subject:"Re\: Van Jacobson net channels"

Re: Van Jacobson net channels

2006-02-03 Thread Greg Banks

On Fri, 2006-02-03 at 18:48, Andi Kleen wrote:
> On Friday 03 February 2006 02:07, Greg Banks wrote:
> 
> > > (Don't ask for code - it's not really in an usable state)
> > 
> > Sure.  I'm looking forward to it.
> 
> I had actually shelved the idea because of TSO. But if you can get me
> some data from your NFS servers that shows TSO is not enough
> for them that might change the picture.

We should be doing some NFS+TSO testing on SLES10 beta in the
next few weeks, time permitting.  I'll let you know how it goes.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-03 Thread Andi Kleen

On Friday 03 February 2006 02:07, Greg Banks wrote:

> > (Don't ask for code - it's not really in an usable state)
> 
> Sure.  I'm looking forward to it.

I had actually shelved the idea because of TSO. But if you can get me
some data from your NFS servers that shows TSO is not enough
for them that might change the picture.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-02 Thread David S. Miller

From: Greg Banks <[EMAIL PROTECTED]>
Date: Fri, 03 Feb 2006 12:08:54 +1100

> So, given 2.6.16 on tg3 hardware, would your advice be to
> enable TSO by default?

Yes.

In fact I've been meaning to discuss with Michael Chan
enabling it in the driver by default.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Van Jacobson net channels

2006-02-02 Thread Greg Banks

On Fri, 2006-02-03 at 01:41, Leonid Grossman wrote:
>  
> As I mentioned earlier, it would be cool to get these moderation
> tresholds from NAPI, since it can make a better guess about the overall
> system utilization than the driver can.

Agreed.

>  But even at the driver level,
> this works reasonably well.

Yep.

> - the moderation scheme is implemented in the ASIC on per channel basis.
> So, if you have workloads with very distinct latency needs, you can just
> steer it to a separate channel and have an interrupt moderation that is
> different from other flows, for example keep an interrupt per packet
> always.

Wow, that's cool.  So I could configure a particular UDP port and a
particular TCP port to always have minimum latency, but keep all the
rest of the traffic on the same NIC at minimum interrupts?  Currently
we need to use separate NICs for the two traffic types (for a number
of reasons).

What's the interface, some kind of ethtool extension or /proc magic?

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-02 Thread Greg Banks

On Thu, 2006-02-02 at 18:51, David S. Miller wrote:
> From: Greg Banks <[EMAIL PROTECTED]>
> Date: Thu, 02 Feb 2006 18:31:49 +1100
> 
> > On Thu, 2006-02-02 at 17:45, Andi Kleen wrote: 
> > > Normally TSO was supposed to fix that.
> > 
> > Sure, except that the last time SGI looked at TSO it was
> > extremely flaky.  I gather that's much better now, but TSO
> > still has a very small size limit imposed by the stack (not
> > the hardware).
> 
> Oh you have TSO disabled?  That explains a lot.
> 
> Yes, it's been a bumpy road, and there are still some
> e1000 lockups, but in general things should be smooth
> these days.

So, given 2.6.16 on tg3 hardware, would your advice be to
enable TSO by default?

Greg
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-02 Thread Greg Banks

On Thu, 2006-02-02 at 18:48, Andi Kleen wrote:
> On Thursday 02 February 2006 08:31, Greg Banks wrote:
> 
> > [...]SGI's solution is do is ship a script that uses ethtool
> > at boot to tune rx-usecs, rx-frames, rx-usecs-irq, rx-frames-irq
> > up from the defaults.
> 
> All user tuning like this is bad. The stack should all do that automatically.

That would be nice ;-)

> Would there be a drawback of making these
> settings default?

Yes, as mentioned elsewhere in this thread, applications which
are latency-sensitive will suffer.

For example, SGI sells a clustered filesystem where overall performance
is sensitive to the RTT of intra-cluster RPCs, to which receive latency
due to NIC interrupt mitigation is a significant factor.  The NICs which
run that traffic need to be using minimum mitigation, but the NICs which
run NFS traffic need to be using maximum mitigation.

> > This helps a lot, and we're very grateful ;-)   But a scheme
> > which used the interrupt mitigation hardware dynamically based on
> > load could reduce the irq rate and CPU usage even further without
> > compromising latency at low load.
> 
> If you know what's needed perhaps you could investigate it?

Maybe, in a couple of months when I've the time.

> You mean the 64k limit?

Exactly.  Currently the NFS server is limited to a 32K blocksize
so the largest RPC reply size is about 33K.  However the NFS client
in Linus' tree, and other OS's NFS servers, have much larger limits.
A value of about 1.001 MiB would probably be best.  The next SGI
Linux NFS server release will probably include a patch to increase
the maximum blocksize on TCP to 1MiB.

> > Cool.  Wouldn't it mean rewriting the nontrivial qdiscs?
> 
> It had some compat code that just split up the lists - same
> for netfilter. And only an implementation for pfifo_fast.

Ok by me, in practice our servers only ever use pfifo.

> (Don't ask for code - it's not really in an usable state)

Sure.  I'm looking forward to it.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Van Jacobson net channels and NIC channels

2006-02-02 Thread Leonid Grossman

 

> -Original Message-
> From: Andi Kleen [mailto:[EMAIL PROTECTED] 

> Why are you saying it can't be used by the host? The stack 
> should be fully ready for it.

Sorry, I should have said "it can't be used by the host to the full
potential of the feature" :-).
It does work for us now, as a "driver only" implementation, but setting
IRQ affinity from the kernel (as well as couple other decisions that we
would like host to make, rather than making them in the driver) should
help quite a bit.

> 
> The only small piece missing is a way to set the IRQ affinity 
> from the kernel, but that can be simulated from user space by 
> tweaking them in /proc. If you have a prototype patch adding 
> the kernel interfaces wouldn't be that hard neither.

Agreed, at this point we should put a patch forward and tweak the kernel
interface later on.

> 
> Also how about per CPU TX completion interrupts? 

Yes, a channel can have separate Tx completion and RX MSI-X interrupts
(and an exception MSI-X interrupt, if desired). It's up to 64 MSI-X
interrupts total.

> 
> -Andi
> 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-02 Thread Rick Jones


Andi Kleen wrote:

On Thursday 02 February 2006 08:31, Greg Banks wrote:



The tg3 driver uses small hardcoded values for the RXCOL_TICKS
and RXMAX_FRAMES registers, and allows "ethtool -C" to change
them.  SGI's solution is do is ship a script that uses ethtool
at boot to tune rx-usecs, rx-frames, rx-usecs-irq, rx-frames-irq
up from the defaults.



All user tuning like this is bad. The stack should all do that automatically.  
Would there be a drawback of making these

settings default?


Larger settings (even the defaults) of the coalescing parms, while giving decent 
CPU utilization for a bulk transfer and better CPU utilization for a large 
agregate workload seem to mean bad things for minimizing latency.


The "presentation" needs work but the data in:

ftp://ftp.cup.hp.com/dist/networking/briefs/nic_latency_vs_tput.txt

should show some of that.  The current executive summary:

Executive Summary:

By default, the e1000 driver used in conjunction with the A9900A PCI-X
Dual-port Gigabit Ethernet adaptor strongly favors maximum packet per
second throughput over minimum request/response latency.  Anyone
desiring lowest possible request/response latency needs to alter the
modprobe parameters used when the e1000 driver is loaded.  This
appears to reduce round-trip latency by as much as 85%.

However, configuring the A9900A PCI-X Dual-port Gigabit Ethernet
adaptor for minimum request/response latency will reduce maximum
packet per second performance (as measured with the netperf TCP_RR
test) by ~23% and increase the service demand for bulk data transfer
by ~63% for sending and ~145% for receiving.


there is also some data in there for tg3 and for xframe I (but with a rather 
behind the times driver, i'm still trying to get cycles to run with a newer driver)



This helps a lot, and we're very grateful ;-)   But a scheme
which used the interrupt mitigation hardware dynamically based on
load could reduce the irq rate and CPU usage even further without
compromising latency at low load.



If you know what's needed perhaps you could investigate it?


I'm guessing that any automagic interrupt mitigation scheme might want to know 
what it wants to enable for the single-stream TCP_RR transaction/s as the base 
pps before it starts holding-off interrupts.  Even then however, the ability for 
the user to overrride needs to remain because there may be a workload that wants 
that PPS rate, but isn't concerned about the latency, only the CPU utilization 
and so indeed wants the interrupts mitigated.  So it would seem that an 
automagic coalescing might be an N% solution, but I don't think it would be 
100%.  Question then becomes whether or not N is large enough to warrant it over 
 defaults+manual config.


rick jones
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Van Jacobson net channels

2006-02-02 Thread Robert Olsson

Leonid Grossman writes:

 > Right. Interrupt moderation is done on per channel basis. 
 > The only addition to the current NAPI mechanism I'd like to see is to
 > have NAPI setting desired interrupt rate (once interrupts are ON),
 > rather than use an interrupt per packet or a driver default. Arguably,
 > NAPI can figure out desired interrupt rate a bit better than a driver
 > can.

 In the current scheme a driver can easily use a dynamic interrupt scheme
 in fact tulip has used this for years. At low rates there are now delays at 
 all if reach some threshold it increases interrupt latency. It can be done 
 in sevaral levels. The best threshold seems luckily just to be to count 
 the number of packets sitting RX ring when ->poll is called. Jamal heavily 
 experimented with this and gave a talk at OLS 2000.

 Yes if net channel classifier runs in  hardirq we get back to the livelock 
 situation sooner or later. IMO interupts should just be a signal to 
 indicate work

 Cheers.
--ro

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-02 Thread Rick Jones


Oh you have TSO disabled?  That explains a lot.

Yes, it's been a bumpy road, and there are still some
e1000 lockups, but in general things should be smooth
these days.


I suspect that "these days" in kernel.org terms differs somewhat from "these 
days" RH/SuSE/etc terms, hence TSO being disabled.


rick jones

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-02 Thread Stephen Hemminger

On Wed, 01 Feb 2006 16:29:11 -0800 (PST)
"David S. Miller" <[EMAIL PROTECTED]> wrote:

> From: Stephen Hemminger <[EMAIL PROTECTED]>
> Date: Wed, 1 Feb 2006 16:12:14 -0800
> 
> > The bigger problem I see is scalability.  All those mmap rings have to
> > be pinned in memory to be useful. It's fine for a single smart application
> > per server environment, but in real world with many dumb thread monster
> > applications on a single server it will be really hard to get working.
> 
> This is no different from when the thread blocks and the receive queue
> fills up, and in order to absorb scheduling latency.  We already lock
> memory into the kernel for socket buffer memory as it is.  At least
> the mmap() ring buffer method is optimized and won't have all of the
> overhead for struct sk_buff and friends.  So we have the potential to
> lock down less memory not more.
> 
> This is just like when we started using BK or GIT for source
> management, everyone was against it and looking for holes while they
> tried to wrap their brains around the new concepts and ideas.  I guess
> it will take a while for people to understand all this new stuff, but
> we'll get there.

No, it just means we have to cover our bases and not regress while
moving forward.  Not that we never have any regressions ;=)

-- 
Stephen Hemminger <[EMAIL PROTECTED]>
OSDL http://developer.osdl.org/~shemminger
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels and NIC channels

2006-02-02 Thread Andi Kleen

On Thursday 02 February 2006 17:27, Leonid Grossman wrote:

> By now we have submitted UFO, MSI-X and LRO patches.  The one item on
> the TODO list that we did not submit a full driver patch for is the
> "support for distributing receive processing across multiple CPUs (using
> NIC hw queues)", mainly because at present the feature can't be fully
> used by the host anyways.

Why are you saying it can't be used by the host? The stack should
be fully ready for it.

The only small piece missing is a way to set the IRQ affinity from the kernel,
but that can be simulated from user space by tweaking them in /proc. If you 
have a prototype patch adding the kernel interfaces wouldn't be that hard 
neither.

Also how about per CPU TX completion interrupts? 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Van Jacobson net channels

2006-02-02 Thread Leonid Grossman

 

> -Original Message-
> From: Eric W. Biederman [mailto:[EMAIL PROTECTED] 


> How do you classify channels?

Multiple rx steering criterias are available, for example tcp tuple (or
subset) hash, direct tcp tuple (or subset) match, MAC address, pkt size,
vlan tag, QOS bits, etc.

> 
> If your channels can map directly to the VAN Jacobsen 
> channels then when the kernel starts using them, it sounds 
> like the ideal strategy is to use the current NAPI algorithm 
> of disabling interrupts (on a per channel basis (assuming 
> MSI-X here) until that channel gets caught up Then enable 
> interrupts again.

Right. Interrupt moderation is done on per channel basis. 
The only addition to the current NAPI mechanism I'd like to see is to
have NAPI setting desired interrupt rate (once interrupts are ON),
rather than use an interrupt per packet or a driver default. Arguably,
NAPI can figure out desired interrupt rate a bit better than a driver
can.

> 
> I wonder if someone could make that the default policy in their NICs?

Some NICs can support this today.
If there is a consensus on a channel-aware NIC driver interface
(including interrupt mgmt per channel), this will become a default NIC
implementation. Over time, NIC development is always driven by the
OS/stack requirements.

> 
> Eric
> 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-02 Thread Stephen Hemminger

On Thu, 02 Feb 2006 08:35:28 -0700
[EMAIL PROTECTED] (Eric W. Biederman) wrote:

> "Christopher Friesen" <[EMAIL PROTECTED]> writes:
> 
> > Eric W. Biederman wrote:
> >> Jeff Garzik <[EMAIL PROTECTED]> writes:
> >
> >>> This was discussed on the netdev list, and the conclusion was that
> >>> you want both NAPI and hw mitigation.  This was implemented in a
> >>> few drivers, at least.
> >
> >> How does that deal with the latency that hw mitigation introduces. When you
> >> have a workload that bottle-necked waiting for that next
> >> packet and hw mitigation is turned on  you can see some horrible
> >> unjustified slow downs.
> >
> > Presumably at low traffic you would disable hardware mitigation to get the 
> > best
> > possible latency.  As traffic ramps up you tune the hardware mitigation
> > appropriately.  At high traffic loads, you end up with full hardware 
> > mitigation,
> > but you have enough packets coming in that the latency is minimal.
> 
> The evil but real work load is when you have a high volume of dependent 
> traffic.
> RPC calls or MPI collectives are cases where you are likely to see this.
> 
> Or even in TCP there is an element that once you hit your window limit you 
> won't
> send more traffic until you get your ack.  But if you don't get your ack
> because the interrupt is mitigated.
> 
> NAPI handles this beautifully.  It disables the interrupts until it knows it
> needs to process more packets.  Then when it is just waiting around for
> packets from that card it enables interrupts on that card.

Also, NAPI handles the case where receiver is getting DoS or overrrun with 
packets,
and you want the hardware to send flow control. Without NAPI it is easy to get
stuck only processing packets and nothing else.

I hope the VJ channels code has receive flow control.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-02 Thread Eric W. Biederman

"Leonid Grossman" <[EMAIL PROTECTED]> writes:

> There two facilities (at least, in our ASIC, but there is no reason this
> can't be part of the generic multi-channel driver interface that I will
> get to shortly) to deal with it.
>
> - hardware supports more than one utilization-based interrupt rate (we
> have four). For lowest utilization range, we always set interrupt rate
> to one interrupt for every rx packet - exactly for the latency reasons
> that you are bringing up. Also, cpu is not busy anyways so extra
> interrupts do not hurt much. For highest utilization range, we set the
> rate by default to something like an interrupt per 128 packets. There is
> also timer-based interrupt, as a last resort option.
> As I mentioned earlier, it would be cool to get these moderation
> tresholds from NAPI, since it can make a better guess about the overall
> system utilization than the driver can. But even at the driver level,
> this works reasonably well.
>
> - the moderation scheme is implemented in the ASIC on per channel basis.
> So, if you have workloads with very distinct latency needs, you can just
> steer it to a separate channel and have an interrupt moderation that is
> different from other flows, for example keep an interrupt per packet
> always.

How do you classify channels?

If your channels can map directly to the VAN Jacobsen channels then
when the kernel starts using them, it sounds like the ideal strategy is
to use the current NAPI algorithm of disabling interrupts (on a per
channel basis (assuming MSI-X here) until that channel gets caught up
Then enable interrupts again.

I wonder if someone could make that the default policy in their NICs?

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-02 Thread Eric W. Biederman

"Christopher Friesen" <[EMAIL PROTECTED]> writes:

> Eric W. Biederman wrote:
>> Jeff Garzik <[EMAIL PROTECTED]> writes:
>
>>> This was discussed on the netdev list, and the conclusion was that
>>> you want both NAPI and hw mitigation.  This was implemented in a
>>> few drivers, at least.
>
>> How does that deal with the latency that hw mitigation introduces. When you
>> have a workload that bottle-necked waiting for that next
>> packet and hw mitigation is turned on  you can see some horrible
>> unjustified slow downs.
>
> Presumably at low traffic you would disable hardware mitigation to get the 
> best
> possible latency.  As traffic ramps up you tune the hardware mitigation
> appropriately.  At high traffic loads, you end up with full hardware 
> mitigation,
> but you have enough packets coming in that the latency is minimal.

The evil but real work load is when you have a high volume of dependent traffic.
RPC calls or MPI collectives are cases where you are likely to see this.

Or even in TCP there is an element that once you hit your window limit you won't
send more traffic until you get your ack.  But if you don't get your ack
because the interrupt is mitigated.

NAPI handles this beautifully.  It disables the interrupts until it knows it
needs to process more packets.  Then when it is just waiting around for
packets from that card it enables interrupts on that card.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Van Jacobson net channels

2006-02-02 Thread Leonid Grossman

 

> -Original Message-
> From: Andi Kleen [mailto:[EMAIL PROTECTED] 

> > You just need to make sure that you don't leak data from 
> other peoples 
> > sockets.
> 
> There are three basic ways I can see to do this:
> 
> -  You have really advanced hardware which can potentially 
> manage tens of thousands of hardware queues with full 
> classification down to the ports. Then everything is great. 
> But who has such hardware?
> Perhaps Leonid will do it, but I expect the majority of Linux 
> users to not have access to it in the forseeable time. Also 
> even with the advanced hardware that can handle e.g. 50k 
> sockets what happens when you need 100k for some extreme situation?
> 
> -Andi

You may be surprised here :-) iWAPP (RDMA over Ethernet) received a lot
of funding and industry support over last several years, and rNIC
development is already pre-announced by multiple vendors not just us.
 
I expect RDMA deployment to be a long and bumpy multi-year road, since
protocols and applications will need to change to take full advantage of
it. And this is a discussion for a totally separate thread anyways :-)

But in the meantime, these new ethernet adapters will have huge number
of hw queue pairs (AKA channels), and at least some of the NICs will
have these channels at no incremental cost to the hardware. You may be
able to use the channels for full socket traffic classification if
nothing else, and defer the rest of rNIC functionality until the iWARP
infrastructure is mature. 

This is actually one of many reasons why VJ net channels and related
ideas look very promising - we can "extend" it to the driver/hw level
with the current NICs that have at least one channel per cpu, with a
good chance that the next wave of hardware will support many more
channels and will take advantage of the stack/NAPI improvements.  
Leonid
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-02 Thread Christopher Friesen


Eric W. Biederman wrote:

Jeff Garzik <[EMAIL PROTECTED]> writes:



This was discussed on the netdev list, and the conclusion was that
you want both NAPI and hw mitigation.  This was implemented in a
few drivers, at least.


How does that deal with the latency that hw mitigation introduces. 
When you have a workload that bottle-necked waiting for that next

packet and hw mitigation is turned on  you can see some horrible
unjustified slow downs.


Presumably at low traffic you would disable hardware mitigation to get 
the best possible latency.  As traffic ramps up you tune the hardware 
mitigation appropriately.  At high traffic loads, you end up with full 
hardware mitigation, but you have enough packets coming in that the 
latency is minimal.


Chris
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Van Jacobson net channels

2006-02-02 Thread Leonid Grossman

 

> -Original Message-
> From: Eric W. Biederman [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, February 02, 2006 4:29 AM
> To: Jeff Garzik
> Cc: Andi Kleen; Greg Banks; David S. Miller; Leonid Grossman; 
> [EMAIL PROTECTED]; Linux Network Development list
> Subject: Re: Van Jacobson net channels
> 
> Jeff Garzik <[EMAIL PROTECTED]> writes:
> 
> > Andi Kleen wrote:
> >> There was already talk some time ago to make NAPI drivers use the 
> >> hardware mitigation again. The reason is when you have
> >
> >
> > This was discussed on the netdev list, and the conclusion 
> was that you 
> > want both NAPI and hw mitigation.  This was implemented in 
> a few drivers, at least.
> 
> How does that deal with the latency that hw mitigation introduces.
> When you have a workload that bottle-necked waiting for that 
> next packet and hw mitigation is turned on  you can see some 
> horrible unjustified slow downs.

There two facilities (at least, in our ASIC, but there is no reason this
can't be part of the generic multi-channel driver interface that I will
get to shortly) to deal with it.

- hardware supports more than one utilization-based interrupt rate (we
have four). For lowest utilization range, we always set interrupt rate
to one interrupt for every rx packet - exactly for the latency reasons
that you are bringing up. Also, cpu is not busy anyways so extra
interrupts do not hurt much. For highest utilization range, we set the
rate by default to something like an interrupt per 128 packets. There is
also timer-based interrupt, as a last resort option.
As I mentioned earlier, it would be cool to get these moderation
tresholds from NAPI, since it can make a better guess about the overall
system utilization than the driver can. But even at the driver level,
this works reasonably well.

- the moderation scheme is implemented in the ASIC on per channel basis.
So, if you have workloads with very distinct latency needs, you can just
steer it to a separate channel and have an interrupt moderation that is
different from other flows, for example keep an interrupt per packet
always.

Leonid

> 
> Eric
> 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-02 Thread Eric W. Biederman

Jeff Garzik <[EMAIL PROTECTED]> writes:

> Andi Kleen wrote:
>> There was already talk some time ago to make NAPI drivers use
>> the hardware mitigation again. The reason is when you have
>
>
> This was discussed on the netdev list, and the conclusion was that you want 
> both
> NAPI and hw mitigation.  This was implemented in a few drivers, at least.

How does that deal with the latency that hw mitigation introduces.
When you have a workload that bottle-necked waiting for that next packet
and hw mitigation is turned on  you can see some horrible unjustified
slow downs.

Eric
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread David S. Miller

From: Greg Banks <[EMAIL PROTECTED]>
Date: Thu, 02 Feb 2006 18:31:49 +1100

> On Thu, 2006-02-02 at 17:45, Andi Kleen wrote: 
> > Normally TSO was supposed to fix that.
> 
> Sure, except that the last time SGI looked at TSO it was
> extremely flaky.  I gather that's much better now, but TSO
> still has a very small size limit imposed by the stack (not
> the hardware).

Oh you have TSO disabled?  That explains a lot.

Yes, it's been a bumpy road, and there are still some
e1000 lockups, but in general things should be smooth
these days.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Andi Kleen

On Thursday 02 February 2006 08:31, Greg Banks wrote:

> The tg3 driver uses small hardcoded values for the RXCOL_TICKS
> and RXMAX_FRAMES registers, and allows "ethtool -C" to change
> them.  SGI's solution is do is ship a script that uses ethtool
> at boot to tune rx-usecs, rx-frames, rx-usecs-irq, rx-frames-irq
> up from the defaults.

All user tuning like this is bad. The stack should all do that automatically.  
Would there be a drawback of making these
settings default?

> This helps a lot, and we're very grateful ;-)   But a scheme
> which used the interrupt mitigation hardware dynamically based on
> load could reduce the irq rate and CPU usage even further without
> compromising latency at low load.

If you know what's needed perhaps you could investigate it?

> Sure, except that the last time SGI looked at TSO it was
> extremely flaky.

I believe David has done quite a lot of work on it and it should 
be much better much.
 
> I gather that's much better now, but TSO 
> still has a very small size limit imposed by the stack (not
> the hardware).

You mean the 64k limit?

>
> > I was playing with a design some time ago to let TCP batch
> > the lower level transactions even without that. The idea
> > was instead of calling down into IP and dev_queue_xmit et.al.
> > for each packet generated by TCP first generate a list of packets
> > in sendmsg/sendpage and then just hand down the list
> > through all layers into the driver.
>
> Cool.  Wouldn't it mean rewriting the nontrivial qdiscs?

It had some compat code that just split up the lists - same
for netfilter. And only an implementation for pfifo_fast.
(Don't ask for code - it's not really in an usable state)

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Andi Kleen

On Thursday 02 February 2006 00:50, David S. Miller wrote:

>
> Why not concentrate your thinking on how to make it can be made to
> _work_ instead of punching holes in the idea?  Isn't that more
> productive?

What I think would be very practical to do would be to try to 
replace the socket rx queue and the prequeues and perhaps 
the qdisc queues with an netchannel style
array of pointers (just using pointers to skbs instead of indexes) or 
a list of arrays and see if it gives any cache benefits.

What do you think?

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Andi Kleen

On Thursday 02 February 2006 00:08, Jeff Garzik wrote:

> Definitely not.  POSIX AIO is far more complex than the operation
> requires,

Ah, I sense strong a NIH field.

> and is particularly bad for implementations that find it wise 
> to queue a bunch of to-be-filled buffers. 

Why? lio_listio seems to be very well suited for that task to me.

> Further, the current 
> implementation of POSIX AIO uses a thread for almost every I/O, which is
> yet more overkill.

That's just an implementation detail of the current Linux aio.

> A simple mmap'd ring buffer is much closer to how the hardware actually
> behaves.  It's no surprise that the "ring buffer / doorbell" pattern
> pops up all over the place in computing these days.

If you really want you can just fill in the pointer to the lio list
into a mmaped ring buffer. This can be hidden between the POSIX interfaces.
[I think Ben's early kernel aio had support for that, but it was
eliminated as unneeded complexity]

> Getting the TCP receive path out of the kernel isn't a requirement, just
> an improvement.

It's not clear to me yet how this is an improvement.

> But people who care about the performance of their networking apps are
> likely to want to switch over to this new userspace networking API, over
> the next decade, I think.

POSIX aio has the advantage that it already works on some other Unixes
and some big applications have support for it that just needs to be enabled
with the right ifdef.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Greg Banks

On Thu, 2006-02-02 at 17:45, Andi Kleen wrote: 
> There was already talk some time ago to make NAPI drivers use
> the hardware mitigation again. The reason is when you have
> a workload that runs below overload and doesn't quite
> fill the queues and is a bit bursty, then NAPI tends to turn 
> on/off the NIC interrupts quite often.

In SGI's experience, all it takes to get into this state is
an even workload and a sufficiently fast CPU.

On Thu, 2006-02-02 at 17:49, David S. Miller wrote:
> From: Andi Kleen <[EMAIL PROTECTED]>
> Date: Thu, 2 Feb 2006 07:45:26 +0100
> 
> > Don't think it was ever implemented though. In the end we just 
> > eat the slowdown in that particular load.
> 
> The tg3 driver uses the chip interrupt mitigation to help
> deal with the SGI NUMA issues resulting from NAPI.

The tg3 driver uses small hardcoded values for the RXCOL_TICKS
and RXMAX_FRAMES registers, and allows "ethtool -C" to change
them.  SGI's solution is do is ship a script that uses ethtool
at boot to tune rx-usecs, rx-frames, rx-usecs-irq, rx-frames-irq
up from the defaults.

This helps a lot, and we're very grateful ;-)   But a scheme
which used the interrupt mitigation hardware dynamically based on
load could reduce the irq rate and CPU usage even further without
compromising latency at low load.

On Thu, 2006-02-02 at 17:51, Andi Kleen wrote: 
> On Thursday 02 February 2006 04:19, Greg Banks wrote:
> > On Thu, 2006-02-02 at 14:13, David S. Miller wrote:
> > > From: Greg Banks <[EMAIL PROTECTED]>
> 
> > Multiple trips down through TCP, qdisc, and the driver for each
> > NFS packet sent:
> 
> Normally TSO was supposed to fix that.

Sure, except that the last time SGI looked at TSO it was
extremely flaky.  I gather that's much better now, but TSO
still has a very small size limit imposed by the stack (not
the hardware).

> I was playing with a design some time ago to let TCP batch
> the lower level transactions even without that. The idea
> was instead of calling down into IP and dev_queue_xmit et.al.
> for each packet generated by TCP first generate a list of packets
> in sendmsg/sendpage and then just hand down the list
> through all layers into the driver.

Cool.  Wouldn't it mean rewriting the nontrivial qdiscs?

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Jeff Garzik


Andi Kleen wrote:

There was already talk some time ago to make NAPI drivers use
the hardware mitigation again. The reason is when you have



This was discussed on the netdev list, and the conclusion was that you 
want both NAPI and hw mitigation.  This was implemented in a few 
drivers, at least.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Andi Kleen

On Thursday 02 February 2006 07:49, David S. Miller wrote:
> From: Andi Kleen <[EMAIL PROTECTED]>
> Date: Thu, 2 Feb 2006 07:45:26 +0100
>
> > Don't think it was ever implemented though. In the end we just
> > eat the slowdown in that particular load.
>
> The tg3 driver uses the chip interrupt mitigation to help
> deal with the SGI NUMA issues resulting from NAPI.

Ok thanks for the correction. It was indeed fixed then. Great!

-Andi

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Andi Kleen

On Thursday 02 February 2006 00:37, Mitchell Blank Jr wrote:
> Jeff Garzik wrote:
> > Once packets classified to be delivered to a specific local host socket,
> > what further operations are require privs?  What received packet data
> > cannot be exposed to userspace?
>
> You just need to make sure that you don't leak data from other peoples
> sockets. 

There are three basic ways I can see to do this:

-  You have really advanced hardware which can potentially manage
tens of thousands of hardware queues with full classification down
to the ports. Then everything is great. But who has such hardware?
Perhaps Leonid will do it, but I expect the majority of Linux users
to not have access to it in the forseeable time. Also even with the
advanced hardware that can handle e.g. 50k sockets what happens when
you need 100k for some extreme situation?

- You use some high level easy classifier to distingush between 
classical "slower and isolated" streams and "fast and shared by everybody"
streams. Let's say you use two IP addresses and program the NIC's
hardware RX queues to distingush them. Then you end up with 
two receive rings - a standard one managed in the classical way
and a netchannel one mapped into all applications running the
user level TCP stack. This requires moderately advanced hardware
(like a current XFrame and perhaps Tigon3?), but should be possible.
One problem is that you will have to preallocate a lot of memory
for the fast ring because mapping new memory this way is relatively
costly (potentially lots of TLB flushes on all CPUs). And of course
the data will be all shared between all fast users. 

Ok assuming  the internet is considered a rogue place these days with sniffers
everywhere I guess that's not too bad - everybody interested
in privacy should use encryption anyways. Still maintaining the  separate IP 
address as the high level classify anchor would be 
somewhat of a administrator burden. You could avoid it by putting 
just all data into the fast ring and allowing everybody interested
to mmap it, but I'm not sure it's a good idea to completely drop
all backwards compatibility in "secure" stream isolation.

- You do classification to sockets in software in the interrupt handler and 
then copy the data once from the memory in the RX ring into a big 
preallocated buffer per netchannel consumer. That would work,
but if the user space TCP stack is to emulate a standard read()
interface it would likely need to copy again to get the data into 
the place the application expects it. This means you would 
have added an additional copy over the current stack, which is not good.
Also question is how this classification would work and would it be
really faster than what we do today?

All the ways I described have severe drawbacks imho. Did I miss some
clever additional way?

-Andi


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Van Jacobson net channels

2006-02-01 Thread Leonid Grossman

 

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Andi Kleen
> Sent: Wednesday, February 01, 2006 10:45 PM


> There was already talk some time ago to make NAPI drivers use 
> the hardware mitigation again. The reason is when you have a 
> workload that runs below overload and doesn't quite fill the 
> queues and is a bit bursty, then NAPI tends to turn on/off 
> the NIC interrupts quite often. At least on some chipsets
> (Tigon3 in particular) this seems to cause slowdowns compared 
> to non NAPI. The idea (from Jamal originally iirc) was to use 
> the hardware mitigation to cycle less often from polling to 
> non polling state. 
> 
> Don't think it was ever implemented though. In the end we 
> just eat the slowdown in that particular load.

Ideally, we want NAPI to set driver interrupt rate dynamically, to a
desired number of packets per interrupt. More and more NICs support this
in hardware as a run-time option; switching interrupts ON and OFF is
indeed a bit of an "overdrive" but can be still used for legacy NICs.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Andi Kleen

On Thursday 02 February 2006 04:19, Greg Banks wrote:
> On Thu, 2006-02-02 at 14:13, David S. Miller wrote:
> > From: Greg Banks <[EMAIL PROTECTED]>
> > Date: Thu, 02 Feb 2006 14:06:06 +1100
> >
> > > On Thu, 2006-02-02 at 13:46, David S. Miller wrote:
> > > > I know SAMBA is using sendfile() (when the client has the oplock
> > > > held, which basically is "always"), is NFS doing so as well?
> > >
> > > NFS is an in-kernel server, and uses sock->ops->sendpage directly.
> >
> > Great.
> >
> > Then where's all the TX overhead for NFS?  All the small transactions
> > and the sunrpc header munging?
>
> Multiple trips down through TCP, qdisc, and the driver for each
> NFS packet sent:

Normally TSO was supposed to fix that.

I was playing with a design some time ago to let TCP batch
the lower level transactions even without that. The idea
was instead of calling down into IP and dev_queue_xmit et.al.
for each packet generated by TCP first generate a list of packets
in sendmsg/sendpage and then just hand down the list
through all layers into the driver.

It was inspired by Andrew Morton's 2.5 work in the VM layer
who used this trick very successfully with pages and BHs there.

But I didn't pursue it further when it turned out all interesting
hardware was using TSO already, which does a similar thing.

There was also some trickiness when to do the flush exactly.

> one for the header and one for each page.  Lots 
> of locks need to be taken and dropped, all this while multiple nfds
> on multiple CPUs are all trying to reply to NFS RPCs at the same
> time.  And in the particular case of the SN2 architecture, time
> spent flushing PCI writes in the driver (less of an issue now that
> host send rings are the default in tg3).

Hmm, maybe it would be still worth for your case with multiple
connections going on at the same time. But accumulating
the packet list somewhere between different connections
would be a natural congestion point and potential scalability 
issue.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread David S. Miller

From: Andi Kleen <[EMAIL PROTECTED]>
Date: Thu, 2 Feb 2006 07:45:26 +0100

> Don't think it was ever implemented though. In the end we just 
> eat the slowdown in that particular load.

The tg3 driver uses the chip interrupt mitigation to help
deal with the SGI NUMA issues resulting from NAPI.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Andi Kleen

On Thursday 02 February 2006 02:53, Greg Banks wrote:
> On Thu, 2006-02-02 at 08:11, David S. Miller wrote:
> > Van is not against NAPI, in fact he's taking NAPI to the next level.
> > Softirq handling is overhead, and as this work shows, it is totally
> > unnecessary overhead.
>
> I got the impression that his code was dynamically changing the
> e1000 interrupt mitigation registers in response to load, in
> other words using the capabilities of the hardware in a way that
> NAPI deliberately avoids doing. 

There was already talk some time ago to make NAPI drivers use
the hardware mitigation again. The reason is when you have
a workload that runs below overload and doesn't quite
fill the queues and is a bit bursty, then NAPI tends to turn 
on/off the NIC interrupts quite often. At least on some chipsets 
(Tigon3 in particular) this seems to cause slowdowns compared 
to non NAPI. The idea (from Jamal originally iirc) was to use the hardware 
mitigation to cycle less often from polling to non polling state. 

Don't think it was ever implemented though. In the end we just 
eat the slowdown in that particular load.

> > How in the world can you not understand how incredible this is?
>
> Maybe "you had to be there".  Van's presentation was amazingly
> convincing in person, in a way the slides don't convey.  I've
> not seen a standing ovation at a technical talk before ;-)

Wish I had made it then. Perhaps I would see the light then @)

> I'm very interested in vj channels for improving CPU usage of
> NFS and Samba servers.  However, after a few days to reflect,
> I'm curious as to how the tx is improved. 

Yes i was missing that too. He hinted about getting rid of hard_start_xmit
somehow, but then never touched it again.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Greg Banks

On Thu, 2006-02-02 at 14:32, David S. Miller wrote:
> I see.
> 
> Maybe we can be smarter about how the write(), CORK, sendfile,
> UNCORK sequence is done.

>From the NFS server's point of view, the ideal interface would
be to pass an array of {page,offset,len} tuples, covering up to
around 1 MiB+1KiB in total length.

Also, nfsd doesn't cork/uncork.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread David S. Miller

From: Greg Banks <[EMAIL PROTECTED]>
Date: Thu, 02 Feb 2006 14:19:43 +1100

> Multiple trips down through TCP, qdisc, and the driver for each
> NFS packet sent: one for the header and one for each page.  Lots
> of locks need to be taken and dropped, all this while multiple nfds
> on multiple CPUs are all trying to reply to NFS RPCs at the same
> time.  And in the particular case of the SN2 architecture, time
> spent flushing PCI writes in the driver (less of an issue now that
> host send rings are the default in tg3).

I see.

Maybe we can be smarter about how the write(), CORK, sendfile,
UNCORK sequence is done.

Thanks for mentioning this.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Greg Banks

On Thu, 2006-02-02 at 14:13, David S. Miller wrote:
> From: Greg Banks <[EMAIL PROTECTED]>
> Date: Thu, 02 Feb 2006 14:06:06 +1100
> 
> > On Thu, 2006-02-02 at 13:46, David S. Miller wrote:
> > > I know SAMBA is using sendfile() (when the client has the oplock held,
> > > which basically is "always"), is NFS doing so as well?
> > 
> > NFS is an in-kernel server, and uses sock->ops->sendpage directly.
> 
> Great.
> 
> Then where's all the TX overhead for NFS?  All the small transactions
> and the sunrpc header munging?

Multiple trips down through TCP, qdisc, and the driver for each
NFS packet sent: one for the header and one for each page.  Lots
of locks need to be taken and dropped, all this while multiple nfds
on multiple CPUs are all trying to reply to NFS RPCs at the same
time.  And in the particular case of the SN2 architecture, time
spent flushing PCI writes in the driver (less of an issue now that
host send rings are the default in tg3).

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread David S. Miller

From: Greg Banks <[EMAIL PROTECTED]>
Date: Thu, 02 Feb 2006 14:06:06 +1100

> On Thu, 2006-02-02 at 13:46, David S. Miller wrote:
> > I know SAMBA is using sendfile() (when the client has the oplock held,
> > which basically is "always"), is NFS doing so as well?
> 
> NFS is an in-kernel server, and uses sock->ops->sendpage directly.

Great.

Then where's all the TX overhead for NFS?  All the small transactions
and the sunrpc header munging?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Greg Banks

On Thu, 2006-02-02 at 13:46, David S. Miller wrote:
> I know SAMBA is using sendfile() (when the client has the oplock held,
> which basically is "always"), is NFS doing so as well?

NFS is an in-kernel server, and uses sock->ops->sendpage directly.

> Van does have some ideas in mind for TX net channels that I touched
> upon briefly with him, and we'll see some more things in this area, we
> just need to be patient. :)

Cool.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread David S. Miller

From: Greg Banks <[EMAIL PROTECTED]>
Date: Thu, 02 Feb 2006 12:53:14 +1100

> I got the impression that his code was dynamically changing the
> e1000 interrupt mitigation registers in response to load, in 
> other words using the capabilities of the hardware in a way that
> NAPI deliberately avoids doing.  I'm very curious to see the
> details.

Yes, once you stop doing NAPI and demux in the driver, we
start using the HW interrupt mitigation facilities again.

> cpu usage on tx is a significant part of the CPU usage
> issues for many interesting NFS workloads.

I know SAMBA is using sendfile() (when the client has the oplock held,
which basically is "always"), is NFS doing so as well?

Van does have some ideas in mind for TX net channels that I touched
upon briefly with him, and we'll see some more things in this area, we
just need to be patient. :)

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Rick Jones

David S. Miller wrote:

From: Rick Jones <[EMAIL PROTECTED]>
Date: Wed, 01 Feb 2006 17:32:24 -0800

How large is "the bulk?"

The prequeue is always enabled when the app has blocked
on read().

Actually I meant in terms of percentage of the cycles to process the packet 
rather than frequency of occurance but that is an interesting question - any 
read(), or a read() against the socket associated with that connection?

What happens when the application has not blocked on a read - say an application 
using (e)poll on M connections?  Does that ressurect my supposition about the 
three degrees of parallelism?

Ie. ACK goes out as fast as we can context switch
  to the app receiving the data.  This feedback makes all senders
  to a system send at a rate that system can handle.

Once those senders have filled the TCP windows right?

All you have to do for this to take effect is to fill the congestion
window, which starts at 2 packets :-)

I think we've had that converstion before - it starts at 4380 bytes or three 
segments whichever comes first right?  IIRC (I should look this up but...) the 
two segments is when the MSS is larger than 1460?

rick jones

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Greg Banks

On Thu, 2006-02-02 at 08:11, David S. Miller wrote:
> Van is not against NAPI, in fact he's taking NAPI to the next level.
> Softirq handling is overhead, and as this work shows, it is totally
> unnecessary overhead.

I got the impression that his code was dynamically changing the
e1000 interrupt mitigation registers in response to load, in 
other words using the capabilities of the hardware in a way that
NAPI deliberately avoids doing.  I'm very curious to see the
details.

> 
> How in the world can you not understand how incredible this is?

Maybe "you had to be there".  Van's presentation was amazingly
convincing in person, in a way the slides don't convey.  I've
not seen a standing ovation at a technical talk before ;-)

I'm very interested in vj channels for improving CPU usage of
NFS and Samba servers.  However, after a few days to reflect,
I'm curious as to how the tx is improved.  Van didn't touch
upon the tx side at all, and cpu usage on tx is a significant
part of the CPU usage issues for many interesting NFS workloads.
The other objections raised here are non-issues for an NFS or
Samba server.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread David S. Miller

From: Rick Jones <[EMAIL PROTECTED]>
Date: Wed, 01 Feb 2006 17:32:24 -0800

> How large is "the bulk?"

The prequeue is always enabled when the app has blocked
on read().

> > Ie. ACK goes out as fast as we can context switch
> >to the app receiving the data.  This feedback makes all senders
> >to a system send at a rate that system can handle.
> 
> Once those senders have filled the TCP windows right?

All you have to do for this to take effect is to fill the congestion
window, which starts at 2 packets :-)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Rick Jones


Maybe I'm not sufficiently clued-in, but in broad handwaving terms,
it seems today that all three can be taking place in parallel for a
given TCP connection.  The application is doing its
application-level thing on request N on one CPU, while request N+1
is being processed by TCP on another CPU, while the NIC is DMA'ing
request N+2 into the host.



That's not what happens in the Linux TCP stack even today.  The bulk
of the TCP processing is done in user context via the kernel prequeue.


How large is "the bulk?"


When we get a TCP packet, we simply find the socket and tack on the
SKB, then wake up the task.  We do none of the TCP packet processing.
Once the app wakes up, we do the TCP input path and copy the data
directly into user space all in one go.


Sounds a little like TOPS but with the user context. OK I think I can grasp 
that.


This has several advantages.

1) TCP stack processing is accounted for in the user process
2) ACK emission is done at a rate that matches the load of the
   system. 


ACK emission sounds like something tracked by the EPA :)


Ie. ACK goes out as fast as we can context switch
   to the app receiving the data.  This feedback makes all senders
   to a system send at a rate that system can handle.


Once those senders have filled the TCP windows right?


3) checksum + copy in parallel into userspace is possible

>

And we've been things like this for 6 years :-)

This prequeue is another Van Jacobson idea btw, and net channels
just extend this concept further.


So the parallelism I gained by moving netperf from the interrupt CPU to the 
non-interrupt CPU was strictly between the driver+ip on the interrupt CPU and 
tcp+socket on the other right?


 Cpu0 :  0.1% us,  0.1% sy,  0.0% ni, 60.1% id,  0.2% wa,  0.5% hi, 39.1% si
 Cpu1 :  0.2% us, 20.9% sy,  0.0% ni, 34.2% id,  0.0% wa,  0.0% hi, 44.8% si

(netperf was bound to CPU1, and assuming the top numbers are trustworth)

rick jones
onlist no need to cc
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread David S. Miller

From: Rick Jones <[EMAIL PROTECTED]>
Date: Wed, 01 Feb 2006 16:39:00 -0800

> My questions are meant to see if something is even a roadblock in
> the first place.

Fair enough.

> Maybe I'm not sufficiently clued-in, but in broad handwaving terms,
> it seems today that all three can be taking place in parallel for a
> given TCP connection.  The application is doing its
> application-level thing on request N on one CPU, while request N+1
> is being processed by TCP on another CPU, while the NIC is DMA'ing
> request N+2 into the host.

That's not what happens in the Linux TCP stack even today.  The bulk
of the TCP processing is done in user context via the kernel prequeue.

When we get a TCP packet, we simply find the socket and tack on the
SKB, then wake up the task.  We do none of the TCP packet processing.
Once the app wakes up, we do the TCP input path and copy the data
directly into user space all in one go.

This has several advantages.

1) TCP stack processing is accounted for in the user process
2) ACK emission is done at a rate that matches the load of the
   system.  Ie. ACK goes out as fast as we can context switch
   to the app receiving the data.  This feedback makes all senders
   to a system send at a rate that system can handle.
3) checksum + copy in parallel into userspace is possible

And we've been things like this for 6 years :-)

This prequeue is another Van Jacobson idea btw, and net channels
just extend this concept further.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Rick Jones

David S. Miller wrote:

From: Rick Jones <[EMAIL PROTECTED]>
Date: Wed, 01 Feb 2006 15:50:38 -0800

[ What sucks about this whole thread is that only folks like
  Jeff and myself are attempting to think and use our imagination
  to consider how some roadblocks might be overcome ]

My questions are meant to see if something is even a roadblock in the first 
place.

If the TCP processing is put in the user context, that means there
is no more parallelism between the application doing its non-TCP
stuff, and the TCP stuff for say the next request, which presently
could be processed on another CPU right?

There is no such implicit limitation, really.

Consider the userspace mmap()'d ring buffer being tagged with, say,
connection IDs.  Say, file descriptors.  In this way the kernel could
dump into a single net channel for multiple sockets, and then the app
can demux this stuff however it likes.

In particular, things like HTTP would want this because web servers
get lots of tiny requests and using a net channel per socket could
be very wasteful.

I'm not meaning to talk about mux/demux of multiple connections, I'm asking 
about where all the cycles are consumed and how that affects parallelism between 
user space, "TCP/IP processing" and the NIC for a given flow/connection/whatever.

Maybe I'm not sufficiently clued-in, but in broad handwaving terms, it seems 
today that all three can be taking place in parallel for a given TCP connection. 
 The application is doing its application-level thing on request N on one CPU, 
while request N+1 is being processed by TCP on another CPU, while the NIC is 
DMA'ing request N+2 into the host.

If the processing is pushed all the way up to user space, will it be the case 
that the single-threaded application code can be working on request N while the 
TCP code is processing request N+1?  That's what I'm trying to ask about.

I think the data I posted about saturating a GbE bidirectionally with a single 
TCP connection shows an example of advantage being taken of parallelism between 
the application doing its thing on request N, while TCP is processing N+1 on 
another CPU and the NIC is bringing N+2 into the RAM.

["Re: [RFC] Poor Network Performance with e1000 on 2.6.14.3" msg id 
<[EMAIL PROTECTED]> ]

What I'm not sure of is if that fully matters.  Hence the questions.

rick jones

So, other background... long ago and far away, in HP-UX 10.20 which was BSDish 
in its networking, with Inbound Packet Scheduling, the netisr handoff included a 
hash of the header info and a per-CPU netisr would be used for the "TCP 
processing"  That got HP-UX parallelism for multiple TCP connections coming 
through a single NIC.  It meant that a single threaded application, with 
multiple connections would have the inbound TCP processing possibly scattered 
across all the CPUs while it was running on only one CPU.  Cache lines for 
socket structures going back and forth could indeed be a concern although moving 
a cache line from one CPU to another is not a priori evil (although the 
threshold is rather high IMO).  In HP-UX 11.X IPS was replaced with Thread 
Optimized Packet Scheduling (TOPS).  There was still a netisr-like hand-off 
(although not as low in the stack as I would have liked it) where a lookup took 
place that found where the application last accessed that connection  (I think 
Solaris Fire Engine does something very similar today).  The idea there was that 
the place where inbound processing would take place would be determined by where 
the application last accessed the socket.  Still get advantage taken of multiple 
CPUs for multiple connections to multiple threads, but at the price of losing 
one part of the app/tcp/nic parallelism.  Both TOPS and IPS have been successful 
in their days.  I'm trying to come to grips with which might be "better" - if it 
is even possible to say that one was better than the other.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread David S. Miller

From: Stephen Hemminger <[EMAIL PROTECTED]>
Date: Wed, 1 Feb 2006 16:12:14 -0800

> The bigger problem I see is scalability.  All those mmap rings have to
> be pinned in memory to be useful. It's fine for a single smart application
> per server environment, but in real world with many dumb thread monster
> applications on a single server it will be really hard to get working.

This is no different from when the thread blocks and the receive queue
fills up, and in order to absorb scheduling latency.  We already lock
memory into the kernel for socket buffer memory as it is.  At least
the mmap() ring buffer method is optimized and won't have all of the
overhead for struct sk_buff and friends.  So we have the potential to
lock down less memory not more.

This is just like when we started using BK or GIT for source
management, everyone was against it and looking for holes while they
tried to wrap their brains around the new concepts and ideas.  I guess
it will take a while for people to understand all this new stuff, but
we'll get there.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Stephen Hemminger

On Wed, 01 Feb 2006 15:42:39 -0800 (PST)
"David S. Miller" <[EMAIL PROTECTED]> wrote:

> From: Andi Kleen <[EMAIL PROTECTED]>
> Date: Wed, 1 Feb 2006 23:55:11 +0100
> 
> > On Wednesday 01 February 2006 21:26, Jeff Garzik wrote:
> > > Andi Kleen wrote:
> > > > But I don't think Van's design is supposed to be exposed to user space.
> > > 
> > > It is supposed to be exposed to userspace AFAICS.
> > 
> > Then it's likely insecure and root only, unless he knows some magic
> > that we don't.
> > 
> > I hope it's not just PF_PACKET mmap rings with a user space TCP library.
> 
> Yes, that's it.
> 
> If the user screws up the TCP connection and corrupts his data why
> should the kernel care?
> 
> > I mean the Linux implementation is in the kernel, but in user context.
> 
> Right, but prequeue doesn't go nearly far enough.  We still do up to 5
> demuxes on the input path (protocol, route, IPSEC, netfilter, socket)
> plus the queueing at the softint layer.  That's rediculious and we've
> always understood this, and Van has presented a way to kill this stuff
> off.
> 
> But even if you don't like the userspace stuff, we don't necessarily
> have to go there, we can just demux directly to sockets in the kernel
> TCP stack and then revisit the userspace idea before committing to it.
> 

The bigger problem I see is scalability.  All those mmap rings have to
be pinned in memory to be useful. It's fine for a single smart application
per server environment, but in real world with many dumb thread monster
applications on a single server it will be really hard to get working.


-- 
Stephen Hemminger <[EMAIL PROTECTED]>
OSDL http://developer.osdl.org/~shemminger
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread David S. Miller

From: Rick Jones <[EMAIL PROTECTED]>
Date: Wed, 01 Feb 2006 15:50:38 -0800

[ What sucks about this whole thread is that only folks like
  Jeff and myself are attempting to think and use our imagination
  to consider how some roadblocks might be overcome ]

> If the TCP processing is put in the user context, that means there
> is no more parallelism between the application doing its non-TCP
> stuff, and the TCP stuff for say the next request, which presently
> could be processed on another CPU right?

There is no such implicit limitation, really.

Consider the userspace mmap()'d ring buffer being tagged with, say,
connection IDs.  Say, file descriptors.  In this way the kernel could
dump into a single net channel for multiple sockets, and then the app
can demux this stuff however it likes.

In particular, things like HTTP would want this because web servers
get lots of tiny requests and using a net channel per socket could
be very wasteful.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Rick Jones

It almost feels like the channel concept wants a "thread per 
connection" model?



No, it means only that your application must be asynchronous -- which 
all modern network apps are already.


The INN model of a single process calling epoll(2) for 800 sockets 
should continue to work, as should the Apache N-sockets-per-thread 
model, as should the thread-per-connection model.  All of that continues 
to be within the realm of application choice.


I may not have been as clear as I should - I'm not meaning to ask about stuff 
like epoll continuing to function etc.  If the TCP processing is put in the user 
context, that means there is no more parallelism between the application doing 
its non-TCP stuff, and the TCP stuff for say the next request, which presently 
could be processed on another CPU right? Like when I did the "yes, one can 
saturate GbE both ways on a single connection - when netperf runs on a CPU other 
than the one doing (some fraction) of the TCP processing" message earlier today 
in another thread.


rick jones
on list, no need to cc
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread David S. Miller

From: Mitchell Blank Jr <[EMAIL PROTECTED]>
Date: Wed, 1 Feb 2006 15:37:04 -0800

> So I agree that this would have to be CAP_NET_ADMIN only.

I'm drowning in all of this pessimism folks.

Why not concentrate your thinking on how to make it can be made to
_work_ instead of punching holes in the idea?  Isn't that more
productive?

Or I suppose those studly numbers aren't incentive enough to find a
solution and try to be optimistic?  If so, I think that sucks.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread David S. Miller

From: Andi Kleen <[EMAIL PROTECTED]>
Date: Wed, 1 Feb 2006 23:55:11 +0100

> On Wednesday 01 February 2006 21:26, Jeff Garzik wrote:
> > Andi Kleen wrote:
> > > But I don't think Van's design is supposed to be exposed to user space.
> > 
> > It is supposed to be exposed to userspace AFAICS.
> 
> Then it's likely insecure and root only, unless he knows some magic
> that we don't.
> 
> I hope it's not just PF_PACKET mmap rings with a user space TCP library.

Yes, that's it.

If the user screws up the TCP connection and corrupts his data why
should the kernel care?

> I mean the Linux implementation is in the kernel, but in user context.

Right, but prequeue doesn't go nearly far enough.  We still do up to 5
demuxes on the input path (protocol, route, IPSEC, netfilter, socket)
plus the queueing at the softint layer.  That's rediculious and we've
always understood this, and Van has presented a way to kill this stuff
off.

But even if you don't like the userspace stuff, we don't necessarily
have to go there, we can just demux directly to sockets in the kernel
TCP stack and then revisit the userspace idea before committing to it.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Mitchell Blank Jr

Jeff Garzik wrote:
> Once packets classified to be delivered to a specific local host socket, 
> what further operations are require privs?  What received packet data 
> cannot be exposed to userspace?

You just need to make sure that you don't leak data from other peoples
sockets.  Two issues I see:

  1. If the card receives a long frame for application #1 and then receives
 a short frame for application #2, then you need to make sure that
 the data gets zeroed out first.  So you need to limit this to only
 maximum-sized packets (or packets whose previous use was on the same
 flow).  Probably not a big deal, since that's the performance-critical
 case anyway

  2. More concerning is how you control what packets the app can see.
 If you made the memory frames all PAGE_SIZE then you could just give
 the app the packets to its flows by doing MMU tricks, but wouldn't
 that murder performance anyway?  So I think the only real solution
 would be to allow the app to map all of the frames all of the time.

So I agree that this would have to be CAP_NET_ADMIN only.

-Mitch
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Rick Jones



But people who care about the performance of their networking apps are 
likely to want to switch over to this new userspace networking API, over 
the next decade, I think.


Yet there needs to be some cross-platform commonality for the API yes?  That was 
the main thrust behind my simplistic asking about posix aio being sufficient 
(which of course could be more than necessary :) to the task - at least as an 
API, not specifically about any given implementation of it - I agree that doing 
aio by launching another thread is rather silly...


rick jones
on list, no need for cc
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Jeff Garzik


Andi Kleen wrote:

On Wednesday 01 February 2006 21:26, Jeff Garzik wrote:


Andi Kleen wrote:


But I don't think Van's design is supposed to be exposed to user space.


It is supposed to be exposed to userspace AFAICS.



Then it's likely insecure and root only, unless he knows some magic
that we don't.


Once packets classified to be delivered to a specific local host socket, 
what further operations are require privs?  What received packet data 
cannot be exposed to userspace?




I hope it's not just PF_PACKET mmap rings with a user space TCP library.


Why?



It's still in the kernel, just in process context.


Incorrect.  Its in the userspace app (though usually via a library). 
See slides 26 and 27.



I mean the Linux implementation is in the kernel, but in user context.


Yes, I know what you meant.  My answer still stands...  Certainly older 
applications that use only read(2) and write(2) must continue to work.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Jeff Garzik


Rick Jones wrote:
what are the implications for having the application churning away doing 
application things while TCP is feeding it data?  Or for an application 
that is processing more than one TCP connection in a given thread?


It almost feels like the channel concept wants a "thread per connection" 
model?


No, it means only that your application must be asynchronous -- which 
all modern network apps are already.


The INN model of a single process calling epoll(2) for 800 sockets 
should continue to work, as should the Apache N-sockets-per-thread 
model, as should the thread-per-connection model.  All of that continues 
to be within the realm of application choice.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Jeff Garzik


Rick Jones wrote:

Jeff Garzik wrote:



Key point 1:
Van's slides align closely with the design that I was already working 
on, for zero-copy RX.


To have a fully async, zero copy network receive, POSIX read(2) is 
inadequate. 



Is there an aio_read() in POSIX adequate to the task?


Definitely not.  POSIX AIO is far more complex than the operation 
requires, and is particularly bad for implementations that find it wise 
to queue a bunch of to-be-filled buffers.  Further, the current 
implementation of POSIX AIO uses a thread for almost every I/O, which is 
yet more overkill.


A simple mmap'd ring buffer is much closer to how the hardware actually 
behaves.  It's no surprise that the "ring buffer / doorbell" pattern 
pops up all over the place in computing these days.



Are you speaking strictly in the context of a single TCP connection, or 
for multiple TCP connections?  For the latter getting out of the kernel 


multiple


isn't a priori a requirement.  Actually, I'm not even sure it is a 
priori a requirement for the former?


Getting the TCP receive path out of the kernel isn't a requirement, just 
an improvement.


You'll always have to have a basic path for existing applications that 
do normal read(2) and write(2).  You can't break something that fundamental.


But people who care about the performance of their networking apps are 
likely to want to switch over to this new userspace networking API, over 
the next decade, I think.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Andi Kleen

On Wednesday 01 February 2006 21:26, Jeff Garzik wrote:
> Andi Kleen wrote:
> > But I don't think Van's design is supposed to be exposed to user space.
> 
> It is supposed to be exposed to userspace AFAICS.

Then it's likely insecure and root only, unless he knows some magic
that we don't.

I hope it's not just PF_PACKET mmap rings with a user space TCP library.

> 
> > It's still in the kernel, just in process context.
> 
> Incorrect.  Its in the userspace app (though usually via a library). 
> See slides 26 and 27.

I mean the Linux implementation is in the kernel, but in user context.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Andi Kleen

On Wednesday 01 February 2006 22:11, David S. Miller wrote:
> From: Andi Kleen <[EMAIL PROTECTED]>
> Date: Wed, 1 Feb 2006 19:28:46 +0100
> 
> > http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
> 
> I did a writeup in my blog about all of this, another good
> reason to actively follow my blog:
> 
>   http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html
> 
> Go read.
> 
> > -Andi (who prefers sourceware over slideware)
> 
> People are definitely hung up on the details, and that means
> they are analyzing Van's work from the absolute _wrong_ angle.

The main reason i look for details is that it's unclear to 
me if his work is one copy or zero copy and how the actual
data in the channels are managed.

The netchannels seem to just pass indexes into some other buffer, so unless
he found a much better e1000 than I have @) it's probably
single copy from RX ring into another big buffer. Right? Some of the other
stuff sounded like an attempted zero copy.

How is that other buffer managed? Is it sitting in user space?
If yes then how does the data end up in the simulated read()
in user space? That would require another copy unless I'm missing something.

The other way if it's not copy-from-rx-ring to another buffer
would be to have a big shared pool of always mapped to 
everybody pool (assuming no intelligent NIC queue support) - 
that would be inscure right?

I guess independent of any other stuff it would be an interesting
experiment to change the socket and TCP prequeue into an
linked list of arrays pointing to skb and see if it really helps 
over the double linked lists (that are the points
that should pass skbs between CPUs)

Also the TX part is a bit unclear.

> So when a TCP socket enters established state, we add an entry into
> the classifier.  The classifier is even smart enough to look for
> a listening socket if the fully established classification fails.

I think it's a pretty important detail. The current TCP demultiplex
is a considerable part of the TCP processing cost and i haven't
see any good proposals yet to make it faster

[except the old one of using a smaller hash ..]

Is he using some kind of binary tree for this or a hash?

> Van is not against NAPI, in fact he's taking NAPI to the next level.
> Softirq handling is overhead, and as this work shows, it is totally
> unnecessary overhead.
> 
> Yes we do TCP prequeue now, and that's where the second stage net
> channel stuff hooks into.  But prequeue as we have it now is not
> enough, we still run softirq, and IP input processing from softirq not
> from user socket context.

I don't quite get why this is a problem. softirq is on the same CPU
as the interrupt so it should be pretty cheap (no bounced cachelines) 
Due to the way the stacking works the cache locality should be also ok
(except for the big hash tables)

> The RX net channel bypasses all of that 
> crap.
> 
> The way we do softirq now we can feed one cpu with softirq work given
> a single card, with Van's stuff we can feed socket users on multiple
> cpus with a single card.  The net channel data structure SMP
> friendliness really helps here.

Ok so the point is to not keep the softirq work on the CPU which has the 
interrupt affinity. 

MSI-X & receive hashing should solve that one mostly anyways, no? But I agree
it would be nice to fix it on old hardware too.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Rick Jones


At the risk of being told to launch myself towards a body of water...

So, sort of linking with the data about saturating a GbE both ways on a single 
TCP connection, and how it required binding netperf to the CPU other than the 
one taking interrupts... If channels are taken to their limit, and the 
non-hard-irq processing of the packet is all in the user's context
what are the implications for having the application churning away doing 
application things while TCP is feeding it data?  Or for an application that is 
processing more than one TCP connection in a given thread?


It almost feels like the channel concept wants a "thread per connection" model?

rick jones

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread David S. Miller

From: Jeff Garzik <[EMAIL PROTECTED]>
Date: Wed, 01 Feb 2006 14:37:46 -0500

> So, I am not concerned with slideware.  These are two good ideas that 
> are worth pursuing, even if Van produces zero additional output.

Right.

And, to all of you having trouble imagining how else you'd apply these
net channel ideas.  Consider the case where the RX classifier gives
you the net channel which is the output queue of another device.
That's routing and packet mirroring :-)

The RX classifier could give you a net channel for a netfilter rule.
And let's assume that after netfilter NAT's the packet, it's for the
local host, and netfilter does another classification and ends up with
a local socket net channel.

Use your imagination, this stuff can be applied everywhere.  It's like
eating peanuts.  You start eating them one by one and you just can't
stop :-)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread David S. Miller

From: Andi Kleen <[EMAIL PROTECTED]>
Date: Wed, 1 Feb 2006 20:50:31 +0100

> On Wednesday 01 February 2006 20:37, Jeff Garzik wrote:
>  
> > To have a fully async, zero copy network receive, POSIX read(2) is 
> > inadequate. 
> 
> Agreed, but POSIX aio is adequate.

No, it's a joke.

To do this stuff right you want networking experts (not UNIX interface
standards experts) to come up with how to do things, because folks
like POSIX are going to make a rocking implementation next to
impossible.

Only folks like Van Jacobson can take us out of the myopic view we
currently have of how networking receive is done.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread David S. Miller

From: Andi Kleen <[EMAIL PROTECTED]>
Date: Wed, 1 Feb 2006 19:28:46 +0100

> http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf

I did a writeup in my blog about all of this, another good
reason to actively follow my blog:

http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html

Go read.

> -Andi (who prefers sourceware over slideware)

People are definitely hung up on the details, and that means
they are analyzing Van's work from the absolute _wrong_ angle.

This surprised me, what I expected was for anyone knowledgable about
networking to get this immediately, and as for the details, have an
attitude of "I don't care how, let's find a way to make this work!"

But since you're so hung up on the details, the basic idea is that
there is a tiny classifier in the RX IRQ processing of the driver.  We
have to touch that first cache line of the packet headers anyway, so
the classification comes for free.  You'll notice that even though
he's running this tiny classifier in the hard IRQ context, in order to
put the packet on the right RX net channel, IRQ overhead remains the
same.

So when a TCP socket enters established state, we add an entry into
the classifier.  The classifier is even smart enough to look for
a listening socket if the fully established classification fails.

Van is not against NAPI, in fact he's taking NAPI to the next level.
Softirq handling is overhead, and as this work shows, it is totally
unnecessary overhead.

Yes we do TCP prequeue now, and that's where the second stage net
channel stuff hooks into.  But prequeue as we have it now is not
enough, we still run softirq, and IP input processing from softirq not
from user socket context.  The RX net channel bypasses all of that
crap.

The way we do softirq now we can feed one cpu with softirq work given
a single card, with Van's stuff we can feed socket users on multiple
cpus with a single card.  The net channel data structure SMP
friendliness really helps here.

In one shot it does the input route lookup and the socket lookup.  We
just attach the packet to the socket's RX net channel, all from hard
IRQ context, at zero cost (see above).  This is just like the grand
unified flow cache idea that we've been tossing around for the past
few years.

And the beauty of all of this is that it complements ideas like LRO,
I/O AT, and cpu architectures like Niagara.

How in the world can you not understand how incredible this is?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Jeff Garzik


Andi Kleen wrote:

But I don't think Van's design is supposed to be exposed to user space.


It is supposed to be exposed to userspace AFAICS.



It's still in the kernel, just in process context.


Incorrect.  Its in the userspace app (though usually via a library). 
See slides 26 and 27.


But irrelevant of the slides, think about the underlying concept:  the 
most efficient pipe of this sort is for the NIC to hand [selected] 
packets directly to the userspace app, with minimal (hopefully zero) 
copying.  The userspace app is then given the freedom to choose how it 
handles incoming TCP data, either via custom algorithms or a standard 
shared library.


I agree that ACK latency may be one potential issue...  but this overall 
design is what a lot of low latency systems are moving to.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Arnaldo Carvalho de Melo

On 2/1/06, Andi Kleen <[EMAIL PROTECTED]> wrote:
> On Wednesday 01 February 2006 20:37, Jeff Garzik wrote:
>
> > To have a fully async, zero copy network receive, POSIX read(2) is
> > inadequate.
>
> Agreed, but POSIX aio is adequate.
>
> > One needs a ring buffer, similar in API to the mmap'd
> > packet socket, where you can queue a whole bunch of reads.  Van's design
> > seems similar to this.
>
> See lio_listio et.al.
>
> But I don't think Van's design is supposed to be exposed to user space.
> It's just a better way to implement BSD sockets.

Well, for DCCP it seems interesting, look at:

http://www.icir.org/kohler/dccp/nsdiabstract.pdf

#

A Congestion-Controlled Unreliable Datagram API
Junwen Lai and Eddie Kohler

Describes a potential DCCP API based on a shared-memory packet ring.
The API simultaneously achieves kernel-implemented congestion control,
high throughput, and late data choice, where the app can change what's
sent very late in the process. Shows that congestion-controlled DCCP
API can improve the rate of "important" frames delivered, relative to
non-congestion-controlled UDP, in some situations.

- Arnaldo
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Jonathan Corbet

Andi writes:

> But I don't think Van's design is supposed to be exposed to user space.
> It's just a better way to implement BSD sockets.

Actually, it can, indeed, go all the way to user space - connecting
channels to the socket layer was one of the intermediate steps.

FWIW, I did an article on this, going mostly from the slides.  Here's a
get-out-of-subscription-jail-free card for it:

http://lwn.net/SubscriberLink/169961/776b6c53d1c1673a/ 

jon
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Rick Jones


Jeff Garzik wrote:


Key point 1:
Van's slides align closely with the design that I was already working 
on, for zero-copy RX.


To have a fully async, zero copy network receive, POSIX read(2) is 
inadequate. 


Is there an aio_read() in POSIX adequate to the task?

One needs a ring buffer, similar in API to the mmap'd 
packet socket, where you can queue a whole bunch of reads.  Van's design 
seems similar to this.


Key point 2:
Once the kernel gets enough info to determine which channel should 
receive a packet, it's done.  Van pushes TCP/IP receive processing into 
the userland app, which is quite an idea.  This pushes work out of the 
kernel and into the app, which in turn, increases the amount of work 
that can be performed in parallel on multiple cpus/cores. 


Increased opportunity for parallelism is indeed goodness.

Are you speaking strictly in the context of a single TCP connection, or for 
multiple TCP connections?  For the latter getting out of the kernel isn't a 
priori a requirement.  Actually, I'm not even sure it is a priori a requirement 
for the former?


This also has the side effect of making TOE 
look even more like an inferior solution...  Van's design is far more 
scalable than TOE.


I'm not disagreeing, but it would be good to further define the axis for 
scalability.


rick jones
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Andi Kleen

On Wednesday 01 February 2006 20:37, Jeff Garzik wrote:

> To have a fully async, zero copy network receive, POSIX read(2) is 
> inadequate. 

Agreed, but POSIX aio is adequate.

> One needs a ring buffer, similar in API to the mmap'd   
> packet socket, where you can queue a whole bunch of reads.  Van's design 
> seems similar to this.

See lio_listio et.al.

But I don't think Van's design is supposed to be exposed to user space.
It's just a better way to implement BSD sockets.

> Key point 2:
> Once the kernel gets enough info to determine which channel should 
> receive a packet, it's done.  Van pushes TCP/IP receive processing into 
> the userland app, which is quite an idea.

We already do this since 2.3 (Alexey's work) 

The only difference in his scheme seems to be that the demultiplex
to different sockets is somehow (he doesn't explain how) pushed into
the driver. It's also unclear how this will simplify the drivers
as the slides claim.

Also I should add that the added ACK latency is a problem for a few
workloads.

> This pushes work out of the  
> kernel and into the app,

It's still in the kernel, just in process context.

> which in turn, increases the amount of work  
> that can be performed in parallel on multiple cpus/cores. 

Well the current demultiplex already runs on all CPUs
(assuming you have enough devices to send affinied interrupts to each CPU -
in the future with MSI-X this can be hopefully done better)

> The overall  
> bottleneck in the kernel is reduced. 

What bottleneck exactly?

-Andi
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Jeff Garzik



Key point 1:
Van's slides align closely with the design that I was already working 
on, for zero-copy RX.


To have a fully async, zero copy network receive, POSIX read(2) is 
inadequate.  One needs a ring buffer, similar in API to the mmap'd 
packet socket, where you can queue a whole bunch of reads.  Van's design 
seems similar to this.


Key point 2:
Once the kernel gets enough info to determine which channel should 
receive a packet, it's done.  Van pushes TCP/IP receive processing into 
the userland app, which is quite an idea.  This pushes work out of the 
kernel and into the app, which in turn, increases the amount of work 
that can be performed in parallel on multiple cpus/cores.  The overall 
bottleneck in the kernel is reduced.  PCI MSI-X further reduces the 
bottleneck, after that.  This also has the side effect of making TOE 
look even more like an inferior solution...  Van's design is far more 
scalable than TOE.


So, I am not concerned with slideware.  These are two good ideas that 
are worth pursuing, even if Van produces zero additional output.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Van Jacobson net channels

2006-02-01 Thread Andi Kleen

On Wednesday 01 February 2006 14:48, Leonid Grossman wrote:
> David S. Miller wrote:
> 
> > And with Van Jacobson net channels, none of this is going to 
> > matter and 512 is going to be your limit whether you like it 
> > or not.  So this short term complexity gain is doubly not justified.
> 
> 
> David, can you elaborate on this upcoming implementation (timeframe,
> description if any, etc)?

There are some slides on the web, but I failed to fully understand
the concept just from them - maybe I'm just not clever enough @)

http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf

Some details seem to be missing at least:
- a description of TX (he hinted at eliminating hard_start_xmit
somehow at one slide, but everything else was only on RX) 
- how the NIC selects the socket to connect to directly
(e.g. with what does he replace the ETH/IP/TCP demux tables)
- What his problem with softirqs was. They are fully parallelized
in Linux and completely cache hot with their feeding interrupts,
so I didn't quite follow why he wanted to eliminate them.
- how exactly the drivers become more simple and what the generic
functions he wants to replace driver code with do.

I liked the concept of the cache friendlier"array queues", but as far as I can 
see there should be usually only one hand off between different
CPUs for RX, so I'm not sure they will make that much difference. 
And longer term this handoff can be eliminated with hardware queue
support anyways when we manage to get the NIC to send an MSI
to the right CPU. Then everything will be CPU local.

Hopefully there is a patch available soon to make this all clear.

> I assume these net channels are per cpu? What's the relation to NAPI?
> 
> This sounds like something we can "connect" our driver/ASIC queues to,
> in order to extend the channels all the way to 10GbE PHY.

I think the basic problem of that is still unsolved. The only really
scalable way to do such connections seem to be RX hashing and send
MSIs to different CPUs based on that.  But it is still not fully clear how 
to get the Linux scheduler to cooperate with that so that the socket
consumers end up at the right CPUs selected by the driver hash.

-Andi (who prefers sourceware over slideware)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

68 matches

Mail list logo