subject:"RE\: \[PATCH 3\/3\] NET\: \[SCHED\] Qdisc changes and sch

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-07 Thread David Miller

From: jamal <[EMAIL PROTECTED]>
Date: Fri, 06 Jul 2007 10:39:15 -0400

> If the issue is usability of listing 1024 netdevices, i can think of
> many ways to resolve it.

I would agree with this if there were a reason for it, it's totally
unnecessary complication as far as I can see.

These virtual devices are an ethernet with the subnet details exposed
to the driver, nothing more.

I see zero benefit to having a netdev for each guest or node we can
speak to whatsoever.  It's a very heavy abstraction to use for
something that is so bloody simple.

My demux on ->hard_start_xmit() is _5 DAMN LINES OF CODE_, you want to
replace that with a full netdev because of some minor difficulties in
figuring out to record the queuing state.  It's beyond unreasonable.

Netdevs are like salt, if you put too much in your food it tastes
awful.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-07 Thread Rusty Russell

On Fri, 2007-07-06 at 10:39 -0400, jamal wrote:
> The first thing that crossed my mind was "if you want to select a
> destination port based on a destination MAC you are talking about a
> switch/bridge". You bring up the issue of "a huge number of virtual NICs
> if you wanted arbitrary guests" which is a real one[2].

Hi Jamal,

I'm deeply tempted to agree with you that the answer is multiple
virtual NICs (and I've been tempted to abandon lguest's N-way transport
scheme), except that it looks like we're going to have multi-queue NICs
for other reasons.

Otherwise I'd be tempted to say "create/destroy virtual NICs as other
guests appear/vanish from the network".  Noone does this today, but that
doesn't make it wrong.

> If i got this right, still not answering the netif_stop question posed:
> the problem you are also trying to resolve now is get rid of N
> netdevices on each guest for a usability reason; i.e have one netdevice,
> move the bridging/switching functionality/tables into the driver;
> replace the ports with queues instead of netdevices. Did i get that
> right? 

Yep, well summarized.  I guess the question is: should the Intel guys
be representing their multi-queue NICs as multiple NICs rather than
adding the subqueue concept?

> BTW, one curve that threw me off a little is it seems most of the
> hardware that provides virtualization also provides point-to-point
> connections between different domains; i always thought that they all
> provided a point-to-point to the dom0 equivalent and let the dom0 worry
> about how things get from domainX to domainY.

Yeah, but that has obvious limitations as people care more about
inter-guest I/O: we want direct inter-guest networking...

Cheers,
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-06 Thread James Chapman


jamal wrote:

If the issue is usability of listing 1024 netdevices, i can think of
many ways to resolve it.
One way we can resolve the listing is with a simple tag to the netdev
struct i could say "list netdevices for guest 0-10" etc etc.


This would be a useful feature, not only for virtualization. I've seen 
some boxes with thousands of net devices (mostly ppp, but also some 
ATM). It would be nice to be able to assign a tag to an arbitrary set of 
devices.


Does the network namespace stuff help with any of this?

--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-06 Thread jamal

On Fri, 2007-06-07 at 17:32 +1000, Rusty Russell wrote:

[..some good stuff deleted here ..]

> Hope that adds something,

It does - thanks. 
 
I think i was letting my experience pollute my thinking earlier when
Dave posted. The copy-avoidance requirement is clear to me[1].

I had another issue which wasnt clear but you touched on it so this
breaks the ice for me - be gentle please, my asbestos suit is full of
dust these days ;->:

The first thing that crossed my mind was "if you want to select a
destination port based on a destination MAC you are talking about a
switch/bridge". You bring up the issue of "a huge number of virtual NICs
if you wanted arbitrary guests" which is a real one[2]. 
Lets take the case of a small number of guests; a bridge of course would
solve the problem with the copy-avoidance with the caveat being:
- you now have N bridges and their respective tables for N domains i.e
one on each domain 
- N netdevices on each domain as well (of course you could say that is
not very different resourcewise from N queues instead). 

If i got this right, still not answering the netif_stop question posed:
the problem you are also trying to resolve now is get rid of N
netdevices on each guest for a usability reason; i.e have one netdevice,
move the bridging/switching functionality/tables into the driver;
replace the ports with queues instead of netdevices. Did i get that
right? 
If the issue is usability of listing 1024 netdevices, i can think of
many ways to resolve it.
One way we can resolve the listing is with a simple tag to the netdev
struct i could say "list netdevices for guest 0-10" etc etc.

I am having a little problem differentiating conceptually the case of a
guest being different from the host/dom0 if you want to migrate the
switching/bridging functions into each guest. So even if this doesnt
apply to all domains, it does apply to the dom0.
I like netdevices today (as opposed to queues within netdevices):
- the stack knows them well (I can add IP addresses, i can point routes
to, I can change MAC addresses, i can bring them administratively
down/up, I can add qos rules etc etc).
I can also tie netdevices to a CPU and therefore scale that way. I see
this viable at least from the host/dom0 perspective if a netdevice
represents a guest.

Sorry for the long email - drained some of my morning coffee.
Ok, kill me.

cheers,
jamal

[1] My experience is around qemu/uml/old-openvz - their model is to let
the host do the routing/switching between guests or the outside of the
box. From your description i would add Xen to that behavior.
>From Daves posting, i understand that for many good reasons, any time
you move between any one domain to another you are copying. So if you
use Xen and you want to go from domainX to domainY you go to dom0 which
implies copying domainX->dom0 then dom0->domainY. 
BTW, one curve that threw me off a little is it seems most of the
hardware that provides virtualization also provides point-to-point
connections between different domains; i always thought that they all
provided a point-to-point to the dom0 equivalent and let the dom0 worry
about how things get from domainX to domainY.

[2] Unfortunately that means if i wanted 1024 virtual routers/guest
domains i have at least 1024 netdevices on each guest connected to the
bridge on the guest. I have a freaking problem listing 72 netdevices
today on some device i have.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-06 Thread Rusty Russell

On Tue, 2007-07-03 at 22:20 -0400, jamal wrote:
> On Tue, 2007-03-07 at 14:24 -0700, David Miller wrote:
> [.. some useful stuff here deleted ..]
> 
> > That's why you have to copy into a purpose-built set of memory
> > that is composed of pages that _ONLY_ contain TX packet buffers
> > and nothing else.
> > 
> > The cost of going through the switch is too high, and the copies are
> > necessary, so concentrate on allowing me to map the guest ports to the
> > egress queues.  Anything else is a waste of discussion time, I've been
> > pouring over these issues endlessly for weeks, so if I'm saying doing
> > copies and avoiding the switch is necessary I do in fact mean it. :-)
> 
> ok, i get it Dave ;-> Thanks for your patience, that was useful.
> Now that is clear for me, I will go back and look at your original email
> and try to get back on track to what you really asked ;->

To expand on this, there are already "virtual" nic drivers in tree which
do the demux based on dst mac and send to appropriate other guest
(iseries_veth.c and Carsten Otte said the S/390 drivers do too).  lguest
and DaveM's LDOM make two more.

There is currently no good way to write such a driver.  If one recipient
is full, you have to drop the packet: if you netif_stop_queue, it means
a slow/buggy recipient blocks packets going to other recipients.  But
dropping packets makes networking suck.

Some hypervisors (eg. Xen) only have a virtual NIC which is
point-to-point: this sidesteps the issue, with the risk that you might
need a huge number of virtual NICs if you wanted arbitrary guests to
talk to each other (Xen doesn't support that, they route/bridge through
dom0).

Most hypervisors have a sensible maximum on the number of guests they
could talk to, so I'm not too unhappy with a static number of queues.
But the dstmac -> queue mapping changes in hypervisor-specific ways, so
it really needs to be managed by the driver...

Hope that adds something,
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-03 Thread jamal

On Tue, 2007-03-07 at 14:24 -0700, David Miller wrote:
[.. some useful stuff here deleted ..]

> That's why you have to copy into a purpose-built set of memory
> that is composed of pages that _ONLY_ contain TX packet buffers
> and nothing else.
> 
> The cost of going through the switch is too high, and the copies are
> necessary, so concentrate on allowing me to map the guest ports to the
> egress queues.  Anything else is a waste of discussion time, I've been
> pouring over these issues endlessly for weeks, so if I'm saying doing
> copies and avoiding the switch is necessary I do in fact mean it. :-)

ok, i get it Dave ;-> Thanks for your patience, that was useful.
Now that is clear for me, I will go back and look at your original email
and try to get back on track to what you really asked ;->

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-03 Thread David Miller

From: jamal <[EMAIL PROTECTED]>
Date: Tue, 03 Jul 2007 08:42:33 -0400

> (likely not in the case of hypervisor based virtualization like Xen)
> just have their skbs cloned when crossing domains, is that not the
> case?[1]
> Assuming they copy, the balance that needs to be stricken now is
> between:

Sigh, I kind of hoped I wouldn't have to give a lesson in
hypervisors and virtualized I/O and all the issues contained
within, but if you keep pushing the "avoid the copy" idea I
guess I am forced to educate. :-)

First, keep in mind that my Linux guest drivers are talking
to Solaris control node servers and switches, I cannot control
the API for any of this stuff.  And I think that's a good thing
in fact.

Exporting memory between nodes is _THE_ problem with virtualized I/O
in hypervisor based systems.

These things should even be able to work between two guests that
simply DO NOT trust each other at all.

With that in mind the hypervisor provides a very small shim layer of
interface for exporting memory between two nodes.  There is a
pseudo-pagetable where you export pages, and a set of interfaces
one of which copies to/from inported memory to/from local memory.

If a guest gets stuck, reboots, crashes, or gets stuck, you have
to be able to revoke the memory the remote node has inported.
When this happens, if the inporting node comes back to life and
tries to touch those pages it takes a fault.

Taking a fault is easy if the nodes go through the hypervisor copy
interface, they just get a return value back.  If, instead, you try to
map in those pages or program them into the IOMMU of the PCI
controller, you get faults, and extremely difficult to handle faults
at that.  If the IOMMU takes the exception on a revoked page, your
E1000 card resets when it gets the master abort from the PCI
controller.  On the CPU side you have to annotate every single kernel
access to this memory mapping of inported pages, just like we have to
annotate all userspace accesses with exception tables mapping load and
store instructions to fixup code, in order to handler the fault
correctly.

Next, you don't trust the other end as we already stated, so you
can't export object in a page that belong to other objects.  For
example, if a SKB's data sits in the same page as the plain-text
password the user just typed in, you can't export that page.

That's why you have to copy into a purpose-built set of memory
that is composed of pages that _ONLY_ contain TX packet buffers
and nothing else.

The cost of going through the switch is too high, and the copies are
necessary, so concentrate on allowing me to map the guest ports to the
egress queues.  Anything else is a waste of discussion time, I've been
pouring over these issues endlessly for weeks, so if I'm saying doing
copies and avoiding the switch is necessary I do in fact mean it. :-)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-03 Thread jamal

On Sat, 2007-30-06 at 13:33 -0700, David Miller wrote:

> It's like twice as fast, since the switch doesn't have to copy
> the packet in, switch it, then the destination guest copies it
> into it's address space.
> 
> There is approximately one copy for each hop you go over through these
> virtual devices.

Ok - i see what you are getting at, and while it makes more sense to me
now, let me continue to be _the_ devils advocate (sip some esspresso
before responding or reading): 
for some reason i always thought that packets going across these things
(likely not in the case of hypervisor based virtualization like Xen)
just have their skbs cloned when crossing domains, is that not the
case?[1]
Assuming they copy, the balance that needs to be stricken now is
between:

a) copy is expensive
vs
b1) For N guests, N^2 queues in the system vs N queues and 1 vs N
replicated global info.
b2) The architecture challenges to resolve the fact you now have to deal
with a mesh (1-1 mapping) instead of star topology between the guests. 

I dont think #b1 is such a big deal; in the old days when i had played
with what is now openvz, i was happy to get 1024 virtual routers/guests
(each running Zebra/OSPF). I could live with a little more wasted memory
if the copy is reduced.
I think sub-consciously i am questioning #b2. Do you really need that
sacrifice just so that you can avoid one extra copy between two guests?
If i was running virtual routers or servers i think the majority of
traffic (by far) would be between a domain and outside of the box not
between any two domains within the same box. 

cheers,
jamal

[1] But then if this is true, i can think of a simple way to attack the
other domains by inserting a kernel module into a domain that reduced
the refcount of each received skb to 0. I would be suprised if the
openvz type approach hasnt thought this through.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-30 Thread David Miller

From: jamal <[EMAIL PROTECTED]>
Date: Sat, 30 Jun 2007 10:52:44 -0400

> On Fri, 2007-29-06 at 21:35 -0700, David Miller wrote:
> 
> > Awesome, but let's concentrate on the client since I can actually
> > implement and test anything we come up with :-)
> 
> Ok, you need to clear one premise for me then ;->
> You said the model is for the guest/client to hook have a port to the
> host and one to each guest; i think this is the confusing part for me
> (and may have led to the switch discussion) because i have not seen this
> model used before. What i have seen before is that the host side
> connects the different guests. In such a scenario, on the guest is a
> single port that connects to the host - the host worries (lets forget
> the switch/bridge for a sec) about how to get packets from guestX to
> guestY pending consultation of access control details.
> What is the advantage of direct domain-domain connection? Is it a
> scalable?

It's like twice as fast, since the switch doesn't have to copy
the packet in, switch it, then the destination guest copies it
into it's address space.

There is approximately one copy for each hop you go over through these
virtual devices.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-30 Thread jamal

On Fri, 2007-29-06 at 21:35 -0700, David Miller wrote:

> Awesome, but let's concentrate on the client since I can actually
> implement and test anything we come up with :-)

Ok, you need to clear one premise for me then ;->
You said the model is for the guest/client to hook have a port to the
host and one to each guest; i think this is the confusing part for me
(and may have led to the switch discussion) because i have not seen this
model used before. What i have seen before is that the host side
connects the different guests. In such a scenario, on the guest is a
single port that connects to the host - the host worries (lets forget
the switch/bridge for a sec) about how to get packets from guestX to
guestY pending consultation of access control details.
What is the advantage of direct domain-domain connection? Is it a
scalable?

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-30 Thread Waskiewicz Jr, Peter P

> It would be great if we could finally get a working e1000 
> multiqueue patch so work in this area can actually be tested.

I'm actively working on this right now.  I'm on vacation next week, but
hopefully I can get something working before I leave OLS and post it.

-PJ
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-30 Thread Patrick McHardy

David Miller wrote:
> Now I get to pose a problem for everyone, prove to me how useful
> this new code is by showing me how it can be used to solve a
> reocurring problem in virtualized network drivers of which I've
> had to code one up recently, see my most recent blog entry at:
> 
>   http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html
> 
> Anyways the gist of the issue is (and this happens for Sun LDOMS
> networking, lguest, IBM iSeries, etc.) that we have a single
> virtualized network device.  There is a "port" to the control
> node (which switches packets to the real network for the guest)
> and one "port" to each of the other guests.
> 
> Each guest gets a unique MAC address.  There is a queue per-port
> that can fill up.
> 
> What all the drivers like this do right now is stop the queue if
> any of the per-port queues fill up, and that's why my sunvnet
> driver does right now as well.  We can only thus wakeup the
> queue when all of the ports have some space.
> 
> The ports (and thus the queues) are selected by destination
> MAC address.  Each port has a remote MAC address, if there
> is an exact match with a port's remote MAC we'd use that port
> and thus that port's queue.  If there is no exact match
> (some other node on the real network, broadcast, multicast,
> etc.) we want to use the control node's port and port queue.
> 
> So the problem to solve is to make a way for drivers to do the queue
> selection before the generic queueing layer starts to try and push
> things to the driver.  Perhaps a classifier in the driver or similar.

That sounds like the only reasonable possibility if you really
do want to use queues. Another possibility would be to not use
a queue and make the hole thing unreliable and treat full rx
rings of the guests as "loss on the wire". Not sure if that makes
any sense.

I was thinking about adding a way for (multiqueue) drivers to use
other default qdiscs than pfifo_fast so they can default to a
multiband prio or something else that makes sense for them ..
maybe a dev->qdisc_setup hook that is invoked from dev_activate.
They would need to be able to add a default classifier for this
to have any effect (the grand plan is to get rid of the horrible
wme scheduler). Specialized classifiers like your dst MAC classifier
and maybe even WME should then probably be built into the driver and
don't register with the API, so they don't become globally visible.

> The solution to this problem generalizes to the other facility
> we want now, hashing the transmit queue by smp_processor_id()
> or similar.  With that in place we can look at doing the TX locking
> per-queue too as is hinted at by the comments above the per-queue
> structure in the current net-2.6.23 tree.

It would be great if we could finally get a working e1000 multiqueue
patch so work in this area can actually be tested.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-30 Thread Benny Amorsen

> "DM" == David Miller <[EMAIL PROTECTED]> writes:

DM> And some people still use hubs, believe it or not.

Hubs are 100Mbps at most. You could of course make a flooding Gbps
switch, but it would be rather silly. If you care about multicast
performance, you get a switch with IGMP snooping.


/Benny


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread David Miller

From: jamal <[EMAIL PROTECTED]>
Date: Fri, 29 Jun 2007 21:30:53 -0400

> On Fri, 2007-29-06 at 14:31 -0700, David Miller wrote:
> > Maybe for the control node switch, yes, but not for the guest network
> > devices.
> 
> And that is precisely what i was talking about - and i am sure thats how
> the discussion with Patrick was.

Awesome, but let's concentrate on the client since I can actually
implement and test anything we come up with :-)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread jamal

On Fri, 2007-29-06 at 14:31 -0700, David Miller wrote:
> This conversation begins to go into a pointless direction already, as
> I feared it would.
> 
> Nobody is going to configure bridges, classification, tc, and all of
> this other crap just for a simple virtualized guest networking device.
> 
> It's a confined and well defined case that doesn't need any of that.
> You've got to be fucking kidding me if you think I'm going to go
> through the bridging code and all of that layering instead of my
> hash demux on transmit which is 4 or 5 lines of C code at best.
> 
> Such a suggestion is beyond stupid.
> 

Ok, calm down - will you please? 
If you are soliciting for opinions, then you should be expecting all
sorts of answers, otherwise why bother posting. If you think you are
misunderstood just clarify. Otherwise you are being totaly unreasonable.

> Maybe for the control node switch, yes, but not for the guest network
> devices.

And that is precisely what i was talking about - and i am sure thats how
the discussion with Patrick was.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread David Miller

From: Ben Greear <[EMAIL PROTECTED]>
Date: Fri, 29 Jun 2007 08:33:06 -0700

> Patrick McHardy wrote:
> > Right, but the current bridging code always uses promiscous mode
> > and its nice to avoid that if possible. Looking at the code, it
> > should be easy to avoid though by disabling learning (and thus
> > promisous mode) and adding unicast filters for all static fdb entries.
> >   
> I am curious about why people are so hot to do away with promisc mode.  
> It seems to me
> that in a modern switched environment, there should only very rarely be 
> unicast packets received
> on an interface that does not want to receive them.
> 
> Could someone give a quick example of when I am wrong and promisc mode 
> would allow
> a NIC to receive a significant number of packets not really destined for it?

You're neighbour on the switch is being pummeled with multicast traffic,
and now you get to see it all too.

Switches don't obviate the cost of promiscuous mode, you keep wanting
to discuss this and think it doesn't matter, but it does.

And some people still use hubs, believe it or not.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread David Miller


This conversation begins to go into a pointless direction already, as
I feared it would.

Nobody is going to configure bridges, classification, tc, and all of
this other crap just for a simple virtualized guest networking device.

It's a confined and well defined case that doesn't need any of that.
You've got to be fucking kidding me if you think I'm going to go
through the bridging code and all of that layering instead of my
hash demux on transmit which is 4 or 5 lines of C code at best.

Such a suggestion is beyond stupid.

Maybe for the control node switch, yes, but not for the guest network
devices.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Ben Greear


Patrick McHardy wrote:

Ben Greear wrote:
  

Could someone give a quick example of when I am wrong and promisc mode
would allow
a NIC to receive a significant number of packets not really destined for
it?




In a switched environment it won't have a big effect, I agree.
It might help avoid receiving unwanted multicast traffic, which
could be more significant than unicast.

Anyways, why be wasteful when it can be avoided .. :)
  

Ok, I had forgotten about multicast, thanks for the reminder!

--
Ben Greear <[EMAIL PROTECTED]> 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Patrick McHardy

Ben Greear wrote:
> Patrick McHardy wrote:
> 
>> Right, but the current bridging code always uses promiscous mode
>> and its nice to avoid that if possible. Looking at the code, it
>> should be easy to avoid though by disabling learning (and thus
>> promisous mode) and adding unicast filters for all static fdb entries.
>>   
> 
> I am curious about why people are so hot to do away with promisc mode. 
> It seems to me
> that in a modern switched environment, there should only very rarely be
> unicast packets received
> on an interface that does not want to receive them.


I don't know if that really was Dave's reason to handle it in a driver.

> Could someone give a quick example of when I am wrong and promisc mode
> would allow
> a NIC to receive a significant number of packets not really destined for
> it?


In a switched environment it won't have a big effect, I agree.
It might help avoid receiving unwanted multicast traffic, which
could be more significant than unicast.

Anyways, why be wasteful when it can be avoided .. :)

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Ben Greear


Patrick McHardy wrote:

Right, but the current bridging code always uses promiscous mode
and its nice to avoid that if possible. Looking at the code, it
should be easy to avoid though by disabling learning (and thus
promisous mode) and adding unicast filters for all static fdb entries.
  
I am curious about why people are so hot to do away with promisc mode.  
It seems to me
that in a modern switched environment, there should only very rarely be 
unicast packets received

on an interface that does not want to receive them.

Could someone give a quick example of when I am wrong and promisc mode 
would allow

a NIC to receive a significant number of packets not really destined for it?

Thanks,
Ben


--
Ben Greear <[EMAIL PROTECTED]> 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread jamal

On Fri, 2007-29-06 at 15:08 +0200, Patrick McHardy wrote:
> jamal wrote:
> > On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote:

> Right, but the current bridging code always uses promiscous mode
> and its nice to avoid that if possible. 
> Looking at the code, it
> should be easy to avoid though by disabling learning (and thus
> promisous mode) and adding unicast filters for all static fdb entries.
> 

Yes, that would do it for static provisioning (I suspect that would work
today unless bridging has no knobs to turn off going into promisc). But
you could even allow for learning and just have extra filters in tc
before bridging disallowing things.

> Have a look at my secondary unicast address patches in case you didn't
> notice them before (there's also a driver example for e1000 on netdev):
> 
> http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.23.git;a=commit;h=306890b54dcbd168cdeea64f1630d2024febb5c7
> 
> You still need to do filtering in software, but you can have the NIC
> pre-filter in case it supports it, otherwise it goes to promiscous mode.
> 

Ok, I will look at them when i get back. Sorry - havent caught up on
netdev.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Patrick McHardy

jamal wrote:
> On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote:
> 
> 
>>The difference to a real bridge is that the
>>all addresses are completely known in advance, so it doesn't need
>>promiscous mode for learning.
> 
> 
> You mean the per-virtual MAC addresses are known in advance, right?

Yes.

> This is fine. The bridging or otherwise (like L3 etc) is for
> interconnecting once you have the provisioning done. And you could build
> different "broadcast domains" by having multiple bridges.

Right, but the current bridging code always uses promiscous mode
and its nice to avoid that if possible. Looking at the code, it
should be easy to avoid though by disabling learning (and thus
promisous mode) and adding unicast filters for all static fdb entries.

> To go off on a slight tangent:
> I think you have to look at the two types of NICs separately 
> 1) dumb ones where you may have to use the mcast filters in hardware to
> pretend you have a unicast address per virtual device - those will be
> really hard to simulate using a separate netdevice per MAC address.
> Actually your bigger problem on those is tx MAC address selection
> because that is not built into the hardware. I still think even for
> these types something above netdevice (bridge, L3 routing, tc action
> redirect etc) will do.

Have a look at my secondary unicast address patches in case you didn't
notice them before (there's also a driver example for e1000 on netdev):

http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.23.git;a=commit;h=306890b54dcbd168cdeea64f1630d2024febb5c7

You still need to do filtering in software, but you can have the NIC
pre-filter in case it supports it, otherwise it goes to promiscous mode.

> 2) The new NICs being built for virtualization; those allow you to
> explicitly have clean separation of IO where the only thing that is
> shared between virtual devices is the wire and the bus (otherwise
> each has its own registers etc) i.e the hardware is designed with this
> in mind. In such a case, i think a separate netdevice per single MAC
> address - possibly tied to a separate CPU should work.

Agreed, that could also be useful for non-virtualized use.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread jamal

On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote:

> I'm guessing that that wouldn't allow to do unicast filtering for
> the guests on the real device without hacking the bridge code for
> this special case. 

For ingress (i guess you could say for egress as well): we can do it as
well today with tc filtering on the host - it is involved but is part of
provisioning for a guest IMO.
A substantial amount of ethernet switches (ok, not the $5 ones) do
filtering at the same level.

> The difference to a real bridge is that the
> all addresses are completely known in advance, so it doesn't need
> promiscous mode for learning.

You mean the per-virtual MAC addresses are known in advance, right?
This is fine. The bridging or otherwise (like L3 etc) is for
interconnecting once you have the provisioning done. And you could build
different "broadcast domains" by having multiple bridges.

To go off on a slight tangent:
I think you have to look at the two types of NICs separately 
1) dumb ones where you may have to use the mcast filters in hardware to
pretend you have a unicast address per virtual device - those will be
really hard to simulate using a separate netdevice per MAC address.
Actually your bigger problem on those is tx MAC address selection
because that is not built into the hardware. I still think even for
these types something above netdevice (bridge, L3 routing, tc action
redirect etc) will do.
2) The new NICs being built for virtualization; those allow you to
explicitly have clean separation of IO where the only thing that is
shared between virtual devices is the wire and the bus (otherwise
each has its own registers etc) i.e the hardware is designed with this
in mind. In such a case, i think a separate netdevice per single MAC
address - possibly tied to a separate CPU should work.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Patrick McHardy

jamal wrote:
> On Thu, 2007-28-06 at 21:20 -0700, David Miller wrote:
> 
>>Each guest gets a unique MAC address.  There is a queue per-port
>>that can fill up.
>>
>>What all the drivers like this do right now is stop the queue if
>>any of the per-port queues fill up, and that's why my sunvnet
>>driver does right now as well.  We can only thus wakeup the
>>queue when all of the ports have some space.
> 
> 
> Is a netdevice really the correct construct for the host side?
> Sounds to me a layer above the netdevice is the way to go. A bridge for
> example or L3 routing or even simple tc classify/redirection etc.
> I havent used what has become openvz these days in many years (or played
> with Erics approach), but if i recall correctly - it used to have a
> single netdevice per guest on the host. Thats close to what a basic
> qemu/UML has today. In such a case it is something above netdevices
> which does the guest selection.

I'm guessing that that wouldn't allow to do unicast filtering for
the guests on the real device without hacking the bridge code for
this special case. The difference to a real bridge is that the
all addresses are completely known in advance, so it doesn't need
promiscous mode for learning.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread jamal

Ive changed the topic for you friend - otherwise most people wont follow
(as youve said a few times yourself ;->).

On Thu, 2007-28-06 at 21:20 -0700, David Miller wrote:

> Now I get to pose a problem for everyone, prove to me how useful
> this new code is by showing me how it can be used to solve a
> reocurring problem in virtualized network drivers of which I've
> had to code one up recently, see my most recent blog entry at:
> 
>   http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html
> 

nice.

> Anyways the gist of the issue is (and this happens for Sun LDOMS
> networking, lguest, IBM iSeries, etc.) that we have a single
> virtualized network device.  There is a "port" to the control
> node (which switches packets to the real network for the guest)
> and one "port" to each of the other guests.
> 
> Each guest gets a unique MAC address.  There is a queue per-port
> that can fill up.
> 
> What all the drivers like this do right now is stop the queue if
> any of the per-port queues fill up, and that's why my sunvnet
> driver does right now as well.  We can only thus wakeup the
> queue when all of the ports have some space.

Is a netdevice really the correct construct for the host side?
Sounds to me a layer above the netdevice is the way to go. A bridge for
example or L3 routing or even simple tc classify/redirection etc.
I havent used what has become openvz these days in many years (or played
with Erics approach), but if i recall correctly - it used to have a
single netdevice per guest on the host. Thats close to what a basic
qemu/UML has today. In such a case it is something above netdevices
which does the guest selection.

> The ports (and thus the queues) are selected by destinationt
> MAC address.  Each port has a remote MAC address, if there
> is an exact match with a port's remote MAC we'd use that port
> and thus that port's queue.  If there is no exact match
> (some other node on the real network, broadcast, multicast,
> etc.) we want to use the control node's port and port queue.
> 

Ok, Dave, isnt that what a bridge does? ;-> Youd need filtering to go
with it (for example to restrict guest0 from getting certain brodcasts
etc) - but we already have that. 

> So the problem to solve is to make a way for drivers to do the queue
> selection before the generic queueing layer starts to try and push
> things to the driver.  Perhaps a classifier in the driver or similar.
>
> The solution to this problem generalizes to the other facility
> we want now, hashing the transmit queue by smp_processor_id()
> or similar.  With that in place we can look at doing the TX locking
> per-queue too as is hinted at by the comments above the per-queue
> structure in the current net-2.6.23 tree.

A major surgery will be needed on the tx path if you want to hash tx
queue to processor id. Our unit construct (today, net-2.6.23) that can
be tied to a cpu is a netdevice. OTOH, if you used a netdevice it should
work as is. But i am possibly missing something in your comments.
What do you have in mind.

> My current work-in-progress sunvnet.c driver is included below so
> we can discuss things concretely with code.
> 
> I'm listening. :-)

And you got words above.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Waskiewicz Jr, Peter P

> Ok everything is checked into net-2.6.23, thanks everyone.

Dave, thank you for your patience and feedback on this whole process.
Patrick and everyone else, thank you for your feedback and assistance.

I am looking at your posed virtualization question, but I need sleep
since I just remembered I'm on east coast time here at OLS, and it's
4:30am.

Thanks again,

-PJ Waskiewicz
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread David Miller

From: Patrick McHardy <[EMAIL PROTECTED]>
Date: Thu, 28 Jun 2007 21:24:37 +0200

> Waskiewicz Jr, Peter P wrote:
> >>[...]
> >>The only reasonable thing it can do is not care about 
> >>multiqueue and just dequeue as usual. In fact I think it 
> >>should be an error to configure multiqueue on a non-root qdisc.
> > 
> > 
> > Ack.  This is a thought process that trips me up from time to time...I
> > see child qdisc, and think that's the last qdisc to dequeue and send to
> > the device, not the first one to dequeue.  So please disregard my
> > comments before; I totally agree with you.  Great catch here; I really
> > like the prio_classify() cleanup.
> 
> 
> Thanks. This updated patch makes configuring a non-root qdisc for
> multiqueue an error.

Ok everything is checked into net-2.6.23, thanks everyone.

Now I get to pose a problem for everyone, prove to me how useful
this new code is by showing me how it can be used to solve a
reocurring problem in virtualized network drivers of which I've
had to code one up recently, see my most recent blog entry at:

http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html

Anyways the gist of the issue is (and this happens for Sun LDOMS
networking, lguest, IBM iSeries, etc.) that we have a single
virtualized network device.  There is a "port" to the control
node (which switches packets to the real network for the guest)
and one "port" to each of the other guests.

Each guest gets a unique MAC address.  There is a queue per-port
that can fill up.

What all the drivers like this do right now is stop the queue if
any of the per-port queues fill up, and that's why my sunvnet
driver does right now as well.  We can only thus wakeup the
queue when all of the ports have some space.

The ports (and thus the queues) are selected by destination
MAC address.  Each port has a remote MAC address, if there
is an exact match with a port's remote MAC we'd use that port
and thus that port's queue.  If there is no exact match
(some other node on the real network, broadcast, multicast,
etc.) we want to use the control node's port and port queue.

So the problem to solve is to make a way for drivers to do the queue
selection before the generic queueing layer starts to try and push
things to the driver.  Perhaps a classifier in the driver or similar.

The solution to this problem generalizes to the other facility
we want now, hashing the transmit queue by smp_processor_id()
or similar.  With that in place we can look at doing the TX locking
per-queue too as is hinted at by the comments above the per-queue
structure in the current net-2.6.23 tree.

My current work-in-progress sunvnet.c driver is included below so
we can discuss things concretely with code.

I'm listening. :-)

 sunvnet.h 
#ifndef _SUNVNET_H
#define _SUNVNET_H

#define DESC_NCOOKIES(entry_size)   \
((entry_size) - sizeof(struct vio_net_desc))

/* length of time before we decide the hardware is borked,
 * and dev->tx_timeout() should be called to fix the problem
 */
#define VNET_TX_TIMEOUT (5 * HZ)

#define VNET_TX_RING_SIZE   512
#define VNET_TX_WAKEUP_THRESH(dr)   ((dr)->pending / 4)

/* VNET packets are sent in buffers with the first 6 bytes skipped
 * so that after the ethernet header the IPv4/IPv6 headers are aligned
 * properly.
 */
#define VNET_PACKET_SKIP6

struct vnet_tx_entry {
void*buf;
unsigned intncookies;
struct ldc_trans_cookie cookies[2];
};

struct vnet;
struct vnet_port {
struct vio_driver_state vio;

struct hlist_node   hash;
u8  raddr[ETH_ALEN];

struct vnet *vp;

struct vnet_tx_entrytx_bufs[VNET_TX_RING_SIZE];

struct list_headlist;
};

static inline struct vnet_port *to_vnet_port(struct vio_driver_state *vio)
{
return container_of(vio, struct vnet_port, vio);
}

#define VNET_PORT_HASH_SIZE 16
#define VNET_PORT_HASH_MASK (VNET_PORT_HASH_SIZE - 1)

static inline unsigned int vnet_hashfn(u8 *mac)
{
unsigned int val = mac[4] ^ mac[5];

return val & (VNET_PORT_HASH_MASK);
}

struct vnet {
/* Protects port_list and port_hash.  */
spinlock_t  lock;

struct net_device   *dev;

u32 msg_enable;
struct vio_dev  *vdev;

struct list_headport_list;

struct hlist_head   port_hash[VNET_PORT_HASH_SIZE];
};

#endif /* _SUNVNET_H */
 sunvnet.c 
/* sunvnet.c: Sun LDOM Virtual Network Driver.
 *
 * Copyright (C) 2007 David S. Miller <[EMAIL PROTECTED]>
 */

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 

#include "sunvnet.h"

#define DRV_MODULE_NAME "sunvnet"
#define PFX DRV_MODULE_NAME ": "
#define DRV_MODULE_VERSIO

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Waskiewicz Jr, Peter P

> Waskiewicz Jr, Peter P wrote:
> >>[...]
> >>The only reasonable thing it can do is not care about 
> multiqueue and 
> >>just dequeue as usual. In fact I think it should be an error to 
> >>configure multiqueue on a non-root qdisc.
> > 
> > 
> > Ack.  This is a thought process that trips me up from time 
> to time...I 
> > see child qdisc, and think that's the last qdisc to dequeue 
> and send 
> > to the device, not the first one to dequeue.  So please 
> disregard my 
> > comments before; I totally agree with you.  Great catch 
> here; I really 
> > like the prio_classify() cleanup.
> 
> 
> Thanks. This updated patch makes configuring a non-root qdisc 
> for multiqueue an error.
> 

The patch looks fine to me.  Thanks Patrick.

-PJ
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Patrick McHardy

Waskiewicz Jr, Peter P wrote:
>>[...]
>>The only reasonable thing it can do is not care about 
>>multiqueue and just dequeue as usual. In fact I think it 
>>should be an error to configure multiqueue on a non-root qdisc.
> 
> 
> Ack.  This is a thought process that trips me up from time to time...I
> see child qdisc, and think that's the last qdisc to dequeue and send to
> the device, not the first one to dequeue.  So please disregard my
> comments before; I totally agree with you.  Great catch here; I really
> like the prio_classify() cleanup.


Thanks. This updated patch makes configuring a non-root qdisc for
multiqueue an error.

[SCHED] Qdisc changes and sch_rr added for multiqueue

Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio and sch_rr to be compiled with or without multiqueue
hardware support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue routine.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

---
commit 8798d6bf4f3ed4f5b162184282adc714ef5b69b1
tree b3ba373a0534b905b34abc520eba6adecdcc3ca5
parent 48e334930c5fbb64a821733a7056e2800057c128
author Peter P Waskiewicz Jr <[EMAIL PROTECTED]> Thu, 28 Jun 2007 18:47:49 +0200
committer Patrick McHardy <[EMAIL PROTECTED]> Thu, 28 Jun 2007 21:23:44 +0200

 include/linux/pkt_sched.h |9 +++
 net/sched/Kconfig |   11 
 net/sched/sch_prio.c  |  129 -
 3 files changed, 134 insertions(+), 15 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..268c515 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -101,6 +101,15 @@ struct tc_prio_qopt
 	__u8	priomap[TC_PRIO_MAX+1];	/* Map: logical priority -> PRIO band */
 };
 
+enum
+{
+	TCA_PRIO_UNSPEC,
+	TCA_PRIO_MQ,
+	__TCA_PRIO_MAX
+};
+
+#define TCA_PRIO_MAX(__TCA_PRIO_MAX - 1)
+
 /* TBF section */
 
 struct tc_tbf_qopt
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..f321794 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -111,6 +111,17 @@ config NET_SCH_PRIO
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_prio.
 
+config NET_SCH_RR
+	tristate "Multi Band Round Robin Queuing (RR)"
+	select NET_SCH_PRIO
+	---help---
+	  Say Y here if you want to use an n-band round robin packet
+	  scheduler.
+
+	  The module uses sch_prio for its framework and is aliased as
+	  sch_rr, so it will load sch_prio, although it is referred
+	  to using sch_rr.
+
 config NET_SCH_RED
 	tristate "Random Early Detection (RED)"
 	---help---
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 6d7542c..4045220 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -40,9 +40,11 @@
 struct prio_sched_data
 {
 	int bands;
+	int curband; /* for round-robin */
 	struct tcf_proto *filter_list;
 	u8  prio2band[TC_PRIO_MAX+1];
 	struct Qdisc *queues[TCQ_PRIO_BANDS];
+	int mq;
 };
 
 
@@ -70,14 +72,17 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 #endif
 			if (TC_H_MAJ(band))
 band = 0;
-			return q->queues[q->prio2band[band&TC_PRIO_MAX]];
+			band = q->prio2band[band&TC_PRIO_MAX];
+			goto out;
 		}
 		band = res.classid;
 	}
 	band = TC_H_MIN(band) - 1;
 	if (band >= q->bands)
-		return q->queues[q->prio2band[0]];
-
+		band = q->prio2band[0];
+out:
+	if (q->mq)
+		skb_set_queue_mapping(skb, band);
 	return q->queues[band];
 }
 
@@ -144,17 +149,58 @@ prio_dequeue(struct Qdisc* sch)
 	struct Qdisc *qdisc;
 
 	for (prio = 0; prio < q->bands; prio++) {
-		qdisc = q->queues[prio];
-		skb = qdisc->dequeue(qdisc);
-		if (skb) {
-			sch->q.qlen--;
-			return skb;
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.
+		 */
+		if (!netif_subqueue_stopped(sch->dev, (q->mq ? prio : 0))) {
+			qdisc = q->queues[prio];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+sch->q.qlen--;
+return skb;
+			}
 		}
 	}
 	return NULL;
 
 }
 
+static struct sk_buff *rr_dequeue(struct Qdisc* sch)
+{
+	struct sk_buff *skb;
+	struct prio_sched_data *q = qdisc_priv(sch);
+	struct Qdisc *qdisc;
+	int bandcount;
+
+	/* Only take one pass through the queues.  If nothing is available,
+	 * return nothing.
+	 */
+	for (bandcount = 0; bandcount < q->bands; bandcount++) {
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.  If the queue is stopped, try the
+		 * next queue.
+		 */
+		if (!netif_subqueue_stopped(sch->dev,
+	(q->mq ? q->curband : 0))) {
+			qdisc = q->queues[q->curband];
+			skb = qdisc->dequeue(qdisc);
+			if (skb) {
+sch->q.qlen--;
+q->curband++;
+if (q->curband >= q->bands)
+	q->curband = 0;
+return skb;
+			}
+		}
+		q->curband++;
+		if (q->c

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Waskiewicz Jr, Peter P

> Absolutely not. First of all, its perfectly valid to use 
> non-multiqueue qdiscs on multiqueue devices. Secondly, its 
> only the root qdisc that has to know about multiqueue since 
> that one controls the child qdiscs.
> 
> Think about it, it makes absolutely no sense to have the 
> child qdisc even know about multiqueue. Changing my example 
> to have a multiqueue qdisc as child:
> 
> root qdisc: 2 band prio multiqueue
> child qdisc of band 0: 2 band prio multiqueue
> 
> When the root qdisc decides to dequeue band0, it checks 
> whether subqueue 0 is active and dequeues the child qdisc. If 
> the child qdisc is indeed another multiqueue qdisc as you 
> suggest, if might decide to dequeue its own band 1 and checks 
> that subqueue state. So where should the packet finally end 
> up? And what if one of both subqueues is inactive?
> 
> The only reasonable thing it can do is not care about 
> multiqueue and just dequeue as usual. In fact I think it 
> should be an error to configure multiqueue on a non-root qdisc.

Ack.  This is a thought process that trips me up from time to time...I
see child qdisc, and think that's the last qdisc to dequeue and send to
the device, not the first one to dequeue.  So please disregard my
comments before; I totally agree with you.  Great catch here; I really
like the prio_classify() cleanup.

As always, many thanks for your feedback and help Patrick.

-PJ
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Patrick McHardy

Waskiewicz Jr, Peter P wrote:
>>PJ Waskiewicz wrote:
>>
>>>+#ifdef CONFIG_NET_SCH_MULTIQUEUE
>>>+if (q->mq)
>>>+skb->queue_mapping = 
>>>+
>>
>>q->prio2band[band&TC_PRIO_MAX];
>>
>>>+else
>>>+skb->queue_mapping = 0;
>>>+#endif
>>
>>
>>Setting it to zero here is wrong, consider:
>>
>>root qdisc: prio multiqueue
>>child qdisc: prio non-multiqueue
>>
>>The top-level qdisc will set it, the child qdisc will unset it again.
>>When multiqueue is inactive it should not touch it.
>>
>>I'll fix that as well.
> 
> 
> But the child can't assume the device is multiqueue if the child is
> non-multiqueue.

The child doesn't have to assume anything.

> This is the same issue with IP forwarding, where if you
> forward through a multiqueue device to a non-mq device, you don't know
> if the destination device is multiqueue.

No its not. I'm talking about nested qdiscs, which are all on
a single device.

> So the last qdisc to actually
> dequeue into a device should have control what the queue mapping is.

Fully agreed. And that is always the top-level qdisc.

> If
> a user had a multiqueue qdisc as root, and configured a child qdisc as
> non-mq, that is a configuration error if the underlying device is indeed
> multiqueue IMO.

Absolutely not. First of all, its perfectly valid to use non-multiqueue
qdiscs on multiqueue devices. Secondly, its only the root qdisc that
has to know about multiqueue since that one controls the child qdiscs.

Think about it, it makes absolutely no sense to have the child
qdisc even know about multiqueue. Changing my example to have
a multiqueue qdisc as child:

root qdisc: 2 band prio multiqueue
child qdisc of band 0: 2 band prio multiqueue

When the root qdisc decides to dequeue band0, it checks whether subqueue
0 is active and dequeues the child qdisc. If the child qdisc is indeed
another multiqueue qdisc as you suggest, if might decide to dequeue its
own band 1 and checks that subqueue state. So where should the packet
finally end up? And what if one of both subqueues is inactive?

The only reasonable thing it can do is not care about multiqueue and
just dequeue as usual. In fact I think it should be an error to
configure multiqueue on a non-root qdisc.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Waskiewicz Jr, Peter P

> PJ Waskiewicz wrote:
> > +#ifdef CONFIG_NET_SCH_MULTIQUEUE
> > +   if (q->mq)
> > +   skb->queue_mapping = 
> > +   
> q->prio2band[band&TC_PRIO_MAX];
> > +   else
> > +   skb->queue_mapping = 0;
> > +#endif
> 
> 
> Setting it to zero here is wrong, consider:
> 
> root qdisc: prio multiqueue
> child qdisc: prio non-multiqueue
> 
> The top-level qdisc will set it, the child qdisc will unset it again.
> When multiqueue is inactive it should not touch it.
> 
> I'll fix that as well.

But the child can't assume the device is multiqueue if the child is
non-multiqueue.  This is the same issue with IP forwarding, where if you
forward through a multiqueue device to a non-mq device, you don't know
if the destination device is multiqueue.  So the last qdisc to actually
dequeue into a device should have control what the queue mapping is.  If
a user had a multiqueue qdisc as root, and configured a child qdisc as
non-mq, that is a configuration error if the underlying device is indeed
multiqueue IMO.

-PJ
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Patrick McHardy

PJ Waskiewicz wrote:
> +#ifdef CONFIG_NET_SCH_MULTIQUEUE
> + if (q->mq)
> + skb->queue_mapping = 
> + q->prio2band[band&TC_PRIO_MAX];
> + else
> + skb->queue_mapping = 0;
> +#endif

Setting it to zero here is wrong, consider:

root qdisc: prio multiqueue
child qdisc: prio non-multiqueue

The top-level qdisc will set it, the child qdisc will unset it again.
When multiqueue is inactive it should not touch it.

I'll fix that as well.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Patrick McHardy

Waskiewicz Jr, Peter P wrote:
> Thanks for fixing; however, the current sch_prio doesn't unregister the
> qdisc if register_qdisc() on prio fails, or does that happen implicitly
> because the module will probably unload?

It failed, there's nothing to unregister. But when you register two
qdiscs and the second one fails you have to unregister the first one.

Your way works too, but it might fail registering the second one without
providing any feedback to the user.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Waskiewicz Jr, Peter P


> Its not error handling. You do:
> 
> err = register qdisc 1
> if (err)
>   return err;
> err = register qdisc 2
> if (err)
>   unregister qdisc 2
> return err
> 
> anyways, I already fixed that and cleaned up prio_classify 
> the way I suggested. Will send shortly.

Thanks for fixing; however, the current sch_prio doesn't unregister the
qdisc if register_qdisc() on prio fails, or does that happen implicitly
because the module will probably unload?

Thanks again Patrick,
-PJ
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Patrick McHardy

Patrick McHardy wrote:
> PJ Waskiewicz wrote:
> 
>> +
>>  static int __init prio_module_init(void)
>>  {
>> -return register_qdisc(&prio_qdisc_ops);
>> +int err;
>> +err = register_qdisc(&prio_qdisc_ops);
>> +if (!err)
>> +err = register_qdisc(&rr_qdisc_ops);
>> +return err;
>>  }
>>  
> 
> 
> Thats still broken. I'll fix this and some minor cleanness issues myself so
> you don't have to go through another resend.


Here it is, fixed error handling and cleaned up prio_classify.

There are still too many ifdefs in there for my taste, and I'm
wondering whether the NET_SCH_MULTIQUEUE option should really
be NETDEVICES_MULTIQUEUE. That would allow to move the #ifdefs
to netif_subqueue_stopped and keep the qdiscs clean.
I can send a patch for that on top of your patches.

[SCHED] Qdisc changes and sch_rr added for multiqueue

Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio and sch_rr to be compiled with or without multiqueue
hardware support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue routine.

Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]>
Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

---
commit ab2c102d05981e983b494e9211e1c987ae0b80ee
tree edb9457aeed6eb4ac8adef517b783b3361ba8830
parent 1937d8b868253885584dd49fe7ba804eda6b525f
author Peter P Waskiewicz Jr <[EMAIL PROTECTED]> Thu, 28 Jun 2007 18:47:49 +0200
committer Patrick McHardy <[EMAIL PROTECTED]> Thu, 28 Jun 2007 18:47:49 +0200

 include/linux/pkt_sched.h |9 +++
 net/sched/Kconfig |   23 +++
 net/sched/sch_prio.c  |  142 -
 3 files changed, 159 insertions(+), 15 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..268c515 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -101,6 +101,15 @@ struct tc_prio_qopt
__u8priomap[TC_PRIO_MAX+1]; /* Map: logical priority -> PRIO band */
 };
 
+enum
+{
+   TCA_PRIO_UNSPEC,
+   TCA_PRIO_MQ,
+   __TCA_PRIO_MAX
+};
+
+#define TCA_PRIO_MAX(__TCA_PRIO_MAX - 1)
+
 /* TBF section */
 
 struct tc_tbf_qopt
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..65ee9e7 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -111,6 +111,29 @@ config NET_SCH_PRIO
  To compile this code as a module, choose M here: the
  module will be called sch_prio.
 
+config NET_SCH_RR
+   tristate "Multi Band Round Robin Queuing (RR)"
+   select NET_SCH_PRIO
+   ---help---
+ Say Y here if you want to use an n-band round robin packet
+ scheduler.
+
+ The module uses sch_prio for its framework and is aliased as
+ sch_rr, so it will load sch_prio, although it is referred
+ to using sch_rr.
+
+config NET_SCH_MULTIQUEUE
+   bool "Multiple hardware queue support"
+   ---help---
+ Say Y here if you want to allow supported qdiscs to assign flows to
+ multiple hardware queues on an ethernet device.  This will
+ still work on devices with 1 queue.
+
+ Current qdiscs supporting this feature are NET_SCH_PRIO and
+ NET_SCH_RR.
+
+ Most people will say N here.
+
 config NET_SCH_RED
tristate "Random Early Detection (RED)"
---help---
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 6d7542c..b35b9c5 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -40,9 +40,13 @@
 struct prio_sched_data
 {
int bands;
+   int curband; /* for round-robin */
struct tcf_proto *filter_list;
u8  prio2band[TC_PRIO_MAX+1];
struct Qdisc *queues[TCQ_PRIO_BANDS];
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+   unsigned char mq;
+#endif
 };
 
 
@@ -70,14 +74,21 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int 
*qerr)
 #endif
if (TC_H_MAJ(band))
band = 0;
-   return q->queues[q->prio2band[band&TC_PRIO_MAX]];
+   band = q->prio2band[band&TC_PRIO_MAX];
+   goto out;
}
band = res.classid;
}
band = TC_H_MIN(band) - 1;
if (band >= q->bands)
-   return q->queues[q->prio2band[0]];
-
+   band = q->prio2band[0];
+out:
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+   if (q->mq)
+   skb->queue_mapping = band;
+   else
+   skb->queue_mapping = 0;
+#endif
return q->queues[band];
 }
 
@@ -144,17 +155,65 @@ prio_dequeue(struct Qdisc* sch)
struct Qdisc *qdisc;
 
for (prio = 0; prio < q->bands; prio++) {
-   qdisc = q->queues[prio];
-   skb = qdisc->dequeue(qdisc);
-   if (skb) {
-   sch->q.qlen--;
-   return skb;
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Patrick McHardy

Waskiewicz Jr, Peter P wrote:
>>PJ Waskiewicz wrote:
>>
>>
>>>+
>>> static int __init prio_module_init(void)  {
>>>-return register_qdisc(&prio_qdisc_ops);
>>>+int err;
>>>+err = register_qdisc(&prio_qdisc_ops);
>>>+if (!err)
>>>+err = register_qdisc(&rr_qdisc_ops);
>>>+return err;
>>> }
>>> 
>>
>>Thats still broken. I'll fix this and some minor cleanness 
>>issues myself so you don't have to go through another resend.
> 
> 
> Auke and I just looked at register_qdisc() and this code.  Maybe we
> haven't had enough coffee yet, but register_qdisc() returns 0 on
> success.  So if register_qdisc(&prio_qdisc_ops) succeeds, then
> rr_qdisc_ops gets registered.  I'm curious what is broken with this.

Its not error handling. You do:

err = register qdisc 1
if (err)
return err;
err = register qdisc 2
if (err)
unregister qdisc 2
return err

anyways, I already fixed that and cleaned up prio_classify the
way I suggested. Will send shortly.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Waskiewicz Jr, Peter P

> PJ Waskiewicz wrote:
> 
> > +
> >  static int __init prio_module_init(void)  {
> > -   return register_qdisc(&prio_qdisc_ops);
> > +   int err;
> > +   err = register_qdisc(&prio_qdisc_ops);
> > +   if (!err)
> > +   err = register_qdisc(&rr_qdisc_ops);
> > +   return err;
> >  }
> >  
> 
> Thats still broken. I'll fix this and some minor cleanness 
> issues myself so you don't have to go through another resend.

Auke and I just looked at register_qdisc() and this code.  Maybe we
haven't had enough coffee yet, but register_qdisc() returns 0 on
success.  So if register_qdisc(&prio_qdisc_ops) succeeds, then
rr_qdisc_ops gets registered.  I'm curious what is broken with this.

Thanks Patrick
-PJ Waskiewicz
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Patrick McHardy


PJ Waskiewicz wrote:


+
 static int __init prio_module_init(void)
 {
-   return register_qdisc(&prio_qdisc_ops);
+   int err;
+   err = register_qdisc(&prio_qdisc_ops);
+   if (!err)
+   err = register_qdisc(&rr_qdisc_ops);
+   return err;
 }
 


Thats still broken. I'll fix this and some minor cleanness issues myself so
you don't have to go through another resend.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-25 Thread Waskiewicz Jr, Peter P

> Thats not necessary. I just though you could add one exit point:
> 
> 
> ...
> out:
> skb->queue_mapping = q->mq ? band : 0;
> return q->queues[band];
> }
> 
> But if that doesn't work don't bother ..

Unfortunately it won't, given how band might be used like this to select
the queue:

return q->queues[q->prio2band[band&TC_PRIO_MAX]];

I'll keep this in mind though, and if it can be done cleanly, I'll
submit a patch.

Thanks Patrick,
-PJ
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-25 Thread Patrick McHardy


Waskiewicz Jr, Peter P wrote:
@@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc 
*sch, int *qerr)  #endif

if (TC_H_MAJ(band))
band = 0;
+   if (q->mq)
+skb->queue_mapping = 
+		
  

q->prio2band[band&TC_PRIO_MAX];


+   else
+   skb->queue_mapping = 0;
  
Might look cleaner if you have one central point where 
queue_mapping is set and the band is returned.



I've taken a stab at this.  I can have one return point, but I'll still
have multiple assignments of skb->queue_mapping due to the different
branches for which queue to select in the qdisc.  I suppose we can do a
rewrite of prio_classify(), but to me that seems beyond the scope of the
multiqueue patches themselves.  What do you think?
  


Thats not necessary. I just though you could add one exit point:


...
out:
   skb->queue_mapping = q->mq ? band : 0;
   return q->queues[band];
}

But if that doesn't work don't bother ..

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-25 Thread Waskiewicz Jr, Peter P

> > @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc 
> > *sch, int *qerr)  #endif
> > if (TC_H_MAJ(band))
> > band = 0;
> > +   if (q->mq)
> > +   skb->queue_mapping = 
> > +   
> q->prio2band[band&TC_PRIO_MAX];
> > +   else
> > +   skb->queue_mapping = 0;
> 
> 
> Might look cleaner if you have one central point where 
> queue_mapping is set and the band is returned.

I've taken a stab at this.  I can have one return point, but I'll still
have multiple assignments of skb->queue_mapping due to the different
branches for which queue to select in the qdisc.  I suppose we can do a
rewrite of prio_classify(), but to me that seems beyond the scope of the
multiqueue patches themselves.  What do you think?

Thanks,
-PJ
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-25 Thread Waskiewicz Jr, Peter P

> PJ Waskiewicz wrote:
> > +   /* If we're multiqueue, make sure the number of incoming bands
> > +* matches the number of queues on the device we're 
> associating with.
> > +*/
> > +   if (tb[TCA_PRIO_MQ - 1])
> > +   q->mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]);
> > +
> > +   if (q->mq && (qopt->bands != sch->dev->egress_subqueue_count))
> > +   return -EINVAL;
> 
> 
> A nice thing you could do for the user here is use 
> egress_subqueue_count as default when qopt->bands == 0 (and 
> change tc prio to accept 0 in case it doesn't).

prio only allows a minimum of 2 bands right now.  I see what you're
suggesting though; let me think about this.  I do like this suggestion.

Thanks,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-25 Thread Patrick McHardy


Waskiewicz Jr, Peter P wrote:


And RTA_PUT_FLAG. Now that I think of it, does it even makes 
sense to have a prio private flag for this instead of a qdisc 
global one?



There currently aren't any other qdiscs that are natural fits for
multiqueue that I can see.  I can see the benefit though of having this
as a global flag in the qdisc API; let me check it out, and if it makes
sense, I can move it.
  


Yes, that thought occured to me as well. Keeping it private
seems better.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-25 Thread Waskiewicz Jr, Peter P

> >  enum
> >  {
> > -   TCA_PRIO_UNPSEC,
> > -   TCA_PRIO_TEST,
> 
> 
> You misunderstood me. You can work on top of my compat 
> attribute patches, but the example code should not have to go 
> in to apply your patch.

Ok.  I'll fix my patches.

> > diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 
> > 475df84..7f14fa6 100644
> > --- a/net/sched/Kconfig
> > +++ b/net/sched/Kconfig
> > @@ -102,8 +102,16 @@ config NET_SCH_ATM
> >   To compile this code as a module, choose M here: the
> >   module will be called sch_atm.
> >  
> > +config NET_SCH_BANDS
> > +bool "Multi Band Queueing (PRIO and RR)"
> 
> This options seems useless. Its not used *anywhere* except 
> for dependencies.

I was trying to group the multiqueue qdiscs together with this.  But I
can see just having the multiqueue option for scheduling will cover
this.  I'll remove this.

> > +config NET_SCH_BANDS_MQ
> > +   bool "Multiple hardware queue support"
> > +   depends on NET_SCH_BANDS
> 
> 
> OK, again:
> 
> Introduce NET_SCH_RR. NET_SCH_RR selects NET_SCH_PRIO. 
> Nothing at all changes for NET_SCH_PRIO itself. Additionally 
> introduce a boolean NET_SCH_MULTIQUEUE. No dependencies at 
> all. Use NET_SCH_MULTIQUEUE to guard the multiqueue code in 
> sch_prio.c.
> Your current code doesn't even have any ifdefs anymore 
> though, so this might not be needed at all.
> 
> Additionally you could later introduce E1000_MULTIQUEUE and 
> have that select NET_SCH_MULTIQUEUE.

I'll clean this up.  Thanks for the persistance.  :)

> > diff --git a/net/sched/sch_generic.c 
> b/net/sched/sch_generic.c index 
> > 9461e8a..203d5c4 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -168,7 +168,8 @@ static inline int qdisc_restart(struct 
> net_device *dev)
> > spin_unlock(&dev->queue_lock);
> >  
> > ret = NETDEV_TX_BUSY;
> > -   if (!netif_queue_stopped(dev))
> > +   if (!netif_queue_stopped(dev) &&
> > +   !netif_subqueue_stopped(dev, skb->queue_mapping))
> > /* churn baby churn .. */
> > ret = dev_hard_start_xmit(skb, dev);
> 
> I'll try again - please move this to patch 2/3.

I'm sorry; I misread your original comment about this.  I'll move the
change (although this disappears with Jamal's and KK's qdisc_restart()
cleanup).

> > diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index 
> > 40a13e8..8a716f0 100644
> > --- a/net/sched/sch_prio.c
> > +++ b/net/sched/sch_prio.c
> > @@ -40,9 +40,11 @@
> >  struct prio_sched_data
> >  {
> > int bands;
> > +   int curband; /* for round-robin */
> > struct tcf_proto *filter_list;
> > u8  prio2band[TC_PRIO_MAX+1];
> > struct Qdisc *queues[TCQ_PRIO_BANDS];
> > +   unsigned char mq;
> >  };
> >  
> >  
> > @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc 
> > *sch, int *qerr)  #endif
> > if (TC_H_MAJ(band))
> > band = 0;
> > +   if (q->mq)
> > +   skb->queue_mapping = 
> > +   
> q->prio2band[band&TC_PRIO_MAX];
> > +   else
> > +   skb->queue_mapping = 0;
> 
> 
> Might look cleaner if you have one central point where 
> queue_mapping is set and the band is returned.

I'll see how easy it'll be to condense this; because the queue being
selected in the qdisc can be different based on a few different things,
I'm not sure how easy it'll be to assign this in one spot.  I'll play
around with it and see what I can come up with.

> > +   /* If we're multiqueue, make sure the number of incoming bands
> > +* matches the number of queues on the device we're 
> associating with.
> > +*/
> > +   if (tb[TCA_PRIO_MQ - 1])
> > +   q->mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]);
> 
> 
> If you're using it as a flag, please use RTA_GET_FLAG(), 
> otherwise RTA_GET_U8.

Will do.  Thanks.

> > +   if (q->mq && (qopt->bands != sch->dev->egress_subqueue_count))
> > +   return -EINVAL;
> >  
> > sch_tree_lock(sch);
> > q->bands = qopt->bands;
> > @@ -280,7 +342,7 @@ static int prio_dump(struct Qdisc *sch, 
> struct sk_buff *skb)
> > memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX+1);
> >  
> > nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), &opt);
> > -   RTA_PUT_U32(skb, TCA_PRIO_TEST, 321);
> > +   RTA_PUT_U8(skb, TCA_PRIO_MQ, q->mq);
> 
> 
> And RTA_PUT_FLAG. Now that I think of it, does it even makes 
> sense to have a prio private flag for this instead of a qdisc 
> global one?

There currently aren't any other qdiscs that are natural fits for
multiqueue that I can see.  I can see the benefit though of having this
as a global flag in the qdisc API; let me check it out, and if it makes
sense, I can move it.

> >  static int __init prio_module_init(void)  {
> > -   return register_qdisc(&prio_qdisc_ops);
> > +   register_qdisc(&prio_qdisc_ops);
> > +   register_qdisc(&rr_qdisc_ops);
>

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-24 Thread Patrick McHardy

PJ Waskiewicz wrote:
> + /* If we're multiqueue, make sure the number of incoming bands
> +  * matches the number of queues on the device we're associating with.
> +  */
> + if (tb[TCA_PRIO_MQ - 1])
> + q->mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]);
> +
> + if (q->mq && (qopt->bands != sch->dev->egress_subqueue_count))
> + return -EINVAL;

A nice thing you could do for the user here is use
egress_subqueue_count as default when qopt->bands == 0
(and change tc prio to accept 0 in case it doesn't).
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-24 Thread Patrick McHardy

PJ Waskiewicz wrote:
> diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
> index 09808b7..ec3a9a5 100644
> --- a/include/linux/pkt_sched.h
> +++ b/include/linux/pkt_sched.h
> @@ -103,8 +103,8 @@ struct tc_prio_qopt
>  
>  enum
>  {
> - TCA_PRIO_UNPSEC,
> - TCA_PRIO_TEST,


You misunderstood me. You can work on top of my compat attribute
patches, but the example code should not have to go in to apply
your patch.


> diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> index 475df84..7f14fa6 100644
> --- a/net/sched/Kconfig
> +++ b/net/sched/Kconfig
> @@ -102,8 +102,16 @@ config NET_SCH_ATM
> To compile this code as a module, choose M here: the
> module will be called sch_atm.
>  
> +config NET_SCH_BANDS
> +bool "Multi Band Queueing (PRIO and RR)"

This options seems useless. Its not used *anywhere* except for
dependencies.

> +---help---
> +  Say Y here if you want to use n-band multiqueue packet
> +  schedulers.  These include a priority-based scheduler and
> +a round-robin scheduler.
> +
>  config NET_SCH_PRIO
>   tristate "Multi Band Priority Queueing (PRIO)"
> + depends on NET_SCH_BANDS

And this dependency as well.

>   ---help---
> Say Y here if you want to use an n-band priority queue packet
> scheduler.
> @@ -111,6 +119,28 @@ config NET_SCH_PRIO
> To compile this code as a module, choose M here: the
> module will be called sch_prio.
>  
> +config NET_SCH_RR
> + tristate "Multi Band Round Robin Queuing (RR)"
> + depends on NET_SCH_BANDS

Same here. RR

> + select NET_SCH_PRIO
> + ---help---
> +   Say Y here if you want to use an n-band round robin packet
> +   scheduler.
> +
> +   The module uses sch_prio for its framework and is aliased as
> +   sch_rr, so it will load sch_prio, although it is referred
> +   to using sch_rr.
> +
> +config NET_SCH_BANDS_MQ
> + bool "Multiple hardware queue support"
> + depends on NET_SCH_BANDS


OK, again:

Introduce NET_SCH_RR. NET_SCH_RR selects NET_SCH_PRIO. Nothing at
all changes for NET_SCH_PRIO itself. Additionally introduce a
boolean NET_SCH_MULTIQUEUE. No dependencies at all. Use
NET_SCH_MULTIQUEUE to guard the multiqueue code in sch_prio.c.
Your current code doesn't even have any ifdefs anymore though,
so this might not be needed at all.

Additionally you could later introduce E1000_MULTIQUEUE and
have that select NET_SCH_MULTIQUEUE.

> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 9461e8a..203d5c4 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -168,7 +168,8 @@ static inline int qdisc_restart(struct net_device *dev)
>   spin_unlock(&dev->queue_lock);
>  
>   ret = NETDEV_TX_BUSY;
> - if (!netif_queue_stopped(dev))
> + if (!netif_queue_stopped(dev) &&
> + !netif_subqueue_stopped(dev, skb->queue_mapping))
>   /* churn baby churn .. */
>   ret = dev_hard_start_xmit(skb, dev);

I'll try again - please move this to patch 2/3.



> diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
> index 40a13e8..8a716f0 100644
> --- a/net/sched/sch_prio.c
> +++ b/net/sched/sch_prio.c
> @@ -40,9 +40,11 @@
>  struct prio_sched_data
>  {
>   int bands;
> + int curband; /* for round-robin */
>   struct tcf_proto *filter_list;
>   u8  prio2band[TC_PRIO_MAX+1];
>   struct Qdisc *queues[TCQ_PRIO_BANDS];
> + unsigned char mq;
>  };
>  
>  
> @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int 
> *qerr)
>  #endif
>   if (TC_H_MAJ(band))
>   band = 0;
> + if (q->mq)
> + skb->queue_mapping = 
> + q->prio2band[band&TC_PRIO_MAX];
> + else
> + skb->queue_mapping = 0;


Might look cleaner if you have one central point where queue_mapping is
set and the band is returned.

> + /* If we're multiqueue, make sure the number of incoming bands
> +  * matches the number of queues on the device we're associating with.
> +  */
> + if (tb[TCA_PRIO_MQ - 1])
> + q->mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]);


If you're using it as a flag, please use RTA_GET_FLAG(),
otherwise RTA_GET_U8.

> + if (q->mq && (qopt->bands != sch->dev->egress_subqueue_count))
> + return -EINVAL;
>  
>   sch_tree_lock(sch);
>   q->bands = qopt->bands;
> @@ -280,7 +342,7 @@ static int prio_dump(struct Qdisc *sch, struct sk_buff 
> *skb)
>   memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX+1);
>  
>   nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), &opt);
> - RTA_PUT_U32(skb, TCA_PRIO_TEST, 321);
> + RTA_PUT_U8(skb, TCA_PRIO_MQ, q->mq);


And RTA_PUT_FLAG. Now that I think of it, does it even makes sense
to have a prio private flag for this ins

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-22 Thread Waskiewicz Jr, Peter P

> Patrick McHardy wrote:
> > Waskiewicz Jr, Peter P wrote:
> > 
> >>Thought about this more last night and this morning.  As 
> far as I can 
> >>tell, I still need this.  If the qdisc gets loaded with multiqueue 
> >>turned on, I can just use the value of band to assign
> >>skb->queue_mapping.  But if the qdisc is loaded without multiqueue
> >>support, then I need to assign a value of zero to queue_mapping, or 
> >>not assign it at all (it will be zero'd out before the call to 
> >>->enqueue() in dev_queue_xmit()).  But I'd rather not have a 
> >>conditional in the hotpath checking if the qdisc is multiqueue; I'd 
> >>rather have the array to match the bands so I can just do 
> an assignment.
> >>
> >>What do you think?
> > 
> > 
> > 
> > I very much doubt that it has any measurable impact. You 
> can also add 
> > a small inline function
> > 
> > void skb_set_queue_mapping(struct sk_buff *skb, unsigned int queue)
> 
> 
> OK I didn't really listen obviously :) A compile time option 
> won't help. Just remove it and assign it conditionally.

Sounds good.  Thanks Patrick.

-PJ
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-22 Thread Patrick McHardy

Patrick McHardy wrote:
> Waskiewicz Jr, Peter P wrote:
> 
>>Thought about this more last night and this morning.  As far as I can
>>tell, I still need this.  If the qdisc gets loaded with multiqueue
>>turned on, I can just use the value of band to assign
>>skb->queue_mapping.  But if the qdisc is loaded without multiqueue
>>support, then I need to assign a value of zero to queue_mapping, or not
>>assign it at all (it will be zero'd out before the call to ->enqueue()
>>in dev_queue_xmit()).  But I'd rather not have a conditional in the
>>hotpath checking if the qdisc is multiqueue; I'd rather have the array
>>to match the bands so I can just do an assignment.
>>
>>What do you think?
> 
> 
> 
> I very much doubt that it has any measurable impact. You can
> also add a small inline function
> 
> void skb_set_queue_mapping(struct sk_buff *skb, unsigned int queue)


OK I didn't really listen obviously :) A compile time option
won't help. Just remove it and assign it conditionally.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-22 Thread Patrick McHardy

Patrick McHardy wrote:
> void skb_set_queue_mapping(struct sk_buff *skb, unsigned int queue)
> {
> #ifdef CONFIG_NET_SCH_MULTIQUEUE
>   skb->queue_mapping = queue;
> #else
>   skb->queue_mapping = 0;
> #endif


Maybe even use it everywhere and guard skb->queue_mapping by
an #ifdef, on 32 bit it does enlarge the skb.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-22 Thread Patrick McHardy

Waskiewicz Jr, Peter P wrote:
>>> #include 
>>>@@ -40,9 +42,13 @@
>>> struct prio_sched_data
>>> {
>>> int bands;
>>>+#ifdef CONFIG_NET_SCH_RR
>>>+int curband; /* for round-robin */
>>>+#endif
>>> struct tcf_proto *filter_list;
>>> u8  prio2band[TC_PRIO_MAX+1];
>>> struct Qdisc *queues[TCQ_PRIO_BANDS];
>>>+u16 band2queue[TC_PRIO_MAX + 1];
>>>  
>>
>>Why is this still here? Its a 1:1 mapping.
> 
> 
> Thought about this more last night and this morning.  As far as I can
> tell, I still need this.  If the qdisc gets loaded with multiqueue
> turned on, I can just use the value of band to assign
> skb->queue_mapping.  But if the qdisc is loaded without multiqueue
> support, then I need to assign a value of zero to queue_mapping, or not
> assign it at all (it will be zero'd out before the call to ->enqueue()
> in dev_queue_xmit()).  But I'd rather not have a conditional in the
> hotpath checking if the qdisc is multiqueue; I'd rather have the array
> to match the bands so I can just do an assignment.
> 
> What do you think?


I very much doubt that it has any measurable impact. You can
also add a small inline function

void skb_set_queue_mapping(struct sk_buff *skb, unsigned int queue)
{
#ifdef CONFIG_NET_SCH_MULTIQUEUE
skb->queue_mapping = queue;
#else
skb->queue_mapping = 0;
#endif
}
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-22 Thread Waskiewicz Jr, Peter P

> >  #include 
> > @@ -40,9 +42,13 @@
> >  struct prio_sched_data
> >  {
> > int bands;
> > +#ifdef CONFIG_NET_SCH_RR
> > +   int curband; /* for round-robin */
> > +#endif
> > struct tcf_proto *filter_list;
> > u8  prio2band[TC_PRIO_MAX+1];
> > struct Qdisc *queues[TCQ_PRIO_BANDS];
> > +   u16 band2queue[TC_PRIO_MAX + 1];
> >   
> 
> Why is this still here? Its a 1:1 mapping.

Thought about this more last night and this morning.  As far as I can
tell, I still need this.  If the qdisc gets loaded with multiqueue
turned on, I can just use the value of band to assign
skb->queue_mapping.  But if the qdisc is loaded without multiqueue
support, then I need to assign a value of zero to queue_mapping, or not
assign it at all (it will be zero'd out before the call to ->enqueue()
in dev_queue_xmit()).  But I'd rather not have a conditional in the
hotpath checking if the qdisc is multiqueue; I'd rather have the array
to match the bands so I can just do an assignment.

What do you think?

Thanks,
-PJ
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-21 Thread Patrick McHardy


Waskiewicz Jr, Peter P wrote:
The dependencies seem to be very confused. SCHED_PRIO does 
not depend on anything new, SCH_RR also doesn't depend on 
anything. SCH_PRIO_MQ and SCH_RR_MQ (which is missing) depend 
on SCH_PRIO/SCH_RR. A single NET_SCH_MULTIQUEUE option seems 
better than adding one per scheduler though.



I agree with a NET_SCH_MULTIQUEUE option.  However, SCH_RR does depend
on SCH_PRIO being built since it's the same code, doesn't it?  Maybe I'm
not understanding something about the build process.  I'll clean this
up.


The easiest solution is to select SCH_PRIO from SCH_RR.
I head something else in mind initially but that is
needlessly complicated.




For the tenth time now, the user should enable this at 
runtime. You can't just break things dependant on config options.



I had this in sch_prio and tc before, and was told to remove it because
of ABI issues.  I can put it back in, but I'm not sure what those
previous ABI issues were.  Was it backwards compatibility that you
referred to before that was broken?


Your tc changes changed the structure in a way that old tc binaries
wouldn't work anymore. This version breaks configurations that use
a number of bands not matching the HW queues when the user enables
the multiqueue compile time option.

Unfortunately prio does not use nested attributes, so the easiest
way is extending struct tc_prio_qopt at the end and checking the
attribute size to decide whether its an old or a new version.

A better fix would be to introduce a new qdisc configuration
attribute that takes precedence before TCA_OPTIONS and have
userspace send both the old non-nested structure and a new
nested configuration. That would make sure we never run into
this problem again.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-21 Thread Waskiewicz Jr, Peter P

> The dependencies seem to be very confused. SCHED_PRIO does 
> not depend on anything new, SCH_RR also doesn't depend on 
> anything. SCH_PRIO_MQ and SCH_RR_MQ (which is missing) depend 
> on SCH_PRIO/SCH_RR. A single NET_SCH_MULTIQUEUE option seems 
> better than adding one per scheduler though.

I agree with a NET_SCH_MULTIQUEUE option.  However, SCH_RR does depend
on SCH_PRIO being built since it's the same code, doesn't it?  Maybe I'm
not understanding something about the build process.  I'll clean this
up.

> 
> > --- a/net/sched/sch_prio.c
> > +++ b/net/sched/sch_prio.c
> > @@ -9,6 +9,8 @@
> >   * Authors:Alexey Kuznetsov, <[EMAIL PROTECTED]>
> >   * Fixes:   19990609: J Hadi Salim <[EMAIL PROTECTED]>:
> >   *  Init --  EINVAL when opt undefined
> > + * Additions:  Peter P. Waskiewicz Jr. 
> <[EMAIL PROTECTED]>
> > + * Added round-robin scheduling for selection at load-time
> >   
> 
> git keeps changelogs, please don't add it here.

Roger.

> > struct tcf_proto *filter_list;
> > u8  prio2band[TC_PRIO_MAX+1];
> > struct Qdisc *queues[TCQ_PRIO_BANDS];
> > +   u16 band2queue[TC_PRIO_MAX + 1];
> >   
> 
> Why is this still here? Its a 1:1 mapping.

I'll fix this.

> > @@ -211,6 +265,22 @@ static int prio_tune(struct Qdisc 
> *sch, struct rtattr *opt)
> > return -EINVAL;
> > }
> >  
> > +   /* If we're prio multiqueue or are using round-robin, make
> > +* sure the number of incoming bands matches the number of
> > +* queues on the device we're associating with.
> > +*/
> > +#ifdef CONFIG_NET_SCH_RR
> > +   if (strcmp("rr", sch->ops->id) == 0)
> > +   if (qopt->bands != sch->dev->egress_subqueue_count)
> > +   return -EINVAL;
> > +#endif
> > +
> > +#ifdef CONFIG_NET_SCH_PRIO_MQ
> > +   if (strcmp("prio", sch->ops->id) == 0)
> > +   if (qopt->bands != sch->dev->egress_subqueue_count)
> > +   return -EINVAL;
> > +#endif
> >   
> 
> For the tenth time now, the user should enable this at 
> runtime. You can't just break things dependant on config options.

I had this in sch_prio and tc before, and was told to remove it because
of ABI issues.  I can put it back in, but I'm not sure what those
previous ABI issues were.  Was it backwards compatibility that you
referred to before that was broken?

As always, the feedback is very much appreciated.  I'll get these fixes
in as soon as possible.

-PJ
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-21 Thread Patrick McHardy


PJ Waskiewicz wrote:

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..ca0b352 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -102,8 +102,16 @@ config NET_SCH_ATM
  To compile this code as a module, choose M here: the
  module will be called sch_atm.
 
+config NET_SCH_BANDS

+bool "Multi Band Queueing (PRIO and RR)"
+---help---
+  Say Y here if you want to use n-band multiqueue packet
+  schedulers.  These include a priority-based scheduler and
+  a round-robin scheduler.
+
 config NET_SCH_PRIO
tristate "Multi Band Priority Queueing (PRIO)"
+   depends on NET_SCH_BANDS
---help---
  Say Y here if you want to use an n-band priority queue packet
  scheduler.
@@ -111,6 +119,30 @@ config NET_SCH_PRIO
  To compile this code as a module, choose M here: the
  module will be called sch_prio.
 
+config NET_SCH_PRIO_MQ

+   bool "Multiple hardware queue support for PRIO"
+   depends on NET_SCH_PRIO
+   ---help---
+ Say Y here if you want to allow the PRIO qdisc to assign
+ flows to multiple hardware queues on an ethernet device.  This
+ will still work on devices with 1 queue.
+
+ Consider this scheduler for devices that do not use
+ hardware-based scheduling policies.  Otherwise, use NET_SCH_RR.
+
+ Most people will say N here.
+
+config NET_SCH_RR
+   bool "Multi Band Round Robin Queuing (RR)"
+   depends on NET_SCH_BANDS && NET_SCH_PRIO
+   ---help---
+ Say Y here if you want to use an n-band round robin packet
+ scheduler.
+
+ The module uses sch_prio for its framework and is aliased as
+ sch_rr, so it will load sch_prio, although it is referred
+ to using sch_rr.
  


The dependencies seem to be very confused. SCHED_PRIO does not depend
on anything new, SCH_RR also doesn't depend on anything. SCH_PRIO_MQ
and SCH_RR_MQ (which is missing) depend on SCH_PRIO/SCH_RR. A single
NET_SCH_MULTIQUEUE option seems better than adding one per scheduler
though.


--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -9,6 +9,8 @@
  * Authors:Alexey Kuznetsov, <[EMAIL PROTECTED]>
  * Fixes:   19990609: J Hadi Salim <[EMAIL PROTECTED]>:
  *  Init --  EINVAL when opt undefined
+ * Additions:  Peter P. Waskiewicz Jr. <[EMAIL PROTECTED]>
+ * Added round-robin scheduling for selection at load-time
  


git keeps changelogs, please don't add it here.


  */
 
 #include 

@@ -40,9 +42,13 @@
 struct prio_sched_data
 {
int bands;
+#ifdef CONFIG_NET_SCH_RR
+   int curband; /* for round-robin */
+#endif
struct tcf_proto *filter_list;
u8  prio2band[TC_PRIO_MAX+1];
struct Qdisc *queues[TCQ_PRIO_BANDS];
+   u16 band2queue[TC_PRIO_MAX + 1];
  


Why is this still here? Its a 1:1 mapping.

@@ -211,6 +265,22 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
return -EINVAL;
}
 
+	/* If we're prio multiqueue or are using round-robin, make

+* sure the number of incoming bands matches the number of
+* queues on the device we're associating with.
+*/
+#ifdef CONFIG_NET_SCH_RR
+   if (strcmp("rr", sch->ops->id) == 0)
+   if (qopt->bands != sch->dev->egress_subqueue_count)
+   return -EINVAL;
+#endif
+
+#ifdef CONFIG_NET_SCH_PRIO_MQ
+   if (strcmp("prio", sch->ops->id) == 0)
+   if (qopt->bands != sch->dev->egress_subqueue_count)
+   return -EINVAL;
+#endif
  


For the tenth time now, the user should enable this at
runtime. You can't just break things dependant on config
options.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

55 matches

Mail list logo