Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-08 Thread David Miller
From: jamal [EMAIL PROTECTED]
Date: Fri, 06 Jul 2007 10:39:15 -0400

 If the issue is usability of listing 1024 netdevices, i can think of
 many ways to resolve it.

I would agree with this if there were a reason for it, it's totally
unnecessary complication as far as I can see.

These virtual devices are an ethernet with the subnet details exposed
to the driver, nothing more.

I see zero benefit to having a netdev for each guest or node we can
speak to whatsoever.  It's a very heavy abstraction to use for
something that is so bloody simple.

My demux on -hard_start_xmit() is _5 DAMN LINES OF CODE_, you want to
replace that with a full netdev because of some minor difficulties in
figuring out to record the queuing state.  It's beyond unreasonable.

Netdevs are like salt, if you put too much in your food it tastes
awful.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-07 Thread Rusty Russell
On Fri, 2007-07-06 at 10:39 -0400, jamal wrote:
 The first thing that crossed my mind was if you want to select a
 destination port based on a destination MAC you are talking about a
 switch/bridge. You bring up the issue of a huge number of virtual NICs
 if you wanted arbitrary guests which is a real one[2].

Hi Jamal,

I'm deeply tempted to agree with you that the answer is multiple
virtual NICs (and I've been tempted to abandon lguest's N-way transport
scheme), except that it looks like we're going to have multi-queue NICs
for other reasons.

Otherwise I'd be tempted to say create/destroy virtual NICs as other
guests appear/vanish from the network.  Noone does this today, but that
doesn't make it wrong.

 If i got this right, still not answering the netif_stop question posed:
 the problem you are also trying to resolve now is get rid of N
 netdevices on each guest for a usability reason; i.e have one netdevice,
 move the bridging/switching functionality/tables into the driver;
 replace the ports with queues instead of netdevices. Did i get that
 right? 

Yep, well summarized.  I guess the question is: should the Intel guys
be representing their multi-queue NICs as multiple NICs rather than
adding the subqueue concept?

 BTW, one curve that threw me off a little is it seems most of the
 hardware that provides virtualization also provides point-to-point
 connections between different domains; i always thought that they all
 provided a point-to-point to the dom0 equivalent and let the dom0 worry
 about how things get from domainX to domainY.

Yeah, but that has obvious limitations as people care more about
inter-guest I/O: we want direct inter-guest networking...

Cheers,
Rusty.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-06 Thread Rusty Russell
On Tue, 2007-07-03 at 22:20 -0400, jamal wrote:
 On Tue, 2007-03-07 at 14:24 -0700, David Miller wrote:
 [.. some useful stuff here deleted ..]
 
  That's why you have to copy into a purpose-built set of memory
  that is composed of pages that _ONLY_ contain TX packet buffers
  and nothing else.
  
  The cost of going through the switch is too high, and the copies are
  necessary, so concentrate on allowing me to map the guest ports to the
  egress queues.  Anything else is a waste of discussion time, I've been
  pouring over these issues endlessly for weeks, so if I'm saying doing
  copies and avoiding the switch is necessary I do in fact mean it. :-)
 
 ok, i get it Dave ;- Thanks for your patience, that was useful.
 Now that is clear for me, I will go back and look at your original email
 and try to get back on track to what you really asked ;-

To expand on this, there are already virtual nic drivers in tree which
do the demux based on dst mac and send to appropriate other guest
(iseries_veth.c and Carsten Otte said the S/390 drivers do too).  lguest
and DaveM's LDOM make two more.

There is currently no good way to write such a driver.  If one recipient
is full, you have to drop the packet: if you netif_stop_queue, it means
a slow/buggy recipient blocks packets going to other recipients.  But
dropping packets makes networking suck.

Some hypervisors (eg. Xen) only have a virtual NIC which is
point-to-point: this sidesteps the issue, with the risk that you might
need a huge number of virtual NICs if you wanted arbitrary guests to
talk to each other (Xen doesn't support that, they route/bridge through
dom0).

Most hypervisors have a sensible maximum on the number of guests they
could talk to, so I'm not too unhappy with a static number of queues.
But the dstmac - queue mapping changes in hypervisor-specific ways, so
it really needs to be managed by the driver...

Hope that adds something,
Rusty.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-06 Thread jamal
On Fri, 2007-06-07 at 17:32 +1000, Rusty Russell wrote:

[..some good stuff deleted here ..]

 Hope that adds something,

It does - thanks. 
 
I think i was letting my experience pollute my thinking earlier when
Dave posted. The copy-avoidance requirement is clear to me[1].

I had another issue which wasnt clear but you touched on it so this
breaks the ice for me - be gentle please, my asbestos suit is full of
dust these days ;-:

The first thing that crossed my mind was if you want to select a
destination port based on a destination MAC you are talking about a
switch/bridge. You bring up the issue of a huge number of virtual NICs
if you wanted arbitrary guests which is a real one[2]. 
Lets take the case of a small number of guests; a bridge of course would
solve the problem with the copy-avoidance with the caveat being:
- you now have N bridges and their respective tables for N domains i.e
one on each domain 
- N netdevices on each domain as well (of course you could say that is
not very different resourcewise from N queues instead). 

If i got this right, still not answering the netif_stop question posed:
the problem you are also trying to resolve now is get rid of N
netdevices on each guest for a usability reason; i.e have one netdevice,
move the bridging/switching functionality/tables into the driver;
replace the ports with queues instead of netdevices. Did i get that
right? 
If the issue is usability of listing 1024 netdevices, i can think of
many ways to resolve it.
One way we can resolve the listing is with a simple tag to the netdev
struct i could say list netdevices for guest 0-10 etc etc.

I am having a little problem differentiating conceptually the case of a
guest being different from the host/dom0 if you want to migrate the
switching/bridging functions into each guest. So even if this doesnt
apply to all domains, it does apply to the dom0.
I like netdevices today (as opposed to queues within netdevices):
- the stack knows them well (I can add IP addresses, i can point routes
to, I can change MAC addresses, i can bring them administratively
down/up, I can add qos rules etc etc).
I can also tie netdevices to a CPU and therefore scale that way. I see
this viable at least from the host/dom0 perspective if a netdevice
represents a guest.

Sorry for the long email - drained some of my morning coffee.
Ok, kill me.

cheers,
jamal

[1] My experience is around qemu/uml/old-openvz - their model is to let
the host do the routing/switching between guests or the outside of the
box. From your description i would add Xen to that behavior.
From Daves posting, i understand that for many good reasons, any time
you move between any one domain to another you are copying. So if you
use Xen and you want to go from domainX to domainY you go to dom0 which
implies copying domainX-dom0 then dom0-domainY. 
BTW, one curve that threw me off a little is it seems most of the
hardware that provides virtualization also provides point-to-point
connections between different domains; i always thought that they all
provided a point-to-point to the dom0 equivalent and let the dom0 worry
about how things get from domainX to domainY.

[2] Unfortunately that means if i wanted 1024 virtual routers/guest
domains i have at least 1024 netdevices on each guest connected to the
bridge on the guest. I have a freaking problem listing 72 netdevices
today on some device i have.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-06 Thread James Chapman

jamal wrote:

If the issue is usability of listing 1024 netdevices, i can think of
many ways to resolve it.
One way we can resolve the listing is with a simple tag to the netdev
struct i could say list netdevices for guest 0-10 etc etc.


This would be a useful feature, not only for virtualization. I've seen 
some boxes with thousands of net devices (mostly ppp, but also some 
ATM). It would be nice to be able to assign a tag to an arbitrary set of 
devices.


Does the network namespace stuff help with any of this?

--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-03 Thread jamal
On Sat, 2007-30-06 at 13:33 -0700, David Miller wrote:

 It's like twice as fast, since the switch doesn't have to copy
 the packet in, switch it, then the destination guest copies it
 into it's address space.
 
 There is approximately one copy for each hop you go over through these
 virtual devices.

Ok - i see what you are getting at, and while it makes more sense to me
now, let me continue to be _the_ devils advocate (sip some esspresso
before responding or reading): 
for some reason i always thought that packets going across these things
(likely not in the case of hypervisor based virtualization like Xen)
just have their skbs cloned when crossing domains, is that not the
case?[1]
Assuming they copy, the balance that needs to be stricken now is
between:

a) copy is expensive
vs
b1) For N guests, N^2 queues in the system vs N queues and 1 vs N
replicated global info.
b2) The architecture challenges to resolve the fact you now have to deal
with a mesh (1-1 mapping) instead of star topology between the guests. 

I dont think #b1 is such a big deal; in the old days when i had played
with what is now openvz, i was happy to get 1024 virtual routers/guests
(each running Zebra/OSPF). I could live with a little more wasted memory
if the copy is reduced.
I think sub-consciously i am questioning #b2. Do you really need that
sacrifice just so that you can avoid one extra copy between two guests?
If i was running virtual routers or servers i think the majority of
traffic (by far) would be between a domain and outside of the box not
between any two domains within the same box. 

cheers,
jamal


[1] But then if this is true, i can think of a simple way to attack the
other domains by inserting a kernel module into a domain that reduced
the refcount of each received skb to 0. I would be suprised if the
openvz type approach hasnt thought this through.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-03 Thread jamal
On Tue, 2007-03-07 at 14:24 -0700, David Miller wrote:
[.. some useful stuff here deleted ..]

 That's why you have to copy into a purpose-built set of memory
 that is composed of pages that _ONLY_ contain TX packet buffers
 and nothing else.
 
 The cost of going through the switch is too high, and the copies are
 necessary, so concentrate on allowing me to map the guest ports to the
 egress queues.  Anything else is a waste of discussion time, I've been
 pouring over these issues endlessly for weeks, so if I'm saying doing
 copies and avoiding the switch is necessary I do in fact mean it. :-)

ok, i get it Dave ;- Thanks for your patience, that was useful.
Now that is clear for me, I will go back and look at your original email
and try to get back on track to what you really asked ;-

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-30 Thread Patrick McHardy
David Miller wrote:
 Now I get to pose a problem for everyone, prove to me how useful
 this new code is by showing me how it can be used to solve a
 reocurring problem in virtualized network drivers of which I've
 had to code one up recently, see my most recent blog entry at:
 
   http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html
 
 Anyways the gist of the issue is (and this happens for Sun LDOMS
 networking, lguest, IBM iSeries, etc.) that we have a single
 virtualized network device.  There is a port to the control
 node (which switches packets to the real network for the guest)
 and one port to each of the other guests.
 
 Each guest gets a unique MAC address.  There is a queue per-port
 that can fill up.
 
 What all the drivers like this do right now is stop the queue if
 any of the per-port queues fill up, and that's why my sunvnet
 driver does right now as well.  We can only thus wakeup the
 queue when all of the ports have some space.
 
 The ports (and thus the queues) are selected by destination
 MAC address.  Each port has a remote MAC address, if there
 is an exact match with a port's remote MAC we'd use that port
 and thus that port's queue.  If there is no exact match
 (some other node on the real network, broadcast, multicast,
 etc.) we want to use the control node's port and port queue.
 
 So the problem to solve is to make a way for drivers to do the queue
 selection before the generic queueing layer starts to try and push
 things to the driver.  Perhaps a classifier in the driver or similar.


That sounds like the only reasonable possibility if you really
do want to use queues. Another possibility would be to not use
a queue and make the hole thing unreliable and treat full rx
rings of the guests as loss on the wire. Not sure if that makes
any sense.

I was thinking about adding a way for (multiqueue) drivers to use
other default qdiscs than pfifo_fast so they can default to a
multiband prio or something else that makes sense for them ..
maybe a dev-qdisc_setup hook that is invoked from dev_activate.
They would need to be able to add a default classifier for this
to have any effect (the grand plan is to get rid of the horrible
wme scheduler). Specialized classifiers like your dst MAC classifier
and maybe even WME should then probably be built into the driver and
don't register with the API, so they don't become globally visible.

 The solution to this problem generalizes to the other facility
 we want now, hashing the transmit queue by smp_processor_id()
 or similar.  With that in place we can look at doing the TX locking
 per-queue too as is hinted at by the comments above the per-queue
 structure in the current net-2.6.23 tree.


It would be great if we could finally get a working e1000 multiqueue
patch so work in this area can actually be tested.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-30 Thread Waskiewicz Jr, Peter P
 It would be great if we could finally get a working e1000 
 multiqueue patch so work in this area can actually be tested.

I'm actively working on this right now.  I'm on vacation next week, but
hopefully I can get something working before I leave OLS and post it.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-30 Thread jamal
On Fri, 2007-29-06 at 21:35 -0700, David Miller wrote:

 Awesome, but let's concentrate on the client since I can actually
 implement and test anything we come up with :-)

Ok, you need to clear one premise for me then ;-
You said the model is for the guest/client to hook have a port to the
host and one to each guest; i think this is the confusing part for me
(and may have led to the switch discussion) because i have not seen this
model used before. What i have seen before is that the host side
connects the different guests. In such a scenario, on the guest is a
single port that connects to the host - the host worries (lets forget
the switch/bridge for a sec) about how to get packets from guestX to
guestY pending consultation of access control details.
What is the advantage of direct domain-domain connection? Is it a
scalable?

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-30 Thread David Miller
From: jamal [EMAIL PROTECTED]
Date: Sat, 30 Jun 2007 10:52:44 -0400

 On Fri, 2007-29-06 at 21:35 -0700, David Miller wrote:
 
  Awesome, but let's concentrate on the client since I can actually
  implement and test anything we come up with :-)
 
 Ok, you need to clear one premise for me then ;-
 You said the model is for the guest/client to hook have a port to the
 host and one to each guest; i think this is the confusing part for me
 (and may have led to the switch discussion) because i have not seen this
 model used before. What i have seen before is that the host side
 connects the different guests. In such a scenario, on the guest is a
 single port that connects to the host - the host worries (lets forget
 the switch/bridge for a sec) about how to get packets from guestX to
 guestY pending consultation of access control details.
 What is the advantage of direct domain-domain connection? Is it a
 scalable?

It's like twice as fast, since the switch doesn't have to copy
the packet in, switch it, then the destination guest copies it
into it's address space.

There is approximately one copy for each hop you go over through these
virtual devices.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Waskiewicz Jr, Peter P
 Ok everything is checked into net-2.6.23, thanks everyone.

Dave, thank you for your patience and feedback on this whole process.
Patrick and everyone else, thank you for your feedback and assistance.

I am looking at your posed virtualization question, but I need sleep
since I just remembered I'm on east coast time here at OLS, and it's
4:30am.

Thanks again,

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread jamal

Ive changed the topic for you friend - otherwise most people wont follow
(as youve said a few times yourself ;-).

On Thu, 2007-28-06 at 21:20 -0700, David Miller wrote:

 Now I get to pose a problem for everyone, prove to me how useful
 this new code is by showing me how it can be used to solve a
 reocurring problem in virtualized network drivers of which I've
 had to code one up recently, see my most recent blog entry at:
 
   http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html
 

nice.

 Anyways the gist of the issue is (and this happens for Sun LDOMS
 networking, lguest, IBM iSeries, etc.) that we have a single
 virtualized network device.  There is a port to the control
 node (which switches packets to the real network for the guest)
 and one port to each of the other guests.
 
 Each guest gets a unique MAC address.  There is a queue per-port
 that can fill up.
 
 What all the drivers like this do right now is stop the queue if
 any of the per-port queues fill up, and that's why my sunvnet
 driver does right now as well.  We can only thus wakeup the
 queue when all of the ports have some space.

Is a netdevice really the correct construct for the host side?
Sounds to me a layer above the netdevice is the way to go. A bridge for
example or L3 routing or even simple tc classify/redirection etc.
I havent used what has become openvz these days in many years (or played
with Erics approach), but if i recall correctly - it used to have a
single netdevice per guest on the host. Thats close to what a basic
qemu/UML has today. In such a case it is something above netdevices
which does the guest selection.
 
 The ports (and thus the queues) are selected by destinationt
 MAC address.  Each port has a remote MAC address, if there
 is an exact match with a port's remote MAC we'd use that port
 and thus that port's queue.  If there is no exact match
 (some other node on the real network, broadcast, multicast,
 etc.) we want to use the control node's port and port queue.
 

Ok, Dave, isnt that what a bridge does? ;- Youd need filtering to go
with it (for example to restrict guest0 from getting certain brodcasts
etc) - but we already have that. 

 So the problem to solve is to make a way for drivers to do the queue
 selection before the generic queueing layer starts to try and push
 things to the driver.  Perhaps a classifier in the driver or similar.

 The solution to this problem generalizes to the other facility
 we want now, hashing the transmit queue by smp_processor_id()
 or similar.  With that in place we can look at doing the TX locking
 per-queue too as is hinted at by the comments above the per-queue
 structure in the current net-2.6.23 tree.

A major surgery will be needed on the tx path if you want to hash tx
queue to processor id. Our unit construct (today, net-2.6.23) that can
be tied to a cpu is a netdevice. OTOH, if you used a netdevice it should
work as is. But i am possibly missing something in your comments.
What do you have in mind.

 My current work-in-progress sunvnet.c driver is included below so
 we can discuss things concretely with code.
 
 I'm listening. :-)

And you got words above.

cheers,
jamal


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Patrick McHardy
jamal wrote:
 On Thu, 2007-28-06 at 21:20 -0700, David Miller wrote:
 
Each guest gets a unique MAC address.  There is a queue per-port
that can fill up.

What all the drivers like this do right now is stop the queue if
any of the per-port queues fill up, and that's why my sunvnet
driver does right now as well.  We can only thus wakeup the
queue when all of the ports have some space.
 
 
 Is a netdevice really the correct construct for the host side?
 Sounds to me a layer above the netdevice is the way to go. A bridge for
 example or L3 routing or even simple tc classify/redirection etc.
 I havent used what has become openvz these days in many years (or played
 with Erics approach), but if i recall correctly - it used to have a
 single netdevice per guest on the host. Thats close to what a basic
 qemu/UML has today. In such a case it is something above netdevices
 which does the guest selection.


I'm guessing that that wouldn't allow to do unicast filtering for
the guests on the real device without hacking the bridge code for
this special case. The difference to a real bridge is that the
all addresses are completely known in advance, so it doesn't need
promiscous mode for learning.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread jamal
On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote:

 I'm guessing that that wouldn't allow to do unicast filtering for
 the guests on the real device without hacking the bridge code for
 this special case. 

For ingress (i guess you could say for egress as well): we can do it as
well today with tc filtering on the host - it is involved but is part of
provisioning for a guest IMO.
A substantial amount of ethernet switches (ok, not the $5 ones) do
filtering at the same level.

 The difference to a real bridge is that the
 all addresses are completely known in advance, so it doesn't need
 promiscous mode for learning.

You mean the per-virtual MAC addresses are known in advance, right?
This is fine. The bridging or otherwise (like L3 etc) is for
interconnecting once you have the provisioning done. And you could build
different broadcast domains by having multiple bridges.

To go off on a slight tangent:
I think you have to look at the two types of NICs separately 
1) dumb ones where you may have to use the mcast filters in hardware to
pretend you have a unicast address per virtual device - those will be
really hard to simulate using a separate netdevice per MAC address.
Actually your bigger problem on those is tx MAC address selection
because that is not built into the hardware. I still think even for
these types something above netdevice (bridge, L3 routing, tc action
redirect etc) will do.
2) The new NICs being built for virtualization; those allow you to
explicitly have clean separation of IO where the only thing that is
shared between virtual devices is the wire and the bus (otherwise
each has its own registers etc) i.e the hardware is designed with this
in mind. In such a case, i think a separate netdevice per single MAC
address - possibly tied to a separate CPU should work.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Patrick McHardy
jamal wrote:
 On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote:
 
 
The difference to a real bridge is that the
all addresses are completely known in advance, so it doesn't need
promiscous mode for learning.
 
 
 You mean the per-virtual MAC addresses are known in advance, right?

Yes.

 This is fine. The bridging or otherwise (like L3 etc) is for
 interconnecting once you have the provisioning done. And you could build
 different broadcast domains by having multiple bridges.


Right, but the current bridging code always uses promiscous mode
and its nice to avoid that if possible. Looking at the code, it
should be easy to avoid though by disabling learning (and thus
promisous mode) and adding unicast filters for all static fdb entries.

 To go off on a slight tangent:
 I think you have to look at the two types of NICs separately 
 1) dumb ones where you may have to use the mcast filters in hardware to
 pretend you have a unicast address per virtual device - those will be
 really hard to simulate using a separate netdevice per MAC address.
 Actually your bigger problem on those is tx MAC address selection
 because that is not built into the hardware. I still think even for
 these types something above netdevice (bridge, L3 routing, tc action
 redirect etc) will do.


Have a look at my secondary unicast address patches in case you didn't
notice them before (there's also a driver example for e1000 on netdev):

http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.23.git;a=commit;h=306890b54dcbd168cdeea64f1630d2024febb5c7

You still need to do filtering in software, but you can have the NIC
pre-filter in case it supports it, otherwise it goes to promiscous mode.

 2) The new NICs being built for virtualization; those allow you to
 explicitly have clean separation of IO where the only thing that is
 shared between virtual devices is the wire and the bus (otherwise
 each has its own registers etc) i.e the hardware is designed with this
 in mind. In such a case, i think a separate netdevice per single MAC
 address - possibly tied to a separate CPU should work.


Agreed, that could also be useful for non-virtualized use.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread jamal
On Fri, 2007-29-06 at 15:08 +0200, Patrick McHardy wrote:
 jamal wrote:
  On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote:

 Right, but the current bridging code always uses promiscous mode
 and its nice to avoid that if possible. 
 Looking at the code, it
 should be easy to avoid though by disabling learning (and thus
 promisous mode) and adding unicast filters for all static fdb entries.
 

Yes, that would do it for static provisioning (I suspect that would work
today unless bridging has no knobs to turn off going into promisc). But
you could even allow for learning and just have extra filters in tc
before bridging disallowing things.

 Have a look at my secondary unicast address patches in case you didn't
 notice them before (there's also a driver example for e1000 on netdev):
 
 http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.23.git;a=commit;h=306890b54dcbd168cdeea64f1630d2024febb5c7
 
 You still need to do filtering in software, but you can have the NIC
 pre-filter in case it supports it, otherwise it goes to promiscous mode.
 

Ok, I will look at them when i get back. Sorry - havent caught up on
netdev.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Ben Greear

Patrick McHardy wrote:

Right, but the current bridging code always uses promiscous mode
and its nice to avoid that if possible. Looking at the code, it
should be easy to avoid though by disabling learning (and thus
promisous mode) and adding unicast filters for all static fdb entries.
  
I am curious about why people are so hot to do away with promisc mode.  
It seems to me
that in a modern switched environment, there should only very rarely be 
unicast packets received

on an interface that does not want to receive them.

Could someone give a quick example of when I am wrong and promisc mode 
would allow

a NIC to receive a significant number of packets not really destined for it?

Thanks,
Ben


--
Ben Greear [EMAIL PROTECTED] 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Ben Greear

Patrick McHardy wrote:

Ben Greear wrote:
  

Could someone give a quick example of when I am wrong and promisc mode
would allow
a NIC to receive a significant number of packets not really destined for
it?




In a switched environment it won't have a big effect, I agree.
It might help avoid receiving unwanted multicast traffic, which
could be more significant than unicast.

Anyways, why be wasteful when it can be avoided .. :)
  

Ok, I had forgotten about multicast, thanks for the reminder!

--
Ben Greear [EMAIL PROTECTED] 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Patrick McHardy
Ben Greear wrote:
 Patrick McHardy wrote:
 
 Right, but the current bridging code always uses promiscous mode
 and its nice to avoid that if possible. Looking at the code, it
 should be easy to avoid though by disabling learning (and thus
 promisous mode) and adding unicast filters for all static fdb entries.
   
 
 I am curious about why people are so hot to do away with promisc mode. 
 It seems to me
 that in a modern switched environment, there should only very rarely be
 unicast packets received
 on an interface that does not want to receive them.


I don't know if that really was Dave's reason to handle it in a driver.

 Could someone give a quick example of when I am wrong and promisc mode
 would allow
 a NIC to receive a significant number of packets not really destined for
 it?


In a switched environment it won't have a big effect, I agree.
It might help avoid receiving unwanted multicast traffic, which
could be more significant than unicast.

Anyways, why be wasteful when it can be avoided .. :)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread David Miller

This conversation begins to go into a pointless direction already, as
I feared it would.

Nobody is going to configure bridges, classification, tc, and all of
this other crap just for a simple virtualized guest networking device.

It's a confined and well defined case that doesn't need any of that.
You've got to be fucking kidding me if you think I'm going to go
through the bridging code and all of that layering instead of my
hash demux on transmit which is 4 or 5 lines of C code at best.

Such a suggestion is beyond stupid.

Maybe for the control node switch, yes, but not for the guest network
devices.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread David Miller
From: jamal [EMAIL PROTECTED]
Date: Fri, 29 Jun 2007 21:30:53 -0400

 On Fri, 2007-29-06 at 14:31 -0700, David Miller wrote:
  Maybe for the control node switch, yes, but not for the guest network
  devices.
 
 And that is precisely what i was talking about - and i am sure thats how
 the discussion with Patrick was.

Awesome, but let's concentrate on the client since I can actually
implement and test anything we come up with :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Patrick McHardy

PJ Waskiewicz wrote:


+
 static int __init prio_module_init(void)
 {
-   return register_qdisc(prio_qdisc_ops);
+   int err;
+   err = register_qdisc(prio_qdisc_ops);
+   if (!err)
+   err = register_qdisc(rr_qdisc_ops);
+   return err;
 }
 


Thats still broken. I'll fix this and some minor cleanness issues myself so
you don't have to go through another resend.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Waskiewicz Jr, Peter P
 PJ Waskiewicz wrote:
 
  +
   static int __init prio_module_init(void)  {
  -   return register_qdisc(prio_qdisc_ops);
  +   int err;
  +   err = register_qdisc(prio_qdisc_ops);
  +   if (!err)
  +   err = register_qdisc(rr_qdisc_ops);
  +   return err;
   }
   
 
 Thats still broken. I'll fix this and some minor cleanness 
 issues myself so you don't have to go through another resend.

Auke and I just looked at register_qdisc() and this code.  Maybe we
haven't had enough coffee yet, but register_qdisc() returns 0 on
success.  So if register_qdisc(prio_qdisc_ops) succeeds, then
rr_qdisc_ops gets registered.  I'm curious what is broken with this.

Thanks Patrick
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Patrick McHardy
Waskiewicz Jr, Peter P wrote:
PJ Waskiewicz wrote:


+
 static int __init prio_module_init(void)  {
-return register_qdisc(prio_qdisc_ops);
+int err;
+err = register_qdisc(prio_qdisc_ops);
+if (!err)
+err = register_qdisc(rr_qdisc_ops);
+return err;
 }
 

Thats still broken. I'll fix this and some minor cleanness 
issues myself so you don't have to go through another resend.
 
 
 Auke and I just looked at register_qdisc() and this code.  Maybe we
 haven't had enough coffee yet, but register_qdisc() returns 0 on
 success.  So if register_qdisc(prio_qdisc_ops) succeeds, then
 rr_qdisc_ops gets registered.  I'm curious what is broken with this.

Its not error handling. You do:

err = register qdisc 1
if (err)
return err;
err = register qdisc 2
if (err)
unregister qdisc 2
return err

anyways, I already fixed that and cleaned up prio_classify the
way I suggested. Will send shortly.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Waskiewicz Jr, Peter P

 Its not error handling. You do:
 
 err = register qdisc 1
 if (err)
   return err;
 err = register qdisc 2
 if (err)
   unregister qdisc 2
 return err
 
 anyways, I already fixed that and cleaned up prio_classify 
 the way I suggested. Will send shortly.

Thanks for fixing; however, the current sch_prio doesn't unregister the
qdisc if register_qdisc() on prio fails, or does that happen implicitly
because the module will probably unload?

Thanks again Patrick,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Patrick McHardy
Patrick McHardy wrote:
 PJ Waskiewicz wrote:
 
 +
  static int __init prio_module_init(void)
  {
 -return register_qdisc(prio_qdisc_ops);
 +int err;
 +err = register_qdisc(prio_qdisc_ops);
 +if (!err)
 +err = register_qdisc(rr_qdisc_ops);
 +return err;
  }
  
 
 
 Thats still broken. I'll fix this and some minor cleanness issues myself so
 you don't have to go through another resend.


Here it is, fixed error handling and cleaned up prio_classify.

There are still too many ifdefs in there for my taste, and I'm
wondering whether the NET_SCH_MULTIQUEUE option should really
be NETDEVICES_MULTIQUEUE. That would allow to move the #ifdefs
to netif_subqueue_stopped and keep the qdiscs clean.
I can send a patch for that on top of your patches.

[SCHED] Qdisc changes and sch_rr added for multiqueue

Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio and sch_rr to be compiled with or without multiqueue
hardware support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue routine.

Signed-off-by: Peter P Waskiewicz Jr [EMAIL PROTECTED]
Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit ab2c102d05981e983b494e9211e1c987ae0b80ee
tree edb9457aeed6eb4ac8adef517b783b3361ba8830
parent 1937d8b868253885584dd49fe7ba804eda6b525f
author Peter P Waskiewicz Jr [EMAIL PROTECTED] Thu, 28 Jun 2007 18:47:49 +0200
committer Patrick McHardy [EMAIL PROTECTED] Thu, 28 Jun 2007 18:47:49 +0200

 include/linux/pkt_sched.h |9 +++
 net/sched/Kconfig |   23 +++
 net/sched/sch_prio.c  |  142 -
 3 files changed, 159 insertions(+), 15 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..268c515 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -101,6 +101,15 @@ struct tc_prio_qopt
__u8priomap[TC_PRIO_MAX+1]; /* Map: logical priority - PRIO band */
 };
 
+enum
+{
+   TCA_PRIO_UNSPEC,
+   TCA_PRIO_MQ,
+   __TCA_PRIO_MAX
+};
+
+#define TCA_PRIO_MAX(__TCA_PRIO_MAX - 1)
+
 /* TBF section */
 
 struct tc_tbf_qopt
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..65ee9e7 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -111,6 +111,29 @@ config NET_SCH_PRIO
  To compile this code as a module, choose M here: the
  module will be called sch_prio.
 
+config NET_SCH_RR
+   tristate Multi Band Round Robin Queuing (RR)
+   select NET_SCH_PRIO
+   ---help---
+ Say Y here if you want to use an n-band round robin packet
+ scheduler.
+
+ The module uses sch_prio for its framework and is aliased as
+ sch_rr, so it will load sch_prio, although it is referred
+ to using sch_rr.
+
+config NET_SCH_MULTIQUEUE
+   bool Multiple hardware queue support
+   ---help---
+ Say Y here if you want to allow supported qdiscs to assign flows to
+ multiple hardware queues on an ethernet device.  This will
+ still work on devices with 1 queue.
+
+ Current qdiscs supporting this feature are NET_SCH_PRIO and
+ NET_SCH_RR.
+
+ Most people will say N here.
+
 config NET_SCH_RED
tristate Random Early Detection (RED)
---help---
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 6d7542c..b35b9c5 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -40,9 +40,13 @@
 struct prio_sched_data
 {
int bands;
+   int curband; /* for round-robin */
struct tcf_proto *filter_list;
u8  prio2band[TC_PRIO_MAX+1];
struct Qdisc *queues[TCQ_PRIO_BANDS];
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+   unsigned char mq;
+#endif
 };
 
 
@@ -70,14 +74,21 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int 
*qerr)
 #endif
if (TC_H_MAJ(band))
band = 0;
-   return q-queues[q-prio2band[bandTC_PRIO_MAX]];
+   band = q-prio2band[bandTC_PRIO_MAX];
+   goto out;
}
band = res.classid;
}
band = TC_H_MIN(band) - 1;
if (band = q-bands)
-   return q-queues[q-prio2band[0]];
-
+   band = q-prio2band[0];
+out:
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+   if (q-mq)
+   skb-queue_mapping = band;
+   else
+   skb-queue_mapping = 0;
+#endif
return q-queues[band];
 }
 
@@ -144,17 +155,65 @@ prio_dequeue(struct Qdisc* sch)
struct Qdisc *qdisc;
 
for (prio = 0; prio  q-bands; prio++) {
-   qdisc = q-queues[prio];
-   skb = qdisc-dequeue(qdisc);
-   if (skb) {
-   sch-q.qlen--;
-   return skb;
+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+   /* Check if the target subqueue is available before
+ 

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Patrick McHardy
Waskiewicz Jr, Peter P wrote:
 Thanks for fixing; however, the current sch_prio doesn't unregister the
 qdisc if register_qdisc() on prio fails, or does that happen implicitly
 because the module will probably unload?


It failed, there's nothing to unregister. But when you register two
qdiscs and the second one fails you have to unregister the first one.

Your way works too, but it might fail registering the second one without
providing any feedback to the user.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Waskiewicz Jr, Peter P
 PJ Waskiewicz wrote:
  +#ifdef CONFIG_NET_SCH_MULTIQUEUE
  +   if (q-mq)
  +   skb-queue_mapping = 
  +   
 q-prio2band[bandTC_PRIO_MAX];
  +   else
  +   skb-queue_mapping = 0;
  +#endif
 
 
 Setting it to zero here is wrong, consider:
 
 root qdisc: prio multiqueue
 child qdisc: prio non-multiqueue
 
 The top-level qdisc will set it, the child qdisc will unset it again.
 When multiqueue is inactive it should not touch it.
 
 I'll fix that as well.

But the child can't assume the device is multiqueue if the child is
non-multiqueue.  This is the same issue with IP forwarding, where if you
forward through a multiqueue device to a non-mq device, you don't know
if the destination device is multiqueue.  So the last qdisc to actually
dequeue into a device should have control what the queue mapping is.  If
a user had a multiqueue qdisc as root, and configured a child qdisc as
non-mq, that is a configuration error if the underlying device is indeed
multiqueue IMO.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Patrick McHardy
Waskiewicz Jr, Peter P wrote:
PJ Waskiewicz wrote:

+#ifdef CONFIG_NET_SCH_MULTIQUEUE
+if (q-mq)
+skb-queue_mapping = 
+

q-prio2band[bandTC_PRIO_MAX];

+else
+skb-queue_mapping = 0;
+#endif


Setting it to zero here is wrong, consider:

root qdisc: prio multiqueue
child qdisc: prio non-multiqueue

The top-level qdisc will set it, the child qdisc will unset it again.
When multiqueue is inactive it should not touch it.

I'll fix that as well.
 
 
 But the child can't assume the device is multiqueue if the child is
 non-multiqueue.

The child doesn't have to assume anything.

 This is the same issue with IP forwarding, where if you
 forward through a multiqueue device to a non-mq device, you don't know
 if the destination device is multiqueue.

No its not. I'm talking about nested qdiscs, which are all on
a single device.

 So the last qdisc to actually
 dequeue into a device should have control what the queue mapping is.


Fully agreed. And that is always the top-level qdisc.

 If
 a user had a multiqueue qdisc as root, and configured a child qdisc as
 non-mq, that is a configuration error if the underlying device is indeed
 multiqueue IMO.


Absolutely not. First of all, its perfectly valid to use non-multiqueue
qdiscs on multiqueue devices. Secondly, its only the root qdisc that
has to know about multiqueue since that one controls the child qdiscs.

Think about it, it makes absolutely no sense to have the child
qdisc even know about multiqueue. Changing my example to have
a multiqueue qdisc as child:

root qdisc: 2 band prio multiqueue
child qdisc of band 0: 2 band prio multiqueue

When the root qdisc decides to dequeue band0, it checks whether subqueue
0 is active and dequeues the child qdisc. If the child qdisc is indeed
another multiqueue qdisc as you suggest, if might decide to dequeue its
own band 1 and checks that subqueue state. So where should the packet
finally end up? And what if one of both subqueues is inactive?

The only reasonable thing it can do is not care about multiqueue and
just dequeue as usual. In fact I think it should be an error to
configure multiqueue on a non-root qdisc.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Waskiewicz Jr, Peter P
 Absolutely not. First of all, its perfectly valid to use 
 non-multiqueue qdiscs on multiqueue devices. Secondly, its 
 only the root qdisc that has to know about multiqueue since 
 that one controls the child qdiscs.
 
 Think about it, it makes absolutely no sense to have the 
 child qdisc even know about multiqueue. Changing my example 
 to have a multiqueue qdisc as child:
 
 root qdisc: 2 band prio multiqueue
 child qdisc of band 0: 2 band prio multiqueue
 
 When the root qdisc decides to dequeue band0, it checks 
 whether subqueue 0 is active and dequeues the child qdisc. If 
 the child qdisc is indeed another multiqueue qdisc as you 
 suggest, if might decide to dequeue its own band 1 and checks 
 that subqueue state. So where should the packet finally end 
 up? And what if one of both subqueues is inactive?
 
 The only reasonable thing it can do is not care about 
 multiqueue and just dequeue as usual. In fact I think it 
 should be an error to configure multiqueue on a non-root qdisc.

Ack.  This is a thought process that trips me up from time to time...I
see child qdisc, and think that's the last qdisc to dequeue and send to
the device, not the first one to dequeue.  So please disregard my
comments before; I totally agree with you.  Great catch here; I really
like the prio_classify() cleanup.

As always, many thanks for your feedback and help Patrick.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Patrick McHardy
Waskiewicz Jr, Peter P wrote:
[...]
The only reasonable thing it can do is not care about 
multiqueue and just dequeue as usual. In fact I think it 
should be an error to configure multiqueue on a non-root qdisc.
 
 
 Ack.  This is a thought process that trips me up from time to time...I
 see child qdisc, and think that's the last qdisc to dequeue and send to
 the device, not the first one to dequeue.  So please disregard my
 comments before; I totally agree with you.  Great catch here; I really
 like the prio_classify() cleanup.


Thanks. This updated patch makes configuring a non-root qdisc for
multiqueue an error.

[SCHED] Qdisc changes and sch_rr added for multiqueue

Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio and sch_rr to be compiled with or without multiqueue
hardware support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue routine.

Signed-off-by: Peter P Waskiewicz Jr [EMAIL PROTECTED]
Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit 8798d6bf4f3ed4f5b162184282adc714ef5b69b1
tree b3ba373a0534b905b34abc520eba6adecdcc3ca5
parent 48e334930c5fbb64a821733a7056e2800057c128
author Peter P Waskiewicz Jr [EMAIL PROTECTED] Thu, 28 Jun 2007 18:47:49 +0200
committer Patrick McHardy [EMAIL PROTECTED] Thu, 28 Jun 2007 21:23:44 +0200

 include/linux/pkt_sched.h |9 +++
 net/sched/Kconfig |   11 
 net/sched/sch_prio.c  |  129 -
 3 files changed, 134 insertions(+), 15 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index d10f353..268c515 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -101,6 +101,15 @@ struct tc_prio_qopt
 	__u8	priomap[TC_PRIO_MAX+1];	/* Map: logical priority - PRIO band */
 };
 
+enum
+{
+	TCA_PRIO_UNSPEC,
+	TCA_PRIO_MQ,
+	__TCA_PRIO_MAX
+};
+
+#define TCA_PRIO_MAX(__TCA_PRIO_MAX - 1)
+
 /* TBF section */
 
 struct tc_tbf_qopt
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..f321794 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -111,6 +111,17 @@ config NET_SCH_PRIO
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_prio.
 
+config NET_SCH_RR
+	tristate Multi Band Round Robin Queuing (RR)
+	select NET_SCH_PRIO
+	---help---
+	  Say Y here if you want to use an n-band round robin packet
+	  scheduler.
+
+	  The module uses sch_prio for its framework and is aliased as
+	  sch_rr, so it will load sch_prio, although it is referred
+	  to using sch_rr.
+
 config NET_SCH_RED
 	tristate Random Early Detection (RED)
 	---help---
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 6d7542c..4045220 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -40,9 +40,11 @@
 struct prio_sched_data
 {
 	int bands;
+	int curband; /* for round-robin */
 	struct tcf_proto *filter_list;
 	u8  prio2band[TC_PRIO_MAX+1];
 	struct Qdisc *queues[TCQ_PRIO_BANDS];
+	int mq;
 };
 
 
@@ -70,14 +72,17 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 #endif
 			if (TC_H_MAJ(band))
 band = 0;
-			return q-queues[q-prio2band[bandTC_PRIO_MAX]];
+			band = q-prio2band[bandTC_PRIO_MAX];
+			goto out;
 		}
 		band = res.classid;
 	}
 	band = TC_H_MIN(band) - 1;
 	if (band = q-bands)
-		return q-queues[q-prio2band[0]];
-
+		band = q-prio2band[0];
+out:
+	if (q-mq)
+		skb_set_queue_mapping(skb, band);
 	return q-queues[band];
 }
 
@@ -144,17 +149,58 @@ prio_dequeue(struct Qdisc* sch)
 	struct Qdisc *qdisc;
 
 	for (prio = 0; prio  q-bands; prio++) {
-		qdisc = q-queues[prio];
-		skb = qdisc-dequeue(qdisc);
-		if (skb) {
-			sch-q.qlen--;
-			return skb;
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.
+		 */
+		if (!netif_subqueue_stopped(sch-dev, (q-mq ? prio : 0))) {
+			qdisc = q-queues[prio];
+			skb = qdisc-dequeue(qdisc);
+			if (skb) {
+sch-q.qlen--;
+return skb;
+			}
 		}
 	}
 	return NULL;
 
 }
 
+static struct sk_buff *rr_dequeue(struct Qdisc* sch)
+{
+	struct sk_buff *skb;
+	struct prio_sched_data *q = qdisc_priv(sch);
+	struct Qdisc *qdisc;
+	int bandcount;
+
+	/* Only take one pass through the queues.  If nothing is available,
+	 * return nothing.
+	 */
+	for (bandcount = 0; bandcount  q-bands; bandcount++) {
+		/* Check if the target subqueue is available before
+		 * pulling an skb.  This way we avoid excessive requeues
+		 * for slower queues.  If the queue is stopped, try the
+		 * next queue.
+		 */
+		if (!netif_subqueue_stopped(sch-dev,
+	(q-mq ? q-curband : 0))) {
+			qdisc = q-queues[q-curband];
+			skb = qdisc-dequeue(qdisc);
+			if (skb) {
+sch-q.qlen--;
+q-curband++;
+if (q-curband = q-bands)
+	q-curband = 0;
+return skb;
+			}
+		}
+		q-curband++;
+		if (q-curband = q-bands)
+			q-curband = 0;
+	}
+	return NULL;
+}
+
 

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread Waskiewicz Jr, Peter P
 Waskiewicz Jr, Peter P wrote:
 [...]
 The only reasonable thing it can do is not care about 
 multiqueue and 
 just dequeue as usual. In fact I think it should be an error to 
 configure multiqueue on a non-root qdisc.
  
  
  Ack.  This is a thought process that trips me up from time 
 to time...I 
  see child qdisc, and think that's the last qdisc to dequeue 
 and send 
  to the device, not the first one to dequeue.  So please 
 disregard my 
  comments before; I totally agree with you.  Great catch 
 here; I really 
  like the prio_classify() cleanup.
 
 
 Thanks. This updated patch makes configuring a non-root qdisc 
 for multiqueue an error.
 

The patch looks fine to me.  Thanks Patrick.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-28 Thread David Miller
From: Patrick McHardy [EMAIL PROTECTED]
Date: Thu, 28 Jun 2007 21:24:37 +0200

 Waskiewicz Jr, Peter P wrote:
 [...]
 The only reasonable thing it can do is not care about 
 multiqueue and just dequeue as usual. In fact I think it 
 should be an error to configure multiqueue on a non-root qdisc.
  
  
  Ack.  This is a thought process that trips me up from time to time...I
  see child qdisc, and think that's the last qdisc to dequeue and send to
  the device, not the first one to dequeue.  So please disregard my
  comments before; I totally agree with you.  Great catch here; I really
  like the prio_classify() cleanup.
 
 
 Thanks. This updated patch makes configuring a non-root qdisc for
 multiqueue an error.

Ok everything is checked into net-2.6.23, thanks everyone.

Now I get to pose a problem for everyone, prove to me how useful
this new code is by showing me how it can be used to solve a
reocurring problem in virtualized network drivers of which I've
had to code one up recently, see my most recent blog entry at:

http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html

Anyways the gist of the issue is (and this happens for Sun LDOMS
networking, lguest, IBM iSeries, etc.) that we have a single
virtualized network device.  There is a port to the control
node (which switches packets to the real network for the guest)
and one port to each of the other guests.

Each guest gets a unique MAC address.  There is a queue per-port
that can fill up.

What all the drivers like this do right now is stop the queue if
any of the per-port queues fill up, and that's why my sunvnet
driver does right now as well.  We can only thus wakeup the
queue when all of the ports have some space.

The ports (and thus the queues) are selected by destination
MAC address.  Each port has a remote MAC address, if there
is an exact match with a port's remote MAC we'd use that port
and thus that port's queue.  If there is no exact match
(some other node on the real network, broadcast, multicast,
etc.) we want to use the control node's port and port queue.

So the problem to solve is to make a way for drivers to do the queue
selection before the generic queueing layer starts to try and push
things to the driver.  Perhaps a classifier in the driver or similar.

The solution to this problem generalizes to the other facility
we want now, hashing the transmit queue by smp_processor_id()
or similar.  With that in place we can look at doing the TX locking
per-queue too as is hinted at by the comments above the per-queue
structure in the current net-2.6.23 tree.

My current work-in-progress sunvnet.c driver is included below so
we can discuss things concretely with code.

I'm listening. :-)

 sunvnet.h 
#ifndef _SUNVNET_H
#define _SUNVNET_H

#define DESC_NCOOKIES(entry_size)   \
((entry_size) - sizeof(struct vio_net_desc))

/* length of time before we decide the hardware is borked,
 * and dev-tx_timeout() should be called to fix the problem
 */
#define VNET_TX_TIMEOUT (5 * HZ)

#define VNET_TX_RING_SIZE   512
#define VNET_TX_WAKEUP_THRESH(dr)   ((dr)-pending / 4)

/* VNET packets are sent in buffers with the first 6 bytes skipped
 * so that after the ethernet header the IPv4/IPv6 headers are aligned
 * properly.
 */
#define VNET_PACKET_SKIP6

struct vnet_tx_entry {
void*buf;
unsigned intncookies;
struct ldc_trans_cookie cookies[2];
};

struct vnet;
struct vnet_port {
struct vio_driver_state vio;

struct hlist_node   hash;
u8  raddr[ETH_ALEN];

struct vnet *vp;

struct vnet_tx_entrytx_bufs[VNET_TX_RING_SIZE];

struct list_headlist;
};

static inline struct vnet_port *to_vnet_port(struct vio_driver_state *vio)
{
return container_of(vio, struct vnet_port, vio);
}

#define VNET_PORT_HASH_SIZE 16
#define VNET_PORT_HASH_MASK (VNET_PORT_HASH_SIZE - 1)

static inline unsigned int vnet_hashfn(u8 *mac)
{
unsigned int val = mac[4] ^ mac[5];

return val  (VNET_PORT_HASH_MASK);
}

struct vnet {
/* Protects port_list and port_hash.  */
spinlock_t  lock;

struct net_device   *dev;

u32 msg_enable;
struct vio_dev  *vdev;

struct list_headport_list;

struct hlist_head   port_hash[VNET_PORT_HASH_SIZE];
};

#endif /* _SUNVNET_H */
 sunvnet.c 
/* sunvnet.c: Sun LDOM Virtual Network Driver.
 *
 * Copyright (C) 2007 David S. Miller [EMAIL PROTECTED]
 */

#include linux/module.h
#include linux/kernel.h
#include linux/types.h
#include linux/slab.h
#include linux/delay.h
#include linux/init.h
#include linux/netdevice.h
#include linux/ethtool.h
#include linux/etherdevice.h

#include asm/vio.h
#include asm/ldc.h

#include sunvnet.h


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-25 Thread Waskiewicz Jr, Peter P
   enum
   {
  -   TCA_PRIO_UNPSEC,
  -   TCA_PRIO_TEST,
 
 
 You misunderstood me. You can work on top of my compat 
 attribute patches, but the example code should not have to go 
 in to apply your patch.

Ok.  I'll fix my patches.

  diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 
  475df84..7f14fa6 100644
  --- a/net/sched/Kconfig
  +++ b/net/sched/Kconfig
  @@ -102,8 +102,16 @@ config NET_SCH_ATM
To compile this code as a module, choose M here: the
module will be called sch_atm.
   
  +config NET_SCH_BANDS
  +bool Multi Band Queueing (PRIO and RR)
 
 This options seems useless. Its not used *anywhere* except 
 for dependencies.

I was trying to group the multiqueue qdiscs together with this.  But I
can see just having the multiqueue option for scheduling will cover
this.  I'll remove this.

  +config NET_SCH_BANDS_MQ
  +   bool Multiple hardware queue support
  +   depends on NET_SCH_BANDS
 
 
 OK, again:
 
 Introduce NET_SCH_RR. NET_SCH_RR selects NET_SCH_PRIO. 
 Nothing at all changes for NET_SCH_PRIO itself. Additionally 
 introduce a boolean NET_SCH_MULTIQUEUE. No dependencies at 
 all. Use NET_SCH_MULTIQUEUE to guard the multiqueue code in 
 sch_prio.c.
 Your current code doesn't even have any ifdefs anymore 
 though, so this might not be needed at all.
 
 Additionally you could later introduce E1000_MULTIQUEUE and 
 have that select NET_SCH_MULTIQUEUE.

I'll clean this up.  Thanks for the persistance.  :)

  diff --git a/net/sched/sch_generic.c 
 b/net/sched/sch_generic.c index 
  9461e8a..203d5c4 100644
  --- a/net/sched/sch_generic.c
  +++ b/net/sched/sch_generic.c
  @@ -168,7 +168,8 @@ static inline int qdisc_restart(struct 
 net_device *dev)
  spin_unlock(dev-queue_lock);
   
  ret = NETDEV_TX_BUSY;
  -   if (!netif_queue_stopped(dev))
  +   if (!netif_queue_stopped(dev) 
  +   !netif_subqueue_stopped(dev, skb-queue_mapping))
  /* churn baby churn .. */
  ret = dev_hard_start_xmit(skb, dev);
 
 I'll try again - please move this to patch 2/3.

I'm sorry; I misread your original comment about this.  I'll move the
change (although this disappears with Jamal's and KK's qdisc_restart()
cleanup).

  diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index 
  40a13e8..8a716f0 100644
  --- a/net/sched/sch_prio.c
  +++ b/net/sched/sch_prio.c
  @@ -40,9 +40,11 @@
   struct prio_sched_data
   {
  int bands;
  +   int curband; /* for round-robin */
  struct tcf_proto *filter_list;
  u8  prio2band[TC_PRIO_MAX+1];
  struct Qdisc *queues[TCQ_PRIO_BANDS];
  +   unsigned char mq;
   };
   
   
  @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc 
  *sch, int *qerr)  #endif
  if (TC_H_MAJ(band))
  band = 0;
  +   if (q-mq)
  +   skb-queue_mapping = 
  +   
 q-prio2band[bandTC_PRIO_MAX];
  +   else
  +   skb-queue_mapping = 0;
 
 
 Might look cleaner if you have one central point where 
 queue_mapping is set and the band is returned.

I'll see how easy it'll be to condense this; because the queue being
selected in the qdisc can be different based on a few different things,
I'm not sure how easy it'll be to assign this in one spot.  I'll play
around with it and see what I can come up with.

  +   /* If we're multiqueue, make sure the number of incoming bands
  +* matches the number of queues on the device we're 
 associating with.
  +*/
  +   if (tb[TCA_PRIO_MQ - 1])
  +   q-mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]);
 
 
 If you're using it as a flag, please use RTA_GET_FLAG(), 
 otherwise RTA_GET_U8.

Will do.  Thanks.

  +   if (q-mq  (qopt-bands != sch-dev-egress_subqueue_count))
  +   return -EINVAL;
   
  sch_tree_lock(sch);
  q-bands = qopt-bands;
  @@ -280,7 +342,7 @@ static int prio_dump(struct Qdisc *sch, 
 struct sk_buff *skb)
  memcpy(opt.priomap, q-prio2band, TC_PRIO_MAX+1);
   
  nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), opt);
  -   RTA_PUT_U32(skb, TCA_PRIO_TEST, 321);
  +   RTA_PUT_U8(skb, TCA_PRIO_MQ, q-mq);
 
 
 And RTA_PUT_FLAG. Now that I think of it, does it even makes 
 sense to have a prio private flag for this instead of a qdisc 
 global one?

There currently aren't any other qdiscs that are natural fits for
multiqueue that I can see.  I can see the benefit though of having this
as a global flag in the qdisc API; let me check it out, and if it makes
sense, I can move it.

   static int __init prio_module_init(void)  {
  -   return register_qdisc(prio_qdisc_ops);
  +   register_qdisc(prio_qdisc_ops);
  +   register_qdisc(rr_qdisc_ops);
 
 Proper error handling please.

Will do.

Thanks,
-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-25 Thread Waskiewicz Jr, Peter P
  @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc 
  *sch, int *qerr)  #endif
  if (TC_H_MAJ(band))
  band = 0;
  +   if (q-mq)
  +   skb-queue_mapping = 
  +   
 q-prio2band[bandTC_PRIO_MAX];
  +   else
  +   skb-queue_mapping = 0;
 
 
 Might look cleaner if you have one central point where 
 queue_mapping is set and the band is returned.

I've taken a stab at this.  I can have one return point, but I'll still
have multiple assignments of skb-queue_mapping due to the different
branches for which queue to select in the qdisc.  I suppose we can do a
rewrite of prio_classify(), but to me that seems beyond the scope of the
multiqueue patches themselves.  What do you think?

Thanks,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-25 Thread Patrick McHardy

Waskiewicz Jr, Peter P wrote:
@@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc 
*sch, int *qerr)  #endif

if (TC_H_MAJ(band))
band = 0;
+   if (q-mq)
+skb-queue_mapping = 
+		
  

q-prio2band[bandTC_PRIO_MAX];


+   else
+   skb-queue_mapping = 0;
  
Might look cleaner if you have one central point where 
queue_mapping is set and the band is returned.



I've taken a stab at this.  I can have one return point, but I'll still
have multiple assignments of skb-queue_mapping due to the different
branches for which queue to select in the qdisc.  I suppose we can do a
rewrite of prio_classify(), but to me that seems beyond the scope of the
multiqueue patches themselves.  What do you think?
  


Thats not necessary. I just though you could add one exit point:


...
out:
   skb-queue_mapping = q-mq ? band : 0;
   return q-queues[band];
}

But if that doesn't work don't bother ..

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-25 Thread Waskiewicz Jr, Peter P
 Thats not necessary. I just though you could add one exit point:
 
 
 ...
 out:
 skb-queue_mapping = q-mq ? band : 0;
 return q-queues[band];
 }
 
 But if that doesn't work don't bother ..

Unfortunately it won't, given how band might be used like this to select
the queue:

return q-queues[q-prio2band[bandTC_PRIO_MAX]];

I'll keep this in mind though, and if it can be done cleanly, I'll
submit a patch.

Thanks Patrick,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-24 Thread Patrick McHardy
PJ Waskiewicz wrote:
 diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
 index 09808b7..ec3a9a5 100644
 --- a/include/linux/pkt_sched.h
 +++ b/include/linux/pkt_sched.h
 @@ -103,8 +103,8 @@ struct tc_prio_qopt
  
  enum
  {
 - TCA_PRIO_UNPSEC,
 - TCA_PRIO_TEST,


You misunderstood me. You can work on top of my compat attribute
patches, but the example code should not have to go in to apply
your patch.


 diff --git a/net/sched/Kconfig b/net/sched/Kconfig
 index 475df84..7f14fa6 100644
 --- a/net/sched/Kconfig
 +++ b/net/sched/Kconfig
 @@ -102,8 +102,16 @@ config NET_SCH_ATM
 To compile this code as a module, choose M here: the
 module will be called sch_atm.
  
 +config NET_SCH_BANDS
 +bool Multi Band Queueing (PRIO and RR)

This options seems useless. Its not used *anywhere* except for
dependencies.

 +---help---
 +  Say Y here if you want to use n-band multiqueue packet
 +  schedulers.  These include a priority-based scheduler and
 +a round-robin scheduler.
 +
  config NET_SCH_PRIO
   tristate Multi Band Priority Queueing (PRIO)
 + depends on NET_SCH_BANDS

And this dependency as well.

   ---help---
 Say Y here if you want to use an n-band priority queue packet
 scheduler.
 @@ -111,6 +119,28 @@ config NET_SCH_PRIO
 To compile this code as a module, choose M here: the
 module will be called sch_prio.
  
 +config NET_SCH_RR
 + tristate Multi Band Round Robin Queuing (RR)
 + depends on NET_SCH_BANDS

Same here. RR

 + select NET_SCH_PRIO
 + ---help---
 +   Say Y here if you want to use an n-band round robin packet
 +   scheduler.
 +
 +   The module uses sch_prio for its framework and is aliased as
 +   sch_rr, so it will load sch_prio, although it is referred
 +   to using sch_rr.
 +
 +config NET_SCH_BANDS_MQ
 + bool Multiple hardware queue support
 + depends on NET_SCH_BANDS


OK, again:

Introduce NET_SCH_RR. NET_SCH_RR selects NET_SCH_PRIO. Nothing at
all changes for NET_SCH_PRIO itself. Additionally introduce a
boolean NET_SCH_MULTIQUEUE. No dependencies at all. Use
NET_SCH_MULTIQUEUE to guard the multiqueue code in sch_prio.c.
Your current code doesn't even have any ifdefs anymore though,
so this might not be needed at all.

Additionally you could later introduce E1000_MULTIQUEUE and
have that select NET_SCH_MULTIQUEUE.

 diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
 index 9461e8a..203d5c4 100644
 --- a/net/sched/sch_generic.c
 +++ b/net/sched/sch_generic.c
 @@ -168,7 +168,8 @@ static inline int qdisc_restart(struct net_device *dev)
   spin_unlock(dev-queue_lock);
  
   ret = NETDEV_TX_BUSY;
 - if (!netif_queue_stopped(dev))
 + if (!netif_queue_stopped(dev) 
 + !netif_subqueue_stopped(dev, skb-queue_mapping))
   /* churn baby churn .. */
   ret = dev_hard_start_xmit(skb, dev);

I'll try again - please move this to patch 2/3.



 diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
 index 40a13e8..8a716f0 100644
 --- a/net/sched/sch_prio.c
 +++ b/net/sched/sch_prio.c
 @@ -40,9 +40,11 @@
  struct prio_sched_data
  {
   int bands;
 + int curband; /* for round-robin */
   struct tcf_proto *filter_list;
   u8  prio2band[TC_PRIO_MAX+1];
   struct Qdisc *queues[TCQ_PRIO_BANDS];
 + unsigned char mq;
  };
  
  
 @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int 
 *qerr)
  #endif
   if (TC_H_MAJ(band))
   band = 0;
 + if (q-mq)
 + skb-queue_mapping = 
 + q-prio2band[bandTC_PRIO_MAX];
 + else
 + skb-queue_mapping = 0;


Might look cleaner if you have one central point where queue_mapping is
set and the band is returned.

 + /* If we're multiqueue, make sure the number of incoming bands
 +  * matches the number of queues on the device we're associating with.
 +  */
 + if (tb[TCA_PRIO_MQ - 1])
 + q-mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]);


If you're using it as a flag, please use RTA_GET_FLAG(),
otherwise RTA_GET_U8.

 + if (q-mq  (qopt-bands != sch-dev-egress_subqueue_count))
 + return -EINVAL;
  
   sch_tree_lock(sch);
   q-bands = qopt-bands;
 @@ -280,7 +342,7 @@ static int prio_dump(struct Qdisc *sch, struct sk_buff 
 *skb)
   memcpy(opt.priomap, q-prio2band, TC_PRIO_MAX+1);
  
   nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), opt);
 - RTA_PUT_U32(skb, TCA_PRIO_TEST, 321);
 + RTA_PUT_U8(skb, TCA_PRIO_MQ, q-mq);


And RTA_PUT_FLAG. Now that I think of it, does it even makes sense
to have a prio private flag for this instead of a qdisc global one?

  static int __init prio_module_init(void)
  {
 - return register_qdisc(prio_qdisc_ops);
 + 

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-24 Thread Patrick McHardy
PJ Waskiewicz wrote:
 + /* If we're multiqueue, make sure the number of incoming bands
 +  * matches the number of queues on the device we're associating with.
 +  */
 + if (tb[TCA_PRIO_MQ - 1])
 + q-mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]);
 +
 + if (q-mq  (qopt-bands != sch-dev-egress_subqueue_count))
 + return -EINVAL;


A nice thing you could do for the user here is use
egress_subqueue_count as default when qopt-bands == 0
(and change tc prio to accept 0 in case it doesn't).
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-23 Thread PJ Waskiewicz
Updated: This patch applies on top of Patrick McHardy's RTNETLINK
nested compat attribute patches.  These are required to preserve
ABI for iproute2 when working with the multiqueue qdiscs.

Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio and sch_rr to be compiled with or without multiqueue hardware
support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue routine.

Signed-off-by: Peter P Waskiewicz Jr [EMAIL PROTECTED]
---

 include/linux/pkt_sched.h |4 +-
 net/sched/Kconfig |   30 +
 net/sched/sch_generic.c   |3 +
 net/sched/sch_prio.c  |  106 -
 4 files changed, 129 insertions(+), 14 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 09808b7..ec3a9a5 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -103,8 +103,8 @@ struct tc_prio_qopt
 
 enum
 {
-   TCA_PRIO_UNPSEC,
-   TCA_PRIO_TEST,
+   TCA_PRIO_UNSPEC,
+   TCA_PRIO_MQ,
__TCA_PRIO_MAX
 };
 
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..7f14fa6 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -102,8 +102,16 @@ config NET_SCH_ATM
  To compile this code as a module, choose M here: the
  module will be called sch_atm.
 
+config NET_SCH_BANDS
+bool Multi Band Queueing (PRIO and RR)
+---help---
+  Say Y here if you want to use n-band multiqueue packet
+  schedulers.  These include a priority-based scheduler and
+  a round-robin scheduler.
+
 config NET_SCH_PRIO
tristate Multi Band Priority Queueing (PRIO)
+   depends on NET_SCH_BANDS
---help---
  Say Y here if you want to use an n-band priority queue packet
  scheduler.
@@ -111,6 +119,28 @@ config NET_SCH_PRIO
  To compile this code as a module, choose M here: the
  module will be called sch_prio.
 
+config NET_SCH_RR
+   tristate Multi Band Round Robin Queuing (RR)
+   depends on NET_SCH_BANDS
+   select NET_SCH_PRIO
+   ---help---
+ Say Y here if you want to use an n-band round robin packet
+ scheduler.
+
+ The module uses sch_prio for its framework and is aliased as
+ sch_rr, so it will load sch_prio, although it is referred
+ to using sch_rr.
+
+config NET_SCH_BANDS_MQ
+   bool Multiple hardware queue support
+   depends on NET_SCH_BANDS
+   ---help---
+ Say Y here if you want to allow the PRIO and RR qdiscs to assign
+ flows to multiple hardware queues on an ethernet device.  This
+ will still work on devices with 1 queue.
+
+ Most people will say N here.
+
 config NET_SCH_RED
tristate Random Early Detection (RED)
---help---
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 9461e8a..203d5c4 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -168,7 +168,8 @@ static inline int qdisc_restart(struct net_device *dev)
spin_unlock(dev-queue_lock);
 
ret = NETDEV_TX_BUSY;
-   if (!netif_queue_stopped(dev))
+   if (!netif_queue_stopped(dev) 
+   !netif_subqueue_stopped(dev, skb-queue_mapping))
/* churn baby churn .. */
ret = dev_hard_start_xmit(skb, dev);
 
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 40a13e8..8a716f0 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -40,9 +40,11 @@
 struct prio_sched_data
 {
int bands;
+   int curband; /* for round-robin */
struct tcf_proto *filter_list;
u8  prio2band[TC_PRIO_MAX+1];
struct Qdisc *queues[TCQ_PRIO_BANDS];
+   unsigned char mq;
 };
 
 
@@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int 
*qerr)
 #endif
if (TC_H_MAJ(band))
band = 0;
+   if (q-mq)
+   skb-queue_mapping = 
+   q-prio2band[bandTC_PRIO_MAX];
+   else
+   skb-queue_mapping = 0;
return q-queues[q-prio2band[bandTC_PRIO_MAX]];
}
band = res.classid;
}
band = TC_H_MIN(band) - 1;
-   if (band = q-bands)
+   if (band = q-bands) {
+   if (q-mq)
+   skb-queue_mapping = q-prio2band[0];
+   else
+   skb-queue_mapping = 0;
return q-queues[q-prio2band[0]];
+   }
 
+   if (q-mq)
+   skb-queue_mapping = band;
+   else
+   skb-queue_mapping = 0;
return q-queues[band];
 }
 
@@ -144,17 +160,57 @@ prio_dequeue(struct Qdisc* sch)
struct Qdisc *qdisc;
 
for (prio = 0; prio  q-bands; prio++) {
-   

RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-22 Thread Waskiewicz Jr, Peter P
   #include linux/module.h
  @@ -40,9 +42,13 @@
   struct prio_sched_data
   {
  int bands;
  +#ifdef CONFIG_NET_SCH_RR
  +   int curband; /* for round-robin */
  +#endif
  struct tcf_proto *filter_list;
  u8  prio2band[TC_PRIO_MAX+1];
  struct Qdisc *queues[TCQ_PRIO_BANDS];
  +   u16 band2queue[TC_PRIO_MAX + 1];

 
 Why is this still here? Its a 1:1 mapping.

Thought about this more last night and this morning.  As far as I can
tell, I still need this.  If the qdisc gets loaded with multiqueue
turned on, I can just use the value of band to assign
skb-queue_mapping.  But if the qdisc is loaded without multiqueue
support, then I need to assign a value of zero to queue_mapping, or not
assign it at all (it will be zero'd out before the call to -enqueue()
in dev_queue_xmit()).  But I'd rather not have a conditional in the
hotpath checking if the qdisc is multiqueue; I'd rather have the array
to match the bands so I can just do an assignment.

What do you think?

Thanks,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-22 Thread Patrick McHardy
Waskiewicz Jr, Peter P wrote:
 #include linux/module.h
@@ -40,9 +42,13 @@
 struct prio_sched_data
 {
 int bands;
+#ifdef CONFIG_NET_SCH_RR
+int curband; /* for round-robin */
+#endif
 struct tcf_proto *filter_list;
 u8  prio2band[TC_PRIO_MAX+1];
 struct Qdisc *queues[TCQ_PRIO_BANDS];
+u16 band2queue[TC_PRIO_MAX + 1];
  

Why is this still here? Its a 1:1 mapping.
 
 
 Thought about this more last night and this morning.  As far as I can
 tell, I still need this.  If the qdisc gets loaded with multiqueue
 turned on, I can just use the value of band to assign
 skb-queue_mapping.  But if the qdisc is loaded without multiqueue
 support, then I need to assign a value of zero to queue_mapping, or not
 assign it at all (it will be zero'd out before the call to -enqueue()
 in dev_queue_xmit()).  But I'd rather not have a conditional in the
 hotpath checking if the qdisc is multiqueue; I'd rather have the array
 to match the bands so I can just do an assignment.
 
 What do you think?


I very much doubt that it has any measurable impact. You can
also add a small inline function

void skb_set_queue_mapping(struct sk_buff *skb, unsigned int queue)
{
#ifdef CONFIG_NET_SCH_MULTIQUEUE
skb-queue_mapping = queue;
#else
skb-queue_mapping = 0;
#endif
}
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-22 Thread Patrick McHardy
Patrick McHardy wrote:
 void skb_set_queue_mapping(struct sk_buff *skb, unsigned int queue)
 {
 #ifdef CONFIG_NET_SCH_MULTIQUEUE
   skb-queue_mapping = queue;
 #else
   skb-queue_mapping = 0;
 #endif


Maybe even use it everywhere and guard skb-queue_mapping by
an #ifdef, on 32 bit it does enlarge the skb.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-22 Thread Patrick McHardy
Patrick McHardy wrote:
 Waskiewicz Jr, Peter P wrote:
 
Thought about this more last night and this morning.  As far as I can
tell, I still need this.  If the qdisc gets loaded with multiqueue
turned on, I can just use the value of band to assign
skb-queue_mapping.  But if the qdisc is loaded without multiqueue
support, then I need to assign a value of zero to queue_mapping, or not
assign it at all (it will be zero'd out before the call to -enqueue()
in dev_queue_xmit()).  But I'd rather not have a conditional in the
hotpath checking if the qdisc is multiqueue; I'd rather have the array
to match the bands so I can just do an assignment.

What do you think?
 
 
 
 I very much doubt that it has any measurable impact. You can
 also add a small inline function
 
 void skb_set_queue_mapping(struct sk_buff *skb, unsigned int queue)


OK I didn't really listen obviously :) A compile time option
won't help. Just remove it and assign it conditionally.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-22 Thread Waskiewicz Jr, Peter P
 Patrick McHardy wrote:
  Waskiewicz Jr, Peter P wrote:
  
 Thought about this more last night and this morning.  As 
 far as I can 
 tell, I still need this.  If the qdisc gets loaded with multiqueue 
 turned on, I can just use the value of band to assign
 skb-queue_mapping.  But if the qdisc is loaded without multiqueue
 support, then I need to assign a value of zero to queue_mapping, or 
 not assign it at all (it will be zero'd out before the call to 
 -enqueue() in dev_queue_xmit()).  But I'd rather not have a 
 conditional in the hotpath checking if the qdisc is multiqueue; I'd 
 rather have the array to match the bands so I can just do 
 an assignment.
 
 What do you think?
  
  
  
  I very much doubt that it has any measurable impact. You 
 can also add 
  a small inline function
  
  void skb_set_queue_mapping(struct sk_buff *skb, unsigned int queue)
 
 
 OK I didn't really listen obviously :) A compile time option 
 won't help. Just remove it and assign it conditionally.

Sounds good.  Thanks Patrick.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-21 Thread PJ Waskiewicz
Add the new sch_rr qdisc for multiqueue network device support.
Allow sch_prio to be compiled with or without multiqueue hardware
support.

sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS.  This
was done since sch_prio and sch_rr only differ in their dequeue routine.

Signed-off-by: Peter P Waskiewicz Jr [EMAIL PROTECTED]
---

 net/sched/Kconfig   |   32 
 net/sched/sch_generic.c |3 +
 net/sched/sch_prio.c|  123 ---
 3 files changed, 150 insertions(+), 8 deletions(-)

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..ca0b352 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -102,8 +102,16 @@ config NET_SCH_ATM
  To compile this code as a module, choose M here: the
  module will be called sch_atm.
 
+config NET_SCH_BANDS
+bool Multi Band Queueing (PRIO and RR)
+---help---
+  Say Y here if you want to use n-band multiqueue packet
+  schedulers.  These include a priority-based scheduler and
+  a round-robin scheduler.
+
 config NET_SCH_PRIO
tristate Multi Band Priority Queueing (PRIO)
+   depends on NET_SCH_BANDS
---help---
  Say Y here if you want to use an n-band priority queue packet
  scheduler.
@@ -111,6 +119,30 @@ config NET_SCH_PRIO
  To compile this code as a module, choose M here: the
  module will be called sch_prio.
 
+config NET_SCH_PRIO_MQ
+   bool Multiple hardware queue support for PRIO
+   depends on NET_SCH_PRIO
+   ---help---
+ Say Y here if you want to allow the PRIO qdisc to assign
+ flows to multiple hardware queues on an ethernet device.  This
+ will still work on devices with 1 queue.
+
+ Consider this scheduler for devices that do not use
+ hardware-based scheduling policies.  Otherwise, use NET_SCH_RR.
+
+ Most people will say N here.
+
+config NET_SCH_RR
+   bool Multi Band Round Robin Queuing (RR)
+   depends on NET_SCH_BANDS  NET_SCH_PRIO
+   ---help---
+ Say Y here if you want to use an n-band round robin packet
+ scheduler.
+
+ The module uses sch_prio for its framework and is aliased as
+ sch_rr, so it will load sch_prio, although it is referred
+ to using sch_rr.
+
 config NET_SCH_RED
tristate Random Early Detection (RED)
---help---
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 9461e8a..203d5c4 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -168,7 +168,8 @@ static inline int qdisc_restart(struct net_device *dev)
spin_unlock(dev-queue_lock);
 
ret = NETDEV_TX_BUSY;
-   if (!netif_queue_stopped(dev))
+   if (!netif_queue_stopped(dev) 
+   !netif_subqueue_stopped(dev, skb-queue_mapping))
/* churn baby churn .. */
ret = dev_hard_start_xmit(skb, dev);
 
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 6d7542c..4eb3ba5 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -9,6 +9,8 @@
  * Authors:Alexey Kuznetsov, [EMAIL PROTECTED]
  * Fixes:   19990609: J Hadi Salim [EMAIL PROTECTED]:
  *  Init --  EINVAL when opt undefined
+ * Additions:  Peter P. Waskiewicz Jr. [EMAIL PROTECTED]
+ * Added round-robin scheduling for selection at load-time
  */
 
 #include linux/module.h
@@ -40,9 +42,13 @@
 struct prio_sched_data
 {
int bands;
+#ifdef CONFIG_NET_SCH_RR
+   int curband; /* for round-robin */
+#endif
struct tcf_proto *filter_list;
u8  prio2band[TC_PRIO_MAX+1];
struct Qdisc *queues[TCQ_PRIO_BANDS];
+   u16 band2queue[TC_PRIO_MAX + 1];
 };
 
 
@@ -70,14 +76,19 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int 
*qerr)
 #endif
if (TC_H_MAJ(band))
band = 0;
+   skb-queue_mapping =
+   q-band2queue[q-prio2band[bandTC_PRIO_MAX]];
return q-queues[q-prio2band[bandTC_PRIO_MAX]];
}
band = res.classid;
}
band = TC_H_MIN(band) - 1;
-   if (band = q-bands)
+   if (band = q-bands) {
+   skb-queue_mapping = q-band2queue[q-prio2band[0]];
return q-queues[q-prio2band[0]];
+   }
 
+   skb-queue_mapping = q-band2queue[band];
return q-queues[band];
 }
 
@@ -144,17 +155,59 @@ prio_dequeue(struct Qdisc* sch)
struct Qdisc *qdisc;
 
for (prio = 0; prio  q-bands; prio++) {
-   qdisc = q-queues[prio];
-   skb = qdisc-dequeue(qdisc);
-   if (skb) {
-   sch-q.qlen--;
-   return skb;
+   /* Check if the target subqueue is available before
+* pulling an skb.  This way we avoid excessive requeues
+* for 

Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-21 Thread Patrick McHardy

PJ Waskiewicz wrote:

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 475df84..ca0b352 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -102,8 +102,16 @@ config NET_SCH_ATM
  To compile this code as a module, choose M here: the
  module will be called sch_atm.
 
+config NET_SCH_BANDS

+bool Multi Band Queueing (PRIO and RR)
+---help---
+  Say Y here if you want to use n-band multiqueue packet
+  schedulers.  These include a priority-based scheduler and
+  a round-robin scheduler.
+
 config NET_SCH_PRIO
tristate Multi Band Priority Queueing (PRIO)
+   depends on NET_SCH_BANDS
---help---
  Say Y here if you want to use an n-band priority queue packet
  scheduler.
@@ -111,6 +119,30 @@ config NET_SCH_PRIO
  To compile this code as a module, choose M here: the
  module will be called sch_prio.
 
+config NET_SCH_PRIO_MQ

+   bool Multiple hardware queue support for PRIO
+   depends on NET_SCH_PRIO
+   ---help---
+ Say Y here if you want to allow the PRIO qdisc to assign
+ flows to multiple hardware queues on an ethernet device.  This
+ will still work on devices with 1 queue.
+
+ Consider this scheduler for devices that do not use
+ hardware-based scheduling policies.  Otherwise, use NET_SCH_RR.
+
+ Most people will say N here.
+
+config NET_SCH_RR
+   bool Multi Band Round Robin Queuing (RR)
+   depends on NET_SCH_BANDS  NET_SCH_PRIO
+   ---help---
+ Say Y here if you want to use an n-band round robin packet
+ scheduler.
+
+ The module uses sch_prio for its framework and is aliased as
+ sch_rr, so it will load sch_prio, although it is referred
+ to using sch_rr.
  


The dependencies seem to be very confused. SCHED_PRIO does not depend
on anything new, SCH_RR also doesn't depend on anything. SCH_PRIO_MQ
and SCH_RR_MQ (which is missing) depend on SCH_PRIO/SCH_RR. A single
NET_SCH_MULTIQUEUE option seems better than adding one per scheduler
though.


--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -9,6 +9,8 @@
  * Authors:Alexey Kuznetsov, [EMAIL PROTECTED]
  * Fixes:   19990609: J Hadi Salim [EMAIL PROTECTED]:
  *  Init --  EINVAL when opt undefined
+ * Additions:  Peter P. Waskiewicz Jr. [EMAIL PROTECTED]
+ * Added round-robin scheduling for selection at load-time
  


git keeps changelogs, please don't add it here.


  */
 
 #include linux/module.h

@@ -40,9 +42,13 @@
 struct prio_sched_data
 {
int bands;
+#ifdef CONFIG_NET_SCH_RR
+   int curband; /* for round-robin */
+#endif
struct tcf_proto *filter_list;
u8  prio2band[TC_PRIO_MAX+1];
struct Qdisc *queues[TCQ_PRIO_BANDS];
+   u16 band2queue[TC_PRIO_MAX + 1];
  


Why is this still here? Its a 1:1 mapping.

@@ -211,6 +265,22 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
return -EINVAL;
}
 
+	/* If we're prio multiqueue or are using round-robin, make

+* sure the number of incoming bands matches the number of
+* queues on the device we're associating with.
+*/
+#ifdef CONFIG_NET_SCH_RR
+   if (strcmp(rr, sch-ops-id) == 0)
+   if (qopt-bands != sch-dev-egress_subqueue_count)
+   return -EINVAL;
+#endif
+
+#ifdef CONFIG_NET_SCH_PRIO_MQ
+   if (strcmp(prio, sch-ops-id) == 0)
+   if (qopt-bands != sch-dev-egress_subqueue_count)
+   return -EINVAL;
+#endif
  


For the tenth time now, the user should enable this at
runtime. You can't just break things dependant on config
options.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-21 Thread Waskiewicz Jr, Peter P
 The dependencies seem to be very confused. SCHED_PRIO does 
 not depend on anything new, SCH_RR also doesn't depend on 
 anything. SCH_PRIO_MQ and SCH_RR_MQ (which is missing) depend 
 on SCH_PRIO/SCH_RR. A single NET_SCH_MULTIQUEUE option seems 
 better than adding one per scheduler though.

I agree with a NET_SCH_MULTIQUEUE option.  However, SCH_RR does depend
on SCH_PRIO being built since it's the same code, doesn't it?  Maybe I'm
not understanding something about the build process.  I'll clean this
up.

 
  --- a/net/sched/sch_prio.c
  +++ b/net/sched/sch_prio.c
  @@ -9,6 +9,8 @@
* Authors:Alexey Kuznetsov, [EMAIL PROTECTED]
* Fixes:   19990609: J Hadi Salim [EMAIL PROTECTED]:
*  Init --  EINVAL when opt undefined
  + * Additions:  Peter P. Waskiewicz Jr. 
 [EMAIL PROTECTED]
  + * Added round-robin scheduling for selection at load-time

 
 git keeps changelogs, please don't add it here.

Roger.

  struct tcf_proto *filter_list;
  u8  prio2band[TC_PRIO_MAX+1];
  struct Qdisc *queues[TCQ_PRIO_BANDS];
  +   u16 band2queue[TC_PRIO_MAX + 1];

 
 Why is this still here? Its a 1:1 mapping.

I'll fix this.

  @@ -211,6 +265,22 @@ static int prio_tune(struct Qdisc 
 *sch, struct rtattr *opt)
  return -EINVAL;
  }
   
  +   /* If we're prio multiqueue or are using round-robin, make
  +* sure the number of incoming bands matches the number of
  +* queues on the device we're associating with.
  +*/
  +#ifdef CONFIG_NET_SCH_RR
  +   if (strcmp(rr, sch-ops-id) == 0)
  +   if (qopt-bands != sch-dev-egress_subqueue_count)
  +   return -EINVAL;
  +#endif
  +
  +#ifdef CONFIG_NET_SCH_PRIO_MQ
  +   if (strcmp(prio, sch-ops-id) == 0)
  +   if (qopt-bands != sch-dev-egress_subqueue_count)
  +   return -EINVAL;
  +#endif

 
 For the tenth time now, the user should enable this at 
 runtime. You can't just break things dependant on config options.

I had this in sch_prio and tc before, and was told to remove it because
of ABI issues.  I can put it back in, but I'm not sure what those
previous ABI issues were.  Was it backwards compatibility that you
referred to before that was broken?

As always, the feedback is very much appreciated.  I'll get these fixes
in as soon as possible.

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-21 Thread Patrick McHardy

Waskiewicz Jr, Peter P wrote:
The dependencies seem to be very confused. SCHED_PRIO does 
not depend on anything new, SCH_RR also doesn't depend on 
anything. SCH_PRIO_MQ and SCH_RR_MQ (which is missing) depend 
on SCH_PRIO/SCH_RR. A single NET_SCH_MULTIQUEUE option seems 
better than adding one per scheduler though.



I agree with a NET_SCH_MULTIQUEUE option.  However, SCH_RR does depend
on SCH_PRIO being built since it's the same code, doesn't it?  Maybe I'm
not understanding something about the build process.  I'll clean this
up.


The easiest solution is to select SCH_PRIO from SCH_RR.
I head something else in mind initially but that is
needlessly complicated.




For the tenth time now, the user should enable this at 
runtime. You can't just break things dependant on config options.



I had this in sch_prio and tc before, and was told to remove it because
of ABI issues.  I can put it back in, but I'm not sure what those
previous ABI issues were.  Was it backwards compatibility that you
referred to before that was broken?


Your tc changes changed the structure in a way that old tc binaries
wouldn't work anymore. This version breaks configurations that use
a number of bands not matching the HW queues when the user enables
the multiqueue compile time option.

Unfortunately prio does not use nested attributes, so the easiest
way is extending struct tc_prio_qopt at the end and checking the
attribute size to decide whether its an old or a new version.

A better fix would be to introduce a new qdisc configuration
attribute that takes precedence before TCA_OPTIONS and have
userspace send both the old non-nested structure and a new
nested configuration. That would make sure we never run into
this problem again.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html