Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-07 Thread David Miller
From: jamal <[EMAIL PROTECTED]>
Date: Fri, 06 Jul 2007 10:39:15 -0400

> If the issue is usability of listing 1024 netdevices, i can think of
> many ways to resolve it.

I would agree with this if there were a reason for it, it's totally
unnecessary complication as far as I can see.

These virtual devices are an ethernet with the subnet details exposed
to the driver, nothing more.

I see zero benefit to having a netdev for each guest or node we can
speak to whatsoever.  It's a very heavy abstraction to use for
something that is so bloody simple.

My demux on ->hard_start_xmit() is _5 DAMN LINES OF CODE_, you want to
replace that with a full netdev because of some minor difficulties in
figuring out to record the queuing state.  It's beyond unreasonable.

Netdevs are like salt, if you put too much in your food it tastes
awful.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-07 Thread Rusty Russell
On Fri, 2007-07-06 at 10:39 -0400, jamal wrote:
> The first thing that crossed my mind was "if you want to select a
> destination port based on a destination MAC you are talking about a
> switch/bridge". You bring up the issue of "a huge number of virtual NICs
> if you wanted arbitrary guests" which is a real one[2].

Hi Jamal,

I'm deeply tempted to agree with you that the answer is multiple
virtual NICs (and I've been tempted to abandon lguest's N-way transport
scheme), except that it looks like we're going to have multi-queue NICs
for other reasons.

Otherwise I'd be tempted to say "create/destroy virtual NICs as other
guests appear/vanish from the network".  Noone does this today, but that
doesn't make it wrong.

> If i got this right, still not answering the netif_stop question posed:
> the problem you are also trying to resolve now is get rid of N
> netdevices on each guest for a usability reason; i.e have one netdevice,
> move the bridging/switching functionality/tables into the driver;
> replace the ports with queues instead of netdevices. Did i get that
> right? 

Yep, well summarized.  I guess the question is: should the Intel guys
be representing their multi-queue NICs as multiple NICs rather than
adding the subqueue concept?

> BTW, one curve that threw me off a little is it seems most of the
> hardware that provides virtualization also provides point-to-point
> connections between different domains; i always thought that they all
> provided a point-to-point to the dom0 equivalent and let the dom0 worry
> about how things get from domainX to domainY.

Yeah, but that has obvious limitations as people care more about
inter-guest I/O: we want direct inter-guest networking...

Cheers,
Rusty.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-06 Thread James Chapman

jamal wrote:

If the issue is usability of listing 1024 netdevices, i can think of
many ways to resolve it.
One way we can resolve the listing is with a simple tag to the netdev
struct i could say "list netdevices for guest 0-10" etc etc.


This would be a useful feature, not only for virtualization. I've seen 
some boxes with thousands of net devices (mostly ppp, but also some 
ATM). It would be nice to be able to assign a tag to an arbitrary set of 
devices.


Does the network namespace stuff help with any of this?

--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-06 Thread jamal
On Fri, 2007-06-07 at 17:32 +1000, Rusty Russell wrote:

[..some good stuff deleted here ..]

> Hope that adds something,

It does - thanks. 
 
I think i was letting my experience pollute my thinking earlier when
Dave posted. The copy-avoidance requirement is clear to me[1].

I had another issue which wasnt clear but you touched on it so this
breaks the ice for me - be gentle please, my asbestos suit is full of
dust these days ;->:

The first thing that crossed my mind was "if you want to select a
destination port based on a destination MAC you are talking about a
switch/bridge". You bring up the issue of "a huge number of virtual NICs
if you wanted arbitrary guests" which is a real one[2]. 
Lets take the case of a small number of guests; a bridge of course would
solve the problem with the copy-avoidance with the caveat being:
- you now have N bridges and their respective tables for N domains i.e
one on each domain 
- N netdevices on each domain as well (of course you could say that is
not very different resourcewise from N queues instead). 

If i got this right, still not answering the netif_stop question posed:
the problem you are also trying to resolve now is get rid of N
netdevices on each guest for a usability reason; i.e have one netdevice,
move the bridging/switching functionality/tables into the driver;
replace the ports with queues instead of netdevices. Did i get that
right? 
If the issue is usability of listing 1024 netdevices, i can think of
many ways to resolve it.
One way we can resolve the listing is with a simple tag to the netdev
struct i could say "list netdevices for guest 0-10" etc etc.

I am having a little problem differentiating conceptually the case of a
guest being different from the host/dom0 if you want to migrate the
switching/bridging functions into each guest. So even if this doesnt
apply to all domains, it does apply to the dom0.
I like netdevices today (as opposed to queues within netdevices):
- the stack knows them well (I can add IP addresses, i can point routes
to, I can change MAC addresses, i can bring them administratively
down/up, I can add qos rules etc etc).
I can also tie netdevices to a CPU and therefore scale that way. I see
this viable at least from the host/dom0 perspective if a netdevice
represents a guest.

Sorry for the long email - drained some of my morning coffee.
Ok, kill me.

cheers,
jamal

[1] My experience is around qemu/uml/old-openvz - their model is to let
the host do the routing/switching between guests or the outside of the
box. From your description i would add Xen to that behavior.
>From Daves posting, i understand that for many good reasons, any time
you move between any one domain to another you are copying. So if you
use Xen and you want to go from domainX to domainY you go to dom0 which
implies copying domainX->dom0 then dom0->domainY. 
BTW, one curve that threw me off a little is it seems most of the
hardware that provides virtualization also provides point-to-point
connections between different domains; i always thought that they all
provided a point-to-point to the dom0 equivalent and let the dom0 worry
about how things get from domainX to domainY.

[2] Unfortunately that means if i wanted 1024 virtual routers/guest
domains i have at least 1024 netdevices on each guest connected to the
bridge on the guest. I have a freaking problem listing 72 netdevices
today on some device i have.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-06 Thread Rusty Russell
On Tue, 2007-07-03 at 22:20 -0400, jamal wrote:
> On Tue, 2007-03-07 at 14:24 -0700, David Miller wrote:
> [.. some useful stuff here deleted ..]
> 
> > That's why you have to copy into a purpose-built set of memory
> > that is composed of pages that _ONLY_ contain TX packet buffers
> > and nothing else.
> > 
> > The cost of going through the switch is too high, and the copies are
> > necessary, so concentrate on allowing me to map the guest ports to the
> > egress queues.  Anything else is a waste of discussion time, I've been
> > pouring over these issues endlessly for weeks, so if I'm saying doing
> > copies and avoiding the switch is necessary I do in fact mean it. :-)
> 
> ok, i get it Dave ;-> Thanks for your patience, that was useful.
> Now that is clear for me, I will go back and look at your original email
> and try to get back on track to what you really asked ;->

To expand on this, there are already "virtual" nic drivers in tree which
do the demux based on dst mac and send to appropriate other guest
(iseries_veth.c and Carsten Otte said the S/390 drivers do too).  lguest
and DaveM's LDOM make two more.

There is currently no good way to write such a driver.  If one recipient
is full, you have to drop the packet: if you netif_stop_queue, it means
a slow/buggy recipient blocks packets going to other recipients.  But
dropping packets makes networking suck.

Some hypervisors (eg. Xen) only have a virtual NIC which is
point-to-point: this sidesteps the issue, with the risk that you might
need a huge number of virtual NICs if you wanted arbitrary guests to
talk to each other (Xen doesn't support that, they route/bridge through
dom0).

Most hypervisors have a sensible maximum on the number of guests they
could talk to, so I'm not too unhappy with a static number of queues.
But the dstmac -> queue mapping changes in hypervisor-specific ways, so
it really needs to be managed by the driver...

Hope that adds something,
Rusty.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-03 Thread jamal
On Tue, 2007-03-07 at 14:24 -0700, David Miller wrote:
[.. some useful stuff here deleted ..]

> That's why you have to copy into a purpose-built set of memory
> that is composed of pages that _ONLY_ contain TX packet buffers
> and nothing else.
> 
> The cost of going through the switch is too high, and the copies are
> necessary, so concentrate on allowing me to map the guest ports to the
> egress queues.  Anything else is a waste of discussion time, I've been
> pouring over these issues endlessly for weeks, so if I'm saying doing
> copies and avoiding the switch is necessary I do in fact mean it. :-)

ok, i get it Dave ;-> Thanks for your patience, that was useful.
Now that is clear for me, I will go back and look at your original email
and try to get back on track to what you really asked ;->

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-03 Thread David Miller
From: jamal <[EMAIL PROTECTED]>
Date: Tue, 03 Jul 2007 08:42:33 -0400

> (likely not in the case of hypervisor based virtualization like Xen)
> just have their skbs cloned when crossing domains, is that not the
> case?[1]
> Assuming they copy, the balance that needs to be stricken now is
> between:

Sigh, I kind of hoped I wouldn't have to give a lesson in
hypervisors and virtualized I/O and all the issues contained
within, but if you keep pushing the "avoid the copy" idea I
guess I am forced to educate. :-)

First, keep in mind that my Linux guest drivers are talking
to Solaris control node servers and switches, I cannot control
the API for any of this stuff.  And I think that's a good thing
in fact.

Exporting memory between nodes is _THE_ problem with virtualized I/O
in hypervisor based systems.

These things should even be able to work between two guests that
simply DO NOT trust each other at all.

With that in mind the hypervisor provides a very small shim layer of
interface for exporting memory between two nodes.  There is a
pseudo-pagetable where you export pages, and a set of interfaces
one of which copies to/from inported memory to/from local memory.

If a guest gets stuck, reboots, crashes, or gets stuck, you have
to be able to revoke the memory the remote node has inported.
When this happens, if the inporting node comes back to life and
tries to touch those pages it takes a fault.

Taking a fault is easy if the nodes go through the hypervisor copy
interface, they just get a return value back.  If, instead, you try to
map in those pages or program them into the IOMMU of the PCI
controller, you get faults, and extremely difficult to handle faults
at that.  If the IOMMU takes the exception on a revoked page, your
E1000 card resets when it gets the master abort from the PCI
controller.  On the CPU side you have to annotate every single kernel
access to this memory mapping of inported pages, just like we have to
annotate all userspace accesses with exception tables mapping load and
store instructions to fixup code, in order to handler the fault
correctly.

Next, you don't trust the other end as we already stated, so you
can't export object in a page that belong to other objects.  For
example, if a SKB's data sits in the same page as the plain-text
password the user just typed in, you can't export that page.

That's why you have to copy into a purpose-built set of memory
that is composed of pages that _ONLY_ contain TX packet buffers
and nothing else.

The cost of going through the switch is too high, and the copies are
necessary, so concentrate on allowing me to map the guest ports to the
egress queues.  Anything else is a waste of discussion time, I've been
pouring over these issues endlessly for weeks, so if I'm saying doing
copies and avoiding the switch is necessary I do in fact mean it. :-)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-07-03 Thread jamal
On Sat, 2007-30-06 at 13:33 -0700, David Miller wrote:

> It's like twice as fast, since the switch doesn't have to copy
> the packet in, switch it, then the destination guest copies it
> into it's address space.
> 
> There is approximately one copy for each hop you go over through these
> virtual devices.

Ok - i see what you are getting at, and while it makes more sense to me
now, let me continue to be _the_ devils advocate (sip some esspresso
before responding or reading): 
for some reason i always thought that packets going across these things
(likely not in the case of hypervisor based virtualization like Xen)
just have their skbs cloned when crossing domains, is that not the
case?[1]
Assuming they copy, the balance that needs to be stricken now is
between:

a) copy is expensive
vs
b1) For N guests, N^2 queues in the system vs N queues and 1 vs N
replicated global info.
b2) The architecture challenges to resolve the fact you now have to deal
with a mesh (1-1 mapping) instead of star topology between the guests. 

I dont think #b1 is such a big deal; in the old days when i had played
with what is now openvz, i was happy to get 1024 virtual routers/guests
(each running Zebra/OSPF). I could live with a little more wasted memory
if the copy is reduced.
I think sub-consciously i am questioning #b2. Do you really need that
sacrifice just so that you can avoid one extra copy between two guests?
If i was running virtual routers or servers i think the majority of
traffic (by far) would be between a domain and outside of the box not
between any two domains within the same box. 

cheers,
jamal


[1] But then if this is true, i can think of a simple way to attack the
other domains by inserting a kernel module into a domain that reduced
the refcount of each received skb to 0. I would be suprised if the
openvz type approach hasnt thought this through.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-30 Thread David Miller
From: jamal <[EMAIL PROTECTED]>
Date: Sat, 30 Jun 2007 10:52:44 -0400

> On Fri, 2007-29-06 at 21:35 -0700, David Miller wrote:
> 
> > Awesome, but let's concentrate on the client since I can actually
> > implement and test anything we come up with :-)
> 
> Ok, you need to clear one premise for me then ;->
> You said the model is for the guest/client to hook have a port to the
> host and one to each guest; i think this is the confusing part for me
> (and may have led to the switch discussion) because i have not seen this
> model used before. What i have seen before is that the host side
> connects the different guests. In such a scenario, on the guest is a
> single port that connects to the host - the host worries (lets forget
> the switch/bridge for a sec) about how to get packets from guestX to
> guestY pending consultation of access control details.
> What is the advantage of direct domain-domain connection? Is it a
> scalable?

It's like twice as fast, since the switch doesn't have to copy
the packet in, switch it, then the destination guest copies it
into it's address space.

There is approximately one copy for each hop you go over through these
virtual devices.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-30 Thread jamal
On Fri, 2007-29-06 at 21:35 -0700, David Miller wrote:

> Awesome, but let's concentrate on the client since I can actually
> implement and test anything we come up with :-)

Ok, you need to clear one premise for me then ;->
You said the model is for the guest/client to hook have a port to the
host and one to each guest; i think this is the confusing part for me
(and may have led to the switch discussion) because i have not seen this
model used before. What i have seen before is that the host side
connects the different guests. In such a scenario, on the guest is a
single port that connects to the host - the host worries (lets forget
the switch/bridge for a sec) about how to get packets from guestX to
guestY pending consultation of access control details.
What is the advantage of direct domain-domain connection? Is it a
scalable?

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-30 Thread Benny Amorsen
> "DM" == David Miller <[EMAIL PROTECTED]> writes:

DM> And some people still use hubs, believe it or not.

Hubs are 100Mbps at most. You could of course make a flooding Gbps
switch, but it would be rather silly. If you care about multicast
performance, you get a switch with IGMP snooping.


/Benny


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread David Miller
From: jamal <[EMAIL PROTECTED]>
Date: Fri, 29 Jun 2007 21:30:53 -0400

> On Fri, 2007-29-06 at 14:31 -0700, David Miller wrote:
> > Maybe for the control node switch, yes, but not for the guest network
> > devices.
> 
> And that is precisely what i was talking about - and i am sure thats how
> the discussion with Patrick was.

Awesome, but let's concentrate on the client since I can actually
implement and test anything we come up with :-)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread jamal
On Fri, 2007-29-06 at 14:31 -0700, David Miller wrote:
> This conversation begins to go into a pointless direction already, as
> I feared it would.
> 
> Nobody is going to configure bridges, classification, tc, and all of
> this other crap just for a simple virtualized guest networking device.
> 
> It's a confined and well defined case that doesn't need any of that.
> You've got to be fucking kidding me if you think I'm going to go
> through the bridging code and all of that layering instead of my
> hash demux on transmit which is 4 or 5 lines of C code at best.
> 
> Such a suggestion is beyond stupid.
> 

Ok, calm down - will you please? 
If you are soliciting for opinions, then you should be expecting all
sorts of answers, otherwise why bother posting. If you think you are
misunderstood just clarify. Otherwise you are being totaly unreasonable.

> Maybe for the control node switch, yes, but not for the guest network
> devices.

And that is precisely what i was talking about - and i am sure thats how
the discussion with Patrick was.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread David Miller
From: Ben Greear <[EMAIL PROTECTED]>
Date: Fri, 29 Jun 2007 08:33:06 -0700

> Patrick McHardy wrote:
> > Right, but the current bridging code always uses promiscous mode
> > and its nice to avoid that if possible. Looking at the code, it
> > should be easy to avoid though by disabling learning (and thus
> > promisous mode) and adding unicast filters for all static fdb entries.
> >   
> I am curious about why people are so hot to do away with promisc mode.  
> It seems to me
> that in a modern switched environment, there should only very rarely be 
> unicast packets received
> on an interface that does not want to receive them.
> 
> Could someone give a quick example of when I am wrong and promisc mode 
> would allow
> a NIC to receive a significant number of packets not really destined for it?

You're neighbour on the switch is being pummeled with multicast traffic,
and now you get to see it all too.

Switches don't obviate the cost of promiscuous mode, you keep wanting
to discuss this and think it doesn't matter, but it does.

And some people still use hubs, believe it or not.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread David Miller

This conversation begins to go into a pointless direction already, as
I feared it would.

Nobody is going to configure bridges, classification, tc, and all of
this other crap just for a simple virtualized guest networking device.

It's a confined and well defined case that doesn't need any of that.
You've got to be fucking kidding me if you think I'm going to go
through the bridging code and all of that layering instead of my
hash demux on transmit which is 4 or 5 lines of C code at best.

Such a suggestion is beyond stupid.

Maybe for the control node switch, yes, but not for the guest network
devices.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Ben Greear

Patrick McHardy wrote:

Ben Greear wrote:
  

Could someone give a quick example of when I am wrong and promisc mode
would allow
a NIC to receive a significant number of packets not really destined for
it?




In a switched environment it won't have a big effect, I agree.
It might help avoid receiving unwanted multicast traffic, which
could be more significant than unicast.

Anyways, why be wasteful when it can be avoided .. :)
  

Ok, I had forgotten about multicast, thanks for the reminder!

--
Ben Greear <[EMAIL PROTECTED]> 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Patrick McHardy
Ben Greear wrote:
> Patrick McHardy wrote:
> 
>> Right, but the current bridging code always uses promiscous mode
>> and its nice to avoid that if possible. Looking at the code, it
>> should be easy to avoid though by disabling learning (and thus
>> promisous mode) and adding unicast filters for all static fdb entries.
>>   
> 
> I am curious about why people are so hot to do away with promisc mode. 
> It seems to me
> that in a modern switched environment, there should only very rarely be
> unicast packets received
> on an interface that does not want to receive them.


I don't know if that really was Dave's reason to handle it in a driver.

> Could someone give a quick example of when I am wrong and promisc mode
> would allow
> a NIC to receive a significant number of packets not really destined for
> it?


In a switched environment it won't have a big effect, I agree.
It might help avoid receiving unwanted multicast traffic, which
could be more significant than unicast.

Anyways, why be wasteful when it can be avoided .. :)

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Ben Greear

Patrick McHardy wrote:

Right, but the current bridging code always uses promiscous mode
and its nice to avoid that if possible. Looking at the code, it
should be easy to avoid though by disabling learning (and thus
promisous mode) and adding unicast filters for all static fdb entries.
  
I am curious about why people are so hot to do away with promisc mode.  
It seems to me
that in a modern switched environment, there should only very rarely be 
unicast packets received

on an interface that does not want to receive them.

Could someone give a quick example of when I am wrong and promisc mode 
would allow

a NIC to receive a significant number of packets not really destined for it?

Thanks,
Ben


--
Ben Greear <[EMAIL PROTECTED]> 
Candela Technologies Inc  http://www.candelatech.com



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread jamal
On Fri, 2007-29-06 at 15:08 +0200, Patrick McHardy wrote:
> jamal wrote:
> > On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote:

> Right, but the current bridging code always uses promiscous mode
> and its nice to avoid that if possible. 
> Looking at the code, it
> should be easy to avoid though by disabling learning (and thus
> promisous mode) and adding unicast filters for all static fdb entries.
> 

Yes, that would do it for static provisioning (I suspect that would work
today unless bridging has no knobs to turn off going into promisc). But
you could even allow for learning and just have extra filters in tc
before bridging disallowing things.

> Have a look at my secondary unicast address patches in case you didn't
> notice them before (there's also a driver example for e1000 on netdev):
> 
> http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.23.git;a=commit;h=306890b54dcbd168cdeea64f1630d2024febb5c7
> 
> You still need to do filtering in software, but you can have the NIC
> pre-filter in case it supports it, otherwise it goes to promiscous mode.
> 

Ok, I will look at them when i get back. Sorry - havent caught up on
netdev.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Patrick McHardy
jamal wrote:
> On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote:
> 
> 
>>The difference to a real bridge is that the
>>all addresses are completely known in advance, so it doesn't need
>>promiscous mode for learning.
> 
> 
> You mean the per-virtual MAC addresses are known in advance, right?

Yes.

> This is fine. The bridging or otherwise (like L3 etc) is for
> interconnecting once you have the provisioning done. And you could build
> different "broadcast domains" by having multiple bridges.


Right, but the current bridging code always uses promiscous mode
and its nice to avoid that if possible. Looking at the code, it
should be easy to avoid though by disabling learning (and thus
promisous mode) and adding unicast filters for all static fdb entries.

> To go off on a slight tangent:
> I think you have to look at the two types of NICs separately 
> 1) dumb ones where you may have to use the mcast filters in hardware to
> pretend you have a unicast address per virtual device - those will be
> really hard to simulate using a separate netdevice per MAC address.
> Actually your bigger problem on those is tx MAC address selection
> because that is not built into the hardware. I still think even for
> these types something above netdevice (bridge, L3 routing, tc action
> redirect etc) will do.


Have a look at my secondary unicast address patches in case you didn't
notice them before (there's also a driver example for e1000 on netdev):

http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.23.git;a=commit;h=306890b54dcbd168cdeea64f1630d2024febb5c7

You still need to do filtering in software, but you can have the NIC
pre-filter in case it supports it, otherwise it goes to promiscous mode.

> 2) The new NICs being built for virtualization; those allow you to
> explicitly have clean separation of IO where the only thing that is
> shared between virtual devices is the wire and the bus (otherwise
> each has its own registers etc) i.e the hardware is designed with this
> in mind. In such a case, i think a separate netdevice per single MAC
> address - possibly tied to a separate CPU should work.


Agreed, that could also be useful for non-virtualized use.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread jamal
On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote:

> I'm guessing that that wouldn't allow to do unicast filtering for
> the guests on the real device without hacking the bridge code for
> this special case. 

For ingress (i guess you could say for egress as well): we can do it as
well today with tc filtering on the host - it is involved but is part of
provisioning for a guest IMO.
A substantial amount of ethernet switches (ok, not the $5 ones) do
filtering at the same level.

> The difference to a real bridge is that the
> all addresses are completely known in advance, so it doesn't need
> promiscous mode for learning.

You mean the per-virtual MAC addresses are known in advance, right?
This is fine. The bridging or otherwise (like L3 etc) is for
interconnecting once you have the provisioning done. And you could build
different "broadcast domains" by having multiple bridges.

To go off on a slight tangent:
I think you have to look at the two types of NICs separately 
1) dumb ones where you may have to use the mcast filters in hardware to
pretend you have a unicast address per virtual device - those will be
really hard to simulate using a separate netdevice per MAC address.
Actually your bigger problem on those is tx MAC address selection
because that is not built into the hardware. I still think even for
these types something above netdevice (bridge, L3 routing, tc action
redirect etc) will do.
2) The new NICs being built for virtualization; those allow you to
explicitly have clean separation of IO where the only thing that is
shared between virtual devices is the wire and the bus (otherwise
each has its own registers etc) i.e the hardware is designed with this
in mind. In such a case, i think a separate netdevice per single MAC
address - possibly tied to a separate CPU should work.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread Patrick McHardy
jamal wrote:
> On Thu, 2007-28-06 at 21:20 -0700, David Miller wrote:
> 
>>Each guest gets a unique MAC address.  There is a queue per-port
>>that can fill up.
>>
>>What all the drivers like this do right now is stop the queue if
>>any of the per-port queues fill up, and that's why my sunvnet
>>driver does right now as well.  We can only thus wakeup the
>>queue when all of the ports have some space.
> 
> 
> Is a netdevice really the correct construct for the host side?
> Sounds to me a layer above the netdevice is the way to go. A bridge for
> example or L3 routing or even simple tc classify/redirection etc.
> I havent used what has become openvz these days in many years (or played
> with Erics approach), but if i recall correctly - it used to have a
> single netdevice per guest on the host. Thats close to what a basic
> qemu/UML has today. In such a case it is something above netdevices
> which does the guest selection.


I'm guessing that that wouldn't allow to do unicast filtering for
the guests on the real device without hacking the bridge code for
this special case. The difference to a real bridge is that the
all addresses are completely known in advance, so it doesn't need
promiscous mode for learning.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue

2007-06-29 Thread jamal

Ive changed the topic for you friend - otherwise most people wont follow
(as youve said a few times yourself ;->).

On Thu, 2007-28-06 at 21:20 -0700, David Miller wrote:

> Now I get to pose a problem for everyone, prove to me how useful
> this new code is by showing me how it can be used to solve a
> reocurring problem in virtualized network drivers of which I've
> had to code one up recently, see my most recent blog entry at:
> 
>   http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html
> 

nice.

> Anyways the gist of the issue is (and this happens for Sun LDOMS
> networking, lguest, IBM iSeries, etc.) that we have a single
> virtualized network device.  There is a "port" to the control
> node (which switches packets to the real network for the guest)
> and one "port" to each of the other guests.
> 
> Each guest gets a unique MAC address.  There is a queue per-port
> that can fill up.
> 
> What all the drivers like this do right now is stop the queue if
> any of the per-port queues fill up, and that's why my sunvnet
> driver does right now as well.  We can only thus wakeup the
> queue when all of the ports have some space.

Is a netdevice really the correct construct for the host side?
Sounds to me a layer above the netdevice is the way to go. A bridge for
example or L3 routing or even simple tc classify/redirection etc.
I havent used what has become openvz these days in many years (or played
with Erics approach), but if i recall correctly - it used to have a
single netdevice per guest on the host. Thats close to what a basic
qemu/UML has today. In such a case it is something above netdevices
which does the guest selection.
 
> The ports (and thus the queues) are selected by destinationt
> MAC address.  Each port has a remote MAC address, if there
> is an exact match with a port's remote MAC we'd use that port
> and thus that port's queue.  If there is no exact match
> (some other node on the real network, broadcast, multicast,
> etc.) we want to use the control node's port and port queue.
> 

Ok, Dave, isnt that what a bridge does? ;-> Youd need filtering to go
with it (for example to restrict guest0 from getting certain brodcasts
etc) - but we already have that. 

> So the problem to solve is to make a way for drivers to do the queue
> selection before the generic queueing layer starts to try and push
> things to the driver.  Perhaps a classifier in the driver or similar.
>
> The solution to this problem generalizes to the other facility
> we want now, hashing the transmit queue by smp_processor_id()
> or similar.  With that in place we can look at doing the TX locking
> per-queue too as is hinted at by the comments above the per-queue
> structure in the current net-2.6.23 tree.

A major surgery will be needed on the tx path if you want to hash tx
queue to processor id. Our unit construct (today, net-2.6.23) that can
be tied to a cpu is a netdevice. OTOH, if you used a netdevice it should
work as is. But i am possibly missing something in your comments.
What do you have in mind.

> My current work-in-progress sunvnet.c driver is included below so
> we can discuss things concretely with code.
> 
> I'm listening. :-)

And you got words above.

cheers,
jamal


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html