subject:"Re\: \[RFC PATCH v0 1\/2\] net\: bridge\: propagate FDB table into hardware"

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-13 Thread Jamal Hadi Salim

On Mon, 2012-03-12 at 09:48 +0100, Lennert Buytenhek wrote:

 Since it can lead to problems (address database mismatches, doesn't
 correctly handle STP transitions or topology changes automatically),
 I think it should be avoided whenever possible.  I don't see any
 advantages of hardware based learning over software based learning
 anyway ('flexibility' doesn't seem like a very good argument).

Indeed address mismatches may happen if you have two databases. 
You have two choices then:
Do learning in user space or be able to tolerate some transient 
inconsistency (if you have some software that lazily looks at the
database). But there is a case where the database sits only in hardware.
In such a case, you cant have mismatches.
I think the STP problem can be handled by user space regardless of
whether address mismatch happens or not.

 It should be doable along the lines of the current DSA patch --
 add a VLAN ID argument to the interface add/remove callbacks, and
 when a VLAN virtual interface is added to the bridge, call the
 relevant callbacks with the parent interface + VLAN ID instead.
 (This doesn't work for stacked VLANs, but the current net/dsa
 supported chips don't handle those anyway.)

Sounds like a good start - we could have a different interface for
stacked variants. I think you should push in the patch.

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-12 Thread Lennert Buytenhek

On Wed, Mar 07, 2012 at 09:11:40AM -0500, Jamal Hadi Salim wrote:

  Why so?  (I think the switch chips should just never do learning at
  all..)
 
 I agree that learning in software gives you more flexibility;
 however, I am for providing interface flexibility as well - switches
 have learning features. I think i should be able to use them when it
 makes sense to. 

Since it can lead to problems (address database mismatches, doesn't
correctly handle STP transitions or topology changes automatically),
I think it should be avoided whenever possible.  I don't see any
advantages of hardware based learning over software based learning
anyway ('flexibility' doesn't seem like a very good argument).


   I think it should also be upto the admin to decide whether the
   learning happens in the kernel or user space.
  
  I can't see any point in doing it in userspace.  What would be the
  advantage of that?  And based on what would the admin make the decision?
 
 If i wanted to do some funky access control based on some new MAC
 address showing up - best place to do it is user space.

Alright, that sounds fair.


  Keep in mind that these chips also do VLAN tagging in hardware, and
  so a scenario like:
  
  # brctl addbr br123
  # brctl addif br123 lan1.123
  # brctl addif br123 lan2.123
  
  is also one that can be handled in hardware (which the current
  patchwork patch doesn't handle yet).
 
 We would need to work with offloading VLANs, no?

Yes.


 Do the current VLAN offloads used for NICs suffice for switching
 chips as well?  i.e typically most chips have a table associated
 with some port in which the Vlan is partof or is the lookup key. 

It should be doable along the lines of the current DSA patch --
add a VLAN ID argument to the interface add/remove callbacks, and
when a VLAN virtual interface is added to the bridge, call the
relevant callbacks with the parent interface + VLAN ID instead.
(This doesn't work for stacked VLANs, but the current net/dsa
supported chips don't handle those anyway.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-07 Thread Jamal Hadi Salim


On Tue, 2012-03-06 at 15:09 +0100, Lennert Buytenhek wrote:

 Why so?  (I think the switch chips should just never do learning at
 all..)

I agree that learning in software gives you more flexibility; however,
I am for providing interface flexibility as well - switches have
learning features. I think i should be able to use them when it makes
sense to. 

  I think it should also be upto the admin to decide whether the learning
  happens in the kernel or user space.
 
 I can't see any point in doing it in userspace.  What would be the
 advantage of that?  And based on what would the admin make the decision?
 

If i wanted to do some funky access control based on some new MAC
address showing up - best place to do it is user space.

 It does, there is an STP state field per port in the switch chip,
 which controls whether learning takes place on this port (in
 Learning and Forwarding states) and whether packets are forwarded
 (in the Forwarding state).

ok, makes sense.

 But e.g. it doesn't automatically flush this port's FDB entries if
 you move a port from Forwarding to Listening -- the STP state field
 only controls direct learning and forwarding for received packets.

 And when you receive a BPDU with the topology change notification
 bit set, the switch won't automatically shorten the FDB entry
 timeout for you until the topology change is over, either.

I have to go back and look at some manuals i have - but iirc, the
ones ive played with behaved similarly.  As long as we provide knobs
to set/unset those different attributes, I think the handling of all
that should be from software (likely some daemon in user space);
then it shouldnt matter whether we are working with STP BPDUs or TRILL
or thenewprotocolTM etc. 

 Keep in mind that these chips also do VLAN tagging in hardware, and
 so a scenario like:
 
   # brctl addbr br123
   # brctl addif br123 lan1.123
   # brctl addif br123 lan2.123
 
 is also one that can be handled in hardware (which the current
 patchwork patch doesn't handle yet).
 

We would need to work with offloading VLANs, no? Do the current
VLAN offloads used for NICs suffice for switching chips as well?
i.e typically most chips have a table associated with some port in
which the Vlan is partof or is the lookup key. 

 You can let the switch rate limit the number of packets passed up to
 the CPU.  500 kp/s broadcast traffic seems somewhat excessive in any
 case, and I'm not sure if this deserves handling apart from QoSing
 those streams to manageable levels.

Yes, that would provide a solution.
I havent seen anything where you can rate limit the learning(SA lookup
failure). 

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-06 Thread jamal

On Mon, 2012-03-05 at 17:53 +0100, Lennert Buytenhek wrote:

 net/dsa currently configures any switch chips in the system to do
 auto-learning.  

So we clearly need the (user configurable) knob to turn on/off learning.
I think it should also be upto the admin to decide whether the learning
happens in the kernel or user space.
 
 However, I would much prefer to disable that, and have
 the switch chip just pass up packets for new source addresses, have
 Linux do the learning, and then mirror the Linux software FDB into
 the hardware instead -- that avoids having to manually flush the
 hardware FDB on certain STP state transitions or having to configure
 the hardware to use a shorter address learning timeout when we're in
 the middle of an STP topology change, which are problems we are
 running into in practice.

So in the scenario you are describing then it seems the h/ware has
no stp state toggles, correct? In other ASICs i have seen, there is
influence from stp state on behavior.

 Just curious -- while your patches allow propagating FDB entries
 into the hardware, do you also have hooks to tell the hardware which
 ports are to share address databases?

I think those are missing in this discussion and makes a lot of sense to
be part of the interface.

 net/dsa currently solves this by not having the hardware handle
 broadcast packets at all, which circumvents the problem, but for
 multicast traffic you would still like to be able to do at least the
 forwarding that can be done in hardware in hardware.  (Unicast doesn't
 have this problem as long as the kernel and the switch chip agree on
 their view of the FDB.)

Of course this could represent an interesting opportunity for a DOS.
Even at 4 port switch at 100Mbs, hitting 500Kpps to the CPU (I am
thinking these tiny switches end up in some tiny MIPS/ARM cpu) could
be devastating. How do you deal with that?

cheers,
jamal


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-06 Thread Lennert Buytenhek

On Tue, Mar 06, 2012 at 08:42:26AM -0500, jamal wrote:

  net/dsa currently configures any switch chips in the system to do
  auto-learning.  
 
 So we clearly need the (user configurable) knob to turn on/off learning.

Why so?  (I think the switch chips should just never do learning at
all..)


 I think it should also be upto the admin to decide whether the learning
 happens in the kernel or user space.

I can't see any point in doing it in userspace.  What would be the
advantage of that?  And based on what would the admin make the decision?


  However, I would much prefer to disable that, and have
  the switch chip just pass up packets for new source addresses, have
  Linux do the learning, and then mirror the Linux software FDB into
  the hardware instead -- that avoids having to manually flush the
  hardware FDB on certain STP state transitions or having to configure
  the hardware to use a shorter address learning timeout when we're in
  the middle of an STP topology change, which are problems we are
  running into in practice.
 
 So in the scenario you are describing then it seems the h/ware has
 no stp state toggles, correct?  In other ASICs i have seen, there is
 influence from stp state on behavior.

It does, there is an STP state field per port in the switch chip,
which controls whether learning takes place on this port (in
Learning and Forwarding states) and whether packets are forwarded
(in the Forwarding state).

But e.g. it doesn't automatically flush this port's FDB entries if
you move a port from Forwarding to Listening -- the STP state field
only controls direct learning and forwarding for received packets.

And when you receive a BPDU with the topology change notification
bit set, the switch won't automatically shorten the FDB entry
timeout for you until the topology change is over, either.


  Just curious -- while your patches allow propagating FDB entries
  into the hardware, do you also have hooks to tell the hardware which
  ports are to share address databases?
 
 I think those are missing in this discussion and makes a lot of
 sense to be part of the interface.

Keep in mind that these chips also do VLAN tagging in hardware, and
so a scenario like:

# brctl addbr br123
# brctl addif br123 lan1.123
# brctl addif br123 lan2.123

is also one that can be handled in hardware (which the current
patchwork patch doesn't handle yet).


  net/dsa currently solves this by not having the hardware handle
  broadcast packets at all, which circumvents the problem, but for
  multicast traffic you would still like to be able to do at least the
  forwarding that can be done in hardware in hardware.  (Unicast doesn't
  have this problem as long as the kernel and the switch chip agree on
  their view of the FDB.)
 
 Of course this could represent an interesting opportunity for a DOS.
 Even at 4 port switch at 100Mbs, hitting 500Kpps to the CPU (I am
 thinking these tiny switches end up in some tiny MIPS/ARM cpu) could
 be devastating. How do you deal with that?

You can let the switch rate limit the number of packets passed up to
the CPU.  500 kp/s broadcast traffic seems somewhat excessive in any
case, and I'm not sure if this deserves handling apart from QoSing
those streams to manageable levels.


thanks,
Lennert
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-06 Thread Lennert Buytenhek

On Mon, Mar 05, 2012 at 07:45:22PM -0800, John Fastabend wrote:

  Also if there are embedded switches with learning capabilities they
  might want to trigger events to user space. In this case having
  a protocol type makes user space a bit easier to manage. I've
  added Lennert so maybe he can comment I think the Marvell chipsets
  might support something along these lines. The SR-IOV chipsets I'm
  aware of _today_ don't do learning. Learning makes the event model
  more plausible.
  
  net/dsa currently configures any switch chips in the system to do
  auto-learning.  However, I would much prefer to disable that, and have
  the switch chip just pass up packets for new source addresses, have
  Linux do the learning, and then mirror the Linux software FDB into
  the hardware instead -- that avoids having to manually flush the
  hardware FDB on certain STP state transitions or having to configure
  the hardware to use a shorter address learning timeout when we're in
  the middle of an STP topology change, which are problems we are
  running into in practice.
 
 Great. And the plan is we should be able to use the same daemon with
 minimal changes (currently a flag) to control both sw and hw bridges.

Why should userspace care at all whether there is a hw bridge present?


  Just curious -- while your patches allow propagating FDB entries
  into the hardware, do you also have hooks to tell the hardware which
  ports are to share address databases?
 
 Not in the current patches. I don't have hardware right now
 that can instantiate multiple bridges. When I get some I was hoping
 to do something similar to this patch and use netlink commands
 to create/delete bridges and add/remove ports to them. This would
 be modifying the existing commands to work for both software and
 hardware bridges.

In the DSA h/w bridging patch, a hardware address database is only
allocated on addif to an existing bridge, not on addbr.

For one, at addbr time, you have no idea yet whether there will
be any switch chip ports in the bridge port group for this bridge.

Also, the h/w has some restrictions on the assignment of address
database identifiers (e.g. if you want to bridge between lan1.123
and lan2.123, you have to use address database identifier '123').


 By a bridge instantiation I mean a shared address database in this case.

Makes sense.  (I'd add STP port states to this.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-05 Thread Lennert Buytenhek

On Tue, Feb 28, 2012 at 08:40:06PM -0800, John Fastabend wrote:

 Also if there are embedded switches with learning capabilities they
 might want to trigger events to user space. In this case having
 a protocol type makes user space a bit easier to manage. I've
 added Lennert so maybe he can comment I think the Marvell chipsets
 might support something along these lines. The SR-IOV chipsets I'm
 aware of _today_ don't do learning. Learning makes the event model
 more plausible.

net/dsa currently configures any switch chips in the system to do
auto-learning.  However, I would much prefer to disable that, and have
the switch chip just pass up packets for new source addresses, have
Linux do the learning, and then mirror the Linux software FDB into
the hardware instead -- that avoids having to manually flush the
hardware FDB on certain STP state transitions or having to configure
the hardware to use a shorter address learning timeout when we're in
the middle of an STP topology change, which are problems we are
running into in practice.

Just curious -- while your patches allow propagating FDB entries
into the hardware, do you also have hooks to tell the hardware which
ports are to share address databases?

For net/dsa, we currently have:

http://patchwork.ozlabs.org/patch/16578/

While I think this is conceptually sound, the implementation is hacky,
and I wonder how you've solved it for your setup, and if DSA can
piggy-back off that.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-05 Thread Lennert Buytenhek

On Thu, Mar 01, 2012 at 08:36:20AM -0500, Jamal Hadi Salim wrote:

   I want to see a unified API so that user space control applications 
   (RSTP, TRILL?)
   can use one set of netlink calls for both software bridge and hardware 
   offloaded
   bridges.  Does this proposal meet that requirement?
   
 
 I dont see any issues with those requirements being met.
 
  Jamal, so why do They have to be different calls? I'm not so sure 
  anymore...
  moving to RTM_FDB_XXXENTRY saved some refactoring in the bridge module but 
  that
  is just cosmetic.
 
 I may not want to use the s/ware bridge i.e I may want to use h/ware
 bridge. I may want to use both.

This is a rather common case in embedded wireless routers/access points,
where you want to have the 4 LAN ports bridged together with the wlan0
interface.

In this scenario, the bridging between the LAN ports is typically done
in hardware, and the bridging between the LAN ports and wlan0 in
software, but here you have to be careful when you send the packet from
the switch chip up the stack to be forwarded to the wlan0 interface to
not re-send it to the hardware switch chip ports other than the one
that the packet came from.

net/dsa currently solves this by not having the hardware handle
broadcast packets at all, which circumvents the problem, but for
multicast traffic you would still like to be able to do at least the
forwarding that can be done in hardware in hardware.  (Unicast doesn't
have this problem as long as the kernel and the switch chip agree on
their view of the FDB.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-05 Thread John Fastabend

On 3/5/2012 8:53 AM, Lennert Buytenhek wrote:
 On Tue, Feb 28, 2012 at 08:40:06PM -0800, John Fastabend wrote:
 
 Also if there are embedded switches with learning capabilities they
 might want to trigger events to user space. In this case having
 a protocol type makes user space a bit easier to manage. I've
 added Lennert so maybe he can comment I think the Marvell chipsets
 might support something along these lines. The SR-IOV chipsets I'm
 aware of _today_ don't do learning. Learning makes the event model
 more plausible.
 
 net/dsa currently configures any switch chips in the system to do
 auto-learning.  However, I would much prefer to disable that, and have
 the switch chip just pass up packets for new source addresses, have
 Linux do the learning, and then mirror the Linux software FDB into
 the hardware instead -- that avoids having to manually flush the
 hardware FDB on certain STP state transitions or having to configure
 the hardware to use a shorter address learning timeout when we're in
 the middle of an STP topology change, which are problems we are
 running into in practice.
 

Great. And the plan is we should be able to use the same daemon with
minimal changes (currently a flag) to control both sw and hw bridges.

 Just curious -- while your patches allow propagating FDB entries
 into the hardware, do you also have hooks to tell the hardware which
 ports are to share address databases?
 

Not in the current patches. I don't have hardware right now
that can instantiate multiple bridges. When I get some I was hoping
to do something similar to this patch and use netlink commands
to create/delete bridges and add/remove ports to them. This would
be modifying the existing commands to work for both software and
hardware bridges.

By a bridge instantiation I mean a shared address database in this case.

 For net/dsa, we currently have:
 
   http://patchwork.ozlabs.org/patch/16578/
 
 While I think this is conceptually sound, the implementation is hacky,
 and I wonder how you've solved it for your setup, and if DSA can
 piggy-back off that.

Yep anything we come up with should work in both cases.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-02 Thread jamal

On Thu, 2012-03-01 at 14:17 -0800, John Fastabend wrote:
 Hmm so I think what I'll do is this...
 
  both: ndm_flags = 0 
  sw  : ndm_flags = NTF_SW_FDB
  hw  : ndm_flags = NTF_HW_FDB

 Then current tools will work with embedded bridges and software
 bridges
 with the interesting case being when a port supporting an offloaded
 FDB is attached to a SW bridge. Doing both in this case seems to be a
 reasonable default to me.

Looks good, although it seems like no backward compat is broken, it
feels like the default should be whats goin on today i.e s/ware only.
IOW, I would make that the 0.

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-01 Thread Jamal Hadi Salim

On Wed, 2012-02-29 at 09:25 -0800, John Fastabend wrote:

 Well I think NETLINK_ROUTE is the most correct type to use in this
 case. Per netlink.h its for routing and device hooks.
 
 #define NETLINK_ROUTE   0   /* Routing/device hook
   */
 
 And NETLINK_ROUTE msg_types use the RTM_* prefix. The _*NEIGH postfix
 were merely a copy from the SW BRIDGE code paths. How about,
 
 PF_BRIDGE:RTM_FDB_NEWENTRY
 PF_BRIDGE:RTM_FDB_DELENTRY
 PF_BRIDGE:RTM_FDB_GETENTRY

OK, I guess ;-

 And a new group RTNLGRP_FDB. 

Nod.

 Also using NETLINK_ROUTE gives the correct
 rtnl locking semantics for free.

makes sense.


 Agreed. I think adding some ndo_ops for bridging offloads here would
 work. For example the DSA infrastructure and/or macvlan devices might
 need this. Along the lines of extending this RFC,
 
 [RFC] hardware bridging support for DSA switches
 http://patchwork.ozlabs.org/patch/16578/

Certainly - thats one approach that is reasonable.
Where is Lennert? ;- I changed his email address to one that i am 
familiar with.

cheers,
jamal


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-01 Thread Jamal Hadi Salim

On Wed, 2012-02-29 at 10:19 -0800, John Fastabend wrote:

  
  I want to see a unified API so that user space control applications (RSTP, 
  TRILL?)
  can use one set of netlink calls for both software bridge and hardware 
  offloaded
  bridges.  Does this proposal meet that requirement?
  

I dont see any issues with those requirements being met.

 Jamal, so why do They have to be different calls? I'm not so sure anymore...
 moving to RTM_FDB_XXXENTRY saved some refactoring in the bridge module but 
 that
 is just cosmetic.

I may not want to use the s/ware bridge i.e I may want to use h/ware
bridge. I may want to use both. So there are 3 variations there. You
need at least 1.5 bits to represent them if you are going to use the
same interface. There may be features in either h/ware but not in
s/ware and vice-versa. 
A single interface with flags which say this applies to hware:sware:both
would be good, but it may be harder to achieve - thats why i suggested
they be different.

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-01 Thread Michael S. Tsirkin

On Wed, Feb 29, 2012 at 09:25:56AM -0800, John Fastabend wrote:
 Agreed. I think adding some ndo_ops for bridging offloads here would
 work. For example the DSA infrastructure and/or macvlan devices might
 need this. Along the lines of extending this RFC,
 
 [RFC] hardware bridging support for DSA switches
 http://patchwork.ozlabs.org/patch/16578/
 
 
 .John

One place where this might not work well would be
macvtap which is not a network device so it doesn't have
its own address, instead it inherits one from macvlan.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-01 Thread John Fastabend

On 3/1/2012 6:14 AM, Michael S. Tsirkin wrote:
 On Wed, Feb 29, 2012 at 09:25:56AM -0800, John Fastabend wrote:
 Agreed. I think adding some ndo_ops for bridging offloads here would
 work. For example the DSA infrastructure and/or macvlan devices might
 need this. Along the lines of extending this RFC,

 [RFC] hardware bridging support for DSA switches
 http://patchwork.ozlabs.org/patch/16578/


 .John
 
 One place where this might not work well would be
 macvtap which is not a network device so it doesn't have
 its own address, instead it inherits one from macvlan.
 

But is macvtap really doing any forwarding or implementing any
RX filters? Took a quick scan and it looks like the forwarding
logic is all in the macvlan code paths. In this case I suspect
if we enable macvlan then any device built on top of it would
work.

.John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-01 Thread John Fastabend

On 3/1/2012 5:36 AM, Jamal Hadi Salim wrote:
 On Wed, 2012-02-29 at 10:19 -0800, John Fastabend wrote:
 

 I want to see a unified API so that user space control applications (RSTP, 
 TRILL?)
 can use one set of netlink calls for both software bridge and hardware 
 offloaded
 bridges.  Does this proposal meet that requirement?

 
 I dont see any issues with those requirements being met.
 
 Jamal, so why do They have to be different calls? I'm not so sure 
 anymore...
 moving to RTM_FDB_XXXENTRY saved some refactoring in the bridge module but 
 that
 is just cosmetic.
 
 I may not want to use the s/ware bridge i.e I may want to use h/ware
 bridge. I may want to use both. So there are 3 variations there. You
 need at least 1.5 bits to represent them if you are going to use the
 same interface. There may be features in either h/ware but not in
 s/ware and vice-versa. 
 A single interface with flags which say this applies to hware:sware:both
 would be good, but it may be harder to achieve - thats why i suggested
 they be different.
 
 cheers,
 jamal
 

Hmm so I think what I'll do is this...

 both: ndm_flags = 0 
 sw  : ndm_flags = NTF_SW_FDB
 hw  : ndm_flags = NTF_HW_FDB

Then current tools will work with embedded bridges and software bridges
with the interesting case being when a port supporting an offloaded FDB
is attached to a SW bridge. Doing both in this case seems to be a reasonable
default to me.

The tricky bit will be pulling the message handlers out of the ./net/bridge
code so that we don't have to always load the bridge module to add entries
to a macvlan for example. I need to look at a few other things today but
I'll code up a patch for this tomorrow.

.John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-29 Thread Jamal Hadi Salim

On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote:

 OK back to this. The last piece is where to put these messages...
 we could take PF_ROUTE:RTM_*NEIGH
 
  PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
  switch.
  PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
  switch.
  PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table
 

Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map
to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc
may play a role in the user space app populating the FDB, i dont think
they are necessary players.
Learning could be via a table entry miss and packet redirect to user
space.
So my suggestion is to use FDB_*ENTRY for names
 
 The neighbor code is using the PF_UNSPEC protocol type so we won't
 collide with these unless someone was using PF_ROUTE and relying on
 falling back to PF_UNSPEC however I couldn't find any programs that
 did this iproute2 certainly doesn't. And the bridge pieces are using
 PF_BRIDGE so no collision there.

They have to be different calls from the calls that talk to the s/ware
bridge. In my opinion, as controversial as this may sound, you need to
be flexible enough that some vendor can replace these calls with
proprietary calls which are more efficient for their hardware. So a
plugin to replace these calls in the user space code would be a 
good idea. Alternatively, you could make that something they do at
the driver level i.e from user space to kernel it is hardware, please
addthistotheFDBtable() call and the implementation of that could be
proprietary to the specific hardware.

[..]

 Also if there are embedded switches with learning capabilities they
 might want to trigger events to user space. In this case having
 a protocol type makes user space a bit easier to manage. I've
 added Lennert so maybe he can comment I think the Marvell chipsets
 might support something along these lines. The SR-IOV chipsets I'm
 aware of _today_ don't do learning. Learning makes the event model
 more plausible.

The other events to consider is aging of hardware entries.

 The other mechanism would be to embed some more attributes into the
 PF_UNSPEC:RTM_XXXLINK msg however I'm thinking that if we want to
 support learning and triggering events then we likely also don't
 want to send these events to every app with RTNLGRP_LINK set.

I think this needs to be a different event message. 
FDB_TABLEMISS? FDB_EXCEPTION?

 Plus there is already a proliferation of LINK attributes and dumping
 the FDB out of this seems a bit much but could be done with some
 bitmasks. Although the current ext_filter_mask u32 doesn't seem to
 be sufficient for events to trigger this.

Dumping the FDB table should be something along the lines of FDB_GET
with the dump flag. It shouldnt tie to the LINK side of things.

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-29 Thread Jamal Hadi Salim

On Tue, 2012-02-28 at 21:14 -0800, John Fastabend wrote:

 Just checked looks like the DSA infrastructure has commands to enable
 STP so guess it is doing learning.

IIRC, Lennert built some of this stuff tied to the kernel.

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-29 Thread John Fastabend

On 2/29/2012 5:56 AM, Jamal Hadi Salim wrote:
 On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote:
 
 OK back to this. The last piece is where to put these messages...
 we could take PF_ROUTE:RTM_*NEIGH

  PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
  switch.
  PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
  switch.
  PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table

 
 Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map
 to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc
 may play a role in the user space app populating the FDB, i dont think
 they are necessary players.
 Learning could be via a table entry miss and packet redirect to user
 space.
 So my suggestion is to use FDB_*ENTRY for names
  

Well I think NETLINK_ROUTE is the most correct type to use in this
case. Per netlink.h its for routing and device hooks.

#define NETLINK_ROUTE   0   /* Routing/device hook  
*/

And NETLINK_ROUTE msg_types use the RTM_* prefix. The _*NEIGH postfix
were merely a copy from the SW BRIDGE code paths. How about,

PF_BRIDGE:RTM_FDB_NEWENTRY
PF_BRIDGE:RTM_FDB_DELENTRY
PF_BRIDGE:RTM_FDB_GETENTRY

And a new group RTNLGRP_FDB. Also using NETLINK_ROUTE gives the correct
rtnl locking semantics for free.

 The neighbor code is using the PF_UNSPEC protocol type so we won't
 collide with these unless someone was using PF_ROUTE and relying on
 falling back to PF_UNSPEC however I couldn't find any programs that
 did this iproute2 certainly doesn't. And the bridge pieces are using
 PF_BRIDGE so no collision there.
 
 They have to be different calls from the calls that talk to the s/ware
 bridge. In my opinion, as controversial as this may sound, you need to
 be flexible enough that some vendor can replace these calls with
 proprietary calls which are more efficient for their hardware. So a
 plugin to replace these calls in the user space code would be a 
 good idea. Alternatively, you could make that something they do at
 the driver level i.e from user space to kernel it is hardware, please
 addthistotheFDBtable() call and the implementation of that could be
 proprietary to the specific hardware.
 

Agreed. I think adding some ndo_ops for bridging offloads here would
work. For example the DSA infrastructure and/or macvlan devices might
need this. Along the lines of extending this RFC,

[RFC] hardware bridging support for DSA switches
http://patchwork.ozlabs.org/patch/16578/


.John

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-29 Thread Stephen Hemminger

On Wed, 29 Feb 2012 09:25:56 -0800
John Fastabend john.r.fastab...@intel.com wrote:

 On 2/29/2012 5:56 AM, Jamal Hadi Salim wrote:
  On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote:
  
  OK back to this. The last piece is where to put these messages...
  we could take PF_ROUTE:RTM_*NEIGH
 
   PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
   switch.
   PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
   switch.
   PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table
 
  
  Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map
  to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc
  may play a role in the user space app populating the FDB, i dont think
  they are necessary players.
  Learning could be via a table entry miss and packet redirect to user
  space.
  So my suggestion is to use FDB_*ENTRY for names
   
 
 Well I think NETLINK_ROUTE is the most correct type to use in this
 case. Per netlink.h its for routing and device hooks.
 
 #define NETLINK_ROUTE   0   /* Routing/device hook
   */
 
 And NETLINK_ROUTE msg_types use the RTM_* prefix. The _*NEIGH postfix
 were merely a copy from the SW BRIDGE code paths. How about,
 
 PF_BRIDGE:RTM_FDB_NEWENTRY
 PF_BRIDGE:RTM_FDB_DELENTRY
 PF_BRIDGE:RTM_FDB_GETENTRY
 
 And a new group RTNLGRP_FDB. Also using NETLINK_ROUTE gives the correct
 rtnl locking semantics for free.
 
  The neighbor code is using the PF_UNSPEC protocol type so we won't
  collide with these unless someone was using PF_ROUTE and relying on
  falling back to PF_UNSPEC however I couldn't find any programs that
  did this iproute2 certainly doesn't. And the bridge pieces are using
  PF_BRIDGE so no collision there.
  
  They have to be different calls from the calls that talk to the s/ware
  bridge. In my opinion, as controversial as this may sound, you need to
  be flexible enough that some vendor can replace these calls with
  proprietary calls which are more efficient for their hardware. So a
  plugin to replace these calls in the user space code would be a 
  good idea. Alternatively, you could make that something they do at
  the driver level i.e from user space to kernel it is hardware, please
  addthistotheFDBtable() call and the implementation of that could be
  proprietary to the specific hardware.
  
 
 Agreed. I think adding some ndo_ops for bridging offloads here would
 work. For example the DSA infrastructure and/or macvlan devices might
 need this. Along the lines of extending this RFC,
 
 [RFC] hardware bridging support for DSA switches
 http://patchwork.ozlabs.org/patch/16578/

I want to see a unified API so that user space control applications (RSTP, 
TRILL?)
can use one set of netlink calls for both software bridge and hardware offloaded
bridges.  Does this proposal meet that requirement?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-29 Thread John Fastabend

On 2/29/2012 9:52 AM, Stephen Hemminger wrote:
 On Wed, 29 Feb 2012 09:25:56 -0800
 John Fastabend john.r.fastab...@intel.com wrote:
 
 On 2/29/2012 5:56 AM, Jamal Hadi Salim wrote:
 On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote:

 OK back to this. The last piece is where to put these messages...
 we could take PF_ROUTE:RTM_*NEIGH

  PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
  switch.
  PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
  switch.
  PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table


 Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map
 to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc
 may play a role in the user space app populating the FDB, i dont think
 they are necessary players.
 Learning could be via a table entry miss and packet redirect to user
 space.
 So my suggestion is to use FDB_*ENTRY for names
  

 Well I think NETLINK_ROUTE is the most correct type to use in this
 case. Per netlink.h its for routing and device hooks.

 #define NETLINK_ROUTE   0   /* Routing/device hook   
*/

 And NETLINK_ROUTE msg_types use the RTM_* prefix. The _*NEIGH postfix
 were merely a copy from the SW BRIDGE code paths. How about,

 PF_BRIDGE:RTM_FDB_NEWENTRY
 PF_BRIDGE:RTM_FDB_DELENTRY
 PF_BRIDGE:RTM_FDB_GETENTRY

 And a new group RTNLGRP_FDB. Also using NETLINK_ROUTE gives the correct
 rtnl locking semantics for free.

 The neighbor code is using the PF_UNSPEC protocol type so we won't
 collide with these unless someone was using PF_ROUTE and relying on
 falling back to PF_UNSPEC however I couldn't find any programs that
 did this iproute2 certainly doesn't. And the bridge pieces are using
 PF_BRIDGE so no collision there.

 They have to be different calls from the calls that talk to the s/ware
 bridge. In my opinion, as controversial as this may sound, you need to
 be flexible enough that some vendor can replace these calls with
 proprietary calls which are more efficient for their hardware. So a
 plugin to replace these calls in the user space code would be a 
 good idea. Alternatively, you could make that something they do at
 the driver level i.e from user space to kernel it is hardware, please
 addthistotheFDBtable() call and the implementation of that could be
 proprietary to the specific hardware.


 Agreed. I think adding some ndo_ops for bridging offloads here would
 work. For example the DSA infrastructure and/or macvlan devices might
 need this. Along the lines of extending this RFC,

 [RFC] hardware bridging support for DSA switches
 http://patchwork.ozlabs.org/patch/16578/
 
 I want to see a unified API so that user space control applications (RSTP, 
 TRILL?)
 can use one set of netlink calls for both software bridge and hardware 
 offloaded
 bridges.  Does this proposal meet that requirement?
 

With the patches I sent out last night the same netlink calls are used
for both SW and HW with a flag set in ndm_flags to indicate it is a hardware
entry. The flag is needed when a port has offload support and is also
a slave of a SW bridge. Another option would be to apply the command to both
hardware and software tables. This might be good enough and user space would
not have to make distinctions between HW and SW bridges. Also helps with my
original use case where I want the SW and HW bridge FDBs to be in sync.

In response to Jamal's comment I proposed changing the type to RTM_FDB_XXXENTRY
but the message contents are the same in both cases.

Jamal, so why do They have to be different calls? I'm not so sure anymore...
moving to RTM_FDB_XXXENTRY saved some refactoring in the bridge module but that
is just cosmetic.

Thanks,
John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-28 Thread John Fastabend

On 2/18/2012 4:41 AM, jamal wrote:
 On Fri, 2012-02-17 at 09:10 -0800, John Fastabend wrote:
 
 Yes I agree that is the goal.

 One last comment:
 With synchronization there are other challenges when the entry in the
 hardware conflicts with the entry in software when you intend the
 behavior to be the same. This is not such a big deal with bridging but
 becomes more apparent when you start offloading ACLs etc.


 OK and these sorts of conflicts certainly don't need to be resolved
 by kernel code. So I think this is a reasonable reason to drive the
 synchronization into a user space daemon.
 
 
 Yep. 
 Thanks for listening John. Waiting to see them patches.
 
 cheers,
 jamal
 
 
 

+Lennert

OK back to this. The last piece is where to put these messages...
we could take PF_ROUTE:RTM_*NEIGH

 PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
 switch.
 PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
 switch.
 PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table

The neighbor code is using the PF_UNSPEC protocol type so we won't
collide with these unless someone was using PF_ROUTE and relying on
falling back to PF_UNSPEC however I couldn't find any programs that
did this iproute2 certainly doesn't. And the bridge pieces are using
PF_BRIDGE so no collision there.

I briefly thought about trying to pull the PF_BRIDGE protocol out
and use this for both types but I think its better to leave the
bridge code alone and there is also the issue of disambiguating a msg
at a port which has both an embedded switch and has SW bridge for a
master.

Also if there are embedded switches with learning capabilities they
might want to trigger events to user space. In this case having
a protocol type makes user space a bit easier to manage. I've
added Lennert so maybe he can comment I think the Marvell chipsets
might support something along these lines. The SR-IOV chipsets I'm
aware of _today_ don't do learning. Learning makes the event model
more plausible.

The other mechanism would be to embed some more attributes into the
PF_UNSPEC:RTM_XXXLINK msg however I'm thinking that if we want to
support learning and triggering events then we likely also don't
want to send these events to every app with RTNLGRP_LINK set.

Plus there is already a proliferation of LINK attributes and dumping
the FDB out of this seems a bit much but could be done with some
bitmasks. Although the current ext_filter_mask u32 doesn't seem to
be sufficient for events to trigger this.

so much for a short note...

Thanks
.John




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-28 Thread John Fastabend

On 2/28/2012 8:40 PM, John Fastabend wrote:
 On 2/18/2012 4:41 AM, jamal wrote:
 On Fri, 2012-02-17 at 09:10 -0800, John Fastabend wrote:

 Yes I agree that is the goal.

 One last comment:
 With synchronization there are other challenges when the entry in the
 hardware conflicts with the entry in software when you intend the
 behavior to be the same. This is not such a big deal with bridging but
 becomes more apparent when you start offloading ACLs etc.


 OK and these sorts of conflicts certainly don't need to be resolved
 by kernel code. So I think this is a reasonable reason to drive the
 synchronization into a user space daemon.


 Yep. 
 Thanks for listening John. Waiting to see them patches.

 cheers,
 jamal



 
 +Lennert
 
 OK back to this. The last piece is where to put these messages...
 we could take PF_ROUTE:RTM_*NEIGH
 
  PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
  switch.
  PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
  switch.
  PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table
 
 The neighbor code is using the PF_UNSPEC protocol type so we won't
 collide with these unless someone was using PF_ROUTE and relying on
 falling back to PF_UNSPEC however I couldn't find any programs that
 did this iproute2 certainly doesn't. And the bridge pieces are using
 PF_BRIDGE so no collision there.
 
 I briefly thought about trying to pull the PF_BRIDGE protocol out
 and use this for both types but I think its better to leave the
 bridge code alone and there is also the issue of disambiguating a msg
 at a port which has both an embedded switch and has SW bridge for a
 master.

Maybe I gave up too quickly here I could use a bit in the ndm_flags to
specify embedded or sw bridge. But would require having the bridge
module loaded.

 
 Also if there are embedded switches with learning capabilities they
 might want to trigger events to user space. In this case having
 a protocol type makes user space a bit easier to manage. I've
 added Lennert so maybe he can comment I think the Marvell chipsets
 might support something along these lines. The SR-IOV chipsets I'm
 aware of _today_ don't do learning. Learning makes the event model
 more plausible.
 

Just checked looks like the DSA infrastructure has commands to enable
STP so guess it is doing learning.

 The other mechanism would be to embed some more attributes into the
 PF_UNSPEC:RTM_XXXLINK msg however I'm thinking that if we want to
 support learning and triggering events then we likely also don't
 want to send these events to every app with RTNLGRP_LINK set.
 
 Plus there is already a proliferation of LINK attributes and dumping
 the FDB out of this seems a bit much but could be done with some
 bitmasks. Although the current ext_filter_mask u32 doesn't seem to
 be sufficient for events to trigger this.
 
 so much for a short note...
 
 Thanks
 .John
 
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-18 Thread jamal

On Fri, 2012-02-17 at 09:10 -0800, John Fastabend wrote:

 Yes I agree that is the goal.
 
  One last comment:
  With synchronization there are other challenges when the entry in the
  hardware conflicts with the entry in software when you intend the
  behavior to be the same. This is not such a big deal with bridging but
  becomes more apparent when you start offloading ACLs etc.
  
 
 OK and these sorts of conflicts certainly don't need to be resolved
 by kernel code. So I think this is a reasonable reason to drive the
 synchronization into a user space daemon.


Yep. 
Thanks for listening John. Waiting to see them patches.

cheers,
jamal



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-17 Thread jamal

On Wed, 2012-02-15 at 17:26 -0800, John Fastabend wrote:
 On 2/15/2012 6:10 AM, Jamal Hadi Salim wrote:
  On Tue, 2012-02-14 at 10:57 -0800, John Fastabend wrote:
  
  Roopa was likely on the right track here,
 
  http://patchwork.ozlabs.org/patch/123064/
  
  Doesnt seem related to the bridging stuff - the modeling looks
  reasonable however.
  
 
 The operations are really the same ADD/DEL/GET additional MAC
 addresses to a port, in this case a macvlan type port. The
 difference is the  macvlan port type drops any packet with an
 address not in the FDB where the bridge type floods these.

Ok.
[the vlan piece really should have been an integrated part of bridging;
in the early days this was the case]


 [root@jf-dev1-dcblab src]# br fdb help
 Usage: br fdb { add | del | replace } ADDR dev DEV
br fdb {show} [ dev DEV ]
 
 In my example I just dumped all bridge devices,
 

Ok, makes sense.


 Seems we need both a synchronize and a { add | del | replace } option.

I am conflicted on this.
Not sure if that is a command line thing or something built into a user
space daemon. It may be useful to have the command line variant but i
feel having a daemon take care of things helps in faster
synchronization.
I think user space is a good spot to add such functionality (as opposed
to the kernel). That way user space can work with h/ware switching such
as yours as well as a standalone switching chips (from sillicon vendors
like Marvel etc).
IMO, the average user doesnt need to be aware of such low level stuff;
so the default should be for the user not to be responsible for
configuration of synchronization. IOW, I want to just run well
understood user interface tools things like ifconfig, ip link etc, the
new br tool and not even need to be aware that we are offloading.
So as long as s/w br0 is mapping to the bridge on ixgb-0 i dont need
to know ixgb0 h/w bridge exists.

One last comment:
With synchronization there are other challenges when the entry in the
hardware conflicts with the entry in software when you intend the
behavior to be the same. This is not such a big deal with bridging but
becomes more apparent when you start offloading ACLs etc.


 So I think what your saying is a per port bit to disable learning...
 hmm but if you start tweaking it too much it looks less and less like a
 802.1D bridge and more like something you would want to build with tc or
 openvswitch or tc+bridge or tc+macvlan.

These are pretty commodity features in most silicon switching chips ive
come across. You have a knob to control learning and another to control
flooding.

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-17 Thread jamal

On Thu, 2012-02-16 at 03:58 +, Ben Hutchings wrote:

 
 Well, in addition, there are SR-IOV network adapters that don't have any
 bridge.  For these, the software bridge is necessary to handle
 multicast, broadcast and forwarding between local ports, not only to do
 learning.

For the scenario where there is no h/w bridge - the s/ware bridge should
be usable. There's no way working around that.
My contention is only with the case where there is a h/w bridge and
there being two FDB tables; one in hardware and another in s/w.
And both the h/w and s/w bridges doing flooding and learning.
It is desirable to have options to use one or other or both with
some synchronization.

 Solarflare's implementation of accelerated guest networking (which
 Shradha and I are gradually sending upstream) builds on libvirt's
 existing support for software bridges and assigns VFs to guests as a
 means to offload some of the forwarding.
 If and when we implement a hardware bridge, we would probably still want
 to keep the software bridge as a fallback.  If a guest is dependent on a
 VF that's connected to a hardware bridge, it becomes impossible or at
 least very disruptive to migrate it to another host that doesn't have a
 compatible VF available.

In the scheme i described to John in last email, libvirt needs not be aware of 
existence of hardware offloading (and migration should be transparent of whether
h/w bridge exists or not)...

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-17 Thread John Fastabend

On 2/17/2012 6:28 AM, jamal wrote:
 On Wed, 2012-02-15 at 17:26 -0800, John Fastabend wrote:
 On 2/15/2012 6:10 AM, Jamal Hadi Salim wrote:
 On Tue, 2012-02-14 at 10:57 -0800, John Fastabend wrote:

 Roopa was likely on the right track here,

 http://patchwork.ozlabs.org/patch/123064/

 Doesnt seem related to the bridging stuff - the modeling looks
 reasonable however.


 The operations are really the same ADD/DEL/GET additional MAC
 addresses to a port, in this case a macvlan type port. The
 difference is the  macvlan port type drops any packet with an
 address not in the FDB where the bridge type floods these.
 
 Ok.
 [the vlan piece really should have been an integrated part of bridging;
 in the early days this was the case]
 
 
 [root@jf-dev1-dcblab src]# br fdb help
 Usage: br fdb { add | del | replace } ADDR dev DEV
br fdb {show} [ dev DEV ]

 In my example I just dumped all bridge devices,

 
 Ok, makes sense.
 
 
 Seems we need both a synchronize and a { add | del | replace } option.
 
 I am conflicted on this.
 Not sure if that is a command line thing or something built into a user
 space daemon. It may be useful to have the command line variant but i
 feel having a daemon take care of things helps in faster
 synchronization.
 I think user space is a good spot to add such functionality (as opposed
 to the kernel). That way user space can work with h/ware switching such
 as yours as well as a standalone switching chips (from sillicon vendors
 like Marvel etc).
 IMO, the average user doesnt need to be aware of such low level stuff;
 so the default should be for the user not to be responsible for
 configuration of synchronization. IOW, I want to just run well
 understood user interface tools things like ifconfig, ip link etc, the
 new br tool and not even need to be aware that we are offloading.
 So as long as s/w br0 is mapping to the bridge on ixgb-0 i dont need
 to know ixgb0 h/w bridge exists.
 

Yes I agree that is the goal.

 One last comment:
 With synchronization there are other challenges when the entry in the
 hardware conflicts with the entry in software when you intend the
 behavior to be the same. This is not such a big deal with bridging but
 becomes more apparent when you start offloading ACLs etc.
 

OK and these sorts of conflicts certainly don't need to be resolved
by kernel code. So I think this is a reasonable reason to drive the
synchronization into a user space daemon.

 
 So I think what your saying is a per port bit to disable learning...
 hmm but if you start tweaking it too much it looks less and less like a
 802.1D bridge and more like something you would want to build with tc or
 openvswitch or tc+bridge or tc+macvlan.
 
 These are pretty commodity features in most silicon switching chips ive
 come across. You have a knob to control learning and another to control
 flooding.
 

All right this looks like a follow up patch to me. First build the interface
to configure the HW FDB. Then a second series to add a flooding knob which
works for both embedded switches and software switches.

 cheers,
 jamal
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-16 Thread Shradha Shah

Hello,

Please find my comments inline.

Regards,
Shradha Shah

On 02/16/2012 03:58 AM, Ben Hutchings wrote:
 [I'm just catching up with this after getting my own driver changes into
 shape.]
 
 On Fri, 2012-02-10 at 10:18 -0500, jamal wrote:
 Hi John,

 I went backwards to summarize at the top after going through your email.

 TL;DR version 0.1: 
 you provide a good use case where it makes sense to do things in the
 kernel. IMO, you could make the same arguement if your embedded switch
 could do ACLs, IPv4 forwarding etc. And the kernel bloats.
 I am always bigoted to move all policy control to user space instead of
 bloating in the kernel.
 [...]
 Now here is the potential issue,

 (G) The frame transmitted from ethx.y with the destination address of
 veth0 but the embedded switch is not a learning switch. If the FDB
 update is done in user space its possible (likely?) that the FDB
 entry for veth0 has not been added to the embedded switch yet. 

 Ok, got it - so the catch here is the switch is not capable of learning.
 I think this depends on where learning is done. Your intent is to
 use the S/W bridge as something that does the learning for you i.e in
 the kernel. This makes the s/w bridge part of MUST-have-for-this-to-run.
 And that maybe the case for your use case.
 [...]
 
 Well, in addition, there are SR-IOV network adapters that don't have any
 bridge.  For these, the software bridge is necessary to handle
 multicast, broadcast and forwarding between local ports, not only to do
 learning.
 
 Solarflare's implementation of accelerated guest networking (which
 Shradha and I are gradually sending upstream) builds on libvirt's
 existing support for software bridges and assigns VFs to guests as a
 means to offload some of the forwarding.

I am also trying to work with bridging using macvtap. Libvirt supports
macvtap in four modes; vepa, bridge, private and passthrough mode.
Macvtap used in the bridge mode will work similar to a software bridge and 
will improve performance.

 
 If and when we implement a hardware bridge, we would probably still want
 to keep the software bridge as a fallback.  If a guest is dependent on a
 VF that's connected to a hardware bridge, it becomes impossible or at
 least very disruptive to migrate it to another host that doesn't have a
 compatible VF available.
 
 Ben.
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-15 Thread Jamal Hadi Salim

On Tue, 2012-02-14 at 10:57 -0800, John Fastabend wrote:

 Roopa was likely on the right track here,
 
 http://patchwork.ozlabs.org/patch/123064/

Doesnt seem related to the bridging stuff - the modeling looks
reasonable however.

 But I think the proper syntax is to use the existing PF_BRIDGE:RTM_XXX
 netlink messages. And if possible drive this without extending ndo_ops.
 
 An ideal user space interaction IMHO would look like,
 
 [root@jf-dev1-dcblab iproute2]# ./br/br fdb add 52:e5:62:7b:57:88 dev veth10
 [root@jf-dev1-dcblab iproute2]# ./br/br fdb
 portmac addrflags
 veth2   36:a6:35:9b:96:c4   local
 veth4   aa:54:b0:7b:42:ef   local
 veth0   2a:e8:5c:95:6c:1b   local
 veth6   6e:26:d5:43:a3:36   local
 veth0   f2:c1:39:76:6a:fb
 veth8   4e:35:16:af:87:13   local
 veth10  52:e5:62:7b:57:88   static
 veth10  aa:a9:35:21:15:c4   local

Looks nice, where is the targeted bridge(eg br0) in that syntax?

 Using Stephen's br tool. First command adds FDB entry to SW bridge and
 if the same tool could be used to add entries to embedded bridge I think
 that would be the best case. 

That would be nice (although adds dependency on the presence of the
s/ware bridge). Would be nicer to have either a knob in the kernel to
say synchronize with h/w bridge foo which can be turned off.  

 So no RTNETLINK error on the second cmd. Then
 embedded FDB entries could be dumped this way also so I get a complete view
 of my FDB setup across multiple sw bridges and embedded bridges.

So if you had multiple h/ware bridges - which one is tied to br0? 


 Yes. The hardware has a bit to support this which is currently not exposed
 to user space. That's a case where we have 'yet another knob' that needs
 a clean solution. This causes real bugs today when users try to use the
 macvlan devices in VEPA mode on top of SR-IOV. By the way these modes are
 all part of the 802.1Qbg spec which people actually want to use with Linux
 so a good clean solution is probably needed.


I think the knobs to flood and learn are important. The hardware
seems to have the flood but not the learn/discover. I think the
s/ware bridge needs to have both. At the moment - as pointed out in that
*NEIGH* notification, s/w bridge assumes a policy that could be
considered a security flaw in some circles - just because you are my
neighbor does not mean i trust you to come into my house; i may trust
you partially and allow you only to come through the front door. Even in
Canada with a default policy of not locking your door we sometimes lock
our doors ;-


 I have no problem with drawing the line here and trying to implement something
 over PF_BRIDGE:RTM_xxx nlmsgs. 


My comment/concern was in regard to the bridge built-in policy of
reading from the neighbor updates (refer to above comments)

cheers,
jamal


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-15 Thread John Fastabend

On 2/15/2012 6:10 AM, Jamal Hadi Salim wrote:
 On Tue, 2012-02-14 at 10:57 -0800, John Fastabend wrote:
 
 Roopa was likely on the right track here,

 http://patchwork.ozlabs.org/patch/123064/
 
 Doesnt seem related to the bridging stuff - the modeling looks
 reasonable however.
 

The operations are really the same ADD/DEL/GET additional MAC
addresses to a port, in this case a macvlan type port. The
difference is the  macvlan port type drops any packet with an
address not in the FDB where the bridge type floods these.

 But I think the proper syntax is to use the existing PF_BRIDGE:RTM_XXX
 netlink messages. And if possible drive this without extending ndo_ops.

 An ideal user space interaction IMHO would look like,

 [root@jf-dev1-dcblab iproute2]# ./br/br fdb add 52:e5:62:7b:57:88 dev veth10
 [root@jf-dev1-dcblab iproute2]# ./br/br fdb
 portmac addrflags
 veth2   36:a6:35:9b:96:c4   local
 veth4   aa:54:b0:7b:42:ef   local
 veth0   2a:e8:5c:95:6c:1b   local
 veth6   6e:26:d5:43:a3:36   local
 veth0   f2:c1:39:76:6a:fb
 veth8   4e:35:16:af:87:13   local
 veth10  52:e5:62:7b:57:88   static
 veth10  aa:a9:35:21:15:c4   local
 
 Looks nice, where is the targeted bridge(eg br0) in that syntax?

[root@jf-dev1-dcblab src]# br fdb help
Usage: br fdb { add | del | replace } ADDR dev DEV
   br fdb {show} [ dev DEV ]

In my example I just dumped all bridge devices,

#br fdb show dev bridge0

 
 Using Stephen's br tool. First command adds FDB entry to SW bridge and
 if the same tool could be used to add entries to embedded bridge I think
 that would be the best case. 
 
 That would be nice (although adds dependency on the presence of the
 s/ware bridge). Would be nicer to have either a knob in the kernel to
 say synchronize with h/w bridge foo which can be turned off.  
 

Seems we need both a synchronize and a { add | del | replace } option.

 So no RTNETLINK error on the second cmd. Then
 embedded FDB entries could be dumped this way also so I get a complete view
 of my FDB setup across multiple sw bridges and embedded bridges.
 
 So if you had multiple h/ware bridges - which one is tied to br0? 
 

Not sure I follow but does the additional dev parameter above answer this?

 
 Yes. The hardware has a bit to support this which is currently not exposed
 to user space. That's a case where we have 'yet another knob' that needs
 a clean solution. This causes real bugs today when users try to use the
 macvlan devices in VEPA mode on top of SR-IOV. By the way these modes are
 all part of the 802.1Qbg spec which people actually want to use with Linux
 so a good clean solution is probably needed.
 
 
 I think the knobs to flood and learn are important. The hardware
 seems to have the flood but not the learn/discover. I think the
 s/ware bridge needs to have both. At the moment - as pointed out in that
 *NEIGH* notification, s/w bridge assumes a policy that could be
 considered a security flaw in some circles - just because you are my
 neighbor does not mean i trust you to come into my house; i may trust
 you partially and allow you only to come through the front door. Even in
 Canada with a default policy of not locking your door we sometimes lock
 our doors ;-
 
 
 I have no problem with drawing the line here and trying to implement 
 something
 over PF_BRIDGE:RTM_xxx nlmsgs. 
 
 
 My comment/concern was in regard to the bridge built-in policy of
 reading from the neighbor updates (refer to above comments)
 

So I think what your saying is a per port bit to disable learning...

hmm but if you start tweaking it too much it looks less and less like a
802.1D bridge and more like something you would want to build with tc or
openvswitch or tc+bridge or tc+macvlan.

.John

 cheers,
 jamal
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-15 Thread Ben Hutchings

[I'm just catching up with this after getting my own driver changes into
shape.]

On Fri, 2012-02-10 at 10:18 -0500, jamal wrote:
 Hi John,
 
 I went backwards to summarize at the top after going through your email.
 
 TL;DR version 0.1: 
 you provide a good use case where it makes sense to do things in the
 kernel. IMO, you could make the same arguement if your embedded switch
 could do ACLs, IPv4 forwarding etc. And the kernel bloats.
 I am always bigoted to move all policy control to user space instead of
 bloating in the kernel.
[...]
  Now here is the potential issue,
  
  (G) The frame transmitted from ethx.y with the destination address of
  veth0 but the embedded switch is not a learning switch. If the FDB
  update is done in user space its possible (likely?) that the FDB
  entry for veth0 has not been added to the embedded switch yet. 
 
 Ok, got it - so the catch here is the switch is not capable of learning.
 I think this depends on where learning is done. Your intent is to
 use the S/W bridge as something that does the learning for you i.e in
 the kernel. This makes the s/w bridge part of MUST-have-for-this-to-run.
 And that maybe the case for your use case.
[...]

Well, in addition, there are SR-IOV network adapters that don't have any
bridge.  For these, the software bridge is necessary to handle
multicast, broadcast and forwarding between local ports, not only to do
learning.

Solarflare's implementation of accelerated guest networking (which
Shradha and I are gradually sending upstream) builds on libvirt's
existing support for software bridges and assigns VFs to guests as a
means to offload some of the forwarding.

If and when we implement a hardware bridge, we would probably still want
to keep the software bridge as a fallback.  If a guest is dependent on a
VF that's connected to a hardware bridge, it becomes impossible or at
least very disruptive to migrate it to another host that doesn't have a
compatible VF available.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-14 Thread jamal

On Mon, 2012-02-13 at 07:13 -0800, John Fastabend wrote:

 The use case here is multiple VFs but the same solution should work with
 multiple PFs as well. FDB controls should be independent of how the ports
 are exposed VFs, PFs, VMDQ/queue pairs, macvlan, etc.

Makes sense.

 With events and ADD/DEL/GET FDB controls we can solve both cases. This also
 solves Roopa's case with macvlan where he wants to add additional addresses
 to macvlan ports.

Not familiar with that issue - I'll prowl the list.

 Yes it should flood here, unless its acting as a 802.1Qbg VEB or VEPA.

Ok. So there is a toggle somewhere which controls how flooding should
happen.

 
 Maybe not. But the kernel already has the needed signals with one extra
 hook we can save running a daemon in user space. Maybe that's not a great
 argument to add kernel code though.

You make a reasonable arguement to have it in the kernel but i think we
win more if we separate the control. So while i empathize, I am hoping
that youd go with the path that is hard to travel ;-

 The PF_BRIDGE:RTM_GETNEIGH,RTM_NEWNEIGH,RTM_DELNEIGH are registered in the
 br_netlink_init() path. 

Hrm - hadnt paid attention to that before. Nasty.
The bridge seems to be hard-coding policy on station movement, no? 
This is a good example of the qualms i have on adding things to the
kernel;-
I may not want to auto update a MAC address moving ports as part of
some policy i have. I can go and add YAK (Yet Another Knob) - but where
is the line drawn?

cheers,
jamal


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-14 Thread John Fastabend

On 2/14/2012 5:18 AM, jamal wrote:
 On Mon, 2012-02-13 at 07:13 -0800, John Fastabend wrote:
 
 The use case here is multiple VFs but the same solution should work with
 multiple PFs as well. FDB controls should be independent of how the ports
 are exposed VFs, PFs, VMDQ/queue pairs, macvlan, etc.
 
 Makes sense.
 
 With events and ADD/DEL/GET FDB controls we can solve both cases. This also
 solves Roopa's case with macvlan where she wants to add additional addresses
 to macvlan ports.
 
 Not familiar with that issue - I'll prowl the list.

Roopa was likely on the right track here,

http://patchwork.ozlabs.org/patch/123064/

But I think the proper syntax is to use the existing PF_BRIDGE:RTM_XXX
netlink messages. And if possible drive this without extending ndo_ops.

An ideal user space interaction IMHO would look like,

[root@jf-dev1-dcblab iproute2]# ./br/br fdb add 52:e5:62:7b:57:88 dev veth10
[root@jf-dev1-dcblab iproute2]# ./br/br fdb
portmac addrflags
veth2   36:a6:35:9b:96:c4   local
veth4   aa:54:b0:7b:42:ef   local
veth0   2a:e8:5c:95:6c:1b   local
veth6   6e:26:d5:43:a3:36   local
veth0   f2:c1:39:76:6a:fb
veth8   4e:35:16:af:87:13   local
veth10  52:e5:62:7b:57:88   static
veth10  aa:a9:35:21:15:c4   local
[root@jf-dev1-dcblab iproute2]# ./br/br fdb add dev eth3 to 52:e5:62:7b:57:88
RTNETLINK answers: Invalid argument

Using Stephen's br tool. First command adds FDB entry to SW bridge and
if the same tool could be used to add entries to embedded bridge I think
that would be the best case. So no RTNETLINK error on the second cmd. Then
embedded FDB entries could be dumped this way also so I get a complete view
of my FDB setup across multiple sw bridges and embedded bridges.

I don't think br is part of iproute2 yet I just pulled it out of some RFC
but it works reasonably well and is intuitive enough.

 
 Yes it should flood here, unless its acting as a 802.1Qbg VEB or VEPA.
 
 Ok. So there is a toggle somewhere which controls how flooding should
 happen.
 

Yes. The hardware has a bit to support this which is currently not exposed
to user space. That's a case where we have 'yet another knob' that needs
a clean solution. This causes real bugs today when users try to use the
macvlan devices in VEPA mode on top of SR-IOV. By the way these modes are
all part of the 802.1Qbg spec which people actually want to use with Linux
so a good clean solution is probably needed.


 Maybe not. But the kernel already has the needed signals with one extra
 hook we can save running a daemon in user space. Maybe that's not a great
 argument to add kernel code though.
 
 You make a reasonable arguement to have it in the kernel but i think we
 win more if we separate the control. So while i empathize, I am hoping
 that youd go with the path that is hard to travel ;-
 
 The PF_BRIDGE:RTM_GETNEIGH,RTM_NEWNEIGH,RTM_DELNEIGH are registered in the
 br_netlink_init() path. 
 
 Hrm - hadnt paid attention to that before. Nasty.
 The bridge seems to be hard-coding policy on station movement, no? 
 This is a good example of the qualms i have on adding things to the
 kernel;-
 I may not want to auto update a MAC address moving ports as part of
 some policy i have. I can go and add YAK (Yet Another Knob) - but where
 is the line drawn?
 

I have no problem with drawing the line here and trying to implement something
over PF_BRIDGE:RTM_xxx nlmsgs. I'll work with Roopa and see if we can come
up with something in the next couple days.

w.r.t. VEPA/VEB and flooding behavior we could probably have a bit to indicate
if the port is a flooding port or not. Then users could build any sort of 
forwarding
table they wanted OR we could just drive it through a notifier (ndo_ops?) in the
macvlan path which does VEPA today.

OK I'll try to write some actual code now that can be critiqued.

 cheers,
 jamal
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-14 Thread Stephen Hemminger

On Tue, 14 Feb 2012 10:57:04 -0800
John Fastabend john.r.fastab...@intel.com wrote:

 On 2/14/2012 5:18 AM, jamal wrote:
  On Mon, 2012-02-13 at 07:13 -0800, John Fastabend wrote:
  
  The use case here is multiple VFs but the same solution should work with
  multiple PFs as well. FDB controls should be independent of how the ports
  are exposed VFs, PFs, VMDQ/queue pairs, macvlan, etc.
  
  Makes sense.
  
  With events and ADD/DEL/GET FDB controls we can solve both cases. This also
  solves Roopa's case with macvlan where she wants to add additional 
  addresses
  to macvlan ports.
  
  Not familiar with that issue - I'll prowl the list.
 
 Roopa was likely on the right track here,
 
 http://patchwork.ozlabs.org/patch/123064/
 
 But I think the proper syntax is to use the existing PF_BRIDGE:RTM_XXX
 netlink messages. And if possible drive this without extending ndo_ops.
 
 An ideal user space interaction IMHO would look like,
 
 [root@jf-dev1-dcblab iproute2]# ./br/br fdb add 52:e5:62:7b:57:88 dev veth10
 [root@jf-dev1-dcblab iproute2]# ./br/br fdb
 portmac addrflags
 veth2   36:a6:35:9b:96:c4   local
 veth4   aa:54:b0:7b:42:ef   local
 veth0   2a:e8:5c:95:6c:1b   local
 veth6   6e:26:d5:43:a3:36   local
 veth0   f2:c1:39:76:6a:fb
 veth8   4e:35:16:af:87:13   local
 veth10  52:e5:62:7b:57:88   static
 veth10  aa:a9:35:21:15:c4   local
 [root@jf-dev1-dcblab iproute2]# ./br/br fdb add dev eth3 to 52:e5:62:7b:57:88
 RTNETLINK answers: Invalid argument

I am going to put bridge (nameclash with br) tool into iproute2 (soon).
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-14 Thread John Fastabend

On 2/14/2012 11:05 AM, Stephen Hemminger wrote:
 On Tue, 14 Feb 2012 10:57:04 -0800
 John Fastabend john.r.fastab...@intel.com wrote:
 
 On 2/14/2012 5:18 AM, jamal wrote:
 On Mon, 2012-02-13 at 07:13 -0800, John Fastabend wrote:

 The use case here is multiple VFs but the same solution should work with
 multiple PFs as well. FDB controls should be independent of how the ports
 are exposed VFs, PFs, VMDQ/queue pairs, macvlan, etc.

 Makes sense.

 With events and ADD/DEL/GET FDB controls we can solve both cases. This also
 solves Roopa's case with macvlan where she wants to add additional 
 addresses
 to macvlan ports.

 Not familiar with that issue - I'll prowl the list.

 Roopa was likely on the right track here,

 http://patchwork.ozlabs.org/patch/123064/

 But I think the proper syntax is to use the existing PF_BRIDGE:RTM_XXX
 netlink messages. And if possible drive this without extending ndo_ops.

 An ideal user space interaction IMHO would look like,

 [root@jf-dev1-dcblab iproute2]# ./br/br fdb add 52:e5:62:7b:57:88 dev veth10
 [root@jf-dev1-dcblab iproute2]# ./br/br fdb
 portmac addrflags
 veth2   36:a6:35:9b:96:c4   local
 veth4   aa:54:b0:7b:42:ef   local
 veth0   2a:e8:5c:95:6c:1b   local
 veth6   6e:26:d5:43:a3:36   local
 veth0   f2:c1:39:76:6a:fb
 veth8   4e:35:16:af:87:13   local
 veth10  52:e5:62:7b:57:88   static
 veth10  aa:a9:35:21:15:c4   local
 [root@jf-dev1-dcblab iproute2]# ./br/br fdb add dev eth3 to 52:e5:62:7b:57:88
 RTNETLINK answers: Invalid argument
 
 I am going to put bridge (nameclash with br) tool into iproute2 (soon).

I've been using it on my dev box for awhile now and it works well for
me.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-13 Thread jamal

On Fri, 2012-02-10 at 08:39 -0800, Stephen Hemminger wrote:

 Some related discussion points:
  * the bridge needs to support control from both userspace (MSTP, TRILL, ...)
and kernel space (offload etc)

I think all are pretty much covered if you let some controler (I prefer
user space) ADD/DEL/GET/Event on the fdb 
TRILL really is outside the scope of this; from an encap/decap it
probably needs to be YAND (Yet another netdev) and from a control side
of things you need to just provide the above netlink ops(ADD, etC) on
the fdb and let the controller worry about things (Actually you _may_
need to have learning done outside of the kernel for TRILL)

  * the bridge forwarding database is simpler and different than the existing
neighbor table, don't remember the details but last time I checked it
using neighbor table in bridge would be putting square peg in round hole.

Agreed.

cheers,
jamal


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-13 Thread John Fastabend

On 2/10/2012 7:18 AM, jamal wrote:
 Hi John,
 
 I went backwards to summarize at the top after going through your email.
 
 TL;DR version 0.1: 
 you provide a good use case where it makes sense to do things in the
 kernel. IMO, you could make the same arguement if your embedded switch
 could do ACLs, IPv4 forwarding etc. And the kernel bloats.
 I am always bigoted to move all policy control to user space instead of
 bloating in the kernel.
 
  
 On Thu, 2012-02-09 at 20:14 -0800, John Fastabend wrote:
 

 Hi Jamal,

 The user space app in this case would listen for FDB updates to the SW
 bridge and then mirror them at the embedded NIC. In this case it seems
 easier to just add a notifier chain and let the kernel keep these in
 sync. Otherwise we need a daemon in user space to replicate these.

 
 A user space daemon if you need to ensure synchronization. Thats what i
 meant when i said there was a disadvantage over the simple case when
 the goal is always to synchronize.
 
 On the other hand if you could make the same RTM_NEWNEIGH, RTM_DELNEIGH,
 and RTM_GETNEIGH work for the bridge, embedded bridge, and macvlan you
 would have one common interface to drive these. But the bridge already
 has this protocol/msgtype so that would require either some demux or
 new protocol/msgtype pairs to be created. 

 
 The bridge is very netlink friendly these days. Given the rest of the
 network stack (*NEIGH* you mention above) talks netlink to user space
 it should be workable. 
 
 Let me think on it. I'm tempted by the simplicity of adding notifier
 hooks though.
 
 If something is missing bridge-side it may need to be added (as Per
 Stephen's comment) - i just took it one further indicating those
 notifiers need to also netlink-speak
 

Sure.

 
 Actually because the bridge is adding/removing fdb entries dynamically
 maybe its best this gets done in kernel. Here's the example case,
 
 [..]
 

 With the flow by letters above hope this is not too difficult to follow.
 
 (A) veth0 a virtual device transmits packet destined for ethx.y
 (B) SW bridge receives frames and updates FDB flooding to C
 (C) eth0 the PF in this case sends the frame to the HW backed by the
 embedded bridge
 
 Following so far.
 Can you have more than one PF per embedded switch? Or is the intent here
 purely to do VMs/VF separation?
 

The use case here is multiple VFs but the same solution should work with
multiple PFs as well. FDB controls should be independent of how the ports
are exposed VFs, PFs, VMDQ/queue pairs, macvlan, etc.

 (D) The HW embedded switch has a static entry for ethx.y and forwards
 the frame to the VF or if its a broadcast frame also floods it to
 the wire and ethx.y
 
 nod.
 
 (E) ethx.y receives the frame and generates a response to the dest mac of
 veth0
 
 nod.
 Since you said in #D the entries in the switch are static, I am assuming
 at this point neither ethx.y nor veth0 exist in the embedded FDB.
 
 Now here is the potential issue,

 (G) The frame transmitted from ethx.y with the destination address of
 veth0 but the embedded switch is not a learning switch. If the FDB
 update is done in user space its possible (likely?) that the FDB
 entry for veth0 has not been added to the embedded switch yet. 
 
 Ok, got it - so the catch here is the switch is not capable of learning.
 I think this depends on where learning is done. Your intent is to
 use the S/W bridge as something that does the learning for you i.e in
 the kernel. This makes the s/w bridge part of MUST-have-for-this-to-run.
 And that maybe the case for your use case.
 

This is _my_ use case today.

 What if I dont wanna run the S/W bridge at all?
 Ive been making a point that with a simple knob(Stephen doesn like to
 add such a knob), the SW bridge could defer learning to user space. 
 [This way you can add a lot of richness e.g on ACLs such as restricting
 what MAC addresses etc are allowed to talk to which ones etc.].
 But if bypass the s/w bridge all together and learn in user space
 or have a static config in which i populate the embedded switch, i dont
 see the issue.

With events and ADD/DEL/GET FDB controls we can solve both cases. This also
solves Roopa's case with macvlan where he wants to add additional addresses
to macvlan ports.

 
 Now
 we either have to flood the frame which is not horrible but not
 ideal or worse if the embedded switch does not support flooding send
 it to the wire and veth0 never receives it. 
 
 If it is a switch it has to flood, no? Otherwise it sounds broken.
 

Yes it should flood here, unless its acting as a 802.1Qbg VEB or VEPA.

 If the SW bridge pushes
 the FDB update down into the embedded switch the address is for
 sure in the embedded switches forwarding tables and the switching
 works as expected.
 
 Yes, there is a small gap between the s/w bridge learning and the
 synchronization happening to the embedded nic switch. That gap gets
 larger if you defer learning to

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-10 Thread Roopa Prabhu




On 2/9/12 9:36 AM, John Fastabend john.r.fastab...@intel.com wrote:

 On 2/8/2012 8:36 PM, Stephen Hemminger wrote:
 On Wed, 08 Feb 2012 19:22:06 -0800
 John Fastabend john.r.fastab...@intel.com wrote:
 
 Propagate software FDB table into hardware uc, mc lists when
 the NETIF_F_HW_FDB is set.
 
 This resolves the case below where an embedded switch is used
 in hardware to do inter-VF or VF-PF switching. This patch
 pushes the FDB entry (specifically the MAC address) into the
 embedded switch with dev_add_uc and dev_add_mc so the switch
 learns about the software bridge.
 
 
   veth0  veth2
 |  |
   
   |  bridge0 |    software bridging
   
/
/
   ethx.y  ethx
 VF PF
  \ \   propagate FDB entries to HW
  \ \
   
   |  Embedded Bridge | hardware offloaded switching
   
 
 This is only an RFC couple more changes are needed.
 
 (1) Optimize HW FDB set/del to only walk list if an FDB offloaded
 device is attached. Or decide it doesn't matter from unlikely()
 path.
 
 (2) Is it good enough to just call dev_uc_{add|del} or
 dev_mc_{add|del}? Or do some devices really need a new netdev
 callback to do this operation correctly. I think it should be
 good enough as is.
 
 (3) wrapped list walk in rcu_read_lock() just in case maybe every
 case is already inside rcu_read_lock()/unlock().
 
 Also this is in response to this thread regarding the macvlan and
 exposing rx filters posting now to see if folks think this is the
 right idea and if it will resolve at least the bridge case.
 
 http://lists.openwall.net/netdev/2011/11/08/135
 
 Signed-off-by: John Fastabend john.r.fastab...@intel.com
 ---
 
  include/linux/netdev_features.h |2 ++
  net/bridge/br_fdb.c |   34 ++
  2 files changed, 36 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/netdev_features.h
 b/include/linux/netdev_features.h
 index 77f5202..5936fae 100644
 
 Rather than yet another device feature, I would rather use netlink_notifier
 callback. The notifier is more general and generic without messing with
 internals
 of bridge.
 
 
 But the device features makes it easy for user space to learn that the device
 supports this sort of offload. Now if all SR-IOV devices support this then it
 doesn't matter but I thought there were SR-IOV devices that didn't do any
 switching? I'll dig through the SR-IOV drivers to check there are not too
 many of them.

Correct. Our 802.1Qbh sriov device (enic) does not do local switching.

 
 By netlink_notifier do you mean adding a notifier_block and using
 atomic_notifier_call_chain()
 probably in rtnl_notify()? Then drivers could register with the notifier chain
 with
 atomic_notifier_chain_register() and receive the events correctly. Or did I
 miss
 some notifier chain that already exists?
 
 Thanks,
 John
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-10 Thread jamal

Hi John,

I went backwards to summarize at the top after going through your email.

TL;DR version 0.1: 
you provide a good use case where it makes sense to do things in the
kernel. IMO, you could make the same arguement if your embedded switch
could do ACLs, IPv4 forwarding etc. And the kernel bloats.
I am always bigoted to move all policy control to user space instead of
bloating in the kernel.

 
On Thu, 2012-02-09 at 20:14 -0800, John Fastabend wrote:

  
  Hi Jamal,
  
  The user space app in this case would listen for FDB updates to the SW
  bridge and then mirror them at the embedded NIC. In this case it seems
  easier to just add a notifier chain and let the kernel keep these in
  sync. Otherwise we need a daemon in user space to replicate these.
  

A user space daemon if you need to ensure synchronization. Thats what i
meant when i said there was a disadvantage over the simple case when
the goal is always to synchronize.

  On the other hand if you could make the same RTM_NEWNEIGH, RTM_DELNEIGH,
  and RTM_GETNEIGH work for the bridge, embedded bridge, and macvlan you
  would have one common interface to drive these. But the bridge already
  has this protocol/msgtype so that would require either some demux or
  new protocol/msgtype pairs to be created. 
  

The bridge is very netlink friendly these days. Given the rest of the
network stack (*NEIGH* you mention above) talks netlink to user space
it should be workable. 

  Let me think on it. I'm tempted by the simplicity of adding notifier
  hooks though.

If something is missing bridge-side it may need to be added (as Per
Stephen's comment) - i just took it one further indicating those
notifiers need to also netlink-speak


 Actually because the bridge is adding/removing fdb entries dynamically
 maybe its best this gets done in kernel. Here's the example case,

[..]

 
 With the flow by letters above hope this is not too difficult to follow.

 (A) veth0 a virtual device transmits packet destined for ethx.y
 (B) SW bridge receives frames and updates FDB flooding to C
 (C) eth0 the PF in this case sends the frame to the HW backed by the
 embedded bridge

Following so far.
Can you have more than one PF per embedded switch? Or is the intent here
purely to do VMs/VF separation?

 (D) The HW embedded switch has a static entry for ethx.y and forwards
 the frame to the VF or if its a broadcast frame also floods it to
 the wire and ethx.y

nod.

 (E) ethx.y receives the frame and generates a response to the dest mac of
 veth0

nod.
Since you said in #D the entries in the switch are static, I am assuming
at this point neither ethx.y nor veth0 exist in the embedded FDB.

 Now here is the potential issue,
 
 (G) The frame transmitted from ethx.y with the destination address of
 veth0 but the embedded switch is not a learning switch. If the FDB
 update is done in user space its possible (likely?) that the FDB
 entry for veth0 has not been added to the embedded switch yet. 

Ok, got it - so the catch here is the switch is not capable of learning.
I think this depends on where learning is done. Your intent is to
use the S/W bridge as something that does the learning for you i.e in
the kernel. This makes the s/w bridge part of MUST-have-for-this-to-run.
And that maybe the case for your use case.

What if I dont wanna run the S/W bridge at all?
Ive been making a point that with a simple knob(Stephen doesn like to
add such a knob), the SW bridge could defer learning to user space. 
[This way you can add a lot of richness e.g on ACLs such as restricting
what MAC addresses etc are allowed to talk to which ones etc.].
But if bypass the s/w bridge all together and learn in user space
or have a static config in which i populate the embedded switch, i dont
see the issue.

 Now
 we either have to flood the frame which is not horrible but not
 ideal or worse if the embedded switch does not support flooding send
 it to the wire and veth0 never receives it. 

If it is a switch it has to flood, no? Otherwise it sounds broken.

 If the SW bridge pushes
 the FDB update down into the embedded switch the address is for
 sure in the embedded switches forwarding tables and the switching
 works as expected.

Yes, there is a small gap between the s/w bridge learning and the
synchronization happening to the embedded nic switch. That gap gets
larger if you defer learning to user space. But like you said earlier,
during that gap packets are flooded - and do you care if the
synchronization doesnt happen immediately?

 So to handle this case correctly its probably best IMHO to use a notifier
 hook. Having a RTM_GETNEIGH for the embedded switch implemented though
 would be nice for dumping the FDB of the embedded switch and SET/DEL
 could be used to configure the FDB when its not being driven by the SW
 switch. Of course we should try to be minimalists here.

Do you need to have a different *NEIGH* than what we already have
really?

The

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-10 Thread Stephen Hemminger

On Fri, 10 Feb 2012 10:18:31 -0500
jamal h...@cyberus.ca wrote:

 Hi John,
 
 I went backwards to summarize at the top after going through your email.
 
 TL;DR version 0.1: 
 you provide a good use case where it makes sense to do things in the
 kernel. IMO, you could make the same arguement if your embedded switch
 could do ACLs, IPv4 forwarding etc. And the kernel bloats.
 I am always bigoted to move all policy control to user space instead of
 bloating in the kernel.
 
  
 On Thu, 2012-02-09 at 20:14 -0800, John Fastabend wrote:
 
   
   Hi Jamal,
   
   The user space app in this case would listen for FDB updates to the SW
   bridge and then mirror them at the embedded NIC. In this case it seems
   easier to just add a notifier chain and let the kernel keep these in
   sync. Otherwise we need a daemon in user space to replicate these.
   
 
 A user space daemon if you need to ensure synchronization. Thats what i
 meant when i said there was a disadvantage over the simple case when
 the goal is always to synchronize.
 
   On the other hand if you could make the same RTM_NEWNEIGH, RTM_DELNEIGH,
   and RTM_GETNEIGH work for the bridge, embedded bridge, and macvlan you
   would have one common interface to drive these. But the bridge already
   has this protocol/msgtype so that would require either some demux or
   new protocol/msgtype pairs to be created. 
   
 
 The bridge is very netlink friendly these days. Given the rest of the
 network stack (*NEIGH* you mention above) talks netlink to user space
 it should be workable. 
 
   Let me think on it. I'm tempted by the simplicity of adding notifier
   hooks though.
 
 If something is missing bridge-side it may need to be added (as Per
 Stephen's comment) - i just took it one further indicating those
 notifiers need to also netlink-speak
 
 
  Actually because the bridge is adding/removing fdb entries dynamically
  maybe its best this gets done in kernel. Here's the example case,
 
 [..]
 
  
  With the flow by letters above hope this is not too difficult to follow.
 
  (A) veth0 a virtual device transmits packet destined for ethx.y
  (B) SW bridge receives frames and updates FDB flooding to C
  (C) eth0 the PF in this case sends the frame to the HW backed by the
  embedded bridge
 
 Following so far.
 Can you have more than one PF per embedded switch? Or is the intent here
 purely to do VMs/VF separation?
 
  (D) The HW embedded switch has a static entry for ethx.y and forwards
  the frame to the VF or if its a broadcast frame also floods it to
  the wire and ethx.y
 
 nod.
 
  (E) ethx.y receives the frame and generates a response to the dest mac of
  veth0
 
 nod.
 Since you said in #D the entries in the switch are static, I am assuming
 at this point neither ethx.y nor veth0 exist in the embedded FDB.
 
  Now here is the potential issue,
  
  (G) The frame transmitted from ethx.y with the destination address of
  veth0 but the embedded switch is not a learning switch. If the FDB
  update is done in user space its possible (likely?) that the FDB
  entry for veth0 has not been added to the embedded switch yet. 
 
 Ok, got it - so the catch here is the switch is not capable of learning.
 I think this depends on where learning is done. Your intent is to
 use the S/W bridge as something that does the learning for you i.e in
 the kernel. This makes the s/w bridge part of MUST-have-for-this-to-run.
 And that maybe the case for your use case.
 
 What if I dont wanna run the S/W bridge at all?
 Ive been making a point that with a simple knob(Stephen doesn like to
 add such a knob), the SW bridge could defer learning to user space. 
 [This way you can add a lot of richness e.g on ACLs such as restricting
 what MAC addresses etc are allowed to talk to which ones etc.].
 But if bypass the s/w bridge all together and learn in user space
 or have a static config in which i populate the embedded switch, i dont
 see the issue.
 
  Now
  we either have to flood the frame which is not horrible but not
  ideal or worse if the embedded switch does not support flooding send
  it to the wire and veth0 never receives it. 
 
 If it is a switch it has to flood, no? Otherwise it sounds broken.
 
  If the SW bridge pushes
  the FDB update down into the embedded switch the address is for
  sure in the embedded switches forwarding tables and the switching
  works as expected.
 
 Yes, there is a small gap between the s/w bridge learning and the
 synchronization happening to the embedded nic switch. That gap gets
 larger if you defer learning to user space. But like you said earlier,
 during that gap packets are flooded - and do you care if the
 synchronization doesnt happen immediately?
 
  So to handle this case correctly its probably best IMHO to use a notifier
  hook. Having a RTM_GETNEIGH for the embedded switch implemented though
  would be nice for dumping the FDB of the embedded switch and SET/DEL
  could be used to

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend

On 2/8/2012 8:36 PM, Stephen Hemminger wrote:
 On Wed, 08 Feb 2012 19:22:06 -0800
 John Fastabend john.r.fastab...@intel.com wrote:
 
 Propagate software FDB table into hardware uc, mc lists when
 the NETIF_F_HW_FDB is set.

 This resolves the case below where an embedded switch is used
 in hardware to do inter-VF or VF-PF switching. This patch
 pushes the FDB entry (specifically the MAC address) into the
 embedded switch with dev_add_uc and dev_add_mc so the switch
 learns about the software bridge.


   veth0  veth2
 |  |
   
   |  bridge0 |    software bridging
   
/
/
   ethx.y  ethx
 VF PF
  \ \   propagate FDB entries to HW
  \ \
   
   |  Embedded Bridge | hardware offloaded switching
   

 This is only an RFC couple more changes are needed.

 (1) Optimize HW FDB set/del to only walk list if an FDB offloaded
 device is attached. Or decide it doesn't matter from unlikely()
 path.

 (2) Is it good enough to just call dev_uc_{add|del} or
 dev_mc_{add|del}? Or do some devices really need a new netdev
 callback to do this operation correctly. I think it should be
 good enough as is.

 (3) wrapped list walk in rcu_read_lock() just in case maybe every
 case is already inside rcu_read_lock()/unlock().

 Also this is in response to this thread regarding the macvlan and
 exposing rx filters posting now to see if folks think this is the
 right idea and if it will resolve at least the bridge case.

 http://lists.openwall.net/netdev/2011/11/08/135

 Signed-off-by: John Fastabend john.r.fastab...@intel.com
 ---

  include/linux/netdev_features.h |2 ++
  net/bridge/br_fdb.c |   34 ++
  2 files changed, 36 insertions(+), 0 deletions(-)

 diff --git a/include/linux/netdev_features.h 
 b/include/linux/netdev_features.h
 index 77f5202..5936fae 100644
 
 Rather than yet another device feature, I would rather use netlink_notifier
 callback. The notifier is more general and generic without messing with 
 internals
 of bridge.
 

But the device features makes it easy for user space to learn that the device
supports this sort of offload. Now if all SR-IOV devices support this then it
doesn't matter but I thought there were SR-IOV devices that didn't do any
switching? I'll dig through the SR-IOV drivers to check there are not too
many of them.

By netlink_notifier do you mean adding a notifier_block and using 
atomic_notifier_call_chain()
probably in rtnl_notify()? Then drivers could register with the notifier chain 
with
atomic_notifier_chain_register() and receive the events correctly. Or did I miss
some notifier chain that already exists?

Thanks,
John

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread Stephen Hemminger

On Thu, 09 Feb 2012 09:36:47 -0800
John Fastabend john.r.fastab...@intel.com wrote:

 But the device features makes it easy for user space to learn that the device
 supports this sort of offload. Now if all SR-IOV devices support this then it
 doesn't matter but I thought there were SR-IOV devices that didn't do any
 switching? I'll dig through the SR-IOV drivers to check there are not too
 many of them.

If user space needs to know then the OS is not designed properly.
The purpose of the network device is to abstract all those details, and more 
and more
of them are bleeding through. This makes writing management applications harder 
and makes
things dependent on features that may or may not be present. The best design is 
when
the change is invisible.

 By netlink_notifier do you mean adding a notifier_block and using 
 atomic_notifier_call_chain()
 probably in rtnl_notify()? Then drivers could register with the notifier 
 chain with
 atomic_notifier_chain_register() and receive the events correctly. Or did I 
 miss
 some notifier chain that already exists?

Yes. that is what I mean. The callbacks you need may or may not already be 
present.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend

On 2/9/2012 9:40 AM, Stephen Hemminger wrote:
 On Thu, 09 Feb 2012 09:36:47 -0800
 John Fastabend john.r.fastab...@intel.com wrote:
 
 But the device features makes it easy for user space to learn that the device
 supports this sort of offload. Now if all SR-IOV devices support this then it
 doesn't matter but I thought there were SR-IOV devices that didn't do any
 switching? I'll dig through the SR-IOV drivers to check there are not too
 many of them.
 
 If user space needs to know then the OS is not designed properly.
 The purpose of the network device is to abstract all those details, and more 
 and more
 of them are bleeding through. This makes writing management applications 
 harder and makes
 things dependent on features that may or may not be present. The best design 
 is when
 the change is invisible.
 

Agreed.

 By netlink_notifier do you mean adding a notifier_block and using 
 atomic_notifier_call_chain()
 probably in rtnl_notify()? Then drivers could register with the notifier 
 chain with
 atomic_notifier_chain_register() and receive the events correctly. Or did I 
 miss
 some notifier chain that already exists?
 
 Yes. that is what I mean. The callbacks you need may or may not already be 
 present.


OK thanks I'll put together an update here shortly.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread Sridhar Samudrala

On Wed, 2012-02-08 at 19:22 -0800, John Fastabend wrote:
 Propagate software FDB table into hardware uc, mc lists when
 the NETIF_F_HW_FDB is set.
 
 This resolves the case below where an embedded switch is used
 in hardware to do inter-VF or VF-PF switching. This patch
 pushes the FDB entry (specifically the MAC address) into the
 embedded switch with dev_add_uc and dev_add_mc so the switch
 learns about the software bridge.
 
 
   veth0  veth2
 |  |
   
   |  bridge0 |    software bridging
   
/
/
   ethx.y  ethx
 VF PF
  \ \   propagate FDB entries to HW
  \ \
   
   |  Embedded Bridge | hardware offloaded switching
   
 

This scenario works now as adding an interface to a bridge puts it in
promiscuous mode. So adding a PF to a software bridge should not be
a problem as it supports promiscuous mode. But adding a VF will not
work.

Are you trying to avoid the requirement of having to put the interface 
in promiscuous mode when adding to a bridge?

Thanks
Sridhar



 This is only an RFC couple more changes are needed.
 
 (1) Optimize HW FDB set/del to only walk list if an FDB offloaded
 device is attached. Or decide it doesn't matter from unlikely()
 path.
 
 (2) Is it good enough to just call dev_uc_{add|del} or
 dev_mc_{add|del}? Or do some devices really need a new netdev
 callback to do this operation correctly. I think it should be
 good enough as is.
 
 (3) wrapped list walk in rcu_read_lock() just in case maybe every
 case is already inside rcu_read_lock()/unlock().
 
 Also this is in response to this thread regarding the macvlan and
 exposing rx filters posting now to see if folks think this is the
 right idea and if it will resolve at least the bridge case.
 
 http://lists.openwall.net/netdev/2011/11/08/135
 
 Signed-off-by: John Fastabend john.r.fastab...@intel.com
 ---
 
  include/linux/netdev_features.h |2 ++
  net/bridge/br_fdb.c |   34 ++
  2 files changed, 36 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
 index 77f5202..5936fae 100644
 --- a/include/linux/netdev_features.h
 +++ b/include/linux/netdev_features.h
 @@ -55,6 +55,8 @@ enum {
   NETIF_F_NOCACHE_COPY_BIT,   /* Use no-cache copyfromuser */
   NETIF_F_LOOPBACK_BIT,   /* Enable loopback */
 
 + NETIF_F_HW_FDB, /* Hardware supports switching */
 +
   /*
* Add your fresh new feature above and remember to update
* netdev_features_strings[] in net/core/ethtool.c and maybe
 diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
 index 5ba0c84..4cc545b 100644
 --- a/net/bridge/br_fdb.c
 +++ b/net/bridge/br_fdb.c
 @@ -81,9 +81,26 @@ static void fdb_rcu_free(struct rcu_head *head)
   kmem_cache_free(br_fdb_cache, ent);
  }
 
 +static void fdb_hw_delete(struct net_bridge *br,
 +   struct net_bridge_fdb_entry *fdb)
 +{
 + struct net_bridge_port *op;
 +
 + rcu_read_lock();
 + list_for_each_entry_rcu(op, br-port_list, list) {
 + struct net_device *dev = op-dev;
 +
 + if ((dev-features  NETIF_F_HW_FDB) 
 + dev != fdb-dst-dev)
 + dev_uc_del(dev, fdb-addr.addr);
 + }
 + rcu_read_unlock();
 +}
 +
  static void fdb_delete(struct net_bridge *br, struct net_bridge_fdb_entry *f)
  {
   hlist_del_rcu(f-hlist);
 + fdb_hw_delete(br, f);
   fdb_notify(br, f, RTM_DELNEIGH);
   call_rcu(f-rcu, fdb_rcu_free);
  }
 @@ -350,6 +367,22 @@ static struct net_bridge_fdb_entry *fdb_find_rcu(struct 
 hlist_head *head,
   return NULL;
  }
 
 +static void fdb_hw_create(struct net_bridge *br,
 +   struct net_bridge_fdb_entry *fdb)
 +{
 + struct net_bridge_port *op;
 +
 + rcu_read_lock();
 + list_for_each_entry_rcu(op, br-port_list, list) {
 + struct net_device *dev = op-dev;
 +
 + if ((dev-features  NETIF_F_HW_FDB) 
 + dev != fdb-dst-dev)
 + dev_uc_add(dev, fdb-addr.addr);
 + }
 + rcu_read_unlock();
 +}
 +
  static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
  struct net_bridge_port *source,
  const unsigned char *addr)
 @@ -363,6 +396,7 @@ static struct net_bridge_fdb_entry *fdb_create(struct 
 hlist_head *head,
   fdb-is_local = 0;
   fdb-is_static = 0;
   fdb-updated = fdb-used = jiffies;
 + fdb_hw_create(source-br, fdb);
   hlist_add_head_rcu(fdb-hlist, head);
   }
   return fdb;
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend

On 2/9/2012 10:14 AM, Sridhar Samudrala wrote:
 On Wed, 2012-02-08 at 19:22 -0800, John Fastabend wrote:
 Propagate software FDB table into hardware uc, mc lists when
 the NETIF_F_HW_FDB is set.

 This resolves the case below where an embedded switch is used
 in hardware to do inter-VF or VF-PF switching. This patch
 pushes the FDB entry (specifically the MAC address) into the
 embedded switch with dev_add_uc and dev_add_mc so the switch
 learns about the software bridge.


   veth0  veth2
 |  |
   
   |  bridge0 |    software bridging
   
/
/
   ethx.y  ethx
 VF PF
  \ \   propagate FDB entries to HW
  \ \
   
   |  Embedded Bridge | hardware offloaded switching
   

 
 This scenario works now as adding an interface to a bridge puts it in
 promiscuous mode. So adding a PF to a software bridge should not be
 a problem as it supports promiscuous mode. But adding a VF will not
 work.

It shouldn't work because the embedded bridge will lookup the address
in its FDB and when it doesn't match any unicast filters it will forward
the packet onto the wire. Because the veth0 and veth2 above never get
inserted into the embedded brdige's FDB the packets will _never_ get
routed there.

That said the current 'ixgbe' driver is doing something broken in that
it is always setting the unicast hash table and mirroring bits to 1. So
if you think this is working your seeing a bug where packets are being
sent onto the wire AND upto the PF. Packets with destination addresses
matching veth1 should not end up on the wire and vice versa. This is
specific to ixgbe and is not the case for other SR-IOV devices.

This causes some issues (a) has some very real performance implications,
(b) at this point you have some strange behavior from my point of view.
The embedded bridge is not a learning bridge nor is it acting like an
802.1Q VEB or VEPA.

 
 Are you trying to avoid the requirement of having to put the interface 
 in promiscuous mode when adding to a bridge?
 

I think the bridge being in promiscuous mode is correct.

Hope that helps sorry its a bit long winded.
John



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread jamal

On Thu, 2012-02-09 at 09:52 -0800, John Fastabend wrote:

  By netlink_notifier do you mean adding a notifier_block and using 
  atomic_notifier_call_chain()
  probably in rtnl_notify()? Then drivers could register with the notifier 
  chain with
  atomic_notifier_chain_register() and receive the events correctly. Or did 
  I miss
  some notifier chain that already exists?
  
  Yes. that is what I mean. The callbacks you need may or may not already be 
  present.

I'll go one step further.
This stuff shouldnt be in the kernel at all. 
The disadvantage is you need a user space app to update the hardware.
i.e, the same mechanism should be usable for either a switch embedded
in a NIC or a standalone hardware switch (with/out the s/ware bridge 
presence)

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread Sridhar Samudrala

On Thu, 2012-02-09 at 12:30 -0800, John Fastabend wrote:
 On 2/9/2012 10:14 AM, Sridhar Samudrala wrote:
  On Wed, 2012-02-08 at 19:22 -0800, John Fastabend wrote:
  Propagate software FDB table into hardware uc, mc lists when
  the NETIF_F_HW_FDB is set.
 
  This resolves the case below where an embedded switch is used
  in hardware to do inter-VF or VF-PF switching. This patch
  pushes the FDB entry (specifically the MAC address) into the
  embedded switch with dev_add_uc and dev_add_mc so the switch
  learns about the software bridge.
 
 
veth0  veth2
  |  |

|  bridge0 |    software bridging

 /
 /
ethx.y  ethx
  VF PF
   \ \   propagate FDB entries to HW
   \ \

|  Embedded Bridge | hardware offloaded switching

 
  
  This scenario works now as adding an interface to a bridge puts it in
  promiscuous mode. So adding a PF to a software bridge should not be
  a problem as it supports promiscuous mode. But adding a VF will not
  work.
 
 It shouldn't work because the embedded bridge will lookup the address
 in its FDB and when it doesn't match any unicast filters it will forward
 the packet onto the wire. Because the veth0 and veth2 above never get
 inserted into the embedded brdige's FDB the packets will _never_ get
 routed there.
 
 That said the current 'ixgbe' driver is doing something broken in that
 it is always setting the unicast hash table and mirroring bits to 1. So
 if you think this is working your seeing a bug where packets are being
 sent onto the wire AND upto the PF. Packets with destination addresses
 matching veth1 should not end up on the wire and vice versa. This is
 specific to ixgbe and is not the case for other SR-IOV devices.

OK. Is this behavior going to be fixed.

 
 This causes some issues (a) has some very real performance implications,
 (b) at this point you have some strange behavior from my point of view.
 The embedded bridge is not a learning bridge nor is it acting like an
 802.1Q VEB or VEPA.
 
  
  Are you trying to avoid the requirement of having to put the interface 
  in promiscuous mode when adding to a bridge?
  
 
 I think the bridge being in promiscuous mode is correct.

The interface that is added to the bridge is put in promiscuous mode,
not the bridge itself.  In this example, i assumed that setting
promiscuous on PF is putting the embedded bridge in learning mode.

Thanks
Sridhar

 
 Hope that helps sorry its a bit long winded.
 John
 
 
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend

On 2/9/2012 4:39 PM, Sridhar Samudrala wrote:
 On Thu, 2012-02-09 at 12:30 -0800, John Fastabend wrote:
 On 2/9/2012 10:14 AM, Sridhar Samudrala wrote:
 On Wed, 2012-02-08 at 19:22 -0800, John Fastabend wrote:
 Propagate software FDB table into hardware uc, mc lists when
 the NETIF_F_HW_FDB is set.

 This resolves the case below where an embedded switch is used
 in hardware to do inter-VF or VF-PF switching. This patch
 pushes the FDB entry (specifically the MAC address) into the
 embedded switch with dev_add_uc and dev_add_mc so the switch
 learns about the software bridge.


   veth0  veth2
 |  |
   
   |  bridge0 |    software bridging
   
/
/
   ethx.y  ethx
 VF PF
  \ \   propagate FDB entries to HW
  \ \
   
   |  Embedded Bridge | hardware offloaded switching
   


 This scenario works now as adding an interface to a bridge puts it in
 promiscuous mode. So adding a PF to a software bridge should not be
 a problem as it supports promiscuous mode. But adding a VF will not
 work.

 It shouldn't work because the embedded bridge will lookup the address
 in its FDB and when it doesn't match any unicast filters it will forward
 the packet onto the wire. Because the veth0 and veth2 above never get
 inserted into the embedded brdige's FDB the packets will _never_ get
 routed there.

 That said the current 'ixgbe' driver is doing something broken in that
 it is always setting the unicast hash table and mirroring bits to 1. So
 if you think this is working your seeing a bug where packets are being
 sent onto the wire AND upto the PF. Packets with destination addresses
 matching veth1 should not end up on the wire and vice versa. This is
 specific to ixgbe and is not the case for other SR-IOV devices.
 
 OK. Is this behavior going to be fixed.
 

Only after we have a mechanism to either configure the NIC FDB directly
or have it stay in sync with the SW switch. Flooding traffic seems better
than being unable to send traffic to the virtual device altogether. This
behavior is driver specific some devices just fail outright.

I'm thinking over Jamal's comment now.


 This causes some issues (a) has some very real performance implications,
 (b) at this point you have some strange behavior from my point of view.
 The embedded bridge is not a learning bridge nor is it acting like an
 802.1Q VEB or VEPA.


 Are you trying to avoid the requirement of having to put the interface 
 in promiscuous mode when adding to a bridge?


 I think the bridge being in promiscuous mode is correct.
 
 The interface that is added to the bridge is put in promiscuous mode,
 not the bridge itself.  In this example, i assumed that setting
 promiscuous on PF is putting the embedded bridge in learning mode.


Yes I misspoke I mean the PF. The embedded bridge in this case does not
support learning. Also I'm not aware of any SR-IOV NICs that do support
learning.

 
 Thanks
 Sridhar
 

 Hope that helps sorry its a bit long winded.
 John



 
 
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend

On 2/9/2012 1:11 PM, jamal wrote:
 On Thu, 2012-02-09 at 09:52 -0800, John Fastabend wrote:
 
 By netlink_notifier do you mean adding a notifier_block and using 
 atomic_notifier_call_chain()
 probably in rtnl_notify()? Then drivers could register with the notifier 
 chain with
 atomic_notifier_chain_register() and receive the events correctly. Or did 
 I miss
 some notifier chain that already exists?

 Yes. that is what I mean. The callbacks you need may or may not already be 
 present.
 
 I'll go one step further.
 This stuff shouldnt be in the kernel at all. 
 The disadvantage is you need a user space app to update the hardware.
 i.e, the same mechanism should be usable for either a switch embedded
 in a NIC or a standalone hardware switch (with/out the s/ware bridge 
 presence)
 
 cheers,
 jamal
 

Hi Jamal,

The user space app in this case would listen for FDB updates to the SW
bridge and then mirror them at the embedded NIC. In this case it seems
easier to just add a notifier chain and let the kernel keep these in
sync. Otherwise we need a daemon in user space to replicate these.

On the other hand if you could make the same RTM_NEWNEIGH, RTM_DELNEIGH,
and RTM_GETNEIGH work for the bridge, embedded bridge, and macvlan you
would have one common interface to drive these. But the bridge already
has this protocol/msgtype so that would require either some demux or
new protocol/msgtype pairs to be created. 

Let me think on it. I'm tempted by the simplicity of adding notifier
hooks though.

.John


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend

On 2/9/2012 6:14 PM, John Fastabend wrote:
 On 2/9/2012 1:11 PM, jamal wrote:
 On Thu, 2012-02-09 at 09:52 -0800, John Fastabend wrote:

 By netlink_notifier do you mean adding a notifier_block and using 
 atomic_notifier_call_chain()
 probably in rtnl_notify()? Then drivers could register with the notifier 
 chain with
 atomic_notifier_chain_register() and receive the events correctly. Or did 
 I miss
 some notifier chain that already exists?

 Yes. that is what I mean. The callbacks you need may or may not already be 
 present.

 I'll go one step further.
 This stuff shouldnt be in the kernel at all. 
 The disadvantage is you need a user space app to update the hardware.
 i.e, the same mechanism should be usable for either a switch embedded
 in a NIC or a standalone hardware switch (with/out the s/ware bridge 
 presence)

 cheers,
 jamal

 
 Hi Jamal,
 
 The user space app in this case would listen for FDB updates to the SW
 bridge and then mirror them at the embedded NIC. In this case it seems
 easier to just add a notifier chain and let the kernel keep these in
 sync. Otherwise we need a daemon in user space to replicate these.
 
 On the other hand if you could make the same RTM_NEWNEIGH, RTM_DELNEIGH,
 and RTM_GETNEIGH work for the bridge, embedded bridge, and macvlan you
 would have one common interface to drive these. But the bridge already
 has this protocol/msgtype so that would require either some demux or
 new protocol/msgtype pairs to be created. 
 
 Let me think on it. I'm tempted by the simplicity of adding notifier
 hooks though.
 
 .John
 

Actually because the bridge is adding/removing fdb entries dynamically
maybe its best this gets done in kernel. Here's the example case,


  -- -
  | ethx.y |   E| veth0 |  --- A
  -- -
  |   |
  |   |
  |   |
  | --
  | |  SW Bridge | --- B
  | --
  |   |
  |   |
  |  -
  |  | eth0  | --- C
  |  -
  |   |
   ---
   |embedded switch  | --- D
   ---
   |
   |
   G

With the flow by letters above hope this is not too difficult to follow.

(A) veth0 a virtual device transmits packet destined for ethx.y
(B) SW bridge receives frames and updates FDB flooding to C
(C) eth0 the PF in this case sends the frame to the HW backed by the
embedded bridge
(D) The HW embedded switch has a static entry for ethx.y and forwards
the frame to the VF or if its a broadcast frame also floods it to
the wire and ethx.y
(E) ethx.y receives the frame and generates a response to the dest mac of
veth0

Now here is the potential issue,

(G) The frame transmitted from ethx.y with the destination address of
veth0 but the embedded switch is not a learning switch. If the FDB
update is done in user space its possible (likely?) that the FDB
entry for veth0 has not been added to the embedded switch yet. Now
we either have to flood the frame which is not horrible but not
ideal or worse if the embedded switch does not support flooding send
it to the wire and veth0 never receives it. If the SW bridge pushes
the FDB update down into the embedded switch the address is for
sure in the embedded switches forwarding tables and the switching
works as expected.

So to handle this case correctly its probably best IMHO to use a notifier
hook. Having a RTM_GETNEIGH for the embedded switch implemented though
would be nice for dumping the FDB of the embedded switch and SET/DEL
could be used to configure the FDB when its not being driven by the SW
switch. Of course we should try to be minimalists here.

.John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-08 Thread Stephen Hemminger

On Wed, 08 Feb 2012 19:22:06 -0800
John Fastabend john.r.fastab...@intel.com wrote:

 Propagate software FDB table into hardware uc, mc lists when
 the NETIF_F_HW_FDB is set.
 
 This resolves the case below where an embedded switch is used
 in hardware to do inter-VF or VF-PF switching. This patch
 pushes the FDB entry (specifically the MAC address) into the
 embedded switch with dev_add_uc and dev_add_mc so the switch
 learns about the software bridge.
 
 
   veth0  veth2
 |  |
   
   |  bridge0 |    software bridging
   
/
/
   ethx.y  ethx
 VF PF
  \ \   propagate FDB entries to HW
  \ \
   
   |  Embedded Bridge | hardware offloaded switching
   
 
 This is only an RFC couple more changes are needed.
 
 (1) Optimize HW FDB set/del to only walk list if an FDB offloaded
 device is attached. Or decide it doesn't matter from unlikely()
 path.
 
 (2) Is it good enough to just call dev_uc_{add|del} or
 dev_mc_{add|del}? Or do some devices really need a new netdev
 callback to do this operation correctly. I think it should be
 good enough as is.
 
 (3) wrapped list walk in rcu_read_lock() just in case maybe every
 case is already inside rcu_read_lock()/unlock().
 
 Also this is in response to this thread regarding the macvlan and
 exposing rx filters posting now to see if folks think this is the
 right idea and if it will resolve at least the bridge case.
 
 http://lists.openwall.net/netdev/2011/11/08/135
 
 Signed-off-by: John Fastabend john.r.fastab...@intel.com
 ---
 
  include/linux/netdev_features.h |2 ++
  net/bridge/br_fdb.c |   34 ++
  2 files changed, 36 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
 index 77f5202..5936fae 100644

Rather than yet another device feature, I would rather use netlink_notifier
callback. The notifier is more general and generic without messing with 
internals
of bridge.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

50 matches

Mail list logo