Re: [RFC PATCH 0/2] make mac programming for virtio net more robust

2013-01-11 Thread John Fastabend

On 1/10/2013 11:46 PM, Michael S. Tsirkin wrote:

On Fri, Jan 11, 2013 at 12:53:07PM +1030, Rusty Russell wrote:

"Michael S. Tsirkin"  writes:


On Thu, Jan 10, 2013 at 10:45:39PM +0800, ak...@redhat.com wrote:

From: Amos Kong 

Currenly mac is programmed byte by byte. This means that we
have an intermediate step where mac is wrong.

Second patch introduced a new vq control command to set mac
address in one time.


As you mention we could alternatively do it without
new commands, simply add a feature bit that says that MACs are
in the mac table.
This would be a much bigger patch, and I'm fine with either way.
Rusty what do you think?


Hmm, mac filtering and "my mac address" are not quite the same thing.  I
don't know if it matters for anyone: does it?
The mac address is abused
for things like identifying machines, etc.


I don't know either. I think net core differentiates between mac and
uc_list because linux has to know which mac to use when building
up packets, so at some level, I agree it might be useful to identify the
machine.

BTW netdev/davem should have been copied on this, Amos I think it's a
good idea to remember to do it next time you post.



If we keep it as a separate concept, Amos' patch seems to make sense.


Yes. It also keeps the patch small, I just thought I'd mention the
option.



Cheers,
Rusty.




Don't have the entire context here but if you implement the
ndo_fdb_dump() probably hooking it up to ndo_dflt_fdb_dump() you could
use the 'bridge' tool dump the uc_list.

Then use ndo_fdb_add() and ndo_fdb_del() to add and remove entries
from the uc_list. We do this today in macvlan and the ixgbe driver when
it is in SR-IOV mode and the embedded switch needs to be programmed.

fdb is "forwarding database" its a bit different then mac filtering
in that its telling the "switch" how to forward mac addresses, in
ixgbe and macvlan at least we have been overloading it a bit to also
stop filtering the mac address. I think this makes sense if you setup
forwarding to a port it doesn't make much sense to then drop them.

Maybe its not entirely applicable here just thought I would mention it.

Thanks,
John
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net

2012-06-25 Thread John Fastabend

On 6/25/2012 3:07 AM, Michael S. Tsirkin wrote:

On Mon, Jun 25, 2012 at 05:16:48PM +0800, Jason Wang wrote:

Hello All:

This series is an update version of multiqueue virtio-net driver based on
Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the
packets reception and transmission. Please review and comments.

Test Environment:
- Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes
- Two directed connected 82599

Test Summary:

- Highlights: huge improvements on TCP_RR test
- Lowlights: regression on small packet transmission, higher cpu utilization
  than single queue, need further optimization


Didn't review yet, reacting this this paragraph:

To avoid regressions, it seems reasonable to make
the device use a single queue by default for now.
Add a way to switch multiqueue on/off using ethtool.

This way guest admin can tune the device for the
workload manually until we manage to imlement some
self-tuning heuristics.



Ethtool already has this switch 'ethtool -L' can be
used to set the number tx/rx channels. So you would
likely just need to add a set_channels hook.

.John
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next PATCH v0 3/5] net: add fdb generic dump routine

2012-03-26 Thread John Fastabend
On 3/25/2012 6:09 AM, Roopa Prabhu wrote:
> 
> 
> 
> On 3/18/12 11:52 PM, "John Fastabend"  wrote:
> 
>> This adds a generic dump routine drivers can call. It
>> should be sufficient to handle any bridging model that
>> uses the unicast address list. This should be most SR-IOV
>> enabled NICs.
>>
>> Signed-off-by: John Fastabend 
>> ---
>>

[...]

>> +/**
>> + * ndo_dflt_fdb_dump: default netdevice operation to dump an FDB table.
>> + * @nlh: netlink message header
>> + * @dev: netdevice
>> + *
>> + * Default netdevice operation to dump the existing unicast address list.
>> + * Returns zero on success.
>> + */
>> +int ndo_dflt_fdb_dump(struct sk_buff *skb,
>> +struct netlink_callback *cb,
>> +struct net_device *dev,
>> +int idx)
>> +{
>> + struct netdev_hw_addr *ha;
>> + struct nlmsghdr *nlh;
>> + struct ndmsg *ndm;
>> + u32 pid, seq;
>> +
>> + pid = NETLINK_CB(cb->skb).pid;
>> + seq = cb->nlh->nlmsg_seq;
>> +
>> + netif_addr_lock_bh(dev);
>> + list_for_each_entry(ha, &dev->uc.list, list) {
>> +  if (idx < cb->args[0])
>> +   goto skip;
> 
> Any reason why its only uc ?. What about mc ?

Sure this might be useful to know for embedded devices
and likely more useful for the macvlan driver. I'll add
it in the next version.

Thanks,
John


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next PATCH v0 0/5] Series short description

2012-03-19 Thread John Fastabend
On 3/19/2012 5:35 PM, David Miller wrote:
> From: John Fastabend 
> Date: Mon, 19 Mar 2012 17:27:00 -0700
> 
>> Dave, its probably fine to push this to 3.5 then.
> 
> Fair enough.

Stephen, please let me know if you see any issues though
because without these we have no way to forward packets
correctly in the embedded switch. So we can't really
use SR-IOV and virtual interfaces together correctly. And
the macvlan device in passthru mode is putting the device
in promiscuous mode which isn't great either.

.John
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next PATCH v0 0/5] Series short description

2012-03-19 Thread John Fastabend
On 3/19/2012 3:55 PM, Stephen Hemminger wrote:
> On Mon, 19 Mar 2012 18:38:08 -0400 (EDT)
> David Miller  wrote:
> 
>> From: John Fastabend 
>> Date: Sun, 18 Mar 2012 23:51:45 -0700
>>
>>> This series is a follow up to this thread:
>>>
>>> http://www.spinics.net/lists/netdev/msg191360.html
>>
>> Can the interested parties please review this series?
>>
>> I'm willing to apply this right now if it looks OK, but if
>> it needs more revisions we'll have to defer.
> 
> Please don't rush this into this merge window. It needs more than
> 1 full day of review.

Dave, its probably fine to push this to 3.5 then. I can
resubmit after you close the merge window if you want? This
has been somewhat broken for SR-IOV cards for multiple
kernel releases now anyways one more wont hurt too much.

I'll work with Roopa to get the macvlan driver plugged into
the fdb ops in the meantime and maybe get DSA as well.

Thanks,
John
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next PATCH v0 1/5] net: add generic PF_BRIDGE:RTM_ FDB hooks

2012-03-19 Thread John Fastabend
Forgot to change the title resending with a title that won't
be dropped by netdev and kvm mailing lists. And updated my
local repo so it won't happen again.

---

This adds two new flags NTF_MASTER and NTF_LOWERDEV that can
now be used to specify where PF_BRIDGE netlink commands should
be sent. NTF_MASTER sends the commands to the 'dev->master'
device for parsing. Typically this will be the linux net/bridge,
macvlan, or open-vswitch devices. Also without any flags set
the command will be handled by the master device as well so
that current user space tools continue to work as expected.

The NTF_LOWERDEV flag will push the PF_BRIDGE commands to the
device. In the basic example below the commands are then parsed
and programmed in the embedded bridge.

Note if both NTF_LOWERDEV and NTF_MASTER bits are set then the
command will be sent both to 'dev->master' and 'dev' this allows
user space to easily keep the embedded bridge and software bridge
in sync.

To support this new net device ops were added to call into
the device and the existing bridging code was refactored
to use these. There should be no change from user space.

A basic setup with a SR-IOV enabled NIC looks like this,

  veth0  veth2
|  |
  
  |  bridge0 |   < software bridging
  
   /
   /
  ethx.y  ethx
VF PF
 \ \  < propagate FDB entries to HW
 \ \
  
  |  Embedded Bridge |< hardware offloaded switching
  

In this case the embedded bridge must be managed to allow 'veth0'
to communicate with 'ethx.y' correctly. At present drivers managing
the embedded bridge either send frames onto the network which
then get dropped by the switch OR the embedded bridge will flood
these frames. With this patch we have a mechanism to manage the
embedded bridge correctly from user space. This example is specific
to SR-IOV but replacing the VF with another PF or dropping this
into the DSA framework generates similar management issues.

Examples session using the 'br'[1] tool to add, dump and then
delete a mac address with a new "embedded" option and enabled
ixgbe driver:

# br fdb add 22:35:19:ac:60:59 dev eth3
# br fdb
portmac addrflags
veth0   22:35:19:ac:60:58   static
veth0   9a:5f:81:f7:f6:ec   local
eth300:1b:21:55:23:59   local
eth322:35:19:ac:60:59   static
veth0   22:35:19:ac:60:57   static
#br fdb add 22:35:19:ac:60:59 embedded dev eth3
#br fdb
portmac addrflags
veth0   22:35:19:ac:60:58   static
veth0   9a:5f:81:f7:f6:ec   local
eth300:1b:21:55:23:59   local
eth322:35:19:ac:60:59   static
veth0   22:35:19:ac:60:57   static
eth322:35:19:ac:60:59   local embedded
#br fdb del 22:35:19:ac:60:59 embedded dev eth3

I added a couple lines to 'br' to set the flags correctly is all. It
is my opinion that the merit of this patch is now embedded and SW
bridges can both be modeled correctly in user space using very nearly
the same message passing.

[1] 'br' tool was published as an RFC here and will be renamed 'bridge'
http://patchwork.ozlabs.org/patch/117664/

Thanks to Jamal Hadi Salim, Stephen Hemminger and Ben Hutchings for
valuable feedback, suggestions, and review.

Signed-off-by: John Fastabend 
---

 include/linux/neighbour.h |3 +
 include/linux/netdevice.h |   26 
 include/linux/rtnetlink.h |4 +
 net/bridge/br_device.c|3 +
 net/bridge/br_fdb.c   |  128 ++
 net/bridge/br_netlink.c   |   12 
 net/bridge/br_private.h   |   15 -
 net/core/rtnetlink.c  |  138 +
 8 files changed, 217 insertions(+), 112 deletions(-)

diff --git a/include/linux/neighbour.h b/include/linux/neighbour.h
index b188f68..3a94409 100644
--- a/include/linux/neighbour.h
+++ b/include/linux/neighbour.h
@@ -33,6 +33,9 @@ enum {
 #define NTF_PROXY  0x08/* == ATF_PUBL */
 #define NTF_ROUTER 0x80
 
+#define NTF_LOWERDEV   0x02
+#define NTF_MASTER 0x04
+
 /*
  * Neighbor Cache Entry States.
  */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4535a4e..4208901 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -54,6 +54,7 @@
 #include 
 
 #include 
+#include 
 
 struct netpoll_info;
 struct phy_device;
@@ -904,6 +905,19 @@ struct netdev_fcoe_hbainfo {
  * feature set might be less than what was returned by ndo_fix_features()).
  * Must return >0 or -errno if it changed dev->features itself.
  *
+ * int (*ndo_fdb_add)(struct ndmsg *ndm, struct net_device *dev,
+ *   unsigned char *addr, u16 flags)
+ * Adds an FDB entry to dev for addr. The ndmsg con

[net-next PATCH v0 4/5] ixgbe: enable FDB netdevice ops

2012-03-19 Thread John Fastabend
Enable FDB ops on ixgbe when in SR-IOV mode.

Signed-off-by: John Fastabend 
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   59 +
 1 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 1d8f9f8..32adb4f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7586,6 +7586,62 @@ static int ixgbe_set_features(struct net_device *netdev,
 
 }
 
+static int ixgbe_ndo_fdb_add(struct ndmsg *ndm,
+struct net_device *dev,
+unsigned char *addr,
+u16 flags)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(dev);
+   int err = -EOPNOTSUPP;
+
+   if (ndm->ndm_state & NUD_PERMANENT) {
+   pr_info("%s: FDB only supports static addresses\n",
+   ixgbe_driver_name);
+   return -EINVAL;
+   }
+
+   if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
+   err = dev_uc_add_excl(dev, addr);
+
+   /* Only return duplicate errors if NLM_F_EXCL is set */
+   if (err == -EEXIST && !(flags & NLM_F_EXCL))
+   err = 0;
+
+   return err;
+}
+
+static int ixgbe_ndo_fdb_del(struct ndmsg *ndm,
+struct net_device *dev,
+unsigned char *addr)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(dev);
+   int err = -EOPNOTSUPP;
+
+   if (ndm->ndm_state & NUD_PERMANENT) {
+   pr_info("%s: FDB only supports static addresses\n",
+   ixgbe_driver_name);
+   return -EINVAL;
+   }
+
+   if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
+   err = dev_uc_del(dev, addr);
+
+   return err;
+}
+
+static int ixgbe_ndo_fdb_dump(struct sk_buff *skb,
+ struct netlink_callback *cb,
+ struct net_device *dev,
+ int idx)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(dev);
+
+   if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
+   idx = ndo_dflt_fdb_dump(skb, cb, dev, idx);
+
+   return idx;
+}
+
 static const struct net_device_ops ixgbe_netdev_ops = {
.ndo_open   = ixgbe_open,
.ndo_stop   = ixgbe_close,
@@ -7620,6 +7676,9 @@ static const struct net_device_ops ixgbe_netdev_ops = {
 #endif /* IXGBE_FCOE */
.ndo_set_features = ixgbe_set_features,
.ndo_fix_features = ixgbe_fix_features,
+   .ndo_fdb_add= ixgbe_ndo_fdb_add,
+   .ndo_fdb_del= ixgbe_ndo_fdb_del,
+   .ndo_fdb_dump   = ixgbe_ndo_fdb_dump,
 };
 
 static void __devinit ixgbe_probe_vf(struct ixgbe_adapter *adapter,

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next PATCH v0 5/5] ixgbe: allow RAR table to be updated in promisc mode

2012-03-19 Thread John Fastabend
This allows RAR table updates while in promiscuous. With
SR-IOV enabled it is valuable to allow the RAR table to
be updated even when in promisc mode to configure forwarding

Signed-off-by: John Fastabend 
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   21 +++--
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 32adb4f..d1925b5 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -3400,16 +3400,17 @@ void ixgbe_set_rx_mode(struct net_device *netdev)
}
ixgbe_vlan_filter_enable(adapter);
hw->addr_ctrl.user_set_promisc = false;
-   /*
-* Write addresses to available RAR registers, if there is not
-* sufficient space to store all the addresses then enable
-* unicast promiscuous mode
-*/
-   count = ixgbe_write_uc_addr_list(netdev);
-   if (count < 0) {
-   fctrl |= IXGBE_FCTRL_UPE;
-   vmolr |= IXGBE_VMOLR_ROPE;
-   }
+   }
+
+   /*
+* Write addresses to available RAR registers, if there is not
+* sufficient space to store all the addresses then enable
+* unicast promiscuous mode
+*/
+   count = ixgbe_write_uc_addr_list(netdev);
+   if (count < 0) {
+   fctrl |= IXGBE_FCTRL_UPE;
+   vmolr |= IXGBE_VMOLR_ROPE;
}
 
if (adapter->num_vfs) {

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next PATCH v0 3/5] net: add fdb generic dump routine

2012-03-19 Thread John Fastabend
This adds a generic dump routine drivers can call. It
should be sufficient to handle any bridging model that
uses the unicast address list. This should be most SR-IOV
enabled NICs.

Signed-off-by: John Fastabend 
---

 net/core/rtnetlink.c |   56 ++
 1 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 8c3278a..35ee2d6 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2082,6 +2082,62 @@ static int rtnl_fdb_del(struct sk_buff *skb, struct 
nlmsghdr *nlh, void *arg)
return err;
 }
 
+/**
+ * ndo_dflt_fdb_dump: default netdevice operation to dump an FDB table.
+ * @nlh: netlink message header
+ * @dev: netdevice
+ *
+ * Default netdevice operation to dump the existing unicast address list.
+ * Returns zero on success.
+ */
+int ndo_dflt_fdb_dump(struct sk_buff *skb,
+ struct netlink_callback *cb,
+ struct net_device *dev,
+ int idx)
+{
+   struct netdev_hw_addr *ha;
+   struct nlmsghdr *nlh;
+   struct ndmsg *ndm;
+   u32 pid, seq;
+
+   pid = NETLINK_CB(cb->skb).pid;
+   seq = cb->nlh->nlmsg_seq;
+
+   netif_addr_lock_bh(dev);
+   list_for_each_entry(ha, &dev->uc.list, list) {
+   if (idx < cb->args[0])
+   goto skip;
+
+   nlh = nlmsg_put(skb, pid, seq,
+   RTM_NEWNEIGH, sizeof(*ndm), NLM_F_MULTI);
+   if (!nlh)
+   break;
+
+   ndm = nlmsg_data(nlh);
+   ndm->ndm_family  = AF_BRIDGE;
+   ndm->ndm_pad1= 0;
+   ndm->ndm_pad2= 0;
+   ndm->ndm_flags   = NTF_LOWERDEV;
+   ndm->ndm_type= 0;
+   ndm->ndm_ifindex = dev->ifindex;
+   ndm->ndm_state   = NUD_PERMANENT;
+
+   NLA_PUT(skb, NDA_LLADDR, ETH_ALEN, ha->addr);
+
+   nlmsg_end(skb, nlh);
+skip:
+   ++idx;
+   }
+   netif_addr_unlock_bh(dev);
+
+   return idx;
+nla_put_failure:
+   netif_addr_unlock_bh(dev);
+   nlmsg_cancel(skb, nlh);
+   return idx;
+}
+EXPORT_SYMBOL(ndo_dflt_fdb_dump);
+
 static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
int idx = 0;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next PATCH v0 2/5] net: addr_list: add exclusive dev_uc_add

2012-03-19 Thread John Fastabend
This adds a dev_uc_add_excl() call similar to the original
dev_uc_add() except it sets the global bit. With this
change the reference count will not be bumped and -EEXIST
will be returned if a duplicate address exists.

This is useful for drivers that support SR-IOV and want
to manage the unicast lists.

Signed-off-by: John Fastabend 
---

 include/linux/netdevice.h |1 +
 net/core/dev_addr_lists.c |   19 +++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4208901..5e43cec 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2571,6 +2571,7 @@ extern int dev_addr_init(struct net_device *dev);
 
 /* Functions used for unicast addresses handling */
 extern int dev_uc_add(struct net_device *dev, unsigned char *addr);
+extern int dev_uc_add_excl(struct net_device *dev, unsigned char *addr);
 extern int dev_uc_del(struct net_device *dev, unsigned char *addr);
 extern int dev_uc_sync(struct net_device *to, struct net_device *from);
 extern void dev_uc_unsync(struct net_device *to, struct net_device *from);
diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index 29c07fe..c7d27ad 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -377,6 +377,25 @@ EXPORT_SYMBOL(dev_addr_del_multiple);
  */
 
 /**
+ * dev_uc_add_excl - Add a global secondary unicast address
+ * @dev: device
+ * @addr: address to add
+ */
+int dev_uc_add_excl(struct net_device *dev, unsigned char *addr)
+{
+   int err;
+
+   netif_addr_lock_bh(dev);
+   err = __hw_addr_add_ex(&dev->uc, addr, dev->addr_len,
+  NETDEV_HW_ADDR_T_UNICAST, true);
+   if (!err)
+   __dev_set_rx_mode(dev);
+   netif_addr_unlock_bh(dev);
+   return err;
+}
+EXPORT_SYMBOL(dev_uc_add_excl);
+
+/**
  * dev_uc_add - Add a secondary unicast address
  * @dev: device
  * @addr: address to add

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next PATCH v0 0/5] Series short description

2012-03-19 Thread John Fastabend
This series is a follow up to this thread:

http://www.spinics.net/lists/netdev/msg191360.html

This adds two NTF_XXX bits to signal if the PF_BRIDGE
netlink command should be parsed by the embedded bridge
or the SW bridge. The insight here is the SW bridge is
always the master device (NTF_MASTER) and the embedded
bridge is the lower device (NTF_LOWERDEV). Without either
flag set the command is parsed by the SW bridge to support
existing tooling.

To make this work correctly I added three new ndo ops

ndo_fdb_add
ndo_fdb_del
ndo_fdb_dump

to add, delete, and dump FDB entries. These operations
can be used by drivers to program embedded nics or by
software bridges. We have at least three SW bridge now
net/bridge, openvswitch, and macvlan. And three variants
of embedded bridges SR-IOV devices, multi-function devices
and Distributed Switch Architecture (DSA).

I think at least in this case adding netdevice ops is
the cleanest way to implement this. I thought about
notifier hooks and other methods but this seems to be
the simplest.

I've tested these three scenarios, embedded bridge only,
sw bridge only, and embedded bridge and SW bridge. These
are working on the Intel 82599 devices with this patch
series. I am also working on a patch for the macvlan
drivers. I'll submit that as an RFC shortly so far I
only have the passthru mode wired up.

Thanks to Stephen, Ben, and Jamal for bearing with me
and the feedback on the last round of patches.

As always any comments/feedback appreciated!

---

John Fastabend (5):
  ixgbe: allow RAR table to be updated in promisc mode
  ixgbe: enable FDB netdevice ops
  net: add fdb generic dump routine
  net: addr_list: add exclusive dev_uc_add
  net: add generic PF_BRIDGE:RTM_XXX FDB hooks


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   80 +-
 include/linux/neighbour.h |3 
 include/linux/netdevice.h |   27 +++
 include/linux/rtnetlink.h |4 +
 net/bridge/br_device.c|3 
 net/bridge/br_fdb.c   |  128 
 net/bridge/br_netlink.c   |   12 --
 net/bridge/br_private.h   |   15 ++
 net/core/dev_addr_lists.c |   19 ++
 net/core/rtnetlink.c  |  194 +
 10 files changed, 363 insertions(+), 122 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 4/4] ixgbe: enable FDB netdevice ops

2012-03-09 Thread John Fastabend
On 3/9/2012 7:48 PM, Stephen Hemminger wrote:
> 
>> Enable FDB ops on ixgbe when in SR-IOV mode.
>>
>> Signed-off-by: John Fastabend 
> 
> Will all this break anything on the vf client? What if the vf is running
> a bridge.

No shouldn't break anything.

Actually, implementing these ops in the VF driver (ixgbevf in this
case) will allow bridging to work on the VF. Because at least on
the intel 82599 devices the VF can't be put in promiscuous mode
so bridging doesn't work as expected. With the ops implemented
we could at least get traffic forwarded correctly for addresses
that were known above the VF.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 1/4] net: add generic PF_BRIDGE:RTM_ FDB hooks

2012-03-09 Thread John Fastabend
Resending this patch with a new title apparently "PF_BRIDGE:RTM_XXX
FDB hooks" looks like spam. Sorry for the noise if you get a dup.

---
net: add generic PF_BRIDGE:RTM_XXX FDB hooks

This adds two new flags NTF_MASTER and NTF_LOWERDEV that can
now be used to specify where PF_BRIDGE netlink commands should
be sent. NTF_MASTER sends the commands to the 'dev->master'
device for parsing. Typically this will be the linux net/bridge,
macvlan, or open-vswitch devices. Also without any flags set
the command will be handled by the master device as well so
that current user space tools continue to work as expected.

The NTF_LOWERDEV flag will push the PF_BRIDGE commands to the
device. In the basic example below the commands are then parsed
and programmed in the embedded bridge.

Note if both NTF_LOWERDEV and NTF_MASTER bits are set then the
command will be sent both to 'dev->master' and 'dev' this allows
user space to easily keep the embedded bridge and software bridge
in sync.

To support this new net device ops were added to call into
the device and the existing bridging code was refactored
to use these. There should be no change from user space.

A basic setup with a SR-IOV enabled NIC looks like this,

  veth0  veth2
|  |
  
  |  bridge0 |   < software bridging
  
   /
   /
  ethx.y  ethx
VF PF
 \ \  < propagate FDB entries to HW
 \ \
  
  |  Embedded Bridge |< hardware offloaded switching
  

In this case the embedded bridge must be managed to allow 'veth0'
to communicate with 'ethx.y' correctly. At present drivers managing
the embedded bridge either send frames onto the network which
then get dropped by the switch OR the embedded bridge will flood
these frames. With this patch we have a mechanism to manage the
embedded bridge correctly from user space. This example is specific
to SR-IOV but replacing the VF with another PF or dropping this
into the DSA framework generates similar management issues.

Examples session using the 'br'[1] tool to add, dump and then
delete a mac address with a new "embedded" option and enabled
ixgbe driver:

# br fdb add 22:35:19:ac:60:59 dev eth3
# br fdb
portmac addrflags
veth0   22:35:19:ac:60:58   static
veth0   9a:5f:81:f7:f6:ec   local
eth300:1b:21:55:23:59   local
eth322:35:19:ac:60:59   static
veth0   22:35:19:ac:60:57   static
#br fdb add 22:35:19:ac:60:59 embedded dev eth3
#br fdb
portmac addrflags
veth0   22:35:19:ac:60:58   static
veth0   9a:5f:81:f7:f6:ec   local
eth300:1b:21:55:23:59   local
eth322:35:19:ac:60:59   static
veth0   22:35:19:ac:60:57   static
eth322:35:19:ac:60:59   local embedded
#br fdb del 22:35:19:ac:60:59 embedded dev eth3

I added a couple lines to 'br' to set the flags correctly is all. It
is my opinion that the merit of this patch is now embedded and SW
bridges can both be modeled correctly in user space using very nearly
the same message passing.

[1] 'br' tool was published as an RFC here and will be renamed 'bridge'
http://patchwork.ozlabs.org/patch/117664/

Signed-off-by: John Fastabend 
---

 include/linux/neighbour.h |3 +
 include/linux/netdevice.h |   26 
 include/linux/rtnetlink.h |4 +
 net/bridge/br_device.c|3 +
 net/bridge/br_fdb.c   |  128 ++
 net/bridge/br_netlink.c   |   12 
 net/bridge/br_private.h   |   15 -
 net/core/rtnetlink.c  |  138 +
 8 files changed, 217 insertions(+), 112 deletions(-)

diff --git a/include/linux/neighbour.h b/include/linux/neighbour.h
index b188f68..3a94409 100644
--- a/include/linux/neighbour.h
+++ b/include/linux/neighbour.h
@@ -33,6 +33,9 @@ enum {
 #define NTF_PROXY  0x08/* == ATF_PUBL */
 #define NTF_ROUTER 0x80
 
+#define NTF_LOWERDEV   0x02
+#define NTF_MASTER 0x04
+
 /*
  * Neighbor Cache Entry States.
  */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a89933b..3963992 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -54,6 +54,7 @@
 #include 
 
 #include 
+#include 
 
 struct netpoll_info;
 struct phy_device;
@@ -904,6 +905,19 @@ struct netdev_fcoe_hbainfo {
  * feature set might be less than what was returned by ndo_fix_features()).
  * Must return >0 or -errno if it changed dev->features itself.
  *
+ * int (*ndo_fdb_add)(struct ndmsg *ndm, struct net_device *dev,
+ *   unsigned char *addr, u16 flags)
+ * Adds an FDB entry to dev for addr. The ndmsg contains flags to indicate
+ * if the dev->master FDB should be updated or

[RFC PATCH v1 4/4] ixgbe: enable FDB netdevice ops

2012-03-09 Thread John Fastabend
Enable FDB ops on ixgbe when in SR-IOV mode.

Signed-off-by: John Fastabend 
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   59 +
 1 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 23a4665..c41439c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7484,6 +7484,62 @@ static int ixgbe_set_features(struct net_device *netdev,
 
 }
 
+static int ixgbe_ndo_fdb_add(struct ndmsg *ndm,
+struct net_device *dev,
+unsigned char *addr,
+u16 flags)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(dev);
+   int err = -EOPNOTSUPP;
+
+   if (ndm->ndm_state & NUD_PERMANENT) {
+   pr_info("%s: FDB only supports static addresses\n",
+   ixgbe_driver_name);
+   return -EINVAL;
+   }
+
+   if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
+   err = dev_uc_add_excl(dev, addr);
+
+   /* Only return duplicate errors if NLM_F_EXCL is set */
+   if (err == -EEXIST && !(flags & NLM_F_EXCL))
+   err = 0;
+
+   return err;
+}
+
+static int ixgbe_ndo_fdb_del(struct ndmsg *ndm,
+struct net_device *dev,
+unsigned char *addr)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(dev);
+   int err = -EOPNOTSUPP;
+
+   if (ndm->ndm_state & NUD_PERMANENT) {
+   pr_info("%s: FDB only supports static addresses\n",
+   ixgbe_driver_name);
+   return -EINVAL;
+   }
+
+   if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
+   err = dev_uc_del(dev, addr);
+
+   return err;
+}
+
+static int ixgbe_ndo_fdb_dump(struct sk_buff *skb,
+ struct netlink_callback *cb,
+ struct net_device *dev,
+ int idx)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(dev);
+
+   if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
+   idx = ndo_dflt_fdb_dump(skb, cb, dev, idx);
+
+   return idx;
+}
+
 static const struct net_device_ops ixgbe_netdev_ops = {
.ndo_open   = ixgbe_open,
.ndo_stop   = ixgbe_close,
@@ -7518,6 +7574,9 @@ static const struct net_device_ops ixgbe_netdev_ops = {
 #endif /* IXGBE_FCOE */
.ndo_set_features = ixgbe_set_features,
.ndo_fix_features = ixgbe_fix_features,
+   .ndo_fdb_add= ixgbe_ndo_fdb_add,
+   .ndo_fdb_del= ixgbe_ndo_fdb_del,
+   .ndo_fdb_dump   = ixgbe_ndo_fdb_dump,
 };
 
 static void __devinit ixgbe_probe_vf(struct ixgbe_adapter *adapter,

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 3/4] net: add fdb generic dump routine

2012-03-09 Thread John Fastabend
This adds a generic dump routine drivers can call. It
should be sufficient to handle any bridging model that
uses the unicast address list. This should be most SR-IOV
enabled NICs.

Signed-off-by: John Fastabend 
---

 net/core/rtnetlink.c |   56 ++
 1 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 8c3278a..35ee2d6 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2082,6 +2082,62 @@ static int rtnl_fdb_del(struct sk_buff *skb, struct 
nlmsghdr *nlh, void *arg)
return err;
 }
 
+/**
+ * ndo_dflt_fdb_dump: default netdevice operation to dump an FDB table.
+ * @nlh: netlink message header
+ * @dev: netdevice
+ *
+ * Default netdevice operation to dump the existing unicast address list.
+ * Returns zero on success.
+ */
+int ndo_dflt_fdb_dump(struct sk_buff *skb,
+ struct netlink_callback *cb,
+ struct net_device *dev,
+ int idx)
+{
+   struct netdev_hw_addr *ha;
+   struct nlmsghdr *nlh;
+   struct ndmsg *ndm;
+   u32 pid, seq;
+
+   pid = NETLINK_CB(cb->skb).pid;
+   seq = cb->nlh->nlmsg_seq;
+
+   netif_addr_lock_bh(dev);
+   list_for_each_entry(ha, &dev->uc.list, list) {
+   if (idx < cb->args[0])
+   goto skip;
+
+   nlh = nlmsg_put(skb, pid, seq,
+   RTM_NEWNEIGH, sizeof(*ndm), NLM_F_MULTI);
+   if (!nlh)
+   break;
+
+   ndm = nlmsg_data(nlh);
+   ndm->ndm_family  = AF_BRIDGE;
+   ndm->ndm_pad1= 0;
+   ndm->ndm_pad2= 0;
+   ndm->ndm_flags   = NTF_LOWERDEV;
+   ndm->ndm_type= 0;
+   ndm->ndm_ifindex = dev->ifindex;
+   ndm->ndm_state   = NUD_PERMANENT;
+
+   NLA_PUT(skb, NDA_LLADDR, ETH_ALEN, ha->addr);
+
+   nlmsg_end(skb, nlh);
+skip:
+   ++idx;
+   }
+   netif_addr_unlock_bh(dev);
+
+   return idx;
+nla_put_failure:
+   netif_addr_unlock_bh(dev);
+   nlmsg_cancel(skb, nlh);
+   return idx;
+}
+EXPORT_SYMBOL(ndo_dflt_fdb_dump);
+
 static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
int idx = 0;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 2/4] net: addr_list: add exclusive dev_uc_add

2012-03-09 Thread John Fastabend
This adds a dev_uc_add_excl() call similar to the original
dev_uc_add() except it sets the global bit. With this
change the reference count will not be bumped and -EEXIST
will be returned if a duplicate address exists.

This is useful for drivers that support SR-IOV and want
to manage the unicast lists.

Signed-off-by: John Fastabend 
---

 include/linux/netdevice.h |1 +
 net/core/dev_addr_lists.c |   19 +++
 2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3963992..7e4a86f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2553,6 +2553,7 @@ extern int dev_addr_init(struct net_device *dev);
 
 /* Functions used for unicast addresses handling */
 extern int dev_uc_add(struct net_device *dev, unsigned char *addr);
+extern int dev_uc_add_excl(struct net_device *dev, unsigned char *addr);
 extern int dev_uc_del(struct net_device *dev, unsigned char *addr);
 extern int dev_uc_sync(struct net_device *to, struct net_device *from);
 extern void dev_uc_unsync(struct net_device *to, struct net_device *from);
diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index 29c07fe..c7d27ad 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -377,6 +377,25 @@ EXPORT_SYMBOL(dev_addr_del_multiple);
  */
 
 /**
+ * dev_uc_add_excl - Add a global secondary unicast address
+ * @dev: device
+ * @addr: address to add
+ */
+int dev_uc_add_excl(struct net_device *dev, unsigned char *addr)
+{
+   int err;
+
+   netif_addr_lock_bh(dev);
+   err = __hw_addr_add_ex(&dev->uc, addr, dev->addr_len,
+  NETDEV_HW_ADDR_T_UNICAST, true);
+   if (!err)
+   __dev_set_rx_mode(dev);
+   netif_addr_unlock_bh(dev);
+   return err;
+}
+EXPORT_SYMBOL(dev_uc_add_excl);
+
+/**
  * dev_uc_add - Add a secondary unicast address
  * @dev: device
  * @addr: address to add

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1 0/4] net: bridge: FDB management

2012-03-09 Thread John Fastabend
This series is a follow up to the previous thread here:

http://lists.openwall.net/netdev/2012/02/29/31

There are some significant changes in this series. First
I add two NTF_XXX bits to signal if the PF_BRIDGE netlink
command should be parsed by the embedded bridge or the
SW bridge. The insight here is the SW bridge is always the
master device (NTF_MASTER) and the embedded bridge is
the lower device (NTF_LOWERDEV). Without either flag set
the command is parsed by the SW bridge to support existing
tooling.

To make this work correctly I added three new ndo ops

ndo_fdb_add
ndo_fdb_del
ndo_fdb_dump

to add, delete, and dump FDB entries. These operations
can be used by drivers to program embedded nics or by
software bridges. We have at least three SW bridge now
net/bridge, openvswitch, and macvlan. And three variants
of embedded bridges SR-IOV devices, multi-function devices
and Distributed Switch Architecture (DSA).

I think at least in this case adding netdevice ops is
the cleanest way to implement this. I thought about
notifier hooks and other methods but for now at least
this seems to be the simplest.

I'm going to drop this into my testbed and let it run
for a few days. But I think (hope?) this series is close
to being ready for a non-RFC submission. I'll probably
audit the patches once more as well.

Thanks to Stephen, Ben, and Jamal for bearing with me
and the feedback on the last round of patches.

As always any comments/feedback is appreciated!

---

John Fastabend (4):
  ixgbe: enable FDB netdevice ops
  net: add fdb generic dump routine
  net: addr_list: add exclusive dev_uc_add
  net: add generic PF_BRIDGE:RTM_XXX FDB hooks


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   59 
 include/linux/neighbour.h |3 
 include/linux/netdevice.h |   27 +++
 include/linux/rtnetlink.h |4 +
 net/bridge/br_device.c|3 
 net/bridge/br_fdb.c   |  128 
 net/bridge/br_netlink.c   |   12 --
 net/bridge/br_private.h   |   15 ++
 net/core/dev_addr_lists.c |   19 ++
 net/core/rtnetlink.c  |  194 +
 10 files changed, 352 insertions(+), 112 deletions(-)

-- 
Signature
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-05 Thread John Fastabend
On 3/5/2012 8:53 AM, Lennert Buytenhek wrote:
> On Tue, Feb 28, 2012 at 08:40:06PM -0800, John Fastabend wrote:
> 
>> Also if there are embedded switches with learning capabilities they
>> might want to trigger events to user space. In this case having
>> a protocol type makes user space a bit easier to manage. I've
>> added Lennert so maybe he can comment I think the Marvell chipsets
>> might support something along these lines. The SR-IOV chipsets I'm
>> aware of _today_ don't do learning. Learning makes the event model
>> more plausible.
> 
> net/dsa currently configures any switch chips in the system to do
> auto-learning.  However, I would much prefer to disable that, and have
> the switch chip just pass up packets for new source addresses, have
> Linux do the learning, and then mirror the Linux software FDB into
> the hardware instead -- that avoids having to manually flush the
> hardware FDB on certain STP state transitions or having to configure
> the hardware to use a shorter address learning timeout when we're in
> the middle of an STP topology change, which are problems we are
> running into in practice.
> 

Great. And the plan is we should be able to use the same daemon with
minimal changes (currently a flag) to control both sw and hw bridges.

> Just curious -- while your patches allow propagating FDB entries
> into the hardware, do you also have hooks to tell the hardware which
> ports are to share address databases?
> 

Not in the current patches. I don't have hardware right now
that can instantiate multiple bridges. When I get some I was hoping
to do something similar to this patch and use netlink commands
to create/delete bridges and add/remove ports to them. This would
be modifying the existing commands to work for both software and
hardware bridges.

By a bridge instantiation I mean a shared address database in this case.

> For net/dsa, we currently have:
> 
>   http://patchwork.ozlabs.org/patch/16578/
> 
> While I think this is conceptually sound, the implementation is hacky,
> and I wonder how you've solved it for your setup, and if DSA can
> piggy-back off that.

Yep anything we come up with should work in both cases.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-01 Thread John Fastabend
On 3/1/2012 5:36 AM, Jamal Hadi Salim wrote:
> On Wed, 2012-02-29 at 10:19 -0800, John Fastabend wrote:
> 
>>>
>>> I want to see a unified API so that user space control applications (RSTP, 
>>> TRILL?)
>>> can use one set of netlink calls for both software bridge and hardware 
>>> offloaded
>>> bridges.  Does this proposal meet that requirement?
>>>
> 
> I dont see any issues with those requirements being met.
> 
>> Jamal, so why do "They have to be different calls"? I'm not so sure 
>> anymore...
>> moving to RTM_FDB_XXXENTRY saved some refactoring in the bridge module but 
>> that
>> is just cosmetic.
> 
> I may not want to use the s/ware bridge i.e I may want to use h/ware
> bridge. I may want to use both. So there are 3 variations there. You
> need at least 1.5 bits to represent them if you are going to use the
> same interface. There may be features in either h/ware but not in
> s/ware and vice-versa. 
> A single interface with flags which say this applies to hware:sware:both
> would be good, but it may be harder to achieve - thats why i suggested
> they be different.
> 
> cheers,
> jamal
> 

Hmm so I think what I'll do is this...

 both: ndm_flags = 0 
 sw  : ndm_flags = NTF_SW_FDB
 hw  : ndm_flags = NTF_HW_FDB

Then current tools will work with embedded bridges and software bridges
with the interesting case being when a port supporting an offloaded FDB
is attached to a SW bridge. Doing both in this case seems to be a reasonable
default to me.

The tricky bit will be pulling the message handlers out of the ./net/bridge
code so that we don't have to always load the bridge module to add entries
to a macvlan for example. I need to look at a few other things today but
I'll code up a patch for this tomorrow.

.John
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-03-01 Thread John Fastabend
On 3/1/2012 6:14 AM, Michael S. Tsirkin wrote:
> On Wed, Feb 29, 2012 at 09:25:56AM -0800, John Fastabend wrote:
>> Agreed. I think adding some ndo_ops for bridging offloads here would
>> work. For example the DSA infrastructure and/or macvlan devices might
>> need this. Along the lines of extending this RFC,
>>
>> [RFC] hardware bridging support for DSA switches
>> http://patchwork.ozlabs.org/patch/16578/
>>
>>
>> .John
> 
> One place where this might not work well would be
> macvtap which is not a network device so it doesn't have
> its own address, instead it inherits one from macvlan.
> 

But is macvtap really doing any forwarding or implementing any
RX filters? Took a quick scan and it looks like the forwarding
logic is all in the macvlan code paths. In this case I suspect
if we enable macvlan then any device built on top of it would
work.

.John
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-29 Thread John Fastabend
On 2/29/2012 9:52 AM, Stephen Hemminger wrote:
> On Wed, 29 Feb 2012 09:25:56 -0800
> John Fastabend  wrote:
> 
>> On 2/29/2012 5:56 AM, Jamal Hadi Salim wrote:
>>> On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote:
>>>
>>>> OK back to this. The last piece is where to put these messages...
>>>> we could take PF_ROUTE:RTM_*NEIGH
>>>>
>>>>  PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
>>>>  switch.
>>>>  PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
>>>>  switch.
>>>>  PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table
>>>>
>>>
>>> Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map
>>> to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc
>>> may play a role in the user space app populating the FDB, i dont think
>>> they are necessary players.
>>> Learning could be via a table entry miss and packet redirect to user
>>> space.
>>> So my suggestion is to use FDB_*ENTRY for names
>>>  
>>
>> Well I think NETLINK_ROUTE is the most correct type to use in this
>> case. Per netlink.h its for routing and device hooks.
>>
>> #define NETLINK_ROUTE   0   /* Routing/device hook   
>>*/
>>
>> And NETLINK_ROUTE msg_types use the RTM_* prefix. The _*NEIGH postfix
>> were merely a copy from the SW BRIDGE code paths. How about,
>>
>> PF_BRIDGE:RTM_FDB_NEWENTRY
>> PF_BRIDGE:RTM_FDB_DELENTRY
>> PF_BRIDGE:RTM_FDB_GETENTRY
>>
>> And a new group RTNLGRP_FDB. Also using NETLINK_ROUTE gives the correct
>> rtnl locking semantics for free.
>>
>>>> The neighbor code is using the PF_UNSPEC protocol type so we won't
>>>> collide with these unless someone was using PF_ROUTE and relying on
>>>> falling back to PF_UNSPEC however I couldn't find any programs that
>>>> did this iproute2 certainly doesn't. And the bridge pieces are using
>>>> PF_BRIDGE so no collision there.
>>>
>>> They have to be different calls from the calls that talk to the s/ware
>>> bridge. In my opinion, as controversial as this may sound, you need to
>>> be flexible enough that some vendor can replace these calls with
>>> proprietary calls which are more efficient for their hardware. So a
>>> "plugin" to replace these calls in the user space code would be a 
>>> good idea. Alternatively, you could make that something they do at
>>> the driver level i.e from user space to kernel it is "hardware, please
>>> addthistotheFDBtable()" call and the implementation of that could be
>>> proprietary to the specific hardware.
>>>
>>
>> Agreed. I think adding some ndo_ops for bridging offloads here would
>> work. For example the DSA infrastructure and/or macvlan devices might
>> need this. Along the lines of extending this RFC,
>>
>> [RFC] hardware bridging support for DSA switches
>> http://patchwork.ozlabs.org/patch/16578/
> 
> I want to see a unified API so that user space control applications (RSTP, 
> TRILL?)
> can use one set of netlink calls for both software bridge and hardware 
> offloaded
> bridges.  Does this proposal meet that requirement?
> 

With the patches I sent out last night the same netlink calls are used
for both SW and HW with a flag set in ndm_flags to indicate it is a hardware
entry. The flag is needed when a port has offload support and is also
a slave of a SW bridge. Another option would be to apply the command to both
hardware and software tables. This might be good enough and user space would
not have to make distinctions between HW and SW bridges. Also helps with my
original use case where I want the SW and HW bridge FDBs to be in sync.

In response to Jamal's comment I proposed changing the type to RTM_FDB_XXXENTRY
but the message contents are the same in both cases.

Jamal, so why do "They have to be different calls"? I'm not so sure anymore...
moving to RTM_FDB_XXXENTRY saved some refactoring in the bridge module but that
is just cosmetic.

Thanks,
John
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-29 Thread John Fastabend
On 2/29/2012 5:56 AM, Jamal Hadi Salim wrote:
> On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote:
> 
>> OK back to this. The last piece is where to put these messages...
>> we could take PF_ROUTE:RTM_*NEIGH
>>
>>  PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
>>  switch.
>>  PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
>>  switch.
>>  PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table
>>
> 
> Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map
> to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc
> may play a role in the user space app populating the FDB, i dont think
> they are necessary players.
> Learning could be via a table entry miss and packet redirect to user
> space.
> So my suggestion is to use FDB_*ENTRY for names
>  

Well I think NETLINK_ROUTE is the most correct type to use in this
case. Per netlink.h its for routing and device hooks.

#define NETLINK_ROUTE   0   /* Routing/device hook  
*/

And NETLINK_ROUTE msg_types use the RTM_* prefix. The _*NEIGH postfix
were merely a copy from the SW BRIDGE code paths. How about,

PF_BRIDGE:RTM_FDB_NEWENTRY
PF_BRIDGE:RTM_FDB_DELENTRY
PF_BRIDGE:RTM_FDB_GETENTRY

And a new group RTNLGRP_FDB. Also using NETLINK_ROUTE gives the correct
rtnl locking semantics for free.

>> The neighbor code is using the PF_UNSPEC protocol type so we won't
>> collide with these unless someone was using PF_ROUTE and relying on
>> falling back to PF_UNSPEC however I couldn't find any programs that
>> did this iproute2 certainly doesn't. And the bridge pieces are using
>> PF_BRIDGE so no collision there.
> 
> They have to be different calls from the calls that talk to the s/ware
> bridge. In my opinion, as controversial as this may sound, you need to
> be flexible enough that some vendor can replace these calls with
> proprietary calls which are more efficient for their hardware. So a
> "plugin" to replace these calls in the user space code would be a 
> good idea. Alternatively, you could make that something they do at
> the driver level i.e from user space to kernel it is "hardware, please
> addthistotheFDBtable()" call and the implementation of that could be
> proprietary to the specific hardware.
> 

Agreed. I think adding some ndo_ops for bridging offloads here would
work. For example the DSA infrastructure and/or macvlan devices might
need this. Along the lines of extending this RFC,

[RFC] hardware bridging support for DSA switches
http://patchwork.ozlabs.org/patch/16578/


.John

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-28 Thread John Fastabend
On 2/28/2012 8:40 PM, John Fastabend wrote:
> On 2/18/2012 4:41 AM, jamal wrote:
>> On Fri, 2012-02-17 at 09:10 -0800, John Fastabend wrote:
>>
>>> Yes I agree that is the goal.
>>>
>>>> One last comment:
>>>> With synchronization there are other challenges when the entry in the
>>>> hardware conflicts with the entry in software when you intend the
>>>> behavior to be the same. This is not such a big deal with bridging but
>>>> becomes more apparent when you start offloading ACLs etc.
>>>>
>>>
>>> OK and these sorts of conflicts certainly don't need to be resolved
>>> by kernel code. So I think this is a reasonable reason to drive the
>>> synchronization into a user space daemon.
>>
>>
>> Yep. 
>> Thanks for listening John. Waiting to see them patches.
>>
>> cheers,
>> jamal
>>
>>
>>
> 
> +Lennert
> 
> OK back to this. The last piece is where to put these messages...
> we could take PF_ROUTE:RTM_*NEIGH
> 
>  PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
>  switch.
>  PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
>  switch.
>  PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table
> 
> The neighbor code is using the PF_UNSPEC protocol type so we won't
> collide with these unless someone was using PF_ROUTE and relying on
> falling back to PF_UNSPEC however I couldn't find any programs that
> did this iproute2 certainly doesn't. And the bridge pieces are using
> PF_BRIDGE so no collision there.
> 
> I briefly thought about trying to pull the PF_BRIDGE protocol out
> and use this for both types but I think its better to leave the
> bridge code alone and there is also the issue of disambiguating a msg
> at a port which has both an embedded switch and has SW bridge for a
> master.

Maybe I gave up too quickly here I could use a bit in the ndm_flags to
specify embedded or sw bridge. But would require having the bridge
module loaded.

> 
> Also if there are embedded switches with learning capabilities they
> might want to trigger events to user space. In this case having
> a protocol type makes user space a bit easier to manage. I've
> added Lennert so maybe he can comment I think the Marvell chipsets
> might support something along these lines. The SR-IOV chipsets I'm
> aware of _today_ don't do learning. Learning makes the event model
> more plausible.
> 

Just checked looks like the DSA infrastructure has commands to enable
STP so guess it is doing learning.

> The other mechanism would be to embed some more attributes into the
> PF_UNSPEC:RTM_XXXLINK msg however I'm thinking that if we want to
> support learning and triggering events then we likely also don't
> want to send these events to every app with RTNLGRP_LINK set.
> 
> Plus there is already a proliferation of LINK attributes and dumping
> the FDB out of this seems a bit much but could be done with some
> bitmasks. Although the current ext_filter_mask u32 doesn't seem to
> be sufficient for events to trigger this.
> 
> so much for a short note...
> 
> Thanks
> .John
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-28 Thread John Fastabend
On 2/18/2012 4:41 AM, jamal wrote:
> On Fri, 2012-02-17 at 09:10 -0800, John Fastabend wrote:
> 
>> Yes I agree that is the goal.
>>
>>> One last comment:
>>> With synchronization there are other challenges when the entry in the
>>> hardware conflicts with the entry in software when you intend the
>>> behavior to be the same. This is not such a big deal with bridging but
>>> becomes more apparent when you start offloading ACLs etc.
>>>
>>
>> OK and these sorts of conflicts certainly don't need to be resolved
>> by kernel code. So I think this is a reasonable reason to drive the
>> synchronization into a user space daemon.
> 
> 
> Yep. 
> Thanks for listening John. Waiting to see them patches.
> 
> cheers,
> jamal
> 
> 
> 

+Lennert

OK back to this. The last piece is where to put these messages...
we could take PF_ROUTE:RTM_*NEIGH

 PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
 switch.
 PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
 switch.
 PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table

The neighbor code is using the PF_UNSPEC protocol type so we won't
collide with these unless someone was using PF_ROUTE and relying on
falling back to PF_UNSPEC however I couldn't find any programs that
did this iproute2 certainly doesn't. And the bridge pieces are using
PF_BRIDGE so no collision there.

I briefly thought about trying to pull the PF_BRIDGE protocol out
and use this for both types but I think its better to leave the
bridge code alone and there is also the issue of disambiguating a msg
at a port which has both an embedded switch and has SW bridge for a
master.

Also if there are embedded switches with learning capabilities they
might want to trigger events to user space. In this case having
a protocol type makes user space a bit easier to manage. I've
added Lennert so maybe he can comment I think the Marvell chipsets
might support something along these lines. The SR-IOV chipsets I'm
aware of _today_ don't do learning. Learning makes the event model
more plausible.

The other mechanism would be to embed some more attributes into the
PF_UNSPEC:RTM_XXXLINK msg however I'm thinking that if we want to
support learning and triggering events then we likely also don't
want to send these events to every app with RTNLGRP_LINK set.

Plus there is already a proliferation of LINK attributes and dumping
the FDB out of this seems a bit much but could be done with some
bitmasks. Although the current ext_filter_mask u32 doesn't seem to
be sufficient for events to trigger this.

so much for a short note...

Thanks
.John




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-17 Thread John Fastabend
On 2/17/2012 6:28 AM, jamal wrote:
> On Wed, 2012-02-15 at 17:26 -0800, John Fastabend wrote:
>> On 2/15/2012 6:10 AM, Jamal Hadi Salim wrote:
>>> On Tue, 2012-02-14 at 10:57 -0800, John Fastabend wrote:
>>>
>>>> Roopa was likely on the right track here,
>>>>
>>>> http://patchwork.ozlabs.org/patch/123064/
>>>
>>> Doesnt seem related to the bridging stuff - the modeling looks
>>> reasonable however.
>>>
>>
>> The operations are really the same ADD/DEL/GET additional MAC
>> addresses to a port, in this case a macvlan type port. The
>> difference is the  macvlan port type drops any packet with an
>> address not in the FDB where the bridge type floods these.
> 
> Ok.
> [the vlan piece really should have been an integrated part of bridging;
> in the early days this was the case]
> 
> 
>> [root@jf-dev1-dcblab src]# br fdb help
>> Usage: br fdb { add | del | replace } ADDR dev DEV
>>br fdb {show} [ dev DEV ]
>>
>> In my example I just dumped all bridge devices,
>>
> 
> Ok, makes sense.
> 
> 
>> Seems we need both a synchronize and a { add | del | replace } option.
> 
> I am conflicted on this.
> Not sure if that is a command line thing or something built into a user
> space daemon. It may be useful to have the command line variant but i
> feel having a daemon take care of things helps in faster
> synchronization.
> I think user space is a good spot to add such functionality (as opposed
> to the kernel). That way user space can work with h/ware switching such
> as yours as well as a standalone switching chips (from sillicon vendors
> like Marvel etc).
> IMO, the average user doesnt need to be aware of such low level stuff;
> so the default should be for the user not to be responsible for
> configuration of synchronization. IOW, I want to just run well
> understood user interface tools things like ifconfig, ip link etc, the
> new br tool and not even need to be aware that we are offloading.
> So as long as s/w br0 is mapping to the bridge on ixgb-0 i dont need
> to know ixgb0 h/w bridge exists.
> 

Yes I agree that is the goal.

> One last comment:
> With synchronization there are other challenges when the entry in the
> hardware conflicts with the entry in software when you intend the
> behavior to be the same. This is not such a big deal with bridging but
> becomes more apparent when you start offloading ACLs etc.
> 

OK and these sorts of conflicts certainly don't need to be resolved
by kernel code. So I think this is a reasonable reason to drive the
synchronization into a user space daemon.

> 
>> So I think what your saying is a per port bit to disable learning...
>> hmm but if you start tweaking it too much it looks less and less like a
>> 802.1D bridge and more like something you would want to build with tc or
>> openvswitch or tc+bridge or tc+macvlan.
> 
> These are pretty commodity features in most silicon switching chips ive
> come across. You have a knob to control learning and another to control
> flooding.
> 

All right this looks like a follow up patch to me. First build the interface
to configure the HW FDB. Then a second series to add a flooding knob which
works for both embedded switches and software switches.

> cheers,
> jamal
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-15 Thread John Fastabend
On 2/15/2012 6:10 AM, Jamal Hadi Salim wrote:
> On Tue, 2012-02-14 at 10:57 -0800, John Fastabend wrote:
> 
>> Roopa was likely on the right track here,
>>
>> http://patchwork.ozlabs.org/patch/123064/
> 
> Doesnt seem related to the bridging stuff - the modeling looks
> reasonable however.
> 

The operations are really the same ADD/DEL/GET additional MAC
addresses to a port, in this case a macvlan type port. The
difference is the  macvlan port type drops any packet with an
address not in the FDB where the bridge type floods these.

>> But I think the proper syntax is to use the existing PF_BRIDGE:RTM_XXX
>> netlink messages. And if possible drive this without extending ndo_ops.
>>
>> An ideal user space interaction IMHO would look like,
>>
>> [root@jf-dev1-dcblab iproute2]# ./br/br fdb add 52:e5:62:7b:57:88 dev veth10
>> [root@jf-dev1-dcblab iproute2]# ./br/br fdb
>> portmac addrflags
>> veth2   36:a6:35:9b:96:c4   local
>> veth4   aa:54:b0:7b:42:ef   local
>> veth0   2a:e8:5c:95:6c:1b   local
>> veth6   6e:26:d5:43:a3:36   local
>> veth0   f2:c1:39:76:6a:fb
>> veth8   4e:35:16:af:87:13   local
>> veth10  52:e5:62:7b:57:88   static
>> veth10  aa:a9:35:21:15:c4   local
> 
> Looks nice, where is the targeted bridge(eg br0) in that syntax?

[root@jf-dev1-dcblab src]# br fdb help
Usage: br fdb { add | del | replace } ADDR dev DEV
   br fdb {show} [ dev DEV ]

In my example I just dumped all bridge devices,

#br fdb show dev bridge0

> 
>> Using Stephen's br tool. First command adds FDB entry to SW bridge and
>> if the same tool could be used to add entries to embedded bridge I think
>> that would be the best case. 
> 
> That would be nice (although adds dependency on the presence of the
> s/ware bridge). Would be nicer to have either a knob in the kernel to
> say "synchronize with h/w bridge foo" which can be turned off.  
> 

Seems we need both a synchronize and a { add | del | replace } option.

>> So no RTNETLINK error on the second cmd. Then
>> embedded FDB entries could be dumped this way also so I get a complete view
>> of my FDB setup across multiple sw bridges and embedded bridges.
> 
> So if you had multiple h/ware bridges - which one is tied to br0? 
> 

Not sure I follow but does the additional dev parameter above answer this?

> 
>> Yes. The hardware has a bit to support this which is currently not exposed
>> to user space. That's a case where we have 'yet another knob' that needs
>> a clean solution. This causes real bugs today when users try to use the
>> macvlan devices in VEPA mode on top of SR-IOV. By the way these modes are
>> all part of the 802.1Qbg spec which people actually want to use with Linux
>> so a good clean solution is probably needed.
> 
> 
> I think the knobs to "flood" and "learn" are important. The hardware
> seems to have the "flood" but not the "learn/discover". I think the
> s/ware bridge needs to have both. At the moment - as pointed out in that
> *NEIGH* notification, s/w bridge assumes a policy that could be
> considered a security flaw in some circles - just because you are my
> neighbor does not mean i trust you to come into my house; i may trust
> you partially and allow you only to come through the front door. Even in
> Canada with a default policy of not locking your door we sometimes lock
> our doors ;->
> 
> 
>> I have no problem with drawing the line here and trying to implement 
>> something
>> over PF_BRIDGE:RTM_xxx nlmsgs. 
> 
> 
> My comment/concern was in regard to the bridge built-in policy of
> reading from the neighbor updates (refer to above comments)
> 

So I think what your saying is a per port bit to disable learning...

hmm but if you start tweaking it too much it looks less and less like a
802.1D bridge and more like something you would want to build with tc or
openvswitch or tc+bridge or tc+macvlan.

.John

> cheers,
> jamal
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-14 Thread John Fastabend
On 2/14/2012 11:05 AM, Stephen Hemminger wrote:
> On Tue, 14 Feb 2012 10:57:04 -0800
> John Fastabend  wrote:
> 
>> On 2/14/2012 5:18 AM, jamal wrote:
>>> On Mon, 2012-02-13 at 07:13 -0800, John Fastabend wrote:
>>>
>>>> The use case here is multiple VFs but the same solution should work with
>>>> multiple PFs as well. FDB controls should be independent of how the ports
>>>> are exposed VFs, PFs, VMDQ/queue pairs, macvlan, etc.
>>>
>>> Makes sense.
>>>
>>>> With events and ADD/DEL/GET FDB controls we can solve both cases. This also
>>>> solves Roopa's case with macvlan where she wants to add additional 
>>>> addresses
>>>> to macvlan ports.
>>>
>>> Not familiar with that issue - I'll prowl the list.
>>
>> Roopa was likely on the right track here,
>>
>> http://patchwork.ozlabs.org/patch/123064/
>>
>> But I think the proper syntax is to use the existing PF_BRIDGE:RTM_XXX
>> netlink messages. And if possible drive this without extending ndo_ops.
>>
>> An ideal user space interaction IMHO would look like,
>>
>> [root@jf-dev1-dcblab iproute2]# ./br/br fdb add 52:e5:62:7b:57:88 dev veth10
>> [root@jf-dev1-dcblab iproute2]# ./br/br fdb
>> portmac addrflags
>> veth2   36:a6:35:9b:96:c4   local
>> veth4   aa:54:b0:7b:42:ef   local
>> veth0   2a:e8:5c:95:6c:1b   local
>> veth6   6e:26:d5:43:a3:36   local
>> veth0   f2:c1:39:76:6a:fb
>> veth8   4e:35:16:af:87:13   local
>> veth10  52:e5:62:7b:57:88   static
>> veth10  aa:a9:35:21:15:c4   local
>> [root@jf-dev1-dcblab iproute2]# ./br/br fdb add dev eth3 to 52:e5:62:7b:57:88
>> RTNETLINK answers: Invalid argument
> 
> I am going to put bridge (nameclash with br) tool into iproute2 (soon).

I've been using it on my dev box for awhile now and it works well for
me.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-14 Thread John Fastabend
On 2/14/2012 5:18 AM, jamal wrote:
> On Mon, 2012-02-13 at 07:13 -0800, John Fastabend wrote:
> 
>> The use case here is multiple VFs but the same solution should work with
>> multiple PFs as well. FDB controls should be independent of how the ports
>> are exposed VFs, PFs, VMDQ/queue pairs, macvlan, etc.
> 
> Makes sense.
> 
>> With events and ADD/DEL/GET FDB controls we can solve both cases. This also
>> solves Roopa's case with macvlan where she wants to add additional addresses
>> to macvlan ports.
> 
> Not familiar with that issue - I'll prowl the list.

Roopa was likely on the right track here,

http://patchwork.ozlabs.org/patch/123064/

But I think the proper syntax is to use the existing PF_BRIDGE:RTM_XXX
netlink messages. And if possible drive this without extending ndo_ops.

An ideal user space interaction IMHO would look like,

[root@jf-dev1-dcblab iproute2]# ./br/br fdb add 52:e5:62:7b:57:88 dev veth10
[root@jf-dev1-dcblab iproute2]# ./br/br fdb
portmac addrflags
veth2   36:a6:35:9b:96:c4   local
veth4   aa:54:b0:7b:42:ef   local
veth0   2a:e8:5c:95:6c:1b   local
veth6   6e:26:d5:43:a3:36   local
veth0   f2:c1:39:76:6a:fb
veth8   4e:35:16:af:87:13   local
veth10  52:e5:62:7b:57:88   static
veth10  aa:a9:35:21:15:c4   local
[root@jf-dev1-dcblab iproute2]# ./br/br fdb add dev eth3 to 52:e5:62:7b:57:88
RTNETLINK answers: Invalid argument

Using Stephen's br tool. First command adds FDB entry to SW bridge and
if the same tool could be used to add entries to embedded bridge I think
that would be the best case. So no RTNETLINK error on the second cmd. Then
embedded FDB entries could be dumped this way also so I get a complete view
of my FDB setup across multiple sw bridges and embedded bridges.

I don't think br is part of iproute2 yet I just pulled it out of some RFC
but it works reasonably well and is intuitive enough.

> 
>> Yes it should flood here, unless its acting as a 802.1Qbg VEB or VEPA.
> 
> Ok. So there is a toggle somewhere which controls how flooding should
> happen.
> 

Yes. The hardware has a bit to support this which is currently not exposed
to user space. That's a case where we have 'yet another knob' that needs
a clean solution. This causes real bugs today when users try to use the
macvlan devices in VEPA mode on top of SR-IOV. By the way these modes are
all part of the 802.1Qbg spec which people actually want to use with Linux
so a good clean solution is probably needed.

>>
>> Maybe not. But the kernel already has the needed signals with one extra
>> hook we can save running a daemon in user space. Maybe that's not a great
>> argument to add kernel code though.
> 
> You make a reasonable arguement to have it in the kernel but i think we
> win more if we separate the control. So while i empathize, I am hoping
> that youd go with the path that is hard to travel ;->
> 
>> The PF_BRIDGE:RTM_GETNEIGH,RTM_NEWNEIGH,RTM_DELNEIGH are registered in the
>> br_netlink_init() path. 
> 
> Hrm - hadnt paid attention to that before. Nasty.
> The bridge seems to be hard-coding policy on station movement, no? 
> This is a good example of the qualms i have on adding things to the
> kernel;->
> I may not want to auto update a MAC address moving ports as part of
> some policy i have. I can go and add YAK (Yet Another Knob) - but where
> is the line drawn?
> 

I have no problem with drawing the line here and trying to implement something
over PF_BRIDGE:RTM_xxx nlmsgs. I'll work with Roopa and see if we can come
up with something in the next couple days.

w.r.t. VEPA/VEB and flooding behavior we could probably have a bit to indicate
if the port is a flooding port or not. Then users could build any sort of 
forwarding
table they wanted OR we could just drive it through a notifier (ndo_ops?) in the
macvlan path which does VEPA today.

OK I'll try to write some actual code now that can be critiqued.

> cheers,
> jamal
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-13 Thread John Fastabend
On 2/10/2012 7:18 AM, jamal wrote:
> Hi John,
> 
> I went backwards to summarize at the top after going through your email.
> 
> TL;DR version 0.1: 
> you provide a good use case where it makes sense to do things in the
> kernel. IMO, you could make the same arguement if your embedded switch
> could do ACLs, IPv4 forwarding etc. And the kernel bloats.
> I am always bigoted to move all policy control to user space instead of
> bloating in the kernel.
> 
>  
> On Thu, 2012-02-09 at 20:14 -0800, John Fastabend wrote:
> 
>>>
>>> Hi Jamal,
>>>
>>> The user space app in this case would listen for FDB updates to the SW
>>> bridge and then mirror them at the embedded NIC. In this case it seems
>>> easier to just add a notifier chain and let the kernel keep these in
>>> sync. Otherwise we need a daemon in user space to replicate these.
>>>
> 
> A user space daemon if you need to ensure synchronization. Thats what i
> meant when i said there was a "disadvantage" over the simple case when
> the goal is always to synchronize.
> 
>>> On the other hand if you could make the same RTM_NEWNEIGH, RTM_DELNEIGH,
>>> and RTM_GETNEIGH work for the bridge, embedded bridge, and macvlan you
>>> would have one common interface to drive these. But the bridge already
>>> has this protocol/msgtype so that would require either some demux or
>>> new protocol/msgtype pairs to be created. 
>>>
> 
> The bridge is very netlink friendly these days. Given the rest of the
> network stack (*NEIGH* you mention above) talks netlink to user space
> it should be workable. 
> 
>>> Let me think on it. I'm tempted by the simplicity of adding notifier
>>> hooks though.
> 
> If something is missing bridge-side it may need to be added (as Per
> Stephen's comment) - i just took it one further indicating those
> notifiers need to also netlink-speak
> 

Sure.

> 
>> Actually because the bridge is adding/removing fdb entries dynamically
>> maybe its best this gets done in kernel. Here's the example case,
> 
> [..]
> 
>>
>> With the flow by letters above hope this is not too difficult to follow.
> 
>> (A) veth0 a virtual device transmits packet destined for ethx.y
>> (B) SW bridge receives frames and updates FDB flooding to C
>> (C) eth0 the PF in this case sends the frame to the HW backed by the
>> embedded bridge
> 
> Following so far.
> Can you have more than one PF per embedded switch? Or is the intent here
> purely to do VMs/VF separation?
> 

The use case here is multiple VFs but the same solution should work with
multiple PFs as well. FDB controls should be independent of how the ports
are exposed VFs, PFs, VMDQ/queue pairs, macvlan, etc.

>> (D) The HW embedded switch has a static entry for ethx.y and forwards
>> the frame to the VF or if its a broadcast frame also floods it to
>> the wire and ethx.y
> 
> nod.
> 
>> (E) ethx.y receives the frame and generates a response to the dest mac of
>> veth0
> 
> nod.
> Since you said in #D the entries in the switch are static, I am assuming
> at this point neither ethx.y nor veth0 exist in the embedded FDB.
> 
>> Now here is the potential issue,
>>
>> (G) The frame transmitted from ethx.y with the destination address of
>> veth0 but the embedded switch is not a learning switch. If the FDB
>> update is done in user space its possible (likely?) that the FDB
>> entry for veth0 has not been added to the embedded switch yet. 
> 
> Ok, got it - so the catch here is the switch is not capable of learning.
> I think this depends on where learning is done. Your intent is to
> use the S/W bridge as something that does the learning for you i.e in
> the kernel. This makes the s/w bridge part of MUST-have-for-this-to-run.
> And that maybe the case for your use case.
> 

This is _my_ use case today.

> What if I dont wanna run the S/W bridge at all?
> Ive been making a point that with a simple knob(Stephen doesn like to
> add such a knob), the SW bridge could defer learning to user space. 
> [This way you can add a lot of richness e.g on ACLs such as restricting
> what MAC addresses etc are allowed to talk to which ones etc.].
> But if bypass the s/w bridge all together and learn in user space
> or have a static config in which i populate the embedded switch, i dont
> see the issue.

With events and ADD/DEL/GET FDB controls we can solve both cases. This also
solves Roopa's case with macvlan where he wants to add additional addresses
to macvlan ports.

> 
>> Now
>> we either have to flood the frame which is not

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend
On 2/9/2012 6:14 PM, John Fastabend wrote:
> On 2/9/2012 1:11 PM, jamal wrote:
>> On Thu, 2012-02-09 at 09:52 -0800, John Fastabend wrote:
>>
>>>>> By netlink_notifier do you mean adding a notifier_block and using 
>>>>> atomic_notifier_call_chain()
>>>>> probably in rtnl_notify()? Then drivers could register with the notifier 
>>>>> chain with
>>>>> atomic_notifier_chain_register() and receive the events correctly. Or did 
>>>>> I miss
>>>>> some notifier chain that already exists?
>>>>
>>>> Yes. that is what I mean. The callbacks you need may or may not already be 
>>>> present.
>>
>> I'll go one step further.
>> This stuff shouldnt be in the kernel at all. 
>> The disadvantage is you need a user space app to update the hardware.
>> i.e, the same mechanism should be usable for either a switch embedded
>> in a NIC or a standalone hardware switch (with/out the s/ware bridge 
>> presence)
>>
>> cheers,
>> jamal
>>
> 
> Hi Jamal,
> 
> The user space app in this case would listen for FDB updates to the SW
> bridge and then mirror them at the embedded NIC. In this case it seems
> easier to just add a notifier chain and let the kernel keep these in
> sync. Otherwise we need a daemon in user space to replicate these.
> 
> On the other hand if you could make the same RTM_NEWNEIGH, RTM_DELNEIGH,
> and RTM_GETNEIGH work for the bridge, embedded bridge, and macvlan you
> would have one common interface to drive these. But the bridge already
> has this protocol/msgtype so that would require either some demux or
> new protocol/msgtype pairs to be created. 
> 
> Let me think on it. I'm tempted by the simplicity of adding notifier
> hooks though.
> 
> .John
> 

Actually because the bridge is adding/removing fdb entries dynamically
maybe its best this gets done in kernel. Here's the example case,


  -- -
  | ethx.y |  < E| veth0 |  <--- A
  -- -
  |   |
  |   |
  |   |
  | --
  | |  SW Bridge | <--- B
  | --
  |   |
  |   |
  |  -
  |  | eth0  | <--- C
  |  -
  |   |
   ---
   |embedded switch  | <--- D
   ---
   |
   |
   G

With the flow by letters above hope this is not too difficult to follow.

(A) veth0 a virtual device transmits packet destined for ethx.y
(B) SW bridge receives frames and updates FDB flooding to C
(C) eth0 the PF in this case sends the frame to the HW backed by the
embedded bridge
(D) The HW embedded switch has a static entry for ethx.y and forwards
the frame to the VF or if its a broadcast frame also floods it to
the wire and ethx.y
(E) ethx.y receives the frame and generates a response to the dest mac of
veth0

Now here is the potential issue,

(G) The frame transmitted from ethx.y with the destination address of
veth0 but the embedded switch is not a learning switch. If the FDB
update is done in user space its possible (likely?) that the FDB
entry for veth0 has not been added to the embedded switch yet. Now
we either have to flood the frame which is not horrible but not
ideal or worse if the embedded switch does not support flooding send
it to the wire and veth0 never receives it. If the SW bridge pushes
the FDB update down into the embedded switch the address is for
sure in the embedded switches forwarding tables and the switching
works as expected.

So to handle this case correctly its probably best IMHO to use a notifier
hook. Having a RTM_GETNEIGH for the embedded switch implemented though
would be nice for dumping the FDB of the embedded switch and SET/DEL
could be used to configure the FDB when its not being driven by the SW
switch. Of course we should try to be minimalists here.

.John
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend
On 2/9/2012 1:11 PM, jamal wrote:
> On Thu, 2012-02-09 at 09:52 -0800, John Fastabend wrote:
> 
>>>> By netlink_notifier do you mean adding a notifier_block and using 
>>>> atomic_notifier_call_chain()
>>>> probably in rtnl_notify()? Then drivers could register with the notifier 
>>>> chain with
>>>> atomic_notifier_chain_register() and receive the events correctly. Or did 
>>>> I miss
>>>> some notifier chain that already exists?
>>>
>>> Yes. that is what I mean. The callbacks you need may or may not already be 
>>> present.
> 
> I'll go one step further.
> This stuff shouldnt be in the kernel at all. 
> The disadvantage is you need a user space app to update the hardware.
> i.e, the same mechanism should be usable for either a switch embedded
> in a NIC or a standalone hardware switch (with/out the s/ware bridge 
> presence)
> 
> cheers,
> jamal
> 

Hi Jamal,

The user space app in this case would listen for FDB updates to the SW
bridge and then mirror them at the embedded NIC. In this case it seems
easier to just add a notifier chain and let the kernel keep these in
sync. Otherwise we need a daemon in user space to replicate these.

On the other hand if you could make the same RTM_NEWNEIGH, RTM_DELNEIGH,
and RTM_GETNEIGH work for the bridge, embedded bridge, and macvlan you
would have one common interface to drive these. But the bridge already
has this protocol/msgtype so that would require either some demux or
new protocol/msgtype pairs to be created. 

Let me think on it. I'm tempted by the simplicity of adding notifier
hooks though.

.John


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend
On 2/9/2012 4:39 PM, Sridhar Samudrala wrote:
> On Thu, 2012-02-09 at 12:30 -0800, John Fastabend wrote:
>> On 2/9/2012 10:14 AM, Sridhar Samudrala wrote:
>>> On Wed, 2012-02-08 at 19:22 -0800, John Fastabend wrote:
>>>> Propagate software FDB table into hardware uc, mc lists when
>>>> the NETIF_F_HW_FDB is set.
>>>>
>>>> This resolves the case below where an embedded switch is used
>>>> in hardware to do inter-VF or VF-PF switching. This patch
>>>> pushes the FDB entry (specifically the MAC address) into the
>>>> embedded switch with dev_add_uc and dev_add_mc so the switch
>>>> "learns" about the software bridge.
>>>>
>>>>
>>>>   veth0  veth2
>>>> |  |
>>>>   
>>>>   |  bridge0 |   < software bridging
>>>>   
>>>>/
>>>>/
>>>>   ethx.y  ethx
>>>> VF PF
>>>>  \ \  < propagate FDB entries to HW
>>>>  \ \
>>>>   
>>>>   |  Embedded Bridge |< hardware offloaded switching
>>>>   
>>>>
>>>
>>> This scenario works now as adding an interface to a bridge puts it in
>>> promiscuous mode. So adding a PF to a software bridge should not be
>>> a problem as it supports promiscuous mode. But adding a VF will not
>>> work.
>>
>> It shouldn't work because the embedded bridge will lookup the address
>> in its FDB and when it doesn't match any unicast filters it will forward
>> the packet onto the wire. Because the veth0 and veth2 above never get
>> inserted into the embedded brdige's FDB the packets will _never_ get
>> routed there.
>>
>> That said the current 'ixgbe' driver is doing something broken in that
>> it is always setting the unicast hash table and mirroring bits to 1. So
>> if you think this is working your seeing a bug where packets are being
>> sent onto the wire AND upto the PF. Packets with destination addresses
>> matching veth1 should not end up on the wire and vice versa. This is
>> specific to ixgbe and is not the case for other SR-IOV devices.
> 
> OK. Is this behavior going to be fixed.
> 

Only after we have a mechanism to either configure the NIC FDB directly
or have it stay in sync with the SW switch. Flooding traffic seems better
than being unable to send traffic to the virtual device altogether. This
behavior is driver specific some devices just fail outright.

I'm thinking over Jamal's comment now.

>>
>> This causes some issues (a) has some very real performance implications,
>> (b) at this point you have some strange behavior from my point of view.
>> The embedded bridge is not a learning bridge nor is it acting like an
>> 802.1Q VEB or VEPA.
>>
>>>
>>> Are you trying to avoid the requirement of having to put the interface 
>>> in promiscuous mode when adding to a bridge?
>>>
>>
>> I think the bridge being in promiscuous mode is correct.
> 
> The interface that is added to the bridge is put in promiscuous mode,
> not the bridge itself.  In this example, i assumed that setting
> promiscuous on PF is putting the embedded bridge in learning mode.


Yes I misspoke I mean the PF. The embedded bridge in this case does not
support learning. Also I'm not aware of any SR-IOV NICs that do support
learning.

> 
> Thanks
> Sridhar
> 
>>
>> Hope that helps sorry its a bit long winded.
>> John
>>
>>
>>
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend
On 2/9/2012 10:14 AM, Sridhar Samudrala wrote:
> On Wed, 2012-02-08 at 19:22 -0800, John Fastabend wrote:
>> Propagate software FDB table into hardware uc, mc lists when
>> the NETIF_F_HW_FDB is set.
>>
>> This resolves the case below where an embedded switch is used
>> in hardware to do inter-VF or VF-PF switching. This patch
>> pushes the FDB entry (specifically the MAC address) into the
>> embedded switch with dev_add_uc and dev_add_mc so the switch
>> "learns" about the software bridge.
>>
>>
>>   veth0  veth2
>> |  |
>>   
>>   |  bridge0 |   < software bridging
>>   
>>/
>>/
>>   ethx.y  ethx
>> VF PF
>>  \ \  < propagate FDB entries to HW
>>  \ \
>>   
>>   |  Embedded Bridge |< hardware offloaded switching
>>   
>>
> 
> This scenario works now as adding an interface to a bridge puts it in
> promiscuous mode. So adding a PF to a software bridge should not be
> a problem as it supports promiscuous mode. But adding a VF will not
> work.

It shouldn't work because the embedded bridge will lookup the address
in its FDB and when it doesn't match any unicast filters it will forward
the packet onto the wire. Because the veth0 and veth2 above never get
inserted into the embedded brdige's FDB the packets will _never_ get
routed there.

That said the current 'ixgbe' driver is doing something broken in that
it is always setting the unicast hash table and mirroring bits to 1. So
if you think this is working your seeing a bug where packets are being
sent onto the wire AND upto the PF. Packets with destination addresses
matching veth1 should not end up on the wire and vice versa. This is
specific to ixgbe and is not the case for other SR-IOV devices.

This causes some issues (a) has some very real performance implications,
(b) at this point you have some strange behavior from my point of view.
The embedded bridge is not a learning bridge nor is it acting like an
802.1Q VEB or VEPA.

> 
> Are you trying to avoid the requirement of having to put the interface 
> in promiscuous mode when adding to a bridge?
> 

I think the bridge being in promiscuous mode is correct.

Hope that helps sorry its a bit long winded.
John



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend
On 2/9/2012 9:40 AM, Stephen Hemminger wrote:
> On Thu, 09 Feb 2012 09:36:47 -0800
> John Fastabend  wrote:
> 
>> But the device features makes it easy for user space to learn that the device
>> supports this sort of offload. Now if all SR-IOV devices support this then it
>> doesn't matter but I thought there were SR-IOV devices that didn't do any
>> switching? I'll dig through the SR-IOV drivers to check there are not too
>> many of them.
> 
> If user space needs to know then the OS is not designed properly.
> The purpose of the network device is to abstract all those details, and more 
> and more
> of them are bleeding through. This makes writing management applications 
> harder and makes
> things dependent on features that may or may not be present. The best design 
> is when
> the change is invisible.
> 

Agreed.

>> By netlink_notifier do you mean adding a notifier_block and using 
>> atomic_notifier_call_chain()
>> probably in rtnl_notify()? Then drivers could register with the notifier 
>> chain with
>> atomic_notifier_chain_register() and receive the events correctly. Or did I 
>> miss
>> some notifier chain that already exists?
> 
> Yes. that is what I mean. The callbacks you need may or may not already be 
> present.


OK thanks I'll put together an update here shortly.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread John Fastabend
On 2/8/2012 8:36 PM, Stephen Hemminger wrote:
> On Wed, 08 Feb 2012 19:22:06 -0800
> John Fastabend  wrote:
> 
>> Propagate software FDB table into hardware uc, mc lists when
>> the NETIF_F_HW_FDB is set.
>>
>> This resolves the case below where an embedded switch is used
>> in hardware to do inter-VF or VF-PF switching. This patch
>> pushes the FDB entry (specifically the MAC address) into the
>> embedded switch with dev_add_uc and dev_add_mc so the switch
>> "learns" about the software bridge.
>>
>>
>>   veth0  veth2
>> |  |
>>   
>>   |  bridge0 |   < software bridging
>>   
>>/
>>/
>>   ethx.y  ethx
>> VF PF
>>  \ \  < propagate FDB entries to HW
>>  \ \
>>   
>>   |  Embedded Bridge |< hardware offloaded switching
>>   
>>
>> This is only an RFC couple more changes are needed.
>>
>> (1) Optimize HW FDB set/del to only walk list if an FDB offloaded
>> device is attached. Or decide it doesn't matter from unlikely()
>> path.
>>
>> (2) Is it good enough to just call dev_uc_{add|del} or
>> dev_mc_{add|del}? Or do some devices really need a new netdev
>> callback to do this operation correctly. I think it should be
>> good enough as is.
>>
>> (3) wrapped list walk in rcu_read_lock() just in case maybe every
>> case is already inside rcu_read_lock()/unlock().
>>
>> Also this is in response to this thread regarding the macvlan and
>> exposing rx filters posting now to see if folks think this is the
>> right idea and if it will resolve at least the bridge case.
>>
>> http://lists.openwall.net/netdev/2011/11/08/135
>>
>> Signed-off-by: John Fastabend 
>> ---
>>
>>  include/linux/netdev_features.h |2 ++
>>  net/bridge/br_fdb.c |   34 ++
>>  2 files changed, 36 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/netdev_features.h 
>> b/include/linux/netdev_features.h
>> index 77f5202..5936fae 100644
> 
> Rather than yet another device feature, I would rather use netlink_notifier
> callback. The notifier is more general and generic without messing with 
> internals
> of bridge.
> 

But the device features makes it easy for user space to learn that the device
supports this sort of offload. Now if all SR-IOV devices support this then it
doesn't matter but I thought there were SR-IOV devices that didn't do any
switching? I'll dig through the SR-IOV drivers to check there are not too
many of them.

By netlink_notifier do you mean adding a notifier_block and using 
atomic_notifier_call_chain()
probably in rtnl_notify()? Then drivers could register with the notifier chain 
with
atomic_notifier_chain_register() and receive the events correctly. Or did I miss
some notifier chain that already exists?

Thanks,
John

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v0 2/2] ixgbe: add NETIF_F_HW_FDB to supported flags

2012-02-08 Thread John Fastabend
Add support for NETIF_F_HW_FDB flag when SR-IOV is enabled. This
allows the bridge to push fdb entries into the hardware so the
VF can communicate with virtual devices attached to the bridge.


  veth0  veth2
|  |
  
  |  bridge0 |   < software bridging
  
   /
   /
  ethx.y  ethx
VF PF
 \ \  < propagate FDB entries to HW
 \ \
  
  |  Embedded Bridge |< hardware offloaded switching
  


Signed-off-by: John Fastabend 
---

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   35 +
 1 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ecc46ce..66261fa 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -3207,6 +3207,11 @@ static int ixgbe_write_uc_addr_list(struct net_device 
*netdev)
netdev_for_each_uc_addr(ha, netdev) {
if (!rar_entries)
break;
+
+   netif_printk(adapter, hw, KERN_DEBUG, adapter->netdev,
+"%s %s: write vfn %i %pM\n",
+__func__, netdev->name, vfn, ha->addr);
+
hw->mac.ops.set_rar(hw, rar_entries--, ha->addr,
vfn, IXGBE_RAH_AV);
count++;
@@ -3268,16 +3273,17 @@ void ixgbe_set_rx_mode(struct net_device *netdev)
}
ixgbe_vlan_filter_enable(adapter);
hw->addr_ctrl.user_set_promisc = false;
-   /*
-* Write addresses to available RAR registers, if there is not
-* sufficient space to store all the addresses then enable
-* unicast promiscuous mode
-*/
-   count = ixgbe_write_uc_addr_list(netdev);
-   if (count < 0) {
-   fctrl |= IXGBE_FCTRL_UPE;
-   vmolr |= IXGBE_VMOLR_ROPE;
-   }
+   }
+
+   /*
+* Write addresses to available RAR registers, if there is not
+* sufficient space to store all the addresses then enable
+* unicast promiscuous mode
+*/
+   count = ixgbe_write_uc_addr_list(netdev);
+   if (count < 0) {
+   fctrl |= IXGBE_FCTRL_UPE;
+   vmolr |= IXGBE_VMOLR_ROPE;
}
 
if (adapter->num_vfs) {
@@ -7214,6 +7220,10 @@ static netdev_features_t ixgbe_fix_features(struct 
net_device *netdev,
e_info(probe, "rx-usecs set too low, not enabling RSC\n");
}
 
+   /* Only use offloaded FDB if SR-IOV is enabled */
+   if (!(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED))
+   data &= ~NETIF_F_HW_FDB;
+
return data;
 }
 
@@ -7549,9 +7559,12 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
 
netdev->priv_flags |= IFF_UNICAST_FLT;
 
-   if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
+   if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED) {
adapter->flags &= ~(IXGBE_FLAG_RSS_ENABLED |
IXGBE_FLAG_DCB_ENABLED);
+   netdev->hw_features |= NETIF_F_HW_FDB;
+   netdev->features |= NETIF_F_HW_FDB;
+   }
 
 #ifdef CONFIG_IXGBE_DCB
netdev->dcbnl_ops = &dcbnl_ops;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-08 Thread John Fastabend
Propagate software FDB table into hardware uc, mc lists when
the NETIF_F_HW_FDB is set.

This resolves the case below where an embedded switch is used
in hardware to do inter-VF or VF-PF switching. This patch
pushes the FDB entry (specifically the MAC address) into the
embedded switch with dev_add_uc and dev_add_mc so the switch
"learns" about the software bridge.


  veth0  veth2
|  |
  
  |  bridge0 |   < software bridging
  
   /
   /
  ethx.y  ethx
VF PF
 \ \  < propagate FDB entries to HW
 \ \
  
  |  Embedded Bridge |< hardware offloaded switching
  

This is only an RFC couple more changes are needed.

(1) Optimize HW FDB set/del to only walk list if an FDB offloaded
device is attached. Or decide it doesn't matter from unlikely()
path.

(2) Is it good enough to just call dev_uc_{add|del} or
dev_mc_{add|del}? Or do some devices really need a new netdev
callback to do this operation correctly. I think it should be
good enough as is.

(3) wrapped list walk in rcu_read_lock() just in case maybe every
case is already inside rcu_read_lock()/unlock().

Also this is in response to this thread regarding the macvlan and
exposing rx filters posting now to see if folks think this is the
right idea and if it will resolve at least the bridge case.

http://lists.openwall.net/netdev/2011/11/08/135

Signed-off-by: John Fastabend 
---

 include/linux/netdev_features.h |2 ++
 net/bridge/br_fdb.c |   34 ++
 2 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 77f5202..5936fae 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -55,6 +55,8 @@ enum {
NETIF_F_NOCACHE_COPY_BIT,   /* Use no-cache copyfromuser */
NETIF_F_LOOPBACK_BIT,   /* Enable loopback */
 
+   NETIF_F_HW_FDB, /* Hardware supports switching */
+
/*
 * Add your fresh new feature above and remember to update
 * netdev_features_strings[] in net/core/ethtool.c and maybe
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 5ba0c84..4cc545b 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -81,9 +81,26 @@ static void fdb_rcu_free(struct rcu_head *head)
kmem_cache_free(br_fdb_cache, ent);
 }
 
+static void fdb_hw_delete(struct net_bridge *br,
+ struct net_bridge_fdb_entry *fdb)
+{
+   struct net_bridge_port *op;
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(op, &br->port_list, list) {
+   struct net_device *dev = op->dev;
+
+   if ((dev->features & NETIF_F_HW_FDB) &&
+   dev != fdb->dst->dev)
+   dev_uc_del(dev, fdb->addr.addr);
+   }
+   rcu_read_unlock();
+}
+
 static void fdb_delete(struct net_bridge *br, struct net_bridge_fdb_entry *f)
 {
hlist_del_rcu(&f->hlist);
+   fdb_hw_delete(br, f);
fdb_notify(br, f, RTM_DELNEIGH);
call_rcu(&f->rcu, fdb_rcu_free);
 }
@@ -350,6 +367,22 @@ static struct net_bridge_fdb_entry *fdb_find_rcu(struct 
hlist_head *head,
return NULL;
 }
 
+static void fdb_hw_create(struct net_bridge *br,
+ struct net_bridge_fdb_entry *fdb)
+{
+   struct net_bridge_port *op;
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(op, &br->port_list, list) {
+   struct net_device *dev = op->dev;
+
+   if ((dev->features & NETIF_F_HW_FDB) &&
+   dev != fdb->dst->dev)
+   dev_uc_add(dev, fdb->addr.addr);
+   }
+   rcu_read_unlock();
+}
+
 static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
   struct net_bridge_port *source,
   const unsigned char *addr)
@@ -363,6 +396,7 @@ static struct net_bridge_fdb_entry *fdb_create(struct 
hlist_head *head,
fdb->is_local = 0;
fdb->is_static = 0;
fdb->updated = fdb->used = jiffies;
+   fdb_hw_create(source->br, fdb);
hlist_add_head_rcu(&fdb->hlist, head);
}
return fdb;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2012-02-08 Thread John Fastabend
On 2/5/2012 8:54 AM, Roopa Prabhu wrote:
> 
> 
> 
> On 2/3/12 7:32 AM, "Roopa Prabhu"  wrote:
> 
>>
>>
>>
>> On 2/2/12 10:58 AM, "John Fastabend"  wrote:
> ..
> 
>>> Are you sure they will be good to have? I'm  not so sure you want to be
>>> able to manipulate the uc and mc tables from user space. MACVLAN seems to
>>> be one type of device where it is useful but doing this to a PF or VF seems
>>> hard to use for any real use case. Fun to test the embedded bridge though.
>>>
>>
>> I wont say I am sure. Would be nice have to have netlink interfaces to
>> ADD/DEL additional uc, mc addrs, filter flags and vlans. I have looked at
>> the existing interfaces and nothing seemed straightforward then. But I
>> forget and need to take a look again.
>> I think vlans and filter flags is somehow possible today. And maybe mc too.
>> But if I am right we don't have a way to add additional unicast addresses
>> from userspace. 
>>
>> I will dig my notes and try and list down the problems with using the
>> existing netlink interfaces for this.
> 
> There are kernel api's/ops to add/del hw uc/mc/vlan/filter filter flags:
> Ndo_set_rx_mode, add/del_vid, dev_uc_add, dev_mc_add and dev_filter_flags.
> 
> But there are no straight forward mechanisms to add these from userspace. L2
> mc addresses can be added by SIOCADDMULTI. And filter flags maybe via
> netlink. Nothing for uc and vlan as far as I know (correct me if I am
> wrong). Setting of hw filters is usually done indirectly by the kernel
> during creation of vlan devices for example.
> 
> There is a netlink msg to create a vlan device. But there is no way to add a
> vlan filter directly to the hw. Nothing for secondary uc addrs.
> This is ok for all cases except for the virtualization case I am trying to
> solve. 

Well there is the somewhat asymmetric case where we allow port VLANs to be set
on a VF but not on the PF. I can think of cases (802.1Qbg) where firmware
might be doing EVB or CDCP and enforce a specific filtering. With current
asymmetric interfaces I'm not sure how to expose this on the PF.

to see how this works look for 'ifla_vf_vlan'.

> 
> To summarize,
> 
> The requirement is to have a mechanism from userspace to populate hw filters
> on a device. And this is required to program guest nic filters into the host
> device backing the guest nic. In the direct attach case, its the macvtap
> device and in turn the macvtap lower device.

Yes I agree we would like a mechanism to do this.

> 
> Today I cant think of any other use case that would require this (except
> that there is a brief chance that this could be used in the hybrid
> acceleration stuff that ben and intel have been discussing).

I'll post a RFC I hacked out today to do this with the ./net/bridge code
in a minute. (still needs testing and some fixups).

> 
> I see the below ways this can be done:
> 1) TUNSETTXFILTER: My v1 patch, that targets only the above specific macvtap
> problem. This works for only uc/mc and flags filter. Possibly requires a new
> cmd TUNSETVLANFILTER for vlan filters.
> 
> 
> 2) rtnetlink ops for setting hw filters: My v2 patch targeting virtual
> devices that implement rtnl_link_ops. Example macvtap/macvlan
> 
> This netlink interface to set filters follows TUNSETTXFILTER giving the
> ability to set filters on these devices. The netlink payload must contain
> all the uc, mc, vid's and filter flags that go on the device.
> 
> 
> 3) netdev_ops for setting hw filters: my v3 and v4 patches. This is same as
> 2 but moves the ops to netdev, so that it can be used by all devices if
> required.
> 
> 
> 4) v5 (New approach. Not submitted yet):
> In 2 and 3 above, the netlink msg could be broken down to have separate msgs
> to support add/del of uc/mc/vlan. This should be close to what we have today
> for vf vlan and vf mac. (Something similar to what John Fastabend was
> suggesting too). Advantage, use existing hw ops. (This slightly varies from
> the original goal which was not targeted at getting in to uc,mc lists alone.
> The goal was to have macvlan maintain its own filters and use it in fwding
> lookups if needed in the future. But I guess if we need this in the future
> we could possibly use the macvlan uc, mc lists.)
> 

hmm I'm on the fence here. I like that (4) is generic but I'm not sure I
would want user space to come in and whack a uc addr added from a stacked
device. Looks like there is a free bit in netdev_hw_addr maybe we could add
another bool here to let user space only modify/delete entries that haven't
been locked by the kernel.

That said (1) seems like the straight forward 

Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2012-02-02 Thread John Fastabend
On 2/2/2012 12:38 PM, Ben Hutchings wrote:
> On Thu, 2012-02-02 at 00:46 -0800, John Fastabend wrote:
> [...]
>> OK finally got to read through this. And its not clear to me why we need
>> these per VF/PF filter netdevice ops and netlink extensions if we can
>> get the stacking correct. (Adding filters to the macvlan seems reasonable
>> to me)
>>
>> In the cases I saw listed above I see a few enumerations:
>>
>> PF <--> MACVLAN  <---> Guest <--- [...]
>>
>> VF <--> MACVLAN  <---> Guest <--- [...]   
>>
>> VF|Guest <--- [...]   direct assigned VF
>>
>> PF|Guest <--- [...]   direct assigned PF
>>
>>
>> I used '[...]' to represent whatever additional stacking is done in the
>> guest unknown to the host. In the direct assign VF case (Greg Rose
>> correct me if I am wrong) the normal uc and mc addr lists should suffice
>> along with the netdev op ndo_set_rx_mode(). Here the guest adds MAC
>> addresses and/or VLANS as normal and then the VF<->PF back channel
>> should handle this if needed. This should work for Linux guests and other
>> OS's should do something similar.
>>
>> In the direct assign PF case the hardware is owned by the guest so
>> no problems here.
>>
>> This leaves the two MACVLAN cases which can be handled the same. If
>> the MACVLAN driver and netlink interface is extended to add filters
>> to the MACVLAN then the addresses can be pushed to the lower device
>> using the normal dev_uc_{add|del}() and dev_mc_{add|del}() routines.
> [...]
> 
> There is another case: hybrid acceleration.  Without a bridge in the
> NIC, you need a software bridge for multicast/broadcast replication,
> traffic between guests, and traffic between guest and host.  A guest
> driver can then send and receive to remote addresses through a VF while
> retaining fallback to the software bridge.
> 
> In order to do this, the guest driver needs to know which addresses are
> local.  The net driver for the PF can tell it about the addresses
> assigned to each function, but if there are other devices included in
> the bridge then it will not know about them.

Off the top of my head another approach would be to add a flag to the
PF maybe NETIF_F_FDB_OFFLOADED and have the bridge push the fdb updates
down to the PF which could propagate them into the VFs. Then you don't
need any new netdevice ops or netlink extensions.

This also would allow any 'learned' addresses to be pushed into the
VF and your daemon wouldn't have to monitor the bridges FDB and send
updates down.

> 
> In Solarflare's current out-of-tree implementation this is dealt with in
> an extension to libvirt that writes the additional 'local' MAC addresses
> to a driver-specific sysfs file, but that is obviously not likely to be
> acceptable in-tree!  So I was interested in this proposal to extend MAC
> filtering, but wanted to get the semantics clear.

Agreed. Intel devices needs something similar to handle the bridge cases
where the MAC addresses of the virtual devices are not getting programmed
into the hardware forwarding table.

So we need something its just finding the right semantics and mechanism.

> 
> Ben.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2012-02-02 Thread John Fastabend
On 2/2/2012 10:07 AM, Roopa Prabhu wrote:
> 
> 
> 
> On 2/2/12 12:46 AM, "John Fastabend"  wrote:
> 
>> On 2/1/2012 11:24 PM, Michael S. Tsirkin wrote:
>>> On Sun, Nov 20, 2011 at 08:30:24AM -0800, Roopa Prabhu wrote:
>>>>
>>>>
>>>>
>>>> On 11/17/11 4:15 PM, "Ben Hutchings"  wrote:
>>>>
>>>>> Sorry to come to this rather late.
>>>>>
>>>>> On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
>>>>> [...]
>>>>>> v2 -> v3
>>>>>> - Moved set and get filter ops from rtnl_link_ops to netdev_ops
>>>>>> - Support for SRIOV VFs.
>>>>>> [Note: The get filters msg (in the way current get rtnetlink
>>>>>> handles
>>>>>> it) might get too big for SRIOV vfs. This patch follows existing
>>>>>> sriov 
>>>>>> vf get code and tries to accomodate filters for all VF's in a PF.
>>>>>> And for the SRIOV case I have only tested the fact that the VF
>>>>>> arguments are getting delivered to rtnetlink correctly. The code
>>>>>> follows existing sriov vf handling code so rest of it should work
>>>>>> fine]
>>>>> [...]
>>>>>
>>>>> This is already broken for large numbers of VFs, and increasing the
>>>>> amount of information per VF is going to make the situation worse.  I am
>>>>> no netlink expert but I think that the current approach of bundling all
>>>>> information about an interface in a single message may not be
>>>>> sustainable.
>>>>
>>>> Yes agreed. I have the same concern.
>>>
>>> So it seems that we need to extend the existing interface to allow
>>> tweaking filters per VF. Does it need to block this
>>> patchset though? After all, we'll need to support the existing
>>
>> hmm not sure I follow what patchset is this blocking?
>>
>>> interface indefinitely, too.
>>>
>>
>> OK finally got to read through this. And its not clear to me why we need
>> these per VF/PF filter netdevice ops and netlink extensions if we can
>> get the stacking correct. (Adding filters to the macvlan seems reasonable
>> to me)
>>
>> In the cases I saw listed above I see a few enumerations:
>>
>> PF <--> MACVLAN  <---> Guest <--- [...]
>>
>> VF <--> MACVLAN  <---> Guest <--- [...]
>>
>> VF|Guest <--- [...]   direct assigned VF
>>
>> PF|Guest <--- [...]   direct assigned PF
>>
>>
>> I used '[...]' to represent whatever additional stacking is done in the
>> guest unknown to the host. In the direct assign VF case (Greg Rose
>> correct me if I am wrong) the normal uc and mc addr lists should suffice
>> along with the netdev op ndo_set_rx_mode(). Here the guest adds MAC
>> addresses and/or VLANS as normal and then the VF<->PF back channel
>> should handle this if needed. This should work for Linux guests and other
>> OS's should do something similar.
>>
>> In the direct assign PF case the hardware is owned by the guest so
>> no problems here.
>>
>> This leaves the two MACVLAN cases which can be handled the same. If
>> the MACVLAN driver and netlink interface is extended to add filters
>> to the MACVLAN then the addresses can be pushed to the lower device
>> using the normal dev_uc_{add|del}() and dev_mc_{add|del}() routines.
> 
> My patches were trying to do just this (unless I am missing something).
> 

Right I was trying enumerate the cases. Your patches 5,6 seem to use
dev_{uc|mc}_{add|del} like this.

>>
>> I think this has some real advantages to the above scheme. First
>> we get rid of _all_ the drivers having to add a bunch of new
>> net_device ops and do it once in the layer above. This is nice
>> for driver implementers but also because your feature becomes usable
>> immediately and we don't have to wait for driver developers to implement
>> it.
> 
> Yes my patches were targeting towards this too. I had macvlan implement the
> netlink ops and macvlan internally was using the dev_uc_add and del routines
> to pass the addr lists to lower device.

Yes. But I am missing why the VF ops and netlink extensions are useful. Or
even the op/netlink extension into the PF for that matter.

> 
>>
>> Also it prunes down the number of netlink extensions

Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2012-02-02 Thread John Fastabend
On 2/2/2012 12:50 AM, Michael S. Tsirkin wrote:
> On Thu, Feb 02, 2012 at 12:46:57AM -0800, John Fastabend wrote:
>> On 2/1/2012 11:24 PM, Michael S. Tsirkin wrote:
>>> On Sun, Nov 20, 2011 at 08:30:24AM -0800, Roopa Prabhu wrote:
>>>>
>>>>
>>>>
>>>> On 11/17/11 4:15 PM, "Ben Hutchings"  wrote:
>>>>
>>>>> Sorry to come to this rather late.
>>>>>
>>>>> On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
>>>>> [...]
>>>>>> v2 -> v3
>>>>>> - Moved set and get filter ops from rtnl_link_ops to netdev_ops
>>>>>> - Support for SRIOV VFs.
>>>>>> [Note: The get filters msg (in the way current get rtnetlink 
>>>>>> handles
>>>>>> it) might get too big for SRIOV vfs. This patch follows existing
>>>>>> sriov 
>>>>>> vf get code and tries to accomodate filters for all VF's in a PF.
>>>>>> And for the SRIOV case I have only tested the fact that the VF
>>>>>> arguments are getting delivered to rtnetlink correctly. The code
>>>>>> follows existing sriov vf handling code so rest of it should work
>>>>>> fine]
>>>>> [...]
>>>>>
>>>>> This is already broken for large numbers of VFs, and increasing the
>>>>> amount of information per VF is going to make the situation worse.  I am
>>>>> no netlink expert but I think that the current approach of bundling all
>>>>> information about an interface in a single message may not be
>>>>> sustainable.
>>>>
>>>> Yes agreed. I have the same concern.
>>>
>>> So it seems that we need to extend the existing interface to allow
>>> tweaking filters per VF. Does it need to block this
>>> patchset though? After all, we'll need to support the existing
>>
>> hmm not sure I follow what patchset is this blocking?
> 
> The one you are replying to.

Gotcha that would seem OK to me although I think you can avoid it altogether.

.John
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2012-02-02 Thread John Fastabend
On 2/1/2012 11:24 PM, Michael S. Tsirkin wrote:
> On Sun, Nov 20, 2011 at 08:30:24AM -0800, Roopa Prabhu wrote:
>>
>>
>>
>> On 11/17/11 4:15 PM, "Ben Hutchings"  wrote:
>>
>>> Sorry to come to this rather late.
>>>
>>> On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
>>> [...]
 v2 -> v3
 - Moved set and get filter ops from rtnl_link_ops to netdev_ops
 - Support for SRIOV VFs.
 [Note: The get filters msg (in the way current get rtnetlink 
 handles
 it) might get too big for SRIOV vfs. This patch follows existing
 sriov 
 vf get code and tries to accomodate filters for all VF's in a PF.
 And for the SRIOV case I have only tested the fact that the VF
 arguments are getting delivered to rtnetlink correctly. The code
 follows existing sriov vf handling code so rest of it should work
 fine]
>>> [...]
>>>
>>> This is already broken for large numbers of VFs, and increasing the
>>> amount of information per VF is going to make the situation worse.  I am
>>> no netlink expert but I think that the current approach of bundling all
>>> information about an interface in a single message may not be
>>> sustainable.
>>
>> Yes agreed. I have the same concern.
> 
> So it seems that we need to extend the existing interface to allow
> tweaking filters per VF. Does it need to block this
> patchset though? After all, we'll need to support the existing

hmm not sure I follow what patchset is this blocking?

> interface indefinitely, too.
> 

OK finally got to read through this. And its not clear to me why we need
these per VF/PF filter netdevice ops and netlink extensions if we can
get the stacking correct. (Adding filters to the macvlan seems reasonable
to me)

In the cases I saw listed above I see a few enumerations:

PF <--> MACVLAN  <---> Guest <--- [...]

VF <--> MACVLAN  <---> Guest <--- [...]   

VF|Guest <--- [...]   direct assigned VF

PF|Guest <--- [...]   direct assigned PF


I used '[...]' to represent whatever additional stacking is done in the
guest unknown to the host. In the direct assign VF case (Greg Rose
correct me if I am wrong) the normal uc and mc addr lists should suffice
along with the netdev op ndo_set_rx_mode(). Here the guest adds MAC
addresses and/or VLANS as normal and then the VF<->PF back channel
should handle this if needed. This should work for Linux guests and other
OS's should do something similar.

In the direct assign PF case the hardware is owned by the guest so
no problems here.

This leaves the two MACVLAN cases which can be handled the same. If
the MACVLAN driver and netlink interface is extended to add filters
to the MACVLAN then the addresses can be pushed to the lower device
using the normal dev_uc_{add|del}() and dev_mc_{add|del}() routines.

I think this has some real advantages to the above scheme. First
we get rid of _all_ the drivers having to add a bunch of new
net_device ops and do it once in the layer above. This is nice
for driver implementers but also because your feature becomes usable
immediately and we don't have to wait for driver developers to implement
it.

Also it prunes down the number of netlink extensions being added
here. 

Additionally the existing semantics seem a bit strange to me on the
netlink message side. Taking a quick look at the macvlan implementation
it looks like every set has to have a complete list of address. But
the dev_uc_add and dev_uc_del seem to be using a refcnt scheme so
if I want to add a second address and then latter a third address
how does that work?

Is the expected flow from user space 'read uc_list -> write uc_list'?
This seems risky because with two adders in user space you might
lose addresses unless they are somehow kept in sync. IMHO it is likely
easier to implement an ADD and DEL attribute rather than a table
approach. Took a quick stab at something like this below but there
might be a better way to do this and allow direct modification of the
uc and mc lists I think means you could remove a uc address added
by some stacked device maybe a VLAN. (just guessing.)

Sorry if I missed something in the above thread I read most of it. And
maybe I missed something or oversimplified the problem.

Thanks,
John



+/* MACVLAN ADDRLIST management section 
+ *
+ * Contains attributes to expose multicast and unicast hardware
+ * RX address filters to user space.
+ *
+ * FIELDS:
+ * - IFLA_ADDRLIST_{UC|MC}
+ *
+ *   Read only attributes, returns currently set mc or uc addr list.
+ *
+ * - IFLA_ADDRLIST_{UC|MC}_ADD
+ *
+ *   Write only attributes, adds listed addresses to dev uc or mc
+ *   RX filter address lists.
+ *
+ * - IFLA_ADDRLIST_{UC|MC}_DEL
+ *
+ *   Write only attributes, deletes listed addresses in dev uc or
+ *   mc RX filter address lists.
+ *
+ * PRECEDENCE:
+ *
+ * Add operations are parsed before delete operations. Passing a
+ * single netlink message with a single address in bo