[Bridge] [PATCH RFC WIP 0/5] IGMP snooping for local traffic

2017-08-26 Thread Andrew Lunn
This is a WIP patchset i would like comments on from bridge, switchdev
and hardware offload people.

The linux bridge supports IGMP snooping. It will listen to IGMP
reports on bridge ports and keep track of which groups have been
joined on an interface. It will then forward multicast based on this
group membership.

When the bridge adds or removed groups from an interface, it uses
switchdev to request the hardware add an mdb to a port, so the
hardware can perform the selective forwarding between ports.

What is not covered by the current bridge code, is IGMP joins/leaves
from the host on the brX interface. No such monitoring is
performed. With a pure software bridge, it is not required. All
mulitcast frames are passed to the brX interface, and the network
stack filters them, as it does for any interface. However, when
hardware offload is involved, things change. We should program the
hardware to only send multcast packets to the host when the host has
in interest in them.

Thus we need to perform IGMP snooping on the brX interface, just like
any other interface of the bridge. However, currently the brX
interface is missing all the needed data structures to do this. There
is no net_bridge_port structure for the brX interface. This strucuture
is created when an interface is added to the bridge. But the brX
interface is not a member of the bridge. So this patchset makes the
brX interface a first class member of the bridge. When the brX
interface is opened, the interface is added to the bridge. A
net_bridge_port is allocated for it, and IGMP snooping is performed as
usual.

There are some complexities here. Some assumptions are broken, like
the master interface of a port interface is the bridge interface. The
brX interface cannot be its own master. The use of
netdev_master_upper_dev_get() within the bridge code has been changed
to reflecit this. The bridge receive handler needs to not process
frames for the brX interface, etc.

The interface downward to the hardware is also an issue. The code
presented here is a hack and needs to change. But that is secondary
and can be solved once it is agreed how the bridge needs to change to
support this use case.

Comment welcome and wanted.

Andrew

Andrew Lunn (5):
  net: rtnetlink: Handle bridge port without upper device
  net: bridge: Skip receive handler on brX interface
  net: bridge: Make the brX interface a member of the bridge
  net: dsa: HACK: Handle MDB add/remove for none-switch ports
  net: dsa: Don't include CPU port when adding MDB to a port

 include/linux/if_bridge.h |  1 +
 net/bridge/br_device.c| 12 ++--
 net/bridge/br_if.c| 37 -
 net/bridge/br_input.c |  4 
 net/bridge/br_mdb.c   |  2 --
 net/bridge/br_multicast.c |  7 ---
 net/bridge/br_private.h   |  1 +
 net/core/rtnetlink.c  | 23 +--
 net/dsa/port.c| 19 +--
 net/dsa/switch.c  |  2 +-
 10 files changed, 83 insertions(+), 25 deletions(-)

-- 
2.14.1



[Bridge] [PATCH RFC WIP 5/5] net: dsa: Don't include CPU port when adding MDB to a port

2017-08-26 Thread Andrew Lunn
Now that the MDB are explicitly added to the CPU port when required,
don't add the CPU port adding an MDB to a switch port.

Signed-off-by: Andrew Lunn 
---
 net/dsa/switch.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/dsa/switch.c b/net/dsa/switch.c
index 97e2e9c8cf3f..c178e2b86a9a 100644
--- a/net/dsa/switch.c
+++ b/net/dsa/switch.c
@@ -130,7 +130,7 @@ static int dsa_switch_mdb_add(struct dsa_switch *ds,
if (ds->index == info->sw_index)
set_bit(info->port, group);
for (port = 0; port < ds->num_ports; port++)
-   if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))
+   if (dsa_is_dsa_port(ds, port))
set_bit(port, group);
 
if (switchdev_trans_ph_prepare(trans)) {
-- 
2.14.1



[Bridge] [PATCH RFC WIP 1/5] net: rtnetlink: Handle bridge port without upper device

2017-08-26 Thread Andrew Lunn
The brX interface will with a following patch becomes a member of the
bridge. It however cannot be a slave interface, since it would have to
be a slave of itself. netdev_master_upper_dev_get() returns NULL as a
result. Handle this NULL, by knowing this bridge slave must also be
the master, i.e. what we are looking for.

Signed-off-by: Andrew Lunn 
---
 net/core/rtnetlink.c | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 9201e3621351..2673eb430b6f 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3093,8 +3093,12 @@ static int rtnl_fdb_add(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if ((!ndm->ndm_flags || ndm->ndm_flags & NTF_MASTER) &&
(dev->priv_flags & IFF_BRIDGE_PORT)) {
struct net_device *br_dev = netdev_master_upper_dev_get(dev);
-   const struct net_device_ops *ops = br_dev->netdev_ops;
+   const struct net_device_ops *ops;
 
+   if (!br_dev)
+   br_dev = dev;
+
+   ops = br_dev->netdev_ops;
err = ops->ndo_fdb_add(ndm, tb, dev, addr, vid,
   nlh->nlmsg_flags);
if (err)
@@ -3197,7 +3201,12 @@ static int rtnl_fdb_del(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if ((!ndm->ndm_flags || ndm->ndm_flags & NTF_MASTER) &&
(dev->priv_flags & IFF_BRIDGE_PORT)) {
struct net_device *br_dev = netdev_master_upper_dev_get(dev);
-   const struct net_device_ops *ops = br_dev->netdev_ops;
+   const struct net_device_ops *ops;
+
+   if (!br_dev)
+   br_dev = dev;
+
+   ops = br_dev->netdev_ops;
 
if (ops->ndo_fdb_del)
err = ops->ndo_fdb_del(ndm, tb, dev, addr, vid);
@@ -3332,6 +3341,8 @@ static int rtnl_fdb_dump(struct sk_buff *skb, struct 
netlink_callback *cb)
if (!br_idx) { /* user did not specify a specific 
bridge */
if (dev->priv_flags & IFF_BRIDGE_PORT) {
br_dev = 
netdev_master_upper_dev_get(dev);
+   if (!br_dev)
+   br_dev = dev;
cops = br_dev->netdev_ops;
}
} else {
@@ -3410,6 +3421,9 @@ int ndo_dflt_bridge_getlink(struct sk_buff *skb, u32 pid, 
u32 seq,
struct net_device *br_dev = netdev_master_upper_dev_get(dev);
int err = 0;
 
+   if (!br_dev)
+   br_dev = dev;
+
nlh = nlmsg_put(skb, pid, seq, RTM_NEWLINK, sizeof(*ifm), nlflags);
if (nlh == NULL)
return -EMSGSIZE;
@@ -3647,6 +3661,8 @@ static int rtnl_bridge_setlink(struct sk_buff *skb, 
struct nlmsghdr *nlh,
 
if (!flags || (flags & BRIDGE_FLAGS_MASTER)) {
struct net_device *br_dev = netdev_master_upper_dev_get(dev);
+   if (!br_dev)
+   br_dev = dev;
 
if (!br_dev || !br_dev->netdev_ops->ndo_bridge_setlink) {
err = -EOPNOTSUPP;
@@ -3723,6 +3739,9 @@ static int rtnl_bridge_dellink(struct sk_buff *skb, 
struct nlmsghdr *nlh,
if (!flags || (flags & BRIDGE_FLAGS_MASTER)) {
struct net_device *br_dev = netdev_master_upper_dev_get(dev);
 
+   if (!br_dev)
+   br_dev = dev;
+
if (!br_dev || !br_dev->netdev_ops->ndo_bridge_dellink) {
err = -EOPNOTSUPP;
goto out;
-- 
2.14.1



[Bridge] [PATCH RFC WIP 2/5] net: bridge: Skip receive handler on brX interface

2017-08-26 Thread Andrew Lunn
The brX interface will soon become a member of the bridge. As such, it
will get a receiver handler assigned. However, we don't want to handle
packets received on this soft interfaces. So detect the condition and
say all the packets pass.
---
 net/bridge/br_input.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 7637f58c1226..38c2a41968f2 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -267,6 +267,10 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
return RX_HANDLER_CONSUMED;
 
p = br_port_get_rcu(skb->dev);
+
+   if (p->dev == p->br->dev)
+   return RX_HANDLER_PASS;
+
if (p->flags & BR_VLAN_TUNNEL) {
if (br_handle_ingress_vlan_tunnel(skb, p,
  nbp_vlan_group_rcu(p)))
-- 
2.14.1



[Bridge] [PATCH RFC WIP 3/5] net: bridge: Make the brX interface a member of the bridge

2017-08-26 Thread Andrew Lunn
In order to perform IGMP snooping on the brX interface, it has to be
part of the bridge, so that the code snooping on normal bridge ports
keeps track of IGMP joins and leaves.

When the brX interface is opened, add the interface to the bridge.
When the brX interface is closed, remove it from the bridge.

This port does however need some special handling. So add a bridge
port flag, BR_SOFT_INTERFACE, indicating a port is the sort interface
of the bridge.

When the port is added to the bridge, the netdev for this port cannot
be linked to the master device, since it is the master device.
Similarly when removing the port, it cannot be unlinked from the
master device.

With the brX interface now being a member of the bridge, and having
all associated structures, we can process IGMP messages sent by the
interface. This is done by the br_multicast_rcv() function, which
takes the bridge_port structure as a parameter. This cannot be easily
found, so keep track of it in the net_bridge structure.
---
 include/linux/if_bridge.h |  1 +
 net/bridge/br_device.c| 12 ++--
 net/bridge/br_if.c| 37 -
 net/bridge/br_mdb.c   |  2 --
 net/bridge/br_multicast.c |  7 ---
 net/bridge/br_private.h   |  1 +
 6 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index 3cd18ac0697f..8a03821d1827 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -49,6 +49,7 @@ struct br_ip_list {
 #define BR_MULTICAST_TO_UNICASTBIT(12)
 #define BR_VLAN_TUNNEL BIT(13)
 #define BR_BCAST_FLOOD BIT(14)
+#define BR_SOFT_INTERFACE  BIT(15)
 
 #define BR_DEFAULT_AGEING_TIME (300 * HZ)
 
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 861ae2a165f4..f27ca62fd4a5 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -69,7 +69,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct 
net_device *dev)
br_flood(br, skb, BR_PKT_MULTICAST, false, true);
goto out;
}
-   if (br_multicast_rcv(br, NULL, skb, vid)) {
+   if (br_multicast_rcv(br, br->local_port, skb, vid)) {
kfree_skb(skb);
goto out;
}
@@ -133,6 +133,14 @@ static void br_dev_uninit(struct net_device *dev)
 static int br_dev_open(struct net_device *dev)
 {
struct net_bridge *br = netdev_priv(dev);
+   int err;
+
+   err = br_add_if(br, br->dev);
+   if (err)
+   return err;
+
+   br->local_port = list_first_or_null_rcu(&br->port_list,
+   struct net_bridge_port, list);
 
netdev_update_features(dev);
netif_start_queue(dev);
@@ -161,7 +169,7 @@ static int br_dev_stop(struct net_device *dev)
 
netif_stop_queue(dev);
 
-   return 0;
+   return br_del_if(br, br->dev);
 }
 
 static void br_get_stats64(struct net_device *dev,
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index f3aef22931ab..49208e774191 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -284,7 +284,8 @@ static void del_nbp(struct net_bridge_port *p)
 
nbp_update_port_count(br);
 
-   netdev_upper_dev_unlink(dev, br->dev);
+   if (!(p->flags & BR_SOFT_INTERFACE))
+   netdev_upper_dev_unlink(dev, br->dev);
 
dev->priv_flags &= ~IFF_BRIDGE_PORT;
 
@@ -362,6 +363,8 @@ static struct net_bridge_port *new_nbp(struct net_bridge 
*br,
p->priority = 0x8000 >> BR_PORT_BITS;
p->port_no = index;
p->flags = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;
+   if (br->dev == dev)
+   p->flags |= BR_SOFT_INTERFACE;
br_init_port(p);
br_set_state(p, BR_STATE_DISABLED);
br_stp_port_timer_init(p);
@@ -500,8 +503,11 @@ int br_add_if(struct net_bridge *br, struct net_device 
*dev)
return -EINVAL;
 
/* No bridging of bridges */
-   if (dev->netdev_ops->ndo_start_xmit == br_dev_xmit)
-   return -ELOOP;
+   if (dev->netdev_ops->ndo_start_xmit == br_dev_xmit) {
+   /* Unless it is our own soft interface */
+   if (br->dev != dev)
+   return -ELOOP;
+   }
 
/* Device is already being bridged */
if (br_port_exists(dev))
@@ -540,9 +546,11 @@ int br_add_if(struct net_bridge *br, struct net_device 
*dev)
 
dev->priv_flags |= IFF_BRIDGE_PORT;
 
-   err = netdev_master_upper_dev_link(dev, br->dev, NULL, NULL);
-   if (err)
-   goto err5;
+   if (!(p->flags & BR_SOFT_INTERFACE)) {
+   err = netdev_master_upper_dev_link(dev, br->dev, NULL, NULL);
+   if (err)
+   goto err5;
+   }
 
err = nbp_switchdev_mark_set(p);
if (err)
@@ -563,13 +571,15 @@ int br_add_if(struct net_bridge *br, struct net

[Bridge] [PATCH RFC WIP 4/5] net: dsa: HACK: Handle MDB add/remove for none-switch ports

2017-08-26 Thread Andrew Lunn
When there is a mdb added to a port which is not in the switch, we
need the switch to forward traffic for the group to the software
bridge, so it can forward it out the none-switch port.

The current implementation is a hack and will be replaced. Currently
only the bridge soft interface is supported. When there is a
join/leave on the soft interface, switchdev calls are made on the soft
interface device, brX. This does not have a switchdev ops structure
registered, so all lower interfaces of brX get there switchdev
function called. These are switch ports, and do have switchdev ops. By
comparing the original interface to the called interface, we can
determine this is not for a switch port, and add/remove the mdb to the
CPU port.
---
 net/dsa/port.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index d6e07176df3f..d8e4bfefd97d 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -194,8 +194,15 @@ int dsa_port_mdb_add(struct dsa_port *dp,
.mdb = mdb,
};
 
-   pr_info("dsa_port_mdb_add: %d %d", info.sw_index, info.port);
-
+   if (dp->netdev != mdb->obj.orig_dev) {
+   /* Not a port for this switch, so forward
+* multicast out the CPU port to the bridge.
+*/
+   struct dsa_switch_tree *dst = dp->ds->dst;
+   struct dsa_port *cpu_dp = dsa_get_cpu_port(dst);
+   info.port = cpu_dp->index;
+   return dsa_port_notify(cpu_dp, DSA_NOTIFIER_MDB_ADD, &info);
+   }
return dsa_port_notify(dp, DSA_NOTIFIER_MDB_ADD, &info);
 }
 
@@ -208,6 +215,14 @@ int dsa_port_mdb_del(struct dsa_port *dp,
.mdb = mdb,
};
 
+   if (dp->netdev != mdb->obj.orig_dev) {
+   struct dsa_switch_tree *dst = dp->ds->dst;
+   struct dsa_port *cpu_dp = dsa_get_cpu_port(dst);
+
+   info.port = cpu_dp->index;
+   return dsa_port_notify(cpu_dp, DSA_NOTIFIER_MDB_DEL, &info);
+   }
+
return dsa_port_notify(dp, DSA_NOTIFIER_MDB_DEL, &info);
 }
 
-- 
2.14.1



Re: [Bridge] [PATCH RFC WIP 0/5] IGMP snooping for local traffic

2017-08-26 Thread Andrew Lunn
> Hi Andrew,
> 
> Have you taken a look at mglist (the boolean, probably needs a rename) ? It 
> is for
> exactly that purpose, to track which groups the bridge is interested in.
> I assume I'm forgetting or missing something here.
> 
> > performed. With a pure software bridge, it is not required. All
> > mulitcast frames are passed to the brX interface, and the network
> 
> If mglist (again the boolean) is false then they won't be passed up.
> 
> > stack filters them, as it does for any interface. However, when
> > hardware offload is involved, things change. We should program the
> > hardware to only send multcast packets to the host when the host has
> > in interest in them.
> 
> Granted the boolean mglist might need some changes (esp. with host group 
> leave)
> but I think it can be used to program switchdev for host join/leave, can't
> we adjust its behaviour instead of introducing this complexity and avoid many
> headaches ?

I would like to avoid this complexity as well. I will take a look at
mglist. Thanks for the hint.

Andrew


Re: [Bridge] [PATCH RFC WIP 0/5] IGMP snooping for local traffic

2017-08-28 Thread Andrew Lunn
On Sun, Aug 27, 2017 at 07:44:15PM -0700, Florian Fainelli wrote:
> Hi Andrew,
> 
> On 08/26/2017 01:56 PM, Andrew Lunn wrote:
> > This is a WIP patchset i would like comments on from bridge,
> > switchdev and hardware offload people.
> > 
> > The linux bridge supports IGMP snooping. It will listen to IGMP 
> > reports on bridge ports and keep track of which groups have been 
> > joined on an interface. It will then forward multicast based on this 
> > group membership.
> > 
> > When the bridge adds or removed groups from an interface, it uses 
> > switchdev to request the hardware add an mdb to a port, so the 
> > hardware can perform the selective forwarding between ports.
> > 
> > What is not covered by the current bridge code, is IGMP joins/leaves 
> > from the host on the brX interface. No such monitoring is performed.
> > With a pure software bridge, it is not required. All mulitcast frames
> > are passed to the brX interface, and the network stack filters them,
> > as it does for any interface. However, when hardware offload is
> > involved, things change. We should program the hardware to only send
> > multcast packets to the host when the host has in interest in them.
> 
> OK, so if I understand this right, without a bridge, we have the
> following happen today: with a DSA-enabled setup using any kind of
> switch tagging protocol, if a host is interested in receiving particular
> multicast traffic, we would receive IGMP joins/leaves through sw0p0, and
> the stack should call ndo_set_rx_mode for sw0p0, which would be
> dsa_slave_set_rx_mode() and which would synchronize the DSA master
> network device with the slave network device, everything works fine
> provided that the CPU port is configured to accept multicast traffic.

Hi Florian

That sounds about right, but it is not a use case i've looked at yet.
I do need to look at it though, because there is a chance i break it
with IGMP snooping support.

> Note here that we don't really add a MDB entry for sw0p0 when that
> happens, but it seems like we should for switches that lack IGMP
> snooping and/or multicast filtering.

Yes. But it is not an MDB in the normal sense, at least from the
switches perspective. An MDB passed to a switch using switchdev says
that traffic for the specified group should go out that port. However,
when the host does a join on sw0p0, we want the exact opposite,
multicast traffic coming in that port of the switch which matches the
group should be forwarded to the host. So my second attempt at this
code adds a new switchdev object, SWITCHDEV_OBJ_ID_PORT_HOST_MDB.

> With the current bridge and DSA code, are not we actually always going
> to get the CPU port to be added with the multicast address and therefore
> no filtering is occurring and snooping is pretty much useless?

Correct. At the moment, all multicast traffic gets delivered to the
host.

> While I understand the reasons why you did it that way, I think this is
> going to break a lot of code in bridge that does not expect brX to be a
> bridge port member.

It does. 

Nikolay suggestion works a lot better, and i'm following that now. It
looks good and only requires some minor tweaks to the bridge code.

  Andrew


Re: [Bridge] [PATCH net-next v2] bridge: fdb add and delete tracepoints

2017-08-29 Thread Andrew Lunn
On Mon, Aug 28, 2017 at 09:22:48PM -0700, Roopa Prabhu wrote:
> From: Roopa Prabhu 
> 
> A few useful tracepoints to trace bridge forwarding
> database updates.

Hi Roopa

Once accepted, it would be nice to add mdb tracepoints as well. I
expect the implementation would be very similar.

   Andrew


Re: [Bridge] [PATCH net-next] net: Define eth_stp_addr in linux/etherdevice.h

2017-11-02 Thread Andrew Lunn
On Thu, Nov 02, 2017 at 10:36:48AM +0100, Egil Hjelmeland wrote:
> Why:

Please drop the Why.

> The lan9303 driver defines eth_stp_addr as a synonym to
> eth_reserved_addr_base to get the STP ethernet address 01:80:c2:00:00:00.
> 
> eth_reserved_addr_base is also used to define the start of Bridge Reserved
> ethernet address range, which happen to be the STP address.
> 
> br_dev_setup refer to eth_reserved_addr_base as a definition of STP
> address.
> 
> Clean up by:
>  - Move the eth_stp_addr definition to linux/etherdevice.h
>  - Use eth_stp_addr instead of eth_reserved_addr_base in br_dev_setup.
> 
> Signed-off-by: Egil Hjelmeland 

I was thinking along the same lines when reviewing your lan9303 patch.

Reviewed-by: Andrew Lunn 

Andrew


Re: [Bridge] [PATCH net-next 2/5] bridge: propagate BR_ flags updates through sysfs to switchdev

2018-03-10 Thread Andrew Lunn
On Fri, Mar 09, 2018 at 07:03:05PM -0800, Igor Mitsyanko wrote:
> sysfs interface to configure bridge flags only updates SW copy but does
> not notify hardware through switchdev interface. Make sure it is.
> 
> Signed-off-by: Igor Mitsyanko 
> ---
>  net/bridge/br_sysfs_if.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c
> index 126a8ea..9bdd177 100644
> --- a/net/bridge/br_sysfs_if.c
> +++ b/net/bridge/br_sysfs_if.c
> @@ -51,6 +51,7 @@ static int store_flag(struct net_bridge_port *p, unsigned 
> long v,
> unsigned long mask)
>  {
>   unsigned long flags;
> + int err;
>  
>   flags = p->flags;
>  
> @@ -59,10 +60,16 @@ static int store_flag(struct net_bridge_port *p, unsigned 
> long v,
>   else
>   flags &= ~mask;
>  
> - if (flags != p->flags) {
> - p->flags = flags;
> - br_port_flags_change(p, mask);
> - }
> + if (flags == p->flags)
> + return 0;
> +
> + err = br_switchdev_set_port_flag(p, flags, mask);
> + if (err)
> + return err;

You might want to consider the br_warn() in
br_switchdev_set_port_flag(). Do we want to spam the kernel log?  Or
should store_flag() do some validation before calling
br_switchdev_set_port_flag()?

Andrew


Re: [Bridge] [RFC PATCH net-next 3/5] bridge: allow switchdev port to handle flooding by itself

2018-03-10 Thread Andrew Lunn
On Fri, Mar 09, 2018 at 07:03:06PM -0800, Igor Mitsyanko wrote:
> Introduce BR_FLOOD_OFFLOAD bridge port flag that can be used by
> switchdev-capable hardware to advertize that it wants to handle all
> flooding by itself.
> In that case there is no need for a driver to set skb::offload_fwd_mark
> on each offloaded packet as it is implied by BR_FLOOD_OFFLOAD bridge
> port flag.

Is this sufficiently granular? There are a few different use cases for
flooding:

There is no fdb entry in the software switch for the destination MAC
address, so flood the packet out all ports of the bridge. The hardware
switch might have an entry in its fdb to the destination switch, so it
could unicast out the correct hardware port. If not, it should flood
the packet.

A point to remember here, the software switch and the hardware switch
can have different forwarding data bases.

A broadcast packet. Send it out all ports.

A multicast packet. If the hardware switch is capable of IGMP
snooping, it could have FDB entries indicating which ports it should
send the frame out of, and which is should not. Otherwise it needs to
flood.

Is one flag sufficient for all of these, and any other use cases i
might of missed?

As far as DSA switches go, i don't know of any of them which could
implement anything like this, so BR_FLOOD_OFFLOAD will never be
set. But maybe some of the TOR switches supported by switchdev can do
some of these, and not others

 Andrew


Re: [Bridge] [PATCH net-next 1/5] bridge: initialize port flags with switchdev defaults

2018-03-10 Thread Andrew Lunn
On Fri, Mar 09, 2018 at 07:03:04PM -0800, Igor Mitsyanko wrote:
> Default bridge port flags for switchdev devices can be different from
> what is used in bridging core. Get default value from switchdev itself
> on port initialization.
> 
> Signed-off-by: Igor Mitsyanko 
> ---
>  net/bridge/br_if.c | 17 -
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
> index 9ba4ed6..d658b55 100644
> --- a/net/bridge/br_if.c
> +++ b/net/bridge/br_if.c
> @@ -342,6 +342,21 @@ static int find_portno(struct net_bridge *br)
>   return (index >= BR_MAX_PORTS) ? -EXFULL : index;
>  }
>  
> +static unsigned long nbp_flags_get_default(struct net_device *dev)
> +{
> + struct switchdev_attr attr = {
> + .orig_dev = dev,
> + .id = SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS,
> + };
> + int ret;
> +
> + ret = switchdev_port_attr_get(dev, &attr);
> + if (ret)
> + return BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD;


Hi Igor

Please check if ret == -EOPNOTSUPP and only then use the defaults. A
real error should be propagated, causing new_nbp to fail. You might
also consider what to do when ENODATA is returned.

 Andrew


Re: [Bridge] [PATCH net-next 0/5] Switchdev: flooding offload option

2018-03-10 Thread Andrew Lunn
On Fri, Mar 09, 2018 at 07:03:03PM -0800, Igor Mitsyanko wrote:
> Main goal of the patchset is to allow for flooding offload to switchdev
> in scenarios when HW switchdev ports and SW ports are operating in a
> single bridge.
> 
> In case a data frame that needs to be flooded is ingressed on a SW port,
> it needs to be flooded to a single HW port of any given
> switchdev-capable hardware only. Switchdev hardware will than take care
> about flooding to the rest of the ports that it manages.

Hi Igor

You don't appear to be adding any user of this. Please also make one
of the switchdev drivers actually use this new functionality.

   Andrew


Re: [Bridge] [RFC PATCH net-next 3/5] bridge: allow switchdev port to handle flooding by itself

2018-03-12 Thread Andrew Lunn
> The flag was introduced to enable hardware switch capabilities of
> drivers/net/wireless/quantenna/qtnfmac wifi driver. It does not have any
> switchdev functionality in upstream tree at this moment, and this patchset
> was intended as a preparatory change.

O.K. But i suggest you add basic switchdev support first. Then think
about adding new functionality. That way you can learn more about
switchdev, and we can learn more about your hardware.

> qtnfmac driver provides several physical radios (5 GHz and 2.4 GHz), each
> can have up to 8 virtual network interfaces. These interfaces can be bridged
> together in various configurations, and I'm trying to figure out what is the
> most efficient way to handle it from bridging perspective.

I think the first thing to do is get this part correctly represented
by switchdev. I don't think any of us maintainers have thought about
how wireless and switchdev can be combined. The wifi model seems to be
one phy device, with multiple MACs running on top of it, with each MAC
being a single SSID.  So is it one SSID per virtual interface?  Or are
your virtual network interfaces actually virtual phys in the wireless
model, and you can have multiple MACs on top of each virtual phy?

> My assumption was that software FDB and hardware FDB should always
> be in sync with each other. I guess it is a safe assumption if
> handled correctly?  Hardware should send a notification for each new
> FDB it has learned, and switchdev driver should process FDB
> notifications from software bridge.

No, you cannot make this assumption. Take the example of DSA
switches. They are generally connected over an MDIO bus, or an SPI
bus. The bandwidth is small. How long do you think it takes the
hardware to learn 8K MAC addresses with 5x 1Gbps ports receiving 64
byte packets? DSA drivers have no way of keeping up with the
hardware. And there is no need to. Everything works fine with the SW
and the HW bridge having different dynamic FDB entries.

I don't even think your hardware will have the hardware and software
in sync. How fast can your hardware learn new addresses? 'Line' rate?
Or do you prevent the hardware learning a new address until the
software bridge has confirmed it has learnt the previous new address?

> qtnfmac hardware has its own memory and maintains FWT table, so for the best
> efficiency forwarding between virtual interfaces should be handled locally.
> Qtnfmac can handle all the mentioned flooding by itself:
> - unknown unicasts
> - broadcast and unknown multicast
> - known multicasts (does have IGMP snooping)
> - can do multicast-to-unicast translation if required.
> 
> The most important usecase IMO is a muticast transmission, specific example
> being:
> - 2.4GHz x 8 and 5GHz x 8 virtual wifi interfaces, bridged with backbone
> ethernet interface in Linux
> - multicast video streaming from a server behind ethernet
> - multicast clients connected to some wifi interfaces

I agree this makes sense. But we need to ensure the solution is
generic, not something which just works for your hardware/firmware.  I
know somebody who would love to be able to do something like this with
DSA drivers. They would probably sacrifice IGMP snooping and just
flood everywhere, if that is all the hardware could do. But so far,
i've not been able to figure out a way to do this.

 Andrew


Re: [Bridge] [PATCH 02/12] treewide/net: Rename eth_stp_addr to ether_stp_addr

2018-03-31 Thread Andrew Lunn
On Sat, Mar 31, 2018 at 12:05:17AM -0700, Joe Perches wrote:
> --- a/net/bridge/br_device.c
> +++ b/net/bridge/br_device.c
> @@ -1,4 +1,4 @@
> -/*
> +gg/*

Hi Joe

This does not look good.

 Andrew


Re: [Bridge] [PATCH net-next 2/9] net: bridge: add bitfield for options and convert vlan opts

2018-09-26 Thread Andrew Lunn
Hi Nikolay

>  struct net_bridge {
>   spinlock_t  lock;
>   spinlock_t  hash_lock;
>   struct list_headport_list;
>   struct net_device   *dev;
>   struct pcpu_sw_netstats __percpu *stats;
> + unsigned long   options;

Maybe a u32 would be better, so we run out of bits at the same time on
32 and 64 bit systems?

   Andrew


Re: [Bridge] [PATCH net-next 2/9] net: bridge: add bitfield for options and convert vlan opts

2018-09-26 Thread Andrew Lunn
On Wed, Sep 26, 2018 at 05:55:47PM +0300, Nikolay Aleksandrov wrote:
> On 26/09/18 17:48, Andrew Lunn wrote:
> > Hi Nikolay
> > 
> >>  struct net_bridge {
> >>spinlock_t  lock;
> >>spinlock_t  hash_lock;
> >>struct list_headport_list;
> >>struct net_device   *dev;
> >>struct pcpu_sw_netstats __percpu *stats;
> >> +  unsigned long   options;
> > 
> > Maybe a u32 would be better, so we run out of bits at the same time on
> > 32 and 64 bit systems?
> > 
> >Andrew
> > 
> 
> Bitops operate on an unsigned long, I actually had a BUILD_BUG_ON() for
> 32 bits initially, but checked other places and they seem to be using it
> as-is without any checks so I decided to leave it as well (e.g.
> sock_flags).

O.K.

I assume we get compiler warnings anyway, when we use bit 32 on a
32bit system. The build-bots should let us know.

  Andrew


Re: [Bridge] [PATCH net-next 1/2] net: bridge: add support for user-controlled bool options

2018-11-22 Thread Andrew Lunn
On Thu, Nov 22, 2018 at 06:29:24AM +0200, Nikolay Aleksandrov wrote:
> We have been adding many new bridge options, a big number of which are
> boolean but still take up netlink attribute ids and waste space in the skb.
> Recently we discussed learning from link-local packets[1] and decided
> yet another new boolean option will be needed, thus introducing this API
> to save some bridge nl space.
> The API supports changing the value of multiple boolean options at once
> via the br_boolopt_multi struct which has an optmask (which options to
> set, bit per opt) and optval (options' new values). Future boolean
> options will only be added to the br_boolopt_id enum and then will have
> to be handled in br_boolopt_toggle/get. The API will automatically
> add the ability to change and export them via netlink, sysfs can use the
> single boolopt function versions to do the same. The behaviour with
> failing/succeeding is the same as with normal netlink option changing.
> 
> If an option requires mapping to internal kernel flag or needs special
> configuration to be enabled then it should be handled in
> br_boolopt_toggle. It should also be able to retrieve an option's current
> state via br_boolopt_get.
> 
> [1] https://www.spinics.net/lists/netdev/msg532698.html
> 
> Signed-off-by: Nikolay Aleksandrov 
> ---
>  include/uapi/linux/if_bridge.h | 18 +
>  include/uapi/linux/if_link.h   |  1 +
>  net/bridge/br.c| 68 ++
>  net/bridge/br_netlink.c| 17 -
>  net/bridge/br_private.h|  6 +++
>  net/core/rtnetlink.c   |  2 +-
>  6 files changed, 110 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
> index e41eda3c71f1..6dc02c03bdf8 100644
> --- a/include/uapi/linux/if_bridge.h
> +++ b/include/uapi/linux/if_bridge.h
> @@ -292,4 +292,22 @@ struct br_mcast_stats {
>   __u64 mcast_bytes[BR_MCAST_DIR_SIZE];
>   __u64 mcast_packets[BR_MCAST_DIR_SIZE];
>  };
> +
> +/* bridge boolean options
> + * IMPORTANT: if adding a new option do not forget to handle
> + *it in br_boolopt_toggle/get and bridge sysfs
> + */
> +enum br_boolopt_id {
> + BR_BOOLOPT_MAX
> +};
> +
> +/* struct br_boolopt_multi - change multiple bridge boolean options
> + *
> + * @optval: new option values (bit per option)
> + * @optmask: options to change (bit per option)
> + */
> +struct br_boolopt_multi {
> + __u32 optval;
> + __u32 optmask;
> +};

Hi Nikolay

Thanks for handling this.

How many boolean options do we already have? What it the likelihood a
u32 is going to be too small, in a couple of years time?

I recently went through the pain of converting the u32 for
representing link modes in the phylib API to a linux bitmap.  I'm just
wondering if in the long run, using a linux bitmap right from the
beginning would be better?

> +int br_boolopt_multi_toggle(struct net_bridge *br,
> + struct br_boolopt_multi *bm)
> +{
> + unsigned long bitmap = bm->optmask;
> + int err = 0;
> + int opt_id;
> +
> + for_each_set_bit(opt_id, &bitmap, BR_BOOLOPT_MAX) {
> + bool on = !!(bm->optval & BIT(opt_id));
> +
> + err = br_boolopt_toggle(br, opt_id, on);
> + if (err) {
> + br_debug(br, "boolopt multi-toggle error: option: %d 
> current: %d new: %d error: %d\n",
> +  opt_id, br_boolopt_get(br, opt_id), on, err);

Would it be possible to return that to userspace using the extended
error infrastructure?

  Andrew


Re: [Bridge] [PATCH net-next 1/2] net: bridge: add support for user-controlled bool options

2018-11-22 Thread Andrew Lunn
> +/* br_boolopt_toggle - change user-controlled boolean option
> + *
> + * @br: bridge device
> + * @opt: id of the option to change
> + * @on: new option value
> + *
> + * Changes the value of the respective boolean option to @on taking care of
> + * any internal option value mapping and configuration.
> + */
> +int br_boolopt_toggle(struct net_bridge *br, enum br_boolopt_id opt, bool on)
> +{
> + int err = -ENOENT;
> +
> + switch (opt) {
> + default:
> + break;
> + }
> +
> + return err;
> +}
> +


> +int br_boolopt_multi_toggle(struct net_bridge *br,
> + struct br_boolopt_multi *bm)
> +{
> + unsigned long bitmap = bm->optmask;
> + int err = 0;
> + int opt_id;
> +
> + for_each_set_bit(opt_id, &bitmap, BR_BOOLOPT_MAX) {
> + bool on = !!(bm->optval & BIT(opt_id));
> +
> + err = br_boolopt_toggle(br, opt_id, on);
> + if (err) {
> + br_debug(br, "boolopt multi-toggle error: option: %d 
> current: %d new: %d error: %d\n",
> +  opt_id, br_boolopt_get(br, opt_id), on, err);
> + break;
> + }

An old kernel with a new iproute2 might partially succeed in toggling
some low bits, but then silently ignore a high bit that is not
supported by the kernel. As a user, i want to know the kernel does not
support an option i'm trying to toggle.

Can you walk all the bits in the u32 from the MSB to the LSB? That
should avoid this problem.

Andrew


Re: [Bridge] [PATCH net-next 1/2] net: bridge: add support for user-controlled bool options

2018-11-22 Thread Andrew Lunn
> +void br_boolopt_multi_get(const struct net_bridge *br,
> +   struct br_boolopt_multi *bm)
> +{
> + u32 optval = 0;
> + int opt_id;
> +
> + for (opt_id = 0; opt_id < BR_BOOLOPT_MAX; opt_id++)
> + optval |= (br_boolopt_get(br, opt_id) << opt_id);
> +
> + bm->optval = optval;
> + bm->optmask = 0;

Maybe set optmask to indicate which bits this kernel supports?

  Andrew


Re: [Bridge] [PATCH net-next 2/2] net: bridge: add no_linklocal_learn bool option

2018-11-22 Thread Andrew Lunn
>  int br_boolopt_get(const struct net_bridge *br, enum br_boolopt_id opt)
>  {
> - int optval = 0;
> -
>   switch (opt) {
> + case BR_BOOLOPT_NO_LL_LEARN:
> + return br_opt_get(br, BROPT_NO_LL_LEARN);
>   default:
>   break;
>   }
>  
> - return optval;
> + return 0;
>  }

It seems like 1/2 of that change belongs in the previous patch.

> --- a/net/bridge/br_sysfs_br.c
> +++ b/net/bridge/br_sysfs_br.c
> @@ -328,6 +328,27 @@ static ssize_t flush_store(struct device *d,
>  }
>  static DEVICE_ATTR_WO(flush);
>  
> +static ssize_t no_linklocal_learn_show(struct device *d,
> +struct device_attribute *attr,
> +char *buf)
> +{
> + struct net_bridge *br = to_bridge(d);
> + return sprintf(buf, "%d\n", br_boolopt_get(br, BR_BOOLOPT_NO_LL_LEARN));
> +}
> +
> +static int set_no_linklocal_learn(struct net_bridge *br, unsigned long val)
> +{
> + return br_boolopt_toggle(br, BR_BOOLOPT_NO_LL_LEARN, !!val);
> +}
> +
> +static ssize_t no_linklocal_learn_store(struct device *d,
> + struct device_attribute *attr,
> + const char *buf, size_t len)
> +{
> + return store_bridge_parm(d, buf, len, set_no_linklocal_learn);
> +}
> +static DEVICE_ATTR_RW(no_linklocal_learn);

I thought we where trying to move away from sysfs? Do we need to add
new options here? It seems like forcing people to use iproute2 for
newer options is a good way to get people to convert to iproute2.

Andrew


Re: [Bridge] [PATCH net-next v2 1/3] net: bridge: add support for user-controlled bool options

2018-11-24 Thread Andrew Lunn
> +int br_boolopt_toggle(struct net_bridge *br, enum br_boolopt_id opt, bool on,
> +   struct netlink_ext_ack *extack)
> +{
> + switch (opt) {
> + default:
> + /* shouldn't be called with unsupported options */
> + WARN_ON(1);
> + break;

So you return 0 here, meaning the br_debug() lower down will not
happen. Maybe return -EOPNOTSUPP?

> + }
> +
> + return 0;
> +}
> +

> +int br_boolopt_multi_toggle(struct net_bridge *br,
> + struct br_boolopt_multi *bm,
> + struct netlink_ext_ack *extack)
> +{
> + unsigned long bitmap = bm->optmask;
> + int err = 0;
> + int opt_id;
> +
> + for_each_set_bit(opt_id, &bitmap, BR_BOOLOPT_MAX) {
> + bool on = !!(bm->optval & BIT(opt_id));
> +
> + err = br_boolopt_toggle(br, opt_id, on, extack);
> + if (err) {
> + br_debug(br, "boolopt multi-toggle error: option: %d 
> current: %d new: %d error: %d\n",
> +  opt_id, br_boolopt_get(br, opt_id), on, err);
> + break;
> + }
> + }

Does the semantics of extack allow you to return something even when
there is no error? If there are bits > BR_BOOLOPT_MAX you could return
0, but also add a warning in extack that some bits where not supported
by this kernel.

> +void br_boolopt_multi_get(const struct net_bridge *br,
> +   struct br_boolopt_multi *bm)
> +{
> + u32 optval = 0;
> + int opt_id;
> +
> + for (opt_id = 0; opt_id < BR_BOOLOPT_MAX; opt_id++)
> + optval |= (br_boolopt_get(br, opt_id) << opt_id);
> +
> + bm->optval = optval;
> + bm->optmask = 0;

You liked the idea of setting optmask to indicate which bits this
kernel supports. Did you change your mind?

   Andrew


Re: [Bridge] [PATCH net-next v2 3/3] net: bridge: export supported boolopts

2018-11-24 Thread Andrew Lunn
On Sat, Nov 24, 2018 at 04:34:22AM +0200, Nikolay Aleksandrov wrote:
> Now that we have at least one bool option, we can export all of the
> supported bool options via optmask when dumping them.
> 
Hi Nik

That answers my question then...

I'm assuming this means there is no easy way to generate a bitmask of
0? So you waited until there was at least one bit.

(1 << 2) - 1 = 3
(1 << 1) - 1 = 1
(1 << 0) - 1 = 0

So does 

GENMASK((BR_BOOLOPT_MAX - 1), 0);

really not do the right thing when BR_BOOLOPT_MAX = 1?

   Andrew


Re: [Bridge] [PATCH net-next v2 1/3] net: bridge: add support for user-controlled bool options

2018-11-24 Thread Andrew Lunn
On Sat, Nov 24, 2018 at 06:18:33PM +0200, niko...@cumulusnetworks.com wrote:
> On 24 November 2018 18:10:41 EET, Andrew Lunn  wrote:
> >> +int br_boolopt_toggle(struct net_bridge *br, enum br_boolopt_id opt,
> >bool on,
> >> +struct netlink_ext_ack *extack)
> >> +{
> >> +  switch (opt) {
> >> +  default:
> >> +  /* shouldn't be called with unsupported options */
> >> +  WARN_ON(1);
> >> +  break;
> >
> >So you return 0 here, meaning the br_debug() lower down will not
> >happen. Maybe return -EOPNOTSUPP?
> >
> 
> No, the idea here is that some option in the future might return an error. 
> This function cannot be called with unsupported option thus the warn. 

O.K, i was trying to make it easier to see which option caused it to
happen.

> >> +  }
> >> +
> >> +  return 0;
> >> +}
> >> +
> >
> >> +int br_boolopt_multi_toggle(struct net_bridge *br,
> >> +  struct br_boolopt_multi *bm,
> >> +  struct netlink_ext_ack *extack)
> >> +{
> >> +  unsigned long bitmap = bm->optmask;
> >> +  int err = 0;
> >> +  int opt_id;
> >> +
> >> +  for_each_set_bit(opt_id, &bitmap, BR_BOOLOPT_MAX) {
> >> +  bool on = !!(bm->optval & BIT(opt_id));
> >> +
> >> +  err = br_boolopt_toggle(br, opt_id, on, extack);
> >> +  if (err) {
> >> +  br_debug(br, "boolopt multi-toggle error: option: %d 
> >> current: %d
> >new: %d error: %d\n",
> >> +   opt_id, br_boolopt_get(br, opt_id), on, err);
> >> +  break;
> >> +  }
> >> +  }
> >
> >Does the semantics of extack allow you to return something even when
> >there is no error? If there are bits > BR_BOOLOPT_MAX you could return
> >0, but also add a warning in extack that some bits where not supported
> >by this kernel.
> 
> If we return 0 there's no reason to check extack. 

Well, the caller can check to see if extack is present, even on
success. This is extack, not extnack after all...

 Andrew


Re: [Bridge] [PATCH net-next v2 3/3] net: bridge: export supported boolopts

2018-11-24 Thread Andrew Lunn
On Sat, Nov 24, 2018 at 04:34:22AM +0200, Nikolay Aleksandrov wrote:
> Now that we have at least one bool option, we can export all of the
> supported bool options via optmask when dumping them.
> 
> v2: new patch
> 
> Signed-off-by: Nikolay Aleksandrov 

Reviewed-by: Andrew Lunn 

Andrew


Re: [Bridge] [PATCH net-next v2 2/3] net: bridge: add no_linklocal_learn bool option

2018-11-24 Thread Andrew Lunn
On Sat, Nov 24, 2018 at 04:34:21AM +0200, Nikolay Aleksandrov wrote:
> Use the new boolopt API to add an option which disables learning from
> link-local packets. The default is kept as before and learning is
> enabled. This is a simple map from a boolopt bit to a bridge private
> flag that is tested before learning.
> 
> v2: pass NULL for extack via sysfs
> 
> Signed-off-by: Nikolay Aleksandrov 

Reviewed-by: Andrew Lunn 

Andrew


Re: [Bridge] [PATCH net-next v2 1/3] net: bridge: add support for user-controlled bool options

2018-11-25 Thread Andrew Lunn
On Sat, Nov 24, 2018 at 04:34:20AM +0200, Nikolay Aleksandrov wrote:
> We have been adding many new bridge options, a big number of which are
> boolean but still take up netlink attribute ids and waste space in the skb.
> Recently we discussed learning from link-local packets[1] and decided
> yet another new boolean option will be needed, thus introducing this API
> to save some bridge nl space.
> The API supports changing the value of multiple boolean options at once
> via the br_boolopt_multi struct which has an optmask (which options to
> set, bit per opt) and optval (options' new values). Future boolean
> options will only be added to the br_boolopt_id enum and then will have
> to be handled in br_boolopt_toggle/get. The API will automatically
> add the ability to change and export them via netlink, sysfs can use the
> single boolopt function versions to do the same. The behaviour with
> failing/succeeding is the same as with normal netlink option changing.
> 
> If an option requires mapping to internal kernel flag or needs special
> configuration to be enabled then it should be handled in
> br_boolopt_toggle. It should also be able to retrieve an option's current
> state via br_boolopt_get.
> 
> v2: WARN_ON() on unsupported option as that shouldn't be possible and
> also will help catch people who add new options without handling
> them for both set and get. Pass down extack so if an option desires
> it could set it on error and be more user-friendly.
> 
> [1] https://www.spinics.net/lists/netdev/msg532698.html
> 
> Signed-off-by: Nikolay Aleksandrov 

Reviewed-by: Andrew Lunn 

Andrew


Re: [Bridge] [PATCH net-next] Documentation: networking: Clarify switchdev devices behavior

2018-12-13 Thread Andrew Lunn
> +VLAN filtering
> +~~
> +
> +The Linux bridge allows the configuration of a VLAN filtering mode (compile 
> and
> +run time) which must be observed by the underlying switchdev network
> +device/hardware:
> +
> +- with VLAN filtering turned off: frames ingressing the device with a VID 
> that
> +  is not programmed into the bridge/switch's VLAN table must be forwarded.
> +
> +- with VLAN filtering turned on: frames ingressing the device with a VID 
> that is
> +  not programmed into the bridges/switch's VLAN table must be dropped.

Hi Florian

i forget the details, but there are some difference between VLAN
filtering being disabled at compile time, and disabled at runtime. I
think the expected behaviour is the same, but the switchdev API usage
is slightly different.

> +- when IGMP snooping is turned on, multicast traffic must be selectively 
> flowing
> +  to the appropriate network ports and not flood the entire switch, that must
> +  include the CPU/management port.

224.0.0.X/32 should always be flooded, IGMP is optional for those
groups in the local subnet.

Andrew


Re: [Bridge] [PATCH net-next v4] Documentation: networking: Clarify switchdev devices behavior

2019-01-11 Thread Andrew Lunn
> > +IGMP snooping
> > +~
> > +
> > +The Linux bridge allows the configuration of IGMP snooping (compile and run
> > +time) which must be observed by the underlying switchdev network 
> > device/hardware
> > +in the following way:
> > +
> > +- when IGMP snooping is turned off, multicast traffic must be flooded to 
> > all
> > +  switch ports within the same broadcast domain. The CPU/management port
> > +  should ideally not be flooded and continue to learn multicast traffic 
> > through
> > +  the network stack notifications. If the hardware is not capable of doing 
> > that
> > +  then the CPU/management port must also be flooded and multicast filtering
> > +  happens in software.
> > +
> > +- when IGMP snooping is turned on, multicast traffic must selectively flow
> > +  to the appropriate network ports (including CPU/management port) and not 
> > flood
> > +  the switch.
> > +
> > +Note: reserved multicast addresses (e.g.: BPDUs) as well as Local Network
> > +Control block (224.0.0.0 - 224.0.0.255) do not require IGMP and should 
> > always
> > +be flooded.
> 
> I'm not sure that these paragraphs are actually needed. You're basically
> describing RFC 4541 on which the IGMP snooping functionality in the
> Linux bridge is based on.

Hi Ido

My experience talking with people is that IGMP snooping is a bit
mystical and not well understood. I would not be surprised if
community driver writers, as opposed to vendor driver writers, don't
actually know how snooping works. So i find having some hints is good.

 Andrew


Re: [Bridge] Validation of forward_delay seems wrong...

2019-07-02 Thread Andrew Lunn
> Hi Andrew,
> The man page is wrong, these have been in USER_HZ scaled clock_t format from 
> the beginning.
> TBH a lot of the time/delay bridge config options are messed up like that.

Hi Nikola

Yes, that is a mess.

arch/alpha/include/asm/param.h:# define USER_HZ 
1024
arch/ia64/include/asm/param.h:# define USER_HZ  
HZ
include/asm-generic/param.h:# define USER_HZ
100

And ia64 does
# define HZ CONFIG_HZ

So it seems pretty hard for user space to get this right in a generic
fashion.

Andrew


[Bridge] Validation of forward_delay seems wrong...

2019-07-02 Thread Andrew Lunn
Hi Nikolay

The man page says that the bridge forward_delay is in units of
seconds, and should be between 2 and 30.

I've tested on a couple of different kernel versions, and this appears
to be not working correctly:

ip link set br0 type bridge forward_delay 2
RTNETLINK answers: Numerical result out of range

ip link set br0 type bridge forward_delay 199
RTNETLINK answers: Numerical result out of range

ip link set br0 type bridge forward_delay 200
# 

ip link set br0 type bridge forward_delay 3000
#

ip link set br0 type bridge forward_delay 3001
RTNETLINK answers: Numerical result out of range

I've not checked what delay is actually being used here, but clearly
something is mixed up.

grep HZ .config 
CONFIG_HZ_PERIODIC=y
# CONFIG_NO_HZ_IDLE is not set
# CONFIG_NO_HZ_FULL is not set
# CONFIG_NO_HZ is not set
CONFIG_HZ_FIXED=0
CONFIG_HZ_100=y
# CONFIG_HZ_200 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
# CONFIG_HZ_500 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=100

Thanks
Andrew


Re: [Bridge] [PATCH] net: bridge: Allow bridge to joing multicast groups

2019-07-26 Thread Andrew Lunn
On Fri, Jul 26, 2019 at 02:02:15PM +0200, Horatiu Vultur wrote:
> Hi Nikolay,
> 
> The 07/26/2019 12:26, Nikolay Aleksandrov wrote:
> > External E-Mail
> > 
> > 
> > On 26/07/2019 11:41, Nikolay Aleksandrov wrote:
> > > On 25/07/2019 17:21, Horatiu Vultur wrote:
> > >> Hi Nikolay,
> > >>
> > >> The 07/25/2019 16:21, Nikolay Aleksandrov wrote:
> > >>> External E-Mail
> > >>>
> > >>>
> > >>> On 25/07/2019 16:06, Nikolay Aleksandrov wrote:
> >  On 25/07/2019 14:44, Horatiu Vultur wrote:
> > > There is no way to configure the bridge, to receive only specific link
> > > layer multicast addresses. From the description of the command 'bridge
> > > fdb append' is supposed to do that, but there was no way to notify the
> > > network driver that the bridge joined a group, because LLADDR was 
> > > added
> > > to the unicast netdev_hw_addr_list.
> > >
> > > Therefore update fdb_add_entry to check if the NLM_F_APPEND flag is 
> > > set
> > > and if the source is NULL, which represent the bridge itself. Then add
> > > address to multicast netdev_hw_addr_list for each bridge interfaces.
> > > And then the .ndo_set_rx_mode function on the driver is called. To 
> > > notify
> > > the driver that the list of multicast mac addresses changed.
> > >
> > > Signed-off-by: Horatiu Vultur 
> > > ---
> > >  net/bridge/br_fdb.c | 49 
> > > ++---
> > >  1 file changed, 46 insertions(+), 3 deletions(-)
> > >
> > 
> >  Hi,
> >  I'm sorry but this patch is wrong on many levels, some notes below. In 
> >  general
> >  NLM_F_APPEND is only used in vxlan, the bridge does not handle that 
> >  flag at all.
> >  FDB is only for *unicast*, nothing is joined and no multicast should 
> >  be used with fdbs.
> >  MDB is used for multicast handling, but both of these are used for 
> >  forwarding.
> >  The reason the static fdbs are added to the filter is for non-promisc 
> >  ports, so they can
> >  receive traffic destined for these FDBs for forwarding.
> >  If you'd like to join any multicast group please use the standard way, 
> >  if you'd like to join
> >  it only on a specific port - join it only on that port (or ports) and 
> >  the bridge and you'll
> > >>>
> > >>> And obviously this is for the case where you're not enabling port 
> > >>> promisc mode (non-default).
> > >>> In general you'll only need to join the group on the bridge to receive 
> > >>> traffic for it
> > >>> or add it as an mdb entry to forward it.
> > >>>
> >  have the effect that you're describing. What do you mean there's no 
> >  way ?
> > >>
> > >> Thanks for the explanation.
> > >> There are few things that are not 100% clear to me and maybe you can
> > >> explain them, not to go totally in the wrong direction. Currently I am
> > >> writing a network driver on which I added switchdev support. Then I was
> > >> looking for a way to configure the network driver to copy link layer
> > >> multicast address to the CPU port.
> > >>
> > >> If I am using bridge mdb I can do it only for IP multicast addreses,
> > >> but how should I do it if I want non IP frames with link layer multicast
> > >> address to be copy to CPU? For example: all frames with multicast
> > >> address '01-21-6C-00-00-01' to be copy to CPU. What is the user space
> > >> command for that?
> > >>
> > > 
> > > Check SIOCADDMULTI (ip maddr from iproute2), f.e. add that mac to the port
> > > which needs to receive it and the bridge will send it up automatically 
> > > since
> > > it's unknown mcast (note that if there's a querier, you'll have to make 
> > > the
> > > bridge mcast router if it is not the querier itself). It would also flood 
> > > it to all
> > 
> > Actually you mentioned non-IP traffic, so the querier stuff is not a 
> > problem. This
> > traffic will always be flooded by the bridge (and also a copy will be 
> > locally sent up).
> > Thus only the flooding may need to be controlled.
> 
> OK, I see, but the part which is not clear to me is, which bridge
> command(from iproute2) to use so the bridge would notify the network
> driver(using swichdev or not) to configure the HW to copy all the frames
> with dmac '01-21-6C-00-00-01' to CPU? So that the bridge can receive
> those frames and then just to pass them up.

Hi Horatiu

Something to keep in mind.

My default, multicast should be flooded, and that includes the CPU
port for a DSA driver. Adding an MDB entry allows for optimisations,
limiting which ports a multicast frame goes out of. But it is just an
optimisation.

Andrew


Re: [PATCH] net: bridge: Allow bridge to joing multicast groups

2019-07-26 Thread Andrew Lunn
> As you properly guessed, this model is quite different from what we are used 
> to.

Yes, it takes a while to get the idea that the hardware is just an
accelerator for what the Linux stack can already do. And if the switch
cannot do some feature, pass the frame to Linux so it can handle it.

You need to keep in mind that there could be other ports in the bridge
than switch ports, and those ports might be interested in the
multicast traffic. Hence the CPU needs to see the traffic. But IGMP
snooping can be used to optimise this. But you still need to be
careful, eg. IPv6 Neighbour discovery has often been broken on
mv88e6xxx because we have been too aggressive with filtering
multicast.

Andrew


Re: [Bridge] [PATCH] net: bridge: Allow bridge to joing multicast groups

2019-07-28 Thread Andrew Lunn
> Trying to get back to the original problem:
> 
> We have a network which implements the ODVA/DLR ring protocol. This protocol
> sends out a beacon frame as often as every 3 us (as far as I recall, default I
> believe is 400 us) to this MAC address: 01:21:6C:00:00:01.
> 
> Try take a quick look at slide 10 in [1].
> 
> If we assume that the SwitchDev driver implemented such that all multicast
> traffic goes to the CPU, then we should really have a way to install a HW
> offload path in the silicon, such that these packets does not go to the CPU 
> (as
> they are known not to be use full, and a frame every 3 us is a significant 
> load
> on small DMA connections and CPU resources).
> 
> If we assume that the SwitchDev driver implemented such that only "needed"
> multicast packets goes to the CPU, then we need a way to get these packets in
> case we want to implement the DLR protocol.
> 
> I'm sure that both models can work, and I do not think that this is the main
> issue here.
> 
> Our initial attempt was to allow install static L2-MAC entries and append
> multiple ports to such an entry in the MAC table. This was rejected, for 
> several
> good reasons it seems. But I'm not sure it was clear what we wanted to 
> achieve,
> and why we find it to be important. Hopefully this is clear with a real world
> use-case.
> 
> Any hints or ideas on what would be a better way to solve this problems will 
> be
> much appreciated.

I always try to think about how this would work if i had a bunch of
discrete network interfaces, not a switch. What APIs are involved in
configuring such a system? How does the Linux network stack perform
software DLR? How is the reception and blocking of the multicast group
performed?

Once you understand how it works in the software implement, it should
then be more obvious which switchdev hooks should be used to
accelerate this using hardware.

   Andrew


Re: [PATCH] net: bridge: Allow bridge to joing multicast groups

2019-07-30 Thread Andrew Lunn
Hi Allan

Just throwing out another idea

The whole offloading story has been you use the hardware to accelerate
what the Linux stack can already do.

In this case, you want to accelerate Device Level Ring, DLR.  But i've
not yet seen a software implementation of DLR. Should we really be
considering first adding DLR to the SW bridge? Make it an alternative
to the STP code? Once we have a generic implementation we can then
look at how it can be accelerated using switchdev.

 Andrew


Re: [PATCH] net: bridge: Allow bridge to joing multicast groups

2019-07-30 Thread Andrew Lunn
> Our plan was to implement this in pure SW, and then look at how to HW offload
> it.

Great.

> But this will take some time before we have anything meaning full to show.
> 
> > Make it an alternative to the STP code?
> I'm still working on learning the details of DLR, but I actually believe that 
> it
> in some situations may co-exists with STP ;-)

The PDF you linked to suggests this as well. But i think you will need
to make some core changes to the bridge. At the moment, STP is a
bridge level property. But you are going to need it to be a per-port
option. You can then use DLR on the ring ports, and optionally STP on
the other ports.

> But what we are looking at here, is to offload a
> non-aware-(DLR|MRP)-switch which happens to be placed in a network
> with these protocols running.

So we need to think about why we are passing traffic to the CPU port,
and under what conditions can it be blocked.

1) The interface is not part of a bridge. In this case, we only need
the switch to pass to the CPU port MC addresses which have been set
via set_rx_mode().

I think this case does not apply for what you want. You have two ports
bridges together as part of the ring.

2) The interface is part of a bridge. There are a few sub-cases

a) IGMP snooping is being performed. We can block multicast where
there is no interest in the group. But this is limited to IP
multicast.

b) IGMP snooping is not being used and all interfaces in the bridge
are ports of the switch. IP Multicast can be blocked to the CPU.

c) IGMP snooping is not being used and there is a non-switch interface
in the bridge. Multicast needed is needed, so it can be flooded out
this port.

d) set_rx_mode() has been called on the br0 interface, indicating
there is interest in the packets on the host. They must be sent to the
CPU so they can be delivered locally.

e) 

Does the Multicast MAC address being used by DLR also map to an IP
mmulticast address? 01:21:6C:00:00:0[123] appear to be the MAC
addresses used by DLR. IPv4 multicast MAC addresses are
01:00:5E:XX:XX:XX. IPv6 multicast MAC addresses are 33:33:XX:XX:XX:XX.

So one possibility here is to teach the SW bridge about non-IP
multicast addresses. Initially the switch should forward all MAC
multicast frames to the CPU. If the frame is not an IPv4 or IPv6
frame, and there has not been a call to set_rx_mode() for the MAC
address on the br0 interface, and the bridge only contains switch
ports, switchdev could be used to block the multicast to the CPU
frame, but forward it out all other ports of the bridge.

  Andrew


Re: [PATCH] net: bridge: Allow bridge to joing multicast groups

2019-07-31 Thread Andrew Lunn
> > 2) The interface is part of a bridge. There are a few sub-cases
> > 
> > a) IGMP snooping is being performed. We can block multicast where
> > there is no interest in the group. But this is limited to IP
> > multicast.
> Agree. And this is done today by installing an explicit offload rule to limit
> the flooding of a specific group.
> 
> > b) IGMP snooping is not being used and all interfaces in the bridge
> > are ports of the switch. IP Multicast can be blocked to the CPU.
> Does it matter if IGMP snooping is used or not? A more general statement could
> be:
> 
> - "All interfaces in the bridge are ports of the switch. IP Multicast can be
>   blocked to the CPU."

We have seen IPv6 neighbour discovery break in conditions like this. I
don't know the exact details.

You also need to watch out for 224.0.0.0/24. This is the link local
multicast range. There is no need to join MC addresses in that
range. It is assumed they will always be received. So even if IGMP is
enabled, you still need to pass some multicast traffic to the CPU.

> > So one possibility here is to teach the SW bridge about non-IP
> > multicast addresses. Initially the switch should forward all MAC
> > multicast frames to the CPU. If the frame is not an IPv4 or IPv6
> > frame, and there has not been a call to set_rx_mode() for the MAC
> > address on the br0 interface, and the bridge only contains switch
> > ports, switchdev could be used to block the multicast to the CPU
> > frame, but forward it out all other ports of the bridge.
> Close, but not exactly (due to the arguments above).
> 
> Here is how I see it:
> 
> Teach the SW bridge about non-IP multicast addresses. Initially the switch
> should forward all MAC multicast frames to the CPU. Today MDB rules can be
> installed (either static or dynamic by IGMP), which limit the flooding of 
> IPv4/6
> multicast streams. In the same way, we should have a way to install a rule
> (FDM/ or MDB) to limit the flooding of L2 multicast frames.
> 
> If foreign interfaces (or br0 it self) is part of the destination list, then
> traffic also needs to go to the CPU.
> 
> By doing this, we can for explicitly configured dst mac address:
> - limit the flooding on the on the SW bridge interfaces
> - limit the flooding on the on the HW bridge interfaces
> - prevent them to go to the CPU if they are not needed

This is all very complex because of all the different corner cases. So
i don't think we want a user API to do the CPU part, we want the
network stack to do it. Otherwise the user is going to get is wrong,
break their network, and then come running to the list for help.

This also fits with how we do things in DSA. There is deliberately no
user space concept for configuring the DSA CPU port. To user space,
the switch is just a bunch of Linux interfaces. Everything to do with
the CPU port is hidden away in the DSA core layer, the DSA drivers,
and a little bit in the bridge.

   Andrew


Re: [PATCH 1/3] net: Add HW_BRIDGE offload feature

2019-08-22 Thread Andrew Lunn
> +/* Determin if the SW bridge can be offloaded to HW. Return true if all
> + * the interfaces of the bridge have the feature NETIF_F_HW_SWITCHDEV set
> + * and have the same netdev_ops.
> + */

Hi Horatiu

Why do you need these restrictions. The HW bridge should be able to
learn that a destination MAC address can be reached via the SW
bridge. The software bridge can then forward it out the correct
interface.

Or are you saying your hardware cannot learn from frames which come
from the CPU?

Andrew


Re: [Bridge] [PATCH 0/3] Add NETIF_F_HW_BRIDGE feature

2019-08-23 Thread Andrew Lunn
> > > Why do the devices have to be from the same driver ?
> After seeing yours and Andrews comments I realize that I try to do two
> things, but I have only explained one of them.
> 
> Here is what I was trying to do:
> A. Prevent ports from going into promisc mode, if it is not needed.

The switch definition is promisc is a bit odd. You really need to
split it into two use cases.

The Linux interface is not a member of a bridge. In this case, promisc
mode would mean all frames ingressing the port should be forwarded to
the CPU. Without promisc, you can program the hardware to just accept
frames with the interfaces MAC address. So this is just the usual
behaviour of an interface.

When the interface is part of the bridge, then you can turn on all the
learning and not forward frames to the CPU, unless the CPU asks for
them. But you need to watch out for various flags. By default, you
should flood to the CPU, unknown destinations to the CPU etc. But some
of these can be turned off by flags.

> B. Prevent adding the CPU to the flood-mask (in Ocelot we have a
> flood-mask controlling who should be included when flooding due to
> destination unknown).

So destination unknown should be flooded to the CPU. The CPU might
know where to send the frame.

> To solve item "B", the network driver needs to detect if there is a
> foreign interfaces added to the bridge. If that is the case then to add
> the CPU port to the flooding mask otherwise no.

It is not just a foreign interface. What about the MAC address on the
bridge interface?

> > > This is too specific targeting some devices.
> Maybe I was wrong to mention specific HW in the commit message. The
> purpose of the patch was to add an optimization (not to copy all the
> frames to the CPU) for HW that is capable of learning and flooding the
> frames.

To some extent, this is also tied to your hardware not learning MAC
addresses from frames passed from the CPU. You should also consider
fixing this. The SW bridge does send out notifications when it
adds/removes MAC addresses to its tables. You probably want to receive
this modifications, and use them to program your hardware to forward
frames to the CPU when needed.

   Andrew


Re: [PATCH v2 0/3] Add NETIF_F_HW_BR_CAP feature

2019-08-26 Thread Andrew Lunn
On Mon, Aug 26, 2019 at 10:11:12AM +0200, Horatiu Vultur wrote:
> When a network port is added to a bridge then the port is added in
> promisc mode. Some HW that has bridge capabilities(can learn, forward,
> flood etc the frames) they are disabling promisc mode in the network
> driver when the port is added to the SW bridge.
> 
> This patch adds the feature NETIF_F_HW_BR_CAP so that the network ports
> that have this feature will not be set in promisc mode when they are
> added to a SW bridge.
> 
> In this way the HW that has bridge capabilities don't need to send all the
> traffic to the CPU and can also implement the promisc mode and toggle it
> using the command 'ip link set dev swp promisc on'

Hi Horatiu

I'm still not convinced this is needed. The model is, the hardware is
there to accelerate what Linux can do in software. Any peculiarities
of the accelerator should be hidden in the driver.  If the accelerator
can do its job without needing promisc mode, do that in the driver.

So you are trying to differentiate between promisc mode because the
interface is a member of a bridge, and promisc mode because some
application, like pcap, has asked for promisc mode.

dev->promiscuity is a counter. So what you can do it look at its
value, and how the interface is being used. If the interface is not a
member of a bridge, and the count > 0, enable promisc mode in the
accelerator. If the interface is a member of a bridge, and the count >
1, enable promisc mode in the accelerator.

   Andrew



Re: [PATCH v2 0/3] Add NETIF_F_HW_BR_CAP feature

2019-08-27 Thread Andrew Lunn
> That sounds like a great idea. I was expecting to add this logic in the
> set_rx_mode function of the driver. But unfortunetly, I got the calls to
> this function before the dev->promiscuity is updated or not to get the
> call at all. For example in case the port is member of a bridge and I try
> to enable the promisc mode.

Hi Horatiu

What about the notifier? Is it called in all the conditions you need
to know about?

Or, you could consider adding a new switchdev call to pass this
information to any switchdev driver which is interested in the
information.

At the moment, the DSA driver core does not pass onto the driver it
should put a port into promisc mode. So pcap etc, will only see
traffic directed to the CPU, not all the traffic ingressing the
interface. If you put the needed core infrastructure into place, we
could plumb it down from the DSA core to DSA drivers.

Having said that, i don't actually know if the Marvell switches
support this. Forward using the ATU and send a copy to the CPU?  What
switches tend to support is port mirroring, sending all the traffic
out another port. A couple of DSA drivers support that, via TC.

Andrew


Re: [Bridge] [RFC net-next Patch 0/3] net: bridge: mrp: Add support for Media Redundancy Protocol(MRP)

2020-01-10 Thread Andrew Lunn
> I think it would help your case if you explained a bit more about
> the hw offload primitives you have implemented internally.

Agreed.

Horatiu, could you also give some references to the frames that need
to be sent. I've no idea what information they need to contain, if the
contents is dynamic, or static, etc.

 Andrew


Re: [Bridge] [RFC net-next Patch 0/3] net: bridge: mrp: Add support for Media Redundancy Protocol(MRP)

2020-01-10 Thread Andrew Lunn
> > Horatiu, could you also give some references to the frames that need
> > to be sent. I've no idea what information they need to contain, if the
> > contents is dynamic, or static, etc.
> It is dynamic - but trivial...

If it is trivial, i don't see why you are so worried about abstracting
it?

> Here is a dump from WireShark with
> annotation on what our HW can update:
> 
> Ethernet II, Src: 7a:8b:b1:35:96:e1 (7a:8b:b1:35:96:e1), Dst: Iec_00:00:01 
> (01:15:4e:00:00:01)
> Destination: Iec_00:00:01 (01:15:4e:00:00:01)
> Source: 7a:8b:b1:35:96:e1 (7a:8b:b1:35:96:e1)
> Type: MRP (0x88e3)
> PROFINET MRP MRP_Test, MRP_Common, MRP_End
> MRP_Version: 1
> MRP_TLVHeader.Type: MRP_Test (0x02)
> MRP_TLVHeader.Type: MRP_Test (0x02)
> MRP_TLVHeader.Length: 18
> MRP_Prio: 0x1f40 High priorities
> MRP_SA: 7a:8b:b1:35:96:e1 (7a:8b:b1:35:96:e1)
> MRP_PortRole: Primary ring port (0x)
> MRP_RingState: Ring closed (0x0001)
> MRP_Transition: 0x0001
> MRP_TimeStamp [ms]: 0x000cf574 <-- Updated 
> automatic
> MRP_TLVHeader.Type: MRP_Common (0x01)
> MRP_TLVHeader.Type: MRP_Common (0x01)
> MRP_TLVHeader.Length: 18
> MRP_SequenceID: 0x00e9 <-- Updated 
> automatic
> MRP_DomainUUID: ----
> MRP_TLVHeader.Type: MRP_End (0x00)
> MRP_TLVHeader.Type: MRP_End (0x00)
> MRP_TLVHeader.Length: 0
> 
> But all the fields can change, but to change the other fields we need to
> interact with the HW. Other SoC may have other capabilities in their
> offload. As an example, if the ring becomes open then the fields
> MRP_RingState and MRP_Transition need to change and in our case this
> requires SW interference.

Isn't SW always required? You need to tell your state machine that the
state has changed.

> Would you like a PCAP file as an example? Or do you want a better
> description of the frame format.

I was hoping for a link to an RFC, or some standards document.

  Andrew


Re: [Bridge] [RFC net-next Patch 0/3] net: bridge: mrp: Add support for Media Redundancy Protocol(MRP)

2020-01-10 Thread Andrew Lunn
On Fri, Jan 10, 2020 at 09:12:48PM +0100, Horatiu Vultur wrote:
> The 01/10/2020 18:56, Andrew Lunn wrote:
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
> > content is safe
> > 
> > > > Horatiu, could you also give some references to the frames that need
> > > > to be sent. I've no idea what information they need to contain, if the
> > > > contents is dynamic, or static, etc.
> > > It is dynamic - but trivial...
> > 
> > If it is trivial, i don't see why you are so worried about abstracting
> > it?
> Maybe we misunderstood each other. When you asked if it is dynamic or
> static, I thought you meant if it is the same frame being send repeated
> or if it needs to be changed. It needs to be changed but the changes are
> trivial, but it means that a non-MRP aware frame generator can't
> properly offload this.

The only frame generator i've ever seen are for generating test
packets. They generally have random content, random length, maybe the
option to send invalid CRC etc. These are never going to work for MRP.
So we should limit our thinking to hardware specifically designed for
MRP offload.

What we need to think about is an abstract model for MRP offload. What
must such a bit of hardware do? What parameters do we need to pass to
it? When should it interrupt us because some event has happened?

Once we have an abstract model, we can define netlink messages, or
devlink messages. And you can implement driver code which takes this
abstract model and implements it for your real hardware. And if you
have the abstract model correct, other vendors should also be able to
implement drivers as well.

Since this is a closed standard, there is not much the rest of us can
do. You need to define this abstract model. We can then review it.

Andrew


Re: [Bridge] [RFC net-next Patch v2 4/4] net: bridge: mrp: switchdev: Add HW offload

2020-01-13 Thread Andrew Lunn
On Mon, Jan 13, 2020 at 01:46:20PM +0100, Horatiu Vultur wrote:
> +#ifdef CONFIG_BRIDGE_MRP
> +/* SWITCHDEV_OBJ_ID_PORT_MRP */
> +struct switchdev_obj_port_mrp {
> + struct switchdev_obj obj;
> + struct net_device *port;
> + u32 ring_nr;
> +};
> +
> +#define SWITCHDEV_OBJ_PORT_MRP(OBJ) \
> + container_of((OBJ), struct switchdev_obj_port_mrp, obj)
> +
> +/* SWITCHDEV_OBJ_ID_RING_TEST_MRP */
> +struct switchdev_obj_ring_test_mrp {
> + struct switchdev_obj obj;
> + /* The value is in us and a value of 0 represents to stop */
> + u32 interval;
> + u8 max;
> + u32 ring_nr;
> +};
> +
> +#define SWITCHDEV_OBJ_RING_TEST_MRP(OBJ) \
> + container_of((OBJ), struct switchdev_obj_ring_test_mrp, obj)
> +
> +/* SWITCHDEV_OBJ_ID_RING_ROLE_MRP */
> +struct switchdev_obj_ring_role_mrp {
> + struct switchdev_obj obj;
> + u8 ring_role;
> + u32 ring_nr;
> +};

Hi Horatiu

The structures above should give me enough information to build this,
correct?

Ethernet II, Src: 7a:8b:b1:35:96:e1 (7a:8b:b1:35:96:e1), Dst: Iec_00:00:01 
(01:15:4e:00:00:01)
Destination: Iec_00:00:01 (01:15:4e:00:00:01)
Source: 7a:8b:b1:35:96:e1 (7a:8b:b1:35:96:e1)
Type: MRP (0x88e3)
PROFINET MRP MRP_Test, MRP_Common, MRP_End
MRP_Version: 1
MRP_TLVHeader.Type: MRP_Test (0x02)
MRP_TLVHeader.Type: MRP_Test (0x02)
MRP_TLVHeader.Length: 18
MRP_Prio: 0x1f40 High priorities
MRP_SA: 7a:8b:b1:35:96:e1 (7a:8b:b1:35:96:e1)
MRP_PortRole: Primary ring port (0x)
MRP_RingState: Ring closed (0x0001)
MRP_Transition: 0x0001
MRP_TimeStamp [ms]: 0x000cf574 <-- Updated automatic
MRP_TLVHeader.Type: MRP_Common (0x01)
MRP_TLVHeader.Type: MRP_Common (0x01)
MRP_TLVHeader.Length: 18
MRP_SequenceID: 0x00e9 <-- Updated automatic
MRP_DomainUUID: ----
MRP_TLVHeader.Type: MRP_End (0x00)
MRP_TLVHeader.Type: MRP_End (0x00)
MRP_TLVHeader.Length: 0

There are a couple of fields i don't see. MRP_SA, MRP_Transition.

What are max and ring_nr used for?

Do you need to set the first value MRP_SequenceID uses? Often, in
order to detect a reset, a random value is used to initialise the
sequence number. Also, does the time stamp need initializing?

Thanks
Andrew


Re: [Bridge] [RFC net-next Patch v2 4/4] net: bridge: mrp: switchdev: Add HW offload

2020-01-13 Thread Andrew Lunn
Hi Horatiu

It has been said a few times what the basic state machine should be in
user space. A pure software solution can use raw sockets to send and
receive MRP_Test test frames. When considering hardware acceleration,
the switchdev API you have proposed here seems quite simple. It should
not be too hard to map it to a set of netlink messages from userspace.

Yet your argument for kernel, not user space, is you are worried about
the parameters which need to be passed to the hardware offload engine.
In order to win the argument for a kernel solution, we are going to
need a better idea what you think this problem is. The MRP_Test is TLV
based. Are there other things which could be in this message? Is that
what you are worried about?

Thanks
 Andrew


Re: [Bridge] [RFC net-next Patch v2 4/4] net: bridge: mrp: switchdev: Add HW offload

2020-01-14 Thread Andrew Lunn
On Tue, Jan 14, 2020 at 09:08:56AM +0100, Horatiu Vultur wrote:
> The 01/14/2020 00:30, Andrew Lunn wrote:
> > 
> > Hi Horatiu
> > 
> > It has been said a few times what the basic state machine should be in
> > user space. A pure software solution can use raw sockets to send and
> > receive MRP_Test test frames. When considering hardware acceleration,
> > the switchdev API you have proposed here seems quite simple. It should
> > not be too hard to map it to a set of netlink messages from userspace.
> 
> Yes and we will try to go with this approach, to have a user space
> application that contains the state machines and then in the kernel to
> extend the netlink messages to map to the switchdev API.
> So we will create a new RFC once we will have the user space and the
> definition of the netlink messages.

Cool.

Before you get too far, we might want to discuss exactly how you pass
these netlink messages. Do we want to make this part of the new
ethtool Netlink implementation? Part of devlink? Extend the current
bridge netlink interface used by userspae RSTP daemons? A new generic
netlink socket?

Extending the bridge netlink interface might seem the most logical.
The argument against it, is that the kernel bridge code probably does
not need to know anything about this offloading. But it does allow you
to make use of the switchdev API, so we have a uniform API between the
network stack and drivers implementing offloading.

  Andrew


Re: [Bridge] [RFC net-next v3 02/10] net: bridge: mrp: Expose function br_mrp_port_open

2020-01-24 Thread Andrew Lunn
On Fri, Jan 24, 2020 at 05:18:20PM +0100, Horatiu Vultur wrote:
> In case the HW is capable to detect when the MRP ring is open or closed. It is
> expected that the network driver will notify the bridge that the ring is open 
> or
> closed.
> 
> The function br_mrp_port_open is used to notify the kernel that one of the 
> ports
> stopped receiving MRP_Test frames. The argument 'loc' has a value of '1' when
> the port stopped receiving MRP_Test and '0' when it started to receive 
> MRP_Test.

Hi Horatiu

Given the name of the function, br_mrp_port_open(), how about replacing
loc with a bool with the name open?

Andrew


Re: [Bridge] [RFC net-next v3 03/10] net: bridge: mrp: Add MRP interface used by netlink

2020-01-24 Thread Andrew Lunn
> br_mrp_flush - will flush the FDB.

How does this differ from a normal bridge flush? I assume there is a
way for user space to flush the bridge FDB.

Andrew


Re: [Bridge] [RFC net-next v3 04/10] net: bridge: mrp: Add generic netlink interface to configure MRP

2020-01-25 Thread Andrew Lunn
On Fri, Jan 24, 2020 at 05:18:22PM +0100, Horatiu Vultur wrote:
> Implement the generic netlink interface to configure MRP. The implementation
> will do sanity checks over the attributes and then eventually call the MRP
> interface which eventually will call the switchdev API.

Hi Horatiu

What was your thinking between adding a new generic netlink interface,
and extending the current one?

I've not looked at your user space code yet, but i assume it has to
make use of both? It needs to create the bridge and add the
interfaces. And then it needs to control the MRP state.

Allan mentioned you might get around to implementing 802.1CB? Would
that be another generic netlink interface, or would you extend the MRP
interface?

Thanks
Andrew  


Re: [Bridge] [RFC net-next v3 09/10] net: bridge: mrp: Integrate MRP into the bridge

2020-01-25 Thread Andrew Lunn
On Fri, Jan 24, 2020 at 05:18:27PM +0100, Horatiu Vultur wrote:
> To integrate MRP into the bridge, the bridge needs to do the following:
> - initialized and destroy the generic netlink used by MRP
> - detect if the MRP frame was received on a port that is part of a MRP ring. 
> In
>   case it was not, then forward the frame as usual, otherwise redirect the 
> frame
>   to the upper layer.
> 
> Signed-off-by: Horatiu Vultur 
> ---
>  net/bridge/br.c | 11 +++
>  net/bridge/br_device.c  |  3 +++
>  net/bridge/br_if.c  |  6 ++
>  net/bridge/br_input.c   | 14 ++
>  net/bridge/br_private.h | 14 ++
>  5 files changed, 48 insertions(+)
> 
> diff --git a/net/bridge/br.c b/net/bridge/br.c
> index b6fe30e3768f..d5e556eed4ba 100644
> --- a/net/bridge/br.c
> +++ b/net/bridge/br.c
> @@ -344,6 +344,12 @@ static int __init br_init(void)
>   if (err)
>   goto err_out5;
>  
> +#ifdef CONFIG_BRIDGE_MRP
> + err = br_mrp_netlink_init();
> + if (err)
> + goto err_out6;
> +#endif

Please try to avoid #ifdef's like this in C code. Add a stub function
to br_private_mrp.h.

If you really cannot avoid #ifdef, please use #if IS_ENABLED(CONFIG_BRIDGE_MRP).
That expands to

if (0) {

}

So the compiler will compile it and then optimize it out. That gives
us added benefit of build testing, we don't suddenly find the code no
longer compiles when we enable the option.

> --- a/net/bridge/br_input.c
> +++ b/net/bridge/br_input.c
> @@ -21,6 +21,9 @@
>  #include 
>  #include "br_private.h"
>  #include "br_private_tunnel.h"
> +#ifdef CONFIG_BRIDGE_MRP
> +#include "br_private_mrp.h"
> +#endif

It should always be safe to include a header file.

   Andrew


Re: [Bridge] [RFC net-next v3 03/10] net: bridge: mrp: Add MRP interface used by netlink

2020-01-25 Thread Andrew Lunn
On Sat, Jan 25, 2020 at 12:37:26PM +0100, Horatiu Vultur wrote:
> The 01/24/2020 18:43, Andrew Lunn wrote:
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
> > content is safe
> > 
> > > br_mrp_flush - will flush the FDB.
> > 
> > How does this differ from a normal bridge flush? I assume there is a
> > way for user space to flush the bridge FDB.
> 
> Hi,
> 
> If I seen corectly the normal bridge flush will clear the entire FDB for
> all the ports of the bridge. In this case it is require to clear FDB
> entries only for the ring ports.

Maybe it would be better to extend the current bridge netlink call to
be able to pass an optional interface to be flushed?  I'm not sure it
is a good idea to have two APIs doing very similar things.

   Andrew


Re: [Bridge] [RFC net-next v3 09/10] net: bridge: mrp: Integrate MRP into the bridge

2020-01-25 Thread Andrew Lunn
>  br_netif_receive_skb(struct net *net, struct sock *sk, struct sk_buff *skb)
> @@ -338,6 +341,17 @@ rx_handler_result_t br_handle_frame(struct sk_buff 
> **pskb)
>   return RX_HANDLER_CONSUMED;
>   }
>   }
> +#ifdef CONFIG_BRIDGE_MRP
> + /* If there is no MRP instance do normal forwarding */
> + if (!p->mrp_aware)
> + goto forward;
> +
> + if (skb->protocol == htons(ETH_P_MRP))
> + return RX_HANDLER_PASS;

What MAC address is used for these MRP frames? It would make sense to
use a L2 link local destination address, since i assume they are not
supposed to be forwarded by the bridge. If so, you could extend the
if (unlikely(is_link_local_ether_addr(dest))) condition.

> +
> + if (p->state == BR_STATE_BLOCKING)
> + goto drop;
> +#endif

Is this needed? The next block of code is a switch statement on
p->state. The default case, which BR_STATE_BLOCKING should hit, is
drop.

This function is on the hot path. So we should try to optimize it as
much as possible.

 Andrew


Re: [Bridge] [RFC net-next v3 00/10] net: bridge: mrp: Add support for Media Redundancy Protocol (MRP)

2020-01-25 Thread Andrew Lunn
> Lets say that the link between H1 and H2 goes down:
> 
> +--+
> |  |
> +-->|H1|<---  / --->|H2|<-->|H3|<--+
> eth0eth1eth0eth1eth0eth1
> 
> H1 will now observe that the test packets it sends on eth1, is not
> received in eth0, meaninf that the ring is open

Hi Allan

Is H1 the only device sending test packets? It is assumed that H2 and
H3 will forward them? Or does each device send test packets, and when
it stops hearing these packets from a neighbour, it assumes the link
is open?

   Andrew


Re: [Bridge] [RFC net-next v3 06/10] net: bridge: mrp: switchdev: Extend switchdev API to offload MRP

2020-01-25 Thread Andrew Lunn
> SWITCHDEV_OBJ_ID_RING_TEST_MRP: This is used when to start/stop sending
>   MRP_Test frames on the mrp ring ports. This is called only on nodes that 
> have
>   the role Media Redundancy Manager.

How do you handle the 'headless chicken' scenario? User space tells
the port to start sending MRP_Test frames. It then dies. The hardware
continues sending these messages, and the neighbours thinks everything
is O.K, but in reality the state machine is dead, and when the ring
breaks, the daemon is not there to fix it?

And it is not just the daemon that could die. The kernel could opps or
deadlock, etc.

For a robust design, it seems like SWITCHDEV_OBJ_ID_RING_TEST_MRP
should mean: start sending MRP_Test frames for the next X seconds, and
then stop. And the request is repeated every X-1 seconds.

 Andrew


Re: [Bridge] [RFC net-next v3 03/10] net: bridge: mrp: Add MRP interface used by netlink

2020-01-26 Thread Andrew Lunn
> We could do also that. The main reason why I have added a new generic
> netlink was that I thought it would be clearer what commands are for MRP
> configuration.

The naming makes this clear, having _MRP_ in the attribute names etc.

But it would be good have input from the Bridge maintainers.

Andrew


Re: [Bridge] [RFC net-next v3 06/10] net: bridge: mrp: switchdev: Extend switchdev API to offload MRP

2020-01-26 Thread Andrew Lunn
On Sun, Jan 26, 2020 at 02:22:13PM +0100, Horatiu Vultur wrote:
> The 01/25/2020 17:35, Andrew Lunn wrote:
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
> > content is safe
> > 
> > > SWITCHDEV_OBJ_ID_RING_TEST_MRP: This is used when to start/stop sending
> > >   MRP_Test frames on the mrp ring ports. This is called only on nodes 
> > > that have
> > >   the role Media Redundancy Manager.
> > 
> > How do you handle the 'headless chicken' scenario? User space tells
> > the port to start sending MRP_Test frames. It then dies. The hardware
> > continues sending these messages, and the neighbours thinks everything
> > is O.K, but in reality the state machine is dead, and when the ring
> > breaks, the daemon is not there to fix it?
> > 
> > And it is not just the daemon that could die. The kernel could opps or
> > deadlock, etc.
> > 
> > For a robust design, it seems like SWITCHDEV_OBJ_ID_RING_TEST_MRP
> > should mean: start sending MRP_Test frames for the next X seconds, and
> > then stop. And the request is repeated every X-1 seconds.
> 
> I totally missed this case, I will update this as you suggest.

Hi Horatiu

What does your hardware actually provide?

Given the design of the protocol, if the hardware decides the OS etc
is dead, it should stop sending MRP_TEST frames and unblock the ports.
If then becomes a 'dumb switch', and for a short time there will be a
broadcast storm. Hopefully one of the other nodes will then take over
the role and block a port.

Andrew


Re: [Bridge] [RFC net-next v3 09/10] net: bridge: mrp: Integrate MRP into the bridge

2020-01-26 Thread Andrew Lunn
On Sun, Jan 26, 2020 at 02:01:11PM +0100, Horatiu Vultur wrote:
> The 01/25/2020 17:16, Andrew Lunn wrote:
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
> > content is safe
> > 
> > >  br_netif_receive_skb(struct net *net, struct sock *sk, struct sk_buff 
> > > *skb)
> > > @@ -338,6 +341,17 @@ rx_handler_result_t br_handle_frame(struct sk_buff 
> > > **pskb)
> > >   return RX_HANDLER_CONSUMED;
> > >   }
> > >   }
> > > +#ifdef CONFIG_BRIDGE_MRP
> > > + /* If there is no MRP instance do normal forwarding */
> > > + if (!p->mrp_aware)
> > > + goto forward;
> > > +
> > > + if (skb->protocol == htons(ETH_P_MRP))
> > > + return RX_HANDLER_PASS;
> > 
> > What MAC address is used for these MRP frames? It would make sense to
> > use a L2 link local destination address, since i assume they are not
> > supposed to be forwarded by the bridge. If so, you could extend the
> > if (unlikely(is_link_local_ether_addr(dest))) condition.
> 
> The MAC addresses used by MRP frames are:
> 0x1, 0x15, 0x4e, 0x0, 0x0, 0x1 - used by MRP_Test frames
> 0x1, 0x15, 0x4e, 0x0, 0x0, 0x2 - used by the rest of MRP frames.
> 
> If we will add support also for MIM/MIC. These requires 2 more MAC
> addresses:
> 0x1, 0x15, 0x4e, 0x0, 0x0, 0x3 - used by MRP_InTest frames.
> 0x1, 0x15, 0x4e, 0x0, 0x0, 0x4 - used by the other MRP interconnect
> frames.

Hi Horatiu

I made the wrong guess about how this protocol worked when i said L2
link local. These MAC addresses are L2 multicast.

And you are using a raw socket to receive them into userspace when
needed.

'Thinking allowed' here.

+--+
|  |
+-->|H1|<-->|H2|<-->|H3|<--+
eth0eth1eth0eth1eth0eth1
 ^
 |
  Blocked


There are three major classes of user case here:

1) Pure software solution

You need the software bridge in the client to forward these frames
from the left side to the right side. (Does the standard give these
two ports names)? In the master, the left port is blocked, so the
bridge drops them anyway. You have a RAW socket open on both eth0 and
eth1, so you get to see the frames, even if the bridge drops them.

2) Hardware offload to an MRP unaware switch.

I'm thinking about a plain switch supported by DSA, Marvell, Broadcom,
etc. It has no special knowledge of MRP.

Ideally, you want the switch to forward MRP_Test frames left to right
for a client. In a master, i think you have a problem, since the port
is blocked. The hardware is unlikely to recognise these frames as
special, since they are not in the 01-80-C2-XX-XX-XX block, and let
them through. So your raw socket is never going to see them, and you
cannot detect open/closed ring.

I don't know how realistic it is to support MRP in this case, and i
also don't think you can fall back to a pure software solution,
because the software bridge is going to offload the basic bridge
operation to the hardware. It would be nice if you could detect this,
and return -EOPNOTSUPP.

3) Hardware offload to an MRP aware switch.

For a client, you tell it which port is left, which is right, and
assume it forwards the frames. For a master, you again tell it which
is left, which is right, and ask it send MRP_Test frames out right,
and report changes in open/closed on the right port. You don't need
the CPU to see the MRP_Test frames, so the switch has no need to
forward them to the CPU.

We should think about the general case of a bridge with many ports,
and many pairs of ports using MRP. This makes the forwarding of these
frames interesting. Given that they are multicast, the default action
of the software bridge is that it will flood them. Does the protocol
handle seeing MRP_Test from some other loop? Do we need to avoid this?
You could avoid this by adding MDB entries to the bridge. However,
this does not scale to more then one ring. I don't think an MDB is
associated to an ingress port. So you cannot say 

0x1, 0x15, 0x4e, 0x0, 0x0, 0x1 ingress port1 egress port2
0x1, 0x15, 0x4e, 0x0, 0x0, 0x1 ingress port3 egress port4

The best you can say is

0x1, 0x15, 0x4e, 0x0, 0x0, 0x1 egress port2, port4

I'm sure there are other issues i'm missing, but it is interesting to
think about all this.

Andrew


Re: [Bridge] [RFC net-next v3 09/10] net: bridge: mrp: Integrate MRP into the bridge

2020-01-27 Thread Andrew Lunn
> > 'Thinking allowed' here.
> > 
> > +--+
> > |  |
> > +-->|H1|<-->|H2|<-->|H3|<--+
> > eth0eth1eth0eth1eth0eth1
> >  ^
> >  |
> >   Blocked
> > 
> > 
> > There are three major classes of user case here:
> > 
> > 1) Pure software solution
> > You need the software bridge in the client to forward these frames
> > from the left side to the right side.

> As far as I understand it is not the bridge which forward these frames -
> it is the user-space tool. This was to put as much functionality in
> user-space and only use the kernel to configure the HW. We can (and
> should) discuss if this is the right decision.

So i need to flip the point around. How does the software switch know
not to forward the frames? Are you adding an MDB?

> We would properly have better performance if we do this in kernel-space.

Yes, that is what i think. And if you can do it without any additional
code, using the forwarding tables, so much the better.

> BTW: It is not only from left to right, it is also from right to left.
> The MRM will inject packets on both ring ports, and monitor both.

Using the same MAC address in both directions? I need to think what
that implies for MDB entries. It probably just works, since you never
flood back out the ingress port.

> Again, I do not know how other HW is designed, but all the SOC's we are
> working with, does allow us to add a TCAM rule which can redirect these
> frames to the CPU even on a blocked port.

It is not in scope for what you are doing, but i wonder how we
describe this in a generic Linux way? And then how we push it down to
the hardware?

For the Marvell Switches, it might be possible to do this without the
TCAM. You can add forwarding DB entries marked as Management. It is
unclear if this overrides the blocked state, but it would be a bit odd
if it did not.

> > You could avoid this by adding MDB entries to the bridge. However,
> > this does not scale to more then one ring.
> I would prefer a solution where the individual drivers can do what is
> best on the given HW.

The nice thing about adding MDB is that it is making use of the
software bridge facilities. In general, the software bridge and
hardware bridges are pretty similar. If you can solve the problem
using generic software bridge features, not additional special cases
in code, you have good chance of being able to offload it to a
hardware bridge which is not MRP aware. The switchdev API for MRP
specific features should then allow you to make use of any additional
features the hardware might have.

> Yes, the solution Horatiu has chosen, is not to forward MRP frames,
> received in MRP ring ports at all. This is done by the user-space tool.
> 
> Again, not sure if this is the right way to do it, but it is what patch
> v3 does.
> 
> The alternative to this would be to learn the bridge how to forward MRP
> frames when it is a MRC. The user-space tool then never needs to do
> this, it know that the kernel will take care of this part (either in SW
> or in HW).

I think that should be considered. I'm not saying it is the best way,
just that some thought should be put into it to figure out what it
actually implies.

 Andrew


Re: [Bridge] [RFC net-next v3 06/10] net: bridge: mrp: switchdev: Extend switchdev API to offload MRP

2020-01-27 Thread Andrew Lunn
On Mon, Jan 27, 2020 at 03:26:38PM +0100, Jürgen Lambrecht wrote:
> On 1/27/20 12:04 PM, Allan W. Nielsen wrote:
> 
> > How do you handle the 'headless chicken' scenario? User space
> tells
> > the port to start sending MRP_Test frames. It then dies. The
> hardware
> 
> Andrew, I am a bit confused here - maybe I missed an email-thread, I'm sorry
> then.
> 
> In previous emails you and others talked about hardware support to send 
> packets
> (inside the switch). But somebody also talked about data-plane and
> control-plane (about STP in-kernel being a bad idea), and that data-plane is
> in-kernel, and control plane is a mrp-daemon (in user space).
> And in my mind, the "hardware" you mention is a frame-injector and can be both
> real hardware and a driver in the kernel.
> 
> Do I see it right?

Hi Jürgen

It i still unclear where the MRP_Test frames should be generated,
forward and consumed, either in kernel, or in user space.

The userspace RSTP daemon generates and consumes all the BPDUs in
userspace. But BPDUs are never forwarded. However MRP_Test frames are
forwarded by client nodes. Are the MRP_Test frames then part of the
data plane in a client?

What i think is clear is that the state machine is in user space.

For the rest, we are still exploring possibilities.

Andrew


Re: [Bridge] [RFC PATCH net-next] net: bridge: fix client roaming from DSA user port

2020-04-19 Thread Andrew Lunn
On Mon, Apr 20, 2020 at 12:19:46AM +0800, DENG Qingfang wrote:
> When a client roams from a DSA user port to a soft-bridged port (such as WiFi
> interface), the left-over MAC entry in the switch HW is not deleted, causing
> inconsistency between Linux fdb and the switch MAC table. As a result, the
> client cannot talk to other hosts which are on that DSA user port until the
> MAC entry expires.
> 
> Solve this by notifying switchdev fdb to delete the leftover entry when an
> entry is updated. Remove the added_by_user check in DSA
> 
> Signed-off-by: DENG Qingfang 
> ---
> I tried this on mt7530 and mv88e6xxx, but only mt7530 works.
> In previous discussion[1], Andrew Lunn said "try playing with auto learning
> for the CPU port" but it didn't work on mv88e6xxx either

Hi Deng

We should probably first define how we expect moving MAC to work. Then
we can make any core fixes, and driver fixes.

For DSA, we have assumed that the software bridge and the hardware
bridge are independent, each performs its own learning. Only static
entries are kept in sync.

How should this separate learning work for a MAC address which moves?

Andrew


Re: [Bridge] [RFC PATCH net-next] net: bridge: fix client roaming from DSA user port

2020-04-20 Thread Andrew Lunn
> When a client moves from a hardware port (e.g. sw0p1) to a software port 
> (wlan0)
> or another hardware port that belongs to a different switch (sw1p1),
> that MAC entry
> in sw0's MAC table should be deleted, or replaced with the CPU port as
> destination,
> by DSA. Otherwise the client is unable to talk to other hosts on sw0 because 
> sw0
> still thinks the client is on sw0p1.

The MAC address needs to move, no argument there. But what are the
mechanisms which cause this. Is learning sufficient, or does DSA need
to take an active role?

Forget about DSA for the moment. How does this work for two normal
bridges? Is the flow of unicast packets sufficient? Does it requires a
broadcast packet from the device after it moves?

  Andrew


Re: [Bridge] [RFC PATCH net-next] net: bridge: fix client roaming from DSA user port

2020-04-23 Thread Andrew Lunn
On Wed, Apr 22, 2020 at 02:01:28PM +0800, Chuanhong Guo wrote:
> Hi!
> 
> On Tue, Apr 21, 2020 at 12:36 AM Andrew Lunn  wrote:
> >
> > The MAC address needs to move, no argument there. But what are the
> > mechanisms which cause this. Is learning sufficient, or does DSA need
> > to take an active role?
> 
> cpu port learning will break switch operation if for whatever reason
> we want to disable bridge offloading (e.g. ebtables?). In this case
> a packet received from cpu port need to be sent back through
> cpu port to another switch port, and the switch will learn from this
> packet incorrectly.
> 
> If we want cpu port learning to kick in, we need to make sure that:
> 1. When bridge offload is enabled, the switch takes care of packet
> flooding on switch ports itself, instead of flooding with software
> bridge.

Hi Chuanhong

This is what the skb->offload_fwd_mark is all about. If this is set to
1, it means the switch has done all the forwarding needed for ports in
that switch. Most of the tag drivers set this unconditionally true.

> 2. Software bridge shouldn't forward any packet between ports
> on the same switch.

If skb->offload_fwd_mark is true, it won't.

> 3. cpu port learning should only be enabled when bridge
> offloading is used.

So it should be safe for most switch drivers. And the ones which don't
set offload_fwd_mark are probably relying of software bridging, or are
broken and replicating frames.

> It doesn't have to be a broadcast packet but it needs a packet to go
> through both bridges.
> 
> Say we have bridge A and bridge B, port A1 and B1 are connected
> together and a device is on port A2 first:
> Bridge A knows that this device is on port A2 and will forward traffic
> through A1 to B1 if needed. Bridge B sees these packets and knows
> device is on port B1.
> When the device move from A2 to B2, B updates its fdb and if a
> packet reaches A, A will update its fdb too.

The issue here is 'if a packet reaches A'. B might have no reason to
send a unicast packet to A, if none of the destinations the device is
talking to is reached via A. Which is why i think a
broadcast/multicast packet is more likely to be involved.

Andrew


Re: [Bridge] [PATCH v3 net-next 1/7] net: bridge: notify switchdev of disappearance of old FDB entry upon migration

2020-12-16 Thread Andrew Lunn
On Sun, Dec 13, 2020 at 04:07:04PM +0200, Vladimir Oltean wrote:
> Currently the bridge emits atomic switchdev notifications for
> dynamically learnt FDB entries. Monitoring these notifications works
> wonders for switchdev drivers that want to keep their hardware FDB in
> sync with the bridge's FDB.

Reviewed-by: Andrew Lunn 

Andrew


Re: [Bridge] [PATCH v3 net-next 2/7] net: dsa: be louder when a non-legacy FDB operation fails

2020-12-16 Thread Andrew Lunn
On Sun, Dec 13, 2020 at 04:07:05PM +0200, Vladimir Oltean wrote:
> The dev_close() call was added in commit c9eb3e0f8701 ("net: dsa: Add
> support for learning FDB through notification") "to indicate inconsistent
> situation" when we could not delete an FDB entry from the port.
> 
> bridge fdb del d8:58:d7:00:ca:6d dev swp0 self master
> 
> It is a bit drastic and at the same time not helpful if the above fails
> to only print with netdev_dbg log level, but on the other hand to bring
> the interface down.
> 
> So increase the verbosity of the error message, and drop dev_close().
> 
> Signed-off-by: Vladimir Oltean 

Reviewed-by: Andrew Lunn 

Andrew


Re: [Bridge] [PATCH v3 net-next 5/7] net: dsa: exit early in dsa_slave_switchdev_event if we can't program the FDB

2020-12-16 Thread Andrew Lunn
On Sun, Dec 13, 2020 at 04:07:08PM +0200, Vladimir Oltean wrote:
> Right now, the following would happen for a switch driver that does not
> implement .port_fdb_add or .port_fdb_del.
> 
> dsa_slave_switchdev_event returns NOTIFY_OK and schedules:
> -> dsa_slave_switchdev_event_work
>-> dsa_port_fdb_add
>   -> dsa_port_notify(DSA_NOTIFIER_FDB_ADD)
>  -> dsa_switch_fdb_add
> -> if (!ds->ops->port_fdb_add) return -EOPNOTSUPP;
>-> an error is printed with dev_dbg, and
>   dsa_fdb_offload_notify(switchdev_work) is not called.
> 
> We can avoid scheduling the worker for nothing and say NOTIFY_DONE.
> Because we don't call dsa_fdb_offload_notify, the static FDB entry will
> remain just in the software bridge.
> 
> Signed-off-by: Vladimir Oltean 
> Reviewed-by: Florian Fainelli 

Reviewed-by: Andrew Lunn 

Andrew


Re: [Bridge] [PATCH v3 net-next 6/7] net: dsa: listen for SWITCHDEV_{FDB, DEL}_ADD_TO_DEVICE on foreign bridge neighbors

2020-12-16 Thread Andrew Lunn
On Sun, Dec 13, 2020 at 04:07:09PM +0200, Vladimir Oltean wrote:
> Some DSA switches (and not only) cannot learn source MAC addresses from
> packets injected from the CPU. They only perform hardware address
> learning from inbound traffic.

Reviewed-by: Andrew Lunn 

Andrew


Re: [Bridge] [PATCH v3 net-next 7/7] net: dsa: ocelot: request DSA to fix up lack of address learning on CPU port

2020-12-16 Thread Andrew Lunn
On Sun, Dec 13, 2020 at 04:07:10PM +0200, Vladimir Oltean wrote:
> Given the following setup:
> 
> ip link add br0 type bridge
> ip link set eno0 master br0
> ip link set swp0 master br0
> ip link set swp1 master br0
> ip link set swp2 master br0
> ip link set swp3 master br0
> 
> Currently, packets received on a DSA slave interface (such as swp0)
> which should be routed by the software bridge towards a non-switch port
> (such as eno0) are also flooded towards the other switch ports (swp1,
> swp2, swp3) because the destination is unknown to the hardware switch.
> 
> This patch addresses the issue by monitoring the addresses learnt by the
> software bridge on eno0, and adding/deleting them as static FDB entries
> on the CPU port accordingly.
> 
> Signed-off-by: Vladimir Oltean 
> Reviewed-by: Florian Fainelli 

Reviewed-by: Andrew Lunn 

Andrew


Re: [Bridge] [PATCH net] net: mrp: use stp state as substitute for unimplemented mrp state

2021-01-19 Thread Andrew Lunn
On Tue, Jan 19, 2021 at 09:32:40AM +0100, Horatiu Vultur wrote:
> The 01/18/2021 21:27, Vladimir Oltean wrote:
> > EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
> > content is safe
> > 
> > On Mon, Jan 18, 2021 at 09:20:36PM +0100, Horatiu Vultur wrote:
> > > The 01/18/2021 19:46, Vladimir Oltean wrote:
> > > >
> > > > On Mon, Jan 18, 2021 at 07:56:18PM +0100, Horatiu Vultur wrote:
> > > > > The reason was to stay away from STP, because you can't run these two
> > > > > protocols at the same time. Even though in SW, we reuse port's state.
> > > > > In our driver(which is not upstreamed), we currently implement
> > > > > SWITCHDEV_ATTR_ID_MRP_PORT_STATE and just call the
> > > > > SWITCHDEV_ATTR_ID_PORT_STP_STATE.
> > > >
> > > > And isn't Rasmus's approach reasonable, in that it allows unmodified
> > > > switchdev drivers to offload MRP port states without creating
> > > > unnecessary code churn?
> > >
> > > I am sorry but I don't see this as the correct solution. In my opinion,
> > > I would prefer to have 3 extra lines in the driver and have a better
> > > view of what is happening. Than having 2 calls in the driver for
> > > different protocols.
> > 
> > I think the question boils down to: is a MRP-unaware driver expected to
> > work with the current bridge MRP code?
> 
> If the driver has switchdev support, then is not expected to work with
> the current bridge MRP code.

> 
> For example, the Ocelot driver, it has switchdev support but no MRP
> support so this is not expected to work.

Then ideally, we need switchdev core to be testing for the needed ops
and returning an error which prevents MRP being configured when it
cannot work.

   Andrew


Re: [Bridge] [RFC net-next 6/9] net: dsa: Forward offloading

2021-05-04 Thread Andrew Lunn
> There is really no need to recompute the static parts of the tags on
> each skb. It would mean moving some knowledge of the tagging format to
> the driver. But that boundary is pretty artificial for
> mv88e6xxx. tag_dsa has no use outside of mv88e6xxx, and mv88e6xxx does
> not work with any other tagger. I suppose you could even move the whole
> tagger to drivers/net/dsa/mv88e6xxx/?
> 
> What do you think?
> 
> Andrew?

We have resisted this before.

What information do you actually need to share between the tagger and
the driver? Both tag_lan9303.c and tag_ocelot_8021q.c do reference
their switch driver data structures, so some sharing is allowed. But
please try to keep the surface areas down.

   Andrew


Re: [Bridge] [PATCH v6 net-next 5/7] net: bridge: switchdev: let drivers inform which bridge ports are offloaded

2021-07-26 Thread Andrew Lunn
On Mon, Jul 26, 2021 at 07:21:20PM +0530, Naresh Kamboju wrote:
> On Wed, 21 Jul 2021 at 21:56, Vladimir Oltean  wrote:
> >
> > On reception of an skb, the bridge checks if it was marked as 'already
> > forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it
> > is, it assigns the source hardware domain of that skb based on the
> > hardware domain of the ingress port. Then during forwarding, it enforces
> > that the egress port must have a different hardware domain than the
> > ingress one (this is done in nbp_switchdev_allowed_egress).

> [Please ignore if it is already reported]
> 
> Following build error noticed on Linux next 20210723 tag
> with omap2plus_defconfig on arm architecture.

Hi Naresh

Please trim emails when replying. It is really annoying to have to
page down and down and down to find your part in the email, and you
always wonder if you accidentally jumped over something when paging
down at speed.

 Andrew


Re: [Bridge] [PATCH net-next 1/4] net: bridge: Add support for bridge port in locked mode

2022-02-07 Thread Andrew Lunn
> > +   if (p->flags & BR_PORT_LOCKED) {
> > +   fdb_entry = br_fdb_find_rcu(br, eth_hdr(skb)->h_source, vid);
> > +   if (!(fdb_entry && fdb_entry->dst == p))
> > +   goto drop;
> 
> I'm not familiar with 802.1X so I have some questions:

Me neither.

> 
> 1. Do we need to differentiate between no FDB entry and an FDB entry
> pointing to a different port than we expect?

And extending that question, a static vs a dynamic entry?

Andrew


Re: [Bridge] [PATCH net-next 3/3] net: dsa: mv88e6xxx: mac-auth/MAB implementation

2022-03-10 Thread Andrew Lunn
> + if (mv88e6xxx_port_is_locked(chip, chip->ports[spid].port))
> + err = 
> mv88e6xxx_switchdev_handle_atu_miss_violation(chip,
> + 
> chip->ports[spid].port,
> + 
> &entry,
> + 
> fid);

> +static int mv88e6xxx_find_vid_on_matching_fid(struct mv88e6xxx_chip *chip,
> +   const struct mv88e6xxx_vtu_entry 
> *entry,
> +   void *priv)
> +{
> + struct mv88e6xxx_fid_search_ctx *ctx = priv;
> +
> + if (ctx->fid_search == entry->fid) {
> + ctx->vid_found = entry->vid;
> + return 1;
> + }
> + return 0;
> +}
> +
> +int mv88e6xxx_switchdev_handle_atu_miss_violation(struct mv88e6xxx_chip 
> *chip,
> +   int port,
> +   struct mv88e6xxx_atu_entry 
> *entry,
> +   u16 fid)
> +{
> + struct switchdev_notifier_fdb_info info = {
> + .addr = entry->mac,
> + .vid = 0,
> + .added_by_user = false,
> + .is_local = false,
> + .offloaded = true,
> + .locked = true,
> + };
> + struct mv88e6xxx_fid_search_ctx ctx;
> + struct netlink_ext_ack *extack;
> + struct net_device *brport;
> + struct dsa_port *dp;
> + int err;
> +
> + ctx.fid_search = fid;
> + err = mv88e6xxx_vtu_walk(chip, mv88e6xxx_find_vid_on_matching_fid, 
> &ctx);

I could be reading this code wrong, but it looks like you assume there
is a single new entry in the ATU. But interrupts on these devices are
slow. It would be easy for two or more devices to pop into existence
at the same time. Don't you need to walk the whole ATU to find all the
new entries? Have you tried this with a traffic generating populating
the ATU with new entries at different rates, up to line rate? Do you
get notifications for them all?

Andrew


Re: [Bridge] [PATCH net-next 3/3] net: dsa: mv88e6xxx: mac-auth/MAB implementation

2022-03-17 Thread Andrew Lunn
On Thu, Mar 17, 2022 at 09:52:15AM +0100, Hans Schultz wrote:
> On tor, mar 17, 2022 at 01:34, Vladimir Oltean  wrote:
> > On Mon, Mar 14, 2022 at 11:46:51AM +0100, Hans Schultz wrote:
> >> >> @@ -396,6 +414,13 @@ static irqreturn_t 
> >> >> mv88e6xxx_g1_atu_prob_irq_thread_fn(int irq, void *dev_id)
> >> >> "ATU miss violation for %pM portvec 
> >> >> %x spid %d\n",
> >> >> entry.mac, entry.portvec, spid);
> >> >> chip->ports[spid].atu_miss_violation++;
> >> >> +   if (mv88e6xxx_port_is_locked(chip, 
> >> >> chip->ports[spid].port))
> >> >> +   err = 
> >> >> mv88e6xxx_switchdev_handle_atu_miss_violation(chip,
> >> >> +   
> >> >> chip->ports[spid].port,
> >> >> +   
> >> >> &entry,
> >> >> +   
> >> >> fid);
> >> >
> >> > Do we want to suppress the ATU miss violation warnings if we're going to
> >> > notify the bridge, or is it better to keep them for some reason?
> >> > My logic is that they're part of normal operation, so suppressing makes
> >> > sense.
> >> >
> >> 
> >> I have been seeing many ATU member violations after the miss violation is
> >> handled (using ping), and I think it could be considered to suppress the 
> >> ATU member
> >> violations interrupts by setting the IgnoreWrongData bit for the
> >> port (sect 4.4.7). This would be something to do whenever a port is set in 
> >> locked mode?
> >
> > So the first packet with a given MAC SA triggers an ATU miss violation
> > interrupt.
> >
> > You program that MAC SA into the ATU with a destination port mask of all
> > zeroes. This suppresses further ATU miss interrupts for this MAC SA, but
> > now generates ATU member violations, because the MAC SA _is_ present in
> > the ATU, but not towards the expected port (in fact, towards _no_ port).
> >
> > Especially if user space decides it doesn't want to authorize this MAC
> > SA, it really becomes a problem because this is now a vector for denial
> > of service, with every packet triggering an ATU member violation
> > interrupt.
> >
> > So your suggestion is to set the IgnoreWrongData bit on locked ports,
> > and this will suppress the actual member violation interrupts for
> > traffic coming from these ports.
> >
> > So if the user decides to unplug a previously authorized printer from
> > switch port 1 and move it to port 2, how is this handled? If there isn't
> > a mechanism in place to delete the locked FDB entry when the printer
> > goes away, then by setting IgnoreWrongData you're effectively also
> > suppressing migration notifications.
> 
> I don't think such a scenario is so realistic, as changing port is not
> just something done casually, besides port 2 then must also be a locked
> port to have the same policy.

I think it is very realistic. It is also something which does not work
is going to cause a lot of confusion. People will blame the printer,
when in fact they should be blaming the switch. They will be rebooting
the printer, when in fact, they need to reboot the switch etc.

I expect there is a way to cleanly support this, you just need to
figure it out.

> The other aspect is that the user space daemon that authorizes catches
> the fdb add entry events and checks if it is a locked entry. So it will
> be up to said daemon to decide the policy, like remove the fdb entry
> after a timeout.
> 
> >
> > Oh, btw, my question was: could you consider suppressing the _prints_ on
> > an ATU miss violation on a locked port?
> 
> As there will only be such on the first packet, I think it should be
> logged and those prints serve that purpose, so I think it is best to
> keep the print.
> If in the future some tests or other can argue for suppressing the
> prints, it is an easy thing to do.

Please use a traffic generator and try to DOS one of your own
switches. Can you?

  Andrew


[Bridge] [PATCH RFC] net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.

2022-05-05 Thread Andrew Lunn
It is possible to stack bridges on top of each other. Consider the
following which makes use of an Ethernet switch:

   br1
 /\
/  \
   /\
 br0.11wlan0
   |
   br0
 /  |  \
p1  p2  p3

br0 is offloaded to the switch. Above br0 is a vlan interface, for
vlan 11. This vlan interface is then a slave of br1. br1 also has
wireless interface as a slave. This setup trunks wireless lan traffic
over the copper network inside a VLAN.

A frame received on p1 which is passed up to the bridge has the
skb->offload_fwd_mark flag set to true, indicating it that the switch
has dealt with forwarding the frame out ports p2 and p3 as
needed. This flag instructs the software bridge it does not need to
pass the frame back down again. However, the flag is not getting reset
when the frame is passed upwards. As a result br1 sees the flag,
wrongly interprets it, and fails to forward the frame to wlan0.

When passing a frame upwards, clear the flag.

RFC because i don't know the bridge code well enough if this is the
correct place to do this, and if there are any side effects, could the
skb be a clone, etc.

Fixes: f1c2eddf4cb6 ("bridge: switchdev: Use an helper to clear forward mark")
Signed-off-by: Andrew Lunn 
---
 net/bridge/br_input.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 196417859c4a..9327a5fad1df 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -39,6 +39,13 @@ static int br_pass_frame_up(struct sk_buff *skb)
dev_sw_netstats_rx_add(brdev, skb->len);
 
vg = br_vlan_group_rcu(br);
+
+   /* Reset the offload_fwd_mark because there could be a stacked
+* bridge above, and it should not think this bridge it doing
+* that bridges work forward out its ports.
+*/
+   br_switchdev_frame_unmark(skb);
+
/* Bridge is just like any other port.  Make sure the
 * packet is allowed except in promisc mode when someone
 * may be running packet capture.
-- 
2.36.0



Re: [Bridge] [PATCH RFC] net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.

2022-05-05 Thread Andrew Lunn
On Thu, May 05, 2022 at 04:07:20PM -0700, Stephen Hemminger wrote:
> On Fri,  6 May 2022 00:59:04 +0200
> Andrew Lunn  wrote:
> 
> > It is possible to stack bridges on top of each other. Consider the
> > following which makes use of an Ethernet switch:
> > 
> >br1
> >  /\
> > /  \
> >/\
> >  br0.11wlan0
> >|
> >br0
> >  /  |  \
> > p1  p2  p3
> > 
> > br0 is offloaded to the switch. Above br0 is a vlan interface, for
> > vlan 11. This vlan interface is then a slave of br1. br1 also has
> > wireless interface as a slave. This setup trunks wireless lan traffic
> > over the copper network inside a VLAN.
> > 
> > A frame received on p1 which is passed up to the bridge has the
> > skb->offload_fwd_mark flag set to true, indicating it that the switch
> > has dealt with forwarding the frame out ports p2 and p3 as
> > needed. This flag instructs the software bridge it does not need to
> > pass the frame back down again. However, the flag is not getting reset
> > when the frame is passed upwards. As a result br1 sees the flag,
> > wrongly interprets it, and fails to forward the frame to wlan0.
> > 
> > When passing a frame upwards, clear the flag.
> > 
> > RFC because i don't know the bridge code well enough if this is the
> > correct place to do this, and if there are any side effects, could the
> > skb be a clone, etc.
> > 
> > Fixes: f1c2eddf4cb6 ("bridge: switchdev: Use an helper to clear forward 
> > mark")
> > Signed-off-by: Andrew Lunn 
> 
> Bridging of bridges is not supposed to be allowed.
> See:
> 
> bridge:br_if.c
> 
>   /* No bridging of bridges */
>   if (dev->netdev_ops->ndo_start_xmit == br_dev_xmit) {
>   NL_SET_ERR_MSG(extack,
>  "Can not enslave a bridge to a bridge");
>   return -ELOOP;
>   }

This is not direct bridging of bridges. There is a vlan interface in
the middle. And even if it is not supposed to work, it does work, it
is being used, and it regressed. This fixes the regression.

   Andrew


Re: [Bridge] [PATCH RFC] net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.

2022-05-06 Thread Andrew Lunn
> Some safer alternatives to this patch are based on the idea that we
> could ignore skb->offload_fwd_mark coming from an unoffloaded bridge
> port (i.e. treat this condition at br1, not at br0). We could:
> - clear skb->offload_fwd_mark in br_handle_frame_finish(), if p->hwdom is 0
> - change nbp_switchdev_allowed_egress() to return true if cb->src_hwdom == 0

O.K, i will try out these solutions.

Thanks
 Andrew


Re: [Bridge] [PATCH RFC] net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.

2022-05-12 Thread Andrew Lunn
> I like Andrew's patch because it is the Rx equivalent of
> br_switchdev_frame_unmark() in br_dev_xmit(). However, if we go with the
> second option, it should allow us to remove the clearing of the mark in
> the Tx path as the control block is cleared in the Tx path since commit
> fd65e5a95d08 ("net: bridge: clear bridge's private skb space on xmit").
> 
> I don't know how far back Nik's patch was backported and I don't know
> how far back Andrew's patch will be backported, so it might be best to
> submit Andrew's patch to net as-is and then in net-next change
> nbp_switchdev_allowed_egress() and remove br_switchdev_frame_unmark()
> from both the Rx and Tx paths.
> 
> Anyway, I have applied this patch to our tree for testing. Will report
> tomorrow in case there are any regressions.

Hi Ido

Did your testing find any issues?

Thanks
Andrew


[Bridge] [PATCH v2 net] net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.

2022-05-17 Thread Andrew Lunn
It is possible to stack bridges on top of each other. Consider the
following which makes use of an Ethernet switch:

   br1
 /\
/  \
   /\
 br0.11wlan0
   |
   br0
 /  |  \
p1  p2  p3

br0 is offloaded to the switch. Above br0 is a vlan interface, for
vlan 11. This vlan interface is then a slave of br1. br1 also has a
wireless interface as a slave. This setup trunks wireless lan traffic
over the copper network inside a VLAN.

A frame received on p1 which is passed up to the bridge has the
skb->offload_fwd_mark flag set to true, indicating that the switch has
dealt with forwarding the frame out ports p2 and p3 as needed. This
flag instructs the software bridge it does not need to pass the frame
back down again. However, the flag is not getting reset when the frame
is passed upwards. As a result br1 sees the flag, wrongly interprets
it, and fails to forward the frame to wlan0.

When passing a frame upwards, clear the flag. This is the Rx
equivalent of br_switchdev_frame_unmark() in br_dev_xmit().

Fixes: f1c2eddf4cb6 ("bridge: switchdev: Use an helper to clear forward mark")
Signed-off-by: Andrew Lunn 
---

v2:
Extended the commit message with Ido obsersation of the equivelance of
br_dev_xmit().

Fixed up the comment.

This code has passed Ido test setup.

net/bridge/br_input.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 196417859c4a..68b3e850bcb9 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -39,6 +39,13 @@ static int br_pass_frame_up(struct sk_buff *skb)
dev_sw_netstats_rx_add(brdev, skb->len);
 
vg = br_vlan_group_rcu(br);
+
+   /* Reset the offload_fwd_mark because there could be a stacked
+* bridge above, and it should not think this bridge it doing
+* that bridge's work forwarding out its ports.
+*/
+   br_switchdev_frame_unmark(skb);
+
/* Bridge is just like any other port.  Make sure the
 * packet is allowed except in promisc mode when someone
 * may be running packet capture.
-- 
2.36.0