[Bridge] [PATCH RFC WIP 0/5] IGMP snooping for local traffic
This is a WIP patchset i would like comments on from bridge, switchdev and hardware offload people. The linux bridge supports IGMP snooping. It will listen to IGMP reports on bridge ports and keep track of which groups have been joined on an interface. It will then forward multicast based on this group membership. When the bridge adds or removed groups from an interface, it uses switchdev to request the hardware add an mdb to a port, so the hardware can perform the selective forwarding between ports. What is not covered by the current bridge code, is IGMP joins/leaves from the host on the brX interface. No such monitoring is performed. With a pure software bridge, it is not required. All mulitcast frames are passed to the brX interface, and the network stack filters them, as it does for any interface. However, when hardware offload is involved, things change. We should program the hardware to only send multcast packets to the host when the host has in interest in them. Thus we need to perform IGMP snooping on the brX interface, just like any other interface of the bridge. However, currently the brX interface is missing all the needed data structures to do this. There is no net_bridge_port structure for the brX interface. This strucuture is created when an interface is added to the bridge. But the brX interface is not a member of the bridge. So this patchset makes the brX interface a first class member of the bridge. When the brX interface is opened, the interface is added to the bridge. A net_bridge_port is allocated for it, and IGMP snooping is performed as usual. There are some complexities here. Some assumptions are broken, like the master interface of a port interface is the bridge interface. The brX interface cannot be its own master. The use of netdev_master_upper_dev_get() within the bridge code has been changed to reflecit this. The bridge receive handler needs to not process frames for the brX interface, etc. The interface downward to the hardware is also an issue. The code presented here is a hack and needs to change. But that is secondary and can be solved once it is agreed how the bridge needs to change to support this use case. Comment welcome and wanted. Andrew Andrew Lunn (5): net: rtnetlink: Handle bridge port without upper device net: bridge: Skip receive handler on brX interface net: bridge: Make the brX interface a member of the bridge net: dsa: HACK: Handle MDB add/remove for none-switch ports net: dsa: Don't include CPU port when adding MDB to a port include/linux/if_bridge.h | 1 + net/bridge/br_device.c| 12 ++-- net/bridge/br_if.c| 37 - net/bridge/br_input.c | 4 net/bridge/br_mdb.c | 2 -- net/bridge/br_multicast.c | 7 --- net/bridge/br_private.h | 1 + net/core/rtnetlink.c | 23 +-- net/dsa/port.c| 19 +-- net/dsa/switch.c | 2 +- 10 files changed, 83 insertions(+), 25 deletions(-) -- 2.14.1
[Bridge] [PATCH RFC WIP 5/5] net: dsa: Don't include CPU port when adding MDB to a port
Now that the MDB are explicitly added to the CPU port when required, don't add the CPU port adding an MDB to a switch port. Signed-off-by: Andrew Lunn --- net/dsa/switch.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/dsa/switch.c b/net/dsa/switch.c index 97e2e9c8cf3f..c178e2b86a9a 100644 --- a/net/dsa/switch.c +++ b/net/dsa/switch.c @@ -130,7 +130,7 @@ static int dsa_switch_mdb_add(struct dsa_switch *ds, if (ds->index == info->sw_index) set_bit(info->port, group); for (port = 0; port < ds->num_ports; port++) - if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port)) + if (dsa_is_dsa_port(ds, port)) set_bit(port, group); if (switchdev_trans_ph_prepare(trans)) { -- 2.14.1
[Bridge] [PATCH RFC WIP 1/5] net: rtnetlink: Handle bridge port without upper device
The brX interface will with a following patch becomes a member of the bridge. It however cannot be a slave interface, since it would have to be a slave of itself. netdev_master_upper_dev_get() returns NULL as a result. Handle this NULL, by knowing this bridge slave must also be the master, i.e. what we are looking for. Signed-off-by: Andrew Lunn --- net/core/rtnetlink.c | 23 +-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 9201e3621351..2673eb430b6f 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -3093,8 +3093,12 @@ static int rtnl_fdb_add(struct sk_buff *skb, struct nlmsghdr *nlh, if ((!ndm->ndm_flags || ndm->ndm_flags & NTF_MASTER) && (dev->priv_flags & IFF_BRIDGE_PORT)) { struct net_device *br_dev = netdev_master_upper_dev_get(dev); - const struct net_device_ops *ops = br_dev->netdev_ops; + const struct net_device_ops *ops; + if (!br_dev) + br_dev = dev; + + ops = br_dev->netdev_ops; err = ops->ndo_fdb_add(ndm, tb, dev, addr, vid, nlh->nlmsg_flags); if (err) @@ -3197,7 +3201,12 @@ static int rtnl_fdb_del(struct sk_buff *skb, struct nlmsghdr *nlh, if ((!ndm->ndm_flags || ndm->ndm_flags & NTF_MASTER) && (dev->priv_flags & IFF_BRIDGE_PORT)) { struct net_device *br_dev = netdev_master_upper_dev_get(dev); - const struct net_device_ops *ops = br_dev->netdev_ops; + const struct net_device_ops *ops; + + if (!br_dev) + br_dev = dev; + + ops = br_dev->netdev_ops; if (ops->ndo_fdb_del) err = ops->ndo_fdb_del(ndm, tb, dev, addr, vid); @@ -3332,6 +3341,8 @@ static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb) if (!br_idx) { /* user did not specify a specific bridge */ if (dev->priv_flags & IFF_BRIDGE_PORT) { br_dev = netdev_master_upper_dev_get(dev); + if (!br_dev) + br_dev = dev; cops = br_dev->netdev_ops; } } else { @@ -3410,6 +3421,9 @@ int ndo_dflt_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq, struct net_device *br_dev = netdev_master_upper_dev_get(dev); int err = 0; + if (!br_dev) + br_dev = dev; + nlh = nlmsg_put(skb, pid, seq, RTM_NEWLINK, sizeof(*ifm), nlflags); if (nlh == NULL) return -EMSGSIZE; @@ -3647,6 +3661,8 @@ static int rtnl_bridge_setlink(struct sk_buff *skb, struct nlmsghdr *nlh, if (!flags || (flags & BRIDGE_FLAGS_MASTER)) { struct net_device *br_dev = netdev_master_upper_dev_get(dev); + if (!br_dev) + br_dev = dev; if (!br_dev || !br_dev->netdev_ops->ndo_bridge_setlink) { err = -EOPNOTSUPP; @@ -3723,6 +3739,9 @@ static int rtnl_bridge_dellink(struct sk_buff *skb, struct nlmsghdr *nlh, if (!flags || (flags & BRIDGE_FLAGS_MASTER)) { struct net_device *br_dev = netdev_master_upper_dev_get(dev); + if (!br_dev) + br_dev = dev; + if (!br_dev || !br_dev->netdev_ops->ndo_bridge_dellink) { err = -EOPNOTSUPP; goto out; -- 2.14.1
[Bridge] [PATCH RFC WIP 2/5] net: bridge: Skip receive handler on brX interface
The brX interface will soon become a member of the bridge. As such, it will get a receiver handler assigned. However, we don't want to handle packets received on this soft interfaces. So detect the condition and say all the packets pass. --- net/bridge/br_input.c | 4 1 file changed, 4 insertions(+) diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c index 7637f58c1226..38c2a41968f2 100644 --- a/net/bridge/br_input.c +++ b/net/bridge/br_input.c @@ -267,6 +267,10 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb) return RX_HANDLER_CONSUMED; p = br_port_get_rcu(skb->dev); + + if (p->dev == p->br->dev) + return RX_HANDLER_PASS; + if (p->flags & BR_VLAN_TUNNEL) { if (br_handle_ingress_vlan_tunnel(skb, p, nbp_vlan_group_rcu(p))) -- 2.14.1
[Bridge] [PATCH RFC WIP 3/5] net: bridge: Make the brX interface a member of the bridge
In order to perform IGMP snooping on the brX interface, it has to be part of the bridge, so that the code snooping on normal bridge ports keeps track of IGMP joins and leaves. When the brX interface is opened, add the interface to the bridge. When the brX interface is closed, remove it from the bridge. This port does however need some special handling. So add a bridge port flag, BR_SOFT_INTERFACE, indicating a port is the sort interface of the bridge. When the port is added to the bridge, the netdev for this port cannot be linked to the master device, since it is the master device. Similarly when removing the port, it cannot be unlinked from the master device. With the brX interface now being a member of the bridge, and having all associated structures, we can process IGMP messages sent by the interface. This is done by the br_multicast_rcv() function, which takes the bridge_port structure as a parameter. This cannot be easily found, so keep track of it in the net_bridge structure. --- include/linux/if_bridge.h | 1 + net/bridge/br_device.c| 12 ++-- net/bridge/br_if.c| 37 - net/bridge/br_mdb.c | 2 -- net/bridge/br_multicast.c | 7 --- net/bridge/br_private.h | 1 + 6 files changed, 40 insertions(+), 20 deletions(-) diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h index 3cd18ac0697f..8a03821d1827 100644 --- a/include/linux/if_bridge.h +++ b/include/linux/if_bridge.h @@ -49,6 +49,7 @@ struct br_ip_list { #define BR_MULTICAST_TO_UNICASTBIT(12) #define BR_VLAN_TUNNEL BIT(13) #define BR_BCAST_FLOOD BIT(14) +#define BR_SOFT_INTERFACE BIT(15) #define BR_DEFAULT_AGEING_TIME (300 * HZ) diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c index 861ae2a165f4..f27ca62fd4a5 100644 --- a/net/bridge/br_device.c +++ b/net/bridge/br_device.c @@ -69,7 +69,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev) br_flood(br, skb, BR_PKT_MULTICAST, false, true); goto out; } - if (br_multicast_rcv(br, NULL, skb, vid)) { + if (br_multicast_rcv(br, br->local_port, skb, vid)) { kfree_skb(skb); goto out; } @@ -133,6 +133,14 @@ static void br_dev_uninit(struct net_device *dev) static int br_dev_open(struct net_device *dev) { struct net_bridge *br = netdev_priv(dev); + int err; + + err = br_add_if(br, br->dev); + if (err) + return err; + + br->local_port = list_first_or_null_rcu(&br->port_list, + struct net_bridge_port, list); netdev_update_features(dev); netif_start_queue(dev); @@ -161,7 +169,7 @@ static int br_dev_stop(struct net_device *dev) netif_stop_queue(dev); - return 0; + return br_del_if(br, br->dev); } static void br_get_stats64(struct net_device *dev, diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c index f3aef22931ab..49208e774191 100644 --- a/net/bridge/br_if.c +++ b/net/bridge/br_if.c @@ -284,7 +284,8 @@ static void del_nbp(struct net_bridge_port *p) nbp_update_port_count(br); - netdev_upper_dev_unlink(dev, br->dev); + if (!(p->flags & BR_SOFT_INTERFACE)) + netdev_upper_dev_unlink(dev, br->dev); dev->priv_flags &= ~IFF_BRIDGE_PORT; @@ -362,6 +363,8 @@ static struct net_bridge_port *new_nbp(struct net_bridge *br, p->priority = 0x8000 >> BR_PORT_BITS; p->port_no = index; p->flags = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD; + if (br->dev == dev) + p->flags |= BR_SOFT_INTERFACE; br_init_port(p); br_set_state(p, BR_STATE_DISABLED); br_stp_port_timer_init(p); @@ -500,8 +503,11 @@ int br_add_if(struct net_bridge *br, struct net_device *dev) return -EINVAL; /* No bridging of bridges */ - if (dev->netdev_ops->ndo_start_xmit == br_dev_xmit) - return -ELOOP; + if (dev->netdev_ops->ndo_start_xmit == br_dev_xmit) { + /* Unless it is our own soft interface */ + if (br->dev != dev) + return -ELOOP; + } /* Device is already being bridged */ if (br_port_exists(dev)) @@ -540,9 +546,11 @@ int br_add_if(struct net_bridge *br, struct net_device *dev) dev->priv_flags |= IFF_BRIDGE_PORT; - err = netdev_master_upper_dev_link(dev, br->dev, NULL, NULL); - if (err) - goto err5; + if (!(p->flags & BR_SOFT_INTERFACE)) { + err = netdev_master_upper_dev_link(dev, br->dev, NULL, NULL); + if (err) + goto err5; + } err = nbp_switchdev_mark_set(p); if (err) @@ -563,13 +571,15 @@ int br_add_if(struct net_bridge *br, struct net
[Bridge] [PATCH RFC WIP 4/5] net: dsa: HACK: Handle MDB add/remove for none-switch ports
When there is a mdb added to a port which is not in the switch, we need the switch to forward traffic for the group to the software bridge, so it can forward it out the none-switch port. The current implementation is a hack and will be replaced. Currently only the bridge soft interface is supported. When there is a join/leave on the soft interface, switchdev calls are made on the soft interface device, brX. This does not have a switchdev ops structure registered, so all lower interfaces of brX get there switchdev function called. These are switch ports, and do have switchdev ops. By comparing the original interface to the called interface, we can determine this is not for a switch port, and add/remove the mdb to the CPU port. --- net/dsa/port.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/net/dsa/port.c b/net/dsa/port.c index d6e07176df3f..d8e4bfefd97d 100644 --- a/net/dsa/port.c +++ b/net/dsa/port.c @@ -194,8 +194,15 @@ int dsa_port_mdb_add(struct dsa_port *dp, .mdb = mdb, }; - pr_info("dsa_port_mdb_add: %d %d", info.sw_index, info.port); - + if (dp->netdev != mdb->obj.orig_dev) { + /* Not a port for this switch, so forward +* multicast out the CPU port to the bridge. +*/ + struct dsa_switch_tree *dst = dp->ds->dst; + struct dsa_port *cpu_dp = dsa_get_cpu_port(dst); + info.port = cpu_dp->index; + return dsa_port_notify(cpu_dp, DSA_NOTIFIER_MDB_ADD, &info); + } return dsa_port_notify(dp, DSA_NOTIFIER_MDB_ADD, &info); } @@ -208,6 +215,14 @@ int dsa_port_mdb_del(struct dsa_port *dp, .mdb = mdb, }; + if (dp->netdev != mdb->obj.orig_dev) { + struct dsa_switch_tree *dst = dp->ds->dst; + struct dsa_port *cpu_dp = dsa_get_cpu_port(dst); + + info.port = cpu_dp->index; + return dsa_port_notify(cpu_dp, DSA_NOTIFIER_MDB_DEL, &info); + } + return dsa_port_notify(dp, DSA_NOTIFIER_MDB_DEL, &info); } -- 2.14.1
Re: [Bridge] [PATCH RFC WIP 0/5] IGMP snooping for local traffic
> Hi Andrew, > > Have you taken a look at mglist (the boolean, probably needs a rename) ? It > is for > exactly that purpose, to track which groups the bridge is interested in. > I assume I'm forgetting or missing something here. > > > performed. With a pure software bridge, it is not required. All > > mulitcast frames are passed to the brX interface, and the network > > If mglist (again the boolean) is false then they won't be passed up. > > > stack filters them, as it does for any interface. However, when > > hardware offload is involved, things change. We should program the > > hardware to only send multcast packets to the host when the host has > > in interest in them. > > Granted the boolean mglist might need some changes (esp. with host group > leave) > but I think it can be used to program switchdev for host join/leave, can't > we adjust its behaviour instead of introducing this complexity and avoid many > headaches ? I would like to avoid this complexity as well. I will take a look at mglist. Thanks for the hint. Andrew
Re: [Bridge] [PATCH RFC WIP 0/5] IGMP snooping for local traffic
On Sun, Aug 27, 2017 at 07:44:15PM -0700, Florian Fainelli wrote: > Hi Andrew, > > On 08/26/2017 01:56 PM, Andrew Lunn wrote: > > This is a WIP patchset i would like comments on from bridge, > > switchdev and hardware offload people. > > > > The linux bridge supports IGMP snooping. It will listen to IGMP > > reports on bridge ports and keep track of which groups have been > > joined on an interface. It will then forward multicast based on this > > group membership. > > > > When the bridge adds or removed groups from an interface, it uses > > switchdev to request the hardware add an mdb to a port, so the > > hardware can perform the selective forwarding between ports. > > > > What is not covered by the current bridge code, is IGMP joins/leaves > > from the host on the brX interface. No such monitoring is performed. > > With a pure software bridge, it is not required. All mulitcast frames > > are passed to the brX interface, and the network stack filters them, > > as it does for any interface. However, when hardware offload is > > involved, things change. We should program the hardware to only send > > multcast packets to the host when the host has in interest in them. > > OK, so if I understand this right, without a bridge, we have the > following happen today: with a DSA-enabled setup using any kind of > switch tagging protocol, if a host is interested in receiving particular > multicast traffic, we would receive IGMP joins/leaves through sw0p0, and > the stack should call ndo_set_rx_mode for sw0p0, which would be > dsa_slave_set_rx_mode() and which would synchronize the DSA master > network device with the slave network device, everything works fine > provided that the CPU port is configured to accept multicast traffic. Hi Florian That sounds about right, but it is not a use case i've looked at yet. I do need to look at it though, because there is a chance i break it with IGMP snooping support. > Note here that we don't really add a MDB entry for sw0p0 when that > happens, but it seems like we should for switches that lack IGMP > snooping and/or multicast filtering. Yes. But it is not an MDB in the normal sense, at least from the switches perspective. An MDB passed to a switch using switchdev says that traffic for the specified group should go out that port. However, when the host does a join on sw0p0, we want the exact opposite, multicast traffic coming in that port of the switch which matches the group should be forwarded to the host. So my second attempt at this code adds a new switchdev object, SWITCHDEV_OBJ_ID_PORT_HOST_MDB. > With the current bridge and DSA code, are not we actually always going > to get the CPU port to be added with the multicast address and therefore > no filtering is occurring and snooping is pretty much useless? Correct. At the moment, all multicast traffic gets delivered to the host. > While I understand the reasons why you did it that way, I think this is > going to break a lot of code in bridge that does not expect brX to be a > bridge port member. It does. Nikolay suggestion works a lot better, and i'm following that now. It looks good and only requires some minor tweaks to the bridge code. Andrew
Re: [Bridge] [PATCH net-next v2] bridge: fdb add and delete tracepoints
On Mon, Aug 28, 2017 at 09:22:48PM -0700, Roopa Prabhu wrote: > From: Roopa Prabhu > > A few useful tracepoints to trace bridge forwarding > database updates. Hi Roopa Once accepted, it would be nice to add mdb tracepoints as well. I expect the implementation would be very similar. Andrew
Re: [Bridge] [PATCH net-next] net: Define eth_stp_addr in linux/etherdevice.h
On Thu, Nov 02, 2017 at 10:36:48AM +0100, Egil Hjelmeland wrote: > Why: Please drop the Why. > The lan9303 driver defines eth_stp_addr as a synonym to > eth_reserved_addr_base to get the STP ethernet address 01:80:c2:00:00:00. > > eth_reserved_addr_base is also used to define the start of Bridge Reserved > ethernet address range, which happen to be the STP address. > > br_dev_setup refer to eth_reserved_addr_base as a definition of STP > address. > > Clean up by: > - Move the eth_stp_addr definition to linux/etherdevice.h > - Use eth_stp_addr instead of eth_reserved_addr_base in br_dev_setup. > > Signed-off-by: Egil Hjelmeland I was thinking along the same lines when reviewing your lan9303 patch. Reviewed-by: Andrew Lunn Andrew
Re: [Bridge] [PATCH net-next 2/5] bridge: propagate BR_ flags updates through sysfs to switchdev
On Fri, Mar 09, 2018 at 07:03:05PM -0800, Igor Mitsyanko wrote: > sysfs interface to configure bridge flags only updates SW copy but does > not notify hardware through switchdev interface. Make sure it is. > > Signed-off-by: Igor Mitsyanko > --- > net/bridge/br_sysfs_if.c | 15 +++ > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/net/bridge/br_sysfs_if.c b/net/bridge/br_sysfs_if.c > index 126a8ea..9bdd177 100644 > --- a/net/bridge/br_sysfs_if.c > +++ b/net/bridge/br_sysfs_if.c > @@ -51,6 +51,7 @@ static int store_flag(struct net_bridge_port *p, unsigned > long v, > unsigned long mask) > { > unsigned long flags; > + int err; > > flags = p->flags; > > @@ -59,10 +60,16 @@ static int store_flag(struct net_bridge_port *p, unsigned > long v, > else > flags &= ~mask; > > - if (flags != p->flags) { > - p->flags = flags; > - br_port_flags_change(p, mask); > - } > + if (flags == p->flags) > + return 0; > + > + err = br_switchdev_set_port_flag(p, flags, mask); > + if (err) > + return err; You might want to consider the br_warn() in br_switchdev_set_port_flag(). Do we want to spam the kernel log? Or should store_flag() do some validation before calling br_switchdev_set_port_flag()? Andrew
Re: [Bridge] [RFC PATCH net-next 3/5] bridge: allow switchdev port to handle flooding by itself
On Fri, Mar 09, 2018 at 07:03:06PM -0800, Igor Mitsyanko wrote: > Introduce BR_FLOOD_OFFLOAD bridge port flag that can be used by > switchdev-capable hardware to advertize that it wants to handle all > flooding by itself. > In that case there is no need for a driver to set skb::offload_fwd_mark > on each offloaded packet as it is implied by BR_FLOOD_OFFLOAD bridge > port flag. Is this sufficiently granular? There are a few different use cases for flooding: There is no fdb entry in the software switch for the destination MAC address, so flood the packet out all ports of the bridge. The hardware switch might have an entry in its fdb to the destination switch, so it could unicast out the correct hardware port. If not, it should flood the packet. A point to remember here, the software switch and the hardware switch can have different forwarding data bases. A broadcast packet. Send it out all ports. A multicast packet. If the hardware switch is capable of IGMP snooping, it could have FDB entries indicating which ports it should send the frame out of, and which is should not. Otherwise it needs to flood. Is one flag sufficient for all of these, and any other use cases i might of missed? As far as DSA switches go, i don't know of any of them which could implement anything like this, so BR_FLOOD_OFFLOAD will never be set. But maybe some of the TOR switches supported by switchdev can do some of these, and not others Andrew
Re: [Bridge] [PATCH net-next 1/5] bridge: initialize port flags with switchdev defaults
On Fri, Mar 09, 2018 at 07:03:04PM -0800, Igor Mitsyanko wrote: > Default bridge port flags for switchdev devices can be different from > what is used in bridging core. Get default value from switchdev itself > on port initialization. > > Signed-off-by: Igor Mitsyanko > --- > net/bridge/br_if.c | 17 - > 1 file changed, 16 insertions(+), 1 deletion(-) > > diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c > index 9ba4ed6..d658b55 100644 > --- a/net/bridge/br_if.c > +++ b/net/bridge/br_if.c > @@ -342,6 +342,21 @@ static int find_portno(struct net_bridge *br) > return (index >= BR_MAX_PORTS) ? -EXFULL : index; > } > > +static unsigned long nbp_flags_get_default(struct net_device *dev) > +{ > + struct switchdev_attr attr = { > + .orig_dev = dev, > + .id = SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS, > + }; > + int ret; > + > + ret = switchdev_port_attr_get(dev, &attr); > + if (ret) > + return BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD; Hi Igor Please check if ret == -EOPNOTSUPP and only then use the defaults. A real error should be propagated, causing new_nbp to fail. You might also consider what to do when ENODATA is returned. Andrew
Re: [Bridge] [PATCH net-next 0/5] Switchdev: flooding offload option
On Fri, Mar 09, 2018 at 07:03:03PM -0800, Igor Mitsyanko wrote: > Main goal of the patchset is to allow for flooding offload to switchdev > in scenarios when HW switchdev ports and SW ports are operating in a > single bridge. > > In case a data frame that needs to be flooded is ingressed on a SW port, > it needs to be flooded to a single HW port of any given > switchdev-capable hardware only. Switchdev hardware will than take care > about flooding to the rest of the ports that it manages. Hi Igor You don't appear to be adding any user of this. Please also make one of the switchdev drivers actually use this new functionality. Andrew
Re: [Bridge] [RFC PATCH net-next 3/5] bridge: allow switchdev port to handle flooding by itself
> The flag was introduced to enable hardware switch capabilities of > drivers/net/wireless/quantenna/qtnfmac wifi driver. It does not have any > switchdev functionality in upstream tree at this moment, and this patchset > was intended as a preparatory change. O.K. But i suggest you add basic switchdev support first. Then think about adding new functionality. That way you can learn more about switchdev, and we can learn more about your hardware. > qtnfmac driver provides several physical radios (5 GHz and 2.4 GHz), each > can have up to 8 virtual network interfaces. These interfaces can be bridged > together in various configurations, and I'm trying to figure out what is the > most efficient way to handle it from bridging perspective. I think the first thing to do is get this part correctly represented by switchdev. I don't think any of us maintainers have thought about how wireless and switchdev can be combined. The wifi model seems to be one phy device, with multiple MACs running on top of it, with each MAC being a single SSID. So is it one SSID per virtual interface? Or are your virtual network interfaces actually virtual phys in the wireless model, and you can have multiple MACs on top of each virtual phy? > My assumption was that software FDB and hardware FDB should always > be in sync with each other. I guess it is a safe assumption if > handled correctly? Hardware should send a notification for each new > FDB it has learned, and switchdev driver should process FDB > notifications from software bridge. No, you cannot make this assumption. Take the example of DSA switches. They are generally connected over an MDIO bus, or an SPI bus. The bandwidth is small. How long do you think it takes the hardware to learn 8K MAC addresses with 5x 1Gbps ports receiving 64 byte packets? DSA drivers have no way of keeping up with the hardware. And there is no need to. Everything works fine with the SW and the HW bridge having different dynamic FDB entries. I don't even think your hardware will have the hardware and software in sync. How fast can your hardware learn new addresses? 'Line' rate? Or do you prevent the hardware learning a new address until the software bridge has confirmed it has learnt the previous new address? > qtnfmac hardware has its own memory and maintains FWT table, so for the best > efficiency forwarding between virtual interfaces should be handled locally. > Qtnfmac can handle all the mentioned flooding by itself: > - unknown unicasts > - broadcast and unknown multicast > - known multicasts (does have IGMP snooping) > - can do multicast-to-unicast translation if required. > > The most important usecase IMO is a muticast transmission, specific example > being: > - 2.4GHz x 8 and 5GHz x 8 virtual wifi interfaces, bridged with backbone > ethernet interface in Linux > - multicast video streaming from a server behind ethernet > - multicast clients connected to some wifi interfaces I agree this makes sense. But we need to ensure the solution is generic, not something which just works for your hardware/firmware. I know somebody who would love to be able to do something like this with DSA drivers. They would probably sacrifice IGMP snooping and just flood everywhere, if that is all the hardware could do. But so far, i've not been able to figure out a way to do this. Andrew
Re: [Bridge] [PATCH 02/12] treewide/net: Rename eth_stp_addr to ether_stp_addr
On Sat, Mar 31, 2018 at 12:05:17AM -0700, Joe Perches wrote: > --- a/net/bridge/br_device.c > +++ b/net/bridge/br_device.c > @@ -1,4 +1,4 @@ > -/* > +gg/* Hi Joe This does not look good. Andrew
Re: [Bridge] [PATCH net-next 2/9] net: bridge: add bitfield for options and convert vlan opts
Hi Nikolay > struct net_bridge { > spinlock_t lock; > spinlock_t hash_lock; > struct list_headport_list; > struct net_device *dev; > struct pcpu_sw_netstats __percpu *stats; > + unsigned long options; Maybe a u32 would be better, so we run out of bits at the same time on 32 and 64 bit systems? Andrew
Re: [Bridge] [PATCH net-next 2/9] net: bridge: add bitfield for options and convert vlan opts
On Wed, Sep 26, 2018 at 05:55:47PM +0300, Nikolay Aleksandrov wrote: > On 26/09/18 17:48, Andrew Lunn wrote: > > Hi Nikolay > > > >> struct net_bridge { > >>spinlock_t lock; > >>spinlock_t hash_lock; > >>struct list_headport_list; > >>struct net_device *dev; > >>struct pcpu_sw_netstats __percpu *stats; > >> + unsigned long options; > > > > Maybe a u32 would be better, so we run out of bits at the same time on > > 32 and 64 bit systems? > > > >Andrew > > > > Bitops operate on an unsigned long, I actually had a BUILD_BUG_ON() for > 32 bits initially, but checked other places and they seem to be using it > as-is without any checks so I decided to leave it as well (e.g. > sock_flags). O.K. I assume we get compiler warnings anyway, when we use bit 32 on a 32bit system. The build-bots should let us know. Andrew
Re: [Bridge] [PATCH net-next 1/2] net: bridge: add support for user-controlled bool options
On Thu, Nov 22, 2018 at 06:29:24AM +0200, Nikolay Aleksandrov wrote: > We have been adding many new bridge options, a big number of which are > boolean but still take up netlink attribute ids and waste space in the skb. > Recently we discussed learning from link-local packets[1] and decided > yet another new boolean option will be needed, thus introducing this API > to save some bridge nl space. > The API supports changing the value of multiple boolean options at once > via the br_boolopt_multi struct which has an optmask (which options to > set, bit per opt) and optval (options' new values). Future boolean > options will only be added to the br_boolopt_id enum and then will have > to be handled in br_boolopt_toggle/get. The API will automatically > add the ability to change and export them via netlink, sysfs can use the > single boolopt function versions to do the same. The behaviour with > failing/succeeding is the same as with normal netlink option changing. > > If an option requires mapping to internal kernel flag or needs special > configuration to be enabled then it should be handled in > br_boolopt_toggle. It should also be able to retrieve an option's current > state via br_boolopt_get. > > [1] https://www.spinics.net/lists/netdev/msg532698.html > > Signed-off-by: Nikolay Aleksandrov > --- > include/uapi/linux/if_bridge.h | 18 + > include/uapi/linux/if_link.h | 1 + > net/bridge/br.c| 68 ++ > net/bridge/br_netlink.c| 17 - > net/bridge/br_private.h| 6 +++ > net/core/rtnetlink.c | 2 +- > 6 files changed, 110 insertions(+), 2 deletions(-) > > diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h > index e41eda3c71f1..6dc02c03bdf8 100644 > --- a/include/uapi/linux/if_bridge.h > +++ b/include/uapi/linux/if_bridge.h > @@ -292,4 +292,22 @@ struct br_mcast_stats { > __u64 mcast_bytes[BR_MCAST_DIR_SIZE]; > __u64 mcast_packets[BR_MCAST_DIR_SIZE]; > }; > + > +/* bridge boolean options > + * IMPORTANT: if adding a new option do not forget to handle > + *it in br_boolopt_toggle/get and bridge sysfs > + */ > +enum br_boolopt_id { > + BR_BOOLOPT_MAX > +}; > + > +/* struct br_boolopt_multi - change multiple bridge boolean options > + * > + * @optval: new option values (bit per option) > + * @optmask: options to change (bit per option) > + */ > +struct br_boolopt_multi { > + __u32 optval; > + __u32 optmask; > +}; Hi Nikolay Thanks for handling this. How many boolean options do we already have? What it the likelihood a u32 is going to be too small, in a couple of years time? I recently went through the pain of converting the u32 for representing link modes in the phylib API to a linux bitmap. I'm just wondering if in the long run, using a linux bitmap right from the beginning would be better? > +int br_boolopt_multi_toggle(struct net_bridge *br, > + struct br_boolopt_multi *bm) > +{ > + unsigned long bitmap = bm->optmask; > + int err = 0; > + int opt_id; > + > + for_each_set_bit(opt_id, &bitmap, BR_BOOLOPT_MAX) { > + bool on = !!(bm->optval & BIT(opt_id)); > + > + err = br_boolopt_toggle(br, opt_id, on); > + if (err) { > + br_debug(br, "boolopt multi-toggle error: option: %d > current: %d new: %d error: %d\n", > + opt_id, br_boolopt_get(br, opt_id), on, err); Would it be possible to return that to userspace using the extended error infrastructure? Andrew
Re: [Bridge] [PATCH net-next 1/2] net: bridge: add support for user-controlled bool options
> +/* br_boolopt_toggle - change user-controlled boolean option > + * > + * @br: bridge device > + * @opt: id of the option to change > + * @on: new option value > + * > + * Changes the value of the respective boolean option to @on taking care of > + * any internal option value mapping and configuration. > + */ > +int br_boolopt_toggle(struct net_bridge *br, enum br_boolopt_id opt, bool on) > +{ > + int err = -ENOENT; > + > + switch (opt) { > + default: > + break; > + } > + > + return err; > +} > + > +int br_boolopt_multi_toggle(struct net_bridge *br, > + struct br_boolopt_multi *bm) > +{ > + unsigned long bitmap = bm->optmask; > + int err = 0; > + int opt_id; > + > + for_each_set_bit(opt_id, &bitmap, BR_BOOLOPT_MAX) { > + bool on = !!(bm->optval & BIT(opt_id)); > + > + err = br_boolopt_toggle(br, opt_id, on); > + if (err) { > + br_debug(br, "boolopt multi-toggle error: option: %d > current: %d new: %d error: %d\n", > + opt_id, br_boolopt_get(br, opt_id), on, err); > + break; > + } An old kernel with a new iproute2 might partially succeed in toggling some low bits, but then silently ignore a high bit that is not supported by the kernel. As a user, i want to know the kernel does not support an option i'm trying to toggle. Can you walk all the bits in the u32 from the MSB to the LSB? That should avoid this problem. Andrew
Re: [Bridge] [PATCH net-next 1/2] net: bridge: add support for user-controlled bool options
> +void br_boolopt_multi_get(const struct net_bridge *br, > + struct br_boolopt_multi *bm) > +{ > + u32 optval = 0; > + int opt_id; > + > + for (opt_id = 0; opt_id < BR_BOOLOPT_MAX; opt_id++) > + optval |= (br_boolopt_get(br, opt_id) << opt_id); > + > + bm->optval = optval; > + bm->optmask = 0; Maybe set optmask to indicate which bits this kernel supports? Andrew
Re: [Bridge] [PATCH net-next 2/2] net: bridge: add no_linklocal_learn bool option
> int br_boolopt_get(const struct net_bridge *br, enum br_boolopt_id opt) > { > - int optval = 0; > - > switch (opt) { > + case BR_BOOLOPT_NO_LL_LEARN: > + return br_opt_get(br, BROPT_NO_LL_LEARN); > default: > break; > } > > - return optval; > + return 0; > } It seems like 1/2 of that change belongs in the previous patch. > --- a/net/bridge/br_sysfs_br.c > +++ b/net/bridge/br_sysfs_br.c > @@ -328,6 +328,27 @@ static ssize_t flush_store(struct device *d, > } > static DEVICE_ATTR_WO(flush); > > +static ssize_t no_linklocal_learn_show(struct device *d, > +struct device_attribute *attr, > +char *buf) > +{ > + struct net_bridge *br = to_bridge(d); > + return sprintf(buf, "%d\n", br_boolopt_get(br, BR_BOOLOPT_NO_LL_LEARN)); > +} > + > +static int set_no_linklocal_learn(struct net_bridge *br, unsigned long val) > +{ > + return br_boolopt_toggle(br, BR_BOOLOPT_NO_LL_LEARN, !!val); > +} > + > +static ssize_t no_linklocal_learn_store(struct device *d, > + struct device_attribute *attr, > + const char *buf, size_t len) > +{ > + return store_bridge_parm(d, buf, len, set_no_linklocal_learn); > +} > +static DEVICE_ATTR_RW(no_linklocal_learn); I thought we where trying to move away from sysfs? Do we need to add new options here? It seems like forcing people to use iproute2 for newer options is a good way to get people to convert to iproute2. Andrew
Re: [Bridge] [PATCH net-next v2 1/3] net: bridge: add support for user-controlled bool options
> +int br_boolopt_toggle(struct net_bridge *br, enum br_boolopt_id opt, bool on, > + struct netlink_ext_ack *extack) > +{ > + switch (opt) { > + default: > + /* shouldn't be called with unsupported options */ > + WARN_ON(1); > + break; So you return 0 here, meaning the br_debug() lower down will not happen. Maybe return -EOPNOTSUPP? > + } > + > + return 0; > +} > + > +int br_boolopt_multi_toggle(struct net_bridge *br, > + struct br_boolopt_multi *bm, > + struct netlink_ext_ack *extack) > +{ > + unsigned long bitmap = bm->optmask; > + int err = 0; > + int opt_id; > + > + for_each_set_bit(opt_id, &bitmap, BR_BOOLOPT_MAX) { > + bool on = !!(bm->optval & BIT(opt_id)); > + > + err = br_boolopt_toggle(br, opt_id, on, extack); > + if (err) { > + br_debug(br, "boolopt multi-toggle error: option: %d > current: %d new: %d error: %d\n", > + opt_id, br_boolopt_get(br, opt_id), on, err); > + break; > + } > + } Does the semantics of extack allow you to return something even when there is no error? If there are bits > BR_BOOLOPT_MAX you could return 0, but also add a warning in extack that some bits where not supported by this kernel. > +void br_boolopt_multi_get(const struct net_bridge *br, > + struct br_boolopt_multi *bm) > +{ > + u32 optval = 0; > + int opt_id; > + > + for (opt_id = 0; opt_id < BR_BOOLOPT_MAX; opt_id++) > + optval |= (br_boolopt_get(br, opt_id) << opt_id); > + > + bm->optval = optval; > + bm->optmask = 0; You liked the idea of setting optmask to indicate which bits this kernel supports. Did you change your mind? Andrew
Re: [Bridge] [PATCH net-next v2 3/3] net: bridge: export supported boolopts
On Sat, Nov 24, 2018 at 04:34:22AM +0200, Nikolay Aleksandrov wrote: > Now that we have at least one bool option, we can export all of the > supported bool options via optmask when dumping them. > Hi Nik That answers my question then... I'm assuming this means there is no easy way to generate a bitmask of 0? So you waited until there was at least one bit. (1 << 2) - 1 = 3 (1 << 1) - 1 = 1 (1 << 0) - 1 = 0 So does GENMASK((BR_BOOLOPT_MAX - 1), 0); really not do the right thing when BR_BOOLOPT_MAX = 1? Andrew
Re: [Bridge] [PATCH net-next v2 1/3] net: bridge: add support for user-controlled bool options
On Sat, Nov 24, 2018 at 06:18:33PM +0200, niko...@cumulusnetworks.com wrote: > On 24 November 2018 18:10:41 EET, Andrew Lunn wrote: > >> +int br_boolopt_toggle(struct net_bridge *br, enum br_boolopt_id opt, > >bool on, > >> +struct netlink_ext_ack *extack) > >> +{ > >> + switch (opt) { > >> + default: > >> + /* shouldn't be called with unsupported options */ > >> + WARN_ON(1); > >> + break; > > > >So you return 0 here, meaning the br_debug() lower down will not > >happen. Maybe return -EOPNOTSUPP? > > > > No, the idea here is that some option in the future might return an error. > This function cannot be called with unsupported option thus the warn. O.K, i was trying to make it easier to see which option caused it to happen. > >> + } > >> + > >> + return 0; > >> +} > >> + > > > >> +int br_boolopt_multi_toggle(struct net_bridge *br, > >> + struct br_boolopt_multi *bm, > >> + struct netlink_ext_ack *extack) > >> +{ > >> + unsigned long bitmap = bm->optmask; > >> + int err = 0; > >> + int opt_id; > >> + > >> + for_each_set_bit(opt_id, &bitmap, BR_BOOLOPT_MAX) { > >> + bool on = !!(bm->optval & BIT(opt_id)); > >> + > >> + err = br_boolopt_toggle(br, opt_id, on, extack); > >> + if (err) { > >> + br_debug(br, "boolopt multi-toggle error: option: %d > >> current: %d > >new: %d error: %d\n", > >> + opt_id, br_boolopt_get(br, opt_id), on, err); > >> + break; > >> + } > >> + } > > > >Does the semantics of extack allow you to return something even when > >there is no error? If there are bits > BR_BOOLOPT_MAX you could return > >0, but also add a warning in extack that some bits where not supported > >by this kernel. > > If we return 0 there's no reason to check extack. Well, the caller can check to see if extack is present, even on success. This is extack, not extnack after all... Andrew
Re: [Bridge] [PATCH net-next v2 3/3] net: bridge: export supported boolopts
On Sat, Nov 24, 2018 at 04:34:22AM +0200, Nikolay Aleksandrov wrote: > Now that we have at least one bool option, we can export all of the > supported bool options via optmask when dumping them. > > v2: new patch > > Signed-off-by: Nikolay Aleksandrov Reviewed-by: Andrew Lunn Andrew
Re: [Bridge] [PATCH net-next v2 2/3] net: bridge: add no_linklocal_learn bool option
On Sat, Nov 24, 2018 at 04:34:21AM +0200, Nikolay Aleksandrov wrote: > Use the new boolopt API to add an option which disables learning from > link-local packets. The default is kept as before and learning is > enabled. This is a simple map from a boolopt bit to a bridge private > flag that is tested before learning. > > v2: pass NULL for extack via sysfs > > Signed-off-by: Nikolay Aleksandrov Reviewed-by: Andrew Lunn Andrew
Re: [Bridge] [PATCH net-next v2 1/3] net: bridge: add support for user-controlled bool options
On Sat, Nov 24, 2018 at 04:34:20AM +0200, Nikolay Aleksandrov wrote: > We have been adding many new bridge options, a big number of which are > boolean but still take up netlink attribute ids and waste space in the skb. > Recently we discussed learning from link-local packets[1] and decided > yet another new boolean option will be needed, thus introducing this API > to save some bridge nl space. > The API supports changing the value of multiple boolean options at once > via the br_boolopt_multi struct which has an optmask (which options to > set, bit per opt) and optval (options' new values). Future boolean > options will only be added to the br_boolopt_id enum and then will have > to be handled in br_boolopt_toggle/get. The API will automatically > add the ability to change and export them via netlink, sysfs can use the > single boolopt function versions to do the same. The behaviour with > failing/succeeding is the same as with normal netlink option changing. > > If an option requires mapping to internal kernel flag or needs special > configuration to be enabled then it should be handled in > br_boolopt_toggle. It should also be able to retrieve an option's current > state via br_boolopt_get. > > v2: WARN_ON() on unsupported option as that shouldn't be possible and > also will help catch people who add new options without handling > them for both set and get. Pass down extack so if an option desires > it could set it on error and be more user-friendly. > > [1] https://www.spinics.net/lists/netdev/msg532698.html > > Signed-off-by: Nikolay Aleksandrov Reviewed-by: Andrew Lunn Andrew
Re: [Bridge] [PATCH net-next] Documentation: networking: Clarify switchdev devices behavior
> +VLAN filtering > +~~ > + > +The Linux bridge allows the configuration of a VLAN filtering mode (compile > and > +run time) which must be observed by the underlying switchdev network > +device/hardware: > + > +- with VLAN filtering turned off: frames ingressing the device with a VID > that > + is not programmed into the bridge/switch's VLAN table must be forwarded. > + > +- with VLAN filtering turned on: frames ingressing the device with a VID > that is > + not programmed into the bridges/switch's VLAN table must be dropped. Hi Florian i forget the details, but there are some difference between VLAN filtering being disabled at compile time, and disabled at runtime. I think the expected behaviour is the same, but the switchdev API usage is slightly different. > +- when IGMP snooping is turned on, multicast traffic must be selectively > flowing > + to the appropriate network ports and not flood the entire switch, that must > + include the CPU/management port. 224.0.0.X/32 should always be flooded, IGMP is optional for those groups in the local subnet. Andrew
Re: [Bridge] [PATCH net-next v4] Documentation: networking: Clarify switchdev devices behavior
> > +IGMP snooping > > +~ > > + > > +The Linux bridge allows the configuration of IGMP snooping (compile and run > > +time) which must be observed by the underlying switchdev network > > device/hardware > > +in the following way: > > + > > +- when IGMP snooping is turned off, multicast traffic must be flooded to > > all > > + switch ports within the same broadcast domain. The CPU/management port > > + should ideally not be flooded and continue to learn multicast traffic > > through > > + the network stack notifications. If the hardware is not capable of doing > > that > > + then the CPU/management port must also be flooded and multicast filtering > > + happens in software. > > + > > +- when IGMP snooping is turned on, multicast traffic must selectively flow > > + to the appropriate network ports (including CPU/management port) and not > > flood > > + the switch. > > + > > +Note: reserved multicast addresses (e.g.: BPDUs) as well as Local Network > > +Control block (224.0.0.0 - 224.0.0.255) do not require IGMP and should > > always > > +be flooded. > > I'm not sure that these paragraphs are actually needed. You're basically > describing RFC 4541 on which the IGMP snooping functionality in the > Linux bridge is based on. Hi Ido My experience talking with people is that IGMP snooping is a bit mystical and not well understood. I would not be surprised if community driver writers, as opposed to vendor driver writers, don't actually know how snooping works. So i find having some hints is good. Andrew
Re: [Bridge] Validation of forward_delay seems wrong...
> Hi Andrew, > The man page is wrong, these have been in USER_HZ scaled clock_t format from > the beginning. > TBH a lot of the time/delay bridge config options are messed up like that. Hi Nikola Yes, that is a mess. arch/alpha/include/asm/param.h:# define USER_HZ 1024 arch/ia64/include/asm/param.h:# define USER_HZ HZ include/asm-generic/param.h:# define USER_HZ 100 And ia64 does # define HZ CONFIG_HZ So it seems pretty hard for user space to get this right in a generic fashion. Andrew
[Bridge] Validation of forward_delay seems wrong...
Hi Nikolay The man page says that the bridge forward_delay is in units of seconds, and should be between 2 and 30. I've tested on a couple of different kernel versions, and this appears to be not working correctly: ip link set br0 type bridge forward_delay 2 RTNETLINK answers: Numerical result out of range ip link set br0 type bridge forward_delay 199 RTNETLINK answers: Numerical result out of range ip link set br0 type bridge forward_delay 200 # ip link set br0 type bridge forward_delay 3000 # ip link set br0 type bridge forward_delay 3001 RTNETLINK answers: Numerical result out of range I've not checked what delay is actually being used here, but clearly something is mixed up. grep HZ .config CONFIG_HZ_PERIODIC=y # CONFIG_NO_HZ_IDLE is not set # CONFIG_NO_HZ_FULL is not set # CONFIG_NO_HZ is not set CONFIG_HZ_FIXED=0 CONFIG_HZ_100=y # CONFIG_HZ_200 is not set # CONFIG_HZ_250 is not set # CONFIG_HZ_300 is not set # CONFIG_HZ_500 is not set # CONFIG_HZ_1000 is not set CONFIG_HZ=100 Thanks Andrew
Re: [Bridge] [PATCH] net: bridge: Allow bridge to joing multicast groups
On Fri, Jul 26, 2019 at 02:02:15PM +0200, Horatiu Vultur wrote: > Hi Nikolay, > > The 07/26/2019 12:26, Nikolay Aleksandrov wrote: > > External E-Mail > > > > > > On 26/07/2019 11:41, Nikolay Aleksandrov wrote: > > > On 25/07/2019 17:21, Horatiu Vultur wrote: > > >> Hi Nikolay, > > >> > > >> The 07/25/2019 16:21, Nikolay Aleksandrov wrote: > > >>> External E-Mail > > >>> > > >>> > > >>> On 25/07/2019 16:06, Nikolay Aleksandrov wrote: > > On 25/07/2019 14:44, Horatiu Vultur wrote: > > > There is no way to configure the bridge, to receive only specific link > > > layer multicast addresses. From the description of the command 'bridge > > > fdb append' is supposed to do that, but there was no way to notify the > > > network driver that the bridge joined a group, because LLADDR was > > > added > > > to the unicast netdev_hw_addr_list. > > > > > > Therefore update fdb_add_entry to check if the NLM_F_APPEND flag is > > > set > > > and if the source is NULL, which represent the bridge itself. Then add > > > address to multicast netdev_hw_addr_list for each bridge interfaces. > > > And then the .ndo_set_rx_mode function on the driver is called. To > > > notify > > > the driver that the list of multicast mac addresses changed. > > > > > > Signed-off-by: Horatiu Vultur > > > --- > > > net/bridge/br_fdb.c | 49 > > > ++--- > > > 1 file changed, 46 insertions(+), 3 deletions(-) > > > > > > > Hi, > > I'm sorry but this patch is wrong on many levels, some notes below. In > > general > > NLM_F_APPEND is only used in vxlan, the bridge does not handle that > > flag at all. > > FDB is only for *unicast*, nothing is joined and no multicast should > > be used with fdbs. > > MDB is used for multicast handling, but both of these are used for > > forwarding. > > The reason the static fdbs are added to the filter is for non-promisc > > ports, so they can > > receive traffic destined for these FDBs for forwarding. > > If you'd like to join any multicast group please use the standard way, > > if you'd like to join > > it only on a specific port - join it only on that port (or ports) and > > the bridge and you'll > > >>> > > >>> And obviously this is for the case where you're not enabling port > > >>> promisc mode (non-default). > > >>> In general you'll only need to join the group on the bridge to receive > > >>> traffic for it > > >>> or add it as an mdb entry to forward it. > > >>> > > have the effect that you're describing. What do you mean there's no > > way ? > > >> > > >> Thanks for the explanation. > > >> There are few things that are not 100% clear to me and maybe you can > > >> explain them, not to go totally in the wrong direction. Currently I am > > >> writing a network driver on which I added switchdev support. Then I was > > >> looking for a way to configure the network driver to copy link layer > > >> multicast address to the CPU port. > > >> > > >> If I am using bridge mdb I can do it only for IP multicast addreses, > > >> but how should I do it if I want non IP frames with link layer multicast > > >> address to be copy to CPU? For example: all frames with multicast > > >> address '01-21-6C-00-00-01' to be copy to CPU. What is the user space > > >> command for that? > > >> > > > > > > Check SIOCADDMULTI (ip maddr from iproute2), f.e. add that mac to the port > > > which needs to receive it and the bridge will send it up automatically > > > since > > > it's unknown mcast (note that if there's a querier, you'll have to make > > > the > > > bridge mcast router if it is not the querier itself). It would also flood > > > it to all > > > > Actually you mentioned non-IP traffic, so the querier stuff is not a > > problem. This > > traffic will always be flooded by the bridge (and also a copy will be > > locally sent up). > > Thus only the flooding may need to be controlled. > > OK, I see, but the part which is not clear to me is, which bridge > command(from iproute2) to use so the bridge would notify the network > driver(using swichdev or not) to configure the HW to copy all the frames > with dmac '01-21-6C-00-00-01' to CPU? So that the bridge can receive > those frames and then just to pass them up. Hi Horatiu Something to keep in mind. My default, multicast should be flooded, and that includes the CPU port for a DSA driver. Adding an MDB entry allows for optimisations, limiting which ports a multicast frame goes out of. But it is just an optimisation. Andrew
Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
> As you properly guessed, this model is quite different from what we are used > to. Yes, it takes a while to get the idea that the hardware is just an accelerator for what the Linux stack can already do. And if the switch cannot do some feature, pass the frame to Linux so it can handle it. You need to keep in mind that there could be other ports in the bridge than switch ports, and those ports might be interested in the multicast traffic. Hence the CPU needs to see the traffic. But IGMP snooping can be used to optimise this. But you still need to be careful, eg. IPv6 Neighbour discovery has often been broken on mv88e6xxx because we have been too aggressive with filtering multicast. Andrew
Re: [Bridge] [PATCH] net: bridge: Allow bridge to joing multicast groups
> Trying to get back to the original problem: > > We have a network which implements the ODVA/DLR ring protocol. This protocol > sends out a beacon frame as often as every 3 us (as far as I recall, default I > believe is 400 us) to this MAC address: 01:21:6C:00:00:01. > > Try take a quick look at slide 10 in [1]. > > If we assume that the SwitchDev driver implemented such that all multicast > traffic goes to the CPU, then we should really have a way to install a HW > offload path in the silicon, such that these packets does not go to the CPU > (as > they are known not to be use full, and a frame every 3 us is a significant > load > on small DMA connections and CPU resources). > > If we assume that the SwitchDev driver implemented such that only "needed" > multicast packets goes to the CPU, then we need a way to get these packets in > case we want to implement the DLR protocol. > > I'm sure that both models can work, and I do not think that this is the main > issue here. > > Our initial attempt was to allow install static L2-MAC entries and append > multiple ports to such an entry in the MAC table. This was rejected, for > several > good reasons it seems. But I'm not sure it was clear what we wanted to > achieve, > and why we find it to be important. Hopefully this is clear with a real world > use-case. > > Any hints or ideas on what would be a better way to solve this problems will > be > much appreciated. I always try to think about how this would work if i had a bunch of discrete network interfaces, not a switch. What APIs are involved in configuring such a system? How does the Linux network stack perform software DLR? How is the reception and blocking of the multicast group performed? Once you understand how it works in the software implement, it should then be more obvious which switchdev hooks should be used to accelerate this using hardware. Andrew
Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
Hi Allan Just throwing out another idea The whole offloading story has been you use the hardware to accelerate what the Linux stack can already do. In this case, you want to accelerate Device Level Ring, DLR. But i've not yet seen a software implementation of DLR. Should we really be considering first adding DLR to the SW bridge? Make it an alternative to the STP code? Once we have a generic implementation we can then look at how it can be accelerated using switchdev. Andrew
Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
> Our plan was to implement this in pure SW, and then look at how to HW offload > it. Great. > But this will take some time before we have anything meaning full to show. > > > Make it an alternative to the STP code? > I'm still working on learning the details of DLR, but I actually believe that > it > in some situations may co-exists with STP ;-) The PDF you linked to suggests this as well. But i think you will need to make some core changes to the bridge. At the moment, STP is a bridge level property. But you are going to need it to be a per-port option. You can then use DLR on the ring ports, and optionally STP on the other ports. > But what we are looking at here, is to offload a > non-aware-(DLR|MRP)-switch which happens to be placed in a network > with these protocols running. So we need to think about why we are passing traffic to the CPU port, and under what conditions can it be blocked. 1) The interface is not part of a bridge. In this case, we only need the switch to pass to the CPU port MC addresses which have been set via set_rx_mode(). I think this case does not apply for what you want. You have two ports bridges together as part of the ring. 2) The interface is part of a bridge. There are a few sub-cases a) IGMP snooping is being performed. We can block multicast where there is no interest in the group. But this is limited to IP multicast. b) IGMP snooping is not being used and all interfaces in the bridge are ports of the switch. IP Multicast can be blocked to the CPU. c) IGMP snooping is not being used and there is a non-switch interface in the bridge. Multicast needed is needed, so it can be flooded out this port. d) set_rx_mode() has been called on the br0 interface, indicating there is interest in the packets on the host. They must be sent to the CPU so they can be delivered locally. e) Does the Multicast MAC address being used by DLR also map to an IP mmulticast address? 01:21:6C:00:00:0[123] appear to be the MAC addresses used by DLR. IPv4 multicast MAC addresses are 01:00:5E:XX:XX:XX. IPv6 multicast MAC addresses are 33:33:XX:XX:XX:XX. So one possibility here is to teach the SW bridge about non-IP multicast addresses. Initially the switch should forward all MAC multicast frames to the CPU. If the frame is not an IPv4 or IPv6 frame, and there has not been a call to set_rx_mode() for the MAC address on the br0 interface, and the bridge only contains switch ports, switchdev could be used to block the multicast to the CPU frame, but forward it out all other ports of the bridge. Andrew
Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
> > 2) The interface is part of a bridge. There are a few sub-cases > > > > a) IGMP snooping is being performed. We can block multicast where > > there is no interest in the group. But this is limited to IP > > multicast. > Agree. And this is done today by installing an explicit offload rule to limit > the flooding of a specific group. > > > b) IGMP snooping is not being used and all interfaces in the bridge > > are ports of the switch. IP Multicast can be blocked to the CPU. > Does it matter if IGMP snooping is used or not? A more general statement could > be: > > - "All interfaces in the bridge are ports of the switch. IP Multicast can be > blocked to the CPU." We have seen IPv6 neighbour discovery break in conditions like this. I don't know the exact details. You also need to watch out for 224.0.0.0/24. This is the link local multicast range. There is no need to join MC addresses in that range. It is assumed they will always be received. So even if IGMP is enabled, you still need to pass some multicast traffic to the CPU. > > So one possibility here is to teach the SW bridge about non-IP > > multicast addresses. Initially the switch should forward all MAC > > multicast frames to the CPU. If the frame is not an IPv4 or IPv6 > > frame, and there has not been a call to set_rx_mode() for the MAC > > address on the br0 interface, and the bridge only contains switch > > ports, switchdev could be used to block the multicast to the CPU > > frame, but forward it out all other ports of the bridge. > Close, but not exactly (due to the arguments above). > > Here is how I see it: > > Teach the SW bridge about non-IP multicast addresses. Initially the switch > should forward all MAC multicast frames to the CPU. Today MDB rules can be > installed (either static or dynamic by IGMP), which limit the flooding of > IPv4/6 > multicast streams. In the same way, we should have a way to install a rule > (FDM/ or MDB) to limit the flooding of L2 multicast frames. > > If foreign interfaces (or br0 it self) is part of the destination list, then > traffic also needs to go to the CPU. > > By doing this, we can for explicitly configured dst mac address: > - limit the flooding on the on the SW bridge interfaces > - limit the flooding on the on the HW bridge interfaces > - prevent them to go to the CPU if they are not needed This is all very complex because of all the different corner cases. So i don't think we want a user API to do the CPU part, we want the network stack to do it. Otherwise the user is going to get is wrong, break their network, and then come running to the list for help. This also fits with how we do things in DSA. There is deliberately no user space concept for configuring the DSA CPU port. To user space, the switch is just a bunch of Linux interfaces. Everything to do with the CPU port is hidden away in the DSA core layer, the DSA drivers, and a little bit in the bridge. Andrew
Re: [PATCH 1/3] net: Add HW_BRIDGE offload feature
> +/* Determin if the SW bridge can be offloaded to HW. Return true if all > + * the interfaces of the bridge have the feature NETIF_F_HW_SWITCHDEV set > + * and have the same netdev_ops. > + */ Hi Horatiu Why do you need these restrictions. The HW bridge should be able to learn that a destination MAC address can be reached via the SW bridge. The software bridge can then forward it out the correct interface. Or are you saying your hardware cannot learn from frames which come from the CPU? Andrew
Re: [Bridge] [PATCH 0/3] Add NETIF_F_HW_BRIDGE feature
> > > Why do the devices have to be from the same driver ? > After seeing yours and Andrews comments I realize that I try to do two > things, but I have only explained one of them. > > Here is what I was trying to do: > A. Prevent ports from going into promisc mode, if it is not needed. The switch definition is promisc is a bit odd. You really need to split it into two use cases. The Linux interface is not a member of a bridge. In this case, promisc mode would mean all frames ingressing the port should be forwarded to the CPU. Without promisc, you can program the hardware to just accept frames with the interfaces MAC address. So this is just the usual behaviour of an interface. When the interface is part of the bridge, then you can turn on all the learning and not forward frames to the CPU, unless the CPU asks for them. But you need to watch out for various flags. By default, you should flood to the CPU, unknown destinations to the CPU etc. But some of these can be turned off by flags. > B. Prevent adding the CPU to the flood-mask (in Ocelot we have a > flood-mask controlling who should be included when flooding due to > destination unknown). So destination unknown should be flooded to the CPU. The CPU might know where to send the frame. > To solve item "B", the network driver needs to detect if there is a > foreign interfaces added to the bridge. If that is the case then to add > the CPU port to the flooding mask otherwise no. It is not just a foreign interface. What about the MAC address on the bridge interface? > > > This is too specific targeting some devices. > Maybe I was wrong to mention specific HW in the commit message. The > purpose of the patch was to add an optimization (not to copy all the > frames to the CPU) for HW that is capable of learning and flooding the > frames. To some extent, this is also tied to your hardware not learning MAC addresses from frames passed from the CPU. You should also consider fixing this. The SW bridge does send out notifications when it adds/removes MAC addresses to its tables. You probably want to receive this modifications, and use them to program your hardware to forward frames to the CPU when needed. Andrew
Re: [PATCH v2 0/3] Add NETIF_F_HW_BR_CAP feature
On Mon, Aug 26, 2019 at 10:11:12AM +0200, Horatiu Vultur wrote: > When a network port is added to a bridge then the port is added in > promisc mode. Some HW that has bridge capabilities(can learn, forward, > flood etc the frames) they are disabling promisc mode in the network > driver when the port is added to the SW bridge. > > This patch adds the feature NETIF_F_HW_BR_CAP so that the network ports > that have this feature will not be set in promisc mode when they are > added to a SW bridge. > > In this way the HW that has bridge capabilities don't need to send all the > traffic to the CPU and can also implement the promisc mode and toggle it > using the command 'ip link set dev swp promisc on' Hi Horatiu I'm still not convinced this is needed. The model is, the hardware is there to accelerate what Linux can do in software. Any peculiarities of the accelerator should be hidden in the driver. If the accelerator can do its job without needing promisc mode, do that in the driver. So you are trying to differentiate between promisc mode because the interface is a member of a bridge, and promisc mode because some application, like pcap, has asked for promisc mode. dev->promiscuity is a counter. So what you can do it look at its value, and how the interface is being used. If the interface is not a member of a bridge, and the count > 0, enable promisc mode in the accelerator. If the interface is a member of a bridge, and the count > 1, enable promisc mode in the accelerator. Andrew
Re: [PATCH v2 0/3] Add NETIF_F_HW_BR_CAP feature
> That sounds like a great idea. I was expecting to add this logic in the > set_rx_mode function of the driver. But unfortunetly, I got the calls to > this function before the dev->promiscuity is updated or not to get the > call at all. For example in case the port is member of a bridge and I try > to enable the promisc mode. Hi Horatiu What about the notifier? Is it called in all the conditions you need to know about? Or, you could consider adding a new switchdev call to pass this information to any switchdev driver which is interested in the information. At the moment, the DSA driver core does not pass onto the driver it should put a port into promisc mode. So pcap etc, will only see traffic directed to the CPU, not all the traffic ingressing the interface. If you put the needed core infrastructure into place, we could plumb it down from the DSA core to DSA drivers. Having said that, i don't actually know if the Marvell switches support this. Forward using the ATU and send a copy to the CPU? What switches tend to support is port mirroring, sending all the traffic out another port. A couple of DSA drivers support that, via TC. Andrew
Re: [Bridge] [RFC net-next Patch 0/3] net: bridge: mrp: Add support for Media Redundancy Protocol(MRP)
> I think it would help your case if you explained a bit more about > the hw offload primitives you have implemented internally. Agreed. Horatiu, could you also give some references to the frames that need to be sent. I've no idea what information they need to contain, if the contents is dynamic, or static, etc. Andrew
Re: [Bridge] [RFC net-next Patch 0/3] net: bridge: mrp: Add support for Media Redundancy Protocol(MRP)
> > Horatiu, could you also give some references to the frames that need > > to be sent. I've no idea what information they need to contain, if the > > contents is dynamic, or static, etc. > It is dynamic - but trivial... If it is trivial, i don't see why you are so worried about abstracting it? > Here is a dump from WireShark with > annotation on what our HW can update: > > Ethernet II, Src: 7a:8b:b1:35:96:e1 (7a:8b:b1:35:96:e1), Dst: Iec_00:00:01 > (01:15:4e:00:00:01) > Destination: Iec_00:00:01 (01:15:4e:00:00:01) > Source: 7a:8b:b1:35:96:e1 (7a:8b:b1:35:96:e1) > Type: MRP (0x88e3) > PROFINET MRP MRP_Test, MRP_Common, MRP_End > MRP_Version: 1 > MRP_TLVHeader.Type: MRP_Test (0x02) > MRP_TLVHeader.Type: MRP_Test (0x02) > MRP_TLVHeader.Length: 18 > MRP_Prio: 0x1f40 High priorities > MRP_SA: 7a:8b:b1:35:96:e1 (7a:8b:b1:35:96:e1) > MRP_PortRole: Primary ring port (0x) > MRP_RingState: Ring closed (0x0001) > MRP_Transition: 0x0001 > MRP_TimeStamp [ms]: 0x000cf574 <-- Updated > automatic > MRP_TLVHeader.Type: MRP_Common (0x01) > MRP_TLVHeader.Type: MRP_Common (0x01) > MRP_TLVHeader.Length: 18 > MRP_SequenceID: 0x00e9 <-- Updated > automatic > MRP_DomainUUID: ---- > MRP_TLVHeader.Type: MRP_End (0x00) > MRP_TLVHeader.Type: MRP_End (0x00) > MRP_TLVHeader.Length: 0 > > But all the fields can change, but to change the other fields we need to > interact with the HW. Other SoC may have other capabilities in their > offload. As an example, if the ring becomes open then the fields > MRP_RingState and MRP_Transition need to change and in our case this > requires SW interference. Isn't SW always required? You need to tell your state machine that the state has changed. > Would you like a PCAP file as an example? Or do you want a better > description of the frame format. I was hoping for a link to an RFC, or some standards document. Andrew
Re: [Bridge] [RFC net-next Patch 0/3] net: bridge: mrp: Add support for Media Redundancy Protocol(MRP)
On Fri, Jan 10, 2020 at 09:12:48PM +0100, Horatiu Vultur wrote: > The 01/10/2020 18:56, Andrew Lunn wrote: > > EXTERNAL EMAIL: Do not click links or open attachments unless you know the > > content is safe > > > > > > Horatiu, could you also give some references to the frames that need > > > > to be sent. I've no idea what information they need to contain, if the > > > > contents is dynamic, or static, etc. > > > It is dynamic - but trivial... > > > > If it is trivial, i don't see why you are so worried about abstracting > > it? > Maybe we misunderstood each other. When you asked if it is dynamic or > static, I thought you meant if it is the same frame being send repeated > or if it needs to be changed. It needs to be changed but the changes are > trivial, but it means that a non-MRP aware frame generator can't > properly offload this. The only frame generator i've ever seen are for generating test packets. They generally have random content, random length, maybe the option to send invalid CRC etc. These are never going to work for MRP. So we should limit our thinking to hardware specifically designed for MRP offload. What we need to think about is an abstract model for MRP offload. What must such a bit of hardware do? What parameters do we need to pass to it? When should it interrupt us because some event has happened? Once we have an abstract model, we can define netlink messages, or devlink messages. And you can implement driver code which takes this abstract model and implements it for your real hardware. And if you have the abstract model correct, other vendors should also be able to implement drivers as well. Since this is a closed standard, there is not much the rest of us can do. You need to define this abstract model. We can then review it. Andrew
Re: [Bridge] [RFC net-next Patch v2 4/4] net: bridge: mrp: switchdev: Add HW offload
On Mon, Jan 13, 2020 at 01:46:20PM +0100, Horatiu Vultur wrote: > +#ifdef CONFIG_BRIDGE_MRP > +/* SWITCHDEV_OBJ_ID_PORT_MRP */ > +struct switchdev_obj_port_mrp { > + struct switchdev_obj obj; > + struct net_device *port; > + u32 ring_nr; > +}; > + > +#define SWITCHDEV_OBJ_PORT_MRP(OBJ) \ > + container_of((OBJ), struct switchdev_obj_port_mrp, obj) > + > +/* SWITCHDEV_OBJ_ID_RING_TEST_MRP */ > +struct switchdev_obj_ring_test_mrp { > + struct switchdev_obj obj; > + /* The value is in us and a value of 0 represents to stop */ > + u32 interval; > + u8 max; > + u32 ring_nr; > +}; > + > +#define SWITCHDEV_OBJ_RING_TEST_MRP(OBJ) \ > + container_of((OBJ), struct switchdev_obj_ring_test_mrp, obj) > + > +/* SWITCHDEV_OBJ_ID_RING_ROLE_MRP */ > +struct switchdev_obj_ring_role_mrp { > + struct switchdev_obj obj; > + u8 ring_role; > + u32 ring_nr; > +}; Hi Horatiu The structures above should give me enough information to build this, correct? Ethernet II, Src: 7a:8b:b1:35:96:e1 (7a:8b:b1:35:96:e1), Dst: Iec_00:00:01 (01:15:4e:00:00:01) Destination: Iec_00:00:01 (01:15:4e:00:00:01) Source: 7a:8b:b1:35:96:e1 (7a:8b:b1:35:96:e1) Type: MRP (0x88e3) PROFINET MRP MRP_Test, MRP_Common, MRP_End MRP_Version: 1 MRP_TLVHeader.Type: MRP_Test (0x02) MRP_TLVHeader.Type: MRP_Test (0x02) MRP_TLVHeader.Length: 18 MRP_Prio: 0x1f40 High priorities MRP_SA: 7a:8b:b1:35:96:e1 (7a:8b:b1:35:96:e1) MRP_PortRole: Primary ring port (0x) MRP_RingState: Ring closed (0x0001) MRP_Transition: 0x0001 MRP_TimeStamp [ms]: 0x000cf574 <-- Updated automatic MRP_TLVHeader.Type: MRP_Common (0x01) MRP_TLVHeader.Type: MRP_Common (0x01) MRP_TLVHeader.Length: 18 MRP_SequenceID: 0x00e9 <-- Updated automatic MRP_DomainUUID: ---- MRP_TLVHeader.Type: MRP_End (0x00) MRP_TLVHeader.Type: MRP_End (0x00) MRP_TLVHeader.Length: 0 There are a couple of fields i don't see. MRP_SA, MRP_Transition. What are max and ring_nr used for? Do you need to set the first value MRP_SequenceID uses? Often, in order to detect a reset, a random value is used to initialise the sequence number. Also, does the time stamp need initializing? Thanks Andrew
Re: [Bridge] [RFC net-next Patch v2 4/4] net: bridge: mrp: switchdev: Add HW offload
Hi Horatiu It has been said a few times what the basic state machine should be in user space. A pure software solution can use raw sockets to send and receive MRP_Test test frames. When considering hardware acceleration, the switchdev API you have proposed here seems quite simple. It should not be too hard to map it to a set of netlink messages from userspace. Yet your argument for kernel, not user space, is you are worried about the parameters which need to be passed to the hardware offload engine. In order to win the argument for a kernel solution, we are going to need a better idea what you think this problem is. The MRP_Test is TLV based. Are there other things which could be in this message? Is that what you are worried about? Thanks Andrew
Re: [Bridge] [RFC net-next Patch v2 4/4] net: bridge: mrp: switchdev: Add HW offload
On Tue, Jan 14, 2020 at 09:08:56AM +0100, Horatiu Vultur wrote: > The 01/14/2020 00:30, Andrew Lunn wrote: > > > > Hi Horatiu > > > > It has been said a few times what the basic state machine should be in > > user space. A pure software solution can use raw sockets to send and > > receive MRP_Test test frames. When considering hardware acceleration, > > the switchdev API you have proposed here seems quite simple. It should > > not be too hard to map it to a set of netlink messages from userspace. > > Yes and we will try to go with this approach, to have a user space > application that contains the state machines and then in the kernel to > extend the netlink messages to map to the switchdev API. > So we will create a new RFC once we will have the user space and the > definition of the netlink messages. Cool. Before you get too far, we might want to discuss exactly how you pass these netlink messages. Do we want to make this part of the new ethtool Netlink implementation? Part of devlink? Extend the current bridge netlink interface used by userspae RSTP daemons? A new generic netlink socket? Extending the bridge netlink interface might seem the most logical. The argument against it, is that the kernel bridge code probably does not need to know anything about this offloading. But it does allow you to make use of the switchdev API, so we have a uniform API between the network stack and drivers implementing offloading. Andrew
Re: [Bridge] [RFC net-next v3 02/10] net: bridge: mrp: Expose function br_mrp_port_open
On Fri, Jan 24, 2020 at 05:18:20PM +0100, Horatiu Vultur wrote: > In case the HW is capable to detect when the MRP ring is open or closed. It is > expected that the network driver will notify the bridge that the ring is open > or > closed. > > The function br_mrp_port_open is used to notify the kernel that one of the > ports > stopped receiving MRP_Test frames. The argument 'loc' has a value of '1' when > the port stopped receiving MRP_Test and '0' when it started to receive > MRP_Test. Hi Horatiu Given the name of the function, br_mrp_port_open(), how about replacing loc with a bool with the name open? Andrew
Re: [Bridge] [RFC net-next v3 03/10] net: bridge: mrp: Add MRP interface used by netlink
> br_mrp_flush - will flush the FDB. How does this differ from a normal bridge flush? I assume there is a way for user space to flush the bridge FDB. Andrew
Re: [Bridge] [RFC net-next v3 04/10] net: bridge: mrp: Add generic netlink interface to configure MRP
On Fri, Jan 24, 2020 at 05:18:22PM +0100, Horatiu Vultur wrote: > Implement the generic netlink interface to configure MRP. The implementation > will do sanity checks over the attributes and then eventually call the MRP > interface which eventually will call the switchdev API. Hi Horatiu What was your thinking between adding a new generic netlink interface, and extending the current one? I've not looked at your user space code yet, but i assume it has to make use of both? It needs to create the bridge and add the interfaces. And then it needs to control the MRP state. Allan mentioned you might get around to implementing 802.1CB? Would that be another generic netlink interface, or would you extend the MRP interface? Thanks Andrew
Re: [Bridge] [RFC net-next v3 09/10] net: bridge: mrp: Integrate MRP into the bridge
On Fri, Jan 24, 2020 at 05:18:27PM +0100, Horatiu Vultur wrote: > To integrate MRP into the bridge, the bridge needs to do the following: > - initialized and destroy the generic netlink used by MRP > - detect if the MRP frame was received on a port that is part of a MRP ring. > In > case it was not, then forward the frame as usual, otherwise redirect the > frame > to the upper layer. > > Signed-off-by: Horatiu Vultur > --- > net/bridge/br.c | 11 +++ > net/bridge/br_device.c | 3 +++ > net/bridge/br_if.c | 6 ++ > net/bridge/br_input.c | 14 ++ > net/bridge/br_private.h | 14 ++ > 5 files changed, 48 insertions(+) > > diff --git a/net/bridge/br.c b/net/bridge/br.c > index b6fe30e3768f..d5e556eed4ba 100644 > --- a/net/bridge/br.c > +++ b/net/bridge/br.c > @@ -344,6 +344,12 @@ static int __init br_init(void) > if (err) > goto err_out5; > > +#ifdef CONFIG_BRIDGE_MRP > + err = br_mrp_netlink_init(); > + if (err) > + goto err_out6; > +#endif Please try to avoid #ifdef's like this in C code. Add a stub function to br_private_mrp.h. If you really cannot avoid #ifdef, please use #if IS_ENABLED(CONFIG_BRIDGE_MRP). That expands to if (0) { } So the compiler will compile it and then optimize it out. That gives us added benefit of build testing, we don't suddenly find the code no longer compiles when we enable the option. > --- a/net/bridge/br_input.c > +++ b/net/bridge/br_input.c > @@ -21,6 +21,9 @@ > #include > #include "br_private.h" > #include "br_private_tunnel.h" > +#ifdef CONFIG_BRIDGE_MRP > +#include "br_private_mrp.h" > +#endif It should always be safe to include a header file. Andrew
Re: [Bridge] [RFC net-next v3 03/10] net: bridge: mrp: Add MRP interface used by netlink
On Sat, Jan 25, 2020 at 12:37:26PM +0100, Horatiu Vultur wrote: > The 01/24/2020 18:43, Andrew Lunn wrote: > > EXTERNAL EMAIL: Do not click links or open attachments unless you know the > > content is safe > > > > > br_mrp_flush - will flush the FDB. > > > > How does this differ from a normal bridge flush? I assume there is a > > way for user space to flush the bridge FDB. > > Hi, > > If I seen corectly the normal bridge flush will clear the entire FDB for > all the ports of the bridge. In this case it is require to clear FDB > entries only for the ring ports. Maybe it would be better to extend the current bridge netlink call to be able to pass an optional interface to be flushed? I'm not sure it is a good idea to have two APIs doing very similar things. Andrew
Re: [Bridge] [RFC net-next v3 09/10] net: bridge: mrp: Integrate MRP into the bridge
> br_netif_receive_skb(struct net *net, struct sock *sk, struct sk_buff *skb) > @@ -338,6 +341,17 @@ rx_handler_result_t br_handle_frame(struct sk_buff > **pskb) > return RX_HANDLER_CONSUMED; > } > } > +#ifdef CONFIG_BRIDGE_MRP > + /* If there is no MRP instance do normal forwarding */ > + if (!p->mrp_aware) > + goto forward; > + > + if (skb->protocol == htons(ETH_P_MRP)) > + return RX_HANDLER_PASS; What MAC address is used for these MRP frames? It would make sense to use a L2 link local destination address, since i assume they are not supposed to be forwarded by the bridge. If so, you could extend the if (unlikely(is_link_local_ether_addr(dest))) condition. > + > + if (p->state == BR_STATE_BLOCKING) > + goto drop; > +#endif Is this needed? The next block of code is a switch statement on p->state. The default case, which BR_STATE_BLOCKING should hit, is drop. This function is on the hot path. So we should try to optimize it as much as possible. Andrew
Re: [Bridge] [RFC net-next v3 00/10] net: bridge: mrp: Add support for Media Redundancy Protocol (MRP)
> Lets say that the link between H1 and H2 goes down: > > +--+ > | | > +-->|H1|<--- / --->|H2|<-->|H3|<--+ > eth0eth1eth0eth1eth0eth1 > > H1 will now observe that the test packets it sends on eth1, is not > received in eth0, meaninf that the ring is open Hi Allan Is H1 the only device sending test packets? It is assumed that H2 and H3 will forward them? Or does each device send test packets, and when it stops hearing these packets from a neighbour, it assumes the link is open? Andrew
Re: [Bridge] [RFC net-next v3 06/10] net: bridge: mrp: switchdev: Extend switchdev API to offload MRP
> SWITCHDEV_OBJ_ID_RING_TEST_MRP: This is used when to start/stop sending > MRP_Test frames on the mrp ring ports. This is called only on nodes that > have > the role Media Redundancy Manager. How do you handle the 'headless chicken' scenario? User space tells the port to start sending MRP_Test frames. It then dies. The hardware continues sending these messages, and the neighbours thinks everything is O.K, but in reality the state machine is dead, and when the ring breaks, the daemon is not there to fix it? And it is not just the daemon that could die. The kernel could opps or deadlock, etc. For a robust design, it seems like SWITCHDEV_OBJ_ID_RING_TEST_MRP should mean: start sending MRP_Test frames for the next X seconds, and then stop. And the request is repeated every X-1 seconds. Andrew
Re: [Bridge] [RFC net-next v3 03/10] net: bridge: mrp: Add MRP interface used by netlink
> We could do also that. The main reason why I have added a new generic > netlink was that I thought it would be clearer what commands are for MRP > configuration. The naming makes this clear, having _MRP_ in the attribute names etc. But it would be good have input from the Bridge maintainers. Andrew
Re: [Bridge] [RFC net-next v3 06/10] net: bridge: mrp: switchdev: Extend switchdev API to offload MRP
On Sun, Jan 26, 2020 at 02:22:13PM +0100, Horatiu Vultur wrote: > The 01/25/2020 17:35, Andrew Lunn wrote: > > EXTERNAL EMAIL: Do not click links or open attachments unless you know the > > content is safe > > > > > SWITCHDEV_OBJ_ID_RING_TEST_MRP: This is used when to start/stop sending > > > MRP_Test frames on the mrp ring ports. This is called only on nodes > > > that have > > > the role Media Redundancy Manager. > > > > How do you handle the 'headless chicken' scenario? User space tells > > the port to start sending MRP_Test frames. It then dies. The hardware > > continues sending these messages, and the neighbours thinks everything > > is O.K, but in reality the state machine is dead, and when the ring > > breaks, the daemon is not there to fix it? > > > > And it is not just the daemon that could die. The kernel could opps or > > deadlock, etc. > > > > For a robust design, it seems like SWITCHDEV_OBJ_ID_RING_TEST_MRP > > should mean: start sending MRP_Test frames for the next X seconds, and > > then stop. And the request is repeated every X-1 seconds. > > I totally missed this case, I will update this as you suggest. Hi Horatiu What does your hardware actually provide? Given the design of the protocol, if the hardware decides the OS etc is dead, it should stop sending MRP_TEST frames and unblock the ports. If then becomes a 'dumb switch', and for a short time there will be a broadcast storm. Hopefully one of the other nodes will then take over the role and block a port. Andrew
Re: [Bridge] [RFC net-next v3 09/10] net: bridge: mrp: Integrate MRP into the bridge
On Sun, Jan 26, 2020 at 02:01:11PM +0100, Horatiu Vultur wrote: > The 01/25/2020 17:16, Andrew Lunn wrote: > > EXTERNAL EMAIL: Do not click links or open attachments unless you know the > > content is safe > > > > > br_netif_receive_skb(struct net *net, struct sock *sk, struct sk_buff > > > *skb) > > > @@ -338,6 +341,17 @@ rx_handler_result_t br_handle_frame(struct sk_buff > > > **pskb) > > > return RX_HANDLER_CONSUMED; > > > } > > > } > > > +#ifdef CONFIG_BRIDGE_MRP > > > + /* If there is no MRP instance do normal forwarding */ > > > + if (!p->mrp_aware) > > > + goto forward; > > > + > > > + if (skb->protocol == htons(ETH_P_MRP)) > > > + return RX_HANDLER_PASS; > > > > What MAC address is used for these MRP frames? It would make sense to > > use a L2 link local destination address, since i assume they are not > > supposed to be forwarded by the bridge. If so, you could extend the > > if (unlikely(is_link_local_ether_addr(dest))) condition. > > The MAC addresses used by MRP frames are: > 0x1, 0x15, 0x4e, 0x0, 0x0, 0x1 - used by MRP_Test frames > 0x1, 0x15, 0x4e, 0x0, 0x0, 0x2 - used by the rest of MRP frames. > > If we will add support also for MIM/MIC. These requires 2 more MAC > addresses: > 0x1, 0x15, 0x4e, 0x0, 0x0, 0x3 - used by MRP_InTest frames. > 0x1, 0x15, 0x4e, 0x0, 0x0, 0x4 - used by the other MRP interconnect > frames. Hi Horatiu I made the wrong guess about how this protocol worked when i said L2 link local. These MAC addresses are L2 multicast. And you are using a raw socket to receive them into userspace when needed. 'Thinking allowed' here. +--+ | | +-->|H1|<-->|H2|<-->|H3|<--+ eth0eth1eth0eth1eth0eth1 ^ | Blocked There are three major classes of user case here: 1) Pure software solution You need the software bridge in the client to forward these frames from the left side to the right side. (Does the standard give these two ports names)? In the master, the left port is blocked, so the bridge drops them anyway. You have a RAW socket open on both eth0 and eth1, so you get to see the frames, even if the bridge drops them. 2) Hardware offload to an MRP unaware switch. I'm thinking about a plain switch supported by DSA, Marvell, Broadcom, etc. It has no special knowledge of MRP. Ideally, you want the switch to forward MRP_Test frames left to right for a client. In a master, i think you have a problem, since the port is blocked. The hardware is unlikely to recognise these frames as special, since they are not in the 01-80-C2-XX-XX-XX block, and let them through. So your raw socket is never going to see them, and you cannot detect open/closed ring. I don't know how realistic it is to support MRP in this case, and i also don't think you can fall back to a pure software solution, because the software bridge is going to offload the basic bridge operation to the hardware. It would be nice if you could detect this, and return -EOPNOTSUPP. 3) Hardware offload to an MRP aware switch. For a client, you tell it which port is left, which is right, and assume it forwards the frames. For a master, you again tell it which is left, which is right, and ask it send MRP_Test frames out right, and report changes in open/closed on the right port. You don't need the CPU to see the MRP_Test frames, so the switch has no need to forward them to the CPU. We should think about the general case of a bridge with many ports, and many pairs of ports using MRP. This makes the forwarding of these frames interesting. Given that they are multicast, the default action of the software bridge is that it will flood them. Does the protocol handle seeing MRP_Test from some other loop? Do we need to avoid this? You could avoid this by adding MDB entries to the bridge. However, this does not scale to more then one ring. I don't think an MDB is associated to an ingress port. So you cannot say 0x1, 0x15, 0x4e, 0x0, 0x0, 0x1 ingress port1 egress port2 0x1, 0x15, 0x4e, 0x0, 0x0, 0x1 ingress port3 egress port4 The best you can say is 0x1, 0x15, 0x4e, 0x0, 0x0, 0x1 egress port2, port4 I'm sure there are other issues i'm missing, but it is interesting to think about all this. Andrew
Re: [Bridge] [RFC net-next v3 09/10] net: bridge: mrp: Integrate MRP into the bridge
> > 'Thinking allowed' here. > > > > +--+ > > | | > > +-->|H1|<-->|H2|<-->|H3|<--+ > > eth0eth1eth0eth1eth0eth1 > > ^ > > | > > Blocked > > > > > > There are three major classes of user case here: > > > > 1) Pure software solution > > You need the software bridge in the client to forward these frames > > from the left side to the right side. > As far as I understand it is not the bridge which forward these frames - > it is the user-space tool. This was to put as much functionality in > user-space and only use the kernel to configure the HW. We can (and > should) discuss if this is the right decision. So i need to flip the point around. How does the software switch know not to forward the frames? Are you adding an MDB? > We would properly have better performance if we do this in kernel-space. Yes, that is what i think. And if you can do it without any additional code, using the forwarding tables, so much the better. > BTW: It is not only from left to right, it is also from right to left. > The MRM will inject packets on both ring ports, and monitor both. Using the same MAC address in both directions? I need to think what that implies for MDB entries. It probably just works, since you never flood back out the ingress port. > Again, I do not know how other HW is designed, but all the SOC's we are > working with, does allow us to add a TCAM rule which can redirect these > frames to the CPU even on a blocked port. It is not in scope for what you are doing, but i wonder how we describe this in a generic Linux way? And then how we push it down to the hardware? For the Marvell Switches, it might be possible to do this without the TCAM. You can add forwarding DB entries marked as Management. It is unclear if this overrides the blocked state, but it would be a bit odd if it did not. > > You could avoid this by adding MDB entries to the bridge. However, > > this does not scale to more then one ring. > I would prefer a solution where the individual drivers can do what is > best on the given HW. The nice thing about adding MDB is that it is making use of the software bridge facilities. In general, the software bridge and hardware bridges are pretty similar. If you can solve the problem using generic software bridge features, not additional special cases in code, you have good chance of being able to offload it to a hardware bridge which is not MRP aware. The switchdev API for MRP specific features should then allow you to make use of any additional features the hardware might have. > Yes, the solution Horatiu has chosen, is not to forward MRP frames, > received in MRP ring ports at all. This is done by the user-space tool. > > Again, not sure if this is the right way to do it, but it is what patch > v3 does. > > The alternative to this would be to learn the bridge how to forward MRP > frames when it is a MRC. The user-space tool then never needs to do > this, it know that the kernel will take care of this part (either in SW > or in HW). I think that should be considered. I'm not saying it is the best way, just that some thought should be put into it to figure out what it actually implies. Andrew
Re: [Bridge] [RFC net-next v3 06/10] net: bridge: mrp: switchdev: Extend switchdev API to offload MRP
On Mon, Jan 27, 2020 at 03:26:38PM +0100, Jürgen Lambrecht wrote: > On 1/27/20 12:04 PM, Allan W. Nielsen wrote: > > > How do you handle the 'headless chicken' scenario? User space > tells > > the port to start sending MRP_Test frames. It then dies. The > hardware > > Andrew, I am a bit confused here - maybe I missed an email-thread, I'm sorry > then. > > In previous emails you and others talked about hardware support to send > packets > (inside the switch). But somebody also talked about data-plane and > control-plane (about STP in-kernel being a bad idea), and that data-plane is > in-kernel, and control plane is a mrp-daemon (in user space). > And in my mind, the "hardware" you mention is a frame-injector and can be both > real hardware and a driver in the kernel. > > Do I see it right? Hi Jürgen It i still unclear where the MRP_Test frames should be generated, forward and consumed, either in kernel, or in user space. The userspace RSTP daemon generates and consumes all the BPDUs in userspace. But BPDUs are never forwarded. However MRP_Test frames are forwarded by client nodes. Are the MRP_Test frames then part of the data plane in a client? What i think is clear is that the state machine is in user space. For the rest, we are still exploring possibilities. Andrew
Re: [Bridge] [RFC PATCH net-next] net: bridge: fix client roaming from DSA user port
On Mon, Apr 20, 2020 at 12:19:46AM +0800, DENG Qingfang wrote: > When a client roams from a DSA user port to a soft-bridged port (such as WiFi > interface), the left-over MAC entry in the switch HW is not deleted, causing > inconsistency between Linux fdb and the switch MAC table. As a result, the > client cannot talk to other hosts which are on that DSA user port until the > MAC entry expires. > > Solve this by notifying switchdev fdb to delete the leftover entry when an > entry is updated. Remove the added_by_user check in DSA > > Signed-off-by: DENG Qingfang > --- > I tried this on mt7530 and mv88e6xxx, but only mt7530 works. > In previous discussion[1], Andrew Lunn said "try playing with auto learning > for the CPU port" but it didn't work on mv88e6xxx either Hi Deng We should probably first define how we expect moving MAC to work. Then we can make any core fixes, and driver fixes. For DSA, we have assumed that the software bridge and the hardware bridge are independent, each performs its own learning. Only static entries are kept in sync. How should this separate learning work for a MAC address which moves? Andrew
Re: [Bridge] [RFC PATCH net-next] net: bridge: fix client roaming from DSA user port
> When a client moves from a hardware port (e.g. sw0p1) to a software port > (wlan0) > or another hardware port that belongs to a different switch (sw1p1), > that MAC entry > in sw0's MAC table should be deleted, or replaced with the CPU port as > destination, > by DSA. Otherwise the client is unable to talk to other hosts on sw0 because > sw0 > still thinks the client is on sw0p1. The MAC address needs to move, no argument there. But what are the mechanisms which cause this. Is learning sufficient, or does DSA need to take an active role? Forget about DSA for the moment. How does this work for two normal bridges? Is the flow of unicast packets sufficient? Does it requires a broadcast packet from the device after it moves? Andrew
Re: [Bridge] [RFC PATCH net-next] net: bridge: fix client roaming from DSA user port
On Wed, Apr 22, 2020 at 02:01:28PM +0800, Chuanhong Guo wrote: > Hi! > > On Tue, Apr 21, 2020 at 12:36 AM Andrew Lunn wrote: > > > > The MAC address needs to move, no argument there. But what are the > > mechanisms which cause this. Is learning sufficient, or does DSA need > > to take an active role? > > cpu port learning will break switch operation if for whatever reason > we want to disable bridge offloading (e.g. ebtables?). In this case > a packet received from cpu port need to be sent back through > cpu port to another switch port, and the switch will learn from this > packet incorrectly. > > If we want cpu port learning to kick in, we need to make sure that: > 1. When bridge offload is enabled, the switch takes care of packet > flooding on switch ports itself, instead of flooding with software > bridge. Hi Chuanhong This is what the skb->offload_fwd_mark is all about. If this is set to 1, it means the switch has done all the forwarding needed for ports in that switch. Most of the tag drivers set this unconditionally true. > 2. Software bridge shouldn't forward any packet between ports > on the same switch. If skb->offload_fwd_mark is true, it won't. > 3. cpu port learning should only be enabled when bridge > offloading is used. So it should be safe for most switch drivers. And the ones which don't set offload_fwd_mark are probably relying of software bridging, or are broken and replicating frames. > It doesn't have to be a broadcast packet but it needs a packet to go > through both bridges. > > Say we have bridge A and bridge B, port A1 and B1 are connected > together and a device is on port A2 first: > Bridge A knows that this device is on port A2 and will forward traffic > through A1 to B1 if needed. Bridge B sees these packets and knows > device is on port B1. > When the device move from A2 to B2, B updates its fdb and if a > packet reaches A, A will update its fdb too. The issue here is 'if a packet reaches A'. B might have no reason to send a unicast packet to A, if none of the destinations the device is talking to is reached via A. Which is why i think a broadcast/multicast packet is more likely to be involved. Andrew
Re: [Bridge] [PATCH v3 net-next 1/7] net: bridge: notify switchdev of disappearance of old FDB entry upon migration
On Sun, Dec 13, 2020 at 04:07:04PM +0200, Vladimir Oltean wrote: > Currently the bridge emits atomic switchdev notifications for > dynamically learnt FDB entries. Monitoring these notifications works > wonders for switchdev drivers that want to keep their hardware FDB in > sync with the bridge's FDB. Reviewed-by: Andrew Lunn Andrew
Re: [Bridge] [PATCH v3 net-next 2/7] net: dsa: be louder when a non-legacy FDB operation fails
On Sun, Dec 13, 2020 at 04:07:05PM +0200, Vladimir Oltean wrote: > The dev_close() call was added in commit c9eb3e0f8701 ("net: dsa: Add > support for learning FDB through notification") "to indicate inconsistent > situation" when we could not delete an FDB entry from the port. > > bridge fdb del d8:58:d7:00:ca:6d dev swp0 self master > > It is a bit drastic and at the same time not helpful if the above fails > to only print with netdev_dbg log level, but on the other hand to bring > the interface down. > > So increase the verbosity of the error message, and drop dev_close(). > > Signed-off-by: Vladimir Oltean Reviewed-by: Andrew Lunn Andrew
Re: [Bridge] [PATCH v3 net-next 5/7] net: dsa: exit early in dsa_slave_switchdev_event if we can't program the FDB
On Sun, Dec 13, 2020 at 04:07:08PM +0200, Vladimir Oltean wrote: > Right now, the following would happen for a switch driver that does not > implement .port_fdb_add or .port_fdb_del. > > dsa_slave_switchdev_event returns NOTIFY_OK and schedules: > -> dsa_slave_switchdev_event_work >-> dsa_port_fdb_add > -> dsa_port_notify(DSA_NOTIFIER_FDB_ADD) > -> dsa_switch_fdb_add > -> if (!ds->ops->port_fdb_add) return -EOPNOTSUPP; >-> an error is printed with dev_dbg, and > dsa_fdb_offload_notify(switchdev_work) is not called. > > We can avoid scheduling the worker for nothing and say NOTIFY_DONE. > Because we don't call dsa_fdb_offload_notify, the static FDB entry will > remain just in the software bridge. > > Signed-off-by: Vladimir Oltean > Reviewed-by: Florian Fainelli Reviewed-by: Andrew Lunn Andrew
Re: [Bridge] [PATCH v3 net-next 6/7] net: dsa: listen for SWITCHDEV_{FDB, DEL}_ADD_TO_DEVICE on foreign bridge neighbors
On Sun, Dec 13, 2020 at 04:07:09PM +0200, Vladimir Oltean wrote: > Some DSA switches (and not only) cannot learn source MAC addresses from > packets injected from the CPU. They only perform hardware address > learning from inbound traffic. Reviewed-by: Andrew Lunn Andrew
Re: [Bridge] [PATCH v3 net-next 7/7] net: dsa: ocelot: request DSA to fix up lack of address learning on CPU port
On Sun, Dec 13, 2020 at 04:07:10PM +0200, Vladimir Oltean wrote: > Given the following setup: > > ip link add br0 type bridge > ip link set eno0 master br0 > ip link set swp0 master br0 > ip link set swp1 master br0 > ip link set swp2 master br0 > ip link set swp3 master br0 > > Currently, packets received on a DSA slave interface (such as swp0) > which should be routed by the software bridge towards a non-switch port > (such as eno0) are also flooded towards the other switch ports (swp1, > swp2, swp3) because the destination is unknown to the hardware switch. > > This patch addresses the issue by monitoring the addresses learnt by the > software bridge on eno0, and adding/deleting them as static FDB entries > on the CPU port accordingly. > > Signed-off-by: Vladimir Oltean > Reviewed-by: Florian Fainelli Reviewed-by: Andrew Lunn Andrew
Re: [Bridge] [PATCH net] net: mrp: use stp state as substitute for unimplemented mrp state
On Tue, Jan 19, 2021 at 09:32:40AM +0100, Horatiu Vultur wrote: > The 01/18/2021 21:27, Vladimir Oltean wrote: > > EXTERNAL EMAIL: Do not click links or open attachments unless you know the > > content is safe > > > > On Mon, Jan 18, 2021 at 09:20:36PM +0100, Horatiu Vultur wrote: > > > The 01/18/2021 19:46, Vladimir Oltean wrote: > > > > > > > > On Mon, Jan 18, 2021 at 07:56:18PM +0100, Horatiu Vultur wrote: > > > > > The reason was to stay away from STP, because you can't run these two > > > > > protocols at the same time. Even though in SW, we reuse port's state. > > > > > In our driver(which is not upstreamed), we currently implement > > > > > SWITCHDEV_ATTR_ID_MRP_PORT_STATE and just call the > > > > > SWITCHDEV_ATTR_ID_PORT_STP_STATE. > > > > > > > > And isn't Rasmus's approach reasonable, in that it allows unmodified > > > > switchdev drivers to offload MRP port states without creating > > > > unnecessary code churn? > > > > > > I am sorry but I don't see this as the correct solution. In my opinion, > > > I would prefer to have 3 extra lines in the driver and have a better > > > view of what is happening. Than having 2 calls in the driver for > > > different protocols. > > > > I think the question boils down to: is a MRP-unaware driver expected to > > work with the current bridge MRP code? > > If the driver has switchdev support, then is not expected to work with > the current bridge MRP code. > > For example, the Ocelot driver, it has switchdev support but no MRP > support so this is not expected to work. Then ideally, we need switchdev core to be testing for the needed ops and returning an error which prevents MRP being configured when it cannot work. Andrew
Re: [Bridge] [RFC net-next 6/9] net: dsa: Forward offloading
> There is really no need to recompute the static parts of the tags on > each skb. It would mean moving some knowledge of the tagging format to > the driver. But that boundary is pretty artificial for > mv88e6xxx. tag_dsa has no use outside of mv88e6xxx, and mv88e6xxx does > not work with any other tagger. I suppose you could even move the whole > tagger to drivers/net/dsa/mv88e6xxx/? > > What do you think? > > Andrew? We have resisted this before. What information do you actually need to share between the tagger and the driver? Both tag_lan9303.c and tag_ocelot_8021q.c do reference their switch driver data structures, so some sharing is allowed. But please try to keep the surface areas down. Andrew
Re: [Bridge] [PATCH v6 net-next 5/7] net: bridge: switchdev: let drivers inform which bridge ports are offloaded
On Mon, Jul 26, 2021 at 07:21:20PM +0530, Naresh Kamboju wrote: > On Wed, 21 Jul 2021 at 21:56, Vladimir Oltean wrote: > > > > On reception of an skb, the bridge checks if it was marked as 'already > > forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it > > is, it assigns the source hardware domain of that skb based on the > > hardware domain of the ingress port. Then during forwarding, it enforces > > that the egress port must have a different hardware domain than the > > ingress one (this is done in nbp_switchdev_allowed_egress). > [Please ignore if it is already reported] > > Following build error noticed on Linux next 20210723 tag > with omap2plus_defconfig on arm architecture. Hi Naresh Please trim emails when replying. It is really annoying to have to page down and down and down to find your part in the email, and you always wonder if you accidentally jumped over something when paging down at speed. Andrew
Re: [Bridge] [PATCH net-next 1/4] net: bridge: Add support for bridge port in locked mode
> > + if (p->flags & BR_PORT_LOCKED) { > > + fdb_entry = br_fdb_find_rcu(br, eth_hdr(skb)->h_source, vid); > > + if (!(fdb_entry && fdb_entry->dst == p)) > > + goto drop; > > I'm not familiar with 802.1X so I have some questions: Me neither. > > 1. Do we need to differentiate between no FDB entry and an FDB entry > pointing to a different port than we expect? And extending that question, a static vs a dynamic entry? Andrew
Re: [Bridge] [PATCH net-next 3/3] net: dsa: mv88e6xxx: mac-auth/MAB implementation
> + if (mv88e6xxx_port_is_locked(chip, chip->ports[spid].port)) > + err = > mv88e6xxx_switchdev_handle_atu_miss_violation(chip, > + > chip->ports[spid].port, > + > &entry, > + > fid); > +static int mv88e6xxx_find_vid_on_matching_fid(struct mv88e6xxx_chip *chip, > + const struct mv88e6xxx_vtu_entry > *entry, > + void *priv) > +{ > + struct mv88e6xxx_fid_search_ctx *ctx = priv; > + > + if (ctx->fid_search == entry->fid) { > + ctx->vid_found = entry->vid; > + return 1; > + } > + return 0; > +} > + > +int mv88e6xxx_switchdev_handle_atu_miss_violation(struct mv88e6xxx_chip > *chip, > + int port, > + struct mv88e6xxx_atu_entry > *entry, > + u16 fid) > +{ > + struct switchdev_notifier_fdb_info info = { > + .addr = entry->mac, > + .vid = 0, > + .added_by_user = false, > + .is_local = false, > + .offloaded = true, > + .locked = true, > + }; > + struct mv88e6xxx_fid_search_ctx ctx; > + struct netlink_ext_ack *extack; > + struct net_device *brport; > + struct dsa_port *dp; > + int err; > + > + ctx.fid_search = fid; > + err = mv88e6xxx_vtu_walk(chip, mv88e6xxx_find_vid_on_matching_fid, > &ctx); I could be reading this code wrong, but it looks like you assume there is a single new entry in the ATU. But interrupts on these devices are slow. It would be easy for two or more devices to pop into existence at the same time. Don't you need to walk the whole ATU to find all the new entries? Have you tried this with a traffic generating populating the ATU with new entries at different rates, up to line rate? Do you get notifications for them all? Andrew
Re: [Bridge] [PATCH net-next 3/3] net: dsa: mv88e6xxx: mac-auth/MAB implementation
On Thu, Mar 17, 2022 at 09:52:15AM +0100, Hans Schultz wrote: > On tor, mar 17, 2022 at 01:34, Vladimir Oltean wrote: > > On Mon, Mar 14, 2022 at 11:46:51AM +0100, Hans Schultz wrote: > >> >> @@ -396,6 +414,13 @@ static irqreturn_t > >> >> mv88e6xxx_g1_atu_prob_irq_thread_fn(int irq, void *dev_id) > >> >> "ATU miss violation for %pM portvec > >> >> %x spid %d\n", > >> >> entry.mac, entry.portvec, spid); > >> >> chip->ports[spid].atu_miss_violation++; > >> >> + if (mv88e6xxx_port_is_locked(chip, > >> >> chip->ports[spid].port)) > >> >> + err = > >> >> mv88e6xxx_switchdev_handle_atu_miss_violation(chip, > >> >> + > >> >> chip->ports[spid].port, > >> >> + > >> >> &entry, > >> >> + > >> >> fid); > >> > > >> > Do we want to suppress the ATU miss violation warnings if we're going to > >> > notify the bridge, or is it better to keep them for some reason? > >> > My logic is that they're part of normal operation, so suppressing makes > >> > sense. > >> > > >> > >> I have been seeing many ATU member violations after the miss violation is > >> handled (using ping), and I think it could be considered to suppress the > >> ATU member > >> violations interrupts by setting the IgnoreWrongData bit for the > >> port (sect 4.4.7). This would be something to do whenever a port is set in > >> locked mode? > > > > So the first packet with a given MAC SA triggers an ATU miss violation > > interrupt. > > > > You program that MAC SA into the ATU with a destination port mask of all > > zeroes. This suppresses further ATU miss interrupts for this MAC SA, but > > now generates ATU member violations, because the MAC SA _is_ present in > > the ATU, but not towards the expected port (in fact, towards _no_ port). > > > > Especially if user space decides it doesn't want to authorize this MAC > > SA, it really becomes a problem because this is now a vector for denial > > of service, with every packet triggering an ATU member violation > > interrupt. > > > > So your suggestion is to set the IgnoreWrongData bit on locked ports, > > and this will suppress the actual member violation interrupts for > > traffic coming from these ports. > > > > So if the user decides to unplug a previously authorized printer from > > switch port 1 and move it to port 2, how is this handled? If there isn't > > a mechanism in place to delete the locked FDB entry when the printer > > goes away, then by setting IgnoreWrongData you're effectively also > > suppressing migration notifications. > > I don't think such a scenario is so realistic, as changing port is not > just something done casually, besides port 2 then must also be a locked > port to have the same policy. I think it is very realistic. It is also something which does not work is going to cause a lot of confusion. People will blame the printer, when in fact they should be blaming the switch. They will be rebooting the printer, when in fact, they need to reboot the switch etc. I expect there is a way to cleanly support this, you just need to figure it out. > The other aspect is that the user space daemon that authorizes catches > the fdb add entry events and checks if it is a locked entry. So it will > be up to said daemon to decide the policy, like remove the fdb entry > after a timeout. > > > > > Oh, btw, my question was: could you consider suppressing the _prints_ on > > an ATU miss violation on a locked port? > > As there will only be such on the first packet, I think it should be > logged and those prints serve that purpose, so I think it is best to > keep the print. > If in the future some tests or other can argue for suppressing the > prints, it is an easy thing to do. Please use a traffic generator and try to DOS one of your own switches. Can you? Andrew
[Bridge] [PATCH RFC] net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.
It is possible to stack bridges on top of each other. Consider the following which makes use of an Ethernet switch: br1 /\ / \ /\ br0.11wlan0 | br0 / | \ p1 p2 p3 br0 is offloaded to the switch. Above br0 is a vlan interface, for vlan 11. This vlan interface is then a slave of br1. br1 also has wireless interface as a slave. This setup trunks wireless lan traffic over the copper network inside a VLAN. A frame received on p1 which is passed up to the bridge has the skb->offload_fwd_mark flag set to true, indicating it that the switch has dealt with forwarding the frame out ports p2 and p3 as needed. This flag instructs the software bridge it does not need to pass the frame back down again. However, the flag is not getting reset when the frame is passed upwards. As a result br1 sees the flag, wrongly interprets it, and fails to forward the frame to wlan0. When passing a frame upwards, clear the flag. RFC because i don't know the bridge code well enough if this is the correct place to do this, and if there are any side effects, could the skb be a clone, etc. Fixes: f1c2eddf4cb6 ("bridge: switchdev: Use an helper to clear forward mark") Signed-off-by: Andrew Lunn --- net/bridge/br_input.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c index 196417859c4a..9327a5fad1df 100644 --- a/net/bridge/br_input.c +++ b/net/bridge/br_input.c @@ -39,6 +39,13 @@ static int br_pass_frame_up(struct sk_buff *skb) dev_sw_netstats_rx_add(brdev, skb->len); vg = br_vlan_group_rcu(br); + + /* Reset the offload_fwd_mark because there could be a stacked +* bridge above, and it should not think this bridge it doing +* that bridges work forward out its ports. +*/ + br_switchdev_frame_unmark(skb); + /* Bridge is just like any other port. Make sure the * packet is allowed except in promisc mode when someone * may be running packet capture. -- 2.36.0
Re: [Bridge] [PATCH RFC] net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.
On Thu, May 05, 2022 at 04:07:20PM -0700, Stephen Hemminger wrote: > On Fri, 6 May 2022 00:59:04 +0200 > Andrew Lunn wrote: > > > It is possible to stack bridges on top of each other. Consider the > > following which makes use of an Ethernet switch: > > > >br1 > > /\ > > / \ > >/\ > > br0.11wlan0 > >| > >br0 > > / | \ > > p1 p2 p3 > > > > br0 is offloaded to the switch. Above br0 is a vlan interface, for > > vlan 11. This vlan interface is then a slave of br1. br1 also has > > wireless interface as a slave. This setup trunks wireless lan traffic > > over the copper network inside a VLAN. > > > > A frame received on p1 which is passed up to the bridge has the > > skb->offload_fwd_mark flag set to true, indicating it that the switch > > has dealt with forwarding the frame out ports p2 and p3 as > > needed. This flag instructs the software bridge it does not need to > > pass the frame back down again. However, the flag is not getting reset > > when the frame is passed upwards. As a result br1 sees the flag, > > wrongly interprets it, and fails to forward the frame to wlan0. > > > > When passing a frame upwards, clear the flag. > > > > RFC because i don't know the bridge code well enough if this is the > > correct place to do this, and if there are any side effects, could the > > skb be a clone, etc. > > > > Fixes: f1c2eddf4cb6 ("bridge: switchdev: Use an helper to clear forward > > mark") > > Signed-off-by: Andrew Lunn > > Bridging of bridges is not supposed to be allowed. > See: > > bridge:br_if.c > > /* No bridging of bridges */ > if (dev->netdev_ops->ndo_start_xmit == br_dev_xmit) { > NL_SET_ERR_MSG(extack, > "Can not enslave a bridge to a bridge"); > return -ELOOP; > } This is not direct bridging of bridges. There is a vlan interface in the middle. And even if it is not supposed to work, it does work, it is being used, and it regressed. This fixes the regression. Andrew
Re: [Bridge] [PATCH RFC] net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.
> Some safer alternatives to this patch are based on the idea that we > could ignore skb->offload_fwd_mark coming from an unoffloaded bridge > port (i.e. treat this condition at br1, not at br0). We could: > - clear skb->offload_fwd_mark in br_handle_frame_finish(), if p->hwdom is 0 > - change nbp_switchdev_allowed_egress() to return true if cb->src_hwdom == 0 O.K, i will try out these solutions. Thanks Andrew
Re: [Bridge] [PATCH RFC] net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.
> I like Andrew's patch because it is the Rx equivalent of > br_switchdev_frame_unmark() in br_dev_xmit(). However, if we go with the > second option, it should allow us to remove the clearing of the mark in > the Tx path as the control block is cleared in the Tx path since commit > fd65e5a95d08 ("net: bridge: clear bridge's private skb space on xmit"). > > I don't know how far back Nik's patch was backported and I don't know > how far back Andrew's patch will be backported, so it might be best to > submit Andrew's patch to net as-is and then in net-next change > nbp_switchdev_allowed_egress() and remove br_switchdev_frame_unmark() > from both the Rx and Tx paths. > > Anyway, I have applied this patch to our tree for testing. Will report > tomorrow in case there are any regressions. Hi Ido Did your testing find any issues? Thanks Andrew
[Bridge] [PATCH v2 net] net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.
It is possible to stack bridges on top of each other. Consider the following which makes use of an Ethernet switch: br1 /\ / \ /\ br0.11wlan0 | br0 / | \ p1 p2 p3 br0 is offloaded to the switch. Above br0 is a vlan interface, for vlan 11. This vlan interface is then a slave of br1. br1 also has a wireless interface as a slave. This setup trunks wireless lan traffic over the copper network inside a VLAN. A frame received on p1 which is passed up to the bridge has the skb->offload_fwd_mark flag set to true, indicating that the switch has dealt with forwarding the frame out ports p2 and p3 as needed. This flag instructs the software bridge it does not need to pass the frame back down again. However, the flag is not getting reset when the frame is passed upwards. As a result br1 sees the flag, wrongly interprets it, and fails to forward the frame to wlan0. When passing a frame upwards, clear the flag. This is the Rx equivalent of br_switchdev_frame_unmark() in br_dev_xmit(). Fixes: f1c2eddf4cb6 ("bridge: switchdev: Use an helper to clear forward mark") Signed-off-by: Andrew Lunn --- v2: Extended the commit message with Ido obsersation of the equivelance of br_dev_xmit(). Fixed up the comment. This code has passed Ido test setup. net/bridge/br_input.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c index 196417859c4a..68b3e850bcb9 100644 --- a/net/bridge/br_input.c +++ b/net/bridge/br_input.c @@ -39,6 +39,13 @@ static int br_pass_frame_up(struct sk_buff *skb) dev_sw_netstats_rx_add(brdev, skb->len); vg = br_vlan_group_rcu(br); + + /* Reset the offload_fwd_mark because there could be a stacked +* bridge above, and it should not think this bridge it doing +* that bridge's work forwarding out its ports. +*/ + br_switchdev_frame_unmark(skb); + /* Bridge is just like any other port. Make sure the * packet is allowed except in promisc mode when someone * may be running packet capture. -- 2.36.0