from:"Ido Schimmel"

Re: [PATCH net-next 9/9] selftests: vxlan_mdb: Add MDB bulk deletion test

2024-02-19 Thread Ido Schimmel

Hi,

On Mon, Feb 19, 2024 at 01:54:32PM +0800, Yujie Liu wrote:
> Hi Ido,
> 
> I'm from the kernel test robot team. We noticed that this patch
> introduced a new group of flush tests. The bot cannot parse the test
> result correctly due to some duplicate output in the summary, such
> as the following ones marked by arrows:

[...]

> Althought we can walkaround this problem at the bot side, we would still
> like to consult you about whether it is expected or by design to have the
> duplicate test descriptions, and is it possible to give them different
> descriptions to clearly tell them apart? Could you please give us
> some guidance?

Thanks for the report. Wasn't aware of this limitation. Will take care
of it later this week (AFK tomorrow) and copy you on the patch.

[PATCH net-next 9/9] selftests: vxlan_mdb: Add MDB bulk deletion test

2023-12-17 Thread Ido Schimmel

Add test cases to verify the behavior of the MDB bulk deletion
functionality in the VXLAN driver.

Signed-off-by: Ido Schimmel 
Acked-by: Petr Machata 
---
 tools/testing/selftests/net/test_vxlan_mdb.sh | 201 +-
 1 file changed, 199 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/test_vxlan_mdb.sh 
b/tools/testing/selftests/net/test_vxlan_mdb.sh
index 6725fd9157b9..84a05a9e46d8 100755
--- a/tools/testing/selftests/net/test_vxlan_mdb.sh
+++ b/tools/testing/selftests/net/test_vxlan_mdb.sh
@@ -79,6 +79,7 @@ CONTROL_PATH_TESTS="
dump_ipv6_ipv4
dump_ipv4_ipv6
dump_ipv6_ipv6
+   flush
 "
 
 DATA_PATH_TESTS="
@@ -968,6 +969,202 @@ dump_ipv6_ipv6()
dump_common $ns1 $local_addr $remote_prefix $fn
 }
 
+flush()
+{
+   local num_entries
+
+   echo
+   echo "Control path: Flush"
+   echo "---"
+
+   # Add entries with different attributes and check that they are all
+   # flushed when the flush command is given with no parameters.
+
+   # Different source VNI.
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.1 
permanent dst 198.51.100.1 src_vni 10010"
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.2 
permanent dst 198.51.100.1 src_vni 10011"
+
+   # Different routing protocol.
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.3 
permanent proto bgp dst 198.51.100.1 src_vni 10010"
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.4 
permanent proto zebra dst 198.51.100.1 src_vni 10010"
+
+   # Different destination IP.
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.5 
permanent dst 198.51.100.1 src_vni 10010"
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.6 
permanent dst 198.51.100.2 src_vni 10010"
+
+   # Different destination port.
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.7 
permanent dst 198.51.100.1 dst_port 1 src_vni 10010"
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.8 
permanent dst 198.51.100.1 dst_port 2 src_vni 10010"
+
+   # Different VNI.
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.9 
permanent dst 198.51.100.1 vni 10010 src_vni 10010"
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.10 
permanent dst 198.51.100.1 vni 10020 src_vni 10010"
+
+   run_cmd "bridge -n $ns1_v4 mdb flush dev vx0"
+   num_entries=$(bridge -n $ns1_v4 mdb show dev vx0 | wc -l)
+   [[ $num_entries -eq 0 ]]
+   log_test $? 0 "Flush all"
+
+   # Check that entries are flushed when port is specified as the VXLAN
+   # device and that an error is returned when port is specified as a
+   # different net device.
+
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.1 
permanent dst 198.51.100.1 src_vni 10010"
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.1 
permanent dst 198.51.100.2 src_vni 10010"
+
+   run_cmd "bridge -n $ns1_v4 mdb flush dev vx0 port vx0"
+   run_cmd "bridge -n $ns1_v4 -d -s mdb get dev vx0 grp 239.1.1.1 src_vni 
10010"
+   log_test $? 254 "Flush by port"
+
+   run_cmd "bridge -n $ns1_v4 mdb flush dev vx0 port veth0"
+   log_test $? 255 "Flush by wrong port"
+
+   # Check that when flushing by source VNI only entries programmed with
+   # the specified source VNI are flushed and the rest are not.
+
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.1 
permanent dst 198.51.100.1 src_vni 10010"
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.1 
permanent dst 198.51.100.2 src_vni 10010"
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.1 
permanent dst 198.51.100.1 src_vni 10011"
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.1.1 
permanent dst 198.51.100.2 src_vni 10011"
+
+   run_cmd "bridge -n $ns1_v4 mdb flush dev vx0 src_vni 10010"
+
+   run_cmd "bridge -n $ns1_v4 -d -s mdb get dev vx0 grp 239.1.1.1 src_vni 
10010"
+   log_test $? 254 "Flush by specified source VNI"
+   run_cmd "bridge -n $ns1_v4 -d -s mdb get dev vx0 grp 239.1.1.1 src_vni 
10011"
+   log_test $? 0 "Flush by unspecified source VNI"
+
+   run_cmd "bridge -n $ns1_v4 mdb flush dev vx0"
+
+   # Check that all entries are flushed when "permanent" is specified and
+   # that an error is returned when "nopermanent" is specified.
+
+   run_cmd "bridge -n $ns1_v4 mdb add dev vx0 port vx0 grp 239.1.

[PATCH net-next 8/9] selftests: bridge_mdb: Add MDB bulk deletion test

2023-12-17 Thread Ido Schimmel

Add test cases to verify the behavior of the MDB bulk deletion
functionality in the bridge driver.

Signed-off-by: Ido Schimmel 
Acked-by: Petr Machata 
---
 .../selftests/net/forwarding/bridge_mdb.sh| 191 +-
 1 file changed, 189 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/bridge_mdb.sh 
b/tools/testing/selftests/net/forwarding/bridge_mdb.sh
index e4e3e9405056..61348f71728c 100755
--- a/tools/testing/selftests/net/forwarding/bridge_mdb.sh
+++ b/tools/testing/selftests/net/forwarding/bridge_mdb.sh
@@ -803,11 +803,198 @@ cfg_test_dump()
cfg_test_dump_common "L2" l2_grps_get
 }
 
+# Check flush functionality with different parameters.
+cfg_test_flush()
+{
+   local num_entries
+
+   # Add entries with different attributes and check that they are all
+   # flushed when the flush command is given with no parameters.
+
+   # Different port.
+   bridge mdb add dev br0 port $swp1 grp 239.1.1.1 vid 10
+   bridge mdb add dev br0 port $swp2 grp 239.1.1.2 vid 10
+
+   # Different VLAN ID.
+   bridge mdb add dev br0 port $swp1 grp 239.1.1.3 vid 10
+   bridge mdb add dev br0 port $swp1 grp 239.1.1.4 vid 20
+
+   # Different routing protocol.
+   bridge mdb add dev br0 port $swp1 grp 239.1.1.5 vid 10 proto bgp
+   bridge mdb add dev br0 port $swp1 grp 239.1.1.6 vid 10 proto zebra
+
+   # Different state.
+   bridge mdb add dev br0 port $swp1 grp 239.1.1.7 vid 10 permanent
+   bridge mdb add dev br0 port $swp1 grp 239.1.1.8 vid 10 temp
+
+   bridge mdb flush dev br0
+   num_entries=$(bridge mdb show dev br0 | wc -l)
+   [[ $num_entries -eq 0 ]]
+   check_err $? 0 "Not all entries flushed after flush all"
+
+   # Check that when flushing by port only entries programmed with the
+   # specified port are flushed and the rest are not.
+
+   bridge mdb add dev br0 port $swp1 grp 239.1.1.1 vid 10
+   bridge mdb add dev br0 port $swp2 grp 239.1.1.1 vid 10
+   bridge mdb add dev br0 port br0 grp 239.1.1.1 vid 10
+
+   bridge mdb flush dev br0 port $swp1
+
+   bridge mdb get dev br0 grp 239.1.1.1 vid 10 | grep -q "port $swp1"
+   check_fail $? "Entry not flushed by specified port"
+   bridge mdb get dev br0 grp 239.1.1.1 vid 10 | grep -q "port $swp2"
+   check_err $? "Entry flushed by wrong port"
+   bridge mdb get dev br0 grp 239.1.1.1 vid 10 | grep -q "port br0"
+   check_err $? "Host entry flushed by wrong port"
+
+   bridge mdb flush dev br0 port br0
+
+   bridge mdb get dev br0 grp 239.1.1.1 vid 10 | grep -q "port br0"
+   check_fail $? "Host entry not flushed by specified port"
+
+   bridge mdb flush dev br0
+
+   # Check that when flushing by VLAN ID only entries programmed with the
+   # specified VLAN ID are flushed and the rest are not.
+
+   bridge mdb add dev br0 port $swp1 grp 239.1.1.1 vid 10
+   bridge mdb add dev br0 port $swp2 grp 239.1.1.1 vid 10
+   bridge mdb add dev br0 port $swp1 grp 239.1.1.1 vid 20
+   bridge mdb add dev br0 port $swp2 grp 239.1.1.1 vid 20
+
+   bridge mdb flush dev br0 vid 10
+
+   bridge mdb get dev br0 grp 239.1.1.1 vid 10 &> /dev/null
+   check_fail $? "Entry not flushed by specified VLAN ID"
+   bridge mdb get dev br0 grp 239.1.1.1 vid 20 &> /dev/null
+   check_err $? "Entry flushed by wrong VLAN ID"
+
+   bridge mdb flush dev br0
+
+   # Check that all permanent entries are flushed when "permanent" is
+   # specified and that temporary entries are not.
+
+   bridge mdb add dev br0 port $swp1 grp 239.1.1.1 permanent vid 10
+   bridge mdb add dev br0 port $swp2 grp 239.1.1.1 temp vid 10
+
+   bridge mdb flush dev br0 permanent
+
+   bridge mdb get dev br0 grp 239.1.1.1 vid 10 | grep -q "port $swp1"
+   check_fail $? "Entry not flushed by \"permanent\" state"
+   bridge mdb get dev br0 grp 239.1.1.1 vid 10 | grep -q "port $swp2"
+   check_err $? "Entry flushed by wrong state (\"permanent\")"
+
+   bridge mdb flush dev br0
+
+   # Check that all temporary entries are flushed when "nopermanent" is
+   # specified and that permanent entries are not.
+
+   bridge mdb add dev br0 port $swp1 grp 239.1.1.1 permanent vid 10
+   bridge mdb add dev br0 port $swp2 grp 239.1.1.1 temp vid 10
+
+   bridge mdb flush dev br0 nopermanent
+
+   bridge mdb get dev br0 grp 239.1.1.1 vid 10 | grep -q "port $swp1"
+   check_err $? "Entry flushed by wrong state (\"nopermanent\")"
+   bridge mdb get dev br0 grp 239.1.1.1 vid 10 | grep -q "port $swp2"
+   check_fail $? "Entry not flushed by \"nopermanent\" state"
+
+

[PATCH net-next 7/9] rtnetlink: bridge: Enable MDB bulk deletion

2023-12-17 Thread Ido Schimmel

Now that both the common code as well as individual drivers support MDB
bulk deletion, allow user space to make such requests.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
---
 net/core/rtnetlink.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 349255151ad0..33f1e8d8e842 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -6747,5 +6747,6 @@ void __init rtnetlink_init(void)
 
rtnl_register(PF_BRIDGE, RTM_GETMDB, rtnl_mdb_get, rtnl_mdb_dump, 0);
rtnl_register(PF_BRIDGE, RTM_NEWMDB, rtnl_mdb_add, NULL, 0);
-   rtnl_register(PF_BRIDGE, RTM_DELMDB, rtnl_mdb_del, NULL, 0);
+   rtnl_register(PF_BRIDGE, RTM_DELMDB, rtnl_mdb_del, NULL,
+ RTNL_FLAG_BULK_DEL_SUPPORTED);
 }
-- 
2.40.1

[PATCH net-next 6/9] vxlan: mdb: Add MDB bulk deletion support

2023-12-17 Thread Ido Schimmel

Implement MDB bulk deletion support in the VXLAN driver, allowing MDB
entries to be deleted in bulk according to provided parameters.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
---
 drivers/net/vxlan/vxlan_core.c|   1 +
 drivers/net/vxlan/vxlan_mdb.c | 174 +-
 drivers/net/vxlan/vxlan_private.h |   2 +
 3 files changed, 153 insertions(+), 24 deletions(-)

diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c
index 764ea02ff911..16106e088c63 100644
--- a/drivers/net/vxlan/vxlan_core.c
+++ b/drivers/net/vxlan/vxlan_core.c
@@ -3235,6 +3235,7 @@ static const struct net_device_ops vxlan_netdev_ether_ops 
= {
.ndo_fdb_get= vxlan_fdb_get,
.ndo_mdb_add= vxlan_mdb_add,
.ndo_mdb_del= vxlan_mdb_del,
+   .ndo_mdb_del_bulk   = vxlan_mdb_del_bulk,
.ndo_mdb_dump   = vxlan_mdb_dump,
.ndo_mdb_get= vxlan_mdb_get,
.ndo_fill_metadata_dst  = vxlan_fill_metadata_dst,
diff --git a/drivers/net/vxlan/vxlan_mdb.c b/drivers/net/vxlan/vxlan_mdb.c
index eb4c580b5cee..60eb95a06d55 100644
--- a/drivers/net/vxlan/vxlan_mdb.c
+++ b/drivers/net/vxlan/vxlan_mdb.c
@@ -74,6 +74,14 @@ struct vxlan_mdb_config {
u8 rt_protocol;
 };
 
+struct vxlan_mdb_flush_desc {
+   union vxlan_addr remote_ip;
+   __be32 src_vni;
+   __be32 remote_vni;
+   __be16 remote_port;
+   u8 rt_protocol;
+};
+
 static const struct rhashtable_params vxlan_mdb_rht_params = {
.head_offset = offsetof(struct vxlan_mdb_entry, rhnode),
.key_offset = offsetof(struct vxlan_mdb_entry, key),
@@ -1306,6 +1314,145 @@ int vxlan_mdb_del(struct net_device *dev, struct nlattr 
*tb[],
return err;
 }
 
+static const struct nla_policy
+vxlan_mdbe_attrs_del_bulk_pol[MDBE_ATTR_MAX + 1] = {
+   [MDBE_ATTR_RTPROT] = NLA_POLICY_MIN(NLA_U8, RTPROT_STATIC),
+   [MDBE_ATTR_DST] = NLA_POLICY_RANGE(NLA_BINARY,
+  sizeof(struct in_addr),
+  sizeof(struct in6_addr)),
+   [MDBE_ATTR_DST_PORT] = { .type = NLA_U16 },
+   [MDBE_ATTR_VNI] = NLA_POLICY_FULL_RANGE(NLA_U32, &vni_range),
+   [MDBE_ATTR_SRC_VNI] = NLA_POLICY_FULL_RANGE(NLA_U32, &vni_range),
+   [MDBE_ATTR_STATE_MASK] = NLA_POLICY_MASK(NLA_U8, MDB_PERMANENT),
+};
+
+static int vxlan_mdb_flush_desc_init(struct vxlan_dev *vxlan,
+struct vxlan_mdb_flush_desc *desc,
+struct nlattr *tb[],
+struct netlink_ext_ack *extack)
+{
+   struct br_mdb_entry *entry = nla_data(tb[MDBA_SET_ENTRY]);
+   struct nlattr *mdbe_attrs[MDBE_ATTR_MAX + 1];
+   int err;
+
+   if (entry->ifindex && entry->ifindex != vxlan->dev->ifindex) {
+   NL_SET_ERR_MSG_MOD(extack, "Invalid port net device");
+   return -EINVAL;
+   }
+
+   if (entry->vid) {
+   NL_SET_ERR_MSG_MOD(extack, "VID must not be specified");
+   return -EINVAL;
+   }
+
+   if (!tb[MDBA_SET_ENTRY_ATTRS])
+   return 0;
+
+   err = nla_parse_nested(mdbe_attrs, MDBE_ATTR_MAX,
+  tb[MDBA_SET_ENTRY_ATTRS],
+  vxlan_mdbe_attrs_del_bulk_pol, extack);
+   if (err)
+   return err;
+
+   if (mdbe_attrs[MDBE_ATTR_STATE_MASK]) {
+   u8 state_mask = nla_get_u8(mdbe_attrs[MDBE_ATTR_STATE_MASK]);
+
+   if ((state_mask & MDB_PERMANENT) && !(entry->state & 
MDB_PERMANENT)) {
+   NL_SET_ERR_MSG_MOD(extack, "Only permanent MDB entries 
are supported");
+   return -EINVAL;
+   }
+   }
+
+   if (mdbe_attrs[MDBE_ATTR_RTPROT])
+   desc->rt_protocol = nla_get_u8(mdbe_attrs[MDBE_ATTR_RTPROT]);
+
+   if (mdbe_attrs[MDBE_ATTR_DST])
+   vxlan_nla_get_addr(&desc->remote_ip, mdbe_attrs[MDBE_ATTR_DST]);
+
+   if (mdbe_attrs[MDBE_ATTR_DST_PORT])
+   desc->remote_port =
+   
cpu_to_be16(nla_get_u16(mdbe_attrs[MDBE_ATTR_DST_PORT]));
+
+   if (mdbe_attrs[MDBE_ATTR_VNI])
+   desc->remote_vni =
+   cpu_to_be32(nla_get_u32(mdbe_attrs[MDBE_ATTR_VNI]));
+
+   if (mdbe_attrs[MDBE_ATTR_SRC_VNI])
+   desc->src_vni =
+   cpu_to_be32(nla_get_u32(mdbe_attrs[MDBE_ATTR_SRC_VNI]));
+
+   return 0;
+}
+
+static void vxlan_mdb_remotes_flush(struct vxlan_dev *vxlan,
+   struct vxlan_mdb_entry *mdb_entry,
+   const struct vxlan_mdb_flush_desc *desc)
+{
+   struct vxlan_mdb_remote *remote, *tmp;
+
+   list_for_each_entry_safe(remote, tmp, &mdb_entry->rem

[PATCH net-next 2/9] rtnetlink: bridge: Use a different policy for MDB bulk delete

2023-12-17 Thread Ido Schimmel

For MDB bulk delete we will need to validate 'MDBA_SET_ENTRY'
differently compared to regular delete. Specifically, allow the ifindex
to be zero (in case not filtering on bridge port) and force the address
to be zero as bulk delete based on address is not supported.

Do that by introducing a new policy and choosing the correct policy
based on the presence of the 'NLM_F_BULK' flag in the netlink message
header. Use nlmsg_parse() for strict validation.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
---
 net/core/rtnetlink.c | 51 ++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 5e0ab4c08f72..30f030a672f2 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -6416,17 +6416,64 @@ static int rtnl_mdb_add(struct sk_buff *skb, struct 
nlmsghdr *nlh,
return dev->netdev_ops->ndo_mdb_add(dev, tb, nlh->nlmsg_flags, extack);
 }
 
+static int rtnl_validate_mdb_entry_del_bulk(const struct nlattr *attr,
+   struct netlink_ext_ack *extack)
+{
+   struct br_mdb_entry *entry = nla_data(attr);
+   struct br_mdb_entry zero_entry = {};
+
+   if (nla_len(attr) != sizeof(struct br_mdb_entry)) {
+   NL_SET_ERR_MSG_ATTR(extack, attr, "Invalid attribute length");
+   return -EINVAL;
+   }
+
+   if (entry->state != MDB_PERMANENT && entry->state != MDB_TEMPORARY) {
+   NL_SET_ERR_MSG(extack, "Unknown entry state");
+   return -EINVAL;
+   }
+
+   if (entry->flags) {
+   NL_SET_ERR_MSG(extack, "Entry flags cannot be set");
+   return -EINVAL;
+   }
+
+   if (entry->vid >= VLAN_N_VID - 1) {
+   NL_SET_ERR_MSG(extack, "Invalid entry VLAN id");
+   return -EINVAL;
+   }
+
+   if (memcmp(&entry->addr, &zero_entry.addr, sizeof(entry->addr))) {
+   NL_SET_ERR_MSG(extack, "Entry address cannot be set");
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static const struct nla_policy mdba_del_bulk_policy[MDBA_SET_ENTRY_MAX + 1] = {
+   [MDBA_SET_ENTRY] = NLA_POLICY_VALIDATE_FN(NLA_BINARY,
+ 
rtnl_validate_mdb_entry_del_bulk,
+ sizeof(struct br_mdb_entry)),
+   [MDBA_SET_ENTRY_ATTRS] = { .type = NLA_NESTED },
+};
+
 static int rtnl_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
struct netlink_ext_ack *extack)
 {
+   bool del_bulk = !!(nlh->nlmsg_flags & NLM_F_BULK);
struct nlattr *tb[MDBA_SET_ENTRY_MAX + 1];
struct net *net = sock_net(skb->sk);
struct br_port_msg *bpm;
struct net_device *dev;
int err;
 
-   err = nlmsg_parse_deprecated(nlh, sizeof(*bpm), tb,
-MDBA_SET_ENTRY_MAX, mdba_policy, extack);
+   if (!del_bulk)
+   err = nlmsg_parse_deprecated(nlh, sizeof(*bpm), tb,
+MDBA_SET_ENTRY_MAX, mdba_policy,
+extack);
+   else
+   err = nlmsg_parse(nlh, sizeof(*bpm), tb, MDBA_SET_ENTRY_MAX,
+ mdba_del_bulk_policy, extack);
if (err)
return err;
 
-- 
2.40.1

[PATCH net-next 4/9] rtnetlink: bridge: Invoke MDB bulk deletion when needed

2023-12-17 Thread Ido Schimmel

Invoke the new MDB bulk deletion device operation when the 'NLM_F_BULK'
flag is set in the netlink message header.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
---
 net/core/rtnetlink.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 30f030a672f2..349255151ad0 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -6494,6 +6494,14 @@ static int rtnl_mdb_del(struct sk_buff *skb, struct 
nlmsghdr *nlh,
return -EINVAL;
}
 
+   if (del_bulk) {
+   if (!dev->netdev_ops->ndo_mdb_del_bulk) {
+   NL_SET_ERR_MSG(extack, "Device does not support MDB 
bulk deletion");
+   return -EOPNOTSUPP;
+   }
+   return dev->netdev_ops->ndo_mdb_del_bulk(dev, tb, extack);
+   }
+
if (!dev->netdev_ops->ndo_mdb_del) {
NL_SET_ERR_MSG(extack, "Device does not support MDB 
operations");
return -EOPNOTSUPP;
-- 
2.40.1

[PATCH net-next 5/9] bridge: mdb: Add MDB bulk deletion support

2023-12-17 Thread Ido Schimmel

Implement MDB bulk deletion support in the bridge driver, allowing MDB
entries to be deleted in bulk according to provided parameters.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
---
 net/bridge/br_device.c  |   1 +
 net/bridge/br_mdb.c | 133 
 net/bridge/br_private.h |   8 +++
 3 files changed, 142 insertions(+)

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 8f40de3af154..65cee0ad3c1b 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -471,6 +471,7 @@ static const struct net_device_ops br_netdev_ops = {
.ndo_fdb_get = br_fdb_get,
.ndo_mdb_add = br_mdb_add,
.ndo_mdb_del = br_mdb_del,
+   .ndo_mdb_del_bulk= br_mdb_del_bulk,
.ndo_mdb_dump= br_mdb_dump,
.ndo_mdb_get = br_mdb_get,
.ndo_bridge_getlink  = br_getlink,
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 8cc526067bc2..bc37e47ad829 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -1412,6 +1412,139 @@ int br_mdb_del(struct net_device *dev, struct nlattr 
*tb[],
return err;
 }
 
+struct br_mdb_flush_desc {
+   u32 port_ifindex;
+   u16 vid;
+   u8 rt_protocol;
+   u8 state;
+   u8 state_mask;
+};
+
+static const struct nla_policy br_mdbe_attrs_del_bulk_pol[MDBE_ATTR_MAX + 1] = 
{
+   [MDBE_ATTR_RTPROT] = NLA_POLICY_MIN(NLA_U8, RTPROT_STATIC),
+   [MDBE_ATTR_STATE_MASK] = NLA_POLICY_MASK(NLA_U8, MDB_PERMANENT),
+};
+
+static int br_mdb_flush_desc_init(struct br_mdb_flush_desc *desc,
+ struct nlattr *tb[],
+ struct netlink_ext_ack *extack)
+{
+   struct br_mdb_entry *entry = nla_data(tb[MDBA_SET_ENTRY]);
+   struct nlattr *mdbe_attrs[MDBE_ATTR_MAX + 1];
+   int err;
+
+   desc->port_ifindex = entry->ifindex;
+   desc->vid = entry->vid;
+   desc->state = entry->state;
+
+   if (!tb[MDBA_SET_ENTRY_ATTRS])
+   return 0;
+
+   err = nla_parse_nested(mdbe_attrs, MDBE_ATTR_MAX,
+  tb[MDBA_SET_ENTRY_ATTRS],
+  br_mdbe_attrs_del_bulk_pol, extack);
+   if (err)
+   return err;
+
+   if (mdbe_attrs[MDBE_ATTR_STATE_MASK])
+   desc->state_mask = nla_get_u8(mdbe_attrs[MDBE_ATTR_STATE_MASK]);
+
+   if (mdbe_attrs[MDBE_ATTR_RTPROT])
+   desc->rt_protocol = nla_get_u8(mdbe_attrs[MDBE_ATTR_RTPROT]);
+
+   return 0;
+}
+
+static void br_mdb_flush_host(struct net_bridge *br,
+ struct net_bridge_mdb_entry *mp,
+ const struct br_mdb_flush_desc *desc)
+{
+   u8 state;
+
+   if (desc->port_ifindex && desc->port_ifindex != br->dev->ifindex)
+   return;
+
+   if (desc->rt_protocol)
+   return;
+
+   state = br_group_is_l2(&mp->addr) ? MDB_PERMANENT : 0;
+   if (desc->state_mask && (state & desc->state_mask) != desc->state)
+   return;
+
+   br_multicast_host_leave(mp, true);
+   if (!mp->ports && netif_running(br->dev))
+   mod_timer(&mp->timer, jiffies);
+}
+
+static void br_mdb_flush_pgs(struct net_bridge *br,
+struct net_bridge_mdb_entry *mp,
+const struct br_mdb_flush_desc *desc)
+{
+   struct net_bridge_port_group __rcu **pp;
+   struct net_bridge_port_group *p;
+
+   for (pp = &mp->ports; (p = mlock_dereference(*pp, br)) != NULL;) {
+   u8 state;
+
+   if (desc->port_ifindex &&
+   desc->port_ifindex != p->key.port->dev->ifindex) {
+   pp = &p->next;
+   continue;
+   }
+
+   if (desc->rt_protocol && desc->rt_protocol != p->rt_protocol) {
+   pp = &p->next;
+   continue;
+   }
+
+   state = p->flags & MDB_PG_FLAGS_PERMANENT ? MDB_PERMANENT : 0;
+   if (desc->state_mask &&
+   (state & desc->state_mask) != desc->state) {
+   pp = &p->next;
+   continue;
+   }
+
+   br_multicast_del_pg(mp, p, pp);
+   }
+}
+
+static void br_mdb_flush(struct net_bridge *br,
+const struct br_mdb_flush_desc *desc)
+{
+   struct net_bridge_mdb_entry *mp;
+
+   spin_lock_bh(&br->multicast_lock);
+
+   /* Safe variant is not needed because entries are removed from the list
+* upon group timer expiration or bridge deletion.
+*/
+   hlist_for_each_entry(mp, &br->mdb_li

[PATCH net-next 0/9] Add MDB bulk deletion support

2023-12-17 Thread Ido Schimmel

This patchset adds MDB bulk deletion support, allowing user space to
request the deletion of matching entries instead of dumping the entire
MDB and issuing a separate deletion request for each matching entry.
Support is added in both the bridge and VXLAN drivers in a similar
fashion to the existing FDB bulk deletion support.

The parameters according to which bulk deletion can be performed are
similar to the FDB ones, namely: Destination port, VLAN ID, state (e.g.,
"permanent"), routing protocol, source / destination VNI, destination IP
and UDP port. Flushing based on flags (e.g., "offload", "fast_leave",
"added_by_star_ex", "blocked") is not currently supported, but can be
added in the future, if a use case arises.

Patch #1 adds a new uAPI attribute to allow specifying the state mask
according to which bulk deletion will be performed, if any.

Patch #2 adds a new policy according to which bulk deletion requests
(with 'NLM_F_BULK' flag set) will be parsed.

Patches #3-#4 add a new NDO for MDB bulk deletion and invoke it from the
rtnetlink code when a bulk deletion request is made.

Patches #5-#6 implement the MDB bulk deletion NDO in the bridge and
VXLAN drivers, respectively.

Patch #7 allows user space to issue MDB bulk deletion requests by no
longer rejecting the 'NLM_F_BULK' flag when it is set in 'RTM_DELMDB'
requests.

Patches #8-#9 add selftests for both drivers, for both good and bad
flows.

iproute2 changes can be found here [1].

https://github.com/idosch/iproute2/tree/submit/mdb_flush_v1

Ido Schimmel (9):
  bridge: add MDB state mask uAPI attribute
  rtnetlink: bridge: Use a different policy for MDB bulk delete
  net: Add MDB bulk deletion device operation
  rtnetlink: bridge: Invoke MDB bulk deletion when needed
  bridge: mdb: Add MDB bulk deletion support
  vxlan: mdb: Add MDB bulk deletion support
  rtnetlink: bridge: Enable MDB bulk deletion
  selftests: bridge_mdb: Add MDB bulk deletion test
  selftests: vxlan_mdb: Add MDB bulk deletion test

 drivers/net/vxlan/vxlan_core.c|   1 +
 drivers/net/vxlan/vxlan_mdb.c | 174 ---
 drivers/net/vxlan/vxlan_private.h |   2 +
 include/linux/netdevice.h |   6 +
 include/uapi/linux/if_bridge.h|   1 +
 net/bridge/br_device.c|   1 +
 net/bridge/br_mdb.c   | 133 
 net/bridge/br_private.h   |   8 +
 net/core/rtnetlink.c  |  62 +-
 .../selftests/net/forwarding/bridge_mdb.sh| 191 -
 tools/testing/selftests/net/test_vxlan_mdb.sh | 201 +-
 11 files changed, 749 insertions(+), 31 deletions(-)

-- 
2.40.1

[PATCH net-next 3/9] net: Add MDB bulk deletion device operation

2023-12-17 Thread Ido Schimmel

Add MDB net device operation that will be invoked by rtnetlink code in
response to received 'RTM_DELMDB' messages with the 'NLM_F_BULK' flag
set. Subsequent patches will implement the operation in the bridge and
VXLAN drivers.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
---
 include/linux/netdevice.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1b935ee341b4..75c7725e5e4f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1329,6 +1329,9 @@ struct netdev_net_notifier {
  * int (*ndo_mdb_del)(struct net_device *dev, struct nlattr *tb[],
  *   struct netlink_ext_ack *extack);
  * Deletes the MDB entry from dev.
+ * int (*ndo_mdb_del_bulk)(struct net_device *dev, struct nlattr *tb[],
+ *struct netlink_ext_ack *extack);
+ * Bulk deletes MDB entries from dev.
  * int (*ndo_mdb_dump)(struct net_device *dev, struct sk_buff *skb,
  *struct netlink_callback *cb);
  * Dumps MDB entries from dev. The first argument (marker) in the netlink
@@ -1611,6 +1614,9 @@ struct net_device_ops {
int (*ndo_mdb_del)(struct net_device *dev,
   struct nlattr *tb[],
   struct netlink_ext_ack *extack);
+   int (*ndo_mdb_del_bulk)(struct net_device *dev,
+   struct nlattr *tb[],
+   struct netlink_ext_ack 
*extack);
int (*ndo_mdb_dump)(struct net_device *dev,
struct sk_buff *skb,
struct netlink_callback *cb);
-- 
2.40.1

[PATCH net-next 1/9] bridge: add MDB state mask uAPI attribute

2023-12-17 Thread Ido Schimmel

Currently, the 'state' field in 'struct br_port_msg' can be set to 1 if
the MDB entry is permanent or 0 if it is temporary. Additional states
might be added in the future.

In a similar fashion to 'NDA_NDM_STATE_MASK', add an MDB state mask uAPI
attribute that will allow the upcoming bulk deletion API to bulk delete
MDB entries with a certain state or any state.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
---
 include/uapi/linux/if_bridge.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index 2e23f99dc0f1..a5b743a2f775 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -757,6 +757,7 @@ enum {
MDBE_ATTR_VNI,
MDBE_ATTR_IFINDEX,
MDBE_ATTR_SRC_VNI,
+   MDBE_ATTR_STATE_MASK,
__MDBE_ATTR_MAX,
 };
 #define MDBE_ATTR_MAX (__MDBE_ATTR_MAX - 1)
-- 
2.40.1

[Bridge] [PATCH net-next v2 13/13] selftests: vxlan_mdb: Use MDB get instead of dump

2023-10-25 Thread Ido Schimmel via Bridge

Test the new MDB get functionality by converting dump and grep to MDB
get.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 tools/testing/selftests/net/test_vxlan_mdb.sh | 108 +-
 1 file changed, 54 insertions(+), 54 deletions(-)

diff --git a/tools/testing/selftests/net/test_vxlan_mdb.sh 
b/tools/testing/selftests/net/test_vxlan_mdb.sh
index 31e5f0f8859d..6e996f8063cd 100755
--- a/tools/testing/selftests/net/test_vxlan_mdb.sh
+++ b/tools/testing/selftests/net/test_vxlan_mdb.sh
@@ -337,62 +337,62 @@ basic_common()
# Basic add, replace and delete behavior.
run_cmd "bridge -n $ns1 mdb add dev vx0 port vx0 $grp_key permanent dst 
$vtep_ip src_vni 10010"
log_test $? 0 "MDB entry addition"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010"
log_test $? 0 "MDB entry presence after addition"
 
run_cmd "bridge -n $ns1 mdb replace dev vx0 port vx0 $grp_key permanent 
dst $vtep_ip src_vni 10010"
log_test $? 0 "MDB entry replacement"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010"
log_test $? 0 "MDB entry presence after replacement"
 
run_cmd "bridge -n $ns1 mdb del dev vx0 port vx0 $grp_key dst $vtep_ip 
src_vni 10010"
log_test $? 0 "MDB entry deletion"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\""
-   log_test $? 1 "MDB entry presence after deletion"
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010"
+   log_test $? 254 "MDB entry presence after deletion"
 
run_cmd "bridge -n $ns1 mdb del dev vx0 port vx0 $grp_key dst $vtep_ip 
src_vni 10010"
log_test $? 255 "Non-existent MDB entry deletion"
 
# Default protocol and replacement.
run_cmd "bridge -n $ns1 mdb add dev vx0 port vx0 $grp_key permanent dst 
$vtep_ip src_vni 10010"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\" | 
grep \"proto static\""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010 | 
grep \"proto static\""
log_test $? 0 "MDB entry default protocol"
 
run_cmd "bridge -n $ns1 mdb replace dev vx0 port vx0 $grp_key permanent 
proto 123 dst $vtep_ip src_vni 10010"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\" | 
grep \"proto 123\""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010 | 
grep \"proto 123\""
log_test $? 0 "MDB entry protocol replacement"
 
run_cmd "bridge -n $ns1 mdb del dev vx0 port vx0 $grp_key dst $vtep_ip 
src_vni 10010"
 
# Default destination port and replacement.
run_cmd "bridge -n $ns1 mdb add dev vx0 port vx0 $grp_key permanent dst 
$vtep_ip src_vni 10010"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\" | 
grep \" dst_port \""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010 | 
grep \" dst_port \""
log_test $? 1 "MDB entry default destination port"
 
run_cmd "bridge -n $ns1 mdb replace dev vx0 port vx0 $grp_key permanent 
dst $vtep_ip dst_port 1234 src_vni 10010"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\" | 
grep \"dst_port 1234\""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010 | 
grep \"dst_port 1234\""
log_test $? 0 "MDB entry destination port replacement"
 
run_cmd "bridge -n $ns1 mdb del dev vx0 port vx0 $grp_key dst $vtep_ip 
src_vni 10010"
 
# Default destination VNI and replacement.
run_cmd "bridge -n $ns1 mdb add dev vx0 port vx0 $grp_key permanent dst 
$vtep_ip src_vni 10010"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\" | 
grep \" vni \""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010 | 
grep \" vni \""
log_test $? 1 "MDB entry default destination VNI"
 
run_cmd "bridge -n $ns1 mdb replace dev vx0 port vx0 $grp_key permanent 
dst $vtep_ip vni 1234 src_vni 10010"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\" | 
grep \"vni 1234\""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key

[Bridge] [PATCH net-next v2 12/13] selftests: bridge_mdb: Use MDB get instead of dump

2023-10-25 Thread Ido Schimmel via Bridge

Test the new MDB get functionality by converting dump and grep to MDB
get.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 .../selftests/net/forwarding/bridge_mdb.sh| 184 +++---
 1 file changed, 71 insertions(+), 113 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/bridge_mdb.sh 
b/tools/testing/selftests/net/forwarding/bridge_mdb.sh
index d0c6c499d5da..e4e3e9405056 100755
--- a/tools/testing/selftests/net/forwarding/bridge_mdb.sh
+++ b/tools/testing/selftests/net/forwarding/bridge_mdb.sh
@@ -145,14 +145,14 @@ cfg_test_host_common()
 
# Check basic add, replace and delete behavior.
bridge mdb add dev br0 port br0 grp $grp $state vid 10
-   bridge mdb show dev br0 vid 10 | grep -q "$grp"
+   bridge mdb get dev br0 grp $grp vid 10 &> /dev/null
check_err $? "Failed to add $name host entry"
 
bridge mdb replace dev br0 port br0 grp $grp $state vid 10 &> /dev/null
check_fail $? "Managed to replace $name host entry"
 
bridge mdb del dev br0 port br0 grp $grp $state vid 10
-   bridge mdb show dev br0 vid 10 | grep -q "$grp"
+   bridge mdb get dev br0 grp $grp vid 10 &> /dev/null
check_fail $? "Failed to delete $name host entry"
 
# Check error cases.
@@ -200,7 +200,7 @@ cfg_test_port_common()
 
# Check basic add, replace and delete behavior.
bridge mdb add dev br0 port $swp1 $grp_key permanent vid 10
-   bridge mdb show dev br0 vid 10 | grep -q "$grp_key"
+   bridge mdb get dev br0 $grp_key vid 10 &> /dev/null
check_err $? "Failed to add $name entry"
 
bridge mdb replace dev br0 port $swp1 $grp_key permanent vid 10 \
@@ -208,31 +208,31 @@ cfg_test_port_common()
check_err $? "Failed to replace $name entry"
 
bridge mdb del dev br0 port $swp1 $grp_key permanent vid 10
-   bridge mdb show dev br0 vid 10 | grep -q "$grp_key"
+   bridge mdb get dev br0 $grp_key vid 10 &> /dev/null
check_fail $? "Failed to delete $name entry"
 
# Check default protocol and replacement.
bridge mdb add dev br0 port $swp1 $grp_key permanent vid 10
-   bridge -d mdb show dev br0 vid 10 | grep "$grp_key" | grep -q "static"
+   bridge -d mdb get dev br0 $grp_key vid 10 | grep -q "static"
check_err $? "$name entry not added with default \"static\" protocol"
 
bridge mdb replace dev br0 port $swp1 $grp_key permanent vid 10 \
proto 123
-   bridge -d mdb show dev br0 vid 10 | grep "$grp_key" | grep -q "123"
+   bridge -d mdb get dev br0 $grp_key vid 10 | grep -q "123"
check_err $? "Failed to replace protocol of $name entry"
bridge mdb del dev br0 port $swp1 $grp_key permanent vid 10
 
# Check behavior when VLAN is not specified.
bridge mdb add dev br0 port $swp1 $grp_key permanent
-   bridge mdb show dev br0 vid 10 | grep -q "$grp_key"
+   bridge mdb get dev br0 $grp_key vid 10 &> /dev/null
check_err $? "$name entry with VLAN 10 not added when VLAN was not 
specified"
-   bridge mdb show dev br0 vid 20 | grep -q "$grp_key"
+   bridge mdb get dev br0 $grp_key vid 20 &> /dev/null
check_err $? "$name entry with VLAN 20 not added when VLAN was not 
specified"
 
bridge mdb del dev br0 port $swp1 $grp_key permanent
-   bridge mdb show dev br0 vid 10 | grep -q "$grp_key"
+   bridge mdb get dev br0 $grp_key vid 10 &> /dev/null
check_fail $? "$name entry with VLAN 10 not deleted when VLAN was not 
specified"
-   bridge mdb show dev br0 vid 20 | grep -q "$grp_key"
+   bridge mdb get dev br0 $grp_key vid 20 &> /dev/null
check_fail $? "$name entry with VLAN 20 not deleted when VLAN was not 
specified"
 
# Check behavior when bridge port is down.
@@ -298,21 +298,21 @@ __cfg_test_port_ip_star_g()
RET=0
 
bridge mdb add dev br0 port $swp1 grp $grp vid 10
-   bridge -d mdb show dev br0 vid 10 | grep "$grp" | grep -q "exclude"
+   bridge -d mdb get dev br0 grp $grp vid 10 | grep -q "exclude"
check_err $? "Default filter mode is not \"exclude\""
bridge mdb del dev br0 port $swp1 grp $grp vid 10
 
# Check basic add and delete behavior.
bridge mdb add dev br0 port $swp1 grp $grp vid 10 filter_mode exclude \
source_list $src1
-   bridge -d mdb show dev br0 vid 10 | grep "$grp" | grep -q -v "src"
+   bridge -d mdb get dev br0 grp $grp vid 10 &> /dev/null
check_err $? "(*, G) entry not created

[Bridge] [PATCH net-next v2 11/13] rtnetlink: Add MDB get support

2023-10-25 Thread Ido Schimmel via Bridge

Now that both the bridge and VXLAN drivers implement the MDB get net
device operation, expose the functionality to user space by registering
a handler for RTM_GETMDB messages. Derive the net device from the
ifindex specified in the ancillary header and invoke its MDB get NDO.

Note that unlike other get handlers, the allocation of the skb
containing the response is not performed in the common rtnetlink code as
the size is variable and needs to be determined by the respective
driver.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 net/core/rtnetlink.c | 89 +++-
 1 file changed, 88 insertions(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index f2753fd58881..e8431c6c8490 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -6219,6 +6219,93 @@ static int rtnl_mdb_dump(struct sk_buff *skb, struct 
netlink_callback *cb)
return skb->len;
 }
 
+static int rtnl_validate_mdb_entry_get(const struct nlattr *attr,
+  struct netlink_ext_ack *extack)
+{
+   struct br_mdb_entry *entry = nla_data(attr);
+
+   if (nla_len(attr) != sizeof(struct br_mdb_entry)) {
+   NL_SET_ERR_MSG_ATTR(extack, attr, "Invalid attribute length");
+   return -EINVAL;
+   }
+
+   if (entry->ifindex) {
+   NL_SET_ERR_MSG(extack, "Entry ifindex cannot be specified");
+   return -EINVAL;
+   }
+
+   if (entry->state) {
+   NL_SET_ERR_MSG(extack, "Entry state cannot be specified");
+   return -EINVAL;
+   }
+
+   if (entry->flags) {
+   NL_SET_ERR_MSG(extack, "Entry flags cannot be specified");
+   return -EINVAL;
+   }
+
+   if (entry->vid >= VLAN_VID_MASK) {
+   NL_SET_ERR_MSG(extack, "Invalid entry VLAN id");
+   return -EINVAL;
+   }
+
+   if (entry->addr.proto != htons(ETH_P_IP) &&
+   entry->addr.proto != htons(ETH_P_IPV6) &&
+   entry->addr.proto != 0) {
+   NL_SET_ERR_MSG(extack, "Unknown entry protocol");
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static const struct nla_policy mdba_get_policy[MDBA_GET_ENTRY_MAX + 1] = {
+   [MDBA_GET_ENTRY] = NLA_POLICY_VALIDATE_FN(NLA_BINARY,
+ rtnl_validate_mdb_entry_get,
+ sizeof(struct br_mdb_entry)),
+   [MDBA_GET_ENTRY_ATTRS] = { .type = NLA_NESTED },
+};
+
+static int rtnl_mdb_get(struct sk_buff *in_skb, struct nlmsghdr *nlh,
+   struct netlink_ext_ack *extack)
+{
+   struct nlattr *tb[MDBA_GET_ENTRY_MAX + 1];
+   struct net *net = sock_net(in_skb->sk);
+   struct br_port_msg *bpm;
+   struct net_device *dev;
+   int err;
+
+   err = nlmsg_parse(nlh, sizeof(struct br_port_msg), tb,
+ MDBA_GET_ENTRY_MAX, mdba_get_policy, extack);
+   if (err)
+   return err;
+
+   bpm = nlmsg_data(nlh);
+   if (!bpm->ifindex) {
+   NL_SET_ERR_MSG(extack, "Invalid ifindex");
+   return -EINVAL;
+   }
+
+   dev = __dev_get_by_index(net, bpm->ifindex);
+   if (!dev) {
+   NL_SET_ERR_MSG(extack, "Device doesn't exist");
+   return -ENODEV;
+   }
+
+   if (NL_REQ_ATTR_CHECK(extack, NULL, tb, MDBA_GET_ENTRY)) {
+   NL_SET_ERR_MSG(extack, "Missing MDBA_GET_ENTRY attribute");
+   return -EINVAL;
+   }
+
+   if (!dev->netdev_ops->ndo_mdb_get) {
+   NL_SET_ERR_MSG(extack, "Device does not support MDB 
operations");
+   return -EOPNOTSUPP;
+   }
+
+   return dev->netdev_ops->ndo_mdb_get(dev, tb, NETLINK_CB(in_skb).portid,
+   nlh->nlmsg_seq, extack);
+}
+
 static int rtnl_validate_mdb_entry(const struct nlattr *attr,
   struct netlink_ext_ack *extack)
 {
@@ -6595,7 +6682,7 @@ void __init rtnetlink_init(void)
  0);
rtnl_register(PF_UNSPEC, RTM_SETSTATS, rtnl_stats_set, NULL, 0);
 
-   rtnl_register(PF_BRIDGE, RTM_GETMDB, NULL, rtnl_mdb_dump, 0);
+   rtnl_register(PF_BRIDGE, RTM_GETMDB, rtnl_mdb_get, rtnl_mdb_dump, 0);
rtnl_register(PF_BRIDGE, RTM_NEWMDB, rtnl_mdb_add, NULL, 0);
rtnl_register(PF_BRIDGE, RTM_DELMDB, rtnl_mdb_del, NULL, 0);
 }
-- 
2.40.1

[Bridge] [PATCH net-next v2 10/13] vxlan: mdb: Add MDB get support

2023-10-25 Thread Ido Schimmel via Bridge

Implement support for MDB get operation by looking up a matching MDB
entry, allocating the skb according to the entry's size and then filling
in the response.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 drivers/net/vxlan/vxlan_core.c|   1 +
 drivers/net/vxlan/vxlan_mdb.c | 150 ++
 drivers/net/vxlan/vxlan_private.h |   2 +
 3 files changed, 153 insertions(+)

diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c
index 7b526ae16ed0..901c590caf24 100644
--- a/drivers/net/vxlan/vxlan_core.c
+++ b/drivers/net/vxlan/vxlan_core.c
@@ -3226,6 +3226,7 @@ static const struct net_device_ops vxlan_netdev_ether_ops 
= {
.ndo_mdb_add= vxlan_mdb_add,
.ndo_mdb_del= vxlan_mdb_del,
.ndo_mdb_dump   = vxlan_mdb_dump,
+   .ndo_mdb_get= vxlan_mdb_get,
.ndo_fill_metadata_dst  = vxlan_fill_metadata_dst,
 };
 
diff --git a/drivers/net/vxlan/vxlan_mdb.c b/drivers/net/vxlan/vxlan_mdb.c
index 19640f7e3a88..e472fd67fc2e 100644
--- a/drivers/net/vxlan/vxlan_mdb.c
+++ b/drivers/net/vxlan/vxlan_mdb.c
@@ -1306,6 +1306,156 @@ int vxlan_mdb_del(struct net_device *dev, struct nlattr 
*tb[],
return err;
 }
 
+static const struct nla_policy vxlan_mdbe_attrs_get_pol[MDBE_ATTR_MAX + 1] = {
+   [MDBE_ATTR_SOURCE] = NLA_POLICY_RANGE(NLA_BINARY,
+ sizeof(struct in_addr),
+ sizeof(struct in6_addr)),
+   [MDBE_ATTR_SRC_VNI] = NLA_POLICY_FULL_RANGE(NLA_U32, &vni_range),
+};
+
+static int vxlan_mdb_get_parse(struct net_device *dev, struct nlattr *tb[],
+  struct vxlan_mdb_entry_key *group,
+  struct netlink_ext_ack *extack)
+{
+   struct br_mdb_entry *entry = nla_data(tb[MDBA_GET_ENTRY]);
+   struct nlattr *mdbe_attrs[MDBE_ATTR_MAX + 1];
+   struct vxlan_dev *vxlan = netdev_priv(dev);
+   int err;
+
+   memset(group, 0, sizeof(*group));
+   group->vni = vxlan->default_dst.remote_vni;
+
+   if (!tb[MDBA_GET_ENTRY_ATTRS]) {
+   vxlan_mdb_group_set(group, entry, NULL);
+   return 0;
+   }
+
+   err = nla_parse_nested(mdbe_attrs, MDBE_ATTR_MAX,
+  tb[MDBA_GET_ENTRY_ATTRS],
+  vxlan_mdbe_attrs_get_pol, extack);
+   if (err)
+   return err;
+
+   if (mdbe_attrs[MDBE_ATTR_SOURCE] &&
+   !vxlan_mdb_is_valid_source(mdbe_attrs[MDBE_ATTR_SOURCE],
+  entry->addr.proto, extack))
+   return -EINVAL;
+
+   vxlan_mdb_group_set(group, entry, mdbe_attrs[MDBE_ATTR_SOURCE]);
+
+   if (mdbe_attrs[MDBE_ATTR_SRC_VNI])
+   group->vni =
+   cpu_to_be32(nla_get_u32(mdbe_attrs[MDBE_ATTR_SRC_VNI]));
+
+   return 0;
+}
+
+static struct sk_buff *
+vxlan_mdb_get_reply_alloc(const struct vxlan_dev *vxlan,
+ const struct vxlan_mdb_entry *mdb_entry)
+{
+   struct vxlan_mdb_remote *remote;
+   size_t nlmsg_size;
+
+   nlmsg_size = NLMSG_ALIGN(sizeof(struct br_port_msg)) +
+/* MDBA_MDB */
+nla_total_size(0) +
+/* MDBA_MDB_ENTRY */
+nla_total_size(0);
+
+   list_for_each_entry(remote, &mdb_entry->remotes, list)
+   nlmsg_size += vxlan_mdb_nlmsg_remote_size(vxlan, mdb_entry,
+ remote);
+
+   return nlmsg_new(nlmsg_size, GFP_KERNEL);
+}
+
+static int
+vxlan_mdb_get_reply_fill(const struct vxlan_dev *vxlan,
+struct sk_buff *skb,
+const struct vxlan_mdb_entry *mdb_entry,
+u32 portid, u32 seq)
+{
+   struct nlattr *mdb_nest, *mdb_entry_nest;
+   struct vxlan_mdb_remote *remote;
+   struct br_port_msg *bpm;
+   struct nlmsghdr *nlh;
+   int err;
+
+   nlh = nlmsg_put(skb, portid, seq, RTM_NEWMDB, sizeof(*bpm), 0);
+   if (!nlh)
+   return -EMSGSIZE;
+
+   bpm = nlmsg_data(nlh);
+   memset(bpm, 0, sizeof(*bpm));
+   bpm->family  = AF_BRIDGE;
+   bpm->ifindex = vxlan->dev->ifindex;
+   mdb_nest = nla_nest_start_noflag(skb, MDBA_MDB);
+   if (!mdb_nest) {
+   err = -EMSGSIZE;
+   goto cancel;
+   }
+   mdb_entry_nest = nla_nest_start_noflag(skb, MDBA_MDB_ENTRY);
+   if (!mdb_entry_nest) {
+   err = -EMSGSIZE;
+   goto cancel;
+   }
+
+   list_for_each_entry(remote, &mdb_entry->remotes, list) {
+   err = vxlan_mdb_entry_info_fill(vxlan, skb, mdb_entry, remote);
+   if (err)
+   goto cancel;
+   }
+
+   nla_nest_end(skb, mdb_entry_nest);
+

[Bridge] [PATCH net-next v2 09/13] bridge: mcast: Add MDB get support

2023-10-25 Thread Ido Schimmel via Bridge

Implement support for MDB get operation by looking up a matching MDB
entry, allocating the skb according to the entry's size and then filling
in the response. The operation is performed under the bridge multicast
lock to ensure that the entry does not change between the time the reply
size is determined and when the reply is filled in.

Signed-off-by: Ido Schimmel 
---

Notes:
v2:
* Add a comment above spin_lock_bh().

 net/bridge/br_device.c  |   1 +
 net/bridge/br_mdb.c | 158 
 net/bridge/br_private.h |   9 +++
 3 files changed, 168 insertions(+)

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index d624710b384a..8f40de3af154 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -472,6 +472,7 @@ static const struct net_device_ops br_netdev_ops = {
.ndo_mdb_add = br_mdb_add,
.ndo_mdb_del = br_mdb_del,
.ndo_mdb_dump= br_mdb_dump,
+   .ndo_mdb_get = br_mdb_get,
.ndo_bridge_getlink  = br_getlink,
.ndo_bridge_setlink  = br_setlink,
.ndo_bridge_dellink  = br_dellink,
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 42983f6a0abd..8cc526067bc2 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -1411,3 +1411,161 @@ int br_mdb_del(struct net_device *dev, struct nlattr 
*tb[],
br_mdb_config_fini(&cfg);
return err;
 }
+
+static const struct nla_policy br_mdbe_attrs_get_pol[MDBE_ATTR_MAX + 1] = {
+   [MDBE_ATTR_SOURCE] = NLA_POLICY_RANGE(NLA_BINARY,
+ sizeof(struct in_addr),
+ sizeof(struct in6_addr)),
+};
+
+static int br_mdb_get_parse(struct net_device *dev, struct nlattr *tb[],
+   struct br_ip *group, struct netlink_ext_ack *extack)
+{
+   struct br_mdb_entry *entry = nla_data(tb[MDBA_GET_ENTRY]);
+   struct nlattr *mdbe_attrs[MDBE_ATTR_MAX + 1];
+   int err;
+
+   if (!tb[MDBA_GET_ENTRY_ATTRS]) {
+   __mdb_entry_to_br_ip(entry, group, NULL);
+   return 0;
+   }
+
+   err = nla_parse_nested(mdbe_attrs, MDBE_ATTR_MAX,
+  tb[MDBA_GET_ENTRY_ATTRS], br_mdbe_attrs_get_pol,
+  extack);
+   if (err)
+   return err;
+
+   if (mdbe_attrs[MDBE_ATTR_SOURCE] &&
+   !is_valid_mdb_source(mdbe_attrs[MDBE_ATTR_SOURCE],
+entry->addr.proto, extack))
+   return -EINVAL;
+
+   __mdb_entry_to_br_ip(entry, group, mdbe_attrs);
+
+   return 0;
+}
+
+static struct sk_buff *
+br_mdb_get_reply_alloc(const struct net_bridge_mdb_entry *mp)
+{
+   struct net_bridge_port_group *pg;
+   size_t nlmsg_size;
+
+   nlmsg_size = NLMSG_ALIGN(sizeof(struct br_port_msg)) +
+/* MDBA_MDB */
+nla_total_size(0) +
+/* MDBA_MDB_ENTRY */
+nla_total_size(0);
+
+   if (mp->host_joined)
+   nlmsg_size += rtnl_mdb_nlmsg_pg_size(NULL);
+
+   for (pg = mlock_dereference(mp->ports, mp->br); pg;
+pg = mlock_dereference(pg->next, mp->br))
+   nlmsg_size += rtnl_mdb_nlmsg_pg_size(pg);
+
+   return nlmsg_new(nlmsg_size, GFP_ATOMIC);
+}
+
+static int br_mdb_get_reply_fill(struct sk_buff *skb,
+struct net_bridge_mdb_entry *mp, u32 portid,
+u32 seq)
+{
+   struct nlattr *mdb_nest, *mdb_entry_nest;
+   struct net_bridge_port_group *pg;
+   struct br_port_msg *bpm;
+   struct nlmsghdr *nlh;
+   int err;
+
+   nlh = nlmsg_put(skb, portid, seq, RTM_NEWMDB, sizeof(*bpm), 0);
+   if (!nlh)
+   return -EMSGSIZE;
+
+   bpm = nlmsg_data(nlh);
+   memset(bpm, 0, sizeof(*bpm));
+   bpm->family  = AF_BRIDGE;
+   bpm->ifindex = mp->br->dev->ifindex;
+   mdb_nest = nla_nest_start_noflag(skb, MDBA_MDB);
+   if (!mdb_nest) {
+   err = -EMSGSIZE;
+   goto cancel;
+   }
+   mdb_entry_nest = nla_nest_start_noflag(skb, MDBA_MDB_ENTRY);
+   if (!mdb_entry_nest) {
+   err = -EMSGSIZE;
+   goto cancel;
+   }
+
+   if (mp->host_joined) {
+   err = __mdb_fill_info(skb, mp, NULL);
+   if (err)
+   goto cancel;
+   }
+
+   for (pg = mlock_dereference(mp->ports, mp->br); pg;
+pg = mlock_dereference(pg->next, mp->br)) {
+   err = __mdb_fill_info(skb, mp, pg);
+   if (err)
+   goto cancel;
+   }
+
+   nla_nest_end(skb, mdb_entry_nest);
+   nla_nest_end(skb, mdb_nest);
+   nlmsg_end(skb, nlh);
+
+   return 0;
+
+cancel:
+   nlmsg_cancel(skb, nlh);
+

[Bridge] [PATCH net-next v2 08/13] net: Add MDB get device operation

2023-10-25 Thread Ido Schimmel via Bridge

Add MDB net device operation that will be invoked by rtnetlink code in
response to received RTM_GETMDB messages. Subsequent patches will
implement the operation in the bridge and VXLAN drivers.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 include/linux/netdevice.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b8bf669212cc..a16c9cc063fe 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1609,6 +1609,10 @@ struct net_device_ops {
int (*ndo_mdb_dump)(struct net_device *dev,
struct sk_buff *skb,
struct netlink_callback *cb);
+   int (*ndo_mdb_get)(struct net_device *dev,
+  struct nlattr *tb[], u32 portid,
+  u32 seq,
+  struct netlink_ext_ack *extack);
int (*ndo_bridge_setlink)(struct net_device *dev,
  struct nlmsghdr *nlh,
  u16 flags,
-- 
2.40.1

[Bridge] [PATCH net-next v2 07/13] bridge: add MDB get uAPI attributes

2023-10-25 Thread Ido Schimmel via Bridge

Add MDB get attributes that correspond to the MDB set attributes used in
RTM_NEWMDB messages. Specifically, add 'MDBA_GET_ENTRY' which will hold
a 'struct br_mdb_entry' and 'MDBA_GET_ENTRY_ATTRS' which will hold
'MDBE_ATTR_*' attributes that are used as indexes (source IP and source
VNI).

An example request will look as follows:

[ struct nlmsghdr ]
[ struct br_port_msg ]
[ MDBA_GET_ENTRY ]
struct br_mdb_entry
[ MDBA_GET_ENTRY_ATTRS ]
[ MDBE_ATTR_SOURCE ]
struct in_addr / struct in6_addr
[ MDBE_ATTR_SRC_VNI ]
u32

Signed-off-by: Ido Schimmel 
---

Notes:
v2:
* Add comment.

 include/uapi/linux/if_bridge.h | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index f95326fce6bb..2e23f99dc0f1 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -723,6 +723,24 @@ enum {
 };
 #define MDBA_SET_ENTRY_MAX (__MDBA_SET_ENTRY_MAX - 1)
 
+/* [MDBA_GET_ENTRY] = {
+ *struct br_mdb_entry
+ *[MDBA_GET_ENTRY_ATTRS] = {
+ *   [MDBE_ATTR_SOURCE]
+ *  struct in_addr / struct in6_addr
+ *   [MDBE_ATTR_SRC_VNI]
+ *  u32
+ *}
+ * }
+ */
+enum {
+   MDBA_GET_ENTRY_UNSPEC,
+   MDBA_GET_ENTRY,
+   MDBA_GET_ENTRY_ATTRS,
+   __MDBA_GET_ENTRY_MAX,
+};
+#define MDBA_GET_ENTRY_MAX (__MDBA_GET_ENTRY_MAX - 1)
+
 /* [MDBA_SET_ENTRY_ATTRS] = {
  *[MDBE_ATTR_xxx]
  *...
-- 
2.40.1

[Bridge] [PATCH net-next v2 06/13] vxlan: mdb: Factor out a helper for remote entry size calculation

2023-10-25 Thread Ido Schimmel via Bridge

Currently, netlink notifications are sent for individual remote entries
and not for the entire MDB entry itself.

Subsequent patches are going to add MDB get support which will require
the VXLAN driver to reply with an entire MDB entry.

Therefore, as a preparation, factor out a helper to calculate the size
of an individual remote entry. When determining the size of the reply
this helper will be invoked for each remote entry in the MDB entry.

No functional changes intended.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 drivers/net/vxlan/vxlan_mdb.c | 28 +++-
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/drivers/net/vxlan/vxlan_mdb.c b/drivers/net/vxlan/vxlan_mdb.c
index 0b6043e1473b..19640f7e3a88 100644
--- a/drivers/net/vxlan/vxlan_mdb.c
+++ b/drivers/net/vxlan/vxlan_mdb.c
@@ -925,23 +925,20 @@ vxlan_mdb_nlmsg_src_list_size(const struct 
vxlan_mdb_entry_key *group,
return nlmsg_size;
 }
 
-static size_t vxlan_mdb_nlmsg_size(const struct vxlan_dev *vxlan,
-  const struct vxlan_mdb_entry *mdb_entry,
-  const struct vxlan_mdb_remote *remote)
+static size_t
+vxlan_mdb_nlmsg_remote_size(const struct vxlan_dev *vxlan,
+   const struct vxlan_mdb_entry *mdb_entry,
+   const struct vxlan_mdb_remote *remote)
 {
const struct vxlan_mdb_entry_key *group = &mdb_entry->key;
struct vxlan_rdst *rd = rtnl_dereference(remote->rd);
size_t nlmsg_size;
 
-   nlmsg_size = NLMSG_ALIGN(sizeof(struct br_port_msg)) +
-/* MDBA_MDB */
-nla_total_size(0) +
-/* MDBA_MDB_ENTRY */
-nla_total_size(0) +
 /* MDBA_MDB_ENTRY_INFO */
-nla_total_size(sizeof(struct br_mdb_entry)) +
+   nlmsg_size = nla_total_size(sizeof(struct br_mdb_entry)) +
 /* MDBA_MDB_EATTR_TIMER */
 nla_total_size(sizeof(u32));
+
/* MDBA_MDB_EATTR_SOURCE */
if (vxlan_mdb_is_sg(group))
nlmsg_size += nla_total_size(vxlan_addr_size(&group->dst));
@@ -969,6 +966,19 @@ static size_t vxlan_mdb_nlmsg_size(const struct vxlan_dev 
*vxlan,
return nlmsg_size;
 }
 
+static size_t vxlan_mdb_nlmsg_size(const struct vxlan_dev *vxlan,
+  const struct vxlan_mdb_entry *mdb_entry,
+  const struct vxlan_mdb_remote *remote)
+{
+   return NLMSG_ALIGN(sizeof(struct br_port_msg)) +
+  /* MDBA_MDB */
+  nla_total_size(0) +
+  /* MDBA_MDB_ENTRY */
+  nla_total_size(0) +
+  /* Remote entry */
+  vxlan_mdb_nlmsg_remote_size(vxlan, mdb_entry, remote);
+}
+
 static int vxlan_mdb_nlmsg_fill(const struct vxlan_dev *vxlan,
struct sk_buff *skb,
const struct vxlan_mdb_entry *mdb_entry,
-- 
2.40.1

[Bridge] [PATCH net-next v2 05/13] vxlan: mdb: Adjust function arguments

2023-10-25 Thread Ido Schimmel via Bridge

Adjust the function's arguments and rename it to allow it to be reused
by future call sites that only have access to 'struct
vxlan_mdb_entry_key', but not to 'struct vxlan_mdb_config'.

No functional changes intended.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 drivers/net/vxlan/vxlan_mdb.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/net/vxlan/vxlan_mdb.c b/drivers/net/vxlan/vxlan_mdb.c
index 5e041622261a..0b6043e1473b 100644
--- a/drivers/net/vxlan/vxlan_mdb.c
+++ b/drivers/net/vxlan/vxlan_mdb.c
@@ -370,12 +370,10 @@ static bool vxlan_mdb_is_valid_source(const struct nlattr 
*attr, __be16 proto,
return true;
 }
 
-static void vxlan_mdb_config_group_set(struct vxlan_mdb_config *cfg,
-  const struct br_mdb_entry *entry,
-  const struct nlattr *source_attr)
+static void vxlan_mdb_group_set(struct vxlan_mdb_entry_key *group,
+   const struct br_mdb_entry *entry,
+   const struct nlattr *source_attr)
 {
-   struct vxlan_mdb_entry_key *group = &cfg->group;
-
switch (entry->addr.proto) {
case htons(ETH_P_IP):
group->dst.sa.sa_family = AF_INET;
@@ -503,7 +501,7 @@ static int vxlan_mdb_config_attrs_init(struct 
vxlan_mdb_config *cfg,
   entry->addr.proto, extack))
return -EINVAL;
 
-   vxlan_mdb_config_group_set(cfg, entry, mdbe_attrs[MDBE_ATTR_SOURCE]);
+   vxlan_mdb_group_set(&cfg->group, entry, mdbe_attrs[MDBE_ATTR_SOURCE]);
 
/* rtnetlink code only validates that IPv4 group address is
 * multicast.
-- 
2.40.1

[Bridge] [PATCH net-next v2 03/13] bridge: mcast: Factor out a helper for PG entry size calculation

2023-10-25 Thread Ido Schimmel via Bridge

Currently, netlink notifications are sent for individual port group
entries and not for the entire MDB entry itself.

Subsequent patches are going to add MDB get support which will require
the bridge driver to reply with an entire MDB entry.

Therefore, as a preparation, factor out an helper to calculate the size
of an individual port group entry. When determining the size of the
reply this helper will be invoked for each port group entry in the MDB
entry.

No functional changes intended.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 net/bridge/br_mdb.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 08de94bffc12..42983f6a0abd 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -450,18 +450,13 @@ static int nlmsg_populate_mdb_fill(struct sk_buff *skb,
return -EMSGSIZE;
 }
 
-static size_t rtnl_mdb_nlmsg_size(struct net_bridge_port_group *pg)
+static size_t rtnl_mdb_nlmsg_pg_size(const struct net_bridge_port_group *pg)
 {
struct net_bridge_group_src *ent;
size_t nlmsg_size, addr_size = 0;
 
-   nlmsg_size = NLMSG_ALIGN(sizeof(struct br_port_msg)) +
-/* MDBA_MDB */
-nla_total_size(0) +
-/* MDBA_MDB_ENTRY */
-nla_total_size(0) +
 /* MDBA_MDB_ENTRY_INFO */
-nla_total_size(sizeof(struct br_mdb_entry)) +
+   nlmsg_size = nla_total_size(sizeof(struct br_mdb_entry)) +
 /* MDBA_MDB_EATTR_TIMER */
 nla_total_size(sizeof(u32));
 
@@ -511,6 +506,17 @@ static size_t rtnl_mdb_nlmsg_size(struct 
net_bridge_port_group *pg)
return nlmsg_size;
 }
 
+static size_t rtnl_mdb_nlmsg_size(const struct net_bridge_port_group *pg)
+{
+   return NLMSG_ALIGN(sizeof(struct br_port_msg)) +
+  /* MDBA_MDB */
+  nla_total_size(0) +
+  /* MDBA_MDB_ENTRY */
+  nla_total_size(0) +
+  /* Port group entry */
+  rtnl_mdb_nlmsg_pg_size(pg);
+}
+
 void br_mdb_notify(struct net_device *dev,
   struct net_bridge_mdb_entry *mp,
   struct net_bridge_port_group *pg,
-- 
2.40.1

[Bridge] [PATCH net-next v2 04/13] bridge: mcast: Rename MDB entry get function

2023-10-25 Thread Ido Schimmel via Bridge

The current name is going to conflict with the upcoming net device
operation for the MDB get operation.

Rename the function to br_mdb_entry_skb_get(). No functional changes
intended.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 net/bridge/br_device.c|  2 +-
 net/bridge/br_input.c |  2 +-
 net/bridge/br_multicast.c |  5 +++--
 net/bridge/br_private.h   | 10 ++
 4 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 9a5ea06236bd..d624710b384a 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -92,7 +92,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct 
net_device *dev)
goto out;
}
 
-   mdst = br_mdb_get(brmctx, skb, vid);
+   mdst = br_mdb_entry_skb_get(brmctx, skb, vid);
if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
br_multicast_querier_exists(brmctx, eth_hdr(skb), mdst))
br_multicast_flood(mdst, skb, brmctx, false, true);
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index c729528b5e85..f21097e73482 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -175,7 +175,7 @@ int br_handle_frame_finish(struct net *net, struct sock 
*sk, struct sk_buff *skb
 
switch (pkt_type) {
case BR_PKT_MULTICAST:
-   mdst = br_mdb_get(brmctx, skb, vid);
+   mdst = br_mdb_entry_skb_get(brmctx, skb, vid);
if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
br_multicast_querier_exists(brmctx, eth_hdr(skb), mdst)) {
if ((mdst && mdst->host_joined) ||
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 96d1fc78dd39..d7d021af1029 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -145,8 +145,9 @@ static struct net_bridge_mdb_entry *br_mdb_ip6_get(struct 
net_bridge *br,
 }
 #endif
 
-struct net_bridge_mdb_entry *br_mdb_get(struct net_bridge_mcast *brmctx,
-   struct sk_buff *skb, u16 vid)
+struct net_bridge_mdb_entry *
+br_mdb_entry_skb_get(struct net_bridge_mcast *brmctx, struct sk_buff *skb,
+u16 vid)
 {
struct net_bridge *br = brmctx->br;
struct br_ip ip;
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 27a7a06660f3..40bbcd9f63b5 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -955,8 +955,9 @@ int br_multicast_rcv(struct net_bridge_mcast **brmctx,
 struct net_bridge_mcast_port **pmctx,
 struct net_bridge_vlan *vlan,
 struct sk_buff *skb, u16 vid);
-struct net_bridge_mdb_entry *br_mdb_get(struct net_bridge_mcast *brmctx,
-   struct sk_buff *skb, u16 vid);
+struct net_bridge_mdb_entry *
+br_mdb_entry_skb_get(struct net_bridge_mcast *brmctx, struct sk_buff *skb,
+u16 vid);
 int br_multicast_add_port(struct net_bridge_port *port);
 void br_multicast_del_port(struct net_bridge_port *port);
 void br_multicast_enable_port(struct net_bridge_port *port);
@@ -1345,8 +1346,9 @@ static inline int br_multicast_rcv(struct 
net_bridge_mcast **brmctx,
return 0;
 }
 
-static inline struct net_bridge_mdb_entry *br_mdb_get(struct net_bridge_mcast 
*brmctx,
- struct sk_buff *skb, u16 
vid)
+static inline struct net_bridge_mdb_entry *
+br_mdb_entry_skb_get(struct net_bridge_mcast *brmctx, struct sk_buff *skb,
+u16 vid)
 {
return NULL;
 }
-- 
2.40.1

[Bridge] [PATCH net-next v2 02/13] bridge: mcast: Account for missing attributes

2023-10-25 Thread Ido Schimmel via Bridge

The 'MDBA_MDB' and 'MDBA_MDB_ENTRY' nest attributes are not accounted
for when calculating the size of MDB notifications. Add them along with
comments for existing attributes.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 net/bridge/br_mdb.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index fb58bb1b60e8..08de94bffc12 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -452,11 +452,18 @@ static int nlmsg_populate_mdb_fill(struct sk_buff *skb,
 
 static size_t rtnl_mdb_nlmsg_size(struct net_bridge_port_group *pg)
 {
-   size_t nlmsg_size = NLMSG_ALIGN(sizeof(struct br_port_msg)) +
-   nla_total_size(sizeof(struct br_mdb_entry)) +
-   nla_total_size(sizeof(u32));
struct net_bridge_group_src *ent;
-   size_t addr_size = 0;
+   size_t nlmsg_size, addr_size = 0;
+
+   nlmsg_size = NLMSG_ALIGN(sizeof(struct br_port_msg)) +
+/* MDBA_MDB */
+nla_total_size(0) +
+/* MDBA_MDB_ENTRY */
+nla_total_size(0) +
+/* MDBA_MDB_ENTRY_INFO */
+nla_total_size(sizeof(struct br_mdb_entry)) +
+/* MDBA_MDB_EATTR_TIMER */
+nla_total_size(sizeof(u32));
 
if (!pg)
goto out;
-- 
2.40.1

[Bridge] [PATCH net-next v2 01/13] bridge: mcast: Dump MDB entries even when snooping is disabled

2023-10-25 Thread Ido Schimmel via Bridge

Currently, the bridge driver does not dump MDB entries when multicast
snooping is disabled although the entries are present in the kernel:

 # bridge mdb add dev br0 port swp1 grp 239.1.1.1 permanent
 # bridge mdb show dev br0
 dev br0 port swp1 grp 239.1.1.1 permanent
 dev br0 port br0 grp ff02::6a temp
 dev br0 port br0 grp ff02::1:ff9d:e61b temp
 # ip link set dev br0 type bridge mcast_snooping 0
 # bridge mdb show dev br0
 # ip link set dev br0 type bridge mcast_snooping 1
 # bridge mdb show dev br0
 dev br0 port swp1 grp 239.1.1.1 permanent
 dev br0 port br0 grp ff02::6a temp
 dev br0 port br0 grp ff02::1:ff9d:e61b temp

This behavior differs from other netlink dump interfaces that dump
entries regardless if they are used or not. For example, VLANs are
dumped even when VLAN filtering is disabled:

 # ip link set dev br0 type bridge vlan_filtering 0
 # bridge vlan show dev swp1
 port  vlan-id
 swp1  1 PVID Egress Untagged

Remove the check and always dump MDB entries:

 # bridge mdb add dev br0 port swp1 grp 239.1.1.1 permanent
 # bridge mdb show dev br0
 dev br0 port swp1 grp 239.1.1.1 permanent
 dev br0 port br0 grp ff02::6a temp
 dev br0 port br0 grp ff02::1:ffeb:1a4d temp
 # ip link set dev br0 type bridge mcast_snooping 0
 # bridge mdb show dev br0
 dev br0 port swp1 grp 239.1.1.1 permanent
 dev br0 port br0 grp ff02::6a temp
 dev br0 port br0 grp ff02::1:ffeb:1a4d temp
 # ip link set dev br0 type bridge mcast_snooping 1
 # bridge mdb show dev br0
 dev br0 port swp1 grp 239.1.1.1 permanent
 dev br0 port br0 grp ff02::6a temp
 dev br0 port br0 grp ff02::1:ffeb:1a4d temp

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 net/bridge/br_mdb.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 7305f5f8215c..fb58bb1b60e8 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -323,9 +323,6 @@ static int br_mdb_fill_info(struct sk_buff *skb, struct 
netlink_callback *cb,
struct net_bridge_mdb_entry *mp;
struct nlattr *nest, *nest2;
 
-   if (!br_opt_get(br, BROPT_MULTICAST_ENABLED))
-   return 0;
-
nest = nla_nest_start_noflag(skb, MDBA_MDB);
if (nest == NULL)
return -EMSGSIZE;
-- 
2.40.1

[Bridge] [PATCH net-next v2 00/13] Add MDB get support

2023-10-25 Thread Ido Schimmel via Bridge

This patchset adds MDB get support, allowing user space to request a
single MDB entry to be retrieved instead of dumping the entire MDB.
Support is added in both the bridge and VXLAN drivers.

Patches #1-#6 are small preparations in both drivers.

Patches #7-#8 add the required uAPI attributes for the new functionality
and the MDB get net device operation (NDO), respectively.

Patches #9-#10 implement the MDB get NDO in both drivers.

Patch #11 registers a handler for RTM_GETMDB messages in rtnetlink core.
The handler derives the net device from the ifindex specified in the
ancillary header and invokes its MDB get NDO.

Patches #12-#13 add selftests by converting tests that use MDB dump with
grep to the new MDB get functionality.

iproute2 changes can be found here [1].

v2:
* Patch #7: Add a comment to describe attributes structure.
* Patch #9: Add a comment above spin_lock_bh().

[1] https://github.com/idosch/iproute2/tree/submit/mdb_get_v1

Ido Schimmel (13):
  bridge: mcast: Dump MDB entries even when snooping is disabled
  bridge: mcast: Account for missing attributes
  bridge: mcast: Factor out a helper for PG entry size calculation
  bridge: mcast: Rename MDB entry get function
  vxlan: mdb: Adjust function arguments
  vxlan: mdb: Factor out a helper for remote entry size calculation
  bridge: add MDB get uAPI attributes
  net: Add MDB get device operation
  bridge: mcast: Add MDB get support
  vxlan: mdb: Add MDB get support
  rtnetlink: Add MDB get support
  selftests: bridge_mdb: Use MDB get instead of dump
  selftests: vxlan_mdb: Use MDB get instead of dump

 drivers/net/vxlan/vxlan_core.c|   1 +
 drivers/net/vxlan/vxlan_mdb.c | 188 --
 drivers/net/vxlan/vxlan_private.h |   2 +
 include/linux/netdevice.h |   4 +
 include/uapi/linux/if_bridge.h|  18 ++
 net/bridge/br_device.c|   3 +-
 net/bridge/br_input.c |   2 +-
 net/bridge/br_mdb.c   | 184 -
 net/bridge/br_multicast.c |   5 +-
 net/bridge/br_private.h   |  19 +-
 net/core/rtnetlink.c  |  89 -
 .../selftests/net/forwarding/bridge_mdb.sh| 184 +++--
 tools/testing/selftests/net/test_vxlan_mdb.sh | 108 +-
 13 files changed, 608 insertions(+), 199 deletions(-)

-- 
2.40.1

Re: [Bridge] [PATCH net-next v5 3/5] net: bridge: Add netlink knobs for number / max learned FDB entries

2023-10-17 Thread Ido Schimmel

On Mon, Oct 16, 2023 at 03:27:22PM +0200, Johannes Nixdorf wrote:
> The previous patch added accounting and a limit for the number of
> dynamically learned FDB entries per bridge. However it did not provide
> means to actually configure those bounds or read back the count. This
> patch does that.
> 
> Two new netlink attributes are added for the accounting and limit of
> dynamically learned FDB entries:
>  - IFLA_BR_FDB_N_LEARNED (RO) for the number of entries accounted for
>a single bridge.
>  - IFLA_BR_FDB_MAX_LEARNED (RW) for the configured limit of entries for
>the bridge.
> 
> The new attributes are used like this:
> 
>  # ip link add name br up type bridge fdb_max_learned 256
>  # ip link add name v1 up master br type veth peer v2
>  # ip link set up dev v2
>  # mausezahn -a rand -c 1024 v2
>  0.01 seconds (90877 packets per second
>  # bridge fdb | grep -v permanent | wc -l
>  256
>  # ip -d link show dev br
>  13: br:  mtu 1500 [...]
>  [...] fdb_n_learned 256 fdb_max_learned 256
> 
> Signed-off-by: Johannes Nixdorf 

Reviewed-by: Ido Schimmel

Re: [Bridge] [PATCH net-next v5 4/5] net: bridge: Set strict_start_type for br_policy

2023-10-17 Thread Ido Schimmel

On Mon, Oct 16, 2023 at 03:27:23PM +0200, Johannes Nixdorf wrote:
> Set any new attributes added to br_policy to be parsed strictly, to
> prevent userspace from passing garbage.
> 
> Signed-off-by: Johannes Nixdorf 

Reviewed-by: Ido Schimmel

Re: [Bridge] [PATCH net-next 09/13] bridge: mcast: Add MDB get support

2023-10-17 Thread Ido Schimmel via Bridge

On Tue, Oct 17, 2023 at 12:24:44PM +0300, Nikolay Aleksandrov wrote:
> On 10/16/23 16:12, Ido Schimmel wrote:
> > Implement support for MDB get operation by looking up a matching MDB
> > entry, allocating the skb according to the entry's size and then filling
> > in the response. The operation is performed under the bridge multicast
> > lock to ensure that the entry does not change between the time the reply
> > size is determined and when the reply is filled in.
> > 
> > Signed-off-by: Ido Schimmel 
> > ---
> >   net/bridge/br_device.c  |   1 +
> >   net/bridge/br_mdb.c | 154 
> >   net/bridge/br_private.h |   9 +++
> >   3 files changed, 164 insertions(+)
> > 
> [snip]
> > +int br_mdb_get(struct net_device *dev, struct nlattr *tb[], u32 portid, 
> > u32 seq,
> > +  struct netlink_ext_ack *extack)
> > +{
> > +   struct net_bridge *br = netdev_priv(dev);
> > +   struct net_bridge_mdb_entry *mp;
> > +   struct sk_buff *skb;
> > +   struct br_ip group;
> > +   int err;
> > +
> > +   err = br_mdb_get_parse(dev, tb, &group, extack);
> > +   if (err)
> > +   return err;
> > +
> > +   spin_lock_bh(&br->multicast_lock);
> 
> Since this is only reading, could we use rcu to avoid blocking mcast
> processing?

I tried to explain this choice in the commit message. Do you think it's
a non-issue?

> 
> > +
> > +   mp = br_mdb_ip_get(br, &group);
> > +   if (!mp) {
> > +   NL_SET_ERR_MSG_MOD(extack, "MDB entry not found");
> > +   err = -ENOENT;
> > +   goto unlock;
> > +   }
> > +
> > +   skb = br_mdb_get_reply_alloc(mp);
> > +   if (!skb) {
> > +   err = -ENOMEM;
> > +   goto unlock;
> > +   }
> > +
> > +   err = br_mdb_get_reply_fill(skb, mp, portid, seq);
> > +   if (err) {
> > +   NL_SET_ERR_MSG_MOD(extack, "Failed to fill MDB get reply");
> > +   goto free;
> > +   }
> > +
> > +   spin_unlock_bh(&br->multicast_lock);
> > +
> > +   return rtnl_unicast(skb, dev_net(dev), portid);
> > +
> > +free:
> > +   kfree_skb(skb);
> > +unlock:
> > +   spin_unlock_bh(&br->multicast_lock);
> > +   return err;
> > +}
>

Re: [Bridge] [PATCH net-next 07/13] bridge: add MDB get uAPI attributes

2023-10-17 Thread Ido Schimmel via Bridge

On Tue, Oct 17, 2023 at 12:08:30PM +0300, Nikolay Aleksandrov wrote:
> On 10/16/23 16:12, Ido Schimmel wrote:
> > Add MDB get attributes that correspond to the MDB set attributes used in
> > RTM_NEWMDB messages. Specifically, add 'MDBA_GET_ENTRY' which will hold
> > a 'struct br_mdb_entry' and 'MDBA_GET_ENTRY_ATTRS' which will hold
> > 'MDBE_ATTR_*' attributes that are used as indexes (source IP and source
> > VNI).
> > 
> > An example request will look as follows:
> > 
> > [ struct nlmsghdr ]
> > [ struct br_port_msg ]
> > [ MDBA_GET_ENTRY ]
> > struct br_mdb_entry
> > [ MDBA_GET_ENTRY_ATTRS ]
> > [ MDBE_ATTR_SOURCE ]
> > struct in_addr / struct in6_addr
> > [ MDBE_ATTR_SRC_VNI ]
> > u32
> > 
> 
> Could you please add this info as a comment above the enum?
> Similar to the enum below it. It'd be nice to have an example
> of what's expected.

Yes, will add in v2

Thanks

[Bridge] [PATCH net-next 13/13] selftests: vxlan_mdb: Use MDB get instead of dump

2023-10-16 Thread Ido Schimmel via Bridge

Test the new MDB get functionality by converting dump and grep to MDB
get.

Signed-off-by: Ido Schimmel 
---
 tools/testing/selftests/net/test_vxlan_mdb.sh | 108 +-
 1 file changed, 54 insertions(+), 54 deletions(-)

diff --git a/tools/testing/selftests/net/test_vxlan_mdb.sh 
b/tools/testing/selftests/net/test_vxlan_mdb.sh
index 31e5f0f8859d..6e996f8063cd 100755
--- a/tools/testing/selftests/net/test_vxlan_mdb.sh
+++ b/tools/testing/selftests/net/test_vxlan_mdb.sh
@@ -337,62 +337,62 @@ basic_common()
# Basic add, replace and delete behavior.
run_cmd "bridge -n $ns1 mdb add dev vx0 port vx0 $grp_key permanent dst 
$vtep_ip src_vni 10010"
log_test $? 0 "MDB entry addition"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010"
log_test $? 0 "MDB entry presence after addition"
 
run_cmd "bridge -n $ns1 mdb replace dev vx0 port vx0 $grp_key permanent 
dst $vtep_ip src_vni 10010"
log_test $? 0 "MDB entry replacement"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010"
log_test $? 0 "MDB entry presence after replacement"
 
run_cmd "bridge -n $ns1 mdb del dev vx0 port vx0 $grp_key dst $vtep_ip 
src_vni 10010"
log_test $? 0 "MDB entry deletion"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\""
-   log_test $? 1 "MDB entry presence after deletion"
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010"
+   log_test $? 254 "MDB entry presence after deletion"
 
run_cmd "bridge -n $ns1 mdb del dev vx0 port vx0 $grp_key dst $vtep_ip 
src_vni 10010"
log_test $? 255 "Non-existent MDB entry deletion"
 
# Default protocol and replacement.
run_cmd "bridge -n $ns1 mdb add dev vx0 port vx0 $grp_key permanent dst 
$vtep_ip src_vni 10010"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\" | 
grep \"proto static\""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010 | 
grep \"proto static\""
log_test $? 0 "MDB entry default protocol"
 
run_cmd "bridge -n $ns1 mdb replace dev vx0 port vx0 $grp_key permanent 
proto 123 dst $vtep_ip src_vni 10010"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\" | 
grep \"proto 123\""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010 | 
grep \"proto 123\""
log_test $? 0 "MDB entry protocol replacement"
 
run_cmd "bridge -n $ns1 mdb del dev vx0 port vx0 $grp_key dst $vtep_ip 
src_vni 10010"
 
# Default destination port and replacement.
run_cmd "bridge -n $ns1 mdb add dev vx0 port vx0 $grp_key permanent dst 
$vtep_ip src_vni 10010"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\" | 
grep \" dst_port \""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010 | 
grep \" dst_port \""
log_test $? 1 "MDB entry default destination port"
 
run_cmd "bridge -n $ns1 mdb replace dev vx0 port vx0 $grp_key permanent 
dst $vtep_ip dst_port 1234 src_vni 10010"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\" | 
grep \"dst_port 1234\""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010 | 
grep \"dst_port 1234\""
log_test $? 0 "MDB entry destination port replacement"
 
run_cmd "bridge -n $ns1 mdb del dev vx0 port vx0 $grp_key dst $vtep_ip 
src_vni 10010"
 
# Default destination VNI and replacement.
run_cmd "bridge -n $ns1 mdb add dev vx0 port vx0 $grp_key permanent dst 
$vtep_ip src_vni 10010"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\" | 
grep \" vni \""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010 | 
grep \" vni \""
log_test $? 1 "MDB entry default destination VNI"
 
run_cmd "bridge -n $ns1 mdb replace dev vx0 port vx0 $grp_key permanent 
dst $vtep_ip vni 1234 src_vni 10010"
-   run_cmd "bridge -n $ns1 -d -s mdb show dev vx0 | grep \"$grp_key\" | 
grep \"vni 1234\""
+   run_cmd "bridge -n $ns1 -d -s mdb get dev vx0 $grp_key src_vni 10010 | 
grep \&quo

[Bridge] [PATCH net-next 12/13] selftests: bridge_mdb: Use MDB get instead of dump

2023-10-16 Thread Ido Schimmel via Bridge

Test the new MDB get functionality by converting dump and grep to MDB
get.

Signed-off-by: Ido Schimmel 
---
 .../selftests/net/forwarding/bridge_mdb.sh| 184 +++---
 1 file changed, 71 insertions(+), 113 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/bridge_mdb.sh 
b/tools/testing/selftests/net/forwarding/bridge_mdb.sh
index d0c6c499d5da..e4e3e9405056 100755
--- a/tools/testing/selftests/net/forwarding/bridge_mdb.sh
+++ b/tools/testing/selftests/net/forwarding/bridge_mdb.sh
@@ -145,14 +145,14 @@ cfg_test_host_common()
 
# Check basic add, replace and delete behavior.
bridge mdb add dev br0 port br0 grp $grp $state vid 10
-   bridge mdb show dev br0 vid 10 | grep -q "$grp"
+   bridge mdb get dev br0 grp $grp vid 10 &> /dev/null
check_err $? "Failed to add $name host entry"
 
bridge mdb replace dev br0 port br0 grp $grp $state vid 10 &> /dev/null
check_fail $? "Managed to replace $name host entry"
 
bridge mdb del dev br0 port br0 grp $grp $state vid 10
-   bridge mdb show dev br0 vid 10 | grep -q "$grp"
+   bridge mdb get dev br0 grp $grp vid 10 &> /dev/null
check_fail $? "Failed to delete $name host entry"
 
# Check error cases.
@@ -200,7 +200,7 @@ cfg_test_port_common()
 
# Check basic add, replace and delete behavior.
bridge mdb add dev br0 port $swp1 $grp_key permanent vid 10
-   bridge mdb show dev br0 vid 10 | grep -q "$grp_key"
+   bridge mdb get dev br0 $grp_key vid 10 &> /dev/null
check_err $? "Failed to add $name entry"
 
bridge mdb replace dev br0 port $swp1 $grp_key permanent vid 10 \
@@ -208,31 +208,31 @@ cfg_test_port_common()
check_err $? "Failed to replace $name entry"
 
bridge mdb del dev br0 port $swp1 $grp_key permanent vid 10
-   bridge mdb show dev br0 vid 10 | grep -q "$grp_key"
+   bridge mdb get dev br0 $grp_key vid 10 &> /dev/null
check_fail $? "Failed to delete $name entry"
 
# Check default protocol and replacement.
bridge mdb add dev br0 port $swp1 $grp_key permanent vid 10
-   bridge -d mdb show dev br0 vid 10 | grep "$grp_key" | grep -q "static"
+   bridge -d mdb get dev br0 $grp_key vid 10 | grep -q "static"
check_err $? "$name entry not added with default \"static\" protocol"
 
bridge mdb replace dev br0 port $swp1 $grp_key permanent vid 10 \
proto 123
-   bridge -d mdb show dev br0 vid 10 | grep "$grp_key" | grep -q "123"
+   bridge -d mdb get dev br0 $grp_key vid 10 | grep -q "123"
check_err $? "Failed to replace protocol of $name entry"
bridge mdb del dev br0 port $swp1 $grp_key permanent vid 10
 
# Check behavior when VLAN is not specified.
bridge mdb add dev br0 port $swp1 $grp_key permanent
-   bridge mdb show dev br0 vid 10 | grep -q "$grp_key"
+   bridge mdb get dev br0 $grp_key vid 10 &> /dev/null
check_err $? "$name entry with VLAN 10 not added when VLAN was not 
specified"
-   bridge mdb show dev br0 vid 20 | grep -q "$grp_key"
+   bridge mdb get dev br0 $grp_key vid 20 &> /dev/null
check_err $? "$name entry with VLAN 20 not added when VLAN was not 
specified"
 
bridge mdb del dev br0 port $swp1 $grp_key permanent
-   bridge mdb show dev br0 vid 10 | grep -q "$grp_key"
+   bridge mdb get dev br0 $grp_key vid 10 &> /dev/null
check_fail $? "$name entry with VLAN 10 not deleted when VLAN was not 
specified"
-   bridge mdb show dev br0 vid 20 | grep -q "$grp_key"
+   bridge mdb get dev br0 $grp_key vid 20 &> /dev/null
check_fail $? "$name entry with VLAN 20 not deleted when VLAN was not 
specified"
 
# Check behavior when bridge port is down.
@@ -298,21 +298,21 @@ __cfg_test_port_ip_star_g()
RET=0
 
bridge mdb add dev br0 port $swp1 grp $grp vid 10
-   bridge -d mdb show dev br0 vid 10 | grep "$grp" | grep -q "exclude"
+   bridge -d mdb get dev br0 grp $grp vid 10 | grep -q "exclude"
check_err $? "Default filter mode is not \"exclude\""
bridge mdb del dev br0 port $swp1 grp $grp vid 10
 
# Check basic add and delete behavior.
bridge mdb add dev br0 port $swp1 grp $grp vid 10 filter_mode exclude \
source_list $src1
-   bridge -d mdb show dev br0 vid 10 | grep "$grp" | grep -q -v "src"
+   bridge -d mdb get dev br0 grp $grp vid 10 &> /dev/null
check_err $? "(*, G) entry not created"
-   bridge -d mdb

[Bridge] [PATCH net-next 11/13] rtnetlink: Add MDB get support

2023-10-16 Thread Ido Schimmel via Bridge

Now that both the bridge and VXLAN drivers implement the MDB get net
device operation, expose the functionality to user space by registering
a handler for RTM_GETMDB messages. Derive the net device from the
ifindex specified in the ancillary header and invoke its MDB get NDO.

Note that unlike other get handlers, the allocation of the skb
containing the response is not performed in the common rtnetlink code as
the size is variable and needs to be determined by the respective
driver.

Signed-off-by: Ido Schimmel 
---
 net/core/rtnetlink.c | 89 +++-
 1 file changed, 88 insertions(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index eef7f7788996..e4fb242655b4 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -6221,6 +6221,93 @@ static int rtnl_mdb_dump(struct sk_buff *skb, struct 
netlink_callback *cb)
return skb->len;
 }
 
+static int rtnl_validate_mdb_entry_get(const struct nlattr *attr,
+  struct netlink_ext_ack *extack)
+{
+   struct br_mdb_entry *entry = nla_data(attr);
+
+   if (nla_len(attr) != sizeof(struct br_mdb_entry)) {
+   NL_SET_ERR_MSG_ATTR(extack, attr, "Invalid attribute length");
+   return -EINVAL;
+   }
+
+   if (entry->ifindex) {
+   NL_SET_ERR_MSG(extack, "Entry ifindex cannot be specified");
+   return -EINVAL;
+   }
+
+   if (entry->state) {
+   NL_SET_ERR_MSG(extack, "Entry state cannot be specified");
+   return -EINVAL;
+   }
+
+   if (entry->flags) {
+   NL_SET_ERR_MSG(extack, "Entry flags cannot be specified");
+   return -EINVAL;
+   }
+
+   if (entry->vid >= VLAN_VID_MASK) {
+   NL_SET_ERR_MSG(extack, "Invalid entry VLAN id");
+   return -EINVAL;
+   }
+
+   if (entry->addr.proto != htons(ETH_P_IP) &&
+   entry->addr.proto != htons(ETH_P_IPV6) &&
+   entry->addr.proto != 0) {
+   NL_SET_ERR_MSG(extack, "Unknown entry protocol");
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static const struct nla_policy mdba_get_policy[MDBA_GET_ENTRY_MAX + 1] = {
+   [MDBA_GET_ENTRY] = NLA_POLICY_VALIDATE_FN(NLA_BINARY,
+ rtnl_validate_mdb_entry_get,
+ sizeof(struct br_mdb_entry)),
+   [MDBA_GET_ENTRY_ATTRS] = { .type = NLA_NESTED },
+};
+
+static int rtnl_mdb_get(struct sk_buff *in_skb, struct nlmsghdr *nlh,
+   struct netlink_ext_ack *extack)
+{
+   struct nlattr *tb[MDBA_GET_ENTRY_MAX + 1];
+   struct net *net = sock_net(in_skb->sk);
+   struct br_port_msg *bpm;
+   struct net_device *dev;
+   int err;
+
+   err = nlmsg_parse(nlh, sizeof(struct br_port_msg), tb,
+ MDBA_GET_ENTRY_MAX, mdba_get_policy, extack);
+   if (err)
+   return err;
+
+   bpm = nlmsg_data(nlh);
+   if (!bpm->ifindex) {
+   NL_SET_ERR_MSG(extack, "Invalid ifindex");
+   return -EINVAL;
+   }
+
+   dev = __dev_get_by_index(net, bpm->ifindex);
+   if (!dev) {
+   NL_SET_ERR_MSG(extack, "Device doesn't exist");
+   return -ENODEV;
+   }
+
+   if (NL_REQ_ATTR_CHECK(extack, NULL, tb, MDBA_GET_ENTRY)) {
+   NL_SET_ERR_MSG(extack, "Missing MDBA_GET_ENTRY attribute");
+   return -EINVAL;
+   }
+
+   if (!dev->netdev_ops->ndo_mdb_get) {
+   NL_SET_ERR_MSG(extack, "Device does not support MDB 
operations");
+   return -EOPNOTSUPP;
+   }
+
+   return dev->netdev_ops->ndo_mdb_get(dev, tb, NETLINK_CB(in_skb).portid,
+   nlh->nlmsg_seq, extack);
+}
+
 static int rtnl_validate_mdb_entry(const struct nlattr *attr,
   struct netlink_ext_ack *extack)
 {
@@ -6597,7 +6684,7 @@ void __init rtnetlink_init(void)
  0);
rtnl_register(PF_UNSPEC, RTM_SETSTATS, rtnl_stats_set, NULL, 0);
 
-   rtnl_register(PF_BRIDGE, RTM_GETMDB, NULL, rtnl_mdb_dump, 0);
+   rtnl_register(PF_BRIDGE, RTM_GETMDB, rtnl_mdb_get, rtnl_mdb_dump, 0);
rtnl_register(PF_BRIDGE, RTM_NEWMDB, rtnl_mdb_add, NULL, 0);
rtnl_register(PF_BRIDGE, RTM_DELMDB, rtnl_mdb_del, NULL, 0);
 }
-- 
2.40.1

[Bridge] [PATCH net-next 09/13] bridge: mcast: Add MDB get support

2023-10-16 Thread Ido Schimmel via Bridge

Implement support for MDB get operation by looking up a matching MDB
entry, allocating the skb according to the entry's size and then filling
in the response. The operation is performed under the bridge multicast
lock to ensure that the entry does not change between the time the reply
size is determined and when the reply is filled in.

Signed-off-by: Ido Schimmel 
---
 net/bridge/br_device.c  |   1 +
 net/bridge/br_mdb.c | 154 
 net/bridge/br_private.h |   9 +++
 3 files changed, 164 insertions(+)

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index d624710b384a..8f40de3af154 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -472,6 +472,7 @@ static const struct net_device_ops br_netdev_ops = {
.ndo_mdb_add = br_mdb_add,
.ndo_mdb_del = br_mdb_del,
.ndo_mdb_dump= br_mdb_dump,
+   .ndo_mdb_get = br_mdb_get,
.ndo_bridge_getlink  = br_getlink,
.ndo_bridge_setlink  = br_setlink,
.ndo_bridge_dellink  = br_dellink,
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 42983f6a0abd..973e27fe3498 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -1411,3 +1411,157 @@ int br_mdb_del(struct net_device *dev, struct nlattr 
*tb[],
br_mdb_config_fini(&cfg);
return err;
 }
+
+static const struct nla_policy br_mdbe_attrs_get_pol[MDBE_ATTR_MAX + 1] = {
+   [MDBE_ATTR_SOURCE] = NLA_POLICY_RANGE(NLA_BINARY,
+ sizeof(struct in_addr),
+ sizeof(struct in6_addr)),
+};
+
+static int br_mdb_get_parse(struct net_device *dev, struct nlattr *tb[],
+   struct br_ip *group, struct netlink_ext_ack *extack)
+{
+   struct br_mdb_entry *entry = nla_data(tb[MDBA_GET_ENTRY]);
+   struct nlattr *mdbe_attrs[MDBE_ATTR_MAX + 1];
+   int err;
+
+   if (!tb[MDBA_GET_ENTRY_ATTRS]) {
+   __mdb_entry_to_br_ip(entry, group, NULL);
+   return 0;
+   }
+
+   err = nla_parse_nested(mdbe_attrs, MDBE_ATTR_MAX,
+  tb[MDBA_GET_ENTRY_ATTRS], br_mdbe_attrs_get_pol,
+  extack);
+   if (err)
+   return err;
+
+   if (mdbe_attrs[MDBE_ATTR_SOURCE] &&
+   !is_valid_mdb_source(mdbe_attrs[MDBE_ATTR_SOURCE],
+entry->addr.proto, extack))
+   return -EINVAL;
+
+   __mdb_entry_to_br_ip(entry, group, mdbe_attrs);
+
+   return 0;
+}
+
+static struct sk_buff *
+br_mdb_get_reply_alloc(const struct net_bridge_mdb_entry *mp)
+{
+   struct net_bridge_port_group *pg;
+   size_t nlmsg_size;
+
+   nlmsg_size = NLMSG_ALIGN(sizeof(struct br_port_msg)) +
+/* MDBA_MDB */
+nla_total_size(0) +
+/* MDBA_MDB_ENTRY */
+nla_total_size(0);
+
+   if (mp->host_joined)
+   nlmsg_size += rtnl_mdb_nlmsg_pg_size(NULL);
+
+   for (pg = mlock_dereference(mp->ports, mp->br); pg;
+pg = mlock_dereference(pg->next, mp->br))
+   nlmsg_size += rtnl_mdb_nlmsg_pg_size(pg);
+
+   return nlmsg_new(nlmsg_size, GFP_ATOMIC);
+}
+
+static int br_mdb_get_reply_fill(struct sk_buff *skb,
+struct net_bridge_mdb_entry *mp, u32 portid,
+u32 seq)
+{
+   struct nlattr *mdb_nest, *mdb_entry_nest;
+   struct net_bridge_port_group *pg;
+   struct br_port_msg *bpm;
+   struct nlmsghdr *nlh;
+   int err;
+
+   nlh = nlmsg_put(skb, portid, seq, RTM_NEWMDB, sizeof(*bpm), 0);
+   if (!nlh)
+   return -EMSGSIZE;
+
+   bpm = nlmsg_data(nlh);
+   memset(bpm, 0, sizeof(*bpm));
+   bpm->family  = AF_BRIDGE;
+   bpm->ifindex = mp->br->dev->ifindex;
+   mdb_nest = nla_nest_start_noflag(skb, MDBA_MDB);
+   if (!mdb_nest) {
+   err = -EMSGSIZE;
+   goto cancel;
+   }
+   mdb_entry_nest = nla_nest_start_noflag(skb, MDBA_MDB_ENTRY);
+   if (!mdb_entry_nest) {
+   err = -EMSGSIZE;
+   goto cancel;
+   }
+
+   if (mp->host_joined) {
+   err = __mdb_fill_info(skb, mp, NULL);
+   if (err)
+   goto cancel;
+   }
+
+   for (pg = mlock_dereference(mp->ports, mp->br); pg;
+pg = mlock_dereference(pg->next, mp->br)) {
+   err = __mdb_fill_info(skb, mp, pg);
+   if (err)
+   goto cancel;
+   }
+
+   nla_nest_end(skb, mdb_entry_nest);
+   nla_nest_end(skb, mdb_nest);
+   nlmsg_end(skb, nlh);
+
+   return 0;
+
+cancel:
+   nlmsg_cancel(skb, nlh);
+   return err;
+}
+
+int br_mdb_get(struct net_device *dev,

[Bridge] [PATCH net-next 10/13] vxlan: mdb: Add MDB get support

2023-10-16 Thread Ido Schimmel via Bridge

Implement support for MDB get operation by looking up a matching MDB
entry, allocating the skb according to the entry's size and then filling
in the response.

Signed-off-by: Ido Schimmel 
---
 drivers/net/vxlan/vxlan_core.c|   1 +
 drivers/net/vxlan/vxlan_mdb.c | 150 ++
 drivers/net/vxlan/vxlan_private.h |   2 +
 3 files changed, 153 insertions(+)

diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c
index 6f7d45e3cfa2..7ed19f2cf6f5 100644
--- a/drivers/net/vxlan/vxlan_core.c
+++ b/drivers/net/vxlan/vxlan_core.c
@@ -3302,6 +3302,7 @@ static const struct net_device_ops vxlan_netdev_ether_ops 
= {
.ndo_mdb_add= vxlan_mdb_add,
.ndo_mdb_del= vxlan_mdb_del,
.ndo_mdb_dump   = vxlan_mdb_dump,
+   .ndo_mdb_get= vxlan_mdb_get,
.ndo_fill_metadata_dst  = vxlan_fill_metadata_dst,
 };
 
diff --git a/drivers/net/vxlan/vxlan_mdb.c b/drivers/net/vxlan/vxlan_mdb.c
index 19640f7e3a88..e472fd67fc2e 100644
--- a/drivers/net/vxlan/vxlan_mdb.c
+++ b/drivers/net/vxlan/vxlan_mdb.c
@@ -1306,6 +1306,156 @@ int vxlan_mdb_del(struct net_device *dev, struct nlattr 
*tb[],
return err;
 }
 
+static const struct nla_policy vxlan_mdbe_attrs_get_pol[MDBE_ATTR_MAX + 1] = {
+   [MDBE_ATTR_SOURCE] = NLA_POLICY_RANGE(NLA_BINARY,
+ sizeof(struct in_addr),
+ sizeof(struct in6_addr)),
+   [MDBE_ATTR_SRC_VNI] = NLA_POLICY_FULL_RANGE(NLA_U32, &vni_range),
+};
+
+static int vxlan_mdb_get_parse(struct net_device *dev, struct nlattr *tb[],
+  struct vxlan_mdb_entry_key *group,
+  struct netlink_ext_ack *extack)
+{
+   struct br_mdb_entry *entry = nla_data(tb[MDBA_GET_ENTRY]);
+   struct nlattr *mdbe_attrs[MDBE_ATTR_MAX + 1];
+   struct vxlan_dev *vxlan = netdev_priv(dev);
+   int err;
+
+   memset(group, 0, sizeof(*group));
+   group->vni = vxlan->default_dst.remote_vni;
+
+   if (!tb[MDBA_GET_ENTRY_ATTRS]) {
+   vxlan_mdb_group_set(group, entry, NULL);
+   return 0;
+   }
+
+   err = nla_parse_nested(mdbe_attrs, MDBE_ATTR_MAX,
+  tb[MDBA_GET_ENTRY_ATTRS],
+  vxlan_mdbe_attrs_get_pol, extack);
+   if (err)
+   return err;
+
+   if (mdbe_attrs[MDBE_ATTR_SOURCE] &&
+   !vxlan_mdb_is_valid_source(mdbe_attrs[MDBE_ATTR_SOURCE],
+  entry->addr.proto, extack))
+   return -EINVAL;
+
+   vxlan_mdb_group_set(group, entry, mdbe_attrs[MDBE_ATTR_SOURCE]);
+
+   if (mdbe_attrs[MDBE_ATTR_SRC_VNI])
+   group->vni =
+   cpu_to_be32(nla_get_u32(mdbe_attrs[MDBE_ATTR_SRC_VNI]));
+
+   return 0;
+}
+
+static struct sk_buff *
+vxlan_mdb_get_reply_alloc(const struct vxlan_dev *vxlan,
+ const struct vxlan_mdb_entry *mdb_entry)
+{
+   struct vxlan_mdb_remote *remote;
+   size_t nlmsg_size;
+
+   nlmsg_size = NLMSG_ALIGN(sizeof(struct br_port_msg)) +
+/* MDBA_MDB */
+nla_total_size(0) +
+/* MDBA_MDB_ENTRY */
+nla_total_size(0);
+
+   list_for_each_entry(remote, &mdb_entry->remotes, list)
+   nlmsg_size += vxlan_mdb_nlmsg_remote_size(vxlan, mdb_entry,
+ remote);
+
+   return nlmsg_new(nlmsg_size, GFP_KERNEL);
+}
+
+static int
+vxlan_mdb_get_reply_fill(const struct vxlan_dev *vxlan,
+struct sk_buff *skb,
+const struct vxlan_mdb_entry *mdb_entry,
+u32 portid, u32 seq)
+{
+   struct nlattr *mdb_nest, *mdb_entry_nest;
+   struct vxlan_mdb_remote *remote;
+   struct br_port_msg *bpm;
+   struct nlmsghdr *nlh;
+   int err;
+
+   nlh = nlmsg_put(skb, portid, seq, RTM_NEWMDB, sizeof(*bpm), 0);
+   if (!nlh)
+   return -EMSGSIZE;
+
+   bpm = nlmsg_data(nlh);
+   memset(bpm, 0, sizeof(*bpm));
+   bpm->family  = AF_BRIDGE;
+   bpm->ifindex = vxlan->dev->ifindex;
+   mdb_nest = nla_nest_start_noflag(skb, MDBA_MDB);
+   if (!mdb_nest) {
+   err = -EMSGSIZE;
+   goto cancel;
+   }
+   mdb_entry_nest = nla_nest_start_noflag(skb, MDBA_MDB_ENTRY);
+   if (!mdb_entry_nest) {
+   err = -EMSGSIZE;
+   goto cancel;
+   }
+
+   list_for_each_entry(remote, &mdb_entry->remotes, list) {
+   err = vxlan_mdb_entry_info_fill(vxlan, skb, mdb_entry, remote);
+   if (err)
+   goto cancel;
+   }
+
+   nla_nest_end(skb, mdb_entry_nest);
+   nla_nest_end(skb, mdb_nest);
+

[Bridge] [PATCH net-next 08/13] net: Add MDB get device operation

2023-10-16 Thread Ido Schimmel via Bridge

Add MDB net device operation that will be invoked by rtnetlink code in
response to received RTM_GETMDB messages. Subsequent patches will
implement the operation in the bridge and VXLAN drivers.

Signed-off-by: Ido Schimmel 
---
 include/linux/netdevice.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1c7681263d30..18376b65dc61 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1586,6 +1586,10 @@ struct net_device_ops {
int (*ndo_mdb_dump)(struct net_device *dev,
struct sk_buff *skb,
struct netlink_callback *cb);
+   int (*ndo_mdb_get)(struct net_device *dev,
+  struct nlattr *tb[], u32 portid,
+  u32 seq,
+  struct netlink_ext_ack *extack);
int (*ndo_bridge_setlink)(struct net_device *dev,
  struct nlmsghdr *nlh,
  u16 flags,
-- 
2.40.1

[Bridge] [PATCH net-next 05/13] vxlan: mdb: Adjust function arguments

2023-10-16 Thread Ido Schimmel via Bridge

Adjust the function's arguments and rename it to allow it to be reused
by future call sites that only have access to 'struct
vxlan_mdb_entry_key', but not to 'struct vxlan_mdb_config'.

No functional changes intended.

Signed-off-by: Ido Schimmel 
---
 drivers/net/vxlan/vxlan_mdb.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/net/vxlan/vxlan_mdb.c b/drivers/net/vxlan/vxlan_mdb.c
index 5e041622261a..0b6043e1473b 100644
--- a/drivers/net/vxlan/vxlan_mdb.c
+++ b/drivers/net/vxlan/vxlan_mdb.c
@@ -370,12 +370,10 @@ static bool vxlan_mdb_is_valid_source(const struct nlattr 
*attr, __be16 proto,
return true;
 }
 
-static void vxlan_mdb_config_group_set(struct vxlan_mdb_config *cfg,
-  const struct br_mdb_entry *entry,
-  const struct nlattr *source_attr)
+static void vxlan_mdb_group_set(struct vxlan_mdb_entry_key *group,
+   const struct br_mdb_entry *entry,
+   const struct nlattr *source_attr)
 {
-   struct vxlan_mdb_entry_key *group = &cfg->group;
-
switch (entry->addr.proto) {
case htons(ETH_P_IP):
group->dst.sa.sa_family = AF_INET;
@@ -503,7 +501,7 @@ static int vxlan_mdb_config_attrs_init(struct 
vxlan_mdb_config *cfg,
   entry->addr.proto, extack))
return -EINVAL;
 
-   vxlan_mdb_config_group_set(cfg, entry, mdbe_attrs[MDBE_ATTR_SOURCE]);
+   vxlan_mdb_group_set(&cfg->group, entry, mdbe_attrs[MDBE_ATTR_SOURCE]);
 
/* rtnetlink code only validates that IPv4 group address is
 * multicast.
-- 
2.40.1

[Bridge] [PATCH net-next 06/13] vxlan: mdb: Factor out a helper for remote entry size calculation

2023-10-16 Thread Ido Schimmel via Bridge

Currently, netlink notifications are sent for individual remote entries
and not for the entire MDB entry itself.

Subsequent patches are going to add MDB get support which will require
the VXLAN driver to reply with an entire MDB entry.

Therefore, as a preparation, factor out a helper to calculate the size
of an individual remote entry. When determining the size of the reply
this helper will be invoked for each remote entry in the MDB entry.

No functional changes intended.

Signed-off-by: Ido Schimmel 
---
 drivers/net/vxlan/vxlan_mdb.c | 28 +++-
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/drivers/net/vxlan/vxlan_mdb.c b/drivers/net/vxlan/vxlan_mdb.c
index 0b6043e1473b..19640f7e3a88 100644
--- a/drivers/net/vxlan/vxlan_mdb.c
+++ b/drivers/net/vxlan/vxlan_mdb.c
@@ -925,23 +925,20 @@ vxlan_mdb_nlmsg_src_list_size(const struct 
vxlan_mdb_entry_key *group,
return nlmsg_size;
 }
 
-static size_t vxlan_mdb_nlmsg_size(const struct vxlan_dev *vxlan,
-  const struct vxlan_mdb_entry *mdb_entry,
-  const struct vxlan_mdb_remote *remote)
+static size_t
+vxlan_mdb_nlmsg_remote_size(const struct vxlan_dev *vxlan,
+   const struct vxlan_mdb_entry *mdb_entry,
+   const struct vxlan_mdb_remote *remote)
 {
const struct vxlan_mdb_entry_key *group = &mdb_entry->key;
struct vxlan_rdst *rd = rtnl_dereference(remote->rd);
size_t nlmsg_size;
 
-   nlmsg_size = NLMSG_ALIGN(sizeof(struct br_port_msg)) +
-/* MDBA_MDB */
-nla_total_size(0) +
-/* MDBA_MDB_ENTRY */
-nla_total_size(0) +
 /* MDBA_MDB_ENTRY_INFO */
-nla_total_size(sizeof(struct br_mdb_entry)) +
+   nlmsg_size = nla_total_size(sizeof(struct br_mdb_entry)) +
 /* MDBA_MDB_EATTR_TIMER */
 nla_total_size(sizeof(u32));
+
/* MDBA_MDB_EATTR_SOURCE */
if (vxlan_mdb_is_sg(group))
nlmsg_size += nla_total_size(vxlan_addr_size(&group->dst));
@@ -969,6 +966,19 @@ static size_t vxlan_mdb_nlmsg_size(const struct vxlan_dev 
*vxlan,
return nlmsg_size;
 }
 
+static size_t vxlan_mdb_nlmsg_size(const struct vxlan_dev *vxlan,
+  const struct vxlan_mdb_entry *mdb_entry,
+  const struct vxlan_mdb_remote *remote)
+{
+   return NLMSG_ALIGN(sizeof(struct br_port_msg)) +
+  /* MDBA_MDB */
+  nla_total_size(0) +
+  /* MDBA_MDB_ENTRY */
+  nla_total_size(0) +
+  /* Remote entry */
+  vxlan_mdb_nlmsg_remote_size(vxlan, mdb_entry, remote);
+}
+
 static int vxlan_mdb_nlmsg_fill(const struct vxlan_dev *vxlan,
struct sk_buff *skb,
const struct vxlan_mdb_entry *mdb_entry,
-- 
2.40.1

[Bridge] [PATCH net-next 07/13] bridge: add MDB get uAPI attributes

2023-10-16 Thread Ido Schimmel via Bridge

Add MDB get attributes that correspond to the MDB set attributes used in
RTM_NEWMDB messages. Specifically, add 'MDBA_GET_ENTRY' which will hold
a 'struct br_mdb_entry' and 'MDBA_GET_ENTRY_ATTRS' which will hold
'MDBE_ATTR_*' attributes that are used as indexes (source IP and source
VNI).

An example request will look as follows:

[ struct nlmsghdr ]
[ struct br_port_msg ]
[ MDBA_GET_ENTRY ]
struct br_mdb_entry
[ MDBA_GET_ENTRY_ATTRS ]
[ MDBE_ATTR_SOURCE ]
struct in_addr / struct in6_addr
[ MDBE_ATTR_SRC_VNI ]
u32

Signed-off-by: Ido Schimmel 
---
 include/uapi/linux/if_bridge.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index f95326fce6bb..7e1bf080b414 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -723,6 +723,14 @@ enum {
 };
 #define MDBA_SET_ENTRY_MAX (__MDBA_SET_ENTRY_MAX - 1)
 
+enum {
+   MDBA_GET_ENTRY_UNSPEC,
+   MDBA_GET_ENTRY,
+   MDBA_GET_ENTRY_ATTRS,
+   __MDBA_GET_ENTRY_MAX,
+};
+#define MDBA_GET_ENTRY_MAX (__MDBA_GET_ENTRY_MAX - 1)
+
 /* [MDBA_SET_ENTRY_ATTRS] = {
  *[MDBE_ATTR_xxx]
  *...
-- 
2.40.1

[Bridge] [PATCH net-next 03/13] bridge: mcast: Factor out a helper for PG entry size calculation

2023-10-16 Thread Ido Schimmel via Bridge

Currently, netlink notifications are sent for individual port group
entries and not for the entire MDB entry itself.

Subsequent patches are going to add MDB get support which will require
the bridge driver to reply with an entire MDB entry.

Therefore, as a preparation, factor out an helper to calculate the size
of an individual port group entry. When determining the size of the
reply this helper will be invoked for each port group entry in the MDB
entry.

No functional changes intended.

Signed-off-by: Ido Schimmel 
---
 net/bridge/br_mdb.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 08de94bffc12..42983f6a0abd 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -450,18 +450,13 @@ static int nlmsg_populate_mdb_fill(struct sk_buff *skb,
return -EMSGSIZE;
 }
 
-static size_t rtnl_mdb_nlmsg_size(struct net_bridge_port_group *pg)
+static size_t rtnl_mdb_nlmsg_pg_size(const struct net_bridge_port_group *pg)
 {
struct net_bridge_group_src *ent;
size_t nlmsg_size, addr_size = 0;
 
-   nlmsg_size = NLMSG_ALIGN(sizeof(struct br_port_msg)) +
-/* MDBA_MDB */
-nla_total_size(0) +
-/* MDBA_MDB_ENTRY */
-nla_total_size(0) +
 /* MDBA_MDB_ENTRY_INFO */
-nla_total_size(sizeof(struct br_mdb_entry)) +
+   nlmsg_size = nla_total_size(sizeof(struct br_mdb_entry)) +
 /* MDBA_MDB_EATTR_TIMER */
 nla_total_size(sizeof(u32));
 
@@ -511,6 +506,17 @@ static size_t rtnl_mdb_nlmsg_size(struct 
net_bridge_port_group *pg)
return nlmsg_size;
 }
 
+static size_t rtnl_mdb_nlmsg_size(const struct net_bridge_port_group *pg)
+{
+   return NLMSG_ALIGN(sizeof(struct br_port_msg)) +
+  /* MDBA_MDB */
+  nla_total_size(0) +
+  /* MDBA_MDB_ENTRY */
+  nla_total_size(0) +
+  /* Port group entry */
+  rtnl_mdb_nlmsg_pg_size(pg);
+}
+
 void br_mdb_notify(struct net_device *dev,
   struct net_bridge_mdb_entry *mp,
   struct net_bridge_port_group *pg,
-- 
2.40.1

[Bridge] [PATCH net-next 04/13] bridge: mcast: Rename MDB entry get function

2023-10-16 Thread Ido Schimmel via Bridge

The current name is going to conflict with the upcoming net device
operation for the MDB get operation.

Rename the function to br_mdb_entry_skb_get(). No functional changes
intended.

Signed-off-by: Ido Schimmel 
---
 net/bridge/br_device.c|  2 +-
 net/bridge/br_input.c |  2 +-
 net/bridge/br_multicast.c |  5 +++--
 net/bridge/br_private.h   | 10 ++
 4 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 9a5ea06236bd..d624710b384a 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -92,7 +92,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct 
net_device *dev)
goto out;
}
 
-   mdst = br_mdb_get(brmctx, skb, vid);
+   mdst = br_mdb_entry_skb_get(brmctx, skb, vid);
if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
br_multicast_querier_exists(brmctx, eth_hdr(skb), mdst))
br_multicast_flood(mdst, skb, brmctx, false, true);
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index c729528b5e85..f21097e73482 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -175,7 +175,7 @@ int br_handle_frame_finish(struct net *net, struct sock 
*sk, struct sk_buff *skb
 
switch (pkt_type) {
case BR_PKT_MULTICAST:
-   mdst = br_mdb_get(brmctx, skb, vid);
+   mdst = br_mdb_entry_skb_get(brmctx, skb, vid);
if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
br_multicast_querier_exists(brmctx, eth_hdr(skb), mdst)) {
if ((mdst && mdst->host_joined) ||
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 96d1fc78dd39..d7d021af1029 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -145,8 +145,9 @@ static struct net_bridge_mdb_entry *br_mdb_ip6_get(struct 
net_bridge *br,
 }
 #endif
 
-struct net_bridge_mdb_entry *br_mdb_get(struct net_bridge_mcast *brmctx,
-   struct sk_buff *skb, u16 vid)
+struct net_bridge_mdb_entry *
+br_mdb_entry_skb_get(struct net_bridge_mcast *brmctx, struct sk_buff *skb,
+u16 vid)
 {
struct net_bridge *br = brmctx->br;
struct br_ip ip;
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index cbbe35278459..3220898424ce 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -951,8 +951,9 @@ int br_multicast_rcv(struct net_bridge_mcast **brmctx,
 struct net_bridge_mcast_port **pmctx,
 struct net_bridge_vlan *vlan,
 struct sk_buff *skb, u16 vid);
-struct net_bridge_mdb_entry *br_mdb_get(struct net_bridge_mcast *brmctx,
-   struct sk_buff *skb, u16 vid);
+struct net_bridge_mdb_entry *
+br_mdb_entry_skb_get(struct net_bridge_mcast *brmctx, struct sk_buff *skb,
+u16 vid);
 int br_multicast_add_port(struct net_bridge_port *port);
 void br_multicast_del_port(struct net_bridge_port *port);
 void br_multicast_enable_port(struct net_bridge_port *port);
@@ -1341,8 +1342,9 @@ static inline int br_multicast_rcv(struct 
net_bridge_mcast **brmctx,
return 0;
 }
 
-static inline struct net_bridge_mdb_entry *br_mdb_get(struct net_bridge_mcast 
*brmctx,
- struct sk_buff *skb, u16 
vid)
+static inline struct net_bridge_mdb_entry *
+br_mdb_entry_skb_get(struct net_bridge_mcast *brmctx, struct sk_buff *skb,
+u16 vid)
 {
return NULL;
 }
-- 
2.40.1

[Bridge] [PATCH net-next 01/13] bridge: mcast: Dump MDB entries even when snooping is disabled

2023-10-16 Thread Ido Schimmel via Bridge

Currently, the bridge driver does not dump MDB entries when multicast
snooping is disabled although the entries are present in the kernel:

 # bridge mdb add dev br0 port swp1 grp 239.1.1.1 permanent
 # bridge mdb show dev br0
 dev br0 port swp1 grp 239.1.1.1 permanent
 dev br0 port br0 grp ff02::6a temp
 dev br0 port br0 grp ff02::1:ff9d:e61b temp
 # ip link set dev br0 type bridge mcast_snooping 0
 # bridge mdb show dev br0
 # ip link set dev br0 type bridge mcast_snooping 1
 # bridge mdb show dev br0
 dev br0 port swp1 grp 239.1.1.1 permanent
 dev br0 port br0 grp ff02::6a temp
 dev br0 port br0 grp ff02::1:ff9d:e61b temp

This behavior differs from other netlink dump interfaces that dump
entries regardless if they are used or not. For example, VLANs are
dumped even when VLAN filtering is disabled:

 # ip link set dev br0 type bridge vlan_filtering 0
 # bridge vlan show dev swp1
 port  vlan-id
 swp1  1 PVID Egress Untagged

Remove the check and always dump MDB entries:

 # bridge mdb add dev br0 port swp1 grp 239.1.1.1 permanent
 # bridge mdb show dev br0
 dev br0 port swp1 grp 239.1.1.1 permanent
 dev br0 port br0 grp ff02::6a temp
 dev br0 port br0 grp ff02::1:ffeb:1a4d temp
 # ip link set dev br0 type bridge mcast_snooping 0
 # bridge mdb show dev br0
 dev br0 port swp1 grp 239.1.1.1 permanent
 dev br0 port br0 grp ff02::6a temp
 dev br0 port br0 grp ff02::1:ffeb:1a4d temp
 # ip link set dev br0 type bridge mcast_snooping 1
 # bridge mdb show dev br0
 dev br0 port swp1 grp 239.1.1.1 permanent
 dev br0 port br0 grp ff02::6a temp
 dev br0 port br0 grp ff02::1:ffeb:1a4d temp

Signed-off-by: Ido Schimmel 
---
 net/bridge/br_mdb.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 7305f5f8215c..fb58bb1b60e8 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -323,9 +323,6 @@ static int br_mdb_fill_info(struct sk_buff *skb, struct 
netlink_callback *cb,
struct net_bridge_mdb_entry *mp;
struct nlattr *nest, *nest2;
 
-   if (!br_opt_get(br, BROPT_MULTICAST_ENABLED))
-   return 0;
-
nest = nla_nest_start_noflag(skb, MDBA_MDB);
if (nest == NULL)
return -EMSGSIZE;
-- 
2.40.1

[Bridge] [PATCH net-next 02/13] bridge: mcast: Account for missing attributes

2023-10-16 Thread Ido Schimmel via Bridge

The 'MDBA_MDB' and 'MDBA_MDB_ENTRY' nest attributes are not accounted
for when calculating the size of MDB notifications. Add them along with
comments for existing attributes.

Signed-off-by: Ido Schimmel 
---
 net/bridge/br_mdb.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index fb58bb1b60e8..08de94bffc12 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -452,11 +452,18 @@ static int nlmsg_populate_mdb_fill(struct sk_buff *skb,
 
 static size_t rtnl_mdb_nlmsg_size(struct net_bridge_port_group *pg)
 {
-   size_t nlmsg_size = NLMSG_ALIGN(sizeof(struct br_port_msg)) +
-   nla_total_size(sizeof(struct br_mdb_entry)) +
-   nla_total_size(sizeof(u32));
struct net_bridge_group_src *ent;
-   size_t addr_size = 0;
+   size_t nlmsg_size, addr_size = 0;
+
+   nlmsg_size = NLMSG_ALIGN(sizeof(struct br_port_msg)) +
+/* MDBA_MDB */
+nla_total_size(0) +
+/* MDBA_MDB_ENTRY */
+nla_total_size(0) +
+/* MDBA_MDB_ENTRY_INFO */
+nla_total_size(sizeof(struct br_mdb_entry)) +
+/* MDBA_MDB_EATTR_TIMER */
+nla_total_size(sizeof(u32));
 
if (!pg)
goto out;
-- 
2.40.1

[Bridge] [PATCH net-next 00/13] Add MDB get support

2023-10-16 Thread Ido Schimmel via Bridge

This patchset adds MDB get support, allowing user space to request a
single MDB entry to be retrieved instead of dumping the entire MDB.
Support is added in both the bridge and VXLAN drivers.

Patches #1-#6 are small preparations in both drivers.

Patches #7-#8 add the required uAPI attributes for the new functionality
and the MDB get net device operation (NDO), respectively.

Patches #9-#10 implement the MDB get NDO in both drivers.

Patch #11 registers a handler for RTM_GETMDB messages in rtnetlink core.
The handler derives the net device from the ifindex specified in the
ancillary header and invokes its MDB get NDO.

Patches #12-#13 add selftests by converting tests that use MDB dump with
grep to the new MDB get functionality.

iproute2 changes can be found here [1].

[1] https://github.com/idosch/iproute2/tree/submit/mdb_get_v1

Ido Schimmel (13):
  bridge: mcast: Dump MDB entries even when snooping is disabled
  bridge: mcast: Account for missing attributes
  bridge: mcast: Factor out a helper for PG entry size calculation
  bridge: mcast: Rename MDB entry get function
  vxlan: mdb: Adjust function arguments
  vxlan: mdb: Factor out a helper for remote entry size calculation
  bridge: add MDB get uAPI attributes
  net: Add MDB get device operation
  bridge: mcast: Add MDB get support
  vxlan: mdb: Add MDB get support
  rtnetlink: Add MDB get support
  selftests: bridge_mdb: Use MDB get instead of dump
  selftests: vxlan_mdb: Use MDB get instead of dump

 drivers/net/vxlan/vxlan_core.c|   1 +
 drivers/net/vxlan/vxlan_mdb.c | 188 --
 drivers/net/vxlan/vxlan_private.h |   2 +
 include/linux/netdevice.h |   4 +
 include/uapi/linux/if_bridge.h|   8 +
 net/bridge/br_device.c|   3 +-
 net/bridge/br_input.c |   2 +-
 net/bridge/br_mdb.c   | 180 -
 net/bridge/br_multicast.c |   5 +-
 net/bridge/br_private.h   |  19 +-
 net/core/rtnetlink.c  |  89 -
 .../selftests/net/forwarding/bridge_mdb.sh| 184 +++--
 tools/testing/selftests/net/test_vxlan_mdb.sh | 108 +-
 13 files changed, 594 insertions(+), 199 deletions(-)

-- 
2.40.1

Re: [Bridge] [PATCH net-next v4 5/6] net: bridge: Add a configurable default FDB learning limit

2023-09-26 Thread Ido Schimmel

On Thu, Sep 21, 2023 at 01:19:44PM +0300, Nikolay Aleksandrov wrote:
> I'm not strongly against, just IMO it is unnecessary. I won't block the set
> because of this, but it would be nice to get input from others as
> well. If you can recompile your kernel to set a limit, it should be easier
> to change your app to set the same limit via netlink, but I'm not familiar
> with your use case.

I agree with keeping it out. We don't have it for similar knobs (e.g.,
MDB limits) and it would create a precedence for other bridge options
instead of simply using netlink and improving user space applications.

Re: [Bridge] [PATCH net-next v4 3/6] net: bridge: Track and limit dynamically learned FDB entries

2023-09-26 Thread Ido Schimmel

On Tue, Sep 19, 2023 at 10:12:50AM +0200, Johannes Nixdorf wrote:
> A malicious actor behind one bridge port may spam the kernel with packets
> with a random source MAC address, each of which will create an FDB entry,
> each of which is a dynamic allocation in the kernel.
> 
> There are roughly 2^48 different MAC addresses, further limited by the
> rhashtable they are stored in to 2^31. Each entry is of the type struct
> net_bridge_fdb_entry, which is currently 128 bytes big. This means the
> maximum amount of memory allocated for FDB entries is 2^31 * 128B =
> 256GiB, which is too much for most computers.
> 
> Mitigate this by maintaining a per bridge count of those automatically
> generated entries in fdb_n_learned, and a limit in fdb_max_learned. If
> the limit is hit new entries are not learned anymore.
> 
> For backwards compatibility the default setting of 0 disables the limit.
> 
> User-added entries by netlink or from bridge or bridge port addresses
> are never blocked and do not count towards that limit.
> 
> Introduce a new fdb entry flag BR_FDB_DYNAMIC_LEARNED to keep track of
> whether an FDB entry is included in the count. The flag is enabled for
> dynamically learned entries, and disabled for all other entries. This
> should be equivalent to BR_FDB_ADDED_BY_USER and BR_FDB_LOCAL being unset,
> but contrary to the two flags it can be toggled atomically.
> 
> Atomicity is required here, as there are multiple callers that modify the
> flags, but are not under a common lock (br_fdb_update is the exception
> for br->hash_lock, br_fdb_external_learn_add for RTNL).
> 
> Signed-off-by: Johannes Nixdorf 

Reviewed-by: Ido Schimmel

Re: [Bridge] [PATCH net-next v4 1/6] net: bridge: Set BR_FDB_ADDED_BY_USER early in fdb_add_entry

2023-09-21 Thread Ido Schimmel

On Tue, Sep 19, 2023 at 10:12:48AM +0200, Johannes Nixdorf wrote:
> In preparation of the following fdb limit for dynamically learned entries,
> allow fdb_create to detect that the entry was added by the user. This
> way it can skip applying the limit in this case.
> 
> Signed-off-by: Johannes Nixdorf 

Reviewed-by: Ido Schimmel

[Bridge] [PATCH net-next v2 4/4] selftests: net: Add bridge backup port and backup nexthop ID test

2023-07-17 Thread Ido Schimmel via Bridge

Add test cases for bridge backup port and backup nexthop ID, testing
both good and bad flows.

Example truncated output:

 # ./test_bridge_backup_port.sh
 [...]
 Tests passed:  83
 Tests failed:   0

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 tools/testing/selftests/net/Makefile  |   1 +
 .../selftests/net/test_bridge_backup_port.sh  | 759 ++
 2 files changed, 760 insertions(+)
 create mode 100755 tools/testing/selftests/net/test_bridge_backup_port.sh

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 2f69f7274e3d..6d1cd1c63d40 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -85,6 +85,7 @@ TEST_GEN_FILES += bind_wildcard
 TEST_PROGS += test_vxlan_mdb.sh
 TEST_PROGS += test_bridge_neigh_suppress.sh
 TEST_PROGS += test_vxlan_nolocalbypass.sh
+TEST_PROGS += test_bridge_backup_port.sh
 
 TEST_FILES := settings
 
diff --git a/tools/testing/selftests/net/test_bridge_backup_port.sh 
b/tools/testing/selftests/net/test_bridge_backup_port.sh
new file mode 100755
index ..112cfd8a10ad
--- /dev/null
+++ b/tools/testing/selftests/net/test_bridge_backup_port.sh
@@ -0,0 +1,759 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# This test is for checking bridge backup port and backup nexthop ID
+# functionality. The topology consists of two bridge (VTEPs) connected using
+# VXLAN. The test checks that when the switch port (swp1) is down, traffic is
+# redirected to the VXLAN port (vx0). When a backup nexthop ID is configured,
+# the test checks that traffic is redirected with the correct nexthop
+# information.
+#
+# ++ ++
+# |+ swp1   + vx0  | |+ swp1   + vx0  |
+# |||  | |||  |
+# ||   br0  |  | |||  |
+# |++---+  | |++---+  |
+# | |  | | |  |
+# | |  | | |  |
+# | +  | | +  |
+# |br0 | |br0 |
+# | +  | | +  |
+# | |  | | |  |
+# | |  | | |  |
+# | +  | | +  |
+# |  br0.10| |  br0.10|
+# |   192.0.2.65/28| |192.0.2.66/28   |
+# || ||
+# || ||
+# | 192.0.2.33 | | 192.0.2.34 |
+# | + lo   | | + lo   |
+# || ||
+# || ||
+# |   192.0.2.49/28| |192.0.2.50/28   |
+# |   veth0 +---+ veth0   |
+# || ||
+# | sw1| | sw2|
+# ++ ++
+
+ret=0
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+# All tests in this script. Can be overridden with -t option.
+TESTS="
+   backup_port
+   backup_nhid
+   backup_nhid_invalid
+   backup_nhid_ping
+   backup_nhid_torture
+"
+VERBOSE=0
+PAUSE_ON_FAIL=no
+PAUSE=no
+PING_TIMEOUT=5
+
+
+# Utilities
+
+log_test()
+{
+   local rc=$1
+   local expected=$2
+   local msg="$3"
+
+   if [ ${rc} -eq ${expected} ]; then
+   printf "TEST: %-60s  [ OK ]\n" "${msg}"
+   nsuccess=$((nsuccess+1))
+   else
+   ret=1
+   nfail=$((nfail+1))
+   printf "TEST: %-60s  [FAIL]\n" "${msg}"
+   if [ "$VERBOSE" = "1" ]; then
+   echo "rc=$rc, expected $expected"
+   fi
+
+   if [ "${PAUSE_ON_FAIL}" = "yes" ]; then
+   echo
+   echo "hit enter to continue, 'q' to quit"
+

[Bridge] [PATCH net-next v2 3/4] bridge: Add backup nexthop ID support

2023-07-17 Thread Ido Schimmel via Bridge

Add a new bridge port attribute that allows attaching a nexthop object
ID to an skb that is redirected to a backup bridge port with VLAN
tunneling enabled.

Specifically, when redirecting a known unicast packet, read the backup
nexthop ID from the bridge port that lost its carrier and set it in the
bridge control block of the skb before forwarding it via the backup
port. Note that reading the ID from the bridge port should not result in
a cache miss as the ID is added next to the 'backup_port' field that was
already accessed. After this change, the 'state' field still stays on
the first cache line, together with other data path related fields such
as 'flags and 'vlgrp':

struct net_bridge_port {
struct net_bridge *br;   /* 0 8 */
struct net_device *dev;  /* 8 8 */
netdevice_tracker  dev_tracker;  /*16 0 */
struct list_head   list; /*1616 */
long unsigned int  flags;/*32 8 */
struct net_bridge_vlan_group * vlgrp;/*40 8 */
struct net_bridge_port *   backup_port;  /*48 8 */
u32backup_nhid;  /*56 4 */
u8 priority; /*60 1 */
u8 state;/*61 1 */
u16port_no;  /*62 2 */
/* --- cacheline 1 boundary (64 bytes) --- */
[...]
} __attribute__((__aligned__(8)));

When forwarding an skb via a bridge port that has VLAN tunneling
enabled, check if the backup nexthop ID stored in the bridge control
block is valid (i.e., not zero). If so, instead of attaching the
pre-allocated metadata (that only has the tunnel key set), allocate a
new metadata, set both the tunnel key and the nexthop object ID and
attach it to the skb.

By default, do not dump the new attribute to user space as a value of
zero is an invalid nexthop object ID.

The above is useful for EVPN multihoming. When one of the links
composing an Ethernet Segment (ES) fails, traffic needs to be redirected
towards the host via one of the other ES peers. For example, if a host
is multihomed to three different VTEPs, the backup port of each ES link
needs to be set to the VXLAN device and the backup nexthop ID needs to
point to an FDB nexthop group that includes the IP addresses of the
other two VTEPs. The VXLAN driver will extract the ID from the metadata
of the redirected skb, calculate its flow hash and forward it towards
one of the other VTEPs. If the ID does not exist, or represents an
invalid nexthop object, the VXLAN driver will drop the skb. This
relieves the bridge driver from the need to validate the ID.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 include/uapi/linux/if_link.h |  1 +
 net/bridge/br_forward.c  |  1 +
 net/bridge/br_netlink.c  | 12 
 net/bridge/br_private.h  |  3 +++
 net/bridge/br_vlan_tunnel.c  | 15 +++
 net/core/rtnetlink.c |  2 +-
 6 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 0f6a0fe09bdb..ce3117df9cec 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -570,6 +570,7 @@ enum {
IFLA_BRPORT_MCAST_N_GROUPS,
IFLA_BRPORT_MCAST_MAX_GROUPS,
IFLA_BRPORT_NEIGH_VLAN_SUPPRESS,
+   IFLA_BRPORT_BACKUP_NHID,
__IFLA_BRPORT_MAX
 };
 #define IFLA_BRPORT_MAX (__IFLA_BRPORT_MAX - 1)
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 6116eba1bd89..9d7bc8b96b53 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -154,6 +154,7 @@ void br_forward(const struct net_bridge_port *to,
backup_port = rcu_dereference(to->backup_port);
if (unlikely(!backup_port))
goto out;
+   BR_INPUT_SKB_CB(skb)->backup_nhid = READ_ONCE(to->backup_nhid);
to = backup_port;
}
 
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 05c5863d2e20..10f0d33d8ccf 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -211,6 +211,7 @@ static inline size_t br_port_info_size(void)
+ nla_total_size(sizeof(u8))/* IFLA_BRPORT_MRP_IN_OPEN */
+ nla_total_size(sizeof(u32))   /* 
IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT */
+ nla_total_size(sizeof(u32))   /* 
IFLA_BRPORT_MCAST_EHT_HOSTS_CNT */
+   + nla_total_size(sizeof(u32))   /* IFLA_BRPORT_BACKUP_NHID */
+ 0;
 }
 
@@ -319,6 +320,10 @@ static int br_port_fill_attrs(struct sk_buff *skb,
backup_p->dev->ifindex);
rcu_read_unlock();
 
+   if (p->backup_nhid &&
+

[Bridge] [PATCH net-next v2 0/4] Add backup nexthop ID support

2023-07-17 Thread Ido Schimmel via Bridge

.2
[3] https://github.com/idosch/iproute2/tree/submit/backup_nhid_v1
[4] https://lore.kernel.org/netdev/20230713070925.3955850-1-ido...@nvidia.com/

Ido Schimmel (4):
  ip_tunnels: Add nexthop ID field to ip_tunnel_key
  vxlan: Add support for nexthop ID metadata
  bridge: Add backup nexthop ID support
  selftests: net: Add bridge backup port and backup nexthop ID test

 drivers/net/vxlan/vxlan_core.c|  44 +
 include/net/ip_tunnels.h  |   1 +
 include/uapi/linux/if_link.h  |   1 +
 net/bridge/br_forward.c   |   1 +
 net/bridge/br_netlink.c   |  12 +
 net/bridge/br_private.h   |   3 +
 net/bridge/br_vlan_tunnel.c   |  15 +
 net/core/rtnetlink.c  |   2 +-
 tools/testing/selftests/net/Makefile  |   1 +
 .../selftests/net/test_bridge_backup_port.sh  | 759 ++
 10 files changed, 838 insertions(+), 1 deletion(-)
 create mode 100755 tools/testing/selftests/net/test_bridge_backup_port.sh

-- 
2.40.1

[Bridge] [PATCH net-next v2 2/4] vxlan: Add support for nexthop ID metadata

2023-07-17 Thread Ido Schimmel via Bridge

VXLAN FDB entries can point to FDB nexthop objects. Each such object
includes the IP address(es) of remote VTEP(s) via which the target host
is accessible. Example:

 # ip nexthop add id 1 via 192.0.2.1 fdb
 # ip nexthop add id 2 via 192.0.2.17 fdb
 # ip nexthop add id 1000 group 1/2 fdb
 # bridge fdb add 00:11:22:33:44:55 dev vx0 self static nhid 1000 src_vni 10020

This is useful for EVPN multihoming where a single host can be connected
to multiple VTEPs. The source VTEP will calculate the flow hash of the
skb and forward it towards the IP address of one of the VTEPs member in
the nexthop group.

There are cases where an external entity (e.g., the bridge driver) can
provide not only the tunnel ID (i.e., VNI) of the skb, but also the ID
of the nexthop object via which the skb should be forwarded.

Therefore, in order to support such cases, when the VXLAN device is in
external / collect metadata mode and the tunnel info attached to the skb
is of bridge type, extract the nexthop ID from the tunnel info. If the
ID is valid (i.e., non-zero), forward the skb via the nexthop object
associated with the ID, as if the skb hit an FDB entry associated with
this ID.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 drivers/net/vxlan/vxlan_core.c | 44 ++
 1 file changed, 44 insertions(+)

diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c
index 78744549c1b3..10a4dbd50710 100644
--- a/drivers/net/vxlan/vxlan_core.c
+++ b/drivers/net/vxlan/vxlan_core.c
@@ -2672,6 +2672,45 @@ static void vxlan_xmit_nh(struct sk_buff *skb, struct 
net_device *dev,
dev_kfree_skb(skb);
 }
 
+static netdev_tx_t vxlan_xmit_nhid(struct sk_buff *skb, struct net_device *dev,
+  u32 nhid, __be32 vni)
+{
+   struct vxlan_dev *vxlan = netdev_priv(dev);
+   struct vxlan_rdst nh_rdst;
+   struct nexthop *nh;
+   bool do_xmit;
+   u32 hash;
+
+   memset(&nh_rdst, 0, sizeof(struct vxlan_rdst));
+   hash = skb_get_hash(skb);
+
+   rcu_read_lock();
+   nh = nexthop_find_by_id(dev_net(dev), nhid);
+   if (unlikely(!nh || !nexthop_is_fdb(nh) || !nexthop_is_multipath(nh))) {
+   rcu_read_unlock();
+   goto drop;
+   }
+   do_xmit = vxlan_fdb_nh_path_select(nh, hash, &nh_rdst);
+   rcu_read_unlock();
+
+   if (vxlan->cfg.saddr.sa.sa_family != nh_rdst.remote_ip.sa.sa_family)
+   goto drop;
+
+   if (likely(do_xmit))
+   vxlan_xmit_one(skb, dev, vni, &nh_rdst, false);
+   else
+   goto drop;
+
+   return NETDEV_TX_OK;
+
+drop:
+   dev->stats.tx_dropped++;
+   vxlan_vnifilter_count(netdev_priv(dev), vni, NULL,
+ VXLAN_VNI_STATS_TX_DROPS, 0);
+   dev_kfree_skb(skb);
+   return NETDEV_TX_OK;
+}
+
 /* Transmit local packets over Vxlan
  *
  * Outer IP header inherits ECN and DF from inner header.
@@ -2687,6 +2726,7 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct 
net_device *dev)
struct vxlan_fdb *f;
struct ethhdr *eth;
__be32 vni = 0;
+   u32 nhid = 0;
 
info = skb_tunnel_info(skb);
 
@@ -2696,6 +2736,7 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct 
net_device *dev)
if (info && info->mode & IP_TUNNEL_INFO_BRIDGE &&
info->mode & IP_TUNNEL_INFO_TX) {
vni = tunnel_id_to_key32(info->key.tun_id);
+   nhid = info->key.nhid;
} else {
if (info && info->mode & IP_TUNNEL_INFO_TX)
vxlan_xmit_one(skb, dev, vni, NULL, false);
@@ -2723,6 +2764,9 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct 
net_device *dev)
 #endif
}
 
+   if (nhid)
+   return vxlan_xmit_nhid(skb, dev, nhid, vni);
+
if (vxlan->cfg.flags & VXLAN_F_MDB) {
struct vxlan_mdb_entry *mdb_entry;
 
-- 
2.40.1

[Bridge] [PATCH net-next v2 1/4] ip_tunnels: Add nexthop ID field to ip_tunnel_key

2023-07-17 Thread Ido Schimmel via Bridge

Extend the ip_tunnel_key structure with a field indicating the ID of the
nexthop object via which the skb should be routed.

The field is going to be populated in subsequent patches by the bridge
driver in order to indicate to the VXLAN driver which FDB nexthop object
to use in order to reach the target host.

Signed-off-by: Ido Schimmel 
Reviewed-by: Nikolay Aleksandrov 
---
 include/net/ip_tunnels.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index ed4b6ad3fcac..e8750b4ef7e1 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -52,6 +52,7 @@ struct ip_tunnel_key {
u8  tos;/* TOS for IPv4, TC for IPv6 */
u8  ttl;/* TTL for IPv4, HL for IPv6 */
__be32  label;  /* Flow Label for IPv6 */
+   u32 nhid;
__be16  tp_src;
__be16  tp_dst;
__u8flow_flags;
-- 
2.40.1

[Bridge] [RFC PATCH net-next 4/4] selftests: net: Add bridge backup port and backup nexthop ID test

2023-07-13 Thread Ido Schimmel via Bridge

Add test cases for bridge backup port and backup nexthop ID, testing
both good and bad flows.

Example truncated output:

 # ./test_bridge_backup_port.sh
 [...]
 Tests passed:  83
 Tests failed:   0

Signed-off-by: Ido Schimmel 
---
 tools/testing/selftests/net/Makefile  |   1 +
 .../selftests/net/test_bridge_backup_port.sh  | 759 ++
 2 files changed, 760 insertions(+)
 create mode 100755 tools/testing/selftests/net/test_bridge_backup_port.sh

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 7f3ab2a93ed6..18c94544aa9d 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -85,6 +85,7 @@ TEST_GEN_FILES += bind_wildcard
 TEST_PROGS += test_vxlan_mdb.sh
 TEST_PROGS += test_bridge_neigh_suppress.sh
 TEST_PROGS += test_vxlan_nolocalbypass.sh
+TEST_PROGS += test_bridge_backup_port.sh
 
 TEST_FILES := settings
 
diff --git a/tools/testing/selftests/net/test_bridge_backup_port.sh 
b/tools/testing/selftests/net/test_bridge_backup_port.sh
new file mode 100755
index ..112cfd8a10ad
--- /dev/null
+++ b/tools/testing/selftests/net/test_bridge_backup_port.sh
@@ -0,0 +1,759 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# This test is for checking bridge backup port and backup nexthop ID
+# functionality. The topology consists of two bridge (VTEPs) connected using
+# VXLAN. The test checks that when the switch port (swp1) is down, traffic is
+# redirected to the VXLAN port (vx0). When a backup nexthop ID is configured,
+# the test checks that traffic is redirected with the correct nexthop
+# information.
+#
+# ++ ++
+# |+ swp1   + vx0  | |+ swp1   + vx0  |
+# |||  | |||  |
+# ||   br0  |  | |||  |
+# |++---+  | |++---+  |
+# | |  | | |  |
+# | |  | | |  |
+# | +  | | +  |
+# |br0 | |br0 |
+# | +  | | +  |
+# | |  | | |  |
+# | |  | | |  |
+# | +  | | +  |
+# |  br0.10| |  br0.10|
+# |   192.0.2.65/28| |192.0.2.66/28   |
+# || ||
+# || ||
+# | 192.0.2.33 | | 192.0.2.34 |
+# | + lo   | | + lo   |
+# || ||
+# || ||
+# |   192.0.2.49/28| |192.0.2.50/28   |
+# |   veth0 +---+ veth0   |
+# || ||
+# | sw1| | sw2|
+# ++ ++
+
+ret=0
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+# All tests in this script. Can be overridden with -t option.
+TESTS="
+   backup_port
+   backup_nhid
+   backup_nhid_invalid
+   backup_nhid_ping
+   backup_nhid_torture
+"
+VERBOSE=0
+PAUSE_ON_FAIL=no
+PAUSE=no
+PING_TIMEOUT=5
+
+
+# Utilities
+
+log_test()
+{
+   local rc=$1
+   local expected=$2
+   local msg="$3"
+
+   if [ ${rc} -eq ${expected} ]; then
+   printf "TEST: %-60s  [ OK ]\n" "${msg}"
+   nsuccess=$((nsuccess+1))
+   else
+   ret=1
+   nfail=$((nfail+1))
+   printf "TEST: %-60s  [FAIL]\n" "${msg}"
+   if [ "$VERBOSE" = "1" ]; then
+   echo "rc=$rc, expected $expected"
+   fi
+
+   if [ "${PAUSE_ON_FAIL}" = "yes" ]; then
+   echo
+   echo "hit enter to continue, 'q' to quit"
+   read a
+

[Bridge] [RFC PATCH net-next 3/4] bridge: Add backup nexthop ID support

2023-07-13 Thread Ido Schimmel via Bridge

Add a new bridge port attribute that allows attaching a nexthop object
ID to an skb that is redirected to a backup bridge port with VLAN
tunneling enabled.

Specifically, when redirecting a known unicast packet, read the backup
nexthop ID from the bridge port that lost its carrier and set it in the
bridge control block of the skb before forwarding it via the backup
port. Note that reading the ID from the bridge port should not result in
a cache miss as the ID is added next to the 'backup_port' field that was
already accessed. After this change, the 'state' field still stays on
the first cache line, together with other data path related fields such
as 'flags and 'vlgrp':

struct net_bridge_port {
struct net_bridge *br;   /* 0 8 */
struct net_device *dev;  /* 8 8 */
netdevice_tracker  dev_tracker;  /*16 0 */
struct list_head   list; /*1616 */
long unsigned int  flags;/*32 8 */
struct net_bridge_vlan_group * vlgrp;/*40 8 */
struct net_bridge_port *   backup_port;  /*48 8 */
u32backup_nhid;  /*56 4 */
u8 priority; /*60 1 */
u8 state;/*61 1 */
u16port_no;  /*62 2 */
/* --- cacheline 1 boundary (64 bytes) --- */
[...]
} __attribute__((__aligned__(8)));

When forwarding an skb via a bridge port that has VLAN tunneling
enabled, check if the backup nexthop ID stored in the bridge control
block is valid (i.e., not zero). If so, instead of attaching the
pre-allocated metadata (that only has the tunnel key set), allocate a
new metadata, set both the tunnel key and the nexthop object ID and
attach it to the skb.

By default, do not dump the new attribute to user space as a value of
zero is an invalid nexthop object ID.

The above is useful for EVPN multihoming. When one of the links
composing an Ethernet Segment (ES) fails, traffic needs to be redirected
towards the host via one of the other ES peers. For example, if a host
is multihomed to three different VTEPs, the backup port of each ES link
needs to be set to the VXLAN device and the backup nexthop ID needs to
point to an FDB nexthop group that includes the IP addresses of the
other two VTEPs. The VXLAN driver will extract the ID from the metadata
of the redirected skb, calculate its flow hash and forward it towards
one of the other VTEPs. If the ID does not exist, or represents an
invalid nexthop object, the VXLAN driver will drop the skb. This
relieves the bridge driver from the need to validate the ID.

Signed-off-by: Ido Schimmel 
---
 include/uapi/linux/if_link.h |  1 +
 net/bridge/br_forward.c  |  1 +
 net/bridge/br_netlink.c  | 12 
 net/bridge/br_private.h  |  3 +++
 net/bridge/br_vlan_tunnel.c  | 15 +++
 net/core/rtnetlink.c |  2 +-
 6 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 0f6a0fe09bdb..ce3117df9cec 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -570,6 +570,7 @@ enum {
IFLA_BRPORT_MCAST_N_GROUPS,
IFLA_BRPORT_MCAST_MAX_GROUPS,
IFLA_BRPORT_NEIGH_VLAN_SUPPRESS,
+   IFLA_BRPORT_BACKUP_NHID,
__IFLA_BRPORT_MAX
 };
 #define IFLA_BRPORT_MAX (__IFLA_BRPORT_MAX - 1)
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 6116eba1bd89..9d7bc8b96b53 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -154,6 +154,7 @@ void br_forward(const struct net_bridge_port *to,
backup_port = rcu_dereference(to->backup_port);
if (unlikely(!backup_port))
goto out;
+   BR_INPUT_SKB_CB(skb)->backup_nhid = READ_ONCE(to->backup_nhid);
to = backup_port;
}
 
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 05c5863d2e20..10f0d33d8ccf 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -211,6 +211,7 @@ static inline size_t br_port_info_size(void)
+ nla_total_size(sizeof(u8))/* IFLA_BRPORT_MRP_IN_OPEN */
+ nla_total_size(sizeof(u32))   /* 
IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT */
+ nla_total_size(sizeof(u32))   /* 
IFLA_BRPORT_MCAST_EHT_HOSTS_CNT */
+   + nla_total_size(sizeof(u32))   /* IFLA_BRPORT_BACKUP_NHID */
+ 0;
 }
 
@@ -319,6 +320,10 @@ static int br_port_fill_attrs(struct sk_buff *skb,
backup_p->dev->ifindex);
rcu_read_unlock();
 
+   if (p->backup_nhid &&
+   nla_put_u32(skb, IFLA_BRPORT_B

[Bridge] [RFC PATCH net-next 2/4] vxlan: Add support for nexthop ID metadata

2023-07-13 Thread Ido Schimmel via Bridge

VXLAN FDB entries can point to FDB nexthop objects. Each such object
includes the IP address(es) of remote VTEP(s) via which the target host
is accessible. Example:

 # ip nexthop add id 1 via 192.0.2.1 fdb
 # ip nexthop add id 2 via 192.0.2.17 fdb
 # ip nexthop add id 1000 group 1/2 fdb
 # bridge fdb add 00:11:22:33:44:55 dev vx0 self static nhid 1000 src_vni 10020

This is useful for EVPN multihoming where a single host can be connected
to multiple VTEPs. The source VTEP will calculate the flow hash of the
skb and forward it towards the IP address of one of the VTEPs member in
the nexthop group.

There are cases where an external entity (e.g., the bridge driver) can
provide not only the tunnel ID (i.e., VNI) of the skb, but also the ID
of the nexthop object via which the skb should be forwarded.

Therefore, in order to support such cases, when the VXLAN device is in
external / collect metadata mode and the tunnel info attached to the skb
is of bridge type, extract the nexthop ID from the tunnel info. If the
ID is valid (i.e., non-zero), forward the skb via the nexthop object
associated with the ID, as if the skb hit an FDB entry associated with
this ID.

Signed-off-by: Ido Schimmel 
---
 drivers/net/vxlan/vxlan_core.c | 44 ++
 1 file changed, 44 insertions(+)

diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c
index 78744549c1b3..10a4dbd50710 100644
--- a/drivers/net/vxlan/vxlan_core.c
+++ b/drivers/net/vxlan/vxlan_core.c
@@ -2672,6 +2672,45 @@ static void vxlan_xmit_nh(struct sk_buff *skb, struct 
net_device *dev,
dev_kfree_skb(skb);
 }
 
+static netdev_tx_t vxlan_xmit_nhid(struct sk_buff *skb, struct net_device *dev,
+  u32 nhid, __be32 vni)
+{
+   struct vxlan_dev *vxlan = netdev_priv(dev);
+   struct vxlan_rdst nh_rdst;
+   struct nexthop *nh;
+   bool do_xmit;
+   u32 hash;
+
+   memset(&nh_rdst, 0, sizeof(struct vxlan_rdst));
+   hash = skb_get_hash(skb);
+
+   rcu_read_lock();
+   nh = nexthop_find_by_id(dev_net(dev), nhid);
+   if (unlikely(!nh || !nexthop_is_fdb(nh) || !nexthop_is_multipath(nh))) {
+   rcu_read_unlock();
+   goto drop;
+   }
+   do_xmit = vxlan_fdb_nh_path_select(nh, hash, &nh_rdst);
+   rcu_read_unlock();
+
+   if (vxlan->cfg.saddr.sa.sa_family != nh_rdst.remote_ip.sa.sa_family)
+   goto drop;
+
+   if (likely(do_xmit))
+   vxlan_xmit_one(skb, dev, vni, &nh_rdst, false);
+   else
+   goto drop;
+
+   return NETDEV_TX_OK;
+
+drop:
+   dev->stats.tx_dropped++;
+   vxlan_vnifilter_count(netdev_priv(dev), vni, NULL,
+ VXLAN_VNI_STATS_TX_DROPS, 0);
+   dev_kfree_skb(skb);
+   return NETDEV_TX_OK;
+}
+
 /* Transmit local packets over Vxlan
  *
  * Outer IP header inherits ECN and DF from inner header.
@@ -2687,6 +2726,7 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct 
net_device *dev)
struct vxlan_fdb *f;
struct ethhdr *eth;
__be32 vni = 0;
+   u32 nhid = 0;
 
info = skb_tunnel_info(skb);
 
@@ -2696,6 +2736,7 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct 
net_device *dev)
if (info && info->mode & IP_TUNNEL_INFO_BRIDGE &&
info->mode & IP_TUNNEL_INFO_TX) {
vni = tunnel_id_to_key32(info->key.tun_id);
+   nhid = info->key.nhid;
} else {
if (info && info->mode & IP_TUNNEL_INFO_TX)
vxlan_xmit_one(skb, dev, vni, NULL, false);
@@ -2723,6 +2764,9 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct 
net_device *dev)
 #endif
}
 
+   if (nhid)
+   return vxlan_xmit_nhid(skb, dev, nhid, vni);
+
if (vxlan->cfg.flags & VXLAN_F_MDB) {
struct vxlan_mdb_entry *mdb_entry;
 
-- 
2.40.1

[Bridge] [RFC PATCH net-next 0/4] Add backup nexthop ID support

2023-07-13 Thread Ido Schimmel via Bridge

/backup_nhid_v1

Ido Schimmel (4):
  ip_tunnels: Add nexthop ID field to ip_tunnel_key
  vxlan: Add support for nexthop ID metadata
  bridge: Add backup nexthop ID support
  selftests: net: Add bridge backup port and backup nexthop ID test

 drivers/net/vxlan/vxlan_core.c|  44 +
 include/net/ip_tunnels.h  |   1 +
 include/uapi/linux/if_link.h  |   1 +
 net/bridge/br_forward.c   |   1 +
 net/bridge/br_netlink.c   |  12 +
 net/bridge/br_private.h   |   3 +
 net/bridge/br_vlan_tunnel.c   |  15 +
 net/core/rtnetlink.c  |   2 +-
 tools/testing/selftests/net/Makefile  |   1 +
 .../selftests/net/test_bridge_backup_port.sh  | 759 ++
 10 files changed, 838 insertions(+), 1 deletion(-)
 create mode 100755 tools/testing/selftests/net/test_bridge_backup_port.sh

-- 
2.40.1

[Bridge] [RFC PATCH net-next 1/4] ip_tunnels: Add nexthop ID field to ip_tunnel_key

2023-07-13 Thread Ido Schimmel via Bridge

Extend the ip_tunnel_key structure with a field indicating the ID of the
nexthop object via which the skb should be routed.

The field is going to be populated in subsequent patches by the bridge
driver in order to indicate to the VXLAN driver which FDB nexthop object
to use in order to reach the target host.

Signed-off-by: Ido Schimmel 
---
 include/net/ip_tunnels.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index ed4b6ad3fcac..e8750b4ef7e1 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -52,6 +52,7 @@ struct ip_tunnel_key {
u8  tos;/* TOS for IPv4, TC for IPv6 */
u8  ttl;/* TTL for IPv4, HL for IPv6 */
__be32  label;  /* Flow Label for IPv6 */
+   u32 nhid;
__be16  tp_src;
__be16  tp_dst;
__u8flow_flags;
-- 
2.40.1

Re: [Bridge] [PATCH v2 net] bridge: Add extack warning when enabling STP in netns.

2023-07-12 Thread Ido Schimmel

On Wed, Jul 12, 2023 at 08:44:49AM -0700, Kuniyuki Iwashima wrote:
> When we create an L2 loop on a bridge in netns, we will see packets storm
> even if STP is enabled.
> 
>   # unshare -n
>   # ip link add br0 type bridge
>   # ip link add veth0 type veth peer name veth1
>   # ip link set veth0 master br0 up
>   # ip link set veth1 master br0 up
>   # ip link set br0 type bridge stp_state 1
>   # ip link set br0 up
>   # sleep 30
>   # ip -s link show br0
>   2: br0:  mtu 1500 qdisc noqueue state UP 
> mode DEFAULT group default qlen 1000
>   link/ether b6:61:98:1c:1c:b5 brd ff:ff:ff:ff:ff:ff
>   RX: bytes  packets  errors  dropped missed  mcast
>   956553768  12861249 0   0   0   12861249  <-. Keep
>   TX: bytes  packets  errors  dropped carrier collsns |  increasing
>   1027834119510   0   0   0 <-'   rapidly
> 
> This is because llc_rcv() drops all packets in non-root netns and BPDU
> is dropped.
> 
> Let's add extack warning when enabling STP in netns.
> 
>   # unshare -n
>   # ip link add br0 type bridge
>   # ip link set br0 type bridge stp_state 1
>   Warning: bridge: STP does not work in non-root netns.
> 
> Note this commit will be reverted later when we namespacify the whole LLC
> infra.
> 
> Fixes: e730c15519d0 ("[NET]: Make packet reception network namespace safe")
> Suggested-by: Harry Coin 
> Link: 
> https://lore.kernel.org/netdev/0f531295-e289-022d-5add-5ceffa0df...@quietfountain.com/
> Suggested-by: Ido Schimmel 
> Signed-off-by: Kuniyuki Iwashima 

Reviewed-by: Ido Schimmel

Re: [Bridge] [PATCH v1 net] bridge: Return an error when enabling STP in netns.

2023-07-12 Thread Ido Schimmel

On Tue, Jul 11, 2023 at 04:54:15PM -0700, Kuniyuki Iwashima wrote:
> When we create an L2 loop on a bridge in netns, we will see packets storm
> even if STP is enabled.
> 
>   # unshare -n
>   # ip link add br0 type bridge
>   # ip link add veth0 type veth peer name veth1
>   # ip link set veth0 master br0 up
>   # ip link set veth1 master br0 up
>   # ip link set br0 type bridge stp_state 1
>   # ip link set br0 up
>   # sleep 30
>   # ip -s link show br0
>   2: br0:  mtu 1500 qdisc noqueue state UP 
> mode DEFAULT group default qlen 1000
>   link/ether b6:61:98:1c:1c:b5 brd ff:ff:ff:ff:ff:ff
>   RX: bytes  packets  errors  dropped missed  mcast
>   956553768  12861249 0   0   0   12861249  <-. Keep
>   TX: bytes  packets  errors  dropped carrier collsns |  increasing
>   1027834119510   0   0   0 <-'   rapidly
> 
> This is because llc_rcv() drops all packets in non-root netns and BPDU
> is dropped.
> 
> Let's show an error when enabling STP in netns.
> 
>   # unshare -n
>   # ip link add br0 type bridge
>   # ip link set br0 type bridge stp_state 1
>   Error: bridge: STP can't be enabled in non-root netns.
> 
> Note this commit will be reverted later when we namespacify the whole LLC
> infra.
> 
> Fixes: e730c15519d0 ("[NET]: Make packet reception network namespace safe")
> Suggested-by: Harry Coin 

I'm not sure that's accurate. I read his response in the link below and
he says "I'd rather be warned than blocked" and "But better warned and
awaiting a fix than blocked", which I agree with. The patch has the
potential to cause a lot of regressions, but without actually fixing the
problem.

How about simply removing the error [1]? Since iproute2 commit
844c37b42373 ("libnetlink: Handle extack messages for non-error case"),
it can print extack warnings and not only errors. With the diff below:

 # unshare -n 
 # ip link add name br0 type bridge
 # ip link set dev br0 type bridge stp_state 1
 Warning: bridge: STP can't be enabled in non-root netns.
 # echo $?
 0

[1]
diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c
index a807996ac56b..b5143de37938 100644
--- a/net/bridge/br_stp_if.c
+++ b/net/bridge/br_stp_if.c
@@ -201,10 +201,8 @@ int br_stp_set_enabled(struct net_bridge *br, unsigned 
long val,
 {
ASSERT_RTNL();
 
-   if (!net_eq(dev_net(br->dev), &init_net)) {
+   if (!net_eq(dev_net(br->dev), &init_net))
NL_SET_ERR_MSG_MOD(extack, "STP can't be enabled in non-root 
netns");
-   return -EINVAL;
-   }
 
if (br_mrp_enabled(br)) {
NL_SET_ERR_MSG_MOD(extack,

> Link: 
> https://lore.kernel.org/netdev/0f531295-e289-022d-5add-5ceffa0df...@quietfountain.com/
> Signed-off-by: Kuniyuki Iwashima 
> ---
>  net/bridge/br_stp_if.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/net/bridge/br_stp_if.c b/net/bridge/br_stp_if.c
> index 75204d36d7f9..a807996ac56b 100644
> --- a/net/bridge/br_stp_if.c
> +++ b/net/bridge/br_stp_if.c
> @@ -201,6 +201,11 @@ int br_stp_set_enabled(struct net_bridge *br, unsigned 
> long val,
>  {
>   ASSERT_RTNL();
>  
> + if (!net_eq(dev_net(br->dev), &init_net)) {
> + NL_SET_ERR_MSG_MOD(extack, "STP can't be enabled in non-root 
> netns");
> + return -EINVAL;
> + }
> +
>   if (br_mrp_enabled(br)) {
>   NL_SET_ERR_MSG_MOD(extack,
>  "STP can't be enabled if MRP is already 
> enabled");
> -- 
> 2.30.2
> 
>

Re: [Bridge] [PATCH v1 net] bridge: Return an error when enabling STP in netns.

2023-07-12 Thread Ido Schimmel

On Wed, Jul 12, 2023 at 05:52:09PM +0300, Nikolay Aleksandrov wrote:
> I'd prefer this approach to changing user-visible behaviour and potential 
> regressions.
> Just change the warning message.

Yea, I noticed after sending that the message no longer fits :)

Re: [Bridge] [PATCH net] net: bridge: keep ports without IFF_UNICAST_FLT in BR_PROMISC mode

2023-07-03 Thread Ido Schimmel via Bridge

On Fri, Jun 30, 2023 at 07:41:18PM +0300, Vladimir Oltean wrote:
> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
> index 3f04b40f6056..2450690f98cf 100644
> --- a/net/bridge/br_if.c
> +++ b/net/bridge/br_if.c
> @@ -166,8 +166,9 @@ void br_manage_promisc(struct net_bridge *br)
>* This lets us disable promiscuous mode and write
>* this config to hw.
>*/
> - if (br->auto_cnt == 0 ||
> - (br->auto_cnt == 1 && br_auto_port(p)))
> + if ((p->dev->priv_flags & IFF_UNICAST_FLT) &&
> + (br->auto_cnt == 0 ||
> +  (br->auto_cnt == 1 && br_auto_port(p
>   br_port_clear_promisc(p);
>   else
>   br_port_set_promisc(p);

IIUC, you are basically saying "If the port does not support unicast
filtering, then set it to promiscuous mode right away instead of waiting
for the addition of the first FDB entry to trigger it."

If so, LGTM.

Reviewed-by: Ido Schimmel 

Tested using [1].

Before:

# ~/tmp/promisc_repo.sh 
0

After:

# ~/tmp/promisc_repo.sh 
1

[1]
#!/bin/bash

ip link add name swp1 type dummy
ip link add name br1 type bridge vlan_filtering 1
ip link set dev swp1 master br1
ip -d -j -p link show dev swp1 | jq '.[]["promiscuity"]'

Re: [Bridge] [PATCH net-next v2 2/3] bridge: Add a limit on learned FDB entries

2023-06-19 Thread Ido Schimmel via Bridge

On Mon, Jun 19, 2023 at 09:14:42AM +0200, Johannes Nixdorf wrote:
> A malicious actor behind one bridge port may spam the kernel with packets
> with a random source MAC address, each of which will create an FDB entry,
> each of which is a dynamic allocation in the kernel.
> 
> There are roughly 2^48 different MAC addresses, further limited by the
> rhashtable they are stored in to 2^31. Each entry is of the type struct
> net_bridge_fdb_entry, which is currently 128 bytes big. This means the
> maximum amount of memory allocated for FDB entries is 2^31 * 128B =
> 256GiB, which is too much for most computers.
> 
> Mitigate this by adding a bridge netlink setting
> IFLA_BR_FDB_MAX_LEARNED_ENTRIES, which, if nonzero, limits the amount
> of learned entries to a user specified maximum.
> 
> For backwards compatibility the default setting of 0 disables the limit.
> 
> User-added entries by netlink or from bridge or bridge port addresses
> are never blocked and do not count towards that limit.
> 
> All changes to fdb_n_entries are under br->hash_lock, which means we do
> not need additional locking. The call paths are (✓ denotes that
> br->hash_lock is taken around the next call):
> 
>  - fdb_delete <-+- fdb_delete_local <-+- br_fdb_changeaddr ✓
> | +- br_fdb_change_mac_address ✓
> | +- br_fdb_delete_by_port ✓
> +- br_fdb_find_delete_local ✓
> +- fdb_add_local <-+- br_fdb_changeaddr ✓
> |  +- br_fdb_change_mac_address ✓
> |  +- br_fdb_add_local ✓
> +- br_fdb_cleanup ✓
> +- br_fdb_flush ✓
> +- br_fdb_delete_by_port ✓
> +- fdb_delete_by_addr_and_port <--- __br_fdb_delete ✓
> +- br_fdb_external_learn_del ✓
>  - fdb_create <-+- fdb_add_local <-+- br_fdb_changeaddr ✓
> |  +- br_fdb_change_mac_address ✓
> |  +- br_fdb_add_local ✓
> +- br_fdb_update ✓
> +- fdb_add_entry <--- __br_fdb_add ✓
> +- br_fdb_external_learn_add ✓
> 
> The flags that imply an entry does not come from learning
> (BR_FDB_NOT_LEARNED_MASK) are now only set or cleared under br->hash_lock
> as well, and when the boolean value of (fdb->flags &
> BR_FDB_NOT_LEARNED_MASK) changes the accounting is updated.
> 
> This introduces one additional locked update in br_fdb_update if
> BR_FDB_ADDED_BY_USER was set. This is only the case when creating a new
> entry via netlink, and never in the packet handling fast path.
> 
> Signed-off-by: Johannes Nixdorf 
> 
> ---
> 
> Changes since v1:
>  - Do not initialize fdb_*_entries to 0. (from review)
>  - Do not skip decrementing on 0. (from review)
>  - Moved the counters to a conditional hole in struct net_bridge to
>avoid growing the struct. (from review, it still grows the struct as
>there are 2 32-bit values)
>  - Add IFLA_BR_FDB_CUR_LEARNED_ENTRIES (from review)
>  - Fix br_get_size()
>  - Only limit learned entries, rename to
>*_(CUR|MAX)_LEARNED_ENTRIES. (from review)
> 
> Obsolete v1 review comments:
>  - Return better errors to users: Due to limiting the limit to
>automatically created entries, netlink fdb add requests and changing
>bridge ports are never rejected, so they do not yet need a more
>friendly error returned.
> 
>  include/uapi/linux/if_link.h |  2 ++
>  net/bridge/br_fdb.c  | 67 +---
>  net/bridge/br_netlink.c  | 13 ++-
>  net/bridge/br_private.h  |  6 

To minimize the number of changes per patch and make review easier, try
to first maintain the count and the maximum and then in a separate patch
expose them via netlink. See b57e8d870d52 and a1aee20d5db2, for example.
Merge commit is cb3086cee656.

>  4 files changed, 83 insertions(+), 5 deletions(-)
> 
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index 4ac1000b0ef2..165b9014379b 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -510,6 +510,8 @@ enum {
>   IFLA_BR_VLAN_STATS_PER_PORT,
>   IFLA_BR_MULTI_BOOLOPT,
>   IFLA_BR_MCAST_QUERIER_STATE,
> + IFLA_BR_FDB_CUR_LEARNED_ENTRIES,
> + IFLA_BR_FDB_MAX_LEARNED_ENTRIES,
>   __IFLA_BR_MAX,
>  };
>  
> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
> index ac1dc8723b9c..bc61d1fd5fcf 100644
> --- a/net/bridge/br_fdb.c
> +++ b/net/bridge/br_fdb.c
> @@ -301,6 +301,38 @@ static void fdb_add_hw_addr(struct net_bridge *br, const 
> unsigned char *addr)
>   }
>  }
>  
> +/* Set a FDB flag that implies the entry was not learned, and account
> + * for changes in the learned status.
> + */
> +static void __fdb_set_flag_not_learned(struct net_bridge *br,
> +struct net_bridge_fdb_entry *fdb,
> +long nr)
> +{
>

Re: [Bridge] [PATCH net-next v2 1/3] bridge: Set BR_FDB_ADDED_BY_USER early in fdb_add_entry

2023-06-19 Thread Ido Schimmel via Bridge

On Mon, Jun 19, 2023 at 09:14:41AM +0200, Johannes Nixdorf wrote:
> This allows the called fdb_create to detect that the entry was added by
> the user early in the process. This is in preparation to adding limits
> in fdb_create that should not apply to user created fdb entries.

Use imperative mood:
https://www.kernel.org/doc/html/latest/process/submitting-patches.html#describe-your-changes

> 
> Signed-off-by: Johannes Nixdorf 
> 

Remove the blank line

> ---
> 
> Changes since v1:
>  - Added this change to ensure user added entries are not limited.
> 
>  net/bridge/br_fdb.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
> index e69a872bfc1d..ac1dc8723b9c 100644
> --- a/net/bridge/br_fdb.c
> +++ b/net/bridge/br_fdb.c
> @@ -1056,7 +1056,7 @@ static int fdb_add_entry(struct net_bridge *br, struct 
> net_bridge_port *source,
>   if (!(flags & NLM_F_CREATE))
>   return -ENOENT;
>  
> - fdb = fdb_create(br, source, addr, vid, 0);
> + fdb = fdb_create(br, source, addr, vid, BR_FDB_ADDED_BY_USER);

BIT(BR_FDB_ADDED_BY_USER)

>   if (!fdb)
>   return -ENOMEM;
>  
> @@ -1069,6 +1069,8 @@ static int fdb_add_entry(struct net_bridge *br, struct 
> net_bridge_port *source,
>   WRITE_ONCE(fdb->dst, source);
>   modified = true;
>   }
> +
> + set_bit(BR_FDB_ADDED_BY_USER, &fdb->flags);
>   }
>  
>   if (fdb_to_nud(br, fdb) != state) {
> @@ -1100,8 +1102,6 @@ static int fdb_add_entry(struct net_bridge *br, struct 
> net_bridge_port *source,
>   if (fdb_handle_notify(fdb, notify))
>   modified = true;
>  
> - set_bit(BR_FDB_ADDED_BY_USER, &fdb->flags);
> -
>   fdb->used = jiffies;
>   if (modified) {
>   if (refresh)
> -- 
> 2.40.1
>

Re: [Bridge] [PATCH iproute2-next 1/1] iplink: bridge: Add support for bridge FDB learning limits

2023-06-19 Thread Ido Schimmel via Bridge

Please see the following link regarding posting of iproute2 patches:

https://www.kernel.org/doc/html/latest/process/maintainer-netdev.html#co-posting-changes-to-user-space-components

On Mon, Jun 19, 2023 at 09:14:44AM +0200, Johannes Nixdorf wrote:
> Support setting the FDB limit through ip link. The arguments is:
>  - fdb_max_learned_entries: A 32-bit unsigned integer specifying the
> maximum number of learned FDB entries, with 0
> disabling the limit.
> 
> Also support reading back the current number of learned FDB entries in
> the bridge by this count. The returned value's name is:
>  - fdb_cur_learned_entries: A 32-bit unsigned integer specifying the
>  current number of learned FDB entries.

MDB has "mcast_n_groups" and "mcast_max_groups". Maybe use
"fdb_n_learned_entries" to be consistent?

> 
> Example:
> 
>  # ip -d -j -p link show br0
> [ {
> ...
> "linkinfo": {
> "info_kind": "bridge",
> "info_data": {
> ...
> "fdb_cur_learned_entries": 2,
> "fdb_max_learned_entries": 0,
> ...
> }
> },
> ...
> } ]
>  # ip link set br0 type bridge fdb_max_learned_entries 1024
>  # ip -d -j -p link show br0
> [ {
> ...
> "linkinfo": {
> "info_kind": "bridge",
> "info_data": {
> ...
> "fdb_cur_learned_entries": 2,
> "fdb_max_learned_entries": 1024,
> ...
> }
> },
> ...
> } ]
> 
> Signed-off-by: Johannes Nixdorf 
> ---
>  include/uapi/linux/if_link.h |  2 ++
>  ip/iplink_bridge.c   | 21 +
>  man/man8/ip-link.8.in|  9 +
>  3 files changed, 32 insertions(+)
> 
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index 94fb7ef9e226..5ad1e2727e0d 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -508,6 +508,8 @@ enum {
>   IFLA_BR_VLAN_STATS_PER_PORT,
>   IFLA_BR_MULTI_BOOLOPT,
>   IFLA_BR_MCAST_QUERIER_STATE,
> + IFLA_BR_FDB_CUR_LEARNED_ENTRIES,
> + IFLA_BR_FDB_MAX_LEARNED_ENTRIES,
>   __IFLA_BR_MAX,
>  };
>  
> diff --git a/ip/iplink_bridge.c b/ip/iplink_bridge.c
> index 7e4e62c81c0c..68ed3c251945 100644
> --- a/ip/iplink_bridge.c
> +++ b/ip/iplink_bridge.c
> @@ -34,6 +34,7 @@ static void print_explain(FILE *f)
>   " [ group_fwd_mask MASK ]\n"
>   " [ group_address ADDRESS ]\n"
>   " [ no_linklocal_learn NO_LINKLOCAL_LEARN ]\n"
> + " [ fdb_max_learned_entries 
> FDB_MAX_LEARNED_ENTRIES ]\n"
>   " [ vlan_filtering VLAN_FILTERING ]\n"
>   " [ vlan_protocol VLAN_PROTOCOL ]\n"
>   " [ vlan_default_pvid VLAN_DEFAULT_PVID ]\n"
> @@ -168,6 +169,14 @@ static int bridge_parse_opt(struct link_util *lu, int 
> argc, char **argv,
>   bm.optval |= no_ll_learn_bit;
>   else
>   bm.optval &= ~no_ll_learn_bit;
> + } else if (matches(*argv, "fdb_max_learned_entries") == 0) {

New code is expected to use strcmp() instead of matches().

> + __u32 fdb_max_learned_entries;
> +
> + NEXT_ARG();
> + if (get_u32(&fdb_max_learned_entries, *argv, 0))
> + invarg("invalid fdb_max_learned_entries", 
> *argv);
> +
> + addattr32(n, 1024, IFLA_BR_FDB_MAX_LEARNED_ENTRIES, 
> fdb_max_learned_entries);
>   } else if (matches(*argv, "fdb_flush") == 0) {
>   addattr(n, 1024, IFLA_BR_FDB_FLUSH);
>   } else if (matches(*argv, "vlan_default_pvid") == 0) {
> @@ -544,6 +553,18 @@ static void bridge_print_opt(struct link_util *lu, FILE 
> *f, struct rtattr *tb[])
>   if (tb[IFLA_BR_GC_TIMER])
>   _bridge_print_timer(f, "gc_timer", tb[IFLA_BR_GC_TIMER]);
>  
> + if (tb[IFLA_BR_FDB_CUR_LEARNED_ENTRIES])
> + print_uint(PRINT_ANY,
> +"fdb_cur_learned_entries",
> +"fdb_cur_learned_entries %u ",
> +
> rta_getattr_u32(tb[IFLA_BR_FDB_CUR_LEARNED_ENTRIES]));
> +
> + if (tb[IFLA_BR_FDB_MAX_LEARNED_ENTRIES])
> + print_uint(PRINT_ANY,
> +"fdb_max_learned_entries",
> +"fdb_max_learned_entries %u ",
> +
> rta_getattr_u32(tb[IFLA_BR_FDB_MAX_LEARNED_ENTRIES]));
> +
>   if (tb[IFLA_BR_VLAN_DEFAULT_PVID])
>   print_uint(PRINT_ANY,
>  "vlan_default_pvid",
> diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
> index bf3605a9fa2e..a29595858a51 100644
> --- a/man/man8/ip-link.8.in
> +++ b/man/man8/ip-link.8.in
> @@ -1620,6 +1620,8 @@ the

[Bridge] [PATCH net-next v2 8/8] selftests: forwarding: Add layer 2 miss test cases

2023-05-29 Thread Ido Schimmel via Bridge

Add test cases to verify that the bridge driver correctly marks layer 2
misses only when it should and that the flower classifier can match on
this metadata.

Example output:

 # ./tc_flower_l2_miss.sh
 TEST: L2 miss - Unicast [ OK ]
 TEST: L2 miss - Multicast (IPv4)[ OK ]
 TEST: L2 miss - Multicast (IPv6)[ OK ]
 TEST: L2 miss - Link-local multicast (IPv4) [ OK ]
 TEST: L2 miss - Link-local multicast (IPv6) [ OK ]
 TEST: L2 miss - Broadcast   [ OK ]

Signed-off-by: Ido Schimmel 
---

Notes:
v2:
* Test that broadcast does not hit miss filter.

 .../testing/selftests/net/forwarding/Makefile |   1 +
 .../net/forwarding/tc_flower_l2_miss.sh   | 350 ++
 2 files changed, 351 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh

diff --git a/tools/testing/selftests/net/forwarding/Makefile 
b/tools/testing/selftests/net/forwarding/Makefile
index a474c60fe348..9d0062b542e5 100644
--- a/tools/testing/selftests/net/forwarding/Makefile
+++ b/tools/testing/selftests/net/forwarding/Makefile
@@ -83,6 +83,7 @@ TEST_PROGS = bridge_igmp.sh \
tc_chains.sh \
tc_flower_router.sh \
tc_flower.sh \
+   tc_flower_l2_miss.sh \
tc_mpls_l2vpn.sh \
tc_police.sh \
tc_shblocks.sh \
diff --git a/tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh 
b/tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh
new file mode 100755
index ..37b0369b5246
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh
@@ -0,0 +1,350 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# +---+ 
+--+
+# | H1 (vrf)  | | H2 (vrf) 
|
+# |+ $h1  | |  $h2 +   
|
+# || 192.0.2.1/28 | | 192.0.2.2/28 |   
|
+# || 2001:db8:1::1/64 | | 2001:db8:1::2/64 |   
|
+# +|--+ 
+--|---+
+#  |   |
+# 
+|---|---+
+# | SW |   |   
|
+# |  +-|---|-+ 
|
+# |  | + $swp1   BR  $swp2 + | 
|
+# |  +---+ 
|
+# 
++
+
+ALL_TESTS="
+   test_l2_miss_unicast
+   test_l2_miss_multicast
+   test_l2_miss_ll_multicast
+   test_l2_miss_broadcast
+"
+
+NUM_NETIFS=4
+source lib.sh
+source tc_common.sh
+
+h1_create()
+{
+   simple_if_init $h1 192.0.2.1/28 2001:db8:1::1/64
+}
+
+h1_destroy()
+{
+   simple_if_fini $h1 192.0.2.1/28 2001:db8:1::1/64
+}
+
+h2_create()
+{
+   simple_if_init $h2 192.0.2.2/28 2001:db8:1::2/64
+}
+
+h2_destroy()
+{
+   simple_if_fini $h2 192.0.2.2/28 2001:db8:1::2/64
+}
+
+switch_create()
+{
+   ip link add name br1 up type bridge
+   ip link set dev $swp1 master br1
+   ip link set dev $swp1 up
+   ip link set dev $swp2 master br1
+   ip link set dev $swp2 up
+
+   tc qdisc add dev $swp2 clsact
+}
+
+switch_destroy()
+{
+   tc qdisc del dev $swp2 clsact
+
+   ip link set dev $swp2 down
+   ip link set dev $swp2 nomaster
+   ip link set dev $swp1 down
+   ip link set dev $swp1 nomaster
+   ip link del dev br1
+}
+
+test_l2_miss_unicast()
+{
+   local dmac=00:01:02:03:04:05
+   local dip=192.0.2.2
+   local sip=192.0.2.1
+
+   RET=0
+
+   # Unknown unicast.
+   tc filter add dev $swp2 egress protocol ipv4 handle 101 pref 1 \
+  flower indev $swp1 l2_miss true dst_mac $dmac src_ip $sip \
+  dst_ip $dip action pass
+   # Known unicast.
+   tc filter add dev $swp2 egress protocol ipv4 handle 102 pref 1 \
+  flower indev $swp1 l2_miss false dst_mac $dmac src_ip $sip \
+  dst_ip $dip action pass
+
+   # Before adding FDB entry.
+   $MZ $h1 -a own -b $dmac -t ip -A $sip -B $dip -c 1 -p 100 -q
+
+   tc_check_packets "dev $swp2 egress" 101 1
+   check_err $? "Unknown unicast filter was not hit before adding FDB 
entry"
+
+   tc_check_packets "dev $swp2 egress" 102 0
+   check_err $? "Known unicast filter was hit before adding FDB entry"
+
+   # Adding FDB entry.
+   bridge fdb replace $dmac dev $swp2 master static
+
+   $MZ $h1 -a

[Bridge] [PATCH net-next v2 7/8] mlxsw: spectrum_flower: Add ability to match on layer 2 miss

2023-05-29 Thread Ido Schimmel via Bridge

Add the 'fdb_miss' key element to supported key blocks and make use of
it to match on layer 2 miss.

The key is only supported on Spectrum-{2,3,4}. An error is returned for
Spectrum-1 since the key element is not present in any of its key
blocks.

Signed-off-by: Ido Schimmel 
---

Notes:
v2:
* Use 'fdb_miss' key element instead of 'dmac_type'.

 drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c| 1 +
 drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h| 3 ++-
 .../net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c| 2 ++
 drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c   | 6 ++
 4 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
index bd1a51a0a540..f0b2963ebac3 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
@@ -42,6 +42,7 @@ static const struct mlxsw_afk_element_info 
mlxsw_afk_element_infos[] = {
MLXSW_AFK_ELEMENT_INFO_BUF(DST_IP_64_95, 0x34, 4),
MLXSW_AFK_ELEMENT_INFO_BUF(DST_IP_32_63, 0x38, 4),
MLXSW_AFK_ELEMENT_INFO_BUF(DST_IP_0_31, 0x3C, 4),
+   MLXSW_AFK_ELEMENT_INFO_U32(FDB_MISS, 0x40, 0, 1),
 };
 
 struct mlxsw_afk {
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h 
b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h
index 3a037fe47211..65a4abadc7db 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h
@@ -35,6 +35,7 @@ enum mlxsw_afk_element {
MLXSW_AFK_ELEMENT_IP_DSCP,
MLXSW_AFK_ELEMENT_VIRT_ROUTER_MSB,
MLXSW_AFK_ELEMENT_VIRT_ROUTER_LSB,
+   MLXSW_AFK_ELEMENT_FDB_MISS,
MLXSW_AFK_ELEMENT_MAX,
 };
 
@@ -69,7 +70,7 @@ struct mlxsw_afk_element_info {
MLXSW_AFK_ELEMENT_INFO(MLXSW_AFK_ELEMENT_TYPE_BUF,  
\
   _element, _offset, 0, _size)
 
-#define MLXSW_AFK_ELEMENT_STORAGE_SIZE 0x40
+#define MLXSW_AFK_ELEMENT_STORAGE_SIZE 0x44
 
 struct mlxsw_afk_element_inst { /* element instance in actual block */
enum mlxsw_afk_element element;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c
index 00c32320f891..4dea39f2b304 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c
@@ -123,10 +123,12 @@ const struct mlxsw_afk_ops mlxsw_sp1_afk_ops = {
 };
 
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_mac_0[] = {
+   MLXSW_AFK_ELEMENT_INST_U32(FDB_MISS, 0x00, 3, 1),
MLXSW_AFK_ELEMENT_INST_BUF(DMAC_0_31, 0x04, 4),
 };
 
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_mac_1[] = {
+   MLXSW_AFK_ELEMENT_INST_U32(FDB_MISS, 0x00, 3, 1),
MLXSW_AFK_ELEMENT_INST_BUF(SMAC_0_31, 0x04, 4),
 };
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
index 9c62c12e410b..72917f09e806 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
@@ -336,10 +336,8 @@ static int mlxsw_sp_flower_parse_meta(struct 
mlxsw_sp_acl_rule_info *rulei,
 
flow_rule_match_meta(rule, &match);
 
-   if (match.mask->l2_miss) {
-   NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on 
\"l2_miss\"");
-   return -EOPNOTSUPP;
-   }
+   mlxsw_sp_acl_rulei_keymask_u32(rulei, MLXSW_AFK_ELEMENT_FDB_MISS,
+  match.key->l2_miss, match.mask->l2_miss);
 
return mlxsw_sp_flower_parse_meta_iif(rulei, block, &match,
  f->common.extack);
-- 
2.40.1

[Bridge] [PATCH net-next v2 6/8] mlxsw: spectrum_flower: Do not force matching on iif

2023-05-29 Thread Ido Schimmel via Bridge

Currently, mlxsw only supports the 'ingress_ifindex' field in the
'FLOW_DISSECTOR_KEY_META' key, but subsequent patches are going to add
support for the 'l2_miss' field as well. It is valid to only match on
'l2_miss' without 'ingress_ifindex', so do not force matching on it.

Signed-off-by: Ido Schimmel 
---

Notes:
v2:
* New patch.

 drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
index 2b0bae847eb9..9c62c12e410b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
@@ -290,6 +290,9 @@ mlxsw_sp_flower_parse_meta_iif(struct 
mlxsw_sp_acl_rule_info *rulei,
struct mlxsw_sp_port *mlxsw_sp_port;
struct net_device *ingress_dev;
 
+   if (!match->mask->ingress_ifindex)
+   return 0;
+
if (match->mask->ingress_ifindex != 0x) {
NL_SET_ERR_MSG_MOD(extack, "Unsupported ingress ifindex mask");
return -EINVAL;
-- 
2.40.1

[Bridge] [PATCH net-next v2 5/8] mlxsw: spectrum_flower: Split iif parsing to a separate function

2023-05-29 Thread Ido Schimmel via Bridge

Currently, mlxsw only supports the 'ingress_ifindex' field in the
'FLOW_DISSECTOR_KEY_META' key, but subsequent patches are going to add
support for the 'l2_miss' field as well. Split the parsing of the
'ingress_ifindex' field to a separate function to avoid nesting. No
functional changes intended.

Signed-off-by: Ido Schimmel 
---

Notes:
v2:
* New patch.

 .../ethernet/mellanox/mlxsw/spectrum_flower.c | 54 +++
 1 file changed, 33 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
index 6fec9223250b..2b0bae847eb9 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
@@ -281,45 +281,35 @@ static int mlxsw_sp_flower_parse_actions(struct mlxsw_sp 
*mlxsw_sp,
return 0;
 }
 
-static int mlxsw_sp_flower_parse_meta(struct mlxsw_sp_acl_rule_info *rulei,
- struct flow_cls_offload *f,
- struct mlxsw_sp_flow_block *block)
+static int
+mlxsw_sp_flower_parse_meta_iif(struct mlxsw_sp_acl_rule_info *rulei,
+  const struct mlxsw_sp_flow_block *block,
+  const struct flow_match_meta *match,
+  struct netlink_ext_ack *extack)
 {
-   struct flow_rule *rule = flow_cls_offload_flow_rule(f);
struct mlxsw_sp_port *mlxsw_sp_port;
struct net_device *ingress_dev;
-   struct flow_match_meta match;
-
-   if (!flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_META))
-   return 0;
-
-   flow_rule_match_meta(rule, &match);
 
-   if (match.mask->l2_miss) {
-   NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on 
\"l2_miss\"");
-   return -EOPNOTSUPP;
-   }
-
-   if (match.mask->ingress_ifindex != 0x) {
-   NL_SET_ERR_MSG_MOD(f->common.extack, "Unsupported ingress 
ifindex mask");
+   if (match->mask->ingress_ifindex != 0x) {
+   NL_SET_ERR_MSG_MOD(extack, "Unsupported ingress ifindex mask");
return -EINVAL;
}
 
ingress_dev = __dev_get_by_index(block->net,
-match.key->ingress_ifindex);
+match->key->ingress_ifindex);
if (!ingress_dev) {
-   NL_SET_ERR_MSG_MOD(f->common.extack, "Can't find specified 
ingress port to match on");
+   NL_SET_ERR_MSG_MOD(extack, "Can't find specified ingress port 
to match on");
return -EINVAL;
}
 
if (!mlxsw_sp_port_dev_check(ingress_dev)) {
-   NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on non-mlxsw 
ingress port");
+   NL_SET_ERR_MSG_MOD(extack, "Can't match on non-mlxsw ingress 
port");
return -EINVAL;
}
 
mlxsw_sp_port = netdev_priv(ingress_dev);
if (mlxsw_sp_port->mlxsw_sp != block->mlxsw_sp) {
-   NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on a port 
from different device");
+   NL_SET_ERR_MSG_MOD(extack, "Can't match on a port from 
different device");
return -EINVAL;
}
 
@@ -327,9 +317,31 @@ static int mlxsw_sp_flower_parse_meta(struct 
mlxsw_sp_acl_rule_info *rulei,
   MLXSW_AFK_ELEMENT_SRC_SYS_PORT,
   mlxsw_sp_port->local_port,
   0x);
+
return 0;
 }
 
+static int mlxsw_sp_flower_parse_meta(struct mlxsw_sp_acl_rule_info *rulei,
+ struct flow_cls_offload *f,
+ struct mlxsw_sp_flow_block *block)
+{
+   struct flow_rule *rule = flow_cls_offload_flow_rule(f);
+   struct flow_match_meta match;
+
+   if (!flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_META))
+   return 0;
+
+   flow_rule_match_meta(rule, &match);
+
+   if (match.mask->l2_miss) {
+   NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on 
\"l2_miss\"");
+   return -EOPNOTSUPP;
+   }
+
+   return mlxsw_sp_flower_parse_meta_iif(rulei, block, &match,
+ f->common.extack);
+}
+
 static void mlxsw_sp_flower_parse_ipv4(struct mlxsw_sp_acl_rule_info *rulei,
   struct flow_cls_offload *f)
 {
-- 
2.40.1

[Bridge] [PATCH net-next v2 4/8] flow_offload: Reject matching on layer 2 miss

2023-05-29 Thread Ido Schimmel via Bridge

Adjust drivers that support the 'FLOW_DISSECTOR_KEY_META' key to reject
filters that try to match on the newly added layer 2 miss field. Add an
extack message to clearly communicate the failure reason to user space.

The following users were not patched:

1. mtk_flow_offload_replace(): Only checks that the key is present, but
   does not do anything with it.
2. mlx5_tc_ct_set_tuple_match(): Used as part of netfilter offload,
   which does not make use of the new field, unlike tc.
3. get_netdev_from_rule() in nfp: Likewise.

Example:

 # tc filter add dev swp1 egress pref 1 proto all flower skip_sw l2_miss true 
action drop
 Error: mlxsw_spectrum: Can't match on "l2_miss".
 We have an error talking to the kernel

Acked-by: Elad Nachman 
Signed-off-by: Ido Schimmel 
---

Notes:
v2:
* Expand commit message to explain why some users were not patched.

 .../net/ethernet/marvell/prestera/prestera_flower.c|  6 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c|  6 ++
 drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c  |  6 ++
 drivers/net/ethernet/mscc/ocelot_flower.c  | 10 ++
 4 files changed, 28 insertions(+)

diff --git a/drivers/net/ethernet/marvell/prestera/prestera_flower.c 
b/drivers/net/ethernet/marvell/prestera/prestera_flower.c
index 91a478b75cbf..3e20e71b0f81 100644
--- a/drivers/net/ethernet/marvell/prestera/prestera_flower.c
+++ b/drivers/net/ethernet/marvell/prestera/prestera_flower.c
@@ -148,6 +148,12 @@ static int prestera_flower_parse_meta(struct 
prestera_acl_rule *rule,
__be16 key, mask;
 
flow_rule_match_meta(f_rule, &match);
+
+   if (match.mask->l2_miss) {
+   NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on 
\"l2_miss\"");
+   return -EOPNOTSUPP;
+   }
+
if (match.mask->ingress_ifindex != 0x) {
NL_SET_ERR_MSG_MOD(f->common.extack,
   "Unsupported ingress ifindex mask");
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index e95414ef1f04..1b0906cb57ef 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2587,6 +2587,12 @@ static int mlx5e_flower_parse_meta(struct net_device 
*filter_dev,
return 0;
 
flow_rule_match_meta(rule, &match);
+
+   if (match.mask->l2_miss) {
+   NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on 
\"l2_miss\"");
+   return -EOPNOTSUPP;
+   }
+
if (!match.mask->ingress_ifindex)
return 0;
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
index 594cdcb90b3d..6fec9223250b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
@@ -294,6 +294,12 @@ static int mlxsw_sp_flower_parse_meta(struct 
mlxsw_sp_acl_rule_info *rulei,
return 0;
 
flow_rule_match_meta(rule, &match);
+
+   if (match.mask->l2_miss) {
+   NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on 
\"l2_miss\"");
+   return -EOPNOTSUPP;
+   }
+
if (match.mask->ingress_ifindex != 0x) {
NL_SET_ERR_MSG_MOD(f->common.extack, "Unsupported ingress 
ifindex mask");
return -EINVAL;
diff --git a/drivers/net/ethernet/mscc/ocelot_flower.c 
b/drivers/net/ethernet/mscc/ocelot_flower.c
index ee052404eb55..e0916afcddfb 100644
--- a/drivers/net/ethernet/mscc/ocelot_flower.c
+++ b/drivers/net/ethernet/mscc/ocelot_flower.c
@@ -592,6 +592,16 @@ ocelot_flower_parse_key(struct ocelot *ocelot, int port, 
bool ingress,
return -EOPNOTSUPP;
}
 
+   if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_META)) {
+   struct flow_match_meta match;
+
+   flow_rule_match_meta(rule, &match);
+   if (match.mask->l2_miss) {
+   NL_SET_ERR_MSG_MOD(extack, "Can't match on 
\"l2_miss\"");
+   return -EOPNOTSUPP;
+   }
+   }
+
/* For VCAP ES0 (egress rewriter) we can match on the ingress port */
if (!ingress) {
ret = ocelot_flower_parse_indev(ocelot, port, f, filter);
-- 
2.40.1

[Bridge] [PATCH net-next v2 3/8] net/sched: flower: Allow matching on layer 2 miss

2023-05-29 Thread Ido Schimmel via Bridge

Add the 'TCA_FLOWER_L2_MISS' netlink attribute that allows user space to
match on packets that encountered a layer 2 miss. The miss indication is
set as metadata in the tc skb extension by the bridge driver upon FDB or
MDB lookup miss and dissected by the flow dissector to the
'FLOW_DISSECTOR_KEY_META' key.

The use of this skb extension is guarded by the 'tc_skb_ext_tc' static
key. As such, enable / disable this key when filters that match on layer
2 miss are added / deleted.

Tested:

 # cat tc_skb_ext_tc.py
 #!/usr/bin/env -S drgn -s vmlinux

 refcount = prog["tc_skb_ext_tc"].key.enabled.counter.value_()
 print(f"tc_skb_ext_tc reference count is {refcount}")

 # ./tc_skb_ext_tc.py
 tc_skb_ext_tc reference count is 0

 # tc filter add dev swp1 egress proto all handle 101 pref 1 flower src_mac 
00:11:22:33:44:55 action drop
 # tc filter add dev swp1 egress proto all handle 102 pref 2 flower src_mac 
00:11:22:33:44:55 l2_miss true action drop
 # tc filter add dev swp1 egress proto all handle 103 pref 3 flower src_mac 
00:11:22:33:44:55 l2_miss false action drop

 # ./tc_skb_ext_tc.py
 tc_skb_ext_tc reference count is 2

 # tc filter replace dev swp1 egress proto all handle 102 pref 2 flower src_mac 
00:01:02:03:04:05 l2_miss false action drop

 # ./tc_skb_ext_tc.py
 tc_skb_ext_tc reference count is 2

 # tc filter del dev swp1 egress proto all handle 103 pref 3 flower
 # tc filter del dev swp1 egress proto all handle 102 pref 2 flower
 # tc filter del dev swp1 egress proto all handle 101 pref 1 flower

 # ./tc_skb_ext_tc.py
 tc_skb_ext_tc reference count is 0

Signed-off-by: Ido Schimmel 
---

Notes:
v2:
* Split flow_dissector changes to a previous patch.
* Use tc skb extension instead of 'skb->l2_miss'.

 include/uapi/linux/pkt_cls.h |  2 ++
 net/sched/cls_flower.c   | 30 --
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 648a82f32666..00933dda7b10 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -594,6 +594,8 @@ enum {
 
TCA_FLOWER_KEY_L2TPV3_SID,  /* be32 */
 
+   TCA_FLOWER_L2_MISS, /* u8 */
+
__TCA_FLOWER_MAX,
 };
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 9dbc43388e57..04adcde9eb81 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -120,6 +120,7 @@ struct cls_fl_filter {
u32 handle;
u32 flags;
u32 in_hw_count;
+   u8 needs_tc_skb_ext:1;
struct rcu_work rwork;
struct net_device *hw_dev;
/* Flower classifier is unlocked, which means that its reference counter
@@ -415,6 +416,8 @@ static struct cls_fl_head *fl_head_dereference(struct 
tcf_proto *tp)
 
 static void __fl_destroy_filter(struct cls_fl_filter *f)
 {
+   if (f->needs_tc_skb_ext)
+   tc_skb_ext_tc_disable();
tcf_exts_destroy(&f->exts);
tcf_exts_put_net(&f->exts);
kfree(f);
@@ -615,7 +618,8 @@ static void *fl_get(struct tcf_proto *tp, u32 handle)
 }
 
 static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = {
-   [TCA_FLOWER_UNSPEC] = { .type = NLA_UNSPEC },
+   [TCA_FLOWER_UNSPEC] = { .strict_start_type =
+   TCA_FLOWER_L2_MISS },
[TCA_FLOWER_CLASSID]= { .type = NLA_U32 },
[TCA_FLOWER_INDEV]  = { .type = NLA_STRING,
.len = IFNAMSIZ },
@@ -720,7 +724,7 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 
1] = {
[TCA_FLOWER_KEY_PPPOE_SID]  = { .type = NLA_U16 },
[TCA_FLOWER_KEY_PPP_PROTO]  = { .type = NLA_U16 },
[TCA_FLOWER_KEY_L2TPV3_SID] = { .type = NLA_U32 },
-
+   [TCA_FLOWER_L2_MISS]= NLA_POLICY_MAX(NLA_U8, 1),
 };
 
 static const struct nla_policy
@@ -1668,6 +1672,10 @@ static int fl_set_key(struct net *net, struct nlattr 
**tb,
mask->meta.ingress_ifindex = 0x;
}
 
+   fl_set_key_val(tb, &key->meta.l2_miss, TCA_FLOWER_L2_MISS,
+  &mask->meta.l2_miss, TCA_FLOWER_UNSPEC,
+  sizeof(key->meta.l2_miss));
+
fl_set_key_val(tb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
   mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
   sizeof(key->eth.dst));
@@ -2085,6 +2093,11 @@ static int fl_check_assign_mask(struct cls_fl_head *head,
return ret;
 }
 
+static bool fl_needs_tc_skb_ext(const struct fl_flow_key *mask)
+{
+   return mask->meta.l2_miss;
+}
+
 static int fl_set_parms(struct net *net, struct tcf_proto *tp,
struct cls_fl_filter *f, struct fl_flow_mask *mask,
unsigned long base, struct nlattr **tb,
@@ -2

[Bridge] [PATCH net-next v2 2/8] flow_dissector: Dissect layer 2 miss from tc skb extension

2023-05-29 Thread Ido Schimmel via Bridge

Extend the 'FLOW_DISSECTOR_KEY_META' key with a new 'l2_miss' field and
populate it from a field with the same name in the tc skb extension.
This field is set by the bridge driver for packets that incur an FDB or
MDB miss.

The next patch will extend the flower classifier to be able to match on
layer 2 misses.

Signed-off-by: Ido Schimmel 
---

Notes:
v2:
* Split from flower patch.
* Use tc skb extension instead of 'skb->l2_miss'.

 include/net/flow_dissector.h |  2 ++
 net/core/flow_dissector.c| 10 ++
 2 files changed, 12 insertions(+)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 85b2281576ed..8b41668c77fc 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -243,10 +243,12 @@ struct flow_dissector_key_ip {
  * struct flow_dissector_key_meta:
  * @ingress_ifindex: ingress ifindex
  * @ingress_iftype: ingress interface type
+ * @l2_miss: packet did not match an L2 entry during forwarding
  */
 struct flow_dissector_key_meta {
int ingress_ifindex;
u16 ingress_iftype;
+   u8 l2_miss;
 };
 
 /**
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 25fb0bbc310f..481ca4080cbd 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -241,6 +242,15 @@ void skb_flow_dissect_meta(const struct sk_buff *skb,
 FLOW_DISSECTOR_KEY_META,
 target_container);
meta->ingress_ifindex = skb->skb_iif;
+#if IS_ENABLED(CONFIG_NET_TC_SKB_EXT)
+   if (tc_skb_ext_tc_enabled()) {
+   struct tc_skb_ext *ext;
+
+   ext = skb_ext_find(skb, TC_SKB_EXT);
+   if (ext)
+   meta->l2_miss = ext->l2_miss;
+   }
+#endif
 }
 EXPORT_SYMBOL(skb_flow_dissect_meta);
 
-- 
2.40.1

[Bridge] [PATCH net-next v2 1/8] skbuff: bridge: Add layer 2 miss indication

2023-05-29 Thread Ido Schimmel via Bridge

For EVPN non-DF (Designated Forwarder) filtering we need to be able to
prevent decapsulated traffic from being flooded to a multi-homed host.
Filtering of multicast and broadcast traffic can be achieved using the
following flower filter:

 # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 
01:00:00:00:00:00/01:00:00:00:00:00 action drop

Unlike broadcast and multicast traffic, it is not currently possible to
filter unknown unicast traffic. The classification into unknown unicast
is performed by the bridge driver, but is not visible to other layers
such as tc.

Solve this by adding a new 'l2_miss' bit to the tc skb extension. Clear
the bit whenever a packet enters the bridge (received from a bridge port
or transmitted via the bridge) and set it if the packet did not match an
FDB or MDB entry. If there is no skb extension and the bit needs to be
cleared, then do not allocate one as no extension is equivalent to the
bit being cleared. The bit is not set for broadcast packets as they
never perform a lookup and therefore never incur a miss.

A bit that is set for every flooded packet would also work for the
current use case, but it does not allow us to differentiate between
registered and unregistered multicast traffic, which might be useful in
the future.

To keep the performance impact to a minimum, the marking of packets is
guarded by the 'tc_skb_ext_tc' static key. When 'false', the skb is not
touched and an skb extension is not allocated. Instead, only a
5 bytes nop is executed, as demonstrated below for the call site in
br_handle_frame().

Before the patch:

```
memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
  c37b09:   49 c7 44 24 28 00 00movq   $0x0,0x28(%r12)
  c37b10:   00 00

p = br_port_get_rcu(skb->dev);
  c37b12:   49 8b 44 24 10  mov0x10(%r12),%rax
memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
  c37b17:   49 c7 44 24 30 00 00movq   $0x0,0x30(%r12)
  c37b1e:   00 00
  c37b20:   49 c7 44 24 38 00 00movq   $0x0,0x38(%r12)
  c37b27:   00 00
```

After the patch (when static key is disabled):

```
memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
  c37c29:   49 c7 44 24 28 00 00movq   $0x0,0x28(%r12)
  c37c30:   00 00
  c37c32:   49 8d 44 24 28  lea0x28(%r12),%rax
  c37c37:   48 c7 40 08 00 00 00movq   $0x0,0x8(%rax)
  c37c3e:   00
  c37c3f:   48 c7 40 10 00 00 00movq   $0x0,0x10(%rax)
  c37c46:   00

#ifdef CONFIG_HAVE_JUMP_LABEL_HACK

static __always_inline bool arch_static_branch(struct static_key *key, bool 
branch)
{
asm_volatile_goto("1:"
  c37c47:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
br_tc_skb_miss_set(skb, false);

p = br_port_get_rcu(skb->dev);
  c37c4c:   49 8b 44 24 10  mov0x10(%r12),%rax
```

Subsequent patches will extend the flower classifier to be able to match
on the new 'l2_miss' bit and enable / disable the static key when
filters that match on it are added / deleted.

Signed-off-by: Ido Schimmel 
---

Notes:
v2:
* Use tc skb extension instead of adding a bit to the skb.
* Do not mark broadcast packets as they never perform a lookup and
  therefore never incur a miss.

 include/linux/skbuff.h  |  1 +
 net/bridge/br_device.c  |  1 +
 net/bridge/br_forward.c |  3 +++
 net/bridge/br_input.c   |  1 +
 net/bridge/br_private.h | 27 +++
 5 files changed, 33 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5951904413ab..e2f48ddb2f7c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -330,6 +330,7 @@ struct tc_skb_ext {
u8 post_ct_snat:1;
u8 post_ct_dnat:1;
u8 act_miss:1; /* Set if act_miss_cookie is used */
+   u8 l2_miss:1; /* Set by bridge upon FDB or MDB miss */
 };
 #endif
 
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 8eca8a5c80c6..9a5ea06236bd 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -39,6 +39,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct 
net_device *dev)
u16 vid = 0;
 
memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
+   br_tc_skb_miss_set(skb, false);
 
rcu_read_lock();
nf_ops = rcu_dereference(nf_br_ops);
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 84d6dd5e5b1a..6116eba1bd89 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -203,6 +203,8 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
struct net_bridge_port *prev = NULL;
struct net_bridge_port *p;
 
+   br_tc_skb_miss_set(skb, pkt_type != BR_PKT_BROADCAST);
+
list_for_each_entry_rcu(p, &br->port_list, list) {
/* Do not flood unicast traffic to ports that turn it off, nor
 * other tra

[Bridge] [PATCH net-next v2 0/8] Add layer 2 miss indication and filtering

2023-05-29 Thread Ido Schimmel via Bridge

s patch. Use tc skb
  extension instead of 'skb->l2_miss'.

* Patch #4: Expand commit message to explain why some users were not
  patched.

* Patch #5: New patch.

* Patch #6: New patch.

* Patch #7: Use 'fdb_miss' key element instead of 'dmac_type'.

* Patch #8: Test that broadcast does not hit miss filter.

Since RFC [5]:

No changes.

[1] https://datatracker.ietf.org/doc/html/rfc7432#section-8.3
[2] https://datatracker.ietf.org/doc/html/rfc7432#section-8.5
[3] https://github.com/idosch/iproute2/tree/submit/non_df_filter_v1
[4] https://lore.kernel.org/netdev/20230518113328.1952135-1-ido...@nvidia.com/
[5] https://lore.kernel.org/netdev/20230509070446.246088-1-ido...@nvidia.com/

Ido Schimmel (8):
  skbuff: bridge: Add layer 2 miss indication
  flow_dissector: Dissect layer 2 miss from tc skb extension
  net/sched: flower: Allow matching on layer 2 miss
  flow_offload: Reject matching on layer 2 miss
  mlxsw: spectrum_flower: Split iif parsing to a separate function
  mlxsw: spectrum_flower: Do not force matching on iif
  mlxsw: spectrum_flower: Add ability to match on layer 2 miss
  selftests: forwarding: Add layer 2 miss test cases

 .../marvell/prestera/prestera_flower.c|   6 +
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   |   6 +
 .../mellanox/mlxsw/core_acl_flex_keys.c   |   1 +
 .../mellanox/mlxsw/core_acl_flex_keys.h   |   3 +-
 .../mellanox/mlxsw/spectrum_acl_flex_keys.c   |   2 +
 .../ethernet/mellanox/mlxsw/spectrum_flower.c |  45 ++-
 drivers/net/ethernet/mscc/ocelot_flower.c |  10 +
 include/linux/skbuff.h|   1 +
 include/net/flow_dissector.h  |   2 +
 include/uapi/linux/pkt_cls.h  |   2 +
 net/bridge/br_device.c|   1 +
 net/bridge/br_forward.c   |   3 +
 net/bridge/br_input.c |   1 +
 net/bridge/br_private.h   |  27 ++
 net/core/flow_dissector.c |  10 +
 net/sched/cls_flower.c|  30 +-
 .../testing/selftests/net/forwarding/Makefile |   1 +
 .../net/forwarding/tc_flower_l2_miss.sh   | 350 ++
 18 files changed, 485 insertions(+), 16 deletions(-)
 create mode 100755 tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh

-- 
2.40.1

Re: [Bridge] [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication

2023-05-23 Thread Ido Schimmel via Bridge

On Tue, May 23, 2023 at 11:04:27AM +0200, Paolo Abeni wrote:
> I think you would only need to set/add the extension when l2_miss is
> true, right? (with no extension l2 hit is assumed). That will avoid
> unneeded overhead for br_dev_xmit().

If an extension is already present (possibly with 'l2_miss' being 'true'
because the packet was flooded by a different bridge earlier in the
pipeline), then we need to clear it when the packet enters the bridge.
IMO, this is quite unlikely. However, if the extension is missing, then
you are correct and there is no point in allocating one.

IOW, I can squash the following diff to the first patch:

diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index fb6525553a8a..32115d76a6de 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -764,10 +764,16 @@ static inline void br_tc_skb_miss_set(struct sk_buff 
*skb, bool miss)
return;

ext = skb_ext_find(skb, TC_SKB_EXT);
-   if (!ext)
-   ext = tc_skb_ext_alloc(skb);
-   if (ext)
+   if (ext) {
ext->l2_miss = miss;
+   return;
+   }
+   if (!miss)
+   return;
+   ext = tc_skb_ext_alloc(skb);
+   if (!ext)
+   return;
+   ext->l2_miss = miss;
 }
 #else
 static inline void br_tc_skb_miss_set(struct sk_buff *skb, bool miss)

Thanks

Re: [Bridge] [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication

2023-05-23 Thread Ido Schimmel via Bridge

On Fri, May 19, 2023 at 02:52:18PM -0700, Jakub Kicinski wrote:
> On Fri, 19 May 2023 16:51:48 +0300 Ido Schimmel wrote:
> > diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> > index fc17b9fd93e6..274e55455b15 100644
> > --- a/net/bridge/br_input.c
> > +++ b/net/bridge/br_input.c
> > @@ -46,6 +46,8 @@ static int br_pass_frame_up(struct sk_buff *skb)
> >  */
> > br_switchdev_frame_unmark(skb);
> >  
> > +   skb->l2_miss = BR_INPUT_SKB_CB(skb)->miss;
> > +
> > /* Bridge is just like any other port.  Make sure the
> >  * packet is allowed except in promisc mode when someone
> >  * may be running packet capture.
> > 
> > Ran these changes through the selftest and it seems to work.
> 
> Can we possibly put the new field at the end of the CB and then have TC
> look at it in the CB? We already do a bit of such CB juggling in strp
> (first member of struct sk_skb_cb).

Using the CB between different layers is very fragile and I would like
to avoid it. Note that the skb can pass various layers until hitting the
classifier, each of which can decide to memset() the CB.

Anyway, I think I have a better alternative. I added the 'l2_miss' bit
to the tc skb extension and adjusted the bridge to mark packets via this
extension. The entire thing is protected by the existing 'tc_skb_ext_tc'
static key, so overhead is kept to a minimum when feature is disabled.
Extended flower to enable / disable this key when filters that match on
'l2_miss' are added / removed.

bridge change to mark the packet:
https://github.com/idosch/linux/commit/3fab206492fcad9177f2340680f02ced1b9a0dec.patch

flow_dissector change to dissect the info from the extension:
https://github.com/idosch/linux/commit/1533c078b02586547817a4e63989a0db62aa5315.patch

flower change to enable / disable the key:
https://github.com/idosch/linux/commit/cf84b277511ec80fe565c41271abc6b2e2f629af.patch

Advantages compared to the previous approach are that we do not need a
new bit in the skb and that overhead is kept to a minimum when feature
is disabled. Disadvantage is that overhead is higher when feature is
enabled.

WDYT?

To be clear, merely asking for feedback on the general approach, not
code review.

Thanks

Re: [Bridge] [PATCH net-next 3/5] flow_offload: Reject matching on layer 2 miss

2023-05-19 Thread Ido Schimmel via Bridge

On Fri, May 19, 2023 at 01:33:00PM +0200, Simon Horman wrote:
> On Thu, May 18, 2023 at 02:33:26PM +0300, Ido Schimmel wrote:
> > Adjust drivers that support the 'FLOW_DISSECTOR_KEY_META' key to reject
> > filters that try to match on the newly added layer 2 miss option. Add an
> > extack message to clearly communicate the failure reason to user space.
> 
> Hi Ido,
> 
> FLOW_DISSECTOR_KEY_META is also used in the following.
> Perhaps they don't need updating. But perhaps it is worth mentioning why.

Good point.

> 
>  * drivers/net/ethernet/mediatek/mtk_ppe_offload.c

This driver does not seem to do anything with this key. TBH, I'm not
sure what is the purpose of this hunk:

if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_META)) {
struct flow_match_meta match;

flow_rule_match_meta(rule, &match);
} else {
return -EOPNOTSUPP;
}

Felix, can you comment?
Original patch:
https://lore.kernel.org/netdev/20230518113328.1952135-4-ido...@nvidia.com/

>  * drivers/net/ethernet/netronome/nfp/flower/conntrack.c

My understanding is that this code is for netfilter offload (not tc)
which does not use the new bit. Adding a check would therefore be dead
code. I don't mind adding a check or mentioning in the commit message
why I didn't add one. Let me know what you prefer.

Thanks

Re: [Bridge] [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication

2023-05-19 Thread Ido Schimmel via Bridge

On Thu, May 18, 2023 at 07:08:47PM +0300, Nikolay Aleksandrov wrote:
> On 18/05/2023 14:33, Ido Schimmel wrote:
> > diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
> > index fc17b9fd93e6..d8ab5890cbe6 100644
> > --- a/net/bridge/br_input.c
> > +++ b/net/bridge/br_input.c
> > @@ -334,6 +334,7 @@ static rx_handler_result_t br_handle_frame(struct 
> > sk_buff **pskb)
> > return RX_HANDLER_CONSUMED;
> >  
> > memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
> > +   skb->l2_miss = 0;
> >  
> > p = br_port_get_rcu(skb->dev);
> > if (p->flags & BR_VLAN_TUNNEL)
> 
> Overall looks good, only this part is a bit worrisome and needs some 
> additional
> investigation because now we'll unconditionally dirty a cache line for every
> packet that is forwarded. Could you please check the effect with perf?

To eliminate it I tried the approach we discussed yesterday:

First, add the miss indication to the bridge's control block which is
zeroed for every skb entering the bridge:

diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 2119729ded2b..bd5c18286a40 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -581,6 +581,7 @@ struct br_input_skb_cb {
 #endif
u8 proxyarp_replied:1;
u8 src_port_isolated:1;
+   u8 miss:1;  /* FDB or MDB lookup miss */
 #ifdef CONFIG_BRIDGE_VLAN_FILTERING
u8 vlan_filtered:1;
 #endif

And set this bit upon misses instead of skb->l2_miss:

@@ -203,6 +205,8 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
struct net_bridge_port *prev = NULL;
struct net_bridge_port *p;
 
+   BR_INPUT_SKB_CB(skb)->miss = 1;
+
list_for_each_entry_rcu(p, &br->port_list, list) {
/* Do not flood unicast traffic to ports that turn it off, nor
 * other traffic if flood off, except for traffic we originate
@@ -295,6 +299,7 @@ void br_multicast_flood(struct net_bridge_mdb_entry *mdst,
allow_mode_include = false;
} else {
p = NULL;
+   BR_INPUT_SKB_CB(skb)->miss = 1;
}
 
while (p || rp) {

Then copy it to skb->l2_miss at the very end where the cache line
containing this field is already written to:

diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 84d6dd5e5b1a..89f65564e338 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -50,6 +50,8 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, 
struct sk_buff *skb
 
br_switchdev_frame_set_offload_fwd_mark(skb);
 
+   skb->l2_miss = BR_INPUT_SKB_CB(skb)->miss;
+
dev_queue_xmit(skb);
 
return 0;

Also for locally received packets:

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index fc17b9fd93e6..274e55455b15 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -46,6 +46,8 @@ static int br_pass_frame_up(struct sk_buff *skb)
 */
br_switchdev_frame_unmark(skb);
 
+   skb->l2_miss = BR_INPUT_SKB_CB(skb)->miss;
+
/* Bridge is just like any other port.  Make sure the
 * packet is allowed except in promisc mode when someone
 * may be running packet capture.

Ran these changes through the selftest and it seems to work.

WDYT?

[Bridge] [PATCH net-next 5/5] selftests: forwarding: Add layer 2 miss test cases

2023-05-18 Thread Ido Schimmel via Bridge

Add test cases to verify that the bridge driver correctly marks layer 2
misses only when it should and that the flower classifier can match on
this metadata.

Example output:

 # ./tc_flower_l2_miss.sh
 TEST: L2 miss - Unicast [ OK ]
 TEST: L2 miss - Multicast (IPv4)[ OK ]
 TEST: L2 miss - Multicast (IPv6)[ OK ]
 TEST: L2 miss - Link-local multicast (IPv4) [ OK ]
 TEST: L2 miss - Link-local multicast (IPv6) [ OK ]
 TEST: L2 miss - Broadcast   [ OK ]

Signed-off-by: Ido Schimmel 
---
 .../testing/selftests/net/forwarding/Makefile |   1 +
 .../net/forwarding/tc_flower_l2_miss.sh   | 343 ++
 2 files changed, 344 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh

diff --git a/tools/testing/selftests/net/forwarding/Makefile 
b/tools/testing/selftests/net/forwarding/Makefile
index a474c60fe348..9d0062b542e5 100644
--- a/tools/testing/selftests/net/forwarding/Makefile
+++ b/tools/testing/selftests/net/forwarding/Makefile
@@ -83,6 +83,7 @@ TEST_PROGS = bridge_igmp.sh \
tc_chains.sh \
tc_flower_router.sh \
tc_flower.sh \
+   tc_flower_l2_miss.sh \
tc_mpls_l2vpn.sh \
tc_police.sh \
tc_shblocks.sh \
diff --git a/tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh 
b/tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh
new file mode 100755
index ..fbf0a960b2c8
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh
@@ -0,0 +1,343 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# +---+ 
+--+
+# | H1 (vrf)  | | H2 (vrf) 
|
+# |+ $h1  | |  $h2 +   
|
+# || 192.0.2.1/28 | | 192.0.2.2/28 |   
|
+# || 2001:db8:1::1/64 | | 2001:db8:1::2/64 |   
|
+# +|--+ 
+--|---+
+#  |   |
+# 
+|---|---+
+# | SW |   |   
|
+# |  +-|---|-+ 
|
+# |  | + $swp1   BR  $swp2 + | 
|
+# |  +---+ 
|
+# 
++
+
+ALL_TESTS="
+   test_l2_miss_unicast
+   test_l2_miss_multicast
+   test_l2_miss_ll_multicast
+   test_l2_miss_broadcast
+"
+
+NUM_NETIFS=4
+source lib.sh
+source tc_common.sh
+
+h1_create()
+{
+   simple_if_init $h1 192.0.2.1/28 2001:db8:1::1/64
+}
+
+h1_destroy()
+{
+   simple_if_fini $h1 192.0.2.1/28 2001:db8:1::1/64
+}
+
+h2_create()
+{
+   simple_if_init $h2 192.0.2.2/28 2001:db8:1::2/64
+}
+
+h2_destroy()
+{
+   simple_if_fini $h2 192.0.2.2/28 2001:db8:1::2/64
+}
+
+switch_create()
+{
+   ip link add name br1 up type bridge
+   ip link set dev $swp1 master br1
+   ip link set dev $swp1 up
+   ip link set dev $swp2 master br1
+   ip link set dev $swp2 up
+
+   tc qdisc add dev $swp2 clsact
+}
+
+switch_destroy()
+{
+   tc qdisc del dev $swp2 clsact
+
+   ip link set dev $swp2 down
+   ip link set dev $swp2 nomaster
+   ip link set dev $swp1 down
+   ip link set dev $swp1 nomaster
+   ip link del dev br1
+}
+
+test_l2_miss_unicast()
+{
+   local dmac=00:01:02:03:04:05
+   local dip=192.0.2.2
+   local sip=192.0.2.1
+
+   RET=0
+
+   # Unknown unicast.
+   tc filter add dev $swp2 egress protocol ipv4 handle 101 pref 1 \
+  flower indev $swp1 l2_miss true dst_mac $dmac src_ip $sip \
+  dst_ip $dip action pass
+   # Known unicast.
+   tc filter add dev $swp2 egress protocol ipv4 handle 102 pref 1 \
+  flower indev $swp1 l2_miss false dst_mac $dmac src_ip $sip \
+  dst_ip $dip action pass
+
+   # Before adding FDB entry.
+   $MZ $h1 -a own -b $dmac -t ip -A $sip -B $dip -c 1 -p 100 -q
+
+   tc_check_packets "dev $swp2 egress" 101 1
+   check_err $? "Unknown unicast filter was not hit before adding FDB 
entry"
+
+   tc_check_packets "dev $swp2 egress" 102 0
+   check_err $? "Known unicast filter was hit before adding FDB entry"
+
+   # Adding FDB entry.
+   bridge fdb replace $dmac dev $swp2 master static
+
+   $MZ $h1 -a own -b $dmac -t ip -A $sip -B $dip -c 1 -p 100 -q
+
+   tc_check_

[Bridge] [PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication

2023-05-18 Thread Ido Schimmel via Bridge

Allow the bridge driver to mark packets that did not match a layer 2
entry during forwarding by adding a 'l2_miss' bit to the skb.

Clear the bit whenever a packet enters the bridge (received from a
bridge port or transmitted via the bridge) and set it if the packet did
not match an FDB/MDB entry.

Subsequent patches will allow the flower classifier to match on this
bit. The motivating use case in non-DF (Designated Forwarder) filtering
where we would like to prevent decapsulated packets from being flooded
to a multi-homed host.

Do not allocate the bit if the kernel was not compiled with bridge
support and place it after the two bit fields in accordance with commit
4c60d04c2888 ("net: skbuff: push nf_trace down the bitfield"). The bit
does not increase the size of the structure as it is placed at an
existing hole. Layout with allmodconfig:

struct sk_buff {
[...]
__u8   csum_not_inet:1;  /*   132: 3  1 */
__u8   l2_miss:1;/*   132: 4  1 */

/* XXX 3 bits hole, try to pack */
/* XXX 1 byte hole, try to pack */

__u16  tc_index; /*   134 2 */
u16alloc_cpu;/*   136 2 */
[...]
} __attribute__((__aligned__(8)));

Signed-off-by: Ido Schimmel 
---
 include/linux/skbuff.h  | 4 
 net/bridge/br_device.c  | 1 +
 net/bridge/br_forward.c | 3 +++
 net/bridge/br_input.c   | 1 +
 4 files changed, 9 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8cff3d817131..b64dc3f62c5c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -801,6 +801,7 @@ typedef unsigned char *sk_buff_data_t;
  * @encap_hdr_csum: software checksum is needed
  * @csum_valid: checksum is already valid
  * @csum_not_inet: use CRC32c to resolve CHECKSUM_PARTIAL
+ * @l2_miss: Packet did not match an L2 entry during forwarding
  * @csum_complete_sw: checksum was completed by software
  * @csum_level: indicates the number of consecutive checksums found in
  * the packet minus one that have been verified as
@@ -991,6 +992,9 @@ struct sk_buff {
 #if IS_ENABLED(CONFIG_IP_SCTP)
__u8csum_not_inet:1;
 #endif
+#if IS_ENABLED(CONFIG_BRIDGE)
+   __u8l2_miss:1;
+#endif
 
 #ifdef CONFIG_NET_SCHED
__u16   tc_index;   /* traffic control index */
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 8eca8a5c80c6..91dbdae4afd4 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -39,6 +39,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct 
net_device *dev)
u16 vid = 0;
 
memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
+   skb->l2_miss = 0;
 
rcu_read_lock();
nf_ops = rcu_dereference(nf_br_ops);
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 84d6dd5e5b1a..8cf5a51489ce 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -203,6 +203,8 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
struct net_bridge_port *prev = NULL;
struct net_bridge_port *p;
 
+   skb->l2_miss = 1;
+
list_for_each_entry_rcu(p, &br->port_list, list) {
/* Do not flood unicast traffic to ports that turn it off, nor
 * other traffic if flood off, except for traffic we originate
@@ -295,6 +297,7 @@ void br_multicast_flood(struct net_bridge_mdb_entry *mdst,
allow_mode_include = false;
} else {
p = NULL;
+   skb->l2_miss = 1;
}
 
while (p || rp) {
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index fc17b9fd93e6..d8ab5890cbe6 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -334,6 +334,7 @@ static rx_handler_result_t br_handle_frame(struct sk_buff 
**pskb)
return RX_HANDLER_CONSUMED;
 
memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
+   skb->l2_miss = 0;
 
p = br_port_get_rcu(skb->dev);
if (p->flags & BR_VLAN_TUNNEL)
-- 
2.40.1

[Bridge] [PATCH net-next 0/5] Add layer 2 miss indication and filtering

2023-05-18 Thread Ido Schimmel via Bridge

tl;dr
=

This patchset adds a single bit to the skb to indicate that a packet
encountered a layer 2 miss in the bridge and extends flower to match on
this metadata. This is required for non-DF (Designated Forwarder)
filtering in EVPN multi-homing which prevents decapsulated BUM packets
from being forwarded multiple times to the same multi-homed host.

Background
==

In a typical EVPN multi-homing setup each host is multi-homed using a
set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
switches in a rack. These switches act as VTEPs and are not directly
connected (as opposed to MLAG), but can communicate with each other (as
well as with VTEPs in remote racks) via spine switches over L3.

When a host sends a BUM packet over ES1 to VTEP1, the VTEP will flood it
to other VTEPs in the network, including those connected to the host
over ES1. The receiving VTEPs must drop the packet and not forward it
back to the host. This is called "split-horizon filtering" (SPH) [1].

FRR configures SPH filtering using two tc filters. The first, an ingress
filter that matches on packets received from VTEP1 and marks them using
a fwmark (firewall mark). The second, an egress filter configured on the
LAG interface connected to the host that matches on the fwmark and drops
the packets. Example:

 # tc filter add dev vxlan0 ingress pref 1 proto all flower enc_src_ip 
$VTEP1_IP action skbedit mark 101
 # tc filter add dev bond0 egress pref 1 handle 101 fw action drop

Motivation
==

For each ES, only one VTEP is elected by the control plane as the DF.
The DF is responsible for forwarding decapsulated BUM traffic to the
host over the ES. The non-DF VTEPs must drop such traffic as otherwise
the host will receive multiple copies of BUM traffic. This is called
"non-DF filtering" [2].

Filtering of multicast and broadcast traffic can be achieved using the
following flower filter:

 # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 
01:00:00:00:00:00/01:00:00:00:00:00 action drop

Unlike broadcast and multicast traffic, it is not currently possible to
filter unknown unicast traffic. The classification into unknown unicast
is performed by the bridge driver, but is not visible to other layers.

Implementation
==

The proposed solution is to add a single bit to the skb that is set by
the bridge for packets that encountered an FDB/MDB miss. The flower
classifier is extended to be able to match on this new metadata bit in a
similar fashion to existing metadata options such as 'indev'.

A bit that is set for every flooded packet would also work, but it does
not allow us to differentiate between registered and unregistered
multicast traffic which might be useful in the future.

A relatively generic name is chosen for this bit - 'l2_miss' - to allow
its use to be extended to other layer 2 devices such as VXLAN, should a
use case arise.

With the above, the control plane can implement a non-DF filter using
the following tc filters:

 # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 
01:00:00:00:00:00/01:00:00:00:00:00 action drop
 # tc filter add dev bond0 egress pref 2 proto all flower indev vxlan0 l2_miss 
true action drop

The first drops broadcast and multicast traffic and the second drops
unknown unicast traffic.

Testing
===

A test exercising the different permutations of the 'l2_miss' bit is
added in patch #5.

Patchset overview
=

Patch #1 adds the new bit to the skb and sets it in the bridge driver
for packets that encountered a miss. The new bit is added in an existing
hole in the skb in order not to inflate this data structure.

Patch #2 extends the flower classifier to be able to match on the new
layer 2 miss metadata.

Patch #3 rejects matching on the new metadata in drivers that already
support the 'FLOW_DISSECTOR_KEY_META' key.

Patch #4 extends mlxsw to be able to match on layer 2 miss.

Patch #5 adds a selftest.

iproute2 patches can be found here [3].

Changelog
=

Since RFC [4]:

No changes.

[1] https://datatracker.ietf.org/doc/html/rfc7432#section-8.3
[2] https://datatracker.ietf.org/doc/html/rfc7432#section-8.5
[3] https://github.com/idosch/iproute2/tree/submit/non_df_filter_v1
[4] https://lore.kernel.org/netdev/20230509070446.246088-1-ido...@nvidia.com/

Ido Schimmel (5):
  skbuff: bridge: Add layer 2 miss indication
  net/sched: flower: Allow matching on layer 2 miss
  flow_offload: Reject matching on layer 2 miss
  mlxsw: spectrum_flower: Add ability to match on layer 2 miss
  selftests: forwarding: Add layer 2 miss test cases

 .../marvell/prestera/prestera_flower.c|   6 +
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   |   6 +
 .../mellanox/mlxsw/core_acl_flex_keys.c   |   1 +
 .../mellanox/mlxsw/core_acl_flex_keys.h   |   3 +-
 .../mellanox/mlxsw/spectrum_acl_flex_keys.c   |   5 +
 .../ethernet/mellano

[Bridge] [PATCH net-next 4/5] mlxsw: spectrum_flower: Add ability to match on layer 2 miss

2023-05-18 Thread Ido Schimmel via Bridge

Add the 'dmac_type' key element to supported key blocks and make use of
it to match on layer 2 miss.

This is a two bits key in hardware with the following values:
00b - Known multicast.
01b - Broadcast.
10b - Known unicast.
11b - Unknown unicast or unregistered multicast.

When 'l2_miss' is set we need to match on 01b or 11b. Therefore, only
match on the LSB in order to differentiate between both cases of
'l2_miss'.

Tested on Spectrum-{1,2,3,4}.

Signed-off-by: Ido Schimmel 
---
 .../mellanox/mlxsw/core_acl_flex_keys.c   |  1 +
 .../mellanox/mlxsw/core_acl_flex_keys.h   |  3 ++-
 .../mellanox/mlxsw/spectrum_acl_flex_keys.c   |  5 +
 .../ethernet/mellanox/mlxsw/spectrum_flower.c | 20 ++-
 4 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
index bd1a51a0a540..81af0b9a4329 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
@@ -42,6 +42,7 @@ static const struct mlxsw_afk_element_info 
mlxsw_afk_element_infos[] = {
MLXSW_AFK_ELEMENT_INFO_BUF(DST_IP_64_95, 0x34, 4),
MLXSW_AFK_ELEMENT_INFO_BUF(DST_IP_32_63, 0x38, 4),
MLXSW_AFK_ELEMENT_INFO_BUF(DST_IP_0_31, 0x3C, 4),
+   MLXSW_AFK_ELEMENT_INFO_U32(DMAC_TYPE, 0x40, 0, 2),
 };
 
 struct mlxsw_afk {
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h 
b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h
index 3a037fe47211..6f1649cfa4cb 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h
@@ -35,6 +35,7 @@ enum mlxsw_afk_element {
MLXSW_AFK_ELEMENT_IP_DSCP,
MLXSW_AFK_ELEMENT_VIRT_ROUTER_MSB,
MLXSW_AFK_ELEMENT_VIRT_ROUTER_LSB,
+   MLXSW_AFK_ELEMENT_DMAC_TYPE,
MLXSW_AFK_ELEMENT_MAX,
 };
 
@@ -69,7 +70,7 @@ struct mlxsw_afk_element_info {
MLXSW_AFK_ELEMENT_INFO(MLXSW_AFK_ELEMENT_TYPE_BUF,  
\
   _element, _offset, 0, _size)
 
-#define MLXSW_AFK_ELEMENT_STORAGE_SIZE 0x40
+#define MLXSW_AFK_ELEMENT_STORAGE_SIZE 0x44
 
 struct mlxsw_afk_element_inst { /* element instance in actual block */
enum mlxsw_afk_element element;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c
index 00c32320f891..18a968cded36 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c
@@ -26,6 +26,7 @@ static struct mlxsw_afk_element_inst 
mlxsw_sp_afk_element_info_l2_smac[] = {
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_l2_smac_ex[] = {
MLXSW_AFK_ELEMENT_INST_BUF(SMAC_32_47, 0x02, 2),
MLXSW_AFK_ELEMENT_INST_BUF(SMAC_0_31, 0x04, 4),
+   MLXSW_AFK_ELEMENT_INST_U32(DMAC_TYPE, 0x08, 0, 2),
MLXSW_AFK_ELEMENT_INST_U32(ETHERTYPE, 0x0C, 0, 16),
 };
 
@@ -50,6 +51,7 @@ static struct mlxsw_afk_element_inst 
mlxsw_sp_afk_element_info_ipv4[] = {
 };
 
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_ipv4_ex[] = {
+   MLXSW_AFK_ELEMENT_INST_U32(DMAC_TYPE, 0x00, 24, 2),
MLXSW_AFK_ELEMENT_INST_U32(VID, 0x00, 0, 12),
MLXSW_AFK_ELEMENT_INST_U32(PCP, 0x08, 29, 3),
MLXSW_AFK_ELEMENT_INST_U32(SRC_L4_PORT, 0x08, 0, 16),
@@ -78,6 +80,7 @@ static struct mlxsw_afk_element_inst 
mlxsw_sp_afk_element_info_ipv6_sip_ex[] = {
 };
 
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_packet_type[] = 
{
+   MLXSW_AFK_ELEMENT_INST_U32(DMAC_TYPE, 0x00, 30, 2),
MLXSW_AFK_ELEMENT_INST_U32(ETHERTYPE, 0x00, 0, 16),
 };
 
@@ -123,6 +126,7 @@ const struct mlxsw_afk_ops mlxsw_sp1_afk_ops = {
 };
 
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_mac_0[] = {
+   MLXSW_AFK_ELEMENT_INST_U32(DMAC_TYPE, 0x00, 0, 2),
MLXSW_AFK_ELEMENT_INST_BUF(DMAC_0_31, 0x04, 4),
 };
 
@@ -313,6 +317,7 @@ const struct mlxsw_afk_ops mlxsw_sp2_afk_ops = {
 };
 
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_mac_5b[] = {
+   MLXSW_AFK_ELEMENT_INST_U32(DMAC_TYPE, 0x00, 2, 2),
MLXSW_AFK_ELEMENT_INST_U32(VID, 0x04, 18, 12),
MLXSW_AFK_ELEMENT_INST_EXT_U32(SRC_SYS_PORT, 0x04, 0, 9, -1, true), /* 
RX_ACL_SYSTEM_PORT */
 };
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
index 6fec9223250b..170a07f35897 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
@@ -295,11 +295,6 @@ static int mlxsw_sp_flower_parse_meta(struct 
mlxsw_sp_acl_rule_info *rulei,
 
flow_rule_match_meta(rule, &match);
 
-   if (match.mask->l2_miss) {
-   NL_SET_ERR_MSG_MOD(f->c

[Bridge] [PATCH net-next 2/5] net/sched: flower: Allow matching on layer 2 miss

2023-05-18 Thread Ido Schimmel via Bridge

Add the 'TCA_FLOWER_L2_MISS' netlink attribute that allows user space to
match on packets that encountered a layer 2 miss. The miss indication is
set as metadata in the skb by the bridge driver upon FDB/MDB lookup
miss.

Signed-off-by: Ido Schimmel 
---
 include/net/flow_dissector.h |  2 ++
 include/uapi/linux/pkt_cls.h |  2 ++
 net/core/flow_dissector.c|  3 +++
 net/sched/cls_flower.c   | 14 --
 4 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 85b2281576ed..8b41668c77fc 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -243,10 +243,12 @@ struct flow_dissector_key_ip {
  * struct flow_dissector_key_meta:
  * @ingress_ifindex: ingress ifindex
  * @ingress_iftype: ingress interface type
+ * @l2_miss: packet did not match an L2 entry during forwarding
  */
 struct flow_dissector_key_meta {
int ingress_ifindex;
u16 ingress_iftype;
+   u8 l2_miss;
 };
 
 /**
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 648a82f32666..00933dda7b10 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -594,6 +594,8 @@ enum {
 
TCA_FLOWER_KEY_L2TPV3_SID,  /* be32 */
 
+   TCA_FLOWER_L2_MISS, /* u8 */
+
__TCA_FLOWER_MAX,
 };
 
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 25fb0bbc310f..3776c7bdd228 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -241,6 +241,9 @@ void skb_flow_dissect_meta(const struct sk_buff *skb,
 FLOW_DISSECTOR_KEY_META,
 target_container);
meta->ingress_ifindex = skb->skb_iif;
+#if IS_ENABLED(CONFIG_BRIDGE)
+   meta->l2_miss = skb->l2_miss;
+#endif
 }
 EXPORT_SYMBOL(skb_flow_dissect_meta);
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 9dbc43388e57..4eb06c6367fc 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -615,7 +615,8 @@ static void *fl_get(struct tcf_proto *tp, u32 handle)
 }
 
 static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = {
-   [TCA_FLOWER_UNSPEC] = { .type = NLA_UNSPEC },
+   [TCA_FLOWER_UNSPEC] = { .strict_start_type =
+   TCA_FLOWER_L2_MISS },
[TCA_FLOWER_CLASSID]= { .type = NLA_U32 },
[TCA_FLOWER_INDEV]  = { .type = NLA_STRING,
.len = IFNAMSIZ },
@@ -720,7 +721,7 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 
1] = {
[TCA_FLOWER_KEY_PPPOE_SID]  = { .type = NLA_U16 },
[TCA_FLOWER_KEY_PPP_PROTO]  = { .type = NLA_U16 },
[TCA_FLOWER_KEY_L2TPV3_SID] = { .type = NLA_U32 },
-
+   [TCA_FLOWER_L2_MISS]= NLA_POLICY_MAX(NLA_U8, 1),
 };
 
 static const struct nla_policy
@@ -1668,6 +1669,10 @@ static int fl_set_key(struct net *net, struct nlattr 
**tb,
mask->meta.ingress_ifindex = 0x;
}
 
+   fl_set_key_val(tb, &key->meta.l2_miss, TCA_FLOWER_L2_MISS,
+  &mask->meta.l2_miss, TCA_FLOWER_UNSPEC,
+  sizeof(key->meta.l2_miss));
+
fl_set_key_val(tb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
   mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
   sizeof(key->eth.dst));
@@ -3074,6 +3079,11 @@ static int fl_dump_key(struct sk_buff *skb, struct net 
*net,
goto nla_put_failure;
}
 
+   if (fl_dump_key_val(skb, &key->meta.l2_miss,
+   TCA_FLOWER_L2_MISS, &mask->meta.l2_miss,
+   TCA_FLOWER_UNSPEC, sizeof(key->meta.l2_miss)))
+   goto nla_put_failure;
+
if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
sizeof(key->eth.dst)) ||
-- 
2.40.1

[Bridge] [PATCH net-next 3/5] flow_offload: Reject matching on layer 2 miss

2023-05-18 Thread Ido Schimmel via Bridge

Adjust drivers that support the 'FLOW_DISSECTOR_KEY_META' key to reject
filters that try to match on the newly added layer 2 miss option. Add an
extack message to clearly communicate the failure reason to user space.

Example:

 # tc filter add dev swp1 egress pref 1 proto all flower skip_sw l2_miss true 
action drop
 Error: mlxsw_spectrum: Can't match on "l2_miss".
 We have an error talking to the kernel

Acked-by: Elad Nachman 
Signed-off-by: Ido Schimmel 
---
 .../net/ethernet/marvell/prestera/prestera_flower.c|  6 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c|  6 ++
 drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c  |  6 ++
 drivers/net/ethernet/mscc/ocelot_flower.c  | 10 ++
 4 files changed, 28 insertions(+)

diff --git a/drivers/net/ethernet/marvell/prestera/prestera_flower.c 
b/drivers/net/ethernet/marvell/prestera/prestera_flower.c
index 91a478b75cbf..3e20e71b0f81 100644
--- a/drivers/net/ethernet/marvell/prestera/prestera_flower.c
+++ b/drivers/net/ethernet/marvell/prestera/prestera_flower.c
@@ -148,6 +148,12 @@ static int prestera_flower_parse_meta(struct 
prestera_acl_rule *rule,
__be16 key, mask;
 
flow_rule_match_meta(f_rule, &match);
+
+   if (match.mask->l2_miss) {
+   NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on 
\"l2_miss\"");
+   return -EOPNOTSUPP;
+   }
+
if (match.mask->ingress_ifindex != 0x) {
NL_SET_ERR_MSG_MOD(f->common.extack,
   "Unsupported ingress ifindex mask");
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 728b82ce4031..516653568330 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2586,6 +2586,12 @@ static int mlx5e_flower_parse_meta(struct net_device 
*filter_dev,
return 0;
 
flow_rule_match_meta(rule, &match);
+
+   if (match.mask->l2_miss) {
+   NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on 
\"l2_miss\"");
+   return -EOPNOTSUPP;
+   }
+
if (!match.mask->ingress_ifindex)
return 0;
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
index 594cdcb90b3d..6fec9223250b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
@@ -294,6 +294,12 @@ static int mlxsw_sp_flower_parse_meta(struct 
mlxsw_sp_acl_rule_info *rulei,
return 0;
 
flow_rule_match_meta(rule, &match);
+
+   if (match.mask->l2_miss) {
+   NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on 
\"l2_miss\"");
+   return -EOPNOTSUPP;
+   }
+
if (match.mask->ingress_ifindex != 0x) {
NL_SET_ERR_MSG_MOD(f->common.extack, "Unsupported ingress 
ifindex mask");
return -EINVAL;
diff --git a/drivers/net/ethernet/mscc/ocelot_flower.c 
b/drivers/net/ethernet/mscc/ocelot_flower.c
index ee052404eb55..e0916afcddfb 100644
--- a/drivers/net/ethernet/mscc/ocelot_flower.c
+++ b/drivers/net/ethernet/mscc/ocelot_flower.c
@@ -592,6 +592,16 @@ ocelot_flower_parse_key(struct ocelot *ocelot, int port, 
bool ingress,
return -EOPNOTSUPP;
}
 
+   if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_META)) {
+   struct flow_match_meta match;
+
+   flow_rule_match_meta(rule, &match);
+   if (match.mask->l2_miss) {
+   NL_SET_ERR_MSG_MOD(extack, "Can't match on 
\"l2_miss\"");
+   return -EOPNOTSUPP;
+   }
+   }
+
/* For VCAP ES0 (egress rewriter) we can match on the ingress port */
if (!ingress) {
ret = ocelot_flower_parse_indev(ocelot, port, f, filter);
-- 
2.40.1

[Bridge] [RFC PATCH net-next 4/5] mlxsw: spectrum_flower: Add ability to match on layer 2 miss

2023-05-09 Thread Ido Schimmel via Bridge

Add the 'dmac_type' key element to supported key blocks and make use of
it to match on layer 2 miss.

This is a two bits key in hardware with the following values:
00b - Known multicast.
01b - Broadcast.
10b - Known unicast.
11b - Unknown unicast or unregistered multicast.

When 'l2_miss' is set we need to match on 01b or 11b. Therefore, only
match on the LSB in order to differentiate between both cases of
'l2_miss'.

Tested on Spectrum-{1,2,3,4}.

Signed-off-by: Ido Schimmel 
---
 .../mellanox/mlxsw/core_acl_flex_keys.c   |  1 +
 .../mellanox/mlxsw/core_acl_flex_keys.h   |  3 ++-
 .../mellanox/mlxsw/spectrum_acl_flex_keys.c   |  5 +
 .../ethernet/mellanox/mlxsw/spectrum_flower.c | 20 ++-
 4 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
index bd1a51a0a540..81af0b9a4329 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
@@ -42,6 +42,7 @@ static const struct mlxsw_afk_element_info 
mlxsw_afk_element_infos[] = {
MLXSW_AFK_ELEMENT_INFO_BUF(DST_IP_64_95, 0x34, 4),
MLXSW_AFK_ELEMENT_INFO_BUF(DST_IP_32_63, 0x38, 4),
MLXSW_AFK_ELEMENT_INFO_BUF(DST_IP_0_31, 0x3C, 4),
+   MLXSW_AFK_ELEMENT_INFO_U32(DMAC_TYPE, 0x40, 0, 2),
 };
 
 struct mlxsw_afk {
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h 
b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h
index 3a037fe47211..6f1649cfa4cb 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h
@@ -35,6 +35,7 @@ enum mlxsw_afk_element {
MLXSW_AFK_ELEMENT_IP_DSCP,
MLXSW_AFK_ELEMENT_VIRT_ROUTER_MSB,
MLXSW_AFK_ELEMENT_VIRT_ROUTER_LSB,
+   MLXSW_AFK_ELEMENT_DMAC_TYPE,
MLXSW_AFK_ELEMENT_MAX,
 };
 
@@ -69,7 +70,7 @@ struct mlxsw_afk_element_info {
MLXSW_AFK_ELEMENT_INFO(MLXSW_AFK_ELEMENT_TYPE_BUF,  
\
   _element, _offset, 0, _size)
 
-#define MLXSW_AFK_ELEMENT_STORAGE_SIZE 0x40
+#define MLXSW_AFK_ELEMENT_STORAGE_SIZE 0x44
 
 struct mlxsw_afk_element_inst { /* element instance in actual block */
enum mlxsw_afk_element element;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c
index 00c32320f891..18a968cded36 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.c
@@ -26,6 +26,7 @@ static struct mlxsw_afk_element_inst 
mlxsw_sp_afk_element_info_l2_smac[] = {
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_l2_smac_ex[] = {
MLXSW_AFK_ELEMENT_INST_BUF(SMAC_32_47, 0x02, 2),
MLXSW_AFK_ELEMENT_INST_BUF(SMAC_0_31, 0x04, 4),
+   MLXSW_AFK_ELEMENT_INST_U32(DMAC_TYPE, 0x08, 0, 2),
MLXSW_AFK_ELEMENT_INST_U32(ETHERTYPE, 0x0C, 0, 16),
 };
 
@@ -50,6 +51,7 @@ static struct mlxsw_afk_element_inst 
mlxsw_sp_afk_element_info_ipv4[] = {
 };
 
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_ipv4_ex[] = {
+   MLXSW_AFK_ELEMENT_INST_U32(DMAC_TYPE, 0x00, 24, 2),
MLXSW_AFK_ELEMENT_INST_U32(VID, 0x00, 0, 12),
MLXSW_AFK_ELEMENT_INST_U32(PCP, 0x08, 29, 3),
MLXSW_AFK_ELEMENT_INST_U32(SRC_L4_PORT, 0x08, 0, 16),
@@ -78,6 +80,7 @@ static struct mlxsw_afk_element_inst 
mlxsw_sp_afk_element_info_ipv6_sip_ex[] = {
 };
 
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_packet_type[] = 
{
+   MLXSW_AFK_ELEMENT_INST_U32(DMAC_TYPE, 0x00, 30, 2),
MLXSW_AFK_ELEMENT_INST_U32(ETHERTYPE, 0x00, 0, 16),
 };
 
@@ -123,6 +126,7 @@ const struct mlxsw_afk_ops mlxsw_sp1_afk_ops = {
 };
 
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_mac_0[] = {
+   MLXSW_AFK_ELEMENT_INST_U32(DMAC_TYPE, 0x00, 0, 2),
MLXSW_AFK_ELEMENT_INST_BUF(DMAC_0_31, 0x04, 4),
 };
 
@@ -313,6 +317,7 @@ const struct mlxsw_afk_ops mlxsw_sp2_afk_ops = {
 };
 
 static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_mac_5b[] = {
+   MLXSW_AFK_ELEMENT_INST_U32(DMAC_TYPE, 0x00, 2, 2),
MLXSW_AFK_ELEMENT_INST_U32(VID, 0x04, 18, 12),
MLXSW_AFK_ELEMENT_INST_EXT_U32(SRC_SYS_PORT, 0x04, 0, 9, -1, true), /* 
RX_ACL_SYSTEM_PORT */
 };
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
index 6fec9223250b..170a07f35897 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
@@ -295,11 +295,6 @@ static int mlxsw_sp_flower_parse_meta(struct 
mlxsw_sp_acl_rule_info *rulei,
 
flow_rule_match_meta(rule, &match);
 
-   if (match.mask->l2_miss) {
-   NL_SET_ERR_MSG_MOD(f->c

[Bridge] [RFC PATCH net-next 5/5] selftests: forwarding: Add layer 2 miss test cases

2023-05-09 Thread Ido Schimmel via Bridge

Add test cases to verify that the bridge driver correctly marks layer 2
misses only when it should and that the flower classifier can match on
this metadata.

Example output:

 # ./tc_flower_l2_miss.sh
 TEST: L2 miss - Unicast [ OK ]
 TEST: L2 miss - Multicast (IPv4)[ OK ]
 TEST: L2 miss - Multicast (IPv6)[ OK ]
 TEST: L2 miss - Link-local multicast (IPv4) [ OK ]
 TEST: L2 miss - Link-local multicast (IPv6) [ OK ]
 TEST: L2 miss - Broadcast   [ OK ]

Signed-off-by: Ido Schimmel 
---
 .../testing/selftests/net/forwarding/Makefile |   1 +
 .../net/forwarding/tc_flower_l2_miss.sh   | 343 ++
 2 files changed, 344 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh

diff --git a/tools/testing/selftests/net/forwarding/Makefile 
b/tools/testing/selftests/net/forwarding/Makefile
index a474c60fe348..9d0062b542e5 100644
--- a/tools/testing/selftests/net/forwarding/Makefile
+++ b/tools/testing/selftests/net/forwarding/Makefile
@@ -83,6 +83,7 @@ TEST_PROGS = bridge_igmp.sh \
tc_chains.sh \
tc_flower_router.sh \
tc_flower.sh \
+   tc_flower_l2_miss.sh \
tc_mpls_l2vpn.sh \
tc_police.sh \
tc_shblocks.sh \
diff --git a/tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh 
b/tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh
new file mode 100755
index ..fbf0a960b2c8
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/tc_flower_l2_miss.sh
@@ -0,0 +1,343 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# +---+ 
+--+
+# | H1 (vrf)  | | H2 (vrf) 
|
+# |+ $h1  | |  $h2 +   
|
+# || 192.0.2.1/28 | | 192.0.2.2/28 |   
|
+# || 2001:db8:1::1/64 | | 2001:db8:1::2/64 |   
|
+# +|--+ 
+--|---+
+#  |   |
+# 
+|---|---+
+# | SW |   |   
|
+# |  +-|---|-+ 
|
+# |  | + $swp1   BR  $swp2 + | 
|
+# |  +---+ 
|
+# 
++
+
+ALL_TESTS="
+   test_l2_miss_unicast
+   test_l2_miss_multicast
+   test_l2_miss_ll_multicast
+   test_l2_miss_broadcast
+"
+
+NUM_NETIFS=4
+source lib.sh
+source tc_common.sh
+
+h1_create()
+{
+   simple_if_init $h1 192.0.2.1/28 2001:db8:1::1/64
+}
+
+h1_destroy()
+{
+   simple_if_fini $h1 192.0.2.1/28 2001:db8:1::1/64
+}
+
+h2_create()
+{
+   simple_if_init $h2 192.0.2.2/28 2001:db8:1::2/64
+}
+
+h2_destroy()
+{
+   simple_if_fini $h2 192.0.2.2/28 2001:db8:1::2/64
+}
+
+switch_create()
+{
+   ip link add name br1 up type bridge
+   ip link set dev $swp1 master br1
+   ip link set dev $swp1 up
+   ip link set dev $swp2 master br1
+   ip link set dev $swp2 up
+
+   tc qdisc add dev $swp2 clsact
+}
+
+switch_destroy()
+{
+   tc qdisc del dev $swp2 clsact
+
+   ip link set dev $swp2 down
+   ip link set dev $swp2 nomaster
+   ip link set dev $swp1 down
+   ip link set dev $swp1 nomaster
+   ip link del dev br1
+}
+
+test_l2_miss_unicast()
+{
+   local dmac=00:01:02:03:04:05
+   local dip=192.0.2.2
+   local sip=192.0.2.1
+
+   RET=0
+
+   # Unknown unicast.
+   tc filter add dev $swp2 egress protocol ipv4 handle 101 pref 1 \
+  flower indev $swp1 l2_miss true dst_mac $dmac src_ip $sip \
+  dst_ip $dip action pass
+   # Known unicast.
+   tc filter add dev $swp2 egress protocol ipv4 handle 102 pref 1 \
+  flower indev $swp1 l2_miss false dst_mac $dmac src_ip $sip \
+  dst_ip $dip action pass
+
+   # Before adding FDB entry.
+   $MZ $h1 -a own -b $dmac -t ip -A $sip -B $dip -c 1 -p 100 -q
+
+   tc_check_packets "dev $swp2 egress" 101 1
+   check_err $? "Unknown unicast filter was not hit before adding FDB 
entry"
+
+   tc_check_packets "dev $swp2 egress" 102 0
+   check_err $? "Known unicast filter was hit before adding FDB entry"
+
+   # Adding FDB entry.
+   bridge fdb replace $dmac dev $swp2 master static
+
+   $MZ $h1 -a own -b $dmac -t ip -A $sip -B $dip -c 1 -p 100 -q
+
+   tc_check_

[Bridge] [RFC PATCH net-next 2/5] net/sched: flower: Allow matching on layer 2 miss

2023-05-09 Thread Ido Schimmel via Bridge

Add the 'TCA_FLOWER_L2_MISS' netlink attribute that allows user space to
match on packets that encountered a layer 2 miss. The miss indication is
set as metadata in the skb by the bridge driver upon FDB/MDB lookup
miss.

Signed-off-by: Ido Schimmel 
---
 include/net/flow_dissector.h |  2 ++
 include/uapi/linux/pkt_cls.h |  2 ++
 net/core/flow_dissector.c|  3 +++
 net/sched/cls_flower.c   | 14 --
 4 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 85b2281576ed..8b41668c77fc 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -243,10 +243,12 @@ struct flow_dissector_key_ip {
  * struct flow_dissector_key_meta:
  * @ingress_ifindex: ingress ifindex
  * @ingress_iftype: ingress interface type
+ * @l2_miss: packet did not match an L2 entry during forwarding
  */
 struct flow_dissector_key_meta {
int ingress_ifindex;
u16 ingress_iftype;
+   u8 l2_miss;
 };
 
 /**
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 648a82f32666..00933dda7b10 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -594,6 +594,8 @@ enum {
 
TCA_FLOWER_KEY_L2TPV3_SID,  /* be32 */
 
+   TCA_FLOWER_L2_MISS, /* u8 */
+
__TCA_FLOWER_MAX,
 };
 
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 25fb0bbc310f..3776c7bdd228 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -241,6 +241,9 @@ void skb_flow_dissect_meta(const struct sk_buff *skb,
 FLOW_DISSECTOR_KEY_META,
 target_container);
meta->ingress_ifindex = skb->skb_iif;
+#if IS_ENABLED(CONFIG_BRIDGE)
+   meta->l2_miss = skb->l2_miss;
+#endif
 }
 EXPORT_SYMBOL(skb_flow_dissect_meta);
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 9dbc43388e57..4eb06c6367fc 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -615,7 +615,8 @@ static void *fl_get(struct tcf_proto *tp, u32 handle)
 }
 
 static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = {
-   [TCA_FLOWER_UNSPEC] = { .type = NLA_UNSPEC },
+   [TCA_FLOWER_UNSPEC] = { .strict_start_type =
+   TCA_FLOWER_L2_MISS },
[TCA_FLOWER_CLASSID]= { .type = NLA_U32 },
[TCA_FLOWER_INDEV]  = { .type = NLA_STRING,
.len = IFNAMSIZ },
@@ -720,7 +721,7 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 
1] = {
[TCA_FLOWER_KEY_PPPOE_SID]  = { .type = NLA_U16 },
[TCA_FLOWER_KEY_PPP_PROTO]  = { .type = NLA_U16 },
[TCA_FLOWER_KEY_L2TPV3_SID] = { .type = NLA_U32 },
-
+   [TCA_FLOWER_L2_MISS]= NLA_POLICY_MAX(NLA_U8, 1),
 };
 
 static const struct nla_policy
@@ -1668,6 +1669,10 @@ static int fl_set_key(struct net *net, struct nlattr 
**tb,
mask->meta.ingress_ifindex = 0x;
}
 
+   fl_set_key_val(tb, &key->meta.l2_miss, TCA_FLOWER_L2_MISS,
+  &mask->meta.l2_miss, TCA_FLOWER_UNSPEC,
+  sizeof(key->meta.l2_miss));
+
fl_set_key_val(tb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
   mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
   sizeof(key->eth.dst));
@@ -3074,6 +3079,11 @@ static int fl_dump_key(struct sk_buff *skb, struct net 
*net,
goto nla_put_failure;
}
 
+   if (fl_dump_key_val(skb, &key->meta.l2_miss,
+   TCA_FLOWER_L2_MISS, &mask->meta.l2_miss,
+   TCA_FLOWER_UNSPEC, sizeof(key->meta.l2_miss)))
+   goto nla_put_failure;
+
if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
sizeof(key->eth.dst)) ||
-- 
2.40.1

[Bridge] [RFC PATCH net-next 1/5] skbuff: bridge: Add layer 2 miss indication

2023-05-09 Thread Ido Schimmel via Bridge

Allow the bridge driver to mark packets that did not match a layer 2
entry during forwarding by adding a 'l2_miss' bit to the skb.

Clear the bit whenever a packet enters the bridge (received from a
bridge port or transmitted via the bridge) and set it if the packet did
not match an FDB/MDB entry.

Subsequent patches will allow the flower classifier to match on this
bit. The motivating use case in non-DF (Designated Forwarder) filtering
where we would like to prevent decapsulated packets from being flooded
to a multi-homed host.

Do not allocate the bit if the kernel was not compiled with bridge
support and place it after the two bit fields in accordance with commit
4c60d04c2888 ("net: skbuff: push nf_trace down the bitfield"). The bit
does not increase the size of the structure as it is placed at an
existing hole. Layout with allmodconfig:

struct sk_buff {
[...]
__u8   csum_not_inet:1;  /*   132: 3  1 */
__u8   l2_miss:1;/*   132: 4  1 */

/* XXX 3 bits hole, try to pack */
/* XXX 1 byte hole, try to pack */

__u16  tc_index; /*   134 2 */
u16alloc_cpu;/*   136 2 */
[...]
} __attribute__((__aligned__(8)));

Signed-off-by: Ido Schimmel 
---
 include/linux/skbuff.h  | 4 
 net/bridge/br_device.c  | 1 +
 net/bridge/br_forward.c | 3 +++
 net/bridge/br_input.c   | 1 +
 4 files changed, 9 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 738776ab8838..c7a84767ed48 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -801,6 +801,7 @@ typedef unsigned char *sk_buff_data_t;
  * @encap_hdr_csum: software checksum is needed
  * @csum_valid: checksum is already valid
  * @csum_not_inet: use CRC32c to resolve CHECKSUM_PARTIAL
+ * @l2_miss: Packet did not match an L2 entry during forwarding
  * @csum_complete_sw: checksum was completed by software
  * @csum_level: indicates the number of consecutive checksums found in
  * the packet minus one that have been verified as
@@ -991,6 +992,9 @@ struct sk_buff {
 #if IS_ENABLED(CONFIG_IP_SCTP)
__u8csum_not_inet:1;
 #endif
+#if IS_ENABLED(CONFIG_BRIDGE)
+   __u8l2_miss:1;
+#endif
 
 #ifdef CONFIG_NET_SCHED
__u16   tc_index;   /* traffic control index */
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 8eca8a5c80c6..91dbdae4afd4 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -39,6 +39,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct 
net_device *dev)
u16 vid = 0;
 
memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
+   skb->l2_miss = 0;
 
rcu_read_lock();
nf_ops = rcu_dereference(nf_br_ops);
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 57744704ff69..5893648c4da2 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -203,6 +203,8 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
struct net_bridge_port *prev = NULL;
struct net_bridge_port *p;
 
+   skb->l2_miss = 1;
+
list_for_each_entry_rcu(p, &br->port_list, list) {
/* Do not flood unicast traffic to ports that turn it off, nor
 * other traffic if flood off, except for traffic we originate
@@ -295,6 +297,7 @@ void br_multicast_flood(struct net_bridge_mdb_entry *mdst,
allow_mode_include = false;
} else {
p = NULL;
+   skb->l2_miss = 1;
}
 
while (p || rp) {
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index fc17b9fd93e6..d8ab5890cbe6 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -334,6 +334,7 @@ static rx_handler_result_t br_handle_frame(struct sk_buff 
**pskb)
return RX_HANDLER_CONSUMED;
 
memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
+   skb->l2_miss = 0;
 
p = br_port_get_rcu(skb->dev);
if (p->flags & BR_VLAN_TUNNEL)
-- 
2.40.1

[Bridge] [RFC PATCH net-next 3/5] flow_offload: Reject matching on layer 2 miss

2023-05-09 Thread Ido Schimmel via Bridge

Adjust drivers that support the 'FLOW_DISSECTOR_KEY_META' key to reject
filters that try to match on the newly added layer 2 miss option. Add an
extack message to clearly communicate the failure reason to user space.

Example:

 # tc filter add dev swp1 egress pref 1 proto all flower skip_sw l2_miss true 
action drop
 Error: mlxsw_spectrum: Can't match on "l2_miss".
 We have an error talking to the kernel

Signed-off-by: Ido Schimmel 
---
 .../net/ethernet/marvell/prestera/prestera_flower.c|  6 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c|  6 ++
 drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c  |  6 ++
 drivers/net/ethernet/mscc/ocelot_flower.c  | 10 ++
 4 files changed, 28 insertions(+)

diff --git a/drivers/net/ethernet/marvell/prestera/prestera_flower.c 
b/drivers/net/ethernet/marvell/prestera/prestera_flower.c
index 91a478b75cbf..3e20e71b0f81 100644
--- a/drivers/net/ethernet/marvell/prestera/prestera_flower.c
+++ b/drivers/net/ethernet/marvell/prestera/prestera_flower.c
@@ -148,6 +148,12 @@ static int prestera_flower_parse_meta(struct 
prestera_acl_rule *rule,
__be16 key, mask;
 
flow_rule_match_meta(f_rule, &match);
+
+   if (match.mask->l2_miss) {
+   NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on 
\"l2_miss\"");
+   return -EOPNOTSUPP;
+   }
+
if (match.mask->ingress_ifindex != 0x) {
NL_SET_ERR_MSG_MOD(f->common.extack,
   "Unsupported ingress ifindex mask");
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 728b82ce4031..516653568330 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2586,6 +2586,12 @@ static int mlx5e_flower_parse_meta(struct net_device 
*filter_dev,
return 0;
 
flow_rule_match_meta(rule, &match);
+
+   if (match.mask->l2_miss) {
+   NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on 
\"l2_miss\"");
+   return -EOPNOTSUPP;
+   }
+
if (!match.mask->ingress_ifindex)
return 0;
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
index 594cdcb90b3d..6fec9223250b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
@@ -294,6 +294,12 @@ static int mlxsw_sp_flower_parse_meta(struct 
mlxsw_sp_acl_rule_info *rulei,
return 0;
 
flow_rule_match_meta(rule, &match);
+
+   if (match.mask->l2_miss) {
+   NL_SET_ERR_MSG_MOD(f->common.extack, "Can't match on 
\"l2_miss\"");
+   return -EOPNOTSUPP;
+   }
+
if (match.mask->ingress_ifindex != 0x) {
NL_SET_ERR_MSG_MOD(f->common.extack, "Unsupported ingress 
ifindex mask");
return -EINVAL;
diff --git a/drivers/net/ethernet/mscc/ocelot_flower.c 
b/drivers/net/ethernet/mscc/ocelot_flower.c
index ee052404eb55..e0916afcddfb 100644
--- a/drivers/net/ethernet/mscc/ocelot_flower.c
+++ b/drivers/net/ethernet/mscc/ocelot_flower.c
@@ -592,6 +592,16 @@ ocelot_flower_parse_key(struct ocelot *ocelot, int port, 
bool ingress,
return -EOPNOTSUPP;
}
 
+   if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_META)) {
+   struct flow_match_meta match;
+
+   flow_rule_match_meta(rule, &match);
+   if (match.mask->l2_miss) {
+   NL_SET_ERR_MSG_MOD(extack, "Can't match on 
\"l2_miss\"");
+   return -EOPNOTSUPP;
+   }
+   }
+
/* For VCAP ES0 (egress rewriter) we can match on the ingress port */
if (!ingress) {
ret = ocelot_flower_parse_indev(ocelot, port, f, filter);
-- 
2.40.1

[Bridge] [RFC PATCH net-next 0/5] Add layer 2 miss indication and filtering

2023-05-09 Thread Ido Schimmel via Bridge

tl;dr
=

This patchset adds a single bit to the skb to indicate that a packet
encountered a layer 2 miss in the bridge and extends flower to match on
this metadata. This is required for non-DF (Designated Forwarder)
filtering in EVPN multi-homing which prevents decapsulated BUM packets
from being forwarded multiple times to the same multi-homed host.

Background
==

In a typical EVPN multi-homing setup each host is multi-homed using a
set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
switches in a rack. These switches act as VTEPs and are not directly
connected (as opposed to MLAG), but can communicate with each other (as
well as with VTEPs in remote racks) via spine switches over L3.

When a host sends a BUM packet over ES1 to VTEP1, the VTEP will flood it
to other VTEPs in the network, including those connected to the host
over ES1. The receiving VTEPs must drop the packet and not forward it
back to the host. This is called "split-horizon filtering" (SPH) [1].

FRR configures SPH filtering using two tc filters. The first, an ingress
filter that matches on packets received from VTEP1 and marks them using
a fwmark (firewall mark). The second, an egress filter configured on the
LAG interface connected to the host that matches on the fwmark and drops
the packets. Example:

 # tc filter add dev vxlan0 ingress pref 1 proto all flower enc_src_ip 
$VTEP1_IP action skbedit mark 101
 # tc filter add dev bond0 egress pref 1 handle 101 fw action drop

Motivation
==

For each ES, only one VTEP is elected by the control plane as the DF.
The DF is responsible for forwarding decapsulated BUM traffic to the
host over the ES. The non-DF VTEPs must drop such traffic as otherwise
the host will receive multiple copies of BUM traffic. This is called
"non-DF filtering" [2].

Filtering of multicast and broadcast traffic can be achieved using the
following flower filter:

 # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 
01:00:00:00:00:00/01:00:00:00:00:00 action drop

Unlike broadcast and multicast traffic, it is not currently possible to
filter unknown unicast traffic. The classification into unknown unicast
is performed by the bridge driver, but is not visible to other layers.

Implementation
==

The proposed solution is to add a single bit to the skb that is set by
the bridge for packets that encountered an FDB/MDB miss. The flower
classifier is extended to be able to match on this new metadata bit in a
similar fashion to existing metadata options such as 'indev'.

A bit that is set for every flooded packet would also work, but it does
not allow us to differentiate between registered and unregistered
multicast traffic which might be useful in the future.

A relatively generic name is chosen for this bit - 'l2_miss' - to allow
its use to be extended to other layer 2 devices such as VXLAN, should a
use case arise.

With the above, the control plane can implement a non-DF filter using
the following tc filters:

 # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 
01:00:00:00:00:00/01:00:00:00:00:00 action drop
 # tc filter add dev bond0 egress pref 2 proto all flower indev vxlan0 l2_miss 
true action drop

The first drops broadcast and multicast traffic and the second drops
unknown unicast traffic.

Testing
===

A test exercising the different permutations of the 'l2_miss' bit is
added in patch #5.

Patchset overview
=

Patch #1 adds the new bit to the skb and sets it in the bridge driver
for packets that encountered a miss. The new bit is added in an existing
hole in the skb in order not to inflate this data structure.

Patch #2 extends the flower classifier to be able to match on the new
layer 2 miss metadata.

Patch #3 rejects matching on the new metadata in drivers that already
support the 'FLOW_DISSECTOR_KEY_META' key.

Patch #4 extends mlxsw to be able to match on layer 2 miss.

Patch #5 adds a selftest.

iproute2 patches can be found here [3].

[1] https://datatracker.ietf.org/doc/html/rfc7432#section-8.3
[2] https://datatracker.ietf.org/doc/html/rfc7432#section-8.5
[3] https://github.com/idosch/iproute2/tree/submit/non_df_filter_v1

Ido Schimmel (5):
  skbuff: bridge: Add layer 2 miss indication
  net/sched: flower: Allow matching on layer 2 miss
  flow_offload: Reject matching on layer 2 miss
  mlxsw: spectrum_flower: Add ability to match on layer 2 miss
  selftests: forwarding: Add layer 2 miss test cases

 .../marvell/prestera/prestera_flower.c|   6 +
 .../net/ethernet/mellanox/mlx5/core/en_tc.c   |   6 +
 .../mellanox/mlxsw/core_acl_flex_keys.c   |   1 +
 .../mellanox/mlxsw/core_acl_flex_keys.h   |   3 +-
 .../mellanox/mlxsw/spectrum_acl_flex_keys.c   |   5 +
 .../ethernet/mellanox/mlxsw/spectrum_flower.c |  16 +
 drivers/net/ethernet/mscc/ocelot_flower.c |  10 +
 include/linux/skbuff.h

Re: [Bridge] [Question] Any plan to write/update the bridge doc?

2023-04-24 Thread Ido Schimmel via Bridge

On Mon, Apr 24, 2023 at 05:25:08PM +0800, Hangbin Liu wrote:
> Hi,
> 
> Maybe someone already has asked. The only official Linux bridge document I
> got is a very ancient wiki page[1] or the ip link man page[2][3]. As there are
> many bridge stp/vlan/multicast paramegers. Should we add a detailed kernel
> document about each parameter? The parameter showed in ip link page seems
> a little brief.

I suggest improving the man pages instead of adding kernel
documentation. The man pages are the most up to date resource and
therefore the one users probably refer to the most. Also, it's already
quite annoying to patch both "ip-link" and "bridge" man pages when
adding bridge port options. Adding a third document and making sure all
three resources are patched would be a nightmare...

> 
> I'd like to help do this work. But apparently neither my English nor my
> understanding of the code is good enough. Anyway, if you want, I can help
> write a draft version first and you (bridge maintainers) keep working on this.

I can help reviewing man page patches if you want. I'm going to send
some soon. Will copy you.

> 
> [1] https://wiki.linuxfoundation.org/networking/bridge
> [2] https://man7.org/linux/man-pages/man8/bridge.8.html
> [3] https://man7.org/linux/man-pages/man8/ip-link.8.html
> 
> Thanks
> Hangbin

Re: [Bridge] [PATCH v2 net] net: bridge: switchdev: don't notify FDB entries with "master dynamic"

2023-04-19 Thread Ido Schimmel via Bridge

On Tue, Apr 18, 2023 at 06:59:02PM +0300, Vladimir Oltean wrote:
> There is a structural problem in switchdev, where the flag bits in
> struct switchdev_notifier_fdb_info (added_by_user, is_local etc) only
> represent a simplified / denatured view of what's in struct
> net_bridge_fdb_entry :: flags (BR_FDB_ADDED_BY_USER, BR_FDB_LOCAL etc).
> Each time we want to pass more information about struct
> net_bridge_fdb_entry :: flags to struct switchdev_notifier_fdb_info
> (here, BR_FDB_STATIC), we find that FDB entries were already notified to
> switchdev with no regard to this flag, and thus, switchdev drivers had
> no indication whether the notified entries were static or not.

[...]

> Fixes: 6b26b51b1d13 ("net: bridge: Add support for notifying devices about 
> FDB add/del")
> Link: https://lore.kernel.org/netdev/20230327115206.jk5q5l753aoelwus@skbuf/
> Signed-off-by: Vladimir Oltean 
> Reviewed-by: Jesse Brandeburg 

Reviewed-by: Ido Schimmel 
Tested-by: Ido Schimmel

[Bridge] [PATCH net-next v2 8/9] bridge: Allow setting per-{Port, VLAN} neighbor suppression state

2023-04-19 Thread Ido Schimmel via Bridge

Add a new bridge port attribute that allows user space to enable
per-{Port, VLAN} neighbor suppression. Example:

 # bridge -d -j -p link show dev swp1 | jq '.[]["neigh_vlan_suppress"]'
 false
 # bridge link set dev swp1 neigh_vlan_suppress on
 # bridge -d -j -p link show dev swp1 | jq '.[]["neigh_vlan_suppress"]'
 true
 # bridge link set dev swp1 neigh_vlan_suppress off
 # bridge -d -j -p link show dev swp1 | jq '.[]["neigh_vlan_suppress"]'
 false

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 include/uapi/linux/if_link.h | 1 +
 net/bridge/br_netlink.c  | 8 +++-
 net/core/rtnetlink.c | 2 +-
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 8d679688efe0..4ac1000b0ef2 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -569,6 +569,7 @@ enum {
IFLA_BRPORT_MAB,
IFLA_BRPORT_MCAST_N_GROUPS,
IFLA_BRPORT_MCAST_MAX_GROUPS,
+   IFLA_BRPORT_NEIGH_VLAN_SUPPRESS,
__IFLA_BRPORT_MAX
 };
 #define IFLA_BRPORT_MAX (__IFLA_BRPORT_MAX - 1)
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index fefb1c0e248b..05c5863d2e20 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -189,6 +189,7 @@ static inline size_t br_port_info_size(void)
+ nla_total_size(1) /* IFLA_BRPORT_ISOLATED */
+ nla_total_size(1) /* IFLA_BRPORT_LOCKED */
+ nla_total_size(1) /* IFLA_BRPORT_MAB */
+   + nla_total_size(1) /* IFLA_BRPORT_NEIGH_VLAN_SUPPRESS */
+ nla_total_size(sizeof(struct ifla_bridge_id)) /* 
IFLA_BRPORT_ROOT_ID */
+ nla_total_size(sizeof(struct ifla_bridge_id)) /* 
IFLA_BRPORT_BRIDGE_ID */
+ nla_total_size(sizeof(u16))   /* IFLA_BRPORT_DESIGNATED_PORT 
*/
@@ -278,7 +279,9 @@ static int br_port_fill_attrs(struct sk_buff *skb,
   !!(p->flags & BR_MRP_LOST_IN_CONT)) ||
nla_put_u8(skb, IFLA_BRPORT_ISOLATED, !!(p->flags & BR_ISOLATED)) ||
nla_put_u8(skb, IFLA_BRPORT_LOCKED, !!(p->flags & BR_PORT_LOCKED)) 
||
-   nla_put_u8(skb, IFLA_BRPORT_MAB, !!(p->flags & BR_PORT_MAB)))
+   nla_put_u8(skb, IFLA_BRPORT_MAB, !!(p->flags & BR_PORT_MAB)) ||
+   nla_put_u8(skb, IFLA_BRPORT_NEIGH_VLAN_SUPPRESS,
+  !!(p->flags & BR_NEIGH_VLAN_SUPPRESS)))
return -EMSGSIZE;
 
timerval = br_timer_value(&p->message_age_timer);
@@ -891,6 +894,7 @@ static const struct nla_policy 
br_port_policy[IFLA_BRPORT_MAX + 1] = {
[IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT] = { .type = NLA_U32 },
[IFLA_BRPORT_MCAST_N_GROUPS] = { .type = NLA_REJECT },
[IFLA_BRPORT_MCAST_MAX_GROUPS] = { .type = NLA_U32 },
+   [IFLA_BRPORT_NEIGH_VLAN_SUPPRESS] = NLA_POLICY_MAX(NLA_U8, 1),
 };
 
 /* Change the state of the port and notify spanning tree */
@@ -957,6 +961,8 @@ static int br_setport(struct net_bridge_port *p, struct 
nlattr *tb[],
br_set_port_flag(p, tb, IFLA_BRPORT_ISOLATED, BR_ISOLATED);
br_set_port_flag(p, tb, IFLA_BRPORT_LOCKED, BR_PORT_LOCKED);
br_set_port_flag(p, tb, IFLA_BRPORT_MAB, BR_PORT_MAB);
+   br_set_port_flag(p, tb, IFLA_BRPORT_NEIGH_VLAN_SUPPRESS,
+BR_NEIGH_VLAN_SUPPRESS);
 
if ((p->flags & BR_PORT_MAB) &&
(!(p->flags & BR_PORT_LOCKED) || !(p->flags & BR_LEARNING))) {
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e844d75220fb..653901a1bf75 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -61,7 +61,7 @@
 #include "dev.h"
 
 #define RTNL_MAX_TYPE  50
-#define RTNL_SLAVE_MAX_TYPE42
+#define RTNL_SLAVE_MAX_TYPE43
 
 struct rtnl_link {
rtnl_doit_func  doit;
-- 
2.37.3

[Bridge] [PATCH net-next v2 9/9] selftests: net: Add bridge neighbor suppression test

2023-04-19 Thread Ido Schimmel via Bridge

Add test cases for bridge neighbor suppression, testing both per-port
and per-{Port, VLAN} neighbor suppression with both ARP and NS packets.

Example truncated output:

 # ./test_bridge_neigh_suppress.sh
 [...]
 Tests passed: 148
 Tests failed:   0

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 tools/testing/selftests/net/Makefile  |   1 +
 .../net/test_bridge_neigh_suppress.sh | 862 ++
 2 files changed, 863 insertions(+)
 create mode 100755 tools/testing/selftests/net/test_bridge_neigh_suppress.sh

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 1de34ec99290..c12df57d5539 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -83,6 +83,7 @@ TEST_GEN_FILES += nat6to4.o
 TEST_GEN_FILES += ip_local_port_range
 TEST_GEN_FILES += bind_wildcard
 TEST_PROGS += test_vxlan_mdb.sh
+TEST_PROGS += test_bridge_neigh_suppress.sh
 
 TEST_FILES := settings
 
diff --git a/tools/testing/selftests/net/test_bridge_neigh_suppress.sh 
b/tools/testing/selftests/net/test_bridge_neigh_suppress.sh
new file mode 100755
index ..d80f2cd87614
--- /dev/null
+++ b/tools/testing/selftests/net/test_bridge_neigh_suppress.sh
@@ -0,0 +1,862 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# This test is for checking bridge neighbor suppression functionality. The
+# topology consists of two bridges (VTEPs) connected using VXLAN. A single
+# host is connected to each bridge over multiple VLANs. The test checks that
+# ARP/NS messages from the first host are suppressed on the VXLAN port when
+# should.
+#
+# +---+  ++
+# | h1|  | h2 |
+# |   |  ||
+# | + eth0.10 |  | + eth0.10  |
+# | | 192.0.2.1/28|  | | 192.0.2.2/28 |
+# | | 2001:db8:1::1/64|  | | 2001:db8:1::2/64 |
+# | | |  | |  |
+# | |  + eth0.20  |  | |  + eth0.20   |
+# | \  | 192.0.2.17/28|  | \  | 192.0.2.18/28 |
+# |  \ | 2001:db8:2::1/64 |  |  \ | 2001:db8:2::2/64  |
+# |   \|  |  |   \|   |
+# |+ eth0 |  |+ eth0  |
+# +|--+  +|---+
+#  |  |
+#  |  |
+# +|---+ +|---+
+# |+ swp1   + vx0  | |+ swp1   + vx0  |
+# |||  | |||  |
+# ||   br0  |  | |||  |
+# |++---+  | |++---+  |
+# | |  | | |  |
+# | |  | | |  |
+# | +---+---+  | | +---+---+  |
+# | |   |  | | |   |  |
+# | |   |  | | |   |  |
+# | +   +  | | +   +  |
+# |  br0.10  br0.20| |  br0.10  br0.20|
+# || ||
+# | 192.0.2.33 | | 192.0.2.34 |
+# | + lo   | | + lo   |
+# || ||
+# || ||
+# |   192.0.2.49/28| |192.0.2.50/28   |
+# |   veth0 +---+ veth0   |
+# || ||
+# | sw1| | sw2|
+# ++ ++
+
+ret=0
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+# All tests in this script. Can be overridden with -t option.
+TESTS="
+   neigh_suppress_arp
+   neigh_suppress_ns
+   neigh_vlan_suppress_arp
+   neigh_vlan_suppress_ns
+"
+VERBOSE=0
+PAUSE_ON_FAIL=no
+PAUSE=no
+
+
+# Utilities
+
+log_test()
+{
+   local rc=$1
+   local expected=$2
+   local msg="$3"
+
+   if [ ${rc} -eq ${expected} ]; then
+

[Bridge] [PATCH net-next v2 7/9] bridge: vlan: Allow setting VLAN neighbor suppression state

2023-04-19 Thread Ido Schimmel via Bridge

Add a new VLAN attribute that allows user space to set the neighbor
suppression state of the port VLAN. Example:

 # bridge -d -j -p vlan show dev swp1 vid 10 | jq 
'.[]["vlans"][]["neigh_suppress"]'
 false
 # bridge vlan set vid 10 dev swp1 neigh_suppress on
 # bridge -d -j -p vlan show dev swp1 vid 10 | jq 
'.[]["vlans"][]["neigh_suppress"]'
 true
 # bridge vlan set vid 10 dev swp1 neigh_suppress off
 # bridge -d -j -p vlan show dev swp1 vid 10 | jq 
'.[]["vlans"][]["neigh_suppress"]'
 false

 # bridge vlan set vid 10 dev br0 neigh_suppress on
 Error: bridge: Can't set neigh_suppress for non-port vlans.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 include/uapi/linux/if_bridge.h |  1 +
 net/bridge/br_vlan.c   |  1 +
 net/bridge/br_vlan_options.c   | 20 +++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index c9d624f528c5..f95326fce6bb 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -525,6 +525,7 @@ enum {
BRIDGE_VLANDB_ENTRY_MCAST_ROUTER,
BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS,
BRIDGE_VLANDB_ENTRY_MCAST_MAX_GROUPS,
+   BRIDGE_VLANDB_ENTRY_NEIGH_SUPPRESS,
__BRIDGE_VLANDB_ENTRY_MAX,
 };
 #define BRIDGE_VLANDB_ENTRY_MAX (__BRIDGE_VLANDB_ENTRY_MAX - 1)
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 8a3dbc09ba38..15f44d026e75 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -2134,6 +2134,7 @@ static const struct nla_policy 
br_vlan_db_policy[BRIDGE_VLANDB_ENTRY_MAX + 1] =
[BRIDGE_VLANDB_ENTRY_MCAST_ROUTER]  = { .type = NLA_U8 },
[BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS]= { .type = NLA_REJECT },
[BRIDGE_VLANDB_ENTRY_MCAST_MAX_GROUPS]  = { .type = NLA_U32 },
+   [BRIDGE_VLANDB_ENTRY_NEIGH_SUPPRESS]= NLA_POLICY_MAX(NLA_U8, 1),
 };
 
 static int br_vlan_rtm_process_one(struct net_device *dev,
diff --git a/net/bridge/br_vlan_options.c b/net/bridge/br_vlan_options.c
index e378c2f3a9e2..8fa89b04ee94 100644
--- a/net/bridge/br_vlan_options.c
+++ b/net/bridge/br_vlan_options.c
@@ -52,7 +52,9 @@ bool br_vlan_opts_fill(struct sk_buff *skb, const struct 
net_bridge_vlan *v,
   const struct net_bridge_port *p)
 {
if (nla_put_u8(skb, BRIDGE_VLANDB_ENTRY_STATE, br_vlan_get_state(v)) ||
-   !__vlan_tun_put(skb, v))
+   !__vlan_tun_put(skb, v) ||
+   nla_put_u8(skb, BRIDGE_VLANDB_ENTRY_NEIGH_SUPPRESS,
+  !!(v->priv_flags & BR_VLFLAG_NEIGH_SUPPRESS_ENABLED)))
return false;
 
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
@@ -80,6 +82,7 @@ size_t br_vlan_opts_nl_size(void)
   + nla_total_size(sizeof(u32)) /* 
BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS */
   + nla_total_size(sizeof(u32)) /* 
BRIDGE_VLANDB_ENTRY_MCAST_MAX_GROUPS */
 #endif
+  + nla_total_size(sizeof(u8)) /* 
BRIDGE_VLANDB_ENTRY_NEIGH_SUPPRESS */
   + 0;
 }
 
@@ -239,6 +242,21 @@ static int br_vlan_process_one_opts(const struct 
net_bridge *br,
}
 #endif
 
+   if (tb[BRIDGE_VLANDB_ENTRY_NEIGH_SUPPRESS]) {
+   bool enabled = v->priv_flags & BR_VLFLAG_NEIGH_SUPPRESS_ENABLED;
+   bool val = nla_get_u8(tb[BRIDGE_VLANDB_ENTRY_NEIGH_SUPPRESS]);
+
+   if (!p) {
+   NL_SET_ERR_MSG_MOD(extack, "Can't set neigh_suppress 
for non-port vlans");
+   return -EINVAL;
+   }
+
+   if (val != enabled) {
+   v->priv_flags ^= BR_VLFLAG_NEIGH_SUPPRESS_ENABLED;
+   *changed = true;
+   }
+   }
+
return 0;
 }
 
-- 
2.37.3

[Bridge] [PATCH net-next v2 5/9] bridge: Encapsulate data path neighbor suppression logic

2023-04-19 Thread Ido Schimmel via Bridge

Currently, there are various places in the bridge data path that check
whether neighbor suppression is enabled on a given bridge port.

As a preparation for per-{Port, VLAN} neighbor suppression, encapsulate
this logic in a function and pass the VLAN ID of the packet as an
argument.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 net/bridge/br_arp_nd_proxy.c | 15 ++-
 net/bridge/br_forward.c  |  3 ++-
 net/bridge/br_private.h  |  1 +
 3 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/net/bridge/br_arp_nd_proxy.c b/net/bridge/br_arp_nd_proxy.c
index 016a25a9e444..16c3a1c5d0ae 100644
--- a/net/bridge/br_arp_nd_proxy.c
+++ b/net/bridge/br_arp_nd_proxy.c
@@ -158,7 +158,7 @@ void br_do_proxy_suppress_arp(struct sk_buff *skb, struct 
net_bridge *br,
return;
 
if (br_opt_get(br, BROPT_NEIGH_SUPPRESS_ENABLED)) {
-   if (p && (p->flags & BR_NEIGH_SUPPRESS))
+   if (br_is_neigh_suppress_enabled(p, vid))
return;
if (parp->ar_op != htons(ARPOP_RREQUEST) &&
parp->ar_op != htons(ARPOP_RREPLY) &&
@@ -202,8 +202,8 @@ void br_do_proxy_suppress_arp(struct sk_buff *skb, struct 
net_bridge *br,
bool replied = false;
 
if ((p && (p->flags & BR_PROXYARP)) ||
-   (f->dst && (f->dst->flags & (BR_PROXYARP_WIFI |
-BR_NEIGH_SUPPRESS {
+   (f->dst && (f->dst->flags & BR_PROXYARP_WIFI)) ||
+   br_is_neigh_suppress_enabled(f->dst, vid)) {
if (!vid)
br_arp_send(br, p, skb->dev, sip, tip,
sha, n->ha, sha, 0, 0);
@@ -407,7 +407,7 @@ void br_do_suppress_nd(struct sk_buff *skb, struct 
net_bridge *br,
 
BR_INPUT_SKB_CB(skb)->proxyarp_replied = 0;
 
-   if (p && (p->flags & BR_NEIGH_SUPPRESS))
+   if (br_is_neigh_suppress_enabled(p, vid))
return;
 
if (msg->icmph.icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT &&
@@ -461,7 +461,7 @@ void br_do_suppress_nd(struct sk_buff *skb, struct 
net_bridge *br,
if (f) {
bool replied = false;
 
-   if (f->dst && (f->dst->flags & BR_NEIGH_SUPPRESS)) {
+   if (br_is_neigh_suppress_enabled(f->dst, vid)) {
if (vid != 0)
br_nd_send(br, p, skb, n,
   skb->vlan_proto,
@@ -483,3 +483,8 @@ void br_do_suppress_nd(struct sk_buff *skb, struct 
net_bridge *br,
}
 }
 #endif
+
+bool br_is_neigh_suppress_enabled(const struct net_bridge_port *p, u16 vid)
+{
+   return p && (p->flags & BR_NEIGH_SUPPRESS);
+}
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 94a8d757ae4e..57744704ff69 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -226,7 +226,8 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
if (p->flags & BR_PROXYARP)
continue;
if (BR_INPUT_SKB_CB(skb)->proxyarp_replied &&
-   (p->flags & (BR_PROXYARP_WIFI | BR_NEIGH_SUPPRESS)))
+   ((p->flags & BR_PROXYARP_WIFI) ||
+br_is_neigh_suppress_enabled(p, vid)))
continue;
 
prev = maybe_deliver(prev, p, skb, local_orig);
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index b17fc821ecc8..2119729ded2b 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -2220,4 +2220,5 @@ void br_do_proxy_suppress_arp(struct sk_buff *skb, struct 
net_bridge *br,
 void br_do_suppress_nd(struct sk_buff *skb, struct net_bridge *br,
   u16 vid, struct net_bridge_port *p, struct nd_msg *msg);
 struct nd_msg *br_is_nd_neigh_msg(struct sk_buff *skb, struct nd_msg *m);
+bool br_is_neigh_suppress_enabled(const struct net_bridge_port *p, u16 vid);
 #endif
-- 
2.37.3

[Bridge] [PATCH net-next v2 6/9] bridge: Add per-{Port, VLAN} neighbor suppression data path support

2023-04-19 Thread Ido Schimmel via Bridge

When the bridge is not VLAN-aware (i.e., VLAN ID is 0), determine if
neighbor suppression is enabled on a given bridge port solely based on
the existing 'BR_NEIGH_SUPPRESS' flag.

Otherwise, if the bridge is VLAN-aware, first check if per-{Port, VLAN}
neighbor suppression is enabled on the given bridge port using the
'BR_NEIGH_VLAN_SUPPRESS' flag. If so, look up the VLAN and check whether
it has neighbor suppression enabled based on the per-VLAN
'BR_VLFLAG_NEIGH_SUPPRESS_ENABLED' flag.

If the bridge is VLAN-aware, but the bridge port does not have
per-{Port, VLAN} neighbor suppression enabled, then fallback to
determine neighbor suppression based on the 'BR_NEIGH_SUPPRESS' flag.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 net/bridge/br_arp_nd_proxy.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_arp_nd_proxy.c b/net/bridge/br_arp_nd_proxy.c
index 16c3a1c5d0ae..c7869a286df4 100644
--- a/net/bridge/br_arp_nd_proxy.c
+++ b/net/bridge/br_arp_nd_proxy.c
@@ -486,5 +486,21 @@ void br_do_suppress_nd(struct sk_buff *skb, struct 
net_bridge *br,
 
 bool br_is_neigh_suppress_enabled(const struct net_bridge_port *p, u16 vid)
 {
-   return p && (p->flags & BR_NEIGH_SUPPRESS);
+   if (!p)
+   return false;
+
+   if (!vid)
+   return !!(p->flags & BR_NEIGH_SUPPRESS);
+
+   if (p->flags & BR_NEIGH_VLAN_SUPPRESS) {
+   struct net_bridge_vlan_group *vg = nbp_vlan_group_rcu(p);
+   struct net_bridge_vlan *v;
+
+   v = br_vlan_find(vg, vid);
+   if (!v)
+   return false;
+   return !!(v->priv_flags & BR_VLFLAG_NEIGH_SUPPRESS_ENABLED);
+   } else {
+   return !!(p->flags & BR_NEIGH_SUPPRESS);
+   }
 }
-- 
2.37.3

[Bridge] [PATCH net-next v2 4/9] bridge: Take per-{Port, VLAN} neighbor suppression into account

2023-04-19 Thread Ido Schimmel via Bridge

The bridge driver gates the neighbor suppression code behind an internal
per-bridge flag called 'BROPT_NEIGH_SUPPRESS_ENABLED'. The flag is set
when at least one bridge port has neighbor suppression enabled.

As a preparation for per-{Port, VLAN} neighbor suppression, make sure
the global flag is also set if per-{Port, VLAN} neighbor suppression is
enabled. That is, when the 'BR_NEIGH_VLAN_SUPPRESS' flag is set on at
least one bridge port.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 net/bridge/br_arp_nd_proxy.c | 2 +-
 net/bridge/br_if.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br_arp_nd_proxy.c b/net/bridge/br_arp_nd_proxy.c
index b45c00c01dea..016a25a9e444 100644
--- a/net/bridge/br_arp_nd_proxy.c
+++ b/net/bridge/br_arp_nd_proxy.c
@@ -30,7 +30,7 @@ void br_recalculate_neigh_suppress_enabled(struct net_bridge 
*br)
bool neigh_suppress = false;
 
list_for_each_entry(p, &br->port_list, list) {
-   if (p->flags & BR_NEIGH_SUPPRESS) {
+   if (p->flags & (BR_NEIGH_SUPPRESS | BR_NEIGH_VLAN_SUPPRESS)) {
neigh_suppress = true;
break;
}
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 24f01ff113f0..3f04b40f6056 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -759,7 +759,7 @@ void br_port_flags_change(struct net_bridge_port *p, 
unsigned long mask)
if (mask & BR_AUTO_MASK)
nbp_update_port_count(br);
 
-   if (mask & BR_NEIGH_SUPPRESS)
+   if (mask & (BR_NEIGH_SUPPRESS | BR_NEIGH_VLAN_SUPPRESS))
br_recalculate_neigh_suppress_enabled(br);
 }
 
-- 
2.37.3

[Bridge] [PATCH net-next v2 3/9] bridge: Add internal flags for per-{Port, VLAN} neighbor suppression

2023-04-19 Thread Ido Schimmel via Bridge

Add two internal flags that will be used to enable / disable per-{Port,
VLAN} neighbor suppression:

1. 'BR_NEIGH_VLAN_SUPPRESS': A per-port flag used to indicate that
per-{Port, VLAN} neighbor suppression is enabled on the bridge port.
When set, 'BR_NEIGH_SUPPRESS' has no effect.

2. 'BR_VLFLAG_NEIGH_SUPPRESS_ENABLED': A per-VLAN flag used to indicate
that neighbor suppression is enabled on the given VLAN.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 include/linux/if_bridge.h | 1 +
 net/bridge/br_private.h   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index 1668ac4d7adc..3ff96ae31bf6 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -60,6 +60,7 @@ struct br_ip_list {
 #define BR_TX_FWD_OFFLOAD  BIT(20)
 #define BR_PORT_LOCKED BIT(21)
 #define BR_PORT_MABBIT(22)
+#define BR_NEIGH_VLAN_SUPPRESS BIT(23)
 
 #define BR_DEFAULT_AGEING_TIME (300 * HZ)
 
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 1ff4d64ab584..b17fc821ecc8 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -178,6 +178,7 @@ enum {
BR_VLFLAG_ADDED_BY_SWITCHDEV = BIT(1),
BR_VLFLAG_MCAST_ENABLED = BIT(2),
BR_VLFLAG_GLOBAL_MCAST_ENABLED = BIT(3),
+   BR_VLFLAG_NEIGH_SUPPRESS_ENABLED = BIT(4),
 };
 
 /**
-- 
2.37.3

[Bridge] [PATCH net-next v2 2/9] bridge: Pass VLAN ID to br_flood()

2023-04-19 Thread Ido Schimmel via Bridge

Subsequent patches are going to add per-{Port, VLAN} neighbor
suppression, which will require br_flood() to potentially suppress ARP /
NS packets on a per-{Port, VLAN} basis.

As a preparation, pass the VLAN ID of the packet as another argument to
br_flood().

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 net/bridge/br_device.c  | 8 
 net/bridge/br_forward.c | 3 ++-
 net/bridge/br_input.c   | 2 +-
 net/bridge/br_private.h | 3 ++-
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index df47c876230e..8eca8a5c80c6 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -80,10 +80,10 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct 
net_device *dev)
 
dest = eth_hdr(skb)->h_dest;
if (is_broadcast_ether_addr(dest)) {
-   br_flood(br, skb, BR_PKT_BROADCAST, false, true);
+   br_flood(br, skb, BR_PKT_BROADCAST, false, true, vid);
} else if (is_multicast_ether_addr(dest)) {
if (unlikely(netpoll_tx_running(dev))) {
-   br_flood(br, skb, BR_PKT_MULTICAST, false, true);
+   br_flood(br, skb, BR_PKT_MULTICAST, false, true, vid);
goto out;
}
if (br_multicast_rcv(&brmctx, &pmctx_null, vlan, skb, vid)) {
@@ -96,11 +96,11 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct 
net_device *dev)
br_multicast_querier_exists(brmctx, eth_hdr(skb), mdst))
br_multicast_flood(mdst, skb, brmctx, false, true);
else
-   br_flood(br, skb, BR_PKT_MULTICAST, false, true);
+   br_flood(br, skb, BR_PKT_MULTICAST, false, true, vid);
} else if ((dst = br_fdb_find_rcu(br, dest, vid)) != NULL) {
br_forward(dst->dst, skb, false, true);
} else {
-   br_flood(br, skb, BR_PKT_UNICAST, false, true);
+   br_flood(br, skb, BR_PKT_UNICAST, false, true, vid);
}
 out:
rcu_read_unlock();
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 0fe133fa214c..94a8d757ae4e 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -197,7 +197,8 @@ static struct net_bridge_port *maybe_deliver(
 
 /* called under rcu_read_lock */
 void br_flood(struct net_bridge *br, struct sk_buff *skb,
- enum br_pkt_type pkt_type, bool local_rcv, bool local_orig)
+ enum br_pkt_type pkt_type, bool local_rcv, bool local_orig,
+ u16 vid)
 {
struct net_bridge_port *prev = NULL;
struct net_bridge_port *p;
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 3027e8f6be15..fc17b9fd93e6 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -207,7 +207,7 @@ int br_handle_frame_finish(struct net *net, struct sock 
*sk, struct sk_buff *skb
br_forward(dst->dst, skb, local_rcv, false);
} else {
if (!mcast_hit)
-   br_flood(br, skb, pkt_type, local_rcv, false);
+   br_flood(br, skb, pkt_type, local_rcv, false, vid);
else
br_multicast_flood(mdst, skb, brmctx, local_rcv, false);
}
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 7264fd40f82f..1ff4d64ab584 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -849,7 +849,8 @@ void br_forward(const struct net_bridge_port *to, struct 
sk_buff *skb,
bool local_rcv, bool local_orig);
 int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb);
 void br_flood(struct net_bridge *br, struct sk_buff *skb,
- enum br_pkt_type pkt_type, bool local_rcv, bool local_orig);
+ enum br_pkt_type pkt_type, bool local_rcv, bool local_orig,
+ u16 vid);
 
 /* return true if both source port and dest port are isolated */
 static inline bool br_skb_isolated(const struct net_bridge_port *to,
-- 
2.37.3

[Bridge] [PATCH net-next v2 1/9] bridge: Reorder neighbor suppression check when flooding

2023-04-19 Thread Ido Schimmel via Bridge

The bridge does not flood ARP / NS packets for which a reply was sent to
bridge ports that have neighbor suppression enabled.

Subsequent patches are going to add per-{Port, VLAN} neighbor
suppression, which is going to make it more expensive to check whether
neighbor suppression is enabled since a VLAN lookup will be required.

Therefore, instead of unnecessarily performing this lookup for every
packet, only perform it for ARP / NS packets for which a reply was sent.

Signed-off-by: Ido Schimmel 
Acked-by: Nikolay Aleksandrov 
---
 net/bridge/br_forward.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 02bb620d3b8d..0fe133fa214c 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -224,8 +224,8 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
/* Do not flood to ports that enable proxy ARP */
if (p->flags & BR_PROXYARP)
continue;
-   if ((p->flags & (BR_PROXYARP_WIFI | BR_NEIGH_SUPPRESS)) &&
-   BR_INPUT_SKB_CB(skb)->proxyarp_replied)
+   if (BR_INPUT_SKB_CB(skb)->proxyarp_replied &&
+   (p->flags & (BR_PROXYARP_WIFI | BR_NEIGH_SUPPRESS)))
continue;
 
prev = maybe_deliver(prev, p, skb, local_orig);
-- 
2.37.3

[Bridge] [PATCH net-next v2 0/9] bridge: Add per-{Port, VLAN} neighbor suppression

2023-04-19 Thread Ido Schimmel via Bridge

Background
==

In order to minimize the flooding of ARP and ND messages in the VXLAN
network, EVPN includes provisions [1] that allow participating VTEPs to
suppress such messages in case they know the MAC-IP binding and can
reply on behalf of the remote host. In Linux, the above is implemented
in the bridge driver using a per-port option called "neigh_suppress"
that was added in kernel version 4.15 [2].

Motivation
==

Some applications use ARP messages as keepalives between the application
nodes in the network. This works perfectly well when two nodes are
connected to the same VTEP. When a node goes down it will stop
responding to ARP requests and the other node will notice it
immediately.

However, when the two nodes are connected to different VTEPs and
neighbor suppression is enabled, the local VTEP will reply to ARP
requests even after the remote node went down, until certain timers
expire and the EVPN control plane decides to withdraw the MAC/IP
Advertisement route for the address. Therefore, some users would like to
be able to disable neighbor suppression on VLANs where such applications
reside and keep it enabled on the rest.

Implementation
==

The proposed solution is to allow user space to control neighbor
suppression on a per-{Port, VLAN} basis, in a similar fashion to other
per-port options that gained per-{Port, VLAN} counterparts such as
"mcast_router". This allows users to benefit from the operational
simplicity and scalability associated with shared VXLAN devices (i.e.,
external / collect-metadata mode), while still allowing for per-VLAN/VNI
neighbor suppression control.

The user interface is extended with a new "neigh_vlan_suppress" bridge
port option that allows user space to enable per-{Port, VLAN} neighbor
suppression on the bridge port. When enabled, the existing
"neigh_suppress" option has no effect and neighbor suppression is
controlled using a new "neigh_suppress" VLAN option. Example usage:

 # bridge link set dev vxlan0 neigh_vlan_suppress on
 # bridge vlan add vid 10 dev vxlan0
 # bridge vlan set vid 10 dev vxlan0 neigh_suppress on

Testing
===

Tested using existing bridge selftests. Added a dedicated selftest in
the last patch.

Patchset overview
=

Patches #1-#5 are preparations.

Patch #6 adds per-{Port, VLAN} neighbor suppression support to the
bridge's data path.

Patches #7-#8 add the required netlink attributes to enable the feature.

Patch #9 adds a selftest.

iproute2 patches can be found here [3].

Changelog
=

Since RFC [4]:

No changes.

[1] https://www.rfc-editor.org/rfc/rfc7432#section-10
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a42317785c898c0ed46db45a33b0cc71b671bf29
[3] https://github.com/idosch/iproute2/tree/submit/neigh_suppress_v1
[4] https://lore.kernel.org/netdev/20230413095830.2182382-1-ido...@nvidia.com/

Ido Schimmel (9):
  bridge: Reorder neighbor suppression check when flooding
  bridge: Pass VLAN ID to br_flood()
  bridge: Add internal flags for per-{Port, VLAN} neighbor suppression
  bridge: Take per-{Port, VLAN} neighbor suppression into account
  bridge: Encapsulate data path neighbor suppression logic
  bridge: Add per-{Port, VLAN} neighbor suppression data path support
  bridge: vlan: Allow setting VLAN neighbor suppression state
  bridge: Allow setting per-{Port, VLAN} neighbor suppression state
  selftests: net: Add bridge neighbor suppression test

 include/linux/if_bridge.h |   1 +
 include/uapi/linux/if_bridge.h|   1 +
 include/uapi/linux/if_link.h  |   1 +
 net/bridge/br_arp_nd_proxy.c  |  33 +-
 net/bridge/br_device.c|   8 +-
 net/bridge/br_forward.c   |   8 +-
 net/bridge/br_if.c|   2 +-
 net/bridge/br_input.c |   2 +-
 net/bridge/br_netlink.c   |   8 +-
 net/bridge/br_private.h   |   5 +-
 net/bridge/br_vlan.c  |   1 +
 net/bridge/br_vlan_options.c  |  20 +-
 net/core/rtnetlink.c  |   2 +-
 tools/testing/selftests/net/Makefile  |   1 +
 .../net/test_bridge_neigh_suppress.sh | 862 ++
 15 files changed, 936 insertions(+), 19 deletions(-)
 create mode 100755 tools/testing/selftests/net/test_bridge_neigh_suppress.sh

-- 
2.37.3

1 2 3 4 5 6 7 >

1 - 100 of 667 matches

Mail list logo