from:"Petr Machata"

Re: [patch net-next v4 6/6] selftests: virtio_net: add initial tests

2024-04-18 Thread Petr Machata



Jiri Pirko  writes:

> From: Jiri Pirko 
>
> Introduce initial tests for virtio_net driver. Focus on feature testing
> leveraging previously introduced debugfs feature filtering
> infrastructure. Add very basic ping and F_MAC feature tests.
>
> To run this, do:
> $ make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests
>
> Run it on a system with 2 virtio_net devices connected back-to-back
> on the hypervisor.
>
> Signed-off-by: Jiri Pirko 

Reviewed-by: Petr Machata

Re: [patch net-next v4 5/6] selftests: forwarding: add wait_for_dev() helper

2024-04-18 Thread Petr Machata



Jiri Pirko  writes:

> From: Jiri Pirko 
>
> The existing setup_wait*() helper family check the status of the
> interface to be up. Introduce wait_for_dev() to wait for the netdevice
> to appear, for example after test script does manual device bind.
>
> Signed-off-by: Jiri Pirko 

Reviewed-by: Petr Machata

Re: [patch net-next v4 3/6] selftests: forwarding: add ability to assemble NETIFS array by driver name

2024-04-18 Thread Petr Machata



Jiri Pirko  writes:

> From: Jiri Pirko 
>
> Allow driver tests to work without specifying the netdevice names.
> Introduce a possibility to search for available netdevices according to
> set driver name. Allow test to specify the name by setting
> NETIF_FIND_DRIVER variable.
>
> Note that user overrides this either by passing netdevice names on the
> command line or by declaring NETIFS array in custom forwarding.config
> configuration file.
>
> Signed-off-by: Jiri Pirko 

Reviewed-by: Petr Machata

Re: [patch net-next v4 2/6] selftests: forwarding: move initial root check to the beginning

2024-04-18 Thread Petr Machata



Jiri Pirko  writes:

> From: Jiri Pirko 
>
> This check can be done at the very beginning of the script.
> As the follow up patch needs to add early code that needs to be executed
> after the check, move it.
>
> Signed-off-by: Jiri Pirko 

Reviewed-by: Petr Machata

Re: [patch net-next v3 6/6] selftests: virtio_net: add initial tests

2024-04-18 Thread Petr Machata



Jiri Pirko  writes:

> From: Jiri Pirko 
>
> Introduce initial tests for virtio_net driver. Focus on feature testing
> leveraging previously introduced debugfs feature filtering
> infrastructure. Add very basic ping and F_MAC feature tests.
>
> To run this, do:
> $ make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests
>
> Run it on a system with 2 virtio_net devices connected back-to-back
> on the hypervisor.
>
> Signed-off-by: Jiri Pirko 

> +h2_destroy()
> +{
> + simple_if_fini $h2 $H2_IPV4/24 $H2_IPV6/64
> +}
> +
> +initial_ping_test()
> +{
> + cleanup

All these cleanup() calls will end up possibly triggering
PAUSE_ON_CLEANUP. Not sure that's intended.

> + setup_prepare
> + ping_test $h1 $H2_IPV4 " simple"
> +}

Other than this nit, LGTM.

Reviewed-by: Petr Machata

Re: [patch net-next v3 3/6] selftests: forwarding: add ability to assemble NETIFS array by driver name

2024-04-18 Thread Petr Machata



Petr Machata  writes:

> Jiri Pirko  writes:
>
>> +# Whether to find netdevice according to the specified driver.
>> +: "${NETIF_FIND_DRIVER:=}"
>
> This would be better placed up there in the Topology description
> section. Together with NETIFS and NETIF_NO_CABLE, as it concerns
> specification of which interfaces to use.

Oh never mind, it's not something a user should configure, but rather a
test API.

Re: [patch net-next v3 5/6] selftests: forwarding: add wait_for_dev() helper

2024-04-18 Thread Petr Machata



Jiri Pirko  writes:

> From: Jiri Pirko 
>
> The existing setup_wait*() helper family check the status of the
> interface to be up. Introduce wait_for_dev() to wait for the netdevice
> to appear, for example after test script does manual device bind.
>
> Signed-off-by: Jiri Pirko 
> ---
> v1->v2:
> - reworked wait_for_dev() helper to use slowwait() helper
> ---
>  tools/testing/selftests/net/forwarding/lib.sh | 13 +
>  1 file changed, 13 insertions(+)
>
> diff --git a/tools/testing/selftests/net/forwarding/lib.sh 
> b/tools/testing/selftests/net/forwarding/lib.sh
> index edaec12c0575..41c0b0ed430b 100644
> --- a/tools/testing/selftests/net/forwarding/lib.sh
> +++ b/tools/testing/selftests/net/forwarding/lib.sh
> @@ -745,6 +745,19 @@ setup_wait()
>   sleep $WAIT_TIME
>  }
>  
> +wait_for_dev()
> +{
> +local dev=$1; shift
> +local timeout=${1:-$WAIT_TIMEOUT}; shift
> +
> +slowwait $timeout ip link show dev $dev up &> /dev/null

I agree with Benjamin's feedback that this should lose the up flag. It
looks as if it's waiting for the device to be up.

> +if (( $? )); then
> +check_err 1
> +log_test wait_for_dev "Interface $dev did not appear."
> +exit $EXIT_STATUS
> +fi
> +}
> +
>  cmd_jq()
>  {
>   local cmd=$1

Re: [patch net-next v3 4/6] selftests: forwarding: add check_driver() helper

2024-04-18 Thread Petr Machata



Jiri Pirko  writes:

> From: Jiri Pirko 
>
> Add a helper to be used to check if the netdevice is backed by specified
> driver.
>
> Signed-off-by: Jiri Pirko 

Reviewed-by: Petr Machata

Re: [patch net-next v3 3/6] selftests: forwarding: add ability to assemble NETIFS array by driver name

2024-04-18 Thread Petr Machata



Jiri Pirko  writes:

> From: Jiri Pirko 
>
> Allow driver tests to work without specifying the netdevice names.
> Introduce a possibility to search for available netdevices according to
> set driver name. Allow test to specify the name by setting
> NETIF_FIND_DRIVER variable.
>
> Note that user overrides this either by passing netdevice names on the
> command line or by declaring NETIFS array in custom forwarding.config
> configuration file.
>
> Signed-off-by: Jiri Pirko 
> ---
> v1->v2:
> - removed unnecessary "-p" and "-e" options
> - removed unnecessary "! -z" from the check
> - moved NETIF_FIND_DRIVER declaration from the config options
> ---
>  tools/testing/selftests/net/forwarding/lib.sh | 39 +++
>  1 file changed, 39 insertions(+)
>
> diff --git a/tools/testing/selftests/net/forwarding/lib.sh 
> b/tools/testing/selftests/net/forwarding/lib.sh
> index 2e7695b94b6b..b3fd0f052d71 100644
> --- a/tools/testing/selftests/net/forwarding/lib.sh
> +++ b/tools/testing/selftests/net/forwarding/lib.sh
> @@ -94,6 +94,45 @@ if [[ ! -v NUM_NETIFS ]]; then
>   exit $ksft_skip
>  fi
>  
> +##
> +# Find netifs by test-specified driver name
> +
> +driver_name_get()
> +{
> + local dev=$1; shift
> + local driver_path="/sys/class/net/$dev/device/driver"
> +
> + if [ ! -L $driver_path ]; then
> + echo ""
> + else
> + basename `realpath $driver_path`
> + fi

This is just:

if [[ -L $driver_path ]]; then
basename `realpath $driver_path`
fi

> +}
> +
> +find_netif()

Maybe name it find_driver_netif? find_netif sounds super generic.

Also consider having it take an argument instead of accessing
environment NETIF_FIND_DRIVER directly.

> +{
> + local ifnames=`ip -j link show | jq -r ".[].ifname"`
> + local count=0
> +
> + for ifname in $ifnames
> + do
> + local driver_name=`driver_name_get $ifname`
> + if [[ ! -z $driver_name && $driver_name == $NETIF_FIND_DRIVER 
> ]]; then
> + count=$((count + 1))
> + NETIFS[p$count]="$ifname"
> + fi
> + done
> +}
> +
> +# Whether to find netdevice according to the specified driver.
> +: "${NETIF_FIND_DRIVER:=}"

This would be better placed up there in the Topology description
section. Together with NETIFS and NETIF_NO_CABLE, as it concerns
specification of which interfaces to use.

> +
> +if [[ $NETIF_FIND_DRIVER ]]; then
> + unset NETIFS
> + declare -A NETIFS
> + find_netif
> +fi
> +
>  net_forwarding_dir=$(dirname "$(readlink -e "${BASH_SOURCE[0]}")")
>  
>  if [[ -f $net_forwarding_dir/forwarding.config ]]; then

[PATCH net-next 08/10] mlxsw: spectrum_qdisc: Allocate child qdiscs dynamically

2021-04-20 Thread Petr Machata

Instead of keeping qdiscs in globally-preallocated arrays, introduce a
per-qdisc-kind value num_classes, and then allocate the necessary child
qdiscs (if any) based on that value. Since now dynamic allocation is
involved, mlxsw_sp_qdisc_replace() gets messy enough that it is worth it to
split it to two cases: a new qdisc allocation and a change of existing
qdisc. (Note that the change also includes what TC formally calls replace,
if the qdisc kind is the same.)

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 115 +-
 1 file changed, 83 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index 9e7f1a0188e8..03c131027fa7 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -49,6 +49,7 @@ struct mlxsw_sp_qdisc_ops {
  struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void *params);
struct mlxsw_sp_qdisc *(*find_class)(struct mlxsw_sp_qdisc 
*mlxsw_sp_qdisc,
 u32 parent);
+   unsigned int num_classes;
 };
 
 struct mlxsw_sp_qdisc {
@@ -74,7 +75,6 @@ struct mlxsw_sp_qdisc {
 
 struct mlxsw_sp_qdisc_state {
struct mlxsw_sp_qdisc root_qdisc;
-   struct mlxsw_sp_qdisc tclass_qdiscs[IEEE_8021QAZ_MAX_TCS];
 
/* When a PRIO or ETS are added, the invisible FIFOs in their bands are
 * created first. When notifications for these FIFOs arrive, it is not
@@ -215,29 +215,41 @@ mlxsw_sp_qdisc_destroy(struct mlxsw_sp_port 
*mlxsw_sp_port,
if (mlxsw_sp_qdisc->ops->destroy)
err = mlxsw_sp_qdisc->ops->destroy(mlxsw_sp_port,
   mlxsw_sp_qdisc);
+   if (mlxsw_sp_qdisc->ops->clean_stats)
+   mlxsw_sp_qdisc->ops->clean_stats(mlxsw_sp_port, mlxsw_sp_qdisc);
 
mlxsw_sp_qdisc->handle = TC_H_UNSPEC;
mlxsw_sp_qdisc->ops = NULL;
-
+   mlxsw_sp_qdisc->num_classes = 0;
+   kfree(mlxsw_sp_qdisc->qdiscs);
+   mlxsw_sp_qdisc->qdiscs = NULL;
return err_hdroom ?: err;
 }
 
-static int
-mlxsw_sp_qdisc_replace(struct mlxsw_sp_port *mlxsw_sp_port, u32 handle,
-  struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
-  struct mlxsw_sp_qdisc_ops *ops, void *params)
+static int mlxsw_sp_qdisc_create(struct mlxsw_sp_port *mlxsw_sp_port,
+u32 handle,
+struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+struct mlxsw_sp_qdisc_ops *ops, void *params)
 {
struct mlxsw_sp_qdisc *root_qdisc = &mlxsw_sp_port->qdisc->root_qdisc;
struct mlxsw_sp_hdroom orig_hdroom;
+   unsigned int i;
int err;
 
-   if (mlxsw_sp_qdisc->ops && mlxsw_sp_qdisc->ops->type != ops->type)
-   /* In case this location contained a different qdisc of the
-* same type we can override the old qdisc configuration.
-* Otherwise, we need to remove the old qdisc before setting the
-* new one.
-*/
-   mlxsw_sp_qdisc_destroy(mlxsw_sp_port, mlxsw_sp_qdisc);
+   err = ops->check_params(mlxsw_sp_port, params);
+   if (err)
+   return err;
+
+   if (ops->num_classes) {
+   mlxsw_sp_qdisc->qdiscs = kcalloc(ops->num_classes,
+
sizeof(*mlxsw_sp_qdisc->qdiscs),
+GFP_KERNEL);
+   if (!mlxsw_sp_qdisc->qdiscs)
+   return -ENOMEM;
+
+   for (i = 0; i < ops->num_classes; i++)
+   mlxsw_sp_qdisc->qdiscs[i].parent = mlxsw_sp_qdisc;
+   }
 
orig_hdroom = *mlxsw_sp_port->hdroom;
if (root_qdisc == mlxsw_sp_qdisc) {
@@ -253,20 +265,46 @@ mlxsw_sp_qdisc_replace(struct mlxsw_sp_port 
*mlxsw_sp_port, u32 handle,
goto err_hdroom_configure;
}
 
+   mlxsw_sp_qdisc->num_classes = ops->num_classes;
+   mlxsw_sp_qdisc->ops = ops;
+   mlxsw_sp_qdisc->handle = handle;
+   err = ops->replace(mlxsw_sp_port, handle, mlxsw_sp_qdisc, params);
+   if (err)
+   goto err_replace;
+
+   return 0;
+
+err_replace:
+   mlxsw_sp_qdisc->handle = TC_H_UNSPEC;
+   mlxsw_sp_qdisc->ops = NULL;
+   mlxsw_sp_qdisc->num_classes = 0;
+   mlxsw_sp_hdroom_configure(mlxsw_sp_port, &orig_hdroom);
+err_hdroom_configure:
+   kfree(mlxsw_sp_qdisc->qdiscs);
+   mlxsw_sp_qdisc->qdiscs = NULL;
+   return err;
+}
+
+static int
+mlxsw_sp_qdisc_change(struct mlxsw_sp_port *mlxsw_sp_port, u32 handle,
+

[PATCH net-next 10/10] selftests: mlxsw: sch_red_ets: Test proper counter cleaning in ETS

2021-04-20 Thread Petr Machata

There was a bug introduced during the rework which cause non-zero backlog
being stuck at ETS. Introduce a selftest that would have caught the issue
earlier.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh 
b/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh
index 3f007c5f8361..f3ef3274f9b3 100755
--- a/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh
@@ -67,6 +67,13 @@ red_test()
 {
install_qdisc
 
+   # Make sure that we get the non-zero value if there is any.
+   local cur=$(busywait 1100 until_counter_is "> 0" \
+   qdisc_stats_get $swp3 10: .backlog)
+   (( cur == 0 ))
+   check_err $? "backlog of $cur observed on non-busy qdisc"
+   log_test "$QDISC backlog properly cleaned"
+
do_red_test 10 $BACKLOG1
do_red_test 11 $BACKLOG2
 
-- 
2.26.2

[PATCH net-next 09/10] mlxsw: spectrum_qdisc: Index future FIFOs by band number

2021-04-20 Thread Petr Machata

mlxsw used to hold an array of qdiscs indexed by the TC number. In the
previous patch, it was changed to allocate child qdiscs dynamically, and
they are now indexed by band number. Follow suit with the array of future
FIFOs.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_qdisc.c| 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index 03c131027fa7..04672eb5c7f3 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -962,7 +962,7 @@ static int __mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port 
*mlxsw_sp_port,
 {
struct mlxsw_sp_qdisc_state *qdisc_state = mlxsw_sp_port->qdisc;
struct mlxsw_sp_qdisc *mlxsw_sp_qdisc;
-   int tclass, child_index;
+   unsigned int band;
u32 parent_handle;
 
mlxsw_sp_qdisc = mlxsw_sp_qdisc_find(mlxsw_sp_port, p->parent, false);
@@ -977,13 +977,12 @@ static int __mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port 
*mlxsw_sp_port,
qdisc_state->future_handle = parent_handle;
}
 
-   child_index = TC_H_MIN(p->parent);
-   tclass = MLXSW_SP_PRIO_CHILD_TO_TCLASS(child_index);
-   if (tclass < IEEE_8021QAZ_MAX_TCS) {
+   band = TC_H_MIN(p->parent) - 1;
+   if (band < IEEE_8021QAZ_MAX_TCS) {
if (p->command == TC_FIFO_REPLACE)
-   qdisc_state->future_fifos[tclass] = true;
+   qdisc_state->future_fifos[band] = true;
else if (p->command == TC_FIFO_DESTROY)
-   qdisc_state->future_fifos[tclass] = false;
+   qdisc_state->future_fifos[band] = false;
}
}
if (!mlxsw_sp_qdisc)
@@ -1117,7 +1116,7 @@ __mlxsw_sp_qdisc_ets_replace(struct mlxsw_sp_port 
*mlxsw_sp_port,
}
 
if (handle == qdisc_state->future_handle &&
-   qdisc_state->future_fifos[tclass]) {
+   qdisc_state->future_fifos[band]) {
err = mlxsw_sp_qdisc_replace(mlxsw_sp_port, TC_H_UNSPEC,
 child_qdisc,
 &mlxsw_sp_qdisc_ops_fifo,
-- 
2.26.2

[PATCH net-next 07/10] mlxsw: spectrum_qdisc: Guard all qdisc accesses with a lock

2021-04-20 Thread Petr Machata

The FIFO handler currently guards accesses to the future FIFO tracking by
asserting RTNL. In the future, the changes to the qdisc state will be more
thorough, so other qdiscs will need this guarding is as well. In order
to not further the RTNL infestation, instead convert to a custom lock that
will guard accesses to the qdisc state.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 89 +++
 1 file changed, 73 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index f42ea958919b..9e7f1a0188e8 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -89,6 +89,7 @@ struct mlxsw_sp_qdisc_state {
 */
u32 future_handle;
bool future_fifos[IEEE_8021QAZ_MAX_TCS];
+   struct mutex lock; /* Protects qdisc state. */
 };
 
 static bool
@@ -620,8 +621,8 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_red = {
.find_class = mlxsw_sp_qdisc_leaf_find_class,
 };
 
-int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port *mlxsw_sp_port,
- struct tc_red_qopt_offload *p)
+static int __mlxsw_sp_setup_tc_red(struct mlxsw_sp_port *mlxsw_sp_port,
+  struct tc_red_qopt_offload *p)
 {
struct mlxsw_sp_qdisc *mlxsw_sp_qdisc;
 
@@ -652,6 +653,18 @@ int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port 
*mlxsw_sp_port,
}
 }
 
+int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port *mlxsw_sp_port,
+ struct tc_red_qopt_offload *p)
+{
+   int err;
+
+   mutex_lock(&mlxsw_sp_port->qdisc->lock);
+   err = __mlxsw_sp_setup_tc_red(mlxsw_sp_port, p);
+   mutex_unlock(&mlxsw_sp_port->qdisc->lock);
+
+   return err;
+}
+
 static void
 mlxsw_sp_setup_tc_qdisc_leaf_clean_stats(struct mlxsw_sp_port *mlxsw_sp_port,
 struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
@@ -814,8 +827,8 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_tbf = {
.find_class = mlxsw_sp_qdisc_leaf_find_class,
 };
 
-int mlxsw_sp_setup_tc_tbf(struct mlxsw_sp_port *mlxsw_sp_port,
- struct tc_tbf_qopt_offload *p)
+static int __mlxsw_sp_setup_tc_tbf(struct mlxsw_sp_port *mlxsw_sp_port,
+  struct tc_tbf_qopt_offload *p)
 {
struct mlxsw_sp_qdisc *mlxsw_sp_qdisc;
 
@@ -843,6 +856,18 @@ int mlxsw_sp_setup_tc_tbf(struct mlxsw_sp_port 
*mlxsw_sp_port,
}
 }
 
+int mlxsw_sp_setup_tc_tbf(struct mlxsw_sp_port *mlxsw_sp_port,
+ struct tc_tbf_qopt_offload *p)
+{
+   int err;
+
+   mutex_lock(&mlxsw_sp_port->qdisc->lock);
+   err = __mlxsw_sp_setup_tc_tbf(mlxsw_sp_port, p);
+   mutex_unlock(&mlxsw_sp_port->qdisc->lock);
+
+   return err;
+}
+
 static int
 mlxsw_sp_qdisc_fifo_check_params(struct mlxsw_sp_port *mlxsw_sp_port,
 void *params)
@@ -876,20 +901,14 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_fifo 
= {
.clean_stats = mlxsw_sp_setup_tc_qdisc_leaf_clean_stats,
 };
 
-int mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port *mlxsw_sp_port,
-  struct tc_fifo_qopt_offload *p)
+static int __mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port *mlxsw_sp_port,
+   struct tc_fifo_qopt_offload *p)
 {
struct mlxsw_sp_qdisc_state *qdisc_state = mlxsw_sp_port->qdisc;
struct mlxsw_sp_qdisc *mlxsw_sp_qdisc;
int tclass, child_index;
u32 parent_handle;
 
-   /* Invisible FIFOs are tracked in future_handle and future_fifos. Make
-* sure that not more than one qdisc is created for a port at a time.
-* RTNL is a simple proxy for that.
-*/
-   ASSERT_RTNL();
-
mlxsw_sp_qdisc = mlxsw_sp_qdisc_find(mlxsw_sp_port, p->parent, false);
if (!mlxsw_sp_qdisc && p->handle == TC_H_UNSPEC) {
parent_handle = TC_H_MAJ(p->parent);
@@ -936,6 +955,18 @@ int mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port 
*mlxsw_sp_port,
return -EOPNOTSUPP;
 }
 
+int mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port *mlxsw_sp_port,
+  struct tc_fifo_qopt_offload *p)
+{
+   int err;
+
+   mutex_lock(&mlxsw_sp_port->qdisc->lock);
+   err = __mlxsw_sp_setup_tc_fifo(mlxsw_sp_port, p);
+   mutex_unlock(&mlxsw_sp_port->qdisc->lock);
+
+   return err;
+}
+
 static int __mlxsw_sp_qdisc_ets_destroy(struct mlxsw_sp_port *mlxsw_sp_port,
struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
 {
@@ -1277,8 +1308,8 @@ mlxsw_sp_qdisc_prio_graft(struct mlxsw_sp_port 
*mlxsw_sp_port,
  p->band, p->child_handle);
 }
 
-int mlxsw_sp_setup_tc_prio(struct m

[PATCH net-next 06/10] mlxsw: spectrum_qdisc: Track children per qdisc

2021-04-20 Thread Petr Machata

mlxsw currently allows a two-level structure of qdiscs: the root and
possibly a number of children. In order to support offloading more general
qdisc trees, introduce to struct mlxsw_sp_qdisc a pointer to child qdiscs.
Refer to the child qdiscs through this pointer, instead of going through
the tclass_qdiscs in qdisc_state. Additionally introduce a field
num_classes, which holds number of given qdisc's children.

Also introduce a generic function for walking qdisc trees. Rewrite
mlxsw_sp_qdisc_find() and _find_by_handle() to use the generic walker.

For now, keep the qdisc_state.tclass_qdisc, and just point root_qdiscs's
children to this array. Following patches will make the allocation dynamic.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 164 +-
 1 file changed, 118 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index a8a7e9c88a4d..f42ea958919b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -47,6 +47,8 @@ struct mlxsw_sp_qdisc_ops {
 */
void (*unoffload)(struct mlxsw_sp_port *mlxsw_sp_port,
  struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void *params);
+   struct mlxsw_sp_qdisc *(*find_class)(struct mlxsw_sp_qdisc 
*mlxsw_sp_qdisc,
+u32 parent);
 };
 
 struct mlxsw_sp_qdisc {
@@ -66,6 +68,8 @@ struct mlxsw_sp_qdisc {
 
struct mlxsw_sp_qdisc_ops *ops;
struct mlxsw_sp_qdisc *parent;
+   struct mlxsw_sp_qdisc *qdiscs;
+   unsigned int num_classes;
 };
 
 struct mlxsw_sp_qdisc_state {
@@ -93,44 +97,84 @@ mlxsw_sp_qdisc_compare(struct mlxsw_sp_qdisc 
*mlxsw_sp_qdisc, u32 handle)
return mlxsw_sp_qdisc->ops && mlxsw_sp_qdisc->handle == handle;
 }
 
+static struct mlxsw_sp_qdisc *
+mlxsw_sp_qdisc_walk(struct mlxsw_sp_qdisc *qdisc,
+   struct mlxsw_sp_qdisc *(*pre)(struct mlxsw_sp_qdisc *,
+ void *),
+   void *data)
+{
+   struct mlxsw_sp_qdisc *tmp;
+   unsigned int i;
+
+   if (pre) {
+   tmp = pre(qdisc, data);
+   if (tmp)
+   return tmp;
+   }
+
+   if (qdisc->ops) {
+   for (i = 0; i < qdisc->num_classes; i++) {
+   tmp = &qdisc->qdiscs[i];
+   if (qdisc->ops) {
+   tmp = mlxsw_sp_qdisc_walk(tmp, pre, data);
+   if (tmp)
+   return tmp;
+   }
+   }
+   }
+
+   return NULL;
+}
+
+static struct mlxsw_sp_qdisc *
+mlxsw_sp_qdisc_walk_cb_find(struct mlxsw_sp_qdisc *qdisc, void *data)
+{
+   u32 parent = *(u32 *)data;
+
+   if (qdisc->ops && TC_H_MAJ(qdisc->handle) == TC_H_MAJ(parent)) {
+   if (qdisc->ops->find_class)
+   return qdisc->ops->find_class(qdisc, parent);
+   }
+
+   return NULL;
+}
+
 static struct mlxsw_sp_qdisc *
 mlxsw_sp_qdisc_find(struct mlxsw_sp_port *mlxsw_sp_port, u32 parent,
bool root_only)
 {
struct mlxsw_sp_qdisc_state *qdisc_state = mlxsw_sp_port->qdisc;
-   int tclass, child_index;
 
+   if (!qdisc_state)
+   return NULL;
if (parent == TC_H_ROOT)
return &qdisc_state->root_qdisc;
-
-   if (root_only || !qdisc_state ||
-   !qdisc_state->root_qdisc.ops ||
-   TC_H_MAJ(parent) != qdisc_state->root_qdisc.handle ||
-   TC_H_MIN(parent) > IEEE_8021QAZ_MAX_TCS)
+   if (root_only)
return NULL;
+   return mlxsw_sp_qdisc_walk(&qdisc_state->root_qdisc,
+  mlxsw_sp_qdisc_walk_cb_find, &parent);
+}
 
-   child_index = TC_H_MIN(parent);
-   tclass = MLXSW_SP_PRIO_CHILD_TO_TCLASS(child_index);
-   return &qdisc_state->tclass_qdiscs[tclass];
+static struct mlxsw_sp_qdisc *
+mlxsw_sp_qdisc_walk_cb_find_by_handle(struct mlxsw_sp_qdisc *qdisc, void *data)
+{
+   u32 handle = *(u32 *)data;
+
+   if (qdisc->ops && qdisc->handle == handle)
+   return qdisc;
+   return NULL;
 }
 
 static struct mlxsw_sp_qdisc *
 mlxsw_sp_qdisc_find_by_handle(struct mlxsw_sp_port *mlxsw_sp_port, u32 handle)
 {
struct mlxsw_sp_qdisc_state *qdisc_state = mlxsw_sp_port->qdisc;
-   int i;
 
-   if (qdisc_state->root_qdisc.handle == handle)
-   return &qdisc_state->root_qdisc;
-
-   if (qdisc_state->root_qdisc.handle == TC_H_UNSPEC)
+   if (!qdisc_state)
return NULL;
-
-   for (i = 0; i < IEEE_8021QAZ_MAX

[PATCH net-next 04/10] mlxsw: spectrum_qdisc: Track tclass_num as int, not u8

2021-04-20 Thread Petr Machata

tclass_num is just a number, a value that would be ordinarily passed around
as an int. (Which is unlike a u8 prio_bitmap.) In several places,
tclass_num already is an int. Convert the remaining instances.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index f1d32bfc4bed..da1f6314df60 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -51,7 +51,7 @@ struct mlxsw_sp_qdisc_ops {
 
 struct mlxsw_sp_qdisc {
u32 handle;
-   u8 tclass_num;
+   int tclass_num;
u8 prio_bitmap;
union {
struct red_stats red;
@@ -291,7 +291,7 @@ mlxsw_sp_qdisc_collect_tc_stats(struct mlxsw_sp_port 
*mlxsw_sp_port,
u64 *p_tx_bytes, u64 *p_tx_packets,
u64 *p_drops, u64 *p_backlog)
 {
-   u8 tclass_num = mlxsw_sp_qdisc->tclass_num;
+   int tclass_num = mlxsw_sp_qdisc->tclass_num;
struct mlxsw_sp_port_xstats *xstats;
u64 tx_bytes, tx_packets;
 
@@ -391,7 +391,7 @@ static void
 mlxsw_sp_setup_tc_qdisc_red_clean_stats(struct mlxsw_sp_port *mlxsw_sp_port,
struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
 {
-   u8 tclass_num = mlxsw_sp_qdisc->tclass_num;
+   int tclass_num = mlxsw_sp_qdisc->tclass_num;
struct mlxsw_sp_qdisc_stats *stats_base;
struct mlxsw_sp_port_xstats *xstats;
struct red_stats *red_base;
@@ -462,7 +462,7 @@ mlxsw_sp_qdisc_red_replace(struct mlxsw_sp_port 
*mlxsw_sp_port, u32 handle,
 {
struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
struct tc_red_qopt_offload_params *p = params;
-   u8 tclass_num = mlxsw_sp_qdisc->tclass_num;
+   int tclass_num = mlxsw_sp_qdisc->tclass_num;
u32 min, max;
u64 prob;
 
@@ -507,7 +507,7 @@ mlxsw_sp_qdisc_get_red_xstats(struct mlxsw_sp_port 
*mlxsw_sp_port,
  void *xstats_ptr)
 {
struct red_stats *xstats_base = &mlxsw_sp_qdisc->xstats_base.red;
-   u8 tclass_num = mlxsw_sp_qdisc->tclass_num;
+   int tclass_num = mlxsw_sp_qdisc->tclass_num;
struct mlxsw_sp_port_xstats *xstats;
struct red_stats *res = xstats_ptr;
int early_drops, pdrops;
@@ -531,7 +531,7 @@ mlxsw_sp_qdisc_get_red_stats(struct mlxsw_sp_port 
*mlxsw_sp_port,
 struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
 struct tc_qopt_offload_stats *stats_ptr)
 {
-   u8 tclass_num = mlxsw_sp_qdisc->tclass_num;
+   int tclass_num = mlxsw_sp_qdisc->tclass_num;
struct mlxsw_sp_qdisc_stats *stats_base;
struct mlxsw_sp_port_xstats *xstats;
u64 overlimits;
-- 
2.26.2

[PATCH net-next 05/10] mlxsw: spectrum_qdisc: Promote backlog reduction to mlxsw_sp_qdisc_destroy()

2021-04-20 Thread Petr Machata

When a qdisc is removed, it is necessary to update the backlog value at its
parent--unless the qdisc is at root position. RED, TBF and FIFO all do
that, each separately. Since all of them need to do this, just promote the
operation directly to mlxsw_sp_qdisc_destroy(), instead of deferring it to
individual destructors. Since FIFO dtor thus becomes trivial, remove it.

Add struct mlxsw_sp_qdisc.parent to point at the parent qdisc. This will be
handy later as deeper structures are offloaded. Use the parent qdisc to
find the chain of parents whose backlog value needs to be updated.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 48 +++
 1 file changed, 18 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index da1f6314df60..a8a7e9c88a4d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -65,6 +65,7 @@ struct mlxsw_sp_qdisc {
} stats_base;
 
struct mlxsw_sp_qdisc_ops *ops;
+   struct mlxsw_sp_qdisc *parent;
 };
 
 struct mlxsw_sp_qdisc_state {
@@ -132,6 +133,15 @@ mlxsw_sp_qdisc_find_by_handle(struct mlxsw_sp_port 
*mlxsw_sp_port, u32 handle)
return NULL;
 }
 
+static void
+mlxsw_sp_qdisc_reduce_parent_backlog(struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
+{
+   struct mlxsw_sp_qdisc *tmp;
+
+   for (tmp = mlxsw_sp_qdisc->parent; tmp; tmp = tmp->parent)
+   tmp->stats_base.backlog -= mlxsw_sp_qdisc->stats_base.backlog;
+}
+
 static int
 mlxsw_sp_qdisc_destroy(struct mlxsw_sp_port *mlxsw_sp_port,
   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
@@ -153,7 +163,11 @@ mlxsw_sp_qdisc_destroy(struct mlxsw_sp_port *mlxsw_sp_port,
err_hdroom = mlxsw_sp_hdroom_configure(mlxsw_sp_port, &hdroom);
}
 
-   if (mlxsw_sp_qdisc->ops && mlxsw_sp_qdisc->ops->destroy)
+   if (!mlxsw_sp_qdisc->ops)
+   return 0;
+
+   mlxsw_sp_qdisc_reduce_parent_backlog(mlxsw_sp_qdisc);
+   if (mlxsw_sp_qdisc->ops->destroy)
err = mlxsw_sp_qdisc->ops->destroy(mlxsw_sp_port,
   mlxsw_sp_qdisc);
 
@@ -417,13 +431,6 @@ static int
 mlxsw_sp_qdisc_red_destroy(struct mlxsw_sp_port *mlxsw_sp_port,
   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
 {
-   struct mlxsw_sp_qdisc_state *qdisc_state = mlxsw_sp_port->qdisc;
-   struct mlxsw_sp_qdisc *root_qdisc = &qdisc_state->root_qdisc;
-
-   if (root_qdisc != mlxsw_sp_qdisc)
-   root_qdisc->stats_base.backlog -=
-   mlxsw_sp_qdisc->stats_base.backlog;
-
return mlxsw_sp_tclass_congestion_disable(mlxsw_sp_port,
  mlxsw_sp_qdisc->tclass_num);
 }
@@ -616,13 +623,6 @@ static int
 mlxsw_sp_qdisc_tbf_destroy(struct mlxsw_sp_port *mlxsw_sp_port,
   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
 {
-   struct mlxsw_sp_qdisc_state *qdisc_state = mlxsw_sp_port->qdisc;
-   struct mlxsw_sp_qdisc *root_qdisc = &qdisc_state->root_qdisc;
-
-   if (root_qdisc != mlxsw_sp_qdisc)
-   root_qdisc->stats_base.backlog -=
-   mlxsw_sp_qdisc->stats_base.backlog;
-
return mlxsw_sp_port_ets_maxrate_set(mlxsw_sp_port,
 MLXSW_REG_QEEC_HR_SUBGROUP,
 mlxsw_sp_qdisc->tclass_num, 0,
@@ -790,19 +790,6 @@ int mlxsw_sp_setup_tc_tbf(struct mlxsw_sp_port 
*mlxsw_sp_port,
}
 }
 
-static int
-mlxsw_sp_qdisc_fifo_destroy(struct mlxsw_sp_port *mlxsw_sp_port,
-   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
-{
-   struct mlxsw_sp_qdisc_state *qdisc_state = mlxsw_sp_port->qdisc;
-   struct mlxsw_sp_qdisc *root_qdisc = &qdisc_state->root_qdisc;
-
-   if (root_qdisc != mlxsw_sp_qdisc)
-   root_qdisc->stats_base.backlog -=
-   mlxsw_sp_qdisc->stats_base.backlog;
-   return 0;
-}
-
 static int
 mlxsw_sp_qdisc_fifo_check_params(struct mlxsw_sp_port *mlxsw_sp_port,
 void *params)
@@ -832,7 +819,6 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_fifo = {
.type = MLXSW_SP_QDISC_FIFO,
.check_params = mlxsw_sp_qdisc_fifo_check_params,
.replace = mlxsw_sp_qdisc_fifo_replace,
-   .destroy = mlxsw_sp_qdisc_fifo_destroy,
.get_stats = mlxsw_sp_qdisc_get_fifo_stats,
.clean_stats = mlxsw_sp_setup_tc_qdisc_leaf_clean_stats,
 };
@@ -1825,8 +1811,10 @@ int mlxsw_sp_tc_qdisc_init(struct mlxsw_sp_port 
*mlxsw_sp_port)
 
qdisc_state->root_qdisc.prio_bitmap

[PATCH net-next 02/10] mlxsw: spectrum_qdisc: Simplify mlxsw_sp_qdisc_compare()

2021-04-20 Thread Petr Machata

The purpose of this function is to filter out events that are related to
qdiscs that are not offloaded, or are not offloaded anymore. But the
function is unnecessarily thorough:

- mlxsw_sp_qdisc pointer is never NULL in the context where it is called
- Two qdiscs with the same handle will never have different types. Even
  when replacing one qdisc with another in the same class, Linux will not
  permit handle reuse unless the qdisc type also matches.

Simplify the function by omitting these two unnecessary conditions.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 22 ++-
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index 644ffc021abe..013398ecd15b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -87,12 +87,9 @@ struct mlxsw_sp_qdisc_state {
 };
 
 static bool
-mlxsw_sp_qdisc_compare(struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, u32 handle,
-  enum mlxsw_sp_qdisc_type type)
+mlxsw_sp_qdisc_compare(struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, u32 handle)
 {
-   return mlxsw_sp_qdisc && mlxsw_sp_qdisc->ops &&
-  mlxsw_sp_qdisc->ops->type == type &&
-  mlxsw_sp_qdisc->handle == handle;
+   return mlxsw_sp_qdisc->ops && mlxsw_sp_qdisc->handle == handle;
 }
 
 static struct mlxsw_sp_qdisc *
@@ -579,8 +576,7 @@ int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port 
*mlxsw_sp_port,
  &mlxsw_sp_qdisc_ops_red,
  &p->set);
 
-   if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle,
-   MLXSW_SP_QDISC_RED))
+   if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle))
return -EOPNOTSUPP;
 
switch (p->command) {
@@ -780,8 +776,7 @@ int mlxsw_sp_setup_tc_tbf(struct mlxsw_sp_port 
*mlxsw_sp_port,
  &mlxsw_sp_qdisc_ops_tbf,
  &p->replace_params);
 
-   if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle,
-   MLXSW_SP_QDISC_TBF))
+   if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle))
return -EOPNOTSUPP;
 
switch (p->command) {
@@ -886,8 +881,7 @@ int mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port 
*mlxsw_sp_port,
  &mlxsw_sp_qdisc_ops_fifo, NULL);
}
 
-   if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle,
-   MLXSW_SP_QDISC_FIFO))
+   if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle))
return -EOPNOTSUPP;
 
switch (p->command) {
@@ -1247,8 +1241,7 @@ int mlxsw_sp_setup_tc_prio(struct mlxsw_sp_port 
*mlxsw_sp_port,
  &mlxsw_sp_qdisc_ops_prio,
  &p->replace_params);
 
-   if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle,
-   MLXSW_SP_QDISC_PRIO))
+   if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle))
return -EOPNOTSUPP;
 
switch (p->command) {
@@ -1280,8 +1273,7 @@ int mlxsw_sp_setup_tc_ets(struct mlxsw_sp_port 
*mlxsw_sp_port,
  &mlxsw_sp_qdisc_ops_ets,
  &p->replace_params);
 
-   if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle,
-   MLXSW_SP_QDISC_ETS))
+   if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle))
return -EOPNOTSUPP;
 
switch (p->command) {
-- 
2.26.2

[PATCH net-next 03/10] mlxsw: spectrum_qdisc: Drop an always-true condition

2021-04-20 Thread Petr Machata

The function mlxsw_sp_qdisc_compare() is invoked a couple lines above this
check, which will bounce any requests where this condition does not hold.
Therefore drop it.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index 013398ecd15b..f1d32bfc4bed 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -886,10 +886,7 @@ int mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port 
*mlxsw_sp_port,
 
switch (p->command) {
case TC_FIFO_DESTROY:
-   if (p->handle == mlxsw_sp_qdisc->handle)
-   return mlxsw_sp_qdisc_destroy(mlxsw_sp_port,
- mlxsw_sp_qdisc);
-   return 0;
+   return mlxsw_sp_qdisc_destroy(mlxsw_sp_port, mlxsw_sp_qdisc);
case TC_FIFO_STATS:
return mlxsw_sp_qdisc_get_stats(mlxsw_sp_port, mlxsw_sp_qdisc,
&p->stats);
-- 
2.26.2

[PATCH net-next 01/10] mlxsw: spectrum_qdisc: Drop one argument from check_params callback

2021-04-20 Thread Petr Machata

The mlxsw_sp_qdisc argument is not used in any of the actual callbacks.
Drop it.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index baf17c0b2702..644ffc021abe 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -29,7 +29,6 @@ struct mlxsw_sp_qdisc;
 struct mlxsw_sp_qdisc_ops {
enum mlxsw_sp_qdisc_type type;
int (*check_params)(struct mlxsw_sp_port *mlxsw_sp_port,
-   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
void *params);
int (*replace)(struct mlxsw_sp_port *mlxsw_sp_port, u32 handle,
   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void *params);
@@ -198,7 +197,7 @@ mlxsw_sp_qdisc_replace(struct mlxsw_sp_port *mlxsw_sp_port, 
u32 handle,
goto err_hdroom_configure;
}
 
-   err = ops->check_params(mlxsw_sp_port, mlxsw_sp_qdisc, params);
+   err = ops->check_params(mlxsw_sp_port, params);
if (err)
goto err_bad_param;
 
@@ -434,7 +433,6 @@ mlxsw_sp_qdisc_red_destroy(struct mlxsw_sp_port 
*mlxsw_sp_port,
 
 static int
 mlxsw_sp_qdisc_red_check_params(struct mlxsw_sp_port *mlxsw_sp_port,
-   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
void *params)
 {
struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
@@ -678,7 +676,6 @@ mlxsw_sp_qdisc_tbf_rate_kbps(struct 
tc_tbf_qopt_offload_replace_params *p)
 
 static int
 mlxsw_sp_qdisc_tbf_check_params(struct mlxsw_sp_port *mlxsw_sp_port,
-   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
void *params)
 {
struct tc_tbf_qopt_offload_replace_params *p = params;
@@ -813,7 +810,6 @@ mlxsw_sp_qdisc_fifo_destroy(struct mlxsw_sp_port 
*mlxsw_sp_port,
 
 static int
 mlxsw_sp_qdisc_fifo_check_params(struct mlxsw_sp_port *mlxsw_sp_port,
-struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
 void *params)
 {
return 0;
@@ -948,7 +944,6 @@ __mlxsw_sp_qdisc_ets_check_params(unsigned int nbands)
 
 static int
 mlxsw_sp_qdisc_prio_check_params(struct mlxsw_sp_port *mlxsw_sp_port,
-struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
 void *params)
 {
struct tc_prio_qopt_offload_params *p = params;
@@ -1124,7 +1119,6 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_prio 
= {
 
 static int
 mlxsw_sp_qdisc_ets_check_params(struct mlxsw_sp_port *mlxsw_sp_port,
-   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
void *params)
 {
struct tc_ets_qopt_offload_replace_params *p = params;
-- 
2.26.2

[PATCH net-next 00/10] mlxsw: Refactor qdisc offload

2021-04-20 Thread Petr Machata

Currently, mlxsw admits for offload a suitable root qdisc, and its
children. Thus up to two levels of hierarchy are offloaded. Often, this is
enough: one can configure TCs with RED and TCs with a shaper, and can even
see counters for each TC by looking at a qdisc at a sufficiently shallow
position.

While simple, the system has obvious shortcomings. It is not possible to
configure both RED and shaping on one TC. It is not possible to place a
PRIO below root TBF, which would then be offloaded as port shaper. FIFOs
are only offloaded at root or directly below, which is confusing to users,
because RED and TBF of course have their own FIFO.

This patchset is a step towards the end goal of allowing more comprehensive
qdisc tree offload and cleans up the qdisc offload code.

- Patches #1-#4 contain small cleanups.

- Up until now, since mlxsw offloaded only a very simple qdisc
  configurations, basically all bookkeeping was done using one container
  for the root qdisc, and 8 containers for its children. Patches #5, #6, #8
  and #9 gradually introduce a more dynamic structure, where parent-child
  relationships are tracked directly at qdiscs, instead of being implicit.

- This tree management assumes only one qdisc is created at a time. In FIFO
  handlers, this condition was enforced simply by asserting RTNL lock. But
  instead of furthering this RTNL dependence, patch #7 converts the whole
  qdisc offload logic to a per-port mutex.

- Patch #10 adds a selftest.

Petr Machata (10):
  mlxsw: spectrum_qdisc: Drop one argument from check_params callback
  mlxsw: spectrum_qdisc: Simplify mlxsw_sp_qdisc_compare()
  mlxsw: spectrum_qdisc: Drop an always-true condition
  mlxsw: spectrum_qdisc: Track tclass_num as int, not u8
  mlxsw: spectrum_qdisc: Promote backlog reduction to
mlxsw_sp_qdisc_destroy()
  mlxsw: spectrum_qdisc: Track children per qdisc
  mlxsw: spectrum_qdisc: Guard all qdisc accesses with a lock
  mlxsw: spectrum_qdisc: Allocate child qdiscs dynamically
  mlxsw: spectrum_qdisc: Index future FIFOs by band number
  selftests: mlxsw: sch_red_ets: Test proper counter cleaning in ETS

 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 448 --
 .../drivers/net/mlxsw/sch_red_ets.sh  |   7 +
 2 files changed, 306 insertions(+), 149 deletions(-)

-- 
2.26.2

Re: [PATCH net-next 1/7] net: sched: Add a trap-and-forward action

2021-04-09 Thread Petr Machata



Jamal Hadi Salim  writes:

> On 2021-04-09 7:03 a.m., Petr Machata wrote:
>> Jamal Hadi Salim  writes:
>> 
>>> I am concerned about adding new opcodes which only make sense if you
>>> offload (or make sense only if you are running in s/w).
>>>
>>> Those opcodes are intended to be generic abstractions so the dispatcher
>>> can decide what to do next.
>>> [...]
>>> For details see:
>>> https://people.netfilter.org/pablo/netdev0.1/papers/Linux-Traffic-Control-Classifier-Action-Subsystem-Architecture.pdf
>>
>> Trap has been in since 4.13, so 2017ish. It's done and dusted at this
>> point.
>
> here's how it translates:
> "We already made a mistake, therefore, its ok to build on it and
> make more mistakes".

I can see how it reads that way, but that was not the intention. I was
actually thinking about whether there might be a way to gradually
migrate all this stuff over to mirred, but at this point, trap is very
much baked in.

>>> IMO:
>>> It seems to me there are two actions here encapsulated in one.
>>> The first is to "trap" and the second is to "drop".
>>>
>>> This is no different semantically than say "mirror and drop"
>>> offload being enunciated by "skip_sw".
>>>
>>> Does the spectrum not support multiple actions?
>>> e.g with a policy like:
>>>   match blah action trap action drop skip_sw
>> Trap drops implicitly. We need a "trap, but don't drop". Expressed in
>> terms of existing actions it would be "mirred egress redirect dev
>> $cpu_port". But how to express $cpu_port except again by a HW-specific
>> magic token I don't know.

(I meant mirred egress mirror, not redirect.)

> Note: mirred was originally intended to send redirect/mirror
> packets to user space (the comment is still there in the code).
> Infact there is a patch lying around somewhere that does that with
> packet sockets (the author hasnt been serious about pushing it
> upstream). In that case the semantics are redirecting to a file
> descriptor. Could we have something like that here which points
> to whatever representation $cpu_port has? Sounds like semantics
> for "trap and forward" are just "mirror and forward".

Hmm, we have devlink ports, the CPU port is exposed there. But that's
the only thing that comes to mind. Those are specific for the given
device though, it doesn't look suitable...

> I think there is value in having something like trap action
> which generalizes the combinations only to the fact that
> it will make it easier to relay the info to the offload without
> much transformation.
> If i was to do it i would write one action configured by user space:
> - to return DROP if you want action trap-and-drop semantics.
> - to return STOLEN if you want trap
> - to return PIPE if you want trap and forward. You will need a second
> action composed to forward.

I think your STOLEN and PIPE are the same behavior. Both are "transfer
the packet to the SW datapath, but keep it in the HW datapath".

In general I have no issue expressing this stuff as a new action,
instead of an opcode. I'll take a look at this.

Re: [PATCH net-next 1/7] net: sched: Add a trap-and-forward action

2021-04-09 Thread Petr Machata



Jamal Hadi Salim  writes:

> I am concerned about adding new opcodes which only make sense if you
> offload (or make sense only if you are running in s/w).
>
> Those opcodes are intended to be generic abstractions so the dispatcher
> can decide what to do next. Adding things that are specific only
> to scenarios of hardware offload removes that opaqueness.
> I must have missed the discussion on ACT_TRAP because it is the
> same issue there i.e shouldnt be an opcode. For details see:
> https://people.netfilter.org/pablo/netdev0.1/papers/Linux-Traffic-Control-Classifier-Action-Subsystem-Architecture.pdf

Trap has been in since 4.13, so 2017ish. It's done and dusted at this
point.

> IMO:
> It seems to me there are two actions here encapsulated in one.
> The first is to "trap" and the second is to "drop".
>
> This is no different semantically than say "mirror and drop"
> offload being enunciated by "skip_sw".
>
> Does the spectrum not support multiple actions?
> e.g with a policy like:
>  match blah action trap action drop skip_sw

Trap drops implicitly. We need a "trap, but don't drop". Expressed in
terms of existing actions it would be "mirred egress redirect dev
$cpu_port". But how to express $cpu_port except again by a HW-specific
magic token I don't know.

[PATCH net-next 7/7] selftests: mlxsw: Add a trap_fwd test to devlink_trap_control

2021-04-08 Thread Petr Machata

Test that trap_fwd'd packets show up under the correct trap.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 .../drivers/net/mlxsw/devlink_trap_control.sh | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_control.sh 
b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_control.sh
index a37273473c1b..8bca4c58819b 100755
--- a/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_control.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_control.sh
@@ -83,6 +83,7 @@ ALL_TESTS="
ptp_general_test
flow_action_sample_test
flow_action_trap_test
+   flow_action_trap_fwd_test
 "
 NUM_NETIFS=4
 source $lib_dir/lib.sh
@@ -663,14 +664,18 @@ flow_action_sample_test()
tc qdisc del dev $rp1 clsact
 }
 
-flow_action_trap_test()
+__flow_action_trap_test()
 {
+   local action=$1; shift
+   local trap=$1; shift
+   local description=$1; shift
+
# Install a filter that traps a specific flow.
tc qdisc add dev $rp1 clsact
tc filter add dev $rp1 ingress proto ip pref 1 handle 101 flower \
-   skip_sw ip_proto udp src_port 12345 dst_port 54321 action trap
+   skip_sw ip_proto udp src_port 12345 dst_port 54321 action 
$action
 
-   devlink_trap_stats_test "Flow Trapping (Logging)" "flow_action_trap" \
+   devlink_trap_stats_test "$description" $trap \
$MZ $h1 -c 1 -a own -b $(mac_get $rp1) \
-A 192.0.2.1 -B 198.51.100.1 -t udp sp=12345,dp=54321 -p 100 -q
 
@@ -678,6 +683,18 @@ flow_action_trap_test()
tc qdisc del dev $rp1 clsact
 }
 
+flow_action_trap_test()
+{
+   __flow_action_trap_test trap flow_action_trap \
+   "Flow Trapping (Logging)"
+}
+
+flow_action_trap_fwd_test()
+{
+   __flow_action_trap_test trap_fwd flow_action_trap_fwd \
+   "Flow Trap-and-forwarding (Logging)"
+}
+
 trap cleanup EXIT
 
 setup_prepare
-- 
2.26.2

[PATCH net-next 6/7] selftests: forwarding: Add a test for TC trapping behavior

2021-04-08 Thread Petr Machata

Test that trapped packets are forwarded through the SW datapath, whereas
trap_fwd'd ones are not (but are forwarded through HW datapath). For
completeness' sake, also test that "pass" (i.e. lack of trapping) simply
forwards the packets in the HW datapath.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 .../selftests/net/forwarding/tc_trap.sh   | 170 ++
 1 file changed, 170 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/tc_trap.sh

diff --git a/tools/testing/selftests/net/forwarding/tc_trap.sh 
b/tools/testing/selftests/net/forwarding/tc_trap.sh
new file mode 100755
index ..56336cea45a2
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/tc_trap.sh
@@ -0,0 +1,170 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# In the following simple routing scenario, put SW datapath packet probes on
+# $swp1, $swp2 and $h2. Always expect packets to arrive at $h2. Depending on
+# whether, in the HW datapath, $swp1 lets packets pass, traps them, or
+# traps_forwards them, $swp1 and $swp2 probes are expected to give different
+# results.
+#
+# +--+ +--+
+# | H1   | |   H2 |
+# |+ $h1 | |$h2 + |
+# || 192.0.2.1/28| |  192.0.2.18/28 | |
+# +|-+ +|-+
+#  ||
+# +||-+
+# | SW || |
+# |+ $swp1$swp2 + |
+# |  192.0.2.2/28 192.0.2.17/28   |
+# +---+
+
+
+ALL_TESTS="
+   no_trap_test
+   trap_fwd_test
+   trap_test
+"
+
+NUM_NETIFS=4
+source lib.sh
+source tc_common.sh
+
+h1_create()
+{
+   simple_if_init $h1 192.0.2.1/28
+   ip route add vrf v$h1 192.0.2.16/28 via 192.0.2.2
+}
+
+h1_destroy()
+{
+   ip route del vrf v$h1 192.0.2.16/28 via 192.0.2.2
+   simple_if_fini $h1 192.0.2.1/28
+}
+
+h2_create()
+{
+   simple_if_init $h2 192.0.2.18/28
+   ip route add vrf v$h2 192.0.2.0/28 via 192.0.2.17
+   tc qdisc add dev $h2 clsact
+}
+
+h2_destroy()
+{
+   tc qdisc del dev $h2 clsact
+   ip route del vrf v$h2 192.0.2.0/28 via 192.0.2.17
+   simple_if_fini $h2 192.0.2.18/28
+}
+
+switch_create()
+{
+   simple_if_init $swp1 192.0.2.2/28
+   __simple_if_init $swp2 v$swp1 192.0.2.17/28
+
+   tc qdisc add dev $swp1 clsact
+   tc qdisc add dev $swp2 clsact
+}
+
+switch_destroy()
+{
+   tc qdisc del dev $swp2 clsact
+   tc qdisc del dev $swp1 clsact
+
+   __simple_if_fini $swp2 192.0.2.17/28
+   simple_if_fini $swp1 192.0.2.2/28
+}
+
+setup_prepare()
+{
+   h1=${NETIFS[p1]}
+   swp1=${NETIFS[p2]}
+
+   swp2=${NETIFS[p3]}
+   h2=${NETIFS[p4]}
+
+   vrf_prepare
+   forwarding_enable
+
+   h1_create
+   h2_create
+   switch_create
+}
+
+cleanup()
+{
+   pre_cleanup
+
+   switch_destroy
+   h2_destroy
+   h1_destroy
+
+   forwarding_restore
+   vrf_cleanup
+}
+
+__test()
+{
+   local action=$1; shift
+   local ingress_should_fail=$1; shift
+   local egress_should_fail=$1; shift
+
+   tc filter add dev $swp1 ingress protocol ip pref 2 handle 101 \
+   flower skip_sw dst_ip 192.0.2.18 action $action
+   tc filter add dev $swp1 ingress protocol ip pref 1 handle 102 \
+   flower skip_hw dst_ip 192.0.2.18 action pass
+   tc filter add dev $swp2 egress protocol ip pref 1 handle 103 \
+   flower skip_hw dst_ip 192.0.2.18 action pass
+   tc filter add dev $h2 ingress protocol ip pref 1 handle 104 \
+   flower dst_ip 192.0.2.18 action drop
+
+   RET=0
+
+   $MZ $h1 -c 1 -p 64 -a $(mac_get $h1) -b $(mac_get $swp1) \
+   -A 192.0.2.1 -B 192.0.2.18 -q -t ip
+
+   tc_check_packets "dev $swp1 ingress" 102 1
+   check_err_fail $ingress_should_fail $? "ingress should_fail 
$ingress_should_fail"
+
+   tc_check_packets "dev $swp2 egress" 103 1
+   check_err_fail $egress_should_fail $? "egress should_fail 
$egress_should_fail"
+
+   tc_check_packets "dev $h2 ingress" 104 1
+   check_err $? "Did not see the packet on host"
+
+   log_test "$action test"
+
+   tc filter del dev $h2 ingress protocol ip pref 1 handle 104 flower
+   tc filter del dev $swp2 egress protocol ip pref 1 handle 103 flower
+   tc filter del dev $swp1 ingress protocol ip p

[PATCH net-next 1/7] net: sched: Add a trap-and-forward action

2021-04-08 Thread Petr Machata

The TC action "trap" is used to instruct the HW datapath to drop the
matched packet and transfer it for processing in the SW pipeline. If
instead it is desirable to forward the packet and transferring a _copy_ to
the SW pipeline, there is no practical way to achieve that.

To that end add a new generic action, trap_fwd. In the software pipeline,
it is equivalent to an OK. When offloading, it should forward the packet to
the host, but unlike trap it should not drop the packet.

Signed-off-by: Petr Machata 
Reviewed-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 include/uapi/linux/pkt_cls.h   |  6 +-
 net/core/dev.c |  2 ++
 net/sched/act_bpf.c| 13 +++--
 net/sched/cls_bpf.c|  1 +
 net/sched/sch_dsmark.c |  1 +
 tools/include/uapi/linux/pkt_cls.h |  6 +-
 6 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index 025c40fef93d..a1bbccb88e67 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -72,7 +72,11 @@ enum {
   * the skb and act like everything
   * is alright.
   */
-#define TC_ACT_VALUE_MAX   TC_ACT_TRAP
+#define TC_ACT_TRAP_FWD9 /* For hw path, this means "send a 
copy
+  * of the packet to the cpu". For sw
+  * datapath, this is like TC_ACT_OK.
+  */
+#define TC_ACT_VALUE_MAX   TC_ACT_TRAP_FWD
 
 /* There is a special kind of actions called "extended actions",
  * which need a value parameter. These have a local opcode located in
diff --git a/net/core/dev.c b/net/core/dev.c
index 9d1a8fac793f..f0b8c16dbf12 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3975,6 +3975,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct 
net_device *dev)
switch (tcf_classify(skb, miniq->filter_list, &cl_res, false)) {
case TC_ACT_OK:
case TC_ACT_RECLASSIFY:
+   case TC_ACT_TRAP_FWD:
skb->tc_index = TC_H_MIN(cl_res.classid);
break;
case TC_ACT_SHOT:
@@ -5083,6 +5084,7 @@ sch_handle_ingress(struct sk_buff *skb, struct 
packet_type **pt_prev, int *ret,
 &cl_res, false)) {
case TC_ACT_OK:
case TC_ACT_RECLASSIFY:
+   case TC_ACT_TRAP_FWD:
skb->tc_index = TC_H_MIN(cl_res.classid);
break;
case TC_ACT_SHOT:
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index e48e980c3b93..be2a51c6f84e 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -54,8 +54,16 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct 
tc_action *act,
bpf_compute_data_pointers(skb);
filter_res = BPF_PROG_RUN(filter, skb);
}
-   if (skb_sk_is_prefetched(skb) && filter_res != TC_ACT_OK)
-   skb_orphan(skb);
+   if (skb_sk_is_prefetched(skb)) {
+   switch (filter_res) {
+   case TC_ACT_OK:
+   case TC_ACT_TRAP_FWD:
+   break;
+   default:
+   skb_orphan(skb);
+   break;
+   }
+   }
rcu_read_unlock();
 
/* A BPF program may overwrite the default action opcode.
@@ -72,6 +80,7 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct 
tc_action *act,
case TC_ACT_PIPE:
case TC_ACT_RECLASSIFY:
case TC_ACT_OK:
+   case TC_ACT_TRAP_FWD:
case TC_ACT_REDIRECT:
action = filter_res;
break;
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 6e3e63db0e01..5fd96cf2dca7 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -69,6 +69,7 @@ static int cls_bpf_exec_opcode(int code)
case TC_ACT_SHOT:
case TC_ACT_STOLEN:
case TC_ACT_TRAP:
+   case TC_ACT_TRAP_FWD:
case TC_ACT_REDIRECT:
case TC_ACT_UNSPEC:
return code;
diff --git a/net/sched/sch_dsmark.c b/net/sched/sch_dsmark.c
index cd2748e2d4a2..054a06bd9dc8 100644
--- a/net/sched/sch_dsmark.c
+++ b/net/sched/sch_dsmark.c
@@ -258,6 +258,7 @@ static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
goto drop;
 #endif
case TC_ACT_OK:
+   case TC_ACT_TRAP_FWD:
skb->tc_index = TC_H_MIN(res.classid);
break;
 
diff --git a/tools/include/uapi/linux/pkt_cls.h 
b/tools/include/uapi/linux/pkt_cls.h
index 12153771396a..ccfa424dfeaf 100644
--- a/tools/include/uapi/linux/pkt_cls.h
+++ b/tools/include/uapi/linux/pkt_cls.h
@@ -45,7 +45,11 @@ enum {
   * the skb and act like

[PATCH net-next 5/7] mlxsw: Offload trap_fwd

2021-04-08 Thread Petr Machata

Offload the TC action trap_fwd. This is offloaded as a TRAP_ACTION with
forward_action of FORWARD (as opposed to NOP for the trap action). Unlike
trap, trap_fwd needs to be in an "goto"-typed action set, not "next"-typed
one.

Trap_fwd'd traffic is marked with offload_fwd_mark and offload_l3_fwd_mark
to prevent second forwarding in the SW datapath.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 .../mellanox/mlxsw/core_acl_flex_actions.c| 23 +++
 .../net/ethernet/mellanox/mlxsw/spectrum.h|  1 +
 .../ethernet/mellanox/mlxsw/spectrum_acl.c|  6 +
 .../ethernet/mellanox/mlxsw/spectrum_flower.c |  7 ++
 .../ethernet/mellanox/mlxsw/spectrum_trap.c   |  8 +++
 drivers/net/ethernet/mellanox/mlxsw/trap.h|  2 ++
 6 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c
index faa90cc31376..d7d7e688139f 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c
@@ -94,7 +94,8 @@ struct mlxsw_afa_set {
  * kvdl_index is valid).
  */
   has_trap:1,
-  has_police:1;
+  has_police:1,
+  has_trap_fwd:1;
unsigned int ref_count;
struct mlxsw_afa_set *next; /* Pointer to the next set. */
struct mlxsw_afa_set *prev; /* Pointer to the previous set,
@@ -263,14 +264,23 @@ static void mlxsw_afa_set_goto_set(struct mlxsw_afa_set 
*set,
mlxsw_afa_set_goto_next_binding_set(actions, group_id);
 }
 
-static void mlxsw_afa_set_next_set(struct mlxsw_afa_set *set,
+static int mlxsw_afa_set_next_set(struct mlxsw_afa_set *set,
  u32 next_set_kvdl_index,
  struct netlink_ext_ack *extack)
 {
char *actions = set->ht_key.enc_actions;
 
+   /* If the forwarding action is not drop, the next/goto record must not
+* be a next, it must be a goto.
+*/
+   if (set->has_trap_fwd) {
+   NL_SET_ERR_MSG_MOD(extack, "Only goto permissible after a 
trap_fwd action");
+   return -EINVAL;
+   }
+
mlxsw_afa_set_type_set(actions, MLXSW_AFA_SET_TYPE_NEXT);
mlxsw_afa_set_next_action_set_ptr_set(actions, next_set_kvdl_index);
+   return 0;
 }
 
 static struct mlxsw_afa_set *mlxsw_afa_set_create(bool is_first)
@@ -461,6 +471,7 @@ int mlxsw_afa_block_commit(struct mlxsw_afa_block *block,
 {
struct mlxsw_afa_set *set = block->cur_set;
struct mlxsw_afa_set *prev_set;
+   int err;
 
block->cur_set = NULL;
block->finished = true;
@@ -481,8 +492,10 @@ int mlxsw_afa_block_commit(struct mlxsw_afa_block *block,
return PTR_ERR(set);
if (prev_set) {
prev_set->next = set;
-   mlxsw_afa_set_next_set(prev_set, set->kvdl_index,
-  extack);
+   err = mlxsw_afa_set_next_set(prev_set, set->kvdl_index,
+extack);
+   if (err)
+   return err;
set = prev_set;
}
} while (prev_set);
@@ -1346,6 +1359,8 @@ int mlxsw_afa_block_append_trap_and_forward(struct 
mlxsw_afa_block *block,
 
if (IS_ERR(act))
return PTR_ERR(act);
+
+   block->cur_set->has_trap_fwd = true;
mlxsw_afa_trap_pack(act, MLXSW_AFA_TRAP_TRAP_ACTION_TRAP,
MLXSW_AFA_TRAP_FORWARD_ACTION_FORWARD, trap_id);
return 0;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index d74fc7ff8083..6067a049dcf2 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -940,6 +940,7 @@ int mlxsw_sp_acl_rulei_act_drop(struct 
mlxsw_sp_acl_rule_info *rulei,
const struct flow_action_cookie *fa_cookie,
struct netlink_ext_ack *extack);
 int mlxsw_sp_acl_rulei_act_trap(struct mlxsw_sp_acl_rule_info *rulei);
+int mlxsw_sp_acl_rulei_act_trap_fwd(struct mlxsw_sp_acl_rule_info *rulei);
 int mlxsw_sp_acl_rulei_act_mirror(struct mlxsw_sp *mlxsw_sp,
  struct mlxsw_sp_acl_rule_info *rulei,
  struct mlxsw_sp_flow_block *block,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
index b9c4c1feba6d..6f7913424bd9 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
@@ -401,6 +401,12 @@ int

[PATCH net-next 4/7] mlxsw: Propagate extack to mlxsw_afa_block_commit()

2021-04-08 Thread Petr Machata

In the following patch, attempts to change the next/goto of a flexible
action set from goto to next will be rejected for action sets that contain
a trap_fwd action. Propagate extack to make it possible to communicate the
issue to the user.

Signed-off-by: Petr Machata 
Reviewed-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 .../net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c  | 9 ++---
 .../net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h  | 3 ++-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h   | 3 ++-
 drivers/net/ethernet/mellanox/mlxsw/spectrum1_acl_tcam.c | 2 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c   | 5 +++--
 drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c| 2 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c   | 2 +-
 7 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c
index 78d9c0196f2b..faa90cc31376 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c
@@ -264,7 +264,8 @@ static void mlxsw_afa_set_goto_set(struct mlxsw_afa_set 
*set,
 }
 
 static void mlxsw_afa_set_next_set(struct mlxsw_afa_set *set,
-  u32 next_set_kvdl_index)
+ u32 next_set_kvdl_index,
+ struct netlink_ext_ack *extack)
 {
char *actions = set->ht_key.enc_actions;
 
@@ -455,7 +456,8 @@ void mlxsw_afa_block_destroy(struct mlxsw_afa_block *block)
 }
 EXPORT_SYMBOL(mlxsw_afa_block_destroy);
 
-int mlxsw_afa_block_commit(struct mlxsw_afa_block *block)
+int mlxsw_afa_block_commit(struct mlxsw_afa_block *block,
+  struct netlink_ext_ack *extack)
 {
struct mlxsw_afa_set *set = block->cur_set;
struct mlxsw_afa_set *prev_set;
@@ -479,7 +481,8 @@ int mlxsw_afa_block_commit(struct mlxsw_afa_block *block)
return PTR_ERR(set);
if (prev_set) {
prev_set->next = set;
-   mlxsw_afa_set_next_set(prev_set, set->kvdl_index);
+   mlxsw_afa_set_next_set(prev_set, set->kvdl_index,
+  extack);
set = prev_set;
}
} while (prev_set);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h 
b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h
index b65bf98eb5ab..24350f9470f8 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h
@@ -45,7 +45,8 @@ struct mlxsw_afa *mlxsw_afa_create(unsigned int 
max_acts_per_set,
 void mlxsw_afa_destroy(struct mlxsw_afa *mlxsw_afa);
 struct mlxsw_afa_block *mlxsw_afa_block_create(struct mlxsw_afa *mlxsw_afa);
 void mlxsw_afa_block_destroy(struct mlxsw_afa_block *block);
-int mlxsw_afa_block_commit(struct mlxsw_afa_block *block);
+int mlxsw_afa_block_commit(struct mlxsw_afa_block *block,
+  struct netlink_ext_ack *extack);
 char *mlxsw_afa_block_first_set(struct mlxsw_afa_block *block);
 char *mlxsw_afa_block_cur_set(struct mlxsw_afa_block *block);
 u32 mlxsw_afa_block_first_kvdl_index(struct mlxsw_afa_block *block);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index f99db88ee884..d74fc7ff8083 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -920,7 +920,8 @@ struct mlxsw_sp_acl_rule_info *
 mlxsw_sp_acl_rulei_create(struct mlxsw_sp_acl *acl,
  struct mlxsw_afa_block *afa_block);
 void mlxsw_sp_acl_rulei_destroy(struct mlxsw_sp_acl_rule_info *rulei);
-int mlxsw_sp_acl_rulei_commit(struct mlxsw_sp_acl_rule_info *rulei);
+int mlxsw_sp_acl_rulei_commit(struct mlxsw_sp_acl_rule_info *rulei,
+ struct netlink_ext_ack *extack);
 void mlxsw_sp_acl_rulei_priority(struct mlxsw_sp_acl_rule_info *rulei,
 unsigned int priority);
 void mlxsw_sp_acl_rulei_keymask_u32(struct mlxsw_sp_acl_rule_info *rulei,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum1_acl_tcam.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum1_acl_tcam.c
index 3a636f753607..cda04bc4453f 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum1_acl_tcam.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum1_acl_tcam.c
@@ -75,7 +75,7 @@ mlxsw_sp1_acl_ctcam_region_catchall_add(struct mlxsw_sp 
*mlxsw_sp,
err = mlxsw_sp_acl_rulei_act_continue(rulei);
if (WARN_ON(err))
goto err_rulei_act_continue;
-   err = mlxsw_sp_acl_rulei_commit(rulei);
+   err = mlxsw_sp_acl_rulei_commit(rulei, NULL);
if (err)
goto err_rulei_comm

[PATCH net-next 0/7] tc: Introduce a trap-and-forward action

2021-04-08 Thread Petr Machata

The TC action "trap" is used to instruct the HW datapath to drop the
matched packet and transfer it to the host for processing in the SW
pipeline. If instead it is desirable to forward the packet in the HW
datapath, and to transfer a _copy_ to the SW pipeline, there is no
practical way to achieve that.

As a particular use case, the mlxsw driver could instruct a Spectrum
machine to mirror packets that are ECN-marked to the host. However these
packets are still forwarded in the HW datapath, therefore describing this
mirroring through the "trap" action is incorrect. A new action is needed.

To that end, this patchset introduces a new generic action, trap_fwd. In
the software pipeline, it is equivalent to an OK. When offloading, it
should forward the packet to the host, but unlike trap it should not drop
the packet.

This patchset proceeds as follows:

- In patch #1, introduce the new action, and modify the TC code to
  recognize it as an OK.

- In patches #2 and #3, introduce the artifacts necessary for offloading
  the trap_fwd action, and a new trap so that drivers can report the
  trapped packets.

- Patches #4 and #5 offload trap_fwd in mlxsw.

- Patches #6 and #7 add selftests.

Petr Machata (7):
  net: sched: Add a trap-and-forward action
  net: sched: Make the action trap_fwd offloadable
  devlink: Add a new trap for the trap_fwd action
  mlxsw: Propagate extack to mlxsw_afa_block_commit()
  mlxsw: Offload trap_fwd
  selftests: forwarding: Add a test for TC trapping behavior
  selftests: mlxsw: Add a trap_fwd test to devlink_trap_control

 .../networking/devlink/devlink-trap.rst   |   4 +
 .../mellanox/mlxsw/core_acl_flex_actions.c|  28 ++-
 .../mellanox/mlxsw/core_acl_flex_actions.h|   3 +-
 .../net/ethernet/mellanox/mlxsw/spectrum.h|   4 +-
 .../mellanox/mlxsw/spectrum1_acl_tcam.c   |   2 +-
 .../ethernet/mellanox/mlxsw/spectrum_acl.c|  11 +-
 .../ethernet/mellanox/mlxsw/spectrum_flower.c |   9 +-
 .../mellanox/mlxsw/spectrum_mr_tcam.c |   2 +-
 .../ethernet/mellanox/mlxsw/spectrum_trap.c   |   8 +
 drivers/net/ethernet/mellanox/mlxsw/trap.h|   2 +
 include/net/devlink.h |   3 +
 include/net/flow_offload.h|   1 +
 include/net/tc_act/tc_gact.h  |   5 +
 include/uapi/linux/pkt_cls.h  |   6 +-
 net/core/dev.c|   2 +
 net/core/devlink.c|   1 +
 net/sched/act_bpf.c   |  13 +-
 net/sched/cls_api.c   |   2 +
 net/sched/cls_bpf.c   |   1 +
 net/sched/sch_dsmark.c|   1 +
 tools/include/uapi/linux/pkt_cls.h|   6 +-
 .../drivers/net/mlxsw/devlink_trap_control.sh |  23 ++-
 .../selftests/net/forwarding/tc_trap.sh   | 170 ++
 23 files changed, 288 insertions(+), 19 deletions(-)
 create mode 100755 tools/testing/selftests/net/forwarding/tc_trap.sh

-- 
2.26.2

[PATCH net-next 3/7] devlink: Add a new trap for the trap_fwd action

2021-04-08 Thread Petr Machata

Add a new trap so that drivers can report packets forwarded due to the
trap_fwd action correctly.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 Documentation/networking/devlink/devlink-trap.rst | 4 
 include/net/devlink.h | 3 +++
 net/core/devlink.c| 1 +
 3 files changed, 8 insertions(+)

diff --git a/Documentation/networking/devlink/devlink-trap.rst 
b/Documentation/networking/devlink/devlink-trap.rst
index 935b6397e8cf..3f1c0f89d284 100644
--- a/Documentation/networking/devlink/devlink-trap.rst
+++ b/Documentation/networking/devlink/devlink-trap.rst
@@ -405,6 +405,10 @@ be added to the following table:
  - ``control``
  - Traps packets logged during processing of flow action trap (e.g., via
tc's trap action)
+   * - ``flow_action_trap_fwd``
+ - ``control``
+ - Traps packets logged during processing of flow action trap_fwd (e.g., 
via
+   tc's trap_fwd action)
* - ``early_drop``
  - ``drop``
  - Traps packets dropped due to the RED (Random Early Detection) algorithm
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 853420db5d32..967e70363ba9 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -845,6 +845,7 @@ enum devlink_trap_generic_id {
DEVLINK_TRAP_GENERIC_ID_PTP_GENERAL,
DEVLINK_TRAP_GENERIC_ID_FLOW_ACTION_SAMPLE,
DEVLINK_TRAP_GENERIC_ID_FLOW_ACTION_TRAP,
+   DEVLINK_TRAP_GENERIC_ID_FLOW_ACTION_TRAP_FWD,
DEVLINK_TRAP_GENERIC_ID_EARLY_DROP,
DEVLINK_TRAP_GENERIC_ID_VXLAN_PARSING,
DEVLINK_TRAP_GENERIC_ID_LLC_SNAP_PARSING,
@@ -1053,6 +1054,8 @@ enum devlink_trap_group_generic_id {
"flow_action_sample"
 #define DEVLINK_TRAP_GENERIC_NAME_FLOW_ACTION_TRAP \
"flow_action_trap"
+#define DEVLINK_TRAP_GENERIC_NAME_FLOW_ACTION_TRAP_FWD \
+   "flow_action_trap_fwd"
 #define DEVLINK_TRAP_GENERIC_NAME_EARLY_DROP \
"early_drop"
 #define DEVLINK_TRAP_GENERIC_NAME_VXLAN_PARSING \
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 737b61c2976e..478d4bc01a39 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -9744,6 +9744,7 @@ static const struct devlink_trap devlink_trap_generic[] = 
{
DEVLINK_TRAP(PTP_GENERAL, CONTROL),
DEVLINK_TRAP(FLOW_ACTION_SAMPLE, CONTROL),
DEVLINK_TRAP(FLOW_ACTION_TRAP, CONTROL),
+   DEVLINK_TRAP(FLOW_ACTION_TRAP_FWD, CONTROL),
DEVLINK_TRAP(EARLY_DROP, DROP),
DEVLINK_TRAP(VXLAN_PARSING, DROP),
DEVLINK_TRAP(LLC_SNAP_PARSING, DROP),
-- 
2.26.2

[PATCH net-next 2/7] net: sched: Make the action trap_fwd offloadable

2021-04-08 Thread Petr Machata

Add the new flow action and related support so that drivers can offload the
trap_fwd action.

Signed-off-by: Petr Machata 
Reviewed-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 include/net/flow_offload.h   | 1 +
 include/net/tc_act/tc_gact.h | 5 +
 net/sched/cls_api.c  | 2 ++
 3 files changed, 8 insertions(+)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index dc5c1e69cd9f..5f35523f12b5 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -121,6 +121,7 @@ enum flow_action_id {
FLOW_ACTION_ACCEPT  = 0,
FLOW_ACTION_DROP,
FLOW_ACTION_TRAP,
+   FLOW_ACTION_TRAP_FWD,
FLOW_ACTION_GOTO,
FLOW_ACTION_REDIRECT,
FLOW_ACTION_MIRRED,
diff --git a/include/net/tc_act/tc_gact.h b/include/net/tc_act/tc_gact.h
index eb8f01c819e6..df9e0a19c826 100644
--- a/include/net/tc_act/tc_gact.h
+++ b/include/net/tc_act/tc_gact.h
@@ -49,6 +49,11 @@ static inline bool is_tcf_gact_trap(const struct tc_action 
*a)
return __is_tcf_gact_act(a, TC_ACT_TRAP, false);
 }
 
+static inline bool is_tcf_gact_trap_fwd(const struct tc_action *a)
+{
+   return __is_tcf_gact_act(a, TC_ACT_TRAP_FWD, false);
+}
+
 static inline bool is_tcf_gact_goto_chain(const struct tc_action *a)
 {
return __is_tcf_gact_act(a, TC_ACT_GOTO_CHAIN, true);
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index d3db70865d66..95e37eb50173 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3582,6 +3582,8 @@ int tc_setup_flow_action(struct flow_action *flow_action,
entry->id = FLOW_ACTION_DROP;
} else if (is_tcf_gact_trap(act)) {
entry->id = FLOW_ACTION_TRAP;
+   } else if (is_tcf_gact_trap_fwd(act)) {
+   entry->id = FLOW_ACTION_TRAP_FWD;
} else if (is_tcf_gact_goto_chain(act)) {
entry->id = FLOW_ACTION_GOTO;
entry->chain_index = tcf_gact_goto_chain_index(act);
-- 
2.26.2

[PATCH net-next] Documentation: net: Document resilient next-hop groups

2021-03-29 Thread Petr Machata

Add a document describing the principles behind resilient next-hop groups,
and some notes about how to configure and offload them.

Suggested-by: David Ahern 
Signed-off-by: Petr Machata 
Reviewed-by: David Ahern 
---

Notes:
v1 (from an RFC shared privately):
- Dropped a reference to a non-existent footnote [Ido]
- Spell out consequences of flow redirection explicitly [Ido]
- A handful of wording changes [Ido]
- Kept David's R-b due to minor scope of the above fixes

 Documentation/networking/index.rst|   1 +
 .../networking/nexthop-group-resilient.rst| 293 ++
 2 files changed, 294 insertions(+)
 create mode 100644 Documentation/networking/nexthop-group-resilient.rst

diff --git a/Documentation/networking/index.rst 
b/Documentation/networking/index.rst
index b8a29997d433..e9ce55992aa9 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -76,6 +76,7 @@ Contents:
netdevices
netfilter-sysctl
netif-msg
+   nexthop-group-resilient
nf_conntrack-sysctl
nf_flowtable
openvswitch
diff --git a/Documentation/networking/nexthop-group-resilient.rst 
b/Documentation/networking/nexthop-group-resilient.rst
new file mode 100644
index ..fabecee24d85
--- /dev/null
+++ b/Documentation/networking/nexthop-group-resilient.rst
@@ -0,0 +1,293 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+Resilient Next-hop Groups
+=
+
+Resilient groups are a type of next-hop group that is aimed at minimizing
+disruption in flow routing across changes to the group composition and
+weights of constituent next hops.
+
+The idea behind resilient hashing groups is best explained in contrast to
+the legacy multipath next-hop group, which uses the hash-threshold
+algorithm, described in RFC 2992.
+
+To select a next hop, hash-threshold algorithm first assigns a range of
+hashes to each next hop in the group, and then selects the next hop by
+comparing the SKB hash with the individual ranges. When a next hop is
+removed from the group, the ranges are recomputed, which leads to
+reassignment of parts of hash space from one next hop to another. RFC 2992
+illustrates it thus::
+
+ +---+---+---+---+---+
+ |   1   |   2   |   3   |   4   |   5   |
+ +---+-+-+---+---+-+-+---+
+ |1|2|4|5|
+ +-+-+-+-+
+
+  Before and after deletion of next hop 3
+ under the hash-threshold algorithm.
+
+Note how next hop 2 gave up part of the hash space in favor of next hop 1,
+and 4 in favor of 5. While there will usually be some overlap between the
+previous and the new distribution, some traffic flows change the next hop
+that they resolve to.
+
+If a multipath group is used for load-balancing between multiple servers,
+this hash space reassignment causes an issue that packets from a single
+flow suddenly end up arriving at a server that does not expect them. This
+can result in TCP connections being reset.
+
+If a multipath group is used for load-balancing among available paths to
+the same server, the issue is that different latencies and reordering along
+the way causes the packets to arrive in the wrong order, resulting in
+degraded application performance.
+
+To mitigate the above-mentioned flow redirection, resilient next-hop groups
+insert another layer of indirection between the hash space and its
+constituent next hops: a hash table. The selection algorithm uses SKB hash
+to choose a hash table bucket, then reads the next hop that this bucket
+contains, and forwards traffic there.
+
+This indirection brings an important feature. In the hash-threshold
+algorithm, the range of hashes associated with a next hop must be
+continuous. With a hash table, mapping between the hash table buckets and
+the individual next hops is arbitrary. Therefore when a next hop is deleted
+the buckets that held it are simply reassigned to other next hops::
+
+   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+   |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
+   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+v v v v
+   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+   |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
+   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+   Before and after deletion of next hop 3
+   under the resilient hashing algorithm.
+
+When weights of next hops in a group are altered, it may be possible to
+choose a subset of buckets that are currently not used for forwarding
+traffic, and use those to satisfy the new next-hop distribution demands,
+keeping the "busy" buckets intact. This way, established flows are ideally
+kept being forwarded to the same endpoints through the same paths as before
+the next-hop group change.
+
+Algorithm
+--

[PATCH net-next] nexthop: Rename artifacts related to legacy multipath nexthop groups

2021-03-26 Thread Petr Machata

After resilient next-hop groups have been added recently, there are two
types of multipath next-hop groups: the legacy "mpath", and the new
"resilient". Calling the legacy next-hop group type "mpath" is unfortunate,
because that describes the fact that a packet could be forwarded in one of
several paths, which is also true for the resilient next-hop groups.

Therefore, to make the naming clearer, rename various artifacts to reflect
the assumptions made. Therefore as of this patch:

- The flag for multipath groups is nh_grp_entry::is_multipath. This
  includes the legacy and resilient groups, as well as any future group
  types that behave as multipath groups.
  Functions that assume this have "mpath" in the name.

- The flag for legacy multipath groups is nh_grp_entry::hash_threshold.
  Functions that assume this have "hthr" in the name.

- The flag for resilient groups is nh_grp_entry::resilient.
  Functions that assume this have "res" in the name.

Besides the above, struct nh_grp_entry::mpath was renamed to ::hthr as
well.

UAPI artifacts were obviously left intact.

Suggested-by: David Ahern 
Signed-off-by: Petr Machata 
---
 include/net/nexthop.h |  4 ++--
 net/ipv4/nexthop.c| 56 +--
 2 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/include/net/nexthop.h b/include/net/nexthop.h
index ba94868a21d5..ace54bf90b2c 100644
--- a/include/net/nexthop.h
+++ b/include/net/nexthop.h
@@ -102,7 +102,7 @@ struct nh_grp_entry {
union {
struct {
atomic_tupper_bound;
-   } mpath;
+   } hthr;
struct {
/* Member on uw_nh_entries. */
struct list_headuw_nh_entry;
@@ -120,7 +120,7 @@ struct nh_group {
struct nh_group *spare; /* spare group for removals */
u16 num_nh;
boolis_multipath;
-   boolmpath;
+   boolhash_threshold;
boolresilient;
boolfdb_nh;
boolhas_v4;
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index f09fe3a5608f..5a2fc8798d20 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -116,8 +116,8 @@ static void nh_notifier_single_info_fini(struct 
nh_notifier_info *info)
kfree(info->nh);
 }
 
-static int nh_notifier_mp_info_init(struct nh_notifier_info *info,
-   struct nh_group *nhg)
+static int nh_notifier_mpath_info_init(struct nh_notifier_info *info,
+  struct nh_group *nhg)
 {
u16 num_nh = nhg->num_nh;
int i;
@@ -181,8 +181,8 @@ static int nh_notifier_grp_info_init(struct 
nh_notifier_info *info,
 {
struct nh_group *nhg = rtnl_dereference(nh->nh_grp);
 
-   if (nhg->mpath)
-   return nh_notifier_mp_info_init(info, nhg);
+   if (nhg->hash_threshold)
+   return nh_notifier_mpath_info_init(info, nhg);
else if (nhg->resilient)
return nh_notifier_res_table_info_init(info, nhg);
return -EINVAL;
@@ -193,7 +193,7 @@ static void nh_notifier_grp_info_fini(struct 
nh_notifier_info *info,
 {
struct nh_group *nhg = rtnl_dereference(nh->nh_grp);
 
-   if (nhg->mpath)
+   if (nhg->hash_threshold)
kfree(info->nh_grp);
else if (nhg->resilient)
vfree(info->nh_res_table);
@@ -406,7 +406,7 @@ static int call_nexthop_res_table_notifiers(struct net 
*net, struct nexthop *nh,
 * could potentially veto it in case of unsupported configuration.
 */
nhg = rtnl_dereference(nh->nh_grp);
-   err = nh_notifier_mp_info_init(&info, nhg);
+   err = nh_notifier_mpath_info_init(&info, nhg);
if (err) {
NL_SET_ERR_MSG(extack, "Failed to initialize nexthop notifier 
info");
return err;
@@ -661,7 +661,7 @@ static int nla_put_nh_group(struct sk_buff *skb, struct 
nh_group *nhg)
u16 group_type = 0;
int i;
 
-   if (nhg->mpath)
+   if (nhg->hash_threshold)
group_type = NEXTHOP_GRP_TYPE_MPATH;
else if (nhg->resilient)
group_type = NEXTHOP_GRP_TYPE_RES;
@@ -992,9 +992,9 @@ static bool valid_group_nh(struct nexthop *nh, unsigned int 
npaths,
struct nh_group *nhg = rtnl_dereference(nh->nh_grp);
 
/* Nesting groups within groups is not supported. */
-   if (nhg->mpath) {
+   if (nhg->hash_threshold) {
NL_SET_ERR_MSG(extack,
-  "Multipath group can not be a nexthop 
within a group");
+  "H

[PATCH iproute2-next v4 5/6] nexthop: Add support for resilient nexthop groups

2021-03-17 Thread Petr Machata

From: Ido Schimmel 

Add ability to configure resilient nexthop groups and show their current
configuration. Example:

 # ip nexthop add id 10 group 1/2 type resilient buckets 8
 # ip nexthop show id 10
 id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0
 # ip -j -p nexthop show id 10
 [ {
 "id": 10,
 "group": [ {
 "id": 1
 },{
 "id": 2
 } ],
 "type": "resilient",
 "resilient_args": {
 "buckets": 8,
 "idle_timer": 120,
 "unbalanced_timer": 0
 },
 "flags": [ ]
 } ]

Signed-off-by: Ido Schimmel 
Signed-off-by: Petr Machata 
---
 ip/ipnexthop.c| 144 +-
 man/man8/ip-nexthop.8 |  55 +++-
 2 files changed, 193 insertions(+), 6 deletions(-)

diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 5aae32629edd..1d50bf7529c4 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -43,9 +43,12 @@ static void usage(void)
"[ groups ] [ fdb ]\n"
"NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n"
"[ encap ENCAPTYPE ENCAPHDR ] |\n"
-   "group GROUP [ fdb ] [ type TYPE ] }\n"
+   "group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n"
"GROUP := [ //... ]\n"
-   "TYPE := { mpath }\n"
+   "TYPE := { mpath | resilient }\n"
+   "TYPE_ARGS := [ RESILIENT_ARGS ]\n"
+   "RESILIENT_ARGS := [ buckets BUCKETS ] [ idle_timer IDLE ]\n"
+   "  [ unbalanced_timer UNBALANCED ]\n"
"ENCAPTYPE := [ mpls ]\n"
"ENCAPHDR := [ MPLSLABEL ]\n");
exit(-1);
@@ -203,6 +206,66 @@ static void print_nh_group(FILE *fp, const struct rtattr 
*grps_attr)
close_json_array(PRINT_JSON, NULL);
 }
 
+static const char *nh_group_type_name(__u16 type)
+{
+   switch (type) {
+   case NEXTHOP_GRP_TYPE_MPATH:
+   return "mpath";
+   case NEXTHOP_GRP_TYPE_RES:
+   return "resilient";
+   default:
+   return "";
+   }
+}
+
+static void print_nh_group_type(FILE *fp, const struct rtattr *grp_type_attr)
+{
+   __u16 type = rta_getattr_u16(grp_type_attr);
+
+   if (type == NEXTHOP_GRP_TYPE_MPATH)
+   /* Do not print type in order not to break existing output. */
+   return;
+
+   print_string(PRINT_ANY, "type", "type %s ", nh_group_type_name(type));
+}
+
+static void print_nh_res_group(FILE *fp, const struct rtattr *res_grp_attr)
+{
+   struct rtattr *tb[NHA_RES_GROUP_MAX + 1];
+   struct rtattr *rta;
+   struct timeval tv;
+
+   parse_rtattr_nested(tb, NHA_RES_GROUP_MAX, res_grp_attr);
+
+   open_json_object("resilient_args");
+
+   if (tb[NHA_RES_GROUP_BUCKETS])
+   print_uint(PRINT_ANY, "buckets", "buckets %u ",
+  rta_getattr_u16(tb[NHA_RES_GROUP_BUCKETS]));
+
+   if (tb[NHA_RES_GROUP_IDLE_TIMER]) {
+   rta = tb[NHA_RES_GROUP_IDLE_TIMER];
+   __jiffies_to_tv(&tv, rta_getattr_u32(rta));
+   print_tv(PRINT_ANY, "idle_timer", "idle_timer %g ", &tv);
+   }
+
+   if (tb[NHA_RES_GROUP_UNBALANCED_TIMER]) {
+   rta = tb[NHA_RES_GROUP_UNBALANCED_TIMER];
+   __jiffies_to_tv(&tv, rta_getattr_u32(rta));
+   print_tv(PRINT_ANY, "unbalanced_timer", "unbalanced_timer %g ",
+&tv);
+   }
+
+   if (tb[NHA_RES_GROUP_UNBALANCED_TIME]) {
+   rta = tb[NHA_RES_GROUP_UNBALANCED_TIME];
+   __jiffies_to_tv(&tv, rta_getattr_u32(rta));
+   print_tv(PRINT_ANY, "unbalanced_time", "unbalanced_time %g ",
+&tv);
+   }
+
+   close_json_object();
+}
+
 int print_nexthop(struct nlmsghdr *n, void *arg)
 {
struct nhmsg *nhm = NLMSG_DATA(n);
@@ -229,7 +292,7 @@ int print_nexthop(struct nlmsghdr *n, void *arg)
if (filter.proto && filter.proto != nhm->nh_protocol)
return 0;
 
-   parse_rtattr(tb, NHA_MAX, RTM_NHA(nhm), len);
+   parse_rtattr_flags(tb, NHA_MAX, RTM_NHA(nhm), len, NLA_F_NESTED);
 
open_json_object(NULL);
 
@@ -243,6 +306,12 @@ int print_nexthop(struct nlmsghdr *n, void *arg)
if (tb[NHA_GROUP])
print_nh_group(fp, tb[NHA_GROUP]);
 
+   if (tb[NHA_GROUP_TYPE])
+   print_nh_group_type(fp, tb[NHA_G

[PATCH iproute2-next v4 6/6] nexthop: Add support for nexthop buckets

2021-03-17 Thread Petr Machata

From: Ido Schimmel 

Add ability to dump multiple nexthop buckets and get a specific one.
Example:

 # ip nexthop add id 10 group 1/2 type resilient buckets 8
 # ip nexthop
 id 1 via 192.0.2.2 dev dummy10 scope link
 id 2 via 192.0.2.19 dev dummy20 scope link
 id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0 
unbalanced_time 0
 # ip nexthop bucket
 id 10 index 0 idle_time 28.1 nhid 2
 id 10 index 1 idle_time 28.1 nhid 2
 id 10 index 2 idle_time 28.1 nhid 2
 id 10 index 3 idle_time 28.1 nhid 2
 id 10 index 4 idle_time 28.1 nhid 1
 id 10 index 5 idle_time 28.1 nhid 1
 id 10 index 6 idle_time 28.1 nhid 1
 id 10 index 7 idle_time 28.1 nhid 1
 # ip nexthop bucket show nhid 1
 id 10 index 4 idle_time 53.59 nhid 1
 id 10 index 5 idle_time 53.59 nhid 1
 id 10 index 6 idle_time 53.59 nhid 1
 id 10 index 7 idle_time 53.59 nhid 1
 # ip nexthop bucket get id 10 index 5
 id 10 index 5 idle_time 81 nhid 1
 # ip -j -p nexthop bucket get id 10 index 5
 [ {
 "id": 10,
 "bucket": {
 "index": 5,
 "idle_time": 104.89,
 "nhid": 1
 },
 "flags": [ ]
 } ]

Signed-off-by: Ido Schimmel 
Signed-off-by: Petr Machata 
---
 include/libnetlink.h  |   3 +
 ip/ip_common.h|   1 +
 ip/ipmonitor.c|   6 +
 ip/ipnexthop.c| 254 ++
 lib/libnetlink.c  |  26 +
 man/man8/ip-nexthop.8 |  45 
 6 files changed, 335 insertions(+)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index b9073a6a13ad..e8ed5d7fb495 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -97,6 +97,9 @@ int rtnl_dump_request_n(struct rtnl_handle *rth, struct 
nlmsghdr *n)
 int rtnl_nexthopdump_req(struct rtnl_handle *rth, int family,
 req_filter_fn_t filter_fn)
__attribute__((warn_unused_result));
+int rtnl_nexthop_bucket_dump_req(struct rtnl_handle *rth, int family,
+req_filter_fn_t filter_fn)
+   __attribute__((warn_unused_result));
 
 struct rtnl_ctrl_data {
int nsid;
diff --git a/ip/ip_common.h b/ip/ip_common.h
index 9a31e837563f..55a5521c4275 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -53,6 +53,7 @@ int print_rule(struct nlmsghdr *n, void *arg);
 int print_netconf(struct rtnl_ctrl_data *ctrl,
  struct nlmsghdr *n, void *arg);
 int print_nexthop(struct nlmsghdr *n, void *arg);
+int print_nexthop_bucket(struct nlmsghdr *n, void *arg);
 void netns_map_init(void);
 void netns_nsid_socket_init(void);
 int print_nsid(struct nlmsghdr *n, void *arg);
diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index 99f5fda8ba1f..d7f31cf5d1b5 100644
--- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -90,6 +90,12 @@ static int accept_msg(struct rtnl_ctrl_data *ctrl,
print_nexthop(n, arg);
return 0;
 
+   case RTM_NEWNEXTHOPBUCKET:
+   case RTM_DELNEXTHOPBUCKET:
+   print_headers(fp, "[NEXTHOPBUCKET]", ctrl);
+   print_nexthop_bucket(n, arg);
+   return 0;
+
case RTM_NEWLINK:
case RTM_DELLINK:
ll_remember_index(n, NULL);
diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 1d50bf7529c4..0263307c49df 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -21,6 +21,8 @@ static struct {
unsigned int master;
unsigned int proto;
unsigned int fdb;
+   unsigned int id;
+   unsigned int nhid;
 } filter;
 
 enum {
@@ -39,8 +41,11 @@ static void usage(void)
"Usage: ip nexthop { list | flush } [ protocol ID ] SELECTOR\n"
"   ip nexthop { add | replace } id ID NH [ protocol ID ]\n"
"   ip nexthop { get | del } id ID\n"
+   "   ip nexthop bucket list BUCKET_SELECTOR\n"
+   "   ip nexthop bucket get id ID index INDEX\n"
"SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV 
]\n"
"[ groups ] [ fdb ]\n"
+   "BUCKET_SELECTOR := SELECTOR | [ nhid ID ]\n"
"NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n"
"[ encap ENCAPTYPE ENCAPHDR ] |\n"
"group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n"
@@ -85,6 +90,36 @@ static int nh_dump_filter(struct nlmsghdr *nlh, int reqlen)
return 0;
 }
 
+static int nh_dump_bucket_filter(struct nlmsghdr *nlh, int reqlen)
+{
+   struct rtattr *nest;
+   int err = 0;
+
+   err = nh_dump_filter(nlh, reqlen);
+   if (err)
+   return err;
+
+   if (filter.id) {
+   err = addattr32(nlh, reqlen, NHA_ID, filter.id);
+   if (err)
+   return err;
+   }
+
+   if (filter.nhid) {
+

[PATCH iproute2-next v4 3/6] nexthop: Extract a helper to parse a NH ID

2021-03-17 Thread Petr Machata

NH ID extraction is a common operation, and will become more common still
with the resilient NH groups support. Add a helper that does what it
usually done and returns the parsed NH ID.

Signed-off-by: Petr Machata 
---
 ip/ipnexthop.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 20cde586596b..126b0b17cab4 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -327,6 +327,15 @@ static int add_nh_group_attr(struct nlmsghdr *n, int 
maxlen, char *argv)
return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps));
 }
 
+static int ipnh_parse_id(const char *argv)
+{
+   __u32 id;
+
+   if (get_unsigned(&id, argv, 0))
+   invarg("invalid id value", argv);
+   return id;
+}
+
 static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv)
 {
struct {
@@ -343,12 +352,9 @@ static int ipnh_modify(int cmd, unsigned int flags, int 
argc, char **argv)
 
while (argc > 0) {
if (!strcmp(*argv, "id")) {
-   __u32 id;
-
NEXT_ARG();
-   if (get_unsigned(&id, *argv, 0))
-   invarg("invalid id value", *argv);
-   addattr32(&req.n, sizeof(req), NHA_ID, id);
+   addattr32(&req.n, sizeof(req), NHA_ID,
+ ipnh_parse_id(*argv));
} else if (!strcmp(*argv, "dev")) {
int ifindex;
 
@@ -485,12 +491,8 @@ static int ipnh_list_flush(int argc, char **argv, int 
action)
if (!filter.master)
invarg("VRF does not exist\n", *argv);
} else if (!strcmp(*argv, "id")) {
-   __u32 id;
-
NEXT_ARG();
-   if (get_unsigned(&id, *argv, 0))
-   invarg("invalid id value", *argv);
-   return ipnh_get_id(id);
+   return ipnh_get_id(ipnh_parse_id(*argv));
} else if (!matches(*argv, "protocol")) {
__u32 proto;
 
@@ -536,8 +538,7 @@ static int ipnh_get(int argc, char **argv)
while (argc > 0) {
if (!strcmp(*argv, "id")) {
NEXT_ARG();
-   if (get_unsigned(&id, *argv, 0))
-   invarg("invalid id value", *argv);
+   id = ipnh_parse_id(*argv);
} else  {
usage();
}
-- 
2.26.2

[PATCH iproute2-next v4 4/6] nexthop: Add ability to specify group type

2021-03-17 Thread Petr Machata

From: Ido Schimmel 

Next patches are going to add a 'resilient' nexthop group type, so allow
users to specify the type using the 'type' argument. Currently, only
'mpath' type is supported.

These two commands are equivalent:

 # ip nexthop add id 10 group 1/2/3
 # ip nexthop add id 10 group 1/2/3 type mpath

Signed-off-by: Ido Schimmel 
Signed-off-by: Petr Machata 
---

Notes:
v2:
- Add a missing example command to commit message
- Mention in the man page that mpath is the default

 ip/ipnexthop.c| 32 +++-
 man/man8/ip-nexthop.8 | 19 +--
 2 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 126b0b17cab4..5aae32629edd 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -42,8 +42,10 @@ static void usage(void)
"SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV 
]\n"
"[ groups ] [ fdb ]\n"
"NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n"
-   "[ encap ENCAPTYPE ENCAPHDR ] | group GROUP [ fdb ] }\n"
+   "[ encap ENCAPTYPE ENCAPHDR ] |\n"
+   "group GROUP [ fdb ] [ type TYPE ] }\n"
"GROUP := [ //... ]\n"
+   "TYPE := { mpath }\n"
"ENCAPTYPE := [ mpls ]\n"
"ENCAPHDR := [ MPLSLABEL ]\n");
exit(-1);
@@ -327,6 +329,32 @@ static int add_nh_group_attr(struct nlmsghdr *n, int 
maxlen, char *argv)
return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps));
 }
 
+static int read_nh_group_type(const char *name)
+{
+   if (strcmp(name, "mpath") == 0)
+   return NEXTHOP_GRP_TYPE_MPATH;
+
+   return __NEXTHOP_GRP_TYPE_MAX;
+}
+
+static void parse_nh_group_type(struct nlmsghdr *n, int maxlen, int *argcp,
+   char ***argvp)
+{
+   char **argv = *argvp;
+   int argc = *argcp;
+   __u16 type;
+
+   NEXT_ARG();
+   type = read_nh_group_type(*argv);
+   if (type > NEXTHOP_GRP_TYPE_MAX)
+   invarg("\"type\" value is invalid\n", *argv);
+
+   *argcp = argc;
+   *argvp = argv;
+
+   addattr16(n, maxlen, NHA_GROUP_TYPE, type);
+}
+
 static int ipnh_parse_id(const char *argv)
 {
__u32 id;
@@ -409,6 +437,8 @@ static int ipnh_modify(int cmd, unsigned int flags, int 
argc, char **argv)
 
if (add_nh_group_attr(&req.n, sizeof(req), *argv))
invarg("\"group\" value is invalid\n", *argv);
+   } else if (!strcmp(*argv, "type")) {
+   parse_nh_group_type(&req.n, sizeof(req), &argc, &argv);
} else if (matches(*argv, "protocol") == 0) {
__u32 prot;
 
diff --git a/man/man8/ip-nexthop.8 b/man/man8/ip-nexthop.8
index 4d55f4dbcc75..b86f307fef35 100644
--- a/man/man8/ip-nexthop.8
+++ b/man/man8/ip-nexthop.8
@@ -54,7 +54,9 @@ ip-nexthop \- nexthop object management
 .BR fdb " ] | "
 .B  group
 .IR GROUP " [ "
-.BR fdb " ] } "
+.BR fdb " ] [ "
+.B type
+.IR TYPE " ] } "
 
 .ti -8
 .IR ENCAP " := [ "
@@ -71,6 +73,10 @@ ip-nexthop \- nexthop object management
 .IR GROUP " := "
 .BR id "[," weight "[/...]"
 
+.ti -8
+.IR TYPE " := { "
+.BR mpath " }"
+
 .SH DESCRIPTION
 .B ip nexthop
 is used to manipulate entries in the kernel's nexthop tables.
@@ -122,9 +128,18 @@ is a set of encapsulation attributes specific to the
 .in -2
 
 .TP
-.BI group " GROUP"
+.BI group " GROUP [ " type " TYPE ]"
 create a nexthop group. Group specification is id with an optional
 weight (id,weight) and a '/' as a separator between entries.
+.sp
+.I TYPE
+is a string specifying the nexthop group type. Namely:
+
+.in +8
+.BI mpath
+- Multipath nexthop group backed by the hash-threshold algorithm. The
+default when the type is unspecified.
+
 .TP
 .B blackhole
 create a blackhole nexthop
-- 
2.26.2

[PATCH iproute2-next v4 2/6] json_print: Add print_tv()

2021-03-17 Thread Petr Machata

Add a helper to dump a timeval. Print by first converting to double and
then dispatching to print_color_float().

Signed-off-by: Petr Machata 
---

Notes:
v4:
- Make print_tv() take a const*.

 include/json_print.h |  1 +
 lib/json_print.c | 13 +
 2 files changed, 14 insertions(+)

diff --git a/include/json_print.h b/include/json_print.h
index 6fcf9fd910ec..91b34571ceb0 100644
--- a/include/json_print.h
+++ b/include/json_print.h
@@ -81,6 +81,7 @@ _PRINT_FUNC(0xhex, unsigned long long)
 _PRINT_FUNC(luint, unsigned long)
 _PRINT_FUNC(lluint, unsigned long long)
 _PRINT_FUNC(float, double)
+_PRINT_FUNC(tv, const struct timeval *)
 #undef _PRINT_FUNC
 
 #define _PRINT_NAME_VALUE_FUNC(type_name, type, format_char) \
diff --git a/lib/json_print.c b/lib/json_print.c
index 994a2f8d6ae0..e3a88375fe7c 100644
--- a/lib/json_print.c
+++ b/lib/json_print.c
@@ -299,6 +299,19 @@ int print_color_null(enum output_type type,
return ret;
 }
 
+int print_color_tv(enum output_type type,
+  enum color_attr color,
+  const char *key,
+  const char *fmt,
+  const struct timeval *tv)
+{
+   double usecs = tv->tv_usec;
+   double secs = tv->tv_sec;
+   double time = secs + usecs / 100;
+
+   return print_color_float(type, color, key, fmt, time);
+}
+
 /* Print line separator (if not in JSON mode) */
 void print_nl(void)
 {
-- 
2.26.2

[PATCH iproute2-next v4 0/6] ip: nexthop: Support resilient groups

2021-03-17 Thread Petr Machata

Support for resilient next-hop groups was recently accepted to Linux
kernel[1]. Resilient next-hop groups add a layer of indirection between the
SKB hash and the next hop. Thus the hash is used to reference a hash table
bucket, which is then used to reference a particular next hop. This allows
the system more flexibility when assigning SKB hash space to next hops.
Previously, each next hop had to be assigned a continuous range of SKB hash
space. With a hash table as an intermediate layer, it is possible to
reassign next hops with a hash table bucket granularity. In turn, this
mends issues with traffic flow redirection resulting from next hop removal
or adjustments in next-hop weights.

In this patch set, introduce support for resilient next-hop groups to
iproute2.

- Patch #1 brings include/uapi/linux/nexthop.h and /rtnetlink.h up to date.

- Patches #2 and #3 add new helpers that will be useful later.

- Patch #4 extends the ip/nexthop sub-tool to accept group type as a
  command line argument, and to dispatch based on the specified type.

- Patch #5 adds the support for resilient next-hop groups.

- Patch #6 adds the support for resilient next-hop group bucket interface.

To illustrate the usage, consider the following commands:

 # ip nexthop add id 1 via 192.0.2.2 dev dummy1
 # ip nexthop add id 2 via 192.0.2.3 dev dummy1
 # ip nexthop add id 10 group 1/2 type resilient \
buckets 8 idle_timer 60 unbalanced_timer 300

The last command creates a resilient next-hop group. It will have 8
buckets, each bucket will be considered idle when no traffic hits it for at
least 60 seconds, and if the table remains out of balance for 300 seconds,
it will be forcefully brought into balance.

And this is how the next-hop group bucket interface looks:

 # ip nexthop bucket show id 10
 id 10 index 0 idle_time 5.59 nhid 1
 id 10 index 1 idle_time 5.59 nhid 1
 id 10 index 2 idle_time 8.74 nhid 2
 id 10 index 3 idle_time 8.74 nhid 2
 id 10 index 4 idle_time 8.74 nhid 1
 id 10 index 5 idle_time 8.74 nhid 1
 id 10 index 6 idle_time 8.74 nhid 1
 id 10 index 7 idle_time 8.74 nhid 1

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=2a0186a37700b0d5b8cc40be202a62af44f02fa2

v4:
- Patch #2:
- Make print_tv() take a const*.

v3:
- Add missing S-o-b's.

v2:
- Patch #4:
- Add a missing example command to commit message
- Mention in the man page that mpath is the default

Ido Schimmel (3):
  nexthop: Add ability to specify group type
  nexthop: Add support for resilient nexthop groups
  nexthop: Add support for nexthop buckets

Petr Machata (3):
  nexthop: Synchronize uAPI files
  json_print: Add print_tv()
  nexthop: Extract a helper to parse a NH ID

 include/json_print.h   |   1 +
 include/libnetlink.h   |   3 +
 include/uapi/linux/nexthop.h   |  47 +++-
 include/uapi/linux/rtnetlink.h |   7 +
 ip/ip_common.h |   1 +
 ip/ipmonitor.c |   6 +
 ip/ipnexthop.c | 451 -
 lib/json_print.c   |  13 +
 lib/libnetlink.c   |  26 ++
 man/man8/ip-nexthop.8  | 113 -
 10 files changed, 651 insertions(+), 17 deletions(-)

-- 
2.26.2

[PATCH iproute2-next v4 1/6] nexthop: Synchronize uAPI files

2021-03-17 Thread Petr Machata

Signed-off-by: Petr Machata 
---
 include/uapi/linux/nexthop.h   | 47 +-
 include/uapi/linux/rtnetlink.h |  7 +
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h
index b0a5613905ef..37b14b4ea6c4 100644
--- a/include/uapi/linux/nexthop.h
+++ b/include/uapi/linux/nexthop.h
@@ -21,7 +21,10 @@ struct nexthop_grp {
 };
 
 enum {
-   NEXTHOP_GRP_TYPE_MPATH,  /* default type if not specified */
+   NEXTHOP_GRP_TYPE_MPATH,  /* hash-threshold nexthop group
+ * default type if not specified
+ */
+   NEXTHOP_GRP_TYPE_RES,/* resilient nexthop group */
__NEXTHOP_GRP_TYPE_MAX,
 };
 
@@ -52,8 +55,50 @@ enum {
NHA_FDB,/* flag; nexthop belongs to a bridge fdb */
/* if NHA_FDB is added, OIF, BLACKHOLE, ENCAP cannot be set */
 
+   /* nested; resilient nexthop group attributes */
+   NHA_RES_GROUP,
+   /* nested; nexthop bucket attributes */
+   NHA_RES_BUCKET,
+
__NHA_MAX,
 };
 
 #define NHA_MAX(__NHA_MAX - 1)
+
+enum {
+   NHA_RES_GROUP_UNSPEC,
+   /* Pad attribute for 64-bit alignment. */
+   NHA_RES_GROUP_PAD = NHA_RES_GROUP_UNSPEC,
+
+   /* u16; number of nexthop buckets in a resilient nexthop group */
+   NHA_RES_GROUP_BUCKETS,
+   /* clock_t as u32; nexthop bucket idle timer (per-group) */
+   NHA_RES_GROUP_IDLE_TIMER,
+   /* clock_t as u32; nexthop unbalanced timer */
+   NHA_RES_GROUP_UNBALANCED_TIMER,
+   /* clock_t as u64; nexthop unbalanced time */
+   NHA_RES_GROUP_UNBALANCED_TIME,
+
+   __NHA_RES_GROUP_MAX,
+};
+
+#define NHA_RES_GROUP_MAX  (__NHA_RES_GROUP_MAX - 1)
+
+enum {
+   NHA_RES_BUCKET_UNSPEC,
+   /* Pad attribute for 64-bit alignment. */
+   NHA_RES_BUCKET_PAD = NHA_RES_BUCKET_UNSPEC,
+
+   /* u16; nexthop bucket index */
+   NHA_RES_BUCKET_INDEX,
+   /* clock_t as u64; nexthop bucket idle time */
+   NHA_RES_BUCKET_IDLE_TIME,
+   /* u32; nexthop id assigned to the nexthop bucket */
+   NHA_RES_BUCKET_NH_ID,
+
+   __NHA_RES_BUCKET_MAX,
+};
+
+#define NHA_RES_BUCKET_MAX (__NHA_RES_BUCKET_MAX - 1)
+
 #endif
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index b34b9add5f65..f6217651 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -178,6 +178,13 @@ enum {
RTM_GETVLAN,
 #define RTM_GETVLANRTM_GETVLAN
 
+   RTM_NEWNEXTHOPBUCKET = 116,
+#define RTM_NEWNEXTHOPBUCKET   RTM_NEWNEXTHOPBUCKET
+   RTM_DELNEXTHOPBUCKET,
+#define RTM_DELNEXTHOPBUCKET   RTM_DELNEXTHOPBUCKET
+   RTM_GETNEXTHOPBUCKET,
+#define RTM_GETNEXTHOPBUCKET   RTM_GETNEXTHOPBUCKET
+
__RTM_MAX,
 #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1)
 };
-- 
2.26.2

Re: [PATCH iproute2-next v3 2/6] json_print: Add print_tv()

2021-03-17 Thread Petr Machata



Stephen Hemminger  writes:

>> +_PRINT_FUNC(tv, struct timeval *)
>
> This 
>
> Make it const please?

OK

[PATCH iproute2] ip: Fix batch processing

2021-03-17 Thread Petr Machata

After the comment cited below, batch mode neglects to set the global
variable batch_mode to a non-zero value. Netns and VRF commands use this
variable, and break in batch mode. Fix by setting the value again.

Fixes: 1d9a81b8c9f3 ("Unify batch processing across tools")
Reported-by: Tim Rice 
Signed-off-by: Petr Machata 
---
 ip/ip.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/ip/ip.c b/ip/ip.c
index 40d2998ae60b..2d7d0d327734 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -155,6 +155,7 @@ static int batch(const char *name)
return EXIT_FAILURE;
}
 
+   batch_mode = 1;
ret = do_batch(name, force, ip_batch_cmd, &orig_family);
 
rtnl_close(&rth);
-- 
2.26.2

Re: [BUG] Iproute2 batch-mode fails to bring up veth

2021-03-16 Thread Petr Machata

David Ahern  writes:

>> Git bisect pinpoints this commit:
>> https://github.com/shemminger/iproute2/commit/1d9a81b8c9f30f9f4abeb875998262f61bf10577
>> 
>
> Petr, can you take a look at this regression?

Yes, see elsewhere in the thread:

https://marc.info/?l=linux-netdev&m=161589291608081&w=2

I'm pretty sure this fixes the issue, and hopefully Tim can take it for
a spin and confirm. I'll send this formally afterwards.

Re: [BUG] Iproute2 batch-mode fails to bring up veth

2021-03-16 Thread Petr Machata

Thanks for the report. Would you be able to test with the following
patch?


https://github.com/pmachata/iproute2/commit/a12eeca9caf90b3ebe24bc121819d506c9072a34.patch

I believe it fixes the issue.

[PATCH iproute2-next v3 6/6] nexthop: Add support for nexthop buckets

2021-03-16 Thread Petr Machata

From: Ido Schimmel 

Add ability to dump multiple nexthop buckets and get a specific one.
Example:

 # ip nexthop add id 10 group 1/2 type resilient buckets 8
 # ip nexthop
 id 1 via 192.0.2.2 dev dummy10 scope link
 id 2 via 192.0.2.19 dev dummy20 scope link
 id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0 
unbalanced_time 0
 # ip nexthop bucket
 id 10 index 0 idle_time 28.1 nhid 2
 id 10 index 1 idle_time 28.1 nhid 2
 id 10 index 2 idle_time 28.1 nhid 2
 id 10 index 3 idle_time 28.1 nhid 2
 id 10 index 4 idle_time 28.1 nhid 1
 id 10 index 5 idle_time 28.1 nhid 1
 id 10 index 6 idle_time 28.1 nhid 1
 id 10 index 7 idle_time 28.1 nhid 1
 # ip nexthop bucket show nhid 1
 id 10 index 4 idle_time 53.59 nhid 1
 id 10 index 5 idle_time 53.59 nhid 1
 id 10 index 6 idle_time 53.59 nhid 1
 id 10 index 7 idle_time 53.59 nhid 1
 # ip nexthop bucket get id 10 index 5
 id 10 index 5 idle_time 81 nhid 1
 # ip -j -p nexthop bucket get id 10 index 5
 [ {
 "id": 10,
 "bucket": {
 "index": 5,
 "idle_time": 104.89,
 "nhid": 1
 },
 "flags": [ ]
 } ]

Signed-off-by: Ido Schimmel 
Signed-off-by: Petr Machata 
---
 include/libnetlink.h  |   3 +
 ip/ip_common.h|   1 +
 ip/ipmonitor.c|   6 +
 ip/ipnexthop.c| 254 ++
 lib/libnetlink.c  |  26 +
 man/man8/ip-nexthop.8 |  45 
 6 files changed, 335 insertions(+)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index b9073a6a13ad..e8ed5d7fb495 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -97,6 +97,9 @@ int rtnl_dump_request_n(struct rtnl_handle *rth, struct 
nlmsghdr *n)
 int rtnl_nexthopdump_req(struct rtnl_handle *rth, int family,
 req_filter_fn_t filter_fn)
__attribute__((warn_unused_result));
+int rtnl_nexthop_bucket_dump_req(struct rtnl_handle *rth, int family,
+req_filter_fn_t filter_fn)
+   __attribute__((warn_unused_result));
 
 struct rtnl_ctrl_data {
int nsid;
diff --git a/ip/ip_common.h b/ip/ip_common.h
index 9a31e837563f..55a5521c4275 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -53,6 +53,7 @@ int print_rule(struct nlmsghdr *n, void *arg);
 int print_netconf(struct rtnl_ctrl_data *ctrl,
  struct nlmsghdr *n, void *arg);
 int print_nexthop(struct nlmsghdr *n, void *arg);
+int print_nexthop_bucket(struct nlmsghdr *n, void *arg);
 void netns_map_init(void);
 void netns_nsid_socket_init(void);
 int print_nsid(struct nlmsghdr *n, void *arg);
diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index 99f5fda8ba1f..d7f31cf5d1b5 100644
--- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -90,6 +90,12 @@ static int accept_msg(struct rtnl_ctrl_data *ctrl,
print_nexthop(n, arg);
return 0;
 
+   case RTM_NEWNEXTHOPBUCKET:
+   case RTM_DELNEXTHOPBUCKET:
+   print_headers(fp, "[NEXTHOPBUCKET]", ctrl);
+   print_nexthop_bucket(n, arg);
+   return 0;
+
case RTM_NEWLINK:
case RTM_DELLINK:
ll_remember_index(n, NULL);
diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 1d50bf7529c4..0263307c49df 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -21,6 +21,8 @@ static struct {
unsigned int master;
unsigned int proto;
unsigned int fdb;
+   unsigned int id;
+   unsigned int nhid;
 } filter;
 
 enum {
@@ -39,8 +41,11 @@ static void usage(void)
"Usage: ip nexthop { list | flush } [ protocol ID ] SELECTOR\n"
"   ip nexthop { add | replace } id ID NH [ protocol ID ]\n"
"   ip nexthop { get | del } id ID\n"
+   "   ip nexthop bucket list BUCKET_SELECTOR\n"
+   "   ip nexthop bucket get id ID index INDEX\n"
"SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV 
]\n"
"[ groups ] [ fdb ]\n"
+   "BUCKET_SELECTOR := SELECTOR | [ nhid ID ]\n"
"NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n"
"[ encap ENCAPTYPE ENCAPHDR ] |\n"
"group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n"
@@ -85,6 +90,36 @@ static int nh_dump_filter(struct nlmsghdr *nlh, int reqlen)
return 0;
 }
 
+static int nh_dump_bucket_filter(struct nlmsghdr *nlh, int reqlen)
+{
+   struct rtattr *nest;
+   int err = 0;
+
+   err = nh_dump_filter(nlh, reqlen);
+   if (err)
+   return err;
+
+   if (filter.id) {
+   err = addattr32(nlh, reqlen, NHA_ID, filter.id);
+   if (err)
+   return err;
+   }
+
+   if (filter.nhid) {
+

[PATCH iproute2-next v3 5/6] nexthop: Add support for resilient nexthop groups

2021-03-16 Thread Petr Machata

From: Ido Schimmel 

Add ability to configure resilient nexthop groups and show their current
configuration. Example:

 # ip nexthop add id 10 group 1/2 type resilient buckets 8
 # ip nexthop show id 10
 id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0
 # ip -j -p nexthop show id 10
 [ {
 "id": 10,
 "group": [ {
 "id": 1
 },{
 "id": 2
 } ],
 "type": "resilient",
 "resilient_args": {
 "buckets": 8,
 "idle_timer": 120,
 "unbalanced_timer": 0
 },
 "flags": [ ]
 } ]

Signed-off-by: Ido Schimmel 
Signed-off-by: Petr Machata 
---
 ip/ipnexthop.c| 144 +-
 man/man8/ip-nexthop.8 |  55 +++-
 2 files changed, 193 insertions(+), 6 deletions(-)

diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 5aae32629edd..1d50bf7529c4 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -43,9 +43,12 @@ static void usage(void)
"[ groups ] [ fdb ]\n"
"NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n"
"[ encap ENCAPTYPE ENCAPHDR ] |\n"
-   "group GROUP [ fdb ] [ type TYPE ] }\n"
+   "group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n"
"GROUP := [ //... ]\n"
-   "TYPE := { mpath }\n"
+   "TYPE := { mpath | resilient }\n"
+   "TYPE_ARGS := [ RESILIENT_ARGS ]\n"
+   "RESILIENT_ARGS := [ buckets BUCKETS ] [ idle_timer IDLE ]\n"
+   "  [ unbalanced_timer UNBALANCED ]\n"
"ENCAPTYPE := [ mpls ]\n"
"ENCAPHDR := [ MPLSLABEL ]\n");
exit(-1);
@@ -203,6 +206,66 @@ static void print_nh_group(FILE *fp, const struct rtattr 
*grps_attr)
close_json_array(PRINT_JSON, NULL);
 }
 
+static const char *nh_group_type_name(__u16 type)
+{
+   switch (type) {
+   case NEXTHOP_GRP_TYPE_MPATH:
+   return "mpath";
+   case NEXTHOP_GRP_TYPE_RES:
+   return "resilient";
+   default:
+   return "";
+   }
+}
+
+static void print_nh_group_type(FILE *fp, const struct rtattr *grp_type_attr)
+{
+   __u16 type = rta_getattr_u16(grp_type_attr);
+
+   if (type == NEXTHOP_GRP_TYPE_MPATH)
+   /* Do not print type in order not to break existing output. */
+   return;
+
+   print_string(PRINT_ANY, "type", "type %s ", nh_group_type_name(type));
+}
+
+static void print_nh_res_group(FILE *fp, const struct rtattr *res_grp_attr)
+{
+   struct rtattr *tb[NHA_RES_GROUP_MAX + 1];
+   struct rtattr *rta;
+   struct timeval tv;
+
+   parse_rtattr_nested(tb, NHA_RES_GROUP_MAX, res_grp_attr);
+
+   open_json_object("resilient_args");
+
+   if (tb[NHA_RES_GROUP_BUCKETS])
+   print_uint(PRINT_ANY, "buckets", "buckets %u ",
+  rta_getattr_u16(tb[NHA_RES_GROUP_BUCKETS]));
+
+   if (tb[NHA_RES_GROUP_IDLE_TIMER]) {
+   rta = tb[NHA_RES_GROUP_IDLE_TIMER];
+   __jiffies_to_tv(&tv, rta_getattr_u32(rta));
+   print_tv(PRINT_ANY, "idle_timer", "idle_timer %g ", &tv);
+   }
+
+   if (tb[NHA_RES_GROUP_UNBALANCED_TIMER]) {
+   rta = tb[NHA_RES_GROUP_UNBALANCED_TIMER];
+   __jiffies_to_tv(&tv, rta_getattr_u32(rta));
+   print_tv(PRINT_ANY, "unbalanced_timer", "unbalanced_timer %g ",
+&tv);
+   }
+
+   if (tb[NHA_RES_GROUP_UNBALANCED_TIME]) {
+   rta = tb[NHA_RES_GROUP_UNBALANCED_TIME];
+   __jiffies_to_tv(&tv, rta_getattr_u32(rta));
+   print_tv(PRINT_ANY, "unbalanced_time", "unbalanced_time %g ",
+&tv);
+   }
+
+   close_json_object();
+}
+
 int print_nexthop(struct nlmsghdr *n, void *arg)
 {
struct nhmsg *nhm = NLMSG_DATA(n);
@@ -229,7 +292,7 @@ int print_nexthop(struct nlmsghdr *n, void *arg)
if (filter.proto && filter.proto != nhm->nh_protocol)
return 0;
 
-   parse_rtattr(tb, NHA_MAX, RTM_NHA(nhm), len);
+   parse_rtattr_flags(tb, NHA_MAX, RTM_NHA(nhm), len, NLA_F_NESTED);
 
open_json_object(NULL);
 
@@ -243,6 +306,12 @@ int print_nexthop(struct nlmsghdr *n, void *arg)
if (tb[NHA_GROUP])
print_nh_group(fp, tb[NHA_GROUP]);
 
+   if (tb[NHA_GROUP_TYPE])
+   print_nh_group_type(fp, tb[NHA_G

[PATCH iproute2-next v3 3/6] nexthop: Extract a helper to parse a NH ID

2021-03-16 Thread Petr Machata

NH ID extraction is a common operation, and will become more common still
with the resilient NH groups support. Add a helper that does what it
usually done and returns the parsed NH ID.

Signed-off-by: Petr Machata 
---
 ip/ipnexthop.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 20cde586596b..126b0b17cab4 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -327,6 +327,15 @@ static int add_nh_group_attr(struct nlmsghdr *n, int 
maxlen, char *argv)
return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps));
 }
 
+static int ipnh_parse_id(const char *argv)
+{
+   __u32 id;
+
+   if (get_unsigned(&id, argv, 0))
+   invarg("invalid id value", argv);
+   return id;
+}
+
 static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv)
 {
struct {
@@ -343,12 +352,9 @@ static int ipnh_modify(int cmd, unsigned int flags, int 
argc, char **argv)
 
while (argc > 0) {
if (!strcmp(*argv, "id")) {
-   __u32 id;
-
NEXT_ARG();
-   if (get_unsigned(&id, *argv, 0))
-   invarg("invalid id value", *argv);
-   addattr32(&req.n, sizeof(req), NHA_ID, id);
+   addattr32(&req.n, sizeof(req), NHA_ID,
+ ipnh_parse_id(*argv));
} else if (!strcmp(*argv, "dev")) {
int ifindex;
 
@@ -485,12 +491,8 @@ static int ipnh_list_flush(int argc, char **argv, int 
action)
if (!filter.master)
invarg("VRF does not exist\n", *argv);
} else if (!strcmp(*argv, "id")) {
-   __u32 id;
-
NEXT_ARG();
-   if (get_unsigned(&id, *argv, 0))
-   invarg("invalid id value", *argv);
-   return ipnh_get_id(id);
+   return ipnh_get_id(ipnh_parse_id(*argv));
} else if (!matches(*argv, "protocol")) {
__u32 proto;
 
@@ -536,8 +538,7 @@ static int ipnh_get(int argc, char **argv)
while (argc > 0) {
if (!strcmp(*argv, "id")) {
NEXT_ARG();
-   if (get_unsigned(&id, *argv, 0))
-   invarg("invalid id value", *argv);
+   id = ipnh_parse_id(*argv);
} else  {
usage();
}
-- 
2.26.2

[PATCH iproute2-next v3 4/6] nexthop: Add ability to specify group type

2021-03-16 Thread Petr Machata

From: Ido Schimmel 

Next patches are going to add a 'resilient' nexthop group type, so allow
users to specify the type using the 'type' argument. Currently, only
'mpath' type is supported.

These two commands are equivalent:

 # ip nexthop add id 10 group 1/2/3
 # ip nexthop add id 10 group 1/2/3 type mpath

Signed-off-by: Ido Schimmel 
Signed-off-by: Petr Machata 
---

Notes:
v2:
- Add a missing example command to commit message
- Mention in the man page that mpath is the default

 ip/ipnexthop.c| 32 +++-
 man/man8/ip-nexthop.8 | 19 +--
 2 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 126b0b17cab4..5aae32629edd 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -42,8 +42,10 @@ static void usage(void)
"SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV 
]\n"
"[ groups ] [ fdb ]\n"
"NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n"
-   "[ encap ENCAPTYPE ENCAPHDR ] | group GROUP [ fdb ] }\n"
+   "[ encap ENCAPTYPE ENCAPHDR ] |\n"
+   "group GROUP [ fdb ] [ type TYPE ] }\n"
"GROUP := [ //... ]\n"
+   "TYPE := { mpath }\n"
"ENCAPTYPE := [ mpls ]\n"
"ENCAPHDR := [ MPLSLABEL ]\n");
exit(-1);
@@ -327,6 +329,32 @@ static int add_nh_group_attr(struct nlmsghdr *n, int 
maxlen, char *argv)
return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps));
 }
 
+static int read_nh_group_type(const char *name)
+{
+   if (strcmp(name, "mpath") == 0)
+   return NEXTHOP_GRP_TYPE_MPATH;
+
+   return __NEXTHOP_GRP_TYPE_MAX;
+}
+
+static void parse_nh_group_type(struct nlmsghdr *n, int maxlen, int *argcp,
+   char ***argvp)
+{
+   char **argv = *argvp;
+   int argc = *argcp;
+   __u16 type;
+
+   NEXT_ARG();
+   type = read_nh_group_type(*argv);
+   if (type > NEXTHOP_GRP_TYPE_MAX)
+   invarg("\"type\" value is invalid\n", *argv);
+
+   *argcp = argc;
+   *argvp = argv;
+
+   addattr16(n, maxlen, NHA_GROUP_TYPE, type);
+}
+
 static int ipnh_parse_id(const char *argv)
 {
__u32 id;
@@ -409,6 +437,8 @@ static int ipnh_modify(int cmd, unsigned int flags, int 
argc, char **argv)
 
if (add_nh_group_attr(&req.n, sizeof(req), *argv))
invarg("\"group\" value is invalid\n", *argv);
+   } else if (!strcmp(*argv, "type")) {
+   parse_nh_group_type(&req.n, sizeof(req), &argc, &argv);
} else if (matches(*argv, "protocol") == 0) {
__u32 prot;
 
diff --git a/man/man8/ip-nexthop.8 b/man/man8/ip-nexthop.8
index 4d55f4dbcc75..b86f307fef35 100644
--- a/man/man8/ip-nexthop.8
+++ b/man/man8/ip-nexthop.8
@@ -54,7 +54,9 @@ ip-nexthop \- nexthop object management
 .BR fdb " ] | "
 .B  group
 .IR GROUP " [ "
-.BR fdb " ] } "
+.BR fdb " ] [ "
+.B type
+.IR TYPE " ] } "
 
 .ti -8
 .IR ENCAP " := [ "
@@ -71,6 +73,10 @@ ip-nexthop \- nexthop object management
 .IR GROUP " := "
 .BR id "[," weight "[/...]"
 
+.ti -8
+.IR TYPE " := { "
+.BR mpath " }"
+
 .SH DESCRIPTION
 .B ip nexthop
 is used to manipulate entries in the kernel's nexthop tables.
@@ -122,9 +128,18 @@ is a set of encapsulation attributes specific to the
 .in -2
 
 .TP
-.BI group " GROUP"
+.BI group " GROUP [ " type " TYPE ]"
 create a nexthop group. Group specification is id with an optional
 weight (id,weight) and a '/' as a separator between entries.
+.sp
+.I TYPE
+is a string specifying the nexthop group type. Namely:
+
+.in +8
+.BI mpath
+- Multipath nexthop group backed by the hash-threshold algorithm. The
+default when the type is unspecified.
+
 .TP
 .B blackhole
 create a blackhole nexthop
-- 
2.26.2

[PATCH iproute2-next v3 2/6] json_print: Add print_tv()

2021-03-16 Thread Petr Machata

Add a helper to dump a timeval. Print by first converting to double and
then dispatching to print_color_float().

Signed-off-by: Petr Machata 
---
 include/json_print.h |  1 +
 lib/json_print.c | 13 +
 2 files changed, 14 insertions(+)

diff --git a/include/json_print.h b/include/json_print.h
index 6fcf9fd910ec..63eee3823fe4 100644
--- a/include/json_print.h
+++ b/include/json_print.h
@@ -81,6 +81,7 @@ _PRINT_FUNC(0xhex, unsigned long long)
 _PRINT_FUNC(luint, unsigned long)
 _PRINT_FUNC(lluint, unsigned long long)
 _PRINT_FUNC(float, double)
+_PRINT_FUNC(tv, struct timeval *)
 #undef _PRINT_FUNC
 
 #define _PRINT_NAME_VALUE_FUNC(type_name, type, format_char) \
diff --git a/lib/json_print.c b/lib/json_print.c
index 994a2f8d6ae0..1018bfb36d94 100644
--- a/lib/json_print.c
+++ b/lib/json_print.c
@@ -299,6 +299,19 @@ int print_color_null(enum output_type type,
return ret;
 }
 
+int print_color_tv(enum output_type type,
+  enum color_attr color,
+  const char *key,
+  const char *fmt,
+  struct timeval *tv)
+{
+   double usecs = tv->tv_usec;
+   double secs = tv->tv_sec;
+   double time = secs + usecs / 100;
+
+   return print_color_float(type, color, key, fmt, time);
+}
+
 /* Print line separator (if not in JSON mode) */
 void print_nl(void)
 {
-- 
2.26.2

[PATCH iproute2-next v3 1/6] nexthop: Synchronize uAPI files

2021-03-16 Thread Petr Machata

Signed-off-by: Petr Machata 
---
 include/uapi/linux/nexthop.h   | 47 +-
 include/uapi/linux/rtnetlink.h |  7 +
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h
index b0a5613905ef..37b14b4ea6c4 100644
--- a/include/uapi/linux/nexthop.h
+++ b/include/uapi/linux/nexthop.h
@@ -21,7 +21,10 @@ struct nexthop_grp {
 };
 
 enum {
-   NEXTHOP_GRP_TYPE_MPATH,  /* default type if not specified */
+   NEXTHOP_GRP_TYPE_MPATH,  /* hash-threshold nexthop group
+ * default type if not specified
+ */
+   NEXTHOP_GRP_TYPE_RES,/* resilient nexthop group */
__NEXTHOP_GRP_TYPE_MAX,
 };
 
@@ -52,8 +55,50 @@ enum {
NHA_FDB,/* flag; nexthop belongs to a bridge fdb */
/* if NHA_FDB is added, OIF, BLACKHOLE, ENCAP cannot be set */
 
+   /* nested; resilient nexthop group attributes */
+   NHA_RES_GROUP,
+   /* nested; nexthop bucket attributes */
+   NHA_RES_BUCKET,
+
__NHA_MAX,
 };
 
 #define NHA_MAX(__NHA_MAX - 1)
+
+enum {
+   NHA_RES_GROUP_UNSPEC,
+   /* Pad attribute for 64-bit alignment. */
+   NHA_RES_GROUP_PAD = NHA_RES_GROUP_UNSPEC,
+
+   /* u16; number of nexthop buckets in a resilient nexthop group */
+   NHA_RES_GROUP_BUCKETS,
+   /* clock_t as u32; nexthop bucket idle timer (per-group) */
+   NHA_RES_GROUP_IDLE_TIMER,
+   /* clock_t as u32; nexthop unbalanced timer */
+   NHA_RES_GROUP_UNBALANCED_TIMER,
+   /* clock_t as u64; nexthop unbalanced time */
+   NHA_RES_GROUP_UNBALANCED_TIME,
+
+   __NHA_RES_GROUP_MAX,
+};
+
+#define NHA_RES_GROUP_MAX  (__NHA_RES_GROUP_MAX - 1)
+
+enum {
+   NHA_RES_BUCKET_UNSPEC,
+   /* Pad attribute for 64-bit alignment. */
+   NHA_RES_BUCKET_PAD = NHA_RES_BUCKET_UNSPEC,
+
+   /* u16; nexthop bucket index */
+   NHA_RES_BUCKET_INDEX,
+   /* clock_t as u64; nexthop bucket idle time */
+   NHA_RES_BUCKET_IDLE_TIME,
+   /* u32; nexthop id assigned to the nexthop bucket */
+   NHA_RES_BUCKET_NH_ID,
+
+   __NHA_RES_BUCKET_MAX,
+};
+
+#define NHA_RES_BUCKET_MAX (__NHA_RES_BUCKET_MAX - 1)
+
 #endif
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index b34b9add5f65..f6217651 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -178,6 +178,13 @@ enum {
RTM_GETVLAN,
 #define RTM_GETVLANRTM_GETVLAN
 
+   RTM_NEWNEXTHOPBUCKET = 116,
+#define RTM_NEWNEXTHOPBUCKET   RTM_NEWNEXTHOPBUCKET
+   RTM_DELNEXTHOPBUCKET,
+#define RTM_DELNEXTHOPBUCKET   RTM_DELNEXTHOPBUCKET
+   RTM_GETNEXTHOPBUCKET,
+#define RTM_GETNEXTHOPBUCKET   RTM_GETNEXTHOPBUCKET
+
__RTM_MAX,
 #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1)
 };
-- 
2.26.2

[PATCH iproute2-next v3 0/6] ip: nexthop: Support resilient groups

2021-03-16 Thread Petr Machata

Support for resilient next-hop groups was recently accepted to Linux
kernel[1]. Resilient next-hop groups add a layer of indirection between the
SKB hash and the next hop. Thus the hash is used to reference a hash table
bucket, which is then used to reference a particular next hop. This allows
the system more flexibility when assigning SKB hash space to next hops.
Previously, each next hop had to be assigned a continuous range of SKB hash
space. With a hash table as an intermediate layer, it is possible to
reassign next hops with a hash table bucket granularity. In turn, this
mends issues with traffic flow redirection resulting from next hop removal
or adjustments in next-hop weights.

In this patch set, introduce support for resilient next-hop groups to
iproute2.

- Patch #1 brings include/uapi/linux/nexthop.h and /rtnetlink.h up to date.

- Patches #2 and #3 add new helpers that will be useful later.

- Patch #4 extends the ip/nexthop sub-tool to accept group type as a
  command line argument, and to dispatch based on the specified type.

- Patch #5 adds the support for resilient next-hop groups.

- Patch #6 adds the support for resilient next-hop group bucket interface.

To illustrate the usage, consider the following commands:

 # ip nexthop add id 1 via 192.0.2.2 dev dummy1
 # ip nexthop add id 2 via 192.0.2.3 dev dummy1
 # ip nexthop add id 10 group 1/2 type resilient \
buckets 8 idle_timer 60 unbalanced_timer 300

The last command creates a resilient next-hop group. It will have 8
buckets, each bucket will be considered idle when no traffic hits it for at
least 60 seconds, and if the table remains out of balance for 300 seconds,
it will be forcefully brought into balance.

And this is how the next-hop group bucket interface looks:

 # ip nexthop bucket show id 10
 id 10 index 0 idle_time 5.59 nhid 1
 id 10 index 1 idle_time 5.59 nhid 1
 id 10 index 2 idle_time 8.74 nhid 2
 id 10 index 3 idle_time 8.74 nhid 2
 id 10 index 4 idle_time 8.74 nhid 1
 id 10 index 5 idle_time 8.74 nhid 1
 id 10 index 6 idle_time 8.74 nhid 1
 id 10 index 7 idle_time 8.74 nhid 1

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=2a0186a37700b0d5b8cc40be202a62af44f02fa2

v3:
- Add missing S-o-b's.

v2:
- Patch #4:
- Add a missing example command to commit message
- Mention in the man page that mpath is the default

Ido Schimmel (3):
  nexthop: Add ability to specify group type
  nexthop: Add support for resilient nexthop groups
  nexthop: Add support for nexthop buckets

Petr Machata (3):
  nexthop: Synchronize uAPI files
  json_print: Add print_tv()
  nexthop: Extract a helper to parse a NH ID

 include/json_print.h   |   1 +
 include/libnetlink.h   |   3 +
 include/uapi/linux/nexthop.h   |  47 +++-
 include/uapi/linux/rtnetlink.h |   7 +
 ip/ip_common.h |   1 +
 ip/ipmonitor.c |   6 +
 ip/ipnexthop.c | 451 -
 lib/json_print.c   |  13 +
 lib/libnetlink.c   |  26 ++
 man/man8/ip-nexthop.8  | 113 -
 10 files changed, 651 insertions(+), 17 deletions(-)

-- 
2.26.2

Re: [PATCH iproute2-next v2 4/6] nexthop: Add ability to specify group type

2021-03-15 Thread Petr Machata



Petr Machata  writes:

> Signed-off-by: Ido Schimmel 

And I managed to forget my S-o-b :-/

[PATCH iproute2-next v2 6/6] nexthop: Add support for nexthop buckets

2021-03-15 Thread Petr Machata

From: Ido Schimmel 

Add ability to dump multiple nexthop buckets and get a specific one.
Example:

 # ip nexthop add id 10 group 1/2 type resilient buckets 8
 # ip nexthop
 id 1 via 192.0.2.2 dev dummy10 scope link
 id 2 via 192.0.2.19 dev dummy20 scope link
 id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0 
unbalanced_time 0
 # ip nexthop bucket
 id 10 index 0 idle_time 28.1 nhid 2
 id 10 index 1 idle_time 28.1 nhid 2
 id 10 index 2 idle_time 28.1 nhid 2
 id 10 index 3 idle_time 28.1 nhid 2
 id 10 index 4 idle_time 28.1 nhid 1
 id 10 index 5 idle_time 28.1 nhid 1
 id 10 index 6 idle_time 28.1 nhid 1
 id 10 index 7 idle_time 28.1 nhid 1
 # ip nexthop bucket show nhid 1
 id 10 index 4 idle_time 53.59 nhid 1
 id 10 index 5 idle_time 53.59 nhid 1
 id 10 index 6 idle_time 53.59 nhid 1
 id 10 index 7 idle_time 53.59 nhid 1
 # ip nexthop bucket get id 10 index 5
 id 10 index 5 idle_time 81 nhid 1
 # ip -j -p nexthop bucket get id 10 index 5
 [ {
 "id": 10,
 "bucket": {
 "index": 5,
 "idle_time": 104.89,
 "nhid": 1
 },
 "flags": [ ]
 } ]

Signed-off-by: Ido Schimmel 
---
 include/libnetlink.h  |   3 +
 ip/ip_common.h|   1 +
 ip/ipmonitor.c|   6 +
 ip/ipnexthop.c| 254 ++
 lib/libnetlink.c  |  26 +
 man/man8/ip-nexthop.8 |  45 
 6 files changed, 335 insertions(+)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index b9073a6a13ad..e8ed5d7fb495 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -97,6 +97,9 @@ int rtnl_dump_request_n(struct rtnl_handle *rth, struct 
nlmsghdr *n)
 int rtnl_nexthopdump_req(struct rtnl_handle *rth, int family,
 req_filter_fn_t filter_fn)
__attribute__((warn_unused_result));
+int rtnl_nexthop_bucket_dump_req(struct rtnl_handle *rth, int family,
+req_filter_fn_t filter_fn)
+   __attribute__((warn_unused_result));
 
 struct rtnl_ctrl_data {
int nsid;
diff --git a/ip/ip_common.h b/ip/ip_common.h
index 9a31e837563f..55a5521c4275 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -53,6 +53,7 @@ int print_rule(struct nlmsghdr *n, void *arg);
 int print_netconf(struct rtnl_ctrl_data *ctrl,
  struct nlmsghdr *n, void *arg);
 int print_nexthop(struct nlmsghdr *n, void *arg);
+int print_nexthop_bucket(struct nlmsghdr *n, void *arg);
 void netns_map_init(void);
 void netns_nsid_socket_init(void);
 int print_nsid(struct nlmsghdr *n, void *arg);
diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index 99f5fda8ba1f..d7f31cf5d1b5 100644
--- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -90,6 +90,12 @@ static int accept_msg(struct rtnl_ctrl_data *ctrl,
print_nexthop(n, arg);
return 0;
 
+   case RTM_NEWNEXTHOPBUCKET:
+   case RTM_DELNEXTHOPBUCKET:
+   print_headers(fp, "[NEXTHOPBUCKET]", ctrl);
+   print_nexthop_bucket(n, arg);
+   return 0;
+
case RTM_NEWLINK:
case RTM_DELLINK:
ll_remember_index(n, NULL);
diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 1d50bf7529c4..0263307c49df 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -21,6 +21,8 @@ static struct {
unsigned int master;
unsigned int proto;
unsigned int fdb;
+   unsigned int id;
+   unsigned int nhid;
 } filter;
 
 enum {
@@ -39,8 +41,11 @@ static void usage(void)
"Usage: ip nexthop { list | flush } [ protocol ID ] SELECTOR\n"
"   ip nexthop { add | replace } id ID NH [ protocol ID ]\n"
"   ip nexthop { get | del } id ID\n"
+   "   ip nexthop bucket list BUCKET_SELECTOR\n"
+   "   ip nexthop bucket get id ID index INDEX\n"
"SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV 
]\n"
"[ groups ] [ fdb ]\n"
+   "BUCKET_SELECTOR := SELECTOR | [ nhid ID ]\n"
"NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n"
"[ encap ENCAPTYPE ENCAPHDR ] |\n"
"group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n"
@@ -85,6 +90,36 @@ static int nh_dump_filter(struct nlmsghdr *nlh, int reqlen)
return 0;
 }
 
+static int nh_dump_bucket_filter(struct nlmsghdr *nlh, int reqlen)
+{
+   struct rtattr *nest;
+   int err = 0;
+
+   err = nh_dump_filter(nlh, reqlen);
+   if (err)
+   return err;
+
+   if (filter.id) {
+   err = addattr32(nlh, reqlen, NHA_ID, filter.id);
+   if (err)
+   return err;
+   }
+
+   if (filter.nhid) {
+   nest = addattr_nest(nlh, reqlen, NHA_RES_BUCKET);
+   nest->rta_type |= NLA_F_NESTED;
+
+   err = addattr32(nlh, reqlen, NHA_RES_BUCKET_NH_ID,
+   f

[PATCH iproute2-next v2 5/6] nexthop: Add support for resilient nexthop groups

2021-03-15 Thread Petr Machata

From: Ido Schimmel 

Add ability to configure resilient nexthop groups and show their current
configuration. Example:

 # ip nexthop add id 10 group 1/2 type resilient buckets 8
 # ip nexthop show id 10
 id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0
 # ip -j -p nexthop show id 10
 [ {
 "id": 10,
 "group": [ {
 "id": 1
 },{
 "id": 2
 } ],
 "type": "resilient",
 "resilient_args": {
 "buckets": 8,
 "idle_timer": 120,
 "unbalanced_timer": 0
 },
 "flags": [ ]
 } ]

Signed-off-by: Ido Schimmel 
---
 ip/ipnexthop.c| 144 +-
 man/man8/ip-nexthop.8 |  55 +++-
 2 files changed, 193 insertions(+), 6 deletions(-)

diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 5aae32629edd..1d50bf7529c4 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -43,9 +43,12 @@ static void usage(void)
"[ groups ] [ fdb ]\n"
"NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n"
"[ encap ENCAPTYPE ENCAPHDR ] |\n"
-   "group GROUP [ fdb ] [ type TYPE ] }\n"
+   "group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n"
"GROUP := [ //... ]\n"
-   "TYPE := { mpath }\n"
+   "TYPE := { mpath | resilient }\n"
+   "TYPE_ARGS := [ RESILIENT_ARGS ]\n"
+   "RESILIENT_ARGS := [ buckets BUCKETS ] [ idle_timer IDLE ]\n"
+   "  [ unbalanced_timer UNBALANCED ]\n"
"ENCAPTYPE := [ mpls ]\n"
"ENCAPHDR := [ MPLSLABEL ]\n");
exit(-1);
@@ -203,6 +206,66 @@ static void print_nh_group(FILE *fp, const struct rtattr 
*grps_attr)
close_json_array(PRINT_JSON, NULL);
 }
 
+static const char *nh_group_type_name(__u16 type)
+{
+   switch (type) {
+   case NEXTHOP_GRP_TYPE_MPATH:
+   return "mpath";
+   case NEXTHOP_GRP_TYPE_RES:
+   return "resilient";
+   default:
+   return "";
+   }
+}
+
+static void print_nh_group_type(FILE *fp, const struct rtattr *grp_type_attr)
+{
+   __u16 type = rta_getattr_u16(grp_type_attr);
+
+   if (type == NEXTHOP_GRP_TYPE_MPATH)
+   /* Do not print type in order not to break existing output. */
+   return;
+
+   print_string(PRINT_ANY, "type", "type %s ", nh_group_type_name(type));
+}
+
+static void print_nh_res_group(FILE *fp, const struct rtattr *res_grp_attr)
+{
+   struct rtattr *tb[NHA_RES_GROUP_MAX + 1];
+   struct rtattr *rta;
+   struct timeval tv;
+
+   parse_rtattr_nested(tb, NHA_RES_GROUP_MAX, res_grp_attr);
+
+   open_json_object("resilient_args");
+
+   if (tb[NHA_RES_GROUP_BUCKETS])
+   print_uint(PRINT_ANY, "buckets", "buckets %u ",
+  rta_getattr_u16(tb[NHA_RES_GROUP_BUCKETS]));
+
+   if (tb[NHA_RES_GROUP_IDLE_TIMER]) {
+   rta = tb[NHA_RES_GROUP_IDLE_TIMER];
+   __jiffies_to_tv(&tv, rta_getattr_u32(rta));
+   print_tv(PRINT_ANY, "idle_timer", "idle_timer %g ", &tv);
+   }
+
+   if (tb[NHA_RES_GROUP_UNBALANCED_TIMER]) {
+   rta = tb[NHA_RES_GROUP_UNBALANCED_TIMER];
+   __jiffies_to_tv(&tv, rta_getattr_u32(rta));
+   print_tv(PRINT_ANY, "unbalanced_timer", "unbalanced_timer %g ",
+&tv);
+   }
+
+   if (tb[NHA_RES_GROUP_UNBALANCED_TIME]) {
+   rta = tb[NHA_RES_GROUP_UNBALANCED_TIME];
+   __jiffies_to_tv(&tv, rta_getattr_u32(rta));
+   print_tv(PRINT_ANY, "unbalanced_time", "unbalanced_time %g ",
+&tv);
+   }
+
+   close_json_object();
+}
+
 int print_nexthop(struct nlmsghdr *n, void *arg)
 {
struct nhmsg *nhm = NLMSG_DATA(n);
@@ -229,7 +292,7 @@ int print_nexthop(struct nlmsghdr *n, void *arg)
if (filter.proto && filter.proto != nhm->nh_protocol)
return 0;
 
-   parse_rtattr(tb, NHA_MAX, RTM_NHA(nhm), len);
+   parse_rtattr_flags(tb, NHA_MAX, RTM_NHA(nhm), len, NLA_F_NESTED);
 
open_json_object(NULL);
 
@@ -243,6 +306,12 @@ int print_nexthop(struct nlmsghdr *n, void *arg)
if (tb[NHA_GROUP])
print_nh_group(fp, tb[NHA_GROUP]);
 
+   if (tb[NHA_GROUP_TYPE])
+   print_nh_group_type(fp, tb[NHA_GROUP_TYPE]);
+
+   if (tb[NHA_RES_GROUP])
+   print_nh_res_group(fp, tb[NHA_RES_GROUP]);
+
if (tb[NHA_ENCAP])
lwt_print_encap(fp, tb[NHA_ENCAP_TYPE], tb[NHA_ENCAP]);
 
@@ -333,10 +402,70 @@ static int read_nh_group_type(const char *name)
 {
if (strcmp(name, "mpath") == 0)
return NEXTHOP_GRP_TYPE_MPATH;
+   else if (strcmp(name, "resilient") == 0)
+   return NEXTHOP_GRP_T

[PATCH iproute2-next v2 4/6] nexthop: Add ability to specify group type

2021-03-15 Thread Petr Machata

From: Ido Schimmel 

Next patches are going to add a 'resilient' nexthop group type, so allow
users to specify the type using the 'type' argument. Currently, only
'mpath' type is supported.

These two commands are equivalent:

 # ip nexthop add id 10 group 1/2/3
 # ip nexthop add id 10 group 1/2/3 type mpath

Signed-off-by: Ido Schimmel 
---

Notes:
v2:
- Add a missing example command to commit message
- Mention in the man page that mpath is the default

 ip/ipnexthop.c| 32 +++-
 man/man8/ip-nexthop.8 | 19 +--
 2 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 126b0b17cab4..5aae32629edd 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -42,8 +42,10 @@ static void usage(void)
"SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV 
]\n"
"[ groups ] [ fdb ]\n"
"NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n"
-   "[ encap ENCAPTYPE ENCAPHDR ] | group GROUP [ fdb ] }\n"
+   "[ encap ENCAPTYPE ENCAPHDR ] |\n"
+   "group GROUP [ fdb ] [ type TYPE ] }\n"
"GROUP := [ //... ]\n"
+   "TYPE := { mpath }\n"
"ENCAPTYPE := [ mpls ]\n"
"ENCAPHDR := [ MPLSLABEL ]\n");
exit(-1);
@@ -327,6 +329,32 @@ static int add_nh_group_attr(struct nlmsghdr *n, int 
maxlen, char *argv)
return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps));
 }
 
+static int read_nh_group_type(const char *name)
+{
+   if (strcmp(name, "mpath") == 0)
+   return NEXTHOP_GRP_TYPE_MPATH;
+
+   return __NEXTHOP_GRP_TYPE_MAX;
+}
+
+static void parse_nh_group_type(struct nlmsghdr *n, int maxlen, int *argcp,
+   char ***argvp)
+{
+   char **argv = *argvp;
+   int argc = *argcp;
+   __u16 type;
+
+   NEXT_ARG();
+   type = read_nh_group_type(*argv);
+   if (type > NEXTHOP_GRP_TYPE_MAX)
+   invarg("\"type\" value is invalid\n", *argv);
+
+   *argcp = argc;
+   *argvp = argv;
+
+   addattr16(n, maxlen, NHA_GROUP_TYPE, type);
+}
+
 static int ipnh_parse_id(const char *argv)
 {
__u32 id;
@@ -409,6 +437,8 @@ static int ipnh_modify(int cmd, unsigned int flags, int 
argc, char **argv)
 
if (add_nh_group_attr(&req.n, sizeof(req), *argv))
invarg("\"group\" value is invalid\n", *argv);
+   } else if (!strcmp(*argv, "type")) {
+   parse_nh_group_type(&req.n, sizeof(req), &argc, &argv);
} else if (matches(*argv, "protocol") == 0) {
__u32 prot;
 
diff --git a/man/man8/ip-nexthop.8 b/man/man8/ip-nexthop.8
index 4d55f4dbcc75..b86f307fef35 100644
--- a/man/man8/ip-nexthop.8
+++ b/man/man8/ip-nexthop.8
@@ -54,7 +54,9 @@ ip-nexthop \- nexthop object management
 .BR fdb " ] | "
 .B  group
 .IR GROUP " [ "
-.BR fdb " ] } "
+.BR fdb " ] [ "
+.B type
+.IR TYPE " ] } "
 
 .ti -8
 .IR ENCAP " := [ "
@@ -71,6 +73,10 @@ ip-nexthop \- nexthop object management
 .IR GROUP " := "
 .BR id "[," weight "[/...]"
 
+.ti -8
+.IR TYPE " := { "
+.BR mpath " }"
+
 .SH DESCRIPTION
 .B ip nexthop
 is used to manipulate entries in the kernel's nexthop tables.
@@ -122,9 +128,18 @@ is a set of encapsulation attributes specific to the
 .in -2
 
 .TP
-.BI group " GROUP"
+.BI group " GROUP [ " type " TYPE ]"
 create a nexthop group. Group specification is id with an optional
 weight (id,weight) and a '/' as a separator between entries.
+.sp
+.I TYPE
+is a string specifying the nexthop group type. Namely:
+
+.in +8
+.BI mpath
+- Multipath nexthop group backed by the hash-threshold algorithm. The
+default when the type is unspecified.
+
 .TP
 .B blackhole
 create a blackhole nexthop
-- 
2.26.2

[PATCH iproute2-next v2 3/6] nexthop: Extract a helper to parse a NH ID

2021-03-15 Thread Petr Machata

NH ID extraction is a common operation, and will become more common still
with the resilient NH groups support. Add a helper that does what it
usually done and returns the parsed NH ID.

Signed-off-by: Petr Machata 
---
 ip/ipnexthop.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 20cde586596b..126b0b17cab4 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -327,6 +327,15 @@ static int add_nh_group_attr(struct nlmsghdr *n, int 
maxlen, char *argv)
return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps));
 }
 
+static int ipnh_parse_id(const char *argv)
+{
+   __u32 id;
+
+   if (get_unsigned(&id, argv, 0))
+   invarg("invalid id value", argv);
+   return id;
+}
+
 static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv)
 {
struct {
@@ -343,12 +352,9 @@ static int ipnh_modify(int cmd, unsigned int flags, int 
argc, char **argv)
 
while (argc > 0) {
if (!strcmp(*argv, "id")) {
-   __u32 id;
-
NEXT_ARG();
-   if (get_unsigned(&id, *argv, 0))
-   invarg("invalid id value", *argv);
-   addattr32(&req.n, sizeof(req), NHA_ID, id);
+   addattr32(&req.n, sizeof(req), NHA_ID,
+ ipnh_parse_id(*argv));
} else if (!strcmp(*argv, "dev")) {
int ifindex;
 
@@ -485,12 +491,8 @@ static int ipnh_list_flush(int argc, char **argv, int 
action)
if (!filter.master)
invarg("VRF does not exist\n", *argv);
} else if (!strcmp(*argv, "id")) {
-   __u32 id;
-
NEXT_ARG();
-   if (get_unsigned(&id, *argv, 0))
-   invarg("invalid id value", *argv);
-   return ipnh_get_id(id);
+   return ipnh_get_id(ipnh_parse_id(*argv));
} else if (!matches(*argv, "protocol")) {
__u32 proto;
 
@@ -536,8 +538,7 @@ static int ipnh_get(int argc, char **argv)
while (argc > 0) {
if (!strcmp(*argv, "id")) {
NEXT_ARG();
-   if (get_unsigned(&id, *argv, 0))
-   invarg("invalid id value", *argv);
+   id = ipnh_parse_id(*argv);
} else  {
usage();
}
-- 
2.26.2

[PATCH iproute2-next v2 1/6] nexthop: Synchronize uAPI files

2021-03-15 Thread Petr Machata

Signed-off-by: Petr Machata 
---
 include/uapi/linux/nexthop.h   | 47 +-
 include/uapi/linux/rtnetlink.h |  7 +
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h
index b0a5613905ef..37b14b4ea6c4 100644
--- a/include/uapi/linux/nexthop.h
+++ b/include/uapi/linux/nexthop.h
@@ -21,7 +21,10 @@ struct nexthop_grp {
 };
 
 enum {
-   NEXTHOP_GRP_TYPE_MPATH,  /* default type if not specified */
+   NEXTHOP_GRP_TYPE_MPATH,  /* hash-threshold nexthop group
+ * default type if not specified
+ */
+   NEXTHOP_GRP_TYPE_RES,/* resilient nexthop group */
__NEXTHOP_GRP_TYPE_MAX,
 };
 
@@ -52,8 +55,50 @@ enum {
NHA_FDB,/* flag; nexthop belongs to a bridge fdb */
/* if NHA_FDB is added, OIF, BLACKHOLE, ENCAP cannot be set */
 
+   /* nested; resilient nexthop group attributes */
+   NHA_RES_GROUP,
+   /* nested; nexthop bucket attributes */
+   NHA_RES_BUCKET,
+
__NHA_MAX,
 };
 
 #define NHA_MAX(__NHA_MAX - 1)
+
+enum {
+   NHA_RES_GROUP_UNSPEC,
+   /* Pad attribute for 64-bit alignment. */
+   NHA_RES_GROUP_PAD = NHA_RES_GROUP_UNSPEC,
+
+   /* u16; number of nexthop buckets in a resilient nexthop group */
+   NHA_RES_GROUP_BUCKETS,
+   /* clock_t as u32; nexthop bucket idle timer (per-group) */
+   NHA_RES_GROUP_IDLE_TIMER,
+   /* clock_t as u32; nexthop unbalanced timer */
+   NHA_RES_GROUP_UNBALANCED_TIMER,
+   /* clock_t as u64; nexthop unbalanced time */
+   NHA_RES_GROUP_UNBALANCED_TIME,
+
+   __NHA_RES_GROUP_MAX,
+};
+
+#define NHA_RES_GROUP_MAX  (__NHA_RES_GROUP_MAX - 1)
+
+enum {
+   NHA_RES_BUCKET_UNSPEC,
+   /* Pad attribute for 64-bit alignment. */
+   NHA_RES_BUCKET_PAD = NHA_RES_BUCKET_UNSPEC,
+
+   /* u16; nexthop bucket index */
+   NHA_RES_BUCKET_INDEX,
+   /* clock_t as u64; nexthop bucket idle time */
+   NHA_RES_BUCKET_IDLE_TIME,
+   /* u32; nexthop id assigned to the nexthop bucket */
+   NHA_RES_BUCKET_NH_ID,
+
+   __NHA_RES_BUCKET_MAX,
+};
+
+#define NHA_RES_BUCKET_MAX (__NHA_RES_BUCKET_MAX - 1)
+
 #endif
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index b34b9add5f65..f6217651 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -178,6 +178,13 @@ enum {
RTM_GETVLAN,
 #define RTM_GETVLANRTM_GETVLAN
 
+   RTM_NEWNEXTHOPBUCKET = 116,
+#define RTM_NEWNEXTHOPBUCKET   RTM_NEWNEXTHOPBUCKET
+   RTM_DELNEXTHOPBUCKET,
+#define RTM_DELNEXTHOPBUCKET   RTM_DELNEXTHOPBUCKET
+   RTM_GETNEXTHOPBUCKET,
+#define RTM_GETNEXTHOPBUCKET   RTM_GETNEXTHOPBUCKET
+
__RTM_MAX,
 #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1)
 };
-- 
2.26.2

[PATCH iproute2-next v2 2/6] json_print: Add print_tv()

2021-03-15 Thread Petr Machata

Add a helper to dump a timeval. Print by first converting to double and
then dispatching to print_color_float().

Signed-off-by: Petr Machata 
---
 include/json_print.h |  1 +
 lib/json_print.c | 13 +
 2 files changed, 14 insertions(+)

diff --git a/include/json_print.h b/include/json_print.h
index 6fcf9fd910ec..63eee3823fe4 100644
--- a/include/json_print.h
+++ b/include/json_print.h
@@ -81,6 +81,7 @@ _PRINT_FUNC(0xhex, unsigned long long)
 _PRINT_FUNC(luint, unsigned long)
 _PRINT_FUNC(lluint, unsigned long long)
 _PRINT_FUNC(float, double)
+_PRINT_FUNC(tv, struct timeval *)
 #undef _PRINT_FUNC
 
 #define _PRINT_NAME_VALUE_FUNC(type_name, type, format_char) \
diff --git a/lib/json_print.c b/lib/json_print.c
index 994a2f8d6ae0..1018bfb36d94 100644
--- a/lib/json_print.c
+++ b/lib/json_print.c
@@ -299,6 +299,19 @@ int print_color_null(enum output_type type,
return ret;
 }
 
+int print_color_tv(enum output_type type,
+  enum color_attr color,
+  const char *key,
+  const char *fmt,
+  struct timeval *tv)
+{
+   double usecs = tv->tv_usec;
+   double secs = tv->tv_sec;
+   double time = secs + usecs / 100;
+
+   return print_color_float(type, color, key, fmt, time);
+}
+
 /* Print line separator (if not in JSON mode) */
 void print_nl(void)
 {
-- 
2.26.2

[PATCH iproute2-next v2 0/6] ip: nexthop: Support resilient groups

2021-03-15 Thread Petr Machata

Support for resilient next-hop groups was recently accepted to Linux
kernel[1]. Resilient next-hop groups add a layer of indirection between the
SKB hash and the next hop. Thus the hash is used to reference a hash table
bucket, which is then used to reference a particular next hop. This allows
the system more flexibility when assigning SKB hash space to next hops.
Previously, each next hop had to be assigned a continuous range of SKB hash
space. With a hash table as an intermediate layer, it is possible to
reassign next hops with a hash table bucket granularity. In turn, this
mends issues with traffic flow redirection resulting from next hop removal
or adjustments in next-hop weights.

In this patch set, introduce support for resilient next-hop groups to
iproute2.

- Patch #1 brings include/uapi/linux/nexthop.h and /rtnetlink.h up to date.

- Patches #2 and #3 add new helpers that will be useful later.

- Patch #4 extends the ip/nexthop sub-tool to accept group type as a
  command line argument, and to dispatch based on the specified type.

- Patch #5 adds the support for resilient next-hop groups.

- Patch #6 adds the support for resilient next-hop group bucket interface.

To illustrate the usage, consider the following commands:

 # ip nexthop add id 1 via 192.0.2.2 dev dummy1
 # ip nexthop add id 2 via 192.0.2.3 dev dummy1
 # ip nexthop add id 10 group 1/2 type resilient \
buckets 8 idle_timer 60 unbalanced_timer 300

The last command creates a resilient next-hop group. It will have 8
buckets, each bucket will be considered idle when no traffic hits it for at
least 60 seconds, and if the table remains out of balance for 300 seconds,
it will be forcefully brought into balance.

And this is how the next-hop group bucket interface looks:

 # ip nexthop bucket show id 10
 id 10 index 0 idle_time 5.59 nhid 1
 id 10 index 1 idle_time 5.59 nhid 1
 id 10 index 2 idle_time 8.74 nhid 2
 id 10 index 3 idle_time 8.74 nhid 2
 id 10 index 4 idle_time 8.74 nhid 1
 id 10 index 5 idle_time 8.74 nhid 1
 id 10 index 6 idle_time 8.74 nhid 1
 id 10 index 7 idle_time 8.74 nhid 1

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=2a0186a37700b0d5b8cc40be202a62af44f02fa2

v2:
- Patch #4:
- Add a missing example command to commit message
- Mention in the man page that mpath is the default

Ido Schimmel (3):
  nexthop: Add ability to specify group type
  nexthop: Add support for resilient nexthop groups
  nexthop: Add support for nexthop buckets

Petr Machata (3):
  nexthop: Synchronize uAPI files
  json_print: Add print_tv()
  nexthop: Extract a helper to parse a NH ID

 include/json_print.h   |   1 +
 include/libnetlink.h   |   3 +
 include/uapi/linux/nexthop.h   |  47 +++-
 include/uapi/linux/rtnetlink.h |   7 +
 ip/ip_common.h |   1 +
 ip/ipmonitor.c |   6 +
 ip/ipnexthop.c | 451 -
 lib/json_print.c   |  13 +
 lib/libnetlink.c   |  26 ++
 man/man8/ip-nexthop.8  | 113 -
 10 files changed, 651 insertions(+), 17 deletions(-)

-- 
2.26.2

Re: [PATCH iproute2-next v2] dcb: Fix compilation warning about reallocarray

2021-03-15 Thread Petr Machata



Petr Machata  writes:

> Roi Dayan  writes:
>
>> --- a/dcb/dcb_app.c
>> +++ b/dcb/dcb_app.c
>> @@ -65,8 +65,7 @@ static void dcb_app_table_fini(struct dcb_app_table *tab)
>>  
>>  static int dcb_app_table_push(struct dcb_app_table *tab, struct dcb_app 
>> *app)
>>  {
>> -struct dcb_app *apps = reallocarray(tab->apps, tab->n_apps + 1,
>> -sizeof(*tab->apps));
>> +struct dcb_app *apps = realloc(tab->apps, (tab->n_apps + 1) * 
>> sizeof(*tab->apps));
>
> Reviewed-by: Petr Machata 

Could this be merged, please?

Re: [PATCH iproute2-next 4/6] nexthop: Add ability to specify group type

2021-03-15 Thread Petr Machata



David Ahern  writes:

> On 3/12/21 10:23 AM, Petr Machata wrote:
>> From: Petr Machata 
>> 
>> From: Ido Schimmel 
>
> All of the patches have the above. If Ido is the author and you are
> sending, AIUI you add your Signed-off-by below his.

Sorry about that, that's a leftover from when I was sending the DCB
patches. I'll resend with the correct headers.

>> +.sp
>> +.I TYPE
>> +is a string specifying the nexthop group type. Namely:
>> +
>> +.in +8
>> +.BI mpath
>> +- multipath nexthop group
>> +
>
> Add a comment that this is the default group type and refers to the
> legacy hash-bashed multipath group.

OK.

[PATCH iproute2-next 3/6] nexthop: Extract a helper to parse a NH ID

2021-03-12 Thread Petr Machata

From: Petr Machata 

From: Petr Machata 

NH ID extraction is a common operation, and will become more common still
with the resilient NH groups support. Add a helper that does what it
usually done and returns the parsed NH ID.

Signed-off-by: Petr Machata 
---
 ip/ipnexthop.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 20cde586596b..126b0b17cab4 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -327,6 +327,15 @@ static int add_nh_group_attr(struct nlmsghdr *n, int 
maxlen, char *argv)
return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps));
 }
 
+static int ipnh_parse_id(const char *argv)
+{
+   __u32 id;
+
+   if (get_unsigned(&id, argv, 0))
+   invarg("invalid id value", argv);
+   return id;
+}
+
 static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv)
 {
struct {
@@ -343,12 +352,9 @@ static int ipnh_modify(int cmd, unsigned int flags, int 
argc, char **argv)
 
while (argc > 0) {
if (!strcmp(*argv, "id")) {
-   __u32 id;
-
NEXT_ARG();
-   if (get_unsigned(&id, *argv, 0))
-   invarg("invalid id value", *argv);
-   addattr32(&req.n, sizeof(req), NHA_ID, id);
+   addattr32(&req.n, sizeof(req), NHA_ID,
+ ipnh_parse_id(*argv));
} else if (!strcmp(*argv, "dev")) {
int ifindex;
 
@@ -485,12 +491,8 @@ static int ipnh_list_flush(int argc, char **argv, int 
action)
if (!filter.master)
invarg("VRF does not exist\n", *argv);
} else if (!strcmp(*argv, "id")) {
-   __u32 id;
-
NEXT_ARG();
-   if (get_unsigned(&id, *argv, 0))
-   invarg("invalid id value", *argv);
-   return ipnh_get_id(id);
+   return ipnh_get_id(ipnh_parse_id(*argv));
} else if (!matches(*argv, "protocol")) {
__u32 proto;
 
@@ -536,8 +538,7 @@ static int ipnh_get(int argc, char **argv)
while (argc > 0) {
if (!strcmp(*argv, "id")) {
NEXT_ARG();
-   if (get_unsigned(&id, *argv, 0))
-   invarg("invalid id value", *argv);
+   id = ipnh_parse_id(*argv);
} else  {
usage();
}
-- 
2.26.2

[PATCH iproute2-next 6/6] nexthop: Add support for nexthop buckets

2021-03-12 Thread Petr Machata

From: Petr Machata 

From: Ido Schimmel 

Add ability to dump multiple nexthop buckets and get a specific one.
Example:

 # ip nexthop add id 10 group 1/2 type resilient buckets 8
 # ip nexthop
 id 1 via 192.0.2.2 dev dummy10 scope link
 id 2 via 192.0.2.19 dev dummy20 scope link
 id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0 
unbalanced_time 0
 # ip nexthop bucket
 id 10 index 0 idle_time 28.1 nhid 2
 id 10 index 1 idle_time 28.1 nhid 2
 id 10 index 2 idle_time 28.1 nhid 2
 id 10 index 3 idle_time 28.1 nhid 2
 id 10 index 4 idle_time 28.1 nhid 1
 id 10 index 5 idle_time 28.1 nhid 1
 id 10 index 6 idle_time 28.1 nhid 1
 id 10 index 7 idle_time 28.1 nhid 1
 # ip nexthop bucket show nhid 1
 id 10 index 4 idle_time 53.59 nhid 1
 id 10 index 5 idle_time 53.59 nhid 1
 id 10 index 6 idle_time 53.59 nhid 1
 id 10 index 7 idle_time 53.59 nhid 1
 # ip nexthop bucket get id 10 index 5
 id 10 index 5 idle_time 81 nhid 1
 # ip -j -p nexthop bucket get id 10 index 5
 [ {
 "id": 10,
 "bucket": {
 "index": 5,
 "idle_time": 104.89,
 "nhid": 1
 },
 "flags": [ ]
 } ]

Signed-off-by: Ido Schimmel 
---
 include/libnetlink.h  |   3 +
 ip/ip_common.h|   1 +
 ip/ipmonitor.c|   6 +
 ip/ipnexthop.c| 254 ++
 lib/libnetlink.c  |  26 +
 man/man8/ip-nexthop.8 |  45 
 6 files changed, 335 insertions(+)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index b9073a6a13ad..e8ed5d7fb495 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -97,6 +97,9 @@ int rtnl_dump_request_n(struct rtnl_handle *rth, struct 
nlmsghdr *n)
 int rtnl_nexthopdump_req(struct rtnl_handle *rth, int family,
 req_filter_fn_t filter_fn)
__attribute__((warn_unused_result));
+int rtnl_nexthop_bucket_dump_req(struct rtnl_handle *rth, int family,
+req_filter_fn_t filter_fn)
+   __attribute__((warn_unused_result));
 
 struct rtnl_ctrl_data {
int nsid;
diff --git a/ip/ip_common.h b/ip/ip_common.h
index 9a31e837563f..55a5521c4275 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -53,6 +53,7 @@ int print_rule(struct nlmsghdr *n, void *arg);
 int print_netconf(struct rtnl_ctrl_data *ctrl,
  struct nlmsghdr *n, void *arg);
 int print_nexthop(struct nlmsghdr *n, void *arg);
+int print_nexthop_bucket(struct nlmsghdr *n, void *arg);
 void netns_map_init(void);
 void netns_nsid_socket_init(void);
 int print_nsid(struct nlmsghdr *n, void *arg);
diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index 99f5fda8ba1f..d7f31cf5d1b5 100644
--- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -90,6 +90,12 @@ static int accept_msg(struct rtnl_ctrl_data *ctrl,
print_nexthop(n, arg);
return 0;
 
+   case RTM_NEWNEXTHOPBUCKET:
+   case RTM_DELNEXTHOPBUCKET:
+   print_headers(fp, "[NEXTHOPBUCKET]", ctrl);
+   print_nexthop_bucket(n, arg);
+   return 0;
+
case RTM_NEWLINK:
case RTM_DELLINK:
ll_remember_index(n, NULL);
diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 1d50bf7529c4..0263307c49df 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -21,6 +21,8 @@ static struct {
unsigned int master;
unsigned int proto;
unsigned int fdb;
+   unsigned int id;
+   unsigned int nhid;
 } filter;
 
 enum {
@@ -39,8 +41,11 @@ static void usage(void)
"Usage: ip nexthop { list | flush } [ protocol ID ] SELECTOR\n"
"   ip nexthop { add | replace } id ID NH [ protocol ID ]\n"
"   ip nexthop { get | del } id ID\n"
+   "   ip nexthop bucket list BUCKET_SELECTOR\n"
+   "   ip nexthop bucket get id ID index INDEX\n"
"SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV 
]\n"
"[ groups ] [ fdb ]\n"
+   "BUCKET_SELECTOR := SELECTOR | [ nhid ID ]\n"
"NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n"
"[ encap ENCAPTYPE ENCAPHDR ] |\n"
"group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n"
@@ -85,6 +90,36 @@ static int nh_dump_filter(struct nlmsghdr *nlh, int reqlen)
return 0;
 }
 
+static int nh_dump_bucket_filter(struct nlmsghdr *nlh, int reqlen)
+{
+   struct rtattr *nest;
+   int err = 0;
+
+   err = nh_dump_filter(nlh, reqlen);
+   if (err)
+   return err;
+
+   if (filter.id) {
+   err = addattr32(nlh, reqlen, NHA_ID, filter.id);
+   if (err)
+   return err;
+   }
+
+   if (filter.nhid) {
+

[PATCH iproute2-next 5/6] nexthop: Add support for resilient nexthop groups

2021-03-12 Thread Petr Machata

From: Petr Machata 

From: Ido Schimmel 

Add ability to configure resilient nexthop groups and show their current
configuration. Example:

 # ip nexthop add id 10 group 1/2 type resilient buckets 8
 # ip nexthop show id 10
 id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0
 # ip -j -p nexthop show id 10
 [ {
 "id": 10,
 "group": [ {
 "id": 1
 },{
 "id": 2
 } ],
 "type": "resilient",
 "resilient_args": {
 "buckets": 8,
 "idle_timer": 120,
 "unbalanced_timer": 0
 },
 "flags": [ ]
 } ]

Signed-off-by: Ido Schimmel 
---
 ip/ipnexthop.c| 144 +-
 man/man8/ip-nexthop.8 |  55 +++-
 2 files changed, 193 insertions(+), 6 deletions(-)

diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 5aae32629edd..1d50bf7529c4 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -43,9 +43,12 @@ static void usage(void)
"[ groups ] [ fdb ]\n"
"NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n"
"[ encap ENCAPTYPE ENCAPHDR ] |\n"
-   "group GROUP [ fdb ] [ type TYPE ] }\n"
+   "group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n"
"GROUP := [ //... ]\n"
-   "TYPE := { mpath }\n"
+   "TYPE := { mpath | resilient }\n"
+   "TYPE_ARGS := [ RESILIENT_ARGS ]\n"
+   "RESILIENT_ARGS := [ buckets BUCKETS ] [ idle_timer IDLE ]\n"
+   "  [ unbalanced_timer UNBALANCED ]\n"
"ENCAPTYPE := [ mpls ]\n"
"ENCAPHDR := [ MPLSLABEL ]\n");
exit(-1);
@@ -203,6 +206,66 @@ static void print_nh_group(FILE *fp, const struct rtattr 
*grps_attr)
close_json_array(PRINT_JSON, NULL);
 }
 
+static const char *nh_group_type_name(__u16 type)
+{
+   switch (type) {
+   case NEXTHOP_GRP_TYPE_MPATH:
+   return "mpath";
+   case NEXTHOP_GRP_TYPE_RES:
+   return "resilient";
+   default:
+   return "";
+   }
+}
+
+static void print_nh_group_type(FILE *fp, const struct rtattr *grp_type_attr)
+{
+   __u16 type = rta_getattr_u16(grp_type_attr);
+
+   if (type == NEXTHOP_GRP_TYPE_MPATH)
+   /* Do not print type in order not to break existing output. */
+   return;
+
+   print_string(PRINT_ANY, "type", "type %s ", nh_group_type_name(type));
+}
+
+static void print_nh_res_group(FILE *fp, const struct rtattr *res_grp_attr)
+{
+   struct rtattr *tb[NHA_RES_GROUP_MAX + 1];
+   struct rtattr *rta;
+   struct timeval tv;
+
+   parse_rtattr_nested(tb, NHA_RES_GROUP_MAX, res_grp_attr);
+
+   open_json_object("resilient_args");
+
+   if (tb[NHA_RES_GROUP_BUCKETS])
+   print_uint(PRINT_ANY, "buckets", "buckets %u ",
+  rta_getattr_u16(tb[NHA_RES_GROUP_BUCKETS]));
+
+   if (tb[NHA_RES_GROUP_IDLE_TIMER]) {
+   rta = tb[NHA_RES_GROUP_IDLE_TIMER];
+   __jiffies_to_tv(&tv, rta_getattr_u32(rta));
+   print_tv(PRINT_ANY, "idle_timer", "idle_timer %g ", &tv);
+   }
+
+   if (tb[NHA_RES_GROUP_UNBALANCED_TIMER]) {
+   rta = tb[NHA_RES_GROUP_UNBALANCED_TIMER];
+   __jiffies_to_tv(&tv, rta_getattr_u32(rta));
+   print_tv(PRINT_ANY, "unbalanced_timer", "unbalanced_timer %g ",
+&tv);
+   }
+
+   if (tb[NHA_RES_GROUP_UNBALANCED_TIME]) {
+   rta = tb[NHA_RES_GROUP_UNBALANCED_TIME];
+   __jiffies_to_tv(&tv, rta_getattr_u32(rta));
+   print_tv(PRINT_ANY, "unbalanced_time", "unbalanced_time %g ",
+&tv);
+   }
+
+   close_json_object();
+}
+
 int print_nexthop(struct nlmsghdr *n, void *arg)
 {
struct nhmsg *nhm = NLMSG_DATA(n);
@@ -229,7 +292,7 @@ int print_nexthop(struct nlmsghdr *n, void *arg)
if (filter.proto && filter.proto != nhm->nh_protocol)
return 0;
 
-   parse_rtattr(tb, NHA_MAX, RTM_NHA(nhm), len);
+   parse_rtattr_flags(tb, NHA_MAX, RTM_NHA(nhm), len, NLA_F_NESTED);
 
open_json_object(NULL);
 
@@ -243,6 +306,12 @@ int print_nexthop(struct nlmsghdr *n, void *arg)
if (tb[NHA_GROUP])
print_nh_group(fp, tb[NHA_GROUP]);
 
+   if (tb[NHA_GROUP_TYPE])
+   print_nh_group_type(fp, tb[NHA_G

[PATCH iproute2-next 1/6] nexthop: Synchronize uAPI files

2021-03-12 Thread Petr Machata

From: Petr Machata 

From: Ido Schimmel 

Signed-off-by: Petr Machata 
---
 include/uapi/linux/nexthop.h   | 47 +-
 include/uapi/linux/rtnetlink.h |  7 +
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h
index b0a5613905ef..37b14b4ea6c4 100644
--- a/include/uapi/linux/nexthop.h
+++ b/include/uapi/linux/nexthop.h
@@ -21,7 +21,10 @@ struct nexthop_grp {
 };
 
 enum {
-   NEXTHOP_GRP_TYPE_MPATH,  /* default type if not specified */
+   NEXTHOP_GRP_TYPE_MPATH,  /* hash-threshold nexthop group
+ * default type if not specified
+ */
+   NEXTHOP_GRP_TYPE_RES,/* resilient nexthop group */
__NEXTHOP_GRP_TYPE_MAX,
 };
 
@@ -52,8 +55,50 @@ enum {
NHA_FDB,/* flag; nexthop belongs to a bridge fdb */
/* if NHA_FDB is added, OIF, BLACKHOLE, ENCAP cannot be set */
 
+   /* nested; resilient nexthop group attributes */
+   NHA_RES_GROUP,
+   /* nested; nexthop bucket attributes */
+   NHA_RES_BUCKET,
+
__NHA_MAX,
 };
 
 #define NHA_MAX(__NHA_MAX - 1)
+
+enum {
+   NHA_RES_GROUP_UNSPEC,
+   /* Pad attribute for 64-bit alignment. */
+   NHA_RES_GROUP_PAD = NHA_RES_GROUP_UNSPEC,
+
+   /* u16; number of nexthop buckets in a resilient nexthop group */
+   NHA_RES_GROUP_BUCKETS,
+   /* clock_t as u32; nexthop bucket idle timer (per-group) */
+   NHA_RES_GROUP_IDLE_TIMER,
+   /* clock_t as u32; nexthop unbalanced timer */
+   NHA_RES_GROUP_UNBALANCED_TIMER,
+   /* clock_t as u64; nexthop unbalanced time */
+   NHA_RES_GROUP_UNBALANCED_TIME,
+
+   __NHA_RES_GROUP_MAX,
+};
+
+#define NHA_RES_GROUP_MAX  (__NHA_RES_GROUP_MAX - 1)
+
+enum {
+   NHA_RES_BUCKET_UNSPEC,
+   /* Pad attribute for 64-bit alignment. */
+   NHA_RES_BUCKET_PAD = NHA_RES_BUCKET_UNSPEC,
+
+   /* u16; nexthop bucket index */
+   NHA_RES_BUCKET_INDEX,
+   /* clock_t as u64; nexthop bucket idle time */
+   NHA_RES_BUCKET_IDLE_TIME,
+   /* u32; nexthop id assigned to the nexthop bucket */
+   NHA_RES_BUCKET_NH_ID,
+
+   __NHA_RES_BUCKET_MAX,
+};
+
+#define NHA_RES_BUCKET_MAX (__NHA_RES_BUCKET_MAX - 1)
+
 #endif
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index b34b9add5f65..f6217651 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -178,6 +178,13 @@ enum {
RTM_GETVLAN,
 #define RTM_GETVLANRTM_GETVLAN
 
+   RTM_NEWNEXTHOPBUCKET = 116,
+#define RTM_NEWNEXTHOPBUCKET   RTM_NEWNEXTHOPBUCKET
+   RTM_DELNEXTHOPBUCKET,
+#define RTM_DELNEXTHOPBUCKET   RTM_DELNEXTHOPBUCKET
+   RTM_GETNEXTHOPBUCKET,
+#define RTM_GETNEXTHOPBUCKET   RTM_GETNEXTHOPBUCKET
+
__RTM_MAX,
 #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1)
 };
-- 
2.26.2

[PATCH iproute2-next 4/6] nexthop: Add ability to specify group type

2021-03-12 Thread Petr Machata

From: Petr Machata 

From: Ido Schimmel 

Next patches are going to add a 'resilient' nexthop group type, so allow
users to specify the type using the 'type' argument. Currently, only
'mpath' type is supported.

These two command are equivalent:

Signed-off-by: Ido Schimmel 
---
 ip/ipnexthop.c| 32 +++-
 man/man8/ip-nexthop.8 | 18 --
 2 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
index 126b0b17cab4..5aae32629edd 100644
--- a/ip/ipnexthop.c
+++ b/ip/ipnexthop.c
@@ -42,8 +42,10 @@ static void usage(void)
"SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV 
]\n"
"[ groups ] [ fdb ]\n"
"NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n"
-   "[ encap ENCAPTYPE ENCAPHDR ] | group GROUP [ fdb ] }\n"
+   "[ encap ENCAPTYPE ENCAPHDR ] |\n"
+   "group GROUP [ fdb ] [ type TYPE ] }\n"
"GROUP := [ //... ]\n"
+   "TYPE := { mpath }\n"
"ENCAPTYPE := [ mpls ]\n"
"ENCAPHDR := [ MPLSLABEL ]\n");
exit(-1);
@@ -327,6 +329,32 @@ static int add_nh_group_attr(struct nlmsghdr *n, int 
maxlen, char *argv)
return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps));
 }
 
+static int read_nh_group_type(const char *name)
+{
+   if (strcmp(name, "mpath") == 0)
+   return NEXTHOP_GRP_TYPE_MPATH;
+
+   return __NEXTHOP_GRP_TYPE_MAX;
+}
+
+static void parse_nh_group_type(struct nlmsghdr *n, int maxlen, int *argcp,
+   char ***argvp)
+{
+   char **argv = *argvp;
+   int argc = *argcp;
+   __u16 type;
+
+   NEXT_ARG();
+   type = read_nh_group_type(*argv);
+   if (type > NEXTHOP_GRP_TYPE_MAX)
+   invarg("\"type\" value is invalid\n", *argv);
+
+   *argcp = argc;
+   *argvp = argv;
+
+   addattr16(n, maxlen, NHA_GROUP_TYPE, type);
+}
+
 static int ipnh_parse_id(const char *argv)
 {
__u32 id;
@@ -409,6 +437,8 @@ static int ipnh_modify(int cmd, unsigned int flags, int 
argc, char **argv)
 
if (add_nh_group_attr(&req.n, sizeof(req), *argv))
invarg("\"group\" value is invalid\n", *argv);
+   } else if (!strcmp(*argv, "type")) {
+   parse_nh_group_type(&req.n, sizeof(req), &argc, &argv);
} else if (matches(*argv, "protocol") == 0) {
__u32 prot;
 
diff --git a/man/man8/ip-nexthop.8 b/man/man8/ip-nexthop.8
index 4d55f4dbcc75..f02e0555a000 100644
--- a/man/man8/ip-nexthop.8
+++ b/man/man8/ip-nexthop.8
@@ -54,7 +54,9 @@ ip-nexthop \- nexthop object management
 .BR fdb " ] | "
 .B  group
 .IR GROUP " [ "
-.BR fdb " ] } "
+.BR fdb " ] [ "
+.B type
+.IR TYPE " ] } "
 
 .ti -8
 .IR ENCAP " := [ "
@@ -71,6 +73,10 @@ ip-nexthop \- nexthop object management
 .IR GROUP " := "
 .BR id "[," weight "[/...]"
 
+.ti -8
+.IR TYPE " := { "
+.BR mpath " }"
+
 .SH DESCRIPTION
 .B ip nexthop
 is used to manipulate entries in the kernel's nexthop tables.
@@ -122,9 +128,17 @@ is a set of encapsulation attributes specific to the
 .in -2
 
 .TP
-.BI group " GROUP"
+.BI group " GROUP [ " type " TYPE ]"
 create a nexthop group. Group specification is id with an optional
 weight (id,weight) and a '/' as a separator between entries.
+.sp
+.I TYPE
+is a string specifying the nexthop group type. Namely:
+
+.in +8
+.BI mpath
+- multipath nexthop group
+
 .TP
 .B blackhole
 create a blackhole nexthop
-- 
2.26.2

[PATCH iproute2-next 0/6] ip: nexthop: Support resilient groups

2021-03-12 Thread Petr Machata

Support for resilient next-hop groups was recently accepted to Linux
kernel[1]. Resilient next-hop groups add a layer of indirection between the
SKB hash and the next hop. Thus the hash is used to reference a hash table
bucket, which is then used to reference a particular next hop. This allows
the system more flexibility when assigning SKB hash space to next hops.
Previously, each next hop had to be assigned a continuous range of SKB hash
space. With a hash table as an intermediate layer, it is possible to
reassign next hops with a hash table bucket granularity. In turn, this
mends issues with traffic flow redirection resulting from next hop removal
or adjustments in next-hop weights.

In this patch set, introduce support for resilient next-hop groups to
iproute2.

- Patch #1 brings include/uapi/linux/nexthop.h and /rtnetlink.h up to date.

- Patches #2 and #3 add new helpers that will be useful later.

- Patch #4 extends the ip/nexthop sub-tool to accept group type as a
  command line argument, and to dispatch based on the specified type.

- Patch #5 adds the support for resilient next-hop groups.

- Patch #6 adds the support for resilient next-hop group bucket interface.

To illustrate the usage, consider the following commands:

 # ip nexthop add id 1 via 192.0.2.2 dev dummy1
 # ip nexthop add id 2 via 192.0.2.3 dev dummy1
 # ip nexthop add id 10 group 1/2 type resilient \
buckets 8 idle_timer 60 unbalanced_timer 300

The last command creates a resilient next-hop group. It will have 8
buckets, each bucket will be considered idle when no traffic hits it for at
least 60 seconds, and if the table remains out of balance for 300 seconds,
it will be forcefully brought into balance.

And this is how the next-hop group bucket interface looks:

 # ip nexthop bucket show id 10
 id 10 index 0 idle_time 5.59 nhid 1
 id 10 index 1 idle_time 5.59 nhid 1
 id 10 index 2 idle_time 8.74 nhid 2
 id 10 index 3 idle_time 8.74 nhid 2
 id 10 index 4 idle_time 8.74 nhid 1
 id 10 index 5 idle_time 8.74 nhid 1
 id 10 index 6 idle_time 8.74 nhid 1
 id 10 index 7 idle_time 8.74 nhid 1

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=2a0186a37700b0d5b8cc40be202a62af44f02fa2

Ido Schimmel (4):
  nexthop: Synchronize uAPI files
  nexthop: Add ability to specify group type
  nexthop: Add support for resilient nexthop groups
  nexthop: Add support for nexthop buckets

Petr Machata (2):
  json_print: Add print_tv()
  nexthop: Extract a helper to parse a NH ID

 include/json_print.h   |   1 +
 include/libnetlink.h   |   3 +
 include/uapi/linux/nexthop.h   |  47 +++-
 include/uapi/linux/rtnetlink.h |   7 +
 ip/ip_common.h |   1 +
 ip/ipmonitor.c |   6 +
 ip/ipnexthop.c | 451 -
 lib/json_print.c   |  13 +
 lib/libnetlink.c   |  26 ++
 man/man8/ip-nexthop.8  | 112 +++-
 10 files changed, 650 insertions(+), 17 deletions(-)

-- 
2.26.2

[PATCH iproute2-next 2/6] json_print: Add print_tv()

2021-03-12 Thread Petr Machata

From: Petr Machata 

From: Petr Machata 

Add a helper to dump a timeval. Print by first converting to double and
then dispatching to print_color_float().

Signed-off-by: Petr Machata 
---
 include/json_print.h |  1 +
 lib/json_print.c | 13 +
 2 files changed, 14 insertions(+)

diff --git a/include/json_print.h b/include/json_print.h
index 6fcf9fd910ec..63eee3823fe4 100644
--- a/include/json_print.h
+++ b/include/json_print.h
@@ -81,6 +81,7 @@ _PRINT_FUNC(0xhex, unsigned long long)
 _PRINT_FUNC(luint, unsigned long)
 _PRINT_FUNC(lluint, unsigned long long)
 _PRINT_FUNC(float, double)
+_PRINT_FUNC(tv, struct timeval *)
 #undef _PRINT_FUNC
 
 #define _PRINT_NAME_VALUE_FUNC(type_name, type, format_char) \
diff --git a/lib/json_print.c b/lib/json_print.c
index 994a2f8d6ae0..1018bfb36d94 100644
--- a/lib/json_print.c
+++ b/lib/json_print.c
@@ -299,6 +299,19 @@ int print_color_null(enum output_type type,
return ret;
 }
 
+int print_color_tv(enum output_type type,
+  enum color_attr color,
+  const char *key,
+  const char *fmt,
+  struct timeval *tv)
+{
+   double usecs = tv->tv_usec;
+   double secs = tv->tv_sec;
+   double time = secs + usecs / 100;
+
+   return print_color_float(type, color, key, fmt, time);
+}
+
 /* Print line separator (if not in JSON mode) */
 void print_nl(void)
 {
-- 
2.26.2

[PATCH net-next 10/10] selftests: netdevsim: Add test for resilient nexthop groups offload API

2021-03-12 Thread Petr Machata

From: Ido Schimmel 

Test various aspects of the resilient nexthop group offload API on top
of the netdevsim implementation. Both good and bad flows are tested.

Signed-off-by: Ido Schimmel 
Co-developed-by: Petr Machata 
Signed-off-by: Petr Machata 
---
 .../drivers/net/netdevsim/nexthop.sh  | 620 ++
 1 file changed, 620 insertions(+)

diff --git a/tools/testing/selftests/drivers/net/netdevsim/nexthop.sh 
b/tools/testing/selftests/drivers/net/netdevsim/nexthop.sh
index be0c1b5ee6b8..ba75c81cda91 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/nexthop.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/nexthop.sh
@@ -11,14 +11,33 @@ ALL_TESTS="
nexthop_single_add_err_test
nexthop_group_add_test
nexthop_group_add_err_test
+   nexthop_res_group_add_test
+   nexthop_res_group_add_err_test
nexthop_group_replace_test
nexthop_group_replace_err_test
+   nexthop_res_group_replace_test
+   nexthop_res_group_replace_err_test
+   nexthop_res_group_idle_timer_test
+   nexthop_res_group_idle_timer_del_test
+   nexthop_res_group_increase_idle_timer_test
+   nexthop_res_group_decrease_idle_timer_test
+   nexthop_res_group_unbalanced_timer_test
+   nexthop_res_group_unbalanced_timer_del_test
+   nexthop_res_group_no_unbalanced_timer_test
+   nexthop_res_group_short_unbalanced_timer_test
+   nexthop_res_group_increase_unbalanced_timer_test
+   nexthop_res_group_decrease_unbalanced_timer_test
+   nexthop_res_group_force_migrate_busy_test
nexthop_single_replace_test
nexthop_single_replace_err_test
nexthop_single_in_group_replace_test
nexthop_single_in_group_replace_err_test
+   nexthop_single_in_res_group_replace_test
+   nexthop_single_in_res_group_replace_err_test
nexthop_single_in_group_delete_test
nexthop_single_in_group_delete_err_test
+   nexthop_single_in_res_group_delete_test
+   nexthop_single_in_res_group_delete_err_test
nexthop_replay_test
nexthop_replay_err_test
 "
@@ -27,6 +46,7 @@ DEV_ADDR=1337
 DEV=netdevsim${DEV_ADDR}
 DEVLINK_DEV=netdevsim/${DEV}
 SYSFS_NET_DIR=/sys/bus/netdevsim/devices/$DEV/net/
+DEBUGFS_NET_DIR=/sys/kernel/debug/netdevsim/$DEV/
 NUM_NETIFS=0
 source $lib_dir/lib.sh
 source $lib_dir/devlink_lib.sh
@@ -44,6 +64,28 @@ nexthop_check()
return 0
 }
 
+nexthop_bucket_nhid_count_check()
+{
+   local group_id=$1; shift
+   local expected
+   local count
+   local nhid
+   local ret
+
+   while (($# > 0)); do
+   nhid=$1; shift
+   expected=$1; shift
+
+   count=$($IP nexthop bucket show id $group_id nhid $nhid |
+   grep "trap" | wc -l)
+   if ((expected != count)); then
+   return 1
+   fi
+   done
+
+   return 0
+}
+
 nexthop_resource_check()
 {
local expected_occ=$1; shift
@@ -159,6 +201,71 @@ nexthop_group_add_err_test()
nexthop_resource_set 
 }
 
+nexthop_res_group_add_test()
+{
+   RET=0
+
+   $IP nexthop add id 1 via 192.0.2.2 dev dummy1
+   $IP nexthop add id 2 via 192.0.2.3 dev dummy1
+
+   $IP nexthop add id 10 group 1/2 type resilient buckets 4
+   nexthop_check "id 10" "id 10 group 1/2 type resilient buckets 4 
idle_timer 120 unbalanced_timer 0 unbalanced_time 0 trap"
+   check_err $? "Unexpected nexthop group entry"
+
+   nexthop_bucket_nhid_count_check 10 1 2
+   check_err $? "Wrong nexthop buckets count"
+   nexthop_bucket_nhid_count_check 10 2 2
+   check_err $? "Wrong nexthop buckets count"
+
+   nexthop_resource_check 6
+   check_err $? "Wrong nexthop occupancy"
+
+   $IP nexthop del id 10
+   nexthop_resource_check 2
+   check_err $? "Wrong nexthop occupancy after delete"
+
+   $IP nexthop add id 10 group 1,3/2,2 type resilient buckets 5
+   nexthop_check "id 10" "id 10 group 1,3/2,2 type resilient buckets 5 
idle_timer 120 unbalanced_timer 0 unbalanced_time 0 trap"
+   check_err $? "Unexpected weighted nexthop group entry"
+
+   nexthop_bucket_nhid_count_check 10 1 3
+   check_err $? "Wrong nexthop buckets count"
+   nexthop_bucket_nhid_count_check 10 2 2
+   check_err $? "Wrong nexthop buckets count"
+
+   nexthop_resource_check 7
+   check_err $? "Wrong weighted nexthop occupancy"
+
+   $IP nexthop del id 10
+   nexthop_resource_check 2
+   check_err $? "Wrong nexthop occupancy after delete"
+
+   log_test "Resilient nexthop group add and delete"
+
+   $IP nexthop flush &> /dev/null
+}
+
+nexthop_res_group_add_err_test()
+{
+   RET=0
+
+   nexthop_resource_set 2
+
+   $IP nexthop add id 1 via 192.0

[PATCH net-next 09/10] selftests: forwarding: Add resilient multipath tunneling nexthop test

2021-03-12 Thread Petr Machata

From: Ido Schimmel 

Add a resilient nexthop objects version of gre_multipath_nh.sh. Test
that both IPv4 and IPv6 overlays work with resilient nexthop groups
where the nexthops are two GRE tunnels.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
Signed-off-by: Petr Machata 
---
 .../net/forwarding/gre_multipath_nh_res.sh| 361 ++
 1 file changed, 361 insertions(+)
 create mode 100755 
tools/testing/selftests/net/forwarding/gre_multipath_nh_res.sh

diff --git a/tools/testing/selftests/net/forwarding/gre_multipath_nh_res.sh 
b/tools/testing/selftests/net/forwarding/gre_multipath_nh_res.sh
new file mode 100755
index ..088b65e64d66
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/gre_multipath_nh_res.sh
@@ -0,0 +1,361 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# Test traffic distribution when a wECMP route forwards traffic to two GRE
+# tunnels.
+#
+# +-+
+# | H1  |
+# |   $h1 + |
+# |  192.0.2.1/28 | |
+# |  2001:db8:1::1/64 | |
+# +---|-+
+# |
+# +---|+
+# | SW1   ||
+# |  $ol1 +|
+# |  192.0.2.2/28  |
+# |  2001:db8:1::2/64  |
+# ||
+# |  + g1a (gre)  + g1b (gre)  |
+# |loc=192.0.2.65   loc=192.0.2.81 |
+# |rem=192.0.2.66 --.   rem=192.0.2.82 --. |
+# |tos=inherit  |   tos=inherit  | |
+# |  .--'| |
+# |  |.--' |
+# |  vv|
+# |  + $ul1.111 (vlan)+ $ul1.222 (vlan)|
+# |  | 192.0.2.129/28 | 192.0.2.145/28 |
+# |   \  / |
+# |\/  |
+# ||   |
+# |+ $ul1  |
+# +|---+
+#  |
+# +|---+
+# | SW2+ $ul2  |
+# | ___|   |
+# |/\  |
+# |   /  \ |
+# |  + $ul2.111 (vlan)+ $ul2.222 (vlan)|
+# |  ^ 192.0.2.130/28 ^ 192.0.2.146/28 |
+# |  |||
+# |  |'--. |
+# |  '--.| |
+# |  + g2a (gre)| + g2b (gre)| |
+# |loc=192.0.2.66   |   loc=192.0.2.82   | |
+# |rem=192.0.2.65 --'   rem=192.0.2.81 --' |
+# |tos=inherit  tos=inherit|
+# ||
+# |  $ol2 +|
+# | 192.0.2.17/28 ||
+# |  2001:db8:2::1/64 ||
+# +---|+
+# |
+# +---|-+
+# | H2| |
+# |   $h2 + |
+# | 192.0.2.18/28   |
+# |  2001:db8:2::2/64   |
+# +-+
+
+ALL_TESTS="
+   ping_ipv4
+   ping_ipv6
+   multipath_ipv4
+   multipath_ipv6
+   multipath_ipv6_l4
+"
+
+NUM_NETIFS=6
+source lib.sh
+
+h1_create()
+{
+   simple_if_init $h1 192.0.2.1/28 2001:db8:1::1/64
+   ip route add vrf v$h1 192.0.2.16/28 via 192.0.2.2
+   ip route add vrf v$h1 2001:db8:2::/64 via 2001:db8:1::2
+}
+
+h1_destroy()
+{
+   ip route del vrf v$h1 2001:db8:2::/64 via 2001:db8:1::2
+   ip route del vrf v$h1 192.0.2.16/28 via 192.0.2.2
+   simple_if_fini $h1 192.0.2.1/28
+}
+
+sw1_create()
+{
+   simple_if_init $ol1 192.0.2.2/28 2001:db8:1::2/64
+   __simple_if_init $ul1 v$ol1
+   vlan_create $ul1 111 v$ol1 192.0.2.129/28
+   vlan_create $ul1 222 v$ol1 192.0.2.145/28
+
+   tunnel_create g1a gre 192.0.2.65 192.0.2.66 tos inherit dev v$ol1
+   __simple_if_init g1a v$ol1 192.0.2.65/32
+   ip route add vrf v$ol1 192.0.2.66/32 via 192.0.2.130
+
+   tunnel_create g1b gre 192.0.2.81 192.0.2.82 tos inherit dev v$ol1
+   __simple_if_init g1b v$ol1 192.0.2.81/32
+   ip route add vrf v$ol1 192.0.2.82/32 via 192.0.2.146
+
+   ip -6 nexthop add id 101 dev g1a
+   ip -6 nexthop add id 102 dev g1b
+   ip nexthop add id 103 group 101/102 type resilient buckets 512 \
+   idle_timer 0
+
+   ip route add vrf v$ol1 192.0.2.16/28 nhid 103
+   ip route add vrf v$ol1 2001:db8:2::/64 nhid 103
+}
+
+sw1_destroy()
+{
+   ip route del vrf v$ol1 2001:db8:2::/64
+   ip route del vrf v$ol1 192.0.2.16/28
+
+   ip nexthop del id 103
+   ip -6 nexthop del id 102
+   ip -6 nexthop del id 101
+
+

[PATCH net-next 07/10] selftests: fib_nexthops: Test resilient nexthop groups

2021-03-12 Thread Petr Machata

by: Ido Schimmel 
Co-developed-by: Petr Machata 
Signed-off-by: Petr Machata 
---
 tools/testing/selftests/net/fib_nexthops.sh | 517 
 1 file changed, 517 insertions(+)

diff --git a/tools/testing/selftests/net/fib_nexthops.sh 
b/tools/testing/selftests/net/fib_nexthops.sh
index c840aa88ff18..56dd0c6f2e96 100755
--- a/tools/testing/selftests/net/fib_nexthops.sh
+++ b/tools/testing/selftests/net/fib_nexthops.sh
@@ -22,26 +22,33 @@ ksft_skip=4
 IPV4_TESTS="
ipv4_fcnal
ipv4_grp_fcnal
+   ipv4_res_grp_fcnal
ipv4_withv6_fcnal
ipv4_fcnal_runtime
ipv4_large_grp
+   ipv4_large_res_grp
ipv4_compat_mode
ipv4_fdb_grp_fcnal
ipv4_torture
+   ipv4_res_torture
 "
 
 IPV6_TESTS="
ipv6_fcnal
ipv6_grp_fcnal
+   ipv6_res_grp_fcnal
ipv6_fcnal_runtime
ipv6_large_grp
+   ipv6_large_res_grp
ipv6_compat_mode
ipv6_fdb_grp_fcnal
ipv6_torture
+   ipv6_res_torture
 "
 
 ALL_TESTS="
basic
+   basic_res
${IPV4_TESTS}
${IPV6_TESTS}
 "
@@ -254,6 +261,19 @@ check_nexthop()
check_output "${out}" "${expected}"
 }
 
+check_nexthop_bucket()
+{
+   local nharg="$1"
+   local expected="$2"
+   local out
+
+   # remove the idle time since we cannot match it
+   out=$($IP nexthop bucket ${nharg} \
+   | sed s/idle_time\ [0-9.]*\ // 2>/dev/null)
+
+   check_output "${out}" "${expected}"
+}
+
 check_route()
 {
local pfx="$1"
@@ -330,6 +350,25 @@ check_large_grp()
log_test $? 0 "Dump large (x$ecmp) ecmp groups"
 }
 
+check_large_res_grp()
+{
+   local ipv=$1
+   local buckets=$2
+   local ipstr=""
+
+   if [ $ipv -eq 4 ]; then
+   ipstr="172.16.1.2"
+   else
+   ipstr="2001:db8:91::2"
+   fi
+
+   # create a resilient group with $buckets buckets and dump them
+   run_cmd "$IP nexthop add id 100 via $ipstr dev veth1"
+   run_cmd "$IP nexthop add id 1000 group 100 type resilient buckets 
$buckets"
+   run_cmd "$IP nexthop bucket list"
+   log_test $? 0 "Dump large (x$buckets) nexthop buckets"
+}
+
 start_ip_monitor()
 {
local mtype=$1
@@ -366,6 +405,15 @@ check_nexthop_fdb_support()
fi
 }
 
+check_nexthop_res_support()
+{
+   $IP nexthop help 2>&1 | grep -q resilient
+   if [ $? -ne 0 ]; then
+   echo "SKIP: iproute2 too old, missing resilient nexthop group 
support"
+   return $ksft_skip
+   fi
+}
+
 ipv6_fdb_grp_fcnal()
 {
local rc
@@ -688,6 +736,70 @@ ipv6_grp_fcnal()
log_test $? 2 "Nexthop group can not have a blackhole and another 
nexthop"
 }
 
+ipv6_res_grp_fcnal()
+{
+   local rc
+
+   echo
+   echo "IPv6 resilient groups functional"
+   echo ""
+
+   check_nexthop_res_support
+   if [ $? -eq $ksft_skip ]; then
+   return $ksft_skip
+   fi
+
+   #
+   # migration of nexthop buckets - equal weights
+   #
+   run_cmd "$IP nexthop add id 62 via 2001:db8:91::2 dev veth1"
+   run_cmd "$IP nexthop add id 63 via 2001:db8:91::3 dev veth1"
+   run_cmd "$IP nexthop add id 102 group 62/63 type resilient buckets 2 
idle_timer 0"
+
+   run_cmd "$IP nexthop del id 63"
+   check_nexthop "id 102" \
+   "id 102 group 62 type resilient buckets 2 idle_timer 0 
unbalanced_timer 0 unbalanced_time 0"
+   log_test $? 0 "Nexthop group updated when entry is deleted"
+   check_nexthop_bucket "list id 102" \
+   "id 102 index 0 nhid 62 id 102 index 1 nhid 62"
+   log_test $? 0 "Nexthop buckets updated when entry is deleted"
+
+   run_cmd "$IP nexthop add id 63 via 2001:db8:91::3 dev veth1"
+   run_cmd "$IP nexthop replace id 102 group 62/63 type resilient buckets 
2 idle_timer 0"
+   check_nexthop "id 102" \
+   "id 102 group 62/63 type resilient buckets 2 idle_timer 0 
unbalanced_timer 0 unbalanced_time 0"
+   log_test $? 0 "Nexthop group updated after replace"
+   check_nexthop_bucket "list id 102" \
+   "id 102 index 0 nhid 63 id 102 index 1 nhid 62"
+   log_test $? 0 "Nexthop buckets updated after replace"
+
+   $IP nexthop flush >/dev/null 2>&1
+
+   #
+   # migration of nexthop buckets - unequal weights
+   #
+   run_cmd "$IP nexthop add id 62 via 2001:db8:91::2 dev veth1"
+   run_cmd "$IP nexthop add id 63 via 2001:db8:91::3 dev veth1"
+   run_c

[PATCH net-next 08/10] selftests: forwarding: Add resilient hashing test

2021-03-12 Thread Petr Machata

From: Ido Schimmel 

Verify that IPv4 and IPv6 multipath forwarding works correctly with
resilient nexthop groups and with different weights.

Test that when the idle timer is not zero, the resilient groups are not
rebalanced - because the nexthop buckets are considered active - and the
initial weights (1:1) are used.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
Signed-off-by: Petr Machata 
---
 .../net/forwarding/router_mpath_nh_res.sh | 400 ++
 1 file changed, 400 insertions(+)
 create mode 100755 
tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh

diff --git a/tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh 
b/tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh
new file mode 100755
index ..4898dd4118f1
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh
@@ -0,0 +1,400 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+ALL_TESTS="
+   ping_ipv4
+   ping_ipv6
+   multipath_test
+"
+NUM_NETIFS=8
+source lib.sh
+
+h1_create()
+{
+   vrf_create "vrf-h1"
+   ip link set dev $h1 master vrf-h1
+
+   ip link set dev vrf-h1 up
+   ip link set dev $h1 up
+
+   ip address add 192.0.2.2/24 dev $h1
+   ip address add 2001:db8:1::2/64 dev $h1
+
+   ip route add 198.51.100.0/24 vrf vrf-h1 nexthop via 192.0.2.1
+   ip route add 2001:db8:2::/64 vrf vrf-h1 nexthop via 2001:db8:1::1
+}
+
+h1_destroy()
+{
+   ip route del 2001:db8:2::/64 vrf vrf-h1
+   ip route del 198.51.100.0/24 vrf vrf-h1
+
+   ip address del 2001:db8:1::2/64 dev $h1
+   ip address del 192.0.2.2/24 dev $h1
+
+   ip link set dev $h1 down
+   vrf_destroy "vrf-h1"
+}
+
+h2_create()
+{
+   vrf_create "vrf-h2"
+   ip link set dev $h2 master vrf-h2
+
+   ip link set dev vrf-h2 up
+   ip link set dev $h2 up
+
+   ip address add 198.51.100.2/24 dev $h2
+   ip address add 2001:db8:2::2/64 dev $h2
+
+   ip route add 192.0.2.0/24 vrf vrf-h2 nexthop via 198.51.100.1
+   ip route add 2001:db8:1::/64 vrf vrf-h2 nexthop via 2001:db8:2::1
+}
+
+h2_destroy()
+{
+   ip route del 2001:db8:1::/64 vrf vrf-h2
+   ip route del 192.0.2.0/24 vrf vrf-h2
+
+   ip address del 2001:db8:2::2/64 dev $h2
+   ip address del 198.51.100.2/24 dev $h2
+
+   ip link set dev $h2 down
+   vrf_destroy "vrf-h2"
+}
+
+router1_create()
+{
+   vrf_create "vrf-r1"
+   ip link set dev $rp11 master vrf-r1
+   ip link set dev $rp12 master vrf-r1
+   ip link set dev $rp13 master vrf-r1
+
+   ip link set dev vrf-r1 up
+   ip link set dev $rp11 up
+   ip link set dev $rp12 up
+   ip link set dev $rp13 up
+
+   ip address add 192.0.2.1/24 dev $rp11
+   ip address add 2001:db8:1::1/64 dev $rp11
+
+   ip address add 169.254.2.12/24 dev $rp12
+   ip address add fe80:2::12/64 dev $rp12
+
+   ip address add 169.254.3.13/24 dev $rp13
+   ip address add fe80:3::13/64 dev $rp13
+}
+
+router1_destroy()
+{
+   ip route del 2001:db8:2::/64 vrf vrf-r1
+   ip route del 198.51.100.0/24 vrf vrf-r1
+
+   ip address del fe80:3::13/64 dev $rp13
+   ip address del 169.254.3.13/24 dev $rp13
+
+   ip address del fe80:2::12/64 dev $rp12
+   ip address del 169.254.2.12/24 dev $rp12
+
+   ip address del 2001:db8:1::1/64 dev $rp11
+   ip address del 192.0.2.1/24 dev $rp11
+
+   ip nexthop del id 103
+   ip nexthop del id 101
+   ip nexthop del id 102
+   ip nexthop del id 106
+   ip nexthop del id 104
+   ip nexthop del id 105
+
+   ip link set dev $rp13 down
+   ip link set dev $rp12 down
+   ip link set dev $rp11 down
+
+   vrf_destroy "vrf-r1"
+}
+
+router2_create()
+{
+   vrf_create "vrf-r2"
+   ip link set dev $rp21 master vrf-r2
+   ip link set dev $rp22 master vrf-r2
+   ip link set dev $rp23 master vrf-r2
+
+   ip link set dev vrf-r2 up
+   ip link set dev $rp21 up
+   ip link set dev $rp22 up
+   ip link set dev $rp23 up
+
+   ip address add 198.51.100.1/24 dev $rp21
+   ip address add 2001:db8:2::1/64 dev $rp21
+
+   ip address add 169.254.2.22/24 dev $rp22
+   ip address add fe80:2::22/64 dev $rp22
+
+   ip address add 169.254.3.23/24 dev $rp23
+   ip address add fe80:3::23/64 dev $rp23
+}
+
+router2_destroy()
+{
+   ip route del 2001:db8:1::/64 vrf vrf-r2
+   ip route del 192.0.2.0/24 vrf vrf-r2
+
+   ip address del fe80:3::23/64 dev $rp23
+   ip address del 169.254.3.23/24 dev $rp23
+
+   ip address del fe80:2::22/64 dev $rp22
+   ip address del 169.254.2.22/24 dev $rp22
+
+   ip address del 2001:db8:2::1/64 dev $rp21
+   ip address del 198.51.100.1/24 dev $rp21
+
+   ip nexthop del id 201
+   ip nexthop del id 202
+   ip nexthop del id 204
+   ip nexthop del id 205
+
+   i

[PATCH net-next 03/10] netdevsim: Add support for resilient nexthop groups

2021-03-12 Thread Petr Machata

From: Ido Schimmel 

Allow resilient nexthop groups to be programmed and account their
occupancy according to their number of buckets. The nexthop group itself
as well as its buckets are marked with hardware flags (i.e.,
'RTNH_F_TRAP').

Replacement of a single nexthop bucket can fail using the following
debugfs knob:

 # cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
 N
 # echo 1 > 
/sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
 # cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
 Y

Replacement of a resilient nexthop group can fail using the following
debugfs knob:

 # cat 
/sys/kernel/debug/netdevsim/netdevsim10/fib/fail_res_nexthop_group_replace
 N
 # echo 1 > 
/sys/kernel/debug/netdevsim/netdevsim10/fib/fail_res_nexthop_group_replace
 # cat 
/sys/kernel/debug/netdevsim/netdevsim10/fib/fail_res_nexthop_group_replace
 Y

This enables testing of various error paths.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
Signed-off-by: Petr Machata 
---
 drivers/net/netdevsim/fib.c | 55 +
 1 file changed, 55 insertions(+)

diff --git a/drivers/net/netdevsim/fib.c b/drivers/net/netdevsim/fib.c
index 62cbd716383c..e41f3b98295c 100644
--- a/drivers/net/netdevsim/fib.c
+++ b/drivers/net/netdevsim/fib.c
@@ -57,6 +57,8 @@ struct nsim_fib_data {
struct mutex nh_lock; /* Protects NH HT */
struct dentry *ddir;
bool fail_route_offload;
+   bool fail_res_nexthop_group_replace;
+   bool fail_nexthop_bucket_replace;
 };
 
 struct nsim_fib_rt_key {
@@ -117,6 +119,7 @@ struct nsim_nexthop {
struct rhash_head ht_node;
u64 occ;
u32 id;
+   bool is_resilient;
 };
 
 static const struct rhashtable_params nsim_nexthop_ht_params = {
@@ -1115,6 +1118,10 @@ static struct nsim_nexthop *nsim_nexthop_create(struct 
nsim_fib_data *data,
for (i = 0; i < info->nh_grp->num_nh; i++)
occ += info->nh_grp->nh_entries[i].weight;
break;
+   case NH_NOTIFIER_INFO_TYPE_RES_TABLE:
+   occ = info->nh_res_table->num_nh_buckets;
+   nexthop->is_resilient = true;
+   break;
default:
NL_SET_ERR_MSG_MOD(info->extack, "Unsupported nexthop type");
kfree(nexthop);
@@ -1161,7 +1168,15 @@ static void nsim_nexthop_hw_flags_set(struct net *net,
  const struct nsim_nexthop *nexthop,
  bool trap)
 {
+   int i;
+
nexthop_set_hw_flags(net, nexthop->id, false, trap);
+
+   if (!nexthop->is_resilient)
+   return;
+
+   for (i = 0; i < nexthop->occ; i++)
+   nexthop_bucket_set_hw_flags(net, nexthop->id, i, false, trap);
 }
 
 static int nsim_nexthop_add(struct nsim_fib_data *data,
@@ -1262,6 +1277,32 @@ static void nsim_nexthop_remove(struct nsim_fib_data 
*data,
nsim_nexthop_destroy(nexthop);
 }
 
+static int nsim_nexthop_res_table_pre_replace(struct nsim_fib_data *data,
+ struct nh_notifier_info *info)
+{
+   if (data->fail_res_nexthop_group_replace) {
+   NL_SET_ERR_MSG_MOD(info->extack, "Failed to replace a resilient 
nexthop group");
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static int nsim_nexthop_bucket_replace(struct nsim_fib_data *data,
+  struct nh_notifier_info *info)
+{
+   if (data->fail_nexthop_bucket_replace) {
+   NL_SET_ERR_MSG_MOD(info->extack, "Failed to replace nexthop 
bucket");
+   return -EINVAL;
+   }
+
+   nexthop_bucket_set_hw_flags(info->net, info->id,
+   info->nh_res_bucket->bucket_index,
+   false, true);
+
+   return 0;
+}
+
 static int nsim_nexthop_event_nb(struct notifier_block *nb, unsigned long 
event,
 void *ptr)
 {
@@ -1278,6 +1319,12 @@ static int nsim_nexthop_event_nb(struct notifier_block 
*nb, unsigned long event,
case NEXTHOP_EVENT_DEL:
nsim_nexthop_remove(data, info);
break;
+   case NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE:
+   err = nsim_nexthop_res_table_pre_replace(data, info);
+   break;
+   case NEXTHOP_EVENT_BUCKET_REPLACE:
+   err = nsim_nexthop_bucket_replace(data, info);
+   break;
default:
break;
}
@@ -1387,6 +1434,14 @@ nsim_fib_debugfs_init(struct nsim_fib_data *data, struct 
nsim_dev *nsim_dev)
data->fail_route_offload = false;
debugfs_create_bool("fail_route_offload", 0600, data->ddir,
&data->fail_route_offload);

[PATCH net-next 06/10] selftests: fib_nexthops: List each test case in a different line

2021-03-12 Thread Petr Machata

From: Ido Schimmel 

The lines with the IPv4 and IPv6 test cases are already very long and
more test cases will be added in subsequent patches.

List each test case in a different line to make it easier to extend the
test with more test cases.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
Signed-off-by: Petr Machata 
---
 tools/testing/selftests/net/fib_nexthops.sh | 30 ++---
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/net/fib_nexthops.sh 
b/tools/testing/selftests/net/fib_nexthops.sh
index 91226ac50112..c840aa88ff18 100755
--- a/tools/testing/selftests/net/fib_nexthops.sh
+++ b/tools/testing/selftests/net/fib_nexthops.sh
@@ -19,10 +19,32 @@ ret=0
 ksft_skip=4
 
 # all tests in this script. Can be overridden with -t option
-IPV4_TESTS="ipv4_fcnal ipv4_grp_fcnal ipv4_withv6_fcnal ipv4_fcnal_runtime 
ipv4_large_grp ipv4_compat_mode ipv4_fdb_grp_fcnal ipv4_torture"
-IPV6_TESTS="ipv6_fcnal ipv6_grp_fcnal ipv6_fcnal_runtime ipv6_large_grp 
ipv6_compat_mode ipv6_fdb_grp_fcnal ipv6_torture"
-
-ALL_TESTS="basic ${IPV4_TESTS} ${IPV6_TESTS}"
+IPV4_TESTS="
+   ipv4_fcnal
+   ipv4_grp_fcnal
+   ipv4_withv6_fcnal
+   ipv4_fcnal_runtime
+   ipv4_large_grp
+   ipv4_compat_mode
+   ipv4_fdb_grp_fcnal
+   ipv4_torture
+"
+
+IPV6_TESTS="
+   ipv6_fcnal
+   ipv6_grp_fcnal
+   ipv6_fcnal_runtime
+   ipv6_large_grp
+   ipv6_compat_mode
+   ipv6_fdb_grp_fcnal
+   ipv6_torture
+"
+
+ALL_TESTS="
+   basic
+   ${IPV4_TESTS}
+   ${IPV6_TESTS}
+"
 TESTS="${ALL_TESTS}"
 VERBOSE=0
 PAUSE_ON_FAIL=no
-- 
2.26.2

[PATCH net-next 02/10] netdevsim: Create a helper for setting nexthop hardware flags

2021-03-12 Thread Petr Machata

From: Ido Schimmel 

Instead of calling nexthop_set_hw_flags(), call a helper. It will be
used to also set nexthop bucket flags in a subsequent patch.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
Signed-off-by: Petr Machata 
---
 drivers/net/netdevsim/fib.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/netdevsim/fib.c b/drivers/net/netdevsim/fib.c
index ba577e20b1a1..62cbd716383c 100644
--- a/drivers/net/netdevsim/fib.c
+++ b/drivers/net/netdevsim/fib.c
@@ -1157,6 +1157,13 @@ static int nsim_nexthop_account(struct nsim_fib_data 
*data, u64 occ,
 
 }
 
+static void nsim_nexthop_hw_flags_set(struct net *net,
+ const struct nsim_nexthop *nexthop,
+ bool trap)
+{
+   nexthop_set_hw_flags(net, nexthop->id, false, trap);
+}
+
 static int nsim_nexthop_add(struct nsim_fib_data *data,
struct nsim_nexthop *nexthop,
struct netlink_ext_ack *extack)
@@ -1175,7 +1182,7 @@ static int nsim_nexthop_add(struct nsim_fib_data *data,
goto err_nexthop_dismiss;
}
 
-   nexthop_set_hw_flags(net, nexthop->id, false, true);
+   nsim_nexthop_hw_flags_set(net, nexthop, true);
 
return 0;
 
@@ -1204,7 +1211,7 @@ static int nsim_nexthop_replace(struct nsim_fib_data 
*data,
goto err_nexthop_dismiss;
}
 
-   nexthop_set_hw_flags(net, nexthop->id, false, true);
+   nsim_nexthop_hw_flags_set(net, nexthop, true);
nsim_nexthop_account(data, nexthop_old->occ, false, extack);
nsim_nexthop_destroy(nexthop_old);
 
@@ -1286,7 +1293,7 @@ static void nsim_nexthop_free(void *ptr, void *arg)
struct net *net;
 
net = devlink_net(data->devlink);
-   nexthop_set_hw_flags(net, nexthop->id, false, false);
+   nsim_nexthop_hw_flags_set(net, nexthop, false);
nsim_nexthop_account(data, nexthop->occ, false, NULL);
nsim_nexthop_destroy(nexthop);
 }
-- 
2.26.2

[PATCH net-next 01/10] netdevsim: fib: Introduce a lock to guard nexthop hashtable

2021-03-12 Thread Petr Machata

Currently netdevsim relies on RTNL to maintain exclusivity in accessing the
nexthop hash table. However, bucket notification may be called without RTNL
having been held. Instead, introduce a custom lock to guard the table.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 drivers/net/netdevsim/fib.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/netdevsim/fib.c b/drivers/net/netdevsim/fib.c
index 3ca0f54d0c3b..ba577e20b1a1 100644
--- a/drivers/net/netdevsim/fib.c
+++ b/drivers/net/netdevsim/fib.c
@@ -47,13 +47,14 @@ struct nsim_fib_data {
struct nsim_fib_entry nexthops;
struct rhashtable fib_rt_ht;
struct list_head fib_rt_list;
-   struct mutex fib_lock; /* Protects hashtable and list */
+   struct mutex fib_lock; /* Protects FIB HT and list */
struct notifier_block nexthop_nb;
struct rhashtable nexthop_ht;
struct devlink *devlink;
struct work_struct fib_event_work;
struct list_head fib_event_queue;
spinlock_t fib_event_queue_lock; /* Protects fib event queue list */
+   struct mutex nh_lock; /* Protects NH HT */
struct dentry *ddir;
bool fail_route_offload;
 };
@@ -1262,8 +1263,7 @@ static int nsim_nexthop_event_nb(struct notifier_block 
*nb, unsigned long event,
struct nh_notifier_info *info = ptr;
int err = 0;
 
-   ASSERT_RTNL();
-
+   mutex_lock(&data->nh_lock);
switch (event) {
case NEXTHOP_EVENT_REPLACE:
err = nsim_nexthop_insert(data, info);
@@ -1275,6 +1275,7 @@ static int nsim_nexthop_event_nb(struct notifier_block 
*nb, unsigned long event,
break;
}
 
+   mutex_unlock(&data->nh_lock);
return notifier_from_errno(err);
 }
 
@@ -1404,6 +1405,7 @@ struct nsim_fib_data *nsim_fib_create(struct devlink 
*devlink,
if (err)
goto err_data_free;
 
+   mutex_init(&data->nh_lock);
err = rhashtable_init(&data->nexthop_ht, &nsim_nexthop_ht_params);
if (err)
goto err_debugfs_exit;
@@ -1469,6 +1471,7 @@ struct nsim_fib_data *nsim_fib_create(struct devlink 
*devlink,
data);
mutex_destroy(&data->fib_lock);
 err_debugfs_exit:
+   mutex_destroy(&data->nh_lock);
nsim_fib_debugfs_exit(data);
 err_data_free:
kfree(data);
@@ -1497,6 +1500,7 @@ void nsim_fib_destroy(struct devlink *devlink, struct 
nsim_fib_data *data)
WARN_ON_ONCE(!list_empty(&data->fib_event_queue));
WARN_ON_ONCE(!list_empty(&data->fib_rt_list));
mutex_destroy(&data->fib_lock);
+   mutex_destroy(&data->nh_lock);
nsim_fib_debugfs_exit(data);
kfree(data);
 }
-- 
2.26.2

[PATCH net-next 04/10] netdevsim: Allow reporting activity on nexthop buckets

2021-03-12 Thread Petr Machata

From: Ido Schimmel 

A key component of the resilient hashing algorithm is the hash buckets'
activity. If a bucket is active, it will not be populated with a new
nexthop in order not to break existing flows. Therefore, in order to
easily and thoroughly test the algorithm, we need to be in full control
over the reported activity.

Add a debugfs interface that allows user space to have netdevsim report
a nexthop bucket within a resilient nexthop group as active. For
example:

 # echo 10 23 > 
/sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity

Will mark bucket 23 in nexthop group 10 as active.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
Signed-off-by: Petr Machata 
---
 drivers/net/netdevsim/fib.c | 61 +
 1 file changed, 61 insertions(+)

diff --git a/drivers/net/netdevsim/fib.c b/drivers/net/netdevsim/fib.c
index e41f3b98295c..fda6f37e7055 100644
--- a/drivers/net/netdevsim/fib.c
+++ b/drivers/net/netdevsim/fib.c
@@ -14,6 +14,7 @@
  * THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -1345,6 +1346,63 @@ static void nsim_nexthop_free(void *ptr, void *arg)
nsim_nexthop_destroy(nexthop);
 }
 
+static ssize_t nsim_nexthop_bucket_activity_write(struct file *file,
+ const char __user *user_buf,
+ size_t size, loff_t *ppos)
+{
+   struct nsim_fib_data *data = file->private_data;
+   struct net *net = devlink_net(data->devlink);
+   struct nsim_nexthop *nexthop;
+   unsigned long *activity;
+   loff_t pos = *ppos;
+   u16 bucket_index;
+   char buf[128];
+   int err = 0;
+   u32 nhid;
+
+   if (pos != 0)
+   return -EINVAL;
+   if (size > sizeof(buf))
+   return -EINVAL;
+   if (copy_from_user(buf, user_buf, size))
+   return -EFAULT;
+   if (sscanf(buf, "%u %hu", &nhid, &bucket_index) != 2)
+   return -EINVAL;
+
+   rtnl_lock();
+
+   nexthop = rhashtable_lookup_fast(&data->nexthop_ht, &nhid,
+nsim_nexthop_ht_params);
+   if (!nexthop || !nexthop->is_resilient ||
+   bucket_index >= nexthop->occ) {
+   err = -EINVAL;
+   goto out;
+   }
+
+   activity = bitmap_zalloc(nexthop->occ, GFP_KERNEL);
+   if (!activity) {
+   err = -ENOMEM;
+   goto out;
+   }
+
+   bitmap_set(activity, bucket_index, 1);
+   nexthop_res_grp_activity_update(net, nhid, nexthop->occ, activity);
+   bitmap_free(activity);
+
+out:
+   rtnl_unlock();
+
+   *ppos = size;
+   return err ?: size;
+}
+
+static const struct file_operations nsim_nexthop_bucket_activity_fops = {
+   .open = simple_open,
+   .write = nsim_nexthop_bucket_activity_write,
+   .llseek = no_llseek,
+   .owner = THIS_MODULE,
+};
+
 static u64 nsim_fib_ipv4_resource_occ_get(void *priv)
 {
struct nsim_fib_data *data = priv;
@@ -1442,6 +1500,9 @@ nsim_fib_debugfs_init(struct nsim_fib_data *data, struct 
nsim_dev *nsim_dev)
data->fail_nexthop_bucket_replace = false;
debugfs_create_bool("fail_nexthop_bucket_replace", 0600, data->ddir,
&data->fail_nexthop_bucket_replace);
+
+   debugfs_create_file("nexthop_bucket_activity", 0200, data->ddir,
+   data, &nsim_nexthop_bucket_activity_fops);
return 0;
 }
 
-- 
2.26.2

[PATCH net-next 00/10] net: Resilient NH groups: netdevsim, selftests

2021-03-12 Thread Petr Machata

Support for resilient next-hop groups was added in a previous patch set.
Resilient next hop groups add a layer of indirection between the SKB hash
and the next hop. Thus the hash is used to reference a hash table bucket,
which is then used to reference a particular next hop. This allows the
system more flexibility when assigning SKB hash space to next hops.
Previously, each next hop had to be assigned a continuous range of SKB hash
space. With a hash table as an intermediate layer, it is possible to
reassign next hops with a hash table bucket granularity. In turn, this
mends issues with traffic flow redirection resulting from next hop removal
or adjustments in next-hop weights.

This patch set introduces mock offloading of resilient next hop groups by
the netdevsim driver, and a suite of selftests.

- Patch #1 adds a netdevsim-specific lock to protect next-hop hashtable.
  Previously, netdevsim relied on RTNL to maintain mutual exclusion.
  Patch #2 extracts a helper to make the following patches clearer.

- Patch #3 implements the support for offloading of resilient next-hop
  groups.

- Patch #4 introduces a new debugfs interface to set activity on a selected
  next-hop bucket. This simulates how HW can periodically report bucket
  activity, and buckets thus marked are expected to be exempt from
  migration to new next hops when the group changes.

- Patches #5 and #6 clean up the fib_nexthop selftests.

- Patches #7, #8 and #9 add tests for resilient next hop groups. Patch #7
  adds resilient-hashing counterparts to fib_nexthops.sh. Patch #8 adds a
  new traffic test for resilient next-hop groups. Patch #9 adds a new
  traffic test for tunneling.

- Patch #10 actually leverages the netdevsim offload to implement a suite
  of algorithmic tests that verify how and when buckets are migrated under
  various simulated workload scenarios.

The overall plan is to contribute approximately the following patchsets:

1) Nexthop policy refactoring (already pushed)
2) Preparations for resilient next hop groups (already pushed)
3) Implementation of resilient next hop group (already pushed)
4) Netdevsim offload plus a suite of selftests (this patchset)
5) Preparations for mlxsw offload of resilient next-hop groups
6) mlxsw offload including selftests

Interested parties can look at the complete code at [2].

[1] https://tools.ietf.org/html/rfc2992
[2] https://github.com/idosch/linux/commits/submit/res_integ_v1

Ido Schimmel (9):
  netdevsim: Create a helper for setting nexthop hardware flags
  netdevsim: Add support for resilient nexthop groups
  netdevsim: Allow reporting activity on nexthop buckets
  selftests: fib_nexthops: Declutter test output
  selftests: fib_nexthops: List each test case in a different line
  selftests: fib_nexthops: Test resilient nexthop groups
  selftests: forwarding: Add resilient hashing test
  selftests: forwarding: Add resilient multipath tunneling nexthop test
  selftests: netdevsim: Add test for resilient nexthop groups offload
API

Petr Machata (1):
  netdevsim: fib: Introduce a lock to guard nexthop hashtable

 drivers/net/netdevsim/fib.c   | 139 +++-
 .../drivers/net/netdevsim/nexthop.sh  | 620 ++
 tools/testing/selftests/net/fib_nexthops.sh   | 549 +++-
 .../net/forwarding/gre_multipath_nh_res.sh| 361 ++
 .../net/forwarding/router_mpath_nh_res.sh | 400 +++
 5 files changed, 2059 insertions(+), 10 deletions(-)
 create mode 100755 
tools/testing/selftests/net/forwarding/gre_multipath_nh_res.sh
 create mode 100755 
tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh

-- 
2.26.2

[PATCH net-next 05/10] selftests: fib_nexthops: Declutter test output

2021-03-12 Thread Petr Machata

From: Ido Schimmel 

Before:

 # ./fib_nexthops.sh -t ipv4_torture

IPv4 runtime torture

TEST: IPv4 torture test [ OK ]
./fib_nexthops.sh: line 213: 19376 Killed  ipv4_del_add_loop1
./fib_nexthops.sh: line 213: 19377 Killed  ipv4_grp_replace_loop
./fib_nexthops.sh: line 213: 19378 Killed  ip netns exec me 
ping -f 172.16.101.1 > /dev/null 2>&1
./fib_nexthops.sh: line 213: 19380 Killed  ip netns exec me 
ping -f 172.16.101.2 > /dev/null 2>&1
./fib_nexthops.sh: line 213: 19381 Killed  ip netns exec me 
mausezahn veth1 -B 172.16.101.2 -A 172.16.1.1 -c 0 -t tcp "dp=1-1023, 
flags=syn" > /dev/null 2>&1

Tests passed:   1
Tests failed:   0

 # ./fib_nexthops.sh -t ipv6_torture

IPv6 runtime torture

TEST: IPv6 torture test [ OK ]
./fib_nexthops.sh: line 213: 24453 Killed  ipv6_del_add_loop1
./fib_nexthops.sh: line 213: 24454 Killed  ipv6_grp_replace_loop
./fib_nexthops.sh: line 213: 24456 Killed  ip netns exec me 
ping -f 2001:db8:101::1 > /dev/null 2>&1
./fib_nexthops.sh: line 213: 24457 Killed  ip netns exec me 
ping -f 2001:db8:101::2 > /dev/null 2>&1
./fib_nexthops.sh: line 213: 24458 Killed  ip netns exec me 
mausezahn -6 veth1 -B 2001:db8:101::2 -A 2001:db8:91::1 -c 0 -t tcp "dp=1-1023, 
flags=syn" > /dev/null 2>&1

Tests passed:   1
Tests failed:   0

After:

 # ./fib_nexthops.sh -t ipv4_torture

IPv4 runtime torture

TEST: IPv4 torture test [ OK ]

Tests passed:   1
Tests failed:   0

 # ./fib_nexthops.sh -t ipv6_torture

IPv6 runtime torture

TEST: IPv6 torture test [ OK ]

Tests passed:   1
Tests failed:   0

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
Signed-off-by: Petr Machata 
---
 tools/testing/selftests/net/fib_nexthops.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/net/fib_nexthops.sh 
b/tools/testing/selftests/net/fib_nexthops.sh
index d98fb85e201c..91226ac50112 100755
--- a/tools/testing/selftests/net/fib_nexthops.sh
+++ b/tools/testing/selftests/net/fib_nexthops.sh
@@ -874,6 +874,7 @@ ipv6_torture()
 
sleep 300
kill -9 $pid1 $pid2 $pid3 $pid4 $pid5
+   wait $pid1 $pid2 $pid3 $pid4 $pid5 2>/dev/null
 
# if we did not crash, success
log_test 0 0 "IPv6 torture test"
@@ -1476,6 +1477,7 @@ ipv4_torture()
 
sleep 300
kill -9 $pid1 $pid2 $pid3 $pid4 $pid5
+   wait $pid1 $pid2 $pid3 $pid4 $pid5 2>/dev/null
 
# if we did not crash, success
log_test 0 0 "IPv4 torture test"
-- 
2.26.2

[PATCH net-next v2 14/14] nexthop: Enable resilient next-hop groups

2021-03-11 Thread Petr Machata

Now that all the code is in place, stop rejecting requests to create
resilient next-hop groups.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Reviewed-by: David Ahern 
---
 net/ipv4/nexthop.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 015a47e8163a..f09fe3a5608f 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -2443,10 +2443,6 @@ static struct nexthop *nexthop_create_group(struct net 
*net,
} else if (cfg->nh_grp_type == NEXTHOP_GRP_TYPE_RES) {
struct nh_res_table *res_table;
 
-   /* Bounce resilient groups for now. */
-   err = -EINVAL;
-   goto out_no_nh;
-
res_table = nexthop_res_table_alloc(net, cfg->nh_id, cfg);
if (!res_table) {
err = -ENOMEM;
-- 
2.26.2

[PATCH net-next v2 13/14] nexthop: Notify userspace about bucket migrations

2021-03-11 Thread Petr Machata

Nexthop replacements et.al. are notified through netlink, but if a delayed
work migrates buckets on the background, userspace will stay oblivious.
Notify these as RTM_NEWNEXTHOPBUCKET events.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Reviewed-by: David Ahern 
---

Notes:
v1 (changes since RFC):
- u32 -> u16 for bucket counts / indices

 net/ipv4/nexthop.c | 45 +++--
 1 file changed, 39 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 3d602ef6f2c1..015a47e8163a 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -957,6 +957,34 @@ static int nh_fill_res_bucket(struct sk_buff *skb, struct 
nexthop *nh,
return -EMSGSIZE;
 }
 
+static void nexthop_bucket_notify(struct nh_res_table *res_table,
+ u16 bucket_index)
+{
+   struct nh_res_bucket *bucket = &res_table->nh_buckets[bucket_index];
+   struct nh_grp_entry *nhge = nh_res_dereference(bucket->nh_entry);
+   struct nexthop *nh = nhge->nh_parent;
+   struct sk_buff *skb;
+   int err = -ENOBUFS;
+
+   skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
+   if (!skb)
+   goto errout;
+
+   err = nh_fill_res_bucket(skb, nh, bucket, bucket_index,
+RTM_NEWNEXTHOPBUCKET, 0, 0, NLM_F_REPLACE,
+NULL);
+   if (err < 0) {
+   kfree_skb(skb);
+   goto errout;
+   }
+
+   rtnl_notify(skb, nh->net, 0, RTNLGRP_NEXTHOP, NULL, GFP_KERNEL);
+   return;
+errout:
+   if (err < 0)
+   rtnl_set_sk_err(nh->net, RTNLGRP_NEXTHOP, err);
+}
+
 static bool valid_group_nh(struct nexthop *nh, unsigned int npaths,
   bool *is_fdb, struct netlink_ext_ack *extack)
 {
@@ -1470,7 +1498,8 @@ static bool nh_res_bucket_should_migrate(struct 
nh_res_table *res_table,
 }
 
 static bool nh_res_bucket_migrate(struct nh_res_table *res_table,
- u16 bucket_index, bool notify, bool force)
+ u16 bucket_index, bool notify,
+ bool notify_nl, bool force)
 {
struct nh_res_bucket *bucket = &res_table->nh_buckets[bucket_index];
struct nh_grp_entry *new_nhge;
@@ -1513,6 +1542,9 @@ static bool nh_res_bucket_migrate(struct nh_res_table 
*res_table,
nh_res_bucket_set_nh(bucket, new_nhge);
nh_res_bucket_set_idle(res_table, bucket);
 
+   if (notify_nl)
+   nexthop_bucket_notify(res_table, bucket_index);
+
if (nh_res_nhge_is_balanced(new_nhge))
list_del(&new_nhge->res.uw_nh_entry);
return true;
@@ -1520,7 +1552,8 @@ static bool nh_res_bucket_migrate(struct nh_res_table 
*res_table,
 
 #define NH_RES_UPKEEP_DW_MINIMUM_INTERVAL (HZ / 2)
 
-static void nh_res_table_upkeep(struct nh_res_table *res_table, bool notify)
+static void nh_res_table_upkeep(struct nh_res_table *res_table,
+   bool notify, bool notify_nl)
 {
unsigned long now = jiffies;
unsigned long deadline;
@@ -1545,7 +1578,7 @@ static void nh_res_table_upkeep(struct nh_res_table 
*res_table, bool notify)
if (nh_res_bucket_should_migrate(res_table, bucket,
 &deadline, &force)) {
if (!nh_res_bucket_migrate(res_table, i, notify,
-  force)) {
+  notify_nl, force)) {
unsigned long idle_point;
 
/* A driver can override the migration
@@ -1586,7 +1619,7 @@ static void nh_res_table_upkeep_dw(struct work_struct 
*work)
struct nh_res_table *res_table;
 
res_table = container_of(dw, struct nh_res_table, upkeep_dw);
-   nh_res_table_upkeep(res_table, true);
+   nh_res_table_upkeep(res_table, true, true);
 }
 
 static void nh_res_table_cancel_upkeep(struct nh_res_table *res_table)
@@ -1674,7 +1707,7 @@ static void replace_nexthop_grp_res(struct nh_group *oldg,
nh_res_group_rebalance(newg, old_res_table);
if (prev_has_uw && !list_empty(&old_res_table->uw_nh_entries))
old_res_table->unbalanced_since = prev_unbalanced_since;
-   nh_res_table_upkeep(old_res_table, true);
+   nh_res_table_upkeep(old_res_table, true, false);
 }
 
 static void nh_mp_group_rebalance(struct nh_group *nhg)
@@ -2288,7 +2321,7 @@ static int insert_nexthop(struct net *net, struct nexthop 
*new_nh,
/* Do not send bucket notifications, we do full
 * notification below.
 */
-   nh_res_table_upkeep(res_table, false);
+   nh_res_table_upkeep(res_table, false, false);
}
}
 
-- 
2.26.2

[PATCH net-next v2 12/14] nexthop: Add netlink handlers for bucket get

2021-03-11 Thread Petr Machata

Allow getting (but not setting) individual buckets to inspect the next hop
mapped therein, idle time, and flags.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Reviewed-by: David Ahern 
---

Notes:
v1 (changes since RFC):
- u32 -> u16 for bucket counts / indices

 net/ipv4/nexthop.c | 110 -
 1 file changed, 109 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index ed2745708f9d..3d602ef6f2c1 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -66,6 +66,15 @@ static const struct nla_policy 
rtm_nh_res_bucket_policy_dump[] = {
[NHA_RES_BUCKET_NH_ID]  = { .type = NLA_U32 },
 };
 
+static const struct nla_policy rtm_nh_policy_get_bucket[] = {
+   [NHA_ID]= { .type = NLA_U32 },
+   [NHA_RES_BUCKET]= { .type = NLA_NESTED },
+};
+
+static const struct nla_policy rtm_nh_res_bucket_policy_get[] = {
+   [NHA_RES_BUCKET_INDEX]  = { .type = NLA_U16 },
+};
+
 static bool nexthop_notifiers_is_empty(struct net *net)
 {
return !net->nexthop.notifier_chain.head;
@@ -3381,6 +3390,105 @@ static int rtm_dump_nexthop_bucket(struct sk_buff *skb,
return err;
 }
 
+static int nh_valid_get_bucket_req_res_bucket(struct nlattr *res,
+ u16 *bucket_index,
+ struct netlink_ext_ack *extack)
+{
+   struct nlattr *tb[ARRAY_SIZE(rtm_nh_res_bucket_policy_get)];
+   int err;
+
+   err = nla_parse_nested(tb, ARRAY_SIZE(rtm_nh_res_bucket_policy_get) - 1,
+  res, rtm_nh_res_bucket_policy_get, extack);
+   if (err < 0)
+   return err;
+
+   if (!tb[NHA_RES_BUCKET_INDEX]) {
+   NL_SET_ERR_MSG(extack, "Bucket index is missing");
+   return -EINVAL;
+   }
+
+   *bucket_index = nla_get_u16(tb[NHA_RES_BUCKET_INDEX]);
+   return 0;
+}
+
+static int nh_valid_get_bucket_req(const struct nlmsghdr *nlh,
+  u32 *id, u16 *bucket_index,
+  struct netlink_ext_ack *extack)
+{
+   struct nlattr *tb[ARRAY_SIZE(rtm_nh_policy_get_bucket)];
+   int err;
+
+   err = nlmsg_parse(nlh, sizeof(struct nhmsg), tb,
+ ARRAY_SIZE(rtm_nh_policy_get_bucket) - 1,
+ rtm_nh_policy_get_bucket, extack);
+   if (err < 0)
+   return err;
+
+   err = __nh_valid_get_del_req(nlh, tb, id, extack);
+   if (err)
+   return err;
+
+   if (!tb[NHA_RES_BUCKET]) {
+   NL_SET_ERR_MSG(extack, "Bucket information is missing");
+   return -EINVAL;
+   }
+
+   err = nh_valid_get_bucket_req_res_bucket(tb[NHA_RES_BUCKET],
+bucket_index, extack);
+   if (err)
+   return err;
+
+   return 0;
+}
+
+/* rtnl */
+static int rtm_get_nexthop_bucket(struct sk_buff *in_skb, struct nlmsghdr *nlh,
+ struct netlink_ext_ack *extack)
+{
+   struct net *net = sock_net(in_skb->sk);
+   struct nh_res_table *res_table;
+   struct sk_buff *skb = NULL;
+   struct nh_group *nhg;
+   struct nexthop *nh;
+   u16 bucket_index;
+   int err;
+   u32 id;
+
+   err = nh_valid_get_bucket_req(nlh, &id, &bucket_index, extack);
+   if (err)
+   return err;
+
+   nh = nexthop_find_group_resilient(net, id, extack);
+   if (IS_ERR(nh))
+   return PTR_ERR(nh);
+
+   nhg = rtnl_dereference(nh->nh_grp);
+   res_table = rtnl_dereference(nhg->res_table);
+   if (bucket_index >= res_table->num_nh_buckets) {
+   NL_SET_ERR_MSG(extack, "Bucket index out of bounds");
+   return -ENOENT;
+   }
+
+   skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
+   if (!skb)
+   return -ENOBUFS;
+
+   err = nh_fill_res_bucket(skb, nh, &res_table->nh_buckets[bucket_index],
+bucket_index, RTM_NEWNEXTHOPBUCKET,
+NETLINK_CB(in_skb).portid, nlh->nlmsg_seq,
+0, extack);
+   if (err < 0) {
+   WARN_ON(err == -EMSGSIZE);
+   goto errout_free;
+   }
+
+   return rtnl_unicast(skb, net, NETLINK_CB(in_skb).portid);
+
+errout_free:
+   kfree_skb(skb);
+   return err;
+}
+
 static void nexthop_sync_mtu(struct net_device *dev, u32 orig_mtu)
 {
unsigned int hash = nh_dev_hashfn(dev->ifindex);
@@ -3604,7 +3712,7 @@ static int __init nexthop_init(void)
rtnl_register(PF_INET6, RTM_NEWNEXTHOP, rtm_new_nexthop, NULL, 0);
rtnl_register(PF_INET6, RTM_GETNEXTHOP, NULL, rtm_dump_nexthop, 0);
 
-   rtnl_register(PF_UNSPEC, RTM_GETNEXTHOPBUCKET, NU

[PATCH net-next v2 10/14] nexthop: Add netlink handlers for resilient nexthop groups

2021-03-11 Thread Petr Machata

Implement the netlink messages that allow creation and dumping of resilient
nexthop groups.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Reviewed-by: David Ahern 
---

Notes:
v1 (changes since RFC):
- u32 -> u16 for bucket counts / indices

 net/ipv4/nexthop.c | 150 +++--
 1 file changed, 145 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 495b5e69ffcd..439bf3b7ced5 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -16,6 +16,9 @@
 #include 
 #include 
 
+#define NH_RES_DEFAULT_IDLE_TIMER  (120 * HZ)
+#define NH_RES_DEFAULT_UNBALANCED_TIMER0   /* No forced 
rebalancing. */
+
 static void remove_nexthop(struct net *net, struct nexthop *nh,
   struct nl_info *nlinfo);
 
@@ -32,6 +35,7 @@ static const struct nla_policy rtm_nh_policy_new[] = {
[NHA_ENCAP_TYPE]= { .type = NLA_U16 },
[NHA_ENCAP] = { .type = NLA_NESTED },
[NHA_FDB]   = { .type = NLA_FLAG },
+   [NHA_RES_GROUP] = { .type = NLA_NESTED },
 };
 
 static const struct nla_policy rtm_nh_policy_get[] = {
@@ -45,6 +49,12 @@ static const struct nla_policy rtm_nh_policy_dump[] = {
[NHA_FDB]   = { .type = NLA_FLAG },
 };
 
+static const struct nla_policy rtm_nh_res_policy_new[] = {
+   [NHA_RES_GROUP_BUCKETS] = { .type = NLA_U16 },
+   [NHA_RES_GROUP_IDLE_TIMER]  = { .type = NLA_U32 },
+   [NHA_RES_GROUP_UNBALANCED_TIMER]= { .type = NLA_U32 },
+};
+
 static bool nexthop_notifiers_is_empty(struct net *net)
 {
return !net->nexthop.notifier_chain.head;
@@ -588,6 +598,41 @@ static void nh_res_time_set_deadline(unsigned long 
next_time,
*deadline = next_time;
 }
 
+static clock_t nh_res_table_unbalanced_time(struct nh_res_table *res_table)
+{
+   if (list_empty(&res_table->uw_nh_entries))
+   return 0;
+   return jiffies_delta_to_clock_t(jiffies - res_table->unbalanced_since);
+}
+
+static int nla_put_nh_group_res(struct sk_buff *skb, struct nh_group *nhg)
+{
+   struct nh_res_table *res_table = rtnl_dereference(nhg->res_table);
+   struct nlattr *nest;
+
+   nest = nla_nest_start(skb, NHA_RES_GROUP);
+   if (!nest)
+   return -EMSGSIZE;
+
+   if (nla_put_u16(skb, NHA_RES_GROUP_BUCKETS,
+   res_table->num_nh_buckets) ||
+   nla_put_u32(skb, NHA_RES_GROUP_IDLE_TIMER,
+   jiffies_to_clock_t(res_table->idle_timer)) ||
+   nla_put_u32(skb, NHA_RES_GROUP_UNBALANCED_TIMER,
+   jiffies_to_clock_t(res_table->unbalanced_timer)) ||
+   nla_put_u64_64bit(skb, NHA_RES_GROUP_UNBALANCED_TIME,
+ nh_res_table_unbalanced_time(res_table),
+ NHA_RES_GROUP_PAD))
+   goto nla_put_failure;
+
+   nla_nest_end(skb, nest);
+   return 0;
+
+nla_put_failure:
+   nla_nest_cancel(skb, nest);
+   return -EMSGSIZE;
+}
+
 static int nla_put_nh_group(struct sk_buff *skb, struct nh_group *nhg)
 {
struct nexthop_grp *p;
@@ -598,6 +643,8 @@ static int nla_put_nh_group(struct sk_buff *skb, struct 
nh_group *nhg)
 
if (nhg->mpath)
group_type = NEXTHOP_GRP_TYPE_MPATH;
+   else if (nhg->resilient)
+   group_type = NEXTHOP_GRP_TYPE_RES;
 
if (nla_put_u16(skb, NHA_GROUP_TYPE, group_type))
goto nla_put_failure;
@@ -613,6 +660,9 @@ static int nla_put_nh_group(struct sk_buff *skb, struct 
nh_group *nhg)
p += 1;
}
 
+   if (nhg->resilient && nla_put_nh_group_res(skb, nhg))
+   goto nla_put_failure;
+
return 0;
 
 nla_put_failure:
@@ -700,13 +750,26 @@ static int nh_fill_node(struct sk_buff *skb, struct 
nexthop *nh,
return -EMSGSIZE;
 }
 
+static size_t nh_nlmsg_size_grp_res(struct nh_group *nhg)
+{
+   return nla_total_size(0) +  /* NHA_RES_GROUP */
+   nla_total_size(2) + /* NHA_RES_GROUP_BUCKETS */
+   nla_total_size(4) + /* NHA_RES_GROUP_IDLE_TIMER */
+   nla_total_size(4) + /* NHA_RES_GROUP_UNBALANCED_TIMER */
+   nla_total_size_64bit(8);/* NHA_RES_GROUP_UNBALANCED_TIME */
+}
+
 static size_t nh_nlmsg_size_grp(struct nexthop *nh)
 {
struct nh_group *nhg = rtnl_dereference(nh->nh_grp);
size_t sz = sizeof(struct nexthop_grp) * nhg->num_nh;
+   size_t tot = nla_total_size(sz) +
+   nla_total_size(2); /* NHA_GROUP_TYPE */
+
+   if (nhg->resilient)
+   tot += nh_nlmsg_size_grp_res(nhg);
 
-   return nla_total_size(sz) +
-  nla_total_size(2);  /* NHA_GROUP_TYPE */
+   return tot;
 }
 
 static size_t nh_nlmsg_size_single(struct nexthop *nh)
@@ -876,7 +939,7 @@ static int nh_c

[PATCH net-next v2 09/14] nexthop: Allow reporting activity of nexthop buckets

2021-03-11 Thread Petr Machata

From: Ido Schimmel 

The kernel periodically checks the idle time of nexthop buckets to
determine if they are idle and can be re-populated with a new nexthop.

When the resilient nexthop group is offloaded to hardware, the kernel
will not see activity on nexthop buckets unless it is reported from
hardware.

Add a function that can be periodically called by device drivers to
report activity on nexthop buckets after querying it from the underlying
device.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
Reviewed-by: David Ahern 
Signed-off-by: Petr Machata 
---

Notes:
v1 (changes since RFC):
- u32 -> u16 for bucket counts / indices

 include/net/nexthop.h |  2 ++
 net/ipv4/nexthop.c| 35 +++
 2 files changed, 37 insertions(+)

diff --git a/include/net/nexthop.h b/include/net/nexthop.h
index 685f208d26b5..ba94868a21d5 100644
--- a/include/net/nexthop.h
+++ b/include/net/nexthop.h
@@ -222,6 +222,8 @@ int unregister_nexthop_notifier(struct net *net, struct 
notifier_block *nb);
 void nexthop_set_hw_flags(struct net *net, u32 id, bool offload, bool trap);
 void nexthop_bucket_set_hw_flags(struct net *net, u32 id, u16 bucket_index,
 bool offload, bool trap);
+void nexthop_res_grp_activity_update(struct net *net, u32 id, u16 num_buckets,
+unsigned long *activity);
 
 /* caller is holding rcu or rtnl; no reference taken to nexthop */
 struct nexthop *nexthop_find_by_id(struct net *net, u32 id);
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 1fce4ff39390..495b5e69ffcd 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -3106,6 +3106,41 @@ void nexthop_bucket_set_hw_flags(struct net *net, u32 
id, u16 bucket_index,
 }
 EXPORT_SYMBOL(nexthop_bucket_set_hw_flags);
 
+void nexthop_res_grp_activity_update(struct net *net, u32 id, u16 num_buckets,
+unsigned long *activity)
+{
+   struct nh_res_table *res_table;
+   struct nexthop *nexthop;
+   struct nh_group *nhg;
+   u16 i;
+
+   rcu_read_lock();
+
+   nexthop = nexthop_find_by_id(net, id);
+   if (!nexthop || !nexthop->is_group)
+   goto out;
+
+   nhg = rcu_dereference(nexthop->nh_grp);
+   if (!nhg->resilient)
+   goto out;
+
+   /* Instead of silently ignoring some buckets, demand that the sizes
+* be the same.
+*/
+   res_table = rcu_dereference(nhg->res_table);
+   if (num_buckets != res_table->num_nh_buckets)
+   goto out;
+
+   for (i = 0; i < num_buckets; i++) {
+   if (test_bit(i, activity))
+   nh_res_bucket_set_busy(&res_table->nh_buckets[i]);
+   }
+
+out:
+   rcu_read_unlock();
+}
+EXPORT_SYMBOL(nexthop_res_grp_activity_update);
+
 static void __net_exit nexthop_net_exit(struct net *net)
 {
rtnl_lock();
-- 
2.26.2

[PATCH net-next v2 11/14] nexthop: Add netlink handlers for bucket dump

2021-03-11 Thread Petr Machata

Add a dump handler for resilient next hop buckets. When next-hop group ID
is given, it walks buckets of that group, otherwise it walks buckets of all
groups. It then dumps the buckets whose next hops match the given filtering
criteria.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Reviewed-by: David Ahern 
---

Notes:
v1 (changes since RFC):
- u32 -> u16 for bucket counts / indices

 net/ipv4/nexthop.c | 283 +
 1 file changed, 283 insertions(+)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 439bf3b7ced5..ed2745708f9d 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -55,6 +55,17 @@ static const struct nla_policy rtm_nh_res_policy_new[] = {
[NHA_RES_GROUP_UNBALANCED_TIMER]= { .type = NLA_U32 },
 };
 
+static const struct nla_policy rtm_nh_policy_dump_bucket[] = {
+   [NHA_ID]= { .type = NLA_U32 },
+   [NHA_OIF]   = { .type = NLA_U32 },
+   [NHA_MASTER]= { .type = NLA_U32 },
+   [NHA_RES_BUCKET]= { .type = NLA_NESTED },
+};
+
+static const struct nla_policy rtm_nh_res_bucket_policy_dump[] = {
+   [NHA_RES_BUCKET_NH_ID]  = { .type = NLA_U32 },
+};
+
 static bool nexthop_notifiers_is_empty(struct net *net)
 {
return !net->nexthop.notifier_chain.head;
@@ -883,6 +894,60 @@ static void nh_res_bucket_set_busy(struct nh_res_bucket 
*bucket)
atomic_long_set(&bucket->used_time, (long)jiffies);
 }
 
+static clock_t nh_res_bucket_idle_time(const struct nh_res_bucket *bucket)
+{
+   unsigned long used_time = nh_res_bucket_used_time(bucket);
+
+   return jiffies_delta_to_clock_t(jiffies - used_time);
+}
+
+static int nh_fill_res_bucket(struct sk_buff *skb, struct nexthop *nh,
+ struct nh_res_bucket *bucket, u16 bucket_index,
+ int event, u32 portid, u32 seq,
+ unsigned int nlflags,
+ struct netlink_ext_ack *extack)
+{
+   struct nh_grp_entry *nhge = nh_res_dereference(bucket->nh_entry);
+   struct nlmsghdr *nlh;
+   struct nlattr *nest;
+   struct nhmsg *nhm;
+
+   nlh = nlmsg_put(skb, portid, seq, event, sizeof(*nhm), nlflags);
+   if (!nlh)
+   return -EMSGSIZE;
+
+   nhm = nlmsg_data(nlh);
+   nhm->nh_family = AF_UNSPEC;
+   nhm->nh_flags = bucket->nh_flags;
+   nhm->nh_protocol = nh->protocol;
+   nhm->nh_scope = 0;
+   nhm->resvd = 0;
+
+   if (nla_put_u32(skb, NHA_ID, nh->id))
+   goto nla_put_failure;
+
+   nest = nla_nest_start(skb, NHA_RES_BUCKET);
+   if (!nest)
+   goto nla_put_failure;
+
+   if (nla_put_u16(skb, NHA_RES_BUCKET_INDEX, bucket_index) ||
+   nla_put_u32(skb, NHA_RES_BUCKET_NH_ID, nhge->nh->id) ||
+   nla_put_u64_64bit(skb, NHA_RES_BUCKET_IDLE_TIME,
+ nh_res_bucket_idle_time(bucket),
+ NHA_RES_BUCKET_PAD))
+   goto nla_put_failure_nest;
+
+   nla_nest_end(skb, nest);
+   nlmsg_end(skb, nlh);
+   return 0;
+
+nla_put_failure_nest:
+   nla_nest_cancel(skb, nest);
+nla_put_failure:
+   nlmsg_cancel(skb, nlh);
+   return -EMSGSIZE;
+}
+
 static bool valid_group_nh(struct nexthop *nh, unsigned int npaths,
   bool *is_fdb, struct netlink_ext_ack *extack)
 {
@@ -2918,10 +2983,12 @@ static int rtm_get_nexthop(struct sk_buff *in_skb, 
struct nlmsghdr *nlh,
 }
 
 struct nh_dump_filter {
+   u32 nh_id;
int dev_idx;
int master_idx;
bool group_filter;
bool fdb_filter;
+   u32 res_bucket_nh_id;
 };
 
 static bool nh_dump_filtered(struct nexthop *nh,
@@ -3101,6 +3168,219 @@ static int rtm_dump_nexthop(struct sk_buff *skb, struct 
netlink_callback *cb)
return err;
 }
 
+static struct nexthop *
+nexthop_find_group_resilient(struct net *net, u32 id,
+struct netlink_ext_ack *extack)
+{
+   struct nh_group *nhg;
+   struct nexthop *nh;
+
+   nh = nexthop_find_by_id(net, id);
+   if (!nh)
+   return ERR_PTR(-ENOENT);
+
+   if (!nh->is_group) {
+   NL_SET_ERR_MSG(extack, "Not a nexthop group");
+   return ERR_PTR(-EINVAL);
+   }
+
+   nhg = rtnl_dereference(nh->nh_grp);
+   if (!nhg->resilient) {
+   NL_SET_ERR_MSG(extack, "Nexthop group not of type resilient");
+   return ERR_PTR(-EINVAL);
+   }
+
+   return nh;
+}
+
+static int nh_valid_dump_nhid(struct nlattr *attr, u32 *nh_id_p,
+ struct netlink_ext_ack *extack)
+{
+   u32 idx;
+
+   if (attr) {
+   idx = nla_get_u32(attr);
+   if (!idx) {
+   NL_SET_ERR_MSG(extack, "Invalid nexthop id");
+

[PATCH net-next v2 06/14] nexthop: Add data structures for resilient group notifications

2021-03-11 Thread Petr Machata

From: Ido Schimmel 

Add data structures that will be used for in-kernel notifications about
addition / deletion of a resilient nexthop group and about changes to a
hash bucket within a resilient group.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
Reviewed-by: David Ahern 
Signed-off-by: Petr Machata 
---

Notes:
v1 (changes since RFC):
- u32 -> u16 for bucket counts / indices

 include/net/nexthop.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/include/net/nexthop.h b/include/net/nexthop.h
index b78505c9031e..fd3c0debe8bf 100644
--- a/include/net/nexthop.h
+++ b/include/net/nexthop.h
@@ -155,11 +155,15 @@ struct nexthop {
 enum nexthop_event_type {
NEXTHOP_EVENT_DEL,
NEXTHOP_EVENT_REPLACE,
+   NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE,
+   NEXTHOP_EVENT_BUCKET_REPLACE,
 };
 
 enum nh_notifier_info_type {
NH_NOTIFIER_INFO_TYPE_SINGLE,
NH_NOTIFIER_INFO_TYPE_GRP,
+   NH_NOTIFIER_INFO_TYPE_RES_TABLE,
+   NH_NOTIFIER_INFO_TYPE_RES_BUCKET,
 };
 
 struct nh_notifier_single_info {
@@ -186,6 +190,19 @@ struct nh_notifier_grp_info {
struct nh_notifier_grp_entry_info nh_entries[];
 };
 
+struct nh_notifier_res_bucket_info {
+   u16 bucket_index;
+   unsigned int idle_timer_ms;
+   bool force;
+   struct nh_notifier_single_info old_nh;
+   struct nh_notifier_single_info new_nh;
+};
+
+struct nh_notifier_res_table_info {
+   u16 num_nh_buckets;
+   struct nh_notifier_single_info nhs[];
+};
+
 struct nh_notifier_info {
struct net *net;
struct netlink_ext_ack *extack;
@@ -194,6 +211,8 @@ struct nh_notifier_info {
union {
struct nh_notifier_single_info *nh;
struct nh_notifier_grp_info *nh_grp;
+   struct nh_notifier_res_table_info *nh_res_table;
+   struct nh_notifier_res_bucket_info *nh_res_bucket;
};
 };
 
-- 
2.26.2

[PATCH net-next v2 04/14] nexthop: Add netlink defines and enumerators for resilient NH groups

2021-03-11 Thread Petr Machata

From: Ido Schimmel 

- RTM_NEWNEXTHOP et.al. that handle resilient groups will have a new nested
  attribute, NHA_RES_GROUP, whose elements are attributes NHA_RES_GROUP_*.

- RTM_NEWNEXTHOPBUCKET et.al. is a suite of new messages that will
  currently serve only for dumping of individual buckets of resilient next
  hop groups. For nexthop group buckets, these messages will carry a nested
  attribute NHA_RES_BUCKET, whose elements are attributes NHA_RES_BUCKET_*.

  There are several reasons why a new suite of messages is created for
  nexthop buckets instead of overloading the information on the existing
  RTM_{NEW,DEL,GET}NEXTHOP messages.

  First, a nexthop group can contain a large number of nexthop buckets (4k
  is not unheard of). This imposes limits on the amount of information that
  can be encoded for each nexthop bucket given a netlink message is limited
  to 64k bytes.

  Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at
  this point, in the future it can be extended to provide user space with
  control over nexthop buckets configuration.

- The new group type is NEXTHOP_GRP_TYPE_RES. Note that nexthop code is
  adjusted to bounce groups with that type for now.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
Reviewed-by: David Ahern 
Signed-off-by: Petr Machata 
---

Notes:
v2:
- Comment at NEXTHOP_GRP_TYPE_MPATH that it's for the hash-threshold
  groups.

v1 (changes since RFC):
- u32 -> u16 for bucket counts / indices

 include/uapi/linux/nexthop.h   | 47 +-
 include/uapi/linux/rtnetlink.h |  7 +
 net/ipv4/nexthop.c |  2 ++
 security/selinux/nlmsgtab.c|  5 +++-
 4 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h
index 2d4a1e784cf0..d8ffa8c9ca78 100644
--- a/include/uapi/linux/nexthop.h
+++ b/include/uapi/linux/nexthop.h
@@ -21,7 +21,10 @@ struct nexthop_grp {
 };
 
 enum {
-   NEXTHOP_GRP_TYPE_MPATH,  /* default type if not specified */
+   NEXTHOP_GRP_TYPE_MPATH,  /* hash-threshold nexthop group
+ * default type if not specified
+ */
+   NEXTHOP_GRP_TYPE_RES,/* resilient nexthop group */
__NEXTHOP_GRP_TYPE_MAX,
 };
 
@@ -52,8 +55,50 @@ enum {
NHA_FDB,/* flag; nexthop belongs to a bridge fdb */
/* if NHA_FDB is added, OIF, BLACKHOLE, ENCAP cannot be set */
 
+   /* nested; resilient nexthop group attributes */
+   NHA_RES_GROUP,
+   /* nested; nexthop bucket attributes */
+   NHA_RES_BUCKET,
+
__NHA_MAX,
 };
 
 #define NHA_MAX(__NHA_MAX - 1)
+
+enum {
+   NHA_RES_GROUP_UNSPEC,
+   /* Pad attribute for 64-bit alignment. */
+   NHA_RES_GROUP_PAD = NHA_RES_GROUP_UNSPEC,
+
+   /* u16; number of nexthop buckets in a resilient nexthop group */
+   NHA_RES_GROUP_BUCKETS,
+   /* clock_t as u32; nexthop bucket idle timer (per-group) */
+   NHA_RES_GROUP_IDLE_TIMER,
+   /* clock_t as u32; nexthop unbalanced timer */
+   NHA_RES_GROUP_UNBALANCED_TIMER,
+   /* clock_t as u64; nexthop unbalanced time */
+   NHA_RES_GROUP_UNBALANCED_TIME,
+
+   __NHA_RES_GROUP_MAX,
+};
+
+#define NHA_RES_GROUP_MAX  (__NHA_RES_GROUP_MAX - 1)
+
+enum {
+   NHA_RES_BUCKET_UNSPEC,
+   /* Pad attribute for 64-bit alignment. */
+   NHA_RES_BUCKET_PAD = NHA_RES_BUCKET_UNSPEC,
+
+   /* u16; nexthop bucket index */
+   NHA_RES_BUCKET_INDEX,
+   /* clock_t as u64; nexthop bucket idle time */
+   NHA_RES_BUCKET_IDLE_TIME,
+   /* u32; nexthop id assigned to the nexthop bucket */
+   NHA_RES_BUCKET_NH_ID,
+
+   __NHA_RES_BUCKET_MAX,
+};
+
+#define NHA_RES_BUCKET_MAX (__NHA_RES_BUCKET_MAX - 1)
+
 #endif
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 91e4ca064d61..d35953bc7d53 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -178,6 +178,13 @@ enum {
RTM_GETVLAN,
 #define RTM_GETVLANRTM_GETVLAN
 
+   RTM_NEWNEXTHOPBUCKET = 116,
+#define RTM_NEWNEXTHOPBUCKET   RTM_NEWNEXTHOPBUCKET
+   RTM_DELNEXTHOPBUCKET,
+#define RTM_DELNEXTHOPBUCKET   RTM_DELNEXTHOPBUCKET
+   RTM_GETNEXTHOPBUCKET,
+#define RTM_GETNEXTHOPBUCKET   RTM_GETNEXTHOPBUCKET
+
__RTM_MAX,
 #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 56c54d0fbacc..7a94591da856 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -1492,6 +1492,8 @@ static struct nexthop *nexthop_create_group(struct net 
*net,
if (cfg->nh_grp_type == NEXTHOP_GRP_TYPE_MPATH) {
nhg->mpath = 1;
nhg->is_multipath = true;
+   } else if (cfg->nh_grp_type == NEXTHOP_GRP_TYPE_RES) {
+   goto out_no_nh;
}
 
WARN_ON_ONCE(nhg->mpath !

[PATCH net-next v2 07/14] nexthop: Implement notifiers for resilient nexthop groups

2021-03-11 Thread Petr Machata

Implement the following notifications towards drivers:

- NEXTHOP_EVENT_REPLACE, when a resilient nexthop group is created.

- NEXTHOP_EVENT_BUCKET_REPLACE any time there is a change in assignment of
  next hops to hash table buckets. That includes replacements, deletions,
  and delayed upkeep cycles. Some bucket notifications can be vetoed by the
  driver, to make it possible to propagate bucket busy-ness flags from the
  HW back to the algorithm. Some are however forced, e.g. if a next hop is
  deleted, all buckets that use this next hop simply must be migrated,
  whether the HW wishes so or not.

- NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE, before a resilient nexthop group is
  replaced. Usually the driver will get the bucket notifications as well,
  and could veto those. But in some cases, a bucket may not be migrated
  immediately, but during delayed upkeep, and that is too late to roll the
  transaction back. This notification allows the driver to take a look and
  veto the new proposed group up front, before anything is committed.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Reviewed-by: David Ahern 
---

Notes:
v1 (changes since RFC):
- u32 -> u16 for bucket counts / indices

 net/ipv4/nexthop.c | 320 +++--
 1 file changed, 308 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 0e2ff72e10c0..8b06aafc2e9e 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -115,6 +115,37 @@ static int nh_notifier_mp_info_init(struct 
nh_notifier_info *info,
return 0;
 }
 
+static int nh_notifier_res_table_info_init(struct nh_notifier_info *info,
+  struct nh_group *nhg)
+{
+   struct nh_res_table *res_table = rtnl_dereference(nhg->res_table);
+   u16 num_nh_buckets = res_table->num_nh_buckets;
+   unsigned long size;
+   u16 i;
+
+   info->type = NH_NOTIFIER_INFO_TYPE_RES_TABLE;
+   size = struct_size(info->nh_res_table, nhs, num_nh_buckets);
+   info->nh_res_table = __vmalloc(size, GFP_KERNEL | __GFP_ZERO |
+  __GFP_NOWARN);
+   if (!info->nh_res_table)
+   return -ENOMEM;
+
+   info->nh_res_table->num_nh_buckets = num_nh_buckets;
+
+   for (i = 0; i < num_nh_buckets; i++) {
+   struct nh_res_bucket *bucket = &res_table->nh_buckets[i];
+   struct nh_grp_entry *nhge;
+   struct nh_info *nhi;
+
+   nhge = rtnl_dereference(bucket->nh_entry);
+   nhi = rtnl_dereference(nhge->nh->nh_info);
+   __nh_notifier_single_info_init(&info->nh_res_table->nhs[i],
+  nhi);
+   }
+
+   return 0;
+}
+
 static int nh_notifier_grp_info_init(struct nh_notifier_info *info,
 const struct nexthop *nh)
 {
@@ -122,6 +153,8 @@ static int nh_notifier_grp_info_init(struct 
nh_notifier_info *info,
 
if (nhg->mpath)
return nh_notifier_mp_info_init(info, nhg);
+   else if (nhg->resilient)
+   return nh_notifier_res_table_info_init(info, nhg);
return -EINVAL;
 }
 
@@ -132,6 +165,8 @@ static void nh_notifier_grp_info_fini(struct 
nh_notifier_info *info,
 
if (nhg->mpath)
kfree(info->nh_grp);
+   else if (nhg->resilient)
+   vfree(info->nh_res_table);
 }
 
 static int nh_notifier_info_init(struct nh_notifier_info *info,
@@ -183,6 +218,107 @@ static int call_nexthop_notifiers(struct net *net,
return notifier_to_errno(err);
 }
 
+static int
+nh_notifier_res_bucket_idle_timer_get(const struct nh_notifier_info *info,
+ bool force, unsigned int *p_idle_timer_ms)
+{
+   struct nh_res_table *res_table;
+   struct nh_group *nhg;
+   struct nexthop *nh;
+   int err = 0;
+
+   /* When 'force' is false, nexthop bucket replacement is performed
+* because the bucket was deemed to be idle. In this case, capable
+* listeners can choose to perform an atomic replacement: The bucket is
+* only replaced if it is inactive. However, if the idle timer interval
+* is smaller than the interval in which a listener is querying
+* buckets' activity from the device, then atomic replacement should
+* not be tried. Pass the idle timer value to listeners, so that they
+* could determine which type of replacement to perform.
+*/
+   if (force) {
+   *p_idle_timer_ms = 0;
+   return 0;
+   }
+
+   rcu_read_lock();
+
+   nh = nexthop_find_by_id(info->net, info->id);
+   if (!nh) {
+   err = -EINVAL;
+   goto out;
+   }
+
+   nhg = rcu_dereference(nh->nh_grp);
+   res_table = rcu_deref

[PATCH net-next v2 05/14] nexthop: Add implementation of resilient next-hop groups

2021-03-11 Thread Petr Machata

roup type, and that is currently bounced.
There is therefore no way to actually access this code.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Reviewed-by: David Ahern 
---

Notes:
v1 (changes since RFC):
- u32 -> u16 for bucket counts / indices
- set the new flag is_multipath for resilient groups

 include/net/nexthop.h |  42 
 net/ipv4/nexthop.c| 517 --
 2 files changed, 546 insertions(+), 13 deletions(-)

diff --git a/include/net/nexthop.h b/include/net/nexthop.h
index 5062c2c08e2b..b78505c9031e 100644
--- a/include/net/nexthop.h
+++ b/include/net/nexthop.h
@@ -40,6 +40,12 @@ struct nh_config {
 
struct nlattr   *nh_grp;
u16 nh_grp_type;
+   u16 nh_grp_res_num_buckets;
+   unsigned long   nh_grp_res_idle_timer;
+   unsigned long   nh_grp_res_unbalanced_timer;
+   boolnh_grp_res_has_num_buckets;
+   boolnh_grp_res_has_idle_timer;
+   boolnh_grp_res_has_unbalanced_timer;
 
struct nlattr   *nh_encap;
u16 nh_encap_type;
@@ -63,6 +69,32 @@ struct nh_info {
};
 };
 
+struct nh_res_bucket {
+   struct nh_grp_entry __rcu *nh_entry;
+   atomic_long_t   used_time;
+   unsigned long   migrated_time;
+   booloccupied;
+   u8  nh_flags;
+};
+
+struct nh_res_table {
+   struct net  *net;
+   u32 nhg_id;
+   struct delayed_work upkeep_dw;
+
+   /* List of NHGEs that have too few buckets ("uw" for underweight).
+* Reclaimed buckets will be given to entries in this list.
+*/
+   struct list_headuw_nh_entries;
+   unsigned long   unbalanced_since;
+
+   u32 idle_timer;
+   u32 unbalanced_timer;
+
+   u16 num_nh_buckets;
+   struct nh_res_bucketnh_buckets[];
+};
+
 struct nh_grp_entry {
struct nexthop  *nh;
u8  weight;
@@ -71,6 +103,13 @@ struct nh_grp_entry {
struct {
atomic_tupper_bound;
} mpath;
+   struct {
+   /* Member on uw_nh_entries. */
+   struct list_headuw_nh_entry;
+
+   u16 count_buckets;
+   u16 wants_buckets;
+   } res;
};
 
struct list_head nh_list;
@@ -82,8 +121,11 @@ struct nh_group {
u16 num_nh;
boolis_multipath;
boolmpath;
+   boolresilient;
boolfdb_nh;
boolhas_v4;
+
+   struct nh_res_table __rcu *res_table;
struct nh_grp_entry nh_entries[];
 };
 
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 7a94591da856..0e2ff72e10c0 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -183,6 +183,30 @@ static int call_nexthop_notifiers(struct net *net,
return notifier_to_errno(err);
 }
 
+/* There are three users of RES_TABLE, and NHs etc. referenced from there:
+ *
+ * 1) a collection of callbacks for NH maintenance. This operates under
+ *RTNL,
+ * 2) the delayed work that gradually balances the resilient table,
+ * 3) and nexthop_select_path(), operating under RCU.
+ *
+ * Both the delayed work and the RTNL block are writers, and need to
+ * maintain mutual exclusion. Since there are only two and well-known
+ * writers for each table, the RTNL code can make sure it has exclusive
+ * access thus:
+ *
+ * - Have the DW operate without locking;
+ * - synchronously cancel the DW;
+ * - do the writing;
+ * - if the write was not actually a delete, call upkeep, which schedules
+ *   DW again if necessary.
+ *
+ * The functions that are always called from the RTNL context use
+ * rtnl_dereference(). The functions that can also be called from the DW do
+ * a raw dereference and rely on the above mutual exclusion scheme.
+ */
+#define nh_res_dereference(p) (rcu_dereference_raw(p))
+
 static int call_nexthop_notifier(struct notifier_block *nb, struct net *net,
 enum nexthop_event_type event_type,
 struct nexthop *nh,
@@ -241,6 +265,9 @@ static void nexthop_free_group(struct nexthop *nh)
 
WARN_ON(nhg->spare == nhg);
 
+   if (nhg->resilient)
+   vfree(rcu_dereference_raw(nhg->res_table));
+
kfree(nhg->spare);
kfree(nhg);
 }
@@ -299,6 +326,30 @@ static struct nh_group *nexthop_grp_alloc(u16 num_nh)
return nhg;
 }
 
+static void nh_res_table_upkeep_dw(struct work_struct *work);
+
+static struct nh_res_table *
+nexthop_res_table_alloc(struct net *net, u32 nhg_id, struct nh_con

[PATCH net-next v2 08/14] nexthop: Allow setting "offload" and "trap" indication of nexthop buckets

2021-03-11 Thread Petr Machata

From: Ido Schimmel 

Add a function that can be called by device drivers to set "offload" or
"trap" indication on nexthop buckets following nexthop notifications and
other changes such as a neighbour becoming invalid.

Signed-off-by: Ido Schimmel 
Reviewed-by: Petr Machata 
Reviewed-by: David Ahern 
Signed-off-by: Petr Machata 
---

Notes:
v1 (changes since RFC):
- u32 -> u16 for bucket counts / indices

 include/net/nexthop.h |  2 ++
 net/ipv4/nexthop.c| 34 ++
 2 files changed, 36 insertions(+)

diff --git a/include/net/nexthop.h b/include/net/nexthop.h
index fd3c0debe8bf..685f208d26b5 100644
--- a/include/net/nexthop.h
+++ b/include/net/nexthop.h
@@ -220,6 +220,8 @@ int register_nexthop_notifier(struct net *net, struct 
notifier_block *nb,
  struct netlink_ext_ack *extack);
 int unregister_nexthop_notifier(struct net *net, struct notifier_block *nb);
 void nexthop_set_hw_flags(struct net *net, u32 id, bool offload, bool trap);
+void nexthop_bucket_set_hw_flags(struct net *net, u32 id, u16 bucket_index,
+bool offload, bool trap);
 
 /* caller is holding rcu or rtnl; no reference taken to nexthop */
 struct nexthop *nexthop_find_by_id(struct net *net, u32 id);
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 8b06aafc2e9e..1fce4ff39390 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -3072,6 +3072,40 @@ void nexthop_set_hw_flags(struct net *net, u32 id, bool 
offload, bool trap)
 }
 EXPORT_SYMBOL(nexthop_set_hw_flags);
 
+void nexthop_bucket_set_hw_flags(struct net *net, u32 id, u16 bucket_index,
+bool offload, bool trap)
+{
+   struct nh_res_table *res_table;
+   struct nh_res_bucket *bucket;
+   struct nexthop *nexthop;
+   struct nh_group *nhg;
+
+   rcu_read_lock();
+
+   nexthop = nexthop_find_by_id(net, id);
+   if (!nexthop || !nexthop->is_group)
+   goto out;
+
+   nhg = rcu_dereference(nexthop->nh_grp);
+   if (!nhg->resilient)
+   goto out;
+
+   if (bucket_index >= nhg->res_table->num_nh_buckets)
+   goto out;
+
+   res_table = rcu_dereference(nhg->res_table);
+   bucket = &res_table->nh_buckets[bucket_index];
+   bucket->nh_flags &= ~(RTNH_F_OFFLOAD | RTNH_F_TRAP);
+   if (offload)
+   bucket->nh_flags |= RTNH_F_OFFLOAD;
+   if (trap)
+   bucket->nh_flags |= RTNH_F_TRAP;
+
+out:
+   rcu_read_unlock();
+}
+EXPORT_SYMBOL(nexthop_bucket_set_hw_flags);
+
 static void __net_exit nexthop_net_exit(struct net *net)
 {
rtnl_lock();
-- 
2.26.2

[PATCH net-next v2 03/14] nexthop: Add a dedicated flag for multipath next-hop groups

2021-03-11 Thread Petr Machata

With the introduction of resilient nexthop groups, there will be two types
of multipath groups: the current hash-threshold "mpath" ones, and resilient
groups. Both are multipath, but to determine the fact, the system needs to
consider two flags. This might prove costly in the datapath. Therefore,
introduce a new flag, that should be set for next-hop groups that have more
than one nexthop, and should be considered multipath.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Reviewed-by: David Ahern 
---

Notes:
v1 (changes since RFC):
- This patch is new

 include/net/nexthop.h | 7 ---
 net/ipv4/nexthop.c| 5 -
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/net/nexthop.h b/include/net/nexthop.h
index 7bc057aee40b..5062c2c08e2b 100644
--- a/include/net/nexthop.h
+++ b/include/net/nexthop.h
@@ -80,6 +80,7 @@ struct nh_grp_entry {
 struct nh_group {
struct nh_group *spare; /* spare group for removals */
u16 num_nh;
+   boolis_multipath;
boolmpath;
boolfdb_nh;
boolhas_v4;
@@ -212,7 +213,7 @@ static inline bool nexthop_is_multipath(const struct 
nexthop *nh)
struct nh_group *nh_grp;
 
nh_grp = rcu_dereference_rtnl(nh->nh_grp);
-   return nh_grp->mpath;
+   return nh_grp->is_multipath;
}
return false;
 }
@@ -227,7 +228,7 @@ static inline unsigned int nexthop_num_path(const struct 
nexthop *nh)
struct nh_group *nh_grp;
 
nh_grp = rcu_dereference_rtnl(nh->nh_grp);
-   if (nh_grp->mpath)
+   if (nh_grp->is_multipath)
rc = nh_grp->num_nh;
}
 
@@ -308,7 +309,7 @@ struct fib_nh_common *nexthop_fib_nhc(struct nexthop *nh, 
int nhsel)
struct nh_group *nh_grp;
 
nh_grp = rcu_dereference_rtnl(nh->nh_grp);
-   if (nh_grp->mpath) {
+   if (nh_grp->is_multipath) {
nh = nexthop_mpath_select(nh_grp, nhsel);
if (!nh)
return NULL;
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 69c8b50a936e..56c54d0fbacc 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -967,6 +967,7 @@ static void remove_nh_grp_entry(struct net *net, struct 
nh_grp_entry *nhge,
}
 
newg->has_v4 = false;
+   newg->is_multipath = nhg->is_multipath;
newg->mpath = nhg->mpath;
newg->fdb_nh = nhg->fdb_nh;
newg->num_nh = nhg->num_nh;
@@ -1488,8 +1489,10 @@ static struct nexthop *nexthop_create_group(struct net 
*net,
nhg->nh_entries[i].nh_parent = nh;
}
 
-   if (cfg->nh_grp_type == NEXTHOP_GRP_TYPE_MPATH)
+   if (cfg->nh_grp_type == NEXTHOP_GRP_TYPE_MPATH) {
nhg->mpath = 1;
+   nhg->is_multipath = true;
+   }
 
WARN_ON_ONCE(nhg->mpath != 1);
 
-- 
2.26.2

[PATCH net-next v2 02/14] nexthop: __nh_notifier_single_info_init(): Make nh_info an argument

2021-03-11 Thread Petr Machata

The cited function currently uses rtnl_dereference() to get nh_info from a
handed-in nexthop. However, under the resilient hashing scheme, this
function will not always be called under RTNL, sometimes the mutual
exclusion will be achieved differently. Therefore move the nh_info
extraction from the function to its callers to make it possible to use a
different synchronization guarantee.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Reviewed-by: David Ahern 
---
 net/ipv4/nexthop.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index f723dc97dcd3..69c8b50a936e 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -52,10 +52,8 @@ static bool nexthop_notifiers_is_empty(struct net *net)
 
 static void
 __nh_notifier_single_info_init(struct nh_notifier_single_info *nh_info,
-  const struct nexthop *nh)
+  const struct nh_info *nhi)
 {
-   struct nh_info *nhi = rtnl_dereference(nh->nh_info);
-
nh_info->dev = nhi->fib_nhc.nhc_dev;
nh_info->gw_family = nhi->fib_nhc.nhc_gw_family;
if (nh_info->gw_family == AF_INET)
@@ -71,12 +69,14 @@ __nh_notifier_single_info_init(struct 
nh_notifier_single_info *nh_info,
 static int nh_notifier_single_info_init(struct nh_notifier_info *info,
const struct nexthop *nh)
 {
+   struct nh_info *nhi = rtnl_dereference(nh->nh_info);
+
info->type = NH_NOTIFIER_INFO_TYPE_SINGLE;
info->nh = kzalloc(sizeof(*info->nh), GFP_KERNEL);
if (!info->nh)
return -ENOMEM;
 
-   __nh_notifier_single_info_init(info->nh, nh);
+   __nh_notifier_single_info_init(info->nh, nhi);
 
return 0;
 }
@@ -103,11 +103,13 @@ static int nh_notifier_mp_info_init(struct 
nh_notifier_info *info,
 
for (i = 0; i < num_nh; i++) {
struct nh_grp_entry *nhge = &nhg->nh_entries[i];
+   struct nh_info *nhi;
 
+   nhi = rtnl_dereference(nhge->nh->nh_info);
info->nh_grp->nh_entries[i].id = nhge->nh->id;
info->nh_grp->nh_entries[i].weight = nhge->weight;
__nh_notifier_single_info_init(&info->nh_grp->nh_entries[i].nh,
-  nhge->nh);
+  nhi);
}
 
return 0;
-- 
2.26.2

[PATCH net-next v2 01/14] nexthop: Pass nh_config to replace_nexthop()

2021-03-11 Thread Petr Machata

Currently, replace assumes that the new group that is given is a
fully-formed object. But mpath groups really only have one attribute, and
that is the constituent next hop configuration. This may not be universally
true. From the usability perspective, it is desirable to allow the replace
operation to adjust just the constituent next hop configuration and leave
the group attributes as such intact.

But the object that keeps track of whether an attribute was or was not
given is the nh_config object, not the next hop or next-hop group. To allow
(selective) attribute updates during NH group replacement, propagate `cfg'
to replace_nexthop() and further to replace_nexthop_grp().

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Reviewed-by: David Ahern 
---
 net/ipv4/nexthop.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 743777bce179..f723dc97dcd3 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -1107,7 +1107,7 @@ static void nh_rt_cache_flush(struct net *net, struct 
nexthop *nh)
 }
 
 static int replace_nexthop_grp(struct net *net, struct nexthop *old,
-  struct nexthop *new,
+  struct nexthop *new, const struct nh_config *cfg,
   struct netlink_ext_ack *extack)
 {
struct nh_group *oldg, *newg;
@@ -1276,7 +1276,8 @@ static void nexthop_replace_notify(struct net *net, 
struct nexthop *nh,
 }
 
 static int replace_nexthop(struct net *net, struct nexthop *old,
-  struct nexthop *new, struct netlink_ext_ack *extack)
+  struct nexthop *new, const struct nh_config *cfg,
+  struct netlink_ext_ack *extack)
 {
bool new_is_reject = false;
struct nh_grp_entry *nhge;
@@ -1319,7 +1320,7 @@ static int replace_nexthop(struct net *net, struct 
nexthop *old,
}
 
if (old->is_group)
-   err = replace_nexthop_grp(net, old, new, extack);
+   err = replace_nexthop_grp(net, old, new, cfg, extack);
else
err = replace_nexthop_single(net, old, new, extack);
 
@@ -1361,7 +1362,7 @@ static int insert_nexthop(struct net *net, struct nexthop 
*new_nh,
} else if (new_id > nh->id) {
pp = &next->rb_right;
} else if (replace) {
-   rc = replace_nexthop(net, nh, new_nh, extack);
+   rc = replace_nexthop(net, nh, new_nh, cfg, extack);
if (!rc) {
new_nh = nh; /* send notification with old nh */
replace_notify = 1;
-- 
2.26.2

[PATCH net-next v2 00/14] nexthop: Resilient next-hop groups

2021-03-11 Thread Petr Machata

emain unbalanced
indefinitely. The value of 120 is the default in Cumulus implementation of
resilient next-hop groups. To a degree the default is arbitrary, the only
value that certainly does not make sense is 0. Therefore going with an
existing deployed implementation is reasonable.

Unbalanced time, i.e. how long since the last time that all nexthops had as
many buckets as they should according to their weights, is reported when
the group is dumped:

 # ip nexthop show id 10
 id 10 group 1/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 
unbalanced_time 0

When replacing next hops or changing weights, if one does not specify some
parameters, their value is left as it was:

 # ip nexthop replace id 10 group 1,2/2 type resilient
 # ip nexthop show id 10
 id 10 group 1,2/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 
unbalanced_time 0

It is also possible to do a dump of individual buckets (and now you know
why there were only 8 of them in the example above):

 # ip nexthop bucket show id 10
 id 10 index 0 idle_time 5.59 nhid 1
 id 10 index 1 idle_time 5.59 nhid 1
 id 10 index 2 idle_time 8.74 nhid 2
 id 10 index 3 idle_time 8.74 nhid 2
 id 10 index 4 idle_time 8.74 nhid 1
 id 10 index 5 idle_time 8.74 nhid 1
 id 10 index 6 idle_time 8.74 nhid 1
 id 10 index 7 idle_time 8.74 nhid 1

Note the two buckets that have a shorter idle time. Those are the ones that
were migrated after the nexthop replace command to satisfy the new demand
that nexthop 1 be given 6 buckets instead of 4.

The patchset proceeds as follows:

- Patches #1 and #2 are small refactoring patches.

- Patch #3 adds a new flag to struct nh_group, is_multipath. This flag is
  meant to be set for all nexthop groups that in general have several
  nexthops from which they choose, and avoids a more expensive dispatch
  based on reading several flags, one for each nexthop group type.

- Patch #4 contains defines of new UAPI attributes and the new next-hop
  group type. At this point, the nexthop code is made to bounce the new
  type. As the resilient hashing code is gradually added in the following
  patch sets, it will remain dead. The last patch will make it accessible.

  This patch also adds a suite of new messages related to next hop buckets.
  This approach was taken instead of overloading the information on the
  existing RTM_{NEW,DEL,GET}NEXTHOP messages for the following reasons.

  First, a next-hop group can contain a large number of next-hop buckets
  (4k is not unheard of). This imposes limits on the amount of information
  that can be encoded for each next-hop bucket given a netlink message is
  limited to 64k bytes.

  Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at this
  point, in the future it can be extended to provide user space with
  control over next-hop buckets configuration.

- Patch #5 contains the meat of the resilient next-hop group support.

- Patches #6 and #7 implement support for notifications towards the
  drivers.

- Patch #8 adds an interface for the drivers to report resilient hash
  table bucket activity. Drivers will be able to report through this
  interface whether traffic is hitting a given bucket.

- Patch #9 adds an interface for the drivers to report whether a given
  hash table bucket is offloaded or trapping traffic.

- In patches #10, #11, #12 and #13, UAPI is implemented. This includes all
  the code necessary for creation of resilient groups, bucket dumping and
  getting, and bucket migration notifications.

- In patch #14 the next-hop groups are finally made available.

The overall plan is to contribute approximately the following patchsets:

1) Nexthop policy refactoring (already pushed)
2) Preparations for resilient next-hop groups (already pushed)
3) Implementation of resilient next-hop groups (this patchset)
4) Netdevsim offload plus a suite of selftests
5) Preparations for mlxsw offload of resilient next-hop groups
6) mlxsw offload including selftests

Interested parties can look at the current state of the code at [2] and
[3].

[1] https://tools.ietf.org/html/rfc2992
[2] https://github.com/idosch/linux/commits/submit/res_integ_v1
[3] https://github.com/idosch/iproute2/commits/submit/res_v1

v2:
- Patch #4:
- Comment at NEXTHOP_GRP_TYPE_MPATH that it's for the hash-threshold
  groups.

v1 (changes since RFC):
- Patch #3:
- This patch is new
- Patches #4-#13:
- u32 -> u16 for bucket counts / indices
- Patch #5:
- set the new flag is_multipath for resilient groups

Ido Schimmel (4):
  nexthop: Add netlink defines and enumerators for resilient NH groups
  nexthop: Add data structures for resilient group notifications
  nexthop: Allow setting "offload" and "trap" indication of nexthop
    buckets
  nexthop: Allow reporting activity of nexthop buckets

Petr Machata (10):
  nexthop: Pass nh_config to replace_nexthop()
  nexthop: __nh_notifier_single_info_init(): Make nh_info an argument
  nexthop: Add a

Re: [PATCH net-next 00/14] nexthop: Resilient next-hop groups

2021-03-11 Thread Petr Machata



David Ahern  writes:

> When you get to the end of the sets, it would be good to submit
> documentation for resilient multipath under Documentation/networking

All right.

Re: [PATCH net-next 03/14] nexthop: Add a dedicated flag for multipath next-hop groups

2021-03-11 Thread Petr Machata

David Ahern  writes:

> On 3/11/21 8:39 AM, Petr Machata wrote:
>> 
>> David Ahern  writes:
>> 
>>>> diff --git a/include/net/nexthop.h b/include/net/nexthop.h
>>>> index 7bc057aee40b..5062c2c08e2b 100644
>>>> --- a/include/net/nexthop.h
>>>> +++ b/include/net/nexthop.h
>>>> @@ -80,6 +80,7 @@ struct nh_grp_entry {
>>>>  struct nh_group {
>>>>struct nh_group *spare; /* spare group for removals */
>>>>u16 num_nh;
>>>> +  boolis_multipath;
>>>>boolmpath;
>>>
>>>
>>> It would be good to rename the existing type 'mpath' to something else.
>>> You have 'resilient' as a group type later, so maybe rename this one to
>>> hash or hash_threshold.
>> 
>> All right, I'll send a follow-up with that.
>
> I'm fine with the rename being a followup after this patch set or as the
> last patch in this set.

I looked at this, it's more than just this struct field. There is a
whole number of functions with mpath in their name to reflect that they
are for the hash-threshold algorithm. (And then some where the "mpath"
reflects is_multipath assumption.)

So I'll send this separately, and have it go through our regression.
It's still trivialish renaming, but a fair amount thereof.

Re: [PATCH net-next 04/14] nexthop: Add netlink defines and enumerators for resilient NH groups

2021-03-11 Thread Petr Machata



David Ahern  writes:

> On 3/11/21 8:45 AM, Petr Machata wrote:
>> 
>> David Ahern  writes:
>> 
>>> On 3/10/21 8:02 AM, Petr Machata wrote:
>>>> diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h
>>>> index 2d4a1e784cf0..8efebf3cb9c7 100644
>>>> --- a/include/uapi/linux/nexthop.h
>>>> +++ b/include/uapi/linux/nexthop.h
>>>> @@ -22,6 +22,7 @@ struct nexthop_grp {
>>>>  
>>>>  enum {
>>>>NEXTHOP_GRP_TYPE_MPATH,  /* default type if not specified */
>>>
>>> Update the above comment that it is for legacy, hash based multipath.
>> 
>> Maybe this would make sense?
>> 
>>  NEXTHOP_GRP_TYPE_MPATH,  /* hash-threshold nexthop group */
>> 
>
> yes, the description is fine. keep the comment about 'default type'.

OK.

Re: [PATCH net-next 04/14] nexthop: Add netlink defines and enumerators for resilient NH groups

2021-03-11 Thread Petr Machata



David Ahern  writes:

> On 3/10/21 8:02 AM, Petr Machata wrote:
>> diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h
>> index 2d4a1e784cf0..8efebf3cb9c7 100644
>> --- a/include/uapi/linux/nexthop.h
>> +++ b/include/uapi/linux/nexthop.h
>> @@ -22,6 +22,7 @@ struct nexthop_grp {
>>  
>>  enum {
>>  NEXTHOP_GRP_TYPE_MPATH,  /* default type if not specified */
>
> Update the above comment that it is for legacy, hash based multipath.

Maybe this would make sense?

NEXTHOP_GRP_TYPE_MPATH,  /* hash-threshold nexthop group */

Re: [PATCH net-next 03/14] nexthop: Add a dedicated flag for multipath next-hop groups

2021-03-11 Thread Petr Machata



David Ahern  writes:

>> diff --git a/include/net/nexthop.h b/include/net/nexthop.h
>> index 7bc057aee40b..5062c2c08e2b 100644
>> --- a/include/net/nexthop.h
>> +++ b/include/net/nexthop.h
>> @@ -80,6 +80,7 @@ struct nh_grp_entry {
>>  struct nh_group {
>>  struct nh_group *spare; /* spare group for removals */
>>  u16 num_nh;
>> +boolis_multipath;
>>  boolmpath;
>
>
> It would be good to rename the existing type 'mpath' to something else.
> You have 'resilient' as a group type later, so maybe rename this one to
> hash or hash_threshold.

All right, I'll send a follow-up with that.

[PATCH net-next 14/14] nexthop: Enable resilient next-hop groups

2021-03-10 Thread Petr Machata

Now that all the code is in place, stop rejecting requests to create
resilient next-hop groups.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 net/ipv4/nexthop.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 015a47e8163a..f09fe3a5608f 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -2443,10 +2443,6 @@ static struct nexthop *nexthop_create_group(struct net 
*net,
} else if (cfg->nh_grp_type == NEXTHOP_GRP_TYPE_RES) {
struct nh_res_table *res_table;
 
-   /* Bounce resilient groups for now. */
-   err = -EINVAL;
-   goto out_no_nh;
-
res_table = nexthop_res_table_alloc(net, cfg->nh_id, cfg);
if (!res_table) {
err = -ENOMEM;
-- 
2.26.2

[PATCH net-next 10/14] nexthop: Add netlink handlers for resilient nexthop groups

2021-03-10 Thread Petr Machata

Implement the netlink messages that allow creation and dumping of resilient
nexthop groups.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---

Notes:
v1 (changes since RFC):
- u32 -> u16 for bucket counts / indices

 net/ipv4/nexthop.c | 150 +++--
 1 file changed, 145 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 495b5e69ffcd..439bf3b7ced5 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -16,6 +16,9 @@
 #include 
 #include 
 
+#define NH_RES_DEFAULT_IDLE_TIMER  (120 * HZ)
+#define NH_RES_DEFAULT_UNBALANCED_TIMER0   /* No forced 
rebalancing. */
+
 static void remove_nexthop(struct net *net, struct nexthop *nh,
   struct nl_info *nlinfo);
 
@@ -32,6 +35,7 @@ static const struct nla_policy rtm_nh_policy_new[] = {
[NHA_ENCAP_TYPE]= { .type = NLA_U16 },
[NHA_ENCAP] = { .type = NLA_NESTED },
[NHA_FDB]   = { .type = NLA_FLAG },
+   [NHA_RES_GROUP] = { .type = NLA_NESTED },
 };
 
 static const struct nla_policy rtm_nh_policy_get[] = {
@@ -45,6 +49,12 @@ static const struct nla_policy rtm_nh_policy_dump[] = {
[NHA_FDB]   = { .type = NLA_FLAG },
 };
 
+static const struct nla_policy rtm_nh_res_policy_new[] = {
+   [NHA_RES_GROUP_BUCKETS] = { .type = NLA_U16 },
+   [NHA_RES_GROUP_IDLE_TIMER]  = { .type = NLA_U32 },
+   [NHA_RES_GROUP_UNBALANCED_TIMER]= { .type = NLA_U32 },
+};
+
 static bool nexthop_notifiers_is_empty(struct net *net)
 {
return !net->nexthop.notifier_chain.head;
@@ -588,6 +598,41 @@ static void nh_res_time_set_deadline(unsigned long 
next_time,
*deadline = next_time;
 }
 
+static clock_t nh_res_table_unbalanced_time(struct nh_res_table *res_table)
+{
+   if (list_empty(&res_table->uw_nh_entries))
+   return 0;
+   return jiffies_delta_to_clock_t(jiffies - res_table->unbalanced_since);
+}
+
+static int nla_put_nh_group_res(struct sk_buff *skb, struct nh_group *nhg)
+{
+   struct nh_res_table *res_table = rtnl_dereference(nhg->res_table);
+   struct nlattr *nest;
+
+   nest = nla_nest_start(skb, NHA_RES_GROUP);
+   if (!nest)
+   return -EMSGSIZE;
+
+   if (nla_put_u16(skb, NHA_RES_GROUP_BUCKETS,
+   res_table->num_nh_buckets) ||
+   nla_put_u32(skb, NHA_RES_GROUP_IDLE_TIMER,
+   jiffies_to_clock_t(res_table->idle_timer)) ||
+   nla_put_u32(skb, NHA_RES_GROUP_UNBALANCED_TIMER,
+   jiffies_to_clock_t(res_table->unbalanced_timer)) ||
+   nla_put_u64_64bit(skb, NHA_RES_GROUP_UNBALANCED_TIME,
+ nh_res_table_unbalanced_time(res_table),
+ NHA_RES_GROUP_PAD))
+   goto nla_put_failure;
+
+   nla_nest_end(skb, nest);
+   return 0;
+
+nla_put_failure:
+   nla_nest_cancel(skb, nest);
+   return -EMSGSIZE;
+}
+
 static int nla_put_nh_group(struct sk_buff *skb, struct nh_group *nhg)
 {
struct nexthop_grp *p;
@@ -598,6 +643,8 @@ static int nla_put_nh_group(struct sk_buff *skb, struct 
nh_group *nhg)
 
if (nhg->mpath)
group_type = NEXTHOP_GRP_TYPE_MPATH;
+   else if (nhg->resilient)
+   group_type = NEXTHOP_GRP_TYPE_RES;
 
if (nla_put_u16(skb, NHA_GROUP_TYPE, group_type))
goto nla_put_failure;
@@ -613,6 +660,9 @@ static int nla_put_nh_group(struct sk_buff *skb, struct 
nh_group *nhg)
p += 1;
}
 
+   if (nhg->resilient && nla_put_nh_group_res(skb, nhg))
+   goto nla_put_failure;
+
return 0;
 
 nla_put_failure:
@@ -700,13 +750,26 @@ static int nh_fill_node(struct sk_buff *skb, struct 
nexthop *nh,
return -EMSGSIZE;
 }
 
+static size_t nh_nlmsg_size_grp_res(struct nh_group *nhg)
+{
+   return nla_total_size(0) +  /* NHA_RES_GROUP */
+   nla_total_size(2) + /* NHA_RES_GROUP_BUCKETS */
+   nla_total_size(4) + /* NHA_RES_GROUP_IDLE_TIMER */
+   nla_total_size(4) + /* NHA_RES_GROUP_UNBALANCED_TIMER */
+   nla_total_size_64bit(8);/* NHA_RES_GROUP_UNBALANCED_TIME */
+}
+
 static size_t nh_nlmsg_size_grp(struct nexthop *nh)
 {
struct nh_group *nhg = rtnl_dereference(nh->nh_grp);
size_t sz = sizeof(struct nexthop_grp) * nhg->num_nh;
+   size_t tot = nla_total_size(sz) +
+   nla_total_size(2); /* NHA_GROUP_TYPE */
+
+   if (nhg->resilient)
+   tot += nh_nlmsg_size_grp_res(nhg);
 
-   return nla_total_size(sz) +
-  nla_total_size(2);  /* NHA_GROUP_TYPE */
+   return tot;
 }
 
 static size_t nh_nlmsg_size_single(struct nexthop *nh)
@@ -876,7 +939,7 @@ static int nh_check_attr_fdb_g

1 2 3 4 5 6 7 8 >

1 - 100 of 735 matches

Mail list logo