Re: [patch net-next v4 6/6] selftests: virtio_net: add initial tests
Jiri Pirko writes: > From: Jiri Pirko > > Introduce initial tests for virtio_net driver. Focus on feature testing > leveraging previously introduced debugfs feature filtering > infrastructure. Add very basic ping and F_MAC feature tests. > > To run this, do: > $ make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests > > Run it on a system with 2 virtio_net devices connected back-to-back > on the hypervisor. > > Signed-off-by: Jiri Pirko Reviewed-by: Petr Machata
Re: [patch net-next v4 5/6] selftests: forwarding: add wait_for_dev() helper
Jiri Pirko writes: > From: Jiri Pirko > > The existing setup_wait*() helper family check the status of the > interface to be up. Introduce wait_for_dev() to wait for the netdevice > to appear, for example after test script does manual device bind. > > Signed-off-by: Jiri Pirko Reviewed-by: Petr Machata
Re: [patch net-next v4 3/6] selftests: forwarding: add ability to assemble NETIFS array by driver name
Jiri Pirko writes: > From: Jiri Pirko > > Allow driver tests to work without specifying the netdevice names. > Introduce a possibility to search for available netdevices according to > set driver name. Allow test to specify the name by setting > NETIF_FIND_DRIVER variable. > > Note that user overrides this either by passing netdevice names on the > command line or by declaring NETIFS array in custom forwarding.config > configuration file. > > Signed-off-by: Jiri Pirko Reviewed-by: Petr Machata
Re: [patch net-next v4 2/6] selftests: forwarding: move initial root check to the beginning
Jiri Pirko writes: > From: Jiri Pirko > > This check can be done at the very beginning of the script. > As the follow up patch needs to add early code that needs to be executed > after the check, move it. > > Signed-off-by: Jiri Pirko Reviewed-by: Petr Machata
Re: [patch net-next v3 6/6] selftests: virtio_net: add initial tests
Jiri Pirko writes: > From: Jiri Pirko > > Introduce initial tests for virtio_net driver. Focus on feature testing > leveraging previously introduced debugfs feature filtering > infrastructure. Add very basic ping and F_MAC feature tests. > > To run this, do: > $ make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests > > Run it on a system with 2 virtio_net devices connected back-to-back > on the hypervisor. > > Signed-off-by: Jiri Pirko > +h2_destroy() > +{ > + simple_if_fini $h2 $H2_IPV4/24 $H2_IPV6/64 > +} > + > +initial_ping_test() > +{ > + cleanup All these cleanup() calls will end up possibly triggering PAUSE_ON_CLEANUP. Not sure that's intended. > + setup_prepare > + ping_test $h1 $H2_IPV4 " simple" > +} Other than this nit, LGTM. Reviewed-by: Petr Machata
Re: [patch net-next v3 3/6] selftests: forwarding: add ability to assemble NETIFS array by driver name
Petr Machata writes: > Jiri Pirko writes: > >> +# Whether to find netdevice according to the specified driver. >> +: "${NETIF_FIND_DRIVER:=}" > > This would be better placed up there in the Topology description > section. Together with NETIFS and NETIF_NO_CABLE, as it concerns > specification of which interfaces to use. Oh never mind, it's not something a user should configure, but rather a test API.
Re: [patch net-next v3 5/6] selftests: forwarding: add wait_for_dev() helper
Jiri Pirko writes: > From: Jiri Pirko > > The existing setup_wait*() helper family check the status of the > interface to be up. Introduce wait_for_dev() to wait for the netdevice > to appear, for example after test script does manual device bind. > > Signed-off-by: Jiri Pirko > --- > v1->v2: > - reworked wait_for_dev() helper to use slowwait() helper > --- > tools/testing/selftests/net/forwarding/lib.sh | 13 + > 1 file changed, 13 insertions(+) > > diff --git a/tools/testing/selftests/net/forwarding/lib.sh > b/tools/testing/selftests/net/forwarding/lib.sh > index edaec12c0575..41c0b0ed430b 100644 > --- a/tools/testing/selftests/net/forwarding/lib.sh > +++ b/tools/testing/selftests/net/forwarding/lib.sh > @@ -745,6 +745,19 @@ setup_wait() > sleep $WAIT_TIME > } > > +wait_for_dev() > +{ > +local dev=$1; shift > +local timeout=${1:-$WAIT_TIMEOUT}; shift > + > +slowwait $timeout ip link show dev $dev up &> /dev/null I agree with Benjamin's feedback that this should lose the up flag. It looks as if it's waiting for the device to be up. > +if (( $? )); then > +check_err 1 > +log_test wait_for_dev "Interface $dev did not appear." > +exit $EXIT_STATUS > +fi > +} > + > cmd_jq() > { > local cmd=$1
Re: [patch net-next v3 4/6] selftests: forwarding: add check_driver() helper
Jiri Pirko writes: > From: Jiri Pirko > > Add a helper to be used to check if the netdevice is backed by specified > driver. > > Signed-off-by: Jiri Pirko Reviewed-by: Petr Machata
Re: [patch net-next v3 3/6] selftests: forwarding: add ability to assemble NETIFS array by driver name
Jiri Pirko writes: > From: Jiri Pirko > > Allow driver tests to work without specifying the netdevice names. > Introduce a possibility to search for available netdevices according to > set driver name. Allow test to specify the name by setting > NETIF_FIND_DRIVER variable. > > Note that user overrides this either by passing netdevice names on the > command line or by declaring NETIFS array in custom forwarding.config > configuration file. > > Signed-off-by: Jiri Pirko > --- > v1->v2: > - removed unnecessary "-p" and "-e" options > - removed unnecessary "! -z" from the check > - moved NETIF_FIND_DRIVER declaration from the config options > --- > tools/testing/selftests/net/forwarding/lib.sh | 39 +++ > 1 file changed, 39 insertions(+) > > diff --git a/tools/testing/selftests/net/forwarding/lib.sh > b/tools/testing/selftests/net/forwarding/lib.sh > index 2e7695b94b6b..b3fd0f052d71 100644 > --- a/tools/testing/selftests/net/forwarding/lib.sh > +++ b/tools/testing/selftests/net/forwarding/lib.sh > @@ -94,6 +94,45 @@ if [[ ! -v NUM_NETIFS ]]; then > exit $ksft_skip > fi > > +## > +# Find netifs by test-specified driver name > + > +driver_name_get() > +{ > + local dev=$1; shift > + local driver_path="/sys/class/net/$dev/device/driver" > + > + if [ ! -L $driver_path ]; then > + echo "" > + else > + basename `realpath $driver_path` > + fi This is just: if [[ -L $driver_path ]]; then basename `realpath $driver_path` fi > +} > + > +find_netif() Maybe name it find_driver_netif? find_netif sounds super generic. Also consider having it take an argument instead of accessing environment NETIF_FIND_DRIVER directly. > +{ > + local ifnames=`ip -j link show | jq -r ".[].ifname"` > + local count=0 > + > + for ifname in $ifnames > + do > + local driver_name=`driver_name_get $ifname` > + if [[ ! -z $driver_name && $driver_name == $NETIF_FIND_DRIVER > ]]; then > + count=$((count + 1)) > + NETIFS[p$count]="$ifname" > + fi > + done > +} > + > +# Whether to find netdevice according to the specified driver. > +: "${NETIF_FIND_DRIVER:=}" This would be better placed up there in the Topology description section. Together with NETIFS and NETIF_NO_CABLE, as it concerns specification of which interfaces to use. > + > +if [[ $NETIF_FIND_DRIVER ]]; then > + unset NETIFS > + declare -A NETIFS > + find_netif > +fi > + > net_forwarding_dir=$(dirname "$(readlink -e "${BASH_SOURCE[0]}")") > > if [[ -f $net_forwarding_dir/forwarding.config ]]; then
[PATCH net-next 08/10] mlxsw: spectrum_qdisc: Allocate child qdiscs dynamically
Instead of keeping qdiscs in globally-preallocated arrays, introduce a per-qdisc-kind value num_classes, and then allocate the necessary child qdiscs (if any) based on that value. Since now dynamic allocation is involved, mlxsw_sp_qdisc_replace() gets messy enough that it is worth it to split it to two cases: a new qdisc allocation and a change of existing qdisc. (Note that the change also includes what TC formally calls replace, if the qdisc kind is the same.) Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- .../ethernet/mellanox/mlxsw/spectrum_qdisc.c | 115 +- 1 file changed, 83 insertions(+), 32 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c index 9e7f1a0188e8..03c131027fa7 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c @@ -49,6 +49,7 @@ struct mlxsw_sp_qdisc_ops { struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void *params); struct mlxsw_sp_qdisc *(*find_class)(struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, u32 parent); + unsigned int num_classes; }; struct mlxsw_sp_qdisc { @@ -74,7 +75,6 @@ struct mlxsw_sp_qdisc { struct mlxsw_sp_qdisc_state { struct mlxsw_sp_qdisc root_qdisc; - struct mlxsw_sp_qdisc tclass_qdiscs[IEEE_8021QAZ_MAX_TCS]; /* When a PRIO or ETS are added, the invisible FIFOs in their bands are * created first. When notifications for these FIFOs arrive, it is not @@ -215,29 +215,41 @@ mlxsw_sp_qdisc_destroy(struct mlxsw_sp_port *mlxsw_sp_port, if (mlxsw_sp_qdisc->ops->destroy) err = mlxsw_sp_qdisc->ops->destroy(mlxsw_sp_port, mlxsw_sp_qdisc); + if (mlxsw_sp_qdisc->ops->clean_stats) + mlxsw_sp_qdisc->ops->clean_stats(mlxsw_sp_port, mlxsw_sp_qdisc); mlxsw_sp_qdisc->handle = TC_H_UNSPEC; mlxsw_sp_qdisc->ops = NULL; - + mlxsw_sp_qdisc->num_classes = 0; + kfree(mlxsw_sp_qdisc->qdiscs); + mlxsw_sp_qdisc->qdiscs = NULL; return err_hdroom ?: err; } -static int -mlxsw_sp_qdisc_replace(struct mlxsw_sp_port *mlxsw_sp_port, u32 handle, - struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, - struct mlxsw_sp_qdisc_ops *ops, void *params) +static int mlxsw_sp_qdisc_create(struct mlxsw_sp_port *mlxsw_sp_port, +u32 handle, +struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, +struct mlxsw_sp_qdisc_ops *ops, void *params) { struct mlxsw_sp_qdisc *root_qdisc = &mlxsw_sp_port->qdisc->root_qdisc; struct mlxsw_sp_hdroom orig_hdroom; + unsigned int i; int err; - if (mlxsw_sp_qdisc->ops && mlxsw_sp_qdisc->ops->type != ops->type) - /* In case this location contained a different qdisc of the -* same type we can override the old qdisc configuration. -* Otherwise, we need to remove the old qdisc before setting the -* new one. -*/ - mlxsw_sp_qdisc_destroy(mlxsw_sp_port, mlxsw_sp_qdisc); + err = ops->check_params(mlxsw_sp_port, params); + if (err) + return err; + + if (ops->num_classes) { + mlxsw_sp_qdisc->qdiscs = kcalloc(ops->num_classes, + sizeof(*mlxsw_sp_qdisc->qdiscs), +GFP_KERNEL); + if (!mlxsw_sp_qdisc->qdiscs) + return -ENOMEM; + + for (i = 0; i < ops->num_classes; i++) + mlxsw_sp_qdisc->qdiscs[i].parent = mlxsw_sp_qdisc; + } orig_hdroom = *mlxsw_sp_port->hdroom; if (root_qdisc == mlxsw_sp_qdisc) { @@ -253,20 +265,46 @@ mlxsw_sp_qdisc_replace(struct mlxsw_sp_port *mlxsw_sp_port, u32 handle, goto err_hdroom_configure; } + mlxsw_sp_qdisc->num_classes = ops->num_classes; + mlxsw_sp_qdisc->ops = ops; + mlxsw_sp_qdisc->handle = handle; + err = ops->replace(mlxsw_sp_port, handle, mlxsw_sp_qdisc, params); + if (err) + goto err_replace; + + return 0; + +err_replace: + mlxsw_sp_qdisc->handle = TC_H_UNSPEC; + mlxsw_sp_qdisc->ops = NULL; + mlxsw_sp_qdisc->num_classes = 0; + mlxsw_sp_hdroom_configure(mlxsw_sp_port, &orig_hdroom); +err_hdroom_configure: + kfree(mlxsw_sp_qdisc->qdiscs); + mlxsw_sp_qdisc->qdiscs = NULL; + return err; +} + +static int +mlxsw_sp_qdisc_change(struct mlxsw_sp_port *mlxsw_sp_port, u32 handle, +
[PATCH net-next 10/10] selftests: mlxsw: sch_red_ets: Test proper counter cleaning in ETS
There was a bug introduced during the rework which cause non-zero backlog being stuck at ETS. Introduce a selftest that would have caught the issue earlier. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh | 7 +++ 1 file changed, 7 insertions(+) diff --git a/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh b/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh index 3f007c5f8361..f3ef3274f9b3 100755 --- a/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh +++ b/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh @@ -67,6 +67,13 @@ red_test() { install_qdisc + # Make sure that we get the non-zero value if there is any. + local cur=$(busywait 1100 until_counter_is "> 0" \ + qdisc_stats_get $swp3 10: .backlog) + (( cur == 0 )) + check_err $? "backlog of $cur observed on non-busy qdisc" + log_test "$QDISC backlog properly cleaned" + do_red_test 10 $BACKLOG1 do_red_test 11 $BACKLOG2 -- 2.26.2
[PATCH net-next 09/10] mlxsw: spectrum_qdisc: Index future FIFOs by band number
mlxsw used to hold an array of qdiscs indexed by the TC number. In the previous patch, it was changed to allocate child qdiscs dynamically, and they are now indexed by band number. Follow suit with the array of future FIFOs. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- .../net/ethernet/mellanox/mlxsw/spectrum_qdisc.c| 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c index 03c131027fa7..04672eb5c7f3 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c @@ -962,7 +962,7 @@ static int __mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port *mlxsw_sp_port, { struct mlxsw_sp_qdisc_state *qdisc_state = mlxsw_sp_port->qdisc; struct mlxsw_sp_qdisc *mlxsw_sp_qdisc; - int tclass, child_index; + unsigned int band; u32 parent_handle; mlxsw_sp_qdisc = mlxsw_sp_qdisc_find(mlxsw_sp_port, p->parent, false); @@ -977,13 +977,12 @@ static int __mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port *mlxsw_sp_port, qdisc_state->future_handle = parent_handle; } - child_index = TC_H_MIN(p->parent); - tclass = MLXSW_SP_PRIO_CHILD_TO_TCLASS(child_index); - if (tclass < IEEE_8021QAZ_MAX_TCS) { + band = TC_H_MIN(p->parent) - 1; + if (band < IEEE_8021QAZ_MAX_TCS) { if (p->command == TC_FIFO_REPLACE) - qdisc_state->future_fifos[tclass] = true; + qdisc_state->future_fifos[band] = true; else if (p->command == TC_FIFO_DESTROY) - qdisc_state->future_fifos[tclass] = false; + qdisc_state->future_fifos[band] = false; } } if (!mlxsw_sp_qdisc) @@ -1117,7 +1116,7 @@ __mlxsw_sp_qdisc_ets_replace(struct mlxsw_sp_port *mlxsw_sp_port, } if (handle == qdisc_state->future_handle && - qdisc_state->future_fifos[tclass]) { + qdisc_state->future_fifos[band]) { err = mlxsw_sp_qdisc_replace(mlxsw_sp_port, TC_H_UNSPEC, child_qdisc, &mlxsw_sp_qdisc_ops_fifo, -- 2.26.2
[PATCH net-next 07/10] mlxsw: spectrum_qdisc: Guard all qdisc accesses with a lock
The FIFO handler currently guards accesses to the future FIFO tracking by asserting RTNL. In the future, the changes to the qdisc state will be more thorough, so other qdiscs will need this guarding is as well. In order to not further the RTNL infestation, instead convert to a custom lock that will guard accesses to the qdisc state. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- .../ethernet/mellanox/mlxsw/spectrum_qdisc.c | 89 +++ 1 file changed, 73 insertions(+), 16 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c index f42ea958919b..9e7f1a0188e8 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c @@ -89,6 +89,7 @@ struct mlxsw_sp_qdisc_state { */ u32 future_handle; bool future_fifos[IEEE_8021QAZ_MAX_TCS]; + struct mutex lock; /* Protects qdisc state. */ }; static bool @@ -620,8 +621,8 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_red = { .find_class = mlxsw_sp_qdisc_leaf_find_class, }; -int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port *mlxsw_sp_port, - struct tc_red_qopt_offload *p) +static int __mlxsw_sp_setup_tc_red(struct mlxsw_sp_port *mlxsw_sp_port, + struct tc_red_qopt_offload *p) { struct mlxsw_sp_qdisc *mlxsw_sp_qdisc; @@ -652,6 +653,18 @@ int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port *mlxsw_sp_port, } } +int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port *mlxsw_sp_port, + struct tc_red_qopt_offload *p) +{ + int err; + + mutex_lock(&mlxsw_sp_port->qdisc->lock); + err = __mlxsw_sp_setup_tc_red(mlxsw_sp_port, p); + mutex_unlock(&mlxsw_sp_port->qdisc->lock); + + return err; +} + static void mlxsw_sp_setup_tc_qdisc_leaf_clean_stats(struct mlxsw_sp_port *mlxsw_sp_port, struct mlxsw_sp_qdisc *mlxsw_sp_qdisc) @@ -814,8 +827,8 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_tbf = { .find_class = mlxsw_sp_qdisc_leaf_find_class, }; -int mlxsw_sp_setup_tc_tbf(struct mlxsw_sp_port *mlxsw_sp_port, - struct tc_tbf_qopt_offload *p) +static int __mlxsw_sp_setup_tc_tbf(struct mlxsw_sp_port *mlxsw_sp_port, + struct tc_tbf_qopt_offload *p) { struct mlxsw_sp_qdisc *mlxsw_sp_qdisc; @@ -843,6 +856,18 @@ int mlxsw_sp_setup_tc_tbf(struct mlxsw_sp_port *mlxsw_sp_port, } } +int mlxsw_sp_setup_tc_tbf(struct mlxsw_sp_port *mlxsw_sp_port, + struct tc_tbf_qopt_offload *p) +{ + int err; + + mutex_lock(&mlxsw_sp_port->qdisc->lock); + err = __mlxsw_sp_setup_tc_tbf(mlxsw_sp_port, p); + mutex_unlock(&mlxsw_sp_port->qdisc->lock); + + return err; +} + static int mlxsw_sp_qdisc_fifo_check_params(struct mlxsw_sp_port *mlxsw_sp_port, void *params) @@ -876,20 +901,14 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_fifo = { .clean_stats = mlxsw_sp_setup_tc_qdisc_leaf_clean_stats, }; -int mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port *mlxsw_sp_port, - struct tc_fifo_qopt_offload *p) +static int __mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port *mlxsw_sp_port, + struct tc_fifo_qopt_offload *p) { struct mlxsw_sp_qdisc_state *qdisc_state = mlxsw_sp_port->qdisc; struct mlxsw_sp_qdisc *mlxsw_sp_qdisc; int tclass, child_index; u32 parent_handle; - /* Invisible FIFOs are tracked in future_handle and future_fifos. Make -* sure that not more than one qdisc is created for a port at a time. -* RTNL is a simple proxy for that. -*/ - ASSERT_RTNL(); - mlxsw_sp_qdisc = mlxsw_sp_qdisc_find(mlxsw_sp_port, p->parent, false); if (!mlxsw_sp_qdisc && p->handle == TC_H_UNSPEC) { parent_handle = TC_H_MAJ(p->parent); @@ -936,6 +955,18 @@ int mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port *mlxsw_sp_port, return -EOPNOTSUPP; } +int mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port *mlxsw_sp_port, + struct tc_fifo_qopt_offload *p) +{ + int err; + + mutex_lock(&mlxsw_sp_port->qdisc->lock); + err = __mlxsw_sp_setup_tc_fifo(mlxsw_sp_port, p); + mutex_unlock(&mlxsw_sp_port->qdisc->lock); + + return err; +} + static int __mlxsw_sp_qdisc_ets_destroy(struct mlxsw_sp_port *mlxsw_sp_port, struct mlxsw_sp_qdisc *mlxsw_sp_qdisc) { @@ -1277,8 +1308,8 @@ mlxsw_sp_qdisc_prio_graft(struct mlxsw_sp_port *mlxsw_sp_port, p->band, p->child_handle); } -int mlxsw_sp_setup_tc_prio(struct m
[PATCH net-next 06/10] mlxsw: spectrum_qdisc: Track children per qdisc
mlxsw currently allows a two-level structure of qdiscs: the root and possibly a number of children. In order to support offloading more general qdisc trees, introduce to struct mlxsw_sp_qdisc a pointer to child qdiscs. Refer to the child qdiscs through this pointer, instead of going through the tclass_qdiscs in qdisc_state. Additionally introduce a field num_classes, which holds number of given qdisc's children. Also introduce a generic function for walking qdisc trees. Rewrite mlxsw_sp_qdisc_find() and _find_by_handle() to use the generic walker. For now, keep the qdisc_state.tclass_qdisc, and just point root_qdiscs's children to this array. Following patches will make the allocation dynamic. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- .../ethernet/mellanox/mlxsw/spectrum_qdisc.c | 164 +- 1 file changed, 118 insertions(+), 46 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c index a8a7e9c88a4d..f42ea958919b 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c @@ -47,6 +47,8 @@ struct mlxsw_sp_qdisc_ops { */ void (*unoffload)(struct mlxsw_sp_port *mlxsw_sp_port, struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void *params); + struct mlxsw_sp_qdisc *(*find_class)(struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, +u32 parent); }; struct mlxsw_sp_qdisc { @@ -66,6 +68,8 @@ struct mlxsw_sp_qdisc { struct mlxsw_sp_qdisc_ops *ops; struct mlxsw_sp_qdisc *parent; + struct mlxsw_sp_qdisc *qdiscs; + unsigned int num_classes; }; struct mlxsw_sp_qdisc_state { @@ -93,44 +97,84 @@ mlxsw_sp_qdisc_compare(struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, u32 handle) return mlxsw_sp_qdisc->ops && mlxsw_sp_qdisc->handle == handle; } +static struct mlxsw_sp_qdisc * +mlxsw_sp_qdisc_walk(struct mlxsw_sp_qdisc *qdisc, + struct mlxsw_sp_qdisc *(*pre)(struct mlxsw_sp_qdisc *, + void *), + void *data) +{ + struct mlxsw_sp_qdisc *tmp; + unsigned int i; + + if (pre) { + tmp = pre(qdisc, data); + if (tmp) + return tmp; + } + + if (qdisc->ops) { + for (i = 0; i < qdisc->num_classes; i++) { + tmp = &qdisc->qdiscs[i]; + if (qdisc->ops) { + tmp = mlxsw_sp_qdisc_walk(tmp, pre, data); + if (tmp) + return tmp; + } + } + } + + return NULL; +} + +static struct mlxsw_sp_qdisc * +mlxsw_sp_qdisc_walk_cb_find(struct mlxsw_sp_qdisc *qdisc, void *data) +{ + u32 parent = *(u32 *)data; + + if (qdisc->ops && TC_H_MAJ(qdisc->handle) == TC_H_MAJ(parent)) { + if (qdisc->ops->find_class) + return qdisc->ops->find_class(qdisc, parent); + } + + return NULL; +} + static struct mlxsw_sp_qdisc * mlxsw_sp_qdisc_find(struct mlxsw_sp_port *mlxsw_sp_port, u32 parent, bool root_only) { struct mlxsw_sp_qdisc_state *qdisc_state = mlxsw_sp_port->qdisc; - int tclass, child_index; + if (!qdisc_state) + return NULL; if (parent == TC_H_ROOT) return &qdisc_state->root_qdisc; - - if (root_only || !qdisc_state || - !qdisc_state->root_qdisc.ops || - TC_H_MAJ(parent) != qdisc_state->root_qdisc.handle || - TC_H_MIN(parent) > IEEE_8021QAZ_MAX_TCS) + if (root_only) return NULL; + return mlxsw_sp_qdisc_walk(&qdisc_state->root_qdisc, + mlxsw_sp_qdisc_walk_cb_find, &parent); +} - child_index = TC_H_MIN(parent); - tclass = MLXSW_SP_PRIO_CHILD_TO_TCLASS(child_index); - return &qdisc_state->tclass_qdiscs[tclass]; +static struct mlxsw_sp_qdisc * +mlxsw_sp_qdisc_walk_cb_find_by_handle(struct mlxsw_sp_qdisc *qdisc, void *data) +{ + u32 handle = *(u32 *)data; + + if (qdisc->ops && qdisc->handle == handle) + return qdisc; + return NULL; } static struct mlxsw_sp_qdisc * mlxsw_sp_qdisc_find_by_handle(struct mlxsw_sp_port *mlxsw_sp_port, u32 handle) { struct mlxsw_sp_qdisc_state *qdisc_state = mlxsw_sp_port->qdisc; - int i; - if (qdisc_state->root_qdisc.handle == handle) - return &qdisc_state->root_qdisc; - - if (qdisc_state->root_qdisc.handle == TC_H_UNSPEC) + if (!qdisc_state) return NULL; - - for (i = 0; i < IEEE_8021QAZ_MAX
[PATCH net-next 04/10] mlxsw: spectrum_qdisc: Track tclass_num as int, not u8
tclass_num is just a number, a value that would be ordinarily passed around as an int. (Which is unlike a u8 prio_bitmap.) In several places, tclass_num already is an int. Convert the remaining instances. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c index f1d32bfc4bed..da1f6314df60 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c @@ -51,7 +51,7 @@ struct mlxsw_sp_qdisc_ops { struct mlxsw_sp_qdisc { u32 handle; - u8 tclass_num; + int tclass_num; u8 prio_bitmap; union { struct red_stats red; @@ -291,7 +291,7 @@ mlxsw_sp_qdisc_collect_tc_stats(struct mlxsw_sp_port *mlxsw_sp_port, u64 *p_tx_bytes, u64 *p_tx_packets, u64 *p_drops, u64 *p_backlog) { - u8 tclass_num = mlxsw_sp_qdisc->tclass_num; + int tclass_num = mlxsw_sp_qdisc->tclass_num; struct mlxsw_sp_port_xstats *xstats; u64 tx_bytes, tx_packets; @@ -391,7 +391,7 @@ static void mlxsw_sp_setup_tc_qdisc_red_clean_stats(struct mlxsw_sp_port *mlxsw_sp_port, struct mlxsw_sp_qdisc *mlxsw_sp_qdisc) { - u8 tclass_num = mlxsw_sp_qdisc->tclass_num; + int tclass_num = mlxsw_sp_qdisc->tclass_num; struct mlxsw_sp_qdisc_stats *stats_base; struct mlxsw_sp_port_xstats *xstats; struct red_stats *red_base; @@ -462,7 +462,7 @@ mlxsw_sp_qdisc_red_replace(struct mlxsw_sp_port *mlxsw_sp_port, u32 handle, { struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp; struct tc_red_qopt_offload_params *p = params; - u8 tclass_num = mlxsw_sp_qdisc->tclass_num; + int tclass_num = mlxsw_sp_qdisc->tclass_num; u32 min, max; u64 prob; @@ -507,7 +507,7 @@ mlxsw_sp_qdisc_get_red_xstats(struct mlxsw_sp_port *mlxsw_sp_port, void *xstats_ptr) { struct red_stats *xstats_base = &mlxsw_sp_qdisc->xstats_base.red; - u8 tclass_num = mlxsw_sp_qdisc->tclass_num; + int tclass_num = mlxsw_sp_qdisc->tclass_num; struct mlxsw_sp_port_xstats *xstats; struct red_stats *res = xstats_ptr; int early_drops, pdrops; @@ -531,7 +531,7 @@ mlxsw_sp_qdisc_get_red_stats(struct mlxsw_sp_port *mlxsw_sp_port, struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, struct tc_qopt_offload_stats *stats_ptr) { - u8 tclass_num = mlxsw_sp_qdisc->tclass_num; + int tclass_num = mlxsw_sp_qdisc->tclass_num; struct mlxsw_sp_qdisc_stats *stats_base; struct mlxsw_sp_port_xstats *xstats; u64 overlimits; -- 2.26.2
[PATCH net-next 05/10] mlxsw: spectrum_qdisc: Promote backlog reduction to mlxsw_sp_qdisc_destroy()
When a qdisc is removed, it is necessary to update the backlog value at its parent--unless the qdisc is at root position. RED, TBF and FIFO all do that, each separately. Since all of them need to do this, just promote the operation directly to mlxsw_sp_qdisc_destroy(), instead of deferring it to individual destructors. Since FIFO dtor thus becomes trivial, remove it. Add struct mlxsw_sp_qdisc.parent to point at the parent qdisc. This will be handy later as deeper structures are offloaded. Use the parent qdisc to find the chain of parents whose backlog value needs to be updated. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- .../ethernet/mellanox/mlxsw/spectrum_qdisc.c | 48 +++ 1 file changed, 18 insertions(+), 30 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c index da1f6314df60..a8a7e9c88a4d 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c @@ -65,6 +65,7 @@ struct mlxsw_sp_qdisc { } stats_base; struct mlxsw_sp_qdisc_ops *ops; + struct mlxsw_sp_qdisc *parent; }; struct mlxsw_sp_qdisc_state { @@ -132,6 +133,15 @@ mlxsw_sp_qdisc_find_by_handle(struct mlxsw_sp_port *mlxsw_sp_port, u32 handle) return NULL; } +static void +mlxsw_sp_qdisc_reduce_parent_backlog(struct mlxsw_sp_qdisc *mlxsw_sp_qdisc) +{ + struct mlxsw_sp_qdisc *tmp; + + for (tmp = mlxsw_sp_qdisc->parent; tmp; tmp = tmp->parent) + tmp->stats_base.backlog -= mlxsw_sp_qdisc->stats_base.backlog; +} + static int mlxsw_sp_qdisc_destroy(struct mlxsw_sp_port *mlxsw_sp_port, struct mlxsw_sp_qdisc *mlxsw_sp_qdisc) @@ -153,7 +163,11 @@ mlxsw_sp_qdisc_destroy(struct mlxsw_sp_port *mlxsw_sp_port, err_hdroom = mlxsw_sp_hdroom_configure(mlxsw_sp_port, &hdroom); } - if (mlxsw_sp_qdisc->ops && mlxsw_sp_qdisc->ops->destroy) + if (!mlxsw_sp_qdisc->ops) + return 0; + + mlxsw_sp_qdisc_reduce_parent_backlog(mlxsw_sp_qdisc); + if (mlxsw_sp_qdisc->ops->destroy) err = mlxsw_sp_qdisc->ops->destroy(mlxsw_sp_port, mlxsw_sp_qdisc); @@ -417,13 +431,6 @@ static int mlxsw_sp_qdisc_red_destroy(struct mlxsw_sp_port *mlxsw_sp_port, struct mlxsw_sp_qdisc *mlxsw_sp_qdisc) { - struct mlxsw_sp_qdisc_state *qdisc_state = mlxsw_sp_port->qdisc; - struct mlxsw_sp_qdisc *root_qdisc = &qdisc_state->root_qdisc; - - if (root_qdisc != mlxsw_sp_qdisc) - root_qdisc->stats_base.backlog -= - mlxsw_sp_qdisc->stats_base.backlog; - return mlxsw_sp_tclass_congestion_disable(mlxsw_sp_port, mlxsw_sp_qdisc->tclass_num); } @@ -616,13 +623,6 @@ static int mlxsw_sp_qdisc_tbf_destroy(struct mlxsw_sp_port *mlxsw_sp_port, struct mlxsw_sp_qdisc *mlxsw_sp_qdisc) { - struct mlxsw_sp_qdisc_state *qdisc_state = mlxsw_sp_port->qdisc; - struct mlxsw_sp_qdisc *root_qdisc = &qdisc_state->root_qdisc; - - if (root_qdisc != mlxsw_sp_qdisc) - root_qdisc->stats_base.backlog -= - mlxsw_sp_qdisc->stats_base.backlog; - return mlxsw_sp_port_ets_maxrate_set(mlxsw_sp_port, MLXSW_REG_QEEC_HR_SUBGROUP, mlxsw_sp_qdisc->tclass_num, 0, @@ -790,19 +790,6 @@ int mlxsw_sp_setup_tc_tbf(struct mlxsw_sp_port *mlxsw_sp_port, } } -static int -mlxsw_sp_qdisc_fifo_destroy(struct mlxsw_sp_port *mlxsw_sp_port, - struct mlxsw_sp_qdisc *mlxsw_sp_qdisc) -{ - struct mlxsw_sp_qdisc_state *qdisc_state = mlxsw_sp_port->qdisc; - struct mlxsw_sp_qdisc *root_qdisc = &qdisc_state->root_qdisc; - - if (root_qdisc != mlxsw_sp_qdisc) - root_qdisc->stats_base.backlog -= - mlxsw_sp_qdisc->stats_base.backlog; - return 0; -} - static int mlxsw_sp_qdisc_fifo_check_params(struct mlxsw_sp_port *mlxsw_sp_port, void *params) @@ -832,7 +819,6 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_fifo = { .type = MLXSW_SP_QDISC_FIFO, .check_params = mlxsw_sp_qdisc_fifo_check_params, .replace = mlxsw_sp_qdisc_fifo_replace, - .destroy = mlxsw_sp_qdisc_fifo_destroy, .get_stats = mlxsw_sp_qdisc_get_fifo_stats, .clean_stats = mlxsw_sp_setup_tc_qdisc_leaf_clean_stats, }; @@ -1825,8 +1811,10 @@ int mlxsw_sp_tc_qdisc_init(struct mlxsw_sp_port *mlxsw_sp_port) qdisc_state->root_qdisc.prio_bitmap
[PATCH net-next 02/10] mlxsw: spectrum_qdisc: Simplify mlxsw_sp_qdisc_compare()
The purpose of this function is to filter out events that are related to qdiscs that are not offloaded, or are not offloaded anymore. But the function is unnecessarily thorough: - mlxsw_sp_qdisc pointer is never NULL in the context where it is called - Two qdiscs with the same handle will never have different types. Even when replacing one qdisc with another in the same class, Linux will not permit handle reuse unless the qdisc type also matches. Simplify the function by omitting these two unnecessary conditions. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- .../ethernet/mellanox/mlxsw/spectrum_qdisc.c | 22 ++- 1 file changed, 7 insertions(+), 15 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c index 644ffc021abe..013398ecd15b 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c @@ -87,12 +87,9 @@ struct mlxsw_sp_qdisc_state { }; static bool -mlxsw_sp_qdisc_compare(struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, u32 handle, - enum mlxsw_sp_qdisc_type type) +mlxsw_sp_qdisc_compare(struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, u32 handle) { - return mlxsw_sp_qdisc && mlxsw_sp_qdisc->ops && - mlxsw_sp_qdisc->ops->type == type && - mlxsw_sp_qdisc->handle == handle; + return mlxsw_sp_qdisc->ops && mlxsw_sp_qdisc->handle == handle; } static struct mlxsw_sp_qdisc * @@ -579,8 +576,7 @@ int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port *mlxsw_sp_port, &mlxsw_sp_qdisc_ops_red, &p->set); - if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle, - MLXSW_SP_QDISC_RED)) + if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle)) return -EOPNOTSUPP; switch (p->command) { @@ -780,8 +776,7 @@ int mlxsw_sp_setup_tc_tbf(struct mlxsw_sp_port *mlxsw_sp_port, &mlxsw_sp_qdisc_ops_tbf, &p->replace_params); - if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle, - MLXSW_SP_QDISC_TBF)) + if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle)) return -EOPNOTSUPP; switch (p->command) { @@ -886,8 +881,7 @@ int mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port *mlxsw_sp_port, &mlxsw_sp_qdisc_ops_fifo, NULL); } - if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle, - MLXSW_SP_QDISC_FIFO)) + if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle)) return -EOPNOTSUPP; switch (p->command) { @@ -1247,8 +1241,7 @@ int mlxsw_sp_setup_tc_prio(struct mlxsw_sp_port *mlxsw_sp_port, &mlxsw_sp_qdisc_ops_prio, &p->replace_params); - if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle, - MLXSW_SP_QDISC_PRIO)) + if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle)) return -EOPNOTSUPP; switch (p->command) { @@ -1280,8 +1273,7 @@ int mlxsw_sp_setup_tc_ets(struct mlxsw_sp_port *mlxsw_sp_port, &mlxsw_sp_qdisc_ops_ets, &p->replace_params); - if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle, - MLXSW_SP_QDISC_ETS)) + if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle)) return -EOPNOTSUPP; switch (p->command) { -- 2.26.2
[PATCH net-next 03/10] mlxsw: spectrum_qdisc: Drop an always-true condition
The function mlxsw_sp_qdisc_compare() is invoked a couple lines above this check, which will bounce any requests where this condition does not hold. Therefore drop it. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c index 013398ecd15b..f1d32bfc4bed 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c @@ -886,10 +886,7 @@ int mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port *mlxsw_sp_port, switch (p->command) { case TC_FIFO_DESTROY: - if (p->handle == mlxsw_sp_qdisc->handle) - return mlxsw_sp_qdisc_destroy(mlxsw_sp_port, - mlxsw_sp_qdisc); - return 0; + return mlxsw_sp_qdisc_destroy(mlxsw_sp_port, mlxsw_sp_qdisc); case TC_FIFO_STATS: return mlxsw_sp_qdisc_get_stats(mlxsw_sp_port, mlxsw_sp_qdisc, &p->stats); -- 2.26.2
[PATCH net-next 01/10] mlxsw: spectrum_qdisc: Drop one argument from check_params callback
The mlxsw_sp_qdisc argument is not used in any of the actual callbacks. Drop it. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c index baf17c0b2702..644ffc021abe 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c @@ -29,7 +29,6 @@ struct mlxsw_sp_qdisc; struct mlxsw_sp_qdisc_ops { enum mlxsw_sp_qdisc_type type; int (*check_params)(struct mlxsw_sp_port *mlxsw_sp_port, - struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void *params); int (*replace)(struct mlxsw_sp_port *mlxsw_sp_port, u32 handle, struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void *params); @@ -198,7 +197,7 @@ mlxsw_sp_qdisc_replace(struct mlxsw_sp_port *mlxsw_sp_port, u32 handle, goto err_hdroom_configure; } - err = ops->check_params(mlxsw_sp_port, mlxsw_sp_qdisc, params); + err = ops->check_params(mlxsw_sp_port, params); if (err) goto err_bad_param; @@ -434,7 +433,6 @@ mlxsw_sp_qdisc_red_destroy(struct mlxsw_sp_port *mlxsw_sp_port, static int mlxsw_sp_qdisc_red_check_params(struct mlxsw_sp_port *mlxsw_sp_port, - struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void *params) { struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp; @@ -678,7 +676,6 @@ mlxsw_sp_qdisc_tbf_rate_kbps(struct tc_tbf_qopt_offload_replace_params *p) static int mlxsw_sp_qdisc_tbf_check_params(struct mlxsw_sp_port *mlxsw_sp_port, - struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void *params) { struct tc_tbf_qopt_offload_replace_params *p = params; @@ -813,7 +810,6 @@ mlxsw_sp_qdisc_fifo_destroy(struct mlxsw_sp_port *mlxsw_sp_port, static int mlxsw_sp_qdisc_fifo_check_params(struct mlxsw_sp_port *mlxsw_sp_port, -struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void *params) { return 0; @@ -948,7 +944,6 @@ __mlxsw_sp_qdisc_ets_check_params(unsigned int nbands) static int mlxsw_sp_qdisc_prio_check_params(struct mlxsw_sp_port *mlxsw_sp_port, -struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void *params) { struct tc_prio_qopt_offload_params *p = params; @@ -1124,7 +1119,6 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_prio = { static int mlxsw_sp_qdisc_ets_check_params(struct mlxsw_sp_port *mlxsw_sp_port, - struct mlxsw_sp_qdisc *mlxsw_sp_qdisc, void *params) { struct tc_ets_qopt_offload_replace_params *p = params; -- 2.26.2
[PATCH net-next 00/10] mlxsw: Refactor qdisc offload
Currently, mlxsw admits for offload a suitable root qdisc, and its children. Thus up to two levels of hierarchy are offloaded. Often, this is enough: one can configure TCs with RED and TCs with a shaper, and can even see counters for each TC by looking at a qdisc at a sufficiently shallow position. While simple, the system has obvious shortcomings. It is not possible to configure both RED and shaping on one TC. It is not possible to place a PRIO below root TBF, which would then be offloaded as port shaper. FIFOs are only offloaded at root or directly below, which is confusing to users, because RED and TBF of course have their own FIFO. This patchset is a step towards the end goal of allowing more comprehensive qdisc tree offload and cleans up the qdisc offload code. - Patches #1-#4 contain small cleanups. - Up until now, since mlxsw offloaded only a very simple qdisc configurations, basically all bookkeeping was done using one container for the root qdisc, and 8 containers for its children. Patches #5, #6, #8 and #9 gradually introduce a more dynamic structure, where parent-child relationships are tracked directly at qdiscs, instead of being implicit. - This tree management assumes only one qdisc is created at a time. In FIFO handlers, this condition was enforced simply by asserting RTNL lock. But instead of furthering this RTNL dependence, patch #7 converts the whole qdisc offload logic to a per-port mutex. - Patch #10 adds a selftest. Petr Machata (10): mlxsw: spectrum_qdisc: Drop one argument from check_params callback mlxsw: spectrum_qdisc: Simplify mlxsw_sp_qdisc_compare() mlxsw: spectrum_qdisc: Drop an always-true condition mlxsw: spectrum_qdisc: Track tclass_num as int, not u8 mlxsw: spectrum_qdisc: Promote backlog reduction to mlxsw_sp_qdisc_destroy() mlxsw: spectrum_qdisc: Track children per qdisc mlxsw: spectrum_qdisc: Guard all qdisc accesses with a lock mlxsw: spectrum_qdisc: Allocate child qdiscs dynamically mlxsw: spectrum_qdisc: Index future FIFOs by band number selftests: mlxsw: sch_red_ets: Test proper counter cleaning in ETS .../ethernet/mellanox/mlxsw/spectrum_qdisc.c | 448 -- .../drivers/net/mlxsw/sch_red_ets.sh | 7 + 2 files changed, 306 insertions(+), 149 deletions(-) -- 2.26.2
Re: [PATCH net-next 1/7] net: sched: Add a trap-and-forward action
Jamal Hadi Salim writes: > On 2021-04-09 7:03 a.m., Petr Machata wrote: >> Jamal Hadi Salim writes: >> >>> I am concerned about adding new opcodes which only make sense if you >>> offload (or make sense only if you are running in s/w). >>> >>> Those opcodes are intended to be generic abstractions so the dispatcher >>> can decide what to do next. >>> [...] >>> For details see: >>> https://people.netfilter.org/pablo/netdev0.1/papers/Linux-Traffic-Control-Classifier-Action-Subsystem-Architecture.pdf >> >> Trap has been in since 4.13, so 2017ish. It's done and dusted at this >> point. > > here's how it translates: > "We already made a mistake, therefore, its ok to build on it and > make more mistakes". I can see how it reads that way, but that was not the intention. I was actually thinking about whether there might be a way to gradually migrate all this stuff over to mirred, but at this point, trap is very much baked in. >>> IMO: >>> It seems to me there are two actions here encapsulated in one. >>> The first is to "trap" and the second is to "drop". >>> >>> This is no different semantically than say "mirror and drop" >>> offload being enunciated by "skip_sw". >>> >>> Does the spectrum not support multiple actions? >>> e.g with a policy like: >>> match blah action trap action drop skip_sw >> Trap drops implicitly. We need a "trap, but don't drop". Expressed in >> terms of existing actions it would be "mirred egress redirect dev >> $cpu_port". But how to express $cpu_port except again by a HW-specific >> magic token I don't know. (I meant mirred egress mirror, not redirect.) > Note: mirred was originally intended to send redirect/mirror > packets to user space (the comment is still there in the code). > Infact there is a patch lying around somewhere that does that with > packet sockets (the author hasnt been serious about pushing it > upstream). In that case the semantics are redirecting to a file > descriptor. Could we have something like that here which points > to whatever representation $cpu_port has? Sounds like semantics > for "trap and forward" are just "mirror and forward". Hmm, we have devlink ports, the CPU port is exposed there. But that's the only thing that comes to mind. Those are specific for the given device though, it doesn't look suitable... > I think there is value in having something like trap action > which generalizes the combinations only to the fact that > it will make it easier to relay the info to the offload without > much transformation. > If i was to do it i would write one action configured by user space: > - to return DROP if you want action trap-and-drop semantics. > - to return STOLEN if you want trap > - to return PIPE if you want trap and forward. You will need a second > action composed to forward. I think your STOLEN and PIPE are the same behavior. Both are "transfer the packet to the SW datapath, but keep it in the HW datapath". In general I have no issue expressing this stuff as a new action, instead of an opcode. I'll take a look at this.
Re: [PATCH net-next 1/7] net: sched: Add a trap-and-forward action
Jamal Hadi Salim writes: > I am concerned about adding new opcodes which only make sense if you > offload (or make sense only if you are running in s/w). > > Those opcodes are intended to be generic abstractions so the dispatcher > can decide what to do next. Adding things that are specific only > to scenarios of hardware offload removes that opaqueness. > I must have missed the discussion on ACT_TRAP because it is the > same issue there i.e shouldnt be an opcode. For details see: > https://people.netfilter.org/pablo/netdev0.1/papers/Linux-Traffic-Control-Classifier-Action-Subsystem-Architecture.pdf Trap has been in since 4.13, so 2017ish. It's done and dusted at this point. > IMO: > It seems to me there are two actions here encapsulated in one. > The first is to "trap" and the second is to "drop". > > This is no different semantically than say "mirror and drop" > offload being enunciated by "skip_sw". > > Does the spectrum not support multiple actions? > e.g with a policy like: > match blah action trap action drop skip_sw Trap drops implicitly. We need a "trap, but don't drop". Expressed in terms of existing actions it would be "mirred egress redirect dev $cpu_port". But how to express $cpu_port except again by a HW-specific magic token I don't know.
[PATCH net-next 7/7] selftests: mlxsw: Add a trap_fwd test to devlink_trap_control
Test that trap_fwd'd packets show up under the correct trap. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- .../drivers/net/mlxsw/devlink_trap_control.sh | 23 --- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_control.sh b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_control.sh index a37273473c1b..8bca4c58819b 100755 --- a/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_control.sh +++ b/tools/testing/selftests/drivers/net/mlxsw/devlink_trap_control.sh @@ -83,6 +83,7 @@ ALL_TESTS=" ptp_general_test flow_action_sample_test flow_action_trap_test + flow_action_trap_fwd_test " NUM_NETIFS=4 source $lib_dir/lib.sh @@ -663,14 +664,18 @@ flow_action_sample_test() tc qdisc del dev $rp1 clsact } -flow_action_trap_test() +__flow_action_trap_test() { + local action=$1; shift + local trap=$1; shift + local description=$1; shift + # Install a filter that traps a specific flow. tc qdisc add dev $rp1 clsact tc filter add dev $rp1 ingress proto ip pref 1 handle 101 flower \ - skip_sw ip_proto udp src_port 12345 dst_port 54321 action trap + skip_sw ip_proto udp src_port 12345 dst_port 54321 action $action - devlink_trap_stats_test "Flow Trapping (Logging)" "flow_action_trap" \ + devlink_trap_stats_test "$description" $trap \ $MZ $h1 -c 1 -a own -b $(mac_get $rp1) \ -A 192.0.2.1 -B 198.51.100.1 -t udp sp=12345,dp=54321 -p 100 -q @@ -678,6 +683,18 @@ flow_action_trap_test() tc qdisc del dev $rp1 clsact } +flow_action_trap_test() +{ + __flow_action_trap_test trap flow_action_trap \ + "Flow Trapping (Logging)" +} + +flow_action_trap_fwd_test() +{ + __flow_action_trap_test trap_fwd flow_action_trap_fwd \ + "Flow Trap-and-forwarding (Logging)" +} + trap cleanup EXIT setup_prepare -- 2.26.2
[PATCH net-next 6/7] selftests: forwarding: Add a test for TC trapping behavior
Test that trapped packets are forwarded through the SW datapath, whereas trap_fwd'd ones are not (but are forwarded through HW datapath). For completeness' sake, also test that "pass" (i.e. lack of trapping) simply forwards the packets in the HW datapath. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- .../selftests/net/forwarding/tc_trap.sh | 170 ++ 1 file changed, 170 insertions(+) create mode 100755 tools/testing/selftests/net/forwarding/tc_trap.sh diff --git a/tools/testing/selftests/net/forwarding/tc_trap.sh b/tools/testing/selftests/net/forwarding/tc_trap.sh new file mode 100755 index ..56336cea45a2 --- /dev/null +++ b/tools/testing/selftests/net/forwarding/tc_trap.sh @@ -0,0 +1,170 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +# In the following simple routing scenario, put SW datapath packet probes on +# $swp1, $swp2 and $h2. Always expect packets to arrive at $h2. Depending on +# whether, in the HW datapath, $swp1 lets packets pass, traps them, or +# traps_forwards them, $swp1 and $swp2 probes are expected to give different +# results. +# +# +--+ +--+ +# | H1 | | H2 | +# |+ $h1 | |$h2 + | +# || 192.0.2.1/28| | 192.0.2.18/28 | | +# +|-+ +|-+ +# || +# +||-+ +# | SW || | +# |+ $swp1$swp2 + | +# | 192.0.2.2/28 192.0.2.17/28 | +# +---+ + + +ALL_TESTS=" + no_trap_test + trap_fwd_test + trap_test +" + +NUM_NETIFS=4 +source lib.sh +source tc_common.sh + +h1_create() +{ + simple_if_init $h1 192.0.2.1/28 + ip route add vrf v$h1 192.0.2.16/28 via 192.0.2.2 +} + +h1_destroy() +{ + ip route del vrf v$h1 192.0.2.16/28 via 192.0.2.2 + simple_if_fini $h1 192.0.2.1/28 +} + +h2_create() +{ + simple_if_init $h2 192.0.2.18/28 + ip route add vrf v$h2 192.0.2.0/28 via 192.0.2.17 + tc qdisc add dev $h2 clsact +} + +h2_destroy() +{ + tc qdisc del dev $h2 clsact + ip route del vrf v$h2 192.0.2.0/28 via 192.0.2.17 + simple_if_fini $h2 192.0.2.18/28 +} + +switch_create() +{ + simple_if_init $swp1 192.0.2.2/28 + __simple_if_init $swp2 v$swp1 192.0.2.17/28 + + tc qdisc add dev $swp1 clsact + tc qdisc add dev $swp2 clsact +} + +switch_destroy() +{ + tc qdisc del dev $swp2 clsact + tc qdisc del dev $swp1 clsact + + __simple_if_fini $swp2 192.0.2.17/28 + simple_if_fini $swp1 192.0.2.2/28 +} + +setup_prepare() +{ + h1=${NETIFS[p1]} + swp1=${NETIFS[p2]} + + swp2=${NETIFS[p3]} + h2=${NETIFS[p4]} + + vrf_prepare + forwarding_enable + + h1_create + h2_create + switch_create +} + +cleanup() +{ + pre_cleanup + + switch_destroy + h2_destroy + h1_destroy + + forwarding_restore + vrf_cleanup +} + +__test() +{ + local action=$1; shift + local ingress_should_fail=$1; shift + local egress_should_fail=$1; shift + + tc filter add dev $swp1 ingress protocol ip pref 2 handle 101 \ + flower skip_sw dst_ip 192.0.2.18 action $action + tc filter add dev $swp1 ingress protocol ip pref 1 handle 102 \ + flower skip_hw dst_ip 192.0.2.18 action pass + tc filter add dev $swp2 egress protocol ip pref 1 handle 103 \ + flower skip_hw dst_ip 192.0.2.18 action pass + tc filter add dev $h2 ingress protocol ip pref 1 handle 104 \ + flower dst_ip 192.0.2.18 action drop + + RET=0 + + $MZ $h1 -c 1 -p 64 -a $(mac_get $h1) -b $(mac_get $swp1) \ + -A 192.0.2.1 -B 192.0.2.18 -q -t ip + + tc_check_packets "dev $swp1 ingress" 102 1 + check_err_fail $ingress_should_fail $? "ingress should_fail $ingress_should_fail" + + tc_check_packets "dev $swp2 egress" 103 1 + check_err_fail $egress_should_fail $? "egress should_fail $egress_should_fail" + + tc_check_packets "dev $h2 ingress" 104 1 + check_err $? "Did not see the packet on host" + + log_test "$action test" + + tc filter del dev $h2 ingress protocol ip pref 1 handle 104 flower + tc filter del dev $swp2 egress protocol ip pref 1 handle 103 flower + tc filter del dev $swp1 ingress protocol ip p
[PATCH net-next 1/7] net: sched: Add a trap-and-forward action
The TC action "trap" is used to instruct the HW datapath to drop the matched packet and transfer it for processing in the SW pipeline. If instead it is desirable to forward the packet and transferring a _copy_ to the SW pipeline, there is no practical way to achieve that. To that end add a new generic action, trap_fwd. In the software pipeline, it is equivalent to an OK. When offloading, it should forward the packet to the host, but unlike trap it should not drop the packet. Signed-off-by: Petr Machata Reviewed-by: Jiri Pirko Reviewed-by: Ido Schimmel --- include/uapi/linux/pkt_cls.h | 6 +- net/core/dev.c | 2 ++ net/sched/act_bpf.c| 13 +++-- net/sched/cls_bpf.c| 1 + net/sched/sch_dsmark.c | 1 + tools/include/uapi/linux/pkt_cls.h | 6 +- 6 files changed, 25 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index 025c40fef93d..a1bbccb88e67 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -72,7 +72,11 @@ enum { * the skb and act like everything * is alright. */ -#define TC_ACT_VALUE_MAX TC_ACT_TRAP +#define TC_ACT_TRAP_FWD9 /* For hw path, this means "send a copy + * of the packet to the cpu". For sw + * datapath, this is like TC_ACT_OK. + */ +#define TC_ACT_VALUE_MAX TC_ACT_TRAP_FWD /* There is a special kind of actions called "extended actions", * which need a value parameter. These have a local opcode located in diff --git a/net/core/dev.c b/net/core/dev.c index 9d1a8fac793f..f0b8c16dbf12 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3975,6 +3975,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) switch (tcf_classify(skb, miniq->filter_list, &cl_res, false)) { case TC_ACT_OK: case TC_ACT_RECLASSIFY: + case TC_ACT_TRAP_FWD: skb->tc_index = TC_H_MIN(cl_res.classid); break; case TC_ACT_SHOT: @@ -5083,6 +5084,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret, &cl_res, false)) { case TC_ACT_OK: case TC_ACT_RECLASSIFY: + case TC_ACT_TRAP_FWD: skb->tc_index = TC_H_MIN(cl_res.classid); break; case TC_ACT_SHOT: diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index e48e980c3b93..be2a51c6f84e 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -54,8 +54,16 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act, bpf_compute_data_pointers(skb); filter_res = BPF_PROG_RUN(filter, skb); } - if (skb_sk_is_prefetched(skb) && filter_res != TC_ACT_OK) - skb_orphan(skb); + if (skb_sk_is_prefetched(skb)) { + switch (filter_res) { + case TC_ACT_OK: + case TC_ACT_TRAP_FWD: + break; + default: + skb_orphan(skb); + break; + } + } rcu_read_unlock(); /* A BPF program may overwrite the default action opcode. @@ -72,6 +80,7 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act, case TC_ACT_PIPE: case TC_ACT_RECLASSIFY: case TC_ACT_OK: + case TC_ACT_TRAP_FWD: case TC_ACT_REDIRECT: action = filter_res; break; diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index 6e3e63db0e01..5fd96cf2dca7 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -69,6 +69,7 @@ static int cls_bpf_exec_opcode(int code) case TC_ACT_SHOT: case TC_ACT_STOLEN: case TC_ACT_TRAP: + case TC_ACT_TRAP_FWD: case TC_ACT_REDIRECT: case TC_ACT_UNSPEC: return code; diff --git a/net/sched/sch_dsmark.c b/net/sched/sch_dsmark.c index cd2748e2d4a2..054a06bd9dc8 100644 --- a/net/sched/sch_dsmark.c +++ b/net/sched/sch_dsmark.c @@ -258,6 +258,7 @@ static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc *sch, goto drop; #endif case TC_ACT_OK: + case TC_ACT_TRAP_FWD: skb->tc_index = TC_H_MIN(res.classid); break; diff --git a/tools/include/uapi/linux/pkt_cls.h b/tools/include/uapi/linux/pkt_cls.h index 12153771396a..ccfa424dfeaf 100644 --- a/tools/include/uapi/linux/pkt_cls.h +++ b/tools/include/uapi/linux/pkt_cls.h @@ -45,7 +45,11 @@ enum { * the skb and act like
[PATCH net-next 5/7] mlxsw: Offload trap_fwd
Offload the TC action trap_fwd. This is offloaded as a TRAP_ACTION with forward_action of FORWARD (as opposed to NOP for the trap action). Unlike trap, trap_fwd needs to be in an "goto"-typed action set, not "next"-typed one. Trap_fwd'd traffic is marked with offload_fwd_mark and offload_l3_fwd_mark to prevent second forwarding in the SW datapath. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- .../mellanox/mlxsw/core_acl_flex_actions.c| 23 +++ .../net/ethernet/mellanox/mlxsw/spectrum.h| 1 + .../ethernet/mellanox/mlxsw/spectrum_acl.c| 6 + .../ethernet/mellanox/mlxsw/spectrum_flower.c | 7 ++ .../ethernet/mellanox/mlxsw/spectrum_trap.c | 8 +++ drivers/net/ethernet/mellanox/mlxsw/trap.h| 2 ++ 6 files changed, 43 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c index faa90cc31376..d7d7e688139f 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c +++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c @@ -94,7 +94,8 @@ struct mlxsw_afa_set { * kvdl_index is valid). */ has_trap:1, - has_police:1; + has_police:1, + has_trap_fwd:1; unsigned int ref_count; struct mlxsw_afa_set *next; /* Pointer to the next set. */ struct mlxsw_afa_set *prev; /* Pointer to the previous set, @@ -263,14 +264,23 @@ static void mlxsw_afa_set_goto_set(struct mlxsw_afa_set *set, mlxsw_afa_set_goto_next_binding_set(actions, group_id); } -static void mlxsw_afa_set_next_set(struct mlxsw_afa_set *set, +static int mlxsw_afa_set_next_set(struct mlxsw_afa_set *set, u32 next_set_kvdl_index, struct netlink_ext_ack *extack) { char *actions = set->ht_key.enc_actions; + /* If the forwarding action is not drop, the next/goto record must not +* be a next, it must be a goto. +*/ + if (set->has_trap_fwd) { + NL_SET_ERR_MSG_MOD(extack, "Only goto permissible after a trap_fwd action"); + return -EINVAL; + } + mlxsw_afa_set_type_set(actions, MLXSW_AFA_SET_TYPE_NEXT); mlxsw_afa_set_next_action_set_ptr_set(actions, next_set_kvdl_index); + return 0; } static struct mlxsw_afa_set *mlxsw_afa_set_create(bool is_first) @@ -461,6 +471,7 @@ int mlxsw_afa_block_commit(struct mlxsw_afa_block *block, { struct mlxsw_afa_set *set = block->cur_set; struct mlxsw_afa_set *prev_set; + int err; block->cur_set = NULL; block->finished = true; @@ -481,8 +492,10 @@ int mlxsw_afa_block_commit(struct mlxsw_afa_block *block, return PTR_ERR(set); if (prev_set) { prev_set->next = set; - mlxsw_afa_set_next_set(prev_set, set->kvdl_index, - extack); + err = mlxsw_afa_set_next_set(prev_set, set->kvdl_index, +extack); + if (err) + return err; set = prev_set; } } while (prev_set); @@ -1346,6 +1359,8 @@ int mlxsw_afa_block_append_trap_and_forward(struct mlxsw_afa_block *block, if (IS_ERR(act)) return PTR_ERR(act); + + block->cur_set->has_trap_fwd = true; mlxsw_afa_trap_pack(act, MLXSW_AFA_TRAP_TRAP_ACTION_TRAP, MLXSW_AFA_TRAP_FORWARD_ACTION_FORWARD, trap_id); return 0; diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h index d74fc7ff8083..6067a049dcf2 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h @@ -940,6 +940,7 @@ int mlxsw_sp_acl_rulei_act_drop(struct mlxsw_sp_acl_rule_info *rulei, const struct flow_action_cookie *fa_cookie, struct netlink_ext_ack *extack); int mlxsw_sp_acl_rulei_act_trap(struct mlxsw_sp_acl_rule_info *rulei); +int mlxsw_sp_acl_rulei_act_trap_fwd(struct mlxsw_sp_acl_rule_info *rulei); int mlxsw_sp_acl_rulei_act_mirror(struct mlxsw_sp *mlxsw_sp, struct mlxsw_sp_acl_rule_info *rulei, struct mlxsw_sp_flow_block *block, diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c index b9c4c1feba6d..6f7913424bd9 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c @@ -401,6 +401,12 @@ int
[PATCH net-next 4/7] mlxsw: Propagate extack to mlxsw_afa_block_commit()
In the following patch, attempts to change the next/goto of a flexible action set from goto to next will be rejected for action sets that contain a trap_fwd action. Propagate extack to make it possible to communicate the issue to the user. Signed-off-by: Petr Machata Reviewed-by: Jiri Pirko Reviewed-by: Ido Schimmel --- .../net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c | 9 ++--- .../net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h | 3 ++- drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 3 ++- drivers/net/ethernet/mellanox/mlxsw/spectrum1_acl_tcam.c | 2 +- drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c | 5 +++-- drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c| 2 +- drivers/net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c | 2 +- 7 files changed, 16 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c index 78d9c0196f2b..faa90cc31376 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c +++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c @@ -264,7 +264,8 @@ static void mlxsw_afa_set_goto_set(struct mlxsw_afa_set *set, } static void mlxsw_afa_set_next_set(struct mlxsw_afa_set *set, - u32 next_set_kvdl_index) + u32 next_set_kvdl_index, + struct netlink_ext_ack *extack) { char *actions = set->ht_key.enc_actions; @@ -455,7 +456,8 @@ void mlxsw_afa_block_destroy(struct mlxsw_afa_block *block) } EXPORT_SYMBOL(mlxsw_afa_block_destroy); -int mlxsw_afa_block_commit(struct mlxsw_afa_block *block) +int mlxsw_afa_block_commit(struct mlxsw_afa_block *block, + struct netlink_ext_ack *extack) { struct mlxsw_afa_set *set = block->cur_set; struct mlxsw_afa_set *prev_set; @@ -479,7 +481,8 @@ int mlxsw_afa_block_commit(struct mlxsw_afa_block *block) return PTR_ERR(set); if (prev_set) { prev_set->next = set; - mlxsw_afa_set_next_set(prev_set, set->kvdl_index); + mlxsw_afa_set_next_set(prev_set, set->kvdl_index, + extack); set = prev_set; } } while (prev_set); diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h index b65bf98eb5ab..24350f9470f8 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h +++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h @@ -45,7 +45,8 @@ struct mlxsw_afa *mlxsw_afa_create(unsigned int max_acts_per_set, void mlxsw_afa_destroy(struct mlxsw_afa *mlxsw_afa); struct mlxsw_afa_block *mlxsw_afa_block_create(struct mlxsw_afa *mlxsw_afa); void mlxsw_afa_block_destroy(struct mlxsw_afa_block *block); -int mlxsw_afa_block_commit(struct mlxsw_afa_block *block); +int mlxsw_afa_block_commit(struct mlxsw_afa_block *block, + struct netlink_ext_ack *extack); char *mlxsw_afa_block_first_set(struct mlxsw_afa_block *block); char *mlxsw_afa_block_cur_set(struct mlxsw_afa_block *block); u32 mlxsw_afa_block_first_kvdl_index(struct mlxsw_afa_block *block); diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h index f99db88ee884..d74fc7ff8083 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h @@ -920,7 +920,8 @@ struct mlxsw_sp_acl_rule_info * mlxsw_sp_acl_rulei_create(struct mlxsw_sp_acl *acl, struct mlxsw_afa_block *afa_block); void mlxsw_sp_acl_rulei_destroy(struct mlxsw_sp_acl_rule_info *rulei); -int mlxsw_sp_acl_rulei_commit(struct mlxsw_sp_acl_rule_info *rulei); +int mlxsw_sp_acl_rulei_commit(struct mlxsw_sp_acl_rule_info *rulei, + struct netlink_ext_ack *extack); void mlxsw_sp_acl_rulei_priority(struct mlxsw_sp_acl_rule_info *rulei, unsigned int priority); void mlxsw_sp_acl_rulei_keymask_u32(struct mlxsw_sp_acl_rule_info *rulei, diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum1_acl_tcam.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum1_acl_tcam.c index 3a636f753607..cda04bc4453f 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum1_acl_tcam.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum1_acl_tcam.c @@ -75,7 +75,7 @@ mlxsw_sp1_acl_ctcam_region_catchall_add(struct mlxsw_sp *mlxsw_sp, err = mlxsw_sp_acl_rulei_act_continue(rulei); if (WARN_ON(err)) goto err_rulei_act_continue; - err = mlxsw_sp_acl_rulei_commit(rulei); + err = mlxsw_sp_acl_rulei_commit(rulei, NULL); if (err) goto err_rulei_comm
[PATCH net-next 0/7] tc: Introduce a trap-and-forward action
The TC action "trap" is used to instruct the HW datapath to drop the matched packet and transfer it to the host for processing in the SW pipeline. If instead it is desirable to forward the packet in the HW datapath, and to transfer a _copy_ to the SW pipeline, there is no practical way to achieve that. As a particular use case, the mlxsw driver could instruct a Spectrum machine to mirror packets that are ECN-marked to the host. However these packets are still forwarded in the HW datapath, therefore describing this mirroring through the "trap" action is incorrect. A new action is needed. To that end, this patchset introduces a new generic action, trap_fwd. In the software pipeline, it is equivalent to an OK. When offloading, it should forward the packet to the host, but unlike trap it should not drop the packet. This patchset proceeds as follows: - In patch #1, introduce the new action, and modify the TC code to recognize it as an OK. - In patches #2 and #3, introduce the artifacts necessary for offloading the trap_fwd action, and a new trap so that drivers can report the trapped packets. - Patches #4 and #5 offload trap_fwd in mlxsw. - Patches #6 and #7 add selftests. Petr Machata (7): net: sched: Add a trap-and-forward action net: sched: Make the action trap_fwd offloadable devlink: Add a new trap for the trap_fwd action mlxsw: Propagate extack to mlxsw_afa_block_commit() mlxsw: Offload trap_fwd selftests: forwarding: Add a test for TC trapping behavior selftests: mlxsw: Add a trap_fwd test to devlink_trap_control .../networking/devlink/devlink-trap.rst | 4 + .../mellanox/mlxsw/core_acl_flex_actions.c| 28 ++- .../mellanox/mlxsw/core_acl_flex_actions.h| 3 +- .../net/ethernet/mellanox/mlxsw/spectrum.h| 4 +- .../mellanox/mlxsw/spectrum1_acl_tcam.c | 2 +- .../ethernet/mellanox/mlxsw/spectrum_acl.c| 11 +- .../ethernet/mellanox/mlxsw/spectrum_flower.c | 9 +- .../mellanox/mlxsw/spectrum_mr_tcam.c | 2 +- .../ethernet/mellanox/mlxsw/spectrum_trap.c | 8 + drivers/net/ethernet/mellanox/mlxsw/trap.h| 2 + include/net/devlink.h | 3 + include/net/flow_offload.h| 1 + include/net/tc_act/tc_gact.h | 5 + include/uapi/linux/pkt_cls.h | 6 +- net/core/dev.c| 2 + net/core/devlink.c| 1 + net/sched/act_bpf.c | 13 +- net/sched/cls_api.c | 2 + net/sched/cls_bpf.c | 1 + net/sched/sch_dsmark.c| 1 + tools/include/uapi/linux/pkt_cls.h| 6 +- .../drivers/net/mlxsw/devlink_trap_control.sh | 23 ++- .../selftests/net/forwarding/tc_trap.sh | 170 ++ 23 files changed, 288 insertions(+), 19 deletions(-) create mode 100755 tools/testing/selftests/net/forwarding/tc_trap.sh -- 2.26.2
[PATCH net-next 3/7] devlink: Add a new trap for the trap_fwd action
Add a new trap so that drivers can report packets forwarded due to the trap_fwd action correctly. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- Documentation/networking/devlink/devlink-trap.rst | 4 include/net/devlink.h | 3 +++ net/core/devlink.c| 1 + 3 files changed, 8 insertions(+) diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst index 935b6397e8cf..3f1c0f89d284 100644 --- a/Documentation/networking/devlink/devlink-trap.rst +++ b/Documentation/networking/devlink/devlink-trap.rst @@ -405,6 +405,10 @@ be added to the following table: - ``control`` - Traps packets logged during processing of flow action trap (e.g., via tc's trap action) + * - ``flow_action_trap_fwd`` + - ``control`` + - Traps packets logged during processing of flow action trap_fwd (e.g., via + tc's trap_fwd action) * - ``early_drop`` - ``drop`` - Traps packets dropped due to the RED (Random Early Detection) algorithm diff --git a/include/net/devlink.h b/include/net/devlink.h index 853420db5d32..967e70363ba9 100644 --- a/include/net/devlink.h +++ b/include/net/devlink.h @@ -845,6 +845,7 @@ enum devlink_trap_generic_id { DEVLINK_TRAP_GENERIC_ID_PTP_GENERAL, DEVLINK_TRAP_GENERIC_ID_FLOW_ACTION_SAMPLE, DEVLINK_TRAP_GENERIC_ID_FLOW_ACTION_TRAP, + DEVLINK_TRAP_GENERIC_ID_FLOW_ACTION_TRAP_FWD, DEVLINK_TRAP_GENERIC_ID_EARLY_DROP, DEVLINK_TRAP_GENERIC_ID_VXLAN_PARSING, DEVLINK_TRAP_GENERIC_ID_LLC_SNAP_PARSING, @@ -1053,6 +1054,8 @@ enum devlink_trap_group_generic_id { "flow_action_sample" #define DEVLINK_TRAP_GENERIC_NAME_FLOW_ACTION_TRAP \ "flow_action_trap" +#define DEVLINK_TRAP_GENERIC_NAME_FLOW_ACTION_TRAP_FWD \ + "flow_action_trap_fwd" #define DEVLINK_TRAP_GENERIC_NAME_EARLY_DROP \ "early_drop" #define DEVLINK_TRAP_GENERIC_NAME_VXLAN_PARSING \ diff --git a/net/core/devlink.c b/net/core/devlink.c index 737b61c2976e..478d4bc01a39 100644 --- a/net/core/devlink.c +++ b/net/core/devlink.c @@ -9744,6 +9744,7 @@ static const struct devlink_trap devlink_trap_generic[] = { DEVLINK_TRAP(PTP_GENERAL, CONTROL), DEVLINK_TRAP(FLOW_ACTION_SAMPLE, CONTROL), DEVLINK_TRAP(FLOW_ACTION_TRAP, CONTROL), + DEVLINK_TRAP(FLOW_ACTION_TRAP_FWD, CONTROL), DEVLINK_TRAP(EARLY_DROP, DROP), DEVLINK_TRAP(VXLAN_PARSING, DROP), DEVLINK_TRAP(LLC_SNAP_PARSING, DROP), -- 2.26.2
[PATCH net-next 2/7] net: sched: Make the action trap_fwd offloadable
Add the new flow action and related support so that drivers can offload the trap_fwd action. Signed-off-by: Petr Machata Reviewed-by: Jiri Pirko Reviewed-by: Ido Schimmel --- include/net/flow_offload.h | 1 + include/net/tc_act/tc_gact.h | 5 + net/sched/cls_api.c | 2 ++ 3 files changed, 8 insertions(+) diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h index dc5c1e69cd9f..5f35523f12b5 100644 --- a/include/net/flow_offload.h +++ b/include/net/flow_offload.h @@ -121,6 +121,7 @@ enum flow_action_id { FLOW_ACTION_ACCEPT = 0, FLOW_ACTION_DROP, FLOW_ACTION_TRAP, + FLOW_ACTION_TRAP_FWD, FLOW_ACTION_GOTO, FLOW_ACTION_REDIRECT, FLOW_ACTION_MIRRED, diff --git a/include/net/tc_act/tc_gact.h b/include/net/tc_act/tc_gact.h index eb8f01c819e6..df9e0a19c826 100644 --- a/include/net/tc_act/tc_gact.h +++ b/include/net/tc_act/tc_gact.h @@ -49,6 +49,11 @@ static inline bool is_tcf_gact_trap(const struct tc_action *a) return __is_tcf_gact_act(a, TC_ACT_TRAP, false); } +static inline bool is_tcf_gact_trap_fwd(const struct tc_action *a) +{ + return __is_tcf_gact_act(a, TC_ACT_TRAP_FWD, false); +} + static inline bool is_tcf_gact_goto_chain(const struct tc_action *a) { return __is_tcf_gact_act(a, TC_ACT_GOTO_CHAIN, true); diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index d3db70865d66..95e37eb50173 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -3582,6 +3582,8 @@ int tc_setup_flow_action(struct flow_action *flow_action, entry->id = FLOW_ACTION_DROP; } else if (is_tcf_gact_trap(act)) { entry->id = FLOW_ACTION_TRAP; + } else if (is_tcf_gact_trap_fwd(act)) { + entry->id = FLOW_ACTION_TRAP_FWD; } else if (is_tcf_gact_goto_chain(act)) { entry->id = FLOW_ACTION_GOTO; entry->chain_index = tcf_gact_goto_chain_index(act); -- 2.26.2
[PATCH net-next] Documentation: net: Document resilient next-hop groups
Add a document describing the principles behind resilient next-hop groups, and some notes about how to configure and offload them. Suggested-by: David Ahern Signed-off-by: Petr Machata Reviewed-by: David Ahern --- Notes: v1 (from an RFC shared privately): - Dropped a reference to a non-existent footnote [Ido] - Spell out consequences of flow redirection explicitly [Ido] - A handful of wording changes [Ido] - Kept David's R-b due to minor scope of the above fixes Documentation/networking/index.rst| 1 + .../networking/nexthop-group-resilient.rst| 293 ++ 2 files changed, 294 insertions(+) create mode 100644 Documentation/networking/nexthop-group-resilient.rst diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b8a29997d433..e9ce55992aa9 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -76,6 +76,7 @@ Contents: netdevices netfilter-sysctl netif-msg + nexthop-group-resilient nf_conntrack-sysctl nf_flowtable openvswitch diff --git a/Documentation/networking/nexthop-group-resilient.rst b/Documentation/networking/nexthop-group-resilient.rst new file mode 100644 index ..fabecee24d85 --- /dev/null +++ b/Documentation/networking/nexthop-group-resilient.rst @@ -0,0 +1,293 @@ +.. SPDX-License-Identifier: GPL-2.0 + += +Resilient Next-hop Groups += + +Resilient groups are a type of next-hop group that is aimed at minimizing +disruption in flow routing across changes to the group composition and +weights of constituent next hops. + +The idea behind resilient hashing groups is best explained in contrast to +the legacy multipath next-hop group, which uses the hash-threshold +algorithm, described in RFC 2992. + +To select a next hop, hash-threshold algorithm first assigns a range of +hashes to each next hop in the group, and then selects the next hop by +comparing the SKB hash with the individual ranges. When a next hop is +removed from the group, the ranges are recomputed, which leads to +reassignment of parts of hash space from one next hop to another. RFC 2992 +illustrates it thus:: + + +---+---+---+---+---+ + | 1 | 2 | 3 | 4 | 5 | + +---+-+-+---+---+-+-+---+ + |1|2|4|5| + +-+-+-+-+ + + Before and after deletion of next hop 3 + under the hash-threshold algorithm. + +Note how next hop 2 gave up part of the hash space in favor of next hop 1, +and 4 in favor of 5. While there will usually be some overlap between the +previous and the new distribution, some traffic flows change the next hop +that they resolve to. + +If a multipath group is used for load-balancing between multiple servers, +this hash space reassignment causes an issue that packets from a single +flow suddenly end up arriving at a server that does not expect them. This +can result in TCP connections being reset. + +If a multipath group is used for load-balancing among available paths to +the same server, the issue is that different latencies and reordering along +the way causes the packets to arrive in the wrong order, resulting in +degraded application performance. + +To mitigate the above-mentioned flow redirection, resilient next-hop groups +insert another layer of indirection between the hash space and its +constituent next hops: a hash table. The selection algorithm uses SKB hash +to choose a hash table bucket, then reads the next hop that this bucket +contains, and forwards traffic there. + +This indirection brings an important feature. In the hash-threshold +algorithm, the range of hashes associated with a next hop must be +continuous. With a hash table, mapping between the hash table buckets and +the individual next hops is arbitrary. Therefore when a next hop is deleted +the buckets that held it are simply reassigned to other next hops:: + + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5| + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +v v v v + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5| + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + Before and after deletion of next hop 3 + under the resilient hashing algorithm. + +When weights of next hops in a group are altered, it may be possible to +choose a subset of buckets that are currently not used for forwarding +traffic, and use those to satisfy the new next-hop distribution demands, +keeping the "busy" buckets intact. This way, established flows are ideally +kept being forwarded to the same endpoints through the same paths as before +the next-hop group change. + +Algorithm +--
[PATCH net-next] nexthop: Rename artifacts related to legacy multipath nexthop groups
After resilient next-hop groups have been added recently, there are two types of multipath next-hop groups: the legacy "mpath", and the new "resilient". Calling the legacy next-hop group type "mpath" is unfortunate, because that describes the fact that a packet could be forwarded in one of several paths, which is also true for the resilient next-hop groups. Therefore, to make the naming clearer, rename various artifacts to reflect the assumptions made. Therefore as of this patch: - The flag for multipath groups is nh_grp_entry::is_multipath. This includes the legacy and resilient groups, as well as any future group types that behave as multipath groups. Functions that assume this have "mpath" in the name. - The flag for legacy multipath groups is nh_grp_entry::hash_threshold. Functions that assume this have "hthr" in the name. - The flag for resilient groups is nh_grp_entry::resilient. Functions that assume this have "res" in the name. Besides the above, struct nh_grp_entry::mpath was renamed to ::hthr as well. UAPI artifacts were obviously left intact. Suggested-by: David Ahern Signed-off-by: Petr Machata --- include/net/nexthop.h | 4 ++-- net/ipv4/nexthop.c| 56 +-- 2 files changed, 30 insertions(+), 30 deletions(-) diff --git a/include/net/nexthop.h b/include/net/nexthop.h index ba94868a21d5..ace54bf90b2c 100644 --- a/include/net/nexthop.h +++ b/include/net/nexthop.h @@ -102,7 +102,7 @@ struct nh_grp_entry { union { struct { atomic_tupper_bound; - } mpath; + } hthr; struct { /* Member on uw_nh_entries. */ struct list_headuw_nh_entry; @@ -120,7 +120,7 @@ struct nh_group { struct nh_group *spare; /* spare group for removals */ u16 num_nh; boolis_multipath; - boolmpath; + boolhash_threshold; boolresilient; boolfdb_nh; boolhas_v4; diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index f09fe3a5608f..5a2fc8798d20 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -116,8 +116,8 @@ static void nh_notifier_single_info_fini(struct nh_notifier_info *info) kfree(info->nh); } -static int nh_notifier_mp_info_init(struct nh_notifier_info *info, - struct nh_group *nhg) +static int nh_notifier_mpath_info_init(struct nh_notifier_info *info, + struct nh_group *nhg) { u16 num_nh = nhg->num_nh; int i; @@ -181,8 +181,8 @@ static int nh_notifier_grp_info_init(struct nh_notifier_info *info, { struct nh_group *nhg = rtnl_dereference(nh->nh_grp); - if (nhg->mpath) - return nh_notifier_mp_info_init(info, nhg); + if (nhg->hash_threshold) + return nh_notifier_mpath_info_init(info, nhg); else if (nhg->resilient) return nh_notifier_res_table_info_init(info, nhg); return -EINVAL; @@ -193,7 +193,7 @@ static void nh_notifier_grp_info_fini(struct nh_notifier_info *info, { struct nh_group *nhg = rtnl_dereference(nh->nh_grp); - if (nhg->mpath) + if (nhg->hash_threshold) kfree(info->nh_grp); else if (nhg->resilient) vfree(info->nh_res_table); @@ -406,7 +406,7 @@ static int call_nexthop_res_table_notifiers(struct net *net, struct nexthop *nh, * could potentially veto it in case of unsupported configuration. */ nhg = rtnl_dereference(nh->nh_grp); - err = nh_notifier_mp_info_init(&info, nhg); + err = nh_notifier_mpath_info_init(&info, nhg); if (err) { NL_SET_ERR_MSG(extack, "Failed to initialize nexthop notifier info"); return err; @@ -661,7 +661,7 @@ static int nla_put_nh_group(struct sk_buff *skb, struct nh_group *nhg) u16 group_type = 0; int i; - if (nhg->mpath) + if (nhg->hash_threshold) group_type = NEXTHOP_GRP_TYPE_MPATH; else if (nhg->resilient) group_type = NEXTHOP_GRP_TYPE_RES; @@ -992,9 +992,9 @@ static bool valid_group_nh(struct nexthop *nh, unsigned int npaths, struct nh_group *nhg = rtnl_dereference(nh->nh_grp); /* Nesting groups within groups is not supported. */ - if (nhg->mpath) { + if (nhg->hash_threshold) { NL_SET_ERR_MSG(extack, - "Multipath group can not be a nexthop within a group"); + "H
[PATCH iproute2-next v4 5/6] nexthop: Add support for resilient nexthop groups
From: Ido Schimmel Add ability to configure resilient nexthop groups and show their current configuration. Example: # ip nexthop add id 10 group 1/2 type resilient buckets 8 # ip nexthop show id 10 id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0 # ip -j -p nexthop show id 10 [ { "id": 10, "group": [ { "id": 1 },{ "id": 2 } ], "type": "resilient", "resilient_args": { "buckets": 8, "idle_timer": 120, "unbalanced_timer": 0 }, "flags": [ ] } ] Signed-off-by: Ido Schimmel Signed-off-by: Petr Machata --- ip/ipnexthop.c| 144 +- man/man8/ip-nexthop.8 | 55 +++- 2 files changed, 193 insertions(+), 6 deletions(-) diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 5aae32629edd..1d50bf7529c4 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -43,9 +43,12 @@ static void usage(void) "[ groups ] [ fdb ]\n" "NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n" "[ encap ENCAPTYPE ENCAPHDR ] |\n" - "group GROUP [ fdb ] [ type TYPE ] }\n" + "group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n" "GROUP := [ //... ]\n" - "TYPE := { mpath }\n" + "TYPE := { mpath | resilient }\n" + "TYPE_ARGS := [ RESILIENT_ARGS ]\n" + "RESILIENT_ARGS := [ buckets BUCKETS ] [ idle_timer IDLE ]\n" + " [ unbalanced_timer UNBALANCED ]\n" "ENCAPTYPE := [ mpls ]\n" "ENCAPHDR := [ MPLSLABEL ]\n"); exit(-1); @@ -203,6 +206,66 @@ static void print_nh_group(FILE *fp, const struct rtattr *grps_attr) close_json_array(PRINT_JSON, NULL); } +static const char *nh_group_type_name(__u16 type) +{ + switch (type) { + case NEXTHOP_GRP_TYPE_MPATH: + return "mpath"; + case NEXTHOP_GRP_TYPE_RES: + return "resilient"; + default: + return ""; + } +} + +static void print_nh_group_type(FILE *fp, const struct rtattr *grp_type_attr) +{ + __u16 type = rta_getattr_u16(grp_type_attr); + + if (type == NEXTHOP_GRP_TYPE_MPATH) + /* Do not print type in order not to break existing output. */ + return; + + print_string(PRINT_ANY, "type", "type %s ", nh_group_type_name(type)); +} + +static void print_nh_res_group(FILE *fp, const struct rtattr *res_grp_attr) +{ + struct rtattr *tb[NHA_RES_GROUP_MAX + 1]; + struct rtattr *rta; + struct timeval tv; + + parse_rtattr_nested(tb, NHA_RES_GROUP_MAX, res_grp_attr); + + open_json_object("resilient_args"); + + if (tb[NHA_RES_GROUP_BUCKETS]) + print_uint(PRINT_ANY, "buckets", "buckets %u ", + rta_getattr_u16(tb[NHA_RES_GROUP_BUCKETS])); + + if (tb[NHA_RES_GROUP_IDLE_TIMER]) { + rta = tb[NHA_RES_GROUP_IDLE_TIMER]; + __jiffies_to_tv(&tv, rta_getattr_u32(rta)); + print_tv(PRINT_ANY, "idle_timer", "idle_timer %g ", &tv); + } + + if (tb[NHA_RES_GROUP_UNBALANCED_TIMER]) { + rta = tb[NHA_RES_GROUP_UNBALANCED_TIMER]; + __jiffies_to_tv(&tv, rta_getattr_u32(rta)); + print_tv(PRINT_ANY, "unbalanced_timer", "unbalanced_timer %g ", +&tv); + } + + if (tb[NHA_RES_GROUP_UNBALANCED_TIME]) { + rta = tb[NHA_RES_GROUP_UNBALANCED_TIME]; + __jiffies_to_tv(&tv, rta_getattr_u32(rta)); + print_tv(PRINT_ANY, "unbalanced_time", "unbalanced_time %g ", +&tv); + } + + close_json_object(); +} + int print_nexthop(struct nlmsghdr *n, void *arg) { struct nhmsg *nhm = NLMSG_DATA(n); @@ -229,7 +292,7 @@ int print_nexthop(struct nlmsghdr *n, void *arg) if (filter.proto && filter.proto != nhm->nh_protocol) return 0; - parse_rtattr(tb, NHA_MAX, RTM_NHA(nhm), len); + parse_rtattr_flags(tb, NHA_MAX, RTM_NHA(nhm), len, NLA_F_NESTED); open_json_object(NULL); @@ -243,6 +306,12 @@ int print_nexthop(struct nlmsghdr *n, void *arg) if (tb[NHA_GROUP]) print_nh_group(fp, tb[NHA_GROUP]); + if (tb[NHA_GROUP_TYPE]) + print_nh_group_type(fp, tb[NHA_G
[PATCH iproute2-next v4 6/6] nexthop: Add support for nexthop buckets
From: Ido Schimmel Add ability to dump multiple nexthop buckets and get a specific one. Example: # ip nexthop add id 10 group 1/2 type resilient buckets 8 # ip nexthop id 1 via 192.0.2.2 dev dummy10 scope link id 2 via 192.0.2.19 dev dummy20 scope link id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0 unbalanced_time 0 # ip nexthop bucket id 10 index 0 idle_time 28.1 nhid 2 id 10 index 1 idle_time 28.1 nhid 2 id 10 index 2 idle_time 28.1 nhid 2 id 10 index 3 idle_time 28.1 nhid 2 id 10 index 4 idle_time 28.1 nhid 1 id 10 index 5 idle_time 28.1 nhid 1 id 10 index 6 idle_time 28.1 nhid 1 id 10 index 7 idle_time 28.1 nhid 1 # ip nexthop bucket show nhid 1 id 10 index 4 idle_time 53.59 nhid 1 id 10 index 5 idle_time 53.59 nhid 1 id 10 index 6 idle_time 53.59 nhid 1 id 10 index 7 idle_time 53.59 nhid 1 # ip nexthop bucket get id 10 index 5 id 10 index 5 idle_time 81 nhid 1 # ip -j -p nexthop bucket get id 10 index 5 [ { "id": 10, "bucket": { "index": 5, "idle_time": 104.89, "nhid": 1 }, "flags": [ ] } ] Signed-off-by: Ido Schimmel Signed-off-by: Petr Machata --- include/libnetlink.h | 3 + ip/ip_common.h| 1 + ip/ipmonitor.c| 6 + ip/ipnexthop.c| 254 ++ lib/libnetlink.c | 26 + man/man8/ip-nexthop.8 | 45 6 files changed, 335 insertions(+) diff --git a/include/libnetlink.h b/include/libnetlink.h index b9073a6a13ad..e8ed5d7fb495 100644 --- a/include/libnetlink.h +++ b/include/libnetlink.h @@ -97,6 +97,9 @@ int rtnl_dump_request_n(struct rtnl_handle *rth, struct nlmsghdr *n) int rtnl_nexthopdump_req(struct rtnl_handle *rth, int family, req_filter_fn_t filter_fn) __attribute__((warn_unused_result)); +int rtnl_nexthop_bucket_dump_req(struct rtnl_handle *rth, int family, +req_filter_fn_t filter_fn) + __attribute__((warn_unused_result)); struct rtnl_ctrl_data { int nsid; diff --git a/ip/ip_common.h b/ip/ip_common.h index 9a31e837563f..55a5521c4275 100644 --- a/ip/ip_common.h +++ b/ip/ip_common.h @@ -53,6 +53,7 @@ int print_rule(struct nlmsghdr *n, void *arg); int print_netconf(struct rtnl_ctrl_data *ctrl, struct nlmsghdr *n, void *arg); int print_nexthop(struct nlmsghdr *n, void *arg); +int print_nexthop_bucket(struct nlmsghdr *n, void *arg); void netns_map_init(void); void netns_nsid_socket_init(void); int print_nsid(struct nlmsghdr *n, void *arg); diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c index 99f5fda8ba1f..d7f31cf5d1b5 100644 --- a/ip/ipmonitor.c +++ b/ip/ipmonitor.c @@ -90,6 +90,12 @@ static int accept_msg(struct rtnl_ctrl_data *ctrl, print_nexthop(n, arg); return 0; + case RTM_NEWNEXTHOPBUCKET: + case RTM_DELNEXTHOPBUCKET: + print_headers(fp, "[NEXTHOPBUCKET]", ctrl); + print_nexthop_bucket(n, arg); + return 0; + case RTM_NEWLINK: case RTM_DELLINK: ll_remember_index(n, NULL); diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 1d50bf7529c4..0263307c49df 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -21,6 +21,8 @@ static struct { unsigned int master; unsigned int proto; unsigned int fdb; + unsigned int id; + unsigned int nhid; } filter; enum { @@ -39,8 +41,11 @@ static void usage(void) "Usage: ip nexthop { list | flush } [ protocol ID ] SELECTOR\n" " ip nexthop { add | replace } id ID NH [ protocol ID ]\n" " ip nexthop { get | del } id ID\n" + " ip nexthop bucket list BUCKET_SELECTOR\n" + " ip nexthop bucket get id ID index INDEX\n" "SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV ]\n" "[ groups ] [ fdb ]\n" + "BUCKET_SELECTOR := SELECTOR | [ nhid ID ]\n" "NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n" "[ encap ENCAPTYPE ENCAPHDR ] |\n" "group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n" @@ -85,6 +90,36 @@ static int nh_dump_filter(struct nlmsghdr *nlh, int reqlen) return 0; } +static int nh_dump_bucket_filter(struct nlmsghdr *nlh, int reqlen) +{ + struct rtattr *nest; + int err = 0; + + err = nh_dump_filter(nlh, reqlen); + if (err) + return err; + + if (filter.id) { + err = addattr32(nlh, reqlen, NHA_ID, filter.id); + if (err) + return err; + } + + if (filter.nhid) { +
[PATCH iproute2-next v4 3/6] nexthop: Extract a helper to parse a NH ID
NH ID extraction is a common operation, and will become more common still with the resilient NH groups support. Add a helper that does what it usually done and returns the parsed NH ID. Signed-off-by: Petr Machata --- ip/ipnexthop.c | 25 + 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 20cde586596b..126b0b17cab4 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -327,6 +327,15 @@ static int add_nh_group_attr(struct nlmsghdr *n, int maxlen, char *argv) return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps)); } +static int ipnh_parse_id(const char *argv) +{ + __u32 id; + + if (get_unsigned(&id, argv, 0)) + invarg("invalid id value", argv); + return id; +} + static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv) { struct { @@ -343,12 +352,9 @@ static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv) while (argc > 0) { if (!strcmp(*argv, "id")) { - __u32 id; - NEXT_ARG(); - if (get_unsigned(&id, *argv, 0)) - invarg("invalid id value", *argv); - addattr32(&req.n, sizeof(req), NHA_ID, id); + addattr32(&req.n, sizeof(req), NHA_ID, + ipnh_parse_id(*argv)); } else if (!strcmp(*argv, "dev")) { int ifindex; @@ -485,12 +491,8 @@ static int ipnh_list_flush(int argc, char **argv, int action) if (!filter.master) invarg("VRF does not exist\n", *argv); } else if (!strcmp(*argv, "id")) { - __u32 id; - NEXT_ARG(); - if (get_unsigned(&id, *argv, 0)) - invarg("invalid id value", *argv); - return ipnh_get_id(id); + return ipnh_get_id(ipnh_parse_id(*argv)); } else if (!matches(*argv, "protocol")) { __u32 proto; @@ -536,8 +538,7 @@ static int ipnh_get(int argc, char **argv) while (argc > 0) { if (!strcmp(*argv, "id")) { NEXT_ARG(); - if (get_unsigned(&id, *argv, 0)) - invarg("invalid id value", *argv); + id = ipnh_parse_id(*argv); } else { usage(); } -- 2.26.2
[PATCH iproute2-next v4 4/6] nexthop: Add ability to specify group type
From: Ido Schimmel Next patches are going to add a 'resilient' nexthop group type, so allow users to specify the type using the 'type' argument. Currently, only 'mpath' type is supported. These two commands are equivalent: # ip nexthop add id 10 group 1/2/3 # ip nexthop add id 10 group 1/2/3 type mpath Signed-off-by: Ido Schimmel Signed-off-by: Petr Machata --- Notes: v2: - Add a missing example command to commit message - Mention in the man page that mpath is the default ip/ipnexthop.c| 32 +++- man/man8/ip-nexthop.8 | 19 +-- 2 files changed, 48 insertions(+), 3 deletions(-) diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 126b0b17cab4..5aae32629edd 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -42,8 +42,10 @@ static void usage(void) "SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV ]\n" "[ groups ] [ fdb ]\n" "NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n" - "[ encap ENCAPTYPE ENCAPHDR ] | group GROUP [ fdb ] }\n" + "[ encap ENCAPTYPE ENCAPHDR ] |\n" + "group GROUP [ fdb ] [ type TYPE ] }\n" "GROUP := [ //... ]\n" + "TYPE := { mpath }\n" "ENCAPTYPE := [ mpls ]\n" "ENCAPHDR := [ MPLSLABEL ]\n"); exit(-1); @@ -327,6 +329,32 @@ static int add_nh_group_attr(struct nlmsghdr *n, int maxlen, char *argv) return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps)); } +static int read_nh_group_type(const char *name) +{ + if (strcmp(name, "mpath") == 0) + return NEXTHOP_GRP_TYPE_MPATH; + + return __NEXTHOP_GRP_TYPE_MAX; +} + +static void parse_nh_group_type(struct nlmsghdr *n, int maxlen, int *argcp, + char ***argvp) +{ + char **argv = *argvp; + int argc = *argcp; + __u16 type; + + NEXT_ARG(); + type = read_nh_group_type(*argv); + if (type > NEXTHOP_GRP_TYPE_MAX) + invarg("\"type\" value is invalid\n", *argv); + + *argcp = argc; + *argvp = argv; + + addattr16(n, maxlen, NHA_GROUP_TYPE, type); +} + static int ipnh_parse_id(const char *argv) { __u32 id; @@ -409,6 +437,8 @@ static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv) if (add_nh_group_attr(&req.n, sizeof(req), *argv)) invarg("\"group\" value is invalid\n", *argv); + } else if (!strcmp(*argv, "type")) { + parse_nh_group_type(&req.n, sizeof(req), &argc, &argv); } else if (matches(*argv, "protocol") == 0) { __u32 prot; diff --git a/man/man8/ip-nexthop.8 b/man/man8/ip-nexthop.8 index 4d55f4dbcc75..b86f307fef35 100644 --- a/man/man8/ip-nexthop.8 +++ b/man/man8/ip-nexthop.8 @@ -54,7 +54,9 @@ ip-nexthop \- nexthop object management .BR fdb " ] | " .B group .IR GROUP " [ " -.BR fdb " ] } " +.BR fdb " ] [ " +.B type +.IR TYPE " ] } " .ti -8 .IR ENCAP " := [ " @@ -71,6 +73,10 @@ ip-nexthop \- nexthop object management .IR GROUP " := " .BR id "[," weight "[/...]" +.ti -8 +.IR TYPE " := { " +.BR mpath " }" + .SH DESCRIPTION .B ip nexthop is used to manipulate entries in the kernel's nexthop tables. @@ -122,9 +128,18 @@ is a set of encapsulation attributes specific to the .in -2 .TP -.BI group " GROUP" +.BI group " GROUP [ " type " TYPE ]" create a nexthop group. Group specification is id with an optional weight (id,weight) and a '/' as a separator between entries. +.sp +.I TYPE +is a string specifying the nexthop group type. Namely: + +.in +8 +.BI mpath +- Multipath nexthop group backed by the hash-threshold algorithm. The +default when the type is unspecified. + .TP .B blackhole create a blackhole nexthop -- 2.26.2
[PATCH iproute2-next v4 2/6] json_print: Add print_tv()
Add a helper to dump a timeval. Print by first converting to double and then dispatching to print_color_float(). Signed-off-by: Petr Machata --- Notes: v4: - Make print_tv() take a const*. include/json_print.h | 1 + lib/json_print.c | 13 + 2 files changed, 14 insertions(+) diff --git a/include/json_print.h b/include/json_print.h index 6fcf9fd910ec..91b34571ceb0 100644 --- a/include/json_print.h +++ b/include/json_print.h @@ -81,6 +81,7 @@ _PRINT_FUNC(0xhex, unsigned long long) _PRINT_FUNC(luint, unsigned long) _PRINT_FUNC(lluint, unsigned long long) _PRINT_FUNC(float, double) +_PRINT_FUNC(tv, const struct timeval *) #undef _PRINT_FUNC #define _PRINT_NAME_VALUE_FUNC(type_name, type, format_char) \ diff --git a/lib/json_print.c b/lib/json_print.c index 994a2f8d6ae0..e3a88375fe7c 100644 --- a/lib/json_print.c +++ b/lib/json_print.c @@ -299,6 +299,19 @@ int print_color_null(enum output_type type, return ret; } +int print_color_tv(enum output_type type, + enum color_attr color, + const char *key, + const char *fmt, + const struct timeval *tv) +{ + double usecs = tv->tv_usec; + double secs = tv->tv_sec; + double time = secs + usecs / 100; + + return print_color_float(type, color, key, fmt, time); +} + /* Print line separator (if not in JSON mode) */ void print_nl(void) { -- 2.26.2
[PATCH iproute2-next v4 0/6] ip: nexthop: Support resilient groups
Support for resilient next-hop groups was recently accepted to Linux kernel[1]. Resilient next-hop groups add a layer of indirection between the SKB hash and the next hop. Thus the hash is used to reference a hash table bucket, which is then used to reference a particular next hop. This allows the system more flexibility when assigning SKB hash space to next hops. Previously, each next hop had to be assigned a continuous range of SKB hash space. With a hash table as an intermediate layer, it is possible to reassign next hops with a hash table bucket granularity. In turn, this mends issues with traffic flow redirection resulting from next hop removal or adjustments in next-hop weights. In this patch set, introduce support for resilient next-hop groups to iproute2. - Patch #1 brings include/uapi/linux/nexthop.h and /rtnetlink.h up to date. - Patches #2 and #3 add new helpers that will be useful later. - Patch #4 extends the ip/nexthop sub-tool to accept group type as a command line argument, and to dispatch based on the specified type. - Patch #5 adds the support for resilient next-hop groups. - Patch #6 adds the support for resilient next-hop group bucket interface. To illustrate the usage, consider the following commands: # ip nexthop add id 1 via 192.0.2.2 dev dummy1 # ip nexthop add id 2 via 192.0.2.3 dev dummy1 # ip nexthop add id 10 group 1/2 type resilient \ buckets 8 idle_timer 60 unbalanced_timer 300 The last command creates a resilient next-hop group. It will have 8 buckets, each bucket will be considered idle when no traffic hits it for at least 60 seconds, and if the table remains out of balance for 300 seconds, it will be forcefully brought into balance. And this is how the next-hop group bucket interface looks: # ip nexthop bucket show id 10 id 10 index 0 idle_time 5.59 nhid 1 id 10 index 1 idle_time 5.59 nhid 1 id 10 index 2 idle_time 8.74 nhid 2 id 10 index 3 idle_time 8.74 nhid 2 id 10 index 4 idle_time 8.74 nhid 1 id 10 index 5 idle_time 8.74 nhid 1 id 10 index 6 idle_time 8.74 nhid 1 id 10 index 7 idle_time 8.74 nhid 1 [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=2a0186a37700b0d5b8cc40be202a62af44f02fa2 v4: - Patch #2: - Make print_tv() take a const*. v3: - Add missing S-o-b's. v2: - Patch #4: - Add a missing example command to commit message - Mention in the man page that mpath is the default Ido Schimmel (3): nexthop: Add ability to specify group type nexthop: Add support for resilient nexthop groups nexthop: Add support for nexthop buckets Petr Machata (3): nexthop: Synchronize uAPI files json_print: Add print_tv() nexthop: Extract a helper to parse a NH ID include/json_print.h | 1 + include/libnetlink.h | 3 + include/uapi/linux/nexthop.h | 47 +++- include/uapi/linux/rtnetlink.h | 7 + ip/ip_common.h | 1 + ip/ipmonitor.c | 6 + ip/ipnexthop.c | 451 - lib/json_print.c | 13 + lib/libnetlink.c | 26 ++ man/man8/ip-nexthop.8 | 113 - 10 files changed, 651 insertions(+), 17 deletions(-) -- 2.26.2
[PATCH iproute2-next v4 1/6] nexthop: Synchronize uAPI files
Signed-off-by: Petr Machata --- include/uapi/linux/nexthop.h | 47 +- include/uapi/linux/rtnetlink.h | 7 + 2 files changed, 53 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h index b0a5613905ef..37b14b4ea6c4 100644 --- a/include/uapi/linux/nexthop.h +++ b/include/uapi/linux/nexthop.h @@ -21,7 +21,10 @@ struct nexthop_grp { }; enum { - NEXTHOP_GRP_TYPE_MPATH, /* default type if not specified */ + NEXTHOP_GRP_TYPE_MPATH, /* hash-threshold nexthop group + * default type if not specified + */ + NEXTHOP_GRP_TYPE_RES,/* resilient nexthop group */ __NEXTHOP_GRP_TYPE_MAX, }; @@ -52,8 +55,50 @@ enum { NHA_FDB,/* flag; nexthop belongs to a bridge fdb */ /* if NHA_FDB is added, OIF, BLACKHOLE, ENCAP cannot be set */ + /* nested; resilient nexthop group attributes */ + NHA_RES_GROUP, + /* nested; nexthop bucket attributes */ + NHA_RES_BUCKET, + __NHA_MAX, }; #define NHA_MAX(__NHA_MAX - 1) + +enum { + NHA_RES_GROUP_UNSPEC, + /* Pad attribute for 64-bit alignment. */ + NHA_RES_GROUP_PAD = NHA_RES_GROUP_UNSPEC, + + /* u16; number of nexthop buckets in a resilient nexthop group */ + NHA_RES_GROUP_BUCKETS, + /* clock_t as u32; nexthop bucket idle timer (per-group) */ + NHA_RES_GROUP_IDLE_TIMER, + /* clock_t as u32; nexthop unbalanced timer */ + NHA_RES_GROUP_UNBALANCED_TIMER, + /* clock_t as u64; nexthop unbalanced time */ + NHA_RES_GROUP_UNBALANCED_TIME, + + __NHA_RES_GROUP_MAX, +}; + +#define NHA_RES_GROUP_MAX (__NHA_RES_GROUP_MAX - 1) + +enum { + NHA_RES_BUCKET_UNSPEC, + /* Pad attribute for 64-bit alignment. */ + NHA_RES_BUCKET_PAD = NHA_RES_BUCKET_UNSPEC, + + /* u16; nexthop bucket index */ + NHA_RES_BUCKET_INDEX, + /* clock_t as u64; nexthop bucket idle time */ + NHA_RES_BUCKET_IDLE_TIME, + /* u32; nexthop id assigned to the nexthop bucket */ + NHA_RES_BUCKET_NH_ID, + + __NHA_RES_BUCKET_MAX, +}; + +#define NHA_RES_BUCKET_MAX (__NHA_RES_BUCKET_MAX - 1) + #endif diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index b34b9add5f65..f6217651 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -178,6 +178,13 @@ enum { RTM_GETVLAN, #define RTM_GETVLANRTM_GETVLAN + RTM_NEWNEXTHOPBUCKET = 116, +#define RTM_NEWNEXTHOPBUCKET RTM_NEWNEXTHOPBUCKET + RTM_DELNEXTHOPBUCKET, +#define RTM_DELNEXTHOPBUCKET RTM_DELNEXTHOPBUCKET + RTM_GETNEXTHOPBUCKET, +#define RTM_GETNEXTHOPBUCKET RTM_GETNEXTHOPBUCKET + __RTM_MAX, #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1) }; -- 2.26.2
Re: [PATCH iproute2-next v3 2/6] json_print: Add print_tv()
Stephen Hemminger writes: >> +_PRINT_FUNC(tv, struct timeval *) > > This > > Make it const please? OK
[PATCH iproute2] ip: Fix batch processing
After the comment cited below, batch mode neglects to set the global variable batch_mode to a non-zero value. Netns and VRF commands use this variable, and break in batch mode. Fix by setting the value again. Fixes: 1d9a81b8c9f3 ("Unify batch processing across tools") Reported-by: Tim Rice Signed-off-by: Petr Machata --- ip/ip.c | 1 + 1 file changed, 1 insertion(+) diff --git a/ip/ip.c b/ip/ip.c index 40d2998ae60b..2d7d0d327734 100644 --- a/ip/ip.c +++ b/ip/ip.c @@ -155,6 +155,7 @@ static int batch(const char *name) return EXIT_FAILURE; } + batch_mode = 1; ret = do_batch(name, force, ip_batch_cmd, &orig_family); rtnl_close(&rth); -- 2.26.2
Re: [BUG] Iproute2 batch-mode fails to bring up veth
David Ahern writes: >> Git bisect pinpoints this commit: >> https://github.com/shemminger/iproute2/commit/1d9a81b8c9f30f9f4abeb875998262f61bf10577 >> > > Petr, can you take a look at this regression? Yes, see elsewhere in the thread: https://marc.info/?l=linux-netdev&m=161589291608081&w=2 I'm pretty sure this fixes the issue, and hopefully Tim can take it for a spin and confirm. I'll send this formally afterwards.
Re: [BUG] Iproute2 batch-mode fails to bring up veth
Thanks for the report. Would you be able to test with the following patch? https://github.com/pmachata/iproute2/commit/a12eeca9caf90b3ebe24bc121819d506c9072a34.patch I believe it fixes the issue.
[PATCH iproute2-next v3 6/6] nexthop: Add support for nexthop buckets
From: Ido Schimmel Add ability to dump multiple nexthop buckets and get a specific one. Example: # ip nexthop add id 10 group 1/2 type resilient buckets 8 # ip nexthop id 1 via 192.0.2.2 dev dummy10 scope link id 2 via 192.0.2.19 dev dummy20 scope link id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0 unbalanced_time 0 # ip nexthop bucket id 10 index 0 idle_time 28.1 nhid 2 id 10 index 1 idle_time 28.1 nhid 2 id 10 index 2 idle_time 28.1 nhid 2 id 10 index 3 idle_time 28.1 nhid 2 id 10 index 4 idle_time 28.1 nhid 1 id 10 index 5 idle_time 28.1 nhid 1 id 10 index 6 idle_time 28.1 nhid 1 id 10 index 7 idle_time 28.1 nhid 1 # ip nexthop bucket show nhid 1 id 10 index 4 idle_time 53.59 nhid 1 id 10 index 5 idle_time 53.59 nhid 1 id 10 index 6 idle_time 53.59 nhid 1 id 10 index 7 idle_time 53.59 nhid 1 # ip nexthop bucket get id 10 index 5 id 10 index 5 idle_time 81 nhid 1 # ip -j -p nexthop bucket get id 10 index 5 [ { "id": 10, "bucket": { "index": 5, "idle_time": 104.89, "nhid": 1 }, "flags": [ ] } ] Signed-off-by: Ido Schimmel Signed-off-by: Petr Machata --- include/libnetlink.h | 3 + ip/ip_common.h| 1 + ip/ipmonitor.c| 6 + ip/ipnexthop.c| 254 ++ lib/libnetlink.c | 26 + man/man8/ip-nexthop.8 | 45 6 files changed, 335 insertions(+) diff --git a/include/libnetlink.h b/include/libnetlink.h index b9073a6a13ad..e8ed5d7fb495 100644 --- a/include/libnetlink.h +++ b/include/libnetlink.h @@ -97,6 +97,9 @@ int rtnl_dump_request_n(struct rtnl_handle *rth, struct nlmsghdr *n) int rtnl_nexthopdump_req(struct rtnl_handle *rth, int family, req_filter_fn_t filter_fn) __attribute__((warn_unused_result)); +int rtnl_nexthop_bucket_dump_req(struct rtnl_handle *rth, int family, +req_filter_fn_t filter_fn) + __attribute__((warn_unused_result)); struct rtnl_ctrl_data { int nsid; diff --git a/ip/ip_common.h b/ip/ip_common.h index 9a31e837563f..55a5521c4275 100644 --- a/ip/ip_common.h +++ b/ip/ip_common.h @@ -53,6 +53,7 @@ int print_rule(struct nlmsghdr *n, void *arg); int print_netconf(struct rtnl_ctrl_data *ctrl, struct nlmsghdr *n, void *arg); int print_nexthop(struct nlmsghdr *n, void *arg); +int print_nexthop_bucket(struct nlmsghdr *n, void *arg); void netns_map_init(void); void netns_nsid_socket_init(void); int print_nsid(struct nlmsghdr *n, void *arg); diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c index 99f5fda8ba1f..d7f31cf5d1b5 100644 --- a/ip/ipmonitor.c +++ b/ip/ipmonitor.c @@ -90,6 +90,12 @@ static int accept_msg(struct rtnl_ctrl_data *ctrl, print_nexthop(n, arg); return 0; + case RTM_NEWNEXTHOPBUCKET: + case RTM_DELNEXTHOPBUCKET: + print_headers(fp, "[NEXTHOPBUCKET]", ctrl); + print_nexthop_bucket(n, arg); + return 0; + case RTM_NEWLINK: case RTM_DELLINK: ll_remember_index(n, NULL); diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 1d50bf7529c4..0263307c49df 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -21,6 +21,8 @@ static struct { unsigned int master; unsigned int proto; unsigned int fdb; + unsigned int id; + unsigned int nhid; } filter; enum { @@ -39,8 +41,11 @@ static void usage(void) "Usage: ip nexthop { list | flush } [ protocol ID ] SELECTOR\n" " ip nexthop { add | replace } id ID NH [ protocol ID ]\n" " ip nexthop { get | del } id ID\n" + " ip nexthop bucket list BUCKET_SELECTOR\n" + " ip nexthop bucket get id ID index INDEX\n" "SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV ]\n" "[ groups ] [ fdb ]\n" + "BUCKET_SELECTOR := SELECTOR | [ nhid ID ]\n" "NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n" "[ encap ENCAPTYPE ENCAPHDR ] |\n" "group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n" @@ -85,6 +90,36 @@ static int nh_dump_filter(struct nlmsghdr *nlh, int reqlen) return 0; } +static int nh_dump_bucket_filter(struct nlmsghdr *nlh, int reqlen) +{ + struct rtattr *nest; + int err = 0; + + err = nh_dump_filter(nlh, reqlen); + if (err) + return err; + + if (filter.id) { + err = addattr32(nlh, reqlen, NHA_ID, filter.id); + if (err) + return err; + } + + if (filter.nhid) { +
[PATCH iproute2-next v3 5/6] nexthop: Add support for resilient nexthop groups
From: Ido Schimmel Add ability to configure resilient nexthop groups and show their current configuration. Example: # ip nexthop add id 10 group 1/2 type resilient buckets 8 # ip nexthop show id 10 id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0 # ip -j -p nexthop show id 10 [ { "id": 10, "group": [ { "id": 1 },{ "id": 2 } ], "type": "resilient", "resilient_args": { "buckets": 8, "idle_timer": 120, "unbalanced_timer": 0 }, "flags": [ ] } ] Signed-off-by: Ido Schimmel Signed-off-by: Petr Machata --- ip/ipnexthop.c| 144 +- man/man8/ip-nexthop.8 | 55 +++- 2 files changed, 193 insertions(+), 6 deletions(-) diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 5aae32629edd..1d50bf7529c4 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -43,9 +43,12 @@ static void usage(void) "[ groups ] [ fdb ]\n" "NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n" "[ encap ENCAPTYPE ENCAPHDR ] |\n" - "group GROUP [ fdb ] [ type TYPE ] }\n" + "group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n" "GROUP := [ //... ]\n" - "TYPE := { mpath }\n" + "TYPE := { mpath | resilient }\n" + "TYPE_ARGS := [ RESILIENT_ARGS ]\n" + "RESILIENT_ARGS := [ buckets BUCKETS ] [ idle_timer IDLE ]\n" + " [ unbalanced_timer UNBALANCED ]\n" "ENCAPTYPE := [ mpls ]\n" "ENCAPHDR := [ MPLSLABEL ]\n"); exit(-1); @@ -203,6 +206,66 @@ static void print_nh_group(FILE *fp, const struct rtattr *grps_attr) close_json_array(PRINT_JSON, NULL); } +static const char *nh_group_type_name(__u16 type) +{ + switch (type) { + case NEXTHOP_GRP_TYPE_MPATH: + return "mpath"; + case NEXTHOP_GRP_TYPE_RES: + return "resilient"; + default: + return ""; + } +} + +static void print_nh_group_type(FILE *fp, const struct rtattr *grp_type_attr) +{ + __u16 type = rta_getattr_u16(grp_type_attr); + + if (type == NEXTHOP_GRP_TYPE_MPATH) + /* Do not print type in order not to break existing output. */ + return; + + print_string(PRINT_ANY, "type", "type %s ", nh_group_type_name(type)); +} + +static void print_nh_res_group(FILE *fp, const struct rtattr *res_grp_attr) +{ + struct rtattr *tb[NHA_RES_GROUP_MAX + 1]; + struct rtattr *rta; + struct timeval tv; + + parse_rtattr_nested(tb, NHA_RES_GROUP_MAX, res_grp_attr); + + open_json_object("resilient_args"); + + if (tb[NHA_RES_GROUP_BUCKETS]) + print_uint(PRINT_ANY, "buckets", "buckets %u ", + rta_getattr_u16(tb[NHA_RES_GROUP_BUCKETS])); + + if (tb[NHA_RES_GROUP_IDLE_TIMER]) { + rta = tb[NHA_RES_GROUP_IDLE_TIMER]; + __jiffies_to_tv(&tv, rta_getattr_u32(rta)); + print_tv(PRINT_ANY, "idle_timer", "idle_timer %g ", &tv); + } + + if (tb[NHA_RES_GROUP_UNBALANCED_TIMER]) { + rta = tb[NHA_RES_GROUP_UNBALANCED_TIMER]; + __jiffies_to_tv(&tv, rta_getattr_u32(rta)); + print_tv(PRINT_ANY, "unbalanced_timer", "unbalanced_timer %g ", +&tv); + } + + if (tb[NHA_RES_GROUP_UNBALANCED_TIME]) { + rta = tb[NHA_RES_GROUP_UNBALANCED_TIME]; + __jiffies_to_tv(&tv, rta_getattr_u32(rta)); + print_tv(PRINT_ANY, "unbalanced_time", "unbalanced_time %g ", +&tv); + } + + close_json_object(); +} + int print_nexthop(struct nlmsghdr *n, void *arg) { struct nhmsg *nhm = NLMSG_DATA(n); @@ -229,7 +292,7 @@ int print_nexthop(struct nlmsghdr *n, void *arg) if (filter.proto && filter.proto != nhm->nh_protocol) return 0; - parse_rtattr(tb, NHA_MAX, RTM_NHA(nhm), len); + parse_rtattr_flags(tb, NHA_MAX, RTM_NHA(nhm), len, NLA_F_NESTED); open_json_object(NULL); @@ -243,6 +306,12 @@ int print_nexthop(struct nlmsghdr *n, void *arg) if (tb[NHA_GROUP]) print_nh_group(fp, tb[NHA_GROUP]); + if (tb[NHA_GROUP_TYPE]) + print_nh_group_type(fp, tb[NHA_G
[PATCH iproute2-next v3 3/6] nexthop: Extract a helper to parse a NH ID
NH ID extraction is a common operation, and will become more common still with the resilient NH groups support. Add a helper that does what it usually done and returns the parsed NH ID. Signed-off-by: Petr Machata --- ip/ipnexthop.c | 25 + 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 20cde586596b..126b0b17cab4 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -327,6 +327,15 @@ static int add_nh_group_attr(struct nlmsghdr *n, int maxlen, char *argv) return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps)); } +static int ipnh_parse_id(const char *argv) +{ + __u32 id; + + if (get_unsigned(&id, argv, 0)) + invarg("invalid id value", argv); + return id; +} + static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv) { struct { @@ -343,12 +352,9 @@ static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv) while (argc > 0) { if (!strcmp(*argv, "id")) { - __u32 id; - NEXT_ARG(); - if (get_unsigned(&id, *argv, 0)) - invarg("invalid id value", *argv); - addattr32(&req.n, sizeof(req), NHA_ID, id); + addattr32(&req.n, sizeof(req), NHA_ID, + ipnh_parse_id(*argv)); } else if (!strcmp(*argv, "dev")) { int ifindex; @@ -485,12 +491,8 @@ static int ipnh_list_flush(int argc, char **argv, int action) if (!filter.master) invarg("VRF does not exist\n", *argv); } else if (!strcmp(*argv, "id")) { - __u32 id; - NEXT_ARG(); - if (get_unsigned(&id, *argv, 0)) - invarg("invalid id value", *argv); - return ipnh_get_id(id); + return ipnh_get_id(ipnh_parse_id(*argv)); } else if (!matches(*argv, "protocol")) { __u32 proto; @@ -536,8 +538,7 @@ static int ipnh_get(int argc, char **argv) while (argc > 0) { if (!strcmp(*argv, "id")) { NEXT_ARG(); - if (get_unsigned(&id, *argv, 0)) - invarg("invalid id value", *argv); + id = ipnh_parse_id(*argv); } else { usage(); } -- 2.26.2
[PATCH iproute2-next v3 4/6] nexthop: Add ability to specify group type
From: Ido Schimmel Next patches are going to add a 'resilient' nexthop group type, so allow users to specify the type using the 'type' argument. Currently, only 'mpath' type is supported. These two commands are equivalent: # ip nexthop add id 10 group 1/2/3 # ip nexthop add id 10 group 1/2/3 type mpath Signed-off-by: Ido Schimmel Signed-off-by: Petr Machata --- Notes: v2: - Add a missing example command to commit message - Mention in the man page that mpath is the default ip/ipnexthop.c| 32 +++- man/man8/ip-nexthop.8 | 19 +-- 2 files changed, 48 insertions(+), 3 deletions(-) diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 126b0b17cab4..5aae32629edd 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -42,8 +42,10 @@ static void usage(void) "SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV ]\n" "[ groups ] [ fdb ]\n" "NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n" - "[ encap ENCAPTYPE ENCAPHDR ] | group GROUP [ fdb ] }\n" + "[ encap ENCAPTYPE ENCAPHDR ] |\n" + "group GROUP [ fdb ] [ type TYPE ] }\n" "GROUP := [ //... ]\n" + "TYPE := { mpath }\n" "ENCAPTYPE := [ mpls ]\n" "ENCAPHDR := [ MPLSLABEL ]\n"); exit(-1); @@ -327,6 +329,32 @@ static int add_nh_group_attr(struct nlmsghdr *n, int maxlen, char *argv) return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps)); } +static int read_nh_group_type(const char *name) +{ + if (strcmp(name, "mpath") == 0) + return NEXTHOP_GRP_TYPE_MPATH; + + return __NEXTHOP_GRP_TYPE_MAX; +} + +static void parse_nh_group_type(struct nlmsghdr *n, int maxlen, int *argcp, + char ***argvp) +{ + char **argv = *argvp; + int argc = *argcp; + __u16 type; + + NEXT_ARG(); + type = read_nh_group_type(*argv); + if (type > NEXTHOP_GRP_TYPE_MAX) + invarg("\"type\" value is invalid\n", *argv); + + *argcp = argc; + *argvp = argv; + + addattr16(n, maxlen, NHA_GROUP_TYPE, type); +} + static int ipnh_parse_id(const char *argv) { __u32 id; @@ -409,6 +437,8 @@ static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv) if (add_nh_group_attr(&req.n, sizeof(req), *argv)) invarg("\"group\" value is invalid\n", *argv); + } else if (!strcmp(*argv, "type")) { + parse_nh_group_type(&req.n, sizeof(req), &argc, &argv); } else if (matches(*argv, "protocol") == 0) { __u32 prot; diff --git a/man/man8/ip-nexthop.8 b/man/man8/ip-nexthop.8 index 4d55f4dbcc75..b86f307fef35 100644 --- a/man/man8/ip-nexthop.8 +++ b/man/man8/ip-nexthop.8 @@ -54,7 +54,9 @@ ip-nexthop \- nexthop object management .BR fdb " ] | " .B group .IR GROUP " [ " -.BR fdb " ] } " +.BR fdb " ] [ " +.B type +.IR TYPE " ] } " .ti -8 .IR ENCAP " := [ " @@ -71,6 +73,10 @@ ip-nexthop \- nexthop object management .IR GROUP " := " .BR id "[," weight "[/...]" +.ti -8 +.IR TYPE " := { " +.BR mpath " }" + .SH DESCRIPTION .B ip nexthop is used to manipulate entries in the kernel's nexthop tables. @@ -122,9 +128,18 @@ is a set of encapsulation attributes specific to the .in -2 .TP -.BI group " GROUP" +.BI group " GROUP [ " type " TYPE ]" create a nexthop group. Group specification is id with an optional weight (id,weight) and a '/' as a separator between entries. +.sp +.I TYPE +is a string specifying the nexthop group type. Namely: + +.in +8 +.BI mpath +- Multipath nexthop group backed by the hash-threshold algorithm. The +default when the type is unspecified. + .TP .B blackhole create a blackhole nexthop -- 2.26.2
[PATCH iproute2-next v3 2/6] json_print: Add print_tv()
Add a helper to dump a timeval. Print by first converting to double and then dispatching to print_color_float(). Signed-off-by: Petr Machata --- include/json_print.h | 1 + lib/json_print.c | 13 + 2 files changed, 14 insertions(+) diff --git a/include/json_print.h b/include/json_print.h index 6fcf9fd910ec..63eee3823fe4 100644 --- a/include/json_print.h +++ b/include/json_print.h @@ -81,6 +81,7 @@ _PRINT_FUNC(0xhex, unsigned long long) _PRINT_FUNC(luint, unsigned long) _PRINT_FUNC(lluint, unsigned long long) _PRINT_FUNC(float, double) +_PRINT_FUNC(tv, struct timeval *) #undef _PRINT_FUNC #define _PRINT_NAME_VALUE_FUNC(type_name, type, format_char) \ diff --git a/lib/json_print.c b/lib/json_print.c index 994a2f8d6ae0..1018bfb36d94 100644 --- a/lib/json_print.c +++ b/lib/json_print.c @@ -299,6 +299,19 @@ int print_color_null(enum output_type type, return ret; } +int print_color_tv(enum output_type type, + enum color_attr color, + const char *key, + const char *fmt, + struct timeval *tv) +{ + double usecs = tv->tv_usec; + double secs = tv->tv_sec; + double time = secs + usecs / 100; + + return print_color_float(type, color, key, fmt, time); +} + /* Print line separator (if not in JSON mode) */ void print_nl(void) { -- 2.26.2
[PATCH iproute2-next v3 1/6] nexthop: Synchronize uAPI files
Signed-off-by: Petr Machata --- include/uapi/linux/nexthop.h | 47 +- include/uapi/linux/rtnetlink.h | 7 + 2 files changed, 53 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h index b0a5613905ef..37b14b4ea6c4 100644 --- a/include/uapi/linux/nexthop.h +++ b/include/uapi/linux/nexthop.h @@ -21,7 +21,10 @@ struct nexthop_grp { }; enum { - NEXTHOP_GRP_TYPE_MPATH, /* default type if not specified */ + NEXTHOP_GRP_TYPE_MPATH, /* hash-threshold nexthop group + * default type if not specified + */ + NEXTHOP_GRP_TYPE_RES,/* resilient nexthop group */ __NEXTHOP_GRP_TYPE_MAX, }; @@ -52,8 +55,50 @@ enum { NHA_FDB,/* flag; nexthop belongs to a bridge fdb */ /* if NHA_FDB is added, OIF, BLACKHOLE, ENCAP cannot be set */ + /* nested; resilient nexthop group attributes */ + NHA_RES_GROUP, + /* nested; nexthop bucket attributes */ + NHA_RES_BUCKET, + __NHA_MAX, }; #define NHA_MAX(__NHA_MAX - 1) + +enum { + NHA_RES_GROUP_UNSPEC, + /* Pad attribute for 64-bit alignment. */ + NHA_RES_GROUP_PAD = NHA_RES_GROUP_UNSPEC, + + /* u16; number of nexthop buckets in a resilient nexthop group */ + NHA_RES_GROUP_BUCKETS, + /* clock_t as u32; nexthop bucket idle timer (per-group) */ + NHA_RES_GROUP_IDLE_TIMER, + /* clock_t as u32; nexthop unbalanced timer */ + NHA_RES_GROUP_UNBALANCED_TIMER, + /* clock_t as u64; nexthop unbalanced time */ + NHA_RES_GROUP_UNBALANCED_TIME, + + __NHA_RES_GROUP_MAX, +}; + +#define NHA_RES_GROUP_MAX (__NHA_RES_GROUP_MAX - 1) + +enum { + NHA_RES_BUCKET_UNSPEC, + /* Pad attribute for 64-bit alignment. */ + NHA_RES_BUCKET_PAD = NHA_RES_BUCKET_UNSPEC, + + /* u16; nexthop bucket index */ + NHA_RES_BUCKET_INDEX, + /* clock_t as u64; nexthop bucket idle time */ + NHA_RES_BUCKET_IDLE_TIME, + /* u32; nexthop id assigned to the nexthop bucket */ + NHA_RES_BUCKET_NH_ID, + + __NHA_RES_BUCKET_MAX, +}; + +#define NHA_RES_BUCKET_MAX (__NHA_RES_BUCKET_MAX - 1) + #endif diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index b34b9add5f65..f6217651 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -178,6 +178,13 @@ enum { RTM_GETVLAN, #define RTM_GETVLANRTM_GETVLAN + RTM_NEWNEXTHOPBUCKET = 116, +#define RTM_NEWNEXTHOPBUCKET RTM_NEWNEXTHOPBUCKET + RTM_DELNEXTHOPBUCKET, +#define RTM_DELNEXTHOPBUCKET RTM_DELNEXTHOPBUCKET + RTM_GETNEXTHOPBUCKET, +#define RTM_GETNEXTHOPBUCKET RTM_GETNEXTHOPBUCKET + __RTM_MAX, #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1) }; -- 2.26.2
[PATCH iproute2-next v3 0/6] ip: nexthop: Support resilient groups
Support for resilient next-hop groups was recently accepted to Linux kernel[1]. Resilient next-hop groups add a layer of indirection between the SKB hash and the next hop. Thus the hash is used to reference a hash table bucket, which is then used to reference a particular next hop. This allows the system more flexibility when assigning SKB hash space to next hops. Previously, each next hop had to be assigned a continuous range of SKB hash space. With a hash table as an intermediate layer, it is possible to reassign next hops with a hash table bucket granularity. In turn, this mends issues with traffic flow redirection resulting from next hop removal or adjustments in next-hop weights. In this patch set, introduce support for resilient next-hop groups to iproute2. - Patch #1 brings include/uapi/linux/nexthop.h and /rtnetlink.h up to date. - Patches #2 and #3 add new helpers that will be useful later. - Patch #4 extends the ip/nexthop sub-tool to accept group type as a command line argument, and to dispatch based on the specified type. - Patch #5 adds the support for resilient next-hop groups. - Patch #6 adds the support for resilient next-hop group bucket interface. To illustrate the usage, consider the following commands: # ip nexthop add id 1 via 192.0.2.2 dev dummy1 # ip nexthop add id 2 via 192.0.2.3 dev dummy1 # ip nexthop add id 10 group 1/2 type resilient \ buckets 8 idle_timer 60 unbalanced_timer 300 The last command creates a resilient next-hop group. It will have 8 buckets, each bucket will be considered idle when no traffic hits it for at least 60 seconds, and if the table remains out of balance for 300 seconds, it will be forcefully brought into balance. And this is how the next-hop group bucket interface looks: # ip nexthop bucket show id 10 id 10 index 0 idle_time 5.59 nhid 1 id 10 index 1 idle_time 5.59 nhid 1 id 10 index 2 idle_time 8.74 nhid 2 id 10 index 3 idle_time 8.74 nhid 2 id 10 index 4 idle_time 8.74 nhid 1 id 10 index 5 idle_time 8.74 nhid 1 id 10 index 6 idle_time 8.74 nhid 1 id 10 index 7 idle_time 8.74 nhid 1 [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=2a0186a37700b0d5b8cc40be202a62af44f02fa2 v3: - Add missing S-o-b's. v2: - Patch #4: - Add a missing example command to commit message - Mention in the man page that mpath is the default Ido Schimmel (3): nexthop: Add ability to specify group type nexthop: Add support for resilient nexthop groups nexthop: Add support for nexthop buckets Petr Machata (3): nexthop: Synchronize uAPI files json_print: Add print_tv() nexthop: Extract a helper to parse a NH ID include/json_print.h | 1 + include/libnetlink.h | 3 + include/uapi/linux/nexthop.h | 47 +++- include/uapi/linux/rtnetlink.h | 7 + ip/ip_common.h | 1 + ip/ipmonitor.c | 6 + ip/ipnexthop.c | 451 - lib/json_print.c | 13 + lib/libnetlink.c | 26 ++ man/man8/ip-nexthop.8 | 113 - 10 files changed, 651 insertions(+), 17 deletions(-) -- 2.26.2
Re: [PATCH iproute2-next v2 4/6] nexthop: Add ability to specify group type
Petr Machata writes: > Signed-off-by: Ido Schimmel And I managed to forget my S-o-b :-/
[PATCH iproute2-next v2 6/6] nexthop: Add support for nexthop buckets
From: Ido Schimmel Add ability to dump multiple nexthop buckets and get a specific one. Example: # ip nexthop add id 10 group 1/2 type resilient buckets 8 # ip nexthop id 1 via 192.0.2.2 dev dummy10 scope link id 2 via 192.0.2.19 dev dummy20 scope link id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0 unbalanced_time 0 # ip nexthop bucket id 10 index 0 idle_time 28.1 nhid 2 id 10 index 1 idle_time 28.1 nhid 2 id 10 index 2 idle_time 28.1 nhid 2 id 10 index 3 idle_time 28.1 nhid 2 id 10 index 4 idle_time 28.1 nhid 1 id 10 index 5 idle_time 28.1 nhid 1 id 10 index 6 idle_time 28.1 nhid 1 id 10 index 7 idle_time 28.1 nhid 1 # ip nexthop bucket show nhid 1 id 10 index 4 idle_time 53.59 nhid 1 id 10 index 5 idle_time 53.59 nhid 1 id 10 index 6 idle_time 53.59 nhid 1 id 10 index 7 idle_time 53.59 nhid 1 # ip nexthop bucket get id 10 index 5 id 10 index 5 idle_time 81 nhid 1 # ip -j -p nexthop bucket get id 10 index 5 [ { "id": 10, "bucket": { "index": 5, "idle_time": 104.89, "nhid": 1 }, "flags": [ ] } ] Signed-off-by: Ido Schimmel --- include/libnetlink.h | 3 + ip/ip_common.h| 1 + ip/ipmonitor.c| 6 + ip/ipnexthop.c| 254 ++ lib/libnetlink.c | 26 + man/man8/ip-nexthop.8 | 45 6 files changed, 335 insertions(+) diff --git a/include/libnetlink.h b/include/libnetlink.h index b9073a6a13ad..e8ed5d7fb495 100644 --- a/include/libnetlink.h +++ b/include/libnetlink.h @@ -97,6 +97,9 @@ int rtnl_dump_request_n(struct rtnl_handle *rth, struct nlmsghdr *n) int rtnl_nexthopdump_req(struct rtnl_handle *rth, int family, req_filter_fn_t filter_fn) __attribute__((warn_unused_result)); +int rtnl_nexthop_bucket_dump_req(struct rtnl_handle *rth, int family, +req_filter_fn_t filter_fn) + __attribute__((warn_unused_result)); struct rtnl_ctrl_data { int nsid; diff --git a/ip/ip_common.h b/ip/ip_common.h index 9a31e837563f..55a5521c4275 100644 --- a/ip/ip_common.h +++ b/ip/ip_common.h @@ -53,6 +53,7 @@ int print_rule(struct nlmsghdr *n, void *arg); int print_netconf(struct rtnl_ctrl_data *ctrl, struct nlmsghdr *n, void *arg); int print_nexthop(struct nlmsghdr *n, void *arg); +int print_nexthop_bucket(struct nlmsghdr *n, void *arg); void netns_map_init(void); void netns_nsid_socket_init(void); int print_nsid(struct nlmsghdr *n, void *arg); diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c index 99f5fda8ba1f..d7f31cf5d1b5 100644 --- a/ip/ipmonitor.c +++ b/ip/ipmonitor.c @@ -90,6 +90,12 @@ static int accept_msg(struct rtnl_ctrl_data *ctrl, print_nexthop(n, arg); return 0; + case RTM_NEWNEXTHOPBUCKET: + case RTM_DELNEXTHOPBUCKET: + print_headers(fp, "[NEXTHOPBUCKET]", ctrl); + print_nexthop_bucket(n, arg); + return 0; + case RTM_NEWLINK: case RTM_DELLINK: ll_remember_index(n, NULL); diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 1d50bf7529c4..0263307c49df 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -21,6 +21,8 @@ static struct { unsigned int master; unsigned int proto; unsigned int fdb; + unsigned int id; + unsigned int nhid; } filter; enum { @@ -39,8 +41,11 @@ static void usage(void) "Usage: ip nexthop { list | flush } [ protocol ID ] SELECTOR\n" " ip nexthop { add | replace } id ID NH [ protocol ID ]\n" " ip nexthop { get | del } id ID\n" + " ip nexthop bucket list BUCKET_SELECTOR\n" + " ip nexthop bucket get id ID index INDEX\n" "SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV ]\n" "[ groups ] [ fdb ]\n" + "BUCKET_SELECTOR := SELECTOR | [ nhid ID ]\n" "NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n" "[ encap ENCAPTYPE ENCAPHDR ] |\n" "group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n" @@ -85,6 +90,36 @@ static int nh_dump_filter(struct nlmsghdr *nlh, int reqlen) return 0; } +static int nh_dump_bucket_filter(struct nlmsghdr *nlh, int reqlen) +{ + struct rtattr *nest; + int err = 0; + + err = nh_dump_filter(nlh, reqlen); + if (err) + return err; + + if (filter.id) { + err = addattr32(nlh, reqlen, NHA_ID, filter.id); + if (err) + return err; + } + + if (filter.nhid) { + nest = addattr_nest(nlh, reqlen, NHA_RES_BUCKET); + nest->rta_type |= NLA_F_NESTED; + + err = addattr32(nlh, reqlen, NHA_RES_BUCKET_NH_ID, + f
[PATCH iproute2-next v2 5/6] nexthop: Add support for resilient nexthop groups
From: Ido Schimmel Add ability to configure resilient nexthop groups and show their current configuration. Example: # ip nexthop add id 10 group 1/2 type resilient buckets 8 # ip nexthop show id 10 id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0 # ip -j -p nexthop show id 10 [ { "id": 10, "group": [ { "id": 1 },{ "id": 2 } ], "type": "resilient", "resilient_args": { "buckets": 8, "idle_timer": 120, "unbalanced_timer": 0 }, "flags": [ ] } ] Signed-off-by: Ido Schimmel --- ip/ipnexthop.c| 144 +- man/man8/ip-nexthop.8 | 55 +++- 2 files changed, 193 insertions(+), 6 deletions(-) diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 5aae32629edd..1d50bf7529c4 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -43,9 +43,12 @@ static void usage(void) "[ groups ] [ fdb ]\n" "NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n" "[ encap ENCAPTYPE ENCAPHDR ] |\n" - "group GROUP [ fdb ] [ type TYPE ] }\n" + "group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n" "GROUP := [ //... ]\n" - "TYPE := { mpath }\n" + "TYPE := { mpath | resilient }\n" + "TYPE_ARGS := [ RESILIENT_ARGS ]\n" + "RESILIENT_ARGS := [ buckets BUCKETS ] [ idle_timer IDLE ]\n" + " [ unbalanced_timer UNBALANCED ]\n" "ENCAPTYPE := [ mpls ]\n" "ENCAPHDR := [ MPLSLABEL ]\n"); exit(-1); @@ -203,6 +206,66 @@ static void print_nh_group(FILE *fp, const struct rtattr *grps_attr) close_json_array(PRINT_JSON, NULL); } +static const char *nh_group_type_name(__u16 type) +{ + switch (type) { + case NEXTHOP_GRP_TYPE_MPATH: + return "mpath"; + case NEXTHOP_GRP_TYPE_RES: + return "resilient"; + default: + return ""; + } +} + +static void print_nh_group_type(FILE *fp, const struct rtattr *grp_type_attr) +{ + __u16 type = rta_getattr_u16(grp_type_attr); + + if (type == NEXTHOP_GRP_TYPE_MPATH) + /* Do not print type in order not to break existing output. */ + return; + + print_string(PRINT_ANY, "type", "type %s ", nh_group_type_name(type)); +} + +static void print_nh_res_group(FILE *fp, const struct rtattr *res_grp_attr) +{ + struct rtattr *tb[NHA_RES_GROUP_MAX + 1]; + struct rtattr *rta; + struct timeval tv; + + parse_rtattr_nested(tb, NHA_RES_GROUP_MAX, res_grp_attr); + + open_json_object("resilient_args"); + + if (tb[NHA_RES_GROUP_BUCKETS]) + print_uint(PRINT_ANY, "buckets", "buckets %u ", + rta_getattr_u16(tb[NHA_RES_GROUP_BUCKETS])); + + if (tb[NHA_RES_GROUP_IDLE_TIMER]) { + rta = tb[NHA_RES_GROUP_IDLE_TIMER]; + __jiffies_to_tv(&tv, rta_getattr_u32(rta)); + print_tv(PRINT_ANY, "idle_timer", "idle_timer %g ", &tv); + } + + if (tb[NHA_RES_GROUP_UNBALANCED_TIMER]) { + rta = tb[NHA_RES_GROUP_UNBALANCED_TIMER]; + __jiffies_to_tv(&tv, rta_getattr_u32(rta)); + print_tv(PRINT_ANY, "unbalanced_timer", "unbalanced_timer %g ", +&tv); + } + + if (tb[NHA_RES_GROUP_UNBALANCED_TIME]) { + rta = tb[NHA_RES_GROUP_UNBALANCED_TIME]; + __jiffies_to_tv(&tv, rta_getattr_u32(rta)); + print_tv(PRINT_ANY, "unbalanced_time", "unbalanced_time %g ", +&tv); + } + + close_json_object(); +} + int print_nexthop(struct nlmsghdr *n, void *arg) { struct nhmsg *nhm = NLMSG_DATA(n); @@ -229,7 +292,7 @@ int print_nexthop(struct nlmsghdr *n, void *arg) if (filter.proto && filter.proto != nhm->nh_protocol) return 0; - parse_rtattr(tb, NHA_MAX, RTM_NHA(nhm), len); + parse_rtattr_flags(tb, NHA_MAX, RTM_NHA(nhm), len, NLA_F_NESTED); open_json_object(NULL); @@ -243,6 +306,12 @@ int print_nexthop(struct nlmsghdr *n, void *arg) if (tb[NHA_GROUP]) print_nh_group(fp, tb[NHA_GROUP]); + if (tb[NHA_GROUP_TYPE]) + print_nh_group_type(fp, tb[NHA_GROUP_TYPE]); + + if (tb[NHA_RES_GROUP]) + print_nh_res_group(fp, tb[NHA_RES_GROUP]); + if (tb[NHA_ENCAP]) lwt_print_encap(fp, tb[NHA_ENCAP_TYPE], tb[NHA_ENCAP]); @@ -333,10 +402,70 @@ static int read_nh_group_type(const char *name) { if (strcmp(name, "mpath") == 0) return NEXTHOP_GRP_TYPE_MPATH; + else if (strcmp(name, "resilient") == 0) + return NEXTHOP_GRP_T
[PATCH iproute2-next v2 4/6] nexthop: Add ability to specify group type
From: Ido Schimmel Next patches are going to add a 'resilient' nexthop group type, so allow users to specify the type using the 'type' argument. Currently, only 'mpath' type is supported. These two commands are equivalent: # ip nexthop add id 10 group 1/2/3 # ip nexthop add id 10 group 1/2/3 type mpath Signed-off-by: Ido Schimmel --- Notes: v2: - Add a missing example command to commit message - Mention in the man page that mpath is the default ip/ipnexthop.c| 32 +++- man/man8/ip-nexthop.8 | 19 +-- 2 files changed, 48 insertions(+), 3 deletions(-) diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 126b0b17cab4..5aae32629edd 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -42,8 +42,10 @@ static void usage(void) "SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV ]\n" "[ groups ] [ fdb ]\n" "NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n" - "[ encap ENCAPTYPE ENCAPHDR ] | group GROUP [ fdb ] }\n" + "[ encap ENCAPTYPE ENCAPHDR ] |\n" + "group GROUP [ fdb ] [ type TYPE ] }\n" "GROUP := [ //... ]\n" + "TYPE := { mpath }\n" "ENCAPTYPE := [ mpls ]\n" "ENCAPHDR := [ MPLSLABEL ]\n"); exit(-1); @@ -327,6 +329,32 @@ static int add_nh_group_attr(struct nlmsghdr *n, int maxlen, char *argv) return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps)); } +static int read_nh_group_type(const char *name) +{ + if (strcmp(name, "mpath") == 0) + return NEXTHOP_GRP_TYPE_MPATH; + + return __NEXTHOP_GRP_TYPE_MAX; +} + +static void parse_nh_group_type(struct nlmsghdr *n, int maxlen, int *argcp, + char ***argvp) +{ + char **argv = *argvp; + int argc = *argcp; + __u16 type; + + NEXT_ARG(); + type = read_nh_group_type(*argv); + if (type > NEXTHOP_GRP_TYPE_MAX) + invarg("\"type\" value is invalid\n", *argv); + + *argcp = argc; + *argvp = argv; + + addattr16(n, maxlen, NHA_GROUP_TYPE, type); +} + static int ipnh_parse_id(const char *argv) { __u32 id; @@ -409,6 +437,8 @@ static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv) if (add_nh_group_attr(&req.n, sizeof(req), *argv)) invarg("\"group\" value is invalid\n", *argv); + } else if (!strcmp(*argv, "type")) { + parse_nh_group_type(&req.n, sizeof(req), &argc, &argv); } else if (matches(*argv, "protocol") == 0) { __u32 prot; diff --git a/man/man8/ip-nexthop.8 b/man/man8/ip-nexthop.8 index 4d55f4dbcc75..b86f307fef35 100644 --- a/man/man8/ip-nexthop.8 +++ b/man/man8/ip-nexthop.8 @@ -54,7 +54,9 @@ ip-nexthop \- nexthop object management .BR fdb " ] | " .B group .IR GROUP " [ " -.BR fdb " ] } " +.BR fdb " ] [ " +.B type +.IR TYPE " ] } " .ti -8 .IR ENCAP " := [ " @@ -71,6 +73,10 @@ ip-nexthop \- nexthop object management .IR GROUP " := " .BR id "[," weight "[/...]" +.ti -8 +.IR TYPE " := { " +.BR mpath " }" + .SH DESCRIPTION .B ip nexthop is used to manipulate entries in the kernel's nexthop tables. @@ -122,9 +128,18 @@ is a set of encapsulation attributes specific to the .in -2 .TP -.BI group " GROUP" +.BI group " GROUP [ " type " TYPE ]" create a nexthop group. Group specification is id with an optional weight (id,weight) and a '/' as a separator between entries. +.sp +.I TYPE +is a string specifying the nexthop group type. Namely: + +.in +8 +.BI mpath +- Multipath nexthop group backed by the hash-threshold algorithm. The +default when the type is unspecified. + .TP .B blackhole create a blackhole nexthop -- 2.26.2
[PATCH iproute2-next v2 3/6] nexthop: Extract a helper to parse a NH ID
NH ID extraction is a common operation, and will become more common still with the resilient NH groups support. Add a helper that does what it usually done and returns the parsed NH ID. Signed-off-by: Petr Machata --- ip/ipnexthop.c | 25 + 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 20cde586596b..126b0b17cab4 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -327,6 +327,15 @@ static int add_nh_group_attr(struct nlmsghdr *n, int maxlen, char *argv) return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps)); } +static int ipnh_parse_id(const char *argv) +{ + __u32 id; + + if (get_unsigned(&id, argv, 0)) + invarg("invalid id value", argv); + return id; +} + static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv) { struct { @@ -343,12 +352,9 @@ static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv) while (argc > 0) { if (!strcmp(*argv, "id")) { - __u32 id; - NEXT_ARG(); - if (get_unsigned(&id, *argv, 0)) - invarg("invalid id value", *argv); - addattr32(&req.n, sizeof(req), NHA_ID, id); + addattr32(&req.n, sizeof(req), NHA_ID, + ipnh_parse_id(*argv)); } else if (!strcmp(*argv, "dev")) { int ifindex; @@ -485,12 +491,8 @@ static int ipnh_list_flush(int argc, char **argv, int action) if (!filter.master) invarg("VRF does not exist\n", *argv); } else if (!strcmp(*argv, "id")) { - __u32 id; - NEXT_ARG(); - if (get_unsigned(&id, *argv, 0)) - invarg("invalid id value", *argv); - return ipnh_get_id(id); + return ipnh_get_id(ipnh_parse_id(*argv)); } else if (!matches(*argv, "protocol")) { __u32 proto; @@ -536,8 +538,7 @@ static int ipnh_get(int argc, char **argv) while (argc > 0) { if (!strcmp(*argv, "id")) { NEXT_ARG(); - if (get_unsigned(&id, *argv, 0)) - invarg("invalid id value", *argv); + id = ipnh_parse_id(*argv); } else { usage(); } -- 2.26.2
[PATCH iproute2-next v2 1/6] nexthop: Synchronize uAPI files
Signed-off-by: Petr Machata --- include/uapi/linux/nexthop.h | 47 +- include/uapi/linux/rtnetlink.h | 7 + 2 files changed, 53 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h index b0a5613905ef..37b14b4ea6c4 100644 --- a/include/uapi/linux/nexthop.h +++ b/include/uapi/linux/nexthop.h @@ -21,7 +21,10 @@ struct nexthop_grp { }; enum { - NEXTHOP_GRP_TYPE_MPATH, /* default type if not specified */ + NEXTHOP_GRP_TYPE_MPATH, /* hash-threshold nexthop group + * default type if not specified + */ + NEXTHOP_GRP_TYPE_RES,/* resilient nexthop group */ __NEXTHOP_GRP_TYPE_MAX, }; @@ -52,8 +55,50 @@ enum { NHA_FDB,/* flag; nexthop belongs to a bridge fdb */ /* if NHA_FDB is added, OIF, BLACKHOLE, ENCAP cannot be set */ + /* nested; resilient nexthop group attributes */ + NHA_RES_GROUP, + /* nested; nexthop bucket attributes */ + NHA_RES_BUCKET, + __NHA_MAX, }; #define NHA_MAX(__NHA_MAX - 1) + +enum { + NHA_RES_GROUP_UNSPEC, + /* Pad attribute for 64-bit alignment. */ + NHA_RES_GROUP_PAD = NHA_RES_GROUP_UNSPEC, + + /* u16; number of nexthop buckets in a resilient nexthop group */ + NHA_RES_GROUP_BUCKETS, + /* clock_t as u32; nexthop bucket idle timer (per-group) */ + NHA_RES_GROUP_IDLE_TIMER, + /* clock_t as u32; nexthop unbalanced timer */ + NHA_RES_GROUP_UNBALANCED_TIMER, + /* clock_t as u64; nexthop unbalanced time */ + NHA_RES_GROUP_UNBALANCED_TIME, + + __NHA_RES_GROUP_MAX, +}; + +#define NHA_RES_GROUP_MAX (__NHA_RES_GROUP_MAX - 1) + +enum { + NHA_RES_BUCKET_UNSPEC, + /* Pad attribute for 64-bit alignment. */ + NHA_RES_BUCKET_PAD = NHA_RES_BUCKET_UNSPEC, + + /* u16; nexthop bucket index */ + NHA_RES_BUCKET_INDEX, + /* clock_t as u64; nexthop bucket idle time */ + NHA_RES_BUCKET_IDLE_TIME, + /* u32; nexthop id assigned to the nexthop bucket */ + NHA_RES_BUCKET_NH_ID, + + __NHA_RES_BUCKET_MAX, +}; + +#define NHA_RES_BUCKET_MAX (__NHA_RES_BUCKET_MAX - 1) + #endif diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index b34b9add5f65..f6217651 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -178,6 +178,13 @@ enum { RTM_GETVLAN, #define RTM_GETVLANRTM_GETVLAN + RTM_NEWNEXTHOPBUCKET = 116, +#define RTM_NEWNEXTHOPBUCKET RTM_NEWNEXTHOPBUCKET + RTM_DELNEXTHOPBUCKET, +#define RTM_DELNEXTHOPBUCKET RTM_DELNEXTHOPBUCKET + RTM_GETNEXTHOPBUCKET, +#define RTM_GETNEXTHOPBUCKET RTM_GETNEXTHOPBUCKET + __RTM_MAX, #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1) }; -- 2.26.2
[PATCH iproute2-next v2 2/6] json_print: Add print_tv()
Add a helper to dump a timeval. Print by first converting to double and then dispatching to print_color_float(). Signed-off-by: Petr Machata --- include/json_print.h | 1 + lib/json_print.c | 13 + 2 files changed, 14 insertions(+) diff --git a/include/json_print.h b/include/json_print.h index 6fcf9fd910ec..63eee3823fe4 100644 --- a/include/json_print.h +++ b/include/json_print.h @@ -81,6 +81,7 @@ _PRINT_FUNC(0xhex, unsigned long long) _PRINT_FUNC(luint, unsigned long) _PRINT_FUNC(lluint, unsigned long long) _PRINT_FUNC(float, double) +_PRINT_FUNC(tv, struct timeval *) #undef _PRINT_FUNC #define _PRINT_NAME_VALUE_FUNC(type_name, type, format_char) \ diff --git a/lib/json_print.c b/lib/json_print.c index 994a2f8d6ae0..1018bfb36d94 100644 --- a/lib/json_print.c +++ b/lib/json_print.c @@ -299,6 +299,19 @@ int print_color_null(enum output_type type, return ret; } +int print_color_tv(enum output_type type, + enum color_attr color, + const char *key, + const char *fmt, + struct timeval *tv) +{ + double usecs = tv->tv_usec; + double secs = tv->tv_sec; + double time = secs + usecs / 100; + + return print_color_float(type, color, key, fmt, time); +} + /* Print line separator (if not in JSON mode) */ void print_nl(void) { -- 2.26.2
[PATCH iproute2-next v2 0/6] ip: nexthop: Support resilient groups
Support for resilient next-hop groups was recently accepted to Linux kernel[1]. Resilient next-hop groups add a layer of indirection between the SKB hash and the next hop. Thus the hash is used to reference a hash table bucket, which is then used to reference a particular next hop. This allows the system more flexibility when assigning SKB hash space to next hops. Previously, each next hop had to be assigned a continuous range of SKB hash space. With a hash table as an intermediate layer, it is possible to reassign next hops with a hash table bucket granularity. In turn, this mends issues with traffic flow redirection resulting from next hop removal or adjustments in next-hop weights. In this patch set, introduce support for resilient next-hop groups to iproute2. - Patch #1 brings include/uapi/linux/nexthop.h and /rtnetlink.h up to date. - Patches #2 and #3 add new helpers that will be useful later. - Patch #4 extends the ip/nexthop sub-tool to accept group type as a command line argument, and to dispatch based on the specified type. - Patch #5 adds the support for resilient next-hop groups. - Patch #6 adds the support for resilient next-hop group bucket interface. To illustrate the usage, consider the following commands: # ip nexthop add id 1 via 192.0.2.2 dev dummy1 # ip nexthop add id 2 via 192.0.2.3 dev dummy1 # ip nexthop add id 10 group 1/2 type resilient \ buckets 8 idle_timer 60 unbalanced_timer 300 The last command creates a resilient next-hop group. It will have 8 buckets, each bucket will be considered idle when no traffic hits it for at least 60 seconds, and if the table remains out of balance for 300 seconds, it will be forcefully brought into balance. And this is how the next-hop group bucket interface looks: # ip nexthop bucket show id 10 id 10 index 0 idle_time 5.59 nhid 1 id 10 index 1 idle_time 5.59 nhid 1 id 10 index 2 idle_time 8.74 nhid 2 id 10 index 3 idle_time 8.74 nhid 2 id 10 index 4 idle_time 8.74 nhid 1 id 10 index 5 idle_time 8.74 nhid 1 id 10 index 6 idle_time 8.74 nhid 1 id 10 index 7 idle_time 8.74 nhid 1 [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=2a0186a37700b0d5b8cc40be202a62af44f02fa2 v2: - Patch #4: - Add a missing example command to commit message - Mention in the man page that mpath is the default Ido Schimmel (3): nexthop: Add ability to specify group type nexthop: Add support for resilient nexthop groups nexthop: Add support for nexthop buckets Petr Machata (3): nexthop: Synchronize uAPI files json_print: Add print_tv() nexthop: Extract a helper to parse a NH ID include/json_print.h | 1 + include/libnetlink.h | 3 + include/uapi/linux/nexthop.h | 47 +++- include/uapi/linux/rtnetlink.h | 7 + ip/ip_common.h | 1 + ip/ipmonitor.c | 6 + ip/ipnexthop.c | 451 - lib/json_print.c | 13 + lib/libnetlink.c | 26 ++ man/man8/ip-nexthop.8 | 113 - 10 files changed, 651 insertions(+), 17 deletions(-) -- 2.26.2
Re: [PATCH iproute2-next v2] dcb: Fix compilation warning about reallocarray
Petr Machata writes: > Roi Dayan writes: > >> --- a/dcb/dcb_app.c >> +++ b/dcb/dcb_app.c >> @@ -65,8 +65,7 @@ static void dcb_app_table_fini(struct dcb_app_table *tab) >> >> static int dcb_app_table_push(struct dcb_app_table *tab, struct dcb_app >> *app) >> { >> -struct dcb_app *apps = reallocarray(tab->apps, tab->n_apps + 1, >> -sizeof(*tab->apps)); >> +struct dcb_app *apps = realloc(tab->apps, (tab->n_apps + 1) * >> sizeof(*tab->apps)); > > Reviewed-by: Petr Machata Could this be merged, please?
Re: [PATCH iproute2-next 4/6] nexthop: Add ability to specify group type
David Ahern writes: > On 3/12/21 10:23 AM, Petr Machata wrote: >> From: Petr Machata >> >> From: Ido Schimmel > > All of the patches have the above. If Ido is the author and you are > sending, AIUI you add your Signed-off-by below his. Sorry about that, that's a leftover from when I was sending the DCB patches. I'll resend with the correct headers. >> +.sp >> +.I TYPE >> +is a string specifying the nexthop group type. Namely: >> + >> +.in +8 >> +.BI mpath >> +- multipath nexthop group >> + > > Add a comment that this is the default group type and refers to the > legacy hash-bashed multipath group. OK.
[PATCH iproute2-next 3/6] nexthop: Extract a helper to parse a NH ID
From: Petr Machata From: Petr Machata NH ID extraction is a common operation, and will become more common still with the resilient NH groups support. Add a helper that does what it usually done and returns the parsed NH ID. Signed-off-by: Petr Machata --- ip/ipnexthop.c | 25 + 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 20cde586596b..126b0b17cab4 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -327,6 +327,15 @@ static int add_nh_group_attr(struct nlmsghdr *n, int maxlen, char *argv) return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps)); } +static int ipnh_parse_id(const char *argv) +{ + __u32 id; + + if (get_unsigned(&id, argv, 0)) + invarg("invalid id value", argv); + return id; +} + static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv) { struct { @@ -343,12 +352,9 @@ static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv) while (argc > 0) { if (!strcmp(*argv, "id")) { - __u32 id; - NEXT_ARG(); - if (get_unsigned(&id, *argv, 0)) - invarg("invalid id value", *argv); - addattr32(&req.n, sizeof(req), NHA_ID, id); + addattr32(&req.n, sizeof(req), NHA_ID, + ipnh_parse_id(*argv)); } else if (!strcmp(*argv, "dev")) { int ifindex; @@ -485,12 +491,8 @@ static int ipnh_list_flush(int argc, char **argv, int action) if (!filter.master) invarg("VRF does not exist\n", *argv); } else if (!strcmp(*argv, "id")) { - __u32 id; - NEXT_ARG(); - if (get_unsigned(&id, *argv, 0)) - invarg("invalid id value", *argv); - return ipnh_get_id(id); + return ipnh_get_id(ipnh_parse_id(*argv)); } else if (!matches(*argv, "protocol")) { __u32 proto; @@ -536,8 +538,7 @@ static int ipnh_get(int argc, char **argv) while (argc > 0) { if (!strcmp(*argv, "id")) { NEXT_ARG(); - if (get_unsigned(&id, *argv, 0)) - invarg("invalid id value", *argv); + id = ipnh_parse_id(*argv); } else { usage(); } -- 2.26.2
[PATCH iproute2-next 6/6] nexthop: Add support for nexthop buckets
From: Petr Machata From: Ido Schimmel Add ability to dump multiple nexthop buckets and get a specific one. Example: # ip nexthop add id 10 group 1/2 type resilient buckets 8 # ip nexthop id 1 via 192.0.2.2 dev dummy10 scope link id 2 via 192.0.2.19 dev dummy20 scope link id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0 unbalanced_time 0 # ip nexthop bucket id 10 index 0 idle_time 28.1 nhid 2 id 10 index 1 idle_time 28.1 nhid 2 id 10 index 2 idle_time 28.1 nhid 2 id 10 index 3 idle_time 28.1 nhid 2 id 10 index 4 idle_time 28.1 nhid 1 id 10 index 5 idle_time 28.1 nhid 1 id 10 index 6 idle_time 28.1 nhid 1 id 10 index 7 idle_time 28.1 nhid 1 # ip nexthop bucket show nhid 1 id 10 index 4 idle_time 53.59 nhid 1 id 10 index 5 idle_time 53.59 nhid 1 id 10 index 6 idle_time 53.59 nhid 1 id 10 index 7 idle_time 53.59 nhid 1 # ip nexthop bucket get id 10 index 5 id 10 index 5 idle_time 81 nhid 1 # ip -j -p nexthop bucket get id 10 index 5 [ { "id": 10, "bucket": { "index": 5, "idle_time": 104.89, "nhid": 1 }, "flags": [ ] } ] Signed-off-by: Ido Schimmel --- include/libnetlink.h | 3 + ip/ip_common.h| 1 + ip/ipmonitor.c| 6 + ip/ipnexthop.c| 254 ++ lib/libnetlink.c | 26 + man/man8/ip-nexthop.8 | 45 6 files changed, 335 insertions(+) diff --git a/include/libnetlink.h b/include/libnetlink.h index b9073a6a13ad..e8ed5d7fb495 100644 --- a/include/libnetlink.h +++ b/include/libnetlink.h @@ -97,6 +97,9 @@ int rtnl_dump_request_n(struct rtnl_handle *rth, struct nlmsghdr *n) int rtnl_nexthopdump_req(struct rtnl_handle *rth, int family, req_filter_fn_t filter_fn) __attribute__((warn_unused_result)); +int rtnl_nexthop_bucket_dump_req(struct rtnl_handle *rth, int family, +req_filter_fn_t filter_fn) + __attribute__((warn_unused_result)); struct rtnl_ctrl_data { int nsid; diff --git a/ip/ip_common.h b/ip/ip_common.h index 9a31e837563f..55a5521c4275 100644 --- a/ip/ip_common.h +++ b/ip/ip_common.h @@ -53,6 +53,7 @@ int print_rule(struct nlmsghdr *n, void *arg); int print_netconf(struct rtnl_ctrl_data *ctrl, struct nlmsghdr *n, void *arg); int print_nexthop(struct nlmsghdr *n, void *arg); +int print_nexthop_bucket(struct nlmsghdr *n, void *arg); void netns_map_init(void); void netns_nsid_socket_init(void); int print_nsid(struct nlmsghdr *n, void *arg); diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c index 99f5fda8ba1f..d7f31cf5d1b5 100644 --- a/ip/ipmonitor.c +++ b/ip/ipmonitor.c @@ -90,6 +90,12 @@ static int accept_msg(struct rtnl_ctrl_data *ctrl, print_nexthop(n, arg); return 0; + case RTM_NEWNEXTHOPBUCKET: + case RTM_DELNEXTHOPBUCKET: + print_headers(fp, "[NEXTHOPBUCKET]", ctrl); + print_nexthop_bucket(n, arg); + return 0; + case RTM_NEWLINK: case RTM_DELLINK: ll_remember_index(n, NULL); diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 1d50bf7529c4..0263307c49df 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -21,6 +21,8 @@ static struct { unsigned int master; unsigned int proto; unsigned int fdb; + unsigned int id; + unsigned int nhid; } filter; enum { @@ -39,8 +41,11 @@ static void usage(void) "Usage: ip nexthop { list | flush } [ protocol ID ] SELECTOR\n" " ip nexthop { add | replace } id ID NH [ protocol ID ]\n" " ip nexthop { get | del } id ID\n" + " ip nexthop bucket list BUCKET_SELECTOR\n" + " ip nexthop bucket get id ID index INDEX\n" "SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV ]\n" "[ groups ] [ fdb ]\n" + "BUCKET_SELECTOR := SELECTOR | [ nhid ID ]\n" "NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n" "[ encap ENCAPTYPE ENCAPHDR ] |\n" "group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n" @@ -85,6 +90,36 @@ static int nh_dump_filter(struct nlmsghdr *nlh, int reqlen) return 0; } +static int nh_dump_bucket_filter(struct nlmsghdr *nlh, int reqlen) +{ + struct rtattr *nest; + int err = 0; + + err = nh_dump_filter(nlh, reqlen); + if (err) + return err; + + if (filter.id) { + err = addattr32(nlh, reqlen, NHA_ID, filter.id); + if (err) + return err; + } + + if (filter.nhid) { +
[PATCH iproute2-next 5/6] nexthop: Add support for resilient nexthop groups
From: Petr Machata From: Ido Schimmel Add ability to configure resilient nexthop groups and show their current configuration. Example: # ip nexthop add id 10 group 1/2 type resilient buckets 8 # ip nexthop show id 10 id 10 group 1/2 type resilient buckets 8 idle_timer 120 unbalanced_timer 0 # ip -j -p nexthop show id 10 [ { "id": 10, "group": [ { "id": 1 },{ "id": 2 } ], "type": "resilient", "resilient_args": { "buckets": 8, "idle_timer": 120, "unbalanced_timer": 0 }, "flags": [ ] } ] Signed-off-by: Ido Schimmel --- ip/ipnexthop.c| 144 +- man/man8/ip-nexthop.8 | 55 +++- 2 files changed, 193 insertions(+), 6 deletions(-) diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 5aae32629edd..1d50bf7529c4 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -43,9 +43,12 @@ static void usage(void) "[ groups ] [ fdb ]\n" "NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n" "[ encap ENCAPTYPE ENCAPHDR ] |\n" - "group GROUP [ fdb ] [ type TYPE ] }\n" + "group GROUP [ fdb ] [ type TYPE [ TYPE_ARGS ] ] }\n" "GROUP := [ //... ]\n" - "TYPE := { mpath }\n" + "TYPE := { mpath | resilient }\n" + "TYPE_ARGS := [ RESILIENT_ARGS ]\n" + "RESILIENT_ARGS := [ buckets BUCKETS ] [ idle_timer IDLE ]\n" + " [ unbalanced_timer UNBALANCED ]\n" "ENCAPTYPE := [ mpls ]\n" "ENCAPHDR := [ MPLSLABEL ]\n"); exit(-1); @@ -203,6 +206,66 @@ static void print_nh_group(FILE *fp, const struct rtattr *grps_attr) close_json_array(PRINT_JSON, NULL); } +static const char *nh_group_type_name(__u16 type) +{ + switch (type) { + case NEXTHOP_GRP_TYPE_MPATH: + return "mpath"; + case NEXTHOP_GRP_TYPE_RES: + return "resilient"; + default: + return ""; + } +} + +static void print_nh_group_type(FILE *fp, const struct rtattr *grp_type_attr) +{ + __u16 type = rta_getattr_u16(grp_type_attr); + + if (type == NEXTHOP_GRP_TYPE_MPATH) + /* Do not print type in order not to break existing output. */ + return; + + print_string(PRINT_ANY, "type", "type %s ", nh_group_type_name(type)); +} + +static void print_nh_res_group(FILE *fp, const struct rtattr *res_grp_attr) +{ + struct rtattr *tb[NHA_RES_GROUP_MAX + 1]; + struct rtattr *rta; + struct timeval tv; + + parse_rtattr_nested(tb, NHA_RES_GROUP_MAX, res_grp_attr); + + open_json_object("resilient_args"); + + if (tb[NHA_RES_GROUP_BUCKETS]) + print_uint(PRINT_ANY, "buckets", "buckets %u ", + rta_getattr_u16(tb[NHA_RES_GROUP_BUCKETS])); + + if (tb[NHA_RES_GROUP_IDLE_TIMER]) { + rta = tb[NHA_RES_GROUP_IDLE_TIMER]; + __jiffies_to_tv(&tv, rta_getattr_u32(rta)); + print_tv(PRINT_ANY, "idle_timer", "idle_timer %g ", &tv); + } + + if (tb[NHA_RES_GROUP_UNBALANCED_TIMER]) { + rta = tb[NHA_RES_GROUP_UNBALANCED_TIMER]; + __jiffies_to_tv(&tv, rta_getattr_u32(rta)); + print_tv(PRINT_ANY, "unbalanced_timer", "unbalanced_timer %g ", +&tv); + } + + if (tb[NHA_RES_GROUP_UNBALANCED_TIME]) { + rta = tb[NHA_RES_GROUP_UNBALANCED_TIME]; + __jiffies_to_tv(&tv, rta_getattr_u32(rta)); + print_tv(PRINT_ANY, "unbalanced_time", "unbalanced_time %g ", +&tv); + } + + close_json_object(); +} + int print_nexthop(struct nlmsghdr *n, void *arg) { struct nhmsg *nhm = NLMSG_DATA(n); @@ -229,7 +292,7 @@ int print_nexthop(struct nlmsghdr *n, void *arg) if (filter.proto && filter.proto != nhm->nh_protocol) return 0; - parse_rtattr(tb, NHA_MAX, RTM_NHA(nhm), len); + parse_rtattr_flags(tb, NHA_MAX, RTM_NHA(nhm), len, NLA_F_NESTED); open_json_object(NULL); @@ -243,6 +306,12 @@ int print_nexthop(struct nlmsghdr *n, void *arg) if (tb[NHA_GROUP]) print_nh_group(fp, tb[NHA_GROUP]); + if (tb[NHA_GROUP_TYPE]) + print_nh_group_type(fp, tb[NHA_G
[PATCH iproute2-next 1/6] nexthop: Synchronize uAPI files
From: Petr Machata From: Ido Schimmel Signed-off-by: Petr Machata --- include/uapi/linux/nexthop.h | 47 +- include/uapi/linux/rtnetlink.h | 7 + 2 files changed, 53 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h index b0a5613905ef..37b14b4ea6c4 100644 --- a/include/uapi/linux/nexthop.h +++ b/include/uapi/linux/nexthop.h @@ -21,7 +21,10 @@ struct nexthop_grp { }; enum { - NEXTHOP_GRP_TYPE_MPATH, /* default type if not specified */ + NEXTHOP_GRP_TYPE_MPATH, /* hash-threshold nexthop group + * default type if not specified + */ + NEXTHOP_GRP_TYPE_RES,/* resilient nexthop group */ __NEXTHOP_GRP_TYPE_MAX, }; @@ -52,8 +55,50 @@ enum { NHA_FDB,/* flag; nexthop belongs to a bridge fdb */ /* if NHA_FDB is added, OIF, BLACKHOLE, ENCAP cannot be set */ + /* nested; resilient nexthop group attributes */ + NHA_RES_GROUP, + /* nested; nexthop bucket attributes */ + NHA_RES_BUCKET, + __NHA_MAX, }; #define NHA_MAX(__NHA_MAX - 1) + +enum { + NHA_RES_GROUP_UNSPEC, + /* Pad attribute for 64-bit alignment. */ + NHA_RES_GROUP_PAD = NHA_RES_GROUP_UNSPEC, + + /* u16; number of nexthop buckets in a resilient nexthop group */ + NHA_RES_GROUP_BUCKETS, + /* clock_t as u32; nexthop bucket idle timer (per-group) */ + NHA_RES_GROUP_IDLE_TIMER, + /* clock_t as u32; nexthop unbalanced timer */ + NHA_RES_GROUP_UNBALANCED_TIMER, + /* clock_t as u64; nexthop unbalanced time */ + NHA_RES_GROUP_UNBALANCED_TIME, + + __NHA_RES_GROUP_MAX, +}; + +#define NHA_RES_GROUP_MAX (__NHA_RES_GROUP_MAX - 1) + +enum { + NHA_RES_BUCKET_UNSPEC, + /* Pad attribute for 64-bit alignment. */ + NHA_RES_BUCKET_PAD = NHA_RES_BUCKET_UNSPEC, + + /* u16; nexthop bucket index */ + NHA_RES_BUCKET_INDEX, + /* clock_t as u64; nexthop bucket idle time */ + NHA_RES_BUCKET_IDLE_TIME, + /* u32; nexthop id assigned to the nexthop bucket */ + NHA_RES_BUCKET_NH_ID, + + __NHA_RES_BUCKET_MAX, +}; + +#define NHA_RES_BUCKET_MAX (__NHA_RES_BUCKET_MAX - 1) + #endif diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index b34b9add5f65..f6217651 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -178,6 +178,13 @@ enum { RTM_GETVLAN, #define RTM_GETVLANRTM_GETVLAN + RTM_NEWNEXTHOPBUCKET = 116, +#define RTM_NEWNEXTHOPBUCKET RTM_NEWNEXTHOPBUCKET + RTM_DELNEXTHOPBUCKET, +#define RTM_DELNEXTHOPBUCKET RTM_DELNEXTHOPBUCKET + RTM_GETNEXTHOPBUCKET, +#define RTM_GETNEXTHOPBUCKET RTM_GETNEXTHOPBUCKET + __RTM_MAX, #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1) }; -- 2.26.2
[PATCH iproute2-next 4/6] nexthop: Add ability to specify group type
From: Petr Machata From: Ido Schimmel Next patches are going to add a 'resilient' nexthop group type, so allow users to specify the type using the 'type' argument. Currently, only 'mpath' type is supported. These two command are equivalent: Signed-off-by: Ido Schimmel --- ip/ipnexthop.c| 32 +++- man/man8/ip-nexthop.8 | 18 -- 2 files changed, 47 insertions(+), 3 deletions(-) diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c index 126b0b17cab4..5aae32629edd 100644 --- a/ip/ipnexthop.c +++ b/ip/ipnexthop.c @@ -42,8 +42,10 @@ static void usage(void) "SELECTOR := [ id ID ] [ dev DEV ] [ vrf NAME ] [ master DEV ]\n" "[ groups ] [ fdb ]\n" "NH := { blackhole | [ via ADDRESS ] [ dev DEV ] [ onlink ]\n" - "[ encap ENCAPTYPE ENCAPHDR ] | group GROUP [ fdb ] }\n" + "[ encap ENCAPTYPE ENCAPHDR ] |\n" + "group GROUP [ fdb ] [ type TYPE ] }\n" "GROUP := [ //... ]\n" + "TYPE := { mpath }\n" "ENCAPTYPE := [ mpls ]\n" "ENCAPHDR := [ MPLSLABEL ]\n"); exit(-1); @@ -327,6 +329,32 @@ static int add_nh_group_attr(struct nlmsghdr *n, int maxlen, char *argv) return addattr_l(n, maxlen, NHA_GROUP, grps, count * sizeof(*grps)); } +static int read_nh_group_type(const char *name) +{ + if (strcmp(name, "mpath") == 0) + return NEXTHOP_GRP_TYPE_MPATH; + + return __NEXTHOP_GRP_TYPE_MAX; +} + +static void parse_nh_group_type(struct nlmsghdr *n, int maxlen, int *argcp, + char ***argvp) +{ + char **argv = *argvp; + int argc = *argcp; + __u16 type; + + NEXT_ARG(); + type = read_nh_group_type(*argv); + if (type > NEXTHOP_GRP_TYPE_MAX) + invarg("\"type\" value is invalid\n", *argv); + + *argcp = argc; + *argvp = argv; + + addattr16(n, maxlen, NHA_GROUP_TYPE, type); +} + static int ipnh_parse_id(const char *argv) { __u32 id; @@ -409,6 +437,8 @@ static int ipnh_modify(int cmd, unsigned int flags, int argc, char **argv) if (add_nh_group_attr(&req.n, sizeof(req), *argv)) invarg("\"group\" value is invalid\n", *argv); + } else if (!strcmp(*argv, "type")) { + parse_nh_group_type(&req.n, sizeof(req), &argc, &argv); } else if (matches(*argv, "protocol") == 0) { __u32 prot; diff --git a/man/man8/ip-nexthop.8 b/man/man8/ip-nexthop.8 index 4d55f4dbcc75..f02e0555a000 100644 --- a/man/man8/ip-nexthop.8 +++ b/man/man8/ip-nexthop.8 @@ -54,7 +54,9 @@ ip-nexthop \- nexthop object management .BR fdb " ] | " .B group .IR GROUP " [ " -.BR fdb " ] } " +.BR fdb " ] [ " +.B type +.IR TYPE " ] } " .ti -8 .IR ENCAP " := [ " @@ -71,6 +73,10 @@ ip-nexthop \- nexthop object management .IR GROUP " := " .BR id "[," weight "[/...]" +.ti -8 +.IR TYPE " := { " +.BR mpath " }" + .SH DESCRIPTION .B ip nexthop is used to manipulate entries in the kernel's nexthop tables. @@ -122,9 +128,17 @@ is a set of encapsulation attributes specific to the .in -2 .TP -.BI group " GROUP" +.BI group " GROUP [ " type " TYPE ]" create a nexthop group. Group specification is id with an optional weight (id,weight) and a '/' as a separator between entries. +.sp +.I TYPE +is a string specifying the nexthop group type. Namely: + +.in +8 +.BI mpath +- multipath nexthop group + .TP .B blackhole create a blackhole nexthop -- 2.26.2
[PATCH iproute2-next 0/6] ip: nexthop: Support resilient groups
Support for resilient next-hop groups was recently accepted to Linux kernel[1]. Resilient next-hop groups add a layer of indirection between the SKB hash and the next hop. Thus the hash is used to reference a hash table bucket, which is then used to reference a particular next hop. This allows the system more flexibility when assigning SKB hash space to next hops. Previously, each next hop had to be assigned a continuous range of SKB hash space. With a hash table as an intermediate layer, it is possible to reassign next hops with a hash table bucket granularity. In turn, this mends issues with traffic flow redirection resulting from next hop removal or adjustments in next-hop weights. In this patch set, introduce support for resilient next-hop groups to iproute2. - Patch #1 brings include/uapi/linux/nexthop.h and /rtnetlink.h up to date. - Patches #2 and #3 add new helpers that will be useful later. - Patch #4 extends the ip/nexthop sub-tool to accept group type as a command line argument, and to dispatch based on the specified type. - Patch #5 adds the support for resilient next-hop groups. - Patch #6 adds the support for resilient next-hop group bucket interface. To illustrate the usage, consider the following commands: # ip nexthop add id 1 via 192.0.2.2 dev dummy1 # ip nexthop add id 2 via 192.0.2.3 dev dummy1 # ip nexthop add id 10 group 1/2 type resilient \ buckets 8 idle_timer 60 unbalanced_timer 300 The last command creates a resilient next-hop group. It will have 8 buckets, each bucket will be considered idle when no traffic hits it for at least 60 seconds, and if the table remains out of balance for 300 seconds, it will be forcefully brought into balance. And this is how the next-hop group bucket interface looks: # ip nexthop bucket show id 10 id 10 index 0 idle_time 5.59 nhid 1 id 10 index 1 idle_time 5.59 nhid 1 id 10 index 2 idle_time 8.74 nhid 2 id 10 index 3 idle_time 8.74 nhid 2 id 10 index 4 idle_time 8.74 nhid 1 id 10 index 5 idle_time 8.74 nhid 1 id 10 index 6 idle_time 8.74 nhid 1 id 10 index 7 idle_time 8.74 nhid 1 [1] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=2a0186a37700b0d5b8cc40be202a62af44f02fa2 Ido Schimmel (4): nexthop: Synchronize uAPI files nexthop: Add ability to specify group type nexthop: Add support for resilient nexthop groups nexthop: Add support for nexthop buckets Petr Machata (2): json_print: Add print_tv() nexthop: Extract a helper to parse a NH ID include/json_print.h | 1 + include/libnetlink.h | 3 + include/uapi/linux/nexthop.h | 47 +++- include/uapi/linux/rtnetlink.h | 7 + ip/ip_common.h | 1 + ip/ipmonitor.c | 6 + ip/ipnexthop.c | 451 - lib/json_print.c | 13 + lib/libnetlink.c | 26 ++ man/man8/ip-nexthop.8 | 112 +++- 10 files changed, 650 insertions(+), 17 deletions(-) -- 2.26.2
[PATCH iproute2-next 2/6] json_print: Add print_tv()
From: Petr Machata From: Petr Machata Add a helper to dump a timeval. Print by first converting to double and then dispatching to print_color_float(). Signed-off-by: Petr Machata --- include/json_print.h | 1 + lib/json_print.c | 13 + 2 files changed, 14 insertions(+) diff --git a/include/json_print.h b/include/json_print.h index 6fcf9fd910ec..63eee3823fe4 100644 --- a/include/json_print.h +++ b/include/json_print.h @@ -81,6 +81,7 @@ _PRINT_FUNC(0xhex, unsigned long long) _PRINT_FUNC(luint, unsigned long) _PRINT_FUNC(lluint, unsigned long long) _PRINT_FUNC(float, double) +_PRINT_FUNC(tv, struct timeval *) #undef _PRINT_FUNC #define _PRINT_NAME_VALUE_FUNC(type_name, type, format_char) \ diff --git a/lib/json_print.c b/lib/json_print.c index 994a2f8d6ae0..1018bfb36d94 100644 --- a/lib/json_print.c +++ b/lib/json_print.c @@ -299,6 +299,19 @@ int print_color_null(enum output_type type, return ret; } +int print_color_tv(enum output_type type, + enum color_attr color, + const char *key, + const char *fmt, + struct timeval *tv) +{ + double usecs = tv->tv_usec; + double secs = tv->tv_sec; + double time = secs + usecs / 100; + + return print_color_float(type, color, key, fmt, time); +} + /* Print line separator (if not in JSON mode) */ void print_nl(void) { -- 2.26.2
[PATCH net-next 10/10] selftests: netdevsim: Add test for resilient nexthop groups offload API
From: Ido Schimmel Test various aspects of the resilient nexthop group offload API on top of the netdevsim implementation. Both good and bad flows are tested. Signed-off-by: Ido Schimmel Co-developed-by: Petr Machata Signed-off-by: Petr Machata --- .../drivers/net/netdevsim/nexthop.sh | 620 ++ 1 file changed, 620 insertions(+) diff --git a/tools/testing/selftests/drivers/net/netdevsim/nexthop.sh b/tools/testing/selftests/drivers/net/netdevsim/nexthop.sh index be0c1b5ee6b8..ba75c81cda91 100755 --- a/tools/testing/selftests/drivers/net/netdevsim/nexthop.sh +++ b/tools/testing/selftests/drivers/net/netdevsim/nexthop.sh @@ -11,14 +11,33 @@ ALL_TESTS=" nexthop_single_add_err_test nexthop_group_add_test nexthop_group_add_err_test + nexthop_res_group_add_test + nexthop_res_group_add_err_test nexthop_group_replace_test nexthop_group_replace_err_test + nexthop_res_group_replace_test + nexthop_res_group_replace_err_test + nexthop_res_group_idle_timer_test + nexthop_res_group_idle_timer_del_test + nexthop_res_group_increase_idle_timer_test + nexthop_res_group_decrease_idle_timer_test + nexthop_res_group_unbalanced_timer_test + nexthop_res_group_unbalanced_timer_del_test + nexthop_res_group_no_unbalanced_timer_test + nexthop_res_group_short_unbalanced_timer_test + nexthop_res_group_increase_unbalanced_timer_test + nexthop_res_group_decrease_unbalanced_timer_test + nexthop_res_group_force_migrate_busy_test nexthop_single_replace_test nexthop_single_replace_err_test nexthop_single_in_group_replace_test nexthop_single_in_group_replace_err_test + nexthop_single_in_res_group_replace_test + nexthop_single_in_res_group_replace_err_test nexthop_single_in_group_delete_test nexthop_single_in_group_delete_err_test + nexthop_single_in_res_group_delete_test + nexthop_single_in_res_group_delete_err_test nexthop_replay_test nexthop_replay_err_test " @@ -27,6 +46,7 @@ DEV_ADDR=1337 DEV=netdevsim${DEV_ADDR} DEVLINK_DEV=netdevsim/${DEV} SYSFS_NET_DIR=/sys/bus/netdevsim/devices/$DEV/net/ +DEBUGFS_NET_DIR=/sys/kernel/debug/netdevsim/$DEV/ NUM_NETIFS=0 source $lib_dir/lib.sh source $lib_dir/devlink_lib.sh @@ -44,6 +64,28 @@ nexthop_check() return 0 } +nexthop_bucket_nhid_count_check() +{ + local group_id=$1; shift + local expected + local count + local nhid + local ret + + while (($# > 0)); do + nhid=$1; shift + expected=$1; shift + + count=$($IP nexthop bucket show id $group_id nhid $nhid | + grep "trap" | wc -l) + if ((expected != count)); then + return 1 + fi + done + + return 0 +} + nexthop_resource_check() { local expected_occ=$1; shift @@ -159,6 +201,71 @@ nexthop_group_add_err_test() nexthop_resource_set } +nexthop_res_group_add_test() +{ + RET=0 + + $IP nexthop add id 1 via 192.0.2.2 dev dummy1 + $IP nexthop add id 2 via 192.0.2.3 dev dummy1 + + $IP nexthop add id 10 group 1/2 type resilient buckets 4 + nexthop_check "id 10" "id 10 group 1/2 type resilient buckets 4 idle_timer 120 unbalanced_timer 0 unbalanced_time 0 trap" + check_err $? "Unexpected nexthop group entry" + + nexthop_bucket_nhid_count_check 10 1 2 + check_err $? "Wrong nexthop buckets count" + nexthop_bucket_nhid_count_check 10 2 2 + check_err $? "Wrong nexthop buckets count" + + nexthop_resource_check 6 + check_err $? "Wrong nexthop occupancy" + + $IP nexthop del id 10 + nexthop_resource_check 2 + check_err $? "Wrong nexthop occupancy after delete" + + $IP nexthop add id 10 group 1,3/2,2 type resilient buckets 5 + nexthop_check "id 10" "id 10 group 1,3/2,2 type resilient buckets 5 idle_timer 120 unbalanced_timer 0 unbalanced_time 0 trap" + check_err $? "Unexpected weighted nexthop group entry" + + nexthop_bucket_nhid_count_check 10 1 3 + check_err $? "Wrong nexthop buckets count" + nexthop_bucket_nhid_count_check 10 2 2 + check_err $? "Wrong nexthop buckets count" + + nexthop_resource_check 7 + check_err $? "Wrong weighted nexthop occupancy" + + $IP nexthop del id 10 + nexthop_resource_check 2 + check_err $? "Wrong nexthop occupancy after delete" + + log_test "Resilient nexthop group add and delete" + + $IP nexthop flush &> /dev/null +} + +nexthop_res_group_add_err_test() +{ + RET=0 + + nexthop_resource_set 2 + + $IP nexthop add id 1 via 192.0
[PATCH net-next 09/10] selftests: forwarding: Add resilient multipath tunneling nexthop test
From: Ido Schimmel Add a resilient nexthop objects version of gre_multipath_nh.sh. Test that both IPv4 and IPv6 overlays work with resilient nexthop groups where the nexthops are two GRE tunnels. Signed-off-by: Ido Schimmel Reviewed-by: Petr Machata Signed-off-by: Petr Machata --- .../net/forwarding/gre_multipath_nh_res.sh| 361 ++ 1 file changed, 361 insertions(+) create mode 100755 tools/testing/selftests/net/forwarding/gre_multipath_nh_res.sh diff --git a/tools/testing/selftests/net/forwarding/gre_multipath_nh_res.sh b/tools/testing/selftests/net/forwarding/gre_multipath_nh_res.sh new file mode 100755 index ..088b65e64d66 --- /dev/null +++ b/tools/testing/selftests/net/forwarding/gre_multipath_nh_res.sh @@ -0,0 +1,361 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +# Test traffic distribution when a wECMP route forwards traffic to two GRE +# tunnels. +# +# +-+ +# | H1 | +# | $h1 + | +# | 192.0.2.1/28 | | +# | 2001:db8:1::1/64 | | +# +---|-+ +# | +# +---|+ +# | SW1 || +# | $ol1 +| +# | 192.0.2.2/28 | +# | 2001:db8:1::2/64 | +# || +# | + g1a (gre) + g1b (gre) | +# |loc=192.0.2.65 loc=192.0.2.81 | +# |rem=192.0.2.66 --. rem=192.0.2.82 --. | +# |tos=inherit | tos=inherit | | +# | .--'| | +# | |.--' | +# | vv| +# | + $ul1.111 (vlan)+ $ul1.222 (vlan)| +# | | 192.0.2.129/28 | 192.0.2.145/28 | +# | \ / | +# |\/ | +# || | +# |+ $ul1 | +# +|---+ +# | +# +|---+ +# | SW2+ $ul2 | +# | ___| | +# |/\ | +# | / \ | +# | + $ul2.111 (vlan)+ $ul2.222 (vlan)| +# | ^ 192.0.2.130/28 ^ 192.0.2.146/28 | +# | ||| +# | |'--. | +# | '--.| | +# | + g2a (gre)| + g2b (gre)| | +# |loc=192.0.2.66 | loc=192.0.2.82 | | +# |rem=192.0.2.65 --' rem=192.0.2.81 --' | +# |tos=inherit tos=inherit| +# || +# | $ol2 +| +# | 192.0.2.17/28 || +# | 2001:db8:2::1/64 || +# +---|+ +# | +# +---|-+ +# | H2| | +# | $h2 + | +# | 192.0.2.18/28 | +# | 2001:db8:2::2/64 | +# +-+ + +ALL_TESTS=" + ping_ipv4 + ping_ipv6 + multipath_ipv4 + multipath_ipv6 + multipath_ipv6_l4 +" + +NUM_NETIFS=6 +source lib.sh + +h1_create() +{ + simple_if_init $h1 192.0.2.1/28 2001:db8:1::1/64 + ip route add vrf v$h1 192.0.2.16/28 via 192.0.2.2 + ip route add vrf v$h1 2001:db8:2::/64 via 2001:db8:1::2 +} + +h1_destroy() +{ + ip route del vrf v$h1 2001:db8:2::/64 via 2001:db8:1::2 + ip route del vrf v$h1 192.0.2.16/28 via 192.0.2.2 + simple_if_fini $h1 192.0.2.1/28 +} + +sw1_create() +{ + simple_if_init $ol1 192.0.2.2/28 2001:db8:1::2/64 + __simple_if_init $ul1 v$ol1 + vlan_create $ul1 111 v$ol1 192.0.2.129/28 + vlan_create $ul1 222 v$ol1 192.0.2.145/28 + + tunnel_create g1a gre 192.0.2.65 192.0.2.66 tos inherit dev v$ol1 + __simple_if_init g1a v$ol1 192.0.2.65/32 + ip route add vrf v$ol1 192.0.2.66/32 via 192.0.2.130 + + tunnel_create g1b gre 192.0.2.81 192.0.2.82 tos inherit dev v$ol1 + __simple_if_init g1b v$ol1 192.0.2.81/32 + ip route add vrf v$ol1 192.0.2.82/32 via 192.0.2.146 + + ip -6 nexthop add id 101 dev g1a + ip -6 nexthop add id 102 dev g1b + ip nexthop add id 103 group 101/102 type resilient buckets 512 \ + idle_timer 0 + + ip route add vrf v$ol1 192.0.2.16/28 nhid 103 + ip route add vrf v$ol1 2001:db8:2::/64 nhid 103 +} + +sw1_destroy() +{ + ip route del vrf v$ol1 2001:db8:2::/64 + ip route del vrf v$ol1 192.0.2.16/28 + + ip nexthop del id 103 + ip -6 nexthop del id 102 + ip -6 nexthop del id 101 + +
[PATCH net-next 07/10] selftests: fib_nexthops: Test resilient nexthop groups
by: Ido Schimmel Co-developed-by: Petr Machata Signed-off-by: Petr Machata --- tools/testing/selftests/net/fib_nexthops.sh | 517 1 file changed, 517 insertions(+) diff --git a/tools/testing/selftests/net/fib_nexthops.sh b/tools/testing/selftests/net/fib_nexthops.sh index c840aa88ff18..56dd0c6f2e96 100755 --- a/tools/testing/selftests/net/fib_nexthops.sh +++ b/tools/testing/selftests/net/fib_nexthops.sh @@ -22,26 +22,33 @@ ksft_skip=4 IPV4_TESTS=" ipv4_fcnal ipv4_grp_fcnal + ipv4_res_grp_fcnal ipv4_withv6_fcnal ipv4_fcnal_runtime ipv4_large_grp + ipv4_large_res_grp ipv4_compat_mode ipv4_fdb_grp_fcnal ipv4_torture + ipv4_res_torture " IPV6_TESTS=" ipv6_fcnal ipv6_grp_fcnal + ipv6_res_grp_fcnal ipv6_fcnal_runtime ipv6_large_grp + ipv6_large_res_grp ipv6_compat_mode ipv6_fdb_grp_fcnal ipv6_torture + ipv6_res_torture " ALL_TESTS=" basic + basic_res ${IPV4_TESTS} ${IPV6_TESTS} " @@ -254,6 +261,19 @@ check_nexthop() check_output "${out}" "${expected}" } +check_nexthop_bucket() +{ + local nharg="$1" + local expected="$2" + local out + + # remove the idle time since we cannot match it + out=$($IP nexthop bucket ${nharg} \ + | sed s/idle_time\ [0-9.]*\ // 2>/dev/null) + + check_output "${out}" "${expected}" +} + check_route() { local pfx="$1" @@ -330,6 +350,25 @@ check_large_grp() log_test $? 0 "Dump large (x$ecmp) ecmp groups" } +check_large_res_grp() +{ + local ipv=$1 + local buckets=$2 + local ipstr="" + + if [ $ipv -eq 4 ]; then + ipstr="172.16.1.2" + else + ipstr="2001:db8:91::2" + fi + + # create a resilient group with $buckets buckets and dump them + run_cmd "$IP nexthop add id 100 via $ipstr dev veth1" + run_cmd "$IP nexthop add id 1000 group 100 type resilient buckets $buckets" + run_cmd "$IP nexthop bucket list" + log_test $? 0 "Dump large (x$buckets) nexthop buckets" +} + start_ip_monitor() { local mtype=$1 @@ -366,6 +405,15 @@ check_nexthop_fdb_support() fi } +check_nexthop_res_support() +{ + $IP nexthop help 2>&1 | grep -q resilient + if [ $? -ne 0 ]; then + echo "SKIP: iproute2 too old, missing resilient nexthop group support" + return $ksft_skip + fi +} + ipv6_fdb_grp_fcnal() { local rc @@ -688,6 +736,70 @@ ipv6_grp_fcnal() log_test $? 2 "Nexthop group can not have a blackhole and another nexthop" } +ipv6_res_grp_fcnal() +{ + local rc + + echo + echo "IPv6 resilient groups functional" + echo "" + + check_nexthop_res_support + if [ $? -eq $ksft_skip ]; then + return $ksft_skip + fi + + # + # migration of nexthop buckets - equal weights + # + run_cmd "$IP nexthop add id 62 via 2001:db8:91::2 dev veth1" + run_cmd "$IP nexthop add id 63 via 2001:db8:91::3 dev veth1" + run_cmd "$IP nexthop add id 102 group 62/63 type resilient buckets 2 idle_timer 0" + + run_cmd "$IP nexthop del id 63" + check_nexthop "id 102" \ + "id 102 group 62 type resilient buckets 2 idle_timer 0 unbalanced_timer 0 unbalanced_time 0" + log_test $? 0 "Nexthop group updated when entry is deleted" + check_nexthop_bucket "list id 102" \ + "id 102 index 0 nhid 62 id 102 index 1 nhid 62" + log_test $? 0 "Nexthop buckets updated when entry is deleted" + + run_cmd "$IP nexthop add id 63 via 2001:db8:91::3 dev veth1" + run_cmd "$IP nexthop replace id 102 group 62/63 type resilient buckets 2 idle_timer 0" + check_nexthop "id 102" \ + "id 102 group 62/63 type resilient buckets 2 idle_timer 0 unbalanced_timer 0 unbalanced_time 0" + log_test $? 0 "Nexthop group updated after replace" + check_nexthop_bucket "list id 102" \ + "id 102 index 0 nhid 63 id 102 index 1 nhid 62" + log_test $? 0 "Nexthop buckets updated after replace" + + $IP nexthop flush >/dev/null 2>&1 + + # + # migration of nexthop buckets - unequal weights + # + run_cmd "$IP nexthop add id 62 via 2001:db8:91::2 dev veth1" + run_cmd "$IP nexthop add id 63 via 2001:db8:91::3 dev veth1" + run_c
[PATCH net-next 08/10] selftests: forwarding: Add resilient hashing test
From: Ido Schimmel Verify that IPv4 and IPv6 multipath forwarding works correctly with resilient nexthop groups and with different weights. Test that when the idle timer is not zero, the resilient groups are not rebalanced - because the nexthop buckets are considered active - and the initial weights (1:1) are used. Signed-off-by: Ido Schimmel Reviewed-by: Petr Machata Signed-off-by: Petr Machata --- .../net/forwarding/router_mpath_nh_res.sh | 400 ++ 1 file changed, 400 insertions(+) create mode 100755 tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh diff --git a/tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh b/tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh new file mode 100755 index ..4898dd4118f1 --- /dev/null +++ b/tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh @@ -0,0 +1,400 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +ALL_TESTS=" + ping_ipv4 + ping_ipv6 + multipath_test +" +NUM_NETIFS=8 +source lib.sh + +h1_create() +{ + vrf_create "vrf-h1" + ip link set dev $h1 master vrf-h1 + + ip link set dev vrf-h1 up + ip link set dev $h1 up + + ip address add 192.0.2.2/24 dev $h1 + ip address add 2001:db8:1::2/64 dev $h1 + + ip route add 198.51.100.0/24 vrf vrf-h1 nexthop via 192.0.2.1 + ip route add 2001:db8:2::/64 vrf vrf-h1 nexthop via 2001:db8:1::1 +} + +h1_destroy() +{ + ip route del 2001:db8:2::/64 vrf vrf-h1 + ip route del 198.51.100.0/24 vrf vrf-h1 + + ip address del 2001:db8:1::2/64 dev $h1 + ip address del 192.0.2.2/24 dev $h1 + + ip link set dev $h1 down + vrf_destroy "vrf-h1" +} + +h2_create() +{ + vrf_create "vrf-h2" + ip link set dev $h2 master vrf-h2 + + ip link set dev vrf-h2 up + ip link set dev $h2 up + + ip address add 198.51.100.2/24 dev $h2 + ip address add 2001:db8:2::2/64 dev $h2 + + ip route add 192.0.2.0/24 vrf vrf-h2 nexthop via 198.51.100.1 + ip route add 2001:db8:1::/64 vrf vrf-h2 nexthop via 2001:db8:2::1 +} + +h2_destroy() +{ + ip route del 2001:db8:1::/64 vrf vrf-h2 + ip route del 192.0.2.0/24 vrf vrf-h2 + + ip address del 2001:db8:2::2/64 dev $h2 + ip address del 198.51.100.2/24 dev $h2 + + ip link set dev $h2 down + vrf_destroy "vrf-h2" +} + +router1_create() +{ + vrf_create "vrf-r1" + ip link set dev $rp11 master vrf-r1 + ip link set dev $rp12 master vrf-r1 + ip link set dev $rp13 master vrf-r1 + + ip link set dev vrf-r1 up + ip link set dev $rp11 up + ip link set dev $rp12 up + ip link set dev $rp13 up + + ip address add 192.0.2.1/24 dev $rp11 + ip address add 2001:db8:1::1/64 dev $rp11 + + ip address add 169.254.2.12/24 dev $rp12 + ip address add fe80:2::12/64 dev $rp12 + + ip address add 169.254.3.13/24 dev $rp13 + ip address add fe80:3::13/64 dev $rp13 +} + +router1_destroy() +{ + ip route del 2001:db8:2::/64 vrf vrf-r1 + ip route del 198.51.100.0/24 vrf vrf-r1 + + ip address del fe80:3::13/64 dev $rp13 + ip address del 169.254.3.13/24 dev $rp13 + + ip address del fe80:2::12/64 dev $rp12 + ip address del 169.254.2.12/24 dev $rp12 + + ip address del 2001:db8:1::1/64 dev $rp11 + ip address del 192.0.2.1/24 dev $rp11 + + ip nexthop del id 103 + ip nexthop del id 101 + ip nexthop del id 102 + ip nexthop del id 106 + ip nexthop del id 104 + ip nexthop del id 105 + + ip link set dev $rp13 down + ip link set dev $rp12 down + ip link set dev $rp11 down + + vrf_destroy "vrf-r1" +} + +router2_create() +{ + vrf_create "vrf-r2" + ip link set dev $rp21 master vrf-r2 + ip link set dev $rp22 master vrf-r2 + ip link set dev $rp23 master vrf-r2 + + ip link set dev vrf-r2 up + ip link set dev $rp21 up + ip link set dev $rp22 up + ip link set dev $rp23 up + + ip address add 198.51.100.1/24 dev $rp21 + ip address add 2001:db8:2::1/64 dev $rp21 + + ip address add 169.254.2.22/24 dev $rp22 + ip address add fe80:2::22/64 dev $rp22 + + ip address add 169.254.3.23/24 dev $rp23 + ip address add fe80:3::23/64 dev $rp23 +} + +router2_destroy() +{ + ip route del 2001:db8:1::/64 vrf vrf-r2 + ip route del 192.0.2.0/24 vrf vrf-r2 + + ip address del fe80:3::23/64 dev $rp23 + ip address del 169.254.3.23/24 dev $rp23 + + ip address del fe80:2::22/64 dev $rp22 + ip address del 169.254.2.22/24 dev $rp22 + + ip address del 2001:db8:2::1/64 dev $rp21 + ip address del 198.51.100.1/24 dev $rp21 + + ip nexthop del id 201 + ip nexthop del id 202 + ip nexthop del id 204 + ip nexthop del id 205 + + i
[PATCH net-next 03/10] netdevsim: Add support for resilient nexthop groups
From: Ido Schimmel Allow resilient nexthop groups to be programmed and account their occupancy according to their number of buckets. The nexthop group itself as well as its buckets are marked with hardware flags (i.e., 'RTNH_F_TRAP'). Replacement of a single nexthop bucket can fail using the following debugfs knob: # cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace N # echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace # cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace Y Replacement of a resilient nexthop group can fail using the following debugfs knob: # cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_res_nexthop_group_replace N # echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_res_nexthop_group_replace # cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_res_nexthop_group_replace Y This enables testing of various error paths. Signed-off-by: Ido Schimmel Reviewed-by: Petr Machata Signed-off-by: Petr Machata --- drivers/net/netdevsim/fib.c | 55 + 1 file changed, 55 insertions(+) diff --git a/drivers/net/netdevsim/fib.c b/drivers/net/netdevsim/fib.c index 62cbd716383c..e41f3b98295c 100644 --- a/drivers/net/netdevsim/fib.c +++ b/drivers/net/netdevsim/fib.c @@ -57,6 +57,8 @@ struct nsim_fib_data { struct mutex nh_lock; /* Protects NH HT */ struct dentry *ddir; bool fail_route_offload; + bool fail_res_nexthop_group_replace; + bool fail_nexthop_bucket_replace; }; struct nsim_fib_rt_key { @@ -117,6 +119,7 @@ struct nsim_nexthop { struct rhash_head ht_node; u64 occ; u32 id; + bool is_resilient; }; static const struct rhashtable_params nsim_nexthop_ht_params = { @@ -1115,6 +1118,10 @@ static struct nsim_nexthop *nsim_nexthop_create(struct nsim_fib_data *data, for (i = 0; i < info->nh_grp->num_nh; i++) occ += info->nh_grp->nh_entries[i].weight; break; + case NH_NOTIFIER_INFO_TYPE_RES_TABLE: + occ = info->nh_res_table->num_nh_buckets; + nexthop->is_resilient = true; + break; default: NL_SET_ERR_MSG_MOD(info->extack, "Unsupported nexthop type"); kfree(nexthop); @@ -1161,7 +1168,15 @@ static void nsim_nexthop_hw_flags_set(struct net *net, const struct nsim_nexthop *nexthop, bool trap) { + int i; + nexthop_set_hw_flags(net, nexthop->id, false, trap); + + if (!nexthop->is_resilient) + return; + + for (i = 0; i < nexthop->occ; i++) + nexthop_bucket_set_hw_flags(net, nexthop->id, i, false, trap); } static int nsim_nexthop_add(struct nsim_fib_data *data, @@ -1262,6 +1277,32 @@ static void nsim_nexthop_remove(struct nsim_fib_data *data, nsim_nexthop_destroy(nexthop); } +static int nsim_nexthop_res_table_pre_replace(struct nsim_fib_data *data, + struct nh_notifier_info *info) +{ + if (data->fail_res_nexthop_group_replace) { + NL_SET_ERR_MSG_MOD(info->extack, "Failed to replace a resilient nexthop group"); + return -EINVAL; + } + + return 0; +} + +static int nsim_nexthop_bucket_replace(struct nsim_fib_data *data, + struct nh_notifier_info *info) +{ + if (data->fail_nexthop_bucket_replace) { + NL_SET_ERR_MSG_MOD(info->extack, "Failed to replace nexthop bucket"); + return -EINVAL; + } + + nexthop_bucket_set_hw_flags(info->net, info->id, + info->nh_res_bucket->bucket_index, + false, true); + + return 0; +} + static int nsim_nexthop_event_nb(struct notifier_block *nb, unsigned long event, void *ptr) { @@ -1278,6 +1319,12 @@ static int nsim_nexthop_event_nb(struct notifier_block *nb, unsigned long event, case NEXTHOP_EVENT_DEL: nsim_nexthop_remove(data, info); break; + case NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE: + err = nsim_nexthop_res_table_pre_replace(data, info); + break; + case NEXTHOP_EVENT_BUCKET_REPLACE: + err = nsim_nexthop_bucket_replace(data, info); + break; default: break; } @@ -1387,6 +1434,14 @@ nsim_fib_debugfs_init(struct nsim_fib_data *data, struct nsim_dev *nsim_dev) data->fail_route_offload = false; debugfs_create_bool("fail_route_offload", 0600, data->ddir, &data->fail_route_offload);
[PATCH net-next 06/10] selftests: fib_nexthops: List each test case in a different line
From: Ido Schimmel The lines with the IPv4 and IPv6 test cases are already very long and more test cases will be added in subsequent patches. List each test case in a different line to make it easier to extend the test with more test cases. Signed-off-by: Ido Schimmel Reviewed-by: Petr Machata Signed-off-by: Petr Machata --- tools/testing/selftests/net/fib_nexthops.sh | 30 ++--- 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/net/fib_nexthops.sh b/tools/testing/selftests/net/fib_nexthops.sh index 91226ac50112..c840aa88ff18 100755 --- a/tools/testing/selftests/net/fib_nexthops.sh +++ b/tools/testing/selftests/net/fib_nexthops.sh @@ -19,10 +19,32 @@ ret=0 ksft_skip=4 # all tests in this script. Can be overridden with -t option -IPV4_TESTS="ipv4_fcnal ipv4_grp_fcnal ipv4_withv6_fcnal ipv4_fcnal_runtime ipv4_large_grp ipv4_compat_mode ipv4_fdb_grp_fcnal ipv4_torture" -IPV6_TESTS="ipv6_fcnal ipv6_grp_fcnal ipv6_fcnal_runtime ipv6_large_grp ipv6_compat_mode ipv6_fdb_grp_fcnal ipv6_torture" - -ALL_TESTS="basic ${IPV4_TESTS} ${IPV6_TESTS}" +IPV4_TESTS=" + ipv4_fcnal + ipv4_grp_fcnal + ipv4_withv6_fcnal + ipv4_fcnal_runtime + ipv4_large_grp + ipv4_compat_mode + ipv4_fdb_grp_fcnal + ipv4_torture +" + +IPV6_TESTS=" + ipv6_fcnal + ipv6_grp_fcnal + ipv6_fcnal_runtime + ipv6_large_grp + ipv6_compat_mode + ipv6_fdb_grp_fcnal + ipv6_torture +" + +ALL_TESTS=" + basic + ${IPV4_TESTS} + ${IPV6_TESTS} +" TESTS="${ALL_TESTS}" VERBOSE=0 PAUSE_ON_FAIL=no -- 2.26.2
[PATCH net-next 02/10] netdevsim: Create a helper for setting nexthop hardware flags
From: Ido Schimmel Instead of calling nexthop_set_hw_flags(), call a helper. It will be used to also set nexthop bucket flags in a subsequent patch. Signed-off-by: Ido Schimmel Reviewed-by: Petr Machata Signed-off-by: Petr Machata --- drivers/net/netdevsim/fib.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/net/netdevsim/fib.c b/drivers/net/netdevsim/fib.c index ba577e20b1a1..62cbd716383c 100644 --- a/drivers/net/netdevsim/fib.c +++ b/drivers/net/netdevsim/fib.c @@ -1157,6 +1157,13 @@ static int nsim_nexthop_account(struct nsim_fib_data *data, u64 occ, } +static void nsim_nexthop_hw_flags_set(struct net *net, + const struct nsim_nexthop *nexthop, + bool trap) +{ + nexthop_set_hw_flags(net, nexthop->id, false, trap); +} + static int nsim_nexthop_add(struct nsim_fib_data *data, struct nsim_nexthop *nexthop, struct netlink_ext_ack *extack) @@ -1175,7 +1182,7 @@ static int nsim_nexthop_add(struct nsim_fib_data *data, goto err_nexthop_dismiss; } - nexthop_set_hw_flags(net, nexthop->id, false, true); + nsim_nexthop_hw_flags_set(net, nexthop, true); return 0; @@ -1204,7 +1211,7 @@ static int nsim_nexthop_replace(struct nsim_fib_data *data, goto err_nexthop_dismiss; } - nexthop_set_hw_flags(net, nexthop->id, false, true); + nsim_nexthop_hw_flags_set(net, nexthop, true); nsim_nexthop_account(data, nexthop_old->occ, false, extack); nsim_nexthop_destroy(nexthop_old); @@ -1286,7 +1293,7 @@ static void nsim_nexthop_free(void *ptr, void *arg) struct net *net; net = devlink_net(data->devlink); - nexthop_set_hw_flags(net, nexthop->id, false, false); + nsim_nexthop_hw_flags_set(net, nexthop, false); nsim_nexthop_account(data, nexthop->occ, false, NULL); nsim_nexthop_destroy(nexthop); } -- 2.26.2
[PATCH net-next 01/10] netdevsim: fib: Introduce a lock to guard nexthop hashtable
Currently netdevsim relies on RTNL to maintain exclusivity in accessing the nexthop hash table. However, bucket notification may be called without RTNL having been held. Instead, introduce a custom lock to guard the table. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- drivers/net/netdevsim/fib.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/net/netdevsim/fib.c b/drivers/net/netdevsim/fib.c index 3ca0f54d0c3b..ba577e20b1a1 100644 --- a/drivers/net/netdevsim/fib.c +++ b/drivers/net/netdevsim/fib.c @@ -47,13 +47,14 @@ struct nsim_fib_data { struct nsim_fib_entry nexthops; struct rhashtable fib_rt_ht; struct list_head fib_rt_list; - struct mutex fib_lock; /* Protects hashtable and list */ + struct mutex fib_lock; /* Protects FIB HT and list */ struct notifier_block nexthop_nb; struct rhashtable nexthop_ht; struct devlink *devlink; struct work_struct fib_event_work; struct list_head fib_event_queue; spinlock_t fib_event_queue_lock; /* Protects fib event queue list */ + struct mutex nh_lock; /* Protects NH HT */ struct dentry *ddir; bool fail_route_offload; }; @@ -1262,8 +1263,7 @@ static int nsim_nexthop_event_nb(struct notifier_block *nb, unsigned long event, struct nh_notifier_info *info = ptr; int err = 0; - ASSERT_RTNL(); - + mutex_lock(&data->nh_lock); switch (event) { case NEXTHOP_EVENT_REPLACE: err = nsim_nexthop_insert(data, info); @@ -1275,6 +1275,7 @@ static int nsim_nexthop_event_nb(struct notifier_block *nb, unsigned long event, break; } + mutex_unlock(&data->nh_lock); return notifier_from_errno(err); } @@ -1404,6 +1405,7 @@ struct nsim_fib_data *nsim_fib_create(struct devlink *devlink, if (err) goto err_data_free; + mutex_init(&data->nh_lock); err = rhashtable_init(&data->nexthop_ht, &nsim_nexthop_ht_params); if (err) goto err_debugfs_exit; @@ -1469,6 +1471,7 @@ struct nsim_fib_data *nsim_fib_create(struct devlink *devlink, data); mutex_destroy(&data->fib_lock); err_debugfs_exit: + mutex_destroy(&data->nh_lock); nsim_fib_debugfs_exit(data); err_data_free: kfree(data); @@ -1497,6 +1500,7 @@ void nsim_fib_destroy(struct devlink *devlink, struct nsim_fib_data *data) WARN_ON_ONCE(!list_empty(&data->fib_event_queue)); WARN_ON_ONCE(!list_empty(&data->fib_rt_list)); mutex_destroy(&data->fib_lock); + mutex_destroy(&data->nh_lock); nsim_fib_debugfs_exit(data); kfree(data); } -- 2.26.2
[PATCH net-next 04/10] netdevsim: Allow reporting activity on nexthop buckets
From: Ido Schimmel A key component of the resilient hashing algorithm is the hash buckets' activity. If a bucket is active, it will not be populated with a new nexthop in order not to break existing flows. Therefore, in order to easily and thoroughly test the algorithm, we need to be in full control over the reported activity. Add a debugfs interface that allows user space to have netdevsim report a nexthop bucket within a resilient nexthop group as active. For example: # echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity Will mark bucket 23 in nexthop group 10 as active. Signed-off-by: Ido Schimmel Reviewed-by: Petr Machata Signed-off-by: Petr Machata --- drivers/net/netdevsim/fib.c | 61 + 1 file changed, 61 insertions(+) diff --git a/drivers/net/netdevsim/fib.c b/drivers/net/netdevsim/fib.c index e41f3b98295c..fda6f37e7055 100644 --- a/drivers/net/netdevsim/fib.c +++ b/drivers/net/netdevsim/fib.c @@ -14,6 +14,7 @@ * THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. */ +#include #include #include #include @@ -1345,6 +1346,63 @@ static void nsim_nexthop_free(void *ptr, void *arg) nsim_nexthop_destroy(nexthop); } +static ssize_t nsim_nexthop_bucket_activity_write(struct file *file, + const char __user *user_buf, + size_t size, loff_t *ppos) +{ + struct nsim_fib_data *data = file->private_data; + struct net *net = devlink_net(data->devlink); + struct nsim_nexthop *nexthop; + unsigned long *activity; + loff_t pos = *ppos; + u16 bucket_index; + char buf[128]; + int err = 0; + u32 nhid; + + if (pos != 0) + return -EINVAL; + if (size > sizeof(buf)) + return -EINVAL; + if (copy_from_user(buf, user_buf, size)) + return -EFAULT; + if (sscanf(buf, "%u %hu", &nhid, &bucket_index) != 2) + return -EINVAL; + + rtnl_lock(); + + nexthop = rhashtable_lookup_fast(&data->nexthop_ht, &nhid, +nsim_nexthop_ht_params); + if (!nexthop || !nexthop->is_resilient || + bucket_index >= nexthop->occ) { + err = -EINVAL; + goto out; + } + + activity = bitmap_zalloc(nexthop->occ, GFP_KERNEL); + if (!activity) { + err = -ENOMEM; + goto out; + } + + bitmap_set(activity, bucket_index, 1); + nexthop_res_grp_activity_update(net, nhid, nexthop->occ, activity); + bitmap_free(activity); + +out: + rtnl_unlock(); + + *ppos = size; + return err ?: size; +} + +static const struct file_operations nsim_nexthop_bucket_activity_fops = { + .open = simple_open, + .write = nsim_nexthop_bucket_activity_write, + .llseek = no_llseek, + .owner = THIS_MODULE, +}; + static u64 nsim_fib_ipv4_resource_occ_get(void *priv) { struct nsim_fib_data *data = priv; @@ -1442,6 +1500,9 @@ nsim_fib_debugfs_init(struct nsim_fib_data *data, struct nsim_dev *nsim_dev) data->fail_nexthop_bucket_replace = false; debugfs_create_bool("fail_nexthop_bucket_replace", 0600, data->ddir, &data->fail_nexthop_bucket_replace); + + debugfs_create_file("nexthop_bucket_activity", 0200, data->ddir, + data, &nsim_nexthop_bucket_activity_fops); return 0; } -- 2.26.2
[PATCH net-next 00/10] net: Resilient NH groups: netdevsim, selftests
Support for resilient next-hop groups was added in a previous patch set. Resilient next hop groups add a layer of indirection between the SKB hash and the next hop. Thus the hash is used to reference a hash table bucket, which is then used to reference a particular next hop. This allows the system more flexibility when assigning SKB hash space to next hops. Previously, each next hop had to be assigned a continuous range of SKB hash space. With a hash table as an intermediate layer, it is possible to reassign next hops with a hash table bucket granularity. In turn, this mends issues with traffic flow redirection resulting from next hop removal or adjustments in next-hop weights. This patch set introduces mock offloading of resilient next hop groups by the netdevsim driver, and a suite of selftests. - Patch #1 adds a netdevsim-specific lock to protect next-hop hashtable. Previously, netdevsim relied on RTNL to maintain mutual exclusion. Patch #2 extracts a helper to make the following patches clearer. - Patch #3 implements the support for offloading of resilient next-hop groups. - Patch #4 introduces a new debugfs interface to set activity on a selected next-hop bucket. This simulates how HW can periodically report bucket activity, and buckets thus marked are expected to be exempt from migration to new next hops when the group changes. - Patches #5 and #6 clean up the fib_nexthop selftests. - Patches #7, #8 and #9 add tests for resilient next hop groups. Patch #7 adds resilient-hashing counterparts to fib_nexthops.sh. Patch #8 adds a new traffic test for resilient next-hop groups. Patch #9 adds a new traffic test for tunneling. - Patch #10 actually leverages the netdevsim offload to implement a suite of algorithmic tests that verify how and when buckets are migrated under various simulated workload scenarios. The overall plan is to contribute approximately the following patchsets: 1) Nexthop policy refactoring (already pushed) 2) Preparations for resilient next hop groups (already pushed) 3) Implementation of resilient next hop group (already pushed) 4) Netdevsim offload plus a suite of selftests (this patchset) 5) Preparations for mlxsw offload of resilient next-hop groups 6) mlxsw offload including selftests Interested parties can look at the complete code at [2]. [1] https://tools.ietf.org/html/rfc2992 [2] https://github.com/idosch/linux/commits/submit/res_integ_v1 Ido Schimmel (9): netdevsim: Create a helper for setting nexthop hardware flags netdevsim: Add support for resilient nexthop groups netdevsim: Allow reporting activity on nexthop buckets selftests: fib_nexthops: Declutter test output selftests: fib_nexthops: List each test case in a different line selftests: fib_nexthops: Test resilient nexthop groups selftests: forwarding: Add resilient hashing test selftests: forwarding: Add resilient multipath tunneling nexthop test selftests: netdevsim: Add test for resilient nexthop groups offload API Petr Machata (1): netdevsim: fib: Introduce a lock to guard nexthop hashtable drivers/net/netdevsim/fib.c | 139 +++- .../drivers/net/netdevsim/nexthop.sh | 620 ++ tools/testing/selftests/net/fib_nexthops.sh | 549 +++- .../net/forwarding/gre_multipath_nh_res.sh| 361 ++ .../net/forwarding/router_mpath_nh_res.sh | 400 +++ 5 files changed, 2059 insertions(+), 10 deletions(-) create mode 100755 tools/testing/selftests/net/forwarding/gre_multipath_nh_res.sh create mode 100755 tools/testing/selftests/net/forwarding/router_mpath_nh_res.sh -- 2.26.2
[PATCH net-next 05/10] selftests: fib_nexthops: Declutter test output
From: Ido Schimmel Before: # ./fib_nexthops.sh -t ipv4_torture IPv4 runtime torture TEST: IPv4 torture test [ OK ] ./fib_nexthops.sh: line 213: 19376 Killed ipv4_del_add_loop1 ./fib_nexthops.sh: line 213: 19377 Killed ipv4_grp_replace_loop ./fib_nexthops.sh: line 213: 19378 Killed ip netns exec me ping -f 172.16.101.1 > /dev/null 2>&1 ./fib_nexthops.sh: line 213: 19380 Killed ip netns exec me ping -f 172.16.101.2 > /dev/null 2>&1 ./fib_nexthops.sh: line 213: 19381 Killed ip netns exec me mausezahn veth1 -B 172.16.101.2 -A 172.16.1.1 -c 0 -t tcp "dp=1-1023, flags=syn" > /dev/null 2>&1 Tests passed: 1 Tests failed: 0 # ./fib_nexthops.sh -t ipv6_torture IPv6 runtime torture TEST: IPv6 torture test [ OK ] ./fib_nexthops.sh: line 213: 24453 Killed ipv6_del_add_loop1 ./fib_nexthops.sh: line 213: 24454 Killed ipv6_grp_replace_loop ./fib_nexthops.sh: line 213: 24456 Killed ip netns exec me ping -f 2001:db8:101::1 > /dev/null 2>&1 ./fib_nexthops.sh: line 213: 24457 Killed ip netns exec me ping -f 2001:db8:101::2 > /dev/null 2>&1 ./fib_nexthops.sh: line 213: 24458 Killed ip netns exec me mausezahn -6 veth1 -B 2001:db8:101::2 -A 2001:db8:91::1 -c 0 -t tcp "dp=1-1023, flags=syn" > /dev/null 2>&1 Tests passed: 1 Tests failed: 0 After: # ./fib_nexthops.sh -t ipv4_torture IPv4 runtime torture TEST: IPv4 torture test [ OK ] Tests passed: 1 Tests failed: 0 # ./fib_nexthops.sh -t ipv6_torture IPv6 runtime torture TEST: IPv6 torture test [ OK ] Tests passed: 1 Tests failed: 0 Signed-off-by: Ido Schimmel Reviewed-by: Petr Machata Signed-off-by: Petr Machata --- tools/testing/selftests/net/fib_nexthops.sh | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/testing/selftests/net/fib_nexthops.sh b/tools/testing/selftests/net/fib_nexthops.sh index d98fb85e201c..91226ac50112 100755 --- a/tools/testing/selftests/net/fib_nexthops.sh +++ b/tools/testing/selftests/net/fib_nexthops.sh @@ -874,6 +874,7 @@ ipv6_torture() sleep 300 kill -9 $pid1 $pid2 $pid3 $pid4 $pid5 + wait $pid1 $pid2 $pid3 $pid4 $pid5 2>/dev/null # if we did not crash, success log_test 0 0 "IPv6 torture test" @@ -1476,6 +1477,7 @@ ipv4_torture() sleep 300 kill -9 $pid1 $pid2 $pid3 $pid4 $pid5 + wait $pid1 $pid2 $pid3 $pid4 $pid5 2>/dev/null # if we did not crash, success log_test 0 0 "IPv4 torture test" -- 2.26.2
[PATCH net-next v2 14/14] nexthop: Enable resilient next-hop groups
Now that all the code is in place, stop rejecting requests to create resilient next-hop groups. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Reviewed-by: David Ahern --- net/ipv4/nexthop.c | 4 1 file changed, 4 deletions(-) diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 015a47e8163a..f09fe3a5608f 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -2443,10 +2443,6 @@ static struct nexthop *nexthop_create_group(struct net *net, } else if (cfg->nh_grp_type == NEXTHOP_GRP_TYPE_RES) { struct nh_res_table *res_table; - /* Bounce resilient groups for now. */ - err = -EINVAL; - goto out_no_nh; - res_table = nexthop_res_table_alloc(net, cfg->nh_id, cfg); if (!res_table) { err = -ENOMEM; -- 2.26.2
[PATCH net-next v2 13/14] nexthop: Notify userspace about bucket migrations
Nexthop replacements et.al. are notified through netlink, but if a delayed work migrates buckets on the background, userspace will stay oblivious. Notify these as RTM_NEWNEXTHOPBUCKET events. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Reviewed-by: David Ahern --- Notes: v1 (changes since RFC): - u32 -> u16 for bucket counts / indices net/ipv4/nexthop.c | 45 +++-- 1 file changed, 39 insertions(+), 6 deletions(-) diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 3d602ef6f2c1..015a47e8163a 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -957,6 +957,34 @@ static int nh_fill_res_bucket(struct sk_buff *skb, struct nexthop *nh, return -EMSGSIZE; } +static void nexthop_bucket_notify(struct nh_res_table *res_table, + u16 bucket_index) +{ + struct nh_res_bucket *bucket = &res_table->nh_buckets[bucket_index]; + struct nh_grp_entry *nhge = nh_res_dereference(bucket->nh_entry); + struct nexthop *nh = nhge->nh_parent; + struct sk_buff *skb; + int err = -ENOBUFS; + + skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL); + if (!skb) + goto errout; + + err = nh_fill_res_bucket(skb, nh, bucket, bucket_index, +RTM_NEWNEXTHOPBUCKET, 0, 0, NLM_F_REPLACE, +NULL); + if (err < 0) { + kfree_skb(skb); + goto errout; + } + + rtnl_notify(skb, nh->net, 0, RTNLGRP_NEXTHOP, NULL, GFP_KERNEL); + return; +errout: + if (err < 0) + rtnl_set_sk_err(nh->net, RTNLGRP_NEXTHOP, err); +} + static bool valid_group_nh(struct nexthop *nh, unsigned int npaths, bool *is_fdb, struct netlink_ext_ack *extack) { @@ -1470,7 +1498,8 @@ static bool nh_res_bucket_should_migrate(struct nh_res_table *res_table, } static bool nh_res_bucket_migrate(struct nh_res_table *res_table, - u16 bucket_index, bool notify, bool force) + u16 bucket_index, bool notify, + bool notify_nl, bool force) { struct nh_res_bucket *bucket = &res_table->nh_buckets[bucket_index]; struct nh_grp_entry *new_nhge; @@ -1513,6 +1542,9 @@ static bool nh_res_bucket_migrate(struct nh_res_table *res_table, nh_res_bucket_set_nh(bucket, new_nhge); nh_res_bucket_set_idle(res_table, bucket); + if (notify_nl) + nexthop_bucket_notify(res_table, bucket_index); + if (nh_res_nhge_is_balanced(new_nhge)) list_del(&new_nhge->res.uw_nh_entry); return true; @@ -1520,7 +1552,8 @@ static bool nh_res_bucket_migrate(struct nh_res_table *res_table, #define NH_RES_UPKEEP_DW_MINIMUM_INTERVAL (HZ / 2) -static void nh_res_table_upkeep(struct nh_res_table *res_table, bool notify) +static void nh_res_table_upkeep(struct nh_res_table *res_table, + bool notify, bool notify_nl) { unsigned long now = jiffies; unsigned long deadline; @@ -1545,7 +1578,7 @@ static void nh_res_table_upkeep(struct nh_res_table *res_table, bool notify) if (nh_res_bucket_should_migrate(res_table, bucket, &deadline, &force)) { if (!nh_res_bucket_migrate(res_table, i, notify, - force)) { + notify_nl, force)) { unsigned long idle_point; /* A driver can override the migration @@ -1586,7 +1619,7 @@ static void nh_res_table_upkeep_dw(struct work_struct *work) struct nh_res_table *res_table; res_table = container_of(dw, struct nh_res_table, upkeep_dw); - nh_res_table_upkeep(res_table, true); + nh_res_table_upkeep(res_table, true, true); } static void nh_res_table_cancel_upkeep(struct nh_res_table *res_table) @@ -1674,7 +1707,7 @@ static void replace_nexthop_grp_res(struct nh_group *oldg, nh_res_group_rebalance(newg, old_res_table); if (prev_has_uw && !list_empty(&old_res_table->uw_nh_entries)) old_res_table->unbalanced_since = prev_unbalanced_since; - nh_res_table_upkeep(old_res_table, true); + nh_res_table_upkeep(old_res_table, true, false); } static void nh_mp_group_rebalance(struct nh_group *nhg) @@ -2288,7 +2321,7 @@ static int insert_nexthop(struct net *net, struct nexthop *new_nh, /* Do not send bucket notifications, we do full * notification below. */ - nh_res_table_upkeep(res_table, false); + nh_res_table_upkeep(res_table, false, false); } } -- 2.26.2
[PATCH net-next v2 12/14] nexthop: Add netlink handlers for bucket get
Allow getting (but not setting) individual buckets to inspect the next hop mapped therein, idle time, and flags. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Reviewed-by: David Ahern --- Notes: v1 (changes since RFC): - u32 -> u16 for bucket counts / indices net/ipv4/nexthop.c | 110 - 1 file changed, 109 insertions(+), 1 deletion(-) diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index ed2745708f9d..3d602ef6f2c1 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -66,6 +66,15 @@ static const struct nla_policy rtm_nh_res_bucket_policy_dump[] = { [NHA_RES_BUCKET_NH_ID] = { .type = NLA_U32 }, }; +static const struct nla_policy rtm_nh_policy_get_bucket[] = { + [NHA_ID]= { .type = NLA_U32 }, + [NHA_RES_BUCKET]= { .type = NLA_NESTED }, +}; + +static const struct nla_policy rtm_nh_res_bucket_policy_get[] = { + [NHA_RES_BUCKET_INDEX] = { .type = NLA_U16 }, +}; + static bool nexthop_notifiers_is_empty(struct net *net) { return !net->nexthop.notifier_chain.head; @@ -3381,6 +3390,105 @@ static int rtm_dump_nexthop_bucket(struct sk_buff *skb, return err; } +static int nh_valid_get_bucket_req_res_bucket(struct nlattr *res, + u16 *bucket_index, + struct netlink_ext_ack *extack) +{ + struct nlattr *tb[ARRAY_SIZE(rtm_nh_res_bucket_policy_get)]; + int err; + + err = nla_parse_nested(tb, ARRAY_SIZE(rtm_nh_res_bucket_policy_get) - 1, + res, rtm_nh_res_bucket_policy_get, extack); + if (err < 0) + return err; + + if (!tb[NHA_RES_BUCKET_INDEX]) { + NL_SET_ERR_MSG(extack, "Bucket index is missing"); + return -EINVAL; + } + + *bucket_index = nla_get_u16(tb[NHA_RES_BUCKET_INDEX]); + return 0; +} + +static int nh_valid_get_bucket_req(const struct nlmsghdr *nlh, + u32 *id, u16 *bucket_index, + struct netlink_ext_ack *extack) +{ + struct nlattr *tb[ARRAY_SIZE(rtm_nh_policy_get_bucket)]; + int err; + + err = nlmsg_parse(nlh, sizeof(struct nhmsg), tb, + ARRAY_SIZE(rtm_nh_policy_get_bucket) - 1, + rtm_nh_policy_get_bucket, extack); + if (err < 0) + return err; + + err = __nh_valid_get_del_req(nlh, tb, id, extack); + if (err) + return err; + + if (!tb[NHA_RES_BUCKET]) { + NL_SET_ERR_MSG(extack, "Bucket information is missing"); + return -EINVAL; + } + + err = nh_valid_get_bucket_req_res_bucket(tb[NHA_RES_BUCKET], +bucket_index, extack); + if (err) + return err; + + return 0; +} + +/* rtnl */ +static int rtm_get_nexthop_bucket(struct sk_buff *in_skb, struct nlmsghdr *nlh, + struct netlink_ext_ack *extack) +{ + struct net *net = sock_net(in_skb->sk); + struct nh_res_table *res_table; + struct sk_buff *skb = NULL; + struct nh_group *nhg; + struct nexthop *nh; + u16 bucket_index; + int err; + u32 id; + + err = nh_valid_get_bucket_req(nlh, &id, &bucket_index, extack); + if (err) + return err; + + nh = nexthop_find_group_resilient(net, id, extack); + if (IS_ERR(nh)) + return PTR_ERR(nh); + + nhg = rtnl_dereference(nh->nh_grp); + res_table = rtnl_dereference(nhg->res_table); + if (bucket_index >= res_table->num_nh_buckets) { + NL_SET_ERR_MSG(extack, "Bucket index out of bounds"); + return -ENOENT; + } + + skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL); + if (!skb) + return -ENOBUFS; + + err = nh_fill_res_bucket(skb, nh, &res_table->nh_buckets[bucket_index], +bucket_index, RTM_NEWNEXTHOPBUCKET, +NETLINK_CB(in_skb).portid, nlh->nlmsg_seq, +0, extack); + if (err < 0) { + WARN_ON(err == -EMSGSIZE); + goto errout_free; + } + + return rtnl_unicast(skb, net, NETLINK_CB(in_skb).portid); + +errout_free: + kfree_skb(skb); + return err; +} + static void nexthop_sync_mtu(struct net_device *dev, u32 orig_mtu) { unsigned int hash = nh_dev_hashfn(dev->ifindex); @@ -3604,7 +3712,7 @@ static int __init nexthop_init(void) rtnl_register(PF_INET6, RTM_NEWNEXTHOP, rtm_new_nexthop, NULL, 0); rtnl_register(PF_INET6, RTM_GETNEXTHOP, NULL, rtm_dump_nexthop, 0); - rtnl_register(PF_UNSPEC, RTM_GETNEXTHOPBUCKET, NU
[PATCH net-next v2 10/14] nexthop: Add netlink handlers for resilient nexthop groups
Implement the netlink messages that allow creation and dumping of resilient nexthop groups. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Reviewed-by: David Ahern --- Notes: v1 (changes since RFC): - u32 -> u16 for bucket counts / indices net/ipv4/nexthop.c | 150 +++-- 1 file changed, 145 insertions(+), 5 deletions(-) diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 495b5e69ffcd..439bf3b7ced5 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -16,6 +16,9 @@ #include #include +#define NH_RES_DEFAULT_IDLE_TIMER (120 * HZ) +#define NH_RES_DEFAULT_UNBALANCED_TIMER0 /* No forced rebalancing. */ + static void remove_nexthop(struct net *net, struct nexthop *nh, struct nl_info *nlinfo); @@ -32,6 +35,7 @@ static const struct nla_policy rtm_nh_policy_new[] = { [NHA_ENCAP_TYPE]= { .type = NLA_U16 }, [NHA_ENCAP] = { .type = NLA_NESTED }, [NHA_FDB] = { .type = NLA_FLAG }, + [NHA_RES_GROUP] = { .type = NLA_NESTED }, }; static const struct nla_policy rtm_nh_policy_get[] = { @@ -45,6 +49,12 @@ static const struct nla_policy rtm_nh_policy_dump[] = { [NHA_FDB] = { .type = NLA_FLAG }, }; +static const struct nla_policy rtm_nh_res_policy_new[] = { + [NHA_RES_GROUP_BUCKETS] = { .type = NLA_U16 }, + [NHA_RES_GROUP_IDLE_TIMER] = { .type = NLA_U32 }, + [NHA_RES_GROUP_UNBALANCED_TIMER]= { .type = NLA_U32 }, +}; + static bool nexthop_notifiers_is_empty(struct net *net) { return !net->nexthop.notifier_chain.head; @@ -588,6 +598,41 @@ static void nh_res_time_set_deadline(unsigned long next_time, *deadline = next_time; } +static clock_t nh_res_table_unbalanced_time(struct nh_res_table *res_table) +{ + if (list_empty(&res_table->uw_nh_entries)) + return 0; + return jiffies_delta_to_clock_t(jiffies - res_table->unbalanced_since); +} + +static int nla_put_nh_group_res(struct sk_buff *skb, struct nh_group *nhg) +{ + struct nh_res_table *res_table = rtnl_dereference(nhg->res_table); + struct nlattr *nest; + + nest = nla_nest_start(skb, NHA_RES_GROUP); + if (!nest) + return -EMSGSIZE; + + if (nla_put_u16(skb, NHA_RES_GROUP_BUCKETS, + res_table->num_nh_buckets) || + nla_put_u32(skb, NHA_RES_GROUP_IDLE_TIMER, + jiffies_to_clock_t(res_table->idle_timer)) || + nla_put_u32(skb, NHA_RES_GROUP_UNBALANCED_TIMER, + jiffies_to_clock_t(res_table->unbalanced_timer)) || + nla_put_u64_64bit(skb, NHA_RES_GROUP_UNBALANCED_TIME, + nh_res_table_unbalanced_time(res_table), + NHA_RES_GROUP_PAD)) + goto nla_put_failure; + + nla_nest_end(skb, nest); + return 0; + +nla_put_failure: + nla_nest_cancel(skb, nest); + return -EMSGSIZE; +} + static int nla_put_nh_group(struct sk_buff *skb, struct nh_group *nhg) { struct nexthop_grp *p; @@ -598,6 +643,8 @@ static int nla_put_nh_group(struct sk_buff *skb, struct nh_group *nhg) if (nhg->mpath) group_type = NEXTHOP_GRP_TYPE_MPATH; + else if (nhg->resilient) + group_type = NEXTHOP_GRP_TYPE_RES; if (nla_put_u16(skb, NHA_GROUP_TYPE, group_type)) goto nla_put_failure; @@ -613,6 +660,9 @@ static int nla_put_nh_group(struct sk_buff *skb, struct nh_group *nhg) p += 1; } + if (nhg->resilient && nla_put_nh_group_res(skb, nhg)) + goto nla_put_failure; + return 0; nla_put_failure: @@ -700,13 +750,26 @@ static int nh_fill_node(struct sk_buff *skb, struct nexthop *nh, return -EMSGSIZE; } +static size_t nh_nlmsg_size_grp_res(struct nh_group *nhg) +{ + return nla_total_size(0) + /* NHA_RES_GROUP */ + nla_total_size(2) + /* NHA_RES_GROUP_BUCKETS */ + nla_total_size(4) + /* NHA_RES_GROUP_IDLE_TIMER */ + nla_total_size(4) + /* NHA_RES_GROUP_UNBALANCED_TIMER */ + nla_total_size_64bit(8);/* NHA_RES_GROUP_UNBALANCED_TIME */ +} + static size_t nh_nlmsg_size_grp(struct nexthop *nh) { struct nh_group *nhg = rtnl_dereference(nh->nh_grp); size_t sz = sizeof(struct nexthop_grp) * nhg->num_nh; + size_t tot = nla_total_size(sz) + + nla_total_size(2); /* NHA_GROUP_TYPE */ + + if (nhg->resilient) + tot += nh_nlmsg_size_grp_res(nhg); - return nla_total_size(sz) + - nla_total_size(2); /* NHA_GROUP_TYPE */ + return tot; } static size_t nh_nlmsg_size_single(struct nexthop *nh) @@ -876,7 +939,7 @@ static int nh_c
[PATCH net-next v2 09/14] nexthop: Allow reporting activity of nexthop buckets
From: Ido Schimmel The kernel periodically checks the idle time of nexthop buckets to determine if they are idle and can be re-populated with a new nexthop. When the resilient nexthop group is offloaded to hardware, the kernel will not see activity on nexthop buckets unless it is reported from hardware. Add a function that can be periodically called by device drivers to report activity on nexthop buckets after querying it from the underlying device. Signed-off-by: Ido Schimmel Reviewed-by: Petr Machata Reviewed-by: David Ahern Signed-off-by: Petr Machata --- Notes: v1 (changes since RFC): - u32 -> u16 for bucket counts / indices include/net/nexthop.h | 2 ++ net/ipv4/nexthop.c| 35 +++ 2 files changed, 37 insertions(+) diff --git a/include/net/nexthop.h b/include/net/nexthop.h index 685f208d26b5..ba94868a21d5 100644 --- a/include/net/nexthop.h +++ b/include/net/nexthop.h @@ -222,6 +222,8 @@ int unregister_nexthop_notifier(struct net *net, struct notifier_block *nb); void nexthop_set_hw_flags(struct net *net, u32 id, bool offload, bool trap); void nexthop_bucket_set_hw_flags(struct net *net, u32 id, u16 bucket_index, bool offload, bool trap); +void nexthop_res_grp_activity_update(struct net *net, u32 id, u16 num_buckets, +unsigned long *activity); /* caller is holding rcu or rtnl; no reference taken to nexthop */ struct nexthop *nexthop_find_by_id(struct net *net, u32 id); diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 1fce4ff39390..495b5e69ffcd 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -3106,6 +3106,41 @@ void nexthop_bucket_set_hw_flags(struct net *net, u32 id, u16 bucket_index, } EXPORT_SYMBOL(nexthop_bucket_set_hw_flags); +void nexthop_res_grp_activity_update(struct net *net, u32 id, u16 num_buckets, +unsigned long *activity) +{ + struct nh_res_table *res_table; + struct nexthop *nexthop; + struct nh_group *nhg; + u16 i; + + rcu_read_lock(); + + nexthop = nexthop_find_by_id(net, id); + if (!nexthop || !nexthop->is_group) + goto out; + + nhg = rcu_dereference(nexthop->nh_grp); + if (!nhg->resilient) + goto out; + + /* Instead of silently ignoring some buckets, demand that the sizes +* be the same. +*/ + res_table = rcu_dereference(nhg->res_table); + if (num_buckets != res_table->num_nh_buckets) + goto out; + + for (i = 0; i < num_buckets; i++) { + if (test_bit(i, activity)) + nh_res_bucket_set_busy(&res_table->nh_buckets[i]); + } + +out: + rcu_read_unlock(); +} +EXPORT_SYMBOL(nexthop_res_grp_activity_update); + static void __net_exit nexthop_net_exit(struct net *net) { rtnl_lock(); -- 2.26.2
[PATCH net-next v2 11/14] nexthop: Add netlink handlers for bucket dump
Add a dump handler for resilient next hop buckets. When next-hop group ID is given, it walks buckets of that group, otherwise it walks buckets of all groups. It then dumps the buckets whose next hops match the given filtering criteria. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Reviewed-by: David Ahern --- Notes: v1 (changes since RFC): - u32 -> u16 for bucket counts / indices net/ipv4/nexthop.c | 283 + 1 file changed, 283 insertions(+) diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 439bf3b7ced5..ed2745708f9d 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -55,6 +55,17 @@ static const struct nla_policy rtm_nh_res_policy_new[] = { [NHA_RES_GROUP_UNBALANCED_TIMER]= { .type = NLA_U32 }, }; +static const struct nla_policy rtm_nh_policy_dump_bucket[] = { + [NHA_ID]= { .type = NLA_U32 }, + [NHA_OIF] = { .type = NLA_U32 }, + [NHA_MASTER]= { .type = NLA_U32 }, + [NHA_RES_BUCKET]= { .type = NLA_NESTED }, +}; + +static const struct nla_policy rtm_nh_res_bucket_policy_dump[] = { + [NHA_RES_BUCKET_NH_ID] = { .type = NLA_U32 }, +}; + static bool nexthop_notifiers_is_empty(struct net *net) { return !net->nexthop.notifier_chain.head; @@ -883,6 +894,60 @@ static void nh_res_bucket_set_busy(struct nh_res_bucket *bucket) atomic_long_set(&bucket->used_time, (long)jiffies); } +static clock_t nh_res_bucket_idle_time(const struct nh_res_bucket *bucket) +{ + unsigned long used_time = nh_res_bucket_used_time(bucket); + + return jiffies_delta_to_clock_t(jiffies - used_time); +} + +static int nh_fill_res_bucket(struct sk_buff *skb, struct nexthop *nh, + struct nh_res_bucket *bucket, u16 bucket_index, + int event, u32 portid, u32 seq, + unsigned int nlflags, + struct netlink_ext_ack *extack) +{ + struct nh_grp_entry *nhge = nh_res_dereference(bucket->nh_entry); + struct nlmsghdr *nlh; + struct nlattr *nest; + struct nhmsg *nhm; + + nlh = nlmsg_put(skb, portid, seq, event, sizeof(*nhm), nlflags); + if (!nlh) + return -EMSGSIZE; + + nhm = nlmsg_data(nlh); + nhm->nh_family = AF_UNSPEC; + nhm->nh_flags = bucket->nh_flags; + nhm->nh_protocol = nh->protocol; + nhm->nh_scope = 0; + nhm->resvd = 0; + + if (nla_put_u32(skb, NHA_ID, nh->id)) + goto nla_put_failure; + + nest = nla_nest_start(skb, NHA_RES_BUCKET); + if (!nest) + goto nla_put_failure; + + if (nla_put_u16(skb, NHA_RES_BUCKET_INDEX, bucket_index) || + nla_put_u32(skb, NHA_RES_BUCKET_NH_ID, nhge->nh->id) || + nla_put_u64_64bit(skb, NHA_RES_BUCKET_IDLE_TIME, + nh_res_bucket_idle_time(bucket), + NHA_RES_BUCKET_PAD)) + goto nla_put_failure_nest; + + nla_nest_end(skb, nest); + nlmsg_end(skb, nlh); + return 0; + +nla_put_failure_nest: + nla_nest_cancel(skb, nest); +nla_put_failure: + nlmsg_cancel(skb, nlh); + return -EMSGSIZE; +} + static bool valid_group_nh(struct nexthop *nh, unsigned int npaths, bool *is_fdb, struct netlink_ext_ack *extack) { @@ -2918,10 +2983,12 @@ static int rtm_get_nexthop(struct sk_buff *in_skb, struct nlmsghdr *nlh, } struct nh_dump_filter { + u32 nh_id; int dev_idx; int master_idx; bool group_filter; bool fdb_filter; + u32 res_bucket_nh_id; }; static bool nh_dump_filtered(struct nexthop *nh, @@ -3101,6 +3168,219 @@ static int rtm_dump_nexthop(struct sk_buff *skb, struct netlink_callback *cb) return err; } +static struct nexthop * +nexthop_find_group_resilient(struct net *net, u32 id, +struct netlink_ext_ack *extack) +{ + struct nh_group *nhg; + struct nexthop *nh; + + nh = nexthop_find_by_id(net, id); + if (!nh) + return ERR_PTR(-ENOENT); + + if (!nh->is_group) { + NL_SET_ERR_MSG(extack, "Not a nexthop group"); + return ERR_PTR(-EINVAL); + } + + nhg = rtnl_dereference(nh->nh_grp); + if (!nhg->resilient) { + NL_SET_ERR_MSG(extack, "Nexthop group not of type resilient"); + return ERR_PTR(-EINVAL); + } + + return nh; +} + +static int nh_valid_dump_nhid(struct nlattr *attr, u32 *nh_id_p, + struct netlink_ext_ack *extack) +{ + u32 idx; + + if (attr) { + idx = nla_get_u32(attr); + if (!idx) { + NL_SET_ERR_MSG(extack, "Invalid nexthop id"); +
[PATCH net-next v2 06/14] nexthop: Add data structures for resilient group notifications
From: Ido Schimmel Add data structures that will be used for in-kernel notifications about addition / deletion of a resilient nexthop group and about changes to a hash bucket within a resilient group. Signed-off-by: Ido Schimmel Reviewed-by: Petr Machata Reviewed-by: David Ahern Signed-off-by: Petr Machata --- Notes: v1 (changes since RFC): - u32 -> u16 for bucket counts / indices include/net/nexthop.h | 19 +++ 1 file changed, 19 insertions(+) diff --git a/include/net/nexthop.h b/include/net/nexthop.h index b78505c9031e..fd3c0debe8bf 100644 --- a/include/net/nexthop.h +++ b/include/net/nexthop.h @@ -155,11 +155,15 @@ struct nexthop { enum nexthop_event_type { NEXTHOP_EVENT_DEL, NEXTHOP_EVENT_REPLACE, + NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE, + NEXTHOP_EVENT_BUCKET_REPLACE, }; enum nh_notifier_info_type { NH_NOTIFIER_INFO_TYPE_SINGLE, NH_NOTIFIER_INFO_TYPE_GRP, + NH_NOTIFIER_INFO_TYPE_RES_TABLE, + NH_NOTIFIER_INFO_TYPE_RES_BUCKET, }; struct nh_notifier_single_info { @@ -186,6 +190,19 @@ struct nh_notifier_grp_info { struct nh_notifier_grp_entry_info nh_entries[]; }; +struct nh_notifier_res_bucket_info { + u16 bucket_index; + unsigned int idle_timer_ms; + bool force; + struct nh_notifier_single_info old_nh; + struct nh_notifier_single_info new_nh; +}; + +struct nh_notifier_res_table_info { + u16 num_nh_buckets; + struct nh_notifier_single_info nhs[]; +}; + struct nh_notifier_info { struct net *net; struct netlink_ext_ack *extack; @@ -194,6 +211,8 @@ struct nh_notifier_info { union { struct nh_notifier_single_info *nh; struct nh_notifier_grp_info *nh_grp; + struct nh_notifier_res_table_info *nh_res_table; + struct nh_notifier_res_bucket_info *nh_res_bucket; }; }; -- 2.26.2
[PATCH net-next v2 04/14] nexthop: Add netlink defines and enumerators for resilient NH groups
From: Ido Schimmel - RTM_NEWNEXTHOP et.al. that handle resilient groups will have a new nested attribute, NHA_RES_GROUP, whose elements are attributes NHA_RES_GROUP_*. - RTM_NEWNEXTHOPBUCKET et.al. is a suite of new messages that will currently serve only for dumping of individual buckets of resilient next hop groups. For nexthop group buckets, these messages will carry a nested attribute NHA_RES_BUCKET, whose elements are attributes NHA_RES_BUCKET_*. There are several reasons why a new suite of messages is created for nexthop buckets instead of overloading the information on the existing RTM_{NEW,DEL,GET}NEXTHOP messages. First, a nexthop group can contain a large number of nexthop buckets (4k is not unheard of). This imposes limits on the amount of information that can be encoded for each nexthop bucket given a netlink message is limited to 64k bytes. Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at this point, in the future it can be extended to provide user space with control over nexthop buckets configuration. - The new group type is NEXTHOP_GRP_TYPE_RES. Note that nexthop code is adjusted to bounce groups with that type for now. Signed-off-by: Ido Schimmel Reviewed-by: Petr Machata Reviewed-by: David Ahern Signed-off-by: Petr Machata --- Notes: v2: - Comment at NEXTHOP_GRP_TYPE_MPATH that it's for the hash-threshold groups. v1 (changes since RFC): - u32 -> u16 for bucket counts / indices include/uapi/linux/nexthop.h | 47 +- include/uapi/linux/rtnetlink.h | 7 + net/ipv4/nexthop.c | 2 ++ security/selinux/nlmsgtab.c| 5 +++- 4 files changed, 59 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h index 2d4a1e784cf0..d8ffa8c9ca78 100644 --- a/include/uapi/linux/nexthop.h +++ b/include/uapi/linux/nexthop.h @@ -21,7 +21,10 @@ struct nexthop_grp { }; enum { - NEXTHOP_GRP_TYPE_MPATH, /* default type if not specified */ + NEXTHOP_GRP_TYPE_MPATH, /* hash-threshold nexthop group + * default type if not specified + */ + NEXTHOP_GRP_TYPE_RES,/* resilient nexthop group */ __NEXTHOP_GRP_TYPE_MAX, }; @@ -52,8 +55,50 @@ enum { NHA_FDB,/* flag; nexthop belongs to a bridge fdb */ /* if NHA_FDB is added, OIF, BLACKHOLE, ENCAP cannot be set */ + /* nested; resilient nexthop group attributes */ + NHA_RES_GROUP, + /* nested; nexthop bucket attributes */ + NHA_RES_BUCKET, + __NHA_MAX, }; #define NHA_MAX(__NHA_MAX - 1) + +enum { + NHA_RES_GROUP_UNSPEC, + /* Pad attribute for 64-bit alignment. */ + NHA_RES_GROUP_PAD = NHA_RES_GROUP_UNSPEC, + + /* u16; number of nexthop buckets in a resilient nexthop group */ + NHA_RES_GROUP_BUCKETS, + /* clock_t as u32; nexthop bucket idle timer (per-group) */ + NHA_RES_GROUP_IDLE_TIMER, + /* clock_t as u32; nexthop unbalanced timer */ + NHA_RES_GROUP_UNBALANCED_TIMER, + /* clock_t as u64; nexthop unbalanced time */ + NHA_RES_GROUP_UNBALANCED_TIME, + + __NHA_RES_GROUP_MAX, +}; + +#define NHA_RES_GROUP_MAX (__NHA_RES_GROUP_MAX - 1) + +enum { + NHA_RES_BUCKET_UNSPEC, + /* Pad attribute for 64-bit alignment. */ + NHA_RES_BUCKET_PAD = NHA_RES_BUCKET_UNSPEC, + + /* u16; nexthop bucket index */ + NHA_RES_BUCKET_INDEX, + /* clock_t as u64; nexthop bucket idle time */ + NHA_RES_BUCKET_IDLE_TIME, + /* u32; nexthop id assigned to the nexthop bucket */ + NHA_RES_BUCKET_NH_ID, + + __NHA_RES_BUCKET_MAX, +}; + +#define NHA_RES_BUCKET_MAX (__NHA_RES_BUCKET_MAX - 1) + #endif diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 91e4ca064d61..d35953bc7d53 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -178,6 +178,13 @@ enum { RTM_GETVLAN, #define RTM_GETVLANRTM_GETVLAN + RTM_NEWNEXTHOPBUCKET = 116, +#define RTM_NEWNEXTHOPBUCKET RTM_NEWNEXTHOPBUCKET + RTM_DELNEXTHOPBUCKET, +#define RTM_DELNEXTHOPBUCKET RTM_DELNEXTHOPBUCKET + RTM_GETNEXTHOPBUCKET, +#define RTM_GETNEXTHOPBUCKET RTM_GETNEXTHOPBUCKET + __RTM_MAX, #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1) }; diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 56c54d0fbacc..7a94591da856 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -1492,6 +1492,8 @@ static struct nexthop *nexthop_create_group(struct net *net, if (cfg->nh_grp_type == NEXTHOP_GRP_TYPE_MPATH) { nhg->mpath = 1; nhg->is_multipath = true; + } else if (cfg->nh_grp_type == NEXTHOP_GRP_TYPE_RES) { + goto out_no_nh; } WARN_ON_ONCE(nhg->mpath !
[PATCH net-next v2 07/14] nexthop: Implement notifiers for resilient nexthop groups
Implement the following notifications towards drivers: - NEXTHOP_EVENT_REPLACE, when a resilient nexthop group is created. - NEXTHOP_EVENT_BUCKET_REPLACE any time there is a change in assignment of next hops to hash table buckets. That includes replacements, deletions, and delayed upkeep cycles. Some bucket notifications can be vetoed by the driver, to make it possible to propagate bucket busy-ness flags from the HW back to the algorithm. Some are however forced, e.g. if a next hop is deleted, all buckets that use this next hop simply must be migrated, whether the HW wishes so or not. - NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE, before a resilient nexthop group is replaced. Usually the driver will get the bucket notifications as well, and could veto those. But in some cases, a bucket may not be migrated immediately, but during delayed upkeep, and that is too late to roll the transaction back. This notification allows the driver to take a look and veto the new proposed group up front, before anything is committed. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Reviewed-by: David Ahern --- Notes: v1 (changes since RFC): - u32 -> u16 for bucket counts / indices net/ipv4/nexthop.c | 320 +++-- 1 file changed, 308 insertions(+), 12 deletions(-) diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 0e2ff72e10c0..8b06aafc2e9e 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -115,6 +115,37 @@ static int nh_notifier_mp_info_init(struct nh_notifier_info *info, return 0; } +static int nh_notifier_res_table_info_init(struct nh_notifier_info *info, + struct nh_group *nhg) +{ + struct nh_res_table *res_table = rtnl_dereference(nhg->res_table); + u16 num_nh_buckets = res_table->num_nh_buckets; + unsigned long size; + u16 i; + + info->type = NH_NOTIFIER_INFO_TYPE_RES_TABLE; + size = struct_size(info->nh_res_table, nhs, num_nh_buckets); + info->nh_res_table = __vmalloc(size, GFP_KERNEL | __GFP_ZERO | + __GFP_NOWARN); + if (!info->nh_res_table) + return -ENOMEM; + + info->nh_res_table->num_nh_buckets = num_nh_buckets; + + for (i = 0; i < num_nh_buckets; i++) { + struct nh_res_bucket *bucket = &res_table->nh_buckets[i]; + struct nh_grp_entry *nhge; + struct nh_info *nhi; + + nhge = rtnl_dereference(bucket->nh_entry); + nhi = rtnl_dereference(nhge->nh->nh_info); + __nh_notifier_single_info_init(&info->nh_res_table->nhs[i], + nhi); + } + + return 0; +} + static int nh_notifier_grp_info_init(struct nh_notifier_info *info, const struct nexthop *nh) { @@ -122,6 +153,8 @@ static int nh_notifier_grp_info_init(struct nh_notifier_info *info, if (nhg->mpath) return nh_notifier_mp_info_init(info, nhg); + else if (nhg->resilient) + return nh_notifier_res_table_info_init(info, nhg); return -EINVAL; } @@ -132,6 +165,8 @@ static void nh_notifier_grp_info_fini(struct nh_notifier_info *info, if (nhg->mpath) kfree(info->nh_grp); + else if (nhg->resilient) + vfree(info->nh_res_table); } static int nh_notifier_info_init(struct nh_notifier_info *info, @@ -183,6 +218,107 @@ static int call_nexthop_notifiers(struct net *net, return notifier_to_errno(err); } +static int +nh_notifier_res_bucket_idle_timer_get(const struct nh_notifier_info *info, + bool force, unsigned int *p_idle_timer_ms) +{ + struct nh_res_table *res_table; + struct nh_group *nhg; + struct nexthop *nh; + int err = 0; + + /* When 'force' is false, nexthop bucket replacement is performed +* because the bucket was deemed to be idle. In this case, capable +* listeners can choose to perform an atomic replacement: The bucket is +* only replaced if it is inactive. However, if the idle timer interval +* is smaller than the interval in which a listener is querying +* buckets' activity from the device, then atomic replacement should +* not be tried. Pass the idle timer value to listeners, so that they +* could determine which type of replacement to perform. +*/ + if (force) { + *p_idle_timer_ms = 0; + return 0; + } + + rcu_read_lock(); + + nh = nexthop_find_by_id(info->net, info->id); + if (!nh) { + err = -EINVAL; + goto out; + } + + nhg = rcu_dereference(nh->nh_grp); + res_table = rcu_deref
[PATCH net-next v2 05/14] nexthop: Add implementation of resilient next-hop groups
roup type, and that is currently bounced. There is therefore no way to actually access this code. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Reviewed-by: David Ahern --- Notes: v1 (changes since RFC): - u32 -> u16 for bucket counts / indices - set the new flag is_multipath for resilient groups include/net/nexthop.h | 42 net/ipv4/nexthop.c| 517 -- 2 files changed, 546 insertions(+), 13 deletions(-) diff --git a/include/net/nexthop.h b/include/net/nexthop.h index 5062c2c08e2b..b78505c9031e 100644 --- a/include/net/nexthop.h +++ b/include/net/nexthop.h @@ -40,6 +40,12 @@ struct nh_config { struct nlattr *nh_grp; u16 nh_grp_type; + u16 nh_grp_res_num_buckets; + unsigned long nh_grp_res_idle_timer; + unsigned long nh_grp_res_unbalanced_timer; + boolnh_grp_res_has_num_buckets; + boolnh_grp_res_has_idle_timer; + boolnh_grp_res_has_unbalanced_timer; struct nlattr *nh_encap; u16 nh_encap_type; @@ -63,6 +69,32 @@ struct nh_info { }; }; +struct nh_res_bucket { + struct nh_grp_entry __rcu *nh_entry; + atomic_long_t used_time; + unsigned long migrated_time; + booloccupied; + u8 nh_flags; +}; + +struct nh_res_table { + struct net *net; + u32 nhg_id; + struct delayed_work upkeep_dw; + + /* List of NHGEs that have too few buckets ("uw" for underweight). +* Reclaimed buckets will be given to entries in this list. +*/ + struct list_headuw_nh_entries; + unsigned long unbalanced_since; + + u32 idle_timer; + u32 unbalanced_timer; + + u16 num_nh_buckets; + struct nh_res_bucketnh_buckets[]; +}; + struct nh_grp_entry { struct nexthop *nh; u8 weight; @@ -71,6 +103,13 @@ struct nh_grp_entry { struct { atomic_tupper_bound; } mpath; + struct { + /* Member on uw_nh_entries. */ + struct list_headuw_nh_entry; + + u16 count_buckets; + u16 wants_buckets; + } res; }; struct list_head nh_list; @@ -82,8 +121,11 @@ struct nh_group { u16 num_nh; boolis_multipath; boolmpath; + boolresilient; boolfdb_nh; boolhas_v4; + + struct nh_res_table __rcu *res_table; struct nh_grp_entry nh_entries[]; }; diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 7a94591da856..0e2ff72e10c0 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -183,6 +183,30 @@ static int call_nexthop_notifiers(struct net *net, return notifier_to_errno(err); } +/* There are three users of RES_TABLE, and NHs etc. referenced from there: + * + * 1) a collection of callbacks for NH maintenance. This operates under + *RTNL, + * 2) the delayed work that gradually balances the resilient table, + * 3) and nexthop_select_path(), operating under RCU. + * + * Both the delayed work and the RTNL block are writers, and need to + * maintain mutual exclusion. Since there are only two and well-known + * writers for each table, the RTNL code can make sure it has exclusive + * access thus: + * + * - Have the DW operate without locking; + * - synchronously cancel the DW; + * - do the writing; + * - if the write was not actually a delete, call upkeep, which schedules + * DW again if necessary. + * + * The functions that are always called from the RTNL context use + * rtnl_dereference(). The functions that can also be called from the DW do + * a raw dereference and rely on the above mutual exclusion scheme. + */ +#define nh_res_dereference(p) (rcu_dereference_raw(p)) + static int call_nexthop_notifier(struct notifier_block *nb, struct net *net, enum nexthop_event_type event_type, struct nexthop *nh, @@ -241,6 +265,9 @@ static void nexthop_free_group(struct nexthop *nh) WARN_ON(nhg->spare == nhg); + if (nhg->resilient) + vfree(rcu_dereference_raw(nhg->res_table)); + kfree(nhg->spare); kfree(nhg); } @@ -299,6 +326,30 @@ static struct nh_group *nexthop_grp_alloc(u16 num_nh) return nhg; } +static void nh_res_table_upkeep_dw(struct work_struct *work); + +static struct nh_res_table * +nexthop_res_table_alloc(struct net *net, u32 nhg_id, struct nh_con
[PATCH net-next v2 08/14] nexthop: Allow setting "offload" and "trap" indication of nexthop buckets
From: Ido Schimmel Add a function that can be called by device drivers to set "offload" or "trap" indication on nexthop buckets following nexthop notifications and other changes such as a neighbour becoming invalid. Signed-off-by: Ido Schimmel Reviewed-by: Petr Machata Reviewed-by: David Ahern Signed-off-by: Petr Machata --- Notes: v1 (changes since RFC): - u32 -> u16 for bucket counts / indices include/net/nexthop.h | 2 ++ net/ipv4/nexthop.c| 34 ++ 2 files changed, 36 insertions(+) diff --git a/include/net/nexthop.h b/include/net/nexthop.h index fd3c0debe8bf..685f208d26b5 100644 --- a/include/net/nexthop.h +++ b/include/net/nexthop.h @@ -220,6 +220,8 @@ int register_nexthop_notifier(struct net *net, struct notifier_block *nb, struct netlink_ext_ack *extack); int unregister_nexthop_notifier(struct net *net, struct notifier_block *nb); void nexthop_set_hw_flags(struct net *net, u32 id, bool offload, bool trap); +void nexthop_bucket_set_hw_flags(struct net *net, u32 id, u16 bucket_index, +bool offload, bool trap); /* caller is holding rcu or rtnl; no reference taken to nexthop */ struct nexthop *nexthop_find_by_id(struct net *net, u32 id); diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 8b06aafc2e9e..1fce4ff39390 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -3072,6 +3072,40 @@ void nexthop_set_hw_flags(struct net *net, u32 id, bool offload, bool trap) } EXPORT_SYMBOL(nexthop_set_hw_flags); +void nexthop_bucket_set_hw_flags(struct net *net, u32 id, u16 bucket_index, +bool offload, bool trap) +{ + struct nh_res_table *res_table; + struct nh_res_bucket *bucket; + struct nexthop *nexthop; + struct nh_group *nhg; + + rcu_read_lock(); + + nexthop = nexthop_find_by_id(net, id); + if (!nexthop || !nexthop->is_group) + goto out; + + nhg = rcu_dereference(nexthop->nh_grp); + if (!nhg->resilient) + goto out; + + if (bucket_index >= nhg->res_table->num_nh_buckets) + goto out; + + res_table = rcu_dereference(nhg->res_table); + bucket = &res_table->nh_buckets[bucket_index]; + bucket->nh_flags &= ~(RTNH_F_OFFLOAD | RTNH_F_TRAP); + if (offload) + bucket->nh_flags |= RTNH_F_OFFLOAD; + if (trap) + bucket->nh_flags |= RTNH_F_TRAP; + +out: + rcu_read_unlock(); +} +EXPORT_SYMBOL(nexthop_bucket_set_hw_flags); + static void __net_exit nexthop_net_exit(struct net *net) { rtnl_lock(); -- 2.26.2
[PATCH net-next v2 03/14] nexthop: Add a dedicated flag for multipath next-hop groups
With the introduction of resilient nexthop groups, there will be two types of multipath groups: the current hash-threshold "mpath" ones, and resilient groups. Both are multipath, but to determine the fact, the system needs to consider two flags. This might prove costly in the datapath. Therefore, introduce a new flag, that should be set for next-hop groups that have more than one nexthop, and should be considered multipath. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Reviewed-by: David Ahern --- Notes: v1 (changes since RFC): - This patch is new include/net/nexthop.h | 7 --- net/ipv4/nexthop.c| 5 - 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/include/net/nexthop.h b/include/net/nexthop.h index 7bc057aee40b..5062c2c08e2b 100644 --- a/include/net/nexthop.h +++ b/include/net/nexthop.h @@ -80,6 +80,7 @@ struct nh_grp_entry { struct nh_group { struct nh_group *spare; /* spare group for removals */ u16 num_nh; + boolis_multipath; boolmpath; boolfdb_nh; boolhas_v4; @@ -212,7 +213,7 @@ static inline bool nexthop_is_multipath(const struct nexthop *nh) struct nh_group *nh_grp; nh_grp = rcu_dereference_rtnl(nh->nh_grp); - return nh_grp->mpath; + return nh_grp->is_multipath; } return false; } @@ -227,7 +228,7 @@ static inline unsigned int nexthop_num_path(const struct nexthop *nh) struct nh_group *nh_grp; nh_grp = rcu_dereference_rtnl(nh->nh_grp); - if (nh_grp->mpath) + if (nh_grp->is_multipath) rc = nh_grp->num_nh; } @@ -308,7 +309,7 @@ struct fib_nh_common *nexthop_fib_nhc(struct nexthop *nh, int nhsel) struct nh_group *nh_grp; nh_grp = rcu_dereference_rtnl(nh->nh_grp); - if (nh_grp->mpath) { + if (nh_grp->is_multipath) { nh = nexthop_mpath_select(nh_grp, nhsel); if (!nh) return NULL; diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 69c8b50a936e..56c54d0fbacc 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -967,6 +967,7 @@ static void remove_nh_grp_entry(struct net *net, struct nh_grp_entry *nhge, } newg->has_v4 = false; + newg->is_multipath = nhg->is_multipath; newg->mpath = nhg->mpath; newg->fdb_nh = nhg->fdb_nh; newg->num_nh = nhg->num_nh; @@ -1488,8 +1489,10 @@ static struct nexthop *nexthop_create_group(struct net *net, nhg->nh_entries[i].nh_parent = nh; } - if (cfg->nh_grp_type == NEXTHOP_GRP_TYPE_MPATH) + if (cfg->nh_grp_type == NEXTHOP_GRP_TYPE_MPATH) { nhg->mpath = 1; + nhg->is_multipath = true; + } WARN_ON_ONCE(nhg->mpath != 1); -- 2.26.2
[PATCH net-next v2 02/14] nexthop: __nh_notifier_single_info_init(): Make nh_info an argument
The cited function currently uses rtnl_dereference() to get nh_info from a handed-in nexthop. However, under the resilient hashing scheme, this function will not always be called under RTNL, sometimes the mutual exclusion will be achieved differently. Therefore move the nh_info extraction from the function to its callers to make it possible to use a different synchronization guarantee. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Reviewed-by: David Ahern --- net/ipv4/nexthop.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index f723dc97dcd3..69c8b50a936e 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -52,10 +52,8 @@ static bool nexthop_notifiers_is_empty(struct net *net) static void __nh_notifier_single_info_init(struct nh_notifier_single_info *nh_info, - const struct nexthop *nh) + const struct nh_info *nhi) { - struct nh_info *nhi = rtnl_dereference(nh->nh_info); - nh_info->dev = nhi->fib_nhc.nhc_dev; nh_info->gw_family = nhi->fib_nhc.nhc_gw_family; if (nh_info->gw_family == AF_INET) @@ -71,12 +69,14 @@ __nh_notifier_single_info_init(struct nh_notifier_single_info *nh_info, static int nh_notifier_single_info_init(struct nh_notifier_info *info, const struct nexthop *nh) { + struct nh_info *nhi = rtnl_dereference(nh->nh_info); + info->type = NH_NOTIFIER_INFO_TYPE_SINGLE; info->nh = kzalloc(sizeof(*info->nh), GFP_KERNEL); if (!info->nh) return -ENOMEM; - __nh_notifier_single_info_init(info->nh, nh); + __nh_notifier_single_info_init(info->nh, nhi); return 0; } @@ -103,11 +103,13 @@ static int nh_notifier_mp_info_init(struct nh_notifier_info *info, for (i = 0; i < num_nh; i++) { struct nh_grp_entry *nhge = &nhg->nh_entries[i]; + struct nh_info *nhi; + nhi = rtnl_dereference(nhge->nh->nh_info); info->nh_grp->nh_entries[i].id = nhge->nh->id; info->nh_grp->nh_entries[i].weight = nhge->weight; __nh_notifier_single_info_init(&info->nh_grp->nh_entries[i].nh, - nhge->nh); + nhi); } return 0; -- 2.26.2
[PATCH net-next v2 01/14] nexthop: Pass nh_config to replace_nexthop()
Currently, replace assumes that the new group that is given is a fully-formed object. But mpath groups really only have one attribute, and that is the constituent next hop configuration. This may not be universally true. From the usability perspective, it is desirable to allow the replace operation to adjust just the constituent next hop configuration and leave the group attributes as such intact. But the object that keeps track of whether an attribute was or was not given is the nh_config object, not the next hop or next-hop group. To allow (selective) attribute updates during NH group replacement, propagate `cfg' to replace_nexthop() and further to replace_nexthop_grp(). Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel Reviewed-by: David Ahern --- net/ipv4/nexthop.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 743777bce179..f723dc97dcd3 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -1107,7 +1107,7 @@ static void nh_rt_cache_flush(struct net *net, struct nexthop *nh) } static int replace_nexthop_grp(struct net *net, struct nexthop *old, - struct nexthop *new, + struct nexthop *new, const struct nh_config *cfg, struct netlink_ext_ack *extack) { struct nh_group *oldg, *newg; @@ -1276,7 +1276,8 @@ static void nexthop_replace_notify(struct net *net, struct nexthop *nh, } static int replace_nexthop(struct net *net, struct nexthop *old, - struct nexthop *new, struct netlink_ext_ack *extack) + struct nexthop *new, const struct nh_config *cfg, + struct netlink_ext_ack *extack) { bool new_is_reject = false; struct nh_grp_entry *nhge; @@ -1319,7 +1320,7 @@ static int replace_nexthop(struct net *net, struct nexthop *old, } if (old->is_group) - err = replace_nexthop_grp(net, old, new, extack); + err = replace_nexthop_grp(net, old, new, cfg, extack); else err = replace_nexthop_single(net, old, new, extack); @@ -1361,7 +1362,7 @@ static int insert_nexthop(struct net *net, struct nexthop *new_nh, } else if (new_id > nh->id) { pp = &next->rb_right; } else if (replace) { - rc = replace_nexthop(net, nh, new_nh, extack); + rc = replace_nexthop(net, nh, new_nh, cfg, extack); if (!rc) { new_nh = nh; /* send notification with old nh */ replace_notify = 1; -- 2.26.2
[PATCH net-next v2 00/14] nexthop: Resilient next-hop groups
emain unbalanced indefinitely. The value of 120 is the default in Cumulus implementation of resilient next-hop groups. To a degree the default is arbitrary, the only value that certainly does not make sense is 0. Therefore going with an existing deployed implementation is reasonable. Unbalanced time, i.e. how long since the last time that all nexthops had as many buckets as they should according to their weights, is reported when the group is dumped: # ip nexthop show id 10 id 10 group 1/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0 When replacing next hops or changing weights, if one does not specify some parameters, their value is left as it was: # ip nexthop replace id 10 group 1,2/2 type resilient # ip nexthop show id 10 id 10 group 1,2/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0 It is also possible to do a dump of individual buckets (and now you know why there were only 8 of them in the example above): # ip nexthop bucket show id 10 id 10 index 0 idle_time 5.59 nhid 1 id 10 index 1 idle_time 5.59 nhid 1 id 10 index 2 idle_time 8.74 nhid 2 id 10 index 3 idle_time 8.74 nhid 2 id 10 index 4 idle_time 8.74 nhid 1 id 10 index 5 idle_time 8.74 nhid 1 id 10 index 6 idle_time 8.74 nhid 1 id 10 index 7 idle_time 8.74 nhid 1 Note the two buckets that have a shorter idle time. Those are the ones that were migrated after the nexthop replace command to satisfy the new demand that nexthop 1 be given 6 buckets instead of 4. The patchset proceeds as follows: - Patches #1 and #2 are small refactoring patches. - Patch #3 adds a new flag to struct nh_group, is_multipath. This flag is meant to be set for all nexthop groups that in general have several nexthops from which they choose, and avoids a more expensive dispatch based on reading several flags, one for each nexthop group type. - Patch #4 contains defines of new UAPI attributes and the new next-hop group type. At this point, the nexthop code is made to bounce the new type. As the resilient hashing code is gradually added in the following patch sets, it will remain dead. The last patch will make it accessible. This patch also adds a suite of new messages related to next hop buckets. This approach was taken instead of overloading the information on the existing RTM_{NEW,DEL,GET}NEXTHOP messages for the following reasons. First, a next-hop group can contain a large number of next-hop buckets (4k is not unheard of). This imposes limits on the amount of information that can be encoded for each next-hop bucket given a netlink message is limited to 64k bytes. Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at this point, in the future it can be extended to provide user space with control over next-hop buckets configuration. - Patch #5 contains the meat of the resilient next-hop group support. - Patches #6 and #7 implement support for notifications towards the drivers. - Patch #8 adds an interface for the drivers to report resilient hash table bucket activity. Drivers will be able to report through this interface whether traffic is hitting a given bucket. - Patch #9 adds an interface for the drivers to report whether a given hash table bucket is offloaded or trapping traffic. - In patches #10, #11, #12 and #13, UAPI is implemented. This includes all the code necessary for creation of resilient groups, bucket dumping and getting, and bucket migration notifications. - In patch #14 the next-hop groups are finally made available. The overall plan is to contribute approximately the following patchsets: 1) Nexthop policy refactoring (already pushed) 2) Preparations for resilient next-hop groups (already pushed) 3) Implementation of resilient next-hop groups (this patchset) 4) Netdevsim offload plus a suite of selftests 5) Preparations for mlxsw offload of resilient next-hop groups 6) mlxsw offload including selftests Interested parties can look at the current state of the code at [2] and [3]. [1] https://tools.ietf.org/html/rfc2992 [2] https://github.com/idosch/linux/commits/submit/res_integ_v1 [3] https://github.com/idosch/iproute2/commits/submit/res_v1 v2: - Patch #4: - Comment at NEXTHOP_GRP_TYPE_MPATH that it's for the hash-threshold groups. v1 (changes since RFC): - Patch #3: - This patch is new - Patches #4-#13: - u32 -> u16 for bucket counts / indices - Patch #5: - set the new flag is_multipath for resilient groups Ido Schimmel (4): nexthop: Add netlink defines and enumerators for resilient NH groups nexthop: Add data structures for resilient group notifications nexthop: Allow setting "offload" and "trap" indication of nexthop buckets nexthop: Allow reporting activity of nexthop buckets Petr Machata (10): nexthop: Pass nh_config to replace_nexthop() nexthop: __nh_notifier_single_info_init(): Make nh_info an argument nexthop: Add a
Re: [PATCH net-next 00/14] nexthop: Resilient next-hop groups
David Ahern writes: > When you get to the end of the sets, it would be good to submit > documentation for resilient multipath under Documentation/networking All right.
Re: [PATCH net-next 03/14] nexthop: Add a dedicated flag for multipath next-hop groups
David Ahern writes: > On 3/11/21 8:39 AM, Petr Machata wrote: >> >> David Ahern writes: >> >>>> diff --git a/include/net/nexthop.h b/include/net/nexthop.h >>>> index 7bc057aee40b..5062c2c08e2b 100644 >>>> --- a/include/net/nexthop.h >>>> +++ b/include/net/nexthop.h >>>> @@ -80,6 +80,7 @@ struct nh_grp_entry { >>>> struct nh_group { >>>>struct nh_group *spare; /* spare group for removals */ >>>>u16 num_nh; >>>> + boolis_multipath; >>>>boolmpath; >>> >>> >>> It would be good to rename the existing type 'mpath' to something else. >>> You have 'resilient' as a group type later, so maybe rename this one to >>> hash or hash_threshold. >> >> All right, I'll send a follow-up with that. > > I'm fine with the rename being a followup after this patch set or as the > last patch in this set. I looked at this, it's more than just this struct field. There is a whole number of functions with mpath in their name to reflect that they are for the hash-threshold algorithm. (And then some where the "mpath" reflects is_multipath assumption.) So I'll send this separately, and have it go through our regression. It's still trivialish renaming, but a fair amount thereof.
Re: [PATCH net-next 04/14] nexthop: Add netlink defines and enumerators for resilient NH groups
David Ahern writes: > On 3/11/21 8:45 AM, Petr Machata wrote: >> >> David Ahern writes: >> >>> On 3/10/21 8:02 AM, Petr Machata wrote: >>>> diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h >>>> index 2d4a1e784cf0..8efebf3cb9c7 100644 >>>> --- a/include/uapi/linux/nexthop.h >>>> +++ b/include/uapi/linux/nexthop.h >>>> @@ -22,6 +22,7 @@ struct nexthop_grp { >>>> >>>> enum { >>>>NEXTHOP_GRP_TYPE_MPATH, /* default type if not specified */ >>> >>> Update the above comment that it is for legacy, hash based multipath. >> >> Maybe this would make sense? >> >> NEXTHOP_GRP_TYPE_MPATH, /* hash-threshold nexthop group */ >> > > yes, the description is fine. keep the comment about 'default type'. OK.
Re: [PATCH net-next 04/14] nexthop: Add netlink defines and enumerators for resilient NH groups
David Ahern writes: > On 3/10/21 8:02 AM, Petr Machata wrote: >> diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h >> index 2d4a1e784cf0..8efebf3cb9c7 100644 >> --- a/include/uapi/linux/nexthop.h >> +++ b/include/uapi/linux/nexthop.h >> @@ -22,6 +22,7 @@ struct nexthop_grp { >> >> enum { >> NEXTHOP_GRP_TYPE_MPATH, /* default type if not specified */ > > Update the above comment that it is for legacy, hash based multipath. Maybe this would make sense? NEXTHOP_GRP_TYPE_MPATH, /* hash-threshold nexthop group */
Re: [PATCH net-next 03/14] nexthop: Add a dedicated flag for multipath next-hop groups
David Ahern writes: >> diff --git a/include/net/nexthop.h b/include/net/nexthop.h >> index 7bc057aee40b..5062c2c08e2b 100644 >> --- a/include/net/nexthop.h >> +++ b/include/net/nexthop.h >> @@ -80,6 +80,7 @@ struct nh_grp_entry { >> struct nh_group { >> struct nh_group *spare; /* spare group for removals */ >> u16 num_nh; >> +boolis_multipath; >> boolmpath; > > > It would be good to rename the existing type 'mpath' to something else. > You have 'resilient' as a group type later, so maybe rename this one to > hash or hash_threshold. All right, I'll send a follow-up with that.
[PATCH net-next 14/14] nexthop: Enable resilient next-hop groups
Now that all the code is in place, stop rejecting requests to create resilient next-hop groups. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- net/ipv4/nexthop.c | 4 1 file changed, 4 deletions(-) diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 015a47e8163a..f09fe3a5608f 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -2443,10 +2443,6 @@ static struct nexthop *nexthop_create_group(struct net *net, } else if (cfg->nh_grp_type == NEXTHOP_GRP_TYPE_RES) { struct nh_res_table *res_table; - /* Bounce resilient groups for now. */ - err = -EINVAL; - goto out_no_nh; - res_table = nexthop_res_table_alloc(net, cfg->nh_id, cfg); if (!res_table) { err = -ENOMEM; -- 2.26.2
[PATCH net-next 10/14] nexthop: Add netlink handlers for resilient nexthop groups
Implement the netlink messages that allow creation and dumping of resilient nexthop groups. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- Notes: v1 (changes since RFC): - u32 -> u16 for bucket counts / indices net/ipv4/nexthop.c | 150 +++-- 1 file changed, 145 insertions(+), 5 deletions(-) diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 495b5e69ffcd..439bf3b7ced5 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -16,6 +16,9 @@ #include #include +#define NH_RES_DEFAULT_IDLE_TIMER (120 * HZ) +#define NH_RES_DEFAULT_UNBALANCED_TIMER0 /* No forced rebalancing. */ + static void remove_nexthop(struct net *net, struct nexthop *nh, struct nl_info *nlinfo); @@ -32,6 +35,7 @@ static const struct nla_policy rtm_nh_policy_new[] = { [NHA_ENCAP_TYPE]= { .type = NLA_U16 }, [NHA_ENCAP] = { .type = NLA_NESTED }, [NHA_FDB] = { .type = NLA_FLAG }, + [NHA_RES_GROUP] = { .type = NLA_NESTED }, }; static const struct nla_policy rtm_nh_policy_get[] = { @@ -45,6 +49,12 @@ static const struct nla_policy rtm_nh_policy_dump[] = { [NHA_FDB] = { .type = NLA_FLAG }, }; +static const struct nla_policy rtm_nh_res_policy_new[] = { + [NHA_RES_GROUP_BUCKETS] = { .type = NLA_U16 }, + [NHA_RES_GROUP_IDLE_TIMER] = { .type = NLA_U32 }, + [NHA_RES_GROUP_UNBALANCED_TIMER]= { .type = NLA_U32 }, +}; + static bool nexthop_notifiers_is_empty(struct net *net) { return !net->nexthop.notifier_chain.head; @@ -588,6 +598,41 @@ static void nh_res_time_set_deadline(unsigned long next_time, *deadline = next_time; } +static clock_t nh_res_table_unbalanced_time(struct nh_res_table *res_table) +{ + if (list_empty(&res_table->uw_nh_entries)) + return 0; + return jiffies_delta_to_clock_t(jiffies - res_table->unbalanced_since); +} + +static int nla_put_nh_group_res(struct sk_buff *skb, struct nh_group *nhg) +{ + struct nh_res_table *res_table = rtnl_dereference(nhg->res_table); + struct nlattr *nest; + + nest = nla_nest_start(skb, NHA_RES_GROUP); + if (!nest) + return -EMSGSIZE; + + if (nla_put_u16(skb, NHA_RES_GROUP_BUCKETS, + res_table->num_nh_buckets) || + nla_put_u32(skb, NHA_RES_GROUP_IDLE_TIMER, + jiffies_to_clock_t(res_table->idle_timer)) || + nla_put_u32(skb, NHA_RES_GROUP_UNBALANCED_TIMER, + jiffies_to_clock_t(res_table->unbalanced_timer)) || + nla_put_u64_64bit(skb, NHA_RES_GROUP_UNBALANCED_TIME, + nh_res_table_unbalanced_time(res_table), + NHA_RES_GROUP_PAD)) + goto nla_put_failure; + + nla_nest_end(skb, nest); + return 0; + +nla_put_failure: + nla_nest_cancel(skb, nest); + return -EMSGSIZE; +} + static int nla_put_nh_group(struct sk_buff *skb, struct nh_group *nhg) { struct nexthop_grp *p; @@ -598,6 +643,8 @@ static int nla_put_nh_group(struct sk_buff *skb, struct nh_group *nhg) if (nhg->mpath) group_type = NEXTHOP_GRP_TYPE_MPATH; + else if (nhg->resilient) + group_type = NEXTHOP_GRP_TYPE_RES; if (nla_put_u16(skb, NHA_GROUP_TYPE, group_type)) goto nla_put_failure; @@ -613,6 +660,9 @@ static int nla_put_nh_group(struct sk_buff *skb, struct nh_group *nhg) p += 1; } + if (nhg->resilient && nla_put_nh_group_res(skb, nhg)) + goto nla_put_failure; + return 0; nla_put_failure: @@ -700,13 +750,26 @@ static int nh_fill_node(struct sk_buff *skb, struct nexthop *nh, return -EMSGSIZE; } +static size_t nh_nlmsg_size_grp_res(struct nh_group *nhg) +{ + return nla_total_size(0) + /* NHA_RES_GROUP */ + nla_total_size(2) + /* NHA_RES_GROUP_BUCKETS */ + nla_total_size(4) + /* NHA_RES_GROUP_IDLE_TIMER */ + nla_total_size(4) + /* NHA_RES_GROUP_UNBALANCED_TIMER */ + nla_total_size_64bit(8);/* NHA_RES_GROUP_UNBALANCED_TIME */ +} + static size_t nh_nlmsg_size_grp(struct nexthop *nh) { struct nh_group *nhg = rtnl_dereference(nh->nh_grp); size_t sz = sizeof(struct nexthop_grp) * nhg->num_nh; + size_t tot = nla_total_size(sz) + + nla_total_size(2); /* NHA_GROUP_TYPE */ + + if (nhg->resilient) + tot += nh_nlmsg_size_grp_res(nhg); - return nla_total_size(sz) + - nla_total_size(2); /* NHA_GROUP_TYPE */ + return tot; } static size_t nh_nlmsg_size_single(struct nexthop *nh) @@ -876,7 +939,7 @@ static int nh_check_attr_fdb_g