Acked-by: Viacheslav Ovsiienko <[email protected]>

> -----Original Message-----
> From: Vincent Jardin <[email protected]>
> Sent: Sunday, March 22, 2026 3:46 PM
> To: [email protected]
> Cc: Raslan Darawsheh <[email protected]>; NBU-Contact-Thomas Monjalon
> (EXTERNAL) <[email protected]>; [email protected];
> Dariusz Sosnowski <[email protected]>; Slava Ovsiienko
> <[email protected]>; Bing Zhao <[email protected]>; Ori Kam
> <[email protected]>; Suanming Mou <[email protected]>; Matan Azrad
> <[email protected]>; [email protected];
> [email protected]; Vincent Jardin <[email protected]>
> Subject: [PATCH v4 05/10] net/mlx5: support per-queue rate limiting
> 
> Wire rte_eth_set_queue_rate_limit() to the mlx5 PMD. The callback allocates a
> per-queue PP index with the requested data rate, then modifies the live SQ via
> modify_bitmask bit 0 to apply the new packet_pacing_rate_limit_index — no
> queue teardown required.
> 
> Setting tx_rate=0 clears the PP index on the SQ and frees it.
> 
> Capability check uses hca_attr.qos.packet_pacing directly (not dev_cap.txpp_en
> which requires Clock Queue prerequisites). This allows per-queue rate limiting
> without the tx_pp devarg.
> 
> The callback rejects hairpin queues and queues whose SQ is not yet created.
> 
> testpmd usage (no testpmd changes needed):
>   set port 0 queue 0 rate 1000
>   set port 0 queue 1 rate 5000
>   set port 0 queue 0 rate 0     # disable
> 
> Supported hardware:
> - ConnectX-6 Dx: full support, per-SQ rate via HW rate table
> - ConnectX-7/8: full support, coexists with wait-on-time scheduling
> - BlueField-2/3: full support as DPU rep ports
> 
> Not supported:
> - ConnectX-5: packet_pacing exists but dynamic SQ modify may not
>   work on all firmware versions
> - ConnectX-4 Lx and earlier: no packet_pacing capability
> 
> Signed-off-by: Vincent Jardin <[email protected]>
> ---
>  doc/guides/nics/features/mlx5.ini |   1 +
>  doc/guides/nics/mlx5.rst          |  54 ++++++++++++++
>  drivers/net/mlx5/mlx5.c           |   2 +
>  drivers/net/mlx5/mlx5_tx.h        |   2 +
>  drivers/net/mlx5/mlx5_txq.c       | 118 ++++++++++++++++++++++++++++++
>  5 files changed, 177 insertions(+)
> 
> diff --git a/doc/guides/nics/features/mlx5.ini
> b/doc/guides/nics/features/mlx5.ini
> index 4f9c4c309b..3b3eda28b8 100644
> --- a/doc/guides/nics/features/mlx5.ini
> +++ b/doc/guides/nics/features/mlx5.ini
> @@ -30,6 +30,7 @@ Inner RSS            = Y
>  SR-IOV               = Y
>  VLAN filter          = Y
>  Flow control         = Y
> +Rate limitation      = Y
>  CRC offload          = Y
>  VLAN offload         = Y
>  L3 checksum offload  = Y
> diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst index
> 6bb8c07353..c72a60f084 100644
> --- a/doc/guides/nics/mlx5.rst
> +++ b/doc/guides/nics/mlx5.rst
> @@ -580,6 +580,60 @@ for an additional list of options shared with other
> mlx5 drivers.
>    (with ``tx_pp``) and ConnectX-7+ (wait-on-time) scheduling modes.
>    The default value is zero.
> 
> +.. _mlx5_per_queue_rate_limit:
> +
> +Per-Queue Tx Rate Limiting
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The mlx5 PMD supports per-queue Tx rate limiting via the standard
> +ethdev API ``rte_eth_set_queue_rate_limit()`` and
> ``rte_eth_get_queue_rate_limit()``.
> +
> +This feature uses the hardware packet pacing mechanism to enforce a
> +data rate on individual TX queues without tearing down the queue. The
> +rate is specified in Mbps.
> +
> +**Requirements:**
> +
> +- ConnectX-6 Dx or later with ``packet_pacing`` HCA capability.
> +- The DevX path must be used (default). The legacy Verbs path
> +  (``dv_flow_en=0``) does not support dynamic SQ modification and
> +  returns ``-EINVAL``.
> +- The queue must be started (SQ in RDY state) before setting a rate.
> +
> +**Supported hardware:**
> +
> +- ConnectX-6 Dx: per-SQ rate via HW rate table.
> +- ConnectX-7/8: full support, coexists with wait-on-time scheduling.
> +- BlueField-2/3: full support as DPU rep ports.
> +
> +**Not supported:**
> +
> +- ConnectX-5: ``packet_pacing`` exists but dynamic SQ modify may not
> +  work on all firmware versions.
> +- ConnectX-4 Lx and earlier: no ``packet_pacing`` capability.
> +
> +**Rate table sharing:**
> +
> +The hardware rate table has a limited number of entries (typically 128
> +on
> +ConnectX-6 Dx). When multiple queues are configured with identical rate
> +parameters, the kernel mlx5 driver shares a single rate table entry
> +across them. Each queue still has its own independent SQ and enforces
> +the rate independently — queues are never merged. The rate cap applies per-
> queue:
> +if two queues share the same 1000 Mbps entry, each can send up to
> +1000 Mbps independently, they do not share a combined budget.
> +
> +This sharing is transparent and only affects table capacity: 128
> +entries can serve thousands of queues as long as many use the same
> +rate. Queues with different rates consume separate entries.
> +
> +**Usage with testpmd:**
> +
> +.. code-block:: console
> +
> +   testpmd> set port 0 queue 0 rate 1000
> +   testpmd> show port 0 queue 0 rate
> +   testpmd> set port 0 queue 0 rate 0
> +
>  - ``tx_vec_en`` parameter [int]
> 
>    A nonzero value enables Tx vector with ConnectX-5 NICs and above.
> diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c index
> e795948187..e718f0fa8c 100644
> --- a/drivers/net/mlx5/mlx5.c
> +++ b/drivers/net/mlx5/mlx5.c
> @@ -2621,6 +2621,7 @@ const struct eth_dev_ops mlx5_dev_ops = {
>       .map_aggr_tx_affinity = mlx5_map_aggr_tx_affinity,
>       .rx_metadata_negotiate = mlx5_flow_rx_metadata_negotiate,
>       .get_restore_flags = mlx5_get_restore_flags,
> +     .set_queue_rate_limit = mlx5_set_queue_rate_limit,
>  };
> 
>  /* Available operations from secondary process. */ @@ -2714,6 +2715,7 @@
> const struct eth_dev_ops mlx5_dev_ops_isolate = {
>       .count_aggr_ports = mlx5_count_aggr_ports,
>       .map_aggr_tx_affinity = mlx5_map_aggr_tx_affinity,
>       .get_restore_flags = mlx5_get_restore_flags,
> +     .set_queue_rate_limit = mlx5_set_queue_rate_limit,
>  };
> 
>  /**
> diff --git a/drivers/net/mlx5/mlx5_tx.h b/drivers/net/mlx5/mlx5_tx.h index
> 51f330454a..975ff57acd 100644
> --- a/drivers/net/mlx5/mlx5_tx.h
> +++ b/drivers/net/mlx5/mlx5_tx.h
> @@ -222,6 +222,8 @@ struct mlx5_txq_ctrl *mlx5_txq_get(struct rte_eth_dev
> *dev, uint16_t idx);  int mlx5_txq_release(struct rte_eth_dev *dev, uint16_t
> idx);  int mlx5_txq_releasable(struct rte_eth_dev *dev, uint16_t idx);  int
> mlx5_txq_verify(struct rte_eth_dev *dev);
> +int mlx5_set_queue_rate_limit(struct rte_eth_dev *dev, uint16_t queue_idx,
> +                           uint32_t tx_rate);
>  int mlx5_txq_get_sqn(struct mlx5_txq_ctrl *txq);  void
> mlx5_txq_alloc_elts(struct mlx5_txq_ctrl *txq_ctrl);  void
> mlx5_txq_free_elts(struct mlx5_txq_ctrl *txq_ctrl); diff --git
> a/drivers/net/mlx5/mlx5_txq.c b/drivers/net/mlx5/mlx5_txq.c index
> 3356c89758..ce08363ca9 100644
> --- a/drivers/net/mlx5/mlx5_txq.c
> +++ b/drivers/net/mlx5/mlx5_txq.c
> @@ -1363,6 +1363,124 @@ mlx5_txq_release(struct rte_eth_dev *dev,
> uint16_t idx)
>       return 0;
>  }
> 
> +/**
> + * Set per-queue packet pacing rate limit.
> + *
> + * @param dev
> + *   Pointer to Ethernet device.
> + * @param queue_idx
> + *   TX queue index.
> + * @param tx_rate
> + *   TX rate in Mbps, 0 to disable rate limiting.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +int
> +mlx5_set_queue_rate_limit(struct rte_eth_dev *dev, uint16_t queue_idx,
> +                       uint32_t tx_rate)
> +{
> +     struct mlx5_priv *priv = dev->data->dev_private;
> +     struct mlx5_dev_ctx_shared *sh = priv->sh;
> +     struct mlx5_txq_ctrl *txq_ctrl;
> +     struct mlx5_devx_obj *sq_devx;
> +     struct mlx5_devx_modify_sq_attr sq_attr = { 0 };
> +     struct mlx5_txq_rate_limit new_rate_limit = { 0 };
> +     int ret;
> +
> +     if (!sh->cdev->config.hca_attr.qos.packet_pacing) {
> +             DRV_LOG(ERR, "Port %u packet pacing not supported.",
> +                     dev->data->port_id);
> +             rte_errno = ENOTSUP;
> +             return -rte_errno;
> +     }
> +     if (priv->txqs == NULL || (*priv->txqs)[queue_idx] == NULL) {
> +             DRV_LOG(ERR, "Port %u Tx queue %u not configured.",
> +                     dev->data->port_id, queue_idx);
> +             rte_errno = EINVAL;
> +             return -rte_errno;
> +     }
> +     txq_ctrl = container_of((*priv->txqs)[queue_idx],
> +                             struct mlx5_txq_ctrl, txq);
> +     if (txq_ctrl->is_hairpin) {
> +             DRV_LOG(ERR, "Port %u Tx queue %u is hairpin.",
> +                     dev->data->port_id, queue_idx);
> +             rte_errno = EINVAL;
> +             return -rte_errno;
> +     }
> +     if (txq_ctrl->obj == NULL) {
> +             DRV_LOG(ERR, "Port %u Tx queue %u not initialized.",
> +                     dev->data->port_id, queue_idx);
> +             rte_errno = EINVAL;
> +             return -rte_errno;
> +     }
> +     /*
> +      * For non-hairpin queues the SQ DevX object lives in
> +      * obj->sq_obj.sq (used by DevX/HWS mode), while hairpin
> +      * queues use obj->sq directly. These are different members
> +      * of a union inside mlx5_txq_obj.
> +      */
> +     sq_devx = txq_ctrl->obj->sq_obj.sq;
> +     if (sq_devx == NULL) {
> +             DRV_LOG(ERR, "Port %u Tx queue %u SQ not ready.",
> +                     dev->data->port_id, queue_idx);
> +             rte_errno = EINVAL;
> +             return -rte_errno;
> +     }
> +     if (dev->data->tx_queue_state[queue_idx] !=
> +         RTE_ETH_QUEUE_STATE_STARTED) {
> +             DRV_LOG(ERR,
> +                     "Port %u Tx queue %u is not started, stop traffic before
> setting rate.",
> +                     dev->data->port_id, queue_idx);
> +             rte_errno = EINVAL;
> +             return -rte_errno;
> +     }
> +     if (tx_rate == 0) {
> +             /* Disable rate limiting. */
> +             if (txq_ctrl->rate_limit.pp_id == 0)
> +                     return 0; /* Already disabled. */
> +             sq_attr.sq_state = MLX5_SQC_STATE_RDY;
> +             sq_attr.state = MLX5_SQC_STATE_RDY;
> +             sq_attr.rl_update = 1;
> +             sq_attr.packet_pacing_rate_limit_index = 0;
> +             ret = mlx5_devx_cmd_modify_sq(sq_devx, &sq_attr);
> +             if (ret) {
> +                     DRV_LOG(ERR,
> +                             "Port %u Tx queue %u failed to clear rate.",
> +                             dev->data->port_id, queue_idx);
> +                     rte_errno = -ret;
> +                     return ret;
> +             }
> +             mlx5_txq_free_pp_rate_limit(&txq_ctrl->rate_limit);
> +             DRV_LOG(DEBUG, "Port %u Tx queue %u rate limit disabled.",
> +                     dev->data->port_id, queue_idx);
> +             return 0;
> +     }
> +     /* Allocate a new PP index for the requested rate into a temp. */
> +     ret = mlx5_txq_alloc_pp_rate_limit(sh, &new_rate_limit, tx_rate);
> +     if (ret)
> +             return ret;
> +     /* Modify live SQ to use the new PP index. */
> +     sq_attr.sq_state = MLX5_SQC_STATE_RDY;
> +     sq_attr.state = MLX5_SQC_STATE_RDY;
> +     sq_attr.rl_update = 1;
> +     sq_attr.packet_pacing_rate_limit_index = new_rate_limit.pp_id;
> +     ret = mlx5_devx_cmd_modify_sq(sq_devx, &sq_attr);
> +     if (ret) {
> +             DRV_LOG(ERR, "Port %u Tx queue %u failed to set rate %u
> Mbps.",
> +                     dev->data->port_id, queue_idx, tx_rate);
> +             mlx5_txq_free_pp_rate_limit(&new_rate_limit);
> +             rte_errno = -ret;
> +             return ret;
> +     }
> +     /* SQ updated — release old PP context, install new one. */
> +     mlx5_txq_free_pp_rate_limit(&txq_ctrl->rate_limit);
> +     txq_ctrl->rate_limit = new_rate_limit;
> +     DRV_LOG(DEBUG, "Port %u Tx queue %u rate set to %u Mbps (PP idx
> %u).",
> +             dev->data->port_id, queue_idx, tx_rate, txq_ctrl-
> >rate_limit.pp_id);
> +     return 0;
> +}
> +
>  /**
>   * Verify if the queue can be released.
>   *
> --
> 2.43.0

Reply via email to