Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.

This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into it's software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache.

However, this solution poses several constraints:

1)The receive queue needs to know which transmit queue it should take
the buffers from. The application logic decides which transmit port to
use to send out the packets. In many use cases the NIC might have a
single port ([1], [2], [3]), in which case a given transmit queue is
always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
is easy to configure.

If the NIC has 2 ports (there are several references), then we will have
1:2 (RX queue: TX queue) mapping which is still easy to configure.
However, if this is generalized to 'N' ports, the configuration can be
long. More over the PMD would have to scan a list of transmit queues to
pull the buffers from.

2)The other factor that needs to be considered is 'run-to-completion' vs
'pipeline' models. In the run-to-completion model, the receive side and
the transmit side are running on the same lcore serially. In the pipeline
model. The receive side and transmit side might be running on different
lcores in parallel. This requires locking. This is not supported at this
point.

3)Tx and Rx buffers must be from the same mempool. And we also must
ensure Tx buffer free number is equal to Rx buffer free number:
(txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH)
Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
is due to tx_next_dd is a variable to compute tx sw-ring free location.
Its value will be one more round than the position where next time free
starts.

Current status in this RFC:
1)An API is added to allow for mapping a TX queue to a RX queue.
  Currently it supports 1:1 mapping.
2)The i40e driver is changed to do the direct re-arm of the receive
  side.
3)L3fwd application is modified to do the direct rearm mapping
automatically without user config. This follows the rules that the
thread can map TX queue to a RX queue based on the first received
package destination port.

Testing status:
1.The testing results for L3fwd are as follows:
-------------------------------------------------------------------
enabled direct rearm
-------------------------------------------------------------------
Arm:
N1SDP(neon path):
without fast-free mode          with fast-free mode
        +14.1%                          +7.0%

Ampere Altra(neon path):
without fast-free mode          with fast-free mode
        +17.1                           +14.0%

X86:
Dell-8268(limit frequency):
sse path:
without fast-free mode          with fast-free mode
        +6.96%                          +2.02%
avx2 path:
without fast-free mode          with fast-free mode
        +9.04%                          +7.75%
avx512 path:
without fast-free mode          with fast-free mode
        +5.43%                          +1.57%
-------------------------------------------------------------------
This patch can not affect base performance of normal mode.
Furthermore, the reason for that limiting the CPU frequency is
that dell-8268 can encounter i40e NIC bottleneck with maximum
frequency.

2.The testing results for VPP-L3fwd are as follows:
-------------------------------------------------------------------
Arm:
N1SDP(neon path):
with direct re-arm mode enabled
        +7.0%
-------------------------------------------------------------------
For Ampere Altra and X86,VPP-L3fwd test has not been done.

Reference:
[1] 
https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
[2] 
https://www.intel.com/content/www/us/en/products/sku/192561/intel-ethernet-network-adapter-e810cqda1/specifications.html
[3] 
https://www.broadcom.com/products/ethernet-connectivity/network-adapters/100gb-nic-ocp/n1100g

Feifei Wang (5):
  net/i40e: remove redundant Dtype initialization
  net/i40e: enable direct rearm mode
  ethdev: add API for direct rearm mode
  net/i40e: add direct rearm mode internal API
  examples/l3fwd: enable direct rearm mode

 drivers/net/i40e/i40e_ethdev.c          |  34 +++
 drivers/net/i40e/i40e_rxtx.c            |   4 -
 drivers/net/i40e/i40e_rxtx.h            |   4 +
 drivers/net/i40e/i40e_rxtx_common_avx.h | 269 ++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx_vec_avx2.c   |  14 +-
 drivers/net/i40e/i40e_rxtx_vec_avx512.c | 249 +++++++++++++++++++++-
 drivers/net/i40e/i40e_rxtx_vec_neon.c   | 141 ++++++++++++-
 drivers/net/i40e/i40e_rxtx_vec_sse.c    | 170 ++++++++++++++-
 examples/l3fwd/l3fwd_lpm.c              |  16 +-
 lib/ethdev/ethdev_driver.h              |  15 ++
 lib/ethdev/rte_ethdev.c                 |  14 ++
 lib/ethdev/rte_ethdev.h                 |  31 +++
 lib/ethdev/version.map                  |   1 +
 13 files changed, 949 insertions(+), 13 deletions(-)

-- 
2.25.1

Reply via email to