On 9/20/2023 2:12 PM, Ferruh Yigit wrote:
> On 8/24/2023 8:36 AM, Feifei Wang wrote:
>> Currently, the transmit side frees the buffers into the lcore cache and
>> the receive side allocates buffers from the lcore cache. The transmit
>> side typically frees 32 buffers resulting in 32*8=256B of stores to
>> lcore cache. The receive side allocates 32 buffers and stores them in
>> the receive side software ring, resulting in 32*8=256B of stores and
>> 256B of load from the lcore cache.
>>
>> This patch proposes a mechanism to avoid freeing to/allocating from
>> the lcore cache. i.e. the receive side will free the buffers from
>> transmit side directly into its software ring. This will avoid the 256B
>> of loads and stores introduced by the lcore cache. It also frees up the
>> cache lines used by the lcore cache. And we can call this mode as mbufs
>> recycle mode.
>>
>> In the latest version, mbufs recycle mode is packaged as a separate API. 
>> This allows for the users to change rxq/txq pairing in real time in data 
>> plane,
>> according to the analysis of the packet flow by the application, for example:
>> -----------------------------------------------------------------------
>> Step 1: upper application analyse the flow direction
>> Step 2: recycle_rxq_info = rte_eth_recycle_rx_queue_info_get(rx_portid, 
>> rx_queueid)
>> Step 3: rte_eth_recycle_mbufs(rx_portid, rx_queueid, tx_portid, tx_queueid, 
>> recycle_rxq_info);
>> Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
>> Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
>> -----------------------------------------------------------------------
>> Above can support user to change rxq/txq pairing  at run-time and user does 
>> not need to
>> know the direction of flow in advance. This can effectively expand mbufs 
>> recycle mode's
>> use scenarios.
>>
>> Furthermore, mbufs recycle mode is no longer limited to the same pmd,
>> it can support moving mbufs between different vendor pmds, even can put the 
>> mbufs
>> anywhere into your Rx mbuf ring as long as the address of the mbuf ring can 
>> be provided.
>> In the latest version, we enable mbufs recycle mode in i40e pmd and ixgbe 
>> pmd, and also try to
>> use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance 
>> improvement
>> by mbufs recycle mode.
>>
>> Difference between mbuf recycle, ZC API used in mempool and general path
>> For general path: 
>>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>>                 Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 
>> 32 pkts memcpy from temporary variable to mempool cache
>> For ZC API used in mempool:
>>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>>                 Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
>>                 Refer link: 
>> http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.alig...@arm.com/
>> For mbufs recycle:
>>                 Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
>> Thus we can see in the one loop, compared to general path, mbufs recycle 
>> mode reduces 32+32=64 pkts memcpy;
>> Compared to ZC API used in mempool, we can see mbufs recycle mode reduce 32 
>> pkts memcpy in each loop.
>> So, mbufs recycle has its own benefits.
>>
>> Testing status:
>> (1) dpdk l3fwd test with multiple drivers:
>>     port 0: 82599 NIC   port 1: XL710 NIC
>> -------------------------------------------------------------
>>              Without fast free       With fast free
>> Thunderx2:      +7.53%                       +13.54%
>> -------------------------------------------------------------
>>
>> (2) dpdk l3fwd test with same driver:
>>     port 0 && 1: XL710 NIC
>> -------------------------------------------------------------
>>              Without fast free       With fast free
>> Ampere altra:   +12.61%                      +11.42%
>> n1sdp:               +8.30%                  +3.85%
>> x86-sse:     +8.43%                  +3.72%
>> -------------------------------------------------------------
>>
>> (3) Performance comparison with ZC_mempool used
>>     port 0 && 1: XL710 NIC
>>     with fast free
>> -------------------------------------------------------------
>>              With recycle buffer     With zc_mempool
>> Ampere altra:        11.42%                  3.54%
>> -------------------------------------------------------------
>>
>> Furthermore, we add recycle_mbuf engine in testpmd. Due to XL710 NIC has
>> I/O bottleneck in testpmd in ampere altra, we can not see throughput change
>> compared with I/O fwd engine. However, using record cmd in testpmd:
>> '$set record-burst-stats on'
>> we can see the ratio of 'Rx/Tx burst size of 32' is reduced. This
>> indicate mbufs recycle can save CPU cycles.
>>
>> V2:
>> 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
>> 2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
>> 3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
>> 4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)
>>
>> V3:
>> 1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
>> 2. Delete L3fwd change for direct rearm (Jerin)
>> 3. enable direct rearm in ixgbe driver in Arm
>>
>> v4:
>> 1. Rename direct-rearm as buffer recycle. Based on this, function name
>> and variable name are changed to let this mode more general for all
>> drivers. (Konstantin, Morten)
>> 2. Add ring wrapping check (Konstantin)
>>
>> v5:
>> 1. some change for ethdev API (Morten)
>> 2. add support for avx2, sse, altivec path
>>
>> v6:
>> 1. fix ixgbe build issue in ppc
>> 2. remove 'recycle_tx_mbufs_reuse' and 'recycle_rx_descriptors_refill'
>>    API wrapper (Tech Board meeting)
>> 3. add recycle_mbufs engine in testpmd (Tech Board meeting)
>> 4. add namespace in the functions related to mbufs recycle(Ferruh)
>>
>> v7:
>> 1. move 'rxq/txq data' pointers to the beginning of eth_dev structure,
>> in order to keep them in the same cache line as rx/tx_burst function
>> pointers (Morten)
>> 2. add the extra description for 'rte_eth_recycle_mbufs' to show it can
>> support feeding 1 Rx queue from 2 Tx queues in the same thread
>> (Konstantin)
>> 3. For i40e/ixgbe driver, make the previous copied buffers as invalid if
>> there are Tx buffers refcnt > 1 or from unexpected mempool (Konstantin)
>> 4. add check for the return value of 'rte_eth_recycle_rx_queue_info_get'
>> in testpmd fwd engine (Morten)
>>
>> v8:
>> 1. add arm/x86 build option to fix ixgbe build issue in ppc
>>
>> v9:
>> 1. delete duplicate file name for ixgbe
>>
>> v10:
>> 1. fix compile issue on windows
>>
>> v11:
>> 1. fix doc warning
>>
>> v12:
>> 1. replace rx queue check code with eth_dev_validate_rx_queue
>> function (Stephen)
>> 2. put port and queue check before function call (Konstantin)
>>
>> Feifei Wang (4):
>>   ethdev: add API for mbufs recycle mode
>>   net/i40e: implement mbufs recycle mode
>>   net/ixgbe: implement mbufs recycle mode
>>   app/testpmd: add recycle mbufs engine
>>
> 
> Thanks for dedication to improve the patchset and finding better
> solution, it is appreciated.
> 
> Series applied to dpdk-next-net/main, thanks.
> 

Konstantin highlighted that there is an outstanding discussion:
http://patchwork.dpdk.org/project/dpdk/patch/20230822072710.1945027-3-feifei.wa...@arm.com/

Dropping the patchset from next-net, and updating its status in the
patchwork as "Changes Requested".

Reply via email to