On 2024-04-24 12:21, Ferruh Yigit wrote:
On 4/23/2024 9:56 PM, Mattias Rönnblom wrote:
On 2024-04-23 13:15, Ferruh Yigit wrote:
On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
Cache align Rx and Tx queue struct to avoid false sharing.

RX struct happens to be 64 bytes on x86_64 already, so cache alignment
makes no change there, but it does on 32-bit ISAs.

TX struct is 56 bytes on x86_64.


Hi Mattias,

No objection to the patch. Is the improvement theoretical or do you
measure any improvement practically, if so how much is the improvement?


I didn't run any benchmarks.

Two cores storing to a (falsely) shared cache line on a per-packet basis
is going to be very expensive, at least for "light touch" applications.


ack
I expect for af_packet bottleneck is the kernel side, so I don't expect
any visible improvement practically, but OK to fix this theoretical issue.


If you use af_packet as a slow path interface, you may well poll it often. If you are unlucky and aren't on an ISA where the RX struct *by accident* is cache-line aligned, you will suffer greatly from false sharing, because the counters are written to regardless if there are packets polled or not.

If you don't care about synchronization overhead, why even support multi-queue in the af_packet driver!?

Both structs keep counters, and in the RX case they are updated even
for empty polls.


Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
the loop?


No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes
are declared volatile, so you are forcing a load-modify-store cycle for
every increment.


My intention was to prevent updating stats in empty polls, I thought
stats will be hot in the cache but won't work with volatile.


I would drop "volatile", or replace it with an atomic (although *not*
use an atomic add for incrementing, but rather atomic load + <n>
non-atomic adds + atomic store).


As only single core polls from an Rx queue, I assume atomics is required
only for stats reset case. And for that case won't we need atomic add?


Correct me if I'm wrong here, but my impression is that statistics reset tend to be best-effort in DPDK. Or, expressed in a different way, are not thread-safe.

If you want to support MT safe counter reset, the counter increment operation must be an atomic add, since now you have two writers. (Unless you do something fancy and keep only a new offset upon reset, and never actually zero the counter.)

And do you know if there is a performance difference between atomic add
and keeping volatile qualifier?



I don't know how slow af_packet is, but if you care about performance, you don't want to use atomic add for statistics.

volatile in this case it's not a big issue, performance-wise, I would expect.

Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com>
---
   drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
   1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
b/drivers/net/af_packet/rte_eth_af_packet.c
index 397a32db58..28aeb7d08e 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -6,6 +6,7 @@
    * All rights reserved.
    */
   +#include <rte_common.h>
   #include <rte_string_fns.h>
   #include <rte_mbuf.h>
   #include <ethdev_driver.h>
@@ -53,7 +54,7 @@ struct pkt_rx_queue {
         volatile unsigned long rx_pkts;
       volatile unsigned long rx_bytes;
-};
+} __rte_cache_aligned;

Latest location for '__rte_cache_aligned' tag is at the beginning of the
struct [1], so something like:
`struct __rte_cache_aligned pkt_rx_queue {`

[1]
https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both

Reply via email to