On 4/24/2024 12:57 PM, Mattias Rönnblom wrote: > On 2024-04-24 12:21, Ferruh Yigit wrote: >> On 4/23/2024 9:56 PM, Mattias Rönnblom wrote: >>> On 2024-04-23 13:15, Ferruh Yigit wrote: >>>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote: >>>>> Cache align Rx and Tx queue struct to avoid false sharing. >>>>> >>>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment >>>>> makes no change there, but it does on 32-bit ISAs. >>>>> >>>>> TX struct is 56 bytes on x86_64. >>>>> >>>> >>>> Hi Mattias, >>>> >>>> No objection to the patch. Is the improvement theoretical or do you >>>> measure any improvement practically, if so how much is the improvement? >>>> >>> >>> I didn't run any benchmarks. >>> >>> Two cores storing to a (falsely) shared cache line on a per-packet basis >>> is going to be very expensive, at least for "light touch" applications. >>> >> >> ack >> I expect for af_packet bottleneck is the kernel side, so I don't expect >> any visible improvement practically, but OK to fix this theoretical >> issue. >> > > If you use af_packet as a slow path interface, you may well poll it > often. If you are unlucky and aren't on an ISA where the RX struct *by > accident* is cache-line aligned, you will suffer greatly from false > sharing, because the counters are written to regardless if there are > packets polled or not. > > If you don't care about synchronization overhead, why even support > multi-queue in the af_packet driver!? >
We care, only I expect no measurable performance improvement and was asking if you have data for it. >>>>> Both structs keep counters, and in the RX case they are updated even >>>>> for empty polls. >>>>> >>>> >>>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within >>>> the loop? >>>> >>> >>> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes >>> are declared volatile, so you are forcing a load-modify-store cycle for >>> every increment. >>> >> >> My intention was to prevent updating stats in empty polls, I thought >> stats will be hot in the cache but won't work with volatile. >> >> >>> I would drop "volatile", or replace it with an atomic (although *not* >>> use an atomic add for incrementing, but rather atomic load + <n> >>> non-atomic adds + atomic store). >>> >> >> As only single core polls from an Rx queue, I assume atomics is required >> only for stats reset case. And for that case won't we need atomic add? >> > > Correct me if I'm wrong here, but my impression is that statistics reset > tend to be best-effort in DPDK. Or, expressed in a different way, are > not thread-safe. > At least I don't see this is something documented. For cases HW tracks stats we don't have this problem. Or some drivers use offset mechanism you mentioned below. But I can see some soft drivers will have this problem with stats clear. > If you want to support MT safe counter reset, the counter increment > operation must be an atomic add, since now you have two writers. (Unless > you do something fancy and keep only a new offset upon reset, and never > actually zero the counter.) > Asking to learn, if we ignore the stats reset case and assume stats only incremented, single writer case, do we still need either volatile qualifier or atomic operations at all? >> And do you know if there is a performance difference between atomic add >> and keeping volatile qualifier? >> >> > > I don't know how slow af_packet is, but if you care about performance, > you don't want to use atomic add for statistics. > There are a few soft drivers already using atomics adds for updating stats. If we document expectations from 'rte_eth_stats_reset()', we can update those usages. > volatile in this case it's not a big issue, performance-wise, I would > expect. > >>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com> >>>>> --- >>>>> drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- >>>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c >>>>> b/drivers/net/af_packet/rte_eth_af_packet.c >>>>> index 397a32db58..28aeb7d08e 100644 >>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c >>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >>>>> @@ -6,6 +6,7 @@ >>>>> * All rights reserved. >>>>> */ >>>>> +#include <rte_common.h> >>>>> #include <rte_string_fns.h> >>>>> #include <rte_mbuf.h> >>>>> #include <ethdev_driver.h> >>>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue { >>>>> volatile unsigned long rx_pkts; >>>>> volatile unsigned long rx_bytes; >>>>> -}; >>>>> +} __rte_cache_aligned; >>>>> >>>> >>>> Latest location for '__rte_cache_aligned' tag is at the beginning of >>>> the >>>> struct [1], so something like: >>>> `struct __rte_cache_aligned pkt_rx_queue {` >>>> >>>> [1] >>>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both >>