On 4/24/2024 12:57 PM, Mattias Rönnblom wrote:
> On 2024-04-24 12:21, Ferruh Yigit wrote:
>> On 4/23/2024 9:56 PM, Mattias Rönnblom wrote:
>>> On 2024-04-23 13:15, Ferruh Yigit wrote:
>>>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>>>>> Cache align Rx and Tx queue struct to avoid false sharing.
>>>>>
>>>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
>>>>> makes no change there, but it does on 32-bit ISAs.
>>>>>
>>>>> TX struct is 56 bytes on x86_64.
>>>>>
>>>>
>>>> Hi Mattias,
>>>>
>>>> No objection to the patch. Is the improvement theoretical or do you
>>>> measure any improvement practically, if so how much is the improvement?
>>>>
>>>
>>> I didn't run any benchmarks.
>>>
>>> Two cores storing to a (falsely) shared cache line on a per-packet basis
>>> is going to be very expensive, at least for "light touch" applications.
>>>
>>
>> ack
>> I expect for af_packet bottleneck is the kernel side, so I don't expect
>> any visible improvement practically, but OK to fix this theoretical
>> issue.
>>
> 
> If you use af_packet as a slow path interface, you may well poll it
> often. If you are unlucky and aren't on an ISA where the RX struct *by
> accident* is cache-line aligned, you will suffer greatly from false
> sharing, because the counters are written to regardless if there are
> packets polled or not.
> 
> If you don't care about synchronization overhead, why even support
> multi-queue in the af_packet driver!?
> 

We care, only I expect no measurable performance improvement and was
asking if you have data for it.

>>>>> Both structs keep counters, and in the RX case they are updated even
>>>>> for empty polls.
>>>>>
>>>>
>>>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
>>>> the loop?
>>>>
>>>
>>> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes
>>> are declared volatile, so you are forcing a load-modify-store cycle for
>>> every increment.
>>>
>>
>> My intention was to prevent updating stats in empty polls, I thought
>> stats will be hot in the cache but won't work with volatile.
>>
>>
>>> I would drop "volatile", or replace it with an atomic (although *not*
>>> use an atomic add for incrementing, but rather atomic load + <n>
>>> non-atomic adds + atomic store).
>>>
>>
>> As only single core polls from an Rx queue, I assume atomics is required
>> only for stats reset case. And for that case won't we need atomic add?
>>
> 
> Correct me if I'm wrong here, but my impression is that statistics reset
> tend to be best-effort in DPDK. Or, expressed in a different way, are
> not thread-safe.
> 

At least I don't see this is something documented. For cases HW tracks
stats we don't have this problem. Or some drivers use offset mechanism
you mentioned below.
But I can see some soft drivers will have this problem with stats clear.

> If you want to support MT safe counter reset, the counter increment
> operation must be an atomic add, since now you have two writers. (Unless
> you do something fancy and keep only a new offset upon reset, and never
> actually zero the counter.)
> 

Asking to learn, if we ignore the stats reset case and assume stats only
incremented, single writer case, do we still need either volatile
qualifier or atomic operations at all?

>> And do you know if there is a performance difference between atomic add
>> and keeping volatile qualifier?
>>
>>
> 
> I don't know how slow af_packet is, but if you care about performance,
> you don't want to use atomic add for statistics.
> 

There are a few soft drivers already using atomics adds for updating stats.
If we document expectations from 'rte_eth_stats_reset()', we can update
those usages.

> volatile in this case it's not a big issue, performance-wise, I would
> expect.
> 
>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com>
>>>>> ---
>>>>>    drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>>>    1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> index 397a32db58..28aeb7d08e 100644
>>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> @@ -6,6 +6,7 @@
>>>>>     * All rights reserved.
>>>>>     */
>>>>>    +#include <rte_common.h>
>>>>>    #include <rte_string_fns.h>
>>>>>    #include <rte_mbuf.h>
>>>>>    #include <ethdev_driver.h>
>>>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>>>          volatile unsigned long rx_pkts;
>>>>>        volatile unsigned long rx_bytes;
>>>>> -};
>>>>> +} __rte_cache_aligned;
>>>>>    
>>>>
>>>> Latest location for '__rte_cache_aligned' tag is at the beginning of
>>>> the
>>>> struct [1], so something like:
>>>> `struct __rte_cache_aligned pkt_rx_queue {`
>>>>
>>>> [1]
>>>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
>>

Reply via email to