On 2024-05-22 21:51, Stephen Hemminger wrote:
On Wed, 22 May 2024 12:01:12 -0700
Tyler Retzlaff <roret...@linux.microsoft.com> wrote:
On Wed, May 22, 2024 at 07:57:01PM +0200, Morten Brørup wrote:
From: Stephen Hemminger [mailto:step...@networkplumber.org]
Sent: Wednesday, 22 May 2024 17.38
On Wed, 22 May 2024 10:31:39 +0200
Morten Brørup <m...@smartsharesystems.com> wrote:
+/* On 32 bit platform, need to use atomic to avoid load/store
tearing */
+typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
As shown by Godbolt experiments discussed in a previous thread [2],
non-tearing 64 bit counters can be implemented without using atomic
instructions on all 32 bit architectures supported by DPDK. So we should
use the counter/offset design pattern for RTE_ARCH_32 too.
[2]:
https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
erver.smartshare.dk/
This code built with -O3 and -m32 on godbolt shows split problem.
#include <stdint.h>
typedef uint64_t rte_counter64_t;
void
rte_counter64_add(rte_counter64_t *counter, uint32_t val)
{
*counter += val;
}
… *counter = val;
}
rte_counter64_add:
push ebx
mov eax, DWORD PTR [esp+8]
xor ebx, ebx
mov ecx, DWORD PTR [esp+12]
add DWORD PTR [eax], ecx
adc DWORD PTR [eax+4], ebx
pop ebx
ret
rte_counter64_read:
mov eax, DWORD PTR [esp+4]
mov edx, DWORD PTR [eax+4]
mov eax, DWORD PTR [eax]
ret
rte_counter64_set:
movq xmm0, QWORD PTR [esp+8]
mov eax, DWORD PTR [esp+4]
movq QWORD PTR [eax], xmm0
ret
Sure, atomic might be required on some 32 bit architectures and/or with some
compilers.
in theory i think you should be able to use generic atomics and
depending on the target you get codegen that works. it might be
something more expensive on 32-bit and nothing on 64-bit etc..
what's the damage if we just use atomic generic and relaxed ordering? is
the codegen not optimal?
If we use atomic with relaxed memory order, then compiler for x86 still
generates
a locked increment in the fast path. This costs about 100 extra cycles due
to cache and prefetch stall. This whole endeavor is an attempt to avoid that.
It's because the code is overly restrictive (e.g., needlessly forcing
the whole read-modify-read being atomic), in that case, and no fault of
the compiler.
void add(uint64_t *addr, uint64_t operand)
{
uint64_t value = __atomic_load_n(addr, __ATOMIC_RELAXED);
value += operand;
__atomic_store_n(addr, value, __ATOMIC_RELAXED);
}
->
x86_64
add:
mov rax, QWORD PTR [rdi]
add rax, rsi
mov QWORD PTR [rdi], rax
ret
x86
add:
sub esp, 12
mov ecx, DWORD PTR [esp+16]
movq xmm0, QWORD PTR [ecx]
movq QWORD PTR [esp], xmm0
mov eax, DWORD PTR [esp]
mov edx, DWORD PTR [esp+4]
add eax, DWORD PTR [esp+20]
adc edx, DWORD PTR [esp+24]
mov DWORD PTR [esp], eax
mov DWORD PTR [esp+4], edx
movq xmm1, QWORD PTR [esp]
movq QWORD PTR [ecx], xmm1
add esp, 12
ret
No locked instructions.
PS: looking at the locked increment code for 32 bit involves locked compare
exchange and potential retry. Probably don't care about performance on that
platform
anymore.