On 2024-09-12 11:39, Morten Brørup wrote:
+struct lcore_state { + uint64_t a; + uint64_t b; + uint64_t sum; +}; + +static __rte_always_inline void +update(struct lcore_state *state) +{ + state->sum += state->a * state->b; +} + +static RTE_DEFINE_PER_LCORE(struct lcore_state, tls_lcore_state); + +static __rte_noinline void +tls_update(void) +{ + update(&RTE_PER_LCORE(tls_lcore_state));I would normally access TLS variables directly, not through a pointer, i.e.: RTE_PER_LCORE(tls_lcore_state.sum) += RTE_PER_LCORE(tls_lcore_state.a) * RTE_PER_LCORE(tls_lcore_state.b); On the other hand, then it wouldn't be 1:1 comparable to the two other test cases. Besides, I expect the compiler to optimize away the indirect access, and produce the same output (as for the alternative implementation) anyway. No change requested. Just noticing.+} + +struct __rte_cache_aligned lcore_state_aligned { + uint64_t a; + uint64_t b; + uint64_t sum;Please add RTE_CACHE_GUARD here, for 100 % matching the common design pattern.
Will do.
+}; + +static struct lcore_state_aligned sarray_lcore_state[RTE_MAX_LCORE];+ printf("Latencies [ns/update]\n"); + printf("Thread-local storage Static array Lcore variables\n"); + printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9, + sarray_latency * 1e9, lvar_latency * 1e9);I prefer cycles over ns. Perhaps you could show both?
That's makes you an x86 guy. :) Since only on x86 those cycles makes any sense.
I didn't want to use cycles since it would be a very small value on certain (e.g., old ARM) platforms.
But, elsewhere in the perf tests TSC cycles are used, so maybe I should switch to using such nevertheless.
With RTE_CACHE_GUARD added where mentioned, Acked-by: Morten Brørup <[email protected]>

