On Friday, 30 May 2014 at 09:46:10 UTC, Marco Leise wrote:
simplicity. But as soon as I added a single CAS I was already
over the time that TCMalloc needs. That way I learned that CAS
is not as cheap as it looks and the fastest allocators work
thread local as long as possible.

22 cycles latency if on a valid cacheline?
+ overhead of going to memory

Did you try to add explicit prefetch, maybe that would help?

Prefetch is expensive on Ivy Brigde (43 cycles throughput, 0.5 cycles on Haswell). You need instructions to fill the pipeline between PREFETCH and LOCK CMPXCHG. So you probably need to go ASM and do a lot of testing on different CPUs. Explicit prefetching, lock free strategies etc are tricky to get right. Get it wrong and it is worse than the naive implementation.

Reply via email to