cmcfarlen opened a new issue, #8826:
URL: https://github.com/apache/trafficserver/issues/8826
Posting an issue about this for discussion to come up with a good remedy.
With freelists on, and under high transaction rates, a perf flame graph
shows a lot of time is spent in `freelist_new`/`freelist_free` under
`Arena::alloc` or `Arena::reset`. This is due to the freelist implementation
performing atomic CAS operations in a loop. I wrote a benchmark to exercise
this and added a bit of instrumentation to count how many times the CAS fails.
```
benchmark name samples iterations estimated
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
global allocator 100 1 2.17617
s
18.6112 ms 17.8946 ms 19.1444
ms
3.13175 ms 2.48935 ms 4.16424
ms
thread allocator 100 1 515.923
ms
5.09905 ms 4.97022 ms 5.34322
ms
872.118 us 562.584 us 1.52475
ms
max global loop count: 226979
max local loop count: 0
```
The difference between these benchmarks is that `global allocator` is marked
static (as it is in ATS ArenaBlock), but `thread allocator` is marked
`thread_local`. Otherwise the code is identical.
The benchmark shows that its much faster thread_local, but also I added a
loop_counter that keeps track of how many times the CAS fails when alloc/free
from the freelists. The max global loop count here is the worst number of CAS
misses for all of the benchmark iterations. For thread local, this is 0
because there should be no contention and the CAS will never fail. For the
global allocator though, there are 20 threads vying to allocate 1000 items each
from the same allocator instance. Doing the math, that should be `20 * 1000 *
2 = 40000` CAS attempts, but there were `226979` misses which is `266979` CAS
operations or about `7` tries to accomplish one CAS. In other words, the
thread local allocator is 7x better in this scenario.
We tested marking this `defaultSizeArenaBlock` allocator thread_local and
there was a significant performance increase.
I see a couple of discussion items around this:
- With freelists off, this allocator uses system malloc (or a thread_local
jemalloc heap) for allocation, so in some situations, marking this thread_local
ends up using two thread_local variables which seems odd.
- Normally this might be a use for a `ProxyAllocator`, but code
organization-wise, Arena is in `tscore` where `ProxyAllocator` is in
`iocore/eventsystem` so the dependency goes the wrong direction.
The most direct solution is just to mark this `defaultSizeArenaBlock`
allocator `thread_local`, but I'd like to get some community feedback on this
issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]