cmcfarlen opened a new issue, #8826:
URL: https://github.com/apache/trafficserver/issues/8826

   Posting an issue about this for discussion to come up with a good remedy.
   
   With freelists on, and under high transaction rates, a perf flame graph 
shows a lot of time is spent in `freelist_new`/`freelist_free` under 
`Arena::alloc` or `Arena::reset`.  This is due to the freelist implementation 
performing atomic CAS operations in a loop.  I wrote a benchmark to exercise 
this and added a bit of instrumentation to count how many times the CAS fails.
   
   ```
   benchmark name                       samples       iterations    estimated
                                        mean          low mean      high mean
                                        std dev       low std dev   high std dev
   
-------------------------------------------------------------------------------
   global allocator                               100             1     2.17617 
s
                                           18.6112 ms    17.8946 ms    19.1444 
ms
                                           3.13175 ms    2.48935 ms    4.16424 
ms
   
   thread allocator                               100             1    515.923 
ms
                                           5.09905 ms    4.97022 ms    5.34322 
ms
                                           872.118 us    562.584 us    1.52475 
ms
   
   
   max global loop count: 226979
   max local loop count: 0
   ```
   
   The difference between these benchmarks is that `global allocator` is marked 
static (as it is in ATS ArenaBlock), but `thread allocator` is marked 
`thread_local`.  Otherwise the code is identical.
   
   The benchmark shows that its much faster thread_local, but also I added a 
loop_counter that keeps track of how many times the CAS fails when alloc/free 
from the freelists.  The max global loop count here is the worst number of CAS 
misses for all of the benchmark iterations.  For thread local, this is 0 
because there should be no contention and the CAS will never fail.  For the 
global allocator though, there are 20 threads vying to allocate 1000 items each 
from the same allocator instance.  Doing the math, that should be `20 * 1000 * 
2 = 40000` CAS attempts, but there were `226979` misses which is `266979` CAS 
operations or about `7` tries to accomplish one CAS.  In other words, the 
thread local allocator is 7x better in this scenario.
   
   We tested marking this `defaultSizeArenaBlock` allocator thread_local and 
there was a significant performance increase.
   
   I see a couple of discussion items around this:
   
     - With freelists off, this allocator uses system malloc (or a thread_local 
jemalloc heap) for allocation, so in some situations, marking this thread_local 
ends up using two thread_local variables which seems odd.
     - Normally this might be a use for a `ProxyAllocator`, but code 
organization-wise, Arena is in `tscore` where `ProxyAllocator` is in 
`iocore/eventsystem` so the dependency goes the wrong direction.
   
   The most direct solution is just to mark this `defaultSizeArenaBlock` 
allocator `thread_local`, but I'd like to get some community feedback on this 
issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to