JosiahWI opened a new pull request, #13170:
URL: https://github.com/apache/trafficserver/pull/13170

   The PR title is fairly self-explanatory, but the design choices here deserve 
explicit mention.
   
   * The head_p type is no longer a union with a {pointer, version} field
   
   This has been done to eliminate type punning, which was done all over the 
implementation, and is UB. The pointer and version are now always set through 
the macros `FREELIST_POINTER`, `FREELIST_VERSION`, and 
`SET_FREELIST_POINTER_VERSION`, which use a separate {pointer, version} struct 
type and `memcpy` on platforms where this is appropriate (see preprocessor defs 
for the list).
   
   * The head_p type has been changed into a type alias of the data type
   
   This is necessary so that the head of the list will be an atomic integer 
type instead of an atomic class type to be sure it can use 128bit atomic 
hardware instructions on platforms that support them.
   
   * Freelist alignment is now adjusted to be satisfy `head_p` alignment 
requirements
   
   This is actually a bug in master. There is no issue created for it. It 
happens to work to allocate `head_p` objects that are underaligned on our 
supported platforms, but it is UB and could very realistically fail... to drive 
this point home, the included unit tests segfault with the original 
implementation for this exact reason (they pass if the alignment is adjusted).
   
   * The freelist and atomiclist pop operations (called freelist_new for the 
freelist) are now locked to provide mutual exclusion
   
   This one is annoying and might need to be reverted from this PR and 
considered separately. This particular change is necessary to fix #11640 - 
there is a minor data race in master in that the second pointer from the list 
head can be overwritten by an allocator's placement new before it is read 
without synchronization in `freelist_new` by another thread, which is going to 
subsequently find out the list head is stale and retry. Thus, the garbage 
pointer is not dereferenced, but this is still UB.
   
   I have been thinking about approaches here. One approach is to add an atomic 
flag to the list head that is set by any thread popping from the list. A thread 
that has successfully popped can spin on that flag to wait for the completion 
of any other threads still reading the memory it is about to return. My 
intuition is that this will be better, but I don't know without benchmarking, 
and it's a lot more complex than the lock.
   
   Fixes #5398 
   Fixes #11640
   
   ## Previous Work
   
   See #7382. This PR is only a step in the direction of #7382; it retains a 
lot of the old code structure along with most of its design flaws. If this 
change is accepted, it should thereafter be possible to apply other design 
improvements from #7382, such as the fleshed out versioned pointer type, with 
greater confidence.
   
   ## A Few Comments About Assertions
   
   This PR adds a hoard of assertions that check alignment requirements. Most 
of them are debug assertions - the alignment check on the pointer passed to 
`freelist_push` is a release assert for now, because it would almost certainly 
indicate a major issue if it triggered. According to the comment from 
@bryancall, this assertion was in fact failing before (it was previously a 
debug assert, and he commented it out). I am hopeful that that issue is now 
resolved.
   
   I have commented out the assertions in atomiclist, because the alignment 
requirements for atomiclist are established elsewhere in ATS - and established 
incorrectly. The head_p objects are misaligned all over the place. That is bad. 
Unfortunately, it's not trivial to fix, because the head_p objects have to be 
aligned at an offset from the base pointer, which *also* has to be aligned, but 
for a different object type (e.g. `Event`).
   
   ## Performance Implications
   
   I ran the following benchmarks in WSL running on an AMD Ryzen 5 5500U 
processor.
   
   ### Freelist - Single Threaded Release Performance - Before
   ```
   benchmark name                       samples       iterations    est run time
                                        mean          low mean      high mean
                                        std dev       low std dev   high std dev
   
-------------------------------------------------------------------------------
   Single threaded alloc                          100           587      2.348 
ms
                                           36.2374 ns    35.5824 ns    37.6799 
ns
                                           4.69932 ns    2.48233 ns    9.03906 
ns
   
   Single threaded free                           100          1412     2.2592 
ms
                                           15.2373 ns    14.9109 ns    16.1355 
ns
                                           2.52519 ns    1.00718 ns    5.32402 
ns
   ```
   
   ### Freelist - Single Threaded Release Performance - After
   
   Notice that the allocation takes a performance hit because of the added lock.
   
   ```
   benchmark name                       samples       iterations    est run time
                                        mean          low mean      high mean
                                        std dev       low std dev   high std dev
   
-------------------------------------------------------------------------------
   Single threaded alloc                          100           600       2.34 
ms
                                           40.3439 ns    38.0766 ns    46.6134 
ns
                                           17.5734 ns    7.42366 ns    35.7456 
ns
   
   Single threaded free                           100           346     2.3528 
ms
                                           12.0705 ns      11.37 ns    15.4749 
ns
                                           6.78033 ns   0.198969 ns    16.1671 
ns
   ```
   
   ### Freelist - Single Threaded Release Performance - After Without Lock
   
   This represents the potential performance of allocation without the mutex 
guarding the allocation routine. This case represents similar behavior to the 
original code, which did not have the lock, either.
   
   ```
   benchmark name                       samples       iterations    est run time
                                        mean          low mean      high mean
                                        std dev       low std dev   high std dev
   
-------------------------------------------------------------------------------
   Single threaded alloc                          100           672      2.352 
ms
                                           30.9758 ns    29.5555 ns    36.3515 
ns
                                           12.3705 ns    2.25337 ns    28.5114 
ns
   
   Single threaded free                           100           489     2.3472 
ms
                                           13.0952 ns    11.6199 ns    16.8676 
ns
                                           10.5817 ns     2.0172 ns     19.303 
ns
   ```
   
   ### Multithreaded Performance (Contention)
   
   I don't know the best way to test this. I would appreciate input from 
@bryancall or anyone who knows how to benchmark overall ATS performance. I'm 
concerned that the added lock will have unacceptable performance impacts, but I 
don't know how I should confirm this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to