> On Dec 14, 2025, at 11:38 AM, Mathieu Desnoyers 
> <[email protected]> wrote:
> 
> On 2025-12-13 17:31, Paul E. McKenney wrote:
>> Hello!
>> I didn't do a good job of answering the "what about large numbers of
>> hazard pointers" at Boqun's and my hazard-pointers talk at Linux Plumbers
>> Conference yesterday, so please allow me to at least start on the path
>> towards fixing that problem.
>> Also, there were a couple of people participating whose email addresses
>> I don't know, so please feel free to CC them.
>> The trick is that in real workloads to date, although there might be
>> millions of hazard pointers, there will typically only be a few active
>> per CPU at a given time.  This of course suggests a per-CPU data structure
>> tracking the active ones.  Allocating a hazard pointer grabs an unused one
>> from this array, or, if all entries are in use, takes memory provided by
>> the caller and links it into an overflow list.  Either way, it returns a
>> pointer to the hazard pointer that is now visible to updaters.  When done,
>> the caller calls a function that marks the array-entry as unused or
>> removes the element from the list, as the case may be.  Because hazard
>> pointers can migrate among CPUs, full synchronization is required when
>> operating on the array and the overflow list.
>> And either way, the caller is responsible for allocating and freeing the
>> backup hazard-pointer structure that will be used in case of overflow.
>> And also either way, the updater need only deal with hazard pointers
>> that are currently in use.
> OK, so let me state how I see the fundamental problem you are trying
> to address, and detail a possible solution.
> 
> * Problem Statement
> 
> Assuming we go for an array of hazard pointer slots per CPU to cover
> the fast path (common case), we still need to handle the
> overflow scenario, where more hazard pointers are accessed
> concurrently for a given CPU than the array size, either due to
> preemption, nested interrupts, or simply nested calls.
> 
> * Possible Solution
> 
> Requiring the HP caller to allocate backup space is clearly something
> that would cover all scenarios. My worry is that tracking this backup
> space allocation may be cumbersome for the user, especially if this
> requires heap allocation.
> 
> Where the backup space can be allocated will likely depend on how long
> the HP will be accessed. My working hypothesis here (let me know if
> I'm wrong) is that a most of those HP users will complete their access
> within the same stack frame where the HP was acquired. This is the
> primary use-case I would like to make sure is convenient.
> 
> For that use-case the users can simply allocate room on their
> stack frame for the backup HP slot. The requirement here is that they
> clear the HP slot before the end of the current stack frame.
> If there is enough room in the per-CPU array, they use that, else
> they add the backup slot from their stack into the backup slot
> list. When they are done, if they used a backup slot, they need
> to remove it from the list.
> 
> There could still be room for more advanced use-cases where the
> backup slots are allocated on the heap, but I suspect that it would
> make the API trickier to use and should be reserved for use-cases
> that really require it.
> 
> Thoughts ?

This sounds fine to me. 

We have had similar issues with kfree_rcu() and running out of preallocated 
memory there meant we just triggered a slow path (synchronize) but for hazard 
ptr I am not sure what such a slow path would be since, since this problem 
appears to be on the reader side.

Perhaps we can also pre allocate overflow nodes in the api itself, and tap into 
that in case of overflow? The user need not provide their own storage then for 
overflow purposes, I think. And perhaps this pre allocated pool of overflow 
nodes can also be common to all hazptr users. Ideally it would be good to not 
all the api user deal with overflow at all and it transparently works behind 
the scenes.

I think we need not worry too much about cases of preallocated overflow nodes 
itself running out because that is no different than reserved memory needed in 
atomic context which need a minimum off anyway right? And we have a bounded 
number of CPU and bounded number of context, so the number of *active* nodes 
required at any given time should also be bounded?

Thoughts?

Thanks.





> 
> Thanks,
> 
> Mathieu
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 

Reply via email to