On 2025-12-18 05:33, Joel Fernandes wrote:
Hi Mathieu,
Thanks for posting this.

On Dec 17, 2025, at 8:45 PM, Mathieu Desnoyers <[email protected]> 
wrote:

Hi,

Here is a revisited version of my Hazard Pointers series. Boqun, Joel,
if you guys have time to try it out with your use-cases it would be
great!

This new version does the following:

- It has 8 preallocated hazard pointer slots per CPU (one cache line),
- The hazard pointer user allocates a hazard pointer context variable
  (typically on the stack), which contains the pointer to the slot *and*
  a backup slot,
- When all the per-CPU slots are in use, fallback to the backup slot.
  Chain the backup slot into per-CPU lists, each protected by a raw
  spinlock.
- The hazard pointer synchronize does a piecewise iteration on the
  per-CPU overflow slots lists, releasing the raw spinlock between
  each list item. It uses a 64-bit generation counter to check for
  concurrent list changes, and restart the traversal on generation
  counter mismatch.
- There is a new CONFIG_PREEMPT_HAZPTR config option. When enabled,
  the hazard pointer acquire/release adds and then removes the hazard
  pointer context from a per-task linked list. On context switch, the
  scheduler migrates the per-CPU slots used by the task to the backup
  per-context slots, thus making sure the per-CPU slots are not used
  by preempted and blocked tasks.

This last point is another reason why I want the slots to be per task instead 
of per CPU. It becomes very natural because the hazard pointer is always 
associated with a task only anyway, not with the CPU (at usecase level). By 
putting the slot in the task struct, we allow these requirements to flow 
naturally without requiring any locking or list management.. Did I miss 
something about the use cases?

I start from the hypothesis that scanning all tasks at synchronize
is likely too slow for a general purpose synchronization algo.

This is why we have RCU (general purpose) tracking hierarchical per-cpu
state, and specialized RCU flavors (for tracing and other use-cases)
using iteration on all tasks.

One of the highlights of hazard pointers over RCU is its ability to
know about quiescence of a pointer without waiting for a whole grace
period. Therefore I went down the route of scanning per-CPU state rather
than per-task.


I did some measurements about the task-scanning issue, and it is fast in my 
testing (~1ms/10000 tasks). Any input from you or anyone on what the typical 
task count distribution is that we are addressing? I also made a rough 
prototype, and it appears to be simpler with fewer lines of code because I do 
not need to handle preemption. It just happens naturally.

It really depends on the access pattern here. If you want to replace
refcount decrement and test with hazard pointer synchronize, a delay of
1ms on each hazptr synchronize is a *lot*.

Of course perhaps this can be batched with a worker thread, but then
you are back to using more memory due to delayed reclaim, and thus
closer to RCU.


First of all, we can have a per-task counter that tracks how many hazard 
pointers are active. If this is zero, then we can simply skip the task instead 
of wasting cycles scanning all the task slot.
Further, we can have a retire list that reuses a single scan to scan all the 
objects in the retire list, thus reusing the scan cost. This can also assist in 
asynchronously implementing object retiring via a dedicated thread perhaps with 
the tasks RCU infrastructure. We can also make this per-task counter a bitmap 
to speed up scanning potentially.

I am okay with the concept of an overflow list, but if we keep the overflow 
list at the per-task level instead of the per-CPU level, it is highly unlikely 
IMO that such an overflow list will be used unless more than, say, eight hazard 
pointers per task are active at any given time.

The PREEMPT_HAZPTR scheme achieves this as well, with per-CPU scan
rather than per-task.

So its lock contention would be rarer than, say, having a per-CPU overflow 
list. I would say that contention would be incredibly rare because typically 
hazard pointers are used by multiple tasks, each of which will have its own 
unique set of slots. Whereas in a per-CPU overflow  approach, we have a higher 
chance of lock contention, especially when the number of CPUs is low.

Contention on the per-CPU spinlocks would only happen if threads are
migrated between chaining of their backup slot (either if 8 hazptr
are in use by the thread, or on context switch), and release. Do
you expect this type of migration to happen so often as to increase
contention ?


Other than the task-scanning performance issue, what am I missing?

Task-scan performance is the key issue. Also you have to set a limit
of per-task slots. How would you handle the scenario where a user
needs more slots than the limit ?


Another nice benefit of using per-task hazard pointers is that we can also 
implement sleeping in hazard pointer sections because we will be scanning for 
sleeping tasks as well.

The PREEMPT_HAZPTR scheme takes care of this as well. Sleeping with a
hazptr held will trigger the context switch, and the scheduler will
simply move the hazard pointer from the per-CPU slot to the backup
slot. End of story.


By contrast, the other approaches I have seen with per-CPU hazard pointers 
forbid sleeping, since after sleeping a task is no longer associated with its 
CPU. The other approaches also have a higher likelihood of locking Due to 
running out of slots.

Except for PREEMPT_HAZPTR. :)


Of course I am missing a use case, but I suspect we can find a per-CPU ref-count use case that benefits from this. I am researching use cases when I get time. I think my next task is to find a solid use case for this
before doing further development of a solution..

Let me know what you find, I'm very interested!


By the way, feedback on the scanning patch:

Can you consider using a per-CPU counter to track the number of active slots 
per CPU? That way you can ignore CPU slots for CPUs that are not using hazard 
pointers. Another idea is to skip idle CPUs as well.

I had this in a userspace prototype: a counter of used per-cpu slots.
But I finally decided against it because it requires extra instructions
on the fast path, *and* scanning 8 pointers within a single cache line
on synchronize is really cheap. So I'd need benchmarks to justify adding
back the counter.


Have you also considered any asynchronous use case where maintaining a retired 
list would assist in RCU-style deferred reclaim of hazard-pointer objects?

This would be a logical next step indeed. I'll have to think
about this some more.

Thanks,

Mathieu


thanks,

  - Joel


It is based on v6.18.1.

Review is very welcome,

Thanks,

Mathieu

Cc: Nicholas Piggin <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Alan Stern <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Zqiang <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
Cc: Mateusz Guzik <[email protected]>
Cc: Jonas Oberhauser <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Mathieu Desnoyers (4):
  compiler.h: Introduce ptr_eq() to preserve address dependency
  Documentation: RCU: Refer to ptr_eq()
  hazptr: Implement Hazard Pointers
  hazptr: Migrate per-CPU slots to backup slot on context switch

Documentation/RCU/rcu_dereference.rst |  38 +++-
include/linux/compiler.h              |  63 +++++++
include/linux/hazptr.h                | 241 ++++++++++++++++++++++++++
include/linux/sched.h                 |   4 +
init/init_task.c                      |   3 +
init/main.c                           |   2 +
kernel/Kconfig.preempt                |  10 ++
kernel/Makefile                       |   2 +-
kernel/fork.c                         |   3 +
kernel/hazptr.c                       | 150 ++++++++++++++++
kernel/sched/core.c                   |   2 +
11 files changed, 512 insertions(+), 6 deletions(-)
create mode 100644 include/linux/hazptr.h
create mode 100644 kernel/hazptr.c

--
2.39.5



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Reply via email to