On Wed, 2026-05-13 at 13:32 +0800, Wen Yang wrote:
> Thanks for both messages.  Two patches are ready; let me address
> your follow-up concerns before sending.
> 
>    1. "all monitors reusing slots would suffer from it"
> 
>       Only RV_MON_PER_TASK uses the rv_get/put_task_monitor_slot()
>       pool.  RV_MON_GLOBAL and RV_MON_PER_CPU each have dedicated
>       storage (a single static variable and a per-cpu variable) and
>       never share slots across monitor types.  The race is exclusive
>       to PER_TASK, so fixing that variant's da_monitor_destroy() is
>       the correct scope.
> 
>    2. "LTL monitors don't even have monitoring"
> 
>       tracepoint_synchronize_unregister() does not rely on the
>       monitoring flag at all.  It is a system-wide barrier — it
>       calls synchronize_rcu_tasks_trace() followed by
>       synchronize_srcu(&tracepoint_srcu) — draining every in-flight
>       tracepoint handler on every CPU regardless of which monitor
>       dispatched it.  LTL handlers are covered without any special
>       treatment.
> 
> The slot-ordering issue (patch 1) affects all per-task DA monitors,
> not only HA ones — "independent on HA" — because
> RV_PER_TASK_MONITOR_INIT equals CONFIG_RV_PER_TASK_MONITORS (one
> past the end of rv[]), so da_monitor_reset_all() overwrites whatever
> follows rv[] in task_struct whenever any per-task monitor is
> disabled.

Exactly, and since whatever follows .rv is randomised on a task_struct, this can
get quite nasty.

I included my version of the fix in the series in [1], but feel free to send
yours, you got there first ;)

> 
> Also corrected "wwnr probe handler" to "stall probe handler" in
> patch 2 per your annotation.
> 

While tracepoint_synchronize_unregister() does fix the race, I still see a timed
bomb in the way we do ha_monitor_reset_env().

Since we reused the same slots for per-task monitors (not for the others, you're
right I was brainfarting) we essentially don't know what happened before we do
da_monitor_init(), the same slot could have been used by an LTL monitor which
cannot even reliably clear the byte used by the monitoring flag.

Now, we either mandate all monitors to memset the entire slot (union
rv_task_monitor) or we don't assume anything about the slot's state during
initialisation. Any middle ground could reveal pesky bugs as soon as we refactor
the structs.

The latter idea is what I did in [1]. I believe that would make the
synchronisation superfluous.

What do you think?

Thanks,
Gabriele

[1] - https://lore.kernel.org/lkml/[email protected]

> Please let me know if the above reasoning addresses your concerns.
> 
> 
> --
> Best wishes,
> Wen
> 
> > > 
> > > >   include/rv/da_monitor.h | 18 ++++++++++++++++--
> > > >   1 file changed, 16 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/include/rv/da_monitor.h b/include/rv/da_monitor.h
> > > > index 00ded3d5ab3f..d04bb3229c75 100644
> > > > --- a/include/rv/da_monitor.h
> > > > +++ b/include/rv/da_monitor.h
> > > > @@ -304,6 +304,20 @@ static int da_monitor_init(void)
> > > >   
> > > >   /*
> > > >    * da_monitor_destroy - return the allocated slot
> > > > + *
> > > > + * Call tracepoint_synchronize_unregister() before reset_all() to close
> > > > + * the race where an in-flight non-HA probe handler sets monitoring=1
> > > > + * (without calling timer_setup()) after da_monitor_reset_all() has
> > > > + * already cleared the slot but before the caller's own sync completes.
> > > > + * Without this barrier, an HA_TIMER_WHEEL monitor that later acquires
> > > > + * the same slot would call timer_delete() on a never-initialised
> > > > + * timer_list, triggering ODEBUG warnings.
> > > > + *
> > > > + * Note: tracepoint_synchronize_unregister() is a system-wide barrier
> > > > + * that waits for all CPUs to finish any in-flight tracepoint handlers.
> > > > + * The caller's own __rv_disable_monitor() issues a second sync after
> > > > + * returning from disable(); that redundant call is harmless on the
> > > > + * infrequent admin (enable/disable) path.
> > > >    */
> > > >   static inline void da_monitor_destroy(void)
> > > >   {
> > > > @@ -311,10 +325,10 @@ static inline void da_monitor_destroy(void)
> > > >                 WARN_ONCE(1, "Disabling a disabled monitor: "
> > > > __stringify(MONITOR_NAME));
> > > >                 return;
> > > >         }
> > > > +       tracepoint_synchronize_unregister();
> > > > +       da_monitor_reset_all();
> > > >         rv_put_task_monitor_slot(task_mon_slot);
> > > >         task_mon_slot = RV_PER_TASK_MONITOR_INIT;
> > > > -
> > > > -       da_monitor_reset_all();
> > > >   }
> > > >   
> > > >   #elif RV_MON_TYPE == RV_MON_PER_OBJ
> > 


Reply via email to