On Fri, Sep 15, 2023 at 12:57 PM Paul E. McKenney <[email protected]> wrote:
>
[...]
> > > > > On the other hand, I came up with a real fix [1] and I am currently 
> > > > > testing it.
> > > > > This is to fix a live lock between RT push and CPU hotplug's
> > > > > select_fallback_rq()-induced push. I am not sure if the fix works but 
> > > > > I have
> > > > > some faith based on what I'm seeing in traces. Fingers crossed. I 
> > > > > also feel
> > > > > the real fix is needed to prevent these issues even if we're able to 
> > > > > hide it
> > > > > by halving the total rcutorture boost threads.
> > > >
> > > > So that fixed it without any changes to RCU. Below is the updated patch 
> > > > also
> > > > for the archives. Though I'm rewriting it slightly differently and 
> > > > testing
> > > > that more. The main thing I am doing in the new patch is I find that RT
> > > > should not select !cpu_active() CPUs since those have the scheduler 
> > > > turned
> > > > off. Though checking for cpu_dying() also works. I could not find any
> > > > instance where cpu_dying() != cpu_active() but there could be a tiny 
> > > > window
> > > > where that is true. Anyway, I'll make some noise with scheduler folks 
> > > > once I
> > > > have the new version of the patch tested.
> > > >
> > > > Also halving the number of RT boost threads makes it less likely to 
> > > > occur but
> > > > does not work. Not too surprising since the issue actually may not be 
> > > > related
> > > > to too many RT threads but rather a lockup between hotplug and RT..
> > >
> > > Again, looks promising!  When I get the non-RCU -rcu stuff moved to
> > > v6.6-rc1 and appropriately branched and tested, I will give it a go on
> > > the test setup here.
> >
> > Thanks a lot, and I have enclosed a simpler updated patch below which also
> > similarly shows very good results. This is the one I would like to test
> > more and send to scheduler folks. I'll send it out once I have it tested 
> > more
> > and also possibly after seeing your results (I am on vacation next week so
> > there's time).
>
> Much nicer!  This is just on current mainline, correct?

Yes, correct. I also applied it cleanly to all stable kernels for my
test rigs. Only 5.10 had a little merge conflict but it was trivially
fixed.

thanks,

 - Joel

Reply via email to