On Fri, Sep 15, 2023 at 12:57 PM Paul E. McKenney <[email protected]> wrote: > [...] > > > > > On the other hand, I came up with a real fix [1] and I am currently > > > > > testing it. > > > > > This is to fix a live lock between RT push and CPU hotplug's > > > > > select_fallback_rq()-induced push. I am not sure if the fix works but > > > > > I have > > > > > some faith based on what I'm seeing in traces. Fingers crossed. I > > > > > also feel > > > > > the real fix is needed to prevent these issues even if we're able to > > > > > hide it > > > > > by halving the total rcutorture boost threads. > > > > > > > > So that fixed it without any changes to RCU. Below is the updated patch > > > > also > > > > for the archives. Though I'm rewriting it slightly differently and > > > > testing > > > > that more. The main thing I am doing in the new patch is I find that RT > > > > should not select !cpu_active() CPUs since those have the scheduler > > > > turned > > > > off. Though checking for cpu_dying() also works. I could not find any > > > > instance where cpu_dying() != cpu_active() but there could be a tiny > > > > window > > > > where that is true. Anyway, I'll make some noise with scheduler folks > > > > once I > > > > have the new version of the patch tested. > > > > > > > > Also halving the number of RT boost threads makes it less likely to > > > > occur but > > > > does not work. Not too surprising since the issue actually may not be > > > > related > > > > to too many RT threads but rather a lockup between hotplug and RT.. > > > > > > Again, looks promising! When I get the non-RCU -rcu stuff moved to > > > v6.6-rc1 and appropriately branched and tested, I will give it a go on > > > the test setup here. > > > > Thanks a lot, and I have enclosed a simpler updated patch below which also > > similarly shows very good results. This is the one I would like to test > > more and send to scheduler folks. I'll send it out once I have it tested > > more > > and also possibly after seeing your results (I am on vacation next week so > > there's time). > > Much nicer! This is just on current mainline, correct?
Yes, correct. I also applied it cleanly to all stable kernels for my test rigs. Only 5.10 had a little merge conflict but it was trivially fixed. thanks, - Joel
