On 6/4/2025 8:26 AM, Paul E. McKenney wrote:
>>>>>>> Or just don't send subsequent self-IPIs if we just sent one for the
>>>>>>> rdp. Chances are, if we did not get the scheduler's attention during
>>>>>>> the first one, we may not in subsequent ones I think. Plus we do send
>>>>>>> other IPIs already if the grace period was over extended (from the FQS
>>>>>>> loop), maybe we can tweak that?
>>>>>> Thanks a lot for your reply. I think it's hard for me to fix this issue
>>>>>> as
>>>>>> above without introducing new bugs. I barely understand the RCU code.
>>>>>> But I'm
>>>>>> very glad to help test if you have any code modifiction need to. I have
>>>>>> the VM and the syskaller benchmark which can reproduce the problem.
>>>>> Sure, I understand. This is already incredibly valuable so thank you
>>>>> again.
>>>>> Will request for your testing help soon. I also have a test module now
>>>>> which
>>>>> can sort-off reproduce this. Keep you posted!
>>>>
>>>> Oh sorry I meant to ask - could you provide the full kernel log and also is
>>>> there a standalone reproducer syzcaller binary one can run to reproduce it
>>>> in a VM?
>>
>> Sorry, I communicate with the teams who maintain the syzkaller tools. He said
>> I can't send the syskaller binary out of the company. Sorry, but I can help
>> to
>> reproduce. It's not complicate and not time consuming.
>>
>> I found the origin log which use kernel v6.6. But it's not complete.
>> Then I reprouce the problem using the latest kernel.
>> Both logs are attached as attachments.
>>
> Looking at both the v6.6 version and Joel's fix, I am forced to conclude
> that this bug has been there for a very long time. Thank you for your
> testing efforts and Joel for the fix!
Thanks. I am still working on polishing the fix Xiongfeng tested. I hope to have
it out next week for review. As we discussed I will split the context-tracking
API into a separate patch and will also add a separate documentation
comment-patch on why we need the irq_work.
thanks,
- Joel