Re: [Kgdb-bugreport] [PATCH 1/2] prevent Slave CPUs hang on exit

Jason Wessel Mon, 24 Mar 2008 06:50:07 -0700

Amit S. Kale wrote:
> On Monday 17 March 2008 11:55:28 pm Konstantin Baydarov wrote:
>   
>> Problem:
>> Sometimes(after remote gdb was connected) x86 SMP kernel(with KGDB and NMI
>> watchdog enabled) hangs when kernel modules are automatically loaded.
>>     
>
> Konstantin,
>
> The description below doesn't mention how module loading comes into picture.
>
>


I too have observed this problem as well as hangs in the stress test
where you ask each cpu to execute the same system call over and over
(via a user space program) and you set a kernel breakpoint there.

Specifically the problem Konstantin is referring to is when you attach a
debugger, continue and then a number of kernel module loads are executed
as a part of the whole user space startup or initrd startup.  The a
kernel module aware debugger will stop, load symbols and automatically
continue on each kernel module load event.

>> Root Cause:
>>   Slave CPU hangs in kgdb_wait() when master CPU leaves KGDB, causing the
>> whole system to hang.
>>   If watchdog NMI occurs when Slave CPU have already exited kgdb_wait() and
>> Master CPU haven't unset debugger_active, 
>>     
>
> An NMI watchdog can't occur until kgdb_wait function returns, control goes to 
> kgdb_nmihook, which returns control to kgdb_notify, which in turn returns 
> through the notify chain call returns, do_nmi, and then to entry.S, where an 
> iret is executed. (NMI is disabled until iret is executed).
>   

The issue here is that there is a window where the slavecpu is unlocked
with kgdb_spin_unlock(&slavecpulocks[i]).  After that there is a window
where the slave cpu will spin up again and start taking NMI events based
on how often the APIC timer is set to fire.  Even if you remove the
msleep() it doesn't remove the window entirely and you can still have a
processor re-enter the kgdb_wait() before debugger_active is zeroed out.


> Compare to this to what the master CPU does: master CPU just has to unlock 
> all 
> slave locks and then immediately set debugger_active to 0. (The only 
> exeception to this is when debugger_step is set. More about this below).
>
> The later can be executed much quicker than the former and while in theory 
> the 
> former can execute before the later, it can't happen in a real-life 
> situation.
>
> There is a delay of mdelay(2), when debugger_step is set and master debugger 
> lets other CPUs run (kgdb_contthread == 0), which could potentially trigger 
> this race. Could you please confirm whether any of the following two solve 
> your problem?
>
> 1. Comment out these lines from kgdb_handle_exception
> +     if (debugger_step)
> +             mdelay(2);
>
>
> 2. Execute this command from gdb as soon as it is started "set 
> scheduler-locking on"
>
>
>   
>> How Solved:
>>   New atomic variable debugger_exiting was added. It's set when Master CPU
>> starts waiting Slave CPUs, and is reset after debugger_active is set to
>> zero. Variable debugger_exiting is checked in kgdb_notify() and
>> kgdb_nmihook wouldn't be called until debugger_exiting equal zero. So
>> debugger_exiting guaranties that Slave CPU won't reenter kgdb_wait() until
>> Master CPU completely leaves KGDB. Patch against kernel 2.6.24.3.
>>     
>
> I would strongly recommend adding any more locking variables. As it is we've 
> sufficient difficulty analyzing races :-).
> -Amit
>   

I assume you meant to recommend against adding any more locking variables.

Something that serves the same purpose as this particular variable is in
fact needed.  I created patch to fix the same problem ~ 6 months ago
(the new variable was called kgdb_resuming in my case), but the patch is
even uglier in that I also added controls to change the behavior of the
single stepping so as to allow another processor to hit a breakpoint
while single stepping a different processor.

In the last 1.5 months the kgdb core was significantly changed, as well
as a kgdb test suite was added to test for some of these architecture
specific issues.  It appears that the test case cannot be hit very often
because one of the commits removed the msleep(), which definitely
reduces the window of opportunity.

In short, this is definitely a real problem and with the msleep() the
window is large enough that it gets hit reasonably easily.  I plan to
split the 2007 single_step / kgdb resuming patch to cover just the
resuming case and I will test it on the new kgdb core.

As a side point, I back ported the new kgdb core to 2.6.24, which I can
make available in cvs/git.  I am wondering if Sergei and or Konstantin
would be willing to aid in getting it to work with some of these odd
ball kgdb specific RS232 drivers?   It is not possible for me to truly
test them because I don't have any of the boards.

Jason.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Kgdb-bugreport mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport

Re: [Kgdb-bugreport] [PATCH 1/2] prevent Slave CPUs hang on exit

Reply via email to