Re: [Kgdb-bugreport] [PATCH 1/2] prevent Slave CPUs hang on exit

Jason Wessel Tue, 25 Mar 2008 10:59:28 -0700

Amit S. Kale wrote:
> On Monday 24 March 2008 07:18:51 pm Jason Wessel wrote:
>> Amit S. Kale wrote:
>>> On Monday 17 March 2008 11:55:28 pm Konstantin Baydarov wrote:
>>>> Problem:
>>>> Sometimes(after remote gdb was connected) x86 SMP kernel(with KGDB and
>>>> NMI watchdog enabled) hangs when kernel modules are automatically
>>>> loaded.
>>> Konstantin,
>>>
>>> The description below doesn't mention how module loading comes into
>>> picture.
>> I too have observed this problem as well as hangs in the stress test
>> where you ask each cpu to execute the same system call over and over
>> (via a user space program) and you set a kernel breakpoint there.
>>
>> Specifically the problem Konstantin is referring to is when you attach a
>> debugger, continue and then a number of kernel module loads are executed
>> as a part of the whole user space startup or initrd startup.  The a
>> kernel module aware debugger will stop, load symbols and automatically
>> continue on each kernel module load event.
> 
> I was wondering how module load is different from a regular 
> breakpoint-singlestep.
>



There are quite a few back to back breakpoints.  The reality is that
is not any different to stress testing a breakpoint in a system call
on an SMP system.


>> The issue here is that there is a window where the slavecpu is unlocked
>> with kgdb_spin_unlock(&slavecpulocks[i]).  After that there is a window
>> where the slave cpu will spin up again and start taking NMI events based
>> on how often the APIC timer is set to fire.  Even if you remove the
>> msleep() it doesn't remove the window entirely and you can still have a
>> processor re-enter the kgdb_wait() before debugger_active is zeroed out.
> 
> Agreed. There is definitely a window where this can happen. However given 
> removal of msleep, I have doubts about how it'll be hit on x86 arch.
> 


Without the msleep it seemed only happen on the 4-way SMP with the
10000 breakpoint test.  It is just a matter of timing to get the NMI
to trigger at the right point to hit the window on a CPU that was just
released to run, while the others are getting activated.


> Yes. I am generally against adding new locking variables since we already 
> have 
> got enough of them. We haven't defined a good hierarchy for them (resulting 
> in spinlock lock detection false alarms).
> 

The latest version of kgdb (called kgdb-light on kernel.org) does not
have as many locks in it.  The code was refactored, although it still
had this particular problem with the exit kgdb_handle_exception
problem.

>> Something that serves the same purpose as this particular variable is in
>> fact needed.  I created patch to fix the same problem ~ 6 months ago
>> (the new variable was called kgdb_resuming in my case), but the patch is
>> even uglier in that I also added controls to change the behavior of the
>> single stepping so as to allow another processor to hit a breakpoint
>> while single stepping a different processor.
>>
>> In the last 1.5 months the kgdb core was significantly changed, as well
>> as a kgdb test suite was added to test for some of these architecture
>> specific issues.  It appears that the test case cannot be hit very often
>> because one of the commits removed the msleep(), which definitely
>> reduces the window of opportunity.
>>
>> In short, this is definitely a real problem and with the msleep() the
>> window is large enough that it gets hit reasonably easily.  I plan to
>> split the 2007 single_step / kgdb resuming patch to cover just the
>> resuming case and I will test it on the new kgdb core.
> 
> It appears to solve the problem described above.
> 
> Does it make sure that we don't miss any NMI watchdog events on master 
> processor and cause them to be routed to default panic handler? [Haven't 
> thought through that for not having a quickly reachable current kgdb source]
> 
> -Amit
> 

The patch that Konstantin created would still correctly handle the NMI
watch dog because the arch specific handlers were changed some time
ago to deal with the various NMI entry types.  In fact, it was
reasonably straight forward to change the locks around to eliminate
the race condition.  This is a patch for the 2.6.24 branch that solves
the same problem a different way without adding any addtional atomic
variables to control the code flow.

I have a different version of the patch for the kgdb-light branch
because the variable names are different.

Jason.

---CLIP HERE---

Fix the problem of protecting the kgdb handle_exception exit
which had an NMI race condition, while trying to restore
normal system operation.

There was a small window after the master processor sets
procindebug to zero but before it has set debugger_active
to zero where a non-master processor in an SMP system
could receive an NMI and re-enter the kgdb_wait()
loop.

As long as the master processor sets the procindebug before
sending the cpu roundup the procindebug variable can also
be used to guard against the race condition.

The kgdb_wait() function no longer needs to check
debugger_active because it is done in the arch specific code
and handled along with the nmi traps at the low level.
This also allows kgdb_wait() to exit correctly if it was
entered for some unknown reason due to a spurious NMI that
could not be handled by the arch specific code.

Signed-off-by: Jason Wessel <[EMAIL PROTECTED]>

---
 kernel/kgdb.c |   27 ++++++---------------------
 1 file changed, 6 insertions(+), 21 deletions(-)

--- a/kernel/kgdb.c
+++ b/kernel/kgdb.c
@@ -636,22 +636,6 @@ static void kgdb_wait(struct pt_regs *re
        kgdb_info[processor].task = current;
        atomic_set(&procindebug[processor], 1);
 
-       /* The master processor must be active to enter here, but this is
-        * gaurd in case the master processor had not been selected if
-        * this was an entry via nmi.
-        */
-       while (!atomic_read(&debugger_active))
-               cpu_relax();
-
-       /* Wait till master processor goes completely into the debugger.
-        */
-       while (!atomic_read(&procindebug[atomic_read(&debugger_active) - 1])) {
-               int i = 10;     /* an arbitrary number */
-
-               while (--i)
-                       cpu_relax();
-       }
-
        /* Wait till master processor is done with debugging */
        spin_lock(&slavecpulocks[processor]);
 
@@ -1042,16 +1026,16 @@ int kgdb_handle_exception(int ex_vector,
                for (i = 0; i < NR_CPUS; i++)
                        spin_lock(&slavecpulocks[i]);
 
+       /* spin_lock code is good enough as a barrier so we don't
+        * need one here */
+       atomic_set(&procindebug[processor], 1);
+
 #ifdef CONFIG_SMP
        /* Make sure we get the other CPUs */
        if (!debugger_step || !kgdb_contthread)
                kgdb_roundup_cpus(flags);
 #endif
 
-       /* spin_lock code is good enough as a barrier so we don't
-        * need one here */
-       atomic_set(&procindebug[processor], 1);
-
        /* Wait a reasonable time for the other CPUs to be notified and
         * be waiting for us.  Very early on this could be imperfect
         * as num_online_cpus() could be 0.*/
@@ -1584,7 +1568,8 @@ int kgdb_nmihook(int cpu, void *regs)
 {
 #ifdef CONFIG_SMP
        if (!atomic_read(&procindebug[cpu]) &&
-               atomic_read(&debugger_active) != (cpu + 1)) {
+               atomic_read(&debugger_active) != (cpu + 1) &&
+               atomic_read(&procindebug[atomic_read(&debugger_active) - 1])) {
                kgdb_wait((struct pt_regs *)regs);
                return 0;
        }

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Kgdb-bugreport mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/kgdb-bugreport

Re: [Kgdb-bugreport] [PATCH 1/2] prevent Slave CPUs hang on exit

Reply via email to