Re: debugging kernel hang (can't type anything)

Peter Teoh Wed, 18 Feb 2009 06:34:30 -0800

On Wed, Feb 18, 2009 at 3:58 AM, Sukanto Ghosh
<[email protected]>wrote:


> Hi Peter,
>
> Accidentally I came upon this article:
> http://stackframe.blogspot.com/2007/04/debugging-linux-kernels-with.html
>
> With the help of it I could get a stacktrace when the kernel hung.
>
> The problem was a stupid one: I was holding a spin_lock and then I
> called some function that again tries to hold the same lock.  Now I
> release the lock before I call that method.
>
> I was working with mm/thrash.c
>
> But now I am getting a "BUG: spinlock recursion on CPU#0" error.  What
> does this error mean ?



yes, it actually means that the previous lock is still not released yet, and
now u are calling a function to acquire the spinlock again......and somehow,
before actually acquiring the spinlock, a check was made:

lib/spinlock_debug.c:

static inline void
debug_spin_lock_before(spinlock_t *lock)
{
        SPIN_BUG_ON(lock->magic != SPINLOCK_MAGIC, lock, "bad magic");
        SPIN_BUG_ON(lock->owner == current, lock, "recursion");
        SPIN_BUG_ON(lock->owner_cpu == raw_smp_processor_id(),
                                                        lock, "cpu
recursion");
}

so now the above "recursion" was detected, as the spinlock is still residing
on the same CPU while being reacquired.   so instead of going into a tight
spin, it print out the stack trace before hand.


>
> Does it mean again it's spinning indefinitely on a spinlock ?
>
> I got the following backtrace:
> #0  0xc0505f1b in delay_tsc (loops=1) at arch/x86/lib/delay.c:85
> #1  0xc0505f77 in __udelay (usecs=3479683981) at arch/x86/lib/delay.c:118
> #2  0xc700ad98 in ?? ()
> #3  0xc05097ba in _raw_spin_lock (lock=0xc6c34998) at
> lib/spinlock_debug.c:116
> #4  0xc0647509 in _spin_lock_bh (lock=0xc6c349a8) at kernel/spinlock.c:113
> #5  0xc048189e in dmam_pool_match (dev=<value optimized out>, res=0x1cb,
>    match_data=0x0) at mm/dmapool.c:457
> #6  0x00000001 in ?? ()
>
>
> While writing this mail I had paused my guest OS kernel from gdb (^c)
> for sometime and when I said continue (c) it printed: "Clocksource tsc
> unstable (delta = 838972636559 ns)
>
>
> Regards,
> Sukanto Ghosh
>
>
>
>
>
> On Wed, Feb 18, 2009 at 4:52 AM, Peter Teoh <[email protected]>
> wrote:
> > Would u like to share WHERE u made the change?   WHAT u do could be part
> of
> > academic exercise, so perhaps u want to keep confidential, but WHERE
> would
> > be helpful.
> >
> > I am suspecting (very usual for changes to MM codes) that u have done
> > something illegal while holding a open spinlock.   So knowing where u
> insert
> > codes, will help us to understand if this is a problem or not.
> >
> > On Sat, Feb 14, 2009 at 7:52 PM, Sukanto Ghosh <
> [email protected]>
> > wrote:
> >>
> >> Hi,
> >>
> >> I have made some changes to the memory management part of the kernel
> >> as an experiment. Now when I boot into that kernel and start some
> >> heavy processes (which cause paging), the kernel hangs. I can't even
> >> type anything.
> >>
> >> I have gone through the 'paper on debugging kernel oops or hang'
> >> (http://mail.nl.linux.org/kernelnewbies/2003-08/msg00347.html)
> >>
> >> In this paper Erik says that to get the stack trace we can type
> >> 'Alt-SysRq-t' which prints the stack trace and when it's not possible
> >> to type anything, then it's best to use serial port + console. he says
> >> the config for lilo would be: console=ttyS0,9600 console=tty0
> >>
> >> As I have grub I am using the following lines:
> >>
> >> default=0
> >> timeout=15
> >> title Fedora (2.6.27.4)
> >>        root (hd0,0)
> >>        kernel /boot/vmlinuz-2.6.27.4 ro root=/dev/sda1
> >>        initrd /boot/initrd-2.6.27.4.img
> >>        serial --unit=0 --speed=9600 --word=8 --parity=no --stop=1
> >>        terminal --dumb --timeout=10 serial console
> >>
> >>
> >>
> >> CONFIG_MAGIC_SYSRQ was enabled in my config file.
> >>
> >> My test kernel is running inside a Virtual machine (VM) (VMware), with
> >> its serial port 0 redirected to a file.
> >> VM OS: fedora Core 9 with modified kernel 2.6.27.4
> >> Host OS: ubuntu hardy 2.6.24.3
> >>
> >> My problem is I am not getting any kind of output in the file to which
> >> I redirected the serial port of the VM except a bunch of "Press any
> >> key to continue .. " messages.
> >>
> >> should I be providing the 'alt-sysrq-t' input through the serial port,
> >> if so, how ?
> >> can i connect a host terminal to the serial port of the VM.
> >> Vmware gives me three options about the serial port of the Virtual
> Machine
> >> i) connect it to physical port of the host, ii) connect to a named
> >> pipe and, iii)connect it to a file in the host.
> >>
> >> Please help ...
> >>
> >>
> >> --
> >> Regards,
> >> Sukanto Ghosh
> >>
> >> --
> >> To unsubscribe from this list: send an email with
> >> "unsubscribe kernelnewbies" to [email protected]
> >> Please read the FAQ at http://kernelnewbies.org/FAQ
> >>
> >
> >
> >
> > --
> > Regards,
> > Peter Teoh
> >
>



-- 
Regards,
Peter Teoh

Re: debugging kernel hang (can't type anything)

Reply via email to