Re: debugging kernel hang (can't type anything)

Peter Teoh Wed, 18 Feb 2009 18:48:14 -0800

On Thu, Feb 19, 2009 at 3:19 AM, Sukanto Ghosh
<sukanto.cse.i...@gmail.com>wrote:


> Hi,
>
> I am running into a peculiar problem here.
>
> I am getting stack-traces from two sources, i) spin_lock BUG_ON "cpu
> recursion", and ii) remote gdb running on my host and connected to the
> guest kernel (the one I am debugging).
>
> spin_lock BUG_ON stack-trace says:
> BUG: spinlock recursion on CPU#0, qsbench/2169
>  lock: c6783618, .magic: dead4ead, .owner: qsbench/2169, .owner_cpu: 0
> Pid: 2169, comm: qsbench Not tainted 2.6.27.4-sg-kgdb #2
>  [<c0644bb7>] ? printk+0xf/0x18
>  [<c05096ae>] spin_bug+0x7c/0x87
>  [<c0509766>] _raw_spin_lock+0x35/0xfa
>  [<c0647509>] _spin_lock+0x3d/0x4b
>  [<c048189e>] grab_swap_token+0x20c/0x246   <===
>  [<c0477873>] handle_mm_fault+0x35b/0x70a
>  [<c0649352>] ? do_page_fault+0x2a6/0x722
>  [<c0649425>] do_page_fault+0x379/0x722
>  [<c0445d82>] ? __lock_acquire+0x7bd/0x814
>  [<c0445d82>] ? __lock_acquire+0x7bd/0x814
>  [<c043f9c4>] ? getnstimeofday+0x3c/0xd6
> ....
>
>
> And gdb shows:
> #0  0xc0505f1b in delay_tsc (loops=1) at arch/x86/lib/delay.c:85
> #1  0xc0505f77 in __udelay (usecs=2394097143) at arch/x86/lib/delay.c:118
> #2  0xc6866d98 in ?? ()
> #3  0xc05097ba in _raw_spin_lock (lock=0xc6783618) at
> lib/spinlock_debug.c:116
> #4  0xc0647509 in _spin_lock_bh (lock=0xc6783628) at kernel/spinlock.c:113
> #5  0xc048189e in dmam_pool_match (dev=<value optimized out>, res=0x2d5,
>     match_data=0x0) at mm/dmapool.c:457        <===
> #6  0x00000001 in ?? ()
>
>
> If you notice the lines I have shown with arrows, the address
> 0xc048189e  is shown in different function in each case.
> What is more surprising, System.map says
> c0481692 T grab_swap_token
> c04818d8 t dmam_pool_match
> (consistent with the spin_lock BUG_ON trace)
>
>
> But on going through the output of 'objdump -s -l --source' I found
> that 'dmam_pool_match' starts at address 0xc0481898.
> (consistent with the gdb back-trace output)
>

I must say I am quite surprised by this feature of --source.....quite useful
...but now sure how to use it not.   But it really takes too
LONG......normally I do "objdump -rd" much faster.   But to find the
address, both approach are the same.

And the address indeed have some problem....u sure u did not used an old
copy of these files ?(as u said ..... these are from two different physical
sources??? compiled codes may be different on different machines)
Normally, System.map's address should match to "objdump -rd"'s address.   In
your case it is c04818d8 vs c0481898, weird.   Mine both are identical.



>
> I am confused about the source of the 'spinlock cpu recursion'. I have
> checked the code I have written (grab_swap_token modifications etc in
> mm/thrash.c) , it doesn't appears to be the source of the problem.  I
> haven't touched the function dmam_pool_match() and it doesn't appears
> in mm/thrash.c
>
>
>
>
> >>> But now I am getting a "BUG: spinlock recursion on CPU#0" error.  What
> >>> does this error mean ?
> >>
> >> yes, it actually means that the previous lock is still not released yet,
> and
> >> now u are calling a function to acquire the spinlock again......and
> somehow,
> >> before actually acquiring the spinlock, a check was made:
> >>
> >> lib/spinlock_debug.c:
> >>
> >> static inline void
> >> debug_spin_lock_before(spinlock_t *lock)
> >> {
> >>         SPIN_BUG_ON(lock->magic != SPINLOCK_MAGIC, lock, "bad magic");
> >>         SPIN_BUG_ON(lock->owner == current, lock, "recursion");
> >>         SPIN_BUG_ON(lock->owner_cpu == raw_smp_processor_id(),
> >>                                                         lock, "cpu
> >> recursion");
> >> }
> >>
> >> so now the above "recursion" was detected, as the spinlock is still
> residing
> >> on the same CPU while being reacquired.   so instead of going into a
> tight
> >> spin, it print out the stack trace before hand.
> >>
>
> Does it mean that it's illegal for two processes (on the same cpu)
> spinning on a lock that is held by someone else (again on the same
> cpu) ?
>

yes,....but very confusing.   normally, when u are talking about spinlock on
one CPU,  the process will not be switched out of the CPU until it has
finish and release the spinlock.   the rule is that spinlock is meant for
multi-cpu synchronization, NOT multiple process on one CPU.   in fact, if
another process is switch in, and found the spinlock locked by another
process, the CPU will (in the past) go into a deadly spin.....and loop wait
until the spinlock is unlock.  but this is not possible, because the CPU is
spinning.   now I think the code is wiser - it will throw an error
instead.   if u find this happening....i think u have found a bug....:-)



>
>
> --
> Regards,
> Sukanto Ghosh
>



-- 
Regards,
Peter Teoh

Re: debugging kernel hang (can't type anything)

Reply via email to