On Wed, Aug 13, 2003 at 12:25:37PM -0400, Nick Fisher wrote:
> > On Tue, Aug 12, 2003 at 03:52:34PM -0400, Nick Fisher wrote:
> >> I have a machine that I cannot compile a stable 2.4.20 kernel for, yet
> >> the
> >> one off of the 1.4_rc2 liveCD works fine. I'm guessing there is an
> >> option
> >> or a patch that is/isn't set/applyed. Apart from good old trial and
> >> error
> >> how the heck do I work out what is giving me the problem?
> > I've often found the NMI watchdog timer to be extremely
> > helpful with unexplained kernel lockups. Documentation is in
> > /usr/src/linux/Documentation/nmi_watchdog.txt. Basically, you append
> > "nmi_watchdog=1" to your kernel launch from LILO or Grub - in a few
> > rare cases, you need a value other than 1. When the kernel locks up, the
> > watchdog detects it and dumps interesting traceback to the console. I've
> > been able to correlate that traceback to symbols in /proc/ksyms and
> > identify malfunctioning drivers.
> Ok.... sounds good. I now have this:
> 
> NMI Watchdog detected LOCKUP on CPU1, eip c02f2750, registers:
> eax: d64ca000 ebx: c24d5400 ecx: c24bb000 edx: 00000000
> ds: 0018 es: 0018 ss: 0018
> 
> How do I make some sence of it? I've had a look in /proc/ksyms but
> c02f2750 isn't listed. Is there some other list I should be asking on? A
> URL I should be looking at? I'm very new to kernel
> debugging/troubleshooting and I'm a bit lost..... though not quite as lost
> as I was yesterday ;)

Progress! As you surmise, eip is the interesting register. Sort the
contents of /proc/ksyms and see where it falls. In my /proc/ksyms (which
won't match yours!), I see these entries bracketing that value:

  c02f04b0 task_read_24_Rsmp_ae3fb3f3
  c02f3870 proc_ide_read_geometry_Rsmp_50fed6f7

so if I saw that address in my dump, I'd know the failure was in a routine
named task_read_24_Rsmp(). (The trailing stuff is module versioning.)

I believe a stack traceback also appears in the NMI Watchdog output -
it's sometimes interesting to construct a traceback by gathering some
of those addresses.


The last time I used this technique, BTW, I identified some buggy SCSI
module code. After some grueling detective work, I found a message
somewhere that said, "oops, I forgot to propagate my Adaptec fix from
this aic79xx module to this aic7xxx module"... found the patch, applied
it to my Gentoo sources, and was back in business. This ain't for
the faint of heart :-).

Nathan Meyers
[EMAIL PROTECTED]

--
[EMAIL PROTECTED] mailing list

Reply via email to