Re: [gentoo-user] Kernel conundrum

Nick Fisher Thu, 14 Aug 2003 11:22:39 -0700

>> >> I have a machine that I cannot compile a stable 2.4.20 kernel for,
>> yet
>> >> the
>> >> one off of the 1.4_rc2 liveCD works fine. I'm guessing there is an
>> >> option
>> >> or a patch that is/isn't set/applyed. Apart from good old trial and
>> >> error
>> >> how the heck do I work out what is giving me the problem?
>> > I've often found the NMI watchdog timer to be extremely
>> > helpful with unexplained kernel lockups. Documentation is in
>> > /usr/src/linux/Documentation/nmi_watchdog.txt. Basically, you append
>> > "nmi_watchdog=1" to your kernel launch from LILO or Grub - in a few
>> > rare cases, you need a value other than 1. When the kernel locks up,
>> the
>> > watchdog detects it and dumps interesting traceback to the console.
>> I've
>> > been able to correlate that traceback to symbols in /proc/ksyms and
>> > identify malfunctioning drivers.
>> Ok.... sounds good. I now have this:
>>
>> NMI Watchdog detected LOCKUP on CPU1, eip c02f2750, registers:
>> eax: d64ca000 ebx: c24d5400 ecx: c24bb000 edx: 00000000
>> ds: 0018 es: 0018 ss: 0018
>>
>> How do I make some sence of it? I've had a look in /proc/ksyms but
>> c02f2750 isn't listed. Is there some other list I should be asking on? A
>> URL I should be looking at? I'm very new to kernel
>> debugging/troubleshooting and I'm a bit lost..... though not quite as
>> lost
>> as I was yesterday ;)
>
> Progress!
Yey!


> As you surmise, eip is the interesting register. Sort the
> contents of /proc/ksyms and see where it falls. In my /proc/ksyms (which
> won't match yours!), I see these entries bracketing that value:
>
>   c02f04b0 task_read_24_Rsmp_ae3fb3f3
>   c02f3870 proc_ide_read_geometry_Rsmp_50fed6f7
>
> so if I saw that address in my dump, I'd know the failure was in a routine
> named task_read_24_Rsmp(). (The trailing stuff is module versioning.)
Right..... well from my sorted ksyms...

c02ea460 scsi_malloc_R1cce3f92
c02ea598 scsi_free_R475dddfa
c0306a4c register_cdrom_R5a61744f
c0306d20 unregister_cdrom_R703d3575

So I'm guessing that the problem is in scsi_free() yes? That would explain
why I keep having the problem with all my kernels. All my kernels have the
aic7xxx driver for my card.....

How can I tell where scsi_free() comes from? I'm guessing that it's from
the aic7xxx driver but how can I tell?

> I believe a stack traceback also appears in the NMI Watchdog output -
> it's sometimes interesting to construct a traceback by gathering some
> of those addresses.
I put everything I found on the console in the mail..... so I'm not sure
about the stack trace.....

> The last time I used this technique, BTW, I identified some buggy SCSI
> module code.
Hummmmm.... sounds familiar....

> After some grueling detective work, I found a message
> somewhere that said, "oops, I forgot to propagate my Adaptec fix from
> this aic79xx module to this aic7xxx module"... found the patch, applied
> it to my Gentoo sources, and was back in business.
I don't suppose that patch is still missing from the gentoo sources is it?
I'm gussing not.... that would be *to* easy.
Without getting you to do my work for me, where should I go looking for
things relating to this? What should I look for?

> This ain't for
> the faint of heart :-).
LOL! If you have a better idea I'm all ears ;)

Thanks for your help on this, I've been banging my head against it for
about three weeks on and off now. I finaly feel I'm making headway!

  Nick

--
[EMAIL PROTECTED] mailing list

Re: [gentoo-user] Kernel conundrum

Reply via email to