>> >> I have a machine that I cannot compile a stable 2.4.20 kernel for, >> yet >> >> the >> >> one off of the 1.4_rc2 liveCD works fine. I'm guessing there is an >> >> option >> >> or a patch that is/isn't set/applyed. Apart from good old trial and >> >> error >> >> how the heck do I work out what is giving me the problem? >> > I've often found the NMI watchdog timer to be extremely >> > helpful with unexplained kernel lockups. Documentation is in >> > /usr/src/linux/Documentation/nmi_watchdog.txt. Basically, you append >> > "nmi_watchdog=1" to your kernel launch from LILO or Grub - in a few >> > rare cases, you need a value other than 1. When the kernel locks up, >> the >> > watchdog detects it and dumps interesting traceback to the console. >> I've >> > been able to correlate that traceback to symbols in /proc/ksyms and >> > identify malfunctioning drivers. >> Ok.... sounds good. I now have this: >> >> NMI Watchdog detected LOCKUP on CPU1, eip c02f2750, registers: >> eax: d64ca000 ebx: c24d5400 ecx: c24bb000 edx: 00000000 >> ds: 0018 es: 0018 ss: 0018 >> >> How do I make some sence of it? I've had a look in /proc/ksyms but >> c02f2750 isn't listed. Is there some other list I should be asking on? A >> URL I should be looking at? I'm very new to kernel >> debugging/troubleshooting and I'm a bit lost..... though not quite as >> lost >> as I was yesterday ;) > > Progress! Yey!
> As you surmise, eip is the interesting register. Sort the > contents of /proc/ksyms and see where it falls. In my /proc/ksyms (which > won't match yours!), I see these entries bracketing that value: > > c02f04b0 task_read_24_Rsmp_ae3fb3f3 > c02f3870 proc_ide_read_geometry_Rsmp_50fed6f7 > > so if I saw that address in my dump, I'd know the failure was in a routine > named task_read_24_Rsmp(). (The trailing stuff is module versioning.) Right..... well from my sorted ksyms... c02ea460 scsi_malloc_R1cce3f92 c02ea598 scsi_free_R475dddfa c0306a4c register_cdrom_R5a61744f c0306d20 unregister_cdrom_R703d3575 So I'm guessing that the problem is in scsi_free() yes? That would explain why I keep having the problem with all my kernels. All my kernels have the aic7xxx driver for my card..... How can I tell where scsi_free() comes from? I'm guessing that it's from the aic7xxx driver but how can I tell? > I believe a stack traceback also appears in the NMI Watchdog output - > it's sometimes interesting to construct a traceback by gathering some > of those addresses. I put everything I found on the console in the mail..... so I'm not sure about the stack trace..... > The last time I used this technique, BTW, I identified some buggy SCSI > module code. Hummmmm.... sounds familiar.... > After some grueling detective work, I found a message > somewhere that said, "oops, I forgot to propagate my Adaptec fix from > this aic79xx module to this aic7xxx module"... found the patch, applied > it to my Gentoo sources, and was back in business. I don't suppose that patch is still missing from the gentoo sources is it? I'm gussing not.... that would be *to* easy. Without getting you to do my work for me, where should I go looking for things relating to this? What should I look for? > This ain't for > the faint of heart :-). LOL! If you have a better idea I'm all ears ;) Thanks for your help on this, I've been banging my head against it for about three weeks on and off now. I finaly feel I'm making headway! Nick -- [EMAIL PROTECTED] mailing list