Re: How to diagnose kernel panic?

2006-07-11 Thread Mark Copper
Thank you for your suggestions how to chase down this intermittant panic.

This did occur from beginning with this machine (Supermicro P4SCi MB,
Ablecom 420w power, Seagate Barracuda SATA HD's, Crucial RAM, Debian
testing with 2.6.15 kernel, software RAID1).

All HD tests from manufacturer passed (thanks for idea).  Search for
others reporting same panic message was interesting but seemingly no
exact matches--seems to indicate the problem is in the hardware
somewhere.  Data center guy suspects static zap during assembly.

Can't imagine what conditions from being on-line at data center I am
unable to replicate...  Guess I'll just wait, run [EMAIL PROTECTED] and
occasionally bombard with http requests.  It's gotta happen again
sometime, especially if it's the power supply.

Any further test suggestions would be welcome.

Mark


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: How to diagnose kernel panic?

2006-07-10 Thread Lothar Braun
Hi Mark,

On Sun, 2006-07-09 at 15:56 -0400, Mark Copper wrote:
 I have a server that is brought down by a kernel panic every two weeks
 on average.

Did it do that right from the first installation or did it run for some
time without problems?

   Nothing untoward gets in the logs and the on-screen panic
 message starts with something like
Kernel panic - not syncing: Fatal exception in interrupt

Call trace:
[c026bc42] scsi_request_fn+0xf610x294
 I wasn't able to get any more at the data center...

Well the first thing i'd suppose would be some problem with the hard
ware (most of the kernel panics i saw where related to broken hardware).

And it looks like it's something problematic with the hard drive
containing the root filesystem.

The steps i would do:

1.) Use google with the name of your hard drive and scsi_request_fn
and kernel panic

2.) Get the hard drive checking tools of your manufacturer and test that
drive

3.) If you have two machines with the same drive, swap drives between
them (and check where the problem occurs then)

4.) Get a newer kernel

After doing this and none of the things above helped:

5.) Write a mail to the kernel guys, tell them about the problem and
what you did to find the problem. 


Hth,
Lothar


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: How to diagnose kernel panic?

2006-07-10 Thread Andrew Sackville-West
On Sun, Jul 09, 2006 at 03:56:12PM -0400, Mark Copper wrote:
 I have a server that is brought down by a kernel panic every two weeks
 on average.  Nothing untoward gets in the logs and the on-screen panic
 message starts with something like
Kernel panic - not syncing: Fatal exception in interrupt

Call trace:
[c026bc42] scsi_request_fn+0xf610x294
 I wasn't able to get any more at the data center...
 
 So I brought the machine home and am running [EMAIL PROTECTED] on it and so
 far I have not been able to induce the panic.

there was something unique about what the machine was doing previously
that caused this. You are not doing whatever that is now and thus not
inducing the error. 

  The replacement machine
 is similar, but not identical.  The main difference being a switch from
 software to hardware RAID1.  Also, the new machine, except for the
 hardware driver, uses stable while the problematic machine uses testing.
 And the replacement has run so far without problem.

well, the call trace above points to a disk problem and you've changed
the disk setup in the new machine by putting a piece of hardware
between the disks and the mother board, so you're problem may be gone
because of that. Its unclear what exactly you've done here. Does the
new machine use the old disks through the hardware raid? or are you
dealing with all new disks. Either way you've changed a lot from old
to new machine and its not surprising that you've eliminated the
problem as a result of this. 

 
 The only other thing I can add is that the bad machine would seem to
 start getting sluggish before it froze, but for the life of me, I
 couldn't see why.


maybe the kernel was trying repeatedly to do some disk operation that
failed, which used up cpu time and caused the sluggish behaviour?
 

 I am posting because I'm hopeful that list participants might have
 suggestions how I might start to chase down or, better yet, eliminate
 this problem.

can you reproduce the exact setup that was causing problems before,
including the usage levels? 

A


signature.asc
Description: Digital signature


How to diagnose kernel panic?

2006-07-09 Thread Mark Copper
I have a server that is brought down by a kernel panic every two weeks
on average.  Nothing untoward gets in the logs and the on-screen panic
message starts with something like
   Kernel panic - not syncing: Fatal exception in interrupt
   
   Call trace:
   [c026bc42] scsi_request_fn+0xf610x294
I wasn't able to get any more at the data center...

So I brought the machine home and am running [EMAIL PROTECTED] on it and so
far I have not been able to induce the panic.  The replacement machine
is similar, but not identical.  The main difference being a switch from
software to hardware RAID1.  Also, the new machine, except for the
hardware driver, uses stable while the problematic machine uses testing.
And the replacement has run so far without problem.

The only other thing I can add is that the bad machine would seem to
start getting sluggish before it froze, but for the life of me, I
couldn't see why.

I am posting because I'm hopeful that list participants might have
suggestions how I might start to chase down or, better yet, eliminate
this problem.

Is there a way, perhaps, to manufacture the possible interrupts that
occur?  

Thanks.

Mark


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: How to diagnose kernel panic?

2006-07-09 Thread Robert Jeffrey Miesen
Hi Mark.

I don't know if this kind of information will help out at all or not, 
but 
what are the specs of your machine? Specifically, do you have a quality power 
supply? How about your hard drive and your motherboard? As I said, I don't 
know if answering these questions will reveal anything important, but it 
always helps to verify that you are using quality parts in your machine. 

After all, a software program is just a collection of assembly 
instructions 
to your CPU (usually compiled from a high-level language, such as C++). If a 
piece of software executes an assembly instruction that addresses a hard disk 
for information and if the motherboard and/or the hard disk are cheapies and 
they fail to properly return whatever data the assembly instruction was 
expecting, that certainly cause software bugs ranging from incorrect display 
of data to kernel panics, depending on the program that gets lucky (cheap 
motherboards and hard disks are cheap because they have less redundancy, 
fault-tolerance, and use components more likely to fail to begin with). Also, 
if your power supply is a cheap one, it might not be supplying enough power 
to your computer and if that happens, well, your computer just won't work 
correctly because both your software and hardware expect full power in order 
to work correctly. 

Hope all that helps.


On Sunday 09 July 2006 12:56, Mark Copper wrote:
 I have a server that is brought down by a kernel panic every two weeks
 on average.  Nothing untoward gets in the logs and the on-screen panic
 message starts with something like
Kernel panic - not syncing: Fatal exception in interrupt

Call trace:
[c026bc42] scsi_request_fn+0xf610x294
 I wasn't able to get any more at the data center...

 So I brought the machine home and am running [EMAIL PROTECTED] on it and so
 far I have not been able to induce the panic.  The replacement machine
 is similar, but not identical.  The main difference being a switch from
 software to hardware RAID1.  Also, the new machine, except for the
 hardware driver, uses stable while the problematic machine uses testing.
 And the replacement has run so far without problem.

 The only other thing I can add is that the bad machine would seem to
 start getting sluggish before it froze, but for the life of me, I
 couldn't see why.

 I am posting because I'm hopeful that list participants might have
 suggestions how I might start to chase down or, better yet, eliminate
 this problem.

 Is there a way, perhaps, to manufacture the possible interrupts that
 occur?

 Thanks.

 Mark


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]