Dear List,

I have a dual 450 MHz PII built from a SuperMicro P6DGS (440GX)
motherboard with two deschutes CPUs.  This MoBo has onboard aic7895
controllers.  The system has a Matrox G200 video card and is currently
running a linksys sort-of-tulip card.  It has 512 MB of SDRAM.  It was
running 2.0.35 SMP a couple of days ago; now it is running 2.0.35 SMP.

This system has, since we got it, exhibited a most painful tendency to
just die under load, and sometimes under NO load.  Worse, it dies
without so much as a whimper.  No aiee's, no deadlocks -- it just hangs
or spontaneously reboots.

I reported this problem a few months ago and it was suggested that I
turn off DMA on the IDE drive (it has an IDE drive, an IDE CD-R, a
SCSI drive, and a SCSI Jaz drive) -- I did so and for a while the
problem appeared to have disappeared.  It recently reappeared, though,
just when we really need the system.

With a spanking new kernel and a problem that has persisted through
several SMP kernel and driver revisions (I've run 2.0.33, 2.0.35, 2.0.36
on it with aic7xxx 5.0.19, 5.1.2, and 5.1.4, I've tried eepro100's and
the linksys ethernet cards with several driver revisions, and although I
haven't tried the system without the SVGA card the problem occurs even
when the monitor is doing "nothing" but running an idle VGA console) I'm
becoming doubtful that the problem is in software.  However, I have
almost nothing to go one to diagnose the problem in hardware, and it is
a very bad time to ship the box back for depot repair (especially with a
diffuse problem like, "Uh, dunno, just reboots itself from time to
time).

Any suggestions on how to debug this and isolate a hardware problem?
Any (negative or positive) experiences with this particular motherboard?
I've skimmed the smp-faq (of course) and it isn't mentioned, nor is
there anything suggested there that I haven't already tried (for
example, yes, the CPU's are the same stepping -- 2 -- and the same
bogomips, I'm not overclocking (doing long term numerical calculations
with 10^12 or more Ops I'd be crazy to overclock, but then, only crazy
people EVER overclock), there is -- literally -- nothing in
/var/adm/[syslog,messages] and I get no console message at crash time.

Should I:  

  a) Enable kernel profiling or in some other way trace what it is doing
at a crash to at least see if one particular thing causes the crash?

  b) I've done a fair amount of hardware swap already -- permutation of
network devices, removal of SCSI and IDE devices -- but I haven't
removed DIMM at a time or anything like that as I have no reason to
believe that my SDRAM is "bad".  This worries me, however, as I've heard
on this list that 450 MHz CPU's are very memory sensitive.  Is there any
way to test this?  Anything I should look for?  I'm using SDRAM provided
by Aberdeen "for this motherboard" that SHOULD be in spec, but how do I
tell?  Is there a memory tester/excerciser program available somewhere?

  c) I'm writing the smp list because it is an SMP motherboard running
an SMP kernel.  I really don't think that the kernel (SMP vs UP or 2.0.x
vs 2.1.x) is at fault here.  I've tested both SMP and UP kernels on it
and managed to get the crash both ways.  Nevertheless, if anybody has
specific evidence that the kernel might be at fault I'd be happy to try
anything at this point.

I'd also be happy to provide contents of any /proc file or the like.
They are all, however, cosmically normal and boring.  The system boots
normally, runs normally, and then just "dies".  Sometimes under load,
sometimes in mid-keystroke while basically idle.

Help!  Please!

   rgb

Robert G. Brown                        http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:[EMAIL PROTECTED]



Reply via email to