Not that you haven't already done this, but the first things that I think
about is making all the BIOS settings as conservative as possible, for
instance with wait states and such.  Also, can you go to a single
processor and see if the problem persists?  Best of luck.

On the slightly wacky side of things, here are some more ideas:
If you think that heat could be a problem, keep the case off and use a box
fan to cool it--I didn't make this up myself, someone else solved their
problems with it.  Or maybe the reset switch connector is faulty?  Try
removing the reset switch connection to the motherboard (remember where it
goes for later, unless you have a mobo manual).

-Paul Komarek

On Wed, 2 Dec 1998, Robert G. Brown wrote:

> Dear List,
> 
> I have a dual 450 MHz PII built from a SuperMicro P6DGS (440GX)
> motherboard with two deschutes CPUs.  This MoBo has onboard aic7895
> controllers.  The system has a Matrox G200 video card and is currently
> running a linksys sort-of-tulip card.  It has 512 MB of SDRAM.  It was
> running 2.0.35 SMP a couple of days ago; now it is running 2.0.35 SMP.
> 
> This system has, since we got it, exhibited a most painful tendency to
> just die under load, and sometimes under NO load.  Worse, it dies
> without so much as a whimper.  No aiee's, no deadlocks -- it just hangs
> or spontaneously reboots.
> 
> I reported this problem a few months ago and it was suggested that I
> turn off DMA on the IDE drive (it has an IDE drive, an IDE CD-R, a
> SCSI drive, and a SCSI Jaz drive) -- I did so and for a while the
> problem appeared to have disappeared.  It recently reappeared, though,
> just when we really need the system.
> 
> With a spanking new kernel and a problem that has persisted through
> several SMP kernel and driver revisions (I've run 2.0.33, 2.0.35, 2.0.36
> on it with aic7xxx 5.0.19, 5.1.2, and 5.1.4, I've tried eepro100's and
> the linksys ethernet cards with several driver revisions, and although I
> haven't tried the system without the SVGA card the problem occurs even
> when the monitor is doing "nothing" but running an idle VGA console) I'm
> becoming doubtful that the problem is in software.  However, I have
> almost nothing to go one to diagnose the problem in hardware, and it is
> a very bad time to ship the box back for depot repair (especially with a
> diffuse problem like, "Uh, dunno, just reboots itself from time to
> time).
> 
> Any suggestions on how to debug this and isolate a hardware problem?
> Any (negative or positive) experiences with this particular motherboard?
> I've skimmed the smp-faq (of course) and it isn't mentioned, nor is
> there anything suggested there that I haven't already tried (for
> example, yes, the CPU's are the same stepping -- 2 -- and the same
> bogomips, I'm not overclocking (doing long term numerical calculations
> with 10^12 or more Ops I'd be crazy to overclock, but then, only crazy
> people EVER overclock), there is -- literally -- nothing in
> /var/adm/[syslog,messages] and I get no console message at crash time.
> 
> Should I:  
> 
>   a) Enable kernel profiling or in some other way trace what it is doing
> at a crash to at least see if one particular thing causes the crash?
> 
>   b) I've done a fair amount of hardware swap already -- permutation of
> network devices, removal of SCSI and IDE devices -- but I haven't
> removed DIMM at a time or anything like that as I have no reason to
> believe that my SDRAM is "bad".  This worries me, however, as I've heard
> on this list that 450 MHz CPU's are very memory sensitive.  Is there any
> way to test this?  Anything I should look for?  I'm using SDRAM provided
> by Aberdeen "for this motherboard" that SHOULD be in spec, but how do I
> tell?  Is there a memory tester/excerciser program available somewhere?
> 
>   c) I'm writing the smp list because it is an SMP motherboard running
> an SMP kernel.  I really don't think that the kernel (SMP vs UP or 2.0.x
> vs 2.1.x) is at fault here.  I've tested both SMP and UP kernels on it
> and managed to get the crash both ways.  Nevertheless, if anybody has
> specific evidence that the kernel might be at fault I'd be happy to try
> anything at this point.
> 
> I'd also be happy to provide contents of any /proc file or the like.
> They are all, however, cosmically normal and boring.  The system boots
> normally, runs normally, and then just "dies".  Sometimes under load,
> sometimes in mid-keystroke while basically idle.
> 
> Help!  Please!
> 
>    rgb
> 
> Robert G. Brown                              http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:[EMAIL PROTECTED]
> 
> 
> 
> 
> 

Reply via email to