The following are based on our experience on dual 450 PII systems ( ASUS
P2B-D, we don't want to try DS -- onboard SCSI yet since it is not stable
long enough for a production system from our criterion).
1. SDRAM -- you need to be very careful on SDRAM brand. We saw the
same symptom as you described and the problem was solved by replacing
memory to a reliable brand (We got them from SWT). One test you may
want to try is to reduce memory bus speed. I assume you are using
100 MHz memory bus here.
2. power supply. We found that 450 MHz dual CPU system is quite
sensitive to power supply quality. ASUS P2B MBs have power checking
built-in BIOS. We found it is very useful to diagnose problems.
3. From our experience, it makes system more stable to compile SCSI
drivers into kernel, instead of module, for a system with 512MB+.
But we did not see any problem yet for new PII 450 systems with
kernel 2.0.35+ yet by using SCSI modules.
4. JAZ stuff cause a lot of SCSI problems (I still can not figure out
why). If you want a heavy duty system running stable, don't
attach JAZ/ZIP.
Hope these useful.
-- QZ
"Robert G. Brown" wrote:
> Dear List,
>
> I have a dual 450 MHz PII built from a SuperMicro P6DGS (440GX)
> motherboard with two deschutes CPUs. This MoBo has onboard aic7895
> controllers. The system has a Matrox G200 video card and is currently
> running a linksys sort-of-tulip card. It has 512 MB of SDRAM. It was
> running 2.0.35 SMP a couple of days ago; now it is running 2.0.35 SMP.
>
> This system has, since we got it, exhibited a most painful tendency to
> just die under load, and sometimes under NO load. Worse, it dies
> without so much as a whimper. No aiee's, no deadlocks -- it just hangs
> or spontaneously reboots.
>
> I reported this problem a few months ago and it was suggested that I
> turn off DMA on the IDE drive (it has an IDE drive, an IDE CD-R, a
> SCSI drive, and a SCSI Jaz drive) -- I did so and for a while the
> problem appeared to have disappeared. It recently reappeared, though,
> just when we really need the system.
>
> With a spanking new kernel and a problem that has persisted through
> several SMP kernel and driver revisions (I've run 2.0.33, 2.0.35, 2.0.36
> on it with aic7xxx 5.0.19, 5.1.2, and 5.1.4, I've tried eepro100's and
> the linksys ethernet cards with several driver revisions, and although I
> haven't tried the system without the SVGA card the problem occurs even
> when the monitor is doing "nothing" but running an idle VGA console) I'm
> becoming doubtful that the problem is in software. However, I have
> almost nothing to go one to diagnose the problem in hardware, and it is
> a very bad time to ship the box back for depot repair (especially with a
> diffuse problem like, "Uh, dunno, just reboots itself from time to
> time).
>
> Any suggestions on how to debug this and isolate a hardware problem?
> Any (negative or positive) experiences with this particular motherboard?
> I've skimmed the smp-faq (of course) and it isn't mentioned, nor is
> there anything suggested there that I haven't already tried (for
> example, yes, the CPU's are the same stepping -- 2 -- and the same
> bogomips, I'm not overclocking (doing long term numerical calculations
> with 10^12 or more Ops I'd be crazy to overclock, but then, only crazy
> people EVER overclock), there is -- literally -- nothing in
> /var/adm/[syslog,messages] and I get no console message at crash time.
>
> Should I:
>
> a) Enable kernel profiling or in some other way trace what it is doing
> at a crash to at least see if one particular thing causes the crash?
>
> b) I've done a fair amount of hardware swap already -- permutation of
> network devices, removal of SCSI and IDE devices -- but I haven't
> removed DIMM at a time or anything like that as I have no reason to
> believe that my SDRAM is "bad". This worries me, however, as I've heard
> on this list that 450 MHz CPU's are very memory sensitive. Is there any
> way to test this? Anything I should look for? I'm using SDRAM provided
> by Aberdeen "for this motherboard" that SHOULD be in spec, but how do I
> tell? Is there a memory tester/excerciser program available somewhere?
>
> c) I'm writing the smp list because it is an SMP motherboard running
> an SMP kernel. I really don't think that the kernel (SMP vs UP or 2.0.x
> vs 2.1.x) is at fault here. I've tested both SMP and UP kernels on it
> and managed to get the crash both ways. Nevertheless, if anybody has
> specific evidence that the kernel might be at fault I'd be happy to try
> anything at this point.
>
> I'd also be happy to provide contents of any /proc file or the like.
> They are all, however, cosmically normal and boring. The system boots
> normally, runs normally, and then just "dies". Sometimes under load,
> sometimes in mid-keystroke while basically idle.
>
> Help! Please!
>
> rgb
=================================================================
Qiru Zhou [EMAIL PROTECTED]
2D432 Lucent Bell Laboratories tel (908) 582-4562
700 Mountain Avenue, Murray Hill, NJ 07974 fax (908) 582-7308
=================================================================
-
Linux SMP list: FIRST see FAQ at http://www.irisa.fr/prive/mentre/smp-faq/
To Unsubscribe: send "unsubscribe linux-smp" to [EMAIL PROTECTED]