On Wed, 13 Jan 1999, Jean SCHMITTBUHL wrote:

> Hello,
> 
>       I read with a very high interest your emails on the SMP list about
> problems you met with your dual 450 MHz PII. I bought a dual 450 MHZ PII
> last October and I have problems that look very similar to yours. My
> machine reboot spontaneously when it is loaded by some jobs. However it is
> perfectly stable with out any or little load.
> My configuration is:
> 2x450 MHz PII
> Asus P2BDS MB
> 512 Mb
> Viking Scsi disk 4Go
> ATI video card
> 3c900 ethernet card
> 
> When I underclock the bus frequency to 83 Mhz instead of 100Mhz it works
> perfectly. An exchange of the processors at 100 Mhz does not change the 
> unstability problem. Did you try to underclock yours ? Did you solve you
> stability problem ?

I should answer this briefly for the whole group, since the answer is
yes.  We yanked the expensive 250 MB SDRAM DIMMS we got from the vendor
("registered" PC100 and all) and replaced it with over-the-counter 128
MB SDRAM PC100 DIMMS from a local vendor and have never looked back.
Totally stable.  From some of the suggestions I got, it may be that the
P6DBS has trouble with 256 MB DIMMS, or it may be that the memory we got
was incidentally crap, but I don't have evidence addressing that point
as we decided to just add nice cheap 128 MB DIMMS as our jobs demanded
them rather than load up to 512 MB a priori anyway.

We tried a whole range of things suggested by the group, however, and
all put together they form a veritable hardware debugging manual that is
well worth adding to the smp FAQ.  Something like:

   a) Removal of all cards but an SVGA card and testing to failure.
Swapping the SVGA card with a completely different one and testing to
failure.  (If it stabilizes on this step, add back cards until point of
failure is found and deal with it by replacement or hacking the driver
or whatever).

   b) Setting the motherboard speed (and hence both CPU and memory
speed) down to the next accessible quantum, in our case 300 MHz.  In our
case, this stabilized the system!  This strongly suggested a problem
with the CPUs (possible forgery? possible overclocks?), the memory (bad
memory) or the motherboard itself (yuk! the one part that is really hard
to swap out).

   c) Removal of one CPU (back at full speed), then the other and
testing to failure.  Removal of all the memory but one DIMM, testing to
failure.  Swapping for the other DIMM(s) and testing to failure (binary
search optional if 3 or more DIMMS).  None of this stabilized our
system.  We checked the CPU's carefully for forgery (apparently a
significant problem, see list archives) but Intel verified their SN's.
Problem almost certainly in Motherboard itself or memory subsystem, but
not in either particular memory chip.

   d) Finally, we bought aforementioned DIMM, swapped it in, and system
was stable with dual CPUs at full speed.  Sent back (expensive!) 250 MB
DIMMS to vendor (Aberdeen, Inc.) with irate letter.  Long ago vowed
never to do business with Aberdeen again for other reasons (like the
utter impossibility of getting service or even attention from them
unless we write the president of the company PERSONALLY -- fortunately I
do indeed have his email address;-), this reinforced the decision.

   e) Don't know what stage to put this, but I've found putting an
actual multimeter on the power supply to be helpful in the past, as well
as a careful check of its rated peak current/power.  Some systems,
especially ATX systems, won't start unless the power supply can provide
enough current at startup, and vendors sometimes load in the cards or
peripherals after burning in the motherboard (idiots!) and don't realize
that a system built with a cheap PS won't boot.  In our case, I was
running the lm-sensors package and already knew the core voltages and
temperatures to be in the nominal range.

   f) Don't know what stage to put this either, as the SMP kernels are
pretty reliable these days, but it is alwasy a good idea to try SMP and
UP kernels, and to build "minimal device" kernels and try them to see if
the system stabilizes as well to eliminate software as a source of
difficulty.  These days I find hardware MUCH more likely to be the point
of failure if it is something mysterious.  Problems with device drivers
are usually fairly obvious and fixed by the time you get a single clean
boot.

SO, I don't want to suggest that your problem is certainly memory or
anything like that, but the protocol above might be useful to you and
anyone else.  Perhaps it could get added to the linux-SMP FAQ?

    rgb

Robert G. Brown                        http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:[EMAIL PROTECTED]



-
Linux SMP list: FIRST see FAQ at http://www.irisa.fr/prive/mentre/smp-faq/
To Unsubscribe: send "unsubscribe linux-smp" to [EMAIL PROTECTED]

Reply via email to