Joerg Schilling wrote:

The machine currently runs fine with Build 30.

Hmmm, it sounds like you have a memory dimm producing
an uncorrectable ECC error and prior to build 34 the
detector/disposition were not enabled/correct so
you got away with it.  Try disabling ECC if the bios
allows, or enabling ChipKill if not already done
(may or may not help depending on dimm types installed).
A couple of others errors can also lead to HyperTransport
sync floods and attendent resets - things like HT CRC
errors, but uncorrectable ecc from memory is more
likely.

If you'd like to go the hardware divide-and-conquer
approach try removing dimms until you can boot - even
leave a cpu with no memory at all (as long as one
of them has some).  You could also check for
badly seated dimms, bent pind in sockets etc.

Since your system has been running like this before
(ie perhaps with the error present and perhaps data
damage already done) you may as well perform another
experiment to suppress the HT sync flood and see
if we can catch the error:

 - boot kmdb -d and before starting up do
 - ::bp cpu.AuthenticAMD.15`ao_mca_init
   to break on mca initialization
 - :c to boot, and wait for breakpoint.
 - ao_nb_cfg_add/W100 (disable northbridge watchdog,
   and don't set syncflood on uncorrectable ecc)
 - :c to continue
 - if we survive I'm hoping that since we've left the
   detectors enabled (just disabled some of the reset
   response) then we'll soon see some events and perhaps
   a diagnosis.  There are some known bugs here since
   the reset normally means this code is never executed,
   but most stuff is ok.
 - so 'fmdump -e' to see if there have been any events;
   'fmdump -ev' and 'fmdump -eV' will show more details
   on any events
 - if a diagnosis is reached you'll see a console message
   and 'fmadm faulty' will point out a dimm.  dimms are
   (in a simplification) numbered 0/1/2/3 on each chip -
   how that maps for your motherboard I don't know.

We want to make the above easy sometime after working out
how to keep data safe under these conditions (ie resetting
for certain error types).

Gavin
_______________________________________________
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Reply via email to