fmd itself will not disable DIMMs. The firmware on some SPARC platforms (I don't think the V890 is one of them) may blacklist DIMMs that experience a fatal UE. From the OS' point of view, a blacklisted DIMM will appear to have effectively disappeared. But that is not what happened here.

What fmd will do is "retire" the physical pages of memory one which UE's (or a high rate of CE's) has been detected. This will reduce the amount of physical memory available, though not by a huge amount.

rob


--
Robert Johnston, Fishworks            http://blogs.sun.com/robj


On 03/11/2011 09:21 PM, Jerry Sutton wrote:
I believe you will find that fmd has disabled this DIMM or DIMMs.
If you know how much memory is supposed to be there you can check the
output of 'prtconf", or, preferably, that of "prtdiag -v" and compare
that to what you believe is physically installed.
The last time I read the fmadm manpage, as I recall, end users are
expected to NOT run fmadm repair unless specifically instructed to do so
by Sun XXX Oracle support staff. It *may* *not* need to be run at all
when this DIMM is replaced (various Sun Sparc hardware seems to behave
each a bit differently with respect to fault management in my experience)

Certainly, you do NOT run 'fmadm repair' before actually replacing the
DIMMs identified as failing.

Unless you suffer additional error rate problems with other DIMMs I
would not expect another panic induced reboot.

On 03/11/11 14:17, Paul Robertson wrote:
Our V890 server reported a memory fault, rebooted, and now shows the
following:

csgams08:~>sudo fmadm faulty Password: ---------------
------------------------------------ -------------- --------- TIME
EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ --------------
--------- Mar 11 00:21:46 7acef7a1-6c9e-49db-9b03-ff6f0d5f911d
SUN4U-8000-35 Critical

Fault class : fault.memory.bank 95% Affects :
mem:///unum=Slot,B:J8100,J8101,J8201,J8200 degraded but still in
service FRU : mem:///unum=Slot,B:J8100,J8101,J8201,J8200 95%
Serial ID. :

Description : The number of errors associated with this memory module
has exceeded acceptable levels. Refer to
http://sun.com/msg/SUN4U-8000-35 for more information.

Response : Pages of memory associated with this memory module are
being removed from service as errors are reported.

Impact : Total system memory capacity will be reduced as pages
are retired.

Action : Schedule a repair procedure to replace the affected
memory module. Use fmdump -v -u<EVENT_ID> to identify the module.

We've scheduled the replacement already, but I want to understand
whether fmd has effectively disabled these dimms until such time as
we run "fmadm repair". In other words, is it likely that we'll get
another failure/reboot before we can schedule the maintenance? If so,
I guess we'll try and asr-disable these dimms to minimize the risk.

Please advise.

Paul


_______________________________________________
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Reply via email to