Re: kernel MCA messages

2010-08-25 Thread Andriy Gapon
on 25/08/2010 02:38 Jeremy Chadwick said the following:
 On Tue, Aug 24, 2010 at 07:13:23PM -0400, Dan Langille wrote:
 On 8/22/2010 9:18 PM, Dan Langille wrote:
 What does this mean?

 kernel: MCA: Bank 4, Status 0x940c4001fe080813
 kernel: MCA: Global Cap 0x0105, Status 0x
 kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
 kernel: MCA: CPU 0 COR BUSLG Source RD Memory
 kernel: MCA: Address 0x7ff6b0

 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43

 FYI, these are occurring every hour, almost to the second. e.g.
 xx:56:yy, where yy is 09, 10, or 11.

 Checking logs, I don't see anything that correlates with this point
 in the hour (i.e 56 minutes past) that doesn't also occur at other
 times.

 It seems very odd to occur so regularly.

I still think that everything of essence has already been said in this thread.

 1) Why haven't you replaced the DIMM in Bank 4 -- or better yet, all

Bank 4 here is MCA reporting bank, it has nothing to do with RAM slots.
Currently on FreeBSD we don't have a standard tool to map physical address to
DRAM module, but I am sure that there could be some ways to do it.

the DIMMs just to be sure?  Do this and see if the problem goes
away.  If not, no harm done, and you've narrowed it down.
 
 2) What exact manufacturer and model of motherboard is this?  If
you can provide a link to a User Manual that would be great.
 
 3) Please go into your system BIOS and find where ECC ChipKill
options are available (likely under a Memory, Chipset, or
Northbridge section).  Please write down and provide here all
of the options and what their currently selected values are.
 
 4) Please make sure you're running the latest system BIOS.  I've seen
on certain Rackable AMD-based systems where Northbridge-related
features don't work quite right (at least with Solaris), resulting
in atrocious memory performance on the system.  A BIOS upgrade
solved the problem.
 
 There's a ChipKill feature called ECC BG Scrubbing that's vague in
 definition, given that it's a background memory scrub that happens at
 intervals which are unknown to me.  Maybe 60 minutes?  I don't know.
 This is why I ask question #3.
 
 For John and other devs: I assume the decoded MCA messages indicate with
 absolute certainty that the ECC error is coming from external DRAM and
 not, say, bad L1 or L2 cache?

Have you read the decoded message?
Please re-read it.

I still recommend reading at least the summary of the RAM ECC research article
to make your own judgment about need to replace DRAM.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-25 Thread Tom Evans
On Tue, Aug 24, 2010 at 4:06 PM, John Baldwin j...@freebsd.org wrote:
 On Monday, August 23, 2010 5:35:40 pm Matthew D. Fuller wrote:
 On Mon, Aug 23, 2010 at 08:20:35AM -0400 I heard the voice of
 John Baldwin, and lo! it spake thus:
 
  It is not private, it is in //depot/projects/mcelog/... in p4.

 Which may as well be Siberia for us lowly non-developers.  Any chance
 you could stick a tarball or a patch against upstream mcelog
 somewhere?

 It is actually public at perforce.freebsd.org. :)  However, it is tedious to
 download the files.  It really should be a port perhaps, though Someone (tm)
 should try to get the patches integrated upstream.

 You can find a patch at www.freebsd.org/~jhb/mcelog/.  You will also need to
 download the memstream.c file from there as well and put that in the extracted
 mcelog tarball.


I wrote a small script a while back to extract a tree from perforce
using the web interface, might be handy:

http://www.clearchain.com/~benjsc/downloads/FreeBSD/P4fetch.rb

Cheers

Tom
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-25 Thread Dan Langille

On 8/25/2010 3:11 AM, Andriy Gapon wrote:


Have you read the decoded message?
Please re-read it.

I still recommend reading at least the summary of the RAM ECC research article
to make your own judgment about need to replace DRAM.


Andriy: What is your interpretation of the decoded message?  What is 
your view on replacing DRAM?  What do you conclude from the summary?


--
Dan Langille - http://langille.org/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-25 Thread Andriy Gapon
on 25/08/2010 13:41 Dan Langille said the following:
 On 8/25/2010 3:11 AM, Andriy Gapon wrote:
 
 Have you read the decoded message?
 Please re-read it.

 I still recommend reading at least the summary of the RAM ECC research 
 article
 to make your own judgment about need to replace DRAM.
 
 Andriy: What is your interpretation of the decoded message?  What is your 
 view on
 replacing DRAM?  What do you conclude from the summary?

Most likely you have a small defect in one of your memory modules, perhaps a
stuck bit.  You will be getting correctable ECC errors for that module.
Eventually you might get uncorrectable error.  That may happen soon or it may
never happen during lifetime of that modules.

As that study has demonstrated a significant percentage of systems and modules
report at least one correctable ECC error.  ECC correctable errors at present
correlate with correctable ECC errors in the future.  They also correlate with
uncorrectable errors in the future.  But percentage of systems developing
uncorrectable errors is significantly smaller, so chances of false positives are
substantial.

You should decide whether you want to replace the module (if you can pinpoint 
it)
or all modules depending on your resources (money, etc), importance of service
that the server in question provides (allowable downtime and cost of it and
fault-tolerance of a larger system, of which the server may be a part (e.g. it 
may
have a standby server for failover).

I think that most of what I've just said was kind of obvious from the start.
The important bit from that study is that ECC errors are not as random and as 
rare
as was previously thought, and they can be attributed to a number of factors 
like
manufacturing defects, layout of memory lanes on motherboard, etc.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-25 Thread John Baldwin
On Wednesday, August 25, 2010 12:05:09 am Matthew D. Fuller wrote:
 On Tue, Aug 24, 2010 at 11:06:43AM -0400 I heard the voice of
 John Baldwin, and lo! it spake thus:
  
  It is actually public at perforce.freebsd.org. :)  However, it is
  tedious to download the files.
 
 Oh, I'd apparently blocked out of my mind that you could clicky-clicky
 files one at a time from there.  Probably for the best; I'd be real
 annoyed by the end of that   ;)
 
 
  You can find a patch at www.freebsd.org/~jhb/mcelog/.  You will also
  need to download the memstream.c file from there as well and put
  that in the extracted mcelog tarball.
 
 Thanks!  For anyone following along at home, I needed to make a few
 changes to get it compiling here:
 
 - I'm on a nice recent -CURRENT, so I had to #if 0 out the getline()
   definition.

Yeah, that needs a __FreeBSD_version wrapper.

 - Add a FREEBSD definition to the Makefile (or remember it manually).

I just do 'gmake FREEBSD=yes'.

 - Comment out the kread_symbol() of X_SNAPDATE in mcelog.c.  I don't
   see X_SNAPDATE defined anywhere in my /usr/include, and the var
   doesn't seem to ever actually be read for anything anyway (unless
   I'm supposed to -DLOCAL_HACK...).

Bah, I must have missed that.  You certainly don't want LOCAL_HACK.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-25 Thread John Baldwin
On Tuesday, August 24, 2010 7:13:23 pm Dan Langille wrote:
 On 8/22/2010 9:18 PM, Dan Langille wrote:
  What does this mean?
 
  kernel: MCA: Bank 4, Status 0x940c4001fe080813
  kernel: MCA: Global Cap 0x0105, Status 0x
  kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
  kernel: MCA: CPU 0 COR BUSLG Source RD Memory
  kernel: MCA: Address 0x7ff6b0
 
  FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43
 
 FYI, these are occurring every hour, almost to the second. e.g. 
 xx:56:yy, where yy is 09, 10, or 11.
 
 Checking logs, I don't see anything that correlates with this point in 
 the hour (i.e 56 minutes past) that doesn't also occur at other times.
 
 It seems very odd to occur so regularly.

That is because machine checks for corrected errors have to be polled and the 
kernel polls once an hour.  On newer Intel CPUs (such as Nehalem) there is a 
separate interrupt (CMCI) that can fire for corrected errors.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-25 Thread John Baldwin
On Wednesday, August 25, 2010 7:01:19 am Andriy Gapon wrote:
 on 25/08/2010 13:41 Dan Langille said the following:
  On 8/25/2010 3:11 AM, Andriy Gapon wrote:
  
  Have you read the decoded message?
  Please re-read it.
 
  I still recommend reading at least the summary of the RAM ECC research 
  article
  to make your own judgment about need to replace DRAM.
  
  Andriy: What is your interpretation of the decoded message?  What is your 
  view on
  replacing DRAM?  What do you conclude from the summary?
 
 Most likely you have a small defect in one of your memory modules, perhaps a
 stuck bit.  You will be getting correctable ECC errors for that module.
 Eventually you might get uncorrectable error.  That may happen soon or it may
 never happen during lifetime of that modules.
 
 As that study has demonstrated a significant percentage of systems and modules
 report at least one correctable ECC error.  ECC correctable errors at present
 correlate with correctable ECC errors in the future.  They also correlate with
 uncorrectable errors in the future.  But percentage of systems developing
 uncorrectable errors is significantly smaller, so chances of false positives 
 are
 substantial.
 
 You should decide whether you want to replace the module (if you can pinpoint 
 it)
 or all modules depending on your resources (money, etc), importance of service
 that the server in question provides (allowable downtime and cost of it and
 fault-tolerance of a larger system, of which the server may be a part (e.g. 
 it may
 have a standby server for failover).
 
 I think that most of what I've just said was kind of obvious from the start.
 The important bit from that study is that ECC errors are not as random and as 
 rare
 as was previously thought, and they can be attributed to a number of factors 
 like
 manufacturing defects, layout of memory lanes on motherboard, etc.

A while back I found a slide deck from some Intel presentation that claimed
that a modern 4GB DIMM should average 18 corrected errors a month.  Your
rate is a bit higher than that, but corrected ECC errors are not that
unexpected.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-25 Thread Andriy Gapon
on 25/08/2010 15:23 John Baldwin said the following:
 That is because machine checks for corrected errors have to be polled and the 
 kernel polls once an hour.  On newer Intel CPUs (such as Nehalem) there is a 
 separate interrupt (CMCI) that can fire for corrected errors.

I think that on AMD it's possible to configure an interrupt for Bank 4 events as
well (perhaps other banks too), but I need to refresh my memory of BKDG.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-25 Thread Andriy Gapon
on 25/08/2010 18:02 Andriy Gapon said the following:
 on 25/08/2010 15:23 John Baldwin said the following:
 That is because machine checks for corrected errors have to be polled and 
 the 
 kernel polls once an hour.  On newer Intel CPUs (such as Nehalem) there is a 
 separate interrupt (CMCI) that can fire for corrected errors.
 
 I think that on AMD it's possible to configure an interrupt for Bank 4 events 
 as
 well (perhaps other banks too), but I need to refresh my memory of BKDG.

Yeah, for Bank 4 only, configurable via MSR_0413 and MSRC000_04[0A:08] 
Machine
Check Misc 4 (Thresholding) Registers.
Also, see section 2.12.1.6 Error Thresholding in Fam10h BKDG.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-24 Thread Ronald Klop

On Mon, 23 Aug 2010 14:20:35 +0200, John Baldwin j...@freebsd.org wrote:


On Monday, August 23, 2010 2:44:38 am Andriy Gapon wrote:

on 23/08/2010 05:05 Dan Langille said the following:
 On 8/22/2010 9:18 PM, Dan Langille wrote:
 What does this mean?

 kernel: MCA: Bank 4, Status 0x940c4001fe080813
 kernel: MCA: Global Cap 0x0105, Status 0x
 kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
 kernel: MCA: CPU 0 COR BUSLG Source RD Memory
 kernel: MCA: Address 0x7ff6b0

 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43

 And another one:

 kernel: MCA: Bank 4, Status 0x9459c0014a080813
 kernel: MCA: Global Cap 0x0105, Status 0x
 kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
 kernel: MCA: CPU 0 COR BUSLG Source RD Memory
 kernel: MCA: Address 0x7ff670

I believe that you get correctable RAM ECC errors, but not entirely  
sure.
There is mcelog utility that decodes such messages into human-friendly  
descriptions.

The utility is available on Linux-based systems.
John Baldwin has a port of it to FreeBSD, but it seems to be WIP and is  
private

so far.  Wait and watch John posting decoded text in this thread :-)


It is not private, it is in //depot/projects/mcelog/... in p4.  It is  
not a

complete port yet though (doesn't support the daemon and client modes for
example).

Details for these errors:

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge
ADDR 7ff6b0
  Northbridge RAM Chipkill ECC error
  Chipkill ECC syndrome = fe18
   bit32 = err cpu0
   bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
 generic read mem transaction
 memory access, level generic'
STATUS 940c4001fe080813 MCGSTATUS 0
MCGCAP 105 APICID 0 SOCKETID 0
CPUID Vendor AMD Family 15 Model 5
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge
ADDR 7ff670
  Northbridge RAM Chipkill ECC error
  Chipkill ECC syndrome = 4ab3
   bit32 = err cpu0
   bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
 generic read mem transaction
 memory access, level generic'
STATUS 9459c0014a080813 MCGSTATUS 0
MCGCAP 105 APICID 0 SOCKETID 0
CPUID Vendor AMD Family 15 Model 5

As Andriy guessed, I believe both of these are corrected ECC errors.  You
can likely ignore them as a low rate of corrected ECC errors is not
unexpected.



Hi,

A little off topic, but what is 'a low rate of corrected ECC errors'? At  
work one machine has them like ones per day, but runs ok. Is ones per day  
much?


Ronald.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-24 Thread Andriy Gapon
on 24/08/2010 09:14 Ronald Klop said the following:
 
 A little off topic, but what is 'a low rate of corrected ECC errors'? At work
 one machine has them like ones per day, but runs ok. Is ones per day much?

That's up to your judgment.  It's like after how many remapped sectors do you
replace HDD.
You may find this interesting:
http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-24 Thread John Baldwin
On Monday, August 23, 2010 5:35:40 pm Matthew D. Fuller wrote:
 On Mon, Aug 23, 2010 at 08:20:35AM -0400 I heard the voice of
 John Baldwin, and lo! it spake thus:
 
  It is not private, it is in //depot/projects/mcelog/... in p4.
 
 Which may as well be Siberia for us lowly non-developers.  Any chance
 you could stick a tarball or a patch against upstream mcelog
 somewhere?

It is actually public at perforce.freebsd.org. :)  However, it is tedious to 
download the files.  It really should be a port perhaps, though Someone (tm) 
should try to get the patches integrated upstream.

You can find a patch at www.freebsd.org/~jhb/mcelog/.  You will also need to 
download the memstream.c file from there as well and put that in the extracted 
mcelog tarball.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-24 Thread Artem Belevich
IMHO the key here is whether hardware is broken or not. The only case
where correctable ECC errors are OK is when a bit gets flipped by a
high-energy particle. That's a normal but fairly rare event. If you
get bit flips often enough that you can recall details of more then
one of them on the same hardware, my guess would be that you're
dealing with something else -- bad/marginal memory, signal integrity
issues, power issues, overheating... The list continues.. In all those
cases hardware does *not* work correctly. Whether you can (or want to)
keep running stuff on the hardware that is broken is another question.

--Artem



On Tue, Aug 24, 2010 at 1:15 AM, Andriy Gapon a...@icyb.net.ua wrote:
 on 24/08/2010 09:14 Ronald Klop said the following:

 A little off topic, but what is 'a low rate of corrected ECC errors'? At work
 one machine has them like ones per day, but runs ok. Is ones per day much?

 That's up to your judgment.  It's like after how many remapped sectors do you
 replace HDD.
 You may find this interesting:
 http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

 --
 Andriy Gapon
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-24 Thread Andriy Gapon
on 24/08/2010 22:51 Artem Belevich said the following:
 IMHO the key here is whether hardware is broken or not. The only case
 where correctable ECC errors are OK is when a bit gets flipped by a
 high-energy particle. That's a normal but fairly rare event. If you
 get bit flips often enough that you can recall details of more then
 one of them on the same hardware, my guess would be that you're
 dealing with something else -- bad/marginal memory, signal integrity
 issues, power issues, overheating... The list continues.. In all those
 cases hardware does *not* work correctly. Whether you can (or want to)
 keep running stuff on the hardware that is broken is another question.

Have you read the article? :)
If not, read at least the summary.

 On Tue, Aug 24, 2010 at 1:15 AM, Andriy Gapon a...@icyb.net.ua wrote:
 on 24/08/2010 09:14 Ronald Klop said the following:

 A little off topic, but what is 'a low rate of corrected ECC errors'? At 
 work
 one machine has them like ones per day, but runs ok. Is ones per day much?

 That's up to your judgment.  It's like after how many remapped sectors do you
 replace HDD.
 You may find this interesting:
 http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

 --
 Andriy Gapon

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-24 Thread Dan Langille

On 8/22/2010 9:18 PM, Dan Langille wrote:

What does this mean?

kernel: MCA: Bank 4, Status 0x940c4001fe080813
kernel: MCA: Global Cap 0x0105, Status 0x
kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
kernel: MCA: CPU 0 COR BUSLG Source RD Memory
kernel: MCA: Address 0x7ff6b0

FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43


FYI, these are occurring every hour, almost to the second. e.g. 
xx:56:yy, where yy is 09, 10, or 11.


Checking logs, I don't see anything that correlates with this point in 
the hour (i.e 56 minutes past) that doesn't also occur at other times.


It seems very odd to occur so regularly.

--
Dan Langille - http://langille.org/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-24 Thread Jeremy Chadwick
On Tue, Aug 24, 2010 at 07:13:23PM -0400, Dan Langille wrote:
 On 8/22/2010 9:18 PM, Dan Langille wrote:
 What does this mean?
 
 kernel: MCA: Bank 4, Status 0x940c4001fe080813
 kernel: MCA: Global Cap 0x0105, Status 0x
 kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
 kernel: MCA: CPU 0 COR BUSLG Source RD Memory
 kernel: MCA: Address 0x7ff6b0
 
 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43
 
 FYI, these are occurring every hour, almost to the second. e.g.
 xx:56:yy, where yy is 09, 10, or 11.
 
 Checking logs, I don't see anything that correlates with this point
 in the hour (i.e 56 minutes past) that doesn't also occur at other
 times.
 
 It seems very odd to occur so regularly.

1) Why haven't you replaced the DIMM in Bank 4 -- or better yet, all
   the DIMMs just to be sure?  Do this and see if the problem goes
   away.  If not, no harm done, and you've narrowed it down.

2) What exact manufacturer and model of motherboard is this?  If
   you can provide a link to a User Manual that would be great.

3) Please go into your system BIOS and find where ECC ChipKill
   options are available (likely under a Memory, Chipset, or
   Northbridge section).  Please write down and provide here all
   of the options and what their currently selected values are.

4) Please make sure you're running the latest system BIOS.  I've seen
   on certain Rackable AMD-based systems where Northbridge-related
   features don't work quite right (at least with Solaris), resulting
   in atrocious memory performance on the system.  A BIOS upgrade
   solved the problem.

There's a ChipKill feature called ECC BG Scrubbing that's vague in
definition, given that it's a background memory scrub that happens at
intervals which are unknown to me.  Maybe 60 minutes?  I don't know.
This is why I ask question #3.

For John and other devs: I assume the decoded MCA messages indicate with
absolute certainty that the ECC error is coming from external DRAM and
not, say, bad L1 or L2 cache?

-- 
| Jeremy Chadwick   j...@parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-24 Thread Dan Langille

On 8/24/2010 7:38 PM, Jeremy Chadwick wrote:

On Tue, Aug 24, 2010 at 07:13:23PM -0400, Dan Langille wrote:

On 8/22/2010 9:18 PM, Dan Langille wrote:

What does this mean?

kernel: MCA: Bank 4, Status 0x940c4001fe080813
kernel: MCA: Global Cap 0x0105, Status 0x
kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
kernel: MCA: CPU 0 COR BUSLG Source RD Memory
kernel: MCA: Address 0x7ff6b0

FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43


FYI, these are occurring every hour, almost to the second. e.g.
xx:56:yy, where yy is 09, 10, or 11.

Checking logs, I don't see anything that correlates with this point
in the hour (i.e 56 minutes past) that doesn't also occur at other
times.

It seems very odd to occur so regularly.


1) Why haven't you replaced the DIMM in Bank 4 -- or better yet, all
the DIMMs just to be sure?  Do this and see if the problem goes
away.  If not, no harm done, and you've narrowed it down.


For good reason: time and distance.   I've not hand the time or 
opportunity to buy new RAM.  Today is Tuesday.  The problem appeared 
about 48 hours ago after upgrading to 8.1 stable from 7.x.  The box is 
in Austin.  I'm in Philadelphia.  You know the math.  ;)  When I can get 
the time to fly to Austin, I will if required.


I'm sorry, I'm not meaning to be flippant.  I'm just glad I documented 
as such as I could 4 years ago.



2) What exact manufacturer and model of motherboard is this?  If
you can provide a link to a User Manual that would be great.


 This is a box from iXsystems that I obtained back when 6.1-RELEASE was 
the latest.  I know it has four sticks of 2GB.


   http://www.freebsddiary.org/dual-opteron.php

Sadly, many of the links are now invalid. The board is a AccelerTech 
ATO2161-DC, also known as a RioWorks HDAMA-G.


See also:

  http://www.freebsddiary.org/dual-opteron-dmidecode.txt

And we have a close up of the RAM and the m/b:

  http://www.freebsddiary.org/showpicture.php?id=85
  http://www.freebsddiary.org/showpicture.php?id=84

I am quite sure it's very close to this:

  http://www.accelertech.com/2007/amd_mb/opteron/ato2161i-dc_pic.php

With the manual here:

  http://www.accelertech.com/2007/amd_mb/opteron/ato2161i-dc_manual.php


3) Please go into your system BIOS and find where ECC ChipKill
options are available (likely under a Memory, Chipset, or
Northbridge section).  Please write down and provide here all
of the options and what their currently selected values are.

4) Please make sure you're running the latest system BIOS.  I've seen
on certain Rackable AMD-based systems where Northbridge-related
features don't work quite right (at least with Solaris), resulting
in atrocious memory performance on the system.  A BIOS upgrade
solved the problem.


3  4 are just as hard as #1 at the moment.


There's a ChipKill feature called ECC BG Scrubbing that's vague in
definition, given that it's a background memory scrub that happens at
intervals which are unknown to me.  Maybe 60 minutes?  I don't know.
This is why I ask question #3.

For John and other devs: I assume the decoded MCA messages indicate with
absolute certainty that the ECC error is coming from external DRAM and
not, say, bad L1 or L2 cache?


Nice question.

--
Dan Langille - http://langille.org/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-24 Thread Matthew D. Fuller
On Tue, Aug 24, 2010 at 11:06:43AM -0400 I heard the voice of
John Baldwin, and lo! it spake thus:
 
 It is actually public at perforce.freebsd.org. :)  However, it is
 tedious to download the files.

Oh, I'd apparently blocked out of my mind that you could clicky-clicky
files one at a time from there.  Probably for the best; I'd be real
annoyed by the end of that   ;)


 You can find a patch at www.freebsd.org/~jhb/mcelog/.  You will also
 need to download the memstream.c file from there as well and put
 that in the extracted mcelog tarball.

Thanks!  For anyone following along at home, I needed to make a few
changes to get it compiling here:

- I'm on a nice recent -CURRENT, so I had to #if 0 out the getline()
  definition.

- Add a FREEBSD definition to the Makefile (or remember it manually).

- Comment out the kread_symbol() of X_SNAPDATE in mcelog.c.  I don't
  see X_SNAPDATE defined anywhere in my /usr/include, and the var
  doesn't seem to ever actually be read for anything anyway (unless
  I'm supposed to -DLOCAL_HACK...).


-- 
Matthew Fuller (MF4839)   |  fulle...@over-yonder.net
Systems/Network Administrator |  http://www.over-yonder.net/~fullermd/
   On the Internet, nobody can hear you scream.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-23 Thread Andriy Gapon
on 23/08/2010 05:05 Dan Langille said the following:
 On 8/22/2010 9:18 PM, Dan Langille wrote:
 What does this mean?

 kernel: MCA: Bank 4, Status 0x940c4001fe080813
 kernel: MCA: Global Cap 0x0105, Status 0x
 kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
 kernel: MCA: CPU 0 COR BUSLG Source RD Memory
 kernel: MCA: Address 0x7ff6b0

 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43
 
 And another one:
 
 kernel: MCA: Bank 4, Status 0x9459c0014a080813
 kernel: MCA: Global Cap 0x0105, Status 0x
 kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
 kernel: MCA: CPU 0 COR BUSLG Source RD Memory
 kernel: MCA: Address 0x7ff670

I believe that you get correctable RAM ECC errors, but not entirely sure.
There is mcelog utility that decodes such messages into human-friendly 
descriptions.
The utility is available on Linux-based systems.
John Baldwin has a port of it to FreeBSD, but it seems to be WIP and is private
so far.  Wait and watch John posting decoded text in this thread :-)

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-23 Thread John Baldwin
On Monday, August 23, 2010 2:44:38 am Andriy Gapon wrote:
 on 23/08/2010 05:05 Dan Langille said the following:
  On 8/22/2010 9:18 PM, Dan Langille wrote:
  What does this mean?
 
  kernel: MCA: Bank 4, Status 0x940c4001fe080813
  kernel: MCA: Global Cap 0x0105, Status 0x
  kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
  kernel: MCA: CPU 0 COR BUSLG Source RD Memory
  kernel: MCA: Address 0x7ff6b0
 
  FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43
  
  And another one:
  
  kernel: MCA: Bank 4, Status 0x9459c0014a080813
  kernel: MCA: Global Cap 0x0105, Status 0x
  kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
  kernel: MCA: CPU 0 COR BUSLG Source RD Memory
  kernel: MCA: Address 0x7ff670
 
 I believe that you get correctable RAM ECC errors, but not entirely sure.
 There is mcelog utility that decodes such messages into human-friendly 
 descriptions.
 The utility is available on Linux-based systems.
 John Baldwin has a port of it to FreeBSD, but it seems to be WIP and is 
 private
 so far.  Wait and watch John posting decoded text in this thread :-)

It is not private, it is in //depot/projects/mcelog/... in p4.  It is not a
complete port yet though (doesn't support the daemon and client modes for
example).

Details for these errors:

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge 
ADDR 7ff6b0 
  Northbridge RAM Chipkill ECC error
  Chipkill ECC syndrome = fe18
   bit32 = err cpu0
   bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
 generic read mem transaction
 memory access, level generic'
STATUS 940c4001fe080813 MCGSTATUS 0
MCGCAP 105 APICID 0 SOCKETID 0 
CPUID Vendor AMD Family 15 Model 5
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge 
ADDR 7ff670 
  Northbridge RAM Chipkill ECC error
  Chipkill ECC syndrome = 4ab3
   bit32 = err cpu0
   bit46 = corrected ecc error
  bus error 'local node origin, request didn't time out
 generic read mem transaction
 memory access, level generic'
STATUS 9459c0014a080813 MCGSTATUS 0
MCGCAP 105 APICID 0 SOCKETID 0 
CPUID Vendor AMD Family 15 Model 5

As Andriy guessed, I believe both of these are corrected ECC errors.  You
can likely ignore them as a low rate of corrected ECC errors is not
unexpected.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-23 Thread Matthew D. Fuller
On Mon, Aug 23, 2010 at 08:20:35AM -0400 I heard the voice of
John Baldwin, and lo! it spake thus:

 It is not private, it is in //depot/projects/mcelog/... in p4.

Which may as well be Siberia for us lowly non-developers.  Any chance
you could stick a tarball or a patch against upstream mcelog
somewhere?


-- 
Matthew Fuller (MF4839)   |  fulle...@over-yonder.net
Systems/Network Administrator |  http://www.over-yonder.net/~fullermd/
   On the Internet, nobody can hear you scream.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-23 Thread Dan Langille

On 8/22/2010 10:05 PM, Dan Langille wrote:

On 8/22/2010 9:18 PM, Dan Langille wrote:

What does this mean?

kernel: MCA: Bank 4, Status 0x940c4001fe080813
kernel: MCA: Global Cap 0x0105, Status 0x
kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
kernel: MCA: CPU 0 COR BUSLG Source RD Memory
kernel: MCA: Address 0x7ff6b0

FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43


And another one:

kernel: MCA: Bank 4, Status 0x9459c0014a080813
kernel: MCA: Global Cap 0x0105, Status 0x
kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
kernel: MCA: CPU 0 COR BUSLG Source RD Memory
kernel: MCA: Address 0x7ff670


kernel: MCA: Bank 4, Status 0x947ec000d8080a13
kernel: MCA: Global Cap 0x0105, Status 0x
kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
kernel: MCA: CPU 0 COR BUSLG Responder RD Memory
kernel: MCA: Address 0xbfa9930

Another one.

These errors started appearing after upgrading to 8.1-STABLE from 7.2.. 
something.  I suspect the functionality was added about then



--
Dan Langille - http://langille.org/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-23 Thread Andriy Gapon
on 24/08/2010 02:43 Dan Langille said the following:
 On 8/22/2010 10:05 PM, Dan Langille wrote:
 On 8/22/2010 9:18 PM, Dan Langille wrote:
 What does this mean?

 kernel: MCA: Bank 4, Status 0x940c4001fe080813
 kernel: MCA: Global Cap 0x0105, Status 0x
 kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
 kernel: MCA: CPU 0 COR BUSLG Source RD Memory
 kernel: MCA: Address 0x7ff6b0

 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43

 And another one:

 kernel: MCA: Bank 4, Status 0x9459c0014a080813
 kernel: MCA: Global Cap 0x0105, Status 0x
 kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
 kernel: MCA: CPU 0 COR BUSLG Source RD Memory
 kernel: MCA: Address 0x7ff670
 
 kernel: MCA: Bank 4, Status 0x947ec000d8080a13
 kernel: MCA: Global Cap 0x0105, Status 0x
 kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
 kernel: MCA: CPU 0 COR BUSLG Responder RD Memory
 kernel: MCA: Address 0xbfa9930
 
 Another one.
 
 These errors started appearing after upgrading to 8.1-STABLE from 7.2..
 something.  I suspect the functionality was added about then

Please strop the flood :-)
Depending on hardware there could be hundreds of such errors per day.
Either replace memory modules or learn to live with these messages.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-23 Thread Dan Langille

On 8/23/2010 7:47 PM, Andriy Gapon wrote:

on 24/08/2010 02:43 Dan Langille said the following:

On 8/22/2010 10:05 PM, Dan Langille wrote:

On 8/22/2010 9:18 PM, Dan Langille wrote:

What does this mean?

kernel: MCA: Bank 4, Status 0x940c4001fe080813
kernel: MCA: Global Cap 0x0105, Status 0x
kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
kernel: MCA: CPU 0 COR BUSLG Source RD Memory
kernel: MCA: Address 0x7ff6b0

FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43


And another one:

kernel: MCA: Bank 4, Status 0x9459c0014a080813
kernel: MCA: Global Cap 0x0105, Status 0x
kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
kernel: MCA: CPU 0 COR BUSLG Source RD Memory
kernel: MCA: Address 0x7ff670


kernel: MCA: Bank 4, Status 0x947ec000d8080a13
kernel: MCA: Global Cap 0x0105, Status 0x
kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
kernel: MCA: CPU 0 COR BUSLG Responder RD Memory
kernel: MCA: Address 0xbfa9930

Another one.

These errors started appearing after upgrading to 8.1-STABLE from 7.2..
something.  I suspect the functionality was added about then


Please strop the flood :-)


Sure.  Three emails is hardly a flood.  :)


Depending on hardware there could be hundreds of such errors per day.
Either replace memory modules or learn to live with these messages.


I was posting a remark anyone.  Thought I'd include one more that I 
noticed.  Surely you can cope.  :)


--
Dan Langille - http://langille.org/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: kernel MCA messages

2010-08-22 Thread Daniel O'Connor

On 23/08/2010, at 10:48, Dan Langille wrote:
 What does this mean?
 
 kernel: MCA: Bank 4, Status 0x940c4001fe080813
 kernel: MCA: Global Cap 0x0105, Status 0x
 kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
 kernel: MCA: CPU 0 COR BUSLG Source RD Memory
 kernel: MCA: Address 0x7ff6b0
 
 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43

It's generated by machine check support, see..
/usr/src/sys/i386/i386/mca.c
/usr/src/sys/amd64/amd64/mca.c

Some info here..
http://en.wikipedia.org/wiki/Machine_check_architecture

No man page for it though.

--
Daniel O'Connor software and network engineer
for Genesis Software - http://www.gsoft.com.au
The nice thing about standards is that there
are so many of them to choose from.
  -- Andrew Tanenbaum
GPG Fingerprint - 5596 B766 97C0 0E94 4347 295E E593 DC20 7B3F CE8C








Re: kernel MCA messages

2010-08-22 Thread Dan Langille

On 8/22/2010 9:18 PM, Dan Langille wrote:

What does this mean?

kernel: MCA: Bank 4, Status 0x940c4001fe080813
kernel: MCA: Global Cap 0x0105, Status 0x
kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
kernel: MCA: CPU 0 COR BUSLG Source RD Memory
kernel: MCA: Address 0x7ff6b0

FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43


And another one:

kernel: MCA: Bank 4, Status 0x9459c0014a080813
kernel: MCA: Global Cap 0x0105, Status 0x
kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0
kernel: MCA: CPU 0 COR BUSLG Source RD Memory
kernel: MCA: Address 0x7ff670


--
Dan Langille - http://langille.org/
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org