Re: kernel MCA messages
on 25/08/2010 02:38 Jeremy Chadwick said the following: On Tue, Aug 24, 2010 at 07:13:23PM -0400, Dan Langille wrote: On 8/22/2010 9:18 PM, Dan Langille wrote: What does this mean? kernel: MCA: Bank 4, Status 0x940c4001fe080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff6b0 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 FYI, these are occurring every hour, almost to the second. e.g. xx:56:yy, where yy is 09, 10, or 11. Checking logs, I don't see anything that correlates with this point in the hour (i.e 56 minutes past) that doesn't also occur at other times. It seems very odd to occur so regularly. I still think that everything of essence has already been said in this thread. 1) Why haven't you replaced the DIMM in Bank 4 -- or better yet, all Bank 4 here is MCA reporting bank, it has nothing to do with RAM slots. Currently on FreeBSD we don't have a standard tool to map physical address to DRAM module, but I am sure that there could be some ways to do it. the DIMMs just to be sure? Do this and see if the problem goes away. If not, no harm done, and you've narrowed it down. 2) What exact manufacturer and model of motherboard is this? If you can provide a link to a User Manual that would be great. 3) Please go into your system BIOS and find where ECC ChipKill options are available (likely under a Memory, Chipset, or Northbridge section). Please write down and provide here all of the options and what their currently selected values are. 4) Please make sure you're running the latest system BIOS. I've seen on certain Rackable AMD-based systems where Northbridge-related features don't work quite right (at least with Solaris), resulting in atrocious memory performance on the system. A BIOS upgrade solved the problem. There's a ChipKill feature called ECC BG Scrubbing that's vague in definition, given that it's a background memory scrub that happens at intervals which are unknown to me. Maybe 60 minutes? I don't know. This is why I ask question #3. For John and other devs: I assume the decoded MCA messages indicate with absolute certainty that the ECC error is coming from external DRAM and not, say, bad L1 or L2 cache? Have you read the decoded message? Please re-read it. I still recommend reading at least the summary of the RAM ECC research article to make your own judgment about need to replace DRAM. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On Tue, Aug 24, 2010 at 4:06 PM, John Baldwin j...@freebsd.org wrote: On Monday, August 23, 2010 5:35:40 pm Matthew D. Fuller wrote: On Mon, Aug 23, 2010 at 08:20:35AM -0400 I heard the voice of John Baldwin, and lo! it spake thus: It is not private, it is in //depot/projects/mcelog/... in p4. Which may as well be Siberia for us lowly non-developers. Any chance you could stick a tarball or a patch against upstream mcelog somewhere? It is actually public at perforce.freebsd.org. :) However, it is tedious to download the files. It really should be a port perhaps, though Someone (tm) should try to get the patches integrated upstream. You can find a patch at www.freebsd.org/~jhb/mcelog/. You will also need to download the memstream.c file from there as well and put that in the extracted mcelog tarball. I wrote a small script a while back to extract a tree from perforce using the web interface, might be handy: http://www.clearchain.com/~benjsc/downloads/FreeBSD/P4fetch.rb Cheers Tom ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On 8/25/2010 3:11 AM, Andriy Gapon wrote: Have you read the decoded message? Please re-read it. I still recommend reading at least the summary of the RAM ECC research article to make your own judgment about need to replace DRAM. Andriy: What is your interpretation of the decoded message? What is your view on replacing DRAM? What do you conclude from the summary? -- Dan Langille - http://langille.org/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
on 25/08/2010 13:41 Dan Langille said the following: On 8/25/2010 3:11 AM, Andriy Gapon wrote: Have you read the decoded message? Please re-read it. I still recommend reading at least the summary of the RAM ECC research article to make your own judgment about need to replace DRAM. Andriy: What is your interpretation of the decoded message? What is your view on replacing DRAM? What do you conclude from the summary? Most likely you have a small defect in one of your memory modules, perhaps a stuck bit. You will be getting correctable ECC errors for that module. Eventually you might get uncorrectable error. That may happen soon or it may never happen during lifetime of that modules. As that study has demonstrated a significant percentage of systems and modules report at least one correctable ECC error. ECC correctable errors at present correlate with correctable ECC errors in the future. They also correlate with uncorrectable errors in the future. But percentage of systems developing uncorrectable errors is significantly smaller, so chances of false positives are substantial. You should decide whether you want to replace the module (if you can pinpoint it) or all modules depending on your resources (money, etc), importance of service that the server in question provides (allowable downtime and cost of it and fault-tolerance of a larger system, of which the server may be a part (e.g. it may have a standby server for failover). I think that most of what I've just said was kind of obvious from the start. The important bit from that study is that ECC errors are not as random and as rare as was previously thought, and they can be attributed to a number of factors like manufacturing defects, layout of memory lanes on motherboard, etc. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On Wednesday, August 25, 2010 12:05:09 am Matthew D. Fuller wrote: On Tue, Aug 24, 2010 at 11:06:43AM -0400 I heard the voice of John Baldwin, and lo! it spake thus: It is actually public at perforce.freebsd.org. :) However, it is tedious to download the files. Oh, I'd apparently blocked out of my mind that you could clicky-clicky files one at a time from there. Probably for the best; I'd be real annoyed by the end of that ;) You can find a patch at www.freebsd.org/~jhb/mcelog/. You will also need to download the memstream.c file from there as well and put that in the extracted mcelog tarball. Thanks! For anyone following along at home, I needed to make a few changes to get it compiling here: - I'm on a nice recent -CURRENT, so I had to #if 0 out the getline() definition. Yeah, that needs a __FreeBSD_version wrapper. - Add a FREEBSD definition to the Makefile (or remember it manually). I just do 'gmake FREEBSD=yes'. - Comment out the kread_symbol() of X_SNAPDATE in mcelog.c. I don't see X_SNAPDATE defined anywhere in my /usr/include, and the var doesn't seem to ever actually be read for anything anyway (unless I'm supposed to -DLOCAL_HACK...). Bah, I must have missed that. You certainly don't want LOCAL_HACK. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On Tuesday, August 24, 2010 7:13:23 pm Dan Langille wrote: On 8/22/2010 9:18 PM, Dan Langille wrote: What does this mean? kernel: MCA: Bank 4, Status 0x940c4001fe080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff6b0 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 FYI, these are occurring every hour, almost to the second. e.g. xx:56:yy, where yy is 09, 10, or 11. Checking logs, I don't see anything that correlates with this point in the hour (i.e 56 minutes past) that doesn't also occur at other times. It seems very odd to occur so regularly. That is because machine checks for corrected errors have to be polled and the kernel polls once an hour. On newer Intel CPUs (such as Nehalem) there is a separate interrupt (CMCI) that can fire for corrected errors. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On Wednesday, August 25, 2010 7:01:19 am Andriy Gapon wrote: on 25/08/2010 13:41 Dan Langille said the following: On 8/25/2010 3:11 AM, Andriy Gapon wrote: Have you read the decoded message? Please re-read it. I still recommend reading at least the summary of the RAM ECC research article to make your own judgment about need to replace DRAM. Andriy: What is your interpretation of the decoded message? What is your view on replacing DRAM? What do you conclude from the summary? Most likely you have a small defect in one of your memory modules, perhaps a stuck bit. You will be getting correctable ECC errors for that module. Eventually you might get uncorrectable error. That may happen soon or it may never happen during lifetime of that modules. As that study has demonstrated a significant percentage of systems and modules report at least one correctable ECC error. ECC correctable errors at present correlate with correctable ECC errors in the future. They also correlate with uncorrectable errors in the future. But percentage of systems developing uncorrectable errors is significantly smaller, so chances of false positives are substantial. You should decide whether you want to replace the module (if you can pinpoint it) or all modules depending on your resources (money, etc), importance of service that the server in question provides (allowable downtime and cost of it and fault-tolerance of a larger system, of which the server may be a part (e.g. it may have a standby server for failover). I think that most of what I've just said was kind of obvious from the start. The important bit from that study is that ECC errors are not as random and as rare as was previously thought, and they can be attributed to a number of factors like manufacturing defects, layout of memory lanes on motherboard, etc. A while back I found a slide deck from some Intel presentation that claimed that a modern 4GB DIMM should average 18 corrected errors a month. Your rate is a bit higher than that, but corrected ECC errors are not that unexpected. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
on 25/08/2010 15:23 John Baldwin said the following: That is because machine checks for corrected errors have to be polled and the kernel polls once an hour. On newer Intel CPUs (such as Nehalem) there is a separate interrupt (CMCI) that can fire for corrected errors. I think that on AMD it's possible to configure an interrupt for Bank 4 events as well (perhaps other banks too), but I need to refresh my memory of BKDG. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
on 25/08/2010 18:02 Andriy Gapon said the following: on 25/08/2010 15:23 John Baldwin said the following: That is because machine checks for corrected errors have to be polled and the kernel polls once an hour. On newer Intel CPUs (such as Nehalem) there is a separate interrupt (CMCI) that can fire for corrected errors. I think that on AMD it's possible to configure an interrupt for Bank 4 events as well (perhaps other banks too), but I need to refresh my memory of BKDG. Yeah, for Bank 4 only, configurable via MSR_0413 and MSRC000_04[0A:08] Machine Check Misc 4 (Thresholding) Registers. Also, see section 2.12.1.6 Error Thresholding in Fam10h BKDG. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On Mon, 23 Aug 2010 14:20:35 +0200, John Baldwin j...@freebsd.org wrote: On Monday, August 23, 2010 2:44:38 am Andriy Gapon wrote: on 23/08/2010 05:05 Dan Langille said the following: On 8/22/2010 9:18 PM, Dan Langille wrote: What does this mean? kernel: MCA: Bank 4, Status 0x940c4001fe080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff6b0 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 And another one: kernel: MCA: Bank 4, Status 0x9459c0014a080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff670 I believe that you get correctable RAM ECC errors, but not entirely sure. There is mcelog utility that decodes such messages into human-friendly descriptions. The utility is available on Linux-based systems. John Baldwin has a port of it to FreeBSD, but it seems to be WIP and is private so far. Wait and watch John posting decoded text in this thread :-) It is not private, it is in //depot/projects/mcelog/... in p4. It is not a complete port yet though (doesn't support the daemon and client modes for example). Details for these errors: HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge ADDR 7ff6b0 Northbridge RAM Chipkill ECC error Chipkill ECC syndrome = fe18 bit32 = err cpu0 bit46 = corrected ecc error bus error 'local node origin, request didn't time out generic read mem transaction memory access, level generic' STATUS 940c4001fe080813 MCGSTATUS 0 MCGCAP 105 APICID 0 SOCKETID 0 CPUID Vendor AMD Family 15 Model 5 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge ADDR 7ff670 Northbridge RAM Chipkill ECC error Chipkill ECC syndrome = 4ab3 bit32 = err cpu0 bit46 = corrected ecc error bus error 'local node origin, request didn't time out generic read mem transaction memory access, level generic' STATUS 9459c0014a080813 MCGSTATUS 0 MCGCAP 105 APICID 0 SOCKETID 0 CPUID Vendor AMD Family 15 Model 5 As Andriy guessed, I believe both of these are corrected ECC errors. You can likely ignore them as a low rate of corrected ECC errors is not unexpected. Hi, A little off topic, but what is 'a low rate of corrected ECC errors'? At work one machine has them like ones per day, but runs ok. Is ones per day much? Ronald. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
on 24/08/2010 09:14 Ronald Klop said the following: A little off topic, but what is 'a low rate of corrected ECC errors'? At work one machine has them like ones per day, but runs ok. Is ones per day much? That's up to your judgment. It's like after how many remapped sectors do you replace HDD. You may find this interesting: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On Monday, August 23, 2010 5:35:40 pm Matthew D. Fuller wrote: On Mon, Aug 23, 2010 at 08:20:35AM -0400 I heard the voice of John Baldwin, and lo! it spake thus: It is not private, it is in //depot/projects/mcelog/... in p4. Which may as well be Siberia for us lowly non-developers. Any chance you could stick a tarball or a patch against upstream mcelog somewhere? It is actually public at perforce.freebsd.org. :) However, it is tedious to download the files. It really should be a port perhaps, though Someone (tm) should try to get the patches integrated upstream. You can find a patch at www.freebsd.org/~jhb/mcelog/. You will also need to download the memstream.c file from there as well and put that in the extracted mcelog tarball. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
IMHO the key here is whether hardware is broken or not. The only case where correctable ECC errors are OK is when a bit gets flipped by a high-energy particle. That's a normal but fairly rare event. If you get bit flips often enough that you can recall details of more then one of them on the same hardware, my guess would be that you're dealing with something else -- bad/marginal memory, signal integrity issues, power issues, overheating... The list continues.. In all those cases hardware does *not* work correctly. Whether you can (or want to) keep running stuff on the hardware that is broken is another question. --Artem On Tue, Aug 24, 2010 at 1:15 AM, Andriy Gapon a...@icyb.net.ua wrote: on 24/08/2010 09:14 Ronald Klop said the following: A little off topic, but what is 'a low rate of corrected ECC errors'? At work one machine has them like ones per day, but runs ok. Is ones per day much? That's up to your judgment. It's like after how many remapped sectors do you replace HDD. You may find this interesting: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
on 24/08/2010 22:51 Artem Belevich said the following: IMHO the key here is whether hardware is broken or not. The only case where correctable ECC errors are OK is when a bit gets flipped by a high-energy particle. That's a normal but fairly rare event. If you get bit flips often enough that you can recall details of more then one of them on the same hardware, my guess would be that you're dealing with something else -- bad/marginal memory, signal integrity issues, power issues, overheating... The list continues.. In all those cases hardware does *not* work correctly. Whether you can (or want to) keep running stuff on the hardware that is broken is another question. Have you read the article? :) If not, read at least the summary. On Tue, Aug 24, 2010 at 1:15 AM, Andriy Gapon a...@icyb.net.ua wrote: on 24/08/2010 09:14 Ronald Klop said the following: A little off topic, but what is 'a low rate of corrected ECC errors'? At work one machine has them like ones per day, but runs ok. Is ones per day much? That's up to your judgment. It's like after how many remapped sectors do you replace HDD. You may find this interesting: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf -- Andriy Gapon -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On 8/22/2010 9:18 PM, Dan Langille wrote: What does this mean? kernel: MCA: Bank 4, Status 0x940c4001fe080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff6b0 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 FYI, these are occurring every hour, almost to the second. e.g. xx:56:yy, where yy is 09, 10, or 11. Checking logs, I don't see anything that correlates with this point in the hour (i.e 56 minutes past) that doesn't also occur at other times. It seems very odd to occur so regularly. -- Dan Langille - http://langille.org/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On Tue, Aug 24, 2010 at 07:13:23PM -0400, Dan Langille wrote: On 8/22/2010 9:18 PM, Dan Langille wrote: What does this mean? kernel: MCA: Bank 4, Status 0x940c4001fe080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff6b0 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 FYI, these are occurring every hour, almost to the second. e.g. xx:56:yy, where yy is 09, 10, or 11. Checking logs, I don't see anything that correlates with this point in the hour (i.e 56 minutes past) that doesn't also occur at other times. It seems very odd to occur so regularly. 1) Why haven't you replaced the DIMM in Bank 4 -- or better yet, all the DIMMs just to be sure? Do this and see if the problem goes away. If not, no harm done, and you've narrowed it down. 2) What exact manufacturer and model of motherboard is this? If you can provide a link to a User Manual that would be great. 3) Please go into your system BIOS and find where ECC ChipKill options are available (likely under a Memory, Chipset, or Northbridge section). Please write down and provide here all of the options and what their currently selected values are. 4) Please make sure you're running the latest system BIOS. I've seen on certain Rackable AMD-based systems where Northbridge-related features don't work quite right (at least with Solaris), resulting in atrocious memory performance on the system. A BIOS upgrade solved the problem. There's a ChipKill feature called ECC BG Scrubbing that's vague in definition, given that it's a background memory scrub that happens at intervals which are unknown to me. Maybe 60 minutes? I don't know. This is why I ask question #3. For John and other devs: I assume the decoded MCA messages indicate with absolute certainty that the ECC error is coming from external DRAM and not, say, bad L1 or L2 cache? -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On 8/24/2010 7:38 PM, Jeremy Chadwick wrote: On Tue, Aug 24, 2010 at 07:13:23PM -0400, Dan Langille wrote: On 8/22/2010 9:18 PM, Dan Langille wrote: What does this mean? kernel: MCA: Bank 4, Status 0x940c4001fe080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff6b0 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 FYI, these are occurring every hour, almost to the second. e.g. xx:56:yy, where yy is 09, 10, or 11. Checking logs, I don't see anything that correlates with this point in the hour (i.e 56 minutes past) that doesn't also occur at other times. It seems very odd to occur so regularly. 1) Why haven't you replaced the DIMM in Bank 4 -- or better yet, all the DIMMs just to be sure? Do this and see if the problem goes away. If not, no harm done, and you've narrowed it down. For good reason: time and distance. I've not hand the time or opportunity to buy new RAM. Today is Tuesday. The problem appeared about 48 hours ago after upgrading to 8.1 stable from 7.x. The box is in Austin. I'm in Philadelphia. You know the math. ;) When I can get the time to fly to Austin, I will if required. I'm sorry, I'm not meaning to be flippant. I'm just glad I documented as such as I could 4 years ago. 2) What exact manufacturer and model of motherboard is this? If you can provide a link to a User Manual that would be great. This is a box from iXsystems that I obtained back when 6.1-RELEASE was the latest. I know it has four sticks of 2GB. http://www.freebsddiary.org/dual-opteron.php Sadly, many of the links are now invalid. The board is a AccelerTech ATO2161-DC, also known as a RioWorks HDAMA-G. See also: http://www.freebsddiary.org/dual-opteron-dmidecode.txt And we have a close up of the RAM and the m/b: http://www.freebsddiary.org/showpicture.php?id=85 http://www.freebsddiary.org/showpicture.php?id=84 I am quite sure it's very close to this: http://www.accelertech.com/2007/amd_mb/opteron/ato2161i-dc_pic.php With the manual here: http://www.accelertech.com/2007/amd_mb/opteron/ato2161i-dc_manual.php 3) Please go into your system BIOS and find where ECC ChipKill options are available (likely under a Memory, Chipset, or Northbridge section). Please write down and provide here all of the options and what their currently selected values are. 4) Please make sure you're running the latest system BIOS. I've seen on certain Rackable AMD-based systems where Northbridge-related features don't work quite right (at least with Solaris), resulting in atrocious memory performance on the system. A BIOS upgrade solved the problem. 3 4 are just as hard as #1 at the moment. There's a ChipKill feature called ECC BG Scrubbing that's vague in definition, given that it's a background memory scrub that happens at intervals which are unknown to me. Maybe 60 minutes? I don't know. This is why I ask question #3. For John and other devs: I assume the decoded MCA messages indicate with absolute certainty that the ECC error is coming from external DRAM and not, say, bad L1 or L2 cache? Nice question. -- Dan Langille - http://langille.org/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On Tue, Aug 24, 2010 at 11:06:43AM -0400 I heard the voice of John Baldwin, and lo! it spake thus: It is actually public at perforce.freebsd.org. :) However, it is tedious to download the files. Oh, I'd apparently blocked out of my mind that you could clicky-clicky files one at a time from there. Probably for the best; I'd be real annoyed by the end of that ;) You can find a patch at www.freebsd.org/~jhb/mcelog/. You will also need to download the memstream.c file from there as well and put that in the extracted mcelog tarball. Thanks! For anyone following along at home, I needed to make a few changes to get it compiling here: - I'm on a nice recent -CURRENT, so I had to #if 0 out the getline() definition. - Add a FREEBSD definition to the Makefile (or remember it manually). - Comment out the kread_symbol() of X_SNAPDATE in mcelog.c. I don't see X_SNAPDATE defined anywhere in my /usr/include, and the var doesn't seem to ever actually be read for anything anyway (unless I'm supposed to -DLOCAL_HACK...). -- Matthew Fuller (MF4839) | fulle...@over-yonder.net Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ On the Internet, nobody can hear you scream. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
on 23/08/2010 05:05 Dan Langille said the following: On 8/22/2010 9:18 PM, Dan Langille wrote: What does this mean? kernel: MCA: Bank 4, Status 0x940c4001fe080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff6b0 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 And another one: kernel: MCA: Bank 4, Status 0x9459c0014a080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff670 I believe that you get correctable RAM ECC errors, but not entirely sure. There is mcelog utility that decodes such messages into human-friendly descriptions. The utility is available on Linux-based systems. John Baldwin has a port of it to FreeBSD, but it seems to be WIP and is private so far. Wait and watch John posting decoded text in this thread :-) -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On Monday, August 23, 2010 2:44:38 am Andriy Gapon wrote: on 23/08/2010 05:05 Dan Langille said the following: On 8/22/2010 9:18 PM, Dan Langille wrote: What does this mean? kernel: MCA: Bank 4, Status 0x940c4001fe080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff6b0 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 And another one: kernel: MCA: Bank 4, Status 0x9459c0014a080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff670 I believe that you get correctable RAM ECC errors, but not entirely sure. There is mcelog utility that decodes such messages into human-friendly descriptions. The utility is available on Linux-based systems. John Baldwin has a port of it to FreeBSD, but it seems to be WIP and is private so far. Wait and watch John posting decoded text in this thread :-) It is not private, it is in //depot/projects/mcelog/... in p4. It is not a complete port yet though (doesn't support the daemon and client modes for example). Details for these errors: HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge ADDR 7ff6b0 Northbridge RAM Chipkill ECC error Chipkill ECC syndrome = fe18 bit32 = err cpu0 bit46 = corrected ecc error bus error 'local node origin, request didn't time out generic read mem transaction memory access, level generic' STATUS 940c4001fe080813 MCGSTATUS 0 MCGCAP 105 APICID 0 SOCKETID 0 CPUID Vendor AMD Family 15 Model 5 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge ADDR 7ff670 Northbridge RAM Chipkill ECC error Chipkill ECC syndrome = 4ab3 bit32 = err cpu0 bit46 = corrected ecc error bus error 'local node origin, request didn't time out generic read mem transaction memory access, level generic' STATUS 9459c0014a080813 MCGSTATUS 0 MCGCAP 105 APICID 0 SOCKETID 0 CPUID Vendor AMD Family 15 Model 5 As Andriy guessed, I believe both of these are corrected ECC errors. You can likely ignore them as a low rate of corrected ECC errors is not unexpected. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On Mon, Aug 23, 2010 at 08:20:35AM -0400 I heard the voice of John Baldwin, and lo! it spake thus: It is not private, it is in //depot/projects/mcelog/... in p4. Which may as well be Siberia for us lowly non-developers. Any chance you could stick a tarball or a patch against upstream mcelog somewhere? -- Matthew Fuller (MF4839) | fulle...@over-yonder.net Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ On the Internet, nobody can hear you scream. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On 8/22/2010 10:05 PM, Dan Langille wrote: On 8/22/2010 9:18 PM, Dan Langille wrote: What does this mean? kernel: MCA: Bank 4, Status 0x940c4001fe080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff6b0 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 And another one: kernel: MCA: Bank 4, Status 0x9459c0014a080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff670 kernel: MCA: Bank 4, Status 0x947ec000d8080a13 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Responder RD Memory kernel: MCA: Address 0xbfa9930 Another one. These errors started appearing after upgrading to 8.1-STABLE from 7.2.. something. I suspect the functionality was added about then -- Dan Langille - http://langille.org/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
on 24/08/2010 02:43 Dan Langille said the following: On 8/22/2010 10:05 PM, Dan Langille wrote: On 8/22/2010 9:18 PM, Dan Langille wrote: What does this mean? kernel: MCA: Bank 4, Status 0x940c4001fe080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff6b0 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 And another one: kernel: MCA: Bank 4, Status 0x9459c0014a080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff670 kernel: MCA: Bank 4, Status 0x947ec000d8080a13 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Responder RD Memory kernel: MCA: Address 0xbfa9930 Another one. These errors started appearing after upgrading to 8.1-STABLE from 7.2.. something. I suspect the functionality was added about then Please strop the flood :-) Depending on hardware there could be hundreds of such errors per day. Either replace memory modules or learn to live with these messages. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On 8/23/2010 7:47 PM, Andriy Gapon wrote: on 24/08/2010 02:43 Dan Langille said the following: On 8/22/2010 10:05 PM, Dan Langille wrote: On 8/22/2010 9:18 PM, Dan Langille wrote: What does this mean? kernel: MCA: Bank 4, Status 0x940c4001fe080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff6b0 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 And another one: kernel: MCA: Bank 4, Status 0x9459c0014a080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff670 kernel: MCA: Bank 4, Status 0x947ec000d8080a13 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Responder RD Memory kernel: MCA: Address 0xbfa9930 Another one. These errors started appearing after upgrading to 8.1-STABLE from 7.2.. something. I suspect the functionality was added about then Please strop the flood :-) Sure. Three emails is hardly a flood. :) Depending on hardware there could be hundreds of such errors per day. Either replace memory modules or learn to live with these messages. I was posting a remark anyone. Thought I'd include one more that I noticed. Surely you can cope. :) -- Dan Langille - http://langille.org/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: kernel MCA messages
On 23/08/2010, at 10:48, Dan Langille wrote: What does this mean? kernel: MCA: Bank 4, Status 0x940c4001fe080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff6b0 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 It's generated by machine check support, see.. /usr/src/sys/i386/i386/mca.c /usr/src/sys/amd64/amd64/mca.c Some info here.. http://en.wikipedia.org/wiki/Machine_check_architecture No man page for it though. -- Daniel O'Connor software and network engineer for Genesis Software - http://www.gsoft.com.au The nice thing about standards is that there are so many of them to choose from. -- Andrew Tanenbaum GPG Fingerprint - 5596 B766 97C0 0E94 4347 295E E593 DC20 7B3F CE8C
Re: kernel MCA messages
On 8/22/2010 9:18 PM, Dan Langille wrote: What does this mean? kernel: MCA: Bank 4, Status 0x940c4001fe080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff6b0 FreeBSD 7.3-STABLE #1: Sun Aug 22 23:16:43 And another one: kernel: MCA: Bank 4, Status 0x9459c0014a080813 kernel: MCA: Global Cap 0x0105, Status 0x kernel: MCA: Vendor AuthenticAMD, ID 0xf5a, APIC ID 0 kernel: MCA: CPU 0 COR BUSLG Source RD Memory kernel: MCA: Address 0x7ff670 -- Dan Langille - http://langille.org/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org