Re: Wheezy: mcelog not getting notified of ECC errors anymore?

2013-06-04 Thread Steffen Grunewald
On Mon, Jun 03, 2013 at 02:32:39PM -0500, Karl Schmidt wrote:
> This is serious - I sure hope you wrote up a bug report?

I first wanted to make sure that the serious error isn't on my side.
Up to now I seem to be the only one affected...

S


-- 
To UNSUBSCRIBE, email to debian-amd64-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130604070529.gk20...@casco.aei.mpg.de



Re: Wheezy: mcelog not getting notified of ECC errors anymore?

2013-06-03 Thread Karl Schmidt

This is serious - I sure hope you wrote up a bug report?



Karl Schmidt  EMail k...@xtronics.com
Transtronics, Inc.  WEB 
http://secure.transtronics.com
3209 West 9th Street Ph (785) 841-3089
Lawrence, KS 66049  FAX (785) 841-0434

Truth is mighty and will prevail.
There is nothing wrong with this,
except that it ain't so.
--Mark Twain




--
To UNSUBSCRIBE, email to debian-amd64-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/51acef57.6010...@xtronics.com



Wheezy: mcelog not getting notified of ECC errors anymore?

2013-06-03 Thread Steffen Grunewald
After installing Wheezy (using FAI, so the setup is essentially unaltered),
one of my machines doesn't report memory errors via mcelog anymore. Error
messages go to syslog instead:

> Jun  3 09:47:07 testbed kernel: [231899.816038] [Hardware Error]: CPU:0   
> MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd4674833
> Jun  3 09:47:07 testbed kernel: [231899.816282] [Hardware Error]: 
> MC0_ADDR: 0x76d39ec0
> Jun  3 09:47:07 testbed kernel: [231899.816377] [Hardware Error]: Data Cache 
> Error: during system linefill.
> Jun  3 09:47:07 testbed kernel: [231899.816534] [Hardware Error]: cache 
> level: L3/GEN, mem/io: MEM, mem-tx: DRD, part-proc: SRC (no timeout)
> Jun  3 09:47:07 testbed kernel: [231899.816899] [Hardware Error]: CPU:0   
> MC2_STATUS[Over|CE|-|-|-|CECC]: 0xd0004863
> Jun  3 09:47:07 testbed kernel: [231899.817136] [Hardware Error]: Bus Unit 
> Error: PRF/ECC error in data read from NB: SRC.
> Jun  3 09:47:07 testbed kernel: [231899.817314] [Hardware Error]: cache 
> level: L3/GEN, mem/io: MEM, mem-tx: PRF, part-proc: SRC (no timeout)
> Jun  3 09:47:07 testbed kernel: [231899.817677] [Hardware Error]: CPU:0   
> MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd46740020813
> Jun  3 09:47:07 testbed kernel: [231899.817915] [Hardware Error]: 
> MC4_ADDR: 0x7fafc410
> Jun  3 09:47:07 testbed kernel: [231899.818009] [Hardware Error]: Northbridge 
> Error (node 0): DRAM ECC error detected on the NB.
> Jun  3 09:47:07 testbed kernel: [231899.818189] EDAC amd64 MC0: CE 
> ERROR_ADDRESS= 0x7fafc410
> Jun  3 09:47:07 testbed kernel: [231899.818289] EDAC MC0: CE page 0x7fafc, 
> offset 0x410, grain 0, syndrome 0xce, row 1, channel 0, label "": amd64_edac
> Jun  3 09:47:07 testbed kernel: [231899.818298] [Hardware Error]: cache 
> level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
> Jun  3 09:47:08 testbed kernel: [231900.804029] [Hardware Error]: CPU:1   
> MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd4674833
> Jun  3 09:47:08 testbed kernel: [231900.804278] [Hardware Error]: 
> MC0_ADDR: 0x7a673600
> Jun  3 09:47:08 testbed kernel: [231900.804371] [Hardware Error]: Data Cache 
> Error: during system linefill.
> Jun  3 09:47:08 testbed kernel: [231900.804530] [Hardware Error]: cache 
> level: L3/GEN, mem/io: MEM, mem-tx: DRD, part-proc: SRC (no timeout)
> Jun  3 09:47:08 testbed kernel: [231900.804894] [Hardware Error]: CPU:1   
> MC2_STATUS[Over|CE|-|-|-|CECC]: 0xd0004863
> Jun  3 09:47:08 testbed kernel: [231900.805130] [Hardware Error]: Bus Unit 
> Error: PRF/ECC error in data read from NB: SRC.
> Jun  3 09:47:08 testbed kernel: [231900.810632] [Hardware Error]: cache 
> level: L3/GEN, mem/io: MEM, mem-tx: PRF, part-proc: SRC (no timeout)
> Jun  3 09:52:07 testbed kernel: [232199.816039] [Hardware Error]: CPU:0   
> MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd4674833
> Jun  3 09:52:07 testbed kernel: [232199.816284] [Hardware Error]: 
> MC0_ADDR: 0x0021086ea0c0
> Jun  3 09:52:07 testbed kernel: [232199.816378] [Hardware Error]: Data Cache 
> Error: during system linefill.
> Jun  3 09:52:07 testbed kernel: [232199.816536] [Hardware Error]: cache 
> level: L3/GEN, mem/io: MEM, mem-tx: DRD, part-proc: SRC (no timeout)
> Jun  3 09:52:07 testbed kernel: [232199.816901] [Hardware Error]: CPU:0   
> MC2_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd4004813
> Jun  3 09:52:07 testbed kernel: [232199.817139] [Hardware Error]: 
> MC2_ADDR: 0x77ef0cc0
> Jun  3 09:52:07 testbed kernel: [232199.817232] [Hardware Error]: Bus Unit 
> Error: RD/ECC error in data read from NB: SRC.
> Jun  3 09:52:07 testbed kernel: [232199.817409] [Hardware Error]: cache 
> level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
> Jun  3 09:52:07 testbed kernel: [232199.817771] [Hardware Error]: CPU:0   
> MC4_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd46740020813
> Jun  3 09:52:07 testbed kernel: [232199.818008] [Hardware Error]: 
> MC4_ADDR: 0x7fafc410
> Jun  3 09:52:07 testbed kernel: [232199.818101] [Hardware Error]: Northbridge 
> Error (node 0): DRAM ECC error detected on the NB.
> Jun  3 09:52:07 testbed kernel: [232199.818282] EDAC amd64 MC0: CE 
> ERROR_ADDRESS= 0x7fafc410
> Jun  3 09:52:07 testbed kernel: [232199.818382] EDAC MC0: CE page 0x7fafc, 
> offset 0x410, grain 0, syndrome 0xce, row 1, channel 0, label "": amd64_edac
> Jun  3 09:52:07 testbed kernel: [232199.818391] [Hardware Error]: cache 
> level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
> Jun  3 09:52:08 testbed kernel: [232200.804035] [Hardware Error]: CPU:1   
> MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd4674833
> Jun  3 09:52:08 testbed kernel: [232200.804283] [Hardware Error]: 
> MC0_ADDR: 0x7a673600
> Jun  3 09:52:08 testbed kernel: [232200.804377] [Hardware Error]: Data Cache 
> Error: during system linefill.
> Jun  3 09:52:08 testbed kernel: [232200.804534] [Hardware Error]: cache 
>