Re: [GLLUG] How worried should I be ...
These errors are logged in mcelog also - if you are running mcelog! On Sun, 24 May 2020 at 16:12, John Hearns wrote: > As Martin Broosk says run memtest. > You can run the user space memtester on circa 90% of the RAM. > Ever better download https://www.stresslinux.org/sl/ > Format a USB stick and boot from it. Then run the memtester utility there. > > On a server I would advise to use the iDrac or BMC and get a list of the > hardware events also. > > On Fri, 22 May 2020 at 18:18, James Courtier-Dutton via GLLUG < > gllug@mailman.lug.org.uk> wrote: > >> On Fri, 22 May 2020 at 16:38, Alain D D Williams via GLLUG >> wrote: >> > >> > The message below was put to all login sessions this morning. I have >> never seen >> > this before. There is nothing more in /var/log/messages. >> > >> > The machine is 8 years old, always switched on, AMD 8150 Eight-Core >> Processor. >> > >> > Should I take this as a warning and look to replace the machine or just >> shrug my >> > shoulders & mutter something about cosmic rays ? >> > >> > TIA >> > >> > >> > Message from syslogd@mint at May 22 07:27:09 ... >> > kernel:[Hardware Error]: MC4 Error (node 0): L3 data cache ECC error. >> > >> > Message from syslogd@mint at May 22 07:27:09 ... >> > kernel:[Hardware Error]: Error Status: Corrected error, no action >> required. >> > >> >> If this is a one off, I would not worry about it. >> Bits flip occasionally. >> If you are getting it continuously, then power off the box. Reboot it, >> and see if the problem goes away. >> If it is always there, even after a cold power cycle, you have a hardware >> fault. >> >> -- >> GLLUG mailing list >> GLLUG@mailman.lug.org.uk >> https://mailman.lug.org.uk/mailman/listinfo/gllug > > -- GLLUG mailing list GLLUG@mailman.lug.org.uk https://mailman.lug.org.uk/mailman/listinfo/gllug
Re: [GLLUG] How worried should I be ...
As Martin Broosk says run memtest. You can run the user space memtester on circa 90% of the RAM. Ever better download https://www.stresslinux.org/sl/ Format a USB stick and boot from it. Then run the memtester utility there. On a server I would advise to use the iDrac or BMC and get a list of the hardware events also. On Fri, 22 May 2020 at 18:18, James Courtier-Dutton via GLLUG < gllug@mailman.lug.org.uk> wrote: > On Fri, 22 May 2020 at 16:38, Alain D D Williams via GLLUG > wrote: > > > > The message below was put to all login sessions this morning. I have > never seen > > this before. There is nothing more in /var/log/messages. > > > > The machine is 8 years old, always switched on, AMD 8150 Eight-Core > Processor. > > > > Should I take this as a warning and look to replace the machine or just > shrug my > > shoulders & mutter something about cosmic rays ? > > > > TIA > > > > > > Message from syslogd@mint at May 22 07:27:09 ... > > kernel:[Hardware Error]: MC4 Error (node 0): L3 data cache ECC error. > > > > Message from syslogd@mint at May 22 07:27:09 ... > > kernel:[Hardware Error]: Error Status: Corrected error, no action > required. > > > > If this is a one off, I would not worry about it. > Bits flip occasionally. > If you are getting it continuously, then power off the box. Reboot it, > and see if the problem goes away. > If it is always there, even after a cold power cycle, you have a hardware > fault. > > -- > GLLUG mailing list > GLLUG@mailman.lug.org.uk > https://mailman.lug.org.uk/mailman/listinfo/gllug -- GLLUG mailing list GLLUG@mailman.lug.org.uk https://mailman.lug.org.uk/mailman/listinfo/gllug
Re: [GLLUG] How worried should I be ...
On Fri, 22 May 2020 at 16:38, Alain D D Williams via GLLUG wrote: > > The message below was put to all login sessions this morning. I have never > seen > this before. There is nothing more in /var/log/messages. > > The machine is 8 years old, always switched on, AMD 8150 Eight-Core Processor. > > Should I take this as a warning and look to replace the machine or just shrug > my > shoulders & mutter something about cosmic rays ? > > TIA > > > Message from syslogd@mint at May 22 07:27:09 ... > kernel:[Hardware Error]: MC4 Error (node 0): L3 data cache ECC error. > > Message from syslogd@mint at May 22 07:27:09 ... > kernel:[Hardware Error]: Error Status: Corrected error, no action required. > If this is a one off, I would not worry about it. Bits flip occasionally. If you are getting it continuously, then power off the box. Reboot it, and see if the problem goes away. If it is always there, even after a cold power cycle, you have a hardware fault. -- GLLUG mailing list GLLUG@mailman.lug.org.uk https://mailman.lug.org.uk/mailman/listinfo/gllug
Re: [GLLUG] How worried should I be ...
On 2020-05-22 16:37, Alain D D Williams via GLLUG wrote: The message below was put to all login sessions this morning. I have never seen this before. There is nothing more in /var/log/messages. You probably have faulty RAM. Run memtest. -- GLLUG mailing list GLLUG@mailman.lug.org.uk https://mailman.lug.org.uk/mailman/listinfo/gllug
Re: [GLLUG] How worried should I be ...
Hello, On Fri, May 22, 2020 at 04:37:57PM +0100, Alain D D Williams via GLLUG wrote: > Should I take this as a warning and look to replace the machine or just shrug > my > shoulders & mutter something about cosmic rays ? > > Message from syslogd@mint at May 22 07:27:09 ... > kernel:[Hardware Error]: MC4 Error (node 0): L3 data cache ECC error. > > Message from syslogd@mint at May 22 07:27:09 ... > kernel:[Hardware Error]: Error Status: Corrected error, no action required. The L3 cache is inside the CPU. It can be a faulty CPU, I think it could possibly also be faulty RAM if you do not have ECC RAM (otherwise problem would have been detected in the RAM not the L3 cache). Either way it is a single bit flip detected by ECC in the cache and corrected. If you can shut the machine down I would run a few passes of memtest. That will hopefully spot any RAM problems. If the RAM comes up clean but it keeps happening, I would really suspect the CPU and plan for a replacement soon. If the RAM comes up clean and it never happens again well, then yes it could be cosmic rays or similar. I have seen this sort of thing only a couple of times in 20 years; only one of those times did it not soon get worse. It's not really enough data to say whether you are in for a bad time. Cheers, Andy -- https://bitfolk.com/ -- No-nonsense VPS hosting -- GLLUG mailing list GLLUG@mailman.lug.org.uk https://mailman.lug.org.uk/mailman/listinfo/gllug
Re: [GLLUG] How worried should I be ...
On Friday, 22 May 2020 16:37:57 BST Alain D D Williams via GLLUG wrote: > The message below was put to all login sessions this morning. I have never > seen this before. There is nothing more in /var/log/messages. > > The machine is 8 years old, always switched on, AMD 8150 Eight-Core > Processor. > > Should I take this as a warning and look to replace the machine or just > shrug my shoulders & mutter something about cosmic rays ? > > TIA > > > Message from syslogd@mint at May 22 07:27:09 ... > kernel:[Hardware Error]: MC4 Error (node 0): L3 data cache ECC error. > > Message from syslogd@mint at May 22 07:27:09 ... > kernel:[Hardware Error]: Error Status: Corrected error, no action required. > > Message from syslogd@mint at May 22 07:27:09 ... > kernel:[Hardware Error]: CPU:0 (15:1:2) > MC4_STATUS[-|CE|MiscV|-|AddrV|-|Poison|CECC]: 0x9d5c4881011c011b > > Message from syslogd@mint at May 22 07:27:09 ... > kernel:[Hardware Error]: MC4_ADDR: 0x00076f75be90 > > Message from syslogd@mint at May 22 07:27:09 ... > kernel:[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Several years ago there was an on-line demonstration of an SGI Purple computer which used terabytes of non-ECC RAM because of the price, and simply marked faulty sections as not available until they could be bothered to shut down and swap it. -- Chris Bell Website http://chrisbell.org.uk -- GLLUG mailing list GLLUG@mailman.lug.org.uk https://mailman.lug.org.uk/mailman/listinfo/gllug