Re: [GLLUG] How worried should I be ...

2020-05-24 Thread John Hearns via GLLUG
These errors are logged in mcelog also - if you are running mcelog!

On Sun, 24 May 2020 at 16:12, John Hearns  wrote:

> As Martin Broosk says run memtest.
> You can run the user space memtester on circa 90% of the RAM.
> Ever better download https://www.stresslinux.org/sl/
> Format a USB stick and boot from it. Then run the memtester utility there.
>
> On a server I would advise to use the iDrac or BMC and get a list of the
> hardware events also.
>
> On Fri, 22 May 2020 at 18:18, James Courtier-Dutton via GLLUG <
> gllug@mailman.lug.org.uk> wrote:
>
>> On Fri, 22 May 2020 at 16:38, Alain D D Williams via GLLUG
>>  wrote:
>> >
>> > The message below was put to all login sessions this morning. I have
>> never seen
>> > this before. There is nothing more in /var/log/messages.
>> >
>> > The machine is 8 years old, always switched on, AMD 8150 Eight-Core
>> Processor.
>> >
>> > Should I take this as a warning and look to replace the machine or just
>> shrug my
>> > shoulders & mutter something about cosmic rays ?
>> >
>> > TIA
>> >
>> >
>> > Message from syslogd@mint at May 22 07:27:09 ...
>> >  kernel:[Hardware Error]: MC4 Error (node 0): L3 data cache ECC error.
>> >
>> > Message from syslogd@mint at May 22 07:27:09 ...
>> >  kernel:[Hardware Error]: Error Status: Corrected error, no action
>> required.
>> >
>>
>> If this is a one off, I would not worry about it.
>> Bits flip occasionally.
>> If you are getting it continuously, then power off the box. Reboot it,
>> and see if the problem goes away.
>> If it is always there, even after a cold power cycle, you have a hardware
>> fault.
>>
>> --
>> GLLUG mailing list
>> GLLUG@mailman.lug.org.uk
>> https://mailman.lug.org.uk/mailman/listinfo/gllug
>
>
-- 
GLLUG mailing list
GLLUG@mailman.lug.org.uk
https://mailman.lug.org.uk/mailman/listinfo/gllug

Re: [GLLUG] How worried should I be ...

2020-05-24 Thread John Hearns via GLLUG
As Martin Broosk says run memtest.
You can run the user space memtester on circa 90% of the RAM.
Ever better download https://www.stresslinux.org/sl/
Format a USB stick and boot from it. Then run the memtester utility there.

On a server I would advise to use the iDrac or BMC and get a list of the
hardware events also.

On Fri, 22 May 2020 at 18:18, James Courtier-Dutton via GLLUG <
gllug@mailman.lug.org.uk> wrote:

> On Fri, 22 May 2020 at 16:38, Alain D D Williams via GLLUG
>  wrote:
> >
> > The message below was put to all login sessions this morning. I have
> never seen
> > this before. There is nothing more in /var/log/messages.
> >
> > The machine is 8 years old, always switched on, AMD 8150 Eight-Core
> Processor.
> >
> > Should I take this as a warning and look to replace the machine or just
> shrug my
> > shoulders & mutter something about cosmic rays ?
> >
> > TIA
> >
> >
> > Message from syslogd@mint at May 22 07:27:09 ...
> >  kernel:[Hardware Error]: MC4 Error (node 0): L3 data cache ECC error.
> >
> > Message from syslogd@mint at May 22 07:27:09 ...
> >  kernel:[Hardware Error]: Error Status: Corrected error, no action
> required.
> >
>
> If this is a one off, I would not worry about it.
> Bits flip occasionally.
> If you are getting it continuously, then power off the box. Reboot it,
> and see if the problem goes away.
> If it is always there, even after a cold power cycle, you have a hardware
> fault.
>
> --
> GLLUG mailing list
> GLLUG@mailman.lug.org.uk
> https://mailman.lug.org.uk/mailman/listinfo/gllug
-- 
GLLUG mailing list
GLLUG@mailman.lug.org.uk
https://mailman.lug.org.uk/mailman/listinfo/gllug

Re: [GLLUG] How worried should I be ...

2020-05-22 Thread James Courtier-Dutton via GLLUG
On Fri, 22 May 2020 at 16:38, Alain D D Williams via GLLUG
 wrote:
>
> The message below was put to all login sessions this morning. I have never 
> seen
> this before. There is nothing more in /var/log/messages.
>
> The machine is 8 years old, always switched on, AMD 8150 Eight-Core Processor.
>
> Should I take this as a warning and look to replace the machine or just shrug 
> my
> shoulders & mutter something about cosmic rays ?
>
> TIA
>
>
> Message from syslogd@mint at May 22 07:27:09 ...
>  kernel:[Hardware Error]: MC4 Error (node 0): L3 data cache ECC error.
>
> Message from syslogd@mint at May 22 07:27:09 ...
>  kernel:[Hardware Error]: Error Status: Corrected error, no action required.
>

If this is a one off, I would not worry about it.
Bits flip occasionally.
If you are getting it continuously, then power off the box. Reboot it,
and see if the problem goes away.
If it is always there, even after a cold power cycle, you have a hardware fault.

-- 
GLLUG mailing list
GLLUG@mailman.lug.org.uk
https://mailman.lug.org.uk/mailman/listinfo/gllug

Re: [GLLUG] How worried should I be ...

2020-05-22 Thread Martin A. Brooks via GLLUG

On 2020-05-22 16:37, Alain D D Williams via GLLUG wrote:
The message below was put to all login sessions this morning. I have 
never seen

this before. There is nothing more in /var/log/messages.


You probably have faulty RAM.  Run memtest.

--
GLLUG mailing list
GLLUG@mailman.lug.org.uk
https://mailman.lug.org.uk/mailman/listinfo/gllug

Re: [GLLUG] How worried should I be ...

2020-05-22 Thread Andy Smith via GLLUG
Hello,

On Fri, May 22, 2020 at 04:37:57PM +0100, Alain D D Williams via GLLUG wrote:
> Should I take this as a warning and look to replace the machine or just shrug 
> my
> shoulders & mutter something about cosmic rays ?
>
> Message from syslogd@mint at May 22 07:27:09 ...
>  kernel:[Hardware Error]: MC4 Error (node 0): L3 data cache ECC error.
> 
> Message from syslogd@mint at May 22 07:27:09 ...
>  kernel:[Hardware Error]: Error Status: Corrected error, no action required.

The L3 cache is inside the CPU. It can be a faulty CPU, I think it
could possibly also be faulty RAM if you do not have ECC RAM
(otherwise problem would have been detected in the RAM not the L3
cache). Either way it is a single bit flip detected by ECC in the
cache and corrected.

If you can shut the machine down I would run a few passes of
memtest. That will hopefully spot any RAM problems.

If the RAM comes up clean but it keeps happening, I would really
suspect the CPU and plan for a replacement soon.

If the RAM comes up clean and it never happens again well, then yes
it could be cosmic rays or similar. I have seen this sort of thing
only a couple of times in 20 years; only one of those times did it
not soon get worse. It's not really enough data to say whether you
are in for a bad time.

Cheers,
Andy

-- 
https://bitfolk.com/ -- No-nonsense VPS hosting

-- 
GLLUG mailing list
GLLUG@mailman.lug.org.uk
https://mailman.lug.org.uk/mailman/listinfo/gllug

Re: [GLLUG] How worried should I be ...

2020-05-22 Thread Chris Bell via GLLUG
On Friday, 22 May 2020 16:37:57 BST Alain D D Williams via GLLUG wrote:
> The message below was put to all login sessions this morning. I have never
> seen this before. There is nothing more in /var/log/messages.
> 
> The machine is 8 years old, always switched on, AMD 8150 Eight-Core
> Processor.
> 
> Should I take this as a warning and look to replace the machine or just
> shrug my shoulders & mutter something about cosmic rays ?
> 
> TIA
> 
> 
> Message from syslogd@mint at May 22 07:27:09 ...
>  kernel:[Hardware Error]: MC4 Error (node 0): L3 data cache ECC error.
> 
> Message from syslogd@mint at May 22 07:27:09 ...
>  kernel:[Hardware Error]: Error Status: Corrected error, no action required.
> 
> Message from syslogd@mint at May 22 07:27:09 ...
>  kernel:[Hardware Error]: CPU:0 (15:1:2)
> MC4_STATUS[-|CE|MiscV|-|AddrV|-|Poison|CECC]: 0x9d5c4881011c011b
> 
> Message from syslogd@mint at May 22 07:27:09 ...
>  kernel:[Hardware Error]: MC4_ADDR: 0x00076f75be90
> 
> Message from syslogd@mint at May 22 07:27:09 ...
>  kernel:[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

Several years ago there was an on-line demonstration of an SGI Purple computer 
which used terabytes of non-ECC RAM because of the price, and simply marked 
faulty sections as not available until they could be bothered to shut down and 
swap it.

-- 
Chris Bell
Website http://chrisbell.org.uk



-- 
GLLUG mailing list
GLLUG@mailman.lug.org.uk
https://mailman.lug.org.uk/mailman/listinfo/gllug

[GLLUG] How worried should I be ...

2020-05-22 Thread Alain D D Williams via GLLUG
The message below was put to all login sessions this morning. I have never seen
this before. There is nothing more in /var/log/messages.

The machine is 8 years old, always switched on, AMD 8150 Eight-Core Processor.

Should I take this as a warning and look to replace the machine or just shrug my
shoulders & mutter something about cosmic rays ?

TIA


Message from syslogd@mint at May 22 07:27:09 ...
 kernel:[Hardware Error]: MC4 Error (node 0): L3 data cache ECC error.

Message from syslogd@mint at May 22 07:27:09 ...
 kernel:[Hardware Error]: Error Status: Corrected error, no action required.

Message from syslogd@mint at May 22 07:27:09 ...
 kernel:[Hardware Error]: CPU:0 (15:1:2) 
MC4_STATUS[-|CE|MiscV|-|AddrV|-|Poison|CECC]: 0x9d5c4881011c011b

Message from syslogd@mint at May 22 07:27:09 ...
 kernel:[Hardware Error]: MC4_ADDR: 0x00076f75be90

Message from syslogd@mint at May 22 07:27:09 ...
 kernel:[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD



-- 
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT 
Lecturer.
+44 (0) 787 668 0256  https://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: 
https://www.phcomp.co.uk/Contact.html
#include 

-- 
GLLUG mailing list
GLLUG@mailman.lug.org.uk
https://mailman.lug.org.uk/mailman/listinfo/gllug