Re: [gentoo-user] time to build a new machine ?

2021-09-24 Thread Adam Carter
>
> >> man mcelog
>
> 'man mcelog' + 'man mce' find nothing.  does it need to be installed ?
>

Yep and the package is called mcelog.

Did you check for any other messages before/after the mce errors?

Do you also have lm-sensors installed? Running sensord?

Genuine CPU issues seem pretty rare, so I would check for overheating or
power issues, and lm-sensors will help with that.


Re: [gentoo-user] time to build a new machine ?

2021-09-24 Thread Philip Webb
210924 Mark Knecht wrote:
> On 2021-09-24, at 05:58, Philip Webb  wrote:
>> While I was asleep yesterday, my machine reported on all  3  Konsoles
>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
>> : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b
>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
>> : mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a0100
>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
>> : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 
>> microcode 6000822
MK> I have no direct experience with this error,
> however I'd suggest it was most likely an error
> reading a block of DRAM and not likely the CPU itself failing.
> I periodically get mce errors on my i980 machine
> when running big PixInsight jobs and I hit thermal limits.

I thought you had written "1980 machine" (grin).

> I'd suggest you run extensive memory tests 
> and if you don't see any problems don't worry too much.
> It's always wise to do good backups in case the problem gets worse.

Everything is backed up, incl off-site.

> On Fri, Sep 24, 2021 at 8:23 AM Andrew Udvare  wrote:
>> man mcelog

'man mcelog' + 'man mce' find nothing.  does it need to be installed ?

Thanks for the advice so far.

-- 
,,
SUPPORT ___//___,   Philip Webb
ELECTRIC   /] [] [] [] [] []|   Cities Centre, University of Toronto
TRANSIT`-O--O---'   purslowatchassdotutorontodotca




Re: [gentoo-user] time to build a new machine ?

2021-09-24 Thread Mark Knecht
On Fri, Sep 24, 2021 at 8:23 AM Andrew Udvare  wrote:
>
> On 24/09/2021 06:48, Philip Webb wrote:
> > 210924 Andrew Udvare wrote:
> >> On 2021-09-24, at 05:58, Philip Webb  wrote:
> >>> While I was asleep yesterday, my machine reported on all  3  Konsoles
:
> >>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
> >>> : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:
9d0b4c16001d011b
> >>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
> >>> : mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a0100
> >>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
> >>> : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0
APIC 0 microcode 6000822
> >>> -- end of report --
> >>  From the manpage:
> >
> > Which man page is that ?
>
> man mcelog

I have no direct experience with this error however I'd suggest it was most
likely an error reading
a block of DRAM and not likely the CPU itself failing. I periodically get
mce errors on my i980
machine when running big PixInsight jobs and I hit thermal limits.

I'd suggest you run extensive memory tests and if you don't see any problems
don't worry too much. Of course, it's always wise to do good backups in case
the problem gets worse.

Good luck,
Mark


Re: [gentoo-user] time to build a new machine ?

2021-09-24 Thread Andrew Udvare

On 24/09/2021 06:48, Philip Webb wrote:

210924 Andrew Udvare wrote:

On 2021-09-24, at 05:58, Philip Webb  wrote:

While I was asleep yesterday, my machine reported on all  3  Konsoles :
Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b
Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
: mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a0100
Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
: mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 
microcode 6000822
-- end of report --

 From the manpage:


Which man page is that ?


man mcelog



OpenPGP_signature
Description: OpenPGP digital signature


Re: [gentoo-user] time to build a new machine ?

2021-09-24 Thread Philip Webb
210924 Andrew Udvare wrote:
> On 2021-09-24, at 05:58, Philip Webb  wrote:
>> While I was asleep yesterday, my machine reported on all  3  Konsoles :
>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
>> : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b
>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
>> : mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a0100 
>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
>> : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 
>> microcode 6000822
>> -- end of report --
> From the manpage:

Which man page is that ?

> Most errors can be corrected by the CPU
> by internal error correction mechanisms.  Uncorrected errors cause
> machine check exceptions which may kill processes or panic the machine.
> A small number of corrected errors is usually not a cause for worry,
> but a large number can indicate future failure.

So it looks as if the above was a correctable error.

> When an uncorrected machine check error happens
> that the kernel cannot recover from, then it will usually panic the system.
> In this case when there was a warm reset after the panic,
> mcelog should pick up the machine check errors after reboot.
> This is not possible after a cold reset.

No sign of any other effects : everything went on running.

> If you are overclocking, try disabling it.

No, I never overclock anything (smile).

-- 
,,
SUPPORT ___//___,   Philip Webb
ELECTRIC   /] [] [] [] [] []|   Cities Centre, University of Toronto
TRANSIT`-O--O---'   purslowatchassdotutorontodotca




Re: [gentoo-user] time to build a new machine ?

2021-09-24 Thread Andrew Udvare



> On 2021-09-24, at 05:58, Philip Webb  wrote:
> 
> While I was asleep yesterday, my machine reported on all  3  Konsoles :
> 
> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
> : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b
> 
> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
> : mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a0100 
> 
> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
> : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 
> microcode 6000822
> 
> -- end of report --
> 
> I don't remember seeing this before : how concerned should I be ?

From the manpage:

   Most  errors  can be corrected by the CPU by internal error correction 
mechanisms. Uncorrected
   errors cause machine check exceptions which may kill processes or panic 
the machine.  A  small
   number  of  corrected errors is usually not a cause for worry, but a 
large number can indicate
   future failure.

   When an uncorrected machine check error happens that the kernel cannot 
recover  from  then  it
   will  usually  panic  the  system.   In  this case when there was a warm 
reset after the panic
   mcelog should pick up the machine check errors after reboot.  This is  
not  possible  after  a
   cold reset.

If you are overclocking, try disabling it.




[gentoo-user] time to build a new machine ?

2021-09-24 Thread Philip Webb
While I was asleep yesterday, my machine reported on all  3  Konsoles :

Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b

Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
: mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a0100 

Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
: mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 
microcode 6000822

-- end of report --

I don't remember seeing this before : how concerned should I be ?

The present machine is  6  years old & has always worked very well ;
its CPU is an AMD.  I plan to build a new machine in the next few months :
should I accelerate my plans ?

-- 
,,
SUPPORT ___//___,   Philip Webb
ELECTRIC   /] [] [] [] [] []|   Cities Centre, University of Toronto
TRANSIT`-O--O---'   purslowatchassdotutorontodotca