Re: ipmi0: incorrect critical max

2023-03-21 Thread Lloyd Parkes




On 22/03/23 03:45, Stephen Borrill wrote:

On Sat, 18 Mar 2023, Lloyd Parkes wrote:

On 18/03/23 05:14, Stephen Borrill wrote:
On an HP Microserver Gen10 Plus, I found that soon after booting, I 
get the following alert:

...
   Current  CritMax  WarnMax WarnMin  CritMin  Unit
[ipmi0]
    11-LOM-CORE:    59.253    0.000 110.471    degC



Just out of interest, in the BIOS (RBSU) what is the Power Management 
/ Power Regulator set to? It will have settings such as "Dynamic Power 
Savings Mode" and "OS Control Mode".


I set it to Maximum I/O Performance (words may not match exactly, it is 
in a box waiting to be installed at a customer).


OK. When you don't set it to OS Controlled, the HPE RBSU chops power 
management out of the ACPI in a way that makes Linux complain about 
corrupt ACPI information.


I realise that you are looking at IPMI, not ACPI, but it does have that 
HPE smell of ugly removal from your view because the RBSU is managing 
it. That could just be coincidence of course.


Cheers,
Lloyd


Re: ipmi0: incorrect critical max

2023-03-21 Thread Stephen Borrill

On Sat, 18 Mar 2023, Lloyd Parkes wrote:

On 18/03/23 05:14, Stephen Borrill wrote:
On an HP Microserver Gen10 Plus, I found that soon after booting, I get the 
following alert:

...
   Current  CritMax  WarnMax  
WarnMin  CritMin  Unit

[ipmi0]
    11-LOM-CORE:    59.253    0.000  
110.471    degC




Just out of interest, in the BIOS (RBSU) what is the Power Management / Power 
Regulator set to? It will have settings such as "Dynamic Power Savings Mode" 
and "OS Control Mode".


I set it to Maximum I/O Performance (words may not match exactly, it is in 
a box waiting to be installed at a customer).


--
Stephen


Re: ipmi0: incorrect critical max

2023-03-18 Thread Michael van Elst
net...@precedence.co.uk (Stephen Borrill) writes:

>   Current  CritMax  WarnMax  WarnMin  CritMin  Unit
>[ipmi0]
>11-LOM-CORE:59.2530.000  110.471degC

>Seen on 9.3_STABLE, but also in 10 BETA.

>I suppose one simple fix would be to ensure that if CritMax is lower 
>than WarnMax, it should be set to the value of WarnMax.

IPMI reports 3 upper and 3 lower limits (each as an unsigned byte)
and a bitmask to show which value is valid.

lower non-recoverable threshold
-> configures CritMin
lower critical threshold
-> configures CritMin
lower non-critical threshold
-> configures WarnMin

lower limits of 0 are ignored, because you cannot exceed them.


upper non-recoverable threshold
-> configures CritMax
upper critical threshold
-> configures CritMax
upper non-critical threshold
-> configures WarnMax

upper limits of 255 are ignored, because you cannot exceed them.


Apparently your system says that the upper critical or the
non-recoverable threshold exist but returns a value of zero.

The code could do some more sanity checking and then just
skip the invalid limits.

Something like:

@@ -1582,6 +1684,16 @@ ipmi_get_sensor_limits(struct ipmi_softc
break;
}
 
+   if ((data[0] & 0x28) == 0x28 && data[6] < data[4])
+   data[0] ^= 0x20;
+   if ((data[0] & 0x18) == 0x18 && data[5] < data[4])
+   data[0] ^= 0x10;
+
+   if ((data[0] & 0x0a) == 0x0a && data[3] > data[1])
+   data[0] ^= 0x08;
+   if ((data[0] & 0x06) == 0x06 && data[2] > data[1])
+   data[0] ^= 0x04;
+
if (data[0] & 0x20 && data[6] != 0xff) {
*pcritmax = ipmi_convert_sensor(&data[6], psensor);
*props |= prop_critmax;


As an alternative you could also override the limit in /etc/envsys.conf.




Re: ipmi0: incorrect critical max

2023-03-17 Thread Lloyd Parkes




On 18/03/23 05:14, Stephen Borrill wrote:
On an HP Microserver Gen10 Plus, I found that soon after booting, I get 
the following alert:

...
   Current  CritMax  WarnMax  WarnMin  CritMin  Unit
[ipmi0]
    11-LOM-CORE:    59.253    0.000  110.471    degC



Just out of interest, in the BIOS (RBSU) what is the Power Management / 
Power Regulator set to? It will have settings such as "Dynamic Power 
Savings Mode" and "OS Control Mode".


Cheers,
Lloyd


Re: ipmi0: incorrect critical max

2023-03-17 Thread Brad Spencer
Stephen Borrill  writes:

> On an HP Microserver Gen10 Plus, I found that soon after booting, I get 
> the following alert:
>
> ipmi0: critical over limit on '11-LOM-CORE'
>
> If powerd is running (the default), it shuts the machine down (so 
> basically as soon as it hits multi-user).
>
> envstat shows that CritMax is zero:
>
>Current  CritMax  WarnMax  WarnMin  CritMin  Unit
> [ipmi0]
> 11-LOM-CORE:59.2530.000  110.471degC
>
> Seen on 9.3_STABLE, but also in 10 BETA.
>
> I suppose one simple fix would be to ensure that if CritMax is lower 
> than WarnMax, it should be set to the value of WarnMax.
>
> Any other things to look at? The machine won't be put into production for 
> a few days, so it's good time to experiment
>
> I have put the latest BIOS on the machine



If that server has a independent out of band "system" in it, a BMC with
a command line interface or web browser, I would get into that and see
if it reports the sensors there just to see if the ipmi driver pulls
them correctly.  The BMC may not have a way to specifiy or tell you what
the Crit and Warn values are, but it would be worth looking around for
that too.  Failing any of that, I think you should be able to set what
NetBSD thinks the CritMax is in /etc/envsys.conf.  See envsys.conf(5)
for details.

I have a ASrockRack board that doesn't report one of the sensors
correctly and/or the APMI driver doesn't pull it correctly.  It is a
fixed values that never changes for nothing...



-- 
Brad Spencer - b...@anduin.eldar.org - KC8VKS - http://anduin.eldar.org


ipmi0: incorrect critical max

2023-03-17 Thread Stephen Borrill
On an HP Microserver Gen10 Plus, I found that soon after booting, I get 
the following alert:


ipmi0: critical over limit on '11-LOM-CORE'

If powerd is running (the default), it shuts the machine down (so 
basically as soon as it hits multi-user).


envstat shows that CritMax is zero:

  Current  CritMax  WarnMax  WarnMin  CritMin  Unit
[ipmi0]
   11-LOM-CORE:59.2530.000  110.471degC

Seen on 9.3_STABLE, but also in 10 BETA.

I suppose one simple fix would be to ensure that if CritMax is lower 
than WarnMax, it should be set to the value of WarnMax.


Any other things to look at? The machine won't be put into production for 
a few days, so it's good time to experiment


I have put the latest BIOS on the machine

--
Stephen