URL: <https://savannah.gnu.org/bugs/?64792>
Summary: Bad IPMI DCMI response from Huawei and Xfusion BMCs Group: GNU FreeIPMI Submitter: oleholmnielsen Submitted: Thu 19 Oct 2023 08:15:00 AM UTC Category: None Severity: 3 - Normal Priority: 5 - Normal Item Group: None Status: None Privacy: Public Assigned to: None Open/Closed: Open Discussion Lock: Any Operating System: None _______________________________________________________ Follow-up Comments: ------------------------------------------------------- Date: Thu 19 Oct 2023 08:15:00 AM UTC By: Ole Holm Nielsen <oleholmnielsen> We have successfully integrated the development FreeIPMI version 1.7.0 in our Linux cluster with the Slurm resource manager. My test is described in https://bugs.schedmd.com/show_bug.cgi?id=17639#c55 and I have documented the FreeIPMI setup in my Slurm Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#freeipmi-issues Now we would like to deploy Slurm including the FreeIPMI power monitoring, but we have discovered a snag: We have 196 older Huawei XH620 V3 nodes (Intel Broadwell) whose BMC doesn't seem to support the IPMI DCMI extensions. A colleague at another university has the same problem with brand new Xfusion FusionOne HPC 1288H V6 servers (Intel IceLake, essentially rebranded Huawei servers) even though the server's BMC is documented to support DCMI 1.5! On the Huawei and Xfusion nodes we get this error message: $ ipmi-dcmi --get-system-power-statistics ipmi_cmd_dcmi_get_power_reading: command invalid or unsupported Due to this error, Slurm logs (spams) every minute in slurmd.log "error: _get_dcmi_power_reading: get DCMI power reading failed" I've tried to find out how to query the Huawei BMC with IPMI DCMI but I only get error messages: $ ipmi-dcmi --get-dcmi-capability-info ipmi_cmd_dcmi_get_dcmi_capability_info_supported_dcmi_capabilities: bad completion code I also tried each of the WORKAROUNDS listed in the ipmi-dcmi manual page, but in every case they return the same error. The debug option gives some details: $ ipmi-dcmi --get-dcmi-capability-info --debug ===================================================== Group Extension - Get DCMI Capability Info Request ===================================================== [ 1h] = cmd[ 8b] [ DCh] = group_extension_identification[ 8b] [ 1h] = parameter_selector[ 8b] ===================================================== Group Extension - Get DCMI Capability Info Response ===================================================== [ 1h] = cmd[ 8b] [ D6h] = comp_code[ 8b] ipmi_cmd_dcmi_get_dcmi_capability_info_supported_dcmi_capabilities: bad completion code The non-DCMI commands seem to be working correctly. For example, I can read the system power: $ ipmi-sensors -t Power_Unit ID | Name | Type | Reading | Units | Event 22 | Power | Power Unit | 296.00 | W | 'OK' (lines deleted) Question: Would a WORKAROUND be feasible to implement for Huawei and Xfusion servers? If so, how can we help by providing debugging information? Or is there some other way for getting the DCMI extensions to work? Thanks a lot, Ole _______________________________________________________ File Attachments: ------------------------------------------------------- Date: Thu 19 Oct 2023 08:15:00 AM UTC Name: bmc-info.log Size: 2KiB By: oleholmnielsen Output from bmc-info <http://savannah.gnu.org/bugs/download.php?file_id=55257> _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?64792> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/