Hi Corey,
Thanks a lot for you kindly guide.

I send the version 2 patch to you: [PATCH] [v2] ipmi: retry to get device id 
when error
If there is still something wrong or I missed something, please correct me:)

-----Original Message-----
From: Corey Minyard [mailto:tcminy...@gmail.com] On Behalf Of Corey Minyard
Sent: Monday, September 14, 2020 1:26 AM
To: tianxianting (RD) <tian.xiant...@h3c.com>
Cc: a...@arndb.de; gre...@linuxfoundation.org; 
openipmi-develo...@lists.sourceforge.net; linux-kernel@vger.kernel.org
Subject: Re: [PATCH] ipmi: retry to get device id when error

On Sun, Sep 13, 2020 at 02:10:01PM +0000, Tianxianting wrote:
> Hi Corey
> Thanks for your quickly reply,
> We didn't try the method you mentioned, actually, I didn't know it 
> before you told me:( The issue ever occurred on our 2 ceph storage server 
> both with low probability.
> We finally use this patch to solve the issue, it can automatically solve the 
> issue when it happened. So no need to judge and reload ipmi driver manually 
> or develop additional scripts to make it.
> The 1 second delay is acceptable to us.
> 
> If there really isn't a BMC out there, ipmi driver will not be loaded, is it 
> right?

No, there are systems that have IPMI controllers that are specified in the 
ACPI/SMBIOS tables but have no IPMI controller.

> 
> May be we can adjust to retry 3 times with 500ms interval? 

Maybe there is another way.  What I'm guessing is happening is not that the 
interface is lossy, but that the BMC is in the middle of a reset or an upgrade. 
 The D5 completion code means: Cannot execute command. Command, or request 
parameter(s), not supported in present state.

That error is also returned from bt_start_transaction() in the driver if the 
driver is not in the right state when starting a transaction, which may point 
to a bug in the BT state machine.  Search for IPMI_NOT_IN_MY_STATE_ERR in 
drivers/char/ipmi/ipmi_bt_sm.c.

If it's coming fron the state machine, it would be handy to know what state it 
was in when the error happened to help trace the bug down.
That's what I would suggest first.  Fix the fundamental bug, if you can.
a pr_warn() added to the code there would be all that's needed, and thats 
probably a good permanent addition.

I would be ok with a patch that retried some number of times if it got a
D5 completion code.  That wouldn't have any effect on other systems.
Plus you could add a D1 and D2 completion code, which are similar things.  Add 
the new completion codes to include/uapi/linux/ipmi_msgdefs.h and implement the 
retry.  That makes sense from a general point of view.

Thanks,

-corey

> 
> Thanks in advance if you can feedback again.
> 
> -----Original Message-----
> From: Corey Minyard [mailto:tcminy...@gmail.com] On Behalf Of Corey 
> Minyard
> Sent: Sunday, September 13, 2020 8:40 PM
> To: tianxianting (RD) <tian.xiant...@h3c.com>
> Cc: a...@arndb.de; gre...@linuxfoundation.org; 
> openipmi-develo...@lists.sourceforge.net; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] ipmi: retry to get device id when error
> 
> On Sun, Sep 13, 2020 at 08:02:03PM +0800, Xianting Tian wrote:
> > We can't get bmc's device id with low probability when loading ipmi 
> > driver, it caused bmc device register failed. This issue may caused 
> > by bad lpc signal quality. When this issue happened, we got below kernel 
> > printks:
> >     [Wed Sep  9 19:52:03 2020] ipmi_si IPI0001:00: IPMI message handler: 
> > device id demangle failed: -22
> >     [Wed Sep  9 19:52:03 2020] IPMI BT: using default values
> >     [Wed Sep  9 19:52:03 2020] IPMI BT: req2rsp=5 secs retries=2
> >     [Wed Sep  9 19:52:03 2020] ipmi_si IPI0001:00: Unable to get the device 
> > id: -5
> >     [Wed Sep  9 19:52:04 2020] ipmi_si IPI0001:00: Unable to register
> > device: error -5
> > 
> > When this issue happened, we want to manually unload the driver and 
> > try to load it again, but it can't be unloaded by 'rmmod' as it is already 
> > 'in use'.
> 
> I'm not sure this patch is a good idea; it would cause a long boot delay in 
> situations where there really isn't a BMC out there.  Yes, it happens.
> 
> You don't have to reload the driver to add a device, though.  You can hot-add 
> devices using /sys/modules/ipmi_si/parameters/hotmod.  Look in 
> Documentation/driver-api/ipmi.rst for details.
> 
> Does that work for you?
> 
> -corey
> 
> > 
> > We add below 'printk' in handle_one_recv_msg(), when this issue 
> > happened, the msg we received is "Recv: 1c 01 d5", which means the 
> > data_len is 1, data[0] is 0xd5.
> >     Debug code:
> >     static int handle_one_recv_msg(struct ipmi_smi *intf,
> >                                struct ipmi_smi_msg *msg) {
> >             printk("Recv: %*ph\n", msg->rsp_size, msg->rsp);
> >             ... ...
> >     }
> > Then in ipmi_demangle_device_id(), it returned '-EINVAL' as 'data_len < 7'
> > and 'data[0] != 0'.
> > 
> > We used this patch to retry to get device id when error happen, we 
> > reproduced this issue again and the retry succeed on the first 
> > retry, we finally got the correct msg and then all is ok:
> > Recv: 1c 01 00 01 81 05 84 02 af db 07 00 01 00 b9 00 10 00
> > 
> > So use retry machanism in this patch to give bmc more opportunity to 
> > correctly response kernel.
> > 
> > Signed-off-by: Xianting Tian <tian.xiant...@h3c.com>
> > ---
> >  drivers/char/ipmi/ipmi_msghandler.c | 17 ++++++++++++++---
> >  1 file changed, 14 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/char/ipmi/ipmi_msghandler.c
> > b/drivers/char/ipmi/ipmi_msghandler.c
> > index 737c0b6b2..bfb2de77a 100644
> > --- a/drivers/char/ipmi/ipmi_msghandler.c
> > +++ b/drivers/char/ipmi/ipmi_msghandler.c
> > @@ -34,6 +34,7 @@
> >  #include <linux/uuid.h>
> >  #include <linux/nospec.h>
> >  #include <linux/vmalloc.h>
> > +#include <linux/delay.h>
> >  
> >  #define IPMI_DRIVER_VERSION "39.2"
> >  
> > @@ -60,6 +61,9 @@ enum ipmi_panic_event_op {  #else  #define 
> > IPMI_PANIC_DEFAULT IPMI_SEND_PANIC_EVENT_NONE  #endif
> > +
> > +#define GET_DEVICE_ID_MAX_RETRY    5
> > +
> >  static enum ipmi_panic_event_op ipmi_send_panic_event = 
> > IPMI_PANIC_DEFAULT;
> >  
> >  static int panic_op_write_handler(const char *val, @@ -2426,19
> > +2430,26 @@ send_get_device_id_cmd(struct ipmi_smi *intf)  static 
> > +int
> > __get_device_id(struct ipmi_smi *intf, struct bmc_device *bmc)  {
> >     int rv;
> > -
> > -   bmc->dyn_id_set = 2;
> > +   unsigned int retry_count = 0;
> >  
> >     intf->null_user_handler = bmc_device_id_handler;
> >  
> > +retry:
> > +   bmc->dyn_id_set = 2;
> > +
> >     rv = send_get_device_id_cmd(intf);
> >     if (rv)
> >             return rv;
> >  
> >     wait_event(intf->waitq, bmc->dyn_id_set != 2);
> >  
> > -   if (!bmc->dyn_id_set)
> > +   if (!bmc->dyn_id_set) {
> > +           msleep(1000);
> > +           if (++retry_count <= GET_DEVICE_ID_MAX_RETRY)
> > +                   goto retry;
> > +
> >             rv = -EIO; /* Something went wrong in the fetch. */
> > +   }
> >  
> >     /* dyn_id_set makes the id data available. */
> >     smp_rmb();
> > --
> > 2.17.1
> > 

Reply via email to