On Mon, Apr 20, 2026 at 11:33:01AM -0500, Corey Minyard wrote:
> On Sun, Apr 19, 2026 at 09:50:38PM +0100, Matt Fleming wrote:
> > On Fri, Apr 17, 2026 at 06:53:55PM -0500, Corey Minyard wrote:
> > > 
> > > The EVENT_MSG_BUFFER_FULL flag only gets cleared when a unsuccessful
> > > READ_EVENT_MSG_BUFFER command completes.  Getting data from the
> > > BMC has higher priority than sending data to the BMC.
> > > 
> > > If the BMC continually reports success from READ_EVENT_MSG_BUFFER, then
> > > that would certainly wedge the driver.  But it would have to continually
> > > report success for that command, which would be strange as its supposed
> > > to error out when the queue is empty.
> >  
> > That does indeed appear to be what's happening.
> > 
> > The implementation of intel-ipmi-oem's OpenBMC READ_EVENT_MSG_BUFFER
> > handler does not fail when there is nothing to read,
> > 
> >   
> > https://github.com/openbmc/intel-ipmi-oem/blob/master/src/bridgingcommands.cpp#L704
> 
> Actually, that is so clearly wrong that it's hard to imagine they made
> this mistake.  Anyway, a defect needs to be filed against it.  It will
> certainly break other software on other OSes.
> 
> I think I have an easy workaround, though I'm guessing.  I'm guessing
> they are returning zero data bytes. There's no check on the size at that
> point in the code (it's later).
> 
> Can you try the following patch?

Actually ignore that, I was thinking too much about the other stuff and
didn't actually read my patch.

Here's a working patch:

diff --git a/drivers/char/ipmi/ipmi_si_intf.c b/drivers/char/ipmi/ipmi_si_intf.c
index 4a9e9de4d684..08c208cc64c5 100644
--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -630,7 +630,13 @@ static void handle_transaction_done(struct smi_info 
*smi_info)
                 */
                msg = smi_info->curr_msg;
                smi_info->curr_msg = NULL;
-               if (msg->rsp[2] != 0) {
+               /*
+                * It appears some BMCs, with no event data, return no
+                * data in the message and not a 0x80 error as the
+                * spec says they should.  Shut down processing if
+                * the data is not the right length.
+                */
+               if (msg->rsp[2] != 0 || msg->rsp_size != 19) {
                        /* Error getting event, probably done. */
                        msg->done(msg);



> 
> diff --git a/drivers/char/ipmi/ipmi_si_intf.c 
> b/drivers/char/ipmi/ipmi_si_intf.c
> index 4a9e9de4d684..cf8674a93af1 100644
> --- a/drivers/char/ipmi/ipmi_si_intf.c
> +++ b/drivers/char/ipmi/ipmi_si_intf.c
> @@ -630,7 +630,13 @@ static void handle_transaction_done(struct smi_info 
> *smi_info)
>                  */
>                 msg = smi_info->curr_msg;
>                 smi_info->curr_msg = NULL;
> -               if (msg->rsp[2] != 0) {
> +               /*
> +                * It appears some BMCs, with no event data, return no
> +                * data in the message and not a 0x80 error as the
> +                * spec says they should.  Shut down processing if
> +                * the data is not the right length.
> +                */
> +               if (msg->rsp[2] != 0 || smi_info->curr_msg->rsp_size != 19) {
>                         /* Error getting event, probably done. */
>                         msg->done(msg);
> 
>  
> With your approval on that, I'll send it to Linus after letting it sit
> in the next tree for a bit.  Actually, I'll probably add it in any case,
> as it's a good check to do.
> 
> > 
> > > If it's really something like that, I could also look at adding limits
> > > for those operations.
> > 
> > That would be great. Me and Fred would be happy to test out any patch.
> > 
> > I still think the original patch I sent is a worthwhile defense.
> > 
> > Our periodic monitoring scripts cause TASK_UNINTERRUPTIBLE tasks to
> > block behind one another when we hit these kinds of issues in the IPMI
> > code. Untangling that across thousands of machines can be time
> > consuming and a more explicit EIO or ETIMEDOUT would help with triage.
> 
> Unfortunately, that might have other issues, similar to the ones the
> people with the watchdog issue found.  I'll look at it a bit, but those
> sorts of things would have to be scattered all over the code, not just
> in that one place.  As you say, it would make debugging easier.
> 
> I think adding a counter to the number of operations occuring from a
> single flag fetch would be a way to avoid this issue.  That's going to
> take a little more time, but I'll definitely work on that.
> 
> Thanks,
> 
> -corey


_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to