Re: OpenBSD crash on an IBM x3550 M3

Mike Belopuhov Thu, 03 Mar 2011 11:10:49 -0800

On Fri, Feb 04, 2011 at 14:53 +0000, emeric boit wrote:
> Hello,
> 
> After doing a clean install of OpenBSD 4.8 (AMD64) on an IBM x3550 M3,
> I find 
> the
> system randomly panics after a period of use.
> uvm_fault(0xffffffff80cc8360, 0xffff8000149b7000, 0, 1) -> e
> kernel: page
> fault trap, code=0
> Stopped at      mpi_reply+0x102:        movq
> 0(%r13),%rax
> ddb{0}>
> 
> ddb{0}> trace
> mpi_reply() at mpi_reply+0x102
> mpi_intr()
> at mpi_intr+0x20
> Xintr_ioapic_level18() at Xintr_ioapic_level18+0xec
> ---
> interrupt ---
> Bad frame pointer: 0xffff8000194e1920
> end trace frame:
> 0xffff8000194e1920, count: -3
> Xspllower+0xe:
> ddb{0}>
>

We've tried different things, but after this hint i realised
that what might be happening is that bnx and mpi interrupts
are chained (it's bnx0 actually, my initial guess about bnx1
was wrong) and mpi_intr is called first.  Currently neither
mpi(4) nor mpii(4) don't check the interrupt status register
but look directly into the reply post queue.  Although,
there's not supposed to be any race between host cpu reading
from the memory and ioc writing to it, in practice it turns
out that in some particular hardware configurations this rule
is violated and we read a garbled reply from the controller.

If my memory serves, I've considered this for the mpii_intr
but never got into the situation where it was needed and
thus omitted it.  I guess I have to bring it back too.

Emeric tortured the machine with this diff and reported that
it solves the issue for him.  OK to commit?

On Wed, Mar 02, 2011 at 17:20 +0000, emeric boit wrote:
> hi,
> 
> This change doesn't solve the issue.
> 
> I have remarked that the server crash when I use the network.
> 
> I copy a small file several times without problem.
> On the IBM I do :
> scp USER@IP:/tmp/mpi.c .
> 
> And when I copy a larger file the server crash :
> scp USER@IP:/bsd .
> 
> 
> And when I copy th same file (bsd) from an usb key I don't have problem.
> 
> Emeric.
> 

that sounds like an interrupt sharing bug of some sort.
is it bnx1 that you're using to reproduce a crash?

try the following diff please (on a clean checkout):

Index: mpi.c
===================================================================
RCS file: /home/cvs/src/sys/dev/ic/mpi.c,v
retrieving revision 1.166
diff -u -p -r1.166 mpi.c
--- mpi.c       1 Mar 2011 23:48:33 -0000       1.166
+++ mpi.c       2 Mar 2011 17:40:13 -0000
@@ -887,6 +887,9 @@ mpi_intr(void *arg)
        u_int32_t                       reg;
        int                             rv = 0;

+       if ((mpi_read_intr(sc) & MPI_INTR_STATUS_REPLY) == 0)
+               return (rv);
+
        while ((reg = mpi_pop_reply(sc)) != 0xffffffff) {
                mpi_reply(sc, reg);
                rv = 1;

Re: OpenBSD crash on an IBM x3550 M3

Reply via email to