On the old box, I tried a 7.6 with the same crash. I will try
7.4, if that makes sense.

Testing is difficult. While the panic often happens after
less than an hour into the stress test, that's not
a guarantee that the bug isn't present, if it doesn't.
Maybe it's some race condition involving disk or network.

So far, the earliest version that did show the panic
is 7.4-install. I'm stress testing 7.3-install now.

This is the code for the panic: (removing DEBUG)
(I wonder, what that magic number 10 is about ...)

sys/dev/pv/xen.c line 1180:

void
xen_grant_table_remove(struct xen_softc *sc, grant_ref_t ref)
{
        struct xen_gntent *ge;
        uint32_t flags, *ptr;
        int loop;

        ref -= ge->ge_start;
        /* Invalidate the grant reference */
        virtio_membar_sync();
        ptr = (uint32_t *)&ge->ge_table[ref];
        flags = (ge->ge_table[ref].flags & ~(GTF_reading|GTF_writing)) |
            (ge->ge_table[ref].domid << 16);
        loop = 0;
        while (atomic_cas_uint(ptr, flags, GTF_invalid) != flags) {
                if (loop++ > 10) {
                ****** panic("grant table reference %u is held "
                            "by domain %d: frame %#x flags %#x",
                            ref + ge->ge_start, ge->ge_table[ref].domid,
                            ge->ge_table[ref].frame,
                            ge->ge_table[ref].flags);
                }
#if (defined(__amd64__) || defined(__i386__))
                __asm volatile("pause": : : "memory");
#endif
        }
        ge->ge_table[ref].frame = 0xffffffff;
}


The "process" of the panic is in the xbf interrupt handler
which is clearly disk-related. There was xen-related
update going on from 7.3 to 7.4 that looked
network related - however seems to change some
general kernel locking granularity in the hypervisor-related
code.

Putting the diff here, maybe it rings a bell for someone:

diff -r 73/sys/dev/pv/hyperv.c 74/sys/dev/pv/hyperv.c
410,412c410,411
<    __asm__ volatile ("mov %0, %%r8" : : "r" (output_pa) : "r8");
<    __asm__ volatile ("call *%3" : "=a" (status) : "c" (control),
<        "d" (input_pa), "m" (sc->sc_hc));
---
>    extern uint64_t hv_hypercall_trampoline(uint64_t, paddr_t, paddr_t);
>    status = hv_hypercall_trampoline(control, input_pa, output_pa);
diff -r 73/sys/dev/pv/hypervic.c 74/sys/dev/pv/hypervic.c
849c849
<    int i, j, lo, hi, s, af;
---
>    int i, j, lo, hi, af;
873,874c873
<    KERNEL_LOCK();
<    s = splnet();
---
>    NET_LOCK_SHARED();
881,882c880
<            splx(s);
<            KERNEL_UNLOCK();
---
>            NET_UNLOCK_SHARED();
892c890
<             * we were asked for for an IPv6 address.
---
>             * we were asked for an IPv6 address.
922,923c920
<                    splx(s);
<                    KERNEL_UNLOCK();
---
>                    NET_UNLOCK_SHARED();
959,960c956
<    splx(s);
<    KERNEL_UNLOCK();
---
>    NET_UNLOCK_SHARED();

--korni

Reply via email to