On the old box, I tried a 7.6 with the same crash. I will try
7.4, if that makes sense.
Testing is difficult. While the panic often happens after
less than an hour into the stress test, that's not
a guarantee that the bug isn't present, if it doesn't.
Maybe it's some race condition involving disk or network.
So far, the earliest version that did show the panic
is 7.4-install. I'm stress testing 7.3-install now.
This is the code for the panic: (removing DEBUG)
(I wonder, what that magic number 10 is about ...)
sys/dev/pv/xen.c line 1180:
void
xen_grant_table_remove(struct xen_softc *sc, grant_ref_t ref)
{
struct xen_gntent *ge;
uint32_t flags, *ptr;
int loop;
ref -= ge->ge_start;
/* Invalidate the grant reference */
virtio_membar_sync();
ptr = (uint32_t *)&ge->ge_table[ref];
flags = (ge->ge_table[ref].flags & ~(GTF_reading|GTF_writing)) |
(ge->ge_table[ref].domid << 16);
loop = 0;
while (atomic_cas_uint(ptr, flags, GTF_invalid) != flags) {
if (loop++ > 10) {
****** panic("grant table reference %u is held "
"by domain %d: frame %#x flags %#x",
ref + ge->ge_start, ge->ge_table[ref].domid,
ge->ge_table[ref].frame,
ge->ge_table[ref].flags);
}
#if (defined(__amd64__) || defined(__i386__))
__asm volatile("pause": : : "memory");
#endif
}
ge->ge_table[ref].frame = 0xffffffff;
}
The "process" of the panic is in the xbf interrupt handler
which is clearly disk-related. There was xen-related
update going on from 7.3 to 7.4 that looked
network related - however seems to change some
general kernel locking granularity in the hypervisor-related
code.
Putting the diff here, maybe it rings a bell for someone:
diff -r 73/sys/dev/pv/hyperv.c 74/sys/dev/pv/hyperv.c
410,412c410,411
< __asm__ volatile ("mov %0, %%r8" : : "r" (output_pa) : "r8");
< __asm__ volatile ("call *%3" : "=a" (status) : "c" (control),
< "d" (input_pa), "m" (sc->sc_hc));
---
> extern uint64_t hv_hypercall_trampoline(uint64_t, paddr_t, paddr_t);
> status = hv_hypercall_trampoline(control, input_pa, output_pa);
diff -r 73/sys/dev/pv/hypervic.c 74/sys/dev/pv/hypervic.c
849c849
< int i, j, lo, hi, s, af;
---
> int i, j, lo, hi, af;
873,874c873
< KERNEL_LOCK();
< s = splnet();
---
> NET_LOCK_SHARED();
881,882c880
< splx(s);
< KERNEL_UNLOCK();
---
> NET_UNLOCK_SHARED();
892c890
< * we were asked for for an IPv6 address.
---
> * we were asked for an IPv6 address.
922,923c920
< splx(s);
< KERNEL_UNLOCK();
---
> NET_UNLOCK_SHARED();
959,960c956
< splx(s);
< KERNEL_UNLOCK();
---
> NET_UNLOCK_SHARED();
--korni