> Date: Wed, 29 Jul 2020 13:03:43 -0700
> From: Mike Larkin <mlar...@nested.page>
> 
> Hi,
> 
>  I'm seeing crashes on amd64 GENERIC.MP on a few VMs recently. This happens
> on GENERIC.MP regardless of whether or not the VM has one cpu or more than
> one. It does not happen on GENERIC kernels.
> 
>  The crash will happen fairly quickly after the kernel starts executing
> processes. Sometimes it crashes instantly, sometimes it lasts for a minute
> or two. It rarely makes it to the login prompt. The problem is 100%
> reproducible on two different VMs I have, running on two different
> hypervisors (Hyper-V and ESXi6.7U2).
> 
>  I first started noticing the problem on the 24th July snap, but TBH these
> machines were not frequently updated, so the previous snap I had installed
> might have been a couple months old. Whatever older snap was on them before
> worked fine.
> 
>  Since this is happening on two different machines with two different VMs,
> I'm gonna rule out hardware issues.
> 
>  Crash:
> 
> kernel: pretection fault trap, code=0
> Stopped at    setrunqueue+0xa2:       addl    $0x1,0x288(%r13)
> 
>  Trace:
> ddb{2}> trace
> setrunqueue(27b3d6c24c3fab80, ffff800015e874e0,32) at setrunqueue+0xa2
> sched_barrier_task(ffff800015f1a168) at sched_barrier_task+0x6c
> taskq_thread(ffffffff82121548) at taskq_thread+0x8d
> end trace frame: 0x0, count: -3
> 
>  Registers:
> ddb{2}> sh r
> rdi                   0xffffffff821ee728      sched_lock
> rsi                   0xffff800014cc6ff0
> rbp                   0xffff800015ea0e40
> rbx                                    0
> rdx                             0x23ca94      acpi_pdirpa_0x2288fc
> rcx                                  0xc
> rax                                  0xc
> r8                                 0x202
> r9                                   0x2
> r10                                    0
> r11                   0x57f79bf6968709d8
> r12                   0xffff800015e874e0
> r13                   0x27b3d6c24c3fab80
> r14                                 0x32
> r15                   0x27b3d6c24c3fab80
> rip                   0xffffffff81b9df22      setrunqueue+0xa2
> cs                                   0x8
> rflags                                   0x10207      __ALIGN_SIZE+0xf207
> rsp                   0xffff800015ea0df0
> ss                                  0x10
> 
> 
> The offending instruction is in kern_sched.c:260:
> 
>       spc->spc_nrun++;
> 
> ... which indicates 'spc' is trash (and it is, based on %r13 above). In my
> tests, %r13 always is this same trash value. That comes from 'ci', which is
> either passed in or chosen by sched_choosecpu. Neither of these functions
> have changed recently, so I'm guessing this corruption is coming from 
> something
> else.
> 
>  Anyone have ideas where to start looking? I suppose I could start bisecting,
> but does anyone know of any changes that would affect this area?
> 
>  I can send dmesgs if needed, but these are pretty standard VMs,
> nothing fancy configured in them. 4 CPUs, 8GB RAM, etc.

They're VMs and it turns out that many of the "PV" drivers are/were
using the intr_barrier() interface the wrong way.

For Hyper-V, see my reply in the "Panic on boot with Hyper-V since Jun
17 snapshot" thread on bugs@ from earlier today.

Cheers,

Mark

Reply via email to