On Wed, Apr 1, 2015 at 6:37 PM, Andrey Korolyov <and...@xdel.ru> wrote: > On Wed, Apr 1, 2015 at 4:19 PM, Paolo Bonzini <pbonz...@redhat.com> wrote: >> >> >> On 01/04/2015 14:26, Andrey Korolyov wrote: >>> Yes, I disabled host watchdog during runtime. Indeed guest-induced NMI >>> would look different and they had no reasons to be fired at this stage >>> inside guest. I`d suspect a hypervisor hardware misbehavior there but >>> have a very little idea on how APICv behavior (which is completely >>> microcode-dependent and CPU-dependent but decoupled from peripheral >>> hardware) may vary at this point, I am using 1.20140913.1 ucode >>> version from debian if this can matter. Will send trace suggested by >>> Paolo in a next couple of hours. Also it would be awesome to ask >>> hardware folks from Intel who can prove or disprove my abovementioned >>> statement (as I was unable to catch the problem on 2603v2 so far, this >>> hypothesis has some chance to be real). >> >> Yes, the interaction with the NMI watchdog is unexpected and makes a >> processor erratum somewhat more likely. >> >> Paolo > > > http://xdel.ru/downloads/kvm-e5v2-issue/trace-nmi-apicv-fail-at-reboot.dat.gz > > err, no NMI entries nearby failure event, though capture should be correct: > /sys/kernel/debug/tracing/events/kvm*/filter > /sys/kernel/debug/tracing/events/*/kvm*/filter > /sys/kernel/debug/tracing/events/nmi*/filter > /sys/kernel/debug/tracing/events/*/nmi*/filter
Moved 2603v2s back and issue is still here. I used wrong pattern for the issue on a previous series of tests on those CPUs in the middle of month, continuously respawning VMs when the real issue is hiding in *first* reboot events starting from the hypervisor reboot (or module load). So either it should be reproducible anywhere or this is not a hardware issue (or it is related to the mainboard instead of CPU itself :) ).