On Wed, Apr 1, 2015 at 6:37 PM, Andrey Korolyov <and...@xdel.ru> wrote:
> On Wed, Apr 1, 2015 at 4:19 PM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>>
>>
>> On 01/04/2015 14:26, Andrey Korolyov wrote:
>>> Yes, I disabled host watchdog during runtime. Indeed guest-induced NMI
>>> would look different and they had no reasons to be fired at this stage
>>> inside guest. I`d suspect a hypervisor hardware misbehavior there but
>>> have a very little idea on how APICv behavior (which is completely
>>> microcode-dependent and CPU-dependent but decoupled from peripheral
>>> hardware) may vary at this point, I am using 1.20140913.1 ucode
>>> version from debian if this can matter. Will send trace suggested by
>>> Paolo in a next couple of hours. Also it would be awesome to ask
>>> hardware folks from Intel who can prove or disprove my abovementioned
>>> statement (as I was unable to catch the problem on 2603v2 so far, this
>>> hypothesis has some chance to be real).
>>
>> Yes, the interaction with the NMI watchdog is unexpected and makes a
>> processor erratum somewhat more likely.
>>
>> Paolo
>
>
> http://xdel.ru/downloads/kvm-e5v2-issue/trace-nmi-apicv-fail-at-reboot.dat.gz
>
> err, no NMI entries nearby failure event, though capture should be correct:
> /sys/kernel/debug/tracing/events/kvm*/filter
> /sys/kernel/debug/tracing/events/*/kvm*/filter
> /sys/kernel/debug/tracing/events/nmi*/filter
> /sys/kernel/debug/tracing/events/*/nmi*/filter

Moved 2603v2s back and issue is still here. I used wrong pattern for
the issue on a previous series of tests on those CPUs in the middle of
month, continuously respawning VMs when the real issue is hiding in
*first* reboot events starting from the hypervisor reboot (or module
load). So either it should be reproducible anywhere or this is not a
hardware issue (or it is related to the mainboard instead of CPU
itself :) ).

Reply via email to