subject:"E5\-2620v2 \- emulation stop error"

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-04-05 Thread Andrey Korolyov

A small update:

the behavior is caused by setting unrestricted_guest feature to N, I
had this feature disabled everywhere from approx. three years ago when
its enablement was one of suspects of the host crashes with
contemporary then KVM module. Also nVMX is likely to not work at all
and produce the same traces as in https://lkml.org/lkml/2014/7/17/12
without unrestricted_guest=1. I think this fact actually explaining
all real mode weirdness we`ve seen before and this should be probably
ended either by putting appropriate bits in a README or module
information or making strict dependency between
apicv/unrestricted_guest+nested/unrestricted_guest or fixing the issue
at its root if this is possible or appropriate solution. Thanks
everyone for keeping up with ideas through this thread!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-04-01 Thread Paolo Bonzini



On 01/04/2015 13:49, Radim Krčmář wrote:
 2015-03-31 21:23+0300, Andrey Korolyov:
 On Tue, Mar 31, 2015 at 9:04 PM, Bandan Das b...@redhat.com wrote:
 Bandan Das b...@redhat.com writes:
 Andrey Korolyov and...@xdel.ru writes:
 ...
 http://xdel.ru/downloads/kvm-e5v2-issue/another-tracepoint-fail-with-apicv.dat.gz

 Something a bit more interesting, but the mess is happening just
 *after* NMI firing.

 What happens if NMI is turned off on the host ?

 Sorry, I meant the watchdog..

 Thanks, everything goes well (as it probably should go there):
 http://xdel.ru/downloads/kvm-e5v2-issue/apicv-enabled-nmi-disabled.dat.gz
 
 Nice revelation!

Yes, pretty random but good to know.  Can you try again with the
nmi/nmi_handler tracepoint also?

Paolo

 KVM doesn't expect host's NMIs to look like this so it doesn't pass them
 to the host.  What was the watchdog that casually sent NMIs?
 (It worked after nmi_watchdog=0 on the host?)
 
 (Guest's NMI should have a different result as well.  NMI_EXCEPTION is
  an expected exit reason for guest's hard exceptions, they are then
  differentiated by intr_info and nothing hinted that this was a NMI.)
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-04-01 Thread Andrey Korolyov

On Wed, Apr 1, 2015 at 2:49 PM, Radim Krčmář rkrc...@redhat.com wrote:
 2015-03-31 21:23+0300, Andrey Korolyov:
 On Tue, Mar 31, 2015 at 9:04 PM, Bandan Das b...@redhat.com wrote:
  Bandan Das b...@redhat.com writes:
  Andrey Korolyov and...@xdel.ru writes:
  ...
  http://xdel.ru/downloads/kvm-e5v2-issue/another-tracepoint-fail-with-apicv.dat.gz
 
  Something a bit more interesting, but the mess is happening just
  *after* NMI firing.
 
  What happens if NMI is turned off on the host ?
 
  Sorry, I meant the watchdog..

 Thanks, everything goes well (as it probably should go there):
 http://xdel.ru/downloads/kvm-e5v2-issue/apicv-enabled-nmi-disabled.dat.gz

 Nice revelation!

 KVM doesn't expect host's NMIs to look like this so it doesn't pass them
 to the host.  What was the watchdog that casually sent NMIs?
 (It worked after nmi_watchdog=0 on the host?)

 (Guest's NMI should have a different result as well.  NMI_EXCEPTION is
  an expected exit reason for guest's hard exceptions, they are then
  differentiated by intr_info and nothing hinted that this was a NMI.)

Yes, I disabled host watchdog during runtime. Indeed guest-induced NMI
would look different and they had no reasons to be fired at this stage
inside guest. I`d suspect a hypervisor hardware misbehavior there but
have a very little idea on how APICv behavior (which is completely
microcode-dependent and CPU-dependent but decoupled from peripheral
hardware) may vary at this point, I am using 1.20140913.1 ucode
version from debian if this can matter. Will send trace suggested by
Paolo in a next couple of hours. Also it would be awesome to ask
hardware folks from Intel who can prove or disprove my abovementioned
statement (as I was unable to catch the problem on 2603v2 so far, this
hypothesis has some chance to be real).
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-04-01 Thread Radim Krčmář

2015-03-31 21:23+0300, Andrey Korolyov:
 On Tue, Mar 31, 2015 at 9:04 PM, Bandan Das b...@redhat.com wrote:
  Bandan Das b...@redhat.com writes:
  Andrey Korolyov and...@xdel.ru writes:
  ...
  http://xdel.ru/downloads/kvm-e5v2-issue/another-tracepoint-fail-with-apicv.dat.gz
 
  Something a bit more interesting, but the mess is happening just
  *after* NMI firing.
 
  What happens if NMI is turned off on the host ?
 
  Sorry, I meant the watchdog..
 
 Thanks, everything goes well (as it probably should go there):
 http://xdel.ru/downloads/kvm-e5v2-issue/apicv-enabled-nmi-disabled.dat.gz

Nice revelation!

KVM doesn't expect host's NMIs to look like this so it doesn't pass them
to the host.  What was the watchdog that casually sent NMIs?
(It worked after nmi_watchdog=0 on the host?)

(Guest's NMI should have a different result as well.  NMI_EXCEPTION is
 an expected exit reason for guest's hard exceptions, they are then
 differentiated by intr_info and nothing hinted that this was a NMI.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-04-01 Thread Paolo Bonzini



On 01/04/2015 14:26, Andrey Korolyov wrote:
 Yes, I disabled host watchdog during runtime. Indeed guest-induced NMI
 would look different and they had no reasons to be fired at this stage
 inside guest. I`d suspect a hypervisor hardware misbehavior there but
 have a very little idea on how APICv behavior (which is completely
 microcode-dependent and CPU-dependent but decoupled from peripheral
 hardware) may vary at this point, I am using 1.20140913.1 ucode
 version from debian if this can matter. Will send trace suggested by
 Paolo in a next couple of hours. Also it would be awesome to ask
 hardware folks from Intel who can prove or disprove my abovementioned
 statement (as I was unable to catch the problem on 2603v2 so far, this
 hypothesis has some chance to be real).

Yes, the interaction with the NMI watchdog is unexpected and makes a
processor erratum somewhat more likely.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-04-01 Thread Andrey Korolyov

On Wed, Apr 1, 2015 at 4:19 PM, Paolo Bonzini pbonz...@redhat.com wrote:


 On 01/04/2015 14:26, Andrey Korolyov wrote:
 Yes, I disabled host watchdog during runtime. Indeed guest-induced NMI
 would look different and they had no reasons to be fired at this stage
 inside guest. I`d suspect a hypervisor hardware misbehavior there but
 have a very little idea on how APICv behavior (which is completely
 microcode-dependent and CPU-dependent but decoupled from peripheral
 hardware) may vary at this point, I am using 1.20140913.1 ucode
 version from debian if this can matter. Will send trace suggested by
 Paolo in a next couple of hours. Also it would be awesome to ask
 hardware folks from Intel who can prove or disprove my abovementioned
 statement (as I was unable to catch the problem on 2603v2 so far, this
 hypothesis has some chance to be real).

 Yes, the interaction with the NMI watchdog is unexpected and makes a
 processor erratum somewhat more likely.

 Paolo


http://xdel.ru/downloads/kvm-e5v2-issue/trace-nmi-apicv-fail-at-reboot.dat.gz

err, no NMI entries nearby failure event, though capture should be correct:
/sys/kernel/debug/tracing/events/kvm*/filter
/sys/kernel/debug/tracing/events/*/kvm*/filter
/sys/kernel/debug/tracing/events/nmi*/filter
/sys/kernel/debug/tracing/events/*/nmi*/filter
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-04-01 Thread Andrey Korolyov

On Wed, Apr 1, 2015 at 6:37 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Apr 1, 2015 at 4:19 PM, Paolo Bonzini pbonz...@redhat.com wrote:


 On 01/04/2015 14:26, Andrey Korolyov wrote:
 Yes, I disabled host watchdog during runtime. Indeed guest-induced NMI
 would look different and they had no reasons to be fired at this stage
 inside guest. I`d suspect a hypervisor hardware misbehavior there but
 have a very little idea on how APICv behavior (which is completely
 microcode-dependent and CPU-dependent but decoupled from peripheral
 hardware) may vary at this point, I am using 1.20140913.1 ucode
 version from debian if this can matter. Will send trace suggested by
 Paolo in a next couple of hours. Also it would be awesome to ask
 hardware folks from Intel who can prove or disprove my abovementioned
 statement (as I was unable to catch the problem on 2603v2 so far, this
 hypothesis has some chance to be real).

 Yes, the interaction with the NMI watchdog is unexpected and makes a
 processor erratum somewhat more likely.

 Paolo


 http://xdel.ru/downloads/kvm-e5v2-issue/trace-nmi-apicv-fail-at-reboot.dat.gz

 err, no NMI entries nearby failure event, though capture should be correct:
 /sys/kernel/debug/tracing/events/kvm*/filter
 /sys/kernel/debug/tracing/events/*/kvm*/filter
 /sys/kernel/debug/tracing/events/nmi*/filter
 /sys/kernel/debug/tracing/events/*/nmi*/filter

Moved 2603v2s back and issue is still here. I used wrong pattern for
the issue on a previous series of tests on those CPUs in the middle of
month, continuously respawning VMs when the real issue is hiding in
*first* reboot events starting from the hypervisor reboot (or module
load). So either it should be reproducible anywhere or this is not a
hardware issue (or it is related to the mainboard instead of CPU
itself :) ).
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-04-01 Thread Andrey Korolyov

*putting my tinfoil hat on*

After thinking a little bit more, the observable behavior is a quite
good match for a bios-level hypervisor (hardware trojan in a modern
terminology), as it likely is sensitive to timing[1], does not appear
more than once per VM during boot cycle and seemingly does not regard
a fact if kvm-intel was reloaded once or twice (or more) and not
reproducible outside of domain of a single board model. If nobody has
a better suggestions to try on, I`ll do a couple of steps in a next
days:
- extract and compare bios to the vendor`s image with SPI programmer,
- extract and compare BMC image with public version (should be easy as well),
- try to analyze switch timings by writing sample code for a bare
hardware (there can be a hint that the L2 Linux guest can expose
larger execution time difference with L1 on host with top-level
hypervisor than on supposedly 'non-infected' one),
- try to analyze binary BIOS code itself, though it can be VERY
problematic, I am even not talking for same possibility for BMC.

Sorry for posting such a naive and stupid stuff in the public ml, but
I am really out of clues of what`s happening there and why it is not
reproducible anywhere else.

1. https://xakep.ru/2011/12/26/58104/ (russian text, but can be read
through g-translate without lack of details)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-31 Thread Radim Krčmář

2015-03-30 22:32+0300, Andrey Korolyov:
 On Mon, Mar 30, 2015 at 9:56 PM, Radim Krčmář rkrc...@redhat.com wrote:
  2015-03-27 13:16+0300, Andrey Korolyov:
  On Fri, Mar 27, 2015 at 12:03 AM, Bandan Das b...@redhat.com wrote:
   Radim Krčmář rkrc...@redhat.com writes:
   I second Bandan -- checking that it reproduces on other machine would be
   great for sanity :)  (Although a bug in our APICv is far more likely.)
  
   If it's APICv related, a run without apicv enabled could give more hints.
  
   Your devices not getting reset hypothesis makes the most sense to me,
   maybe the timer vector in the error message is just one part of
   the whole story. Another misbehaving interrupt from the dark comes in at 
   the
   same time and leads to a double fault.
 
  Default trace (APICv enabled, first reboot introduced the issue):
  http://xdel.ru/downloads/kvm-e5v2-issue/hanged-reboot-apic-on.dat.gz
 
  The relevant part is here,
  prefixed with qemu-system-x86-4180  [002]   697.111550:
 
kvm_exit: reason CR_ACCESS rip 0xd272 info 0 0
kvm_cr:   cr_write 0 = 0x10
kvm_mmu_get_page: existing sp gfn 0 0/4 q0 direct --- !pge !nxe root 
  0 sync
kvm_entry:vcpu 0
kvm_emulate_insn: f:d275: ea 7a d2 00 f0
kvm_emulate_insn: f:d27a: 2e 0f 01 1e f0 6c
kvm_emulate_insn: f:d280: 31 c0
kvm_emulate_insn: f:d282: 8e e0
kvm_emulate_insn: f:d284: 8e e8
kvm_emulate_insn: f:d286: 8e c0
kvm_emulate_insn: f:d288: 8e d8
kvm_emulate_insn: f:d28a: 8e d0
kvm_entry:vcpu 0
kvm_exit: reason EXTERNAL_INTERRUPT rip 0xd28f info 0 80f6
kvm_entry:vcpu 0
kvm_exit: reason EPT_VIOLATION rip 0x8dd0 info 184 0
kvm_page_fault:   address f8dd0 error_code 184
kvm_entry:vcpu 0
kvm_exit: reason EXTERNAL_INTERRUPT rip 0x8dd0 info 0 80f6
kvm_entry:vcpu 0
kvm_exit: reason EPT_VIOLATION rip 0x76d6 info 184 0
kvm_page_fault:   address f76d6 error_code 184
kvm_entry:vcpu 0
kvm_exit: reason EXTERNAL_INTERRUPT rip 0x76d6 info 0 80f6
kvm_entry:vcpu 0
kvm_exit: reason PENDING_INTERRUPT rip 0xd331 info 0 0
kvm_inj_virq: irq 8
kvm_entry:vcpu 0
kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfea5 info 0 80f6
kvm_entry:vcpu 0
kvm_exit: reason EPT_VIOLATION rip 0xfea5 info 184 0
kvm_page_fault:   address ffea5 error_code 184
kvm_entry:vcpu 0
kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfea5 info 0 80f6
kvm_entry:vcpu 0
kvm_exit: reason EPT_VIOLATION rip 0xe990 info 184 0
kvm_page_fault:   address fe990 error_code 184
kvm_entry:vcpu 0
kvm_exit: reason EXTERNAL_INTERRUPT rip 0xe990 info 0 80f6
kvm_entry:vcpu 0
kvm_exit: reason EXCEPTION_NMI rip 0xd334 info 0 8b0d
kvm_userspace_exit:   reason KVM_EXIT_INTERNAL_ERROR (17)
 
  Trace without APICv (three reboots, just to make sure to hit the
  problematic condition of supposed DF, as it still have not one hundred
  percent reproducibility):
  http://xdel.ru/downloads/kvm-e5v2-issue/apic-off.dat.gz
 
  The trace here contains a well matching excerpt, just instead of the
  EXCEPTION_NMI, it does
 
   169.905098: kvm_exit: reason EPT_VIOLATION rip 0xd334 info 181   0
   169.905102: kvm_page_fault:   address feffd066 error_code 181
 
  and works.  Page fault says we tried to read 0xfeffd066 -- probably IOPB
  of TSS.  (I guess it is pre-fetch for following IO instruction.)
 
  Nothing strikes me when looking at it, but some APICv boots don't fail,
  so it would be interesting to compare them ... hosts's 0xf6 interrupt
  (IRQ_WORK_VECTOR) is a possible source of races.  (We could look more
  closely.  It is fired too often for my liking as well.)
 
 Thanks Radim, 
 http://xdel.ru/downloads/kvm-e5v2-issue/no-fail-with-apicv.dat.gz
 
 The related bits looks the same as with enable_apicv=0 for me.

Yeah,

 qemu-system-x86-4201  [007]   159.297337:
  kvm_exit: reason CR_ACCESS rip 0xd272 info 0 0
  kvm_cr:   cr_write 0 = 0x10
  kvm_mmu_get_page: existing sp gfn 0 0/4 q0 direct --- !pge !nxe root 0 
sync
  kvm_entry:vcpu 0
  kvm_emulate_insn: f:d275: ea 7a d2 00 f0
  kvm_emulate_insn: f:d27a: 2e 0f 01 1e f0 6c
  kvm_emulate_insn: f:d280: 31 c0
  kvm_emulate_insn: f:d282: 8e e0
  kvm_emulate_insn: f:d284: 8e e8
  kvm_emulate_insn: f:d286: 8e c0
  kvm_emulate_insn: f:d288: 8e d8
  kvm_emulate_insn: f:d28a: 8e d0
  kvm_entry:vcpu 0
  kvm_exit: reason EPT_VIOLATION rip 0x8dd0 info 184 0
  kvm_page_fault:

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-31 Thread Andrey Korolyov

On Tue, Mar 31, 2015 at 4:45 PM, Radim Krčmář rkrc...@redhat.com wrote:
 2015-03-30 22:32+0300, Andrey Korolyov:
 On Mon, Mar 30, 2015 at 9:56 PM, Radim Krčmář rkrc...@redhat.com wrote:
  2015-03-27 13:16+0300, Andrey Korolyov:
  On Fri, Mar 27, 2015 at 12:03 AM, Bandan Das b...@redhat.com wrote:
   Radim Krčmář rkrc...@redhat.com writes:
   I second Bandan -- checking that it reproduces on other machine would 
   be
   great for sanity :)  (Although a bug in our APICv is far more likely.)
  
   If it's APICv related, a run without apicv enabled could give more 
   hints.
  
   Your devices not getting reset hypothesis makes the most sense to me,
   maybe the timer vector in the error message is just one part of
   the whole story. Another misbehaving interrupt from the dark comes in 
   at the
   same time and leads to a double fault.
 
  Default trace (APICv enabled, first reboot introduced the issue):
  http://xdel.ru/downloads/kvm-e5v2-issue/hanged-reboot-apic-on.dat.gz
 
  The relevant part is here,
  prefixed with qemu-system-x86-4180  [002]   697.111550:
 
kvm_exit: reason CR_ACCESS rip 0xd272 info 0 0
kvm_cr:   cr_write 0 = 0x10
kvm_mmu_get_page: existing sp gfn 0 0/4 q0 direct --- !pge !nxe root 
  0 sync
kvm_entry:vcpu 0
kvm_emulate_insn: f:d275: ea 7a d2 00 f0
kvm_emulate_insn: f:d27a: 2e 0f 01 1e f0 6c
kvm_emulate_insn: f:d280: 31 c0
kvm_emulate_insn: f:d282: 8e e0
kvm_emulate_insn: f:d284: 8e e8
kvm_emulate_insn: f:d286: 8e c0
kvm_emulate_insn: f:d288: 8e d8
kvm_emulate_insn: f:d28a: 8e d0
kvm_entry:vcpu 0
kvm_exit: reason EXTERNAL_INTERRUPT rip 0xd28f info 0 
  80f6
kvm_entry:vcpu 0
kvm_exit: reason EPT_VIOLATION rip 0x8dd0 info 184 0
kvm_page_fault:   address f8dd0 error_code 184
kvm_entry:vcpu 0
kvm_exit: reason EXTERNAL_INTERRUPT rip 0x8dd0 info 0 
  80f6
kvm_entry:vcpu 0
kvm_exit: reason EPT_VIOLATION rip 0x76d6 info 184 0
kvm_page_fault:   address f76d6 error_code 184
kvm_entry:vcpu 0
kvm_exit: reason EXTERNAL_INTERRUPT rip 0x76d6 info 0 
  80f6
kvm_entry:vcpu 0
kvm_exit: reason PENDING_INTERRUPT rip 0xd331 info 0 0
kvm_inj_virq: irq 8
kvm_entry:vcpu 0
kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfea5 info 0 
  80f6
kvm_entry:vcpu 0
kvm_exit: reason EPT_VIOLATION rip 0xfea5 info 184 0
kvm_page_fault:   address ffea5 error_code 184
kvm_entry:vcpu 0
kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfea5 info 0 
  80f6
kvm_entry:vcpu 0
kvm_exit: reason EPT_VIOLATION rip 0xe990 info 184 0
kvm_page_fault:   address fe990 error_code 184
kvm_entry:vcpu 0
kvm_exit: reason EXTERNAL_INTERRUPT rip 0xe990 info 0 
  80f6
kvm_entry:vcpu 0
kvm_exit: reason EXCEPTION_NMI rip 0xd334 info 0 8b0d
kvm_userspace_exit:   reason KVM_EXIT_INTERNAL_ERROR (17)
 
  Trace without APICv (three reboots, just to make sure to hit the
  problematic condition of supposed DF, as it still have not one hundred
  percent reproducibility):
  http://xdel.ru/downloads/kvm-e5v2-issue/apic-off.dat.gz
 
  The trace here contains a well matching excerpt, just instead of the
  EXCEPTION_NMI, it does
 
   169.905098: kvm_exit: reason EPT_VIOLATION rip 0xd334 info 
  181 0
   169.905102: kvm_page_fault:   address feffd066 error_code 181
 
  and works.  Page fault says we tried to read 0xfeffd066 -- probably IOPB
  of TSS.  (I guess it is pre-fetch for following IO instruction.)
 
  Nothing strikes me when looking at it, but some APICv boots don't fail,
  so it would be interesting to compare them ... hosts's 0xf6 interrupt
  (IRQ_WORK_VECTOR) is a possible source of races.  (We could look more
  closely.  It is fired too often for my liking as well.)

 Thanks Radim, 
 http://xdel.ru/downloads/kvm-e5v2-issue/no-fail-with-apicv.dat.gz

 The related bits looks the same as with enable_apicv=0 for me.

 Yeah,

  qemu-system-x86-4201  [007]   159.297337:
   kvm_exit: reason CR_ACCESS rip 0xd272 info 0 0
   kvm_cr:   cr_write 0 = 0x10
   kvm_mmu_get_page: existing sp gfn 0 0/4 q0 direct --- !pge !nxe root 0 
 sync
   kvm_entry:vcpu 0
   kvm_emulate_insn: f:d275: ea 7a d2 00 f0
   kvm_emulate_insn: f:d27a: 2e 0f 01 1e f0 6c
   kvm_emulate_insn: f:d280: 31 c0
   kvm_emulate_insn: f:d282: 8e e0
   kvm_emulate_insn: f:d284: 8e e8
   kvm_emulate_insn: f:d286: 8e c0
   kvm_emulate_insn: f:d288: 8e d8
   kvm_emulate_insn: f:d28a: 8e d0

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-31 Thread Radim Krčmář

2015-03-31 17:56+0300, Andrey Korolyov:
  Chasing the culprit this way could take a long time, so a new tracepoint
  that shows if 0xef is set on entry would let us guess the bug faster ...
 
  Please provide a failing trace with the following patch:

 Thanks, please see below:
 
 http://xdel.ru/downloads/kvm-e5v2-issue/new-tracepoint-fail-with-apicv.dat.gz

 qemu-system-x86-4022  [006]  255.915978:
  kvm_entry:vcpu 0
  kvm_emulate_insn: f:d275: ea 7a d2 00 f0
  kvm_emulate_insn: f:d27a: 2e 0f 01 1e f0 6c
  kvm_emulate_insn: f:d280: 31 c0
  kvm_emulate_insn: f:d282: 8e e0
  kvm_emulate_insn: f:d284: 8e e8
  kvm_emulate_insn: f:d286: 8e c0
  kvm_emulate_insn: f:d288: 8e d8
  kvm_emulate_insn: f:d28a: 8e d0
  kvm_entry:vcpu 0
  kvm_0xef: irr clear, isr clear, vmcs 0x0
  kvm_exit: reason EPT_VIOLATION rip 0x8dd0 info 184 0
  kvm_page_fault:   address f8dd0 error_code 184
  kvm_entry:vcpu 0
  kvm_0xef: irr clear, isr clear, vmcs 0x0
  kvm_exit: reason EPT_VIOLATION rip 0x76d6 info 184 0
  kvm_page_fault:   address f76d6 error_code 184
  kvm_entry:vcpu 0
  kvm_0xef: irr clear, isr clear, vmcs 0x0
  kvm_exit: reason EXCEPTION_NMI rip 0xd331 info 0 8b0d
  kvm_userspace_exit:   reason KVM_EXIT_INTERNAL_ERROR (17)

Ok, nothing obvious here either ... I've desperately added all
information I know about.  Please run it again, thanks.

(The patch has to be applied instead of the previous one.)
---
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 7c7bc8bef21f..f986636ad9d0 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -742,6 +742,41 @@ TRACE_EVENT(kvm_emulate_insn,
 #define trace_kvm_emulate_insn_start(vcpu) trace_kvm_emulate_insn(vcpu, 0)
 #define trace_kvm_emulate_insn_failed(vcpu) trace_kvm_emulate_insn(vcpu, 1)
 
+TRACE_EVENT(kvm_0xef,
+   TP_PROTO(bool irr, bool isr, u32 info, bool on, bool pir, u16 status),
+   TP_ARGS(irr, isr, info, on, pir, status),
+
+   TP_STRUCT__entry(
+   __field(bool,  irr )
+   __field(bool,  isr )
+   __field(u32,   info)
+   __field(bool,  on  )
+   __field(bool,  pir )
+   __field(u8,rvi )
+   __field(u8,svi )
+   ),
+
+   TP_fast_assign(
+   __entry-irr  = irr;
+   __entry-isr  = isr;
+   __entry-info = info;
+   __entry-on   = on;
+   __entry-pir  = pir;
+   __entry-rvi  = status  0xff;
+   __entry-svi  = status  8;
+   ),
+
+   TP_printk(irr %s, isr %s, info 0x%x, on %s, pir %s, rvi 0x%x, svi 
0x%x,
+ __entry-irr ? set   : clear,
+ __entry-isr ? set   : clear,
+ __entry-info,
+ __entry-on  ? set   : clear,
+ __entry-pir ? set   : clear,
+ __entry-rvi,
+ __entry-svi
+)
+   );
+
 TRACE_EVENT(
vcpu_match_mmio,
TP_PROTO(gva_t gva, gpa_t gpa, bool write, bool gpa_match),
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index eee63dc33d89..b461edc93d53 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5047,6 +5047,25 @@ static int handle_machine_check(struct kvm_vcpu *vcpu)
return 1;
 }
 
+#define VEC_POS(v) ((v)  (32 - 1))
+#define REG_POS(v) (((v)  5)  4)
+static inline int apic_test_vector(int vec, void *bitmap)
+{
+   return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
+}
+
+static inline void random_trace(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   trace_kvm_0xef(apic_test_vector(0xef, vcpu-arch.apic-regs + APIC_IRR),
+  apic_test_vector(0xef, vcpu-arch.apic-regs + APIC_ISR),
+  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
+  test_bit(POSTED_INTR_ON, (unsigned long 
*)vmx-pi_desc.control),
+  test_bit(0xef, (unsigned long *)vmx-pi_desc.pir),
+  vmcs_read16(GUEST_INTR_STATUS));
+}
+
 static int handle_exception(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -5077,6 +5096,8 @@ static int handle_exception(struct kvm_vcpu *vcpu)
return 1;
}
 
+   random_trace(vcpu);
+
error_code = 0;
if (intr_info  INTR_INFO_DELIVER_CODE_MASK)
error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
@@ -8143,6 +8164,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (vmx-emulation_required)
return;
 
+   random_trace(vcpu);
+
if (vmx-ple_window_dirty) {
vmx-ple_window_dirty = false;
vmcs_write32(PLE_WINDOW, vmx-ple_window);
@@ -8312,6 +8335,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-31 Thread Andrey Korolyov

On Tue, Mar 31, 2015 at 7:45 PM, Radim Krčmář rkrc...@redhat.com wrote:
 2015-03-31 17:56+0300, Andrey Korolyov:
  Chasing the culprit this way could take a long time, so a new tracepoint
  that shows if 0xef is set on entry would let us guess the bug faster ...
 
  Please provide a failing trace with the following patch:

 Thanks, please see below:

 http://xdel.ru/downloads/kvm-e5v2-issue/new-tracepoint-fail-with-apicv.dat.gz

  qemu-system-x86-4022  [006]  255.915978:
   kvm_entry:vcpu 0
   kvm_emulate_insn: f:d275: ea 7a d2 00 f0
   kvm_emulate_insn: f:d27a: 2e 0f 01 1e f0 6c
   kvm_emulate_insn: f:d280: 31 c0
   kvm_emulate_insn: f:d282: 8e e0
   kvm_emulate_insn: f:d284: 8e e8
   kvm_emulate_insn: f:d286: 8e c0
   kvm_emulate_insn: f:d288: 8e d8
   kvm_emulate_insn: f:d28a: 8e d0
   kvm_entry:vcpu 0
   kvm_0xef: irr clear, isr clear, vmcs 0x0
   kvm_exit: reason EPT_VIOLATION rip 0x8dd0 info 184 0
   kvm_page_fault:   address f8dd0 error_code 184
   kvm_entry:vcpu 0
   kvm_0xef: irr clear, isr clear, vmcs 0x0
   kvm_exit: reason EPT_VIOLATION rip 0x76d6 info 184 0
   kvm_page_fault:   address f76d6 error_code 184
   kvm_entry:vcpu 0
   kvm_0xef: irr clear, isr clear, vmcs 0x0
   kvm_exit: reason EXCEPTION_NMI rip 0xd331 info 0 8b0d
   kvm_userspace_exit:   reason KVM_EXIT_INTERNAL_ERROR (17)

 Ok, nothing obvious here either ... I've desperately added all
 information I know about.  Please run it again, thanks.

 (The patch has to be applied instead of the previous one.)
 ---
 diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
 index 7c7bc8bef21f..f986636ad9d0 100644
 --- a/arch/x86/kvm/trace.h
 +++ b/arch/x86/kvm/trace.h
 @@ -742,6 +742,41 @@ TRACE_EVENT(kvm_emulate_insn,
  #define trace_kvm_emulate_insn_start(vcpu) trace_kvm_emulate_insn(vcpu, 0)
  #define trace_kvm_emulate_insn_failed(vcpu) trace_kvm_emulate_insn(vcpu, 1)

 +TRACE_EVENT(kvm_0xef,
 +   TP_PROTO(bool irr, bool isr, u32 info, bool on, bool pir, u16 status),
 +   TP_ARGS(irr, isr, info, on, pir, status),
 +
 +   TP_STRUCT__entry(
 +   __field(bool,  irr )
 +   __field(bool,  isr )
 +   __field(u32,   info)
 +   __field(bool,  on  )
 +   __field(bool,  pir )
 +   __field(u8,rvi )
 +   __field(u8,svi )
 +   ),
 +
 +   TP_fast_assign(
 +   __entry-irr  = irr;
 +   __entry-isr  = isr;
 +   __entry-info = info;
 +   __entry-on   = on;
 +   __entry-pir  = pir;
 +   __entry-rvi  = status  0xff;
 +   __entry-svi  = status  8;
 +   ),
 +
 +   TP_printk(irr %s, isr %s, info 0x%x, on %s, pir %s, rvi 0x%x, svi 
 0x%x,
 + __entry-irr ? set   : clear,
 + __entry-isr ? set   : clear,
 + __entry-info,
 + __entry-on  ? set   : clear,
 + __entry-pir ? set   : clear,
 + __entry-rvi,
 + __entry-svi
 +)
 +   );
 +
  TRACE_EVENT(
 vcpu_match_mmio,
 TP_PROTO(gva_t gva, gpa_t gpa, bool write, bool gpa_match),
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index eee63dc33d89..b461edc93d53 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -5047,6 +5047,25 @@ static int handle_machine_check(struct kvm_vcpu *vcpu)
 return 1;
  }

 +#define VEC_POS(v) ((v)  (32 - 1))
 +#define REG_POS(v) (((v)  5)  4)
 +static inline int apic_test_vector(int vec, void *bitmap)
 +{
 +   return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
 +}
 +
 +static inline void random_trace(struct kvm_vcpu *vcpu)
 +{
 +   struct vcpu_vmx *vmx = to_vmx(vcpu);
 +
 +   trace_kvm_0xef(apic_test_vector(0xef, vcpu-arch.apic-regs + 
 APIC_IRR),
 +  apic_test_vector(0xef, vcpu-arch.apic-regs + 
 APIC_ISR),
 +  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 +  test_bit(POSTED_INTR_ON, (unsigned long 
 *)vmx-pi_desc.control),
 +  test_bit(0xef, (unsigned long *)vmx-pi_desc.pir),
 +  vmcs_read16(GUEST_INTR_STATUS));
 +}
 +
  static int handle_exception(struct kvm_vcpu *vcpu)
  {
 struct vcpu_vmx *vmx = to_vmx(vcpu);
 @@ -5077,6 +5096,8 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 return 1;
 }

 +   random_trace(vcpu);
 +
 error_code = 0;
 if (intr_info  INTR_INFO_DELIVER_CODE_MASK)
 error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
 @@ -8143,6 +8164,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
 *vcpu)
 if (vmx-emulation_required)
 return;

 +   random_trace(vcpu);
 +
 if

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-31 Thread Bandan Das

Bandan Das b...@redhat.com writes:

 Andrey Korolyov and...@xdel.ru writes:
 ...
 http://xdel.ru/downloads/kvm-e5v2-issue/another-tracepoint-fail-with-apicv.dat.gz

 Something a bit more interesting, but the mess is happening just
 *after* NMI firing.

 What happens if NMI is turned off on the host ?

Sorry, I meant the watchdog..
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-31 Thread Bandan Das

Andrey Korolyov and...@xdel.ru writes:
...
 http://xdel.ru/downloads/kvm-e5v2-issue/another-tracepoint-fail-with-apicv.dat.gz

 Something a bit more interesting, but the mess is happening just
 *after* NMI firing.

What happens if NMI is turned off on the host ?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-31 Thread Andrey Korolyov

On Tue, Mar 31, 2015 at 9:04 PM, Bandan Das b...@redhat.com wrote:
 Bandan Das b...@redhat.com writes:

 Andrey Korolyov and...@xdel.ru writes:
 ...
 http://xdel.ru/downloads/kvm-e5v2-issue/another-tracepoint-fail-with-apicv.dat.gz

 Something a bit more interesting, but the mess is happening just
 *after* NMI firing.

 What happens if NMI is turned off on the host ?

 Sorry, I meant the watchdog..


Thanks, everything goes well (as it probably should go there):
http://xdel.ru/downloads/kvm-e5v2-issue/apicv-enabled-nmi-disabled.dat.gz
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-30 Thread Radim Krčmář

2015-03-27 13:16+0300, Andrey Korolyov:
 On Fri, Mar 27, 2015 at 12:03 AM, Bandan Das b...@redhat.com wrote:
  Radim Krčmář rkrc...@redhat.com writes:
  I second Bandan -- checking that it reproduces on other machine would be
  great for sanity :)  (Although a bug in our APICv is far more likely.)
 
  If it's APICv related, a run without apicv enabled could give more hints.
 
  Your devices not getting reset hypothesis makes the most sense to me,
  maybe the timer vector in the error message is just one part of
  the whole story. Another misbehaving interrupt from the dark comes in at the
  same time and leads to a double fault.
 
 Default trace (APICv enabled, first reboot introduced the issue):
 http://xdel.ru/downloads/kvm-e5v2-issue/hanged-reboot-apic-on.dat.gz

The relevant part is here,
prefixed with qemu-system-x86-4180  [002]   697.111550:

  kvm_exit: reason CR_ACCESS rip 0xd272 info 0 0
  kvm_cr:   cr_write 0 = 0x10
  kvm_mmu_get_page: existing sp gfn 0 0/4 q0 direct --- !pge !nxe root 0 
sync
  kvm_entry:vcpu 0
  kvm_emulate_insn: f:d275: ea 7a d2 00 f0
  kvm_emulate_insn: f:d27a: 2e 0f 01 1e f0 6c
  kvm_emulate_insn: f:d280: 31 c0
  kvm_emulate_insn: f:d282: 8e e0
  kvm_emulate_insn: f:d284: 8e e8
  kvm_emulate_insn: f:d286: 8e c0
  kvm_emulate_insn: f:d288: 8e d8
  kvm_emulate_insn: f:d28a: 8e d0
  kvm_entry:vcpu 0
  kvm_exit: reason EXTERNAL_INTERRUPT rip 0xd28f info 0 80f6
  kvm_entry:vcpu 0
  kvm_exit: reason EPT_VIOLATION rip 0x8dd0 info 184 0
  kvm_page_fault:   address f8dd0 error_code 184
  kvm_entry:vcpu 0
  kvm_exit: reason EXTERNAL_INTERRUPT rip 0x8dd0 info 0 80f6
  kvm_entry:vcpu 0
  kvm_exit: reason EPT_VIOLATION rip 0x76d6 info 184 0
  kvm_page_fault:   address f76d6 error_code 184
  kvm_entry:vcpu 0
  kvm_exit: reason EXTERNAL_INTERRUPT rip 0x76d6 info 0 80f6
  kvm_entry:vcpu 0
  kvm_exit: reason PENDING_INTERRUPT rip 0xd331 info 0 0
  kvm_inj_virq: irq 8
  kvm_entry:vcpu 0
  kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfea5 info 0 80f6
  kvm_entry:vcpu 0
  kvm_exit: reason EPT_VIOLATION rip 0xfea5 info 184 0
  kvm_page_fault:   address ffea5 error_code 184
  kvm_entry:vcpu 0
  kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfea5 info 0 80f6
  kvm_entry:vcpu 0
  kvm_exit: reason EPT_VIOLATION rip 0xe990 info 184 0
  kvm_page_fault:   address fe990 error_code 184
  kvm_entry:vcpu 0
  kvm_exit: reason EXTERNAL_INTERRUPT rip 0xe990 info 0 80f6
  kvm_entry:vcpu 0
  kvm_exit: reason EXCEPTION_NMI rip 0xd334 info 0 8b0d
  kvm_userspace_exit:   reason KVM_EXIT_INTERNAL_ERROR (17)

 Trace without APICv (three reboots, just to make sure to hit the
 problematic condition of supposed DF, as it still have not one hundred
 percent reproducibility):
 http://xdel.ru/downloads/kvm-e5v2-issue/apic-off.dat.gz

The trace here contains a well matching excerpt, just instead of the
EXCEPTION_NMI, it does

 169.905098: kvm_exit: reason EPT_VIOLATION rip 0xd334 info 181 0
 169.905102: kvm_page_fault:   address feffd066 error_code 181

and works.  Page fault says we tried to read 0xfeffd066 -- probably IOPB
of TSS.  (I guess it is pre-fetch for following IO instruction.)

Nothing strikes me when looking at it, but some APICv boots don't fail,
so it would be interesting to compare them ... hosts's 0xf6 interrupt
(IRQ_WORK_VECTOR) is a possible source of races.  (We could look more
closely.  It is fired too often for my liking as well.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-30 Thread Radim Krčmář

2015-03-27 14:54+0300, Andrey Korolyov:
 Trace with new bits:

Thanks.

 KVM internal error. Suberror: 2
 extra data[0]: 80ef
 extra data[1]: 8b0d
 extra data[2]: 77b

The #GP code looks formatted as documented under INT in SDM,
  (vector  3) | 2 | ext
where 'ext' stands for 'external' (as opposed to software).

  0x77b == (0xef  3) | 2 | 1

It was 0xef and wasn't triggered by an INT instruction.
The weird part is that it looks like a protected mode error, but CR0
says we are in real mode.

(If CPU interpreted the vector in protected mode, then it would violate
 the IDT limit and throw a #GP ...
 It's too late for coffee today, so I'll try to lure some ideas later.)

 EAX= EBX= ECX= EDX=
 ESI= EDI= EBP= ESP=6d24
 EIP=d331 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =f000 000f  9b00
 SS =   9300
 DS =   9300
 FS =   9300
 GS =   9300
 LDT=   8200
 TR =   8b00
 GDT= 000f6cb0 0037
 IDT=  03ff
 CR0=0010 CR2= CR3= CR4=
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=66 c3 cd 02 cb cd 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd
 19 cb cd 1c cb cd 4a cb fa fc 66 ba 47 d3 0f 00 e9 ad fe f3 90 f0 0f
 ba 2d d4 fe fb 3f
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-30 Thread Andrey Korolyov

On Mon, Mar 30, 2015 at 9:56 PM, Radim Krčmář rkrc...@redhat.com wrote:
 2015-03-27 13:16+0300, Andrey Korolyov:
 On Fri, Mar 27, 2015 at 12:03 AM, Bandan Das b...@redhat.com wrote:
  Radim Krčmář rkrc...@redhat.com writes:
  I second Bandan -- checking that it reproduces on other machine would be
  great for sanity :)  (Although a bug in our APICv is far more likely.)
 
  If it's APICv related, a run without apicv enabled could give more hints.
 
  Your devices not getting reset hypothesis makes the most sense to me,
  maybe the timer vector in the error message is just one part of
  the whole story. Another misbehaving interrupt from the dark comes in at 
  the
  same time and leads to a double fault.

 Default trace (APICv enabled, first reboot introduced the issue):
 http://xdel.ru/downloads/kvm-e5v2-issue/hanged-reboot-apic-on.dat.gz

 The relevant part is here,
 prefixed with qemu-system-x86-4180  [002]   697.111550:

   kvm_exit: reason CR_ACCESS rip 0xd272 info 0 0
   kvm_cr:   cr_write 0 = 0x10
   kvm_mmu_get_page: existing sp gfn 0 0/4 q0 direct --- !pge !nxe root 0 
 sync
   kvm_entry:vcpu 0
   kvm_emulate_insn: f:d275: ea 7a d2 00 f0
   kvm_emulate_insn: f:d27a: 2e 0f 01 1e f0 6c
   kvm_emulate_insn: f:d280: 31 c0
   kvm_emulate_insn: f:d282: 8e e0
   kvm_emulate_insn: f:d284: 8e e8
   kvm_emulate_insn: f:d286: 8e c0
   kvm_emulate_insn: f:d288: 8e d8
   kvm_emulate_insn: f:d28a: 8e d0
   kvm_entry:vcpu 0
   kvm_exit: reason EXTERNAL_INTERRUPT rip 0xd28f info 0 80f6
   kvm_entry:vcpu 0
   kvm_exit: reason EPT_VIOLATION rip 0x8dd0 info 184 0
   kvm_page_fault:   address f8dd0 error_code 184
   kvm_entry:vcpu 0
   kvm_exit: reason EXTERNAL_INTERRUPT rip 0x8dd0 info 0 80f6
   kvm_entry:vcpu 0
   kvm_exit: reason EPT_VIOLATION rip 0x76d6 info 184 0
   kvm_page_fault:   address f76d6 error_code 184
   kvm_entry:vcpu 0
   kvm_exit: reason EXTERNAL_INTERRUPT rip 0x76d6 info 0 80f6
   kvm_entry:vcpu 0
   kvm_exit: reason PENDING_INTERRUPT rip 0xd331 info 0 0
   kvm_inj_virq: irq 8
   kvm_entry:vcpu 0
   kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfea5 info 0 80f6
   kvm_entry:vcpu 0
   kvm_exit: reason EPT_VIOLATION rip 0xfea5 info 184 0
   kvm_page_fault:   address ffea5 error_code 184
   kvm_entry:vcpu 0
   kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfea5 info 0 80f6
   kvm_entry:vcpu 0
   kvm_exit: reason EPT_VIOLATION rip 0xe990 info 184 0
   kvm_page_fault:   address fe990 error_code 184
   kvm_entry:vcpu 0
   kvm_exit: reason EXTERNAL_INTERRUPT rip 0xe990 info 0 80f6
   kvm_entry:vcpu 0
   kvm_exit: reason EXCEPTION_NMI rip 0xd334 info 0 8b0d
   kvm_userspace_exit:   reason KVM_EXIT_INTERNAL_ERROR (17)

 Trace without APICv (three reboots, just to make sure to hit the
 problematic condition of supposed DF, as it still have not one hundred
 percent reproducibility):
 http://xdel.ru/downloads/kvm-e5v2-issue/apic-off.dat.gz

 The trace here contains a well matching excerpt, just instead of the
 EXCEPTION_NMI, it does

  169.905098: kvm_exit: reason EPT_VIOLATION rip 0xd334 info 181 0
  169.905102: kvm_page_fault:   address feffd066 error_code 181

 and works.  Page fault says we tried to read 0xfeffd066 -- probably IOPB
 of TSS.  (I guess it is pre-fetch for following IO instruction.)

 Nothing strikes me when looking at it, but some APICv boots don't fail,
 so it would be interesting to compare them ... hosts's 0xf6 interrupt
 (IRQ_WORK_VECTOR) is a possible source of races.  (We could look more
 closely.  It is fired too often for my liking as well.)


Thanks Radim, http://xdel.ru/downloads/kvm-e5v2-issue/no-fail-with-apicv.dat.gz

(missed right button in mailer previously)

The related bits looks the same as with enable_apicv=0 for me.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-27 Thread Andrey Korolyov

On Fri, Mar 27, 2015 at 12:03 AM, Bandan Das b...@redhat.com wrote:
 Radim Krčmář rkrc...@redhat.com writes:

 2015-03-26 21:24+0300, Andrey Korolyov:
 On Thu, Mar 26, 2015 at 8:40 PM, Radim Krčmář rkrc...@redhat.com wrote:
  2015-03-26 20:08+0300, Andrey Korolyov:
  KVM internal error. Suberror: 2
  extra data[0]: 80ef
  extra data[1]: 8b0d
 
  Btw. does this part ever change?
 
  I see that first report had:
 
KVM internal error. Suberror: 2
extra data[0]: 80d1
extra data[1]: 8b0d
 
  Was that a Windows guest by any chance?

 Yes, exactly, different extra data output was from a Windows VMs.

 Windows uses vector 0xd1 for timer interrupts.

 I second Bandan -- checking that it reproduces on other machine would be
 great for sanity :)  (Although a bug in our APICv is far more likely.)

 If it's APICv related, a run without apicv enabled could give more hints.

 Your devices not getting reset hypothesis makes the most sense to me,
 maybe the timer vector in the error message is just one part of
 the whole story. Another misbehaving interrupt from the dark comes in at the
 same time and leads to a double fault.

Default trace (APICv enabled, first reboot introduced the issue):
http://xdel.ru/downloads/kvm-e5v2-issue/hanged-reboot-apic-on.dat.gz

Trace without APICv (three reboots, just to make sure to hit the
problematic condition of supposed DF, as it still have not one hundred
percent reproducibility):
http://xdel.ru/downloads/kvm-e5v2-issue/apic-off.dat.gz

It would be great of course to reproduce this somewhere else,
otherwise all this thread may end in fixing a bug which exists only at
my particular platform. Right now I have no hardware except a lot of
well-known (in terms of existing issues) Supermicro boards of one
model.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-27 Thread Andrey Korolyov

On Thu, Mar 26, 2015 at 11:40 PM, Radim Krčmář rkrc...@redhat.com wrote:
 2015-03-26 21:24+0300, Andrey Korolyov:
 On Thu, Mar 26, 2015 at 8:40 PM, Radim Krčmář rkrc...@redhat.com wrote:
  2015-03-26 20:08+0300, Andrey Korolyov:
  KVM internal error. Suberror: 2
  extra data[0]: 80ef
  extra data[1]: 8b0d
 
  Btw. does this part ever change?
 
  I see that first report had:
 
KVM internal error. Suberror: 2
extra data[0]: 80d1
extra data[1]: 8b0d
 
  Was that a Windows guest by any chance?

 Yes, exactly, different extra data output was from a Windows VMs.

 Windows uses vector 0xd1 for timer interrupts.

 I second Bandan -- checking that it reproduces on other machine would be
 great for sanity :)  (Although a bug in our APICv is far more likely.)

Trace with new bits:

KVM internal error. Suberror: 2
extra data[0]: 80ef
extra data[1]: 8b0d
extra data[2]: 77b
EAX= EBX= ECX= EDX=
ESI= EDI= EBP= ESP=6d24
EIP=d331 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =   9300
CS =f000 000f  9b00
SS =   9300
DS =   9300
FS =   9300
GS =   9300
LDT=   8200
TR =   8b00
GDT= 000f6cb0 0037
IDT=  03ff
CR0=0010 CR2= CR3= CR4=
DR0= DR1= DR2=
DR3=
DR6=0ff0 DR7=0400
EFER=
Code=66 c3 cd 02 cb cd 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd
19 cb cd 1c cb cd 4a cb fa fc 66 ba 47 d3 0f 00 e9 ad fe f3 90 f0 0f
ba 2d d4 fe fb 3f
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Andrey Korolyov

On Thu, Mar 26, 2015 at 5:47 AM, Bandan Das b...@redhat.com wrote:
 Hi Andrey,

 Andrey Korolyov and...@xdel.ru writes:

 On Mon, Mar 16, 2015 at 10:17 PM, Andrey Korolyov and...@xdel.ru wrote:
 For now, it looks like bug have a mixed Murphy-Heisenberg nature, as
 it appearance is very rare (compared to the number of actual launches)
 and most probably bounded to the physical characteristics of my
 production nodes. As soon as I reach any reproducible path for a
 regular workstation environment, I`ll let everyone know. Also I am
 starting to think that issue can belong to the particular motherboard
 firmware revision, despite fact that the CPU microcode is the same
 everywhere.

 I will take the risk and say this - could it be a processor bug ? :)


 Hello everyone, I`ve managed to reproduce this issue
 *deterministically* with latest seabios with smp fix and 3.18.3. The
 error occuring just *once* per vm until hypervisor reboots, at least
 in my setup, this is definitely crazy...

 - launch two VMs (Centos 7 in my case),
 - wait a little while they are booting,
 - attach serial console (I am using virsh list for this exact purpose),
 - issue acpi reboot or reset, does not matter,
 - VM always hangs at boot, most times with sgabios initialization
 string printed out [1], but sometimes it hangs a bit later [2],
 - no matter how many times I try to relaunch the QEMU afterwards, the
 issue does not appear on VM which experienced problem once;
 - trace and sample args can be seen in [3] and [4] respectively.

 My system is a Dell R720 dual socket which has 2620v2s. I tried your
 setup but couldn't reproduce (my qemu cmdline isn't exactly the same
 as yours), although, if you could simplify your command line a bit,
 I can try again.

 Bandan

 1)
 Google, Inc.
 Serial Graphics Adapter 06/11/14
 SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $
 (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014
 Term: 211x62
 4 0

 2)
 Google, Inc.
 Serial Graphics Adapter 06/11/14
 SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $
 (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014
 Term: 211x62
 4 0
 [...empty screen...]
 SeaBIOS (version 1.8.1-20150325_230423-testnode)
 Machine UUID 3c78721f-7317-4f85-bcbe-f5ad46d293a1


 iPXE (http://ipxe.org) 00:02.0 C100 PCI2.10 PnP PMM+3FF95BA0+3FEF5BA0 C10

 3)

 KVM internal error. Suberror: 2
 extra data[0]: 80ef
 extra data[1]: 8b0d
 EAX= EBX= ECX= EDX=
 ESI= EDI= EBP= ESP=6d2c
 EIP=d331 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =f000 000f  9b00
 SS =   9300
 DS =   9300
 FS =   9300
 GS =   9300
 LDT=   8200
 TR =   8b00
 GDT= 000f6cb0 0037
 IDT=  03ff
 CR0=0010 CR2= CR3= CR4=
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=66 c3 cd 02 cb cd 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd
 19 cb cd 1c cb cd 4a cb fa fc 66 ba 47 d3 0f 00 e9 ad fe f3 90 f0 0f
 ba 2d d4 fe fb 3f

 4)
 /usr/bin/qemu-system-x86_64 -name centos71 -S -machine
 pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -bios
 /usr/share/seabios/bios.bin -m 1024 -realtime mlock=off -smp
 12,sockets=1,cores=12,threads=12 -uuid
 3c78721f-7317-4f85-bcbe-f5ad46d293a1 -nographic -no-user-config
 -nodefaults -device sga -chardev
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/centos71.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc
 base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard
 -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global
 PIIX4_PM.disable_s4=1 -boot strict=on -device
 nec-usb-xhci,id=usb,bus=pci.0,addr=0x3 -device
 virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive
 file=rbd:dev-rack2/centos7-1.raw:id=qemukvm:key=XX:auth_supported=cephx\;none:mon_host=10.6.0.1\:6789\;10.6.0.3\:6789\;10.6.0.4\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=writeback,aio=native
 -device 
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -chardev pty,id=charserial0 -device
 isa-serial,chardev=charserial0,id=serial0 -chardev
 socket,id=charchannel0,path=/var/lib/libvirt/qemu/centos71.sock,server,nowait
 -device 
 virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.1
 -msg timestamp=on

Hehe, 2.2 works just perfectly but 2.1 isn`t. I`ll bisect the issue in
a next couple of days and post the right commit (but as can remember
none of commits b/w 2.1 and 2.2 can fix simular issue by a purpose).
I`ve attached a reference xml to simplify playing with libvirt if
anyone willing to do so.
domain type='kvm'

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Kevin O'Connor

On Thu, Mar 26, 2015 at 08:08:52PM +0300, Andrey Korolyov wrote:
 On Thu, Mar 26, 2015 at 8:06 PM, Kevin O'Connor ke...@koconnor.net wrote:
  On Thu, Mar 26, 2015 at 07:48:09PM +0300, Andrey Korolyov wrote:
  On Thu, Mar 26, 2015 at 7:36 PM, Kevin O'Connor ke...@koconnor.net wrote:
   I'm not sure if the crash always happens at the int $0x19 location
   though.  Andrey, does the crash always happen with EIP=d331 and/or
   with Code=... cd 19?
 
  There are also rare occurences for d3f9 (in the middle of ep) and d334
  ep (less than one tenth of events for both). I`ll post a sample event
  capture with and without Radim`s proposed patch maybe today or
  tomorrow.
 
  /root/seabios-1.8.1/src/romlayout.S:289
  d3eb:   66 50   pushl  %eax
  d3ed:   66 51   pushl  %ecx
  d3ef:   66 52   pushl  %edx
  d3f1:   66 53   pushl  %ebx
  d3f3:   66 55   pushl  %ebp
  d3f5:   66 56   pushl  %esi
  d3f7:   66 57   pushl  %edi
  d3f9:   06  pushw  %es
  d3fa:   1e  pushw  %ds
 
  d334 irq_trampoline_0x1c:
  irq_trampoline_0x1c():
  /root/seabios-1.8.1/src/romlayout.S:196
  d334:   cd 1c   int$0x1c
  d336:   cb  lretw
 
  Thanks.  The d334 looks very similar to the d331 report (code=cd
  1c).  That path could happen during post (big real mode) or
  immiediately after post (real mode).
 
  The d3f9 report does not look like the others - interrupts are
  disabled there.  If you still have the error logs, can you post the
  full kvm crash report for d3f9?
 
 
 Here you go:

Thanks.  While we're at, can you verify if all your reports are
showing the cpu in real mode.  That is, do they all have 
in the third column of the segment registers - as in:

 ES =   9300

[...]
 Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
 b8 00 e0 00 00 8e

KVM reports the code as int $0x10 here.  Was it possible this report
was from a different build of seabios (that had a different code
layout)?

Interestingly, this int $0x10 is also in real-mode and not big real
mode, so I think it would have occurred after post completed.

-Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Radim Krčmář

2015-03-26 20:08+0300, Andrey Korolyov:
 KVM internal error. Suberror: 2
 extra data[0]: 80ef
 extra data[1]: 8b0d

Btw. does this part ever change?

I see that first report had:

  KVM internal error. Suberror: 2
  extra data[0]: 80d1
  extra data[1]: 8b0d

Was that a Windows guest by any chance?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Andrey Korolyov

On Thu, Mar 26, 2015 at 7:36 PM, Kevin O'Connor ke...@koconnor.net wrote:
 On Thu, Mar 26, 2015 at 04:58:07PM +0100, Radim Krčmář wrote:
 2015-03-25 20:05-0400, Kevin O'Connor:
  On Thu, Mar 26, 2015 at 02:35:58AM +0300, Andrey Korolyov wrote:
   Thanks, strangely the reboot is always failing now and always reaching
   seabios greeting. May be prints straightened up a race (e.g. it is not
   int19 problem really).
  
   object file part:
  
   d331 irq_trampoline_0x19:
   irq_trampoline_0x19():
   /root/seabios-1.8.1/src/romlayout.S:195
   d331:   cd 19   int$0x19
   d333:   cb  lretw
 
  [...]
   Jump to int19 (vector=f000e6f2)
 
  Thanks.  So, it dies on the int $0x19 instruction itself.  The
  vector looks correct and I don't see anything in the cpu register
  state that looks wrong.  Maybe one of the kvm developers will have an
  idea what could cause a fault there.

 The place agrees with the cd 19 cb part of KVM error output.
 Suberror 2 means that we were interrupted while delivering a vector,
 here it is disected: (delivering 'vect_info')

   vect_info (extra data[0]: 80ef)
   - vector 0xef
   - INTR_TYPE_EXT_INTR (0x000)
   - no error code (0x000)
   - valid (0x8000)

   intr_info (extra data[1]: 8b0d)
   - #GP (0x0d)
   - INTR_TYPE_HARD_EXCEPTION (0x300)
   - error code on stack (0x800)  [Hunk at the bottom exposes it.]
   - valid (0x8000)

 Thanks for the background info.

 Notice the 0xef.  My best hypothesis so far is that we fail at resetting
 devices, and 0xef is LOCAL_TIMER_VECTOR from Linux before we rebooted.
 (The bug happens at the first place that enables interrupts.)

 FYI, the int $0x19 isn't the first place SeaBIOS will enable
 interrupts.  Each screen print (every character in the seabios banner
 and uuid string) will call the vga bios (int $0x10) with irqs enabled
 (see output.c:screenc).

 Also, SeaBIOS loads a default vector (f000:ff53) at 0xef which does a
 simple iretw.

 Things that are unusual about the int $0x19 call:
   - it is likely the first place that the cpu is transitioned into
 16bit real mode as opposed to big real mode.  (That is, the
 first place interrupts are enabled with the segment limits set to
 0x.)
   - it's right after the fw/shadow.c:make_bios_readonly() call, which
 attempts to configures the memory at 0xf-0x10 as
 read-only.  That code also issues a wbinvd() call.

 I'm not sure if the crash always happens at the int $0x19 location
 though.  Andrey, does the crash always happen with EIP=d331 and/or
 with Code=... cd 19?

 -Kevin

There are also rare occurences for d3f9 (in the middle of ep) and d334
ep (less than one tenth of events for both). I`ll post a sample event
capture with and without Radim`s proposed patch maybe today or
tomorrow.

/root/seabios-1.8.1/src/romlayout.S:289
d3eb:   66 50   pushl  %eax
d3ed:   66 51   pushl  %ecx
d3ef:   66 52   pushl  %edx
d3f1:   66 53   pushl  %ebx
d3f3:   66 55   pushl  %ebp
d3f5:   66 56   pushl  %esi
d3f7:   66 57   pushl  %edi
d3f9:   06  pushw  %es
d3fa:   1e  pushw  %ds

d334 irq_trampoline_0x1c:
irq_trampoline_0x1c():
/root/seabios-1.8.1/src/romlayout.S:196
d334:   cd 1c   int$0x1c
d336:   cb  lretw
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Radim Krčmář

2015-03-26 12:36-0400, Kevin O'Connor:
 On Thu, Mar 26, 2015 at 04:58:07PM +0100, Radim Krčmář wrote:
  Notice the 0xef.  My best hypothesis so far is that we fail at resetting
  devices, and 0xef is LOCAL_TIMER_VECTOR from Linux before we rebooted.
  (The bug happens at the first place that enables interrupts.)
 
 FYI, the int $0x19 isn't the first place SeaBIOS will enable
 interrupts.  Each screen print (every character in the seabios banner
 and uuid string) will call the vga bios (int $0x10) with irqs enabled
 (see output.c:screenc).

Most useful, thank you.
So interrupt can't be forgotten there on reboot ... it's possible that
a pending timer injects it later.
(I'd like to grasp the reason behind 0xef first.)

 Also, SeaBIOS loads a default vector (f000:ff53) at 0xef which does a
 simple iretw.

The #GP error code could help a bit here.

 Things that are unusual about the int $0x19 call:
   - it is likely the first place that the cpu is transitioned into
 16bit real mode as opposed to big real mode.  (That is, the
 first place interrupts are enabled with the segment limits set to
 0x.)
   - it's right after the fw/shadow.c:make_bios_readonly() call, which
 attempts to configures the memory at 0xf-0x10 as
 read-only.  That code also issues a wbinvd() call.

(I'll wait for the trace before doing more wild guesses ...)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Radim Krčmář

2015-03-26 19:48+0300, Andrey Korolyov:
  I`ll post a sample event
 capture with and without Radim`s proposed patch maybe today or
 tomorrow.

Thanks.

The patch doesn't change runtime behavior, it just adds another data
field to the error report, so there is no need to test both cases.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Kevin O'Connor

On Thu, Mar 26, 2015 at 04:58:07PM +0100, Radim Krčmář wrote:
 2015-03-25 20:05-0400, Kevin O'Connor:
  On Thu, Mar 26, 2015 at 02:35:58AM +0300, Andrey Korolyov wrote:
   Thanks, strangely the reboot is always failing now and always reaching
   seabios greeting. May be prints straightened up a race (e.g. it is not
   int19 problem really).
   
   object file part:
   
   d331 irq_trampoline_0x19:
   irq_trampoline_0x19():
   /root/seabios-1.8.1/src/romlayout.S:195
   d331:   cd 19   int$0x19
   d333:   cb  lretw
  
  [...]
   Jump to int19 (vector=f000e6f2)
  
  Thanks.  So, it dies on the int $0x19 instruction itself.  The
  vector looks correct and I don't see anything in the cpu register
  state that looks wrong.  Maybe one of the kvm developers will have an
  idea what could cause a fault there.
 
 The place agrees with the cd 19 cb part of KVM error output.
 Suberror 2 means that we were interrupted while delivering a vector,
 here it is disected: (delivering 'vect_info')
 
   vect_info (extra data[0]: 80ef)
   - vector 0xef
   - INTR_TYPE_EXT_INTR (0x000)
   - no error code (0x000)
   - valid (0x8000)
 
   intr_info (extra data[1]: 8b0d)
   - #GP (0x0d)
   - INTR_TYPE_HARD_EXCEPTION (0x300)
   - error code on stack (0x800)  [Hunk at the bottom exposes it.]
   - valid (0x8000)

Thanks for the background info.

 Notice the 0xef.  My best hypothesis so far is that we fail at resetting
 devices, and 0xef is LOCAL_TIMER_VECTOR from Linux before we rebooted.
 (The bug happens at the first place that enables interrupts.)

FYI, the int $0x19 isn't the first place SeaBIOS will enable
interrupts.  Each screen print (every character in the seabios banner
and uuid string) will call the vga bios (int $0x10) with irqs enabled
(see output.c:screenc).

Also, SeaBIOS loads a default vector (f000:ff53) at 0xef which does a
simple iretw.

Things that are unusual about the int $0x19 call:
  - it is likely the first place that the cpu is transitioned into
16bit real mode as opposed to big real mode.  (That is, the
first place interrupts are enabled with the segment limits set to
0x.)
  - it's right after the fw/shadow.c:make_bios_readonly() call, which
attempts to configures the memory at 0xf-0x10 as
read-only.  That code also issues a wbinvd() call.

I'm not sure if the crash always happens at the int $0x19 location
though.  Andrey, does the crash always happen with EIP=d331 and/or
with Code=... cd 19?

-Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Andrey Korolyov

On Thu, Mar 26, 2015 at 8:06 PM, Kevin O'Connor ke...@koconnor.net wrote:
 On Thu, Mar 26, 2015 at 07:48:09PM +0300, Andrey Korolyov wrote:
 On Thu, Mar 26, 2015 at 7:36 PM, Kevin O'Connor ke...@koconnor.net wrote:
  I'm not sure if the crash always happens at the int $0x19 location
  though.  Andrey, does the crash always happen with EIP=d331 and/or
  with Code=... cd 19?

 There are also rare occurences for d3f9 (in the middle of ep) and d334
 ep (less than one tenth of events for both). I`ll post a sample event
 capture with and without Radim`s proposed patch maybe today or
 tomorrow.

 /root/seabios-1.8.1/src/romlayout.S:289
 d3eb:   66 50   pushl  %eax
 d3ed:   66 51   pushl  %ecx
 d3ef:   66 52   pushl  %edx
 d3f1:   66 53   pushl  %ebx
 d3f3:   66 55   pushl  %ebp
 d3f5:   66 56   pushl  %esi
 d3f7:   66 57   pushl  %edi
 d3f9:   06  pushw  %es
 d3fa:   1e  pushw  %ds

 d334 irq_trampoline_0x1c:
 irq_trampoline_0x1c():
 /root/seabios-1.8.1/src/romlayout.S:196
 d334:   cd 1c   int$0x1c
 d336:   cb  lretw

 Thanks.  The d334 looks very similar to the d331 report (code=cd
 1c).  That path could happen during post (big real mode) or
 immiediately after post (real mode).

 The d3f9 report does not look like the others - interrupts are
 disabled there.  If you still have the error logs, can you post the
 full kvm crash report for d3f9?


Here you go:

KVM internal error. Suberror: 2
extra data[0]: 80ef
extra data[1]: 8b0d
EAX=0003 EBX= ECX= EDX=
ESI= EDI= EBP= ESP=6cd4
EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =   9300
CS =f000 000f  9b00
SS =   9300
DS =   9300
FS =   9300
GS =   9300
LDT=   8200
TR =   8b00
GDT= 000f6e98 0037
IDT=  03ff
CR0=0010 CR2= CR3= CR4=
DR0= DR1= DR2=
DR3=
DR6=0ff0 DR7=0400
EFER=
Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
b8 00 e0 00 00 8e
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Andrey Korolyov

On Thu, Mar 26, 2015 at 8:18 PM, Kevin O'Connor ke...@koconnor.net wrote:
 On Thu, Mar 26, 2015 at 08:08:52PM +0300, Andrey Korolyov wrote:
 On Thu, Mar 26, 2015 at 8:06 PM, Kevin O'Connor ke...@koconnor.net wrote:
  On Thu, Mar 26, 2015 at 07:48:09PM +0300, Andrey Korolyov wrote:
  On Thu, Mar 26, 2015 at 7:36 PM, Kevin O'Connor ke...@koconnor.net 
  wrote:
   I'm not sure if the crash always happens at the int $0x19 location
   though.  Andrey, does the crash always happen with EIP=d331 and/or
   with Code=... cd 19?
 
  There are also rare occurences for d3f9 (in the middle of ep) and d334
  ep (less than one tenth of events for both). I`ll post a sample event
  capture with and without Radim`s proposed patch maybe today or
  tomorrow.
 
  /root/seabios-1.8.1/src/romlayout.S:289
  d3eb:   66 50   pushl  %eax
  d3ed:   66 51   pushl  %ecx
  d3ef:   66 52   pushl  %edx
  d3f1:   66 53   pushl  %ebx
  d3f3:   66 55   pushl  %ebp
  d3f5:   66 56   pushl  %esi
  d3f7:   66 57   pushl  %edi
  d3f9:   06  pushw  %es
  d3fa:   1e  pushw  %ds
 
  d334 irq_trampoline_0x1c:
  irq_trampoline_0x1c():
  /root/seabios-1.8.1/src/romlayout.S:196
  d334:   cd 1c   int$0x1c
  d336:   cb  lretw
 
  Thanks.  The d334 looks very similar to the d331 report (code=cd
  1c).  That path could happen during post (big real mode) or
  immiediately after post (real mode).
 
  The d3f9 report does not look like the others - interrupts are
  disabled there.  If you still have the error logs, can you post the
  full kvm crash report for d3f9?
 

 Here you go:

 Thanks.  While we're at, can you verify if all your reports are
 showing the cpu in real mode.  That is, do they all have 
 in the third column of the segment registers - as in:

 ES =   9300


That`s positive.

 [...]
 Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
 b8 00 e0 00 00 8e

 KVM reports the code as int $0x10 here.  Was it possible this report
 was from a different build of seabios (that had a different code
 layout)?


Yep, sorry, I`ve mixed in logs just from before transition out of 1.7.5.

 Interestingly, this int $0x10 is also in real-mode and not big real
 mode, so I think it would have occurred after post completed.

 -Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Radim Krčmář

2015-03-26 21:24+0300, Andrey Korolyov:
 On Thu, Mar 26, 2015 at 8:40 PM, Radim Krčmář rkrc...@redhat.com wrote:
  2015-03-26 20:08+0300, Andrey Korolyov:
  KVM internal error. Suberror: 2
  extra data[0]: 80ef
  extra data[1]: 8b0d
 
  Btw. does this part ever change?
 
  I see that first report had:
 
KVM internal error. Suberror: 2
extra data[0]: 80d1
extra data[1]: 8b0d
 
  Was that a Windows guest by any chance?
 
 Yes, exactly, different extra data output was from a Windows VMs.

Windows uses vector 0xd1 for timer interrupts.

I second Bandan -- checking that it reproduces on other machine would be
great for sanity :)  (Although a bug in our APICv is far more likely.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Kevin O'Connor

On Thu, Mar 26, 2015 at 07:48:09PM +0300, Andrey Korolyov wrote:
 On Thu, Mar 26, 2015 at 7:36 PM, Kevin O'Connor ke...@koconnor.net wrote:
  I'm not sure if the crash always happens at the int $0x19 location
  though.  Andrey, does the crash always happen with EIP=d331 and/or
  with Code=... cd 19?
 
 There are also rare occurences for d3f9 (in the middle of ep) and d334
 ep (less than one tenth of events for both). I`ll post a sample event
 capture with and without Radim`s proposed patch maybe today or
 tomorrow.
 
 /root/seabios-1.8.1/src/romlayout.S:289
 d3eb:   66 50   pushl  %eax
 d3ed:   66 51   pushl  %ecx
 d3ef:   66 52   pushl  %edx
 d3f1:   66 53   pushl  %ebx
 d3f3:   66 55   pushl  %ebp
 d3f5:   66 56   pushl  %esi
 d3f7:   66 57   pushl  %edi
 d3f9:   06  pushw  %es
 d3fa:   1e  pushw  %ds
 
 d334 irq_trampoline_0x1c:
 irq_trampoline_0x1c():
 /root/seabios-1.8.1/src/romlayout.S:196
 d334:   cd 1c   int$0x1c
 d336:   cb  lretw

Thanks.  The d334 looks very similar to the d331 report (code=cd
1c).  That path could happen during post (big real mode) or
immiediately after post (real mode).

The d3f9 report does not look like the others - interrupts are
disabled there.  If you still have the error logs, can you post the
full kvm crash report for d3f9?

-Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Andrey Korolyov

On Thu, Mar 26, 2015 at 8:40 PM, Radim Krčmář rkrc...@redhat.com wrote:
 2015-03-26 20:08+0300, Andrey Korolyov:
 KVM internal error. Suberror: 2
 extra data[0]: 80ef
 extra data[1]: 8b0d

 Btw. does this part ever change?

 I see that first report had:

   KVM internal error. Suberror: 2
   extra data[0]: 80d1
   extra data[1]: 8b0d

 Was that a Windows guest by any chance?

Yes, exactly, different extra data output was from a Windows VMs.
Thanks for clarifying things for your patch, I hadn`t looked at the
vmx code yet and thought that it changing things.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Bandan Das

Radim Krčmář rkrc...@redhat.com writes:

 2015-03-26 21:24+0300, Andrey Korolyov:
 On Thu, Mar 26, 2015 at 8:40 PM, Radim Krčmář rkrc...@redhat.com wrote:
  2015-03-26 20:08+0300, Andrey Korolyov:
  KVM internal error. Suberror: 2
  extra data[0]: 80ef
  extra data[1]: 8b0d
 
  Btw. does this part ever change?
 
  I see that first report had:
 
KVM internal error. Suberror: 2
extra data[0]: 80d1
extra data[1]: 8b0d
 
  Was that a Windows guest by any chance?
 
 Yes, exactly, different extra data output was from a Windows VMs.

 Windows uses vector 0xd1 for timer interrupts.

 I second Bandan -- checking that it reproduces on other machine would be
 great for sanity :)  (Although a bug in our APICv is far more likely.)

If it's APICv related, a run without apicv enabled could give more hints.

Your devices not getting reset hypothesis makes the most sense to me,
maybe the timer vector in the error message is just one part of
the whole story. Another misbehaving interrupt from the dark comes in at the
same time and leads to a double fault.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Radim Krčmář

2015-03-25 20:05-0400, Kevin O'Connor:
 On Thu, Mar 26, 2015 at 02:35:58AM +0300, Andrey Korolyov wrote:
  Thanks, strangely the reboot is always failing now and always reaching
  seabios greeting. May be prints straightened up a race (e.g. it is not
  int19 problem really).
  
  object file part:
  
  d331 irq_trampoline_0x19:
  irq_trampoline_0x19():
  /root/seabios-1.8.1/src/romlayout.S:195
  d331:   cd 19   int$0x19
  d333:   cb  lretw
 
 [...]
  Jump to int19 (vector=f000e6f2)
 
 Thanks.  So, it dies on the int $0x19 instruction itself.  The
 vector looks correct and I don't see anything in the cpu register
 state that looks wrong.  Maybe one of the kvm developers will have an
 idea what could cause a fault there.

The place agrees with the cd 19 cb part of KVM error output.
Suberror 2 means that we were interrupted while delivering a vector,
here it is disected: (delivering 'vect_info')

  vect_info (extra data[0]: 80ef)
  - vector 0xef
  - INTR_TYPE_EXT_INTR (0x000)
  - no error code (0x000)
  - valid (0x8000)

  intr_info (extra data[1]: 8b0d)
  - #GP (0x0d)
  - INTR_TYPE_HARD_EXCEPTION (0x300)
  - error code on stack (0x800)  [Hunk at the bottom exposes it.]
  - valid (0x8000)

Notice the 0xef.  My best hypothesis so far is that we fail at resetting
devices, and 0xef is LOCAL_TIMER_VECTOR from Linux before we rebooted.
(The bug happens at the first place that enables interrupts.)

This should be handled without an internal error, though.
(It's possible we are hitting two bugs here ...)

Can you provide KVM event trace as well?  (`trace-cmd record -e 'kvm*'`)

Thanks.


---
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 50c675b46901..541a29476a56 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5088,9 +5088,10 @@ static int handle_exception(struct kvm_vcpu *vcpu)
!(is_page_fault(intr_info)  !(error_code  PFERR_RSVD_MASK))) {
vcpu-run-exit_reason = KVM_EXIT_INTERNAL_ERROR;
vcpu-run-internal.suberror = KVM_INTERNAL_ERROR_SIMUL_EX;
-   vcpu-run-internal.ndata = 2;
+   vcpu-run-internal.ndata = 3;
vcpu-run-internal.data[0] = vect_info;
vcpu-run-internal.data[1] = intr_info;
+   vcpu-run-internal.data[2] = error_code;
return 0;
}
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-26 Thread Andrey Korolyov

On Thu, Mar 26, 2015 at 12:18 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Thu, Mar 26, 2015 at 5:47 AM, Bandan Das b...@redhat.com wrote:
 Hi Andrey,

 Andrey Korolyov and...@xdel.ru writes:

 On Mon, Mar 16, 2015 at 10:17 PM, Andrey Korolyov and...@xdel.ru wrote:
 For now, it looks like bug have a mixed Murphy-Heisenberg nature, as
 it appearance is very rare (compared to the number of actual launches)
 and most probably bounded to the physical characteristics of my
 production nodes. As soon as I reach any reproducible path for a
 regular workstation environment, I`ll let everyone know. Also I am
 starting to think that issue can belong to the particular motherboard
 firmware revision, despite fact that the CPU microcode is the same
 everywhere.

 I will take the risk and say this - could it be a processor bug ? :)


 Hello everyone, I`ve managed to reproduce this issue
 *deterministically* with latest seabios with smp fix and 3.18.3. The
 error occuring just *once* per vm until hypervisor reboots, at least
 in my setup, this is definitely crazy...

 - launch two VMs (Centos 7 in my case),
 - wait a little while they are booting,
 - attach serial console (I am using virsh list for this exact purpose),
 - issue acpi reboot or reset, does not matter,
 - VM always hangs at boot, most times with sgabios initialization
 string printed out [1], but sometimes it hangs a bit later [2],
 - no matter how many times I try to relaunch the QEMU afterwards, the
 issue does not appear on VM which experienced problem once;
 - trace and sample args can be seen in [3] and [4] respectively.

 My system is a Dell R720 dual socket which has 2620v2s. I tried your
 setup but couldn't reproduce (my qemu cmdline isn't exactly the same
 as yours), although, if you could simplify your command line a bit,
 I can try again.

 Bandan

 1)
 Google, Inc.
 Serial Graphics Adapter 06/11/14
 SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $
 (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014
 Term: 211x62
 4 0

 2)
 Google, Inc.
 Serial Graphics Adapter 06/11/14
 SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $
 (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014
 Term: 211x62
 4 0
 [...empty screen...]
 SeaBIOS (version 1.8.1-20150325_230423-testnode)
 Machine UUID 3c78721f-7317-4f85-bcbe-f5ad46d293a1


 iPXE (http://ipxe.org) 00:02.0 C100 PCI2.10 PnP PMM+3FF95BA0+3FEF5BA0 C10

 3)

 KVM internal error. Suberror: 2
 extra data[0]: 80ef
 extra data[1]: 8b0d
 EAX= EBX= ECX= EDX=
 ESI= EDI= EBP= ESP=6d2c
 EIP=d331 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =f000 000f  9b00
 SS =   9300
 DS =   9300
 FS =   9300
 GS =   9300
 LDT=   8200
 TR =   8b00
 GDT= 000f6cb0 0037
 IDT=  03ff
 CR0=0010 CR2= CR3= CR4=
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=66 c3 cd 02 cb cd 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd
 19 cb cd 1c cb cd 4a cb fa fc 66 ba 47 d3 0f 00 e9 ad fe f3 90 f0 0f
 ba 2d d4 fe fb 3f

 4)
 /usr/bin/qemu-system-x86_64 -name centos71 -S -machine
 pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -bios
 /usr/share/seabios/bios.bin -m 1024 -realtime mlock=off -smp
 12,sockets=1,cores=12,threads=12 -uuid
 3c78721f-7317-4f85-bcbe-f5ad46d293a1 -nographic -no-user-config
 -nodefaults -device sga -chardev
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/centos71.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc
 base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard
 -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global
 PIIX4_PM.disable_s4=1 -boot strict=on -device
 nec-usb-xhci,id=usb,bus=pci.0,addr=0x3 -device
 virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive
 file=rbd:dev-rack2/centos7-1.raw:id=qemukvm:key=XX:auth_supported=cephx\;none:mon_host=10.6.0.1\:6789\;10.6.0.3\:6789\;10.6.0.4\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=writeback,aio=native
 -device 
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -chardev pty,id=charserial0 -device
 isa-serial,chardev=charserial0,id=serial0 -chardev
 socket,id=charchannel0,path=/var/lib/libvirt/qemu/centos71.sock,server,nowait
 -device 
 virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.1
 -msg timestamp=on

 Hehe, 2.2 works just perfectly but 2.1 isn`t. I`ll bisect the issue in
 a next couple of days and post the right commit (but as can remember
 none of commits b/w 2.1 and 2.2 can fix simular issue by a purpose).
 I`ve attached a reference xml to

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-25 Thread Kevin O'Connor

On Wed, Mar 25, 2015 at 11:43:31PM +0300, Andrey Korolyov wrote:
 On Mon, Mar 16, 2015 at 10:17 PM, Andrey Korolyov and...@xdel.ru wrote:
  For now, it looks like bug have a mixed Murphy-Heisenberg nature, as
  it appearance is very rare (compared to the number of actual launches)
  and most probably bounded to the physical characteristics of my
  production nodes. As soon as I reach any reproducible path for a
  regular workstation environment, I`ll let everyone know. Also I am
  starting to think that issue can belong to the particular motherboard
  firmware revision, despite fact that the CPU microcode is the same
  everywhere.
 
 
 Hello everyone, I`ve managed to reproduce this issue
 *deterministically* with latest seabios with smp fix and 3.18.3. The
 error occuring just *once* per vm until hypervisor reboots, at least
 in my setup, this is definitely crazy...
 
 - launch two VMs (Centos 7 in my case),
 - wait a little while they are booting,
 - attach serial console (I am using virsh list for this exact purpose),
 - issue acpi reboot or reset, does not matter,
 - VM always hangs at boot, most times with sgabios initialization
 string printed out [1], but sometimes it hangs a bit later [2],
 - no matter how many times I try to relaunch the QEMU afterwards, the
 issue does not appear on VM which experienced problem once;
 - trace and sample args can be seen in [3] and [4] respectively.

Can you add something like:

  -chardev file,path=seabioslog.`date +%s`,id=seabios -device 
isa-debugcon,iobase=0x402,chardev=seabios

to the qemu command line and forward the resulting log from both a
succesful boot and a failed one?

-Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-25 Thread Andrey Korolyov

On Mon, Mar 16, 2015 at 10:17 PM, Andrey Korolyov and...@xdel.ru wrote:
 For now, it looks like bug have a mixed Murphy-Heisenberg nature, as
 it appearance is very rare (compared to the number of actual launches)
 and most probably bounded to the physical characteristics of my
 production nodes. As soon as I reach any reproducible path for a
 regular workstation environment, I`ll let everyone know. Also I am
 starting to think that issue can belong to the particular motherboard
 firmware revision, despite fact that the CPU microcode is the same
 everywhere.


Hello everyone, I`ve managed to reproduce this issue
*deterministically* with latest seabios with smp fix and 3.18.3. The
error occuring just *once* per vm until hypervisor reboots, at least
in my setup, this is definitely crazy...

- launch two VMs (Centos 7 in my case),
- wait a little while they are booting,
- attach serial console (I am using virsh list for this exact purpose),
- issue acpi reboot or reset, does not matter,
- VM always hangs at boot, most times with sgabios initialization
string printed out [1], but sometimes it hangs a bit later [2],
- no matter how many times I try to relaunch the QEMU afterwards, the
issue does not appear on VM which experienced problem once;
- trace and sample args can be seen in [3] and [4] respectively.

1)
Google, Inc.
Serial Graphics Adapter 06/11/14
SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $
(pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014
Term: 211x62
4 0

2)
Google, Inc.
Serial Graphics Adapter 06/11/14
SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $
(pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014
Term: 211x62
4 0
[...empty screen...]
SeaBIOS (version 1.8.1-20150325_230423-testnode)
Machine UUID 3c78721f-7317-4f85-bcbe-f5ad46d293a1


iPXE (http://ipxe.org) 00:02.0 C100 PCI2.10 PnP PMM+3FF95BA0+3FEF5BA0 C10

3)

KVM internal error. Suberror: 2
extra data[0]: 80ef
extra data[1]: 8b0d
EAX= EBX= ECX= EDX=
ESI= EDI= EBP= ESP=6d2c
EIP=d331 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =   9300
CS =f000 000f  9b00
SS =   9300
DS =   9300
FS =   9300
GS =   9300
LDT=   8200
TR =   8b00
GDT= 000f6cb0 0037
IDT=  03ff
CR0=0010 CR2= CR3= CR4=
DR0= DR1= DR2=
DR3=
DR6=0ff0 DR7=0400
EFER=
Code=66 c3 cd 02 cb cd 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd
19 cb cd 1c cb cd 4a cb fa fc 66 ba 47 d3 0f 00 e9 ad fe f3 90 f0 0f
ba 2d d4 fe fb 3f

4)
/usr/bin/qemu-system-x86_64 -name centos71 -S -machine
pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -bios
/usr/share/seabios/bios.bin -m 1024 -realtime mlock=off -smp
12,sockets=1,cores=12,threads=12 -uuid
3c78721f-7317-4f85-bcbe-f5ad46d293a1 -nographic -no-user-config
-nodefaults -device sga -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/centos71.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc
base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard
-no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global
PIIX4_PM.disable_s4=1 -boot strict=on -device
nec-usb-xhci,id=usb,bus=pci.0,addr=0x3 -device
virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive
file=rbd:dev-rack2/centos7-1.raw:id=qemukvm:key=XX:auth_supported=cephx\;none:mon_host=10.6.0.1\:6789\;10.6.0.3\:6789\;10.6.0.4\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=writeback,aio=native
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 -chardev
socket,id=charchannel0,path=/var/lib/libvirt/qemu/centos71.sock,server,nowait
-device 
virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.1
-msg timestamp=on
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-25 Thread Andrey Korolyov

 - attach serial console (I am using virsh list for this exact purpose),

virsh console of course, sorry
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-25 Thread Andrey Korolyov

On Wed, Mar 25, 2015 at 11:54 PM, Kevin O'Connor ke...@koconnor.net wrote:
 On Wed, Mar 25, 2015 at 11:43:31PM +0300, Andrey Korolyov wrote:
 On Mon, Mar 16, 2015 at 10:17 PM, Andrey Korolyov and...@xdel.ru wrote:
  For now, it looks like bug have a mixed Murphy-Heisenberg nature, as
  it appearance is very rare (compared to the number of actual launches)
  and most probably bounded to the physical characteristics of my
  production nodes. As soon as I reach any reproducible path for a
  regular workstation environment, I`ll let everyone know. Also I am
  starting to think that issue can belong to the particular motherboard
  firmware revision, despite fact that the CPU microcode is the same
  everywhere.


 Hello everyone, I`ve managed to reproduce this issue
 *deterministically* with latest seabios with smp fix and 3.18.3. The
 error occuring just *once* per vm until hypervisor reboots, at least
 in my setup, this is definitely crazy...

 - launch two VMs (Centos 7 in my case),
 - wait a little while they are booting,
 - attach serial console (I am using virsh list for this exact purpose),
 - issue acpi reboot or reset, does not matter,
 - VM always hangs at boot, most times with sgabios initialization
 string printed out [1], but sometimes it hangs a bit later [2],
 - no matter how many times I try to relaunch the QEMU afterwards, the
 issue does not appear on VM which experienced problem once;
 - trace and sample args can be seen in [3] and [4] respectively.

 Can you add something like:

   -chardev file,path=seabioslog.`date +%s`,id=seabios -device 
 isa-debugcon,iobase=0x402,chardev=seabios

 to the qemu command line and forward the resulting log from both a
 succesful boot and a failed one?

 -Kevin

Of course, logs are attached.


reboot.failed
Description: Binary data


reboot.succeeded
Description: Binary data

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-25 Thread Andrey Korolyov

On Thu, Mar 26, 2015 at 2:02 AM, Kevin O'Connor ke...@koconnor.net wrote:
 On Thu, Mar 26, 2015 at 01:31:11AM +0300, Andrey Korolyov wrote:
 On Wed, Mar 25, 2015 at 11:54 PM, Kevin O'Connor ke...@koconnor.net wrote:
 
  Can you add something like:
 
-chardev file,path=seabioslog.`date +%s`,id=seabios -device 
  isa-debugcon,iobase=0x402,chardev=seabios
 
  to the qemu command line and forward the resulting log from both a
  succesful boot and a failed one?
 
  -Kevin

 Of course, logs are attached.

 Thanks.  From a diff of the two logs:

  4: 3ffe - 4000 = 2 RESERVED
  5: feffc000 - ff00 = 2 RESERVED
  6: fffc - 0001 = 2 RESERVED
   -enter handle_19:
   -  NULL
   -Booting from Hard Disk...
   -Booting from :7c00

 So, it got most of the way through the reboot - there's only a few
 function calls between the e820 map being dumped and the handle_19
 call.  The fault also seems to show it stopped in the BIOS in 16bit
 mode:

 EIP=d331 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =f000 000f  9b00

 Can you add the patch below, force the fault, and forward the log.

 Also, if you recreate the failure can you take the EIP from the fault
 (eg, d331) and search for the corresponding function in the output of:
   objdump -m i386 -M i8086 -M suffix -ldr out/rom16.o | less
 (That is, search for d331:.)  If that's too much of a pain, just
 send me a direct email with the seabios out/rom16.o file and the new
 EIP of the fault.  (I need the out/rom16.o that was used to build the
 version of SeaBIOS that faulted.)

 -Kevin


 diff --git a/src/post.c b/src/post.c
 index 9ea5620..bbd19c0 100644
 --- a/src/post.c
 +++ b/src/post.c
 @@ -185,21 +185,24 @@ prepareboot(void)
  pmm_prepboot();
  malloc_prepboot();
  memmap_prepboot();
 +dprintf(1, a\n);

  HaveRunPost = 2;

  // Setup bios checksum.
  BiosChecksum -= checksum((u8*)BUILD_BIOS_ADDR, BUILD_BIOS_SIZE);
 +dprintf(1, b\n);
  }

  // Begin the boot process by invoking an int0x19 in 16bit mode.
  void VISIBLE32FLAT
  startBoot(void)
  {
 +dprintf(1, e\n);
  // Clear low-memory allocations (required by PMM spec).
  memset((void*)BUILD_STACK_ADDR, 0, BUILD_EBDA_MINIMUM - 
 BUILD_STACK_ADDR);

 -dprintf(3, Jump to int19\n);
 +dprintf(1, Jump to int19 (vector=%x)\n, GET_IVT(0x19).segoff);
  struct bregs br;
  memset(br, 0, sizeof(br));
  br.flags = F_IF;
 @@ -239,9 +242,11 @@ maininit(void)
  // Prepare for boot.
  prepareboot();

 +dprintf(1, c\n);
  // Write protect bios memory.
  make_bios_readonly();

 +dprintf(1, d\n);
  // Invoke int 19 to start boot process.
  startBoot();
  }

Thanks, strangely the reboot is always failing now and always reaching
seabios greeting. May be prints straightened up a race (e.g. it is not
int19 problem really).

object file part:

d331 irq_trampoline_0x19:
irq_trampoline_0x19():
/root/seabios-1.8.1/src/romlayout.S:195
d331:   cd 19   int$0x19
d333:   cb  lretw


reboot.failed
Description: Binary data

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-25 Thread Kevin O'Connor

On Thu, Mar 26, 2015 at 01:31:11AM +0300, Andrey Korolyov wrote:
 On Wed, Mar 25, 2015 at 11:54 PM, Kevin O'Connor ke...@koconnor.net wrote:
 
  Can you add something like:
 
-chardev file,path=seabioslog.`date +%s`,id=seabios -device 
  isa-debugcon,iobase=0x402,chardev=seabios
 
  to the qemu command line and forward the resulting log from both a
  succesful boot and a failed one?
 
  -Kevin
 
 Of course, logs are attached.

Thanks.  From a diff of the two logs:

 4: 3ffe - 4000 = 2 RESERVED
 5: feffc000 - ff00 = 2 RESERVED
 6: fffc - 0001 = 2 RESERVED
  -enter handle_19:
  -  NULL
  -Booting from Hard Disk...
  -Booting from :7c00

So, it got most of the way through the reboot - there's only a few
function calls between the e820 map being dumped and the handle_19
call.  The fault also seems to show it stopped in the BIOS in 16bit
mode:

 EIP=d331 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =f000 000f  9b00

Can you add the patch below, force the fault, and forward the log.

Also, if you recreate the failure can you take the EIP from the fault
(eg, d331) and search for the corresponding function in the output of:
  objdump -m i386 -M i8086 -M suffix -ldr out/rom16.o | less
(That is, search for d331:.)  If that's too much of a pain, just
send me a direct email with the seabios out/rom16.o file and the new
EIP of the fault.  (I need the out/rom16.o that was used to build the
version of SeaBIOS that faulted.)

-Kevin


diff --git a/src/post.c b/src/post.c
index 9ea5620..bbd19c0 100644
--- a/src/post.c
+++ b/src/post.c
@@ -185,21 +185,24 @@ prepareboot(void)
 pmm_prepboot();
 malloc_prepboot();
 memmap_prepboot();
+dprintf(1, a\n);
 
 HaveRunPost = 2;
 
 // Setup bios checksum.
 BiosChecksum -= checksum((u8*)BUILD_BIOS_ADDR, BUILD_BIOS_SIZE);
+dprintf(1, b\n);
 }
 
 // Begin the boot process by invoking an int0x19 in 16bit mode.
 void VISIBLE32FLAT
 startBoot(void)
 {
+dprintf(1, e\n);
 // Clear low-memory allocations (required by PMM spec).
 memset((void*)BUILD_STACK_ADDR, 0, BUILD_EBDA_MINIMUM - BUILD_STACK_ADDR);
 
-dprintf(3, Jump to int19\n);
+dprintf(1, Jump to int19 (vector=%x)\n, GET_IVT(0x19).segoff);
 struct bregs br;
 memset(br, 0, sizeof(br));
 br.flags = F_IF;
@@ -239,9 +242,11 @@ maininit(void)
 // Prepare for boot.
 prepareboot();
 
+dprintf(1, c\n);
 // Write protect bios memory.
 make_bios_readonly();
 
+dprintf(1, d\n);
 // Invoke int 19 to start boot process.
 startBoot();
 }
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-25 Thread Kevin O'Connor

On Thu, Mar 26, 2015 at 02:35:58AM +0300, Andrey Korolyov wrote:
 Thanks, strangely the reboot is always failing now and always reaching
 seabios greeting. May be prints straightened up a race (e.g. it is not
 int19 problem really).
 
 object file part:
 
 d331 irq_trampoline_0x19:
 irq_trampoline_0x19():
 /root/seabios-1.8.1/src/romlayout.S:195
 d331:   cd 19   int$0x19
 d333:   cb  lretw

[...]
 Jump to int19 (vector=f000e6f2)

Thanks.  So, it dies on the int $0x19 instruction itself.  The
vector looks correct and I don't see anything in the cpu register
state that looks wrong.  Maybe one of the kvm developers will have an
idea what could cause a fault there.

-Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-25 Thread Bandan Das

Hi Andrey,

Andrey Korolyov and...@xdel.ru writes:

 On Mon, Mar 16, 2015 at 10:17 PM, Andrey Korolyov and...@xdel.ru wrote:
 For now, it looks like bug have a mixed Murphy-Heisenberg nature, as
 it appearance is very rare (compared to the number of actual launches)
 and most probably bounded to the physical characteristics of my
 production nodes. As soon as I reach any reproducible path for a
 regular workstation environment, I`ll let everyone know. Also I am
 starting to think that issue can belong to the particular motherboard
 firmware revision, despite fact that the CPU microcode is the same
 everywhere.

I will take the risk and say this - could it be a processor bug ? :)


 Hello everyone, I`ve managed to reproduce this issue
 *deterministically* with latest seabios with smp fix and 3.18.3. The
 error occuring just *once* per vm until hypervisor reboots, at least
 in my setup, this is definitely crazy...

 - launch two VMs (Centos 7 in my case),
 - wait a little while they are booting,
 - attach serial console (I am using virsh list for this exact purpose),
 - issue acpi reboot or reset, does not matter,
 - VM always hangs at boot, most times with sgabios initialization
 string printed out [1], but sometimes it hangs a bit later [2],
 - no matter how many times I try to relaunch the QEMU afterwards, the
 issue does not appear on VM which experienced problem once;
 - trace and sample args can be seen in [3] and [4] respectively.

My system is a Dell R720 dual socket which has 2620v2s. I tried your
setup but couldn't reproduce (my qemu cmdline isn't exactly the same
as yours), although, if you could simplify your command line a bit,
I can try again.

Bandan

 1)
 Google, Inc.
 Serial Graphics Adapter 06/11/14
 SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $
 (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014
 Term: 211x62
 4 0

 2)
 Google, Inc.
 Serial Graphics Adapter 06/11/14
 SGABIOS $Id: sgabios.S 8 2010-04-22 00:03:40Z nlaredo $
 (pbuilder@zorak) Wed Jun 11 05:57:34 UTC 2014
 Term: 211x62
 4 0
 [...empty screen...]
 SeaBIOS (version 1.8.1-20150325_230423-testnode)
 Machine UUID 3c78721f-7317-4f85-bcbe-f5ad46d293a1


 iPXE (http://ipxe.org) 00:02.0 C100 PCI2.10 PnP PMM+3FF95BA0+3FEF5BA0 C10

 3)

 KVM internal error. Suberror: 2
 extra data[0]: 80ef
 extra data[1]: 8b0d
 EAX= EBX= ECX= EDX=
 ESI= EDI= EBP= ESP=6d2c
 EIP=d331 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =f000 000f  9b00
 SS =   9300
 DS =   9300
 FS =   9300
 GS =   9300
 LDT=   8200
 TR =   8b00
 GDT= 000f6cb0 0037
 IDT=  03ff
 CR0=0010 CR2= CR3= CR4=
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=66 c3 cd 02 cb cd 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd
 19 cb cd 1c cb cd 4a cb fa fc 66 ba 47 d3 0f 00 e9 ad fe f3 90 f0 0f
 ba 2d d4 fe fb 3f

 4)
 /usr/bin/qemu-system-x86_64 -name centos71 -S -machine
 pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -bios
 /usr/share/seabios/bios.bin -m 1024 -realtime mlock=off -smp
 12,sockets=1,cores=12,threads=12 -uuid
 3c78721f-7317-4f85-bcbe-f5ad46d293a1 -nographic -no-user-config
 -nodefaults -device sga -chardev
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/centos71.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc
 base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard
 -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global
 PIIX4_PM.disable_s4=1 -boot strict=on -device
 nec-usb-xhci,id=usb,bus=pci.0,addr=0x3 -device
 virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive
 file=rbd:dev-rack2/centos7-1.raw:id=qemukvm:key=XX:auth_supported=cephx\;none:mon_host=10.6.0.1\:6789\;10.6.0.3\:6789\;10.6.0.4\:6789,if=none,id=drive-virtio-disk0,format=raw,cache=writeback,aio=native
 -device 
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -chardev pty,id=charserial0 -device
 isa-serial,chardev=charserial0,id=serial0 -chardev
 socket,id=charchannel0,path=/var/lib/libvirt/qemu/centos71.sock,server,nowait
 -device 
 virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.1
 -msg timestamp=on
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-16 Thread Andrey Korolyov

For now, it looks like bug have a mixed Murphy-Heisenberg nature, as
it appearance is very rare (compared to the number of actual launches)
and most probably bounded to the physical characteristics of my
production nodes. As soon as I reach any reproducible path for a
regular workstation environment, I`ll let everyone know. Also I am
starting to think that issue can belong to the particular motherboard
firmware revision, despite fact that the CPU microcode is the same
everywhere.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-16 Thread Dr. David Alan Gilbert

* Andrey Korolyov (and...@xdel.ru) wrote:
 For now, it looks like bug have a mixed Murphy-Heisenberg nature, as
 it appearance is very rare (compared to the number of actual launches)
 and most probably bounded to the physical characteristics of my
 production nodes. As soon as I reach any reproducible path for a
 regular workstation environment, I`ll let everyone know. Also I am
 starting to think that issue can belong to the particular motherboard
 firmware revision, despite fact that the CPU microcode is the same
 everywhere.

OK - so you're still seeing it with the new ROM that went in today?
( remotes/kraxel/tags/pull-seabios-1.8.1-20150316-1 ) and it doesn't
trigger with my one line script?

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-12 Thread Dr. David Alan Gilbert

* Kevin O'Connor (ke...@koconnor.net) wrote:
 On Wed, Mar 11, 2015 at 01:45:57PM +, Dr. David Alan Gilbert wrote:
  * Bandan Das (b...@redhat.com) wrote:
   Dr. David Alan Gilbert dgilb...@redhat.com writes:
while true; do (sleep 5; echo -e 
'\001cq\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -machine 
pc-i440fx-2.0,accel=kvm -m 1024 -smp 128 -nographic -device sga 21 | 
tee /tmp/qemu.op; grep internal error /tmp/qemu.op -q  break; done
   
 [...]
[root@virtlab413 qemu-world3]# git bisect bad
21f5826a04d38e19488f917e1eef22751490c769 is the first bad commit
   
   I can reproduce this on E5-2620 v2 with  David's while true test.
   (The emulation failure I mean, not the suberror 2 that Andrey is seeing)
   The commit that seems to have introduced this is -
   
   commit 0673b7870063a3affbad9046fb6d385a4e734c19
   Author: Kevin O'Connor ke...@koconnor.net
   Date:   Sat May 24 10:49:50 2014 -0400
   
   smp: Replace QEMU SMP init assembler code with C; run only in 32bit 
   mode.
 [...]
  Turning on debug logging
  ( -chardev file,id=log,path=/tmp/debugcon.$$ -device 
  isa-debugcon,chardev=log,iobase=0x402 )
  
  SeaBIOS (version 
  rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org)
 [...]
  Found 1 cpu(s) max supported 128 cpu(s)
 
 Something is very odd here.  When I run the above command (on an older
 AMD machine) I get:
 
 Found 128 cpu(s) max supported 128 cpu(s)
 
 That first value (1 vs 128) comes from QEMU (via cmos index 0x5f).
 That is, during smp init, SeaBIOS expects QEMU to tell it how many
 cpus are active, and SeaBIOS waits until that many CPUs check in from
 its SIPI request before proceeding.
 
 I wonder if QEMU reported only 1 active cpu via that cmos register,
 but more were actually active.  If that was the case, it could
 certainly explain the failure - as multiple cpus could be running
 without the sipi trapoline in place.
 
 What does the log look like on a non-failure case?

I had to drop down from 128 to get a working run with debug; here
are two runs with -smp 20   the first one worked, the second one
failed.

Dave

=== Working ===

SeaBIOS (version rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org)
No Xen hypervisor found.
Running on QEMU (i440fx)
Running on KVM
RamSize: 0x4000 [cmos]
Relocating init from 0x000dea20 to 0x3ffaed30 (size 70160)
Found QEMU fw_cfg
RamBlock: addr 0x len 0x4000 [e820]
Moving pm_base to 0x600
CPU Mhz=2113
=== PCI bus  bridge init ===
PCI: pci_bios_init_bus_rec bus = 0x0
=== PCI device probing ===
Found 6 PCI devices (max PCI bus is 00)
=== PCI new allocation pass #1 ===
PCI: check devices
=== PCI new allocation pass #2 ===
PCI: IO: c000 - c04f
PCI: 32: 8000 - fec0
PCI: map device bdf=00:03.0  bar 1, addr c000, size 0040 [io]
PCI: map device bdf=00:01.1  bar 4, addr c040, size 0010 [io]
PCI: map device bdf=00:03.0  bar 6, addr feb8, size 0004 [mem]
PCI: map device bdf=00:03.0  bar 0, addr febc, size 0002 [mem]
PCI: map device bdf=00:02.0  bar 6, addr febe, size 0001 [mem]
PCI: map device bdf=00:02.0  bar 2, addr febf, size 1000 [mem]
PCI: map device bdf=00:02.0  bar 0, addr fd00, size 0100 [prefmem]
PCI: init bdf=00:00.0 id=8086:1237
PCI: init bdf=00:01.0 id=8086:7000
PIIX3/PIIX4 init: elcr=00 0c
PCI: init bdf=00:01.1 id=8086:7010
PCI: init bdf=00:01.3 id=8086:7113
Using pmtimer, ioport 0x608
PCI: init bdf=00:02.0 id=1234:
PCI: init bdf=00:03.0 id=8086:100e
PCI: Using 00:02.0 for primary VGA
handle_smp: apic_id=12
handle_smp: apic_id=8
handle_smp: apic_id=14
handle_smp: apic_id=2
handle_smp: apic_id=13
handle_smp: apic_id=18
handle_smp: apic_id=1
handle_smp: apic_id=7
handle_smp: apic_id=3
handle_smp: apic_id=4
handle_smp: apic_id=6
handle_smp: apic_id=11
handle_smp: apic_id=10
handle_smp: apic_id=15
handle_smp: apic_id=9
handle_smp: apic_id=16
handle_smp: apic_id=17
handle_smp: apic_id=19
handle_smp: apic_id=5
Found 20 cpu(s) max supported 20 cpu(s)
Copying PIR from 0x3ffbfc98 to 0x000f65a0
Copying MPTABLE from 0x6db0/3ffa5c60 to 0x000f6340
Copying SMBIOS entry point from 0x6db0 to 0x000f6320
Scan for VGA option rom
Running option rom at c000:0003
Start SeaVGABIOS (version 
rel-1.8.0-0-g4c59f5d-20150219_092912-nilsson.home.kraxel.org)
enter vga_post:
   a=0010  b=  c=  d= ds= es=f000 ss=
  si= di=6970 bp= sp=6d0a cs=f000 ip=d239  f=
VBE DISPI: bdf 00:02.0, bar 0
VBE DISPI: lfb_addr=fd00, size 16 MB
Attempting to allocate VGA stack via pmm call to f000:d2f4
pmm call arg1=0
VGA stack allocated at ef1b0
Running option rom at c980:0003
Turning on vga text mode console
set VGA mode 3
SeaBIOS (version rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org)
All threads complete.
Found 1 lpt ports
Found 1 serial ports
Searching bootorder for: /pci@i0cf8/isa@1/fdc@03f0/floppy@0
ATA controller 1

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-12 Thread Paolo Bonzini



On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
 I'm seeing something similar; it's very intermittent and generally
 happening right at boot of the guest;   I'm running this on qemu
 head+my postcopy world (but it's happening right at boot before postcopy
 gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
 but hey maybe I'm seeing a different bug.

Same here on 3.16 + Xeon E5 v3 kernel.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-12 Thread Dr. David Alan Gilbert

* Andrey Korolyov (and...@xdel.ru) wrote:
 On Tue, Mar 10, 2015 at 7:57 PM, Dr. David Alan Gilbert
 dgilb...@redhat.com wrote:
  * Andrey Korolyov (and...@xdel.ru) wrote:
  On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov and...@xdel.ru wrote:
   On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das b...@redhat.com wrote:
   Andrey Korolyov and...@xdel.ru writes:
  
   On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov and...@xdel.ru wrote:
   Hello,
  
   recently I`ve got a couple of shiny new Intel 2620v2s for future
   replacement of the E5-2620v1, but I experienced relatively many events
   with emulation errors, all traces looks simular to the one below. I am
   running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
   can switch to some other versions if necessary. Most of crashes
   happened during reboot cycle or at the end of ACPI-based shutdown
   action, if this can help. I have zero clues of what can introduce such
   a mess inside same processor family using identical software, as
   2620v1 has no simular problem ever. Please let me know if there can be
   some side measures for making entire story more clear.
  
   Thanks!
  
   KVM internal error. Suberror: 2
   extra data[0]: 80d1
   extra data[1]: 8b0d
   EAX=0003 EBX= ECX= EDX=
   ESI= EDI= EBP= ESP=6cd4
   EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
   ES =   9300
   CS =f000 000f  9b00
   SS =   9300
   DS =   9300
   FS =   9300
   GS =   9300
   LDT=   8200
   TR =   8b00
   GDT= 000f6e98 0037
   IDT=  03ff
   CR0=0010 CR2= CR3= CR4=
   DR0= DR1= DR2=
   DR3=
   DR6=0ff0 DR7=0400
   EFER=
   Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
   10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
   b8 00 e0 00 00 8e
  
  
   It turns out that those errors are introduced by APICv, which gets
   enabled due to different feature set. If anyone is interested in
   reproducing/fixing this exactly on 3.10, it takes about one hundred of
   migrations/power state changes for an issue to appear, guest OS can be
   Linux or Win.
  
   Are you able to reproduce this on a more recent upstream kernel as well 
   ?
  
   Bandan
  
   I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
   follow up with any reproduceable results.
 
  Heh.. issue is not triggered on 2603v2 at all, at least I am not able
  to hit this. The only difference with 2620v2 except lower frequency is
  an Intel Dynamic Acceleration feature. I`d appreciate any testing with
  higher CPU models with same or richer feature set. The testing itself
  can be done on both generic 3.10 or RH7 kernels, as both of them are
  experiencing this issue. I conducted all tests with disabled cstates
  so I advise to do the same for a first reproduction step.
 
  Thanks!
 
  model name  : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
  stepping: 4
  microcode   : 0x416
  cpu MHz : 2100.039
  cache size  : 15360 KB
  siblings: 12
  apicid  : 43
  initial apicid  : 43
  fpu : yes
  fpu_exception   : yes
  cpuid level : 13
  wp  : yes
  flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
  mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
  syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
  rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
  dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
  sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
  rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
  flexpriority ept vpid fsgsbase smep erms
 
  I'm seeing something similar; it's very intermittent and generally
  happening right at boot of the guest;   I'm running this on qemu
  head+my postcopy world (but it's happening right at boot before postcopy
  gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
  but hey maybe I'm seeing a different bug.
 
  Dave
 
 Yep, looks like we are hitting same bug - two thirds of mine failure
 events shot during boot/reboot cycle and approx. one third of events
 happened in the middle of runtime. What CPU, v0 or v2 are you using
 (in other words, is APICv enabled)?

processor   : 7
vendor_id   : GenuineIntel
cpu family  : 6
model   : 45
model name  : Intel(R) Xeon(R) CPU E5-2407 0 @ 2.20GHz
stepping: 7
microcode   : 0x70d
cpu MHz : 2200.000
cache size  : 10240 KB
physical id : 1
siblings: 4
core id : 3
cpu cores   : 4
apicid

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-12 Thread Dr. David Alan Gilbert

* Kevin O'Connor (ke...@koconnor.net) wrote:
 On Wed, Mar 11, 2015 at 04:52:03PM +, Dr. David Alan Gilbert wrote:
  * Kevin O'Connor (ke...@koconnor.net) wrote:
   So, I couldn't get this to fail on my older AMD machine at all with
   the default SeaBIOS code.  But, when I change the code with the patch
   below, it failed right away.
 [...]
   And the failed debug output looks like:
   
   SeaBIOS (version 
   rel-1.8.0-7-gd23eba6-dirty-20150311_121819-morn.localdomain)
   [...]
   cmos_smp_count0=20
   [...]
   cmos_smp_count=1
   cmos_smp_count2=1/20
   Found 1 cpu(s) max supported 20 cpu(s)
   
   I'm going to check the assembly for a compiler error, but is it
   possible QEMU is returning incorrect data in cmos index 0x5f?
 
 I checked the SeaBIOS assembler and it looks sane.  So, I think the
 question is, why is QEMU sometimes returning a 0 instead of 127 from
 cmos 0x5f.

My reading of the logs I've just created is that qemu doesn't think
it's ever being asked to read 5f in the failed case:

good:

pc_cmos_init 5f setting smp_cpus=20
cmos: read index=0x0f val=0x00
cmos: read index=0x34 val=0x00
cmos: read index=0x35 val=0x3f
cmos: read index=0x38 val=0x30
cmos: read index=0x3d val=0x12
cmos: read index=0x38 val=0x30
cmos: read index=0x0b val=0x02
cmos: read index=0x0d val=0x80
cmos: read index=0x5f val=0x13  Yeh!
cmos: read index=0x0f val=0x00
cmos: read index=0x0f val=0x00
cmos: read index=0x0f val=0x00

bad:
pc_cmos_init 5f setting smp_cpus=20
cmos: read index=0x0f val=0x00
cmos: read index=0x34 val=0x00
cmos: read index=0x35 val=0x3f
cmos: read index=0x38 val=0x30
cmos: read index=0x3d val=0x12
cmos: read index=0x38 val=0x30
cmos: read index=0x0b val=0x02
cmos: read index=0x0d val=0x80  Oh!
cmos: read index=0x0f val=0x00
cmos: read index=0x0f val=0x00
cmos: read index=0x0f val=0x00

Dave

 
   David, any chance you can recompile seabios and double check your
   output?
  
  Done;
  
  === Working ===
  SeaBIOS (version rel-1.8.0-0-g4c59f5d-dirty-20150311_164408-dgilbert-t530)
 [...]
  cmos_smp_count0=20
  cmos_smp_count=20
  cmos_smp_count2=20/20
  Found 20 cpu(s) max supported 20 cpu(s)
 [...]
  === Broken ===
  SeaBIOS (version rel-1.8.0-0-g4c59f5d-dirty-20150311_164408-dgilbert-t530)
 [...]
  cmos_smp_count0=20
  cmos_smp_count=1
  cmos_smp_count2=1/20
  Found 1 cpu(s) max supported 20 cpu(s)
 
 That's the same pattern I see.
 
 -Kevin
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-12 Thread Andrey Korolyov

On Wed, Mar 11, 2015 at 10:59 PM, Dr. David Alan Gilbert
dgilb...@redhat.com wrote:
 * Andrey Korolyov (and...@xdel.ru) wrote:
 On Wed, Mar 11, 2015 at 10:33 PM, Dr. David Alan Gilbert
 dgilb...@redhat.com wrote:
  * Kevin O'Connor (ke...@koconnor.net) wrote:
  On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote:
   On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote:
For what it's worth, I can't seem to trigger the problem if I move the
cmos read above the SIPI/LAPIC code (see patch below).
  
   Ugh!
  
   That's a seabios bug.  Main processor modifies the rtc index
   (rtc_read()) while APs try to clear the NMI bit by modifying the rtc
   index (romlayout.S:transition32).
  
   I'll put together a fix.
 
  The seabios patch below resolves the issue for me.
 
  Thanks! Looks good here.
 
  Andrey, Paolo, Bandan: Does it fix it for you as well?
 

 Thanks Kevin, Dave,

 I`m afraid that I`m hitting something different not only because
 different suberror code but also because of mine version of seabios -
 I am using 1.7.5 and corresponding code in the proposed patch looks
 different - there is no smp-related code patch is about of. Those
 mentioned devices went to production successfully and I`m afraid I
 cannot afford playing on them anymore, even if I re-trigger the issue
 with patched 1.8.1-rc, there is no way to switch to a different kernel
 and retest due to specific conditions of this production suite. I`ve
 ordered a pair of new shoes^W 2620v2-s which should arrive to me next

 Well I was testing on a pair of 'E5-2620 v2'; but as you saw my test case
 was pretty simple.  If you can suggest any flags I should add etc to the
 test I'd be happy to give it a go.

 Dave

Here is mine launch string:

qemu-system-x86_64 -enable-kvm -name vmtest -S -machine
pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -m 512
-realtime mlock=off -smp 12,sockets=1,cores=12,threads=12 -numa
node,nodeid=0,cpus=0-11,mem=512 -nographic -no-user-config -nodefaults
-device sga -rtc base=utc,driftfix=slew -global
kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global
PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on
-device nec-usb-xhci,id=usb,bus=pci.0,addr=0x4 -device
virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -m
512,slots=31,maxmem=16384M -object
memory-backend-ram,id=mem0,size=512M -device
pc-dimm,id=dimm0,node=0,memdev=mem0

I omitted disk backend in this example, but there is a chance that my
problem is not reproducible without some calls made explicitly by a
bootloader (not sure what to say for mid-runtime failures).


 Monday, so I`ll be able to test a) against 1.8.0-release, b) against
 patched bios code, c) reproduce initial error on master/3.19 (may be
 I`ll take them before weekend by going into this computer shop in
 person). Until then, I have a very deep feeling that mine issue is not
 there :) Also I became very curious on how a lack of IDT feature may
 completely eliminate the issue appearance for me, the only possible
 explanation is a clock-related race which is kinda stupid suggestion
 and unlikely to exist in nature.

 Thanks again for everyone for throughout testing and ideas!

 
  -Kevin
 
 
  --- a/src/romlayout.S
  +++ b/src/romlayout.S
  @@ -22,7 +22,8 @@
   // %edx = return location (in 32bit mode)
   // Clobbers: ecx, flags, segment registers, cr0, idt/gdt
   DECLFUNC transition32
  -transition32_for_smi:
  +transition32_nmi_off:
  +// transition32 when NMI and A20 are already initialized
   movl %eax, %ecx
   jmp 1f
   transition32:
  @@ -205,7 +206,7 @@ __farcall16:
   entry_smi:
   // Transition to 32bit mode.
   movl $1f + BUILD_BIOS_ADDR, %edx
  -jmp transition32_for_smi
  +jmp transition32_nmi_off
   .code32
   1:  movl $BUILD_SMM_ADDR + 0x8000, %esp
   calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR
  @@ -216,8 +217,10 @@ entry_smi:
   DECLFUNC entry_smp
   entry_smp:
   // Transition to 32bit mode.
  +cli
  +cld
   movl $2f + BUILD_BIOS_ADDR, %edx
  -jmp transition32
  +jmp transition32_nmi_off
   .code32
   // Acquire lock and take ownership of shared stack
   1:  rep ; nop
  --
  Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
 --
 Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-12 Thread Dr. David Alan Gilbert

* Andrey Korolyov (and...@xdel.ru) wrote:
 On Wed, Mar 11, 2015 at 10:59 PM, Dr. David Alan Gilbert
 dgilb...@redhat.com wrote:
  * Andrey Korolyov (and...@xdel.ru) wrote:
  On Wed, Mar 11, 2015 at 10:33 PM, Dr. David Alan Gilbert
  dgilb...@redhat.com wrote:
   * Kevin O'Connor (ke...@koconnor.net) wrote:
   On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote:
On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote:
 For what it's worth, I can't seem to trigger the problem if I move 
 the
 cmos read above the SIPI/LAPIC code (see patch below).
   
Ugh!
   
That's a seabios bug.  Main processor modifies the rtc index
(rtc_read()) while APs try to clear the NMI bit by modifying the rtc
index (romlayout.S:transition32).
   
I'll put together a fix.
  
   The seabios patch below resolves the issue for me.
  
   Thanks! Looks good here.
  
   Andrey, Paolo, Bandan: Does it fix it for you as well?
  
 
  Thanks Kevin, Dave,
 
  I`m afraid that I`m hitting something different not only because
  different suberror code but also because of mine version of seabios -
  I am using 1.7.5 and corresponding code in the proposed patch looks
  different - there is no smp-related code patch is about of. Those
  mentioned devices went to production successfully and I`m afraid I
  cannot afford playing on them anymore, even if I re-trigger the issue
  with patched 1.8.1-rc, there is no way to switch to a different kernel
  and retest due to specific conditions of this production suite. I`ve
  ordered a pair of new shoes^W 2620v2-s which should arrive to me next
 
  Well I was testing on a pair of 'E5-2620 v2'; but as you saw my test case
  was pretty simple.  If you can suggest any flags I should add etc to the
  test I'd be happy to give it a go.
 
  Dave
 
 Here is mine launch string:
 
 qemu-system-x86_64 -enable-kvm -name vmtest -S -machine
 pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -m 512
 -realtime mlock=off -smp 12,sockets=1,cores=12,threads=12 -numa
 node,nodeid=0,cpus=0-11,mem=512 -nographic -no-user-config -nodefaults
 -device sga -rtc base=utc,driftfix=slew -global
 kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global
 PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on
 -device nec-usb-xhci,id=usb,bus=pci.0,addr=0x4 -device
 virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -m
 512,slots=31,maxmem=16384M -object
 memory-backend-ram,id=mem0,size=512M -device
 pc-dimm,id=dimm0,node=0,memdev=mem0
 
 I omitted disk backend in this example, but there is a chance that my
 problem is not reproducible without some calls made explicitly by a
 bootloader (not sure what to say for mid-runtime failures).

It seems to survive OK:

while true; do (sleep 1; echo -e '\001cc\n'; sleep 5; echo -e 
'q\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -enable-kvm -name vmtest -S 
-machine pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -m 512 
-realtime mlock=off -smp 12,sockets=1,cores=12,threads=12 -numa 
node,nodeid=0,cpus=0-11,mem=512 -nographic -no-user-config  -device sga -rtc 
base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet 
-no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot 
strict=on -device nec-usb-xhci,id=usb,bus=pci.0,addr=0x4 -device 
virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -m 
512,slots=31,maxmem=16384M -object memory-backend-ram,id=mem0,size=512M -device 
pc-dimm,id=dimm0,node=0,memdev=mem0  ~/pi.vfd 21 | tee /tmp/qemu.op; grep 
internal error /tmp/qemu.op -q  break; done

Dave

 
 
  Monday, so I`ll be able to test a) against 1.8.0-release, b) against
  patched bios code, c) reproduce initial error on master/3.19 (may be
  I`ll take them before weekend by going into this computer shop in
  person). Until then, I have a very deep feeling that mine issue is not
  there :) Also I became very curious on how a lack of IDT feature may
  completely eliminate the issue appearance for me, the only possible
  explanation is a clock-related race which is kinda stupid suggestion
  and unlikely to exist in nature.
 
  Thanks again for everyone for throughout testing and ideas!
 
  
   -Kevin
  
  
   --- a/src/romlayout.S
   +++ b/src/romlayout.S
   @@ -22,7 +22,8 @@
// %edx = return location (in 32bit mode)
// Clobbers: ecx, flags, segment registers, cr0, idt/gdt
DECLFUNC transition32
   -transition32_for_smi:
   +transition32_nmi_off:
   +// transition32 when NMI and A20 are already initialized
movl %eax, %ecx
jmp 1f
transition32:
   @@ -205,7 +206,7 @@ __farcall16:
entry_smi:
// Transition to 32bit mode.
movl $1f + BUILD_BIOS_ADDR, %edx
   -jmp transition32_for_smi
   +jmp transition32_nmi_off
.code32
1:  movl $BUILD_SMM_ADDR + 0x8000, %esp
calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR
   @@ -216,8 +217,10 @@

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-12 Thread Andrey Korolyov

On Wed, Mar 11, 2015 at 10:33 PM, Dr. David Alan Gilbert
dgilb...@redhat.com wrote:
 * Kevin O'Connor (ke...@koconnor.net) wrote:
 On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote:
  On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote:
   For what it's worth, I can't seem to trigger the problem if I move the
   cmos read above the SIPI/LAPIC code (see patch below).
 
  Ugh!
 
  That's a seabios bug.  Main processor modifies the rtc index
  (rtc_read()) while APs try to clear the NMI bit by modifying the rtc
  index (romlayout.S:transition32).
 
  I'll put together a fix.

 The seabios patch below resolves the issue for me.

 Thanks! Looks good here.

 Andrey, Paolo, Bandan: Does it fix it for you as well?


Thanks Kevin, Dave,

I`m afraid that I`m hitting something different not only because
different suberror code but also because of mine version of seabios -
I am using 1.7.5 and corresponding code in the proposed patch looks
different - there is no smp-related code patch is about of. Those
mentioned devices went to production successfully and I`m afraid I
cannot afford playing on them anymore, even if I re-trigger the issue
with patched 1.8.1-rc, there is no way to switch to a different kernel
and retest due to specific conditions of this production suite. I`ve
ordered a pair of new shoes^W 2620v2-s which should arrive to me next
Monday, so I`ll be able to test a) against 1.8.0-release, b) against
patched bios code, c) reproduce initial error on master/3.19 (may be
I`ll take them before weekend by going into this computer shop in
person). Until then, I have a very deep feeling that mine issue is not
there :) Also I became very curious on how a lack of IDT feature may
completely eliminate the issue appearance for me, the only possible
explanation is a clock-related race which is kinda stupid suggestion
and unlikely to exist in nature.

Thanks again for everyone for throughout testing and ideas!


 -Kevin


 --- a/src/romlayout.S
 +++ b/src/romlayout.S
 @@ -22,7 +22,8 @@
  // %edx = return location (in 32bit mode)
  // Clobbers: ecx, flags, segment registers, cr0, idt/gdt
  DECLFUNC transition32
 -transition32_for_smi:
 +transition32_nmi_off:
 +// transition32 when NMI and A20 are already initialized
  movl %eax, %ecx
  jmp 1f
  transition32:
 @@ -205,7 +206,7 @@ __farcall16:
  entry_smi:
  // Transition to 32bit mode.
  movl $1f + BUILD_BIOS_ADDR, %edx
 -jmp transition32_for_smi
 +jmp transition32_nmi_off
  .code32
  1:  movl $BUILD_SMM_ADDR + 0x8000, %esp
  calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR
 @@ -216,8 +217,10 @@ entry_smi:
  DECLFUNC entry_smp
  entry_smp:
  // Transition to 32bit mode.
 +cli
 +cld
  movl $2f + BUILD_BIOS_ADDR, %edx
 -jmp transition32
 +jmp transition32_nmi_off
  .code32
  // Acquire lock and take ownership of shared stack
  1:  rep ; nop
 --
 Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-12 Thread Andrey Korolyov

On Thu, Mar 12, 2015 at 12:59 PM, Dr. David Alan Gilbert
dgilb...@redhat.com wrote:
 * Andrey Korolyov (and...@xdel.ru) wrote:
 On Wed, Mar 11, 2015 at 10:59 PM, Dr. David Alan Gilbert
 dgilb...@redhat.com wrote:
  * Andrey Korolyov (and...@xdel.ru) wrote:
  On Wed, Mar 11, 2015 at 10:33 PM, Dr. David Alan Gilbert
  dgilb...@redhat.com wrote:
   * Kevin O'Connor (ke...@koconnor.net) wrote:
   On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote:
On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote:
 For what it's worth, I can't seem to trigger the problem if I move 
 the
 cmos read above the SIPI/LAPIC code (see patch below).
   
Ugh!
   
That's a seabios bug.  Main processor modifies the rtc index
(rtc_read()) while APs try to clear the NMI bit by modifying the rtc
index (romlayout.S:transition32).
   
I'll put together a fix.
  
   The seabios patch below resolves the issue for me.
  
   Thanks! Looks good here.
  
   Andrey, Paolo, Bandan: Does it fix it for you as well?
  
 
  Thanks Kevin, Dave,
 
  I`m afraid that I`m hitting something different not only because
  different suberror code but also because of mine version of seabios -
  I am using 1.7.5 and corresponding code in the proposed patch looks
  different - there is no smp-related code patch is about of. Those
  mentioned devices went to production successfully and I`m afraid I
  cannot afford playing on them anymore, even if I re-trigger the issue
  with patched 1.8.1-rc, there is no way to switch to a different kernel
  and retest due to specific conditions of this production suite. I`ve
  ordered a pair of new shoes^W 2620v2-s which should arrive to me next
 
  Well I was testing on a pair of 'E5-2620 v2'; but as you saw my test case
  was pretty simple.  If you can suggest any flags I should add etc to the
  test I'd be happy to give it a go.
 
  Dave

 Here is mine launch string:

 qemu-system-x86_64 -enable-kvm -name vmtest -S -machine
 pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -m 512
 -realtime mlock=off -smp 12,sockets=1,cores=12,threads=12 -numa
 node,nodeid=0,cpus=0-11,mem=512 -nographic -no-user-config -nodefaults
 -device sga -rtc base=utc,driftfix=slew -global
 kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global
 PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on
 -device nec-usb-xhci,id=usb,bus=pci.0,addr=0x4 -device
 virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -m
 512,slots=31,maxmem=16384M -object
 memory-backend-ram,id=mem0,size=512M -device
 pc-dimm,id=dimm0,node=0,memdev=mem0

 I omitted disk backend in this example, but there is a chance that my
 problem is not reproducible without some calls made explicitly by a
 bootloader (not sure what to say for mid-runtime failures).

 It seems to survive OK:

Thanks David, I`ll go through test sequence and report. Unfortunately
my orchestration does not have even a hundred millisecond precision
for libvirt events, so I can`t tell if the immediate start-up failures
happened before bootloader execution or during it, all I have for
those is a less-than-two-second interval between actual pass of a
launch command and paused state event. QEMU logging also does not give
me timestamps for an emulation errors even with appropriate timestamp
arg.


 while true; do (sleep 1; echo -e '\001cc\n'; sleep 5; echo -e 
 'q\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -enable-kvm -name vmtest 
 -S -machine pc-i440fx-2.1,accel=kvm,usb=off -cpu SandyBridge,+kvm_pv_eoi -m 
 512 -realtime mlock=off -smp 12,sockets=1,cores=12,threads=12 -numa 
 node,nodeid=0,cpus=0-11,mem=512 -nographic -no-user-config  -device sga -rtc 
 base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet 
 -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 
 -boot strict=on -device nec-usb-xhci,id=usb,bus=pci.0,addr=0x4 -device 
 virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -m 
 512,slots=31,maxmem=16384M -object memory-backend-ram,id=mem0,size=512M 
 -device pc-dimm,id=dimm0,node=0,memdev=mem0  ~/pi.vfd 21 | tee 
 /tmp/qemu.op; grep internal error /tmp/qemu.op -q  break; done

 Dave


 
  Monday, so I`ll be able to test a) against 1.8.0-release, b) against
  patched bios code, c) reproduce initial error on master/3.19 (may be
  I`ll take them before weekend by going into this computer shop in
  person). Until then, I have a very deep feeling that mine issue is not
  there :) Also I became very curious on how a lack of IDT feature may
  completely eliminate the issue appearance for me, the only possible
  explanation is a clock-related race which is kinda stupid suggestion
  and unlikely to exist in nature.
 
  Thanks again for everyone for throughout testing and ideas!
 
  
   -Kevin
  
  
   --- a/src/romlayout.S
   +++ b/src/romlayout.S
   @@ -22,7 +22,8 @@
// %edx = return location (in 32bit mode)
// Clobbers: ecx, flags, segment registers, cr0, idt/gdt

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Dr. David Alan Gilbert

* Kevin O'Connor (ke...@koconnor.net) wrote:
 On Wed, Mar 11, 2015 at 03:53:07PM +, Dr. David Alan Gilbert wrote:
  * Kevin O'Connor (ke...@koconnor.net) wrote:
   On Wed, Mar 11, 2015 at 01:45:57PM +, Dr. David Alan Gilbert wrote:
* Bandan Das (b...@redhat.com) wrote:
 Dr. David Alan Gilbert dgilb...@redhat.com writes:
  while true; do (sleep 5; echo -e 
  '\001cq\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -machine 
  pc-i440fx-2.0,accel=kvm -m 1024 -smp 128 -nographic -device sga 
  21 | tee /tmp/qemu.op; grep internal error /tmp/qemu.op -q  
  break; done
 
 That is a truly impressive command line, BTW.

Thanks :-)

  [root@virtlab413 qemu-world3]# git bisect bad
  21f5826a04d38e19488f917e1eef22751490c769 is the first bad commit
 
 I can reproduce this on E5-2620 v2 with  David's while true test.
 (The emulation failure I mean, not the suberror 2 that Andrey is 
 seeing)
 The commit that seems to have introduced this is -
 
 commit 0673b7870063a3affbad9046fb6d385a4e734c19
 Author: Kevin O'Connor ke...@koconnor.net
 Date:   Sat May 24 10:49:50 2014 -0400
 
 smp: Replace QEMU SMP init assembler code with C; run only in 
 32bit mode.
   [...]
Turning on debug logging
( -chardev file,id=log,path=/tmp/debugcon.$$ -device 
isa-debugcon,chardev=log,iobase=0x402 )

SeaBIOS (version 
rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org)
   [...]
Found 1 cpu(s) max supported 128 cpu(s)
   
   Something is very odd here.  When I run the above command (on an older
   AMD machine) I get:
   
   Found 128 cpu(s) max supported 128 cpu(s)
   
   That first value (1 vs 128) comes from QEMU (via cmos index 0x5f).
   That is, during smp init, SeaBIOS expects QEMU to tell it how many
   cpus are active, and SeaBIOS waits until that many CPUs check in from
   its SIPI request before proceeding.
   
   I wonder if QEMU reported only 1 active cpu via that cmos register,
   but more were actually active.  If that was the case, it could
   certainly explain the failure - as multiple cpus could be running
   without the sipi trapoline in place.
   
   What does the log look like on a non-failure case?
  
  I had to drop down from 128 to get a working run with debug; here
  are two runs with -smp 20   the first one worked, the second one
  failed.
 [...]
  === Working ===
  
  SeaBIOS (version 
  rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org)
 [...]
  Found 20 cpu(s) max supported 20 cpu(s)
 [...]
  === Broken ===
  
  SeaBIOS (version 
  rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org)
 [...]
  Found 1 cpu(s) max supported 20 cpu(s)
 
 So, I couldn't get this to fail on my older AMD machine at all with
 the default SeaBIOS code.  But, when I change the code with the patch
 below, it failed right away.
 
 KVM internal error. Suberror: 1
 emulation failure
 EAX= EBX= ECX= EDX=000fd2b8
 ESI= EDI= EBP= ESP=
 EIP=000fd2c1 EFL=0007 [-PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =0010   00c09300 DPL=0 DS   [-WA]
 CS =0008   00c09b00 DPL=0 CS32 [-RA]
 SS =0010   00c09300 DPL=0 DS   [-WA]
 DS =0010   00c09300 DPL=0 DS   [-WA]
 FS =0010   00c09300 DPL=0 DS   [-WA]
 GS =0010   00c09300 DPL=0 DS   [-WA]
 LDT=   8200 DPL=0 LDT
 TR =   8300 DPL=0 TSS16-busy
 GDT= 000f6a50 0037
 IDT= 000f6a8e 
 CR0=6011 CR2= CR3= CR4=
 DR0= DR1= DR2= 
 DR3= 
 DR6=0ff0 DR7=0400
 EFER=
 Code=66 ba b8 d2 0f 00 e9 a2 fe f3 90 f0 0f ba 2d 04 ff fb 3f 00 72 f3 8b 
 25 00 ff fb 3f e8 d2 65 ff ff c7 05 04 ff fb 3f 00 00 00 00 f4 eb fd fa fc 66 
 b8
 
 And the failed debug output looks like:
 
 SeaBIOS (version rel-1.8.0-7-gd23eba6-dirty-20150311_121819-morn.localdomain)
 [...]
 cmos_smp_count0=20
 [...]
 cmos_smp_count=1
 cmos_smp_count2=1/20
 Found 1 cpu(s) max supported 20 cpu(s)
 
 I'm going to check the assembly for a compiler error, but is it
 possible QEMU is returning incorrect data in cmos index 0x5f?
 
 David, any chance you can recompile seabios and double check your
 output?

Done;

=== Working ===
SeaBIOS (version rel-1.8.0-0-g4c59f5d-dirty-20150311_164408-dgilbert-t530)
No Xen hypervisor found.
Running on QEMU (i440fx)
Running on KVM
RamSize: 0x4000 [cmos]
Relocating init from 0x000df4e0 to 0x3ffaf570 (size 68048)
Found QEMU fw_cfg
RamBlock: addr 0x len 0x4000 [e820]
Moving pm_base to 0x600
CPU Mhz=2116
cmos_smp_count0=20
=== PCI bus  bridge init ===
PCI: pci_bios_init_bus_rec bus = 0x0
=== PCI device probing ===
Found 6 PCI devices (max PCI bus is 00)
=== PCI new

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Kevin O'Connor

On Wed, Mar 11, 2015 at 03:53:07PM +, Dr. David Alan Gilbert wrote:
 * Kevin O'Connor (ke...@koconnor.net) wrote:
  On Wed, Mar 11, 2015 at 01:45:57PM +, Dr. David Alan Gilbert wrote:
   * Bandan Das (b...@redhat.com) wrote:
Dr. David Alan Gilbert dgilb...@redhat.com writes:
 while true; do (sleep 5; echo -e 
 '\001cq\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -machine 
 pc-i440fx-2.0,accel=kvm -m 1024 -smp 128 -nographic -device sga 21 
 | tee /tmp/qemu.op; grep internal error /tmp/qemu.op -q  break; 
 done

That is a truly impressive command line, BTW.

 [root@virtlab413 qemu-world3]# git bisect bad
 21f5826a04d38e19488f917e1eef22751490c769 is the first bad commit

I can reproduce this on E5-2620 v2 with  David's while true test.
(The emulation failure I mean, not the suberror 2 that Andrey is seeing)
The commit that seems to have introduced this is -

commit 0673b7870063a3affbad9046fb6d385a4e734c19
Author: Kevin O'Connor ke...@koconnor.net
Date:   Sat May 24 10:49:50 2014 -0400

smp: Replace QEMU SMP init assembler code with C; run only in 32bit 
mode.
  [...]
   Turning on debug logging
   ( -chardev file,id=log,path=/tmp/debugcon.$$ -device 
   isa-debugcon,chardev=log,iobase=0x402 )
   
   SeaBIOS (version 
   rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org)
  [...]
   Found 1 cpu(s) max supported 128 cpu(s)
  
  Something is very odd here.  When I run the above command (on an older
  AMD machine) I get:
  
  Found 128 cpu(s) max supported 128 cpu(s)
  
  That first value (1 vs 128) comes from QEMU (via cmos index 0x5f).
  That is, during smp init, SeaBIOS expects QEMU to tell it how many
  cpus are active, and SeaBIOS waits until that many CPUs check in from
  its SIPI request before proceeding.
  
  I wonder if QEMU reported only 1 active cpu via that cmos register,
  but more were actually active.  If that was the case, it could
  certainly explain the failure - as multiple cpus could be running
  without the sipi trapoline in place.
  
  What does the log look like on a non-failure case?
 
 I had to drop down from 128 to get a working run with debug; here
 are two runs with -smp 20   the first one worked, the second one
 failed.
[...]
 === Working ===
 
 SeaBIOS (version rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org)
[...]
 Found 20 cpu(s) max supported 20 cpu(s)
[...]
 === Broken ===
 
 SeaBIOS (version rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org)
[...]
 Found 1 cpu(s) max supported 20 cpu(s)

So, I couldn't get this to fail on my older AMD machine at all with
the default SeaBIOS code.  But, when I change the code with the patch
below, it failed right away.

KVM internal error. Suberror: 1
emulation failure
EAX= EBX= ECX= EDX=000fd2b8
ESI= EDI= EBP= ESP=
EIP=000fd2c1 EFL=0007 [-PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010   00c09300 DPL=0 DS   [-WA]
CS =0008   00c09b00 DPL=0 CS32 [-RA]
SS =0010   00c09300 DPL=0 DS   [-WA]
DS =0010   00c09300 DPL=0 DS   [-WA]
FS =0010   00c09300 DPL=0 DS   [-WA]
GS =0010   00c09300 DPL=0 DS   [-WA]
LDT=   8200 DPL=0 LDT
TR =   8300 DPL=0 TSS16-busy
GDT= 000f6a50 0037
IDT= 000f6a8e 
CR0=6011 CR2= CR3= CR4=
DR0= DR1= DR2= 
DR3= 
DR6=0ff0 DR7=0400
EFER=
Code=66 ba b8 d2 0f 00 e9 a2 fe f3 90 f0 0f ba 2d 04 ff fb 3f 00 72 f3 8b 25 
00 ff fb 3f e8 d2 65 ff ff c7 05 04 ff fb 3f 00 00 00 00 f4 eb fd fa fc 66 b8

And the failed debug output looks like:

SeaBIOS (version rel-1.8.0-7-gd23eba6-dirty-20150311_121819-morn.localdomain)
[...]
cmos_smp_count0=20
[...]
cmos_smp_count=1
cmos_smp_count2=1/20
Found 1 cpu(s) max supported 20 cpu(s)

I'm going to check the assembly for a compiler error, but is it
possible QEMU is returning incorrect data in cmos index 0x5f?

David, any chance you can recompile seabios and double check your
output?

-Kevin


--- a/src/fw/smp.c
+++ b/src/fw/smp.c
@@ -128,6 +128,7 @@ smp_setup(void)
 
 // Wait for other CPUs to process the SIPI.
 u8 cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1;
+dprintf(1, cmos_smp_count=%d\n, cmos_smp_count);
 while (cmos_smp_count != CountCPUs)
 asm volatile(
 // Release lock and allow other processors to use the stack.
@@ -140,6 +141,8 @@ smp_setup(void)
 : +m (SMPLock), +m (SMPStack)
 : : cc, memory);
 yield();
+dprintf(1, cmos_smp_count2=%d/%d\n, cmos_smp_count
+, rtc_read(CMOS_BIOS_SMP_COUNT) + 1);
 
 // Restore memory.
 *(u64*)BUILD_AP_BOOT_ADDR = old;
diff --git a/src/post.c b/src/post.c
index

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Bandan Das

Kevin O'Connor ke...@koconnor.net writes:
...

 Something is very odd here.  When I run the above command (on an older
 AMD machine) I get:

 Found 128 cpu(s) max supported 128 cpu(s)

 That first value (1 vs 128) comes from QEMU (via cmos index 0x5f).
 That is, during smp init, SeaBIOS expects QEMU to tell it how many
 cpus are active, and SeaBIOS waits until that many CPUs check in from
 its SIPI request before proceeding.

 I wonder if QEMU reported only 1 active cpu via that cmos register,
 but more were actually active.  If that was the case, it could

I was daring enough to try this and I don't see the crash :)

diff --git a/src/fw/smp.c b/src/fw/smp.c
index a466ea6..a346d46 100644
--- a/src/fw/smp.c
+++ b/src/fw/smp.c
@@ -49,6 +49,7 @@ int apic_id_is_present(u8 apic_id)
 void VISIBLE32FLAT
 handle_smp(void)
 {
+  dprintf(DEBUG_HDL_smp, Calling handle_smp\n);
 if (!CONFIG_QEMU)
 return;
 
@@ -128,6 +129,8 @@ smp_setup(void)
 
 // Wait for other CPUs to process the SIPI.
 u8 cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1;
+while (cmos_smp_count == 1)
+cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1;
 while (cmos_smp_count != CountCPUs)
 asm volatile(
 // Release lock and allow other processors to use the stack.

So, the while loop results in a race somehow ?

Bandan
 certainly explain the failure - as multiple cpus could be running
 without the sipi trapoline in place.

 What does the log look like on a non-failure case?

 -Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Kevin O'Connor

On Wed, Mar 11, 2015 at 01:45:57PM +, Dr. David Alan Gilbert wrote:
 * Bandan Das (b...@redhat.com) wrote:
  Dr. David Alan Gilbert dgilb...@redhat.com writes:
   while true; do (sleep 5; echo -e 
   '\001cq\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -machine 
   pc-i440fx-2.0,accel=kvm -m 1024 -smp 128 -nographic -device sga 21 | 
   tee /tmp/qemu.op; grep internal error /tmp/qemu.op -q  break; done
  
[...]
   [root@virtlab413 qemu-world3]# git bisect bad
   21f5826a04d38e19488f917e1eef22751490c769 is the first bad commit
  
  I can reproduce this on E5-2620 v2 with  David's while true test.
  (The emulation failure I mean, not the suberror 2 that Andrey is seeing)
  The commit that seems to have introduced this is -
  
  commit 0673b7870063a3affbad9046fb6d385a4e734c19
  Author: Kevin O'Connor ke...@koconnor.net
  Date:   Sat May 24 10:49:50 2014 -0400
  
  smp: Replace QEMU SMP init assembler code with C; run only in 32bit 
  mode.
[...]
 Turning on debug logging
 ( -chardev file,id=log,path=/tmp/debugcon.$$ -device 
 isa-debugcon,chardev=log,iobase=0x402 )
 
 SeaBIOS (version rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org)
[...]
 Found 1 cpu(s) max supported 128 cpu(s)

Something is very odd here.  When I run the above command (on an older
AMD machine) I get:

Found 128 cpu(s) max supported 128 cpu(s)

That first value (1 vs 128) comes from QEMU (via cmos index 0x5f).
That is, during smp init, SeaBIOS expects QEMU to tell it how many
cpus are active, and SeaBIOS waits until that many CPUs check in from
its SIPI request before proceeding.

I wonder if QEMU reported only 1 active cpu via that cmos register,
but more were actually active.  If that was the case, it could
certainly explain the failure - as multiple cpus could be running
without the sipi trapoline in place.

What does the log look like on a non-failure case?

-Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Kevin O'Connor

On Wed, Mar 11, 2015 at 05:59:04PM +, Dr. David Alan Gilbert wrote:
 * Kevin O'Connor (ke...@koconnor.net) wrote:
  On Wed, Mar 11, 2015 at 04:52:03PM +, Dr. David Alan Gilbert wrote:
   * Kevin O'Connor (ke...@koconnor.net) wrote:
So, I couldn't get this to fail on my older AMD machine at all with
the default SeaBIOS code.  But, when I change the code with the patch
below, it failed right away.
  [...]
And the failed debug output looks like:

SeaBIOS (version 
rel-1.8.0-7-gd23eba6-dirty-20150311_121819-morn.localdomain)
[...]
cmos_smp_count0=20
[...]
cmos_smp_count=1
cmos_smp_count2=1/20
Found 1 cpu(s) max supported 20 cpu(s)

I'm going to check the assembly for a compiler error, but is it
possible QEMU is returning incorrect data in cmos index 0x5f?
  
  I checked the SeaBIOS assembler and it looks sane.  So, I think the
  question is, why is QEMU sometimes returning a 0 instead of 127 from
  cmos 0x5f.
 
 My reading of the logs I've just created is that qemu doesn't think
 it's ever being asked to read 5f in the failed case:
 
 good:
 
 pc_cmos_init 5f setting smp_cpus=20
 cmos: read index=0x0f val=0x00
 cmos: read index=0x34 val=0x00
 cmos: read index=0x35 val=0x3f
 cmos: read index=0x38 val=0x30
 cmos: read index=0x3d val=0x12
 cmos: read index=0x38 val=0x30
 cmos: read index=0x0b val=0x02
 cmos: read index=0x0d val=0x80
 cmos: read index=0x5f val=0x13  Yeh!
 cmos: read index=0x0f val=0x00
 cmos: read index=0x0f val=0x00
 cmos: read index=0x0f val=0x00
 
 bad:
 pc_cmos_init 5f setting smp_cpus=20
 cmos: read index=0x0f val=0x00
 cmos: read index=0x34 val=0x00
 cmos: read index=0x35 val=0x3f
 cmos: read index=0x38 val=0x30
 cmos: read index=0x3d val=0x12
 cmos: read index=0x38 val=0x30
 cmos: read index=0x0b val=0x02
 cmos: read index=0x0d val=0x80  Oh!
 cmos: read index=0x0f val=0x00
 cmos: read index=0x0f val=0x00
 cmos: read index=0x0f val=0x00

For what it's worth, I can't seem to trigger the problem if I move the
cmos read above the SIPI/LAPIC code (see patch below).

I used this command line:

while true; do (sleep 5; echo -e '\001cq\n')| 
../qemu/qemu-git/x86_64-softmmu/qemu-system-x86_64 -chardev file,path=foo.`date 
+%s`,id=seabios -device isa-debugcon,iobase=0x402,chardev=seabios -machine 
pc-i440fx-2.0,accel=kvm -m 1024 -smp 128 -nographic -device sga -L test 21 | 
tee /tmp/qemu.op; grep internal error /tmp/qemu.op -q  break; done

This is on an AMD Phenom(tm) II X6 1090T Processor machine.

-Kevin


--- a/src/fw/smp.c
+++ b/src/fw/smp.c
@@ -107,6 +107,8 @@ smp_setup(void)
| (((u32)entry_smp - BUILD_BIOS_ADDR)  8));
 *(u64*)BUILD_AP_BOOT_ADDR = new;
 
+u8 cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1;
+
 // enable local APIC
 u32 val = readl(APIC_SVR);
 writel(APIC_SVR, val | APIC_ENABLED);
@@ -127,7 +129,7 @@ smp_setup(void)
 writel(APIC_ICR_LOW, 0x000C4600 | sipi_vector);
 
 // Wait for other CPUs to process the SIPI.
-u8 cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1;
+dprintf(1, cmos_smp_count=%d\n, cmos_smp_count);
 while (cmos_smp_count != CountCPUs)
 asm volatile(
 // Release lock and allow other processors to use the stack.
@@ -140,6 +142,8 @@ smp_setup(void)
 : +m (SMPLock), +m (SMPStack)
 : : cc, memory);
 yield();
+dprintf(1, cmos_smp_count2=%d/%d\n, cmos_smp_count
+, rtc_read(CMOS_BIOS_SMP_COUNT) + 1);
 
 // Restore memory.
 *(u64*)BUILD_AP_BOOT_ADDR = old;
diff --git a/src/post.c b/src/post.c
index 9ea5620..dc11c72 100644
--- a/src/post.c
+++ b/src/post.c
@@ -170,6 +170,7 @@ platform_hardware_setup(void)
 clock_setup();
 
 // Platform specific setup
+dprintf(1, cmos_smp_count0=%d\n, rtc_read(CMOS_BIOS_SMP_COUNT) + 1);
 qemu_platform_setup();
 coreboot_platform_setup();
 }
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Kevin O'Connor

On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote:
 On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote:
  For what it's worth, I can't seem to trigger the problem if I move the
  cmos read above the SIPI/LAPIC code (see patch below).
 
 Ugh!
 
 That's a seabios bug.  Main processor modifies the rtc index
 (rtc_read()) while APs try to clear the NMI bit by modifying the rtc
 index (romlayout.S:transition32).
 
 I'll put together a fix.

The seabios patch below resolves the issue for me.

-Kevin


--- a/src/romlayout.S
+++ b/src/romlayout.S
@@ -22,7 +22,8 @@
 // %edx = return location (in 32bit mode)
 // Clobbers: ecx, flags, segment registers, cr0, idt/gdt
 DECLFUNC transition32
-transition32_for_smi:
+transition32_nmi_off:
+// transition32 when NMI and A20 are already initialized
 movl %eax, %ecx
 jmp 1f
 transition32:
@@ -205,7 +206,7 @@ __farcall16:
 entry_smi:
 // Transition to 32bit mode.
 movl $1f + BUILD_BIOS_ADDR, %edx
-jmp transition32_for_smi
+jmp transition32_nmi_off
 .code32
 1:  movl $BUILD_SMM_ADDR + 0x8000, %esp
 calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR
@@ -216,8 +217,10 @@ entry_smi:
 DECLFUNC entry_smp
 entry_smp:
 // Transition to 32bit mode.
+cli
+cld
 movl $2f + BUILD_BIOS_ADDR, %edx
-jmp transition32
+jmp transition32_nmi_off
 .code32
 // Acquire lock and take ownership of shared stack
 1:  rep ; nop
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Kevin O'Connor

On Wed, Mar 11, 2015 at 04:52:03PM +, Dr. David Alan Gilbert wrote:
 * Kevin O'Connor (ke...@koconnor.net) wrote:
  So, I couldn't get this to fail on my older AMD machine at all with
  the default SeaBIOS code.  But, when I change the code with the patch
  below, it failed right away.
[...]
  And the failed debug output looks like:
  
  SeaBIOS (version 
  rel-1.8.0-7-gd23eba6-dirty-20150311_121819-morn.localdomain)
  [...]
  cmos_smp_count0=20
  [...]
  cmos_smp_count=1
  cmos_smp_count2=1/20
  Found 1 cpu(s) max supported 20 cpu(s)
  
  I'm going to check the assembly for a compiler error, but is it
  possible QEMU is returning incorrect data in cmos index 0x5f?

I checked the SeaBIOS assembler and it looks sane.  So, I think the
question is, why is QEMU sometimes returning a 0 instead of 127 from
cmos 0x5f.

  David, any chance you can recompile seabios and double check your
  output?
 
 Done;
 
 === Working ===
 SeaBIOS (version rel-1.8.0-0-g4c59f5d-dirty-20150311_164408-dgilbert-t530)
[...]
 cmos_smp_count0=20
 cmos_smp_count=20
 cmos_smp_count2=20/20
 Found 20 cpu(s) max supported 20 cpu(s)
[...]
 === Broken ===
 SeaBIOS (version rel-1.8.0-0-g4c59f5d-dirty-20150311_164408-dgilbert-t530)
[...]
 cmos_smp_count0=20
 cmos_smp_count=1
 cmos_smp_count2=1/20
 Found 1 cpu(s) max supported 20 cpu(s)

That's the same pattern I see.

-Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Kevin O'Connor

On Wed, Mar 11, 2015 at 01:09:42PM -0400, Bandan Das wrote:
 Kevin O'Connor ke...@koconnor.net writes:
 ...
 
  Something is very odd here.  When I run the above command (on an older
  AMD machine) I get:
 
  Found 128 cpu(s) max supported 128 cpu(s)
 
  That first value (1 vs 128) comes from QEMU (via cmos index 0x5f).
  That is, during smp init, SeaBIOS expects QEMU to tell it how many
  cpus are active, and SeaBIOS waits until that many CPUs check in from
  its SIPI request before proceeding.
 
  I wonder if QEMU reported only 1 active cpu via that cmos register,
  but more were actually active.  If that was the case, it could
 
 I was daring enough to try this and I don't see the crash :)
 
 diff --git a/src/fw/smp.c b/src/fw/smp.c
 index a466ea6..a346d46 100644
 --- a/src/fw/smp.c
 +++ b/src/fw/smp.c
 @@ -49,6 +49,7 @@ int apic_id_is_present(u8 apic_id)
  void VISIBLE32FLAT
  handle_smp(void)
  {
 +  dprintf(DEBUG_HDL_smp, Calling handle_smp\n);
  if (!CONFIG_QEMU)
  return;
  
 @@ -128,6 +129,8 @@ smp_setup(void)
  
  // Wait for other CPUs to process the SIPI.
  u8 cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1;
 +while (cmos_smp_count == 1)
 +cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1;

That would loop forever if you had -smp 1.

  while (cmos_smp_count != CountCPUs)
  asm volatile(
  // Release lock and allow other processors to use the stack.
 
 So, the while loop results in a race somehow ?

No, the problem is that loop doesn't run at all, and as a result the
other cpus end up running random code.  SeaBIOS sets up an entry
vector for multiple cpus, wakes up the cpus, then tears down the entry
vector.  If it tears down the entry vector before all the cpus have
run through it, then the other cpus can end up running random code.

-Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Bandan Das

Kevin O'Connor ke...@koconnor.net writes:

 On Wed, Mar 11, 2015 at 01:09:42PM -0400, Bandan Das wrote:
 Kevin O'Connor ke...@koconnor.net writes:
 ...
 
  Something is very odd here.  When I run the above command (on an older
  AMD machine) I get:
 
  Found 128 cpu(s) max supported 128 cpu(s)
 
  That first value (1 vs 128) comes from QEMU (via cmos index 0x5f).
  That is, during smp init, SeaBIOS expects QEMU to tell it how many
  cpus are active, and SeaBIOS waits until that many CPUs check in from
  its SIPI request before proceeding.
 
  I wonder if QEMU reported only 1 active cpu via that cmos register,
  but more were actually active.  If that was the case, it could
 
 I was daring enough to try this and I don't see the crash :)
 
 diff --git a/src/fw/smp.c b/src/fw/smp.c
 index a466ea6..a346d46 100644
 --- a/src/fw/smp.c
 +++ b/src/fw/smp.c
 @@ -49,6 +49,7 @@ int apic_id_is_present(u8 apic_id)
  void VISIBLE32FLAT
  handle_smp(void)
  {
 +  dprintf(DEBUG_HDL_smp, Calling handle_smp\n);
  if (!CONFIG_QEMU)
  return;
  
 @@ -128,6 +129,8 @@ smp_setup(void)
  
  // Wait for other CPUs to process the SIPI.
  u8 cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1;
 +while (cmos_smp_count == 1)
 +cmos_smp_count = rtc_read(CMOS_BIOS_SMP_COUNT) + 1;

 That would loop forever if you had -smp 1.

Sorry, I should have been more clear. What I meant is if I run
with -smp 128 (from Dave's original reproducer), sticking this
while loop avoids the crash. So, the rtc_read eventually returns the
right number (127), as the above while loop keeps polling.

Bandan

  while (cmos_smp_count != CountCPUs)
  asm volatile(
  // Release lock and allow other processors to use the stack.
 
 So, the while loop results in a race somehow ?

 No, the problem is that loop doesn't run at all, and as a result the
 other cpus end up running random code.  SeaBIOS sets up an entry
 vector for multiple cpus, wakes up the cpus, then tears down the entry
 vector.  If it tears down the entry vector before all the cpus have
 run through it, then the other cpus can end up running random code.

 -Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Bandan Das

Dr. David Alan Gilbert dgilb...@redhat.com writes:

 * Kevin O'Connor (ke...@koconnor.net) wrote:
 On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote:
  On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote:
   For what it's worth, I can't seem to trigger the problem if I move the
   cmos read above the SIPI/LAPIC code (see patch below).
  
  Ugh!
  
  That's a seabios bug.  Main processor modifies the rtc index
  (rtc_read()) while APs try to clear the NMI bit by modifying the rtc
  index (romlayout.S:transition32).
  
  I'll put together a fix.
 
 The seabios patch below resolves the issue for me.

 Thanks! Looks good here.

 Andrey, Paolo, Bandan: Does it fix it for you as well?

Works for me too, thanks Kevin!

Bandan

 Dave

 -Kevin
 
 
 --- a/src/romlayout.S
 +++ b/src/romlayout.S
 @@ -22,7 +22,8 @@
  // %edx = return location (in 32bit mode)
  // Clobbers: ecx, flags, segment registers, cr0, idt/gdt
  DECLFUNC transition32
 -transition32_for_smi:
 +transition32_nmi_off:
 +// transition32 when NMI and A20 are already initialized
  movl %eax, %ecx
  jmp 1f
  transition32:
 @@ -205,7 +206,7 @@ __farcall16:
  entry_smi:
  // Transition to 32bit mode.
  movl $1f + BUILD_BIOS_ADDR, %edx
 -jmp transition32_for_smi
 +jmp transition32_nmi_off
  .code32
  1:  movl $BUILD_SMM_ADDR + 0x8000, %esp
  calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR
 @@ -216,8 +217,10 @@ entry_smi:
  DECLFUNC entry_smp
  entry_smp:
  // Transition to 32bit mode.
 +cli
 +cld
  movl $2f + BUILD_BIOS_ADDR, %edx
 -jmp transition32
 +jmp transition32_nmi_off
  .code32
  // Acquire lock and take ownership of shared stack
  1:  rep ; nop
 --
 Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Dr. David Alan Gilbert

* Andrey Korolyov (and...@xdel.ru) wrote:
 On Wed, Mar 11, 2015 at 10:33 PM, Dr. David Alan Gilbert
 dgilb...@redhat.com wrote:
  * Kevin O'Connor (ke...@koconnor.net) wrote:
  On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote:
   On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote:
For what it's worth, I can't seem to trigger the problem if I move the
cmos read above the SIPI/LAPIC code (see patch below).
  
   Ugh!
  
   That's a seabios bug.  Main processor modifies the rtc index
   (rtc_read()) while APs try to clear the NMI bit by modifying the rtc
   index (romlayout.S:transition32).
  
   I'll put together a fix.
 
  The seabios patch below resolves the issue for me.
 
  Thanks! Looks good here.
 
  Andrey, Paolo, Bandan: Does it fix it for you as well?
 
 
 Thanks Kevin, Dave,
 
 I`m afraid that I`m hitting something different not only because
 different suberror code but also because of mine version of seabios -
 I am using 1.7.5 and corresponding code in the proposed patch looks
 different - there is no smp-related code patch is about of. Those
 mentioned devices went to production successfully and I`m afraid I
 cannot afford playing on them anymore, even if I re-trigger the issue
 with patched 1.8.1-rc, there is no way to switch to a different kernel
 and retest due to specific conditions of this production suite. I`ve
 ordered a pair of new shoes^W 2620v2-s which should arrive to me next

Well I was testing on a pair of 'E5-2620 v2'; but as you saw my test case
was pretty simple.  If you can suggest any flags I should add etc to the
test I'd be happy to give it a go.

Dave

 Monday, so I`ll be able to test a) against 1.8.0-release, b) against
 patched bios code, c) reproduce initial error on master/3.19 (may be
 I`ll take them before weekend by going into this computer shop in
 person). Until then, I have a very deep feeling that mine issue is not
 there :) Also I became very curious on how a lack of IDT feature may
 completely eliminate the issue appearance for me, the only possible
 explanation is a clock-related race which is kinda stupid suggestion
 and unlikely to exist in nature.
 
 Thanks again for everyone for throughout testing and ideas!
 
 
  -Kevin
 
 
  --- a/src/romlayout.S
  +++ b/src/romlayout.S
  @@ -22,7 +22,8 @@
   // %edx = return location (in 32bit mode)
   // Clobbers: ecx, flags, segment registers, cr0, idt/gdt
   DECLFUNC transition32
  -transition32_for_smi:
  +transition32_nmi_off:
  +// transition32 when NMI and A20 are already initialized
   movl %eax, %ecx
   jmp 1f
   transition32:
  @@ -205,7 +206,7 @@ __farcall16:
   entry_smi:
   // Transition to 32bit mode.
   movl $1f + BUILD_BIOS_ADDR, %edx
  -jmp transition32_for_smi
  +jmp transition32_nmi_off
   .code32
   1:  movl $BUILD_SMM_ADDR + 0x8000, %esp
   calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR
  @@ -216,8 +217,10 @@ entry_smi:
   DECLFUNC entry_smp
   entry_smp:
   // Transition to 32bit mode.
  +cli
  +cld
   movl $2f + BUILD_BIOS_ADDR, %edx
  -jmp transition32
  +jmp transition32_nmi_off
   .code32
   // Acquire lock and take ownership of shared stack
   1:  rep ; nop
  --
  Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Bandan Das

Dr. David Alan Gilbert dgilb...@redhat.com writes:

 * Kevin O'Connor (ke...@koconnor.net) wrote:
 On Wed, Mar 11, 2015 at 04:52:03PM +, Dr. David Alan Gilbert wrote:
  * Kevin O'Connor (ke...@koconnor.net) wrote:
   So, I couldn't get this to fail on my older AMD machine at all with
   the default SeaBIOS code.  But, when I change the code with the patch
   below, it failed right away.
 [...]
   And the failed debug output looks like:
   
   SeaBIOS (version 
   rel-1.8.0-7-gd23eba6-dirty-20150311_121819-morn.localdomain)
   [...]
   cmos_smp_count0=20
   [...]
   cmos_smp_count=1
   cmos_smp_count2=1/20
   Found 1 cpu(s) max supported 20 cpu(s)
   
   I'm going to check the assembly for a compiler error, but is it
   possible QEMU is returning incorrect data in cmos index 0x5f?
 
 I checked the SeaBIOS assembler and it looks sane.  So, I think the
 question is, why is QEMU sometimes returning a 0 instead of 127 from
 cmos 0x5f.

 My reading of the logs I've just created is that qemu doesn't think
 it's ever being asked to read 5f in the failed case:

 good:

 pc_cmos_init 5f setting smp_cpus=20
 cmos: read index=0x0f val=0x00
 cmos: read index=0x34 val=0x00
 cmos: read index=0x35 val=0x3f
 cmos: read index=0x38 val=0x30
 cmos: read index=0x3d val=0x12
 cmos: read index=0x38 val=0x30
 cmos: read index=0x0b val=0x02
 cmos: read index=0x0d val=0x80
 cmos: read index=0x5f val=0x13  Yeh!
 cmos: read index=0x0f val=0x00
 cmos: read index=0x0f val=0x00
 cmos: read index=0x0f val=0x00

 bad:
 pc_cmos_init 5f setting smp_cpus=20
 cmos: read index=0x0f val=0x00
 cmos: read index=0x34 val=0x00
 cmos: read index=0x35 val=0x3f
 cmos: read index=0x38 val=0x30
 cmos: read index=0x3d val=0x12
 cmos: read index=0x38 val=0x30
 cmos: read index=0x0b val=0x02
 cmos: read index=0x0d val=0x80  Oh!
 cmos: read index=0x0f val=0x00
 cmos: read index=0x0f val=0x00
 cmos: read index=0x0f val=0x00

rtc_read in hw/rt.c is:
u8
rtc_read(u8 index)
{
index |= NMI_DISABLE_BIT;
outb(index, PORT_CMOS_INDEX);
return inb(PORT_CMOS_DATA);
}

Is it possible that the read would happen before the index has been committed ?

 Dave

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Dr. David Alan Gilbert

* Kevin O'Connor (ke...@koconnor.net) wrote:
 On Wed, Mar 11, 2015 at 02:45:31PM -0400, Kevin O'Connor wrote:
  On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote:
   For what it's worth, I can't seem to trigger the problem if I move the
   cmos read above the SIPI/LAPIC code (see patch below).
  
  Ugh!
  
  That's a seabios bug.  Main processor modifies the rtc index
  (rtc_read()) while APs try to clear the NMI bit by modifying the rtc
  index (romlayout.S:transition32).
  
  I'll put together a fix.
 
 The seabios patch below resolves the issue for me.

Thanks! Looks good here.

Andrey, Paolo, Bandan: Does it fix it for you as well?

Dave

 -Kevin
 
 
 --- a/src/romlayout.S
 +++ b/src/romlayout.S
 @@ -22,7 +22,8 @@
  // %edx = return location (in 32bit mode)
  // Clobbers: ecx, flags, segment registers, cr0, idt/gdt
  DECLFUNC transition32
 -transition32_for_smi:
 +transition32_nmi_off:
 +// transition32 when NMI and A20 are already initialized
  movl %eax, %ecx
  jmp 1f
  transition32:
 @@ -205,7 +206,7 @@ __farcall16:
  entry_smi:
  // Transition to 32bit mode.
  movl $1f + BUILD_BIOS_ADDR, %edx
 -jmp transition32_for_smi
 +jmp transition32_nmi_off
  .code32
  1:  movl $BUILD_SMM_ADDR + 0x8000, %esp
  calll _cfunc32flat_handle_smi - BUILD_BIOS_ADDR
 @@ -216,8 +217,10 @@ entry_smi:
  DECLFUNC entry_smp
  entry_smp:
  // Transition to 32bit mode.
 +cli
 +cld
  movl $2f + BUILD_BIOS_ADDR, %edx
 -jmp transition32
 +jmp transition32_nmi_off
  .code32
  // Acquire lock and take ownership of shared stack
  1:  rep ; nop
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Paolo Bonzini



On 11/03/2015 18:37, Kevin O'Connor wrote:
  I'm going to check the assembly for a compiler error, but is it
  possible QEMU is returning incorrect data in cmos index 0x5f?
 
 I checked the SeaBIOS assembler and it looks sane.  So, I think the
 question is, why is QEMU sometimes returning a 0 instead of 127 from
 cmos 0x5f.

Dave, can you get a KVM trace (trace-cmd record -e kvm/*
qemu-system-x86_64 ...) and send it to me privately?

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Kevin O'Connor

On Wed, Mar 11, 2015 at 02:40:39PM -0400, Kevin O'Connor wrote:
 For what it's worth, I can't seem to trigger the problem if I move the
 cmos read above the SIPI/LAPIC code (see patch below).

Ugh!

That's a seabios bug.  Main processor modifies the rtc index
(rtc_read()) while APs try to clear the NMI bit by modifying the rtc
index (romlayout.S:transition32).

I'll put together a fix.

-Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-11 Thread Dr. David Alan Gilbert

* Bandan Das (b...@redhat.com) wrote:
 Dr. David Alan Gilbert dgilb...@redhat.com writes:
 
  * Paolo Bonzini (pbonz...@redhat.com) wrote:
  
  
  On 10/03/2015 19:21, Bandan Das wrote:
   Paolo Bonzini pbonz...@redhat.com writes:
   
   On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
   I'm seeing something similar; it's very intermittent and generally
   happening right at boot of the guest;   I'm running this on qemu
   head+my postcopy world (but it's happening right at boot before 
   postcopy
   gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
   but hey maybe I'm seeing a different bug.
   
   Probably a tangent but is the qemu trace identical to what Andrey is 
   seeing ?
   From a cursory look and my limited understanding, it seems his failure 
   is #GP
   when executing video bios.
   
   Same here on 3.16 + Xeon E5 v3 kernel.
   
   I will try to reproduce this on a v2.
  
  I see several failures, usually mine have suberror 1.  With a 32-VCPU
  guest I can reproduce it roughly half of the time.
  
  Paolo
 
  while true; do (sleep 5; echo -e 
  '\001cq\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -machine 
  pc-i440fx-2.0,accel=kvm -m 1024 -smp 128 -nographic -device sga 21 | tee 
  /tmp/qemu.op; grep internal error /tmp/qemu.op -q  break; done
 
  (and leave about 2mins of runs before declaring good)
 
  bad: cd2946607b42636d6c8cf6dbf94bce0273507b17
  bad: 041ccc922ee474693a2869d4e3b59e920c739bc0
  bad: 2559db069628981bfdc90637fac5bf1b4f4e8ef5
  bad: 21f5826a04d38e19488f917e1eef22751490c769
  good:e95d24ff40c77fbfd71396834a2eb99375f8bcc4
  good: 7781a492fa5a2eff53d06b25b93f0186ad3226c9
  good: c3edd62851098e6417786193ed9e9341781fcf57
  good: c5c6d7f81a6950d8e32a3b5a0bafd37bfa5a8e88
  good: 73104fd399c6778112f64fe0d439319f24508d9a
  good: 92013cf8ca10adafec9a92deb5df993e7df22cb9
  good: 4478aa768ccefcc5b234c23d035435fd71b932f6
  good: 2.2.0
 
  [root@virtlab413 qemu-world3]# git bisect bad
  21f5826a04d38e19488f917e1eef22751490c769 is the first bad commit
 
 I can reproduce this on E5-2620 v2 with  David's while true test.
 (The emulation failure I mean, not the suberror 2 that Andrey is seeing)
 The commit that seems to have introduced this is -
 
 commit 0673b7870063a3affbad9046fb6d385a4e734c19
 Author: Kevin O'Connor ke...@koconnor.net
 Date:   Sat May 24 10:49:50 2014 -0400
 
 smp: Replace QEMU SMP init assembler code with C; run only in 32bit mode.
 
 Change the multi-processor init code to trampoline into 32bit mode on
 each of the additional processors.  Implement an atomic lock so that
 each processor performs its initialization serially.
 
 I am not sure what in that change could cause this though..
 Also, in my testing, unrestricted_guest=0 avoids the failure.

Turning on debug logging
( -chardev file,id=log,path=/tmp/debugcon.$$ -device 
isa-debugcon,chardev=log,iobase=0x402 )


SeaBIOS (version rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org)
No Xen hypervisor found.
Running on QEMU (i440fx)
Running on KVM
RamSize: 0x4000 [cmos]
Relocating init from 0x000dea20 to 0x3ffaed30 (size 70160)
Found QEMU fw_cfg
RamBlock: addr 0x len 0x4000 [e820]
Moving pm_base to 0x600
CPU Mhz=2112
=== PCI bus  bridge init ===
PCI: pci_bios_init_bus_rec bus = 0x0
=== PCI device probing ===
Found 6 PCI devices (max PCI bus is 00)
=== PCI new allocation pass #1 ===
PCI: check devices
=== PCI new allocation pass #2 ===
PCI: IO: c000 - c04f
PCI: 32: 8000 - fec0
PCI: map device bdf=00:03.0  bar 1, addr c000, size 0040 [io]
PCI: map device bdf=00:01.1  bar 4, addr c040, size 0010 [io]
PCI: map device bdf=00:03.0  bar 6, addr feb8, size 0004 [mem]
PCI: map device bdf=00:03.0  bar 0, addr febc, size 0002 [mem]
PCI: map device bdf=00:02.0  bar 6, addr febe, size 0001 [mem]
PCI: map device bdf=00:02.0  bar 2, addr febf, size 1000 [mem]
PCI: map device bdf=00:02.0  bar 0, addr fd00, size 0100 [prefmem]
PCI: init bdf=00:00.0 id=8086:1237
PCI: init bdf=00:01.0 id=8086:7000
PIIX3/PIIX4 init: elcr=00 0c
PCI: init bdf=00:01.1 id=8086:7010
PCI: init bdf=00:01.3 id=8086:7113
Using pmtimer, ioport 0x608
PCI: init bdf=00:02.0 id=1234:
PCI: init bdf=00:03.0 id=8086:100e
PCI: Using 00:02.0 for primary VGA
Found 1 cpu(s) max supported 128 cpu(s)
Copying PIR from 0x3ffbfc98 to 0x000f65a0
WARNING - Unable to allocate resource at copy_mptable:62!
Copying SMBIOS entry point from 0x6db0 to 0x000f6580
Scan for VGA option rom
Running option rom at c000:0003
Start SeaVGABIOS (version 
rel-1.8.0-0-g4c59f5d-20150219_092912-nilsson.home.kraxel.org)
enter vga_post:
   a=0010  b=  c=  d= ds= es=f000 ss=
  si= di=6970 bp= sp=6d0a cs=f000 ip=d239  f=
VBE DISPI: bdf 00:02.0, bar 0
VBE DISPI: lfb_addr=fd00, size 16 MB
Attempting to allocate VGA stack via pmm call to f000:d2f4
pmm call arg1=0
VGA

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Andrey Korolyov

On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov and...@xdel.ru wrote:
 On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das b...@redhat.com wrote:
 Andrey Korolyov and...@xdel.ru writes:

 On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hello,

 recently I`ve got a couple of shiny new Intel 2620v2s for future
 replacement of the E5-2620v1, but I experienced relatively many events
 with emulation errors, all traces looks simular to the one below. I am
 running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
 can switch to some other versions if necessary. Most of crashes
 happened during reboot cycle or at the end of ACPI-based shutdown
 action, if this can help. I have zero clues of what can introduce such
 a mess inside same processor family using identical software, as
 2620v1 has no simular problem ever. Please let me know if there can be
 some side measures for making entire story more clear.

 Thanks!

 KVM internal error. Suberror: 2
 extra data[0]: 80d1
 extra data[1]: 8b0d
 EAX=0003 EBX= ECX= EDX=
 ESI= EDI= EBP= ESP=6cd4
 EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =f000 000f  9b00
 SS =   9300
 DS =   9300
 FS =   9300
 GS =   9300
 LDT=   8200
 TR =   8b00
 GDT= 000f6e98 0037
 IDT=  03ff
 CR0=0010 CR2= CR3= CR4=
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
 b8 00 e0 00 00 8e


 It turns out that those errors are introduced by APICv, which gets
 enabled due to different feature set. If anyone is interested in
 reproducing/fixing this exactly on 3.10, it takes about one hundred of
 migrations/power state changes for an issue to appear, guest OS can be
 Linux or Win.

 Are you able to reproduce this on a more recent upstream kernel as well ?

 Bandan

 I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
 follow up with any reproduceable results.

Heh.. issue is not triggered on 2603v2 at all, at least I am not able
to hit this. The only difference with 2620v2 except lower frequency is
an Intel Dynamic Acceleration feature. I`d appreciate any testing with
higher CPU models with same or richer feature set. The testing itself
can be done on both generic 3.10 or RH7 kernels, as both of them are
experiencing this issue. I conducted all tests with disabled cstates
so I advise to do the same for a first reproduction step.

Thanks!

model name  : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
stepping: 4
microcode   : 0x416
cpu MHz : 2100.039
cache size  : 15360 KB
siblings: 12
apicid  : 43
initial apicid  : 43
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
flexpriority ept vpid fsgsbase smep erms
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Dr. David Alan Gilbert

* Andrey Korolyov (and...@xdel.ru) wrote:
 On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov and...@xdel.ru wrote:
  On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das b...@redhat.com wrote:
  Andrey Korolyov and...@xdel.ru writes:
 
  On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov and...@xdel.ru wrote:
  Hello,
 
  recently I`ve got a couple of shiny new Intel 2620v2s for future
  replacement of the E5-2620v1, but I experienced relatively many events
  with emulation errors, all traces looks simular to the one below. I am
  running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
  can switch to some other versions if necessary. Most of crashes
  happened during reboot cycle or at the end of ACPI-based shutdown
  action, if this can help. I have zero clues of what can introduce such
  a mess inside same processor family using identical software, as
  2620v1 has no simular problem ever. Please let me know if there can be
  some side measures for making entire story more clear.
 
  Thanks!
 
  KVM internal error. Suberror: 2
  extra data[0]: 80d1
  extra data[1]: 8b0d
  EAX=0003 EBX= ECX= EDX=
  ESI= EDI= EBP= ESP=6cd4
  EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
  ES =   9300
  CS =f000 000f  9b00
  SS =   9300
  DS =   9300
  FS =   9300
  GS =   9300
  LDT=   8200
  TR =   8b00
  GDT= 000f6e98 0037
  IDT=  03ff
  CR0=0010 CR2= CR3= CR4=
  DR0= DR1= DR2=
  DR3=
  DR6=0ff0 DR7=0400
  EFER=
  Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
  10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
  b8 00 e0 00 00 8e
 
 
  It turns out that those errors are introduced by APICv, which gets
  enabled due to different feature set. If anyone is interested in
  reproducing/fixing this exactly on 3.10, it takes about one hundred of
  migrations/power state changes for an issue to appear, guest OS can be
  Linux or Win.
 
  Are you able to reproduce this on a more recent upstream kernel as well ?
 
  Bandan
 
  I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
  follow up with any reproduceable results.
 
 Heh.. issue is not triggered on 2603v2 at all, at least I am not able
 to hit this. The only difference with 2620v2 except lower frequency is
 an Intel Dynamic Acceleration feature. I`d appreciate any testing with
 higher CPU models with same or richer feature set. The testing itself
 can be done on both generic 3.10 or RH7 kernels, as both of them are
 experiencing this issue. I conducted all tests with disabled cstates
 so I advise to do the same for a first reproduction step.
 
 Thanks!
 
 model name  : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
 stepping: 4
 microcode   : 0x416
 cpu MHz : 2100.039
 cache size  : 15360 KB
 siblings: 12
 apicid  : 43
 initial apicid  : 43
 fpu : yes
 fpu_exception   : yes
 cpuid level : 13
 wp  : yes
 flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
 mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
 syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
 rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
 dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
 rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
 flexpriority ept vpid fsgsbase smep erms

I'm seeing something similar; it's very intermittent and generally
happening right at boot of the guest;   I'm running this on qemu
head+my postcopy world (but it's happening right at boot before postcopy
gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
but hey maybe I'm seeing a different bug.

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Andrey Korolyov

On Tue, Mar 10, 2015 at 7:57 PM, Dr. David Alan Gilbert
dgilb...@redhat.com wrote:
 * Andrey Korolyov (and...@xdel.ru) wrote:
 On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov and...@xdel.ru wrote:
  On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das b...@redhat.com wrote:
  Andrey Korolyov and...@xdel.ru writes:
 
  On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov and...@xdel.ru wrote:
  Hello,
 
  recently I`ve got a couple of shiny new Intel 2620v2s for future
  replacement of the E5-2620v1, but I experienced relatively many events
  with emulation errors, all traces looks simular to the one below. I am
  running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
  can switch to some other versions if necessary. Most of crashes
  happened during reboot cycle or at the end of ACPI-based shutdown
  action, if this can help. I have zero clues of what can introduce such
  a mess inside same processor family using identical software, as
  2620v1 has no simular problem ever. Please let me know if there can be
  some side measures for making entire story more clear.
 
  Thanks!
 
  KVM internal error. Suberror: 2
  extra data[0]: 80d1
  extra data[1]: 8b0d
  EAX=0003 EBX= ECX= EDX=
  ESI= EDI= EBP= ESP=6cd4
  EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
  ES =   9300
  CS =f000 000f  9b00
  SS =   9300
  DS =   9300
  FS =   9300
  GS =   9300
  LDT=   8200
  TR =   8b00
  GDT= 000f6e98 0037
  IDT=  03ff
  CR0=0010 CR2= CR3= CR4=
  DR0= DR1= DR2=
  DR3=
  DR6=0ff0 DR7=0400
  EFER=
  Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
  10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
  b8 00 e0 00 00 8e
 
 
  It turns out that those errors are introduced by APICv, which gets
  enabled due to different feature set. If anyone is interested in
  reproducing/fixing this exactly on 3.10, it takes about one hundred of
  migrations/power state changes for an issue to appear, guest OS can be
  Linux or Win.
 
  Are you able to reproduce this on a more recent upstream kernel as well ?
 
  Bandan
 
  I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
  follow up with any reproduceable results.

 Heh.. issue is not triggered on 2603v2 at all, at least I am not able
 to hit this. The only difference with 2620v2 except lower frequency is
 an Intel Dynamic Acceleration feature. I`d appreciate any testing with
 higher CPU models with same or richer feature set. The testing itself
 can be done on both generic 3.10 or RH7 kernels, as both of them are
 experiencing this issue. I conducted all tests with disabled cstates
 so I advise to do the same for a first reproduction step.

 Thanks!

 model name  : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
 stepping: 4
 microcode   : 0x416
 cpu MHz : 2100.039
 cache size  : 15360 KB
 siblings: 12
 apicid  : 43
 initial apicid  : 43
 fpu : yes
 fpu_exception   : yes
 cpuid level : 13
 wp  : yes
 flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
 mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
 syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
 rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
 dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
 rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
 flexpriority ept vpid fsgsbase smep erms

 I'm seeing something similar; it's very intermittent and generally
 happening right at boot of the guest;   I'm running this on qemu
 head+my postcopy world (but it's happening right at boot before postcopy
 gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
 but hey maybe I'm seeing a different bug.

 Dave

Yep, looks like we are hitting same bug - two thirds of mine failure
events shot during boot/reboot cycle and approx. one third of events
happened in the middle of runtime. What CPU, v0 or v2 are you using
(in other words, is APICv enabled)?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Bandan Das

Dr. David Alan Gilbert dgilb...@redhat.com writes:

 * Paolo Bonzini (pbonz...@redhat.com) wrote:
 
 
 On 10/03/2015 19:21, Bandan Das wrote:
  Paolo Bonzini pbonz...@redhat.com writes:
  
  On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
  I'm seeing something similar; it's very intermittent and generally
  happening right at boot of the guest;   I'm running this on qemu
  head+my postcopy world (but it's happening right at boot before postcopy
  gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
  but hey maybe I'm seeing a different bug.
  
  Probably a tangent but is the qemu trace identical to what Andrey is 
  seeing ?
  From a cursory look and my limited understanding, it seems his failure is 
  #GP
  when executing video bios.
  
  Same here on 3.16 + Xeon E5 v3 kernel.
  
  I will try to reproduce this on a v2.
 
 I see several failures, usually mine have suberror 1.  With a 32-VCPU
 guest I can reproduce it roughly half of the time.
 
 Paolo

 while true; do (sleep 5; echo -e 
 '\001cq\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -machine 
 pc-i440fx-2.0,accel=kvm -m 1024 -smp 128 -nographic -device sga 21 | tee 
 /tmp/qemu.op; grep internal error /tmp/qemu.op -q  break; done

 (and leave about 2mins of runs before declaring good)

 bad: cd2946607b42636d6c8cf6dbf94bce0273507b17
 bad: 041ccc922ee474693a2869d4e3b59e920c739bc0
 bad: 2559db069628981bfdc90637fac5bf1b4f4e8ef5
 bad: 21f5826a04d38e19488f917e1eef22751490c769
 good:e95d24ff40c77fbfd71396834a2eb99375f8bcc4
 good: 7781a492fa5a2eff53d06b25b93f0186ad3226c9
 good: c3edd62851098e6417786193ed9e9341781fcf57
 good: c5c6d7f81a6950d8e32a3b5a0bafd37bfa5a8e88
 good: 73104fd399c6778112f64fe0d439319f24508d9a
 good: 92013cf8ca10adafec9a92deb5df993e7df22cb9
 good: 4478aa768ccefcc5b234c23d035435fd71b932f6
 good: 2.2.0

 [root@virtlab413 qemu-world3]# git bisect bad
 21f5826a04d38e19488f917e1eef22751490c769 is the first bad commit

I can reproduce this on E5-2620 v2 with  David's while true test.
(The emulation failure I mean, not the suberror 2 that Andrey is seeing)
The commit that seems to have introduced this is -

commit 0673b7870063a3affbad9046fb6d385a4e734c19
Author: Kevin O'Connor ke...@koconnor.net
Date:   Sat May 24 10:49:50 2014 -0400

smp: Replace QEMU SMP init assembler code with C; run only in 32bit mode.

Change the multi-processor init code to trampoline into 32bit mode on
each of the additional processors.  Implement an atomic lock so that
each processor performs its initialization serially.

I am not sure what in that change could cause this though..
Also, in my testing, unrestricted_guest=0 avoids the failure.

 commit 21f5826a04d38e19488f917e1eef22751490c769
 Author: Gerd Hoffmann kra...@redhat.com
 Date:   Thu Feb 19 09:33:03 2015 +0100

 seabios: update to 1.8.0 release
 
 'git shortlog 8936dbb2..4c59f5d8' for seabios repo:
 
 David Woodhouse (4):
   Update EFI_COMPATIBILITY16_TABLE to match 0.98 spec update
   build: use -m16 where available instead of asm(.code16gcc)
   romlayout: Use .code16 not .code16gcc
   vgabios: Use .code16 not .code16gcc
 
 Gerd Hoffmann (2):
   add scripts/tarball.sh
   build: set LC_ALL=C
 
 Hannes Reinecke (1):
   megasas: read addional PCI I/O bar
 
 Ian Campbell (1):
   romlayout: Use rep ; nop not rep nop.
 
 Kevin O'Connor (139):
   vgabios: Return from handle_1011() if handler found.
   edd: Move EDD get drive parameters (int 1348) logic from disk.c to 
 block.c.
   edd: Use sectors==-1 to detect removable media.
   edd: Separate out ATA and virtio specific parts of fill_edd().
   cdemu: store internal cdemu fields in standard el-torito spec 
 format.
   Move cdemu call interface and disk_ret helper code to disk.c.
   smm: Replace SMI assembler code with C code.
   smm: Use a C struct to define the layout of the SMM area.
   smp: Replace QEMU SMP init assembler code with C; run only in 32bit 
 mode.
   Don't enable thread preemption during S3 resume vga option rom 
 execution.
   Remove old Bochs bios fixed address string at 0xfff00.
   Move most of the VAR16FIXED() defs to misc.c.
   build: Avoid absolute paths during whole-program compiling.
   Make sure handle_smi() and handle_smp() are compiled out if not 
 enabled.
   Remove the TODO file.
   Abstract reset call (and possible 16bit mode switch) into reset() 
 function.
   build: Remove unused function getSectionsStart() from layoutrom.py.
   build: Extract section visiting logic in layoutrom.py.
   build: Refactor layoutrom.py gc() function.
   build: Use customized entry point for each type of build.
   build: Refactor findInit() function.
   build: Rework getRelocs() to use a hash instead of categories in

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Dr. David Alan Gilbert

* Paolo Bonzini (pbonz...@redhat.com) wrote:
 
 
 On 10/03/2015 19:21, Bandan Das wrote:
  Paolo Bonzini pbonz...@redhat.com writes:
  
  On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
  I'm seeing something similar; it's very intermittent and generally
  happening right at boot of the guest;   I'm running this on qemu
  head+my postcopy world (but it's happening right at boot before postcopy
  gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
  but hey maybe I'm seeing a different bug.
  
  Probably a tangent but is the qemu trace identical to what Andrey is seeing 
  ?
  From a cursory look and my limited understanding, it seems his failure is 
  #GP
  when executing video bios.
  
  Same here on 3.16 + Xeon E5 v3 kernel.
  
  I will try to reproduce this on a v2.
 
 I see several failures, usually mine have suberror 1.  With a 32-VCPU
 guest I can reproduce it roughly half of the time.
 
 Paolo

while true; do (sleep 5; echo -e 
'\001cq\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -machine 
pc-i440fx-2.0,accel=kvm -m 1024 -smp 128 -nographic -device sga 21 | tee 
/tmp/qemu.op; grep internal error /tmp/qemu.op -q  break; done

(and leave about 2mins of runs before declaring good)

bad: cd2946607b42636d6c8cf6dbf94bce0273507b17
bad: 041ccc922ee474693a2869d4e3b59e920c739bc0
bad: 2559db069628981bfdc90637fac5bf1b4f4e8ef5
bad: 21f5826a04d38e19488f917e1eef22751490c769
good:e95d24ff40c77fbfd71396834a2eb99375f8bcc4
good: 7781a492fa5a2eff53d06b25b93f0186ad3226c9
good: c3edd62851098e6417786193ed9e9341781fcf57
good: c5c6d7f81a6950d8e32a3b5a0bafd37bfa5a8e88
good: 73104fd399c6778112f64fe0d439319f24508d9a
good: 92013cf8ca10adafec9a92deb5df993e7df22cb9
good: 4478aa768ccefcc5b234c23d035435fd71b932f6
good: 2.2.0

[root@virtlab413 qemu-world3]# git bisect bad
21f5826a04d38e19488f917e1eef22751490c769 is the first bad commit
commit 21f5826a04d38e19488f917e1eef22751490c769
Author: Gerd Hoffmann kra...@redhat.com
Date:   Thu Feb 19 09:33:03 2015 +0100

seabios: update to 1.8.0 release

'git shortlog 8936dbb2..4c59f5d8' for seabios repo:

David Woodhouse (4):
  Update EFI_COMPATIBILITY16_TABLE to match 0.98 spec update
  build: use -m16 where available instead of asm(.code16gcc)
  romlayout: Use .code16 not .code16gcc
  vgabios: Use .code16 not .code16gcc

Gerd Hoffmann (2):
  add scripts/tarball.sh
  build: set LC_ALL=C

Hannes Reinecke (1):
  megasas: read addional PCI I/O bar

Ian Campbell (1):
  romlayout: Use rep ; nop not rep nop.

Kevin O'Connor (139):
  vgabios: Return from handle_1011() if handler found.
  edd: Move EDD get drive parameters (int 1348) logic from disk.c to 
block.c.
  edd: Use sectors==-1 to detect removable media.
  edd: Separate out ATA and virtio specific parts of fill_edd().
  cdemu: store internal cdemu fields in standard el-torito spec 
format.
  Move cdemu call interface and disk_ret helper code to disk.c.
  smm: Replace SMI assembler code with C code.
  smm: Use a C struct to define the layout of the SMM area.
  smp: Replace QEMU SMP init assembler code with C; run only in 32bit 
mode.
  Don't enable thread preemption during S3 resume vga option rom 
execution.
  Remove old Bochs bios fixed address string at 0xfff00.
  Move most of the VAR16FIXED() defs to misc.c.
  build: Avoid absolute paths during whole-program compiling.
  Make sure handle_smi() and handle_smp() are compiled out if not 
enabled.
  Remove the TODO file.
  Abstract reset call (and possible 16bit mode switch) into reset() 
function.
  build: Remove unused function getSectionsStart() from layoutrom.py.
  build: Extract section visiting logic in layoutrom.py.
  build: Refactor layoutrom.py gc() function.
  build: Use customized entry point for each type of build.
  build: Refactor findInit() function.
  build: Rework getRelocs() to use a hash instead of categories in 
layoutrom.py
  build: Keep segmented sections separate until final link step.
  build: Use fileid instead of category to write sections in 
layoutrom.py.
  build: Only export needed fields in LayoutInfo in layoutrom.py.
  build: Get fixed address variables from 32bit compile pass (not 16bit)
  build: Minor - fix comments referring to old tools/ directory.
  xhci: Update the times for usb command timeouts.
  ehci: Update usb command timeouts to use usb_xfer_time()
  uhci: Update usb command timeouts to use usb_xfer_time()
  ohci: Update usb command timeouts to use usb_xfer_time()
  vgabios: Fix broken build resulting from e5749978.
  boot: Change :rom%d boot order rom instance to :rom%x
  Minor - remove stray tab from src/fw/smm.c.
  build: Update

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Paolo Bonzini



On 10/03/2015 19:21, Bandan Das wrote:
 Paolo Bonzini pbonz...@redhat.com writes:
 
 On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
 I'm seeing something similar; it's very intermittent and generally
 happening right at boot of the guest;   I'm running this on qemu
 head+my postcopy world (but it's happening right at boot before postcopy
 gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
 but hey maybe I'm seeing a different bug.
 
 Probably a tangent but is the qemu trace identical to what Andrey is seeing ?
 From a cursory look and my limited understanding, it seems his failure is #GP
 when executing video bios.
 
 Same here on 3.16 + Xeon E5 v3 kernel.
 
 I will try to reproduce this on a v2.

I see several failures, usually mine have suberror 1.  With a 32-VCPU
guest I can reproduce it roughly half of the time.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Andrey Korolyov

On Tue, Mar 10, 2015 at 9:16 PM, Dr. David Alan Gilbert
dgilb...@redhat.com wrote:
 * Andrey Korolyov (and...@xdel.ru) wrote:
 On Tue, Mar 10, 2015 at 7:57 PM, Dr. David Alan Gilbert
 dgilb...@redhat.com wrote:
  * Andrey Korolyov (and...@xdel.ru) wrote:
  On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov and...@xdel.ru wrote:
   On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das b...@redhat.com wrote:
   Andrey Korolyov and...@xdel.ru writes:
  
   On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov and...@xdel.ru 
   wrote:
   Hello,
  
   recently I`ve got a couple of shiny new Intel 2620v2s for future
   replacement of the E5-2620v1, but I experienced relatively many 
   events
   with emulation errors, all traces looks simular to the one below. I 
   am
   running qemu-2.1 on x86 on top of 3.10 branch for testing purposes 
   but
   can switch to some other versions if necessary. Most of crashes
   happened during reboot cycle or at the end of ACPI-based shutdown
   action, if this can help. I have zero clues of what can introduce 
   such
   a mess inside same processor family using identical software, as
   2620v1 has no simular problem ever. Please let me know if there can 
   be
   some side measures for making entire story more clear.
  
   Thanks!
  
   KVM internal error. Suberror: 2
   extra data[0]: 80d1
   extra data[1]: 8b0d
   EAX=0003 EBX= ECX= EDX=
   ESI= EDI= EBP= ESP=6cd4
   EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
   ES =   9300
   CS =f000 000f  9b00
   SS =   9300
   DS =   9300
   FS =   9300
   GS =   9300
   LDT=   8200
   TR =   8b00
   GDT= 000f6e98 0037
   IDT=  03ff
   CR0=0010 CR2= CR3= CR4=
   DR0= DR1= DR2=
   DR3=
   DR6=0ff0 DR7=0400
   EFER=
   Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
   10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
   b8 00 e0 00 00 8e
  
  
   It turns out that those errors are introduced by APICv, which gets
   enabled due to different feature set. If anyone is interested in
   reproducing/fixing this exactly on 3.10, it takes about one hundred of
   migrations/power state changes for an issue to appear, guest OS can be
   Linux or Win.
  
   Are you able to reproduce this on a more recent upstream kernel as 
   well ?
  
   Bandan
  
   I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
   follow up with any reproduceable results.
 
  Heh.. issue is not triggered on 2603v2 at all, at least I am not able
  to hit this. The only difference with 2620v2 except lower frequency is
  an Intel Dynamic Acceleration feature. I`d appreciate any testing with
  higher CPU models with same or richer feature set. The testing itself
  can be done on both generic 3.10 or RH7 kernels, as both of them are
  experiencing this issue. I conducted all tests with disabled cstates
  so I advise to do the same for a first reproduction step.
 
  Thanks!
 
  model name  : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
  stepping: 4
  microcode   : 0x416
  cpu MHz : 2100.039
  cache size  : 15360 KB
  siblings: 12
  apicid  : 43
  initial apicid  : 43
  fpu : yes
  fpu_exception   : yes
  cpuid level : 13
  wp  : yes
  flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
  mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
  syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts
  rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
  dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca
  sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c
  rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi
  flexpriority ept vpid fsgsbase smep erms
 
  I'm seeing something similar; it's very intermittent and generally
  happening right at boot of the guest;   I'm running this on qemu
  head+my postcopy world (but it's happening right at boot before postcopy
  gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
  but hey maybe I'm seeing a different bug.
 
  Dave

 Yep, looks like we are hitting same bug - two thirds of mine failure
 events shot during boot/reboot cycle and approx. one third of events
 happened in the middle of runtime. What CPU, v0 or v2 are you using
 (in other words, is APICv enabled)?

 processor   : 7
 vendor_id   : GenuineIntel
 cpu family  : 6
 model   : 45
 model name  : Intel(R) Xeon(R) CPU E5-2407 0 @ 2.20GHz
 stepping: 7
 microcode   : 0x70d
 cpu MHz :

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Bandan Das

Paolo Bonzini pbonz...@redhat.com writes:

 On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
 I'm seeing something similar; it's very intermittent and generally
 happening right at boot of the guest;   I'm running this on qemu
 head+my postcopy world (but it's happening right at boot before postcopy
 gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
 but hey maybe I'm seeing a different bug.

Probably a tangent but is the qemu trace identical to what Andrey is seeing ?
From a cursory look and my limited understanding, it seems his failure is #GP
when executing video bios.

 Same here on 3.16 + Xeon E5 v3 kernel.

I will try to reproduce this on a v2.

Bandan
 Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Dr. David Alan Gilbert

* Paolo Bonzini (pbonz...@redhat.com) wrote:
 
 
 On 10/03/2015 19:21, Bandan Das wrote:
  Paolo Bonzini pbonz...@redhat.com writes:
  
  On 10/03/2015 17:57, Dr. David Alan Gilbert wrote:
  I'm seeing something similar; it's very intermittent and generally
  happening right at boot of the guest;   I'm running this on qemu
  head+my postcopy world (but it's happening right at boot before postcopy
  gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case
  but hey maybe I'm seeing a different bug.
  
  Probably a tangent but is the qemu trace identical to what Andrey is seeing 
  ?
  From a cursory look and my limited understanding, it seems his failure is 
  #GP
  when executing video bios.
  
  Same here on 3.16 + Xeon E5 v3 kernel.
  
  I will try to reproduce this on a v2.
 
 I see several failures, usually mine have suberror 1.  With a 32-VCPU
 guest I can reproduce it roughly half of the time.

Oh yes, that helps:

while true; do (sleep 5; echo -e 
'\001cq\n')|/opt/qemu-try-world3/bin/qemu-system-x86_64 -machine 
pc-i440fx-2.3,accel=kvm -m 8192 -smp 20 -nographic -device sga; done

gets it under a minute on my other box; Intel(R) Xeon(R) CPU E5-2620 v2 @ 
2.10GHz
(Running a 3.18)

Dave

 Paolo
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-10 Thread Paolo Bonzini



On 10/03/2015 19:16, Dr. David Alan Gilbert wrote:
 KVM internal error. Suberror: 1
 emulation failure
 EAX= EBX= ECX= EDX=000fd2bc
 ESI= EDI= EBP= ESP=
 EIP=000fd2c5 EFL=00010007 [-PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =0010   00c09300 DPL=0 DS   [-WA]
 CS =0008   00c09b00 DPL=0 CS32 [-RA]
 SS =0010   00c09300 DPL=0 DS   [-WA]
 DS =0010   00c09300 DPL=0 DS   [-WA]
 FS =0010   00c09300 DPL=0 DS   [-WA]
 GS =0010   00c09300 DPL=0 DS   [-WA]
 LDT=   8200 DPL=0 LDT
 TR =   8b00 DPL=0 TSS32-busy
 GDT= 000f6a80 0037
 IDT= 000f6abe 
 CR0=6011 CR2= CR3= CR4=
 DR0= DR1= DR2= 
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=66 ba bc d2 0f 00 e9 a2 fe f3 90 f0 0f ba 2d 04 ff fb bf 00 72 f3 8b 
 25 00 ff fb bf e8 44 66 ff ff c7 05 04 ff
  fb bf 00 00 00 00 f4 eb fd fa fc 66 b8
 KVM internal error. Suberror: 1
 emulation failure

This is exactly the same as mine.  I'm using upstream QEMU right now,
but I can try to reproduce using RHEL7's too.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-06 Thread Andrey Korolyov

On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das b...@redhat.com wrote:
 Andrey Korolyov and...@xdel.ru writes:

 On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hello,

 recently I`ve got a couple of shiny new Intel 2620v2s for future
 replacement of the E5-2620v1, but I experienced relatively many events
 with emulation errors, all traces looks simular to the one below. I am
 running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
 can switch to some other versions if necessary. Most of crashes
 happened during reboot cycle or at the end of ACPI-based shutdown
 action, if this can help. I have zero clues of what can introduce such
 a mess inside same processor family using identical software, as
 2620v1 has no simular problem ever. Please let me know if there can be
 some side measures for making entire story more clear.

 Thanks!

 KVM internal error. Suberror: 2
 extra data[0]: 80d1
 extra data[1]: 8b0d
 EAX=0003 EBX= ECX= EDX=
 ESI= EDI= EBP= ESP=6cd4
 EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =f000 000f  9b00
 SS =   9300
 DS =   9300
 FS =   9300
 GS =   9300
 LDT=   8200
 TR =   8b00
 GDT= 000f6e98 0037
 IDT=  03ff
 CR0=0010 CR2= CR3= CR4=
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
 b8 00 e0 00 00 8e


 It turns out that those errors are introduced by APICv, which gets
 enabled due to different feature set. If anyone is interested in
 reproducing/fixing this exactly on 3.10, it takes about one hundred of
 migrations/power state changes for an issue to appear, guest OS can be
 Linux or Win.

 Are you able to reproduce this on a more recent upstream kernel as well ?

 Bandan

I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and
follow up with any reproduceable results.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-06 Thread Bandan Das

Andrey Korolyov and...@xdel.ru writes:

 On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hello,

 recently I`ve got a couple of shiny new Intel 2620v2s for future
 replacement of the E5-2620v1, but I experienced relatively many events
 with emulation errors, all traces looks simular to the one below. I am
 running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
 can switch to some other versions if necessary. Most of crashes
 happened during reboot cycle or at the end of ACPI-based shutdown
 action, if this can help. I have zero clues of what can introduce such
 a mess inside same processor family using identical software, as
 2620v1 has no simular problem ever. Please let me know if there can be
 some side measures for making entire story more clear.

 Thanks!

 KVM internal error. Suberror: 2
 extra data[0]: 80d1
 extra data[1]: 8b0d
 EAX=0003 EBX= ECX= EDX=
 ESI= EDI= EBP= ESP=6cd4
 EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =f000 000f  9b00
 SS =   9300
 DS =   9300
 FS =   9300
 GS =   9300
 LDT=   8200
 TR =   8b00
 GDT= 000f6e98 0037
 IDT=  03ff
 CR0=0010 CR2= CR3= CR4=
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
 b8 00 e0 00 00 8e


 It turns out that those errors are introduced by APICv, which gets
 enabled due to different feature set. If anyone is interested in
 reproducing/fixing this exactly on 3.10, it takes about one hundred of
 migrations/power state changes for an issue to appear, guest OS can be
 Linux or Win.

Are you able to reproduce this on a more recent upstream kernel as well ?

Bandan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

E5-2620v2 - emulation stop error

2015-03-05 Thread Andrey Korolyov

Hello,

recently I`ve got a couple of shiny new Intel 2620v2s for future
replacement of the E5-2620v1, but I experienced relatively many events
with emulation errors, all traces looks simular to the one below. I am
running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
can switch to some other versions if necessary. Most of crashes
happened during reboot cycle or at the end of ACPI-based shutdown
action, if this can help. I have zero clues of what can introduce such
a mess inside same processor family using identical software, as
2620v1 has no simular problem ever. Please let me know if there can be
some side measures for making entire story more clear.

Thanks!

KVM internal error. Suberror: 2
extra data[0]: 80d1
extra data[1]: 8b0d
EAX=0003 EBX= ECX= EDX=
ESI= EDI= EBP= ESP=6cd4
EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =   9300
CS =f000 000f  9b00
SS =   9300
DS =   9300
FS =   9300
GS =   9300
LDT=   8200
TR =   8b00
GDT= 000f6e98 0037
IDT=  03ff
CR0=0010 CR2= CR3= CR4=
DR0= DR1= DR2=
DR3=
DR6=0ff0 DR7=0400
EFER=
Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
b8 00 e0 00 00 8e
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: E5-2620v2 - emulation stop error

2015-03-05 Thread Andrey Korolyov

On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hello,

 recently I`ve got a couple of shiny new Intel 2620v2s for future
 replacement of the E5-2620v1, but I experienced relatively many events
 with emulation errors, all traces looks simular to the one below. I am
 running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
 can switch to some other versions if necessary. Most of crashes
 happened during reboot cycle or at the end of ACPI-based shutdown
 action, if this can help. I have zero clues of what can introduce such
 a mess inside same processor family using identical software, as
 2620v1 has no simular problem ever. Please let me know if there can be
 some side measures for making entire story more clear.

 Thanks!

 KVM internal error. Suberror: 2
 extra data[0]: 80d1
 extra data[1]: 8b0d
 EAX=0003 EBX= ECX= EDX=
 ESI= EDI= EBP= ESP=6cd4
 EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =f000 000f  9b00
 SS =   9300
 DS =   9300
 FS =   9300
 GS =   9300
 LDT=   8200
 TR =   8b00
 GDT= 000f6e98 0037
 IDT=  03ff
 CR0=0010 CR2= CR3= CR4=
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb cd
 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
 b8 00 e0 00 00 8e


It turns out that those errors are introduced by APICv, which gets
enabled due to different feature set. If anyone is interested in
reproducing/fixing this exactly on 3.10, it takes about one hundred of
migrations/power state changes for an issue to appear, guest OS can be
Linux or Win.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

83 matches

Mail list logo