Re: [PATCH] KVM: x86: emulate: correct page fault error code for NoWrite instructions
Wanpeng Liwrote: > 2015-11-20 10:52 GMT+08:00 Wanpeng Li : >> Hi Paolo, >> 2015-02-09 17:03 GMT+08:00 Paolo Bonzini : >>> NoWrite instructions (e.g. cmp or test) never set the "write access" >>> bit in the error code, even if one of the operands is treated as a >>> destination. >> >> Sorry to reply to an old commit, btw, could you point out where in SDM >> describe above? > > I see. I don’t understand whether you still need my help, so to clarify: on a page-fault the error code should indicate whether the access was due to a write access. Previously KVM marked it as “write access” for instructions such as test and cmp that do not perform write. Nadav-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: fix eflags state following processor init/reset
Here are my 5 cents. Note that vmx_vcpu_reset calls: vmcs_writel(GUEST_RFLAGS, 0x02); (And the RFLAGS value is not cached by KVM, so no consistency problem should occur.) You may want to change the value into constant or call a wrapper function for setting RFLAGS, but I don’t see something broken in the functionality. Regards, Nadav Wanpeng Liwrote: > Ping, :-) > On 10/21/15 2:28 PM, Wanpeng Li wrote: >> Reference SDM 3.4.3: >> >> Following initialization of the processor (either by asserting the >> RESET pin or the INIT pin), the state of the EFLAGS register is >> 0002H. >> >> However, the eflags fixed bit is not set and other bits are also not >> cleared during the init/reset in kvm. >> >> This patch fix it by set eflags register to 0002H following >> initialization of the processor. >> >> Signed-off-by: Wanpeng Li >> --- >> arch/x86/kvm/vmx.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >> index b680c2e..326f6ea 100644 >> --- a/arch/x86/kvm/vmx.c >> +++ b/arch/x86/kvm/vmx.c >> @@ -4935,6 +4935,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool >> init_event) >> vmx_set_efer(vcpu, 0); >> vmx_fpu_activate(vcpu); >> update_exception_bitmap(vcpu); >> +vmx_set_rflags(vcpu, X86_EFLAGS_FIXED); >> vpid_sync_context(vmx->vpid); >> } > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests] x86: Test watchpoints during emulation
From: Rami Burstein& Anrey IsakovThis adds a test for data and port watchpoints during instruction emulation. Signed-off-by: Rami Burstein Signed-off-by: Andrey Isakov --- This unit tests are based on old kvm-unit-tests version and are only delivered for reference. They are certainly bloated in their current form. --- x86/debug.c | 374 1 file changed, 374 insertions(+) diff --git a/x86/debug.c b/x86/debug.c index 34e56fb..43a2777 100644 --- a/x86/debug.c +++ b/x86/debug.c @@ -11,10 +11,24 @@ #include "libcflat.h" #include "desc.h" +#include "vm.h" + +#define TESTDEV_IO_PORT0xe0 static volatile unsigned long bp_addr[10], dr6[10]; static volatile unsigned int n; static volatile unsigned long value; +static volatile char st1[] = "abcdefghijklmnop"; + +static void set_dr2(void *value) +{ + asm volatile("mov %0,%%dr2" : : "r" (value)); +} + +static void set_dr3(void *value) +{ + asm volatile("mov %0,%%dr3" : : "r" (value)); +} static unsigned long get_dr6(void) { @@ -58,6 +72,360 @@ static void handle_bp(struct ex_regs *regs) bp_addr[0] = regs->rip; } +static void test_port_io_bp() +{ + unsigned char r = 0; + unsigned short port_in = 0; + unsigned int port_din = 0; + + n = 0; + set_dr0((void*)TESTDEV_IO_PORT); + set_dr1((void*)TESTDEV_IO_PORT); + set_dr2((void*)TESTDEV_IO_PORT); + set_dr3((void*)TESTDEV_IO_PORT); + set_dr7(0x00020402); + + // Set DE flag. + write_cr4(read_cr4() | 0x8); + asm volatile("movw %0, %%dx \n\t" "outsb \n\t" + : : "i"((short)TESTDEV_IO_PORT),"S"(st1)); + asm volatile("inb %1, %0\n\t" : "=a"(r) : "i"((short)TESTDEV_IO_PORT)); +io_0: + report("em_IO out_in 1 byte bp through dr0", + n == 2 && bp_addr[n-1] == ((unsigned long)&_0) && + dr6[n-1] == 0x4ff1); + + set_dr7(0x00200408); + asm volatile("movw %0, %%dx \n\t" "outsb \n\t" + : : "i"((short)TESTDEV_IO_PORT), "S"(st1)); + asm volatile("inb %1, %0\n\t" : "=a"(r) + : "i"((short)TESTDEV_IO_PORT)); +io_1: + report("em_IO out_in 1 byte bp through dr1", + n == 4 && bp_addr[n-1] == ((unsigned long)&_1) && + dr6[n-1] == 0x4ff2); + + set_dr7(0x02000420); + asm volatile("movw %0, %%dx \n\t" "outsb \n\t" + : : "i"((short)TESTDEV_IO_PORT), "S"(st1)); + asm volatile("inb %1, %0\n\t" : "=a"(r) + : "i"((short)TESTDEV_IO_PORT)); +io_2: + report("em_IO out_in 1 byte bp through dr2", + n == 6 && bp_addr[n-1] == ((unsigned long)&_2) && + dr6[n-1] == 0x4ff4); + + set_dr7(0x2480); + asm volatile("movw %0, %%dx \n\t" "outsb \n\t" + : : "i"((short)TESTDEV_IO_PORT), "S"(st1)); + asm volatile("inb %1, %0\n\t" : "=a"(r) + : "i"((short)TESTDEV_IO_PORT)); +io_3: + report("em_IO out_in 1 byte bp through dr3", + n == 8 && bp_addr[n-1] == ((unsigned long)&_3) && + dr6[n-1] == 0x4ff8); + + n=0; + set_dr7(0x6480); + asm volatile("movw %0, %%dx \n\t" "outsw \n\t" +: : "i"((short)TESTDEV_IO_PORT), "S"(st1)); + asm volatile("inw %1, %0\n\t" : "=r"(port_in) + : "i"((short)TESTDEV_IO_PORT)); +io_w: + report("em_IO out_in 2 byte bp through dr3", + n == 2 && bp_addr[n-1] == ((unsigned long)&_w) && + dr6[n-1] == 0x4ff8); + + set_dr7(0xe480); + asm volatile("movw %0, %%dx \n\t" "outsl \n\t" + : : "i"((short)TESTDEV_IO_PORT), "S"(st1)); + asm volatile ("insl;" :"+D" (port_din) + : "i" ((short)TESTDEV_IO_PORT) : "memory"); +io_dw: + report("em_IO out_in 4 byte bp through dr3", + n == 4 && bp_addr[n-1] == ((unsigned long)&_dw) + && dr6[n-1] == 0x4ff8); +} +static void test_port_io_aligned_bp(){ + + unsigned char r = 0; + unsigned short port_in = 0; + unsigned int port_din = 0; + + n = 0; + write_cr4(read_cr4() | 0x8); + + set_dr7(0x00020402); + set_dr0((void*)(TESTDEV_IO_PORT )); + asm volatile("movw %0, %%dx \n\t" "outsb \n\t" + : : "i"((short)TESTDEV_IO_PORT),"S"(st1)); + asm volatile("inb %1, %0\n\t" : "=a"(r) + : "i"((short)TESTDEV_IO_PORT)); +aio_0: + report("em_IO aligned out_in 1 byte bp through dr0", + n == 2 &&
Re: [PATCH 2/2] KVM: x86: Guest watchpoints during emulation.
Looking at the patch again, there two points you may want to change, see inline. If you want, I’ll send a v2. Nadav Nadav Amit <na...@cs.technion.ac.il> wrote: > From: Nadav Amit <nadav.a...@gmail.com> > > This adds support for guest data and I/O breakpoints during instruction > emulation. > > Watchpoints are examined during data and io interceptions: segmented_read, > segmented_write, em_in, em_out, segmented_read_std and kvm_fast_pio_out. > > When such a breakpoint is triggered, trap is reported by DB_VECTOR > exception. > > Signed-off-by: Andrey Isakov <andreyisak...@gmail.com> > Signed-off-by: Rami Burstein <ramib...@gmail.com> > Signed-off-by: Nadav Amit <nadav.a...@gmail.com> > --- > arch/x86/include/asm/kvm_emulate.h | 3 ++ > arch/x86/kvm/emulate.c | 32 + > arch/x86/kvm/x86.c | 74 +- > 3 files changed, 92 insertions(+), 17 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_emulate.h > b/arch/x86/include/asm/kvm_emulate.h > index e16466e..f6d5d6c 100644 > --- a/arch/x86/include/asm/kvm_emulate.h > +++ b/arch/x86/include/asm/kvm_emulate.h > @@ -211,6 +211,9 @@ struct x86_emulate_ops { > void (*get_cpuid)(struct x86_emulate_ctxt *ctxt, > u32 *eax, u32 *ebx, u32 *ecx, u32 *edx); > void (*set_nmi_mask)(struct x86_emulate_ctxt *ctxt, bool masked); > + bool (*check_watchpoint)(struct x86_emulate_ctxt *ctxt, > + unsigned long addr, unsigned int length, > + int type); > }; > > typedef u32 __attribute__((vector_size(16))) sse128_t; > diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c > index b372a75..4e91b7b 100644 > --- a/arch/x86/kvm/emulate.c > +++ b/arch/x86/kvm/emulate.c > @@ -24,6 +24,7 @@ > #include "kvm_cache_regs.h" > #include > #include > +#include > #include > #include > > @@ -564,6 +565,17 @@ static int emulate_db(struct x86_emulate_ctxt *ctxt) > return emulate_exception(ctxt, DB_VECTOR, 0, false); > } > > +static void emulate_db_trap(struct x86_emulate_ctxt *ctxt) > +{ > + /* > + * If a fault is later encountered, the exception information will be > + * overridden. Otherwise the trap would be handled after the emulation > + * is completed. > + */ > + (void)emulate_exception(ctxt, DB_VECTOR, 0, false); > + ctxt->have_exception = true; > +} > + > static int emulate_gp(struct x86_emulate_ctxt *ctxt, int err) > { > return emulate_exception(ctxt, GP_VECTOR, err, true); > @@ -776,6 +788,10 @@ static int segmented_read_std(struct x86_emulate_ctxt > *ctxt, > rc = linearize(ctxt, addr, size, false, ); > if (rc != X86EMUL_CONTINUE) > return rc; > + > + if (ctxt->ops->check_watchpoint(ctxt, linear, size, DR_RW_READ)) > + emulate_db_trap(ctxt); > + > return ctxt->ops->read_std(ctxt, linear, data, size, >exception); > } > > @@ -1369,6 +1385,10 @@ static int segmented_read(struct x86_emulate_ctxt > *ctxt, > rc = linearize(ctxt, addr, size, false, ); > if (rc != X86EMUL_CONTINUE) > return rc; > + > + if (ctxt->ops->check_watchpoint(ctxt, linear, size, DR_RW_READ)) > + emulate_db_trap(ctxt); > + > return read_emulated(ctxt, linear, data, size); > } > > @@ -1383,6 +1403,10 @@ static int segmented_write(struct x86_emulate_ctxt > *ctxt, > rc = linearize(ctxt, addr, size, true, ); > if (rc != X86EMUL_CONTINUE) > return rc; > + > + if (ctxt->ops->check_watchpoint(ctxt, linear, size, DR_RW_WRITE)) > + emulate_db_trap(ctxt); > + > return ctxt->ops->write_emulated(ctxt, linear, data, size, >>exception); > } > @@ -3729,11 +3753,19 @@ static int em_in(struct x86_emulate_ctxt *ctxt) >>dst.val)) > return X86EMUL_IO_NEEDED; > > + if (ctxt->ops->check_watchpoint(ctxt, ctxt->src.val, ctxt->dst.bytes, > + DR_RW_PORT)) > + emulate_db_trap(ctxt); > + > return X86EMUL_CONTINUE; > } > > static int em_out(struct x86_emulate_ctxt *ctxt) > { > + if (ctxt->ops->check_watchpoint(ctxt, ctxt->dst.val, ctxt->src.bytes, > + DR_RW_PORT)) > + emulate_db_trap(ctxt); > + > ctxt->ops->pio_out_emulated(ctxt, ctxt->src.bytes, ctxt->dst.val, > >src.v
[PATCH 2/2] KVM: x86: Guest watchpoints during emulation.
From: Nadav Amit <nadav.a...@gmail.com> This adds support for guest data and I/O breakpoints during instruction emulation. Watchpoints are examined during data and io interceptions: segmented_read, segmented_write, em_in, em_out, segmented_read_std and kvm_fast_pio_out. When such a breakpoint is triggered, trap is reported by DB_VECTOR exception. Signed-off-by: Andrey Isakov <andreyisak...@gmail.com> Signed-off-by: Rami Burstein <ramib...@gmail.com> Signed-off-by: Nadav Amit <nadav.a...@gmail.com> --- arch/x86/include/asm/kvm_emulate.h | 3 ++ arch/x86/kvm/emulate.c | 32 + arch/x86/kvm/x86.c | 74 +- 3 files changed, 92 insertions(+), 17 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index e16466e..f6d5d6c 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -211,6 +211,9 @@ struct x86_emulate_ops { void (*get_cpuid)(struct x86_emulate_ctxt *ctxt, u32 *eax, u32 *ebx, u32 *ecx, u32 *edx); void (*set_nmi_mask)(struct x86_emulate_ctxt *ctxt, bool masked); + bool (*check_watchpoint)(struct x86_emulate_ctxt *ctxt, +unsigned long addr, unsigned int length, +int type); }; typedef u32 __attribute__((vector_size(16))) sse128_t; diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index b372a75..4e91b7b 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -24,6 +24,7 @@ #include "kvm_cache_regs.h" #include #include +#include #include #include @@ -564,6 +565,17 @@ static int emulate_db(struct x86_emulate_ctxt *ctxt) return emulate_exception(ctxt, DB_VECTOR, 0, false); } +static void emulate_db_trap(struct x86_emulate_ctxt *ctxt) +{ + /* +* If a fault is later encountered, the exception information will be +* overridden. Otherwise the trap would be handled after the emulation +* is completed. +*/ + (void)emulate_exception(ctxt, DB_VECTOR, 0, false); + ctxt->have_exception = true; +} + static int emulate_gp(struct x86_emulate_ctxt *ctxt, int err) { return emulate_exception(ctxt, GP_VECTOR, err, true); @@ -776,6 +788,10 @@ static int segmented_read_std(struct x86_emulate_ctxt *ctxt, rc = linearize(ctxt, addr, size, false, ); if (rc != X86EMUL_CONTINUE) return rc; + + if (ctxt->ops->check_watchpoint(ctxt, linear, size, DR_RW_READ)) + emulate_db_trap(ctxt); + return ctxt->ops->read_std(ctxt, linear, data, size, >exception); } @@ -1369,6 +1385,10 @@ static int segmented_read(struct x86_emulate_ctxt *ctxt, rc = linearize(ctxt, addr, size, false, ); if (rc != X86EMUL_CONTINUE) return rc; + + if (ctxt->ops->check_watchpoint(ctxt, linear, size, DR_RW_READ)) + emulate_db_trap(ctxt); + return read_emulated(ctxt, linear, data, size); } @@ -1383,6 +1403,10 @@ static int segmented_write(struct x86_emulate_ctxt *ctxt, rc = linearize(ctxt, addr, size, true, ); if (rc != X86EMUL_CONTINUE) return rc; + + if (ctxt->ops->check_watchpoint(ctxt, linear, size, DR_RW_WRITE)) + emulate_db_trap(ctxt); + return ctxt->ops->write_emulated(ctxt, linear, data, size, >exception); } @@ -3729,11 +3753,19 @@ static int em_in(struct x86_emulate_ctxt *ctxt) >dst.val)) return X86EMUL_IO_NEEDED; + if (ctxt->ops->check_watchpoint(ctxt, ctxt->src.val, ctxt->dst.bytes, + DR_RW_PORT)) + emulate_db_trap(ctxt); + return X86EMUL_CONTINUE; } static int em_out(struct x86_emulate_ctxt *ctxt) { + if (ctxt->ops->check_watchpoint(ctxt, ctxt->dst.val, ctxt->src.bytes, + DR_RW_PORT)) + emulate_db_trap(ctxt); + ctxt->ops->pio_out_emulated(ctxt, ctxt->src.bytes, ctxt->dst.val, >src.val, 1); /* Disable writeback. */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3e4d032..ba75f76 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4831,6 +4831,55 @@ static void emulator_set_nmi_mask(struct x86_emulate_ctxt *ctxt, bool masked) kvm_x86_ops->set_nmi_mask(emul_to_vcpu(ctxt), masked); } +static int kvm_vcpu_check_hw_bp(unsigned long addr, u32 type, u32 len, + u32 dr7, unsigned long *db) +{ + u32 dr6 = 0; + int i; + u32 enable, dr7_rw, dr7_len; + unsigned long align_db; + + enable = dr7; + dr7_rw
[PATCH 1/2] KVM: x86: Add DR flag definition for IO watchpoints
From: Rami Burstein <ramib...@gmail.com> Add DR_RW_PROT for I/O watchponts. Signed-off-by: Andrey Isakov <andreyisak...@gmail.com> Signed-off-by: Rami Burstein <ramib...@gmail.com> Signed-off-by: Nadav Amit <nadav.a...@gmail.com> --- arch/x86/include/uapi/asm/debugreg.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/include/uapi/asm/debugreg.h b/arch/x86/include/uapi/asm/debugreg.h index 3c0874d..d815cd4 100644 --- a/arch/x86/include/uapi/asm/debugreg.h +++ b/arch/x86/include/uapi/asm/debugreg.h @@ -37,6 +37,7 @@ #define DR_RW_EXECUTE (0x0) /* Settings for the access types to trap on */ #define DR_RW_WRITE (0x1) +#define DR_RW_PORT (0x2) #define DR_RW_READ (0x3) #define DR_LEN_1 (0x0) /* Settings for data length to trap on */ -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] KVM: x86: Emulated I/O and data breakpoints support
This patch-set adds data and I/O breakpoints support on emulated instructions to KVM. The first patch just adds a missing definition and all the logic is in the second one. These patches were implemented as part of a university project, and then cleaned and rebased on the latest KVM by me. I have some concerns about their impact on performance, so it is possible to add a fast-patch for the case all breakpoints are disabled. In addition, these patches do not handle watchpoints that are set by the host debugger. Nadav Amit (1): KVM: x86: Guest watchpoints during emulation. Rami Burstein (1): KVM: x86: Add DR flag definition for IO watchpoints arch/x86/include/asm/kvm_emulate.h | 3 ++ arch/x86/include/uapi/asm/debugreg.h | 1 + arch/x86/kvm/emulate.c | 32 arch/x86/kvm/x86.c | 74 +++- 4 files changed, 93 insertions(+), 17 deletions(-) -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH kvm-unit-tests] x86: debug: test debug extensions
Sorry for misleading. I forgot there are several issues with the patches, so I cannot send them. If I find the time (and you don’t implement it first), I’ll fix and send them. Nadav Paolo Bonzini <pbonz...@redhat.com> wrote: > > > On 21/09/2015 14:26, Nadav Amit wrote: >> Paolo, >> >> Two students here implemented emulated I/O and data breakpoints support for >> KVM under my supervision. I mistakenly graded their project before they >> actually sent the patches, and at this point (surprisingly) they >> disappeared. The patches are relatively ok and include unit-tests. I also >> ran some Intel tests to check the patches, which they (eventually) passed. >> >> I don’t know if I can find the time to rebase them (they are based on 3.19), >> but let me know if you are interested. > > Yes, I am. Worst case just send them based on 3.19, I owe you enough > bugfixes that I can do the rebase myself. :) > > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH kvm-unit-tests] x86: debug: test debug extensions
Paolo, Two students here implemented emulated I/O and data breakpoints support for KVM under my supervision. I mistakenly graded their project before they actually sent the patches, and at this point (surprisingly) they disappeared. The patches are relatively ok and include unit-tests. I also ran some Intel tests to check the patches, which they (eventually) passed. I don’t know if I can find the time to rebase them (they are based on 3.19), but let me know if you are interested. Nadav Paolo Bonziniwrote: > Not committing this yet since KVM does not support I/O breakpoints, but > posting > it because it is useful for TCG too. > > TCG fails the tests because it doesn't preserve DRn_FIXED_1 on mov to dr6 and > dr7, and also because it lacks support for ICEBP, but it is easy to fix the > former and disable the ICEBP test. > > Signed-off-by: Paolo Bonzini > --- > x86/debug.c | 49 + > 1 file changed, 49 insertions(+) > > diff --git a/x86/debug.c b/x86/debug.c > index d04de23..a71a8ae 100644 > --- a/x86/debug.c > +++ b/x86/debug.c > @@ -11,6 +11,7 @@ > > #include "libcflat.h" > #include "desc.h" > +#include "processor.h" > > static volatile unsigned long bp_addr[10], dr6[10]; > static volatile unsigned int n; > @@ -71,6 +72,8 @@ int main(int ac, char **av) > handle_exception(DB_VECTOR, handle_db); > handle_exception(BP_VECTOR, handle_bp); > > + write_cr4(read_cr4() | X86_CR4_DE); > + > sw_bp: > asm volatile("int3"); > report("#BP", bp_addr[0] == (unsigned long)&_bp + 1); > @@ -146,5 +149,51 @@ sw_icebp: > bp_addr[0] == (unsigned long)&_icebp + 1 && > dr6[0] == 0x0ff0); > > + n = 0; > + set_dr1((void *)0xe1ul); > + set_dr7(0x0020400a); > + set_dr6(0); > + > + asm volatile( > + "out %ax,$0xe0\n\t" > + "out %al,$0xe1\n\t"); > +hw_wp_out1: > + > + set_dr1((void *)0xe0ul); > + set_dr7(0x00e0400a); > + asm volatile( > + "out %al,$0xe0\n\t"); > +hw_wp_out2: > + > + report("hw I/O port watchpoint (out)", > +n == 3 && > +bp_addr[0] == ((unsigned long)&_wp_out1 - 2) && > +bp_addr[1] == ((unsigned long)&_wp_out1) && > +bp_addr[2] == ((unsigned long)&_wp_out2) && > +dr6[0] == 0x0ff2); > + > + n = 0; > + set_dr1((void *)0xe1ul); > + set_dr7(0x0020400a); > + set_dr6(0); > + > + asm volatile( > + "in $0xe0,%%ax\n\t" > + "in $0xe1,%%al\n\t" : : : "rax"); > +hw_wp_in1: > + > + set_dr1((void *)0xe0ul); > + set_dr7(0x00e0400a); > + > + asm volatile( > + "in $0xe0,%%al\n\t" : : : "rax"); > +hw_wp_in2: > + report("hw I/O port watchpoint (in)", > +n == 3 && > +bp_addr[0] == ((unsigned long)&_wp_in1 - 2) && > +bp_addr[1] == ((unsigned long)&_wp_in1) && > +bp_addr[2] == ((unsigned long)&_wp_in2) && > +dr6[0] == 0x0ff2); > + > return report_summary(); > } > -- > 2.5.0 > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/2] target-i386: disable LINT0 after reset
I don’t happen to have a similar platform. On regular qemu/kvm runs with q35, I see APIC_LVT0 is set once to 0x8700 on the BSP - as expected: qemu-system-x86-19345 [011] d... 2583274.503018: kvm_entry: vcpu 0 qemu-system-x86-19345 [011] d... 2583274.503019: kvm_exit: reason APIC_ACCESS rip 0x7ffb8288 info 1350 0 qemu-system-x86-19345 [011] 2583274.503020: kvm_emulate_insn: 0:7ffb8288:c7 05 50 03 e0 fe 00 87 00 00 (prot32) qemu-system-x86-19345 [011] 2583274.503021: kvm_mmio: mmio write len 4 gpa 0xfee00350 val 0x8700 qemu-system-x86-19345 [011] 2583274.503021: kvm_apic: apic_write APIC_LVT0 = 0x8700 If someone sends a trace ( http://www.linux-kvm.org/page/Tracing ) of the failure, I would be happy to assist. Nadav Gerd Hoffmann <kra...@redhat.com> wrote: > On Mi, 2015-09-16 at 07:23 +0200, Jan Kiszka wrote: >> On 2015-09-15 23:19, Alex Williamson wrote: >>> On Mon, 2015-04-13 at 02:32 +0300, Nadav Amit wrote: >>>> Due to old Seabios bug, QEMU reenable LINT0 after reset. This bug is long >>>> gone >>>> and therefore this hack is no longer needed. Since it violates the >>>> specifications, it is removed. >>>> >>>> Signed-off-by: Nadav Amit <na...@cs.technion.ac.il> >>>> --- >>>> hw/intc/apic_common.c | 9 - >>>> 1 file changed, 9 deletions(-) >>> >>> Please see bug: https://bugs.launchpad.net/qemu/+bug/1488363 >>> >>> Is this bug perhaps not as long gone as we thought, or is there >>> something else going on here? Thanks, >> >> I would say, someone needs to check if the SeaBIOS line that is supposed >> to enable LINT0 is actually executed on one of the broken systems and, >> if not, why not. > > There is only one reason (beside miscompiling seabios with > CONFIG_QEMU=n) why seabios would skip acpi initialization, and that is > apic not being present according to cpuid: > >cpuid(1, , , , _features); >if (eax < 1 || !(cpuid_features & CPUID_APIC)) { >// No apic - only the main cpu is present. > > https://www.kraxel.org/cgit/seabios/tree/src/fw/smp.c#n79 > > cheers, > Gerd > > PS: coreboot tripped over this too, fixed just a few days ago. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Don't return error on nested bitmap memory allocation failure
Bandan Das b...@redhat.com wrote: If get_free_page() fails for nested bitmap area, it's evident that we are gonna get screwed anyway but returning failure because we failed allocating memory for a nested structure seems like an unnecessary big hammer. Also, save the call for later; after we are done with other non-nested allocations. Signed-off-by: Bandan Das b...@redhat.com --- arch/x86/kvm/vmx.c | 29 +++-- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f7b6168..200bc5c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -6039,20 +6039,22 @@ static __init int hardware_setup(void) if (!vmx_msr_bitmap_longmode_x2apic) goto out4; - if (nested) { - vmx_msr_bitmap_nested = - (unsigned long *)__get_free_page(GFP_KERNEL); - if (!vmx_msr_bitmap_nested) - goto out5; - } - vmx_vmread_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL); if (!vmx_vmread_bitmap) - goto out6; + goto out5; vmx_vmwrite_bitmap = (unsigned long *)__get_free_page(GFP_KERNEL); if (!vmx_vmwrite_bitmap) - goto out7; + goto out6; + + if (nested) { + vmx_msr_bitmap_nested = + (unsigned long *)__get_free_page(GFP_KERNEL); + if (!vmx_msr_bitmap_nested) { + printk(KERN_WARNING +vmx: Failed getting memory for nested bitmap\n); + nested = 0; + } memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE); memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE); @@ -6073,7 +6075,7 @@ static __init int hardware_setup(void) if (setup_vmcs_config(vmcs_config) 0) { r = -EIO; - goto out8; + goto out7; } if (boot_cpu_has(X86_FEATURE_NX)) @@ -6190,13 +6192,12 @@ static __init int hardware_setup(void) return alloc_kvm_area(); -out8: - free_page((unsigned long)vmx_vmwrite_bitmap); out7: - free_page((unsigned long)vmx_vmread_bitmap); -out6: if (nested) free_page((unsigned long)vmx_msr_bitmap_nested); + free_page((unsigned long)vmx_vmwrite_bitmap); +out6: + free_page((unsigned long)vmx_vmread_bitmap); out5: free_page((unsigned long)vmx_msr_bitmap_longmode_x2apic); out4: free_page appears to check whether the address is zero before it actually frees the page. Perhaps it is better to leverage this behaviour to remove all the outX and simplify the code. Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Guest handling of IA32_DEBUGCTL MSR
It seems strange that the guest is allowed to set IA32_DEBUGCTL MSR for the nested VM and get this value to the physical IA32_DEBUGCTL (see prepare_vmcs02), while it cannot set IA32_DEBUGCTL for itself (see kvm_set_msr_common). Am I missing something? Regards, Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Guest handling of IA32_DEBUGCTL MSR
Jan Kiszka jan.kis...@siemens.com wrote: Am 2015-04-28 um 13:43 schrieb Paolo Bonzini: On 28/04/2015 13:42, Nadav Amit wrote: It seems strange that the guest is allowed to set IA32_DEBUGCTL MSR for the nested VM and get this value to the physical IA32_DEBUGCTL (see prepare_vmcs02), while it cannot set IA32_DEBUGCTL for itself (see kvm_set_msr_common). Am I missing something? No, it makes no sense. Are you sure that vmx is not allowing direct access to that MSR while in guest mode? We do save/restore it on all Intel CPUs, see setup_vmcs_config. Not sure about the AMD situation, though. Unless you explicitly disable the interception of the MSR in the bitmap, you trap WRMSR to this MSR. I guess the original intention of saving/restoring it was to address situations in which the CPU itself implicitly changes it (e.g., clearing LBR when “freezing LBR on PMI” is set). I think that these situations do not apply when the MSR is zeroed, as it happens to be in KVM. Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: x86: Fix update RCX/RDI/RSI on REP-string
When REP-string instruction is preceded with an address-size prefix, ECX/EDI/ESI are used as the operation counter and pointers. When they are updated, the high 32-bits of RCX/RDI/RSI are cleared, similarly to the way they are updated on every 32-bit register operation. Fix it. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index c1bc650..296d58e 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -524,13 +524,9 @@ static void masked_increment(ulong *reg, ulong mask, int inc) static inline void register_address_increment(struct x86_emulate_ctxt *ctxt, int reg, int inc) { - ulong mask; + ulong *preg = reg_rmw(ctxt, reg); - if (ctxt-ad_bytes == sizeof(unsigned long)) - mask = ~0UL; - else - mask = ad_mask(ctxt); - masked_increment(reg_rmw(ctxt, reg), mask, inc); + assign_register(preg, *preg + inc, ctxt-ad_bytes); } static void rsp_increment(struct x86_emulate_ctxt *ctxt, int inc) -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] KVM: x86: Fix REP-string effect on RCX/RSI/RDI
This patch-set fixes KVM behavior when handling a REP-string instruction that runs with an address-size of 32-bit. In this case ECX/EDI/ESI are used as counter and pointers, and the high 32-bits should be cleared. The first patch handles with the simple case. The second one handles the corner-case in which ECX is initially zero. It appears that Intel and AMD behave differently in this case (and some experiments suggest even different Intel generations act differently), and I could not find any documentation that describes it. Yet, the behavior of INS/OUTS can be observed by the guest and VMware appears to get it right. Thanks for reviewing the patches. Nadav Amit (2): KVM: x86: Fix update RCX/RDI/RSI on REP-string KVM: x86: Fix zero iterations REP-string arch/x86/kvm/emulate.c | 33 +++-- 1 file changed, 27 insertions(+), 6 deletions(-) -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: x86: Fix zero iterations REP-string
When a REP-string is executed in 64-bit mode with an address-size prefix, ECX/EDI/ESI are used as counter and pointers. When ECX is initially zero, Intel CPUs clear the high 32-bits of RCX, and recent Intel CPUs update the high bits of the pointers in MOVS/STOS. This behavior is specific to Intel according to few experiments. As one may guess, this is an undocumented behavior. Yet, it is observable in the guest, since at least VMX traps REP-INS/OUTS even when ECX=0. Note that VMware appears to get it right. The behavior can be observed using the following code: #include stdio.h #define LOW_MASK (0xull) #define ALL_MASK (0xull) #define TEST(opcode) \ do {\ asm volatile(.byte 0xf2 \n\t .byte 0x67 \n\t .byte opcode \n\t \ : =S(s), =c(c), =D(d) \ : S(ALL_MASK), c(LOW_MASK), D(ALL_MASK)); \ printf(opcode %s rcx=%llx rsi=%llx rdi=%llx\n,\ opcode, c, s, d); \ } while(0) void main() { unsigned long long s, d, c; iopl(3); TEST(0x6c); TEST(0x6d); TEST(0x6e); TEST(0x6f); TEST(0xa4); TEST(0xa5); TEST(0xa6); TEST(0xa7); TEST(0xaa); TEST(0xab); TEST(0xae); TEST(0xaf); } Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 25 + 1 file changed, 25 insertions(+) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 296d58e..e726c50 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2591,6 +2591,30 @@ static bool emulator_io_permited(struct x86_emulate_ctxt *ctxt, return true; } +static void string_registers_quirk(struct x86_emulate_ctxt *ctxt) +{ + /* +* Intel CPUs mask the counter and pointers in quite strange +* manner when ECX is zero due to REP-string optimizations. +*/ +#ifdef CONFIG_X86_64 + if (ctxt-ad_bytes != 4 || !vendor_intel(ctxt)) + return; + + *reg_write(ctxt, VCPU_REGS_RCX) = 0; + + switch (ctxt-b) { + case 0xa4: /* movsb */ + case 0xa5: /* movsd/w */ + *reg_rmw(ctxt, VCPU_REGS_RSI) = (u32)-1; + /* fall through */ + case 0xaa: /* stosb */ + case 0xab: /* stosd/w */ + *reg_rmw(ctxt, VCPU_REGS_RDI) = (u32)-1; + } +#endif +} + static void save_state_to_tss16(struct x86_emulate_ctxt *ctxt, struct tss_segment_16 *tss) { @@ -5004,6 +5028,7 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) if (ctxt-rep_prefix (ctxt-d String)) { /* All REP prefixes have the same first termination condition */ if (address_mask(ctxt, reg_read(ctxt, VCPU_REGS_RCX)) == 0) { + string_registers_quirk(ctxt); ctxt-eip = ctxt-_eip; ctxt-eflags = ~X86_EFLAGS_RF; goto done; -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: Fix cmpxchg with two 32-bit registers
The emulation of CMPXCHG with two register operands in 64-bit mistakenly masks the high 32-bits as it performs assignment. Fix it. Fixes: 2fcf5c8ae244b4c298d2111a288d410a719ac626 Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 630bcb0..84c132b 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2242,6 +2242,8 @@ static int em_cmpxchg(struct x86_emulate_ctxt *ctxt) ctxt-src.val = ctxt-dst.orig_val; /* Create write-cycle to dest by writing the same value */ ctxt-dst.val = ctxt-dst.orig_val; + if (ctxt-dst.type != OP_MEM) + ctxt-dst.type = OP_NONE; } return X86EMUL_CONTINUE; } -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM x86 emulator design question
Hi, I would appreciate if someone briefly explains the design choice that leaded to separating the x86 emulator from the rest of x86 code, i.e., making it oblivious to VCPUs and using the x86_emulate_ops as an interface. Thanks, Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: Mask DR7 correctly on task-switch while debugging
If the host sets hardware breakpoints to debug the guest, and a task-switch occurs in the guest, the architectural DR7 will not be updated. The effective DR7 would be updated instead. This fix uses the standard DR setting mechanism instead of the one that was previously used. As a bonus, the update of DR7 will now be effective for AMD as well. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/vmx.c | 3 --- arch/x86/kvm/x86.c | 3 +++ 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f7a0a7f..8f731c0 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5703,9 +5703,6 @@ static int handle_task_switch(struct kvm_vcpu *vcpu) return 0; } - /* clear all local breakpoint enable flags */ - vmcs_writel(GUEST_DR7, vmcs_readl(GUEST_DR7) ~0x155); - /* * TODO: What about debug traps on tss switch? * Are we supposed to inject them and update dr6? diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2046be4..a170c35 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6816,6 +6816,9 @@ int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int idt_index, kvm_rip_write(vcpu, ctxt-eip); kvm_set_rflags(vcpu, ctxt-eflags); + ret = __kvm_set_dr(vcpu, 7, vcpu-arch.dr7 ~(DR_LOCAL_ENABLE_MASK | + DR_LOCAL_SLOWDOWN)); + WARN_ON(ret != 0); kvm_make_request(KVM_REQ_EVENT, vcpu); return EMULATE_DONE; } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] KVM: x86: Fix DR7 mask on task-switch while debugging
If the host sets hardware breakpoints to debug the guest, and a task-switch occurs in the guest, the architectural DR7 will not be updated. The effective DR7 would be updated instead. This fix puts the DR7 update during task-switch emulation, so it now uses the standard DR setting mechanism instead of the one that was previously used. As a bonus, the update of DR7 will now be effective for AMD as well. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- v1 - v2: - Move the setting to emulate.c instead of x86.c - Shorten title --- arch/x86/kvm/emulate.c | 6 +- arch/x86/kvm/vmx.c | 3 --- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 630bcb0..4a4555a 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -25,6 +25,7 @@ #include linux/module.h #include asm/kvm_emulate.h #include linux/stringify.h +#include asm/debugreg.h #include x86.h #include tss.h @@ -2849,7 +2850,7 @@ static int emulator_do_task_switch(struct x86_emulate_ctxt *ctxt, ulong old_tss_base = ops-get_cached_segment_base(ctxt, VCPU_SREG_TR); u32 desc_limit; - ulong desc_addr; + ulong desc_addr, dr7; /* FIXME: old_tss_base == ~0 ? */ @@ -2934,6 +2935,9 @@ static int emulator_do_task_switch(struct x86_emulate_ctxt *ctxt, ret = em_push(ctxt); } + ops-get_dr(ctxt, 7, dr7); + ops-set_dr(ctxt, 7, dr7 ~(DR_LOCAL_ENABLE_MASK | DR_LOCAL_SLOWDOWN)); + return ret; } diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f7a0a7f..8f731c0 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5703,9 +5703,6 @@ static int handle_task_switch(struct kvm_vcpu *vcpu) return 0; } - /* clear all local breakpoint enable flags */ - vmcs_writel(GUEST_DR7, vmcs_readl(GUEST_DR7) ~0x155); - /* * TODO: What about debug traps on tss switch? * Are we supposed to inject them and update dr6? -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XP machine freeze
Brad Campbell lists2...@fnarfbargle.com wrote: On 13/04/15 22:02, Paolo Bonzini wrote: On 13/04/2015 14:45, Brad Campbell wrote: G'day Paolo, Yes, on AMD and I've tried hard to reproduce it on Intel and been unable to thus far. Now you mention it may be AMD specific, I have a spare motherboard and processor sitting in a drawer. I'll bolt it together tomorrow and see if I can reproduce it on another AMD machine. Two machines should let me test it twice as fast. I got a fail this afternoon, so I'm due to reboot tonight. I'll just revert that one suspect commit from a known bad kernel and see if that cleans it up. If not then I'll work through the remainder of the information in your mail. I really appreciate the attention you've paid to this, it has been a frustrating bug for me because I'm in a position of not knowing what I don't know, and obviously doing something wrong in very long bisection processes. Actually, if you have time to change your course of action, please revert the one that Nadav pointed out (f210f7572bed, KVM: x86: Fix lost interrupt on irr_pending race) or cherry-pick it on top of 3.17. Paolo Ok, I think we have a winner. Patch manually plopped on top of vanilla 3.17. It has never gone for anywhere near this long on a bad kernel. brad@srv:~$ uptime 23:24:48 up 6 days, 1:01, 3 users, load average: 1.48, 1.95, 2.48 So this patch went into the kernel during the 3.19 release cycle? Affected kernels 3.16-3.18? Actually, the original bug seemed to be introduced by commit 33e4c68656a2e461b296ce714ec322978de85412 KVM: Optimize searching for highest IRR”. So the bug goes all the way back to 2.6.32. The race that this patch fixes just became more apparent (i.e., likely to happen) on 3.16. It is fixed in 3.19. I guess Paolo would push it to stable now. Right? Regards, Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Mask DR7 correctly on task-switch while debugging
Paolo Bonzini pbonz...@redhat.com wrote: On 19/04/2015 14:18, Nadav Amit wrote: If the host sets hardware breakpoints to debug the guest, and a task-switch occurs in the guest, the architectural DR7 will not be updated. The effective DR7 would be updated instead. This fix uses the standard DR setting mechanism instead of the one that was previously used. As a bonus, the update of DR7 will now be effective for AMD as well. Is there a reason not to do it in emulate.c instead? Not that I can think of. I will make a new version. Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] KVM: x86: Clear CR2 on VCPU reset
Wanpeng Li wanpeng...@linux.intel.com wrote: Hi Nadav, On Thu, Apr 02, 2015 at 03:10:38AM +0300, Nadav Amit wrote: CR2 is not cleared as it should after reset. See Intel SDM table named IA-32 Processor States Following Power-up, Reset, or INIT. How you trigger the reset instead of the Power-up one? I sent an IPI of INIT for the KVM “reset” flow. I posted a unit-test: http://www.spinics.net/lists/kvm/msg115525.html The actual reset is handled by qemu, but KVM is still able to introduce bugs in it, as it did in not reseting DR0-DR3. Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] KVM: x86: INIT and reset sequences are different
x86 architecture defines differences between the reset and INIT sequences. INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU, MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP. References (from Intel SDM): If the MP protocol has completed and a BSP is chosen, subsequent INITs (either to a specific processor or system wide) do not cause the MP protocol to be repeated. [8.4.2: MP Initialization Protocol Requirements and Restrictions] [Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT] If the processor is reset by asserting the INIT# pin, the x87 FPU state is not changed. [9.2: X87 FPU INITIALIZATION] The state of the local APIC following an INIT reset is the same as it is after a power-up or hardware reset, except that the APIC ID and arbitration ID registers are not affected. [10.4.7.3: Local APIC State After an INIT Reset (“Wait-for-SIPI” State)] Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- v3: - Leave EFER unchanged on INIT. Instead, set cr0 correctly so vmx_set_cr0 would recognize that paging was changed from on to off and clear LMA. - Clean the surrounding from unnecassary indirection of vmx-vcpu. - Change svm similarly to vmx (UNTESTED). v2: - Same as v1 (was part of a patch-set that was modified due to missing tests) --- arch/x86/include/asm/kvm_host.h | 6 ++--- arch/x86/kvm/lapic.c| 11 + arch/x86/kvm/lapic.h| 2 +- arch/x86/kvm/svm.c | 27 +++--- arch/x86/kvm/vmx.c | 51 +++-- arch/x86/kvm/x86.c | 17 -- 6 files changed, 63 insertions(+), 51 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index f80ad59..3a19e30 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -711,7 +711,7 @@ struct kvm_x86_ops { /* Create, but do not attach this VCPU */ struct kvm_vcpu *(*vcpu_create)(struct kvm *kvm, unsigned id); void (*vcpu_free)(struct kvm_vcpu *vcpu); - void (*vcpu_reset)(struct kvm_vcpu *vcpu); + void (*vcpu_reset)(struct kvm_vcpu *vcpu, bool init_event); void (*prepare_guest_switch)(struct kvm_vcpu *vcpu); void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu); @@ -1001,7 +1001,7 @@ void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id); void kvm_inject_nmi(struct kvm_vcpu *vcpu); -int fx_init(struct kvm_vcpu *vcpu); +int fx_init(struct kvm_vcpu *vcpu, bool init_event); void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes); @@ -1145,7 +1145,7 @@ int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v); int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu); int kvm_cpu_get_interrupt(struct kvm_vcpu *v); -void kvm_vcpu_reset(struct kvm_vcpu *vcpu); +void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event); void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu); void kvm_arch_mmu_notifier_invalidate_page(struct kvm *kvm, unsigned long address); diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index fe2d89e..a91fb2f 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1557,7 +1557,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) } -void kvm_lapic_reset(struct kvm_vcpu *vcpu) +void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) { struct kvm_lapic *apic; int i; @@ -1571,7 +1571,8 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu) /* Stop the timer in case it's a reset to an active apic */ hrtimer_cancel(apic-lapic_timer.timer); - kvm_apic_set_id(apic, vcpu-vcpu_id); + if (!init_event) + kvm_apic_set_id(apic, vcpu-vcpu_id); kvm_apic_set_version(apic-vcpu); for (i = 0; i APIC_LVT_NUM; i++) @@ -1713,7 +1714,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu) APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE); static_key_slow_inc(apic_sw_disabled.key); /* sw disabled at reset */ - kvm_lapic_reset(vcpu); + kvm_lapic_reset(vcpu, false); kvm_iodevice_init(apic-dev, apic_mmio_ops); return 0; @@ -2047,8 +2048,8 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu) pe = xchg(apic-pending_events, 0); if (test_bit(KVM_APIC_INIT, pe)) { - kvm_lapic_reset(vcpu); - kvm_vcpu_reset(vcpu); + kvm_lapic_reset(vcpu, true); + kvm_vcpu_reset(vcpu, true); if (kvm_vcpu_is_bsp(apic-vcpu)) vcpu-arch.mp_state = KVM_MP_STATE_RUNNABLE; else diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 9d28383..793f761 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -48,7 +48,7 @@ int
Re: [PATCH] KVM: x86: Support for disabling quirks
Thanks. If you want a test-case you can apply/try the following on top of the previous kvm-unit-tests patch-set I sent ( http://www.spinics.net/lists/kvm/msg115525.html ). Regards, Nadav -- 8 -- From: Nadav Amit na...@cs.technion.ac.il Subject: [PATCH kvm-unit-tests] x86: Test LINT0 is disabled after reset Requires x2APIC in order to easily save LINT0 during 16-bit code. For the test to pass a fix for qemu is required. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- x86/init.c| 13 - x86/unittests.cfg | 1 + 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/x86/init.c b/x86/init.c index 2d7ea99..23079ad 100644 --- a/x86/init.c +++ b/x86/init.c @@ -70,6 +70,7 @@ static struct expected_state expected[] = { #define cr2 (*(volatile int*)0x2010) #define sysenter_eip (*(volatile int*)0x2014) #define boot_apic_id (*(volatile int *)0x2018) +#define lvt0 (*(volatile int *)0x201c) static void set_test_regs(void) { @@ -94,7 +95,10 @@ static bool check_test_regs(bool init) printf(wrong sysenter_eip msr: %x\n, sysenter_eip); error = true; } - + if (lvt0 != 0x1) { + printf(wrong lvt0 value: %x\n, lvt0); + error = true; + } return error; } @@ -240,6 +244,13 @@ asm ( cpuid\n shrl $24, %ebx\n mov %ebx, %cs:0x2018\n// apic_id + mov $0x1b, %ecx\n // IA32_APIC_BASE + rdmsr\n + or $0x400, %eax\n + wrmsr\n // Enabling x2apic + mov $0x835, %ecx\n + rdmsr\n +mov %eax, %cs:0x201c\n// lvt0 mov $0x0f, %al\n // rtc_out(0x0f, 0x00); out %al, $0x70\n mov $0x00, %al\n diff --git a/x86/unittests.cfg b/x86/unittests.cfg index 2d25801..0d1a42b 100644 --- a/x86/unittests.cfg +++ b/x86/unittests.cfg @@ -86,6 +86,7 @@ arch = x86_64 [init] file = init.flat smp = 2 +extra_params = -cpu qemu64,+x2apic [msr] file = msr.flat -- 1.9.1 Paolo Bonzini pbonz...@redhat.com wrote: On 13/04/2015 00:53, Nadav Amit wrote: Introducing KVM_CAP_DISABLE_QUIRKS for disabling x86 quirks that were previous created in order to overcome QEMU issues. Those issue were mostly result of invalid VM BIOS. Currently there are two quirks that can be disabled: 1. KVM_QUIRK_LINT0_REENABLED - LINT0 was enabled after boot 2. KVM_QUIRK_CD_NW_CLEARED - CD and NW are cleared after boot These two issues are already resolved in recent releases of QEMU, and would therefore be disabled by QEMU. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- Documentation/virtual/kvm/api.txt | 3 ++- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/include/uapi/asm/kvm.h | 3 +++ arch/x86/kvm/lapic.c | 5 +++-- arch/x86/kvm/svm.c| 3 ++- arch/x86/kvm/x86.c| 29 + include/uapi/linux/kvm.h | 1 + 7 files changed, 42 insertions(+), 4 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index bc9f6fe..3931221 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -959,7 +959,8 @@ documentation when it pops into existence). 4.37 KVM_ENABLE_CAP Capability: KVM_CAP_ENABLE_CAP, KVM_CAP_ENABLE_CAP_VM -Architectures: ppc, s390 +Architectures: x86 (only KVM_CAP_ENABLE_CAP_VM), + mips (only KVM_CAP_ENABLE_CAP), ppc, s390 Type: vcpu ioctl, vm ioctl (with KVM_CAP_ENABLE_CAP_VM) Parameters: struct kvm_enable_cap (in) Returns: 0 on success; -1 on error diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index dea2e7e..f80ad59 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -635,6 +635,8 @@ struct kvm_arch { #endif bool boot_vcpu_runs_old_kvmclock; + +u64 disabled_quirks; }; struct kvm_vm_stat { diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index d7dcef5..2fec75e 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -345,4 +345,7 @@ struct kvm_xcrs { struct kvm_sync_regs { }; +#define KVM_QUIRK_LINT0_REENABLED (1 0) +#define KVM_QUIRK_CD_NW_CLEARED (1 1) + #endif /* _ASM_X86_KVM_H */ diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 4a6e58a..fe2d89e 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1577,8 +1577,9 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu) for (i = 0; i APIC_LVT_NUM; i++) apic_set_reg(apic, APIC_LVTT + 0x10 * i, APIC_LVT_MASKED); apic-lapic_timer.timer_mode = 0; -apic_set_reg(apic, APIC_LVT0, - SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT)); +if (!(vcpu-kvm-arch.disabled_quirks KVM_QUIRK_LINT0_REENABLED)) +apic_set_reg(apic, APIC_LVT0
Re: [Qemu-devel] [PATCH 2/2] target-i386: kvm: Disable KVM quirks
Paolo Bonzini pbonz...@redhat.com wrote: On 13/04/2015 16:17, Nadav Amit wrote: Paolo Bonzini pbonz...@redhat.com wrote: On 13/04/2015 01:32, Nadav Amit wrote: +if (kvm_check_extension(s, KVM_CAP_ENABLE_CAP_VM)) { The right capability to check here is KVM_CAP_DISABLE_QUIRKS, not KVM_CAP_ENABLE_CAP_VM. Paolo +ret = kvm_vm_enable_cap(s, KVM_CAP_DISABLE_QUIRKS, 0, +KVM_QUIRK_LINT0_DISABLED | +KVM_QUIRK_CD_NW_CLEARED); +if (ret 0) { +return ret; +} +} Of course… Do you want a v2 now, later (after 4.2), or would you change it yourself? Later, close to 4.2 but not necessarily after it. Anyhow, in that case the KVM patch is also wrong (not reporting KVM_CAP_DISABLE_QUIRKS is supported). I don’t want to spam, so I’ll run some tests and resubmit. Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 2/2] target-i386: kvm: Disable KVM quirks
Paolo Bonzini pbonz...@redhat.com wrote: On 13/04/2015 01:32, Nadav Amit wrote: +if (kvm_check_extension(s, KVM_CAP_ENABLE_CAP_VM)) { The right capability to check here is KVM_CAP_DISABLE_QUIRKS, not KVM_CAP_ENABLE_CAP_VM. Paolo +ret = kvm_vm_enable_cap(s, KVM_CAP_DISABLE_QUIRKS, 0, +KVM_QUIRK_LINT0_DISABLED | +KVM_QUIRK_CD_NW_CLEARED); +if (ret 0) { +return ret; +} +} Of course… Do you want a v2 now, later (after 4.2), or would you change it yourself? Thanks, Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XP machine freeze
Paolo, I hope I am not misleading or interrupting, and I am obviously very biased — but couldn’t it be related to the issue that patch f210f7572bed (KVM: x86: Fix lost interrupt on irr_pending race”) deals with? I got this issue first when I upgraded to 3.17 in my testing environment, since apparently a race got worse due to patch 56cc2406d68c. Did anyone try 3.19 that has this fix? Regards, Nadav Paolo Bonzini pbonz...@redhat.com wrote: On 13/04/2015 06:07, Brad Campbell wrote: On 31/03/15 05:11, Paolo Bonzini wrote: On 22/03/2015 16:31, Brad Campbell wrote: No help I'm afraid, but at least I can conclusively say that 3.16 is good, and 3.17 is bad. Can you try more specifically around the first KVM pull request? That would be between c9b88e958182 (presumed good) and 8533ce727188 (presumed bad)? G'day Paolo. I can confirm that the fault appears to lie between good and bad as specified above. Bad failed before 48 hours, good ran for 143 hours. I'm bisecting now. Thanks! Remember to bisect only with arch/x86/kvm. Also: 1) Brad, I see you are on AMD. Have you ever reproduced it on Intel? Saso, are you on AMD as well? If so, the most likely culprit is this: commit 6addfc42992be4b073c39137ecfdf4b2aa2d487f Author: Paolo Bonzini pbonz...@redhat.com Date: Thu Mar 27 11:29:28 2014 +0100 KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation Despite the provisions to emulate up to 130 consecutive instructions, in practice KVM will emulate just one before exiting handle_invalid_guest_state, because x86_emulate_instruction always sets KVM_REQ_EVENT. However, we only need to do this if an interrupt could be injected, which happens a) if an interrupt shadow bit (STI or MOV SS) has gone away; b) if the interrupt flag has just been set (other instructions than STI can set it without enabling an interrupt shadow). This cuts another 700-900 cycles from the cost of emulating an instruction (measured on a Sandy Bridge Xeon: 1650-2600 cycles before the patch on kvm-unit-tests, 925-1700 afterwards). Signed-off-by: Paolo Bonzini pbonz...@redhat.com I would first try this one, and see if it is bad. Radim, do you think this could cause a missed interrupt injection after Windows does a TPR write? 2) For bisection feel free to git bisect skip the following: 03916db9348c079d8d214f971cc114bb51c6b869 Replace NR_VMX_MSR with its definition 9a2a05b9ed618b1bb6d4cbec0c2e1f80d6636609 KVM: nVMX: clean up nested_release_vmcs12 and code around it 4fa7734c62cdd8c07edd54fa5a5e91482273071a KVM: nVMX: fix lifetime issues for vmcs02 c9cdd085bb75226879fd468b88e2e7eb467325b7 KVM: x86: Defining missing x86 vectors 0123be429fef40f067e5b1811576c3994229f59e KVM: x86: Assertions to check no overrun in MSR lists 296f047502f1b3ddfd63adbc192624ce80740081 KVM: vmx: remove duplicate vmx_mpx_supported() prototype 963fee1656603ce2e91ebb988cd5a92f2af41369 KVM: nVMX: Fix virtual interrupt delivery injection 6cbc5f5a80a9ae5a80bc81efc574b5a85bfd4a84 KVM: nSVM: Set correct port for IOIO interception evaluation 6493f1574e898b46370e2b2315836d76a1980f2c KVM: nSVM: Fix IOIO size reported on emulation 9bf418335e24da995ea682a028926d7e1036be6f KVM: nSVM: Fix IOIO bitmap evaluation 62baf44cad3bc6b37115cc21e4228fe53d4f3474 KVM: nSVM: Do not report CLTS via SVM_EXIT_WRITE_CR0 to L1 5381417f6a51293e7b8af1eb18aefa5d47976a71 KVM: nVMX: Fix returned value of MSR_IA32_VMX_VMCS_ENUM 2996fca0690f03a5220203588f4a0d8c5acba2b0 KVM: nVMX: Allow to disable VM_{ENTRY_LOAD,EXIT_SAVE}_DEBUG_CONTROLS 560b7ee12ca5e1ebc1675d7eb4008bb22708277a KVM: nVMX: Fix returned value of MSR_IA32_VMX_PROCBASED_CTLS 3dcdf3ec6e48d918741ea11349d4436d0c5aac93 KVM: nVMX: Allow to disable CR3 access interception 3dbcd8da7b564194f93271b003a1c46ef404cbdb KVM: nVMX: Advertise support for MSR_IA32_VMX_TRUE_*_CTLS bc39c4db7110f88f338cbbabe53d3e43c7400a59 arch/x86/kvm/vmx.c: use PAGE_ALIGNED instead of IS_ALIGNED(PAGE_SIZE e4aa5288ff07766d101751de9a8420d666c61735 KVM: x86: Fix constant value of VM_{EXIT_SAVE,ENTRY_LOAD}_DEBUG_CONTROLS 42cbc04fd3b5e3f9b011bf9fa3ce0b3d1e10b58b x86/kvm: Resolve shadow warnings in macro expansion b55a8144d1807f9e74c51cb584f0dd198483d86c x86/kvm: Resolve shadow warning from min macro 98eff52ab5c0ff5cb96940a93e99a1aeb2f11c89 KVM: x86: Fix lapic.c debug prints 9f6226a762c7ae02f6a23a3d4fc552dafa57ea23 arch: x86: kvm: x86.c: Cleaning up variable is set more than once 80112c89ed872c725e7dc39ccf6c37d1a585e161 KVM: Synthesize G bit for all segments. 27e6fb5dae2819d17f38dc9224692b771e989981 KVM: vmx: vmx instructions handling does not consider cs.l bdc907222c5e4edd848da0c031deb55b59f1cf9a KVM: emulate: fix harmless typo in MMX decoding 10e38fc7cabc668738e6a7b7b57cbcddb2234440 KVM: x86: Emulator flag for instruction that only support 16-bit addresses in real mode
[Qemu-devel][PATCH 1/2] target-i386: disable LINT0 after reset
Due to old Seabios bug, QEMU reenable LINT0 after reset. This bug is long gone and therefore this hack is no longer needed. Since it violates the specifications, it is removed. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- hw/intc/apic_common.c | 9 - 1 file changed, 9 deletions(-) diff --git a/hw/intc/apic_common.c b/hw/intc/apic_common.c index 042e960..d38d24b 100644 --- a/hw/intc/apic_common.c +++ b/hw/intc/apic_common.c @@ -243,15 +243,6 @@ static void apic_reset_common(DeviceState *dev) info-vapic_base_update(s); apic_init_reset(dev); - -if (bsp) { -/* - * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization - * time typically by BIOS, so PIC interrupt can be delivered to the - * processor when local APIC is enabled. - */ -s-lvt[APIC_LVT_LINT0] = 0x700; -} } /* This function is only used for old state version 1 and 2 */ -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Qemu-devel][PATCH 0/2] target-i386: disable LINT0 after reset and init
LINT0 is currently reenabled after reset to circumvent old seabios bug, which violates x86 specifications. This patch-set handles this issue, by removing the old hack from qemu and reporting to kvm that this quirk is no longer needed. In addition, we disable another kvm quirk that clears CD and NW from CR0 after reset. Thanks for reviewing these patches. Nadav Amit (2): target-i386: disable LINT0 after reset target-i386: kvm: Disable KVM quirks hw/intc/apic_common.c | 9 - linux-headers/asm-x86/kvm.h | 4 linux-headers/linux/kvm.h | 1 + target-i386/kvm.c | 8 4 files changed, 13 insertions(+), 9 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Qemu-devel][PATCH 2/2] target-i386: kvm: Disable KVM quirks
KVM has quirks to overcome legacy QEMU bugs that are already resolved. Using a new KVM feature for disabling these quirks. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- linux-headers/asm-x86/kvm.h | 4 linux-headers/linux/kvm.h | 1 + target-i386/kvm.c | 8 3 files changed, 13 insertions(+) diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h index d7dcef5..f8fbb4a 100644 --- a/linux-headers/asm-x86/kvm.h +++ b/linux-headers/asm-x86/kvm.h @@ -345,4 +345,8 @@ struct kvm_xcrs { struct kvm_sync_regs { }; +/* KVM legacy quirks for use with KVM_CAP_DISABLE_QUIRKS */ +#define KVM_QUIRK_LINT0_DISABLED (1 0) +#define KVM_QUIRK_CD_NW_CLEARED(1 1) + #endif /* _ASM_X86_KVM_H */ diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h index 60a54c8..757e869 100644 --- a/linux-headers/linux/kvm.h +++ b/linux-headers/linux/kvm.h @@ -760,6 +760,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_ENABLE_HCALL 104 #define KVM_CAP_CHECK_EXTENSION_VM 105 #define KVM_CAP_S390_USER_SIGP 106 +#define KVM_CAP_DISABLE_QUIRKS 115 #ifdef KVM_CAP_IRQ_ROUTING diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 41d09e5..3b28122 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -898,6 +898,14 @@ int kvm_arch_init(MachineState *ms, KVMState *s) return ret; } } +if (kvm_check_extension(s, KVM_CAP_ENABLE_CAP_VM)) { +ret = kvm_vm_enable_cap(s, KVM_CAP_DISABLE_QUIRKS, 0, +KVM_QUIRK_LINT0_DISABLED | +KVM_QUIRK_CD_NW_CLEARED); +if (ret 0) { +return ret; +} +} return 0; } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: Fix MSR_IA32_BNDCFGS in msrs_to_save
kvm_init_msr_list is currently called before hardware_setup. As a result, vmx_mpx_supported always returns false when kvm_init_msr_list checks whether to save MSR_IA32_BNDCFGS. Move kvm_init_msr_list after vmx_hardware_setup is called to fix this issue. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- Note that this patch was not tested on a system that supports MPX. --- arch/x86/kvm/x86.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b8cb1d0..b7e3b05 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5800,7 +5800,6 @@ int kvm_arch_init(void *opaque) kvm_set_mmio_spte_mask(); kvm_x86_ops = ops; - kvm_init_msr_list(); kvm_mmu_set_mask_ptes(PT_USER_MASK, PT_ACCESSED_MASK, PT_DIRTY_MASK, PT64_NX_MASK, 0); @@ -7254,7 +7253,14 @@ void kvm_arch_hardware_disable(void) int kvm_arch_hardware_setup(void) { - return kvm_x86_ops-hardware_setup(); + int r; + + r = kvm_x86_ops-hardware_setup(); + if (r != 0) + return r; + + kvm_init_msr_list(); + return 0; } void kvm_arch_hardware_unsetup(void) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: Question regarding the reset value of LINT0
Paolo Bonzini pbonz...@redhat.com wrote: On 09/04/2015 21:17, Bandan Das wrote: Excluding (1) all of the other issues are related to the VM BIOS. Perhaps KVM should somehow realize which VM BIOS runs? (yes, it sounds just as bad.) How about renaming the toggle Avi mentioned above to something more generic (KVM_DISABLE_LEGACY_QUIRKS ?) and grouping all the issues together ? Modern userspace will always enable it and get the new correct behavior. When more cases are discovered, KVM can just add them to the list. It can be a VM capability (KVM_FIRMWARE_QUIRKS?) that is enabled via KVM_ENABLE_CAP. The first argument in struct kvm_enable_cap can be used to add more quirks in the future. For now, an argument of zero could be used to: 1) set up LINT0 correctly 2) set up CD and NW correctly in svm_set_cr0 AFAIK the MTRR issue in SeaBIOS was not fixed. For that, QEMU could write to MSR_MTRRcap with KVM_SET_MSR. Setting the accessed bit in vmx_set_segment is IMHO harmless, and we might as well do it even if !enable_unrestricted_guest. Ok. I’ll send an updated version (as well as updated version of the init/reset issues) in the next few days. Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: Support for disabling quirks
Introducing KVM_CAP_DISABLE_QUIRKS for disabling x86 quirks that were previous created in order to overcome QEMU issues. Those issue were mostly result of invalid VM BIOS. Currently there are two quirks that can be disabled: 1. KVM_QUIRK_LINT0_REENABLED - LINT0 was enabled after boot 2. KVM_QUIRK_CD_NW_CLEARED - CD and NW are cleared after boot These two issues are already resolved in recent releases of QEMU, and would therefore be disabled by QEMU. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- Documentation/virtual/kvm/api.txt | 3 ++- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/include/uapi/asm/kvm.h | 3 +++ arch/x86/kvm/lapic.c | 5 +++-- arch/x86/kvm/svm.c| 3 ++- arch/x86/kvm/x86.c| 29 + include/uapi/linux/kvm.h | 1 + 7 files changed, 42 insertions(+), 4 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index bc9f6fe..3931221 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -959,7 +959,8 @@ documentation when it pops into existence). 4.37 KVM_ENABLE_CAP Capability: KVM_CAP_ENABLE_CAP, KVM_CAP_ENABLE_CAP_VM -Architectures: ppc, s390 +Architectures: x86 (only KVM_CAP_ENABLE_CAP_VM), + mips (only KVM_CAP_ENABLE_CAP), ppc, s390 Type: vcpu ioctl, vm ioctl (with KVM_CAP_ENABLE_CAP_VM) Parameters: struct kvm_enable_cap (in) Returns: 0 on success; -1 on error diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index dea2e7e..f80ad59 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -635,6 +635,8 @@ struct kvm_arch { #endif bool boot_vcpu_runs_old_kvmclock; + + u64 disabled_quirks; }; struct kvm_vm_stat { diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index d7dcef5..2fec75e 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -345,4 +345,7 @@ struct kvm_xcrs { struct kvm_sync_regs { }; +#define KVM_QUIRK_LINT0_REENABLED (1 0) +#define KVM_QUIRK_CD_NW_CLEARED(1 1) + #endif /* _ASM_X86_KVM_H */ diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 4a6e58a..fe2d89e 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1577,8 +1577,9 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu) for (i = 0; i APIC_LVT_NUM; i++) apic_set_reg(apic, APIC_LVTT + 0x10 * i, APIC_LVT_MASKED); apic-lapic_timer.timer_mode = 0; - apic_set_reg(apic, APIC_LVT0, -SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT)); + if (!(vcpu-kvm-arch.disabled_quirks KVM_QUIRK_LINT0_REENABLED)) + apic_set_reg(apic, APIC_LVT0, +SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT)); apic_set_reg(apic, APIC_DFR, 0xU); apic_set_spiv(apic, 0xff); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ce741b8..46299da 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1575,7 +1575,8 @@ static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) * does not do it - this results in some delay at * reboot */ - cr0 = ~(X86_CR0_CD | X86_CR0_NW); + if (!(vcpu-kvm-arch.disabled_quirks KVM_QUIRK_CD_NW_CLEARED)) + cr0 = ~(X86_CR0_CD | X86_CR0_NW); svm-vmcb-save.cr0 = cr0; mark_dirty(svm-vmcb, VMCB_CR); update_cr0_intercept(svm); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b8cb1d0..c3859a6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2778,6 +2778,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_HYPERV_TIME: case KVM_CAP_IOAPIC_POLARITY_IGNORED: case KVM_CAP_TSC_DEADLINE_TIMER: + case KVM_CAP_ENABLE_CAP_VM: #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT case KVM_CAP_ASSIGN_DEV_IRQ: case KVM_CAP_PCI_2_3: @@ -3825,6 +3826,26 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event, return 0; } +static int kvm_vm_ioctl_enable_cap(struct kvm *kvm, + struct kvm_enable_cap *cap) +{ + int r; + + if (cap-flags) + return -EINVAL; + + switch (cap-cap) { + case KVM_CAP_DISABLE_QUIRKS: + kvm-arch.disabled_quirks = cap-args[0]; + r = 0; + break; + default: + r = -EINVAL; + break; + } + return r; +} + long kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -4077,7 +4098,15 @@ long kvm_arch_vm_ioctl(struct file *filp, r = 0; break; } + case KVM_ENABLE_CAP: { + struct kvm_enable_cap cap; + r = -EFAULT; + if (copy_from_user(cap, argp
Re: x86: Question regarding the reset value of LINT0
Avi Kivity avi.kiv...@gmail.com wrote: On 04/09/2015 09:21 PM, Nadav Amit wrote: Bandan Das b...@redhat.com wrote: Nadav Amit nadav.a...@gmail.com writes: Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 19:40, Nadav Amit wrote: Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 18:59, Nadav Amit wrote: Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 18:40, Nadav Amit wrote: Hi, I would appreciate if someone explains the reason for enabling LINT0 during APIC reset. This does not correspond with Intel SDM Figure 10-8: “Local Vector Table” that says all LVT registers are reset to 0x1. In kvm_lapic_reset, I see: apic_set_reg(apic, APIC_LVT0, SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT)); Which is actually pretty similar to QEMU’s apic_reset_common: if (bsp) { /* * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization * time typically by BIOS, so PIC interrupt can be delivered to the * processor when local APIC is enabled. */ s-lvt[APIC_LVT_LINT0] = 0x700; } Yet, in both cases, I miss the point - if it is typically done by the BIOS, why does QEMU or KVM enable it? BTW: KVM seems to run fine without it, and I think setting it causes me problems in certain cases. I suspect it has some historic BIOS backgrounds. Already tried to find more information in the git logs of both code bases? Or something that indicates of SeaBIOS or BochsBIOS once didn't do this initialization? Thanks. I found no indication of such thing. QEMU’s commit message (0e21e12bb311c4c1095d0269dc2ef81196ccb60a) says: Don't route PIC interrupts through the local APIC if the local APIC config says so. By Ari Kivity. Maybe Avi Kivity knows this guy. ths? That should have been Thiemo Seufer (IIRC), but he just committed the code back then (and is no longer with us, sadly). Oh… I am sorry - I didn’t know about that.. (I tried to make an unfunny joke about Avi knowing “Ari”). Ah. No problem. My brain apparently fixed that typo up unnoticed. But if that commit went in without any BIOS changes around it, QEMU simply had to do the job of the latter to keep things working. So should I leave it as is? Can I at least disable in KVM during INIT (and leave it as is for RESET)? No, I don't think there is a need to leave this inaccurate for QEMU if our included BIOS gets it right. I don't know what the backward bug-compatibility of KVM is, though. Maybe you can identify since when our BIOS is fine so that we can discuss time frames. I think that it was addressed in commit 19c1a7692bf65fc40e56f93ad00cc3eefaad22a4 (Initialize the LINT LVTs on the local APIC of the BSP.”) So it should be included in seabios 0.5.0, which means qemu 0.12 - so we are talking about the end of 2009 or start of 2010. The probability that someone will use a newer version of kernel with something as old as 0.12 is probably minimal. I think it's ok to change it with a comment indicating the reason. To be on the safe side, however, a user changeable switch is something worth considering. I don’t see any existing mechanism for KVM to be aware of its user type and version. I do see another case of KVM hacks that are intended for fixing very old QEMU bugs (see 3a624e29c75 changes in vmx_set_segment, which are from pretty much the same time-frame of the issue I try to fix). Since this is something which would follow around, please advise what would be the format. A new ioctl that would supply the userspace “type” (according to predefined constants) and version? That would be madness. KVM shouldn't even know that qemu exists, let alone track its versions. Simply add a new toggle KVM_USE_STANDARD_LAPIC_LVT_INIT and document that userspace MUST use it. Old userspace won't, and will get the old buggy behavior. I fully agree it would be madness. Yet it appears to be a recurring problem. Here are similar problems found from a short search: 1. vmx_set_segment setting segment accessed (3a624e29c75) 2. svm_set_cr0 clearing CD and NW (709ddebf81c) 3. Limited number of MTRRs due to Seabios bus (0d234daf7e0a) Excluding (1) all of the other issues are related to the VM BIOS. Perhaps KVM should somehow realize which VM BIOS runs? (yes, it sounds just as bad.) Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: Question regarding the reset value of LINT0
Bandan Das b...@redhat.com wrote: Nadav Amit nadav.a...@gmail.com writes: Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 19:40, Nadav Amit wrote: Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 18:59, Nadav Amit wrote: Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 18:40, Nadav Amit wrote: Hi, I would appreciate if someone explains the reason for enabling LINT0 during APIC reset. This does not correspond with Intel SDM Figure 10-8: “Local Vector Table” that says all LVT registers are reset to 0x1. In kvm_lapic_reset, I see: apic_set_reg(apic, APIC_LVT0, SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT)); Which is actually pretty similar to QEMU’s apic_reset_common: if (bsp) { /* * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization * time typically by BIOS, so PIC interrupt can be delivered to the * processor when local APIC is enabled. */ s-lvt[APIC_LVT_LINT0] = 0x700; } Yet, in both cases, I miss the point - if it is typically done by the BIOS, why does QEMU or KVM enable it? BTW: KVM seems to run fine without it, and I think setting it causes me problems in certain cases. I suspect it has some historic BIOS backgrounds. Already tried to find more information in the git logs of both code bases? Or something that indicates of SeaBIOS or BochsBIOS once didn't do this initialization? Thanks. I found no indication of such thing. QEMU’s commit message (0e21e12bb311c4c1095d0269dc2ef81196ccb60a) says: Don't route PIC interrupts through the local APIC if the local APIC config says so. By Ari Kivity. Maybe Avi Kivity knows this guy. ths? That should have been Thiemo Seufer (IIRC), but he just committed the code back then (and is no longer with us, sadly). Oh… I am sorry - I didn’t know about that.. (I tried to make an unfunny joke about Avi knowing “Ari”). Ah. No problem. My brain apparently fixed that typo up unnoticed. But if that commit went in without any BIOS changes around it, QEMU simply had to do the job of the latter to keep things working. So should I leave it as is? Can I at least disable in KVM during INIT (and leave it as is for RESET)? No, I don't think there is a need to leave this inaccurate for QEMU if our included BIOS gets it right. I don't know what the backward bug-compatibility of KVM is, though. Maybe you can identify since when our BIOS is fine so that we can discuss time frames. I think that it was addressed in commit 19c1a7692bf65fc40e56f93ad00cc3eefaad22a4 (Initialize the LINT LVTs on the local APIC of the BSP.”) So it should be included in seabios 0.5.0, which means qemu 0.12 - so we are talking about the end of 2009 or start of 2010. The probability that someone will use a newer version of kernel with something as old as 0.12 is probably minimal. I think it's ok to change it with a comment indicating the reason. To be on the safe side, however, a user changeable switch is something worth considering. I don’t see any existing mechanism for KVM to be aware of its user type and version. I do see another case of KVM hacks that are intended for fixing very old QEMU bugs (see 3a624e29c75 changes in vmx_set_segment, which are from pretty much the same time-frame of the issue I try to fix). Since this is something which would follow around, please advise what would be the format. A new ioctl that would supply the userspace “type” (according to predefined constants) and version? Thanks, Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: x86: fix x86 eflags fixed bit
Sorry for that - fixes 0efb04406de834d820f7ba150a00d1d3194aa8a6 (KVM: x86: removing redundant eflags bits definitions”). Nadav Wanpeng Li wanpeng...@linux.intel.com wrote: Guest can't be booted w/ ept=0, there is a message dumped as below: If you're running a guest on an Intel machine without unrestricted mode support, the failure can be most likely due to the guest entering an invalid state for Intel VT. For example, the guest maybe running in big real mode which is not supported on less recent Intel processors. EAX=0011 EBX=f000d2f6 ECX=6cac EDX=000f8956 ESI=bffbdf62 EDI= EBP=6c68 ESP=6c68 EIP=d187 EFL=0004 [-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =e000 000e 00809300 DPL=0 DS16 [-WA] CS =f000 000f 00809b00 DPL=0 CS16 [-RA] SS = 00809300 DPL=0 DS16 [-WA] DS = 00809300 DPL=0 DS16 [-WA] FS = 00809300 DPL=0 DS16 [-WA] GS = 00809300 DPL=0 DS16 [-WA] LDT= 8200 DPL=0 LDT TR = 8b00 DPL=0 TSS32-busy GDT= 000f6a80 0037 IDT= 000f6abe CR0=0011 CR2= CR3= CR4= DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER= Code=01 1e b8 6a 2e 0f 01 16 74 6a 0f 20 c0 66 83 c8 01 0f 22 c0 66 ea 8f d1 0f 00 08 00 b8 10 00 00 00 8e d8 8e c0 8e d0 8e e0 8e e8 89 c8 ff e2 89 c1 b8X X86 eflags bit 1 is fixed set, which means that 1 1 is set instead of 1, this patch fix it. Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com --- arch/x86/kvm/emulate.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index b304728..630bcb0 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2033,7 +2033,7 @@ static int emulate_iret_real(struct x86_emulate_ctxt *ctxt) X86_EFLAGS_IF | X86_EFLAGS_DF | X86_EFLAGS_OF | X86_EFLAGS_IOPL | X86_EFLAGS_NT | X86_EFLAGS_RF | X86_EFLAGS_AC | X86_EFLAGS_ID | - X86_EFLAGS_FIXED_BIT; + X86_EFLAGS_FIXED; unsigned long vm86_mask = X86_EFLAGS_VM | X86_EFLAGS_VIF | X86_EFLAGS_VIP; @@ -2072,7 +2072,7 @@ static int emulate_iret_real(struct x86_emulate_ctxt *ctxt) } ctxt-eflags = ~EFLG_RESERVED_ZEROS_MASK; /* Clear reserved zeros */ - ctxt-eflags |= X86_EFLAGS_FIXED_BIT; + ctxt-eflags |= X86_EFLAGS_FIXED; ctxt-ops-set_nmi_mask(ctxt, false); return rc; @@ -2390,7 +2390,7 @@ static int em_syscall(struct x86_emulate_ctxt *ctxt) ops-get_msr(ctxt, MSR_SYSCALL_MASK, msr_data); ctxt-eflags = ~msr_data; - ctxt-eflags |= X86_EFLAGS_FIXED_BIT; + ctxt-eflags |= X86_EFLAGS_FIXED; #endif } else { /* legacy mode */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
x86: Question regarding the reset value of LINT0
Hi, I would appreciate if someone explains the reason for enabling LINT0 during APIC reset. This does not correspond with Intel SDM Figure 10-8: “Local Vector Table” that says all LVT registers are reset to 0x1. In kvm_lapic_reset, I see: apic_set_reg(apic, APIC_LVT0, SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT)); Which is actually pretty similar to QEMU’s apic_reset_common: if (bsp) { /* * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization * time typically by BIOS, so PIC interrupt can be delivered to the * processor when local APIC is enabled. */ s-lvt[APIC_LVT_LINT0] = 0x700; } Yet, in both cases, I miss the point - if it is typically done by the BIOS, why does QEMU or KVM enable it? BTW: KVM seems to run fine without it, and I think setting it causes me problems in certain cases. Thanks, Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: Question regarding the reset value of LINT0
Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 18:40, Nadav Amit wrote: Hi, I would appreciate if someone explains the reason for enabling LINT0 during APIC reset. This does not correspond with Intel SDM Figure 10-8: “Local Vector Table” that says all LVT registers are reset to 0x1. In kvm_lapic_reset, I see: apic_set_reg(apic, APIC_LVT0, SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT)); Which is actually pretty similar to QEMU’s apic_reset_common: if (bsp) { /* * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization * time typically by BIOS, so PIC interrupt can be delivered to the * processor when local APIC is enabled. */ s-lvt[APIC_LVT_LINT0] = 0x700; } Yet, in both cases, I miss the point - if it is typically done by the BIOS, why does QEMU or KVM enable it? BTW: KVM seems to run fine without it, and I think setting it causes me problems in certain cases. I suspect it has some historic BIOS backgrounds. Already tried to find more information in the git logs of both code bases? Or something that indicates of SeaBIOS or BochsBIOS once didn't do this initialization? Thanks. I found no indication of such thing. QEMU’s commit message (0e21e12bb311c4c1095d0269dc2ef81196ccb60a) says: Don't route PIC interrupts through the local APIC if the local APIC config says so. By Ari Kivity. Maybe Avi Kivity knows this guy. Regards, Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: Question regarding the reset value of LINT0
Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 18:59, Nadav Amit wrote: Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 18:40, Nadav Amit wrote: Hi, I would appreciate if someone explains the reason for enabling LINT0 during APIC reset. This does not correspond with Intel SDM Figure 10-8: “Local Vector Table” that says all LVT registers are reset to 0x1. In kvm_lapic_reset, I see: apic_set_reg(apic, APIC_LVT0, SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT)); Which is actually pretty similar to QEMU’s apic_reset_common: if (bsp) { /* * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization * time typically by BIOS, so PIC interrupt can be delivered to the * processor when local APIC is enabled. */ s-lvt[APIC_LVT_LINT0] = 0x700; } Yet, in both cases, I miss the point - if it is typically done by the BIOS, why does QEMU or KVM enable it? BTW: KVM seems to run fine without it, and I think setting it causes me problems in certain cases. I suspect it has some historic BIOS backgrounds. Already tried to find more information in the git logs of both code bases? Or something that indicates of SeaBIOS or BochsBIOS once didn't do this initialization? Thanks. I found no indication of such thing. QEMU’s commit message (0e21e12bb311c4c1095d0269dc2ef81196ccb60a) says: Don't route PIC interrupts through the local APIC if the local APIC config says so. By Ari Kivity. Maybe Avi Kivity knows this guy. ths? That should have been Thiemo Seufer (IIRC), but he just committed the code back then (and is no longer with us, sadly). Oh… I am sorry - I didn’t know about that.. (I tried to make an unfunny joke about Avi knowing “Ari”). But if that commit went in without any BIOS changes around it, QEMU simply had to do the job of the latter to keep things working. So should I leave it as is? Can I at least disable in KVM during INIT (and leave it as is for RESET)? Thanks, Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: Question regarding the reset value of LINT0
Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 19:40, Nadav Amit wrote: Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 18:59, Nadav Amit wrote: Jan Kiszka jan.kis...@siemens.com wrote: On 2015-04-08 18:40, Nadav Amit wrote: Hi, I would appreciate if someone explains the reason for enabling LINT0 during APIC reset. This does not correspond with Intel SDM Figure 10-8: “Local Vector Table” that says all LVT registers are reset to 0x1. In kvm_lapic_reset, I see: apic_set_reg(apic, APIC_LVT0, SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT)); Which is actually pretty similar to QEMU’s apic_reset_common: if (bsp) { /* * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization * time typically by BIOS, so PIC interrupt can be delivered to the * processor when local APIC is enabled. */ s-lvt[APIC_LVT_LINT0] = 0x700; } Yet, in both cases, I miss the point - if it is typically done by the BIOS, why does QEMU or KVM enable it? BTW: KVM seems to run fine without it, and I think setting it causes me problems in certain cases. I suspect it has some historic BIOS backgrounds. Already tried to find more information in the git logs of both code bases? Or something that indicates of SeaBIOS or BochsBIOS once didn't do this initialization? Thanks. I found no indication of such thing. QEMU’s commit message (0e21e12bb311c4c1095d0269dc2ef81196ccb60a) says: Don't route PIC interrupts through the local APIC if the local APIC config says so. By Ari Kivity. Maybe Avi Kivity knows this guy. ths? That should have been Thiemo Seufer (IIRC), but he just committed the code back then (and is no longer with us, sadly). Oh… I am sorry - I didn’t know about that.. (I tried to make an unfunny joke about Avi knowing “Ari”). Ah. No problem. My brain apparently fixed that typo up unnoticed. But if that commit went in without any BIOS changes around it, QEMU simply had to do the job of the latter to keep things working. So should I leave it as is? Can I at least disable in KVM during INIT (and leave it as is for RESET)? No, I don't think there is a need to leave this inaccurate for QEMU if our included BIOS gets it right. I don't know what the backward bug-compatibility of KVM is, though. Maybe you can identify since when our BIOS is fine so that we can discuss time frames. I think that it was addressed in commit 19c1a7692bf65fc40e56f93ad00cc3eefaad22a4 (Initialize the LINT LVTs on the local APIC of the BSP.”) So it should be included in seabios 0.5.0, which means qemu 0.12 - so we are talking about the end of 2009 or start of 2010. What is the verdict? Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests 3/4] x86: check register values after reset
Checking the reset values of dr0, cr2 and sysenter_eip after reset. Since INIT does not initialize certain registers, we keep for each state whether it is INIT or not. It is also required for the next patch. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- x86/init.c | 55 +++ 1 file changed, 55 insertions(+) diff --git a/x86/init.c b/x86/init.c index 717bd12..3694d67 100644 --- a/x86/init.c +++ b/x86/init.c @@ -1,6 +1,8 @@ #include libcflat.h #include apic.h #include io.h +#include processor.h +#include msr.h #define CHECK_HARD_RESET 0 @@ -36,9 +38,51 @@ static inline void rtc_out(u8 reg, u8 val) extern char resume_start, resume_end; +struct expected_state { + bool init; +}; + +static struct expected_state expected[] = { + {.init = false}, + {.init = false}, + {.init = false}, + {.init = false}, + {.init = true}, +}; + #define state (*(volatile int *)0x2000) #define bad (*(volatile int *)0x2004) #define resumed (*(volatile int *)0x2008) +#define dr0 (*(volatile int*)0x200c) +#define cr2 (*(volatile int*)0x2010) +#define sysenter_eip (*(volatile int*)0x2014) + +static void set_test_regs(void) +{ + write_dr0(1); + write_cr2(1); + wrmsr(MSR_IA32_SYSENTER_EIP, 1); +} + +static bool check_test_regs(bool init) +{ + bool error = false; + + if (dr0 != 0) { + printf(error: dr0 was not initialized: %x\n, dr0); + error = true; + } + if (cr2 != 0) { + printf(error: cr2 was not initialized\n, cr2); + error = true; + } + if (init sysenter_eip != 1) { + printf(wrong sysenter_eip msr: %x\n, sysenter_eip); + error = true; + } + + return error; +} int main(int argc, char **argv) { @@ -63,6 +107,9 @@ int main(int argc, char **argv) bad |= 2; } + if (check_test_regs(expected[state-1].init)) + bad |= 4; + #if CHECK_HARD_RESET /* * Port 92 bit 0 is cleared on system reset. On a soft reset it @@ -76,6 +123,7 @@ int main(int argc, char **argv) #endif } + set_test_regs(); resumed = 0; switch (state++) { @@ -122,6 +170,13 @@ asm ( .code16\n resume_start:\n incb %cs:0x2008\n // resumed++; + mov %dr0, %ecx\n + mov %ecx, %cs:0x200c\n// dr0 + mov %cr2, %ecx\n + mov %ecx, %cs:0x2010\n// cr2 + mov $0x176, %ecx\n + rdmsr\n + mov %eax, %cs:0x2014\n// sysenter_eip mov $0x0f, %al\n // rtc_out(0x0f, 0x00); out %al, $0x70\n mov $0x00, %al\n -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests 4/4] x86: Test BSP change
Testing whether changing the BSP, followed by INIT works correctly. The APIC ID is saved after resume and compared. For this test to pass, a fix of QEMU is needed in addition to a KVM fix. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- x86/init.c| 77 ++- x86/unittests.cfg | 1 + 2 files changed, 71 insertions(+), 7 deletions(-) diff --git a/x86/init.c b/x86/init.c index 3694d67..2d7ea99 100644 --- a/x86/init.c +++ b/x86/init.c @@ -3,6 +3,7 @@ #include io.h #include processor.h #include msr.h +#include smp.h #define CHECK_HARD_RESET 0 @@ -36,18 +37,30 @@ static inline void rtc_out(u8 reg, u8 val) outb(val, 0x71); } +static inline void send_init(void) +{ + char *reset_vec = (void*)0xfff0; + + reset_vec[0] = 0xeb; + reset_vec[1] = 0xfe; + apic_icr_write(APIC_DEST_ALLINC | APIC_DEST_PHYSICAL + | APIC_DM_INIT, 0); +} + extern char resume_start, resume_end; struct expected_state { bool init; + unsigned int bsp; }; static struct expected_state expected[] = { - {.init = false}, - {.init = false}, - {.init = false}, - {.init = false}, - {.init = true}, + {.init = false, .bsp = 0}, + {.init = false, .bsp = 0}, + {.init = false, .bsp = 0}, + {.init = false, .bsp = 0}, + {.init = true, .bsp = 0}, + {.init = true, .bsp = 1}, }; #define state (*(volatile int *)0x2000) @@ -56,6 +69,7 @@ static struct expected_state expected[] = { #define dr0 (*(volatile int*)0x200c) #define cr2 (*(volatile int*)0x2010) #define sysenter_eip (*(volatile int*)0x2014) +#define boot_apic_id (*(volatile int *)0x2018) static void set_test_regs(void) { @@ -84,6 +98,39 @@ static bool check_test_regs(bool init) return error; } +static void change_bsp(bool bsp) +{ + unsigned long apicbase = rdmsr(MSR_IA32_APICBASE); + + if (bsp) + apicbase |= MSR_IA32_APICBASE_BSP; + else + apicbase = ~MSR_IA32_APICBASE_BSP; +wrmsr(MSR_IA32_APICBASE, apicbase); +} + +static void set_bsp(void *data) +{ + set_test_regs(); + change_bsp(true); +} + +static bool change_bsp_test(void) +{ + int ncpus; + + smp_init(); + ncpus = cpu_count(); + if (ncpus != 2) { + printf(Skipping BSP test\n); + return false; + } + on_cpu(1, set_bsp, NULL); + change_bsp(false); + return true; +} + + int main(int argc, char **argv) { volatile u16 *resume_vector_ptr = (u16 *)0x467L; @@ -110,6 +157,12 @@ int main(int argc, char **argv) if (check_test_regs(expected[state-1].init)) bad |= 4; + if (expected[state-1].bsp != boot_apic_id) { + printf(error: wrong BSP: %x expected: %x\n, + boot_apic_id, expected[state-1].bsp); + bad |= 8; + } + #if CHECK_HARD_RESET /* * Port 92 bit 0 is cleared on system reset. On a soft reset it @@ -151,11 +204,17 @@ int main(int argc, char **argv) case 4: printf(testing init to BSP... ); - apic_icr_write(APIC_DEST_SELF | APIC_DEST_PHYSICAL - | APIC_DM_INIT, 0); + send_init(); break; case 5: + printf(testing init to BSP (2nd core)... ); + if (change_bsp_test()) { + send_init(); + break; + } + state++;/* fall through */ + case 6: exit(bad); } @@ -177,6 +236,10 @@ asm ( mov $0x176, %ecx\n rdmsr\n mov %eax, %cs:0x2014\n// sysenter_eip + mov $1, %eax\n + cpuid\n + shrl $24, %ebx\n + mov %ebx, %cs:0x2018\n// apic_id mov $0x0f, %al\n // rtc_out(0x0f, 0x00); out %al, $0x70\n mov $0x00, %al\n diff --git a/x86/unittests.cfg b/x86/unittests.cfg index d7ddd9d..2d25801 100644 --- a/x86/unittests.cfg +++ b/x86/unittests.cfg @@ -85,6 +85,7 @@ arch = x86_64 [init] file = init.flat +smp = 2 [msr] file = msr.flat -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests 1/4] x86: move some common functions to processor.h
Moving some code for reading/writing drs to processor.h. The names of the functions are changed to be consistent with others. While we are at it, fixing lldt so it would work. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- lib/x86/processor.h | 19 ++- x86/debug.c | 36 +++- 2 files changed, 25 insertions(+), 30 deletions(-) diff --git a/lib/x86/processor.h b/lib/x86/processor.h index 7973879..47f8d2c 100644 --- a/lib/x86/processor.h +++ b/lib/x86/processor.h @@ -232,7 +232,7 @@ static inline void sidt(struct descriptor_table_ptr *ptr) static inline void lldt(unsigned val) { -asm volatile (lldt %0 : : rm(val)); +asm volatile (lldt %0 : : m(val)); } static inline u16 sldt(void) @@ -254,6 +254,23 @@ static inline u16 str(void) return val; } +static inline ulong read_dr0(void) +{ +ulong val; +asm volatile (mov %%dr0, %0 : =r(val)); +return val; +} + +static inline void write_dr0(ulong val) +{ + asm volatile(mov %0,%%dr0 : : r (val)); +} + +static inline void write_dr1(ulong val) +{ + asm volatile(mov %0,%%dr1 : : r (val)); +} + static inline void write_dr6(ulong val) { asm volatile (mov %0, %%dr6 : : r(val) : memory); diff --git a/x86/debug.c b/x86/debug.c index 34e56fb..68cac20 100644 --- a/x86/debug.c +++ b/x86/debug.c @@ -11,45 +11,23 @@ #include libcflat.h #include desc.h +#include processor.h static volatile unsigned long bp_addr[10], dr6[10]; static volatile unsigned int n; static volatile unsigned long value; -static unsigned long get_dr6(void) -{ - unsigned long value; - - asm volatile(mov %%dr6,%0 : =r (value)); - return value; -} - -static void set_dr0(void *value) -{ - asm volatile(mov %0,%%dr0 : : r (value)); -} - -static void set_dr1(void *value) -{ - asm volatile(mov %0,%%dr1 : : r (value)); -} - -static void set_dr7(unsigned long value) -{ - asm volatile(mov %0,%%dr7 : : r (value)); -} - static void handle_db(struct ex_regs *regs) { bp_addr[n] = regs-rip; - dr6[n] = get_dr6(); + dr6[n] = read_dr6(); if (dr6[n] 0x1) regs-rflags |= (1 16); if (++n = 10) { regs-rflags = ~(1 8); - set_dr7(0x0400); + write_dr7(0x0400); } } @@ -70,8 +48,8 @@ sw_bp: asm volatile(int3); report(#BP, bp_addr[0] == (unsigned long)sw_bp + 1); - set_dr0(hw_bp); - set_dr7(0x0402); + write_dr0((ulong)hw_bp); + write_dr7(0x0402); hw_bp: asm volatile(nop); report(hw breakpoint, @@ -97,8 +75,8 @@ hw_bp: bp_addr[2] == start+1+6+1+1 dr6[2] == 0x4ff0); n = 0; - set_dr1((void *)value); - set_dr7(0x00d0040a); + write_dr1((ulong)value); + write_dr7(0x00d0040a); asm volatile( mov $42,%%rax\n\t -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests 0/4] Init tests for reset reg checks
Reenabling the init tests in order to check whether the registers are set correctly after reset, and to check whether the BSP can be changed from the VM. Thanks for reviewing the patches. Nadav Amit (4): x86: move some common functions to processor.h x86: Reenable init tests x86: check register values after reset x86: Test BSP change lib/x86/processor.h | 19 +++- x86/debug.c | 36 +++ x86/init.c | 128 +++- x86/unittests.cfg | 5 +- 4 files changed, 154 insertions(+), 34 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests 2/4] x86: Reenable init tests
Reenabling init tests. Disabling the tests of port 0x92 since it currently fails. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- x86/init.c| 6 ++ x86/unittests.cfg | 4 ++-- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/x86/init.c b/x86/init.c index 344dc1c..717bd12 100644 --- a/x86/init.c +++ b/x86/init.c @@ -2,6 +2,8 @@ #include apic.h #include io.h +#define CHECK_HARD_RESET 0 + #define KBD_CCMD_READ_OUTPORT 0xD0/* read output port */ #define KBD_CCMD_WRITE_OUTPORT 0xD1/* write output port */ #define KBD_CCMD_RESET 0xFE/* CPU reset */ @@ -60,14 +62,18 @@ int main(int argc, char **argv) printf(Uh, resume vector visited %d times?\n, resumed); bad |= 2; } + +#if CHECK_HARD_RESET /* * Port 92 bit 0 is cleared on system reset. On a soft reset it * is left to 1. Use this to distinguish INIT from hard reset. +* This test is currently disabled since it currently fails. */ if (resumed != 0 (inb(0x92) 1) == 0) { printf(Uh, hard reset!\n); bad |= 1; } +#endif } resumed = 0; diff --git a/x86/unittests.cfg b/x86/unittests.cfg index badb08a..d7ddd9d 100644 --- a/x86/unittests.cfg +++ b/x86/unittests.cfg @@ -83,8 +83,8 @@ file = hypercall.flat file = idt_test.flat arch = x86_64 -#[init] -#file = init.flat +[init] +file = init.flat [msr] file = msr.flat -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/4] KVM: x86: INIT and reset sequences are different
x86 architecture defines differences between the reset and INIT sequences. INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU, MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP. EFER is supposed NOT to be reset according to the SDM, but leaving the LMA and LME untouched causes failed VM-entry. Therefore we reset EFER (although it is unclear whether the rest of EFER bits should be reset). References (from Intel SDM): If the MP protocol has completed and a BSP is chosen, subsequent INITs (either to a specific processor or system wide) do not cause the MP protocol to be repeated. [8.4.2: MP Initialization Protocol Requirements and Restrictions] [Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT] If the processor is reset by asserting the INIT# pin, the x87 FPU state is not changed. [9.2: X87 FPU INITIALIZATION] The state of the local APIC following an INIT reset is the same as it is after a power-up or hardware reset, except that the APIC ID and arbitration ID registers are not affected. [10.4.7.3: Local APIC State After an INIT Reset (“Wait-for-SIPI” State)] Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/include/asm/kvm_host.h | 6 +++--- arch/x86/kvm/lapic.c| 11 ++- arch/x86/kvm/lapic.h| 2 +- arch/x86/kvm/svm.c | 2 +- arch/x86/kvm/vmx.c | 30 +- arch/x86/kvm/x86.c | 17 ++--- 6 files changed, 38 insertions(+), 30 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index bf5a160..59f4374 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -701,7 +701,7 @@ struct kvm_x86_ops { /* Create, but do not attach this VCPU */ struct kvm_vcpu *(*vcpu_create)(struct kvm *kvm, unsigned id); void (*vcpu_free)(struct kvm_vcpu *vcpu); - void (*vcpu_reset)(struct kvm_vcpu *vcpu); + void (*vcpu_reset)(struct kvm_vcpu *vcpu, bool init_event); void (*prepare_guest_switch)(struct kvm_vcpu *vcpu); void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu); @@ -989,7 +989,7 @@ void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id); void kvm_inject_nmi(struct kvm_vcpu *vcpu); -int fx_init(struct kvm_vcpu *vcpu); +int fx_init(struct kvm_vcpu *vcpu, bool init_event); void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes); @@ -1134,7 +1134,7 @@ int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v); int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu); int kvm_cpu_get_interrupt(struct kvm_vcpu *v); -void kvm_vcpu_reset(struct kvm_vcpu *vcpu); +void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event); void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu); void kvm_arch_mmu_notifier_invalidate_page(struct kvm *kvm, unsigned long address); diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index bd4e34d..17da6fc 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1534,7 +1534,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) } -void kvm_lapic_reset(struct kvm_vcpu *vcpu) +void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) { struct kvm_lapic *apic; int i; @@ -1548,7 +1548,8 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu) /* Stop the timer in case it's a reset to an active apic */ hrtimer_cancel(apic-lapic_timer.timer); - kvm_apic_set_id(apic, vcpu-vcpu_id); + if (!init_event) + kvm_apic_set_id(apic, vcpu-vcpu_id); kvm_apic_set_version(apic-vcpu); for (i = 0; i APIC_LVT_NUM; i++) @@ -1689,7 +1690,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu) APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE); static_key_slow_inc(apic_sw_disabled.key); /* sw disabled at reset */ - kvm_lapic_reset(vcpu); + kvm_lapic_reset(vcpu, false); kvm_iodevice_init(apic-dev, apic_mmio_ops); return 0; @@ -2023,8 +2024,8 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu) pe = xchg(apic-pending_events, 0); if (test_bit(KVM_APIC_INIT, pe)) { - kvm_lapic_reset(vcpu); - kvm_vcpu_reset(vcpu); + kvm_lapic_reset(vcpu, true); + kvm_vcpu_reset(vcpu, true); if (kvm_vcpu_is_bsp(apic-vcpu)) vcpu-arch.mp_state = KVM_MP_STATE_RUNNABLE; else diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 0bc6c65..e4c82dc 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -48,7 +48,7 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu); int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu); int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu); void
[PATCH v2 2/4] KVM: x86: BSP in MSR_IA32_APICBASE is writable
After reset, the CPU can change the BSP, which will be used upon INIT. Reset should return the BSP which QEMU asked for, and therefore handled accordingly. To quote: If the MP protocol has completed and a BSP is chosen, subsequent INITs (either to a specific processor or system wide) do not cause the MP protocol to be repeated. [Intel SDM 8.4.2: MP Initialization Protocol Requirements and Restrictions] Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/lapic.c | 2 -- arch/x86/kvm/svm.c | 2 +- arch/x86/kvm/vmx.c | 2 +- arch/x86/kvm/x86.c | 2 +- include/linux/kvm_host.h | 7 ++- 5 files changed, 9 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 17da6fc..b0dbf68 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1498,8 +1498,6 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) return; } - if (!kvm_vcpu_is_bsp(apic-vcpu)) - value = ~MSR_IA32_APICBASE_BSP; vcpu-arch.apic_base = value; /* update jump label if enable bit changes */ diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 1ef4c0d..ef5bf21 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1261,7 +1261,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) svm-vcpu.arch.apic_base = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE; - if (kvm_vcpu_is_bsp(svm-vcpu)) + if (kvm_vcpu_is_reset_bsp(svm-vcpu)) svm-vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP; svm_init_osvw(svm-vcpu); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e9c94c6..5925c65 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4709,7 +4709,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) if (!init_event) { apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE; - if (kvm_vcpu_is_bsp(vmx-vcpu)) + if (kvm_vcpu_is_reset_bsp(vmx-vcpu)) apic_base_msr.data |= MSR_IA32_APICBASE_BSP; apic_base_msr.host_initiated = true; kvm_set_apic_base(vmx-vcpu, apic_base_msr); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 324e639..c93b1b5 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7264,7 +7264,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) vcpu-arch.pv.pv_unhalted = false; vcpu-arch.emulate_ctxt.ops = emulate_ops; - if (!irqchip_in_kernel(kvm) || kvm_vcpu_is_bsp(vcpu)) + if (!irqchip_in_kernel(kvm) || kvm_vcpu_is_reset_bsp(vcpu)) vcpu-arch.mp_state = KVM_MP_STATE_RUNNABLE; else vcpu-arch.mp_state = KVM_MP_STATE_UNINITIALIZED; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0f574eb..8365cae 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -968,11 +968,16 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args) #endif /* CONFIG_HAVE_KVM_EVENTFD */ #ifdef CONFIG_KVM_APIC_ARCHITECTURE -static inline bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu) +static inline bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu) { return vcpu-kvm-bsp_vcpu_id == vcpu-vcpu_id; } +static inline bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu) +{ + return (vcpu-arch.apic_base MSR_IA32_APICBASE_BSP) != 0; +} + bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu); #else -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/4] KVM: x86: Reset fixes
This set includes 2 previous patches that deal with the INIT flow that is not distinguished from regular boot, and allowing the VM to change BSP (which is used in very certain testing environments). The next 2 patches are new, dealing with regression that cause DR0-DR3 not to be reset (even when QEMU initiates the RESET) and CR2 not cleared after INIT. The second patch regarding BSP requires an additional fix for QEMU, as otherwise reset fails. A separate patch was submitted to QEMU mailing-list. Thanks for reviewing the patches. Nadav Amit (4): KVM: x86: INIT and reset sequences are different KVM: x86: BSP in MSR_IA32_APICBASE is writable KVM: x86: DR0-DR3 are not clear on reset KVM: x86: Clear CR2 on VCPU reset arch/x86/include/asm/kvm_host.h | 7 --- arch/x86/kvm/lapic.c| 13 ++--- arch/x86/kvm/lapic.h| 2 +- arch/x86/kvm/svm.c | 4 ++-- arch/x86/kvm/vmx.c | 30 +- arch/x86/kvm/x86.c | 35 +++ include/linux/kvm_host.h| 7 ++- 7 files changed, 63 insertions(+), 35 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/4] KVM: x86: DR0-DR3 are not clear on reset
DR0-DR3 are not cleared as they should during reset and when they are set from userspace. It appears to be caused by c77fb5fe6f03 (KVM: x86: Allow the guest to run with dirty debug registers). Force their reload on these situations. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/x86.c | 14 ++ 2 files changed, 15 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 59f4374..495fcb6 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -345,6 +345,7 @@ struct kvm_pmu { enum { KVM_DEBUGREG_BP_ENABLED = 1, KVM_DEBUGREG_WONT_EXIT = 2, + KVM_DEBUGREG_RELOAD = 4, }; struct kvm_vcpu_arch { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c93b1b5..e4ac17e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -801,6 +801,17 @@ unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_get_cr8); +static void kvm_update_dr0123(struct kvm_vcpu *vcpu) +{ + int i; + + if (!(vcpu-guest_debug KVM_GUESTDBG_USE_HW_BP)) { + for (i = 0; i KVM_NR_DB_REGS; i++) + vcpu-arch.eff_db[i] = vcpu-arch.db[i]; + vcpu-arch.switch_db_regs |= KVM_DEBUGREG_RELOAD; + } +} + static void kvm_update_dr6(struct kvm_vcpu *vcpu) { if (!(vcpu-guest_debug KVM_GUESTDBG_USE_HW_BP)) @@ -3150,6 +3161,7 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu, return -EINVAL; memcpy(vcpu-arch.db, dbgregs-db, sizeof(vcpu-arch.db)); + kvm_update_dr0123(vcpu); vcpu-arch.dr6 = dbgregs-dr6; kvm_update_dr6(vcpu); vcpu-arch.dr7 = dbgregs-dr7; @@ -6322,6 +6334,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) set_debugreg(vcpu-arch.eff_db[2], 2); set_debugreg(vcpu-arch.eff_db[3], 3); set_debugreg(vcpu-arch.dr6, 6); + vcpu-arch.switch_db_regs = ~KVM_DEBUGREG_RELOAD; } trace_kvm_entry(vcpu-vcpu_id); @@ -7098,6 +7111,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) kvm_clear_exception_queue(vcpu); memset(vcpu-arch.db, 0, sizeof(vcpu-arch.db)); + kvm_update_dr0123(vcpu); vcpu-arch.dr6 = DR6_INIT; kvm_update_dr6(vcpu); vcpu-arch.dr7 = DR7_FIXED_1; -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/4] KVM: x86: Clear CR2 on VCPU reset
CR2 is not cleared as it should after reset. See Intel SDM table named IA-32 Processor States Following Power-up, Reset, or INIT. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/x86.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e4ac17e..8fdad04 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7117,6 +7117,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) vcpu-arch.dr7 = DR7_FIXED_1; kvm_update_dr7(vcpu); + vcpu-arch.cr2 = 0; + kvm_make_request(KVM_REQ_EVENT, vcpu); vcpu-arch.apf.msr_val = 0; vcpu-arch.st.msr_val = 0; -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] KVM: x86: INIT and reset sequences are different
x86 architecture defines differences between the reset and INIT sequences. INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU, MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP. References (from Intel SDM): If the MP protocol has completed and a BSP is chosen, subsequent INITs (either to a specific processor or system wide) do not cause the MP protocol to be repeated. [8.4.2: MP Initialization Protocol Requirements and Restrictions] [Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT] If the processor is reset by asserting the INIT# pin, the x87 FPU state is not changed. [9.2: X87 FPU INITIALIZATION] The state of the local APIC following an INIT reset is the same as it is after a power-up or hardware reset, except that the APIC ID and arbitration ID registers are not affected. [10.4.7.3: Local APIC State After an INIT Reset (“Wait-for-SIPI” State)] Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/include/asm/kvm_host.h | 6 +++--- arch/x86/kvm/lapic.c| 11 ++- arch/x86/kvm/lapic.h| 2 +- arch/x86/kvm/svm.c | 2 +- arch/x86/kvm/vmx.c | 33 +++-- arch/x86/kvm/x86.c | 17 ++--- 6 files changed, 40 insertions(+), 31 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index bf5a160..59f4374 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -701,7 +701,7 @@ struct kvm_x86_ops { /* Create, but do not attach this VCPU */ struct kvm_vcpu *(*vcpu_create)(struct kvm *kvm, unsigned id); void (*vcpu_free)(struct kvm_vcpu *vcpu); - void (*vcpu_reset)(struct kvm_vcpu *vcpu); + void (*vcpu_reset)(struct kvm_vcpu *vcpu, bool init_event); void (*prepare_guest_switch)(struct kvm_vcpu *vcpu); void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu); @@ -989,7 +989,7 @@ void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id); void kvm_inject_nmi(struct kvm_vcpu *vcpu); -int fx_init(struct kvm_vcpu *vcpu); +int fx_init(struct kvm_vcpu *vcpu, bool init_event); void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes); @@ -1134,7 +1134,7 @@ int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v); int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu); int kvm_cpu_get_interrupt(struct kvm_vcpu *v); -void kvm_vcpu_reset(struct kvm_vcpu *vcpu); +void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event); void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu); void kvm_arch_mmu_notifier_invalidate_page(struct kvm *kvm, unsigned long address); diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index bd4e34d..17da6fc 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1534,7 +1534,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) } -void kvm_lapic_reset(struct kvm_vcpu *vcpu) +void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) { struct kvm_lapic *apic; int i; @@ -1548,7 +1548,8 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu) /* Stop the timer in case it's a reset to an active apic */ hrtimer_cancel(apic-lapic_timer.timer); - kvm_apic_set_id(apic, vcpu-vcpu_id); + if (!init_event) + kvm_apic_set_id(apic, vcpu-vcpu_id); kvm_apic_set_version(apic-vcpu); for (i = 0; i APIC_LVT_NUM; i++) @@ -1689,7 +1690,7 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu) APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE); static_key_slow_inc(apic_sw_disabled.key); /* sw disabled at reset */ - kvm_lapic_reset(vcpu); + kvm_lapic_reset(vcpu, false); kvm_iodevice_init(apic-dev, apic_mmio_ops); return 0; @@ -2023,8 +2024,8 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu) pe = xchg(apic-pending_events, 0); if (test_bit(KVM_APIC_INIT, pe)) { - kvm_lapic_reset(vcpu); - kvm_vcpu_reset(vcpu); + kvm_lapic_reset(vcpu, true); + kvm_vcpu_reset(vcpu, true); if (kvm_vcpu_is_bsp(apic-vcpu)) vcpu-arch.mp_state = KVM_MP_STATE_RUNNABLE; else diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 0bc6c65..e4c82dc 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -48,7 +48,7 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu); int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu); int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu); void kvm_apic_accept_events(struct kvm_vcpu *vcpu); -void kvm_lapic_reset(struct kvm_vcpu *vcpu); +void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event); u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu); void
[PATCH 5/5] KVM: x86: BSP in MSR_IA32_APICBASE is writable
After reset, the CPU can change the BSP, which will be used upon INIT. Reset should return the BSP which QEMU asked for, and therefore handled accordingly. To quote: If the MP protocol has completed and a BSP is chosen, subsequent INITs (either to a specific processor or system wide) do not cause the MP protocol to be repeated. [Intel SDM 8.4.2: MP Initialization Protocol Requirements and Restrictions] Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/lapic.c | 2 -- arch/x86/kvm/svm.c | 2 +- arch/x86/kvm/vmx.c | 2 +- include/linux/kvm_host.h | 7 ++- 4 files changed, 8 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 17da6fc..b0dbf68 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1498,8 +1498,6 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) return; } - if (!kvm_vcpu_is_bsp(apic-vcpu)) - value = ~MSR_IA32_APICBASE_BSP; vcpu-arch.apic_base = value; /* update jump label if enable bit changes */ diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 1ef4c0d..ef5bf21 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1261,7 +1261,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) svm-vcpu.arch.apic_base = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE; - if (kvm_vcpu_is_bsp(svm-vcpu)) + if (kvm_vcpu_is_reset_bsp(svm-vcpu)) svm-vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP; svm_init_osvw(svm-vcpu); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 8aee6db..7e370b2 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4709,7 +4709,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) if (!init_event) { apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE; - if (kvm_vcpu_is_bsp(vmx-vcpu)) + if (kvm_vcpu_is_reset_bsp(vmx-vcpu)) apic_base_msr.data |= MSR_IA32_APICBASE_BSP; apic_base_msr.host_initiated = true; kvm_set_apic_base(vmx-vcpu, apic_base_msr); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0f574eb..8365cae 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -968,11 +968,16 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args) #endif /* CONFIG_HAVE_KVM_EVENTFD */ #ifdef CONFIG_KVM_APIC_ARCHITECTURE -static inline bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu) +static inline bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu) { return vcpu-kvm-bsp_vcpu_id == vcpu-vcpu_id; } +static inline bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu) +{ + return (vcpu-arch.apic_base MSR_IA32_APICBASE_BSP) != 0; +} + bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu); #else -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] KVM: x86: CMOV emulation on legacy mode is wrong
On legacy mode CMOV emulation should still clear bits [63:32] even if the assignment is not done. The previous fix 140bad89fd (KVM: x86: emulation of dword cmov on long-mode should clear [63:32]) was incomplete. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index c941abe..62f7a39 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -5126,8 +5126,7 @@ twobyte_insn: case 0x40 ... 0x4f: /* cmov */ if (test_cc(ctxt-b, ctxt-eflags)) ctxt-dst.val = ctxt-src.val; - else if (ctxt-mode != X86EMUL_MODE_PROT64 || -ctxt-op_bytes != 4) + else if (ctxt-op_bytes != 4) ctxt-dst.type = OP_NONE; /* no writeback */ break; case 0x80 ... 0x8f: /* jnz rel, etc*/ -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] KVM: x86: BSF and BSR emulation change register unnecassarily
If the source of BSF and BSR is zero, the destination register should not change. That is how real hardware behaves. If we set the destination even with the same value that we had before, we may clear bits [63:32] unnecassarily. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 4961dc5..7004577 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -962,6 +962,22 @@ FASTOP2(xadd); FASTOP2R(cmp, cmp_r); +static int em_bsf_c(struct x86_emulate_ctxt *ctxt) +{ + /* If src is zero, do not writeback, but update flags */ + if (ctxt-src.val == 0) + ctxt-dst.type = OP_NONE; + return fastop(ctxt, em_bsf); +} + +static int em_bsr_c(struct x86_emulate_ctxt *ctxt) +{ + /* If src is zero, do not writeback, but update flags */ + if (ctxt-src.val == 0) + ctxt-dst.type = OP_NONE; + return fastop(ctxt, em_bsr); +} + static u8 test_cc(unsigned int condition, unsigned long flags) { u8 rc; @@ -4188,7 +4204,8 @@ static const struct opcode twobyte_table[256] = { N, N, G(BitOp, group8), F(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_btc), - F(DstReg | SrcMem | ModRM, em_bsf), F(DstReg | SrcMem | ModRM, em_bsr), + I(DstReg | SrcMem | ModRM, em_bsf_c), + I(DstReg | SrcMem | ModRM, em_bsr_c), D(DstReg | SrcMem8 | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov), /* 0xC0 - 0xC7 */ F2bv(DstMem | SrcReg | ModRM | SrcWrite | Lock, em_xadd), -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] KVM: x86: 64/32 bit fixes and INIT/BSP fixes
This patch-set handles 2 issues. Patches 1-3 deal with some more cases in which bits [63:32] are not cleared when using dword opsize. Patches 4-5 handle anomalies with INIT/BSP (INIT does not behave exactly as reset). Thanks for reviewing the patches. Nadav Amit (5): KVM: x86: CMOV emulation on legacy mode is wrong KVM: x86: POPA emulation may not clear bits [63:32] KVM: x86: BSF and BSR emulation change register unnecassarily KVM: x86: INIT and reset sequences are different KVM: x86: BSP in MSR_IA32_APICBASE is writable arch/x86/include/asm/kvm_host.h | 6 ++-- arch/x86/kvm/emulate.c | 61 - arch/x86/kvm/lapic.c| 13 - arch/x86/kvm/lapic.h| 2 +- arch/x86/kvm/svm.c | 4 +-- arch/x86/kvm/vmx.c | 33 -- arch/x86/kvm/x86.c | 17 +++- include/linux/kvm_host.h| 7 - 8 files changed, 89 insertions(+), 54 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] KVM: x86: POPA emulation may not clear bits [63:32]
POPA should assign the values to the registers as usual registers are assigned. In other words, 32-bits register assignments should clear bits [63:32] of the register. Split the code of register assignments that will be used by future changes as well. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 39 +++ 1 file changed, 23 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 62f7a39..4961dc5 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -478,6 +478,25 @@ static void assign_masked(ulong *dest, ulong src, ulong mask) *dest = (*dest ~mask) | (src mask); } +static void assign_register(unsigned long *reg, u64 val, int bytes) +{ + /* The 4-byte case *is* correct: in 64-bit mode we zero-extend. */ + switch (bytes) { + case 1: + *(u8 *)reg = (u8)val; + break; + case 2: + *(u16 *)reg = (u16)val; + break; + case 4: + *reg = (u32)val; + break; /* 64b: zero-extend */ + case 8: + *reg = val; + break; + } +} + static inline unsigned long ad_mask(struct x86_emulate_ctxt *ctxt) { return (1UL (ctxt-ad_bytes 3)) - 1; @@ -1691,21 +1710,7 @@ static int load_segment_descriptor(struct x86_emulate_ctxt *ctxt, static void write_register_operand(struct operand *op) { - /* The 4-byte case *is* correct: in 64-bit mode we zero-extend. */ - switch (op-bytes) { - case 1: - *(u8 *)op-addr.reg = (u8)op-val; - break; - case 2: - *(u16 *)op-addr.reg = (u16)op-val; - break; - case 4: - *op-addr.reg = (u32)op-val; - break; /* 64b: zero-extend */ - case 8: - *op-addr.reg = op-val; - break; - } + return assign_register(op-addr.reg, op-val, op-bytes); } static int writeback(struct x86_emulate_ctxt *ctxt, struct operand *op) @@ -1926,6 +1931,7 @@ static int em_popa(struct x86_emulate_ctxt *ctxt) { int rc = X86EMUL_CONTINUE; int reg = VCPU_REGS_RDI; + u32 val; while (reg = VCPU_REGS_RAX) { if (reg == VCPU_REGS_RSP) { @@ -1933,9 +1939,10 @@ static int em_popa(struct x86_emulate_ctxt *ctxt) --reg; } - rc = emulate_pop(ctxt, reg_rmw(ctxt, reg), ctxt-op_bytes); + rc = emulate_pop(ctxt, val, ctxt-op_bytes); if (rc != X86EMUL_CONTINUE) break; + assign_register(reg_rmw(ctxt, reg), val, ctxt-op_bytes); --reg; } return rc; -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: x86: BSP in MSR_IA32_APICBASE is writable
Paolo, It appears you are right and I have not tested 4 and 5 well enough. I’ll repost them (the others were tested presumably well enough). Two short questions: Can I use init.c in the kvm-unit-tests ? Why is it disabled? Nadav Paolo Bonzini pbonz...@redhat.com wrote: On 30/03/2015 14:39, Nadav Amit wrote: After reset, the CPU can change the BSP, which will be used upon INIT. Reset should return the BSP which QEMU asked for, and therefore handled accordingly. To quote: If the MP protocol has completed and a BSP is chosen, subsequent INITs (either to a specific processor or system wide) do not cause the MP protocol to be repeated. [Intel SDM 8.4.2: MP Initialization Protocol Requirements and Restrictions] Signed-off-by: Nadav Amit na...@cs.technion.ac.il Please provide a kvm-unit-tests testcase for this. Paolo --- arch/x86/kvm/lapic.c | 2 -- arch/x86/kvm/svm.c | 2 +- arch/x86/kvm/vmx.c | 2 +- include/linux/kvm_host.h | 7 ++- 4 files changed, 8 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 17da6fc..b0dbf68 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1498,8 +1498,6 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) return; } -if (!kvm_vcpu_is_bsp(apic-vcpu)) -value = ~MSR_IA32_APICBASE_BSP; vcpu-arch.apic_base = value; /* update jump label if enable bit changes */ diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 1ef4c0d..ef5bf21 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1261,7 +1261,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) svm-vcpu.arch.apic_base = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE; -if (kvm_vcpu_is_bsp(svm-vcpu)) +if (kvm_vcpu_is_reset_bsp(svm-vcpu)) svm-vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP; svm_init_osvw(svm-vcpu); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 8aee6db..7e370b2 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4709,7 +4709,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) if (!init_event) { apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE; -if (kvm_vcpu_is_bsp(vmx-vcpu)) +if (kvm_vcpu_is_reset_bsp(vmx-vcpu)) apic_base_msr.data |= MSR_IA32_APICBASE_BSP; apic_base_msr.host_initiated = true; kvm_set_apic_base(vmx-vcpu, apic_base_msr); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0f574eb..8365cae 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -968,11 +968,16 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args) #endif /* CONFIG_HAVE_KVM_EVENTFD */ #ifdef CONFIG_KVM_APIC_ARCHITECTURE -static inline bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu) +static inline bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu) { return vcpu-kvm-bsp_vcpu_id == vcpu-vcpu_id; } +static inline bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu) +{ +return (vcpu-arch.apic_base MSR_IA32_APICBASE_BSP) != 0; +} + bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu); #else -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: x86: BSP in MSR_IA32_APICBASE is writable
Paolo Bonzini pbonz...@redhat.com wrote: On 30/03/2015 16:40, Nadav Amit wrote: Paolo, It appears you are right and I have not tested 4 and 5 well enough. I’ll repost them (the others were tested presumably well enough). Two short questions: Can I use init.c in the kvm-unit-tests ? Why is it disabled? Because QEMU support for INIT is incomplete, so the tests would fail (Uh, hard reset!). IIRC sending init to BSP with APIC_DEST_SELF is also not supported by actual hardware (or at least not supported officially) so that test would also have to be changed to not use a shortcut. So, I would revive the init unit-test and disable the failing assertions, right? BTW: It appears that there is another bug - DR[0..3] are not reloaded after reset. Regards, Nadav -- 8 -- From: Nadav Amit na...@cs.technion.ac.il Subject: [PATCH] KVM: x86: DR0-DR3 are not clear on reset DR0-DR3 are not cleared as they should during reset and when they are set from userspace. It appears to be caused by c77fb5fe6f03 (KVM: x86: Allow the guest to run with dirty debug registers). Force their reload on these situations. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/x86.c | 14 ++ 2 files changed, 15 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index bf5a160..913ae41 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -345,6 +345,7 @@ struct kvm_pmu { enum { KVM_DEBUGREG_BP_ENABLED = 1, KVM_DEBUGREG_WONT_EXIT = 2, + KVM_DEBUGREG_RELOAD = 4, }; struct kvm_vcpu_arch { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cc2c759..1f65c3a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -801,6 +801,17 @@ unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_get_cr8); +static void kvm_update_dr0123(struct kvm_vcpu *vcpu) +{ + int i; + + if (!(vcpu-guest_debug KVM_GUESTDBG_USE_HW_BP)) { + for (i = 0; i KVM_NR_DB_REGS; i++) + vcpu-arch.eff_db[i] = vcpu-arch.db[i]; + vcpu-arch.switch_db_regs |= KVM_DEBUGREG_RELOAD; + } +} + static void kvm_update_dr6(struct kvm_vcpu *vcpu) { if (!(vcpu-guest_debug KVM_GUESTDBG_USE_HW_BP)) @@ -3150,6 +3161,7 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu, return -EINVAL; memcpy(vcpu-arch.db, dbgregs-db, sizeof(vcpu-arch.db)); + kvm_update_dr0123(vcpu); vcpu-arch.dr6 = dbgregs-dr6; kvm_update_dr6(vcpu); vcpu-arch.dr7 = dbgregs-dr7; @@ -6322,6 +6334,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) set_debugreg(vcpu-arch.eff_db[2], 2); set_debugreg(vcpu-arch.eff_db[3], 3); set_debugreg(vcpu-arch.dr6, 6); + vcpu-arch.switch_db_regs = ~KVM_DEBUGREG_RELOAD; } trace_kvm_entry(vcpu-vcpu_id); @@ -7096,6 +7109,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu) kvm_clear_exception_queue(vcpu); memset(vcpu-arch.db, 0, sizeof(vcpu-arch.db)); + kvm_update_dr0123(vcpu); vcpu-arch.dr6 = DR6_INIT; kvm_update_dr6(vcpu); vcpu-arch.dr7 = DR7_FIXED_1; -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: x86: Remove redundant definitions
Some constants are redfined in emulate.c. Avoid it. s/SELECTOR_RPL_MASK/SEGMENT_RPL_MASK s/SELECTOR_TI_MASK/SEGMENT_TI_MASK No functional change. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/include/asm/kvm_host.h | 3 --- arch/x86/kvm/emulate.c | 6 +++--- arch/x86/kvm/vmx.c | 18 +- 3 files changed, 12 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 7ba3d9d..30b28dc 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -81,9 +81,6 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level) (base_gfn KVM_HPAGE_GFN_SHIFT(level)); } -#define SELECTOR_TI_MASK (1 2) -#define SELECTOR_RPL_MASK 0x03 - #define KVM_PERMILLE_MMU_PAGES 20 #define KVM_MIN_ALLOC_MMU_PAGES 64 #define KVM_MMU_HASH_SHIFT 10 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 9a578a1..a48bcd7 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2412,7 +2412,7 @@ static int em_sysenter(struct x86_emulate_ctxt *ctxt) return emulate_gp(ctxt, 0); ctxt-eflags = ~(X86_EFLAGS_VM | X86_EFLAGS_IF); - cs_sel = (u16)msr_data ~SELECTOR_RPL_MASK; + cs_sel = (u16)msr_data ~SEGMENT_RPL_MASK; ss_sel = cs_sel + 8; if (efer EFER_LMA) { cs.d = 0; @@ -2479,8 +2479,8 @@ static int em_sysexit(struct x86_emulate_ctxt *ctxt) return emulate_gp(ctxt, 0); break; } - cs_sel |= SELECTOR_RPL_MASK; - ss_sel |= SELECTOR_RPL_MASK; + cs_sel |= SEGMENT_RPL_MASK; + ss_sel |= SEGMENT_RPL_MASK; ops-set_segment(ctxt, cs_sel, cs, 0, VCPU_SREG_CS); ops-set_segment(ctxt, ss_sel, ss, 0, VCPU_SREG_SS); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index fdd9f8b..63ca692 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3263,8 +3263,8 @@ static void fix_pmode_seg(struct kvm_vcpu *vcpu, int seg, * default value. */ if (seg == VCPU_SREG_CS || seg == VCPU_SREG_SS) - save-selector = ~SELECTOR_RPL_MASK; - save-dpl = save-selector SELECTOR_RPL_MASK; + save-selector = ~SEGMENT_RPL_MASK; + save-dpl = save-selector SEGMENT_RPL_MASK; save-s = 1; } vmx_set_segment(vcpu, save, seg); @@ -3837,7 +3837,7 @@ static bool code_segment_valid(struct kvm_vcpu *vcpu) unsigned int cs_rpl; vmx_get_segment(vcpu, cs, VCPU_SREG_CS); - cs_rpl = cs.selector SELECTOR_RPL_MASK; + cs_rpl = cs.selector SEGMENT_RPL_MASK; if (cs.unusable) return false; @@ -3865,7 +3865,7 @@ static bool stack_segment_valid(struct kvm_vcpu *vcpu) unsigned int ss_rpl; vmx_get_segment(vcpu, ss, VCPU_SREG_SS); - ss_rpl = ss.selector SELECTOR_RPL_MASK; + ss_rpl = ss.selector SEGMENT_RPL_MASK; if (ss.unusable) return true; @@ -3887,7 +3887,7 @@ static bool data_segment_valid(struct kvm_vcpu *vcpu, int seg) unsigned int rpl; vmx_get_segment(vcpu, var, seg); - rpl = var.selector SELECTOR_RPL_MASK; + rpl = var.selector SEGMENT_RPL_MASK; if (var.unusable) return true; @@ -3914,7 +3914,7 @@ static bool tr_valid(struct kvm_vcpu *vcpu) if (tr.unusable) return false; - if (tr.selector SELECTOR_TI_MASK) /* TI = 1 */ + if (tr.selector SEGMENT_TI_MASK) /* TI = 1 */ return false; if (tr.type != 3 tr.type != 11) /* TODO: Check if guest is in IA32e mode */ return false; @@ -3932,7 +3932,7 @@ static bool ldtr_valid(struct kvm_vcpu *vcpu) if (ldtr.unusable) return true; - if (ldtr.selector SELECTOR_TI_MASK) /* TI = 1 */ + if (ldtr.selector SEGMENT_TI_MASK)/* TI = 1 */ return false; if (ldtr.type != 2) return false; @@ -3949,8 +3949,8 @@ static bool cs_ss_rpl_check(struct kvm_vcpu *vcpu) vmx_get_segment(vcpu, cs, VCPU_SREG_CS); vmx_get_segment(vcpu, ss, VCPU_SREG_SS); - return ((cs.selector SELECTOR_RPL_MASK) == -(ss.selector SELECTOR_RPL_MASK)); + return ((cs.selector SEGMENT_RPL_MASK) == +(ss.selector SEGMENT_RPL_MASK)); } /* -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: x86: removing redundant eflags bits definitions
The eflags are redefined (using other defines) in emulate.c. Use the definition from processor-flags.h as some mess already started. No functional change. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/include/asm/kvm_host.h | 2 - arch/x86/kvm/emulate.c | 105 ++-- 2 files changed, 46 insertions(+), 61 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index bf5a160..7ba3d9d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -84,8 +84,6 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level) #define SELECTOR_TI_MASK (1 2) #define SELECTOR_RPL_MASK 0x03 -#define IOPL_SHIFT 12 - #define KVM_PERMILLE_MMU_PAGES 20 #define KVM_MIN_ALLOC_MMU_PAGES 64 #define KVM_MMU_HASH_SHIFT 10 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index c941abe..9a578a1 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -248,27 +248,7 @@ struct mode_dual { struct opcode mode64; }; -/* EFLAGS bit definitions. */ -#define EFLG_ID (121) -#define EFLG_VIP (120) -#define EFLG_VIF (119) -#define EFLG_AC (118) -#define EFLG_VM (117) -#define EFLG_RF (116) -#define EFLG_IOPL (312) -#define EFLG_NT (114) -#define EFLG_OF (111) -#define EFLG_DF (110) -#define EFLG_IF (19) -#define EFLG_TF (18) -#define EFLG_SF (17) -#define EFLG_ZF (16) -#define EFLG_AF (14) -#define EFLG_PF (12) -#define EFLG_CF (10) - #define EFLG_RESERVED_ZEROS_MASK 0xffc0802a -#define EFLG_RESERVED_ONE_MASK 2 enum x86_transfer_type { X86_TRANSFER_NONE, @@ -317,7 +297,8 @@ static void invalidate_registers(struct x86_emulate_ctxt *ctxt) * These EFLAGS bits are restored from saved value during emulation, and * any changes are written back to the saved value after emulation. */ -#define EFLAGS_MASK (EFLG_OF|EFLG_SF|EFLG_ZF|EFLG_AF|EFLG_PF|EFLG_CF) +#define EFLAGS_MASK (X86_EFLAGS_OF|X86_EFLAGS_SF|X86_EFLAGS_ZF|X86_EFLAGS_AF|\ +X86_EFLAGS_PF|X86_EFLAGS_CF) #ifdef CONFIG_X86_64 #define ON64(x) x @@ -1399,7 +1380,7 @@ static int pio_in_emulated(struct x86_emulate_ctxt *ctxt, unsigned int in_page, n; unsigned int count = ctxt-rep_prefix ? address_mask(ctxt, reg_read(ctxt, VCPU_REGS_RCX)) : 1; - in_page = (ctxt-eflags EFLG_DF) ? + in_page = (ctxt-eflags X86_EFLAGS_DF) ? offset_in_page(reg_read(ctxt, VCPU_REGS_RDI)) : PAGE_SIZE - offset_in_page(reg_read(ctxt, VCPU_REGS_RDI)); n = min3(in_page, (unsigned int)sizeof(rc-data) / size, count); @@ -1412,7 +1393,7 @@ static int pio_in_emulated(struct x86_emulate_ctxt *ctxt, } if (ctxt-rep_prefix (ctxt-d String) - !(ctxt-eflags EFLG_DF)) { + !(ctxt-eflags X86_EFLAGS_DF)) { ctxt-dst.data = rc-data + rc-pos; ctxt-dst.type = OP_MEM_STR; ctxt-dst.count = (rc-end - rc-pos) / size; @@ -1792,32 +1773,34 @@ static int emulate_popf(struct x86_emulate_ctxt *ctxt, { int rc; unsigned long val, change_mask; - int iopl = (ctxt-eflags X86_EFLAGS_IOPL) IOPL_SHIFT; + int iopl = (ctxt-eflags X86_EFLAGS_IOPL) X86_EFLAGS_IOPL_BIT; int cpl = ctxt-ops-cpl(ctxt); rc = emulate_pop(ctxt, val, len); if (rc != X86EMUL_CONTINUE) return rc; - change_mask = EFLG_CF | EFLG_PF | EFLG_AF | EFLG_ZF | EFLG_SF | EFLG_OF - | EFLG_TF | EFLG_DF | EFLG_NT | EFLG_AC | EFLG_ID; + change_mask = X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF | + X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF | + X86_EFLAGS_TF | X86_EFLAGS_DF | X86_EFLAGS_NT | + X86_EFLAGS_AC | X86_EFLAGS_ID; switch(ctxt-mode) { case X86EMUL_MODE_PROT64: case X86EMUL_MODE_PROT32: case X86EMUL_MODE_PROT16: if (cpl == 0) - change_mask |= EFLG_IOPL; + change_mask |= X86_EFLAGS_IOPL; if (cpl = iopl) - change_mask |= EFLG_IF; + change_mask |= X86_EFLAGS_IF; break; case X86EMUL_MODE_VM86: if (iopl 3) return emulate_gp(ctxt, 0); - change_mask |= EFLG_IF; + change_mask |= X86_EFLAGS_IF; break; default: /* real mode */ - change_mask |= (EFLG_IOPL | EFLG_IF); + change_mask |= (X86_EFLAGS_IOPL | X86_EFLAGS_IF); break; } @@ -1918,7 +1901,7 @@ static int em_pusha(struct x86_emulate_ctxt *ctxt) static int em_pushf(struct x86_emulate_ctxt *ctxt) { - ctxt-src.val = (unsigned long)ctxt-eflags ~EFLG_VM; + ctxt-src.val = (unsigned long)ctxt-eflags
[PATCH 0/2] KVM: x86: Removing redundant definitions from emulator
There are several redundant definitions in processor-flags.h and emulator.c. Slowly, but surely they will get mixed, so removing those of emulator.c seems like a reasonable move (unless I am missing something, e.g., kvm-kmod consideration). Nadav Amit (2): KVM: x86: removing redundant eflags bits definitions KVM: x86: Remove redundant definitions arch/x86/include/asm/kvm_host.h | 5 -- arch/x86/kvm/emulate.c | 111 ++-- arch/x86/kvm/vmx.c | 18 +++ 3 files changed, 58 insertions(+), 76 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2 CPU Conformance Issue in KVM/x86
Avi Kivity avi.kiv...@gmail.com wrote: On 03/03/2015 11:52 AM, Paolo Bonzini wrote: In this case, the VM might expect exceptions when PTE bits which are higher than the maximum (reported) address width are set, and it would not get such exceptions. This problem can easily be experienced by small change to the existing KVM unit-tests. There are many variants to this problem, and the only solution which I consider complete is to report to the VM the maximum (52) physical address width to the VM, configure the VM to exit on #PF with reserved-bit error-codes, and then emulate these faulting instructions. Not even that would be a definitive solution. If the guest tries to map RAM (e.g. a PCI BAR that is backed by RAM) above the host MAXPHYADDR, you would get EPT misconfiguration vmexits. I think there is no way to emulate physical address width correctly, except by disabling EPT. Is the issue emulating a higher MAXPHYADDR on the guest than is available on the host? I don't think there's any need to support that. Emulating a lower setting on the guest than is available on the host is, I think, desirable. Whether it would work depends on the relative priority of EPT misconfiguration exits vs. page table permission faults. Thanks for the feedback. Guest page-table permissions faults got priority over EPT misconfiguration. KVM can even be set to trap page-table permission faults, at least in VT-x. Anyhow, I don’t think it is enough. Here is an example My machine has MAXPHYADDR of 46. I modified kvm-unit-tests access test to set pte.45 instead of pte.51, which from the VM point-of-view should cause the #PF error-code indicate the reserved bits are set (just as pte.51 does). Here is one error from the log: test pte.p pte.45 pde.p user: FAIL: error code 5 expected d Dump mapping: address: 1234 --L4: 304b007 --L3: 304c007 --L2: 304d001 --L1: 2201 As you can see, the #PF should have had two reasons: reserved bits, and user access to supervisor only page. The error-code however does not indicate the reserved-bits are set. Note that KVM did not trap any exit on that faulting instruction, as otherwise it would try to emulate the instruction and assuming it is supported (and that the #PF was not on an instruction fetch), should be able to emulate the #PF correctly. [ The test actually crashes soon after this error due to these reasons. ] Anyhow, that is the reason for me to assume that having the maximum MAXPHYADDR is better. Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2 CPU Conformance Issue in KVM/x86
Avi Kivity avi.kiv...@gmail.com wrote: On 03/09/2015 07:51 PM, Nadav Amit wrote: Avi Kivity avi.kiv...@gmail.com wrote: On 03/03/2015 11:52 AM, Paolo Bonzini wrote: In this case, the VM might expect exceptions when PTE bits which are higher than the maximum (reported) address width are set, and it would not get such exceptions. This problem can easily be experienced by small change to the existing KVM unit-tests. There are many variants to this problem, and the only solution which I consider complete is to report to the VM the maximum (52) physical address width to the VM, configure the VM to exit on #PF with reserved-bit error-codes, and then emulate these faulting instructions. Not even that would be a definitive solution. If the guest tries to map RAM (e.g. a PCI BAR that is backed by RAM) above the host MAXPHYADDR, you would get EPT misconfiguration vmexits. I think there is no way to emulate physical address width correctly, except by disabling EPT. Is the issue emulating a higher MAXPHYADDR on the guest than is available on the host? I don't think there's any need to support that. Emulating a lower setting on the guest than is available on the host is, I think, desirable. Whether it would work depends on the relative priority of EPT misconfiguration exits vs. page table permission faults. Thanks for the feedback. Guest page-table permissions faults got priority over EPT misconfiguration. KVM can even be set to trap page-table permission faults, at least in VT-x. Anyhow, I don’t think it is enough. Why is it not enough? If you trap a permission fault, you can inject any exception error code you like. Because there is no real permission fault. In the following example, the VM expects one (VM’s MAXPHYADDR=40), but there isn’t (Host’s MAXPHYADDR=46), so the hypervisor cannot trap it. It can only trap all #PF, which is obviously too intrusive. Here is an example My machine has MAXPHYADDR of 46. I modified kvm-unit-tests access test to set pte.45 instead of pte.51, which from the VM point-of-view should cause the #PF error-code indicate the reserved bits are set (just as pte.51 does). Here is one error from the log: test pte.p pte.45 pde.p user: FAIL: error code 5 expected d Dump mapping: address: 1234 --L4: 304b007 --L3: 304c007 --L2: 304d001 --L1: 2201 This is with an ept misconfig programmed into that address, yes? A reserved bit in the PTE is set - from the VM point-of-view. If there wasn’t another cause for #PF, it would lead to EPT violation/misconfig. As you can see, the #PF should have had two reasons: reserved bits, and user access to supervisor only page. The error-code however does not indicate the reserved-bits are set. Note that KVM did not trap any exit on that faulting instruction, as otherwise it would try to emulate the instruction and assuming it is supported (and that the #PF was not on an instruction fetch), should be able to emulate the #PF correctly. [ The test actually crashes soon after this error due to these reasons. ] Anyhow, that is the reason for me to assume that having the maximum MAXPHYADDR is better. Well, that doesn't work for the reasons Paolo noted. The guest can have a ivshmem device attached, and map it above a host-supported virtual address, and suddenly it goes slow. I fully understand. That’s the reason I don’t have a reasonable solution. Regards, Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2 CPU Conformance Issue in KVM/x86
I got two conformance issues in x86/KVM. For the first one I have no solution. For the latter, my solution is not “great”. Ideas and feedback would be appreciated. The first problem is caused by the deprecating of FPU CS/DS in new Intel CPUs. Assume the VM executes a floating point instruction in real mode (when CS != 0), and later KVM exits to userspace, causing XSAVE/XRSTOR to save and restore the FPU state. At this point FPU CS/DS in new CPUs are zero. If the VM then executes FSAVE in real-mode the save FPU IP would be wrong, since it is actually calculated by the CPU as [FPU CS] * 4 + [FPU IP]. The second problem occurs when the maximum physical address width that KVM reports to the VM is different than the real one. Assume the real one is greater than the reported one (which in KVM is not greater than 40). In this case, the VM might expect exceptions when PTE bits which are higher than the maximum (reported) address width are set, and it would not get such exceptions. This problem can easily be experienced by small change to the existing KVM unit-tests. There are many variants to this problem, and the only solution which I consider complete is to report to the VM the maximum (52) physical address width to the VM, configure the VM to exit on #PF with reserved-bit error-codes, and then emulate these faulting instructions. Thoughts/ideas? Regards, Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2 CPU Conformance Issue in KVM/x86
Paolo, Thanks for the feedback. One small comment below. Paolo Bonzini pbonz...@redhat.com wrote: On 03/03/2015 09:34, Nadav Amit wrote: I got two conformance issues in x86/KVM. For the first one I have no solution. For the latter, my solution is not “great”. Ideas and feedback would be appreciated. The first problem is caused by the deprecating of FPU CS/DS in new Intel CPUs. Assume the VM executes a floating point instruction in real mode (when CS != 0), and later KVM exits to userspace, causing XSAVE/XRSTOR to save and restore the FPU state. At this point FPU CS/DS in new CPUs are zero. If the VM then executes FSAVE in real-mode the save FPU IP would be wrong, since it is actually calculated by the CPU as [FPU CS] * 4 + [FPU IP]. I think this was analyzed a couple years ago and we decided that this bit was not virtualizable. I am fully aware of the previous reports ( https://lkml.org/lkml/2013/10/16/258 ). However, one might have expected the conformance problem to be fully resolved now, since [FPU CS] and [FPU DS] are deprecated in new CPUs. Indeed, the problem is resolved in all modes, but not in real-mode. Regards, Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/4] KVM: APIC improvements (with bonus mixed mode)
Marcello, Radim, As you know - I can run some tests on the patches and whether they comply with real hardware. Please let me know which version to test and I’ll try to do next week. Regards, Nadav Marcelo Tosatti mtosa...@redhat.com wrote: Radim, On Thu, Feb 12, 2015 at 07:41:30PM +0100, Radim Krčmář wrote: Each patch has a diff from v1, here is only a prologue on the mythical mixed xAPIC and x2APIC mode: There is one interesting alias in xAPIC and x2APIC ICR destination, the 0xff00, which is a broadcast in xAPIC and either a physical message to high x2APIC ID or a message to an empty set in x2APIC cluster. This corner case in mixed mode can be triggered by a weird message 1) when we have no x2APIC ID 0xff00, but send x2APIC message there 10.7 SYSTEM AND APIC BUS ARBITRATION Note that except for the SIPI IPI (see Section 10.6.1, “Interrupt Command Register (ICR)”), all bus messages that fail to be delivered to their specified destination or destinations are automatically retried. Software should avoid situations in which IPIs are sent to disabled or nonexistent local APICs, causing the messages to be resent repeatedly. or after something that isn't forbidden in SDM most likely because they didn't think that anyone would ever consider it 2) we have x2APIC ID 0xff00 and reset other x2APICs into xAPIC mode Or just x2APIC initialization in non lockstep: vcpu0 vcpu1 T0) xapic mode xapic mode T1) x2apic enable T2) broadcast IPI Current KVM doesn't need to consider (2), so there only is a slim chance that some hobbyist OS pushes the specification to its limits. The problem is that SDM for xAPIC doesn't believably specify[1] if all messages beginning with 0xff are considered as broadcasts, 10.6.2.1: A broadcast IPI (bits 28-31 of the MDA are 1's) No volunteer came to clear this up, so I hacked Linux to have one x2APIC core between xAPIC ones. Physical IPI to 0xff00 got delivered only to CPU0, like most other destinations, Logical IPI to 0xff00 got dropped and only 0x worked as a broadcast in both modes. I think it is because Intel never considered mixed mode to be valid, and seen delivery might be an emergent feature of QPI routing. Luckily, broadcast from xAPIC isn't delivered to x2APIC. In real hardware? Real exploration would require greater modifications to Linux (ideally writing a custom kernel), so this series implements something that makes some sense and isn't too far from reality. Ok, so the problem is that broadcast (ICR.destination == 0xff00) from xAPIC CPU is not delivered to x2APIC CPUs ? Radim Krčmář (4): KVM: x86: use MDA for interrupt matching KVM: x86: fix mixed APIC mode broadcast KVM: x86: avoid logical_map when it is invalid KVM: x86: simplify kvm_apic_map I can't find any restriction against (cpu0==x2APIC, cpu1==xAPIC) in the documentation. Anyway, emulation should match physical hardware. From your message above, it is not clear what is the behaviour there: No volunteer came to clear this up, so I hacked Linux to have one x2APIC core between xAPIC ones. Physical IPI to 0xff00 got delivered only to CPU0, like most other destinations, Logical IPI to 0xff00 got dropped and only 0x worked as a broadcast in both modes. Luckily, broadcast from xAPIC isn't delivered to x2APIC. From the x2APIC CPU or the xAPIC ones? It should be easy to write kvm-unit-test testcases that match physical hardware behaviour (in general, i am having a hard time figure out in what way mixed mode is supposed to behave, please describe it clearly). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: emulate: correct page fault error code for NoWrite instructions
Sorry for that. I’ll try to run some few more tests next week. Nadav Paolo Bonzini pbonz...@redhat.com wrote: NoWrite instructions (e.g. cmp or test) never set the write access bit in the error code, even if one of the operands is treated as a destination. Fixes: c205fb7d7d4f81e46fc577b707ceb9e356af1456 Cc: Nadav Amit na...@cs.technion.ac.il Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/kvm/emulate.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 81dcf7964701..a943bf0c06d0 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -4954,7 +4954,8 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) rc = segmented_read(ctxt, ctxt-dst.addr.mem, ctxt-dst.val, ctxt-dst.bytes); if (rc != X86EMUL_CONTINUE) { - if (rc == X86EMUL_PROPAGATE_FAULT + if (!(ctxt-d NoWrite) + rc == X86EMUL_PROPAGATE_FAULT ctxt-exception.vector == PF_VECTOR) ctxt-exception.error_code |= PFERR_WRITE_MASK; goto done; -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] KVM: x86: Dirty the dest op page on cmpxchg emulation
Intel SDM says for CMPXCHG: To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison.. This means the destination page should be dirtied. Fix it to by writing back the original value if cmpxchg failed. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index ef23c1e..aa27254 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2205,12 +2205,15 @@ static int em_cmpxchg(struct x86_emulate_ctxt *ctxt) fastop(ctxt, em_cmp); if (ctxt-eflags EFLG_ZF) { - /* Success: write back to memory. */ + /* Success: write back to memory; no update of EAX */ + ctxt-src.type = OP_NONE; ctxt-dst.val = ctxt-src.orig_val; } else { /* Failure: write the value we saw to EAX. */ - ctxt-dst.type = OP_REG; - ctxt-dst.addr.reg = reg_rmw(ctxt, VCPU_REGS_RAX); + ctxt-src.type = OP_REG; + ctxt-src.addr.reg = reg_rmw(ctxt, VCPU_REGS_RAX); + ctxt-src.val = ctxt-dst.orig_val; + /* Create write-cycle to dest by writing the same value */ ctxt-dst.val = ctxt-dst.orig_val; } return X86EMUL_CONTINUE; @@ -4157,7 +4160,7 @@ static const struct opcode twobyte_table[256] = { F(DstMem | SrcReg | Src2CL | ModRM, em_shrd), GD(0, group15), F(DstReg | SrcMem | ModRM, em_imul), /* 0xB0 - 0xB7 */ - I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_cmpxchg), + I2bv(DstMem | SrcReg | ModRM | Lock | PageTable | SrcWrite, em_cmpxchg), I(DstReg | SrcMemFAddr | ModRM | Src2SS, em_lseg), F(DstMem | SrcReg | ModRM | BitOp | Lock, em_btr), I(DstReg | SrcMemFAddr | ModRM | Src2FS, em_lseg), -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/7] KVM: x86: Fix defines in emulator.c
Unnecassary define was left after commit 7d882ffa81d5 (KVM: x86: Revert NoBigReal patch in the emulator”). Commit 39f062ff51b2 (KVM: x86: Generate #UD when memory operand is required”) was missing undef. Fix it. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index db3cf39..997c9eb 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -167,7 +167,6 @@ #define NoMod ((u64)1 47) /* Mod field is ignored */ #define Intercept ((u64)1 48) /* Has valid intercept field */ #define CheckPerm ((u64)1 49) /* Has valid check_perm field */ -#define NoBigReal ((u64)1 50) /* No big real mode */ #define PrivUD ((u64)1 51) /* #UD instead of #GP on CPL 0 */ #define NearBranch ((u64)1 52) /* Near branches */ #define No16 ((u64)1 53) /* No 16 bit operand */ @@ -4246,6 +4245,7 @@ static const struct opcode opcode_map_0f_38[256] = { #undef GP #undef EXT #undef MD +#undef ID #undef D2bv #undef D2bvIP -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/7] KVM: x86: ARPL emulation can cause spurious exceptions
ARPL and MOVSXD are encoded the same and their execution depends on the execution mode. The operand sizes of each instruction are different. Currently, ARPL is detected too late, after the decoding was already done, and therefore may result in spurious exception (instead of failed emulation). Introduce a group to the emulator to handle instructions according to execution mode (32/64 bits). Note: in order not to make changes that may affect performance, the new ModeDual can only be applied to instructions with ModRM. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 32 ++-- 1 file changed, 26 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index fa3ca55..db3cf39 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -125,6 +125,7 @@ #define RMExt (415) /* Opcode extension in ModRM r/m if mod == 3 */ #define Escape (515) /* Escape to coprocessor instruction */ #define InstrDual (615) /* Alternate instruction decoding of mod == 3 */ +#define ModeDual(715) /* Different instruction for 32/64 bit */ #define Sse (118) /* SSE Vector instruction */ /* Generic ModRM decode. */ #define ModRM (119) @@ -215,6 +216,7 @@ struct opcode { const struct gprefix *gprefix; const struct escape *esc; const struct instr_dual *idual; + const struct mode_dual *mdual; void (*fastop)(struct fastop *fake); } u; int (*check_perm)(struct x86_emulate_ctxt *ctxt); @@ -242,6 +244,11 @@ struct instr_dual { struct opcode mod3; }; +struct mode_dual { + struct opcode mode32; + struct opcode mode64; +}; + /* EFLAGS bit definitions. */ #define EFLG_ID (121) #define EFLG_VIP (120) @@ -3530,6 +3537,12 @@ static int em_clflush(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } +static int em_movsxd(struct x86_emulate_ctxt *ctxt) +{ + ctxt-dst.val = (s32) ctxt-src.val; + return X86EMUL_CONTINUE; +} + static bool valid_cr(int nr) { switch (nr) { @@ -3729,6 +3742,7 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt) #define G(_f, _g) { .flags = ((_f) | Group | ModRM), .u.group = (_g) } #define GD(_f, _g) { .flags = ((_f) | GroupDual | ModRM), .u.gdual = (_g) } #define ID(_f, _i) { .flags = ((_f) | InstrDual | ModRM), .u.idual = (_i) } +#define MD(_f, _m) { .flags = ((_f) | ModeDual), .u.mdual = (_m) } #define E(_f, _e) { .flags = ((_f) | Escape | ModRM), .u.esc = (_e) } #define I(_f, _e) { .flags = (_f), .u.execute = (_e) } #define F(_f, _e) { .flags = (_f) | Fastop, .u.fastop = (_e) } @@ -3973,6 +3987,10 @@ static const struct instr_dual instr_dual_0f_c3 = { I(DstMem | SrcReg | ModRM | No16 | Mov, em_mov), N }; +static const struct mode_dual mode_dual_63 = { + N, I(DstReg | SrcMem32 | ModRM | Mov, em_movsxd) +}; + static const struct opcode opcode_table[256] = { /* 0x00 - 0x07 */ F6ALU(Lock, em_add), @@ -4007,7 +4025,7 @@ static const struct opcode opcode_table[256] = { /* 0x60 - 0x67 */ I(ImplicitOps | Stack | No64, em_pusha), I(ImplicitOps | Stack | No64, em_popa), - N, D(DstReg | SrcMem32 | ModRM | Mov) /* movsxd (x86/64) */ , + N, MD(ModRM, mode_dual_63), N, N, N, N, /* 0x68 - 0x6F */ I(SrcImm | Mov | Stack, em_push), @@ -4227,6 +4245,7 @@ static const struct opcode opcode_map_0f_38[256] = { #undef I #undef GP #undef EXT +#undef MD #undef D2bv #undef D2bvIP @@ -4616,6 +4635,12 @@ done_prefixes: else opcode = opcode.u.idual-mod012; break; + case ModeDual: + if (ctxt-mode == X86EMUL_MODE_PROT64) + opcode = opcode.u.mdual-mode64; + else + opcode = opcode.u.mdual-mode32; + break; default: return EMULATION_FAILED; } @@ -4956,11 +4981,6 @@ special_insn: goto threebyte_insn; switch (ctxt-b) { - case 0x63: /* movsxd */ - if (ctxt-mode != X86EMUL_MODE_PROT64) - goto cannot_emulate; - ctxt-dst.val = (s32) ctxt-src.val; - break; case 0x70 ... 0x7f: /* jcc (short) */ if (test_cc(ctxt-b, ctxt-eflags)) rc = jmp_rel(ctxt, ctxt-src.val); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/7] KVM: x86: IRET emulation does not clear NMI masking
The IRET instruction should clear NMI masking, but the current implementation does not do so. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/include/asm/kvm_emulate.h | 1 + arch/x86/kvm/emulate.c | 1 + arch/x86/kvm/x86.c | 6 ++ 3 files changed, 8 insertions(+) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index eb18117..57a9d94 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -208,6 +208,7 @@ struct x86_emulate_ops { void (*get_cpuid)(struct x86_emulate_ctxt *ctxt, u32 *eax, u32 *ebx, u32 *ecx, u32 *edx); + void (*set_nmi_mask)(struct x86_emulate_ctxt *ctxt, bool masked); }; typedef u32 __attribute__((vector_size(16))) sse128_t; diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index dd71004..fa3ca55 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2047,6 +2047,7 @@ static int emulate_iret_real(struct x86_emulate_ctxt *ctxt) ctxt-eflags = ~EFLG_RESERVED_ZEROS_MASK; /* Clear reserved zeros */ ctxt-eflags |= EFLG_RESERVED_ONE_MASK; + ctxt-ops-set_nmi_mask(ctxt, false); return rc; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index afa0815..cdd6606 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4963,6 +4963,11 @@ static void emulator_write_gpr(struct x86_emulate_ctxt *ctxt, unsigned reg, ulon kvm_register_write(emul_to_vcpu(ctxt), reg, val); } +static void emulator_set_nmi_mask(struct x86_emulate_ctxt *ctxt, bool masked) +{ + kvm_x86_ops-set_nmi_mask(emul_to_vcpu(ctxt), masked); +} + static const struct x86_emulate_ops emulate_ops = { .read_gpr= emulator_read_gpr, .write_gpr = emulator_write_gpr, @@ -4998,6 +5003,7 @@ static const struct x86_emulate_ops emulate_ops = { .put_fpu = emulator_put_fpu, .intercept = emulator_intercept, .get_cpuid = emulator_get_cpuid, + .set_nmi_mask= emulator_set_nmi_mask, }; static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/7] KVM: x86: 32-bit wraparound read/write not emulated correctly
If we got a wraparound of 32-bit operand, and the limit is 0x, read and writes should be successful. It just needs to be done in two segments. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 10 +++--- arch/x86/kvm/x86.c | 2 ++ 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 997c9eb..c3b0757 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -684,9 +684,13 @@ static __always_inline int __linearize(struct x86_emulate_ctxt *ctxt, } if (addr.ea lim) goto bad; - *max_size = min_t(u64, ~0u, (u64)lim + 1 - addr.ea); - if (size *max_size) - goto bad; + if (lim == 0x) + *max_size = ~0u; + else { + *max_size = (u64)lim + 1 - addr.ea; + if (size *max_size) + goto bad; + } la = (u32)-1; break; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cdd6606..1e10e3f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4495,6 +4495,8 @@ int emulator_read_write(struct x86_emulate_ctxt *ctxt, unsigned long addr, if (rc != X86EMUL_CONTINUE) return rc; addr += now; + if (ctxt-mode != X86EMUL_MODE_PROT64) + addr = (u32)addr; val += now; bytes -= now; } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/7] KVM: x86: Emulation of call may use incorrect stack size
On long-mode, when far call that changes cs.l takes place, the stack size is determined by the new mode. For instance, if we go from 32-bit mode to 64-bit mode, the stack-size if 64. KVM uses the old stack size. Fix it. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 25 ++--- 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index c3b0757..81dcf79 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -741,19 +741,26 @@ static int assign_eip_far(struct x86_emulate_ctxt *ctxt, ulong dst, const struct desc_struct *cs_desc) { enum x86emul_mode mode = ctxt-mode; + int rc; #ifdef CONFIG_X86_64 - if (ctxt-mode = X86EMUL_MODE_PROT32 cs_desc-l) { - u64 efer = 0; + if (ctxt-mode = X86EMUL_MODE_PROT16) { + if (cs_desc-l) { + u64 efer = 0; - ctxt-ops-get_msr(ctxt, MSR_EFER, efer); - if (efer EFER_LMA) - mode = X86EMUL_MODE_PROT64; + ctxt-ops-get_msr(ctxt, MSR_EFER, efer); + if (efer EFER_LMA) + mode = X86EMUL_MODE_PROT64; + } else + mode = X86EMUL_MODE_PROT32; /* temporary value */ } #endif if (mode == X86EMUL_MODE_PROT16 || mode == X86EMUL_MODE_PROT32) mode = cs_desc-d ? X86EMUL_MODE_PROT32 : X86EMUL_MODE_PROT16; - return assign_eip(ctxt, dst, mode); + rc = assign_eip(ctxt, dst, mode); + if (rc == X86EMUL_CONTINUE) + ctxt-mode = mode; + return rc; } static inline int jmp_rel(struct x86_emulate_ctxt *ctxt, int rel) @@ -3062,6 +3069,7 @@ static int em_call_far(struct x86_emulate_ctxt *ctxt) struct desc_struct old_desc, new_desc; const struct x86_emulate_ops *ops = ctxt-ops; int cpl = ctxt-ops-cpl(ctxt); + enum x86emul_mode prev_mode = ctxt-mode; old_eip = ctxt-_eip; ops-get_segment(ctxt, old_cs, old_desc, NULL, VCPU_SREG_CS); @@ -3085,11 +3093,14 @@ static int em_call_far(struct x86_emulate_ctxt *ctxt) rc = em_push(ctxt); /* If we failed, we tainted the memory, but the very least we should restore cs */ - if (rc != X86EMUL_CONTINUE) + if (rc != X86EMUL_CONTINUE) { + pr_warn_once(faulting far call emulation tainted memory\n); goto fail; + } return rc; fail: ops-set_segment(ctxt, old_cs, old_desc, 0, VCPU_SREG_CS); + ctxt-mode = prev_mode; return rc; } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/7] KVM: x86: Wrong operand size for far ret
Indeed, Intel SDM specifically states that for the RET instruction In 64-bit mode, the default operation size of this instruction is the stack-address size, i.e. 64 bits. However, experiments show this is not the case. Here is for example objdump of small 64-bit asm: 4004f1: ca 14 00lret $0x14 4004f4: 48 cb lretq 4004f6: 48 ca 14 00 lretq $0x14 Therefore, remove the Stack flag from far-ret instructions. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index aa27254..dd71004 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -4062,8 +4062,8 @@ static const struct opcode opcode_table[256] = { G(ByteOp, group11), G(0, group11), /* 0xC8 - 0xCF */ I(Stack | SrcImmU16 | Src2ImmByte, em_enter), I(Stack, em_leave), - I(ImplicitOps | Stack | SrcImmU16, em_ret_far_imm), - I(ImplicitOps | Stack, em_ret_far), + I(ImplicitOps | SrcImmU16, em_ret_far_imm), + I(ImplicitOps, em_ret_far), D(ImplicitOps), DI(SrcImmByte, intn), D(ImplicitOps | No64), II(ImplicitOps, em_iret, iret), /* 0xD0 - 0xD7 */ -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/7] KVM: x86: Emulator fixes
Sorry for sending patches at the last minute. There is nothing critical in this patch-set. Yet, if you may want to incorporate something in 3.20 - specifically 5 (small define mistakes) or 7 (which is somewhat affected by recent changes). Thanks for reviewing the patches. Nadav Amit (7): KVM: x86: Dirty the dest op page on cmpxchg emulation KVM: x86: Wrong operand size for far ret KVM: x86: IRET emulation does not clear NMI masking KVM: x86: ARPL emulation can cause spurious exceptions KVM: x86: Fix defines in emulator.c KVM: x86: 32-bit wraparound read/write not emulated correctly KVM: x86: Emulation of call may use incorrect stack size arch/x86/include/asm/kvm_emulate.h | 1 + arch/x86/kvm/emulate.c | 85 +++--- arch/x86/kvm/x86.c | 8 3 files changed, 71 insertions(+), 23 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI
Radim Kr?má? rkrc...@redhat.com wrote: 2015-01-14 01:27+, Wu, Feng: the new hardware even doesn't consider the TPR for lowest priority interrupts delivery. A bold move ... what hardware was the first to do so? I think it was starting with Nehalem. Thanks, (Could be that QPI can't inform about TPR changes anymore ...) I played with Linux's TPR on Haswell and found that is has no effect. Sorry for jumping into the discussion, but doesn’t it depend on IA32_MISC_ENABLE[23]? This bit disables xTPR messages. On my machine it is set (probably by the BIOS), but since there is no IA32_MISC_ENABLE is not locked for changes, the OS can control it. Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] KVM: x86: #PF error-code on R/W operations is wrong
Feng feng...@intel.com wrote: -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Nadav Amit Sent: Thursday, December 25, 2014 8:52 AM To: pbonz...@redhat.com Cc: kvm@vger.kernel.org; Nadav Amit Subject: [PATCH 1/8] KVM: x86: #PF error-code on R/W operations is wrong When emulating an instruction that reads the destination memory operand (i.e., instructions without the Mov flag in the emulator), the operand is first read. If a page-fault is detected in this phase, the error-code which would be delivered to the VM does not indicate that the access that caused the exception is a write one. This does not conform with real hardware, and may cause the VM to enter the page-fault handler twice for no reason (once for read, once for write). Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/include/asm/kvm_host.h | 12 arch/x86/kvm/emulate.c | 6 +- arch/x86/kvm/mmu.h | 12 3 files changed, 17 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 73ccb12..d6f90ca 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -162,6 +162,18 @@ enum { #define DR7_FIXED_1 0x0400 #define DR7_VOLATILE 0x2bff +#define PFERR_PRESENT_BIT 0 +#define PFERR_WRITE_BIT 1 +#define PFERR_USER_BIT 2 +#define PFERR_RSVD_BIT 3 +#define PFERR_FETCH_BIT 4 + +#define PFERR_PRESENT_MASK (1U PFERR_PRESENT_BIT) +#define PFERR_WRITE_MASK (1U PFERR_WRITE_BIT) +#define PFERR_USER_MASK (1U PFERR_USER_BIT) +#define PFERR_RSVD_MASK (1U PFERR_RSVD_BIT) +#define PFERR_FETCH_MASK (1U PFERR_FETCH_BIT) + /* apic attention bits */ #define KVM_APIC_CHECK_VAPIC 0 /* diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 7a9697f..e5a84be 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -4941,8 +4941,12 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) /* optimisation - avoid slow emulated read if Mov */ rc = segmented_read(ctxt, ctxt-dst.addr.mem, ctxt-dst.val, ctxt-dst.bytes); This is a write operation, do you know why we need to read the dst operand first here? Some x86 instructions read the destination operand during their operation. For instance - MOV [EAX], EBX (Intel ASM format), would perform [EAX] = [EAX] + EBX. Therefore, it would first read the memory of [EAX] at this stage. Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] KVM: x86: pop sreg accesses only 2 bytes
Feng feng...@intel.com wrote: -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Nadav Amit Sent: Thursday, December 25, 2014 5:55 PM To: Chen, Tiejun Cc: Paolo Bonzini; kvm list Subject: Re: [PATCH 2/8] KVM: x86: pop sreg accesses only 2 bytes Tiejun tiejun.c...@intel.com wrote: On 2014/12/25 8:52, Nadav Amit wrote: Although pop sreg updates RSP according to the operand size, only 2 bytes are read. The current behavior may result in incorrect #GP or #PF exceptions. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index e5a84be..702da5e 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1830,12 +1830,14 @@ static int em_pop_sreg(struct x86_emulate_ctxt *ctxt) unsigned long selector; int rc; Looks we just should do similar thing to em_push_sreg(), unsigned long selector; int rc; + if (ctxt-op_bytes == 4) { + rsp_increment(ctxt, -2); + ctxt-op_bytes = 2; + } rc = emulate_pop(ctxt, selector, ctxt-op_bytes); if (rc != X86EMUL_CONTINUE) return rc; Right? I don't think so. It seems the behaviour of push and pop is a bit different. For push: If the source operand is a segment register (16 bits) and the operand size is 64-bits, a zero-extended value is pushed on the stack; if the operand size is 32-bits ... all recent Core and Atom processors perform a 16-bit move, leaving the upper portion of the stack location unmodified. Therefore, for push in the case of op_bytes==8 we push zero-extended value. For pop the behaviour is not well-documented, but experimentally it appears only the first two bytes are accessed. I cannot see why it would be different when opsize is 8, since it is not like the push case, where the segment register value was zero extended. Let's take 64-bits operand size as an example. When pushing a segment register, it uses zero-extended value, so 8 bytes will be pushed on the stack. When popping it, the current code return the top 8 bytes in the stack, and it only uses the lowest 2 bytes for load_segment_descriptor(). what is the issue here? The issue I try to solve is that during the emulated write operation of the pop the read is perform using the wrong size (operand size instead of segment selector size). As you indicated, the destination register/memory of the pop instruction will be identical before the fix and after the fix. However, the emulated read may cause #PF if the operand-size that does not occur on read hardware. Consider for instance a case in which the operand size is 8, RSP=0xFFC and the page of [0x1000] is non-present. In this case POP-SREG should not cause a #PF, yet on KVM it does. Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] KVM: x86: pop sreg accesses only 2 bytes
Tiejun tiejun.c...@intel.com wrote: On 2014/12/25 8:52, Nadav Amit wrote: Although pop sreg updates RSP according to the operand size, only 2 bytes are read. The current behavior may result in incorrect #GP or #PF exceptions. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index e5a84be..702da5e 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1830,12 +1830,14 @@ static int em_pop_sreg(struct x86_emulate_ctxt *ctxt) unsigned long selector; int rc; Looks we just should do similar thing to em_push_sreg(), unsigned long selector; int rc; + if (ctxt-op_bytes == 4) { + rsp_increment(ctxt, -2); + ctxt-op_bytes = 2; + } rc = emulate_pop(ctxt, selector, ctxt-op_bytes); if (rc != X86EMUL_CONTINUE) return rc; Right? I don’t think so. It seems the behaviour of push and pop is a bit different. For push: “If the source operand is a segment register (16 bits) and the operand size is 64-bits, a zero-extended value is pushed on the stack; if the operand size is 32-bits ... all recent Core and Atom processors perform a 16-bit move, leaving the upper portion of the stack location unmodified.” Therefore, for push in the case of op_bytes==8 we push zero-extended value. For pop the behaviour is not well-documented, but experimentally it appears only the first two bytes are accessed. I cannot see why it would be different when opsize is 8, since it is not like the push case, where the segment register value was zero extended. If you feel strongly about it, I’ll create a unit test. Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/8] KVM: x86: pop sreg accesses only 2 bytes
Although pop sreg updates RSP according to the operand size, only 2 bytes are read. The current behavior may result in incorrect #GP or #PF exceptions. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index e5a84be..702da5e 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1830,12 +1830,14 @@ static int em_pop_sreg(struct x86_emulate_ctxt *ctxt) unsigned long selector; int rc; - rc = emulate_pop(ctxt, selector, ctxt-op_bytes); + rc = emulate_pop(ctxt, selector, 2); if (rc != X86EMUL_CONTINUE) return rc; if (ctxt-modrm_reg == VCPU_SREG_SS) ctxt-interruptibility = KVM_X86_SHADOW_INT_MOV_SS; + if (ctxt-op_bytes 2) + rsp_increment(ctxt, ctxt-op_bytes - 2); rc = load_segment_descriptor(ctxt, (u16)selector, seg); return rc; -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/8] KVM: x86: Emulator fixes
Few more emulator fixes. Each is logically independent from the others. The first one is the most interesting one. It appears that the current behavior may cause the VM to enter the page-fault handler twice on certain faulting write accesses. If you do not like my solution, please propose a better one. The fourth (JMP/CALL using call- or task-gate) is not a fix, but returns an error instead of emulating the wrong (#GP) exception. Thanks for reviewing the patches. Nadav Amit (8): KVM: x86: #PF error-code on R/W operations is wrong KVM: x86: pop sreg accesses only 2 bytes KVM: x86: fnstcw and fnstsw may cause spurious exception KVM: x86: JMP/CALL using call- or task-gate causes exception KVM: x86: em_call_far should return failure result KVM: x86: POP [ESP] is not emulated correctly KVM: x86: Do not set access bit on accessed segments KVM: x86: Access to LDT/GDT that wraparound is incorrect arch/x86/include/asm/kvm_host.h | 12 arch/x86/kvm/emulate.c | 138 ++-- arch/x86/kvm/mmu.h | 12 3 files changed, 103 insertions(+), 59 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/8] KVM: x86: #PF error-code on R/W operations is wrong
When emulating an instruction that reads the destination memory operand (i.e., instructions without the Mov flag in the emulator), the operand is first read. If a page-fault is detected in this phase, the error-code which would be delivered to the VM does not indicate that the access that caused the exception is a write one. This does not conform with real hardware, and may cause the VM to enter the page-fault handler twice for no reason (once for read, once for write). Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/include/asm/kvm_host.h | 12 arch/x86/kvm/emulate.c | 6 +- arch/x86/kvm/mmu.h | 12 3 files changed, 17 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 73ccb12..d6f90ca 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -162,6 +162,18 @@ enum { #define DR7_FIXED_10x0400 #define DR7_VOLATILE 0x2bff +#define PFERR_PRESENT_BIT 0 +#define PFERR_WRITE_BIT 1 +#define PFERR_USER_BIT 2 +#define PFERR_RSVD_BIT 3 +#define PFERR_FETCH_BIT 4 + +#define PFERR_PRESENT_MASK (1U PFERR_PRESENT_BIT) +#define PFERR_WRITE_MASK (1U PFERR_WRITE_BIT) +#define PFERR_USER_MASK (1U PFERR_USER_BIT) +#define PFERR_RSVD_MASK (1U PFERR_RSVD_BIT) +#define PFERR_FETCH_MASK (1U PFERR_FETCH_BIT) + /* apic attention bits */ #define KVM_APIC_CHECK_VAPIC 0 /* diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 7a9697f..e5a84be 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -4941,8 +4941,12 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) /* optimisation - avoid slow emulated read if Mov */ rc = segmented_read(ctxt, ctxt-dst.addr.mem, ctxt-dst.val, ctxt-dst.bytes); - if (rc != X86EMUL_CONTINUE) + if (rc != X86EMUL_CONTINUE) { + if (rc == X86EMUL_PROPAGATE_FAULT + ctxt-exception.vector == PF_VECTOR) + ctxt-exception.error_code |= PFERR_WRITE_MASK; goto done; + } } ctxt-dst.orig_val = ctxt-dst.val; diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 6b34876b..daae711 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -44,18 +44,6 @@ #define PT_DIRECTORY_LEVEL 2 #define PT_PAGE_TABLE_LEVEL 1 -#define PFERR_PRESENT_BIT 0 -#define PFERR_WRITE_BIT 1 -#define PFERR_USER_BIT 2 -#define PFERR_RSVD_BIT 3 -#define PFERR_FETCH_BIT 4 - -#define PFERR_PRESENT_MASK (1U PFERR_PRESENT_BIT) -#define PFERR_WRITE_MASK (1U PFERR_WRITE_BIT) -#define PFERR_USER_MASK (1U PFERR_USER_BIT) -#define PFERR_RSVD_MASK (1U PFERR_RSVD_BIT) -#define PFERR_FETCH_MASK (1U PFERR_FETCH_BIT) - static inline u64 rsvd_bits(int s, int e) { return ((1ULL (e - s + 1)) - 1) s; -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/8] KVM: x86: JMP/CALL using call- or task-gate causes exception
The KVM emulator does not emulate JMP and CALL that target a call gate or a task gate. This patch does not try to implement these scenario as they are presumably rare; yet it returns X86EMUL_UNHANDLEABLE error in such cases instead of generating an exception. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 54 +++--- 1 file changed, 34 insertions(+), 20 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 19923e7..fd89471 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -264,6 +264,13 @@ struct instr_dual { #define EFLG_RESERVED_ZEROS_MASK 0xffc0802a #define EFLG_RESERVED_ONE_MASK 2 +enum x86_transfer_type { + X86_TRANSFER_NONE, + X86_TRANSFER_CALL_JMP, + X86_TRANSFER_RET, + X86_TRANSFER_TASK_SWITCH, +}; + static ulong reg_read(struct x86_emulate_ctxt *ctxt, unsigned nr) { if (!(ctxt-regs_valid (1 nr))) { @@ -1474,7 +1481,7 @@ static int write_segment_descriptor(struct x86_emulate_ctxt *ctxt, /* Does not support long mode */ static int __load_segment_descriptor(struct x86_emulate_ctxt *ctxt, u16 selector, int seg, u8 cpl, -bool in_task_switch, +enum x86_transfer_type transfer, struct desc_struct *desc) { struct desc_struct seg_desc, old_desc; @@ -1528,11 +1535,15 @@ static int __load_segment_descriptor(struct x86_emulate_ctxt *ctxt, return ret; err_code = selector 0xfffc; - err_vec = in_task_switch ? TS_VECTOR : GP_VECTOR; + err_vec = (transfer == X86_TRANSFER_TASK_SWITCH) ? TS_VECTOR : + GP_VECTOR; /* can't load system descriptor into segment selector */ - if (seg = VCPU_SREG_GS !seg_desc.s) + if (seg = VCPU_SREG_GS !seg_desc.s) { + if (transfer == X86_TRANSFER_CALL_JMP) + return X86EMUL_UNHANDLEABLE; goto exception; + } if (!seg_desc.p) { err_vec = (seg == VCPU_SREG_SS) ? SS_VECTOR : NP_VECTOR; @@ -1630,7 +1641,8 @@ static int load_segment_descriptor(struct x86_emulate_ctxt *ctxt, u16 selector, int seg) { u8 cpl = ctxt-ops-cpl(ctxt); - return __load_segment_descriptor(ctxt, selector, seg, cpl, false, NULL); + return __load_segment_descriptor(ctxt, selector, seg, cpl, +X86_TRANSFER_NONE, NULL); } static void write_register_operand(struct operand *op) @@ -2042,7 +2054,8 @@ static int em_jmp_far(struct x86_emulate_ctxt *ctxt) memcpy(sel, ctxt-src.valptr + ctxt-op_bytes, 2); - rc = __load_segment_descriptor(ctxt, sel, VCPU_SREG_CS, cpl, false, + rc = __load_segment_descriptor(ctxt, sel, VCPU_SREG_CS, cpl, + X86_TRANSFER_CALL_JMP, new_desc); if (rc != X86EMUL_CONTINUE) return rc; @@ -2131,7 +2144,8 @@ static int em_ret_far(struct x86_emulate_ctxt *ctxt) /* Outer-privilege level return is not implemented */ if (ctxt-mode = X86EMUL_MODE_PROT16 (cs 3) cpl) return X86EMUL_UNHANDLEABLE; - rc = __load_segment_descriptor(ctxt, (u16)cs, VCPU_SREG_CS, cpl, false, + rc = __load_segment_descriptor(ctxt, (u16)cs, VCPU_SREG_CS, cpl, + X86_TRANSFER_RET, new_desc); if (rc != X86EMUL_CONTINUE) return rc; @@ -2569,23 +2583,23 @@ static int load_state_from_tss16(struct x86_emulate_ctxt *ctxt, * it is handled in a context of new task */ ret = __load_segment_descriptor(ctxt, tss-ldt, VCPU_SREG_LDTR, cpl, - true, NULL); + X86_TRANSFER_TASK_SWITCH, NULL); if (ret != X86EMUL_CONTINUE) return ret; ret = __load_segment_descriptor(ctxt, tss-es, VCPU_SREG_ES, cpl, - true, NULL); + X86_TRANSFER_TASK_SWITCH, NULL); if (ret != X86EMUL_CONTINUE) return ret; ret = __load_segment_descriptor(ctxt, tss-cs, VCPU_SREG_CS, cpl, - true, NULL); + X86_TRANSFER_TASK_SWITCH, NULL); if (ret != X86EMUL_CONTINUE) return ret; ret = __load_segment_descriptor(ctxt, tss-ss, VCPU_SREG_SS, cpl, - true, NULL); + X86_TRANSFER_TASK_SWITCH, NULL); if (ret != X86EMUL_CONTINUE) return ret; ret = __load_segment_descriptor(ctxt, tss
[PATCH 8/8] KVM: x86: Access to LDT/GDT that wraparound is incorrect
When access to descriptor in LDT/GDT wraparound outside long-mode, the address of the descriptor should be truncated to 32-bit. Citing Intel SDM 2.1.1.1 Global and Local Descriptor Tables in IA-32e Mode: GDTR and LDTR registers are expanded to 64-bits wide in both IA-32e sub-modes (64-bit mode and compatibility mode). So in other cases, we need to truncate. Creating new function to return a pointer to descriptor table to avoid too much code duplication. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 45 - 1 file changed, 32 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 4fcd9fd..cb6b8ef 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1446,10 +1446,8 @@ static void get_descriptor_table_ptr(struct x86_emulate_ctxt *ctxt, ops-get_gdt(ctxt, dt); } -/* allowed just for 8 bytes segments */ -static int read_segment_descriptor(struct x86_emulate_ctxt *ctxt, - u16 selector, struct desc_struct *desc, - ulong *desc_addr_p) +static int get_descriptor_ptr(struct x86_emulate_ctxt *ctxt, + u16 selector, ulong *desc_addr_p) { struct desc_ptr dt; u16 index = selector 3; @@ -1460,8 +1458,32 @@ static int read_segment_descriptor(struct x86_emulate_ctxt *ctxt, if (dt.size index * 8 + 7) return emulate_gp(ctxt, selector 0xfffc); - *desc_addr_p = addr = dt.address + index * 8; - return ctxt-ops-read_std(ctxt, addr, desc, sizeof *desc, + addr = dt.address + index * 8; + + if (addr 32 != 0) { + u64 efer = 0; + + ctxt-ops-get_msr(ctxt, MSR_EFER, efer); + if (!(efer EFER_LMA)) + addr = (u32)-1; + } + + *desc_addr_p = addr; + return X86EMUL_CONTINUE; +} + +/* allowed just for 8 bytes segments */ +static int read_segment_descriptor(struct x86_emulate_ctxt *ctxt, + u16 selector, struct desc_struct *desc, + ulong *desc_addr_p) +{ + int rc; + + rc = get_descriptor_ptr(ctxt, selector, desc_addr_p); + if (rc != X86EMUL_CONTINUE) + return rc; + + return ctxt-ops-read_std(ctxt, *desc_addr_p, desc, sizeof(*desc), ctxt-exception); } @@ -1469,16 +1491,13 @@ static int read_segment_descriptor(struct x86_emulate_ctxt *ctxt, static int write_segment_descriptor(struct x86_emulate_ctxt *ctxt, u16 selector, struct desc_struct *desc) { - struct desc_ptr dt; - u16 index = selector 3; + int rc; ulong addr; - get_descriptor_table_ptr(ctxt, selector, dt); - - if (dt.size index * 8 + 7) - return emulate_gp(ctxt, selector 0xfffc); + rc = get_descriptor_ptr(ctxt, selector, addr); + if (rc != X86EMUL_CONTINUE) + return rc; - addr = dt.address + index * 8; return ctxt-ops-write_std(ctxt, addr, desc, sizeof *desc, ctxt-exception); } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/8] KVM: x86: POP [ESP] is not emulated correctly
According to Intel SDM: If the ESP register is used as a base register for addressing a destination operand in memory, the POP instruction computes the effective address of the operand after it increments the ESP register. The current emulation does not behave so. The fix required to waste another of the precious instruction flags and to check the flag in decode_modrm. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 7f80f01..7bf3548 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -171,6 +171,7 @@ #define PrivUD ((u64)1 51) /* #UD instead of #GP on CPL 0 */ #define NearBranch ((u64)1 52) /* Near branches */ #define No16 ((u64)1 53) /* No 16 bit operand */ +#define IncSP ((u64)1 54) /* SP is incremented before ModRM calc */ #define DstXacc (DstAccLo | SrcAccHi | SrcWrite) @@ -1229,6 +1230,10 @@ static int decode_modrm(struct x86_emulate_ctxt *ctxt, else { modrm_ea += reg_read(ctxt, base_reg); adjust_modrm_seg(ctxt, base_reg); + /* Increment ESP on POP [ESP] */ + if ((ctxt-d IncSP) + base_reg == VCPU_REGS_RSP) + modrm_ea += ctxt-op_bytes; } if (index_reg != 4) modrm_ea += reg_read(ctxt, index_reg) scale; @@ -3825,7 +3830,7 @@ static const struct opcode group1[] = { }; static const struct opcode group1A[] = { - I(DstMem | SrcNone | Mov | Stack, em_pop), N, N, N, N, N, N, N, + I(DstMem | SrcNone | Mov | Stack | IncSP, em_pop), N, N, N, N, N, N, N, }; static const struct opcode group2[] = { -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/8] KVM: x86: em_call_far should return failure result
Currently, if em_call_far fails it returns success instead of the resulting error-code. Fix it. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index fd89471..7f80f01 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3034,7 +3034,7 @@ static int em_call_far(struct x86_emulate_ctxt *ctxt) rc = __load_segment_descriptor(ctxt, sel, VCPU_SREG_CS, cpl, X86_TRANSFER_CALL_JMP, new_desc); if (rc != X86EMUL_CONTINUE) - return X86EMUL_CONTINUE; + return rc; rc = assign_eip_far(ctxt, ctxt-src.val, new_desc); if (rc != X86EMUL_CONTINUE) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/8] KVM: x86: fnstcw and fnstsw may cause spurious exception
Since the operand size of fnstcw and fnstsw is updated during the execution, the emulation may cause spurious exceptions as it reads the memory beforehand. Marking these instructions as Mov (since the previous value is ignored) and DstMem16 to simplify the setting of operand size. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 702da5e..19923e7 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -86,6 +86,7 @@ #define DstAcc (OpAcc DstShift) #define DstDI (OpDI DstShift) #define DstMem64(OpMem64 DstShift) +#define DstMem16(OpMem16 DstShift) #define DstImmUByte (OpImmUByte DstShift) #define DstDX (OpDX DstShift) #define DstAccLo(OpAccLo DstShift) @@ -1059,8 +1060,6 @@ static int em_fnstcw(struct x86_emulate_ctxt *ctxt) asm volatile(fnstcw %0: +m(fcw)); ctxt-ops-put_fpu(ctxt); - /* force 2 byte destination */ - ctxt-dst.bytes = 2; ctxt-dst.val = fcw; return X86EMUL_CONTINUE; @@ -1077,8 +1076,6 @@ static int em_fnstsw(struct x86_emulate_ctxt *ctxt) asm volatile(fnstsw %0: +m(fsw)); ctxt-ops-put_fpu(ctxt); - /* force 2 byte destination */ - ctxt-dst.bytes = 2; ctxt-dst.val = fsw; return X86EMUL_CONTINUE; @@ -3930,7 +3927,7 @@ static const struct gprefix pfx_0f_e7 = { }; static const struct escape escape_d9 = { { - N, N, N, N, N, N, N, I(DstMem, em_fnstcw), + N, N, N, N, N, N, N, I(DstMem16 | Mov, em_fnstcw), }, { /* 0xC0 - 0xC7 */ N, N, N, N, N, N, N, N, @@ -3972,7 +3969,7 @@ static const struct escape escape_db = { { } }; static const struct escape escape_dd = { { - N, N, N, N, N, N, N, I(DstMem, em_fnstsw), + N, N, N, N, N, N, N, I(DstMem16 | Mov, em_fnstsw), }, { /* 0xC0 - 0xC7 */ N, N, N, N, N, N, N, N, -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/8] KVM: x86: Do not set access bit on accessed segments
When segment is loaded, the segment access bit is set unconditionally. In fact, it should be set conditionally, based on whether the segment had the accessed bit set before. In addition, it can improve performance. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 7bf3548..4fcd9fd 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1620,10 +1620,13 @@ static int __load_segment_descriptor(struct x86_emulate_ctxt *ctxt, if (seg_desc.s) { /* mark segment as accessed */ - seg_desc.type |= 1; - ret = write_segment_descriptor(ctxt, selector, seg_desc); - if (ret != X86EMUL_CONTINUE) - return ret; + if (!(seg_desc.type 1)) { + seg_desc.type |= 1; + ret = write_segment_descriptor(ctxt, selector, + seg_desc); + if (ret != X86EMUL_CONTINUE) + return ret; + } } else if (ctxt-mode == X86EMUL_MODE_PROT64) { ret = ctxt-ops-read_std(ctxt, desc_addr+8, base3, sizeof(base3), ctxt-exception); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Problem with KVM when using XSAVES in host
I encountered an interesting and annoying problem when KVM uses XSAVES/XRSTORS. The problem results from the fact XSAVES does not save the exact value of XINUSE[1]. See Intel SDM 13.10 “Operation of XSAVES”: “...if RFBM[1] = 1 and MXCSR does not have the value 1F80H, XSAVEC writes XSTATE_BV[1] as 1 even if XINUSE[1] = 0”. [ there is a typo in the SDM; they refer to XSAVES in this section, and probably copy-pasted the sentence]. XINUSE[1] marks whether the SSE state is in its initial state and not saved in the save-area. The result of this behaviour, is that when KVM uses XSAVES and then XRSTORS, it may not restore XINUSE[1] correctly, i.e. restore it to 1, although it was 0. This behaviour hurts the “equivalence property” - the VM does not behave as bare-metal system. Moreover, it may hurt the VM performance if the VM uses XSAVEOPT (and not XSAVES), has MXCSR with value different than the reset value of 1F80H and has all SSE registers set to zero. In such case, the VM would save/restore SSE registers unnecessarily. I don’t know whether such scenario happens in real workloads. tl;dr - hypervisors which use XSAVES (and XSAVEC) mess the VM state and may hurt VM performance. Perhaps KVM should use XSAVE/XSAVEOPT instead. Regards, Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] KVM: x86: Emulator fixes for VM86
Two minor fixes for emulation of instructions on VM86. Thanks for reviewing them. Nadav Amit (2): KVM: x86: Do not push eflags.vm on pushf KVM: x86: Emulate should check #UD before #GP arch/x86/kvm/emulate.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: x86: Emulate should check #UD before #GP
Intel SDM table 6-2 (Priority Among Simultaneous Exceptions and Interrupts) shows that faults from decoding the next instruction got higher priority than general protection. Moving the protected-mode check before the CPL check to avoid wrong exception on vm86 mode. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 5cd5401..0d42aca 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -4803,6 +4803,12 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) goto done; } + /* Instruction can only be executed in protected mode */ + if ((ctxt-d Prot) ctxt-mode X86EMUL_MODE_PROT16) { + rc = emulate_ud(ctxt); + goto done; + } + /* Privileged instruction can be executed only in CPL=0 */ if ((ctxt-d Priv) ops-cpl(ctxt)) { if (ctxt-d PrivUD) @@ -4812,12 +4818,6 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) goto done; } - /* Instruction can only be executed in protected mode */ - if ((ctxt-d Prot) ctxt-mode X86EMUL_MODE_PROT16) { - rc = emulate_ud(ctxt); - goto done; - } - /* Do instruction specific permission checks */ if (ctxt-d CheckPerm) { rc = ctxt-check_perm(ctxt); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: x86: Do not push eflags.vm on pushf
The pushf instruction does not push eflags.VM, so emulation should not do so as well. Although eflags.RF should not be pushed as well, it is already cleared by the time pushf is executed. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 33ecfcf..5cd5401 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1863,7 +1863,7 @@ static int em_pusha(struct x86_emulate_ctxt *ctxt) static int em_pushf(struct x86_emulate_ctxt *ctxt) { - ctxt-src.val = (unsigned long)ctxt-eflags; + ctxt-src.val = (unsigned long)ctxt-eflags ~EFLG_VM; return em_push(ctxt); } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests] x86: test_conforming_switch misses es initialization
test_conforming_switch in the taskswitch2 tests, miss es initialization. Fix it. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- x86/taskswitch2.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/x86/taskswitch2.c b/x86/taskswitch2.c index f55843c..db3e41a 100644 --- a/x86/taskswitch2.c +++ b/x86/taskswitch2.c @@ -271,7 +271,8 @@ void test_conforming_switch(void) tss_intr.cs = CONFORM_CS_SEL | 3; tss_intr.eip = (u32)user_tss; - tss_intr.ds = tss_intr.gs = tss_intr.fs = tss_intr.ss = USER_DS; + tss_intr.ss = USER_DS; + tss_intr.ds = tss_intr.gs = tss_intr.es = tss_intr.fs = tss_intr.ss; tss_intr.eflags |= 3 IOPL_SHIFT; set_gdt_entry(CONFORM_CS_SEL, 0, 0x, 0x9f, 0xc0); asm volatile(lcall $ xstr(TSS_INTR) , $0xf4f4f4f4); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: Remove prefix flag when GP macro is used
The macro GP already sets the flag Prefix. Remove the redundant flag for 0f_38_f0 and 0f_38_f1 opcodes. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/kvm/emulate.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 3817334..b4f4201 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -4172,8 +4172,8 @@ static const struct opcode opcode_map_0f_38[256] = { /* 0x80 - 0xef */ X16(N), X16(N), X16(N), X16(N), X16(N), X16(N), X16(N), /* 0xf0 - 0xf1 */ - GP(EmulateOnUD | ModRM | Prefix, three_byte_0f_38_f0), - GP(EmulateOnUD | ModRM | Prefix, three_byte_0f_38_f1), + GP(EmulateOnUD | ModRM, three_byte_0f_38_f0), + GP(EmulateOnUD | ModRM, three_byte_0f_38_f1), /* 0xf2 - 0xff */ N, N, X4(N), X8(N) }; -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] kvm: cpuid: fix xsave area size of XSAVEC
Wanpeng Li wanpeng...@linux.intel.com wrote: XSAVEC also use the compacted format for the extended region of the XSAVE area. This patch fix it by caculate the size of XSAVEC extended region of XSAVE area as compact format. Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com --- arch/x86/kvm/cpuid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 70f0fa1..e16a0c7 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -92,7 +92,7 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu) } best = kvm_find_cpuid_entry(vcpu, 0xD, 1); - if (best (best-eax F(XSAVES))) + if (best (best-eax (F(XSAVES) || F(XSAVEC Did you want to use | (bitwise or) instead of II (logical or) ? best-ebx = xstate_required_size(vcpu-arch.xcr0, true); /* -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [CFT PATCH v2 2/2] KVM: x86: support XSAVES usage in the host
Paolo Bonzini pbonz...@redhat.com wrote: Userspace is expecting non-compacted format for KVM_GET_XSAVE, but struct xsave_struct might be using the compacted format. Convert in order to preserve userspace ABI. Likewise, userspace is passing non-compacted format for KVM_SET_XSAVE but the kernel will pass it to XRSTORS, and we need to convert back. Fixes: f31a9f7c71691569359fa7fb8b0acaa44bce0324 Cc: Fenghua Yu fenghua...@intel.com Cc: H. Peter Anvin h...@linux.intel.com Cc: Nadav Amit na...@cs.technion.ac.il Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/kvm/x86.c | 87 +- 1 file changed, 80 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 08b5657e57ed..373b0ab9a32e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3132,15 +3132,89 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu, return 0; } +#define XSTATE_COMPACTION_ENABLED (1ULL 63) + +static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu) +{ + struct xsave_struct *xsave = vcpu-arch.guest_fpu.state-xsave; + u64 xstate_bv = vcpu-arch.guest_supported_xcr0 | XSTATE_FPSSE; + u64 valid; + + /* + * Copy legacy XSAVE area, to avoid complications with CPUID + * leaves 0 and 1 in the loop below. + */ + memcpy(dest, xsave, XSAVE_HDR_OFFSET); + + /* Set XSTATE_BV */ + *(u64 *)(dest + XSAVE_HDR_OFFSET) = xstate_bv; I have a problem with this line. I ran some experiments and it has a side-effect of causing XINUSE (an internal register which saves which state components are not in the initial state) to be all set. As a results, after load_xsave runs, when the guest runs xsave instruction, initialised xsave state components are marked as not-initialised in the guest’s xstate_bv. This causes both transparency issues (the VM does not behave as bare-metal machine). In addition it may cause performance overheads, since from this point on, xsave and xrstor instructions would save and load state which is in fact in the initial state. I think it is better just to replace the last line with: *(u64 *)(dest + XSAVE_HDR_OFFSET) = xsave-xsave_hdr.xstate_bv Thanks, Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: x86: allow 256 logical x2APICs again
Radim Krčmář rkrc...@redhat.com wrote: While fixing an x2apic bug, 17d68b7 KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376) we've made only one cluster available. This means that the amount of logically addressible x2APICs was reduced to 16 and VCPUs kept overwriting themselves in that region, so even the first cluster wasn't set up correctly. This patch extends x2APIC support back to the logical_map's limit, and keeps the CVE fixed as messages for non-present APICs are dropped. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/lapic.c | 11 ++- arch/x86/kvm/lapic.h | 2 -- 2 files changed, 6 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 049d30f..f6e3369 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -132,8 +132,6 @@ static inline int kvm_apic_id(struct kvm_lapic *apic) return (kvm_apic_get_reg(apic, APIC_ID) 24) 0xff; } -#define KVM_X2APIC_CID_BITS 0 - static void recalculate_apic_map(struct kvm *kvm) { struct kvm_apic_map *new, *old = NULL; @@ -163,8 +161,7 @@ static void recalculate_apic_map(struct kvm *kvm) if (apic_x2apic_mode(apic)) { new-ldr_bits = 32; new-cid_shift = 16; - new-cid_mask = (1 KVM_X2APIC_CID_BITS) - 1; - new-lid_mask = 0x; + new-cid_mask = new-lid_mask = 0x; You set cid_mask to 0x, while there are only 16 clusters. I think it is risky (if you twist my hand would come with a scenario). Yet, why not to set cid_mask to (ARRAY_SIZE(map-logical_map) - 1) ? Nadav-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: x86: allow 256 logical x2APICs again
Radim Krčmář rkrc...@redhat.com wrote: 2014-11-27 21:53+0200, Nadav Amit: Radim Krčmář rkrc...@redhat.com wrote: - new-cid_mask = (1 KVM_X2APIC_CID_BITS) - 1; - new-lid_mask = 0x; + new-cid_mask = new-lid_mask = 0x; You set cid_mask to 0x, while there are only 16 clusters. I think it is risky (if you twist my hand would come with a scenario). Let's see :) APIC id is 8 bit, and we compute cluster part of LDR by taking four upper bits, so 16 is enough. To clarify my concern - I am worried that some of the CPUs are still in xAPIC mode with LDR that does not follow x2APIC LDR scheme. It isn't the safest programming practice, but we already fail to check physical_map bounds and any boost to maximal APIC ID is going to require a rewrite, thus I didn't bother to do it ... All uses should be covered with the following hunk, I will add it to v2 after all reviews, diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 6c2b8a5..30e4cc1 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -198,7 +198,7 @@ static void recalculate_apic_map(struct kvm *kvm) cid = apic_cluster_id(new, ldr); lid = apic_logical_id(new, ldr); - if (lid) + if (lid cid ARRAY_SIZE(map-logical_map)) new-logical_map[cid][ffs(lid) - 1] = apic; } out: Yet, why not to set cid_mask to (ARRAY_SIZE(map-logical_map) - 1) ? We would incorrectly deliver messages intended for high clusters, it has to be 0x. From the SDM, I am not sure you are correct, but your solution is fine. Nadav -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 14/21] KVM: x86: Software disabled APIC should still deliver NMIs
Radim Krčmář rkrc...@redhat.com wrote: 2014-11-26 19:01+0200, Nadav Amit: Sorry for the late and long reply, but I got an issue with the new version (and my previous version as well). Indeed, the SDM states that DFR should be the same for enabled CPUs, and that the BIOS should get all CPUs in either xAPIC or x2APIC. Yet, there is nothing that says all CPUs need to be in xAPIC/x2APIC mode. In my tests (which pass on bare-metal), I got a scenario in which some CPUs are in xAPIC mode, the BSP changed (which is currently not handled correctly by KVM) and the BSP has x2APIC enabled. How many (V)CPUs were you using? (We fail hard with logical destination x2APIC and 16+ VCPUs.) 2 at the moment. What failure do you refer to? All the core APICs are software-enabled. The expected behaviour is that the CPUs with x2APIC enabled would work in x2APIC mode. (Nice, I bet that made some Intel designers happy.) There shouldn't be any message conflict when using APIC IDs 255, so it might be possible if the x2APIC isn't programmed to issue weird messages, like physical to nonexistent APIC ID 0xff00, which would be also interpreted as xAPIC broadcast. I think such a transitory scenario is possible on real-systems as well, perhaps during CPU hot-plug. It appears the previous version (before all of our changes) handled it better. I presume the most efficient way is to start determining the APIC logical mode from the BSP, and if it is disabled, traverse the rest of the CPUs until finding the first one with APIC enabled. Yet, I have not finished doing and checking the BSP fix and other dependent INIT signal handling fixes. In the meanwhile, would you be ok with restoring some of the previous behaviour - i.e., x2APIC is enabled if any CPU turned it on (regardless to whether APIC is software enabled), otherwise use the configuration of the last enabled APIC? I don't think this patch improves anything. (Both behaviors are wrong and I think the current one is a bit less so.) Our x2APIC implementation is a hack that allowed faster IPI thanks to 1 MSR exit instead of 2 MMIO ones. No OS, that doesn't know KVM's limitations, should have enabled it because we didn't emulate interrupt remapping, which is an architectural requirement for x2APIC … It is a shame - I was under the impression QEMU emulation of the Intel IOMMU would include it as well, and I now see they only did DMAR… And for more concrete points: - Physical x2APIC isn't affected (only broadcast, which is incorrect either way) - Logical x2APIC and xAPIC don't work at the same time No, but it is important to determine what is the “consensus” APIC mode. - Btw. logical x2APIC isn't supposed to work (see KVM_X2APIC_CID_BITS) Why? It is as if there is only a single cluster. You can still send an APIC message to multiple CPUs within the same cluster (0). - Logical xAPIC is shifted incorrectly in x2APIC mode, so they are all going to be inaccessible (ldr = 0) - Our map isn't designed to allow x2APIC and xAPIC at the same time - Your patch does not cover the case where sw-disabled x2APIC is before sw-enabled xAPIC, only if it is after. I thought I covered it. The rationale was that if any lapic is in x2APIC mode, then the are all in x2APIC mode. It is done similarly to the previous version (3.18). Anyhow, I have my workarounds, so do as you find appropriately. Once I deal with the BSP issues, I may resubmit another version. Nadav -- 8 — Subject: [PATCH] KVM: x86: Traverse all CPUs during recalculate_apic_map --- arch/x86/kvm/lapic.c | 20 +++- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 9c90d31..6dc2be6 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -139,6 +139,7 @@ static void recalculate_apic_map(struct kvm *kvm) struct kvm_apic_map *new, *old = NULL; struct kvm_vcpu *vcpu; int i; +bool any_enabled = false; new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL); @@ -160,13 +161,21 @@ static void recalculate_apic_map(struct kvm *kvm) if (!kvm_apic_present(vcpu)) continue; +/* + * All APICs DFRs have to be configured the same mode by an OS. + * We take advatage of this while building logical id lookup + * table. After reset APICs are in software disabled mode, so if + * we find apic with different setting we assume this is the mode + * OS wants all apics to be in; build lookup table accordingly. + */ if (apic_x2apic_mode(apic)) { new-ldr_bits = 32; new-cid_shift = 16; new-cid_mask = (1 KVM_X2APIC_CID_BITS) - 1; new-lid_mask = 0x; new-broadcast