[Bug 59891] Unrecoverable Kernel Panic, in Virtio Net in memcpy(), Complete system lockup.

2013-06-18 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=59891





--- Comment #1 from Matthew Grant   2013-06-19 
05:51:59 ---
Created an attachment (id=105301)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=105301)
JPEG picture of screen dump

Screen dump picture.

Can't get serial console output, or capture any other way.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 59891] New: Unrecoverable Kernel Panic, in Virtio Net in memcpy(), Complete system lockup.

2013-06-18 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=59891

   Summary: Unrecoverable Kernel Panic, in Virtio Net in memcpy(),
Complete system lockup.
   Product: Virtualization
   Version: unspecified
Kernel Version: 3.9+
  Platform: All
OS/Version: Linux
  Tree: Mainline
Status: NEW
  Severity: blocking
  Priority: P1
 Component: kvm
AssignedTo: virtualization_...@kernel-bugs.osdl.org
ReportedBy: matthewgra...@gmail.com
CC: matthewgra...@gmail.com
Regression: Yes


On AMD processor, under heavy network load and ATI GPU use, System will
completely lock up when a VM is receiving? a packet.

This happens in memcpy()

Get:

Kernel Panic - not syncing: Fatal exception in interrupt.

Can't seem to get anything on a serial port, or with kexec kernel core dumps.

Attaching picture of exception print

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-18 Thread Benjamin Herrenschmidt
On Wed, 2013-06-19 at 13:05 +0930, Rusty Russell wrote:
> symbol_get() won't try to load a module; it'll just fail.  This is what
> you want, since they must have vfio in the kernel to get a valid fd...

Ok, cool. I suppose what we want here Alexey is slightly higher level,
something like:

vfio_validate_iommu_id(file, iommu_id)

Which verifies that the file that was passed in is allowed to use
that iommu_id.

That's a simple and flexible interface (ie, it will work even if we
support multiple iommu IDs in the future for a vfio, for example
for DDW windows etc...), the logic to know about the ID remains
in qemu, this is strictly a validation call.

That way we also don't have to expose the containing vfio struct etc...
just that simple function.

Alex, any objection ?

Do we need to make it a get/put interface instead ?

vfio_validate_and_use_iommu(file, iommu_id);

vfio_release_iommu(file, iommu_id);

To ensure that the resource remains owned by the process until KVM
is closed as well ?

Or do we want to register with VFIO with a callback so that VFIO can
call us if it needs us to give it up ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-18 Thread Rusty Russell
Alex Williamson  writes:
> On Mon, 2013-06-17 at 13:56 +1000, Benjamin Herrenschmidt wrote:
>> On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote:
>> 
>> > IOMMU groups themselves don't provide security, they're accessed by
>> > interfaces like VFIO, which provide the security.  Given a brief look, I
>> > agree, this looks like a possible backdoor.  The typical VFIO way to
>> > handle this would be to pass a VFIO file descriptor here to prove that
>> > the process has access to the IOMMU group.  This is how /dev/vfio/vfio
>> > gains the ability to setup an IOMMU domain an do mappings with the
>> > SET_CONTAINER ioctl using a group fd.  Thanks,
>> 
>> How do you envision that in the kernel ? IE. I'm in KVM code, gets that
>> vfio fd, what do I do with it ?
>> 
>> Basically, KVM needs to know that the user is allowed to use that iommu
>> group. I don't think we want KVM however to call into VFIO directly
>> right ?
>
> Right, we don't want to create dependencies across modules.  I don't
> have a vision for how this should work.  This is effectively a complete
> side-band to vfio, so we're really just dealing in the iommu group
> space.  Maybe there needs to be some kind of registration of ownership
> for the group using some kind of token.  It would need to include some
> kind of notification when that ownership ends.  That might also be a
> convenient tag to toggle driver probing off for devices in the group.
> Other ideas?  Thanks,

It's actually not that bad.

eg. 

struct vfio_container *vfio_container_from_file(struct file *filp)
{
if (filp->f_op != &vfio_device_fops)
return ERR_PTR(-EINVAL);

/* OK it really is a vfio fd, return the data. */

}
EXPORT_SYMBOL_GPL(vfio_container_from_file);

...

inside KVM_CREATE_SPAPR_TCE_IOMMU:

struct file *vfio_filp;
struct vfio_container *(lookup)(struct file *filp);

vfio_filp = fget(create_tce_iommu.fd);
if (!vfio)
ret = -EBADF;
lookup = symbol_get(vfio_container_from_file);
if (!lookup)
ret = -EINVAL;
else {
container = lookup(vfio_filp);
if (IS_ERR(container))
ret = PTR_ERR(container);
else
...
symbol_put(vfio_container_from_file);
}

symbol_get() won't try to load a module; it'll just fail.  This is what
you want, since they must have vfio in the kernel to get a valid fd...

Hope that helps,
Rusty.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-18 Thread Alexey Kardashevskiy
On 06/16/2013 02:39 PM, Benjamin Herrenschmidt wrote:
>>  static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool 
>> writing,
>> -unsigned long *pte_sizep)
>> +unsigned long *pte_sizep, bool do_get_page)
>>  {
>>  pte_t *ptep;
>>  unsigned int shift = 0;
>> @@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned 
>> long hva, bool writing,
>>  if (!pte_present(*ptep))
>>  return __pte(0);
>>  
>> +/*
>> + * Put huge pages handling to the virtual mode.
>> + * The only exception is for TCE list pages which we
>> + * do need to call get_page() for.
>> + */
>> +if ((*pte_sizep > PAGE_SIZE) && do_get_page)
>> +return __pte(0);
>> +
>>  /* wait until _PAGE_BUSY is clear then set it atomically */
>>  __asm__ __volatile__ (
>>  "1: ldarx   %0,0,%3\n"
>> @@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned 
>> long hva, bool writing,
>>  : "cc");
>>  
>>  ret = pte;
>> +if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) {
>> +struct page *pg = NULL;
>> +pg = realmode_pfn_to_page(pte_pfn(pte));
>> +if (realmode_get_page(pg)) {
>> +ret = __pte(0);
>> +} else {
>> +pte = pte_mkyoung(pte);
>> +if (writing)
>> +pte = pte_mkdirty(pte);
>> +}
>> +}
>> +*ptep = pte;/* clears _PAGE_BUSY */
>>  
>>  return ret;
>>  }
> 
> So now you are adding the clearing of _PAGE_BUSY that was missing for
> your first patch, except that this is not enough since that means that
> in the "emulated" case (ie, !do_get_page) you will in essence return
> and then use a PTE that is not locked without any synchronization to
> ensure that the underlying page doesn't go away... then you'll
> dereference that page.
> 
> So either make everything use speculative get_page, or make the emulated
> case use the MMU notifier to drop the operation in case of collision.
> 
> The former looks easier.
> 
> Also, any specific reason why you do:
> 
>   - Lock the PTE
>   - get_page()
>   - Unlock the PTE
> 
> Instead of
> 
>   - Read the PTE
>   - get_page_unless_zero
>   - re-check PTE
> 
> Like get_user_pages_fast() does ?
> 
> The former will be two atomic ops, the latter only one (faster), but
> maybe you have a good reason why that can't work...



If we want to set "dirty" and "young" bits for pte then I do not know how
to avoid _PAGE_BUSY.



-- 
Alexey
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pci: Enable overrides for missing ACS capabilities

2013-06-18 Thread Bjorn Helgaas
On Tue, Jun 18, 2013 at 5:03 PM, Don Dutile  wrote:
> On 06/18/2013 06:22 PM, Alex Williamson wrote:
>>
>> On Tue, 2013-06-18 at 15:31 -0600, Bjorn Helgaas wrote:
>>>
>>> On Tue, Jun 18, 2013 at 12:20 PM, Alex Williamson
>>>   wrote:

 On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:
>
> On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:
> ...
> Who do you expect to decide whether to use this option?  I think it
> requires intimate knowledge of how the device works.
>
> I think the benefit of using the option is that it makes assignment of
> devices to guests more flexible, which will make it attractive to
> users.
> But most users have no way of knowing whether it's actually *safe* to
> use this.  So I worry that you're adding an easy way to pretend
> isolation
> exists when there's no good way of being confident that it actually
> does.

> ...

>>> I wonder if we should taint the kernel if this option is used (but not
>>> for specific devices added to pci_dev_acs_enabled[]).  It would also
>>> be nice if pci_dev_specific_acs_enabled() gave some indication in
>>> dmesg for the specific devices you're hoping to add to
>>> pci_dev_acs_enabled[].  It's not an enumeration-time quirk right now,
>>> so I'm not sure how we'd limit it to one message per device.
>>
>> Right, setup vs use and getting single prints is a lot of extra code.
>> Tainting is troublesome for support, Don had some objections when I
>> suggested the same to him.
>>
> For RH GSS (Global Support Services), a 'taint' in the kernel printk means
> RH doesn't support that system.  The 'non-support' due to 'taint' being
> printed
> out in this case may be incorrect -- RH may support that use, at least until
> a more sufficient patched kernel is provided.
> Thus my dissension that 'taint' be output.  WARN is ok. 'driver beware',
> 'unleashed dog afoot' sure...

So ...  that's really a RH-specific support issue, and easily worked
around by RH adding a patch that turns off tainting.

It still sounds like a good idea to me for upstream, where use of this
option can very possibly lead to corruption or information leakage
between devices the user claimed were isolated, but in fact were not.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/6] KVM: MMU: fast invalidate all mmio sptes

2013-06-18 Thread Xiao Guangrong
On 06/18/2013 10:26 PM, Paolo Bonzini wrote:
> Il 07/06/2013 10:51, Xiao Guangrong ha scritto:
>> Changelog:
>> V3:
>>   All of these changes are from Gleb's review:
>>   1) rename RET_MMIO_PF_EMU to RET_MMIO_PF_EMULATE.
>>   2) smartly adjust kvm generation number in kvm_current_mmio_generatio()
>>  to avoid kvm_memslots->generation overflow.
>>
>> V2:
>>   - rename kvm_mmu_invalid_mmio_spte to kvm_mmu_invalid_mmio_sptes
>>   - use kvm->memslots->generation as kvm global generation-number
>>   - fix comment and codestyle
>>   - init kvm generation close to mmio wrap-around value
>>   - keep kvm_mmu_zap_mmio_sptes
>>
>> The current way is holding hot mmu-lock and walking all shadow pages, this
>> is not scale. This patchset tries to introduce a very simple and scale way
>> to fast invalidate all mmio sptes - it need not walk any shadow pages and 
>> hold
>> any locks.
>>
>> The idea is simple:
>> KVM maintains a global mmio valid generation-number which is stored in
>> kvm->memslots.generation and every mmio spte stores the current global
>> generation-number into his available bits when it is created
>>
>> When KVM need zap all mmio sptes, it just simply increase the global
>> generation-number. When guests do mmio access, KVM intercepts a MMIO #PF
>> then it walks the shadow page table and get the mmio spte. If the
>> generation-number on the spte does not equal the global generation-number,
>> it will go to the normal #PF handler to update the mmio spte
>>
>> Since 19 bits are used to store generation-number on mmio spte, we zap all
>> mmio sptes when the number is round
>>
>> Xiao Guangrong (6):
>>   KVM: MMU: retain more available bits on mmio spte
>>   KVM: MMU: store generation-number into mmio spte
>>   KVM: MMU: make return value of mmio page fault handler more readable
>>   KVM: MMU: fast invalidate all mmio sptes
>>   KVM: MMU: add tracepoint for check_mmio_spte
>>   KVM: MMU: init kvm generation close to mmio wrap-around value
>>
>>  arch/x86/include/asm/kvm_host.h |   2 +-
>>  arch/x86/kvm/mmu.c  | 131 
>> 
>>  arch/x86/kvm/mmu.h  |  17 ++
>>  arch/x86/kvm/mmutrace.h |  34 +--
>>  arch/x86/kvm/paging_tmpl.h  |  10 ++-
>>  arch/x86/kvm/vmx.c  |  12 ++--
>>  arch/x86/kvm/x86.c  |  11 +++-
>>  7 files changed, 177 insertions(+), 40 deletions(-)
>>
> 
> Xiao, is it time to add more comments to the code or update
> Documentation/virtual/kvm/mmu.txt?  

Yes. it is.

We missed to update mmu.txt for a long time. I will post a separate patchset
to update it to the current mmu code.

> Don't worry about the English, it is
> more than understandable and I can help with the editing.

Okay. Thank you in advance, Paolo! :)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm api doc: fix section numbers

2013-06-18 Thread Alexey Kardashevskiy
Signed-off-by: Alexey Kardashevskiy 
---
 Documentation/virtual/kvm/api.txt |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 5f91eda..6365fef 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2261,7 +2261,7 @@ return indicates the attribute is implemented.  It does 
not necessarily
 indicate that the attribute can be read or written in the device's
 current state.  "addr" is ignored.
 
-4.77 KVM_ARM_VCPU_INIT
+4.82 KVM_ARM_VCPU_INIT
 
 Capability: basic
 Architectures: arm
@@ -2285,7 +2285,7 @@ Possible features:
  Depends on KVM_CAP_ARM_PSCI.
 
 
-4.78 KVM_GET_REG_LIST
+4.83 KVM_GET_REG_LIST
 
 Capability: basic
 Architectures: arm
@@ -2305,7 +2305,7 @@ This ioctl returns the guest registers that are supported 
for the
 KVM_GET_ONE_REG/KVM_SET_ONE_REG calls.
 
 
-4.80 KVM_ARM_SET_DEVICE_ADDR
+4.84 KVM_ARM_SET_DEVICE_ADDR
 
 Capability: KVM_CAP_ARM_SET_DEVICE_ADDR
 Architectures: arm
@@ -2342,7 +2342,7 @@ and distributor interface, the ioctl must be called after 
calling
 KVM_CREATE_IRQCHIP, but before calling KVM_RUN on any of the VCPUs.  Calling
 this ioctl twice for any of the base addresses will return -EEXIST.
 
-4.82 KVM_PPC_RTAS_DEFINE_TOKEN
+4.85 KVM_PPC_RTAS_DEFINE_TOKEN
 
 Capability: KVM_CAP_PPC_RTAS
 Architectures: ppc
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-18 Thread 李春奇
On Wed, Jun 19, 2013 at 12:44 AM, Gleb Natapov  wrote:
> Send code in a form of a patch.
>
> On Wed, Jun 19, 2013 at 12:14:13AM +0800, 李春奇  wrote:
>> extern u8 insn_page[], insn_page_end[];
>> extern u8 test_insn[], test_insn_end[];
>> extern u8 alt_insn_page[];
>>
>> asm(
>> ".align 4096\n\t"
>> ".global insn_page\n\t"
>> ".global insn_page_end\n\t"
>> ".global test_insn\n\t"
>> ".global test_insn_end\n\t"
>> "insn_page:\n\t"
>>
>> "ret \n\t"
>>
>> "push %rax; push %rbx\n\t"
>> "push %rcx; push %rdx\n\t"
>> "push %rsi; push %rdi\n\t"
>> "push %rbp\n\t"
>> "push %r8; push %r9\n\t"
>> "push %r10; push %r11\n\t"
>> "push %r12; push %r13\n\t"
>> "push %r14; push %r15\n\t"
>> "pushf\n\t"
>>
>> "push 136+save \n\t"
>> "popf \n\t"
>> "mov 0+save, %rax \n\t"
>> "mov 8+save, %rbx \n\t"
>> "mov 16+save, %rcx \n\t"
>> "mov 24+save, %rdx \n\t"
>> "mov 32+save, %rsi \n\t"
>> "mov 40+save, %rdi \n\t"
>> "mov 56+save, %rbp \n\t"
>> "mov 64+save, %r8 \n\t"
>> "mov 72+save, %r9 \n\t"
>> "mov 80+save, %r10  \n\t"
>> "mov 88+save, %r11 \n\t"
>> "mov 96+save, %r12 \n\t"
>> "mov 104+save, %r13 \n\t"
>> "mov 112+save, %r14 \n\t"
>> "mov 120+save, %r15 \n\t"
>>
>> "test_insn:\n\t"
>> "in  (%dx),%al\n\t"
>> ". = . + 31\n\t"
>> "test_insn_end:\n\t"
>>
>> "pushf \n\t"
>> "pop 136+save \n\t"
>> "mov %rax, 0+save \n\t"
>> "mov %rbx, 8+save \n\t"
>> "mov %rcx, 16+save \n\t"
>> "mov %rdx, 24+save \n\t"
>> "mov %rsi, 32+save \n\t"
>> "mov %rdi, 40+save \n\t"
>> "mov %rbp, 56+save \n\t"
>> "mov %r8, 64+save \n\t"
>> "mov %r9, 72+save \n\t"
>> "mov %r10, 80+save \n\t"
>> "mov %r11, 88+save \n\t"
>> "mov %r12, 96+save \n\t"
>> "mov %r13, 104+save \n\t"
>> "mov %r14, 112+save \n\t"
>> "mov %r15, 120+save \n\t"
>> "popf \n\t"
>> "pop %r15; pop %r14 \n\t"
>> "pop %r13; pop %r12 \n\t"
>> "pop %r11; pop %r10 \n\t"
>> "pop %r9; pop %r8 \n\t"
>> "pop %rbp \n\t"
>> "pop %rdi; pop %rsi \n\t"
>> "pop %rdx; pop %rcx \n\t"
>> "pop %rbx; pop %rax \n\t"
>>
>> "ret\n\t"
>> "save:\n\t"
>> ". = . + 256\n\t"
>> ".align 4096\n\t"
>> "alt_insn_page:\n\t"
>> ". = . + 4096\n\t"
>> );
>>
>>
>> static void mk_insn_page(uint8_t *alt_insn_page,
>> uint8_t *alt_insn, int alt_insn_length)
>> {
>> int i, emul_offset;
>> for (i=1; i> test_insn[i] = 0x90; // nop
> Why? Gcc should pad it with nops.
>
>> emul_offset = test_insn - insn_page;
>> for (i=0; i> alt_insn_page[i+emul_offset] = alt_insn[i];
>> }
>>
>> static void trap_emulator(uint64_t *mem, uint8_t* alt_insn, int 
>> alt_insn_length)
>> {
>> ulong *cr3 = (ulong *)read_cr3();
>> int save_offset = (u8 *)(&save) - insn_page;
>>
>> memset(alt_insn_page, 0x90, 4096);
> alt_insn_page should contains the same instruction as insn_page except
> between test_insn and test_insn_end. I do not know how you expect it to
> work otherwise.
In my oponion, only codes between test_insn and test_insn_end in
alt_insn_page need to be set, insn_page will be executed in the guest,
and when trapping into emulator OS will load alt_insn_page (because of
invlpg(insn_page)), then return to guest with executing insn_page
(from TLB). I don't know if this is right, but I use this trick in my
previous patch and it runs well. I use "trace-cmd record -e kvm" to
trace it and found instructions in alt_insn_page are not executed, so
I suppose that alt_insn_page is not loaded to the right place.

Arthur
>
>> save = inregs;
>> mk_insn_page(alt_insn_page, alt_insn, alt_insn_length);
>> // Load the code TLB with insn_page, but point the page tables at
>> // alt_insn_page (and keep the data TLB clear, for AMD decode assist).
>> // This will make the CPU trap on the insn_page instruction but the
>> // hypervisor will see alt_insn_page.
>> //install_page(cr3, virt_to_phys(insn_page), insn_page);
>> invlpg(insn_page);
>> // Load code TLB
>> asm volatile("call *%0" : : "r"(insn_page));
>> install_page(cr3, virt_to_phys(alt_insn_page), insn_page);
>> // Trap, let hypervisor emulate at alt_insn_page
>> asm volatile("call *%0": : "r"(insn_page+1));
>>
>> outregs = *((struct regs *)(&alt_insn_page[save_offset]));
>> }
>>
>> static void test_movabs(uint64_t *mem)
>> {
>> // mov $0xc3c3c3c3c3c3c3c3, %rcx
>> uint8_t alt_insn[] = {0x48, 0xb9, 0xc3, 0xc3, 0xc3,
>> 0xc3, 0xc3, 0xc3, 0xc3, 0xc3};
>> inregs = (struct regs){ 0 };
>> trap_emulator(mem, alt_insn, 10);
>> report("64-bit mov imm2", outregs.rcx == 0xc3c3c3c3c3c3c3c3);
>> }
>>
>> On Wed, Jun 19, 2013 at 12:09 AM, Gleb Natapov  wrote:
>> > On Tue, Jun 18, 2013 at 11:56:24PM +0800, 李春奇  wrote:
>> >> On Tue, Jun 18, 2013 at 11:47 PM, Gleb Natapov  wrote:
>> >> > On Tue, Jun 18, 2013 at 10:28:59PM +0800, Ê??Ê?•Â•?  
>> >> > wrote:
>> >> >> On Tue, Jun 18, 2013 at 8:45 PM, Gleb Natapov  wrote:
>> >> >> > On Thu, Jun 13, 2013 at 05:30:03PM +0800, 李春奇 > >> >> > Li> wrote:
>> >> >> >> Hi Gleb,
>> >> >> >> I'm trying to solve these problems in the past days and m

Re: [PATCH] kvmclock: clock should count only if vm is running

2013-06-18 Thread Marcelo Tosatti
On Tue, Jun 18, 2013 at 11:02:27AM +0200, Paolo Bonzini wrote:
> Hi Marcelo, sorry for the late review.
> 
> Il 08/06/2013 04:00, Marcelo Tosatti ha scritto:
> > kvmclock should not count while vm is paused, because:
> > 
> > 1) if the vm is paused for long periods, timekeeping 
> > math can overflow while converting the (large) clocksource 
> > delta to nanoseconds.
> > 
> > 2) Users rely on CLOCK_MONOTONIC to count run time, that is, 
> > time which OS has been in a runnable state (see CLOCK_BOOTTIME).
> 
> Do you have any ideas on how to implement CLOCK_BOOTTIME for kvmclock?
> I think we need to add more fields for the delta between CLOCK_MONOTONIC
> and CLOCK_BOOTTIME.

Unsure. An alternative to new fields is to use MSR_KVM_WALL_CLOCK
interface. Looking at two possibilities for catching up with real time
on unpause or vmload:

1) Catch up realtime via guest agent.
2) Catch up realtime via kvmclock interface.

But it would be good to have proper CLOCK_BOOTTIME behaviour at the same
time.

> > Change kvmclock driver so as to save clock value when vm transitions
> > from runnable to stopped state, and to restore clock value from stopped
> > to runnable transition.
> > 
> > Signed-off-by: Marcelo Tosatti 
> > 
> > diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
> > index 87d4d0f..7d2d005 100644
> > --- a/hw/i386/kvm/clock.c
> > +++ b/hw/i386/kvm/clock.c
> > @@ -28,38 +28,6 @@ typedef struct KVMClockState {
> >  bool clock_valid;
> >  } KVMClockState;
> >  
> > -static void kvmclock_pre_save(void *opaque)
> > -{
> > -KVMClockState *s = opaque;
> > -struct kvm_clock_data data;
> > -int ret;
> > -
> > -if (s->clock_valid) {
> > -return;
> > -}
> > -ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
> > -if (ret < 0) {
> > -fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
> > -data.clock = 0;
> > -}
> > -s->clock = data.clock;
> > -/*
> > - * If the VM is stopped, declare the clock state valid to avoid 
> > re-reading
> > - * it on next vmsave (which would return a different value). Will be 
> > reset
> > - * when the VM is continued.
> > - */
> > -s->clock_valid = !runstate_is_running();
> > -}
> > -
> > -static int kvmclock_post_load(void *opaque, int version_id)
> > -{
> > -KVMClockState *s = opaque;
> > -struct kvm_clock_data data;
> > -
> > -data.clock = s->clock;
> > -data.flags = 0;
> > -return kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
> > -}
> >  
> >  static void kvmclock_vm_state_change(void *opaque, int running,
> >   RunState state)
> > @@ -70,8 +38,18 @@ static void kvmclock_vm_state_change(void *opaque, int 
> > running,
> >  int ret;
> >  
> >  if (running) {
> > +struct kvm_clock_data data;
> > +
> 
> Do we need an "if (!s->clock_valid) return;" here, or an assertion?  (Or
> alternatively, what happens if s->clock_valid == false?)


s->clock_valid = true means the clock value at s->clock (saved in
QEMU) is valid. 

At the moment VM begins execution, it becomes invalid (which means
synchronizing that value is necessary).

For the stopped->running transition, it has no meaning (just as the load
functions assume the device state is valid when they are called).

That said, i see no reason for additional checks.

> > + */
> > +s->clock_valid = !runstate_is_running();
> 
> Here we know that !runstate_is_running() is true (we're in the else
> branch of "if (running)") so it can just be "s->clock_valid = true".
> This matches the false assignment in the other branch.  This is just a
> nit, but it makes the code clearer.
> 
> Paolo

Sure, resending.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvmclock: clock should count only if vm is running (v2)

2013-06-18 Thread Marcelo Tosatti

v2: remove unnecessary runstate_is_running() usage (Paolo)

--

kvmclock should not count while vm is paused, because:

1) if the vm is paused for long periods, timekeeping
math can overflow while converting the (large) clocksource
delta to nanoseconds.

2) Users rely on CLOCK_MONOTONIC to count run time, that is,
time which OS has been in a runnable state (see CLOCK_BOOTTIME).

Change kvmclock driver so as to save clock value when vm transitions
from runnable to stopped state, and to restore clock value from stopped
to runnable transition.

Signed-off-by: Marcelo Tosatti 

diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
index 87d4d0f..98e5ca5 100644
--- a/hw/i386/kvm/clock.c
+++ b/hw/i386/kvm/clock.c
@@ -28,38 +28,6 @@ typedef struct KVMClockState {
 bool clock_valid;
 } KVMClockState;
 
-static void kvmclock_pre_save(void *opaque)
-{
-KVMClockState *s = opaque;
-struct kvm_clock_data data;
-int ret;
-
-if (s->clock_valid) {
-return;
-}
-ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
-if (ret < 0) {
-fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
-data.clock = 0;
-}
-s->clock = data.clock;
-/*
- * If the VM is stopped, declare the clock state valid to avoid re-reading
- * it on next vmsave (which would return a different value). Will be reset
- * when the VM is continued.
- */
-s->clock_valid = !runstate_is_running();
-}
-
-static int kvmclock_post_load(void *opaque, int version_id)
-{
-KVMClockState *s = opaque;
-struct kvm_clock_data data;
-
-data.clock = s->clock;
-data.flags = 0;
-return kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
-}
 
 static void kvmclock_vm_state_change(void *opaque, int running,
  RunState state)
@@ -70,8 +38,18 @@ static void kvmclock_vm_state_change(void *opaque, int 
running,
 int ret;
 
 if (running) {
+struct kvm_clock_data data;
+
 s->clock_valid = false;
 
+data.clock = s->clock;
+data.flags = 0;
+ret = kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
+if (ret < 0) {
+fprintf(stderr, "KVM_SET_CLOCK failed: %s\n", strerror(ret));
+abort();
+}
+
 if (!cap_clock_ctrl) {
 return;
 }
@@ -84,6 +62,26 @@ static void kvmclock_vm_state_change(void *opaque, int 
running,
 return;
 }
 }
+} else {
+struct kvm_clock_data data;
+int ret;
+
+if (s->clock_valid) {
+return;
+}
+ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
+if (ret < 0) {
+fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
+abort();
+}
+s->clock = data.clock;
+
+/*
+ * If the VM is stopped, declare the clock state valid to
+ * avoid re-reading it on next vmsave (which would return
+ * a different value). Will be reset when the VM is continued.
+ */
+s->clock_valid = true;
 }
 }
 
@@ -100,8 +98,6 @@ static const VMStateDescription kvmclock_vmsd = {
 .version_id = 1,
 .minimum_version_id = 1,
 .minimum_version_id_old = 1,
-.pre_save = kvmclock_pre_save,
-.post_load = kvmclock_post_load,
 .fields = (VMStateField[]) {
 VMSTATE_UINT64(clock, KVMClockState),
 VMSTATE_END_OF_LIST()
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pci: Enable overrides for missing ACS capabilities

2013-06-18 Thread Don Dutile

On 06/18/2013 06:22 PM, Alex Williamson wrote:

On Tue, 2013-06-18 at 15:31 -0600, Bjorn Helgaas wrote:

On Tue, Jun 18, 2013 at 12:20 PM, Alex Williamson
  wrote:

On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:

On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:

PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
allows us to control whether transactions are allowed to be redirected
in various subnodes of a PCIe topology.  For instance, if two
endpoints are below a root port or downsteam switch port, the
downstream port may optionally redirect transactions between the
devices, bypassing upstream devices.  The same can happen internally
on multifunction devices.  The transaction may never be visible to the
upstream devices.

One upstream device that we particularly care about is the IOMMU.  If
a redirection occurs in the topology below the IOMMU, then the IOMMU
cannot provide isolation between devices.  This is why the PCIe spec
encourages topologies to include ACS support.  Without it, we have to
assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.

Unfortunately, far too many topologies do not support ACS to make this
a steadfast requirement.  Even the latest chipsets from Intel are only
sporadically supporting ACS.  We have trouble getting interconnect
vendors to include the PCIe spec required PCIe capability, let alone
suggested features.

Therefore, we need to add some flexibility.  The pcie_acs_override=
boot option lets users opt-in specific devices or sets of devices to
assume ACS support.


"ACS support" means the device provides an ACS capability and we
can change bits in the ACS Control Register to control how things
work.

You are really adding a way to "assume this device always routes
peer-to-peer DMA upstream," which ultimately translates into "assume
this device can be isolated from others via the IOMMU."  I think.


We've heard from one vendor that they support ACS with a NULL capability
structure, ie. the ACS PCIe capability exists, but reports no ACS
capabilities and allows no control.  This takes advantage of the "if
supported" style wording of the spec to imply that peer-to-peer is not
supported because the capability is not available.  This supported but
not controllable version is what we're trying to enable here.


I was wrong to say "we can change bits in the control register."  All
the control register bits are actually required to be hardwired to
zero when the relevant functionality is not implemented.

The example you mention (a device with an ACS capability structure
that reports no supported capabilities and allows no control) sounds
perfectly legal as a description of a device that doesn't support
peer-to-peer, and I hope it doesn't require any user intervention
(e.g., this patch) or quirks to make it work.


It requires:

Subject: [PATCH v2 2/2] pci: Differentiate ACS controllable from enabled


The ACS P2P Request Redirect "must be implemented by Functions that
support peer-to-peer traffic with other Functions."  This example
device doesn't support peer-to-peer traffic, so why would it implement
the ACS R bit?  In fact, it looks like the R bit (and all the other
bits) *must* be hardwired to zero in both capability and control
registers.


Right, if they don't support peer-to-peer then hardwiring capability and
control to zero should indicate that and is fixed by the patch
referenced above.


The "downstream" option assumes full ACS support
on root ports and downstream switch ports.  The "multifunction"
option assumes the subset of ACS features available on multifunction
endpoints and upstream switch ports are supported.  The "id::"
option enables ACS support on devices matching the provided vendor
and device IDs, allowing more strategic ACS overrides.  These options
may be combined in any order.  A maximum of 16 id specific overrides
are available.  It's suggested to use the most limited set of options
necessary to avoid completely disabling ACS across the topology.


Probably I'm confused by your use of "assume full ACS support,"


[on root ports and downstream ports]


  but I
don't understand the bit about "completely disabling ACS."


[across the topology]


   If we use
more options than necessary, it seems like we'd assume more isolation
than really exists.  That sounds like pretending ACS is *enabled* in
more places than it really is.


Exactly.  I'm just trying to suggest that booting with
pcie_acs_override=downstream,multifunction is not generally recommended
because it effectively disables ACS checking across the topology and
assumes isolation where there may be none.  In other words, if
everything is overriding ACS checks, then we've disabled any benefit of
doing the checking in the first place (so I really mean disable
checking, not disable ACS).


Yep, the missing "checking" is what is confusing.  Also, I think it
would be good to make the implication more explicit -- using this
option makes the ke

Re: [PATCH] pci: Enable overrides for missing ACS capabilities

2013-06-18 Thread Alex Williamson
On Tue, 2013-06-18 at 15:31 -0600, Bjorn Helgaas wrote:
> On Tue, Jun 18, 2013 at 12:20 PM, Alex Williamson
>  wrote:
> > On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:
> >> On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:
> >> > PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
> >> > allows us to control whether transactions are allowed to be redirected
> >> > in various subnodes of a PCIe topology.  For instance, if two
> >> > endpoints are below a root port or downsteam switch port, the
> >> > downstream port may optionally redirect transactions between the
> >> > devices, bypassing upstream devices.  The same can happen internally
> >> > on multifunction devices.  The transaction may never be visible to the
> >> > upstream devices.
> >> >
> >> > One upstream device that we particularly care about is the IOMMU.  If
> >> > a redirection occurs in the topology below the IOMMU, then the IOMMU
> >> > cannot provide isolation between devices.  This is why the PCIe spec
> >> > encourages topologies to include ACS support.  Without it, we have to
> >> > assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
> >> >
> >> > Unfortunately, far too many topologies do not support ACS to make this
> >> > a steadfast requirement.  Even the latest chipsets from Intel are only
> >> > sporadically supporting ACS.  We have trouble getting interconnect
> >> > vendors to include the PCIe spec required PCIe capability, let alone
> >> > suggested features.
> >> >
> >> > Therefore, we need to add some flexibility.  The pcie_acs_override=
> >> > boot option lets users opt-in specific devices or sets of devices to
> >> > assume ACS support.
> >>
> >> "ACS support" means the device provides an ACS capability and we
> >> can change bits in the ACS Control Register to control how things
> >> work.
> >>
> >> You are really adding a way to "assume this device always routes
> >> peer-to-peer DMA upstream," which ultimately translates into "assume
> >> this device can be isolated from others via the IOMMU."  I think.
> >
> > We've heard from one vendor that they support ACS with a NULL capability
> > structure, ie. the ACS PCIe capability exists, but reports no ACS
> > capabilities and allows no control.  This takes advantage of the "if
> > supported" style wording of the spec to imply that peer-to-peer is not
> > supported because the capability is not available.  This supported but
> > not controllable version is what we're trying to enable here.
> 
> I was wrong to say "we can change bits in the control register."  All
> the control register bits are actually required to be hardwired to
> zero when the relevant functionality is not implemented.
> 
> The example you mention (a device with an ACS capability structure
> that reports no supported capabilities and allows no control) sounds
> perfectly legal as a description of a device that doesn't support
> peer-to-peer, and I hope it doesn't require any user intervention
> (e.g., this patch) or quirks to make it work.

It requires: 

Subject: [PATCH v2 2/2] pci: Differentiate ACS controllable from enabled

> The ACS P2P Request Redirect "must be implemented by Functions that
> support peer-to-peer traffic with other Functions."  This example
> device doesn't support peer-to-peer traffic, so why would it implement
> the ACS R bit?  In fact, it looks like the R bit (and all the other
> bits) *must* be hardwired to zero in both capability and control
> registers.

Right, if they don't support peer-to-peer then hardwiring capability and
control to zero should indicate that and is fixed by the patch
referenced above.

> >> > The "downstream" option assumes full ACS support
> >> > on root ports and downstream switch ports.  The "multifunction"
> >> > option assumes the subset of ACS features available on multifunction
> >> > endpoints and upstream switch ports are supported.  The "id::"
> >> > option enables ACS support on devices matching the provided vendor
> >> > and device IDs, allowing more strategic ACS overrides.  These options
> >> > may be combined in any order.  A maximum of 16 id specific overrides
> >> > are available.  It's suggested to use the most limited set of options
> >> > necessary to avoid completely disabling ACS across the topology.
> >>
> >> Probably I'm confused by your use of "assume full ACS support,"
> >
> > [on root ports and downstream ports]
> >
> >>  but I
> >> don't understand the bit about "completely disabling ACS."
> >
> > [across the topology]
> >
> >>   If we use
> >> more options than necessary, it seems like we'd assume more isolation
> >> than really exists.  That sounds like pretending ACS is *enabled* in
> >> more places than it really is.
> >
> > Exactly.  I'm just trying to suggest that booting with
> > pcie_acs_override=downstream,multifunction is not generally recommended
> > because it effectively disables ACS checking across the topology and
> > assumes isolation where there may 

Re: [PATCH v3 0/6] KVM: MMU: fast invalidate all mmio sptes

2013-06-18 Thread Marcelo Tosatti
On Mon, Jun 17, 2013 at 07:59:15PM +0800, Xiao Guangrong wrote:
> Sorry for the delay reply since i was on vacation.
> 
> On 06/15/2013 10:22 AM, Takuya Yoshikawa wrote:
> > On Thu, 13 Jun 2013 21:08:21 -0300
> > Marcelo Tosatti  wrote:
> > 
> >> On Fri, Jun 07, 2013 at 04:51:22PM +0800, Xiao Guangrong wrote:
> > 
> >> - Where is the generation number increased?
> > 
> > Looks like when a new slot is installed in update_memslots() because
> > it's based on slots->generation.  This is not restricted to "create"
> > and "move".
> 
> Yes. It reuses slots->generation to avoid unnecessary synchronizations
> (RCU, memory barrier).
> 
> Increasing mmio generation number in the case of "create" and "move"
> is ok - it is no addition work unless mmio generation number is overflow
> which is hardly triggered (since the valid mmio generation number is
> large enough and zap_all is scale well now.) and the mmio spte is updated
> only when it is used in the future.
> 
> > 
> >> - Should use spinlock breakable code in kvm_mmu_zap_mmio_sptes()
> >> (picture guest with 512GB of RAM, even walking all those pages is
> >> expensive) (ah, patch to remove kvm_mmu_zap_mmio_sptes does that).
> >> - Is -13 enough to test wraparound? Its highly likely the guest has 
> >> not began executing by the time 13 kvm_set_memory_calls are made
> >> (so no sptes around). Perhaps -2000 is more sensible (should confirm
> >> though).
> > 
> > In the future, after we've tested enough, we should change the testing
> > code to be executed only for some debugging configs.  Especially, if we
> > change zap_mmio_sptes() to zap_all_shadows(), very common guests, even
> > without huge memory like 512GB, can see the effect induced by sudden page
> > faults unnecessarily.
> > 
> > If necessary, developers can test the wraparound code by lowering the
> > max_gen itself anyway.
> 
> I agree.
> 
> > 
> >> - Why remove "if (change == KVM_MR_CREATE) || (change
> >> ==  KVM_MR_MOVE)" from kvm_arch_commit_memory_region?
> >> Its instructive.
> > 
> > There may be a chance that we miss generation wraparounds if we don't
> > check other cases: seems unlikely, but theoretically possible.
> > 
> > In short, all memory slot changes make mmio sptes stored in shadow pages
> > obsolete, or zapped for wraparounds, in the new way -- am I right?
> 
> Yes. You are definitely right. :)
> 
> Takuya-san, thank you very much for you answering the questions for me and 
> thanks
> all of you for patiently reviewing my patches.
> 
> Marcelo, your points?

Agreed - points are clear. Patchset looks good.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv1] kvm guest: fix uninitialized kvmclock read by KVM guest

2013-06-18 Thread Marcelo Tosatti
On Sat, Jun 15, 2013 at 10:01:45PM +0400, Eugene Batalov wrote:
> Due to unintialized kvmclock read KVM guest is hanging on SMP boot stage.
> If unintialized memory contains fatal garbage then hang reproduction is 100%.
> Unintialized memory is allocated by memblock_alloc. So the garbage values
> depend on many many things.
> 
> See the detailed description of the bug and possible ways to fix it
> in the kernel bug tracker.
> "Bug 59521 - KVM linux guest reads uninitialized pvclock values before
> executing rdmsr MSR_KVM_WALL_CLOCK"
> 
> I decided to fix it simply returning 0ULL from kvm_clock_read when
> kvm clocksource is not initialized yet.
> The same as kernel bootstrap CPU doesn on boot stage when kernel
> clocksources are not initialized yet.
> 
> Signed-off-by: Eugene Batalov 
> ---
> I dont' use kernel percpu variables because for each SMP CPU
> their contents are copied from the bootstrap CPU. And I don't
> think that fixing the value for each SMP CPU is a good style.
> If you know a better approach to store the is_pv_clock_ready flags
> I am ready to use it.
> 
> The patch applies cleanly to
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> 
> I've tested the changes with Ubuntu 13.04 "raring" userspace and
> Ubuntu-3.8.0.19-30 kernel tag.
> 
>  arch/x86/kernel/kvmclock.c |   13 +++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 5bedbdd..a6e0af4 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -43,6 +43,9 @@ early_param("no-kvmclock", parse_no_kvmclock);
>  static struct pvclock_vsyscall_time_info *hv_clock;
>  static struct pvclock_wall_clock wall_clock;
>  
> +/* For each cpu store here a flag which tells whether pvclock is initialized 
> */
> +static int __cacheline_aligned_in_smp is_pv_clock_ready[NR_CPUS] = {};
> +
>  /*
>   * The wallclock is the time of day when we booted. Since then, some time may
>   * have elapsed since the hypervisor wrote the data. So we try to account for
> @@ -84,8 +87,11 @@ static cycle_t kvm_clock_read(void)
>  
>   preempt_disable_notrace();
>   cpu = smp_processor_id();
> - src = &hv_clock[cpu].pvti;
> - ret = pvclock_clocksource_read(src);
> + if (is_pv_clock_ready[cpu]) {
> + src = &hv_clock[cpu].pvti;
> + ret = pvclock_clocksource_read(src);
> + } else
> + ret = 0ULL;
>   preempt_enable_notrace();
>   return ret;
>  }
> @@ -168,6 +174,9 @@ int kvm_register_clock(char *txt)
>   printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
>  cpu, high, low, txt);
>  
> + if (!ret)
> + is_pv_clock_ready[cpu] = 1;
> +
>   return ret;
>  }

The path can be really hot, so better avoid it if possible. The patch
to zero the memblock_alloc'd area from Igor brings the behaviour back
to before regression: return 0 until kvmclock is initialized. On top of
your analysis in the BZ, it now seems the right (and safer) thing to do.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-18 Thread Benjamin Herrenschmidt
On Tue, 2013-06-18 at 08:48 -0600, Alex Williamson wrote:
> On Tue, 2013-06-18 at 14:38 +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote:
> > 
> > > Right, we don't want to create dependencies across modules.  I don't
> > > have a vision for how this should work.  This is effectively a complete
> > > side-band to vfio, so we're really just dealing in the iommu group
> > > space.  Maybe there needs to be some kind of registration of ownership
> > > for the group using some kind of token.  It would need to include some
> > > kind of notification when that ownership ends.  That might also be a
> > > convenient tag to toggle driver probing off for devices in the group.
> > > Other ideas?  Thanks,
> > 
> > All of that smells nasty like it will need a pile of bloody
> > infrastructure which makes me think it's too complicated and not the
> > right approach.
> > 
> > How does access control work today on x86/VFIO ? Can you give me a bit
> > more details ? I didn't get a good grasp in your previous email
> 
> The current model is not x86 specific, but it only covers doing iommu
> and device access through vfio.  The kink here is that we're trying to
> do device access and setup through vfio, but iommu manipulation through
> kvm.  We may want to revisit whether we can do the in-kernel iommu
> manipulation through vfio rather than kvm.

How would that be possible ?

The hypercalls from the guest arrive in KVM... in a very very specific &
restricted environment which we call real mode (MMU off but still
running in guest context), where we try to do as much as possible, or in
virtual mode, where they get handled as normal KVM exits.

The only way we could handle them "in VFIO" would be if somewhat VFIO
registered callbacks with KVM... if we have that sort of
cross-dependency, then we may as well have a simpler one where VFIO
tells KVM what iommu is available for the VM

> For vfio in general, the group is the unit of ownership.  A user is
> granted access to /dev/vfio/$GROUP through file permissions.  The user
> opens the group and a container (/dev/vfio/vfio) and calls SET_CONTAINER
> on the group.  If supported by the platform, multiple groups can be set
> to the same container, which allows for iommu domain sharing.  Once a
> group is associated with a container, an iommu backend can be
> initialized for the container.  Only then can a device be accessed
> through the group.
> 
> So even if we were to pass a vfio group file descriptor into kvm and it
> matched as some kind of ownership token on the iommu group, it's not
> clear that's sufficient to assume we can start programming the iommu.
> Thanks,

Your scheme seems to me that it would have the same problem if you
wanted to do virtualized iommu

In any case, this is a big deal. We have a requirement for pass-through.
It cannot work with any remotely usable performance level if we don't
implement the calls in KVM, so it needs to be sorted one way or another
and I'm at a loss how here...

Ben.

> Alex
> 
> > From the look of it, the VFIO file descriptor is what has the "access
> > control" to the underlying iommu, is this right ? So we somewhat need to
> > transfer (or copy) that ownership from the VFIO fd to the KVM VM.
> > 
> > I don't see a way to do that without some cross-layering here...
> > 
> > Rusty, are you aware of some kernel mechanism we can use for that ?
> > 
> > Cheers,
> > Ben.
> > 
> > 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pci: Enable overrides for missing ACS capabilities

2013-06-18 Thread Bjorn Helgaas
On Tue, Jun 18, 2013 at 12:20 PM, Alex Williamson
 wrote:
> On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:
>> On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:
>> > PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
>> > allows us to control whether transactions are allowed to be redirected
>> > in various subnodes of a PCIe topology.  For instance, if two
>> > endpoints are below a root port or downsteam switch port, the
>> > downstream port may optionally redirect transactions between the
>> > devices, bypassing upstream devices.  The same can happen internally
>> > on multifunction devices.  The transaction may never be visible to the
>> > upstream devices.
>> >
>> > One upstream device that we particularly care about is the IOMMU.  If
>> > a redirection occurs in the topology below the IOMMU, then the IOMMU
>> > cannot provide isolation between devices.  This is why the PCIe spec
>> > encourages topologies to include ACS support.  Without it, we have to
>> > assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
>> >
>> > Unfortunately, far too many topologies do not support ACS to make this
>> > a steadfast requirement.  Even the latest chipsets from Intel are only
>> > sporadically supporting ACS.  We have trouble getting interconnect
>> > vendors to include the PCIe spec required PCIe capability, let alone
>> > suggested features.
>> >
>> > Therefore, we need to add some flexibility.  The pcie_acs_override=
>> > boot option lets users opt-in specific devices or sets of devices to
>> > assume ACS support.
>>
>> "ACS support" means the device provides an ACS capability and we
>> can change bits in the ACS Control Register to control how things
>> work.
>>
>> You are really adding a way to "assume this device always routes
>> peer-to-peer DMA upstream," which ultimately translates into "assume
>> this device can be isolated from others via the IOMMU."  I think.
>
> We've heard from one vendor that they support ACS with a NULL capability
> structure, ie. the ACS PCIe capability exists, but reports no ACS
> capabilities and allows no control.  This takes advantage of the "if
> supported" style wording of the spec to imply that peer-to-peer is not
> supported because the capability is not available.  This supported but
> not controllable version is what we're trying to enable here.

I was wrong to say "we can change bits in the control register."  All
the control register bits are actually required to be hardwired to
zero when the relevant functionality is not implemented.

The example you mention (a device with an ACS capability structure
that reports no supported capabilities and allows no control) sounds
perfectly legal as a description of a device that doesn't support
peer-to-peer, and I hope it doesn't require any user intervention
(e.g., this patch) or quirks to make it work.

The ACS P2P Request Redirect "must be implemented by Functions that
support peer-to-peer traffic with other Functions."  This example
device doesn't support peer-to-peer traffic, so why would it implement
the ACS R bit?  In fact, it looks like the R bit (and all the other
bits) *must* be hardwired to zero in both capability and control
registers.

>> > The "downstream" option assumes full ACS support
>> > on root ports and downstream switch ports.  The "multifunction"
>> > option assumes the subset of ACS features available on multifunction
>> > endpoints and upstream switch ports are supported.  The "id::"
>> > option enables ACS support on devices matching the provided vendor
>> > and device IDs, allowing more strategic ACS overrides.  These options
>> > may be combined in any order.  A maximum of 16 id specific overrides
>> > are available.  It's suggested to use the most limited set of options
>> > necessary to avoid completely disabling ACS across the topology.
>>
>> Probably I'm confused by your use of "assume full ACS support,"
>
> [on root ports and downstream ports]
>
>>  but I
>> don't understand the bit about "completely disabling ACS."
>
> [across the topology]
>
>>   If we use
>> more options than necessary, it seems like we'd assume more isolation
>> than really exists.  That sounds like pretending ACS is *enabled* in
>> more places than it really is.
>
> Exactly.  I'm just trying to suggest that booting with
> pcie_acs_override=downstream,multifunction is not generally recommended
> because it effectively disables ACS checking across the topology and
> assumes isolation where there may be none.  In other words, if
> everything is overriding ACS checks, then we've disabled any benefit of
> doing the checking in the first place (so I really mean disable
> checking, not disable ACS).

Yep, the missing "checking" is what is confusing.  Also, I think it
would be good to make the implication more explicit -- using this
option makes the kernel assume isolation where it may not actually
exist -- because the users of this option don't know about checking
anywa

Unreliable QueryPerformanceCounter on Windows 2008 RC2 guest

2013-06-18 Thread pedro pinto
Hi there,

I am using a Ubuntu 12.04 x64 host to run a Windows 2008 RC2 guest and
I have noticed very puzzling behaviour inside the guest when invoking
QueryPerformanceCounter. Specifically the program below will
periodically show a 43 second offset between loop iterations despite
the fact that  loop is roughly running once per second (printing out
the guest system clock confirms this). The same thing happens on a
Windows 7 guest. The 43 second number is remarkably stable across
hosts and guests.

#include "stdafx.h"
#include "windows.h"
#include 

int main(int argc, char ** argv)
{

   LARGE_INTEGER qpcnt;
   QueryPerformanceCounter(&qpcnt);
   auto then = qpcnt.QuadPart;

   while (true) {
  LARGE_INTEGER qpfreq;
  QueryPerformanceFrequency(&qpfreq);
  auto freq = qpfreq.QuadPart;

  QueryPerformanceCounter(&qpcnt);
  auto now = qpcnt.QuadPart;
  auto delta_seconds = (now - then) / freq;

  std::cout << delta_seconds << std::endl;

  QueryPerformanceCounter(&qpcnt);
  then = qpcnt.QuadPart;
  Sleep (1000);
   }
}

A few more details on the host:

$ uname -a
Linux think 3.2.0-48-generic #74-Ubuntu SMP Thu Jun 6 19:43:26 UTC
2013 x86_64 x86_64 x86_64 GNU/Linux
$ kvm --version
QEMU emulator version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008
Fabrice Bellard

Any ideas on what might be happening?

Thanks in advance,
-pp
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pci: Enable overrides for missing ACS capabilities

2013-06-18 Thread Alex Williamson
On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:
> On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:
> > PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
> > allows us to control whether transactions are allowed to be redirected
> > in various subnodes of a PCIe topology.  For instance, if two
> > endpoints are below a root port or downsteam switch port, the
> > downstream port may optionally redirect transactions between the
> > devices, bypassing upstream devices.  The same can happen internally
> > on multifunction devices.  The transaction may never be visible to the
> > upstream devices.
> > 
> > One upstream device that we particularly care about is the IOMMU.  If
> > a redirection occurs in the topology below the IOMMU, then the IOMMU
> > cannot provide isolation between devices.  This is why the PCIe spec
> > encourages topologies to include ACS support.  Without it, we have to
> > assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
> > 
> > Unfortunately, far too many topologies do not support ACS to make this
> > a steadfast requirement.  Even the latest chipsets from Intel are only
> > sporadically supporting ACS.  We have trouble getting interconnect
> > vendors to include the PCIe spec required PCIe capability, let alone
> > suggested features.
> > 
> > Therefore, we need to add some flexibility.  The pcie_acs_override=
> > boot option lets users opt-in specific devices or sets of devices to
> > assume ACS support.  
> 
> "ACS support" means the device provides an ACS capability and we
> can change bits in the ACS Control Register to control how things
> work.
> 
> You are really adding a way to "assume this device always routes
> peer-to-peer DMA upstream," which ultimately translates into "assume
> this device can be isolated from others via the IOMMU."  I think.

We've heard from one vendor that they support ACS with a NULL capability
structure, ie. the ACS PCIe capability exists, but reports no ACS
capabilities and allows no control.  This takes advantage of the "if
supported" style wording of the spec to imply that peer-to-peer is not
supported because the capability is not available.  This supported but
not controllable version is what we're trying to enable here.

> > The "downstream" option assumes full ACS support
> > on root ports and downstream switch ports.  The "multifunction"
> > option assumes the subset of ACS features available on multifunction
> > endpoints and upstream switch ports are supported.  The "id::"
> > option enables ACS support on devices matching the provided vendor
> > and device IDs, allowing more strategic ACS overrides.  These options
> > may be combined in any order.  A maximum of 16 id specific overrides
> > are available.  It's suggested to use the most limited set of options
> > necessary to avoid completely disabling ACS across the topology.
> 
> Probably I'm confused by your use of "assume full ACS support,"

[on root ports and downstream ports]

>  but I
> don't understand the bit about "completely disabling ACS."

[across the topology]

>   If we use
> more options than necessary, it seems like we'd assume more isolation
> than really exists.  That sounds like pretending ACS is *enabled* in
> more places than it really is.

Exactly.  I'm just trying to suggest that booting with
pcie_acs_override=downstream,multifunction is not generally recommended
because it effectively disables ACS checking across the topology and
assumes isolation where there may be none.  In other words, if
everything is overriding ACS checks, then we've disabled any benefit of
doing the checking in the first place (so I really mean disable
checking, not disable ACS).  Instead, the recommendation is to be more
selective, possibly opting for just "downstream" or even better, using
the specific IDs for devices which are known or suspected to not allow
peer-to-peer.

> > Note to hardware vendors, we have facilities to permanently quirk
> > specific devices which enforce isolation but not provide an ACS
> > capability.  Please contact me to have your devices added and save
> > your customers the hassle of this boot option.
> 
> Who do you expect to decide whether to use this option?  I think it
> requires intimate knowledge of how the device works.
> 
> I think the benefit of using the option is that it makes assignment of
> devices to guests more flexible, which will make it attractive to users.
> But most users have no way of knowing whether it's actually *safe* to
> use this.  So I worry that you're adding an easy way to pretend isolation
> exists when there's no good way of being confident that it actually does.
> 
> I see the warning you added for this case; I just don't feel good about
> it.  Maybe the idea is that, e.g., Red Hat will research certain devices
> and recommend the option for those cases, and sign up to support that
> config.  I assume you would not be willing to support its use unless
> Red Hat speci

Re: [PATCH v2 0/2] kvm: x86: Emulate MSR_PLATFORM_INFO

2013-06-18 Thread Nakajima, Jun
On Tue, Jun 18, 2013 at 8:16 AM, Gleb Natapov  wrote:
> On Tue, Jun 18, 2013 at 04:05:08PM +0200, Paolo Bonzini wrote:
>> Il 05/06/2013 10:42, Gleb Natapov ha scritto:
>> >> > These patches add an emulated MSR_PLATFORM_INFO that kvm guests
>> >> > can read as described in section 14.3.2.4 of the Intel SDM.
>> >> > The relevant changes and details are in [2/2]; [1/2] makes vendor_intel
>> >> > generic. There are atleat two known applications that fail to run 
>> >> > because
>> >> > of this MSR missing - Sandra and vTune.
>> > So I really want Intel opinion on this. Right now it is impossible to
>> > implement the MSR correctly in the face of migration (may be with tsc
>> > scaling it will be possible) and while it is unimplemented if application
>> > tries to use it it fails, but if we will implement it application will
>> > just produce incorrect result without any means for user to detect it.
>>
>> Jun, ping?  (Perhaps Gleb you want to ask a more specific question though).
>>
>> I don't think this is much different from any other RDTSC usage in
>> applications (they will typically do their calibration manually, and do
>> it just once).  I'm applying it to queue.
>>
> And we do not support application that uses RDTSC directly! If we could
> catch those it would be good from support point of view, so the way
> MSR_PLATFORM_INFO behaves now it better then proposed alternative.

Is it reasonable or possible to expose MSR_PLATFORM_INFO more and then
disable migration? Some use cases (like VTune) don't need live
migration.


--
Jun
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pci: Enable overrides for missing ACS capabilities

2013-06-18 Thread Bjorn Helgaas
On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:
> PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
> allows us to control whether transactions are allowed to be redirected
> in various subnodes of a PCIe topology.  For instance, if two
> endpoints are below a root port or downsteam switch port, the
> downstream port may optionally redirect transactions between the
> devices, bypassing upstream devices.  The same can happen internally
> on multifunction devices.  The transaction may never be visible to the
> upstream devices.
> 
> One upstream device that we particularly care about is the IOMMU.  If
> a redirection occurs in the topology below the IOMMU, then the IOMMU
> cannot provide isolation between devices.  This is why the PCIe spec
> encourages topologies to include ACS support.  Without it, we have to
> assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
> 
> Unfortunately, far too many topologies do not support ACS to make this
> a steadfast requirement.  Even the latest chipsets from Intel are only
> sporadically supporting ACS.  We have trouble getting interconnect
> vendors to include the PCIe spec required PCIe capability, let alone
> suggested features.
> 
> Therefore, we need to add some flexibility.  The pcie_acs_override=
> boot option lets users opt-in specific devices or sets of devices to
> assume ACS support.  

"ACS support" means the device provides an ACS capability and we
can change bits in the ACS Control Register to control how things
work.

You are really adding a way to "assume this device always routes
peer-to-peer DMA upstream," which ultimately translates into "assume
this device can be isolated from others via the IOMMU."  I think.

> The "downstream" option assumes full ACS support
> on root ports and downstream switch ports.  The "multifunction"
> option assumes the subset of ACS features available on multifunction
> endpoints and upstream switch ports are supported.  The "id::"
> option enables ACS support on devices matching the provided vendor
> and device IDs, allowing more strategic ACS overrides.  These options
> may be combined in any order.  A maximum of 16 id specific overrides
> are available.  It's suggested to use the most limited set of options
> necessary to avoid completely disabling ACS across the topology.

Probably I'm confused by your use of "assume full ACS support," but I
don't understand the bit about "completely disabling ACS."  If we use
more options than necessary, it seems like we'd assume more isolation
than really exists.  That sounds like pretending ACS is *enabled* in
more places than it really is.

> Note to hardware vendors, we have facilities to permanently quirk
> specific devices which enforce isolation but not provide an ACS
> capability.  Please contact me to have your devices added and save
> your customers the hassle of this boot option.

Who do you expect to decide whether to use this option?  I think it
requires intimate knowledge of how the device works.

I think the benefit of using the option is that it makes assignment of
devices to guests more flexible, which will make it attractive to users.
But most users have no way of knowing whether it's actually *safe* to
use this.  So I worry that you're adding an easy way to pretend isolation
exists when there's no good way of being confident that it actually does.

I see the warning you added for this case; I just don't feel good about
it.  Maybe the idea is that, e.g., Red Hat will research certain devices
and recommend the option for those cases, and sign up to support that
config.  I assume you would not be willing to support its use unless
Red Hat specifically recommended it.

Bjorn

> Signed-off-by: Alex Williamson 
> ---
>  Documentation/kernel-parameters.txt |   10 +++
>  drivers/pci/quirks.c|  102 
> +++
>  2 files changed, 112 insertions(+)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 47bb23c..a60e6ad 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2349,6 +2349,16 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   nomsi   Do not use MSI for native PCIe PME signaling (this makes
>   all PCIe root ports use INTx for all services).
>  
> + pcie_acs_override =
> + [PCIE] Override missing PCIe ACS support for:
> + downstream
> + All downstream ports - full ACS capabilties
> + multifunction
> + All multifunction devices - multifunction ACS subset
> + id::
> + Specfic device - full ACS capabilities
> + Specified as vid:did (vendor/device ID) in hex
> +
>   pcmv=   [HW,PCMCIA] BadgePAD 4
>  
>   pd. [PARIDE]
> diff --git a/drivers/pci/q

Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-18 Thread Gleb Natapov
Send code in a form of a patch.

On Wed, Jun 19, 2013 at 12:14:13AM +0800, 李春奇  wrote:
> extern u8 insn_page[], insn_page_end[];
> extern u8 test_insn[], test_insn_end[];
> extern u8 alt_insn_page[];
> 
> asm(
> ".align 4096\n\t"
> ".global insn_page\n\t"
> ".global insn_page_end\n\t"
> ".global test_insn\n\t"
> ".global test_insn_end\n\t"
> "insn_page:\n\t"
> 
> "ret \n\t"
> 
> "push %rax; push %rbx\n\t"
> "push %rcx; push %rdx\n\t"
> "push %rsi; push %rdi\n\t"
> "push %rbp\n\t"
> "push %r8; push %r9\n\t"
> "push %r10; push %r11\n\t"
> "push %r12; push %r13\n\t"
> "push %r14; push %r15\n\t"
> "pushf\n\t"
> 
> "push 136+save \n\t"
> "popf \n\t"
> "mov 0+save, %rax \n\t"
> "mov 8+save, %rbx \n\t"
> "mov 16+save, %rcx \n\t"
> "mov 24+save, %rdx \n\t"
> "mov 32+save, %rsi \n\t"
> "mov 40+save, %rdi \n\t"
> "mov 56+save, %rbp \n\t"
> "mov 64+save, %r8 \n\t"
> "mov 72+save, %r9 \n\t"
> "mov 80+save, %r10  \n\t"
> "mov 88+save, %r11 \n\t"
> "mov 96+save, %r12 \n\t"
> "mov 104+save, %r13 \n\t"
> "mov 112+save, %r14 \n\t"
> "mov 120+save, %r15 \n\t"
> 
> "test_insn:\n\t"
> "in  (%dx),%al\n\t"
> ". = . + 31\n\t"
> "test_insn_end:\n\t"
> 
> "pushf \n\t"
> "pop 136+save \n\t"
> "mov %rax, 0+save \n\t"
> "mov %rbx, 8+save \n\t"
> "mov %rcx, 16+save \n\t"
> "mov %rdx, 24+save \n\t"
> "mov %rsi, 32+save \n\t"
> "mov %rdi, 40+save \n\t"
> "mov %rbp, 56+save \n\t"
> "mov %r8, 64+save \n\t"
> "mov %r9, 72+save \n\t"
> "mov %r10, 80+save \n\t"
> "mov %r11, 88+save \n\t"
> "mov %r12, 96+save \n\t"
> "mov %r13, 104+save \n\t"
> "mov %r14, 112+save \n\t"
> "mov %r15, 120+save \n\t"
> "popf \n\t"
> "pop %r15; pop %r14 \n\t"
> "pop %r13; pop %r12 \n\t"
> "pop %r11; pop %r10 \n\t"
> "pop %r9; pop %r8 \n\t"
> "pop %rbp \n\t"
> "pop %rdi; pop %rsi \n\t"
> "pop %rdx; pop %rcx \n\t"
> "pop %rbx; pop %rax \n\t"
> 
> "ret\n\t"
> "save:\n\t"
> ". = . + 256\n\t"
> ".align 4096\n\t"
> "alt_insn_page:\n\t"
> ". = . + 4096\n\t"
> );
> 
> 
> static void mk_insn_page(uint8_t *alt_insn_page,
> uint8_t *alt_insn, int alt_insn_length)
> {
> int i, emul_offset;
> for (i=1; i test_insn[i] = 0x90; // nop
Why? Gcc should pad it with nops.

> emul_offset = test_insn - insn_page;
> for (i=0; i alt_insn_page[i+emul_offset] = alt_insn[i];
> }
> 
> static void trap_emulator(uint64_t *mem, uint8_t* alt_insn, int 
> alt_insn_length)
> {
> ulong *cr3 = (ulong *)read_cr3();
> int save_offset = (u8 *)(&save) - insn_page;
> 
> memset(alt_insn_page, 0x90, 4096);
alt_insn_page should contains the same instruction as insn_page except
between test_insn and test_insn_end. I do not know how you expect it to
work otherwise.

> save = inregs;
> mk_insn_page(alt_insn_page, alt_insn, alt_insn_length);
> // Load the code TLB with insn_page, but point the page tables at
> // alt_insn_page (and keep the data TLB clear, for AMD decode assist).
> // This will make the CPU trap on the insn_page instruction but the
> // hypervisor will see alt_insn_page.
> //install_page(cr3, virt_to_phys(insn_page), insn_page);
> invlpg(insn_page);
> // Load code TLB
> asm volatile("call *%0" : : "r"(insn_page));
> install_page(cr3, virt_to_phys(alt_insn_page), insn_page);
> // Trap, let hypervisor emulate at alt_insn_page
> asm volatile("call *%0": : "r"(insn_page+1));
> 
> outregs = *((struct regs *)(&alt_insn_page[save_offset]));
> }
> 
> static void test_movabs(uint64_t *mem)
> {
> // mov $0xc3c3c3c3c3c3c3c3, %rcx
> uint8_t alt_insn[] = {0x48, 0xb9, 0xc3, 0xc3, 0xc3,
> 0xc3, 0xc3, 0xc3, 0xc3, 0xc3};
> inregs = (struct regs){ 0 };
> trap_emulator(mem, alt_insn, 10);
> report("64-bit mov imm2", outregs.rcx == 0xc3c3c3c3c3c3c3c3);
> }
> 
> On Wed, Jun 19, 2013 at 12:09 AM, Gleb Natapov  wrote:
> > On Tue, Jun 18, 2013 at 11:56:24PM +0800, 李春奇  wrote:
> >> On Tue, Jun 18, 2013 at 11:47 PM, Gleb Natapov  wrote:
> >> > On Tue, Jun 18, 2013 at 10:28:59PM +0800, Ê??Ê?•Â•?  
> >> > wrote:
> >> >> On Tue, Jun 18, 2013 at 8:45 PM, Gleb Natapov  wrote:
> >> >> > On Thu, Jun 13, 2013 at 05:30:03PM +0800, 李春奇  >> >> > Li> wrote:
> >> >> >> Hi Gleb,
> >> >> >> I'm trying to solve these problems in the past days and meet many
> >> >> >> difficulties. You want to save all the general registers in calling
> >> >> >> insn_page, so registers should be saved to (save) in insn_page.
> >> >> >> Because all the instructions should be generated outside and copy to
> >> >> >> insn_page, and the instructions generated outside is RIP-relative, so
> >> >> >> inside insn_page (save) will be wrong pointed with RIP-relative code.
> >> >> >>
> >> >> > They do not have to be generated outside. You can write code into
> >> >> > insn_page directly. Something like this outside of any functions:
> >> >> >
> >> >> > asm(".align 4096\n\t"
> >> >> > ".global insn_page\n\t"
> >> >> > ".global insn_page_end\n\t"
> >> >> > ".global test_insn\n\t"
> >> >> > ".gl

Re: Bug#707257: linux-image-3.8-1-686-pae: KVM crashes with "entry failed, hardware error 0x80000021"

2013-06-18 Thread Stefan Pietsch
On 17.06.2013 18:07, Paolo Bonzini wrote:
> Il 16/06/2013 02:25, Stefan Pietsch ha scritto:
>> Bisecting leads to
>>
>> git bisect bad 378a8b099fc207ddcb91b19a8c1457667e0af398
>> git bisect good 007a3b547512d69f67ceb9641796d64552bd337e
>> git bisect good 1f3141e80b149e7215313dff29e9a0c47811b1d1
>> git bisect good 286da4156dc65c8a054580fdd96b7709132dce8d
>> git bisect bad 25391454e73e3156202264eb3c473825afe4bc94
>> git bisect good 218e763f458c44f30041c1b48b4371e130fd4317
>>
>>
>> first bad commit: [25391454e73e3156202264eb3c473825afe4bc94]
>> KVM: VMX: don't clobber segment AR of unusable segments.
>>
>> 25391454e73e3156202264eb3c473825afe4bc94
>> emulate_invalid_guest_state=0 -> hangs and shows "KVM: entry failed"
>> emulate_invalid_guest_state=1 -> hangs
>>
>> Please note, I had to compile some revisions with
>> 3f0c3d0bb2bcc4b88b22452a7cf0073ee9a0f1e6 applied, caused by
>> 9ae9febae9500a0a6f5ce29ee4b8d942b5332529.
> 
> Can you please execute "info registers" and "x/10i $pc" from the QEMU
> monitor at the time of the hang, and include the output?  Using
> "-monitor stdio" or the new GTK+ interface can help.
> 
> Also, can you run under tracing (for information on how to do this, see
> http://www.linux-kvm.org/page/Tracing) and include the bottom of the log?

Tested with 25391454e73e3156202264eb3c473825afe4bc94
 emulate_invalid_guest_state=1


(qemu) info registers
EAX=00010286 EBX= ECX=c12c527c EDX=
ESI=00010286 EDI=c14c4744 EBP=c10161f5 ESP=de84df10
EIP=c1014a8d EFL=00010286 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =007b   00c0f300 DPL=3 DS   [-WA]
CS =0060   00c09b00 DPL=0 CS32 [-RA]
SS =0068   00c09300 DPL=0 DS   [-WA]
DS =007b   00c0f300 DPL=3 DS   [-WA]
FS =  ffff 00f0ff00 DPL=3 CS64 [CRA]
GS =00e0 c1438b40 0018 00409100 DPL=0 DS   [--A]
LDT=  ffff 00f0ff00 DPL=3 CS64 [CRA]
TR =0080 c1400f00 206b 8b00 DPL=0 TSS32-busy
GDT= c13f6000 00ff
IDT= c13f5000 07ff
CR0=8005003b CR2= CR3=014bc000 CR4=0690
DR0= DR1= DR2=0007
DR3=
DR6=0ff0 DR7=0400
EFER=
FCW=037f FSW= [ST=0] FTW=00 MXCSR=1f80
FPR0=f44d002c6000 400d FPR1=80847fe7 400e
FPR2=fa007fa24000 400e FPR3=80e88055f000 400e
FPR4=ea61009c4000 400d FPR5=ea62009c4000 400c
FPR6=800bf600 4015 FPR7= 
XMM00=
XMM01=
XMM02=
XMM03=
XMM04=
XMM05=
XMM06=
XMM07=

(qemu) x/10i $pc
0xc1014a8d:  lea0x0(%esi),%esi
0xc1014a91:  ret
0xc1014a92:  cli
0xc1014a93:  nop
0xc1014a94:  lea0x0(%esi),%esi
0xc1014a98:  ret
0xc1014a99:  push   %eax
0xc1014a9a:  call   0xc1014a84
0xc1014a9f:  mov%eax,(%esp)
0xc1014aa2:  call   0xc1014a92


last 20 lines of the trace:
 qemu-system-x86-3575  [000]   542.279800: kvm_entry:vcpu 0
 qemu-system-x86-3575  [000]   542.279802: kvm_inj_virq: irq 48
 qemu-system-x86-3575  [000]   542.279802: kvm_entry:vcpu 0
 qemu-system-x86-3575  [000]   542.279803: kvm_inj_virq: irq 48
 qemu-system-x86-3575  [000]   542.279804: kvm_entry:vcpu 0
 qemu-system-x86-3575  [000]   542.279805: kvm_inj_virq: irq 48
 qemu-system-x86-3575  [000]   542.279806: kvm_entry:vcpu 0
 qemu-system-x86-3575  [000]   542.279807: kvm_inj_virq: irq 48
 qemu-system-x86-3575  [000]   542.279808: kvm_entry:vcpu 0
 qemu-system-x86-3575  [000]   542.279809: kvm_inj_virq: irq 48
 qemu-system-x86-3575  [000]   542.279810: kvm_entry:vcpu 0
 qemu-system-x86-3575  [000]   542.279811: kvm_inj_virq: irq 48
 qemu-system-x86-3575  [000]   542.279812: kvm_entry:vcpu 0
 qemu-system-x86-3573  [001]   542.280010: kvm_set_irq:  gsi 0
level 1 source 0
 qemu-system-x86-3573  [001]   542.280013: kvm_pic_set_irq:  chip 0
pin 0 (edge|masked)
 qemu-system-x86-3573  [001]   542.280015: kvm_apic_accept_irq:  apicid
0 vec 48 (LowPrio|edge) (coalesced)
 qemu-system-x86-3573  [001]   542.280015: kvm_ioapic_set_irq:   pin 2
dst 1 vec=48 (LowPrio|logical|edge) (coalesced)
 qemu-system-x86-3573  [001]   542.280016: kvm_set_irq:  gsi 0
level 0 source 0
 qemu-system-x86-3573  [001]   542.280017: kvm_pic_set_irq:  chip 0
pin 0 (edge|masked)
 qemu-system-x86-3573  [001]   542.280017: kvm_ioapic_set_irq:   pin 2
dst 1 vec=48 (LowPrio|logical|edge)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majord

Re: [PATCH qom-cpu v2 01/29] kvm: Change kvm_cpu_synchronize_state() argument to CPUState

2013-06-18 Thread Andreas Färber
Am 17.06.2013 18:15, schrieb Paolo Bonzini:
> Il 16/06/2013 17:57, Andreas Färber ha scritto:
>> It no longer relies on CPUArchState since 20d695a.
>>
>> Reviewed-by: liguang 
>> Signed-off-by: Andreas Färber 
>> ---
>>  hw/ppc/spapr_rtas.c  |  2 +-
>>  include/sysemu/kvm.h |  4 ++--
>>  kvm-all.c|  4 +---
>>  kvm-stub.c   |  2 +-
>>  target-i386/kvm.c| 10 +-
>>  5 files changed, 10 insertions(+), 12 deletions(-)
[...]
> 
> Acked-by: Paolo Bonzini 

Thanks, applied to qom-cpu:
https://github.com/afaerber/qemu-cpu/commits/qom-cpu

Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-18 Thread 李春奇
extern u8 insn_page[], insn_page_end[];
extern u8 test_insn[], test_insn_end[];
extern u8 alt_insn_page[];

asm(
".align 4096\n\t"
".global insn_page\n\t"
".global insn_page_end\n\t"
".global test_insn\n\t"
".global test_insn_end\n\t"
"insn_page:\n\t"

"ret \n\t"

"push %rax; push %rbx\n\t"
"push %rcx; push %rdx\n\t"
"push %rsi; push %rdi\n\t"
"push %rbp\n\t"
"push %r8; push %r9\n\t"
"push %r10; push %r11\n\t"
"push %r12; push %r13\n\t"
"push %r14; push %r15\n\t"
"pushf\n\t"

"push 136+save \n\t"
"popf \n\t"
"mov 0+save, %rax \n\t"
"mov 8+save, %rbx \n\t"
"mov 16+save, %rcx \n\t"
"mov 24+save, %rdx \n\t"
"mov 32+save, %rsi \n\t"
"mov 40+save, %rdi \n\t"
"mov 56+save, %rbp \n\t"
"mov 64+save, %r8 \n\t"
"mov 72+save, %r9 \n\t"
"mov 80+save, %r10  \n\t"
"mov 88+save, %r11 \n\t"
"mov 96+save, %r12 \n\t"
"mov 104+save, %r13 \n\t"
"mov 112+save, %r14 \n\t"
"mov 120+save, %r15 \n\t"

"test_insn:\n\t"
"in  (%dx),%al\n\t"
". = . + 31\n\t"
"test_insn_end:\n\t"

"pushf \n\t"
"pop 136+save \n\t"
"mov %rax, 0+save \n\t"
"mov %rbx, 8+save \n\t"
"mov %rcx, 16+save \n\t"
"mov %rdx, 24+save \n\t"
"mov %rsi, 32+save \n\t"
"mov %rdi, 40+save \n\t"
"mov %rbp, 56+save \n\t"
"mov %r8, 64+save \n\t"
"mov %r9, 72+save \n\t"
"mov %r10, 80+save \n\t"
"mov %r11, 88+save \n\t"
"mov %r12, 96+save \n\t"
"mov %r13, 104+save \n\t"
"mov %r14, 112+save \n\t"
"mov %r15, 120+save \n\t"
"popf \n\t"
"pop %r15; pop %r14 \n\t"
"pop %r13; pop %r12 \n\t"
"pop %r11; pop %r10 \n\t"
"pop %r9; pop %r8 \n\t"
"pop %rbp \n\t"
"pop %rdi; pop %rsi \n\t"
"pop %rdx; pop %rcx \n\t"
"pop %rbx; pop %rax \n\t"

"ret\n\t"
"save:\n\t"
". = . + 256\n\t"
".align 4096\n\t"
"alt_insn_page:\n\t"
". = . + 4096\n\t"
);


static void mk_insn_page(uint8_t *alt_insn_page,
uint8_t *alt_insn, int alt_insn_length)
{
int i, emul_offset;
for (i=1; i wrote:
> On Tue, Jun 18, 2013 at 11:56:24PM +0800, 李春奇  wrote:
>> On Tue, Jun 18, 2013 at 11:47 PM, Gleb Natapov  wrote:
>> > On Tue, Jun 18, 2013 at 10:28:59PM +0800, Ê??Ê?•Â•?  
>> > wrote:
>> >> On Tue, Jun 18, 2013 at 8:45 PM, Gleb Natapov  wrote:
>> >> > On Thu, Jun 13, 2013 at 05:30:03PM +0800, 李春奇  
>> >> > wrote:
>> >> >> Hi Gleb,
>> >> >> I'm trying to solve these problems in the past days and meet many
>> >> >> difficulties. You want to save all the general registers in calling
>> >> >> insn_page, so registers should be saved to (save) in insn_page.
>> >> >> Because all the instructions should be generated outside and copy to
>> >> >> insn_page, and the instructions generated outside is RIP-relative, so
>> >> >> inside insn_page (save) will be wrong pointed with RIP-relative code.
>> >> >>
>> >> > They do not have to be generated outside. You can write code into
>> >> > insn_page directly. Something like this outside of any functions:
>> >> >
>> >> > asm(".align 4096\n\t"
>> >> > ".global insn_page\n\t"
>> >> > ".global insn_page_end\n\t"
>> >> > ".global test_insn\n\t"
>> >> > ".global test_insn_end\n\t"
>> >> > "insn_page:"
>> >> > "mov %%rax, outregs \n\t"
>> >> > ...
>> >> > "test_insn:\n\t"
>> >> > "in (%ds), %al\n\t"
>> >> > ". = . + 31\n\t"
>> >> > "test_insn_end:\n\t"
>> >> > "mov outregs, %%rax\n\t"
>> >> > ...
>> >> > "ret\n\t"
>> >> > ".align 4096\n\t"
>> >> > "insn_page_end:\n\t");
>> >> >
>> >> > Now you copy that into alt_insn_page, put instruction you want to test
>> >> > into test_insn offset and remap alt_insn_page into "insn_page" virtual 
>> >> > address.
>> >> I used such codes:
>> >>
>> >> invlpg((void *)virt_to_phys(insn_page));
>> > virt_to_phys?
>> This is a mistake, I changed it to "invlpg(insn_page)" but the result
>> is the same.
>> >
>> >> asm volatile("call *%0" : : "r"(insn_page));
>> >> install_page(cr3, virt_to_phys(alt_insn_page), insn_page);
>> >> asm volatile("call *%0": : "r"(insn_page+1));
>> > +1?
>> Here I put "ret" on the first byte of insn_page, so the first call of
>> "insn_page" can just return, and the second call of "insn_page+1“ will
>> directly call the second byte, which is the real content of insn_page.
> Send the code.
>
> --
> Gleb.



-- 
Arthur Chunqi Li
Department of Computer Science
School of EECS
Peking University
Beijing, China
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-18 Thread Gleb Natapov
On Tue, Jun 18, 2013 at 11:56:24PM +0800, 李春奇  wrote:
> On Tue, Jun 18, 2013 at 11:47 PM, Gleb Natapov  wrote:
> > On Tue, Jun 18, 2013 at 10:28:59PM +0800, Ê??Ê?•Â•?  
> > wrote:
> >> On Tue, Jun 18, 2013 at 8:45 PM, Gleb Natapov  wrote:
> >> > On Thu, Jun 13, 2013 at 05:30:03PM +0800, 李春奇  
> >> > wrote:
> >> >> Hi Gleb,
> >> >> I'm trying to solve these problems in the past days and meet many
> >> >> difficulties. You want to save all the general registers in calling
> >> >> insn_page, so registers should be saved to (save) in insn_page.
> >> >> Because all the instructions should be generated outside and copy to
> >> >> insn_page, and the instructions generated outside is RIP-relative, so
> >> >> inside insn_page (save) will be wrong pointed with RIP-relative code.
> >> >>
> >> > They do not have to be generated outside. You can write code into
> >> > insn_page directly. Something like this outside of any functions:
> >> >
> >> > asm(".align 4096\n\t"
> >> > ".global insn_page\n\t"
> >> > ".global insn_page_end\n\t"
> >> > ".global test_insn\n\t"
> >> > ".global test_insn_end\n\t"
> >> > "insn_page:"
> >> > "mov %%rax, outregs \n\t"
> >> > ...
> >> > "test_insn:\n\t"
> >> > "in (%ds), %al\n\t"
> >> > ". = . + 31\n\t"
> >> > "test_insn_end:\n\t"
> >> > "mov outregs, %%rax\n\t"
> >> > ...
> >> > "ret\n\t"
> >> > ".align 4096\n\t"
> >> > "insn_page_end:\n\t");
> >> >
> >> > Now you copy that into alt_insn_page, put instruction you want to test
> >> > into test_insn offset and remap alt_insn_page into "insn_page" virtual 
> >> > address.
> >> I used such codes:
> >>
> >> invlpg((void *)virt_to_phys(insn_page));
> > virt_to_phys?
> This is a mistake, I changed it to "invlpg(insn_page)" but the result
> is the same.
> >
> >> asm volatile("call *%0" : : "r"(insn_page));
> >> install_page(cr3, virt_to_phys(alt_insn_page), insn_page);
> >> asm volatile("call *%0": : "r"(insn_page+1));
> > +1?
> Here I put "ret" on the first byte of insn_page, so the first call of
> "insn_page" can just return, and the second call of "insn_page+1“ will
> directly call the second byte, which is the real content of insn_page.
Send the code.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-18 Thread 李春奇
On Tue, Jun 18, 2013 at 11:47 PM, Gleb Natapov  wrote:
> On Tue, Jun 18, 2013 at 10:28:59PM +0800, Ê??Ê?•Â•?  wrote:
>> On Tue, Jun 18, 2013 at 8:45 PM, Gleb Natapov  wrote:
>> > On Thu, Jun 13, 2013 at 05:30:03PM +0800, 李春奇  
>> > wrote:
>> >> Hi Gleb,
>> >> I'm trying to solve these problems in the past days and meet many
>> >> difficulties. You want to save all the general registers in calling
>> >> insn_page, so registers should be saved to (save) in insn_page.
>> >> Because all the instructions should be generated outside and copy to
>> >> insn_page, and the instructions generated outside is RIP-relative, so
>> >> inside insn_page (save) will be wrong pointed with RIP-relative code.
>> >>
>> > They do not have to be generated outside. You can write code into
>> > insn_page directly. Something like this outside of any functions:
>> >
>> > asm(".align 4096\n\t"
>> > ".global insn_page\n\t"
>> > ".global insn_page_end\n\t"
>> > ".global test_insn\n\t"
>> > ".global test_insn_end\n\t"
>> > "insn_page:"
>> > "mov %%rax, outregs \n\t"
>> > ...
>> > "test_insn:\n\t"
>> > "in (%ds), %al\n\t"
>> > ". = . + 31\n\t"
>> > "test_insn_end:\n\t"
>> > "mov outregs, %%rax\n\t"
>> > ...
>> > "ret\n\t"
>> > ".align 4096\n\t"
>> > "insn_page_end:\n\t");
>> >
>> > Now you copy that into alt_insn_page, put instruction you want to test
>> > into test_insn offset and remap alt_insn_page into "insn_page" virtual 
>> > address.
>> I used such codes:
>>
>> invlpg((void *)virt_to_phys(insn_page));
> virt_to_phys?
This is a mistake, I changed it to "invlpg(insn_page)" but the result
is the same.
>
>> asm volatile("call *%0" : : "r"(insn_page));
>> install_page(cr3, virt_to_phys(alt_insn_page), insn_page);
>> asm volatile("call *%0": : "r"(insn_page+1));
> +1?
Here I put "ret" on the first byte of insn_page, so the first call of
"insn_page" can just return, and the second call of "insn_page+1“ will
directly call the second byte, which is the real content of insn_page.
>
>>
>> But it seems that alt_insn_page are not remapped to insn_page. Here
>> insn_page and alt_insn_page are all declared statically with
>> "asm(...)".
>>
>> Arthur
>> >
>> >> I have tried to move (save) into insn_page. But when calling
>> >> insn_page, data in it can only be read and any instructions like "xchg
>> >> %%rax, 0+%[save]" may cause error, because at this time read is from
>> >> TLB but write will cause inconsistent.
>> >>
>> >> Another way is disabling RIP-relative code, but I failed when using
>> >> "-mcmodel-large -fno-pic", the binary is also using RIP-relative mode.
>> >> Is there any way to totally disable RIP-relative code? Besides, using
>> >> this feature may specified to some newer C compiler. This may not be a
>> >> good solution.
>> >>
>> >> If we don't set %rsp and %rbp when executing emulator code, we can
>> >> just use “push/pop" to save other general registers.
>> >>
>> >> If you have any better solutions, please let me know.
>> >>
>> >> Thanks,
>> >> Arthur
>> >>
>> >> On Thu, Jun 13, 2013 at 12:50 PM, 李春奇 
>> >>  wrote:
>> >> > On Thu, Jun 13, 2013 at 4:50 AM, Paolo Bonzini  
>> >> > wrote:
>> >> >> Il 06/06/2013 11:24, Arthur Chunqi Li ha scritto:
>> >> >>> Add a function trap_emulator to run an instruction in emulator.
>> >> >>> Set inregs first (%rax is invalid because it is used as return
>> >> >>> address), put instruction codec in alt_insn and call func with
>> >> >>> alt_insn_length. Get results in outregs.
>> >> >>>
>> >> >>> Signed-off-by: Arthur Chunqi Li 
>> >> >>> ---
>> >> >>>  x86/emulator.c |   81 
>> >> >>> 
>> >> >>>  1 file changed, 81 insertions(+)
>> >> >>>
>> >> >>> diff --git a/x86/emulator.c b/x86/emulator.c
>> >> >>> index 96576e5..8ab9904 100644
>> >> >>> --- a/x86/emulator.c
>> >> >>> +++ b/x86/emulator.c
>> >> >>> @@ -11,6 +11,14 @@ int fails, tests;
>> >> >>>
>> >> >>>  static int exceptions;
>> >> >>>
>> >> >>> +struct regs {
>> >> >>> + u64 rax, rbx, rcx, rdx;
>> >> >>> + u64 rsi, rdi, rsp, rbp;
>> >> >>> + u64 rip, rflags;
>> >> >>> +};
>> >> >>> +
>> >> >>> +static struct regs inregs, outregs;
>> >> >>> +
>> >> >>>  void report(const char *name, int result)
>> >> >>>  {
>> >> >>>   ++tests;
>> >> >>> @@ -685,6 +693,79 @@ static void test_shld_shrd(u32 *mem)
>> >> >>>  report("shrd (cl)", *mem == ((0x12345678 >> 3) | (5u << 29)));
>> >> >>>  }
>> >> >>>
>> >> >>> +static void trap_emulator(uint64_t *mem, uint8_t *insn_page,
>> >> >>> +  uint8_t *alt_insn_page, void *insn_ram,
>> >> >>> +  uint8_t *alt_insn, int alt_insn_length)
>> >> >>> +{
>> >> >>> + ulong *cr3 = (ulong *)read_cr3();
>> >> >>> + int i;
>> >> >>> +
>> >> >>> + // Pad with RET instructions
>> >> >>> + memset(insn_page, 0xc3, 4096);
>> >> >>> + memset(alt_insn_page, 0xc3, 4096);
>> >> >>> +
>> >> >>> + // Place a tr

Re: [PATCH 2/7] KVM: arm-vgic: Set base addr through device API

2013-06-18 Thread Alexander Graf

On 18.06.2013, at 17:46, Christoffer Dall wrote:

> On Tue, Jun 18, 2013 at 03:25:04PM +0200, Alexander Graf wrote:
>> 
>> 
>> Am 11.06.2013 um 06:51 schrieb Christoffer Dall 
>> :
>> 
>>> Support setting the distributor and cpu interface base addresses in the
>>> VM physical address space through the KVM_{SET,GET}_DEVICE_ATTR API
>>> in addition the ARM specific API.
>>> 
>>> This has the added benefit of being able to share more code in user
>>> space and do things in a uniform maner.
>>> 
>>> Also deprecate the older API at the same time, but backwards
>>> compatibility will be maintained.
>>> 
>>> Signed-off-by: Christoffer Dall 
>>> ---
>>> Documentation/virtual/kvm/api.txt  |5 +-
>>> Documentation/virtual/kvm/devices/arm-vgic.txt |   11 +++
>>> arch/arm/kvm/arm.c |2 +-
>>> include/kvm/arm_vgic.h |2 +-
>>> virt/kvm/arm/vgic.c|   90 
>>> 
>>> 5 files changed, 95 insertions(+), 15 deletions(-)
>>> 
>>> diff --git a/Documentation/virtual/kvm/api.txt 
>>> b/Documentation/virtual/kvm/api.txt
>>> index 5f91eda..ea5ec4a 100644
>>> --- a/Documentation/virtual/kvm/api.txt
>>> +++ b/Documentation/virtual/kvm/api.txt
>>> @@ -2305,7 +2305,7 @@ This ioctl returns the guest registers that are 
>>> supported for the
>>> KVM_GET_ONE_REG/KVM_SET_ONE_REG calls.
>>> 
>>> 
>>> -4.80 KVM_ARM_SET_DEVICE_ADDR
>>> +4.80 KVM_ARM_SET_DEVICE_ADDR (deprecated)
>>> 
>>> Capability: KVM_CAP_ARM_SET_DEVICE_ADDR
>>> Architectures: arm
>>> @@ -2342,6 +2342,9 @@ and distributor interface, the ioctl must be called 
>>> after calling
>>> KVM_CREATE_IRQCHIP, but before calling KVM_RUN on any of the VCPUs.  Calling
>>> this ioctl twice for any of the base addresses will return -EEXIST.
>>> 
>>> +Note, this IOCTL is deprecated and the more flexible SET/GET_DEVICE_ATTR 
>>> API
>>> +should be used instead.
>>> +
>>> 4.82 KVM_PPC_RTAS_DEFINE_TOKEN
>>> 
>>> Capability: KVM_CAP_PPC_RTAS
>>> diff --git a/Documentation/virtual/kvm/devices/arm-vgic.txt 
>>> b/Documentation/virtual/kvm/devices/arm-vgic.txt
>>> index 25fd2d9..ca83ad8 100644
>>> --- a/Documentation/virtual/kvm/devices/arm-vgic.txt
>>> +++ b/Documentation/virtual/kvm/devices/arm-vgic.txt
>>> @@ -8,3 +8,14 @@ Only one VGIC instance may be instantiated through either 
>>> this API or the
>>> legacy KVM_CREATE_IRQCHIP api.  The created VGIC will act as the VM 
>>> interrupt
>>> controller, requiring emulated user-space devices to inject interrupts to 
>>> the
>>> VGIC instead of directly to CPUs.
>>> +
>>> +Groups:
>> 
>> Ah, here they are :)
>> 
>>> +  KVM_DEV_ARM_VGIC_GRP_ADDR
>>> +  Attributes:
>>> +KVM_VGIC_V2_ADDR_TYPE_DIST (rw, 64-bit)
>>> +  Base address in the guest physical address space of the GIC 
>>> distributor
>>> +  register mappings.
>>> +
>>> +KVM_VGIC_V2_ADDR_TYPE_CPU (rw, 64-bit)
>>> +  Base address in the guest physical address space of the GIC virtual 
>>> cpu
>>> +  interface register mappings.
>> 
>> Is this per-cpu or per-vgic? Can different CPUs have their gic interface 
>> maps mapped at different offsets?
>> 
> This is per-vgic.
> 
> If the _CPU part is confusing, it means it's the address of the CPU
> interface, as opposed to the Distributor interface.  Individual CPUs
> calculate their specific offset from this base address based on a mask,
> but the base is common for everyone (and banked depending on the
> accessing CPU for a certain region).

Ah, that's perfectly fine then. On the MPIC this is part of the normal address 
space, so I was merely confused on why it has a separate offset. But I guess 
you can't argue about how hardware works ;).


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-18 Thread Gleb Natapov
On Tue, Jun 18, 2013 at 10:28:59PM +0800, �??�?���?  wrote:
> On Tue, Jun 18, 2013 at 8:45 PM, Gleb Natapov  wrote:
> > On Thu, Jun 13, 2013 at 05:30:03PM +0800, 李春奇  wrote:
> >> Hi Gleb,
> >> I'm trying to solve these problems in the past days and meet many
> >> difficulties. You want to save all the general registers in calling
> >> insn_page, so registers should be saved to (save) in insn_page.
> >> Because all the instructions should be generated outside and copy to
> >> insn_page, and the instructions generated outside is RIP-relative, so
> >> inside insn_page (save) will be wrong pointed with RIP-relative code.
> >>
> > They do not have to be generated outside. You can write code into
> > insn_page directly. Something like this outside of any functions:
> >
> > asm(".align 4096\n\t"
> > ".global insn_page\n\t"
> > ".global insn_page_end\n\t"
> > ".global test_insn\n\t"
> > ".global test_insn_end\n\t"
> > "insn_page:"
> > "mov %%rax, outregs \n\t"
> > ...
> > "test_insn:\n\t"
> > "in (%ds), %al\n\t"
> > ". = . + 31\n\t"
> > "test_insn_end:\n\t"
> > "mov outregs, %%rax\n\t"
> > ...
> > "ret\n\t"
> > ".align 4096\n\t"
> > "insn_page_end:\n\t");
> >
> > Now you copy that into alt_insn_page, put instruction you want to test
> > into test_insn offset and remap alt_insn_page into "insn_page" virtual 
> > address.
> I used such codes:
> 
> invlpg((void *)virt_to_phys(insn_page));
virt_to_phys?

> asm volatile("call *%0" : : "r"(insn_page));
> install_page(cr3, virt_to_phys(alt_insn_page), insn_page);
> asm volatile("call *%0": : "r"(insn_page+1));
+1?

> 
> But it seems that alt_insn_page are not remapped to insn_page. Here
> insn_page and alt_insn_page are all declared statically with
> "asm(...)".
> 
> Arthur
> >
> >> I have tried to move (save) into insn_page. But when calling
> >> insn_page, data in it can only be read and any instructions like "xchg
> >> %%rax, 0+%[save]" may cause error, because at this time read is from
> >> TLB but write will cause inconsistent.
> >>
> >> Another way is disabling RIP-relative code, but I failed when using
> >> "-mcmodel-large -fno-pic", the binary is also using RIP-relative mode.
> >> Is there any way to totally disable RIP-relative code? Besides, using
> >> this feature may specified to some newer C compiler. This may not be a
> >> good solution.
> >>
> >> If we don't set %rsp and %rbp when executing emulator code, we can
> >> just use “push/pop" to save other general registers.
> >>
> >> If you have any better solutions, please let me know.
> >>
> >> Thanks,
> >> Arthur
> >>
> >> On Thu, Jun 13, 2013 at 12:50 PM, 李春奇 
> >>  wrote:
> >> > On Thu, Jun 13, 2013 at 4:50 AM, Paolo Bonzini  
> >> > wrote:
> >> >> Il 06/06/2013 11:24, Arthur Chunqi Li ha scritto:
> >> >>> Add a function trap_emulator to run an instruction in emulator.
> >> >>> Set inregs first (%rax is invalid because it is used as return
> >> >>> address), put instruction codec in alt_insn and call func with
> >> >>> alt_insn_length. Get results in outregs.
> >> >>>
> >> >>> Signed-off-by: Arthur Chunqi Li 
> >> >>> ---
> >> >>>  x86/emulator.c |   81 
> >> >>> 
> >> >>>  1 file changed, 81 insertions(+)
> >> >>>
> >> >>> diff --git a/x86/emulator.c b/x86/emulator.c
> >> >>> index 96576e5..8ab9904 100644
> >> >>> --- a/x86/emulator.c
> >> >>> +++ b/x86/emulator.c
> >> >>> @@ -11,6 +11,14 @@ int fails, tests;
> >> >>>
> >> >>>  static int exceptions;
> >> >>>
> >> >>> +struct regs {
> >> >>> + u64 rax, rbx, rcx, rdx;
> >> >>> + u64 rsi, rdi, rsp, rbp;
> >> >>> + u64 rip, rflags;
> >> >>> +};
> >> >>> +
> >> >>> +static struct regs inregs, outregs;
> >> >>> +
> >> >>>  void report(const char *name, int result)
> >> >>>  {
> >> >>>   ++tests;
> >> >>> @@ -685,6 +693,79 @@ static void test_shld_shrd(u32 *mem)
> >> >>>  report("shrd (cl)", *mem == ((0x12345678 >> 3) | (5u << 29)));
> >> >>>  }
> >> >>>
> >> >>> +static void trap_emulator(uint64_t *mem, uint8_t *insn_page,
> >> >>> +  uint8_t *alt_insn_page, void *insn_ram,
> >> >>> +  uint8_t *alt_insn, int alt_insn_length)
> >> >>> +{
> >> >>> + ulong *cr3 = (ulong *)read_cr3();
> >> >>> + int i;
> >> >>> +
> >> >>> + // Pad with RET instructions
> >> >>> + memset(insn_page, 0xc3, 4096);
> >> >>> + memset(alt_insn_page, 0xc3, 4096);
> >> >>> +
> >> >>> + // Place a trapping instruction in the page to trigger a VMEXIT
> >> >>> + insn_page[0] = 0x89; // mov %eax, (%rax)
> >> >>> + insn_page[1] = 0x00;
> >> >>> + insn_page[2] = 0x90; // nop
> >> >>> + insn_page[3] = 0xc3; // ret
> >> >>> +
> >> >>> + // Place the instruction we want the hypervisor to see in the 
> >> >>> alternate page
> >> >>> + for (i=0; i >> >>> + alt_insn_page[i] = alt_insn[i];
> >> >>> +
> >> >>> + // Save general registers
> >> >>> + asm

Re: [PATCH 2/7] KVM: arm-vgic: Set base addr through device API

2013-06-18 Thread Christoffer Dall
On Tue, Jun 18, 2013 at 03:25:04PM +0200, Alexander Graf wrote:
> 
> 
> Am 11.06.2013 um 06:51 schrieb Christoffer Dall :
> 
> > Support setting the distributor and cpu interface base addresses in the
> > VM physical address space through the KVM_{SET,GET}_DEVICE_ATTR API
> > in addition the ARM specific API.
> > 
> > This has the added benefit of being able to share more code in user
> > space and do things in a uniform maner.
> > 
> > Also deprecate the older API at the same time, but backwards
> > compatibility will be maintained.
> > 
> > Signed-off-by: Christoffer Dall 
> > ---
> > Documentation/virtual/kvm/api.txt  |5 +-
> > Documentation/virtual/kvm/devices/arm-vgic.txt |   11 +++
> > arch/arm/kvm/arm.c |2 +-
> > include/kvm/arm_vgic.h |2 +-
> > virt/kvm/arm/vgic.c|   90 
> > 
> > 5 files changed, 95 insertions(+), 15 deletions(-)
> > 
> > diff --git a/Documentation/virtual/kvm/api.txt 
> > b/Documentation/virtual/kvm/api.txt
> > index 5f91eda..ea5ec4a 100644
> > --- a/Documentation/virtual/kvm/api.txt
> > +++ b/Documentation/virtual/kvm/api.txt
> > @@ -2305,7 +2305,7 @@ This ioctl returns the guest registers that are 
> > supported for the
> > KVM_GET_ONE_REG/KVM_SET_ONE_REG calls.
> > 
> > 
> > -4.80 KVM_ARM_SET_DEVICE_ADDR
> > +4.80 KVM_ARM_SET_DEVICE_ADDR (deprecated)
> > 
> > Capability: KVM_CAP_ARM_SET_DEVICE_ADDR
> > Architectures: arm
> > @@ -2342,6 +2342,9 @@ and distributor interface, the ioctl must be called 
> > after calling
> > KVM_CREATE_IRQCHIP, but before calling KVM_RUN on any of the VCPUs.  Calling
> > this ioctl twice for any of the base addresses will return -EEXIST.
> > 
> > +Note, this IOCTL is deprecated and the more flexible SET/GET_DEVICE_ATTR 
> > API
> > +should be used instead.
> > +
> > 4.82 KVM_PPC_RTAS_DEFINE_TOKEN
> > 
> > Capability: KVM_CAP_PPC_RTAS
> > diff --git a/Documentation/virtual/kvm/devices/arm-vgic.txt 
> > b/Documentation/virtual/kvm/devices/arm-vgic.txt
> > index 25fd2d9..ca83ad8 100644
> > --- a/Documentation/virtual/kvm/devices/arm-vgic.txt
> > +++ b/Documentation/virtual/kvm/devices/arm-vgic.txt
> > @@ -8,3 +8,14 @@ Only one VGIC instance may be instantiated through either 
> > this API or the
> > legacy KVM_CREATE_IRQCHIP api.  The created VGIC will act as the VM 
> > interrupt
> > controller, requiring emulated user-space devices to inject interrupts to 
> > the
> > VGIC instead of directly to CPUs.
> > +
> > +Groups:
> 
> Ah, here they are :)
> 
> > +  KVM_DEV_ARM_VGIC_GRP_ADDR
> > +  Attributes:
> > +KVM_VGIC_V2_ADDR_TYPE_DIST (rw, 64-bit)
> > +  Base address in the guest physical address space of the GIC 
> > distributor
> > +  register mappings.
> > +
> > +KVM_VGIC_V2_ADDR_TYPE_CPU (rw, 64-bit)
> > +  Base address in the guest physical address space of the GIC virtual 
> > cpu
> > +  interface register mappings.
> 
> Is this per-cpu or per-vgic? Can different CPUs have their gic interface maps 
> mapped at different offsets?
> 
This is per-vgic.

If the _CPU part is confusing, it means it's the address of the CPU
interface, as opposed to the Distributor interface.  Individual CPUs
calculate their specific offset from this base address based on a mask,
but the base is common for everyone (and banked depending on the
accessing CPU for a certain region).

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] KVM: arm-vgic: Support KVM_CREATE_DEVICE for VGIC

2013-06-18 Thread Christoffer Dall
On Tue, Jun 18, 2013 at 03:21:38PM +0200, Alexander Graf wrote:
> 
> 
> Am 11.06.2013 um 06:51 schrieb Christoffer Dall :
> 
> > Support creating the ARM VGIC device through the KVM_CREATE_DEVICE
> > ioctl, which can then later be leveraged to use the
> > KVM_{GET/SET}_DEVICE_ATTR, which is useful both for setting addresses in
> > a more generic API than the ARM-specific one and is useful for
> > save/restore of VGIC state.
> > 
> > Adds KVM_CAP_DEVICE_CTRL to ARM capabilities.
> > 
> > Signed-off-by: Christoffer Dall 
> > ---
> > Documentation/virtual/kvm/devices/arm-vgic.txt |   10 +++
> > arch/arm/include/uapi/asm/kvm.h|8 ++
> > arch/arm/kvm/arm.c |1 +
> > include/linux/kvm_host.h   |1 +
> > include/uapi/linux/kvm.h   |1 +
> > virt/kvm/arm/vgic.c|   34 
> > 
> > virt/kvm/kvm_main.c|4 +++
> > 7 files changed, 59 insertions(+)
> > create mode 100644 Documentation/virtual/kvm/devices/arm-vgic.txt
> > 
> > diff --git a/Documentation/virtual/kvm/devices/arm-vgic.txt 
> > b/Documentation/virtual/kvm/devices/arm-vgic.txt
> > new file mode 100644
> > index 000..25fd2d9
> > --- /dev/null
> > +++ b/Documentation/virtual/kvm/devices/arm-vgic.txt
> > @@ -0,0 +1,10 @@
> > +MPIC interrupt controller
> 
> MPIC?
> 

yes, ARM MPIC, haven't heard of it?

> > +=
> > +
> > +Device types supported:
> > +  KVM_DEV_TYPE_ARM_VGICARM Generic Interrupt Controller v2.0
> > +
> > +Only one VGIC instance may be instantiated through either this API or the
> > +legacy KVM_CREATE_IRQCHIP api.  The created VGIC will act as the VM 
> > interrupt
> > +controller, requiring emulated user-space devices to inject interrupts to 
> > the
> > +VGIC instead of directly to CPUs.
> > diff --git a/arch/arm/include/uapi/asm/kvm.h 
> > b/arch/arm/include/uapi/asm/kvm.h
> > index c1ee007..587f1ae 100644
> > --- a/arch/arm/include/uapi/asm/kvm.h
> > +++ b/arch/arm/include/uapi/asm/kvm.h
> > @@ -142,6 +142,14 @@ struct kvm_arch_memory_slot {
> > #define KVM_REG_ARM_VFP_FPINST0x1009
> > #define KVM_REG_ARM_VFP_FPINST20x100A
> > 
> > +/* Device Control API: ARM VGIC */
> > +#define KVM_DEV_ARM_VGIC_GRP_ADDR0
> > +#define KVM_DEV_ARM_VGIC_GRP_DIST_REGS1
> > +#define KVM_DEV_ARM_VGIC_GRP_CPU_REGS2
> > +#define   KVM_DEV_ARM_VGIC_CPUID_SHIFT32
> > +#define   KVM_DEV_ARM_VGIC_CPUID_MASK(0xffULL << 
> > KVM_DEV_ARM_VGIC_CPUID_SHIFT)
> > +#define   KVM_DEV_ARM_VGIC_OFFSET_SHIFT0
> > +#define   KVM_DEV_ARM_VGIC_OFFSET_MASK(0xULL << 
> > KVM_DEV_ARM_VGIC_OFFSET_SHIFT)
> 
> Could you please describe the groups in the documentation too?
> 

these defines can go in the other patch

> > 
> > /* KVM_IRQ_LINE irq field index values */
> > #define KVM_ARM_IRQ_TYPE_SHIFT24
> > diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> > index 741f66a..b8e3b6e 100644
> > --- a/arch/arm/kvm/arm.c
> > +++ b/arch/arm/kvm/arm.c
> > @@ -187,6 +187,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> >case KVM_CAP_IRQCHIP:
> >r = vgic_present;
> >break;
> > +case KVM_CAP_DEVICE_CTRL:
> >case KVM_CAP_USER_MEMORY:
> >case KVM_CAP_SYNC_MMU:
> >case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index d9a3c30..e2d6556 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1086,6 +1086,7 @@ struct kvm_device *kvm_device_from_filp(struct file 
> > *filp);
> > 
> > extern struct kvm_device_ops kvm_mpic_ops;
> > extern struct kvm_device_ops kvm_xics_ops;
> > +extern struct kvm_device_ops kvm_arm_vgic_ops;
> > 
> > #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
> > 
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index a5c86fc..4f2a4ab 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -839,6 +839,7 @@ struct kvm_device_attr {
> > #define KVM_DEV_TYPE_FSL_MPIC_201
> > #define KVM_DEV_TYPE_FSL_MPIC_422
> > #define KVM_DEV_TYPE_XICS3
> > +#define KVM_DEV_TYPE_ARM_VGIC4
> 
> Should this be versioned? Gicv2 is different from v3, no?
> 

yes, we can call it all XXX_VGIC_V2. It just becomes so verbose, but
it's really the right thing to do.

> Alex
> 
> > 
> > /*
> >  * ioctls for VM fds
> > diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> > index 17c5ac7..b3dcd66 100644
> > --- a/virt/kvm/arm/vgic.c
> > +++ b/virt/kvm/arm/vgic.c
> > @@ -1497,3 +1497,37 @@ int kvm_vgic_set_addr(struct kvm *kvm, unsigned long 
> > type, u64 addr)
> >mutex_unlock(&kvm->lock);
> >return r;
> > }
> > +
> > +static int vgic_set_attr(struct kvm_device *dev, struct kvm_device_attr 
> > *attr)
> > +{
> > +return -ENXIO;
> > +}
> > +
> > +static int vgic_get_attr(struct kvm_device *dev, struct kvm_device_attr 
> > *at

Re: [PATCH v2 0/2] kvm: x86: Emulate MSR_PLATFORM_INFO

2013-06-18 Thread Gleb Natapov
On Tue, Jun 18, 2013 at 11:29:27AM -0400, Bandan Das wrote:
> Gleb Natapov  writes:
> 
> > On Tue, Jun 18, 2013 at 04:05:08PM +0200, Paolo Bonzini wrote:
> >> Il 05/06/2013 10:42, Gleb Natapov ha scritto:
> >> >> > These patches add an emulated MSR_PLATFORM_INFO that kvm guests
> >> >> > can read as described in section 14.3.2.4 of the Intel SDM. 
> >> >> > The relevant changes and details are in [2/2]; [1/2] makes 
> >> >> > vendor_intel
> >> >> > generic. There are atleat two known applications that fail to run 
> >> >> > because
> >> >> > of this MSR missing - Sandra and vTune.
> >> > So I really want Intel opinion on this. Right now it is impossible to
> >> > implement the MSR correctly in the face of migration (may be with tsc
> >> > scaling it will be possible) and while it is unimplemented if application
> >> > tries to use it it fails, but if we will implement it application will
> >> > just produce incorrect result without any means for user to detect it.
> >> 
> >> Jun, ping?  (Perhaps Gleb you want to ask a more specific question though).
> >> 
> >> I don't think this is much different from any other RDTSC usage in
> >> applications (they will typically do their calibration manually, and do
> >> it just once).  I'm applying it to queue.
> >> 
> > And we do not support application that uses RDTSC directly! If we could
> > catch those it would be good from support point of view, so the way
> > MSR_PLATFORM_INFO behaves now it better then proposed alternative.
> 
> If support is the issue, can't we have a flag that disables this by
> default and users who want to take the plunge (and be responsible
> for the consequences) can enable it to read platform_info ?
> 
We have it already :) ignore_msrs. If it is set unimplemented MSRs will
not inject #GP, but return zero value instead. Zero it as incorrect as
anything else in the case of migration.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/2] kvm: x86: Emulate MSR_PLATFORM_INFO

2013-06-18 Thread Bandan Das
Gleb Natapov  writes:

> On Tue, Jun 18, 2013 at 04:05:08PM +0200, Paolo Bonzini wrote:
>> Il 05/06/2013 10:42, Gleb Natapov ha scritto:
>> >> > These patches add an emulated MSR_PLATFORM_INFO that kvm guests
>> >> > can read as described in section 14.3.2.4 of the Intel SDM. 
>> >> > The relevant changes and details are in [2/2]; [1/2] makes vendor_intel
>> >> > generic. There are atleat two known applications that fail to run 
>> >> > because
>> >> > of this MSR missing - Sandra and vTune.
>> > So I really want Intel opinion on this. Right now it is impossible to
>> > implement the MSR correctly in the face of migration (may be with tsc
>> > scaling it will be possible) and while it is unimplemented if application
>> > tries to use it it fails, but if we will implement it application will
>> > just produce incorrect result without any means for user to detect it.
>> 
>> Jun, ping?  (Perhaps Gleb you want to ask a more specific question though).
>> 
>> I don't think this is much different from any other RDTSC usage in
>> applications (they will typically do their calibration manually, and do
>> it just once).  I'm applying it to queue.
>> 
> And we do not support application that uses RDTSC directly! If we could
> catch those it would be good from support point of view, so the way
> MSR_PLATFORM_INFO behaves now it better then proposed alternative.

If support is the issue, can't we have a flag that disables this by
default and users who want to take the plunge (and be responsible
for the consequences) can enable it to read platform_info ?

> --
>   Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: fix missed memory synchronization when patch hypercall

2013-06-18 Thread Gleb Natapov
On Tue, Jun 18, 2013 at 04:13:02PM +0200, Paolo Bonzini wrote:
> Il 09/06/2013 14:27, Gleb Natapov ha scritto:
> > On Sun, Jun 09, 2013 at 08:17:19PM +0800, Xiao Guangrong wrote:
> >> On 06/09/2013 07:56 PM, Gleb Natapov wrote:
> >>> On Sun, Jun 09, 2013 at 07:44:03PM +0800, Xiao Guangrong wrote:
>  On 06/09/2013 07:36 PM, Gleb Natapov wrote:
> > On Sun, Jun 09, 2013 at 07:25:17PM +0800, Xiao Guangrong wrote:
> >> On 06/09/2013 06:19 PM, Gleb Natapov wrote:
> >>> On Sun, Jun 09, 2013 at 06:01:45PM +0800, Xiao Guangrong wrote:
>  On 06/09/2013 05:39 PM, Gleb Natapov wrote:
> > On Sun, Jun 09, 2013 at 05:29:37PM +0800, Xiao Guangrong wrote:
> >> On 06/09/2013 04:45 PM, Gleb Natapov wrote:
> >>
> >>> +static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
> >>> +{
> >>> + struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
> >>> + return kvm_exec_with_stopped_vcpu(vcpu->kvm,
> >>> + emulator_fix_hypercall_cb, ctxt);
> >>> +}
> >>> +
> >>> +
> >>>  /*
> >>>   * Check if userspace requested an interrupt window, and that the
> >>>   * interrupt window is open.
> >>> @@ -5761,6 +5769,10 @@ static int vcpu_enter_guest(struct 
> >>> kvm_vcpu *vcpu)
> >>>   kvm_deliver_pmi(vcpu);
> >>>   if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu))
> >>>   vcpu_scan_ioapic(vcpu);
> >>> + if (kvm_check_request(KVM_REQ_STOP_VCPU, vcpu)){
> >>> + mutex_lock(&vcpu->kvm->lock);
> >>> + mutex_unlock(&vcpu->kvm->lock);
> >>
> >> We should execute a serializing instruction here?
> >>
> >>> --- a/virt/kvm/kvm_main.c
> >>> +++ b/virt/kvm/kvm_main.c
> >>> @@ -222,6 +222,18 @@ void kvm_make_scan_ioapic_request(struct kvm 
> >>> *kvm)
> >>>   make_all_cpus_request(kvm, KVM_REQ_SCAN_IOAPIC);
> >>>  }
> >>>
> >>> +int kvm_exec_with_stopped_vcpu(struct kvm *kvm, int (*cb)(void 
> >>> *), void *data)
> >>> +{
> >>> + int r;
> >>> +
> >>> + mutex_lock(&kvm->lock);
> >>> + make_all_cpus_request(kvm, KVM_REQ_STOP_VCPU);
> >>> + r = cb(data);
> >>
> >> And here?
> > Since the serialisation instruction the SDM suggest to use is CPUID 
> > I
> > think the point here is to flush CPU pipeline. Since all vcpus are 
> > out
> > of a guest mode I think out of order execution of modified 
> > instruction
> > is no an issue here.
> 
>  I checked the SDM that it did not said VMLAUNCH/VMRESUME are the
>  serializing instructions both in VM-Entry description and Instruction
>  reference, instead it said the VMX related serializing instructions 
>  are:
>  INVEPT, INVVPID.
> 
>  So, i guess the explicit serializing instruction is needed here.
> 
> >>> Again the question is what for? SDM says:
> >>>
> >>>   The Intel 64 and IA-32 architectures define several serializing
> >>>   instructions. These instructions force the processor to complete all
> >>>   modifications to flags, registers, and memory by previous 
> >>> instructions
> >>>   and to drain all buffered writes to memory before the next 
> >>> instruction
> >>>   is fetched and executed.
> >>>
> >>> So flags and registers modifications on a host are obviously 
> >>> irrelevant for a guest.
> >>
> >> Okay. Hmm... but what can guarantee that "drain all buffered writes to 
> >> memory"?
> > Memory barrier should guaranty that as I said bellow.
> >
> >>
> >>> And for memory ordering we have smp_mb() on a guest entry.
> >>
> >> If i understand the SDM correctly, memory-ordering instructions can 
> >> not drain
> >> instruction buffer, it only drains "data memory subsystem":
> > What is "instruction buffer"?
> 
>  I mean "Instruction Cache" (icache). Can memory ordering drain icache?
>  The "data memory subsystem" confused me, does it mean dcache?
> 
> >>> I think it means all caches.
> >>> 11.6 says:
> >>>
> >>>   A write to a memory location in a code segment that is currently
> >>>   cached in the processor causes the associated cache line (or lines)
> >>>   to be invalidated. This check is based on the physical address of
> >>>   the instruction. In addition, the P6 family and Pentium processors
> >>>   check whether a write to a code segment may modify an instruction that
> >>>   has been prefetched for execution. If the write affects a prefetched
> >>>   instruction, the prefetch queue is invalidated. This latter check is
> >>>   based on the linear address of the instruction. For the Pentium 4 and
> >>>   Intel Xeon p

Re: [PATCH v2 0/2] kvm: x86: Emulate MSR_PLATFORM_INFO

2013-06-18 Thread Gleb Natapov
On Tue, Jun 18, 2013 at 04:05:08PM +0200, Paolo Bonzini wrote:
> Il 05/06/2013 10:42, Gleb Natapov ha scritto:
> >> > These patches add an emulated MSR_PLATFORM_INFO that kvm guests
> >> > can read as described in section 14.3.2.4 of the Intel SDM. 
> >> > The relevant changes and details are in [2/2]; [1/2] makes vendor_intel
> >> > generic. There are atleat two known applications that fail to run because
> >> > of this MSR missing - Sandra and vTune.
> > So I really want Intel opinion on this. Right now it is impossible to
> > implement the MSR correctly in the face of migration (may be with tsc
> > scaling it will be possible) and while it is unimplemented if application
> > tries to use it it fails, but if we will implement it application will
> > just produce incorrect result without any means for user to detect it.
> 
> Jun, ping?  (Perhaps Gleb you want to ask a more specific question though).
> 
> I don't think this is much different from any other RDTSC usage in
> applications (they will typically do their calibration manually, and do
> it just once).  I'm applying it to queue.
> 
And we do not support application that uses RDTSC directly! If we could
catch those it would be good from support point of view, so the way
MSR_PLATFORM_INFO behaves now it better then proposed alternative.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] kvm: x86: emulate MSR_PLATFORM_INFO

2013-06-18 Thread Bandan Das
Paolo Bonzini  writes:

> Il 04/06/2013 18:02, Bandan Das ha scritto:
>> +static u64 kvm_get_platform_info(struct kvm_vcpu *vcpu)
>> +{
>> +u8 cpumodel;
>> +u32 bclk;
>> +
>> +/*
>> + * Programmable Ratio Limit for Turbo Mode (bit 28): 0
>> + * Programmable TDC-TDP Limit for Turbo Mode (bit 29): 0
>> + */
>> +u64 platform_info = 0, max_nonturbo_ratio = 0, max_effi_ratio = 0;
>> +
>> +cpumodel = kvm_cpuid_get_intel_model(vcpu);
>> +
>> +switch (cpumodel) {
>> +case MODEL_NEHALEM_CLARKSFIELD:
>> +case MODEL_NEHALEM_BLOOMFIELD:
>> +case MODEL_NEHALEM_EX:
>> +case MODEL_WESTMERE_ARRANDALE:
>> +case MODEL_WESTMERE_GULFTOWN:
>> +case MODEL_WESTMERE_EX:
>> +bclk = BCLK_133_DEFAULT;
>
> Just one change: please rename this to base_clock_khz and remove the
> BCLK_*_DEFAULT constants please.

Thanks for the review! I thought the base clock is usually referred to 
as bclk in Intel parlance but not sure :) 

Anyway, I will send a new one.


> Paolo
>
>> +break;
>> +case MODEL_SANDYBRIDGE_SANDY:
>> +case MODEL_SANDYBRIDGE_E:
>> +case MODEL_IVYBRIDGE_IVY:
>> +case MODEL_HASWELL_HASWELL:
>> +bclk = BCLK_100_DEFAULT;
>> +break;
>> +default:
>> +bclk = 0;
>> +break;
>> +}
>> +
>> +if (bclk) {
>> +max_nonturbo_ratio = max_effi_ratio
>> += (u8)(vcpu->arch.virtual_tsc_khz / bclk);
>> +platform_info = (max_effi_ratio << 40)
>> +| (max_nonturbo_ratio << 8);
>> +}
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-18 Thread Alex Williamson
On Tue, 2013-06-18 at 14:38 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2013-06-17 at 20:32 -0600, Alex Williamson wrote:
> 
> > Right, we don't want to create dependencies across modules.  I don't
> > have a vision for how this should work.  This is effectively a complete
> > side-band to vfio, so we're really just dealing in the iommu group
> > space.  Maybe there needs to be some kind of registration of ownership
> > for the group using some kind of token.  It would need to include some
> > kind of notification when that ownership ends.  That might also be a
> > convenient tag to toggle driver probing off for devices in the group.
> > Other ideas?  Thanks,
> 
> All of that smells nasty like it will need a pile of bloody
> infrastructure which makes me think it's too complicated and not the
> right approach.
> 
> How does access control work today on x86/VFIO ? Can you give me a bit
> more details ? I didn't get a good grasp in your previous email

The current model is not x86 specific, but it only covers doing iommu
and device access through vfio.  The kink here is that we're trying to
do device access and setup through vfio, but iommu manipulation through
kvm.  We may want to revisit whether we can do the in-kernel iommu
manipulation through vfio rather than kvm.

For vfio in general, the group is the unit of ownership.  A user is
granted access to /dev/vfio/$GROUP through file permissions.  The user
opens the group and a container (/dev/vfio/vfio) and calls SET_CONTAINER
on the group.  If supported by the platform, multiple groups can be set
to the same container, which allows for iommu domain sharing.  Once a
group is associated with a container, an iommu backend can be
initialized for the container.  Only then can a device be accessed
through the group.

So even if we were to pass a vfio group file descriptor into kvm and it
matched as some kind of ownership token on the iommu group, it's not
clear that's sufficient to assume we can start programming the iommu.
Thanks,

Alex

> From the look of it, the VFIO file descriptor is what has the "access
> control" to the underlying iommu, is this right ? So we somewhat need to
> transfer (or copy) that ownership from the VFIO fd to the KVM VM.
> 
> I don't see a way to do that without some cross-layering here...
> 
> Rusty, are you aware of some kernel mechanism we can use for that ?
> 
> Cheers,
> Ben.
> 
> 



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-18 Thread 李春奇
On Tue, Jun 18, 2013 at 8:45 PM, Gleb Natapov  wrote:
> On Thu, Jun 13, 2013 at 05:30:03PM +0800, 李春奇  wrote:
>> Hi Gleb,
>> I'm trying to solve these problems in the past days and meet many
>> difficulties. You want to save all the general registers in calling
>> insn_page, so registers should be saved to (save) in insn_page.
>> Because all the instructions should be generated outside and copy to
>> insn_page, and the instructions generated outside is RIP-relative, so
>> inside insn_page (save) will be wrong pointed with RIP-relative code.
>>
> They do not have to be generated outside. You can write code into
> insn_page directly. Something like this outside of any functions:
>
> asm(".align 4096\n\t"
> ".global insn_page\n\t"
> ".global insn_page_end\n\t"
> ".global test_insn\n\t"
> ".global test_insn_end\n\t"
> "insn_page:"
> "mov %%rax, outregs \n\t"
> ...
> "test_insn:\n\t"
> "in (%ds), %al\n\t"
> ". = . + 31\n\t"
> "test_insn_end:\n\t"
> "mov outregs, %%rax\n\t"
> ...
> "ret\n\t"
> ".align 4096\n\t"
> "insn_page_end:\n\t");
>
> Now you copy that into alt_insn_page, put instruction you want to test
> into test_insn offset and remap alt_insn_page into "insn_page" virtual 
> address.
I used such codes:

invlpg((void *)virt_to_phys(insn_page));
asm volatile("call *%0" : : "r"(insn_page));
install_page(cr3, virt_to_phys(alt_insn_page), insn_page);
asm volatile("call *%0": : "r"(insn_page+1));

But it seems that alt_insn_page are not remapped to insn_page. Here
insn_page and alt_insn_page are all declared statically with
"asm(...)".

Arthur
>
>> I have tried to move (save) into insn_page. But when calling
>> insn_page, data in it can only be read and any instructions like "xchg
>> %%rax, 0+%[save]" may cause error, because at this time read is from
>> TLB but write will cause inconsistent.
>>
>> Another way is disabling RIP-relative code, but I failed when using
>> "-mcmodel-large -fno-pic", the binary is also using RIP-relative mode.
>> Is there any way to totally disable RIP-relative code? Besides, using
>> this feature may specified to some newer C compiler. This may not be a
>> good solution.
>>
>> If we don't set %rsp and %rbp when executing emulator code, we can
>> just use “push/pop" to save other general registers.
>>
>> If you have any better solutions, please let me know.
>>
>> Thanks,
>> Arthur
>>
>> On Thu, Jun 13, 2013 at 12:50 PM, 李春奇 
>>  wrote:
>> > On Thu, Jun 13, 2013 at 4:50 AM, Paolo Bonzini  wrote:
>> >> Il 06/06/2013 11:24, Arthur Chunqi Li ha scritto:
>> >>> Add a function trap_emulator to run an instruction in emulator.
>> >>> Set inregs first (%rax is invalid because it is used as return
>> >>> address), put instruction codec in alt_insn and call func with
>> >>> alt_insn_length. Get results in outregs.
>> >>>
>> >>> Signed-off-by: Arthur Chunqi Li 
>> >>> ---
>> >>>  x86/emulator.c |   81 
>> >>> 
>> >>>  1 file changed, 81 insertions(+)
>> >>>
>> >>> diff --git a/x86/emulator.c b/x86/emulator.c
>> >>> index 96576e5..8ab9904 100644
>> >>> --- a/x86/emulator.c
>> >>> +++ b/x86/emulator.c
>> >>> @@ -11,6 +11,14 @@ int fails, tests;
>> >>>
>> >>>  static int exceptions;
>> >>>
>> >>> +struct regs {
>> >>> + u64 rax, rbx, rcx, rdx;
>> >>> + u64 rsi, rdi, rsp, rbp;
>> >>> + u64 rip, rflags;
>> >>> +};
>> >>> +
>> >>> +static struct regs inregs, outregs;
>> >>> +
>> >>>  void report(const char *name, int result)
>> >>>  {
>> >>>   ++tests;
>> >>> @@ -685,6 +693,79 @@ static void test_shld_shrd(u32 *mem)
>> >>>  report("shrd (cl)", *mem == ((0x12345678 >> 3) | (5u << 29)));
>> >>>  }
>> >>>
>> >>> +static void trap_emulator(uint64_t *mem, uint8_t *insn_page,
>> >>> +  uint8_t *alt_insn_page, void *insn_ram,
>> >>> +  uint8_t *alt_insn, int alt_insn_length)
>> >>> +{
>> >>> + ulong *cr3 = (ulong *)read_cr3();
>> >>> + int i;
>> >>> +
>> >>> + // Pad with RET instructions
>> >>> + memset(insn_page, 0xc3, 4096);
>> >>> + memset(alt_insn_page, 0xc3, 4096);
>> >>> +
>> >>> + // Place a trapping instruction in the page to trigger a VMEXIT
>> >>> + insn_page[0] = 0x89; // mov %eax, (%rax)
>> >>> + insn_page[1] = 0x00;
>> >>> + insn_page[2] = 0x90; // nop
>> >>> + insn_page[3] = 0xc3; // ret
>> >>> +
>> >>> + // Place the instruction we want the hypervisor to see in the 
>> >>> alternate page
>> >>> + for (i=0; i> >>> + alt_insn_page[i] = alt_insn[i];
>> >>> +
>> >>> + // Save general registers
>> >>> + asm volatile(
>> >>> + "push %rax\n\r"
>> >>> + "push %rbx\n\r"
>> >>> + "push %rcx\n\r"
>> >>> + "push %rdx\n\r"
>> >>> + "push %rsi\n\r"
>> >>> + "push %rdi\n\r"
>> >>> + );
>> >>
>> >> This will not work if GCC is using rsp-relative addresses to access
>> >> l

Re: [PATCH v3 0/6] KVM: MMU: fast invalidate all mmio sptes

2013-06-18 Thread Paolo Bonzini
Il 07/06/2013 10:51, Xiao Guangrong ha scritto:
> Changelog:
> V3:
>   All of these changes are from Gleb's review:
>   1) rename RET_MMIO_PF_EMU to RET_MMIO_PF_EMULATE.
>   2) smartly adjust kvm generation number in kvm_current_mmio_generatio()
>  to avoid kvm_memslots->generation overflow.
> 
> V2:
>   - rename kvm_mmu_invalid_mmio_spte to kvm_mmu_invalid_mmio_sptes
>   - use kvm->memslots->generation as kvm global generation-number
>   - fix comment and codestyle
>   - init kvm generation close to mmio wrap-around value
>   - keep kvm_mmu_zap_mmio_sptes
> 
> The current way is holding hot mmu-lock and walking all shadow pages, this
> is not scale. This patchset tries to introduce a very simple and scale way
> to fast invalidate all mmio sptes - it need not walk any shadow pages and hold
> any locks.
> 
> The idea is simple:
> KVM maintains a global mmio valid generation-number which is stored in
> kvm->memslots.generation and every mmio spte stores the current global
> generation-number into his available bits when it is created
> 
> When KVM need zap all mmio sptes, it just simply increase the global
> generation-number. When guests do mmio access, KVM intercepts a MMIO #PF
> then it walks the shadow page table and get the mmio spte. If the
> generation-number on the spte does not equal the global generation-number,
> it will go to the normal #PF handler to update the mmio spte
> 
> Since 19 bits are used to store generation-number on mmio spte, we zap all
> mmio sptes when the number is round
> 
> Xiao Guangrong (6):
>   KVM: MMU: retain more available bits on mmio spte
>   KVM: MMU: store generation-number into mmio spte
>   KVM: MMU: make return value of mmio page fault handler more readable
>   KVM: MMU: fast invalidate all mmio sptes
>   KVM: MMU: add tracepoint for check_mmio_spte
>   KVM: MMU: init kvm generation close to mmio wrap-around value
> 
>  arch/x86/include/asm/kvm_host.h |   2 +-
>  arch/x86/kvm/mmu.c  | 131 
> 
>  arch/x86/kvm/mmu.h  |  17 ++
>  arch/x86/kvm/mmutrace.h |  34 +--
>  arch/x86/kvm/paging_tmpl.h  |  10 ++-
>  arch/x86/kvm/vmx.c  |  12 ++--
>  arch/x86/kvm/x86.c  |  11 +++-
>  7 files changed, 177 insertions(+), 40 deletions(-)
> 

Xiao, is it time to add more comments to the code or update
Documentation/virtual/kvm/mmu.txt?  Don't worry about the English, it is
more than understandable and I can help with the editing.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: fix missed memory synchronization when patch hypercall

2013-06-18 Thread Paolo Bonzini
Il 09/06/2013 14:27, Gleb Natapov ha scritto:
> On Sun, Jun 09, 2013 at 08:17:19PM +0800, Xiao Guangrong wrote:
>> On 06/09/2013 07:56 PM, Gleb Natapov wrote:
>>> On Sun, Jun 09, 2013 at 07:44:03PM +0800, Xiao Guangrong wrote:
 On 06/09/2013 07:36 PM, Gleb Natapov wrote:
> On Sun, Jun 09, 2013 at 07:25:17PM +0800, Xiao Guangrong wrote:
>> On 06/09/2013 06:19 PM, Gleb Natapov wrote:
>>> On Sun, Jun 09, 2013 at 06:01:45PM +0800, Xiao Guangrong wrote:
 On 06/09/2013 05:39 PM, Gleb Natapov wrote:
> On Sun, Jun 09, 2013 at 05:29:37PM +0800, Xiao Guangrong wrote:
>> On 06/09/2013 04:45 PM, Gleb Natapov wrote:
>>
>>> +static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
>>> +{
>>> +   struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
>>> +   return kvm_exec_with_stopped_vcpu(vcpu->kvm,
>>> +   emulator_fix_hypercall_cb, ctxt);
>>> +}
>>> +
>>> +
>>>  /*
>>>   * Check if userspace requested an interrupt window, and that the
>>>   * interrupt window is open.
>>> @@ -5761,6 +5769,10 @@ static int vcpu_enter_guest(struct kvm_vcpu 
>>> *vcpu)
>>> kvm_deliver_pmi(vcpu);
>>> if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu))
>>> vcpu_scan_ioapic(vcpu);
>>> +   if (kvm_check_request(KVM_REQ_STOP_VCPU, vcpu)){
>>> +   mutex_lock(&vcpu->kvm->lock);
>>> +   mutex_unlock(&vcpu->kvm->lock);
>>
>> We should execute a serializing instruction here?
>>
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -222,6 +222,18 @@ void kvm_make_scan_ioapic_request(struct kvm 
>>> *kvm)
>>> make_all_cpus_request(kvm, KVM_REQ_SCAN_IOAPIC);
>>>  }
>>>
>>> +int kvm_exec_with_stopped_vcpu(struct kvm *kvm, int (*cb)(void *), 
>>> void *data)
>>> +{
>>> +   int r;
>>> +
>>> +   mutex_lock(&kvm->lock);
>>> +   make_all_cpus_request(kvm, KVM_REQ_STOP_VCPU);
>>> +   r = cb(data);
>>
>> And here?
> Since the serialisation instruction the SDM suggest to use is CPUID I
> think the point here is to flush CPU pipeline. Since all vcpus are out
> of a guest mode I think out of order execution of modified instruction
> is no an issue here.

 I checked the SDM that it did not said VMLAUNCH/VMRESUME are the
 serializing instructions both in VM-Entry description and Instruction
 reference, instead it said the VMX related serializing instructions 
 are:
 INVEPT, INVVPID.

 So, i guess the explicit serializing instruction is needed here.

>>> Again the question is what for? SDM says:
>>>
>>>   The Intel 64 and IA-32 architectures define several serializing
>>>   instructions. These instructions force the processor to complete all
>>>   modifications to flags, registers, and memory by previous instructions
>>>   and to drain all buffered writes to memory before the next instruction
>>>   is fetched and executed.
>>>
>>> So flags and registers modifications on a host are obviously irrelevant 
>>> for a guest.
>>
>> Okay. Hmm... but what can guarantee that "drain all buffered writes to 
>> memory"?
> Memory barrier should guaranty that as I said bellow.
>
>>
>>> And for memory ordering we have smp_mb() on a guest entry.
>>
>> If i understand the SDM correctly, memory-ordering instructions can not 
>> drain
>> instruction buffer, it only drains "data memory subsystem":
> What is "instruction buffer"?

 I mean "Instruction Cache" (icache). Can memory ordering drain icache?
 The "data memory subsystem" confused me, does it mean dcache?

>>> I think it means all caches.
>>> 11.6 says:
>>>
>>>   A write to a memory location in a code segment that is currently
>>>   cached in the processor causes the associated cache line (or lines)
>>>   to be invalidated. This check is based on the physical address of
>>>   the instruction. In addition, the P6 family and Pentium processors
>>>   check whether a write to a code segment may modify an instruction that
>>>   has been prefetched for execution. If the write affects a prefetched
>>>   instruction, the prefetch queue is invalidated. This latter check is
>>>   based on the linear address of the instruction. For the Pentium 4 and
>>>   Intel Xeon processors, a write or a snoop of an instruction in a code
>>>   segment, where the target instruction is already decoded and resident in
>>>   the trace cache, invalidates the entire trace cache. The latter behavior
>>>   means that programs that self-modify code can cause severe degradation
>>>   o

Re: [PATCH v2 0/2] kvm: x86: Emulate MSR_PLATFORM_INFO

2013-06-18 Thread Paolo Bonzini
Il 05/06/2013 10:42, Gleb Natapov ha scritto:
>> > These patches add an emulated MSR_PLATFORM_INFO that kvm guests
>> > can read as described in section 14.3.2.4 of the Intel SDM. 
>> > The relevant changes and details are in [2/2]; [1/2] makes vendor_intel
>> > generic. There are atleat two known applications that fail to run because
>> > of this MSR missing - Sandra and vTune.
> So I really want Intel opinion on this. Right now it is impossible to
> implement the MSR correctly in the face of migration (may be with tsc
> scaling it will be possible) and while it is unimplemented if application
> tries to use it it fails, but if we will implement it application will
> just produce incorrect result without any means for user to detect it.

Jun, ping?  (Perhaps Gleb you want to ask a more specific question though).

I don't think this is much different from any other RDTSC usage in
applications (they will typically do their calibration manually, and do
it just once).  I'm applying it to queue.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] kvm: x86: emulate MSR_PLATFORM_INFO

2013-06-18 Thread Paolo Bonzini
Il 04/06/2013 18:02, Bandan Das ha scritto:
> +static u64 kvm_get_platform_info(struct kvm_vcpu *vcpu)
> +{
> + u8 cpumodel;
> + u32 bclk;
> +
> + /*
> +  * Programmable Ratio Limit for Turbo Mode (bit 28): 0
> +  * Programmable TDC-TDP Limit for Turbo Mode (bit 29): 0
> +  */
> + u64 platform_info = 0, max_nonturbo_ratio = 0, max_effi_ratio = 0;
> +
> + cpumodel = kvm_cpuid_get_intel_model(vcpu);
> +
> + switch (cpumodel) {
> + case MODEL_NEHALEM_CLARKSFIELD:
> + case MODEL_NEHALEM_BLOOMFIELD:
> + case MODEL_NEHALEM_EX:
> + case MODEL_WESTMERE_ARRANDALE:
> + case MODEL_WESTMERE_GULFTOWN:
> + case MODEL_WESTMERE_EX:
> + bclk = BCLK_133_DEFAULT;

Just one change: please rename this to base_clock_khz and remove the
BCLK_*_DEFAULT constants please.

Paolo

> + break;
> + case MODEL_SANDYBRIDGE_SANDY:
> + case MODEL_SANDYBRIDGE_E:
> + case MODEL_IVYBRIDGE_IVY:
> + case MODEL_HASWELL_HASWELL:
> + bclk = BCLK_100_DEFAULT;
> + break;
> + default:
> + bclk = 0;
> + break;
> + }
> +
> + if (bclk) {
> + max_nonturbo_ratio = max_effi_ratio
> + = (u8)(vcpu->arch.virtual_tsc_khz / bclk);
> + platform_info = (max_effi_ratio << 40)
> + | (max_nonturbo_ratio << 8);
> + }

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-18 Thread 李春奇
On Tue, Jun 18, 2013 at 8:45 PM, Gleb Natapov  wrote:
> On Thu, Jun 13, 2013 at 05:30:03PM +0800, 李春奇  wrote:
>> Hi Gleb,
>> I'm trying to solve these problems in the past days and meet many
>> difficulties. You want to save all the general registers in calling
>> insn_page, so registers should be saved to (save) in insn_page.
>> Because all the instructions should be generated outside and copy to
>> insn_page, and the instructions generated outside is RIP-relative, so
>> inside insn_page (save) will be wrong pointed with RIP-relative code.
>>
> They do not have to be generated outside. You can write code into
> insn_page directly. Something like this outside of any functions:
>
> asm(".align 4096\n\t"
> ".global insn_page\n\t"
> ".global insn_page_end\n\t"
> ".global test_insn\n\t"
> ".global test_insn_end\n\t"
> "insn_page:"
> "mov %%rax, outregs \n\t"
> ...
> "test_insn:\n\t"
> "in (%ds), %al\n\t"
> ". = . + 31\n\t"
> "test_insn_end:\n\t"
> "mov outregs, %%rax\n\t"
> ...
> "ret\n\t"
> ".align 4096\n\t"
> "insn_page_end:\n\t");
>
> Now you copy that into alt_insn_page, put instruction you want to test
> into test_insn offset and remap alt_insn_page into "insn_page" virtual 
> address.
Which function can be used to remap alt_insn_page into "insn_page"
virtual address?

Arthur
>
>> I have tried to move (save) into insn_page. But when calling
>> insn_page, data in it can only be read and any instructions like "xchg
>> %%rax, 0+%[save]" may cause error, because at this time read is from
>> TLB but write will cause inconsistent.
>>
>> Another way is disabling RIP-relative code, but I failed when using
>> "-mcmodel-large -fno-pic", the binary is also using RIP-relative mode.
>> Is there any way to totally disable RIP-relative code? Besides, using
>> this feature may specified to some newer C compiler. This may not be a
>> good solution.
>>
>> If we don't set %rsp and %rbp when executing emulator code, we can
>> just use “push/pop" to save other general registers.
>>
>> If you have any better solutions, please let me know.
>>
>> Thanks,
>> Arthur
>>
>> On Thu, Jun 13, 2013 at 12:50 PM, 李春奇 
>>  wrote:
>> > On Thu, Jun 13, 2013 at 4:50 AM, Paolo Bonzini  wrote:
>> >> Il 06/06/2013 11:24, Arthur Chunqi Li ha scritto:
>> >>> Add a function trap_emulator to run an instruction in emulator.
>> >>> Set inregs first (%rax is invalid because it is used as return
>> >>> address), put instruction codec in alt_insn and call func with
>> >>> alt_insn_length. Get results in outregs.
>> >>>
>> >>> Signed-off-by: Arthur Chunqi Li 
>> >>> ---
>> >>>  x86/emulator.c |   81 
>> >>> 
>> >>>  1 file changed, 81 insertions(+)
>> >>>
>> >>> diff --git a/x86/emulator.c b/x86/emulator.c
>> >>> index 96576e5..8ab9904 100644
>> >>> --- a/x86/emulator.c
>> >>> +++ b/x86/emulator.c
>> >>> @@ -11,6 +11,14 @@ int fails, tests;
>> >>>
>> >>>  static int exceptions;
>> >>>
>> >>> +struct regs {
>> >>> + u64 rax, rbx, rcx, rdx;
>> >>> + u64 rsi, rdi, rsp, rbp;
>> >>> + u64 rip, rflags;
>> >>> +};
>> >>> +
>> >>> +static struct regs inregs, outregs;
>> >>> +
>> >>>  void report(const char *name, int result)
>> >>>  {
>> >>>   ++tests;
>> >>> @@ -685,6 +693,79 @@ static void test_shld_shrd(u32 *mem)
>> >>>  report("shrd (cl)", *mem == ((0x12345678 >> 3) | (5u << 29)));
>> >>>  }
>> >>>
>> >>> +static void trap_emulator(uint64_t *mem, uint8_t *insn_page,
>> >>> +  uint8_t *alt_insn_page, void *insn_ram,
>> >>> +  uint8_t *alt_insn, int alt_insn_length)
>> >>> +{
>> >>> + ulong *cr3 = (ulong *)read_cr3();
>> >>> + int i;
>> >>> +
>> >>> + // Pad with RET instructions
>> >>> + memset(insn_page, 0xc3, 4096);
>> >>> + memset(alt_insn_page, 0xc3, 4096);
>> >>> +
>> >>> + // Place a trapping instruction in the page to trigger a VMEXIT
>> >>> + insn_page[0] = 0x89; // mov %eax, (%rax)
>> >>> + insn_page[1] = 0x00;
>> >>> + insn_page[2] = 0x90; // nop
>> >>> + insn_page[3] = 0xc3; // ret
>> >>> +
>> >>> + // Place the instruction we want the hypervisor to see in the 
>> >>> alternate page
>> >>> + for (i=0; i> >>> + alt_insn_page[i] = alt_insn[i];
>> >>> +
>> >>> + // Save general registers
>> >>> + asm volatile(
>> >>> + "push %rax\n\r"
>> >>> + "push %rbx\n\r"
>> >>> + "push %rcx\n\r"
>> >>> + "push %rdx\n\r"
>> >>> + "push %rsi\n\r"
>> >>> + "push %rdi\n\r"
>> >>> + );
>> >>
>> >> This will not work if GCC is using rsp-relative addresses to access
>> >> local variables.  You need to use mov instructions to load from inregs,
>> >> and put the push/pop sequences inside the "main" asm that does the "call
>> >> *%1".
>> > Is there any way to let gcc use absolute address to access variables?
>> > I move variant "save" to th

Re: [PATCH] KVM: x86: remove vcpu's CPL check in host invoked vcpu's xcr set process

2013-06-18 Thread Paolo Bonzini
Il 18/06/2013 14:39, Zhanghaoyu (A) ha scritto:
>> Applied to master, but check your setup because I had to do so manually.
>> Your patch has 2 tabs + 1 space for each indentation level (??), the file 
>> only has one tab.
>>
>> This time the patch was pretty simple and may even be worth having in 3.9, 
>> but next time I'd ask you to resubmit.
>>
>> Can you write a testcase for kvm-unit-tests?
>>
>> Paolo
> 
> So sorry for above bad format patch, this time the format is OK, but
> I'm not sure could the format be changed in transmission?

This time it is okay.

Paolo

> 
> Signed-off-by: Zhang Haoyu 
> ---
>  arch/x86/kvm/x86.c |5 ++---
>  1 files changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 094b5d9..aa7bda1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -582,8 +582,6 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 
> xcr)
>   if (index != XCR_XFEATURE_ENABLED_MASK)
>   return 1;
>   xcr0 = xcr;
> - if (kvm_x86_ops->get_cpl(vcpu) != 0)
> - return 1;
>   if (!(xcr0 & XSTATE_FP))
>   return 1;
>   if ((xcr0 & XSTATE_YMM) && !(xcr0 & XSTATE_SSE))
> @@ -597,7 +595,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 
> xcr)
>  
>  int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
>  {
> - if (__kvm_set_xcr(vcpu, index, xcr)) {
> + if ((kvm_x86_ops->get_cpl(vcpu) != 0) ||
> + __kvm_set_xcr(vcpu, index, xcr)) {
>   kvm_inject_gp(vcpu, 0);
>   return 1;
>   }
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/7] KVM: arm-vgic: Set base addr through device API

2013-06-18 Thread Alexander Graf


Am 11.06.2013 um 06:51 schrieb Christoffer Dall :

> Support setting the distributor and cpu interface base addresses in the
> VM physical address space through the KVM_{SET,GET}_DEVICE_ATTR API
> in addition the ARM specific API.
> 
> This has the added benefit of being able to share more code in user
> space and do things in a uniform maner.
> 
> Also deprecate the older API at the same time, but backwards
> compatibility will be maintained.
> 
> Signed-off-by: Christoffer Dall 
> ---
> Documentation/virtual/kvm/api.txt  |5 +-
> Documentation/virtual/kvm/devices/arm-vgic.txt |   11 +++
> arch/arm/kvm/arm.c |2 +-
> include/kvm/arm_vgic.h |2 +-
> virt/kvm/arm/vgic.c|   90 
> 5 files changed, 95 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 5f91eda..ea5ec4a 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2305,7 +2305,7 @@ This ioctl returns the guest registers that are 
> supported for the
> KVM_GET_ONE_REG/KVM_SET_ONE_REG calls.
> 
> 
> -4.80 KVM_ARM_SET_DEVICE_ADDR
> +4.80 KVM_ARM_SET_DEVICE_ADDR (deprecated)
> 
> Capability: KVM_CAP_ARM_SET_DEVICE_ADDR
> Architectures: arm
> @@ -2342,6 +2342,9 @@ and distributor interface, the ioctl must be called 
> after calling
> KVM_CREATE_IRQCHIP, but before calling KVM_RUN on any of the VCPUs.  Calling
> this ioctl twice for any of the base addresses will return -EEXIST.
> 
> +Note, this IOCTL is deprecated and the more flexible SET/GET_DEVICE_ATTR API
> +should be used instead.
> +
> 4.82 KVM_PPC_RTAS_DEFINE_TOKEN
> 
> Capability: KVM_CAP_PPC_RTAS
> diff --git a/Documentation/virtual/kvm/devices/arm-vgic.txt 
> b/Documentation/virtual/kvm/devices/arm-vgic.txt
> index 25fd2d9..ca83ad8 100644
> --- a/Documentation/virtual/kvm/devices/arm-vgic.txt
> +++ b/Documentation/virtual/kvm/devices/arm-vgic.txt
> @@ -8,3 +8,14 @@ Only one VGIC instance may be instantiated through either 
> this API or the
> legacy KVM_CREATE_IRQCHIP api.  The created VGIC will act as the VM interrupt
> controller, requiring emulated user-space devices to inject interrupts to the
> VGIC instead of directly to CPUs.
> +
> +Groups:

Ah, here they are :)

> +  KVM_DEV_ARM_VGIC_GRP_ADDR
> +  Attributes:
> +KVM_VGIC_V2_ADDR_TYPE_DIST (rw, 64-bit)
> +  Base address in the guest physical address space of the GIC distributor
> +  register mappings.
> +
> +KVM_VGIC_V2_ADDR_TYPE_CPU (rw, 64-bit)
> +  Base address in the guest physical address space of the GIC virtual cpu
> +  interface register mappings.

Is this per-cpu or per-vgic? Can different CPUs have their gic interface maps 
mapped at different offsets?

Alex

> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index b8e3b6e..63ddccb 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -768,7 +768,7 @@ static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
>case KVM_ARM_DEVICE_VGIC_V2:
>if (!vgic_present)
>return -ENXIO;
> -return kvm_vgic_set_addr(kvm, type, dev_addr->addr);
> +return kvm_vgic_addr(kvm, type, &dev_addr->addr, true);
>default:
>return -ENODEV;
>}
> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> index 343744e..5631cfa 100644
> --- a/include/kvm/arm_vgic.h
> +++ b/include/kvm/arm_vgic.h
> @@ -144,7 +144,7 @@ struct kvm_run;
> struct kvm_exit_mmio;
> 
> #ifdef CONFIG_KVM_ARM_VGIC
> -int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr);
> +int kvm_vgic_addr(struct kvm *kvm, unsigned long type, u64 *addr, bool 
> write);
> int kvm_vgic_hyp_init(void);
> int kvm_vgic_init(struct kvm *kvm);
> int kvm_vgic_create(struct kvm *kvm);
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> index b3dcd66..8e915b7 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -1457,6 +1457,12 @@ static int vgic_ioaddr_assign(struct kvm *kvm, 
> phys_addr_t *ioaddr,
> {
>int ret;
> 
> +if (addr & ~KVM_PHYS_MASK)
> +return -E2BIG;
> +
> +if (addr & (SZ_4K - 1))
> +return -EINVAL;
> +
>if (!IS_VGIC_ADDR_UNDEF(*ioaddr))
>return -EEXIST;
>if (addr + size < addr)
> @@ -1469,26 +1475,41 @@ static int vgic_ioaddr_assign(struct kvm *kvm, 
> phys_addr_t *ioaddr,
>return ret;
> }
> 
> -int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr)
> +/**
> + * kvm_vgic_addr - set or get vgic VM base addresses
> + * @kvm:   pointer to the vm struct
> + * @type:  the VGIC addr type, one of KVM_VGIC_V2_ADDR_TYPE_XXX
> + * @addr:  pointer to address value
> + * @write: if true set the address in the VM address space, if false read the
> + *  address
> + *
> + * Set or get the vgic base addresses for the distributor and the virtual CPU
> + * interface in the VM physical address space.  These addresses a

Re: [PATCH 1/7] KVM: arm-vgic: Support KVM_CREATE_DEVICE for VGIC

2013-06-18 Thread Alexander Graf


Am 11.06.2013 um 06:51 schrieb Christoffer Dall :

> Support creating the ARM VGIC device through the KVM_CREATE_DEVICE
> ioctl, which can then later be leveraged to use the
> KVM_{GET/SET}_DEVICE_ATTR, which is useful both for setting addresses in
> a more generic API than the ARM-specific one and is useful for
> save/restore of VGIC state.
> 
> Adds KVM_CAP_DEVICE_CTRL to ARM capabilities.
> 
> Signed-off-by: Christoffer Dall 
> ---
> Documentation/virtual/kvm/devices/arm-vgic.txt |   10 +++
> arch/arm/include/uapi/asm/kvm.h|8 ++
> arch/arm/kvm/arm.c |1 +
> include/linux/kvm_host.h   |1 +
> include/uapi/linux/kvm.h   |1 +
> virt/kvm/arm/vgic.c|   34 
> virt/kvm/kvm_main.c|4 +++
> 7 files changed, 59 insertions(+)
> create mode 100644 Documentation/virtual/kvm/devices/arm-vgic.txt
> 
> diff --git a/Documentation/virtual/kvm/devices/arm-vgic.txt 
> b/Documentation/virtual/kvm/devices/arm-vgic.txt
> new file mode 100644
> index 000..25fd2d9
> --- /dev/null
> +++ b/Documentation/virtual/kvm/devices/arm-vgic.txt
> @@ -0,0 +1,10 @@
> +MPIC interrupt controller

MPIC?

> +=
> +
> +Device types supported:
> +  KVM_DEV_TYPE_ARM_VGICARM Generic Interrupt Controller v2.0
> +
> +Only one VGIC instance may be instantiated through either this API or the
> +legacy KVM_CREATE_IRQCHIP api.  The created VGIC will act as the VM interrupt
> +controller, requiring emulated user-space devices to inject interrupts to the
> +VGIC instead of directly to CPUs.
> diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
> index c1ee007..587f1ae 100644
> --- a/arch/arm/include/uapi/asm/kvm.h
> +++ b/arch/arm/include/uapi/asm/kvm.h
> @@ -142,6 +142,14 @@ struct kvm_arch_memory_slot {
> #define KVM_REG_ARM_VFP_FPINST0x1009
> #define KVM_REG_ARM_VFP_FPINST20x100A
> 
> +/* Device Control API: ARM VGIC */
> +#define KVM_DEV_ARM_VGIC_GRP_ADDR0
> +#define KVM_DEV_ARM_VGIC_GRP_DIST_REGS1
> +#define KVM_DEV_ARM_VGIC_GRP_CPU_REGS2
> +#define   KVM_DEV_ARM_VGIC_CPUID_SHIFT32
> +#define   KVM_DEV_ARM_VGIC_CPUID_MASK(0xffULL << 
> KVM_DEV_ARM_VGIC_CPUID_SHIFT)
> +#define   KVM_DEV_ARM_VGIC_OFFSET_SHIFT0
> +#define   KVM_DEV_ARM_VGIC_OFFSET_MASK(0xULL << 
> KVM_DEV_ARM_VGIC_OFFSET_SHIFT)

Could you please describe the groups in the documentation too?

> 
> /* KVM_IRQ_LINE irq field index values */
> #define KVM_ARM_IRQ_TYPE_SHIFT24
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 741f66a..b8e3b6e 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -187,6 +187,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>case KVM_CAP_IRQCHIP:
>r = vgic_present;
>break;
> +case KVM_CAP_DEVICE_CTRL:
>case KVM_CAP_USER_MEMORY:
>case KVM_CAP_SYNC_MMU:
>case KVM_CAP_DESTROY_MEMORY_REGION_WORKS:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d9a3c30..e2d6556 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1086,6 +1086,7 @@ struct kvm_device *kvm_device_from_filp(struct file 
> *filp);
> 
> extern struct kvm_device_ops kvm_mpic_ops;
> extern struct kvm_device_ops kvm_xics_ops;
> +extern struct kvm_device_ops kvm_arm_vgic_ops;
> 
> #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a5c86fc..4f2a4ab 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -839,6 +839,7 @@ struct kvm_device_attr {
> #define KVM_DEV_TYPE_FSL_MPIC_201
> #define KVM_DEV_TYPE_FSL_MPIC_422
> #define KVM_DEV_TYPE_XICS3
> +#define KVM_DEV_TYPE_ARM_VGIC4

Should this be versioned? Gicv2 is different from v3, no?

Alex

> 
> /*
>  * ioctls for VM fds
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> index 17c5ac7..b3dcd66 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -1497,3 +1497,37 @@ int kvm_vgic_set_addr(struct kvm *kvm, unsigned long 
> type, u64 addr)
>mutex_unlock(&kvm->lock);
>return r;
> }
> +
> +static int vgic_set_attr(struct kvm_device *dev, struct kvm_device_attr 
> *attr)
> +{
> +return -ENXIO;
> +}
> +
> +static int vgic_get_attr(struct kvm_device *dev, struct kvm_device_attr 
> *attr)
> +{
> +return -ENXIO;
> +}
> +
> +static int vgic_has_attr(struct kvm_device *dev, struct kvm_device_attr 
> *attr)
> +{
> +return -ENXIO;
> +}
> +
> +static void vgic_destroy(struct kvm_device *dev)
> +{
> +kfree(dev);
> +}
> +
> +static int vgic_create(struct kvm_device *dev, u32 type)
> +{
> +return kvm_vgic_create(dev->kvm);
> +}
> +
> +struct kvm_device_ops kvm_arm_vgic_ops = {
> +.name = "kvm-arm-vgic",
> +.create = vgic_create,
> +.destroy = vgic_destroy,
> +.set_attr = vgic_se

Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h

2013-06-18 Thread Gleb Natapov
On Tue, Jun 18, 2013 at 08:51:25PM +0800, Xiao Guangrong wrote:
> On 06/18/2013 06:57 PM, Gleb Natapov wrote:
> > On Mon, Jun 17, 2013 at 08:11:03PM +0800, Xiao Guangrong wrote:
> >> On 06/11/2013 07:32 PM, Gleb Natapov wrote:
> >>> On Tue, May 21, 2013 at 03:52:12PM +0800, Xiao Guangrong wrote:
>  On 05/19/2013 12:52 PM, Jun Nakajima wrote:
> > From: Nadav Har'El 
> >
> > This is the first patch in a series which adds nested EPT support to 
> > KVM's
> > nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 
> > can use
> > EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 
> > guest
> > to set its own cr3 and take its own page faults without either of L0 or 
> > L1
> > getting involved. This often significanlty improves L2's performance 
> > over the
> > previous two alternatives (shadow page tables over EPT, and shadow page
> > tables over shadow page tables).
> >
> > This patch adds EPT support to paging_tmpl.h.
> >
> > paging_tmpl.h contains the code for reading and writing page tables. 
> > The code
> > for 32-bit and 64-bit tables is very similar, but not identical, so
> > paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and 
> > once
> > with PTTYPE=64, and this generates the two sets of similar functions.
> >
> > There are subtle but important differences between the format of EPT 
> > tables
> > and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> > third set of functions to read the guest EPT table and to write the 
> > shadow
> > EPT table.
> >
> > So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions 
> > (prefixed
> > with "EPT") which correctly read and write EPT tables.
> >
> > Signed-off-by: Nadav Har'El 
> > Signed-off-by: Jun Nakajima 
> > Signed-off-by: Xinhao Xu 
> > ---
> >  arch/x86/kvm/mmu.c |  5 +
> >  arch/x86/kvm/paging_tmpl.h | 43 
> > +--
> >  2 files changed, 46 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index 117233f..6c1670f 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -3397,6 +3397,11 @@ static inline bool is_last_gpte(struct kvm_mmu 
> > *mmu, unsigned level, unsigned gp
> > return mmu->last_pte_bitmap & (1 << index);
> >  }
> >
> > +#define PTTYPE_EPT 18 /* arbitrary */
> > +#define PTTYPE PTTYPE_EPT
> > +#include "paging_tmpl.h"
> > +#undef PTTYPE
> > +
> >  #define PTTYPE 64
> >  #include "paging_tmpl.h"
> >  #undef PTTYPE
> > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> > index df34d4a..4c45654 100644
> > --- a/arch/x86/kvm/paging_tmpl.h
> > +++ b/arch/x86/kvm/paging_tmpl.h
> > @@ -50,6 +50,22 @@
> > #define PT_LEVEL_BITS PT32_LEVEL_BITS
> > #define PT_MAX_FULL_LEVELS 2
> > #define CMPXCHG cmpxchg
> > +#elif PTTYPE == PTTYPE_EPT
> > +   #define pt_element_t u64
> > +   #define guest_walker guest_walkerEPT
> > +   #define FNAME(name) EPT_##name
> > +   #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> > +   #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> > +   #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> > +   #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> > +   #define PT_LEVEL_BITS PT64_LEVEL_BITS
> > +   #ifdef CONFIG_X86_64
> > +   #define PT_MAX_FULL_LEVELS 4
> > +   #define CMPXCHG cmpxchg
> > +   #else
> > +   #define CMPXCHG cmpxchg64
> 
>  CMPXHG is only used in FNAME(cmpxchg_gpte), but you commented it later.
>  Do we really need it?
> 
> > +   #define PT_MAX_FULL_LEVELS 2
> 
>  And the SDM says:
> 
>  "It uses a page-walk length of 4, meaning that at most 4 EPT 
>  paging-structure
>  entriesare accessed to translate a guest-physical address.", Is My SDM 
>  obsolete?
>  Which kind of process supports page-walk length = 2?
> 
>  It seems your patch is not able to handle the case that the guest uses 
>  walk-lenght = 2
>  which is running on the host with walk-lenght = 4.
>  (plrease refer to how to handle sp->role.quadrant in 
>  FNAME(get_level1_sp_gpa) in
>  the current code.)
> 
> >>> But since EPT always has 4 levels on all existing cpus it is not an issue 
> >>> and the only case
> >>> that we should worry about is guest walk-lenght == host walk-lenght == 4, 
> >>> or have I
> >>
> >> Yes. I totally agree with you, but...
> >>
> >>> misunderstood what you mean here?
> >>
> >> What confused me is that this patch defines "#define PT_MAX_FULL_LEVELS 
> >> 2", so i asked the
> >> question

Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h

2013-06-18 Thread Xiao Guangrong
On 06/18/2013 06:57 PM, Gleb Natapov wrote:
> On Mon, Jun 17, 2013 at 08:11:03PM +0800, Xiao Guangrong wrote:
>> On 06/11/2013 07:32 PM, Gleb Natapov wrote:
>>> On Tue, May 21, 2013 at 03:52:12PM +0800, Xiao Guangrong wrote:
 On 05/19/2013 12:52 PM, Jun Nakajima wrote:
> From: Nadav Har'El 
>
> This is the first patch in a series which adds nested EPT support to KVM's
> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can 
> use
> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 
> guest
> to set its own cr3 and take its own page faults without either of L0 or L1
> getting involved. This often significanlty improves L2's performance over 
> the
> previous two alternatives (shadow page tables over EPT, and shadow page
> tables over shadow page tables).
>
> This patch adds EPT support to paging_tmpl.h.
>
> paging_tmpl.h contains the code for reading and writing page tables. The 
> code
> for 32-bit and 64-bit tables is very similar, but not identical, so
> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
> with PTTYPE=64, and this generates the two sets of similar functions.
>
> There are subtle but important differences between the format of EPT 
> tables
> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> third set of functions to read the guest EPT table and to write the shadow
> EPT table.
>
> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions 
> (prefixed
> with "EPT") which correctly read and write EPT tables.
>
> Signed-off-by: Nadav Har'El 
> Signed-off-by: Jun Nakajima 
> Signed-off-by: Xinhao Xu 
> ---
>  arch/x86/kvm/mmu.c |  5 +
>  arch/x86/kvm/paging_tmpl.h | 43 
> +--
>  2 files changed, 46 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 117233f..6c1670f 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -3397,6 +3397,11 @@ static inline bool is_last_gpte(struct kvm_mmu 
> *mmu, unsigned level, unsigned gp
>   return mmu->last_pte_bitmap & (1 << index);
>  }
>
> +#define PTTYPE_EPT 18 /* arbitrary */
> +#define PTTYPE PTTYPE_EPT
> +#include "paging_tmpl.h"
> +#undef PTTYPE
> +
>  #define PTTYPE 64
>  #include "paging_tmpl.h"
>  #undef PTTYPE
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index df34d4a..4c45654 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -50,6 +50,22 @@
>   #define PT_LEVEL_BITS PT32_LEVEL_BITS
>   #define PT_MAX_FULL_LEVELS 2
>   #define CMPXCHG cmpxchg
> +#elif PTTYPE == PTTYPE_EPT
> + #define pt_element_t u64
> + #define guest_walker guest_walkerEPT
> + #define FNAME(name) EPT_##name
> + #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> + #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> + #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> + #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> + #define PT_LEVEL_BITS PT64_LEVEL_BITS
> + #ifdef CONFIG_X86_64
> + #define PT_MAX_FULL_LEVELS 4
> + #define CMPXCHG cmpxchg
> + #else
> + #define CMPXCHG cmpxchg64

 CMPXHG is only used in FNAME(cmpxchg_gpte), but you commented it later.
 Do we really need it?

> + #define PT_MAX_FULL_LEVELS 2

 And the SDM says:

 "It uses a page-walk length of 4, meaning that at most 4 EPT 
 paging-structure
 entriesare accessed to translate a guest-physical address.", Is My SDM 
 obsolete?
 Which kind of process supports page-walk length = 2?

 It seems your patch is not able to handle the case that the guest uses 
 walk-lenght = 2
 which is running on the host with walk-lenght = 4.
 (plrease refer to how to handle sp->role.quadrant in 
 FNAME(get_level1_sp_gpa) in
 the current code.)

>>> But since EPT always has 4 levels on all existing cpus it is not an issue 
>>> and the only case
>>> that we should worry about is guest walk-lenght == host walk-lenght == 4, 
>>> or have I
>>
>> Yes. I totally agree with you, but...
>>
>>> misunderstood what you mean here?
>>
>> What confused me is that this patch defines "#define PT_MAX_FULL_LEVELS 2", 
>> so i asked the
>> question: "Which kind of process supports page-walk length = 2".
>> Sorry, there is a typo in my origin comments. "process" should be 
>> "processor" or "CPU".
>>
> That is how I understood it, but then the discussion moved to dropping
> of nEPT support on 32-bit host. What's the connection? Even on 32bit

If EPT supports "walk-level = 2" on 32bit host (maybe it is not true), i 
thought dropping
32bit support to reduce the complex is worthwhile, 

Re: [PATCH 1/2] kvm-unit-tests: Add a func to run instruction in emulator

2013-06-18 Thread Gleb Natapov
On Thu, Jun 13, 2013 at 05:30:03PM +0800, 李春奇  wrote:
> Hi Gleb,
> I'm trying to solve these problems in the past days and meet many
> difficulties. You want to save all the general registers in calling
> insn_page, so registers should be saved to (save) in insn_page.
> Because all the instructions should be generated outside and copy to
> insn_page, and the instructions generated outside is RIP-relative, so
> inside insn_page (save) will be wrong pointed with RIP-relative code.
> 
They do not have to be generated outside. You can write code into
insn_page directly. Something like this outside of any functions:

asm(".align 4096\n\t"
".global insn_page\n\t"
".global insn_page_end\n\t"
".global test_insn\n\t"
".global test_insn_end\n\t"
"insn_page:"
"mov %%rax, outregs \n\t"
...
"test_insn:\n\t" 
"in (%ds), %al\n\t"
". = . + 31\n\t"
"test_insn_end:\n\t"
"mov outregs, %%rax\n\t"
...
"ret\n\t"
".align 4096\n\t"
"insn_page_end:\n\t");

Now you copy that into alt_insn_page, put instruction you want to test
into test_insn offset and remap alt_insn_page into "insn_page" virtual address.

> I have tried to move (save) into insn_page. But when calling
> insn_page, data in it can only be read and any instructions like "xchg
> %%rax, 0+%[save]" may cause error, because at this time read is from
> TLB but write will cause inconsistent.
> 
> Another way is disabling RIP-relative code, but I failed when using
> "-mcmodel-large -fno-pic", the binary is also using RIP-relative mode.
> Is there any way to totally disable RIP-relative code? Besides, using
> this feature may specified to some newer C compiler. This may not be a
> good solution.
> 
> If we don't set %rsp and %rbp when executing emulator code, we can
> just use “push/pop" to save other general registers.
> 
> If you have any better solutions, please let me know.
> 
> Thanks,
> Arthur
> 
> On Thu, Jun 13, 2013 at 12:50 PM, 李春奇 
>  wrote:
> > On Thu, Jun 13, 2013 at 4:50 AM, Paolo Bonzini  wrote:
> >> Il 06/06/2013 11:24, Arthur Chunqi Li ha scritto:
> >>> Add a function trap_emulator to run an instruction in emulator.
> >>> Set inregs first (%rax is invalid because it is used as return
> >>> address), put instruction codec in alt_insn and call func with
> >>> alt_insn_length. Get results in outregs.
> >>>
> >>> Signed-off-by: Arthur Chunqi Li 
> >>> ---
> >>>  x86/emulator.c |   81 
> >>> 
> >>>  1 file changed, 81 insertions(+)
> >>>
> >>> diff --git a/x86/emulator.c b/x86/emulator.c
> >>> index 96576e5..8ab9904 100644
> >>> --- a/x86/emulator.c
> >>> +++ b/x86/emulator.c
> >>> @@ -11,6 +11,14 @@ int fails, tests;
> >>>
> >>>  static int exceptions;
> >>>
> >>> +struct regs {
> >>> + u64 rax, rbx, rcx, rdx;
> >>> + u64 rsi, rdi, rsp, rbp;
> >>> + u64 rip, rflags;
> >>> +};
> >>> +
> >>> +static struct regs inregs, outregs;
> >>> +
> >>>  void report(const char *name, int result)
> >>>  {
> >>>   ++tests;
> >>> @@ -685,6 +693,79 @@ static void test_shld_shrd(u32 *mem)
> >>>  report("shrd (cl)", *mem == ((0x12345678 >> 3) | (5u << 29)));
> >>>  }
> >>>
> >>> +static void trap_emulator(uint64_t *mem, uint8_t *insn_page,
> >>> +  uint8_t *alt_insn_page, void *insn_ram,
> >>> +  uint8_t *alt_insn, int alt_insn_length)
> >>> +{
> >>> + ulong *cr3 = (ulong *)read_cr3();
> >>> + int i;
> >>> +
> >>> + // Pad with RET instructions
> >>> + memset(insn_page, 0xc3, 4096);
> >>> + memset(alt_insn_page, 0xc3, 4096);
> >>> +
> >>> + // Place a trapping instruction in the page to trigger a VMEXIT
> >>> + insn_page[0] = 0x89; // mov %eax, (%rax)
> >>> + insn_page[1] = 0x00;
> >>> + insn_page[2] = 0x90; // nop
> >>> + insn_page[3] = 0xc3; // ret
> >>> +
> >>> + // Place the instruction we want the hypervisor to see in the 
> >>> alternate page
> >>> + for (i=0; i >>> + alt_insn_page[i] = alt_insn[i];
> >>> +
> >>> + // Save general registers
> >>> + asm volatile(
> >>> + "push %rax\n\r"
> >>> + "push %rbx\n\r"
> >>> + "push %rcx\n\r"
> >>> + "push %rdx\n\r"
> >>> + "push %rsi\n\r"
> >>> + "push %rdi\n\r"
> >>> + );
> >>
> >> This will not work if GCC is using rsp-relative addresses to access
> >> local variables.  You need to use mov instructions to load from inregs,
> >> and put the push/pop sequences inside the "main" asm that does the "call
> >> *%1".
> > Is there any way to let gcc use absolute address to access variables?
> > I move variant "save" to the global and use "xchg %%rax, 0+%[save]"
> > and it seems that addressing for "save" is wrong.
> >
> > Arthur
> >>
> >> Paolo
> >>
> >>> + // Load the code TLB with insn_page, but point the page tables at
> >>> + // alt_insn_page (and keep the data TLB clear, for AMD decode 
> >>> as

Re: [PATCH] KVM: x86: remove vcpu's CPL check in host invoked vcpu's xcr set process

2013-06-18 Thread Zhanghaoyu (A)
>> __kvm_set_xcr function does the CPL check when set xcr. __kvm_set_xcr 
>> is called in two flows, one is invoked by guest, call stack shown as 
>> below, handle_xsetbv(or xsetbv_interception)
>>   kvm_set_xcr
>> __kvm_set_xcr
>> the other one is invoked by host(QEMU), call stack shown as below, 
>> kvm_arch_vcpu_ioctl
>>   kvm_vcpu_ioctl_x86_set_xcrs
>> __kvm_set_xcr
>> 
>> The former does need the CPL check, but the latter does not.
>> 
>> Signed-off-by: Zhang Haoyu 
>> ---
>>  arch/x86/kvm/x86.c|5 +-
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 
>> 094b5d9..16ff44e 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -582,8 +582,6 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 
>> xcr)
>>   if (index != XCR_XFEATURE_ENABLED_MASK)
>>   return 1;
>>   xcr0 = xcr;
>> - if (kvm_x86_ops->get_cpl(vcpu) != 0)
>> - return 1;
>>   if (!(xcr0 & XSTATE_FP))
>>   return 1;
>>   if ((xcr0 & XSTATE_YMM) && !(xcr0 & XSTATE_SSE)) @@ -597,7 
>> +595,8 
>> @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
>>  
>>  int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)  {
>> - if (__kvm_set_xcr(vcpu, index, xcr)) {
>> + if ((kvm_x86_ops->get_cpl(vcpu) != 0) ||
>> +  __kvm_set_xcr(vcpu, index, xcr)) {
>>   kvm_inject_gp(vcpu, 0);
>>   return 1;
>>   }
>> 1.7.3.1.msysgit.0
>> 
>
>Applied to master, but check your setup because I had to do so manually.
> Your patch has 2 tabs + 1 space for each indentation level (??), the file 
> only has one tab.
>
>This time the patch was pretty simple and may even be worth having in 3.9, but 
>next time I'd ask you to resubmit.
>
>Can you write a testcase for kvm-unit-tests?
>
>Paolo

So sorry for above bad format patch, this time the format is OK, but I'm not 
sure could the format be changed in transmission?

Signed-off-by: Zhang Haoyu 
---
 arch/x86/kvm/x86.c |5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 094b5d9..aa7bda1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -582,8 +582,6 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
if (index != XCR_XFEATURE_ENABLED_MASK)
return 1;
xcr0 = xcr;
-   if (kvm_x86_ops->get_cpl(vcpu) != 0)
-   return 1;
if (!(xcr0 & XSTATE_FP))
return 1;
if ((xcr0 & XSTATE_YMM) && !(xcr0 & XSTATE_SSE))
@@ -597,7 +595,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 
 int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 {
-   if (__kvm_set_xcr(vcpu, index, xcr)) {
+   if ((kvm_x86_ops->get_cpl(vcpu) != 0) ||
+   __kvm_set_xcr(vcpu, index, xcr)) {
kvm_inject_gp(vcpu, 0);
return 1;
}
-- 
1.7.3.1.msysgit.0
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h

2013-06-18 Thread Gleb Natapov
On Mon, Jun 17, 2013 at 08:11:03PM +0800, Xiao Guangrong wrote:
> On 06/11/2013 07:32 PM, Gleb Natapov wrote:
> > On Tue, May 21, 2013 at 03:52:12PM +0800, Xiao Guangrong wrote:
> >> On 05/19/2013 12:52 PM, Jun Nakajima wrote:
> >>> From: Nadav Har'El 
> >>>
> >>> This is the first patch in a series which adds nested EPT support to KVM's
> >>> nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can 
> >>> use
> >>> EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 
> >>> guest
> >>> to set its own cr3 and take its own page faults without either of L0 or L1
> >>> getting involved. This often significanlty improves L2's performance over 
> >>> the
> >>> previous two alternatives (shadow page tables over EPT, and shadow page
> >>> tables over shadow page tables).
> >>>
> >>> This patch adds EPT support to paging_tmpl.h.
> >>>
> >>> paging_tmpl.h contains the code for reading and writing page tables. The 
> >>> code
> >>> for 32-bit and 64-bit tables is very similar, but not identical, so
> >>> paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
> >>> with PTTYPE=64, and this generates the two sets of similar functions.
> >>>
> >>> There are subtle but important differences between the format of EPT 
> >>> tables
> >>> and that of ordinary x86 64-bit page tables, so for nested EPT we need a
> >>> third set of functions to read the guest EPT table and to write the shadow
> >>> EPT table.
> >>>
> >>> So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions 
> >>> (prefixed
> >>> with "EPT") which correctly read and write EPT tables.
> >>>
> >>> Signed-off-by: Nadav Har'El 
> >>> Signed-off-by: Jun Nakajima 
> >>> Signed-off-by: Xinhao Xu 
> >>> ---
> >>>  arch/x86/kvm/mmu.c |  5 +
> >>>  arch/x86/kvm/paging_tmpl.h | 43 
> >>> +--
> >>>  2 files changed, 46 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >>> index 117233f..6c1670f 100644
> >>> --- a/arch/x86/kvm/mmu.c
> >>> +++ b/arch/x86/kvm/mmu.c
> >>> @@ -3397,6 +3397,11 @@ static inline bool is_last_gpte(struct kvm_mmu 
> >>> *mmu, unsigned level, unsigned gp
> >>>   return mmu->last_pte_bitmap & (1 << index);
> >>>  }
> >>>
> >>> +#define PTTYPE_EPT 18 /* arbitrary */
> >>> +#define PTTYPE PTTYPE_EPT
> >>> +#include "paging_tmpl.h"
> >>> +#undef PTTYPE
> >>> +
> >>>  #define PTTYPE 64
> >>>  #include "paging_tmpl.h"
> >>>  #undef PTTYPE
> >>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> >>> index df34d4a..4c45654 100644
> >>> --- a/arch/x86/kvm/paging_tmpl.h
> >>> +++ b/arch/x86/kvm/paging_tmpl.h
> >>> @@ -50,6 +50,22 @@
> >>>   #define PT_LEVEL_BITS PT32_LEVEL_BITS
> >>>   #define PT_MAX_FULL_LEVELS 2
> >>>   #define CMPXCHG cmpxchg
> >>> +#elif PTTYPE == PTTYPE_EPT
> >>> + #define pt_element_t u64
> >>> + #define guest_walker guest_walkerEPT
> >>> + #define FNAME(name) EPT_##name
> >>> + #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
> >>> + #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> >>> + #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> >>> + #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> >>> + #define PT_LEVEL_BITS PT64_LEVEL_BITS
> >>> + #ifdef CONFIG_X86_64
> >>> + #define PT_MAX_FULL_LEVELS 4
> >>> + #define CMPXCHG cmpxchg
> >>> + #else
> >>> + #define CMPXCHG cmpxchg64
> >>
> >> CMPXHG is only used in FNAME(cmpxchg_gpte), but you commented it later.
> >> Do we really need it?
> >>
> >>> + #define PT_MAX_FULL_LEVELS 2
> >>
> >> And the SDM says:
> >>
> >> "It uses a page-walk length of 4, meaning that at most 4 EPT 
> >> paging-structure
> >> entriesare accessed to translate a guest-physical address.", Is My SDM 
> >> obsolete?
> >> Which kind of process supports page-walk length = 2?
> >>
> >> It seems your patch is not able to handle the case that the guest uses 
> >> walk-lenght = 2
> >> which is running on the host with walk-lenght = 4.
> >> (plrease refer to how to handle sp->role.quadrant in 
> >> FNAME(get_level1_sp_gpa) in
> >> the current code.)
> >>
> > But since EPT always has 4 levels on all existing cpus it is not an issue 
> > and the only case
> > that we should worry about is guest walk-lenght == host walk-lenght == 4, 
> > or have I
> 
> Yes. I totally agree with you, but...
> 
> > misunderstood what you mean here?
> 
> What confused me is that this patch defines "#define PT_MAX_FULL_LEVELS 2", 
> so i asked the
> question: "Which kind of process supports page-walk length = 2".
> Sorry, there is a typo in my origin comments. "process" should be "processor" 
> or "CPU".
> 
That is how I understood it, but then the discussion moved to dropping
of nEPT support on 32-bit host. What's the connection? Even on 32bit
host the walk is 4 levels. Doesn't shadow page code support 4 levels on
32bit host?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a

Re: [PATCH qom-cpu v2 02/29] kvm: Change cpu_synchronize_state() argument to CPUState

2013-06-18 Thread Igor Mammedov
On Sun, 16 Jun 2013 17:57:22 +0200
Andreas Färber  wrote:

> Change Monitor::mon_cpu to CPUState as well.
> In cpu_synchronize_all_states() use qemu_for_each_cpu() now.
> 
> Reviewed-by: liguang 
> Signed-off-by: Andreas Färber 
> ---
>  cpus.c  | 8 
>  gdbstub.c   | 8 
>  hw/i386/kvm/apic.c  | 2 +-
>  hw/i386/kvmvapic.c  | 4 ++--
>  hw/misc/vmport.c| 2 +-
>  hw/ppc/ppce500_spin.c   | 2 +-
>  include/sysemu/kvm.h| 4 ++--
>  monitor.c   | 6 +++---
>  target-i386/helper.c| 4 ++--
>  target-i386/kvm.c   | 2 +-
>  target-ppc/mmu-hash64.c | 2 +-
>  target-ppc/translate.c  | 2 +-
>  target-s390x/kvm.c  | 9 +
>  13 files changed, 28 insertions(+), 27 deletions(-)
> 
> diff --git a/cpus.c b/cpus.c
> index c232265..3260f09 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -407,10 +407,10 @@ void hw_error(const char *fmt, ...)
>  
>  void cpu_synchronize_all_states(void)
>  {
> -CPUArchState *cpu;
> +CPUArchState *env;
>  
> -for (cpu = first_cpu; cpu; cpu = cpu->next_cpu) {
> -cpu_synchronize_state(cpu);
> +for (env = first_cpu; env; env = env->next_cpu) {
> +cpu_synchronize_state(ENV_GET_CPU(env));
>  }
>  }
>  
> @@ -1219,7 +1219,7 @@ CpuInfoList *qmp_query_cpus(Error **errp)
>  CPUState *cpu = ENV_GET_CPU(env);
>  CpuInfoList *info;
>  
> -cpu_synchronize_state(env);
> +cpu_synchronize_state(cpu);
>  
>  info = g_malloc0(sizeof(*info));
>  info->value = g_malloc0(sizeof(*info->value));
> diff --git a/gdbstub.c b/gdbstub.c
> index 94c78ce..bbae06d 100644
> --- a/gdbstub.c
> +++ b/gdbstub.c
> @@ -2033,7 +2033,7 @@ static void gdb_breakpoint_remove_all(void)
>  
>  static void gdb_set_cpu_pc(GDBState *s, target_ulong pc)
>  {
> -cpu_synchronize_state(s->c_cpu);
> +cpu_synchronize_state(ENV_GET_CPU(s->c_cpu));
>  #if defined(TARGET_I386)
>  s->c_cpu->eip = pc;
>  #elif defined (TARGET_PPC)
> @@ -2232,7 +2232,7 @@ static int gdb_handle_packet(GDBState *s, const char 
> *line_buf)
>  }
>  break;
>  case 'g':
> -cpu_synchronize_state(s->g_cpu);
> +cpu_synchronize_state(ENV_GET_CPU(s->g_cpu));
>  env = s->g_cpu;
>  len = 0;
>  for (addr = 0; addr < num_g_regs; addr++) {
> @@ -2243,7 +2243,7 @@ static int gdb_handle_packet(GDBState *s, const char 
> *line_buf)
>  put_packet(s, buf);
>  break;
>  case 'G':
> -cpu_synchronize_state(s->g_cpu);
> +cpu_synchronize_state(ENV_GET_CPU(s->g_cpu));
>  env = s->g_cpu;
>  registers = mem_buf;
>  len = strlen(p) / 2;
> @@ -2411,7 +2411,7 @@ static int gdb_handle_packet(GDBState *s, const char 
> *line_buf)
>  env = find_cpu(thread);
>  if (env != NULL) {
>  CPUState *cpu = ENV_GET_CPU(env);
> -cpu_synchronize_state(env);
> +cpu_synchronize_state(cpu);
>  len = snprintf((char *)mem_buf, sizeof(mem_buf),
> "CPU#%d [%s]", cpu->cpu_index,
> cpu->halted ? "halted " : "running");
> diff --git a/hw/i386/kvm/apic.c b/hw/i386/kvm/apic.c
> index 8f80425..bd0bdd8 100644
> --- a/hw/i386/kvm/apic.c
> +++ b/hw/i386/kvm/apic.c
> @@ -129,7 +129,7 @@ static void do_inject_external_nmi(void *data)
>  uint32_t lvt;
>  int ret;
>  
> -cpu_synchronize_state(&s->cpu->env);
> +cpu_synchronize_state(cpu);
>  
>  lvt = s->lvt[APIC_LVT_LINT1];
>  if (!(lvt & APIC_LVT_MASKED) && ((lvt >> 8) & 7) == APIC_DM_NMI) {
> diff --git a/hw/i386/kvmvapic.c b/hw/i386/kvmvapic.c
> index 655483b..f93629f 100644
> --- a/hw/i386/kvmvapic.c
> +++ b/hw/i386/kvmvapic.c
> @@ -456,7 +456,7 @@ void vapic_report_tpr_access(DeviceState *dev, CPUState 
> *cs, target_ulong ip,
>  X86CPU *cpu = X86_CPU(cs);
>  CPUX86State *env = &cpu->env;
>  
> -cpu_synchronize_state(env);
> +cpu_synchronize_state(cs);
>  
>  if (evaluate_tpr_instruction(s, env, &ip, access) < 0) {
>  if (s->state == VAPIC_ACTIVE) {
> @@ -627,7 +627,7 @@ static void vapic_write(void *opaque, hwaddr addr, 
> uint64_t data,
>  hwaddr rom_paddr;
>  VAPICROMState *s = opaque;
>  
> -cpu_synchronize_state(env);
> +cpu_synchronize_state(CPU(x86_env_get_cpu(env)));
why not use ENV_GET_CPU() here and in several other places below to make it
uniform?

>  
>  /*
>   * The VAPIC supports two PIO-based hypercalls, both via port 0x7E.
> diff --git a/hw/misc/vmport.c b/hw/misc/vmport.c
> index 57b71f5..8363dfd 100644
> --- a/hw/misc/vmport.c
> +++ b/hw/misc/vmport.c
> @@ -66,7 +66,7 @@ static uint64_t vmport_ioport_read(void *opaque, hwaddr 
> addr,
>  unsigned char command;
>  uint32_t eax;
>  
> -cpu_synchronize_state(env);
> +cpu_synchronize_state(CPU(x86_env_get_cpu(env)));
>  
>  eax = env->regs[R_EAX];
>  if (eax != VMPORT_MAGIC)

Re: [PATCH] kvmclock: clock should count only if vm is running

2013-06-18 Thread Paolo Bonzini
Hi Marcelo, sorry for the late review.

Il 08/06/2013 04:00, Marcelo Tosatti ha scritto:
> kvmclock should not count while vm is paused, because:
> 
> 1) if the vm is paused for long periods, timekeeping 
> math can overflow while converting the (large) clocksource 
> delta to nanoseconds.
> 
> 2) Users rely on CLOCK_MONOTONIC to count run time, that is, 
> time which OS has been in a runnable state (see CLOCK_BOOTTIME).

Do you have any ideas on how to implement CLOCK_BOOTTIME for kvmclock?
I think we need to add more fields for the delta between CLOCK_MONOTONIC
and CLOCK_BOOTTIME.

> Change kvmclock driver so as to save clock value when vm transitions
> from runnable to stopped state, and to restore clock value from stopped
> to runnable transition.
> 
> Signed-off-by: Marcelo Tosatti 
> 
> diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
> index 87d4d0f..7d2d005 100644
> --- a/hw/i386/kvm/clock.c
> +++ b/hw/i386/kvm/clock.c
> @@ -28,38 +28,6 @@ typedef struct KVMClockState {
>  bool clock_valid;
>  } KVMClockState;
>  
> -static void kvmclock_pre_save(void *opaque)
> -{
> -KVMClockState *s = opaque;
> -struct kvm_clock_data data;
> -int ret;
> -
> -if (s->clock_valid) {
> -return;
> -}
> -ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
> -if (ret < 0) {
> -fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
> -data.clock = 0;
> -}
> -s->clock = data.clock;
> -/*
> - * If the VM is stopped, declare the clock state valid to avoid 
> re-reading
> - * it on next vmsave (which would return a different value). Will be 
> reset
> - * when the VM is continued.
> - */
> -s->clock_valid = !runstate_is_running();
> -}
> -
> -static int kvmclock_post_load(void *opaque, int version_id)
> -{
> -KVMClockState *s = opaque;
> -struct kvm_clock_data data;
> -
> -data.clock = s->clock;
> -data.flags = 0;
> -return kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
> -}
>  
>  static void kvmclock_vm_state_change(void *opaque, int running,
>   RunState state)
> @@ -70,8 +38,18 @@ static void kvmclock_vm_state_change(void *opaque, int 
> running,
>  int ret;
>  
>  if (running) {
> +struct kvm_clock_data data;
> +

Do we need an "if (!s->clock_valid) return;" here, or an assertion?  (Or
alternatively, what happens if s->clock_valid == false?)

>  s->clock_valid = false;
>  
> +data.clock = s->clock;
> +data.flags = 0;
> +ret = kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
> +if (ret < 0) {
> +fprintf(stderr, "KVM_SET_CLOCK failed: %s\n", strerror(ret));
> +abort();
> +}
> +
>  if (!cap_clock_ctrl) {
>  return;
>  }
> @@ -84,6 +62,26 @@ static void kvmclock_vm_state_change(void *opaque, int 
> running,
>  return;
>  }
>  }
> +} else {
> +struct kvm_clock_data data;
> +int ret;
> +
> +if (s->clock_valid) {
> +return;
> +}
> +ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
> +if (ret < 0) {
> +fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
> +abort();
> +}
> +s->clock = data.clock;
> +
> +/*
> + * If the VM is stopped, declare the clock state valid to
> + * avoid re-reading it on next vmsave (which would return
> + * a different value). Will be reset when the VM is continued.
> + */
> +s->clock_valid = !runstate_is_running();

Here we know that !runstate_is_running() is true (we're in the else
branch of "if (running)") so it can just be "s->clock_valid = true".
This matches the false assignment in the other branch.  This is just a
nit, but it makes the code clearer.

Paolo

>  }
>  }
>  
> @@ -100,8 +98,6 @@ static const VMStateDescription kvmclock_vmsd = {
>  .version_id = 1,
>  .minimum_version_id = 1,
>  .minimum_version_id_old = 1,
> -.pre_save = kvmclock_pre_save,
> -.post_load = kvmclock_post_load,
>  .fields = (VMStateField[]) {
>  VMSTATE_UINT64(clock, KVMClockState),
>  VMSTATE_END_OF_LIST()
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: remove vcpu's CPL check in host invoked vcpu's xcr set process

2013-06-18 Thread Paolo Bonzini
Il 14/06/2013 09:36, Zhanghaoyu (A) ha scritto:
> __kvm_set_xcr function does the CPL check when set xcr. __kvm_set_xcr is 
> called in two flows,
> one is invoked by guest, call stack shown as below,
> handle_xsetbv(or xsetbv_interception)
>   kvm_set_xcr
> __kvm_set_xcr
> the other one is invoked by host(QEMU), call stack shown as below,
> kvm_arch_vcpu_ioctl
>   kvm_vcpu_ioctl_x86_set_xcrs
> __kvm_set_xcr
> 
> The former does need the CPL check, but the latter does not.
> 
> Signed-off-by: Zhang Haoyu 
> ---
>  arch/x86/kvm/x86.c|5 +-
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 094b5d9..16ff44e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -582,8 +582,6 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 
> xcr)
>if (index != XCR_XFEATURE_ENABLED_MASK)
>return 1;
>xcr0 = xcr;
> -  if (kvm_x86_ops->get_cpl(vcpu) != 0)
> -  return 1;
>if (!(xcr0 & XSTATE_FP))
>return 1;
>if ((xcr0 & XSTATE_YMM) && !(xcr0 & XSTATE_SSE))
> @@ -597,7 +595,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 
> xcr)
>  
>  int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
>  {
> -  if (__kvm_set_xcr(vcpu, index, xcr)) {
> +  if ((kvm_x86_ops->get_cpl(vcpu) != 0) ||
> +   __kvm_set_xcr(vcpu, index, xcr)) {
>kvm_inject_gp(vcpu, 0);
>return 1;
>}
> 1.7.3.1.msysgit.0
> 

Applied to master, but check your setup because I had to do so manually.
 Your patch has 2 tabs + 1 space for each indentation level (??), the
file only has one tab.

This time the patch was pretty simple and may even be worth having in
3.9, but next time I'd ask you to resubmit.

Can you write a testcase for kvm-unit-tests?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html