Re: [PATCH] kvm: mmu: delay mmu audit activation

2013-11-20 Thread Gleb Natapov
On Tue, Nov 19, 2013 at 03:22:47PM -0500, Sasha Levin wrote:
> We should not be using jump labels before they were initialized. Push back
> the callback to until after jump label initialization.
> 
> Signed-off-by: Sasha Levin 
Applied, thanks.

> ---
>  arch/x86/kvm/mmu_audit.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
> index daff69e..1185fe7 100644
> --- a/arch/x86/kvm/mmu_audit.c
> +++ b/arch/x86/kvm/mmu_audit.c
> @@ -296,4 +296,4 @@ static struct kernel_param_ops audit_param_ops = {
>   .get = param_get_bool,
>  };
>  
> -module_param_cb(mmu_audit, &audit_param_ops, &mmu_audit, 0644);
> +arch_param_cb(mmu_audit, &audit_param_ops, &mmu_audit, 0644);
> -- 
> 1.7.10.4

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to trace every memory access

2013-11-20 Thread Paolo Bonzini
Il 20/11/2013 08:55, Arthur Chunqi Li ha scritto:
> Hi Paolo,
> 
> Currently I can trap every first write/read to a memory page from
> guest VM (add codes in tdp_page_fault). If I want to trace every
> memory access to a page, how can I achieve such goal in KVM?

You don't. :)

If you are looking for something like this, a dynamic recompilation
engine (such as QEMU's TCG) probably ends up being faster.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND v4] powerpc: kvm: fix rare but potential deadlock scene

2013-11-20 Thread Alexander Graf


> Am 20.11.2013 um 03:42 schrieb Liu ping fan :
> 
>> On Tue, Nov 19, 2013 at 6:39 PM, Alexander Graf  wrote:
>> 
>>> On 19.11.2013, at 07:12, Liu Ping Fan  wrote:
>>> 
>>> Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
>>> realmode, so it can trigger the deadlock.
>>> 
>>> Suppose the following scene:
>>> 
>>> Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of
>>> vcpus.
>>> 
>>> If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched
>>> out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be
>>> caught in realmode for a long time.
>>> 
>>> What makes things even worse if the following happens,
>>> On cpuM, bitlockX is hold, on cpuN, Y is hold.
>>> vcpu_B_2 try to lock Y on cpuM in realmode
>>> vcpu_A_2 try to lock X on cpuN in realmode
>>> 
>>> Oops! deadlock happens
>>> 
>>> Signed-off-by: Liu Ping Fan 
>> 
>> Any particular reason for the resend? The patch is already applied, no?
> Oh, seems that I misunderstood your meaning. You said "Actually, I've
> changed my mind and moved the patch to the for-3.13 branch instead.
> Please make sure to CC kvm@vger on all patches you submit though". So
> I think it is necessary to resend with cc kvm@vger

Oh. I meant "for next time" :). But now that it happened for this patch already 
it's even better.

Thanks a lot!

Alex

> 
> Regards,
> Pingfan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

2013-11-20 Thread Xiao Guangrong

On Nov 20, 2013, at 8:29 AM, Marcelo Tosatti  wrote:

> On Wed, Aug 07, 2013 at 12:06:49PM +0800, Xiao Guangrong wrote:
>> On 08/07/2013 09:48 AM, Marcelo Tosatti wrote:
>>> On Tue, Jul 30, 2013 at 09:02:02PM +0800, Xiao Guangrong wrote:
 Make sure we can see the writable spte before the dirt bitmap is visible
 
 We do this is for kvm_vm_ioctl_get_dirty_log() write-protects the spte 
 based
 on the dirty bitmap, we should ensure the writable spte can be found in 
 rmap
 before the dirty bitmap is visible. Otherwise, we cleared the dirty bitmap 
 and
 failed to write-protect the page
 
 Signed-off-by: Xiao Guangrong 
 ---
 arch/x86/kvm/mmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
>>> 
>>> Can you explain why this is safe, with regard to the rule 
>>> at edde99ce05290e50 ?
>> 
>> BTW, this log fixed this case:
>> 
>>VCPU 0KVM migration control
>> 
>>  write-protects all pages
>> #Pf happen then the page
>> become writable, set dirty
>> bit on the bitmap
>> 
>> swap the bitmap, current bitmap is empty
>> 
>> write the page (no dirty log)
>> 
>> stop the guest and push
>> the remaining dirty pages
>> Stopped
>> See current bitmap is empty that means
>> no page is dirty.
>>> 
>>> "The rule is that all pages are either dirty in the current bitmap,
>>> or write-protected, which is violated here."
>> 
>> Actually, this rule is not complete true, there's the 3th case:
>> the window between write guest page and set dirty bitmap is valid.
>> In that window, page is write-free and not dirty logged.
>> 
>> This case is based on the fact that at the final step of live migration,
>> kvm should stop the guest and push the remaining dirty pages to the
>> destination.
>> 
>> They're some examples in the current code:
>> example 1, in fast_pf_fix_direct_spte():
>>  if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
>>  /* The window in here... */
>>  mark_page_dirty(vcpu->kvm, gfn);
>> 
>> example 2, in kvm_write_guest_page():
>>  r = __copy_to_user((void __user *)addr + offset, data, len);
>>  if (r)
>>  return -EFAULT;
>>  /*
>>   * The window is here, the page is dirty but not logged in
>>* The bitmap.
>>   */
>>  mark_page_dirty(kvm, gfn);
>>  return 0;
> 

Hi Marcelo,

> Why is this valid ? That is, the obviously correct rule is
> 
> "that all pages are either dirty in the current bitmap,
> or write-protected, which is violated here."
> 
> With the window above, GET_DIRTY_LOG can be called 100 times while the 
> page is dirty, but the corresponding bit not set in the dirty bitmap.
> 
> It violates the documentation:
> 
> /* for KVM_GET_DIRTY_LOG */
> struct kvm_dirty_log {
>   __u32 slot;
>   __u32 padding;
>   union {
>   void __user *dirty_bitmap; /* one bit per page */
>   __u64 padding;
>   };
> };
> 
> Given a memory slot, return a bitmap containing any pages dirtied
> since the last call to this ioctl.  Bit 0 is the first page in the
> memory slot.  Ensure the entire structure is cleared to avoid padding
> issues.
> 
> The point about migration, is that GET_DIRTY_LOG is strictly correct
> because it stops vcpus.
> 
> But what guarantee does userspace require, from GET_DIRTY_LOG, while vcpus are
> executing? 

Aha. Single calling GET_DIRTY_LOG is useless since new dirty page can be 
generated
when GET_DIRTY_LOG is being returned. If user wants to get exact dirty pages 
the vcpus
should be stopped. 

> 
> With fast page fault:
> 
>  if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
>  /* The window in here... */
>  mark_page_dirty(vcpu->kvm, gfn);
> 
> And the $SUBJECT set_spte reordering, the rule becomes
> 
> A call to GET_DIRTY_LOG guarantees to return correct information about 
> dirty pages before invocation of the previous GET_DIRTY_LOG call.
> 
> (see example 1: the next GET_DIRTY_LOG will return the dirty information
> there).
> 

It seems no.

The first GET_DIRTY_LOG can happen before fast-page-fault,
the second GET_DIRTY_LOG happens in the window between cmpxchg()
and mark_page_dirty(), for the second one, the information is still “incorrect”.

> The rule for sptes that is, because kvm_write_guest does not match the
> documentation at all.

You mean the case of “kvm_write_guest” is valid (I do not know why it is)?
Or anything else?

> 
> So before example 1 and this patch, the rule (well for sptes at least) was
> 
> "Given a memory slot, return a bitmap containing any pages dirtied
> since the last call to this ioctl.  Bit 0 is the first page in the
> memory slot.  Ensure the entire structure is cleared to avoid padding
> issues."
> 
> Can you explain why it is OK to relax this rule?

It’s because:
1) it do

Re: Starting a VM using /dev/kvm

2013-11-20 Thread Paolo Bonzini
Il 20/11/2013 00:56, Jim MacArthur ha scritto:
> I haven't been able to find much information on it, but by reading the
> API document and stracing qemu I've put together a small program which
> creates a VM, VCPU, and sets up some memory. All of these ioctls
> return successfully, but trying to run always returns with exit code
> 17 and suberror 1 which so far as I can tell seems to be a problem
> with page tables. I'm on an x86_64 host.
> 
> The question is, how does a new vcpu start up? Will it start in full
> 64-bit mode or 16-bit real mode?

By default it start in 16-bit real mode, with CS=0xf000 and EIP=0xfff0,
but CS.base = 0x.  However, you can send ioctls to modify
CR0/CR4/EFER and place the VCPU in whatever mode you'd like to have.

> And will I need a full set of
> translation tables to run a single instruction or can I just point it
> at some memory and expect it to run?

Real mode doesn't need page tables of course, and so does 32-bit
protected mode with CR0.PG=0.  However, 64-bit mode only exists with
paging (and PAE) enabled.  So you need page tables to enable 64-bit mode.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] arm: KVM: Don't return PSCI_INVAL if waitqueue is inactive

2013-11-20 Thread Christoffer Dall
The current KVM implementation of PSCI returns INVALID_PARAMETERS if the
waitqueue for the corresponding CPU is not active.  This does not seem
correct, since KVM should not care what the specific thread is doing,
for example, user space may not have called KVM_RUN on this VCPU yet or
the thread may be busy looping to user space because it received a
signal; this is really up to the user space implementation.  We should
simply clear the pause flag on the CPU and wake up the thread if it
happens to be blocked for us.

Further, the implementation seems to be racy when executing multiple
VCPU threads.  There really isn't a reasonable user space programming
scheme to ensure all secondary CPUs have reached kvm_vcpu_first_run_init
before turning on the boot CPU.  It also does not make much sense to
call into the PSCI code for a CPU that is turned off - after all, it
cannot do anything if it is turned off and PSCI code could reasonably be
written with the assumption that the VCPU is actually up, in some shape
or form.

Therefore, set the pause flag on the vcpu at VCPU init time (which can
reasonably be expected to be completed for all CPUs by user space before
running any VCPUs) and clear both this flag and the feature (in case the
feature can somehow get set again in the future) and ping the waitqueue
on turning on a VCPU using PSCI.

Cc: Marc Zyngier 
Reported-by: Peter Maydell 
Signed-off-by: Christoffer Dall 
---
 arch/arm/kvm/arm.c  | 30 +++---
 arch/arm/kvm/psci.c |  6 ++
 2 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index e312e4a..1140e0e 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -478,15 +478,6 @@ static int kvm_vcpu_first_run_init(struct kvm_vcpu *vcpu)
return ret;
}
 
-   /*
-* Handle the "start in power-off" case by calling into the
-* PSCI code.
-*/
-   if (test_and_clear_bit(KVM_ARM_VCPU_POWER_OFF, vcpu->arch.features)) {
-   *vcpu_reg(vcpu, 0) = KVM_PSCI_FN_CPU_OFF;
-   kvm_psci_call(vcpu);
-   }
-
return 0;
 }
 
@@ -700,6 +691,24 @@ int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct 
kvm_irq_level *irq_level,
return -EINVAL;
 }
 
+static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu *vcpu,
+struct kvm_vcpu_init *init)
+{
+   int ret;
+
+   ret = kvm_vcpu_set_target(vcpu, init);
+   if (ret)
+   return ret;
+
+   /*
+* Handle the "start in power-off" case by marking the VCPU as paused.
+*/
+   if (test_and_clear_bit(KVM_ARM_VCPU_POWER_OFF, vcpu->arch.features))
+   vcpu->arch.pause = true;
+
+   return 0;
+}
+
 long kvm_arch_vcpu_ioctl(struct file *filp,
 unsigned int ioctl, unsigned long arg)
 {
@@ -713,8 +722,7 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
if (copy_from_user(&init, argp, sizeof(init)))
return -EFAULT;
 
-   return kvm_vcpu_set_target(vcpu, &init);
-
+   return kvm_arch_vcpu_ioctl_vcpu_init(vcpu, &init);
}
case KVM_SET_ONE_REG:
case KVM_GET_ONE_REG: {
diff --git a/arch/arm/kvm/psci.c b/arch/arm/kvm/psci.c
index 0881bf1..2e72ef5 100644
--- a/arch/arm/kvm/psci.c
+++ b/arch/arm/kvm/psci.c
@@ -59,10 +59,6 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu 
*source_vcpu)
 
target_pc = *vcpu_reg(source_vcpu, 2);
 
-   wq = kvm_arch_vcpu_wq(vcpu);
-   if (!waitqueue_active(wq))
-   return KVM_PSCI_RET_INVAL;
-
kvm_reset_vcpu(vcpu);
 
/* Gracefully handle Thumb2 entry point */
@@ -76,9 +72,11 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu 
*source_vcpu)
kvm_vcpu_set_be(vcpu);
 
*vcpu_pc(vcpu) = target_pc;
+   clear_bit(KVM_ARM_VCPU_POWER_OFF, vcpu->arch.features);
vcpu->arch.pause = false;
smp_mb();   /* Make sure the above is visible */
 
+   wq = kvm_arch_vcpu_wq(vcpu);
wake_up_interruptible(wq);
 
return KVM_PSCI_RET_SUCCESS;
-- 
1.8.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for 1.7] kvm: Fix uninitialized cpuid_data

2013-11-20 Thread Stefan Weil
Am 07.11.2013 12:15, schrieb Gleb Natapov:
> On Wed, Nov 06, 2013 at 10:35:27PM +0100, Stefan Weil wrote:
>> This error was reported by valgrind when running qemu-system-x86_64
>> with kvm:
>>
>> ==975== Conditional jump or move depends on uninitialised value(s)
>> ==975==at 0x521C38: cpuid_find_entry (kvm.c:176)
>> ==975==by 0x5235BA: kvm_arch_init_vcpu (kvm.c:686)
>> ==975==by 0x4D5175: kvm_init_vcpu (kvm-all.c:267)
>> ==975==by 0x45035B: qemu_kvm_cpu_thread_fn (cpus.c:858)
>> ==975==by 0xD361E0D: start_thread (pthread_create.c:311)
>> ==975==by 0xD65E9EC: clone (clone.S:113)
>> ==975==  Uninitialised value was created by a stack allocation
>> ==975==at 0x5226E4: kvm_arch_init_vcpu (kvm.c:446)
>>
>> Instead of adding more memset calls for parts of cpuid_data, the existing
>> calls were removed and cpuid_data is now initialized completely in one
>> call.
>>
>> Signed-off-by: Stefan Weil 
> Applied, thanks.


Ping. This bug fix for KVM is still missing in QEMU 1.7.

Regards,
Stefan

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] arm: KVM: Don't return PSCI_INVAL if waitqueue is inactive

2013-11-20 Thread Peter Maydell
On 20 November 2013 18:51, Christoffer Dall  wrote:
> Therefore, set the pause flag on the vcpu at VCPU init time (which can
> reasonably be expected to be completed for all CPUs by user space before
> running any VCPUs) and clear both this flag and the feature (in case the
> feature can somehow get set again in the future) and ping the waitqueue
> on turning on a VCPU using PSCI.

Tangential, but your phrasing prompted me to ask: how does
the "start in PSCI power-off" boot protocol work for system reset?
Since the kernel doesn't currently provide a "reset this v CPU"
ioctl userspace has to do reset manually[*]; how do we say "take
this vCPU which has started up and run once, and put it back
into PSCI power-off" ?

[*] this is pretty tedious, since it involves reading every CPU
register on the vCPU before first run in order to feed the kernel
back a bunch of info it already knows about the reset state of
a vCPU.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] arm: KVM: Don't return PSCI_INVAL if waitqueue is inactive

2013-11-20 Thread Christoffer Dall
On Wed, Nov 20, 2013 at 07:12:42PM +, Peter Maydell wrote:
> On 20 November 2013 18:51, Christoffer Dall  
> wrote:
> > Therefore, set the pause flag on the vcpu at VCPU init time (which can
> > reasonably be expected to be completed for all CPUs by user space before
> > running any VCPUs) and clear both this flag and the feature (in case the
> > feature can somehow get set again in the future) and ping the waitqueue
> > on turning on a VCPU using PSCI.
> 
> Tangential, but your phrasing prompted me to ask: how does
> the "start in PSCI power-off" boot protocol work for system reset?
> Since the kernel doesn't currently provide a "reset this v CPU"
> ioctl userspace has to do reset manually[*]; how do we say "take
> this vCPU which has started up and run once, and put it back
> into PSCI power-off" ?
> 
> [*] this is pretty tedious, since it involves reading every CPU
> register on the vCPU before first run in order to feed the kernel
> back a bunch of info it already knows about the reset state of
> a vCPU.
> 
So, from looking at the code and the API specification calling
KVM_ARM_VCPU_INIT does exactly this and also resets all core and cp15
registers for you - you would here be able to set the power-off flag and
pause those CPUs so PSCI can wake them up again.

Am I missing something here?

This makes me wonder if it's worth adding to
Documentation/virtual/kvm/api.txt that KVM_ARM_VCPU_INIT should be
called on all VCPUs before running any of the VCPUs...

-Christoffer

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] arm: KVM: Don't return PSCI_INVAL if waitqueue is inactive

2013-11-20 Thread Peter Maydell
On 20 November 2013 19:21, Christoffer Dall  wrote:
> On Wed, Nov 20, 2013 at 07:12:42PM +, Peter Maydell wrote:
>> Tangential, but your phrasing prompted me to ask: how does
>> the "start in PSCI power-off" boot protocol work for system reset?
>> Since the kernel doesn't currently provide a "reset this v CPU"
>> ioctl userspace has to do reset manually[*]; how do we say "take
>> this vCPU which has started up and run once, and put it back
>> into PSCI power-off" ?
>>
>> [*] this is pretty tedious, since it involves reading every CPU
>> register on the vCPU before first run in order to feed the kernel
>> back a bunch of info it already knows about the reset state of
>> a vCPU.
>>
> So, from looking at the code and the API specification calling
> KVM_ARM_VCPU_INIT does exactly this and also resets all core and cp15
> registers for you - you would here be able to set the power-off flag and
> pause those CPUs so PSCI can wake them up again.
>
> Am I missing something here?

Well, we call KVM_ARM_VCPU_INIT as part of our CPU init,
but I didn't think you were allowed to call it a second time
later on to do a CPU reset (ie as part of system reset when
the guest asks for a system reset as part of 'shutdown -r now').
Is that valid? If so, it would probably be good to specifically document it...

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] arm: KVM: Don't return PSCI_INVAL if waitqueue is inactive

2013-11-20 Thread Christoffer Dall
On Wed, Nov 20, 2013 at 07:27:52PM +, Peter Maydell wrote:
> On 20 November 2013 19:21, Christoffer Dall  
> wrote:
> > On Wed, Nov 20, 2013 at 07:12:42PM +, Peter Maydell wrote:
> >> Tangential, but your phrasing prompted me to ask: how does
> >> the "start in PSCI power-off" boot protocol work for system reset?
> >> Since the kernel doesn't currently provide a "reset this v CPU"
> >> ioctl userspace has to do reset manually[*]; how do we say "take
> >> this vCPU which has started up and run once, and put it back
> >> into PSCI power-off" ?
> >>
> >> [*] this is pretty tedious, since it involves reading every CPU
> >> register on the vCPU before first run in order to feed the kernel
> >> back a bunch of info it already knows about the reset state of
> >> a vCPU.
> >>
> > So, from looking at the code and the API specification calling
> > KVM_ARM_VCPU_INIT does exactly this and also resets all core and cp15
> > registers for you - you would here be able to set the power-off flag and
> > pause those CPUs so PSCI can wake them up again.
> >
> > Am I missing something here?
> 
> Well, we call KVM_ARM_VCPU_INIT as part of our CPU init,
> but I didn't think you were allowed to call it a second time
> later on to do a CPU reset (ie as part of system reset when
> the guest asks for a system reset as part of 'shutdown -r now').
> Is that valid? If so, it would probably be good to specifically document it...
> 
I don't see anything technically preventing you from doing so nor
anything in the documentation telling you that you shouldn't be doing
it.

However, I agree that we haven't thought about this ioctl in that way so
far, but we can start doing that.  If nobody objects I can add a patch
that clarifies this in the api.txt documentation; I don't believe this
would break or change the API, it's merely clarifying the
recommendations on how user space can leverage the functionality.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


guest assigned device MMIO maps with WC: does this work correctly?

2013-11-20 Thread Michael S. Tsirkin
I see this in kvm:

static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool
is_mmio)
{
u64 ret;
 
/* For VT-d and EPT combination
 * 1. MMIO: always map as UC
 * 2. EPT with VT-d:
 *   a. VT-d without snooping control feature: can't guarantee
 *   the
 *  result, try to trust guest.
 *   b. VT-d with snooping control feature: snooping control
 *   feature of
 *  VT-d engine can guarantee the cache correctness. Just
 *  set it
 *  to WB to keep consistent with host. So the same as item
 *  3.
 * 3. EPT without VT-d: always map as WB and set IPAT=1 to keep
 *consistent with host MTRR 
 */
if (is_mmio)
ret = MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;

...
}


does this mean that even if guest maps BAR for an assigned device
as write combined (or configures such using an MTRR),
host will override this and use uncacheable in practice?

I think there are some drivers that map MMIO as write-combined
for performance (but don't have such hardware) so
I'd like to figure out from code whether this will work correctly.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: PPC: vfio kvm device: support spapr tce

2013-11-20 Thread Alex Williamson
On Wed, 2013-11-20 at 16:18 +1100, Alexey Kardashevskiy wrote:
> In addition to the external VFIO user API, a VFIO KVM device
> has been introduced recently.
> 
> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
> identifier. LIOBNs are made up and linked to IOMMU groups by the user
> space. In order to accelerate IOMMU operations in the KVM, we need
> to tell KVM the information about LIOBN-to-group mapping.
> 
> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
> is added. It accepts a pair of a VFIO group fd and LIOBN.
> 
> This also adds a new kvm_vfio_find_group_by_liobn() function which
> receives kvm struct, LIOBN and a callback. As it increases the IOMMU
> group use counter, the KVMr is required to pass a callback which
> called when the VFIO group is about to be removed VFIO-KVM tracking so
> the KVM is able to call iommu_group_put() to release the IOMMU group.
> 
> The KVM uses kvm_vfio_find_group_by_liobn() once per KVM run and caches
> the result in kvm_arch. iommu_group_put() for all groups will be called
> when KVM finishes (in the SPAPR TCE in KVM enablement patch).
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v3:
> * total rework
> * added a release callback into kvm_vfio_find_group_by_liobn so now
> the user of the API can get a notification if the group is about to
> disappear
> ---
>  Documentation/virtual/kvm/devices/vfio.txt |  19 -
>  arch/powerpc/kvm/Kconfig   |   1 +
>  arch/powerpc/kvm/Makefile  |   3 +
>  include/linux/kvm_host.h   |  18 +
>  include/uapi/linux/kvm.h   |   7 ++
>  virt/kvm/vfio.c| 116 
> -
>  6 files changed, 161 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/devices/vfio.txt 
> b/Documentation/virtual/kvm/devices/vfio.txt
> index ef51740..7ecb3b2 100644
> --- a/Documentation/virtual/kvm/devices/vfio.txt
> +++ b/Documentation/virtual/kvm/devices/vfio.txt
> @@ -16,7 +16,22 @@ Groups:
>  
>  KVM_DEV_VFIO_GROUP attributes:
>KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> + kvm_device_attr.addr points to an int32_t file descriptor
> + for the VFIO group.
> +
>KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> + kvm_device_attr.addr points to an int32_t file descriptor
> + for the VFIO group.
> +
> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
> + kvm_device_attr.addr points to a struct:
> + struct kvm_vfio_spapr_tce_liobn {
> + __u32   argsz;
> + __u32   fd;

fds are signed, __s32

> + __u32   liobn;
> + };
> + where
> + @argsz is a struct size;
> + @fd is a file descriptor for a VFIO group;
> + @liobn is a logical bus id to be associated with the group.
>  
> -For each, kvm_device_attr.addr points to an int32_t file descriptor
> -for the VFIO group.
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 61b3535..d1b7f64 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -60,6 +60,7 @@ config KVM_BOOK3S_64
>   select KVM_BOOK3S_64_HANDLER
>   select KVM
>   select SPAPR_TCE_IOMMU
> + select KVM_VFIO
>   ---help---
> Support running unmodified book3s_64 and book3s_32 guest kernels
> in virtual machines on book3s_64 host processors.
> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> index 6646c95..2438d2e 100644
> --- a/arch/powerpc/kvm/Makefile
> +++ b/arch/powerpc/kvm/Makefile
> @@ -87,6 +87,9 @@ kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HV) := \
>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>   book3s_xics.o
>  
> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
> + $(KVM)/vfio.o \
> +
>  kvm-book3s_64-module-objs := \
>   $(KVM)/kvm_main.o \
>   $(KVM)/eventfd.o \
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 88ff96a..1d2ad5e 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1112,5 +1112,23 @@ static inline bool 
> kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
>  }
>  
>  #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
> +
> +typedef void (*kvm_vfio_release_group_callback)(struct kvm *kvm,
> + unsigned long liobn);

liobn was said to be __u32 in kvm_vfio_spapr_tce_liobn above, here it's
unsigned long?

> +
> +#if defined(CONFIG_KVM_VFIO) && defined(CONFIG_SPAPR_TCE_IOMMU)
> +
> +extern struct iommu_group *kvm_vfio_find_group_by_liobn(struct kvm *kvm,
> + unsigned long liobn, kvm_vfio_release_group_callback cb);
> +
> +#else
> +
> +static inline struct iommu_group *kvm_vfio_find_group_by_liobn(struct kvm 
> *kvm,
> + unsigned long liobn, ikvm

Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

2013-11-20 Thread Marcelo Tosatti
On Wed, Nov 20, 2013 at 10:20:09PM +0800, Xiao Guangrong wrote:
> > But what guarantee does userspace require, from GET_DIRTY_LOG, while vcpus 
> > are
> > executing? 
> 
> Aha. Single calling GET_DIRTY_LOG is useless since new dirty page can be 
> generated
> when GET_DIRTY_LOG is being returned. If user wants to get exact dirty pages 
> the vcpus
> should be stopped. 
> 
> > 
> > With fast page fault:
> > 
> >  if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
> >  /* The window in here... */
> >  mark_page_dirty(vcpu->kvm, gfn);
> > 
> > And the $SUBJECT set_spte reordering, the rule becomes
> > 
> > A call to GET_DIRTY_LOG guarantees to return correct information about 
> > dirty pages before invocation of the previous GET_DIRTY_LOG call.
> > 
> > (see example 1: the next GET_DIRTY_LOG will return the dirty information
> > there).
> > 
> 
> It seems no.
> 
> The first GET_DIRTY_LOG can happen before fast-page-fault,
> the second GET_DIRTY_LOG happens in the window between cmpxchg()
> and mark_page_dirty(), for the second one, the information is still 
> “incorrect”.

Its correct for the previous GET_DIRTY_LOG call.

> > The rule for sptes that is, because kvm_write_guest does not match the
> > documentation at all.
> 
> You mean the case of “kvm_write_guest” is valid (I do not know why it is)?
> Or anything else?
> 
> > 
> > So before example 1 and this patch, the rule (well for sptes at least) was
> > 
> > "Given a memory slot, return a bitmap containing any pages dirtied
> > since the last call to this ioctl.  Bit 0 is the first page in the
> > memory slot.  Ensure the entire structure is cleared to avoid padding
> > issues."
> > 
> > Can you explain why it is OK to relax this rule?
> 
> It’s because:
> 1) it doesn’t break current use cases, i.e. Live migration and FB-flushing.
> 2) the current code, like kvm_write_guest  has already broken the 
> documentation
>(the guest page has been written but missed in the dirty bitmap).
> 3) it’s needless to implement a exact get-dirty-pages since the dirty pages 
> can
>no be exactly got except stopping vcpus. 
> 
> So i think we'd document this case instead. No?

Lets figure out the requirements, then. I don't understand why
FB-flushing is safe (think kvm-autotest: one pixel off the entire
test fails).

Before fast page fault: Pages are either write protected or the
corresponding dirty bitmap bit is set. Any write faults to dirty logged
sptes while GET_DIRTY log is executing in the protected section are
allowed to instantiate writeable spte after GET_DIRTY log is finished.

After fast page fault: Pages can be writeable and the dirty bitmap not
set. Therefore data can be dirty before GET_DIRTY executes and still
fail to be returned in the bitmap.

Since this patchset does not introduce change in behaviour (fast pf
did), no reason to avoid merging this.

BTW, since GET_DIRTY log does not require to report concurrent (to
GET_DIRTY_LOG) write faults, it is not necessary to rewalk the spte
list, is it?



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9] kvm-unit-tests/arm: initial drop

2013-11-20 Thread María Soler Heredia
Andrew Jones  redhat.com> writes:

 
> This series introduces arm to kvm-unit-tests.

> To use this you need an arm platform or simulator capable
> of running kvmarm and a qemu with the mach-virt patches[2], as
> well as the previously mentioned virtio-testdev.

Hello,

I have been playing with your tests for a while and I cannot seem to get
them to work all right. When I run them disabling kvm on the arm-run script,
they do work, but when I run them with kvm enabled they fail.

This is my output:

./arm-run arm/boot.flat -smp 1 -m 256 -append 'info 0x1000 0x1000'
qemu-system-arm -device virtio-testdev -display none -serial stdio -M virt
-cpu cortex-a15 -enable-kvm -kernel arm/boot.flat -smp 1 -m 256 -append info
0x1000 0x1000
kvm [1252]: load/store instruction decoding not implemented
error: kvm run failed Function not implemented
./arm-run: line 16:  1251 Aborted $command "$@"
Return value from qemu: 134
FAIL boot_info
./arm-run arm/boot.flat -smp 1 -append 'vectors'
qemu-system-arm -device virtio-testdev -display none -serial stdio -M virt
-cpu cortex-a15 -enable-kvm -kernel arm/boot.flat -smp 1 -append vectors
kvm [1257]: load/store instruction decoding not implemented
error: kvm run failed Function not implemented
./arm-run: line 16:  1256 Aborted $command "$@"
Return value from qemu: 134
FAIL boot_vectors

I am using FastModels Model Debugger version 8.2.028, with a model of this
characteristics:

Model:
--
Model name: ARM_Cortex-A15
Instance: cluster.cpu0
Using CADI 2.0 interface revision 0.
Version: 8.2.72
Generated by Core Generator: No
Needs SimGen License: No

running the latest stable linux release and qemu-devel's latest qemu with
the patches indicated here
https://lists.gnu.org/archive/html/qemu-devel/2013-10/msg02428.html plus
> [1] http://lists.nongnu.org/archive/html/qemu-devel/2013-10/msg01815.html

I tested the instalation by running a linux with the same setup using this call:
qemu-system-arm \
-display none \
-enable-kvm \
-kernel zImage\
-m 128 -M virt -cpu cortex-a15 \
-drive if=none,file=linux.img,id=fs \
-device virtio-blk-device,drive=fs

As I said, the tests pass if the kvm is not enabled and fail otherwise. I
have added a few printfs for debugging and I can tell that the code in
boot.c runs ok, but then when virtio_testdev is called (from
virtio_testdev_exit) the execution throws an exception (more specifically
the line  *tdp++ = cpu_to_le32(va_arg(va, unsigned)); inside the first while.

I am not used to sending emails to this kind of list, so I don't know if I
am being too specific, too little or maybe not even giving the right
information. Please tell me what else you need and if you can help me solve
this problem. 

Thank you very much,
María.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Starting a VM using /dev/kvm

2013-11-20 Thread Jim MacArthur
On 20 November 2013 17:28, Paolo Bonzini  wrote:
> Il 20/11/2013 00:56, Jim MacArthur ha scritto:
>> The question is, how does a new vcpu start up? Will it start in full
>> 64-bit mode or 16-bit real mode?
>
> By default it start in 16-bit real mode, with CS=0xf000 and EIP=0xfff0,
> but CS.base = 0x.  However, you can send ioctls to modify
> CR0/CR4/EFER and place the VCPU in whatever mode you'd like to have.

After reading this I added a call to KVM_GET_SREGS. Everything you say
here matches my experience except that CS.base=0xf.
So I adjusted my memory to cover physical address 0x0, and now
it's happily running instructions (NOPs, at least.)
I'm a bit puzzled that it didn't start with CS.base=0x, but it
doesn't matter, I've done what I wanted to do for now.

Thanks very much for your help.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: PPC: vfio kvm device: support spapr tce

2013-11-20 Thread Alexey Kardashevskiy
On 11/21/2013 07:57 AM, Alex Williamson wrote:
> On Wed, 2013-11-20 at 16:18 +1100, Alexey Kardashevskiy wrote:
>> In addition to the external VFIO user API, a VFIO KVM device
>> has been introduced recently.
>>
>> sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
>> via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
>> identifier. LIOBNs are made up and linked to IOMMU groups by the user
>> space. In order to accelerate IOMMU operations in the KVM, we need
>> to tell KVM the information about LIOBN-to-group mapping.
>>
>> For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
>> is added. It accepts a pair of a VFIO group fd and LIOBN.
>>
>> This also adds a new kvm_vfio_find_group_by_liobn() function which
>> receives kvm struct, LIOBN and a callback. As it increases the IOMMU
>> group use counter, the KVMr is required to pass a callback which
>> called when the VFIO group is about to be removed VFIO-KVM tracking so
>> the KVM is able to call iommu_group_put() to release the IOMMU group.
>>
>> The KVM uses kvm_vfio_find_group_by_liobn() once per KVM run and caches
>> the result in kvm_arch. iommu_group_put() for all groups will be called
>> when KVM finishes (in the SPAPR TCE in KVM enablement patch).
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>> Changes:
>> v3:
>> * total rework
>> * added a release callback into kvm_vfio_find_group_by_liobn so now
>> the user of the API can get a notification if the group is about to
>> disappear
>> ---
>>  Documentation/virtual/kvm/devices/vfio.txt |  19 -
>>  arch/powerpc/kvm/Kconfig   |   1 +
>>  arch/powerpc/kvm/Makefile  |   3 +
>>  include/linux/kvm_host.h   |  18 +
>>  include/uapi/linux/kvm.h   |   7 ++
>>  virt/kvm/vfio.c| 116 
>> -
>>  6 files changed, 161 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/virtual/kvm/devices/vfio.txt 
>> b/Documentation/virtual/kvm/devices/vfio.txt
>> index ef51740..7ecb3b2 100644
>> --- a/Documentation/virtual/kvm/devices/vfio.txt
>> +++ b/Documentation/virtual/kvm/devices/vfio.txt
>> @@ -16,7 +16,22 @@ Groups:
>>  
>>  KVM_DEV_VFIO_GROUP attributes:
>>KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
>> +kvm_device_attr.addr points to an int32_t file descriptor
>> +for the VFIO group.
>> +
>>KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
>> +kvm_device_attr.addr points to an int32_t file descriptor
>> +for the VFIO group.
>> +
>> +  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN: sets a liobn for a VFIO group
>> +kvm_device_attr.addr points to a struct:
>> +struct kvm_vfio_spapr_tce_liobn {
>> +__u32   argsz;
>> +__u32   fd;
> 
> fds are signed, __s32
> 
>> +__u32   liobn;
>> +};
>> +where
>> +@argsz is a struct size;
>> +@fd is a file descriptor for a VFIO group;
>> +@liobn is a logical bus id to be associated with the group.
>>  
>> -For each, kvm_device_attr.addr points to an int32_t file descriptor
>> -for the VFIO group.
>> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
>> index 61b3535..d1b7f64 100644
>> --- a/arch/powerpc/kvm/Kconfig
>> +++ b/arch/powerpc/kvm/Kconfig
>> @@ -60,6 +60,7 @@ config KVM_BOOK3S_64
>>  select KVM_BOOK3S_64_HANDLER
>>  select KVM
>>  select SPAPR_TCE_IOMMU
>> +select KVM_VFIO
>>  ---help---
>>Support running unmodified book3s_64 and book3s_32 guest kernels
>>in virtual machines on book3s_64 host processors.
>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>> index 6646c95..2438d2e 100644
>> --- a/arch/powerpc/kvm/Makefile
>> +++ b/arch/powerpc/kvm/Makefile
>> @@ -87,6 +87,9 @@ kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HV) := \
>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>  book3s_xics.o
>>  
>> +kvm-book3s_64-objs-$(CONFIG_KVM_VFIO) += \
>> +$(KVM)/vfio.o \
>> +
>>  kvm-book3s_64-module-objs := \
>>  $(KVM)/kvm_main.o \
>>  $(KVM)/eventfd.o \
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 88ff96a..1d2ad5e 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -1112,5 +1112,23 @@ static inline bool 
>> kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
>>  }
>>  
>>  #endif /* CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT */
>> +
>> +typedef void (*kvm_vfio_release_group_callback)(struct kvm *kvm,
>> +unsigned long liobn);
> 
> liobn was said to be __u32 in kvm_vfio_spapr_tce_liobn above, here it's
> unsigned long?


PAPR spec says it is 32 bit and kvm_vfio_spapr_tce_liobn is a binary
interface (ABI?) so I want it to be precise.

However kvmppc_get_gpr() (used to parse hypercall parameters such as liobn)
return unsigned long. So inside the kern

question about VM kernel parameter idle=

2013-11-20 Thread Zhanghaoyu (A)
Hi, all

What's the difference of the linux guest kernel parameter 
idle=, especially in performance?

Taking the performance into account, which one is best?

In my opinion, if the number of all VMs' vcpus is far more than that of pcpus, 
e.g. SPECVirt test, idle=halt is better for server's total throughput,
otherwise, e.g. in some CT scenario, the number of total vcpus is not greater 
than that of pcpus, idle=poll is better for server's total throughput,
because of less latency and VMEXIT.

linux-3.9 and above, idle=mwait is not recommended.

Thanks,
Zhang Haoyu
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] KVM: MMU: log dirty page after marking spte writable

2013-11-20 Thread Xiao Guangrong

On Nov 21, 2013, at 3:47 AM, Marcelo Tosatti  wrote:

> On Wed, Nov 20, 2013 at 10:20:09PM +0800, Xiao Guangrong wrote:
>>> But what guarantee does userspace require, from GET_DIRTY_LOG, while vcpus 
>>> are
>>> executing? 
>> 
>> Aha. Single calling GET_DIRTY_LOG is useless since new dirty page can be 
>> generated
>> when GET_DIRTY_LOG is being returned. If user wants to get exact dirty pages 
>> the vcpus
>> should be stopped. 
>> 
>>> 
>>> With fast page fault:
>>> 
>>> if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
>>> /* The window in here... */
>>> mark_page_dirty(vcpu->kvm, gfn);
>>> 
>>> And the $SUBJECT set_spte reordering, the rule becomes
>>> 
>>> A call to GET_DIRTY_LOG guarantees to return correct information about 
>>> dirty pages before invocation of the previous GET_DIRTY_LOG call.
>>> 
>>> (see example 1: the next GET_DIRTY_LOG will return the dirty information
>>> there).
>>> 
>> 
>> It seems no.
>> 
>> The first GET_DIRTY_LOG can happen before fast-page-fault,
>> the second GET_DIRTY_LOG happens in the window between cmpxchg()
>> and mark_page_dirty(), for the second one, the information is still 
>> “incorrect”.
> 
> Its correct for the previous GET_DIRTY_LOG call.

Oh, yes.

> 
>>> The rule for sptes that is, because kvm_write_guest does not match the
>>> documentation at all.
>> 
>> You mean the case of “kvm_write_guest” is valid (I do not know why it is)?
>> Or anything else?
>> 
>>> 
>>> So before example 1 and this patch, the rule (well for sptes at least) was
>>> 
>>> "Given a memory slot, return a bitmap containing any pages dirtied
>>> since the last call to this ioctl.  Bit 0 is the first page in the
>>> memory slot.  Ensure the entire structure is cleared to avoid padding
>>> issues."
>>> 
>>> Can you explain why it is OK to relax this rule?
>> 
>> It’s because:
>> 1) it doesn’t break current use cases, i.e. Live migration and FB-flushing.
>> 2) the current code, like kvm_write_guest  has already broken the 
>> documentation
>>   (the guest page has been written but missed in the dirty bitmap).
>> 3) it’s needless to implement a exact get-dirty-pages since the dirty pages 
>> can
>>   no be exactly got except stopping vcpus. 
>> 
>> So i think we'd document this case instead. No?
> 
> Lets figure out the requirements, then. I don't understand why
> FB-flushing is safe (think kvm-autotest: one pixel off the entire
> test fails).

I did not read FB-flushing code, i guess the reason why it can work is:
FB-flushing do periodicity get-dirty-page and flush it.  After guest writes
the page, the page will be flushed in the next GET_DIRTY_LOG.

> 
> Before fast page fault: Pages are either write protected or the
> corresponding dirty bitmap bit is set. Any write faults to dirty logged
> sptes while GET_DIRTY log is executing in the protected section are
> allowed to instantiate writeable spte after GET_DIRTY log is finished.
> 
> After fast page fault: Pages can be writeable and the dirty bitmap not
> set. Therefore data can be dirty before GET_DIRTY executes and still
> fail to be returned in the bitmap.

It’s right. The current GET_DIRTY fail to get the dirty page but the
next GET_DIRTY can get it properly since the current GET_DIRTY
need to flush all TLBs that waits for fast-page-fault finish.

I do not think it’s a big problem since single GET_DIRTY is useless as
i explained in the previous mail.

> 
> Since this patchset does not introduce change in behaviour (fast pf
> did), no reason to avoid merging this.

Yes, thank you, Marcelo! :)

> 
> BTW, since GET_DIRTY log does not require to report concurrent (to
> GET_DIRTY_LOG) write faults, it is not necessary to rewalk the spte
> list, is it?

You mean the “rewalk” we introduced in pte_list_walk_lockless() in this 
patchset?
I think this rewalk is needed because it’s caused by meet a unexpected nulls 
that
means we’re walking on the unexpected rmap. If we do not do this, some writable
sptes will be missed.




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: question about VM kernel parameter idle=

2013-11-20 Thread Michael S. Tsirkin
On Thu, Nov 21, 2013 at 03:45:28AM +, Zhanghaoyu (A) wrote:
> Hi, all
> 
> What's the difference of the linux guest kernel parameter 
> idle=, especially in performance?
> 
> Taking the performance into account, which one is best?
> 
> In my opinion, if the number of all VMs' vcpus is far more than that of 
> pcpus, e.g. SPECVirt test, idle=halt is better for server's total throughput,
> otherwise, e.g. in some CT scenario, the number of total vcpus is not greater 
> than that of pcpus, idle=poll is better for server's total throughput,
> because of less latency and VMEXIT.

Makes sense overall.

> linux-3.9 and above, idle=mwait is not recommended.
> 
> Thanks,
> Zhang Haoyu

Does it actually have effect? I didn't think it would.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: guest assigned device MMIO maps with WC: does this work correctly?

2013-11-20 Thread Gleb Natapov
On Wed, Nov 20, 2013 at 09:58:15PM +0200, Michael S. Tsirkin wrote:
> I see this in kvm:
> 
> static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool
> is_mmio)
> {
> u64 ret;
>  
> /* For VT-d and EPT combination
>  * 1. MMIO: always map as UC
>  * 2. EPT with VT-d:
>  *   a. VT-d without snooping control feature: can't guarantee
>  *   the
>  *  result, try to trust guest.
>  *   b. VT-d with snooping control feature: snooping control
>  *   feature of
>  *  VT-d engine can guarantee the cache correctness. Just
>  *  set it
>  *  to WB to keep consistent with host. So the same as item
>  *  3.
>  * 3. EPT without VT-d: always map as WB and set IPAT=1 to keep
>  *consistent with host MTRR 
>  */
> if (is_mmio)
> ret = MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> 
> ...
> }
> 
> 
> does this mean that even if guest maps BAR for an assigned device
> as write combined (or configures such using an MTRR),
> host will override this and use uncacheable in practice?
> 
No, it does not mean that. I already answered this once (my previous
answer included below): effective memory type is a combination of MTRR
(EPT MT bits in case of a guest) and PAT bits. See section 11.5.2.2
in SDM on how effective memory type is calculated.

 Since MTRR UC + PAT WC = WC, if guest maps MMIO as WC in a page table
 (that what ioremap_wc does), everything works as it should. If guest maps
 MMIO as WB (ioremap_cache) and MTRR says MMIO is UC (like any MMIO will
 be by default) combined memory type will be UC, so also fine. If guest
 maps MMIO range as WB and fixes mtrr for this region to be WB then memory
 type will be incorrect in a guest, but I found only one place that does
 it in Linux: drivers/video/vesafb.c. All other uses of ioremap_cache
 either remap RAM or used to get whatever memory type configured in MTRR.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: question about VM kernel parameter idle=

2013-11-20 Thread Gleb Natapov
On Thu, Nov 21, 2013 at 09:01:39AM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 21, 2013 at 03:45:28AM +, Zhanghaoyu (A) wrote:
> > Hi, all
> > 
> > What's the difference of the linux guest kernel parameter 
> > idle=, especially in performance?
> > 
> > Taking the performance into account, which one is best?
> > 
> > In my opinion, if the number of all VMs' vcpus is far more than that of 
> > pcpus, e.g. SPECVirt test, idle=halt is better for server's total 
> > throughput,
> > otherwise, e.g. in some CT scenario, the number of total vcpus is not 
> > greater than that of pcpus, idle=poll is better for server's total 
> > throughput,
> > because of less latency and VMEXIT.
> 
> Makes sense overall.
> 
> > linux-3.9 and above, idle=mwait is not recommended.
> > 
> > Thanks,
> > Zhang Haoyu
> 
> Does it actually have effect? I didn't think it would.
mwait instruction is never exposed to a guest. With idle=mwait is will
likely fall back to halt silently.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Starting a VM using /dev/kvm

2013-11-20 Thread Gleb Natapov
On Thu, Nov 21, 2013 at 12:39:49AM +, Jim MacArthur wrote:
> On 20 November 2013 17:28, Paolo Bonzini  wrote:
> > Il 20/11/2013 00:56, Jim MacArthur ha scritto:
> >> The question is, how does a new vcpu start up? Will it start in full
> >> 64-bit mode or 16-bit real mode?
> >
> > By default it start in 16-bit real mode, with CS=0xf000 and EIP=0xfff0,
> > but CS.base = 0x.  However, you can send ioctls to modify
> > CR0/CR4/EFER and place the VCPU in whatever mode you'd like to have.
> 
> After reading this I added a call to KVM_GET_SREGS. Everything you say
> here matches my experience except that CS.base=0xf.
> So I adjusted my memory to cover physical address 0x0, and now
> it's happily running instructions (NOPs, at least.)
> I'm a bit puzzled that it didn't start with CS.base=0x, but it
> doesn't matter, I've done what I wanted to do for now.
> 
What is your kernel version?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: guest assigned device MMIO maps with WC: does this work correctly?

2013-11-20 Thread Michael S. Tsirkin
On Thu, Nov 21, 2013 at 09:18:55AM +0200, Gleb Natapov wrote:
> On Wed, Nov 20, 2013 at 09:58:15PM +0200, Michael S. Tsirkin wrote:
> > I see this in kvm:
> > 
> > static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool
> > is_mmio)
> > {
> > u64 ret;
> >  
> > /* For VT-d and EPT combination
> >  * 1. MMIO: always map as UC
> >  * 2. EPT with VT-d:
> >  *   a. VT-d without snooping control feature: can't guarantee
> >  *   the
> >  *  result, try to trust guest.
> >  *   b. VT-d with snooping control feature: snooping control
> >  *   feature of
> >  *  VT-d engine can guarantee the cache correctness. Just
> >  *  set it
> >  *  to WB to keep consistent with host. So the same as item
> >  *  3.
> >  * 3. EPT without VT-d: always map as WB and set IPAT=1 to keep
> >  *consistent with host MTRR 
> >  */
> > if (is_mmio)
> > ret = MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> > 
> > ...
> > }
> > 
> > 
> > does this mean that even if guest maps BAR for an assigned device
> > as write combined (or configures such using an MTRR),
> > host will override this and use uncacheable in practice?
> > 
> No, it does not mean that. I already answered this once (my previous
> answer included below): effective memory type is a combination of MTRR
> (EPT MT bits in case of a guest) and PAT bits. See section 11.5.2.2
> in SDM


Can you quote chapter name please?
My SDM has
11.5.2.2 Denormal-Operand Exception (#D)

> on how effective memory type is calculated.
> 
>  Since MTRR UC + PAT WC = WC, if guest maps MMIO as WC in a page table
>  (that what ioremap_wc does), everything works as it should. If guest maps
>  MMIO as WB (ioremap_cache) and MTRR says MMIO is UC (like any MMIO will
>  be by default) combined memory type will be UC, so also fine. If guest
>  maps MMIO range as WB and fixes mtrr for this region to be WB then memory
>  type will be incorrect in a guest,

Meaning  MTRR in guest is ignored in this case?

> but I found only one place that does
>  it in Linux: drivers/video/vesafb.c. All other uses of ioremap_cache
>  either remap RAM or used to get whatever memory type configured in MTRR.
> 
> --
>   Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html