Re: [PATCH v3 6/6] KVM: introduce readonly memslot

2012-06-19 Thread Gleb Natapov
On Mon, Jun 18, 2012 at 05:25:05PM -0300, Marcelo Tosatti wrote:
> On Mon, Jun 18, 2012 at 12:50:10PM +0300, Avi Kivity wrote:
> > On 06/16/2012 05:11 AM, Marcelo Tosatti wrote:
> > > 
> > > Can you introduce a separate exit reason, say KVM_EXIT_READ_FAULT, with
> > > information about the fault?
> > 
> > I think you mean WRITE_FAULT.  
> 
> Yes.
> 
> > But what's wrong with the normal mmio exit?
> 
> It is necessary to perform an address->mmio region lookup, to verify
> whether the mmio exit is due to an actual mmio (no memory slot) or from
> a write access to a write protected slot. That information is readily
> available in the kernel but is lost if the mmio exit is used to transmit 
> the information.
> 
Why is it necessary though? Write access to a write protected slot is
MMIO by (our) definition.

> Moreover, i'd argue the uses are different: one is an mmio emulation
> exit, the other is more like handling a pagefault in qemu.
> 
What do you mean by "handling a pagefault in qemu"?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 6/6] KVM: introduce readonly memslot

2012-06-19 Thread Avi Kivity
On 06/18/2012 11:25 PM, Marcelo Tosatti wrote:
> On Mon, Jun 18, 2012 at 12:50:10PM +0300, Avi Kivity wrote:
>> On 06/16/2012 05:11 AM, Marcelo Tosatti wrote:
>> > 
>> > Can you introduce a separate exit reason, say KVM_EXIT_READ_FAULT, with
>> > information about the fault?
>> 
>> I think you mean WRITE_FAULT.  
> 
> Yes.
> 
>> But what's wrong with the normal mmio exit?
> 
> It is necessary to perform an address->mmio region lookup, to verify
> whether the mmio exit is due to an actual mmio (no memory slot) or from
> a write access to a write protected slot. That information is readily
> available in the kernel but is lost if the mmio exit is used to transmit 
> the information.

For qemu it doesn't matter, but other userspaces might need it.  Can we
add it to the mmio exit reason as extra data?  Only present if the CAP
is available.

> Moreover, i'd argue the uses are different: one is an mmio emulation
> exit, the other is more like handling a pagefault in qemu.

It is not.  A pagefault is handled by fixing and retrying (fault).  MMIO
emulation is done by userspace instead of the kernel or guest (trap).
Here we're not fixing anything (say by marking the page writeable; the
slot is readonly, not the page (as is the case with ksm).

Qemu's ROM and ROMD behaviour follow this (discard access for ROM,
emualte for ROMD).

There might be value in implementing user visible write protection, for
example to implement distributed shared memory or postcopy migration.
But as long as per page protection requires separate VMAs it is not
practical (and is unlikely to perform well anyway).

> 
>> > Then perform this exit only if userspace allows it by explicit enable, 
>> > and by default have the exit_read_fault handler jump to the mmio
>> > handler. 
>> 
>> 
>> I don't get this.
> 
> 
> CAN USERSPACE HANDLE WRITE FAULT EXITS?
> YES: WRITE FAULT EXIT.
> NO: MMIO EXIT.
> 
> But then again userspace won't set read-only slots if it does not know
> about them. So it is not necessary.
> 

Okay, okay, don't shout.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4] KVM: x86: Implement PCID/INVPCID for guests with EPT

2012-06-19 Thread Mao, Junjie
Hi, Avi

Any comments for the patch?

Best Regards
Junjie Mao

> -Original Message-
> From: Marcelo Tosatti [mailto:mtosa...@redhat.com]
> Sent: Saturday, June 16, 2012 10:32 AM
> To: Mao, Junjie
> Cc: 'kvm@vger.kernel.org'; Avi Kivity
> Subject: Re: [PATCH v4] KVM: x86: Implement PCID/INVPCID for guests with
> EPT
> 
> On Thu, Jun 14, 2012 at 02:04:25AM +, Mao, Junjie wrote:
> > This patch handles PCID/INVPCID for guests.
> >
> > Process-context identifiers (PCIDs) are a facility by which a logical
> > processor may cache information for multiple linear-address spaces so
> > that the processor may retain cached information when software
> > switches to a different linear address space. Refer to section 4.10.1
> > in IA32 Intel Software Developer's Manual Volume 3A for details.
> >
> > For guests with EPT, the PCID feature is enabled and INVPCID behaves
> > as running natively.
> > For guests without EPT, the PCID feature is disabled and INVPCID triggers
> #UD.
> >
> > Changes from v3:
> > Rebase to the latest tree
> > Expose PCID to nested guests
> > Remove the pcid_supported callback
> >
> > Changes from v2:
> > Seperate management of PCID and INVPCID
> > Prevent PCID bit in CPUID from exposing on guest hypervisors
> > Don't check the lower 12 bits when loading cr3 if cr4.PCIDE is set
> > Explicitly disable INVPCID for L2 guests
> > Support both enable and disable INVPCID in vmx_cpuid_update()
> >
> > Changes from v1:
> > Move cr0/cr4 writing checks to x86.c
> > Update comments for the reason why PCID is disabled for non-EPT guests
> > Do not support PCID/INVPCID for nested guests at present
> > Clean up useless symbols
> >
> > Signed-off-by: Junjie Mao 
> 
> Looks good to me.
> 
> > ---
> >  arch/x86/include/asm/kvm_host.h|4 ++-
> >  arch/x86/include/asm/processor-flags.h |2 +
> >  arch/x86/include/asm/vmx.h |2 +
> >  arch/x86/kvm/cpuid.c   |6 +++-
> >  arch/x86/kvm/cpuid.h   |8 +++
> >  arch/x86/kvm/svm.c |6 +
> >  arch/x86/kvm/vmx.c |   37
> +++-
> >  arch/x86/kvm/x86.c |   24
> ++--
> >  8 files changed, 82 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h
> > b/arch/x86/include/asm/kvm_host.h index db7c1f2..95828a4 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -48,12 +48,13 @@
> >
> >  #define CR3_PAE_RESERVED_BITS ((X86_CR3_PWT | X86_CR3_PCD) - 1)
> > #define CR3_NONPAE_RESERVED_BITS ((PAGE_SIZE-1) & ~(X86_CR3_PWT |
> > X86_CR3_PCD))
> > +#define CR3_PCID_ENABLED_RESERVED_BITS 0xFF00ULL
> >  #define CR3_L_MODE_RESERVED_BITS (CR3_NONPAE_RESERVED_BITS |
>   \
> >   0xFF00ULL)
> >  #define CR4_RESERVED_BITS
> \
> > (~(unsigned long)(X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD |
> X86_CR4_DE\
> >   | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \
> > - | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR  \
> > + | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR |
> X86_CR4_PCIDE \
> >   | X86_CR4_OSXSAVE | X86_CR4_SMEP |
> X86_CR4_RDWRGSFS \
> >   | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE))
> >
> > @@ -661,6 +662,7 @@ struct kvm_x86_ops {
> > u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> > int (*get_lpage_level)(void);
> > bool (*rdtscp_supported)(void);
> > +   bool (*invpcid_supported)(void);
> > void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment,
> > bool host);
> >
> > void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3); diff
> > --git a/arch/x86/include/asm/processor-flags.h
> > b/arch/x86/include/asm/processor-flags.h
> > index f8ab3ea..aea1d1d 100644
> > --- a/arch/x86/include/asm/processor-flags.h
> > +++ b/arch/x86/include/asm/processor-flags.h
> > @@ -44,6 +44,7 @@
> >   */
> >  #define X86_CR3_PWT0x0008 /* Page Write Through */
> >  #define X86_CR3_PCD0x0010 /* Page Cache Disable */
> > +#define X86_CR3_PCID_MASK 0x0fff /* PCID Mask */
> >
> >  /*
> >   * Intel CPU features in CR4
> > @@ -61,6 +62,7 @@
> >  #define X86_CR4_OSXMMEXCPT 0x0400 /* enable unmasked SSE
> exceptions */
> >  #define X86_CR4_VMXE   0x2000 /* enable VMX virtualization */
> >  #define X86_CR4_RDWRGSFS 0x0001 /* enable RDWRGSFS support */
> > +#define X86_CR4_PCIDE  0x0002 /* enable PCID support */
> >  #define X86_CR4_OSXSAVE 0x0004 /* enable xsave and xrestore */
> >  #define X86_CR4_SMEP   0x0010 /* enable SMEP support */
> >
> > diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> > index 31f180c..b81525c 100644
> > --- a/arch/x86/include/asm/vmx.h
> > +++ b/arch/x86/include/asm/vmx.h
> > @@ -60,6 +60,7 @@
> 

Re: [PATCH v8 04/15] KVM: Guard mmu_notifier specific code with CONFIG_MMU_NOTIFIER

2012-06-19 Thread Avi Kivity
On 06/18/2012 08:47 PM, Christoffer Dall wrote:
> On Mon, Jun 18, 2012 at 9:08 AM, Avi Kivity  wrote:
>> On 06/15/2012 10:07 PM, Christoffer Dall wrote:
>>> From: Marc Zyngier 
>>>
>>> In order to avoid compilation failure when KVM is not compiled in,
>>> guard the mmu_notifier specific sections with both CONFIG_MMU_NOTIFIER
>>> and KVM_ARCH_WANT_MMU_NOTIFIER, like it is being done in the rest of
>>> the KVM code.
>>>
>>>
>>> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
>>> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>>>   struct mmu_notifier mmu_notifier;
>>>   unsigned long mmu_notifier_seq;
>>>   long mmu_notifier_count;
>>> @@ -780,7 +780,7 @@ struct kvm_stats_debugfs_item {
>>>  extern struct kvm_stats_debugfs_item debugfs_entries[];
>>>  extern struct dentry *kvm_debugfs_dir;
>>>
>>> -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
>>> +#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>>>  static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long 
>>> mmu_seq)
>>>  {
>>
>> Why not have Kconfig select CONFIG_MMU_NOTIFIER?
>>
>>
> Not sure I understand. Where would you select this option?
> 
> We do select this option when choosing to compile KVM on, but when we
> do _not_, then other includes of kvm_host.h fails.

Right, my mistake.  Didn't notice it was a header.


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 06/15] ARM: KVM: Hypervisor identity mapping

2012-06-19 Thread Avi Kivity
On 06/18/2012 08:55 PM, Christoffer Dall wrote:
> On Mon, Jun 18, 2012 at 9:12 AM, Avi Kivity  wrote:
>> On 06/15/2012 10:07 PM, Christoffer Dall wrote:
>>> Adds support in the identity mapping feature that allows KVM to setup
>>> identity mapping for the Hyp mode with the AP[1] bit set as required by
>>> the specification and also supports freeing created sub pmd's after
>>> finished use.
>>>
>>> These two functions:
>>>  - hyp_idmap_add(pgd, addr, end);
>>>  - hyp_idmap_del(pgd, addr, end);
>>> are essentially calls to the same function as the non-hyp versions but
>>> with a different argument value. KVM calls these functions to setup
>>> and teardown the identity mapping used to initialize the hypervisor.
>>>
>>> Note, the hyp-version of the _del function actually frees the pmd's
>>> pointed to by the pgd as opposed to the non-hyp version which just
>>> clears them.
>>
>>
>> I asked previously what happens if two data structures share a page, and
>> one of them is removed.  Is that handled now?  How?
>>
> 
> I think you asked previously for the general hyp-mode mappings, not
> the identity mappings. For the general hyp-mode mappings we simply
> don't unmap the data structures, potentially leaking a few pages for
> the page tables themselves.
> 
> This is only for initialization, so there are not really any data
> structures mapped, only one/two pages to initialize the hypervisor
> mode.
> 
>> Why not just identity map all memory?  You can use large pages so it's
>> fast and doesn't consume a lot of page table memory.
> 
> That's an option, but it still seems like an awful waste since it's
> only used once (unless you unload and re-load the module) and there's
> really no problem with data structures here.
> 
> The truth is that this is going to go away, and the code will be put
> in a section that's idmapped from kernel start. There's a patch under
> way from Marc taking care of this which I assmue we'll merge for v9.

Okay, thanks.


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 10/15] ARM: KVM: Inject IRQs and FIQs from userspace

2012-06-19 Thread Avi Kivity
On 06/18/2012 11:56 PM, Christoffer Dall wrote:
> On Mon, Jun 18, 2012 at 9:32 AM, Avi Kivity  wrote:
>> On 06/15/2012 10:08 PM, Christoffer Dall wrote:
>>> From: Christoffer Dall 
>>>
>>> Userspace can inject IRQs and FIQs through the KVM_IRQ_LINE VM ioctl.
>>> This ioctl is used since the sematics are in fact two lines that can be
>>> either raised or lowered on the VCPU - the IRQ and FIQ lines.
>>>
>>> KVM needs to know which VCPU it must operate on and whether the FIQ or
>>> IRQ line is raised/lowered. Hence both pieces of information is packed
>>> in the kvm_irq_level->irq field. The irq fild value will be:
>>>   IRQ: vcpu_index << 1
>>>   FIQ: (vcpu_index << 1) | 1
>>>
>>> This is documented in Documentation/kvm/api.txt.
>>>
>>> The effect of the ioctl is simply to simply raise/lower the
>>> corresponding irq_line field on the VCPU struct, which will cause the
>>> world-switch code to raise/lower virtual interrupts when running the
>>> guest on next switch. The wait_for_interrupt flag is also cleared for
>>> raised IRQs or FIQs causing an idle VCPU to become active again. CPUs
>>> in guest mode are kicked to make sure they refresh their interrupt status.
>>
>>>
>>> +static int kvm_arch_vm_ioctl_irq_line(struct kvm *kvm,
>>> +   struct kvm_irq_level *irq_level)
>>> +{
>>> + int mask;
>>> + unsigned int vcpu_idx;
>>> + struct kvm_vcpu *vcpu;
>>> + unsigned long old, new, *ptr;
>>> +
>>> + vcpu_idx = irq_level->irq >> 1;
>>> + if (vcpu_idx >= KVM_MAX_VCPUS)
>>> + return -EINVAL;
>>> +
>>> + vcpu = kvm_get_vcpu(kvm, vcpu_idx);
>>> + if (!vcpu)
>>> + return -EINVAL;
>>> +
>>> + if ((irq_level->irq & 1) == KVM_ARM_IRQ_LINE)
>>> + mask = HCR_VI;
>>> + else /* KVM_ARM_FIQ_LINE */
>>> + mask = HCR_VF;
>>> +
>>> + trace_kvm_set_irq(irq_level->irq, irq_level->level, 0);
>>> +
>>> + ptr = (unsigned long *)&vcpu->arch.irq_lines;
>>> + do {
>>> + old = ACCESS_ONCE(*ptr);
>>> + if (irq_level->level)
>>> + new = old | mask;
>>> + else
>>> + new = old & ~mask;
>>> +
>>> + if (new == old)
>>> + return 0; /* no change */
>>> + } while (cmpxchg(ptr, old, new) != old);
>>
>> Isn't this a complicated
>>
>>
>>   if (level)
>>   set_bit()
>>   else
>>   clear_bit()
>>
>> ?
>>
> 
> we need to atomically know if we changed either the FIQ/IRQ fields and
> atomically update without locking.

So use test_and_set_bit()/test_and_clear_bit().

> (I think you suggested this
> approach, in fact).

I think so too, but it only makes sense if you need to consider both
fields simultaneously (which you don't here).  For example, if IRQ was
asserted while FIQ was already raised, you don't need to kick.  But
that's not the case according to the below.

> 
>>> +
>>> + /*
>>> +  * The vcpu irq_lines field was updated, wake up sleeping VCPUs and
>>> +  * trigger a world-switch round on the running physical CPU to set the
>>> +  * virtual IRQ/FIQ fields in the HCR appropriately.
>>> +  */
>>> + kvm_vcpu_kick(vcpu);
>>
>> No need to wake when the line is asserted so you can make this
>> conditional on level.
>>
> 
> we need to trigger a world switch to update the virtualized register
> from the actual running physical CPU if we changed any of the IRQ/FIQ
> fields. We return in the loop above if we didn't change anything.

Okay.


> 
>>> +
>>> + return 0;
>>> +}
>>> +
>>>  long kvm_arch_vcpu_ioctl(struct file *filp,
>>>unsigned int ioctl, unsigned long arg)
>>>  {
>>> @@ -298,7 +345,20 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
>>> kvm_dirty_log *log)
>>>  long kvm_arch_vm_ioctl(struct file *filp,
>>>  unsigned int ioctl, unsigned long arg)
>>>  {
>>> - return -EINVAL;
>>> + struct kvm *kvm = filp->private_data;
>>> + void __user *argp = (void __user *)arg;
>>> +
>>> + switch (ioctl) {
>>> + case KVM_IRQ_LINE: {
>>> + struct kvm_irq_level irq_event;
>>> +
>>> + if (copy_from_user(&irq_event, argp, sizeof irq_event))
>>> + return -EFAULT;
>>> + return kvm_arch_vm_ioctl_irq_line(kvm, &irq_event);
>>> + }
>>> + default:
>>> + return -EINVAL;
>>> + }
>>>  }
>>
>> Should be in common code guarded by the define introduced previously.
>>
>>
> 
> you mean like this: ?
> 
> 
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index a9f209b..5bf2193 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -535,8 +535,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu,
> struct kvm_run *run)
>   return ret;
>  }
> 
> -static int kvm_arch_vm_ioctl_irq_line(struct kvm *kvm,
> -   struct kvm_irq_level *irq_level)
> +int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_le

Re: [PATCH v8 11/15] ARM: KVM: World-switch implementation

2012-06-19 Thread Avi Kivity
On 06/19/2012 01:05 AM, Christoffer Dall wrote:
>> Premature, but this is sad.  I suggest you split vmid generation from
>> next available vmid.  This allows you to make the generation counter
>> atomic so it may be read outside the lock.
>>
>> You can do
>>
>>if (likely(kvm->arch.vmd_gen) == atomic_read(&kvm_vmid_gen)))
>>   return;
>>
>>spin_lock(...
>>
> 
> I knew you were going to say something here :), please take a look at
> this and see if you agree:

It looks reasonable wrt locking.

> +
> + /* First user of a new VMID generation? */
> + if (unlikely(kvm_next_vmid == 0)) {
> + atomic64_inc(&kvm_vmid_gen);
> + kvm_next_vmid = 1;
> +
> + /* This does nothing on UP */
> + smp_call_function(reset_vm_context, NULL, 1);

Does this imply a memory barrier?  If not, smp_mb__after_atomic_inc().

> +
> + /*
> +  * On SMP we know no other CPUs can use this CPU's or
> +  * each other's VMID since the kvm_vmid_lock blocks
> +  * them from reentry to the guest.
> +  */
> +
> + reset_vm_context(NULL);

These two lines can be folded as on_each_cpu().

Don't you have a race here where you can increment the generation just
before guest entry?

cpu0   cpu1
(vmid=0, gen=1)(gen=0)
-- --
gen == global_gen, return

   gen != global_gen
   increment global_gen
   smp_call_function
reset_vm_context
   vmid=0

enter with vmid=0  enter with vmid=0

You must recheck gen after disabling interrupts to ensure global_gen
didn't bump after update_vttbr but before entry.  x86 has a lot of this,
see vcpu_enter_guest() and vcpu->requests (doesn't apply directly to
your case but may come in useful later).

> 
>>> +
>>> +/**
>>> + * kvm_arch_vcpu_ioctl_run - the main VCPU run function to execute guest 
>>> code
>>> + * @vcpu:The VCPU pointer
>>> + * @run: The kvm_run structure pointer used for userspace state 
>>> exchange
>>> + *
>>> + * This function is called through the VCPU_RUN ioctl called from user 
>>> space. It
>>> + * will execute VM code in a loop until the time slice for the process is 
>>> used
>>> + * or some emulation is needed from user space in which case the function 
>>> will
>>> + * return with return value 0 and with the kvm_run structure filled in 
>>> with the
>>> + * required data for the requested emulation.
>>> + */
>>>  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>>  {
>>> - return -EINVAL;
>>> + int ret = 0;
>>> + sigset_t sigsaved;
>>> +
>>> + if (vcpu->sigset_active)
>>> + sigprocmask(SIG_SETMASK, &vcpu->sigset, &sigsaved);
>>> +
>>> + run->exit_reason = KVM_EXIT_UNKNOWN;
>>> + while (run->exit_reason == KVM_EXIT_UNKNOWN) {
>>
>> It's not a good idea to read stuff from run unless it's part of the ABI,
>> since userspace can play with it while you're reading it.  It's harmless
>> here but in other places it can lead to a vulnerability.
>>
> 
> ok, so in this case, userspace can 'suppress' an MMIO or interrupt
> window operation.
> 
> I can change to keep some local variable to maintain the state and
> have some API convention for emulation functions, if you feel strongly
> about it, but otherwise it feels to me like the code will be more
> complicated. Do you have other ideas?

x86 uses:

 0 - return to userspace (run prepared)
 1 - return to guest (run untouched)
 -ESOMETHING - return to userspace

as return values from handlers and for locals (usually named 'r').


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SOLVED: RE: Garbled audio - Windows 7 64 bit guest on Debian

2012-06-19 Thread Avi Kivity
On 06/18/2012 10:32 PM, Jimmy Crossley wrote:
> 
> I have mostly solved this issue.  The sound works much, much, better, but is 
> still not as good as on my host machine.
> 
> I installed PulseAudio and used it instead of ALSA.  In order to get kvm to 
> use it, I set the environment variable QEMU_AUDIO_DRV=pa.  I had been using 
> sudo to start the VM, and that kept this environment variable from being 
> used.  I did a "sudo setcap cap_net_admin+ep /usr/bin/kvm" in order to be 
> able to run kvm under a normal user account.  Now the sound works quite well.
> 

That is good to hear.  But you are giving up on a lot of security if
you're running kvm as root (or with CAP_NET_ADMIN).  qemu supports
setting up the network externally and running with no special privileges.

Of course, it may not matter for your use case.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 13/15] ARM: KVM: Handle guest faults in KVM

2012-06-19 Thread Avi Kivity
On 06/19/2012 01:20 AM, Christoffer Dall wrote:
> On Mon, Jun 18, 2012 at 9:45 AM, Avi Kivity  wrote:
>> On 06/15/2012 10:09 PM, Christoffer Dall wrote:
>>> From: Christoffer Dall 
>>>
>>> Handles the guest faults in KVM by mapping in corresponding user pages
>>> in the 2nd stage page tables.
>>>
>>> Introduces new ARM-specific kernel memory types, PAGE_KVM_GUEST and
>>> pgprot_guest variables used to map 2nd stage memory for KVM guests.
>>>
>>> Leverages MMU notifiers on KVM/ARM by supporting the kvm_unmap_hva() 
>>> operation,
>>> where we remove the HVA from the 2nd stage translation. All other KVM MMU
>>> notifierhooks are NOPs.
>>
>> I think you must at least support change_pte (possibly by unmapping).
>> Andrea?
>>
> hmmm, at least for KSM support we need to support change_pte (are
> there other callers for this type of memory?)
> 
> It's not trivial I guess, since we would need to support COW and
> thereby stage-2 permission faults... Marc, right?

As I mentioned, you can support change_pte by unmapping.  This will
cause ksm to be ineffective (pages will only be shared if the guest
doesn't touch them at all), but it's enough to get started.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] Improve virtio-blk performance

2012-06-19 Thread Stefan Hajnoczi
On Tue, Jun 19, 2012 at 5:24 AM, Asias He  wrote:
> On 06/18/2012 06:58 PM, Stefan Hajnoczi wrote:
>>
>> As long as the latency is decreasing that's good.  But It's worth
>> keeping in mind that these percentages are probably wildly different
>> on real storage devices and/or qemu-kvm.  What we don't know here is
>> whether this bottleneck matters in real environments - results with
>> real storage and with qemu-kvm would be interesting.
>
>
> Yes. Here is the performance data on a Fusion-IO SSD device.
>
> Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using kvm
> tool.
>
>    Short version:
>     With bio-based IO path, sequential read/write, random read/write
>     IOPS boost         : 11%, 11%, 13%, 10%
>     Latency improvement: 10%, 10%, 12%, 10%

Nice, I'm glad the improvement shows on real hardware.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 13/15] ARM: KVM: Handle guest faults in KVM

2012-06-19 Thread Andrea Arcangeli
On Tue, Jun 19, 2012 at 12:32:06PM +0300, Avi Kivity wrote:
> On 06/19/2012 01:20 AM, Christoffer Dall wrote:
> > On Mon, Jun 18, 2012 at 9:45 AM, Avi Kivity  wrote:
> >> On 06/15/2012 10:09 PM, Christoffer Dall wrote:
> >>> From: Christoffer Dall 
> >>>
> >>> Handles the guest faults in KVM by mapping in corresponding user pages
> >>> in the 2nd stage page tables.
> >>>
> >>> Introduces new ARM-specific kernel memory types, PAGE_KVM_GUEST and
> >>> pgprot_guest variables used to map 2nd stage memory for KVM guests.
> >>>
> >>> Leverages MMU notifiers on KVM/ARM by supporting the kvm_unmap_hva() 
> >>> operation,
> >>> where we remove the HVA from the 2nd stage translation. All other KVM MMU
> >>> notifierhooks are NOPs.
> >>
> >> I think you must at least support change_pte (possibly by unmapping).
> >> Andrea?
> >>
> > hmmm, at least for KSM support we need to support change_pte (are
> > there other callers for this type of memory?)
> > 
> > It's not trivial I guess, since we would need to support COW and
> > thereby stage-2 permission faults... Marc, right?
> 
> As I mentioned, you can support change_pte by unmapping.  This will
> cause ksm to be ineffective (pages will only be shared if the guest
> doesn't touch them at all), but it's enough to get started.

The main reason change_pte initially was required for KSM to be
effective was because gup_fast was called with write=1
unconditionally. change_pte was also responsible to set the spte
readonly. But that should have been fixed now on x86, so KSM should be
effective even despite lack of change_pte on x86.

If the KVM page fault is calling gfn_to_pfn_async(write=0/1) depending
if the vmexit was caused by a write or read access (instead of
gfn_to_pfn which still has the unconditional write=1), and in turn
it's forced to sete the spte readonly after calling
gfn_to_pfn_async(write=0), change_pte is still useful but it's only a
worthwhile optimization to avoid a spte read fault after every KSM
page merged, it's not strictly required for KSM effectiveness anymore.

In short if ARM does the right thing with regard of KVM read faults
passed to gup_fast(write=0) and setting the spte readonly, all should
work good with KSM (even if not as optimal as with change_pte).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Book3S HV: Drop locks around call to kvmppc_pin_guest_page

2012-06-19 Thread Alexander Graf

On 06.06.2012, at 17:52, Alexander Graf wrote:

> On 06/06/2012 02:28 PM, Avi Kivity wrote:
>> On 06/01/2012 01:20 PM, Paul Mackerras wrote:
>>> At the moment we call kvmppc_pin_guest_page() in kvmppc_update_vpa()
>>> with two spinlocks held: the vcore lock and the vcpu->vpa_update_lock.
>>> This is not good, since kvmppc_pin_guest_page() calls down_read() and
>>> get_user_pages_fast(), both of which can sleep.  This bug was introduced
>>> in 2e25aa5f ("KVM: PPC: Book3S HV: Make virtual processor area
>>> registration more robust").
>>> 
>>> This arranges to drop those spinlocks before calling
>>> kvmppc_pin_guest_page() and re-take them afterwards.  Dropping the
>>> vcore lock in kvmppc_run_core() means we have to set the vcore_state
>>> field to VCORE_RUNNING before we drop the lock, so that other vcpus
>>> won't try to run this vcore.
>>> 
>>> Signed-off-by: Paul Mackerras
>>> ---
>>> Since this bug is in Linus' tree, and it can cause a scheduling while
>>> atomic bug message, can we send this to Linus for inclusion in 3.5,
>>> after review of course?
>>> 
>> Sure, Alex?
> 
> Yup, reviewed and tested. Looks ready to go into the tree to me. Can you pull 
> it in the short way please?

Avi?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] spapr_vscsi: Error handling fixes

2012-06-19 Thread Andreas Färber
Am 19.06.2012 08:02, schrieb Benjamin Herrenschmidt:
> We were incorrectly g_free'ing an object that isn't allocated
> in one error path and failed to release it completely in another
> 
> This fixes qemu crashes with some cases of IO errors.
> 
> Signed-off-by: Benjamin Herrenschmidt 
> ---
>  hw/spapr_vscsi.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/spapr_vscsi.c b/hw/spapr_vscsi.c
> index d2fe3e5..6afc3b4 100644
> --- a/hw/spapr_vscsi.c
> +++ b/hw/spapr_vscsi.c
> @@ -801,6 +801,7 @@ static void vscsi_got_payload(VSCSIState *s, vscsi_crq 
> *crq)
>  if (crq->s.IU_length > sizeof(union viosrp_iu)) {
>  fprintf(stderr, "VSCSI: SRP IU too long (%d bytes) !\n",
>  crq->s.IU_length);
> +vscsi_put_req(req);
>  return;
>  }
>  
> @@ -808,7 +809,8 @@ static void vscsi_got_payload(VSCSIState *s, vscsi_crq 
> *crq)
>  if (spapr_vio_dma_read(&s->vdev, crq->s.IU_data_ptr, &req->iu,
> crq->s.IU_length)) {
>  fprintf(stderr, "vscsi_got_payload: DMA read failure !\n");
> -g_free(req);
> +vscsi_put_req(req);

> + return;

Tab alert. :)

/-F

>  }
>  memcpy(&req->crq, crq, sizeof(vscsi_crq));
>  
> 
> 
> 


-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Book3S HV: Drop locks around call to kvmppc_pin_guest_page

2012-06-19 Thread Avi Kivity
On 06/19/2012 01:50 PM, Alexander Graf wrote:
> 
> On 06.06.2012, at 17:52, Alexander Graf wrote:
> 
>> On 06/06/2012 02:28 PM, Avi Kivity wrote:
>>> On 06/01/2012 01:20 PM, Paul Mackerras wrote:
 At the moment we call kvmppc_pin_guest_page() in kvmppc_update_vpa()
 with two spinlocks held: the vcore lock and the vcpu->vpa_update_lock.
 This is not good, since kvmppc_pin_guest_page() calls down_read() and
 get_user_pages_fast(), both of which can sleep.  This bug was introduced
 in 2e25aa5f ("KVM: PPC: Book3S HV: Make virtual processor area
 registration more robust").
 
 This arranges to drop those spinlocks before calling
 kvmppc_pin_guest_page() and re-take them afterwards.  Dropping the
 vcore lock in kvmppc_run_core() means we have to set the vcore_state
 field to VCORE_RUNNING before we drop the lock, so that other vcpus
 won't try to run this vcore.
 
 Signed-off-by: Paul Mackerras
 ---
 Since this bug is in Linus' tree, and it can cause a scheduling while
 atomic bug message, can we send this to Linus for inclusion in 3.5,
 after review of course?
 
>>> Sure, Alex?
>> 
>> Yup, reviewed and tested. Looks ready to go into the tree to me. Can you 
>> pull it in the short way please?
> 
> Avi?

Sorry!  Applied to master, will push upstream shortly.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[FIX PATCH] KVM: Use kvm_kvfree() to free memory allocated by kvm_kvzalloc()

2012-06-19 Thread Takuya Yoshikawa
From: Takuya Yoshikawa 

The following commit did not care about the error handling path:

  commit c1a7b32a14138f908df52d7c53b5ce3415ec6b50
  KVM: Avoid wasting pages for small lpage_info arrays

If memory allocation fails, vfree() will be called with the address
returned by kzalloc().  This patch fixes this issue.

Signed-off-by: Takuya Yoshikawa 
---
 arch/x86/kvm/x86.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a01a424..6ed5983 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6350,7 +6350,7 @@ int kvm_arch_create_memslot(struct kvm_memory_slot *slot, 
unsigned long npages)
 
 out_free:
for (i = 0; i < KVM_NR_PAGE_SIZES - 1; ++i) {
-   vfree(slot->arch.lpage_info[i]);
+   kvm_kvfree(slot->arch.lpage_info[i]);
slot->arch.lpage_info[i] = NULL;
}
return -ENOMEM;
-- 
1.7.5.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [FIX PATCH] KVM: Use kvm_kvfree() to free memory allocated by kvm_kvzalloc()

2012-06-19 Thread Avi Kivity
On 06/19/2012 04:04 PM, Takuya Yoshikawa wrote:
> From: Takuya Yoshikawa 
> 
> The following commit did not care about the error handling path:
> 
>   commit c1a7b32a14138f908df52d7c53b5ce3415ec6b50
>   KVM: Avoid wasting pages for small lpage_info arrays
> 
> If memory allocation fails, vfree() will be called with the address
> returned by kzalloc().  This patch fixes this issue.


Thanks, applied to next.


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: MMU: Make kvm_handle_hva() handle range of addresses

2012-06-19 Thread Takuya Yoshikawa
On Mon, 18 Jun 2012 15:11:42 +0300
Avi Kivity  wrote:

> Potential for improvement: don't do 512 iterations on same large page.
> 
> Something like
> 
> if ((gfn ^ prev_gfn) & mask(level))
> ret |= handler(...)
> 
> with clever selection of the first prev_gfn so it always matches (~gfn
> maybe).


I thought up a better solution:

1. Separate rmap_pde from lpage_info->write_count and
   make this a simple array. (I once tried this.)

2. Use gfn_to_index() and loop over rmap array:
 ...
  /* intersection check */
  start = max(start, memslot->userspace_addr);
  end = min(end, memslot->userspace_addr +
 (memslot->npages << PAGE_SHIFT));
  if (start > end)
  continue;

  /* hva to gfn conversion */
  gfn_start = hva_to_gfn_memslot(start);
  gfn_end = hva_to_gfn_memslot(end);

  /* main part */
  for each level {
  rmapp = __gfn_to_rmap(gfn_start, level, memslot);
  for (idx = gfn_to_index(gfn_start, memslot->base_gfn, level);
   idx < gfn_to_index(gfn_end, memslot->base_gfn, level); idx++) {
  ...
  /* loop over rmap array */
  ret |= handler(kvm, rmapp + idx, data);
  }
  }


Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: SOLVED: RE: Garbled audio - Windows 7 64 bit guest on Debian

2012-06-19 Thread Jimmy Crossley
Thanks for that reminder, Avi.  You are right - it would be best to not be 
running kvm as root or with any special capabilities.  I mainly wanted to 
report back that I got the sound working by switching to PulseAudio.  I made 
the other changes so I would not have to run kvm as root and it would see the 
environment variable.  When I run without that capability, it seems that kvm 
cannot create the tun device.  I will do more research and if I can't find the 
information I need, I'll start a new thread.

>

-
Jimmy Crossley
CoNetrix
Sentry Plaza III
5214 68th Street, Suite 200
Lubbock, TX  79424
jcross...@conetrix.com
https://www.CoNetrix.com
Telephone: 806-687-8600
Toll Free: 800-356-6568
Fax: 806-687-8511
This e-mail message (and attachments) may contain confidential CoNetrix 
information. If you are not the intended recipient, you cannot use, distribute 
or copy the message or attachments. In such a case, please notify the sender by 
return e-mail immediately and erase all copies of the message and attachments. 
Opinions, conclusions and other information in this message and attachments 
that do not relate to official business are neither given nor endorsed by 
CoNetrix.


-Original Message-
> From: Avi Kivity [mailto:a...@redhat.com]
> Sent: Tuesday, June 19, 2012 04:21
> To: Jimmy Crossley
> Cc: kvm@vger.kernel.org
> Subject: Re: SOLVED: RE: Garbled audio - Windows 7 64 bit guest on Debian
>
> On 06/18/2012 10:32 PM, Jimmy Crossley wrote:
> >
> > I have mostly solved this issue.  The sound works much, much, better, but 
> > is still not as good as
> on my host machine.
> >
> > I installed PulseAudio and used it instead of ALSA.  In order to get kvm to 
> > use it, I set the
> environment variable QEMU_AUDIO_DRV=pa.  I had been using sudo to start the 
> VM, and that kept this
> environment variable from being used.  I did a "sudo setcap cap_net_admin+ep 
> /usr/bin/kvm" in order
> to be able to run kvm under a normal user account.  Now the sound works quite 
> well.
> >
>
> That is good to hear.  But you are giving up on a lot of security if
> you're running kvm as root (or with CAP_NET_ADMIN).  qemu supports
> setting up the network externally and running with no special privileges.
>
> Of course, it may not matter for your use case.
>
> --
> error compiling committee.c: too many arguments to function
>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for Tuesday, June 19th

2012-06-19 Thread Juan Quintela
Juan Quintela  wrote:
> Hi
>
> Please send in any agenda items you are interested in covering.
>
> Anthony suggested for last week:
> - multithreading vhost (and general vhost improvements)
>
> I suggest:
> - status of migration: post-copy, IDL, XBRLE, huge memory, ...
>   Will send an email with an status before tomorrow call.

XBRLE: v12 is coming today or so.


This three patches should be a no-brainer (just refactoring code).
1st one is shared with postcopy.

[PATCH v11 1/9] Add MigrationParams structure
[PATCH v11 5/9] Add uleb encoding/decoding functions
[PATCH v11 6/9] Add save_block_hdr function

This ones can be be the ones that we can discuss.

[PATCH v11 2/9] Add migration capabilites
[PATCH v11 3/9] Add XBZRLE documentation
[PATCH v11 4/9] Add cache handling functions
[PATCH v11 7/9] Add XBZRLE to ram_save_block and ram_save_live
[PATCH v11 8/9] Add set_cachesize command

Postcopy:  This is just refactoring that can be integrated.

[PATCH v2 01/41] arch_init: export sort_ram_list() and ram_save_block()
[PATCH v2 02/41] arch_init: export RAM_SAVE_xxx flags for postcopy
[PATCH v2 03/41] arch_init/ram_save: introduce constant for ram save version = 4
[PATCH v2 04/41] arch_init: refactor host_from_stream_offset()
[PATCH v2 05/41] arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
[PATCH v2 06/41] arch_init: refactor ram_save_block()
[PATCH v2 07/41] arch_init/ram_save_live: factor out ram_save_limit
[PATCH v2 08/41] arch_init/ram_load: refactor ram_load
[PATCH v2 09/41] arch_init: introduce helper function to find ram block with id 
string
[PATCH v2 10/41] arch_init: simplify a bit by ram_find_block()
[PATCH v2 11/41] arch_init: factor out counting transferred bytes
[PATCH v2 12/41] arch_init: factor out setting last_block, last_offset
[PATCH v2 13/41] exec.c: factor out qemu_get_ram_ptr()
[PATCH v2 14/41] exec.c: export last_ram_offset()
[PATCH v2 15/41] savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
[PATCH v2 16/41] savevm: qemu_pending_size() to return pending buffered size
[PATCH v2 17/41] savevm, buffered_file: introduce method to drain buffer of 
buffered file
[PATCH v2 18/41] QEMUFile: add qemu_file_fd() for later use
[PATCH v2 19/41] savevm/QEMUFile: drop qemu_stdio_fd
[PATCH v2 20/41] savevm/QEMUFileSocket: drop duplicated member fd
[PATCH v2 21/41] savevm: rename QEMUFileSocket to QEMUFileFD, socket_close to 
fd_close
[PATCH v2 22/41] savevm/QEMUFile: introduce qemu_fopen_fd
[PATCH v2 23/41] migration.c: remove redundant line in migrate_init()
[PATCH v2 24/41] migration: export migrate_fd_completed() and 
migrate_fd_cleanup()
[PATCH v2 25/41] migration: factor out parameters into MigrationParams
[PATCH v2 26/41] buffered_file: factor out buffer management logic
[PATCH v2 27/41] buffered_file: Introduce QEMUFileNonblock for nonblock write
[PATCH v2 28/41] buffered_file: add qemu_file to read/write to buffer in memory

This is postcopy properly.  From this one, postcopy needs to be the
things addressed on previous review, and from there probably (at least)
another review.  Thing to have in account is that the umem (or whatever
you want to call it), should be able to work over RDMA.  Anyone that
knows anything about RDMA to comment on this?

[PATCH v2 29/41] umem.h: import Linux umem.h
[PATCH v2 30/41] update-linux-headers.sh: teach umem.h to 
update-linux-headers.sh
[PATCH v2 31/41] configure: add CONFIG_POSTCOPY option
[PATCH v2 32/41] savevm: add new section that is used by postcopy
[PATCH v2 33/41] postcopy: introduce -postcopy and -postcopy-flags option
[PATCH v2 34/41] postcopy outgoing: add -p and -n option to migrate command
[PATCH v2 35/41] postcopy: introduce helper functions for postcopy
[PATCH v2 36/41] postcopy: implement incoming part of postcopy live migration
[PATCH v2 37/41] postcopy: implement outgoing part of postcopy live migration
[PATCH v2 38/41] postcopy/outgoing: add forward, backward option to specify the 
size of prefault
[PATCH v2 39/41] postcopy/outgoing: implement prefault
[PATCH v2 40/41] migrate: add -m (movebg) option to migrate command
[PATCH v2 41/41] migration/postcopy: add movebg mode

Huge memory migration.
This ones should be trivial, and integrated.

[PATCH 1/7] Add spent time for migration
[PATCH 2/7] Add tracepoints for savevm section start/end
[PATCH 3/7] No need to iterate if we already are over the limit
[PATCH 4/7] Only TCG needs TLB handling
[PATCH 5/7] Only calculate expected_time for stage 2

This one is also trivial, but Anthony on previous reviews wanted to have
migration-thread before we integrated this one.

[PATCH 6/7] Exit loop if we have been there too long

This one, Anthony wanted a different approach improving bitmap
handling.  Not done yet.

[PATCH 7/7] Maintaing number of dirty pages

IDL patchset.  I am not against generating the VMState information, but
I am trying to understand how the patch works.  Notice that I don't grok
Python, this is is one of the reasos it is taking long.

This was one of the 

Re: [Qemu-devel] KVM call agenda for Tuesday, June 19th

2012-06-19 Thread Anthony Liguori

On 06/19/2012 08:54 AM, Juan Quintela wrote:

Juan Quintela  wrote:

Hi

Please send in any agenda items you are interested in covering.

Anthony suggested for last week:
- multithreading vhost (and general vhost improvements)

I suggest:
- status of migration: post-copy, IDL, XBRLE, huge memory, ...
   Will send an email with an status before tomorrow call.


XBRLE: v12 is coming today or so.


This three patches should be a no-brainer (just refactoring code).
1st one is shared with postcopy.

[PATCH v11 1/9] Add MigrationParams structure
[PATCH v11 5/9] Add uleb encoding/decoding functions
[PATCH v11 6/9] Add save_block_hdr function

This ones can be be the ones that we can discuss.

[PATCH v11 2/9] Add migration capabilites
[PATCH v11 3/9] Add XBZRLE documentation
[PATCH v11 4/9] Add cache handling functions
[PATCH v11 7/9] Add XBZRLE to ram_save_block and ram_save_live
[PATCH v11 8/9] Add set_cachesize command

Postcopy:  This is just refactoring that can be integrated.

[PATCH v2 01/41] arch_init: export sort_ram_list() and ram_save_block()
[PATCH v2 02/41] arch_init: export RAM_SAVE_xxx flags for postcopy
[PATCH v2 03/41] arch_init/ram_save: introduce constant for ram save version = 4
[PATCH v2 04/41] arch_init: refactor host_from_stream_offset()
[PATCH v2 05/41] arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE case
[PATCH v2 06/41] arch_init: refactor ram_save_block()
[PATCH v2 07/41] arch_init/ram_save_live: factor out ram_save_limit
[PATCH v2 08/41] arch_init/ram_load: refactor ram_load
[PATCH v2 09/41] arch_init: introduce helper function to find ram block with id 
string
[PATCH v2 10/41] arch_init: simplify a bit by ram_find_block()
[PATCH v2 11/41] arch_init: factor out counting transferred bytes
[PATCH v2 12/41] arch_init: factor out setting last_block, last_offset
[PATCH v2 13/41] exec.c: factor out qemu_get_ram_ptr()
[PATCH v2 14/41] exec.c: export last_ram_offset()
[PATCH v2 15/41] savevm: export qemu_peek_buffer, qemu_peek_byte, qemu_file_skip
[PATCH v2 16/41] savevm: qemu_pending_size() to return pending buffered size
[PATCH v2 17/41] savevm, buffered_file: introduce method to drain buffer of 
buffered file
[PATCH v2 18/41] QEMUFile: add qemu_file_fd() for later use
[PATCH v2 19/41] savevm/QEMUFile: drop qemu_stdio_fd
[PATCH v2 20/41] savevm/QEMUFileSocket: drop duplicated member fd
[PATCH v2 21/41] savevm: rename QEMUFileSocket to QEMUFileFD, socket_close to 
fd_close
[PATCH v2 22/41] savevm/QEMUFile: introduce qemu_fopen_fd
[PATCH v2 23/41] migration.c: remove redundant line in migrate_init()
[PATCH v2 24/41] migration: export migrate_fd_completed() and 
migrate_fd_cleanup()
[PATCH v2 25/41] migration: factor out parameters into MigrationParams
[PATCH v2 26/41] buffered_file: factor out buffer management logic
[PATCH v2 27/41] buffered_file: Introduce QEMUFileNonblock for nonblock write
[PATCH v2 28/41] buffered_file: add qemu_file to read/write to buffer in memory

This is postcopy properly.  From this one, postcopy needs to be the
things addressed on previous review, and from there probably (at least)
another review.  Thing to have in account is that the umem (or whatever
you want to call it), should be able to work over RDMA.  Anyone that
knows anything about RDMA to comment on this?

[PATCH v2 29/41] umem.h: import Linux umem.h
[PATCH v2 30/41] update-linux-headers.sh: teach umem.h to 
update-linux-headers.sh
[PATCH v2 31/41] configure: add CONFIG_POSTCOPY option
[PATCH v2 32/41] savevm: add new section that is used by postcopy
[PATCH v2 33/41] postcopy: introduce -postcopy and -postcopy-flags option
[PATCH v2 34/41] postcopy outgoing: add -p and -n option to migrate command
[PATCH v2 35/41] postcopy: introduce helper functions for postcopy
[PATCH v2 36/41] postcopy: implement incoming part of postcopy live migration
[PATCH v2 37/41] postcopy: implement outgoing part of postcopy live migration
[PATCH v2 38/41] postcopy/outgoing: add forward, backward option to specify the 
size of prefault
[PATCH v2 39/41] postcopy/outgoing: implement prefault
[PATCH v2 40/41] migrate: add -m (movebg) option to migrate command
[PATCH v2 41/41] migration/postcopy: add movebg mode


I'm not at all convinced that postcopy is a good idea.  There needs a clear 
expression of what the value proposition is that's backed by benchmarks.  Those 
benchmarks need to include latency measurements of downtime which so far, I've 
not seen.


I don't want to take any postcopy patches until this discussion happens.

Regards,

Anthony Liguori



Huge memory migration.
This ones should be trivial, and integrated.

[PATCH 1/7] Add spent time for migration
[PATCH 2/7] Add tracepoints for savevm section start/end
[PATCH 3/7] No need to iterate if we already are over the limit
[PATCH 4/7] Only TCG needs TLB handling
[PATCH 5/7] Only calculate expected_time for stage 2

This one is also trivial, but Anthony on previous reviews wanted to have
migration-thread before we integrated this one.

[PATCH 6/7] Exit loop i

Re: [Qemu-devel] KVM call agenda for Tuesday, June 19th

2012-06-19 Thread Takuya Yoshikawa
On Tue, 19 Jun 2012 09:01:36 -0500
Anthony Liguori  wrote:

> I'm not at all convinced that postcopy is a good idea.  There needs a clear 
> expression of what the value proposition is that's backed by benchmarks.  
> Those 
> benchmarks need to include latency measurements of downtime which so far, 
> I've 
> not seen.
> 
> I don't want to take any postcopy patches until this discussion happens.

FWIW:

I rather see postcopy as a way of migrating guests forcibly and I know
a service in which such a way is needed: emergency migration.  There is
also a product which does live migration when some hardware problems are
detected (as a semi-FT solution) -- in such cases, we cannot wait until
the guest becomes calm.

Although I am not certain whether QEMU can be used for such products,
it may be worth thinking about.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for Tuesday, June 19th

2012-06-19 Thread Chegu Vinod
Hello,

Wanted to share some preliminary data from live migration experiments on a 
setup 
that is perhaps one of the larger ones.  

We used Juan's "huge_memory" patches (without the separate migration thread) 
and 
measured the total migration time and the time taken for stage 3 ("downtime"). 
Note: We didn't change the default "downtime" (30ms?). We had a private 10Gig 
back-to-back link between the two hosts..and we set the migration speed to 
10Gig. 

The "workloads" chosen were ones that we could easily setup. All experiments 
were done without using virsh/virt-manager (i.e. direct interaction with the 
qemu monitor prompt).  Pl. see the data below. 

As the guest size increased (and for busier the workloads) we observed that 
network connections were getting dropped not only during the "downtime" (i.e. 
stage 3) but also during at times during iterative pre-copy phase (i.e. stage 
2).  Perhaps some of this will get fixed when we have the migration thread 
implemented.

We had also briefly tried the proposed delta compression changes (easier to say 
than XBZRLE :)) on a smaller configuration. For the simple workloads (perhaps 
there was not much temporal locality in them) it didn't seem to show 
improvements instead took much longer time to migrate (high cache miss 
penalty?). Waiting for the updated version of the XBZRLE for further 
experiments 
to see how well it scales on this larger set up... 

FYI
Vinod

---
10VCPUs/128G
---
1) Idle guest
Total migration time : 124585 ms, 
Stage_3_time : 941 ms , 
Total MB transferred : 2720


2) AIM7-compute (2000 users)
Total migration time : 123540 ms, 
Stage_3_time : 726 ms , 
Total MB transferred : 3580

3) SpecJBB (modified to run 10 warehouse threads for a long duration of time)
Total migration time : 165720 ms, 
Stage_3_time : 6851 ms , 
Total MB transferred : 19656


4) Google SAT  (-s 3600 -C 5 -i 5)
Total migration time : 411827 ms, 
Stage_3_time : 77807 ms , 
Total MB transferred : 142136



---
20VCPUs /256G
---

1) Idle  guest
Total migration time : 259938 ms, 
Stage_3_time : 1998 ms , 
Total MB transferred : 5114

2) AIM7-compute (2000 users)
Total migration time : 261336 ms, 
Stage_3_time : 2107 ms , 
Total MB transferred : 5473

3) SpecJBB (modified to run 20 warehouse threads for a long duration of time)
Total migration time : 390548 ms, 
Stage_3_time : 19596 ms , 
Total MB transferred : 48109

4) Google SAT  (-s 3600 -C 10 -i 10)
Total migration time : 780150 ms, 
Stage_3_time : 90346 ms , 
Total MB transferred : 251287


30VCPUs/384G
---

1) Idle guest
(qemu) Total migration time : 501704 ms, 
Stage_3_time : 2835 ms , 
Total MB transferred : 15731


2) AIM7-compute (2000 users)
Total migration time : 496001 ms, 
Stage_3_time : 3884 ms , 
Total MB transferred : 9375


3) SpecJBB (modified to run 30 warehouse threads for a long duration of time)
Total migration time : 611075 ms, 
Stage_3_time : 17107 ms , 
Total MB transferred : 48862


4) Google SAT  (-s 3600 -C 15 -i 15)  (look at /tmp/kvm_30w_Goog)
Total migration time : 1348102 ms, 
Stage_3_time : 128531 ms , 
Total MB transferred : 367524



---
40VCPUs/512G
---

1) Idle guest
Total migration time : 780257 ms, 
Stage_3_time : 3770 ms , 
Total MB transferred : 13330


2) AIM7-compute (2000 users)
Total migration time : 720963 ms, 
Stage_3_time : 3966 ms , 
Total MB transferred : 10595

3) SpecJBB (modified to run 40 warehouse threads for a long duration of time)
Total migration time : 863577 ms, 
Stage_3_time : 25149 ms , 
Total MB transferred : 54685

4) Google SAT  (-s 3600 -C 20 -i 20)
Total migration time : 2585039 ms, 
Stage_3_time : 177625 ms , 
Total MB transferred : 493575


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for Tuesday, June 19th

2012-06-19 Thread Michael Roth
On Tue, Jun 19, 2012 at 03:54:23PM +0200, Juan Quintela wrote:
> Juan Quintela  wrote:
> > Hi
> >
> > Please send in any agenda items you are interested in covering.
> >
> > Anthony suggested for last week:
> > - multithreading vhost (and general vhost improvements)
> >
> > I suggest:
> > - status of migration: post-copy, IDL, XBRLE, huge memory, ...
> >   Will send an email with an status before tomorrow call.
> 
> XBRLE: v12 is coming today or so.
> 
> 
> This three patches should be a no-brainer (just refactoring code).
> 1st one is shared with postcopy.
> 
> [PATCH v11 1/9] Add MigrationParams structure
> [PATCH v11 5/9] Add uleb encoding/decoding functions
> [PATCH v11 6/9] Add save_block_hdr function
> 
> This ones can be be the ones that we can discuss.
> 
> [PATCH v11 2/9] Add migration capabilites
> [PATCH v11 3/9] Add XBZRLE documentation
> [PATCH v11 4/9] Add cache handling functions
> [PATCH v11 7/9] Add XBZRLE to ram_save_block and ram_save_live
> [PATCH v11 8/9] Add set_cachesize command
> 
> Postcopy:  This is just refactoring that can be integrated.
> 
> [PATCH v2 01/41] arch_init: export sort_ram_list() and ram_save_block()
> [PATCH v2 02/41] arch_init: export RAM_SAVE_xxx flags for postcopy
> [PATCH v2 03/41] arch_init/ram_save: introduce constant for ram save version 
> = 4
> [PATCH v2 04/41] arch_init: refactor host_from_stream_offset()
> [PATCH v2 05/41] arch_init/ram_save_live: factor out RAM_SAVE_FLAG_MEM_SIZE 
> case
> [PATCH v2 06/41] arch_init: refactor ram_save_block()
> [PATCH v2 07/41] arch_init/ram_save_live: factor out ram_save_limit
> [PATCH v2 08/41] arch_init/ram_load: refactor ram_load
> [PATCH v2 09/41] arch_init: introduce helper function to find ram block with 
> id string
> [PATCH v2 10/41] arch_init: simplify a bit by ram_find_block()
> [PATCH v2 11/41] arch_init: factor out counting transferred bytes
> [PATCH v2 12/41] arch_init: factor out setting last_block, last_offset
> [PATCH v2 13/41] exec.c: factor out qemu_get_ram_ptr()
> [PATCH v2 14/41] exec.c: export last_ram_offset()
> [PATCH v2 15/41] savevm: export qemu_peek_buffer, qemu_peek_byte, 
> qemu_file_skip
> [PATCH v2 16/41] savevm: qemu_pending_size() to return pending buffered size
> [PATCH v2 17/41] savevm, buffered_file: introduce method to drain buffer of 
> buffered file
> [PATCH v2 18/41] QEMUFile: add qemu_file_fd() for later use
> [PATCH v2 19/41] savevm/QEMUFile: drop qemu_stdio_fd
> [PATCH v2 20/41] savevm/QEMUFileSocket: drop duplicated member fd
> [PATCH v2 21/41] savevm: rename QEMUFileSocket to QEMUFileFD, socket_close to 
> fd_close
> [PATCH v2 22/41] savevm/QEMUFile: introduce qemu_fopen_fd
> [PATCH v2 23/41] migration.c: remove redundant line in migrate_init()
> [PATCH v2 24/41] migration: export migrate_fd_completed() and 
> migrate_fd_cleanup()
> [PATCH v2 25/41] migration: factor out parameters into MigrationParams
> [PATCH v2 26/41] buffered_file: factor out buffer management logic
> [PATCH v2 27/41] buffered_file: Introduce QEMUFileNonblock for nonblock write
> [PATCH v2 28/41] buffered_file: add qemu_file to read/write to buffer in 
> memory
> 
> This is postcopy properly.  From this one, postcopy needs to be the
> things addressed on previous review, and from there probably (at least)
> another review.  Thing to have in account is that the umem (or whatever
> you want to call it), should be able to work over RDMA.  Anyone that
> knows anything about RDMA to comment on this?
> 
> [PATCH v2 29/41] umem.h: import Linux umem.h
> [PATCH v2 30/41] update-linux-headers.sh: teach umem.h to 
> update-linux-headers.sh
> [PATCH v2 31/41] configure: add CONFIG_POSTCOPY option
> [PATCH v2 32/41] savevm: add new section that is used by postcopy
> [PATCH v2 33/41] postcopy: introduce -postcopy and -postcopy-flags option
> [PATCH v2 34/41] postcopy outgoing: add -p and -n option to migrate command
> [PATCH v2 35/41] postcopy: introduce helper functions for postcopy
> [PATCH v2 36/41] postcopy: implement incoming part of postcopy live migration
> [PATCH v2 37/41] postcopy: implement outgoing part of postcopy live migration
> [PATCH v2 38/41] postcopy/outgoing: add forward, backward option to specify 
> the size of prefault
> [PATCH v2 39/41] postcopy/outgoing: implement prefault
> [PATCH v2 40/41] migrate: add -m (movebg) option to migrate command
> [PATCH v2 41/41] migration/postcopy: add movebg mode
> 
> Huge memory migration.
> This ones should be trivial, and integrated.
> 
> [PATCH 1/7] Add spent time for migration
> [PATCH 2/7] Add tracepoints for savevm section start/end
> [PATCH 3/7] No need to iterate if we already are over the limit
> [PATCH 4/7] Only TCG needs TLB handling
> [PATCH 5/7] Only calculate expected_time for stage 2
> 
> This one is also trivial, but Anthony on previous reviews wanted to have
> migration-thread before we integrated this one.
> 
> [PATCH 6/7] Exit loop if we have been there too long
> 
> This one, Anthony wanted a different approach improving bitmap
> handling.  N

Re: [Qemu-devel] KVM call agenda for Tuesday, June 19th

2012-06-19 Thread Michael Roth
On Tue, Jun 19, 2012 at 11:34:42PM +0900, Takuya Yoshikawa wrote:
> On Tue, 19 Jun 2012 09:01:36 -0500
> Anthony Liguori  wrote:
> 
> > I'm not at all convinced that postcopy is a good idea.  There needs a clear 
> > expression of what the value proposition is that's backed by benchmarks.  
> > Those 
> > benchmarks need to include latency measurements of downtime which so far, 
> > I've 
> > not seen.
> > 
> > I don't want to take any postcopy patches until this discussion happens.
> 
> FWIW:
> 
> I rather see postcopy as a way of migrating guests forcibly and I know
> a service in which such a way is needed: emergency migration.  There is
> also a product which does live migration when some hardware problems are
> detected (as a semi-FT solution) -- in such cases, we cannot wait until
> the guest becomes calm.

Ignoring max downtime values when we've determined that the target is no
longer converging would be another option. Essentially having a
use_strict_max_downtime that can be set on a per-migration basis, where
if not set we can "give up" on maintaining the max_downtime when it's
been determined that progress is no longer being made.

> 
> Although I am not certain whether QEMU can be used for such products,
> it may be worth thinking about.
> 
> Thanks,
>   Takuya
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] block: Introduce __blk_segment_map_sg() helper

2012-06-19 Thread Tejun Heo
Hello,

On Mon, Jun 18, 2012 at 7:02 PM, Asias He  wrote:
>> I *hope* this is a bit prettier.  e.g. Do we really need to pass in
>> @sglist and keep using "goto new_segment"?
>
> I think this deserves another patch on top of this splitting one. I'd like
> to clean this up later.

Yeap, doing it in a separate patch would be better. It would be great
if you can include such patch in this series. :)

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: asm/kvm_para.h from asm-generic (was: Re: linux-next: triage for April 19, 2012)

2012-06-19 Thread Sam Ravnborg
> >>
> >> I just noticed include/asm-generic/Kbuild.asm already had
> >>
> >> ifneq ($(wildcard $(srctree)/arch/$(SRCARCH)/include/asm/kvm_para.h \
> >>                  $(srctree)/include/asm-$(SRCARCH)/kvm_para.h),)
> >> header-y  += kvm_para.h
> >> endif
> >>
> >> but this doesn't seem to work.
> >>
> >> Kbuild people: which one is correct?

Any use of wildcards in include/asm-generic/Kbuild.asm is bogus.
We should:
1) add explicit header-y for kvm.h + a.out.h to the relevant architectures.
2) add explicit header-y for kvm_para.h for the archictectures that needs this 
file
3) and drop header-y / dummy file fro the others.

Anything else is just papering over the initial wrong doings.

I actually coded up "header-n" support to explicitly say:
do not include this file for this arch - but looking at the
actual problem being solved this is just bogus so I killed
it off again.

Sam
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: asm/kvm_para.h from asm-generic (was: Re: linux-next: triage for April 19, 2012)

2012-06-19 Thread Sam Ravnborg
On Tue, Jun 19, 2012 at 08:48:03PM +0200, Sam Ravnborg wrote:
> > >>
> > >> I just noticed include/asm-generic/Kbuild.asm already had
> > >>
> > >> ifneq ($(wildcard $(srctree)/arch/$(SRCARCH)/include/asm/kvm_para.h \
> > >>                  $(srctree)/include/asm-$(SRCARCH)/kvm_para.h),)
> > >> header-y  += kvm_para.h
> > >> endif
> > >>
> > >> but this doesn't seem to work.
> > >>
> > >> Kbuild people: which one is correct?
> 
> Any use of wildcards in include/asm-generic/Kbuild.asm is bogus.
> We should:
> 1) add explicit header-y for kvm.h + a.out.h to the relevant architectures.
> 2) add explicit header-y for kvm_para.h for the archictectures that needs 
> this file
> 3) and drop header-y / dummy file fro the others.
> 
> Anything else is just papering over the initial wrong doings.
> 
> I actually coded up "header-n" support to explicitly say:
> do not include this file for this arch - but looking at the
> actual problem being solved this is just bogus so I killed
> it off again.

Just to be clear - I know this will break headers_check_all.
I will try to take a look soon for a better generic solution.

Sam
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] spapr_vscsi: Error handling fixes

2012-06-19 Thread Benjamin Herrenschmidt
On Tue, 2012-06-19 at 13:31 +0200, Andreas Färber wrote:
> Tab alert. :)

Argh :-) Gotta get an emacs config for use with qemu :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] spapr_vscsi: Error handling fixes

2012-06-19 Thread Alexander Graf

On 19.06.2012, at 22:18, Benjamin Herrenschmidt wrote:

> On Tue, 2012-06-19 at 13:31 +0200, Andreas Färber wrote:
>> Tab alert. :)
> 
> Argh :-) Gotta get an emacs config for use with qemu :-)

So do you want to resend? I can just fix it on the fly too.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Regarding improving ple handler (vcpu_on_spin)

2012-06-19 Thread Raghavendra K T

In ple handler code, last_boosted_vcpu (lbv) variable is
serving as reference point to start when we enter.

lbv = kvm->lbv;
for each vcpu i of kvm
   if i is eligible
  if yield_to(i) is success
 lbv = i 

currently this variable is per VM and it is set after we do
yield_to(target), unfortunately it may take little longer
than we expect to come back again (depending on its lag in rb tree)
on successful yield and set the value. 

So when several ple_handle entry happens before it is set,
all of them start from same place. (and overall RR is also slower).

Also statistical analysis (below) is showing lbv is not very well
distributed with current approach.

naturally, first approach is to move lbv before yield_to, without
bothering  failure case to make RR fast. (was in Rik's V4
vcpu_on_spin patch series).

But when I did performance analysis, in no-overcommit scenario,
I saw violent/cascaded directed yield happening, leading to
more wastage of cpu in spinning. (huge degradation in 1x and
improvement in 3x,  I assume this was the reason it was moved after 
yield_to in V5 of vcpu_on_spin series.)

Second approach, I tried was, 
(1) get rid of per kvm lbv variable
(2) everybody who enters handler start from a random vcpu as reference
point.

The above gave good distribution of starting point,(and performance
improvement in 32 vcpu guest I tested)  and also IMO, it scales well
for larger VM's.

Analysis
=
Four 32 vcpu guest running with one of them running kernbench.

PLE handler yield stat is the statistics for successfully
yielded case (for 32 vcpus)

PLE handler start stat is the statistics for frequency of
each vcpu index as starting point (for 32 vcpus)

snapshot1
=
PLE handler yield stat :
274391  33088  32554  46688  46653  48742  48055  37491
38839  31799  28974  30303  31466  45936  36208  51580
32754  53441  28956  30738  37940  37693  26183  40022
31725  41879  23443  35826  40985  30447  37352  35445  

PLE handler start stat :
433590  383318  204835  169981  193508  203954  175960  139373
153835  125245  118532  140092  135732  134903  119349  149467
109871  160404  117140  120554  144715  125099  108527  125051
111416  141385  94815  138387  154710  116270  123130  173795

snapshot2

PLE handler yield stat :
1957091  59383  67866  65474  100335  77683  80958  64073
53783  44620  80131  81058  66493  56677  74222  74974
42398  132762  48982  70230  78318  65198  54446  104793
59937  57974  73367  96436  79922  59476  58835  63547  

PLE handler start stat :
2555089  611546  461121  346769  435889  452398  407495  314403
354277  298006  364202  461158  344783  288263  342165  357270
270887  451660  300020  332120  378403  317848  307969  414282
351443  328501  352840  426094  375050  330016  347540  371819

So questions I have in mind is,

1. Do you think going for randomizing last_boosted_vcpu and get rid
of per VM variable is better? 

2. Can/Do we have a mechanism, from which we will be able to decide
not to yield to vcpu who is doing frequent PLE exit (possibly
because he is doing unnecessary busy-waits) OR doing yield_to better
candidate?

On a side note: With pv patches I have tried doing yield_to a kicked
VCPU, in vcpu_block path and is giving some performance improvement.

Please let me know if you have any comments/suggestions.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] spapr_vscsi: Error handling fixes

2012-06-19 Thread Benjamin Herrenschmidt
On Tue, 2012-06-19 at 22:20 +0200, Alexander Graf wrote:
> So do you want to resend? I can just fix it on the fly too.

If you can then sure, please do :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: handle last_boosted_vcpu = 0 case

2012-06-19 Thread Rik van Riel
On Wed, 20 Jun 2012 01:50:50 +0530
Raghavendra K T  wrote:

> 
> In ple handler code, last_boosted_vcpu (lbv) variable is
> serving as reference point to start when we enter.

> Also statistical analysis (below) is showing lbv is not very well
> distributed with current approach.

You are the second person to spot this bug today (yes, today).

Due to time zones, the first person has not had a chance yet to
test the patch below, which might fix the issue...

Please let me know how it goes.

8<

If last_boosted_vcpu == 0, then we fall through all test cases and
may end up with all VCPUs pouncing on vcpu 0.  With a large enough
guest, this can result in enormous runqueue lock contention, which
can prevent vcpu0 from running, leading to a livelock.

Changing < to <= makes sure we properly handle that case.

Signed-off-by: Rik van Riel 
---
 virt/kvm/kvm_main.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7e14068..1da542b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 */
for (pass = 0; pass < 2 && !yielded; pass++) {
kvm_for_each_vcpu(i, vcpu, kvm) {
-   if (!pass && i < last_boosted_vcpu) {
+   if (!pass && i <= last_boosted_vcpu) {
i = last_boosted_vcpu;
continue;
} else if (pass && i > last_boosted_vcpu)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] spapr: Add "memop" hypercall

2012-06-19 Thread Alexander Graf

On 19.06.2012, at 08:21, Benjamin Herrenschmidt wrote:

> This adds a qemu-specific hypervisor call to the pseries machine
> which allows to do what amounts to memmove, memcpy and xor over
> regions of physical memory such as the framebuffer.
> 
> This is the simplest way to get usable framebuffer speed from
> SLOF since the framebuffer isn't mapped in the VRMA and so would
> otherwise require an hcall per 8 bytes access.
> 
> The performance is still not great but usable, and can be improved
> with a more complex implementation of the hcall itself if needed.
> 
> This also adds some documentation for the qemu-specific hypercalls
> that we add to PAPR along with a new qemu,hypertas-functions property
> that mirrors ibm,hypertas-functions and provides some discoverability
> for the new calls.
> 
> Note: I chose note to advertise H_RTAS to the guest via that mechanism.
> This is done on purpose, the guest uses the normal RTAS interfaces
> provided by qemu (including SLOF) which internally calls H_RTAS.
> 
> We might in the future implement part (or even all) of RTAS inside the
> guest like IBM's firmware does and replace H_RTAS with some finer grained
> set of private hypercalls.
> 
> Signed-off-by: Benjamin Herrenschmidt 

Thanks, applied to ppc-next.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] ppc64: Rudimentary Support for extra page sizes on server CPUs

2012-06-19 Thread Alexander Graf

On 19.06.2012, at 07:56, Benjamin Herrenschmidt wrote:

> More recent Power server chips (i.e. based on the 64 bit hash MMU)
> support more than just the traditional 4k and 16M page sizes.  This
> can get quite complicated, because which page sizes are supported,
> which combinations are supported within an MMU segment and how these
> page sizes are encoded both in the SLB entry and the hash PTE can vary
> depending on the CPU model (they are not specified by the
> architecture).  In addition the firmware or hypervisor may not permit
> use of certain page sizes, for various reasons.  Whether various page
> sizes are supported on KVM, for example, depends on whether the PR or
> HV variant of KVM is in use, and on the page size of the memory
> backing the guest's RAM.
> 
> This patch adds information to the CPUState and cpu defs to describe
> the supported page sizes and encodings.  Since TCG does not yet
> support any extended page sizes, we just set this to NULL in the
> static CPU definitions, expanding this to the default 4k and 16M page
> sizes when we initialize the cpu state.  When using KVM, however, we
> instead determine available page sizes using the new
> KVM_PPC_GET_SMMU_INFO call.  For old kernels without that call, we use
> some defaults, with some guesswork which should do the right thing for
> existing HV and PR implementations.  The fallback might not be correct
> for future versions, but that's ok, because they'll have
> KVM_PPC_GET_SMMU_INFO.
> 
> Signed-off-by: Benjamin Herrenschmidt 
> Signed-off-by: David Gibson 

Thanks, applied both to ppc-next.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] spapr_vscsi: Error handling fixes

2012-06-19 Thread Alexander Graf

On 19.06.2012, at 08:02, Benjamin Herrenschmidt wrote:

> We were incorrectly g_free'ing an object that isn't allocated
> in one error path and failed to release it completely in another
> 
> This fixes qemu crashes with some cases of IO errors.
> 
> Signed-off-by: Benjamin Herrenschmidt 

Thanks, applied to ppc-next.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Fixes related to processing of qemu's -numa option

2012-06-19 Thread Chegu Vinod
From: root 

Changes since v1:


 - Use bitmap functions that are already in qemu (instead
   of cpu_set_t macro's)
 - Added a check for endvalue >= max_cpus.
 - Fix to address the round-robbing assignment (for the case
   when cpu's are not explicitly specified)

Note: Continuing to use a new constant for
  allocation of the cpumask (max_cpus was
  not getting set early enough).

---

v1:
--

The -numa option to qemu is used to create [fake] numa nodes
and expose them to the guest OS instance.

There are a couple of issues with the -numa option:

a) Max VCPU's that can be specified for a guest while using
   the qemu's -numa option is 64. Due to a typecasting issue
   when the number of VCPUs is > 32 the VCPUs don't show up
   under the specified [fake] numa nodes.

b) KVM currently has support for 160VCPUs per guest. The
   qemu's -numa option has only support for upto 64VCPUs
   per guest.

This patch addresses these two issues.

Below are examples of (a) and (b)

a) >32 VCPUs are specified with the -numa option:

/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
71:01:01 \
-net tap,ifname=tap0,script=no,downscript=no \
-vnc :4

...

Upstream qemu :
--

QEMU 1.1.50 monitor - type 'help' for more information
(qemu) info numa
6 nodes
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 32 33 34 35 36 37 38 39 40 41
node 0 size: 131072 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 46 47 48 49 50 51
node 1 size: 131072 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29 52 53 54 55 56 57 58 59
node 2 size: 131072 MB
node 3 cpus: 30
node 3 size: 131072 MB
node 4 cpus:
node 4 size: 131072 MB
node 5 cpus: 31
node 5 size: 131072 MB

With the patch applied :
---

QEMU 1.1.50 monitor - type 'help' for more information
(qemu) info numa
6 nodes
node 0 cpus: 0 1 2 3 4 5 6 7 8 9
node 0 size: 131072 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19
node 1 size: 131072 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29
node 2 size: 131072 MB
node 3 cpus: 30 31 32 33 34 35 36 37 38 39
node 3 size: 131072 MB
node 4 cpus: 40 41 42 43 44 45 46 47 48 49
node 4 size: 131072 MB
node 5 cpus: 50 51 52 53 54 55 56 57 58 59
node 5 size: 131072 MB

b) >64 VCPUs specified with -numa option:

/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
-cpu 
Westmere,+rdtscp,+pdpe1gb,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
 \
-smp sockets=8,cores=10,threads=1 \
-numa node,nodeid=0,cpus=0-9,mem=64g \
-numa node,nodeid=1,cpus=10-19,mem=64g \
-numa node,nodeid=2,cpus=20-29,mem=64g \
-numa node,nodeid=3,cpus=30-39,mem=64g \
-numa node,nodeid=4,cpus=40-49,mem=64g \
-numa node,nodeid=5,cpus=50-59,mem=64g \
-numa node,nodeid=6,cpus=60-69,mem=64g \
-numa node,nodeid=7,cpus=70-79,mem=64g \
-m 524288 \
-name vm1 \
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \
-drive 
file=/dev/libvirt_lvm/vm.img,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native
 \
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 \
-monitor stdio \
-net nic,macaddr=52:54:00:71:01:01 \
-net tap,ifname=tap0,script=no,downscript=no \
-vnc :4

...

Upstream qemu :
--

only 63 CPUs in NUMA mode supported.
only 64 CPUs in NUMA mode supported.
QEMU 1.1.50 monitor - type 'help' for more information
(qemu) info numa
8 nodes
node 0 cpus: 6 7 8 9 38 39 40 41 70 71 72 73
node 0 size: 65536 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 46 47 48 49 50 51 74 75 
76 77 78 79
node 1 size: 65536 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29 52 53 54 55 56 57 58 59 60 61
node 2 size: 65536 MB
node 3 cpus: 30 62
node 3 size: 65536 MB
node 4 cpus:
node 4 size: 65536 MB
node 5 cpus:
node 5 size: 65536 MB
node 6 cpus: 31 63
node 6 size: 65536 MB
node 7 cpus: 0 1 2 3 4 5 32 33 34 35 36 37 64 65 66 67 68 69
node 7 size: 65536 MB

With the patch applied :
---

QEMU 1.1.50 monitor - type 'help' for more information
(qemu) info numa
8 nodes
node 0 cpus: 0 1 2 3 4 5 6 7 8 9
node 0 size: 65536 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19
node 1 size: 65536 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29
node 2 size: 65536 MB
node 3 cpus: 30 31 32 33 34 35 36 37 38 39
node 3 size: 65536 MB
node 4 cpus: 40 41 42 43 44 45 46 47 48 49
node 4 size: 65536 MB
node 5 cpus: 50 51 52 53 54 55 56 57 58 59
node 5 size: 65536 MB
node 6 cpus: 60 61 62 63 64 65 66 67 68 69
node 6 size: 65536 MB
node 7 cpus: 70 71 72 73 74 75 76 77 78 79

Signed-off-by: Chegu Vinod , Jim Hull , 
Craig Hada 
---
 cpus.c   |3 ++-
 hw/pc.c  |4 +++-
 sysemu.h |3 ++-
 vl.c |   48 ++--
 4 files changed, 33 insertions(+), 25 deletions(-)

diff --git a/cpus.c b/cpus.c
index b182b3d..89ce04d 100644
--- a/cpus.c
+++ b/cpus.c
@@ -36,6 +36,7 @@
 #include "cpus.h"
 #include "qtest.h"
 #include "main-loop.h"
+#incl

how to map a host virtual address (hva) to guest address space (gva/gpa/gfn)

2012-06-19 Thread sheng qiu
Hi all,

is there any existing function in KVM that can map a host virtual
address (hva) to the guest address space. So that the guest VM can
access it.
i.e. translate hva to a gpa or gva or gfn, so the guest VM can use
that converted address to access?

Thanks,
Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


VMM OSX port

2012-06-19 Thread Gene Crucean
Hey everyone,

I just subscribed to this list so please forgive me if I goof something up.

I was looking into the possibility of porting VMM to OSX. Is this on the
todo list already? Or something that anyone else is hoping for? Since it's
all written in Python (with GTK for the UI), it doesn't seem like it would
be too bad of a port. The hard part is finding all of the pre-req's for OSX.

Would someone that knows this code better than I, know of any walls I might
run into? Of course the gnome libs won't be available but I was thinking
those would get replaced with cocoa interface libs.

I've never ported anything before (probably obvious :) so any tips/advice
are welcome.




(Yay!)
 python  >= 2.4
 pygtk2 >= 1.99.12-6
 libxml2-python >= 2.6.23

(Not sure if these exist for osx)
 libvirt-python >= 0.4.0
 dbus-python >= 0.61
 vte >= 0.12.2
 gtk-vnc >= 0.0.1
 python-virtinst >= 0.300.0
 PolicyKit >= 0.6

(Most likely don't exist for osx :)
 gnome-python2-gconf >= 1.99.11-7
 gnome-keyring >= 0.4.9
 gnome-python-desktop >= 2.15.4





--
Gene Crucean - Emmy winning - Oscar nominated VFX Supervisor / iOS-OSX
Developer / Filmmaker / Photographer
** Freelance for hire **
www.genecrucean.com

~~ Please use my website's contact form on www.genecrucean.com for any
personal emails. Thanks. I may not get them at this address. ~~
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to map a host virtual address (hva) to guest address space (gva/gpa/gfn)

2012-06-19 Thread Mehul Chadha
On Tue, Jun 19, 2012 at 6:09 PM, sheng qiu  wrote:
> Hi all,
>
> is there any existing function in KVM that can map a host virtual
> address (hva) to the guest address space. So that the guest VM can
> access it.
> i.e. translate hva to a gpa or gva or gfn, so the guest VM can use
> that converted address to access?
I am not aware of a function that exists in kvm, but you can write one using
__gfn_to_memslot()  as a reference implementation.

you can check if the hva falls in a particular memslot using
userspace_addr member
of the memslot structure, and then calculate the corresponding gfn from that.
>
> Thanks,
> Sheng
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 10/15] ARM: KVM: Inject IRQs and FIQs from userspace

2012-06-19 Thread Christoffer Dall
On Tue, Jun 19, 2012 at 4:49 AM, Avi Kivity  wrote:
> On 06/18/2012 11:56 PM, Christoffer Dall wrote:
>> On Mon, Jun 18, 2012 at 9:32 AM, Avi Kivity  wrote:
>>> On 06/15/2012 10:08 PM, Christoffer Dall wrote:
 From: Christoffer Dall 

 Userspace can inject IRQs and FIQs through the KVM_IRQ_LINE VM ioctl.
 This ioctl is used since the sematics are in fact two lines that can be
 either raised or lowered on the VCPU - the IRQ and FIQ lines.

 KVM needs to know which VCPU it must operate on and whether the FIQ or
 IRQ line is raised/lowered. Hence both pieces of information is packed
 in the kvm_irq_level->irq field. The irq fild value will be:
   IRQ: vcpu_index << 1
   FIQ: (vcpu_index << 1) | 1

 This is documented in Documentation/kvm/api.txt.

 The effect of the ioctl is simply to simply raise/lower the
 corresponding irq_line field on the VCPU struct, which will cause the
 world-switch code to raise/lower virtual interrupts when running the
 guest on next switch. The wait_for_interrupt flag is also cleared for
 raised IRQs or FIQs causing an idle VCPU to become active again. CPUs
 in guest mode are kicked to make sure they refresh their interrupt status.
>>>

 +static int kvm_arch_vm_ioctl_irq_line(struct kvm *kvm,
 +                                   struct kvm_irq_level *irq_level)
 +{
 +     int mask;
 +     unsigned int vcpu_idx;
 +     struct kvm_vcpu *vcpu;
 +     unsigned long old, new, *ptr;
 +
 +     vcpu_idx = irq_level->irq >> 1;
 +     if (vcpu_idx >= KVM_MAX_VCPUS)
 +             return -EINVAL;
 +
 +     vcpu = kvm_get_vcpu(kvm, vcpu_idx);
 +     if (!vcpu)
 +             return -EINVAL;
 +
 +     if ((irq_level->irq & 1) == KVM_ARM_IRQ_LINE)
 +             mask = HCR_VI;
 +     else /* KVM_ARM_FIQ_LINE */
 +             mask = HCR_VF;
 +
 +     trace_kvm_set_irq(irq_level->irq, irq_level->level, 0);
 +
 +     ptr = (unsigned long *)&vcpu->arch.irq_lines;
 +     do {
 +             old = ACCESS_ONCE(*ptr);
 +             if (irq_level->level)
 +                     new = old | mask;
 +             else
 +                     new = old & ~mask;
 +
 +             if (new == old)
 +                     return 0; /* no change */
 +     } while (cmpxchg(ptr, old, new) != old);
>>>
>>> Isn't this a complicated
>>>
>>>
>>>   if (level)
>>>       set_bit()
>>>   else
>>>       clear_bit()
>>>
>>> ?
>>>
>>
>> we need to atomically know if we changed either the FIQ/IRQ fields and
>> atomically update without locking.
>
> So use test_and_set_bit()/test_and_clear_bit().
>
>> (I think you suggested this
>> approach, in fact).
>
> I think so too, but it only makes sense if you need to consider both
> fields simultaneously (which you don't here).  For example, if IRQ was
> asserted while FIQ was already raised, you don't need to kick.  But
> that's not the case according to the below.
>

ok, fair enough. I simplified it a bit.

>>
 +
 +     /*
 +      * The vcpu irq_lines field was updated, wake up sleeping VCPUs and
 +      * trigger a world-switch round on the running physical CPU to set 
 the
 +      * virtual IRQ/FIQ fields in the HCR appropriately.
 +      */
 +     kvm_vcpu_kick(vcpu);
>>>
>>> No need to wake when the line is asserted so you can make this
>>> conditional on level.
>>>
>>
>> we need to trigger a world switch to update the virtualized register
>> from the actual running physical CPU if we changed any of the IRQ/FIQ
>> fields. We return in the loop above if we didn't change anything.
>
> Okay.
>
>
>>
 +
 +     return 0;
 +}
 +
  long kvm_arch_vcpu_ioctl(struct file *filp,
                        unsigned int ioctl, unsigned long arg)
  {
 @@ -298,7 +345,20 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, 
 struct kvm_dirty_log *log)
  long kvm_arch_vm_ioctl(struct file *filp,
                      unsigned int ioctl, unsigned long arg)
  {
 -     return -EINVAL;
 +     struct kvm *kvm = filp->private_data;
 +     void __user *argp = (void __user *)arg;
 +
 +     switch (ioctl) {
 +     case KVM_IRQ_LINE: {
 +             struct kvm_irq_level irq_event;
 +
 +             if (copy_from_user(&irq_event, argp, sizeof irq_event))
 +                     return -EFAULT;
 +             return kvm_arch_vm_ioctl_irq_line(kvm, &irq_event);
 +     }
 +     default:
 +             return -EINVAL;
 +     }
  }
>>>
>>> Should be in common code guarded by the define introduced previously.
>>>
>>>
>>
>> you mean like this: ?
>>
>>
>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>> index a9f209b..5bf2193 100644
>> --- a/arch/arm/kvm/arm.c
>> +++ b/arch/arm/kvm/arm.c
>> @@ -535,8 +535,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcp

Re: [PATCH v8 11/15] ARM: KVM: World-switch implementation

2012-06-19 Thread Christoffer Dall
On Tue, Jun 19, 2012 at 5:16 AM, Avi Kivity  wrote:
> On 06/19/2012 01:05 AM, Christoffer Dall wrote:
>>> Premature, but this is sad.  I suggest you split vmid generation from
>>> next available vmid.  This allows you to make the generation counter
>>> atomic so it may be read outside the lock.
>>>
>>> You can do
>>>
>>>    if (likely(kvm->arch.vmd_gen) == atomic_read(&kvm_vmid_gen)))
>>>           return;
>>>
>>>    spin_lock(...
>>>
>>
>> I knew you were going to say something here :), please take a look at
>> this and see if you agree:
>
> It looks reasonable wrt locking.
>
>> +
>> +     /* First user of a new VMID generation? */
>> +     if (unlikely(kvm_next_vmid == 0)) {
>> +             atomic64_inc(&kvm_vmid_gen);
>> +             kvm_next_vmid = 1;
>> +
>> +             /* This does nothing on UP */
>> +             smp_call_function(reset_vm_context, NULL, 1);
>
> Does this imply a memory barrier?  If not, smp_mb__after_atomic_inc().
>

yes, it implies a memory barrier.

>> +
>> +             /*
>> +              * On SMP we know no other CPUs can use this CPU's or
>> +              * each other's VMID since the kvm_vmid_lock blocks
>> +              * them from reentry to the guest.
>> +              */
>> +
>> +             reset_vm_context(NULL);
>
> These two lines can be folded as on_each_cpu().
>
> Don't you have a race here where you can increment the generation just
> before guest entry?

I don't think I do.

>
> cpu0                       cpu1
> (vmid=0, gen=1)            (gen=0)
> -- --
> gen == global_gen, return
>
>                           gen != global_gen
>                           increment global_gen
>                           smp_call_function
> reset_vm_context
>                           vmid=0
>
> enter with vmid=0          enter with vmid=0

I can't see how the scenario above can happen. First, no-one can run
with vmid 0 - it is reserved for the host. If we bump global_gen on
cpuN, then since we do smp_call_function(x, x, wait=1) we are now sure
that after this call, all cpus(N-1) potentially being inside a VM will
have exited, called reset_vm_context, but before they can re-enter
into the guest, they will call update_vttbr, and if their generation
counter differs from global_gen, they will try to grab that spinlock
and everything should happen in order.

>
> You must recheck gen after disabling interrupts to ensure global_gen
> didn't bump after update_vttbr but before entry.  x86 has a lot of this,
> see vcpu_enter_guest() and vcpu->requests (doesn't apply directly to
> your case but may come in useful later).
>
>>
 +
 +/**
 + * kvm_arch_vcpu_ioctl_run - the main VCPU run function to execute guest 
 code
 + * @vcpu:    The VCPU pointer
 + * @run:     The kvm_run structure pointer used for userspace state 
 exchange
 + *
 + * This function is called through the VCPU_RUN ioctl called from user 
 space. It
 + * will execute VM code in a loop until the time slice for the process is 
 used
 + * or some emulation is needed from user space in which case the function 
 will
 + * return with return value 0 and with the kvm_run structure filled in 
 with the
 + * required data for the requested emulation.
 + */
  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
  {
 -     return -EINVAL;
 +     int ret = 0;
 +     sigset_t sigsaved;
 +
 +     if (vcpu->sigset_active)
 +             sigprocmask(SIG_SETMASK, &vcpu->sigset, &sigsaved);
 +
 +     run->exit_reason = KVM_EXIT_UNKNOWN;
 +     while (run->exit_reason == KVM_EXIT_UNKNOWN) {
>>>
>>> It's not a good idea to read stuff from run unless it's part of the ABI,
>>> since userspace can play with it while you're reading it.  It's harmless
>>> here but in other places it can lead to a vulnerability.
>>>
>>
>> ok, so in this case, userspace can 'suppress' an MMIO or interrupt
>> window operation.
>>
>> I can change to keep some local variable to maintain the state and
>> have some API convention for emulation functions, if you feel strongly
>> about it, but otherwise it feels to me like the code will be more
>> complicated. Do you have other ideas?
>
> x86 uses:
>
>  0 - return to userspace (run prepared)
>  1 - return to guest (run untouched)
>  -ESOMETHING - return to userspace
>

yeah, we can do that I guess, that's fair.

> as return values from handlers and for locals (usually named 'r').
>
>
> --
> error compiling committee.c: too many arguments to function
>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 11/15] ARM: KVM: World-switch implementation

2012-06-19 Thread Christoffer Dall
On Tue, Jun 19, 2012 at 11:27 PM, Christoffer Dall
 wrote:
> On Tue, Jun 19, 2012 at 5:16 AM, Avi Kivity  wrote:
>> On 06/19/2012 01:05 AM, Christoffer Dall wrote:
 Premature, but this is sad.  I suggest you split vmid generation from
 next available vmid.  This allows you to make the generation counter
 atomic so it may be read outside the lock.

 You can do

    if (likely(kvm->arch.vmd_gen) == atomic_read(&kvm_vmid_gen)))
           return;

    spin_lock(...

>>>
>>> I knew you were going to say something here :), please take a look at
>>> this and see if you agree:
>>
>> It looks reasonable wrt locking.
>>
>>> +
>>> +     /* First user of a new VMID generation? */
>>> +     if (unlikely(kvm_next_vmid == 0)) {
>>> +             atomic64_inc(&kvm_vmid_gen);
>>> +             kvm_next_vmid = 1;
>>> +
>>> +             /* This does nothing on UP */
>>> +             smp_call_function(reset_vm_context, NULL, 1);
>>
>> Does this imply a memory barrier?  If not, smp_mb__after_atomic_inc().
>>
>
> yes, it implies a memory barrier.
>
>>> +
>>> +             /*
>>> +              * On SMP we know no other CPUs can use this CPU's or
>>> +              * each other's VMID since the kvm_vmid_lock blocks
>>> +              * them from reentry to the guest.
>>> +              */
>>> +
>>> +             reset_vm_context(NULL);
>>
>> These two lines can be folded as on_each_cpu().
>>
>> Don't you have a race here where you can increment the generation just
>> before guest entry?
>
> I don't think I do.
>

uh, strike that, I do.

>>
>> cpu0                       cpu1
>> (vmid=0, gen=1)            (gen=0)
>> -- --
>> gen == global_gen, return
>>
>>                           gen != global_gen
>>                           increment global_gen
>>                           smp_call_function
>> reset_vm_context
>>                           vmid=0
>>
>> enter with vmid=0          enter with vmid=0
>
> I can't see how the scenario above can happen. First, no-one can run
> with vmid 0 - it is reserved for the host. If we bump global_gen on
> cpuN, then since we do smp_call_function(x, x, wait=1) we are now sure
> that after this call, all cpus(N-1) potentially being inside a VM will
> have exited, called reset_vm_context, but before they can re-enter
> into the guest, they will call update_vttbr, and if their generation
> counter differs from global_gen, they will try to grab that spinlock
> and everything should happen in order.
>

the whole vmid=0 confused me a bit. The point is since we moved the
generation check outside the spin_lock we have to re-check, I see:

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 19fe3b0..74760af 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -313,6 +313,24 @@ static void reset_vm_context(void *info)
 }

 /**
+ * check_new_vmid_gen - check that the VMID is still valid
+ * @kvm: The VM's VMID to checkt
+ *
+ * return true if there is a new generation of VMIDs being used
+ *
+ * The hardware supports only 256 values with the value zero reserved for the
+ * host, so we check if an assigned value belongs to a previous generation,
+ * which which requires us to assign a new value. If we're the first to use a
+ * VMID for the new generation, we must flush necessary caches and TLBs on all
+ * CPUs.
+ */
+static bool check_new_vmid_gen(struct kvm *kvm)
+{
+   if (likely(kvm->arch.vmid_gen == atomic64_read(&kvm_vmid_gen)))
+   return;
+}
+
+/**
  * update_vttbr - Update the VTTBR with a valid VMID before the guest runs
  * @kvmThe guest that we are about to run
  *
@@ -324,15 +342,7 @@ static void update_vttbr(struct kvm *kvm)
 {
phys_addr_t pgd_phys;

-   /*
-*  Check that the VMID is still valid.
-*  (The hardware supports only 256 values with the value zero
-*   reserved for the host, so we check if an assigned value belongs to
-*   a previous generation, which which requires us to assign a new
-*   value. If we're the first to use a VMID for the new generation,
-*   we must flush necessary caches and TLBs on all CPUs.)
-*/
-   if (likely(kvm->arch.vmid_gen == atomic64_read(&kvm_vmid_gen)))
+   if (!check_new_vmid_gen(kvm))
return;

spin_lock(&kvm_vmid_lock);
@@ -504,6 +514,13 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu
*vcpu, struct kvm_run *run)
 */
preempt_disable();
local_irq_disable();
+
+   if (check_new_vmid_gen(kvm)) {
+   local_irq_enable();
+   preempt_enable();
+   continue;
+   }
+
kvm_guest_enter();
vcpu->mode = IN_GUEST_MODE;

>>
>> You must recheck gen after disabling interrupts to ensure global_gen
>> didn't bump after update_vttbr but before entry.  x86 has a lot of this,
>> see vcpu_ente

Re: [PATCH 3/3] virtio-blk: Add bio-based IO path for virtio-blk

2012-06-19 Thread Asias He

On 06/19/2012 02:21 PM, Dor Laor wrote:

On 06/19/2012 05:51 AM, Asias He wrote:

On 06/18/2012 07:39 PM, Sasha Levin wrote:

On Mon, 2012-06-18 at 14:14 +0300, Dor Laor wrote:

On 06/18/2012 01:05 PM, Rusty Russell wrote:

On Mon, 18 Jun 2012 16:03:23 +0800, Asias He wrote:

On 06/18/2012 03:46 PM, Rusty Russell wrote:

On Mon, 18 Jun 2012 14:53:10 +0800, Asias He
wrote:

This patch introduces bio-based IO path for virtio-blk.


Why make it optional?


request-based IO path is useful for users who do not want to bypass
the
IO scheduler in guest kernel, e.g. users using spinning disk. For
users
using fast disk device, e.g. SSD device, they can use bio-based IO
path.


Users using a spinning disk still get IO scheduling in the host
though.
What benefit is there in doing it in the guest as well?


The io scheduler waits for requests to merge and thus batch IOs
together. It's not important w.r.t spinning disks since the host can do
it but it causes much less vmexits which is the key issue for VMs.


Is the amount of exits caused by virtio-blk significant at all with
EVENT_IDX?


Yes. EVENT_IDX saves the number of notify and interrupt. Let's take the
interrupt as an example, The guest fires 200K request to host, the
number of interrupt is about 6K thanks to EVENT_IDX. The ratio is 200K /
6K = 33. The ratio of merging is 4K / 200K = 20.



In this case, why don't you always recommend bio over request based?


This case shows that IO scheduler's merging in guest saves a lot of 
requests to host side. Why should I recommend bio over request based here?


--
Asias


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Implementing a Virtio host module for 802.11 media

2012-06-19 Thread Song Gao
Greetings! 

I want to develop a Virtio host module to emulate 802.11 wireless networks. 
Here's the motivation:

I am a PhD student working in wireless networks. I've been using a cluster of 
32 KVM guest nodes to emulate wireless nodes and testing mobile Ad-hoc routing 
protocols. I use "iptables" MAC address filtering to emulate the physical 
isolation of wireless nodes caused by distance, and use "tc" command to control 
the bandwidth. However, nodes within communication range share the wireless 
media, which means they can interfere with each other and there are packets 
collisions, as a result, the bandwidth of each node is also affected by nodes 
around it. I find that no matter how I use the "tc" command to do the traffic 
control, it's always hard to approach the real media sharing model in 802.11 
wireless networks. I have to synchronize not only virtual location, but also 
real-time network usage of each node. Even if synchronizing these, it's still 
only approximate, and it's ugly. So I begin to believe that I'm trying to solve 
the problem in a wrong way. I should probably emulate these in KVM 
host. 

I want to emulate the real world wireless network, where interference and 
packets collision can take place just like the real wireless communication. The 
model does not have to very precisely describe the real media, but should 
consider those factors affecting the wireless bandwidth and delay.

I have some Linux development and network experience, but I'm quite new to the 
community. I don't really have much kernel development experience but I'll be 
pleased to learn. Do you guys have any advices about where I can get started? 
Any suggestion ( or collaboration if anybody wants ) will be appreciated.

Thanks! 

Kind regards,
-- 
Song Gao
https://profiles.google.com/song.gao.beta/about



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html