Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 09:36 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 08:15:36PM +0800, Xiao Guangrong wrote:
>> On 08/28/2013 06:49 PM, Gleb Natapov wrote:
>>> On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote:
 On 08/28/2013 05:46 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
>>> Or what if desc is moved to another rmap, but then it
>>> is moved back to initial rmap (but another place in the desc list) so
>>> the check here will not catch that we need to restart walking?
>>
>> It is okay. We always add the new desc to the head, then we will walk
>> all the entires under this case.
>>
> Which races another question: What if desc is added in front of the list
> behind the point where lockless walker currently is?

 That case is new spte is being added into the rmap. We need not to care the
 new sptes since it will set the dirty-bitmap then they can be 
 write-protected
 next time.

>>> OK.
>>>
>
>> Right?
> Not sure. While lockless walker works on a desc rmap can be completely
> destroyed and recreated again. It can be any order.

 I think the thing is very similar as include/linux/rculist_nulls.h
>>> include/linux/rculist_nulls.h is for implementing hash tables, so they
>>> may not care about add/del/lookup race for instance, but may be we are
>>> (you are saying above that we are not), so similarity does not prove
>>> correctness for our case. 
>>
>> We do not care the "add" and "del" too when lookup the rmap. Under the "add"
>> case, it is okay, the reason i have explained above. Under the "del" case,
>> the spte becomes unpresent and flush all tlbs immediately, so it is also 
>> okay.
>>
>> I always use a stupid way to check the correctness, that is enumerating
>> all cases we may meet, in this patch, we may meet these cases:
>>
>> 1) kvm deletes the desc before we are current on
>>that descs have been checked, do not need to care it.
>>
>> 2) kvm deletes the desc after we are currently on
>>Since we always add/del the head desc, we can sure the current desc has 
>> been
>>deleted, then we will meet case 3).
>>
>> 3) kvm deletes the desc that we are currently on
>>3.a): the desc stays in slab cache (do not be reused).
>>  all spte entires are empty, then the fn() will skip the nonprsent 
>> spte,
>>  and desc->more is
>>  3.a.1) still pointing to next-desc, then we will continue the lookup
>>  3.a.2) or it is the "nulls list", that means we reach the last one,
>> then finish the walk.
>>
>>3.b): the desc is alloc-ed from slab cache and it's being initialized.
>>  we will see "desc->more == NULL" then restart the walking. It's 
>> okay.
>>
>>3.c): the desc is added to rmap or pte_list again.
>>  3.c.1): the desc is added to the current rmap again.
>>   the new desc always acts as the head desc, then we will walk
>>  all entries, some entries are double checked and not entry
>>  can be missed. It is okay.
>>
>>  3.c.2): the desc is added to another rmap or pte_list
>>  since kvm_set_memory_region() and get_dirty are serial by 
>> slots-lock.
>>  so the "nulls" can not be reused during lookup. Then we we 
>> will
>>  meet the different "nulls" at the end of walking that will 
>> cause
>>  rewalk.
>>
>> I know check the algorithm like this is really silly, do you have other idea?
>>
> Not silly, but easy to miss cases. For instance 3.c.3 can be:
>  the desc is added to another rmap then we move to another desc on the
>  wrong rmap, this other desc is also deleted and reinserted into
>  original rmap. Seams like justification from 3.c.1 applies to that to
>  though.
> 
>>> BTW I do not see
>>> rcu_assign_pointer()/rcu_dereference() in your patches which hints on
>>
>> IIUC, We can not directly use rcu_assign_pointer(), that is something like:
>> p = v to assign a pointer to a pointer. But in our case, we need:
>>*pte_list = (unsigned long)desc | 1;
>>From Documentation/RCU/whatisRCU.txt:
> 
> The updater uses this function to assign a new value to an RCU-protected 
> pointer.
> 
> This is what we do, no? (assuming slot->arch.rmap[] is what rcu protects here)
> The fact that the value is not correct pointer should not matter.
> 

Okay. Will change that code to:

+
+#define rcu_assign_head_desc(pte_list_p, value)\
+   rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), (unsigned 
long *)(value))
+
 /*
  * Pte mapping structures:
  *
@@ -1006,14 +1010,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
desc->sptes[1] = spte;
desc_mark_nulls(pte_list, desc);

-   /*
-* Esure the old spte has been updated into desc, so
-* that the another side can not get the

Re: [PATCH 07/23] KVM: PPC: Book3S PR: Use 64k host pages where possible

2013-08-28 Thread Paul Mackerras
On Thu, Aug 29, 2013 at 01:24:04AM +0200, Alexander Graf wrote:
> 
> On 06.08.2013, at 06:19, Paul Mackerras wrote:
> 
> > +#ifdef CONFIG_PPC_64K_PAGES
> > +   /*
> > +* Mark this as a 64k segment if the host is using
> > +* 64k pages, the host MMU supports 64k pages and
> > +* the guest segment page size is >= 64k,
> > +* but not if this segment contains the magic page.
> 
> What's the problem with the magic page? As long as we map the magic page as a 
> host 64k page and access only the upper 4k (which we handle today already) we 
> should be set, no?

If we use a 64k host HPTE to map the magic page, then we are taking up
64k of the guest address space, and I was concerned that the guest
might ask to map the magic page at address X and then map something
else at address X+4k or X-4k.  If we use a 64k host HPTE then we would
tromp on those nearby mappings.  If you think the guest will never try
to create any nearby mappings, then we could relax this restriction.

Paul.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/23] KVM: PPC: Book3S PR: Allow guest to use 64k pages

2013-08-28 Thread Paul Mackerras
On Thu, Aug 29, 2013 at 12:56:40AM +0200, Alexander Graf wrote:
> 
> On 06.08.2013, at 06:18, Paul Mackerras wrote:
> 
> > #ifdef CONFIG_PPC_BOOK3S_64
> > -   /* default to book3s_64 (970fx) */
> > +   /*
> > +* Default to the same as the host if we're on a POWER7[+],
> > +* otherwise default to PPC970FX.
> > +*/
> > vcpu->arch.pvr = 0x3C0301;
> > +   if (cpu_has_feature(CPU_FTR_ARCH_206))
> > +   vcpu->arch.pvr = mfspr(SPRN_PVR);
> 
> Unrelated change? Also, why? Any reasonable user space these days should set 
> PVR anyways.

The issue is that the most widely-deployed userspace user of KVM
(i.e., QEMU) does the KVM_PPC_GET_SMMU_INFO ioctl *before* it tells
KVM what it wants the guest PVR to be.  Originally I had
kvm_vm_ioctl_get_smmu_info() returning the 64k page size only if the
BOOK3S_HFLAG_MULTI_PGSIZE flag was set, so I had to add this change so
that userspace would see the 64k page size.  So yes, I could probably
remove this hunk now.

> > 
> > +   /* 64k large page size */
> > +   info->sps[2].page_shift = 16;
> > +   info->sps[2].slb_enc = SLB_VSID_L | SLB_VSID_LP_01;
> > +   info->sps[2].enc[0].page_shift = 16;
> > +   info->sps[2].enc[0].pte_enc = 1;
> 
> We only support this with BOOK3S_HFLAG_MULTI_PGSIZE, no?

The virtual machine implemented by PR KVM supports 64k pages on any
hardware, since it is implementing the POWER MMU in software.  That's
why I didn't make it depend on that flag.  That means that we rely on
userspace to filter out any capabilities that don't apply to the
machine it wants to emulate.  We can't do that filtering here because
userspace queries the MMU capabilities before it sets the PVR.

Regards,
Paul.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/23] KVM: PPC: Book3S PR: Keep volatile reg values in vcpu rather than shadow_vcpu

2013-08-28 Thread Paul Mackerras
On Thu, Aug 29, 2013 at 12:00:53AM +0200, Alexander Graf wrote:
> 
> On 06.08.2013, at 06:16, Paul Mackerras wrote:
> 
> > kvm_start_lightweight:
> > +   /* Copy registers into shadow vcpu so we can access them in real mode */
> > +   GET_SHADOW_VCPU(r3)
> > +   bl  FUNC(kvmppc_copy_to_svcpu)
> 
> This will clobber r3 and r4, no? We need to restore them from the stack here 
> I would think.

You're right.  We don't need to restore r3 since we don't actually use
it, but we do need to restore r4.

> > #ifdef CONFIG_PPC_BOOK3S_32
> > /* We set segments as unused segments when invalidating them. So
> >  * treat the respective fault as segment fault. */
> > -   if (svcpu->sr[kvmppc_get_pc(vcpu) >> SID_SHIFT] == SR_INVALID) {
> > -   kvmppc_mmu_map_segment(vcpu, kvmppc_get_pc(vcpu));
> > -   r = RESUME_GUEST;
> > +   {
> > +   struct kvmppc_book3s_shadow_vcpu *svcpu;
> > +   u32 sr;
> > +
> > +   svcpu = svcpu_get(vcpu);
> > +   sr = svcpu->sr[kvmppc_get_pc(vcpu) >> SID_SHIFT];
> 
> Doesn't this break two concurrently running guests now that we don't copy the 
> shadow vcpu anymore? Just move the sr array to a kmalloc'ed area until the 
> whole vcpu is kmalloc'ed. Then you can get rid of all shadow vcpu code.

This is 32-bit only... the svcpu is already kmalloc'ed, so I'm not
sure what you're asking for here or why you think this would break
with multiple guests.

Paul.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/6] cpus: release allocated vcpu objects and exit vcpu thread

2013-08-28 Thread Andreas Färber
Am 29.08.2013 04:09, schrieb Chen Fan:
> After ACPI get a signal to eject a vcpu, then it will notify
> the vcpu thread of needing to exit, before the vcpu exiting,
> will release the vcpu related objects.
> 
> Signed-off-by: Chen Fan 
> ---
>  cpus.c   | 36 
>  hw/acpi/piix4.c  | 16 
>  include/qom/cpu.h|  9 +
>  include/sysemu/kvm.h |  1 +
>  kvm-all.c| 26 ++
>  5 files changed, 84 insertions(+), 4 deletions(-)
> 
> diff --git a/cpus.c b/cpus.c
> index 70cc617..6b793cb 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -697,6 +697,30 @@ void async_run_on_cpu(CPUState *cpu, void (*func)(void 
> *data), void *data)
>  qemu_cpu_kick(cpu);
>  }
>  
> +static void qemu_kvm_destroy_vcpu(CPUState *cpu)
> +{
> +CPUState *pcpu, *pcpu1;
> +
> +pcpu = first_cpu;
> +pcpu1 = NULL;
> +
> +while (pcpu) {
> +if (pcpu == cpu && pcpu1) {
> +pcpu1->next_cpu = cpu->next_cpu;
> +break;
> +}
> +pcpu1 = pcpu;
> +pcpu = pcpu->next_cpu;
> +}

No, no, no. :) I specifically posted the "QOM CPUState, part 12" series
early to avoid exactly such code appearing! Give me a few minutes to
apply that to qom-cpu and then please rebase your work on
git://github.com/afaerber/qemu-cpu.git qom-cpu
using QTAILQ macro and --subject-prefix="RFC qom-cpu v2" for the next
version of the series.

Also, why is this only in the KVM code path? Isn't this needed for TCG
as well?

> +
> +if (kvm_destroy_vcpu(cpu) < 0) {
> +fprintf(stderr, "kvm_destroy_vcpu failed.\n");
> +exit(1);
> +}
> +
> +qdev_free(DEVICE(X86_CPU(cpu)));

DEVICE(cpu) should be sufficient.

> +}
> +
>  static void flush_queued_work(CPUState *cpu)
>  {
>  struct qemu_work_item *wi;
> @@ -788,6 +812,11 @@ static void *qemu_kvm_cpu_thread_fn(void *arg)
>  }
>  }
>  qemu_kvm_wait_io_event(cpu);
> +if (cpu->exit && !cpu_can_run(cpu)) {
> +qemu_kvm_destroy_vcpu(cpu);
> +qemu_mutex_unlock(&qemu_global_mutex);
> +return NULL;
> +}
>  }
>  
>  return NULL;
> @@ -1080,6 +1109,13 @@ static void qemu_dummy_start_vcpu(CPUState *cpu)
>  }
>  }
>  
> +void qemu_down_vcpu(CPUState *cpu)
> +{
> +cpu->stop = true;
> +cpu->exit = true;
> +qemu_cpu_kick(cpu);
> +}
> +
>  void qemu_init_vcpu(CPUState *cpu)
>  {
>  cpu->nr_cores = smp_cores;
> diff --git a/hw/acpi/piix4.c b/hw/acpi/piix4.c
> index 1aaa7a4..44bc809 100644
> --- a/hw/acpi/piix4.c
> +++ b/hw/acpi/piix4.c
> @@ -611,10 +611,18 @@ static const MemoryRegionOps piix4_pci_ops = {
>  },
>  };
>  
> -static void acpi_piix_eject_vcpu(int64_t cpuid)
> +static void acpi_piix_eject_vcpu(PIIX4PMState *s, int64_t cpuid)
>  {
> -/* TODO: eject a vcpu, release allocated vcpu and exit the vcpu pthread. 
>  */
> -PIIX4_DPRINTF("vcpu: %" PRIu64 " need to be ejected.\n", cpuid);
> +CPUStatus *cpus = &s->gpe_cpu;
> +CPUState *cs = NULL;
> +
> +cs = qemu_get_cpu(cpuid);

Are you sure this is correct as 0-based index? Igor?

> +if (cs == NULL) {
> +return;
> +}
> +
> +cpus->old_sts[cpuid / 8] &= ~(1 << (cpuid % 8));
> +qemu_down_vcpu(cs);
>  }
>  
>  static uint64_t cpu_status_read(void *opaque, hwaddr addr, unsigned int size)
> @@ -647,7 +655,7 @@ static void cpu_status_write(void *opaque, hwaddr addr, 
> uint64_t data,
>  }
>  
>  if (cpuid != 0) {
> -acpi_piix_eject_vcpu(cpuid);
> +acpi_piix_eject_vcpu(s, cpuid);
>  }
>  }
>  
> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
> index 3e49936..fa8ec8a 100644
> --- a/include/qom/cpu.h
> +++ b/include/qom/cpu.h
> @@ -180,6 +180,7 @@ struct CPUState {
>  bool created;
>  bool stop;
>  bool stopped;
> +bool exit;
>  volatile sig_atomic_t exit_request;
>  volatile sig_atomic_t tcg_exit_req;
>  uint32_t interrupt_request;
> @@ -489,6 +490,14 @@ void cpu_exit(CPUState *cpu);
>  void cpu_resume(CPUState *cpu);
>  
>  /**
> + * qemu_down_vcpu:
> + * @cpu: The vCPU will to down.
> + *
> + * Down a vCPU.
> + */
> +void qemu_down_vcpu(CPUState *cpu);

The naming and documentation sounds wrong language-wise to me, but I am
not a native speaker either. Maybe "tear down" instead of "down"? Or
simply qemu_request_vcpu_removal() or something like that?

> +
> +/**
>   * qemu_init_vcpu:
>   * @cpu: The vCPU to initialize.
>   *
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index de74411..fd85605 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -158,6 +158,7 @@ int kvm_has_intx_set_mask(void);
>  
>  int kvm_init_vcpu(CPUState *cpu);
>  int kvm_cpu_exec(CPUState *cpu);
> +int kvm_destroy_vcpu(CPUState *cpu);
>  
>  #ifdef NEED_CPU_H
>  
> diff --git a/kvm-all.c b/kvm-all.c
> index 716860f..fda3601 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -225,6 +225,32 @@ static v

Re: [PATCH] KVM: x86: should release vcpu when closing vcpu fd

2013-08-28 Thread Gleb Natapov
On Thu, Aug 29, 2013 at 11:06:47AM +0800, chenfan wrote:
> On Wed, 2013-08-28 at 11:21 +0300, Gleb Natapov wrote:
> > On Wed, Aug 28, 2013 at 04:08:10PM +0800, Chen Fan wrote:
> > > If we want to remove a vcpu from vm, we should not be only to
> > > put kvm, we need to decrease online vcpus and free struct kvm_vcpu
> > > when closing a vcpu file descriptor.
> > > 
> > It much more complicated that what you do here. kvm->vcpus[] is accessed
> > without locking and kvm->online_vcpus is assumed to be greater than max
> > occupied index in kvm->vcpus[]. Attempt where made to make it possible
> > to destroy individual vcpus separately from destroying VM before, but
> > they were unsuccessful thus far.
> > 
> Oh, the motivation of this patch is for QEMU for implementing cpu
> hot-remove, I want to remove the vcpu from KVM after QEMU closing the
> kvm_fd. there I need some advice for this feature. if this patch is not
> comprehensive enough,I want to know the reasons for a more detailed.
> Could you tell me?
> 
I told you above.

> Thanks,
> Chen
> 
> > > Signed-off-by: Chen Fan 
> > > ---
> > >  arch/x86/kvm/x86.c   | 14 ++
> > >  include/linux/kvm_host.h |  1 +
> > >  virt/kvm/kvm_main.c  |  4 +++-
> > >  3 files changed, 18 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index d21bce5..3370de1 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -6904,6 +6904,20 @@ static void kvm_free_vcpus(struct kvm *kvm)
> > >   mutex_unlock(&kvm->lock);
> > >  }
> > >  
> > > +void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu)
> > > +{
> > > + struct kvm *kvm = vcpu->kvm;
> > > +
> > > + mutex_lock(&kvm->lock);
> > > + atomic_dec(&kvm->online_vcpus);
> > > + kvm->vcpus[atomic_read(&kvm->online_vcpus)] = NULL;
> > > + mutex_unlock(&kvm->lock);
> > > +
> > > + kvm_clear_async_pf_completion_queue(vcpu);
> > > + kvm_unload_vcpu_mmu(vcpu);
> > > + kvm_arch_vcpu_free(vcpu);
> > > +}
> > > +
> > >  void kvm_arch_sync_events(struct kvm *kvm)
> > >  {
> > >   kvm_free_all_assigned_devices(kvm);
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index a63d83e..e0811f97 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -631,6 +631,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm 
> > > *kvm, unsigned int id);
> > >  int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu);
> > >  int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu);
> > >  void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu);
> > > +void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu);
> > >  
> > >  int kvm_arch_hardware_enable(void *garbage);
> > >  void kvm_arch_hardware_disable(void *garbage);
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 1580dd4..442014b 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -1873,8 +1873,10 @@ static int kvm_vcpu_mmap(struct file *file, struct 
> > > vm_area_struct *vma)
> > >  static int kvm_vcpu_release(struct inode *inode, struct file *filp)
> > >  {
> > >   struct kvm_vcpu *vcpu = filp->private_data;
> > > + struct kvm *kvm = vcpu->kvm;
> > >  
> > > - kvm_put_kvm(vcpu->kvm);
> > > + kvm_arch_vcpu_release(vcpu);
> > > + kvm_put_kvm(kvm);
> > >   return 0;
> > >  }
> > >  
> > > -- 
> > > 1.8.1.4
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > --
> > Gleb.
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: should release vcpu when closing vcpu fd

2013-08-28 Thread chenfan
On Wed, 2013-08-28 at 11:21 +0300, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 04:08:10PM +0800, Chen Fan wrote:
> > If we want to remove a vcpu from vm, we should not be only to
> > put kvm, we need to decrease online vcpus and free struct kvm_vcpu
> > when closing a vcpu file descriptor.
> > 
> It much more complicated that what you do here. kvm->vcpus[] is accessed
> without locking and kvm->online_vcpus is assumed to be greater than max
> occupied index in kvm->vcpus[]. Attempt where made to make it possible
> to destroy individual vcpus separately from destroying VM before, but
> they were unsuccessful thus far.
> 
Oh, the motivation of this patch is for QEMU for implementing cpu
hot-remove, I want to remove the vcpu from KVM after QEMU closing the
kvm_fd. there I need some advice for this feature. if this patch is not
comprehensive enough,I want to know the reasons for a more detailed.
Could you tell me?

Thanks,
Chen

> > Signed-off-by: Chen Fan 
> > ---
> >  arch/x86/kvm/x86.c   | 14 ++
> >  include/linux/kvm_host.h |  1 +
> >  virt/kvm/kvm_main.c  |  4 +++-
> >  3 files changed, 18 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index d21bce5..3370de1 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -6904,6 +6904,20 @@ static void kvm_free_vcpus(struct kvm *kvm)
> > mutex_unlock(&kvm->lock);
> >  }
> >  
> > +void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu)
> > +{
> > +   struct kvm *kvm = vcpu->kvm;
> > +
> > +   mutex_lock(&kvm->lock);
> > +   atomic_dec(&kvm->online_vcpus);
> > +   kvm->vcpus[atomic_read(&kvm->online_vcpus)] = NULL;
> > +   mutex_unlock(&kvm->lock);
> > +
> > +   kvm_clear_async_pf_completion_queue(vcpu);
> > +   kvm_unload_vcpu_mmu(vcpu);
> > +   kvm_arch_vcpu_free(vcpu);
> > +}
> > +
> >  void kvm_arch_sync_events(struct kvm *kvm)
> >  {
> > kvm_free_all_assigned_devices(kvm);
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index a63d83e..e0811f97 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -631,6 +631,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, 
> > unsigned int id);
> >  int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu);
> >  int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu);
> >  void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu);
> > +void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu);
> >  
> >  int kvm_arch_hardware_enable(void *garbage);
> >  void kvm_arch_hardware_disable(void *garbage);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 1580dd4..442014b 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1873,8 +1873,10 @@ static int kvm_vcpu_mmap(struct file *file, struct 
> > vm_area_struct *vma)
> >  static int kvm_vcpu_release(struct inode *inode, struct file *filp)
> >  {
> > struct kvm_vcpu *vcpu = filp->private_data;
> > +   struct kvm *kvm = vcpu->kvm;
> >  
> > -   kvm_put_kvm(vcpu->kvm);
> > +   kvm_arch_vcpu_release(vcpu);
> > +   kvm_put_kvm(kvm);
> > return 0;
> >  }
> >  
> > -- 
> > 1.8.1.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
>   Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/3] tile: enable VIRTIO support for KVM

2013-08-28 Thread Chris Metcalf
This change enables support for a virtio-based console,
network support, and block driver support.

We remove some debug code in relocate_kernel_64.S that made raw
calls to the hv_console_putc Tilera hypervisor API, since everything
now should funnel through the early_hv_write() API.

Signed-off-by: Chris Metcalf 
---
 arch/tile/Kconfig   |   3 +
 arch/tile/include/asm/kvm_para.h|  20 ++
 arch/tile/include/asm/kvm_virtio.h  |  26 ++
 arch/tile/include/uapi/asm/Kbuild   |   1 +
 arch/tile/include/uapi/asm/kvm.h|   5 +
 arch/tile/include/uapi/asm/kvm_virtio.h |  60 +
 arch/tile/kernel/Makefile   |   1 +
 arch/tile/kernel/early_printk.c |  16 ++
 arch/tile/kernel/hvglue.S   |   1 +
 arch/tile/kernel/kvm_virtio.c   | 430 
 arch/tile/kernel/relocate_kernel_64.S   |   9 +-
 11 files changed, 570 insertions(+), 2 deletions(-)
 create mode 100644 arch/tile/include/asm/kvm_para.h
 create mode 100644 arch/tile/include/asm/kvm_virtio.h
 create mode 100644 arch/tile/include/uapi/asm/kvm_virtio.h
 create mode 100644 arch/tile/kernel/kvm_virtio.c

diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig
index e89aae8..4e8524b 100644
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -370,6 +370,9 @@ config KVM_GUEST
bool "Build kernel as guest for KVM"
default n
depends on TILEGX
+   select VIRTIO
+   select VIRTIO_RING
+   select VIRTIO_CONSOLE
---help---
  This will build a kernel that runs at a lower protection level
  than the default kernel and is suitable to run under KVM.
diff --git a/arch/tile/include/asm/kvm_para.h b/arch/tile/include/asm/kvm_para.h
new file mode 100644
index 000..c8c31d5
--- /dev/null
+++ b/arch/tile/include/asm/kvm_para.h
@@ -0,0 +1,20 @@
+/*
+ * Copyright 2013 Tilera Corporation. All Rights Reserved.
+ *
+ *   This program is free software; you can redistribute it and/or
+ *   modify it under the terms of the GNU General Public License
+ *   as published by the Free Software Foundation, version 2.
+ *
+ *   This program is distributed in the hope that it will be useful, but
+ *   WITHOUT ANY WARRANTY; without even the implied warranty of
+ *   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ *   NON INFRINGEMENT.  See the GNU General Public License for
+ *   more details.
+ */
+#ifndef _ASM_TILE_KVM_PARA_H
+#define _ASM_TILE_KVM_PARA_H
+
+#include 
+
+int hcall_virtio(unsigned long instrument, unsigned long mem);
+#endif /* _ASM_TILE_KVM_PARA_H */
diff --git a/arch/tile/include/asm/kvm_virtio.h 
b/arch/tile/include/asm/kvm_virtio.h
new file mode 100644
index 000..8faa959
--- /dev/null
+++ b/arch/tile/include/asm/kvm_virtio.h
@@ -0,0 +1,26 @@
+/*
+ * Copyright 2013 Tilera Corporation. All Rights Reserved.
+ *
+ *   This program is free software; you can redistribute it and/or
+ *   modify it under the terms of the GNU General Public License
+ *   as published by the Free Software Foundation, version 2.
+ *
+ *   This program is distributed in the hope that it will be useful, but
+ *   WITHOUT ANY WARRANTY; without even the implied warranty of
+ *   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
+ *   NON INFRINGEMENT.  See the GNU General Public License for
+ *   more details.
+ */
+#ifndef _ASM_TILE_KVM_VIRTIO_H
+#define _ASM_TILE_KVM_VIRTIO_H
+
+#include 
+
+
+struct kvm_device {
+   struct virtio_device vdev;
+   struct kvm_device_desc *desc;
+   unsigned long desc_pa;
+};
+
+#endif /* _ASM_TILE_KVM_VIRTIO_H */
diff --git a/arch/tile/include/uapi/asm/Kbuild 
b/arch/tile/include/uapi/asm/Kbuild
index 89022a5..f07cc24 100644
--- a/arch/tile/include/uapi/asm/Kbuild
+++ b/arch/tile/include/uapi/asm/Kbuild
@@ -8,6 +8,7 @@ header-y += cachectl.h
 header-y += hardwall.h
 header-y += kvm.h
 header-y += kvm_para.h
+header-y += kvm_virtio.h
 header-y += mman.h
 header-y += ptrace.h
 header-y += setup.h
diff --git a/arch/tile/include/uapi/asm/kvm.h b/arch/tile/include/uapi/asm/kvm.h
index aa7b97f..4346520 100644
--- a/arch/tile/include/uapi/asm/kvm.h
+++ b/arch/tile/include/uapi/asm/kvm.h
@@ -149,6 +149,9 @@
  */
 #define KVM_OTHER_HCALL  128
 
+/* Hypercall index for virtio. */
+#define KVM_HCALL_virtio 128
+
 /* One greater than the maximum hypercall number. */
 #define KVM_NUM_HCALLS   256
 
@@ -256,6 +259,8 @@ struct kvm_sync_regs {
KVM_EMULATE(get_ipi_pte) \
KVM_EMULATE(set_pte_super_shift) \
KVM_EMULATE(set_speed) \
+   /* For others */ \
+   USER_HCALL(virtio)
 
 #endif
 
diff --git a/arch/tile/include/uapi/asm/kvm_virtio.h 
b/arch/tile/include/uapi/asm/kvm_virtio.h
new file mode 100644
index 000..d94f535
--- /dev/null
+++ b/arch/tile/include/uapi/asm/kvm_virtio.h
@@ -0,0 +1,60 @@
+/*
+ * Copyright 2013 Tilera Corporation. All Rights Reserved.
+ *
+ *   This program is free softwar

[PATCH v3 2/3] tile: enable building as a paravirtualized KVM_GUEST

2013-08-28 Thread Chris Metcalf
This commit enables a special configure option to build the kernel to
run at PL1.  In this mode, the client can run under a KVM host kernel;
it can also run under the older Tilera hypervisor that ran the
operating system at PL1 by default.

The PL1 kernel runs with half the virtual address space and half the
physical address space of the PL2 kernel, so  reflects
modifying those constants appropriately.  We make some things a little
more generic ("intrpt1" references from the old PL1 kernel
nomenclature have now been normalized to "intrpt"), and simplify some
other nomenclature (using MEM_SV_START instead of MEM_SV_INTRPT to
reflect the fact that where the supervisor starts is the fact of
interest, not that it happens to start with the interrupt vectors).

The simulator support for reflecting Elf binary data back out
associated with simulator backtrace information needs some additional
extension.  It currently will report backtrace information for the
guest kernel, but not for processes running within the guest kernel;
additional work in the simulator is required to be able to provide the
necessary path information for that to work.  For now we disable
simulator notifications within the guest kernel.

The timer interrupt for the guest uses the AUX_TILE_TIMER hardware,
leaving the regular TILE_TIMER for the host.

Signed-off-by: Chris Metcalf 
---
 arch/tile/Kconfig  | 14 +++--
 arch/tile/include/asm/module.h | 10 --
 arch/tile/include/asm/page.h   | 63 +++---
 arch/tile/include/asm/pgtable_32.h |  2 +-
 arch/tile/include/asm/pgtable_64.h |  3 +-
 arch/tile/include/asm/processor.h  |  2 +-
 arch/tile/include/asm/switch_to.h  | 25 ---
 arch/tile/include/asm/timex.h  |  8 +
 arch/tile/kernel/head_32.S |  4 +--
 arch/tile/kernel/head_64.S |  6 ++--
 arch/tile/kernel/intvec_32.S   |  6 ++--
 arch/tile/kernel/intvec_64.S   | 34 +++-
 arch/tile/kernel/process.c |  2 ++
 arch/tile/kernel/setup.c   |  8 ++---
 arch/tile/kernel/sysfs.c   |  4 +++
 arch/tile/kernel/time.c| 14 -
 arch/tile/kernel/traps.c   |  2 +-
 arch/tile/kernel/vmlinux.lds.S | 10 +++---
 arch/tile/mm/elf.c |  2 ++
 arch/tile/mm/init.c|  8 ++---
 20 files changed, 145 insertions(+), 82 deletions(-)

diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig
index 3bc8fb7..e89aae8 100644
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@ -126,7 +126,7 @@ config TILEGX
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_KPROBES
select HAVE_KRETPROBES
-   select HAVE_KVM
+   select HAVE_KVM if !KVM_GUEST
 
 config TILEPRO
def_bool !TILEGX
@@ -366,11 +366,19 @@ config HARDWALL
bool "Hardwall support to allow access to user dynamic network"
default y
 
+config KVM_GUEST
+   bool "Build kernel as guest for KVM"
+   default n
+   depends on TILEGX
+   ---help---
+ This will build a kernel that runs at a lower protection level
+ than the default kernel and is suitable to run under KVM.
+
 config KERNEL_PL
int "Processor protection level for kernel"
range 1 2
-   default 2 if TILEGX
-   default 1 if !TILEGX
+   default 2 if TILEGX && !KVM_GUEST
+   default 1 if !TILEGX || KVM_GUEST
---help---
  Since MDE 4.2, the Tilera hypervisor runs the kernel
  at PL2 by default.  If running under an older hypervisor,
diff --git a/arch/tile/include/asm/module.h b/arch/tile/include/asm/module.h
index 44ed07c..a8b546b 100644
--- a/arch/tile/include/asm/module.h
+++ b/arch/tile/include/asm/module.h
@@ -16,7 +16,6 @@
 #define _ASM_TILE_MODULE_H
 
 #include 
-
 #include 
 
 /* We can't use modules built with different page sizes. */
@@ -28,6 +27,13 @@
 # define MODULE_PGSZ ""
 #endif
 
+/* Tag guest Linux, since it uses different SPRs, etc. */
+#if CONFIG_KERNEL_PL == 2
+#define MODULE_PL ""
+#else
+#define MODULE_PL " guest"
+#endif
+
 /* We don't really support no-SMP so tag if someone tries. */
 #ifdef CONFIG_SMP
 #define MODULE_NOSMP ""
@@ -35,6 +41,6 @@
 #define MODULE_NOSMP " nosmp"
 #endif
 
-#define MODULE_ARCH_VERMAGIC CHIP_ARCH_NAME MODULE_PGSZ MODULE_NOSMP
+#define MODULE_ARCH_VERMAGIC CHIP_ARCH_NAME MODULE_PGSZ MODULE_PL MODULE_NOSMP
 
 #endif /* _ASM_TILE_MODULE_H */
diff --git a/arch/tile/include/asm/page.h b/arch/tile/include/asm/page.h
index b4f96c0..2c991f2 100644
--- a/arch/tile/include/asm/page.h
+++ b/arch/tile/include/asm/page.h
@@ -148,8 +148,17 @@ static inline __attribute_const__ int get_order(unsigned 
long size)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 #endif
 
+#ifdef CONFIG_KVM_GUEST
+/* Paravirtualized guests get half the VA, and thus half the PA. */
+#define MAX_PA_WIDTH (CHIP_PA_WIDTH() - 1)
+#define MAX_VA_WIDTH (CHIP_VA_WIDTH() - 1)
+#else
+#define MAX_PA_WIDTH CHIP_PA_WIDTH()
+#define MAX_VA_WIDTH CHIP_VA_W

Re: [PATCH v2] tile: support KVM for tilegx

2013-08-28 Thread Chris Metcalf
On 8/26/2013 8:04 AM, Gleb Natapov wrote:
> On Sun, Aug 25, 2013 at 09:26:47PM -0400, Chris Metcalf wrote:
>> On 8/25/2013 7:39 AM, Gleb Natapov wrote:
>>> On Mon, Aug 12, 2013 at 04:24:11PM -0400, Chris Metcalf wrote:
 This change provides the initial framework support for KVM on tilegx.
 Basic virtual disk and networking is supported.

>>> This needs to be broken down to more reviewable patches.
>> I already broke out one pre-requisite patch that wasn't strictly KVM-related:
>>
>> https://lkml.org/lkml/2013/8/12/339
>>
>> In addition, we've separately arranged to support booting our kernels in a 
>> way that is compatible with the Tilera booter running at the highest 
>> privilege level, which enables multiple kernel privilege levels:
>>
>> https://lkml.org/lkml/2013/5/2/468
>>
>> How would you recommend further breaking down this patch?  It's pretty much 
>> just the basic support for minimal KVM.  I suppose I could break out all the 
>> I/O related stuff into a separate patch, though it wouldn't amount to much; 
>> perhaps the console could also be broken out separately.  Any other 
>> suggestions?
>>
> First of all please break out host and guest bits. Also I/O related stuff,
> like you suggest (so that guest PV bits will be in separate patch) and
> change to a common code (not much as far as I see) with explanation why
> it is needed. (Why kvm_vcpu_kick() is not needed for instance?)

I broke it down into three pieces in the end: the basic host support, the basic 
guest PV support, and the virtio/console support.  The first piece is still 
much the biggest.  I found that the generic kvm_vcpu_kick() is fine, so I 
removed the custom version (which predated the generic version in our internal 
tree).  Explanations are now in the git commit comments.

>>> Also can you
>>> describe the implementation a little bit? Does tile arch has vitalization
>>> extension this implementation uses, or is it trap and emulate approach?
>>> If later does it run unmodified guest kernels? What userspace are you
>>> using with this implementation?
>> We could do full virtualization via trap and emulate, but we've elected to 
>> do a para-virtualized approach.  Userspace runs at PL (privilege level) 0, 
>> the guest kernel runs at PL1, and the host runs at PL2.  We have available 
>> per-PL resources for various things, and take advantage of having two 
>> on-chip timers (for example) to handle timing for the host and guest 
>> kernels.  We run the same userspace with either the host or the guest.
>>
> OK, thanks for explanation. Why have you decided to do PV over trap and
> emulate?

Performance and simplicity; I added comments to the git commit to provide a 
rationale.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/23] KVM: PPC: Book3S PR: Use 64k host pages where possible

2013-08-28 Thread Alexander Graf

On 06.08.2013, at 06:19, Paul Mackerras wrote:

> Currently, PR KVM uses 4k pages for the host-side mappings of guest
> memory, regardless of the host page size.  When the host page size is
> 64kB, we might as well use 64k host page mappings for guest mappings
> of 64kB and larger pages and for guest real-mode mappings.  However,
> the magic page has to remain a 4k page.
> 
> To implement this, we first add another flag bit to the guest VSID
> values we use, to indicate that this segment is one where host pages
> should be mapped using 64k pages.  For segments with this bit set
> we set the bits in the shadow SLB entry to indicate a 64k base page
> size.  When faulting in host HPTEs for this segment, we make them
> 64k HPTEs instead of 4k.  We record the pagesize in struct hpte_cache
> for use when invalidating the HPTE.
> 
> Signed-off-by: Paul Mackerras 
> ---
> arch/powerpc/include/asm/kvm_book3s.h |  6 --
> arch/powerpc/kvm/book3s_32_mmu.c  |  1 +
> arch/powerpc/kvm/book3s_64_mmu.c  | 35 ++-
> arch/powerpc/kvm/book3s_64_mmu_host.c | 27 +--
> arch/powerpc/kvm/book3s_pr.c  |  1 +
> 5 files changed, 57 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
> b/arch/powerpc/include/asm/kvm_book3s.h
> index 175f876..322b539 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -66,6 +66,7 @@ struct hpte_cache {
>   u64 pfn;
>   ulong slot;
>   struct kvmppc_pte pte;
> + int pagesize;
> };
> 
> struct kvmppc_vcpu_book3s {
> @@ -113,8 +114,9 @@ struct kvmppc_vcpu_book3s {
> #define CONTEXT_GUEST 1
> #define CONTEXT_GUEST_END 2
> 
> -#define VSID_REAL0x0fc0ULL
> -#define VSID_BAT 0x0fb0ULL
> +#define VSID_REAL0x07c0ULL
> +#define VSID_BAT 0x07b0ULL
> +#define VSID_64K 0x0800ULL
> #define VSID_1T   0x1000ULL
> #define VSID_REAL_DR  0x2000ULL
> #define VSID_REAL_IR  0x4000ULL
> diff --git a/arch/powerpc/kvm/book3s_32_mmu.c 
> b/arch/powerpc/kvm/book3s_32_mmu.c
> index c8cefdd..af04553 100644
> --- a/arch/powerpc/kvm/book3s_32_mmu.c
> +++ b/arch/powerpc/kvm/book3s_32_mmu.c
> @@ -308,6 +308,7 @@ static int kvmppc_mmu_book3s_32_xlate(struct kvm_vcpu 
> *vcpu, gva_t eaddr,
>   ulong mp_ea = vcpu->arch.magic_page_ea;
> 
>   pte->eaddr = eaddr;
> + pte->page_size = MMU_PAGE_4K;
> 
>   /* Magic page override */
>   if (unlikely(mp_ea) &&
> diff --git a/arch/powerpc/kvm/book3s_64_mmu.c 
> b/arch/powerpc/kvm/book3s_64_mmu.c
> index d5fa26c..658ccd7 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu.c
> @@ -542,6 +542,16 @@ static void kvmppc_mmu_book3s_64_tlbie(struct kvm_vcpu 
> *vcpu, ulong va,
>   kvmppc_mmu_pte_vflush(vcpu, va >> 12, mask);
> }
> 
> +#ifdef CONFIG_PPC_64K_PAGES
> +static int segment_contains_magic_page(struct kvm_vcpu *vcpu, ulong esid)
> +{
> + ulong mp_ea = vcpu->arch.magic_page_ea;
> +
> + return mp_ea && !(vcpu->arch.shared->msr & MSR_PR) &&
> + (mp_ea >> SID_SHIFT) == esid;
> +}
> +#endif
> +
> static int kvmppc_mmu_book3s_64_esid_to_vsid(struct kvm_vcpu *vcpu, ulong 
> esid,
>u64 *vsid)
> {
> @@ -549,11 +559,13 @@ static int kvmppc_mmu_book3s_64_esid_to_vsid(struct 
> kvm_vcpu *vcpu, ulong esid,
>   struct kvmppc_slb *slb;
>   u64 gvsid = esid;
>   ulong mp_ea = vcpu->arch.magic_page_ea;
> + int pagesize = MMU_PAGE_64K;
> 
>   if (vcpu->arch.shared->msr & (MSR_DR|MSR_IR)) {
>   slb = kvmppc_mmu_book3s_64_find_slbe(vcpu, ea);
>   if (slb) {
>   gvsid = slb->vsid;
> + pagesize = slb->base_page_size;
>   if (slb->tb) {
>   gvsid <<= SID_SHIFT_1T - SID_SHIFT;
>   gvsid |= esid & ((1ul << (SID_SHIFT_1T - 
> SID_SHIFT)) - 1);
> @@ -564,28 +576,41 @@ static int kvmppc_mmu_book3s_64_esid_to_vsid(struct 
> kvm_vcpu *vcpu, ulong esid,
> 
>   switch (vcpu->arch.shared->msr & (MSR_DR|MSR_IR)) {
>   case 0:
> - *vsid = VSID_REAL | esid;
> + gvsid = VSID_REAL | esid;
>   break;
>   case MSR_IR:
> - *vsid = VSID_REAL_IR | gvsid;
> + gvsid |= VSID_REAL_IR;
>   break;
>   case MSR_DR:
> - *vsid = VSID_REAL_DR | gvsid;
> + gvsid |= VSID_REAL_DR;
>   break;
>   case MSR_DR|MSR_IR:
>   if (!slb)
>   goto no_slb;
> 
> - *vsid = gvsid;
>   break;
>   default:
>   BUG();
>   break;
>   }
> 
> +#ifdef CONFIG_PPC_64K_PAGES
> + /*
> +  * Mark this as a 64k segment if the host is using
> +  * 64k pages, the h

Re: [PATCH 06/23] KVM: PPC: Book3S PR: Allow guest to use 64k pages

2013-08-28 Thread Alexander Graf

On 06.08.2013, at 06:18, Paul Mackerras wrote:

> This adds the code to interpret 64k HPTEs in the guest hashed page
> table (HPT), 64k SLB entries, and to tell the guest about 64k pages
> in kvm_vm_ioctl_get_smmu_info().  Guest 64k pages are still shadowed
> by 4k pages.
> 
> This also adds another hash table to the four we have already in
> book3s_mmu_hpte.c to allow us to find all the PTEs that we have
> instantiated that match a given 64k guest page.
> 
> The tlbie instruction changed starting with POWER6 to use a bit in
> the RB operand to indicate large page invalidations, and to use other
> RB bits to indicate the base and actual page sizes and the segment
> size.  64k pages came in slightly earlier, with POWER5++.  At present
> we use one bit in vcpu->arch.hflags to indicate that the emulated
> cpu supports 64k pages and also has the new tlbie definition.  If
> we ever want to support emulation of POWER5++, we will need to use
> another bit.
> 
> Signed-off-by: Paul Mackerras 
> ---
> arch/powerpc/include/asm/kvm_asm.h|  1 +
> arch/powerpc/include/asm/kvm_book3s.h |  6 +++
> arch/powerpc/include/asm/kvm_host.h   |  4 ++
> arch/powerpc/kvm/book3s_64_mmu.c  | 92 +++
> arch/powerpc/kvm/book3s_mmu_hpte.c| 50 +++
> arch/powerpc/kvm/book3s_pr.c  | 30 +++-
> 6 files changed, 173 insertions(+), 10 deletions(-)
> 
> 

[...]

> @@ -1127,8 +1144,13 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm 
> *kvm, unsigned int id)
>   goto uninit_vcpu;
> 
> #ifdef CONFIG_PPC_BOOK3S_64
> - /* default to book3s_64 (970fx) */
> + /*
> +  * Default to the same as the host if we're on a POWER7[+],
> +  * otherwise default to PPC970FX.
> +  */
>   vcpu->arch.pvr = 0x3C0301;
> + if (cpu_has_feature(CPU_FTR_ARCH_206))
> + vcpu->arch.pvr = mfspr(SPRN_PVR);

Unrelated change? Also, why? Any reasonable user space these days should set 
PVR anyways.

> #else
>   /* default to book3s_32 (750) */
>   vcpu->arch.pvr = 0x84202;
> @@ -1331,6 +1353,12 @@ int kvm_vm_ioctl_get_smmu_info(struct kvm *kvm, struct 
> kvm_ppc_smmu_info *info)
>   info->sps[1].enc[0].page_shift = 24;
>   info->sps[1].enc[0].pte_enc = 0;
> 
> + /* 64k large page size */
> + info->sps[2].page_shift = 16;
> + info->sps[2].slb_enc = SLB_VSID_L | SLB_VSID_LP_01;
> + info->sps[2].enc[0].page_shift = 16;
> + info->sps[2].enc[0].pte_enc = 1;

We only support this with BOOK3S_HFLAG_MULTI_PGSIZE, no?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/23] KVM: PPC: Book3S PR: Rework kvmppc_mmu_book3s_64_xlate()

2013-08-28 Thread Alexander Graf

On 06.08.2013, at 06:18, Paul Mackerras wrote:

> This reworks kvmppc_mmu_book3s_64_xlate() to make it check the large
> page bit in the hashed page table entries (HPTEs) it looks at, and
> to simplify and streamline the code.  The checking of the first dword
> of each HPTE is now done with a single mask and compare operation,
> and all the code dealing with the matching HPTE, if we find one,
> is consolidated in one place in the main line of the function flow.
> 
> Signed-off-by: Paul Mackerras 

Thanks, applied to kvm-ppc-queue.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/23] KVM: PPC: Book3S PR: Don't corrupt guest state when kernel uses VMX

2013-08-28 Thread Alexander Graf

On 06.08.2013, at 06:14, Paul Mackerras wrote:

> Currently the code assumes that once we load up guest FP/VSX or VMX
> state into the CPU, it stays valid in the CPU registers until we
> explicitly flush it to the thread_struct.  However, on POWER7,
> copy_page() and memcpy() can use VMX.  These functions do flush the
> VMX state to the thread_struct before using VMX instructions, but if
> this happens while we have guest state in the VMX registers, and we
> then re-enter the guest, we don't reload the VMX state from the
> thread_struct, leading to guest corruption.  This has been observed
> to cause guest processes to segfault.
> 
> To fix this, we check before re-entering the guest that all of the
> bits corresponding to facilities owned by the guest, as expressed
> in vcpu->arch.guest_owned_ext, are set in current->thread.regs->msr.
> Any bits that have been cleared correspond to facilities that have
> been used by kernel code and thus flushed to the thread_struct, so
> for them we reload the state from the thread_struct.
> 
> We also need to check current->thread.regs->msr before calling
> giveup_fpu() or giveup_altivec(), since if the relevant bit is
> clear, the state has already been flushed to the thread_struct and
> to flush it again would corrupt it.
> 
> Signed-off-by: Paul Mackerras 

Thanks, applied to kvm-ppc-queue.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/23] KVM: PPC: Book3S PR: Make instruction fetch fallback work for system calls

2013-08-28 Thread Alexander Graf

On 06.08.2013, at 06:15, Paul Mackerras wrote:

> It turns out that if we exit the guest due to a hcall instruction (sc 1),
> and the loading of the instruction in the guest exit path fails for any
> reason, the call to kvmppc_ld() in kvmppc_get_last_inst() fetches the
> instruction after the hcall instruction rather than the hcall itself.
> This in turn means that the instruction doesn't get recognized as an
> hcall in kvmppc_handle_exit_pr() but gets passed to the guest kernel
> as a sc instruction.  That usually results in the guest kernel getting
> a return code of 38 (ENOSYS) from an hcall, which often triggers a
> BUG_ON() or other failure.
> 
> This fixes the problem by adding a new variant of kvmppc_get_last_inst()
> called kvmppc_get_last_sc(), which fetches the instruction if necessary
> from pc - 4 rather than pc.
> 
> Signed-off-by: Paul Mackerras 

Thanks, applied to kvm-ppc-queue.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/23] KVM: PPC: Book3S: Fix compile error in XICS emulation

2013-08-28 Thread Alexander Graf

On 06.08.2013, at 06:13, Paul Mackerras wrote:

> Commit 8e44ddc3f3 ("powerpc/kvm/book3s: Add support for H_IPOLL and
> H_XIRR_X in XICS emulation") added a call to get_tb() but didn't
> include the header that defines it, and on some configs this means
> book3s_xics.c fails to compile:
> 
> arch/powerpc/kvm/book3s_xics.c: In function ‘kvmppc_xics_hcall’:
> arch/powerpc/kvm/book3s_xics.c:812:3: error: implicit declaration of function 
> ‘get_tb’ [-Werror=implicit-function-declaration]
> 
> Cc: sta...@vger.kernel.org [v3.10]
> Signed-off-by: Paul Mackerras 

Thanks, applied to kvm-ppc-queue.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/23] KVM: PPC: Book3S PR: Keep volatile reg values in vcpu rather than shadow_vcpu

2013-08-28 Thread Alexander Graf

On 06.08.2013, at 06:16, Paul Mackerras wrote:

> Currently PR-style KVM keeps the volatile guest register values
> (R0 - R13, CR, LR, CTR, XER, PC) in a shadow_vcpu struct rather than
> the main kvm_vcpu struct.  For 64-bit, the shadow_vcpu exists in two
> places, a kmalloc'd struct and in the PACA, and it gets copied back
> and forth in kvmppc_core_vcpu_load/put(), because the real-mode code
> can't rely on being able to access the kmalloc'd struct.
> 
> This changes the code to copy the volatile values into the shadow_vcpu
> as one of the last things done before entering the guest.  Similarly
> the values are copied back out of the shadow_vcpu to the kvm_vcpu
> immediately after exiting the guest.  We arrange for interrupts to be
> still disabled at this point so that we can't get preempted on 64-bit
> and end up copying values from the wrong PACA.
> 
> This means that the accessor functions in kvm_book3s.h for these
> registers are greatly simplified, and are same between PR and HV KVM.
> In places where accesses to shadow_vcpu fields are now replaced by
> accesses to the kvm_vcpu, we can also remove the svcpu_get/put pairs.
> Finally, on 64-bit, we don't need the kmalloc'd struct at all any more.
> 
> With this, the time to read the PVR one million times in a loop went
> from 582.1ms to 584.3ms (averages of 10 values), a difference which is
> not statistically significant given the variability of the results
> (the standard deviations were 9.5ms and 8.6ms respectively).  A version
> of the patch that used loops to copy the GPR values increased that time
> by around 5% to 611.2ms, so the loop has been unrolled.
> 
> Signed-off-by: Paul Mackerras 
> ---
> arch/powerpc/include/asm/kvm_book3s.h | 220 +-
> arch/powerpc/include/asm/kvm_book3s_asm.h |   6 +-
> arch/powerpc/include/asm/kvm_host.h   |   1 +
> arch/powerpc/kernel/asm-offsets.c |   4 +-
> arch/powerpc/kvm/book3s_emulate.c |   8 +-
> arch/powerpc/kvm/book3s_interrupts.S  |  26 +++-
> arch/powerpc/kvm/book3s_pr.c  | 122 -
> arch/powerpc/kvm/book3s_rmhandlers.S  |   5 -
> arch/powerpc/kvm/trace.h  |   7 +-
> 9 files changed, 156 insertions(+), 243 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
> b/arch/powerpc/include/asm/kvm_book3s.h
> index fa19e2f..a8897c1 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -198,140 +198,76 @@ extern void kvm_return_point(void);
> #include 
> #endif
> 
> -#ifdef CONFIG_KVM_BOOK3S_PR
> -
> -static inline unsigned long kvmppc_interrupt_offset(struct kvm_vcpu *vcpu)
> -{
> - return to_book3s(vcpu)->hior;
> -}
> -
> -static inline void kvmppc_update_int_pending(struct kvm_vcpu *vcpu,
> - unsigned long pending_now, unsigned long old_pending)
> -{
> - if (pending_now)
> - vcpu->arch.shared->int_pending = 1;
> - else if (old_pending)
> - vcpu->arch.shared->int_pending = 0;
> -}
> -
> static inline void kvmppc_set_gpr(struct kvm_vcpu *vcpu, int num, ulong val)
> {
> - if ( num < 14 ) {
> - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
> - svcpu->gpr[num] = val;
> - svcpu_put(svcpu);
> - to_book3s(vcpu)->shadow_vcpu->gpr[num] = val;
> - } else
> - vcpu->arch.gpr[num] = val;
> + vcpu->arch.gpr[num] = val;
> }
> 
> static inline ulong kvmppc_get_gpr(struct kvm_vcpu *vcpu, int num)
> {
> - if ( num < 14 ) {
> - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
> - ulong r = svcpu->gpr[num];
> - svcpu_put(svcpu);
> - return r;
> - } else
> - return vcpu->arch.gpr[num];
> + return vcpu->arch.gpr[num];
> }
> 
> static inline void kvmppc_set_cr(struct kvm_vcpu *vcpu, u32 val)
> {
> - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
> - svcpu->cr = val;
> - svcpu_put(svcpu);
> - to_book3s(vcpu)->shadow_vcpu->cr = val;
> + vcpu->arch.cr = val;
> }
> 
> static inline u32 kvmppc_get_cr(struct kvm_vcpu *vcpu)
> {
> - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
> - u32 r;
> - r = svcpu->cr;
> - svcpu_put(svcpu);
> - return r;
> + return vcpu->arch.cr;
> }
> 
> static inline void kvmppc_set_xer(struct kvm_vcpu *vcpu, u32 val)
> {
> - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
> - svcpu->xer = val;
> - to_book3s(vcpu)->shadow_vcpu->xer = val;
> - svcpu_put(svcpu);
> + vcpu->arch.xer = val;
> }
> 
> static inline u32 kvmppc_get_xer(struct kvm_vcpu *vcpu)
> {
> - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu);
> - u32 r;
> - r = svcpu->xer;
> - svcpu_put(svcpu);
> - return r;
> + return vcpu->arch.xer;
> }
> 
> static inline void kvmppc_set_ctr(struct kvm_vcpu *vcpu, ulong val)
> {
> - struct kvmppc_bo

Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 14:23:58 -0700 Kent Overstreet  wrote:

> > I found things to be quite the opposite - it took 5 minutes of staring,
> > head-scratching, double-checking and penny-dropping before I was
> > confident that the newly-added code actually has nothing at all to do
> > with the current code.  Putting it in the same file was misleading, and
> > I got misled.
> 
> Ok... and I could see how the fact that it currently _doesn't_ have
> anything to do with the existing code would be confusing...
> 
> Do you think that if/when it's making use of the ida rewrite it'll be
> ok? Or would you still prefer to have it in a new file

I'm constitutionally reluctant to ever assume that any out-of-tree code
will be merged.  Maybe you'll get hit by a bus, and maybe the code
sucks ;)

Are you sure that the two things are so tangled together that they must
live in the same file?  If there's some nice layering between ida and
percpu_ida then perhaps such a physical separation would remain
appropriate?

> (and if so, any preference on the naming?)

percpu_ida.c?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments

2013-08-28 Thread Kent Overstreet
On Wed, Aug 28, 2013 at 02:10:19PM -0700, Andrew Morton wrote:
> On Wed, 28 Aug 2013 14:00:10 -0700 Kent Overstreet  wrote:
> 
> > On Wed, Aug 28, 2013 at 01:25:50PM -0700, Andrew Morton wrote:
> > > On Wed, 28 Aug 2013 12:55:17 -0700 Kent Overstreet  
> > > wrote:
> > > 
> > > > Fixup patch, addressing Andrew's review feedback:
> > > 
> > > Looks reasonable.
> > > 
> > > >  lib/idr.c   | 38 +-
> > > 
> > > I still don't think it should be in this file.
> > > 
> > > You say that some as-yet-unmerged patches will tie the new code into
> > > the old ida code.  But will it do it in a manner which requires that
> > > the two reside in the same file?
> > 
> > Not require, no - but it's just intimate enough with my ida rewrite that
> > I think it makes sense; it makes some use of stuff that should be
> > internal to the ida code.
> > 
> > Mostly just sharing the lock though, since I got rid of the ida
> > interfaces that don't do locking, but percpu ida needs a lock that also
> > covers what ida needs.
> > 
> > It also makes use of a ganged allocation interface, but there's no real
> > reason ida can't expose that, it's just unlikely to be useful to
> > anything but percpu ida.
> > 
> > The other reason I think it makes sense to live in idr.c is more for
> > users of the code; as you pointed out as far as the user's perspective
> > percpu ida isn't doing anything fundamentally different from ida, so I
> > think it makes sense for the code to live in the same place as a
> > kindness to future kernel developers who are trying to find their way
> > around the various library code.
> 
> I found things to be quite the opposite - it took 5 minutes of staring,
> head-scratching, double-checking and penny-dropping before I was
> confident that the newly-added code actually has nothing at all to do
> with the current code.  Putting it in the same file was misleading, and
> I got misled.

Ok... and I could see how the fact that it currently _doesn't_ have
anything to do with the existing code would be confusing...

Do you think that if/when it's making use of the ida rewrite it'll be
ok? Or would you still prefer to have it in a new file (and if so, any
preference on the naming?)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v3 1/4] idr: Percpu ida

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 14:12:17 -0700 Kent Overstreet  wrote:

> How's this look?
> 
> diff --git a/lib/idr.c b/lib/idr.c
> index 15c021c..a3f8e9a 100644
> --- a/lib/idr.c
> +++ b/lib/idr.c
> @@ -1288,6 +1288,11 @@ static inline unsigned alloc_local_tag(struct 
> percpu_ida *pool,
>   * Safe to be called from interrupt context (assuming it isn't passed
>   * __GFP_WAIT, of course).
>   *
> + * @gfp indicates whether or not to wait until a free id is available (it's 
> not
> + * used for internal memory allocations); thus if passed __GFP_WAIT we may 
> sleep
> + * however long it takes until another thread frees an id (same semantics as 
> a
> + * mempool).

Looks good.  Mentioning the mempool thing is effective - people
understand that.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v3 1/4] idr: Percpu ida

2013-08-28 Thread Kent Overstreet
On Wed, Aug 28, 2013 at 01:50:42PM -0700, Andrew Morton wrote:
> On Wed, 28 Aug 2013 13:44:54 -0700 Kent Overstreet  wrote:
> 
> > > > > What guarantees that this wait will terminate?
> > > > 
> > > > It seems fairly clear to me from the break statement a couple lines up;
> > > > if we were passed __GFP_WAIT we terminate iff we succesfully allocated a
> > > > tag. If we weren't passed __GFP_WAIT we never actually sleep.
> > > 
> > > OK ;)  Let me rephrase.  What guarantees that a tag will become available?
> > > 
> > > If what we have here is an open-coded __GFP_NOFAIL then that is
> > > potentially problematic.
> > 
> > It's the same semantics as a mempool, really - it'll succeed when a tag
> > gets freed.
> 
> OK, that's reasonable if the code is being used to generate IO tags -
> we expect the in-flight tags to eventually be returned.
> 
> But if a client of this code is using the allocator for something
> totally different, there is no guarantee that the act of waiting will
> result in any tags being returned.

Yeah, and I did wonder a bit whether the waiting mechanism belonged in
the percpu ida code; arguably (certainly just looking at this code, not
any of the users) if it belongs in this code it should be common to
regular ida, not specific to percpu ida.

For now I've just decided to punt on changing that for now, since all
the percpu ida users I've come across do want the waiting mechanism, but
none of the regular ida users that I've looked at want it. There's
probably a reason for that I haven't thought of yet.

> (These are core design principles/constraints which should be
> explicitly documented in a place where future readers will see them!)

*nod* I suppose it should be said explicitly that the gfp_t parameter
indicates whether or not to wait until a _tag_ is available, and not
some internal memory allocation or something.

How's this look?

diff --git a/lib/idr.c b/lib/idr.c
index 15c021c..a3f8e9a 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -1288,6 +1288,11 @@ static inline unsigned alloc_local_tag(struct percpu_ida 
*pool,
  * Safe to be called from interrupt context (assuming it isn't passed
  * __GFP_WAIT, of course).
  *
+ * @gfp indicates whether or not to wait until a free id is available (it's not
+ * used for internal memory allocations); thus if passed __GFP_WAIT we may 
sleep
+ * however long it takes until another thread frees an id (same semantics as a
+ * mempool).
+ *
  * Will not fail if passed __GFP_WAIT.
  */
 int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 14:00:10 -0700 Kent Overstreet  wrote:

> On Wed, Aug 28, 2013 at 01:25:50PM -0700, Andrew Morton wrote:
> > On Wed, 28 Aug 2013 12:55:17 -0700 Kent Overstreet  
> > wrote:
> > 
> > > Fixup patch, addressing Andrew's review feedback:
> > 
> > Looks reasonable.
> > 
> > >  lib/idr.c   | 38 +-
> > 
> > I still don't think it should be in this file.
> > 
> > You say that some as-yet-unmerged patches will tie the new code into
> > the old ida code.  But will it do it in a manner which requires that
> > the two reside in the same file?
> 
> Not require, no - but it's just intimate enough with my ida rewrite that
> I think it makes sense; it makes some use of stuff that should be
> internal to the ida code.
> 
> Mostly just sharing the lock though, since I got rid of the ida
> interfaces that don't do locking, but percpu ida needs a lock that also
> covers what ida needs.
> 
> It also makes use of a ganged allocation interface, but there's no real
> reason ida can't expose that, it's just unlikely to be useful to
> anything but percpu ida.
> 
> The other reason I think it makes sense to live in idr.c is more for
> users of the code; as you pointed out as far as the user's perspective
> percpu ida isn't doing anything fundamentally different from ida, so I
> think it makes sense for the code to live in the same place as a
> kindness to future kernel developers who are trying to find their way
> around the various library code.

I found things to be quite the opposite - it took 5 minutes of staring,
head-scratching, double-checking and penny-dropping before I was
confident that the newly-added code actually has nothing at all to do
with the current code.  Putting it in the same file was misleading, and
I got misled.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments

2013-08-28 Thread Kent Overstreet
On Wed, Aug 28, 2013 at 01:25:50PM -0700, Andrew Morton wrote:
> On Wed, 28 Aug 2013 12:55:17 -0700 Kent Overstreet  wrote:
> 
> > Fixup patch, addressing Andrew's review feedback:
> 
> Looks reasonable.
> 
> >  lib/idr.c   | 38 +-
> 
> I still don't think it should be in this file.
> 
> You say that some as-yet-unmerged patches will tie the new code into
> the old ida code.  But will it do it in a manner which requires that
> the two reside in the same file?

Not require, no - but it's just intimate enough with my ida rewrite that
I think it makes sense; it makes some use of stuff that should be
internal to the ida code.

Mostly just sharing the lock though, since I got rid of the ida
interfaces that don't do locking, but percpu ida needs a lock that also
covers what ida needs.

It also makes use of a ganged allocation interface, but there's no real
reason ida can't expose that, it's just unlikely to be useful to
anything but percpu ida.

The other reason I think it makes sense to live in idr.c is more for
users of the code; as you pointed out as far as the user's perspective
percpu ida isn't doing anything fundamentally different from ida, so I
think it makes sense for the code to live in the same place as a
kindness to future kernel developers who are trying to find their way
around the various library code.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v3 1/4] idr: Percpu ida

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 13:44:54 -0700 Kent Overstreet  wrote:

> > > > What guarantees that this wait will terminate?
> > > 
> > > It seems fairly clear to me from the break statement a couple lines up;
> > > if we were passed __GFP_WAIT we terminate iff we succesfully allocated a
> > > tag. If we weren't passed __GFP_WAIT we never actually sleep.
> > 
> > OK ;)  Let me rephrase.  What guarantees that a tag will become available?
> > 
> > If what we have here is an open-coded __GFP_NOFAIL then that is
> > potentially problematic.
> 
> It's the same semantics as a mempool, really - it'll succeed when a tag
> gets freed.

OK, that's reasonable if the code is being used to generate IO tags -
we expect the in-flight tags to eventually be returned.

But if a client of this code is using the allocator for something
totally different, there is no guarantee that the act of waiting will
result in any tags being returned.

(These are core design principles/constraints which should be
explicitly documented in a place where future readers will see them!)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v3 1/4] idr: Percpu ida

2013-08-28 Thread Kent Overstreet
On Wed, Aug 28, 2013 at 01:23:32PM -0700, Andrew Morton wrote:
> On Wed, 28 Aug 2013 12:53:17 -0700 Kent Overstreet  wrote:
> 
> > > > +   while (1) {
> > > > +   spin_lock(&pool->lock);
> > > > +
> > > > +   /*
> > > > +* prepare_to_wait() must come before steal_tags(), in 
> > > > case
> > > > +* percpu_ida_free() on another cpu flips a bit in
> > > > +* cpus_have_tags
> > > > +*
> > > > +* global lock held and irqs disabled, don't need 
> > > > percpu lock
> > > > +*/
> > > > +   prepare_to_wait(&pool->wait, &wait, 
> > > > TASK_UNINTERRUPTIBLE);
> > > > +
> > > > +   if (!tags->nr_free)
> > > > +   alloc_global_tags(pool, tags);
> > > > +   if (!tags->nr_free)
> > > > +   steal_tags(pool, tags);
> > > > +
> > > > +   if (tags->nr_free) {
> > > > +   tag = tags->freelist[--tags->nr_free];
> > > > +   if (tags->nr_free)
> > > > +   set_bit(smp_processor_id(),
> > > > +   pool->cpus_have_tags);
> > > > +   }
> > > > +
> > > > +   spin_unlock(&pool->lock);
> > > > +   local_irq_restore(flags);
> > > > +
> > > > +   if (tag >= 0 || !(gfp & __GFP_WAIT))
> > > > +   break;
> > > > +
> > > > +   schedule();
> > > > +
> > > > +   local_irq_save(flags);
> > > > +   tags = this_cpu_ptr(pool->tag_cpu);
> > > > +   }
> > > 
> > > What guarantees that this wait will terminate?
> > 
> > It seems fairly clear to me from the break statement a couple lines up;
> > if we were passed __GFP_WAIT we terminate iff we succesfully allocated a
> > tag. If we weren't passed __GFP_WAIT we never actually sleep.
> 
> OK ;)  Let me rephrase.  What guarantees that a tag will become available?
> 
> If what we have here is an open-coded __GFP_NOFAIL then that is
> potentially problematic.

It's the same semantics as a mempool, really - it'll succeed when a tag
gets freed.  If we are sleeping then there isn't really anything else we
can do, there isn't anything we're trying in the __GFP_WAIT case that
we're not trying in the GFP_NOWAIT case.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 12:55:17 -0700 Kent Overstreet  wrote:

> Fixup patch, addressing Andrew's review feedback:

Looks reasonable.

>  lib/idr.c   | 38 +-

I still don't think it should be in this file.

You say that some as-yet-unmerged patches will tie the new code into
the old ida code.  But will it do it in a manner which requires that
the two reside in the same file?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v3 1/4] idr: Percpu ida

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 12:53:17 -0700 Kent Overstreet  wrote:

> > > + while (1) {
> > > + spin_lock(&pool->lock);
> > > +
> > > + /*
> > > +  * prepare_to_wait() must come before steal_tags(), in case
> > > +  * percpu_ida_free() on another cpu flips a bit in
> > > +  * cpus_have_tags
> > > +  *
> > > +  * global lock held and irqs disabled, don't need percpu lock
> > > +  */
> > > + prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
> > > +
> > > + if (!tags->nr_free)
> > > + alloc_global_tags(pool, tags);
> > > + if (!tags->nr_free)
> > > + steal_tags(pool, tags);
> > > +
> > > + if (tags->nr_free) {
> > > + tag = tags->freelist[--tags->nr_free];
> > > + if (tags->nr_free)
> > > + set_bit(smp_processor_id(),
> > > + pool->cpus_have_tags);
> > > + }
> > > +
> > > + spin_unlock(&pool->lock);
> > > + local_irq_restore(flags);
> > > +
> > > + if (tag >= 0 || !(gfp & __GFP_WAIT))
> > > + break;
> > > +
> > > + schedule();
> > > +
> > > + local_irq_save(flags);
> > > + tags = this_cpu_ptr(pool->tag_cpu);
> > > + }
> > 
> > What guarantees that this wait will terminate?
> 
> It seems fairly clear to me from the break statement a couple lines up;
> if we were passed __GFP_WAIT we terminate iff we succesfully allocated a
> tag. If we weren't passed __GFP_WAIT we never actually sleep.

OK ;)  Let me rephrase.  What guarantees that a tag will become available?

If what we have here is an open-coded __GFP_NOFAIL then that is
potentially problematic.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] percpu ida: Switch to cpumask_t, add some comments

2013-08-28 Thread Kent Overstreet
Fixup patch, addressing Andrew's review feedback:

Signed-off-by: Kent Overstreet 
---
 include/linux/idr.h |  2 +-
 lib/idr.c   | 38 +-
 2 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index f0db12b..cdf39be 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -267,7 +267,7 @@ struct percpu_ida {
 * will just keep looking - but the bitmap _must_ be set whenever a
 * percpu freelist does have tags.
 */
-   unsigned long   *cpus_have_tags;
+   cpumask_t   cpus_have_tags;
 
struct {
spinlock_t  lock;
diff --git a/lib/idr.c b/lib/idr.c
index 26495e1..15c021c 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -1178,7 +1178,13 @@ EXPORT_SYMBOL(ida_init);
 #define IDA_PCPU_SIZE  ((IDA_PCPU_BATCH_MOVE * 3) / 2)
 
 struct percpu_ida_cpu {
+   /*
+* Even though this is percpu, we need a lock for tag stealing by remote
+* CPUs:
+*/
spinlock_t  lock;
+
+   /* nr_free/freelist form a stack of free IDs */
unsignednr_free;
unsignedfreelist[];
 };
@@ -1209,21 +1215,21 @@ static inline void steal_tags(struct percpu_ida *pool,
unsigned cpus_have_tags, cpu = pool->cpu_last_stolen;
struct percpu_ida_cpu *remote;
 
-   for (cpus_have_tags = bitmap_weight(pool->cpus_have_tags, nr_cpu_ids);
+   for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags);
 cpus_have_tags * IDA_PCPU_SIZE > pool->nr_tags / 2;
 cpus_have_tags--) {
-   cpu = find_next_bit(pool->cpus_have_tags, nr_cpu_ids, cpu);
+   cpu = cpumask_next(cpu, &pool->cpus_have_tags);
 
-   if (cpu == nr_cpu_ids)
-   cpu = find_first_bit(pool->cpus_have_tags, nr_cpu_ids);
+   if (cpu >= nr_cpu_ids)
+   cpu = cpumask_first(&pool->cpus_have_tags);
 
-   if (cpu == nr_cpu_ids)
+   if (cpu >= nr_cpu_ids)
BUG();
 
pool->cpu_last_stolen = cpu;
remote = per_cpu_ptr(pool->tag_cpu, cpu);
 
-   clear_bit(cpu, pool->cpus_have_tags);
+   cpumask_clear_cpu(cpu, &pool->cpus_have_tags);
 
if (remote == tags)
continue;
@@ -1246,6 +1252,10 @@ static inline void steal_tags(struct percpu_ida *pool,
}
 }
 
+/*
+ * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them 
onto
+ * our percpu freelist:
+ */
 static inline void alloc_global_tags(struct percpu_ida *pool,
 struct percpu_ida_cpu *tags)
 {
@@ -1317,8 +1327,8 @@ int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
if (tags->nr_free) {
tag = tags->freelist[--tags->nr_free];
if (tags->nr_free)
-   set_bit(smp_processor_id(),
-   pool->cpus_have_tags);
+   cpumask_set_cpu(smp_processor_id(),
+   &pool->cpus_have_tags);
}
 
spin_unlock(&pool->lock);
@@ -1363,8 +1373,8 @@ void percpu_ida_free(struct percpu_ida *pool, unsigned 
tag)
spin_unlock(&tags->lock);
 
if (nr_free == 1) {
-   set_bit(smp_processor_id(),
-   pool->cpus_have_tags);
+   cpumask_set_cpu(smp_processor_id(),
+   &pool->cpus_have_tags);
wake_up(&pool->wait);
}
 
@@ -1398,7 +1408,6 @@ EXPORT_SYMBOL_GPL(percpu_ida_free);
 void percpu_ida_destroy(struct percpu_ida *pool)
 {
free_percpu(pool->tag_cpu);
-   kfree(pool->cpus_have_tags);
free_pages((unsigned long) pool->freelist,
   get_order(pool->nr_tags * sizeof(unsigned)));
 }
@@ -1428,7 +1437,7 @@ int percpu_ida_init(struct percpu_ida *pool, unsigned 
long nr_tags)
 
/* Guard against overflow */
if (nr_tags > (unsigned) INT_MAX + 1) {
-   pr_err("tags.c: nr_tags too large\n");
+   pr_err("percpu_ida_init(): nr_tags too large\n");
return -EINVAL;
}
 
@@ -1442,11 +1451,6 @@ int percpu_ida_init(struct percpu_ida *pool, unsigned 
long nr_tags)
 
pool->nr_free = nr_tags;
 
-   pool->cpus_have_tags = kzalloc(BITS_TO_LONGS(nr_cpu_ids) *
-  sizeof(unsigned long), GFP_KERNEL);
-   if (!pool->cpus_have_tags)
-   goto err;
-
pool->tag_cpu = __alloc_percpu(sizeof(struct percpu_ida_cpu) +
   IDA_PCPU_SIZE * sizeof(unsigned),
   sizeof(unsigned));
-- 
1.8.4.rc3

--
To unsubscribe from this 

Re: [PATCH-v3 1/4] idr: Percpu ida

2013-08-28 Thread Kent Overstreet
On Tue, Aug 20, 2013 at 02:31:57PM -0700, Andrew Morton wrote:
> On Fri, 16 Aug 2013 23:09:06 + "Nicholas A. Bellinger" 
>  wrote:
> > +   /*
> > +* Bitmap of cpus that (may) have tags on their percpu freelists:
> > +* steal_tags() uses this to decide when to steal tags, and which cpus
> > +* to try stealing from.
> > +*
> > +* It's ok for a freelist to be empty when its bit is set - steal_tags()
> > +* will just keep looking - but the bitmap _must_ be set whenever a
> > +* percpu freelist does have tags.
> > +*/
> > +   unsigned long   *cpus_have_tags;
> 
> Why not cpumask_t?

I hadn't encountered it before - looks like it's probably what I want.

I don't see any explanation for the parallel set of operations for
working on cpumasks - e.g. next_cpu()/cpumask_next(). For now I'm going
with the cpumask_* versions, is that what I want?o

If you can have a look at the fixup patch that'll be most appreciated.

> > +   struct {
> > +   spinlock_t  lock;
> > +   /*
> > +* When we go to steal tags from another cpu (see steal_tags()),
> > +* we want to pick a cpu at random. Cycling through them every
> > +* time we steal is a bit easier and more or less equivalent:
> > +*/
> > +   unsignedcpu_last_stolen;
> > +
> > +   /* For sleeping on allocation failure */
> > +   wait_queue_head_t   wait;
> > +
> > +   /*
> > +* Global freelist - it's a stack where nr_free points to the
> > +* top
> > +*/
> > +   unsignednr_free;
> > +   unsigned*freelist;
> > +   } cacheline_aligned_in_smp;
> 
> Why the cacheline_aligned_in_smp?

It's separating the RW stuff that isn't always touched from the RO stuff
that's used on every allocation.

> 
> > +};
> > 
> > ...
> >
> > +
> > +/* Percpu IDA */
> > +
> > +/*
> > + * Number of tags we move between the percpu freelist and the global 
> > freelist at
> > + * a time
> 
> "between a percpu freelist" would be more accurate?

No, because when we're stealing tags we always grab all of the remote
percpu freelist's tags - IDA_PCPU_BATCH_MOVE is only used when moving
to/from the global freelist.

> 
> > + */
> > +#define IDA_PCPU_BATCH_MOVE32U
> > +
> > +/* Max size of percpu freelist, */
> > +#define IDA_PCPU_SIZE  ((IDA_PCPU_BATCH_MOVE * 3) / 2)
> > +
> > +struct percpu_ida_cpu {
> > +   spinlock_t  lock;
> > +   unsignednr_free;
> > +   unsignedfreelist[];
> > +};
> 
> Data structure needs documentation.  There's one of these per cpu.  I
> guess nr_free and freelist are clear enough.  The presence of a lock
> in a percpu data structure is a surprise.  It's for cross-cpu stealing,
> I assume?

Yeah, I'll add some comments.

> > +static inline void alloc_global_tags(struct percpu_ida *pool,
> > +struct percpu_ida_cpu *tags)
> > +{
> > +   move_tags(tags->freelist, &tags->nr_free,
> > + pool->freelist, &pool->nr_free,
> > + min(pool->nr_free, IDA_PCPU_BATCH_MOVE));
> > +}
> 
> Document this function?

Will do

> > +   while (1) {
> > +   spin_lock(&pool->lock);
> > +
> > +   /*
> > +* prepare_to_wait() must come before steal_tags(), in case
> > +* percpu_ida_free() on another cpu flips a bit in
> > +* cpus_have_tags
> > +*
> > +* global lock held and irqs disabled, don't need percpu lock
> > +*/
> > +   prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
> > +
> > +   if (!tags->nr_free)
> > +   alloc_global_tags(pool, tags);
> > +   if (!tags->nr_free)
> > +   steal_tags(pool, tags);
> > +
> > +   if (tags->nr_free) {
> > +   tag = tags->freelist[--tags->nr_free];
> > +   if (tags->nr_free)
> > +   set_bit(smp_processor_id(),
> > +   pool->cpus_have_tags);
> > +   }
> > +
> > +   spin_unlock(&pool->lock);
> > +   local_irq_restore(flags);
> > +
> > +   if (tag >= 0 || !(gfp & __GFP_WAIT))
> > +   break;
> > +
> > +   schedule();
> > +
> > +   local_irq_save(flags);
> > +   tags = this_cpu_ptr(pool->tag_cpu);
> > +   }
> 
> What guarantees that this wait will terminate?

It seems fairly clear to me from the break statement a couple lines up;
if we were passed __GFP_WAIT we terminate iff we succesfully allocated a
tag. If we weren't passed __GFP_WAIT we never actually sleep.

I can add a comment if you think it needs one.

> > +   finish_wait(&pool->wait, &wait);
> > +   return tag;
> > +}
> > +EXPORT_SYMBOL_GPL(percpu_ida_alloc);
> > +
> > +/**
> > + * percpu_ida_free - 

Re: VGA passthrough of GTX 660 to KVM guest

2013-08-28 Thread Gustav Sorenson
Hello Alex,

> Ok, so it's probably X.  You might be able to start your guest from here
> and get VGA through the gtx660 (assuming nothing else is broken with
> radeon VGA arbiter support).  You could also run startx to confirm that
> it's X doing all those VGA arbiter writes.  Thanks,

After I issue startx in runlevel 1, dmesg indeed does show numerous
vgaarb: this pci device is not a vga device
messages, but also once this:
vgaarb: device changed decodes:
PCI::01:00.0,olddecodes=io+mem,decodes=none:owns=none

After having booted into runlevel 1 again, not starting X, I have
tried to invoke qemu without any disk and vfio-pci devices, as
described in my first mail, to see whether I could see SeaBIOS output
on the screen I passed through. Both times i tried this now, my
complete host system seemed to have frozen, then suddenly restarted
itself.

When I boot the Windows DomU with pci-assign, the Windows Device
Manager does recognize a second graphics card, but does not label it
as a GFX 660, but a standard VGA device. Also, it says "device could
not be started (Code 10)".

Unfortunately, I haven't noticed the following up until now: Early
during boot, I am shown the message "kvm: no hardware support". After
booting has completed, dmesg | grep -i kvm gives
[0.949028] kvm: no hardware support
[0.949156] kvm: Nested Virtualization enabled
[0.949160] kvm: Nested Paging enabled

I'm a bit confused, since it seems like kvm says AMD-V is not enabled,
but still seems to work. Maybe it's related to the issue at hand?

Please let me know what I can do next to help resolve this.

Thank you very much!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Direct guest device access from nested guest

2013-08-28 Thread Lluís Vilanova
Jan Kiszka writes:

> On 2013-08-28 20:12, Lluís Vilanova wrote:
>> Jan Kiszka writes:
>> [...]
 Is it possible to give a nested guest direct access to a device on the 
 guest?
 (more specifically, an AHCI controller).
>> 
>>> Nope, we are lacking support for emulating or (securely) forwarding
>>> VT-d/IOMMU features to the first level guest. Would be cool to have,
>>> just not yet there. But I've talked to Intel people recently, and they
>>> are considering to support some nested VT-d with KVM.
>> 
>> Thanks a lot. I've been told there's some patches floating around to add such
>> support, but I suppose they've been long outdated and only work as POCs.

> I haven't seen anything in public.

This is what I've found:

  https://lists.nongnu.org/archive/html/qemu-devel/2011-04/msg01970.html


> PS: You have Mail-Followup-To set in your answers - people will drop you
> from CC this way.

Yes, my mail client tried to be clever but didn't know I'm not subscribed to the
KVM list :)


Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Direct guest device access from nested guest

2013-08-28 Thread Lluís Vilanova
Jan Kiszka writes:
[...]
>> Is it possible to give a nested guest direct access to a device on the guest?
>> (more specifically, an AHCI controller).

> Nope, we are lacking support for emulating or (securely) forwarding
> VT-d/IOMMU features to the first level guest. Would be cool to have,
> just not yet there. But I've talked to Intel people recently, and they
> are considering to support some nested VT-d with KVM.

Thanks a lot. I've been told there's some patches floating around to add such
support, but I suppose they've been long outdated and only work as POCs.


Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Direct guest device access from nested guest

2013-08-28 Thread Jan Kiszka
On 2013-08-28 20:12, Lluís Vilanova wrote:
> Jan Kiszka writes:
> [...]
>>> Is it possible to give a nested guest direct access to a device on the 
>>> guest?
>>> (more specifically, an AHCI controller).
> 
>> Nope, we are lacking support for emulating or (securely) forwarding
>> VT-d/IOMMU features to the first level guest. Would be cool to have,
>> just not yet there. But I've talked to Intel people recently, and they
>> are considering to support some nested VT-d with KVM.
> 
> Thanks a lot. I've been told there's some patches floating around to add such
> support, but I suppose they've been long outdated and only work as POCs.

I haven't seen anything in public.

Jan

PS: You have Mail-Followup-To set in your answers - people will drop you
from CC this way.

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VGA passthrough of GTX 660 to KVM guest

2013-08-28 Thread Alex Williamson
On Wed, 2013-08-28 at 19:42 +0200, Gustav Sorenson wrote:
> Hello,
> 
> yes, 1:00.0 is the GTX 660, and yes, I'm running X on the host.
> When I boot into runlevel 1 (on Mint, which is Debian-based, so that
> corresponds to a single-user runlevel without X, although I saw a
> not-very-fancy boot splashscreen), this is what dmesg | grep -i vga
> gives me:
> 
> [0.00] Console: colour VGA+ 80x25
> [0.324894] vgaarb: device added:
> PCI::00:01.0,decodes=io+mem,owns=io+mem,locks=none
> [0.324903] vgaarb: device added:
> PCI::01:00.0,decodes=io+mem,owns=none,locks=none
> [0.324905] vgaarb: loaded
> [0.324906] vgaarb: bridge control possible :01:00.0
> [0.324907] vgaarb: no bridge control possible :00:01.0
> [   16.459085] hda-intel :01:00.1: Handle VGA-switcheroo audio client

Ok, so it's probably X.  You might be able to start your guest from here
and get VGA through the gtx660 (assuming nothing else is broken with
radeon VGA arbiter support).  You could also run startx to confirm that
it's X doing all those VGA arbiter writes.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V13 4/4] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock

2013-08-28 Thread Rob Landley

On 08/26/2013 03:48:36 AM, Raghavendra K T wrote:
KVM_HC_KICK_CPU  hypercall added to wakeup halted vcpu in paravirtual  
spinlock

enabled guest.

KVM_FEATURE_PV_UNHALT enables guest to check whether pv spinlock can  
be enabled

in guest.

Thanks Vatsa for rewriting KVM_HC_KICK_CPU
Cc: Rob Landley 


Looks like documentation to me:

Acked-by: Rob Landley 

Rob--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VGA passthrough of GTX 660 to KVM guest

2013-08-28 Thread Gustav Sorenson
Hello,

yes, 1:00.0 is the GTX 660, and yes, I'm running X on the host.
When I boot into runlevel 1 (on Mint, which is Debian-based, so that
corresponds to a single-user runlevel without X, although I saw a
not-very-fancy boot splashscreen), this is what dmesg | grep -i vga
gives me:

[0.00] Console: colour VGA+ 80x25
[0.324894] vgaarb: device added:
PCI::00:01.0,decodes=io+mem,owns=io+mem,locks=none
[0.324903] vgaarb: device added:
PCI::01:00.0,decodes=io+mem,owns=none,locks=none
[0.324905] vgaarb: loaded
[0.324906] vgaarb: bridge control possible :01:00.0
[0.324907] vgaarb: no bridge control possible :00:01.0
[   16.459085] hda-intel :01:00.1: Handle VGA-switcheroo audio client

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Direct guest device access from nested guest

2013-08-28 Thread Jan Kiszka
On 2013-08-28 16:28, Lluís Vilanova wrote:
> Hi,
> 
> I want to get the following setup, but don't know how (or if it's even
> possible):
> 
> * A guest VM with two AHCI controllers, with one device each. One of the AHCI
>   controllers provides the VM's disk ("system"), while the other provides
>   another disk ("nested") and uses a different emulation driver in QEMU 
> (ahci2):
> 
>   host$ qemu-system-x86_64 -enable-kvm \
> -drive id=system,file=system.img,if=none \
> -device ahci,id=ahci \
> -device ide-drive,drive=system,bus=ahci.0 \
> -drive id=nested,file=nested.img,if=none \
> -device ahci2,id=ahci2 \
> -device ide-drive,drive=nested,bus=ahci2.0
> 
> * A nested guest VM using the guest's (its host) AHCI2 controller.
> 
>   I've tried assigning the AHCI2 device to the nested guest using "pci-assign"
>   and "vfio", but without any luck.
> 
>   The culprit of the problem seems to be I cannot get the nested guest to have
>   an IOMMU.
> 
> Is it possible to give a nested guest direct access to a device on the guest?
> (more specifically, an AHCI controller).

Nope, we are lacking support for emulating or (securely) forwarding
VT-d/IOMMU features to the first level guest. Would be cool to have,
just not yet there. But I've talked to Intel people recently, and they
are considering to support some nested VT-d with KVM.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VGA passthrough of GTX 660 to KVM guest

2013-08-28 Thread Alex Williamson
On Wed, 2013-08-28 at 18:29 +0200, Gustav Sorenson wrote:
> Hi Alex,
> 
> thanks for the info you provided, although I have to admit I didn't
> understand all of it - just so you don't overestimate my knowledge. :)
> 
> Here's the output you requested:
> 
> [0.32] vgaarb: device added:
> PCI::00:01.0,decodes=io+mem,owns=io+mem,locks=none
> [0.321120] vgaarb: device added:
> PCI::01:00.0,decodes=io+mem,owns=none,locks=none
> [0.321122] vgaarb: loaded
> [0.321123] vgaarb: bridge control possible :01:00.0
> [0.321123] vgaarb: no bridge control possible :00:01.0
> [   16.325255] hda-intel :01:00.1: Handle VGA-switcheroo audio client
> [   36.736262] vgaarb: this pci device is not a vga device
> [   36.736274] vgaarb: this pci device is not a vga device
> [   36.736292] vgaarb: this pci device is not a vga device
> [   36.736302] vgaarb: this pci device is not a vga device
> [   36.736312] vgaarb: this pci device is not a vga device
> [   36.736322] vgaarb: this pci device is not a vga device
> [   36.736332] vgaarb: this pci device is not a vga device
> [   36.736342] vgaarb: this pci device is not a vga device
> [   36.736352] vgaarb: this pci device is not a vga device
> [   36.736362] vgaarb: this pci device is not a vga device
> [   36.736373] vgaarb: this pci device is not a vga device
> [   36.736384] vgaarb: this pci device is not a vga device
> [   36.736395] vgaarb: this pci device is not a vga device
> [   36.736406] vgaarb: this pci device is not a vga device
> [   36.736417] vgaarb: this pci device is not a vga device
> [   36.736428] vgaarb: this pci device is not a vga device
> [   36.736440] vgaarb: this pci device is not a vga device
> [   36.736452] vgaarb: this pci device is not a vga device
> [   36.736464] vgaarb: this pci device is not a vga device
> [   36.736475] vgaarb: this pci device is not a vga device
> [   36.736487] vgaarb: this pci device is not a vga device
> [   36.736499] vgaarb: this pci device is not a vga device
> [   36.736511] vgaarb: this pci device is not a vga device
> [   36.736524] vgaarb: this pci device is not a vga device
> [   36.736537] vgaarb: this pci device is not a vga device
> [   36.736549] vgaarb: this pci device is not a vga device
> [   36.736571] vgaarb: device changed decodes:
> PCI::01:00.0,olddecodes=io+mem,decodes=none:owns=none

This is a problem.  Something told the VGA arbiter that 1:00.0 (assume
this is the gtx660) does not decode VGA I/O or memory.  That means we
can't route it to the card when we try to assign it.  These "this pci
device is not a vga device" messages come from the write function
of /dev/vga_arbiter, so something in userspace is doing this, and
apparently doing it very arbitrarily since it must have tried ever
device in the system.  Are you running an X server on the host?  If so,
can you boot to a runlevel that doesn't start X and see if these
messages are still present?  Thanks,

Alex

> [   36.736593] vgaarb: this pci device is not a vga device
> [   36.736607] vgaarb: this pci device is not a vga device
> [   36.736620] vgaarb: this pci device is not a vga device
> 
> > Of course even if you get VGA working, you're still likely to get a Code
> > 43 trying to enable the Nvidia driver, but I appreciate any efforts to
> > help figure that out.  Thanks,
> 
> I'm happy to help where I can.
> 
> Thank you very much for your efforts!



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VGA passthrough of GTX 660 to KVM guest

2013-08-28 Thread Gustav Sorenson
Hi Alex,

thanks for the info you provided, although I have to admit I didn't
understand all of it - just so you don't overestimate my knowledge. :)

Here's the output you requested:

[0.32] vgaarb: device added:
PCI::00:01.0,decodes=io+mem,owns=io+mem,locks=none
[0.321120] vgaarb: device added:
PCI::01:00.0,decodes=io+mem,owns=none,locks=none
[0.321122] vgaarb: loaded
[0.321123] vgaarb: bridge control possible :01:00.0
[0.321123] vgaarb: no bridge control possible :00:01.0
[   16.325255] hda-intel :01:00.1: Handle VGA-switcheroo audio client
[   36.736262] vgaarb: this pci device is not a vga device
[   36.736274] vgaarb: this pci device is not a vga device
[   36.736292] vgaarb: this pci device is not a vga device
[   36.736302] vgaarb: this pci device is not a vga device
[   36.736312] vgaarb: this pci device is not a vga device
[   36.736322] vgaarb: this pci device is not a vga device
[   36.736332] vgaarb: this pci device is not a vga device
[   36.736342] vgaarb: this pci device is not a vga device
[   36.736352] vgaarb: this pci device is not a vga device
[   36.736362] vgaarb: this pci device is not a vga device
[   36.736373] vgaarb: this pci device is not a vga device
[   36.736384] vgaarb: this pci device is not a vga device
[   36.736395] vgaarb: this pci device is not a vga device
[   36.736406] vgaarb: this pci device is not a vga device
[   36.736417] vgaarb: this pci device is not a vga device
[   36.736428] vgaarb: this pci device is not a vga device
[   36.736440] vgaarb: this pci device is not a vga device
[   36.736452] vgaarb: this pci device is not a vga device
[   36.736464] vgaarb: this pci device is not a vga device
[   36.736475] vgaarb: this pci device is not a vga device
[   36.736487] vgaarb: this pci device is not a vga device
[   36.736499] vgaarb: this pci device is not a vga device
[   36.736511] vgaarb: this pci device is not a vga device
[   36.736524] vgaarb: this pci device is not a vga device
[   36.736537] vgaarb: this pci device is not a vga device
[   36.736549] vgaarb: this pci device is not a vga device
[   36.736571] vgaarb: device changed decodes:
PCI::01:00.0,olddecodes=io+mem,decodes=none:owns=none
[   36.736593] vgaarb: this pci device is not a vga device
[   36.736607] vgaarb: this pci device is not a vga device
[   36.736620] vgaarb: this pci device is not a vga device

> Of course even if you get VGA working, you're still likely to get a Code
> 43 trying to enable the Nvidia driver, but I appreciate any efforts to
> help figure that out.  Thanks,

I'm happy to help where I can.

Thank you very much for your efforts!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VGA passthrough of GTX 660 to KVM guest

2013-08-28 Thread Alex Williamson
On Wed, 2013-08-28 at 17:12 +0200, Gustav Sorenson wrote:
> Hi Alex,
> 
> thanks for your answer.
> 
> Do you happen to know someone who might be able to judge whether or
> not the radeon driver could be patched to support passthrough? Also,
> could you please point me to the fixes to i915 you mentioned? I think
> I might write to the radeon list to see whether they have ideas how to
> tackle this issue, and I assume it may help them if they know what
> fixed VGA arbitration on Haswell.
> 
> Thanks again!

VGA Arbiter:
http://lists.freedesktop.org/archives/intel-gfx/2013-August/031963.html
i915:
http://lists.freedesktop.org/archives/intel-gfx/2013-August/032295.html

The changes are very device dependent.  In the case of newer Intel HD
graphics the i915 driver does not need VGA memory access and only
occasionally needs VGA I/O access.  We can therefore disable VGA memory
completely so that it doesn't need to participate in arbitration of
that.  Then the arbiter will disable the I/O bit in the PCI command
register when another device requests VGA.

Perhaps if you can send `dmesg | grep -i vga` from your system we can
see how the arbiter is getting setup.  On an Intel system, a simple test
to disable VGA to IGD is to disable I/O and memory in the PCI command
register:

setpci -s 00:02.0 4.w=0

(of course you really want to read it first to restore it later, you
only need to clear the bottom 2 bits)

Of course even if you get VGA working, you're still likely to get a Code
43 trying to enable the Nvidia driver, but I appreciate any efforts to
help figure that out.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VGA passthrough of GTX 660 to KVM guest

2013-08-28 Thread Gustav Sorenson
Hi Alex,

thanks for your answer.

Do you happen to know someone who might be able to judge whether or
not the radeon driver could be patched to support passthrough? Also,
could you please point me to the fixes to i915 you mentioned? I think
I might write to the radeon list to see whether they have ideas how to
tackle this issue, and I assume it may help them if they know what
fixed VGA arbitration on Haswell.

Thanks again!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vfio-pci: Test for extended config space

2013-08-28 Thread Alex Williamson
Having PCIe/PCI-X capability isn't enough to assume that there are
extended capabilities.  Both specs define that the first capability
header is all zero if there are no extended capabilities.  Testing
for this avoids an erroneous message about hiding capability 0x0 at
offset 0x100.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci_config.c |   11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 8477963..bcf3624 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1069,6 +1069,7 @@ static int vfio_vc_cap_len(struct vfio_pci_device *vdev, 
u16 pos)
 static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos)
 {
struct pci_dev *pdev = vdev->pdev;
+   u32 dword;
u16 word;
u8 byte;
int ret;
@@ -1082,7 +1083,9 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 
cap, u8 pos)
return pcibios_err_to_errno(ret);
 
if (PCI_X_CMD_VERSION(word)) {
-   vdev->extended_caps = true;
+   /* Test for extended capabilities */
+   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, &dword);
+   vdev->extended_caps = (dword != 0);
return PCI_CAP_PCIX_SIZEOF_V2;
} else
return PCI_CAP_PCIX_SIZEOF_V0;
@@ -1094,9 +1097,11 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 
cap, u8 pos)
 
return byte;
case PCI_CAP_ID_EXP:
-   /* length based on version */
-   vdev->extended_caps = true;
+   /* Test for extended capabilities */
+   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, &dword);
+   vdev->extended_caps = (dword != 0);
 
+   /* length based on version */
if ((pcie_caps_reg(pdev) & PCI_EXP_FLAGS_VERS) == 1)
return PCI_CAP_EXP_ENDPOINT_SIZEOF_V1;
else

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: update masterclock when kvmclock_offset is calculated (v2)

2013-08-28 Thread Gleb Natapov
On Tue, Aug 27, 2013 at 11:55:29PM -0300, Marcelo Tosatti wrote:
> 
> The offset to add to the hosts monotonic time, kvmclock_offset, is
> calculated against the monotonic time at KVM_SET_CLOCK ioctl time.
> 
> Request a master clock update at this time, to reduce a potentially
> unbounded difference between the values of the masterclock and
> the clock value used to calculate kvmclock_offset.
> 
> Signed-off-by: Marcelo Tosatti 
> 
Applied, thanks.

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d21bce5..0a93354 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1457,6 +1457,29 @@ static void pvclock_update_vm_gtod_copy(struct kvm 
> *kvm)
>  #endif
>  }
>  
> +static void kvm_gen_update_masterclock(struct kvm *kvm)
> +{
> +#ifdef CONFIG_X86_64
> + int i;
> + struct kvm_vcpu *vcpu;
> + struct kvm_arch *ka = &kvm->arch;
> +
> + spin_lock(&ka->pvclock_gtod_sync_lock);
> + kvm_make_mclock_inprogress_request(kvm);
> + /* no guest entries from this point */
> + pvclock_update_vm_gtod_copy(kvm);
> +
> + kvm_for_each_vcpu(i, vcpu, kvm)
> + set_bit(KVM_REQ_CLOCK_UPDATE, &vcpu->requests);
> +
> + /* guest entries allowed */
> + kvm_for_each_vcpu(i, vcpu, kvm)
> + clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests);
> +
> + spin_unlock(&ka->pvclock_gtod_sync_lock);
> +#endif
> +}
> +
>  static int kvm_guest_time_update(struct kvm_vcpu *v)
>  {
>   unsigned long flags, this_tsc_khz;
> @@ -3806,6 +3829,7 @@ long kvm_arch_vm_ioctl(struct file *filp,
>   delta = user_ns.clock - now_ns;
>   local_irq_enable();
>   kvm->arch.kvmclock_offset = delta;
> + kvm_gen_update_masterclock(kvm);
>   break;
>   }
>   case KVM_GET_CLOCK: {
> @@ -5689,29 +5713,6 @@ static void process_nmi(struct kvm_vcpu *vcpu)
>   kvm_make_request(KVM_REQ_EVENT, vcpu);
>  }
>  
> -static void kvm_gen_update_masterclock(struct kvm *kvm)
> -{
> -#ifdef CONFIG_X86_64
> - int i;
> - struct kvm_vcpu *vcpu;
> - struct kvm_arch *ka = &kvm->arch;
> -
> - spin_lock(&ka->pvclock_gtod_sync_lock);
> - kvm_make_mclock_inprogress_request(kvm);
> - /* no guest entries from this point */
> - pvclock_update_vm_gtod_copy(kvm);
> -
> - kvm_for_each_vcpu(i, vcpu, kvm)
> - set_bit(KVM_REQ_CLOCK_UPDATE, &vcpu->requests);
> -
> - /* guest entries allowed */
> - kvm_for_each_vcpu(i, vcpu, kvm)
> - clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests);
> -
> - spin_unlock(&ka->pvclock_gtod_sync_lock);
> -#endif
> -}
> -
>  static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
>  {
>   u64 eoi_exit_bitmap[4];
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VGA passthrough of GTX 660 to KVM guest

2013-08-28 Thread Alex Williamson
On Wed, 2013-08-28 at 15:11 +0200, Gustav Sorenson wrote:
> Hello everyone,
> 
> I started a thread on this list on July 12, named "AMD integrated
> graphics passthrough to KVM guest," so if you should have a deja-vu,
> that's probably why, since the heart of the problem is still the same.
> 
> I've got an AMD A10 6800 APU with integrated graphics, running on aan
> ASRock FM2A75 Pro4 mainboard with the latest firmware. Since my last
> thread, I've acquired an EVGA GTX 660 graphics card, which I'd like to
> pass through to a KVM guest.
> 
> I'm following this guide to achieve my goal, including compiling the
> suggested patched versions of the kernel, seabios and qemu:
> https://bbs.archlinux.org/viewtopic.php?id=162768
> 
> The integrated graphics adapter is the primary one, and the only one
> that is used by the host system. nouveau and nvidia modules are
> blacklisted and not in lsmod.
> lspci -vvv output: http://pastebin.com/S6znCLK7
> lspci -t output: http://pastebin.com/YH01atEy
> 
> Contrary to what I hoped, invoking
> qemu-system-x86_64 -enable-kvm -M q35 -m 1024 -cpu host -smp
> 2,sockets=1,cores=2,threads=1 -bios /usr/share/qemu/bios.bin -vga none
> -device 
> ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1
> -device vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on
> -device vfio-pci,host=01:00.1,bus=root.1,addr=00.1
> after assigning the devices to vfio as suggested in the guide, does
> not show anything on the display attached to the GTX 660, or even wake
> the display up from standby, but instead gives this output on the
> console:
> qemu-system-x86_64: -device
> vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on:
> Warning, device :01:00.0 does not support reset
> qemu-system-x86_64: -device
> vfio-pci,host=01:00.1,bus=root.1,addr=00.1: Warning, device
> :01:00.1 does not support reset
> More often than not, this also completely freezes the host system.
> 
> Later in the forum I linked, it is suggested to pass the graphics card
> not as the guest's primary, but use pci-assign instead:
> qemu-system-x86_64 -enable-kvm -m 2048 -cpu host -smp
> 2,sockets=1,cores=2,threads=1 -bios /usr/share/qemu/bios.bin -vga std
> -vnc :0 -drive file=/mnt/orthowin/current-new.img,format=qcow2
> -usbdevice tablet -net user,restrict=y -usb -device
> pci-assign,host=01:00.0 -device pci-assign,host=01:00.1
> 
> This allows me to boot the Windows 7 64bit guest. After installing the
> nVidia drivers, shutting the guest down, rebooting the host and
> booting the guest again, the Windows device manager shows "Code 43"
> for the graphics card (which even is correctly labelled as GTX 660),
> and the nVidia user interface doesn't show any GPU at all.


I've beaten my head against Code 43 as well, in fact with the same card.
I'm using vfio and on and AMD 990fx system (no APU), the card works
great as an assigned VGA device and the nvidia driver works in the
guest.  When I move the gtx660 to a Haswell-based IGD based system, I
get stuck with a Code 43.  I am able to get VGA working to the card, but
it required fixes to i915 in the host to get VGA arbitration to work.
The radeon driver may not be doing VGA arbitration correctly either if
you couldn't get VGA with vfio.

The only thing I can think might be causing problems is that the gtx660
is a PCIe gen3 card and the 990fx is only capable of gen2.  The mismatch
between our QEMU root ports and the physical device may be too great
when the card is running gen3.  However, if I set my BIOS to force gen2,
it doesn't make any difference.  I also get the same Code 43 when I
attach the GPU to a QEMU 440fx configuration, so PCIe spec revisions may
be a red herring.  The nouveau driver seems to work though.  Thanks,

Alex

> What could I try to resolve this issue?
> I realize that I might have omitted some information that may be
> important for troubleshooting, but I'm afraid I'm not experienced
> enough to even know what may be relevant. Please let me know if you'd
> like to have some further information.
> Please also note that I'm not running the latest kernel/qemu/seabios,
> so maybe some changes made since I compiled 3.10.1 might help me? If
> so, please suggest a commit or release to try. I'm not afraid of
> messing up my operating system install, but if anything suggested
> might brick the hardware, I'd like to know in advance. :)
> 
> I'm very grateful for any kind of input!
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/10] kvm: i386: fix LAPIC TSC deadline timer save/restore

2013-08-28 Thread Gleb Natapov
From: Marcelo Tosatti 

The configuration of the timer represented by MSR_IA32_TSCDEADLINE depends on:

- APIC LVT Timer register.
- TSC value.

Change the order to respect the dependency.

Signed-off-by: Marcelo Tosatti 
Signed-off-by: Paolo Bonzini 
---
 target-i386/kvm.c |   29 ++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 7bb8455..58f7bb7 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1073,6 +1073,26 @@ static void kvm_msr_entry_set(struct kvm_msr_entry 
*entry,
 entry->data = value;
 }
 
+static int kvm_put_tscdeadline_msr(X86CPU *cpu)
+{
+CPUX86State *env = &cpu->env;
+struct {
+struct kvm_msrs info;
+struct kvm_msr_entry entries[1];
+} msr_data;
+struct kvm_msr_entry *msrs = msr_data.entries;
+
+if (!has_msr_tsc_deadline) {
+return 0;
+}
+
+kvm_msr_entry_set(&msrs[0], MSR_IA32_TSCDEADLINE, env->tsc_deadline);
+
+msr_data.info.nmsrs = 1;
+
+return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MSRS, &msr_data);
+}
+
 static int kvm_put_msrs(X86CPU *cpu, int level)
 {
 CPUX86State *env = &cpu->env;
@@ -1096,9 +1116,6 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 if (has_msr_tsc_adjust) {
 kvm_msr_entry_set(&msrs[n++], MSR_TSC_ADJUST, env->tsc_adjust);
 }
-if (has_msr_tsc_deadline) {
-kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSCDEADLINE, env->tsc_deadline);
-}
 if (has_msr_misc_enable) {
 kvm_msr_entry_set(&msrs[n++], MSR_IA32_MISC_ENABLE,
   env->msr_ia32_misc_enable);
@@ -1808,6 +1825,12 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
 return ret;
 }
 }
+
+ret = kvm_put_tscdeadline_msr(x86_cpu);
+if (ret < 0) {
+return ret;
+}
+
 ret = kvm_put_vcpu_events(x86_cpu, level);
 if (ret < 0) {
 return ret;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/10] kvm: migrate vPMU state

2013-08-28 Thread Gleb Natapov
From: Paolo Bonzini 

Reviewed-by: Gleb Natapov 
Signed-off-by: Paolo Bonzini 
---
 target-i386/cpu.h |   23 
 target-i386/kvm.c |   93 ++---
 target-i386/machine.c |   44 +++
 3 files changed, 155 insertions(+), 5 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index af4c0f7..31de265 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -305,6 +305,8 @@
 #define MSR_TSC_ADJUST  0x003b
 #define MSR_IA32_TSCDEADLINE0x6e0
 
+#define MSR_P6_PERFCTR0 0xc1
+
 #define MSR_MTRRcap 0xfe
 #define MSR_MTRRcap_VCNT8
 #define MSR_MTRRcap_FIXRANGE_SUPPORT(1 << 8)
@@ -318,6 +320,8 @@
 #define MSR_MCG_STATUS  0x17a
 #define MSR_MCG_CTL 0x17b
 
+#define MSR_P6_EVNTSEL0 0x186
+
 #define MSR_IA32_PERF_STATUS0x198
 
 #define MSR_IA32_MISC_ENABLE0x1a0
@@ -343,6 +347,14 @@
 
 #define MSR_MTRRdefType 0x2ff
 
+#define MSR_CORE_PERF_FIXED_CTR00x309
+#define MSR_CORE_PERF_FIXED_CTR10x30a
+#define MSR_CORE_PERF_FIXED_CTR20x30b
+#define MSR_CORE_PERF_FIXED_CTR_CTRL0x38d
+#define MSR_CORE_PERF_GLOBAL_STATUS 0x38e
+#define MSR_CORE_PERF_GLOBAL_CTRL   0x38f
+#define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
+
 #define MSR_MC0_CTL 0x400
 #define MSR_MC0_STATUS  0x401
 #define MSR_MC0_ADDR0x402
@@ -721,6 +733,9 @@ typedef struct {
 #define CPU_NB_REGS CPU_NB_REGS32
 #endif
 
+#define MAX_FIXED_COUNTERS 3
+#define MAX_GP_COUNTERS(MSR_IA32_PERF_STATUS - MSR_P6_EVNTSEL0)
+
 #define NB_MMU_MODES 3
 
 typedef enum TPRAccess {
@@ -816,6 +831,14 @@ typedef struct CPUX86State {
 uint64_t msr_ia32_misc_enable;
 uint64_t msr_ia32_feature_control;
 
+uint64_t msr_fixed_ctr_ctrl;
+uint64_t msr_global_ctrl;
+uint64_t msr_global_status;
+uint64_t msr_global_ovf_ctrl;
+uint64_t msr_fixed_counters[MAX_FIXED_COUNTERS];
+uint64_t msr_gp_counters[MAX_GP_COUNTERS];
+uint64_t msr_gp_evtsel[MAX_GP_COUNTERS];
+
 /* exception/interrupt handling */
 int error_code;
 int exception_is_int;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 84ac00a..513ae52 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -71,6 +71,9 @@ static bool has_msr_misc_enable;
 static bool has_msr_kvm_steal_time;
 static int lm_capable_kernel;
 
+static bool has_msr_architectural_pmu;
+static uint32_t num_architectural_pmu_counters;
+
 bool kvm_allows_irq0_override(void)
 {
 return !kvm_irqchip_in_kernel() || kvm_has_gsi_routing();
@@ -581,6 +584,25 @@ int kvm_arch_init_vcpu(CPUState *cs)
 break;
 }
 }
+
+if (limit >= 0x0a) {
+uint32_t ver;
+
+cpu_x86_cpuid(env, 0x0a, 0, &ver, &unused, &unused, &unused);
+if ((ver & 0xff) > 0) {
+has_msr_architectural_pmu = true;
+num_architectural_pmu_counters = (ver & 0xff00) >> 8;
+
+/* Shouldn't be more than 32, since that's the number of bits
+ * available in EBX to tell us _which_ counters are available.
+ * Play it safe.
+ */
+if (num_architectural_pmu_counters > MAX_GP_COUNTERS) {
+num_architectural_pmu_counters = MAX_GP_COUNTERS;
+}
+}
+}
+
 cpu_x86_cpuid(env, 0x8000, 0, &limit, &unused, &unused, &unused);
 
 for (i = 0x8000; i <= limit; i++) {
@@ -1052,7 +1074,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 struct kvm_msr_entry entries[100];
 } msr_data;
 struct kvm_msr_entry *msrs = msr_data.entries;
-int n = 0;
+int n = 0, i;
 
 kvm_msr_entry_set(&msrs[n++], MSR_IA32_SYSENTER_CS, env->sysenter_cs);
 kvm_msr_entry_set(&msrs[n++], MSR_IA32_SYSENTER_ESP, env->sysenter_esp);
@@ -1094,9 +1116,8 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 }
 }
 /*
- * The following paravirtual MSRs have side effects on the guest or are
- * too heavy for normal writeback. Limit them to reset or full state
- * updates.
+ * The following MSRs have side effects on the guest or are too heavy
+ * for normal writeback. Limit them to reset or full state updates.
  */
 if (level >= KVM_PUT_RESET_STATE) {
 kvm_msr_entry_set(&msrs[n++], MSR_KVM_SYSTEM_TIME,
@@ -1114,6 +1135,33 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 kvm_msr_entry_set(&msrs[n++], MSR_KVM_STEAL_TIME,
   env->steal_time_msr);
 }
+if (has_msr_architectural_pmu) {
+/* Stop the counter.  */
+kvm_msr_entry_set(&msrs[n++], MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
+kvm_msr_entry_set(&msrs[n++], MSR_CORE_PERF_GLOBAL_CTRL, 0);
+
+/* Set the counter values.  */
+for (i = 0; i < MAX_FI

[PATCH 07/10] kvm-all.c: max_cpus should not exceed KVM vcpu limit

2013-08-28 Thread Gleb Natapov
From: Marcelo Tosatti 

maxcpus, which specifies the maximum number of hotpluggable CPUs,
should not exceed KVM's vcpu limit.

Signed-off-by: Marcelo Tosatti 
[Reword message. - Paolo]
Signed-off-by: Paolo Bonzini 
---
 kvm-all.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/kvm-all.c b/kvm-all.c
index ef52a0f..a2d4978 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1400,6 +1400,13 @@ int kvm_init(void)
 goto err;
 }
 
+if (max_cpus > max_vcpus) {
+ret = -EINVAL;
+fprintf(stderr, "Number of hotpluggable cpus requested (%d) exceeds 
max cpus "
+"supported by KVM (%d)\n", max_cpus, max_vcpus);
+goto err;
+}
+
 s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0);
 if (s->vmfd < 0) {
 #ifdef TARGET_S390X
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/10] kvm: Simplify kvm_handle_io

2013-08-28 Thread Gleb Natapov
From: Jan Kiszka 

Now that cpu_in/out is just a wrapper around address_space_rw, we can
also call the latter directly. As host endianness == guest endianness,
there is no need for the memory access helpers st*_p/ld*_p as well.

Signed-off-by: Jan Kiszka 
Signed-off-by: Paolo Bonzini 
---
 kvm-all.c |   28 ++--
 1 file changed, 2 insertions(+), 26 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index bfa4aac..ef52a0f 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1508,32 +1508,8 @@ static void kvm_handle_io(uint16_t port, void *data, int 
direction, int size,
 uint8_t *ptr = data;
 
 for (i = 0; i < count; i++) {
-if (direction == KVM_EXIT_IO_IN) {
-switch (size) {
-case 1:
-stb_p(ptr, cpu_inb(port));
-break;
-case 2:
-stw_p(ptr, cpu_inw(port));
-break;
-case 4:
-stl_p(ptr, cpu_inl(port));
-break;
-}
-} else {
-switch (size) {
-case 1:
-cpu_outb(port, ldub_p(ptr));
-break;
-case 2:
-cpu_outw(port, lduw_p(ptr));
-break;
-case 4:
-cpu_outl(port, ldl_p(ptr));
-break;
-}
-}
-
+address_space_rw(&address_space_io, port, ptr, size,
+ direction == KVM_EXIT_IO_OUT);
 ptr += size;
 }
 }
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/10] kvm: add KVM_IRQFD_FLAG_RESAMPLE support

2013-08-28 Thread Gleb Natapov
From: Vincenzo Maffione 

Added an EventNotifier* parameter to
kvm-all.c:kvm_irqchip_add_irqfd_notifier(), in order to give KVM
another eventfd to be used as "resamplefd". See the documentation
in the linux kernel sources in Documentation/virtual/kvm/api.txt
(section 4.75) for more details.
When the added parameter is passed NULL, the behaviour of the
function is unchanged with respect to the previous versions.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Vincenzo Maffione 
Signed-off-by: Paolo Bonzini 
---
 hw/misc/vfio.c |4 ++--
 hw/virtio/virtio-pci.c |2 +-
 include/sysemu/kvm.h   |3 ++-
 kvm-all.c  |   17 +
 4 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index ad8ce77..54af34a 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -646,7 +646,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, 
unsigned int nr,
 vector->virq = msg ? kvm_irqchip_add_msi_route(kvm_state, *msg) : -1;
 if (vector->virq < 0 ||
 kvm_irqchip_add_irqfd_notifier(kvm_state, &vector->interrupt,
-   vector->virq) < 0) {
+   NULL, vector->virq) < 0) {
 if (vector->virq >= 0) {
 kvm_irqchip_release_virq(kvm_state, vector->virq);
 vector->virq = -1;
@@ -814,7 +814,7 @@ retry:
 vector->virq = kvm_irqchip_add_msi_route(kvm_state, msg);
 if (vector->virq < 0 ||
 kvm_irqchip_add_irqfd_notifier(kvm_state, &vector->interrupt,
-   vector->virq) < 0) {
+   NULL, vector->virq) < 0) {
 qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
 vfio_msi_interrupt, NULL, vector);
 }
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index c38cfd1..c4db407 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -508,7 +508,7 @@ static int kvm_virtio_pci_irqfd_use(VirtIOPCIProxy *proxy,
 VirtQueue *vq = virtio_get_queue(proxy->vdev, queue_no);
 EventNotifier *n = virtio_queue_get_guest_notifier(vq);
 int ret;
-ret = kvm_irqchip_add_irqfd_notifier(kvm_state, n, irqfd->virq);
+ret = kvm_irqchip_add_irqfd_notifier(kvm_state, n, NULL, irqfd->virq);
 return ret;
 }
 
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index f8ac448..ce3efaf 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -309,7 +309,8 @@ int kvm_irqchip_add_msi_route(KVMState *s, MSIMessage msg);
 int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg);
 void kvm_irqchip_release_virq(KVMState *s, int virq);
 
-int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n, int virq);
+int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n,
+   EventNotifier *rn, int virq);
 int kvm_irqchip_remove_irqfd_notifier(KVMState *s, EventNotifier *n, int virq);
 void kvm_pc_gsi_handler(void *opaque, int n, int level);
 void kvm_pc_setup_irq_routing(bool pci_enabled);
diff --git a/kvm-all.c b/kvm-all.c
index 4fb4ccb..bfa4aac 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1230,7 +1230,8 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, 
MSIMessage msg)
 return kvm_update_routing_entry(s, &kroute);
 }
 
-static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int virq, bool assign)
+static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int rfd, int virq,
+bool assign)
 {
 struct kvm_irqfd irqfd = {
 .fd = fd,
@@ -1238,6 +1239,11 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, 
int virq, bool assign)
 .flags = assign ? 0 : KVM_IRQFD_FLAG_DEASSIGN,
 };
 
+if (rfd != -1) {
+irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
+irqfd.resamplefd = rfd;
+}
+
 if (!kvm_irqfds_enabled()) {
 return -ENOSYS;
 }
@@ -1276,14 +1282,17 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, 
MSIMessage msg)
 }
 #endif /* !KVM_CAP_IRQ_ROUTING */
 
-int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n, int virq)
+int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n,
+   EventNotifier *rn, int virq)
 {
-return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n), virq, true);
+return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n),
+   rn ? event_notifier_get_fd(rn) : -1, virq, true);
 }
 
 int kvm_irqchip_remove_irqfd_notifier(KVMState *s, EventNotifier *n, int virq)
 {
-return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n), virq, false);
+return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n), -1, virq,
+   false);
 }
 
 static int kvm_irqchip_create(KVMState *s)
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
Mo

[PATCH 01/10] Initialize IA32_FEATURE_CONTROL MSR in reset and migration

2013-08-28 Thread Gleb Natapov
From: Arthur Chunqi Li 

The recent KVM patch adds IA32_FEATURE_CONTROL support. QEMU needs
to clear this MSR when reset vCPU and keep the value of it when
migration. This patch add this feature.

Signed-off-by: Arthur Chunqi Li 
Signed-off-by: Gleb Natapov 
---
 target-i386/cpu.h |2 ++
 target-i386/kvm.c |4 
 target-i386/machine.c |   22 ++
 3 files changed, 28 insertions(+)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index cedefdc..3a52f94 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -301,6 +301,7 @@
 #define MSR_IA32_APICBASE_BSP   (1<<8)
 #define MSR_IA32_APICBASE_ENABLE(1<<11)
 #define MSR_IA32_APICBASE_BASE  (0xf<<12)
+#define MSR_IA32_FEATURE_CONTROL0x003a
 #define MSR_TSC_ADJUST  0x003b
 #define MSR_IA32_TSCDEADLINE0x6e0
 
@@ -813,6 +814,7 @@ typedef struct CPUX86State {
 
 uint64_t mcg_status;
 uint64_t msr_ia32_misc_enable;
+uint64_t msr_ia32_feature_control;
 
 /* exception/interrupt handling */
 int error_code;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 3c9d10a..84ac00a 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1121,6 +1121,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 if (hyperv_vapic_recommended()) {
 kvm_msr_entry_set(&msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0);
 }
+kvm_msr_entry_set(&msrs[n++], MSR_IA32_FEATURE_CONTROL, 
env->msr_ia32_feature_control);
 }
 if (env->mcg_cap) {
 int i;
@@ -1345,6 +1346,7 @@ static int kvm_get_msrs(X86CPU *cpu)
 if (has_msr_misc_enable) {
 msrs[n++].index = MSR_IA32_MISC_ENABLE;
 }
+msrs[n++].index = MSR_IA32_FEATURE_CONTROL;
 
 if (!env->tsc_valid) {
 msrs[n++].index = MSR_IA32_TSC;
@@ -1443,6 +1445,8 @@ static int kvm_get_msrs(X86CPU *cpu)
 case MSR_IA32_MISC_ENABLE:
 env->msr_ia32_misc_enable = msrs[i].data;
 break;
+case MSR_IA32_FEATURE_CONTROL:
+env->msr_ia32_feature_control = msrs[i].data;
 default:
 if (msrs[i].index >= MSR_MC0_CTL &&
 msrs[i].index < MSR_MC0_CTL + (env->mcg_cap & 0xff) * 4) {
diff --git a/target-i386/machine.c b/target-i386/machine.c
index f9ec581..0d2088e 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -435,6 +435,14 @@ static bool misc_enable_needed(void *opaque)
 return env->msr_ia32_misc_enable != MSR_IA32_MISC_ENABLE_DEFAULT;
 }
 
+static bool feature_control_needed(void *opaque)
+{
+X86CPU *cpu = opaque;
+CPUX86State *env = &cpu->env;
+
+return env->msr_ia32_feature_control != 0;
+}
+
 static const VMStateDescription vmstate_msr_ia32_misc_enable = {
 .name = "cpu/msr_ia32_misc_enable",
 .version_id = 1,
@@ -446,6 +454,17 @@ static const VMStateDescription 
vmstate_msr_ia32_misc_enable = {
 }
 };
 
+static const VMStateDescription vmstate_msr_ia32_feature_control = {
+.name = "cpu/msr_ia32_feature_control",
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT64(env.msr_ia32_feature_control, X86CPU),
+VMSTATE_END_OF_LIST()
+}
+};
+
 const VMStateDescription vmstate_x86_cpu = {
 .name = "cpu",
 .version_id = 12,
@@ -571,6 +590,9 @@ const VMStateDescription vmstate_x86_cpu = {
 }, {
 .vmsd = &vmstate_msr_ia32_misc_enable,
 .needed = misc_enable_needed,
+}, {
+.vmsd = &vmstate_msr_ia32_feature_control,
+.needed = feature_control_needed,
 } , {
 /* empty */
 }
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/10] target-i386: remove tabs from target-i386/cpu.h

2013-08-28 Thread Gleb Natapov
From: Paolo Bonzini 

Signed-off-by: Paolo Bonzini 
---
 target-i386/cpu.h |  192 ++---
 1 file changed, 96 insertions(+), 96 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 3a52f94..af4c0f7 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -37,9 +37,9 @@
 #define TARGET_HAS_ICE 1
 
 #ifdef TARGET_X86_64
-#define ELF_MACHINEEM_X86_64
+#define ELF_MACHINE EM_X86_64
 #else
-#define ELF_MACHINEEM_386
+#define ELF_MACHINE EM_386
 #endif
 
 #define CPUArchState struct CPUX86State
@@ -98,10 +98,10 @@
 #define DESC_TSS_BUSY_MASK (1 << 9)
 
 /* eflags masks */
-#define CC_C   0x0001
-#define CC_P   0x0004
-#define CC_A   0x0010
-#define CC_Z   0x0040
+#define CC_C0x0001
+#define CC_P0x0004
+#define CC_A0x0010
+#define CC_Z0x0040
 #define CC_S0x0080
 #define CC_O0x0800
 
@@ -109,14 +109,14 @@
 #define IOPL_SHIFT 12
 #define VM_SHIFT   17
 
-#define TF_MASK0x0100
-#define IF_MASK0x0200
-#define DF_MASK0x0400
-#define IOPL_MASK  0x3000
-#define NT_MASK0x4000
-#define RF_MASK0x0001
-#define VM_MASK0x0002
-#define AC_MASK0x0004
+#define TF_MASK 0x0100
+#define IF_MASK 0x0200
+#define DF_MASK 0x0400
+#define IOPL_MASK   0x3000
+#define NT_MASK 0x4000
+#define RF_MASK 0x0001
+#define VM_MASK 0x0002
+#define AC_MASK 0x0004
 #define VIF_MASK0x0008
 #define VIP_MASK0x0010
 #define ID_MASK 0x0020
@@ -238,28 +238,28 @@
 #define DR7_TYPE_IO_RW   0x2
 #define DR7_TYPE_DATA_RW 0x3
 
-#define PG_PRESENT_BIT 0
-#define PG_RW_BIT  1
-#define PG_USER_BIT2
-#define PG_PWT_BIT 3
-#define PG_PCD_BIT 4
-#define PG_ACCESSED_BIT5
-#define PG_DIRTY_BIT   6
-#define PG_PSE_BIT 7
-#define PG_GLOBAL_BIT  8
-#define PG_NX_BIT  63
+#define PG_PRESENT_BIT  0
+#define PG_RW_BIT   1
+#define PG_USER_BIT 2
+#define PG_PWT_BIT  3
+#define PG_PCD_BIT  4
+#define PG_ACCESSED_BIT 5
+#define PG_DIRTY_BIT6
+#define PG_PSE_BIT  7
+#define PG_GLOBAL_BIT   8
+#define PG_NX_BIT   63
 
 #define PG_PRESENT_MASK  (1 << PG_PRESENT_BIT)
-#define PG_RW_MASK  (1 << PG_RW_BIT)
-#define PG_USER_MASK(1 << PG_USER_BIT)
-#define PG_PWT_MASK (1 << PG_PWT_BIT)
-#define PG_PCD_MASK (1 << PG_PCD_BIT)
+#define PG_RW_MASK   (1 << PG_RW_BIT)
+#define PG_USER_MASK (1 << PG_USER_BIT)
+#define PG_PWT_MASK  (1 << PG_PWT_BIT)
+#define PG_PCD_MASK  (1 << PG_PCD_BIT)
 #define PG_ACCESSED_MASK (1 << PG_ACCESSED_BIT)
-#define PG_DIRTY_MASK   (1 << PG_DIRTY_BIT)
-#define PG_PSE_MASK (1 << PG_PSE_BIT)
-#define PG_GLOBAL_MASK  (1 << PG_GLOBAL_BIT)
+#define PG_DIRTY_MASK(1 << PG_DIRTY_BIT)
+#define PG_PSE_MASK  (1 << PG_PSE_BIT)
+#define PG_GLOBAL_MASK   (1 << PG_GLOBAL_BIT)
 #define PG_HI_USER_MASK  0x7ff0LL
-#define PG_NX_MASK  (1LL << PG_NX_BIT)
+#define PG_NX_MASK   (1LL << PG_NX_BIT)
 
 #define PG_ERROR_W_BIT 1
 
@@ -269,32 +269,32 @@
 #define PG_ERROR_RSVD_MASK 0x08
 #define PG_ERROR_I_D_MASK  0x10
 
-#define MCG_CTL_P  (1ULL<<8)   /* MCG_CAP register available */
-#define MCG_SER_P  (1ULL<<24) /* MCA recovery/new status bits */
+#define MCG_CTL_P   (1ULL<<8)   /* MCG_CAP register available */
+#define MCG_SER_P   (1ULL<<24) /* MCA recovery/new status bits */
 
-#define MCE_CAP_DEF(MCG_CTL_P|MCG_SER_P)
-#define MCE_BANKS_DEF  10
+#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P)
+#define MCE_BANKS_DEF   10
 
-#define MCG_STATUS_RIPV(1ULL<<0)   /* restart ip valid */
-#define MCG_STATUS_EIPV(1ULL<<1)   /* ip points to correct instruction 
*/
-#define MCG_STATUS_MCIP(1ULL<<2)   /* machine check in progress */
+#define MCG_STATUS_RIPV (1ULL<<0)   /* restart ip valid */
+#define MCG_STATUS_EIPV (1ULL<<1)   /* ip points to correct instruction */
+#define MCG_STATUS_MCIP (1ULL<<2)   /* machine check in progress */
 
-#define MCI_STATUS_VAL (1ULL<<63)  /* valid error */
-#define MCI_STATUS_OVER(1ULL<<62)  /* previous errors lost */
-#define MCI_STATUS_UC  (1ULL<<61)  /* uncorrected error */
-#define MCI_STATUS_EN  (1ULL<<60)  /* error enabled */
-#define MCI_STATUS_MISCV (1ULL<<59) /* misc error reg. valid */
-#define MCI_STATUS_ADDRV (1ULL<<58) /* addr reg. valid */
-#define MCI_STATUS_PCC (1ULL<<57)  /* processor context corrupt */
-#define MCI_STATUS_S   (1ULL<<56)  /* Signaled machine check */
-#define MCI_STATUS_AR  (1ULL<<55)  /* Action required */
+#define MCI_STATUS_VAL   (1ULL<<63)  /* valid error */
+#define MCI_STATUS_OVER  (1ULL<<62)  /* previous errors lost */
+#define MCI_STA

[PATCH v3 00/10] [PULL] qemu-kvm.git uq/master queue

2013-08-28 Thread Gleb Natapov
Anthony,

This obsoletes "[PULL v2 0/9] KVM changes for 2013-08-23"

The following changes since commit f03d07d4683b2e8325a7cb60b4e14b977b1a869c:

  Merge remote-tracking branch 'quintela/migration.next' into staging 
(2013-07-23 10:57:23 -0500)

are available in the git repository at:


  git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master

for you to fetch changes up to 821c808bd1863efc2c1e977800ae77db633a185c:

  kvm-stub: fix compilation (2013-08-28 17:07:02 +0300)


Arthur Chunqi Li (1):
  Initialize IA32_FEATURE_CONTROL MSR in reset and migration

Jan Kiszka (1):
  kvm: Simplify kvm_handle_io

Liu Jinsong (1):
  kvm: x86: fix setting IA32_FEATURE_CONTROL with nested VMX disabled

Marcelo Tosatti (2):
  kvm-all.c: max_cpus should not exceed KVM vcpu limit
  kvm: i386: fix LAPIC TSC deadline timer save/restore

Paolo Bonzini (3):
  target-i386: remove tabs from target-i386/cpu.h
  kvm: migrate vPMU state
  kvm-stub: fix compilation

Vincenzo Maffione (1):
  kvm: add KVM_IRQFD_FLAG_RESAMPLE support

Wei Yang (1):
  kvm: shorten the parameter list for get_real_device()

 hw/i386/kvm/pci-assign.c |9 +-
 hw/misc/vfio.c   |4 +-
 hw/virtio/virtio-pci.c   |2 +-
 include/sysemu/kvm.h |3 +-
 kvm-all.c|   52 +--
 kvm-stub.c   |3 +-
 target-i386/cpu.h|  217 ++
 target-i386/kvm.c|  139 +++--
 target-i386/machine.c|   66 ++
 9 files changed, 351 insertions(+), 144 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/10] kvm: shorten the parameter list for get_real_device()

2013-08-28 Thread Gleb Natapov
From: Wei Yang 

get_real_device() has 5 parameters with the last 4 is contained in the first
structure.

This patch removes the last 4 parameters and directly use them from the first
parameter.

Acked-by: Alex Williamson 
Signed-off-by: Wei Yang 
Signed-off-by: Paolo Bonzini 
---
 hw/i386/kvm/pci-assign.c |9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c
index ff33dc8..73941b2 100644
--- a/hw/i386/kvm/pci-assign.c
+++ b/hw/i386/kvm/pci-assign.c
@@ -568,8 +568,7 @@ static int get_real_device_id(const char *devpath, uint16_t 
*val)
 return get_real_id(devpath, "device", val);
 }
 
-static int get_real_device(AssignedDevice *pci_dev, uint16_t r_seg,
-   uint8_t r_bus, uint8_t r_dev, uint8_t r_func)
+static int get_real_device(AssignedDevice *pci_dev)
 {
 char dir[128], name[128];
 int fd, r = 0, v;
@@ -582,7 +581,8 @@ static int get_real_device(AssignedDevice *pci_dev, 
uint16_t r_seg,
 dev->region_number = 0;
 
 snprintf(dir, sizeof(dir), "/sys/bus/pci/devices/%04x:%02x:%02x.%x/",
- r_seg, r_bus, r_dev, r_func);
+ pci_dev->host.domain, pci_dev->host.bus,
+ pci_dev->host.slot, pci_dev->host.function);
 
 snprintf(name, sizeof(name), "%sconfig", dir);
 
@@ -1769,8 +1769,7 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
 memcpy(dev->emulate_config_write, dev->emulate_config_read,
sizeof(dev->emulate_config_read));
 
-if (get_real_device(dev, dev->host.domain, dev->host.bus,
-dev->host.slot, dev->host.function)) {
+if (get_real_device(dev)) {
 error_report("pci-assign: Error: Couldn't get real device (%s)!",
  dev->dev.qdev.id);
 goto out;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/10] kvm-stub: fix compilation

2013-08-28 Thread Gleb Natapov
From: Paolo Bonzini 

Non-KVM targets fail compilation on the uq/master branch.
Fix the prototype of kvm_irqchip_add_irqfd_notifier to match
the one in kvm-all.c.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Gleb Natapov 
---
 kvm-stub.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kvm-stub.c b/kvm-stub.c
index 7b2233a..806b044 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -135,7 +135,8 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, 
MSIMessage msg)
 return -ENOSYS;
 }
 
-int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n, int virq)
+int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n,
+   EventNotifier *rn, int virq)
 {
 return -ENOSYS;
 }
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/10] kvm: x86: fix setting IA32_FEATURE_CONTROL with nested VMX disabled

2013-08-28 Thread Gleb Natapov
From: Liu Jinsong 

This patch is to fix the bug https://bugs.launchpad.net/qemu-kvm/+bug/1207623

IA32_FEATURE_CONTROL is pointless if not expose VMX or SMX bits to
cpuid.1.ecx of vcpu. Current qemu-kvm will error return when kvm_put_msrs
or kvm_get_msrs.

Signed-off-by: Liu Jinsong 
Signed-off-by: Paolo Bonzini 
---
 target-i386/kvm.c |   17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 513ae52..7bb8455 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -65,6 +65,7 @@ static bool has_msr_star;
 static bool has_msr_hsave_pa;
 static bool has_msr_tsc_adjust;
 static bool has_msr_tsc_deadline;
+static bool has_msr_feature_control;
 static bool has_msr_async_pf_en;
 static bool has_msr_pv_eoi_en;
 static bool has_msr_misc_enable;
@@ -666,6 +667,12 @@ int kvm_arch_init_vcpu(CPUState *cs)
 
 qemu_add_vm_change_state_handler(cpu_update_state, env);
 
+c = cpuid_find_entry(&cpuid_data.cpuid, 1, 0);
+if (c) {
+has_msr_feature_control = !!(c->ecx & CPUID_EXT_VMX) ||
+  !!(c->ecx & CPUID_EXT_SMX);
+}
+
 cpuid_data.cpuid.padding = 0;
 r = kvm_vcpu_ioctl(cs, KVM_SET_CPUID2, &cpuid_data);
 if (r) {
@@ -1169,7 +1176,10 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 if (hyperv_vapic_recommended()) {
 kvm_msr_entry_set(&msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0);
 }
-kvm_msr_entry_set(&msrs[n++], MSR_IA32_FEATURE_CONTROL, 
env->msr_ia32_feature_control);
+if (has_msr_feature_control) {
+kvm_msr_entry_set(&msrs[n++], MSR_IA32_FEATURE_CONTROL,
+  env->msr_ia32_feature_control);
+}
 }
 if (env->mcg_cap) {
 int i;
@@ -1394,7 +1404,9 @@ static int kvm_get_msrs(X86CPU *cpu)
 if (has_msr_misc_enable) {
 msrs[n++].index = MSR_IA32_MISC_ENABLE;
 }
-msrs[n++].index = MSR_IA32_FEATURE_CONTROL;
+if (has_msr_feature_control) {
+msrs[n++].index = MSR_IA32_FEATURE_CONTROL;
+}
 
 if (!env->tsc_valid) {
 msrs[n++].index = MSR_IA32_TSC;
@@ -1509,6 +1521,7 @@ static int kvm_get_msrs(X86CPU *cpu)
 break;
 case MSR_IA32_FEATURE_CONTROL:
 env->msr_ia32_feature_control = msrs[i].data;
+break;
 default:
 if (msrs[i].index >= MSR_MC0_CTL &&
 msrs[i].index < MSR_MC0_CTL + (env->mcg_cap & 0xff) * 4) {
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] arch: powerpc: kvm: add signed type cast for comparation

2013-08-28 Thread Alexander Graf

On 05.08.2013, at 07:50, Chen Gang wrote:

> On 08/05/2013 12:34 PM, Paul Mackerras wrote:
>> On Mon, Jul 22, 2013 at 02:32:35PM +0800, Chen Gang wrote:
 'rmls' is 'unsigned long', lpcr_rmls() will return negative number when
 failure occurs, so it need a type cast for comparing.
 
 'lpid' is 'unsigned long', kvmppc_alloc_lpid() return negative number
 when failure occurs, so it need a type cast for comparing.
 
 
 Signed-off-by: Chen Gang 
>> Looks right, thanks.
>> 
>> Acked-by: Paul Mackerras 
>> 
>> 
> 
> Thank you very much.

Thanks, applied to kvm-ppc-queue.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Book3S PR: return appropriate error when allocation fails

2013-08-28 Thread Alexander Graf

On 17.07.2013, at 17:10, Thadeu Lima de Souza Cascardo wrote:

> err was overwritten by a previous function call, and checked to be 0. If
> the following page allocation fails, 0 is going to be returned instead
> of -ENOMEM.
> 
> Signed-off-by: Thadeu Lima de Souza Cascardo 

Thanks, applied to kvm-ppc-queue.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-28 Thread Gleb Natapov
On Wed, Aug 28, 2013 at 08:15:36PM +0800, Xiao Guangrong wrote:
> On 08/28/2013 06:49 PM, Gleb Natapov wrote:
> > On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote:
> >> On 08/28/2013 05:46 PM, Gleb Natapov wrote:
> >>> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
> > Or what if desc is moved to another rmap, but then it
> > is moved back to initial rmap (but another place in the desc list) so
> > the check here will not catch that we need to restart walking?
> 
>  It is okay. We always add the new desc to the head, then we will walk
>  all the entires under this case.
> 
> >>> Which races another question: What if desc is added in front of the list
> >>> behind the point where lockless walker currently is?
> >>
> >> That case is new spte is being added into the rmap. We need not to care the
> >> new sptes since it will set the dirty-bitmap then they can be 
> >> write-protected
> >> next time.
> >>
> > OK.
> > 
> >>>
>  Right?
> >>> Not sure. While lockless walker works on a desc rmap can be completely
> >>> destroyed and recreated again. It can be any order.
> >>
> >> I think the thing is very similar as include/linux/rculist_nulls.h
> > include/linux/rculist_nulls.h is for implementing hash tables, so they
> > may not care about add/del/lookup race for instance, but may be we are
> > (you are saying above that we are not), so similarity does not prove
> > correctness for our case. 
> 
> We do not care the "add" and "del" too when lookup the rmap. Under the "add"
> case, it is okay, the reason i have explained above. Under the "del" case,
> the spte becomes unpresent and flush all tlbs immediately, so it is also okay.
> 
> I always use a stupid way to check the correctness, that is enumerating
> all cases we may meet, in this patch, we may meet these cases:
> 
> 1) kvm deletes the desc before we are current on
>that descs have been checked, do not need to care it.
> 
> 2) kvm deletes the desc after we are currently on
>Since we always add/del the head desc, we can sure the current desc has 
> been
>deleted, then we will meet case 3).
> 
> 3) kvm deletes the desc that we are currently on
>3.a): the desc stays in slab cache (do not be reused).
>  all spte entires are empty, then the fn() will skip the nonprsent 
> spte,
>  and desc->more is
>  3.a.1) still pointing to next-desc, then we will continue the lookup
>  3.a.2) or it is the "nulls list", that means we reach the last one,
> then finish the walk.
> 
>3.b): the desc is alloc-ed from slab cache and it's being initialized.
>  we will see "desc->more == NULL" then restart the walking. It's okay.
> 
>3.c): the desc is added to rmap or pte_list again.
>  3.c.1): the desc is added to the current rmap again.
>the new desc always acts as the head desc, then we will walk
>  all entries, some entries are double checked and not entry
>  can be missed. It is okay.
> 
>  3.c.2): the desc is added to another rmap or pte_list
>  since kvm_set_memory_region() and get_dirty are serial by 
> slots-lock.
>  so the "nulls" can not be reused during lookup. Then we we 
> will
>  meet the different "nulls" at the end of walking that will 
> cause
>  rewalk.
> 
> I know check the algorithm like this is really silly, do you have other idea?
> 
Not silly, but easy to miss cases. For instance 3.c.3 can be:
 the desc is added to another rmap then we move to another desc on the
 wrong rmap, this other desc is also deleted and reinserted into
 original rmap. Seams like justification from 3.c.1 applies to that to
 though.

> > BTW I do not see
> > rcu_assign_pointer()/rcu_dereference() in your patches which hints on
> 
> IIUC, We can not directly use rcu_assign_pointer(), that is something like:
> p = v to assign a pointer to a pointer. But in our case, we need:
>*pte_list = (unsigned long)desc | 1;
>From Documentation/RCU/whatisRCU.txt:

The updater uses this function to assign a new value to an RCU-protected 
pointer.

This is what we do, no? (assuming slot->arch.rmap[] is what rcu protects here)
The fact that the value is not correct pointer should not matter.

> 
> So i add the smp_wmb() by myself:
>   /*
>* Esure the old spte has been updated into desc, so
>* that the another side can not get the desc from pte_list
>* but miss the old spte.
>*/
>   smp_wmb();
> 
>   *pte_list = (unsigned long)desc | 1;
> 
> But i missed it when inserting a empty desc, in that case, we need the barrier
> too since we should make desc->more visible before assign it to pte_list to
> avoid the lookup side seeing the invalid "nulls".
> 
> I also use own code instead of rcu_dereference():
> pte_list_walk_lockless():
>   pte_

VGA passthrough of GTX 660 to KVM guest

2013-08-28 Thread Gustav Sorenson
Hello everyone,

I started a thread on this list on July 12, named "AMD integrated
graphics passthrough to KVM guest," so if you should have a deja-vu,
that's probably why, since the heart of the problem is still the same.

I've got an AMD A10 6800 APU with integrated graphics, running on aan
ASRock FM2A75 Pro4 mainboard with the latest firmware. Since my last
thread, I've acquired an EVGA GTX 660 graphics card, which I'd like to
pass through to a KVM guest.

I'm following this guide to achieve my goal, including compiling the
suggested patched versions of the kernel, seabios and qemu:
https://bbs.archlinux.org/viewtopic.php?id=162768

The integrated graphics adapter is the primary one, and the only one
that is used by the host system. nouveau and nvidia modules are
blacklisted and not in lsmod.
lspci -vvv output: http://pastebin.com/S6znCLK7
lspci -t output: http://pastebin.com/YH01atEy

Contrary to what I hoped, invoking
qemu-system-x86_64 -enable-kvm -M q35 -m 1024 -cpu host -smp
2,sockets=1,cores=2,threads=1 -bios /usr/share/qemu/bios.bin -vga none
-device ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1
-device vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on
-device vfio-pci,host=01:00.1,bus=root.1,addr=00.1
after assigning the devices to vfio as suggested in the guide, does
not show anything on the display attached to the GTX 660, or even wake
the display up from standby, but instead gives this output on the
console:
qemu-system-x86_64: -device
vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on:
Warning, device :01:00.0 does not support reset
qemu-system-x86_64: -device
vfio-pci,host=01:00.1,bus=root.1,addr=00.1: Warning, device
:01:00.1 does not support reset
More often than not, this also completely freezes the host system.

Later in the forum I linked, it is suggested to pass the graphics card
not as the guest's primary, but use pci-assign instead:
qemu-system-x86_64 -enable-kvm -m 2048 -cpu host -smp
2,sockets=1,cores=2,threads=1 -bios /usr/share/qemu/bios.bin -vga std
-vnc :0 -drive file=/mnt/orthowin/current-new.img,format=qcow2
-usbdevice tablet -net user,restrict=y -usb -device
pci-assign,host=01:00.0 -device pci-assign,host=01:00.1

This allows me to boot the Windows 7 64bit guest. After installing the
nVidia drivers, shutting the guest down, rebooting the host and
booting the guest again, the Windows device manager shows "Code 43"
for the graphics card (which even is correctly labelled as GTX 660),
and the nVidia user interface doesn't show any GPU at all.

What could I try to resolve this issue?
I realize that I might have omitted some information that may be
important for troubleshooting, but I'm afraid I'm not experienced
enough to even know what may be relevant. Please let me know if you'd
like to have some further information.
Please also note that I'm not running the latest kernel/qemu/seabios,
so maybe some changes made since I compiled 3.10.1 might help me? If
so, please suggest a commit or release to try. I'm not afraid of
messing up my operating system install, but if anything suggested
might brick the hardware, I'd like to know in advance. :)

I'm very grateful for any kind of input!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: update masterclock when kvmclock_offset is calculated

2013-08-28 Thread Paolo Bonzini
Il 28/08/2013 04:52, Marcelo Tosatti ha scritto:
> On Thu, Aug 22, 2013 at 07:05:20PM +0200, Paolo Bonzini wrote:
>> Il 20/08/2013 20:20, Marcelo Tosatti ha scritto:
>>>
>>> The offset to add to the hosts monotonic time, kvmclock_offset, is
>>> calculated against the monotonic time at KVM_SET_CLOCK ioctl time.
>>>
>>> Request a master clock update at this time, to reduce a potentially
>>> unbounded difference between the values of the masterclock and
>>> the clock value used to calculate kvmclock_offset.
>>>
>>> Signed-off-by: Marcelo Tosatti 
>>>
>>> Index: linux-2.6-kvmclock-fixes/arch/x86/kvm/x86.c
>>> ===
>>> --- linux-2.6-kvmclock-fixes.orig/arch/x86/kvm/x86.c
>>> +++ linux-2.6-kvmclock-fixes/arch/x86/kvm/x86.c
>>> @@ -3806,6 +3806,7 @@ long kvm_arch_vm_ioctl(struct file *filp
>>> delta = user_ns.clock - now_ns;
>>> local_irq_enable();
>>> kvm->arch.kvmclock_offset = delta;
>>> +   kvm_gen_update_masterclock(kvm);
>>> break;
>>> }
>>> case KVM_GET_CLOCK: {
>>
>> Reviewed-by: Paolo Bonzini 
>>
>> While reviewing this patch, which BTW looks good, I noticed the handling
>> of KVM_REQ_MCLOCK_INPROGRESS, the dummy request that is never processed
>> and is only used to block guest entry.
>>
>> It seems to me that this bit is not necessary.  After
>> KVM_REQ_CLOCK_UPDATE is issued, no guest entries will happen because
>> kvm_guest_time_update will try to take the pvclock_gtod_sync_lock,
>> currently taken by kvm_gen_update_masterclock.
> 
> Not entirely clear, to cancel guest entry the bit is necessary:
> 
> if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
> || need_resched() || signal_pending(current)) {
> vcpu->mode = OUTSIDE_GUEST_MODE;
> smp_wmb();
> local_irq_enable();
> preempt_enable();
> r = 1;
> goto cancel_injection;
> }

Yes, KVM_REQ_CLOCK_UPDATE is enough to cancel guest entries too.  See
code below.

>> Thus, you do not need the dummy request.  You can simply issue
>> KVM_REQ_CLOCK_UPDATE before calling pvclock_update_vm_gtod_copy (with
>> the side effect of exiting VCPUs).  VCPUs will stall in
>> kvm_guest_time_update until pvclock_gtod_sync_lock is released by
>> kvm_gen_update_masterclock.  What do you think?
> 
> Not sure its safe. Can you describe the safety of your proposal in more
> detail ?

Here is the code I was thinking of:

spin_lock(&ka->pvclock_gtod_sync_lock);
make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);

/*
 * No guest entries from this point: VCPUs will be spinning
 * on pvclock_gtod_sync_lock in kvm_guest_time_update.
 */
pvclock_update_vm_gtod_copy(kvm);

/*
 * Let kvm_guest_time_update continue: entering the guest
 * is now allowed too.
 */
spin_unlock(&ka->pvclock_gtod_sync_lock);

KVM_REQ_CLOCK_UPDATE is used to cancel guest entry and execute
kvm_guest_time_update.  But kvm_guest_time_update will spin on
pvclock_gtod_sync_lock until pvclock_update_vm_gtod_copy exits and
kvm_gen_update_masterclock releases the spinlock.

>> On top of this, optionally the spinlock could become an rw_semaphore
>> so that clock updates for different VCPUs will not be serialized.  The
>> effect is probably not visible, though.
> 
> Still not clear of the benefits, but this area certainly welcomes
> performance improvements (the global kick is one thing we discussed 
> and that should be improved).

This unfortunately does not eliminate the global kick, so there is
likely no performance improvement yet.  It simplifies the logic a bit
though.

The change I suggested above is to make pvclock_gtod_sync_lock an rwsem
or, probably better, a seqlock.  VCPUs reading ka->use_master_clock,
ka->master_cycle_now, ka->master_kernel_ns can then run concurrently,
with no locked operations in kvm_guest_time_update.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] [PATCH v2] kvm: warn if num cpus is greater than num recommended

2013-08-28 Thread Eric Blake
On 08/28/2013 01:45 AM, Andrew Jones wrote:
>> What I'm more worried about is what number is libvirt supposed to show
>> to the end user, and should libvirt enforce the lower recommended max,
>> or the larger kernel absolute max?  Which of the two values does the QMP
>> 'MachineInfo' type return in its 'cpu-max' field during the
>> 'query-machines' command?  Should we be modifying QMP to return both
>> values, so that libvirt can also expose the logic to the end user of
>> allowing a recommended vs. larger development max?
>>
> 
> Machine definitions maintain yet another 'max_cpus'. And it appears that
> qmp would return that value. It would probably be best if it returned
> max(qemu_machine.max_cpus, kvm_max_cpus) though.
>  
> I'm starting to think that we should just keep things simple for most of
> the virt stack by sticking to enforcing the larger developer max. And
> then on a production kernel we should just compile KVM_MAX_VCPUS =
> KVM_SOFT_MAX_VCPUS and be done with it. With that thought, this patch
> could be dropped too. The alternative seems to be supporting a run-time
> selectable experimental mode throughout the whole virt stack.

Indeed - if it is a number you are unwilling to support, don't compile
it into the kernel in the first place.  Allowing arbitrary limits that
are lower than the maximum imply policy, and policy implies touching the
stack (because someone, somewhere in the stack, will have good reason
for setting policy different than the lowest layer); fix the maximum
instead, and the whole stack complies without having to worry about
policy.  IMO, this is a case where fewer knobs is better.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 06:49 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote:
>> On 08/28/2013 05:46 PM, Gleb Natapov wrote:
>>> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
> Or what if desc is moved to another rmap, but then it
> is moved back to initial rmap (but another place in the desc list) so
> the check here will not catch that we need to restart walking?

 It is okay. We always add the new desc to the head, then we will walk
 all the entires under this case.

>>> Which races another question: What if desc is added in front of the list
>>> behind the point where lockless walker currently is?
>>
>> That case is new spte is being added into the rmap. We need not to care the
>> new sptes since it will set the dirty-bitmap then they can be write-protected
>> next time.
>>
> OK.
> 
>>>
 Right?
>>> Not sure. While lockless walker works on a desc rmap can be completely
>>> destroyed and recreated again. It can be any order.
>>
>> I think the thing is very similar as include/linux/rculist_nulls.h
> include/linux/rculist_nulls.h is for implementing hash tables, so they
> may not care about add/del/lookup race for instance, but may be we are
> (you are saying above that we are not), so similarity does not prove
> correctness for our case. 

We do not care the "add" and "del" too when lookup the rmap. Under the "add"
case, it is okay, the reason i have explained above. Under the "del" case,
the spte becomes unpresent and flush all tlbs immediately, so it is also okay.

I always use a stupid way to check the correctness, that is enumerating
all cases we may meet, in this patch, we may meet these cases:

1) kvm deletes the desc before we are current on
   that descs have been checked, do not need to care it.

2) kvm deletes the desc after we are currently on
   Since we always add/del the head desc, we can sure the current desc has been
   deleted, then we will meet case 3).

3) kvm deletes the desc that we are currently on
   3.a): the desc stays in slab cache (do not be reused).
 all spte entires are empty, then the fn() will skip the nonprsent spte,
 and desc->more is
 3.a.1) still pointing to next-desc, then we will continue the lookup
 3.a.2) or it is the "nulls list", that means we reach the last one,
then finish the walk.

   3.b): the desc is alloc-ed from slab cache and it's being initialized.
 we will see "desc->more == NULL" then restart the walking. It's okay.

   3.c): the desc is added to rmap or pte_list again.
 3.c.1): the desc is added to the current rmap again.
 the new desc always acts as the head desc, then we will walk
 all entries, some entries are double checked and not entry
 can be missed. It is okay.

 3.c.2): the desc is added to another rmap or pte_list
 since kvm_set_memory_region() and get_dirty are serial by 
slots-lock.
 so the "nulls" can not be reused during lookup. Then we we will
 meet the different "nulls" at the end of walking that will 
cause
 rewalk.

I know check the algorithm like this is really silly, do you have other idea?

> BTW I do not see
> rcu_assign_pointer()/rcu_dereference() in your patches which hints on

IIUC, We can not directly use rcu_assign_pointer(), that is something like:
p = v to assign a pointer to a pointer. But in our case, we need:
   *pte_list = (unsigned long)desc | 1;

So i add the smp_wmb() by myself:
/*
 * Esure the old spte has been updated into desc, so
 * that the another side can not get the desc from pte_list
 * but miss the old spte.
 */
smp_wmb();

*pte_list = (unsigned long)desc | 1;

But i missed it when inserting a empty desc, in that case, we need the barrier
too since we should make desc->more visible before assign it to pte_list to
avoid the lookup side seeing the invalid "nulls".

I also use own code instead of rcu_dereference():
pte_list_walk_lockless():
pte_list_value = ACCESS_ONCE(*pte_list);
if (!pte_list_value)
return;

if (!(pte_list_value & 1))
return fn((u64 *)pte_list_value);

/*
 * fetch pte_list before read sptes in the desc, see the comments
 * in pte_list_add().
 *
 * There is the data dependence since the desc is got from pte_list.
 */
smp_read_barrier_depends();

That part can be replaced by rcu_dereference().

> incorrect usage of RCU. I think any access to slab pointers will need to
> use those.

Remove desc is not necessary i think since we do not mind to see the old
info. (hlist_nulls_del_rcu() does not use rcu_dereference() too)



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to 

Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-28 Thread Gleb Natapov
On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote:
> On 08/28/2013 05:46 PM, Gleb Natapov wrote:
> > On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
> >>> Or what if desc is moved to another rmap, but then it
> >>> is moved back to initial rmap (but another place in the desc list) so
> >>> the check here will not catch that we need to restart walking?
> >>
> >> It is okay. We always add the new desc to the head, then we will walk
> >> all the entires under this case.
> >>
> > Which races another question: What if desc is added in front of the list
> > behind the point where lockless walker currently is?
> 
> That case is new spte is being added into the rmap. We need not to care the
> new sptes since it will set the dirty-bitmap then they can be write-protected
> next time.
> 
OK.

> > 
> >> Right?
> > Not sure. While lockless walker works on a desc rmap can be completely
> > destroyed and recreated again. It can be any order.
> 
> I think the thing is very similar as include/linux/rculist_nulls.h
include/linux/rculist_nulls.h is for implementing hash tables, so they
may not care about add/del/lookup race for instance, but may be we are
(you are saying above that we are not), so similarity does not prove
correctness for our case. BTW I do not see
rcu_assign_pointer()/rcu_dereference() in your patches which hints on
incorrect usage of RCU. I think any access to slab pointers will need to
use those.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 05:46 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
>>> Or what if desc is moved to another rmap, but then it
>>> is moved back to initial rmap (but another place in the desc list) so
>>> the check here will not catch that we need to restart walking?
>>
>> It is okay. We always add the new desc to the head, then we will walk
>> all the entires under this case.
>>
> Which races another question: What if desc is added in front of the list
> behind the point where lockless walker currently is?

That case is new spte is being added into the rmap. We need not to care the
new sptes since it will set the dirty-bitmap then they can be write-protected
next time.

> 
>> Right?
> Not sure. While lockless walker works on a desc rmap can be completely
> destroyed and recreated again. It can be any order.

I think the thing is very similar as include/linux/rculist_nulls.h

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-28 Thread Gleb Natapov
On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote:
> > Or what if desc is moved to another rmap, but then it
> > is moved back to initial rmap (but another place in the desc list) so
> > the check here will not catch that we need to restart walking?
> 
> It is okay. We always add the new desc to the head, then we will walk
> all the entires under this case.
> 
Which races another question: What if desc is added in front of the list
behind the point where lockless walker currently is?

> Right?
Not sure. While lockless walker works on a desc rmap can be completely
destroyed and recreated again. It can be any order.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 05:20 PM, Gleb Natapov wrote:
> On Tue, Jul 30, 2013 at 09:02:07PM +0800, Xiao Guangrong wrote:
>> The basic idea is from nulls list which uses a nulls to indicate
>> whether the desc is moved to different pte-list
>>
>> Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused
>>
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  arch/x86/kvm/mmu.c | 51 ++-
>>  1 file changed, 50 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 36caf6a..f8fc0cc 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1010,6 +1010,14 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 
>> *spte,
>>  desc->sptes[0] = (u64 *)*pte_list;
>>  desc->sptes[1] = spte;
>>  desc_mark_nulls(pte_list, desc);
>> +
>> +/*
>> + * Esure the old spte has been updated into desc, so
>> + * that the another side can not get the desc from pte_list
>> + * but miss the old spte.
>> + */
>> +smp_wmb();
>> +
>>  *pte_list = (unsigned long)desc | 1;
>>  return 1;
>>  }
>> @@ -1131,6 +1139,47 @@ static void pte_list_walk(unsigned long *pte_list, 
>> pte_list_walk_fn fn)
>>  WARN_ON(desc_get_nulls_value(desc) != pte_list);
>>  }
>>  
>> +/* The caller should hold rcu lock. */
>> +typedef void (*pte_list_walk_lockless_fn) (u64 *spte, int level);
>> +static void pte_list_walk_lockless(unsigned long *pte_list,
>> +   pte_list_walk_lockless_fn fn, int level)
>> +{
>> +struct pte_list_desc *desc;
>> +unsigned long pte_list_value;
>> +int i;
>> +
>> +restart:
>> +pte_list_value = ACCESS_ONCE(*pte_list);
>> +if (!pte_list_value)
>> +return;
>> +
>> +if (!(pte_list_value & 1))
>> +return fn((u64 *)pte_list_value, level);
>> +
>> +/*
>> + * fetch pte_list before read sptes in the desc, see the comments
>> + * in pte_list_add().
>> + *
>> + * There is the data dependence since the desc is got from pte_list.
>> + */
>> +smp_read_barrier_depends();
>> +
>> +desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
>> +while (!desc_is_a_nulls(desc)) {
>> +for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
>> +fn(desc->sptes[i], level);
>> +
>> +desc = ACCESS_ONCE(desc->more);
>> +
>> +/* It is being initialized. */
>> +if (unlikely(!desc))
>> +goto restart;
>> +}
>> +
>> +if (unlikely(desc_get_nulls_value(desc) != pte_list))
>> +goto restart;
> So is it really enough to guaranty safety and correctness? What if desc
> is moved to another rmap while we walking it so fn() is called on
> incorrect sptes? 

Then fn() will do needless write-protection. It is the unnecessary work
but it is acceptable for the rare case.

There has a bug that we can not detect mapping level from rmap since
the desc can be moved as you said, it can case we do write-protection
on the middle spte. Can fix it by getting mapping level from sp->role.level
since sp can not be reused when rcu is hold.

> Or what if desc is moved to another rmap, but then it
> is moved back to initial rmap (but another place in the desc list) so
> the check here will not catch that we need to restart walking?

It is okay. We always add the new desc to the head, then we will walk
all the entires under this case.

Right?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker

2013-08-28 Thread Gleb Natapov
On Tue, Jul 30, 2013 at 09:02:07PM +0800, Xiao Guangrong wrote:
> The basic idea is from nulls list which uses a nulls to indicate
> whether the desc is moved to different pte-list
> 
> Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c | 51 ++-
>  1 file changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 36caf6a..f8fc0cc 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1010,6 +1010,14 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 
> *spte,
>   desc->sptes[0] = (u64 *)*pte_list;
>   desc->sptes[1] = spte;
>   desc_mark_nulls(pte_list, desc);
> +
> + /*
> +  * Esure the old spte has been updated into desc, so
> +  * that the another side can not get the desc from pte_list
> +  * but miss the old spte.
> +  */
> + smp_wmb();
> +
>   *pte_list = (unsigned long)desc | 1;
>   return 1;
>   }
> @@ -1131,6 +1139,47 @@ static void pte_list_walk(unsigned long *pte_list, 
> pte_list_walk_fn fn)
>   WARN_ON(desc_get_nulls_value(desc) != pte_list);
>  }
>  
> +/* The caller should hold rcu lock. */
> +typedef void (*pte_list_walk_lockless_fn) (u64 *spte, int level);
> +static void pte_list_walk_lockless(unsigned long *pte_list,
> +pte_list_walk_lockless_fn fn, int level)
> +{
> + struct pte_list_desc *desc;
> + unsigned long pte_list_value;
> + int i;
> +
> +restart:
> + pte_list_value = ACCESS_ONCE(*pte_list);
> + if (!pte_list_value)
> + return;
> +
> + if (!(pte_list_value & 1))
> + return fn((u64 *)pte_list_value, level);
> +
> + /*
> +  * fetch pte_list before read sptes in the desc, see the comments
> +  * in pte_list_add().
> +  *
> +  * There is the data dependence since the desc is got from pte_list.
> +  */
> + smp_read_barrier_depends();
> +
> + desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
> + while (!desc_is_a_nulls(desc)) {
> + for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
> + fn(desc->sptes[i], level);
> +
> + desc = ACCESS_ONCE(desc->more);
> +
> + /* It is being initialized. */
> + if (unlikely(!desc))
> + goto restart;
> + }
> +
> + if (unlikely(desc_get_nulls_value(desc) != pte_list))
> + goto restart;
So is it really enough to guaranty safety and correctness? What if desc
is moved to another rmap while we walking it so fn() is called on
incorrect sptes? Or what if desc is moved to another rmap, but then it
is moved back to initial rmap (but another place in the desc list) so
the check here will not catch that we need to restart walking?


> +}
> +
>  static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
>   struct kvm_memory_slot *slot)
>  {
> @@ -4557,7 +4606,7 @@ int kvm_mmu_module_init(void)
>  {
>   pte_list_desc_cache = kmem_cache_create("pte_list_desc",
>   sizeof(struct pte_list_desc),
> - 0, 0, NULL);
> + 0, SLAB_DESTROY_BY_RCU, NULL);
>   if (!pte_list_desc_cache)
>   goto nomem;
>  
> -- 
> 1.8.1.4

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 04:58 PM, Gleb Natapov wrote:
> On Wed, Aug 28, 2013 at 04:37:32PM +0800, Xiao Guangrong wrote:
>> On 08/28/2013 04:12 PM, Gleb Natapov wrote:
>>
 +
 +  rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
 +  desc = (struct pte_list_desc *)(*pte_list & ~1ul);
 +
 +  /* No empty position in the desc. */
 +  if (desc->sptes[PTE_LIST_EXT - 1]) {
 +  struct pte_list_desc *new_desc;
 +  new_desc = mmu_alloc_pte_list_desc(vcpu);
 +  new_desc->more = desc;
 +  desc = new_desc;
 +  *pte_list = (unsigned long)desc | 1;
}
 -  return count;
 +
 +  free_pos = find_first_free(desc);
 +  desc->sptes[free_pos] = spte;
 +  return count_spte_number(desc);
>>> Should it be count_spte_number(desc) - 1? The function should returns
>>> the number of pte entries before the spte was added.
>>
>> Yes. We have handled it count_spte_number(), we count the number like this:
>>
>>  return first_free + desc_num * PTE_LIST_EXT;
>>
>> The first_free is indexed from 0.
>>
> Suppose when pte_list_add() is called there is one full desc, so the
> number that should be returned is PTE_LIST_EXT, correct? But since
> before calling count_spte_number() one more desc will be added and
> desc->sptes[0] will be set in it the first_free in count_spte_number
> will be 1 and PTE_LIST_EXT + 1 will be returned.

Oh, yes, you are right. Will fix it in the next version, thanks for you
pointing it out.

> 
>> Maybe it is clearer to let count_spte_number() return the real number.
>>
>>>
  }
  
  static void
 -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc 
 *desc,
 - int i, struct pte_list_desc *prev_desc)
 +pte_list_desc_remove_entry(unsigned long *pte_list,
 + struct pte_list_desc *desc, int i)
  {
 -  int j;
 +  struct pte_list_desc *first_desc;
 +  int last_used;
 +
 +  first_desc = (struct pte_list_desc *)(*pte_list & ~1ul);
 +  last_used = find_last_used(first_desc);
  
 -  for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
 -  ;
 -  desc->sptes[i] = desc->sptes[j];
 -  desc->sptes[j] = NULL;
 -  if (j != 0)
 +  /*
 +   * Move the entry from the first desc to this position we want
 +   * to remove.
 +   */
 +  desc->sptes[i] = first_desc->sptes[last_used];
 +  first_desc->sptes[last_used] = NULL;
 +
>>> What if desc == first_desc and i < last_used. You still move spte
>>> backwards so lockless walk may have already examined entry at i and
>>> will miss spte that was moved there from last_used position, no?
>>
>> Right. I noticed it too and fixed in the v2 which is being tested.
>> I fixed it by bottom-up walk desc, like this:
>>
>> pte_list_walk_lockless():
>>
>>  desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
>>  while (!desc_is_a_nulls(desc)) {
>>  /*
>>   * We should do bottom-up walk since we always use the
>>   * bottom entry to replace the deleted entry if only
>>   * one desc is used in the rmap when a spte is removed.
>>   * Otherwise the moved entry will be missed.
>>   */
> I would call it top-down walk since we are walking from big indices to
> smaller once.

Okay, will fix the comments.

> 
>>  for (i = PTE_LIST_EXT - 1; i >= 0; i--)
>>  fn(desc->sptes[i]);
>>
>>  desc = ACCESS_ONCE(desc->more);
>>
>>  /* It is being initialized. */
>>  if (unlikely(!desc))
>>  goto restart;
>>  }
>>
>> How about this?
>>
> Tricky, very very tricky :)
> 
>>>
 +  /* No valid entry in this desc, we can free this desc now. */
 +  if (!first_desc->sptes[0]) {
 +  struct pte_list_desc *next_desc = first_desc->more;
 +
 +  /*
 +   * Only one entry existing but still use a desc to store it?
 +   */
 +  WARN_ON(!next_desc);
 +
 +  mmu_free_pte_list_desc(first_desc);
 +  first_desc = next_desc;
 +  *pte_list = (unsigned long)first_desc | 1ul;
return;
 -  if (!prev_desc && !desc->more)
 -  *pte_list = (unsigned long)desc->sptes[0];
 -  else
 -  if (prev_desc)
 -  prev_desc->more = desc->more;
 -  else
 -  *pte_list = (unsigned long)desc->more | 1;
 -  mmu_free_pte_list_desc(desc);
 +  }
 +
 +  WARN_ON(!first_desc->sptes[0]);
 +
 +  /*
 +   * Only one entry in this desc, move the entry to the head
 +   * then the desc can be freed.
 +   */
 +  if (!first_desc->sptes[1] && !first_desc->more) {
 +  *pte_list = (unsigned long)first_desc->sptes[0];
 +  mmu_free_pte_list_desc(first_desc);
 

Re: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list

2013-08-28 Thread Gleb Natapov
On Wed, Aug 28, 2013 at 04:37:32PM +0800, Xiao Guangrong wrote:
> On 08/28/2013 04:12 PM, Gleb Natapov wrote:
> 
> >> +
> >> +  rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
> >> +  desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> >> +
> >> +  /* No empty position in the desc. */
> >> +  if (desc->sptes[PTE_LIST_EXT - 1]) {
> >> +  struct pte_list_desc *new_desc;
> >> +  new_desc = mmu_alloc_pte_list_desc(vcpu);
> >> +  new_desc->more = desc;
> >> +  desc = new_desc;
> >> +  *pte_list = (unsigned long)desc | 1;
> >>}
> >> -  return count;
> >> +
> >> +  free_pos = find_first_free(desc);
> >> +  desc->sptes[free_pos] = spte;
> >> +  return count_spte_number(desc);
> > Should it be count_spte_number(desc) - 1? The function should returns
> > the number of pte entries before the spte was added.
> 
> Yes. We have handled it count_spte_number(), we count the number like this:
> 
>   return first_free + desc_num * PTE_LIST_EXT;
> 
> The first_free is indexed from 0.
> 
Suppose when pte_list_add() is called there is one full desc, so the
number that should be returned is PTE_LIST_EXT, correct? But since
before calling count_spte_number() one more desc will be added and
desc->sptes[0] will be set in it the first_free in count_spte_number
will be 1 and PTE_LIST_EXT + 1 will be returned.

> Maybe it is clearer to let count_spte_number() return the real number.
> 
> > 
> >>  }
> >>  
> >>  static void
> >> -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc 
> >> *desc,
> >> - int i, struct pte_list_desc *prev_desc)
> >> +pte_list_desc_remove_entry(unsigned long *pte_list,
> >> + struct pte_list_desc *desc, int i)
> >>  {
> >> -  int j;
> >> +  struct pte_list_desc *first_desc;
> >> +  int last_used;
> >> +
> >> +  first_desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> >> +  last_used = find_last_used(first_desc);
> >>  
> >> -  for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
> >> -  ;
> >> -  desc->sptes[i] = desc->sptes[j];
> >> -  desc->sptes[j] = NULL;
> >> -  if (j != 0)
> >> +  /*
> >> +   * Move the entry from the first desc to this position we want
> >> +   * to remove.
> >> +   */
> >> +  desc->sptes[i] = first_desc->sptes[last_used];
> >> +  first_desc->sptes[last_used] = NULL;
> >> +
> > What if desc == first_desc and i < last_used. You still move spte
> > backwards so lockless walk may have already examined entry at i and
> > will miss spte that was moved there from last_used position, no?
> 
> Right. I noticed it too and fixed in the v2 which is being tested.
> I fixed it by bottom-up walk desc, like this:
> 
> pte_list_walk_lockless():
> 
>   desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
>   while (!desc_is_a_nulls(desc)) {
>   /*
>* We should do bottom-up walk since we always use the
>* bottom entry to replace the deleted entry if only
>* one desc is used in the rmap when a spte is removed.
>* Otherwise the moved entry will be missed.
>*/
I would call it top-down walk since we are walking from big indices to
smaller once.

>   for (i = PTE_LIST_EXT - 1; i >= 0; i--)
>   fn(desc->sptes[i]);
> 
>   desc = ACCESS_ONCE(desc->more);
> 
>   /* It is being initialized. */
>   if (unlikely(!desc))
>   goto restart;
>   }
> 
> How about this?
> 
Tricky, very very tricky :)

> > 
> >> +  /* No valid entry in this desc, we can free this desc now. */
> >> +  if (!first_desc->sptes[0]) {
> >> +  struct pte_list_desc *next_desc = first_desc->more;
> >> +
> >> +  /*
> >> +   * Only one entry existing but still use a desc to store it?
> >> +   */
> >> +  WARN_ON(!next_desc);
> >> +
> >> +  mmu_free_pte_list_desc(first_desc);
> >> +  first_desc = next_desc;
> >> +  *pte_list = (unsigned long)first_desc | 1ul;
> >>return;
> >> -  if (!prev_desc && !desc->more)
> >> -  *pte_list = (unsigned long)desc->sptes[0];
> >> -  else
> >> -  if (prev_desc)
> >> -  prev_desc->more = desc->more;
> >> -  else
> >> -  *pte_list = (unsigned long)desc->more | 1;
> >> -  mmu_free_pte_list_desc(desc);
> >> +  }
> >> +
> >> +  WARN_ON(!first_desc->sptes[0]);
> >> +
> >> +  /*
> >> +   * Only one entry in this desc, move the entry to the head
> >> +   * then the desc can be freed.
> >> +   */
> >> +  if (!first_desc->sptes[1] && !first_desc->more) {
> >> +  *pte_list = (unsigned long)first_desc->sptes[0];
> >> +  mmu_free_pte_list_desc(first_desc);
> >> +  }
> >>  }
> >>  
> >>  static void pte_list_remove(u64 *spte, unsigned long *pte_list)
> >>  {
> >>struct pte_list_desc *desc;
> >> -  struct pte_list_desc *prev_desc;
> >>int i;
> >>  

Re: [PATCH 08/12] KVM: MMU: introduce nulls desc

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 04:40 PM, Gleb Natapov wrote:

>>  static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
>> @@ -1200,7 +1221,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct 
>> rmap_iterator *iter)
>>   */
>>  static u64 *rmap_get_next(struct rmap_iterator *iter)
>>  {
>> -if (iter->desc) {
>> +if (iter->desc && !desc_is_a_nulls(iter->desc)) {
>>  if (iter->pos < PTE_LIST_EXT - 1) {
>>  u64 *sptep;
>>  
>> @@ -1212,7 +1233,7 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
>>  
>>  iter->desc = iter->desc->more;
>>  
> I'd rather do:
>   iter->desc = desc_is_a_nulls(iter->desc) ? NULL : iter->desc;
> here and drop two desc_is_a_nulls() checks in this function.

Nice, will do it in the next version. Thanks!

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

2013-08-28 Thread Alexey Kardashevskiy
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
them to user space which saves time on switching to user space and back.

Both real and virtual modes are supported. The kernel tries to
handle a TCE request in the real mode, if fails it passes the request
to the virtual mode to complete the operation. If it a virtual mode
handler fails, the request is passed to user space.

The first user of this is VFIO on POWER. Trampolines to the VFIO external
user API functions are required for this patch.

This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
of map/unmap requests. The device supports a single attribute which is
a struct with LIOBN and IOMMU fd. When the attribute is set, the device
establishes the connection between KVM and VFIO.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Paul Mackerras 
Signed-off-by: Alexey Kardashevskiy 

---

Changes:
v9:
* KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
KVM device
* release_spapr_tce_table() is not shared between different TCE types
* reduced the patch size by moving VFIO external API
trampolines to separate patche
* moved documentation from Documentation/virtual/kvm/api.txt to
Documentation/virtual/kvm/devices/spapr_tce_iommu.txt

v8:
* fixed warnings from check_patch.pl

2013/07/11:
* removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
for KVM_BOOK3S_64
* kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
for this here but the next patch for hugepages support will use it more.

2013/07/06:
* added realmode arch_spin_lock to protect TCE table from races
in real and virtual modes
* POWERPC IOMMU API is changed to support real mode
* iommu_take_ownership and iommu_release_ownership are protected by
iommu_table's locks
* VFIO external user API use rewritten
* multiple small fixes

2013/06/27:
* tce_list page is referenced now in order to protect it from accident
invalidation during H_PUT_TCE_INDIRECT execution
* added use of the external user VFIO API

2013/06/05:
* changed capability number
* changed ioctl number
* update the doc article number

2013/05/20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
---
 .../virtual/kvm/devices/spapr_tce_iommu.txt|  37 +++
 arch/powerpc/include/asm/kvm_host.h|   4 +
 arch/powerpc/kvm/book3s_64_vio.c   | 310 -
 arch/powerpc/kvm/book3s_64_vio_hv.c| 122 
 arch/powerpc/kvm/powerpc.c |   1 +
 include/linux/kvm_host.h   |   1 +
 virt/kvm/kvm_main.c|   5 +
 7 files changed, 477 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt

diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt 
b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
new file mode 100644
index 000..4bc8fc3
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
@@ -0,0 +1,37 @@
+SPAPR TCE IOMMU device
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+
+Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
+
+Groups:
+  KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
+  Attributes: single attribute with pair { LIOBN, IOMMU fd}
+
+This is completely made up device which provides API to link
+logical bus number (LIOBN) and IOMMU group. The user space has
+to create a new SPAPR TCE IOMMU device per a logical bus.
+
+LIOBN is a PCI bus identifier from PPC64-server (sPAPR) DMA hypercalls
+(H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE).
+IOMMU group is a minimal isolated device set which can be passed to
+the user space via VFIO.
+
+Right after creation the device is in uninitlized state and requires
+a KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute to be set.
+The attribute contains liobn, IOMMU fd and flags:
+
+struct kvm_create_spapr_tce_iommu_linkage {
+   __u64 liobn;
+   __u32 fd;
+   __u32 flags;
+};
+
+The user space creates the SPAPR TCE IOMMU device, obtains
+an IOMMU fd via VFIO ABI and sets the attribute to the SPAPR TCE IOMMU
+device. At the moment of setting the attribute, the SPAPR TCE IOMMU
+device links LIOBN to IOMMU group and makes necessary steps
+to make sure 

[PATCH v9 13/13] KVM: PPC: Add hugepage support for IOMMU in-kernel handling

2013-08-28 Thread Alexey Kardashevskiy
This adds special support for huge pages (16MB) in real mode.

The reference counting cannot be easily done for such pages in real
mode (when MMU is off) so we added a hash table of huge pages.
It is populated in virtual mode and get_page is called just once
per a huge page. Real mode handlers check if the requested page is
in the hash table, then no reference counting is done, otherwise
an exit to virtual mode happens. The hash table is released at KVM
exit.

At the moment the fastest card available for tests uses up to 9 huge
pages so walking through this hash table does not cost much.
However this can change and we may want to optimize this.

Signed-off-by: Paul Mackerras 
Signed-off-by: Alexey Kardashevskiy 

---

Changes:
2013/07/12:
* removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
for KVM_BOOK3S_64

2013/06/27:
* list of huge pages replaces with hashtable for better performance
* spinlock removed from real mode and only protects insertion of new
huge [ages descriptors into the hashtable

2013/06/05:
* fixed compile error when CONFIG_IOMMU_API=n

2013/05/20:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).
---
 arch/powerpc/include/asm/kvm_host.h |  25 
 arch/powerpc/kernel/iommu.c |   6 +-
 arch/powerpc/kvm/book3s_64_vio.c| 122 ++--
 arch/powerpc/kvm/book3s_64_vio_hv.c |  32 +-
 4 files changed, 176 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index c1a039d..b970d26 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -184,9 +185,33 @@ struct kvmppc_spapr_tce_table {
struct iommu_group *grp;/* used for IOMMU groups */
struct kvm_create_spapr_tce_iommu_linkage link;
struct vfio_group *vfio_grp;/* used for IOMMU groups */
+   DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */
+   spinlock_t hugepages_write_lock;/* used for IOMMU groups */
struct page *pages[0];
 };
 
+/*
+ * The KVM guest can be backed with 16MB pages.
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa)hash_32(gpa >> 24, 32)
+
+struct kvmppc_spapr_iommu_hugepage {
+   struct hlist_node hash_node;
+   unsigned long gpa;  /* Guest physical address */
+   unsigned long hpa;  /* Host physical address */
+   struct page *page;  /* page struct of the very first subpage */
+   unsigned long size; /* Huge page size (always 16MB at the moment) */
+};
+
 struct kvmppc_linear_info {
void*base_virt;
unsigned longbase_pfn;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ff0cd90..d0593c9 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long 
entry,
if (!pg) {
ret = -EAGAIN;
} else if (PageCompound(pg)) {
-   ret = -EAGAIN;
+   /* Hugepages will be released at KVM exit */
+   ret = 0;
} else {
if (oldtce & TCE_PCI_WRITE)
SetPageDirty(pg);
@@ -1010,6 +1011,9 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned 
long entry,
struct page *pg = pfn_to_page(oldtce >> PAGE_SHIFT);
if (!pg) {
ret = -EAGAIN;
+   } else if (PageCompound(pg)) {
+   /* Hugepages will be released at KVM exit */
+   ret = 0;
} else {
if (oldtce & TCE_PCI_WRITE)
SetPageDirty(pg);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 95f9e1a..1851778 100644
--- a/arch/po

[PATCH v9 10/13] KVM: PPC: remove warning from kvmppc_core_destroy_vm

2013-08-28 Thread Alexey Kardashevskiy
Book3S KVM implements in-kernel TCE tables via kvmppc_spapr_tce_table
structs list (created per KVM). Entries in the list are per LIOBN
(logical bus number) and have a TCE table so DMA hypercalls (such as
H_PUT_TCE) can convert LIOBN to a TCE table.

The entry in the list is created via KVM_CREATE_SPAPR_TCE ioctl which
returns an anonymous fd. This fd is used to map the TCE table to the user
space and it also defines the lifetime of the TCE table in the kernel.
Every list item also hold the link to KVM so when KVM is about to be
destroyed, all kvmppc_spapr_tce_table objects are expected to be
released and removed from the global list. And this is what the warning
verifies.

The upcoming support of VFIO IOMMU will extend kvmppc_spapr_tce_table use.
Unlike emulated devices, it will create kvmppc_spapr_tce_table structs
via new KVM device API which opens an anonymous fd
(as KVM_CREATE_SPAPR_TCE) but the release callback does not call
KVM Device's destroy callback immediately. Instead, KVM devices destruction
is postponed this till KVM destruction but this happens after arch-specific
KVM destroy function so the warning does a false alarm.

This removes the warning as it never happens in real life and there is no
any obvious place to put it.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/book3s_hv.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 9e823ad..5f15ff7 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1974,7 +1974,6 @@ void kvmppc_core_destroy_vm(struct kvm *kvm)
kvmppc_rtas_tokens_free(kvm);
 
kvmppc_free_hpt(kvm);
-   WARN_ON(!list_empty(&kvm->arch.spapr_tce_tables));
 }
 
 /* These are stubs for now */
-- 
1.8.4.rc4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 11/13] KVM: PPC: add trampolines for VFIO external API

2013-08-28 Thread Alexey Kardashevskiy
KVM is going to use VFIO's external API. However KVM can operate even VFIO
is not compiled or loaded so KVM is linked to VFIO dynamically.

This adds proxy functions for VFIO external API.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/book3s_64_vio.c | 49 
 1 file changed, 49 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index cae1099..047b94c 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -42,6 +44,53 @@
 
 #define ERROR_ADDR  ((void *)~(unsigned long)0x0)
 
+/*
+ * Dynamically linked version of the external user VFIO API.
+ *
+ * As a IOMMU group access control is implemented by VFIO,
+ * there is some API to vefiry that specific process can own
+ * a group. As KVM may run when VFIO is not loaded, KVM is not
+ * linked statically to VFIO, instead wrappers are used.
+ */
+struct vfio_group *kvmppc_vfio_group_get_external_user(struct file *filep)
+{
+   struct vfio_group *ret;
+   struct vfio_group * (*proc)(struct file *) =
+   symbol_get(vfio_group_get_external_user);
+   if (!proc)
+   return NULL;
+
+   ret = proc(filep);
+   symbol_put(vfio_group_get_external_user);
+
+   return ret;
+}
+
+void kvmppc_vfio_group_put_external_user(struct vfio_group *group)
+{
+   void (*proc)(struct vfio_group *) =
+   symbol_get(vfio_group_put_external_user);
+   if (!proc)
+   return;
+
+   proc(group);
+   symbol_put(vfio_group_put_external_user);
+}
+
+int kvmppc_vfio_external_user_iommu_id(struct vfio_group *group)
+{
+   int ret;
+   int (*proc)(struct vfio_group *) =
+   symbol_get(vfio_external_user_iommu_id);
+   if (!proc)
+   return -EINVAL;
+
+   ret = proc(group);
+   symbol_put(vfio_external_user_iommu_id);
+
+   return ret;
+}
+
 static long kvmppc_stt_npages(unsigned long window_size)
 {
return ALIGN((window_size >> SPAPR_TCE_SHIFT)
-- 
1.8.4.rc4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 01/13] KVM: PPC: POWERNV: move iommu_add_device earlier

2013-08-28 Thread Alexey Kardashevskiy
The current implementation of IOMMU on sPAPR does not use iommu_ops
and therefore does not call IOMMU API's bus_set_iommu() which
1) sets iommu_ops for a bus
2) registers a bus notifier
Instead, PCI devices are added to IOMMU groups from
subsys_initcall_sync(tce_iommu_init) which does basically the same
thing without using iommu_ops callbacks.

However Freescale PAMU driver (https://lkml.org/lkml/2013/7/1/158)
implements iommu_ops and when tce_iommu_init is called, every PCI device
is already added to some group so there is a conflict.

This patch does 2 things:
1. removes the loop in which PCI devices were added to groups and
adds explicit iommu_add_device() calls to add devices as soon as they get
the iommu_table pointer assigned to them.
2. moves a bus notifier to powernv code in order to avoid conflict with
the notifier from Freescale driver.

iommu_add_device() and iommu_del_device() are public now.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v8:
* added the check for iommu_group!=NULL before removing device from a group
as suggested by Wei Yang 

v2:
* added a helper - set_iommu_table_base_and_group - which does
set_iommu_table_base() and iommu_add_device()
---
 arch/powerpc/include/asm/iommu.h|  9 +++
 arch/powerpc/kernel/iommu.c | 41 +++--
 arch/powerpc/platforms/powernv/pci-ioda.c   |  8 +++---
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  2 +-
 arch/powerpc/platforms/powernv/pci.c| 33 ++-
 arch/powerpc/platforms/pseries/iommu.c  |  8 +++---
 6 files changed, 55 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index c34656a..19ad77f 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -103,6 +103,15 @@ extern struct iommu_table *iommu_init_table(struct 
iommu_table * tbl,
int nid);
 extern void iommu_register_group(struct iommu_table *tbl,
 int pci_domain_number, unsigned long pe_num);
+extern int iommu_add_device(struct device *dev);
+extern void iommu_del_device(struct device *dev);
+
+static inline void set_iommu_table_base_and_group(struct device *dev,
+ void *base)
+{
+   set_iommu_table_base(dev, base);
+   iommu_add_device(dev);
+}
 
 extern int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
struct scatterlist *sglist, int nelems,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b20ff17..15f8ca8 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1105,7 +1105,7 @@ void iommu_release_ownership(struct iommu_table *tbl)
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
-static int iommu_add_device(struct device *dev)
+int iommu_add_device(struct device *dev)
 {
struct iommu_table *tbl;
int ret = 0;
@@ -1134,46 +1134,13 @@ static int iommu_add_device(struct device *dev)
 
return ret;
 }
+EXPORT_SYMBOL_GPL(iommu_add_device);
 
-static void iommu_del_device(struct device *dev)
+void iommu_del_device(struct device *dev)
 {
iommu_group_remove_device(dev);
 }
-
-static int iommu_bus_notifier(struct notifier_block *nb,
- unsigned long action, void *data)
-{
-   struct device *dev = data;
-
-   switch (action) {
-   case BUS_NOTIFY_ADD_DEVICE:
-   return iommu_add_device(dev);
-   case BUS_NOTIFY_DEL_DEVICE:
-   iommu_del_device(dev);
-   return 0;
-   default:
-   return 0;
-   }
-}
-
-static struct notifier_block tce_iommu_bus_nb = {
-   .notifier_call = iommu_bus_notifier,
-};
-
-static int __init tce_iommu_init(void)
-{
-   struct pci_dev *pdev = NULL;
-
-   BUILD_BUG_ON(PAGE_SIZE < IOMMU_PAGE_SIZE);
-
-   for_each_pci_dev(pdev)
-   iommu_add_device(&pdev->dev);
-
-   bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
-   return 0;
-}
-
-subsys_initcall_sync(tce_iommu_init);
+EXPORT_SYMBOL_GPL(iommu_del_device);
 
 #else
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index d8140b1..756bb58 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -440,7 +440,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, 
struct pci_dev *pdev
return;
 
pe = &phb->ioda.pe_array[pdn->pe_number];
-   set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+   set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
 }
 
 static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
@@ -448,7 +448,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, 
struct pci_bus *bus)
struct pci_dev *dev;
 
list_for_each_entry(dev, &bus->devices, bus_list)

[PATCH v9 00/13] KVM: PPC: IOMMU in-kernel handling of VFIO

2013-08-28 Thread Alexey Kardashevskiy
This accelerates VFIO DMA operations on POWER by moving them
into kernel.

This depends on VFIO external API patch which is on its way to upstream.

Changes:
v9:
* replaced the "link logical bus number to IOMMU group" ioctl to KVM
with a KVM device doing the same thing, i.e. the actual changes are in
these 3 patches:
  KVM: PPC: reserve a capability and KVM device type for realmode VFIO
  KVM: PPC: remove warning from kvmppc_core_destroy_vm
  KVM: PPC: Add support for IOMMU in-kernel handling

* moved some VFIO external API bits to a separate patch to reduce the size
of the "KVM: PPC: Add support for IOMMU in-kernel handling" patch

* fixed code style problems reported by checkpatch.pl.

v8:
* fixed comments about capabilities numbers

v7:
* rebased on v3.11-rc3.
* VFIO external user API will go through VFIO tree so it is
excluded from this series.
* As nobody ever reacted on "hashtable: add 
hash_for_each_possible_rcu_notrace()",
Ben suggested to push it via his tree so I included it to the series.
* realmode_(get|put)_page is reworked.

More details in the individual patch comments.

Alexey Kardashevskiy (13):
  KVM: PPC: POWERNV: move iommu_add_device earlier
  hashtable: add hash_for_each_possible_rcu_notrace()
  KVM: PPC: reserve a capability number for multitce support
  KVM: PPC: reserve a capability and KVM device type for realmode VFIO
  powerpc: Prepare to support kernel handling of IOMMU map/unmap
  powerpc: add real mode support for dma operations on powernv
  KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently
  KVM: PPC: Add support for multiple-TCE hcalls
  powerpc/iommu: rework to support realmode
  KVM: PPC: remove warning from kvmppc_core_destroy_vm
  KVM: PPC: add trampolines for VFIO external API
  KVM: PPC: Add support for IOMMU in-kernel handling
  KVM: PPC: Add hugepage support for IOMMU in-kernel handling

 Documentation/virtual/kvm/api.txt  |  26 +
 .../virtual/kvm/devices/spapr_tce_iommu.txt|  37 ++
 arch/powerpc/include/asm/iommu.h   |  18 +-
 arch/powerpc/include/asm/kvm_host.h|  38 ++
 arch/powerpc/include/asm/kvm_ppc.h |  16 +-
 arch/powerpc/include/asm/machdep.h |  12 +
 arch/powerpc/include/asm/pgtable-ppc64.h   |   2 +
 arch/powerpc/include/uapi/asm/kvm.h|   8 +
 arch/powerpc/kernel/iommu.c| 243 +
 arch/powerpc/kvm/Kconfig   |   1 +
 arch/powerpc/kvm/book3s_64_vio.c   | 597 -
 arch/powerpc/kvm/book3s_64_vio_hv.c| 408 +-
 arch/powerpc/kvm/book3s_hv.c   |  42 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S|   8 +-
 arch/powerpc/kvm/book3s_pr_papr.c  |  35 ++
 arch/powerpc/kvm/powerpc.c |   4 +
 arch/powerpc/mm/init_64.c  |  50 +-
 arch/powerpc/platforms/powernv/pci-ioda.c  |  57 +-
 arch/powerpc/platforms/powernv/pci-p5ioc2.c|   2 +-
 arch/powerpc/platforms/powernv/pci.c   |  75 ++-
 arch/powerpc/platforms/powernv/pci.h   |   3 +-
 arch/powerpc/platforms/pseries/iommu.c |   8 +-
 include/linux/hashtable.h  |  15 +
 include/linux/kvm_host.h   |   1 +
 include/linux/mm.h |  14 +
 include/linux/page-flags.h |   4 +-
 include/uapi/linux/kvm.h   |   3 +
 virt/kvm/kvm_main.c|   5 +
 28 files changed, 1564 insertions(+), 168 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt

-- 
1.8.4.rc4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 02/13] hashtable: add hash_for_each_possible_rcu_notrace()

2013-08-28 Thread Alexey Kardashevskiy
This adds hash_for_each_possible_rcu_notrace() which is basically
a notrace clone of hash_for_each_possible_rcu() which cannot be
used in real mode due to its tracing/debugging capability.

Signed-off-by: Alexey Kardashevskiy 

---
Changes:
v8:
* fixed warnings from check_patch.pl
---
 include/linux/hashtable.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h
index a9df51f..519b6e2 100644
--- a/include/linux/hashtable.h
+++ b/include/linux/hashtable.h
@@ -174,6 +174,21 @@ static inline void hash_del_rcu(struct hlist_node *node)
member)
 
 /**
+ * hash_for_each_possible_rcu_notrace - iterate over all possible objects 
hashing
+ * to the same bucket in an rcu enabled hashtable in a rcu enabled hashtable
+ * @name: hashtable to iterate
+ * @obj: the type * to use as a loop cursor for each entry
+ * @member: the name of the hlist_node within the struct
+ * @key: the key of the objects to iterate over
+ *
+ * This is the same as hash_for_each_possible_rcu() except that it does
+ * not do any RCU debugging or tracing.
+ */
+#define hash_for_each_possible_rcu_notrace(name, obj, member, key) \
+   hlist_for_each_entry_rcu_notrace(obj, \
+   &name[hash_min(key, HASH_BITS(name))], member)
+
+/**
  * hash_for_each_possible_safe - iterate over all possible objects hashing to 
the
  * same bucket safe against removals
  * @name: hashtable to iterate
-- 
1.8.4.rc4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 03/13] KVM: PPC: reserve a capability number for multitce support

2013-08-28 Thread Alexey Kardashevskiy
This is to reserve a capablity number for upcoming support
of H_PUT_TCE_INDIRECT and H_STUFF_TCE pseries hypercalls
which support mulptiple DMA map/unmap operations per one call.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
2013/07/16:
* changed the number
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index acccd08..99c2533 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_RTAS 91
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_ARM_EL1_32BIT 93
+#define KVM_CAP_SPAPR_MULTITCE 94
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
1.8.4.rc4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 05/13] powerpc: Prepare to support kernel handling of IOMMU map/unmap

2013-08-28 Thread Alexey Kardashevskiy
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).

To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.

This adds an API function realmode_pfn_to_page() to get page struct when
MMU is off.

This adds to MM a new function put_page_unless_one() which drops a page
if counter is bigger than 1. It is going to be used when MMU is off
(for example, real mode on PPC64) and we want to make sure that page
release will not happen in real mode as it may crash the kernel in
a horrible way.

CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.

Cc: linux...@kvack.org
Cc: Benjamin Herrenschmidt 
Cc: Andrew Morton 
Reviewed-by: Paul Mackerras 
Signed-off-by: Paul Mackerras 
Signed-off-by: Alexey Kardashevskiy 

---

Changes:
2013/07/25 (v7):
* removed realmode_put_page and added put_page_unless_one() instead.
The name has been chosen to conform the already existing
get_page_unless_zero().
* removed realmode_get_page. Instead, get_page_unless_zero() should be used

2013/07/10:
* adjusted comment (removed sentence about virtual mode)
* get_page_unless_zero replaced with atomic_inc_not_zero to minimize
effect of a possible get_page_unless_zero() rework (if it ever happens).

2013/06/27:
* realmode_get_page() fixed to use get_page_unless_zero(). If failed,
the call will be passed from real to virtual mode and safely handled.
* added comment to PageCompound() in include/linux/page-flags.h.

2013/05/20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
---
 arch/powerpc/include/asm/pgtable-ppc64.h |  2 ++
 arch/powerpc/mm/init_64.c| 50 +++-
 include/linux/mm.h   | 14 +
 include/linux/page-flags.h   |  4 ++-
 4 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index 46db094..4a191c4 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -394,6 +394,8 @@ static inline void mark_hpte_slot_valid(unsigned char 
*hpte_slot_array,
hpte_slot_array[index] = hidx << 4 | 0x1 << 3;
 }
 
+struct page *realmode_pfn_to_page(unsigned long pfn);
+
 static inline char *get_hpte_slot_array(pmd_t *pmdp)
 {
/*
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index d0cd9e4..8cf345a 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -300,5 +300,53 @@ void vmemmap_free(unsigned long start, unsigned long end)
 {
 }
 
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * realmode_pfn_to_page functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct vmemmap_backing *vmem_back;
+   struct page *page;
+   unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+   unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
 
+   for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+   if (pg_va < vmem_back->virt_addr)
+   continue;
+
+   /* Check that page struct is not split between real pages */
+   if ((pg_va + sizeof(struct page)) >
+   (vmem_back->virt_addr + page_size))
+   return NULL;
+
+   page = (struct page *) (vmem_back->phys + pg_va -
+   vmem_back->virt_addr);
+   return page;
+   }
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+   struct page *page = pfn_to_page(pfn);
+   return page;
+}
+EXPORT_SYMBO

[PATCH v9 08/13] KVM: PPC: Add support for multiple-TCE hcalls

2013-08-28 Thread Alexey Kardashevskiy
This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO
devices or emulated PCI.  These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.

This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.

This adds a function to convert a guest physical address to a host
virtual address in order to parse a TCE list from H_PUT_TCE_INDIRECT.

This also implements the KVM_CAP_PPC_MULTITCE capability. When present,
the hypercalls mentioned above may or may not be processed successfully
in the kernel based fast path. If they can not be handled by the kernel,
they will get passed on to user space. So user space still has to have
an implementation for these despite the in kernel acceleration.

Signed-off-by: Paul Mackerras 
Signed-off-by: Alexey Kardashevskiy 

---
Changelog:
v8:
* fixed warnings from check_patch.pl

2013/08/01 (v7):
* realmode_get_page/realmode_put_page use was replaced with
get_page_unless_zero/put_page_unless_one

2013/07/11:
* addressed many, many comments from maintainers

2013/07/06:
* fixed number of wrong get_page()/put_page() calls

2013/06/27:
* fixed clear of BUSY bit in kvmppc_lookup_pte()
* H_PUT_TCE_INDIRECT does realmode_get_page() now
* KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64
* updated doc

2013/06/05:
* fixed mistype about IBMVIO in the commit message
* updated doc and moved it to another section
* changed capability number

2013/05/21:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
---
 Documentation/virtual/kvm/api.txt   |  26 +++
 arch/powerpc/include/asm/kvm_host.h |   9 ++
 arch/powerpc/include/asm/kvm_ppc.h  |  16 +-
 arch/powerpc/kvm/book3s_64_vio.c| 132 +++-
 arch/powerpc/kvm/book3s_64_vio_hv.c | 270 
 arch/powerpc/kvm/book3s_hv.c|  41 -
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   8 +-
 arch/powerpc/kvm/book3s_pr_papr.c   |  35 +
 arch/powerpc/kvm/powerpc.c  |   3 +
 9 files changed, 506 insertions(+), 34 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index ef925ea..1c8942a 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2382,6 +2382,32 @@ calls by the guest for that service will be passed to 
userspace to be
 handled.
 
 
+4.86 KVM_CAP_PPC_MULTITCE
+
+Capability: KVM_CAP_PPC_MULTITCE
+Architectures: ppc
+Type: vm
+
+This capability means the kernel is capable of handling hypercalls
+H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
+space. This significantly accelerates DMA operations for PPC KVM guests.
+User space should expect that its handlers for these hypercalls
+are not going to be called if user space previously registered LIOBN
+in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).
+
+In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
+user space might have to advertise it for the guest. For example,
+IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
+present in the "ibm,hypertas-functions" device-tree property.
+
+The hypercalls mentioned above may or may not be processed successfully
+in the kernel based fast path. If they can not be handled by the kernel,
+they will get passed on to user space. So user space still has to have
+an implementation for these despite the in kernel acceleration.
+
+This capability is always enabled.
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index af326cd..a23f132 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -609,6 +610,14 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
+
+   unsigned long *tce_tmp_hpas;/* TCE cache for TCE_PUT_INDIRECT */
+   enum {
+   TCERM_NONE,
+   TCERM_GETPAGE,
+   TCERM_PUTTCE,
+   TCERM_PUTLIST,
+   } tce_rm_fail;  /* failed stage of request pr

[PATCH v9 06/13] powerpc: add real mode support for dma operations on powernv

2013-08-28 Thread Alexey Kardashevskiy
The existing TCE machine calls (tce_build and tce_free) only support
virtual mode as they call __raw_writeq for TCE invalidation what
fails in real mode.

This introduces tce_build_rm and tce_free_rm real mode versions
which do mostly the same but use "Store Doubleword Caching Inhibited
Indexed" instruction for TCE invalidation.

This new feature is going to be utilized by real mode support of VFIO.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v8:
* fixed check_patch.pl warnings

2013/11/07:
* added comment why stdcix cannot be used in virtual mode

2013/08/07:
* tested on p7ioc and fixed a bug with realmode addresses
---
 arch/powerpc/include/asm/machdep.h| 12 
 arch/powerpc/platforms/powernv/pci-ioda.c | 49 +++
 arch/powerpc/platforms/powernv/pci.c  | 42 ++
 arch/powerpc/platforms/powernv/pci.h  |  3 +-
 4 files changed, 87 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 8b48090..07dd3b1 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -78,6 +78,18 @@ struct machdep_calls {
long index);
void(*tce_flush)(struct iommu_table *tbl);
 
+   /* _rm versions are for real mode use only */
+   int (*tce_build_rm)(struct iommu_table *tbl,
+long index,
+long npages,
+unsigned long uaddr,
+enum dma_data_direction direction,
+struct dma_attrs *attrs);
+   void(*tce_free_rm)(struct iommu_table *tbl,
+   long index,
+   long npages);
+   void(*tce_flush_rm)(struct iommu_table *tbl);
+
void __iomem *  (*ioremap)(phys_addr_t addr, unsigned long size,
   unsigned long flags, void *caller);
void(*iounmap)(volatile void __iomem *token);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 756bb58..8cba234 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -70,6 +70,16 @@ define_pe_printk_level(pe_err, KERN_ERR);
 define_pe_printk_level(pe_warn, KERN_WARNING);
 define_pe_printk_level(pe_info, KERN_INFO);
 
+/*
+ * stdcix is only supposed to be used in hypervisor real mode as per
+ * the architecture spec
+ */
+static inline void __raw_rm_writeq(u64 val, volatile void __iomem *paddr)
+{
+   __asm__ __volatile__("stdcix %0,0,%1"
+   : : "r" (val), "r" (paddr) : "memory");
+}
+
 static int pnv_ioda_alloc_pe(struct pnv_phb *phb)
 {
unsigned long pe;
@@ -454,10 +464,13 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe 
*pe, struct pci_bus *bus)
}
 }
 
-static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl,
-u64 *startp, u64 *endp)
+static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe,
+struct iommu_table *tbl,
+u64 *startp, u64 *endp, bool rm)
 {
-   u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
+   u64 __iomem *invalidate = rm ?
+   (u64 __iomem *)pe->tce_inval_reg_phys :
+   (u64 __iomem *)tbl->it_index;
unsigned long start, end, inc;
 
start = __pa(startp);
@@ -484,7 +497,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct 
iommu_table *tbl,
 
 mb(); /* Ensure above stores are visible */
 while (start <= end) {
-__raw_writeq(start, invalidate);
+   if (rm)
+   __raw_rm_writeq(start, invalidate);
+   else
+   __raw_writeq(start, invalidate);
 start += inc;
 }
 
@@ -496,10 +512,12 @@ static void pnv_pci_ioda1_tce_invalidate(struct 
iommu_table *tbl,
 
 static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe,
 struct iommu_table *tbl,
-u64 *startp, u64 *endp)
+u64 *startp, u64 *endp, bool rm)
 {
unsigned long start, end, inc;
-   u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index;
+   u64 __iomem *invalidate = rm ?
+   (u64 __iomem *)pe->tce_inval_reg_phys :
+   (u64 __iomem *)tbl->it_index;
 
/* We'll invalidate DMA address in PE scope */
start = 0x2ul << 60;
@@ -515,22 +533,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
pnv_ioda_pe *pe,
mb();
 
while (start <= end) {
-   __raw_writeq(start, invalidate);
+   if (rm)
+

[PATCH v9 07/13] KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently

2013-08-28 Thread Alexey Kardashevskiy
It does not make much sense to have KVM in book3s-64bit and
not to have IOMMU bits for PCI pass through support as it costs little
and allows VFIO to function on book3s-kvm.

Having IOMMU_API always enabled makes it unnecessary to have a lot of
"#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those
ifdef's we could have only user space emulated devices accelerated
(but not VFIO) which do not seem to be very useful.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kvm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index c55c538..3b2b761 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -59,6 +59,7 @@ config KVM_BOOK3S_64
depends on PPC_BOOK3S_64
select KVM_BOOK3S_64_HANDLER
select KVM
+   select SPAPR_TCE_IOMMU
---help---
  Support running unmodified book3s_64 and book3s_32 guest kernels
  in virtual machines on book3s_64 host processors.
-- 
1.8.4.rc4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 09/13] powerpc/iommu: rework to support realmode

2013-08-28 Thread Alexey Kardashevskiy
The TCE tables handling may differ for real and virtual modes so
additional ppc_md.tce_build_rm/ppc_md.tce_free_rm/ppc_md.tce_flush_rm
handlers were introduced earlier.

So this adds the following:
1. support for the new ppc_md calls;
2. ability to iommu_tce_build to process mupltiple entries per
call;
3. arch_spin_lock to protect TCE table from races in both real and virtual
modes;
4. proper TCE table protection from races with the existing IOMMU code
in iommu_take_ownership/iommu_release_ownership;
5. hwaddr variable renamed to hpa as it better describes what it
actually represents;
6. iommu_tce_direction is static now as it is not called from anywhere else.

This will be used by upcoming real mode support of VFIO on POWER.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v8:
* fixed warnings from check_patch.pl
---
 arch/powerpc/include/asm/iommu.h |   9 +-
 arch/powerpc/kernel/iommu.c  | 198 ++-
 2 files changed, 136 insertions(+), 71 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 19ad77f..71ee525 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -78,6 +78,7 @@ struct iommu_table {
unsigned long *it_map;   /* A simple allocation bitmap for now */
 #ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
+   arch_spinlock_t it_rm_lock;
 #endif
 };
 
@@ -161,9 +162,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table 
*tbl,
 extern int iommu_tce_put_param_check(struct iommu_table *tbl,
unsigned long ioba, unsigned long tce);
 extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
-   unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
-   unsigned long entry);
+   unsigned long *hpas, unsigned long npages, bool rm);
+extern int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+   unsigned long npages, bool rm);
 extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
 extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
@@ -173,7 +174,5 @@ extern void iommu_flush_tce(struct iommu_table *tbl);
 extern int iommu_take_ownership(struct iommu_table *tbl);
 extern void iommu_release_ownership(struct iommu_table *tbl);
 
-extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
-
 #endif /* __KERNEL__ */
 #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 15f8ca8..ff0cd90 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -903,7 +903,7 @@ void iommu_register_group(struct iommu_table *tbl,
kfree(name);
 }
 
-enum dma_data_direction iommu_tce_direction(unsigned long tce)
+static enum dma_data_direction iommu_tce_direction(unsigned long tce)
 {
if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
return DMA_BIDIRECTIONAL;
@@ -914,7 +914,6 @@ enum dma_data_direction iommu_tce_direction(unsigned long 
tce)
else
return DMA_NONE;
 }
-EXPORT_SYMBOL_GPL(iommu_tce_direction);
 
 void iommu_flush_tce(struct iommu_table *tbl)
 {
@@ -972,73 +971,117 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
 
-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
-{
-   unsigned long oldtce;
-   struct iommu_pool *pool = get_pool(tbl, entry);
-
-   spin_lock(&(pool->lock));
-
-   oldtce = ppc_md.tce_get(tbl, entry);
-   if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
-   ppc_md.tce_free(tbl, entry, 1);
-   else
-   oldtce = 0;
-
-   spin_unlock(&(pool->lock));
-
-   return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
 int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
 {
-   unsigned long oldtce;
-   struct page *page;
-
-   for ( ; pages; --pages, ++entry) {
-   oldtce = iommu_clear_tce(tbl, entry);
-   if (!oldtce)
-   continue;
-
-   page = pfn_to_page(oldtce >> PAGE_SHIFT);
-   WARN_ON(!page);
-   if (page) {
-   if (oldtce & TCE_PCI_WRITE)
-   SetPageDirty(page);
-   put_page(page);
-   }
-   }
-
-   return 0;
+   return iommu_free_tces(tbl, entry, pages, false);
 }
 EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
 
-/*
- * hwaddr is a kernel virtual address here (0xc... bazillion),
- * tce_build converts it to a physical address.
- */
+int iommu_free_tces(struct iommu_table *tbl, unsigned long entry,
+   unsigned long npages, bool rm)
+{
+   int i, ret = 0, to_free = 0;
+
+   if (

Re: [PATCH 08/12] KVM: MMU: introduce nulls desc

2013-08-28 Thread Gleb Natapov
On Tue, Jul 30, 2013 at 09:02:06PM +0800, Xiao Guangrong wrote:
> It likes nulls list and we use the pte-list as the nulls which can help us to
> detect whether the "desc" is moved to anther rmap then we can re-walk the rmap
> if that happened
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c | 35 ---
>  1 file changed, 28 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 3013bb1..36caf6a 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -918,6 +918,24 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
> large_gfn)
>   return level - 1;
>  }
>  
> +static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc 
> *desc)
> +{
> + unsigned long marker;
> +
> + marker = (unsigned long)pte_list | 1UL;
> + desc->more = (struct pte_list_desc *)marker;
> +}
> +
> +static bool desc_is_a_nulls(struct pte_list_desc *desc)
> +{
> + return (unsigned long)desc & 1;
> +}
> +
> +static unsigned long *desc_get_nulls_value(struct pte_list_desc *desc)
> +{
> + return (unsigned long *)((unsigned long)desc & ~1);
> +}
> +
>  static int __find_first_free(struct pte_list_desc *desc)
>  {
>   int i;
> @@ -956,7 +974,7 @@ static int count_spte_number(struct pte_list_desc *desc)
>  
>   first_free = __find_first_free(desc);
>  
> - for (desc_num = 0; desc->more; desc = desc->more)
> + for (desc_num = 0; !desc_is_a_nulls(desc->more); desc = desc->more)
>   desc_num++;
>  
>   return first_free + desc_num * PTE_LIST_EXT;
> @@ -991,6 +1009,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte,
>   desc = mmu_alloc_pte_list_desc(vcpu);
>   desc->sptes[0] = (u64 *)*pte_list;
>   desc->sptes[1] = spte;
> + desc_mark_nulls(pte_list, desc);
>   *pte_list = (unsigned long)desc | 1;
>   return 1;
>   }
> @@ -1036,7 +1055,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
>   /*
>* Only one entry existing but still use a desc to store it?
>*/
> - WARN_ON(!next_desc);
> + WARN_ON(desc_is_a_nulls(next_desc));
>  
>   mmu_free_pte_list_desc(first_desc);
>   first_desc = next_desc;
> @@ -1050,7 +1069,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list,
>* Only one entry in this desc, move the entry to the head
>* then the desc can be freed.
>*/
> - if (!first_desc->sptes[1] && !first_desc->more) {
> + if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) {
>   *pte_list = (unsigned long)first_desc->sptes[0];
>   mmu_free_pte_list_desc(first_desc);
>   }
> @@ -1077,7 +1096,7 @@ static void pte_list_remove(u64 *spte, unsigned long 
> *pte_list)
>  
>   rmap_printk("pte_list_remove:  %p many->many\n", spte);
>   desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> - while (desc) {
> + while (!desc_is_a_nulls(desc)) {
>   for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
>   if (desc->sptes[i] == spte) {
>   pte_list_desc_remove_entry(pte_list,
> @@ -1103,11 +1122,13 @@ static void pte_list_walk(unsigned long *pte_list, 
> pte_list_walk_fn fn)
>   return fn((u64 *)*pte_list);
>  
>   desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> - while (desc) {
> + while (!desc_is_a_nulls(desc)) {
>   for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
>   fn(desc->sptes[i]);
>   desc = desc->more;
>   }
> +
> + WARN_ON(desc_get_nulls_value(desc) != pte_list);
>  }
>  
>  static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
> @@ -1200,7 +1221,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct 
> rmap_iterator *iter)
>   */
>  static u64 *rmap_get_next(struct rmap_iterator *iter)
>  {
> - if (iter->desc) {
> + if (iter->desc && !desc_is_a_nulls(iter->desc)) {
>   if (iter->pos < PTE_LIST_EXT - 1) {
>   u64 *sptep;
>  
> @@ -1212,7 +1233,7 @@ static u64 *rmap_get_next(struct rmap_iterator *iter)
>  
>   iter->desc = iter->desc->more;
>  
I'd rather do:
iter->desc = desc_is_a_nulls(iter->desc) ? NULL : iter->desc;
here and drop two desc_is_a_nulls() checks in this function.

> - if (iter->desc) {
> + if (!desc_is_a_nulls(iter->desc)) {
>   iter->pos = 0;
>   /* desc->sptes[0] cannot be NULL */
>   return iter->desc->sptes[iter->pos];
> -- 
> 1.8.1.4

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v9 04/13] KVM: PPC: reserve a capability and KVM device type for realmode VFIO

2013-08-28 Thread Alexey Kardashevskiy
This reserves a capability number for upcoming support
of VFIO-IOMMU DMA operations in real mode.

This reserves a number for a new "SPAPR TCE IOMMU" KVM device
which is going to manage lifetime of SPAPR TCE IOMMU object.

This defines an attribute of the "SPAPR TCE IOMMU" KVM device
which is going to be used for initialization.

Signed-off-by: Alexey Kardashevskiy 

---
Changes:
v9:
* KVM ioctl is replaced with "SPAPR TCE IOMMU" KVM device type with
KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute

2013/08/15:
* fixed mistype in comments
* fixed commit message which says what uses ioctls 0xad and 0xae

2013/07/16:
* changed the number

2013/07/11:
* changed order in a file, added comment about a gap in ioctl number
---
 arch/powerpc/include/uapi/asm/kvm.h | 8 
 include/uapi/linux/kvm.h| 2 ++
 2 files changed, 10 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..c1ae1e5 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -511,4 +511,12 @@ struct kvm_get_htab_header {
 #define  KVM_XICS_MASKED   (1ULL << 41)
 #define  KVM_XICS_PENDING  (1ULL << 42)
 
+/* SPAPR TCE IOMMU device specification */
+struct kvm_create_spapr_tce_iommu_linkage {
+   __u64 liobn;
+   __u32 fd;
+   __u32 flags;
+};
+#define KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE   0
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 99c2533..9d20630 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_IRQ_XICS 92
 #define KVM_CAP_ARM_EL1_32BIT 93
 #define KVM_CAP_SPAPR_MULTITCE 94
+#define KVM_CAP_SPAPR_TCE_IOMMU 95
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -843,6 +844,7 @@ struct kvm_device_attr {
 #define KVM_DEV_TYPE_FSL_MPIC_20   1
 #define KVM_DEV_TYPE_FSL_MPIC_42   2
 #define KVM_DEV_TYPE_XICS  3
+#define KVM_DEV_TYPE_SPAPR_TCE_IOMMU   4
 
 /*
  * ioctls for VM fds
-- 
1.8.4.rc4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 04:12 PM, Gleb Natapov wrote:

>> +
>> +rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
>> +desc = (struct pte_list_desc *)(*pte_list & ~1ul);
>> +
>> +/* No empty position in the desc. */
>> +if (desc->sptes[PTE_LIST_EXT - 1]) {
>> +struct pte_list_desc *new_desc;
>> +new_desc = mmu_alloc_pte_list_desc(vcpu);
>> +new_desc->more = desc;
>> +desc = new_desc;
>> +*pte_list = (unsigned long)desc | 1;
>>  }
>> -return count;
>> +
>> +free_pos = find_first_free(desc);
>> +desc->sptes[free_pos] = spte;
>> +return count_spte_number(desc);
> Should it be count_spte_number(desc) - 1? The function should returns
> the number of pte entries before the spte was added.

Yes. We have handled it count_spte_number(), we count the number like this:

return first_free + desc_num * PTE_LIST_EXT;

The first_free is indexed from 0.

Maybe it is clearer to let count_spte_number() return the real number.

> 
>>  }
>>  
>>  static void
>> -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc 
>> *desc,
>> -   int i, struct pte_list_desc *prev_desc)
>> +pte_list_desc_remove_entry(unsigned long *pte_list,
>> +   struct pte_list_desc *desc, int i)
>>  {
>> -int j;
>> +struct pte_list_desc *first_desc;
>> +int last_used;
>> +
>> +first_desc = (struct pte_list_desc *)(*pte_list & ~1ul);
>> +last_used = find_last_used(first_desc);
>>  
>> -for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j)
>> -;
>> -desc->sptes[i] = desc->sptes[j];
>> -desc->sptes[j] = NULL;
>> -if (j != 0)
>> +/*
>> + * Move the entry from the first desc to this position we want
>> + * to remove.
>> + */
>> +desc->sptes[i] = first_desc->sptes[last_used];
>> +first_desc->sptes[last_used] = NULL;
>> +
> What if desc == first_desc and i < last_used. You still move spte
> backwards so lockless walk may have already examined entry at i and
> will miss spte that was moved there from last_used position, no?

Right. I noticed it too and fixed in the v2 which is being tested.
I fixed it by bottom-up walk desc, like this:

pte_list_walk_lockless():

desc = (struct pte_list_desc *)(pte_list_value & ~1ul);
while (!desc_is_a_nulls(desc)) {
/*
 * We should do bottom-up walk since we always use the
 * bottom entry to replace the deleted entry if only
 * one desc is used in the rmap when a spte is removed.
 * Otherwise the moved entry will be missed.
 */
for (i = PTE_LIST_EXT - 1; i >= 0; i--)
fn(desc->sptes[i]);

desc = ACCESS_ONCE(desc->more);

/* It is being initialized. */
if (unlikely(!desc))
goto restart;
}

How about this?

> 
>> +/* No valid entry in this desc, we can free this desc now. */
>> +if (!first_desc->sptes[0]) {
>> +struct pte_list_desc *next_desc = first_desc->more;
>> +
>> +/*
>> + * Only one entry existing but still use a desc to store it?
>> + */
>> +WARN_ON(!next_desc);
>> +
>> +mmu_free_pte_list_desc(first_desc);
>> +first_desc = next_desc;
>> +*pte_list = (unsigned long)first_desc | 1ul;
>>  return;
>> -if (!prev_desc && !desc->more)
>> -*pte_list = (unsigned long)desc->sptes[0];
>> -else
>> -if (prev_desc)
>> -prev_desc->more = desc->more;
>> -else
>> -*pte_list = (unsigned long)desc->more | 1;
>> -mmu_free_pte_list_desc(desc);
>> +}
>> +
>> +WARN_ON(!first_desc->sptes[0]);
>> +
>> +/*
>> + * Only one entry in this desc, move the entry to the head
>> + * then the desc can be freed.
>> + */
>> +if (!first_desc->sptes[1] && !first_desc->more) {
>> +*pte_list = (unsigned long)first_desc->sptes[0];
>> +mmu_free_pte_list_desc(first_desc);
>> +}
>>  }
>>  
>>  static void pte_list_remove(u64 *spte, unsigned long *pte_list)
>>  {
>>  struct pte_list_desc *desc;
>> -struct pte_list_desc *prev_desc;
>>  int i;
>>  
>>  if (!*pte_list) {
>> -printk(KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
>> -BUG();
>> -} else if (!(*pte_list & 1)) {
>> +WARN(1, KERN_ERR "pte_list_remove: %p 0->BUG\n", spte);
> Why change BUG() to WARN() here and below?

WARN(1, "xxx") can replace two lines in the origin code. And personally,
i prefer WARN() to BUG() since sometimes BUG() can stop my box and i need to
get the full log by using kdump.

If you object it, i will change it back in the next version. :)

> 
>> +return;
>> +}
>> +
>> +if (!(*pte_list & 1)) {

Re: [PATCH] KVM: x86: should release vcpu when closing vcpu fd

2013-08-28 Thread Gleb Natapov
On Wed, Aug 28, 2013 at 04:08:10PM +0800, Chen Fan wrote:
> If we want to remove a vcpu from vm, we should not be only to
> put kvm, we need to decrease online vcpus and free struct kvm_vcpu
> when closing a vcpu file descriptor.
> 
It much more complicated that what you do here. kvm->vcpus[] is accessed
without locking and kvm->online_vcpus is assumed to be greater than max
occupied index in kvm->vcpus[]. Attempt where made to make it possible
to destroy individual vcpus separately from destroying VM before, but
they were unsuccessful thus far.

> Signed-off-by: Chen Fan 
> ---
>  arch/x86/kvm/x86.c   | 14 ++
>  include/linux/kvm_host.h |  1 +
>  virt/kvm/kvm_main.c  |  4 +++-
>  3 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d21bce5..3370de1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6904,6 +6904,20 @@ static void kvm_free_vcpus(struct kvm *kvm)
>   mutex_unlock(&kvm->lock);
>  }
>  
> +void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu)
> +{
> + struct kvm *kvm = vcpu->kvm;
> +
> + mutex_lock(&kvm->lock);
> + atomic_dec(&kvm->online_vcpus);
> + kvm->vcpus[atomic_read(&kvm->online_vcpus)] = NULL;
> + mutex_unlock(&kvm->lock);
> +
> + kvm_clear_async_pf_completion_queue(vcpu);
> + kvm_unload_vcpu_mmu(vcpu);
> + kvm_arch_vcpu_free(vcpu);
> +}
> +
>  void kvm_arch_sync_events(struct kvm *kvm)
>  {
>   kvm_free_all_assigned_devices(kvm);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a63d83e..e0811f97 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -631,6 +631,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, 
> unsigned int id);
>  int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu);
>  int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu);
>  void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu);
> +void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu);
>  
>  int kvm_arch_hardware_enable(void *garbage);
>  void kvm_arch_hardware_disable(void *garbage);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1580dd4..442014b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1873,8 +1873,10 @@ static int kvm_vcpu_mmap(struct file *file, struct 
> vm_area_struct *vma)
>  static int kvm_vcpu_release(struct inode *inode, struct file *filp)
>  {
>   struct kvm_vcpu *vcpu = filp->private_data;
> + struct kvm *kvm = vcpu->kvm;
>  
> - kvm_put_kvm(vcpu->kvm);
> + kvm_arch_vcpu_release(vcpu);
> + kvm_put_kvm(kvm);
>   return 0;
>  }
>  
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list

2013-08-28 Thread Gleb Natapov
On Tue, Jul 30, 2013 at 09:02:05PM +0800, Xiao Guangrong wrote:
> Change the algorithm to:
> 1) always add new desc to the first desc (pointed by parent_ptes/rmap)
>that is good to implement rcu-nulls-list-like lockless rmap
>walking
> 
> 2) always move the entry in first desc to the the position we want
>to remove when remove a spte in the parent_ptes/rmap (backward-move).
>It is good for us to implement lockless rmap walk since in the current
>code, when a spte is deleted from the "desc", another spte in the last
>"desc" will be moved to this position to replace the deleted one. If the
>deleted one has been accessed and we do not access the replaced one, the
>replaced one is missed when we do lockless walk.
>To fix this case, we do not backward move the spte, instead, we forward
>move the entry: when a spte is deleted, we move the entry in the first
>desc to that position
> 
> Both of these also can reduce cache miss
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c | 182 
> -
>  1 file changed, 125 insertions(+), 57 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 5a40564..3013bb1 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -918,6 +918,50 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
> large_gfn)
>   return level - 1;
>  }
>  
> +static int __find_first_free(struct pte_list_desc *desc)
> +{
> + int i;
> +
> + for (i = 0; i < PTE_LIST_EXT; i++)
> + if (!desc->sptes[i])
> + break;
> + return i;
> +}
> +
> +static int find_first_free(struct pte_list_desc *desc)
> +{
> + int free = __find_first_free(desc);
> +
> + WARN_ON(free >= PTE_LIST_EXT);
> + return free;
> +}
> +
> +static int find_last_used(struct pte_list_desc *desc)
> +{
> + int used = __find_first_free(desc) - 1;
> +
> + WARN_ON(used < 0 || used >= PTE_LIST_EXT);
> + return used;
> +}
> +
> +/*
> + * TODO: we can encode the desc number into the rmap/parent_ptes
> + * since at least 10 physical/virtual address bits are reserved
> + * on x86. It is worthwhile if it shows that the desc walking is
> + * a performance issue.
> + */
> +static int count_spte_number(struct pte_list_desc *desc)
> +{
> + int first_free, desc_num;
> +
> + first_free = __find_first_free(desc);
> +
> + for (desc_num = 0; desc->more; desc = desc->more)
> + desc_num++;
> +
> + return first_free + desc_num * PTE_LIST_EXT;
> +}
> +
>  /*
>   * Pte mapping structures:
>   *
> @@ -934,92 +978,116 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 
> *spte,
>   unsigned long *pte_list)
>  {
>   struct pte_list_desc *desc;
> - int i, count = 0;
> + int free_pos;
>  
>   if (!*pte_list) {
>   rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte);
>   *pte_list = (unsigned long)spte;
> - } else if (!(*pte_list & 1)) {
> + return 0;
> + }
> +
> + if (!(*pte_list & 1)) {
>   rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte);
>   desc = mmu_alloc_pte_list_desc(vcpu);
>   desc->sptes[0] = (u64 *)*pte_list;
>   desc->sptes[1] = spte;
>   *pte_list = (unsigned long)desc | 1;
> - ++count;
> - } else {
> - rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
> - desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> - while (desc->sptes[PTE_LIST_EXT-1] && desc->more) {
> - desc = desc->more;
> - count += PTE_LIST_EXT;
> - }
> - if (desc->sptes[PTE_LIST_EXT-1]) {
> - desc->more = mmu_alloc_pte_list_desc(vcpu);
> - desc = desc->more;
> - }
> - for (i = 0; desc->sptes[i]; ++i)
> - ++count;
> - desc->sptes[i] = spte;
> + return 1;
> + }
> +
> + rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte);
> + desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> +
> + /* No empty position in the desc. */
> + if (desc->sptes[PTE_LIST_EXT - 1]) {
> + struct pte_list_desc *new_desc;
> + new_desc = mmu_alloc_pte_list_desc(vcpu);
> + new_desc->more = desc;
> + desc = new_desc;
> + *pte_list = (unsigned long)desc | 1;
>   }
> - return count;
> +
> + free_pos = find_first_free(desc);
> + desc->sptes[free_pos] = spte;
> + return count_spte_number(desc);
Should it be count_spte_number(desc) - 1? The function should returns
the number of pte entries before the spte was added.

>  }
>  
>  static void
> -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc 
> *desc,
> -int i, struct pte_list_desc *prev_desc)
> +

[PATCH] KVM: x86: should release vcpu when closing vcpu fd

2013-08-28 Thread Chen Fan
If we want to remove a vcpu from vm, we should not be only to
put kvm, we need to decrease online vcpus and free struct kvm_vcpu
when closing a vcpu file descriptor.

Signed-off-by: Chen Fan 
---
 arch/x86/kvm/x86.c   | 14 ++
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c  |  4 +++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d21bce5..3370de1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6904,6 +6904,20 @@ static void kvm_free_vcpus(struct kvm *kvm)
mutex_unlock(&kvm->lock);
 }
 
+void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu)
+{
+   struct kvm *kvm = vcpu->kvm;
+
+   mutex_lock(&kvm->lock);
+   atomic_dec(&kvm->online_vcpus);
+   kvm->vcpus[atomic_read(&kvm->online_vcpus)] = NULL;
+   mutex_unlock(&kvm->lock);
+
+   kvm_clear_async_pf_completion_queue(vcpu);
+   kvm_unload_vcpu_mmu(vcpu);
+   kvm_arch_vcpu_free(vcpu);
+}
+
 void kvm_arch_sync_events(struct kvm *kvm)
 {
kvm_free_all_assigned_devices(kvm);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a63d83e..e0811f97 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -631,6 +631,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, 
unsigned int id);
 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu);
+void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu);
 
 int kvm_arch_hardware_enable(void *garbage);
 void kvm_arch_hardware_disable(void *garbage);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1580dd4..442014b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1873,8 +1873,10 @@ static int kvm_vcpu_mmap(struct file *file, struct 
vm_area_struct *vma)
 static int kvm_vcpu_release(struct inode *inode, struct file *filp)
 {
struct kvm_vcpu *vcpu = filp->private_data;
+   struct kvm *kvm = vcpu->kvm;
 
-   kvm_put_kvm(vcpu->kvm);
+   kvm_arch_vcpu_release(vcpu);
+   kvm_put_kvm(kvm);
return 0;
 }
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified

2013-08-28 Thread Xiao Guangrong
On 08/28/2013 03:23 PM, Gleb Natapov wrote:
> On Tue, Jul 30, 2013 at 09:02:04PM +0800, Xiao Guangrong wrote:
>> Relax the tlb flush condition since we will write-protect the spte out of mmu
>> lock. Note lockless write-protection only marks the writable spte to readonly
>> and the spte can be writable only if both SPTE_HOST_WRITEABLE and
>> SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)
>>
>> This patch is used to avoid this kind of race:
>>
>>   VCPU 0 VCPU 1
>> lockless wirte protection:
>>   set spte.w = 0
>>  lock mmu-lock
>>
>>  write protection the spte to sync shadow 
>> page,
>>  see spte.w = 0, then without flush tlb
>>
>>   unlock mmu-lock
>>
>>  !!! At this point, the shadow page can 
>> still be
>>  writable due to the corrupt tlb entry
>>  Flush all TLB
>>
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  arch/x86/kvm/mmu.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 58283bf..5a40564 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -600,7 +600,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>>   * we always atomicly update it, see the comments in
>>   * spte_has_volatile_bits().
>>   */
>> -if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
>> +if (spte_is_locklessly_modifiable(old_spte) &&
>> +  !is_writable_pte(new_spte))
>>  ret = true;
> This will needlessly flush tlbs when dirty login is not in use (common
> case) and old spte is non writable. Can you estimate how serious the
> performance hit is?

If non write-protection caused by dirty log, the spte is always writable
if SPTE_HOST_WRITEABLE and SPTE_MMU_WRITEABLE are set. In other words,
spte_is_locklessly_modifiable(old_spte) is the same as
is_writable_pte(old_spte) in the common case.

There are two cases causing unnecessary TLB flush that are
1) guest read faults on the spte write-protected by dirty log and uses a
   readonly host pfn to fix it.
   This is really rare since read access on the readonly can not trigger
   #PF.

2) guest requires write-protect caused by syncing shadow page.
   this is only needed if ept is disabled and in the most case, the guest
   has many sptes need to be write-protected. Unnecessary TLB flush is rare
   too.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] [PATCH v2] kvm: warn if num cpus is greater than num recommended

2013-08-28 Thread Andrew Jones


- Original Message -
> On 08/23/2013 07:24 AM, Andrew Jones wrote:
> > The comment in kvm_max_vcpus() states that it's using the recommended
> > procedure from the kernel API documentation to get the max number
> > of vcpus that kvm supports. It is, but by always returning the
> > maximum number supported. The maximum number should only be used
> > for development purposes. qemu should check KVM_CAP_NR_VCPUS for
> > the recommended number of vcpus. This patch adds a warning if a user
> > specifies a number of cpus between the recommended and max.
> > 
> > v2:
> > Incorporate tests for max_cpus, which specifies the maximum number
> > of hotpluggable cpus. An additional note is that the message for
> > the fail case was slightly changed, 'exceeds max cpus' to
> > 'exceeds the maximum cpus'. If this is unacceptable change for
> > users like libvirt, then I'll need to spin a v3.
> 
> A quick grep of libvirt does not show any dependence on the particular
> wording "exceeds max cpus", so you are probably fine changing that.
> 
> What I'm more worried about is what number is libvirt supposed to show
> to the end user, and should libvirt enforce the lower recommended max,
> or the larger kernel absolute max?  Which of the two values does the QMP
> 'MachineInfo' type return in its 'cpu-max' field during the
> 'query-machines' command?  Should we be modifying QMP to return both
> values, so that libvirt can also expose the logic to the end user of
> allowing a recommended vs. larger development max?
> 

Machine definitions maintain yet another 'max_cpus'. And it appears that
qmp would return that value. It would probably be best if it returned
max(qemu_machine.max_cpus, kvm_max_cpus) though.
 
I'm starting to think that we should just keep things simple for most of
the virt stack by sticking to enforcing the larger developer max. And
then on a production kernel we should just compile KVM_MAX_VCPUS =
KVM_SOFT_MAX_VCPUS and be done with it. With that thought, this patch
could be dropped too. The alternative seems to be supporting a run-time
selectable experimental mode throughout the whole virt stack.

drew
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified

2013-08-28 Thread Gleb Natapov
On Tue, Jul 30, 2013 at 09:02:04PM +0800, Xiao Guangrong wrote:
> Relax the tlb flush condition since we will write-protect the spte out of mmu
> lock. Note lockless write-protection only marks the writable spte to readonly
> and the spte can be writable only if both SPTE_HOST_WRITEABLE and
> SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)
> 
> This patch is used to avoid this kind of race:
> 
>   VCPU 0 VCPU 1
> lockless wirte protection:
>   set spte.w = 0
>  lock mmu-lock
> 
>  write protection the spte to sync shadow 
> page,
>  see spte.w = 0, then without flush tlb
> 
>unlock mmu-lock
> 
>  !!! At this point, the shadow page can still 
> be
>  writable due to the corrupt tlb entry
>  Flush all TLB
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 58283bf..5a40564 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -600,7 +600,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte)
>* we always atomicly update it, see the comments in
>* spte_has_volatile_bits().
>*/
> - if (is_writable_pte(old_spte) && !is_writable_pte(new_spte))
> + if (spte_is_locklessly_modifiable(old_spte) &&
> +   !is_writable_pte(new_spte))
>   ret = true;
This will needlessly flush tlbs when dirty login is not in use (common
case) and old spte is non writable. Can you estimate how serious the
performance hit is?

>  
>   if (!shadow_accessed_mask)
> -- 
> 1.8.1.4

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html