Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker
On 08/28/2013 09:36 PM, Gleb Natapov wrote: > On Wed, Aug 28, 2013 at 08:15:36PM +0800, Xiao Guangrong wrote: >> On 08/28/2013 06:49 PM, Gleb Natapov wrote: >>> On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote: On 08/28/2013 05:46 PM, Gleb Natapov wrote: > On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote: >>> Or what if desc is moved to another rmap, but then it >>> is moved back to initial rmap (but another place in the desc list) so >>> the check here will not catch that we need to restart walking? >> >> It is okay. We always add the new desc to the head, then we will walk >> all the entires under this case. >> > Which races another question: What if desc is added in front of the list > behind the point where lockless walker currently is? That case is new spte is being added into the rmap. We need not to care the new sptes since it will set the dirty-bitmap then they can be write-protected next time. >>> OK. >>> > >> Right? > Not sure. While lockless walker works on a desc rmap can be completely > destroyed and recreated again. It can be any order. I think the thing is very similar as include/linux/rculist_nulls.h >>> include/linux/rculist_nulls.h is for implementing hash tables, so they >>> may not care about add/del/lookup race for instance, but may be we are >>> (you are saying above that we are not), so similarity does not prove >>> correctness for our case. >> >> We do not care the "add" and "del" too when lookup the rmap. Under the "add" >> case, it is okay, the reason i have explained above. Under the "del" case, >> the spte becomes unpresent and flush all tlbs immediately, so it is also >> okay. >> >> I always use a stupid way to check the correctness, that is enumerating >> all cases we may meet, in this patch, we may meet these cases: >> >> 1) kvm deletes the desc before we are current on >>that descs have been checked, do not need to care it. >> >> 2) kvm deletes the desc after we are currently on >>Since we always add/del the head desc, we can sure the current desc has >> been >>deleted, then we will meet case 3). >> >> 3) kvm deletes the desc that we are currently on >>3.a): the desc stays in slab cache (do not be reused). >> all spte entires are empty, then the fn() will skip the nonprsent >> spte, >> and desc->more is >> 3.a.1) still pointing to next-desc, then we will continue the lookup >> 3.a.2) or it is the "nulls list", that means we reach the last one, >> then finish the walk. >> >>3.b): the desc is alloc-ed from slab cache and it's being initialized. >> we will see "desc->more == NULL" then restart the walking. It's >> okay. >> >>3.c): the desc is added to rmap or pte_list again. >> 3.c.1): the desc is added to the current rmap again. >> the new desc always acts as the head desc, then we will walk >> all entries, some entries are double checked and not entry >> can be missed. It is okay. >> >> 3.c.2): the desc is added to another rmap or pte_list >> since kvm_set_memory_region() and get_dirty are serial by >> slots-lock. >> so the "nulls" can not be reused during lookup. Then we we >> will >> meet the different "nulls" at the end of walking that will >> cause >> rewalk. >> >> I know check the algorithm like this is really silly, do you have other idea? >> > Not silly, but easy to miss cases. For instance 3.c.3 can be: > the desc is added to another rmap then we move to another desc on the > wrong rmap, this other desc is also deleted and reinserted into > original rmap. Seams like justification from 3.c.1 applies to that to > though. > >>> BTW I do not see >>> rcu_assign_pointer()/rcu_dereference() in your patches which hints on >> >> IIUC, We can not directly use rcu_assign_pointer(), that is something like: >> p = v to assign a pointer to a pointer. But in our case, we need: >>*pte_list = (unsigned long)desc | 1; >>From Documentation/RCU/whatisRCU.txt: > > The updater uses this function to assign a new value to an RCU-protected > pointer. > > This is what we do, no? (assuming slot->arch.rmap[] is what rcu protects here) > The fact that the value is not correct pointer should not matter. > Okay. Will change that code to: + +#define rcu_assign_head_desc(pte_list_p, value)\ + rcu_assign_pointer(*(unsigned long __rcu **)(pte_list_p), (unsigned long *)(value)) + /* * Pte mapping structures: * @@ -1006,14 +1010,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte, desc->sptes[1] = spte; desc_mark_nulls(pte_list, desc); - /* -* Esure the old spte has been updated into desc, so -* that the another side can not get the
Re: [PATCH 07/23] KVM: PPC: Book3S PR: Use 64k host pages where possible
On Thu, Aug 29, 2013 at 01:24:04AM +0200, Alexander Graf wrote: > > On 06.08.2013, at 06:19, Paul Mackerras wrote: > > > +#ifdef CONFIG_PPC_64K_PAGES > > + /* > > +* Mark this as a 64k segment if the host is using > > +* 64k pages, the host MMU supports 64k pages and > > +* the guest segment page size is >= 64k, > > +* but not if this segment contains the magic page. > > What's the problem with the magic page? As long as we map the magic page as a > host 64k page and access only the upper 4k (which we handle today already) we > should be set, no? If we use a 64k host HPTE to map the magic page, then we are taking up 64k of the guest address space, and I was concerned that the guest might ask to map the magic page at address X and then map something else at address X+4k or X-4k. If we use a 64k host HPTE then we would tromp on those nearby mappings. If you think the guest will never try to create any nearby mappings, then we could relax this restriction. Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/23] KVM: PPC: Book3S PR: Allow guest to use 64k pages
On Thu, Aug 29, 2013 at 12:56:40AM +0200, Alexander Graf wrote: > > On 06.08.2013, at 06:18, Paul Mackerras wrote: > > > #ifdef CONFIG_PPC_BOOK3S_64 > > - /* default to book3s_64 (970fx) */ > > + /* > > +* Default to the same as the host if we're on a POWER7[+], > > +* otherwise default to PPC970FX. > > +*/ > > vcpu->arch.pvr = 0x3C0301; > > + if (cpu_has_feature(CPU_FTR_ARCH_206)) > > + vcpu->arch.pvr = mfspr(SPRN_PVR); > > Unrelated change? Also, why? Any reasonable user space these days should set > PVR anyways. The issue is that the most widely-deployed userspace user of KVM (i.e., QEMU) does the KVM_PPC_GET_SMMU_INFO ioctl *before* it tells KVM what it wants the guest PVR to be. Originally I had kvm_vm_ioctl_get_smmu_info() returning the 64k page size only if the BOOK3S_HFLAG_MULTI_PGSIZE flag was set, so I had to add this change so that userspace would see the 64k page size. So yes, I could probably remove this hunk now. > > > > + /* 64k large page size */ > > + info->sps[2].page_shift = 16; > > + info->sps[2].slb_enc = SLB_VSID_L | SLB_VSID_LP_01; > > + info->sps[2].enc[0].page_shift = 16; > > + info->sps[2].enc[0].pte_enc = 1; > > We only support this with BOOK3S_HFLAG_MULTI_PGSIZE, no? The virtual machine implemented by PR KVM supports 64k pages on any hardware, since it is implementing the POWER MMU in software. That's why I didn't make it depend on that flag. That means that we rely on userspace to filter out any capabilities that don't apply to the machine it wants to emulate. We can't do that filtering here because userspace queries the MMU capabilities before it sets the PVR. Regards, Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/23] KVM: PPC: Book3S PR: Keep volatile reg values in vcpu rather than shadow_vcpu
On Thu, Aug 29, 2013 at 12:00:53AM +0200, Alexander Graf wrote: > > On 06.08.2013, at 06:16, Paul Mackerras wrote: > > > kvm_start_lightweight: > > + /* Copy registers into shadow vcpu so we can access them in real mode */ > > + GET_SHADOW_VCPU(r3) > > + bl FUNC(kvmppc_copy_to_svcpu) > > This will clobber r3 and r4, no? We need to restore them from the stack here > I would think. You're right. We don't need to restore r3 since we don't actually use it, but we do need to restore r4. > > #ifdef CONFIG_PPC_BOOK3S_32 > > /* We set segments as unused segments when invalidating them. So > > * treat the respective fault as segment fault. */ > > - if (svcpu->sr[kvmppc_get_pc(vcpu) >> SID_SHIFT] == SR_INVALID) { > > - kvmppc_mmu_map_segment(vcpu, kvmppc_get_pc(vcpu)); > > - r = RESUME_GUEST; > > + { > > + struct kvmppc_book3s_shadow_vcpu *svcpu; > > + u32 sr; > > + > > + svcpu = svcpu_get(vcpu); > > + sr = svcpu->sr[kvmppc_get_pc(vcpu) >> SID_SHIFT]; > > Doesn't this break two concurrently running guests now that we don't copy the > shadow vcpu anymore? Just move the sr array to a kmalloc'ed area until the > whole vcpu is kmalloc'ed. Then you can get rid of all shadow vcpu code. This is 32-bit only... the svcpu is already kmalloc'ed, so I'm not sure what you're asking for here or why you think this would break with multiple guests. Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 2/6] cpus: release allocated vcpu objects and exit vcpu thread
Am 29.08.2013 04:09, schrieb Chen Fan: > After ACPI get a signal to eject a vcpu, then it will notify > the vcpu thread of needing to exit, before the vcpu exiting, > will release the vcpu related objects. > > Signed-off-by: Chen Fan > --- > cpus.c | 36 > hw/acpi/piix4.c | 16 > include/qom/cpu.h| 9 + > include/sysemu/kvm.h | 1 + > kvm-all.c| 26 ++ > 5 files changed, 84 insertions(+), 4 deletions(-) > > diff --git a/cpus.c b/cpus.c > index 70cc617..6b793cb 100644 > --- a/cpus.c > +++ b/cpus.c > @@ -697,6 +697,30 @@ void async_run_on_cpu(CPUState *cpu, void (*func)(void > *data), void *data) > qemu_cpu_kick(cpu); > } > > +static void qemu_kvm_destroy_vcpu(CPUState *cpu) > +{ > +CPUState *pcpu, *pcpu1; > + > +pcpu = first_cpu; > +pcpu1 = NULL; > + > +while (pcpu) { > +if (pcpu == cpu && pcpu1) { > +pcpu1->next_cpu = cpu->next_cpu; > +break; > +} > +pcpu1 = pcpu; > +pcpu = pcpu->next_cpu; > +} No, no, no. :) I specifically posted the "QOM CPUState, part 12" series early to avoid exactly such code appearing! Give me a few minutes to apply that to qom-cpu and then please rebase your work on git://github.com/afaerber/qemu-cpu.git qom-cpu using QTAILQ macro and --subject-prefix="RFC qom-cpu v2" for the next version of the series. Also, why is this only in the KVM code path? Isn't this needed for TCG as well? > + > +if (kvm_destroy_vcpu(cpu) < 0) { > +fprintf(stderr, "kvm_destroy_vcpu failed.\n"); > +exit(1); > +} > + > +qdev_free(DEVICE(X86_CPU(cpu))); DEVICE(cpu) should be sufficient. > +} > + > static void flush_queued_work(CPUState *cpu) > { > struct qemu_work_item *wi; > @@ -788,6 +812,11 @@ static void *qemu_kvm_cpu_thread_fn(void *arg) > } > } > qemu_kvm_wait_io_event(cpu); > +if (cpu->exit && !cpu_can_run(cpu)) { > +qemu_kvm_destroy_vcpu(cpu); > +qemu_mutex_unlock(&qemu_global_mutex); > +return NULL; > +} > } > > return NULL; > @@ -1080,6 +1109,13 @@ static void qemu_dummy_start_vcpu(CPUState *cpu) > } > } > > +void qemu_down_vcpu(CPUState *cpu) > +{ > +cpu->stop = true; > +cpu->exit = true; > +qemu_cpu_kick(cpu); > +} > + > void qemu_init_vcpu(CPUState *cpu) > { > cpu->nr_cores = smp_cores; > diff --git a/hw/acpi/piix4.c b/hw/acpi/piix4.c > index 1aaa7a4..44bc809 100644 > --- a/hw/acpi/piix4.c > +++ b/hw/acpi/piix4.c > @@ -611,10 +611,18 @@ static const MemoryRegionOps piix4_pci_ops = { > }, > }; > > -static void acpi_piix_eject_vcpu(int64_t cpuid) > +static void acpi_piix_eject_vcpu(PIIX4PMState *s, int64_t cpuid) > { > -/* TODO: eject a vcpu, release allocated vcpu and exit the vcpu pthread. > */ > -PIIX4_DPRINTF("vcpu: %" PRIu64 " need to be ejected.\n", cpuid); > +CPUStatus *cpus = &s->gpe_cpu; > +CPUState *cs = NULL; > + > +cs = qemu_get_cpu(cpuid); Are you sure this is correct as 0-based index? Igor? > +if (cs == NULL) { > +return; > +} > + > +cpus->old_sts[cpuid / 8] &= ~(1 << (cpuid % 8)); > +qemu_down_vcpu(cs); > } > > static uint64_t cpu_status_read(void *opaque, hwaddr addr, unsigned int size) > @@ -647,7 +655,7 @@ static void cpu_status_write(void *opaque, hwaddr addr, > uint64_t data, > } > > if (cpuid != 0) { > -acpi_piix_eject_vcpu(cpuid); > +acpi_piix_eject_vcpu(s, cpuid); > } > } > > diff --git a/include/qom/cpu.h b/include/qom/cpu.h > index 3e49936..fa8ec8a 100644 > --- a/include/qom/cpu.h > +++ b/include/qom/cpu.h > @@ -180,6 +180,7 @@ struct CPUState { > bool created; > bool stop; > bool stopped; > +bool exit; > volatile sig_atomic_t exit_request; > volatile sig_atomic_t tcg_exit_req; > uint32_t interrupt_request; > @@ -489,6 +490,14 @@ void cpu_exit(CPUState *cpu); > void cpu_resume(CPUState *cpu); > > /** > + * qemu_down_vcpu: > + * @cpu: The vCPU will to down. > + * > + * Down a vCPU. > + */ > +void qemu_down_vcpu(CPUState *cpu); The naming and documentation sounds wrong language-wise to me, but I am not a native speaker either. Maybe "tear down" instead of "down"? Or simply qemu_request_vcpu_removal() or something like that? > + > +/** > * qemu_init_vcpu: > * @cpu: The vCPU to initialize. > * > diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h > index de74411..fd85605 100644 > --- a/include/sysemu/kvm.h > +++ b/include/sysemu/kvm.h > @@ -158,6 +158,7 @@ int kvm_has_intx_set_mask(void); > > int kvm_init_vcpu(CPUState *cpu); > int kvm_cpu_exec(CPUState *cpu); > +int kvm_destroy_vcpu(CPUState *cpu); > > #ifdef NEED_CPU_H > > diff --git a/kvm-all.c b/kvm-all.c > index 716860f..fda3601 100644 > --- a/kvm-all.c > +++ b/kvm-all.c > @@ -225,6 +225,32 @@ static v
Re: [PATCH] KVM: x86: should release vcpu when closing vcpu fd
On Thu, Aug 29, 2013 at 11:06:47AM +0800, chenfan wrote: > On Wed, 2013-08-28 at 11:21 +0300, Gleb Natapov wrote: > > On Wed, Aug 28, 2013 at 04:08:10PM +0800, Chen Fan wrote: > > > If we want to remove a vcpu from vm, we should not be only to > > > put kvm, we need to decrease online vcpus and free struct kvm_vcpu > > > when closing a vcpu file descriptor. > > > > > It much more complicated that what you do here. kvm->vcpus[] is accessed > > without locking and kvm->online_vcpus is assumed to be greater than max > > occupied index in kvm->vcpus[]. Attempt where made to make it possible > > to destroy individual vcpus separately from destroying VM before, but > > they were unsuccessful thus far. > > > Oh, the motivation of this patch is for QEMU for implementing cpu > hot-remove, I want to remove the vcpu from KVM after QEMU closing the > kvm_fd. there I need some advice for this feature. if this patch is not > comprehensive enough,I want to know the reasons for a more detailed. > Could you tell me? > I told you above. > Thanks, > Chen > > > > Signed-off-by: Chen Fan > > > --- > > > arch/x86/kvm/x86.c | 14 ++ > > > include/linux/kvm_host.h | 1 + > > > virt/kvm/kvm_main.c | 4 +++- > > > 3 files changed, 18 insertions(+), 1 deletion(-) > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > index d21bce5..3370de1 100644 > > > --- a/arch/x86/kvm/x86.c > > > +++ b/arch/x86/kvm/x86.c > > > @@ -6904,6 +6904,20 @@ static void kvm_free_vcpus(struct kvm *kvm) > > > mutex_unlock(&kvm->lock); > > > } > > > > > > +void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu) > > > +{ > > > + struct kvm *kvm = vcpu->kvm; > > > + > > > + mutex_lock(&kvm->lock); > > > + atomic_dec(&kvm->online_vcpus); > > > + kvm->vcpus[atomic_read(&kvm->online_vcpus)] = NULL; > > > + mutex_unlock(&kvm->lock); > > > + > > > + kvm_clear_async_pf_completion_queue(vcpu); > > > + kvm_unload_vcpu_mmu(vcpu); > > > + kvm_arch_vcpu_free(vcpu); > > > +} > > > + > > > void kvm_arch_sync_events(struct kvm *kvm) > > > { > > > kvm_free_all_assigned_devices(kvm); > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > index a63d83e..e0811f97 100644 > > > --- a/include/linux/kvm_host.h > > > +++ b/include/linux/kvm_host.h > > > @@ -631,6 +631,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm > > > *kvm, unsigned int id); > > > int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu); > > > int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu); > > > void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu); > > > +void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu); > > > > > > int kvm_arch_hardware_enable(void *garbage); > > > void kvm_arch_hardware_disable(void *garbage); > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > index 1580dd4..442014b 100644 > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > > @@ -1873,8 +1873,10 @@ static int kvm_vcpu_mmap(struct file *file, struct > > > vm_area_struct *vma) > > > static int kvm_vcpu_release(struct inode *inode, struct file *filp) > > > { > > > struct kvm_vcpu *vcpu = filp->private_data; > > > + struct kvm *kvm = vcpu->kvm; > > > > > > - kvm_put_kvm(vcpu->kvm); > > > + kvm_arch_vcpu_release(vcpu); > > > + kvm_put_kvm(kvm); > > > return 0; > > > } > > > > > > -- > > > 1.8.1.4 > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe kvm" in > > > the body of a message to majord...@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > Gleb. > > -- > > To unsubscribe from this list: send the line "unsubscribe kvm" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: should release vcpu when closing vcpu fd
On Wed, 2013-08-28 at 11:21 +0300, Gleb Natapov wrote: > On Wed, Aug 28, 2013 at 04:08:10PM +0800, Chen Fan wrote: > > If we want to remove a vcpu from vm, we should not be only to > > put kvm, we need to decrease online vcpus and free struct kvm_vcpu > > when closing a vcpu file descriptor. > > > It much more complicated that what you do here. kvm->vcpus[] is accessed > without locking and kvm->online_vcpus is assumed to be greater than max > occupied index in kvm->vcpus[]. Attempt where made to make it possible > to destroy individual vcpus separately from destroying VM before, but > they were unsuccessful thus far. > Oh, the motivation of this patch is for QEMU for implementing cpu hot-remove, I want to remove the vcpu from KVM after QEMU closing the kvm_fd. there I need some advice for this feature. if this patch is not comprehensive enough,I want to know the reasons for a more detailed. Could you tell me? Thanks, Chen > > Signed-off-by: Chen Fan > > --- > > arch/x86/kvm/x86.c | 14 ++ > > include/linux/kvm_host.h | 1 + > > virt/kvm/kvm_main.c | 4 +++- > > 3 files changed, 18 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index d21bce5..3370de1 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -6904,6 +6904,20 @@ static void kvm_free_vcpus(struct kvm *kvm) > > mutex_unlock(&kvm->lock); > > } > > > > +void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu) > > +{ > > + struct kvm *kvm = vcpu->kvm; > > + > > + mutex_lock(&kvm->lock); > > + atomic_dec(&kvm->online_vcpus); > > + kvm->vcpus[atomic_read(&kvm->online_vcpus)] = NULL; > > + mutex_unlock(&kvm->lock); > > + > > + kvm_clear_async_pf_completion_queue(vcpu); > > + kvm_unload_vcpu_mmu(vcpu); > > + kvm_arch_vcpu_free(vcpu); > > +} > > + > > void kvm_arch_sync_events(struct kvm *kvm) > > { > > kvm_free_all_assigned_devices(kvm); > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index a63d83e..e0811f97 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -631,6 +631,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, > > unsigned int id); > > int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu); > > int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu); > > void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu); > > +void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu); > > > > int kvm_arch_hardware_enable(void *garbage); > > void kvm_arch_hardware_disable(void *garbage); > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index 1580dd4..442014b 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -1873,8 +1873,10 @@ static int kvm_vcpu_mmap(struct file *file, struct > > vm_area_struct *vma) > > static int kvm_vcpu_release(struct inode *inode, struct file *filp) > > { > > struct kvm_vcpu *vcpu = filp->private_data; > > + struct kvm *kvm = vcpu->kvm; > > > > - kvm_put_kvm(vcpu->kvm); > > + kvm_arch_vcpu_release(vcpu); > > + kvm_put_kvm(kvm); > > return 0; > > } > > > > -- > > 1.8.1.4 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe kvm" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Gleb. > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/3] tile: enable VIRTIO support for KVM
This change enables support for a virtio-based console, network support, and block driver support. We remove some debug code in relocate_kernel_64.S that made raw calls to the hv_console_putc Tilera hypervisor API, since everything now should funnel through the early_hv_write() API. Signed-off-by: Chris Metcalf --- arch/tile/Kconfig | 3 + arch/tile/include/asm/kvm_para.h| 20 ++ arch/tile/include/asm/kvm_virtio.h | 26 ++ arch/tile/include/uapi/asm/Kbuild | 1 + arch/tile/include/uapi/asm/kvm.h| 5 + arch/tile/include/uapi/asm/kvm_virtio.h | 60 + arch/tile/kernel/Makefile | 1 + arch/tile/kernel/early_printk.c | 16 ++ arch/tile/kernel/hvglue.S | 1 + arch/tile/kernel/kvm_virtio.c | 430 arch/tile/kernel/relocate_kernel_64.S | 9 +- 11 files changed, 570 insertions(+), 2 deletions(-) create mode 100644 arch/tile/include/asm/kvm_para.h create mode 100644 arch/tile/include/asm/kvm_virtio.h create mode 100644 arch/tile/include/uapi/asm/kvm_virtio.h create mode 100644 arch/tile/kernel/kvm_virtio.c diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig index e89aae8..4e8524b 100644 --- a/arch/tile/Kconfig +++ b/arch/tile/Kconfig @@ -370,6 +370,9 @@ config KVM_GUEST bool "Build kernel as guest for KVM" default n depends on TILEGX + select VIRTIO + select VIRTIO_RING + select VIRTIO_CONSOLE ---help--- This will build a kernel that runs at a lower protection level than the default kernel and is suitable to run under KVM. diff --git a/arch/tile/include/asm/kvm_para.h b/arch/tile/include/asm/kvm_para.h new file mode 100644 index 000..c8c31d5 --- /dev/null +++ b/arch/tile/include/asm/kvm_para.h @@ -0,0 +1,20 @@ +/* + * Copyright 2013 Tilera Corporation. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation, version 2. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for + * more details. + */ +#ifndef _ASM_TILE_KVM_PARA_H +#define _ASM_TILE_KVM_PARA_H + +#include + +int hcall_virtio(unsigned long instrument, unsigned long mem); +#endif /* _ASM_TILE_KVM_PARA_H */ diff --git a/arch/tile/include/asm/kvm_virtio.h b/arch/tile/include/asm/kvm_virtio.h new file mode 100644 index 000..8faa959 --- /dev/null +++ b/arch/tile/include/asm/kvm_virtio.h @@ -0,0 +1,26 @@ +/* + * Copyright 2013 Tilera Corporation. All Rights Reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation, version 2. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for + * more details. + */ +#ifndef _ASM_TILE_KVM_VIRTIO_H +#define _ASM_TILE_KVM_VIRTIO_H + +#include + + +struct kvm_device { + struct virtio_device vdev; + struct kvm_device_desc *desc; + unsigned long desc_pa; +}; + +#endif /* _ASM_TILE_KVM_VIRTIO_H */ diff --git a/arch/tile/include/uapi/asm/Kbuild b/arch/tile/include/uapi/asm/Kbuild index 89022a5..f07cc24 100644 --- a/arch/tile/include/uapi/asm/Kbuild +++ b/arch/tile/include/uapi/asm/Kbuild @@ -8,6 +8,7 @@ header-y += cachectl.h header-y += hardwall.h header-y += kvm.h header-y += kvm_para.h +header-y += kvm_virtio.h header-y += mman.h header-y += ptrace.h header-y += setup.h diff --git a/arch/tile/include/uapi/asm/kvm.h b/arch/tile/include/uapi/asm/kvm.h index aa7b97f..4346520 100644 --- a/arch/tile/include/uapi/asm/kvm.h +++ b/arch/tile/include/uapi/asm/kvm.h @@ -149,6 +149,9 @@ */ #define KVM_OTHER_HCALL 128 +/* Hypercall index for virtio. */ +#define KVM_HCALL_virtio 128 + /* One greater than the maximum hypercall number. */ #define KVM_NUM_HCALLS 256 @@ -256,6 +259,8 @@ struct kvm_sync_regs { KVM_EMULATE(get_ipi_pte) \ KVM_EMULATE(set_pte_super_shift) \ KVM_EMULATE(set_speed) \ + /* For others */ \ + USER_HCALL(virtio) #endif diff --git a/arch/tile/include/uapi/asm/kvm_virtio.h b/arch/tile/include/uapi/asm/kvm_virtio.h new file mode 100644 index 000..d94f535 --- /dev/null +++ b/arch/tile/include/uapi/asm/kvm_virtio.h @@ -0,0 +1,60 @@ +/* + * Copyright 2013 Tilera Corporation. All Rights Reserved. + * + * This program is free softwar
[PATCH v3 2/3] tile: enable building as a paravirtualized KVM_GUEST
This commit enables a special configure option to build the kernel to run at PL1. In this mode, the client can run under a KVM host kernel; it can also run under the older Tilera hypervisor that ran the operating system at PL1 by default. The PL1 kernel runs with half the virtual address space and half the physical address space of the PL2 kernel, so reflects modifying those constants appropriately. We make some things a little more generic ("intrpt1" references from the old PL1 kernel nomenclature have now been normalized to "intrpt"), and simplify some other nomenclature (using MEM_SV_START instead of MEM_SV_INTRPT to reflect the fact that where the supervisor starts is the fact of interest, not that it happens to start with the interrupt vectors). The simulator support for reflecting Elf binary data back out associated with simulator backtrace information needs some additional extension. It currently will report backtrace information for the guest kernel, but not for processes running within the guest kernel; additional work in the simulator is required to be able to provide the necessary path information for that to work. For now we disable simulator notifications within the guest kernel. The timer interrupt for the guest uses the AUX_TILE_TIMER hardware, leaving the regular TILE_TIMER for the host. Signed-off-by: Chris Metcalf --- arch/tile/Kconfig | 14 +++-- arch/tile/include/asm/module.h | 10 -- arch/tile/include/asm/page.h | 63 +++--- arch/tile/include/asm/pgtable_32.h | 2 +- arch/tile/include/asm/pgtable_64.h | 3 +- arch/tile/include/asm/processor.h | 2 +- arch/tile/include/asm/switch_to.h | 25 --- arch/tile/include/asm/timex.h | 8 + arch/tile/kernel/head_32.S | 4 +-- arch/tile/kernel/head_64.S | 6 ++-- arch/tile/kernel/intvec_32.S | 6 ++-- arch/tile/kernel/intvec_64.S | 34 +++- arch/tile/kernel/process.c | 2 ++ arch/tile/kernel/setup.c | 8 ++--- arch/tile/kernel/sysfs.c | 4 +++ arch/tile/kernel/time.c| 14 - arch/tile/kernel/traps.c | 2 +- arch/tile/kernel/vmlinux.lds.S | 10 +++--- arch/tile/mm/elf.c | 2 ++ arch/tile/mm/init.c| 8 ++--- 20 files changed, 145 insertions(+), 82 deletions(-) diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig index 3bc8fb7..e89aae8 100644 --- a/arch/tile/Kconfig +++ b/arch/tile/Kconfig @@ -126,7 +126,7 @@ config TILEGX select HAVE_FTRACE_MCOUNT_RECORD select HAVE_KPROBES select HAVE_KRETPROBES - select HAVE_KVM + select HAVE_KVM if !KVM_GUEST config TILEPRO def_bool !TILEGX @@ -366,11 +366,19 @@ config HARDWALL bool "Hardwall support to allow access to user dynamic network" default y +config KVM_GUEST + bool "Build kernel as guest for KVM" + default n + depends on TILEGX + ---help--- + This will build a kernel that runs at a lower protection level + than the default kernel and is suitable to run under KVM. + config KERNEL_PL int "Processor protection level for kernel" range 1 2 - default 2 if TILEGX - default 1 if !TILEGX + default 2 if TILEGX && !KVM_GUEST + default 1 if !TILEGX || KVM_GUEST ---help--- Since MDE 4.2, the Tilera hypervisor runs the kernel at PL2 by default. If running under an older hypervisor, diff --git a/arch/tile/include/asm/module.h b/arch/tile/include/asm/module.h index 44ed07c..a8b546b 100644 --- a/arch/tile/include/asm/module.h +++ b/arch/tile/include/asm/module.h @@ -16,7 +16,6 @@ #define _ASM_TILE_MODULE_H #include - #include /* We can't use modules built with different page sizes. */ @@ -28,6 +27,13 @@ # define MODULE_PGSZ "" #endif +/* Tag guest Linux, since it uses different SPRs, etc. */ +#if CONFIG_KERNEL_PL == 2 +#define MODULE_PL "" +#else +#define MODULE_PL " guest" +#endif + /* We don't really support no-SMP so tag if someone tries. */ #ifdef CONFIG_SMP #define MODULE_NOSMP "" @@ -35,6 +41,6 @@ #define MODULE_NOSMP " nosmp" #endif -#define MODULE_ARCH_VERMAGIC CHIP_ARCH_NAME MODULE_PGSZ MODULE_NOSMP +#define MODULE_ARCH_VERMAGIC CHIP_ARCH_NAME MODULE_PGSZ MODULE_PL MODULE_NOSMP #endif /* _ASM_TILE_MODULE_H */ diff --git a/arch/tile/include/asm/page.h b/arch/tile/include/asm/page.h index b4f96c0..2c991f2 100644 --- a/arch/tile/include/asm/page.h +++ b/arch/tile/include/asm/page.h @@ -148,8 +148,17 @@ static inline __attribute_const__ int get_order(unsigned long size) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #endif +#ifdef CONFIG_KVM_GUEST +/* Paravirtualized guests get half the VA, and thus half the PA. */ +#define MAX_PA_WIDTH (CHIP_PA_WIDTH() - 1) +#define MAX_VA_WIDTH (CHIP_VA_WIDTH() - 1) +#else +#define MAX_PA_WIDTH CHIP_PA_WIDTH() +#define MAX_VA_WIDTH CHIP_VA_W
Re: [PATCH v2] tile: support KVM for tilegx
On 8/26/2013 8:04 AM, Gleb Natapov wrote: > On Sun, Aug 25, 2013 at 09:26:47PM -0400, Chris Metcalf wrote: >> On 8/25/2013 7:39 AM, Gleb Natapov wrote: >>> On Mon, Aug 12, 2013 at 04:24:11PM -0400, Chris Metcalf wrote: This change provides the initial framework support for KVM on tilegx. Basic virtual disk and networking is supported. >>> This needs to be broken down to more reviewable patches. >> I already broke out one pre-requisite patch that wasn't strictly KVM-related: >> >> https://lkml.org/lkml/2013/8/12/339 >> >> In addition, we've separately arranged to support booting our kernels in a >> way that is compatible with the Tilera booter running at the highest >> privilege level, which enables multiple kernel privilege levels: >> >> https://lkml.org/lkml/2013/5/2/468 >> >> How would you recommend further breaking down this patch? It's pretty much >> just the basic support for minimal KVM. I suppose I could break out all the >> I/O related stuff into a separate patch, though it wouldn't amount to much; >> perhaps the console could also be broken out separately. Any other >> suggestions? >> > First of all please break out host and guest bits. Also I/O related stuff, > like you suggest (so that guest PV bits will be in separate patch) and > change to a common code (not much as far as I see) with explanation why > it is needed. (Why kvm_vcpu_kick() is not needed for instance?) I broke it down into three pieces in the end: the basic host support, the basic guest PV support, and the virtio/console support. The first piece is still much the biggest. I found that the generic kvm_vcpu_kick() is fine, so I removed the custom version (which predated the generic version in our internal tree). Explanations are now in the git commit comments. >>> Also can you >>> describe the implementation a little bit? Does tile arch has vitalization >>> extension this implementation uses, or is it trap and emulate approach? >>> If later does it run unmodified guest kernels? What userspace are you >>> using with this implementation? >> We could do full virtualization via trap and emulate, but we've elected to >> do a para-virtualized approach. Userspace runs at PL (privilege level) 0, >> the guest kernel runs at PL1, and the host runs at PL2. We have available >> per-PL resources for various things, and take advantage of having two >> on-chip timers (for example) to handle timing for the host and guest >> kernels. We run the same userspace with either the host or the guest. >> > OK, thanks for explanation. Why have you decided to do PV over trap and > emulate? Performance and simplicity; I added comments to the git commit to provide a rationale. -- Chris Metcalf, Tilera Corp. http://www.tilera.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/23] KVM: PPC: Book3S PR: Use 64k host pages where possible
On 06.08.2013, at 06:19, Paul Mackerras wrote: > Currently, PR KVM uses 4k pages for the host-side mappings of guest > memory, regardless of the host page size. When the host page size is > 64kB, we might as well use 64k host page mappings for guest mappings > of 64kB and larger pages and for guest real-mode mappings. However, > the magic page has to remain a 4k page. > > To implement this, we first add another flag bit to the guest VSID > values we use, to indicate that this segment is one where host pages > should be mapped using 64k pages. For segments with this bit set > we set the bits in the shadow SLB entry to indicate a 64k base page > size. When faulting in host HPTEs for this segment, we make them > 64k HPTEs instead of 4k. We record the pagesize in struct hpte_cache > for use when invalidating the HPTE. > > Signed-off-by: Paul Mackerras > --- > arch/powerpc/include/asm/kvm_book3s.h | 6 -- > arch/powerpc/kvm/book3s_32_mmu.c | 1 + > arch/powerpc/kvm/book3s_64_mmu.c | 35 ++- > arch/powerpc/kvm/book3s_64_mmu_host.c | 27 +-- > arch/powerpc/kvm/book3s_pr.c | 1 + > 5 files changed, 57 insertions(+), 13 deletions(-) > > diff --git a/arch/powerpc/include/asm/kvm_book3s.h > b/arch/powerpc/include/asm/kvm_book3s.h > index 175f876..322b539 100644 > --- a/arch/powerpc/include/asm/kvm_book3s.h > +++ b/arch/powerpc/include/asm/kvm_book3s.h > @@ -66,6 +66,7 @@ struct hpte_cache { > u64 pfn; > ulong slot; > struct kvmppc_pte pte; > + int pagesize; > }; > > struct kvmppc_vcpu_book3s { > @@ -113,8 +114,9 @@ struct kvmppc_vcpu_book3s { > #define CONTEXT_GUEST 1 > #define CONTEXT_GUEST_END 2 > > -#define VSID_REAL0x0fc0ULL > -#define VSID_BAT 0x0fb0ULL > +#define VSID_REAL0x07c0ULL > +#define VSID_BAT 0x07b0ULL > +#define VSID_64K 0x0800ULL > #define VSID_1T 0x1000ULL > #define VSID_REAL_DR 0x2000ULL > #define VSID_REAL_IR 0x4000ULL > diff --git a/arch/powerpc/kvm/book3s_32_mmu.c > b/arch/powerpc/kvm/book3s_32_mmu.c > index c8cefdd..af04553 100644 > --- a/arch/powerpc/kvm/book3s_32_mmu.c > +++ b/arch/powerpc/kvm/book3s_32_mmu.c > @@ -308,6 +308,7 @@ static int kvmppc_mmu_book3s_32_xlate(struct kvm_vcpu > *vcpu, gva_t eaddr, > ulong mp_ea = vcpu->arch.magic_page_ea; > > pte->eaddr = eaddr; > + pte->page_size = MMU_PAGE_4K; > > /* Magic page override */ > if (unlikely(mp_ea) && > diff --git a/arch/powerpc/kvm/book3s_64_mmu.c > b/arch/powerpc/kvm/book3s_64_mmu.c > index d5fa26c..658ccd7 100644 > --- a/arch/powerpc/kvm/book3s_64_mmu.c > +++ b/arch/powerpc/kvm/book3s_64_mmu.c > @@ -542,6 +542,16 @@ static void kvmppc_mmu_book3s_64_tlbie(struct kvm_vcpu > *vcpu, ulong va, > kvmppc_mmu_pte_vflush(vcpu, va >> 12, mask); > } > > +#ifdef CONFIG_PPC_64K_PAGES > +static int segment_contains_magic_page(struct kvm_vcpu *vcpu, ulong esid) > +{ > + ulong mp_ea = vcpu->arch.magic_page_ea; > + > + return mp_ea && !(vcpu->arch.shared->msr & MSR_PR) && > + (mp_ea >> SID_SHIFT) == esid; > +} > +#endif > + > static int kvmppc_mmu_book3s_64_esid_to_vsid(struct kvm_vcpu *vcpu, ulong > esid, >u64 *vsid) > { > @@ -549,11 +559,13 @@ static int kvmppc_mmu_book3s_64_esid_to_vsid(struct > kvm_vcpu *vcpu, ulong esid, > struct kvmppc_slb *slb; > u64 gvsid = esid; > ulong mp_ea = vcpu->arch.magic_page_ea; > + int pagesize = MMU_PAGE_64K; > > if (vcpu->arch.shared->msr & (MSR_DR|MSR_IR)) { > slb = kvmppc_mmu_book3s_64_find_slbe(vcpu, ea); > if (slb) { > gvsid = slb->vsid; > + pagesize = slb->base_page_size; > if (slb->tb) { > gvsid <<= SID_SHIFT_1T - SID_SHIFT; > gvsid |= esid & ((1ul << (SID_SHIFT_1T - > SID_SHIFT)) - 1); > @@ -564,28 +576,41 @@ static int kvmppc_mmu_book3s_64_esid_to_vsid(struct > kvm_vcpu *vcpu, ulong esid, > > switch (vcpu->arch.shared->msr & (MSR_DR|MSR_IR)) { > case 0: > - *vsid = VSID_REAL | esid; > + gvsid = VSID_REAL | esid; > break; > case MSR_IR: > - *vsid = VSID_REAL_IR | gvsid; > + gvsid |= VSID_REAL_IR; > break; > case MSR_DR: > - *vsid = VSID_REAL_DR | gvsid; > + gvsid |= VSID_REAL_DR; > break; > case MSR_DR|MSR_IR: > if (!slb) > goto no_slb; > > - *vsid = gvsid; > break; > default: > BUG(); > break; > } > > +#ifdef CONFIG_PPC_64K_PAGES > + /* > + * Mark this as a 64k segment if the host is using > + * 64k pages, the h
Re: [PATCH 06/23] KVM: PPC: Book3S PR: Allow guest to use 64k pages
On 06.08.2013, at 06:18, Paul Mackerras wrote: > This adds the code to interpret 64k HPTEs in the guest hashed page > table (HPT), 64k SLB entries, and to tell the guest about 64k pages > in kvm_vm_ioctl_get_smmu_info(). Guest 64k pages are still shadowed > by 4k pages. > > This also adds another hash table to the four we have already in > book3s_mmu_hpte.c to allow us to find all the PTEs that we have > instantiated that match a given 64k guest page. > > The tlbie instruction changed starting with POWER6 to use a bit in > the RB operand to indicate large page invalidations, and to use other > RB bits to indicate the base and actual page sizes and the segment > size. 64k pages came in slightly earlier, with POWER5++. At present > we use one bit in vcpu->arch.hflags to indicate that the emulated > cpu supports 64k pages and also has the new tlbie definition. If > we ever want to support emulation of POWER5++, we will need to use > another bit. > > Signed-off-by: Paul Mackerras > --- > arch/powerpc/include/asm/kvm_asm.h| 1 + > arch/powerpc/include/asm/kvm_book3s.h | 6 +++ > arch/powerpc/include/asm/kvm_host.h | 4 ++ > arch/powerpc/kvm/book3s_64_mmu.c | 92 +++ > arch/powerpc/kvm/book3s_mmu_hpte.c| 50 +++ > arch/powerpc/kvm/book3s_pr.c | 30 +++- > 6 files changed, 173 insertions(+), 10 deletions(-) > > [...] > @@ -1127,8 +1144,13 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm > *kvm, unsigned int id) > goto uninit_vcpu; > > #ifdef CONFIG_PPC_BOOK3S_64 > - /* default to book3s_64 (970fx) */ > + /* > + * Default to the same as the host if we're on a POWER7[+], > + * otherwise default to PPC970FX. > + */ > vcpu->arch.pvr = 0x3C0301; > + if (cpu_has_feature(CPU_FTR_ARCH_206)) > + vcpu->arch.pvr = mfspr(SPRN_PVR); Unrelated change? Also, why? Any reasonable user space these days should set PVR anyways. > #else > /* default to book3s_32 (750) */ > vcpu->arch.pvr = 0x84202; > @@ -1331,6 +1353,12 @@ int kvm_vm_ioctl_get_smmu_info(struct kvm *kvm, struct > kvm_ppc_smmu_info *info) > info->sps[1].enc[0].page_shift = 24; > info->sps[1].enc[0].pte_enc = 0; > > + /* 64k large page size */ > + info->sps[2].page_shift = 16; > + info->sps[2].slb_enc = SLB_VSID_L | SLB_VSID_LP_01; > + info->sps[2].enc[0].page_shift = 16; > + info->sps[2].enc[0].pte_enc = 1; We only support this with BOOK3S_HFLAG_MULTI_PGSIZE, no? Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/23] KVM: PPC: Book3S PR: Rework kvmppc_mmu_book3s_64_xlate()
On 06.08.2013, at 06:18, Paul Mackerras wrote: > This reworks kvmppc_mmu_book3s_64_xlate() to make it check the large > page bit in the hashed page table entries (HPTEs) it looks at, and > to simplify and streamline the code. The checking of the first dword > of each HPTE is now done with a single mask and compare operation, > and all the code dealing with the matching HPTE, if we find one, > is consolidated in one place in the main line of the function flow. > > Signed-off-by: Paul Mackerras Thanks, applied to kvm-ppc-queue. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/23] KVM: PPC: Book3S PR: Don't corrupt guest state when kernel uses VMX
On 06.08.2013, at 06:14, Paul Mackerras wrote: > Currently the code assumes that once we load up guest FP/VSX or VMX > state into the CPU, it stays valid in the CPU registers until we > explicitly flush it to the thread_struct. However, on POWER7, > copy_page() and memcpy() can use VMX. These functions do flush the > VMX state to the thread_struct before using VMX instructions, but if > this happens while we have guest state in the VMX registers, and we > then re-enter the guest, we don't reload the VMX state from the > thread_struct, leading to guest corruption. This has been observed > to cause guest processes to segfault. > > To fix this, we check before re-entering the guest that all of the > bits corresponding to facilities owned by the guest, as expressed > in vcpu->arch.guest_owned_ext, are set in current->thread.regs->msr. > Any bits that have been cleared correspond to facilities that have > been used by kernel code and thus flushed to the thread_struct, so > for them we reload the state from the thread_struct. > > We also need to check current->thread.regs->msr before calling > giveup_fpu() or giveup_altivec(), since if the relevant bit is > clear, the state has already been flushed to the thread_struct and > to flush it again would corrupt it. > > Signed-off-by: Paul Mackerras Thanks, applied to kvm-ppc-queue. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/23] KVM: PPC: Book3S PR: Make instruction fetch fallback work for system calls
On 06.08.2013, at 06:15, Paul Mackerras wrote: > It turns out that if we exit the guest due to a hcall instruction (sc 1), > and the loading of the instruction in the guest exit path fails for any > reason, the call to kvmppc_ld() in kvmppc_get_last_inst() fetches the > instruction after the hcall instruction rather than the hcall itself. > This in turn means that the instruction doesn't get recognized as an > hcall in kvmppc_handle_exit_pr() but gets passed to the guest kernel > as a sc instruction. That usually results in the guest kernel getting > a return code of 38 (ENOSYS) from an hcall, which often triggers a > BUG_ON() or other failure. > > This fixes the problem by adding a new variant of kvmppc_get_last_inst() > called kvmppc_get_last_sc(), which fetches the instruction if necessary > from pc - 4 rather than pc. > > Signed-off-by: Paul Mackerras Thanks, applied to kvm-ppc-queue. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/23] KVM: PPC: Book3S: Fix compile error in XICS emulation
On 06.08.2013, at 06:13, Paul Mackerras wrote: > Commit 8e44ddc3f3 ("powerpc/kvm/book3s: Add support for H_IPOLL and > H_XIRR_X in XICS emulation") added a call to get_tb() but didn't > include the header that defines it, and on some configs this means > book3s_xics.c fails to compile: > > arch/powerpc/kvm/book3s_xics.c: In function ‘kvmppc_xics_hcall’: > arch/powerpc/kvm/book3s_xics.c:812:3: error: implicit declaration of function > ‘get_tb’ [-Werror=implicit-function-declaration] > > Cc: sta...@vger.kernel.org [v3.10] > Signed-off-by: Paul Mackerras Thanks, applied to kvm-ppc-queue. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/23] KVM: PPC: Book3S PR: Keep volatile reg values in vcpu rather than shadow_vcpu
On 06.08.2013, at 06:16, Paul Mackerras wrote: > Currently PR-style KVM keeps the volatile guest register values > (R0 - R13, CR, LR, CTR, XER, PC) in a shadow_vcpu struct rather than > the main kvm_vcpu struct. For 64-bit, the shadow_vcpu exists in two > places, a kmalloc'd struct and in the PACA, and it gets copied back > and forth in kvmppc_core_vcpu_load/put(), because the real-mode code > can't rely on being able to access the kmalloc'd struct. > > This changes the code to copy the volatile values into the shadow_vcpu > as one of the last things done before entering the guest. Similarly > the values are copied back out of the shadow_vcpu to the kvm_vcpu > immediately after exiting the guest. We arrange for interrupts to be > still disabled at this point so that we can't get preempted on 64-bit > and end up copying values from the wrong PACA. > > This means that the accessor functions in kvm_book3s.h for these > registers are greatly simplified, and are same between PR and HV KVM. > In places where accesses to shadow_vcpu fields are now replaced by > accesses to the kvm_vcpu, we can also remove the svcpu_get/put pairs. > Finally, on 64-bit, we don't need the kmalloc'd struct at all any more. > > With this, the time to read the PVR one million times in a loop went > from 582.1ms to 584.3ms (averages of 10 values), a difference which is > not statistically significant given the variability of the results > (the standard deviations were 9.5ms and 8.6ms respectively). A version > of the patch that used loops to copy the GPR values increased that time > by around 5% to 611.2ms, so the loop has been unrolled. > > Signed-off-by: Paul Mackerras > --- > arch/powerpc/include/asm/kvm_book3s.h | 220 +- > arch/powerpc/include/asm/kvm_book3s_asm.h | 6 +- > arch/powerpc/include/asm/kvm_host.h | 1 + > arch/powerpc/kernel/asm-offsets.c | 4 +- > arch/powerpc/kvm/book3s_emulate.c | 8 +- > arch/powerpc/kvm/book3s_interrupts.S | 26 +++- > arch/powerpc/kvm/book3s_pr.c | 122 - > arch/powerpc/kvm/book3s_rmhandlers.S | 5 - > arch/powerpc/kvm/trace.h | 7 +- > 9 files changed, 156 insertions(+), 243 deletions(-) > > diff --git a/arch/powerpc/include/asm/kvm_book3s.h > b/arch/powerpc/include/asm/kvm_book3s.h > index fa19e2f..a8897c1 100644 > --- a/arch/powerpc/include/asm/kvm_book3s.h > +++ b/arch/powerpc/include/asm/kvm_book3s.h > @@ -198,140 +198,76 @@ extern void kvm_return_point(void); > #include > #endif > > -#ifdef CONFIG_KVM_BOOK3S_PR > - > -static inline unsigned long kvmppc_interrupt_offset(struct kvm_vcpu *vcpu) > -{ > - return to_book3s(vcpu)->hior; > -} > - > -static inline void kvmppc_update_int_pending(struct kvm_vcpu *vcpu, > - unsigned long pending_now, unsigned long old_pending) > -{ > - if (pending_now) > - vcpu->arch.shared->int_pending = 1; > - else if (old_pending) > - vcpu->arch.shared->int_pending = 0; > -} > - > static inline void kvmppc_set_gpr(struct kvm_vcpu *vcpu, int num, ulong val) > { > - if ( num < 14 ) { > - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); > - svcpu->gpr[num] = val; > - svcpu_put(svcpu); > - to_book3s(vcpu)->shadow_vcpu->gpr[num] = val; > - } else > - vcpu->arch.gpr[num] = val; > + vcpu->arch.gpr[num] = val; > } > > static inline ulong kvmppc_get_gpr(struct kvm_vcpu *vcpu, int num) > { > - if ( num < 14 ) { > - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); > - ulong r = svcpu->gpr[num]; > - svcpu_put(svcpu); > - return r; > - } else > - return vcpu->arch.gpr[num]; > + return vcpu->arch.gpr[num]; > } > > static inline void kvmppc_set_cr(struct kvm_vcpu *vcpu, u32 val) > { > - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); > - svcpu->cr = val; > - svcpu_put(svcpu); > - to_book3s(vcpu)->shadow_vcpu->cr = val; > + vcpu->arch.cr = val; > } > > static inline u32 kvmppc_get_cr(struct kvm_vcpu *vcpu) > { > - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); > - u32 r; > - r = svcpu->cr; > - svcpu_put(svcpu); > - return r; > + return vcpu->arch.cr; > } > > static inline void kvmppc_set_xer(struct kvm_vcpu *vcpu, u32 val) > { > - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); > - svcpu->xer = val; > - to_book3s(vcpu)->shadow_vcpu->xer = val; > - svcpu_put(svcpu); > + vcpu->arch.xer = val; > } > > static inline u32 kvmppc_get_xer(struct kvm_vcpu *vcpu) > { > - struct kvmppc_book3s_shadow_vcpu *svcpu = svcpu_get(vcpu); > - u32 r; > - r = svcpu->xer; > - svcpu_put(svcpu); > - return r; > + return vcpu->arch.xer; > } > > static inline void kvmppc_set_ctr(struct kvm_vcpu *vcpu, ulong val) > { > - struct kvmppc_bo
Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments
On Wed, 28 Aug 2013 14:23:58 -0700 Kent Overstreet wrote: > > I found things to be quite the opposite - it took 5 minutes of staring, > > head-scratching, double-checking and penny-dropping before I was > > confident that the newly-added code actually has nothing at all to do > > with the current code. Putting it in the same file was misleading, and > > I got misled. > > Ok... and I could see how the fact that it currently _doesn't_ have > anything to do with the existing code would be confusing... > > Do you think that if/when it's making use of the ida rewrite it'll be > ok? Or would you still prefer to have it in a new file I'm constitutionally reluctant to ever assume that any out-of-tree code will be merged. Maybe you'll get hit by a bus, and maybe the code sucks ;) Are you sure that the two things are so tangled together that they must live in the same file? If there's some nice layering between ida and percpu_ida then perhaps such a physical separation would remain appropriate? > (and if so, any preference on the naming?) percpu_ida.c? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments
On Wed, Aug 28, 2013 at 02:10:19PM -0700, Andrew Morton wrote: > On Wed, 28 Aug 2013 14:00:10 -0700 Kent Overstreet wrote: > > > On Wed, Aug 28, 2013 at 01:25:50PM -0700, Andrew Morton wrote: > > > On Wed, 28 Aug 2013 12:55:17 -0700 Kent Overstreet > > > wrote: > > > > > > > Fixup patch, addressing Andrew's review feedback: > > > > > > Looks reasonable. > > > > > > > lib/idr.c | 38 +- > > > > > > I still don't think it should be in this file. > > > > > > You say that some as-yet-unmerged patches will tie the new code into > > > the old ida code. But will it do it in a manner which requires that > > > the two reside in the same file? > > > > Not require, no - but it's just intimate enough with my ida rewrite that > > I think it makes sense; it makes some use of stuff that should be > > internal to the ida code. > > > > Mostly just sharing the lock though, since I got rid of the ida > > interfaces that don't do locking, but percpu ida needs a lock that also > > covers what ida needs. > > > > It also makes use of a ganged allocation interface, but there's no real > > reason ida can't expose that, it's just unlikely to be useful to > > anything but percpu ida. > > > > The other reason I think it makes sense to live in idr.c is more for > > users of the code; as you pointed out as far as the user's perspective > > percpu ida isn't doing anything fundamentally different from ida, so I > > think it makes sense for the code to live in the same place as a > > kindness to future kernel developers who are trying to find their way > > around the various library code. > > I found things to be quite the opposite - it took 5 minutes of staring, > head-scratching, double-checking and penny-dropping before I was > confident that the newly-added code actually has nothing at all to do > with the current code. Putting it in the same file was misleading, and > I got misled. Ok... and I could see how the fact that it currently _doesn't_ have anything to do with the existing code would be confusing... Do you think that if/when it's making use of the ida rewrite it'll be ok? Or would you still prefer to have it in a new file (and if so, any preference on the naming?) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v3 1/4] idr: Percpu ida
On Wed, 28 Aug 2013 14:12:17 -0700 Kent Overstreet wrote: > How's this look? > > diff --git a/lib/idr.c b/lib/idr.c > index 15c021c..a3f8e9a 100644 > --- a/lib/idr.c > +++ b/lib/idr.c > @@ -1288,6 +1288,11 @@ static inline unsigned alloc_local_tag(struct > percpu_ida *pool, > * Safe to be called from interrupt context (assuming it isn't passed > * __GFP_WAIT, of course). > * > + * @gfp indicates whether or not to wait until a free id is available (it's > not > + * used for internal memory allocations); thus if passed __GFP_WAIT we may > sleep > + * however long it takes until another thread frees an id (same semantics as > a > + * mempool). Looks good. Mentioning the mempool thing is effective - people understand that. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v3 1/4] idr: Percpu ida
On Wed, Aug 28, 2013 at 01:50:42PM -0700, Andrew Morton wrote: > On Wed, 28 Aug 2013 13:44:54 -0700 Kent Overstreet wrote: > > > > > > What guarantees that this wait will terminate? > > > > > > > > It seems fairly clear to me from the break statement a couple lines up; > > > > if we were passed __GFP_WAIT we terminate iff we succesfully allocated a > > > > tag. If we weren't passed __GFP_WAIT we never actually sleep. > > > > > > OK ;) Let me rephrase. What guarantees that a tag will become available? > > > > > > If what we have here is an open-coded __GFP_NOFAIL then that is > > > potentially problematic. > > > > It's the same semantics as a mempool, really - it'll succeed when a tag > > gets freed. > > OK, that's reasonable if the code is being used to generate IO tags - > we expect the in-flight tags to eventually be returned. > > But if a client of this code is using the allocator for something > totally different, there is no guarantee that the act of waiting will > result in any tags being returned. Yeah, and I did wonder a bit whether the waiting mechanism belonged in the percpu ida code; arguably (certainly just looking at this code, not any of the users) if it belongs in this code it should be common to regular ida, not specific to percpu ida. For now I've just decided to punt on changing that for now, since all the percpu ida users I've come across do want the waiting mechanism, but none of the regular ida users that I've looked at want it. There's probably a reason for that I haven't thought of yet. > (These are core design principles/constraints which should be > explicitly documented in a place where future readers will see them!) *nod* I suppose it should be said explicitly that the gfp_t parameter indicates whether or not to wait until a _tag_ is available, and not some internal memory allocation or something. How's this look? diff --git a/lib/idr.c b/lib/idr.c index 15c021c..a3f8e9a 100644 --- a/lib/idr.c +++ b/lib/idr.c @@ -1288,6 +1288,11 @@ static inline unsigned alloc_local_tag(struct percpu_ida *pool, * Safe to be called from interrupt context (assuming it isn't passed * __GFP_WAIT, of course). * + * @gfp indicates whether or not to wait until a free id is available (it's not + * used for internal memory allocations); thus if passed __GFP_WAIT we may sleep + * however long it takes until another thread frees an id (same semantics as a + * mempool). + * * Will not fail if passed __GFP_WAIT. */ int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments
On Wed, 28 Aug 2013 14:00:10 -0700 Kent Overstreet wrote: > On Wed, Aug 28, 2013 at 01:25:50PM -0700, Andrew Morton wrote: > > On Wed, 28 Aug 2013 12:55:17 -0700 Kent Overstreet > > wrote: > > > > > Fixup patch, addressing Andrew's review feedback: > > > > Looks reasonable. > > > > > lib/idr.c | 38 +- > > > > I still don't think it should be in this file. > > > > You say that some as-yet-unmerged patches will tie the new code into > > the old ida code. But will it do it in a manner which requires that > > the two reside in the same file? > > Not require, no - but it's just intimate enough with my ida rewrite that > I think it makes sense; it makes some use of stuff that should be > internal to the ida code. > > Mostly just sharing the lock though, since I got rid of the ida > interfaces that don't do locking, but percpu ida needs a lock that also > covers what ida needs. > > It also makes use of a ganged allocation interface, but there's no real > reason ida can't expose that, it's just unlikely to be useful to > anything but percpu ida. > > The other reason I think it makes sense to live in idr.c is more for > users of the code; as you pointed out as far as the user's perspective > percpu ida isn't doing anything fundamentally different from ida, so I > think it makes sense for the code to live in the same place as a > kindness to future kernel developers who are trying to find their way > around the various library code. I found things to be quite the opposite - it took 5 minutes of staring, head-scratching, double-checking and penny-dropping before I was confident that the newly-added code actually has nothing at all to do with the current code. Putting it in the same file was misleading, and I got misled. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments
On Wed, Aug 28, 2013 at 01:25:50PM -0700, Andrew Morton wrote: > On Wed, 28 Aug 2013 12:55:17 -0700 Kent Overstreet wrote: > > > Fixup patch, addressing Andrew's review feedback: > > Looks reasonable. > > > lib/idr.c | 38 +- > > I still don't think it should be in this file. > > You say that some as-yet-unmerged patches will tie the new code into > the old ida code. But will it do it in a manner which requires that > the two reside in the same file? Not require, no - but it's just intimate enough with my ida rewrite that I think it makes sense; it makes some use of stuff that should be internal to the ida code. Mostly just sharing the lock though, since I got rid of the ida interfaces that don't do locking, but percpu ida needs a lock that also covers what ida needs. It also makes use of a ganged allocation interface, but there's no real reason ida can't expose that, it's just unlikely to be useful to anything but percpu ida. The other reason I think it makes sense to live in idr.c is more for users of the code; as you pointed out as far as the user's perspective percpu ida isn't doing anything fundamentally different from ida, so I think it makes sense for the code to live in the same place as a kindness to future kernel developers who are trying to find their way around the various library code. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v3 1/4] idr: Percpu ida
On Wed, 28 Aug 2013 13:44:54 -0700 Kent Overstreet wrote: > > > > What guarantees that this wait will terminate? > > > > > > It seems fairly clear to me from the break statement a couple lines up; > > > if we were passed __GFP_WAIT we terminate iff we succesfully allocated a > > > tag. If we weren't passed __GFP_WAIT we never actually sleep. > > > > OK ;) Let me rephrase. What guarantees that a tag will become available? > > > > If what we have here is an open-coded __GFP_NOFAIL then that is > > potentially problematic. > > It's the same semantics as a mempool, really - it'll succeed when a tag > gets freed. OK, that's reasonable if the code is being used to generate IO tags - we expect the in-flight tags to eventually be returned. But if a client of this code is using the allocator for something totally different, there is no guarantee that the act of waiting will result in any tags being returned. (These are core design principles/constraints which should be explicitly documented in a place where future readers will see them!) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v3 1/4] idr: Percpu ida
On Wed, Aug 28, 2013 at 01:23:32PM -0700, Andrew Morton wrote: > On Wed, 28 Aug 2013 12:53:17 -0700 Kent Overstreet wrote: > > > > > + while (1) { > > > > + spin_lock(&pool->lock); > > > > + > > > > + /* > > > > +* prepare_to_wait() must come before steal_tags(), in > > > > case > > > > +* percpu_ida_free() on another cpu flips a bit in > > > > +* cpus_have_tags > > > > +* > > > > +* global lock held and irqs disabled, don't need > > > > percpu lock > > > > +*/ > > > > + prepare_to_wait(&pool->wait, &wait, > > > > TASK_UNINTERRUPTIBLE); > > > > + > > > > + if (!tags->nr_free) > > > > + alloc_global_tags(pool, tags); > > > > + if (!tags->nr_free) > > > > + steal_tags(pool, tags); > > > > + > > > > + if (tags->nr_free) { > > > > + tag = tags->freelist[--tags->nr_free]; > > > > + if (tags->nr_free) > > > > + set_bit(smp_processor_id(), > > > > + pool->cpus_have_tags); > > > > + } > > > > + > > > > + spin_unlock(&pool->lock); > > > > + local_irq_restore(flags); > > > > + > > > > + if (tag >= 0 || !(gfp & __GFP_WAIT)) > > > > + break; > > > > + > > > > + schedule(); > > > > + > > > > + local_irq_save(flags); > > > > + tags = this_cpu_ptr(pool->tag_cpu); > > > > + } > > > > > > What guarantees that this wait will terminate? > > > > It seems fairly clear to me from the break statement a couple lines up; > > if we were passed __GFP_WAIT we terminate iff we succesfully allocated a > > tag. If we weren't passed __GFP_WAIT we never actually sleep. > > OK ;) Let me rephrase. What guarantees that a tag will become available? > > If what we have here is an open-coded __GFP_NOFAIL then that is > potentially problematic. It's the same semantics as a mempool, really - it'll succeed when a tag gets freed. If we are sleeping then there isn't really anything else we can do, there isn't anything we're trying in the __GFP_WAIT case that we're not trying in the GFP_NOWAIT case. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments
On Wed, 28 Aug 2013 12:55:17 -0700 Kent Overstreet wrote: > Fixup patch, addressing Andrew's review feedback: Looks reasonable. > lib/idr.c | 38 +- I still don't think it should be in this file. You say that some as-yet-unmerged patches will tie the new code into the old ida code. But will it do it in a manner which requires that the two reside in the same file? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v3 1/4] idr: Percpu ida
On Wed, 28 Aug 2013 12:53:17 -0700 Kent Overstreet wrote: > > > + while (1) { > > > + spin_lock(&pool->lock); > > > + > > > + /* > > > + * prepare_to_wait() must come before steal_tags(), in case > > > + * percpu_ida_free() on another cpu flips a bit in > > > + * cpus_have_tags > > > + * > > > + * global lock held and irqs disabled, don't need percpu lock > > > + */ > > > + prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE); > > > + > > > + if (!tags->nr_free) > > > + alloc_global_tags(pool, tags); > > > + if (!tags->nr_free) > > > + steal_tags(pool, tags); > > > + > > > + if (tags->nr_free) { > > > + tag = tags->freelist[--tags->nr_free]; > > > + if (tags->nr_free) > > > + set_bit(smp_processor_id(), > > > + pool->cpus_have_tags); > > > + } > > > + > > > + spin_unlock(&pool->lock); > > > + local_irq_restore(flags); > > > + > > > + if (tag >= 0 || !(gfp & __GFP_WAIT)) > > > + break; > > > + > > > + schedule(); > > > + > > > + local_irq_save(flags); > > > + tags = this_cpu_ptr(pool->tag_cpu); > > > + } > > > > What guarantees that this wait will terminate? > > It seems fairly clear to me from the break statement a couple lines up; > if we were passed __GFP_WAIT we terminate iff we succesfully allocated a > tag. If we weren't passed __GFP_WAIT we never actually sleep. OK ;) Let me rephrase. What guarantees that a tag will become available? If what we have here is an open-coded __GFP_NOFAIL then that is potentially problematic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] percpu ida: Switch to cpumask_t, add some comments
Fixup patch, addressing Andrew's review feedback: Signed-off-by: Kent Overstreet --- include/linux/idr.h | 2 +- lib/idr.c | 38 +- 2 files changed, 22 insertions(+), 18 deletions(-) diff --git a/include/linux/idr.h b/include/linux/idr.h index f0db12b..cdf39be 100644 --- a/include/linux/idr.h +++ b/include/linux/idr.h @@ -267,7 +267,7 @@ struct percpu_ida { * will just keep looking - but the bitmap _must_ be set whenever a * percpu freelist does have tags. */ - unsigned long *cpus_have_tags; + cpumask_t cpus_have_tags; struct { spinlock_t lock; diff --git a/lib/idr.c b/lib/idr.c index 26495e1..15c021c 100644 --- a/lib/idr.c +++ b/lib/idr.c @@ -1178,7 +1178,13 @@ EXPORT_SYMBOL(ida_init); #define IDA_PCPU_SIZE ((IDA_PCPU_BATCH_MOVE * 3) / 2) struct percpu_ida_cpu { + /* +* Even though this is percpu, we need a lock for tag stealing by remote +* CPUs: +*/ spinlock_t lock; + + /* nr_free/freelist form a stack of free IDs */ unsignednr_free; unsignedfreelist[]; }; @@ -1209,21 +1215,21 @@ static inline void steal_tags(struct percpu_ida *pool, unsigned cpus_have_tags, cpu = pool->cpu_last_stolen; struct percpu_ida_cpu *remote; - for (cpus_have_tags = bitmap_weight(pool->cpus_have_tags, nr_cpu_ids); + for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags); cpus_have_tags * IDA_PCPU_SIZE > pool->nr_tags / 2; cpus_have_tags--) { - cpu = find_next_bit(pool->cpus_have_tags, nr_cpu_ids, cpu); + cpu = cpumask_next(cpu, &pool->cpus_have_tags); - if (cpu == nr_cpu_ids) - cpu = find_first_bit(pool->cpus_have_tags, nr_cpu_ids); + if (cpu >= nr_cpu_ids) + cpu = cpumask_first(&pool->cpus_have_tags); - if (cpu == nr_cpu_ids) + if (cpu >= nr_cpu_ids) BUG(); pool->cpu_last_stolen = cpu; remote = per_cpu_ptr(pool->tag_cpu, cpu); - clear_bit(cpu, pool->cpus_have_tags); + cpumask_clear_cpu(cpu, &pool->cpus_have_tags); if (remote == tags) continue; @@ -1246,6 +1252,10 @@ static inline void steal_tags(struct percpu_ida *pool, } } +/* + * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto + * our percpu freelist: + */ static inline void alloc_global_tags(struct percpu_ida *pool, struct percpu_ida_cpu *tags) { @@ -1317,8 +1327,8 @@ int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp) if (tags->nr_free) { tag = tags->freelist[--tags->nr_free]; if (tags->nr_free) - set_bit(smp_processor_id(), - pool->cpus_have_tags); + cpumask_set_cpu(smp_processor_id(), + &pool->cpus_have_tags); } spin_unlock(&pool->lock); @@ -1363,8 +1373,8 @@ void percpu_ida_free(struct percpu_ida *pool, unsigned tag) spin_unlock(&tags->lock); if (nr_free == 1) { - set_bit(smp_processor_id(), - pool->cpus_have_tags); + cpumask_set_cpu(smp_processor_id(), + &pool->cpus_have_tags); wake_up(&pool->wait); } @@ -1398,7 +1408,6 @@ EXPORT_SYMBOL_GPL(percpu_ida_free); void percpu_ida_destroy(struct percpu_ida *pool) { free_percpu(pool->tag_cpu); - kfree(pool->cpus_have_tags); free_pages((unsigned long) pool->freelist, get_order(pool->nr_tags * sizeof(unsigned))); } @@ -1428,7 +1437,7 @@ int percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags) /* Guard against overflow */ if (nr_tags > (unsigned) INT_MAX + 1) { - pr_err("tags.c: nr_tags too large\n"); + pr_err("percpu_ida_init(): nr_tags too large\n"); return -EINVAL; } @@ -1442,11 +1451,6 @@ int percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags) pool->nr_free = nr_tags; - pool->cpus_have_tags = kzalloc(BITS_TO_LONGS(nr_cpu_ids) * - sizeof(unsigned long), GFP_KERNEL); - if (!pool->cpus_have_tags) - goto err; - pool->tag_cpu = __alloc_percpu(sizeof(struct percpu_ida_cpu) + IDA_PCPU_SIZE * sizeof(unsigned), sizeof(unsigned)); -- 1.8.4.rc3 -- To unsubscribe from this
Re: [PATCH-v3 1/4] idr: Percpu ida
On Tue, Aug 20, 2013 at 02:31:57PM -0700, Andrew Morton wrote: > On Fri, 16 Aug 2013 23:09:06 + "Nicholas A. Bellinger" > wrote: > > + /* > > +* Bitmap of cpus that (may) have tags on their percpu freelists: > > +* steal_tags() uses this to decide when to steal tags, and which cpus > > +* to try stealing from. > > +* > > +* It's ok for a freelist to be empty when its bit is set - steal_tags() > > +* will just keep looking - but the bitmap _must_ be set whenever a > > +* percpu freelist does have tags. > > +*/ > > + unsigned long *cpus_have_tags; > > Why not cpumask_t? I hadn't encountered it before - looks like it's probably what I want. I don't see any explanation for the parallel set of operations for working on cpumasks - e.g. next_cpu()/cpumask_next(). For now I'm going with the cpumask_* versions, is that what I want?o If you can have a look at the fixup patch that'll be most appreciated. > > + struct { > > + spinlock_t lock; > > + /* > > +* When we go to steal tags from another cpu (see steal_tags()), > > +* we want to pick a cpu at random. Cycling through them every > > +* time we steal is a bit easier and more or less equivalent: > > +*/ > > + unsignedcpu_last_stolen; > > + > > + /* For sleeping on allocation failure */ > > + wait_queue_head_t wait; > > + > > + /* > > +* Global freelist - it's a stack where nr_free points to the > > +* top > > +*/ > > + unsignednr_free; > > + unsigned*freelist; > > + } cacheline_aligned_in_smp; > > Why the cacheline_aligned_in_smp? It's separating the RW stuff that isn't always touched from the RO stuff that's used on every allocation. > > > +}; > > > > ... > > > > + > > +/* Percpu IDA */ > > + > > +/* > > + * Number of tags we move between the percpu freelist and the global > > freelist at > > + * a time > > "between a percpu freelist" would be more accurate? No, because when we're stealing tags we always grab all of the remote percpu freelist's tags - IDA_PCPU_BATCH_MOVE is only used when moving to/from the global freelist. > > > + */ > > +#define IDA_PCPU_BATCH_MOVE32U > > + > > +/* Max size of percpu freelist, */ > > +#define IDA_PCPU_SIZE ((IDA_PCPU_BATCH_MOVE * 3) / 2) > > + > > +struct percpu_ida_cpu { > > + spinlock_t lock; > > + unsignednr_free; > > + unsignedfreelist[]; > > +}; > > Data structure needs documentation. There's one of these per cpu. I > guess nr_free and freelist are clear enough. The presence of a lock > in a percpu data structure is a surprise. It's for cross-cpu stealing, > I assume? Yeah, I'll add some comments. > > +static inline void alloc_global_tags(struct percpu_ida *pool, > > +struct percpu_ida_cpu *tags) > > +{ > > + move_tags(tags->freelist, &tags->nr_free, > > + pool->freelist, &pool->nr_free, > > + min(pool->nr_free, IDA_PCPU_BATCH_MOVE)); > > +} > > Document this function? Will do > > + while (1) { > > + spin_lock(&pool->lock); > > + > > + /* > > +* prepare_to_wait() must come before steal_tags(), in case > > +* percpu_ida_free() on another cpu flips a bit in > > +* cpus_have_tags > > +* > > +* global lock held and irqs disabled, don't need percpu lock > > +*/ > > + prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE); > > + > > + if (!tags->nr_free) > > + alloc_global_tags(pool, tags); > > + if (!tags->nr_free) > > + steal_tags(pool, tags); > > + > > + if (tags->nr_free) { > > + tag = tags->freelist[--tags->nr_free]; > > + if (tags->nr_free) > > + set_bit(smp_processor_id(), > > + pool->cpus_have_tags); > > + } > > + > > + spin_unlock(&pool->lock); > > + local_irq_restore(flags); > > + > > + if (tag >= 0 || !(gfp & __GFP_WAIT)) > > + break; > > + > > + schedule(); > > + > > + local_irq_save(flags); > > + tags = this_cpu_ptr(pool->tag_cpu); > > + } > > What guarantees that this wait will terminate? It seems fairly clear to me from the break statement a couple lines up; if we were passed __GFP_WAIT we terminate iff we succesfully allocated a tag. If we weren't passed __GFP_WAIT we never actually sleep. I can add a comment if you think it needs one. > > + finish_wait(&pool->wait, &wait); > > + return tag; > > +} > > +EXPORT_SYMBOL_GPL(percpu_ida_alloc); > > + > > +/** > > + * percpu_ida_free -
Re: VGA passthrough of GTX 660 to KVM guest
Hello Alex, > Ok, so it's probably X. You might be able to start your guest from here > and get VGA through the gtx660 (assuming nothing else is broken with > radeon VGA arbiter support). You could also run startx to confirm that > it's X doing all those VGA arbiter writes. Thanks, After I issue startx in runlevel 1, dmesg indeed does show numerous vgaarb: this pci device is not a vga device messages, but also once this: vgaarb: device changed decodes: PCI::01:00.0,olddecodes=io+mem,decodes=none:owns=none After having booted into runlevel 1 again, not starting X, I have tried to invoke qemu without any disk and vfio-pci devices, as described in my first mail, to see whether I could see SeaBIOS output on the screen I passed through. Both times i tried this now, my complete host system seemed to have frozen, then suddenly restarted itself. When I boot the Windows DomU with pci-assign, the Windows Device Manager does recognize a second graphics card, but does not label it as a GFX 660, but a standard VGA device. Also, it says "device could not be started (Code 10)". Unfortunately, I haven't noticed the following up until now: Early during boot, I am shown the message "kvm: no hardware support". After booting has completed, dmesg | grep -i kvm gives [0.949028] kvm: no hardware support [0.949156] kvm: Nested Virtualization enabled [0.949160] kvm: Nested Paging enabled I'm a bit confused, since it seems like kvm says AMD-V is not enabled, but still seems to work. Maybe it's related to the issue at hand? Please let me know what I can do next to help resolve this. Thank you very much! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Direct guest device access from nested guest
Jan Kiszka writes: > On 2013-08-28 20:12, Lluís Vilanova wrote: >> Jan Kiszka writes: >> [...] Is it possible to give a nested guest direct access to a device on the guest? (more specifically, an AHCI controller). >> >>> Nope, we are lacking support for emulating or (securely) forwarding >>> VT-d/IOMMU features to the first level guest. Would be cool to have, >>> just not yet there. But I've talked to Intel people recently, and they >>> are considering to support some nested VT-d with KVM. >> >> Thanks a lot. I've been told there's some patches floating around to add such >> support, but I suppose they've been long outdated and only work as POCs. > I haven't seen anything in public. This is what I've found: https://lists.nongnu.org/archive/html/qemu-devel/2011-04/msg01970.html > PS: You have Mail-Followup-To set in your answers - people will drop you > from CC this way. Yes, my mail client tried to be clever but didn't know I'm not subscribed to the KVM list :) Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Direct guest device access from nested guest
Jan Kiszka writes: [...] >> Is it possible to give a nested guest direct access to a device on the guest? >> (more specifically, an AHCI controller). > Nope, we are lacking support for emulating or (securely) forwarding > VT-d/IOMMU features to the first level guest. Would be cool to have, > just not yet there. But I've talked to Intel people recently, and they > are considering to support some nested VT-d with KVM. Thanks a lot. I've been told there's some patches floating around to add such support, but I suppose they've been long outdated and only work as POCs. Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Direct guest device access from nested guest
On 2013-08-28 20:12, Lluís Vilanova wrote: > Jan Kiszka writes: > [...] >>> Is it possible to give a nested guest direct access to a device on the >>> guest? >>> (more specifically, an AHCI controller). > >> Nope, we are lacking support for emulating or (securely) forwarding >> VT-d/IOMMU features to the first level guest. Would be cool to have, >> just not yet there. But I've talked to Intel people recently, and they >> are considering to support some nested VT-d with KVM. > > Thanks a lot. I've been told there's some patches floating around to add such > support, but I suppose they've been long outdated and only work as POCs. I haven't seen anything in public. Jan PS: You have Mail-Followup-To set in your answers - people will drop you from CC this way. -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VGA passthrough of GTX 660 to KVM guest
On Wed, 2013-08-28 at 19:42 +0200, Gustav Sorenson wrote: > Hello, > > yes, 1:00.0 is the GTX 660, and yes, I'm running X on the host. > When I boot into runlevel 1 (on Mint, which is Debian-based, so that > corresponds to a single-user runlevel without X, although I saw a > not-very-fancy boot splashscreen), this is what dmesg | grep -i vga > gives me: > > [0.00] Console: colour VGA+ 80x25 > [0.324894] vgaarb: device added: > PCI::00:01.0,decodes=io+mem,owns=io+mem,locks=none > [0.324903] vgaarb: device added: > PCI::01:00.0,decodes=io+mem,owns=none,locks=none > [0.324905] vgaarb: loaded > [0.324906] vgaarb: bridge control possible :01:00.0 > [0.324907] vgaarb: no bridge control possible :00:01.0 > [ 16.459085] hda-intel :01:00.1: Handle VGA-switcheroo audio client Ok, so it's probably X. You might be able to start your guest from here and get VGA through the gtx660 (assuming nothing else is broken with radeon VGA arbiter support). You could also run startx to confirm that it's X doing all those VGA arbiter writes. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V13 4/4] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock
On 08/26/2013 03:48:36 AM, Raghavendra K T wrote: KVM_HC_KICK_CPU hypercall added to wakeup halted vcpu in paravirtual spinlock enabled guest. KVM_FEATURE_PV_UNHALT enables guest to check whether pv spinlock can be enabled in guest. Thanks Vatsa for rewriting KVM_HC_KICK_CPU Cc: Rob Landley Looks like documentation to me: Acked-by: Rob Landley Rob-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VGA passthrough of GTX 660 to KVM guest
Hello, yes, 1:00.0 is the GTX 660, and yes, I'm running X on the host. When I boot into runlevel 1 (on Mint, which is Debian-based, so that corresponds to a single-user runlevel without X, although I saw a not-very-fancy boot splashscreen), this is what dmesg | grep -i vga gives me: [0.00] Console: colour VGA+ 80x25 [0.324894] vgaarb: device added: PCI::00:01.0,decodes=io+mem,owns=io+mem,locks=none [0.324903] vgaarb: device added: PCI::01:00.0,decodes=io+mem,owns=none,locks=none [0.324905] vgaarb: loaded [0.324906] vgaarb: bridge control possible :01:00.0 [0.324907] vgaarb: no bridge control possible :00:01.0 [ 16.459085] hda-intel :01:00.1: Handle VGA-switcheroo audio client Thanks! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Direct guest device access from nested guest
On 2013-08-28 16:28, Lluís Vilanova wrote: > Hi, > > I want to get the following setup, but don't know how (or if it's even > possible): > > * A guest VM with two AHCI controllers, with one device each. One of the AHCI > controllers provides the VM's disk ("system"), while the other provides > another disk ("nested") and uses a different emulation driver in QEMU > (ahci2): > > host$ qemu-system-x86_64 -enable-kvm \ > -drive id=system,file=system.img,if=none \ > -device ahci,id=ahci \ > -device ide-drive,drive=system,bus=ahci.0 \ > -drive id=nested,file=nested.img,if=none \ > -device ahci2,id=ahci2 \ > -device ide-drive,drive=nested,bus=ahci2.0 > > * A nested guest VM using the guest's (its host) AHCI2 controller. > > I've tried assigning the AHCI2 device to the nested guest using "pci-assign" > and "vfio", but without any luck. > > The culprit of the problem seems to be I cannot get the nested guest to have > an IOMMU. > > Is it possible to give a nested guest direct access to a device on the guest? > (more specifically, an AHCI controller). Nope, we are lacking support for emulating or (securely) forwarding VT-d/IOMMU features to the first level guest. Would be cool to have, just not yet there. But I've talked to Intel people recently, and they are considering to support some nested VT-d with KVM. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VGA passthrough of GTX 660 to KVM guest
On Wed, 2013-08-28 at 18:29 +0200, Gustav Sorenson wrote: > Hi Alex, > > thanks for the info you provided, although I have to admit I didn't > understand all of it - just so you don't overestimate my knowledge. :) > > Here's the output you requested: > > [0.32] vgaarb: device added: > PCI::00:01.0,decodes=io+mem,owns=io+mem,locks=none > [0.321120] vgaarb: device added: > PCI::01:00.0,decodes=io+mem,owns=none,locks=none > [0.321122] vgaarb: loaded > [0.321123] vgaarb: bridge control possible :01:00.0 > [0.321123] vgaarb: no bridge control possible :00:01.0 > [ 16.325255] hda-intel :01:00.1: Handle VGA-switcheroo audio client > [ 36.736262] vgaarb: this pci device is not a vga device > [ 36.736274] vgaarb: this pci device is not a vga device > [ 36.736292] vgaarb: this pci device is not a vga device > [ 36.736302] vgaarb: this pci device is not a vga device > [ 36.736312] vgaarb: this pci device is not a vga device > [ 36.736322] vgaarb: this pci device is not a vga device > [ 36.736332] vgaarb: this pci device is not a vga device > [ 36.736342] vgaarb: this pci device is not a vga device > [ 36.736352] vgaarb: this pci device is not a vga device > [ 36.736362] vgaarb: this pci device is not a vga device > [ 36.736373] vgaarb: this pci device is not a vga device > [ 36.736384] vgaarb: this pci device is not a vga device > [ 36.736395] vgaarb: this pci device is not a vga device > [ 36.736406] vgaarb: this pci device is not a vga device > [ 36.736417] vgaarb: this pci device is not a vga device > [ 36.736428] vgaarb: this pci device is not a vga device > [ 36.736440] vgaarb: this pci device is not a vga device > [ 36.736452] vgaarb: this pci device is not a vga device > [ 36.736464] vgaarb: this pci device is not a vga device > [ 36.736475] vgaarb: this pci device is not a vga device > [ 36.736487] vgaarb: this pci device is not a vga device > [ 36.736499] vgaarb: this pci device is not a vga device > [ 36.736511] vgaarb: this pci device is not a vga device > [ 36.736524] vgaarb: this pci device is not a vga device > [ 36.736537] vgaarb: this pci device is not a vga device > [ 36.736549] vgaarb: this pci device is not a vga device > [ 36.736571] vgaarb: device changed decodes: > PCI::01:00.0,olddecodes=io+mem,decodes=none:owns=none This is a problem. Something told the VGA arbiter that 1:00.0 (assume this is the gtx660) does not decode VGA I/O or memory. That means we can't route it to the card when we try to assign it. These "this pci device is not a vga device" messages come from the write function of /dev/vga_arbiter, so something in userspace is doing this, and apparently doing it very arbitrarily since it must have tried ever device in the system. Are you running an X server on the host? If so, can you boot to a runlevel that doesn't start X and see if these messages are still present? Thanks, Alex > [ 36.736593] vgaarb: this pci device is not a vga device > [ 36.736607] vgaarb: this pci device is not a vga device > [ 36.736620] vgaarb: this pci device is not a vga device > > > Of course even if you get VGA working, you're still likely to get a Code > > 43 trying to enable the Nvidia driver, but I appreciate any efforts to > > help figure that out. Thanks, > > I'm happy to help where I can. > > Thank you very much for your efforts! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VGA passthrough of GTX 660 to KVM guest
Hi Alex, thanks for the info you provided, although I have to admit I didn't understand all of it - just so you don't overestimate my knowledge. :) Here's the output you requested: [0.32] vgaarb: device added: PCI::00:01.0,decodes=io+mem,owns=io+mem,locks=none [0.321120] vgaarb: device added: PCI::01:00.0,decodes=io+mem,owns=none,locks=none [0.321122] vgaarb: loaded [0.321123] vgaarb: bridge control possible :01:00.0 [0.321123] vgaarb: no bridge control possible :00:01.0 [ 16.325255] hda-intel :01:00.1: Handle VGA-switcheroo audio client [ 36.736262] vgaarb: this pci device is not a vga device [ 36.736274] vgaarb: this pci device is not a vga device [ 36.736292] vgaarb: this pci device is not a vga device [ 36.736302] vgaarb: this pci device is not a vga device [ 36.736312] vgaarb: this pci device is not a vga device [ 36.736322] vgaarb: this pci device is not a vga device [ 36.736332] vgaarb: this pci device is not a vga device [ 36.736342] vgaarb: this pci device is not a vga device [ 36.736352] vgaarb: this pci device is not a vga device [ 36.736362] vgaarb: this pci device is not a vga device [ 36.736373] vgaarb: this pci device is not a vga device [ 36.736384] vgaarb: this pci device is not a vga device [ 36.736395] vgaarb: this pci device is not a vga device [ 36.736406] vgaarb: this pci device is not a vga device [ 36.736417] vgaarb: this pci device is not a vga device [ 36.736428] vgaarb: this pci device is not a vga device [ 36.736440] vgaarb: this pci device is not a vga device [ 36.736452] vgaarb: this pci device is not a vga device [ 36.736464] vgaarb: this pci device is not a vga device [ 36.736475] vgaarb: this pci device is not a vga device [ 36.736487] vgaarb: this pci device is not a vga device [ 36.736499] vgaarb: this pci device is not a vga device [ 36.736511] vgaarb: this pci device is not a vga device [ 36.736524] vgaarb: this pci device is not a vga device [ 36.736537] vgaarb: this pci device is not a vga device [ 36.736549] vgaarb: this pci device is not a vga device [ 36.736571] vgaarb: device changed decodes: PCI::01:00.0,olddecodes=io+mem,decodes=none:owns=none [ 36.736593] vgaarb: this pci device is not a vga device [ 36.736607] vgaarb: this pci device is not a vga device [ 36.736620] vgaarb: this pci device is not a vga device > Of course even if you get VGA working, you're still likely to get a Code > 43 trying to enable the Nvidia driver, but I appreciate any efforts to > help figure that out. Thanks, I'm happy to help where I can. Thank you very much for your efforts! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VGA passthrough of GTX 660 to KVM guest
On Wed, 2013-08-28 at 17:12 +0200, Gustav Sorenson wrote: > Hi Alex, > > thanks for your answer. > > Do you happen to know someone who might be able to judge whether or > not the radeon driver could be patched to support passthrough? Also, > could you please point me to the fixes to i915 you mentioned? I think > I might write to the radeon list to see whether they have ideas how to > tackle this issue, and I assume it may help them if they know what > fixed VGA arbitration on Haswell. > > Thanks again! VGA Arbiter: http://lists.freedesktop.org/archives/intel-gfx/2013-August/031963.html i915: http://lists.freedesktop.org/archives/intel-gfx/2013-August/032295.html The changes are very device dependent. In the case of newer Intel HD graphics the i915 driver does not need VGA memory access and only occasionally needs VGA I/O access. We can therefore disable VGA memory completely so that it doesn't need to participate in arbitration of that. Then the arbiter will disable the I/O bit in the PCI command register when another device requests VGA. Perhaps if you can send `dmesg | grep -i vga` from your system we can see how the arbiter is getting setup. On an Intel system, a simple test to disable VGA to IGD is to disable I/O and memory in the PCI command register: setpci -s 00:02.0 4.w=0 (of course you really want to read it first to restore it later, you only need to clear the bottom 2 bits) Of course even if you get VGA working, you're still likely to get a Code 43 trying to enable the Nvidia driver, but I appreciate any efforts to help figure that out. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VGA passthrough of GTX 660 to KVM guest
Hi Alex, thanks for your answer. Do you happen to know someone who might be able to judge whether or not the radeon driver could be patched to support passthrough? Also, could you please point me to the fixes to i915 you mentioned? I think I might write to the radeon list to see whether they have ideas how to tackle this issue, and I assume it may help them if they know what fixed VGA arbitration on Haswell. Thanks again! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vfio-pci: Test for extended config space
Having PCIe/PCI-X capability isn't enough to assume that there are extended capabilities. Both specs define that the first capability header is all zero if there are no extended capabilities. Testing for this avoids an erroneous message about hiding capability 0x0 at offset 0x100. Signed-off-by: Alex Williamson --- drivers/vfio/pci/vfio_pci_config.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 8477963..bcf3624 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -1069,6 +1069,7 @@ static int vfio_vc_cap_len(struct vfio_pci_device *vdev, u16 pos) static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos) { struct pci_dev *pdev = vdev->pdev; + u32 dword; u16 word; u8 byte; int ret; @@ -1082,7 +1083,9 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos) return pcibios_err_to_errno(ret); if (PCI_X_CMD_VERSION(word)) { - vdev->extended_caps = true; + /* Test for extended capabilities */ + pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, &dword); + vdev->extended_caps = (dword != 0); return PCI_CAP_PCIX_SIZEOF_V2; } else return PCI_CAP_PCIX_SIZEOF_V0; @@ -1094,9 +1097,11 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 cap, u8 pos) return byte; case PCI_CAP_ID_EXP: - /* length based on version */ - vdev->extended_caps = true; + /* Test for extended capabilities */ + pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, &dword); + vdev->extended_caps = (dword != 0); + /* length based on version */ if ((pcie_caps_reg(pdev) & PCI_EXP_FLAGS_VERS) == 1) return PCI_CAP_EXP_ENDPOINT_SIZEOF_V1; else -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: update masterclock when kvmclock_offset is calculated (v2)
On Tue, Aug 27, 2013 at 11:55:29PM -0300, Marcelo Tosatti wrote: > > The offset to add to the hosts monotonic time, kvmclock_offset, is > calculated against the monotonic time at KVM_SET_CLOCK ioctl time. > > Request a master clock update at this time, to reduce a potentially > unbounded difference between the values of the masterclock and > the clock value used to calculate kvmclock_offset. > > Signed-off-by: Marcelo Tosatti > Applied, thanks. > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index d21bce5..0a93354 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1457,6 +1457,29 @@ static void pvclock_update_vm_gtod_copy(struct kvm > *kvm) > #endif > } > > +static void kvm_gen_update_masterclock(struct kvm *kvm) > +{ > +#ifdef CONFIG_X86_64 > + int i; > + struct kvm_vcpu *vcpu; > + struct kvm_arch *ka = &kvm->arch; > + > + spin_lock(&ka->pvclock_gtod_sync_lock); > + kvm_make_mclock_inprogress_request(kvm); > + /* no guest entries from this point */ > + pvclock_update_vm_gtod_copy(kvm); > + > + kvm_for_each_vcpu(i, vcpu, kvm) > + set_bit(KVM_REQ_CLOCK_UPDATE, &vcpu->requests); > + > + /* guest entries allowed */ > + kvm_for_each_vcpu(i, vcpu, kvm) > + clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests); > + > + spin_unlock(&ka->pvclock_gtod_sync_lock); > +#endif > +} > + > static int kvm_guest_time_update(struct kvm_vcpu *v) > { > unsigned long flags, this_tsc_khz; > @@ -3806,6 +3829,7 @@ long kvm_arch_vm_ioctl(struct file *filp, > delta = user_ns.clock - now_ns; > local_irq_enable(); > kvm->arch.kvmclock_offset = delta; > + kvm_gen_update_masterclock(kvm); > break; > } > case KVM_GET_CLOCK: { > @@ -5689,29 +5713,6 @@ static void process_nmi(struct kvm_vcpu *vcpu) > kvm_make_request(KVM_REQ_EVENT, vcpu); > } > > -static void kvm_gen_update_masterclock(struct kvm *kvm) > -{ > -#ifdef CONFIG_X86_64 > - int i; > - struct kvm_vcpu *vcpu; > - struct kvm_arch *ka = &kvm->arch; > - > - spin_lock(&ka->pvclock_gtod_sync_lock); > - kvm_make_mclock_inprogress_request(kvm); > - /* no guest entries from this point */ > - pvclock_update_vm_gtod_copy(kvm); > - > - kvm_for_each_vcpu(i, vcpu, kvm) > - set_bit(KVM_REQ_CLOCK_UPDATE, &vcpu->requests); > - > - /* guest entries allowed */ > - kvm_for_each_vcpu(i, vcpu, kvm) > - clear_bit(KVM_REQ_MCLOCK_INPROGRESS, &vcpu->requests); > - > - spin_unlock(&ka->pvclock_gtod_sync_lock); > -#endif > -} > - > static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu) > { > u64 eoi_exit_bitmap[4]; > -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VGA passthrough of GTX 660 to KVM guest
On Wed, 2013-08-28 at 15:11 +0200, Gustav Sorenson wrote: > Hello everyone, > > I started a thread on this list on July 12, named "AMD integrated > graphics passthrough to KVM guest," so if you should have a deja-vu, > that's probably why, since the heart of the problem is still the same. > > I've got an AMD A10 6800 APU with integrated graphics, running on aan > ASRock FM2A75 Pro4 mainboard with the latest firmware. Since my last > thread, I've acquired an EVGA GTX 660 graphics card, which I'd like to > pass through to a KVM guest. > > I'm following this guide to achieve my goal, including compiling the > suggested patched versions of the kernel, seabios and qemu: > https://bbs.archlinux.org/viewtopic.php?id=162768 > > The integrated graphics adapter is the primary one, and the only one > that is used by the host system. nouveau and nvidia modules are > blacklisted and not in lsmod. > lspci -vvv output: http://pastebin.com/S6znCLK7 > lspci -t output: http://pastebin.com/YH01atEy > > Contrary to what I hoped, invoking > qemu-system-x86_64 -enable-kvm -M q35 -m 1024 -cpu host -smp > 2,sockets=1,cores=2,threads=1 -bios /usr/share/qemu/bios.bin -vga none > -device > ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 > -device vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on > -device vfio-pci,host=01:00.1,bus=root.1,addr=00.1 > after assigning the devices to vfio as suggested in the guide, does > not show anything on the display attached to the GTX 660, or even wake > the display up from standby, but instead gives this output on the > console: > qemu-system-x86_64: -device > vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on: > Warning, device :01:00.0 does not support reset > qemu-system-x86_64: -device > vfio-pci,host=01:00.1,bus=root.1,addr=00.1: Warning, device > :01:00.1 does not support reset > More often than not, this also completely freezes the host system. > > Later in the forum I linked, it is suggested to pass the graphics card > not as the guest's primary, but use pci-assign instead: > qemu-system-x86_64 -enable-kvm -m 2048 -cpu host -smp > 2,sockets=1,cores=2,threads=1 -bios /usr/share/qemu/bios.bin -vga std > -vnc :0 -drive file=/mnt/orthowin/current-new.img,format=qcow2 > -usbdevice tablet -net user,restrict=y -usb -device > pci-assign,host=01:00.0 -device pci-assign,host=01:00.1 > > This allows me to boot the Windows 7 64bit guest. After installing the > nVidia drivers, shutting the guest down, rebooting the host and > booting the guest again, the Windows device manager shows "Code 43" > for the graphics card (which even is correctly labelled as GTX 660), > and the nVidia user interface doesn't show any GPU at all. I've beaten my head against Code 43 as well, in fact with the same card. I'm using vfio and on and AMD 990fx system (no APU), the card works great as an assigned VGA device and the nvidia driver works in the guest. When I move the gtx660 to a Haswell-based IGD based system, I get stuck with a Code 43. I am able to get VGA working to the card, but it required fixes to i915 in the host to get VGA arbitration to work. The radeon driver may not be doing VGA arbitration correctly either if you couldn't get VGA with vfio. The only thing I can think might be causing problems is that the gtx660 is a PCIe gen3 card and the 990fx is only capable of gen2. The mismatch between our QEMU root ports and the physical device may be too great when the card is running gen3. However, if I set my BIOS to force gen2, it doesn't make any difference. I also get the same Code 43 when I attach the GPU to a QEMU 440fx configuration, so PCIe spec revisions may be a red herring. The nouveau driver seems to work though. Thanks, Alex > What could I try to resolve this issue? > I realize that I might have omitted some information that may be > important for troubleshooting, but I'm afraid I'm not experienced > enough to even know what may be relevant. Please let me know if you'd > like to have some further information. > Please also note that I'm not running the latest kernel/qemu/seabios, > so maybe some changes made since I compiled 3.10.1 might help me? If > so, please suggest a commit or release to try. I'm not afraid of > messing up my operating system install, but if anything suggested > might brick the hardware, I'd like to know in advance. :) > > I'm very grateful for any kind of input! > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/10] kvm: i386: fix LAPIC TSC deadline timer save/restore
From: Marcelo Tosatti The configuration of the timer represented by MSR_IA32_TSCDEADLINE depends on: - APIC LVT Timer register. - TSC value. Change the order to respect the dependency. Signed-off-by: Marcelo Tosatti Signed-off-by: Paolo Bonzini --- target-i386/kvm.c | 29 ++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 7bb8455..58f7bb7 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -1073,6 +1073,26 @@ static void kvm_msr_entry_set(struct kvm_msr_entry *entry, entry->data = value; } +static int kvm_put_tscdeadline_msr(X86CPU *cpu) +{ +CPUX86State *env = &cpu->env; +struct { +struct kvm_msrs info; +struct kvm_msr_entry entries[1]; +} msr_data; +struct kvm_msr_entry *msrs = msr_data.entries; + +if (!has_msr_tsc_deadline) { +return 0; +} + +kvm_msr_entry_set(&msrs[0], MSR_IA32_TSCDEADLINE, env->tsc_deadline); + +msr_data.info.nmsrs = 1; + +return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MSRS, &msr_data); +} + static int kvm_put_msrs(X86CPU *cpu, int level) { CPUX86State *env = &cpu->env; @@ -1096,9 +1116,6 @@ static int kvm_put_msrs(X86CPU *cpu, int level) if (has_msr_tsc_adjust) { kvm_msr_entry_set(&msrs[n++], MSR_TSC_ADJUST, env->tsc_adjust); } -if (has_msr_tsc_deadline) { -kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSCDEADLINE, env->tsc_deadline); -} if (has_msr_misc_enable) { kvm_msr_entry_set(&msrs[n++], MSR_IA32_MISC_ENABLE, env->msr_ia32_misc_enable); @@ -1808,6 +1825,12 @@ int kvm_arch_put_registers(CPUState *cpu, int level) return ret; } } + +ret = kvm_put_tscdeadline_msr(x86_cpu); +if (ret < 0) { +return ret; +} + ret = kvm_put_vcpu_events(x86_cpu, level); if (ret < 0) { return ret; -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/10] kvm: migrate vPMU state
From: Paolo Bonzini Reviewed-by: Gleb Natapov Signed-off-by: Paolo Bonzini --- target-i386/cpu.h | 23 target-i386/kvm.c | 93 ++--- target-i386/machine.c | 44 +++ 3 files changed, 155 insertions(+), 5 deletions(-) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index af4c0f7..31de265 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -305,6 +305,8 @@ #define MSR_TSC_ADJUST 0x003b #define MSR_IA32_TSCDEADLINE0x6e0 +#define MSR_P6_PERFCTR0 0xc1 + #define MSR_MTRRcap 0xfe #define MSR_MTRRcap_VCNT8 #define MSR_MTRRcap_FIXRANGE_SUPPORT(1 << 8) @@ -318,6 +320,8 @@ #define MSR_MCG_STATUS 0x17a #define MSR_MCG_CTL 0x17b +#define MSR_P6_EVNTSEL0 0x186 + #define MSR_IA32_PERF_STATUS0x198 #define MSR_IA32_MISC_ENABLE0x1a0 @@ -343,6 +347,14 @@ #define MSR_MTRRdefType 0x2ff +#define MSR_CORE_PERF_FIXED_CTR00x309 +#define MSR_CORE_PERF_FIXED_CTR10x30a +#define MSR_CORE_PERF_FIXED_CTR20x30b +#define MSR_CORE_PERF_FIXED_CTR_CTRL0x38d +#define MSR_CORE_PERF_GLOBAL_STATUS 0x38e +#define MSR_CORE_PERF_GLOBAL_CTRL 0x38f +#define MSR_CORE_PERF_GLOBAL_OVF_CTRL 0x390 + #define MSR_MC0_CTL 0x400 #define MSR_MC0_STATUS 0x401 #define MSR_MC0_ADDR0x402 @@ -721,6 +733,9 @@ typedef struct { #define CPU_NB_REGS CPU_NB_REGS32 #endif +#define MAX_FIXED_COUNTERS 3 +#define MAX_GP_COUNTERS(MSR_IA32_PERF_STATUS - MSR_P6_EVNTSEL0) + #define NB_MMU_MODES 3 typedef enum TPRAccess { @@ -816,6 +831,14 @@ typedef struct CPUX86State { uint64_t msr_ia32_misc_enable; uint64_t msr_ia32_feature_control; +uint64_t msr_fixed_ctr_ctrl; +uint64_t msr_global_ctrl; +uint64_t msr_global_status; +uint64_t msr_global_ovf_ctrl; +uint64_t msr_fixed_counters[MAX_FIXED_COUNTERS]; +uint64_t msr_gp_counters[MAX_GP_COUNTERS]; +uint64_t msr_gp_evtsel[MAX_GP_COUNTERS]; + /* exception/interrupt handling */ int error_code; int exception_is_int; diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 84ac00a..513ae52 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -71,6 +71,9 @@ static bool has_msr_misc_enable; static bool has_msr_kvm_steal_time; static int lm_capable_kernel; +static bool has_msr_architectural_pmu; +static uint32_t num_architectural_pmu_counters; + bool kvm_allows_irq0_override(void) { return !kvm_irqchip_in_kernel() || kvm_has_gsi_routing(); @@ -581,6 +584,25 @@ int kvm_arch_init_vcpu(CPUState *cs) break; } } + +if (limit >= 0x0a) { +uint32_t ver; + +cpu_x86_cpuid(env, 0x0a, 0, &ver, &unused, &unused, &unused); +if ((ver & 0xff) > 0) { +has_msr_architectural_pmu = true; +num_architectural_pmu_counters = (ver & 0xff00) >> 8; + +/* Shouldn't be more than 32, since that's the number of bits + * available in EBX to tell us _which_ counters are available. + * Play it safe. + */ +if (num_architectural_pmu_counters > MAX_GP_COUNTERS) { +num_architectural_pmu_counters = MAX_GP_COUNTERS; +} +} +} + cpu_x86_cpuid(env, 0x8000, 0, &limit, &unused, &unused, &unused); for (i = 0x8000; i <= limit; i++) { @@ -1052,7 +1074,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level) struct kvm_msr_entry entries[100]; } msr_data; struct kvm_msr_entry *msrs = msr_data.entries; -int n = 0; +int n = 0, i; kvm_msr_entry_set(&msrs[n++], MSR_IA32_SYSENTER_CS, env->sysenter_cs); kvm_msr_entry_set(&msrs[n++], MSR_IA32_SYSENTER_ESP, env->sysenter_esp); @@ -1094,9 +1116,8 @@ static int kvm_put_msrs(X86CPU *cpu, int level) } } /* - * The following paravirtual MSRs have side effects on the guest or are - * too heavy for normal writeback. Limit them to reset or full state - * updates. + * The following MSRs have side effects on the guest or are too heavy + * for normal writeback. Limit them to reset or full state updates. */ if (level >= KVM_PUT_RESET_STATE) { kvm_msr_entry_set(&msrs[n++], MSR_KVM_SYSTEM_TIME, @@ -1114,6 +1135,33 @@ static int kvm_put_msrs(X86CPU *cpu, int level) kvm_msr_entry_set(&msrs[n++], MSR_KVM_STEAL_TIME, env->steal_time_msr); } +if (has_msr_architectural_pmu) { +/* Stop the counter. */ +kvm_msr_entry_set(&msrs[n++], MSR_CORE_PERF_FIXED_CTR_CTRL, 0); +kvm_msr_entry_set(&msrs[n++], MSR_CORE_PERF_GLOBAL_CTRL, 0); + +/* Set the counter values. */ +for (i = 0; i < MAX_FI
[PATCH 07/10] kvm-all.c: max_cpus should not exceed KVM vcpu limit
From: Marcelo Tosatti maxcpus, which specifies the maximum number of hotpluggable CPUs, should not exceed KVM's vcpu limit. Signed-off-by: Marcelo Tosatti [Reword message. - Paolo] Signed-off-by: Paolo Bonzini --- kvm-all.c |7 +++ 1 file changed, 7 insertions(+) diff --git a/kvm-all.c b/kvm-all.c index ef52a0f..a2d4978 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -1400,6 +1400,13 @@ int kvm_init(void) goto err; } +if (max_cpus > max_vcpus) { +ret = -EINVAL; +fprintf(stderr, "Number of hotpluggable cpus requested (%d) exceeds max cpus " +"supported by KVM (%d)\n", max_cpus, max_vcpus); +goto err; +} + s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0); if (s->vmfd < 0) { #ifdef TARGET_S390X -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/10] kvm: Simplify kvm_handle_io
From: Jan Kiszka Now that cpu_in/out is just a wrapper around address_space_rw, we can also call the latter directly. As host endianness == guest endianness, there is no need for the memory access helpers st*_p/ld*_p as well. Signed-off-by: Jan Kiszka Signed-off-by: Paolo Bonzini --- kvm-all.c | 28 ++-- 1 file changed, 2 insertions(+), 26 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index bfa4aac..ef52a0f 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -1508,32 +1508,8 @@ static void kvm_handle_io(uint16_t port, void *data, int direction, int size, uint8_t *ptr = data; for (i = 0; i < count; i++) { -if (direction == KVM_EXIT_IO_IN) { -switch (size) { -case 1: -stb_p(ptr, cpu_inb(port)); -break; -case 2: -stw_p(ptr, cpu_inw(port)); -break; -case 4: -stl_p(ptr, cpu_inl(port)); -break; -} -} else { -switch (size) { -case 1: -cpu_outb(port, ldub_p(ptr)); -break; -case 2: -cpu_outw(port, lduw_p(ptr)); -break; -case 4: -cpu_outl(port, ldl_p(ptr)); -break; -} -} - +address_space_rw(&address_space_io, port, ptr, size, + direction == KVM_EXIT_IO_OUT); ptr += size; } } -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/10] kvm: add KVM_IRQFD_FLAG_RESAMPLE support
From: Vincenzo Maffione Added an EventNotifier* parameter to kvm-all.c:kvm_irqchip_add_irqfd_notifier(), in order to give KVM another eventfd to be used as "resamplefd". See the documentation in the linux kernel sources in Documentation/virtual/kvm/api.txt (section 4.75) for more details. When the added parameter is passed NULL, the behaviour of the function is unchanged with respect to the previous versions. Reviewed-by: Paolo Bonzini Signed-off-by: Vincenzo Maffione Signed-off-by: Paolo Bonzini --- hw/misc/vfio.c |4 ++-- hw/virtio/virtio-pci.c |2 +- include/sysemu/kvm.h |3 ++- kvm-all.c | 17 + 4 files changed, 18 insertions(+), 8 deletions(-) diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c index ad8ce77..54af34a 100644 --- a/hw/misc/vfio.c +++ b/hw/misc/vfio.c @@ -646,7 +646,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr, vector->virq = msg ? kvm_irqchip_add_msi_route(kvm_state, *msg) : -1; if (vector->virq < 0 || kvm_irqchip_add_irqfd_notifier(kvm_state, &vector->interrupt, - vector->virq) < 0) { + NULL, vector->virq) < 0) { if (vector->virq >= 0) { kvm_irqchip_release_virq(kvm_state, vector->virq); vector->virq = -1; @@ -814,7 +814,7 @@ retry: vector->virq = kvm_irqchip_add_msi_route(kvm_state, msg); if (vector->virq < 0 || kvm_irqchip_add_irqfd_notifier(kvm_state, &vector->interrupt, - vector->virq) < 0) { + NULL, vector->virq) < 0) { qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt), vfio_msi_interrupt, NULL, vector); } diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c index c38cfd1..c4db407 100644 --- a/hw/virtio/virtio-pci.c +++ b/hw/virtio/virtio-pci.c @@ -508,7 +508,7 @@ static int kvm_virtio_pci_irqfd_use(VirtIOPCIProxy *proxy, VirtQueue *vq = virtio_get_queue(proxy->vdev, queue_no); EventNotifier *n = virtio_queue_get_guest_notifier(vq); int ret; -ret = kvm_irqchip_add_irqfd_notifier(kvm_state, n, irqfd->virq); +ret = kvm_irqchip_add_irqfd_notifier(kvm_state, n, NULL, irqfd->virq); return ret; } diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index f8ac448..ce3efaf 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -309,7 +309,8 @@ int kvm_irqchip_add_msi_route(KVMState *s, MSIMessage msg); int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg); void kvm_irqchip_release_virq(KVMState *s, int virq); -int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n, int virq); +int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n, + EventNotifier *rn, int virq); int kvm_irqchip_remove_irqfd_notifier(KVMState *s, EventNotifier *n, int virq); void kvm_pc_gsi_handler(void *opaque, int n, int level); void kvm_pc_setup_irq_routing(bool pci_enabled); diff --git a/kvm-all.c b/kvm-all.c index 4fb4ccb..bfa4aac 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -1230,7 +1230,8 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg) return kvm_update_routing_entry(s, &kroute); } -static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int virq, bool assign) +static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int rfd, int virq, +bool assign) { struct kvm_irqfd irqfd = { .fd = fd, @@ -1238,6 +1239,11 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int virq, bool assign) .flags = assign ? 0 : KVM_IRQFD_FLAG_DEASSIGN, }; +if (rfd != -1) { +irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE; +irqfd.resamplefd = rfd; +} + if (!kvm_irqfds_enabled()) { return -ENOSYS; } @@ -1276,14 +1282,17 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg) } #endif /* !KVM_CAP_IRQ_ROUTING */ -int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n, int virq) +int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n, + EventNotifier *rn, int virq) { -return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n), virq, true); +return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n), + rn ? event_notifier_get_fd(rn) : -1, virq, true); } int kvm_irqchip_remove_irqfd_notifier(KVMState *s, EventNotifier *n, int virq) { -return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n), virq, false); +return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n), -1, virq, + false); } static int kvm_irqchip_create(KVMState *s) -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org Mo
[PATCH 01/10] Initialize IA32_FEATURE_CONTROL MSR in reset and migration
From: Arthur Chunqi Li The recent KVM patch adds IA32_FEATURE_CONTROL support. QEMU needs to clear this MSR when reset vCPU and keep the value of it when migration. This patch add this feature. Signed-off-by: Arthur Chunqi Li Signed-off-by: Gleb Natapov --- target-i386/cpu.h |2 ++ target-i386/kvm.c |4 target-i386/machine.c | 22 ++ 3 files changed, 28 insertions(+) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index cedefdc..3a52f94 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -301,6 +301,7 @@ #define MSR_IA32_APICBASE_BSP (1<<8) #define MSR_IA32_APICBASE_ENABLE(1<<11) #define MSR_IA32_APICBASE_BASE (0xf<<12) +#define MSR_IA32_FEATURE_CONTROL0x003a #define MSR_TSC_ADJUST 0x003b #define MSR_IA32_TSCDEADLINE0x6e0 @@ -813,6 +814,7 @@ typedef struct CPUX86State { uint64_t mcg_status; uint64_t msr_ia32_misc_enable; +uint64_t msr_ia32_feature_control; /* exception/interrupt handling */ int error_code; diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 3c9d10a..84ac00a 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -1121,6 +1121,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level) if (hyperv_vapic_recommended()) { kvm_msr_entry_set(&msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0); } +kvm_msr_entry_set(&msrs[n++], MSR_IA32_FEATURE_CONTROL, env->msr_ia32_feature_control); } if (env->mcg_cap) { int i; @@ -1345,6 +1346,7 @@ static int kvm_get_msrs(X86CPU *cpu) if (has_msr_misc_enable) { msrs[n++].index = MSR_IA32_MISC_ENABLE; } +msrs[n++].index = MSR_IA32_FEATURE_CONTROL; if (!env->tsc_valid) { msrs[n++].index = MSR_IA32_TSC; @@ -1443,6 +1445,8 @@ static int kvm_get_msrs(X86CPU *cpu) case MSR_IA32_MISC_ENABLE: env->msr_ia32_misc_enable = msrs[i].data; break; +case MSR_IA32_FEATURE_CONTROL: +env->msr_ia32_feature_control = msrs[i].data; default: if (msrs[i].index >= MSR_MC0_CTL && msrs[i].index < MSR_MC0_CTL + (env->mcg_cap & 0xff) * 4) { diff --git a/target-i386/machine.c b/target-i386/machine.c index f9ec581..0d2088e 100644 --- a/target-i386/machine.c +++ b/target-i386/machine.c @@ -435,6 +435,14 @@ static bool misc_enable_needed(void *opaque) return env->msr_ia32_misc_enable != MSR_IA32_MISC_ENABLE_DEFAULT; } +static bool feature_control_needed(void *opaque) +{ +X86CPU *cpu = opaque; +CPUX86State *env = &cpu->env; + +return env->msr_ia32_feature_control != 0; +} + static const VMStateDescription vmstate_msr_ia32_misc_enable = { .name = "cpu/msr_ia32_misc_enable", .version_id = 1, @@ -446,6 +454,17 @@ static const VMStateDescription vmstate_msr_ia32_misc_enable = { } }; +static const VMStateDescription vmstate_msr_ia32_feature_control = { +.name = "cpu/msr_ia32_feature_control", +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT64(env.msr_ia32_feature_control, X86CPU), +VMSTATE_END_OF_LIST() +} +}; + const VMStateDescription vmstate_x86_cpu = { .name = "cpu", .version_id = 12, @@ -571,6 +590,9 @@ const VMStateDescription vmstate_x86_cpu = { }, { .vmsd = &vmstate_msr_ia32_misc_enable, .needed = misc_enable_needed, +}, { +.vmsd = &vmstate_msr_ia32_feature_control, +.needed = feature_control_needed, } , { /* empty */ } -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/10] target-i386: remove tabs from target-i386/cpu.h
From: Paolo Bonzini Signed-off-by: Paolo Bonzini --- target-i386/cpu.h | 192 ++--- 1 file changed, 96 insertions(+), 96 deletions(-) diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 3a52f94..af4c0f7 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -37,9 +37,9 @@ #define TARGET_HAS_ICE 1 #ifdef TARGET_X86_64 -#define ELF_MACHINEEM_X86_64 +#define ELF_MACHINE EM_X86_64 #else -#define ELF_MACHINEEM_386 +#define ELF_MACHINE EM_386 #endif #define CPUArchState struct CPUX86State @@ -98,10 +98,10 @@ #define DESC_TSS_BUSY_MASK (1 << 9) /* eflags masks */ -#define CC_C 0x0001 -#define CC_P 0x0004 -#define CC_A 0x0010 -#define CC_Z 0x0040 +#define CC_C0x0001 +#define CC_P0x0004 +#define CC_A0x0010 +#define CC_Z0x0040 #define CC_S0x0080 #define CC_O0x0800 @@ -109,14 +109,14 @@ #define IOPL_SHIFT 12 #define VM_SHIFT 17 -#define TF_MASK0x0100 -#define IF_MASK0x0200 -#define DF_MASK0x0400 -#define IOPL_MASK 0x3000 -#define NT_MASK0x4000 -#define RF_MASK0x0001 -#define VM_MASK0x0002 -#define AC_MASK0x0004 +#define TF_MASK 0x0100 +#define IF_MASK 0x0200 +#define DF_MASK 0x0400 +#define IOPL_MASK 0x3000 +#define NT_MASK 0x4000 +#define RF_MASK 0x0001 +#define VM_MASK 0x0002 +#define AC_MASK 0x0004 #define VIF_MASK0x0008 #define VIP_MASK0x0010 #define ID_MASK 0x0020 @@ -238,28 +238,28 @@ #define DR7_TYPE_IO_RW 0x2 #define DR7_TYPE_DATA_RW 0x3 -#define PG_PRESENT_BIT 0 -#define PG_RW_BIT 1 -#define PG_USER_BIT2 -#define PG_PWT_BIT 3 -#define PG_PCD_BIT 4 -#define PG_ACCESSED_BIT5 -#define PG_DIRTY_BIT 6 -#define PG_PSE_BIT 7 -#define PG_GLOBAL_BIT 8 -#define PG_NX_BIT 63 +#define PG_PRESENT_BIT 0 +#define PG_RW_BIT 1 +#define PG_USER_BIT 2 +#define PG_PWT_BIT 3 +#define PG_PCD_BIT 4 +#define PG_ACCESSED_BIT 5 +#define PG_DIRTY_BIT6 +#define PG_PSE_BIT 7 +#define PG_GLOBAL_BIT 8 +#define PG_NX_BIT 63 #define PG_PRESENT_MASK (1 << PG_PRESENT_BIT) -#define PG_RW_MASK (1 << PG_RW_BIT) -#define PG_USER_MASK(1 << PG_USER_BIT) -#define PG_PWT_MASK (1 << PG_PWT_BIT) -#define PG_PCD_MASK (1 << PG_PCD_BIT) +#define PG_RW_MASK (1 << PG_RW_BIT) +#define PG_USER_MASK (1 << PG_USER_BIT) +#define PG_PWT_MASK (1 << PG_PWT_BIT) +#define PG_PCD_MASK (1 << PG_PCD_BIT) #define PG_ACCESSED_MASK (1 << PG_ACCESSED_BIT) -#define PG_DIRTY_MASK (1 << PG_DIRTY_BIT) -#define PG_PSE_MASK (1 << PG_PSE_BIT) -#define PG_GLOBAL_MASK (1 << PG_GLOBAL_BIT) +#define PG_DIRTY_MASK(1 << PG_DIRTY_BIT) +#define PG_PSE_MASK (1 << PG_PSE_BIT) +#define PG_GLOBAL_MASK (1 << PG_GLOBAL_BIT) #define PG_HI_USER_MASK 0x7ff0LL -#define PG_NX_MASK (1LL << PG_NX_BIT) +#define PG_NX_MASK (1LL << PG_NX_BIT) #define PG_ERROR_W_BIT 1 @@ -269,32 +269,32 @@ #define PG_ERROR_RSVD_MASK 0x08 #define PG_ERROR_I_D_MASK 0x10 -#define MCG_CTL_P (1ULL<<8) /* MCG_CAP register available */ -#define MCG_SER_P (1ULL<<24) /* MCA recovery/new status bits */ +#define MCG_CTL_P (1ULL<<8) /* MCG_CAP register available */ +#define MCG_SER_P (1ULL<<24) /* MCA recovery/new status bits */ -#define MCE_CAP_DEF(MCG_CTL_P|MCG_SER_P) -#define MCE_BANKS_DEF 10 +#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P) +#define MCE_BANKS_DEF 10 -#define MCG_STATUS_RIPV(1ULL<<0) /* restart ip valid */ -#define MCG_STATUS_EIPV(1ULL<<1) /* ip points to correct instruction */ -#define MCG_STATUS_MCIP(1ULL<<2) /* machine check in progress */ +#define MCG_STATUS_RIPV (1ULL<<0) /* restart ip valid */ +#define MCG_STATUS_EIPV (1ULL<<1) /* ip points to correct instruction */ +#define MCG_STATUS_MCIP (1ULL<<2) /* machine check in progress */ -#define MCI_STATUS_VAL (1ULL<<63) /* valid error */ -#define MCI_STATUS_OVER(1ULL<<62) /* previous errors lost */ -#define MCI_STATUS_UC (1ULL<<61) /* uncorrected error */ -#define MCI_STATUS_EN (1ULL<<60) /* error enabled */ -#define MCI_STATUS_MISCV (1ULL<<59) /* misc error reg. valid */ -#define MCI_STATUS_ADDRV (1ULL<<58) /* addr reg. valid */ -#define MCI_STATUS_PCC (1ULL<<57) /* processor context corrupt */ -#define MCI_STATUS_S (1ULL<<56) /* Signaled machine check */ -#define MCI_STATUS_AR (1ULL<<55) /* Action required */ +#define MCI_STATUS_VAL (1ULL<<63) /* valid error */ +#define MCI_STATUS_OVER (1ULL<<62) /* previous errors lost */ +#define MCI_STA
[PATCH v3 00/10] [PULL] qemu-kvm.git uq/master queue
Anthony, This obsoletes "[PULL v2 0/9] KVM changes for 2013-08-23" The following changes since commit f03d07d4683b2e8325a7cb60b4e14b977b1a869c: Merge remote-tracking branch 'quintela/migration.next' into staging (2013-07-23 10:57:23 -0500) are available in the git repository at: git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master for you to fetch changes up to 821c808bd1863efc2c1e977800ae77db633a185c: kvm-stub: fix compilation (2013-08-28 17:07:02 +0300) Arthur Chunqi Li (1): Initialize IA32_FEATURE_CONTROL MSR in reset and migration Jan Kiszka (1): kvm: Simplify kvm_handle_io Liu Jinsong (1): kvm: x86: fix setting IA32_FEATURE_CONTROL with nested VMX disabled Marcelo Tosatti (2): kvm-all.c: max_cpus should not exceed KVM vcpu limit kvm: i386: fix LAPIC TSC deadline timer save/restore Paolo Bonzini (3): target-i386: remove tabs from target-i386/cpu.h kvm: migrate vPMU state kvm-stub: fix compilation Vincenzo Maffione (1): kvm: add KVM_IRQFD_FLAG_RESAMPLE support Wei Yang (1): kvm: shorten the parameter list for get_real_device() hw/i386/kvm/pci-assign.c |9 +- hw/misc/vfio.c |4 +- hw/virtio/virtio-pci.c |2 +- include/sysemu/kvm.h |3 +- kvm-all.c| 52 +-- kvm-stub.c |3 +- target-i386/cpu.h| 217 ++ target-i386/kvm.c| 139 +++-- target-i386/machine.c| 66 ++ 9 files changed, 351 insertions(+), 144 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/10] kvm: shorten the parameter list for get_real_device()
From: Wei Yang get_real_device() has 5 parameters with the last 4 is contained in the first structure. This patch removes the last 4 parameters and directly use them from the first parameter. Acked-by: Alex Williamson Signed-off-by: Wei Yang Signed-off-by: Paolo Bonzini --- hw/i386/kvm/pci-assign.c |9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c index ff33dc8..73941b2 100644 --- a/hw/i386/kvm/pci-assign.c +++ b/hw/i386/kvm/pci-assign.c @@ -568,8 +568,7 @@ static int get_real_device_id(const char *devpath, uint16_t *val) return get_real_id(devpath, "device", val); } -static int get_real_device(AssignedDevice *pci_dev, uint16_t r_seg, - uint8_t r_bus, uint8_t r_dev, uint8_t r_func) +static int get_real_device(AssignedDevice *pci_dev) { char dir[128], name[128]; int fd, r = 0, v; @@ -582,7 +581,8 @@ static int get_real_device(AssignedDevice *pci_dev, uint16_t r_seg, dev->region_number = 0; snprintf(dir, sizeof(dir), "/sys/bus/pci/devices/%04x:%02x:%02x.%x/", - r_seg, r_bus, r_dev, r_func); + pci_dev->host.domain, pci_dev->host.bus, + pci_dev->host.slot, pci_dev->host.function); snprintf(name, sizeof(name), "%sconfig", dir); @@ -1769,8 +1769,7 @@ static int assigned_initfn(struct PCIDevice *pci_dev) memcpy(dev->emulate_config_write, dev->emulate_config_read, sizeof(dev->emulate_config_read)); -if (get_real_device(dev, dev->host.domain, dev->host.bus, -dev->host.slot, dev->host.function)) { +if (get_real_device(dev)) { error_report("pci-assign: Error: Couldn't get real device (%s)!", dev->dev.qdev.id); goto out; -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/10] kvm-stub: fix compilation
From: Paolo Bonzini Non-KVM targets fail compilation on the uq/master branch. Fix the prototype of kvm_irqchip_add_irqfd_notifier to match the one in kvm-all.c. Signed-off-by: Paolo Bonzini Signed-off-by: Gleb Natapov --- kvm-stub.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kvm-stub.c b/kvm-stub.c index 7b2233a..806b044 100644 --- a/kvm-stub.c +++ b/kvm-stub.c @@ -135,7 +135,8 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg) return -ENOSYS; } -int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n, int virq) +int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n, + EventNotifier *rn, int virq) { return -ENOSYS; } -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/10] kvm: x86: fix setting IA32_FEATURE_CONTROL with nested VMX disabled
From: Liu Jinsong This patch is to fix the bug https://bugs.launchpad.net/qemu-kvm/+bug/1207623 IA32_FEATURE_CONTROL is pointless if not expose VMX or SMX bits to cpuid.1.ecx of vcpu. Current qemu-kvm will error return when kvm_put_msrs or kvm_get_msrs. Signed-off-by: Liu Jinsong Signed-off-by: Paolo Bonzini --- target-i386/kvm.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 513ae52..7bb8455 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -65,6 +65,7 @@ static bool has_msr_star; static bool has_msr_hsave_pa; static bool has_msr_tsc_adjust; static bool has_msr_tsc_deadline; +static bool has_msr_feature_control; static bool has_msr_async_pf_en; static bool has_msr_pv_eoi_en; static bool has_msr_misc_enable; @@ -666,6 +667,12 @@ int kvm_arch_init_vcpu(CPUState *cs) qemu_add_vm_change_state_handler(cpu_update_state, env); +c = cpuid_find_entry(&cpuid_data.cpuid, 1, 0); +if (c) { +has_msr_feature_control = !!(c->ecx & CPUID_EXT_VMX) || + !!(c->ecx & CPUID_EXT_SMX); +} + cpuid_data.cpuid.padding = 0; r = kvm_vcpu_ioctl(cs, KVM_SET_CPUID2, &cpuid_data); if (r) { @@ -1169,7 +1176,10 @@ static int kvm_put_msrs(X86CPU *cpu, int level) if (hyperv_vapic_recommended()) { kvm_msr_entry_set(&msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0); } -kvm_msr_entry_set(&msrs[n++], MSR_IA32_FEATURE_CONTROL, env->msr_ia32_feature_control); +if (has_msr_feature_control) { +kvm_msr_entry_set(&msrs[n++], MSR_IA32_FEATURE_CONTROL, + env->msr_ia32_feature_control); +} } if (env->mcg_cap) { int i; @@ -1394,7 +1404,9 @@ static int kvm_get_msrs(X86CPU *cpu) if (has_msr_misc_enable) { msrs[n++].index = MSR_IA32_MISC_ENABLE; } -msrs[n++].index = MSR_IA32_FEATURE_CONTROL; +if (has_msr_feature_control) { +msrs[n++].index = MSR_IA32_FEATURE_CONTROL; +} if (!env->tsc_valid) { msrs[n++].index = MSR_IA32_TSC; @@ -1509,6 +1521,7 @@ static int kvm_get_msrs(X86CPU *cpu) break; case MSR_IA32_FEATURE_CONTROL: env->msr_ia32_feature_control = msrs[i].data; +break; default: if (msrs[i].index >= MSR_MC0_CTL && msrs[i].index < MSR_MC0_CTL + (env->mcg_cap & 0xff) * 4) { -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] arch: powerpc: kvm: add signed type cast for comparation
On 05.08.2013, at 07:50, Chen Gang wrote: > On 08/05/2013 12:34 PM, Paul Mackerras wrote: >> On Mon, Jul 22, 2013 at 02:32:35PM +0800, Chen Gang wrote: 'rmls' is 'unsigned long', lpcr_rmls() will return negative number when failure occurs, so it need a type cast for comparing. 'lpid' is 'unsigned long', kvmppc_alloc_lpid() return negative number when failure occurs, so it need a type cast for comparing. Signed-off-by: Chen Gang >> Looks right, thanks. >> >> Acked-by: Paul Mackerras >> >> > > Thank you very much. Thanks, applied to kvm-ppc-queue. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: Book3S PR: return appropriate error when allocation fails
On 17.07.2013, at 17:10, Thadeu Lima de Souza Cascardo wrote: > err was overwritten by a previous function call, and checked to be 0. If > the following page allocation fails, 0 is going to be returned instead > of -ENOMEM. > > Signed-off-by: Thadeu Lima de Souza Cascardo Thanks, applied to kvm-ppc-queue. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker
On Wed, Aug 28, 2013 at 08:15:36PM +0800, Xiao Guangrong wrote: > On 08/28/2013 06:49 PM, Gleb Natapov wrote: > > On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote: > >> On 08/28/2013 05:46 PM, Gleb Natapov wrote: > >>> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote: > > Or what if desc is moved to another rmap, but then it > > is moved back to initial rmap (but another place in the desc list) so > > the check here will not catch that we need to restart walking? > > It is okay. We always add the new desc to the head, then we will walk > all the entires under this case. > > >>> Which races another question: What if desc is added in front of the list > >>> behind the point where lockless walker currently is? > >> > >> That case is new spte is being added into the rmap. We need not to care the > >> new sptes since it will set the dirty-bitmap then they can be > >> write-protected > >> next time. > >> > > OK. > > > >>> > Right? > >>> Not sure. While lockless walker works on a desc rmap can be completely > >>> destroyed and recreated again. It can be any order. > >> > >> I think the thing is very similar as include/linux/rculist_nulls.h > > include/linux/rculist_nulls.h is for implementing hash tables, so they > > may not care about add/del/lookup race for instance, but may be we are > > (you are saying above that we are not), so similarity does not prove > > correctness for our case. > > We do not care the "add" and "del" too when lookup the rmap. Under the "add" > case, it is okay, the reason i have explained above. Under the "del" case, > the spte becomes unpresent and flush all tlbs immediately, so it is also okay. > > I always use a stupid way to check the correctness, that is enumerating > all cases we may meet, in this patch, we may meet these cases: > > 1) kvm deletes the desc before we are current on >that descs have been checked, do not need to care it. > > 2) kvm deletes the desc after we are currently on >Since we always add/del the head desc, we can sure the current desc has > been >deleted, then we will meet case 3). > > 3) kvm deletes the desc that we are currently on >3.a): the desc stays in slab cache (do not be reused). > all spte entires are empty, then the fn() will skip the nonprsent > spte, > and desc->more is > 3.a.1) still pointing to next-desc, then we will continue the lookup > 3.a.2) or it is the "nulls list", that means we reach the last one, > then finish the walk. > >3.b): the desc is alloc-ed from slab cache and it's being initialized. > we will see "desc->more == NULL" then restart the walking. It's okay. > >3.c): the desc is added to rmap or pte_list again. > 3.c.1): the desc is added to the current rmap again. >the new desc always acts as the head desc, then we will walk > all entries, some entries are double checked and not entry > can be missed. It is okay. > > 3.c.2): the desc is added to another rmap or pte_list > since kvm_set_memory_region() and get_dirty are serial by > slots-lock. > so the "nulls" can not be reused during lookup. Then we we > will > meet the different "nulls" at the end of walking that will > cause > rewalk. > > I know check the algorithm like this is really silly, do you have other idea? > Not silly, but easy to miss cases. For instance 3.c.3 can be: the desc is added to another rmap then we move to another desc on the wrong rmap, this other desc is also deleted and reinserted into original rmap. Seams like justification from 3.c.1 applies to that to though. > > BTW I do not see > > rcu_assign_pointer()/rcu_dereference() in your patches which hints on > > IIUC, We can not directly use rcu_assign_pointer(), that is something like: > p = v to assign a pointer to a pointer. But in our case, we need: >*pte_list = (unsigned long)desc | 1; >From Documentation/RCU/whatisRCU.txt: The updater uses this function to assign a new value to an RCU-protected pointer. This is what we do, no? (assuming slot->arch.rmap[] is what rcu protects here) The fact that the value is not correct pointer should not matter. > > So i add the smp_wmb() by myself: > /* >* Esure the old spte has been updated into desc, so >* that the another side can not get the desc from pte_list >* but miss the old spte. >*/ > smp_wmb(); > > *pte_list = (unsigned long)desc | 1; > > But i missed it when inserting a empty desc, in that case, we need the barrier > too since we should make desc->more visible before assign it to pte_list to > avoid the lookup side seeing the invalid "nulls". > > I also use own code instead of rcu_dereference(): > pte_list_walk_lockless(): > pte_
VGA passthrough of GTX 660 to KVM guest
Hello everyone, I started a thread on this list on July 12, named "AMD integrated graphics passthrough to KVM guest," so if you should have a deja-vu, that's probably why, since the heart of the problem is still the same. I've got an AMD A10 6800 APU with integrated graphics, running on aan ASRock FM2A75 Pro4 mainboard with the latest firmware. Since my last thread, I've acquired an EVGA GTX 660 graphics card, which I'd like to pass through to a KVM guest. I'm following this guide to achieve my goal, including compiling the suggested patched versions of the kernel, seabios and qemu: https://bbs.archlinux.org/viewtopic.php?id=162768 The integrated graphics adapter is the primary one, and the only one that is used by the host system. nouveau and nvidia modules are blacklisted and not in lsmod. lspci -vvv output: http://pastebin.com/S6znCLK7 lspci -t output: http://pastebin.com/YH01atEy Contrary to what I hoped, invoking qemu-system-x86_64 -enable-kvm -M q35 -m 1024 -cpu host -smp 2,sockets=1,cores=2,threads=1 -bios /usr/share/qemu/bios.bin -vga none -device ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 -device vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on -device vfio-pci,host=01:00.1,bus=root.1,addr=00.1 after assigning the devices to vfio as suggested in the guide, does not show anything on the display attached to the GTX 660, or even wake the display up from standby, but instead gives this output on the console: qemu-system-x86_64: -device vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on: Warning, device :01:00.0 does not support reset qemu-system-x86_64: -device vfio-pci,host=01:00.1,bus=root.1,addr=00.1: Warning, device :01:00.1 does not support reset More often than not, this also completely freezes the host system. Later in the forum I linked, it is suggested to pass the graphics card not as the guest's primary, but use pci-assign instead: qemu-system-x86_64 -enable-kvm -m 2048 -cpu host -smp 2,sockets=1,cores=2,threads=1 -bios /usr/share/qemu/bios.bin -vga std -vnc :0 -drive file=/mnt/orthowin/current-new.img,format=qcow2 -usbdevice tablet -net user,restrict=y -usb -device pci-assign,host=01:00.0 -device pci-assign,host=01:00.1 This allows me to boot the Windows 7 64bit guest. After installing the nVidia drivers, shutting the guest down, rebooting the host and booting the guest again, the Windows device manager shows "Code 43" for the graphics card (which even is correctly labelled as GTX 660), and the nVidia user interface doesn't show any GPU at all. What could I try to resolve this issue? I realize that I might have omitted some information that may be important for troubleshooting, but I'm afraid I'm not experienced enough to even know what may be relevant. Please let me know if you'd like to have some further information. Please also note that I'm not running the latest kernel/qemu/seabios, so maybe some changes made since I compiled 3.10.1 might help me? If so, please suggest a commit or release to try. I'm not afraid of messing up my operating system install, but if anything suggested might brick the hardware, I'd like to know in advance. :) I'm very grateful for any kind of input! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: update masterclock when kvmclock_offset is calculated
Il 28/08/2013 04:52, Marcelo Tosatti ha scritto: > On Thu, Aug 22, 2013 at 07:05:20PM +0200, Paolo Bonzini wrote: >> Il 20/08/2013 20:20, Marcelo Tosatti ha scritto: >>> >>> The offset to add to the hosts monotonic time, kvmclock_offset, is >>> calculated against the monotonic time at KVM_SET_CLOCK ioctl time. >>> >>> Request a master clock update at this time, to reduce a potentially >>> unbounded difference between the values of the masterclock and >>> the clock value used to calculate kvmclock_offset. >>> >>> Signed-off-by: Marcelo Tosatti >>> >>> Index: linux-2.6-kvmclock-fixes/arch/x86/kvm/x86.c >>> === >>> --- linux-2.6-kvmclock-fixes.orig/arch/x86/kvm/x86.c >>> +++ linux-2.6-kvmclock-fixes/arch/x86/kvm/x86.c >>> @@ -3806,6 +3806,7 @@ long kvm_arch_vm_ioctl(struct file *filp >>> delta = user_ns.clock - now_ns; >>> local_irq_enable(); >>> kvm->arch.kvmclock_offset = delta; >>> + kvm_gen_update_masterclock(kvm); >>> break; >>> } >>> case KVM_GET_CLOCK: { >> >> Reviewed-by: Paolo Bonzini >> >> While reviewing this patch, which BTW looks good, I noticed the handling >> of KVM_REQ_MCLOCK_INPROGRESS, the dummy request that is never processed >> and is only used to block guest entry. >> >> It seems to me that this bit is not necessary. After >> KVM_REQ_CLOCK_UPDATE is issued, no guest entries will happen because >> kvm_guest_time_update will try to take the pvclock_gtod_sync_lock, >> currently taken by kvm_gen_update_masterclock. > > Not entirely clear, to cancel guest entry the bit is necessary: > > if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests > || need_resched() || signal_pending(current)) { > vcpu->mode = OUTSIDE_GUEST_MODE; > smp_wmb(); > local_irq_enable(); > preempt_enable(); > r = 1; > goto cancel_injection; > } Yes, KVM_REQ_CLOCK_UPDATE is enough to cancel guest entries too. See code below. >> Thus, you do not need the dummy request. You can simply issue >> KVM_REQ_CLOCK_UPDATE before calling pvclock_update_vm_gtod_copy (with >> the side effect of exiting VCPUs). VCPUs will stall in >> kvm_guest_time_update until pvclock_gtod_sync_lock is released by >> kvm_gen_update_masterclock. What do you think? > > Not sure its safe. Can you describe the safety of your proposal in more > detail ? Here is the code I was thinking of: spin_lock(&ka->pvclock_gtod_sync_lock); make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE); /* * No guest entries from this point: VCPUs will be spinning * on pvclock_gtod_sync_lock in kvm_guest_time_update. */ pvclock_update_vm_gtod_copy(kvm); /* * Let kvm_guest_time_update continue: entering the guest * is now allowed too. */ spin_unlock(&ka->pvclock_gtod_sync_lock); KVM_REQ_CLOCK_UPDATE is used to cancel guest entry and execute kvm_guest_time_update. But kvm_guest_time_update will spin on pvclock_gtod_sync_lock until pvclock_update_vm_gtod_copy exits and kvm_gen_update_masterclock releases the spinlock. >> On top of this, optionally the spinlock could become an rw_semaphore >> so that clock updates for different VCPUs will not be serialized. The >> effect is probably not visible, though. > > Still not clear of the benefits, but this area certainly welcomes > performance improvements (the global kick is one thing we discussed > and that should be improved). This unfortunately does not eliminate the global kick, so there is likely no performance improvement yet. It simplifies the logic a bit though. The change I suggested above is to make pvclock_gtod_sync_lock an rwsem or, probably better, a seqlock. VCPUs reading ka->use_master_clock, ka->master_cycle_now, ka->master_kernel_ns can then run concurrently, with no locked operations in kvm_guest_time_update. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [libvirt] [PATCH v2] kvm: warn if num cpus is greater than num recommended
On 08/28/2013 01:45 AM, Andrew Jones wrote: >> What I'm more worried about is what number is libvirt supposed to show >> to the end user, and should libvirt enforce the lower recommended max, >> or the larger kernel absolute max? Which of the two values does the QMP >> 'MachineInfo' type return in its 'cpu-max' field during the >> 'query-machines' command? Should we be modifying QMP to return both >> values, so that libvirt can also expose the logic to the end user of >> allowing a recommended vs. larger development max? >> > > Machine definitions maintain yet another 'max_cpus'. And it appears that > qmp would return that value. It would probably be best if it returned > max(qemu_machine.max_cpus, kvm_max_cpus) though. > > I'm starting to think that we should just keep things simple for most of > the virt stack by sticking to enforcing the larger developer max. And > then on a production kernel we should just compile KVM_MAX_VCPUS = > KVM_SOFT_MAX_VCPUS and be done with it. With that thought, this patch > could be dropped too. The alternative seems to be supporting a run-time > selectable experimental mode throughout the whole virt stack. Indeed - if it is a number you are unwilling to support, don't compile it into the kernel in the first place. Allowing arbitrary limits that are lower than the maximum imply policy, and policy implies touching the stack (because someone, somewhere in the stack, will have good reason for setting policy different than the lowest layer); fix the maximum instead, and the whole stack complies without having to worry about policy. IMO, this is a case where fewer knobs is better. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker
On 08/28/2013 06:49 PM, Gleb Natapov wrote: > On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote: >> On 08/28/2013 05:46 PM, Gleb Natapov wrote: >>> On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote: > Or what if desc is moved to another rmap, but then it > is moved back to initial rmap (but another place in the desc list) so > the check here will not catch that we need to restart walking? It is okay. We always add the new desc to the head, then we will walk all the entires under this case. >>> Which races another question: What if desc is added in front of the list >>> behind the point where lockless walker currently is? >> >> That case is new spte is being added into the rmap. We need not to care the >> new sptes since it will set the dirty-bitmap then they can be write-protected >> next time. >> > OK. > >>> Right? >>> Not sure. While lockless walker works on a desc rmap can be completely >>> destroyed and recreated again. It can be any order. >> >> I think the thing is very similar as include/linux/rculist_nulls.h > include/linux/rculist_nulls.h is for implementing hash tables, so they > may not care about add/del/lookup race for instance, but may be we are > (you are saying above that we are not), so similarity does not prove > correctness for our case. We do not care the "add" and "del" too when lookup the rmap. Under the "add" case, it is okay, the reason i have explained above. Under the "del" case, the spte becomes unpresent and flush all tlbs immediately, so it is also okay. I always use a stupid way to check the correctness, that is enumerating all cases we may meet, in this patch, we may meet these cases: 1) kvm deletes the desc before we are current on that descs have been checked, do not need to care it. 2) kvm deletes the desc after we are currently on Since we always add/del the head desc, we can sure the current desc has been deleted, then we will meet case 3). 3) kvm deletes the desc that we are currently on 3.a): the desc stays in slab cache (do not be reused). all spte entires are empty, then the fn() will skip the nonprsent spte, and desc->more is 3.a.1) still pointing to next-desc, then we will continue the lookup 3.a.2) or it is the "nulls list", that means we reach the last one, then finish the walk. 3.b): the desc is alloc-ed from slab cache and it's being initialized. we will see "desc->more == NULL" then restart the walking. It's okay. 3.c): the desc is added to rmap or pte_list again. 3.c.1): the desc is added to the current rmap again. the new desc always acts as the head desc, then we will walk all entries, some entries are double checked and not entry can be missed. It is okay. 3.c.2): the desc is added to another rmap or pte_list since kvm_set_memory_region() and get_dirty are serial by slots-lock. so the "nulls" can not be reused during lookup. Then we we will meet the different "nulls" at the end of walking that will cause rewalk. I know check the algorithm like this is really silly, do you have other idea? > BTW I do not see > rcu_assign_pointer()/rcu_dereference() in your patches which hints on IIUC, We can not directly use rcu_assign_pointer(), that is something like: p = v to assign a pointer to a pointer. But in our case, we need: *pte_list = (unsigned long)desc | 1; So i add the smp_wmb() by myself: /* * Esure the old spte has been updated into desc, so * that the another side can not get the desc from pte_list * but miss the old spte. */ smp_wmb(); *pte_list = (unsigned long)desc | 1; But i missed it when inserting a empty desc, in that case, we need the barrier too since we should make desc->more visible before assign it to pte_list to avoid the lookup side seeing the invalid "nulls". I also use own code instead of rcu_dereference(): pte_list_walk_lockless(): pte_list_value = ACCESS_ONCE(*pte_list); if (!pte_list_value) return; if (!(pte_list_value & 1)) return fn((u64 *)pte_list_value); /* * fetch pte_list before read sptes in the desc, see the comments * in pte_list_add(). * * There is the data dependence since the desc is got from pte_list. */ smp_read_barrier_depends(); That part can be replaced by rcu_dereference(). > incorrect usage of RCU. I think any access to slab pointers will need to > use those. Remove desc is not necessary i think since we do not mind to see the old info. (hlist_nulls_del_rcu() does not use rcu_dereference() too) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to
Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker
On Wed, Aug 28, 2013 at 06:13:43PM +0800, Xiao Guangrong wrote: > On 08/28/2013 05:46 PM, Gleb Natapov wrote: > > On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote: > >>> Or what if desc is moved to another rmap, but then it > >>> is moved back to initial rmap (but another place in the desc list) so > >>> the check here will not catch that we need to restart walking? > >> > >> It is okay. We always add the new desc to the head, then we will walk > >> all the entires under this case. > >> > > Which races another question: What if desc is added in front of the list > > behind the point where lockless walker currently is? > > That case is new spte is being added into the rmap. We need not to care the > new sptes since it will set the dirty-bitmap then they can be write-protected > next time. > OK. > > > >> Right? > > Not sure. While lockless walker works on a desc rmap can be completely > > destroyed and recreated again. It can be any order. > > I think the thing is very similar as include/linux/rculist_nulls.h include/linux/rculist_nulls.h is for implementing hash tables, so they may not care about add/del/lookup race for instance, but may be we are (you are saying above that we are not), so similarity does not prove correctness for our case. BTW I do not see rcu_assign_pointer()/rcu_dereference() in your patches which hints on incorrect usage of RCU. I think any access to slab pointers will need to use those. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker
On 08/28/2013 05:46 PM, Gleb Natapov wrote: > On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote: >>> Or what if desc is moved to another rmap, but then it >>> is moved back to initial rmap (but another place in the desc list) so >>> the check here will not catch that we need to restart walking? >> >> It is okay. We always add the new desc to the head, then we will walk >> all the entires under this case. >> > Which races another question: What if desc is added in front of the list > behind the point where lockless walker currently is? That case is new spte is being added into the rmap. We need not to care the new sptes since it will set the dirty-bitmap then they can be write-protected next time. > >> Right? > Not sure. While lockless walker works on a desc rmap can be completely > destroyed and recreated again. It can be any order. I think the thing is very similar as include/linux/rculist_nulls.h -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker
On Wed, Aug 28, 2013 at 05:33:49PM +0800, Xiao Guangrong wrote: > > Or what if desc is moved to another rmap, but then it > > is moved back to initial rmap (but another place in the desc list) so > > the check here will not catch that we need to restart walking? > > It is okay. We always add the new desc to the head, then we will walk > all the entires under this case. > Which races another question: What if desc is added in front of the list behind the point where lockless walker currently is? > Right? Not sure. While lockless walker works on a desc rmap can be completely destroyed and recreated again. It can be any order. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker
On 08/28/2013 05:20 PM, Gleb Natapov wrote: > On Tue, Jul 30, 2013 at 09:02:07PM +0800, Xiao Guangrong wrote: >> The basic idea is from nulls list which uses a nulls to indicate >> whether the desc is moved to different pte-list >> >> Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused >> >> Signed-off-by: Xiao Guangrong >> --- >> arch/x86/kvm/mmu.c | 51 ++- >> 1 file changed, 50 insertions(+), 1 deletion(-) >> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c >> index 36caf6a..f8fc0cc 100644 >> --- a/arch/x86/kvm/mmu.c >> +++ b/arch/x86/kvm/mmu.c >> @@ -1010,6 +1010,14 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 >> *spte, >> desc->sptes[0] = (u64 *)*pte_list; >> desc->sptes[1] = spte; >> desc_mark_nulls(pte_list, desc); >> + >> +/* >> + * Esure the old spte has been updated into desc, so >> + * that the another side can not get the desc from pte_list >> + * but miss the old spte. >> + */ >> +smp_wmb(); >> + >> *pte_list = (unsigned long)desc | 1; >> return 1; >> } >> @@ -1131,6 +1139,47 @@ static void pte_list_walk(unsigned long *pte_list, >> pte_list_walk_fn fn) >> WARN_ON(desc_get_nulls_value(desc) != pte_list); >> } >> >> +/* The caller should hold rcu lock. */ >> +typedef void (*pte_list_walk_lockless_fn) (u64 *spte, int level); >> +static void pte_list_walk_lockless(unsigned long *pte_list, >> + pte_list_walk_lockless_fn fn, int level) >> +{ >> +struct pte_list_desc *desc; >> +unsigned long pte_list_value; >> +int i; >> + >> +restart: >> +pte_list_value = ACCESS_ONCE(*pte_list); >> +if (!pte_list_value) >> +return; >> + >> +if (!(pte_list_value & 1)) >> +return fn((u64 *)pte_list_value, level); >> + >> +/* >> + * fetch pte_list before read sptes in the desc, see the comments >> + * in pte_list_add(). >> + * >> + * There is the data dependence since the desc is got from pte_list. >> + */ >> +smp_read_barrier_depends(); >> + >> +desc = (struct pte_list_desc *)(pte_list_value & ~1ul); >> +while (!desc_is_a_nulls(desc)) { >> +for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) >> +fn(desc->sptes[i], level); >> + >> +desc = ACCESS_ONCE(desc->more); >> + >> +/* It is being initialized. */ >> +if (unlikely(!desc)) >> +goto restart; >> +} >> + >> +if (unlikely(desc_get_nulls_value(desc) != pte_list)) >> +goto restart; > So is it really enough to guaranty safety and correctness? What if desc > is moved to another rmap while we walking it so fn() is called on > incorrect sptes? Then fn() will do needless write-protection. It is the unnecessary work but it is acceptable for the rare case. There has a bug that we can not detect mapping level from rmap since the desc can be moved as you said, it can case we do write-protection on the middle spte. Can fix it by getting mapping level from sp->role.level since sp can not be reused when rcu is hold. > Or what if desc is moved to another rmap, but then it > is moved back to initial rmap (but another place in the desc list) so > the check here will not catch that we need to restart walking? It is okay. We always add the new desc to the head, then we will walk all the entires under this case. Right? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker
On Tue, Jul 30, 2013 at 09:02:07PM +0800, Xiao Guangrong wrote: > The basic idea is from nulls list which uses a nulls to indicate > whether the desc is moved to different pte-list > > Thanks to SLAB_DESTROY_BY_RCU, the desc can be quickly reused > > Signed-off-by: Xiao Guangrong > --- > arch/x86/kvm/mmu.c | 51 ++- > 1 file changed, 50 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 36caf6a..f8fc0cc 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -1010,6 +1010,14 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 > *spte, > desc->sptes[0] = (u64 *)*pte_list; > desc->sptes[1] = spte; > desc_mark_nulls(pte_list, desc); > + > + /* > + * Esure the old spte has been updated into desc, so > + * that the another side can not get the desc from pte_list > + * but miss the old spte. > + */ > + smp_wmb(); > + > *pte_list = (unsigned long)desc | 1; > return 1; > } > @@ -1131,6 +1139,47 @@ static void pte_list_walk(unsigned long *pte_list, > pte_list_walk_fn fn) > WARN_ON(desc_get_nulls_value(desc) != pte_list); > } > > +/* The caller should hold rcu lock. */ > +typedef void (*pte_list_walk_lockless_fn) (u64 *spte, int level); > +static void pte_list_walk_lockless(unsigned long *pte_list, > +pte_list_walk_lockless_fn fn, int level) > +{ > + struct pte_list_desc *desc; > + unsigned long pte_list_value; > + int i; > + > +restart: > + pte_list_value = ACCESS_ONCE(*pte_list); > + if (!pte_list_value) > + return; > + > + if (!(pte_list_value & 1)) > + return fn((u64 *)pte_list_value, level); > + > + /* > + * fetch pte_list before read sptes in the desc, see the comments > + * in pte_list_add(). > + * > + * There is the data dependence since the desc is got from pte_list. > + */ > + smp_read_barrier_depends(); > + > + desc = (struct pte_list_desc *)(pte_list_value & ~1ul); > + while (!desc_is_a_nulls(desc)) { > + for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) > + fn(desc->sptes[i], level); > + > + desc = ACCESS_ONCE(desc->more); > + > + /* It is being initialized. */ > + if (unlikely(!desc)) > + goto restart; > + } > + > + if (unlikely(desc_get_nulls_value(desc) != pte_list)) > + goto restart; So is it really enough to guaranty safety and correctness? What if desc is moved to another rmap while we walking it so fn() is called on incorrect sptes? Or what if desc is moved to another rmap, but then it is moved back to initial rmap (but another place in the desc list) so the check here will not catch that we need to restart walking? > +} > + > static unsigned long *__gfn_to_rmap(gfn_t gfn, int level, > struct kvm_memory_slot *slot) > { > @@ -4557,7 +4606,7 @@ int kvm_mmu_module_init(void) > { > pte_list_desc_cache = kmem_cache_create("pte_list_desc", > sizeof(struct pte_list_desc), > - 0, 0, NULL); > + 0, SLAB_DESTROY_BY_RCU, NULL); > if (!pte_list_desc_cache) > goto nomem; > > -- > 1.8.1.4 -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list
On 08/28/2013 04:58 PM, Gleb Natapov wrote: > On Wed, Aug 28, 2013 at 04:37:32PM +0800, Xiao Guangrong wrote: >> On 08/28/2013 04:12 PM, Gleb Natapov wrote: >> + + rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte); + desc = (struct pte_list_desc *)(*pte_list & ~1ul); + + /* No empty position in the desc. */ + if (desc->sptes[PTE_LIST_EXT - 1]) { + struct pte_list_desc *new_desc; + new_desc = mmu_alloc_pte_list_desc(vcpu); + new_desc->more = desc; + desc = new_desc; + *pte_list = (unsigned long)desc | 1; } - return count; + + free_pos = find_first_free(desc); + desc->sptes[free_pos] = spte; + return count_spte_number(desc); >>> Should it be count_spte_number(desc) - 1? The function should returns >>> the number of pte entries before the spte was added. >> >> Yes. We have handled it count_spte_number(), we count the number like this: >> >> return first_free + desc_num * PTE_LIST_EXT; >> >> The first_free is indexed from 0. >> > Suppose when pte_list_add() is called there is one full desc, so the > number that should be returned is PTE_LIST_EXT, correct? But since > before calling count_spte_number() one more desc will be added and > desc->sptes[0] will be set in it the first_free in count_spte_number > will be 1 and PTE_LIST_EXT + 1 will be returned. Oh, yes, you are right. Will fix it in the next version, thanks for you pointing it out. > >> Maybe it is clearer to let count_spte_number() return the real number. >> >>> } static void -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc *desc, - int i, struct pte_list_desc *prev_desc) +pte_list_desc_remove_entry(unsigned long *pte_list, + struct pte_list_desc *desc, int i) { - int j; + struct pte_list_desc *first_desc; + int last_used; + + first_desc = (struct pte_list_desc *)(*pte_list & ~1ul); + last_used = find_last_used(first_desc); - for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j) - ; - desc->sptes[i] = desc->sptes[j]; - desc->sptes[j] = NULL; - if (j != 0) + /* + * Move the entry from the first desc to this position we want + * to remove. + */ + desc->sptes[i] = first_desc->sptes[last_used]; + first_desc->sptes[last_used] = NULL; + >>> What if desc == first_desc and i < last_used. You still move spte >>> backwards so lockless walk may have already examined entry at i and >>> will miss spte that was moved there from last_used position, no? >> >> Right. I noticed it too and fixed in the v2 which is being tested. >> I fixed it by bottom-up walk desc, like this: >> >> pte_list_walk_lockless(): >> >> desc = (struct pte_list_desc *)(pte_list_value & ~1ul); >> while (!desc_is_a_nulls(desc)) { >> /* >> * We should do bottom-up walk since we always use the >> * bottom entry to replace the deleted entry if only >> * one desc is used in the rmap when a spte is removed. >> * Otherwise the moved entry will be missed. >> */ > I would call it top-down walk since we are walking from big indices to > smaller once. Okay, will fix the comments. > >> for (i = PTE_LIST_EXT - 1; i >= 0; i--) >> fn(desc->sptes[i]); >> >> desc = ACCESS_ONCE(desc->more); >> >> /* It is being initialized. */ >> if (unlikely(!desc)) >> goto restart; >> } >> >> How about this? >> > Tricky, very very tricky :) > >>> + /* No valid entry in this desc, we can free this desc now. */ + if (!first_desc->sptes[0]) { + struct pte_list_desc *next_desc = first_desc->more; + + /* + * Only one entry existing but still use a desc to store it? + */ + WARN_ON(!next_desc); + + mmu_free_pte_list_desc(first_desc); + first_desc = next_desc; + *pte_list = (unsigned long)first_desc | 1ul; return; - if (!prev_desc && !desc->more) - *pte_list = (unsigned long)desc->sptes[0]; - else - if (prev_desc) - prev_desc->more = desc->more; - else - *pte_list = (unsigned long)desc->more | 1; - mmu_free_pte_list_desc(desc); + } + + WARN_ON(!first_desc->sptes[0]); + + /* + * Only one entry in this desc, move the entry to the head + * then the desc can be freed. + */ + if (!first_desc->sptes[1] && !first_desc->more) { + *pte_list = (unsigned long)first_desc->sptes[0]; + mmu_free_pte_list_desc(first_desc);
Re: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list
On Wed, Aug 28, 2013 at 04:37:32PM +0800, Xiao Guangrong wrote: > On 08/28/2013 04:12 PM, Gleb Natapov wrote: > > >> + > >> + rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte); > >> + desc = (struct pte_list_desc *)(*pte_list & ~1ul); > >> + > >> + /* No empty position in the desc. */ > >> + if (desc->sptes[PTE_LIST_EXT - 1]) { > >> + struct pte_list_desc *new_desc; > >> + new_desc = mmu_alloc_pte_list_desc(vcpu); > >> + new_desc->more = desc; > >> + desc = new_desc; > >> + *pte_list = (unsigned long)desc | 1; > >>} > >> - return count; > >> + > >> + free_pos = find_first_free(desc); > >> + desc->sptes[free_pos] = spte; > >> + return count_spte_number(desc); > > Should it be count_spte_number(desc) - 1? The function should returns > > the number of pte entries before the spte was added. > > Yes. We have handled it count_spte_number(), we count the number like this: > > return first_free + desc_num * PTE_LIST_EXT; > > The first_free is indexed from 0. > Suppose when pte_list_add() is called there is one full desc, so the number that should be returned is PTE_LIST_EXT, correct? But since before calling count_spte_number() one more desc will be added and desc->sptes[0] will be set in it the first_free in count_spte_number will be 1 and PTE_LIST_EXT + 1 will be returned. > Maybe it is clearer to let count_spte_number() return the real number. > > > > >> } > >> > >> static void > >> -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc > >> *desc, > >> - int i, struct pte_list_desc *prev_desc) > >> +pte_list_desc_remove_entry(unsigned long *pte_list, > >> + struct pte_list_desc *desc, int i) > >> { > >> - int j; > >> + struct pte_list_desc *first_desc; > >> + int last_used; > >> + > >> + first_desc = (struct pte_list_desc *)(*pte_list & ~1ul); > >> + last_used = find_last_used(first_desc); > >> > >> - for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j) > >> - ; > >> - desc->sptes[i] = desc->sptes[j]; > >> - desc->sptes[j] = NULL; > >> - if (j != 0) > >> + /* > >> + * Move the entry from the first desc to this position we want > >> + * to remove. > >> + */ > >> + desc->sptes[i] = first_desc->sptes[last_used]; > >> + first_desc->sptes[last_used] = NULL; > >> + > > What if desc == first_desc and i < last_used. You still move spte > > backwards so lockless walk may have already examined entry at i and > > will miss spte that was moved there from last_used position, no? > > Right. I noticed it too and fixed in the v2 which is being tested. > I fixed it by bottom-up walk desc, like this: > > pte_list_walk_lockless(): > > desc = (struct pte_list_desc *)(pte_list_value & ~1ul); > while (!desc_is_a_nulls(desc)) { > /* >* We should do bottom-up walk since we always use the >* bottom entry to replace the deleted entry if only >* one desc is used in the rmap when a spte is removed. >* Otherwise the moved entry will be missed. >*/ I would call it top-down walk since we are walking from big indices to smaller once. > for (i = PTE_LIST_EXT - 1; i >= 0; i--) > fn(desc->sptes[i]); > > desc = ACCESS_ONCE(desc->more); > > /* It is being initialized. */ > if (unlikely(!desc)) > goto restart; > } > > How about this? > Tricky, very very tricky :) > > > >> + /* No valid entry in this desc, we can free this desc now. */ > >> + if (!first_desc->sptes[0]) { > >> + struct pte_list_desc *next_desc = first_desc->more; > >> + > >> + /* > >> + * Only one entry existing but still use a desc to store it? > >> + */ > >> + WARN_ON(!next_desc); > >> + > >> + mmu_free_pte_list_desc(first_desc); > >> + first_desc = next_desc; > >> + *pte_list = (unsigned long)first_desc | 1ul; > >>return; > >> - if (!prev_desc && !desc->more) > >> - *pte_list = (unsigned long)desc->sptes[0]; > >> - else > >> - if (prev_desc) > >> - prev_desc->more = desc->more; > >> - else > >> - *pte_list = (unsigned long)desc->more | 1; > >> - mmu_free_pte_list_desc(desc); > >> + } > >> + > >> + WARN_ON(!first_desc->sptes[0]); > >> + > >> + /* > >> + * Only one entry in this desc, move the entry to the head > >> + * then the desc can be freed. > >> + */ > >> + if (!first_desc->sptes[1] && !first_desc->more) { > >> + *pte_list = (unsigned long)first_desc->sptes[0]; > >> + mmu_free_pte_list_desc(first_desc); > >> + } > >> } > >> > >> static void pte_list_remove(u64 *spte, unsigned long *pte_list) > >> { > >>struct pte_list_desc *desc; > >> - struct pte_list_desc *prev_desc; > >>int i; > >>
Re: [PATCH 08/12] KVM: MMU: introduce nulls desc
On 08/28/2013 04:40 PM, Gleb Natapov wrote: >> static unsigned long *__gfn_to_rmap(gfn_t gfn, int level, >> @@ -1200,7 +1221,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct >> rmap_iterator *iter) >> */ >> static u64 *rmap_get_next(struct rmap_iterator *iter) >> { >> -if (iter->desc) { >> +if (iter->desc && !desc_is_a_nulls(iter->desc)) { >> if (iter->pos < PTE_LIST_EXT - 1) { >> u64 *sptep; >> >> @@ -1212,7 +1233,7 @@ static u64 *rmap_get_next(struct rmap_iterator *iter) >> >> iter->desc = iter->desc->more; >> > I'd rather do: > iter->desc = desc_is_a_nulls(iter->desc) ? NULL : iter->desc; > here and drop two desc_is_a_nulls() checks in this function. Nice, will do it in the next version. Thanks! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT and H_STUFF_TCE requests targeted an IOMMU TCE table without passing them to user space which saves time on switching to user space and back. Both real and virtual modes are supported. The kernel tries to handle a TCE request in the real mode, if fails it passes the request to the virtual mode to complete the operation. If it a virtual mode handler fails, the request is passed to user space. The first user of this is VFIO on POWER. Trampolines to the VFIO external user API functions are required for this patch. This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling of map/unmap requests. The device supports a single attribute which is a struct with LIOBN and IOMMU fd. When the attribute is set, the device establishes the connection between KVM and VFIO. Tests show that this patch increases transmission speed from 220MB/s to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). Signed-off-by: Paul Mackerras Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU" KVM device * release_spapr_tce_table() is not shared between different TCE types * reduced the patch size by moving VFIO external API trampolines to separate patche * moved documentation from Documentation/virtual/kvm/api.txt to Documentation/virtual/kvm/devices/spapr_tce_iommu.txt v8: * fixed warnings from check_patch.pl 2013/07/11: * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled for KVM_BOOK3S_64 * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense for this here but the next patch for hugepages support will use it more. 2013/07/06: * added realmode arch_spin_lock to protect TCE table from races in real and virtual modes * POWERPC IOMMU API is changed to support real mode * iommu_take_ownership and iommu_release_ownership are protected by iommu_table's locks * VFIO external user API use rewritten * multiple small fixes 2013/06/27: * tce_list page is referenced now in order to protect it from accident invalidation during H_PUT_TCE_INDIRECT execution * added use of the external user VFIO API 2013/06/05: * changed capability number * changed ioctl number * update the doc article number 2013/05/20: * removed get_user() from real mode handlers * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there translated TCEs, tries realmode_get_page() on those and if it fails, it passes control over the virtual mode handler which tries to finish the request handling * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit on a page * The only reason to pass the request to user mode now is when the user mode did not register TCE table in the kernel, in all other cases the virtual mode handler is expected to do the job --- .../virtual/kvm/devices/spapr_tce_iommu.txt| 37 +++ arch/powerpc/include/asm/kvm_host.h| 4 + arch/powerpc/kvm/book3s_64_vio.c | 310 - arch/powerpc/kvm/book3s_64_vio_hv.c| 122 arch/powerpc/kvm/powerpc.c | 1 + include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c| 5 + 7 files changed, 477 insertions(+), 3 deletions(-) create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt new file mode 100644 index 000..4bc8fc3 --- /dev/null +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt @@ -0,0 +1,37 @@ +SPAPR TCE IOMMU device + +Capability: KVM_CAP_SPAPR_TCE_IOMMU +Architectures: powerpc + +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU + +Groups: + KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE + Attributes: single attribute with pair { LIOBN, IOMMU fd} + +This is completely made up device which provides API to link +logical bus number (LIOBN) and IOMMU group. The user space has +to create a new SPAPR TCE IOMMU device per a logical bus. + +LIOBN is a PCI bus identifier from PPC64-server (sPAPR) DMA hypercalls +(H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE). +IOMMU group is a minimal isolated device set which can be passed to +the user space via VFIO. + +Right after creation the device is in uninitlized state and requires +a KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute to be set. +The attribute contains liobn, IOMMU fd and flags: + +struct kvm_create_spapr_tce_iommu_linkage { + __u64 liobn; + __u32 fd; + __u32 flags; +}; + +The user space creates the SPAPR TCE IOMMU device, obtains +an IOMMU fd via VFIO ABI and sets the attribute to the SPAPR TCE IOMMU +device. At the moment of setting the attribute, the SPAPR TCE IOMMU +device links LIOBN to IOMMU group and makes necessary steps +to make sure
[PATCH v9 13/13] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
This adds special support for huge pages (16MB) in real mode. The reference counting cannot be easily done for such pages in real mode (when MMU is off) so we added a hash table of huge pages. It is populated in virtual mode and get_page is called just once per a huge page. Real mode handlers check if the requested page is in the hash table, then no reference counting is done, otherwise an exit to virtual mode happens. The hash table is released at KVM exit. At the moment the fastest card available for tests uses up to 9 huge pages so walking through this hash table does not cost much. However this can change and we may want to optimize this. Signed-off-by: Paul Mackerras Signed-off-by: Alexey Kardashevskiy --- Changes: 2013/07/12: * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled for KVM_BOOK3S_64 2013/06/27: * list of huge pages replaces with hashtable for better performance * spinlock removed from real mode and only protects insertion of new huge [ages descriptors into the hashtable 2013/06/05: * fixed compile error when CONFIG_IOMMU_API=n 2013/05/20: * the real mode handler now searches for a huge page by gpa (used to be pte) * the virtual mode handler prints warning if it is called twice for the same huge page as the real mode handler is expected to fail just once - when a huge page is not in the list yet. * the huge page is refcounted twice - when added to the hugepage list and when used in the virtual mode hcall handler (can be optimized but it will make the patch less nice). --- arch/powerpc/include/asm/kvm_host.h | 25 arch/powerpc/kernel/iommu.c | 6 +- arch/powerpc/kvm/book3s_64_vio.c| 122 ++-- arch/powerpc/kvm/book3s_64_vio_hv.c | 32 +- 4 files changed, 176 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index c1a039d..b970d26 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include @@ -184,9 +185,33 @@ struct kvmppc_spapr_tce_table { struct iommu_group *grp;/* used for IOMMU groups */ struct kvm_create_spapr_tce_iommu_linkage link; struct vfio_group *vfio_grp;/* used for IOMMU groups */ + DECLARE_HASHTABLE(hash_tab, ilog2(64)); /* used for IOMMU groups */ + spinlock_t hugepages_write_lock;/* used for IOMMU groups */ struct page *pages[0]; }; +/* + * The KVM guest can be backed with 16MB pages. + * In this case, we cannot do page counting from the real mode + * as the compound pages are used - they are linked in a list + * with pointers as virtual addresses which are inaccessible + * in real mode. + * + * The code below keeps a 16MB pages list and uses page struct + * in real mode if it is already locked in RAM and inserted into + * the list or switches to the virtual mode where it can be + * handled in a usual manner. + */ +#define KVMPPC_SPAPR_HUGEPAGE_HASH(gpa)hash_32(gpa >> 24, 32) + +struct kvmppc_spapr_iommu_hugepage { + struct hlist_node hash_node; + unsigned long gpa; /* Guest physical address */ + unsigned long hpa; /* Host physical address */ + struct page *page; /* page struct of the very first subpage */ + unsigned long size; /* Huge page size (always 16MB at the moment) */ +}; + struct kvmppc_linear_info { void*base_virt; unsigned longbase_pfn; diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index ff0cd90..d0593c9 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -999,7 +999,8 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, if (!pg) { ret = -EAGAIN; } else if (PageCompound(pg)) { - ret = -EAGAIN; + /* Hugepages will be released at KVM exit */ + ret = 0; } else { if (oldtce & TCE_PCI_WRITE) SetPageDirty(pg); @@ -1010,6 +1011,9 @@ int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, struct page *pg = pfn_to_page(oldtce >> PAGE_SHIFT); if (!pg) { ret = -EAGAIN; + } else if (PageCompound(pg)) { + /* Hugepages will be released at KVM exit */ + ret = 0; } else { if (oldtce & TCE_PCI_WRITE) SetPageDirty(pg); diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 95f9e1a..1851778 100644 --- a/arch/po
[PATCH v9 10/13] KVM: PPC: remove warning from kvmppc_core_destroy_vm
Book3S KVM implements in-kernel TCE tables via kvmppc_spapr_tce_table structs list (created per KVM). Entries in the list are per LIOBN (logical bus number) and have a TCE table so DMA hypercalls (such as H_PUT_TCE) can convert LIOBN to a TCE table. The entry in the list is created via KVM_CREATE_SPAPR_TCE ioctl which returns an anonymous fd. This fd is used to map the TCE table to the user space and it also defines the lifetime of the TCE table in the kernel. Every list item also hold the link to KVM so when KVM is about to be destroyed, all kvmppc_spapr_tce_table objects are expected to be released and removed from the global list. And this is what the warning verifies. The upcoming support of VFIO IOMMU will extend kvmppc_spapr_tce_table use. Unlike emulated devices, it will create kvmppc_spapr_tce_table structs via new KVM device API which opens an anonymous fd (as KVM_CREATE_SPAPR_TCE) but the release callback does not call KVM Device's destroy callback immediately. Instead, KVM devices destruction is postponed this till KVM destruction but this happens after arch-specific KVM destroy function so the warning does a false alarm. This removes the warning as it never happens in real life and there is no any obvious place to put it. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kvm/book3s_hv.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 9e823ad..5f15ff7 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -1974,7 +1974,6 @@ void kvmppc_core_destroy_vm(struct kvm *kvm) kvmppc_rtas_tokens_free(kvm); kvmppc_free_hpt(kvm); - WARN_ON(!list_empty(&kvm->arch.spapr_tce_tables)); } /* These are stubs for now */ -- 1.8.4.rc4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v9 11/13] KVM: PPC: add trampolines for VFIO external API
KVM is going to use VFIO's external API. However KVM can operate even VFIO is not compiled or loaded so KVM is linked to VFIO dynamically. This adds proxy functions for VFIO external API. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kvm/book3s_64_vio.c | 49 1 file changed, 49 insertions(+) diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index cae1099..047b94c 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -27,6 +27,8 @@ #include #include #include +#include +#include #include #include @@ -42,6 +44,53 @@ #define ERROR_ADDR ((void *)~(unsigned long)0x0) +/* + * Dynamically linked version of the external user VFIO API. + * + * As a IOMMU group access control is implemented by VFIO, + * there is some API to vefiry that specific process can own + * a group. As KVM may run when VFIO is not loaded, KVM is not + * linked statically to VFIO, instead wrappers are used. + */ +struct vfio_group *kvmppc_vfio_group_get_external_user(struct file *filep) +{ + struct vfio_group *ret; + struct vfio_group * (*proc)(struct file *) = + symbol_get(vfio_group_get_external_user); + if (!proc) + return NULL; + + ret = proc(filep); + symbol_put(vfio_group_get_external_user); + + return ret; +} + +void kvmppc_vfio_group_put_external_user(struct vfio_group *group) +{ + void (*proc)(struct vfio_group *) = + symbol_get(vfio_group_put_external_user); + if (!proc) + return; + + proc(group); + symbol_put(vfio_group_put_external_user); +} + +int kvmppc_vfio_external_user_iommu_id(struct vfio_group *group) +{ + int ret; + int (*proc)(struct vfio_group *) = + symbol_get(vfio_external_user_iommu_id); + if (!proc) + return -EINVAL; + + ret = proc(group); + symbol_put(vfio_external_user_iommu_id); + + return ret; +} + static long kvmppc_stt_npages(unsigned long window_size) { return ALIGN((window_size >> SPAPR_TCE_SHIFT) -- 1.8.4.rc4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v9 01/13] KVM: PPC: POWERNV: move iommu_add_device earlier
The current implementation of IOMMU on sPAPR does not use iommu_ops and therefore does not call IOMMU API's bus_set_iommu() which 1) sets iommu_ops for a bus 2) registers a bus notifier Instead, PCI devices are added to IOMMU groups from subsys_initcall_sync(tce_iommu_init) which does basically the same thing without using iommu_ops callbacks. However Freescale PAMU driver (https://lkml.org/lkml/2013/7/1/158) implements iommu_ops and when tce_iommu_init is called, every PCI device is already added to some group so there is a conflict. This patch does 2 things: 1. removes the loop in which PCI devices were added to groups and adds explicit iommu_add_device() calls to add devices as soon as they get the iommu_table pointer assigned to them. 2. moves a bus notifier to powernv code in order to avoid conflict with the notifier from Freescale driver. iommu_add_device() and iommu_del_device() are public now. Signed-off-by: Alexey Kardashevskiy --- Changes: v8: * added the check for iommu_group!=NULL before removing device from a group as suggested by Wei Yang v2: * added a helper - set_iommu_table_base_and_group - which does set_iommu_table_base() and iommu_add_device() --- arch/powerpc/include/asm/iommu.h| 9 +++ arch/powerpc/kernel/iommu.c | 41 +++-- arch/powerpc/platforms/powernv/pci-ioda.c | 8 +++--- arch/powerpc/platforms/powernv/pci-p5ioc2.c | 2 +- arch/powerpc/platforms/powernv/pci.c| 33 ++- arch/powerpc/platforms/pseries/iommu.c | 8 +++--- 6 files changed, 55 insertions(+), 46 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index c34656a..19ad77f 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -103,6 +103,15 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, int nid); extern void iommu_register_group(struct iommu_table *tbl, int pci_domain_number, unsigned long pe_num); +extern int iommu_add_device(struct device *dev); +extern void iommu_del_device(struct device *dev); + +static inline void set_iommu_table_base_and_group(struct device *dev, + void *base) +{ + set_iommu_table_base(dev, base); + iommu_add_device(dev); +} extern int iommu_map_sg(struct device *dev, struct iommu_table *tbl, struct scatterlist *sglist, int nelems, diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index b20ff17..15f8ca8 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1105,7 +1105,7 @@ void iommu_release_ownership(struct iommu_table *tbl) } EXPORT_SYMBOL_GPL(iommu_release_ownership); -static int iommu_add_device(struct device *dev) +int iommu_add_device(struct device *dev) { struct iommu_table *tbl; int ret = 0; @@ -1134,46 +1134,13 @@ static int iommu_add_device(struct device *dev) return ret; } +EXPORT_SYMBOL_GPL(iommu_add_device); -static void iommu_del_device(struct device *dev) +void iommu_del_device(struct device *dev) { iommu_group_remove_device(dev); } - -static int iommu_bus_notifier(struct notifier_block *nb, - unsigned long action, void *data) -{ - struct device *dev = data; - - switch (action) { - case BUS_NOTIFY_ADD_DEVICE: - return iommu_add_device(dev); - case BUS_NOTIFY_DEL_DEVICE: - iommu_del_device(dev); - return 0; - default: - return 0; - } -} - -static struct notifier_block tce_iommu_bus_nb = { - .notifier_call = iommu_bus_notifier, -}; - -static int __init tce_iommu_init(void) -{ - struct pci_dev *pdev = NULL; - - BUILD_BUG_ON(PAGE_SIZE < IOMMU_PAGE_SIZE); - - for_each_pci_dev(pdev) - iommu_add_device(&pdev->dev); - - bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb); - return 0; -} - -subsys_initcall_sync(tce_iommu_init); +EXPORT_SYMBOL_GPL(iommu_del_device); #else diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index d8140b1..756bb58 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -440,7 +440,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev return; pe = &phb->ioda.pe_array[pdn->pe_number]; - set_iommu_table_base(&pdev->dev, &pe->tce32_table); + set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table); } static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) @@ -448,7 +448,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) struct pci_dev *dev; list_for_each_entry(dev, &bus->devices, bus_list)
[PATCH v9 00/13] KVM: PPC: IOMMU in-kernel handling of VFIO
This accelerates VFIO DMA operations on POWER by moving them into kernel. This depends on VFIO external API patch which is on its way to upstream. Changes: v9: * replaced the "link logical bus number to IOMMU group" ioctl to KVM with a KVM device doing the same thing, i.e. the actual changes are in these 3 patches: KVM: PPC: reserve a capability and KVM device type for realmode VFIO KVM: PPC: remove warning from kvmppc_core_destroy_vm KVM: PPC: Add support for IOMMU in-kernel handling * moved some VFIO external API bits to a separate patch to reduce the size of the "KVM: PPC: Add support for IOMMU in-kernel handling" patch * fixed code style problems reported by checkpatch.pl. v8: * fixed comments about capabilities numbers v7: * rebased on v3.11-rc3. * VFIO external user API will go through VFIO tree so it is excluded from this series. * As nobody ever reacted on "hashtable: add hash_for_each_possible_rcu_notrace()", Ben suggested to push it via his tree so I included it to the series. * realmode_(get|put)_page is reworked. More details in the individual patch comments. Alexey Kardashevskiy (13): KVM: PPC: POWERNV: move iommu_add_device earlier hashtable: add hash_for_each_possible_rcu_notrace() KVM: PPC: reserve a capability number for multitce support KVM: PPC: reserve a capability and KVM device type for realmode VFIO powerpc: Prepare to support kernel handling of IOMMU map/unmap powerpc: add real mode support for dma operations on powernv KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently KVM: PPC: Add support for multiple-TCE hcalls powerpc/iommu: rework to support realmode KVM: PPC: remove warning from kvmppc_core_destroy_vm KVM: PPC: add trampolines for VFIO external API KVM: PPC: Add support for IOMMU in-kernel handling KVM: PPC: Add hugepage support for IOMMU in-kernel handling Documentation/virtual/kvm/api.txt | 26 + .../virtual/kvm/devices/spapr_tce_iommu.txt| 37 ++ arch/powerpc/include/asm/iommu.h | 18 +- arch/powerpc/include/asm/kvm_host.h| 38 ++ arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/include/asm/machdep.h | 12 + arch/powerpc/include/asm/pgtable-ppc64.h | 2 + arch/powerpc/include/uapi/asm/kvm.h| 8 + arch/powerpc/kernel/iommu.c| 243 + arch/powerpc/kvm/Kconfig | 1 + arch/powerpc/kvm/book3s_64_vio.c | 597 - arch/powerpc/kvm/book3s_64_vio_hv.c| 408 +- arch/powerpc/kvm/book3s_hv.c | 42 +- arch/powerpc/kvm/book3s_hv_rmhandlers.S| 8 +- arch/powerpc/kvm/book3s_pr_papr.c | 35 ++ arch/powerpc/kvm/powerpc.c | 4 + arch/powerpc/mm/init_64.c | 50 +- arch/powerpc/platforms/powernv/pci-ioda.c | 57 +- arch/powerpc/platforms/powernv/pci-p5ioc2.c| 2 +- arch/powerpc/platforms/powernv/pci.c | 75 ++- arch/powerpc/platforms/powernv/pci.h | 3 +- arch/powerpc/platforms/pseries/iommu.c | 8 +- include/linux/hashtable.h | 15 + include/linux/kvm_host.h | 1 + include/linux/mm.h | 14 + include/linux/page-flags.h | 4 +- include/uapi/linux/kvm.h | 3 + virt/kvm/kvm_main.c| 5 + 28 files changed, 1564 insertions(+), 168 deletions(-) create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt -- 1.8.4.rc4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v9 02/13] hashtable: add hash_for_each_possible_rcu_notrace()
This adds hash_for_each_possible_rcu_notrace() which is basically a notrace clone of hash_for_each_possible_rcu() which cannot be used in real mode due to its tracing/debugging capability. Signed-off-by: Alexey Kardashevskiy --- Changes: v8: * fixed warnings from check_patch.pl --- include/linux/hashtable.h | 15 +++ 1 file changed, 15 insertions(+) diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h index a9df51f..519b6e2 100644 --- a/include/linux/hashtable.h +++ b/include/linux/hashtable.h @@ -174,6 +174,21 @@ static inline void hash_del_rcu(struct hlist_node *node) member) /** + * hash_for_each_possible_rcu_notrace - iterate over all possible objects hashing + * to the same bucket in an rcu enabled hashtable in a rcu enabled hashtable + * @name: hashtable to iterate + * @obj: the type * to use as a loop cursor for each entry + * @member: the name of the hlist_node within the struct + * @key: the key of the objects to iterate over + * + * This is the same as hash_for_each_possible_rcu() except that it does + * not do any RCU debugging or tracing. + */ +#define hash_for_each_possible_rcu_notrace(name, obj, member, key) \ + hlist_for_each_entry_rcu_notrace(obj, \ + &name[hash_min(key, HASH_BITS(name))], member) + +/** * hash_for_each_possible_safe - iterate over all possible objects hashing to the * same bucket safe against removals * @name: hashtable to iterate -- 1.8.4.rc4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v9 03/13] KVM: PPC: reserve a capability number for multitce support
This is to reserve a capablity number for upcoming support of H_PUT_TCE_INDIRECT and H_STUFF_TCE pseries hypercalls which support mulptiple DMA map/unmap operations per one call. Signed-off-by: Alexey Kardashevskiy --- Changes: 2013/07/16: * changed the number --- include/uapi/linux/kvm.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index acccd08..99c2533 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_ARM_EL1_32BIT 93 +#define KVM_CAP_SPAPR_MULTITCE 94 #ifdef KVM_CAP_IRQ_ROUTING -- 1.8.4.rc4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v9 05/13] powerpc: Prepare to support kernel handling of IOMMU map/unmap
The current VFIO-on-POWER implementation supports only user mode driven mapping, i.e. QEMU is sending requests to map/unmap pages. However this approach is really slow, so we want to move that to KVM. Since H_PUT_TCE can be extremely performance sensitive (especially with network adapters where each packet needs to be mapped/unmapped) we chose to implement that as a "fast" hypercall directly in "real mode" (processor still in the guest context but MMU off). To be able to do that, we need to provide some facilities to access the struct page count within that real mode environment as things like the sparsemem vmemmap mappings aren't accessible. This adds an API function realmode_pfn_to_page() to get page struct when MMU is off. This adds to MM a new function put_page_unless_one() which drops a page if counter is bigger than 1. It is going to be used when MMU is off (for example, real mode on PPC64) and we want to make sure that page release will not happen in real mode as it may crash the kernel in a horrible way. CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported. Cc: linux...@kvack.org Cc: Benjamin Herrenschmidt Cc: Andrew Morton Reviewed-by: Paul Mackerras Signed-off-by: Paul Mackerras Signed-off-by: Alexey Kardashevskiy --- Changes: 2013/07/25 (v7): * removed realmode_put_page and added put_page_unless_one() instead. The name has been chosen to conform the already existing get_page_unless_zero(). * removed realmode_get_page. Instead, get_page_unless_zero() should be used 2013/07/10: * adjusted comment (removed sentence about virtual mode) * get_page_unless_zero replaced with atomic_inc_not_zero to minimize effect of a possible get_page_unless_zero() rework (if it ever happens). 2013/06/27: * realmode_get_page() fixed to use get_page_unless_zero(). If failed, the call will be passed from real to virtual mode and safely handled. * added comment to PageCompound() in include/linux/page-flags.h. 2013/05/20: * PageTail() is replaced by PageCompound() in order to have the same checks for whether the page is huge in realmode_get_page() and realmode_put_page() --- arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++ arch/powerpc/mm/init_64.c| 50 +++- include/linux/mm.h | 14 + include/linux/page-flags.h | 4 ++- 4 files changed, 68 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index 46db094..4a191c4 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -394,6 +394,8 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array, hpte_slot_array[index] = hidx << 4 | 0x1 << 3; } +struct page *realmode_pfn_to_page(unsigned long pfn); + static inline char *get_hpte_slot_array(pmd_t *pmdp) { /* diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index d0cd9e4..8cf345a 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -300,5 +300,53 @@ void vmemmap_free(unsigned long start, unsigned long end) { } -#endif /* CONFIG_SPARSEMEM_VMEMMAP */ +/* + * We do not have access to the sparsemem vmemmap, so we fallback to + * walking the list of sparsemem blocks which we already maintain for + * the sake of crashdump. In the long run, we might want to maintain + * a tree if performance of that linear walk becomes a problem. + * + * realmode_pfn_to_page functions can fail due to: + * 1) As real sparsemem blocks do not lay in RAM continously (they + * are in virtual address space which is not available in the real mode), + * the requested page struct can be split between blocks so get_page/put_page + * may fail. + * 2) When huge pages are used, the get_page/put_page API will fail + * in real mode as the linked addresses in the page struct are virtual + * too. + */ +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct vmemmap_backing *vmem_back; + struct page *page; + unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift; + unsigned long pg_va = (unsigned long) pfn_to_page(pfn); + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) { + if (pg_va < vmem_back->virt_addr) + continue; + + /* Check that page struct is not split between real pages */ + if ((pg_va + sizeof(struct page)) > + (vmem_back->virt_addr + page_size)) + return NULL; + + page = (struct page *) (vmem_back->phys + pg_va - + vmem_back->virt_addr); + return page; + } + + return NULL; +} +EXPORT_SYMBOL_GPL(realmode_pfn_to_page); + +#elif defined(CONFIG_FLATMEM) + +struct page *realmode_pfn_to_page(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + return page; +} +EXPORT_SYMBO
[PATCH v9 08/13] KVM: PPC: Add support for multiple-TCE hcalls
This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for user space emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a function to convert a guest physical address to a host virtual address in order to parse a TCE list from H_PUT_TCE_INDIRECT. This also implements the KVM_CAP_PPC_MULTITCE capability. When present, the hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration. Signed-off-by: Paul Mackerras Signed-off-by: Alexey Kardashevskiy --- Changelog: v8: * fixed warnings from check_patch.pl 2013/08/01 (v7): * realmode_get_page/realmode_put_page use was replaced with get_page_unless_zero/put_page_unless_one 2013/07/11: * addressed many, many comments from maintainers 2013/07/06: * fixed number of wrong get_page()/put_page() calls 2013/06/27: * fixed clear of BUSY bit in kvmppc_lookup_pte() * H_PUT_TCE_INDIRECT does realmode_get_page() now * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64 * updated doc 2013/06/05: * fixed mistype about IBMVIO in the commit message * updated doc and moved it to another section * changed capability number 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) --- Documentation/virtual/kvm/api.txt | 26 +++ arch/powerpc/include/asm/kvm_host.h | 9 ++ arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c| 132 +++- arch/powerpc/kvm/book3s_64_vio_hv.c | 270 arch/powerpc/kvm/book3s_hv.c| 41 - arch/powerpc/kvm/book3s_hv_rmhandlers.S | 8 +- arch/powerpc/kvm/book3s_pr_papr.c | 35 + arch/powerpc/kvm/powerpc.c | 3 + 9 files changed, 506 insertions(+), 34 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index ef925ea..1c8942a 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2382,6 +2382,32 @@ calls by the guest for that service will be passed to userspace to be handled. +4.86 KVM_CAP_PPC_MULTITCE + +Capability: KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This capability means the kernel is capable of handling hypercalls +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user +space. This significantly accelerates DMA operations for PPC KVM guests. +User space should expect that its handlers for these hypercalls +are not going to be called if user space previously registered LIOBN +in KVM (via KVM_CREATE_SPAPR_TCE or similar calls). + +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest, +user space might have to advertise it for the guest. For example, +IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is +present in the "ibm,hypertas-functions" device-tree property. + +The hypercalls mentioned above may or may not be processed successfully +in the kernel based fast path. If they can not be handled by the kernel, +they will get passed on to user space. So user space still has to have +an implementation for these despite the in kernel acceleration. + +This capability is always enabled. + + 5. The kvm_run structure diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index af326cd..a23f132 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -609,6 +610,14 @@ struct kvm_vcpu_arch { spinlock_t tbacct_lock; u64 busy_stolen; u64 busy_preempt; + + unsigned long *tce_tmp_hpas;/* TCE cache for TCE_PUT_INDIRECT */ + enum { + TCERM_NONE, + TCERM_GETPAGE, + TCERM_PUTTCE, + TCERM_PUTLIST, + } tce_rm_fail; /* failed stage of request pr
[PATCH v9 06/13] powerpc: add real mode support for dma operations on powernv
The existing TCE machine calls (tce_build and tce_free) only support virtual mode as they call __raw_writeq for TCE invalidation what fails in real mode. This introduces tce_build_rm and tce_free_rm real mode versions which do mostly the same but use "Store Doubleword Caching Inhibited Indexed" instruction for TCE invalidation. This new feature is going to be utilized by real mode support of VFIO. Signed-off-by: Alexey Kardashevskiy --- Changes: v8: * fixed check_patch.pl warnings 2013/11/07: * added comment why stdcix cannot be used in virtual mode 2013/08/07: * tested on p7ioc and fixed a bug with realmode addresses --- arch/powerpc/include/asm/machdep.h| 12 arch/powerpc/platforms/powernv/pci-ioda.c | 49 +++ arch/powerpc/platforms/powernv/pci.c | 42 ++ arch/powerpc/platforms/powernv/pci.h | 3 +- 4 files changed, 87 insertions(+), 19 deletions(-) diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index 8b48090..07dd3b1 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -78,6 +78,18 @@ struct machdep_calls { long index); void(*tce_flush)(struct iommu_table *tbl); + /* _rm versions are for real mode use only */ + int (*tce_build_rm)(struct iommu_table *tbl, +long index, +long npages, +unsigned long uaddr, +enum dma_data_direction direction, +struct dma_attrs *attrs); + void(*tce_free_rm)(struct iommu_table *tbl, + long index, + long npages); + void(*tce_flush_rm)(struct iommu_table *tbl); + void __iomem * (*ioremap)(phys_addr_t addr, unsigned long size, unsigned long flags, void *caller); void(*iounmap)(volatile void __iomem *token); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 756bb58..8cba234 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -70,6 +70,16 @@ define_pe_printk_level(pe_err, KERN_ERR); define_pe_printk_level(pe_warn, KERN_WARNING); define_pe_printk_level(pe_info, KERN_INFO); +/* + * stdcix is only supposed to be used in hypervisor real mode as per + * the architecture spec + */ +static inline void __raw_rm_writeq(u64 val, volatile void __iomem *paddr) +{ + __asm__ __volatile__("stdcix %0,0,%1" + : : "r" (val), "r" (paddr) : "memory"); +} + static int pnv_ioda_alloc_pe(struct pnv_phb *phb) { unsigned long pe; @@ -454,10 +464,13 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) } } -static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, -u64 *startp, u64 *endp) +static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe, +struct iommu_table *tbl, +u64 *startp, u64 *endp, bool rm) { - u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index; + u64 __iomem *invalidate = rm ? + (u64 __iomem *)pe->tce_inval_reg_phys : + (u64 __iomem *)tbl->it_index; unsigned long start, end, inc; start = __pa(startp); @@ -484,7 +497,10 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, mb(); /* Ensure above stores are visible */ while (start <= end) { -__raw_writeq(start, invalidate); + if (rm) + __raw_rm_writeq(start, invalidate); + else + __raw_writeq(start, invalidate); start += inc; } @@ -496,10 +512,12 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, struct iommu_table *tbl, -u64 *startp, u64 *endp) +u64 *startp, u64 *endp, bool rm) { unsigned long start, end, inc; - u64 __iomem *invalidate = (u64 __iomem *)tbl->it_index; + u64 __iomem *invalidate = rm ? + (u64 __iomem *)pe->tce_inval_reg_phys : + (u64 __iomem *)tbl->it_index; /* We'll invalidate DMA address in PE scope */ start = 0x2ul << 60; @@ -515,22 +533,25 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, mb(); while (start <= end) { - __raw_writeq(start, invalidate); + if (rm) +
[PATCH v9 07/13] KVM: PPC: enable IOMMU_API for KVM_BOOK3S_64 permanently
It does not make much sense to have KVM in book3s-64bit and not to have IOMMU bits for PCI pass through support as it costs little and allows VFIO to function on book3s-kvm. Having IOMMU_API always enabled makes it unnecessary to have a lot of "#ifdef IOMMU_API" in arch/powerpc/kvm/book3s_64_vio*. With those ifdef's we could have only user space emulated devices accelerated (but not VFIO) which do not seem to be very useful. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kvm/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index c55c538..3b2b761 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -59,6 +59,7 @@ config KVM_BOOK3S_64 depends on PPC_BOOK3S_64 select KVM_BOOK3S_64_HANDLER select KVM + select SPAPR_TCE_IOMMU ---help--- Support running unmodified book3s_64 and book3s_32 guest kernels in virtual machines on book3s_64 host processors. -- 1.8.4.rc4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v9 09/13] powerpc/iommu: rework to support realmode
The TCE tables handling may differ for real and virtual modes so additional ppc_md.tce_build_rm/ppc_md.tce_free_rm/ppc_md.tce_flush_rm handlers were introduced earlier. So this adds the following: 1. support for the new ppc_md calls; 2. ability to iommu_tce_build to process mupltiple entries per call; 3. arch_spin_lock to protect TCE table from races in both real and virtual modes; 4. proper TCE table protection from races with the existing IOMMU code in iommu_take_ownership/iommu_release_ownership; 5. hwaddr variable renamed to hpa as it better describes what it actually represents; 6. iommu_tce_direction is static now as it is not called from anywhere else. This will be used by upcoming real mode support of VFIO on POWER. Signed-off-by: Alexey Kardashevskiy --- Changes: v8: * fixed warnings from check_patch.pl --- arch/powerpc/include/asm/iommu.h | 9 +- arch/powerpc/kernel/iommu.c | 198 ++- 2 files changed, 136 insertions(+), 71 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 19ad77f..71ee525 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -78,6 +78,7 @@ struct iommu_table { unsigned long *it_map; /* A simple allocation bitmap for now */ #ifdef CONFIG_IOMMU_API struct iommu_group *it_group; + arch_spinlock_t it_rm_lock; #endif }; @@ -161,9 +162,9 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl, extern int iommu_tce_put_param_check(struct iommu_table *tbl, unsigned long ioba, unsigned long tce); extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, - unsigned long hwaddr, enum dma_data_direction direction); -extern unsigned long iommu_clear_tce(struct iommu_table *tbl, - unsigned long entry); + unsigned long *hpas, unsigned long npages, bool rm); +extern int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, + unsigned long npages, bool rm); extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, unsigned long entry, unsigned long pages); extern int iommu_put_tce_user_mode(struct iommu_table *tbl, @@ -173,7 +174,5 @@ extern void iommu_flush_tce(struct iommu_table *tbl); extern int iommu_take_ownership(struct iommu_table *tbl); extern void iommu_release_ownership(struct iommu_table *tbl); -extern enum dma_data_direction iommu_tce_direction(unsigned long tce); - #endif /* __KERNEL__ */ #endif /* _ASM_IOMMU_H */ diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 15f8ca8..ff0cd90 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -903,7 +903,7 @@ void iommu_register_group(struct iommu_table *tbl, kfree(name); } -enum dma_data_direction iommu_tce_direction(unsigned long tce) +static enum dma_data_direction iommu_tce_direction(unsigned long tce) { if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE)) return DMA_BIDIRECTIONAL; @@ -914,7 +914,6 @@ enum dma_data_direction iommu_tce_direction(unsigned long tce) else return DMA_NONE; } -EXPORT_SYMBOL_GPL(iommu_tce_direction); void iommu_flush_tce(struct iommu_table *tbl) { @@ -972,73 +971,117 @@ int iommu_tce_put_param_check(struct iommu_table *tbl, } EXPORT_SYMBOL_GPL(iommu_tce_put_param_check); -unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry) -{ - unsigned long oldtce; - struct iommu_pool *pool = get_pool(tbl, entry); - - spin_lock(&(pool->lock)); - - oldtce = ppc_md.tce_get(tbl, entry); - if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)) - ppc_md.tce_free(tbl, entry, 1); - else - oldtce = 0; - - spin_unlock(&(pool->lock)); - - return oldtce; -} -EXPORT_SYMBOL_GPL(iommu_clear_tce); - int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, unsigned long entry, unsigned long pages) { - unsigned long oldtce; - struct page *page; - - for ( ; pages; --pages, ++entry) { - oldtce = iommu_clear_tce(tbl, entry); - if (!oldtce) - continue; - - page = pfn_to_page(oldtce >> PAGE_SHIFT); - WARN_ON(!page); - if (page) { - if (oldtce & TCE_PCI_WRITE) - SetPageDirty(page); - put_page(page); - } - } - - return 0; + return iommu_free_tces(tbl, entry, pages, false); } EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages); -/* - * hwaddr is a kernel virtual address here (0xc... bazillion), - * tce_build converts it to a physical address. - */ +int iommu_free_tces(struct iommu_table *tbl, unsigned long entry, + unsigned long npages, bool rm) +{ + int i, ret = 0, to_free = 0; + + if (
Re: [PATCH 08/12] KVM: MMU: introduce nulls desc
On Tue, Jul 30, 2013 at 09:02:06PM +0800, Xiao Guangrong wrote: > It likes nulls list and we use the pte-list as the nulls which can help us to > detect whether the "desc" is moved to anther rmap then we can re-walk the rmap > if that happened > > Signed-off-by: Xiao Guangrong > --- > arch/x86/kvm/mmu.c | 35 --- > 1 file changed, 28 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 3013bb1..36caf6a 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -918,6 +918,24 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t > large_gfn) > return level - 1; > } > > +static void desc_mark_nulls(unsigned long *pte_list, struct pte_list_desc > *desc) > +{ > + unsigned long marker; > + > + marker = (unsigned long)pte_list | 1UL; > + desc->more = (struct pte_list_desc *)marker; > +} > + > +static bool desc_is_a_nulls(struct pte_list_desc *desc) > +{ > + return (unsigned long)desc & 1; > +} > + > +static unsigned long *desc_get_nulls_value(struct pte_list_desc *desc) > +{ > + return (unsigned long *)((unsigned long)desc & ~1); > +} > + > static int __find_first_free(struct pte_list_desc *desc) > { > int i; > @@ -956,7 +974,7 @@ static int count_spte_number(struct pte_list_desc *desc) > > first_free = __find_first_free(desc); > > - for (desc_num = 0; desc->more; desc = desc->more) > + for (desc_num = 0; !desc_is_a_nulls(desc->more); desc = desc->more) > desc_num++; > > return first_free + desc_num * PTE_LIST_EXT; > @@ -991,6 +1009,7 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 *spte, > desc = mmu_alloc_pte_list_desc(vcpu); > desc->sptes[0] = (u64 *)*pte_list; > desc->sptes[1] = spte; > + desc_mark_nulls(pte_list, desc); > *pte_list = (unsigned long)desc | 1; > return 1; > } > @@ -1036,7 +1055,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list, > /* >* Only one entry existing but still use a desc to store it? >*/ > - WARN_ON(!next_desc); > + WARN_ON(desc_is_a_nulls(next_desc)); > > mmu_free_pte_list_desc(first_desc); > first_desc = next_desc; > @@ -1050,7 +1069,7 @@ pte_list_desc_remove_entry(unsigned long *pte_list, >* Only one entry in this desc, move the entry to the head >* then the desc can be freed. >*/ > - if (!first_desc->sptes[1] && !first_desc->more) { > + if (!first_desc->sptes[1] && desc_is_a_nulls(first_desc->more)) { > *pte_list = (unsigned long)first_desc->sptes[0]; > mmu_free_pte_list_desc(first_desc); > } > @@ -1077,7 +1096,7 @@ static void pte_list_remove(u64 *spte, unsigned long > *pte_list) > > rmap_printk("pte_list_remove: %p many->many\n", spte); > desc = (struct pte_list_desc *)(*pte_list & ~1ul); > - while (desc) { > + while (!desc_is_a_nulls(desc)) { > for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) > if (desc->sptes[i] == spte) { > pte_list_desc_remove_entry(pte_list, > @@ -1103,11 +1122,13 @@ static void pte_list_walk(unsigned long *pte_list, > pte_list_walk_fn fn) > return fn((u64 *)*pte_list); > > desc = (struct pte_list_desc *)(*pte_list & ~1ul); > - while (desc) { > + while (!desc_is_a_nulls(desc)) { > for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i) > fn(desc->sptes[i]); > desc = desc->more; > } > + > + WARN_ON(desc_get_nulls_value(desc) != pte_list); > } > > static unsigned long *__gfn_to_rmap(gfn_t gfn, int level, > @@ -1200,7 +1221,7 @@ static u64 *rmap_get_first(unsigned long rmap, struct > rmap_iterator *iter) > */ > static u64 *rmap_get_next(struct rmap_iterator *iter) > { > - if (iter->desc) { > + if (iter->desc && !desc_is_a_nulls(iter->desc)) { > if (iter->pos < PTE_LIST_EXT - 1) { > u64 *sptep; > > @@ -1212,7 +1233,7 @@ static u64 *rmap_get_next(struct rmap_iterator *iter) > > iter->desc = iter->desc->more; > I'd rather do: iter->desc = desc_is_a_nulls(iter->desc) ? NULL : iter->desc; here and drop two desc_is_a_nulls() checks in this function. > - if (iter->desc) { > + if (!desc_is_a_nulls(iter->desc)) { > iter->pos = 0; > /* desc->sptes[0] cannot be NULL */ > return iter->desc->sptes[iter->pos]; > -- > 1.8.1.4 -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v9 04/13] KVM: PPC: reserve a capability and KVM device type for realmode VFIO
This reserves a capability number for upcoming support of VFIO-IOMMU DMA operations in real mode. This reserves a number for a new "SPAPR TCE IOMMU" KVM device which is going to manage lifetime of SPAPR TCE IOMMU object. This defines an attribute of the "SPAPR TCE IOMMU" KVM device which is going to be used for initialization. Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * KVM ioctl is replaced with "SPAPR TCE IOMMU" KVM device type with KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE attribute 2013/08/15: * fixed mistype in comments * fixed commit message which says what uses ioctls 0xad and 0xae 2013/07/16: * changed the number 2013/07/11: * changed order in a file, added comment about a gap in ioctl number --- arch/powerpc/include/uapi/asm/kvm.h | 8 include/uapi/linux/kvm.h| 2 ++ 2 files changed, 10 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 0fb1a6e..c1ae1e5 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -511,4 +511,12 @@ struct kvm_get_htab_header { #define KVM_XICS_MASKED (1ULL << 41) #define KVM_XICS_PENDING (1ULL << 42) +/* SPAPR TCE IOMMU device specification */ +struct kvm_create_spapr_tce_iommu_linkage { + __u64 liobn; + __u32 fd; + __u32 flags; +}; +#define KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE 0 + #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 99c2533..9d20630 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_ARM_EL1_32BIT 93 #define KVM_CAP_SPAPR_MULTITCE 94 +#define KVM_CAP_SPAPR_TCE_IOMMU 95 #ifdef KVM_CAP_IRQ_ROUTING @@ -843,6 +844,7 @@ struct kvm_device_attr { #define KVM_DEV_TYPE_FSL_MPIC_20 1 #define KVM_DEV_TYPE_FSL_MPIC_42 2 #define KVM_DEV_TYPE_XICS 3 +#define KVM_DEV_TYPE_SPAPR_TCE_IOMMU 4 /* * ioctls for VM fds -- 1.8.4.rc4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list
On 08/28/2013 04:12 PM, Gleb Natapov wrote: >> + >> +rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte); >> +desc = (struct pte_list_desc *)(*pte_list & ~1ul); >> + >> +/* No empty position in the desc. */ >> +if (desc->sptes[PTE_LIST_EXT - 1]) { >> +struct pte_list_desc *new_desc; >> +new_desc = mmu_alloc_pte_list_desc(vcpu); >> +new_desc->more = desc; >> +desc = new_desc; >> +*pte_list = (unsigned long)desc | 1; >> } >> -return count; >> + >> +free_pos = find_first_free(desc); >> +desc->sptes[free_pos] = spte; >> +return count_spte_number(desc); > Should it be count_spte_number(desc) - 1? The function should returns > the number of pte entries before the spte was added. Yes. We have handled it count_spte_number(), we count the number like this: return first_free + desc_num * PTE_LIST_EXT; The first_free is indexed from 0. Maybe it is clearer to let count_spte_number() return the real number. > >> } >> >> static void >> -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc >> *desc, >> - int i, struct pte_list_desc *prev_desc) >> +pte_list_desc_remove_entry(unsigned long *pte_list, >> + struct pte_list_desc *desc, int i) >> { >> -int j; >> +struct pte_list_desc *first_desc; >> +int last_used; >> + >> +first_desc = (struct pte_list_desc *)(*pte_list & ~1ul); >> +last_used = find_last_used(first_desc); >> >> -for (j = PTE_LIST_EXT - 1; !desc->sptes[j] && j > i; --j) >> -; >> -desc->sptes[i] = desc->sptes[j]; >> -desc->sptes[j] = NULL; >> -if (j != 0) >> +/* >> + * Move the entry from the first desc to this position we want >> + * to remove. >> + */ >> +desc->sptes[i] = first_desc->sptes[last_used]; >> +first_desc->sptes[last_used] = NULL; >> + > What if desc == first_desc and i < last_used. You still move spte > backwards so lockless walk may have already examined entry at i and > will miss spte that was moved there from last_used position, no? Right. I noticed it too and fixed in the v2 which is being tested. I fixed it by bottom-up walk desc, like this: pte_list_walk_lockless(): desc = (struct pte_list_desc *)(pte_list_value & ~1ul); while (!desc_is_a_nulls(desc)) { /* * We should do bottom-up walk since we always use the * bottom entry to replace the deleted entry if only * one desc is used in the rmap when a spte is removed. * Otherwise the moved entry will be missed. */ for (i = PTE_LIST_EXT - 1; i >= 0; i--) fn(desc->sptes[i]); desc = ACCESS_ONCE(desc->more); /* It is being initialized. */ if (unlikely(!desc)) goto restart; } How about this? > >> +/* No valid entry in this desc, we can free this desc now. */ >> +if (!first_desc->sptes[0]) { >> +struct pte_list_desc *next_desc = first_desc->more; >> + >> +/* >> + * Only one entry existing but still use a desc to store it? >> + */ >> +WARN_ON(!next_desc); >> + >> +mmu_free_pte_list_desc(first_desc); >> +first_desc = next_desc; >> +*pte_list = (unsigned long)first_desc | 1ul; >> return; >> -if (!prev_desc && !desc->more) >> -*pte_list = (unsigned long)desc->sptes[0]; >> -else >> -if (prev_desc) >> -prev_desc->more = desc->more; >> -else >> -*pte_list = (unsigned long)desc->more | 1; >> -mmu_free_pte_list_desc(desc); >> +} >> + >> +WARN_ON(!first_desc->sptes[0]); >> + >> +/* >> + * Only one entry in this desc, move the entry to the head >> + * then the desc can be freed. >> + */ >> +if (!first_desc->sptes[1] && !first_desc->more) { >> +*pte_list = (unsigned long)first_desc->sptes[0]; >> +mmu_free_pte_list_desc(first_desc); >> +} >> } >> >> static void pte_list_remove(u64 *spte, unsigned long *pte_list) >> { >> struct pte_list_desc *desc; >> -struct pte_list_desc *prev_desc; >> int i; >> >> if (!*pte_list) { >> -printk(KERN_ERR "pte_list_remove: %p 0->BUG\n", spte); >> -BUG(); >> -} else if (!(*pte_list & 1)) { >> +WARN(1, KERN_ERR "pte_list_remove: %p 0->BUG\n", spte); > Why change BUG() to WARN() here and below? WARN(1, "xxx") can replace two lines in the origin code. And personally, i prefer WARN() to BUG() since sometimes BUG() can stop my box and i need to get the full log by using kdump. If you object it, i will change it back in the next version. :) > >> +return; >> +} >> + >> +if (!(*pte_list & 1)) {
Re: [PATCH] KVM: x86: should release vcpu when closing vcpu fd
On Wed, Aug 28, 2013 at 04:08:10PM +0800, Chen Fan wrote: > If we want to remove a vcpu from vm, we should not be only to > put kvm, we need to decrease online vcpus and free struct kvm_vcpu > when closing a vcpu file descriptor. > It much more complicated that what you do here. kvm->vcpus[] is accessed without locking and kvm->online_vcpus is assumed to be greater than max occupied index in kvm->vcpus[]. Attempt where made to make it possible to destroy individual vcpus separately from destroying VM before, but they were unsuccessful thus far. > Signed-off-by: Chen Fan > --- > arch/x86/kvm/x86.c | 14 ++ > include/linux/kvm_host.h | 1 + > virt/kvm/kvm_main.c | 4 +++- > 3 files changed, 18 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index d21bce5..3370de1 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -6904,6 +6904,20 @@ static void kvm_free_vcpus(struct kvm *kvm) > mutex_unlock(&kvm->lock); > } > > +void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu) > +{ > + struct kvm *kvm = vcpu->kvm; > + > + mutex_lock(&kvm->lock); > + atomic_dec(&kvm->online_vcpus); > + kvm->vcpus[atomic_read(&kvm->online_vcpus)] = NULL; > + mutex_unlock(&kvm->lock); > + > + kvm_clear_async_pf_completion_queue(vcpu); > + kvm_unload_vcpu_mmu(vcpu); > + kvm_arch_vcpu_free(vcpu); > +} > + > void kvm_arch_sync_events(struct kvm *kvm) > { > kvm_free_all_assigned_devices(kvm); > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index a63d83e..e0811f97 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -631,6 +631,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, > unsigned int id); > int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu); > int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu); > void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu); > +void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu); > > int kvm_arch_hardware_enable(void *garbage); > void kvm_arch_hardware_disable(void *garbage); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 1580dd4..442014b 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1873,8 +1873,10 @@ static int kvm_vcpu_mmap(struct file *file, struct > vm_area_struct *vma) > static int kvm_vcpu_release(struct inode *inode, struct file *filp) > { > struct kvm_vcpu *vcpu = filp->private_data; > + struct kvm *kvm = vcpu->kvm; > > - kvm_put_kvm(vcpu->kvm); > + kvm_arch_vcpu_release(vcpu); > + kvm_put_kvm(kvm); > return 0; > } > > -- > 1.8.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/12] KVM: MMU: redesign the algorithm of pte_list
On Tue, Jul 30, 2013 at 09:02:05PM +0800, Xiao Guangrong wrote: > Change the algorithm to: > 1) always add new desc to the first desc (pointed by parent_ptes/rmap) >that is good to implement rcu-nulls-list-like lockless rmap >walking > > 2) always move the entry in first desc to the the position we want >to remove when remove a spte in the parent_ptes/rmap (backward-move). >It is good for us to implement lockless rmap walk since in the current >code, when a spte is deleted from the "desc", another spte in the last >"desc" will be moved to this position to replace the deleted one. If the >deleted one has been accessed and we do not access the replaced one, the >replaced one is missed when we do lockless walk. >To fix this case, we do not backward move the spte, instead, we forward >move the entry: when a spte is deleted, we move the entry in the first >desc to that position > > Both of these also can reduce cache miss > > Signed-off-by: Xiao Guangrong > --- > arch/x86/kvm/mmu.c | 182 > - > 1 file changed, 125 insertions(+), 57 deletions(-) > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 5a40564..3013bb1 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -918,6 +918,50 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t > large_gfn) > return level - 1; > } > > +static int __find_first_free(struct pte_list_desc *desc) > +{ > + int i; > + > + for (i = 0; i < PTE_LIST_EXT; i++) > + if (!desc->sptes[i]) > + break; > + return i; > +} > + > +static int find_first_free(struct pte_list_desc *desc) > +{ > + int free = __find_first_free(desc); > + > + WARN_ON(free >= PTE_LIST_EXT); > + return free; > +} > + > +static int find_last_used(struct pte_list_desc *desc) > +{ > + int used = __find_first_free(desc) - 1; > + > + WARN_ON(used < 0 || used >= PTE_LIST_EXT); > + return used; > +} > + > +/* > + * TODO: we can encode the desc number into the rmap/parent_ptes > + * since at least 10 physical/virtual address bits are reserved > + * on x86. It is worthwhile if it shows that the desc walking is > + * a performance issue. > + */ > +static int count_spte_number(struct pte_list_desc *desc) > +{ > + int first_free, desc_num; > + > + first_free = __find_first_free(desc); > + > + for (desc_num = 0; desc->more; desc = desc->more) > + desc_num++; > + > + return first_free + desc_num * PTE_LIST_EXT; > +} > + > /* > * Pte mapping structures: > * > @@ -934,92 +978,116 @@ static int pte_list_add(struct kvm_vcpu *vcpu, u64 > *spte, > unsigned long *pte_list) > { > struct pte_list_desc *desc; > - int i, count = 0; > + int free_pos; > > if (!*pte_list) { > rmap_printk("pte_list_add: %p %llx 0->1\n", spte, *spte); > *pte_list = (unsigned long)spte; > - } else if (!(*pte_list & 1)) { > + return 0; > + } > + > + if (!(*pte_list & 1)) { > rmap_printk("pte_list_add: %p %llx 1->many\n", spte, *spte); > desc = mmu_alloc_pte_list_desc(vcpu); > desc->sptes[0] = (u64 *)*pte_list; > desc->sptes[1] = spte; > *pte_list = (unsigned long)desc | 1; > - ++count; > - } else { > - rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte); > - desc = (struct pte_list_desc *)(*pte_list & ~1ul); > - while (desc->sptes[PTE_LIST_EXT-1] && desc->more) { > - desc = desc->more; > - count += PTE_LIST_EXT; > - } > - if (desc->sptes[PTE_LIST_EXT-1]) { > - desc->more = mmu_alloc_pte_list_desc(vcpu); > - desc = desc->more; > - } > - for (i = 0; desc->sptes[i]; ++i) > - ++count; > - desc->sptes[i] = spte; > + return 1; > + } > + > + rmap_printk("pte_list_add: %p %llx many->many\n", spte, *spte); > + desc = (struct pte_list_desc *)(*pte_list & ~1ul); > + > + /* No empty position in the desc. */ > + if (desc->sptes[PTE_LIST_EXT - 1]) { > + struct pte_list_desc *new_desc; > + new_desc = mmu_alloc_pte_list_desc(vcpu); > + new_desc->more = desc; > + desc = new_desc; > + *pte_list = (unsigned long)desc | 1; > } > - return count; > + > + free_pos = find_first_free(desc); > + desc->sptes[free_pos] = spte; > + return count_spte_number(desc); Should it be count_spte_number(desc) - 1? The function should returns the number of pte entries before the spte was added. > } > > static void > -pte_list_desc_remove_entry(unsigned long *pte_list, struct pte_list_desc > *desc, > -int i, struct pte_list_desc *prev_desc) > +
[PATCH] KVM: x86: should release vcpu when closing vcpu fd
If we want to remove a vcpu from vm, we should not be only to put kvm, we need to decrease online vcpus and free struct kvm_vcpu when closing a vcpu file descriptor. Signed-off-by: Chen Fan --- arch/x86/kvm/x86.c | 14 ++ include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 4 +++- 3 files changed, 18 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d21bce5..3370de1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6904,6 +6904,20 @@ static void kvm_free_vcpus(struct kvm *kvm) mutex_unlock(&kvm->lock); } +void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu) +{ + struct kvm *kvm = vcpu->kvm; + + mutex_lock(&kvm->lock); + atomic_dec(&kvm->online_vcpus); + kvm->vcpus[atomic_read(&kvm->online_vcpus)] = NULL; + mutex_unlock(&kvm->lock); + + kvm_clear_async_pf_completion_queue(vcpu); + kvm_unload_vcpu_mmu(vcpu); + kvm_arch_vcpu_free(vcpu); +} + void kvm_arch_sync_events(struct kvm *kvm) { kvm_free_all_assigned_devices(kvm); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index a63d83e..e0811f97 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -631,6 +631,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, unsigned int id); int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu); int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu); void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu); +void kvm_arch_vcpu_release(struct kvm_vcpu *vcpu); int kvm_arch_hardware_enable(void *garbage); void kvm_arch_hardware_disable(void *garbage); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1580dd4..442014b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1873,8 +1873,10 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma) static int kvm_vcpu_release(struct inode *inode, struct file *filp) { struct kvm_vcpu *vcpu = filp->private_data; + struct kvm *kvm = vcpu->kvm; - kvm_put_kvm(vcpu->kvm); + kvm_arch_vcpu_release(vcpu); + kvm_put_kvm(kvm); return 0; } -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified
On 08/28/2013 03:23 PM, Gleb Natapov wrote: > On Tue, Jul 30, 2013 at 09:02:04PM +0800, Xiao Guangrong wrote: >> Relax the tlb flush condition since we will write-protect the spte out of mmu >> lock. Note lockless write-protection only marks the writable spte to readonly >> and the spte can be writable only if both SPTE_HOST_WRITEABLE and >> SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable) >> >> This patch is used to avoid this kind of race: >> >> VCPU 0 VCPU 1 >> lockless wirte protection: >> set spte.w = 0 >> lock mmu-lock >> >> write protection the spte to sync shadow >> page, >> see spte.w = 0, then without flush tlb >> >> unlock mmu-lock >> >> !!! At this point, the shadow page can >> still be >> writable due to the corrupt tlb entry >> Flush all TLB >> >> Signed-off-by: Xiao Guangrong >> --- >> arch/x86/kvm/mmu.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c >> index 58283bf..5a40564 100644 >> --- a/arch/x86/kvm/mmu.c >> +++ b/arch/x86/kvm/mmu.c >> @@ -600,7 +600,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte) >> * we always atomicly update it, see the comments in >> * spte_has_volatile_bits(). >> */ >> -if (is_writable_pte(old_spte) && !is_writable_pte(new_spte)) >> +if (spte_is_locklessly_modifiable(old_spte) && >> + !is_writable_pte(new_spte)) >> ret = true; > This will needlessly flush tlbs when dirty login is not in use (common > case) and old spte is non writable. Can you estimate how serious the > performance hit is? If non write-protection caused by dirty log, the spte is always writable if SPTE_HOST_WRITEABLE and SPTE_MMU_WRITEABLE are set. In other words, spte_is_locklessly_modifiable(old_spte) is the same as is_writable_pte(old_spte) in the common case. There are two cases causing unnecessary TLB flush that are 1) guest read faults on the spte write-protected by dirty log and uses a readonly host pfn to fix it. This is really rare since read access on the readonly can not trigger #PF. 2) guest requires write-protect caused by syncing shadow page. this is only needed if ept is disabled and in the most case, the guest has many sptes need to be write-protected. Unnecessary TLB flush is rare too. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [libvirt] [PATCH v2] kvm: warn if num cpus is greater than num recommended
- Original Message - > On 08/23/2013 07:24 AM, Andrew Jones wrote: > > The comment in kvm_max_vcpus() states that it's using the recommended > > procedure from the kernel API documentation to get the max number > > of vcpus that kvm supports. It is, but by always returning the > > maximum number supported. The maximum number should only be used > > for development purposes. qemu should check KVM_CAP_NR_VCPUS for > > the recommended number of vcpus. This patch adds a warning if a user > > specifies a number of cpus between the recommended and max. > > > > v2: > > Incorporate tests for max_cpus, which specifies the maximum number > > of hotpluggable cpus. An additional note is that the message for > > the fail case was slightly changed, 'exceeds max cpus' to > > 'exceeds the maximum cpus'. If this is unacceptable change for > > users like libvirt, then I'll need to spin a v3. > > A quick grep of libvirt does not show any dependence on the particular > wording "exceeds max cpus", so you are probably fine changing that. > > What I'm more worried about is what number is libvirt supposed to show > to the end user, and should libvirt enforce the lower recommended max, > or the larger kernel absolute max? Which of the two values does the QMP > 'MachineInfo' type return in its 'cpu-max' field during the > 'query-machines' command? Should we be modifying QMP to return both > values, so that libvirt can also expose the logic to the end user of > allowing a recommended vs. larger development max? > Machine definitions maintain yet another 'max_cpus'. And it appears that qmp would return that value. It would probably be best if it returned max(qemu_machine.max_cpus, kvm_max_cpus) though. I'm starting to think that we should just keep things simple for most of the virt stack by sticking to enforcing the larger developer max. And then on a production kernel we should just compile KVM_MAX_VCPUS = KVM_SOFT_MAX_VCPUS and be done with it. With that thought, this patch could be dropped too. The alternative seems to be supporting a run-time selectable experimental mode throughout the whole virt stack. drew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/12] KVM: MMU: flush tlb if the spte can be locklessly modified
On Tue, Jul 30, 2013 at 09:02:04PM +0800, Xiao Guangrong wrote: > Relax the tlb flush condition since we will write-protect the spte out of mmu > lock. Note lockless write-protection only marks the writable spte to readonly > and the spte can be writable only if both SPTE_HOST_WRITEABLE and > SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable) > > This patch is used to avoid this kind of race: > > VCPU 0 VCPU 1 > lockless wirte protection: > set spte.w = 0 > lock mmu-lock > > write protection the spte to sync shadow > page, > see spte.w = 0, then without flush tlb > >unlock mmu-lock > > !!! At this point, the shadow page can still > be > writable due to the corrupt tlb entry > Flush all TLB > > Signed-off-by: Xiao Guangrong > --- > arch/x86/kvm/mmu.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 58283bf..5a40564 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -600,7 +600,8 @@ static bool mmu_spte_update(u64 *sptep, u64 new_spte) >* we always atomicly update it, see the comments in >* spte_has_volatile_bits(). >*/ > - if (is_writable_pte(old_spte) && !is_writable_pte(new_spte)) > + if (spte_is_locklessly_modifiable(old_spte) && > + !is_writable_pte(new_spte)) > ret = true; This will needlessly flush tlbs when dirty login is not in use (common case) and old spte is non writable. Can you estimate how serious the performance hit is? > > if (!shadow_accessed_mask) > -- > 1.8.1.4 -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html