RE: [PATCH v2 2/2] KVM: x86: Add lowest-priority support for vt-d posted-interrupts
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel- > ow...@vger.kernel.org] On Behalf Of Yang Zhang > Sent: Monday, December 21, 2015 9:50 AM > To: Wu, Feng; pbonz...@redhat.com; > rkrc...@redhat.com > Cc: kvm@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: Re: [PATCH v2 2/2] KVM: x86: Add lowest-priority support for vt-d > posted-interrupts > > On 2015/12/16 9:37, Feng Wu wrote: > > Use vector-hashing to deliver lowest-priority interrupts for > > VT-d posted-interrupts. > > > > Signed-off-by: Feng Wu > > --- > > arch/x86/kvm/lapic.c | 67 > > > arch/x86/kvm/lapic.h | 2 ++ > > arch/x86/kvm/vmx.c | 12 -- > > 3 files changed, 79 insertions(+), 2 deletions(-) > > > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > > index e29001f..d4f2c8f 100644 > > --- a/arch/x86/kvm/lapic.c > > +++ b/arch/x86/kvm/lapic.c > > @@ -854,6 +854,73 @@ out: > > } > > > > /* > > + * This routine handles lowest-priority interrupts using vector-hashing > > + * mechanism. As an example, modern Intel CPUs use this method to handle > > + * lowest-priority interrupts. > > + * > > + * Here is the details about the vector-hashing mechanism: > > + * 1. For lowest-priority interrupts, store all the possible destination > > + *vCPUs in an array. > > + * 2. Use "guest vector % max number of destination vCPUs" to find the > > right > > + *destination vCPU in the array for the lowest-priority interrupt. > > + */ > > +struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm, > > + struct kvm_lapic_irq *irq) > > +{ > > + struct kvm_apic_map *map; > > + struct kvm_vcpu *vcpu = NULL; > > + > > + if (irq->shorthand) > > + return NULL; > > + > > + rcu_read_lock(); > > + map = rcu_dereference(kvm->arch.apic_map); > > + > > + if (!map) > > + goto out; > > + > > + if ((irq->dest_mode != APIC_DEST_PHYSICAL) && > > + kvm_lowest_prio_delivery(irq)) { > > + u16 cid; > > + int i, idx = 0; > > + unsigned long bitmap = 1; > > + unsigned int dest_vcpus = 0; > > + struct kvm_lapic **dst = NULL; > > + > > + > > + if (!kvm_apic_logical_map_valid(map)) > > + goto out; > > + > > + apic_logical_id(map, irq->dest_id, , (u16 *)); > > + > > + if (cid >= ARRAY_SIZE(map->logical_map)) > > + goto out; > > + > > + dst = map->logical_map[cid]; > > + > > + for_each_set_bit(i, , 16) { > > + if (!dst[i] && !kvm_lapic_enabled(dst[i]->vcpu)) { > > + clear_bit(i, ); > > + continue; > > + } > > + } > > + > > + dest_vcpus = hweight16(bitmap); > > + > > + if (dest_vcpus != 0) { > > + idx = kvm_vector_2_index(irq->vector, dest_vcpus, > > +, 16); > > + vcpu = dst[idx-1]->vcpu; > > + } > > + } > > + > > +out: > > + rcu_read_unlock(); > > + return vcpu; > > +} > > +EXPORT_SYMBOL_GPL(kvm_intr_vector_hashing_dest); > > + > > +/* > >* Add a pending IRQ into lapic. > >* Return 1 if successfully added and 0 if discarded. > >*/ > > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h > > index 6890ef0..52bffce 100644 > > --- a/arch/x86/kvm/lapic.h > > +++ b/arch/x86/kvm/lapic.h > > @@ -172,4 +172,6 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, > struct kvm_lapic_irq *irq, > > struct kvm_vcpu **dest_vcpu); > > int kvm_vector_2_index(u32 vector, u32 dest_vcpus, > >const unsigned long *bitmap, u32 bitmap_size); > > +struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm, > > + struct kvm_lapic_irq *irq); > > #endif > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > > index 5eb56ed..3f89189 100644 > > --- a/arch/x86/kvm/vmx.c > > +++ b/arch/x86/kvm/vmx.c > > @@ -10702,8 +10702,16 @@ static int vmx_update_pi_irte(struct kvm *kvm, > unsigned int host_irq, > > */ > > > > kvm_set_msi_irq(e, ); > > - if (!kvm_intr_is_single_vcpu(kvm, , )) > > - continue; > > + > > + if (!kvm_intr_is_single_vcpu(kvm, , )) { > > + if (!kvm_vector_hashing_enabled() || > > + irq.delivery_mode != > APIC_DM_LOWEST) > > + continue; > > + > > + vcpu = kvm_intr_vector_hashing_dest(kvm, ); > > + if (!vcpu) > > + continue; > > + } > > I am a little confused with the 'continue'. If the destination is not > single vcpu, shouldn't we rollback to use non-PI mode? Here is the
Re: [PATCH v2 2/2] KVM: x86: Add lowest-priority support for vt-d posted-interrupts
On 2015/12/21 9:55, Wu, Feng wrote: -Original Message- From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel- ow...@vger.kernel.org] On Behalf Of Yang Zhang Sent: Monday, December 21, 2015 9:50 AM To: Wu, Feng; pbonz...@redhat.com; rkrc...@redhat.com Cc: kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v2 2/2] KVM: x86: Add lowest-priority support for vt-d posted-interrupts On 2015/12/16 9:37, Feng Wu wrote: Use vector-hashing to deliver lowest-priority interrupts for VT-d posted-interrupts. Signed-off-by: Feng Wu --- arch/x86/kvm/lapic.c | 67 arch/x86/kvm/lapic.h | 2 ++ arch/x86/kvm/vmx.c | 12 -- 3 files changed, 79 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index e29001f..d4f2c8f 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -854,6 +854,73 @@ out: } /* + * This routine handles lowest-priority interrupts using vector-hashing + * mechanism. As an example, modern Intel CPUs use this method to handle + * lowest-priority interrupts. + * + * Here is the details about the vector-hashing mechanism: + * 1. For lowest-priority interrupts, store all the possible destination + *vCPUs in an array. + * 2. Use "guest vector % max number of destination vCPUs" to find the right + *destination vCPU in the array for the lowest-priority interrupt. + */ +struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm, + struct kvm_lapic_irq *irq) +{ + struct kvm_apic_map *map; + struct kvm_vcpu *vcpu = NULL; + + if (irq->shorthand) + return NULL; + + rcu_read_lock(); + map = rcu_dereference(kvm->arch.apic_map); + + if (!map) + goto out; + + if ((irq->dest_mode != APIC_DEST_PHYSICAL) && + kvm_lowest_prio_delivery(irq)) { + u16 cid; + int i, idx = 0; + unsigned long bitmap = 1; + unsigned int dest_vcpus = 0; + struct kvm_lapic **dst = NULL; + + + if (!kvm_apic_logical_map_valid(map)) + goto out; + + apic_logical_id(map, irq->dest_id, , (u16 *)); + + if (cid >= ARRAY_SIZE(map->logical_map)) + goto out; + + dst = map->logical_map[cid]; + + for_each_set_bit(i, , 16) { + if (!dst[i] && !kvm_lapic_enabled(dst[i]->vcpu)) { + clear_bit(i, ); + continue; + } + } + + dest_vcpus = hweight16(bitmap); + + if (dest_vcpus != 0) { + idx = kvm_vector_2_index(irq->vector, dest_vcpus, +, 16); + vcpu = dst[idx-1]->vcpu; + } + } + +out: + rcu_read_unlock(); + return vcpu; +} +EXPORT_SYMBOL_GPL(kvm_intr_vector_hashing_dest); + +/* * Add a pending IRQ into lapic. * Return 1 if successfully added and 0 if discarded. */ diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 6890ef0..52bffce 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -172,4 +172,6 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq, struct kvm_vcpu **dest_vcpu); int kvm_vector_2_index(u32 vector, u32 dest_vcpus, const unsigned long *bitmap, u32 bitmap_size); +struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm, + struct kvm_lapic_irq *irq); #endif diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5eb56ed..3f89189 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -10702,8 +10702,16 @@ static int vmx_update_pi_irte(struct kvm *kvm, unsigned int host_irq, */ kvm_set_msi_irq(e, ); - if (!kvm_intr_is_single_vcpu(kvm, , )) - continue; + + if (!kvm_intr_is_single_vcpu(kvm, , )) { + if (!kvm_vector_hashing_enabled() || + irq.delivery_mode != APIC_DM_LOWEST) + continue; + + vcpu = kvm_intr_vector_hashing_dest(kvm, ); + if (!vcpu) + continue; + } I am a little confused with the 'continue'. If the destination is not single vcpu, shouldn't we rollback to use non-PI mode? Here is the logic: - If it is single destination, we will use PI no matter it is fixed or lowest-priority. - If it is not single destination: a) It is fixed, we will use non-PI b) It is lowest-priority and vector-hashing is enabled, we will
Re: [PATCH kernel] vfio: Add explicit alignments in vfio_iommu_spapr_tce_create
On Fri, Dec 18, 2015 at 12:35:47PM +1100, Alexey Kardashevskiy wrote: > The vfio_iommu_spapr_tce_create struct has 4x32bit and 2x64bit fields > which should have resulted in sizeof(fio_iommu_spapr_tce_create) equal > to 32 bytes. However due to the gcc's default alignment, the actual > size of this struct is 40 bytes. > > This fills gaps with __resv1/2 fields. > > This should not cause any change in behavior. > > Signed-off-by: Alexey KardashevskiyOops, that was a bit sloppy. Oh well. Acked-by: David Gibson > --- > include/uapi/linux/vfio.h | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index 9fd7b5d..d117233 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -568,8 +568,10 @@ struct vfio_iommu_spapr_tce_create { > __u32 flags; > /* in */ > __u32 page_shift; > + __u32 __resv1; > __u64 window_size; > __u32 levels; > + __u32 __resv2; > /* out */ > __u64 start_addr; > }; -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH v2 1/2] KVM: x86: Use vector-hashing to deliver lowest-priority interrupts
On 2015/12/16 9:37, Feng Wu wrote: Use vector-hashing to deliver lowest-priority interrupts, As an example, modern Intel CPUs in server platform use this method to handle lowest-priority interrupts. Signed-off-by: Feng Wu--- arch/x86/kvm/irq_comm.c | 27 ++- arch/x86/kvm/lapic.c| 57 - arch/x86/kvm/lapic.h| 2 ++ arch/x86/kvm/x86.c | 9 arch/x86/kvm/x86.h | 1 + 5 files changed, 81 insertions(+), 15 deletions(-) diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c index 84b96d3..c8c5f61 100644 --- a/arch/x86/kvm/irq_comm.c +++ b/arch/x86/kvm/irq_comm.c @@ -32,6 +32,7 @@ #include "ioapic.h" #include "lapic.h" +#include "x86.h" static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e, struct kvm *kvm, int irq_source_id, int level, @@ -53,8 +54,10 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e, int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src, struct kvm_lapic_irq *irq, unsigned long *dest_map) { - int i, r = -1; + int i, r = -1, idx = 0; struct kvm_vcpu *vcpu, *lowest = NULL; + unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)]; + unsigned int dest_vcpus = 0; if (irq->dest_mode == 0 && irq->dest_id == 0xff && kvm_lowest_prio_delivery(irq)) { @@ -65,6 +68,8 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src, if (kvm_irq_delivery_to_apic_fast(kvm, src, irq, , dest_map)) return r; + memset(dest_vcpu_bitmap, 0, sizeof(dest_vcpu_bitmap)); + kvm_for_each_vcpu(i, vcpu, kvm) { if (!kvm_apic_present(vcpu)) continue; @@ -78,13 +83,25 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src, r = 0; r += kvm_apic_set_irq(vcpu, irq, dest_map); } else if (kvm_lapic_enabled(vcpu)) { - if (!lowest) - lowest = vcpu; - else if (kvm_apic_compare_prio(vcpu, lowest) < 0) - lowest = vcpu; + if (!kvm_vector_hashing_enabled()) { + if (!lowest) + lowest = vcpu; + else if (kvm_apic_compare_prio(vcpu, lowest) < 0) + lowest = vcpu; + } else { + __set_bit(vcpu->vcpu_id, dest_vcpu_bitmap); + dest_vcpus++; + } } } + if (dest_vcpus != 0) { + idx = kvm_vector_2_index(irq->vector, dest_vcpus, +dest_vcpu_bitmap, KVM_MAX_VCPUS); + + lowest = kvm_get_vcpu(kvm, idx - 1); + } + if (lowest) r = kvm_apic_set_irq(lowest, irq, dest_map); diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index ecd4ea1..e29001f 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -678,6 +678,22 @@ bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source, } } +int kvm_vector_2_index(u32 vector, u32 dest_vcpus, + const unsigned long *bitmap, u32 bitmap_size) +{ + u32 mod; + int i, idx = 0; + + mod = vector % dest_vcpus; + + for (i = 0; i <= mod; i++) { + idx = find_next_bit(bitmap, bitmap_size, idx) + 1; + BUG_ON(idx > bitmap_size); + } + + return idx; +} + bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, struct kvm_lapic_irq *irq, int *r, unsigned long *dest_map) { @@ -731,17 +747,38 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, dst = map->logical_map[cid]; if (kvm_lowest_prio_delivery(irq)) { - int l = -1; - for_each_set_bit(i, , 16) { - if (!dst[i]) - continue; - if (l < 0) - l = i; - else if (kvm_apic_compare_prio(dst[i]->vcpu, dst[l]->vcpu) < 0) - l = i; + if (!kvm_vector_hashing_enabled()) { + int l = -1; + for_each_set_bit(i, , 16) { + if (!dst[i]) + continue; + if (l < 0) + l = i; + else if
Re: [PATCH v2 2/2] KVM: x86: Add lowest-priority support for vt-d posted-interrupts
On 2015/12/16 9:37, Feng Wu wrote: Use vector-hashing to deliver lowest-priority interrupts for VT-d posted-interrupts. Signed-off-by: Feng Wu--- arch/x86/kvm/lapic.c | 67 arch/x86/kvm/lapic.h | 2 ++ arch/x86/kvm/vmx.c | 12 -- 3 files changed, 79 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index e29001f..d4f2c8f 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -854,6 +854,73 @@ out: } /* + * This routine handles lowest-priority interrupts using vector-hashing + * mechanism. As an example, modern Intel CPUs use this method to handle + * lowest-priority interrupts. + * + * Here is the details about the vector-hashing mechanism: + * 1. For lowest-priority interrupts, store all the possible destination + *vCPUs in an array. + * 2. Use "guest vector % max number of destination vCPUs" to find the right + *destination vCPU in the array for the lowest-priority interrupt. + */ +struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm, + struct kvm_lapic_irq *irq) +{ + struct kvm_apic_map *map; + struct kvm_vcpu *vcpu = NULL; + + if (irq->shorthand) + return NULL; + + rcu_read_lock(); + map = rcu_dereference(kvm->arch.apic_map); + + if (!map) + goto out; + + if ((irq->dest_mode != APIC_DEST_PHYSICAL) && + kvm_lowest_prio_delivery(irq)) { + u16 cid; + int i, idx = 0; + unsigned long bitmap = 1; + unsigned int dest_vcpus = 0; + struct kvm_lapic **dst = NULL; + + + if (!kvm_apic_logical_map_valid(map)) + goto out; + + apic_logical_id(map, irq->dest_id, , (u16 *)); + + if (cid >= ARRAY_SIZE(map->logical_map)) + goto out; + + dst = map->logical_map[cid]; + + for_each_set_bit(i, , 16) { + if (!dst[i] && !kvm_lapic_enabled(dst[i]->vcpu)) { + clear_bit(i, ); + continue; + } + } + + dest_vcpus = hweight16(bitmap); + + if (dest_vcpus != 0) { + idx = kvm_vector_2_index(irq->vector, dest_vcpus, +, 16); + vcpu = dst[idx-1]->vcpu; + } + } + +out: + rcu_read_unlock(); + return vcpu; +} +EXPORT_SYMBOL_GPL(kvm_intr_vector_hashing_dest); + +/* * Add a pending IRQ into lapic. * Return 1 if successfully added and 0 if discarded. */ diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 6890ef0..52bffce 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -172,4 +172,6 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq, struct kvm_vcpu **dest_vcpu); int kvm_vector_2_index(u32 vector, u32 dest_vcpus, const unsigned long *bitmap, u32 bitmap_size); +struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm, + struct kvm_lapic_irq *irq); #endif diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5eb56ed..3f89189 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -10702,8 +10702,16 @@ static int vmx_update_pi_irte(struct kvm *kvm, unsigned int host_irq, */ kvm_set_msi_irq(e, ); - if (!kvm_intr_is_single_vcpu(kvm, , )) - continue; + + if (!kvm_intr_is_single_vcpu(kvm, , )) { + if (!kvm_vector_hashing_enabled() || + irq.delivery_mode != APIC_DM_LOWEST) + continue; + + vcpu = kvm_intr_vector_hashing_dest(kvm, ); + if (!vcpu) + continue; + } I am a little confused with the 'continue'. If the destination is not single vcpu, shouldn't we rollback to use non-PI mode? vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu)); vcpu_info.vector = irq.vector; -- best regards yang -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] KVM: x86: Use vector-hashing to deliver lowest-priority interrupts
On 2015/12/21 9:50, Wu, Feng wrote: -Original Message- From: Yang Zhang [mailto:yang.zhang...@gmail.com] Sent: Monday, December 21, 2015 9:46 AM To: Wu, Feng; pbonz...@redhat.com; rkrc...@redhat.com Cc: kvm@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v2 1/2] KVM: x86: Use vector-hashing to deliver lowest- priority interrupts On 2015/12/16 9:37, Feng Wu wrote: Use vector-hashing to deliver lowest-priority interrupts, As an example, modern Intel CPUs in server platform use this method to handle lowest-priority interrupts. Signed-off-by: Feng Wu --- arch/x86/kvm/irq_comm.c | 27 ++- arch/x86/kvm/lapic.c| 57 - arch/x86/kvm/lapic.h| 2 ++ arch/x86/kvm/x86.c | 9 arch/x86/kvm/x86.h | 1 + 5 files changed, 81 insertions(+), 15 deletions(-) diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c index 84b96d3..c8c5f61 100644 --- a/arch/x86/kvm/irq_comm.c +++ b/arch/x86/kvm/irq_comm.c @@ -32,6 +32,7 @@ #include "ioapic.h" #include "lapic.h" +#include "x86.h" static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e, struct kvm *kvm, int irq_source_id, int level, @@ -53,8 +54,10 @@ static int kvm_set_ioapic_irq(struct kvm_kernel_irq_routing_entry *e, int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src, struct kvm_lapic_irq *irq, unsigned long *dest_map) { - int i, r = -1; + int i, r = -1, idx = 0; struct kvm_vcpu *vcpu, *lowest = NULL; + unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)]; + unsigned int dest_vcpus = 0; if (irq->dest_mode == 0 && irq->dest_id == 0xff && kvm_lowest_prio_delivery(irq)) { @@ -65,6 +68,8 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src, if (kvm_irq_delivery_to_apic_fast(kvm, src, irq, , dest_map)) return r; + memset(dest_vcpu_bitmap, 0, sizeof(dest_vcpu_bitmap)); + kvm_for_each_vcpu(i, vcpu, kvm) { if (!kvm_apic_present(vcpu)) continue; @@ -78,13 +83,25 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src, r = 0; r += kvm_apic_set_irq(vcpu, irq, dest_map); } else if (kvm_lapic_enabled(vcpu)) { - if (!lowest) - lowest = vcpu; - else if (kvm_apic_compare_prio(vcpu, lowest) < 0) - lowest = vcpu; + if (!kvm_vector_hashing_enabled()) { + if (!lowest) + lowest = vcpu; + else if (kvm_apic_compare_prio(vcpu, lowest) < 0) + lowest = vcpu; + } else { + __set_bit(vcpu->vcpu_id, dest_vcpu_bitmap); + dest_vcpus++; + } } } + if (dest_vcpus != 0) { + idx = kvm_vector_2_index(irq->vector, dest_vcpus, +dest_vcpu_bitmap, KVM_MAX_VCPUS); + + lowest = kvm_get_vcpu(kvm, idx - 1); + } + if (lowest) r = kvm_apic_set_irq(lowest, irq, dest_map); diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index ecd4ea1..e29001f 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -678,6 +678,22 @@ bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source, } } +int kvm_vector_2_index(u32 vector, u32 dest_vcpus, + const unsigned long *bitmap, u32 bitmap_size) +{ + u32 mod; + int i, idx = 0; + + mod = vector % dest_vcpus; + + for (i = 0; i <= mod; i++) { + idx = find_next_bit(bitmap, bitmap_size, idx) + 1; + BUG_ON(idx > bitmap_size); + } + + return idx; +} + bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, struct kvm_lapic_irq *irq, int *r, unsigned long *dest_map) { @@ -731,17 +747,38 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, dst = map->logical_map[cid]; if (kvm_lowest_prio_delivery(irq)) { - int l = -1; - for_each_set_bit(i, , 16) { - if (!dst[i]) - continue; - if (l < 0) - l = i; - else if (kvm_apic_compare_prio(dst[i]->vcpu, dst[l]->vcpu) < 0) - l = i; + if
RE: [PATCH v2 1/2] KVM: x86: Use vector-hashing to deliver lowest-priority interrupts
> -Original Message- > From: Yang Zhang [mailto:yang.zhang...@gmail.com] > Sent: Monday, December 21, 2015 9:46 AM > To: Wu, Feng; pbonz...@redhat.com; > rkrc...@redhat.com > Cc: kvm@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: Re: [PATCH v2 1/2] KVM: x86: Use vector-hashing to deliver lowest- > priority interrupts > > On 2015/12/16 9:37, Feng Wu wrote: > > Use vector-hashing to deliver lowest-priority interrupts, As an > > example, modern Intel CPUs in server platform use this method to > > handle lowest-priority interrupts. > > > > Signed-off-by: Feng Wu > > --- > > arch/x86/kvm/irq_comm.c | 27 ++- > > arch/x86/kvm/lapic.c| 57 > - > > arch/x86/kvm/lapic.h| 2 ++ > > arch/x86/kvm/x86.c | 9 > > arch/x86/kvm/x86.h | 1 + > > 5 files changed, 81 insertions(+), 15 deletions(-) > > > > diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c > > index 84b96d3..c8c5f61 100644 > > --- a/arch/x86/kvm/irq_comm.c > > +++ b/arch/x86/kvm/irq_comm.c > > @@ -32,6 +32,7 @@ > > #include "ioapic.h" > > > > #include "lapic.h" > > +#include "x86.h" > > > > static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e, > >struct kvm *kvm, int irq_source_id, int level, > > @@ -53,8 +54,10 @@ static int kvm_set_ioapic_irq(struct > kvm_kernel_irq_routing_entry *e, > > int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src, > > struct kvm_lapic_irq *irq, unsigned long *dest_map) > > { > > - int i, r = -1; > > + int i, r = -1, idx = 0; > > struct kvm_vcpu *vcpu, *lowest = NULL; > > + unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)]; > > + unsigned int dest_vcpus = 0; > > > > if (irq->dest_mode == 0 && irq->dest_id == 0xff && > > kvm_lowest_prio_delivery(irq)) { > > @@ -65,6 +68,8 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct > kvm_lapic *src, > > if (kvm_irq_delivery_to_apic_fast(kvm, src, irq, , dest_map)) > > return r; > > > > + memset(dest_vcpu_bitmap, 0, sizeof(dest_vcpu_bitmap)); > > + > > kvm_for_each_vcpu(i, vcpu, kvm) { > > if (!kvm_apic_present(vcpu)) > > continue; > > @@ -78,13 +83,25 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct > kvm_lapic *src, > > r = 0; > > r += kvm_apic_set_irq(vcpu, irq, dest_map); > > } else if (kvm_lapic_enabled(vcpu)) { > > - if (!lowest) > > - lowest = vcpu; > > - else if (kvm_apic_compare_prio(vcpu, lowest) < 0) > > - lowest = vcpu; > > + if (!kvm_vector_hashing_enabled()) { > > + if (!lowest) > > + lowest = vcpu; > > + else if (kvm_apic_compare_prio(vcpu, lowest) < > 0) > > + lowest = vcpu; > > + } else { > > + __set_bit(vcpu->vcpu_id, dest_vcpu_bitmap); > > + dest_vcpus++; > > + } > > } > > } > > > > + if (dest_vcpus != 0) { > > + idx = kvm_vector_2_index(irq->vector, dest_vcpus, > > +dest_vcpu_bitmap, KVM_MAX_VCPUS); > > + > > + lowest = kvm_get_vcpu(kvm, idx - 1); > > + } > > + > > if (lowest) > > r = kvm_apic_set_irq(lowest, irq, dest_map); > > > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > > index ecd4ea1..e29001f 100644 > > --- a/arch/x86/kvm/lapic.c > > +++ b/arch/x86/kvm/lapic.c > > @@ -678,6 +678,22 @@ bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, > struct kvm_lapic *source, > > } > > } > > > > +int kvm_vector_2_index(u32 vector, u32 dest_vcpus, > > + const unsigned long *bitmap, u32 bitmap_size) > > +{ > > + u32 mod; > > + int i, idx = 0; > > + > > + mod = vector % dest_vcpus; > > + > > + for (i = 0; i <= mod; i++) { > > + idx = find_next_bit(bitmap, bitmap_size, idx) + 1; > > + BUG_ON(idx > bitmap_size); > > + } > > + > > + return idx; > > +} > > + > > bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, > > struct kvm_lapic_irq *irq, int *r, unsigned long *dest_map) > > { > > @@ -731,17 +747,38 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm > *kvm, struct kvm_lapic *src, > > dst = map->logical_map[cid]; > > > > if (kvm_lowest_prio_delivery(irq)) { > > - int l = -1; > > - for_each_set_bit(i, , 16) { > > - if (!dst[i]) > > - continue; > > - if (l < 0) > > - l = i; > > -
RE: [PATCH v4 5/5] kvm/x86: Hyper-V kvm exit
Hello! Replying to everything in one message. > > As far as i understand this code, KVM_EXIT_HYPERV is called when one > > of three MSRs are accessed. But, shouldn't we have implemented > > instead something more generic, like KVM_EXIT_REG_IO, which would > > work similar to KVM_EXIT_PIO or KVM_EXIT_MMIO, but carry register > > code and value? > > Yes, we considered that. There were actually patches for this as well. Ah, i missed them, what a pity. There are lots of patches, i don't review them all. Actually i have noticed the change only after it appeared in linux-next. > However, in this case the register is still emulated in the kernel, and > userspace just gets informed of the new value. I see, but this doesn't change the semantic. All we need to do is to tell the userland that "register has been written". Additionally to this we could do whatever we want, including caching the data in kernel, using it in kernel, and processing reads in kernel. > If we do get that, we will just rename KVM_EXIT_HYPERV to > KVM_EXIT_MSR_ACCESS, and KVM_EXIT_HYPERV_SYNIC to > KVM_EXIT_MSR_HYPERV_SYNIC, and struct kvm_hyperv_exit to kvm_msr_exit. Actually, i see this in more generic way, something like: /* KVM_EXIT_REG_ACCESS */ struct { __u64 reg; __u64 data; __u8 is_write; } mmio; 'data' and 'is_write' are self-explanatory, 'reg' would be generalized register code, the same as used for KVM_(GET|SET)_ONE_REG: - for ARM64: ARM64_SYS_REG(op0, op1, crn, crm, op2) - see http://lxr.free-electrons.com/source/arch/arm64/include/uapi/asm/kvm.h#L189 - for x86 : to be defined (i know, we don't use ..._ONE_REG operations here yet), like X86_MSR_REG(id), where the macro itself is: #define X86_MSR_REG(id) (KVM_REG_X86 | KVM_REG_X86_MSR | KVM_REG_SIZE_U64 | (id)) - for other architectures: to be defined in a similar way, once needed. > On brief inspection of Andrey's patch (I have not been following > closely) it looks like the kvm_hyperv_exit struct that's returned to > userspace contains more data (control, evt_page, and msg_page fields) > than simply the value of the MSR, so would the desired SynIC exit fit > into a general-purpose exit for MSR emulation? I have looked at the code too, and these three fields are nothing more than values of respective MSR's: case HV_X64_MSR_SCONTROL: synic->control = data; if (!host) synic_exit(synic, msr); break; case HV_X64_MSR_SIEFP: if (data & HV_SYNIC_SIEFP_ENABLE) if (kvm_clear_guest(vcpu->kvm, data & PAGE_MASK, PAGE_SIZE)) { ret = 1; break; } synic->evt_page = data; if (!host) synic_exit(synic, msr); break; case HV_X64_MSR_SIMP: if (data & HV_SYNIC_SIMP_ENABLE) if (kvm_clear_guest(vcpu->kvm, data & PAGE_MASK, PAGE_SIZE)) { ret = 1; break; } synic->msg_page = data; if (!host) synic_exit(synic, msr); break; So, every time one of these thee MSRs is written, we get a vmexit with values of all three registers, and that's all. We could easily have 'synic_exit(synic, msr, data)' in all three cases, and i think the userspace could easily deal with proposed KVM_EXIT_REG_ACCESS, just cache these values internally if needed. Kind regards, Pavel Fedin Expert Engineer Samsung Electronics Research center Russia -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/4] x86/vdso: Remove pvclock fixmap machinery
Acked-by: Paolo BonziniSigned-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vclock_gettime.c | 1 - arch/x86/entry/vdso/vma.c| 1 + arch/x86/include/asm/fixmap.h| 5 - arch/x86/include/asm/pvclock.h | 5 - arch/x86/kernel/kvmclock.c | 6 -- arch/x86/kernel/pvclock.c| 24 6 files changed, 1 insertion(+), 41 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 5dd363d54348..59a98c25bde7 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -45,7 +45,6 @@ extern u8 pvclock_page #include #include -#include #include notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index aa828191c654..b8f69e264ac4 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h index f80d70009ff8..6d7d0e52ed5a 100644 --- a/arch/x86/include/asm/fixmap.h +++ b/arch/x86/include/asm/fixmap.h @@ -19,7 +19,6 @@ #include #include #include -#include #ifdef CONFIG_X86_32 #include #include @@ -72,10 +71,6 @@ enum fixed_addresses { #ifdef CONFIG_X86_VSYSCALL_EMULATION VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT, #endif -#ifdef CONFIG_PARAVIRT_CLOCK - PVCLOCK_FIXMAP_BEGIN, - PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1, -#endif #endif FIX_DBGP_BASE, FIX_EARLYCON_MEM_BASE, diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 571dad355bbc..fdcc04020636 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -100,10 +100,5 @@ struct pvclock_vsyscall_time_info { } __attribute__((__aligned__(SMP_CACHE_BYTES))); #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info) -#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1) - -int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i, -int size); -struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu); #endif /* _ASM_X86_PVCLOCK_H */ diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index ec1b06dc82d2..72cef58693c7 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -310,7 +310,6 @@ int __init kvm_setup_vsyscall_timeinfo(void) { #ifdef CONFIG_X86_64 int cpu; - int ret; u8 flags; struct pvclock_vcpu_time_info *vcpu_time; unsigned int size; @@ -330,11 +329,6 @@ int __init kvm_setup_vsyscall_timeinfo(void) return 1; } - if ((ret = pvclock_init_vsyscall(hv_clock, size))) { - put_cpu(); - return ret; - } - put_cpu(); kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK; diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 2f355d229a58..99bfc025111d 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -140,27 +140,3 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, set_normalized_timespec(ts, now.tv_sec, now.tv_nsec); } - -#ifdef CONFIG_X86_64 -/* - * Initialize the generic pvclock vsyscall state. This will allocate - * a/some page(s) for the per-vcpu pvclock information, set up a - * fixmap mapping for the page(s) - */ - -int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i, -int size) -{ - int idx; - - WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE); - - for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) { - __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx, -__pa(i) + (idx*PAGE_SIZE), -PAGE_KERNEL_VVAR); - } - - return 0; -} -#endif -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/4] x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap
Acked-by: Paolo BonziniSigned-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vclock_gettime.c | 20 arch/x86/entry/vdso/vdso-layout.lds.S | 3 ++- arch/x86/entry/vdso/vdso2c.c | 3 +++ arch/x86/entry/vdso/vma.c | 13 + arch/x86/include/asm/pvclock.h| 9 + arch/x86/include/asm/vdso.h | 1 + arch/x86/kernel/kvmclock.c| 5 + 7 files changed, 41 insertions(+), 13 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index c325ba1bdddf..5dd363d54348 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -36,6 +36,11 @@ static notrace cycle_t vread_hpet(void) } #endif +#ifdef CONFIG_PARAVIRT_CLOCK +extern u8 pvclock_page + __attribute__((visibility("hidden"))); +#endif + #ifndef BUILD_VDSO32 #include @@ -62,23 +67,14 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) #ifdef CONFIG_PARAVIRT_CLOCK -static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) +static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void) { - const struct pvclock_vsyscall_time_info *pvti_base; - int idx = cpu / (PAGE_SIZE/PVTI_SIZE); - int offset = cpu % (PAGE_SIZE/PVTI_SIZE); - - BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END); - - pvti_base = (struct pvclock_vsyscall_time_info *) - __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx); - - return _base[offset]; + return (const struct pvclock_vsyscall_time_info *)_page; } static notrace cycle_t vread_pvclock(int *mode) { - const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti; + const struct pvclock_vcpu_time_info *pvti = _pvti0()->pvti; cycle_t ret; u64 tsc, pvti_tsc; u64 last, delta, pvti_system_time; diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S index de2c921025f5..4158acc17df0 100644 --- a/arch/x86/entry/vdso/vdso-layout.lds.S +++ b/arch/x86/entry/vdso/vdso-layout.lds.S @@ -25,7 +25,7 @@ SECTIONS * segment. */ - vvar_start = . - 2 * PAGE_SIZE; + vvar_start = . - 3 * PAGE_SIZE; vvar_page = vvar_start; /* Place all vvars at the offsets in asm/vvar.h. */ @@ -36,6 +36,7 @@ SECTIONS #undef EMIT_VVAR hpet_page = vvar_start + PAGE_SIZE; + pvclock_page = vvar_start + 2 * PAGE_SIZE; . = SIZEOF_HEADERS; diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c index 785d9922b106..491020b2826d 100644 --- a/arch/x86/entry/vdso/vdso2c.c +++ b/arch/x86/entry/vdso/vdso2c.c @@ -73,6 +73,7 @@ enum { sym_vvar_start, sym_vvar_page, sym_hpet_page, + sym_pvclock_page, sym_VDSO_FAKE_SECTION_TABLE_START, sym_VDSO_FAKE_SECTION_TABLE_END, }; @@ -80,6 +81,7 @@ enum { const int special_pages[] = { sym_vvar_page, sym_hpet_page, + sym_pvclock_page, }; struct vdso_sym { @@ -91,6 +93,7 @@ struct vdso_sym required_syms[] = { [sym_vvar_start] = {"vvar_start", true}, [sym_vvar_page] = {"vvar_page", true}, [sym_hpet_page] = {"hpet_page", true}, + [sym_pvclock_page] = {"pvclock_page", true}, [sym_VDSO_FAKE_SECTION_TABLE_START] = { "VDSO_FAKE_SECTION_TABLE_START", false }, diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index 64df47148160..aa828191c654 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -100,6 +100,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) .name = "[vvar]", .pages = no_pages, }; + struct pvclock_vsyscall_time_info *pvti; if (calculate_addr) { addr = vdso_addr(current->mm->start_stack, @@ -169,6 +170,18 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) } #endif + pvti = pvclock_pvti_cpu0_va(); + if (pvti && image->sym_pvclock_page) { + ret = remap_pfn_range(vma, + text_start + image->sym_pvclock_page, + __pa(pvti) >> PAGE_SHIFT, + PAGE_SIZE, + PAGE_READONLY); + + if (ret) + goto up_fail; + } + up_fail: if (ret) current->mm->context.vdso = NULL; diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 7a6bed5c08bc..571dad355bbc 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -4,6 +4,15 @@ #include #include +#ifdef CONFIG_KVM_GUEST +extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void); +#else +static inline struct
[PATCH v2 0/4] x86: KVM vdso and clock improvements
x86: KVM vdso and clock improvements NB: patch 1 doesn't really belong here, but it makes this a lot easier for me to test. Patch 1, if it's okay at all, should go though the kvm tree. The rest should probably go through tip:x86/vdso once they're reviewed. I'll do a followup to enable vdso pvclock on 32-bit guests. I'm not currently set up to test it. (The KVM people could also do it very easily on top of these patches.) Changes from v1: - Dropped patch 1 - Added Paolo's review and acks - Fixed a build issue on some configs Andy Lutomirski (4): x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap x86/vdso: Remove pvclock fixmap machinery x86/vdso: Enable vdso pvclock access on all vdso variants arch/x86/entry/vdso/vclock_gettime.c | 151 -- arch/x86/entry/vdso/vdso-layout.lds.S | 3 +- arch/x86/entry/vdso/vdso2c.c | 3 + arch/x86/entry/vdso/vma.c | 14 arch/x86/include/asm/fixmap.h | 5 -- arch/x86/include/asm/pvclock.h| 14 ++-- arch/x86/include/asm/vdso.h | 1 + arch/x86/kernel/kvmclock.c| 11 ++- arch/x86/kernel/pvclock.c | 24 -- 9 files changed, 107 insertions(+), 119 deletions(-) -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/4] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
From: Andy LutomirskiThe pvclock vdso code was too abstracted to understand easily and excessively paranoid. Simplify it for a huge speedup. This opens the door for additional simplifications, as the vdso no longer accesses the pvti for any vcpu other than vcpu 0. Before, vclock_gettime using kvm-clock took about 45ns on my machine. With this change, it takes 29ns, which is almost as fast as the pure TSC implementation. Reviewed-by: Paolo Bonzini Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vclock_gettime.c | 81 1 file changed, 46 insertions(+), 35 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index ca94fa649251..c325ba1bdddf 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) static notrace cycle_t vread_pvclock(int *mode) { - const struct pvclock_vsyscall_time_info *pvti; + const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti; cycle_t ret; - u64 last; - u32 version; - u8 flags; - unsigned cpu, cpu1; - + u64 tsc, pvti_tsc; + u64 last, delta, pvti_system_time; + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift; /* -* Note: hypervisor must guarantee that: -* 1. cpu ID number maps 1:1 to per-CPU pvclock time info. -* 2. that per-CPU pvclock time info is updated if the -*underlying CPU changes. -* 3. that version is increased whenever underlying CPU -*changes. +* Note: The kernel and hypervisor must guarantee that cpu ID +* number maps 1:1 to per-CPU pvclock time info. +* +* Because the hypervisor is entirely unaware of guest userspace +* preemption, it cannot guarantee that per-CPU pvclock time +* info is updated if the underlying CPU changes or that that +* version is increased whenever underlying CPU changes. * +* On KVM, we are guaranteed that pvti updates for any vCPU are +* atomic as seen by *all* vCPUs. This is an even stronger +* guarantee than we get with a normal seqlock. +* +* On Xen, we don't appear to have that guarantee, but Xen still +* supplies a valid seqlock using the version field. + +* We only do pvclock vdso timing at all if +* PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to +* mean that all vCPUs have matching pvti and that the TSC is +* synced, so we can just look at vCPU 0's pvti. */ - do { - cpu = __getcpu() & VGETCPU_CPU_MASK; - /* TODO: We can put vcpu id into higher bits of pvti.version. -* This will save a couple of cycles by getting rid of -* __getcpu() calls (Gleb). -*/ - - pvti = get_pvti(cpu); - - version = __pvclock_read_cycles(>pvti, , ); - - /* -* Test we're still on the cpu as well as the version. -* We could have been migrated just after the first -* vgetcpu but before fetching the version, so we -* wouldn't notice a version change. -*/ - cpu1 = __getcpu() & VGETCPU_CPU_MASK; - } while (unlikely(cpu != cpu1 || - (pvti->pvti.version & 1) || - pvti->pvti.version != version)); - - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT))) + + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) { *mode = VCLOCK_NONE; + return 0; + } + + do { + version = pvti->version; + + /* This is also a read barrier, so we'll read version first. */ + tsc = rdtsc_ordered(); + + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul; + pvti_tsc_shift = pvti->tsc_shift; + pvti_system_time = pvti->system_time; + pvti_tsc = pvti->tsc_timestamp; + + /* Make sure that the version double-check is last. */ + smp_rmb(); + } while (unlikely((version & 1) || version != pvti->version)); + + delta = tsc - pvti_tsc; + ret = pvti_system_time + + pvclock_scale_delta(delta, pvti_tsc_to_system_mul, + pvti_tsc_shift); /* refer to tsc.c read_tsc() comment for rationale */ last = gtod->cycle_last; -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/4] x86/vdso: Enable vdso pvclock access on all vdso variants
Now that pvclock doesn't require access to the fixmap, all vdso variants can use it. The kernel side isn't wired up for 32-bit kernels yet, but this covers 32-bit and x32 userspace on 64-bit kernels. Acked-by: Paolo BonziniSigned-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vclock_gettime.c | 91 1 file changed, 40 insertions(+), 51 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 59a98c25bde7..8602f06c759f 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -17,8 +17,10 @@ #include #include #include +#include #include #include +#include #define gtod ((vsyscall_gtod_data)) @@ -43,10 +45,6 @@ extern u8 pvclock_page #ifndef BUILD_VDSO32 -#include -#include -#include - notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) { long ret; @@ -64,8 +62,42 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) return ret; } -#ifdef CONFIG_PARAVIRT_CLOCK +#else + +notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) +{ + long ret; + + asm( + "mov %%ebx, %%edx \n" + "mov %2, %%ebx \n" + "call __kernel_vsyscall \n" + "mov %%edx, %%ebx \n" + : "=a" (ret) + : "0" (__NR_clock_gettime), "g" (clock), "c" (ts) + : "memory", "edx"); + return ret; +} + +notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) +{ + long ret; + + asm( + "mov %%ebx, %%edx \n" + "mov %2, %%ebx \n" + "call __kernel_vsyscall \n" + "mov %%edx, %%ebx \n" + : "=a" (ret) + : "0" (__NR_gettimeofday), "g" (tv), "c" (tz) + : "memory", "edx"); + return ret; +} + +#endif + +#ifdef CONFIG_PARAVIRT_CLOCK static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void) { return (const struct pvclock_vsyscall_time_info *)_page; @@ -109,9 +141,9 @@ static notrace cycle_t vread_pvclock(int *mode) do { version = pvti->version; - /* This is also a read barrier, so we'll read version first. */ - tsc = rdtsc_ordered(); + smp_rmb(); + tsc = rdtsc_ordered(); pvti_tsc_to_system_mul = pvti->tsc_to_system_mul; pvti_tsc_shift = pvti->tsc_shift; pvti_system_time = pvti->system_time; @@ -126,7 +158,7 @@ static notrace cycle_t vread_pvclock(int *mode) pvclock_scale_delta(delta, pvti_tsc_to_system_mul, pvti_tsc_shift); - /* refer to tsc.c read_tsc() comment for rationale */ + /* refer to vread_tsc() comment for rationale */ last = gtod->cycle_last; if (likely(ret >= last)) @@ -136,49 +168,6 @@ static notrace cycle_t vread_pvclock(int *mode) } #endif -#else - -notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) -{ - long ret; - - asm( - "mov %%ebx, %%edx \n" - "mov %2, %%ebx \n" - "call __kernel_vsyscall \n" - "mov %%edx, %%ebx \n" - : "=a" (ret) - : "0" (__NR_clock_gettime), "g" (clock), "c" (ts) - : "memory", "edx"); - return ret; -} - -notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) -{ - long ret; - - asm( - "mov %%ebx, %%edx \n" - "mov %2, %%ebx \n" - "call __kernel_vsyscall \n" - "mov %%edx, %%ebx \n" - : "=a" (ret) - : "0" (__NR_gettimeofday), "g" (tv), "c" (tz) - : "memory", "edx"); - return ret; -} - -#ifdef CONFIG_PARAVIRT_CLOCK - -static notrace cycle_t vread_pvclock(int *mode) -{ - *mode = VCLOCK_NONE; - return 0; -} -#endif - -#endif - notrace static cycle_t vread_tsc(void) { cycle_t ret = (cycle_t)rdtsc_ordered(); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Qemu-devel] [PATCH] SeaBios: Fix reset procedure reentrancy problem on qemu-kvm platform
> -Original Message- > From: Kevin O'Connor [mailto:ke...@koconnor.net] > Sent: Saturday, December 19, 2015 11:12 PM > On Sat, Dec 19, 2015 at 12:03:15PM +, Gonglei (Arei) wrote: > > Maybe the root cause is not NMI but INTR, so yield() can open hardware > interrupt, > > And then execute interrupt handler, but the interrupt handler make the > SeaBIOS > > stack broken, so that the BSP can't execute the instruction and occur > exception, > > VM_EXIT to Kmod, which is an infinite loop. But I don't have any proofs > > except > > the surface phenomenon. > > I can't see any reason why allowing interrupts at this location would > be a problem. > Does it have any relationship with *extra stack* of SeaBIOS? > > Kevin, can we drop yield() in smp_setup() ? > > It's possible to eliminate this instance of yield, but I think it > would just push the crash to the next time interrupts are enabled. > Perhaps. I'm not sure. > > Is it really useful and allowable for SeaBIOS? Maybe for other components? > > I'm not sure. Because we found that when SeaBIOS is booting, if we inject a > > NMI by QMP, the guest will *stuck*. And the kvm tracing log is the same with > > the current problem. > > If you apply the patches you had to prevent that NMI crash problem, > does it also prevent the above crash? > Yes, but we cannot prevent the NMI injection (though I'll submit some patches to forbid users' NMI injection after NMI_EN disabled by RTC bit7 of port 0x70). Regards, -Gonglei -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL] vhost: cleanups and fixes
The following changes since commit 9f9499ae8e6415cefc4fe0a96ad0e27864353c89: Linux 4.4-rc5 (2015-12-13 17:42:58 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus for you to fetch changes up to 74a599f09bec7419b2490039f0fb33bc8581ef7c: virtio/s390: handle error values in irb (2015-12-17 10:37:33 +0200) virtio: fixes on top of 4.4-rc5 This includes a single fix for virtio ccw error handling. Signed-off-by: Michael S. TsirkinCornelia Huck (1): virtio/s390: handle error values in irb drivers/s390/virtio/virtio_ccw.c | 62 1 file changed, 37 insertions(+), 25 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] SeaBios: Fix reset procedure reentrancy problem on qemu-kvm platform
On Sun, Dec 20, 2015 at 09:49:54AM +, Gonglei (Arei) wrote: > > From: Kevin O'Connor [mailto:ke...@koconnor.net] > > Sent: Saturday, December 19, 2015 11:12 PM > > On Sat, Dec 19, 2015 at 12:03:15PM +, Gonglei (Arei) wrote: > > > Maybe the root cause is not NMI but INTR, so yield() can open hardware > > interrupt, > > > And then execute interrupt handler, but the interrupt handler make the > > SeaBIOS > > > stack broken, so that the BSP can't execute the instruction and occur > > exception, > > > VM_EXIT to Kmod, which is an infinite loop. But I don't have any proofs > > > except > > > the surface phenomenon. > > > > I can't see any reason why allowing interrupts at this location would > > be a problem. > > > Does it have any relationship with *extra stack* of SeaBIOS? None that I can see. Also, the kvm trace seems to show the code trying to execute at rip=0x03 - that will crash long before the extra stack is used. > > > Kevin, can we drop yield() in smp_setup() ? > > > > It's possible to eliminate this instance of yield, but I think it > > would just push the crash to the next time interrupts are enabled. > > > Perhaps. I'm not sure. > > > > Is it really useful and allowable for SeaBIOS? Maybe for other components? > > > I'm not sure. Because we found that when SeaBIOS is booting, if we inject > > > a > > > NMI by QMP, the guest will *stuck*. And the kvm tracing log is the same > > > with > > > the current problem. > > > > If you apply the patches you had to prevent that NMI crash problem, > > does it also prevent the above crash? > > > Yes, but we cannot prevent the NMI injection (though I'll submit some patches > to > forbid users' NMI injection after NMI_EN disabled by RTC bit7 of port 0x70). > -Kevin -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html