Re: [PATCH v2] KVM: Synthesize G bit for all segments.
On 2014-07-08 06:17, Alok Kataria wrote: Thanks Jan and Paolo for looking at the change, I have added a comment in svm_get_segment. Joerg, please consider this for the next merge. -- From: Jim Mattson jmatt...@vmware.com We have noticed that qemu-kvm hangs early in the BIOS when runnning nested under some versions of VMware ESXi. The problem we believe is because KVM assumes that the platform preserves the 'G' but for any segment register. The SVM specification itemizes the segment attribute bits that are observed by the CPU, but the (G)ranularity bit is not one of the bits itemized, for any segment. Though current AMD CPUs keep track of the (G)ranularity bit for all segment registers other than CS, the specification does not require it. VMware's virtual CPU may not track the (G)ranularity bit for any segment register. Since kvm already synthesizes the (G)ranularity bit for the CS segment. It should do so for all segments. The patch below does that, and helps get rid of the hangs. Patch applies on top of Linus' tree. Signed-off-by: Jim Mattson jmatt...@vmware.com Signed-off-by: Alok N Kataria akata...@vmware.com Index: linux-2.6/arch/x86/kvm/svm.c === --- linux-2.6.orig/arch/x86/kvm/svm.c 2014-07-07 15:32:52.724368183 +0530 +++ linux-2.6/arch/x86/kvm/svm.c 2014-07-08 09:30:29.124431069 +0530 @@ -1415,7 +1415,13 @@ var-avl = (s-attrib SVM_SELECTOR_AVL_SHIFT) 1; var-l = (s-attrib SVM_SELECTOR_L_SHIFT) 1; var-db = (s-attrib SVM_SELECTOR_DB_SHIFT) 1; - var-g = (s-attrib SVM_SELECTOR_G_SHIFT) 1; + + /* + * SVM spec doesn't require the platform to track the G bit for all + * segments, so similar to CS, let's synthesize this bit for all + * segments. Either I misunderstand the reference to CS or it does no longer apply once the patch is in. I would suggest to remove that part of the sentence. Jan + */ + var-g = s-limit 0xf; /* * AMD's VMCB does not have an explicit unusable field, so emulate it @@ -1424,14 +1430,6 @@ var-unusable = !var-present || (var-type == 0); switch (seg) { - case VCPU_SREG_CS: - /* - * SVM always stores 0 for the 'G' bit in the CS selector in - * the VMCB on a VMEXIT. This hurts cross-vendor migration: - * Intel's VMENTRY has a check on the 'G' bit. - */ - var-g = s-limit 0xf; - break; case VCPU_SREG_TR: /* * Work around a bug where the busy flag in the tr selector -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.
/me reminds you of 78 char text wrap. On Wed, Jul 09, 2014 at 07:32:09PM +, Liang, Kan wrote: Sure; but what I meant was, check_msr() is broken when ran on such a kernel. You need to fix check_msr() to return failure on these 'ignored' MSRs, after all they don't function as expected, they're effectively broken. The function is designed to check if the MSRs can be safely accessed (no #GP). It cannot guarantee the correctness of the MSRs. If KVM applied patch 2 and guest applied patch 1, from the guest's perspective, the MSRs can be accessed (no #GP triggered). So return true is expected. It should not be a broken. You're not understanding. I know you wrote that function to do that. I'm saying that's wrong. Look at check_hw_exists() it explicitly checks for fake MSRs and reports them broken. These fake MSRs _ARE_ broken, they do not behave as expected. Not crashing is not the right consideration here, we're interested in higher order correct behaviour. The only unexpected thing for guest is that the counting/sampling result for LBR/extra reg is always 0. But the patch is a short term fix to stop things from crashing. I think it should be acceptable. Patch 2 is fine, patch 1, in particular your check_msr() routine is not. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: svm: virtualize MSR reads from MC4_MISC1
On Wed, Jul 09, 2014 at 06:26:23PM +0200, Paolo Bonzini wrote: Il 26/06/2014 14:22, Matthias Lange ha scritto: Linux' AMD MCE code tries to read from the MC4_MISC1 (0xc408) MSR. Because this read is not virtualized within KVM, a GPE is injected into the guest. This patch handles guest reads from MC4_MISC and returns 0 to the guest. Signed-off-by: Matthias Lange matthias.la...@kernkonzept.com --- arch/x86/kvm/x86.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 24d70d4..d5a305b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2530,6 +2530,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) return 1; data = vcpu-arch.osvw.status; break; +case 0xc408: /* MC4_MISC1 */ +data = 0; +break; default: if (kvm_pmu_msr(vcpu, msr)) return kvm_pmu_get_msr(vcpu, msr, pdata); Why don't you have to do the same with MC4_MISC0? MC4_MISC0 (0x413) is handled in kvm_get_msr_common by this statement case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1: return get_msr_mce(vcpu, msr, pdata); and get_msr_mce returns a sane value from the vcpu mce_banks. Is your kernel compiled with or without CONFIG_PARAVIRT? Paravirt does .read_msr = native_read_msr_safe, .write_msr = native_write_msr_safe, so you shouldn't get these #GP exceptions. I used the default i386_defconfig to build the guest kernel. CONFIG_PARAVIRT is disabled in this setup. Matthias. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] powerpc/kvm: Create proper names for the kvm_host_state PMU fields
We have two arrays in kvm_host_state that contain register values for the PMU. Currently we only create an asm-offsets symbol for the base of the arrays, and do the array offset in the assembly code. Creating an asm-offsets symbol for each field individually makes the code much nicer to read, particularly for the MMCRx/SIxR/SDAR fields, and might have helped us notice the recent double restore bug we had in this code. Signed-off-by: Michael Ellerman m...@ellerman.id.au --- This is on top of powerpc/kvm: Remove redundant save of SIER AND MMCR2. arch/powerpc/kernel/asm-offsets.c | 17 +++-- arch/powerpc/kvm/book3s_hv_interrupts.S | 30 +++--- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 32 3 files changed, 46 insertions(+), 33 deletions(-) diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index f5995a912213..9adadb1db9b8 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -645,8 +645,21 @@ int main(void) HSTATE_FIELD(HSTATE_SAVED_XIRR, saved_xirr); HSTATE_FIELD(HSTATE_HOST_IPI, host_ipi); HSTATE_FIELD(HSTATE_PTID, ptid); - HSTATE_FIELD(HSTATE_MMCR, host_mmcr); - HSTATE_FIELD(HSTATE_PMC, host_pmc); + HSTATE_FIELD(HSTATE_MMCR0, host_mmcr[0]); + HSTATE_FIELD(HSTATE_MMCR1, host_mmcr[1]); + HSTATE_FIELD(HSTATE_MMCRA, host_mmcr[2]); + HSTATE_FIELD(HSTATE_SIAR, host_mmcr[3]); + HSTATE_FIELD(HSTATE_SDAR, host_mmcr[4]); + HSTATE_FIELD(HSTATE_MMCR2, host_mmcr[5]); + HSTATE_FIELD(HSTATE_SIER, host_mmcr[6]); + HSTATE_FIELD(HSTATE_PMC1, host_pmc[0]); + HSTATE_FIELD(HSTATE_PMC2, host_pmc[1]); + HSTATE_FIELD(HSTATE_PMC3, host_pmc[2]); + HSTATE_FIELD(HSTATE_PMC4, host_pmc[3]); + HSTATE_FIELD(HSTATE_PMC5, host_pmc[4]); + HSTATE_FIELD(HSTATE_PMC6, host_pmc[5]); + HSTATE_FIELD(HSTATE_PMC7, host_pmc[6]); + HSTATE_FIELD(HSTATE_PMC8, host_pmc[7]); HSTATE_FIELD(HSTATE_PURR, host_purr); HSTATE_FIELD(HSTATE_SPURR, host_spurr); HSTATE_FIELD(HSTATE_DSCR, host_dscr); diff --git a/arch/powerpc/kvm/book3s_hv_interrupts.S b/arch/powerpc/kvm/book3s_hv_interrupts.S index 731be7478b27..16ef3163a8b6 100644 --- a/arch/powerpc/kvm/book3s_hv_interrupts.S +++ b/arch/powerpc/kvm/book3s_hv_interrupts.S @@ -97,15 +97,15 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206) mfspr r5, SPRN_MMCR1 mfspr r9, SPRN_SIAR mfspr r10, SPRN_SDAR - std r7, HSTATE_MMCR(r13) - std r5, HSTATE_MMCR + 8(r13) - std r6, HSTATE_MMCR + 16(r13) - std r9, HSTATE_MMCR + 24(r13) - std r10, HSTATE_MMCR + 32(r13) + std r7, HSTATE_MMCR0(r13) + std r5, HSTATE_MMCR1(r13) + std r6, HSTATE_MMCRA(r13) + std r9, HSTATE_SIAR(r13) + std r10, HSTATE_SDAR(r13) BEGIN_FTR_SECTION mfspr r9, SPRN_SIER - std r8, HSTATE_MMCR + 40(r13) - std r9, HSTATE_MMCR + 48(r13) + std r8, HSTATE_MMCR2(r13) + std r9, HSTATE_SIER(r13) END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) mfspr r3, SPRN_PMC1 mfspr r5, SPRN_PMC2 @@ -117,15 +117,15 @@ BEGIN_FTR_SECTION mfspr r10, SPRN_PMC7 mfspr r11, SPRN_PMC8 END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201) - stw r3, HSTATE_PMC(r13) - stw r5, HSTATE_PMC + 4(r13) - stw r6, HSTATE_PMC + 8(r13) - stw r7, HSTATE_PMC + 12(r13) - stw r8, HSTATE_PMC + 16(r13) - stw r9, HSTATE_PMC + 20(r13) + stw r3, HSTATE_PMC1(r13) + stw r5, HSTATE_PMC2(r13) + stw r6, HSTATE_PMC3(r13) + stw r7, HSTATE_PMC4(r13) + stw r8, HSTATE_PMC5(r13) + stw r9, HSTATE_PMC6(r13) BEGIN_FTR_SECTION - stw r10, HSTATE_PMC + 24(r13) - stw r11, HSTATE_PMC + 28(r13) + stw r10, HSTATE_PMC7(r13) + stw r11, HSTATE_PMC8(r13) END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201) 31: diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 868347ef09fd..d68ecb33b52a 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -87,20 +87,20 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S) cmpwi r4, 0 beq 23f /* skip if not */ BEGIN_FTR_SECTION - ld r3, HSTATE_MMCR(r13) + ld r3, HSTATE_MMCR0(r13) andi. r4, r3, MMCR0_PMAO_SYNC | MMCR0_PMAO cmpwi r4, MMCR0_PMAO beqlkvmppc_fix_pmao END_FTR_SECTION_IFSET(CPU_FTR_PMAO_BUG) - lwz r3, HSTATE_PMC(r13) - lwz r4, HSTATE_PMC + 4(r13) - lwz r5, HSTATE_PMC + 8(r13) - lwz r6, HSTATE_PMC + 12(r13) - lwz r8, HSTATE_PMC + 16(r13) - lwz r9, HSTATE_PMC + 20(r13) + lwz r3, HSTATE_PMC1(r13) + lwz
Re: [PATCH] powerpc/kvm: Create proper names for the kvm_host_state PMU fields
On 10.07.14 11:34, Michael Ellerman wrote: We have two arrays in kvm_host_state that contain register values for the PMU. Currently we only create an asm-offsets symbol for the base of the arrays, and do the array offset in the assembly code. Creating an asm-offsets symbol for each field individually makes the code much nicer to read, particularly for the MMCRx/SIxR/SDAR fields, and might have helped us notice the recent double restore bug we had in this code. Signed-off-by: Michael Ellerman m...@ellerman.id.au Acked-by: Alexander Graf ag...@suse.de I still think this whole code path should just be C though. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[3.11.y.z extended stable] Patch MIPS: KVM: Remove redundant NULL checks before kfree() has been added to staging queue
This is a note to let you know that I have just added a patch titled MIPS: KVM: Remove redundant NULL checks before kfree() to the linux-3.11.y-queue branch of the 3.11.y.z extended stable tree which can be found at: http://kernel.ubuntu.com/git?p=ubuntu/linux.git;a=shortlog;h=refs/heads/linux-3.11.y-queue If you, or anyone else, feels it should not be added to this tree, please reply to this email. For more information about the 3.11.y.z tree, see https://wiki.ubuntu.com/Kernel/Dev/ExtendedStable Thanks. -Luis -- From 58d267632e7edae4b73fe577261713f68a0038d8 Mon Sep 17 00:00:00 2001 From: James Hogan james.ho...@imgtec.com Date: Thu, 29 May 2014 10:16:44 +0100 Subject: MIPS: KVM: Remove redundant NULL checks before kfree() commit c6c0a6637f9da54f9472144d44f71cf847f92e20 upstream. The kfree() function already NULL checks the parameter so remove the redundant NULL checks before kfree() calls in arch/mips/kvm/. Signed-off-by: James Hogan james.ho...@imgtec.com Cc: Paolo Bonzini pbonz...@redhat.com Cc: Gleb Natapov g...@kernel.org Cc: kvm@vger.kernel.org Cc: Ralf Baechle r...@linux-mips.org Cc: linux-m...@linux-mips.org Cc: Sanjay Lal sanj...@kymasys.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com Signed-off-by: Luis Henriques luis.henriq...@canonical.com --- arch/mips/kvm/kvm_mips.c | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c index 426345ac6f6e..7e78af0e57de 100644 --- a/arch/mips/kvm/kvm_mips.c +++ b/arch/mips/kvm/kvm_mips.c @@ -149,9 +149,7 @@ void kvm_mips_free_vcpus(struct kvm *kvm) if (kvm-arch.guest_pmap[i] != KVM_INVALID_PAGE) kvm_mips_release_pfn_clean(kvm-arch.guest_pmap[i]); } - - if (kvm-arch.guest_pmap) - kfree(kvm-arch.guest_pmap); + kfree(kvm-arch.guest_pmap); kvm_for_each_vcpu(i, vcpu, kvm) { kvm_arch_vcpu_free(vcpu); @@ -384,12 +382,8 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu) kvm_mips_dump_stats(vcpu); - if (vcpu-arch.guest_ebase) - kfree(vcpu-arch.guest_ebase); - - if (vcpu-arch.kseg0_commpage) - kfree(vcpu-arch.kseg0_commpage); - + kfree(vcpu-arch.guest_ebase); + kfree(vcpu-arch.kseg0_commpage); } void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v5 00/12] KVM Support for MIPS32 Processors
On 17 June 2014 23:10, James Hogan james.ho...@imgtec.com wrote: The patchset depends on v4 of target-mips: implement UserLocal Register. I'm aiming for QEMU 2.1, hopefully it isn't too late to get some final review. Thanks to everybody who has already taken part in review. This patchset implements KVM support for MIPS32 processors, using Trap Emulation. I was looking at what happens for MMIO accesses to nonexistent memory when we're running under KVM, and interestingly this patchset means MIPS is now the only CPU which both (a) supports KVM and (b) has an implementation of the do_unassigned_access() CPU hook. Does the current code support a guest attempt to access unassigned physical addresses? I had a look at the code and it seems like mips_cpu_unassigned_access() will end up calling cpu_restore_state() and cpu_loop_exit(), which I think will probably crash if you call them when using KVM rather than TCG... More generally, there doesn't really seem to be provision in the KVM KVM_EXIT_MMIO API for returning this access failed. I guess in theory userspace could do all the figure out how to adjust CPU state to do exception entry and then run VCPU, but that seems like quite a lot of work which the kernel already knows how to do; is there some way to provide a simpler API that lets userspace just inform the kernel that it needs to fault the access? thanks -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: Synthesize G bit for all segments.
Il 10/07/2014 10:55, Jan Kiszka ha scritto: + /* + * SVM spec doesn't require the platform to track the G bit for all + * segments, so similar to CS, let's synthesize this bit for all + * segments. Either I misunderstand the reference to CS or it does no longer apply once the patch is in. I would suggest to remove that part of the sentence. Something like this: /* * The SVM spec doesn't require the platform to track the 'G' bit for * all segments. Current processors track it for all segments except * CS, but other hypervisors may not do so. So let's synthesize this * bit always to help running KVM nested. It also helps cross-vendor * migration, because Intel's vmentry has a check on the 'G' bit. */ Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v5 00/12] KVM Support for MIPS32 Processors
Il 10/07/2014 14:17, Peter Maydell ha scritto: More generally, there doesn't really seem to be provision in the KVM KVM_EXIT_MMIO API for returning this access failed. I guess in theory userspace could do all the figure out how to adjust CPU state to do exception entry and then run VCPU, but that seems like quite a lot of work which the kernel already knows how to do; is there some way to provide a simpler API that lets userspace just inform the kernel that it needs to fault the access? There are 3 free padding bytes in struct kvm_run's mmio field. It's possible to add a per-VM capability and have the kernel check one of these bytes when the capability is set. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC 2/5] KVM: s390: move finalization of SIGP STOP orders to kvm_s390_vcpu_stop
From: David Hildenbrand d...@linux.vnet.ibm.com Let's move the finalization of SIGP STOP and SIGP STOP AND STORE STATUS orders to the point where the VCPU is actually stopped. This change is needed to prepare for a user space driven VCPU state change. The action_bits may only be cleared when setting the cpu state to STOPPED while holding the local irq lock. Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- arch/s390/kvm/intercept.c | 31 --- arch/s390/kvm/kvm-s390.c | 8 2 files changed, 20 insertions(+), 19 deletions(-) diff --git a/arch/s390/kvm/intercept.c b/arch/s390/kvm/intercept.c index a0b586c..ac6b325 100644 --- a/arch/s390/kvm/intercept.c +++ b/arch/s390/kvm/intercept.c @@ -56,32 +56,25 @@ static int handle_noop(struct kvm_vcpu *vcpu) static int handle_stop(struct kvm_vcpu *vcpu) { int rc = 0; + unsigned int action_bits; vcpu-stat.exit_stop_request++; - spin_lock_bh(vcpu-arch.local_int.lock); - trace_kvm_s390_stop_request(vcpu-arch.local_int.action_bits); - if (vcpu-arch.local_int.action_bits ACTION_STOP_ON_STOP) { - kvm_s390_vcpu_stop(vcpu); - vcpu-arch.local_int.action_bits = ~ACTION_STOP_ON_STOP; - VCPU_EVENT(vcpu, 3, %s, cpu stopped); - rc = -EOPNOTSUPP; - } + action_bits = vcpu-arch.local_int.action_bits; - if (vcpu-arch.local_int.action_bits ACTION_STORE_ON_STOP) { - vcpu-arch.local_int.action_bits = ~ACTION_STORE_ON_STOP; - /* store status must be called unlocked. Since local_int.lock -* only protects local_int.* and not guest memory we can give -* up the lock here */ - spin_unlock_bh(vcpu-arch.local_int.lock); + if (!(action_bits ACTION_STOP_ON_STOP)) + return 0; + + if (action_bits ACTION_STORE_ON_STOP) { rc = kvm_s390_vcpu_store_status(vcpu, KVM_S390_STORE_STATUS_NOADDR); - if (rc = 0) - rc = -EOPNOTSUPP; - } else - spin_unlock_bh(vcpu-arch.local_int.lock); - return rc; + if (rc) + return rc; + } + + kvm_s390_vcpu_stop(vcpu); + return -EOPNOTSUPP; } static int handle_validity(struct kvm_vcpu *vcpu) diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 2f3e14f..c507789 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -1494,7 +1494,15 @@ void kvm_s390_vcpu_stop(struct kvm_vcpu *vcpu) spin_lock_bh(vcpu-kvm-arch.start_stop_lock); online_vcpus = atomic_read(vcpu-kvm-online_vcpus); + /* Need to lock access to action_bits to avoid a SIGP race condition */ + spin_lock_bh(vcpu-arch.local_int.lock); atomic_set_mask(CPUSTAT_STOPPED, vcpu-arch.sie_block-cpuflags); + + /* SIGP STOP and SIGP STOP AND STORE STATUS has been fully processed */ + vcpu-arch.local_int.action_bits = +~(ACTION_STOP_ON_STOP | ACTION_STORE_ON_STOP); + spin_unlock_bh(vcpu-arch.local_int.lock); + __disable_ibs_on_vcpu(vcpu); for (i = 0; i online_vcpus; i++) { -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC 0/5] KVM: s390: Let user space control the cpu states
Paolo, here is a preview of a rework of CPU state on s390x. Since we extend MP_STATE I wanted to send an RFC upfront. I will wait for some feedback and send a proper pull request in the next week: - This series enables the KVM_SET_MP_STATE ioctl on s390 to make the cpu state settable by user space. This helps us to avoid races in SIGP/Reset handling that happen because some SIGPs are handled in QEMU, others in the KERNEL. As soon as the new interface is touched, user space takes complete control of the cpu states. Otherwise the old way is used. Therefore, the new kernel should continue to work fine with old QEMUs. David Hildenbrand (5): KVM: s390: allow only one SIGP STOP (AND STORE STATUS) at a time KVM: s390: move finalization of SIGP STOP orders to kvm_s390_vcpu_stop KVM: s390: remove __cpu_is_stopped and expose is_vcpu_stopped KVM: prepare for KVM_(S|G)ET_MP_STATE on other architectures KVM: s390: implement KVM_(S|G)ET_MP_STATE for user space state control Documentation/virtual/kvm/api.txt | 31 ++- arch/s390/include/asm/kvm_host.h | 1 + arch/s390/kvm/diag.c | 3 ++- arch/s390/kvm/intercept.c | 32 ++-- arch/s390/kvm/kvm-s390.c | 52 +++ arch/s390/kvm/kvm-s390.h | 10 ++-- arch/s390/kvm/sigp.c | 7 +- include/uapi/linux/kvm.h | 7 +- 8 files changed, 98 insertions(+), 45 deletions(-) -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC 3/5] KVM: s390: remove __cpu_is_stopped and expose is_vcpu_stopped
From: David Hildenbrand d...@linux.vnet.ibm.com The function __cpu_is_stopped is not used any more. Let's remove it and expose the function is_vcpu_stopped instead, which is actually what we want. This patch also converts an open coded check for CPUSTAT_STOPPED to is_vcpu_stopped(). Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com Acked-by: Christian Borntraeger borntrae...@de.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- arch/s390/kvm/kvm-s390.c | 7 +-- arch/s390/kvm/kvm-s390.h | 4 ++-- 2 files changed, 3 insertions(+), 8 deletions(-) diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index c507789..3428953 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -926,7 +926,7 @@ static int kvm_arch_vcpu_ioctl_set_initial_psw(struct kvm_vcpu *vcpu, psw_t psw) { int rc = 0; - if (!(atomic_read(vcpu-arch.sie_block-cpuflags) CPUSTAT_STOPPED)) + if (!is_vcpu_stopped(vcpu)) rc = -EBUSY; else { vcpu-run-psw_mask = psw.mask; @@ -1413,11 +1413,6 @@ int kvm_s390_vcpu_store_status(struct kvm_vcpu *vcpu, unsigned long addr) return kvm_s390_store_status_unloaded(vcpu, addr); } -static inline int is_vcpu_stopped(struct kvm_vcpu *vcpu) -{ - return atomic_read((vcpu)-arch.sie_block-cpuflags) CPUSTAT_STOPPED; -} - static void __disable_ibs_on_vcpu(struct kvm_vcpu *vcpu) { kvm_check_request(KVM_REQ_ENABLE_IBS, vcpu); diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h index a8655ed..77ed846 100644 --- a/arch/s390/kvm/kvm-s390.h +++ b/arch/s390/kvm/kvm-s390.h @@ -45,9 +45,9 @@ do { \ d_args); \ } while (0) -static inline int __cpu_is_stopped(struct kvm_vcpu *vcpu) +static inline int is_vcpu_stopped(struct kvm_vcpu *vcpu) { - return atomic_read(vcpu-arch.sie_block-cpuflags) CPUSTAT_STOP_INT; + return atomic_read(vcpu-arch.sie_block-cpuflags) CPUSTAT_STOPPED; } static inline int kvm_is_ucontrol(struct kvm *kvm) -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC 0/5] s390x/kvm: track the logical cpu state in QEMU and propagate it to kvm
This is the qemu part of kernel series Let user space control the cpu states Christian Borntraeger (1): update linux headers with with cpustate changes David Hildenbrand (4): s390x/kvm: introduce proper states for s390 cpus s390x/kvm: proper use of the cpu states OPERATING and STOPPED s390x/kvm: test whether a cpu is STOPPED when checking has_work s390x/kvm: propagate s390 cpu state to kvm hw/s390x/ipl.c| 2 +- hw/s390x/s390-virtio.c| 32 -- linux-headers/linux/kvm.h | 7 ++- target-s390x/cpu.c| 106 +++--- target-s390x/cpu.h| 33 +-- target-s390x/helper.c | 11 ++--- target-s390x/kvm.c| 49 ++--- trace-events | 6 +++ 8 files changed, 179 insertions(+), 67 deletions(-) -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC 1/5] KVM: s390: allow only one SIGP STOP (AND STORE STATUS) at a time
From: David Hildenbrand d...@linux.vnet.ibm.com A SIGP STOP (AND STORE STATUS) order is complete as soon as the VCPU has been stopped. This patch makes sure that only one SIGP STOP (AND STORE STATUS) may be pending at a time (as defined by the architecture). If the action_bits are still set, a SIGP STOP has been issued but not completed yet. The VCPU is busy for further SIGP STOP orders. Also set the CPUSTAT_STOP_INT after the action_bits variable has been modified (the same order that is used when injecting a KVM_S390_SIGP_STOP from userspace). Both changes are needed in preparation for a user space driven VCPU state change (to avoid race conditions). Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- arch/s390/kvm/sigp.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/s390/kvm/sigp.c b/arch/s390/kvm/sigp.c index 43079a4..fd7fb5c 100644 --- a/arch/s390/kvm/sigp.c +++ b/arch/s390/kvm/sigp.c @@ -136,6 +136,11 @@ static int __inject_sigp_stop(struct kvm_s390_local_interrupt *li, int action) inti-type = KVM_S390_SIGP_STOP; spin_lock_bh(li-lock); + if (li-action_bits ACTION_STOP_ON_STOP) { + /* another SIGP STOP is pending */ + rc = SIGP_CC_BUSY; + goto out; + } if ((atomic_read(li-cpuflags) CPUSTAT_STOPPED)) { kfree(inti); if ((action ACTION_STORE_ON_STOP) != 0) @@ -144,8 +149,8 @@ static int __inject_sigp_stop(struct kvm_s390_local_interrupt *li, int action) } list_add_tail(inti-list, li-list); atomic_set(li-active, 1); - atomic_set_mask(CPUSTAT_STOP_INT, li-cpuflags); li-action_bits |= action; + atomic_set_mask(CPUSTAT_STOP_INT, li-cpuflags); if (waitqueue_active(li-wq)) wake_up_interruptible(li-wq); out: -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC 4/5] KVM: prepare for KVM_(S|G)ET_MP_STATE on other architectures
From: David Hildenbrand d...@linux.vnet.ibm.com Highlight the aspects of the ioctls that are actually specific to x86 and ia64. As defined restrictions (irqchip) and mp states may not apply to other architectures, these parts are flagged to belong to x86 and ia64. In preparation for the use of KVM_(S|G)ET_MP_STATE by s390. Fix a spelling error (KVM_SET_MP_STATE vs. KVM_SET_MPSTATE) on the way. Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com Acked-by: Christian Borntraeger borntrae...@de.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- Documentation/virtual/kvm/api.txt | 21 - include/uapi/linux/kvm.h | 3 ++- 2 files changed, 14 insertions(+), 10 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 0fe3649..904c61c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -988,18 +988,20 @@ uniprocessor guests). Possible values are: - - KVM_MP_STATE_RUNNABLE:the vcpu is currently running + - KVM_MP_STATE_RUNNABLE:the vcpu is currently running [x86, ia64] - KVM_MP_STATE_UNINITIALIZED: the vcpu is an application processor (AP) - which has not yet received an INIT signal + which has not yet received an INIT signal [x86, + ia64] - KVM_MP_STATE_INIT_RECEIVED: the vcpu has received an INIT signal, and is - now ready for a SIPI + now ready for a SIPI [x86, ia64] - KVM_MP_STATE_HALTED: the vcpu has executed a HLT instruction and - is waiting for an interrupt + is waiting for an interrupt [x86, ia64] - KVM_MP_STATE_SIPI_RECEIVED: the vcpu has just received a SIPI (vector - accessible via KVM_GET_VCPU_EVENTS) + accessible via KVM_GET_VCPU_EVENTS) [x86, ia64] -This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel -irqchip, the multiprocessing state must be maintained by userspace. +On x86 and ia64, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an +in-kernel irqchip, the multiprocessing state must be maintained by userspace on +these architectures. 4.39 KVM_SET_MP_STATE @@ -1013,8 +1015,9 @@ Returns: 0 on success; -1 on error Sets the vcpu's current multiprocessing state; see KVM_GET_MP_STATE for arguments. -This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel -irqchip, the multiprocessing state must be maintained by userspace. +On x86 and ia64, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an +in-kernel irqchip, the multiprocessing state must be maintained by userspace on +these architectures. 4.40 KVM_SET_IDENTITY_MAP_ADDR diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index e11d8f1..37d4ec6 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -399,8 +399,9 @@ struct kvm_vapic_addr { __u64 vapic_addr; }; -/* for KVM_SET_MPSTATE */ +/* for KVM_SET_MP_STATE */ +/* not all states are valid on all architectures */ #define KVM_MP_STATE_RUNNABLE 0 #define KVM_MP_STATE_UNINITIALIZED 1 #define KVM_MP_STATE_INIT_RECEIVED 2 -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC 2/5] s390x/kvm: introduce proper states for s390 cpus
From: David Hildenbrand d...@linux.vnet.ibm.com Until now, when a s390 cpu was stopped or halted, the number of running CPUs was tracked in a global variable. This was problematic for migration, so Jason came up with a per-cpu running state. As it turns out, we want to track the full logical state of a target vcpu, so we need real s390 cpu states. This patch is based on an initial patch by Jason Herne, but was heavily rewritten when adding the cpu states STOPPED and OPERATING. On the way we move add_del_running to cpu.c (the declaration is already in cpu.h) and modify the users where appropriate. Please note that the cpu is still set to be stopped when it is halted, which is wrong. This will be fixed in the next patch. The LOAD and CHECK-STOP state will not be used in the first step. Signed-off-by: Jason J. Herne jjhe...@us.ibm.com Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com [folded Jason's patch into David's patch to avoid add/remove same lines] --- hw/s390x/s390-virtio.c | 32 target-s390x/cpu.c | 43 +++ target-s390x/cpu.h | 14 ++ 3 files changed, 57 insertions(+), 32 deletions(-) diff --git a/hw/s390x/s390-virtio.c b/hw/s390x/s390-virtio.c index 93c7ace..91baa18 100644 --- a/hw/s390x/s390-virtio.c +++ b/hw/s390x/s390-virtio.c @@ -124,38 +124,6 @@ static void s390_virtio_register_hcalls(void) s390_virtio_hcall_set_status); } -/* - * The number of running CPUs. On s390 a shutdown is the state of all CPUs - * being either stopped or disabled (for interrupts) waiting. We have to - * track this number to call the shutdown sequence accordingly. This - * number is modified either on startup or while holding the big qemu lock. - */ -static unsigned s390_running_cpus; - -void s390_add_running_cpu(S390CPU *cpu) -{ -CPUState *cs = CPU(cpu); - -if (cs-halted) { -s390_running_cpus++; -cs-halted = 0; -cs-exception_index = -1; -} -} - -unsigned s390_del_running_cpu(S390CPU *cpu) -{ -CPUState *cs = CPU(cpu); - -if (cs-halted == 0) { -assert(s390_running_cpus = 1); -s390_running_cpus--; -cs-halted = 1; -cs-exception_index = EXCP_HLT; -} -return s390_running_cpus; -} - void s390_init_ipl_dev(const char *kernel_filename, const char *kernel_cmdline, const char *initrd_filename, diff --git a/target-s390x/cpu.c b/target-s390x/cpu.c index c3082b7..a6edd07 100644 --- a/target-s390x/cpu.c +++ b/target-s390x/cpu.c @@ -224,6 +224,49 @@ static void s390_cpu_finalize(Object *obj) #endif } +#if !defined(CONFIG_USER_ONLY) +static unsigned s390_count_running_cpus(void) +{ +CPUState *cpu; +int nr_running = 0; + +CPU_FOREACH(cpu) { +uint8_t state = S390_CPU(cpu)-env.cpu_state; +if (state == CPU_STATE_OPERATING || +state == CPU_STATE_LOAD) { +nr_running++; +} +} + +return nr_running; +} + +void s390_add_running_cpu(S390CPU *cpu) +{ +CPUState *cs = CPU(cpu); + +if (cs-halted) { +cpu-env.cpu_state = CPU_STATE_OPERATING; +cs-halted = 0; +cs-exception_index = -1; +} +} + +unsigned s390_del_running_cpu(S390CPU *cpu) +{ +CPUState *cs = CPU(cpu); + +if (cs-halted == 0) { +assert(s390_count_running_cpus() = 1); +cpu-env.cpu_state = CPU_STATE_STOPPED; +cs-halted = 1; +cs-exception_index = EXCP_HLT; +} + +return s390_count_running_cpus(); +} +#endif + static const VMStateDescription vmstate_s390_cpu = { .name = cpu, .unmigratable = 1, diff --git a/target-s390x/cpu.h b/target-s390x/cpu.h index b13761d..af22ba6 100644 --- a/target-s390x/cpu.h +++ b/target-s390x/cpu.h @@ -141,6 +141,20 @@ typedef struct CPUS390XState { QEMUTimer *tod_timer; QEMUTimer *cpu_timer; + +/* + * The cpu state represents the logical state of a cpu. In contrast to other + * architectures, there is a difference between a halt and a stop on s390. + * If all cpus are either stopped (including check stop) or in the disabled + * wait state, the vm can be shut down. + */ +#define CPU_STATE_UNINITIALIZED0x00 +#define CPU_STATE_STOPPED 0x01 +#define CPU_STATE_CHECK_STOP 0x02 +#define CPU_STATE_OPERATING0x03 +#define CPU_STATE_LOAD 0x04 +uint8_t cpu_state; + } CPUS390XState; #include cpu-qom.h -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC 4/5] s390x/kvm: test whether a cpu is STOPPED when checking has_work
From: David Hildenbrand d...@linux.vnet.ibm.com If a cpu is stopped, it must never be allowed to run and no interrupt may wake it up. A cpu also has to be unhalted if it is halted and has work to do - this scenario wasn't hit in kvm case yet, as only disabled wait is processed within QEMU. Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- target-s390x/cpu.c | 6 ++ target-s390x/kvm.c | 5 + 2 files changed, 11 insertions(+) diff --git a/target-s390x/cpu.c b/target-s390x/cpu.c index c5ab98f..1d32f5a 100644 --- a/target-s390x/cpu.c +++ b/target-s390x/cpu.c @@ -72,6 +72,12 @@ static bool s390_cpu_has_work(CPUState *cs) S390CPU *cpu = S390_CPU(cs); CPUS390XState *env = cpu-env; +/* stopped cpus can never run */ +if (env-cpu_state == CPU_STATE_STOPPED || +env-cpu_state == CPU_STATE_CHECK_STOP) { +return false; +} + return (cs-interrupt_request CPU_INTERRUPT_HARD) (env-psw.mask PSW_MASK_EXT); } diff --git a/target-s390x/kvm.c b/target-s390x/kvm.c index db2e42c..00125f1 100644 --- a/target-s390x/kvm.c +++ b/target-s390x/kvm.c @@ -553,6 +553,11 @@ void kvm_arch_post_run(CPUState *cpu, struct kvm_run *run) int kvm_arch_process_async_events(CPUState *cs) { +if (cs-halted CPU_GET_CLASS(cs)-has_work(cs)) { +/* has_work will take care of stopped cpus */ +s390_cpu_unhalt(S390_CPU(cs)); +} + return cs-halted; } -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC 3/5] s390x/kvm: proper use of the cpu states OPERATING and STOPPED
From: David Hildenbrand d...@linux.vnet.ibm.com This patch makes sure that halting a cpu and stopping a cpu are two different things. Stopping a cpu will also set the cpu halted - this is needed for common infrastructure to work (note that the stop and stopped flag cannot be used for our purpose because they are already used by other mechanisms). A cpu can be halted (waiting) when it is operating. If interrupts are disabled, this is called a disabled wait, as it can't be woken up anymore. A stopped cpu is treated like a disabled wait cpu, but in order to prepare for a proper cpu state synchronization with the kvm part, we need to track the real logical state of a cpu. Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com Reviewed-by: Christian Borntraeger borntrae...@de.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- hw/s390x/ipl.c| 2 +- target-s390x/cpu.c| 78 +-- target-s390x/cpu.h| 14 ++--- target-s390x/helper.c | 11 ++-- target-s390x/kvm.c| 11 trace-events | 5 6 files changed, 75 insertions(+), 46 deletions(-) diff --git a/hw/s390x/ipl.c b/hw/s390x/ipl.c index 4fa9cff..3b77c9a 100644 --- a/hw/s390x/ipl.c +++ b/hw/s390x/ipl.c @@ -176,7 +176,7 @@ static void s390_ipl_reset(DeviceState *dev) } } -s390_add_running_cpu(cpu); +s390_cpu_set_state(CPU_STATE_OPERATING, cpu); } static void s390_ipl_class_init(ObjectClass *klass, void *data) diff --git a/target-s390x/cpu.c b/target-s390x/cpu.c index a6edd07..c5ab98f 100644 --- a/target-s390x/cpu.c +++ b/target-s390x/cpu.c @@ -26,7 +26,9 @@ #include cpu.h #include qemu-common.h #include qemu/timer.h +#include qemu/error-report.h #include hw/hw.h +#include trace.h #ifndef CONFIG_USER_ONLY #include sysemu/arch_init.h #endif @@ -81,7 +83,7 @@ static void s390_cpu_load_normal(CPUState *s) S390CPU *cpu = S390_CPU(s); cpu-env.psw.addr = ldl_phys(s-as, 4) PSW_MASK_ESA_ADDR; cpu-env.psw.mask = PSW_MASK_32 | PSW_MASK_64; -s390_add_running_cpu(cpu); +s390_cpu_set_state(CPU_STATE_OPERATING, cpu); } #endif @@ -93,11 +95,8 @@ static void s390_cpu_reset(CPUState *s) CPUS390XState *env = cpu-env; env-pfault_token = -1UL; -s390_del_running_cpu(cpu); scc-parent_reset(s); -#if !defined(CONFIG_USER_ONLY) -s-halted = 1; -#endif +s390_cpu_set_state(CPU_STATE_STOPPED, cpu); tlb_flush(s, 1); } @@ -135,9 +134,8 @@ static void s390_cpu_full_reset(CPUState *s) S390CPUClass *scc = S390_CPU_GET_CLASS(cpu); CPUS390XState *env = cpu-env; -s390_del_running_cpu(cpu); - scc-parent_reset(s); +s390_cpu_set_state(CPU_STATE_STOPPED, cpu); memset(env, 0, offsetof(CPUS390XState, cpu_num)); @@ -147,12 +145,7 @@ static void s390_cpu_full_reset(CPUState *s) env-pfault_token = -1UL; -/* set halted to 1 to make sure we can add the cpu in - * s390_ipl_cpu code, where CPUState::halted is set back to 0 - * after incrementing the cpu counter */ #if !defined(CONFIG_USER_ONLY) -s-halted = 1; - if (kvm_enabled()) { kvm_s390_reset_vcpu(cpu); } @@ -201,10 +194,7 @@ static void s390_cpu_initfn(Object *obj) env-tod_basetime = 0; env-tod_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, s390x_tod_timer, cpu); env-cpu_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, s390x_cpu_timer, cpu); -/* set CPUState::halted state to 1 to avoid decrementing the running - * cpu counter in s390_cpu_reset to a negative number at - * initial ipl */ -cs-halted = 1; +s390_cpu_set_state(CPU_STATE_STOPPED, cpu); #endif env-cpu_num = cpu_num++; env-ext_index = -1; @@ -225,6 +215,12 @@ static void s390_cpu_finalize(Object *obj) } #if !defined(CONFIG_USER_ONLY) +static bool disabled_wait(CPUState *cpu) +{ +return cpu-halted !(S390_CPU(cpu)-env.psw.mask +(PSW_MASK_IO | PSW_MASK_EXT | PSW_MASK_MCHECK)); +} + static unsigned s390_count_running_cpus(void) { CPUState *cpu; @@ -234,34 +230,60 @@ static unsigned s390_count_running_cpus(void) uint8_t state = S390_CPU(cpu)-env.cpu_state; if (state == CPU_STATE_OPERATING || state == CPU_STATE_LOAD) { -nr_running++; +if (!disabled_wait(cpu)) { +nr_running++; +} } } return nr_running; } -void s390_add_running_cpu(S390CPU *cpu) +unsigned int s390_cpu_halt(S390CPU *cpu) { CPUState *cs = CPU(cpu); +trace_cpu_halt(cs-cpu_index); -if (cs-halted) { -cpu-env.cpu_state = CPU_STATE_OPERATING; -cs-halted = 0; -cs-exception_index = -1; +if (!cs-halted) { +cs-halted = 1; +cs-exception_index = EXCP_HLT; } + +return s390_count_running_cpus(); } -unsigned s390_del_running_cpu(S390CPU *cpu) +void
[PATCH/RFC 5/5] s390x/kvm: propagate s390 cpu state to kvm
From: David Hildenbrand d...@linux.vnet.ibm.com Let QEMU propagate the cpu state to kvm. If kvm doesn't yet support it, it is silently ignored as kvm will still handle the cpu state itself in that case. The state is not synced back, thus kvm won't have a chance to actively modify the cpu state. To do so, control has to be given back to QEMU (which is already done so in all relevant cases). Setting of the cpu state can fail either because kvm doesn't support the interface yet, or because the state is invalid/not supported. Failed attempts will be traced Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com Reviewed-by: Thomas Huth th...@linux.vnet.ibm.com Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- target-s390x/cpu.c | 3 +++ target-s390x/cpu.h | 5 + target-s390x/kvm.c | 33 + trace-events | 1 + 4 files changed, 42 insertions(+) diff --git a/target-s390x/cpu.c b/target-s390x/cpu.c index 1d32f5a..cf62ebd 100644 --- a/target-s390x/cpu.c +++ b/target-s390x/cpu.c @@ -289,6 +289,9 @@ unsigned int s390_cpu_set_state(uint8_t cpu_state, S390CPU *cpu) cpu_state); exit(1); } +if (kvm_enabled() cpu-env.cpu_state != cpu_state) { +kvm_s390_set_cpu_state(cpu, cpu_state); +} cpu-env.cpu_state = cpu_state; return s390_count_running_cpus(); diff --git a/target-s390x/cpu.h b/target-s390x/cpu.h index 892726c..ce4a6ca 100644 --- a/target-s390x/cpu.h +++ b/target-s390x/cpu.h @@ -1088,6 +1088,7 @@ int kvm_s390_assign_subch_ioeventfd(EventNotifier *notifier, uint32_t sch, int vq, bool assign); int kvm_s390_cpu_restart(S390CPU *cpu); void kvm_s390_clear_cmma_callback(void *opaque); +int kvm_s390_set_cpu_state(S390CPU *cpu, uint8_t cpu_state); #else static inline void kvm_s390_io_interrupt(uint16_t subchannel_id, uint16_t subchannel_nr, @@ -1114,6 +1115,10 @@ static inline int kvm_s390_cpu_restart(S390CPU *cpu) static inline void kvm_s390_clear_cmma_callback(void *opaque) { } +static inline int kvm_s390_set_cpu_state(S390CPU *cpu, uint8_t cpu_state) +{ +return -ENOSYS; +} #endif static inline void cmma_reset(S390CPU *cpu) diff --git a/target-s390x/kvm.c b/target-s390x/kvm.c index 00125f1..d8ecd0b 100644 --- a/target-s390x/kvm.c +++ b/target-s390x/kvm.c @@ -1289,3 +1289,36 @@ int kvm_s390_assign_subch_ioeventfd(EventNotifier *notifier, uint32_t sch, } return kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, kick); } + +int kvm_s390_set_cpu_state(S390CPU *cpu, uint8_t cpu_state) +{ +struct kvm_mp_state mp_state = {}; +int ret; + +switch (cpu_state) { +case CPU_STATE_STOPPED: +mp_state.mp_state = KVM_MP_STATE_STOPPED; +break; +case CPU_STATE_CHECK_STOP: +mp_state.mp_state = KVM_MP_STATE_CHECK_STOP; +break; +case CPU_STATE_OPERATING: +mp_state.mp_state = KVM_MP_STATE_OPERATING; +break; +case CPU_STATE_LOAD: +mp_state.mp_state = KVM_MP_STATE_LOAD; +break; +default: +error_report(Requested CPU state is not a valid S390 CPU state: %u, + cpu_state); +exit(1); +} + +ret = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MP_STATE, mp_state); +if (ret) { +trace_kvm_failed_cpu_state_set(CPU(cpu)-cpu_index, cpu_state, + strerror(-ret)); +} + +return ret; +} diff --git a/trace-events b/trace-events index c652606..56a7dab 100644 --- a/trace-events +++ b/trace-events @@ -1301,6 +1301,7 @@ mhp_pc_dimm_assigned_address(uint64_t addr) 0x%PRIx64 # target-s390x/kvm.c kvm_enable_cmma(int rc) CMMA: enabling with result code %d kvm_clear_cmma(int rc) CMMA: clearing with result code %d +kvm_failed_cpu_state_set(int cpu_index, uint8_t state, const char *msg) Warning: Unable to set cpu %d state % PRIu8 to KVM: %s # target-s390x/cpu.c cpu_set_state(int cpu_index, uint8_t state) setting cpu %d state to % PRIu8 -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC 5/5] KVM: s390: implement KVM_(S|G)ET_MP_STATE for user space state control
From: David Hildenbrand d...@linux.vnet.ibm.com This patch - adds s390 specific MP states to linux headers and documents them - implements the KVM_{SET,GET}_MP_STATE ioctls - enables KVM_CAP_MP_STATE - allows user space to control the VCPU state on s390. If user space sets the VCPU state using the ioctl KVM_SET_MP_STATE, we can disable manual changing of the VCPU state and trust user space to do the right thing. Signed-off-by: David Hildenbrand d...@linux.vnet.ibm.com Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com Acked-by: Christian Borntraeger borntrae...@de.ibm.com Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- Documentation/virtual/kvm/api.txt | 10 -- arch/s390/include/asm/kvm_host.h | 1 + arch/s390/kvm/diag.c | 3 ++- arch/s390/kvm/intercept.c | 3 ++- arch/s390/kvm/kvm-s390.c | 37 + arch/s390/kvm/kvm-s390.h | 6 ++ include/uapi/linux/kvm.h | 4 7 files changed, 56 insertions(+), 8 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 904c61c..a41465b 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -974,7 +974,7 @@ for vm-wide capabilities. 4.38 KVM_GET_MP_STATE Capability: KVM_CAP_MP_STATE -Architectures: x86, ia64 +Architectures: x86, ia64, s390 Type: vcpu ioctl Parameters: struct kvm_mp_state (out) Returns: 0 on success; -1 on error @@ -998,6 +998,12 @@ Possible values are: is waiting for an interrupt [x86, ia64] - KVM_MP_STATE_SIPI_RECEIVED: the vcpu has just received a SIPI (vector accessible via KVM_GET_VCPU_EVENTS) [x86, ia64] + - KVM_MP_STATE_STOPPED: the vcpu is stopped [s390] + - KVM_MP_STATE_CHECK_STOP: the vcpu is in a special error state [s390] + - KVM_MP_STATE_OPERATING: the vcpu is operating (running or halted) + [s390] + - KVM_MP_STATE_LOAD:the vcpu is in a special load/startup state + [s390] On x86 and ia64, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel irqchip, the multiprocessing state must be maintained by userspace on @@ -1007,7 +1013,7 @@ these architectures. 4.39 KVM_SET_MP_STATE Capability: KVM_CAP_MP_STATE -Architectures: x86, ia64 +Architectures: x86, ia64, s390 Type: vcpu ioctl Parameters: struct kvm_mp_state (in) Returns: 0 on success; -1 on error diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h index 4181d7b..c2ba020 100644 --- a/arch/s390/include/asm/kvm_host.h +++ b/arch/s390/include/asm/kvm_host.h @@ -418,6 +418,7 @@ struct kvm_arch{ int css_support; int use_irqchip; int use_cmma; + int user_cpu_state_ctrl; struct s390_io_adapter *adapters[MAX_S390_IO_ADAPTERS]; wait_queue_head_t ipte_wq; spinlock_t start_stop_lock; diff --git a/arch/s390/kvm/diag.c b/arch/s390/kvm/diag.c index 0161675..59bd8f9 100644 --- a/arch/s390/kvm/diag.c +++ b/arch/s390/kvm/diag.c @@ -176,7 +176,8 @@ static int __diag_ipl_functions(struct kvm_vcpu *vcpu) return -EOPNOTSUPP; } - kvm_s390_vcpu_stop(vcpu); + if (!kvm_s390_user_cpu_state_ctrl(vcpu-kvm)) + kvm_s390_vcpu_stop(vcpu); vcpu-run-s390_reset_flags |= KVM_S390_RESET_SUBSYSTEM; vcpu-run-s390_reset_flags |= KVM_S390_RESET_IPL; vcpu-run-s390_reset_flags |= KVM_S390_RESET_CPU_INIT; diff --git a/arch/s390/kvm/intercept.c b/arch/s390/kvm/intercept.c index ac6b325..eaf4629 100644 --- a/arch/s390/kvm/intercept.c +++ b/arch/s390/kvm/intercept.c @@ -73,7 +73,8 @@ static int handle_stop(struct kvm_vcpu *vcpu) return rc; } - kvm_s390_vcpu_stop(vcpu); + if (!kvm_s390_user_cpu_state_ctrl(vcpu-kvm)) + kvm_s390_vcpu_stop(vcpu); return -EOPNOTSUPP; } diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 3428953..fdf88f7 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -167,6 +167,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_DEVICE_CTRL: case KVM_CAP_ENABLE_CAP_VM: case KVM_CAP_VM_ATTRIBUTES: + case KVM_CAP_MP_STATE: r = 1; break; case KVM_CAP_NR_VCPUS: @@ -595,7 +596,8 @@ static void kvm_s390_vcpu_initial_reset(struct kvm_vcpu *vcpu) vcpu-arch.sie_block-pp = 0; vcpu-arch.pfault_token = KVM_S390_PFAULT_TOKEN_INVALID; kvm_clear_async_pf_completion_queue(vcpu); - kvm_s390_vcpu_stop(vcpu); + if (!kvm_s390_user_cpu_state_ctrl(vcpu-kvm)) + kvm_s390_vcpu_stop(vcpu); kvm_s390_clear_local_irqs(vcpu); } @@ -980,13 +982,34 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, int
[PATCH/RFC 1/5] update linux headers with with cpustate changes
Will do a proper headers sync when ready Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com --- linux-headers/linux/kvm.h | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h index f5d2c38..9b584da 100644 --- a/linux-headers/linux/kvm.h +++ b/linux-headers/linux/kvm.h @@ -399,13 +399,18 @@ struct kvm_vapic_addr { __u64 vapic_addr; }; -/* for KVM_SET_MPSTATE */ +/* for KVM_SET_MP_STATE */ +/* not all states are valid on all architectures */ #define KVM_MP_STATE_RUNNABLE 0 #define KVM_MP_STATE_UNINITIALIZED 1 #define KVM_MP_STATE_INIT_RECEIVED 2 #define KVM_MP_STATE_HALTED3 #define KVM_MP_STATE_SIPI_RECEIVED 4 +#define KVM_MP_STATE_STOPPED 5 +#define KVM_MP_STATE_CHECK_STOP6 +#define KVM_MP_STATE_OPERATING 7 +#define KVM_MP_STATE_LOAD 8 struct kvm_mp_state { __u32 mp_state; -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC 0/5] s390x/kvm: track the logical cpu state in QEMU and propagate it to kvm
This is the qemu part of kernel series Let user space control the cpu states Christian Borntraeger (1): update linux headers with with cpustate changes David Hildenbrand (4): s390x/kvm: introduce proper states for s390 cpus s390x/kvm: proper use of the cpu states OPERATING and STOPPED s390x/kvm: test whether a cpu is STOPPED when checking has_work s390x/kvm: propagate s390 cpu state to kvm hw/s390x/ipl.c| 2 +- hw/s390x/s390-virtio.c| 32 -- linux-headers/linux/kvm.h | 7 ++- target-s390x/cpu.c| 106 +++--- target-s390x/cpu.h| 33 +-- target-s390x/helper.c | 11 ++--- target-s390x/kvm.c| 49 ++--- trace-events | 6 +++ 8 files changed, 179 insertions(+), 67 deletions(-) Looks good to me. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC 0/5] s390x/kvm: track the logical cpu state in QEMU and propagate it to kvm
This is the qemu part of kernel series Let user space control the cpu states Christian Borntraeger (1): update linux headers with with cpustate changes David Hildenbrand (4): s390x/kvm: introduce proper states for s390 cpus s390x/kvm: proper use of the cpu states OPERATING and STOPPED s390x/kvm: test whether a cpu is STOPPED when checking has_work s390x/kvm: propagate s390 cpu state to kvm hw/s390x/ipl.c| 2 +- hw/s390x/s390-virtio.c| 32 -- linux-headers/linux/kvm.h | 7 ++- target-s390x/cpu.c| 106 +++--- target-s390x/cpu.h| 33 +-- target-s390x/helper.c | 11 ++--- target-s390x/kvm.c| 49 ++--- trace-events | 6 +++ 8 files changed, 179 insertions(+), 67 deletions(-) Looks good to me -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC 0/5] s390x/kvm: track the logical cpu state in QEMU and propagate it to kvm
This is the qemu part of kernel series Let user space control the cpu states Christian Borntraeger (1): update linux headers with with cpustate changes David Hildenbrand (4): s390x/kvm: introduce proper states for s390 cpus s390x/kvm: proper use of the cpu states OPERATING and STOPPED s390x/kvm: test whether a cpu is STOPPED when checking has_work s390x/kvm: propagate s390 cpu state to kvm hw/s390x/ipl.c| 2 +- hw/s390x/s390-virtio.c| 32 -- linux-headers/linux/kvm.h | 7 ++- target-s390x/cpu.c| 106 +++--- target-s390x/cpu.h| 33 +-- target-s390x/helper.c | 11 ++--- target-s390x/kvm.c| 49 ++--- trace-events | 6 +++ 8 files changed, 179 insertions(+), 67 deletions(-) Looks good to me -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html @all thought it was the final internal review :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] arm/arm64: KVM: Support KVM_CAP_READONLY_MEM
When userspace loads code and data in a read-only memory regions, KVM needs to be able to handle this on arm and arm64. Specifically this is used when running code directly from a read-only flash device; the common scenario is a UEFI blob loaded with the -bios option in QEMU. To avoid looking through the memslots twice and to reuse the hva error checking of gfn_to_hva_prot(), add a new gfn_to_hva_memslot_prot() function and refactor gfn_to_hva_prot() to use this function. Signed-off-by: Christoffer Dall christoffer.d...@linaro.org --- Note that if you want to test this with QEMU, you need to update the uapi headers. You can also grab the branch below from my qemu git tree with the temporary update headers patch applied on top of Peter Maydell's -bios in -M virt support patches: git://git.linaro.org/people/christoffer.dall/qemu-arm.git virt-for-uefi arch/arm/include/uapi/asm/kvm.h | 1 + arch/arm/kvm/arm.c| 1 + arch/arm/kvm/mmu.c| 15 --- arch/arm64/include/uapi/asm/kvm.h | 1 + include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c | 11 +-- 6 files changed, 22 insertions(+), 9 deletions(-) diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h index e6ebdd3..51257fd 100644 --- a/arch/arm/include/uapi/asm/kvm.h +++ b/arch/arm/include/uapi/asm/kvm.h @@ -25,6 +25,7 @@ #define __KVM_HAVE_GUEST_DEBUG #define __KVM_HAVE_IRQ_LINE +#define __KVM_HAVE_READONLY_MEM #define KVM_REG_SIZE(id) \ (1U (((id) KVM_REG_SIZE_MASK) KVM_REG_SIZE_SHIFT)) diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index d7424ef..037adda 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -188,6 +188,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_ONE_REG: case KVM_CAP_ARM_PSCI: case KVM_CAP_ARM_PSCI_0_2: + case KVM_CAP_READONLY_MEM: r = 1; break; case KVM_CAP_COALESCED_MMIO: diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 0f6f642..d606d86 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -745,14 +745,13 @@ static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap) } static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, - struct kvm_memory_slot *memslot, + struct kvm_memory_slot *memslot, unsigned long hva, unsigned long fault_status) { int ret; bool write_fault, writable, hugetlb = false, force_pte = false; unsigned long mmu_seq; gfn_t gfn = fault_ipa PAGE_SHIFT; - unsigned long hva = gfn_to_hva(vcpu-kvm, gfn); struct kvm *kvm = vcpu-kvm; struct kvm_mmu_memory_cache *memcache = vcpu-arch.mmu_page_cache; struct vm_area_struct *vma; @@ -861,7 +860,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run) unsigned long fault_status; phys_addr_t fault_ipa; struct kvm_memory_slot *memslot; - bool is_iabt; + unsigned long hva; + bool is_iabt, write_fault, writable; gfn_t gfn; int ret, idx; @@ -882,7 +882,10 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run) idx = srcu_read_lock(vcpu-kvm-srcu); gfn = fault_ipa PAGE_SHIFT; - if (!kvm_is_visible_gfn(vcpu-kvm, gfn)) { + memslot = gfn_to_memslot(vcpu-kvm, gfn); + hva = gfn_to_hva_memslot_prot(memslot, gfn, writable); + write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu)); + if (kvm_is_error_hva(hva) || (write_fault !writable)) { if (is_iabt) { /* Prefetch Abort on I/O address */ kvm_inject_pabt(vcpu, kvm_vcpu_get_hfar(vcpu)); @@ -908,9 +911,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run) goto out_unlock; } - memslot = gfn_to_memslot(vcpu-kvm, gfn); - - ret = user_mem_abort(vcpu, fault_ipa, memslot, fault_status); + ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status); if (ret == 0) ret = 1; out_unlock: diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index e633ff8..f4ec5a6 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -37,6 +37,7 @@ #define __KVM_HAVE_GUEST_DEBUG #define __KVM_HAVE_IRQ_LINE +#define __KVM_HAVE_READONLY_MEM #define KVM_REG_SIZE(id) \ (1U (((id) KVM_REG_SIZE_MASK) KVM_REG_SIZE_SHIFT)) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 970c681..8636b7f 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -544,6 +544,8 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn); unsigned long gfn_to_hva(struct kvm *kvm, gfn_t
[PATCH v2 09/10] target-arm/kvm.c: better error reporting
From: Alex Bennée a...@bennee.com When we have a problem syncing CP registers between kvm-qemu it's a lot more useful to have the names of the registers in the log than just a random abort() and core dump. Signed-off-by: Alex Bennée alex.ben...@linaro.org --- v2 - less verbose log message - fix checkpatch warnings diff --git a/target-arm/kvm.c b/target-arm/kvm.c index 319784d..72e242d 100644 --- a/target-arm/kvm.c +++ b/target-arm/kvm.c @@ -279,6 +279,16 @@ void kvm_arm_register_device(MemoryRegion *mr, uint64_t devid, uint64_t group, memory_region_ref(kd-mr); } +static void failed_cpreg_operation(ARMCPU *cpu, uint64_t regidx, int ret, + const char *func) +{ +uint32_t cpreg_id = kvm_to_cpreg_id(regidx); +ARMCPRegInfo *cpreg = g_hash_table_lookup(cpu-cp_regs, cpreg_id); +qemu_log_mask(LOG_UNIMP, + %s: failed (%d) KVM reg op %PRIx64 (%s)\n, + func, ret, regidx, cpreg ? cpreg-name : unknown); +} + bool write_kvmstate_to_list(ARMCPU *cpu) { CPUState *cs = CPU(cpu); @@ -309,6 +319,7 @@ bool write_kvmstate_to_list(ARMCPU *cpu) abort(); } if (ret) { +failed_cpreg_operation(cpu, regidx, ret, __func__); ok = false; } } @@ -345,6 +356,7 @@ bool write_list_to_kvmstate(ARMCPU *cpu) * you tried to set a register which is constant with * a different value from what it actually contains. */ +failed_cpreg_operation(cpu, regidx, ret, __func__); ok = false; } } -- 2.0.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 10/10] target-arm/kvm: make reg sync code common between kvm32/64
Before we launch a guest we query KVM for the list of co-processor registers it knows about which is used later for save/restore of machine state. The logic is identical for both 32-bit and 64-bit so I've moved it all into the common code and simplified the exit paths (as failure = exit). This list may well have more registers than are known by the TCG emulation which is not necessarily a problem but it does stop us from migrating between KVM and TCG hosted guests. I've added some additional checking to report those registers under -d unimp. Signed-off-by: Alex Bennée alex.ben...@linaro.org diff --git a/target-arm/kvm.c b/target-arm/kvm.c index 72e242d..a2895dc 100644 --- a/target-arm/kvm.c +++ b/target-arm/kvm.c @@ -21,6 +21,7 @@ #include sysemu/kvm.h #include kvm_arm.h #include cpu.h +#include internals.h #include hw/arm/arm.h const KVMCapabilityInfo kvm_arch_required_capabilities[] = { @@ -289,6 +290,130 @@ static void failed_cpreg_operation(ARMCPU *cpu, uint64_t regidx, int ret, func, ret, regidx, cpreg ? cpreg-name : unknown); } +static int compare_u64(const void *a, const void *b) +{ +if (*(uint64_t *)a *(uint64_t *)b) { +return 1; +} +if (*(uint64_t *)a *(uint64_t *)b) { +return -1; +} +return 0; +} + +static bool reg_syncs_via_tuple_list(uint64_t regidx) +{ +/* Return true if the regidx is a register we should synchronize + * via the cpreg_tuples array (ie is not a core reg we sync by + * hand in kvm_arch_get/put_registers()) + */ +switch (regidx KVM_REG_ARM_COPROC_MASK) { +case KVM_REG_ARM_CORE: +#ifdef KVM_REG_ARM_VFP +case KVM_REG_ARM_VFP: +#endif +return false; +default: +return true; +} +} + +/* + * Fetch a list of registers from KVM that we will need to be able to + * migrate the state. These registers may or may not map onto real + * hardware registers but either way QEMU uses the KVM_GET/SET_ONE_REG + * api to copy their state back and forth when required. + * + * For migration between KVM and TCG both models need to understand + * the same set of registers. + * + * If we exit due to failure we would leak memory but we'll be exiting + * anyway so the return path is kept simple. + */ +bool kvm_arm_sync_register_list(CPUState *cs) +{ +struct kvm_reg_list rl; +struct kvm_reg_list *rlp; +int i, j, ret, arraylen; +ARMCPU *cpu = ARM_CPU(cs); + +/* Populate the cpreg list based on the kernel's idea + * of what registers exist (and throw away the TCG-created list). + */ +rl.n = 0; +ret = kvm_vcpu_ioctl(cs, KVM_GET_REG_LIST, rl); +if (ret != -E2BIG) { +return FALSE; +} + +rlp = g_malloc(sizeof(struct kvm_reg_list) + (rl.n * sizeof(uint64_t))); +rlp-n = rl.n; +ret = kvm_vcpu_ioctl(cs, KVM_GET_REG_LIST, rlp); +if (ret) { +fprintf(stderr, %s: failed to get register list\n, __func__); +return FALSE; +} +/* Sort the list we get back from the kernel, since cpreg_tuples + * must be in strictly ascending order. + */ +qsort(rlp-reg, rlp-n, sizeof(rlp-reg[0]), compare_u64); + +/* Count how many of these registers we'll actually sync through + * the cpreg_indexes mechanism and overwrite the existing TCG + * built array of registers. + */ +for (i = 0, arraylen = 0; i rlp-n; i++) { +uint64_t regidx = rlp-reg[i]; +if (reg_syncs_via_tuple_list(regidx)) { +gboolean found = FALSE; +arraylen++; +for (j = 0; j cpu-cpreg_array_len; j++) { +if (regidx == cpu-cpreg_indexes[j]) { +found = TRUE; +break; +} +} +if (!found) { +qemu_log_mask(LOG_UNIMP, + %s: TCG missing definition of %PRIx64\n, + __func__, regidx); +} +} +} + +cpu-cpreg_indexes = g_renew(uint64_t, cpu-cpreg_indexes, arraylen); +cpu-cpreg_values = g_renew(uint64_t, cpu-cpreg_values, arraylen); +cpu-cpreg_vmstate_indexes = g_renew(uint64_t, cpu-cpreg_vmstate_indexes, + arraylen); +cpu-cpreg_vmstate_values = g_renew(uint64_t, cpu-cpreg_vmstate_values, +arraylen); +cpu-cpreg_array_len = arraylen; +cpu-cpreg_vmstate_array_len = arraylen; + +for (i = 0, arraylen = 0; i rlp-n; i++) { +uint64_t regidx = rlp-reg[i]; +if (!reg_syncs_via_tuple_list(regidx)) { +continue; +} +switch (regidx KVM_REG_SIZE_MASK) { +case KVM_REG_SIZE_U32: +case KVM_REG_SIZE_U64: +break; +default: +fprintf(stderr, +%s: un-handled register size (%PRIx64) in kernel list\n, +__func__, regidx); +return FALSE; +} +cpu-cpreg_indexes[arraylen] =
[PATCH 0/2] KVM: PPC: Fix Mac OS X guests on non-Apple hosts
While trying to get Mac-on-Linux to work on a POWER7 box I stumbled over two issues in our book3s_32 MMU emulation. With these issues fixed and a hack to disable magic page mapping (very hard to get right with 64k pages in this case) I can successfully run Mac OS X guests on a POWER7 host. Alex Alexander Graf (2): KVM: PPC: Deflect page write faults properly in kvmppc_st KVM: PPC: Book3S: Stop PTE lookup on write errors arch/powerpc/kvm/book3s.c| 6 -- arch/powerpc/kvm/book3s_32_mmu.c | 2 +- 2 files changed, 5 insertions(+), 3 deletions(-) -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: PPC: Deflect page write faults properly in kvmppc_st
When we have a page that we're not allowed to write to, xlate() will already tell us -EPERM on lookup of that page. With the code as is we change it into a page missing error which a guest may get confused about. Instead, just tell the caller about the -EPERM directly. This fixes Mac OS X guests when run with DCBZ32 emulation. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index bd75902..9624c56 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -418,11 +418,13 @@ int kvmppc_st(struct kvm_vcpu *vcpu, ulong *eaddr, int size, void *ptr, bool data) { struct kvmppc_pte pte; + int r; vcpu-stat.st++; - if (kvmppc_xlate(vcpu, *eaddr, data, true, pte)) - return -ENOENT; + r = kvmppc_xlate(vcpu, *eaddr, data, true, pte); + if (r 0) + return r; *eaddr = pte.raddr; -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: PPC: Book3S: Stop PTE lookup on write errors
When a page lookup failed because we're not allowed to write to the page, we should not overwrite that value with another lookup on the second PTEG which will return page not found. Instead, we should just tell the caller that we had a permission problem. This fixes Mac OS X guests looping endlessly in page lookup code for me. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_32_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_32_mmu.c b/arch/powerpc/kvm/book3s_32_mmu.c index 93503bb..cd0b073 100644 --- a/arch/powerpc/kvm/book3s_32_mmu.c +++ b/arch/powerpc/kvm/book3s_32_mmu.c @@ -335,7 +335,7 @@ static int kvmppc_mmu_book3s_32_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, if (r 0) r = kvmppc_mmu_book3s_32_xlate_pte(vcpu, eaddr, pte, data, iswrite, true); - if (r 0) + if (r == -ENOENT) r = kvmppc_mmu_book3s_32_xlate_pte(vcpu, eaddr, pte, data, iswrite, false); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V5 1/2] perf ignore LBR and extra_regs
From: Kan Liang kan.li...@intel.com x86, perf: Protect LBR and extra_regs against KVM lying With -cpu host, KVM reports LBR and extra_regs support, if the host has support. When the guest perf driver tries to access LBR or extra_regs MSR, it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support. So check the related MSRs access right once at initialization time to avoid the error access at runtime. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y (for host kernel). And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel). Start the guest with -cpu host. Run perf record with --branch-any or --branch-filter in guest to trigger LBR #GP. Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to trigger offcore_rsp #GP Signed-off-by: Kan Liang kan.li...@intel.com V2: Move the check code to initialization time. V3: Add flag for each extra register. Check all LBR MSRs at initialization time. V4: Remove lbr_msr_access. For LBR msr, simply set lbr_nr to 0 if check_msr failed. Disable all extra msrs in creation places if check_msr failed. V5: Fix check_msr broken Don't check any more MSRs after the first fail Return error when checking fail to stop creating the event Remove the checking code path which never get --- arch/x86/kernel/cpu/perf_event.c | 3 +++ arch/x86/kernel/cpu/perf_event.h | 45 ++ arch/x86/kernel/cpu/perf_event_intel.c | 38 +++- 3 files changed, 85 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 2bdfbff..a7c5e4b 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -118,6 +118,9 @@ static int x86_pmu_extra_regs(u64 config, struct perf_event *event) continue; if (event-attr.config1 ~er-valid_mask) return -EINVAL; + /* Check if the extra msrs can be safely accessed*/ + if (!x86_pmu.extra_msr_access[er-idx]) + return -EFAULT; reg-idx = er-idx; reg-config = event-attr.config1; diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h index 3b2f9bd..992c678 100644 --- a/arch/x86/kernel/cpu/perf_event.h +++ b/arch/x86/kernel/cpu/perf_event.h @@ -464,6 +464,12 @@ struct x86_pmu { */ struct extra_reg *extra_regs; unsigned int er_flags; + /* +* EXTRA REG MSR can be accessed +* The extra registers are completely unrelated to each other. +* So it needs a flag for each extra register. +*/ + boolextra_msr_access[EXTRA_REG_MAX]; /* * Intel host/guest support (KVM) @@ -525,6 +531,45 @@ extern u64 __read_mostly hw_cache_extra_regs [PERF_COUNT_HW_CACHE_OP_MAX] [PERF_COUNT_HW_CACHE_RESULT_MAX]; +/* + * Under certain circumstances, access certain MSR may cause #GP. + * The function tests if the input MSR can be safely accessed. + */ +static inline bool check_msr(unsigned long msr) +{ + u64 val_old, val_new, val_tmp; + + /* +* Read the current value, change it and read it back to see if it +* matches, this is needed to detect certain hardware emulators +* (qemu/kvm) that don't trap on the MSR access and always return 0s. +*/ + if (rdmsrl_safe(msr, val_old)) + goto msr_fail; + /* +* Only chagne it slightly, +* since the higher bits of some MSRs cannot be updated by wrmsrl. +* E.g. MSR_LBR_TOS +*/ + val_tmp = val_old ^ 0x3UL; + if (wrmsrl_safe(msr, val_tmp) || + rdmsrl_safe(msr, val_new)) + goto msr_fail; + + if (val_new != val_tmp) + goto msr_fail; + + /* Here it's sure that the MSR can be safely accessed. +* Restore the old value and return. +*/ + wrmsrl(msr, val_old); + + return true; + +msr_fail: + return false; +} + u64 x86_perf_event_update(struct perf_event *event); static inline unsigned int x86_pmu_config_addr(int index) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index adb02aa..953b2b2 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -2262,7 +2262,8 @@ __init int intel_pmu_init(void) union cpuid10_ebx ebx; struct event_constraint *c; unsigned int unused; - int version; + int version, i; + struct extra_reg *er; if (!cpu_has(boot_cpu_data, X86_FEATURE_ARCH_PERFMON)) { switch (boot_cpu_data.x86) { @@ -2565,6 +2566,41 @@ __init int intel_pmu_init(void) } } + /* +* Access LBR MSR may cause #GP
[PATCH V5 2/2] kvm: ignore LBR and extra_reg
From: Kan Liang kan.li...@intel.com With -cpu host KVM reports LBR and extra_regs support, so the perf driver may accesses the LBR and extra_regs MSRs. However, there is no LBR and extra_regs virtualization support yet. This could causes guest to crash. As a workaround, KVM just simply ignore the LBR and extra_regs MSRs to lie the guest. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y (for host kernel). And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel). Start the guest with -cpu host. Run perf record with --branch-any or --branch-filter in guest to trigger LBR #GP. Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to trigger offcore_rsp #GP Signed-off-by: Andi Kleen a...@linux.intel.com Signed-off-by: Kan Liang kan.li...@intel.com V3: add MSR_LBR_TOS V4: add MSR_LBR_SELECT and MSR_PEBS_LD_LAT_THRESHOLD V5: set_msr should return 0 to lie the guest --- arch/x86/kvm/pmu.c | 38 ++ 1 file changed, 38 insertions(+) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index cbecaa9..5fd5b44 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -331,6 +331,18 @@ bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: ret = pmu-version 1; break; + case MSR_OFFCORE_RSP_0: + case MSR_OFFCORE_RSP_1: + case MSR_LBR_SELECT: + case MSR_PEBS_LD_LAT_THRESHOLD: + case MSR_LBR_TOS: + /* At most 8-deep LBR for core and atom */ + case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7: + case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7: + /* 16-deep LBR for core i3/i5/i7 series processors */ + case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15: + case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15: + return 1; /* to avoid crashes */ default: ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) || get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) @@ -358,6 +370,19 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data) case MSR_CORE_PERF_GLOBAL_OVF_CTRL: *data = pmu-global_ovf_ctrl; return 0; + case MSR_OFFCORE_RSP_0: + case MSR_OFFCORE_RSP_1: + case MSR_LBR_SELECT: + case MSR_PEBS_LD_LAT_THRESHOLD: + case MSR_LBR_TOS: + /* At most 8-deep LBR for core and atom */ + case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7: + case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7: + /* 16-deep LBR for core i3/i5/i7 series processors */ + case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15: + case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15: + *data = 0; + return 0; default: if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) || (pmc = get_fixed_pmc(pmu, index))) { @@ -409,6 +434,19 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) return 0; } break; + case MSR_OFFCORE_RSP_0: + case MSR_OFFCORE_RSP_1: + case MSR_LBR_SELECT: + case MSR_PEBS_LD_LAT_THRESHOLD: + case MSR_LBR_TOS: + /* At most 8-deep LBR for core and atom */ + case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7: + case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7: + /* 16-deep LBR for core i3/i5/i7 series processors */ + case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15: + case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15: + /* dummy for now */ + return 0; default: if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) || (pmc = get_fixed_pmc(pmu, index))) { -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] arm64: fix VTTBR_BADDR_MASK
On Wed, Jul 09, 2014 at 11:17:04AM -0500, Joel Schopp wrote: The current calculation for VTTBR_BADDR_MASK masks only 39 bits and not all 40 bits. That last bit is important as some systems allocate from near the top of the available address space. This patch is necessary to run KVM on an aarch64 SOC I have been testing. Signed-off-by: Joel Schopp joel.sch...@amd.com --- arch/arm64/include/asm/kvm_arm.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h index 3d69030..b39e93f 100644 --- a/arch/arm64/include/asm/kvm_arm.h +++ b/arch/arm64/include/asm/kvm_arm.h @@ -148,7 +148,7 @@ #endif #define VTTBR_BADDR_SHIFT (VTTBR_X - 1) -#define VTTBR_BADDR_MASK (((1LLU (40 - VTTBR_X)) - 1) VTTBR_BADDR_SHIFT) +#define VTTBR_BADDR_MASK (0xffLLU) /* bits 0-39 */ #define VTTBR_VMID_SHIFT (48LLU) #define VTTBR_VMID_MASK(0xffLLU VTTBR_VMID_SHIFT) While this is obviously fixing a bug, it doesn't feel like the right short-term fix. I'll have to go back and read the definitions of x in BADDR[47:x] for VTTBR_EL2 exactly again, but the intended use of VTTBR_BADDR_MASK (and the only occurence of it in C-code) is to deal with alignment of the allocated pgd. So while all the MSB in the physical address range of your system should be used, we still need to check that the pgd we allocate is actually aligned properly, because writing any bits in [x-1:0} as non-zero can potentially cause unpredicted behavior and the translation walk results can be corrupted. So, you do need to use a proper shift, I'm fine with fixing it for the assumed 40-bit physical address space for now, but you need to figure out the right value of 'x', for the BADDR[47:x], and you should change the code in update_vttbr() as part of this patch, something like BUG_ON(pgd_phys ~VTTBR_BADDR_MASK) and make sure the pgd is allocated at the right alignment, or ensure that the PGD is large enough that we can just mask off the lower bits and still contain the required number of entries. Thanks, -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] arm64: fix VTTBR_BADDR_MASK
On 07/10/2014 03:25 PM, Christoffer Dall wrote: On Wed, Jul 09, 2014 at 11:17:04AM -0500, Joel Schopp wrote: The current calculation for VTTBR_BADDR_MASK masks only 39 bits and not all 40 bits. That last bit is important as some systems allocate from near the top of the available address space. This patch is necessary to run KVM on an aarch64 SOC I have been testing. Signed-off-by: Joel Schopp joel.sch...@amd.com --- arch/arm64/include/asm/kvm_arm.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h index 3d69030..b39e93f 100644 --- a/arch/arm64/include/asm/kvm_arm.h +++ b/arch/arm64/include/asm/kvm_arm.h @@ -148,7 +148,7 @@ #endif #define VTTBR_BADDR_SHIFT (VTTBR_X - 1) -#define VTTBR_BADDR_MASK (((1LLU (40 - VTTBR_X)) - 1) VTTBR_BADDR_SHIFT) +#define VTTBR_BADDR_MASK (0xffLLU) /* bits 0-39 */ #define VTTBR_VMID_SHIFT (48LLU) #define VTTBR_VMID_MASK (0xffLLU VTTBR_VMID_SHIFT) While this is obviously fixing a bug, it doesn't feel like the right short-term fix. I'll have to go back and read the definitions of x in BADDR[47:x] for VTTBR_EL2 exactly again, but the intended use of VTTBR_BADDR_MASK (and the only occurence of it in C-code) is to deal with alignment of the allocated pgd. I think there is some confusion. Before VTTBR_BADDR_MASK always evaluated to 0x7fLLU, after the change it always evaluates to 0xffLLU Neither before nor after the patch is it dealing with alignment. Any bits it throws away (bits 40-47) are most significant not least significant. I could have rewritten the macro like: #define VTTBR_BADDR_MASK (((1LLU (40 - VTTBR_X + 1)) - 1) VTTBR_BADDR_SHIFT) to correct the bug but it's my opinion that the existing code is quite obfuscated which is how the bug happened in the first place. It seemed easier to just actually mask the bits in a straightforward and easy to understand manner. I even added a comment so nobody has to count the fs ;) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] arm64: fix VTTBR_BADDR_MASK
On 07/10/2014 04:02 PM, Joel Schopp wrote: On 07/10/2014 03:25 PM, Christoffer Dall wrote: On Wed, Jul 09, 2014 at 11:17:04AM -0500, Joel Schopp wrote: The current calculation for VTTBR_BADDR_MASK masks only 39 bits and not all 40 bits. That last bit is important as some systems allocate from near the top of the available address space. This patch is necessary to run KVM on an aarch64 SOC I have been testing. Signed-off-by: Joel Schopp joel.sch...@amd.com --- arch/arm64/include/asm/kvm_arm.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h index 3d69030..b39e93f 100644 --- a/arch/arm64/include/asm/kvm_arm.h +++ b/arch/arm64/include/asm/kvm_arm.h @@ -148,7 +148,7 @@ #endif #define VTTBR_BADDR_SHIFT (VTTBR_X - 1) -#define VTTBR_BADDR_MASK (((1LLU (40 - VTTBR_X)) - 1) VTTBR_BADDR_SHIFT) +#define VTTBR_BADDR_MASK (0xffLLU) /* bits 0-39 */ #define VTTBR_VMID_SHIFT (48LLU) #define VTTBR_VMID_MASK (0xffLLU VTTBR_VMID_SHIFT) While this is obviously fixing a bug, it doesn't feel like the right short-term fix. I'll have to go back and read the definitions of x in BADDR[47:x] for VTTBR_EL2 exactly again, but the intended use of VTTBR_BADDR_MASK (and the only occurence of it in C-code) is to deal with alignment of the allocated pgd. I think there is some confusion. Before VTTBR_BADDR_MASK always evaluated to 0x7fLLU, after the change it always evaluates to 0xffLLU Neither before nor after the patch is it dealing with alignment. Any bits it throws away (bits 40-47) are most significant not least significant. I could have rewritten the macro like: #define VTTBR_BADDR_MASK (((1LLU (40 - VTTBR_X + 1)) - 1) VTTBR_BADDR_SHIFT) to correct the bug but it's my opinion that the existing code is quite obfuscated which is how the bug happened in the first place. It seemed easier to just actually mask the bits in a straightforward and easy to understand manner. I even added a comment so nobody has to count the fs ;) I hate to reply to my own email correcting myself. But you were correct. I will fix and resend a v2. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] PCI: introduce VFs reference counter and simplify pci_vfs_assigned() implementation
Current implementation of helper function pci_vfs_assigned() is a little complex, to get sum of VFs that assigned to VM, access low level configuration space register and then loop in traversing device tree. This patch introduce an atomic reference counter for VFs that assigned to VM in struct pci_sriov, and compose two more helper functions pci_sriov_assign_device(), pci_sriov_deassign_device() to replace manipulation to device flag and the meanwhile increase and decease the counter. Passed building on 3.15.5 Signed-off-by: Ethan Zhao ethan.z...@oracle.com --- drivers/pci/iov.c | 65 -- drivers/pci/pci.h |1 + include/linux/pci.h |4 +++ 3 files changed, 41 insertions(+), 29 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index de7a747..72e267f 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -382,6 +382,7 @@ found: iov-nres = nres; iov-ctrl = ctrl; iov-total_VFs = total; + atomic_set(iov-VFs_assigned_cnt, 0); iov-offset = offset; iov-stride = stride; iov-pgsz = pgsz; @@ -603,45 +604,51 @@ int pci_num_vf(struct pci_dev *dev) EXPORT_SYMBOL_GPL(pci_num_vf); /** - * pci_vfs_assigned - returns number of VFs are assigned to a guest - * @dev: the PCI device + * pci_vfs_assigned - returns number of VFs are assigned to VM + * @dev: the physical PCI device that contains the VFs. * - * Returns number of VFs belonging to this device that are assigned to a guest. + * Returns number of VFs belonging to this device that are assigned to VM. * If device is not a physical function returns 0. */ int pci_vfs_assigned(struct pci_dev *dev) { - struct pci_dev *vfdev; - unsigned int vfs_assigned = 0; - unsigned short dev_id; - - /* only search if we are a PF */ if (!dev-is_physfn) return 0; + if (dev-sriov) + return atomic_read(dev-sriov-VFs_assigned_cnt); + return 0; +} +EXPORT_SYMBOL_GPL(pci_vfs_assigned); - /* -* determine the device ID for the VFs, the vendor ID will be the -* same as the PF so there is no need to check for that one -*/ - pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, dev_id); - - /* loop through all the VFs to see if we own any that are assigned */ - vfdev = pci_get_device(dev-vendor, dev_id, NULL); - while (vfdev) { - /* -* It is considered assigned if it is a virtual function with -* our dev as the physical function and the assigned bit is set -*/ - if (vfdev-is_virtfn (vfdev-physfn == dev) - (vfdev-dev_flags PCI_DEV_FLAGS_ASSIGNED)) - vfs_assigned++; - - vfdev = pci_get_device(dev-vendor, dev_id, vfdev); - } +/** + * pci_sriov_assign_device - assign device to VM + * @pdev: the device to be assigned. + */ +void pci_sriov_assign_device(struct pci_dev *pdev) +{ + pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED; + if (pdev-is_virtfn !pdev-is_physfn) + if (pdev-physfn) + if (pdev-physfn-sriov) + atomic_inc(pdev-physfn-sriov- + VFs_assigned_cnt); +} +EXPORT_SYMBOL_GPL(pci_sriov_assign_device); - return vfs_assigned; +/** + * pci_sriov_deassign_device - deasign device from VM + * @pdev: the device to be deassigned. + */ +void pci_sriov_deassign_device(struct pci_dev *pdev) +{ + pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED; + if (pdev-is_virtfn !pdev-is_physfn) + if (pdev-physfn) + if (pdev-physfn-sriov) + atomic_dec(pdev-physfn-sriov- + VFs_assigned_cnt); } -EXPORT_SYMBOL_GPL(pci_vfs_assigned); +EXPORT_SYMBOL_GPL(pci_sriov_deassign_device); /** * pci_sriov_set_totalvfs -- reduce the TotalVFs available diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 6bd0822..d17bda2 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -235,6 +235,7 @@ struct pci_sriov { u32 pgsz; /* page size for BAR alignment */ u8 link;/* Function Dependency Link */ u16 driver_max_VFs; /* max num VFs driver supports */ + atomic_t VFs_assigned_cnt; /* counter of VFs assigned to VM */ struct pci_dev *dev;/* lowest numbered PF */ struct pci_dev *self; /* this PF */ struct mutex lock; /* lock for VF bus */ diff --git a/include/linux/pci.h b/include/linux/pci.h index aab57b4..5cf6833 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1601,6 +1601,8 @@ int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); void pci_disable_sriov(struct pci_dev *dev); int pci_num_vf(struct pci_dev *dev); int pci_vfs_assigned(struct pci_dev *dev);
[PATCH 2/4] xen-pciback: use PCI VFs assignment helper functions
New VFs reference counter mechanism and VFs assignment helper functions are introduced to PCI SRIOV, use them instead of manipulating device flag directly. Signed-off-by: Ethan Zhao ethan.z...@oracle.com --- drivers/xen/xen-pciback/pci_stub.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c index 62fcd48..1d7b747 100644 --- a/drivers/xen/xen-pciback/pci_stub.c +++ b/drivers/xen/xen-pciback/pci_stub.c @@ -133,7 +133,7 @@ static void pcistub_device_release(struct kref *kref) xen_pcibk_config_free_dyn_fields(dev); xen_pcibk_config_free_dev(dev); - dev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED; + pci_sriov_deassign_device(dev); pci_dev_put(dev); kfree(psdev); @@ -404,7 +404,7 @@ static int pcistub_init_device(struct pci_dev *dev) dev_dbg(dev-dev, reset device\n); xen_pcibk_reset_device(dev); - dev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED; + pci_sriov_assign_device(dev); return 0; config_release: -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] i40e: use PCI VFs assignment helper function simplify i40e_vfs_are_assigned()
New VFs reference counter mechanism and VFs assignment helper functions are introduced to PCI SRIOV, use them instead of manipulating device flag directly. Signed-off-by: Ethan Zhao ethan.z...@oracle.com --- drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 17 ++--- 1 files changed, 2 insertions(+), 15 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c index 02c11a7..781040e 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c @@ -693,22 +693,9 @@ complete_reset: static bool i40e_vfs_are_assigned(struct i40e_pf *pf) { struct pci_dev *pdev = pf-pdev; - struct pci_dev *vfdev; - - /* loop through all the VFs to see if we own any that are assigned */ - vfdev = pci_get_device(PCI_VENDOR_ID_INTEL, I40E_DEV_ID_VF , NULL); - while (vfdev) { - /* if we don't own it we don't care */ - if (vfdev-is_virtfn pci_physfn(vfdev) == pdev) { - /* if it is assigned we cannot release it */ - if (vfdev-dev_flags PCI_DEV_FLAGS_ASSIGNED) - return true; - } - vfdev = pci_get_device(PCI_VENDOR_ID_INTEL, - I40E_DEV_ID_VF, - vfdev); - } + if (pci_vfs_assigned(pdev)) + return true; return false; } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] KVM: use PCI VFs assignment helper functions
New VFs reference counter mechanism and VFs assignment helper functions are introduced to PCI SRIOV, use them instead of manipulating device flag directly. Signed-off-by: Ethan Zhao ethan.z...@oracle.com --- virt/kvm/assigned-dev.c |2 +- virt/kvm/iommu.c|4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index bf06577..80b477f 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -302,7 +302,7 @@ static void kvm_free_assigned_device(struct kvm *kvm, else pci_restore_state(assigned_dev-dev); - assigned_dev-dev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED; + pci_sriov_deassign_device(assigned_dev-dev); pci_release_regions(assigned_dev-dev); pci_disable_device(assigned_dev-dev); diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c index 0df7d4b..6d3f5ef 100644 --- a/virt/kvm/iommu.c +++ b/virt/kvm/iommu.c @@ -194,7 +194,7 @@ int kvm_assign_device(struct kvm *kvm, goto out_unmap; } - pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED; + pci_sriov_assign_device(pdev); dev_info(pdev-dev, kvm assign device\n); @@ -220,7 +220,7 @@ int kvm_deassign_device(struct kvm *kvm, iommu_detach_device(domain, pdev-dev); - pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED; + pci_sriov_deassign_device(pdev); dev_info(pdev-dev, kvm deassign device\n); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: PPC: Book3S: Add hack for split real mode
Today we handle split real mode by mapping both instruction and data faults into a special virtual address space that only exists during the split mode phase. This is good enough to catch 32bit Linux guests that use split real mode for copy_from/to_user. In this case we're always prefixed with 0xc000 for our instruction pointer and can map the user space process freely below there. However, that approach fails when we're running KVM inside of KVM. Here the 1st level last_inst reader may well be in the same virtual page as a 2nd level interrupt handler. It also fails when running Mac OS X guests. Here we have a 4G/4G split, so a kernel copy_from/to_user implementation can easily overlap with user space addresses. The architecturally correct way to fix this would be to implement an instruction interpreter in KVM that kicks in whenever we go into split real mode. This interpreter however would not receive a great amount of testing and be a lot of bloat for a reasonably isolated corner case. So I went back to the drawing board and tried to come up with a way to make split real mode work with a single flat address space. And then I realized that we could get away with the same trick that makes it work for Linux: Whenever we see an instruction address during split real mode that may collide, we just move it higher up the virtual address space to a place that hopefully does not collide (keep your fingers crossed!). That approach does work surprisingly well. I am able to successfully run Mac OS X guests with KVM and QEMU (no split real mode hacks like MOL) when I apply a tiny timing probe hack to QEMU. I'd say this is a win over even more broken split real mode :). Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/include/asm/kvm_asm.h| 1 + arch/powerpc/include/asm/kvm_book3s.h | 3 +++ arch/powerpc/kvm/book3s.c | 20 +++ arch/powerpc/kvm/book3s_pr.c | 48 +++ 4 files changed, 72 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_asm.h b/arch/powerpc/include/asm/kvm_asm.h index 9601741..3f3e530 100644 --- a/arch/powerpc/include/asm/kvm_asm.h +++ b/arch/powerpc/include/asm/kvm_asm.h @@ -131,6 +131,7 @@ #define BOOK3S_HFLAG_NATIVE_PS 0x8 #define BOOK3S_HFLAG_MULTI_PGSIZE 0x10 #define BOOK3S_HFLAG_NEW_TLBIE 0x20 +#define BOOK3S_HFLAG_SPLIT_HACK0x40 #define RESUME_FLAG_NV (10) /* Reload guest nonvolatile state? */ #define RESUME_FLAG_HOST(11) /* Resume host? */ diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index 8ac5392..cb7e661 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -324,4 +324,7 @@ static inline bool is_kvmppc_resume_guest(int r) /* LPIDs we support with this build -- runtime limit may be lower */ #define KVMPPC_NR_LPIDS(LPID_RSVD + 1) +#define SPLIT_HACK_MASK0xfff0 +#define SPLIT_HACK_OFFS0xfe20 + #endif /* __ASM_KVM_BOOK3S_H__ */ diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 9624c56..98fc18d 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -72,6 +72,17 @@ void kvmppc_core_load_guest_debugstate(struct kvm_vcpu *vcpu) { } +void kvmppc_unfixup_split_real(struct kvm_vcpu *vcpu) +{ + if (vcpu-arch.hflags BOOK3S_HFLAG_SPLIT_HACK) { + ulong pc = kvmppc_get_pc(vcpu); + if ((pc SPLIT_HACK_MASK) == SPLIT_HACK_OFFS) + kvmppc_set_pc(vcpu, pc ~SPLIT_HACK_MASK); + vcpu-arch.hflags = ~BOOK3S_HFLAG_SPLIT_HACK; + } +} +EXPORT_SYMBOL_GPL(kvmppc_unfixup_split_real); + static inline unsigned long kvmppc_interrupt_offset(struct kvm_vcpu *vcpu) { if (!is_kvmppc_hv_enabled(vcpu-kvm)) @@ -118,6 +129,7 @@ static inline bool kvmppc_critical_section(struct kvm_vcpu *vcpu) void kvmppc_inject_interrupt(struct kvm_vcpu *vcpu, int vec, u64 flags) { + kvmppc_unfixup_split_real(vcpu); kvmppc_set_srr0(vcpu, kvmppc_get_pc(vcpu)); kvmppc_set_srr1(vcpu, kvmppc_get_msr(vcpu) | flags); kvmppc_set_pc(vcpu, kvmppc_interrupt_offset(vcpu) + vec); @@ -384,6 +396,14 @@ static int kvmppc_xlate(struct kvm_vcpu *vcpu, ulong eaddr, bool data, pte-may_write = true; pte-may_execute = true; r = 0; + + if ((kvmppc_get_msr(vcpu) (MSR_IR | MSR_DR)) == MSR_DR + !data) { + ulong pc = kvmppc_get_pc(vcpu); + if ((vcpu-arch.hflags BOOK3S_HFLAG_SPLIT_HACK) + ((eaddr SPLIT_HACK_MASK) == SPLIT_HACK_OFFS)) + pte-raddr = ~SPLIT_HACK_MASK; + } } return r; diff --git a/arch/powerpc/kvm/book3s_pr.c
Re: [PATCH 1/4] PCI: introduce VFs reference counter and simplify pci_vfs_assigned() implementation
Since there's no 0th patch, I guess I'll comment here. This series is not bisectable, patch 1 breaks the existing implementation. I'd suggest: patch 1 - fix i40e patch 2 - create assign/deassign that uses dev_flags patch 3 - convert users to new interface patch 4 - convert interface to use atomic_t IMHO, the assigned flag is a horrible hack and I don't know why drivers like xenback need to use it. KVM needs to use it because it doesn't actually have a driver to bind to when a device is assigned, it's happy to assign devices without any driver. Thanks, Alex On Fri, 2014-07-11 at 08:47 +0800, Ethan Zhao wrote: Current implementation of helper function pci_vfs_assigned() is a little complex, to get sum of VFs that assigned to VM, access low level configuration space register and then loop in traversing device tree. This patch introduce an atomic reference counter for VFs that assigned to VM in struct pci_sriov, and compose two more helper functions pci_sriov_assign_device(), pci_sriov_deassign_device() to replace manipulation to device flag and the meanwhile increase and decease the counter. Passed building on 3.15.5 Signed-off-by: Ethan Zhao ethan.z...@oracle.com --- drivers/pci/iov.c | 65 -- drivers/pci/pci.h |1 + include/linux/pci.h |4 +++ 3 files changed, 41 insertions(+), 29 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index de7a747..72e267f 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -382,6 +382,7 @@ found: iov-nres = nres; iov-ctrl = ctrl; iov-total_VFs = total; + atomic_set(iov-VFs_assigned_cnt, 0); iov-offset = offset; iov-stride = stride; iov-pgsz = pgsz; @@ -603,45 +604,51 @@ int pci_num_vf(struct pci_dev *dev) EXPORT_SYMBOL_GPL(pci_num_vf); /** - * pci_vfs_assigned - returns number of VFs are assigned to a guest - * @dev: the PCI device + * pci_vfs_assigned - returns number of VFs are assigned to VM + * @dev: the physical PCI device that contains the VFs. * - * Returns number of VFs belonging to this device that are assigned to a guest. + * Returns number of VFs belonging to this device that are assigned to VM. * If device is not a physical function returns 0. */ int pci_vfs_assigned(struct pci_dev *dev) { - struct pci_dev *vfdev; - unsigned int vfs_assigned = 0; - unsigned short dev_id; - - /* only search if we are a PF */ if (!dev-is_physfn) return 0; + if (dev-sriov) + return atomic_read(dev-sriov-VFs_assigned_cnt); + return 0; +} +EXPORT_SYMBOL_GPL(pci_vfs_assigned); - /* - * determine the device ID for the VFs, the vendor ID will be the - * same as the PF so there is no need to check for that one - */ - pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, dev_id); - - /* loop through all the VFs to see if we own any that are assigned */ - vfdev = pci_get_device(dev-vendor, dev_id, NULL); - while (vfdev) { - /* - * It is considered assigned if it is a virtual function with - * our dev as the physical function and the assigned bit is set - */ - if (vfdev-is_virtfn (vfdev-physfn == dev) - (vfdev-dev_flags PCI_DEV_FLAGS_ASSIGNED)) - vfs_assigned++; - - vfdev = pci_get_device(dev-vendor, dev_id, vfdev); - } +/** + * pci_sriov_assign_device - assign device to VM + * @pdev: the device to be assigned. + */ +void pci_sriov_assign_device(struct pci_dev *pdev) +{ + pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED; + if (pdev-is_virtfn !pdev-is_physfn) + if (pdev-physfn) + if (pdev-physfn-sriov) + atomic_inc(pdev-physfn-sriov- + VFs_assigned_cnt); +} +EXPORT_SYMBOL_GPL(pci_sriov_assign_device); - return vfs_assigned; +/** + * pci_sriov_deassign_device - deasign device from VM + * @pdev: the device to be deassigned. + */ +void pci_sriov_deassign_device(struct pci_dev *pdev) +{ + pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED; + if (pdev-is_virtfn !pdev-is_physfn) + if (pdev-physfn) + if (pdev-physfn-sriov) + atomic_dec(pdev-physfn-sriov- + VFs_assigned_cnt); } -EXPORT_SYMBOL_GPL(pci_vfs_assigned); +EXPORT_SYMBOL_GPL(pci_sriov_deassign_device); /** * pci_sriov_set_totalvfs -- reduce the TotalVFs available diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 6bd0822..d17bda2 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -235,6 +235,7 @@ struct pci_sriov { u32 pgsz; /* page size for BAR alignment */ u8 link;/* Function
Re: [PATCH 1/4] PCI: introduce VFs reference counter and simplify pci_vfs_assigned() implementation
Alex, Thanks for your reviewing, when I create the patch order, I thought about the question you concerned for quit a while, make every patch be independent to each other as possible as I could, so we can do bisect when hit problem. I manage to take more time to figure out better patch order. Thanks, Ethan On 2014/7/11 9:48, Alex Williamson wrote: Since there's no 0th patch, I guess I'll comment here. This series is not bisectable, patch 1 breaks the existing implementation. I'd suggest: patch 1 - fix i40e i40e only could be fixed with new interface, so it couldn't be the first one. patch 2 - create assign/deassign that uses dev_flags This will be the first ? patch 3 - convert users to new interface Have to be the later step. patch 4 - convert interface to use atomic_t Could it be standalone step ? Let me think about it. IMHO, the assigned flag is a horrible hack and I don't know why drivers like xenback need to use it. KVM needs to use it because it doesn't actually have a driver to bind to when a device is assigned, it's happy to assign devices without any driver. Thanks, Alex On Fri, 2014-07-11 at 08:47 +0800, Ethan Zhao wrote: Current implementation of helper function pci_vfs_assigned() is a little complex, to get sum of VFs that assigned to VM, access low level configuration space register and then loop in traversing device tree. This patch introduce an atomic reference counter for VFs that assigned to VM in struct pci_sriov, and compose two more helper functions pci_sriov_assign_device(), pci_sriov_deassign_device() to replace manipulation to device flag and the meanwhile increase and decease the counter. Passed building on 3.15.5 Signed-off-by: Ethan Zhao ethan.z...@oracle.com --- drivers/pci/iov.c | 65 -- drivers/pci/pci.h |1 + include/linux/pci.h |4 +++ 3 files changed, 41 insertions(+), 29 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index de7a747..72e267f 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -382,6 +382,7 @@ found: iov-nres = nres; iov-ctrl = ctrl; iov-total_VFs = total; + atomic_set(iov-VFs_assigned_cnt, 0); iov-offset = offset; iov-stride = stride; iov-pgsz = pgsz; @@ -603,45 +604,51 @@ int pci_num_vf(struct pci_dev *dev) EXPORT_SYMBOL_GPL(pci_num_vf); /** - * pci_vfs_assigned - returns number of VFs are assigned to a guest - * @dev: the PCI device + * pci_vfs_assigned - returns number of VFs are assigned to VM + * @dev: the physical PCI device that contains the VFs. * - * Returns number of VFs belonging to this device that are assigned to a guest. + * Returns number of VFs belonging to this device that are assigned to VM. * If device is not a physical function returns 0. */ int pci_vfs_assigned(struct pci_dev *dev) { - struct pci_dev *vfdev; - unsigned int vfs_assigned = 0; - unsigned short dev_id; - - /* only search if we are a PF */ if (!dev-is_physfn) return 0; + if (dev-sriov) + return atomic_read(dev-sriov-VFs_assigned_cnt); + return 0; +} +EXPORT_SYMBOL_GPL(pci_vfs_assigned); - /* -* determine the device ID for the VFs, the vendor ID will be the -* same as the PF so there is no need to check for that one -*/ - pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, dev_id); - - /* loop through all the VFs to see if we own any that are assigned */ - vfdev = pci_get_device(dev-vendor, dev_id, NULL); - while (vfdev) { - /* -* It is considered assigned if it is a virtual function with -* our dev as the physical function and the assigned bit is set -*/ - if (vfdev-is_virtfn (vfdev-physfn == dev) - (vfdev-dev_flags PCI_DEV_FLAGS_ASSIGNED)) - vfs_assigned++; - - vfdev = pci_get_device(dev-vendor, dev_id, vfdev); - } +/** + * pci_sriov_assign_device - assign device to VM + * @pdev: the device to be assigned. + */ +void pci_sriov_assign_device(struct pci_dev *pdev) +{ + pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED; + if (pdev-is_virtfn !pdev-is_physfn) + if (pdev-physfn) + if (pdev-physfn-sriov) + atomic_inc(pdev-physfn-sriov- + VFs_assigned_cnt); +} +EXPORT_SYMBOL_GPL(pci_sriov_assign_device); - return vfs_assigned; +/** + * pci_sriov_deassign_device - deasign device from VM + * @pdev: the device to be deassigned. + */ +void pci_sriov_deassign_device(struct pci_dev *pdev) +{ + pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED; + if (pdev-is_virtfn !pdev-is_physfn) + if (pdev-physfn) + if
Re: [PATCH 1/4] PCI: introduce VFs reference counter and simplify pci_vfs_assigned() implementation
On Fri, 2014-07-11 at 10:10 +0800, ethan zhao wrote: Alex, Thanks for your reviewing, when I create the patch order, I thought about the question you concerned for quit a while, make every patch be independent to each other as possible as I could, so we can do bisect when hit problem. I manage to take more time to figure out better patch order. Thanks, Ethan On 2014/7/11 9:48, Alex Williamson wrote: Since there's no 0th patch, I guess I'll comment here. This series is not bisectable, patch 1 breaks the existing implementation. I'd suggest: patch 1 - fix i40e i40e only could be fixed with new interface, so it couldn't be the first one. It looks like i40e just has it's own copy of pci_vfs_assigned(), why can't your current patch 4/4 be applied now? patch 2 - create assign/deassign that uses dev_flags This will be the first ? patch 3 - convert users to new interface Have to be the later step. patch 4 - convert interface to use atomic_t Could it be standalone step ? Let me think about it. IMHO, the assigned flag is a horrible hack and I don't know why drivers like xenback need to use it. KVM needs to use it because it doesn't actually have a driver to bind to when a device is assigned, it's happy to assign devices without any driver. Thanks, Alex On Fri, 2014-07-11 at 08:47 +0800, Ethan Zhao wrote: Current implementation of helper function pci_vfs_assigned() is a little complex, to get sum of VFs that assigned to VM, access low level configuration space register and then loop in traversing device tree. This patch introduce an atomic reference counter for VFs that assigned to VM in struct pci_sriov, and compose two more helper functions pci_sriov_assign_device(), pci_sriov_deassign_device() to replace manipulation to device flag and the meanwhile increase and decease the counter. Passed building on 3.15.5 Signed-off-by: Ethan Zhao ethan.z...@oracle.com --- drivers/pci/iov.c | 65 -- drivers/pci/pci.h |1 + include/linux/pci.h |4 +++ 3 files changed, 41 insertions(+), 29 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index de7a747..72e267f 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -382,6 +382,7 @@ found: iov-nres = nres; iov-ctrl = ctrl; iov-total_VFs = total; + atomic_set(iov-VFs_assigned_cnt, 0); iov-offset = offset; iov-stride = stride; iov-pgsz = pgsz; @@ -603,45 +604,51 @@ int pci_num_vf(struct pci_dev *dev) EXPORT_SYMBOL_GPL(pci_num_vf); /** - * pci_vfs_assigned - returns number of VFs are assigned to a guest - * @dev: the PCI device + * pci_vfs_assigned - returns number of VFs are assigned to VM + * @dev: the physical PCI device that contains the VFs. * - * Returns number of VFs belonging to this device that are assigned to a guest. + * Returns number of VFs belonging to this device that are assigned to VM. * If device is not a physical function returns 0. */ int pci_vfs_assigned(struct pci_dev *dev) { - struct pci_dev *vfdev; - unsigned int vfs_assigned = 0; - unsigned short dev_id; - - /* only search if we are a PF */ if (!dev-is_physfn) return 0; + if (dev-sriov) + return atomic_read(dev-sriov-VFs_assigned_cnt); + return 0; +} +EXPORT_SYMBOL_GPL(pci_vfs_assigned); - /* - * determine the device ID for the VFs, the vendor ID will be the - * same as the PF so there is no need to check for that one - */ - pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, dev_id); - - /* loop through all the VFs to see if we own any that are assigned */ - vfdev = pci_get_device(dev-vendor, dev_id, NULL); - while (vfdev) { - /* - * It is considered assigned if it is a virtual function with - * our dev as the physical function and the assigned bit is set - */ - if (vfdev-is_virtfn (vfdev-physfn == dev) - (vfdev-dev_flags PCI_DEV_FLAGS_ASSIGNED)) - vfs_assigned++; - - vfdev = pci_get_device(dev-vendor, dev_id, vfdev); - } +/** + * pci_sriov_assign_device - assign device to VM + * @pdev: the device to be assigned. + */ +void pci_sriov_assign_device(struct pci_dev *pdev) +{ + pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED; + if (pdev-is_virtfn !pdev-is_physfn) + if (pdev-physfn) + if (pdev-physfn-sriov) + atomic_inc(pdev-physfn-sriov- + VFs_assigned_cnt); +} +EXPORT_SYMBOL_GPL(pci_sriov_assign_device); - return vfs_assigned; +/** + * pci_sriov_deassign_device - deasign device from VM + * @pdev: the device to
Re: [PATCH 1/4] PCI: introduce VFs reference counter and simplify pci_vfs_assigned() implementation
On 2014/7/11 10:22, Alex Williamson wrote: On Fri, 2014-07-11 at 10:10 +0800, ethan zhao wrote: Alex, Thanks for your reviewing, when I create the patch order, I thought about the question you concerned for quit a while, make every patch be independent to each other as possible as I could, so we can do bisect when hit problem. I manage to take more time to figure out better patch order. Thanks, Ethan On 2014/7/11 9:48, Alex Williamson wrote: Since there's no 0th patch, I guess I'll comment here. This series is not bisectable, patch 1 breaks the existing implementation. I'd suggest: patch 1 - fix i40e i40e only could be fixed with new interface, so it couldn't be the first one. It looks like i40e just has it's own copy of pci_vfs_assigned(), why can't your current patch 4/4 be applied now? Yes, i40e has its local copy of pci_vfs_assigned(),it could be simplified . with new interface,in another word, its a user of new interface. Thanks, Ethan patch 2 - create assign/deassign that uses dev_flags This will be the first ? patch 3 - convert users to new interface Have to be the later step. patch 4 - convert interface to use atomic_t Could it be standalone step ? Let me think about it. IMHO, the assigned flag is a horrible hack and I don't know why drivers like xenback need to use it. KVM needs to use it because it doesn't actually have a driver to bind to when a device is assigned, it's happy to assign devices without any driver. Thanks, Alex On Fri, 2014-07-11 at 08:47 +0800, Ethan Zhao wrote: Current implementation of helper function pci_vfs_assigned() is a little complex, to get sum of VFs that assigned to VM, access low level configuration space register and then loop in traversing device tree. This patch introduce an atomic reference counter for VFs that assigned to VM in struct pci_sriov, and compose two more helper functions pci_sriov_assign_device(), pci_sriov_deassign_device() to replace manipulation to device flag and the meanwhile increase and decease the counter. Passed building on 3.15.5 Signed-off-by: Ethan Zhao ethan.z...@oracle.com --- drivers/pci/iov.c | 65 -- drivers/pci/pci.h |1 + include/linux/pci.h |4 +++ 3 files changed, 41 insertions(+), 29 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index de7a747..72e267f 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -382,6 +382,7 @@ found: iov-nres = nres; iov-ctrl = ctrl; iov-total_VFs = total; + atomic_set(iov-VFs_assigned_cnt, 0); iov-offset = offset; iov-stride = stride; iov-pgsz = pgsz; @@ -603,45 +604,51 @@ int pci_num_vf(struct pci_dev *dev) EXPORT_SYMBOL_GPL(pci_num_vf); /** - * pci_vfs_assigned - returns number of VFs are assigned to a guest - * @dev: the PCI device + * pci_vfs_assigned - returns number of VFs are assigned to VM + * @dev: the physical PCI device that contains the VFs. * - * Returns number of VFs belonging to this device that are assigned to a guest. + * Returns number of VFs belonging to this device that are assigned to VM. * If device is not a physical function returns 0. */ int pci_vfs_assigned(struct pci_dev *dev) { - struct pci_dev *vfdev; - unsigned int vfs_assigned = 0; - unsigned short dev_id; - - /* only search if we are a PF */ if (!dev-is_physfn) return 0; + if (dev-sriov) + return atomic_read(dev-sriov-VFs_assigned_cnt); + return 0; +} +EXPORT_SYMBOL_GPL(pci_vfs_assigned); - /* -* determine the device ID for the VFs, the vendor ID will be the -* same as the PF so there is no need to check for that one -*/ - pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, dev_id); - - /* loop through all the VFs to see if we own any that are assigned */ - vfdev = pci_get_device(dev-vendor, dev_id, NULL); - while (vfdev) { - /* -* It is considered assigned if it is a virtual function with -* our dev as the physical function and the assigned bit is set -*/ - if (vfdev-is_virtfn (vfdev-physfn == dev) - (vfdev-dev_flags PCI_DEV_FLAGS_ASSIGNED)) - vfs_assigned++; - - vfdev = pci_get_device(dev-vendor, dev_id, vfdev); - } +/** + * pci_sriov_assign_device - assign device to VM + * @pdev: the device to be assigned. + */ +void pci_sriov_assign_device(struct pci_dev *pdev) +{ + pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED; + if (pdev-is_virtfn !pdev-is_physfn) + if (pdev-physfn) + if (pdev-physfn-sriov) + atomic_inc(pdev-physfn-sriov- +
Re: [PATCH 1/4] PCI: introduce VFs reference counter and simplify pci_vfs_assigned() implementation
On Fri, 2014-07-11 at 10:29 +0800, ethan zhao wrote: On 2014/7/11 10:22, Alex Williamson wrote: On Fri, 2014-07-11 at 10:10 +0800, ethan zhao wrote: Alex, Thanks for your reviewing, when I create the patch order, I thought about the question you concerned for quit a while, make every patch be independent to each other as possible as I could, so we can do bisect when hit problem. I manage to take more time to figure out better patch order. Thanks, Ethan On 2014/7/11 9:48, Alex Williamson wrote: Since there's no 0th patch, I guess I'll comment here. This series is not bisectable, patch 1 breaks the existing implementation. I'd suggest: patch 1 - fix i40e i40e only could be fixed with new interface, so it couldn't be the first one. It looks like i40e just has it's own copy of pci_vfs_assigned(), why can't your current patch 4/4 be applied now? Yes, i40e has its local copy of pci_vfs_assigned(),it could be simplified . with new interface,in another word, its a user of new interface. It's a user of the existing interface. You're missing the point, abstract all the users of the assigned dev_flags first, then you're free to fix the implementation of the interface in one shot without breaking anything. patch 2 - create assign/deassign that uses dev_flags This will be the first ? patch 3 - convert users to new interface Have to be the later step. patch 4 - convert interface to use atomic_t Could it be standalone step ? Let me think about it. IMHO, the assigned flag is a horrible hack and I don't know why drivers like xenback need to use it. KVM needs to use it because it doesn't actually have a driver to bind to when a device is assigned, it's happy to assign devices without any driver. Thanks, Alex On Fri, 2014-07-11 at 08:47 +0800, Ethan Zhao wrote: Current implementation of helper function pci_vfs_assigned() is a little complex, to get sum of VFs that assigned to VM, access low level configuration space register and then loop in traversing device tree. This patch introduce an atomic reference counter for VFs that assigned to VM in struct pci_sriov, and compose two more helper functions pci_sriov_assign_device(), pci_sriov_deassign_device() to replace manipulation to device flag and the meanwhile increase and decease the counter. Passed building on 3.15.5 Signed-off-by: Ethan Zhao ethan.z...@oracle.com --- drivers/pci/iov.c | 65 -- drivers/pci/pci.h |1 + include/linux/pci.h |4 +++ 3 files changed, 41 insertions(+), 29 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index de7a747..72e267f 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -382,6 +382,7 @@ found: iov-nres = nres; iov-ctrl = ctrl; iov-total_VFs = total; +atomic_set(iov-VFs_assigned_cnt, 0); iov-offset = offset; iov-stride = stride; iov-pgsz = pgsz; @@ -603,45 +604,51 @@ int pci_num_vf(struct pci_dev *dev) EXPORT_SYMBOL_GPL(pci_num_vf); /** - * pci_vfs_assigned - returns number of VFs are assigned to a guest - * @dev: the PCI device + * pci_vfs_assigned - returns number of VFs are assigned to VM + * @dev: the physical PCI device that contains the VFs. * - * Returns number of VFs belonging to this device that are assigned to a guest. + * Returns number of VFs belonging to this device that are assigned to VM. * If device is not a physical function returns 0. */ int pci_vfs_assigned(struct pci_dev *dev) { -struct pci_dev *vfdev; -unsigned int vfs_assigned = 0; -unsigned short dev_id; - -/* only search if we are a PF */ if (!dev-is_physfn) return 0; +if (dev-sriov) +return atomic_read(dev-sriov-VFs_assigned_cnt); +return 0; +} +EXPORT_SYMBOL_GPL(pci_vfs_assigned); -/* - * determine the device ID for the VFs, the vendor ID will be the - * same as the PF so there is no need to check for that one - */ -pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, dev_id); - -/* loop through all the VFs to see if we own any that are assigned */ -vfdev = pci_get_device(dev-vendor, dev_id, NULL); -while (vfdev) { -/* - * It is considered assigned if it is a virtual function with - * our dev as the physical function and the assigned bit is set - */ -if (vfdev-is_virtfn (vfdev-physfn == dev) -(vfdev-dev_flags PCI_DEV_FLAGS_ASSIGNED)) -
Re:letter..
reply me on the email id below for explanation of better opportunity for us. email: chnsnn...@gmail.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] PCI: introduce VFs reference counter and simplify pci_vfs_assigned() implementation
On 2014/7/11 10:33, Alex Williamson wrote: On Fri, 2014-07-11 at 10:29 +0800, ethan zhao wrote: On 2014/7/11 10:22, Alex Williamson wrote: On Fri, 2014-07-11 at 10:10 +0800, ethan zhao wrote: Alex, Thanks for your reviewing, when I create the patch order, I thought about the question you concerned for quit a while, make every patch be independent to each other as possible as I could, so we can do bisect when hit problem. I manage to take more time to figure out better patch order. Thanks, Ethan On 2014/7/11 9:48, Alex Williamson wrote: Since there's no 0th patch, I guess I'll comment here. This series is not bisectable, patch 1 breaks the existing implementation. I'd suggest: patch 1 - fix i40e i40e only could be fixed with new interface, so it couldn't be the first one. It looks like i40e just has it's own copy of pci_vfs_assigned(), why can't your current patch 4/4 be applied now? Yes, i40e has its local copy of pci_vfs_assigned(),it could be simplified . with new interface,in another word, its a user of new interface. It's a user of the existing interface. You're missing the point, abstract all the users of the assigned dev_flags first, then you're free to fix the implementation of the interface in one shot without breaking anything. Great,you hit the center this time, agree ! patch 2 - create assign/deassign that uses dev_flags This will be the first ? patch 3 - convert users to new interface Have to be the later step. patch 4 - convert interface to use atomic_t Could it be standalone step ? Let me think about it. IMHO, the assigned flag is a horrible hack and I don't know why drivers like xenback need to use it. KVM needs to use it because it doesn't actually have a driver to bind to when a device is assigned, it's happy to assign devices without any driver. Thanks, Alex On Fri, 2014-07-11 at 08:47 +0800, Ethan Zhao wrote: Current implementation of helper function pci_vfs_assigned() is a little complex, to get sum of VFs that assigned to VM, access low level configuration space register and then loop in traversing device tree. This patch introduce an atomic reference counter for VFs that assigned to VM in struct pci_sriov, and compose two more helper functions pci_sriov_assign_device(), pci_sriov_deassign_device() to replace manipulation to device flag and the meanwhile increase and decease the counter. Passed building on 3.15.5 Signed-off-by: Ethan Zhao ethan.z...@oracle.com --- drivers/pci/iov.c | 65 -- drivers/pci/pci.h |1 + include/linux/pci.h |4 +++ 3 files changed, 41 insertions(+), 29 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index de7a747..72e267f 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -382,6 +382,7 @@ found: iov-nres = nres; iov-ctrl = ctrl; iov-total_VFs = total; + atomic_set(iov-VFs_assigned_cnt, 0); iov-offset = offset; iov-stride = stride; iov-pgsz = pgsz; @@ -603,45 +604,51 @@ int pci_num_vf(struct pci_dev *dev) EXPORT_SYMBOL_GPL(pci_num_vf); /** - * pci_vfs_assigned - returns number of VFs are assigned to a guest - * @dev: the PCI device + * pci_vfs_assigned - returns number of VFs are assigned to VM + * @dev: the physical PCI device that contains the VFs. * - * Returns number of VFs belonging to this device that are assigned to a guest. + * Returns number of VFs belonging to this device that are assigned to VM. * If device is not a physical function returns 0. */ int pci_vfs_assigned(struct pci_dev *dev) { - struct pci_dev *vfdev; - unsigned int vfs_assigned = 0; - unsigned short dev_id; - - /* only search if we are a PF */ if (!dev-is_physfn) return 0; + if (dev-sriov) + return atomic_read(dev-sriov-VFs_assigned_cnt); + return 0; +} +EXPORT_SYMBOL_GPL(pci_vfs_assigned); - /* -* determine the device ID for the VFs, the vendor ID will be the -* same as the PF so there is no need to check for that one -*/ - pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, dev_id); - - /* loop through all the VFs to see if we own any that are assigned */ - vfdev = pci_get_device(dev-vendor, dev_id, NULL); - while (vfdev) { - /* -* It is considered assigned if it is a virtual function with -* our dev as the physical function and the assigned bit is set -*/ - if (vfdev-is_virtfn (vfdev-physfn == dev) - (vfdev-dev_flags PCI_DEV_FLAGS_ASSIGNED)) - vfs_assigned++; - - vfdev = pci_get_device(dev-vendor, dev_id, vfdev); - } +/** + * pci_sriov_assign_device - assign device to VM + * @pdev: the
[PATCH][RESEND] KVM: nVMX: Fix vmptrld fail and vmwrite error when L1 goes down
This bug can be trigger by L1 goes down directly w/ enable_shadow_vmcs. [ 6413.158950] kvm: vmptrld (null)/7800 failed [ 6413.158954] vmwrite error: reg 401e value 4 (err 1) [ 6413.158957] CPU: 0 PID: 4840 Comm: qemu-system-x86 Tainted: G OE 3.16.0kvm+ #2 [ 6413.158958] Hardware name: Dell Inc. OptiPlex 9020/0DNKMN, BIOS A05 12/05/2013 [ 6413.158959] 0003 880210c9fb58 81741de9 8800d7433f80 [ 6413.158960] 880210c9fb68 a059fa08 880210c9fb78 a05938bf [ 6413.158962] 880210c9fba8 a059a97f 8800d7433f80 0003 [ 6413.158963] Call Trace: [ 6413.158968] [81741de9] dump_stack+0x45/0x56 [ 6413.158972] [a059fa08] vmwrite_error+0x2c/0x2e [kvm_intel] [ 6413.158974] [a05938bf] vmcs_writel+0x1f/0x30 [kvm_intel] [ 6413.158976] [a059a97f] free_nested.part.73+0x5f/0x170 [kvm_intel] [ 6413.158978] [a059ab13] vmx_free_vcpu+0x33/0x70 [kvm_intel] [ 6413.158991] [a0360324] kvm_arch_vcpu_free+0x44/0x50 [kvm] [ 6413.158998] [a0360f92] kvm_arch_destroy_vm+0xf2/0x1f0 [kvm] Commit 26a865 (KVM: VMX: fix use after free of vmx-loaded_vmcs) fix the use after free bug by move free_loaded_vmcs() before free_nested(), however, this lead to free loaded_vmcs-vmcs premature and vmptrld load a NULL pointer during sync shadow vmcs to vmcs12. In addition, vmwrite which used to disable shadow vmcs and reset VMCS_LINK_POINTER failed since there is no valid current-VMCS. This patch fix it by skipping kfree vmcs02_list item for current-vmcs in nested_free_all_saved_vmcss() and kfree it after free_loaded_vmcs(). This can also avoid use after free bug. Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com --- arch/x86/kvm/vmx.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 021d84a..6e427ac 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5762,10 +5762,11 @@ static void nested_free_all_saved_vmcss(struct vcpu_vmx *vmx) { struct vmcs02_list *item, *n; list_for_each_entry_safe(item, n, vmx-nested.vmcs02_pool, list) { - if (vmx-loaded_vmcs != item-vmcs02) + if (vmx-loaded_vmcs != item-vmcs02) { free_loaded_vmcs(item-vmcs02); - list_del(item-list); - kfree(item); + list_del(item-list); + kfree(item); + } } vmx-nested.vmcs02_num = 0; @@ -7526,10 +7527,16 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) static void vmx_free_vcpu(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); + struct vmcs02_list *item; free_vpid(vmx); - free_loaded_vmcs(vmx-loaded_vmcs); free_nested(vmx); + free_loaded_vmcs(vmx-loaded_vmcs); + if (vmx-nested.vmxon) + list_for_each_entry(item, vmx-nested.vmcs02_pool, list) { + list_del(item-list); + kfree(item); + } kfree(vmx-guest_msrs); kvm_vcpu_uninit(vcpu); kmem_cache_free(kvm_vcpu_cache, vmx); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] powerpc/kvm: Create proper names for the kvm_host_state PMU fields
On Thu, 2014-07-10 at 12:16 +0200, Alexander Graf wrote: On 10.07.14 11:34, Michael Ellerman wrote: We have two arrays in kvm_host_state that contain register values for the PMU. Currently we only create an asm-offsets symbol for the base of the arrays, and do the array offset in the assembly code. Creating an asm-offsets symbol for each field individually makes the code much nicer to read, particularly for the MMCRx/SIxR/SDAR fields, and might have helped us notice the recent double restore bug we had in this code. Signed-off-by: Michael Ellerman m...@ellerman.id.au Acked-by: Alexander Graf ag...@suse.de Thanks. I still think this whole code path should just be C though. Yeah it probably should. cheers -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2 V2] PCI: implement VFs assignment reference counter
Current implementation of helper function pci_vfs_assigned() is a little complex, to get sum of VFs that assigned to VM, access low level configuration space register and then loop in traversing device tree. This patch introduces an atomic reference counter for VFs those were assigned to VM in struct pci_sriov. and simplify the code in pci_vfs_assigned(). v2: reorder the patchset according to the suggestion from alex.william...@redhat.com Signed-off-by: Ethan Zhao ethan.z...@oracle.com --- drivers/pci/iov.c | 45 + drivers/pci/pci.h |1 + 2 files changed, 18 insertions(+), 28 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index c082523..5478a0c 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -382,6 +382,7 @@ found: iov-nres = nres; iov-ctrl = ctrl; iov-total_VFs = total; + atomic_set(iov-VFs_assigned_cnt, 0); iov-offset = offset; iov-stride = stride; iov-pgsz = pgsz; @@ -603,43 +604,21 @@ int pci_num_vf(struct pci_dev *dev) EXPORT_SYMBOL_GPL(pci_num_vf); /** - * pci_vfs_assigned - returns number of VFs are assigned to a guest - * @dev: the PCI device + * pci_vfs_assigned - returns number of VFs are assigned to VM + * @dev: the physical PCI device that contains the VFs. * - * Returns number of VFs belonging to this device that are assigned to a guest. + * Returns number of VFs belonging to this device that are assigned to VM. * If device is not a physical function returns 0. */ int pci_vfs_assigned(struct pci_dev *dev) { - struct pci_dev *vfdev; - unsigned int vfs_assigned = 0; - unsigned short dev_id; - /* only search if we are a PF */ if (!dev-is_physfn) return 0; - /* -* determine the device ID for the VFs, the vendor ID will be the -* same as the PF so there is no need to check for that one -*/ - pci_read_config_word(dev, dev-sriov-pos + PCI_SRIOV_VF_DID, dev_id); - - /* loop through all the VFs to see if we own any that are assigned */ - vfdev = pci_get_device(dev-vendor, dev_id, NULL); - while (vfdev) { - /* -* It is considered assigned if it is a virtual function with -* our dev as the physical function and the assigned bit is set -*/ - if (vfdev-is_virtfn (vfdev-physfn == dev) - (vfdev-dev_flags PCI_DEV_FLAGS_ASSIGNED)) - vfs_assigned++; - - vfdev = pci_get_device(dev-vendor, dev_id, vfdev); - } - - return vfs_assigned; + if (dev-sriov) + return atomic_read(dev-sriov-VFs_assigned_cnt); + return 0; } EXPORT_SYMBOL_GPL(pci_vfs_assigned); @@ -650,6 +629,11 @@ EXPORT_SYMBOL_GPL(pci_vfs_assigned); void pci_sriov_assign_device(struct pci_dev *pdev) { pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED; + if (pdev-is_virtfn !pdev-is_physfn) + if (pdev-physfn) + if (pdev-physfn-sriov) + atomic_inc(pdev-physfn-sriov- + VFs_assigned_cnt); } EXPORT_SYMBOL_GPL(pci_sriov_assign_device); @@ -660,6 +644,11 @@ EXPORT_SYMBOL_GPL(pci_sriov_assign_device); void pci_sriov_deassign_device(struct pci_dev *pdev) { pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED; + if (pdev-is_virtfn !pdev-is_physfn) + if (pdev-physfn) + if (pdev-physfn-sriov) + atomic_dec(pdev-physfn-sriov- + VFs_assigned_cnt); } EXPORT_SYMBOL_GPL(pci_sriov_deassign_device); diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 6bd0822..d17bda2 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -235,6 +235,7 @@ struct pci_sriov { u32 pgsz; /* page size for BAR alignment */ u8 link;/* Function Dependency Link */ u16 driver_max_VFs; /* max num VFs driver supports */ + atomic_t VFs_assigned_cnt; /* counter of VFs assigned to VM */ struct pci_dev *dev;/* lowest numbered PF */ struct pci_dev *self; /* this PF */ struct mutex lock; /* lock for VF bus */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2 V2] PCI: introduce device assignment interface and refactory related code
This patch introduces two new device assignment functions pci_sriov_assign_device(), pci_sriov_deassign_device() along with the existed one pci_vfs_assigned() They construct the VFs assignment management interface, used to assign/ deassign device to VM and query the VFs reference counter. instead of direct manipulation of device flag. This patch refashioned the related code and make them atomic. v2: reorder the patchset and make it bisectable and atomic, steps clear between interface defination and implemenation according to the suggestion from alex.william...@redhat.com Signed-off-by: Ethan Zhao ethan.z...@oracle.com --- drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 17 ++--- drivers/pci/iov.c | 20 drivers/xen/xen-pciback/pci_stub.c |4 ++-- include/linux/pci.h|4 virt/kvm/assigned-dev.c|2 +- virt/kvm/iommu.c |4 ++-- 6 files changed, 31 insertions(+), 20 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c index 02c11a7..781040e 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c @@ -693,22 +693,9 @@ complete_reset: static bool i40e_vfs_are_assigned(struct i40e_pf *pf) { struct pci_dev *pdev = pf-pdev; - struct pci_dev *vfdev; - - /* loop through all the VFs to see if we own any that are assigned */ - vfdev = pci_get_device(PCI_VENDOR_ID_INTEL, I40E_DEV_ID_VF , NULL); - while (vfdev) { - /* if we don't own it we don't care */ - if (vfdev-is_virtfn pci_physfn(vfdev) == pdev) { - /* if it is assigned we cannot release it */ - if (vfdev-dev_flags PCI_DEV_FLAGS_ASSIGNED) - return true; - } - vfdev = pci_get_device(PCI_VENDOR_ID_INTEL, - I40E_DEV_ID_VF, - vfdev); - } + if (pci_vfs_assigned(pdev)) + return true; return false; } diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index de7a747..c082523 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -644,6 +644,26 @@ int pci_vfs_assigned(struct pci_dev *dev) EXPORT_SYMBOL_GPL(pci_vfs_assigned); /** + * pci_sriov_assign_device - assign device to VM + * @pdev: the device to be assigned. + */ +void pci_sriov_assign_device(struct pci_dev *pdev) +{ + pdev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED; +} +EXPORT_SYMBOL_GPL(pci_sriov_assign_device); + +/** + * pci_sriov_deassign_device - deasign device from VM + * @pdev: the device to be deassigned. + */ +void pci_sriov_deassign_device(struct pci_dev *pdev) +{ + pdev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED; +} +EXPORT_SYMBOL_GPL(pci_sriov_deassign_device); + +/** * pci_sriov_set_totalvfs -- reduce the TotalVFs available * @dev: the PCI PF device * @numvfs: number that should be used for TotalVFs supported diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c index 62fcd48..1d7b747 100644 --- a/drivers/xen/xen-pciback/pci_stub.c +++ b/drivers/xen/xen-pciback/pci_stub.c @@ -133,7 +133,7 @@ static void pcistub_device_release(struct kref *kref) xen_pcibk_config_free_dyn_fields(dev); xen_pcibk_config_free_dev(dev); - dev-dev_flags = ~PCI_DEV_FLAGS_ASSIGNED; + pci_sriov_deassign_device(dev); pci_dev_put(dev); kfree(psdev); @@ -404,7 +404,7 @@ static int pcistub_init_device(struct pci_dev *dev) dev_dbg(dev-dev, reset device\n); xen_pcibk_reset_device(dev); - dev-dev_flags |= PCI_DEV_FLAGS_ASSIGNED; + pci_sriov_assign_device(dev); return 0; config_release: diff --git a/include/linux/pci.h b/include/linux/pci.h index aab57b4..ddfcceb 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1603,6 +1603,8 @@ int pci_num_vf(struct pci_dev *dev); int pci_vfs_assigned(struct pci_dev *dev); int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs); int pci_sriov_get_totalvfs(struct pci_dev *dev); +void pci_sriov_assign_device(struct pci_dev *dev); +void pci_sriov_deassign_device(struct pci_dev *dev); #else static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn) { return -ENODEV; } @@ -1614,6 +1616,8 @@ static inline int pci_sriov_set_totalvfs(struct pci_dev *dev, u16 numvfs) { return 0; } static inline int pci_sriov_get_totalvfs(struct pci_dev *dev) { return 0; } +static inline void pci_sriov_assign_device(struct pci_dev *dev) { } +static inline void pci_sriov_deassign_device(struct pci_dev *dev) { } #endif #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE) diff --git a/virt/kvm/assigned-dev.c
Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
On 09.07.14 00:59, Stewart Smith wrote: Hi! Thanks for review, much appreciated! Alexander Graf ag...@suse.de writes: On 08.07.14 07:06, Stewart Smith wrote: @@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc) int i, need_vpa_update; int srcu_idx; struct kvm_vcpu *vcpus_to_update[threads_per_core]; + phys_addr_t phy_addr, tmp; Please put the variable declarations into the if () branch so that the compiler can catch potential leaks :) ack. will fix. @@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc) srcu_idx = srcu_read_lock(vc-kvm-srcu); + /* If we have a saved list of L2/L3, restore it */ + if (cpu_has_feature(CPU_FTR_ARCH_207S) vc-mpp_buffer) { + phy_addr = virt_to_phys((void *)vc-mpp_buffer); +#if defined(CONFIG_PPC_4K_PAGES) + phy_addr = (phy_addr + 8*4096) ~(8*4096); get_free_pages() is automatically aligned to the order, no? That's what Paul reckoned too, and then we've attempted to find anywhere that documents that behaviour. Happen to be able to point to docs/source that say this is part of API? Phew - it's probably buried somewhere. I could only find this document saying that we always get order-aligned allocations: http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html Mel, do you happen to have any pointer to something that explicitly (or even properly implicitly) says that get_free_pages() returns order-aligned memory? +#endif + tmp = phy_addr PPC_MPPE_ADDRESS_MASK; + tmp = tmp | PPC_MPPE_WHOLE_TABLE; + + /* For sanity, abort any 'save' requests in progress */ + asm volatile(PPC_LOGMPP(R1) : : r (tmp)); + + /* Inititate a cache-load request */ + mtspr(SPRN_MPPR, tmp); + } In fact, this whole block up here could be a function, no? It could, perfectly happy for it to be one. Will fix. + + /* Allocate memory before switching out of guest so we don't + trash L2/L3 with memory allocation stuff */ + if (cpu_has_feature(CPU_FTR_ARCH_207S) !vc-mpp_buffer) { + vc-mpp_buffer = __get_free_pages(GFP_KERNEL|__GFP_ZERO, + MPP_BUFFER_ORDER); get_order(64 * 1024)? Also, why allocate it here and not on vcore creation? There's also the possibility of saving/restorting part of the L3 cache as well, and I was envisioning a future patch to this which checks a flag in vcore (maybe exposed via sysfs or whatever mechanism is applicable) if it should save/restore L2 or L2/L3, so thus it makes a bit more sense allocating it there rather than elsewhere. There's also no real reason to fail to create a vcore if we can't allocate a buffer for L2/L3 cache contents - retrying later is perfectly harmless. If we failed during core creation just don't save/restore L2 cache contents at all. I really prefer to have allocation and dealloction all at init time - and such low order allocations will most likely succeed. Let's leave the L3 cache bits for later when we know whether it actually has an impact. I personally doubt it :). Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote: On 09.07.14 00:59, Stewart Smith wrote: Hi! Thanks for review, much appreciated! Alexander Graf ag...@suse.de writes: On 08.07.14 07:06, Stewart Smith wrote: @@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc) int i, need_vpa_update; int srcu_idx; struct kvm_vcpu *vcpus_to_update[threads_per_core]; + phys_addr_t phy_addr, tmp; Please put the variable declarations into the if () branch so that the compiler can catch potential leaks :) ack. will fix. @@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc) srcu_idx = srcu_read_lock(vc-kvm-srcu); + /* If we have a saved list of L2/L3, restore it */ + if (cpu_has_feature(CPU_FTR_ARCH_207S) vc-mpp_buffer) { + phy_addr = virt_to_phys((void *)vc-mpp_buffer); +#if defined(CONFIG_PPC_4K_PAGES) + phy_addr = (phy_addr + 8*4096) ~(8*4096); get_free_pages() is automatically aligned to the order, no? That's what Paul reckoned too, and then we've attempted to find anywhere that documents that behaviour. Happen to be able to point to docs/source that say this is part of API? Phew - it's probably buried somewhere. I could only find this document saying that we always get order-aligned allocations: http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html Mel, do you happen to have any pointer to something that explicitly (or even properly implicitly) says that get_free_pages() returns order-aligned memory? I did not read the whole thread so I lack context and will just answer this part. There is no guarantee that pages are returned in PFN order for multiple requests to the page allocator. This is the relevant comment in rmqueue_bulk /* * Split buddy pages returned by expand() are received here * in physical page order. The page is added to the callers and * list and the list head then moves forward. From the callers * perspective, the linked list is ordered by page number in * some conditions. This is useful for IO devices that can * merge IO requests if the physical pages are ordered * properly. */ It will probably be true early in the lifetime of the system but the milage will vary on systems with a lot of uptime. If you depend on this behaviour for correctness then you will have a bad day. High-order page requests to the page allocator are guaranteed to be in physical order. However, this does not apply to vmalloc() where allocations are only guaranteed to be virtually contiguous. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
On 10.07.14 15:07, Mel Gorman wrote: On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote: On 09.07.14 00:59, Stewart Smith wrote: Hi! Thanks for review, much appreciated! Alexander Graf ag...@suse.de writes: On 08.07.14 07:06, Stewart Smith wrote: @@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc) int i, need_vpa_update; int srcu_idx; struct kvm_vcpu *vcpus_to_update[threads_per_core]; + phys_addr_t phy_addr, tmp; Please put the variable declarations into the if () branch so that the compiler can catch potential leaks :) ack. will fix. @@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc) srcu_idx = srcu_read_lock(vc-kvm-srcu); + /* If we have a saved list of L2/L3, restore it */ + if (cpu_has_feature(CPU_FTR_ARCH_207S) vc-mpp_buffer) { + phy_addr = virt_to_phys((void *)vc-mpp_buffer); +#if defined(CONFIG_PPC_4K_PAGES) + phy_addr = (phy_addr + 8*4096) ~(8*4096); get_free_pages() is automatically aligned to the order, no? That's what Paul reckoned too, and then we've attempted to find anywhere that documents that behaviour. Happen to be able to point to docs/source that say this is part of API? Phew - it's probably buried somewhere. I could only find this document saying that we always get order-aligned allocations: http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html Mel, do you happen to have any pointer to something that explicitly (or even properly implicitly) says that get_free_pages() returns order-aligned memory? I did not read the whole thread so I lack context and will just answer this part. There is no guarantee that pages are returned in PFN order for multiple requests to the page allocator. This is the relevant comment in rmqueue_bulk /* * Split buddy pages returned by expand() are received here * in physical page order. The page is added to the callers and * list and the list head then moves forward. From the callers * perspective, the linked list is ordered by page number in * some conditions. This is useful for IO devices that can * merge IO requests if the physical pages are ordered * properly. */ It will probably be true early in the lifetime of the system but the milage will vary on systems with a lot of uptime. If you depend on this behaviour for correctness then you will have a bad day. High-order page requests to the page allocator are guaranteed to be in physical order. However, this does not apply to vmalloc() where allocations are only guaranteed to be virtually contiguous. Hrm, ok to be very concrete: Does __get_free_pages(..., 4); on a 4k page size system give me a 64k aligned pointer? :) Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
On Thu, Jul 10, 2014 at 03:17:16PM +0200, Alexander Graf wrote: On 10.07.14 15:07, Mel Gorman wrote: On Thu, Jul 10, 2014 at 01:05:47PM +0200, Alexander Graf wrote: On 09.07.14 00:59, Stewart Smith wrote: Hi! Thanks for review, much appreciated! Alexander Graf ag...@suse.de writes: On 08.07.14 07:06, Stewart Smith wrote: @@ -1528,6 +1535,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc) int i, need_vpa_update; int srcu_idx; struct kvm_vcpu *vcpus_to_update[threads_per_core]; +phys_addr_t phy_addr, tmp; Please put the variable declarations into the if () branch so that the compiler can catch potential leaks :) ack. will fix. @@ -1590,9 +1598,48 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc) srcu_idx = srcu_read_lock(vc-kvm-srcu); +/* If we have a saved list of L2/L3, restore it */ +if (cpu_has_feature(CPU_FTR_ARCH_207S) vc-mpp_buffer) { +phy_addr = virt_to_phys((void *)vc-mpp_buffer); +#if defined(CONFIG_PPC_4K_PAGES) +phy_addr = (phy_addr + 8*4096) ~(8*4096); get_free_pages() is automatically aligned to the order, no? That's what Paul reckoned too, and then we've attempted to find anywhere that documents that behaviour. Happen to be able to point to docs/source that say this is part of API? Phew - it's probably buried somewhere. I could only find this document saying that we always get order-aligned allocations: http://www.thehackademy.net/madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html Mel, do you happen to have any pointer to something that explicitly (or even properly implicitly) says that get_free_pages() returns order-aligned memory? I did not read the whole thread so I lack context and will just answer this part. There is no guarantee that pages are returned in PFN order for multiple requests to the page allocator. This is the relevant comment in rmqueue_bulk /* * Split buddy pages returned by expand() are received here * in physical page order. The page is added to the callers and * list and the list head then moves forward. From the callers * perspective, the linked list is ordered by page number in * some conditions. This is useful for IO devices that can * merge IO requests if the physical pages are ordered * properly. */ It will probably be true early in the lifetime of the system but the milage will vary on systems with a lot of uptime. If you depend on this behaviour for correctness then you will have a bad day. High-order page requests to the page allocator are guaranteed to be in physical order. However, this does not apply to vmalloc() where allocations are only guaranteed to be virtually contiguous. Hrm, ok to be very concrete: Does __get_free_pages(..., 4); on a 4k page size system give me a 64k aligned pointer? :) Yes. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] KVM: PPC: Fix Mac OS X guests on non-Apple hosts
While trying to get Mac-on-Linux to work on a POWER7 box I stumbled over two issues in our book3s_32 MMU emulation. With these issues fixed and a hack to disable magic page mapping (very hard to get right with 64k pages in this case) I can successfully run Mac OS X guests on a POWER7 host. Alex Alexander Graf (2): KVM: PPC: Deflect page write faults properly in kvmppc_st KVM: PPC: Book3S: Stop PTE lookup on write errors arch/powerpc/kvm/book3s.c| 6 -- arch/powerpc/kvm/book3s_32_mmu.c | 2 +- 2 files changed, 5 insertions(+), 3 deletions(-) -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: PPC: Deflect page write faults properly in kvmppc_st
When we have a page that we're not allowed to write to, xlate() will already tell us -EPERM on lookup of that page. With the code as is we change it into a page missing error which a guest may get confused about. Instead, just tell the caller about the -EPERM directly. This fixes Mac OS X guests when run with DCBZ32 emulation. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index bd75902..9624c56 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -418,11 +418,13 @@ int kvmppc_st(struct kvm_vcpu *vcpu, ulong *eaddr, int size, void *ptr, bool data) { struct kvmppc_pte pte; + int r; vcpu-stat.st++; - if (kvmppc_xlate(vcpu, *eaddr, data, true, pte)) - return -ENOENT; + r = kvmppc_xlate(vcpu, *eaddr, data, true, pte); + if (r 0) + return r; *eaddr = pte.raddr; -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: PPC: Book3S: Stop PTE lookup on write errors
When a page lookup failed because we're not allowed to write to the page, we should not overwrite that value with another lookup on the second PTEG which will return page not found. Instead, we should just tell the caller that we had a permission problem. This fixes Mac OS X guests looping endlessly in page lookup code for me. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/book3s_32_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_32_mmu.c b/arch/powerpc/kvm/book3s_32_mmu.c index 93503bb..cd0b073 100644 --- a/arch/powerpc/kvm/book3s_32_mmu.c +++ b/arch/powerpc/kvm/book3s_32_mmu.c @@ -335,7 +335,7 @@ static int kvmppc_mmu_book3s_32_xlate(struct kvm_vcpu *vcpu, gva_t eaddr, if (r 0) r = kvmppc_mmu_book3s_32_xlate_pte(vcpu, eaddr, pte, data, iswrite, true); - if (r 0) + if (r == -ENOENT) r = kvmppc_mmu_book3s_32_xlate_pte(vcpu, eaddr, pte, data, iswrite, false); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: PPC: Book3S: Add hack for split real mode
Today we handle split real mode by mapping both instruction and data faults into a special virtual address space that only exists during the split mode phase. This is good enough to catch 32bit Linux guests that use split real mode for copy_from/to_user. In this case we're always prefixed with 0xc000 for our instruction pointer and can map the user space process freely below there. However, that approach fails when we're running KVM inside of KVM. Here the 1st level last_inst reader may well be in the same virtual page as a 2nd level interrupt handler. It also fails when running Mac OS X guests. Here we have a 4G/4G split, so a kernel copy_from/to_user implementation can easily overlap with user space addresses. The architecturally correct way to fix this would be to implement an instruction interpreter in KVM that kicks in whenever we go into split real mode. This interpreter however would not receive a great amount of testing and be a lot of bloat for a reasonably isolated corner case. So I went back to the drawing board and tried to come up with a way to make split real mode work with a single flat address space. And then I realized that we could get away with the same trick that makes it work for Linux: Whenever we see an instruction address during split real mode that may collide, we just move it higher up the virtual address space to a place that hopefully does not collide (keep your fingers crossed!). That approach does work surprisingly well. I am able to successfully run Mac OS X guests with KVM and QEMU (no split real mode hacks like MOL) when I apply a tiny timing probe hack to QEMU. I'd say this is a win over even more broken split real mode :). Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/include/asm/kvm_asm.h| 1 + arch/powerpc/include/asm/kvm_book3s.h | 3 +++ arch/powerpc/kvm/book3s.c | 20 +++ arch/powerpc/kvm/book3s_pr.c | 48 +++ 4 files changed, 72 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_asm.h b/arch/powerpc/include/asm/kvm_asm.h index 9601741..3f3e530 100644 --- a/arch/powerpc/include/asm/kvm_asm.h +++ b/arch/powerpc/include/asm/kvm_asm.h @@ -131,6 +131,7 @@ #define BOOK3S_HFLAG_NATIVE_PS 0x8 #define BOOK3S_HFLAG_MULTI_PGSIZE 0x10 #define BOOK3S_HFLAG_NEW_TLBIE 0x20 +#define BOOK3S_HFLAG_SPLIT_HACK0x40 #define RESUME_FLAG_NV (10) /* Reload guest nonvolatile state? */ #define RESUME_FLAG_HOST(11) /* Resume host? */ diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h index 8ac5392..cb7e661 100644 --- a/arch/powerpc/include/asm/kvm_book3s.h +++ b/arch/powerpc/include/asm/kvm_book3s.h @@ -324,4 +324,7 @@ static inline bool is_kvmppc_resume_guest(int r) /* LPIDs we support with this build -- runtime limit may be lower */ #define KVMPPC_NR_LPIDS(LPID_RSVD + 1) +#define SPLIT_HACK_MASK0xfff0 +#define SPLIT_HACK_OFFS0xfe20 + #endif /* __ASM_KVM_BOOK3S_H__ */ diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 9624c56..98fc18d 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -72,6 +72,17 @@ void kvmppc_core_load_guest_debugstate(struct kvm_vcpu *vcpu) { } +void kvmppc_unfixup_split_real(struct kvm_vcpu *vcpu) +{ + if (vcpu-arch.hflags BOOK3S_HFLAG_SPLIT_HACK) { + ulong pc = kvmppc_get_pc(vcpu); + if ((pc SPLIT_HACK_MASK) == SPLIT_HACK_OFFS) + kvmppc_set_pc(vcpu, pc ~SPLIT_HACK_MASK); + vcpu-arch.hflags = ~BOOK3S_HFLAG_SPLIT_HACK; + } +} +EXPORT_SYMBOL_GPL(kvmppc_unfixup_split_real); + static inline unsigned long kvmppc_interrupt_offset(struct kvm_vcpu *vcpu) { if (!is_kvmppc_hv_enabled(vcpu-kvm)) @@ -118,6 +129,7 @@ static inline bool kvmppc_critical_section(struct kvm_vcpu *vcpu) void kvmppc_inject_interrupt(struct kvm_vcpu *vcpu, int vec, u64 flags) { + kvmppc_unfixup_split_real(vcpu); kvmppc_set_srr0(vcpu, kvmppc_get_pc(vcpu)); kvmppc_set_srr1(vcpu, kvmppc_get_msr(vcpu) | flags); kvmppc_set_pc(vcpu, kvmppc_interrupt_offset(vcpu) + vec); @@ -384,6 +396,14 @@ static int kvmppc_xlate(struct kvm_vcpu *vcpu, ulong eaddr, bool data, pte-may_write = true; pte-may_execute = true; r = 0; + + if ((kvmppc_get_msr(vcpu) (MSR_IR | MSR_DR)) == MSR_DR + !data) { + ulong pc = kvmppc_get_pc(vcpu); + if ((vcpu-arch.hflags BOOK3S_HFLAG_SPLIT_HACK) + ((eaddr SPLIT_HACK_MASK) == SPLIT_HACK_OFFS)) + pte-raddr = ~SPLIT_HACK_MASK; + } } return r; diff --git a/arch/powerpc/kvm/book3s_pr.c