[COMMIT master] vhost-net: fix reversed logic in mask notifiers
From: Michael S. Tsirkin m...@redhat.com When guest notifier is assigned, we set mask notifier, which will assign kvm irqfd. When guest notifier is unassigned, mask notifier is unset, which should unassign kvm irqfd. The way to do this is to call mask notifier telling it to mask the vector. This, unless vector is already masked which unassigns irqfd already. The logic in unassign was reversed, which left kvm irqfd assigned. Signed-off-by: Michael S. Tsirkin m...@redhat.com Acked-by: Gerd Hoffmann kra...@redhat.com Acked-by: Amit Shah amit.s...@redhat.com Reported-by: Amit Shah amit.s...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/hw/msix.c b/hw/msix.c index 8f9a621..1398680 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -617,6 +617,7 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque) assert(opaque); assert(!dev-msix_mask_notifier_opaque[vector]); +/* Unmask the new notifier unless vector is masked. */ if (msix_is_masked(dev, vector)) { return 0; } @@ -638,12 +639,13 @@ int msix_unset_mask_notifier(PCIDevice *dev, unsigned vector) assert(dev-msix_mask_notifier); assert(dev-msix_mask_notifier_opaque[vector]); +/* Mask the old notifier unless it is already masked. */ if (msix_is_masked(dev, vector)) { return 0; } r = dev-msix_mask_notifier(dev, vector, dev-msix_mask_notifier_opaque[vector], -msix_is_masked(dev, vector)); +!msix_is_masked(dev, vector)); if (r 0) { return r; } -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] device-assignment: don't truncate MSIX capabilities table size
From: Alex Williamson alex.william...@redhat.com PCI_MSIX_TABSIZE is 0x07ff Reported-by: Juan Quintela quint...@redhat.com Signed-off-by: Alex Williamson alex.william...@redhat.com Acked-by: Acked-by: Chris Wright chr...@redhat.com Acked-by: Acked-by: Juan Quintela quint...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/hw/device-assignment.c b/hw/device-assignment.c index d8e7cb4..e254203 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1073,7 +1073,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev) else pos = pci_dev-cap.start; -entries_max_nr = pci_dev-config[pos + 2]; +entries_max_nr = *(uint16_t *)(pci_dev-config + pos + 2); entries_max_nr = PCI_MSIX_TABSIZE; entries_max_nr += 1; @@ -1255,8 +1255,8 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) entry_nr = assigned_dev_pci_read_word(pci_dev, pos + 2) PCI_MSIX_TABSIZE; pci_dev-config[pci_dev-cap.start + pci_dev-cap.length] = 0x11; -pci_dev-config[pci_dev-cap.start + -pci_dev-cap.length + 2] = entry_nr; +*(uint16_t *)(pci_dev-config + pci_dev-cap.start + + pci_dev-cap.length + 2) = entry_nr; msix_table_entry = assigned_dev_pci_read_long(pci_dev, pos + PCI_MSIX_TABLE); *(uint32_t *)(pci_dev-config + pci_dev-cap.start + -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: MMU: don't check PT_WRITABLE_MASK directly
From: Gui Jianfeng guijianf...@cn.fujitsu.com Since we have is_writable_pte(), make use of it. Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index d92984d..136e160 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2965,7 +2965,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot) pt = sp-spt; for (i = 0; i PT64_ENT_PER_PAGE; ++i) /* avoid RMW */ - if (pt[i] PT_WRITABLE_MASK) + if (is_writable_pte(pt[i])) pt[i] = ~PT_WRITABLE_MASK; } kvm_flush_remote_tlbs(kvm); @@ -3400,7 +3400,7 @@ void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep) struct kvm_mmu_page *rev_sp; gfn_t gfn; - if (*sptep PT_WRITABLE_MASK) { + if (is_writable_pte(*sptep)) { rev_sp = page_header(__pa(sptep)); gfn = kvm_mmu_page_get_gfn(rev_sp, sptep - rev_sp-spt); @@ -3449,7 +3449,7 @@ static void check_writable_mappings_rmap(struct kvm_vcpu *vcpu) if (!(ent PT_PRESENT_MASK)) continue; - if (!(ent PT_WRITABLE_MASK)) + if (!is_writable_pte(ent)) continue; inspect_spte_has_rmap(vcpu-kvm, pt[i]); } @@ -3483,7 +3483,7 @@ static void audit_write_protection(struct kvm_vcpu *vcpu) spte = rmap_next(vcpu-kvm, rmapp, NULL); while (spte) { - if (*spte PT_WRITABLE_MASK) + if (is_writable_pte(*spte)) printk(KERN_ERR %s: (%s) shadow page has writable mappings: gfn %lx role %x\n, __func__, audit_msg, sp-gfn, -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: SVM: Fix EFER.LME being stripped
From: Zachary Amsden zams...@redhat.com Must set VCPU register to be the guest notion of EFER even if that setting is not valid on hardware. This was masked by the set in set_efer until 7657fd5ace88e8092f5f3a84117e093d7b893f26 broke that. Fix is simply to set the VCPU register before stripping bits. Signed-off-by: Zachary Amsden zams...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 5b313e9..298918e 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -286,11 +286,11 @@ static inline void flush_guest_tlb(struct kvm_vcpu *vcpu) static void svm_set_efer(struct kvm_vcpu *vcpu, u64 efer) { + vcpu-arch.efer = efer; if (!npt_enabled !(efer EFER_LMA)) efer = ~EFER_LME; to_svm(vcpu)-vmcb-save.efer = efer | EFER_SVME; - vcpu-arch.efer = efer; } static int is_external_interrupt(u32 info) -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: SVM: fix compiling warning from erratum 383 fix
From: Xiao Guangrong xiaoguangr...@cn.fujitsu.com fix: arch/x86/kvm/svm.c: In function ‘svm_handle_mce’: arch/x86/kvm/svm.c:1490: warning: unused variable ‘kvm_run’ Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 298918e..9c68a65 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1487,8 +1487,6 @@ static void svm_handle_mce(struct vcpu_svm *svm) * Erratum 383 triggered. Guest state is corrupt so kill the * guest. */ - struct kvm_run *kvm_run = svm-vcpu.run; - pr_err(KVM: Guest triggered AMD Erratum 383\n); set_bit(KVM_REQ_TRIPLE_FAULT, svm-vcpu.requests); -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: MMU: Allow spte.w=1 for gpte.w=0 and cr0.wp=0 only in shadow mode
From: Avi Kivity a...@redhat.com When tdp is enabled, the guest's cr0.wp shouldn't have any effect on spte permissions. Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 136e160..39dd8d3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1881,7 +1881,8 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, spte |= (u64)pfn PAGE_SHIFT; if ((pte_access ACC_WRITE_MASK) - || (write_fault !is_write_protection(vcpu) !user_fault)) { + || (!tdp_enabled write_fault !is_write_protection(vcpu) +!user_fault)) { if (level PT_PAGE_TABLE_LEVEL has_wrprotected_page(vcpu-kvm, gfn, level)) { -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: MMU: Remove user access when allowing kernel access to gpte.w=0 page
From: Avi Kivity a...@redhat.com If cr0.wp=0, we have to allow the guest kernel access to a page with pte.w=0. We do that by setting spte.w=1, since the host cr0.wp must remain set so the host can write protect pages. Once we allow write access, we must remove user access otherwise we mistakenly allow the user to write the page. Reviewed-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 39dd8d3..56f8c3c 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1894,6 +1894,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, spte |= PT_WRITABLE_MASK; + if (!tdp_enabled !(pte_access ACC_WRITE_MASK)) + spte = ~PT_USER_MASK; + /* * Optimization: for pte sync, if spte was writable the hash * lookup is unnecessary (and expensive). Write protection -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: MMU: Document cr0.wp emulation
From: Avi Kivity a...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/Documentation/kvm/mmu.txt b/Documentation/kvm/mmu.txt index 2201dcb..1e7ecdd 100644 --- a/Documentation/kvm/mmu.txt +++ b/Documentation/kvm/mmu.txt @@ -298,6 +298,25 @@ Host translation updates: - look up affected sptes through reverse map - drop (or update) translations +Emulating cr0.wp + + +If tdp is not enabled, the host must keep cr0.wp=1 so page write protection +works for the guest kernel, not guest guest userspace. When the guest +cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0, +we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the +semantics require allowing any guest kernel access plus user read access). + +We handle this by mapping the permissions to two possible sptes, depending +on fault type: + +- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access, + disallows user access) +- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel + write access) + +(user write faults generate a #PF) + Further reading === -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: Fix order passed to iommu_unmap
From: Jan Kiszka jan.kis...@siemens.com This is obviously a left-over from the the old interface taking the size. Apparently a mostly harmless issue with the current iommu_unmap implementation. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Acked-by: Joerg Roedel joerg.roe...@amd.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c index 5adc578..673c88a 100644 --- a/virt/kvm/iommu.c +++ b/virt/kvm/iommu.c @@ -273,7 +273,7 @@ static void kvm_iommu_put_pages(struct kvm *kvm, pfn = phys PAGE_SHIFT; /* Unmap address from IO address space */ - order = iommu_unmap(domain, gfn_to_gpa(gfn), PAGE_SIZE); + order = iommu_unmap(domain, gfn_to_gpa(gfn), 0); unmap_pages = 1ULL order; /* Unpin all pages we just unmapped to not leak any memory */ -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: ia64: Add missing spin_unlock in kvm_arch_hardware_enable()
From: Julia Lawall ju...@diku.dk Add a spin_unlock missing on the error path. The semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // smpl @@ expression E1; @@ * spin_lock(E1,...); +... when != E1 if (...) { ... when != E1 * return ...; } ...+ * spin_unlock(E1,...); // /smpl Signed-off-by: Julia Lawall ju...@diku.dk Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c index 29f6075..91760e8 100644 --- a/arch/ia64/kvm/kvm-ia64.c +++ b/arch/ia64/kvm/kvm-ia64.c @@ -144,6 +144,7 @@ int kvm_arch_hardware_enable(void *garbage) VP_INIT_ENV : VP_INIT_ENV_INITALIZE, __pa(kvm_vm_buffer), KVM_VM_BUFFER_BASE, tmp_base); if (status != 0) { + spin_unlock(vp_lock); printk(KERN_WARNINGkvm: Failed to Enable VT Support\n); return -EINVAL; } -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/4] Fix tboot enabled macro
Jan Kiszka wrote: Wang, Shane wrote: Avi Kivity wrote: On 05/26/2010 10:25 AM, Jan Kiszka wrote: This is for CONFIG_INTEL_TXT enabled? Good point but needs to be solved differently. tboot, the variable that is checked by the original header, is not exported to modules. I wonder how this worked out for you... Solution should be: hack tboot_enabled to kvm_tboot_enabled and unconditionally define that to 0 for older kernels. If tboot is actually enabled in hardware, KVM may not load but I'm unsure if it's OK to assume tboot == 1 for that case or if that will cause breakages if it's off instead - CC'ing the KVM patch author. Worst case it doesn't load. I don't think it's a problem since enabling tboot will be rare for older kernels. tboot is not 0 if tboot module is run before kernel. If tboot is enabled in hardware (I assume you mean if Intel TXT is enabled in hardware) but tboot module is not run or old kernels don't support tboot module, we still have outside_smx bit in feature msr. Why might KVM not load? If we have to hard-wire tboot_enabled in kvm-kmod to 0, KVM may not test all required bits and erroneously assume VTX would be disabled. So I wondered what would happen if we hard-wired it to 1, pretending that the tboot modules is loaded. Would we gain something without loosing on some other end? If not, I would simply leave things as they are now (i.e. always assuming tboot absence). Thanks, Jan Why is VTX assumed to be disabled? tboot_enabled == 0 but (msr FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX) == 1 if you have VT enabled. If you have VT enabled, VMX outside SMX is 1 always. Shane -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
About upgrade KVM
Hi folks, KVM -virtualizer host - Debian 5.0 This virtual machine has been running for sometimes. IIRC the KVM was download from KVM website. Its version is quite old. $ sudo kvm | head -1 QEMU PC emulator version 0.9.1 (kvm-72), Copyright (c) 2003-2008 Fabrice Bellard $ apt-cache policy qemu-kvm qemu-kvm: Installed: (none) Candidate: 0.12.4+dfsg-1~bpo50+1 Version table: 0.12.4+dfsg-1~bpo50+1 0 1 http://www.backports.org lenny-backports/main Packages I tried to upgrade it on System - Administration - Synaptic Package Manager; - qemu-kvm - qemu-kvm-dbg On marking them to install following warning popup:- Mark additional required changes? To be removed kvm To be added libasyncns0 libcap1 libpulse0 - end - My questions are; 1) On deleting the old package whether it would affect the running virtual machine? 2) Is there any way to come back in case of problem? Please advise. TIA B.R. Stephen L -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/4] Fix tboot enabled macro
Wang, Shane wrote: Why is VTX assumed to be disabled? tboot_enabled == 0 but (msr FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX) == 1 if you have VT enabled. If you have VT enabled, VMX outside SMX is 1 always. Shane BTW: In hardware, VT is enabled, TXT is enabled, then outside = 1, inside = 1; VT is enabled, TXT is disabled, then outside = 1, inside = 0; VT is disabled, then outside = 0;-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] device-assignment: don't truncate MSIX capabilities table size
On 05/26/10 14:48, Avi Kivity wrote: On 05/26/2010 03:27 AM, Juan Quintela wrote: BTW, I also noticed the lack of pci_set_long() and friend functions, but arrived to the same conclusion that you: all the device assignment assumes that the world is x86_64 :) IIRC it used to work on ia64 as well. Same endianess, unless you run HPUX on the CPU. Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm mmu: optimizations when tdp is in use
In case of using tdp, checking write protected page isn't needed and quadrant also no need to be calculated. Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |7 +-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0bb9f17..ce4bbd3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -495,10 +495,13 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn) max_level = kvm_x86_ops-get_lpage_level() host_level ? kvm_x86_ops-get_lpage_level() : host_level; + if (tdp_enabled) + goto done; + for (level = PT_DIRECTORY_LEVEL; level = max_level; ++level) if (has_wrprotected_page(vcpu-kvm, large_gfn, level)) break; - +done: return level - 1; } @@ -1346,7 +1349,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, if (role.direct) role.cr4_pae = 0; role.access = access; - if (vcpu-arch.mmu.root_level = PT32_ROOT_LEVEL) { + if (!tdp_enabled vcpu-arch.mmu.root_level = PT32_ROOT_LEVEL) { quadrant = gaddr (PAGE_SHIFT + (PT64_PT_BITS * level)); quadrant = (1 ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1; role.quadrant = quadrant; -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM test: Add perfmon into the guest tests
On 05/25/10 05:05, Chen Cao wrote: perfmon2 API provides access to the hardware performance counters of modern processors. Dependency, To compile the source code of the test, the following packages should be installed, glibc-static-2.11.1-6.x86_64 glibc-headers-2.11.1-6.x86_64 glibc-common-2.11.1-6.x86_64 glibc-devel-2.11.1-6.x86_64 glibc-2.11.1-6.x86_64 Note, 1. libpfm uses the Performance Monitor Unit (PMU) on the processors, but this unit is not provided by kvm currently, i.e. the test should fail in kvm guests. 2. According to the README file of perfmon-tests-0.3, 2.6.24 or higer Linux kernel (with perfmon v2.8 or higher) is needed to run the tests. I thought perfmon2 was deprecated in favor of perf_event.c ? The only reference left for perfmon2 in the kernel is in the ia64 tree, and KVM/ia64 seems to be pretty dead these days. Cheers, Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm mmu: don't check PT_WRITABLE_MASK directly
Since we have is_writable_pte(), make use of it. Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ce4bbd3..441a5d8 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2923,7 +2923,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot) pt = sp-spt; for (i = 0; i PT64_ENT_PER_PAGE; ++i) /* avoid RMW */ - if (pt[i] PT_WRITABLE_MASK) + if (is_writable_pte(pt[i])) pt[i] = ~PT_WRITABLE_MASK; } kvm_flush_remote_tlbs(kvm); @@ -3358,7 +3358,7 @@ void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep) struct kvm_mmu_page *rev_sp; gfn_t gfn; - if (*sptep PT_WRITABLE_MASK) { + if (is_writable_pte(*sptep)) { rev_sp = page_header(__pa(sptep)); gfn = rev_sp-gfns[sptep - rev_sp-spt]; @@ -3408,7 +3408,7 @@ static void check_writable_mappings_rmap(struct kvm_vcpu *vcpu) if (!(ent PT_PRESENT_MASK)) continue; - if (!(ent PT_WRITABLE_MASK)) + if (!is_writable_pte(ent)) continue; inspect_spte_has_rmap(vcpu-kvm, pt[i]); } @@ -3442,7 +3442,7 @@ static void audit_write_protection(struct kvm_vcpu *vcpu) spte = rmap_next(vcpu-kvm, rmapp, NULL); while (spte) { - if (*spte PT_WRITABLE_MASK) + if (is_writable_pte(*spte)) printk(KERN_ERR %s: (%s) shadow page has writable mappings: gfn %lx role %x\n, __func__, audit_msg, sp-gfn, -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add Documentation/kvm/msr.txt
On 05/26/2010 09:04 PM, Glauber Costa wrote: This patch adds a file that documents the usage of KVM-specific MSRs. Looks good. A few comments: + +Custom MSR list + + +The current supported Custom MSR list is: + +MSR_KVM_WALL_CLOCK: 0x11 + + data: physical address of a memory area. Which must be in guest RAM (i.e., don't point it somewhere random and expect the hypervisor to allocate it for you). Must be aligned to 4 bytes (we don't enforce it though). + +MSR_KVM_SYSTEM_TIME: 0x12 + + data: physical address of a memory area. This memory is expected to + hold a copy of the following structure: In guest RAM. What are the alignment restrictions? I don't think we can restrict to less than 4 bytes without breaking guests retroactively. + + struct pvclock_vcpu_time_info { + u32 version; + u32 pad0; + u64 tsc_timestamp; + u64 system_time; + u32 tsc_to_system_mul; + s8tsc_shift; + u8flags; + u8pad[2]; + } __attribute__((__packed__)); /* 32 bytes */ + + whose data will be filled in by the hypervisor periodically. Only one + write, or registration, is needed for each VCPU. The interval between + updates of this structure is arbitrary, and implementation-dependent. + + version: guest has to check version before and after grabbing + time information, and check that they are both equal and even. + An odd version indicates an in-progress update. + + tsc_timestamp: the tsc value at the current VCPU, at the time + of the update of this structure. Guests can subtract this value + from current tsc to derive a notion of elapsed time since the + structure update. + + system_time: the current system time at the time this structure + was last updated. Unit is nanoseconds. What is the baseline for system_time? Guest boot? + + tsc_to_system_mul: a function of the tsc frequency. One has + to multiply any tsc-related quantity by this value to get + a value in nanoseconds, besides dividing by 2^tsc_shift + + tsc_shift: cycle to nanosecond divider, as a power of two, to + allow for shift rights. One has to shift right any tsc-related + quantity by this value to get a value in nanoseconds, besides + multiplying by tsc_to_system_mul. Would be good to write down the formula for calculating time here. + + flags: bits in this field indicate extended capabilities + coordinated between the guest and the hypervisor. Availability + of specific flags has to be checked in 0x4001 cpuid leaf. + Refer to cpuid.txt for details. Need to document bit 0 here for completeness. cpuid.txt documents how to discover it, here we document how to use it. + + Availability of this MSR must be checked via bit 0 in 0x401 cpuid + leaf prior to usage. + + This MSR falls outside the reserved KVM range, and may be removed in the + future. Its usage is deprecated. + +MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00 + + data and functioning: same as MSR_KVM_WALL_CLOCK. Use this instead. + + Availability of this MSR must be checked via bit 3 in 0x401 cpuid + leaf prior to usage. + +MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01 + + data and functioning: same as MSR_KVM_SYSTEM_TIME. Use this instead. + + Availability of this MSR must be checked via bit 3 in 0x401 cpuid + leaf prior to usage. + Please detail the suggested algorithm for using the msrs (new then old). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: issue about virtio-net driver to suppoprt vhost mergeable buffer with zero-copy to support PS mode
On Thu, May 27, 2010 at 09:21:02AM +0800, Xin, Xiaohui wrote: Michael, I'm now looking into the vhost mergeable buffer, and I tried to use it to support PS mode with zero-copy. And I found an issue there that I have to modify the guest virito-net driver. When guest virtio-net driver submits mergeable buffers, it submits multiple pages outside. In zero-copy case, vhost cannot know which page is used to put header, and which page is used to put payload. Then vhost can only reserves 12 bytes for each page. That means, the page_offset of the payload DMAed into the guest buffer is always 12 bytes. But guest virtio-net driver always use offset 0 to put the data (See receive_mergeable()). That's where the zero-copy use mergeable buffer must modify. Have I missed something here? And how do you think about it? Thanks Xiaohui Maybe you can teach the hardware skip the first 12 bytes: qemu will call an ioctl telling hardware what the virtio header size is. This is how we plan to do it for tap. Alternatively, buffers can be used in any order. So we can have hardware use N buffers for the packet, and then have vhost put the header in buffer N+1. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About upgrade KVM
On marking them to install following warning popup:- Mark additional required changes? To be removed kvm To be added libasyncns0 libcap1 libpulse0 - end - My questions are; 1) On deleting the old package whether it would affect the running virtual machine? IMHO they won't be affected until shutdown of the guest and/or the host. 2) Is there any way to come back in case of problem? backup the old kvm package (look in /var/cache/apt/archive) before upgrade. beaware that some command line switches changed between 0.9.1 and 0.12.4. and upgrade your host kernel too (2.6.32 is in backports) for a newer version of the kvm kernel module (don't know if kvm-72 works with 0.12.4) - Thomas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm mmu: optimizations when tdp is in use
On 05/27/2010 11:06 AM, Gui Jianfeng wrote: In case of using tdp, checking write protected page isn't needed and quadrant also no need to be calculated. Signed-off-by: Gui Jianfengguijianf...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |7 +-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0bb9f17..ce4bbd3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -495,10 +495,13 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn) max_level = kvm_x86_ops-get_lpage_level() host_level ? kvm_x86_ops-get_lpage_level() : host_level; + if (tdp_enabled) + goto done; + for (level = PT_DIRECTORY_LEVEL; level= max_level; ++level) if (has_wrprotected_page(vcpu-kvm, large_gfn, level)) break; - +done: return level - 1; } We also use -write_count to prevent mapping the end of a non-large-page-aligned memslot with a large spte. Undocumented in mmu.txt, I'll post a patch. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/4] Fix tboot enabled macro
Jan Kiszka wrote: If TXT is on and VT is locked but KVM sees tboot_enabled == 0, it won't check for FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX during setup and may consider VT unavailable. If vt is locked, txt is on, tboot_enabled = 0, then it will check VMXON_OUTSIDE_SMX. But at this point, if vt is on (still locked), the fn will return 0, which means vmx is not disabled by bios, correct? Moreover, if VT is not locked in that case, KVM will also not set FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX during hardware_enable, likely leaving VT off then, no? Sure, KVM will not set VMXON_INSIDE_SMX, but will set VMXON_OUTSIDE_SMX. In that case, this means vt is on. So my question is: Would it cause any harm to assume TXT being always on, even if it wasn't? A bit confused. Do you mean hardware TXT always on, i.e. set FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX = 1 always? That's fine. No problem. No harm. Or, do you mean set tboot_enabled = 1 always? if so, in case that the hardware TXT is disabled (FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX = 0), then KVM will think vmx is disabled if feature msr is locked. Shane-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
On Tue, May 18, 2010 at 05:04:51PM -0700, Sridhar Samudrala wrote: Add a new kernel API to create a singlethread workqueue and attach it's task to current task's cgroup and cpumask. Signed-off-by: Sridhar Samudrala s...@us.ibm.com Could someone familiar with workqueue code please comment on whether this patch is suitable for 2.6.35? It is needed to fix the case where vhost user might cause a kernel thread to consume more CPU than allowed by the cgroup. Should I merge it through the vhost tree? Ack for this? Thanks! diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 9466e86..6d6f301 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -211,6 +211,7 @@ __create_workqueue_key(const char *name, int singlethread, #define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1, 0) #define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0, 0) +extern struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name); extern void destroy_workqueue(struct workqueue_struct *wq); extern int queue_work(struct workqueue_struct *wq, struct work_struct *work); diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 5bfb213..6ba226e 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -35,6 +35,7 @@ #include linux/lockdep.h #define CREATE_TRACE_POINTS #include trace/events/workqueue.h +#include linux/cgroup.h /* * The per-CPU workqueue (if single thread, we always use the first @@ -193,6 +194,45 @@ static const struct cpumask *cpu_singlethread_map __read_mostly; */ static cpumask_var_t cpu_populated_map __read_mostly; +static struct task_struct *get_singlethread_wq_task(struct workqueue_struct *wq) +{ + return (per_cpu_ptr(wq-cpu_wq, singlethread_cpu))-thread; +} + +/* Create a singlethread workqueue and attach it's task to the current task's + * cgroup and set it's cpumask to the current task's cpumask. + */ +struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name) +{ + struct workqueue_struct *wq; + struct task_struct *task; + cpumask_var_t mask; + + wq = create_singlethread_workqueue(name); + if (!wq) + goto out; + + if (!alloc_cpumask_var(mask, GFP_KERNEL)) + goto err; + + if (sched_getaffinity(current-pid, mask)) + goto err; + + task = get_singlethread_wq_task(wq); + if (sched_setaffinity(task-pid, mask)) + goto err; + + if (cgroup_attach_task_current_cg(task)) + goto err; +out: + return wq; +err: + destroy_workqueue(wq); + wq = NULL; + goto out; +} +EXPORT_SYMBOL_GPL(create_singlethread_workqueue_in_current_cg); + /* If it's single threaded, it isn't in the list of workqueues. */ static inline int is_wq_single_threaded(struct workqueue_struct *wq) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] Fix tboot enabled macro
Wang, Shane wrote: Jan Kiszka wrote: If TXT is on and VT is locked but KVM sees tboot_enabled == 0, it won't check for FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX during setup and may consider VT unavailable. If vt is locked, txt is on, tboot_enabled = 0, then it will check VMXON_OUTSIDE_SMX. But at this point, if vt is on (still locked), the fn will return 0, which means vmx is not disabled by bios, correct? Moreover, if VT is not locked in that case, KVM will also not set FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX during hardware_enable, likely leaving VT off then, no? Sure, KVM will not set VMXON_INSIDE_SMX, but will set VMXON_OUTSIDE_SMX. In that case, this means vt is on. So my question is: Would it cause any harm to assume TXT being always on, even if it wasn't? A bit confused. Do you mean hardware TXT always on, i.e. set FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX = 1 always? That's fine. No problem. No harm. Or, do you mean set tboot_enabled = 1 always? The latter. As we have no clue about the actual state (tboot is not exported on older kernels), we are forced to assume some reasonable state. if so, in case that the hardware TXT is disabled (FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX = 0), then KVM will think vmx is disabled if feature msr is locked. Then let's leave it as it was before the tboot changes to VMX: assume !tboot_enabled(). Thanks for explaining, Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Copy and paste feature across guest and host
Just installed Fedora13 as guest on KVM. However there is no cross-platform copy and paste feature. I trust I have setup this feature on other guest sometime before. Unfortunately I can't the relevant document. Could you please shed me some light. Pointer would be appreciated. TIA Did you try; # modprobe virtio-copypaste ? Seriously, qemu does not make it easy (well, its GUI does not make most things easy) and you'll need a tool which synchronizes the clipboard between two machines (google for qemu copy paste?). -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/4] Fix tboot enabled macro
Jan Kiszka wrote: The latter. As we have no clue about the actual state (tboot is not exported on older kernels), we are forced to assume some reasonable state. Are you trying to load the latest KVM on the older kernels? Shane -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Autotest PATCH - updated with current tree] KVM test: Add perfmon into the guest tests
perfmon2 API provides access to the hardware performance counters of modern processors. Dependency, To compile the source code of the test, the following packages should be installed, glibc-static-2.11.1-6.x86_64 glibc-headers-2.11.1-6.x86_64 glibc-common-2.11.1-6.x86_64 glibc-devel-2.11.1-6.x86_64 glibc-2.11.1-6.x86_64 Note, 1. libpfm uses the Performance Monitor Unit (PMU) on the processors, but this unit is not provided by kvm currently, i.e. the test should fail in kvm guests. And this test can be used as a reminder that kvm still lack the PMU virtualization. 2. According to the README file of perfmon-tests-0.3, 2.6.24 or higer Linux kernel (with perfmon v2.8 or higher) is needed to run the tests. Signed-off-by: Chen Cao k...@redhat.com --- 0 files changed, 0 insertions(+), 0 deletions(-) diff --git a/client/tests/kvm/autotest_control/perfmon.control b/client/tests/kvm/autotest_control/perfmon.control new file mode 100644 index 000..d3f5190 --- /dev/null +++ b/client/tests/kvm/autotest_control/perfmon.control @@ -0,0 +1,16 @@ +TIME=SHORT +AUTHOR = Stephane Eranian eran...@google.com +DOC = +This is a simple series of test for the perfmon2 API which +provides access to the hardware performance counters of modern +processors. + +Information about perfmon2 at: +http://perfmon2.sf.net + +NAME = 'perfmon' +TEST_CLASS = 'kernel' +TEST_CATEGORY = 'Functional' +TEST_TYPE = 'client' + +job.run_test('perfmon') diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index 5349034..599e4c3 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -154,6 +154,8 @@ variants: test_control_file = hwclock.control - rtc: test_control_file = rtc.control +- perfmon: +test_control_file = perfmon.control - linux_s3: install setup unattended_install type = linux_s3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix SVM longmode guests
On 05/27/2010 04:09 AM, Zachary Amsden wrote: In recent testing, I discovered guests failed to boot on my AMD box. Bisecting revealed an EFER related change caused the problem; here is the fix. Yikes, applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Autotest PATCH] client test - perfmon: Patch the source code of perfmon to remove the -Werror switch
On some OSes, the compilation will failed because of this switch, ... gcc -O2 -g -Wall -Werror +-I/home/kcao/projects/autotest/client/tests/perfmon/perfmon-tests-0.3/libpfm-3. +52/lib/../include -D_REENTRANT -DCONFIG_PFMLIB_ARCH_X86_64 -I. -c +pfmlib_os_linux.c cc1: warnings being treated as errors pfmlib_os_linux.c: In function ‘pfm_init_syscalls_sysfs’: pfmlib_os_linux.c:398: error: ignoring return value of ‘fscanf’, declared with +attribute warn_unused_result make[2]: *** [pfmlib_os_linux.o] Error 1 make[2]: Leaving directory +`/home/kcao/projects/autotest/client/tests/perfmon/perfmon-tests-0.3/libpfm-3.5 +2/lib' make[1]: *** [lib] Error 2 Signed-off-by: Chen Cao k...@redhat.com --- client/tests/perfmon/perfmon.py |8 +++ client/tests/perfmon/remove-werror-switch.patch | 55 +++ 2 files changed, 63 insertions(+), 0 deletions(-) create mode 100644 client/tests/perfmon/remove-werror-switch.patch diff --git a/client/tests/perfmon/perfmon.py b/client/tests/perfmon/perfmon.py index ec1145f..39c6fb2 100644 --- a/client/tests/perfmon/perfmon.py +++ b/client/tests/perfmon/perfmon.py @@ -10,6 +10,14 @@ class perfmon(test.test): tarball = utils.unmap_url(self.bindir, tarball, self.tmpdir) utils.extract_tarball_to_dir(tarball, self.srcdir) os.chdir(self.srcdir) +# Apply the patch to remove the -Werror switch, +# because there are warnings while compiling, +# if the compiler sticks to the rules, the source building +# will fail, although it seems that fedora and rhel ignore +# these warnings. +p1 = 'patch -p1 ../remove-werror-switch.patch' +utils.system(p1) + utils.system('make') diff --git a/client/tests/perfmon/remove-werror-switch.patch b/client/tests/perfmon/remove-werror-switch.patch new file mode 100644 index 000..6a5062c --- /dev/null +++ b/client/tests/perfmon/remove-werror-switch.patch @@ -0,0 +1,55 @@ +diff --git a/Makefile b/Makefile +index 6ba6fba..884274d 100644 +--- a/Makefile b/Makefile +@@ -28,6 +28,7 @@ TOPDIR := $(shell if [ $$PWD != ]; then echo $$PWD; else pwd; fi) + include config.mk + + DIRS=tests ++PATCHNAME=remove-libpfm-werror.patch + + all: libpfm + @echo Compiling for \'$(ARCH)\' target +@@ -36,6 +37,7 @@ all: libpfm + libpfm: + @echo Compiling $(LIBPFM) + @tar zxf $(LIBPFM).tar.gz ++ @(cd $(LIBPFM) patch -p1 ../$(PATCHNAME) cd ..) + @ln -sf $(LIBPFM) libpfm + @$(MAKE) -C $(LIBPFM) lib + clean: +diff --git a/config.mk b/config.mk +index a110111..7321f2d 100644 +--- a/config.mk b/config.mk +@@ -163,7 +163,7 @@ INSTALL=install + LN=ln -sf + PFMINCDIR=$(TOPDIR)/libpfm/include + PFMLIBDIR=$(TOPDIR)/libpfm/lib +-DBG?=-g -Wall -Werror ++DBG?=-g -Wall + # gcc/mips64 bug + ifeq ($(CONFIG_PFMLIB_ARCH_SICORTEX),y) + OPTIM?=-O +diff --git a/remove-libpfm-werror.patch b/remove-libpfm-werror.patch +new file mode 100644 +index 000..252aaa0 +--- /dev/null b/remove-libpfm-werror.patch +@@ -0,0 +1,13 @@ ++diff --git a/config.mk b/config.mk ++index 8c76f59..bbf1fc0 100644 ++--- a/config.mk + b/config.mk ++@@ -164,7 +164,7 @@ INSTALL=install ++ LN=ln -sf ++ PFMINCDIR=$(TOPDIR)/include ++ PFMLIBDIR=$(TOPDIR)/lib ++-DBG?=-g -Wall -Werror +++DBG?=-g -Wall ++ # gcc/mips64 bug ++ ifeq ($(CONFIG_PFMLIB_ARCH_SICORTEX),y) ++ OPTIM?=-O +-- +1.7.0.4 + -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6] KVM: x86: Enable XSAVE/XRSTOR for guest
From: Dexuan Cui dexuan@intel.com This patch enable guest to use XSAVE/XRSTOR instructions. We assume that host_xcr0 would use all possible bits that OS supported. And we loaded xcr0 in the same way we handled fpu - do it as late as we can. Signed-off-by: Dexuan Cui dexuan@intel.com Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/include/asm/kvm_host.h |2 + arch/x86/include/asm/vmx.h |1 + arch/x86/kvm/kvm_cache_regs.h |6 ++ arch/x86/kvm/vmx.c | 16 + arch/x86/kvm/x86.c | 120 -- include/linux/kvm_host.h|2 +- 6 files changed, 139 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index d08bb4a..b16356b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -302,6 +302,7 @@ struct kvm_vcpu_arch { } update_pte; struct fpu guest_fpu; + u64 xcr0; gva_t mmio_fault_cr2; struct kvm_pio_request pio; @@ -605,6 +606,7 @@ int kvm_get_dr(struct kvm_vcpu *vcpu, int dr, unsigned long *val); unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu); void kvm_lmsw(struct kvm_vcpu *vcpu, unsigned long msw); void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l); +void kvm_set_xcr0(struct kvm_vcpu *vcpu, u64 xcr0); int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata); int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data); diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 9e6779f..346ea66 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -266,6 +266,7 @@ enum vmcs_field { #define EXIT_REASON_EPT_VIOLATION 48 #define EXIT_REASON_EPT_MISCONFIG 49 #define EXIT_REASON_WBINVD 54 +#define EXIT_REASON_XSETBV 55 /* * Interruption-information format diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h index d2a98f8..6491ac8 100644 --- a/arch/x86/kvm/kvm_cache_regs.h +++ b/arch/x86/kvm/kvm_cache_regs.h @@ -71,4 +71,10 @@ static inline ulong kvm_read_cr4(struct kvm_vcpu *vcpu) return kvm_read_cr4_bits(vcpu, ~0UL); } +static inline u64 kvm_read_edx_eax(struct kvm_vcpu *vcpu) +{ + return (kvm_register_read(vcpu, VCPU_REGS_RAX) -1u) + | ((u64)(kvm_register_read(vcpu, VCPU_REGS_RDX) -1u) 32); +} + #endif diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 99ae513..c55d57d 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -36,6 +36,8 @@ #include asm/vmx.h #include asm/virtext.h #include asm/mce.h +#include asm/i387.h +#include asm/xcr.h #include trace.h @@ -3354,6 +3356,19 @@ static int handle_wbinvd(struct kvm_vcpu *vcpu) return 1; } +static int handle_xsetbv(struct kvm_vcpu *vcpu) +{ + u64 new_bv = kvm_read_edx_eax(vcpu); + + if (kvm_register_read(vcpu, VCPU_REGS_RCX) != 0) { + kvm_inject_gp(vcpu, 0); + return 1; + } + kvm_set_xcr0(vcpu, new_bv); + skip_emulated_instruction(vcpu); + return 1; +} + static int handle_apic_access(struct kvm_vcpu *vcpu) { return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE; @@ -3632,6 +3647,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { [EXIT_REASON_TPR_BELOW_THRESHOLD] = handle_tpr_below_threshold, [EXIT_REASON_APIC_ACCESS] = handle_apic_access, [EXIT_REASON_WBINVD] = handle_wbinvd, + [EXIT_REASON_XSETBV] = handle_xsetbv, [EXIT_REASON_TASK_SWITCH] = handle_task_switch, [EXIT_REASON_MCE_DURING_VMENTRY] = handle_machine_check, [EXIT_REASON_EPT_VIOLATION] = handle_ept_violation, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 7be1d36..e7acc9d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -64,6 +64,7 @@ (~(unsigned long)(X86_CR4_VME | X86_CR4_PVI | X86_CR4_TSD | X86_CR4_DE\ | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \ | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR \ + | X86_CR4_OSXSAVE \ | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE)) #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR) @@ -149,6 +150,13 @@ struct kvm_stats_debugfs_item debugfs_entries[] = { { NULL } }; +u64 __read_mostly host_xcr0; + +static inline u32 bit(int bitno) +{ + return 1 (bitno 31); +} + static void kvm_on_user_return(struct user_return_notifier *urn) { unsigned slot; @@ -473,6 +481,52 @@ void kvm_lmsw(struct kvm_vcpu *vcpu, unsigned long msw) } EXPORT_SYMBOL_GPL(kvm_lmsw); +int __kvm_set_xcr0(struct kvm_vcpu *vcpu, u64 xcr0) +{ + if (kvm_x86_ops-get_cpl(vcpu) != 0) + return 1; + if (!(xcr0 XSTATE_FP)) + return 1; + if
[PATCH] KVM: x86: XSAVE/XRSTOR live migration support
This patch enable save/restore of xsave state. Signed-off-by: Sheng Yang sh...@linux.intel.com --- arch/x86/include/asm/kvm.h | 29 arch/x86/kvm/x86.c | 79 include/linux/kvm.h|6 +++ 3 files changed, 114 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/kvm.h b/arch/x86/include/asm/kvm.h index ff90055..d3f4d9f 100644 --- a/arch/x86/include/asm/kvm.h +++ b/arch/x86/include/asm/kvm.h @@ -22,6 +22,7 @@ #define __KVM_HAVE_XEN_HVM #define __KVM_HAVE_VCPU_EVENTS #define __KVM_HAVE_DEBUGREGS +#define __KVM_HAVE_XSAVE /* Architectural interrupt line count. */ #define KVM_NR_INTERRUPTS 256 @@ -299,4 +300,32 @@ struct kvm_debugregs { __u64 reserved[9]; }; +/* for KVM_CAP_XSAVE */ +struct kvm_xsave { + struct { + __u16 cwd; + __u16 swd; + __u16 twd; + __u16 fop; + __u64 rip; + __u64 rdp; + __u32 mxcsr; + __u32 mxcsr_mask; + __u32 st_space[32]; + __u32 xmm_space[64]; + __u32 padding[12]; + __u32 sw_reserved[12]; + } i387; + struct { + __u64 xstate_bv; + __u64 reserved1[2]; + __u64 reserved2[5]; + } xsave_hdr; + struct { + __u32 ymmh_space[64]; + } ymmh; + __u64 xcr0; + __u32 padding[256]; +}; + #endif /* _ASM_X86_KVM_H */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e7acc9d..5badba2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1711,6 +1711,9 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_MCE: r = KVM_MAX_MCE_BANKS; break; + case KVM_CAP_XSAVE: + r = cpu_has_xsave; + break; default: r = 0; break; @@ -2363,6 +2366,59 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu, return 0; } +static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu, + struct kvm_xsave *guest_xsave) +{ + struct xsave_struct *xsave = vcpu-arch.guest_fpu.state-xsave; + + if (!cpu_has_xsave) + return; + + guest_xsave-i387.cwd = xsave-i387.cwd; + guest_xsave-i387.swd = xsave-i387.swd; + guest_xsave-i387.twd = xsave-i387.twd; + guest_xsave-i387.fop = xsave-i387.fop; + guest_xsave-i387.rip = xsave-i387.rip; + guest_xsave-i387.rdp = xsave-i387.rdp; + memcpy(guest_xsave-i387.st_space, xsave-i387.st_space, 128); + memcpy(guest_xsave-i387.xmm_space, xsave-i387.xmm_space, + sizeof guest_xsave-i387.xmm_space); + + guest_xsave-xsave_hdr.xstate_bv = xsave-xsave_hdr.xstate_bv; + memcpy(guest_xsave-ymmh.ymmh_space, xsave-ymmh.ymmh_space, + sizeof xsave-ymmh.ymmh_space); + + guest_xsave-xcr0 = vcpu-arch.xcr0; +} + +static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu, + struct kvm_xsave *guest_xsave) +{ + struct xsave_struct *xsave = vcpu-arch.guest_fpu.state-xsave; + + if (!cpu_has_xsave) + return -EINVAL; + + xsave-i387.cwd = guest_xsave-i387.cwd; + xsave-i387.swd = guest_xsave-i387.swd; + xsave-i387.twd = guest_xsave-i387.twd; + xsave-i387.fop = guest_xsave-i387.fop; + xsave-i387.rip = guest_xsave-i387.rip; + xsave-i387.rdp = guest_xsave-i387.rdp; + memcpy(xsave-i387.st_space, guest_xsave-i387.st_space, 128); + memcpy(xsave-i387.xmm_space, guest_xsave-i387.xmm_space, + sizeof guest_xsave-i387.xmm_space); + + xsave-xsave_hdr.xstate_bv = guest_xsave-xsave_hdr.xstate_bv; + memcpy(xsave-ymmh.ymmh_space, guest_xsave-ymmh.ymmh_space, + sizeof guest_xsave-ymmh.ymmh_space); + + /* set_xsave may override the initial value of xcr0... */ + if (guest_xsave-xcr0 != 0) + kvm_set_xcr0(vcpu, guest_xsave-xcr0); + return 0; +} + long kvm_arch_vcpu_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -2564,6 +2620,29 @@ long kvm_arch_vcpu_ioctl(struct file *filp, r = kvm_vcpu_ioctl_x86_set_debugregs(vcpu, dbgregs); break; } + case KVM_GET_XSAVE: { + struct kvm_xsave xsave; + + kvm_vcpu_ioctl_x86_get_xsave(vcpu, xsave); + + r = -EFAULT; + if (copy_to_user(argp, xsave, +sizeof(struct kvm_xsave))) + break; + r = 0; + break; + } + case KVM_SET_XSAVE: { + struct kvm_xsave xsave; + + r = -EFAULT; + if (copy_from_user(xsave, argp, +
[PATCH] qemu: Enable XSAVE related CPUID
We can support it in KVM now. The initial values are the minimal requirement of XSAVE capable processor. Signed-off-by: Sheng Yang sh...@linux.intel.com --- target-i386/cpuid.c | 32 1 files changed, 32 insertions(+), 0 deletions(-) diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c index eebf038..cbf5595 100644 --- a/target-i386/cpuid.c +++ b/target-i386/cpuid.c @@ -1067,6 +1067,38 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, *ecx = 0; *edx = 0; break; +case 0xD: +/* Processor Extended State */ +if (!(env-cpuid_ext_features CPUID_EXT_XSAVE)) { +*eax = 0; +*ebx = 0; +*ecx = 0; +*edx = 0; +break; +} +if (count == 0) { +*eax = 0x7; /* FP | SSE | YMM */ +*ebx = 0x340; /* FP + SSE + YMM size */ +*ecx = 0x340; /* FP + SSE + YMM size */ +*edx = 0; +} else if (count == 1) { +/* eax = 1, so we can continue with others */ +*eax = 1; +*ebx = 0; +*ecx = 0; +*edx = 0; +} else if (count == 2) { +*eax = 0x100; /* YMM size */ +*ebx = 0x240; /* YMM offset */ +*ecx = 0; +*edx = 0; +} else { +*eax = 0; +*ebx = 0; +*ecx = 0; +*edx = 0; +} +break; case 0x8000: *eax = env-cpuid_xlevel; *ebx = env-cpuid_vendor1; -- 1.7.0.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] qemu: kvm: Enable XSAVE live migration support
Signed-off-by: Sheng Yang sh...@linux.intel.com --- qemu-kvm-x86.c| 77 + qemu-kvm.c| 12 +++ qemu-kvm.h| 14 + target-i386/cpu.h |5 +++ target-i386/machine.c | 20 + 5 files changed, 109 insertions(+), 19 deletions(-) diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index 73b4af7..a2e2896 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -776,6 +776,7 @@ void kvm_arch_load_regs(CPUState *env, int level) { struct kvm_regs regs; struct kvm_fpu fpu; +struct kvm_xsave xsave; struct kvm_sregs sregs; struct kvm_msr_entry msrs[100]; int rc, n, i; @@ -806,16 +807,35 @@ void kvm_arch_load_regs(CPUState *env, int level) kvm_set_regs(env, regs); -memset(fpu, 0, sizeof fpu); -fpu.fsw = env-fpus ~(7 11); -fpu.fsw |= (env-fpstt 7) 11; -fpu.fcw = env-fpuc; -for (i = 0; i 8; ++i) - fpu.ftwx |= (!env-fptags[i]) i; -memcpy(fpu.fpr, env-fpregs, sizeof env-fpregs); -memcpy(fpu.xmm, env-xmm_regs, sizeof env-xmm_regs); -fpu.mxcsr = env-mxcsr; -kvm_set_fpu(env, fpu); +if (kvm_check_extension(kvm_state, KVM_CAP_XSAVE)) { +memset(xsave, 0, sizeof xsave); +xsave.i387.swd = env-fpus ~(7 11); +xsave.i387.swd |= (env-fpstt 7) 11; +xsave.i387.cwd = env-fpuc; +for (i = 0; i 8; ++i) +xsave.i387.twd |= (!env-fptags[i]) i; +memcpy(xsave.i387.st_space, env-fpregs, +sizeof env-fpregs); +memcpy(xsave.i387.xmm_space, env-xmm_regs, +sizeof env-xmm_regs); +xsave.i387.mxcsr = env-mxcsr; +xsave.xsave_hdr.xstate_bv = env-xstate_bv; +memcpy(xsave.ymmh.ymmh_space, env-ymmh_regs, +sizeof env-ymmh_regs); +xsave.xcr0 = env-xcr0; +kvm_set_xsave(env, xsave); +} else { +memset(fpu, 0, sizeof fpu); +fpu.fsw = env-fpus ~(7 11); +fpu.fsw |= (env-fpstt 7) 11; +fpu.fcw = env-fpuc; +for (i = 0; i 8; ++i) +fpu.ftwx |= (!env-fptags[i]) i; +memcpy(fpu.fpr, env-fpregs, sizeof env-fpregs); +memcpy(fpu.xmm, env-xmm_regs, sizeof env-xmm_regs); +fpu.mxcsr = env-mxcsr; +kvm_set_fpu(env, fpu); +} memset(sregs.interrupt_bitmap, 0, sizeof(sregs.interrupt_bitmap)); if (env-interrupt_injected = 0) { @@ -938,6 +958,7 @@ void kvm_arch_save_regs(CPUState *env) { struct kvm_regs regs; struct kvm_fpu fpu; +struct kvm_xsave xsave; struct kvm_sregs sregs; struct kvm_msr_entry msrs[100]; uint32_t hflags; @@ -969,15 +990,33 @@ void kvm_arch_save_regs(CPUState *env) env-eflags = regs.rflags; env-eip = regs.rip; -kvm_get_fpu(env, fpu); -env-fpstt = (fpu.fsw 11) 7; -env-fpus = fpu.fsw; -env-fpuc = fpu.fcw; -for (i = 0; i 8; ++i) - env-fptags[i] = !((fpu.ftwx i) 1); -memcpy(env-fpregs, fpu.fpr, sizeof env-fpregs); -memcpy(env-xmm_regs, fpu.xmm, sizeof env-xmm_regs); -env-mxcsr = fpu.mxcsr; +if (kvm_check_extension(kvm_state, KVM_CAP_XSAVE)) { +kvm_get_xsave(env, xsave); +env-fpstt = (xsave.i387.swd 11) 7; +env-fpus = xsave.i387.swd; +env-fpuc = xsave.i387.cwd; +for (i = 0; i 8; ++i) +env-fptags[i] = !((xsave.i387.twd i) 1); +memcpy(env-fpregs, xsave.i387.st_space, +sizeof env-fpregs); +memcpy(env-xmm_regs, xsave.i387.xmm_space, +sizeof env-xmm_regs); +env-mxcsr = xsave.i387.mxcsr; +env-xstate_bv = xsave.xsave_hdr.xstate_bv; +memcpy(env-ymmh_regs, xsave.ymmh.ymmh_space, +sizeof env-ymmh_regs); +env-xcr0 = xsave.xcr0; +} else { +kvm_get_fpu(env, fpu); +env-fpstt = (fpu.fsw 11) 7; +env-fpus = fpu.fsw; +env-fpuc = fpu.fcw; +for (i = 0; i 8; ++i) +env-fptags[i] = !((fpu.ftwx i) 1); +memcpy(env-fpregs, fpu.fpr, sizeof env-fpregs); +memcpy(env-xmm_regs, fpu.xmm, sizeof env-xmm_regs); +env-mxcsr = fpu.mxcsr; +} kvm_get_sregs(env, sregs); diff --git a/qemu-kvm.c b/qemu-kvm.c index 35a4c8a..c472b96 100644 --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -485,6 +485,18 @@ int kvm_set_mpstate(CPUState *env, struct kvm_mp_state *mp_state) } #endif +#ifdef KVM_CAP_XSAVE +int kvm_get_xsave(CPUState *env, struct kvm_xsave *xsave) +{ +return kvm_vcpu_ioctl(env, KVM_GET_XSAVE, xsave); +} + +int kvm_set_xsave(CPUState *env, struct kvm_xsave *xsave) +{ +return kvm_vcpu_ioctl(env, KVM_SET_XSAVE, xsave); +} +#endif + static int handle_mmio(CPUState *env) { unsigned long addr = env-kvm_run-mmio.phys_addr; diff --git a/qemu-kvm.h b/qemu-kvm.h index 6f6c6d8..0fb4638 100644 --- a/qemu-kvm.h +++ b/qemu-kvm.h @@ -300,6 +300,20 @@ int kvm_get_mpstate(CPUState *env, struct kvm_mp_state *mp_state);
Re: [PATCH v6] KVM: x86: Enable XSAVE/XRSTOR for guest
On 05/27/2010 12:47 PM, Sheng Yang wrote: From: Dexuan Cuidexuan@intel.com This patch enable guest to use XSAVE/XRSTOR instructions. We assume that host_xcr0 would use all possible bits that OS supported. And we loaded xcr0 in the same way we handled fpu - do it as late as we can. Looks good. What's the outlook on a unit test? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM test: Add perfmon into the guest tests
On Thu, May 27, 2010 at 10:08:49AM +0200, Jes Sorensen wrote: On 05/25/10 05:05, Chen Cao wrote: perfmon2 API provides access to the hardware performance counters of modern processors. Dependency, To compile the source code of the test, the following packages should be installed, glibc-static-2.11.1-6.x86_64 glibc-headers-2.11.1-6.x86_64 glibc-common-2.11.1-6.x86_64 glibc-devel-2.11.1-6.x86_64 glibc-2.11.1-6.x86_64 Note, 1. libpfm uses the Performance Monitor Unit (PMU) on the processors, but this unit is not provided by kvm currently, i.e. the test should fail in kvm guests. 2. According to the README file of perfmon-tests-0.3, 2.6.24 or higer Linux kernel (with perfmon v2.8 or higher) is needed to run the tests. I thought perfmon2 was deprecated in favor of perf_event.c ? The only reference left for perfmon2 in the kernel is in the ia64 tree, and KVM/ia64 seems to be pretty dead these days. Jes, Thank you for reminding. I have not noticed your mail this afternoon and resent the patches for the perfmon test, it seems that i may have made a mistake. by the way, could you tell me how to verify that perfmon2 was deprecated in favor of perf_event.c, except looking into the kernel code? Regards, Cao, Chen 2010/05/27 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] vhost: fix to check the return value of copy_to/from_user() correctly
copy_to/from_user() returns the number of bytes that could not be copied. So we need to check if it is not zero, and in that case, we should return the error number -EFAULT rather than directly return the return value from copy_to/from_user(). Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- drivers/vhost/vhost.c | 51 ++-- 1 files changed, 28 insertions(+), 23 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 5c9c657..9633a3c 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -320,10 +320,8 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m) { struct vhost_memory mem, *newmem, *oldmem; unsigned long size = offsetof(struct vhost_memory, regions); - long r; - r = copy_from_user(mem, m, size); - if (r) - return r; + if (copy_from_user(mem, m, size)) + return -EFAULT; if (mem.padding) return -EOPNOTSUPP; if (mem.nregions VHOST_MEMORY_MAX_NREGIONS) @@ -333,11 +331,10 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m) return -ENOMEM; memcpy(newmem, mem, size); - r = copy_from_user(newmem-regions, m-regions, - mem.nregions * sizeof *m-regions); - if (r) { + if (copy_from_user(newmem-regions, m-regions, + mem.nregions * sizeof *m-regions)) { kfree(newmem); - return r; + return -EFAULT; } if (!memory_access_ok(d, newmem, vhost_has_feature(d, VHOST_F_LOG_ALL))) @@ -389,9 +386,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) r = -EBUSY; break; } - r = copy_from_user(s, argp, sizeof s); - if (r 0) + if (copy_from_user(s, argp, sizeof s)) { + r = -EFAULT; break; + } if (!s.num || s.num 0x || (s.num (s.num - 1))) { r = -EINVAL; break; @@ -405,9 +403,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) r = -EBUSY; break; } - r = copy_from_user(s, argp, sizeof s); - if (r 0) + if (copy_from_user(s, argp, sizeof s)) { + r = -EFAULT; break; + } if (s.num 0x) { r = -EINVAL; break; @@ -419,12 +418,14 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) case VHOST_GET_VRING_BASE: s.index = idx; s.num = vq-last_avail_idx; - r = copy_to_user(argp, s, sizeof s); + if (copy_to_user(argp, s, sizeof s)) + r = -EFAULT; break; case VHOST_SET_VRING_ADDR: - r = copy_from_user(a, argp, sizeof a); - if (r 0) + if (copy_from_user(a, argp, sizeof a)) { + r = -EFAULT; break; + } if (a.flags ~(0x1 VHOST_VRING_F_LOG)) { r = -EOPNOTSUPP; break; @@ -477,9 +478,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) vq-used = (void __user *)(unsigned long)a.used_user_addr; break; case VHOST_SET_VRING_KICK: - r = copy_from_user(f, argp, sizeof f); - if (r 0) + if (copy_from_user(f, argp, sizeof f)) { + r = -EFAULT; break; + } eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); if (IS_ERR(eventfp)) { r = PTR_ERR(eventfp); @@ -492,9 +494,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) filep = eventfp; break; case VHOST_SET_VRING_CALL: - r = copy_from_user(f, argp, sizeof f); - if (r 0) + if (copy_from_user(f, argp, sizeof f)) { + r = -EFAULT; break; + } eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); if (IS_ERR(eventfp)) { r = PTR_ERR(eventfp); @@ -510,9 +513,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) filep = eventfp; break; case VHOST_SET_VRING_ERR: - r = copy_from_user(f, argp, sizeof f); - if (r
[PATCH 2/3] vhost-net: fix to check the return value of copy_to/from_user() correctly
copy_to/from_user() returns the number of bytes that could not be copied. So we need to check if it is not zero, and in that case, we should return the error number -EFAULT rather than directly return the return value from copy_to/from_user(). Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- drivers/vhost/net.c | 14 +++--- 1 files changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aa88911..0f41c91 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -593,17 +593,17 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl, int r; switch (ioctl) { case VHOST_NET_SET_BACKEND: - r = copy_from_user(backend, argp, sizeof backend); - if (r 0) - return r; + if (copy_from_user(backend, argp, sizeof backend)) + return -EFAULT; return vhost_net_set_backend(n, backend.index, backend.fd); case VHOST_GET_FEATURES: features = VHOST_FEATURES; - return copy_to_user(featurep, features, sizeof features); + if (copy_to_user(featurep, features, sizeof features)) + return -EFAULT; + return 0; case VHOST_SET_FEATURES: - r = copy_from_user(features, featurep, sizeof features); - if (r 0) - return r; + if (copy_from_user(features, featurep, sizeof features)) + return -EFAULT; if (features ~VHOST_FEATURES) return -EOPNOTSUPP; return vhost_net_set_features(n, features); -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: XSAVE/XRSTOR live migration support
On 05/27/2010 12:48 PM, Sheng Yang wrote: This patch enable save/restore of xsave state. Signed-off-by: Sheng Yangsh...@linux.intel.com --- arch/x86/include/asm/kvm.h | 29 arch/x86/kvm/x86.c | 79 include/linux/kvm.h|6 +++ Documentation/kvm/api.txt + +/* for KVM_CAP_XSAVE */ +struct kvm_xsave { + struct { + __u16 cwd; + __u16 swd; + __u16 twd; + __u16 fop; + __u64 rip; + __u64 rdp; + __u32 mxcsr; + __u32 mxcsr_mask; + __u32 st_space[32]; + __u32 xmm_space[64]; + __u32 padding[12]; + __u32 sw_reserved[12]; + } i387; + struct { + __u64 xstate_bv; + __u64 reserved1[2]; + __u64 reserved2[5]; + } xsave_hdr; + struct { + __u32 ymmh_space[64]; + } ymmh; + __u64 xcr0; + __u32 padding[256]; +}; Need to reserve way more space here for future xsave growth. I think at least 4K. LRB wa 32x512bit = 1K (though it probably isn't a candidate for vmx). Would be good to get an opinion from your processor architects. I don't think we need to detail the contents of the structures since they're described by the SDM; so we can have just a large array that is 1:1 with the xsave as saved by the fpu. If we do that then xcr0 needs to be in a separate structure, say kvm_xcr, with a flags field and reserved space of its own for future xcr growth. @@ -2363,6 +2366,59 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu, return 0; } +static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu, + struct kvm_xsave *guest_xsave) +{ + struct xsave_struct *xsave =vcpu-arch.guest_fpu.state-xsave; + + if (!cpu_has_xsave) + return; Hm, it would be nice to make it backward compatible and return the legacy fpu instead. I think the layouts are compatible? + + guest_xsave-i387.cwd = xsave-i387.cwd; + guest_xsave-i387.swd = xsave-i387.swd; + guest_xsave-i387.twd = xsave-i387.twd; + guest_xsave-i387.fop = xsave-i387.fop; + guest_xsave-i387.rip = xsave-i387.rip; + guest_xsave-i387.rdp = xsave-i387.rdp; + memcpy(guest_xsave-i387.st_space, xsave-i387.st_space, 128); + memcpy(guest_xsave-i387.xmm_space, xsave-i387.xmm_space, + sizeof guest_xsave-i387.xmm_space); + + guest_xsave-xsave_hdr.xstate_bv = xsave-xsave_hdr.xstate_bv; + memcpy(guest_xsave-ymmh.ymmh_space, xsave-ymmh.ymmh_space, + sizeof xsave-ymmh.ymmh_space); And we can do a big memcpy here. But we need to limit it to what the host actually allocated. + + guest_xsave-xcr0 = vcpu-arch.xcr0; +} + long kvm_arch_vcpu_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -2564,6 +2620,29 @@ long kvm_arch_vcpu_ioctl(struct file *filp, r = kvm_vcpu_ioctl_x86_set_debugregs(vcpu,dbgregs); break; } + case KVM_GET_XSAVE: { + struct kvm_xsave xsave; Too big for stack (especially if we reserve room for growth). diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 23ea022..5006761 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -524,6 +524,9 @@ struct kvm_enable_cap { #define KVM_CAP_PPC_OSI 52 #define KVM_CAP_PPC_UNSET_IRQ 53 #define KVM_CAP_ENABLE_CAP 54 +#ifdef __KVM_HAVE_XSAVE +#define KVM_CAP_XSAVE 55 +#endif Might make sense to have a separate KVM_CAP_XCR, just for consistency. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] vhost: fix the memory leak which will happen when memory_access_ok fails
We need to free newmem when vhost_set_memory() fails to complete. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- drivers/vhost/vhost.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 9633a3c..1241a22 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -337,8 +337,10 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m) return -EFAULT; } - if (!memory_access_ok(d, newmem, vhost_has_feature(d, VHOST_F_LOG_ALL))) + if (!memory_access_ok(d, newmem, vhost_has_feature(d, VHOST_F_LOG_ALL))) { + kfree(newmem); return -EFAULT; + } oldmem = d-memory; rcu_assign_pointer(d-memory, newmem); synchronize_rcu(); -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM test: Add perfmon into the guest tests
On 05/27/10 11:53, Chen Cao wrote: Jes, Thank you for reminding. I have not noticed your mail this afternoon and resent the patches for the perfmon test, it seems that i may have made a mistake. by the way, could you tell me how to verify that perfmon2 was deprecated in favor of perf_event.c, except looking into the kernel code? Hi Chen, I went and checked with people in the community, and I can confirm that perfmon2 is deprecated for everything but ia64. Since KVM/ia64 is pretty stale and I doubt it even works, I would not recommend for pushing perfmon2 tests into autotest at this point. As for how to see it, then I don't think there really is a way except for looking at the code. Cheers, Jes PS: Please don't CC closed mailing lists when posting to public lists like the KVM list. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: Fix order passed to iommu_unmap
On Wed, May 26, 2010 at 03:36:33PM -0400, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com This is obviously a left-over from the the old interface taking the size. Apparently a mostly harmless issue with the current iommu_unmap implementation. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Uh, good catch. Acked-by: Joerg Roedel joerg.roe...@amd.com --- I think this one is even closer to the right order. :) virt/kvm/iommu.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c index 5adc578..673c88a 100644 --- a/virt/kvm/iommu.c +++ b/virt/kvm/iommu.c @@ -273,7 +273,7 @@ static void kvm_iommu_put_pages(struct kvm *kvm, pfn = phys PAGE_SHIFT; /* Unmap address from IO address space */ - order = iommu_unmap(domain, gfn_to_gpa(gfn), PAGE_SIZE); + order = iommu_unmap(domain, gfn_to_gpa(gfn), 0); unmap_pages = 1ULL order; /* Unpin all pages we just unmapped to not leak any memory */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: MMU: fix relaxing permission
On 05/26/2010 05:46 AM, Xiao Guangrong wrote: There is a relaxing permission operation in set_spte(): if guest's CR0.WP is not set and R/W #PF occurs in supervisor-level, the mapping path might set to writable, then user can allow to write. @@ -1859,8 +1859,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, spte |= (u64)pfn PAGE_SHIFT; - if ((pte_access ACC_WRITE_MASK) - || (write_fault !is_write_protection(vcpu) !user_fault)) { + if (pte_access ACC_WRITE_MASK) { The host always sets cr0.wp (in shadow mode) so we can write protect page tables. So when the guest clears cr0.wp, we emulate a gpte with gpte.w=0 and gpte.u=1 in two ways: - spte.w=1, spte.u=0: this will allow the guest kernel to write but trap on guest user access - spte.w=0, spte.u=1: allows guest user access but traps on guest kernel writes If the guest attempts an access that is currently disallowed, we switch to the other spte encoding. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: SVM: fix compiling warning
On 05/26/2010 05:44 AM, Xiao Guangrong wrote: fix: arch/x86/kvm/svm.c: In function ‘is_erratum_383’: arch/x86/kvm/svm.c:1459: warning: integer constant is too large for ‘long’ type arch/x86/kvm/svm.c: In function ‘svm_handle_mce’: arch/x86/kvm/svm.c:1490: warning: unused variable ‘kvm_run’ Applied (partially, got the other part from Jan already), thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] Fix tboot enabled macro
On 05/27/2010 12:27 PM, Wang, Shane wrote: Jan Kiszka wrote: The latter. As we have no clue about the actual state (tboot is not exported on older kernels), we are forced to assume some reasonable state. Are you trying to load the latest KVM on the older kernels? He is, look at kvm-kmod: http://www.linux-kvm.org/page/Code#building_an_external_module_with_older_kernels (Jan was tricked into becoming the kvm-kmod maintainer) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6] KVM: x86: Enable XSAVE/XRSTOR for guest
On Thursday 27 May 2010 17:52:36 Avi Kivity wrote: On 05/27/2010 12:47 PM, Sheng Yang wrote: From: Dexuan Cuidexuan@intel.com This patch enable guest to use XSAVE/XRSTOR instructions. We assume that host_xcr0 would use all possible bits that OS supported. And we loaded xcr0 in the same way we handled fpu - do it as late as we can. Looks good. What's the outlook on a unit test? Working on it now... -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: XSAVE/XRSTOR live migration support
On Thursday 27 May 2010 18:02:31 Avi Kivity wrote: On 05/27/2010 12:48 PM, Sheng Yang wrote: This patch enable save/restore of xsave state. Signed-off-by: Sheng Yangsh...@linux.intel.com --- arch/x86/include/asm/kvm.h | 29 arch/x86/kvm/x86.c | 79 include/linux/kvm.h |6 +++ Documentation/kvm/api.txt + Yes... +/* for KVM_CAP_XSAVE */ +struct kvm_xsave { + struct { + __u16 cwd; + __u16 swd; + __u16 twd; + __u16 fop; + __u64 rip; + __u64 rdp; + __u32 mxcsr; + __u32 mxcsr_mask; + __u32 st_space[32]; + __u32 xmm_space[64]; + __u32 padding[12]; + __u32 sw_reserved[12]; + } i387; + struct { + __u64 xstate_bv; + __u64 reserved1[2]; + __u64 reserved2[5]; + } xsave_hdr; + struct { + __u32 ymmh_space[64]; + } ymmh; + __u64 xcr0; + __u32 padding[256]; +}; Need to reserve way more space here for future xsave growth. I think at least 4K. LRB wa 32x512bit = 1K (though it probably isn't a candidate for vmx). Would be good to get an opinion from your processor architects. Would check it. I don't think we need to detail the contents of the structures since they're described by the SDM; so we can have just a large array that is 1:1 with the xsave as saved by the fpu. Um, I've tried that, but failed mysteriously... Would check what's wrong. If we do that then xcr0 needs to be in a separate structure, say kvm_xcr, with a flags field and reserved space of its own for future xcr growth. I meant to put it into sregs, but found it's already full... How about extended sregs? @@ -2363,6 +2366,59 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu, return 0; } +static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu, + struct kvm_xsave *guest_xsave) +{ + struct xsave_struct *xsave =vcpu-arch.guest_fpu.state-xsave; + + if (!cpu_has_xsave) + return; Hm, it would be nice to make it backward compatible and return the legacy fpu instead. I think the layouts are compatible? Sound good. But seems we still need KVM_CAP_XSAVE to use this interface, and other processors would still go FPU interface. Seems didn't improve much? + + guest_xsave-i387.cwd = xsave-i387.cwd; + guest_xsave-i387.swd = xsave-i387.swd; + guest_xsave-i387.twd = xsave-i387.twd; + guest_xsave-i387.fop = xsave-i387.fop; + guest_xsave-i387.rip = xsave-i387.rip; + guest_xsave-i387.rdp = xsave-i387.rdp; + memcpy(guest_xsave-i387.st_space, xsave-i387.st_space, 128); + memcpy(guest_xsave-i387.xmm_space, xsave-i387.xmm_space, + sizeof guest_xsave-i387.xmm_space); + + guest_xsave-xsave_hdr.xstate_bv = xsave-xsave_hdr.xstate_bv; + memcpy(guest_xsave-ymmh.ymmh_space, xsave-ymmh.ymmh_space, + sizeof xsave-ymmh.ymmh_space); And we can do a big memcpy here. But we need to limit it to what the host actually allocated. Would try. + + guest_xsave-xcr0 = vcpu-arch.xcr0; +} + long kvm_arch_vcpu_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -2564,6 +2620,29 @@ long kvm_arch_vcpu_ioctl(struct file *filp, r = kvm_vcpu_ioctl_x86_set_debugregs(vcpu,dbgregs); break; } + case KVM_GET_XSAVE: { + struct kvm_xsave xsave; Too big for stack (especially if we reserve room for growth). Oops... diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 23ea022..5006761 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -524,6 +524,9 @@ struct kvm_enable_cap { #define KVM_CAP_PPC_OSI 52 #define KVM_CAP_PPC_UNSET_IRQ 53 #define KVM_CAP_ENABLE_CAP 54 +#ifdef __KVM_HAVE_XSAVE +#define KVM_CAP_XSAVE 55 +#endif Might make sense to have a separate KVM_CAP_XCR, just for consistency. Maybe EXTENDED_SREGS? But still every future field in the struct need a CAP... -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] vhost: fix to check the return value of copy_to/from_user() correctly
On Thu, May 27, 2010 at 06:58:03PM +0900, Takuya Yoshikawa wrote: copy_to/from_user() returns the number of bytes that could not be copied. So we need to check if it is not zero, and in that case, we should return the error number -EFAULT rather than directly return the return value from copy_to/from_user(). Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp Thanks, applied. --- drivers/vhost/vhost.c | 51 ++-- 1 files changed, 28 insertions(+), 23 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 5c9c657..9633a3c 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -320,10 +320,8 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m) { struct vhost_memory mem, *newmem, *oldmem; unsigned long size = offsetof(struct vhost_memory, regions); - long r; - r = copy_from_user(mem, m, size); - if (r) - return r; + if (copy_from_user(mem, m, size)) + return -EFAULT; if (mem.padding) return -EOPNOTSUPP; if (mem.nregions VHOST_MEMORY_MAX_NREGIONS) @@ -333,11 +331,10 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m) return -ENOMEM; memcpy(newmem, mem, size); - r = copy_from_user(newmem-regions, m-regions, -mem.nregions * sizeof *m-regions); - if (r) { + if (copy_from_user(newmem-regions, m-regions, +mem.nregions * sizeof *m-regions)) { kfree(newmem); - return r; + return -EFAULT; } if (!memory_access_ok(d, newmem, vhost_has_feature(d, VHOST_F_LOG_ALL))) @@ -389,9 +386,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) r = -EBUSY; break; } - r = copy_from_user(s, argp, sizeof s); - if (r 0) + if (copy_from_user(s, argp, sizeof s)) { + r = -EFAULT; break; + } if (!s.num || s.num 0x || (s.num (s.num - 1))) { r = -EINVAL; break; @@ -405,9 +403,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) r = -EBUSY; break; } - r = copy_from_user(s, argp, sizeof s); - if (r 0) + if (copy_from_user(s, argp, sizeof s)) { + r = -EFAULT; break; + } if (s.num 0x) { r = -EINVAL; break; @@ -419,12 +418,14 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) case VHOST_GET_VRING_BASE: s.index = idx; s.num = vq-last_avail_idx; - r = copy_to_user(argp, s, sizeof s); + if (copy_to_user(argp, s, sizeof s)) + r = -EFAULT; break; case VHOST_SET_VRING_ADDR: - r = copy_from_user(a, argp, sizeof a); - if (r 0) + if (copy_from_user(a, argp, sizeof a)) { + r = -EFAULT; break; + } if (a.flags ~(0x1 VHOST_VRING_F_LOG)) { r = -EOPNOTSUPP; break; @@ -477,9 +478,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) vq-used = (void __user *)(unsigned long)a.used_user_addr; break; case VHOST_SET_VRING_KICK: - r = copy_from_user(f, argp, sizeof f); - if (r 0) + if (copy_from_user(f, argp, sizeof f)) { + r = -EFAULT; break; + } eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); if (IS_ERR(eventfp)) { r = PTR_ERR(eventfp); @@ -492,9 +494,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) filep = eventfp; break; case VHOST_SET_VRING_CALL: - r = copy_from_user(f, argp, sizeof f); - if (r 0) + if (copy_from_user(f, argp, sizeof f)) { + r = -EFAULT; break; + } eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); if (IS_ERR(eventfp)) { r = PTR_ERR(eventfp); @@ -510,9 +513,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) filep = eventfp; break; case VHOST_SET_VRING_ERR: - r = copy_from_user(f, argp, sizeof
Re: [PATCH 2/3] vhost-net: fix to check the return value of copy_to/from_user() correctly
On Thu, May 27, 2010 at 07:01:58PM +0900, Takuya Yoshikawa wrote: copy_to/from_user() returns the number of bytes that could not be copied. So we need to check if it is not zero, and in that case, we should return the error number -EFAULT rather than directly return the return value from copy_to/from_user(). Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp Thanks, applied. --- drivers/vhost/net.c | 14 +++--- 1 files changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aa88911..0f41c91 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -593,17 +593,17 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl, int r; switch (ioctl) { case VHOST_NET_SET_BACKEND: - r = copy_from_user(backend, argp, sizeof backend); - if (r 0) - return r; + if (copy_from_user(backend, argp, sizeof backend)) + return -EFAULT; return vhost_net_set_backend(n, backend.index, backend.fd); case VHOST_GET_FEATURES: features = VHOST_FEATURES; - return copy_to_user(featurep, features, sizeof features); + if (copy_to_user(featurep, features, sizeof features)) + return -EFAULT; + return 0; case VHOST_SET_FEATURES: - r = copy_from_user(features, featurep, sizeof features); - if (r 0) - return r; + if (copy_from_user(features, featurep, sizeof features)) + return -EFAULT; if (features ~VHOST_FEATURES) return -EOPNOTSUPP; return vhost_net_set_features(n, features); -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] vhost: fix the memory leak which will happen when memory_access_ok fails
On Thu, May 27, 2010 at 07:03:56PM +0900, Takuya Yoshikawa wrote: We need to free newmem when vhost_set_memory() fails to complete. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- Thanks, applied. drivers/vhost/vhost.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 9633a3c..1241a22 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -337,8 +337,10 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m) return -EFAULT; } - if (!memory_access_ok(d, newmem, vhost_has_feature(d, VHOST_F_LOG_ALL))) + if (!memory_access_ok(d, newmem, vhost_has_feature(d, VHOST_F_LOG_ALL))) { + kfree(newmem); return -EFAULT; + } oldmem = d-memory; rcu_assign_pointer(d-memory, newmem); synchronize_rcu(); -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL net-next-2.6] vhost-net cleanups
David, The following tree is on top of net-next-2.6. Please merge it for 2.6.36. Thanks! The following changes since commit 7a9b149212f3716c598afe973b6261fd58453b7a: Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6 (2010-05-20 21:26:12 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next Jeff Dike (1): vhost-net: minor cleanup Michael S. Tsirkin (1): vhost: whitespace fix Tobias Klauser (1): vhost: Storage class should be before const qualifier drivers/vhost/net.c | 13 ++--- drivers/vhost/vhost.c |4 ++-- 2 files changed, 8 insertions(+), 9 deletions(-) -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: MMU: fix relaxing permission
Avi Kivity wrote: On 05/26/2010 05:46 AM, Xiao Guangrong wrote: There is a relaxing permission operation in set_spte(): if guest's CR0.WP is not set and R/W #PF occurs in supervisor-level, the mapping path might set to writable, then user can allow to write. @@ -1859,8 +1859,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, spte |= (u64)pfn PAGE_SHIFT; -if ((pte_access ACC_WRITE_MASK) -|| (write_fault !is_write_protection(vcpu) !user_fault)) { +if (pte_access ACC_WRITE_MASK) { The host always sets cr0.wp (in shadow mode) so we can write protect page tables. So when the guest clears cr0.wp, we emulate a gpte with gpte.w=0 and gpte.u=1 in two ways: - spte.w=1, spte.u=0: this will allow the guest kernel to write but trap on guest user access - spte.w=0, spte.u=1: allows guest user access but traps on guest kernel writes If the guest attempts an access that is currently disallowed, we switch to the other spte encoding. Avi, Thanks for your explanation, but i not see where to implement what you say, could you please point it out for me? :-( And, i think use 'spte.w=1, spte.u=0' to emulate 'guest cr0.wp=0 and gpte.w=0' is not a good way since it can completely stop user process access, but in this case, user process is usually read and kernel lazily to write, just like vdso, it will generate a lots of #PF -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: MMU: fix relaxing permission
On 05/27/2010 02:00 PM, Xiao Guangrong wrote: The host always sets cr0.wp (in shadow mode) so we can write protect page tables. So when the guest clears cr0.wp, we emulate a gpte with gpte.w=0 and gpte.u=1 in two ways: - spte.w=1, spte.u=0: this will allow the guest kernel to write but trap on guest user access - spte.w=0, spte.u=1: allows guest user access but traps on guest kernel writes If the guest attempts an access that is currently disallowed, we switch to the other spte encoding. Avi, Thanks for your explanation, but i not see where to implement what you say, could you please point it out for me? :-( b70ccb0b3fd removed it accidentally: - } else - /* -* Kernel mode access. Fail if it's a read-only page and -* supervisor write protection is enabled. -*/ - if (!writable_shadow) { - if (is_write_protection(vcpu)) - return 0; - *shadow_ent = ~PT_USER_MASK; - } :( And, i think use 'spte.w=1, spte.u=0' to emulate 'guest cr0.wp=0 and gpte.w=0' is not a good way since it can completely stop user process access, but in this case, user process is usually read and kernel lazily to write, just like vdso, it will generate a lots of #PF As soon as the guest kernel stops writing we switch back to gpte.w=gpte.u=1 and the guest can access it completely. For the case where both the kernel and userspace use interleaved access, you are right, but I don't see a better way, do you? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: XSAVE/XRSTOR live migration support
On 05/27/2010 01:33 PM, Sheng Yang wrote: If we do that then xcr0 needs to be in a separate structure, say kvm_xcr, with a flags field and reserved space of its own for future xcr growth. I meant to put it into sregs, but found it's already full... How about extended sregs? Isn't this what xcr means? xtended control register? +static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu, + struct kvm_xsave *guest_xsave) +{ + struct xsave_struct *xsave =vcpu-arch.guest_fpu.state-xsave; + + if (!cpu_has_xsave) + return; Hm, it would be nice to make it backward compatible and return the legacy fpu instead. I think the layouts are compatible? Sound good. But seems we still need KVM_CAP_XSAVE to use this interface, and other processors would still go FPU interface. Seems didn't improve much? I would like the new interface to be used in all cases, this way we can deprecate the old one in a few years. diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 23ea022..5006761 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -524,6 +524,9 @@ struct kvm_enable_cap { #define KVM_CAP_PPC_OSI 52 #define KVM_CAP_PPC_UNSET_IRQ 53 #define KVM_CAP_ENABLE_CAP 54 +#ifdef __KVM_HAVE_XSAVE +#define KVM_CAP_XSAVE 55 +#endif Might make sense to have a separate KVM_CAP_XCR, just for consistency. Maybe EXTENDED_SREGS? But still every future field in the struct need a CAP... Might do struct kvm_xcr { __u32 xcr; __u32 reserved; __u64 value; }; struct kvm_xcrs { __u32 nr_xcrs; __u32 flags; struct kvm_xcr xcrs[KVM_MAX_XCRS]; ... reserved; }; which would allow new xcrs to be added easily. You'll need to change kvm_set_xcr0() to kvm_set_xcr() for this to work though. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] KVM: MMU: Allow spte.w=1 for gpte.w=0 and cr0.wp=0 only in shadow mode
When tdp is enabled, the guest's cr0.wp shouldn't have any effect on spte permissions. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/mmu.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 136e160..39dd8d3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1881,7 +1881,8 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, spte |= (u64)pfn PAGE_SHIFT; if ((pte_access ACC_WRITE_MASK) - || (write_fault !is_write_protection(vcpu) !user_fault)) { + || (!tdp_enabled write_fault !is_write_protection(vcpu) +!user_fault)) { if (level PT_PAGE_TABLE_LEVEL has_wrprotected_page(vcpu-kvm, gfn, level)) { -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Fix cr0.wp=0 emulation
Xiao didn't find the nice cr0.wp=0 trick because it wasn't there, removed accidentally by b70ccb0b3fd. Restore the trick and document it. Avi Kivity (3): KVM: MMU: Allow spte.w=1 for gpte.w=0 and cr0.wp=0 only in shadow mode KVM: MMU: Remove user access when allowing kernel access to gpte.w=0 page KVM: MMU: Document cr0.wp emulation Documentation/kvm/mmu.txt | 18 ++ arch/x86/kvm/mmu.c|6 +- 2 files changed, 23 insertions(+), 1 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] KVM: MMU: Remove user access when allowing kernel access to gpte.w=0 page
If cr0.wp=0, we have to allow the guest kernel access to a page with pte.w=0. We do that by setting spte.w=1, since the host cr0.wp must remain set so the host can write protect pages. Once we allow write access, we must remove user access otherwise we mistakenly allow the user to write the page. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/mmu.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 39dd8d3..56f8c3c 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1894,6 +1894,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, spte |= PT_WRITABLE_MASK; + if (!tdp_enabled !(pte_access ACC_WRITE_MASK)) + spte = ~PT_USER_MASK; + /* * Optimization: for pte sync, if spte was writable the hash * lookup is unnecessary (and expensive). Write protection -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] KVM: MMU: Document cr0.wp emulation
Signed-off-by: Avi Kivity a...@redhat.com --- Documentation/kvm/mmu.txt | 18 ++ 1 files changed, 18 insertions(+), 0 deletions(-) diff --git a/Documentation/kvm/mmu.txt b/Documentation/kvm/mmu.txt index 2201dcb..6528ad9 100644 --- a/Documentation/kvm/mmu.txt +++ b/Documentation/kvm/mmu.txt @@ -298,6 +298,24 @@ Host translation updates: - look up affected sptes through reverse map - drop (or update) translations +Emulating cr0.wp + + +If tdp is not enabled, the host must keep cr0.wp=1 so page write protection +works for the guest kernel, not guest guest userspace. When the guest +cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0, +we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte. + +We handle this my mapping the permissions to two possible gptes, depending +on fault type: + +- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access, + disallows user access) +- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel + write access) + +(user write faults generate a #PF) + Further reading === -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: MMU: fix relaxing permission
On 05/27/2010 02:18 PM, Avi Kivity wrote: And, i think use 'spte.w=1, spte.u=0' to emulate 'guest cr0.wp=0 and gpte.w=0' is not a good way since it can completely stop user process access, but in this case, user process is usually read and kernel lazily to write, just like vdso, it will generate a lots of #PF As soon as the guest kernel stops writing we switch back to gpte.w=gpte.u=1 and the guest can access it completely. For the case where both the kernel and userspace use interleaved access, you are right, but I don't see a better way, do you? To expand, we only set spte.w=1 on write faults. So if the guest only reads the page, we'll instantiate an spte with u=1 and w=0. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL net-2.6] vhost-net: error handling fixes
David, The following tree includes fixes dealing with error handling in vhost-net. It is on top of net-2.6. Please merge it for 2.6.35. Thanks! The following changes since commit 8a74ad60a546b13bd1096b2a61a7a5c6fd9ae17c: net: fix lock_sock_bh/unlock_sock_bh (2010-05-27 00:30:53 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net Krishna Kumar (1): vhost: Fix host panic if ioctl called with wrong index Takuya Yoshikawa (3): vhost: fix to check the return value of copy_to/from_user() correctly vhost-net: fix to check the return value of copy_to/from_user() correctly vhost: fix the memory leak which will happen when memory_access_ok fails drivers/vhost/net.c | 14 ++-- drivers/vhost/vhost.c | 57 +++- 2 files changed, 39 insertions(+), 32 deletions(-) -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] device-assignment: don't truncate MSIX capabilities table size
On 05/26/2010 02:08 AM, Alex Williamson wrote: PCI_MSIX_TABSIZE is 0x07ff Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2-RFC 0/2] virtio: put last seen used index into ring itself
On 05/26/2010 10:50 PM, Michael S. Tsirkin wrote: Here's a rewrite of the original patch with a new layout. I haven't tested it yet so no idea how this performs, but I think this addresses the cache bounce issue raised by Avi. Posting for early flames/comments. Generally, the Host end of the virtio ring doesn't need to see where Guest is up to in consuming the ring. However, to completely understand what's going on from the outside, this information must be exposed. For example, host can reduce the number of interrupts by detecting that the guest is currently handling previous buffers. We add a feature bit so the guest can tell the host that it's writing out the current value there, if it wants to use that. This differs from original approach in that the used index is put after avail index (they are typically written out together). To avoid cache bounces on descriptor access, and make future extensions easier, we put the ring itself at start of page, and move the control after it. I missed the spec patch, can you repost it? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] KVM: MMU: Remove user access when allowing kernel access to gpte.w=0 page
Avi Kivity wrote: If cr0.wp=0, we have to allow the guest kernel access to a page with pte.w=0. We do that by setting spte.w=1, since the host cr0.wp must remain set so the host can write protect pages. Once we allow write access, we must remove user access otherwise we mistakenly allow the user to write the page. Yeah, it's really a nice way :-) Reviewed-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/mmu.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 39dd8d3..56f8c3c 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1894,6 +1894,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, spte |= PT_WRITABLE_MASK; + if (!tdp_enabled !(pte_access ACC_WRITE_MASK)) + spte = ~PT_USER_MASK; + /* * Optimization: for pte sync, if spte was writable the hash * lookup is unnecessary (and expensive). Write protection -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Propagate fpu_alloc errors
On 05/25/2010 05:01 PM, Jan Kiszka wrote: Memory allocation may fail. Propagate such errors. Applied, thanks. Signed-off-by: Jan Kiszkajan.kis...@siemens.com --- arch/x86/include/asm/kvm_host.h |2 +- arch/x86/kvm/svm.c |7 ++- arch/x86/kvm/vmx.c |4 +++- arch/x86/kvm/x86.c | 11 +-- An indication that fpu init should be in common code... -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm mmu: optimizations when tdp is in use
On Thu, May 27, 2010 at 04:06:34PM +0800, Gui Jianfeng wrote: In case of using tdp, checking write protected page isn't needed and quadrant also no need to be calculated. Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |7 +-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0bb9f17..ce4bbd3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -495,10 +495,13 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn) max_level = kvm_x86_ops-get_lpage_level() host_level ? kvm_x86_ops-get_lpage_level() : host_level; + if (tdp_enabled) + goto done; + This is wrong. write_count is initialized for alignment purposes, not only write protected pages. See __kvm_set_memory_region in virt/kvm/kvm_main.c. for (level = PT_DIRECTORY_LEVEL; level = max_level; ++level) if (has_wrprotected_page(vcpu-kvm, large_gfn, level)) break; - +done: return level - 1; } @@ -1346,7 +1349,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, if (role.direct) role.cr4_pae = 0; role.access = access; - if (vcpu-arch.mmu.root_level = PT32_ROOT_LEVEL) { + if (!tdp_enabled vcpu-arch.mmu.root_level = PT32_ROOT_LEVEL) { quadrant = gaddr (PAGE_SHIFT + (PT64_PT_BITS * level)); quadrant = (1 ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1; role.quadrant = quadrant; -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
On 05/27, Michael S. Tsirkin wrote: On Tue, May 18, 2010 at 05:04:51PM -0700, Sridhar Samudrala wrote: Add a new kernel API to create a singlethread workqueue and attach it's task to current task's cgroup and cpumask. Signed-off-by: Sridhar Samudrala s...@us.ibm.com Could someone familiar with workqueue code please comment on whether this patch is suitable for 2.6.35? It is needed to fix the case where vhost user might cause a kernel thread to consume more CPU than allowed by the cgroup. Should I merge it through the vhost tree? Ack for this? I don't understand the reasons for this patch, but this doesn't matter. I don't really see any need to change workqueue.c, +static struct task_struct *get_singlethread_wq_task(struct workqueue_struct *wq) +{ + return (per_cpu_ptr(wq-cpu_wq, singlethread_cpu))-thread; +} (Not sure this trivial static helper with the single caller makes sense, but see below) +/* Create a singlethread workqueue and attach it's task to the current task's + * cgroup and set it's cpumask to the current task's cpumask. + */ +struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name) +{ + struct workqueue_struct *wq; + struct task_struct *task; + cpumask_var_t mask; + + wq = create_singlethread_workqueue(name); + if (!wq) + goto out; + + if (!alloc_cpumask_var(mask, GFP_KERNEL)) + goto err; + + if (sched_getaffinity(current-pid, mask)) + goto err; + + task = get_singlethread_wq_task(wq); + if (sched_setaffinity(task-pid, mask)) + goto err; + + if (cgroup_attach_task_current_cg(task)) + goto err; +out: + return wq; +err: + destroy_workqueue(wq); + wq = NULL; + goto out; +} Instead, cgroup.c (or whoever needs this) can do struct move_struct { struct work_struct work; int ret; }; static void move_func(struct work_struct *work) { struct move_struct *move = container_of(...); if (cgroup_attach_task_current_cg(current)) ret = -EANY; } static struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name) { struct workqueue_struct *wq; struct move_struct move = { .work = __WORK_INITIALIZER(move_func); }; wq = create_singlethread_workqueue(name); if (!wq) return NULL; queue_work(move.work); flush_work(move.work); if (move.ret) { destroy_workqueue(wq); wq = NULL; } return wq; } Or. Just export wq_per_cpu() from workqueue.c (probably with a better name) and use it like the patch does. But, imho, create_singlethread_workqueue_in_current_cg() does not belong to workqueue.c. Oleg. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu: Enable XSAVE related CPUID
On 05/27/2010 12:50 PM, Sheng Yang wrote: We can support it in KVM now. The initial values are the minimal requirement of XSAVE capable processor. Signed-off-by: Sheng Yangsh...@linux.intel.com --- target-i386/cpuid.c | 32 1 files changed, 32 insertions(+), 0 deletions(-) diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c index eebf038..cbf5595 100644 --- a/target-i386/cpuid.c +++ b/target-i386/cpuid.c @@ -1067,6 +1067,38 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, *ecx = 0; *edx = 0; break; +case 0xD: +/* Processor Extended State */ +if (!(env-cpuid_ext_features CPUID_EXT_XSAVE)) { +*eax = 0; +*ebx = 0; +*ecx = 0; +*edx = 0; +break; +} +if (count == 0) { +*eax = 0x7; /* FP | SSE | YMM */ +*ebx = 0x340; /* FP + SSE + YMM size */ +*ecx = 0x340; /* FP + SSE + YMM size */ For -cpu host, we should pick these from KVM_GET_SUPPORTED_CPUID. For canned cpu types (e.g. qemu64), we need to return what we always did. We can also add a new cpu type that has them built in (there's no cpu on the market with avx, right?) +*edx = 0; +} else if (count == 1) { +/* eax = 1, so we can continue with others */ +*eax = 1; +*ebx = 0; +*ecx = 0; +*edx = 0; +} else if (count == 2) { +*eax = 0x100; /* YMM size */ +*ebx = 0x240; /* YMM offset */ +*ecx = 0; +*edx = 0; These, too. +} else { +*eax = 0; +*ebx = 0; +*ecx = 0; +*edx = 0; +} +break; case 0x8000: *eax = env-cpuid_xlevel; *ebx = env-cpuid_vendor1; -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Copy and paste feature across guest and host
On 05/27/2010 12:17 PM, Tomasz Chmielewski wrote: Just installed Fedora13 as guest on KVM. However there is no cross-platform copy and paste feature. I trust I have setup this feature on other guest sometime before. Unfortunately I can't the relevant document. Could you please shed me some light. Pointer would be appreciated. TIA Did you try; # modprobe virtio-copypaste ? Seriously, qemu does not make it easy (well, its GUI does not make most things easy) and you'll need a tool which synchronizes the clipboard between two machines (google for qemu copy paste?). There is no cutpaste at the moment. The plan is to enable it through virtio-serial and have spice vnc use it. Cannot guarantee a date but it shouldn't be too long. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/17] arch/ia64/kvm: Add missing spin_unlock
On 05/26/2010 06:57 PM, Julia Lawall wrote: From: Julia Lawallju...@diku.dk Add a spin_unlock missing on the error path. The semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) //smpl @@ expression E1; @@ * spin_lock(E1,...); +... when != E1 if (...) { ... when != E1 * return ...; } ...+ * spin_unlock(E1,...); ///smpl Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: MMU: Document large pages
Signed-off-by: Avi Kivity a...@redhat.com --- Documentation/kvm/mmu.txt | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/Documentation/kvm/mmu.txt b/Documentation/kvm/mmu.txt index 1e7ecdd..6a0cab2 100644 --- a/Documentation/kvm/mmu.txt +++ b/Documentation/kvm/mmu.txt @@ -317,6 +317,29 @@ on fault type: (user write faults generate a #PF) +Large pages +=== + +The mmu supports all combinations of large and small guest and host pages. +Supported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as +two separate 2M pages, on both guest and host, since the mmu always uses PAE +paging. + +To instantiate a large spte, four constraints must be satisfied: + +- the spte must point to a large host page +- the guest pte must be a large pte of at least equivalent size (if tdp is + enabled, there is no guest pte and this condition is satisified) +- if the spte will be writeable, the large page frame may not overlap any + write-protected pages +- the guest guest pte must be wholly contained by a single memory slot + +To check the last two conditions, the mmu maintains a -write_count set of +arrays for each memory slot and large page size. Every write protected page +causes its write_count to be incremented, thus preventing instantiation of +a large spte. The frames at the end of an unaligned memory slot have +artificically inflated -write_counts so they can never be instantiated. + Further reading === -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
On 05/27, Michael S. Tsirkin wrote: On Thu, May 27, 2010 at 02:44:48PM +0200, Oleg Nesterov wrote: On 05/27, Michael S. Tsirkin wrote: On Tue, May 18, 2010 at 05:04:51PM -0700, Sridhar Samudrala wrote: Add a new kernel API to create a singlethread workqueue and attach it's task to current task's cgroup and cpumask. Signed-off-by: Sridhar Samudrala s...@us.ibm.com Could someone familiar with workqueue code please comment on whether this patch is suitable for 2.6.35? It is needed to fix the case where vhost user might cause a kernel thread to consume more CPU than allowed by the cgroup. Should I merge it through the vhost tree? Ack for this? I don't understand the reasons for this patch, but this doesn't matter. Depending on userspace application, driver can create a lot of work for a workqueue to handle. By making the workqueue thread belong in a cgroup, we make it possible to the CPU and other resources thus consumed. OK, I see, thanks for your explanation. in case I wasn't clear... I didn't mean I dislike this idea, only the the implementation of create_singlethread_workqueue_in_current_cg(), it doesn't belong to workqueue.c imho. Oleg. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-net: fix reversed logic in mask notifiers
On 05/25/2010 05:00 PM, Michael S. Tsirkin wrote: When guest notifier is assigned, we set mask notifier, which will assign kvm irqfd. When guest notifier is unassigned, mask notifier is unset, which should unassign kvm irqfd. The way to do this is to call mask notifier telling it to mask the vector. This, unless vector is already masked which unassigns irqfd already. The logic in unassign was reversed, which left kvm irqfd assigned. This patch is qemu-kvm only as irqfd is not upstream. Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu: Enable XSAVE related CPUID
On Thursday 27 May 2010 20:56:17 Avi Kivity wrote: On 05/27/2010 12:50 PM, Sheng Yang wrote: We can support it in KVM now. The initial values are the minimal requirement of XSAVE capable processor. Signed-off-by: Sheng Yangsh...@linux.intel.com --- target-i386/cpuid.c | 32 1 files changed, 32 insertions(+), 0 deletions(-) diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c index eebf038..cbf5595 100644 --- a/target-i386/cpuid.c +++ b/target-i386/cpuid.c @@ -1067,6 +1067,38 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, *ecx = 0; *edx = 0; break; +case 0xD: +/* Processor Extended State */ +if (!(env-cpuid_ext_features CPUID_EXT_XSAVE)) { +*eax = 0; +*ebx = 0; +*ecx = 0; +*edx = 0; +break; +} +if (count == 0) { +*eax = 0x7; /* FP | SSE | YMM */ +*ebx = 0x340; /* FP + SSE + YMM size */ +*ecx = 0x340; /* FP + SSE + YMM size */ For -cpu host, we should pick these from KVM_GET_SUPPORTED_CPUID. For canned cpu types (e.g. qemu64), we need to return what we always did. Yes, I also prefer this way. didn't do this because it's somehow out of current QEmu CPUID setting mechanism. We can also add a new cpu type that has them built in (there's no cpu on the market with avx, right?) Right... Let's use -cpu host now. +*edx = 0; +} else if (count == 1) { +/* eax = 1, so we can continue with others */ +*eax = 1; +*ebx = 0; +*ecx = 0; +*edx = 0; +} else if (count == 2) { +*eax = 0x100; /* YMM size */ +*ebx = 0x240; /* YMM offset */ +*ecx = 0; +*edx = 0; These, too. Sure. -- regards Yang, Sheng +} else { +*eax = 0; +*ebx = 0; +*ecx = 0; +*edx = 0; +} +break; case 0x8000: *eax = env-cpuid_xlevel; *ebx = env-cpuid_vendor1; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2-RFC 0/2] virtio: put last seen used index into ring itself
On Thu, May 27, 2010 at 03:07:52PM +0300, Avi Kivity wrote: I missed the spec patch, can you repost it? Still work in progress, but here it is. Note I am still debating with myself whether we should split avail idx and flags into separate cache lines. diff --git a/virtio-spec.lyx b/virtio-spec.lyx index ed35893..150e5a8 100644 --- a/virtio-spec.lyx +++ b/virtio-spec.lyx @@ -1803,6 +1803,36 @@ next \emph default descriptor entry (modulo the ring size). This starts at 0, and increases. +\change_inserted 0 1274966643 + +\end_layout + +\begin_layout Standard + +\change_inserted 0 1274968378 +When PUBLISH_USED feature flag has +\emph on +not +\emph default + been negotiated, the ring follows the +\begin_inset Quotes eld +\end_inset + +flags +\begin_inset Quotes erd +\end_inset + + and the +\begin_inset Quotes eld +\end_inset + +idx +\begin_inset Quotes erd +\end_inset + + fields: +\change_unchanged + \end_layout \begin_layout Standard @@ -1845,7 +1875,134 @@ struct vring_avail { \end_layout +\begin_layout Standard + +\change_inserted 0 1274968432 +\begin_inset CommandInset label +LatexCommand label +name PUBLISH_USED-feature + +\end_inset + +When PUBLISH_USED feature flag has been negotiated, the control structure + including the +\begin_inset Quotes eld +\end_inset + +flags and the +\begin_inset Quotes eld +\end_inset + +idx +\begin_inset Quotes erd +\end_inset + + fields follows the ring. + This leaves the room for the +\begin_inset Quotes eld +\end_inset + +last_seen_used_idx +\begin_inset Quotes erd +\end_inset + + field, which indicates the most recent +\begin_inset Quotes eld +\end_inset + +idx +\begin_inset Quotes erd +\end_inset + + value observed by guest in the used ring (see +\begin_inset CommandInset ref +LatexCommand ref +reference sub:Used-Ring + +\end_inset + + below): +\end_layout + +\begin_layout Standard + +\change_inserted 0 1274967396 +\begin_inset listings +inline false +status open + +\begin_layout Plain Layout + +\change_inserted 0 1274967404 + +struct vring_avail { +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1274967405 + + u16 ring[qsz]; /* qsz is the Queue Size field read from device */ +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1274967533 + +#define VRING_AVAIL_F_NO_INTERRUPT 1 +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1274967533 + + u16 flags; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1274967533 + + u16 idx; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1274968345 + + u16 last_seen_used_idx; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1274967396 + +}; +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Standard + +\change_inserted 0 1274967715 +If the ring is large enough, the second layout maintains the control and + ring structures on separate cache lines. +\end_layout + \begin_layout Subsection + +\change_inserted 0 1274968415 +\begin_inset CommandInset label +LatexCommand label +name sub:Used-Ring + +\end_inset + + +\change_unchanged Used Ring \end_layout @@ -2391,12 +2548,20 @@ status open \begin_layout Plain Layout -while (vq-last_seen_used != vring-used.idx) { +while (vq-last_seen_used +\change_inserted 0 1274968316 +_idx +\change_unchanged + != vring-used.idx) { \end_layout \begin_layout Plain Layout -struct vring_used_elem *e = vring.used-ring[vq-last_seen_used%vsz]; +struct vring_used_elem *e = vring.used-ring[vq-last_seen_used +\change_inserted 0 1274968326 +_idx +\change_unchanged +%vsz]; \end_layout \begin_layout Plain Layout @@ -2406,7 +2571,11 @@ while (vq-last_seen_used != vring-used.idx) { \begin_layout Plain Layout -vq-last_seen_used++; +vq-last_seen_used +\change_inserted 0 1274968321 +_idx +\change_unchanged +++; \end_layout \begin_layout Plain Layout @@ -2419,6 +2588,13 @@ while (vq-last_seen_used != vring-used.idx) { \end_layout +\begin_layout Standard + +\change_inserted 0 1274968252 +If PUBLISH_USED feature is negotiated, last_seen_used value should be published + to the device in the avail ring. +\end_layout + \begin_layout Subsection Dealing With Configuration Changes \end_layout @@ -2986,6 +3162,47 @@ struct vring_avail { \begin_layout Plain Layout }; +\change_inserted 0 1274966477 + +\end_layout + +\begin_layout Plain Layout + +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1274966484 + +struct vring_avail_ctrl { +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1274966489 + +__u16 flags; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1274966494 + +__u16 idx; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1274966499 + +__u16 last_used_idx; +\end_layout + +\begin_layout Plain Layout + +\change_inserted 0 1274966474 + +}; \end_layout \begin_layout Plain Layout @@ -3349,6 +3566,28 @@ reference sub:Indirect-Descriptors \end_inset .
Re: Copy and paste feature across guest and host
Am 27.05.2010 15:19, schrieb Dor Laor: On 05/27/2010 12:17 PM, Tomasz Chmielewski wrote: Just installed Fedora13 as guest on KVM. However there is no cross-platform copy and paste feature. I trust I have setup this feature on other guest sometime before. Unfortunately I can't the relevant document. Could you please shed me some light. Pointer would be appreciated. TIA Did you try; # modprobe virtio-copypaste ? Seriously, qemu does not make it easy (well, its GUI does not make most things easy) and you'll need a tool which synchronizes the clipboard between two machines (google for qemu copy paste?). There is no cutpaste at the moment. The plan is to enable it through virtio-serial and have spice vnc use it. Cannot guarantee a date but it shouldn't be too long. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Maybe NX / FreeNX will suit your needs. Regards, Markus -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 0/2] Fix scsi-generic breakage in upstream qemu-kvm.git
On Thu, 2010-05-20 at 15:18 +0200, Kevin Wolf wrote: Am 17.05.2010 18:45, schrieb Nicholas A. Bellinger: From: Nicholas Bellinger n...@linux-iscsi.org Greetings, Attached are the updated patches following hch's comments to fix scsi-generic device breakage with find_image_format() and refresh_total_sectors(). These are being resent as the last attachments where in MBOX format from git-format-patch. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org Thanks, applied all to the block branch, even though I forgot to reply here. Kevin Hi Kevin, Thanks for accepting the series. There is one more piece of breakage that Chris Krumme found in block.c:find_image_format() in the original patch. Please apply the patch to add the missing bdrv_delete() for the SG_IO case below. Thanks for pointing this out Chris! Best, --nab [PATCH] [block]: Add missing bdrv_delete() for SG_IO BlockDriver in find_image_format() This patch adds a missing bdrv_delete() call in find_image_format() so that a SG_IO BlockDriver properly releases the temporary BlockDriverState *bs created from bdrv_file_open() Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org Reported-by: Chris Krumme chris.kru...@windriver.com --- block.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/block.c b/block.c index 7a379dc..88dbc00 100644 --- a/block.c +++ b/block.c @@ -334,8 +334,10 @@ static BlockDriver *find_image_format(const char *filename) return NULL; /* Return the raw BlockDriver * to scsi-generic devices */ -if (bs-sg) +if (bs-sg) { +bdrv_delete(bs); return bdrv_find_format(raw); +} ret = bdrv_pread(bs, 0, buf, sizeof(buf)); bdrv_delete(bs); -- 1.5.6.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add Documentation/kvm/msr.txt
On Thu, May 27, 2010 at 11:15:43AM +0300, Avi Kivity wrote: On 05/26/2010 09:04 PM, Glauber Costa wrote: This patch adds a file that documents the usage of KVM-specific MSRs. Looks good. A few comments: + +Custom MSR list + + +The current supported Custom MSR list is: + +MSR_KVM_WALL_CLOCK: 0x11 + +data: physical address of a memory area. Which must be in guest RAM (i.e., don't point it somewhere random and expect the hypervisor to allocate it for you). Must be aligned to 4 bytes (we don't enforce it though). I don't see the reason for it. If this is a requirement, our own implementation is failing to meet it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
Hello, On 05/27/2010 03:12 PM, Michael S. Tsirkin wrote: I don't understand the reasons for this patch, but this doesn't matter. Depending on userspace application, driver can create a lot of work for a workqueue to handle. By making the workqueue thread belong in a cgroup, we make it possible to the CPU and other resources thus consumed. Hmmm I don't really get it. The unit of scheduling in workqueue is a work. Unless you're gonna convert every driver to use this special kind of workqueue (and what happens when multiple tasks from different cgroups share the driver?), I can't see how this is gonna be useful. If you really wanna impose cgroup control on workqueue items, you'll have to do it per work item which might lead to the problem of priority inversion. Can you please describe what you're trying to do in more detail? Thank you. -- tejun -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
On Thu, 2010-05-27 at 14:44 +0200, Oleg Nesterov wrote: On 05/27, Michael S. Tsirkin wrote: On Tue, May 18, 2010 at 05:04:51PM -0700, Sridhar Samudrala wrote: Add a new kernel API to create a singlethread workqueue and attach it's task to current task's cgroup and cpumask. Signed-off-by: Sridhar Samudrala s...@us.ibm.com Could someone familiar with workqueue code please comment on whether this patch is suitable for 2.6.35? It is needed to fix the case where vhost user might cause a kernel thread to consume more CPU than allowed by the cgroup. Should I merge it through the vhost tree? Ack for this? I don't understand the reasons for this patch, but this doesn't matter. I don't really see any need to change workqueue.c, +static struct task_struct *get_singlethread_wq_task(struct workqueue_struct *wq) +{ + return (per_cpu_ptr(wq-cpu_wq, singlethread_cpu))-thread; +} (Not sure this trivial static helper with the single caller makes sense, but see below) +/* Create a singlethread workqueue and attach it's task to the current task's + * cgroup and set it's cpumask to the current task's cpumask. + */ +struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name) +{ + struct workqueue_struct *wq; + struct task_struct *task; + cpumask_var_t mask; + + wq = create_singlethread_workqueue(name); + if (!wq) + goto out; + + if (!alloc_cpumask_var(mask, GFP_KERNEL)) + goto err; + + if (sched_getaffinity(current-pid, mask)) + goto err; + + task = get_singlethread_wq_task(wq); + if (sched_setaffinity(task-pid, mask)) + goto err; + + if (cgroup_attach_task_current_cg(task)) + goto err; +out: + return wq; +err: + destroy_workqueue(wq); + wq = NULL; + goto out; +} Instead, cgroup.c (or whoever needs this) can do struct move_struct { struct work_struct work; int ret; }; static void move_func(struct work_struct *work) { struct move_struct *move = container_of(...); if (cgroup_attach_task_current_cg(current)) We are trying to attach the task associated with workqueue to the current task's cgroup. So what we need is cgroup_attach_task_current_cg(wq-task); However there is no interface currently that exports the task associated with a workqueue. It is hidden in cpu_workqueue_struct and is only accessible within workqueue.c. ret = -EANY; } static struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name) { struct workqueue_struct *wq; struct move_struct move = { .work = __WORK_INITIALIZER(move_func); }; wq = create_singlethread_workqueue(name); if (!wq) return NULL; queue_work(move.work); flush_work(move.work); if (move.ret) { destroy_workqueue(wq); wq = NULL; } return wq; } Or. Just export wq_per_cpu() from workqueue.c (probably with a better name) and use it like the patch does. This requires that struct cpu_workqueue_struct and struct workqueue_struct are made externally visible by moving them to kernel/workqueue.h. Instead what about adding the simple helper get_singlethread_wq_task() in workqueue.c and exporting it. I can add create_singlethread_workqueue_in_current_cg() to cgroup.c using this helper routine. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM test: Add implementation of network based unattended installation
On Wed, 2010-05-19 at 17:20 +0800, Jason Wang wrote: This patch could let the unattended installation to be done through the following method: - cdrom: the original method which does the installation from cdrom - url: installing the linux guest from http or ftp, tree url was specified through url - nfs: installing the linux guest from nfs. the server address was specified through nfs_server, and the director was specified through nfs_dir For url and nfs installation, the extra_params need to be configurated to specify the location of unattended files: - If the unattended file in the tree is used, extra_parmas= append ks=floppy and unattended_file params need to be specified in the configuration file. - If the unattended file located at remote server is used, unattended_file option must be none and extram_params= append ks=http://xxx; need to be speficied in the configuration file and don't forget the add the finish nofitication part. The --kernel and --initrd were used directly for the network installation instead of the tftp/bootp param because the user mode network is too slow to do this. Hi Jason, I had reviewed your patchset, implemented the suggestions I pointed out, and I have one more thing we need to work out before we commit this - We absolutely need to provide examples on the config file, decently commented. So, could you please provide examples: * HTTP install using remote kickstart * NFS install using local kickstart And re-send the patchset? I will send the patchset with my modifications directly to you and will wait on your follow up. Thanks! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
On Thu, May 27, 2010 at 06:15:54PM +0200, Tejun Heo wrote: Hello, On 05/27/2010 03:12 PM, Michael S. Tsirkin wrote: I don't understand the reasons for this patch, but this doesn't matter. Depending on userspace application, driver can create a lot of work for a workqueue to handle. By making the workqueue thread belong in a cgroup, we make it possible to the CPU and other resources thus consumed. Hmmm I don't really get it. The unit of scheduling in workqueue is a work. Yes. However, we use cgroups to limit when the workqueue itself is scheduled. This affects all of work done on this workqueue, so it's a bit of a blunt intrument. Thus we are not trying to apply this to all drivers, we intend to start with vhost-net. Unless you're gonna convert every driver to use this special kind of workqueue (and what happens when multiple tasks from different cgroups share the driver?), We'll then create a workqueue per task. Each workqueue will have the right cgroup. But we are not trying to selve the problem for every driver. I can't see how this is gonna be useful. If you really wanna impose cgroup control on workqueue items, you'll have to do it per work item which might lead to the problem of priority inversion. Exactly. cgroup is per-workqueue not per work item. If driver wants to let administrators control priority for different kinds of items separately, driver will have to submit them to separate workqueues. Can you please describe what you're trying to do in more detail? Thank you. vhost-net driver is under control from userspace, it queues potentially a lot of work into the workqueue, which might load the system beyond the cgroup limits. And staying within cgroups limits is important for virtualization where vhost is used. -- tejun -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
On Thu, May 27, 2010 at 09:24:18AM -0700, Sridhar Samudrala wrote: On Thu, 2010-05-27 at 14:44 +0200, Oleg Nesterov wrote: On 05/27, Michael S. Tsirkin wrote: On Tue, May 18, 2010 at 05:04:51PM -0700, Sridhar Samudrala wrote: Add a new kernel API to create a singlethread workqueue and attach it's task to current task's cgroup and cpumask. Signed-off-by: Sridhar Samudrala s...@us.ibm.com Could someone familiar with workqueue code please comment on whether this patch is suitable for 2.6.35? It is needed to fix the case where vhost user might cause a kernel thread to consume more CPU than allowed by the cgroup. Should I merge it through the vhost tree? Ack for this? I don't understand the reasons for this patch, but this doesn't matter. I don't really see any need to change workqueue.c, +static struct task_struct *get_singlethread_wq_task(struct workqueue_struct *wq) +{ + return (per_cpu_ptr(wq-cpu_wq, singlethread_cpu))-thread; +} (Not sure this trivial static helper with the single caller makes sense, but see below) +/* Create a singlethread workqueue and attach it's task to the current task's + * cgroup and set it's cpumask to the current task's cpumask. + */ +struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name) +{ + struct workqueue_struct *wq; + struct task_struct *task; + cpumask_var_t mask; + + wq = create_singlethread_workqueue(name); + if (!wq) + goto out; + + if (!alloc_cpumask_var(mask, GFP_KERNEL)) + goto err; + + if (sched_getaffinity(current-pid, mask)) + goto err; + + task = get_singlethread_wq_task(wq); + if (sched_setaffinity(task-pid, mask)) + goto err; + + if (cgroup_attach_task_current_cg(task)) + goto err; +out: + return wq; +err: + destroy_workqueue(wq); + wq = NULL; + goto out; +} Instead, cgroup.c (or whoever needs this) can do struct move_struct { struct work_struct work; int ret; }; static void move_func(struct work_struct *work) { struct move_struct *move = container_of(...); if (cgroup_attach_task_current_cg(current)) We are trying to attach the task associated with workqueue to the current task's cgroup. So what we need is cgroup_attach_task_current_cg(wq-task); However there is no interface currently that exports the task associated with a workqueue. It is hidden in cpu_workqueue_struct and is only accessible within workqueue.c. ret = -EANY; } static struct workqueue_struct *create_singlethread_workqueue_in_current_cg(char *name) { struct workqueue_struct *wq; struct move_struct move = { .work = __WORK_INITIALIZER(move_func); }; wq = create_singlethread_workqueue(name); if (!wq) return NULL; queue_work(move.work); flush_work(move.work); if (move.ret) { destroy_workqueue(wq); wq = NULL; } return wq; } Or. Just export wq_per_cpu() from workqueue.c (probably with a better name) and use it like the patch does. This requires that struct cpu_workqueue_struct and struct workqueue_struct are made externally visible by moving them to kernel/workqueue.h. Instead what about adding the simple helper get_singlethread_wq_task() in workqueue.c and exporting it. I can add create_singlethread_workqueue_in_current_cg() to cgroup.c using this helper routine. Or to our driver, if that's more palatable. Thanks Sridhar -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
Hello, On 05/27/2010 06:39 PM, Michael S. Tsirkin wrote: Unless you're gonna convert every driver to use this special kind of workqueue (and what happens when multiple tasks from different cgroups share the driver?), We'll then create a workqueue per task. Each workqueue will have the right cgroup. But we are not trying to selve the problem for every driver. Ah... I see. You're gonna use multiple workqueues. Once concern that I have is that this is abuse of workqueue interface to certain level and depends on the implementation detail of workqueue rather than its intended usage model. stop_machine() was a similar case and in the end it was better served by a different mechanism built on kthread directly (cpu_stop). Wouldn't it be cleaner to use kthread directly for your case too? You're basically trying to use workqueue as a frontend to kthread, so... Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
What I am actually worried about is Tejun's rework, I am not sure cmwq has the this thread services that wq property... On 05/27, Sridhar Samudrala wrote: On Thu, 2010-05-27 at 14:44 +0200, Oleg Nesterov wrote: Instead, cgroup.c (or whoever needs this) can do struct move_struct { struct work_struct work; int ret; }; static void move_func(struct work_struct *work) { struct move_struct *move = container_of(...); if (cgroup_attach_task_current_cg(current)) We are trying to attach the task associated with workqueue to the current task's cgroup. So what we need is cgroup_attach_task_current_cg(wq-task); I do not see cgroup_attach_task_current_cg() in Linus's tree and thus I do not now what exactly it does, and of course the code above is only template. But I think this is easy. Just add struct cgroup *cgrp into move_struct and then move_func() can do cgroup_attach_task(move-cgrp, current) ? Or. Just export wq_per_cpu() from workqueue.c (probably with a better name) and use it like the patch does. This requires that struct cpu_workqueue_struct and struct workqueue_struct are made externally visible by moving them to kernel/workqueue.h. no, no, Instead what about adding the simple helper get_singlethread_wq_task() in workqueue.c and exporting it. Indeed, this is what I meant. But. I disagree with get_singlethread_wq_task(). If we add this helper, it should work with the multi-threaded wq's too, and needs the int cpu parameter, ignored when is_wq_single_threaded(). So. Either rename wq_per_cpu() and export it (once again, I do not mean we should move the body to workqueue.h!), or create the new helper which just calls wq_per_cpu(). I can add create_singlethread_workqueue_in_current_cg() to cgroup.c using this helper routine. Imho, this is better. But please note that it is possible to do without any changes in workqueue.[ch] afaics, see above. Oleg. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
On Thu, May 27, 2010 at 06:56:20PM +0200, Tejun Heo wrote: Hello, On 05/27/2010 06:39 PM, Michael S. Tsirkin wrote: Unless you're gonna convert every driver to use this special kind of workqueue (and what happens when multiple tasks from different cgroups share the driver?), We'll then create a workqueue per task. Each workqueue will have the right cgroup. But we are not trying to selve the problem for every driver. Ah... I see. You're gonna use multiple workqueues. Once concern that I have is that this is abuse of workqueue interface to certain level and depends on the implementation detail of workqueue rather than its intended usage model. Well, this is why I proposed adding a new API for creating workqueue within workqueue.c, rather than exposing the task and attaching it to cgroups in our driver: so that workqueue maintainers can fix the implementation if it ever changes. And after all, it's an internal API, we can always change it later if we need. stop_machine() was a similar case and in the end it was better served by a different mechanism built on kthread directly (cpu_stop). Wouldn't it be cleaner to use kthread directly for your case too? You're basically trying to use workqueue as a frontend to kthread, so... Thanks. Well, yes but we are using APIs like flush_work etc. These are very handy. It seems much easier than rolling our own queue on top of kthread. Makes sense? -- tejun -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] Add Documentation/kvm/msr.txt
This patch adds a file that documents the usage of KVM-specific MSRs. [ v2: added comments from Randy ] [ v2: added comments from Avi ] Signed-off-by: Glauber Costa glom...@redhat.com --- Documentation/kvm/msr.txt | 152 + 1 files changed, 152 insertions(+), 0 deletions(-) create mode 100644 Documentation/kvm/msr.txt diff --git a/Documentation/kvm/msr.txt b/Documentation/kvm/msr.txt new file mode 100644 index 000..8cdcb6d --- /dev/null +++ b/Documentation/kvm/msr.txt @@ -0,0 +1,152 @@ +KVM-specific MSRs. +Glauber Costa glom...@redhat.com, Red Hat Inc, 2010 += + +KVM makes use of some custom MSRs to service some requests. +At present, this facility is only used by kvmclock. + +Custom MSRs have a range reserved for them, that goes from +0x4b564d00 to 0x4b564dff. There are MSRs outside this area, +but they are deprecated and their use is discouraged. + +Custom MSR list + + +The current supported Custom MSR list is: + +MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00 + + data: physical address of a memory area which must be in guest RAM. + This memory is expected to hold a copy of the following structure: + + struct pvclock_wall_clock { + u32 version; + u32 sec; + u32 nsec; + } __attribute__((__packed__)); + + whose data will be filled in by the hypervisor. The hypervisor is only + guaranteed to update this data at the moment of MSR write. + Users that want to reliably query this information more than once have + to write more than once to this MSR. Fields have the following meanings: + + version: guest has to check version before and after grabbing + time information and check that they are both equal and even. + An odd version indicates an in-progress update. + + sec: number of seconds for wallclock. + + nsec: number of nanoseconds for wallclock. + + Note that although MSRs are per-CPU entities, the effect of this + particular MSR is global. + + Availability of this MSR must be checked via bit 3 in 0x401 cpuid + leaf prior to usage. + +MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01 + + data: 4-byte aligned physical address of a memory area which must be in + guest RAM, plus an enable bit in bit 0. This memory is expected to hold + a copy of the following structure: + + struct pvclock_vcpu_time_info { + u32 version; + u32 pad0; + u64 tsc_timestamp; + u64 system_time; + u32 tsc_to_system_mul; + s8tsc_shift; + u8flags; + u8pad[2]; + } __attribute__((__packed__)); /* 32 bytes */ + + whose data will be filled in by the hypervisor periodically. Only one + write, or registration, is needed for each VCPU. The interval between + updates of this structure is arbitrary and implementation-dependent. + The hypervisor may update this structure at any time it sees fit until + anything with bit0 == 0 is written to it. + + Fields have the following meanings: + + version: guest has to check version before and after grabbing + time information and check that they are both equal and even. + An odd version indicates an in-progress update. + + tsc_timestamp: the tsc value at the current VCPU at the time + of the update of this structure. Guests can subtract this value + from current tsc to derive a notion of elapsed time since the + structure update. + + system_time: a host notion of monotonic time, including sleep + time at the time this structure was last updated. Unit is + nanoseconds. + + tsc_to_system_mul: a function of the tsc frequency. One has + to multiply any tsc-related quantity by this value to get + a value in nanoseconds, besides dividing by 2^tsc_shift + + tsc_shift: cycle to nanosecond divider, as a power of two, to + allow for shift rights. One has to shift right any tsc-related + quantity by this value to get a value in nanoseconds, besides + multiplying by tsc_to_system_mul. + + With this information, guests can derive per-CPU time by + doing: + + time = (current_tsc - tsc_timestamp) + time = (time * tsc_to_system_mul) tsc_shift + time = time + system_time + + flags: bits in this field indicate extended capabilities + coordinated between the guest and the hypervisor. Availability + of specific flags has to be checked in 0x4001 cpuid leaf. +
Re: Clock jumps
On Thu, May 27, 2010 at 11:32 AM, Bernhard Schmidt be...@birkenwald.de wrote: Alexander Graf ag...@suse.de wrote: Do you have ntpd running inside the guest? I have a bug report lying around about 2.6.33 with kvm-clock jumping in time when ntpd is used: https://bugzilla.novell.com/show_bug.cgi?id=582260 I want to chime in here, I have a very similar problem, but not with ntpd in the guest. The host was a HP ProLiant DL320 G5p with a Dualcore Xeon3075. System was a Debian Lenny with a custom 2.6.33 host kernel and a custom qemu-kvm 0.11.0 .deb ported from Ubuntu. The host is synced with ntpd. The guests are various Debian Lenny/Squeeze VMs, with a custom kernel (2.6.33 at the moment) with kvm-clock. Exclusively amd64 guest kernels, but one system has i386 userland. With this setup once in a while (maybe every other week) one VM would have a sudden clock jump, 6-12 hours into the future. No kernel messages or other log entries than applications complaining about the clock jump after the fact. Other VMs were unaffected. Yesterday I did an upgrade to Debian Squeeze. This involved a new qemu-kvm (0.12.4), but not a new host kernel. I also upgraded the guest kernels from 2.6.33 to 2.6.33.4. First of all, after the reboot the host clock was totally unreliable. I had a constant skew of up to five seconds per minute in the host clock, which of course affected the VMs as well. This problem went away when I changed from tsc into hpet on the host. The host does CPU frequency scaling which is, as far as I know, known to affect TSC stability. I think I remember messages about tsc being unstable in earlier boots, maybe the detection did just not work this time. I'd be very interested in hearing more about the host side issue. So this happened with the same kernel that you were using before, with no trouble? Could you also send dmesg output from this boot? And if you can find any older dmesg logs to compare with, send those too? thanks -john -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/1] ceph/rbd block driver for qemu-kvm (v2)
Hi, Based on the review notes Blue Swirl sent us after my last mail, Yehuda cleaned up the header files. The patch is much smaller now and I hope that you accept it for inclusion. To build it, you will need the testing (or unstable) git head of ceph now. The required header files will be part of the next release of ceph (0.20.3). In case you didn't read my last posting, here is the short description again: This patch is a block driver for the distributed file system Ceph (http://ceph.newdream.net/). Ceph was included in the Linux v2.6.34 kernel. However, this driver uses librados (which is part of the Ceph server) for direct access to the Ceph object store and is running entirely in userspace. Therefore it is called rbd - rados block device. The basic idea is to stripe a VM block device over (by default) 4MB objects stored in the Ceph distributed object store. This is very similar to what the sheepdog project is doing, but uses the ceph server as a storage backend. If you don't plan on using the entire ceph filesystem you may leave out the metadata service of ceph. Yehuda Sadeh helped me with the implementation and put some additional usage information on the Ceph-Wiki (http://ceph.newdream.net/wiki/Kvm-rbd). He has also written a Linux kernel driver to make an rbd image accessible as a block device. Regards, Christian --- Makefile |3 + Makefile.objs |1 + block/rbd.c | 584 + block/rbd_types.h | 52 + configure | 27 +++ 5 files changed, 667 insertions(+), 0 deletions(-) create mode 100644 block/rbd.c create mode 100644 block/rbd_types.h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add Documentation/kvm/msr.txt
On 05/27/2010 06:02 AM, Glauber Costa wrote: On Thu, May 27, 2010 at 11:15:43AM +0300, Avi Kivity wrote: On 05/26/2010 09:04 PM, Glauber Costa wrote: This patch adds a file that documents the usage of KVM-specific MSRs. Looks good. A few comments: + +Custom MSR list + + +The current supported Custom MSR list is: + +MSR_KVM_WALL_CLOCK: 0x11 + + data: physical address of a memory area. Which must be in guest RAM (i.e., don't point it somewhere random and expect the hypervisor to allocate it for you). Must be aligned to 4 bytes (we don't enforce it though). I don't see the reason for it. If this is a requirement, our own implementation is failing to meet it. It's so the atomic write actually is atomic. Stating a 4 -byte alignment requirement prevents the wall clock from crossing a page boundary. Zach -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add Documentation/kvm/msr.txt
On Thu, May 27, 2010 at 10:13:12AM -1000, Zachary Amsden wrote: On 05/27/2010 06:02 AM, Glauber Costa wrote: On Thu, May 27, 2010 at 11:15:43AM +0300, Avi Kivity wrote: On 05/26/2010 09:04 PM, Glauber Costa wrote: This patch adds a file that documents the usage of KVM-specific MSRs. Looks good. A few comments: + +Custom MSR list + + +The current supported Custom MSR list is: + +MSR_KVM_WALL_CLOCK: 0x11 + + data: physical address of a memory area. Which must be in guest RAM (i.e., don't point it somewhere random and expect the hypervisor to allocate it for you). Must be aligned to 4 bytes (we don't enforce it though). I don't see the reason for it. If this is a requirement, our own implementation is failing to meet it. It's so the atomic write actually is atomic. Which atomic write? This is the wallclock, we do no atomic writes for querying it. Not to confuse with system time (the other msr). Stating a 4 -byte alignment requirement prevents the wall clock from crossing a page boundary. Yes, but why require it? reading the wallclock is not a hot path for anybody, is usually done just once, and crossing a page boundary here does not pose any correctness issue. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add Documentation/kvm/msr.txt
On 05/27/2010 10:36 AM, Glauber Costa wrote: On Thu, May 27, 2010 at 10:13:12AM -1000, Zachary Amsden wrote: On 05/27/2010 06:02 AM, Glauber Costa wrote: On Thu, May 27, 2010 at 11:15:43AM +0300, Avi Kivity wrote: On 05/26/2010 09:04 PM, Glauber Costa wrote: This patch adds a file that documents the usage of KVM-specific MSRs. Looks good. A few comments: + +Custom MSR list + + +The current supported Custom MSR list is: + +MSR_KVM_WALL_CLOCK: 0x11 + + data: physical address of a memory area. Which must be in guest RAM (i.e., don't point it somewhere random and expect the hypervisor to allocate it for you). Must be aligned to 4 bytes (we don't enforce it though). I don't see the reason for it. If this is a requirement, our own implementation is failing to meet it. It's so the atomic write actually is atomic. Which atomic write? This is the wallclock, we do no atomic writes for querying it. Not to confuse with system time (the other msr). Stating a 4 -byte alignment requirement prevents the wall clock from crossing a page boundary. Yes, but why require it? reading the wallclock is not a hot path for anybody, is usually done just once, and crossing a page boundary here does not pose any correctness issue. Little-endian non-atomic page crossing writes will write the small part of the wallclock first, so another CPU may observe the following wallclock sequence: 0x01ff .. 0x0100 .. 0x0200 Big-endian writes also have similar failure: 0x01ff .. 0x02ff .. 0x0200 This won't happen if there is a single instruction write of the wall clock word. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
Hello, Michael. On 05/27/2010 07:32 PM, Michael S. Tsirkin wrote: Well, this is why I proposed adding a new API for creating workqueue within workqueue.c, rather than exposing the task and attaching it to cgroups in our driver: so that workqueue maintainers can fix the implementation if it ever changes. And after all, it's an internal API, we can always change it later if we need. ... Well, yes but we are using APIs like flush_work etc. These are very handy. It seems much easier than rolling our own queue on top of kthread. The thing is that this kind of one-off usage becomes problemetic when you're trying to change the implementation detail. All current workqueue users don't care which thread they run on and they shouldn't as each work owns the context only for the duration the work is executing. If this sort of fundamental guidelines are followed, the implementation can be improved in pretty much transparent way but when you start depending on specific implementation details, things become messy pretty quickly. If this type of usage were more common, adding proper way to account work usage according to cgroups would make sense but that's not the case here and I removed the only exception case recently while trying to implement cmwq and if this is added. So, this would be the only one which makes such extra assumptions in the whole kernel. One way or the other, workqueue needs to be improved and I don't really think adding the single exception at this point is a good idea. The thing I realized after stop_machine conversion was that there was no reason to use workqueue there at all. There already are more than enough not-too-difficult synchronization constructs and if you're using a thread for dedicated purposes, code complexity isn't that different either way. Plus, it would also be clearer that dedicated threads are required there for what reason. So, I strongly suggest using a kthread. If there are issues which are noticeably difficult to solve with kthread, we can definitely talk about that and think about things again. Thank you. -- tejun -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Clock jumps
On 05/27/2010 08:32 AM, Bernhard Schmidt wrote: Alexander Grafag...@suse.de wrote: Hi, Do you have ntpd running inside the guest? I have a bug report lying around about 2.6.33 with kvm-clock jumping in time when ntpd is used: https://bugzilla.novell.com/show_bug.cgi?id=582260 I want to chime in here, I have a very similar problem, but not with ntpd in the guest. The host was a HP ProLiant DL320 G5p with a Dualcore Xeon3075. System was a Debian Lenny with a custom 2.6.33 host kernel and a custom qemu-kvm 0.11.0 .deb ported from Ubuntu. The host is synced with ntpd. The guests are various Debian Lenny/Squeeze VMs, with a custom kernel (2.6.33 at the moment) with kvm-clock. Exclusively amd64 guest kernels, but one system has i386 userland. With this setup once in a while (maybe every other week) one VM would have a sudden clock jump, 6-12 hours into the future. No kernel messages or other log entries than applications complaining about the clock jump after the fact. Other VMs were unaffected. Yesterday I did an upgrade to Debian Squeeze. This involved a new qemu-kvm (0.12.4), but not a new host kernel. I also upgraded the guest kernels from 2.6.33 to 2.6.33.4. First of all, after the reboot the host clock was totally unreliable. I had a constant skew of up to five seconds per minute in the host clock, which of course affected the VMs as well. This problem went away when I changed from tsc into hpet on the host. The host does CPU frequency scaling which is, as far as I know, known to affect TSC stability. I think I remember messages about tsc being unstable in earlier boots, maybe the detection did just not work this time. Worse, the clock jump issues in the guest appeared much more often than before. The higher loaded VMs did not survive ten minutes without jumping several hours ahead. Situation has stabilized after setting clocksource hpet in the guest immediately after boot. So it seems kvm-clock has some issues here. I've seen a preliminary patch floating around on the ML by Zachary Amsden, but I haven't tried it yet. It talks of backward warps, but so far I've only seen forward warps (the VM time suddenly jumps into the future), so it might be unrelated. I have an AMD Turion TL-52 machine with unreliable TSC. It varies with CPU frequency, which is okay, we can compensate for that, but worse, it has broken clocking when in C1E idle. Apparently, it divides down the clock input to an idle core, so it only runs at 1/16 or whatever of the rate, and adds a multiplier to the TSC increment, so it scales by 16 instead of by 1 (whatever the actual numbers are I don't know, but this illustrates the point). When it wakes up to service a cache probe from another core, it now runs with full clock rate ... and still uses the multiplier for the TSC increment. The effect is that idle CPUs have TSC which may increase faster than running CPUs. Given time, this delta can add to a very large number (in theory, it's a random walk, but it can go very far off). If a VM is running on this CPU and happens to match the idle pattern without switching CPUs, time can effectively run accelerated on that VM, and very rapidly things are going to get confused. Newer kernels should detect the host clock being unreliable quite quickly; my F13 machine detects it right away at boot. I have server side fixes for this kvm-clock which seem to give me a stable clock on this machine, but for true SMP stability, you will need Glauber's guest side changes to kvmclock as well. It is impossible to guarantee strictly monotonic clocksource across multiple CPUs when frequency is dynamically changing (and also because of the C1E idle problems). There is one remaining problem to fix, the reset of TSC on reboot in SMP will destabilize the TSCs again, but now I've actually got VMs running again (different bug), that shouldn't be long. Zach -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Clock jumps
On 27.05.2010 21:08, john stultz wrote: Hi John, I'd be very interested in hearing more about the host side issue. So this happened with the same kernel that you were using before, with no trouble? Correct. Could you also send dmesg output from this boot? And if you can find any older dmesg logs to compare with, send those too? See http://users.birkenwald.de/~berni/temp/dmesg-lenny and http://users.birkenwald.de/~berni/temp/dmesg-squeeze . Although running on the same kernel binary the initrd changed greatly when upgrading, so ordering/timing between those two is off. Note that the dmesg output is captured right after boot. I think I remember seeing a TSC unstable message pretty soon after boot, but I might be mixing it up with my other AMD-based KVM server. I don't hold normal (non-boot) logs that long, so I can't tell for sure. If you need any more info feel free to contact me. Bernhard -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Clock jumps
On 27.05.2010 23:53, Zachary Amsden wrote: Hello Zachary, I have server side fixes for this kvm-clock which seem to give me a stable clock on this machine, but for true SMP stability, you will need Glauber's guest side changes to kvmclock as well. It is impossible to guarantee strictly monotonic clocksource across multiple CPUs when frequency is dynamically changing (and also because of the C1E idle problems). Is all this relevant only when the host is on TSC? Because I have seen these jumps when the host was on HPET and the guests were using kvm-clock. Anyway, can you send me both patches? I'd like to try it, but I have completely lost track where the up-to-date patches are. Bernhard -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Clock jumps
On 05/27/2010 12:12 PM, Bernhard Schmidt wrote: On 27.05.2010 23:53, Zachary Amsden wrote: Hello Zachary, I have server side fixes for this kvm-clock which seem to give me a stable clock on this machine, but for true SMP stability, you will need Glauber's guest side changes to kvmclock as well. It is impossible to guarantee strictly monotonic clocksource across multiple CPUs when frequency is dynamically changing (and also because of the C1E idle problems). Is all this relevant only when the host is on TSC? Because I have seen these jumps when the host was on HPET and the guests were using kvm-clock. Anyway, can you send me both patches? I'd like to try it, but I have completely lost track where the up-to-date patches are. There's more than two, there's quite a bit, I'll send it to the list soon. Zach -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Clock jumps
On 05/27/2010 12:12 PM, Bernhard Schmidt wrote: On 27.05.2010 23:53, Zachary Amsden wrote: Hello Zachary, I have server side fixes for this kvm-clock which seem to give me a stable clock on this machine, but for true SMP stability, you will need Glauber's guest side changes to kvmclock as well. It is impossible to guarantee strictly monotonic clocksource across multiple CPUs when frequency is dynamically changing (and also because of the C1E idle problems). Is all this relevant only when the host is on TSC? Because I have seen these jumps when the host was on HPET and the guests were using kvm-clock. It doesn't matter what the host uses (although the host on TSC with unstable TSC can make things worse), tsc and kvmclock sources in the guest will be unstable regardless. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL net-2.6] vhost-net: error handling fixes
From: Michael S. Tsirkin m...@redhat.com Date: Thu, 27 May 2010 14:57:14 +0300 David, The following tree includes fixes dealing with error handling in vhost-net. It is on top of net-2.6. Please merge it for 2.6.35. Thanks! The following changes since commit 8a74ad60a546b13bd1096b2a61a7a5c6fd9ae17c: net: fix lock_sock_bh/unlock_sock_bh (2010-05-27 00:30:53 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net Pulled, thanks Michael. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Clock jumps
On Thu, 2010-05-27 at 23:48 +0200, Bernhard Schmidt wrote: On 27.05.2010 21:08, john stultz wrote: I'd be very interested in hearing more about the host side issue. So this happened with the same kernel that you were using before, with no trouble? Correct. Could you also send dmesg output from this boot? And if you can find any older dmesg logs to compare with, send those too? See http://users.birkenwald.de/~berni/temp/dmesg-lenny and http://users.birkenwald.de/~berni/temp/dmesg-squeeze . Although running on the same kernel binary the initrd changed greatly when upgrading, so ordering/timing between those two is off. Note that the dmesg output is captured right after boot. I think I remember seeing a TSC unstable message pretty soon after boot, but I might be mixing it up with my other AMD-based KVM server. I don't hold normal (non-boot) logs that long, so I can't tell for sure. Looking at the diff: --- dmesg-lenny 2010-05-27 16:45:33.0 -0700 +++ dmesg-squeeze 2010-05-27 16:46:14.0 -0700 @@ -132,8 +132,8 @@ console [ttyS1] enabled hpet clockevent registered Fast TSC calibration using PIT -Detected 2660.398 MHz processor. -Calibrating delay loop (skipped), value calculated using timer frequency.. 5320.79 BogoMIPS (lpj=10641592) +Detected 2613.324 MHz processor. +Calibrating delay loop (skipped), value calculated using timer frequency.. 5226.64 BogoMIPS (lpj=10453296) Security Framework initialized Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes) Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes) @@ -160,7 +160,7 @@ CPU0: Intel(R) Xeon(R) CPU3075 @ 2.66GHz stepping 0b Booting Node 0, Processors #1 Brought up 2 CPUs -Total of 2 processors activated (10640.79 BogoMIPS). +Total of 2 processors activated (10546.63 BogoMIPS). NET: Registered protocol family 16 ACPI: bus type pci registered PCI: MMCONFIG for domain [bus 00-ff] at [mem 0xe000-0xefff] (base 0xe000) So you can see in the above the during the second boot the TSC calibration was badly mis-calculated. This was the cause of the skew. Not sure how that might be linked to the distro upgrade. It could have been something like SMI damage during the calibration time, but I thought the calibration loop watched for that. Bernhard: I expect with all those vms, this machine isn't rebooted frequently. So could you look at the logs to see how much the Detected .yyy MHz processor. line varies by across a few other boots (if they still exist?). Ingo/Thomas: Any thoughts, should we be considering dropping the quick_pit_calibrate() code and always do the slower route? thanks -john -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Add Documentation/kvm/msr.txt
On Thu, May 27, 2010 at 11:02:35AM -1000, Zachary Amsden wrote: On 05/27/2010 10:36 AM, Glauber Costa wrote: On Thu, May 27, 2010 at 10:13:12AM -1000, Zachary Amsden wrote: On 05/27/2010 06:02 AM, Glauber Costa wrote: On Thu, May 27, 2010 at 11:15:43AM +0300, Avi Kivity wrote: On 05/26/2010 09:04 PM, Glauber Costa wrote: This patch adds a file that documents the usage of KVM-specific MSRs. Looks good. A few comments: + +Custom MSR list + + +The current supported Custom MSR list is: + +MSR_KVM_WALL_CLOCK: 0x11 + +data: physical address of a memory area. Which must be in guest RAM (i.e., don't point it somewhere random and expect the hypervisor to allocate it for you). Must be aligned to 4 bytes (we don't enforce it though). I don't see the reason for it. If this is a requirement, our own implementation is failing to meet it. It's so the atomic write actually is atomic. Which atomic write? This is the wallclock, we do no atomic writes for querying it. Not to confuse with system time (the other msr). Stating a 4 -byte alignment requirement prevents the wall clock from crossing a page boundary. Yes, but why require it? reading the wallclock is not a hot path for anybody, is usually done just once, and crossing a page boundary here does not pose any correctness issue. Little-endian non-atomic page crossing writes will write the small part of the wallclock first, so another CPU may observe the following wallclock sequence: 0x01ff .. 0x0100 .. 0x0200 Big-endian writes also have similar failure: 0x01ff .. 0x02ff .. 0x0200 This won't happen if there is a single instruction write of the wall clock word. We already specify that users can only trust the value of the wallclock after we have an even version field. When we start the update, and during the time of all writes to it, it is odd, and thus, invalid. The ABI guarantees to the guest that we'll only bump version after we're done updating. So why bother? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm mmu: optimizations when tdp is in use
Marcelo Tosatti wrote: On Thu, May 27, 2010 at 04:06:34PM +0800, Gui Jianfeng wrote: In case of using tdp, checking write protected page isn't needed and quadrant also no need to be calculated. Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |7 +-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0bb9f17..ce4bbd3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -495,10 +495,13 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn) max_level = kvm_x86_ops-get_lpage_level() host_level ? kvm_x86_ops-get_lpage_level() : host_level; +if (tdp_enabled) +goto done; + This is wrong. write_count is initialized for alignment purposes, not only write protected pages. See __kvm_set_memory_region in virt/kvm/kvm_main.c. thanks, avi also pointed this out. Gui for (level = PT_DIRECTORY_LEVEL; level = max_level; ++level) if (has_wrprotected_page(vcpu-kvm, large_gfn, level)) break; - +done: return level - 1; } @@ -1346,7 +1349,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, if (role.direct) role.cr4_pae = 0; role.access = access; -if (vcpu-arch.mmu.root_level = PT32_ROOT_LEVEL) { +if (!tdp_enabled vcpu-arch.mmu.root_level = PT32_ROOT_LEVEL) { quadrant = gaddr (PAGE_SHIFT + (PT64_PT_BITS * level)); quadrant = (1 ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1; role.quadrant = quadrant; -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards Gui Jianfeng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html