Re: [Qemu-devel] [PATCH 0/3] switch to seavgabios
Negative also here: Don't see anything on screen on startup... From log, latest qemu-kvm git version: Running option rom at c180:3d4e Running option rom at c180:3da2 Running option rom at c180:3df6 Running option rom at c580:0003 qemu-system-x86_64: /root/download/qemu/git/qemu-kvm/exec.c:2641: register_subpage: Assertion `existing.mr->subpage || existing.mr == &io_mem_unassigned' failed. Startup is done with the following command: /root/download/qemu/git/qemu-kvm/x86_64-softmmu/qemu-system-x86_64 -drive file=1.img,media=disk,if=scsi,bus=0,unit=0 -drive file=2.img,media=disk,if=scsi,bus=0,unit=1 -drive file=3.img,media=disk,if=scsi,bus=0,unit=2 -boot order=cad,menu=on -m 2048 -k de -vga vmware -vnc :0 -bios /root/download/seabios/git/seabios/out/bios.bin -chardev stdio,id=seabios -device isa-debugcon,iobase=0x402,chardev=seabios -option-rom BIOS/V4.19/8xx_64.rom -device rtl8139,mac=00:02:44:92:87:6a,vlan=0,romfile= -net tap,ifname=tap0,script=no,downscript=no,vlan=0 Backtrace isn't valid. Old integrated BIOS is valid and works well. Ciao, Gerhard On 17.04.2012 12:11, Gerd Hoffmann wrote: Hi, This patch series switches the vgabios binaries shipped by qemu from the lgpl'ed vgabios to the seabios version. There should be no guest-visible changes (especially no regressions) in theory. The only known (and intentional) exception is the vesa 2.0 protected mode interface which is not implemented by the seavgabios. There are no known issues with linux and windows guests. Testing with other, more exotic guests is very welcome. Note #1: Just pull from git to test, I've zapped the big binary patches so you can't 'git am' them. Note #2: This goes on top of the seabios-1.7.0 pull request just sent, so pulling this will pull the seabios update too. please give it a spin, Gerd The following changes since commit 6b034aa138716a515c88f7894940d5d0aff2f3ed: seabios: update to 1.7.0 (2012-04-17 10:51:41 +0200) are available in the git repository at: git://git.kraxel.org/qemu seavgabios-1.7.0 Gerd Hoffmann (3): Add seavgabios build rules to roms/Makefile Update vgabios binaries Switch isa vga to seavgabios too. hw/vga-isa.c |2 +- hw/vga_int.h |1 - pc-bios/vgabios-cirrus.bin | Bin 35840 -> 37888 bytes pc-bios/vgabios-isavga.bin | Bin 0 -> 37888 bytes pc-bios/vgabios-qxl.bin| Bin 40448 -> 37888 bytes pc-bios/vgabios-stdvga.bin | Bin 40448 -> 37888 bytes pc-bios/vgabios-vmware.bin | Bin 40448 -> 37888 bytes roms/Makefile | 13 + roms/config.vga.cirrus |3 +++ roms/config.vga.isavga |3 +++ roms/config.vga.qxl|6 ++ roms/config.vga.stdvga |3 +++ roms/config.vga.vmware |6 ++ 13 files changed, 35 insertions(+), 2 deletions(-) create mode 100644 pc-bios/vgabios-isavga.bin create mode 100644 roms/config.vga.cirrus create mode 100644 roms/config.vga.isavga create mode 100644 roms/config.vga.qxl create mode 100644 roms/config.vga.stdvga create mode 100644 roms/config.vga.vmware -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 14/16] KVM: MMU: fast path of handling guest page fault
On 04/18/2012 09:47 AM, Marcelo Tosatti wrote: > On Fri, Apr 13, 2012 at 06:16:33PM +0800, Xiao Guangrong wrote: >> If the the present bit of page fault error code is set, it indicates >> the shadow page is populated on all levels, it means what we do is >> only modify the access bit which can be done out of mmu-lock >> >> Currently, in order to simplify the code, we only fix the page fault >> caused by write-protect on the fast path >> >> Signed-off-by: Xiao Guangrong >> --- >> arch/x86/kvm/mmu.c | 205 >> >> arch/x86/kvm/paging_tmpl.h |3 + >> 2 files changed, 192 insertions(+), 16 deletions(-) >> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c >> index efa5d59..fc91667 100644 >> --- a/arch/x86/kvm/mmu.c >> +++ b/arch/x86/kvm/mmu.c >> @@ -446,6 +446,13 @@ static bool __check_direct_spte_mmio_pf(u64 spte) >> } >> #endif >> >> +static bool spte_wp_by_dirty_log(u64 spte) >> +{ >> +WARN_ON(is_writable_pte(spte)); >> + >> +return (spte & SPTE_ALLOW_WRITE) && !(spte & SPTE_WRITE_PROTECT); >> +} >> + >> static bool spte_has_volatile_bits(u64 spte) >> { >> if (!shadow_accessed_mask) >> @@ -454,9 +461,18 @@ static bool spte_has_volatile_bits(u64 spte) >> if (!is_shadow_present_pte(spte)) >> return false; >> >> -if ((spte & shadow_accessed_mask) && >> - (!is_writable_pte(spte) || (spte & shadow_dirty_mask))) >> -return false; >> +if (spte & shadow_accessed_mask) { >> +if (is_writable_pte(spte)) >> +return !(spte & shadow_dirty_mask); >> + >> +/* >> + * If the spte is write-protected by dirty-log, it may >> + * be marked writable on fast page fault path, then CPU >> + * can modify the Dirty bit. >> + */ >> +if (!spte_wp_by_dirty_log(spte)) >> +return false; >> +} >> >> return true; >> } >> @@ -1109,26 +1125,18 @@ static void drop_spte(struct kvm *kvm, u64 *sptep) >> rmap_remove(kvm, sptep); >> } >> >> -static bool spte_wp_by_dirty_log(u64 spte) >> -{ >> -WARN_ON(is_writable_pte(spte)); >> - >> -return (spte & SPTE_ALLOW_WRITE) && !(spte & SPTE_WRITE_PROTECT); >> -} >> - >> static void spte_write_protect(struct kvm *kvm, u64 *sptep, bool large, >> bool *flush, bool page_table_protect) >> { >> u64 spte = *sptep; >> >> if (is_writable_pte(spte)) { >> -*flush |= true; >> - >> if (large) { >> pgprintk("rmap_write_protect(large): spte %p %llx\n", >> spte, *spte); >> BUG_ON(!is_large_pte(spte)); >> >> +*flush |= true; >> drop_spte(kvm, sptep); >> --kvm->stat.lpages; >> return; >> @@ -1137,6 +1145,9 @@ static void spte_write_protect(struct kvm *kvm, u64 >> *sptep, bool large, >> goto reset_spte; >> } >> >> +/* We need flush tlbs in this case: the fast page fault path >> + * can mark the spte writable after we read the sptep. >> + */ >> if (page_table_protect && spte_wp_by_dirty_log(spte)) >> goto reset_spte; >> >> @@ -1144,6 +1155,8 @@ static void spte_write_protect(struct kvm *kvm, u64 >> *sptep, bool large, >> >> reset_spte: >> rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte); >> + >> +*flush |= true; >> spte = spte & ~PT_WRITABLE_MASK; >> if (page_table_protect) >> spte |= SPTE_WRITE_PROTECT; >> @@ -2767,18 +2780,172 @@ exit: >> return ret; >> } >> >> +static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, gfn_t gfn, >> + u32 error_code) >> +{ >> +unsigned long *rmap; >> + >> +/* >> + * #PF can be fast only if the shadow page table is present and it >> + * is caused by write-protect, that means we just need change the >> + * W bit of the spte which can be done out of mmu-lock. >> + */ >> +if (!(error_code & PFERR_PRESENT_MASK) || >> + !(error_code & PFERR_WRITE_MASK)) >> +return false; >> + >> +rmap = gfn_to_rmap(vcpu->kvm, gfn, PT_PAGE_TABLE_LEVEL); >> + >> +/* Quickly check the page can be writable. */ >> +if (test_bit(PTE_LIST_WP_BIT, ACCESS_ONCE(rmap))) >> +return false; >> + >> +return true; >> +} >> + >> +static bool >> +fast_pf_fix_indirect_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, >> + u64 *sptep, u64 spte, gfn_t gfn) >> +{ >> +pfn_t pfn; >> +bool ret = false; >> + >> +/* >> + * For the indirect spte, it is hard to get a stable gfn since >> + * we just use a cmpxchg to avoid all the races which is not >> + * enough to avoid the ABA problem: the host can arbitrarily >> + * change spte and the mapping from gfn to pfh. >> + * >> + * What w
Re: [PATCH v2 07/16] KVM: MMU: introduce for_each_pte_list_spte
On 04/17/2012 10:47 PM, Takuya Yoshikawa wrote: > On Mon, 16 Apr 2012 11:36:25 +0800 > Xiao Guangrong wrote: > >> I tested it with kernbench, no regression is found. > > Because kernbench is not at all good test for this. > >> It is not a problem since the iter and spte should be in the cache. > > I have checked dirty-log-perf myself with this patch [01-07]. > > GET_DIRTY_LOG for 1GB dirty pages has become more than 15% slower. > Thanks for your test! Unbelievable, i will do more test and check it more carefully. Could you please open your tool, then i can reproduction it and find the real reason? > > Note: if you had checked the worst case latency with this patch series, > you should have noticed this regression by yourself. > > Auto-test and kernbench are not enough to see the effect of this work. > I will check whether your tool is better then kernbench/autotest after review your tool. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: lock slots_lock around device assignment
As pointed out by Jason Baron, when assigning a device to a guest we first set the iommu domain pointer, which enables mapping and unmapping of memory slots to the iommu. This leaves a window where this path is enabled, but we haven't synchronized the iommu mappings to the existing memory slots. Thus a slot being removed at that point could send us down unexpected code paths removing non-existent pinnings and iommu mappings. Take the slots_lock around creating the iommu domain and initial mappings as well as around iommu teardown to avoid this race. Signed-off-by: Alex Williamson --- virt/kvm/iommu.c | 23 +++ 1 files changed, 15 insertions(+), 8 deletions(-) diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c index fec1723..e9fff98 100644 --- a/virt/kvm/iommu.c +++ b/virt/kvm/iommu.c @@ -240,9 +240,13 @@ int kvm_iommu_map_guest(struct kvm *kvm) return -ENODEV; } + mutex_lock(&kvm->slots_lock); + kvm->arch.iommu_domain = iommu_domain_alloc(&pci_bus_type); - if (!kvm->arch.iommu_domain) - return -ENOMEM; + if (!kvm->arch.iommu_domain) { + r = -ENOMEM; + goto out_unlock; + } if (!allow_unsafe_assigned_interrupts && !iommu_domain_has_cap(kvm->arch.iommu_domain, @@ -253,17 +257,16 @@ int kvm_iommu_map_guest(struct kvm *kvm) " module option.\n", __func__); iommu_domain_free(kvm->arch.iommu_domain); kvm->arch.iommu_domain = NULL; - return -EPERM; + r = -EPERM; + goto out_unlock; } r = kvm_iommu_map_memslots(kvm); if (r) - goto out_unmap; - - return 0; + kvm_iommu_unmap_memslots(kvm); -out_unmap: - kvm_iommu_unmap_memslots(kvm); +out_unlock: + mutex_unlock(&kvm->slots_lock); return r; } @@ -340,7 +343,11 @@ int kvm_iommu_unmap_guest(struct kvm *kvm) if (!domain) return 0; + mutex_lock(&kvm->slots_lock); kvm_iommu_unmap_memslots(kvm); + kvm->arch.iommu_domain = NULL; + mutex_unlock(&kvm->slots_lock); + iommu_domain_free(domain); return 0; } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 14/16] KVM: MMU: fast path of handling guest page fault
On Fri, Apr 13, 2012 at 06:16:33PM +0800, Xiao Guangrong wrote: > If the the present bit of page fault error code is set, it indicates > the shadow page is populated on all levels, it means what we do is > only modify the access bit which can be done out of mmu-lock > > Currently, in order to simplify the code, we only fix the page fault > caused by write-protect on the fast path > > Signed-off-by: Xiao Guangrong > --- > arch/x86/kvm/mmu.c | 205 > > arch/x86/kvm/paging_tmpl.h |3 + > 2 files changed, 192 insertions(+), 16 deletions(-) > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index efa5d59..fc91667 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -446,6 +446,13 @@ static bool __check_direct_spte_mmio_pf(u64 spte) > } > #endif > > +static bool spte_wp_by_dirty_log(u64 spte) > +{ > + WARN_ON(is_writable_pte(spte)); > + > + return (spte & SPTE_ALLOW_WRITE) && !(spte & SPTE_WRITE_PROTECT); > +} > + > static bool spte_has_volatile_bits(u64 spte) > { > if (!shadow_accessed_mask) > @@ -454,9 +461,18 @@ static bool spte_has_volatile_bits(u64 spte) > if (!is_shadow_present_pte(spte)) > return false; > > - if ((spte & shadow_accessed_mask) && > - (!is_writable_pte(spte) || (spte & shadow_dirty_mask))) > - return false; > + if (spte & shadow_accessed_mask) { > + if (is_writable_pte(spte)) > + return !(spte & shadow_dirty_mask); > + > + /* > + * If the spte is write-protected by dirty-log, it may > + * be marked writable on fast page fault path, then CPU > + * can modify the Dirty bit. > + */ > + if (!spte_wp_by_dirty_log(spte)) > + return false; > + } > > return true; > } > @@ -1109,26 +1125,18 @@ static void drop_spte(struct kvm *kvm, u64 *sptep) > rmap_remove(kvm, sptep); > } > > -static bool spte_wp_by_dirty_log(u64 spte) > -{ > - WARN_ON(is_writable_pte(spte)); > - > - return (spte & SPTE_ALLOW_WRITE) && !(spte & SPTE_WRITE_PROTECT); > -} > - > static void spte_write_protect(struct kvm *kvm, u64 *sptep, bool large, > bool *flush, bool page_table_protect) > { > u64 spte = *sptep; > > if (is_writable_pte(spte)) { > - *flush |= true; > - > if (large) { > pgprintk("rmap_write_protect(large): spte %p %llx\n", >spte, *spte); > BUG_ON(!is_large_pte(spte)); > > + *flush |= true; > drop_spte(kvm, sptep); > --kvm->stat.lpages; > return; > @@ -1137,6 +1145,9 @@ static void spte_write_protect(struct kvm *kvm, u64 > *sptep, bool large, > goto reset_spte; > } > > + /* We need flush tlbs in this case: the fast page fault path > + * can mark the spte writable after we read the sptep. > + */ > if (page_table_protect && spte_wp_by_dirty_log(spte)) > goto reset_spte; > > @@ -1144,6 +1155,8 @@ static void spte_write_protect(struct kvm *kvm, u64 > *sptep, bool large, > > reset_spte: > rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte); > + > + *flush |= true; > spte = spte & ~PT_WRITABLE_MASK; > if (page_table_protect) > spte |= SPTE_WRITE_PROTECT; > @@ -2767,18 +2780,172 @@ exit: > return ret; > } > > +static bool page_fault_can_be_fast(struct kvm_vcpu *vcpu, gfn_t gfn, > +u32 error_code) > +{ > + unsigned long *rmap; > + > + /* > + * #PF can be fast only if the shadow page table is present and it > + * is caused by write-protect, that means we just need change the > + * W bit of the spte which can be done out of mmu-lock. > + */ > + if (!(error_code & PFERR_PRESENT_MASK) || > + !(error_code & PFERR_WRITE_MASK)) > + return false; > + > + rmap = gfn_to_rmap(vcpu->kvm, gfn, PT_PAGE_TABLE_LEVEL); > + > + /* Quickly check the page can be writable. */ > + if (test_bit(PTE_LIST_WP_BIT, ACCESS_ONCE(rmap))) > + return false; > + > + return true; > +} > + > +static bool > +fast_pf_fix_indirect_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, > + u64 *sptep, u64 spte, gfn_t gfn) > +{ > + pfn_t pfn; > + bool ret = false; > + > + /* > + * For the indirect spte, it is hard to get a stable gfn since > + * we just use a cmpxchg to avoid all the races which is not > + * enough to avoid the ABA problem: the host can arbitrarily > + * change spte and the mapping from gfn to pfh. > + * > + * What we do is call gfn_to_pfn_atomic to bind the gfn and the > + * pfn because after the call: > + * - w
[ANNOUNCE] qemu-kvm-1.0.1
qemu-kvm-1.0.1 is now available. This release is based on the upstream qemu 1.0.1, plus kvm-specific enhancements. Please see the original QEMU 1.0.1 release announcement [1] for details. This release can be used with the kvm kernel modules provided by your distribution kernel, or by the modules in the kvm-kmod package, such as kvm-kmod-3.1. http://www.linux-kvm.org [1] http://lists.gnu.org/archive/html/qemu-stable/2012-02/msg1.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Question about host CPU usage/allocation by KVM
Greetings everybody, Can anybody please point me to code/documentation regarding the following questions I have: - What does it actually mean using "-smp N" option, in terms of CPU sharing between the host and the guest? - How are guest CPUs mapped to host CPUs (if at all)? - Is it possible to hard-allocate one or more CPUs to a particular guest, preventing the host from using them? - Is it possible to limit guest usage of host CPUs? (- Do my questions make sense?) I am running stock ubuntu-natty 2.6.38-8, with KVM 0.14.0. Thanks! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv4] device-assignment: don't touch pci command register
On 2012-04-17 14:10, Michael S. Tsirkin wrote: > Real command register is under kernel control: > it includes bits for triggering SERR, marking > BARs as invalid and such which are all under host > kernel control. > > While there's no known bug this triggers - since qemu does its > best to make guest state match device state - > it seems safer to avoid touching this register as much as > possible. > > With this patch, we don't touch any bits > except bus master which is ok to put under guest control > and intx mask which kvm interrupt sharing machinery > explicitly allows. > > Note: PCI_STATUS bears looking into as well. > > Tested-by: Alex Williamson > Signed-off-by: Michael S. Tsirkin > --- > > Changes from v3: > - use pci_word_test_and_clear_mask to clear bits in a clean way > Changes from v2: > - whitespace fix > Changes from v1: > - fix intx mask > > hw/device-assignment.c | 20 ++-- > 1 files changed, 6 insertions(+), 14 deletions(-) > > diff --git a/hw/device-assignment.c b/hw/device-assignment.c > index 89823f1..17a3a93 100644 > --- a/hw/device-assignment.c > +++ b/hw/device-assignment.c > @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, > uint16_t r_seg, > FILE *f; > unsigned long long start, end, size, flags; > uint16_t id; > -struct stat statbuf; > PCIRegion *rp; > PCIDevRegions *dev = &pci_dev->real_device; > > @@ -610,12 +609,8 @@ again: > pci_dev->dev.config[2] = id & 0xff; > pci_dev->dev.config[3] = (id & 0xff00) >> 8; > > -/* dealing with virtual function device */ > -snprintf(name, sizeof(name), "%sphysfn/", dir); > -if (!stat(name, &statbuf)) { > -/* always provide the written value on readout */ > -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2); > -} > +pci_word_test_and_clear_mask(pci_dev->emulate_config_write + PCI_COMMAND, > + PCI_COMMAND_MASTER | > PCI_COMMAND_INTX_DISABLE); > > dev->region_number = r; > return 0; > @@ -782,13 +778,9 @@ static int assign_device(AssignedDevice *dev) > "cause host memory corruption if the device issues DMA write > " > "requests!\n"); > } > if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK && > kvm_has_intx_set_mask()) { > assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3; > - > -/* hide host-side INTx masking from the guest */ > -dev->emulate_config_read[PCI_COMMAND + 1] |= > -PCI_COMMAND_INTX_DISABLE >> 8; > } > > r = kvm_assign_pci_device(kvm_state, &assigned_dev_data); > @@ -1631,10 +1624,10 @@ static void reset_assigned_device(DeviceState *dev) > } > > /* > - * When a 0 is written to the command register, the device is logically > + * When a 0 is written to the bus master register, the device is > logically > * disconnected from the PCI bus. This avoids further DMA transfers. > */ > -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2); > +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1); > } > > static int assigned_initfn(struct PCIDevice *pci_dev) > @@ -1658,7 +1651,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev) > * device initialization. > */ > assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE); > -assigned_dev_direct_config_read(dev, PCI_COMMAND, 2); > assigned_dev_direct_config_read(dev, PCI_STATUS, 2); > assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1); > assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3); Acked-by: Jan Kiszka -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv4] device-assignment: don't touch pci command register
On Tue, 2012-04-17 at 15:10 +0300, Michael S. Tsirkin wrote: > Real command register is under kernel control: > it includes bits for triggering SERR, marking > BARs as invalid and such which are all under host > kernel control. > > While there's no known bug this triggers - since qemu does its > best to make guest state match device state - > it seems safer to avoid touching this register as much as > possible. > > With this patch, we don't touch any bits > except bus master which is ok to put under guest control > and intx mask which kvm interrupt sharing machinery > explicitly allows. > > Note: PCI_STATUS bears looking into as well. > > Tested-by: Alex Williamson > Signed-off-by: Michael S. Tsirkin Acked-by: Alex Williamson > --- > > Changes from v3: > - use pci_word_test_and_clear_mask to clear bits in a clean way > Changes from v2: > - whitespace fix > Changes from v1: > - fix intx mask > > hw/device-assignment.c | 20 ++-- > 1 files changed, 6 insertions(+), 14 deletions(-) > > diff --git a/hw/device-assignment.c b/hw/device-assignment.c > index 89823f1..17a3a93 100644 > --- a/hw/device-assignment.c > +++ b/hw/device-assignment.c > @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, > uint16_t r_seg, > FILE *f; > unsigned long long start, end, size, flags; > uint16_t id; > -struct stat statbuf; > PCIRegion *rp; > PCIDevRegions *dev = &pci_dev->real_device; > > @@ -610,12 +609,8 @@ again: > pci_dev->dev.config[2] = id & 0xff; > pci_dev->dev.config[3] = (id & 0xff00) >> 8; > > -/* dealing with virtual function device */ > -snprintf(name, sizeof(name), "%sphysfn/", dir); > -if (!stat(name, &statbuf)) { > -/* always provide the written value on readout */ > -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2); > -} > +pci_word_test_and_clear_mask(pci_dev->emulate_config_write + PCI_COMMAND, > + PCI_COMMAND_MASTER | > PCI_COMMAND_INTX_DISABLE); > > dev->region_number = r; > return 0; > @@ -782,13 +778,9 @@ static int assign_device(AssignedDevice *dev) > "cause host memory corruption if the device issues DMA write > " > "requests!\n"); > } > if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK && > kvm_has_intx_set_mask()) { > assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3; > - > -/* hide host-side INTx masking from the guest */ > -dev->emulate_config_read[PCI_COMMAND + 1] |= > -PCI_COMMAND_INTX_DISABLE >> 8; > } > > r = kvm_assign_pci_device(kvm_state, &assigned_dev_data); > @@ -1631,10 +1624,10 @@ static void reset_assigned_device(DeviceState *dev) > } > > /* > - * When a 0 is written to the command register, the device is logically > + * When a 0 is written to the bus master register, the device is > logically > * disconnected from the PCI bus. This avoids further DMA transfers. > */ > -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2); > +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1); > } > > static int assigned_initfn(struct PCIDevice *pci_dev) > @@ -1658,7 +1651,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev) > * device initialization. > */ > assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE); > -assigned_dev_direct_config_read(dev, PCI_COMMAND, 2); > assigned_dev_direct_config_read(dev, PCI_STATUS, 2); > assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1); > assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump
On 04/17/2012 02:44 AM, Avi Kivity wrote: On 04/11/2012 04:39 AM, zhangyanfei wrote: This patch set exports offsets of VMCS fields as note information for kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve runtime state of guest machine image, such as registers, in host machine's crash dump as VMCS format. The problem is that VMCS internal is hidden by Intel in its specification. So, we reverse engineering it in the way implemented in this patch set. Please note that this processing never affects any existing kvm logic. The VMCSINFO is exported via sysfs to kexec-tools just like VMCOREINFO. Here is an example: Processor: Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz $cat /sys/kernel/vmcsinfo 1cba8c0 2000 crash> rd -p 1cba8c0 1000 1cba8c0: 127b0009 53434d56 {...VMCS 1cba8d0: 4f464e49 4e4f495349564552 INFOREVISION 1cba8e0: 49460a643d44495f 5f4e495028444c45 _ID=d.FIELD(PIN_ 1cba8f0: 4d565f4445534142 4f435f434558455f BASED_VM_EXEC_CO 1cba900: 303d294c4f52544e 0a30383130343831 NTROL)=01840180. 1cba910: 504328444c454946 5f44455341425f55 FIELD(CPU_BASED_ 1cba920: 5f434558455f4d56 294c4f52544e4f43 VM_EXEC_CONTROL) 1cba930: 393130343931303d 28444c4549460a30 =01940190.FIELD( 1cba940: 5241444e4f434553 4558455f4d565f59 SECONDARY_VM_EXE 1cba950: 4f52544e4f435f43 30346566303d294c C_CONTROL)=0fe40 1cba960: 4c4549460a306566 4958455f4d562844 fe0.FIELD(VM_EXI 1cba970: 4f52544e4f435f54 346531303d29534c T_CONTROLS)=01e4 1cba980: 4549460a30653130 4e455f4d5628444c 01e0.FIELD(VM_EN 1cba990: 544e4f435f595254 33303d29534c4f52 TRY_CONTROLS)=03 1cba9a0: 460a303133303431 45554728444c4549 140310.FIELD(GUE 1cba9b0: 45535f53455f5453 3d29524f5443454c ST_ES_SELECTOR)= 1cba9c0: 4549460a30303530 545345554728444c 0500.FIELD(GUEST 1cba9d0: 454c45535f53435f 35303d29524f5443 _CS_SELECTOR)=05 .. TODO: 1. In kexec-tools, get VMCSINFO via sysfs and dump it as note information into vmcore. 2. Dump VMCS region of each guest vcpu and VMCSINFO into qemu-process core file. To do this, we will modify kernel core dumper, gdb gcore and crash gcore. 3. Dump guest image from the qemu-process core file into a vmcore. Taking a step back, can you describe the problem scenario you're fixing here? Note that assuming that VMCS fields map to offsets within memory is not guaranteed to work for future processors. There is no guarantee that fields won't be compressed or stored with inverted bit ordering. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.
On 04/17/2012 07:15 PM, Jan Kiszka wrote: > On 2012-04-17 14:06, Avi Kivity wrote: > > On 04/17/2012 03:03 PM, Gleb Natapov wrote: > >>> > >>> KVM_MAX_VCPUS. > >>> > >> Ah, so you are worried about malicious guest configuring pit to > >> broadcast to all its vcpus. > > > > Yes - it can introduce huge amounts of latency this way which is exactly > > what Jan is trying to prevent. > > > > Though I'm not sure spin_lock_irq() in the realtime tree actually > > disables irqs (but it's certainly not a good idea in mainline; it's > > nasty even with just the spinlock). > > This depends on how you declare the spin lock type - raw or normal. The > former will disable irqs, the latter not even preemption (but become a > mutex). Yes (and I see no reason to use raw spinlocks here). Still, for mainline, are we okay with 254*IPIs? Maybe it's not so bad and I'm overinflating the problem. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.
On 2012-04-17 14:06, Avi Kivity wrote: > On 04/17/2012 03:03 PM, Gleb Natapov wrote: >>> >>> KVM_MAX_VCPUS. >>> >> Ah, so you are worried about malicious guest configuring pit to >> broadcast to all its vcpus. > > Yes - it can introduce huge amounts of latency this way which is exactly > what Jan is trying to prevent. > > Though I'm not sure spin_lock_irq() in the realtime tree actually > disables irqs (but it's certainly not a good idea in mainline; it's > nasty even with just the spinlock). This depends on how you declare the spin lock type - raw or normal. The former will disable irqs, the latter not even preemption (but become a mutex). Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/13] KVM: MMU: fast page fault
On 04/17/2012 05:54 PM, Takuya Yoshikawa wrote: > On Tue, 17 Apr 2012 15:41:39 +0300 > Avi Kivity wrote: > > > > Since there are many known algorithms to predict hot memory pages, > > > the userspace will be able to tune the frequency of GET_DIRTY_LOG for such > > > parts not to get too many faults repeatedly, if we can restrict the range > > > of pages to protect. > > > > > > This is the fine-grained control. > > > > Do you want per-page control, or just range-based? > > Difficult question. > > For live migration, range-based control may be enough duo to the locality > of WWS. What's WWS? I don't see a reason to assume locality in gpa space (or in gva space either, for some workloads). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/13] KVM: MMU: fast page fault
On Tue, 17 Apr 2012 15:41:39 +0300 Avi Kivity wrote: > > Since there are many known algorithms to predict hot memory pages, > > the userspace will be able to tune the frequency of GET_DIRTY_LOG for such > > parts not to get too many faults repeatedly, if we can restrict the range > > of pages to protect. > > > > This is the fine-grained control. > > Do you want per-page control, or just range-based? Difficult question. For live migration, range-based control may be enough duo to the locality of WWS. Thanks, Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/16] KVM: MMU: introduce for_each_pte_list_spte
On Mon, 16 Apr 2012 11:36:25 +0800 Xiao Guangrong wrote: > I tested it with kernbench, no regression is found. Because kernbench is not at all good test for this. > It is not a problem since the iter and spte should be in the cache. I have checked dirty-log-perf myself with this patch [01-07]. GET_DIRTY_LOG for 1GB dirty pages has become more than 15% slower. This is not a pure cleanup and introduces significant regression to rmap handling. This is the reason why I asked to remove "cleanup patches" from this patch series. Had it been really trivial cleanup, I would not have asked that. Note: if you had checked the worst case latency with this patch series, you should have noticed this regression by yourself. Auto-test and kernbench are not enough to see the effect of this work. Thanks, Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for April, Tuesday 17
Juan Quintela wrote: > Hi > > Please send in any agenda items you are interested in covering. As there are no topics, we cancel today call. Happy hacking, Juan. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS
On 4/17/2012 2:49 AM, Gleb Natapov wrote: On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote: On 4/16/2012 5:18 AM, Gleb Natapov wrote: On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote: On 04/11/2012 01:21 PM, Chegu Vinod wrote: Hello, While running an AIM7 (workfile.high_systime) in a single 40-way (or a single 60-way KVM guest) I noticed pretty bad performance when the guest was booted with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220 (RHEL6.2) kernel. For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest kernel was nearly 12x better ! How many CPUs your host has? 80 Cores on the DL980. (i.e. 8 Westmere sockets). So you are not oversubscribing CPUs at all. Are those real cores or including HT? HT is off. Do you have other cpus hogs running on the host while testing the guest? Nope. Sometimes I do run the utilities like "perf" or "sar" or "mpstat" on the numa node 0 (where the guest is not running). I was using numactl to bind the qemu of the 40-way guests to numa nodes : 4-7 ( or for a 60-way guest binding them to nodes 2-7) /etc/qemu-ifup tap0 numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7 /usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -enable-kvm \ -m 65536 -smp 40 \ -name vm1 -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \ -drive file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci .0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -monitor stdio \ -net nic,macaddr=<..mac_addr..> \ -net tap,ifname=tap0,script=no,downscript=no \ -vnc :4 /etc/qemu-ifdown tap0 I knew that there will be a few additional temporary qemu worker threads created... i.e. some over subscription will be there. 4 nodes above have 40 real cores, yes? Yes . Other than the qemu's related threads and some of the generic per-cpu Linux kernel threads (e.g. migration etc) there isn't anything else running on these Numa nodes. Can you try to run upstream kernel without binding at all and check the performance? I shall re-run and get back to you with this info. Typically for the native runs... binding the workload results in better numbers. Hence I choose to do the binding for the guest too...i.e. on the same numa nodes as the native case for virt. vs. native comparison purposes. Having said that ...In the past I had seen a couple of cases where the non-binded guest performed better than the native case. Need to re-run and dig into this further... Will have to retry by doing some explicit pinning of the vcpus to native cores (without using virsh). Turned on function tracing and found that there appears to be more time being spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220 guest. Looks like you may be running into the ticket spinlock code. During the early RHEL 6 days, Gleb came up with a patch to automatically disable ticket spinlocks when running inside a KVM guest. IIRC that patch got rejected upstream at the time, with upstream developers preferring to wait for a "better solution". If such a better solution is not on its way upstream now (two years later), maybe we should just merge Gleb's patch upstream for the time being? I think the pv spinlock that is actively discussed currently should address the issue, but I am not sure someone tests it against non-ticket lock in a guest to see which one performs better. I did see that discussion...seems to have originated from the Xen context. Yes, The problem is the same for both hypervisors. -- Gleb. Thanks Vinod -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump
On 04/17/2012 02:25 PM, Wen Congyang wrote: > > > >> > >> For scenario 2, we also want the guest's registers values to be dumped into > >> qemu process's core file when qemu process crashes. This is the task of > >> TODO-list 2. > > > > Why? If qemu crashed it is because of an internal qemu fault. If any > > guest registers were involved, they would have been decoded by qemu > > previously and would be present in the stack trace (for example mmio > > address/data). > > Hmm, IIRC, if qemu meets some critical error, it will call abort() or > assert(). > The guest registers are stored in the kernel, and qemu does not call > cpu_synchronize_state() to get guest register. So I donot understand > why the registers woubld be present int the stack trace... There are two cases. One case is where the problem was not caused directly by guest action, for example a segmentation fault in the block layer or the VNC server. In this case the guest registers are immaterial. The other case is where the problem was directly caused by guest action, for example an mmio write to a device register triggered an error. In this case kvm emulates the mmio instruction and returns KVM_EXIT_MMIO; it can be seen in the kvm_run page. The address/data pair is propagated by the qemu memory core all the way to the device callback. So the instruction and register contents are unneeded for debugging the crash. Is there a scenario where the guest registers help towards debugging a qemu crash? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] device-assignment: don't touch real command register
On Tue, Apr 17, 2012 at 01:24:24PM +0200, Jan Kiszka wrote: > > +/* Pass bus master writes to device. */ > > +pci_dev->emulate_config_write[PCI_COMMAND] &= ~PCI_COMMAND_MASTER; > > +pci_dev->emulate_config_write[PCI_COMMAND + 1] &= > > ~(PCI_COMMAND_INTX_DISABLE >> 8); > > Comment doesn't fully match the code. Will fix. > And I bet you aren't using > checkpatch... ;) I didn't run it but it doesn't complain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/13] KVM: MMU: fast page fault
On 04/17/2012 03:37 PM, Takuya Yoshikawa wrote: > On Tue, 17 Apr 2012 10:51:40 +0300 > Avi Kivity wrote: > > > That's true with the write protect everything approach we use now. But > > it's not true with range-based write protection, where you issue > > GET_DIRTY_LOG on a range of pages and only need to re-write-protect them. > > > > (the motivation for that is to decrease the time between GET_DIRTY_LOG > > and sending the page; as the time increases, the chances that the page > > got re-dirtied go up). > > Thank you for explaining this. > > I was planning to give the userspace more freedom. > > Since there are many known algorithms to predict hot memory pages, > the userspace will be able to tune the frequency of GET_DIRTY_LOG for such > parts not to get too many faults repeatedly, if we can restrict the range > of pages to protect. > > This is the fine-grained control. Do you want per-page control, or just range-based? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/13] KVM: MMU: fast page fault
On Tue, 17 Apr 2012 10:51:40 +0300 Avi Kivity wrote: > That's true with the write protect everything approach we use now. But > it's not true with range-based write protection, where you issue > GET_DIRTY_LOG on a range of pages and only need to re-write-protect them. > > (the motivation for that is to decrease the time between GET_DIRTY_LOG > and sending the page; as the time increases, the chances that the page > got re-dirtied go up). Thank you for explaining this. I was planning to give the userspace more freedom. Since there are many known algorithms to predict hot memory pages, the userspace will be able to tune the frequency of GET_DIRTY_LOG for such parts not to get too many faults repeatedly, if we can restrict the range of pages to protect. This is the fine-grained control. Thanks, Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump
At 04/17/2012 06:59 PM, Avi Kivity Wrote: > On 04/17/2012 01:51 PM, zhangyanfei wrote: >> 于 2012年04月17日 15:44, Avi Kivity 写道: >>> On 04/11/2012 04:39 AM, zhangyanfei wrote: This patch set exports offsets of VMCS fields as note information for kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve runtime state of guest machine image, such as registers, in host machine's crash dump as VMCS format. The problem is that VMCS internal is hidden by Intel in its specification. So, we reverse engineering it in the way implemented in this patch set. Please note that this processing never affects any existing kvm logic. The VMCSINFO is exported via sysfs to kexec-tools just like VMCOREINFO. Here is an example: Processor: Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz $cat /sys/kernel/vmcsinfo 1cba8c0 2000 crash> rd -p 1cba8c0 1000 1cba8c0: 127b0009 53434d56 {...VMCS 1cba8d0: 4f464e49 4e4f495349564552 INFOREVISION 1cba8e0: 49460a643d44495f 5f4e495028444c45 _ID=d.FIELD(PIN_ 1cba8f0: 4d565f4445534142 4f435f434558455f BASED_VM_EXEC_CO 1cba900: 303d294c4f52544e 0a30383130343831 NTROL)=01840180. 1cba910: 504328444c454946 5f44455341425f55 FIELD(CPU_BASED_ 1cba920: 5f434558455f4d56 294c4f52544e4f43 VM_EXEC_CONTROL) 1cba930: 393130343931303d 28444c4549460a30 =01940190.FIELD( 1cba940: 5241444e4f434553 4558455f4d565f59 SECONDARY_VM_EXE 1cba950: 4f52544e4f435f43 30346566303d294c C_CONTROL)=0fe40 1cba960: 4c4549460a306566 4958455f4d562844 fe0.FIELD(VM_EXI 1cba970: 4f52544e4f435f54 346531303d29534c T_CONTROLS)=01e4 1cba980: 4549460a30653130 4e455f4d5628444c 01e0.FIELD(VM_EN 1cba990: 544e4f435f595254 33303d29534c4f52 TRY_CONTROLS)=03 1cba9a0: 460a303133303431 45554728444c4549 140310.FIELD(GUE 1cba9b0: 45535f53455f5453 3d29524f5443454c ST_ES_SELECTOR)= 1cba9c0: 4549460a30303530 545345554728444c 0500.FIELD(GUEST 1cba9d0: 454c45535f53435f 35303d29524f5443 _CS_SELECTOR)=05 .. TODO: 1. In kexec-tools, get VMCSINFO via sysfs and dump it as note information into vmcore. 2. Dump VMCS region of each guest vcpu and VMCSINFO into qemu-process core file. To do this, we will modify kernel core dumper, gdb gcore and crash gcore. 3. Dump guest image from the qemu-process core file into a vmcore. >>> >>> Taking a step back, can you describe the problem scenario you're fixing >>> here? >>> >> Considering two scenarios below: >> 1. Host panics, guests running on that host will also be dumped into >>host's vmcore. >> 2. Qemu process is core dumped (by gdb gcore or kernel core dumper), and >>its coresponding guest will be included in the core file. >> >> We want to create the guest machine's crash dump from host machine's vmcore >> or qemu process's core file. Unfortunately, we cannot get the guest's >> registers >> values in both scenarios. >> >> For scenario 1, some key registers (CR0, CR3...) of the guest machine are >> stored >> in VMCS region. But VMCS internal is hidden by Intel specification. So this >> patch set aims to get offsets of fields in VMCS region and export it as note >> information for kdump. > > Okay. Do you expect it to help in debugging the crash? Did you have > cases where it would help? > >> >> For scenario 2, we also want the guest's registers values to be dumped into >> qemu process's core file when qemu process crashes. This is the task of >> TODO-list 2. > > Why? If qemu crashed it is because of an internal qemu fault. If any > guest registers were involved, they would have been decoded by qemu > previously and would be present in the stack trace (for example mmio > address/data). Hmm, IIRC, if qemu meets some critical error, it will call abort() or assert(). The guest registers are stored in the kernel, and qemu does not call cpu_synchronize_state() to get guest register. So I donot understand why the registers woubld be present int the stack trace... Thanks Wen Congyang > >> Is this what you want? >> > > Yes. I'm trying to understand if the feature would be useful in real life. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/16] KVM: MMU: fask check whether page is writable
On 04/17/2012 03:41 PM, Avi Kivity wrote: > On 04/17/2012 06:55 AM, Xiao Guangrong wrote: >> On 04/16/2012 07:47 PM, Avi Kivity wrote: >> >>> On 04/16/2012 01:20 PM, Xiao Guangrong wrote: >> >> It is used to avoid the unnecessary overload > > It's overloading me :( > Sorry. >>> >>> The trick is to send those in separate patchset so the maintainer >>> doesn't notice. >>> >> >> >> Thanks for your suggestion, i will pay more attention on it in the >> further. >> >> For this patch, what did you mean of "those"? You mean the whole >> rmap.PTE_LIST_WP_BIT (fast check for shadow page table write protection >> and host write protection) or just about host_page_write_protect >> (for KSM only)? > > All of it. Let's start with just modifying sptes concurrently and only > later add reading bits from rmap concurrently, if it proves necessary. > Okay, i agree. >> >> If we do not have rmap.PTE_LIST_WP_BIT, there may have regression on >> shadow mmu. >> >> Hmm, do i need implement rmap.PTE_LIST_WP_BIT, then fast page fault? > > Let's try to measure the effect without rmap.PTE_LIST_WP_BIT. Usually > PTE chains for page tables are short so the effect would be small. Of > course we can't tell about all guest. > It is not about rmap's spte, it is about sp.sync write-protect, if the sp.sync is written, the fast page fault path will be triggered even if no migration and no framebuffer. I have done a quick test for kernbench for 10 times and get the average value without xwindow: keep rmap.PTE_LIST_WP_BIT: 53.494 comment rmap.PTE_LIST_WP_BIT checking in page_fault_can_be_fast: 53.948 Anyway, for good review, let move fast page fault in first and discuss this in the separate patchset later. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv4] device-assignment: don't touch pci command register
Real command register is under kernel control: it includes bits for triggering SERR, marking BARs as invalid and such which are all under host kernel control. While there's no known bug this triggers - since qemu does its best to make guest state match device state - it seems safer to avoid touching this register as much as possible. With this patch, we don't touch any bits except bus master which is ok to put under guest control and intx mask which kvm interrupt sharing machinery explicitly allows. Note: PCI_STATUS bears looking into as well. Tested-by: Alex Williamson Signed-off-by: Michael S. Tsirkin --- Changes from v3: - use pci_word_test_and_clear_mask to clear bits in a clean way Changes from v2: - whitespace fix Changes from v1: - fix intx mask hw/device-assignment.c | 20 ++-- 1 files changed, 6 insertions(+), 14 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 89823f1..17a3a93 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, uint16_t r_seg, FILE *f; unsigned long long start, end, size, flags; uint16_t id; -struct stat statbuf; PCIRegion *rp; PCIDevRegions *dev = &pci_dev->real_device; @@ -610,12 +609,8 @@ again: pci_dev->dev.config[2] = id & 0xff; pci_dev->dev.config[3] = (id & 0xff00) >> 8; -/* dealing with virtual function device */ -snprintf(name, sizeof(name), "%sphysfn/", dir); -if (!stat(name, &statbuf)) { -/* always provide the written value on readout */ -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2); -} +pci_word_test_and_clear_mask(pci_dev->emulate_config_write + PCI_COMMAND, + PCI_COMMAND_MASTER | PCI_COMMAND_INTX_DISABLE); dev->region_number = r; return 0; @@ -782,13 +778,9 @@ static int assign_device(AssignedDevice *dev) "cause host memory corruption if the device issues DMA write " "requests!\n"); } if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK && kvm_has_intx_set_mask()) { assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3; - -/* hide host-side INTx masking from the guest */ -dev->emulate_config_read[PCI_COMMAND + 1] |= -PCI_COMMAND_INTX_DISABLE >> 8; } r = kvm_assign_pci_device(kvm_state, &assigned_dev_data); @@ -1631,10 +1624,10 @@ static void reset_assigned_device(DeviceState *dev) } /* - * When a 0 is written to the command register, the device is logically + * When a 0 is written to the bus master register, the device is logically * disconnected from the PCI bus. This avoids further DMA transfers. */ -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2); +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1); } static int assigned_initfn(struct PCIDevice *pci_dev) @@ -1658,7 +1651,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev) * device initialization. */ assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE); -assigned_dev_direct_config_read(dev, PCI_COMMAND, 2); assigned_dev_direct_config_read(dev, PCI_STATUS, 2); assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1); assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3); -- 1.7.9.111.gf3fb0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.
On 04/17/2012 03:03 PM, Gleb Natapov wrote: > > > > KVM_MAX_VCPUS. > > > Ah, so you are worried about malicious guest configuring pit to > broadcast to all its vcpus. Yes - it can introduce huge amounts of latency this way which is exactly what Jan is trying to prevent. Though I'm not sure spin_lock_irq() in the realtime tree actually disables irqs (but it's certainly not a good idea in mainline; it's nasty even with just the spinlock). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.
On Tue, Apr 17, 2012 at 03:00:10PM +0300, Avi Kivity wrote: > On 04/17/2012 02:05 PM, Gleb Natapov wrote: > > On Tue, Apr 17, 2012 at 01:42:31PM +0300, Avi Kivity wrote: > > > On 04/17/2012 01:31 PM, Gleb Natapov wrote: > > > > On Tue, Apr 17, 2012 at 01:29:04PM +0300, Avi Kivity wrote: > > > > > On 04/17/2012 01:26 PM, Gleb Natapov wrote: > > > > > > > It isn't, since you need to send an IPI. > > > > > > > > > > > > > That is exactly what I forget whether you can send IPI from there :) > > > > > > Anyway this is another reason. > > > > > > > > > > > > > > > > Actually I was wrong. You can't smp_call_function_single() from irq > > > > > context (deadlocks if two vcpus do that), but we send a reschedule > > > > > interrupt. So it should work. > > > > > > > > > Ah, good point. So if we'll use irqsave versions of spinlocks we can > > > > drop kthread? > > > > > > Do we want 254 IPIs to be issued from irq context? Could be slow. > > > > > Where this number is coming from? > > KVM_MAX_VCPUS. > Ah, so you are worried about malicious guest configuring pit to broadcast to all its vcpus. > > > We can make the unicast case run from irq context and defer the > > > multicast to a thread. > > > > > We do not know if gsi is multicasted in the pit code though. > > > > We don't have to do it in the pit code. The .set method can decide > whether it wants a direct path or go through a thread (which thread? > passed as a parameter?) > Agree. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.
On 04/17/2012 02:05 PM, Gleb Natapov wrote: > On Tue, Apr 17, 2012 at 01:42:31PM +0300, Avi Kivity wrote: > > On 04/17/2012 01:31 PM, Gleb Natapov wrote: > > > On Tue, Apr 17, 2012 at 01:29:04PM +0300, Avi Kivity wrote: > > > > On 04/17/2012 01:26 PM, Gleb Natapov wrote: > > > > > > It isn't, since you need to send an IPI. > > > > > > > > > > > That is exactly what I forget whether you can send IPI from there :) > > > > > Anyway this is another reason. > > > > > > > > > > > > > Actually I was wrong. You can't smp_call_function_single() from irq > > > > context (deadlocks if two vcpus do that), but we send a reschedule > > > > interrupt. So it should work. > > > > > > > Ah, good point. So if we'll use irqsave versions of spinlocks we can > > > drop kthread? > > > > Do we want 254 IPIs to be issued from irq context? Could be slow. > > > Where this number is coming from? KVM_MAX_VCPUS. > > We can make the unicast case run from irq context and defer the > > multicast to a thread. > > > We do not know if gsi is multicasted in the pit code though. > We don't have to do it in the pit code. The .set method can decide whether it wants a direct path or go through a thread (which thread? passed as a parameter?) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Run PIT work in own kthread
On 04/17/2012 02:18 PM, Jan Kiszka wrote: > On 2012-04-17 12:27, Avi Kivity wrote: > > On 04/16/2012 09:26 PM, Jan Kiszka wrote: > >> We can't run PIT IRQ injection work in the interrupt context of the host > >> timer. This would allow the user to influence the handler complexity by > >> asking for a broadcast to a large number of VCPUs. Therefore, this work > >> was pushed into workqueue context in 9d244caf2e. However, this prevents > >> prioritizing the PIT injection over other task as workqueues share > >> kernel threads. > >> > >> This replaces the workqueue with a kthread worker and gives that thread > >> a name in the format "kvm-pit/". That allows to > >> identify and adjust the kthread priority according to the VM process > >> parameters. > > > > Is this a new ABI? > > Yep. Scripts will use it, maybe even QEMU. Do you want this to appear in > api.txt? You already know the answer. Please make it explicit that the thread is optional. This way, if we gain real-time workqueue support, or eliminate it through some other trick, userspace won't break. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] device-assignment: don't touch real command register
On 2012-04-17 13:24, Jan Kiszka wrote: > On 2012-04-16 21:53, Michael S. Tsirkin wrote: >> Real command register is under kernel control: >> it includes bits for triggering SERR, marking >> BARs as invalid and such which are under host >> kernel control. Don't touch any except bus master >> which is ok to put under guest control and intx >> mask which kvm interrupt sharing machinery >> explicitly allows. >> >> Signed-off-by: Michael S. Tsirkin >> >> --- >> Compiled only, don't have a setup for assignment >> testing ATM. Is anyone interested enough to test >> and report? >> >> Changes from v1: >> - fix intx mask handling >> >> hw/device-assignment.c | 24 >> 1 files changed, 8 insertions(+), 16 deletions(-) >> >> diff --git a/hw/device-assignment.c b/hw/device-assignment.c >> index 89823f1..6cdcc17 100644 >> --- a/hw/device-assignment.c >> +++ b/hw/device-assignment.c >> @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, >> uint16_t r_seg, >> FILE *f; >> unsigned long long start, end, size, flags; >> uint16_t id; >> -struct stat statbuf; >> PCIRegion *rp; >> PCIDevRegions *dev = &pci_dev->real_device; >> >> @@ -610,12 +609,9 @@ again: >> pci_dev->dev.config[2] = id & 0xff; >> pci_dev->dev.config[3] = (id & 0xff00) >> 8; >> >> -/* dealing with virtual function device */ >> -snprintf(name, sizeof(name), "%sphysfn/", dir); >> -if (!stat(name, &statbuf)) { >> -/* always provide the written value on readout */ >> -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2); >> -} >> +/* Pass bus master writes to device. */ >> +pci_dev->emulate_config_write[PCI_COMMAND] &= ~PCI_COMMAND_MASTER; >> +pci_dev->emulate_config_write[PCI_COMMAND + 1] &= >> ~(PCI_COMMAND_INTX_DISABLE >> 8); > > Comment doesn't fully match the code. And I bet you aren't using > checkpatch... ;) > >> >> dev->region_number = r; >> return 0; >> @@ -782,13 +778,10 @@ static int assign_device(AssignedDevice *dev) >> "cause host memory corruption if the device issues DMA >> write " >> "requests!\n"); >> } >> -if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK && >> -kvm_has_intx_set_mask()) { >> -assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3; >> >> -/* hide host-side INTx masking from the guest */ >> -dev->emulate_config_read[PCI_COMMAND + 1] |= >> -PCI_COMMAND_INTX_DISABLE >> 8; >> +if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK && >> +kvm_has_intx_set_mask()) { >> +assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3; >> } >> >> r = kvm_assign_pci_device(kvm_state, &assigned_dev_data); >> @@ -1631,10 +1624,10 @@ static void reset_assigned_device(DeviceState *dev) >> } >> >> /* >> - * When a 0 is written to the command register, the device is logically >> + * When a 0 is written to the bus master register, the device is >> logically >> * disconnected from the PCI bus. This avoids further DMA transfers. >> */ >> -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2); >> +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1); >> } >> >> static int assigned_initfn(struct PCIDevice *pci_dev) >> @@ -1658,7 +1651,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev) >> * device initialization. >> */ >> assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE); >> -assigned_dev_direct_config_read(dev, PCI_COMMAND, 2); >> assigned_dev_direct_config_read(dev, PCI_STATUS, 2); >> assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1); >> assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3); > > Patch looks otherwise fine to me. Err, I mean v3 on which I actually wanted to comment. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] device-assignment: don't touch real command register
On 2012-04-16 21:53, Michael S. Tsirkin wrote: > Real command register is under kernel control: > it includes bits for triggering SERR, marking > BARs as invalid and such which are under host > kernel control. Don't touch any except bus master > which is ok to put under guest control and intx > mask which kvm interrupt sharing machinery > explicitly allows. > > Signed-off-by: Michael S. Tsirkin > > --- > Compiled only, don't have a setup for assignment > testing ATM. Is anyone interested enough to test > and report? > > Changes from v1: > - fix intx mask handling > > hw/device-assignment.c | 24 > 1 files changed, 8 insertions(+), 16 deletions(-) > > diff --git a/hw/device-assignment.c b/hw/device-assignment.c > index 89823f1..6cdcc17 100644 > --- a/hw/device-assignment.c > +++ b/hw/device-assignment.c > @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, > uint16_t r_seg, > FILE *f; > unsigned long long start, end, size, flags; > uint16_t id; > -struct stat statbuf; > PCIRegion *rp; > PCIDevRegions *dev = &pci_dev->real_device; > > @@ -610,12 +609,9 @@ again: > pci_dev->dev.config[2] = id & 0xff; > pci_dev->dev.config[3] = (id & 0xff00) >> 8; > > -/* dealing with virtual function device */ > -snprintf(name, sizeof(name), "%sphysfn/", dir); > -if (!stat(name, &statbuf)) { > -/* always provide the written value on readout */ > -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2); > -} > +/* Pass bus master writes to device. */ > +pci_dev->emulate_config_write[PCI_COMMAND] &= ~PCI_COMMAND_MASTER; > +pci_dev->emulate_config_write[PCI_COMMAND + 1] &= > ~(PCI_COMMAND_INTX_DISABLE >> 8); Comment doesn't fully match the code. And I bet you aren't using checkpatch... ;) > > dev->region_number = r; > return 0; > @@ -782,13 +778,10 @@ static int assign_device(AssignedDevice *dev) > "cause host memory corruption if the device issues DMA write > " > "requests!\n"); > } > -if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK && > -kvm_has_intx_set_mask()) { > -assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3; > > -/* hide host-side INTx masking from the guest */ > -dev->emulate_config_read[PCI_COMMAND + 1] |= > -PCI_COMMAND_INTX_DISABLE >> 8; > +if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK && > + kvm_has_intx_set_mask()) { > + assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3; > } > > r = kvm_assign_pci_device(kvm_state, &assigned_dev_data); > @@ -1631,10 +1624,10 @@ static void reset_assigned_device(DeviceState *dev) > } > > /* > - * When a 0 is written to the command register, the device is logically > + * When a 0 is written to the bus master register, the device is > logically > * disconnected from the PCI bus. This avoids further DMA transfers. > */ > -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2); > +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1); > } > > static int assigned_initfn(struct PCIDevice *pci_dev) > @@ -1658,7 +1651,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev) > * device initialization. > */ > assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE); > -assigned_dev_direct_config_read(dev, PCI_COMMAND, 2); > assigned_dev_direct_config_read(dev, PCI_STATUS, 2); > assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1); > assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3); Patch looks otherwise fine to me. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Run PIT work in own kthread
On 2012-04-17 12:27, Avi Kivity wrote: > On 04/16/2012 09:26 PM, Jan Kiszka wrote: >> We can't run PIT IRQ injection work in the interrupt context of the host >> timer. This would allow the user to influence the handler complexity by >> asking for a broadcast to a large number of VCPUs. Therefore, this work >> was pushed into workqueue context in 9d244caf2e. However, this prevents >> prioritizing the PIT injection over other task as workqueues share >> kernel threads. >> >> This replaces the workqueue with a kthread worker and gives that thread >> a name in the format "kvm-pit/". That allows to >> identify and adjust the kthread priority according to the VM process >> parameters. > > Is this a new ABI? Yep. Scripts will use it, maybe even QEMU. Do you want this to appear in api.txt? > > Patch looks reasonable. > Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.
On Tue, Apr 17, 2012 at 01:42:31PM +0300, Avi Kivity wrote: > On 04/17/2012 01:31 PM, Gleb Natapov wrote: > > On Tue, Apr 17, 2012 at 01:29:04PM +0300, Avi Kivity wrote: > > > On 04/17/2012 01:26 PM, Gleb Natapov wrote: > > > > > It isn't, since you need to send an IPI. > > > > > > > > > That is exactly what I forget whether you can send IPI from there :) > > > > Anyway this is another reason. > > > > > > > > > > Actually I was wrong. You can't smp_call_function_single() from irq > > > context (deadlocks if two vcpus do that), but we send a reschedule > > > interrupt. So it should work. > > > > > Ah, good point. So if we'll use irqsave versions of spinlocks we can > > drop kthread? > > Do we want 254 IPIs to be issued from irq context? Could be slow. > Where this number is coming from? > We can make the unicast case run from irq context and defer the > multicast to a thread. > We do not know if gsi is multicasted in the pit code though. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump
On 04/17/2012 01:51 PM, zhangyanfei wrote: > 于 2012年04月17日 15:44, Avi Kivity 写道: > > On 04/11/2012 04:39 AM, zhangyanfei wrote: > >> This patch set exports offsets of VMCS fields as note information for > >> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve > >> runtime state of guest machine image, such as registers, in host > >> machine's crash dump as VMCS format. The problem is that VMCS > >> internal is hidden by Intel in its specification. So, we reverse > >> engineering it in the way implemented in this patch set. Please note > >> that this processing never affects any existing kvm logic. The > >> VMCSINFO is exported via sysfs to kexec-tools just like VMCOREINFO. > >> > >> Here is an example: > >> Processor: Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz > >> > >> $cat /sys/kernel/vmcsinfo > >> 1cba8c0 2000 > >> > >> crash> rd -p 1cba8c0 1000 > >> 1cba8c0: 127b0009 53434d56 {...VMCS > >> 1cba8d0: 4f464e49 4e4f495349564552 INFOREVISION > >> 1cba8e0: 49460a643d44495f 5f4e495028444c45 _ID=d.FIELD(PIN_ > >> 1cba8f0: 4d565f4445534142 4f435f434558455f BASED_VM_EXEC_CO > >> 1cba900: 303d294c4f52544e 0a30383130343831 NTROL)=01840180. > >> 1cba910: 504328444c454946 5f44455341425f55 FIELD(CPU_BASED_ > >> 1cba920: 5f434558455f4d56 294c4f52544e4f43 VM_EXEC_CONTROL) > >> 1cba930: 393130343931303d 28444c4549460a30 =01940190.FIELD( > >> 1cba940: 5241444e4f434553 4558455f4d565f59 SECONDARY_VM_EXE > >> 1cba950: 4f52544e4f435f43 30346566303d294c C_CONTROL)=0fe40 > >> 1cba960: 4c4549460a306566 4958455f4d562844 fe0.FIELD(VM_EXI > >> 1cba970: 4f52544e4f435f54 346531303d29534c T_CONTROLS)=01e4 > >> 1cba980: 4549460a30653130 4e455f4d5628444c 01e0.FIELD(VM_EN > >> 1cba990: 544e4f435f595254 33303d29534c4f52 TRY_CONTROLS)=03 > >> 1cba9a0: 460a303133303431 45554728444c4549 140310.FIELD(GUE > >> 1cba9b0: 45535f53455f5453 3d29524f5443454c ST_ES_SELECTOR)= > >> 1cba9c0: 4549460a30303530 545345554728444c 0500.FIELD(GUEST > >> 1cba9d0: 454c45535f53435f 35303d29524f5443 _CS_SELECTOR)=05 > >> .. > >> > >> TODO: > >> 1. In kexec-tools, get VMCSINFO via sysfs and dump it as note information > >> into vmcore. > >> 2. Dump VMCS region of each guest vcpu and VMCSINFO into qemu-process > >> core file. To do this, we will modify kernel core dumper, gdb gcore > >> and crash gcore. > >> 3. Dump guest image from the qemu-process core file into a vmcore. > >> > > > > Taking a step back, can you describe the problem scenario you're fixing > > here? > > > Considering two scenarios below: > 1. Host panics, guests running on that host will also be dumped into >host's vmcore. > 2. Qemu process is core dumped (by gdb gcore or kernel core dumper), and >its coresponding guest will be included in the core file. > > We want to create the guest machine's crash dump from host machine's vmcore > or qemu process's core file. Unfortunately, we cannot get the guest's > registers > values in both scenarios. > > For scenario 1, some key registers (CR0, CR3...) of the guest machine are > stored > in VMCS region. But VMCS internal is hidden by Intel specification. So this > patch set aims to get offsets of fields in VMCS region and export it as note > information for kdump. Okay. Do you expect it to help in debugging the crash? Did you have cases where it would help? > > For scenario 2, we also want the guest's registers values to be dumped into > qemu process's core file when qemu process crashes. This is the task of > TODO-list 2. Why? If qemu crashed it is because of an internal qemu fault. If any guest registers were involved, they would have been decoded by qemu previously and would be present in the stack trace (for example mmio address/data). > Is this what you want? > Yes. I'm trying to understand if the feature would be useful in real life. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump
于 2012年04月17日 15:44, Avi Kivity 写道: > On 04/11/2012 04:39 AM, zhangyanfei wrote: >> This patch set exports offsets of VMCS fields as note information for >> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve >> runtime state of guest machine image, such as registers, in host >> machine's crash dump as VMCS format. The problem is that VMCS >> internal is hidden by Intel in its specification. So, we reverse >> engineering it in the way implemented in this patch set. Please note >> that this processing never affects any existing kvm logic. The >> VMCSINFO is exported via sysfs to kexec-tools just like VMCOREINFO. >> >> Here is an example: >> Processor: Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz >> >> $cat /sys/kernel/vmcsinfo >> 1cba8c0 2000 >> >> crash> rd -p 1cba8c0 1000 >> 1cba8c0: 127b0009 53434d56 {...VMCS >> 1cba8d0: 4f464e49 4e4f495349564552 INFOREVISION >> 1cba8e0: 49460a643d44495f 5f4e495028444c45 _ID=d.FIELD(PIN_ >> 1cba8f0: 4d565f4445534142 4f435f434558455f BASED_VM_EXEC_CO >> 1cba900: 303d294c4f52544e 0a30383130343831 NTROL)=01840180. >> 1cba910: 504328444c454946 5f44455341425f55 FIELD(CPU_BASED_ >> 1cba920: 5f434558455f4d56 294c4f52544e4f43 VM_EXEC_CONTROL) >> 1cba930: 393130343931303d 28444c4549460a30 =01940190.FIELD( >> 1cba940: 5241444e4f434553 4558455f4d565f59 SECONDARY_VM_EXE >> 1cba950: 4f52544e4f435f43 30346566303d294c C_CONTROL)=0fe40 >> 1cba960: 4c4549460a306566 4958455f4d562844 fe0.FIELD(VM_EXI >> 1cba970: 4f52544e4f435f54 346531303d29534c T_CONTROLS)=01e4 >> 1cba980: 4549460a30653130 4e455f4d5628444c 01e0.FIELD(VM_EN >> 1cba990: 544e4f435f595254 33303d29534c4f52 TRY_CONTROLS)=03 >> 1cba9a0: 460a303133303431 45554728444c4549 140310.FIELD(GUE >> 1cba9b0: 45535f53455f5453 3d29524f5443454c ST_ES_SELECTOR)= >> 1cba9c0: 4549460a30303530 545345554728444c 0500.FIELD(GUEST >> 1cba9d0: 454c45535f53435f 35303d29524f5443 _CS_SELECTOR)=05 >> .. >> >> TODO: >> 1. In kexec-tools, get VMCSINFO via sysfs and dump it as note information >> into vmcore. >> 2. Dump VMCS region of each guest vcpu and VMCSINFO into qemu-process >> core file. To do this, we will modify kernel core dumper, gdb gcore >> and crash gcore. >> 3. Dump guest image from the qemu-process core file into a vmcore. >> > > Taking a step back, can you describe the problem scenario you're fixing > here? > Considering two scenarios below: 1. Host panics, guests running on that host will also be dumped into host's vmcore. 2. Qemu process is core dumped (by gdb gcore or kernel core dumper), and its coresponding guest will be included in the core file. We want to create the guest machine's crash dump from host machine's vmcore or qemu process's core file. Unfortunately, we cannot get the guest's registers values in both scenarios. For scenario 1, some key registers (CR0, CR3...) of the guest machine are stored in VMCS region. But VMCS internal is hidden by Intel specification. So this patch set aims to get offsets of fields in VMCS region and export it as note information for kdump. For scenario 2, we also want the guest's registers values to be dumped into qemu process's core file when qemu process crashes. This is the task of TODO-list 2. Is this what you want? Thanks Zhang Yanfei -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.
On 04/17/2012 01:42 PM, Avi Kivity wrote: > On 04/17/2012 01:31 PM, Gleb Natapov wrote: > > On Tue, Apr 17, 2012 at 01:29:04PM +0300, Avi Kivity wrote: > > > On 04/17/2012 01:26 PM, Gleb Natapov wrote: > > > > > It isn't, since you need to send an IPI. > > > > > > > > > That is exactly what I forget whether you can send IPI from there :) > > > > Anyway this is another reason. > > > > > > > > > > Actually I was wrong. You can't smp_call_function_single() from irq > > > context (deadlocks if two vcpus do that), but we send a reschedule > > > interrupt. So it should work. > > > > > Ah, good point. So if we'll use irqsave versions of spinlocks we can > > drop kthread? > > Do we want 254 IPIs to be issued from irq context? Could be slow. > > We can make the unicast case run from irq context and defer the > multicast to a thread. (that should help eventfd and device assignment) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.
On 04/17/2012 01:31 PM, Gleb Natapov wrote: > On Tue, Apr 17, 2012 at 01:29:04PM +0300, Avi Kivity wrote: > > On 04/17/2012 01:26 PM, Gleb Natapov wrote: > > > > It isn't, since you need to send an IPI. > > > > > > > That is exactly what I forget whether you can send IPI from there :) > > > Anyway this is another reason. > > > > > > > Actually I was wrong. You can't smp_call_function_single() from irq > > context (deadlocks if two vcpus do that), but we send a reschedule > > interrupt. So it should work. > > > Ah, good point. So if we'll use irqsave versions of spinlocks we can > drop kthread? Do we want 254 IPIs to be issued from irq context? Could be slow. We can make the unicast case run from irq context and defer the multicast to a thread. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.
On Tue, Apr 17, 2012 at 01:29:04PM +0300, Avi Kivity wrote: > On 04/17/2012 01:26 PM, Gleb Natapov wrote: > > > It isn't, since you need to send an IPI. > > > > > That is exactly what I forget whether you can send IPI from there :) > > Anyway this is another reason. > > > > Actually I was wrong. You can't smp_call_function_single() from irq > context (deadlocks if two vcpus do that), but we send a reschedule > interrupt. So it should work. > Ah, good point. So if we'll use irqsave versions of spinlocks we can drop kthread? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.
On 04/17/2012 01:26 PM, Gleb Natapov wrote: > > It isn't, since you need to send an IPI. > > > That is exactly what I forget whether you can send IPI from there :) > Anyway this is another reason. > Actually I was wrong. You can't smp_call_function_single() from irq context (deadlocks if two vcpus do that), but we send a reschedule interrupt. So it should work. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Run PIT work in own kthread
On 04/16/2012 09:26 PM, Jan Kiszka wrote: > We can't run PIT IRQ injection work in the interrupt context of the host > timer. This would allow the user to influence the handler complexity by > asking for a broadcast to a large number of VCPUs. Therefore, this work > was pushed into workqueue context in 9d244caf2e. However, this prevents > prioritizing the PIT injection over other task as workqueues share > kernel threads. > > This replaces the workqueue with a kthread worker and gives that thread > a name in the format "kvm-pit/". That allows to > identify and adjust the kthread priority according to the VM process > parameters. Is this a new ABI? Patch looks reasonable. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.
On Tue, Apr 17, 2012 at 01:23:52PM +0300, Avi Kivity wrote: > On 04/17/2012 12:31 PM, Gleb Natapov wrote: > > On Mon, Apr 16, 2012 at 06:33:26PM +0200, Jan Kiszka wrote: > > > On 2010-06-16 23:11, Chris Lalancette wrote: > > > > We really want to "kvm_set_irq" during the hrtimer callback, > > > > but that is risky because that is during interrupt context. > > > > Instead, offload the work to a workqueue, which is a bit safer > > > > and should provide most of the same functionality. > > > > > > Unfortunately, workqueues do not have fixed kthread associations (and > > > "kvm-pit-wq" would be too unspecific when running multiple VMs). So I > > > just realized that this subtly breaks the ability to run KVM guests with > > > RT priority (boot managers with timeouts hang as the workqueue starves). > > > > > > Before throwing some kthread_worker at this, could someone help me > > > recalling what was "risky" here? That the PIT IRQ may have to be > > > broadcast to a large number of VCPUs? I would offer to include this > > > information in my changelog. ;) > > > > > ioapic and pic irq_set functions uses spinlocks, but not irqsave > > variants. Also I am always forget whether is is safe to kick vcpu from > > irq context. > > It isn't, since you need to send an IPI. > That is exactly what I forget whether you can send IPI from there :) Anyway this is another reason. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] KVM updates for the 3.4 merge window
On 04/17/2012 12:34 PM, Alexander Graf wrote: >>> tree but his tree is in linux-next as well. There's no reason not to do >>> that. >>> >>> That way, his next branch gets linux-next coverage whether it's in my >>> tree or not, and I pull when I put the final powerpc-next together, >>> which gives me a chance to do a quick vet "just in case" and sort out >>> any major conflict before it all goes to Linus. >>> >> Sure, that works too. > Or do like I do with Kumar for FSL stuff... his stuff gets pulled via my > > Sounds even easier to me. So how do I get my tree into linux-next? Send a note to Stephen Rothwell (copied). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.
On 04/17/2012 12:31 PM, Gleb Natapov wrote: > On Mon, Apr 16, 2012 at 06:33:26PM +0200, Jan Kiszka wrote: > > On 2010-06-16 23:11, Chris Lalancette wrote: > > > We really want to "kvm_set_irq" during the hrtimer callback, > > > but that is risky because that is during interrupt context. > > > Instead, offload the work to a workqueue, which is a bit safer > > > and should provide most of the same functionality. > > > > Unfortunately, workqueues do not have fixed kthread associations (and > > "kvm-pit-wq" would be too unspecific when running multiple VMs). So I > > just realized that this subtly breaks the ability to run KVM guests with > > RT priority (boot managers with timeouts hang as the workqueue starves). > > > > Before throwing some kthread_worker at this, could someone help me > > recalling what was "risky" here? That the PIT IRQ may have to be > > broadcast to a large number of VCPUs? I would offer to include this > > information in my changelog. ;) > > > ioapic and pic irq_set functions uses spinlocks, but not irqsave > variants. Also I am always forget whether is is safe to kick vcpu from > irq context. It isn't, since you need to send an IPI. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS
On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote: > On 4/16/2012 5:18 AM, Gleb Natapov wrote: > >On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote: > >>On 04/11/2012 01:21 PM, Chegu Vinod wrote: > >>>Hello, > >>> > >>>While running an AIM7 (workfile.high_systime) in a single 40-way (or a > >>>single > >>>60-way KVM guest) I noticed pretty bad performance when the guest was > >>>booted > >>>with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220 > >>>(RHEL6.2) kernel. > >>>For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better > >>>than > >>>the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older > >>>guest > >>>kernel was nearly 12x better ! > >How many CPUs your host has? > > 80 Cores on the DL980. (i.e. 8 Westmere sockets). > So you are not oversubscribing CPUs at all. Are those real cores or including HT? Do you have other cpus hogs running on the host while testing the guest? > I was using numactl to bind the qemu of the 40-way guests to numa > nodes : 4-7 ( or for a 60-way guest > binding them to nodes 2-7) > > /etc/qemu-ifup tap0 > > numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7 > /usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu > Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme > -enable-kvm \ > -m 65536 -smp 40 \ > -name vm1 -chardev > socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait > \ > -drive > file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none > -device virtio-blk-pci,scsi=off,bus=pci > .0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ > -monitor stdio \ > -net nic,macaddr=<..mac_addr..> \ > -net tap,ifname=tap0,script=no,downscript=no \ > -vnc :4 > > /etc/qemu-ifdown tap0 > > > I knew that there will be a few additional temporary qemu worker > threads created... i.e. some over > subscription will be there. > 4 nodes above have 40 real cores, yes? Can you try to run upstream kernel without binding at all and check the performance? > > Will have to retry by doing some explicit pinning of the vcpus to > native cores (without using virsh). > > >>>Turned on function tracing and found that there appears to be more time > >>>being > >>>spent around the lock code in the 3.3.1 guest when compared to the > >>>2.6.32-220 > >>>guest. > >>Looks like you may be running into the ticket spinlock > >>code. During the early RHEL 6 days, Gleb came up with a > >>patch to automatically disable ticket spinlocks when > >>running inside a KVM guest. > >> > >>IIRC that patch got rejected upstream at the time, > >>with upstream developers preferring to wait for a > >>"better solution". > >> > >>If such a better solution is not on its way upstream > >>now (two years later), maybe we should just merge > >>Gleb's patch upstream for the time being? > >I think the pv spinlock that is actively discussed currently should > >address the issue, but I am not sure someone tests it against non-ticket > >lock in a guest to see which one performs better. > > I did see that discussion...seems to have originated from the Xen context. > Yes, The problem is the same for both hypervisors. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] KVM updates for the 3.4 merge window
On 04/17/2012 09:20 AM, Avi Kivity wrote: On 04/17/2012 02:05 AM, Benjamin Herrenschmidt wrote: On Mon, 2012-04-16 at 15:53 +0300, Avi Kivity wrote: kvm.git next is exposed to linux-next, where they get tested quite a lot. Granted it's mostly build testing, and people are unlikely to test kvm there, but they will test the non-kvm bits that creep in there. The alternative would be that I don't have a -next tree, just collect patches and immediately send them to Avi. That way the main kvm tree would be broken more often, but at least we don't get these horrible synchronization latencies. That works too. Don't post immediately; 2-3 week batches would reduce noise. Or do like I do with Kumar for FSL stuff... his stuff gets pulled via my tree but his tree is in linux-next as well. There's no reason not to do that. That way, his next branch gets linux-next coverage whether it's in my tree or not, and I pull when I put the final powerpc-next together, which gives me a chance to do a quick vet "just in case" and sort out any major conflict before it all goes to Linus. Sure, that works too. Sounds even easier to me. So how do I get my tree into linux-next? Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.
On Mon, Apr 16, 2012 at 06:33:26PM +0200, Jan Kiszka wrote: > On 2010-06-16 23:11, Chris Lalancette wrote: > > We really want to "kvm_set_irq" during the hrtimer callback, > > but that is risky because that is during interrupt context. > > Instead, offload the work to a workqueue, which is a bit safer > > and should provide most of the same functionality. > > Unfortunately, workqueues do not have fixed kthread associations (and > "kvm-pit-wq" would be too unspecific when running multiple VMs). So I > just realized that this subtly breaks the ability to run KVM guests with > RT priority (boot managers with timeouts hang as the workqueue starves). > > Before throwing some kthread_worker at this, could someone help me > recalling what was "risky" here? That the PIT IRQ may have to be > broadcast to a large number of VCPUs? I would offer to include this > information in my changelog. ;) > ioapic and pic irq_set functions uses spinlocks, but not irqsave variants. Also I am always forget whether is is safe to kick vcpu from irq context. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv1 dont apply] RFC: kvm eoi PV using shared memory
On 04/16/2012 03:18 PM, Michael S. Tsirkin wrote: > On Mon, Apr 16, 2012 at 02:24:46PM +0300, Gleb Natapov wrote: > > On Mon, Apr 16, 2012 at 02:09:20PM +0300, Michael S. Tsirkin wrote: > > > Thanks very much for the review. I'll address the comments. > > > Some questions on your comments below. > > > > > > On Mon, Apr 16, 2012 at 01:08:07PM +0300, Gleb Natapov wrote: > > > > > @@ -37,6 +38,8 @@ > > > > > #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01 > > > > > #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 > > > > > #define MSR_KVM_STEAL_TIME 0x4b564d03 > > > > > +#define MSR_KVM_EOI_EN 0x4b564d04 > > > > > +#define MSR_KVM_EOI_DISABLED 0x0L > > > > This is valid gpa. Follow others MSR example i.e align the address to, > > > > lets say dword, and use lsb as enable bit. > > > > > > We only need a single byte, since this is per-CPU - > > > it's better to save the memory, so no alignment is required. > > > An explicit disable msr would also address this, right? > > > > > We do not have shortage of memory. > > Better make all MSRs works the same > > way. > > I agree it's nice to have EOI and ASYNC_PF look similar > but wasting memory is also bad. I'll ponder this some more. Wasting three bytes? > > BTW have you added new MSR to msrs_to_save array? I forgot to > > checked. > > I didn't yet. Trying to understand how will that affect > cross-version migration - any input? It will just work. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv1 dont apply] RFC: kvm eoi PV using shared memory
On 04/16/2012 01:08 PM, Gleb Natapov wrote: > > } > > > > + > > + if (kvm_para_has_feature(KVM_FEATURE_EOI)) { > > + kvm_guest_native_apic_write = apic->write; > > + apic->write = kvm_guest_apic_write; > > + wrmsrl(MSR_KVM_EOI_EN, __pa(&__get_cpu_var(apic_eoi))); > > + } > Can happen on more then one cpu. After it happens on two > kvm_guest_apic_write() > will call to itself. It's also hacky. Try static_cpu_has() in the apic code (and add an eoi method). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3]KVM: PPC: Use clockevent multiplier and shifter for decrementer
On 04/17/2012 05:55 AM, Bharat Bhushan wrote: Time for which the hrtimer is started for decrementer emulation is calculated using tb_ticks_per_usec. While hrtimer uses the clockevent for DEC reprogramming (if needed) and which calculate timebase ticks using the multiplier and shifter mechanism implemented within clockevent layer. It was observed that this conversion (timebase->time->timebase) are not correct because the mechanism are not consistent. In our setup it adds 2% jitter. With this patch clockevent multiplier and shifter mechanism are used when starting hrtimer for decrementer emulation. Now the jitter is< 0.5%. Signed-off-by: Bharat Bhushan --- v3: - decrementer_clockevent symbol exported. arch/powerpc/include/asm/time.h |1 + arch/powerpc/kernel/time.c |3 ++- arch/powerpc/kvm/emulate.c |5 +++-- 3 files changed, 6 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h index 7eb10fb..b3c7959 100644 --- a/arch/powerpc/include/asm/time.h +++ b/arch/powerpc/include/asm/time.h @@ -28,6 +28,7 @@ extern unsigned long tb_ticks_per_jiffy; extern unsigned long tb_ticks_per_usec; extern unsigned long tb_ticks_per_sec; +extern struct clock_event_device decrementer_clockevent; struct rtc_time; extern void to_tm(int tim, struct rtc_time * tm); diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 567dd7c..4f7a285 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -105,7 +105,7 @@ static int decrementer_set_next_event(unsigned long evt, static void decrementer_set_mode(enum clock_event_mode mode, struct clock_event_device *dev); -static struct clock_event_device decrementer_clockevent = { +struct clock_event_device decrementer_clockevent = { .name = "decrementer", .rating = 200, .irq= 0, @@ -113,6 +113,7 @@ static struct clock_event_device decrementer_clockevent = { .set_mode = decrementer_set_mode, .features = CLOCK_EVT_FEAT_ONESHOT, }; +EXPORT_SYMBOL(decrementer_clockevent); DEFINE_PER_CPU(u64, decrementers_next_tb); static DEFINE_PER_CPU(struct clock_event_device, decrementers); diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c index afc9154..c8b5206 100644 --- a/arch/powerpc/kvm/emulate.c +++ b/arch/powerpc/kvm/emulate.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -104,8 +105,8 @@ void kvmppc_emulate_dec(struct kvm_vcpu *vcpu) */ dec_time = vcpu->arch.dec; - dec_time *= 1000; - do_div(dec_time, tb_ticks_per_usec); Sorry for the late nitpicking, but could you please add a comment here that states that we can use the host's decrementer calculations because the guest's timebase ticks at the same speed? Alex + dec_time = dec_time<< decrementer_clockevent.shift; + do_div(dec_time, decrementer_clockevent.mult); dec_nsec = do_div(dec_time, NSEC_PER_SEC); hrtimer_start(&vcpu->arch.dec_timer, ktime_set(dec_time, dec_nsec), HRTIMER_MODE_REL); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv1 dont apply] RFC: kvm eoi PV using shared memory
On Mon, Apr 16, 2012 at 10:01:29PM +0300, Michael S. Tsirkin wrote: > On Mon, Apr 16, 2012 at 08:51:16PM +0300, Gleb Natapov wrote: > > > > Only when eoi is pending. This is rare. > > > > > > This is exactly while guest handles an interrupt. > > > It's not all that rare at all: e.g. device > > > drivers cause an exit from interrupt > > > handler by doing io. > > So eoi will be coalesced with io that device driver does. Exactly what > > we want. > > It won't. While we handle interrupts eoi is still set. > So there will be a couple of tests + read from userspace > Wasted not a huge overhead but it's the principle of the thing. > Linux acks irq before calling device handler. Windows for some devices does IO to a device before acking. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EuroSec'12 Presentation (ASLR reduces effect of KSM)
From: Marcelo Tosatti Subject: Re: EuroSec'12 Presentation (ASLR reduces effect of KSM) Date: Mon, 16 Apr 2012 21:29:31 -0300 > On Mon, Apr 16, 2012 at 03:52:10PM +0900, Kuniyasu Suzaki wrote: > > > > Marcelo, > > > > From: Marcelo Tosatti > > Subject: Re: EuroSec'12 Presentation (ASLR reduces effect of KSM) > > Date: Fri, 13 Apr 2012 21:47:47 -0300 > > > > > On Thu, Apr 12, 2012 at 08:24:57PM +0900, Kuniyasu Suzaki wrote: > > > > > > > > Dear, > > > > > > > > I made a presentation which measures OS security functions(ASLR, > > > > Memory Santization, and Cache Page Flushing) on memory deduplication > > > > "KSM with VKM" at EuroSec 2012. > > > > > > > > The titile is "Effects of Memory Randomization, Sanitization and Page > > > > Cache on Memory Deduplication". > > > > # This is one of papers related to my memory deduplication research. > > > > > > > > The slide is downloadable. > > > > > > > > http://www.slideshare.net/suzaki/eurosec2012-effects-of-memory-randomization-sanitization-and-page-cache-on-memory-deduplication-by-ksuzaki > > > > The paper will be downloadable form ACM Digital Library. > > > > > > > > The results show ALSR reduces the effect of memory deduplciation. > > > > Please tell me, if you have comments. Thank you. > > > > > > > > -- > > > > Kuniyasu Suzaki, National Institute of Advanced Industrial Science > > > > and Technology, > > > > http://staff.aist.go.jp/k.suzaki > > > > > > Very nice. ALSR is supposed to increase the number of unshared pages > > > because translation tables that contain addresses of symbols will > > > differ for every instance of an executable. > > > > Thank you for good suggestion. > > Anyway, how much the size of translation tables? > > One entry per symbol that is accessed outside of the object or > main executable, one table per shared object (GOT and PLT tables). See > the ELF documentation. > > > In our experience, ALSR on 4 GuestOS (Linux) increased the memory > > consumption more than 50MB. > > Does the translation table in a linux take more than 10MB? > > Increased memory consumption is due to larger number of pagetables > (which is necessary to cover larger virtual address space). > Increased number of unshared pages can be explained by translation > tables. GOT and PLT tables includes Virtual Addresses which are arranged by ASLR. The address are shown at /proc/PID/task/PID/maps. The pagetables are the table to translate form Virtual Address to Physical Address. The memory pages which includes GOT and PLT tables and pagetables, are not identical (unshared), and the size is not small. -- suzaki > > > Can you share additional information about "HICAMP (hardware memory > > > deduplication)" ? > > > > The detail of HICAMP was presented at ASPLOS 2011. > > ASPLOS 2011 paper http://dl.acm.org/citation.cfm?id=2151007&preflayout=tabs > > > > -- > > suzaki > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv1 dont apply] RFC: kvm eoi PV using shared memory
On Mon, Apr 16, 2012 at 09:56:02PM +0300, Michael S. Tsirkin wrote: > On Mon, Apr 16, 2012 at 08:37:37PM +0300, Gleb Natapov wrote: > > On Mon, Apr 16, 2012 at 08:24:16PM +0300, Michael S. Tsirkin wrote: > > > On Mon, Apr 16, 2012 at 06:10:11PM +0300, Gleb Natapov wrote: > > > > > > lapic changes should be minimal. > > > > > > > > > > Exactly my motivation. > > > > > > > > > My patch removes 13 lines more :) > > > > > > Haven't checked whether your patch is correct yet > > > but I see it's checking the eoi register on each exit. > > > > > Only if eoi.pending == true on exit. > > it's checking eoi.pending always and eoi register when > that is true. > We already have if there, so we can reuse it. Instead of checking vapic_addr in kvm_lapic_sync_from_vapic() let it check apic_attention bitmask. vapic_addr and eoi.pending will set bit there and if() will become if(!apic_attention) return. Zero overhead. > > > I think it's clear this would make code a bit shorter > > > (not necessarily clearer as we are relying on ) but > > > as I said even though the extra overhead is likely > > > negligeable I have a feeling it's a wrong approach since > > > this won't scale as we add more features. > > > > > > Let's see what do others think. > > > > > What I do not like about not calling eoi here is that we > > have to reason about all the paths into apic and check that > > we clear isr on all of them. > > Not at all. Instead, check each time before we read isr > or ppr. Or inject interrupt, or ... Luckily lapic code calls apic_update_ppr() a lot (just in case), so adding check there likely covers any relevant entry into the apic, but this is exactly reasoning I would like to avoid if possible :) I do not see bug per se with your current scheme otherwise I would point it out. > > > And that is not all. eoi handler > > set event request, is it OK to skip it? May be, or may be not. > > I think it's for reinjection of lower prio interrupt. Me too. But skipping setting that bit should not be taken lightly. We need to know for sure. Current code tries to err on safe side and prefers to set event request when not needed instead of missing it when it is needed. Those rare cases are hard to debug. > Since eoi optimization is disabled in that case we don't need to set > event request. > > > > > > I also find the logic easier to follow as is - > > > > > it is contained in lapic.c without relying > > > > > on being called from x86.c as just the right moment. > > > > > > > > > See the patch. It change nothing outside of lapic.c. > > > > > > True but you rely on x86.c to call kvm_sync_lapic_from_vapic > > > at the right time before injecting interrupts. > > Not before injecting interrupts, but on vmexit. > > sync is called on entry, not on exit, no? No. Look at the code. Entry is to late. > > > > > I haven't checked whether that is always the case but > > > to me, this makes the code less clear and more fragile. > > > > > We call a lot of apic functions from x86.c. That's were interrupt > > injection happens. > > > > > Again, it appears to be a matter of taste. > > > > > > -- > > > MST > > > > -- > > Gleb. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeback cache + h700 controller w/1gb nvcache, corruption on power loss
On Mon, Apr 16, 2012 at 9:51 AM, Ron Edison wrote: > I would be very interested in how to ensure the guests are sending flushes. > I'm unfamiliar with the example you gave, where is that configured? "mount -o barrier=1 /dev/sda /mnt" is a mount option for ext3 and ext4 file systems. You probably don't want this actually since you have a battery-backed RAID controller. See below for more. > Primarily the guests are CentOS 4, 5 or 6. I am also curious if it would be > advisable to switch to writethrough cache on each guest virtual disk and > leave writeback enabled on the controller and if that would adversely affect > performance of the guests. The most conservative modes are cache=writethrough (uses host page cache) and cache=directsync (does not use host page cache). They both ensure that every single write is flushed to disk. Therefore they have a performance penalty. cache=directsync minimizes stress on host memory because it bypasses the page cache. Since you have a non-volatile cache in your RAID controller you can also use cache=none. This also bypasses the host page cache but it does not flush every single write. The guest may still send flushes but even if it does not, the writes are going to the RAID controller's non-volatile cache. > The disk corruption experienced was indeed lost data -- an fsck was necessary > for 4 of the guests to boot at all in RW mode, they first came up read only. > In the case of one of the guests there was actually files data / data lost > after fsck was manually run upon reboot/single user mode. In some cases these > were config files, in other database indexes, etc. This one of the 4 guests > with the most severe corruption was not usable and we had to revert to a > backup and pull current data out of it as much as possible. Since you used QEMU -drive cache=writeback data loss is expected on host power failure. cache=writeback uses the (volatile) host page cache and therefore data may not have made it to the RAID controller before power was lost. Guest file system recovery - either a quick journal replay or a painful fsck - is also expected on host power failure. The file systems are dirty since the guest stopped executing without cleanly unmounting its file systems. If you use cache=none or cache=directsync then you should get a quick journal replay and the risk of a painful fsck should be reduced (most/all of the data will have been preserved). Stefan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Virtio network performance on Debian
On 12.04.2012 11:42, Hans-Kristian Bakke wrote: > Hi > > For some reason I am not able to get good network performance using > virtio/vhost-net on Debian KVM host (perhaps also valid for Ubuntu > hosts then). The issue has been identified, after Hans-Kristian gave me access to his machine and I did alot of testing. And as usual, the root cause was very stupid... ;) In last release of debian qemu-kvm package I changed the way how debian package version string propagates to build procedure -- before it was a patch grabbing values from debian/version, but in last release I used --with-pkgversion configure flag. This resulted in the following change: - QEMU emulator version 1.0 (qemu-kvm-1.0 Debian 1.0+dfsg-8), Copyright (c) 2003-2008 Fabrice Bellard + QEMU emulator version 1.0 (Debian qemu-kvm 1.0+dfsg-9), Copyright (c) 2003-2008 Fabrice Bellard As it turns out, libvirt parses `qemu -version' output and looks for " (qemu-kvm-" string in there, and if it is found, libvirt enables some "extra" features. Important for us was setting IFF_VNET_HDR flag for a tap device, -- apparently this flag makes a HUGE difference in networking speed, especially when using vhost_net. Obviously this is a change unique to debian, and I never thought about such an.. "interesting" effect it may give us. This is a libvirt bug actually, since support of vnet_hdr can be determined by other means, and since upstream qemu now has almost everything from qemu-kvm, and it wants to be fast too. But since qemu[-kvm] has a long history of changing features, it is difficult to blame libvirt that much... Oh well. Thanks! /mjt -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/13] KVM: MMU: fast page fault
On 04/17/2012 09:26 AM, Xiao Guangrong wrote: > On 04/16/2012 11:49 PM, Takuya Yoshikawa wrote: > > > > Although O(1) is actually O(1) for GET_DIRTY_LOG thread, it adds some > > overheads to page fault handling. We may need to hold mmu_lock for properly > > handling O(1)'s write protection and ~500 write protections will not be so > > cheap. And there is no answer to the question how to achive slot-wise write > > protection. > > > > > Actually no. > > We do not increase the overload on page fault for migration. The number of > page fault of O(1) is the same as write-protect all spte. That's true with the write protect everything approach we use now. But it's not true with range-based write protection, where you issue GET_DIRTY_LOG on a range of pages and only need to re-write-protect them. (the motivation for that is to decrease the time between GET_DIRTY_LOG and sending the page; as the time increases, the chances that the page got re-dirtied go up). That doesn't mean O(1) is unusable for this, just that it requires more thought. Especially with direct maps, we can write-enable pages very quickly. > And, we can also avoid to hold mmu_lock to write-protect PML4s, we can use > a generation number, and notify mmu to update its page table when dirty-log > is enabled. Generation numbers are also useful for o(1) invalidation. > > Anyway, no performance data, no truth. Let me implement it first. > -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] KVM updates for the 3.4 merge window
On 04/17/2012 02:05 AM, Benjamin Herrenschmidt wrote: > On Mon, 2012-04-16 at 15:53 +0300, Avi Kivity wrote: > > > > kvm.git next is exposed to linux-next, where they get tested quite a > > lot. Granted it's mostly build testing, and people are unlikely to > > test > > kvm there, but they will test the non-kvm bits that creep in there. > > > > > The alternative would be that I don't have a -next tree, just > > collect patches and immediately send them to Avi. That way the main > > kvm tree would be broken more often, but at least we don't get these > > horrible synchronization latencies. > > > > That works too. Don't post immediately; 2-3 week batches would reduce > > noise. > > Or do like I do with Kumar for FSL stuff... his stuff gets pulled via my > tree but his tree is in linux-next as well. There's no reason not to do > that. > > That way, his next branch gets linux-next coverage whether it's in my > tree or not, and I pull when I put the final powerpc-next together, > which gives me a chance to do a quick vet "just in case" and sort out > any major conflict before it all goes to Linus. > Sure, that works too. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump
On 04/11/2012 04:39 AM, zhangyanfei wrote: > This patch set exports offsets of VMCS fields as note information for > kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve > runtime state of guest machine image, such as registers, in host > machine's crash dump as VMCS format. The problem is that VMCS > internal is hidden by Intel in its specification. So, we reverse > engineering it in the way implemented in this patch set. Please note > that this processing never affects any existing kvm logic. The > VMCSINFO is exported via sysfs to kexec-tools just like VMCOREINFO. > > Here is an example: > Processor: Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz > > $cat /sys/kernel/vmcsinfo > 1cba8c0 2000 > > crash> rd -p 1cba8c0 1000 > 1cba8c0: 127b0009 53434d56 {...VMCS > 1cba8d0: 4f464e49 4e4f495349564552 INFOREVISION > 1cba8e0: 49460a643d44495f 5f4e495028444c45 _ID=d.FIELD(PIN_ > 1cba8f0: 4d565f4445534142 4f435f434558455f BASED_VM_EXEC_CO > 1cba900: 303d294c4f52544e 0a30383130343831 NTROL)=01840180. > 1cba910: 504328444c454946 5f44455341425f55 FIELD(CPU_BASED_ > 1cba920: 5f434558455f4d56 294c4f52544e4f43 VM_EXEC_CONTROL) > 1cba930: 393130343931303d 28444c4549460a30 =01940190.FIELD( > 1cba940: 5241444e4f434553 4558455f4d565f59 SECONDARY_VM_EXE > 1cba950: 4f52544e4f435f43 30346566303d294c C_CONTROL)=0fe40 > 1cba960: 4c4549460a306566 4958455f4d562844 fe0.FIELD(VM_EXI > 1cba970: 4f52544e4f435f54 346531303d29534c T_CONTROLS)=01e4 > 1cba980: 4549460a30653130 4e455f4d5628444c 01e0.FIELD(VM_EN > 1cba990: 544e4f435f595254 33303d29534c4f52 TRY_CONTROLS)=03 > 1cba9a0: 460a303133303431 45554728444c4549 140310.FIELD(GUE > 1cba9b0: 45535f53455f5453 3d29524f5443454c ST_ES_SELECTOR)= > 1cba9c0: 4549460a30303530 545345554728444c 0500.FIELD(GUEST > 1cba9d0: 454c45535f53435f 35303d29524f5443 _CS_SELECTOR)=05 > .. > > TODO: > 1. In kexec-tools, get VMCSINFO via sysfs and dump it as note information > into vmcore. > 2. Dump VMCS region of each guest vcpu and VMCSINFO into qemu-process > core file. To do this, we will modify kernel core dumper, gdb gcore > and crash gcore. > 3. Dump guest image from the qemu-process core file into a vmcore. > Taking a step back, can you describe the problem scenario you're fixing here? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 10/16] KVM: MMU: fask check whether page is writable
On 04/17/2012 06:55 AM, Xiao Guangrong wrote: > On 04/16/2012 07:47 PM, Avi Kivity wrote: > > > On 04/16/2012 01:20 PM, Xiao Guangrong wrote: > > It is used to avoid the unnecessary overload > >>> > >>> It's overloading me :( > >>> > >> > >> > >> Sorry. > >> > > > > The trick is to send those in separate patchset so the maintainer > > doesn't notice. > > > > > Thanks for your suggestion, i will pay more attention on it in the > further. > > For this patch, what did you mean of "those"? You mean the whole > rmap.PTE_LIST_WP_BIT (fast check for shadow page table write protection > and host write protection) or just about host_page_write_protect > (for KSM only)? All of it. Let's start with just modifying sptes concurrently and only later add reading bits from rmap concurrently, if it proves necessary. > > If we do not have rmap.PTE_LIST_WP_BIT, there may have regression on > shadow mmu. > > Hmm, do i need implement rmap.PTE_LIST_WP_BIT, then fast page fault? Let's try to measure the effect without rmap.PTE_LIST_WP_BIT. Usually PTE chains for page tables are short so the effect would be small. Of course we can't tell about all guest. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] DOS VM problem with QEMU-KVM and newer kernels
On 04/17/2012 09:57 AM, Gleb Natapov wrote: > > > > Status is as follows: > > 1.) Kernel 3.4.x DIDN'T fix the problem > > 2.) Reverting f1c1da2bde712812a3e0f9a7a7ebe7a916a4b5f4 FIXED the problem. > > > > So the bug is still in 3.2., 3.3, 3.4rc present and a possible fix > > doesn't work. Should be fixed in 3.4 release. > > > > How to proceed further? > > > Can you post image of your VM somewhere? > If you can't, then those traces would be a start. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V5 2/6] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks
On 04/12/2012 05:59 AM, Marcelo Tosatti wrote: On Wed, Apr 11, 2012 at 09:06:29PM -0300, Marcelo Tosatti wrote: On Fri, Mar 23, 2012 at 01:37:04PM +0530, Raghavendra K T wrote: From: Srivatsa Vaddagiri [...] barrier(); Is it always OK to erase the notification, even in case an unrelated event such as interrupt was the source of wakeup? Note i am only asking whether it is OK to lose a notification, not requesting a change to atomic test-and-clear. Yes.. got your point. IMO, this is the only (safe) place where it can clear kicked(pv_unhalted) flag. Since it is going to be runnable. and you are also right in having concern on unwanted clear of flag since that would result in vcpu /vm hangs eventually. Hope I did not miss anything. It would be nice to have a comment explaining it. Sure will do that -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V5 3/6] kvm : Add unhalt msr to aid (live) migration
On 04/12/2012 05:45 AM, Marcelo Tosatti wrote: On Fri, Mar 23, 2012 at 01:37:26PM +0530, Raghavendra K T wrote: From: Raghavendra K T [...] Unless there is a reason to use an MSR, should use a normal ioctl such as KVM_{GET,SET}_MP_STATE. I agree with you. In the current implementation, since we are not doing any communication between host/guest (on this flag), I too felt MSR is an overkill for this. IMO, patch like below should do the job, which I am planning to include in next version of patch. Let me know if you foresee any side-effects. --- diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index aa44292..5c81a66 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5691,7 +5691,9 @@ int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu, int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu, struct kvm_mp_state *mp_state) { - mp_state->mp_state = vcpu->arch.mp_state; + mp_state->mp_state = (vcpu->arch.mp_state == KVM_MP_STATE_HALTED && + vcpu->pv_unhalted)? + KVM_MP_STATE_RUNNABLE : vcpu->arch.mp_state; return 0; } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html