[Bug?] qemu abort when trying to passthrough BCM5719 Gigabit Ethernet
Hi all, When i try to passthrough BCM5719 Gigabit Ethernet to guest using the qemu master branch, it aborted, and show kvm_set_phys_mem:error registering slot:Bad Address. qemu command: #./qemu/qemu/x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m 4096 -vnc :99 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x3 -drive file=/home/suse11_sp3_64,if=none,id=drive-scsi0-0-0-0,format=raw,cache=none,aio=native -device scsi-hd,bus=scsi0.0,channel=0,scsi- id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 -device pci-assign,host=01:00.1,id=mydevice -net none info about guest and host: host OS: 3.16.5 *guest OS: Novell SuSE Linux Enterprise Server 11 SP3* #cat /proc/cpuinfo processor : 31 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz stepping: 4 microcode : 0x416 cpu MHz : 1926.875 cache size : 20480 KB physical id : 1 siblings: 16 core id : 7 cpu cores : 8 apicid : 47 initial apicid : 47 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms bogomips: 4005.35 clflush size: 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: gdb info: (gdb) bt #0 0x71ce9989 in raise () from /usr/lib64/libc.so.6 #1 0x71ceb098 in abort () from /usr/lib64/libc.so.6 #2 0x556275cf in kvm_set_phys_mem (section=0x7fffedcea790, add=true) at /home/qemu/qemu/kvm-all.c:711 #3 0x5562980f in address_space_update_topology_pass (as=as@entry=0x55d01ca0 , adding=adding@entry=true, new_view=, new_view=, old_view=0x7fffe8022a90, old_view=0x7fffe8022a90) at /home/qemu/qemu/memory.c:752 #4 0x5562b910 in address_space_update_topology (as=0x55d01ca0 ) at /home/qemu/qemu/memory.c:767 #5 memory_region_transaction_commit () at /home/qemu/qemu/memory.c:808 #6 0x557a75b4 in pci_update_mappings (d=0x562ba9f0) at hw/pci/pci.c:1113 #7 0x557a7932 in pci_default_write_config (d=d@entry=0x562ba9f0, addr=addr@entry=20, val_in=val_in@entry=4294967295, l=l@entry=4) at hw/pci/pci.c:1165 #8 0x5566c17e in assigned_dev_pci_write_config (pci_dev=0x562ba9f0, address=20, val=4294967295, len=4) at /home/qemu/qemu/hw/i386/kvm/pci-assign.c:1196 #9 0x55628fea in access_with_adjusted_size (addr=addr@entry=0, value=value@entry=0x7fffedceaae0, size=size@entry=4, access_size_min=, access_size_max=, access=0x55629160 , mr=0x56231f00) at /home/qemu/qemu/memory.c:480 #10 0x5562dbf7 in memory_region_dispatch_write (size=4, data=18446744073709551615, addr=0, mr=0x56231f00) at /home/qemu/qemu/memory.c:1122 #11 io_mem_write (mr=mr@entry=0x56231f00, addr=0, val=, size=4) at /home/qemu/qemu/memory.c:1958 #12 0x555f8963 in address_space_rw (as=0x55d01d80 , addr=addr@entry=3324, buf=0x77fec000 "\377\377\377\377", len=len@entry=4, is_write=is_write@entry=true) at /home/qemu/qemu/exec.c:2145 #13 0x55628491 in kvm_handle_io (count=1, size=4, direction=, data=, port=3324) at /home/qemu/qemu/kvm-all.c:1614 #14 kvm_cpu_exec (cpu=cpu@entry=0x5620e610) at /home/qemu/qemu/kvm-all.c:1771 #15 0x55617182 in qemu_kvm_cpu_thread_fn (arg=0x5620e610) at /home/qemu/qemu/cpus.c:953 #16 0x76ba2df3 in start_thread () from /usr/lib64/libpthread.so.0 #17 0x71daa3dd in clone () from /usr/lib64/libc.so.6 messages log: Oct 10 07:43:18 localhost kernel: kvm: zapping shadow pages for mmio generation wraparound Oct 10 07:43:27 localhost kernel: kvm [13251]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xabcd Oct 10 07:43:28 localhost kernel: intel_iommu_map: iommu width (48) is not sufficient for the mapped address (fe001000) Oct 10 07:43:28 localhost kernel: kvm_iommu_map_address:iommu failed to map pfn=94880 After bisected the commits, i found the commit 453c43a4a241a7 leads to this problem. commit 57271d63c4d93352406704d540453c43a4a241a7 Author: Paolo Bonzini Date: Thu Nov 7 17:14:37 2013 +0100 exec: make address spaces 64-bit wide As an alternative to commit 818f86b (exec: limit system memory size, 2013-11-04) let's just make all address spaces 64-bit wide. This eliminates problems with phys_page_find ignoring bits above TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal consequently
Re: [PATCH] Documentation: virtual: kvm: correct one bit description in APF case
Il 11/10/2014 03:19, Tiejun Chen ha scritto: > When commit 6adba5274206 (KVM: Let host know whether the guest can > handle async PF in non-userspace context.) is introduced, actually > bit 2 still is reserved and should be zero. Instead, bit 1 is 1 to > indicate if asynchronous page faults can be injected when vcpu is > in cpl == 0, and also please see this, > > in the file kvm_para.h, #define KVM_ASYNC_PF_SEND_ALWAYS (1 << 1). > > Signed-off-by: Tiejun Chen > --- > Documentation/virtual/kvm/msr.txt | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/Documentation/virtual/kvm/msr.txt > b/Documentation/virtual/kvm/msr.txt > index 6d470ae..2a71c8f 100644 > --- a/Documentation/virtual/kvm/msr.txt > +++ b/Documentation/virtual/kvm/msr.txt > @@ -168,7 +168,7 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02 > 64 byte memory area which must be in guest RAM and must be > zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1 > when asynchronous page faults are enabled on the vcpu 0 when > - disabled. Bit 2 is 1 if asynchronous page faults can be injected > + disabled. Bit 1 is 1 if asynchronous page faults can be injected > when vcpu is in cpl == 0. > > First 4 byte of 64 byte memory location will be written to by > Thanks, applying locally. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
bug: host crash when vf is passthrough to vm.
Hi all: kernel: 3.0.93-0.8-default qemu: 1.5 crash log: [134397.708857] BUG: unable to handle kernel NULL pointer dereference at 0012 [134397.717334] IP: [] iommu_disable_dev_iotlb+0x15/0x30 [134397.724268] PGD 0 [134397.726686] Oops: [#1] SMP [134397.730335] kbox: Begin to handle event info [134397.734992] kbox: kbox: Enter into handle die dump while current state:Dump State Init [134397.751054] [134398.043275] kbox: End handling event info [134398.047757] CPU 1 [134398.049678] Modules linked in: mlx4_en(FX) mlx4_core(FX) compat(FX) openvswitch crc32c libcrc32c gre nm_dev(FN) ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables uvpdump i [134398.125560] Supported: No, Unsupported modules are loaded [134398.131338] [134398.133220] Pid: 183980, comm: qemu-kvm Tainted: GF NX 3.0.93-0.8-default #1 HUAWEI TECHNOLOGIES CO.,LTD. CH80TXSUA/CH80TXSUA [134398.146013] RIP: 0010:[] [] iommu_disable_dev_iotlb+0x15/0x30 [134398.155649] RSP: 0018:8817f0e13be0 EFLAGS: 00010202 [134398.161342] RAX: 0002 RBX: 880bf34e6600 RCX: 0007 [134398.169158] RDX: 880bf34e6650 RSI: 0292 RDI: 880bf2011000 [134398.176970] RBP: R08: dead00200200 R09: dead00100100 [134398.184793] R10: 8817f5d4c380 R11: 8128e350 R12: 880bf34e6640 [134398.192549] R13: 880bf5d3de80 R14: 880653e4a000 R15: 880bf34d7858 [134398.200360] FS: 7f6d17fff980() GS:880c3ee2() knlGS: [134398.209109] CS: 0010 DS: ES: CR0: 8005003b [134398.215223] CR2: 0012 CR3: 00096ed95000 CR4: 001427e0 [134398.223000] DR0: DR1: DR2: [134398.230812] DR3: DR6: 0ff0 DR7: 0400 [134398.238630] Process qemu-kvm (pid: 183980, threadinfo 8817f0e12000, task 8817804ac300) [134398.247906] Stack: [134398.250260] 8128f756 ffed 880bf34d7840 0292 [134398.258383] 880bf34d7640 880653e4a090 ffed 880653e4a000 [134398.266509] 880653e4a000 880bee6c4878 812938c6 880bebe17d60 [134398.274659] Call Trace: [134398.277503] [] domain_remove_one_dev_info+0x156/0x280 [134398.284582] [] intel_iommu_attach_device+0x156/0x170 [134398.291583] [] kvm_assign_device+0x73/0x150 [kvm] [134398.298360] [] kvm_vm_ioctl_assign_device+0x247/0x3c0 [kvm] [134398.306303] [] kvm_vm_ioctl_assigned_device+0x2fc/0x6a0 [kvm] [134398.314401] [] kvm_vm_ioctl+0x101/0x300 [kvm] [134398.320789] [] do_vfs_ioctl+0x8b/0x3b0 [134398.326566] [] sys_ioctl+0xa1/0xb0 [134398.331996] [] system_call_fastpath+0x16/0x1b [134398.338376] [<7f6d15b77e57>] 0x7f6d15b77e56 [134398.343373] Code: c6 05 97 dc de 00 01 eb a7 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 8b 7f 28 48 85 ff 74 12 48 8b 87 30 09 00 00 48 85 c0 74 06 40 10 01 75 05 f3 c3 0f 1f 00 e9 ab 61 00 00 66 66 [134398.364424] RIP [] iommu_disable_dev_iotlb+0x15/0x30 [134398.371441] RSP [134398.375314] CR2: 0012 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Documentation: virtual: kvm: correct one bit description in APF case
When commit 6adba5274206 (KVM: Let host know whether the guest can handle async PF in non-userspace context.) is introduced, actually bit 2 still is reserved and should be zero. Instead, bit 1 is 1 to indicate if asynchronous page faults can be injected when vcpu is in cpl == 0, and also please see this, in the file kvm_para.h, #define KVM_ASYNC_PF_SEND_ALWAYS (1 << 1). Signed-off-by: Tiejun Chen --- Documentation/virtual/kvm/msr.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 6d470ae..2a71c8f 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -168,7 +168,7 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02 64 byte memory area which must be in guest RAM and must be zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1 when asynchronous page faults are enabled on the vcpu 0 when - disabled. Bit 2 is 1 if asynchronous page faults can be injected + disabled. Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in cpl == 0. First 4 byte of 64 byte memory location will be written to by -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/3] arm64: Allow 48-bits VA space without ARM_SMMU
On Fri, Oct 10, 2014 at 04:10:53PM +0100, Catalin Marinas wrote: > On Fri, Oct 10, 2014 at 11:14:30AM +0100, Christoffer Dall wrote: > > Now when KVM has been reworked to support 48-bits host VA space, we can > > allow systems to be configured with this option. However, the ARM SMMU > > driver also needs to be tweaked for 48-bit support so only allow the > > config option to be set when not including support for theSMMU. > > > > Signed-off-by: Christoffer Dall > > --- > > arch/arm64/Kconfig | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > > index fd4e81a..a76c6c3b 100644 > > --- a/arch/arm64/Kconfig > > +++ b/arch/arm64/Kconfig > > @@ -200,7 +200,7 @@ config ARM64_VA_BITS_42 > > > > config ARM64_VA_BITS_48 > > bool "48-bit" > > - depends on BROKEN > > + depends on !ARM_SMMU > > > > endchoice > > I think we should rather merge this separately via the arm64 tree once > we test 48-bit VA some more (and as you noticed, there is a bug > already). > Sounds like a good idea, I'll apply the other two to kvmarm. Thanks for the reviews! -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 13/13] vfio: powerpc/spapr: Enable Dynamic DMA windows
On 09/23/2014 11:56 PM, Alex Williamson wrote: > On Tue, 2014-09-23 at 13:01 +1000, Alexey Kardashevskiy wrote: >> This defines and implements VFIO IOMMU API which lets the userspace >> create and remove DMA windows. >> >> This updates VFIO_IOMMU_SPAPR_TCE_GET_INFO to return the number of >> available windows and page mask. >> >> This adds VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE >> to allow the user space to create and remove window(s). >> >> The VFIO IOMMU driver does basic sanity checks and calls corresponding >> SPAPR TCE functions. At the moment only IODA2 (POWER8 PCI host bridge) >> implements them. >> >> This advertises VFIO_IOMMU_SPAPR_TCE_FLAG_DDW capability via >> VFIO_IOMMU_SPAPR_TCE_GET_INFO. >> >> This calls platform DDW reset() callback when IOMMU is being disabled >> to reset the DMA configuration to its original state. >> >> Signed-off-by: Alexey Kardashevskiy >> --- >> drivers/vfio/vfio_iommu_spapr_tce.c | 135 >> ++-- >> include/uapi/linux/vfio.h | 25 ++- >> 2 files changed, 153 insertions(+), 7 deletions(-) >> >> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c >> b/drivers/vfio/vfio_iommu_spapr_tce.c >> index 0dccbc4..b518891 100644 >> --- a/drivers/vfio/vfio_iommu_spapr_tce.c >> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c >> @@ -190,18 +190,25 @@ static void tce_iommu_disable(struct tce_container >> *container) >> >> container->enabled = false; >> >> -if (!container->grp || !current->mm) >> +if (!container->grp) >> return; >> >> data = iommu_group_get_iommudata(container->grp); >> if (!data || !data->iommu_owner || !data->ops->get_table) >> return; >> >> -tbl = data->ops->get_table(data, 0); >> -if (!tbl) >> -return; >> +if (current->mm) { >> +tbl = data->ops->get_table(data, 0); >> +if (tbl) >> +decrement_locked_vm(tbl); >> >> -decrement_locked_vm(tbl); >> +tbl = data->ops->get_table(data, 1); >> +if (tbl) >> +decrement_locked_vm(tbl); >> +} >> + >> +if (data->ops->reset) >> +data->ops->reset(data); >> } >> >> static void *tce_iommu_open(unsigned long arg) >> @@ -243,7 +250,7 @@ static long tce_iommu_ioctl(void *iommu_data, >> unsigned int cmd, unsigned long arg) >> { >> struct tce_container *container = iommu_data; >> -unsigned long minsz; >> +unsigned long minsz, ddwsz; >> long ret; >> >> switch (cmd) { >> @@ -288,6 +295,28 @@ static long tce_iommu_ioctl(void *iommu_data, >> info.dma32_window_size = tbl->it_size << tbl->it_page_shift; >> info.flags = 0; >> >> +ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info, >> +page_size_mask); >> + >> +if (info.argsz == ddwsz) { > >> = > >> +if (data->ops->query && data->ops->create && >> +data->ops->remove) { >> +info.flags |= VFIO_IOMMU_SPAPR_TCE_FLAG_DDW; > > I think you want to set this flag regardless of whether the user has > provided space for it. A valid use model is to call with the minimum > size and look at the flags to determine if it needs to be called again > with a larger size. > >> + >> +ret = data->ops->query(data, >> +&info.current_windows, >> +&info.windows_available, >> +&info.page_size_mask); >> +if (ret) >> +return ret; >> +} else { >> +info.current_windows = 0; >> +info.windows_available = 0; >> +info.page_size_mask = 0; >> +} >> +minsz = ddwsz; > > It's not really any longer the min size, is it? > >> +} >> + >> if (copy_to_user((void __user *)arg, &info, minsz)) >> return -EFAULT; >> >> @@ -412,12 +441,106 @@ static long tce_iommu_ioctl(void *iommu_data, >> tce_iommu_disable(container); >> mutex_unlock(&container->lock); >> return 0; >> + >> case VFIO_EEH_PE_OP: >> if (!container->grp) >> return -ENODEV; >> >> return vfio_spapr_iommu_eeh_ioctl(container->grp, >>cmd, arg); >> + >> +case VFIO_IOMMU_SPAPR_TCE_CREATE: { >> +struct vfio_iommu_spapr_tce_create create; >> +struct spapr_tce_iommu_group *data; >> +struct iommu_table *tbl; >> + >> +if (WARN_ON(!container->grp)) > > redux previous comment on this warning > >> +
[PATCH v2 0/2] deadline TSC multi injection fix
First patch refactors code and introduces a minor should-not-happen optimization. Second patch fixes the bug and improves handling of passed deadline timers. I still haven't looked at Amit's proposed optimization (not recomputing the value of hrtimer on rewrites) long enough to be able to sign it off, sorry. Radim Krčmář (2): KVM: x86: add apic_timer_expired() KVM: x86: fix deadline tsc interrupt injection arch/x86/kvm/lapic.c | 47 +-- 1 file changed, 25 insertions(+), 22 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] KVM: x86: fix deadline tsc interrupt injection
The check in kvm_set_lapic_tscdeadline_msr() was trying to prevent a situation where we lose a pending deadline timer in a MSR write. Losing it is fine, because it effectively occurs before the timer fired, so we should be able to cancel or postpone it. Another problem comes from interaction with QEMU, or other userspace that can set deadline MSR without a good reason, when timer is already pending: one guest's deadline request results in more than one interrupt because one is injected immediately on MSR write from userspace and one through hrtimer later. The solution is to remove the injection when replacing a pending timer and to improve the usual QEMU path, we inject without a hrtimer when the deadline has already passed. Signed-off-by: Radim Krčmář Reported-by: Nadav Amit --- arch/x86/kvm/lapic.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index a3cf68b..8b30099 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1116,9 +1116,10 @@ static void start_apic_timer(struct kvm_lapic *apic) if (likely(tscdeadline > guest_tsc)) { ns = (tscdeadline - guest_tsc) * 100ULL; do_div(ns, this_tsc_khz); - } - hrtimer_start(&apic->lapic_timer.timer, - ktime_add_ns(now, ns), HRTIMER_MODE_ABS); + hrtimer_start(&apic->lapic_timer.timer, + ktime_add_ns(now, ns), HRTIMER_MODE_ABS); + } else + apic_timer_expired(apic); local_irq_restore(flags); } @@ -1375,9 +1376,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data) return; hrtimer_cancel(&apic->lapic_timer.timer); - /* Inject here so clearing tscdeadline won't override new value */ - if (apic_has_pending_timer(vcpu)) - kvm_inject_apic_timer_irqs(vcpu); apic->lapic_timer.tscdeadline = data; start_apic_timer(apic); } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] KVM: x86: add apic_timer_expired()
Make the code reusable. If the timer was already pending, we shouldn't be waiting in a queue, so wake_up can be skipped, simplifying the path. There is no 'reinject' case => the comment is removed. Current race behaves correctly. Signed-off-by: Radim Krčmář --- Not sure about the FIXME, and it might motivate someone to redo the implicit checking as well :) arch/x86/kvm/lapic.c | 37 + 1 file changed, 21 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index b8345dd..a3cf68b 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1034,6 +1034,26 @@ static void update_divide_count(struct kvm_lapic *apic) apic->divide_count); } +static void apic_timer_expired(struct kvm_lapic *apic) +{ + struct kvm_vcpu *vcpu = apic->vcpu; + wait_queue_head_t *q = &vcpu->wq; + + /* +* Note: KVM_REQ_PENDING_TIMER is implicitly checked in +* vcpu_enter_guest. +*/ + if (atomic_read(&apic->lapic_timer.pending)) + return; + + atomic_inc(&apic->lapic_timer.pending); + /* FIXME: this code should not know anything about vcpus */ + kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu); + + if (waitqueue_active(q)) + wake_up_interruptible(q); +} + static void start_apic_timer(struct kvm_lapic *apic) { ktime_t now; @@ -1538,23 +1558,8 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data) { struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer); struct kvm_lapic *apic = container_of(ktimer, struct kvm_lapic, lapic_timer); - struct kvm_vcpu *vcpu = apic->vcpu; - wait_queue_head_t *q = &vcpu->wq; - /* -* There is a race window between reading and incrementing, but we do -* not care about potentially losing timer events in the !reinject -* case anyway. Note: KVM_REQ_PENDING_TIMER is implicitly checked -* in vcpu_enter_guest. -*/ - if (!atomic_read(&ktimer->pending)) { - atomic_inc(&ktimer->pending); - /* FIXME: this code should not know anything about vcpus */ - kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu); - } - - if (waitqueue_active(q)) - wake_up_interruptible(q); + apic_timer_expired(apic); if (lapic_is_periodic(apic)) { hrtimer_add_expires_ns(&ktimer->timer, ktimer->period); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 81841] amd-iommu: kernel BUG & lockup after shutting down KVM guest using PCI passthrough/PCIe bridge
https://bugzilla.kernel.org/show_bug.cgi?id=81841 Marti Raudsepp changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |CODE_FIX --- Comment #23 from Marti Raudsepp --- Closing bug, ACS patch merged to mainline Linux: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3587e625fe24a2d1cd1891fc660c3313151a368c Thanks Joerg. -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/5] KVM: x86: Emulator performs code segment checks on read access
2014-10-10 05:07+0300, Nadav Amit: > When read access is performed using a readable code segment, the "conforming" > and "non-conforming" checks should not be done. As a result, read using > non-conforming readable code segment fails. > > This is according to Intel SDM 5.6.1 ("Accessing Data in Code Segments"). > > One exception is the case of conforming code segment. The SDM says: "Use a > code-segment override prefix (CS) to read a readable... [it is] valid because > the DPL of the code segment selected by the CS register is the same as the > CPL." This is misleading since CS.DPL may be lower (numerically) than CPL, and > CS would still be accessible. The emulator should avoid privilage level > checks > for data reads using CS. Ah, after stripping faulty presumptions, I'm not sure this change is enough ... shouldn't we also skip the check on conforming code segments? Method 2 is always valid because the privilege level of a conforming code segment is effectively the same as the CPL, regardless of its DPL. And we couldn't read it from less privileged modes. > The fix is not to perform the "non-conforming" checks if the access is not a > fetch, and never to perform the checks for CS. > > --- > v1->v2: Privilage level checks are always skipped for CS > > Signed-off-by: Nadav Amit > --- > arch/x86/kvm/emulate.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c > index a46207a..0fee0a0 100644 > --- a/arch/x86/kvm/emulate.c > +++ b/arch/x86/kvm/emulate.c > @@ -661,9 +661,9 @@ static int __linearize(struct x86_emulate_ctxt *ctxt, > goto bad; > } > cpl = ctxt->ops->cpl(ctxt); > - if (!(desc.type & 8)) { > - /* data segment */ > - if (cpl > desc.dpl) > + if (!fetch) { > + /* data segment or readable code segment */ > + if (cpl > desc.dpl && addr.seg != VCPU_SREG_CS) > goto bad; > } else if ((desc.type & 8) && !(desc.type & 4)) { > /* nonconforming code segment */ > -- > 1.9.1 > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/3] arm64: Allow 48-bits VA space without ARM_SMMU
On Fri, Oct 10, 2014 at 11:14:30AM +0100, Christoffer Dall wrote: > Now when KVM has been reworked to support 48-bits host VA space, we can > allow systems to be configured with this option. However, the ARM SMMU > driver also needs to be tweaked for 48-bit support so only allow the > config option to be set when not including support for theSMMU. > > Signed-off-by: Christoffer Dall > --- > arch/arm64/Kconfig | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index fd4e81a..a76c6c3b 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -200,7 +200,7 @@ config ARM64_VA_BITS_42 > > config ARM64_VA_BITS_48 > bool "48-bit" > - depends on BROKEN > + depends on !ARM_SMMU > > endchoice I think we should rather merge this separately via the arm64 tree once we test 48-bit VA some more (and as you noticed, there is a bug already). -- Catalin -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2
On Fri, Oct 10, 2014 at 11:14:28AM +0100, Christoffer Dall wrote: > This patch adds the necessary support for all host kernel PGSIZE and > VA_SPACE configuration options for both EL2 and the Stage-2 page tables. > > However, for 40bit and 42bit PARange systems, the architecture mandates > that VTCR_EL2.SL0 is maximum 1, resulting in fewer levels of stage-2 > pagge tables than levels of host kernel page tables. At the same time, > systems with a PARange > 42bit, we limit the IPA range by always setting > VTCR_EL2.T0SZ to 24. > > To solve the situation with different levels of page tables for Stage-2 > translation than the host kernel page tables, we allocate a dummy PGD > with pointers to our actual inital level Stage-2 page table, in order > for us to reuse the kernel pgtable manipulation primitives. Reproducing > all these in KVM does not look pretty and unnecessarily complicates the > 32-bit side. > > Systems with a PARange < 40bits are not yet supported. > > [ I have reworked this patch from its original form submitted by >Jungseok to take the architecture constraints into consideration. >There were too many changes from the original patch for me to >preserve the authorship. Thanks to Catalin Marinas for his help in >figuring out a good solution to this challenge. I have also fixed >various bugs and missing error code handling from the original >patch. - Christoffer ] > > Cc: Marc Zyngier > Cc: Catalin Marinas > Signed-off-by: Jungseok Lee > Signed-off-by: Christoffer Dall Reviewed-by: Catalin Marinas -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/3] arm/arm64: KVM: Ensure memslots are within KVM_PHYS_SIZE
On Fri, Oct 10, 2014 at 11:14:29AM +0100, Christoffer Dall wrote: > When creating or moving a memslot, make sure the IPA space is within the > addressable range of the guest. Otherwise, user space can create too > large a memslot and KVM would try to access potentially unallocated page > table entries when inserting entries in the Stage-2 page tables. > > Signed-off-by: Christoffer Dall Acked-by: Catalin Marinas -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH kvm-kmod 0/3] First round of kvm-kmod fixes for 3.18 merge window
On 2014-10-09 13:53, Paolo Bonzini wrote: > Patches are relative to next branch of kvm-kmod.git. > > Paolo > > Paolo Bonzini (3): > FOLL_TRIED is not available before 3.18 > the MMU notifier clear_flush_young callback changed in 3.18 > redefine is_zero_pfn to not rely on zero_pfn > > external-module-compat-comm.h | 5 - > sync | 16 > 2 files changed, 20 insertions(+), 1 deletion(-) > Thanks, applied. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
Il 08/10/2014 12:06, Radim Krčmář ha scritto: > KVM: x86: fix deadline tsc interrupt injection > > The check in kvm_set_lapic_tscdeadline_msr() was trying to prevent a > situation where we lose a pending deadline timer in a MSR write. > Losing it is fine, because it effectively occurs before the timer fired, > so we should be able to cancel or postpone it. > > Another problem comes from interaction with QEMU, or other userspace > that can set deadline MSR without a good reason, when timer is already > pending: one guest's deadline request results in more than one > interrupt because one is injected immediately on MSR write from > userspace and one through hrtimer later. > > The solution is to remove the injection when replacing a pending timer > and to improve the usual QEMU path, we inject without a hrtimer when the > deadline has already passed. > > Signed-off-by: Radim Krčmář > Reported-by: Nadav Amit > --- > arch/x86/kvm/lapic.c | 10 +- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index b8345dd..51428dd 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic) > if (likely(tscdeadline > guest_tsc)) { > ns = (tscdeadline - guest_tsc) * 100ULL; > do_div(ns, this_tsc_khz); > + hrtimer_start(&apic->lapic_timer.timer, > + ktime_add_ns(now, ns), HRTIMER_MODE_ABS); > + } else { > + atomic_inc(&ktimer->pending); > + kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu); > } > - hrtimer_start(&apic->lapic_timer.timer, > - ktime_add_ns(now, ns), HRTIMER_MODE_ABS); > > local_irq_restore(flags); > } > @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu > *vcpu, u64 data) > return; > > hrtimer_cancel(&apic->lapic_timer.timer); > - /* Inject here so clearing tscdeadline won't override new value */ > - if (apic_has_pending_timer(vcpu)) > - kvm_inject_apic_timer_irqs(vcpu); > apic->lapic_timer.tscdeadline = data; > start_apic_timer(apic); > } Radim, the patch looks good but please extract this code: /* * There is a race window between reading and incrementing, but we do * not care about potentially losing timer events in the !reinject * case anyway. Note: KVM_REQ_PENDING_TIMER is implicitly checked * in vcpu_enter_guest. */ if (!atomic_read(&ktimer->pending)) { atomic_inc(&ktimer->pending); /* FIXME: this code should not know anything about vcpus */ kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu); } if (waitqueue_active(q)) wake_up_interruptible(q); to a new "static void apic_timer_expired(struct kvm_lapic *apic)" function, and call it from both apic_timer_fn and start_apic_timer. Also, we should not need to do wake_up_interruptible() unless we have changed ktimer->pending from zero to non-zero. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
Il 10/10/2014 14:51, Nadav Amit ha scritto: >>> Second, I think that the solution I proposed would perform better. >>> Currently, there are many unnecessary cancellations and setups of the >>> timer. This solution does not resolve this problem. >> >> I think it does. You do not get an hrtimer_start if tscdeadline <= >> guest_tsc. To avoid useless cancels, either check hrtimer_is_enqueued >> before calling hrtimer_cancel, or go straight to the source and avoid >> taking the lock in the easy cases: >> >> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c >> index 1c2fe7de2842..6ce725007424 100644 >> --- a/kernel/time/hrtimer.c >> +++ b/kernel/time/hrtimer.c >> @@ -1043,10 +1043,17 @@ int hrtimer_try_to_cancel(struct hrtimer *timer) >> { >> struct hrtimer_clock_base *base; >> unsigned long flags; >> -int ret = -1; >> +unsigned long state = timer->state; >> +int ret; >> + >> +if (state & HRTIMER_STATE_ENQUEUED) >> +return 0; >> +if (state & HRTIMER_STATE_CALLBACK) >> +return -1; >> >> base = lock_hrtimer_base(timer, &flags); >> >> +ret = -1; >> if (!hrtimer_callback_running(timer)) >> ret = remove_hrtimer(timer, base); > Wouldn’t this change would cause cancellations never to succeed (the first > check would always be true if the timer is active)? Ehm, there is a missing ! in that first "if". >>> Last, I think that having less interrupts on deadline changes is not >>> completely according to the SDM which says: "If software disarms the >>> timer or postpones the deadline, race conditions may result in the >>> delivery of a spurious timer interrupt.” It never says interrupts may >>> be lost if you reprogram the deadline before you check it expired. >> >> But the case when you rewrite the same value to the MSR is neither >> disarming nor postponing. You would be getting two interrupts for the >> same event. That is why I agree with Radim that checking host_initiated >> is wrong. > > I understand, and Radim's solution seems functionally fine, now that I am > fully awake and understand it. > I still think that if tscdeadline > guest_tsc, then reprogramming the > deadline with the same value, as QEMU does, would result in unwarranted > overhead. The overhead is about two atomic operations (70 clock cycles?). Still, there are plenty of other micro-optimizations possible: 1) instead of incrementing timer->pending, set it to 1 2) change it to test_and_set_bit and only set PENDING_TIMER if the result was zero 3) non-atomically test PENDING_TIMER before (atomically) clearing it 4) return bool from kvm_inject_apic_timer_irqs and only clear PENDING_TIMER if a timer interrupt was actually injected. (1) or (2) would remove one atomic operation when reprogramming a passed deadline with the same value. (3) or (4) would remove one atomic operation in the case where the cause of the exit is not an expired timer. Any takers? > Perhaps it would be enough not to reprogram the timer if tscdeadline value > does not change (by either guest or host). Yes, and that would just be your patch without host_initiated. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] KVM: MMU: reload request from GET_DIRTY_LOG path
On Wed, Oct 08, 2014 at 04:22:31PM -0300, Marcelo Tosatti wrote: > > > > > > Argh, lets try again: > > > > > > skip_pinned = true > > > -- > > > > > > mark page dirty, keep spte intact > > > > > > called from get dirty log path. > > > > > > skip_pinned = false > > > --- > > > reload remote mmu > > > destroy pinned spte. > > > > > > called from: dirty log enablement, rmap write protect (unused for pinned > > > sptes) > > > > > > > > > Note this behaviour is your suggestion: > > > > Yes, I remember that and I thought we will not need this skip_pinned > > at all. For rmap write protect case there shouldn't be any pinned pages, > > but why dirty log enablement sets skip_pinned to false? Why not mark > > pinned pages as dirty just like you do in get dirty log path? > > Because if its a large spte, it must be nuked (or marked read-only, > which for pinned sptes, is not possible). > If a large page has one small page pinned inside it its spte will be marked as pinned, correct? We did nuke large ptes here until very recently: c126d94f2c90ed9d, but we cannot drop a pte here anyway without kicking all vcpu from a guest mode, but do you need additional skip_pinned parameter? Why not check if spte is large instead? So why not have per slot pinned page list (Xiao suggested the same) and do: spte_write_protect() { if (is_pinned(spte) { if (large(spte)) // cannot drop while vcpu are running mmu_reload_pinned_vcpus(); else return false; } get_dirty_log() { for_each(pinned pages i) makr_dirty(i); } -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
On Oct 10, 2014, at 12:45 PM, Paolo Bonzini wrote: > Il 10/10/2014 03:55, Nadav Amit ha scritto: >>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c >>> index b8345dd..51428dd 100644 >>> --- a/arch/x86/kvm/lapic.c >>> +++ b/arch/x86/kvm/lapic.c >>> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic) >>> if (likely(tscdeadline > guest_tsc)) { >>> ns = (tscdeadline - guest_tsc) * 100ULL; >>> do_div(ns, this_tsc_khz); >>> + hrtimer_start(&apic->lapic_timer.timer, >>> + ktime_add_ns(now, ns), HRTIMER_MODE_ABS); >>> + } else { >>> + atomic_inc(&ktimer->pending); >>> + kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu); >>> } >>> - hrtimer_start(&apic->lapic_timer.timer, >>> - ktime_add_ns(now, ns), HRTIMER_MODE_ABS); >>> >>> local_irq_restore(flags); >>> } >>> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu >>> *vcpu, u64 data) >>> return; >>> >>> hrtimer_cancel(&apic->lapic_timer.timer); >>> - /* Inject here so clearing tscdeadline won't override new value */ >>> - if (apic_has_pending_timer(vcpu)) >>> - kvm_inject_apic_timer_irqs(vcpu); >>> apic->lapic_timer.tscdeadline = data; >>> start_apic_timer(apic); >>> } >> >> Perhaps I am missing something, but I don’t see how it solves the problem I >> encountered. >> Recall the scenario: >> 1. A TSC deadline timer interrupt is pending. >> 2. TSC deadline was still not cleared (which happens during vcpu_run). >> 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr. >> >> It appears that in such scenario the guest would still get spurious >> interrupt for no reason, as ktimer->pending may already be increased in >> apic_timer_fn. > > If ktimer->pending == 2 you still get only one interrupt, not two. Am I > misunderstanding your objection? You understood me, and I was wrong... > >> Second, I think that the solution I proposed would perform better. >> Currently, there are many unnecessary cancellations and setups of the >> timer. This solution does not resolve this problem. > > I think it does. You do not get an hrtimer_start if tscdeadline <= > guest_tsc. To avoid useless cancels, either check hrtimer_is_enqueued > before calling hrtimer_cancel, or go straight to the source and avoid > taking the lock in the easy cases: > > diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c > index 1c2fe7de2842..6ce725007424 100644 > --- a/kernel/time/hrtimer.c > +++ b/kernel/time/hrtimer.c > @@ -1043,10 +1043,17 @@ int hrtimer_try_to_cancel(struct hrtimer *timer) > { > struct hrtimer_clock_base *base; > unsigned long flags; > - int ret = -1; > + unsigned long state = timer->state; > + int ret; > + > + if (state & HRTIMER_STATE_ENQUEUED) > + return 0; > + if (state & HRTIMER_STATE_CALLBACK) > + return -1; > > base = lock_hrtimer_base(timer, &flags); > > + ret = -1; > if (!hrtimer_callback_running(timer)) > ret = remove_hrtimer(timer, base); Wouldn’t this change would cause cancellations never to succeed (the first check would always be true if the timer is active)? > > >> Last, I think that having less interrupts on deadline changes is not >> completely according to the SDM which says: "If software disarms the >> timer or postpones the deadline, race conditions may result in the >> delivery of a spurious timer interrupt.” It never says interrupts may >> be lost if you reprogram the deadline before you check it expired. > > But the case when you rewrite the same value to the MSR is neither > disarming nor postponing. You would be getting two interrupts for the > same event. That is why I agree with Radim that checking host_initiated > is wrong. I understand, and Radim's solution seems functionally fine, now that I am fully awake and understand it. I still think that if tscdeadline > guest_tsc, then reprogramming the deadline with the same value, as QEMU does, would result in unwarranted overhead. Perhaps it would be enough not to reprogram the timer if tscdeadline value does not change (by either guest or host). Thanks, Nadav -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
2014-10-10 11:45+0200, Paolo Bonzini: > Il 10/10/2014 03:55, Nadav Amit ha scritto: > >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > >> index b8345dd..51428dd 100644 > >> --- a/arch/x86/kvm/lapic.c > >> +++ b/arch/x86/kvm/lapic.c > >> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic) > >>if (likely(tscdeadline > guest_tsc)) { > >>ns = (tscdeadline - guest_tsc) * 100ULL; > >>do_div(ns, this_tsc_khz); > >> + hrtimer_start(&apic->lapic_timer.timer, > >> + ktime_add_ns(now, ns), HRTIMER_MODE_ABS); > >> + } else { > >> + atomic_inc(&ktimer->pending); > >> + kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu); > >>} > >> - hrtimer_start(&apic->lapic_timer.timer, > >> - ktime_add_ns(now, ns), HRTIMER_MODE_ABS); > >> > >>local_irq_restore(flags); > >>} > >> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu > >> *vcpu, u64 data) > >>return; > >> > >>hrtimer_cancel(&apic->lapic_timer.timer); > >> - /* Inject here so clearing tscdeadline won't override new value */ > >> - if (apic_has_pending_timer(vcpu)) > >> - kvm_inject_apic_timer_irqs(vcpu); > >>apic->lapic_timer.tscdeadline = data; > >>start_apic_timer(apic); > >> } > > > > Perhaps I am missing something, but I don’t see how it solves the problem I > > encountered. > > Recall the scenario: > > 1. A TSC deadline timer interrupt is pending. > > 2. TSC deadline was still not cleared (which happens during vcpu_run). > > 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr. > > > > It appears that in such scenario the guest would still get spurious > > interrupt for no reason, as ktimer->pending may already be increased in > > apic_timer_fn. > > If ktimer->pending == 2 you still get only one interrupt, not two. Am I > misunderstanding your objection? (I don't see it either, the patch removed kvm_inject_apic_timer_irqs(), so we inject as if the MSR write didn't happen.) > > Second, I think that the solution I proposed would perform better. > > Currently, there are many unnecessary cancellations and setups of the > > timer. This solution does not resolve this problem. > > I think it does. You do not get an hrtimer_start if tscdeadline <= > guest_tsc. To avoid useless cancels, either check hrtimer_is_enqueued > before calling hrtimer_cancel, or go straight to the source and avoid > taking the lock in the easy cases: I think the useless cancels were when the timer is set to the same value in the future. Which makes sense to optimize, but I wasn't sure how it would affect the guest if the TSC offset was be changing or TSC itself wasn't stable ... so I chose not to do it. > diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c > index 1c2fe7de2842..6ce725007424 100644 > --- a/kernel/time/hrtimer.c > +++ b/kernel/time/hrtimer.c > @@ -1043,10 +1043,17 @@ int hrtimer_try_to_cancel(struct hrtimer *timer) > { > struct hrtimer_clock_base *base; > unsigned long flags; > - int ret = -1; > + unsigned long state = timer->state; > + int ret; > + > + if (state & HRTIMER_STATE_ENQUEUED) > + return 0; It doesn't try very hard to cancel it ;) > + if (state & HRTIMER_STATE_CALLBACK) > + return -1; > > base = lock_hrtimer_base(timer, &flags); > > + ret = -1; > if (!hrtimer_callback_running(timer)) > ret = remove_hrtimer(timer, base); > > > Last, I think that having less interrupts on deadline changes is not > > completely according to the SDM which says: "If software disarms the > > timer or postpones the deadline, race conditions may result in the > > delivery of a spurious timer interrupt.” It never says interrupts may > > be lost if you reprogram the deadline before you check it expired. Yeah, too bad it wasn't written in a formally defined language ... They could be just reserving design space for concurrent implementation, not making race conditions mandatory. Our race windows are much bigger -- a disarm that is hundreds of cycles in the future can easily be hit after a msr write vm-exit, but it would be very unlikely to get an interrupt on bare metal. And thinking about the meaning of postpone or disarm -- software doesn't want the old interrupt anymore, so it would just add useless work. (And I liked the code better without it, but it'd be ok if we fixed all the cases.) > But the case when you rewrite the same value to the MSR is neither > disarming nor postponing. You would be getting two interrupts for the > same event. That is why I agree with Radim that checking host_initiated > is wrong. Current implementation also injects two interrupts when the timer is set to a lower nonzero value, which should give us just one under both interpretations. -- To unsubscr
Re: [question] Is there a plan to introduce a unified co-scheduling mechanism to CFS ?
>> Hi, >> >> Is it worthy to introduce a unified co-scheduling mechanism to CFS ? >> Because multiple cooperating threads or tasks frequently synchronize with each other, >> not executing them concurrently would only increase the latency of synchronization. >> For example, a thread blocking in spinlock to waiting for another thread to release the same spinlock >> might reduce its waiting time by being executed concurrently with the thread which hold the same spinlock. >> In virtualization scenario, multiple vcpus (which belong to the same vm) co-scheduling is more desired >> when several cooperating threads/task is running in guest. >> >> Is there a plane for this work? > > Please refer to gang scheduler. > Is there a mechanism to dynamically detect which vcpus belong to the same gang? Maybe a cooperative degree can be used to decide the threshold of which vcpus belong to the same gang, just a wild thought. > Regards, > Wanpeng Li >> >> Thanks, >> Zhang Haoyu -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [question] Is there a plan to introduce a unified co-scheduling mechanism to CFS ?
于 10/10/14, 7:37 PM, Zhang Haoyu 写道: Hi, Is it worthy to introduce a unified co-scheduling mechanism to CFS ? Because multiple cooperating threads or tasks frequently synchronize with each other, not executing them concurrently would only increase the latency of synchronization. For example, a thread blocking in spinlock to waiting for another thread to release the same spinlock might reduce its waiting time by being executed concurrently with the thread which hold the same spinlock. In virtualization scenario, multiple vcpus (which belong to the same vm) co-scheduling is more desired when several cooperating threads/task is running in guest. Is there a plane for this work? Please refer to gang scheduler. Regards, Wanpeng Li Thanks, Zhang Haoyu -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[question] Is there a plan to introduce a unified co-scheduling mechanism to CFS ?
Hi, Is it worthy to introduce a unified co-scheduling mechanism to CFS ? Because multiple cooperating threads or tasks frequently synchronize with each other, not executing them concurrently would only increase the latency of synchronization. For example, a thread blocking in spinlock to waiting for another thread to release the same spinlock might reduce its waiting time by being executed concurrently with the thread which hold the same spinlock. In virtualization scenario, multiple vcpus (which belong to the same vm) co-scheduling is more desired when several cooperating threads/task is running in guest. Is there a plane for this work? Thanks, Zhang Haoyu -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/3] arm/arm64: KVM: Host 48-bit VA support and IPA limits
The following two patches fixup some missing memory handling in KVM/arm64. The first patch supports 48 bit virtual address space which involves supporting a different number of levels of page tables in the host kernel and the stage-2 page tables. The second patch ensures userspace cannot create memory slots with too large IPA space given VTCR_EL2.T0SZ = 24. Finally, we enable 48-bit VA support in Linux. The following host configurations have been tested with KVM on APM Mustang: 1) 4KB + 39 bits VA space 2) 4KB + 48 bits VA space 3) 64KB + 39 bits VA space 4) 64KB + 48 bits VA space Tested on TC2 for regressions. Christoffer Dall (3): arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2 arm/arm64: KVM: Ensure memslots are within KVM_PHYS_SIZE arm64: Allow 48-bits VA space without ARM_SMMU arch/arm/include/asm/kvm_mmu.h | 29 +++- arch/arm/kvm/arm.c | 2 +- arch/arm/kvm/mmu.c | 142 +++ arch/arm64/Kconfig | 2 +- arch/arm64/include/asm/kvm_mmu.h | 125 +++--- 5 files changed, 258 insertions(+), 42 deletions(-) -- 2.1.2.330.g565301e.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/3] arm64: Allow 48-bits VA space without ARM_SMMU
Now when KVM has been reworked to support 48-bits host VA space, we can allow systems to be configured with this option. However, the ARM SMMU driver also needs to be tweaked for 48-bit support so only allow the config option to be set when not including support for theSMMU. Signed-off-by: Christoffer Dall --- arch/arm64/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index fd4e81a..a76c6c3b 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -200,7 +200,7 @@ config ARM64_VA_BITS_42 config ARM64_VA_BITS_48 bool "48-bit" - depends on BROKEN + depends on !ARM_SMMU endchoice -- 2.1.2.330.g565301e.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/3] arm/arm64: KVM: Ensure memslots are within KVM_PHYS_SIZE
When creating or moving a memslot, make sure the IPA space is within the addressable range of the guest. Otherwise, user space can create too large a memslot and KVM would try to access potentially unallocated page table entries when inserting entries in the Stage-2 page tables. Signed-off-by: Christoffer Dall --- arch/arm/kvm/mmu.c | 8 1 file changed, 8 insertions(+) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 6e87233..d664bff 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -994,6 +994,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run) goto out_unlock; } + /* Userspace should not be able to register out-of-bounds IPAs */ + VM_BUG_ON(fault_ipa >= KVM_PHYS_SIZE); + ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status); if (ret == 0) ret = 1; @@ -1218,6 +1221,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, struct kvm_userspace_memory_region *mem, enum kvm_mr_change change) { + if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) { + if (memslot->base_gfn + memslot->npages >= + (KVM_PHYS_SIZE >> PAGE_SHIFT)) + return -EFAULT; + } return 0; } -- 2.1.2.330.g565301e.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/3] arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2
This patch adds the necessary support for all host kernel PGSIZE and VA_SPACE configuration options for both EL2 and the Stage-2 page tables. However, for 40bit and 42bit PARange systems, the architecture mandates that VTCR_EL2.SL0 is maximum 1, resulting in fewer levels of stage-2 pagge tables than levels of host kernel page tables. At the same time, systems with a PARange > 42bit, we limit the IPA range by always setting VTCR_EL2.T0SZ to 24. To solve the situation with different levels of page tables for Stage-2 translation than the host kernel page tables, we allocate a dummy PGD with pointers to our actual inital level Stage-2 page table, in order for us to reuse the kernel pgtable manipulation primitives. Reproducing all these in KVM does not look pretty and unnecessarily complicates the 32-bit side. Systems with a PARange < 40bits are not yet supported. [ I have reworked this patch from its original form submitted by Jungseok to take the architecture constraints into consideration. There were too many changes from the original patch for me to preserve the authorship. Thanks to Catalin Marinas for his help in figuring out a good solution to this challenge. I have also fixed various bugs and missing error code handling from the original patch. - Christoffer ] Cc: Marc Zyngier Cc: Catalin Marinas Signed-off-by: Jungseok Lee Signed-off-by: Christoffer Dall --- Changes [v2 -> v3]: - Calculate PTRS_PER_S2_PGD and KVM_PREALLOC_LEVEL based on PGDIR_SHIFT instead of complicated conditionals. - Add explanatory comments to the above definitions - Single definition of kvm_prealloc_hwpgd() and friends - Integrated tests for KVM_PREALLOC_LEVEL into kvm_pXd_table_empty() Changes [v1 -> v2]: - Use KVM_PREALLOC_LEVELS directly instead of C-variable indirection - Factored out the config changes to separate patch - Use __GFP_ZERO instead of memset - Fixed error return path in kvm_alloc_stage2_pgd() - Added WARN_ON if pgd_none() returns true - Changed some macro definitions and names arch/arm/include/asm/kvm_mmu.h | 29 - arch/arm/kvm/arm.c | 2 +- arch/arm/kvm/mmu.c | 134 +++ arch/arm64/include/asm/kvm_mmu.h | 125 +--- 4 files changed, 249 insertions(+), 41 deletions(-) diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h index 3f688b4..3b0fdd0 100644 --- a/arch/arm/include/asm/kvm_mmu.h +++ b/arch/arm/include/asm/kvm_mmu.h @@ -37,6 +37,11 @@ */ #define TRAMPOLINE_VA UL(CONFIG_VECTORS_BASE) +/* + * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation levels. + */ +#define KVM_MMU_CACHE_MIN_PAGES2 + #ifndef __ASSEMBLY__ #include @@ -83,6 +88,11 @@ static inline void kvm_clean_pgd(pgd_t *pgd) clean_dcache_area(pgd, PTRS_PER_S2_PGD * sizeof(pgd_t)); } +static inline void kvm_clean_pmd(pmd_t *pmd) +{ + clean_dcache_area(pmd, PTRS_PER_PMD * sizeof(pmd_t)); +} + static inline void kvm_clean_pmd_entry(pmd_t *pmd) { clean_pmd_entry(pmd); @@ -123,10 +133,23 @@ static inline bool kvm_page_empty(void *ptr) } -#define kvm_pte_table_empty(ptep) kvm_page_empty(ptep) -#define kvm_pmd_table_empty(pmdp) kvm_page_empty(pmdp) -#define kvm_pud_table_empty(pudp) (0) +#define kvm_pte_table_empty(kvm, ptep) kvm_page_empty(ptep) +#define kvm_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp) +#define kvm_pud_table_empty(kvm, pudp) (0) + +#define KVM_PREALLOC_LEVEL 0 +static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd) +{ + return 0; +} + +static inline void kvm_free_hwpgd(struct kvm *kvm) { } + +static inline void *kvm_get_hwpgd(struct kvm *kvm) +{ + return kvm->arch.pgd; +} struct kvm; diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 45e5f67..9e193c8 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -409,7 +409,7 @@ static void update_vttbr(struct kvm *kvm) kvm_next_vmid++; /* update vttbr to be used with the new vmid */ - pgd_phys = virt_to_phys(kvm->arch.pgd); + pgd_phys = virt_to_phys(kvm_get_hwpgd(kvm)); BUG_ON(pgd_phys & ~VTTBR_BADDR_MASK); vmid = ((u64)(kvm->arch.vmid) << VTTBR_VMID_SHIFT) & VTTBR_VMID_MASK; kvm->arch.vttbr = pgd_phys | vmid; diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index eea0306..6e87233 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -42,7 +42,7 @@ static unsigned long hyp_idmap_start; static unsigned long hyp_idmap_end; static phys_addr_t hyp_idmap_vector; -#define pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t)) +#define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t)) #define kvm_pmd_huge(_x) (pmd_huge(_x) || pmd_trans_huge(_x)) @@ -134,7 +134,7 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd, } } while (pte++, addr += PAGE_SIZE, addr != end); - if (kvm_pte_table_empty(
Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
Il 10/10/2014 03:55, Nadav Amit ha scritto: >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c >> index b8345dd..51428dd 100644 >> --- a/arch/x86/kvm/lapic.c >> +++ b/arch/x86/kvm/lapic.c >> @@ -1096,9 +1096,12 @@ static void start_apic_timer(struct kvm_lapic *apic) >> if (likely(tscdeadline > guest_tsc)) { >> ns = (tscdeadline - guest_tsc) * 100ULL; >> do_div(ns, this_tsc_khz); >> +hrtimer_start(&apic->lapic_timer.timer, >> +ktime_add_ns(now, ns), HRTIMER_MODE_ABS); >> +} else { >> +atomic_inc(&ktimer->pending); >> +kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu); >> } >> -hrtimer_start(&apic->lapic_timer.timer, >> -ktime_add_ns(now, ns), HRTIMER_MODE_ABS); >> >> local_irq_restore(flags); >> } >> @@ -1355,9 +1358,6 @@ void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu >> *vcpu, u64 data) >> return; >> >> hrtimer_cancel(&apic->lapic_timer.timer); >> -/* Inject here so clearing tscdeadline won't override new value */ >> -if (apic_has_pending_timer(vcpu)) >> -kvm_inject_apic_timer_irqs(vcpu); >> apic->lapic_timer.tscdeadline = data; >> start_apic_timer(apic); >> } > > Perhaps I am missing something, but I don’t see how it solves the problem I > encountered. > Recall the scenario: > 1. A TSC deadline timer interrupt is pending. > 2. TSC deadline was still not cleared (which happens during vcpu_run). > 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr. > > It appears that in such scenario the guest would still get spurious > interrupt for no reason, as ktimer->pending may already be increased in > apic_timer_fn. If ktimer->pending == 2 you still get only one interrupt, not two. Am I misunderstanding your objection? > Second, I think that the solution I proposed would perform better. > Currently, there are many unnecessary cancellations and setups of the > timer. This solution does not resolve this problem. I think it does. You do not get an hrtimer_start if tscdeadline <= guest_tsc. To avoid useless cancels, either check hrtimer_is_enqueued before calling hrtimer_cancel, or go straight to the source and avoid taking the lock in the easy cases: diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 1c2fe7de2842..6ce725007424 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1043,10 +1043,17 @@ int hrtimer_try_to_cancel(struct hrtimer *timer) { struct hrtimer_clock_base *base; unsigned long flags; - int ret = -1; + unsigned long state = timer->state; + int ret; + + if (state & HRTIMER_STATE_ENQUEUED) + return 0; + if (state & HRTIMER_STATE_CALLBACK) + return -1; base = lock_hrtimer_base(timer, &flags); + ret = -1; if (!hrtimer_callback_running(timer)) ret = remove_hrtimer(timer, base); > Last, I think that having less interrupts on deadline changes is not > completely according to the SDM which says: "If software disarms the > timer or postpones the deadline, race conditions may result in the > delivery of a spurious timer interrupt.” It never says interrupts may > be lost if you reprogram the deadline before you check it expired. But the case when you rewrite the same value to the MSR is neither disarming nor postponing. You would be getting two interrupts for the same event. That is why I agree with Radim that checking host_initiated is wrong. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2
On Thu, Oct 09, 2014 at 02:36:26PM +0100, Catalin Marinas wrote: > On Thu, Oct 09, 2014 at 12:01:37PM +0100, Christoffer Dall wrote: > > On Wed, Oct 08, 2014 at 10:47:04AM +0100, Catalin Marinas wrote: > > > On Tue, Oct 07, 2014 at 08:39:54PM +0100, Christoffer Dall wrote: > > > > +static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd) > > > > +{ > > > > + pud_t *pud; > > > > + pmd_t *pmd; > > > > + unsigned int order, i; > > > > + unsigned long hwpgd; > > > > + > > > > + if (KVM_PREALLOC_LEVEL == 0) > > > > + return 0; > > > > + > > > > + order = get_order(PTRS_PER_S2_PGD); > > > > > > Isn't order always 0 here? Based on our IRC discussion, PTRS_PER_S2_PGD > > > is 16 or less and the order should not be used. > > > > no, if the kernel has 4K pages and 4 levels, then PGDIR_SHIFT is 39, and > > KVM_PHYS_SHIFT stays 40, so that means PTRS_PER_S2_PGD becomes 2, which > > means we concatenate two first level stage-2 page tables, which means we > > need to allocate two consecutive pages, giving us an order of 1, not 0. > > So if PTRS_PER_S2_PGD is 2, how come get_order(PTRS_PER_S2_PGD) == 1? My > reading of the get_order() macro is that get_order(2) == 0. > > Did you mean get_order(PTRS_PER_S2_PGD * PAGE_SIZE)? Ah, you're right. Sorry. Yes, that's what I meant. > > Or you could define a PTRS_PER_S2_PGD_SHIFT as (KVM_PHYS_SHIFT - > PGDIR_SHIFT) and use this as the order directly. > That's better. I also experimented with defining S2_HWPGD_ORDER or S2_PREALLOC_ORDER, but it didn't look much clear, so sticking with PTRS_PER_S2_PGD_SHIFT. > > > > + hwpgd = __get_free_pages(GFP_KERNEL | __GFP_ZERO, order); > > > > > > I assume you need __get_free_pages() for alignment. > > > > yes, would you prefer a comment to that fact? > > No, that's fine. > Thanks, -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html