[PATCH RESEND] Recognize PCID feature
This patch makes Qemu recognize the PCID feature specified from configuration or command line options. Signed-off-by: Junjie Mao junjie@intel.com --- target-i386/cpu.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 5521709..efc6ece 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -50,7 +50,7 @@ static const char *ext_feature_name[] = { ds_cpl, vmx, smx, est, tm2, ssse3, cid, NULL, fma, cx16, xtpr, pdcm, -NULL, NULL, dca, sse4.1|sse4_1, +NULL, pcid, dca, sse4.1|sse4_1, sse4.2|sse4_2, x2apic, movbe, popcnt, tsc-deadline, aes, xsave, osxsave, avx, NULL, NULL, hypervisor, -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 03/21][SeaBIOS] acpi-dsdt: Implement functions for memory hotplug
On Tue, Jul 17, 2012 at 03:23:00PM +0800, Wen Congyang wrote: +Method(MESC, 0) { +// Local5 = active memdevice bitmap +Store (MES, Local5) +// Local2 = last read byte from bitmap +Store (Zero, Local2) +// Local0 = memory device iterator +Store (Zero, Local0) +While (LLess(Local0, SizeOf(MEON))) { +// Local1 = MEON flag for this memory device +Store(DerefOf(Index(MEON, Local0)), Local1) +If (And(Local0, 0x07)) { +// Shift down previously read bitmap byte +ShiftRight(Local2, 1, Local2) +} Else { +// Read next byte from memdevice bitmap +Store(DerefOf(Index(Local5, ShiftRight(Local0, 3))), Local2) +} +// Local3 = active state for this memory device +Store(And(Local2, 1), Local3) + +If (LNotEqual(Local1, Local3)) { There are two ways to hot remove a memory device: 1. dimm_del 2. echo 1 /sys/bus/acpi/devices/PNP0C80:XX/eject In the 2nd case, we cannot hotplug this memory device again, because both Local1 and Local3 are 1. So, I think MEON flag for this meory device should be set to 0 in method _EJ0 or implement method _PS3 for memory device. good catch. Both internal seabios state (MEON) and the machine qemu bitmap (mems_sts in hw/acpi_piix4.c) have to be updated when the ejection comes from OSPM action. I will implement a _PS3 method that updates the MEON flag and also signals qemu to change the mems_sts bitmap. thanks, - Vasilis -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 0/4] kvm: level irqfd and new eoifd
On Thu, Jul 19, 2012 at 12:48:07PM -0600, Alex Williamson wrote: On Thu, 2012-07-19 at 20:45 +0300, Michael S. Tsirkin wrote: On Thu, Jul 19, 2012 at 11:29:38AM -0600, Alex Williamson wrote: On Thu, 2012-07-19 at 19:59 +0300, Michael S. Tsirkin wrote: On Mon, Jul 16, 2012 at 02:33:38PM -0600, Alex Williamson wrote: v5: - irqfds now have a one-to-one mapping with eoifds to prevent users from consuming all of kernel memory by repeatedly creating eoifds from a single irqfd. - implement a kvm_clear_irq() which does a test_and_clear_bit of the irq_state, only updating the pic/ioapic if changes and allowing the caller to know if anything was done. I added this onto the end as it's essentially an optimization on the previous design. It's hard to tell if there's an actual performance benefit to this. - dropped eoifd gsi support patch as it was only an FYI. Thanks, Alex So 3/4, 4/4 are racy and I think you convinced me it's best to drop it for now. I hope that fact that we already scan all vcpus under spinlock for level interrupts is enough to justify adding a lock here. To summarize issues still outstanding with 1/2, 2/2: (a) - source id lingering after irqfd was destroyed/deassigned prevents assigning a new irqfd (b) - if same irqfd is deassigned and re-assigned, this seems to succeed but does not give any more EOIs (c) - document that user needs to re-inject interrupts injected by level IRQFD after migration as they are cleared Hope this helps! Thanks, I'm refining and testing a re-write. One thing I also noticed is that we don't do anything when the eoifd is closed. We'll cleanup when kvm is closed, but that can leave a lot of stray eoifds, and therefore used irq_source_ids tied up. So, I think I need to pull in a lot of the irqfd code just to be able to catch the POLLHUP and do cleanup. I don't think it's worth it. With ioeventfd we have the same issue and we don't care: userspace should just DEASSIGN before close. With irqfd we committed to support cleanup by close but it happens kind of naturally since we poll irqfd anyway. It's there for irqfd for historical reasons. You're not dealing with such a limited resource for ioeventfds though. It's pretty easily conceivable we could run out of irq source IDs. Running out of fds is also very conceivable. Not deassigning before close is a userspace bug anyway. Fixing (a) is a simple flush, so I already added that. To solve (b), I think that saving the irqfd eventfd ctx was a bad idea. I actually think we should just fix it. Scan eoifds when closing/opening irqfds and bind/unbind source id. Hmm, IMHO we had no business holding onto an eventfd ctx. That was an ugly implementation detail forced by the desire to allow the same eventfd to be used in multiple eoifds. The fallout from that leaves a lasting tie between the eoifd and the future use of that eventfd. I can imagine the scenario you present is just one of the glitches and I really don't want to have one interface disable another. Looks like this disabling is inherent in what we want eoifd to do. You bind irqfd and eoifd. If irqfd is deassigned, eoifd will not get any more events, it is disabled. Whether it keeps the pointer to source id internally or not does not matter to the user. The new api I will propose to solve it is that kvm_irqfd returns a token (or key) when used as a level irqfd (actually the irq source id, but the user shouldn't care what it is). We pass that into eoifd instead of the irqfd. That means that if the irqfd is closed and re-configured, the user will get a new key and should have no expectation that it's tied to the previous eoifd. I'll add a comment for (c). Thanks, Alex Hmm, another API rewrite, when I felt it is finally stabilizing. Maybe it's the right thing to do but it does feel like we change userspace ABI just because we have run into an implementation difficulty. Pls note I'm offline next week so won't have time to review soon. We could return the key in the struct kvm_irqfd if it adds anything, but I felt returning the key was preferable and is compatible with the existing ABI. Thanks, Alex You say it is preferable but I wonder what does it buy users compared to using the fd directly - it is certainly more work for userspace to keep track of it. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND 5/5] vhost-blk: Add vhost-blk support
On Thu, Jul 19, 2012 at 2:09 PM, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Jul 19, 2012 at 08:05:42AM -0500, Anthony Liguori wrote: Of course, the million dollar question is why would using AIO in the kernel be faster than using AIO in userspace? Actually for me a more important question is how does it compare with virtio-blk dataplane? Hi Khoa, I think you have results of data-plane and vhost-blk? Is the vhost-blk version identical to Asias' recent patches? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PIC: call ack notifiers for irqs that are dropped form irr
On Tue, Jul 17, 2012 at 02:59:11PM +0300, Gleb Natapov wrote: After commit 242ec97c358256 PIT interrupts are no longer delivered after PIC reset. It happens because PIT injects interrupt only if previous one was acked, but since on PIC reset it is dropped from irr it will never be delivered and hence acknowledged. Fix that by calling ack notifier on PIC reset. Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 81cf4fa..f09e790 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -305,6 +305,7 @@ static void pic_ioport_write(void *opaque, u32 addr, u32 val) addr = 1; if (addr == 0) { if (val 0x10) { + u8 edge_irr = s-irr ~s-elcr; s-init4 = val 1; s-last_irr = 0; s-irr = s-elcr; @@ -322,6 +323,9 @@ static void pic_ioport_write(void *opaque, u32 addr, u32 val) if (val 0x08) pr_pic_unimpl( level sensitive irq not supported); + for (irq = 0; irq PIC_NUM_PINS/2; irq++) + if (edge_irr (1 irq)) + pic_clear_isr(s, irq); } else if (val 0x08) { if (val 0x04) s-poll = 1; -- Gleb. Can modify kvm_pic_reset (currently unused BTW) to conform to 9ed049c3b6230b6898 ? It checks for APIC handling interrupts before acking. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/9] KVM: MMU: fask check write-protect for direct mmu
On Fri, Jul 20, 2012 at 11:45:59AM +0800, Xiao Guangrong wrote: BTW, they are some bug fix patches on -master branch, but it is not existed on -next branch: commit: f411930442e01f9cf1bf4df41ff7e89476575c4d commit: 85b7059169e128c57a3a8a3e588fb89cb2031da1 It causes code conflict if we do the development on -next. See auto-next branch. http://www.linux-kvm.org/page/Kvm-Git-Workflow -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/9] KVM: x86: simplify read_emulated
On Fri, Jul 20, 2012 at 10:17:36AM +0800, Xiao Guangrong wrote: On 07/20/2012 07:58 AM, Marcelo Tosatti wrote: - } + rc = ctxt-ops-read_emulated(ctxt, addr, mc-data + mc-end, size, +ctxt-exception); + if (rc != X86EMUL_CONTINUE) + return rc; + + mc-end += size; + +read_cached: + memcpy(dest, mc-data + mc-pos, size); What prevents read_emulated(size 8) call, with mc-pos == (mc-end - 8) now? Marcelo, The splitting has been done in emulator_read_write_onepage: while (bytes) { unsigned now = min(bytes, 8U); frag = vcpu-mmio_fragments[vcpu-mmio_nr_fragments++]; frag-gpa = gpa; frag-data = val; frag-len = now; frag-write_readonly_mem = (ret == -EPERM); gpa += now; val += now; bytes -= now; } So i think it is safe to remove the splitting in read_emulated. Yes, it is fine to remove it. But splitting in emulate.c prevented the case of _cache read_ with size 8 beyond end of mc-data. Must handle that case in read_emulated. What prevents read_emulated(size 8) call, with mc-pos == (mc-end - 8) now? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/9] KVM: MMU: fask check write-protect for direct mmu
On Fri, Jul 20, 2012 at 10:34:28AM +0800, Xiao Guangrong wrote: On 07/20/2012 08:39 AM, Marcelo Tosatti wrote: On Tue, Jul 17, 2012 at 09:53:29PM +0800, Xiao Guangrong wrote: If it have no indirect shadow pages we need not protect any gfn, this is always true for direct mmu without nested Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com Xiao, What is the motivation? Numbers please. mmu_need_write_protect is the common path for both soft-mmu and hard-mmu, checking indirect_shadow_pages can skip hash-table walking for the case which is tdp is enabled without nested guest. I mean motivation as observation that it is a bottleneck. I will post the Number after I do the performance test. In fact, what case was the original indirect_shadow_pages conditional in kvm_mmu_pte_write optimizing again? They are the different paths, mmu_need_write_protect is the real page fault path, and kvm_mmu_pte_write is caused by mmio emulation. Sure. What i am asking is, what use case is the indirect_shadow_pages optimizing? What scenario, what workload? See the When to optimize section of http://en.wikipedia.org/wiki/Program_optimization. Can't remember why indirect_shadow_pages was introduced in kvm_mmu_pte_write. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC-v3 4/4] tcm_vhost: Initial merge for vhost level target fabric driver
On Wed, Jul 18, 2012 at 02:20:58PM -0700, Nicholas A. Bellinger wrote: On Wed, 2012-07-18 at 19:09 +0300, Michael S. Tsirkin wrote: On Wed, Jul 18, 2012 at 12:59:32AM +, Nicholas A. Bellinger wrote: SNIP Changelog v2 - v3: Unlock on error in tcm_vhost_drop_nexus() (DanC) Fix strlen() doesn't count the terminator (DanC) Call kfree() on an error path (DanC) Convert tcm_vhost_write_pending to use target_execute_cmd (hch + nab) Fix another strlen() off by one in tcm_vhost_make_tport (DanC) Add option under drivers/staging/Kconfig, and move to drivers/vhost/tcm/ as requested by MST (nab) --- drivers/staging/Kconfig |2 + drivers/vhost/Makefile|2 + drivers/vhost/tcm/Kconfig |6 + drivers/vhost/tcm/Makefile|1 + drivers/vhost/tcm/tcm_vhost.c | 1611 + drivers/vhost/tcm/tcm_vhost.h | 74 ++ 6 files changed, 1696 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/tcm/Kconfig create mode 100644 drivers/vhost/tcm/Makefile create mode 100644 drivers/vhost/tcm/tcm_vhost.c create mode 100644 drivers/vhost/tcm/tcm_vhost.h Really sorry about making you run around like that, I did not mean moving all of tcm to a directory, just adding tcm/Kconfig or adding drivers/vhost/Kconfig.tcm because eventually it's easier to keep it all together in one place. Er, apologies for the slight mis-understanding here.. Moving back now + fixing up the Kbuild bits. I'm going offline in several hours and am on vacation for a week starting tomorrow. So to make 3.6, and if you intend to merge through my tree, the best bet is if you can send the final version real soon now. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Biweekly upstream qemu-kvm test report (using autotest + manual) - Week 28
Folks, Please find the result of upstream testing. This time we got a kernel panic error while compiling mainline kernel (3.5.rc7). Hence we could verify only mainline qemu-kvm. We are analysing the failures and we will raise the bugs with the appropriate community. Host Kernel: Kernel: 3.1.0-7.fc16.x86_64 KVM Version: 1.1.50 (qemu-kvm-devel) Date: Thu Jul 19 17:51:29 2012 Stat: 59 tests executed - 40 have passed 19 Failed Number of Bugs raised: 2 https://bugzilla.kernel.org/show_bug.cgi?id=44901 https://github.com/autotest/autotest/issues/467 Tests Failed: .. Test Name ResultRun time .. kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.ffsb FAIL 29 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.disktest FAIL 24 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.hackbench FAIL 22 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.cpu_hotplug FAIL 57 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.block_stream FAIL 159 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.linux_s3 FAIL 303 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.boot_guest.qemu_boot_cpu_model FAIL 2280 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.boot_guest.qemu_boot_cpu_model_and_flags FAIL 2483 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.stress_guest.qemu_test_boot_guest_and_try_flags_under_load FAIL 2859 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.stress_guest.qemu_test_online_offline_guest_CPUs FAIL 2619 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.stress_guest.qemu_test_migration_with_additional_flags FAIL 2665 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.blkio_throttle FAIL2 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.blkio_throttle_multi FAIL 344 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.cpu_share FAIL1 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.cpuset_cpus FAIL1 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.freezer FAIL2 - Tests Passed: ... Test Name ResultRun time ... kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.dbench PASS 131 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.ebizzy PASS 22 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.stress PASS 88 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.sleeptest PASS 55 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.iozone PASS 540 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.jumbo PASS 537 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.blkio_bandwidth PASS 28 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.cpuset_cpus_switching PASS 95 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.devices_access PASS 15 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.memory_limit PASS
Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
On Thu, Jul 05, 2012 at 06:29:54PM +0800, Jason Wang wrote: This patch let the virtio_net driver can negotiate the number of queues it wishes to use through control virtqueue and export an ethtool interface to let use tweak it. As current multiqueue virtio-net implementation has optimizations on per-cpu virtuqueues, so only two modes were support: - single queue pair mode - multiple queue paris mode, the number of queues matches the number of vcpus The single queue mode were used by default currently due to regression of multiqueue mode in some test (especially in stream test). Since virtio core does not support paritially deleting virtqueues, so during mode switching the whole virtqueue were deleted and the driver would re-create the virtqueues it would used. btw. The queue number negotiating were defered to .ndo_open(), this is because only after feature negotitaion could we send the command to control virtqueue (as it may also use event index). Signed-off-by: Jason Wang jasow...@redhat.com --- drivers/net/virtio_net.c | 171 ++- include/linux/virtio_net.h |7 ++ 2 files changed, 142 insertions(+), 36 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 7410187..3339eeb 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -88,6 +88,7 @@ struct receive_queue { struct virtnet_info { u16 num_queue_pairs;/* # of RX/TX vq pairs */ + u16 total_queue_pairs; struct send_queue *sq[MAX_QUEUES] cacheline_aligned_in_smp; struct receive_queue *rq[MAX_QUEUES] cacheline_aligned_in_smp; @@ -137,6 +138,8 @@ struct padded_vnet_hdr { char padding[6]; }; +static const struct ethtool_ops virtnet_ethtool_ops; + static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq) { int ret = virtqueue_get_queue_index(vq); @@ -802,22 +805,6 @@ static void virtnet_netpoll(struct net_device *dev) } #endif -static int virtnet_open(struct net_device *dev) -{ - struct virtnet_info *vi = netdev_priv(dev); - int i; - - for (i = 0; i vi-num_queue_pairs; i++) { - /* Make sure we have some buffers: if oom use wq. */ - if (!try_fill_recv(vi-rq[i], GFP_KERNEL)) - queue_delayed_work(system_nrt_wq, -vi-rq[i]-refill, 0); - virtnet_napi_enable(vi-rq[i]); - } - - return 0; -} - /* * Send command via the control virtqueue and check status. Commands * supported by the hypervisor, as indicated by feature bits, should @@ -873,6 +860,43 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi) rtnl_unlock(); } +static int virtnet_set_queues(struct virtnet_info *vi) +{ + struct scatterlist sg; + struct net_device *dev = vi-dev; + sg_init_one(sg, vi-num_queue_pairs, sizeof(vi-num_queue_pairs)); + + if (!vi-has_cvq) + return -EINVAL; + + if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MULTIQUEUE, + VIRTIO_NET_CTRL_MULTIQUEUE_QNUM, sg, 1, 0)){ + dev_warn(dev-dev, Fail to set the number of queue pairs to + %d\n, vi-num_queue_pairs); + return -EINVAL; + } + + return 0; +} + +static int virtnet_open(struct net_device *dev) +{ + struct virtnet_info *vi = netdev_priv(dev); + int i; + + for (i = 0; i vi-num_queue_pairs; i++) { + /* Make sure we have some buffers: if oom use wq. */ + if (!try_fill_recv(vi-rq[i], GFP_KERNEL)) + queue_delayed_work(system_nrt_wq, +vi-rq[i]-refill, 0); + virtnet_napi_enable(vi-rq[i]); + } + + virtnet_set_queues(vi); + + return 0; +} + static int virtnet_close(struct net_device *dev) { struct virtnet_info *vi = netdev_priv(dev); @@ -1013,12 +1037,6 @@ static void virtnet_get_drvinfo(struct net_device *dev, } -static const struct ethtool_ops virtnet_ethtool_ops = { - .get_drvinfo = virtnet_get_drvinfo, - .get_link = ethtool_op_get_link, - .get_ringparam = virtnet_get_ringparam, -}; - #define MIN_MTU 68 #define MAX_MTU 65535 @@ -1235,7 +1253,7 @@ static int virtnet_find_vqs(struct virtnet_info *vi) err: if (ret names) - for (i = 0; i vi-num_queue_pairs * 2; i++) + for (i = 0; i total_vqs * 2; i++) kfree(names[i]); kfree(names); @@ -1373,7 +1391,6 @@ static int virtnet_probe(struct virtio_device *vdev) mutex_init(vi-config_lock); vi-config_enable = true; INIT_WORK(vi-config_work, virtnet_config_changed_work); - vi-num_queue_pairs = num_queue_pairs; /* If we can receive ANY GSO packets, we must allocate large ones. */
Re: [PATCH 2/9] KVM: x86: simplify read_emulated
On 07/20/2012 06:58 PM, Marcelo Tosatti wrote: On Fri, Jul 20, 2012 at 10:17:36AM +0800, Xiao Guangrong wrote: On 07/20/2012 07:58 AM, Marcelo Tosatti wrote: - } + rc = ctxt-ops-read_emulated(ctxt, addr, mc-data + mc-end, size, +ctxt-exception); + if (rc != X86EMUL_CONTINUE) + return rc; + + mc-end += size; + +read_cached: + memcpy(dest, mc-data + mc-pos, size); What prevents read_emulated(size 8) call, with mc-pos == (mc-end - 8) now? Marcelo, The splitting has been done in emulator_read_write_onepage: while (bytes) { unsigned now = min(bytes, 8U); frag = vcpu-mmio_fragments[vcpu-mmio_nr_fragments++]; frag-gpa = gpa; frag-data = val; frag-len = now; frag-write_readonly_mem = (ret == -EPERM); gpa += now; val += now; bytes -= now; } So i think it is safe to remove the splitting in read_emulated. Yes, it is fine to remove it. But splitting in emulate.c prevented the case of _cache read_ with size 8 beyond end of mc-data. Must handle that case in read_emulated. What prevents read_emulated(size 8) call, with mc-pos == (mc-end - 8) now? You mean the mmio region is partly cached? I think it can not happen. Now, we pass the whole size to emulator_read_write_onepage(), after it is finished, it saves the whole data into mc-data[], so, the cache-read can always get the whole data from mc-data[]. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PIC: call ack notifiers for irqs that are dropped form irr
On Fri, Jul 20, 2012 at 08:58:56AM -0300, Marcelo Tosatti wrote: On Tue, Jul 17, 2012 at 02:59:11PM +0300, Gleb Natapov wrote: After commit 242ec97c358256 PIT interrupts are no longer delivered after PIC reset. It happens because PIT injects interrupt only if previous one was acked, but since on PIC reset it is dropped from irr it will never be delivered and hence acknowledged. Fix that by calling ack notifier on PIC reset. Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 81cf4fa..f09e790 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -305,6 +305,7 @@ static void pic_ioport_write(void *opaque, u32 addr, u32 val) addr = 1; if (addr == 0) { if (val 0x10) { + u8 edge_irr = s-irr ~s-elcr; s-init4 = val 1; s-last_irr = 0; s-irr = s-elcr; @@ -322,6 +323,9 @@ static void pic_ioport_write(void *opaque, u32 addr, u32 val) if (val 0x08) pr_pic_unimpl( level sensitive irq not supported); + for (irq = 0; irq PIC_NUM_PINS/2; irq++) + if (edge_irr (1 irq)) + pic_clear_isr(s, irq); } else if (val 0x08) { if (val 0x04) s-poll = 1; -- Gleb. Can modify kvm_pic_reset (currently unused BTW) to conform to 9ed049c3b6230b6898 ? It checks for APIC handling interrupts before acking. Since it is not used we can make it do anything. But preferably in speared patch otherwise bug fix will be obfuscated by code move. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/9] KVM: MMU: fask check write-protect for direct mmu
On 07/20/2012 07:09 PM, Marcelo Tosatti wrote: On Fri, Jul 20, 2012 at 10:34:28AM +0800, Xiao Guangrong wrote: On 07/20/2012 08:39 AM, Marcelo Tosatti wrote: On Tue, Jul 17, 2012 at 09:53:29PM +0800, Xiao Guangrong wrote: If it have no indirect shadow pages we need not protect any gfn, this is always true for direct mmu without nested Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com Xiao, What is the motivation? Numbers please. mmu_need_write_protect is the common path for both soft-mmu and hard-mmu, checking indirect_shadow_pages can skip hash-table walking for the case which is tdp is enabled without nested guest. I mean motivation as observation that it is a bottleneck. I will post the Number after I do the performance test. In fact, what case was the original indirect_shadow_pages conditional in kvm_mmu_pte_write optimizing again? They are the different paths, mmu_need_write_protect is the real page fault path, and kvm_mmu_pte_write is caused by mmio emulation. Sure. What i am asking is, what use case is the indirect_shadow_pages optimizing? What scenario, what workload? Sorry, Marcelo, i do know why i completely misunderstood your mail. :( I am not sure whether this is a bottleneck, i just got it from code review, i will measure it to see if we can get benefit from it. :p See the When to optimize section of http://en.wikipedia.org/wiki/Program_optimization. Can't remember why indirect_shadow_pages was introduced in kvm_mmu_pte_write. Please refer to: https://lkml.org/lkml/2011/5/18/174 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
On Thu, Jul 05, 2012 at 06:29:53PM +0800, Jason Wang wrote: This patch converts virtio_net to a multi queue device. After negotiated VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs, and driver could read the number from config space. The driver expects the number of rx/tx queue paris is equal to the number of vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some optimization were introduced: - Txq selection is based on the processor id in order to avoid contending a lock whose owner may exits to host. - Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns the queue pairs. Signed-off-by: Krishna Kumar krkum...@in.ibm.com Signed-off-by: Jason Wang jasow...@redhat.com Overall fine. I think it is best to smash the following patch into this one, so that default behavior does not jump to mq then back. some comments below: mostly nits, and a minor bug. If you are worried the patch is too big, it can be split differently - rework to use send_queue/receive_queue structures, no functional changes. - add multiqueue but this is not a must. --- drivers/net/virtio_net.c | 645 ++- include/linux/virtio_net.h |2 + 2 files changed, 452 insertions(+), 195 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 1db445b..7410187 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -26,6 +26,7 @@ #include linux/scatterlist.h #include linux/if_vlan.h #include linux/slab.h +#include linux/interrupt.h static int napi_weight = 128; module_param(napi_weight, int, 0444); @@ -41,6 +42,8 @@ module_param(gso, bool, 0444); #define VIRTNET_SEND_COMMAND_SG_MAX2 #define VIRTNET_DRIVER_VERSION 1.0.0 +#define MAX_QUEUES 256 + struct virtnet_stats { struct u64_stats_sync tx_syncp; struct u64_stats_sync rx_syncp; Would be a bit better not to have artificial limits like that. Maybe allocate arrays at probe time, then we can take whatever the device gives us? @@ -51,43 +54,69 @@ struct virtnet_stats { u64 rx_packets; }; -struct virtnet_info { - struct virtio_device *vdev; - struct virtqueue *rvq, *svq, *cvq; - struct net_device *dev; +/* Internal representation of a send virtqueue */ +struct send_queue { + /* Virtqueue associated with this send _queue */ + struct virtqueue *vq; + + /* TX: fragments + linear part + virtio header */ + struct scatterlist sg[MAX_SKB_FRAGS + 2]; +}; + +/* Internal representation of a receive virtqueue */ +struct receive_queue { + /* Virtqueue associated with this receive_queue */ + struct virtqueue *vq; + + /* Back pointer to the virtnet_info */ + struct virtnet_info *vi; + struct napi_struct napi; - unsigned int status; /* Number of input buffers, and max we've ever had. */ unsigned int num, max; + /* Work struct for refilling if we run low on memory. */ + struct delayed_work refill; + + /* Chain pages by the private ptr. */ + struct page *pages; + + /* RX: fragments + linear part + virtio header */ + struct scatterlist sg[MAX_SKB_FRAGS + 2]; +}; + +struct virtnet_info { + u16 num_queue_pairs;/* # of RX/TX vq pairs */ + + struct send_queue *sq[MAX_QUEUES] cacheline_aligned_in_smp; + struct receive_queue *rq[MAX_QUEUES] cacheline_aligned_in_smp; The assumption is a tx/rx pair is handled on the same cpu, yes? If yes maybe make it a single array to improve cache locality a bit? struct queue_pair { struct send_queue sq; struct receive_queue rq; }; + struct virtqueue *cvq; + + struct virtio_device *vdev; + struct net_device *dev; + unsigned int status; + /* I like... big packets and I cannot lie! */ bool big_packets; /* Host will merge rx buffers for big packets (shake it! shake it!) */ bool mergeable_rx_bufs; + /* Has control virtqueue */ + bool has_cvq; + won't checking (cvq != NULL) be enough? /* enable config space updates */ bool config_enable; /* Active statistics */ struct virtnet_stats __percpu *stats; - /* Work struct for refilling if we run low on memory. */ - struct delayed_work refill; - /* Work struct for config space updates */ struct work_struct config_work; /* Lock for config space updates */ struct mutex config_lock; - - /* Chain pages by the private ptr. */ - struct page *pages; - - /* fragments + linear part + virtio header */ - struct scatterlist rx_sg[MAX_SKB_FRAGS + 2]; - struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; }; struct skb_vnet_hdr { @@ -108,6 +137,22 @@ struct padded_vnet_hdr { char padding[6]; };
Re: Unexpected host I/O load
After some additional troubleshooting under the guidance of a friend, this appears to be a libvirt issue. I have opened the following bug for it: https://bugzilla.redhat.com/show_bug.cgi?id=841918. Thanks to any who spent time looking at this. Brian -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] Biweekly upstream qemu-kvm test report (using autotest + manual) - Week 28
Hi seems there is no virtio_console testing.which I have no one pass and why some of the below cases is mannual, and they are not in the defalut tests.cfg Lei On Fri, Jul 20, 2012 at 8:20 PM, Prem Karat prem.ka...@linux.vnet.ibm.com wrote: Folks, Please find the result of upstream testing. This time we got a kernel panic error while compiling mainline kernel (3.5.rc7). Hence we could verify only mainline qemu-kvm. We are analysing the failures and we will raise the bugs with the appropriate community. Host Kernel: Kernel: 3.1.0-7.fc16.x86_64 KVM Version: 1.1.50 (qemu-kvm-devel) Date: Thu Jul 19 17:51:29 2012 Stat: 59 tests executed - 40 have passed 19 Failed Number of Bugs raised: 2 https://bugzilla.kernel.org/show_bug.cgi?id=44901 https://github.com/autotest/autotest/issues/467 Tests Failed: .. Test Name ResultRun time .. kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.ffsb FAIL 29 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.disktest FAIL 24 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.hackbench FAIL 22 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.cpu_hotplug FAIL 57 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.block_stream FAIL 159 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.linux_s3 FAIL 303 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.boot_guest.qemu_boot_cpu_model FAIL 2280 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.boot_guest.qemu_boot_cpu_model_and_flags FAIL 2483 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.stress_guest.qemu_test_boot_guest_and_try_flags_under_load FAIL 2859 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.stress_guest.qemu_test_online_offline_guest_CPUs FAIL 2619 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.stress_guest.qemu_test_migration_with_additional_flags FAIL 2665 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.blkio_throttle FAIL2 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.blkio_throttle_multi FAIL 344 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.cpu_share FAIL1 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.cpuset_cpus FAIL1 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.freezer FAIL2 - Tests Passed: ... Test Name ResultRun time ... kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.dbench PASS 131 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.ebizzy PASS 22 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.stress PASS 88 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.sleeptest PASS 55 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.iozone PASS 540 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.jumbo PASS 537 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.blkio_bandwidth PASS 28
Re: [PATCH v9 01/16] ARM: add mem_type prot_pte accessor
Am 03.07.2012 10:59, schrieb Christoffer Dall: From: Marc Zyngier marc.zyng...@arm.com The KVM hypervisor mmu code requires requires access to the code requires access Andreas mem_type prot_pte field when setting up page tables pointing to a device. Unfortunately, the mem_type structure is opaque. Add an accessor (get_mem_type_prot_pte()) to retrieve the prot_pte value. Signed-off-by: Marc Zyngier marc.zyng...@arm.com Signed-off-by: Christoffer Dall c.d...@virtualopensystems.com -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: MMU: Fix mmu_shrink() so that it can free mmu pages as intended
On Fri, Jul 20, 2012 at 10:04:34AM +0900, Takuya Yoshikawa wrote: On Wed, 18 Jul 2012 17:52:46 -0300 Marcelo Tosatti mtosa...@redhat.com wrote: Can't understand, can you please expand more clearly? I think mmu pages are not worth freeing under usual memory pressure, especially when we have EPT/NPT on. What's happening: shrink_slab() vainly calls mmu_shrink() with the default batch size 128, and mmu_shrink() takes a long time to zap mmu pages far fewer than the requested number, usually just frees one. Sadly, KVM may recreate the page soon after that. Since we set the seeks 10 times greater than the default, total_scan is very small and shrink_slab() just wastes time for freeing such small amount of may-be-reallocated-soon memory: I want it to use time for scanning other objects instead. Actually the total amount of memory used for mmu pages is not huge in the case of EPT/NPT on: maybe smaller that that of rmap? rmap size is a function of mmu pages, so mmu_shrink indirectly releases rmap also. So, it's clear that no one wants mmu pages to be freed as other objects. Sure, our seeks size prevents shrink_slab() from calling mmu_shrink() usually. But what if administrators want to drop clean caches on the host? Documentation/sysctl/vm.txt says: Writing to this will cause the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free. To free pagecache: echo 1 /proc/sys/vm/drop_caches To free dentries and inodes: echo 2 /proc/sys/vm/drop_caches To free pagecache, dentries and inodes: echo 3 /proc/sys/vm/drop_caches I don't want mmu pages to be freed in such cases. drop_caches should be used in special occasions. I would not worry about it. So, how about stopping reporting/returning the total number of used mmu pages to shrink_slab()? If we do so, it will think that there are not enough objects to get memory back from KVM. No, its important to be able to release memory quickly in low memory conditions. I bet the reasoning behind current seeks value (10*default) is close to arbitrary. mmu_shrink can be smarter, by freeing pages which are less likely to be used. IIRC Avi had some nice ideas for LRU-like schemes (search the archives). You can also consider the fact that freeing a higher level pagetable frees all of its children (that is quite dumb actually, sequential shrink passes should free only pages with no children). In the case of shadow paging, guests can do bad things to allocate enormous mmu pages, so we should report such exceeded numbers to shrink_slab() as freeable objects, not the total. A guest idle for 2 months should not have its mmu pages in memory. |--- needed ---|--- freeable under memory pressure ---| We may be able to use n_max_mmu_pages for this: the shrinker tries to free mmu pages unless the number reaches the goal. Thanks, Takuya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] Biweekly upstream qemu-kvm test report (using autotest + manual) - Week 28
* On 2012-07-20 22:52:21 +0800, lei yang (yanglei.f...@gmail.com) wrote: Hi seems there is no virtio_console testing.which I have no one pass and why some of the below cases is mannual, and they are not in the Because we have no guest agent test case in autotest now. though I'm working on it. lol defalut tests.cfg Lei On Fri, Jul 20, 2012 at 8:20 PM, Prem Karat prem.ka...@linux.vnet.ibm.com wrote: Folks, Please find the result of upstream testing. This time we got a kernel panic error while compiling mainline kernel (3.5.rc7). Hence we could verify only mainline qemu-kvm. We are analysing the failures and we will raise the bugs with the appropriate community. Host Kernel: Kernel: 3.1.0-7.fc16.x86_64 KVM Version: 1.1.50 (qemu-kvm-devel) Date: Thu Jul 19 17:51:29 2012 Stat: 59 tests executed - 40 have passed 19 Failed Number of Bugs raised: 2 https://bugzilla.kernel.org/show_bug.cgi?id=44901 https://github.com/autotest/autotest/issues/467 Tests Failed: .. Test Name ResultRun time .. kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.ffsb FAIL 29 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.disktest FAIL 24 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.hackbench FAIL 22 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.cpu_hotplug FAIL 57 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.block_stream FAIL 159 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.linux_s3 FAIL 303 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.boot_guest.qemu_boot_cpu_model FAIL 2280 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.boot_guest.qemu_boot_cpu_model_and_flags FAIL 2483 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.stress_guest.qemu_test_boot_guest_and_try_flags_under_load FAIL 2859 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.stress_guest.qemu_test_online_offline_guest_CPUs FAIL 2619 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.stress_guest.qemu_test_migration_with_additional_flags FAIL 2665 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.blkio_throttle FAIL2 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.blkio_throttle_multi FAIL 344 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.cpu_share FAIL1 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.cpuset_cpus FAIL1 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.freezer FAIL2 - Tests Passed: ... Test Name ResultRun time ... kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.dbench PASS 131 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.ebizzy PASS 22 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.stress PASS 88 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.sleeptest PASS 55 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.iozone PASS 540 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.jumbo
Re: [Autotest] Biweekly upstream qemu-kvm test report (using autotest + manual) - Week 28
On Fri, Jul 20, 2012 at 11:18 PM, Qingtang Zhou qz...@redhat.com wrote: * On 2012-07-20 22:52:21 +0800, lei yang (yanglei.f...@gmail.com) wrote: Hi seems there is no virtio_console testing.which I have no one pass and why some of the below cases is mannual, and they are not in the Because we have no guest agent test case in autotest now. though I'm working on it. lol The guest agent testcase you metioned, is it the manual cases in this thead? It seems we it select some cases not all the case to do the Biweekly upstream test, is there some stratigies to select which cases to run defalut tests.cfg Lei On Fri, Jul 20, 2012 at 8:20 PM, Prem Karat prem.ka...@linux.vnet.ibm.com wrote: Folks, Please find the result of upstream testing. This time we got a kernel panic error while compiling mainline kernel (3.5.rc7). Hence we could verify only mainline qemu-kvm. We are analysing the failures and we will raise the bugs with the appropriate community. Host Kernel: Kernel: 3.1.0-7.fc16.x86_64 KVM Version: 1.1.50 (qemu-kvm-devel) Date: Thu Jul 19 17:51:29 2012 Stat: 59 tests executed - 40 have passed 19 Failed Number of Bugs raised: 2 https://bugzilla.kernel.org/show_bug.cgi?id=44901 https://github.com/autotest/autotest/issues/467 Tests Failed: .. Test Name ResultRun time .. kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.ffsb FAIL 29 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.disktest FAIL 24 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.hackbench FAIL 22 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.cpu_hotplug FAIL 57 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.block_stream FAIL 159 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.linux_s3 FAIL 303 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.boot_guest.qemu_boot_cpu_model FAIL 2280 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.boot_guest.qemu_boot_cpu_model_and_flags FAIL 2483 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.stress_guest.qemu_test_boot_guest_and_try_flags_under_load FAIL 2859 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.stress_guest.qemu_test_online_offline_guest_CPUs FAIL 2619 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cpuflags.stress_guest.qemu_test_migration_with_additional_flags FAIL 2665 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.blkio_throttle FAIL2 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.blkio_throttle_multi FAIL 344 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.cpu_share FAIL1 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.cpuset_cpus FAIL1 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.cgroup.freezer FAIL2 - Tests Passed: ... Test Name ResultRun time ... kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.dbench PASS 131 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.ebizzy PASS 22 kvm.qed.virtio_blk.smp2.virtio_net.RHEL.6.2.x86_64.autotest.stress PASS 88
[PATCH v6 0/2] kvm: level irqfd and new eoifd
v6: So we're back to just the first two patches, unfortunately the diffstat got bigger though. The reason for that is that I discovered we don't do anything on release of an eoifd. We cleanup if the kvm vm is released, but we're dealing with a constrained resource of irq source IDs, so I think it's best that we cleanup to make sure those are returned. To do that we need nearly the same infrastructure as irqfd, we just only watch for POLLHUP. So while there's more code here, the structure and function names line up identically to irqfd. The other big change here is that KVM_IRQFD returns a token when setting up a level irqfd. We use this token to associate the eoifd with the right source. This means we have to put the struct _source_ids on a list so we can find them. This removes the weird interaction we were headed to where the eoifd is associated with the eventfd of the irqfd. There's potentially more flexibility for the future here too as we might come up with other interfaces that can return a source ID key. Perhaps some future KVM_IRQFD will allow specifying a key for re-attaching. Anyway, the sequence Michael pointed out where an irqfd is de-assigned then re-assigned now results in a new key instead of leaving the user wondering if it re-associates back to the eoifd. Also added workqueue flushes on assign since releasing either object now results in a lazy release via workqueue. This ensures we re-claim any source IDs we can. Thanks, Alex --- Alex Williamson (2): kvm: KVM_EOIFD, an eventfd for EOIs kvm: Extend irqfd to support level interrupts Documentation/virtual/kvm/api.txt | 32 ++- arch/x86/kvm/x86.c|3 include/linux/kvm.h | 18 + include/linux/kvm_host.h | 17 + virt/kvm/eventfd.c| 463 - virt/kvm/kvm_main.c | 11 + 6 files changed, 536 insertions(+), 8 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 1/2] kvm: Extend irqfd to support level interrupts
In order to inject a level interrupt from an external source using an irqfd, we need to allocate a new irq_source_id. This allows us to assert and (later) de-assert an interrupt line independently from users of KVM_IRQ_LINE and avoid lost interrupts. We also add what may appear like a bit of excessive infrastructure around an object for storing this irq_source_id. However, notice that we only provide a way to assert the interrupt here. A follow-on interface will make use of the same irq_source_id to allow de-assert. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Documentation/virtual/kvm/api.txt | 11 +++ arch/x86/kvm/x86.c|1 include/linux/kvm.h |3 + include/linux/kvm_host.h |4 + virt/kvm/eventfd.c| 128 +++-- 5 files changed, 139 insertions(+), 8 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index bf33aaa..3911e62 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1936,7 +1936,7 @@ Capability: KVM_CAP_IRQFD Architectures: x86 Type: vm ioctl Parameters: struct kvm_irqfd (in) -Returns: 0 on success, -1 on error +Returns: 0 (or = 0) on success, -1 on error Allows setting an eventfd to directly trigger a guest interrupt. kvm_irqfd.fd specifies the file descriptor to use as the eventfd and @@ -1946,6 +1946,15 @@ the guest using the specified gsi pin. The irqfd is removed using the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd and kvm_irqfd.gsi. +The KVM_IRQFD_FLAG_LEVEL flag indicates the gsi input is for a level +triggered interrupt. In this case a new irqchip input is allocated +which is logically OR'd with other inputs allowing multiple sources +to independently assert level interrupts. The KVM_IRQFD_FLAG_LEVEL +is only necessary on setup, teardown is identical to that above. The +return value when called with this flag is a key (= 0) which may be +used to associate this irqfd with other ioctls. KVM_IRQFD_FLAG_LEVEL +support is indicated by KVM_CAP_IRQFD_LEVEL. + 4.76 KVM_PPC_ALLOCATE_HTAB Capability: KVM_CAP_PPC_ALLOC_HTAB diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 59b5950..9ded39d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2170,6 +2170,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_GET_TSC_KHZ: case KVM_CAP_PCI_2_3: case KVM_CAP_KVMCLOCK_CTRL: + case KVM_CAP_IRQFD_LEVEL: r = 1; break; case KVM_CAP_COALESCED_MMIO: diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 2ce09aa..b2e6e4f 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_GET_SMMU_INFO 78 #define KVM_CAP_S390_COW 79 #define KVM_CAP_PPC_ALLOC_HTAB 80 +#define KVM_CAP_IRQFD_LEVEL 81 #ifdef KVM_CAP_IRQ_ROUTING @@ -683,6 +684,8 @@ struct kvm_xen_hvm_config { #endif #define KVM_IRQFD_FLAG_DEASSIGN (1 0) +/* Available with KVM_CAP_IRQFD_LEVEL */ +#define KVM_IRQFD_FLAG_LEVEL (1 1) struct kvm_irqfd { __u32 fd; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index b70b48b..c73f071 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -285,6 +285,10 @@ struct kvm { struct list_head items; } irqfds; struct list_head ioeventfds; + struct { + struct mutex lock; + struct list_head items; + } irqsources; #endif struct kvm_vm_stat stat; struct kvm_arch arch; diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 7d7e2aa..878cb52 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -36,6 +36,66 @@ #include iodev.h /* + * An irq_source_id can be created from KVM_IRQFD for level interrupt + * injections and shared with other interfaces for EOI or de-assert. + * Create an object with reference counting to make it easy to use. + */ +struct _irq_source { + int id; /* the IRQ source ID */ + int gsi; + struct kvm *kvm; + struct list_head list; + struct kref kref; +}; + +static void _irq_source_release(struct kref *kref) +{ + struct _irq_source *source = + container_of(kref, struct _irq_source, kref); + + /* This also de-asserts */ + kvm_free_irq_source_id(source-kvm, source-id); + list_del(source-list); + kfree(source); +} + +static void _irq_source_put(struct _irq_source *source) +{ + if (source) { + mutex_lock(source-kvm-irqsources.lock); + kref_put(source-kref, _irq_source_release); + mutex_unlock(source-kvm-irqsources.lock); + } +} + +static struct _irq_source *_irq_source_alloc(struct kvm *kvm, int gsi) +{ + struct _irq_source *source; + int id; + + source = kzalloc(sizeof(*source), GFP_KERNEL); + if
[PATCH v6 2/2] kvm: KVM_EOIFD, an eventfd for EOIs
This new ioctl enables an eventfd to be triggered when an EOI is written for a specified irqchip pin. The first user of this will be external device assignment through VFIO, using a level irqfd for asserting a PCI INTx interrupt and this interface for de-assert and notification once the interrupt is serviced. Here we make use of the reference counting of the _irq_source object allowing us to share it with an irqfd and cleanup regardless of the release order. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Documentation/virtual/kvm/api.txt | 21 ++ arch/x86/kvm/x86.c|2 include/linux/kvm.h | 15 ++ include/linux/kvm_host.h | 13 + virt/kvm/eventfd.c| 335 + virt/kvm/kvm_main.c | 11 + 6 files changed, 397 insertions(+) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 3911e62..8cd6b36 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1989,6 +1989,27 @@ return the hash table order in the parameter. (If the guest is using the virtualized real-mode area (VRMA) facility, the kernel will re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.) +4.77 KVM_EOIFD + +Capability: KVM_CAP_EOIFD +Architectures: x86 +Type: vm ioctl +Parameters: struct kvm_eoifd (in) +Returns: 0 on success, 0 on error + +KVM_EOIFD allows userspace to receive interrupt EOI notification +through an eventfd. kvm_eoifd.fd specifies the eventfd used for +notification. KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd +once assigned. KVM_EOIFD also requires additional bits set in +kvm_eoifd.flags to bind to the proper interrupt line. The +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided +and is a key from a level triggered interrupt (configured from +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL). The EOI notification is bound +to the same GSI and irqchip input as the irqfd. Both kvm_eoifd.key +and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified on assignment and +de-assignment of KVM_EOIFD. A level irqfd may only be bound to a +single eoifd. KVM_CAP_EOIFD_LEVEL_IRQFD indicates support of +KVM_EOIFD_FLAG_LEVEL_IRQFD. 5. The kvm_run structure diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9ded39d..8f3164e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2171,6 +2171,8 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_PCI_2_3: case KVM_CAP_KVMCLOCK_CTRL: case KVM_CAP_IRQFD_LEVEL: + case KVM_CAP_EOIFD: + case KVM_CAP_EOIFD_LEVEL_IRQFD: r = 1; break; case KVM_CAP_COALESCED_MMIO: diff --git a/include/linux/kvm.h b/include/linux/kvm.h index b2e6e4f..effb916 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -619,6 +619,8 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_S390_COW 79 #define KVM_CAP_PPC_ALLOC_HTAB 80 #define KVM_CAP_IRQFD_LEVEL 81 +#define KVM_CAP_EOIFD 82 +#define KVM_CAP_EOIFD_LEVEL_IRQFD 83 #ifdef KVM_CAP_IRQ_ROUTING @@ -694,6 +696,17 @@ struct kvm_irqfd { __u8 pad[20]; }; +#define KVM_EOIFD_FLAG_DEASSIGN (1 0) +/* Available with KVM_CAP_EOIFD_LEVEL_IRQFD */ +#define KVM_EOIFD_FLAG_LEVEL_IRQFD (1 1) + +struct kvm_eoifd { + __u32 fd; + __u32 flags; + __u32 key; + __u8 pad[20]; +}; + struct kvm_clock_data { __u64 clock; __u32 flags; @@ -834,6 +847,8 @@ struct kvm_s390_ucas_mapping { #define KVM_PPC_GET_SMMU_INFO_IOR(KVMIO, 0xa6, struct kvm_ppc_smmu_info) /* Available with KVM_CAP_PPC_ALLOC_HTAB */ #define KVM_PPC_ALLOCATE_HTAB_IOWR(KVMIO, 0xa7, __u32) +/* Available with KVM_CAP_EOIFD */ +#define KVM_EOIFD _IOW(KVMIO, 0xa8, struct kvm_eoifd) /* * ioctls for vcpu fds diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c73f071..01e72a6 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -289,6 +289,10 @@ struct kvm { struct mutex lock; struct list_head items; } irqsources; + struct { + spinlock_t lock; + struct list_head items; + } eoifds; #endif struct kvm_vm_stat stat; struct kvm_arch arch; @@ -832,6 +836,8 @@ int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args); void kvm_irqfd_release(struct kvm *kvm); void kvm_irq_routing_update(struct kvm *, struct kvm_irq_routing_table *); int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args); +int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args); +void kvm_eoifd_release(struct kvm *kvm); #else @@ -857,6 +863,13 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args) return -ENOSYS; } +static inline int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args) +{ + return -ENOSYS; +} + +static inline void
[PATCH 2/2] kvm: kvmclock: eliminate kvmclock offset when time page count goes to zero
When a guest is migrated, a time offset is generated in order to maintain the correct kvmclock based time for the guest. Detect when all kvmclock time pages are deleted so that the kvmclock offset can be safely reset to zero. Cc: Glauber Costa glom...@redhat.com Cc: Zachary Amsden zams...@gmail.com Signed-off-by: Bruce Rogers brog...@suse.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/x86.c |5 - 2 files changed, 5 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index db7c1f2..112415c 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -524,6 +524,7 @@ struct kvm_arch { unsigned long irq_sources_bitmap; s64 kvmclock_offset; + unsigned int n_time_pages; raw_spinlock_t tsc_write_lock; u64 last_tsc_nsec; u64 last_tsc_write; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 14c290d..350c51b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1511,6 +1511,8 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) if (vcpu-arch.time_page) { kvm_release_page_dirty(vcpu-arch.time_page); vcpu-arch.time_page = NULL; + if (--vcpu-kvm-arch.n_time_pages == 0) + vcpu-kvm-arch.kvmclock_offset = 0; } } @@ -1624,7 +1626,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (is_error_page(vcpu-arch.time_page)) { kvm_release_page_clean(vcpu-arch.time_page); vcpu-arch.time_page = NULL; - } + } else + vcpu-kvm-arch.n_time_pages++; break; } case MSR_KVM_ASYNC_PF_EN: -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] kvm: kvmclock: fix kvmclock reboot after migrate issues
When a linux guest live migrates to a new host and subsequently reboots, the guest no longer has the correct time. This is due to a failure to apply the kvmclock offset to the wall clock time. The first patch addresses this failure directly, while the second patch detects when the offset is no longer needed, and zeroes the offset as a matter of cleaning up migration state which is no longer relevant. Both patches address the issue, but in different ways. Bruce Rogers (2): kvm: kvmclock: apply kvmclock offset to guest wall clock time kvm: kvmclock: eliminate kvmclock offset when time page count goes to zero arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/x86.c |9 - 2 files changed, 9 insertions(+), 1 deletions(-) -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] kvm: kvmclock: apply kvmclock offset to guest wall clock time
When a guest migrates to a new host, the system time difference from the previous host is used in the updates to the kvmclock system time visible to the guest, resulting in a continuation of correct kvmclock based guest timekeeping. The wall clock component of the kvmclock provided time is currently not updated with this same time offset. Since the Linux guest caches the wall clock based time, this discrepency is not noticed until the guest is rebooted. After reboot the guest's time calculations are off. This patch adjusts the wall clock by the kvmclock_offset, resulting in correct guest time after a reboot. Cc: Glauber Costa glom...@redhat.com Cc: Zachary Amsden zams...@gmail.com Signed-off-by: Bruce Rogers brog...@suse.com --- arch/x86/kvm/x86.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index be6d549..14c290d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -907,6 +907,10 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock) */ getboottime(boot); + if (kvm-arch.kvmclock_offset) { + struct timespec ts = ns_to_timespec(kvm-arch.kvmclock_offset); + boot = timespec_sub(boot, ts); + } wc.sec = boot.tv_sec; wc.nsec = boot.tv_nsec; wc.version = version; -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler
On Wed, Jul 18, 2012 at 07:07:17PM +0530, Raghavendra K T wrote: Currently Pause Loop Exit (PLE) handler is doing directed yield to a random vcpu on pl-exit. We already have filtering while choosing the candidate to yield_to. This change adds more checks while choosing a candidate to yield_to. On a large vcpu guests, there is a high probability of yielding to the same vcpu who had recently done a pause-loop exit. Such a yield can lead to the vcpu spinning again. The patchset keeps track of the pause loop exit and gives chance to a vcpu which has: (a) Not done pause loop exit at all (probably he is preempted lock-holder) (b) vcpu skipped in last iteration because it did pause loop exit, and probably has become eligible now (next eligible lock holder) This concept also helps in cpu relax interception cases which use same handler. Changes since V4: - Naming Change (Avi): struct ple == struct spin_loop cpu_relax_intercepted == in_spin_loop vcpu_check_and_update_eligible == vcpu_eligible_for_directed_yield - mark vcpu in spinloop as not eligible to avoid influence of previous exit Changes since V3: - arch specific fix/changes (Christian) Changes since v2: - Move ple structure to common code (Avi) - rename pause_loop_exited to cpu_relax_intercepted (Avi) - add config HAVE_KVM_CPU_RELAX_INTERCEPT (Avi) - Drop superfluous curly braces (Ingo) Changes since v1: - Add more documentation for structure and algorithm and Rename plo == ple (Rik). - change dy_eligible initial value to false. (otherwise very first directed yield will not be skipped. (Nikunj) - fixup signoff/from issue Future enhancements: (1) Currently we have a boolean to decide on eligibility of vcpu. It would be nice if I get feedback on guest (32 vcpu) whether we can improve better with integer counter. (with counter = say f(log n )). (2) We have not considered system load during iteration of vcpu. With that information we can limit the scan and also decide whether schedule() is better. [ I am able to use #kicked vcpus to decide on this But may be there are better ideas like information from global loadavg.] (3) We can exploit this further with PV patches since it also knows about next eligible lock-holder. Summary: There is a very good improvement for kvm based guest on PLE machine. The V5 has huge improvement for kbench. +---+---+---++---+ base_rikstdev patched stdev %improve +---+---+---++---+ kernbench (time in sec lesser is better) +---+---+---++---+ 1x49.2300 1.017122.6842 0.3073117.0233 % 2x91.9358 1.776853.9608 1.015470.37516 % +---+---+---++---+ +---+---+---++---+ ebizzy (records/sec more is better) +---+---+---++---+ 1x 1129.250028.67932125.625032.823988.23334 % 2x 1892.375075.11122377.1250 181.682225.61596 % +---+---+---++---+ Note: The patches are tested on x86. Links V4: https://lkml.org/lkml/2012/7/16/80 V3: https://lkml.org/lkml/2012/7/12/437 V2: https://lkml.org/lkml/2012/7/10/392 V1: https://lkml.org/lkml/2012/7/9/32 Raghavendra K T (3): config: Add config to support ple or cpu relax optimzation kvm : Note down when cpu relax intercepted or pause loop exited kvm : Choose a better candidate for directed yield --- arch/s390/kvm/Kconfig|1 + arch/x86/kvm/Kconfig |1 + include/linux/kvm_host.h | 39 +++ virt/kvm/Kconfig |3 +++ virt/kvm/kvm_main.c | 41 + 5 files changed, 85 insertions(+), 0 deletions(-) Reviewed-by: Marcelo Tosatti mtosa...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC-v3 4/4] tcm_vhost: Initial merge for vhost level target fabric driver
On Fri, 2012-07-20 at 15:03 +0300, Michael S. Tsirkin wrote: On Wed, Jul 18, 2012 at 02:20:58PM -0700, Nicholas A. Bellinger wrote: On Wed, 2012-07-18 at 19:09 +0300, Michael S. Tsirkin wrote: On Wed, Jul 18, 2012 at 12:59:32AM +, Nicholas A. Bellinger wrote: SNIP Changelog v2 - v3: Unlock on error in tcm_vhost_drop_nexus() (DanC) Fix strlen() doesn't count the terminator (DanC) Call kfree() on an error path (DanC) Convert tcm_vhost_write_pending to use target_execute_cmd (hch + nab) Fix another strlen() off by one in tcm_vhost_make_tport (DanC) Add option under drivers/staging/Kconfig, and move to drivers/vhost/tcm/ as requested by MST (nab) --- drivers/staging/Kconfig |2 + drivers/vhost/Makefile|2 + drivers/vhost/tcm/Kconfig |6 + drivers/vhost/tcm/Makefile|1 + drivers/vhost/tcm/tcm_vhost.c | 1611 + drivers/vhost/tcm/tcm_vhost.h | 74 ++ 6 files changed, 1696 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/tcm/Kconfig create mode 100644 drivers/vhost/tcm/Makefile create mode 100644 drivers/vhost/tcm/tcm_vhost.c create mode 100644 drivers/vhost/tcm/tcm_vhost.h Really sorry about making you run around like that, I did not mean moving all of tcm to a directory, just adding tcm/Kconfig or adding drivers/vhost/Kconfig.tcm because eventually it's easier to keep it all together in one place. Er, apologies for the slight mis-understanding here.. Moving back now + fixing up the Kbuild bits. I'm going offline in several hours and am on vacation for a week starting tomorrow. So to make 3.6, and if you intend to merge through my tree, the best bet is if you can send the final version real soon now. Ok, thanks for the heads up here.. So aside from Greg-KH's feedback to avoid the drivers/staging/ Kconfig include usage, and one more bugfix from DanC from this morning those are the only pending changes for RFC-v4. If it's OK I'd prefer to take these via target-pending with the necessary Acked-By's, especially if you'll be AFK next week.. Would you like to see a RFC-v4 with these changes included..? Thank you, --nab -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC-v3 4/4] tcm_vhost: Initial merge for vhost level target fabric driver
On Fri, 2012-07-20 at 11:00 -0700, Nicholas A. Bellinger wrote: On Fri, 2012-07-20 at 15:03 +0300, Michael S. Tsirkin wrote: On Wed, Jul 18, 2012 at 02:20:58PM -0700, Nicholas A. Bellinger wrote: On Wed, 2012-07-18 at 19:09 +0300, Michael S. Tsirkin wrote: On Wed, Jul 18, 2012 at 12:59:32AM +, Nicholas A. Bellinger wrote: SNIP Changelog v2 - v3: Unlock on error in tcm_vhost_drop_nexus() (DanC) Fix strlen() doesn't count the terminator (DanC) Call kfree() on an error path (DanC) Convert tcm_vhost_write_pending to use target_execute_cmd (hch + nab) Fix another strlen() off by one in tcm_vhost_make_tport (DanC) Add option under drivers/staging/Kconfig, and move to drivers/vhost/tcm/ as requested by MST (nab) --- drivers/staging/Kconfig |2 + drivers/vhost/Makefile|2 + drivers/vhost/tcm/Kconfig |6 + drivers/vhost/tcm/Makefile|1 + drivers/vhost/tcm/tcm_vhost.c | 1611 + drivers/vhost/tcm/tcm_vhost.h | 74 ++ 6 files changed, 1696 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/tcm/Kconfig create mode 100644 drivers/vhost/tcm/Makefile create mode 100644 drivers/vhost/tcm/tcm_vhost.c create mode 100644 drivers/vhost/tcm/tcm_vhost.h Really sorry about making you run around like that, I did not mean moving all of tcm to a directory, just adding tcm/Kconfig or adding drivers/vhost/Kconfig.tcm because eventually it's easier to keep it all together in one place. Er, apologies for the slight mis-understanding here.. Moving back now + fixing up the Kbuild bits. I'm going offline in several hours and am on vacation for a week starting tomorrow. So to make 3.6, and if you intend to merge through my tree, the best bet is if you can send the final version real soon now. Ok, thanks for the heads up here.. So aside from Greg-KH's feedback to avoid the drivers/staging/ Kconfig include usage, and one more bugfix from DanC from this morning those are the only pending changes for RFC-v4. If it's OK I'd prefer to take these via target-pending with the necessary Acked-By's, especially if you'll be AFK next week.. Would you like to see a RFC-v4 with these changes included..? Thank you, Actually sorry, the patch from DanC is for target core, and not tcm_vhost specific change.. So really the only thing left to resolve for an initial merge is Greg-KH's comments wrt to drivers/staging Kconfig usage.. Are you OK with just adding CONFIG_STAGING following Greg-KH's feedback..? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] kvm: fix race with level interrupts
On Thu, Jul 19, 2012 at 01:45:20PM +0300, Michael S. Tsirkin wrote: When more than 1 source id is in use for the same GSI, we have the following race related to handling irq_states race: CPU 0 clears bit 0. CPU 0 read irq_state as 0. CPU 1 sets level to 1. CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0). Now ioapic thinks the level is 0 but irq_state is not 0. Fix by performing all irq_states bitmap handling under pic/ioapic lock. This also removes the need for atomics with irq_states handling. Reported-by: Gleb Natapov g...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Applied, thanks. Changes from v1: Address comments by Gleb and Alex: renamed some variables for clarify renamed kvm_irq_line_state - __kvm_irq_line_state Any chance we can put this in 3.5? I know level IRQs are not widely used, which is likely why this went unnoticed for so long, but still ... http://yarchive.net/comp/linux/merge_window.html From: Linus Torvalds torva...@linux-foundation.org The thing is, I don't take bug fixes late in the -rc just because they are bug fixes. And I really shouldn't. If it's an old bug, and doesn't cause an oops or a security issue, it had damn well better wait for the next merge window. There is absolutely _no_ reason to just blindly fix bugs at the end of the rc stage, because quite frankly, the risks coming from fixing a bug is often bigger than the advantage. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: drop parameter validation
On Thu, Jul 19, 2012 at 02:13:13PM +0300, Michael S. Tsirkin wrote: We validate irq pin number when routing is setup, so code handling illegal irq # in pic and ioapic on each injection is never called. Drop it. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Note: this is on top of [PATCHv2] kvm: fix race with level interrupts as these patches touch the same code. kvm: fix race with level interrupts has been applied to next (branch which contains next merge window group), which is in freeze mode (only critical fixes are accepted). This patch does not fall into that category, please resend once Linus pulls the dependency. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND 0/5] Add vhost-blk support
On Fri, Jul 13, 2012 at 04:55:06PM +0800, Asias He wrote: Hi folks, [I am resending to fix the broken thread in the previous one.] This patchset adds vhost-blk support. vhost-blk is a in kernel virito-blk device accelerator. Compared to userspace virtio-blk implementation, vhost-blk gives about 5% to 15% performance improvement. Asias He (5): aio: Export symbols and struct kiocb_batch for in kernel aio usage eventfd: Export symbol eventfd_file_create() vhost: Make vhost a separate module vhost-net: Use VHOST_NET_FEATURES for vhost-net vhost-blk: Add vhost-blk support OK so given the state it's in, and assuming you think it is helpful to let it mature in tree and not out of tree, I think it's reasonable to try to do it like tcm_vhost is going to do it: - send me changes to vhost core ASAP (and keep it minimal, e.g. use your own header file to export to userspace) - for other stuff - put in drivers/staging, and ask Greg to merge -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/9] KVM: x86: simplify read_emulated
On Fri, Jul 20, 2012 at 09:15:44PM +0800, Xiao Guangrong wrote: On 07/20/2012 06:58 PM, Marcelo Tosatti wrote: On Fri, Jul 20, 2012 at 10:17:36AM +0800, Xiao Guangrong wrote: On 07/20/2012 07:58 AM, Marcelo Tosatti wrote: -} +rc = ctxt-ops-read_emulated(ctxt, addr, mc-data + mc-end, size, + ctxt-exception); +if (rc != X86EMUL_CONTINUE) +return rc; + +mc-end += size; + +read_cached: +memcpy(dest, mc-data + mc-pos, size); What prevents read_emulated(size 8) call, with mc-pos == (mc-end - 8) now? Marcelo, The splitting has been done in emulator_read_write_onepage: while (bytes) { unsigned now = min(bytes, 8U); frag = vcpu-mmio_fragments[vcpu-mmio_nr_fragments++]; frag-gpa = gpa; frag-data = val; frag-len = now; frag-write_readonly_mem = (ret == -EPERM); gpa += now; val += now; bytes -= now; } So i think it is safe to remove the splitting in read_emulated. Yes, it is fine to remove it. But splitting in emulate.c prevented the case of _cache read_ with size 8 beyond end of mc-data. Must handle that case in read_emulated. What prevents read_emulated(size 8) call, with mc-pos == (mc-end - 8) now? You mean the mmio region is partly cached? I think it can not happen. Now, we pass the whole size to emulator_read_write_onepage(), after it is finished, it saves the whole data into mc-data[], so, the cache-read can always get the whole data from mc-data[]. I mean that nothing prevents a caller from reading beyond the end of mc-data array (but then again this was the previous behavior). ACK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] kvm: kvmclock: apply kvmclock offset to guest wall clock time
When a guest migrates to a new host, the system time difference from the previous host is used in the updates to the kvmclock system time visible to the guest, resulting in a continuation of correct kvmclock based guest timekeeping. The wall clock component of the kvmclock provided time is currently not updated with this same time offset. Since the Linux guest caches the wall clock based time, this discrepency is not noticed until the guest is rebooted. After reboot the guest's time calculations are off. This patch adjusts the wall clock by the kvmclock_offset, resulting in correct guest time after a reboot. Cc: Glauber Costa glom...@redhat.com Cc: Zachary Amsden zams...@redhat.com Signed-off-by: Bruce Rogers brog...@suse.com --- arch/x86/kvm/x86.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index be6d549..14c290d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -907,6 +907,10 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock) */ getboottime(boot); + if (kvm-arch.kvmclock_offset) { + struct timespec ts = ns_to_timespec(kvm-arch.kvmclock_offset); + boot = timespec_sub(boot, ts); + } wc.sec = boot.tv_sec; wc.nsec = boot.tv_nsec; wc.version = version; -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] kvm: kvmclock: eliminate kvmclock offset when time page count goes to zero
When a guest is migrated, a time offset is generated in order to maintain the correct kvmclock based time for the guest. Detect when all kvmclock time pages are deleted so that the kvmclock offset can be safely reset to zero. Cc: Glauber Costa glom...@redhat.com Cc: Zachary Amsden zams...@redhat.com Signed-off-by: Bruce Rogers brog...@suse.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/x86.c |5 - 2 files changed, 5 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index db7c1f2..112415c 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -524,6 +524,7 @@ struct kvm_arch { unsigned long irq_sources_bitmap; s64 kvmclock_offset; + unsigned int n_time_pages; raw_spinlock_t tsc_write_lock; u64 last_tsc_nsec; u64 last_tsc_write; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 14c290d..350c51b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1511,6 +1511,8 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) if (vcpu-arch.time_page) { kvm_release_page_dirty(vcpu-arch.time_page); vcpu-arch.time_page = NULL; + if (--vcpu-kvm-arch.n_time_pages == 0) + vcpu-kvm-arch.kvmclock_offset = 0; } } @@ -1624,7 +1626,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) if (is_error_page(vcpu-arch.time_page)) { kvm_release_page_clean(vcpu-arch.time_page); vcpu-arch.time_page = NULL; - } + } else + vcpu-kvm-arch.n_time_pages++; break; } case MSR_KVM_ASYNC_PF_EN: -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] kvm: kvmclock: fix kvmclock reboot after migrate issues
When a linux guest live migrates to a new host and subsequently reboots, the guest no longer has the correct time. This is due to a failure to apply the kvmclock offset to the wall clock time. The first patch addresses this failure directly, while the second patch detects when the offset is no longer needed, and zeroes the offset as a matter of cleaning up migration state which is no longer relevant. Both patches address the issue, but in different ways. Bruce Rogers (2): kvm: kvmclock: apply kvmclock offset to guest wall clock time kvm: kvmclock: eliminate kvmclock offset when time page count goes to zero arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/x86.c |9 - 2 files changed, 9 insertions(+), 1 deletions(-) -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND 5/5] vhost-blk: Add vhost-blk support
Michael S. Tsirkin m...@redhat.com writes: On Thu, Jul 19, 2012 at 08:05:42AM -0500, Anthony Liguori wrote: Of course, the million dollar question is why would using AIO in the kernel be faster than using AIO in userspace? Actually for me a more important question is how does it compare with virtio-blk dataplane? I'm not even asking for a benchmark comparision. It's the same API being called from a kernel thread vs. a userspace thread. Why would there be a 60% performance difference between the two? That doesn't make any sense. There's got to be a better justification for putting this in the kernel than just that we can. I completely understand why Christoph's suggestion of submitting BIOs directly would be faster. There's no way to do that in userspace. Regards, Anthony Liguori -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2 v5] KVM: PPC: booke: Add watchdog emulation
On 07/20/2012 12:00 AM, Bharat Bhushan wrote: This patch adds the watchdog emulation in KVM. The watchdog emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_WDT) ioctl. The kernel timer are used for watchdog emulation and emulates h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how it is configured. Signed-off-by: Liu Yu yu@freescale.com Signed-off-by: Scott Wood scottw...@freescale.com Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com [bharat.bhus...@freescale.com: reworked patch] Typically the [] note goes immediately before your signoff (but after the others). +static void arm_next_watchdog(struct kvm_vcpu *vcpu) +{ + unsigned long nr_jiffies; + + spin_lock(vcpu-arch.wdt_lock); + nr_jiffies = watchdog_next_timeout(vcpu); + /* + * If the number of jiffies of watchdog timer = NEXT_TIMER_MAX_DELTA + * then do not run the watchdog timer as this can break timer APIs. + */ + if (nr_jiffies NEXT_TIMER_MAX_DELTA) + mod_timer(vcpu-arch.wdt_timer, jiffies + nr_jiffies); + else + del_timer(vcpu-arch.wdt_timer); + spin_unlock(vcpu-arch.wdt_lock); +} This needs to be an irqsave lock. @@ -386,13 +387,23 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) #ifdef CONFIG_KVM_EXIT_TIMING mutex_init(vcpu-arch.exit_timing_lock); #endif - +#ifdef CONFIG_BOOKE + spin_lock_init(vcpu-arch.wdt_lock); + /* setup watchdog timer once */ + setup_timer(vcpu-arch.wdt_timer, kvmppc_watchdog_func, + (unsigned long)vcpu); +#endif return 0; } Can you do this in kvmppc_booke_init()? void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { kvmppc_mmu_destroy(vcpu); +#ifdef CONFIG_BOOKE + spin_lock(vcpu-arch.wdt_lock); + del_timer(vcpu-arch.wdt_timer); + spin_unlock(vcpu-arch.wdt_lock); +#endif } Don't acquire the lock here, but use del_timer_sync(). -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND 5/5] vhost-blk: Add vhost-blk support
On 07/21/2012 04:56 AM, Anthony Liguori wrote: Michael S. Tsirkin m...@redhat.com writes: On Thu, Jul 19, 2012 at 08:05:42AM -0500, Anthony Liguori wrote: Of course, the million dollar question is why would using AIO in the kernel be faster than using AIO in userspace? Actually for me a more important question is how does it compare with virtio-blk dataplane? I'm not even asking for a benchmark comparision. It's the same API being called from a kernel thread vs. a userspace thread. Why would there be a 60% performance difference between the two? That doesn't make any sense. Please read the commit log again. I am not saying vhost-blk v.s userspace implementation gives 60% improvement. I am saying the vhost-blk v.s original vhost-blk gives 60% improvement. This patch is based on Liu Yuan's implementation with various improvements and bug fixes. Notably, this patch makes guest notify and host completion processing in parallel which gives about 60% performance improvement compared to Liu Yuan's implementation. There's got to be a better justification for putting this in the kernel than just that we can. I completely understand why Christoph's suggestion of submitting BIOs directly would be faster. There's no way to do that in userspace. Well. With Zach and Dave's new in-kernel aio API, the aio usage in kernel is much simpler than in userspace. This a potential reason that in kernel one is better than userspace one. I am working on it right now. And for block based image, as suggested by Christoph, we can submit bio directly. This is another potential reason. Why can't we just go further to see if we can improve the IO stack from guest kernel side all the way down to host kernel side. We can not do that if we stick to doing everything in userspace (qemu). -- Asias -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2 v5] KVM: PPC: booke: Add watchdog emulation
On 07/20/2012 12:00 AM, Bharat Bhushan wrote: This patch adds the watchdog emulation in KVM. The watchdog emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_WDT) ioctl. The kernel timer are used for watchdog emulation and emulates h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how it is configured. Signed-off-by: Liu Yu yu@freescale.com Signed-off-by: Scott Wood scottw...@freescale.com Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com [bharat.bhus...@freescale.com: reworked patch] Typically the [] note goes immediately before your signoff (but after the others). +static void arm_next_watchdog(struct kvm_vcpu *vcpu) +{ + unsigned long nr_jiffies; + + spin_lock(vcpu-arch.wdt_lock); + nr_jiffies = watchdog_next_timeout(vcpu); + /* + * If the number of jiffies of watchdog timer = NEXT_TIMER_MAX_DELTA + * then do not run the watchdog timer as this can break timer APIs. + */ + if (nr_jiffies NEXT_TIMER_MAX_DELTA) + mod_timer(vcpu-arch.wdt_timer, jiffies + nr_jiffies); + else + del_timer(vcpu-arch.wdt_timer); + spin_unlock(vcpu-arch.wdt_lock); +} This needs to be an irqsave lock. @@ -386,13 +387,23 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) #ifdef CONFIG_KVM_EXIT_TIMING mutex_init(vcpu-arch.exit_timing_lock); #endif - +#ifdef CONFIG_BOOKE + spin_lock_init(vcpu-arch.wdt_lock); + /* setup watchdog timer once */ + setup_timer(vcpu-arch.wdt_timer, kvmppc_watchdog_func, + (unsigned long)vcpu); +#endif return 0; } Can you do this in kvmppc_booke_init()? void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { kvmppc_mmu_destroy(vcpu); +#ifdef CONFIG_BOOKE + spin_lock(vcpu-arch.wdt_lock); + del_timer(vcpu-arch.wdt_timer); + spin_unlock(vcpu-arch.wdt_lock); +#endif } Don't acquire the lock here, but use del_timer_sync(). -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html