Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
Hi, For (2), you cannot use bus=X,addr=Y because it makes assumptions about the PCI topology which may change in newer -M pc's. Why should the PCI topology for 'pc' ever change? We'll probably get q35 support some day, but when this lands I expect we'll see a new machine type 'q35', so '-m q35' will pick the ich9 chipset (which will have a different pci topology of course) and '-m pc' will pick the existing piix chipset (which will continue to look like it looks today). cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: +return; +} + +bdrv_aio_writev(bs, blk_req-reqs[0].sector, blk_req-reqs[0].qiov, +blk_req-reqs[0].nb_sectors, blk_req-reqs[0].cb, +blk_req-reqs[0].opaque); Same here. +bdrv_flush(bs); This looks really strange. What is this supposed to do? One point is that you write it immediately after bdrv_aio_write, so you get an fsync for which you don't know if it includes the current write request or if it doesn't. Which data do you want to get flushed to the disk? I was expecting to flush the aio request that was just initiated. Am I misunderstanding the function? Seems so. The function names don't use really clear terminology either, so you're not the first one to fall in this trap. Basically we have: * qemu_aio_flush() waits for all AIO requests to complete. I think you wanted to have exactly this, but only for a single block device. Such a function doesn't exist yet. * bdrv_flush() makes sure that all successfully completed requests are written to disk (by calling fsync) * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run the fsync in the thread pool Then what I wanted to do is, call qemu_aio_flush first, then bdrv_flush. It should be like live migration. Okay, that makes sense. :-) The other thing is that you introduce a bdrv_flush for each request, basically forcing everyone to something very similar to writethrough mode. I'm sure this will have a big impact on performance. The reason is to avoid inversion of queued requests. Although processing one-by-one is heavy, wouldn't having requests flushed to disk out of order break the disk image? No, that's fine. If a guest issues two requests at the same time, they may complete in any order. You just need to make sure that you don't call the completion callback before the request really has completed. We need to flush requests, meaning aio and fsync, before sending the final state of the guests, to make sure we can switch to the secondary safely. In theory I think you could just re-submit the requests on the secondary if they had not completed yet. But you're right, let's keep things simple for the start. I'm just starting to wonder if the guest won't timeout the requests if they are queued for too long. Even more, with IDE, it can only handle one request at a time, so not completing requests doesn't sound like a good idea at all. In what intervals is the event-tap queue flushed? The requests are flushed once each transaction completes. So it's not with specific intervals. Right. So when is a transaction completed? This is the time that a single request will take. On the other hand, if you complete before actually writing out, you don't get timeouts, but you signal success to the guest when the request could still fail. What would you do in this case? With a writeback cache mode we're fine, we can just fail the next flush (until then nothing is guaranteed to be on disk and order doesn't matter either), but with cache=writethrough we're in serious trouble. Have you thought about this problem? Maybe we end up having to flush the event-tap queue for each single write in writethrough mode. Yes, and that's what I'm trying to do at this point. Oh, I must have missed that code. Which patch/function should I look at? I know that performance matters a lot, but sacrificing reliability over performance now isn't a good idea. I first want to lay the ground, and then focus on optimization. Note that without dirty bitmap optimization, Kemari suffers a lot in sending rams. Anthony and I discussed to take this approach at KVM Forum. I agree, starting simple makes sense. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-19 20:32, Blue Swirl wrote: On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori aligu...@linux.vnet.ibm.com wrote: On 01/19/2011 07:15 AM, Markus Armbruster wrote: So they interact with KVM (need kvm_state), and they interact with the emulated PCI bus. Could you elaborate on the fundamental difference between the two interactions that makes you choose the (hypothetical) KVM bus over the PCI bus as device parent? It's almost arbitrary, but I would say it's the direction that I/Os flow. But if the underlying observation is that the device tree is not really a tree, you're 100% correct. This is part of why a factory interface that just takes a parent bus is too simplistic. I think we ought to introduce a -pci-device option that is specifically for creating PCI devices that doesn't require a parent bus argument but provides a way to specify stable addressing (for instancing, using a linear index). I think kvm_state should not be a property of any device or bus. It should be split to more logical pieces. Some parts of it could remain in CPUState, because they are associated with a VCPU. Also, for example irqfd could be considered to be similar object to char or block devices provided by QEMU to devices. Would it make sense to introduce new host types for passing parts of kvm_state to devices? I'd also make coalesced MMIO stuff part of memory object. We are not passing any state references when using cpu_physical_memory_rw(), but that could be changed. There are currently no VCPU-specific bits remaining in kvm_state. It may be a good idea to introduce an arch-specific kvm_state and move related bits over. It may also once be feasible to carve out memory management related fields if we have proper abstractions for that, but I'm not completely sure here. Anyway, all these things are secondary. The primary topic here is how to deal with kvm_state and its fields that have VM-global scope. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On Thu, Jan 20, 2011 at 09:44:05AM +0100, Gerd Hoffmann wrote: Hi, For (2), you cannot use bus=X,addr=Y because it makes assumptions about the PCI topology which may change in newer -M pc's. Why should the PCI topology for 'pc' ever change? We'll probably get q35 support some day, but when this lands I expect we'll see a new machine type 'q35', so '-m q35' will pick the ich9 chipset (which will have a different pci topology of course) and '-m pc' will pick the existing piix chipset (which will continue to look like it looks today). If the topology does ever change (eg in the kind of way anthony suggests, first bus only has the graphics card), I think libvirt is going to need a little work to adapt to the new topology, regardless of whether we currently specify a bus= arg to -device or not. I'm not sure there's anything we could do to future proof us to that kind of change. Regards, Daniel -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
2011/1/20 Kevin Wolf kw...@redhat.com: Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: + return; + } + + bdrv_aio_writev(bs, blk_req-reqs[0].sector, blk_req-reqs[0].qiov, + blk_req-reqs[0].nb_sectors, blk_req-reqs[0].cb, + blk_req-reqs[0].opaque); Same here. + bdrv_flush(bs); This looks really strange. What is this supposed to do? One point is that you write it immediately after bdrv_aio_write, so you get an fsync for which you don't know if it includes the current write request or if it doesn't. Which data do you want to get flushed to the disk? I was expecting to flush the aio request that was just initiated. Am I misunderstanding the function? Seems so. The function names don't use really clear terminology either, so you're not the first one to fall in this trap. Basically we have: * qemu_aio_flush() waits for all AIO requests to complete. I think you wanted to have exactly this, but only for a single block device. Such a function doesn't exist yet. * bdrv_flush() makes sure that all successfully completed requests are written to disk (by calling fsync) * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run the fsync in the thread pool Then what I wanted to do is, call qemu_aio_flush first, then bdrv_flush. It should be like live migration. Okay, that makes sense. :-) The other thing is that you introduce a bdrv_flush for each request, basically forcing everyone to something very similar to writethrough mode. I'm sure this will have a big impact on performance. The reason is to avoid inversion of queued requests. Although processing one-by-one is heavy, wouldn't having requests flushed to disk out of order break the disk image? No, that's fine. If a guest issues two requests at the same time, they may complete in any order. You just need to make sure that you don't call the completion callback before the request really has completed. We need to flush requests, meaning aio and fsync, before sending the final state of the guests, to make sure we can switch to the secondary safely. In theory I think you could just re-submit the requests on the secondary if they had not completed yet. But you're right, let's keep things simple for the start. I'm just starting to wonder if the guest won't timeout the requests if they are queued for too long. Even more, with IDE, it can only handle one request at a time, so not completing requests doesn't sound like a good idea at all. In what intervals is the event-tap queue flushed? The requests are flushed once each transaction completes. So it's not with specific intervals. Right. So when is a transaction completed? This is the time that a single request will take. The transaction is completed when the vm state is sent to the secondary, and the primary receives the ack to it. Please let me know if the answer is too vague. What I can tell is that it can't be super fast. On the other hand, if you complete before actually writing out, you don't get timeouts, but you signal success to the guest when the request could still fail. What would you do in this case? With a writeback cache mode we're fine, we can just fail the next flush (until then nothing is guaranteed to be on disk and order doesn't matter either), but with cache=writethrough we're in serious trouble. Have you thought about this problem? Maybe we end up having to flush the event-tap queue for each single write in writethrough mode. Yes, and that's what I'm trying to do at this point. Oh, I must have missed that code. Which patch/function should I look at? Maybe I miss-answered to your question. The device may receive timeouts. If timeouts didn't happen, the requests are flushed one-by-one in writethrough because we're calling qemu_aio_flush and bdrv_flush together. Yoshi I know that performance matters a lot, but sacrificing reliability over performance now isn't a good idea. I first want to lay the ground, and then focus on optimization. Note that without dirty bitmap optimization, Kemari suffers a lot in sending rams. Anthony and I discussed to take this approach at KVM Forum. I agree, starting simple makes sense. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock
On Wed, Jan 19, 2011 at 10:53:52AM -0800, Jeremy Fitzhardinge wrote: The reason for wanting this should be clear I guess, it allows PI. Well, if we can expand the spinlock to include an owner, then all this becomes moot... How so? Having an owner will not eliminate the need for pv-ticketlocks afaict. We still need a mechanism to reduce latency in scheduling the next thread-in-waiting, which is achieved by your patches. - vatsa -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
Am 20.01.2011 11:39, schrieb Yoshiaki Tamura: 2011/1/20 Kevin Wolf kw...@redhat.com: Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: +return; +} + +bdrv_aio_writev(bs, blk_req-reqs[0].sector, blk_req-reqs[0].qiov, +blk_req-reqs[0].nb_sectors, blk_req-reqs[0].cb, +blk_req-reqs[0].opaque); Same here. +bdrv_flush(bs); This looks really strange. What is this supposed to do? One point is that you write it immediately after bdrv_aio_write, so you get an fsync for which you don't know if it includes the current write request or if it doesn't. Which data do you want to get flushed to the disk? I was expecting to flush the aio request that was just initiated. Am I misunderstanding the function? Seems so. The function names don't use really clear terminology either, so you're not the first one to fall in this trap. Basically we have: * qemu_aio_flush() waits for all AIO requests to complete. I think you wanted to have exactly this, but only for a single block device. Such a function doesn't exist yet. * bdrv_flush() makes sure that all successfully completed requests are written to disk (by calling fsync) * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run the fsync in the thread pool Then what I wanted to do is, call qemu_aio_flush first, then bdrv_flush. It should be like live migration. Okay, that makes sense. :-) The other thing is that you introduce a bdrv_flush for each request, basically forcing everyone to something very similar to writethrough mode. I'm sure this will have a big impact on performance. The reason is to avoid inversion of queued requests. Although processing one-by-one is heavy, wouldn't having requests flushed to disk out of order break the disk image? No, that's fine. If a guest issues two requests at the same time, they may complete in any order. You just need to make sure that you don't call the completion callback before the request really has completed. We need to flush requests, meaning aio and fsync, before sending the final state of the guests, to make sure we can switch to the secondary safely. In theory I think you could just re-submit the requests on the secondary if they had not completed yet. But you're right, let's keep things simple for the start. I'm just starting to wonder if the guest won't timeout the requests if they are queued for too long. Even more, with IDE, it can only handle one request at a time, so not completing requests doesn't sound like a good idea at all. In what intervals is the event-tap queue flushed? The requests are flushed once each transaction completes. So it's not with specific intervals. Right. So when is a transaction completed? This is the time that a single request will take. The transaction is completed when the vm state is sent to the secondary, and the primary receives the ack to it. Please let me know if the answer is too vague. What I can tell is that it can't be super fast. On the other hand, if you complete before actually writing out, you don't get timeouts, but you signal success to the guest when the request could still fail. What would you do in this case? With a writeback cache mode we're fine, we can just fail the next flush (until then nothing is guaranteed to be on disk and order doesn't matter either), but with cache=writethrough we're in serious trouble. Have you thought about this problem? Maybe we end up having to flush the event-tap queue for each single write in writethrough mode. Yes, and that's what I'm trying to do at this point. Oh, I must have missed that code. Which patch/function should I look at? Maybe I miss-answered to your question. The device may receive timeouts. We should pay attention that the guest does not see timeouts. I'm not expecting that I/O will be super fast, and as long as it is only a performance problem we can live with it. However, as soon as the guest gets timeouts it reports I/O errors and eventually offlines the block device. At this point it's not a performance problem any more, but also a correctness problem. This is why I suggested that we flush the event-tap queue (i.e. complete the transaction) immediately after an I/O request has been issued instead of waiting for other events that would complete the transaction. If timeouts didn't happen, the requests are flushed one-by-one in writethrough because we're calling qemu_aio_flush and bdrv_flush together. I think this is what we must do. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
EPT: Misconfiguration
I'm suddenly getting lots of the following errors on a server running 2.36.7, but I have no idea what it means: 2011-01-20T12:41:18.358603+01:00 phy005 kernel: EPT: Misconfiguration. 2011-01-20T12:41:18.358621+01:00 phy005 kernel: EPT: GPA: 0x3dbff6b0 2011-01-20T12:41:18.358624+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x50743e007 level 4 2011-01-20T12:41:18.358627+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x523de2007 level 3 2011-01-20T12:41:18.358629+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x62336f007 level 2 2011-01-20T12:41:18.360109+01:00 phy005 kernel: ept_misconfig_inspect_spte: spte 0x1603a0730500d277 level 1 2011-01-20T12:41:18.360137+01:00 phy005 kernel: ept_misconfig_inspect_spte: rsvd_bits = 0x3a000 2011-01-20T12:41:18.360151+01:00 phy005 kernel: [ cut here ] 2011-01-20T12:41:18.360155+01:00 phy005 kernel: WARNING: at arch/x86/kvm/vmx.c:3425 handle_ept_misconfig+0x152/0x1d8 [kvm_intel]() 2011-01-20T12:41:18.360160+01:00 phy005 kernel: Hardware name: X8DTU 2011-01-20T12:41:18.363296+01:00 phy005 kernel: Modules linked in: tun ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_ filter ip6_tables ipv6 kvm_intel kvm igb i2c_i801 iTCO_wdt i2c_core ioatdma joydev iTCO_vendor_support serio_raw dca 3w_9xxx [last unloaded: scsi_wait_scan] 2011-01-20T12:41:18.363312+01:00 phy005 kernel: Pid: 3595, comm: qemu-kvm Tainted: G D W 2.6.34.7-66.tilaa.fc13.x86_64 #1 2011-01-20T12:41:18.363314+01:00 phy005 kernel: Call Trace: 2011-01-20T12:41:18.364385+01:00 phy005 kernel: [8104d11f] warn_slowpath_common+0x7c/0x94 2011-01-20T12:41:18.364455+01:00 phy005 kernel: [8104d14b] warn_slowpath_null+0x14/0x16 2011-01-20T12:41:18.364462+01:00 phy005 kernel: [a00ba7fb] handle_ept_misconfig+0x152/0x1d8 [kvm_intel] 2011-01-20T12:41:18.364466+01:00 phy005 kernel: [a00bb401] vmx_handle_exit+0x204/0x23a [kvm_intel] 2011-01-20T12:41:18.370619+01:00 phy005 kernel: [a0075998] kvm_arch_vcpu_ioctl_run+0x7cd/0xa74 [kvm] 2011-01-20T12:41:18.370731+01:00 phy005 kernel: [a00645ba] kvm_vcpu_ioctl+0xfd/0x56e [kvm] 2011-01-20T12:41:18.370737+01:00 phy005 kernel: [8100a60e] ? apic_timer_interrupt+0xe/0x20 2011-01-20T12:41:18.370741+01:00 phy005 kernel: [8111aa2f] vfs_ioctl+0x32/0xa6 2011-01-20T12:41:18.371562+01:00 phy005 kernel: [8111afa2] do_vfs_ioctl+0x483/0x4c9 2011-01-20T12:41:18.371577+01:00 phy005 kernel: [8111b03e] sys_ioctl+0x56/0x79 2011-01-20T12:41:18.371581+01:00 phy005 kernel: [81009c72] system_call_fastpath+0x16/0x1b 2011-01-20T12:41:18.372244+01:00 phy005 kernel: ---[ end trace 7d57b311d4a5b22c ]--- 2011-01-20T12:41:57.568322+01:00 phy005 kernel: general protection fault: [#2] SMP 2011-01-20T12:41:57.568335+01:00 phy005 kernel: last sysfs file: /sys/devices/system/cpu/cpu15/topology/thread_siblings 2011-01-20T12:41:57.568339+01:00 phy005 kernel: CPU 0 Kind regards, Ruben -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT: Misconfiguration
On Thu, Jan 20, 2011 at 12:48, Ruben Kerkhof ru...@rubenkerkhof.com wrote: I'm suddenly getting lots of the following errors on a server running 2.36.7, but I have no idea what it means: Sorry, that should be 2.34.7. Kind regards, Ruben -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock
On Wed, Jan 19, 2011 at 10:53:52AM -0800, Jeremy Fitzhardinge wrote: I didn't really read the patch, and I totally forgot everything from when I looked at the Xen series, but does the Xen/KVM hypercall interface for this include the vcpu to await the kick from? My guess is not, since the ticket locks used don't know who the owner is, which is of course, sad. There are FIFO spinlock implementations that can do this though.. although I think they all have a bigger memory footprint. At least in the Xen code, a current owner isn't very useful, because we need the current owner to kick the *next* owner to life at release time, which we can't do without some structure recording which ticket belongs to which cpu. If we had a yield-to [1] sort of interface _and_ information on which vcpu owns a lock, then lock-spinners can yield-to the owning vcpu, while the unlocking vcpu can yield-to the next-vcpu-in-waiting. The key here is not to sleep when waiting for locks (as implemented by current patch-series, which can put other VMs at an advantage by giving them more time than they are entitled to) and also to ensure that lock-owner as well as the next-in-line lock-owner are not unduly made to wait for cpu. Is there a way we can dynamically expand the size of lock only upon contention to include additional information like owning vcpu? Have the lock point to a per-cpu area upon contention where additional details can be stored perhaps? 1. https://lkml.org/lkml/2011/1/14/44 - vatsa -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[REPOST][PATCH 0/3] Unmapped page cache control (v3)
The following series implements page cache control, this is a split out version of patch 1 of version 3 of the page cache optimization patches posted earlier at Previous posting http://lwn.net/Articles/419564/ The previous few revision received lot of comments, I've tried to address as many of those as possible in this revision. The last series was reviewed-by Christoph Lameter. There were comments on overlap with Nick's changes and overlap with them. I don't feel these changes impact Nick's work and integration can/will be considered as the patches evolve, if need be. Detailed Description This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the number of pages to reclaim when unmapped_page_control argument is supplied. These numbers were chosen to avoid aggressiveness in reaping page cache ever so frequently, at the same time providing control. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim. Data from the previous patchsets can be found at https://lkml.org/lkml/2010/11/30/79 Size measurement CONFIG_UNMAPPED_PAGECACHE_CONTROL and CONFIG_NUMA enabled # size mm/built-in.o textdata bss dec hex filename 419431 1883047 140888 2443366 254866 mm/built-in.o CONFIG_UNMAPPED_PAGECACHE_CONTROL disabled, CONFIG_NUMA enabled # size mm/built-in.o textdata bss dec hex filename 418908 1883023 140888 2442819 254643 mm/built-in.o --- Balbir Singh (3): Move zone_reclaim() outside of CONFIG_NUMA Refactor zone_reclaim code Provide control over unmapped pages Documentation/kernel-parameters.txt |8 ++ include/linux/mmzone.h |4 + include/linux/swap.h| 21 +- init/Kconfig| 12 +++ kernel/sysctl.c | 20 +++-- mm/page_alloc.c |9 ++ mm/vmscan.c | 132 +++ 7 files changed, 175 insertions(+), 31 deletions(-) -- Balbir Singh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[REPOST] [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v3)
This patch moves zone_reclaim and associated helpers outside CONFIG_NUMA. This infrastructure is reused in the patches for page cache control that follow. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/mmzone.h |4 ++-- include/linux/swap.h |4 ++-- kernel/sysctl.c| 18 +- mm/vmscan.c|2 -- 4 files changed, 13 insertions(+), 15 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4890662..aeede91 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -302,12 +302,12 @@ struct zone { */ unsigned long lowmem_reserve[MAX_NR_ZONES]; -#ifdef CONFIG_NUMA - int node; /* * zone reclaim becomes active if more unmapped pages exist. */ unsigned long min_unmapped_pages; +#ifdef CONFIG_NUMA + int node; unsigned long min_slab_pages; #endif struct per_cpu_pageset __percpu *pageset; diff --git a/include/linux/swap.h b/include/linux/swap.h index 84375e4..ac5c06e 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -253,11 +253,11 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +extern int sysctl_min_unmapped_ratio; +extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #ifdef CONFIG_NUMA extern int zone_reclaim_mode; -extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; -extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else #define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index a00fdef..e40040e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1211,15 +1211,6 @@ static struct ctl_table vm_table[] = { .extra1 = zero, }, #endif -#ifdef CONFIG_NUMA - { - .procname = zone_reclaim_mode, - .data = zone_reclaim_mode, - .maxlen = sizeof(zone_reclaim_mode), - .mode = 0644, - .proc_handler = proc_dointvec, - .extra1 = zero, - }, { .procname = min_unmapped_ratio, .data = sysctl_min_unmapped_ratio, @@ -1229,6 +1220,15 @@ static struct ctl_table vm_table[] = { .extra1 = zero, .extra2 = one_hundred, }, +#ifdef CONFIG_NUMA + { + .procname = zone_reclaim_mode, + .data = zone_reclaim_mode, + .maxlen = sizeof(zone_reclaim_mode), + .mode = 0644, + .proc_handler = proc_dointvec, + .extra1 = zero, + }, { .procname = min_slab_ratio, .data = sysctl_min_slab_ratio, diff --git a/mm/vmscan.c b/mm/vmscan.c index 42a4859..e841cae 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2740,7 +2740,6 @@ static int __init kswapd_init(void) module_init(kswapd_init) -#ifdef CONFIG_NUMA /* * Zone reclaim mode * @@ -2950,7 +2949,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) return ret; } -#endif /* * page_evictable - test whether a page is evictable -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[REPOST] [PATCH 2/3] Refactor zone_reclaim code (v3)
Changelog v3 1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages Refactor zone_reclaim, move reusable functionality outside of zone_reclaim. Make zone_reclaim_unmapped_pages modular Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- mm/vmscan.c | 35 +++ 1 files changed, 23 insertions(+), 12 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index e841cae..3b25423 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone *zone) } /* + * Helper function to reclaim unmapped pages, we might add something + * similar to this for slab cache as well. Currently this function + * is shared with __zone_reclaim() + */ +static inline void +zone_reclaim_pages(struct zone *zone, struct scan_control *sc, + unsigned long nr_pages) +{ + int priority; + /* +* Free memory by calling shrink zone with increasing +* priorities until we have enough memory freed. +*/ + priority = ZONE_RECLAIM_PRIORITY; + do { + shrink_zone(priority, zone, sc); + priority--; + } while (priority = 0 sc-nr_reclaimed nr_pages); +} + +/* * Try to free up some pages from this zone through reclaim. */ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) @@ -2823,7 +2844,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) const unsigned long nr_pages = 1 order; struct task_struct *p = current; struct reclaim_state reclaim_state; - int priority; struct scan_control sc = { .may_writepage = !!(zone_reclaim_mode RECLAIM_WRITE), .may_unmap = !!(zone_reclaim_mode RECLAIM_SWAP), @@ -2847,17 +2867,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) reclaim_state.reclaimed_slab = 0; p-reclaim_state = reclaim_state; - if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) { - /* -* Free memory by calling shrink zone with increasing -* priorities until we have enough memory freed. -*/ - priority = ZONE_RECLAIM_PRIORITY; - do { - shrink_zone(priority, zone, sc); - priority--; - } while (priority = 0 sc.nr_reclaimed nr_pages); - } + if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) + zone_reclaim_pages(zone, sc, nr_pages); nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); if (nr_slab_pages0 zone-min_slab_pages) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)
Changelog v2 1. Use a config option to enable the code (Andrew Morton) 2. Explain the magic tunables in the code or at-least attempt to explain them (General comment) 3. Hint uses of the boot parameter with unlikely (Andrew Morton) 4. Use better names (balanced is not a good naming convention) Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- Documentation/kernel-parameters.txt |8 +++ include/linux/swap.h| 21 ++-- init/Kconfig| 12 kernel/sysctl.c |2 + mm/page_alloc.c |9 +++ mm/vmscan.c | 97 +++ 6 files changed, 142 insertions(+), 7 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index dd8fe2b..f52b0bd 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2515,6 +2515,14 @@ and is between 256 and 4096 characters. It is defined in the file [X86] Set unknown_nmi_panic=1 early on boot. + unmapped_page_control + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL + is enabled. It controls the amount of unmapped memory + that is present in the system. This boot option plus + vm.min_unmapped_ratio (sysctl) provide granular control + over how much unmapped page cache can exist in the system + before kswapd starts reclaiming unmapped page cache pages. + usbcore.autosuspend= [USB] The autosuspend time delay (in seconds) used for newly-detected USB devices (default 2). This diff --git a/include/linux/swap.h b/include/linux/swap.h index ac5c06e..773d7e5 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -253,19 +253,32 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) extern int sysctl_min_unmapped_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; -extern int sysctl_min_slab_ratio; #else -#define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) { return 0; } #endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +extern bool should_reclaim_unmapped_pages(struct zone *zone); +#else +static inline bool should_reclaim_unmapped_pages(struct zone *zone) +{ + return false; +} +#endif + +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; +extern int sysctl_min_slab_ratio; +#else +#define zone_reclaim_mode 0 +#endif + extern int page_evictable(struct page *page, struct vm_area_struct *vma); extern void scan_mapping_unevictable_pages(struct address_space *); diff --git a/init/Kconfig b/init/Kconfig index 3eb22ad..78c9169 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -782,6 +782,18 @@ endif # NAMESPACES config MM_OWNER bool +config UNMAPPED_PAGECACHE_CONTROL + bool Provide control over unmapped page cache + default n + help + This option adds support for controlling unmapped page cache + via a boot parameter (unmapped_page_control). The boot parameter + with sysctl (vm.min_unmapped_ratio) control the total number + of unmapped pages in the system. This feature is useful if + you want to limit the amount of unmapped page cache or want + to reduce page cache duplication in a virtualized environment. + If unsure say 'N' + config SYSFS_DEPRECATED bool enable deprecated sysfs features to support old userspace tools depends on SYSFS diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e40040e..ab2c60a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1211,6 +1211,7 @@ static struct ctl_table vm_table[] = { .extra1 = zero, }, #endif +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA) { .procname = min_unmapped_ratio, .data = sysctl_min_unmapped_ratio, @@ -1220,6 +1221,7 @@ static struct ctl_table vm_table[] = { .extra1 = zero, .extra2 = one_hundred, }, +#endif #ifdef CONFIG_NUMA { .procname = zone_reclaim_mode, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1845a97..1c9fbab 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1662,6 +1662,9 @@ zonelist_scan: unsigned long mark;
Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock
On Thu, 2011-01-20 at 17:29 +0530, Srivatsa Vaddagiri wrote: If we had a yield-to [1] sort of interface _and_ information on which vcpu owns a lock, then lock-spinners can yield-to the owning vcpu, and then I'd nak it for being stupid ;-) really, yield*() is retarded, never even consider using it. If you've got the actual owner you can do full blown PI, which is tons better than a 'do-something-random' call. The only reason the whole non-virt pause loop filtering muck uses it is because it really doesn't know anything, and do-something is pretty much all it can do. Its a broken interface. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
2011/1/20 Kevin Wolf kw...@redhat.com: Am 20.01.2011 11:39, schrieb Yoshiaki Tamura: 2011/1/20 Kevin Wolf kw...@redhat.com: Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: + return; + } + + bdrv_aio_writev(bs, blk_req-reqs[0].sector, blk_req-reqs[0].qiov, + blk_req-reqs[0].nb_sectors, blk_req-reqs[0].cb, + blk_req-reqs[0].opaque); Same here. + bdrv_flush(bs); This looks really strange. What is this supposed to do? One point is that you write it immediately after bdrv_aio_write, so you get an fsync for which you don't know if it includes the current write request or if it doesn't. Which data do you want to get flushed to the disk? I was expecting to flush the aio request that was just initiated. Am I misunderstanding the function? Seems so. The function names don't use really clear terminology either, so you're not the first one to fall in this trap. Basically we have: * qemu_aio_flush() waits for all AIO requests to complete. I think you wanted to have exactly this, but only for a single block device. Such a function doesn't exist yet. * bdrv_flush() makes sure that all successfully completed requests are written to disk (by calling fsync) * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run the fsync in the thread pool Then what I wanted to do is, call qemu_aio_flush first, then bdrv_flush. It should be like live migration. Okay, that makes sense. :-) The other thing is that you introduce a bdrv_flush for each request, basically forcing everyone to something very similar to writethrough mode. I'm sure this will have a big impact on performance. The reason is to avoid inversion of queued requests. Although processing one-by-one is heavy, wouldn't having requests flushed to disk out of order break the disk image? No, that's fine. If a guest issues two requests at the same time, they may complete in any order. You just need to make sure that you don't call the completion callback before the request really has completed. We need to flush requests, meaning aio and fsync, before sending the final state of the guests, to make sure we can switch to the secondary safely. In theory I think you could just re-submit the requests on the secondary if they had not completed yet. But you're right, let's keep things simple for the start. I'm just starting to wonder if the guest won't timeout the requests if they are queued for too long. Even more, with IDE, it can only handle one request at a time, so not completing requests doesn't sound like a good idea at all. In what intervals is the event-tap queue flushed? The requests are flushed once each transaction completes. So it's not with specific intervals. Right. So when is a transaction completed? This is the time that a single request will take. The transaction is completed when the vm state is sent to the secondary, and the primary receives the ack to it. Please let me know if the answer is too vague. What I can tell is that it can't be super fast. On the other hand, if you complete before actually writing out, you don't get timeouts, but you signal success to the guest when the request could still fail. What would you do in this case? With a writeback cache mode we're fine, we can just fail the next flush (until then nothing is guaranteed to be on disk and order doesn't matter either), but with cache=writethrough we're in serious trouble. Have you thought about this problem? Maybe we end up having to flush the event-tap queue for each single write in writethrough mode. Yes, and that's what I'm trying to do at this point. Oh, I must have missed that code. Which patch/function should I look at? Maybe I miss-answered to your question. The device may receive timeouts. We should pay attention that the guest does not see timeouts. I'm not expecting that I/O will be super fast, and as long as it is only a performance problem we can live with it. However, as soon as the guest gets timeouts it reports I/O errors and eventually offlines the block device. At this point it's not a performance problem any more, but also a correctness problem. This is why I suggested that we flush the event-tap queue (i.e. complete the transaction) immediately after an I/O request has been issued instead of waiting for other events that would complete the transaction. Right. event-tap doesn't queue at specific interval. It'll schedule the transaction as bh once events are tapped . The purpose of the queue is store requests initiated while the transaction. So I believe current implementation should be doing what you're expecting. However, if the guest dirtied huge amount of ram and initiated block requests, we may get timeouts even we started transaction right away. Yoshi If timeouts didn't happen, the requests are flushed one-by-one in writethrough because we're calling qemu_aio_flush and bdrv_flush
[PATCH 1/1]: Log more information on IO_PAGE_FAULT event.
Prints more information when IO_PAGE_FAULT event occurs. Signed-off-by: Prasad Joshi prasadjoshi...@gmail.com --- diff --git a/arch/x86/include/asm/amd_iommu_types.h b/arch/x86/include/asm/amd_iommu_types.h index e3509fc..add56b3 100644 --- a/arch/x86/include/asm/amd_iommu_types.h +++ b/arch/x86/include/asm/amd_iommu_types.h @@ -96,6 +96,16 @@ #define EVENT_FLAGS_MASK 0xfff #define EVENT_FLAGS_SHIFT 0x10 +#define EVENT_FLAGS_RESERVED1 0x0 +#define EVENT_FLAGS_RESERVED_SIZE 0x3 +#define EVENT_FLAGS_INT (EVENT_FLAGS_RESERVED1 + EVENT_FLAGS_RESERVED_SIZE) +#define EVENT_FLAGS_PR (EVENT_FLAGS_INT + 0x1) +#define EVENT_FLAGS_RW (EVENT_FLAGS_PR + 0x1) +#define EVENT_FLAGS_PE (EVENT_FLAGS_RW + 0x1) +#define EVENT_FLAGS_RZ (EVENT_FLAGS_PE + 0x1) +#define EVENT_FLAGS_TR (EVENT_FLAGS_RZ + 0x1) +#define EVENT_FLAGS_RESERVED2 (EVENT_FLAGS_TR + 0x1) + /* feature control bits */ #define CONTROL_IOMMU_EN0x00ULL #define CONTROL_HT_TUN_EN 0x01ULL diff --git a/arch/x86/kernel/amd_iommu.c b/arch/x86/kernel/amd_iommu.c index 57ca777..a158852 100644 --- a/arch/x86/kernel/amd_iommu.c +++ b/arch/x86/kernel/amd_iommu.c @@ -283,6 +283,44 @@ static void dump_command(unsigned long phys_addr) pr_err(AMD-Vi: CMD[%d]: %08x\n, i, cmd-data[i]); } +static void inline dump_page_fault(int flags) +{ + int i = (flags EVENT_FLAGS_INT) 0x1; + int pr = (flags EVENT_FLAGS_PR) 0x1; + int rw = (flags EVENT_FLAGS_RW) 0x1; + int pe = (flags EVENT_FLAGS_PE) 0x1; + int rz = (flags EVENT_FLAGS_RZ) 0x1; + int tr = (flags EVENT_FLAGS_TR) 0x1; + + const char *i_pr[] = { + to page marked not present, + to page marked present, + to interrupt marked blocked, + to interrupt marked remapped + }; + + const char *tr_s[2] = { + transaction, + translation + }; + + const char *type[2] = { + read, + write + }; + + const char *permission[2] = { + peripheral had permission, + peripheral had no permission, + }; + + pr_err(AMD-Vi: \t%s %s, tr_s[tr], i_pr[(i1)|pr]); + pr_err(AMD-Vi: \t%s type: %s, tr_s[tr], type[rw]); + if (pr) { + pr_err(AMD-Vi: \t%s, permission[pe]); + } +} + static void iommu_print_event(struct amd_iommu *iommu, void *__evt) { u32 *event = __evt; @@ -307,6 +345,7 @@ static void iommu_print_event(struct amd_iommu *iommu, void *__evt) domain=0x%04x address=0x%016llx flags=0x%04x]\n, PCI_BUS(devid), PCI_SLOT(devid), PCI_FUNC(devid), domid, address, flags); + dump_page_fault(flags); break; case EVENT_TYPE_DEV_TAB_ERR: printk(DEV_TAB_HARDWARE_ERROR device=%02x:%02x.%x -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
Am 20.01.2011 14:50, schrieb Yoshiaki Tamura: 2011/1/20 Kevin Wolf kw...@redhat.com: Am 20.01.2011 11:39, schrieb Yoshiaki Tamura: 2011/1/20 Kevin Wolf kw...@redhat.com: Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: +return; +} + +bdrv_aio_writev(bs, blk_req-reqs[0].sector, blk_req-reqs[0].qiov, +blk_req-reqs[0].nb_sectors, blk_req-reqs[0].cb, +blk_req-reqs[0].opaque); Same here. +bdrv_flush(bs); This looks really strange. What is this supposed to do? One point is that you write it immediately after bdrv_aio_write, so you get an fsync for which you don't know if it includes the current write request or if it doesn't. Which data do you want to get flushed to the disk? I was expecting to flush the aio request that was just initiated. Am I misunderstanding the function? Seems so. The function names don't use really clear terminology either, so you're not the first one to fall in this trap. Basically we have: * qemu_aio_flush() waits for all AIO requests to complete. I think you wanted to have exactly this, but only for a single block device. Such a function doesn't exist yet. * bdrv_flush() makes sure that all successfully completed requests are written to disk (by calling fsync) * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run the fsync in the thread pool Then what I wanted to do is, call qemu_aio_flush first, then bdrv_flush. It should be like live migration. Okay, that makes sense. :-) The other thing is that you introduce a bdrv_flush for each request, basically forcing everyone to something very similar to writethrough mode. I'm sure this will have a big impact on performance. The reason is to avoid inversion of queued requests. Although processing one-by-one is heavy, wouldn't having requests flushed to disk out of order break the disk image? No, that's fine. If a guest issues two requests at the same time, they may complete in any order. You just need to make sure that you don't call the completion callback before the request really has completed. We need to flush requests, meaning aio and fsync, before sending the final state of the guests, to make sure we can switch to the secondary safely. In theory I think you could just re-submit the requests on the secondary if they had not completed yet. But you're right, let's keep things simple for the start. I'm just starting to wonder if the guest won't timeout the requests if they are queued for too long. Even more, with IDE, it can only handle one request at a time, so not completing requests doesn't sound like a good idea at all. In what intervals is the event-tap queue flushed? The requests are flushed once each transaction completes. So it's not with specific intervals. Right. So when is a transaction completed? This is the time that a single request will take. The transaction is completed when the vm state is sent to the secondary, and the primary receives the ack to it. Please let me know if the answer is too vague. What I can tell is that it can't be super fast. On the other hand, if you complete before actually writing out, you don't get timeouts, but you signal success to the guest when the request could still fail. What would you do in this case? With a writeback cache mode we're fine, we can just fail the next flush (until then nothing is guaranteed to be on disk and order doesn't matter either), but with cache=writethrough we're in serious trouble. Have you thought about this problem? Maybe we end up having to flush the event-tap queue for each single write in writethrough mode. Yes, and that's what I'm trying to do at this point. Oh, I must have missed that code. Which patch/function should I look at? Maybe I miss-answered to your question. The device may receive timeouts. We should pay attention that the guest does not see timeouts. I'm not expecting that I/O will be super fast, and as long as it is only a performance problem we can live with it. However, as soon as the guest gets timeouts it reports I/O errors and eventually offlines the block device. At this point it's not a performance problem any more, but also a correctness problem. This is why I suggested that we flush the event-tap queue (i.e. complete the transaction) immediately after an I/O request has been issued instead of waiting for other events that would complete the transaction. Right. event-tap doesn't queue at specific interval. It'll schedule the transaction as bh once events are tapped . The purpose of the queue is store requests initiated while the transaction. Ok, now I got it. :-) So the patches are already doing the best we can do. So I believe current implementation should be doing what you're expecting. However, if the guest dirtied huge amount of ram and initiated block requests, we may get timeouts even we started transaction right away. Right.
Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock
On Thu, Jan 20, 2011 at 02:41:46PM +0100, Peter Zijlstra wrote: On Thu, 2011-01-20 at 17:29 +0530, Srivatsa Vaddagiri wrote: If we had a yield-to [1] sort of interface _and_ information on which vcpu owns a lock, then lock-spinners can yield-to the owning vcpu, and then I'd nak it for being stupid ;-) really, yield*() is retarded, never even consider using it. If you've got the actual owner you can do full blown PI, which is tons better than a 'do-something-random' call. Yes definitely that would be much better than yield-to. The only reason the whole non-virt pause loop filtering muck uses it is because it really doesn't know anything, and do-something is pretty much all it can do. Its a broken interface. - vatsa -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REPOST] [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v3)
On Thu, 20 Jan 2011, Balbir Singh wrote: --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -253,11 +253,11 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +extern int sysctl_min_unmapped_ratio; +extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #ifdef CONFIG_NUMA extern int zone_reclaim_mode; -extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; -extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else #define zone_reclaim_mode 0 So the end result of this patch is that zone reclaim is compiled into vmscan.o even on !NUMA configurations but since zone_reclaim_mode == 0 noone can ever call that code? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REPOST] [PATCH 2/3] Refactor zone_reclaim code (v3)
Reviewed-by: Christoph Lameter c...@linux.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM test: virtio_console: Add kernel crash logging
Change cleanup function. 1) Try cleanup faster way. 2) Try restart ssh session. 3) Try restart machine. *) If test end with kernel crash directly try to restart machine. Signed-off-by: Jiri Zupka jzu...@redhat.com --- client/tests/kvm/scripts/virtio_guest.py | 79 - client/tests/kvm/tests/virtio_console.py | 204 +++--- 2 files changed, 179 insertions(+), 104 deletions(-) diff --git a/client/tests/kvm/scripts/virtio_guest.py b/client/tests/kvm/scripts/virtio_guest.py index 0038f48..498bf6a 100755 --- a/client/tests/kvm/scripts/virtio_guest.py +++ b/client/tests/kvm/scripts/virtio_guest.py @@ -10,11 +10,12 @@ Auxiliary script used to send data between ports on guests. import threading from threading import Thread import os, time, select, re, random, sys, array -import fcntl, array, subprocess, traceback, signal +import fcntl, subprocess, traceback, signal DEBUGPATH = /sys/kernel/debug SYSFSPATH = /sys/class/virtio-ports/ +exiting = False class VirtioGuest: @@ -70,32 +71,42 @@ class VirtioGuest: else: viop_names = os.listdir('%s/virtio-ports' % (DEBUGPATH)) for name in viop_names: -f = open(%s/virtio-ports/%s % (DEBUGPATH, name), 'r') +open_db_file = %s/virtio-ports/%s % (DEBUGPATH, name) +f = open(open_db_file, 'r') port = {} +file = [] for line in iter(f): -m = re.match((\S+): (\S+), line) -port[m.group(1)] = m.group(2) - -if (port['is_console'] == yes): -port[path] = /dev/hvc%s % (port[console_vtermno]) -# Console works like a serialport -else: -port[path] = /dev/%s % name - -if (not os.path.exists(port['path'])): -print FAIL: %s not exist % port['path'] - -sysfspath = SYSFSPATH + name -if (not os.path.isdir(sysfspath)): -print FAIL: %s not exist % (sysfspath) - -info_name = sysfspath + /name -port_name = self._readfile(info_name).strip() -if (port_name != port[name]): -print (FAIL: Port info not match \n%s - %s\n%s - %s % - (info_name , port_name, -%s/virtio-ports/%s % (DEBUGPATH, name), -port[name])) +file.append(line) +try: +for line in file: +m = re.match((\S+): (\S+), line) +port[m.group(1)] = m.group(2) + +if (port['is_console'] == yes): +port[path] = /dev/hvc%s % (port[console_vtermno]) +# Console works like a serialport +else: +port[path] = /dev/%s % name + +if (not os.path.exists(port['path'])): +print FAIL: %s not exist % port['path'] + +sysfspath = SYSFSPATH + name +if (not os.path.isdir(sysfspath)): +print FAIL: %s not exist % (sysfspath) + +info_name = sysfspath + /name +port_name = self._readfile(info_name).strip() +if (port_name != port[name]): +print (FAIL: Port info not match \n%s - %s\n%s - %s % + (info_name , port_name, +%s/virtio-ports/%s % (DEBUGPATH, name), +port[name])) +except AttributeError: +print (In file + open_db_file + +are bad data\n+ .join(file).strip()) +print (FAIL: Fail file data.) +return ports[port['name']] = port f.close() @@ -109,6 +120,8 @@ class VirtioGuest: self.ports = self._get_port_status() +if self.ports == None: +return for item in in_files: if (item[1] != self.ports[item[0]][is_console]): print self.ports @@ -619,7 +632,7 @@ class VirtioGuest: buf = if ret[0]: buf = os.read(in_f[0], buffer) -print (PASS: Rest in socket: + buf) +print (PASS: Rest in socket: ) + str(buf[10]) def is_alive(): @@ -645,13 +658,20 @@ def compile(): sys.exit() +def guest_exit(): +global exiting +exiting = True +os.kill(os.getpid(), signal.SIGUSR1) + + def worker(virt): Worker thread (infinite) loop of virtio_guest. +global exiting print PASS: Start -while True: +while not exiting: str = raw_input() try: exec str @@ -661,7 +681,6 @@ def worker(virt):
Re: [REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)
On Thu, 20 Jan 2011, Balbir Singh wrote: + unmapped_page_control + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL + is enabled. It controls the amount of unmapped memory + that is present in the system. This boot option plus + vm.min_unmapped_ratio (sysctl) provide granular control min_unmapped_ratio is there to guarantee that zone reclaim does not reclaim all unmapped pages. What you want here is a max_unmapped_ratio. { @@ -2297,6 +2320,12 @@ loop_again: shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); + /* + * We do unmapped page reclaim once here and once + * below, so that we don't lose out + */ + reclaim_unmapped_pages(priority, zone, sc); + if (!zone_watermark_ok_safe(zone, order, H. Okay that means background reclaim does it. If so then we also want zone reclaim to be able to work in the background I think. max_unmapped_ratio could also be useful to the zone reclaim logic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vhost: force vhost off for non-MSI guests
When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- I need to report some error from virtio-pci that would be handled specially (disable but don't report an error) so I wanted one that's never likely to be used by a userspace ioctl. I selected ERANGE but it'd be easy to switch to something else. Comments? hw/vhost.c |4 +++- hw/virtio-net.c |6 -- hw/virtio-pci.c |3 +++ hw/virtio.h |2 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 1d09ed0..c79765a 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ccb3e63..5de3fee 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) if (!n-vhost_started) { int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), n-vdev); if (r 0) { -error_report(unable to start vhost net: %d: - falling back on userspace virtio, -r); +if (r != -EVIRTIO_DISABLED) { +error_report(unable to start vhost net: %d: + falling back on userspace virtio, -r); +} } else { n-vhost_started = 1; } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index dd8887a..dbf4be0 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int n, bool assign) EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); if (assign) { +if (!msix_enabled(proxy-pci_dev)) { +return -EVIRTIO_DISABLED; +} int r = event_notifier_init(notifier, 0); if (r 0) { return r; diff --git a/hw/virtio.h b/hw/virtio.h index d8546d5..53bbdba 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -98,6 +98,8 @@ typedef struct { void (*vmstate_change)(void * opaque, bool running); } VirtIOBindings; +#define EVIRTIO_DISABLED ERANGE + #define VIRTIO_PCI_QUEUE_MAX 64 #define VIRTIO_NO_VECTOR 0x -- 1.7.3.2.91.g446ac -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/3] KVM: fix rcu usage warning in kvm_arch_vcpu_ioctl_set_sregs()
On Thu, Jan 20, 2011 at 11:03:31AM +0800, Xiao Guangrong wrote: On 01/20/2011 02:13 AM, Marcelo Tosatti wrote: On Wed, Jan 19, 2011 at 01:16:23PM +0800, Xiao Guangrong wrote: On 01/12/2011 03:39 PM, Xiao Guangrong wrote: Fix: [ 1001.499596] === [ 1001.499599] [ INFO: suspicious rcu_dereference_check() usage. ] [ 1001.499601] --- [ 1001.499604] include/linux/kvm_host.h:301 invoked rcu_dereference_check() without protection! .. [ 1001.499636] Pid: 6035, comm: qemu-system-x86 Not tainted 2.6.37-rc6+ #62 [ 1001.499638] Call Trace: [ 1001.499644] [] lockdep_rcu_dereference+0x9d/0xa5 [ 1001.499653] [] gfn_to_memslot+0x8d/0xc8 [kvm] [ 1001.499661] [] gfn_to_hva+0x16/0x3f [kvm] [ 1001.499669] [] kvm_read_guest_page+0x1e/0x5e [kvm] [ 1001.499681] [] kvm_read_guest_page_mmu+0x53/0x5e [kvm] [ 1001.499699] [] load_pdptrs+0x3f/0x9c [kvm] [ 1001.499705] [] ? vmx_set_cr0+0x507/0x517 [kvm_intel] [ 1001.499717] [] kvm_arch_vcpu_ioctl_set_sregs+0x1f3/0x3c0 [kvm] [ 1001.499727] [] kvm_vcpu_ioctl+0x6a5/0xbc5 [kvm] Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com Ping ...? Applied this fix. For the make_all_cpus_request optimization, can you show numbers with this new version? Because now there is LOCK# similarly to the spinlock. Marcelo, Sure :-), there is the simply test result of kernbench: Before patch: real5m6.493s user3m57.847s sys 9m7.115s real5m1.750s user4m0.109s sys 9m10.192s After patch: real5m0.140s user3m57.956s sys 8m58.339s real4m56.314s user4m0.303s sys 8m55.774s Nice. One disadvantageous side effect for the kvm_vcpu_kick path is that it can race with make_all_cpus_request, which is possibly doing unrelated, slower work (IPI'ing other vcpus, waiting for response). Looks ok, but lets wait for more careful reviews before applying. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests
On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkinm...@redhat.com I actually think this should be a terminal error. The user asks for vhost-net, if we cannot enable it, we should exit. Or we should warn the user that they should expect bad performance. Silently doing something that the user has explicitly asked us not to do is not a good behavior. Regards, Anthony Liguori --- I need to report some error from virtio-pci that would be handled specially (disable but don't report an error) so I wanted one that's never likely to be used by a userspace ioctl. I selected ERANGE but it'd be easy to switch to something else. Comments? hw/vhost.c |4 +++- hw/virtio-net.c |6 -- hw/virtio-pci.c |3 +++ hw/virtio.h |2 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 1d09ed0..c79765a 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ccb3e63..5de3fee 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) if (!n-vhost_started) { int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer),n-vdev); if (r 0) { -error_report(unable to start vhost net: %d: - falling back on userspace virtio, -r); +if (r != -EVIRTIO_DISABLED) { +error_report(unable to start vhost net: %d: + falling back on userspace virtio, -r); +} } else { n-vhost_started = 1; } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index dd8887a..dbf4be0 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int n, bool assign) EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); if (assign) { +if (!msix_enabled(proxy-pci_dev)) { +return -EVIRTIO_DISABLED; +} int r = event_notifier_init(notifier, 0); if (r 0) { return r; diff --git a/hw/virtio.h b/hw/virtio.h index d8546d5..53bbdba 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -98,6 +98,8 @@ typedef struct { void (*vmstate_change)(void * opaque, bool running); } VirtIOBindings; +#define EVIRTIO_DISABLED ERANGE + #define VIRTIO_PCI_QUEUE_MAX 64 #define VIRTIO_NO_VECTOR 0x -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
2011/1/20 Kevin Wolf kw...@redhat.com: Am 20.01.2011 14:50, schrieb Yoshiaki Tamura: 2011/1/20 Kevin Wolf kw...@redhat.com: Am 20.01.2011 11:39, schrieb Yoshiaki Tamura: 2011/1/20 Kevin Wolf kw...@redhat.com: Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: + return; + } + + bdrv_aio_writev(bs, blk_req-reqs[0].sector, blk_req-reqs[0].qiov, + blk_req-reqs[0].nb_sectors, blk_req-reqs[0].cb, + blk_req-reqs[0].opaque); Same here. + bdrv_flush(bs); This looks really strange. What is this supposed to do? One point is that you write it immediately after bdrv_aio_write, so you get an fsync for which you don't know if it includes the current write request or if it doesn't. Which data do you want to get flushed to the disk? I was expecting to flush the aio request that was just initiated. Am I misunderstanding the function? Seems so. The function names don't use really clear terminology either, so you're not the first one to fall in this trap. Basically we have: * qemu_aio_flush() waits for all AIO requests to complete. I think you wanted to have exactly this, but only for a single block device. Such a function doesn't exist yet. * bdrv_flush() makes sure that all successfully completed requests are written to disk (by calling fsync) * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run the fsync in the thread pool Then what I wanted to do is, call qemu_aio_flush first, then bdrv_flush. It should be like live migration. Okay, that makes sense. :-) The other thing is that you introduce a bdrv_flush for each request, basically forcing everyone to something very similar to writethrough mode. I'm sure this will have a big impact on performance. The reason is to avoid inversion of queued requests. Although processing one-by-one is heavy, wouldn't having requests flushed to disk out of order break the disk image? No, that's fine. If a guest issues two requests at the same time, they may complete in any order. You just need to make sure that you don't call the completion callback before the request really has completed. We need to flush requests, meaning aio and fsync, before sending the final state of the guests, to make sure we can switch to the secondary safely. In theory I think you could just re-submit the requests on the secondary if they had not completed yet. But you're right, let's keep things simple for the start. I'm just starting to wonder if the guest won't timeout the requests if they are queued for too long. Even more, with IDE, it can only handle one request at a time, so not completing requests doesn't sound like a good idea at all. In what intervals is the event-tap queue flushed? The requests are flushed once each transaction completes. So it's not with specific intervals. Right. So when is a transaction completed? This is the time that a single request will take. The transaction is completed when the vm state is sent to the secondary, and the primary receives the ack to it. Please let me know if the answer is too vague. What I can tell is that it can't be super fast. On the other hand, if you complete before actually writing out, you don't get timeouts, but you signal success to the guest when the request could still fail. What would you do in this case? With a writeback cache mode we're fine, we can just fail the next flush (until then nothing is guaranteed to be on disk and order doesn't matter either), but with cache=writethrough we're in serious trouble. Have you thought about this problem? Maybe we end up having to flush the event-tap queue for each single write in writethrough mode. Yes, and that's what I'm trying to do at this point. Oh, I must have missed that code. Which patch/function should I look at? Maybe I miss-answered to your question. The device may receive timeouts. We should pay attention that the guest does not see timeouts. I'm not expecting that I/O will be super fast, and as long as it is only a performance problem we can live with it. However, as soon as the guest gets timeouts it reports I/O errors and eventually offlines the block device. At this point it's not a performance problem any more, but also a correctness problem. This is why I suggested that we flush the event-tap queue (i.e. complete the transaction) immediately after an I/O request has been issued instead of waiting for other events that would complete the transaction. Right. event-tap doesn't queue at specific interval. It'll schedule the transaction as bh once events are tapped . The purpose of the queue is store requests initiated while the transaction. Ok, now I got it. :-) So the patches are already doing the best we can do. So I believe current implementation should be doing what you're expecting. However, if the guest dirtied huge amount of ram and initiated block requests, we may get timeouts even we
Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests
On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote: On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkinm...@redhat.com I actually think this should be a terminal error. The user asks for vhost-net, if we cannot enable it, we should exit. Or we should warn the user that they should expect bad performance. Silently doing something that the user has explicitly asked us not to do is not a good behavior. Regards, Anthony Liguori The issue is that user has no control of the guest, and can not know whether the guest enables MSI. So what you ask for will just make some guests fail, and others fail sometimes. The user also has no way to know that version X of kvm does not expose a way to inject level interrupts with irqfd. We could have *another* flag that says use vhost where it helps but then I think this is what everyone wants to do, anyway, and libvirt already sets vhost=on so I prefer redefining the meaning of an existing flag. Maybe this is best handled by a documentation update? We always said: use vhost=on to enable experimental in kernel accelerator\n note 'enable' not 'require'. This is similar to how we specify nvectors : you can not make guest use the feature. How about this: diff --git a/qemu-options.hx b/qemu-options.hx index 898561d..3c937c1 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -1061,6 +1061,7 @@ DEF(net, HAS_ARG, QEMU_OPTION_net, use vnet_hdr=off to avoid enabling the IFF_VNET_HDR tap flag\n use vnet_hdr=on to make the lack of IFF_VNET_HDR support an error condition\n use vhost=on to enable experimental in kernel accelerator\n +(note: vhost=on has no effect unless guest uses MSI-X)\n use 'vhostfd=h' to connect to an already opened vhost net device\n #endif -net socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: force vhost off for non-MSI guests
On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- I need to report some error from virtio-pci that would be handled specially (disable but don't report an error) so I wanted one that's never likely to be used by a userspace ioctl. I selected ERANGE but it'd be easy to switch to something else. Comments? Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED? -Sridhar hw/vhost.c |4 +++- hw/virtio-net.c |6 -- hw/virtio-pci.c |3 +++ hw/virtio.h |2 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 1d09ed0..c79765a 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ccb3e63..5de3fee 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) if (!n-vhost_started) { int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), n-vdev); if (r 0) { -error_report(unable to start vhost net: %d: - falling back on userspace virtio, -r); +if (r != -EVIRTIO_DISABLED) { +error_report(unable to start vhost net: %d: + falling back on userspace virtio, -r); +} } else { n-vhost_started = 1; } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index dd8887a..dbf4be0 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int n, bool assign) EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); if (assign) { +if (!msix_enabled(proxy-pci_dev)) { +return -EVIRTIO_DISABLED; +} int r = event_notifier_init(notifier, 0); if (r 0) { return r; diff --git a/hw/virtio.h b/hw/virtio.h index d8546d5..53bbdba 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -98,6 +98,8 @@ typedef struct { void (*vmstate_change)(void * opaque, bool running); } VirtIOBindings; +#define EVIRTIO_DISABLED ERANGE + #define VIRTIO_PCI_QUEUE_MAX 64 #define VIRTIO_NO_VECTOR 0x -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at
https://bugzilla.kernel.org/show_bug.cgi?id=27052 Marcelo Tosatti mtosa...@redhat.com changed: What|Removed |Added CC||mtosa...@redhat.com --- Comment #4 from Marcelo Tosatti mtosa...@redhat.com 2011-01-20 17:28:40 --- Nicolas, This should be fixed by the attached patch, queued for 2.6.36-stable. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at
https://bugzilla.kernel.org/show_bug.cgi?id=27052 --- Comment #5 from Marcelo Tosatti mtosa...@redhat.com 2011-01-20 17:30:38 --- Created an attachment (id=44522) -- (https://bugzilla.kernel.org/attachment.cgi?id=44522) KVM: MMU: fix rmap_remove on non present sptes KVM: MMU: fix rmap_remove on non present sptes -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: force vhost off for non-MSI guests
On Thu, Jan 20, 2011 at 08:31:53AM -0800, Sridhar Samudrala wrote: On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- I need to report some error from virtio-pci that would be handled specially (disable but don't report an error) so I wanted one that's never likely to be used by a userspace ioctl. I selected ERANGE but it'd be easy to switch to something else. Comments? Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED? -Sridhar The error is reported by virtio-pci which does not know about vhost. I started with EVIRTIO_MSIX_DISABLED and made is shorter. Would EVIRTIO_MSIX_DISABLED be better? hw/vhost.c |4 +++- hw/virtio-net.c |6 -- hw/virtio-pci.c |3 +++ hw/virtio.h |2 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 1d09ed0..c79765a 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ccb3e63..5de3fee 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) if (!n-vhost_started) { int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), n-vdev); if (r 0) { -error_report(unable to start vhost net: %d: - falling back on userspace virtio, -r); +if (r != -EVIRTIO_DISABLED) { +error_report(unable to start vhost net: %d: + falling back on userspace virtio, -r); +} } else { n-vhost_started = 1; } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index dd8887a..dbf4be0 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int n, bool assign) EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); if (assign) { +if (!msix_enabled(proxy-pci_dev)) { +return -EVIRTIO_DISABLED; +} int r = event_notifier_init(notifier, 0); if (r 0) { return r; diff --git a/hw/virtio.h b/hw/virtio.h index d8546d5..53bbdba 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -98,6 +98,8 @@ typedef struct { void (*vmstate_change)(void * opaque, bool running); } VirtIOBindings; +#define EVIRTIO_DISABLED ERANGE + #define VIRTIO_PCI_QUEUE_MAX 64 #define VIRTIO_NO_VECTOR 0x -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock
On 01/20/2011 03:42 AM, Srivatsa Vaddagiri wrote: On Wed, Jan 19, 2011 at 10:53:52AM -0800, Jeremy Fitzhardinge wrote: The reason for wanting this should be clear I guess, it allows PI. Well, if we can expand the spinlock to include an owner, then all this becomes moot... How so? Having an owner will not eliminate the need for pv-ticketlocks afaict. We still need a mechanism to reduce latency in scheduling the next thread-in-waiting, which is achieved by your patches. No, sorry, I should have been clearer. I meant that going to the effort of not increasing the lock size to record in slowpath state. J -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock
On 01/20/2011 03:59 AM, Srivatsa Vaddagiri wrote: At least in the Xen code, a current owner isn't very useful, because we need the current owner to kick the *next* owner to life at release time, which we can't do without some structure recording which ticket belongs to which cpu. If we had a yield-to [1] sort of interface _and_ information on which vcpu owns a lock, then lock-spinners can yield-to the owning vcpu, while the unlocking vcpu can yield-to the next-vcpu-in-waiting. Perhaps, but the core problem is how to find next-vcpu-in-waiting efficiently. Once you have that info, there's a number of things you can usefully do with it. The key here is not to sleep when waiting for locks (as implemented by current patch-series, which can put other VMs at an advantage by giving them more time than they are entitled to) Why? If a VCPU can't make progress because its waiting for some resource, then why not schedule something else instead? Presumably when the VCPU does become runnable, the scheduler will credit its previous blocked state and let it run in preference to something else. and also to ensure that lock-owner as well as the next-in-line lock-owner are not unduly made to wait for cpu. Is there a way we can dynamically expand the size of lock only upon contention to include additional information like owning vcpu? Have the lock point to a per-cpu area upon contention where additional details can be stored perhaps? As soon as you add a pointer to the lock, you're increasing its size. If we had a pointer in there already, then all of this would be moot. If auxiliary per-lock is uncommon, then using a hash keyed on lock address would be one way to do it. J -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: force vhost off for non-MSI guests
On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. We need this if we want avoid bouncing vfio INtx through qemu too. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- I need to report some error from virtio-pci that would be handled specially (disable but don't report an error) so I wanted one that's never likely to be used by a userspace ioctl. I selected ERANGE but it'd be easy to switch to something else. Comments? hw/vhost.c |4 +++- hw/virtio-net.c |6 -- hw/virtio-pci.c |3 +++ hw/virtio.h |2 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 1d09ed0..c79765a 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } style - the above is tab indented. goto fail_notifiers; } diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ccb3e63..5de3fee 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) if (!n-vhost_started) { int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), n-vdev); if (r 0) { -error_report(unable to start vhost net: %d: - falling back on userspace virtio, -r); +if (r != -EVIRTIO_DISABLED) { +error_report(unable to start vhost net: %d: + falling back on userspace virtio, -r); +} } else { n-vhost_started = 1; } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index dd8887a..dbf4be0 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int n, bool assign) EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); if (assign) { +if (!msix_enabled(proxy-pci_dev)) { +return -EVIRTIO_DISABLED; +} int r = event_notifier_init(notifier, 0); if (r 0) { return r; diff --git a/hw/virtio.h b/hw/virtio.h index d8546d5..53bbdba 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -98,6 +98,8 @@ typedef struct { void (*vmstate_change)(void * opaque, bool running); } VirtIOBindings; +#define EVIRTIO_DISABLED ERANGE + #define VIRTIO_PCI_QUEUE_MAX 64 #define VIRTIO_NO_VECTOR 0x I'm not a fan of having this special return value. Why doesn't virtio-pci only setup the set_guest_notifiers function pointer when msix is enabled? If not that, it could at least expose some virtio_foo_enabled() type feature that vhost could check. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-01-19 20:32, Blue Swirl wrote: On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori aligu...@linux.vnet.ibm.com wrote: On 01/19/2011 07:15 AM, Markus Armbruster wrote: So they interact with KVM (need kvm_state), and they interact with the emulated PCI bus. Could you elaborate on the fundamental difference between the two interactions that makes you choose the (hypothetical) KVM bus over the PCI bus as device parent? It's almost arbitrary, but I would say it's the direction that I/Os flow. But if the underlying observation is that the device tree is not really a tree, you're 100% correct. This is part of why a factory interface that just takes a parent bus is too simplistic. I think we ought to introduce a -pci-device option that is specifically for creating PCI devices that doesn't require a parent bus argument but provides a way to specify stable addressing (for instancing, using a linear index). I think kvm_state should not be a property of any device or bus. It should be split to more logical pieces. Some parts of it could remain in CPUState, because they are associated with a VCPU. Also, for example irqfd could be considered to be similar object to char or block devices provided by QEMU to devices. Would it make sense to introduce new host types for passing parts of kvm_state to devices? I'd also make coalesced MMIO stuff part of memory object. We are not passing any state references when using cpu_physical_memory_rw(), but that could be changed. There are currently no VCPU-specific bits remaining in kvm_state. I think fields vcpu_events, robust_singlestep, debugregs, kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the same for all VCPUs but still they are sort of CPU properties. I'm not sure about fd field. It may be a good idea to introduce an arch-specific kvm_state and move related bits over. This should probably contain only irqchip_in_kernel, pit_in_kernel and many_ioeventfds, maybe fd. It may also once be feasible to carve out memory management related fields if we have proper abstractions for that, but I'm not completely sure here. I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region, migration_log into the memory object. Anyway, all these things are secondary. The primary topic here is how to deal with kvm_state and its fields that have VM-global scope. If it is an opaque blob which contains various unrelated stuff, no clear place will be found. By the way, we don't have a QEMUState but instead use globals. Perhaps this should be reorganized as well. For fd field, maybe even using a global variable could be justified, since it is used for direct access to kernel, not unlike a system call. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 01/20/2011 03:33 AM, Jan Kiszka wrote: On 2011-01-19 20:32, Blue Swirl wrote: On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori aligu...@linux.vnet.ibm.com wrote: On 01/19/2011 07:15 AM, Markus Armbruster wrote: So they interact with KVM (need kvm_state), and they interact with the emulated PCI bus. Could you elaborate on the fundamental difference between the two interactions that makes you choose the (hypothetical) KVM bus over the PCI bus as device parent? It's almost arbitrary, but I would say it's the direction that I/Os flow. But if the underlying observation is that the device tree is not really a tree, you're 100% correct. This is part of why a factory interface that just takes a parent bus is too simplistic. I think we ought to introduce a -pci-device option that is specifically for creating PCI devices that doesn't require a parent bus argument but provides a way to specify stable addressing (for instancing, using a linear index). I think kvm_state should not be a property of any device or bus. It should be split to more logical pieces. Some parts of it could remain in CPUState, because they are associated with a VCPU. Also, for example irqfd could be considered to be similar object to char or block devices provided by QEMU to devices. Would it make sense to introduce new host types for passing parts of kvm_state to devices? I'd also make coalesced MMIO stuff part of memory object. We are not passing any state references when using cpu_physical_memory_rw(), but that could be changed. There are currently no VCPU-specific bits remaining in kvm_state. It may be a good idea to introduce an arch-specific kvm_state and move related bits over. It may also once be feasible to carve out memory management related fields if we have proper abstractions for that, but I'm not completely sure here. Anyway, all these things are secondary. The primary topic here is how to deal with kvm_state and its fields that have VM-global scope. The debate is really: 1) should we remove all passing of kvm_state and just assume it's static 2) deal with a couple places in the code where we need to figure out how to get at kvm_state I think we've only identified 1 real instance of (2) and it's resulted in some good discussions about how to model KVM devices vs. emulated devices. Honestly, (1) just stinks. I see absolutely no advantage to it at all. In the very worst case scenario, the thing we need to do is just reference an extern variable in a few places. That completely avoids all of the modelling discussions for now (while leaving for placeholder FIXMEs so the problem can be tackled later). I don't understand the resistance here. Regards, Anthony Liguori Jan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 01/20/2011 02:44 AM, Gerd Hoffmann wrote: Hi, For (2), you cannot use bus=X,addr=Y because it makes assumptions about the PCI topology which may change in newer -M pc's. Why should the PCI topology for 'pc' ever change? We'll probably get q35 support some day, but when this lands I expect we'll see a new machine type 'q35', so '-m q35' will pick the ich9 chipset (which will have a different pci topology of course) and '-m pc' will pick the existing piix chipset (which will continue to look like it looks today). But then what's the default machine type? When I say -M pc, I really mean the default machine. At some point, qemu-system-x86_64 -device virtio-net-pci,addr=2.0 Is not going to be a reliable way to invoke qemu because there's no way we can guarantee that slot 2 isn't occupied by a chipset device or some other default device. Regards, Anthony Liguori cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 01/20/2011 04:33 AM, Daniel P. Berrange wrote: On Thu, Jan 20, 2011 at 09:44:05AM +0100, Gerd Hoffmann wrote: Hi, For (2), you cannot use bus=X,addr=Y because it makes assumptions about the PCI topology which may change in newer -M pc's. Why should the PCI topology for 'pc' ever change? We'll probably get q35 support some day, but when this lands I expect we'll see a new machine type 'q35', so '-m q35' will pick the ich9 chipset (which will have a different pci topology of course) and '-m pc' will pick the existing piix chipset (which will continue to look like it looks today). If the topology does ever change (eg in the kind of way anthony suggests, first bus only has the graphics card), I think libvirt is going to need a little work to adapt to the new topology, regardless of whether we currently specify a bus= arg to -device or not. I'm not sure there's anything we could do to future proof us to that kind of change. I assume that libvirt today assumes that it can use a set of PCI slots in bus 0, correct? Probably in the range 3-31? Such assumptions are very likely to break. Regards, Anthony Liguori Regards, Daniel -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at
https://bugzilla.kernel.org/show_bug.cgi?id=27052 --- Comment #6 from prochazka prochazka.nico...@gmail.com 2011-01-20 19:45:49 --- hello, I do not understand, patch seems to be already apply on 2.6.37 kernel tree, and my test are based on this release. NP. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On Thu, Jan 20, 2011 at 7:37 PM, Anthony Liguori aligu...@linux.vnet.ibm.com wrote: On 01/20/2011 03:33 AM, Jan Kiszka wrote: On 2011-01-19 20:32, Blue Swirl wrote: On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori aligu...@linux.vnet.ibm.com wrote: On 01/19/2011 07:15 AM, Markus Armbruster wrote: So they interact with KVM (need kvm_state), and they interact with the emulated PCI bus. Could you elaborate on the fundamental difference between the two interactions that makes you choose the (hypothetical) KVM bus over the PCI bus as device parent? It's almost arbitrary, but I would say it's the direction that I/Os flow. But if the underlying observation is that the device tree is not really a tree, you're 100% correct. This is part of why a factory interface that just takes a parent bus is too simplistic. I think we ought to introduce a -pci-device option that is specifically for creating PCI devices that doesn't require a parent bus argument but provides a way to specify stable addressing (for instancing, using a linear index). I think kvm_state should not be a property of any device or bus. It should be split to more logical pieces. Some parts of it could remain in CPUState, because they are associated with a VCPU. Also, for example irqfd could be considered to be similar object to char or block devices provided by QEMU to devices. Would it make sense to introduce new host types for passing parts of kvm_state to devices? I'd also make coalesced MMIO stuff part of memory object. We are not passing any state references when using cpu_physical_memory_rw(), but that could be changed. There are currently no VCPU-specific bits remaining in kvm_state. It may be a good idea to introduce an arch-specific kvm_state and move related bits over. It may also once be feasible to carve out memory management related fields if we have proper abstractions for that, but I'm not completely sure here. Anyway, all these things are secondary. The primary topic here is how to deal with kvm_state and its fields that have VM-global scope. The debate is really: 1) should we remove all passing of kvm_state and just assume it's static 2) deal with a couple places in the code where we need to figure out how to get at kvm_state I think we've only identified 1 real instance of (2) and it's resulted in some good discussions about how to model KVM devices vs. emulated devices. Honestly, (1) just stinks. I see absolutely no advantage to it at all. Fully agree. In the very worst case scenario, the thing we need to do is just reference an extern variable in a few places. That completely avoids all of the modelling discussions for now (while leaving for placeholder FIXMEs so the problem can be tackled later). I think KVMState was designed to match KVM ioctl interface: all stuff that is needed for talking to KVM or received from KVM are there. But I think this shouldn't be a design driver. If the only pieces of kvm_state that are needed by the devices are irqchip_in_kernel, pit_in_kernel and many_ioeventfds, the problem of passing kvm_state to devices becomes very different. Each of these are just single bits, affecting only a few devices. Perhaps they could be device properties which the board level sets when KVM is used? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-20 20:27, Blue Swirl wrote: On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-01-19 20:32, Blue Swirl wrote: On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori aligu...@linux.vnet.ibm.com wrote: On 01/19/2011 07:15 AM, Markus Armbruster wrote: So they interact with KVM (need kvm_state), and they interact with the emulated PCI bus. Could you elaborate on the fundamental difference between the two interactions that makes you choose the (hypothetical) KVM bus over the PCI bus as device parent? It's almost arbitrary, but I would say it's the direction that I/Os flow. But if the underlying observation is that the device tree is not really a tree, you're 100% correct. This is part of why a factory interface that just takes a parent bus is too simplistic. I think we ought to introduce a -pci-device option that is specifically for creating PCI devices that doesn't require a parent bus argument but provides a way to specify stable addressing (for instancing, using a linear index). I think kvm_state should not be a property of any device or bus. It should be split to more logical pieces. Some parts of it could remain in CPUState, because they are associated with a VCPU. Also, for example irqfd could be considered to be similar object to char or block devices provided by QEMU to devices. Would it make sense to introduce new host types for passing parts of kvm_state to devices? I'd also make coalesced MMIO stuff part of memory object. We are not passing any state references when using cpu_physical_memory_rw(), but that could be changed. There are currently no VCPU-specific bits remaining in kvm_state. I think fields vcpu_events, robust_singlestep, debugregs, kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the same for all VCPUs but still they are sort of CPU properties. I'm not sure about fd field. They are all properties of the currently loaded KVM subsystem in the host kernel. They can't change while KVM's root fd is opened. Replicating this static information into each and every VCPU state would be crazy. In fact, services like kvm_has_vcpu_events() already encode this: they are static functions without any kvm_state reference that simply return the content of those fields. Totally inconsistent to this, we force the caller of kvm_check_extension to pass a handle. This is part of my problem with the current situation and any halfhearted steps in this context. Either we work towards eliminating static KVMState *kvm_state in kvm-all.c or eliminating KVMState. It may be a good idea to introduce an arch-specific kvm_state and move related bits over. This should probably contain only irqchip_in_kernel, pit_in_kernel and many_ioeventfds, maybe fd. fd is that root file descriptor you need for a few KVM services that are not bound to a specific VM - e.g. feature queries. It's not arch specific. Arch specific are e.g. robust_singlestep or xsave feature states. It may also once be feasible to carve out memory management related fields if we have proper abstractions for that, but I'm not completely sure here. I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region, migration_log into the memory object. vmfd is the VM-scope file descriptor you need at machine-level. The rest logically belongs to a memory object, but I haven't looked at technical details yet. Anyway, all these things are secondary. The primary topic here is how to deal with kvm_state and its fields that have VM-global scope. If it is an opaque blob which contains various unrelated stuff, no clear place will be found. We aren't moving fields yet (and we shouldn't). We first of all need to establish the handle distribution (which apparently requires quite some work in areas beyond KVM). By the way, we don't have a QEMUState but instead use globals. Perhaps this should be reorganized as well. For fd field, maybe even using a global variable could be justified, since it is used for direct access to kernel, not unlike a system call. The fd field is part of this discussion. Making it global (but local to the kvm subsystem) was an implicit part of my original suggestion. I've no problem with something like a QEMUState, or better a MachineState that would also include a few KVM-specific fields like the vmfd - just like we already do for CPUstate (or should we better introduce a KVM CPU bus... ;) ). Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-20 20:37, Anthony Liguori wrote: On 01/20/2011 03:33 AM, Jan Kiszka wrote: On 2011-01-19 20:32, Blue Swirl wrote: On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori aligu...@linux.vnet.ibm.com wrote: On 01/19/2011 07:15 AM, Markus Armbruster wrote: So they interact with KVM (need kvm_state), and they interact with the emulated PCI bus. Could you elaborate on the fundamental difference between the two interactions that makes you choose the (hypothetical) KVM bus over the PCI bus as device parent? It's almost arbitrary, but I would say it's the direction that I/Os flow. But if the underlying observation is that the device tree is not really a tree, you're 100% correct. This is part of why a factory interface that just takes a parent bus is too simplistic. I think we ought to introduce a -pci-device option that is specifically for creating PCI devices that doesn't require a parent bus argument but provides a way to specify stable addressing (for instancing, using a linear index). I think kvm_state should not be a property of any device or bus. It should be split to more logical pieces. Some parts of it could remain in CPUState, because they are associated with a VCPU. Also, for example irqfd could be considered to be similar object to char or block devices provided by QEMU to devices. Would it make sense to introduce new host types for passing parts of kvm_state to devices? I'd also make coalesced MMIO stuff part of memory object. We are not passing any state references when using cpu_physical_memory_rw(), but that could be changed. There are currently no VCPU-specific bits remaining in kvm_state. It may be a good idea to introduce an arch-specific kvm_state and move related bits over. It may also once be feasible to carve out memory management related fields if we have proper abstractions for that, but I'm not completely sure here. Anyway, all these things are secondary. The primary topic here is how to deal with kvm_state and its fields that have VM-global scope. The debate is really: 1) should we remove all passing of kvm_state and just assume it's static 2) deal with a couple places in the code where we need to figure out how to get at kvm_state I think we've only identified 1 real instance of (2) and it's resulted in some good discussions about how to model KVM devices vs. emulated devices. Honestly, (1) just stinks. I see absolutely no advantage to it at all. In the very worst case scenario, the thing we need to do is just reference an extern variable in a few places. That completely avoids all of the modelling discussions for now (while leaving for placeholder FIXMEs so the problem can be tackled later). The PCI bus discussion is surely an interesting outcome, but now almost completely off-topic to the original, way less critical issue (as we were discussing internals). I don't understand the resistance here. IMHO, most suggestions on the table are still over-designed (like a KVMBus that only passes a kvm_state - or do you have more features for it in mind?). The idea I love most so far is establishing a machine state that also carries those few KVM bits which correspond to the KVM extension of CPUState. But in the end I want an implementable consensus that helps moving forward with main topic: the overdue KVM upstream merge. I just do not have a clear picture yet. Jan signature.asc Description: OpenPGP digital signature
[RFC -v6 PATCH 1/8] sched: check the right -nr_running in yield_task_fair
With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq. Yielding to a task from another cfs_rq may be worthwhile, since a process calling yield typically cannot use the CPU right now. Therefor, we want to check the per-cpu nr_running, not the cgroup local one. Signed-off-by: Rik van Riel r...@redhat.com --- kernel/sched_fair.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index c62ebae..7b338ac 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -1304,7 +1304,7 @@ static void yield_task_fair(struct rq *rq) /* * Are we the only task in the tree? */ - if (unlikely(cfs_rq-nr_running == 1)) + if (unlikely(rq-nr_running == 1)) return; clear_buddies(cfs_rq, se); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC -v6 PATCH 2/8] sched: limit the scope of clear_buddies
The clear_buddies function does not seem to play well with the concept of hierarchical runqueues. In the following tree, task groups are represented by 'G', tasks by 'T', next by 'n' and last by 'l'. (nl) /\ G(nl) G / \ \ T(l) T(n) T This situation can arise when a task is woken up T(n), and the previously running task T(l) is marked last. When clear_buddies is called from either T(l) or T(n), the next and last buddies of the group G(nl) will be cleared. This is not the desired result, since we would like to be able to find the other type of buddy in many cases. This especially a worry when implementing yield_task_fair through the buddy system. The fix is simple: only clear the buddy type that the task itself is indicated to be. As an added bonus, we stop walking up the tree when the buddy has already been cleared or pointed elsewhere. Signed-off-by: Rik van Riel r...@redhat.com --- kernel/sched_fair.c | 30 +++--- 1 files changed, 23 insertions(+), 7 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index f4ee445..0321473 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -784,19 +784,35 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) __enqueue_entity(cfs_rq, se); } -static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se) +static void __clear_buddies_last(struct sched_entity *se) { - if (!se || cfs_rq-last == se) - cfs_rq-last = NULL; + for_each_sched_entity(se) { + struct cfs_rq *cfs_rq = cfs_rq_of(se); + if (cfs_rq-last == se) + cfs_rq-last = NULL; + else + break; + } +} - if (!se || cfs_rq-next == se) - cfs_rq-next = NULL; +static void __clear_buddies_next(struct sched_entity *se) +{ + for_each_sched_entity(se) { + struct cfs_rq *cfs_rq = cfs_rq_of(se); + if (cfs_rq-next == se) + cfs_rq-next = NULL; + else + break; + } } static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se) { - for_each_sched_entity(se) - __clear_buddies(cfs_rq_of(se), se); + if (cfs_rq-last == se) + __clear_buddies_last(se); + + if (cfs_rq-next == se) + __clear_buddies_next(se); } static void -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC -v6 PATCH 3/8] sched: use a buddy to implement yield_task_fair
Use the buddy mechanism to implement yield_task_fair. This allows us to skip onto the next highest priority se at every level in the CFS tree, unless doing so would introduce gross unfairness in CPU time distribution. We order the buddy selection in pick_next_entity to check yield first, then last, then next. We need next to be able to override yield, because it is possible for the next and yield task to be different processen in the same sub-tree of the CFS tree. When they are, we need to go into that sub-tree regardless of the yield hint, and pick the correct entity once we get to the right level. Signed-off-by: Rik van Riel r...@redhat.com diff --git a/kernel/sched.c b/kernel/sched.c index dc91a4d..e4e57ff 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -327,7 +327,7 @@ struct cfs_rq { * 'curr' points to currently running entity on this cfs_rq. * It is set to NULL otherwise (i.e when none are currently running). */ - struct sched_entity *curr, *next, *last; + struct sched_entity *curr, *next, *last, *yield; unsigned int nr_spread_over; diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index ad946fd..f701a51 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -384,6 +384,22 @@ static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq) return rb_entry(left, struct sched_entity, run_node); } +static struct sched_entity *__pick_second_entity(struct cfs_rq *cfs_rq) +{ + struct rb_node *left = cfs_rq-rb_leftmost; + struct rb_node *second; + + if (!left) + return NULL; + + second = rb_next(left); + + if (!second) + second = left; + + return rb_entry(second, struct sched_entity, run_node); +} + static struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq) { struct rb_node *last = rb_last(cfs_rq-tasks_timeline); @@ -806,6 +822,17 @@ static void __clear_buddies_next(struct sched_entity *se) } } +static void __clear_buddies_yield(struct sched_entity *se) +{ + for_each_sched_entity(se) { + struct cfs_rq *cfs_rq = cfs_rq_of(se); + if (cfs_rq-yield == se) + cfs_rq-yield = NULL; + else + break; + } +} + static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se) { if (cfs_rq-last == se) @@ -813,6 +840,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se) if (cfs_rq-next == se) __clear_buddies_next(se); + + if (cfs_rq-yield == se) + __clear_buddies_yield(se); } static void @@ -926,13 +956,27 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) static int wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se); +/* + * Pick the next process, keeping these things in mind, in this order: + * 1) keep things fair between processes/task groups + * 2) pick the next process, since someone really wants that to run + * 3) pick the last process, for cache locality + * 4) do not run the yield process, if something else is available + */ static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq) { struct sched_entity *se = __pick_next_entity(cfs_rq); struct sched_entity *left = se; - if (cfs_rq-next wakeup_preempt_entity(cfs_rq-next, left) 1) - se = cfs_rq-next; + /* +* Avoid running the yield buddy, if running something else can +* be done without getting too unfair. +*/ + if (cfs_rq-yield == se) { + struct sched_entity *second = __pick_second_entity(cfs_rq); + if (wakeup_preempt_entity(second, left) 1) + se = second; + } /* * Prefer last buddy, try to return the CPU to a preempted task. @@ -940,6 +984,12 @@ static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq) if (cfs_rq-last wakeup_preempt_entity(cfs_rq-last, left) 1) se = cfs_rq-last; + /* +* Someone really wants this to run. If it's not unfair, run it. +*/ + if (cfs_rq-next wakeup_preempt_entity(cfs_rq-next, left) 1) + se = cfs_rq-next; + clear_buddies(cfs_rq, se); return se; @@ -1096,52 +1146,6 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) hrtick_update(rq); } -/* - * sched_yield() support is very simple - we dequeue and enqueue. - * - * If compat_yield is turned on then we requeue to the end of the tree. - */ -static void yield_task_fair(struct rq *rq) -{ - struct task_struct *curr = rq-curr; - struct cfs_rq *cfs_rq = task_cfs_rq(curr); - struct sched_entity *rightmost, *se = curr-se; - - /* -* Are we the only task in the tree? -*/ - if (unlikely(rq-nr_running == 1)) - return; - -
[RFC -v6 PATCH 4/8] sched: Add yield_to(task, preempt) functionality
From: Mike Galbraith efa...@gmx.de Currently only implemented for fair class tasks. Add a yield_to_task method() to the fair scheduling class. allowing the caller of yield_to() to accelerate another thread in it's thread group, task group. Implemented via a scheduler hint, using cfs_rq-next to encourage the target being selected. We can rely on pick_next_entity to keep things fair, so noone can accelerate a thread that has already used its fair share of CPU time. This also means callers should only call yield_to when they really mean it. Calling it too often can result in the scheduler just ignoring the hint. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Mike Galbraith efa...@gmx.de diff --git a/include/linux/sched.h b/include/linux/sched.h index 2c79e92..6c43fc4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1047,6 +1047,7 @@ struct sched_class { void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); void (*yield_task) (struct rq *rq); + bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt); void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags); @@ -1943,6 +1944,7 @@ static inline int rt_mutex_getprio(struct task_struct *p) # define rt_mutex_adjust_pi(p) do { } while (0) #endif +extern bool yield_to(struct task_struct *p, bool preempt); extern void set_user_nice(struct task_struct *p, long nice); extern int task_prio(const struct task_struct *p); extern int task_nice(const struct task_struct *p); diff --git a/kernel/sched.c b/kernel/sched.c index e4e57ff..1f38ed2 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5270,6 +5270,64 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/** + * yield_to - yield the current processor to another thread in + * your thread group, or accelerate that thread toward the + * processor it's on. + * + * It's the caller's job to ensure that the target task struct + * can't go away on us before we can do any checks. + * + * Returns true if we indeed boosted the target task. + */ +bool __sched yield_to(struct task_struct *p, bool preempt) +{ + struct task_struct *curr = current; + struct rq *rq, *p_rq; + unsigned long flags; + bool yielded = 0; + + local_irq_save(flags); + rq = this_rq(); + +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + while (task_rq(p) != p_rq) { + double_rq_unlock(rq, p_rq); + goto again; + } + + if (!curr-sched_class-yield_to_task) + goto out; + + if (curr-sched_class != p-sched_class) + goto out; + + if (task_running(p_rq, p) || p-state) + goto out; + + if (!same_thread_group(p, curr)) + goto out; + +#ifdef CONFIG_FAIR_GROUP_SCHED + if (task_group(p) != task_group(curr)) + goto out; +#endif + + yielded = curr-sched_class-yield_to_task(rq, p, preempt); + +out: + double_rq_unlock(rq, p_rq); + local_irq_restore(flags); + + if (yielded) + yield(); + + return yielded; +} +EXPORT_SYMBOL_GPL(yield_to); + /* * This task is about to go to sleep on IO. Increment rq-nr_iowait so * that process accounting knows that this is a task in IO wait state. diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index f701a51..097e936 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -1800,6 +1800,23 @@ static void yield_task_fair(struct rq *rq) set_yield_buddy(se); } +static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preempt) +{ + struct sched_entity *se = p-se; + + if (!se-on_rq) + return false; + + /* Tell the scheduler that we'd really like pse to run next. */ + set_next_buddy(se); + + /* Make p's CPU reschedule; pick_next_entity takes care of fairness. */ + if (preempt) + resched_task(rq-curr); + + return true; +} + #ifdef CONFIG_SMP /** * Fair scheduling class load-balancing methods: @@ -3993,6 +4010,7 @@ static const struct sched_class fair_sched_class = { .enqueue_task = enqueue_task_fair, .dequeue_task = dequeue_task_fair, .yield_task = yield_task_fair, + .yield_to_task = yield_to_task_fair, .check_preempt_curr = check_preempt_wakeup, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC -v6 PATCH 8/8] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin
Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic slowdowns of certain workloads, we instead use yield_to to get another VCPU in the same KVM guest to run sooner. This seems to give a 10-15% speedup in certain workloads, versus not having PLE at all. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 9d56ed5..fab2250 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -187,6 +187,7 @@ struct kvm { #endif struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; atomic_t online_vcpus; + int last_boosted_vcpu; struct list_head vm_list; struct mutex lock; struct kvm_io_bus *buses[KVM_NR_BUSES]; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 86c4905..8b761ba 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1292,18 +1292,55 @@ void kvm_resched(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_resched); -void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu) +void kvm_vcpu_on_spin(struct kvm_vcpu *me) { - ktime_t expires; - DEFINE_WAIT(wait); - - prepare_to_wait(vcpu-wq, wait, TASK_INTERRUPTIBLE); - - /* Sleep for 100 us, and hope lock-holder got scheduled */ - expires = ktime_add_ns(ktime_get(), 10UL); - schedule_hrtimeout(expires, HRTIMER_MODE_ABS); + struct kvm *kvm = me-kvm; + struct kvm_vcpu *vcpu; + int last_boosted_vcpu = me-kvm-last_boosted_vcpu; + int yielded = 0; + int pass; + int i; - finish_wait(vcpu-wq, wait); + /* +* We boost the priority of a VCPU that is runnable but not +* currently running, because it got preempted by something +* else and called schedule in __vcpu_run. Hopefully that +* VCPU is holding the lock that we need and will release it. +* We approximate round-robin by starting at the last boosted VCPU. +*/ + for (pass = 0; pass 2 !yielded; pass++) { + kvm_for_each_vcpu(i, vcpu, kvm) { + struct task_struct *task = NULL; + struct pid *pid; + if (!pass i last_boosted_vcpu) { + i = last_boosted_vcpu; + continue; + } else if (pass i last_boosted_vcpu) + break; + if (vcpu == me) + continue; + if (waitqueue_active(vcpu-wq)) + continue; + rcu_read_lock(); + pid = rcu_dereference(vcpu-pid); + if (pid) + task = get_pid_task(vcpu-pid, PIDTYPE_PID); + rcu_read_unlock(); + if (!task) + continue; + if (task-flags PF_VCPU) { + put_task_struct(task); + continue; + } + if (yield_to(task, 1)) { + put_task_struct(task); + kvm-last_boosted_vcpu = i; + yielded = 1; + break; + } + put_task_struct(task); + } + } } EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC -v6 PATCH 7/8] kvm: keep track of which task is running a KVM vcpu
Keep track of which task is running a KVM vcpu. This helps us figure out later what task to wake up if we want to boost a vcpu that got preempted. Unfortunately there are no guarantees that the same task always keeps the same vcpu, so we can only track the task across a single run of the vcpu. Signed-off-by: Rik van Riel r...@redhat.com diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index a055742..9d56ed5 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -81,6 +81,7 @@ struct kvm_vcpu { #endif int vcpu_id; struct mutex mutex; + struct pid *pid; int cpu; atomic_t guest_mode; struct kvm_run *run; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 5225052..86c4905 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -185,6 +185,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id) vcpu-cpu = -1; vcpu-kvm = kvm; vcpu-vcpu_id = id; + vcpu-pid = NULL; init_waitqueue_head(vcpu-wq); page = alloc_page(GFP_KERNEL | __GFP_ZERO); @@ -208,6 +209,8 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_init); void kvm_vcpu_uninit(struct kvm_vcpu *vcpu) { + if (vcpu-pid) + put_pid(vcpu-pid); kvm_arch_vcpu_uninit(vcpu); free_page((unsigned long)vcpu-run); } @@ -1456,6 +1459,14 @@ static long kvm_vcpu_ioctl(struct file *filp, r = -EINVAL; if (arg) goto out; + if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) { + /* The thread running this VCPU changed. */ + struct pid *oldpid = vcpu-pid; + struct pid *newpid = get_task_pid(current, PIDTYPE_PID); + rcu_assign_pointer(vcpu-pid, newpid); + synchronize_rcu(); + put_pid(oldpid); + } r = kvm_arch_vcpu_ioctl_run(vcpu, vcpu-run); break; case KVM_GET_REGS: { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC -v6 PATCH 0/8] directed yield for Pause Loop Exiting
When running SMP virtual machines, it is possible for one VCPU to be spinning on a spinlock, while the VCPU that holds the spinlock is not currently running, because the host scheduler preempted it to run something else. Both Intel and AMD CPUs have a feature that detects when a virtual CPU is spinning on a lock and will trap to the host. The current KVM code sleeps for a bit whenever that happens, which results in eg. a 64 VCPU Windows guest taking forever and a bit to boot up. This is because the VCPU holding the lock is actually running and not sleeping, so the pause is counter-productive. In other workloads a pause can also be counter-productive, with spinlock detection resulting in one guest giving up its CPU time to the others. Instead of spinning, it ends up simply not running much at all. This patch series aims to fix that, by having a VCPU that spins give the remainder of its timeslice to another VCPU in the same guest before yielding the CPU - one that is runnable but got preempted, hopefully the lock holder. v6: - implement yield_task_fair in a way that works with task groups, this allows me to actually get a performance improvement! - fix another race Avi pointed out, the code should be good now v5: - fix the race condition Avi pointed out, by tracking vcpu-pid - also allows us to yield to vcpu tasks that got preempted while in qemu userspace v4: - change to newer version of Mike Galbraith's yield_to implementation - chainsaw out some code from Mike that looked like a great idea, but turned out to give weird interactions in practice v3: - more cleanups - change to Mike Galbraith's yield_to implementation - yield to spinning VCPUs, this seems to work better in some situations and has little downside potential v2: - make lots of cleanups and improvements suggested - do not implement timeslice scheduling or fairness stuff yet, since it is not entirely clear how to do that right (suggestions welcome) Benchmark results: Two 4-CPU KVM guests are pinned to the same 4 physical CPUs. One guest runs the AMQP performance test, the other guest runs 0, 2 or 4 infinite loops, for CPU overcommit factors of 0, 1.5 and 4. The AMQP perftest is run 30 times, with 8 and 16 threads. 8thrno overcommit 1.5x overcommit 2x overcommit no PLE 223801 135137 104951 PLE 224135 141105 118744 16thr no overcommit 1.5x overcommit 2x overcommit no PLE 222424 126175 105299 PLE 222534 138082 132945 Note: this is with the KVM guests NOT running inside cgroups. There seems to be a CPU load balancing issue with cgroup fair group scheduling, which often results in one guest getting only 80% CPU time and the other guest 320%. That will have to be fixed to get meaningful results with cgroups. CPU time division between the AMQP guest and the infinite loop guest were not exactly fair, but the guests got close to the same amount of CPU time in each test run. There is a substantial amount of randomness in CPU time division between guests, but the performance improvement is consistent between multiple runs. -- All rights reversed. -- -- All rights reversed. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC -v6 PATCH 6/8] export pid symbols needed for kvm_vcpu_on_spin
Export the symbols required for a race-free kvm_vcpu_on_spin. Signed-off-by: Rik van Riel r...@redhat.com diff --git a/kernel/fork.c b/kernel/fork.c index 3b159c5..adc8f47 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -191,6 +191,7 @@ void __put_task_struct(struct task_struct *tsk) if (!profile_handoff_task(tsk)) free_task(tsk); } +EXPORT_SYMBOL_GPL(__put_task_struct); /* * macro override instead of weak attribute alias, to workaround diff --git a/kernel/pid.c b/kernel/pid.c index 39b65b6..02f2212 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -435,6 +435,7 @@ struct pid *get_task_pid(struct task_struct *task, enum pid_type type) rcu_read_unlock(); return pid; } +EXPORT_SYMBOL_GPL(get_task_pid); struct task_struct *get_pid_task(struct pid *pid, enum pid_type type) { @@ -446,6 +447,7 @@ struct task_struct *get_pid_task(struct pid *pid, enum pid_type type) rcu_read_unlock(); return result; } +EXPORT_SYMBOL_GPL(get_pid_task); struct pid *find_get_pid(pid_t nr) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC -v6 PATCH 5/8] sched: drop superfluous tests from yield_to
Fairness is enforced by pick_next_entity, so we can drop some superfluous tests from yield_to. Signed-off-by: Rik van Riel r...@redhat.com --- kernel/sched.c |8 1 files changed, 0 insertions(+), 8 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 1f38ed2..398eedf 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5307,14 +5307,6 @@ again: if (task_running(p_rq, p) || p-state) goto out; - if (!same_thread_group(p, curr)) - goto out; - -#ifdef CONFIG_FAIR_GROUP_SCHED - if (task_group(p) != task_group(curr)) - goto out; -#endif - yielded = curr-sched_class-yield_to_task(rq, p, preempt); out: -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On Thu, Jan 20, 2011 at 9:22 PM, Jan Kiszka jan.kis...@web.de wrote: On 2011-01-20 20:27, Blue Swirl wrote: On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-01-19 20:32, Blue Swirl wrote: On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori aligu...@linux.vnet.ibm.com wrote: On 01/19/2011 07:15 AM, Markus Armbruster wrote: So they interact with KVM (need kvm_state), and they interact with the emulated PCI bus. Could you elaborate on the fundamental difference between the two interactions that makes you choose the (hypothetical) KVM bus over the PCI bus as device parent? It's almost arbitrary, but I would say it's the direction that I/Os flow. But if the underlying observation is that the device tree is not really a tree, you're 100% correct. This is part of why a factory interface that just takes a parent bus is too simplistic. I think we ought to introduce a -pci-device option that is specifically for creating PCI devices that doesn't require a parent bus argument but provides a way to specify stable addressing (for instancing, using a linear index). I think kvm_state should not be a property of any device or bus. It should be split to more logical pieces. Some parts of it could remain in CPUState, because they are associated with a VCPU. Also, for example irqfd could be considered to be similar object to char or block devices provided by QEMU to devices. Would it make sense to introduce new host types for passing parts of kvm_state to devices? I'd also make coalesced MMIO stuff part of memory object. We are not passing any state references when using cpu_physical_memory_rw(), but that could be changed. There are currently no VCPU-specific bits remaining in kvm_state. I think fields vcpu_events, robust_singlestep, debugregs, kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the same for all VCPUs but still they are sort of CPU properties. I'm not sure about fd field. They are all properties of the currently loaded KVM subsystem in the host kernel. They can't change while KVM's root fd is opened. Replicating this static information into each and every VCPU state would be crazy. Then each CPUX86State could have a pointer to common structure. In fact, services like kvm_has_vcpu_events() already encode this: they are static functions without any kvm_state reference that simply return the content of those fields. Totally inconsistent to this, we force the caller of kvm_check_extension to pass a handle. This is part of my problem with the current situation and any halfhearted steps in this context. Either we work towards eliminating static KVMState *kvm_state in kvm-all.c or eliminating KVMState. If the CPU related fields are accessible through CPUState, the handle should be available. It may be a good idea to introduce an arch-specific kvm_state and move related bits over. This should probably contain only irqchip_in_kernel, pit_in_kernel and many_ioeventfds, maybe fd. fd is that root file descriptor you need for a few KVM services that are not bound to a specific VM - e.g. feature queries. It's not arch specific. Arch specific are e.g. robust_singlestep or xsave feature states. By arch you mean guest CPU architecture? They are not machine features. It may also once be feasible to carve out memory management related fields if we have proper abstractions for that, but I'm not completely sure here. I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region, migration_log into the memory object. vmfd is the VM-scope file descriptor you need at machine-level. The rest logically belongs to a memory object, but I haven't looked at technical details yet. Anyway, all these things are secondary. The primary topic here is how to deal with kvm_state and its fields that have VM-global scope. If it is an opaque blob which contains various unrelated stuff, no clear place will be found. We aren't moving fields yet (and we shouldn't). We first of all need to establish the handle distribution (which apparently requires quite some work in areas beyond KVM). But I think this is exactly the problem. If the handle is for the current KVMState, you'll indeed need it in various places and passing it around will be cumbersome. By moving the fields around, the information should be available more naturally. By the way, we don't have a QEMUState but instead use globals. Perhaps this should be reorganized as well. For fd field, maybe even using a global variable could be justified, since it is used for direct access to kernel, not unlike a system call. The fd field is part of this discussion. Making it global (but local to the kvm subsystem) was an implicit part of my original suggestion. I've no problem with something like a QEMUState, or better a MachineState that would also include a few KVM-specific fields like the vmfd - just like we already do for CPUstate (or should we
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-20 21:02, Blue Swirl wrote: I think KVMState was designed to match KVM ioctl interface: all stuff that is needed for talking to KVM or received from KVM are there. But I think this shouldn't be a design driver. Agreed. The nice cleanup would probably be the complete assimilation of KVMState by something bigger of comparable scope. If a machine was brought up with KVM support, every device that refers to this machine (as it is supposed to become part of it) should be able to use KVM services in order to accelerate its model. If the only pieces of kvm_state that are needed by the devices are irqchip_in_kernel, pit_in_kernel and many_ioeventfds, the problem of passing kvm_state to devices becomes very different. Each of these are just single bits, affecting only a few devices. Perhaps they could be device properties which the board level sets when KVM is used? Forget about the static capabilities for now. The core of kvm_state are handles that enable you to use KVM services and maybe state fields that have machine scope (unless we find more local homes like a memory object). Those need to be accessible by the kvm layer when servicing requests of components that are related to that very same machine. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-20 22:40, Blue Swirl wrote: On Thu, Jan 20, 2011 at 9:22 PM, Jan Kiszka jan.kis...@web.de wrote: On 2011-01-20 20:27, Blue Swirl wrote: On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka jan.kis...@siemens.com wrote: On 2011-01-19 20:32, Blue Swirl wrote: On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori aligu...@linux.vnet.ibm.com wrote: On 01/19/2011 07:15 AM, Markus Armbruster wrote: So they interact with KVM (need kvm_state), and they interact with the emulated PCI bus. Could you elaborate on the fundamental difference between the two interactions that makes you choose the (hypothetical) KVM bus over the PCI bus as device parent? It's almost arbitrary, but I would say it's the direction that I/Os flow. But if the underlying observation is that the device tree is not really a tree, you're 100% correct. This is part of why a factory interface that just takes a parent bus is too simplistic. I think we ought to introduce a -pci-device option that is specifically for creating PCI devices that doesn't require a parent bus argument but provides a way to specify stable addressing (for instancing, using a linear index). I think kvm_state should not be a property of any device or bus. It should be split to more logical pieces. Some parts of it could remain in CPUState, because they are associated with a VCPU. Also, for example irqfd could be considered to be similar object to char or block devices provided by QEMU to devices. Would it make sense to introduce new host types for passing parts of kvm_state to devices? I'd also make coalesced MMIO stuff part of memory object. We are not passing any state references when using cpu_physical_memory_rw(), but that could be changed. There are currently no VCPU-specific bits remaining in kvm_state. I think fields vcpu_events, robust_singlestep, debugregs, kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the same for all VCPUs but still they are sort of CPU properties. I'm not sure about fd field. They are all properties of the currently loaded KVM subsystem in the host kernel. They can't change while KVM's root fd is opened. Replicating this static information into each and every VCPU state would be crazy. Then each CPUX86State could have a pointer to common structure. That already exists. In fact, services like kvm_has_vcpu_events() already encode this: they are static functions without any kvm_state reference that simply return the content of those fields. Totally inconsistent to this, we force the caller of kvm_check_extension to pass a handle. This is part of my problem with the current situation and any halfhearted steps in this context. Either we work towards eliminating static KVMState *kvm_state in kvm-all.c or eliminating KVMState. If the CPU related fields are accessible through CPUState, the handle should be available. It may be a good idea to introduce an arch-specific kvm_state and move related bits over. This should probably contain only irqchip_in_kernel, pit_in_kernel and many_ioeventfds, maybe fd. fd is that root file descriptor you need for a few KVM services that are not bound to a specific VM - e.g. feature queries. It's not arch specific. Arch specific are e.g. robust_singlestep or xsave feature states. By arch you mean guest CPU architecture? They are not machine features. No, they are practically static host features. It may also once be feasible to carve out memory management related fields if we have proper abstractions for that, but I'm not completely sure here. I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region, migration_log into the memory object. vmfd is the VM-scope file descriptor you need at machine-level. The rest logically belongs to a memory object, but I haven't looked at technical details yet. Anyway, all these things are secondary. The primary topic here is how to deal with kvm_state and its fields that have VM-global scope. If it is an opaque blob which contains various unrelated stuff, no clear place will be found. We aren't moving fields yet (and we shouldn't). We first of all need to establish the handle distribution (which apparently requires quite some work in areas beyond KVM). But I think this is exactly the problem. If the handle is for the current KVMState, you'll indeed need it in various places and passing it around will be cumbersome. By moving the fields around, the information should be available more naturally. Yeah, if we had a MachineState or if we could agree on introducing it, I'm with you again. Improving the currently cumbersome KVM API interaction was the main motivation for my original patch. Jan signature.asc Description: OpenPGP digital signature
Re: suspending in kvm and resuming in qemu
On Tue, Jan 18, 2011 at 04:08:33AM -0800, Mehul Chadha wrote: Hi, I have been trying to get suspending and resuming done across kvm and qemu. While resuming a suspended state in kvm, an error was generated saying unknown section kvmclock . I modified loadvm in qemu to neglect the kvmclock section. Now, when I suspend and resume, qemu screen just stalls at the suspended state and it seems nothing is executing. Pls give any inputs or help, where should I start looking at?? Thanks, Mehul The guest relies on kvmclock support, which is not implemented by qemu alone. So you'd have to disable kvmclock support (with no-kvmclock kernel boot option) in the guest. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Remove user space triggerable MCE error message
On Sat, Jan 15, 2011 at 10:00:53AM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com This case is a pure user space error we do not need to record. Moreover, it can be misused to flood the kernel log. Remove it. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] device-assignment: Properly terminate vmsd.fields
On Mon, Jan 17, 2011 at 10:17:49AM -0700, Alex Williamson wrote: The vmsd code expects the fields structure to be properly terminated, not NULL. An assigned device should never be saved or restored, and recent qemu fixes to the no_migrate flag should ensure this, but let's avoid setting the wrong precedent. Signed-off-by: Alex Williamson alex.william...@redhat.com --- v2: - Change to commit log only, was device-assignment: Add fields to VMStateDescription. Recent qemu.git no_migrate changes avoid potential segfault, but this should still be applied for correctness. hw/device-assignment.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] virtio-pci: mask notifier error handling fixups
On Tue, Jan 18, 2011 at 03:42:57PM +0200, Michael S. Tsirkin wrote: Fix virtio-pci error handling in the mask notifiers: be careful to undo exactly what we did so far. Reported-by: Alex Williamson alex.william...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This is for qemu-kvm only. hw/virtio-pci.c | 12 1 files changed, 12 insertions(+), 0 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: fix build warning within __kvm_set_memory_region() on s390
On Mon, Jan 17, 2011 at 09:21:08PM +0100, Heiko Carstens wrote: From: Heiko Carstens heiko.carst...@de.ibm.com Get rid of this warning: CC arch/s390/kvm/../../../virt/kvm/kvm_main.o arch/s390/kvm/../../../virt/kvm/kvm_main.c:596:12: warning: 'kvm_create_dirty_bitmap' defined but not used The only caller of the function is within a !CONFIG_S390 section, so add the same ifdef around kvm_create_dirty_bitmap() as well. Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally
On Mon, Jan 17, 2011 at 08:47:39AM +0800, Huang Ying wrote: Hi, Andrew, On Sun, 2011-01-16 at 23:35 +0800, Avi Kivity wrote: On 01/14/2011 03:37 AM, Huang Ying wrote: On Thu, 2011-01-13 at 18:43 +0800, Avi Kivity wrote: On 01/13/2011 10:42 AM, Huang Ying wrote: Make __get_user_pages return -EHWPOISON for HWPOISON page only if FOLL_HWPOISON is specified. With this patch, the interested callers can distinguish HWPOISON page from general FAULT page, while other callers will still get -EFAULT for pages, so the user space interface need not to be changed. get_user_pages_hwpoison is added as a variant of get_user_pages that can return -EHWPOISON for HWPOISON page. This feature is needed by KVM, where UCR MCE should be relayed to guest for HWPOISON page, while instruction emulation and MMIO will be tried for general FAULT page. The idea comes from Andrew Morton. Signed-off-by: Huang Yingying.hu...@intel.com Cc: Andrew Mortona...@linux-foundation.org +#ifdef CONFIG_MEMORY_FAILURE +int get_user_pages_hwpoison(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, + int force, struct page **pages, + struct vm_area_struct **vmas); +#else Since we'd also like to add get_user_pages_noio(), perhaps adding a flags field to get_user_pages() is better. Or export __get_user_pages()? That's better, yes. Do you think it is a good idea to export __get_user_pages() instead of adding get_user_pages_noio() and get_user_pages_hwpoison()? Better Andrew and/or Linus should decide it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: force vhost off for non-MSI guests
On Thu, 2011-01-20 at 19:47 +0200, Michael S. Tsirkin wrote: On Thu, Jan 20, 2011 at 08:31:53AM -0800, Sridhar Samudrala wrote: On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- I need to report some error from virtio-pci that would be handled specially (disable but don't report an error) so I wanted one that's never likely to be used by a userspace ioctl. I selected ERANGE but it'd be easy to switch to something else. Comments? Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED? -Sridhar The error is reported by virtio-pci which does not know about vhost. I started with EVIRTIO_MSIX_DISABLED and made is shorter. Would EVIRTIO_MSIX_DISABLED be better? I think so. This makes it more clear. -Sridhar hw/vhost.c |4 +++- hw/virtio-net.c |6 -- hw/virtio-pci.c |3 +++ hw/virtio.h |2 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 1d09ed0..c79765a 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true); if (r 0) { -fprintf(stderr, Error binding guest notifier: %d\n, -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, Error binding guest notifier: %d\n, -r); + } goto fail_notifiers; } diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ccb3e63..5de3fee 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) if (!n-vhost_started) { int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), n-vdev); if (r 0) { -error_report(unable to start vhost net: %d: - falling back on userspace virtio, -r); +if (r != -EVIRTIO_DISABLED) { +error_report(unable to start vhost net: %d: + falling back on userspace virtio, -r); +} } else { n-vhost_started = 1; } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index dd8887a..dbf4be0 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int n, bool assign) EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); if (assign) { +if (!msix_enabled(proxy-pci_dev)) { +return -EVIRTIO_DISABLED; +} int r = event_notifier_init(notifier, 0); if (r 0) { return r; diff --git a/hw/virtio.h b/hw/virtio.h index d8546d5..53bbdba 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -98,6 +98,8 @@ typedef struct { void (*vmstate_change)(void * opaque, bool running); } VirtIOBindings; +#define EVIRTIO_DISABLED ERANGE + #define VIRTIO_PCI_QUEUE_MAX 64 #define VIRTIO_NO_VECTOR 0x -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests
On 01/20/2011 10:07 AM, Michael S. Tsirkin wrote: On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote: On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkinm...@redhat.com I actually think this should be a terminal error. The user asks for vhost-net, if we cannot enable it, we should exit. Or we should warn the user that they should expect bad performance. Silently doing something that the user has explicitly asked us not to do is not a good behavior. Regards, Anthony Liguori The issue is that user has no control of the guest, and can not know whether the guest enables MSI. So what you ask for will just make some guests fail, and others fail sometimes. The user also has no way to know that version X of kvm does not expose a way to inject level interrupts with irqfd. We could have *another* flag that says use vhost where it helps but then I think this is what everyone wants to do, anyway, and libvirt already sets vhost=on so I prefer redefining the meaning of an existing flag. In the very least, there needs to be a vhost=force. Having some sort of friendly default policy is fine but we need to provide a mechanism for a user to have the final say. If you want to redefine vhost=on to really mean, use the friendly default, that's fine by me, but only if the vhost=force option exists. I actually would think libvirt would want to use vhost=force. Debugging with vhost=on is going to be a royal pain in the ass if a user reports bad performance. Given the libvirt XML, you can't actually tell from the guest and the XML whether or not vhost was actually in use or not. Regards, Anthony Liguori Maybe this is best handled by a documentation update? We always said: use vhost=on to enable experimental in kernel accelerator\n note 'enable' not 'require'. This is similar to how we specify nvectors : you can not make guest use the feature. How about this: diff --git a/qemu-options.hx b/qemu-options.hx index 898561d..3c937c1 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -1061,6 +1061,7 @@ DEF(net, HAS_ARG, QEMU_OPTION_net, use vnet_hdr=off to avoid enabling the IFF_VNET_HDR tap flag\n use vnet_hdr=on to make the lack of IFF_VNET_HDR support an error condition\n use vhost=on to enable experimental in kernel accelerator\n +(note: vhost=on has no effect unless guest uses MSI-X)\n use 'vhostfd=h' to connect to an already opened vhost net device\n #endif -net socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests
On Thu, 2011-01-20 at 18:23 -0600, Anthony Liguori wrote: On 01/20/2011 10:07 AM, Michael S. Tsirkin wrote: On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote: On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkinm...@redhat.com I actually think this should be a terminal error. The user asks for vhost-net, if we cannot enable it, we should exit. Or we should warn the user that they should expect bad performance. Silently doing something that the user has explicitly asked us not to do is not a good behavior. Regards, Anthony Liguori The issue is that user has no control of the guest, and can not know whether the guest enables MSI. So what you ask for will just make some guests fail, and others fail sometimes. The user also has no way to know that version X of kvm does not expose a way to inject level interrupts with irqfd. We could have *another* flag that says use vhost where it helps but then I think this is what everyone wants to do, anyway, and libvirt already sets vhost=on so I prefer redefining the meaning of an existing flag. In the very least, there needs to be a vhost=force. Having some sort of friendly default policy is fine but we need to provide a mechanism for a user to have the final say. If you want to redefine vhost=on to really mean, use the friendly default, that's fine by me, but only if the vhost=force option exists. I actually would think libvirt would want to use vhost=force. Debugging with vhost=on is going to be a royal pain in the ass if a user reports bad performance. Given the libvirt XML, you can't actually tell from the guest and the XML whether or not vhost was actually in use or not. If we add a force option, let's please distinguish hotplug from VM creation time. The latter can abort. Hotplug should print an error and fail the initfn. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Flow Control and Port Mirroring Revisited
[ Trimmed Eric from CC list as vger was complaining that it is too long ] On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote: So it won't be all that simple to implement well, and before we try, I'd like to know whether there are applications that are helped by it. For example, we could try to measure latency at various pps and see whether the backpressure helps. netperf has -b, -w flags which might help these measurements. Those options are enabled when one adds --enable-burst to the pre-compilation ./configure of netperf (one doesn't have to recompile netserver). However, if one is also looking at latency statistics via the -j option in the top-of-trunk, or simply at the histogram with --enable-histogram on the ./configure and a verbosity level of 2 (global -v 2) then one wants the very top of trunk netperf from: Hi, I have constructed a test where I run an un-paced UDP_STREAM test in one guest and a paced omni rr test in another guest at the same time. Breifly I get the following results from the omni test.. 1. Omni test only: MEAN_LATENCY=272.00 2. Omni and stream test:MEAN_LATENCY=3423.00 3. cpu and net_cls group: MEAN_LATENCY=493.00 As per 2 plus cgoups are created for each guest and guest tasks added to the groups 4. 100Mbit/s class: MEAN_LATENCY=273.00 As per 3 plus the net_cls groups each have a 100MBit/s HTB class 5. cpu.shares=128: MEAN_LATENCY=652.00 As per 4 plus the cpu groups have cpu.shares set to 128 6. Busy CPUS: MEAN_LATENCY=15126.00 As per 5 but the CPUs are made busy using a simple shell while loop There is a bit of noise in the results as the two netperf invocations aren't started at exactly the same moment For reference, my netperf invocations are: netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12 netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr -k foo -b 1 -w 200 -m 200 foo contains PROTOCOL THROUGHPUT,THROUGHPUT_UNITS LOCAL_SEND_THROUGHPUT LOCAL_RECV_THROUGHPUT REMOTE_SEND_THROUGHPUT REMOTE_RECV_THROUGHPUT RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY LOCAL_CPU_UTIL,REMOTE_CPU_UTIL -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Flow Control and Port Mirroring Revisited
Simon Horman wrote: [ Trimmed Eric from CC list as vger was complaining that it is too long ] ... I have constructed a test where I run an un-paced UDP_STREAM test in one guest and a paced omni rr test in another guest at the same time. Breifly I get the following results from the omni test.. ... There is a bit of noise in the results as the two netperf invocations aren't started at exactly the same moment For reference, my netperf invocations are: netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12 netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr -k foo -b 1 -w 200 -m 200 Since the -b and -w are in the test-specific portion, this test was not actually paced. The -w will have been ignored entirely (IIRC) and the -b will have attempted to set the burst size of a --enable-burst ./configured netperf. If netperf was ./configured that way, it will have had two rr transactions in flight at one time - the regular one and then the one additional from the -b option. If netperf was not ./configured with --enable-burst then a warning message should have been emitted. Also, I am guessing you wanted TCP_NODELAY set, and that is -D but not a global -D. I'm reasonably confident the -m 200 will have been ignored, but it would be best to drop it. So, I think your second line needs to be: netperf.omni -p 12866 -c -C -H 172.17.60.216 -t omni -j -v 2 -b 1 -w 200 -- -r 1 -d rr -k foo -D If you want the request and response sizes to be 200 bytes, use -r 200 (test-specific). Also, if you ./configure with --enable-omni first, that netserver will understand both omni and non-omni tests at the same time and you don't have to have a second netserver on a different control port. You can also go-in to config.h after the ./configure and unset WANT_MIGRATION and then UDP_STREAM in netperf will be the true classic UDP_STREAM code rather than the migrated to omni path. foo contains PROTOCOL THROUGHPUT,THROUGHPUT_UNITS LOCAL_SEND_THROUGHPUT LOCAL_RECV_THROUGHPUT REMOTE_SEND_THROUGHPUT REMOTE_RECV_THROUGHPUT RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY LOCAL_CPU_UTIL,REMOTE_CPU_UTIL As the -k file parsing option didn't care until recently (within the hour or so), I think it didn't matter that you had more than four lines (assuming that is a verbatim cat of foo). However, if you pull the *current* top of trunk, it will probably start to care - I'm in the midst of adding support for direct output selection in the -k, -o and -O options and also cleaning-up the omni printing code to the point where there is only the one routing parsing the output selection file. Currently that is the one for human output, which has a four line restriction. I will try to make it smarter as I go. happy benchmarking, rick jones -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at
https://bugzilla.kernel.org/show_bug.cgi?id=27052 --- Comment #7 from Marcelo Tosatti mtosa...@redhat.com 2011-01-21 03:27:36 --- Nicolas, My bad. Can you please try the following patch. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at
https://bugzilla.kernel.org/show_bug.cgi?id=27052 --- Comment #8 from Marcelo Tosatti mtosa...@redhat.com 2011-01-21 03:29:36 --- Created an attachment (id=44552) -- (https://bugzilla.kernel.org/attachment.cgi?id=44552) update sp-gfns on pte update path -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] V2 Handle guest access to BBL_CR_CTL3 MSR
[Resubmit of prior version which contained a wayward patch hunk. Thanks Marcelo] A correction to Intel cpu model CPUID data (patch queued) caused winxp to BSOD when booted with a Penryn model. This was traced to the CPUID model field correction from 6 - 23 (as is proper for a Penryn class of cpu). Only in this case does the problem surface. The cause for this failure is winxp accessing the BBL_CR_CTL3 MSR which is unsupported by current kvm, appears to be a legacy MSR not fully characterized yet existing in current silicon, and is apparently carried forward in MSR space to accommodate vintage code as here. It is not yet conclusive whether this MSR implements any of its legacy functionality or is just an ornamental dud for compatibility. While I found no silicon version specific documentation link to this MSR, a general description exists in Intel's developer's reference which agrees with the functional behavior of other bootloader/kernel code I've examined accessing BBL_CR_CTL3. Regrettably winxp appears to be setting bit #19 called out as reserved in the above document. So to minimally accommodate this MSR, kvm msr get will provide the equivalent mock data and kvm msr write will simply toss the guest passed data without interpretation. While this treatment of BBL_CR_CTL3 addresses the immediate problem, the approach may be modified pending clarification from Intel. Signed-off-by: john cooper john.coo...@redhat.com --- diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 4d0dfa0..5bfafb6 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -38,6 +38,7 @@ #define MSR_MTRRcap0x00fe #define MSR_IA32_BBL_CR_CTL0x0119 +#define MSR_IA32_BBL_CR_CTL3 0x011e #define MSR_IA32_SYSENTER_CS 0x0174 #define MSR_IA32_SYSENTER_ESP 0x0175 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index bcc0efc..04d6c55 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1592,6 +1592,12 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) } else return set_msr_hyperv(vcpu, msr, data); break; + case MSR_IA32_BBL_CR_CTL3: + /* Drop writes to this legacy MSR -- see rdmsr +* counterpart for further detail. +*/ + pr_unimpl(vcpu, ignored wrmsr: 0x%x data %llx\n, msr, data); + break; default: if (msr (msr == vcpu-kvm-arch.xen_hvm_config.msr)) return xen_hvm_config(vcpu, data); @@ -1846,6 +1852,19 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) } else return get_msr_hyperv(vcpu, msr, pdata); break; + case MSR_IA32_BBL_CR_CTL3: + /* This legacy MSR exists but isn't fully documented in current +* silicon. It is however accessed by winxp in very narrow +* scenarios where it sets bit #19, itself documented as +* a reserved bit. Best effort attempt to source coherent +* read data here should the balance of the register be +* interpreted by the guest: +* +* L2 cache control register 3: 64GB range, 256KB size, +* enabled, latency 0x1, configured +*/ + data = 0xbe702111; + break; default: if (!ignore_msrs) { pr_unimpl(vcpu, unhandled rdmsr: 0x%x\n, msr); -- john.coo...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REPOST] [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v3)
* Christoph Lameter c...@linux.com [2011-01-20 08:49:27]: On Thu, 20 Jan 2011, Balbir Singh wrote: --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -253,11 +253,11 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +extern int sysctl_min_unmapped_ratio; +extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #ifdef CONFIG_NUMA extern int zone_reclaim_mode; -extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; -extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else #define zone_reclaim_mode 0 So the end result of this patch is that zone reclaim is compiled into vmscan.o even on !NUMA configurations but since zone_reclaim_mode == 0 noone can ever call that code? The third patch, fixes this with the introduction of a config (cut-copy-paste below). If someone were to bisect to this point, what you say is correct. +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) extern int sysctl_min_unmapped_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; -extern int sysctl_min_slab_ratio; #else -#define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) { return 0; } #endif Thanks for the review! -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REPOST] [PATCH 2/3] Refactor zone_reclaim code (v3)
* Christoph Lameter c...@linux.com [2011-01-20 08:50:40]: Reviewed-by: Christoph Lameter c...@linux.com Thanks for the review! -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)
* Christoph Lameter c...@linux.com [2011-01-20 09:00:09]: On Thu, 20 Jan 2011, Balbir Singh wrote: + unmapped_page_control + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL + is enabled. It controls the amount of unmapped memory + that is present in the system. This boot option plus + vm.min_unmapped_ratio (sysctl) provide granular control min_unmapped_ratio is there to guarantee that zone reclaim does not reclaim all unmapped pages. What you want here is a max_unmapped_ratio. I thought about that, the logic for reusing min_unmapped_ratio was to keep a limit beyond which unmapped page cache shrinking should stop. I think you are suggesting max_unmapped_ratio as the point at which shrinking should begin, right? { @@ -2297,6 +2320,12 @@ loop_again: shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); + /* +* We do unmapped page reclaim once here and once +* below, so that we don't lose out +*/ + reclaim_unmapped_pages(priority, zone, sc); + if (!zone_watermark_ok_safe(zone, order, H. Okay that means background reclaim does it. If so then we also want zone reclaim to be able to work in the background I think. Anything specific you had in mind, works for me in testing, but is there anything specific that stands out in your mind that needs to be done? Thanks for the review! -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html