Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On 10/01/2009 09:24 PM, Gregory Haskins wrote: Virtualization is about not doing that. Sometimes it's necessary (when you have made unfixable design mistakes), but just to replace a bus, with no advantages to the guest that has to be changed (other hypervisors or hypervisorless deployment scenarios aren't). The problem is that your continued assertion that there is no advantage to the guest is a completely unsubstantiated claim. As it stands right now, I have a public git tree that, to my knowledge, is the fastest KVM PV networking implementation around. It also has capabilities that are demonstrably not found elsewhere, such as the ability to render generic shared-memory interconnects (scheduling, timers), interrupt-priority (qos), and interrupt-coalescing (exit-ratio reduction). I designed each of these capabilities after carefully analyzing where KVM was coming up short. Those are facts. I can't easily prove which of my new features alone are what makes it special per se, because I don't have unit tests for each part that breaks it down. What I _can_ state is that its the fastest and most feature rich KVM-PV tree that I am aware of, and others may download and test it themselves to verify my claims. If you wish to introduce a feature which has downsides (and to me, vbus has downsides) then you must prove it is necessary on its own merits. venet is pretty cool but I need proof before I believe its performance is due to vbus and not to venet-host. The disproof, on the other hand, would be in a counter example that still meets all the performance and feature criteria under all the same conditions while maintaining the existing ABI. To my knowledge, this doesn't exist. mst is working on it and we should have it soon. Therefore, if you believe my work is irrelevant, show me a git tree that accomplishes the same feats in a binary compatible way, and I'll rethink my position. Until then, complaining about lack of binary compatibility is pointless since it is not an insurmountable proposition, and the one and only available solution declares it a required casualty. Fine, let's defer it until vhost-net is up and running. Well, Xen requires pre-translation (since the guest has to give the host (which is just another guest) permissions to access the data). Actually I am not sure that it does require pre-translation. You might be able to use the memctx-copy_to/copy_from scheme in post translation as well, since those would be able to communicate to something like the xen kernel. But I suppose either method would result in extra exits, so there is no distinct benefit using vbus there..as you say below they're just different. The biggest difference is that my proposed model gets around the notion that the entire guest address space can be represented by an arbitrary pointer. For instance, the copy_to/copy_from routines take a GPA, but may use something indirect like a DMA controller to access that GPA. On the other hand, virtio fully expects a viable pointer to come out of the interface iiuc. This is in part what makes vbus more adaptable to non-virt. No, virtio doesn't expect a pointer (this is what makes Xen possible). vhost does; but nothing prevents an interested party from adapting it. An interesting thing here is that you don't even need a fancy multi-homed setup to see the effects of my exit-ratio reduction work: even single port configurations suffer from the phenomenon since many devices have multiple signal-flows (e.g. network adapters tend to have at least 3 flows: rx-ready, tx-complete, and control-events (link-state, etc). Whats worse, is that the flows often are indirectly related (for instance, many host adapters will free tx skbs during rx operations, so you tend to get bursts of tx-completes at the same time as rx-ready. If the flows map 1:1 with IDT, they will suffer the same problem. You can simply use the same vector for both rx and tx and poll both at every interrupt. Yes, but that has its own problems: e.g. additional exits or at least additional overhead figuring out what happens each time. If you're just coalescing tx and rx, it's an additional memory read (which you have anyway in the vbus interrupt queue). This is even more important as we scale out to MQ which may have dozens of queue pairs. You really want finer grained signal-path decode if you want peak performance. MQ definitely wants per-queue or per-queue-pair vectors, and it definitely doesn't want all interrupts to be serviced by a single interrupt queue (you could/should make the queue per-vcpu). Its important to note here that we are actually looking at the interrupt rate, not the exit rate (which is usually a multiple of the interrupt rate, since you have to factor in as many as three exits per interrupt (IPI, window, EOI). Therefore we saved about 18k interrupts in this 10 second burst, but we may have actually saved
Re: [Qemu-devel] Release plan for 0.12.0
On 10/01/2009 11:13 PM, Luiz Capitulino wrote: If we're going to support the protocol for 0.12, I'd like to most of the code merged by the end of October. Four weeks.. Not so much time, but let's try. There are two major issues that may delay QMP. Firstly, we are still on the infrastructure/design phase, which is natural to take time. Maybe when handlers start getting converted en masse things will be faster. I sure hope so. Maybe someone can pitch in if not. Secondly: testing. I have a very ugly python script to test the already converted handlers. The problem is not only the ugliness, the right way to do this would be to use kvm-autotest. So, I was planning to take a detailed look at it and perhaps start writing tests for QMP right when each handler is converted. Right Thing, but takes time. I think this could be done by having autotest use two monitors, one with the machine protocol and one with the human protocol, trying first the machine protocol and falling back if the command is not supported. Hopefully we can get the autotest people to work on it so we parallelize development. They'll also give user-oriented feedback which can be valuable. Are you using a standard json parser with your test script? That's an additional validation. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests
On Fri, Oct 02, 2009 at 04:19:27PM -0400, Gregory Haskins wrote: What: xinterface is a mechanism that allows kernel modules external to the kvm.ko proper to interface with a running guest. It accomplishes this by creating an abstracted interface which does not expose any private details of the guest or its related KVM structures, and provides a mechanism to find and bind to this interface at run-time. Why: There are various subsystems that would like to interact with a KVM guest which are ideally suited to exist outside the domain of the kvm.ko core logic. For instance, external pci-passthrough, virtual-bus, and virtio-net modules are currently under development. In order for these modules to successfully interact with the guest, they need, at the very least, various interfaces for signaling IO events, pointer translation, and possibly memory mapping. The signaling case is covered by the recent introduction of the irqfd/ioeventfd mechanisms. This patch provides a mechanism to cover the other cases. Note that today we only expose pointer-translation related functions, but more could be added at a future date as needs arise. Example usage: QEMU instantiates a guest, and an external module foo that desires the ability to interface with the guest (say via open(/dev/foo)). QEMU may then pass the kvmfd to foo via an ioctl, such as: ioctl(foofd, FOO_SET_VMID, kvmfd). Upon receipt, the foo module can issue kvm_xinterface_bind(kvmfd) to acquire the proper context. Internally, the struct kvm* and associated struct module* will remain pinned at least until the foo module calls kvm_xinterface_put(). --- /dev/null +++ b/virt/kvm/xinterface.c @@ -0,0 +1,409 @@ +/* + * KVM module interface - Allows external modules to interface with a guest + * + * Copyright 2009 Novell. All Rights Reserved. + * + * Author: + * Gregory Haskins ghask...@novell.com + * + * This file is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +#include linux/mm.h +#include linux/vmalloc.h +#include linux/highmem.h +#include linux/module.h +#include linux/mmu_context.h +#include linux/kvm_host.h +#include linux/kvm_xinterface.h + +struct _xinterface { + struct kvm *kvm; + struct task_struct *task; + struct mm_struct *mm; + struct kvm_xinterface intf; + struct kvm_memory_slot *slotcache[NR_CPUS]; +}; + +struct _xvmap { + struct kvm_memory_slot*memslot; + unsigned long npages; + struct kvm_xvmap vmap; +}; + +static struct _xinterface * +to_intf(struct kvm_xinterface *intf) +{ + return container_of(intf, struct _xinterface, intf); +} + +#define _gfn_to_hva(gfn, memslot) \ + (memslot-userspace_addr + (gfn - memslot-base_gfn) * PAGE_SIZE) + +/* + * gpa_to_hva() - translate a guest-physical to host-virtual using + * a per-cpu cache of the memslot. + * + * The gfn_to_memslot() call is relatively expensive, and the gpa access + * patterns exhibit a high degree of locality. Therefore, lets cache + * the last slot used on a per-cpu basis to optimize the lookup + * + * assumes slots_lock held for read + */ +static unsigned long +gpa_to_hva(struct _xinterface *_intf, unsigned long gpa) +{ + int cpu = get_cpu(); + unsigned long gfn = gpa PAGE_SHIFT; + struct kvm_memory_slot *memslot = _intf-slotcache[cpu]; + unsigned long addr= 0; + + if (!memslot + || gfn memslot-base_gfn + || gfn = memslot-base_gfn + memslot-npages) { + + memslot = gfn_to_memslot(_intf-kvm, gfn); + if (!memslot) + goto out; + + _intf-slotcache[cpu] = memslot; + } + + addr = _gfn_to_hva(gfn, memslot) + offset_in_page(gpa); + +out: + put_cpu(); + + return addr; Please optimize gfn_to_memslot() instead, so everybody benefits. It shows very often on profiles. + + page_list = (struct page **) __get_free_page(GFP_KERNEL); + if (!page_list) + return NULL; + + down_write(mm-mmap_sem); + + ret = get_user_pages(p, mm, addr, npages, 1, 0, page_list, NULL); + if (ret 0) + goto out; + + ptr = vmap(page_list, npages, VM_MAP, PAGE_KERNEL); + if (ptr) +
[ kvm-Bugs-2868883 ] netkvm.sys stops sending/receiving on Windows Server 2003 VM
Bugs item #2868883, was opened at 2009-09-28 16:27 Message generated for change (Comment added) made by amontezuma You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2868883group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Mark Weaver (mdw21) Assigned to: Nobody/Anonymous (nobody) Summary: netkvm.sys stops sending/receiving on Windows Server 2003 VM Initial Comment: This usually happens within an hour or two of starting the interface. It can be cured temporarily by disabling/enabling the adapter within Windows. I've run the Windows interface with log level set to 2 -- when traffic stops it still logs outgoing traffic as normal but ParaNdis_ProcessRxPath stops being logged. I suspect this is to do with the traffic content or timing as I cannot reproduce this with iperf, but only with external traffic to a website hosted on the machine. What further steps can I take to debug this issue? Host details: 2 x dual core xeons: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5410 @ 2.33GHz stepping: 6 cpu MHz : 2327.685 cache size : 6144 KB physical id : 0 siblings: 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority bogomips: 4655.37 clflush size: 64 cache_alignment : 64 address sizes : 38 bits physical, 48 bits virtual power management: kernel is 2.6.31 from kernel.org, userspace is debian lenny, all 64-bit qemu is qemu-kvm-0.10.6 Guest details: Windows Server 2003 32-bit qemu is started as: qemu-system-x86_64 \ -boot c \ -drive file=/data/vms/stooge/boot.raw,if=virtio,boot=on,cache=off \ -m 3072 \ -smp 1 \ -vnc 10.80.80.89:2 \ -k en-gb \ -net nic,model=virtio,macaddr=DE:AD:BE:EF:11:29 \ -net tap,ifname=tap0 \ -localtime \ -usb -usbdevice tablet \ -mem-path /hugepages -- Comment By: amontezuma (amontezuma) Date: 2009-10-03 23:08 Message: This is probably the same bug as all these others: https://sourceforge.net/tracker/?func=detailaid=2506814group_id=180599atid=893831 https://sourceforge.net/tracker/?func=detailatid=893831aid=1771262group_id=180599 https://sourceforge.net/tracker/?func=detailatid=893831aid=2327497group_id=180599 it is the most serious and annoying bug in KVM imho and it has been there for SO LONG. -- Comment By: Mark Weaver (mdw21) Date: 2009-09-29 16:00 Message: 1. Could you please attach the log Too big for sf.net, I have put a log here: http://www.blushingpenguin.com/kvm/netkvm.log.bz2 During that log, it appeared that outgoing packets were still being transmitted, however incoming packets were not being received. This was verified by running ping on the guest and using tcpdump on the host. After a while packets started begin received again. The pattern can be seen with: grep received netkvm.log foo up to 2235.26074219 packets are being pulled out regularly -- generally 1-2 packets at a time. After that they start begin pulled out irregularly and in greater numbers. after 2870.28808594 normal service is resumed. 2. Could you be more specific on the scenario? Are you running some tests or network application? It's running websites under IIS. I tried to reproduce this issue with various iperf scenarios but failed to do so. 3. You could raise debug level even more to level 6 - that would give the information about the rings (how much space is left and etc) I have raised the level to 7 (the level of the log linked to above). 4. In the code you could add debug prints to ParaNdis5_MiniportISR to check if the driver even receives the interrupt. It appears that DEBUG_EXIT_STATUS(7, (ULONG)b); is in the function ParaNdis5_MiniportISR so I assume this is sufficient. (5). Another thing to test - could you please run the guest without /hugepages option. The same issue occurs without hugepages. -- Comment By: Yan Vugenfirer (yanv) Date: 2009-09-29 14:51 Message: Another thing to test - could you please run the guest without /hugepages option.
Re: kvm guest: hrtimer: interrupt too slow
Michael, Can you please give the patch below a try please? (without acpi_pm timer or priority adjustments for the guest). On Tue, Sep 29, 2009 at 05:12:17PM +0400, Michael Tokarev wrote: Hello. I'm having quite an.. unusable system here. It's not really a regresssion with 0.11.0, it was something similar before, but with 0.11.0 and/or 2.6.31 it become much worse. The thing is that after some uptime, kvm guest prints something like this: hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns after which system (guest) speeed becomes very slow. The above message is from 2.6.31 guest running wiht 0.11.0 2.6.31 host. Before I tried it with 0.10.6 and 2.6.30 or 2.6.27, and the delta were a bit less than that: hrtimer: interrupt too slow, forcing clock min delta to 15415 ns hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns It seems the way hrtimer_interrupt_hanging calculates min_delta is wrong (especially to virtual machines). The guest vcpu can be scheduled out during the execution of the hrtimer callbacks (and the callbacks themselves can do operations that translate to blocking operations in the hypervisor). So high min_delta values can be calculated if, for example, a single hrtimer_interrupt run takes two host time slices to execute, while some other higher priority task runs for N slices in between. Using the hrtimer_interrupt execution time (which can be the worse case at any given time), as the min_delta is problematic. So simply increase min_delta_ns by 50% once every detected failure, which will eventually lead to an acceptable threshold (the algorithm should scale back to down lower min_delta, to adjust back to wealthier times, too). diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 49da79a..8997978 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer) #ifdef CONFIG_HIGH_RES_TIMERS -static int force_clock_reprogram; - /* * After 5 iteration's attempts, we consider that hrtimer_interrupt() * is hanging, which could happen with something that slows the interrupt - * such as the tracing. Then we force the clock reprogramming for each future - * hrtimer interrupts to avoid infinite loops and use the min_delta_ns - * threshold that we will overwrite. - * The next tick event will be scheduled to 3 times we currently spend on - * hrtimer_interrupt(). This gives a good compromise, the cpus will spend - * 1/4 of their time to process the hrtimer interrupts. This is enough to - * let it running without serious starvation. + * such as the tracing, so we increase min_delta_ns. */ static inline void -hrtimer_interrupt_hanging(struct clock_event_device *dev, - ktime_t try_time) +hrtimer_interrupt_hanging(struct clock_event_device *dev) { - force_clock_reprogram = 1; - dev-min_delta_ns = (unsigned long)try_time.tv64 * 3; - printk(KERN_WARNING hrtimer: interrupt too slow, - forcing clock min delta to %lu ns\n, dev-min_delta_ns); + dev-min_delta_ns += dev-min_delta_ns 1; + if (printk_ratelimit()) + printk(KERN_WARNING hrtimer: interrupt too slow, + forcing clock min delta to %lu ns\n, + dev-min_delta_ns); } /* * High resolution timer interrupt @@ -1276,7 +1268,7 @@ void hrtimer_interrupt(struct clock_event_device *dev) retry: /* 5 retries is enough to notice a hang */ if (!(++nr_retries % 5)) - hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now)); + hrtimer_interrupt_hanging(dev); now = ktime_get(); @@ -1342,7 +1334,7 @@ void hrtimer_interrupt(struct clock_event_device *dev) /* Reprogramming necessary ? */ if (expires_next.tv64 != KTIME_MAX) { - if (tick_program_event(expires_next, force_clock_reprogram)) + if (tick_program_event(expires_next, 0)) goto retry; } } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: INFO: task journal:337 blocked for more than 120 seconds
On 10/2/2009 2:30 PM, Jeremy Fitzhardinge wrote: On 09/30/09 14:11, Shirley Ma wrote: Anybody found this problem before? I kept hitting this issue for 2.6.31 guest kernel even with a simple network test. INFO: task kjournal:337 blocked for more than 120 seconds. echo 0 /proc/sys/kernel/hung_task_timeout_sec disables this message. kjournald D 0041 0 337 2 0x My test is totally being blocked. I'm assuming from the lists you've posted to that this is under KVM? What disk drivers are you using (virtio or emulated)? Can you get a full stack backtrace of kjournald? Kevin Bowling submitted a RH bug against Xen with apparently the same symptoms (https://bugzilla.redhat.com/show_bug.cgi?id=526627). I'm wondering if there's a core kernel bug here, which is perhaps more easily triggered by the changed timing in a virtual machine. Thanks, J I've had a stable system thus far by appending clocksource=jiffies to the kernel boot line. The default clocksource is otherwise xen. The dmesg boot warnings in my bugzilla report still occur. Regards, Kevin Bowling http://www.analograils.com/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/27] Add KVM support for Book3s_64 (PPC64) hosts v4
On Sat, 2009-10-03 at 20:59 +1000, Benjamin Herrenschmidt wrote: On Sat, 2009-10-03 at 12:08 +0200, Avi Kivity wrote: So these MSRs can be modified by the hypervisor? Otherwise you'd cache them in the guest with no hypervisor involvement, right? (just making sure :) There's one MSR :-) Among others, it can be altered by the act of taking an interrupt (for example, it contains the PR bit, which means user vs. supervisor, things like that). For a bit more context... On PowerPC, all those special registers are called SPRs (special registers, surprise ! :-) They are generally accessed via mfspr/mtspr instructions that encode the SPR number, though some of them can also have decicated instructions or be set as a side effect of some instructions or events etc... MSR is a bit special here because it's not per-se an SPR. It's the Machine State Register, in the core, it's in the fast path of a whole bunch of pipeline stages, and it contains the state of things such as the current privilege level, the state of MMU translation for I and D, the interrupt enable bit, etc... It's accessed via specific mfmsr/mtmsr instructions (to simplify as there are other instructions that modify the MSR as a side effect, interrupts do that too, etc...). So the MSR warrants special treatment for KVM. Other SPRs may or may not depending on what they are. Some are just storage like the SPRGs, some contain a copy of the previous PC and MSR when taking an interrupt (SRR0 and SRR1) and are used by the rfi instruction to restore them when returning from an interrupt, and some are totally unrelated (such as the decrementer which is our core timer facility) or other processor specific registers containing various things like cache configuration etc... The main issue with kernel entry / exit performances, though, revolve around MSR, SPRG and SRR0/1 accesses. SPRGs could -almost- be entirely guest cached, but since the goal is to save a register to use as scrach at a time when no register can be clobbered, saving a register to them must fit in one instruction that has no side effect. The typical option we are thinking about here is a store-absolute to an address that KVM can then map to some per-CPU storage page. Things like SRR0/SRR1 can be replaced by similar load/stores as long as the HV sets them appropriately with the original MSR (or emulated MSR) and PC when directing an interrupt to the guest, and know where to retrieve the content set by the kernel when emulating an rfi instruction. The MSR can be read from cache always by the guest as long as the HV knows how to alter its cached value when directing an interrupt to the guest or emulating another of those instructions that can affect it (such as rfi of course), etc... So in our case, that (relatively small) level of paravirt provides a tremendous performance boost, since every guest interrupt (syscall, etc...) goes down from something like a good dozen emulation traps to maybe a couple just for the base entry/exit path from the kernel. This is very different from the issues around PV that you guys had in x86 world related to MMU emulation, though in our case, PV may also prove useful, as our MMU structure is very different, this is a completely orthogonal matter. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html