Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-10-03 Thread Avi Kivity

On 10/01/2009 09:24 PM, Gregory Haskins wrote:



Virtualization is about not doing that.  Sometimes it's necessary (when
you have made unfixable design mistakes), but just to replace a bus,
with no advantages to the guest that has to be changed (other
hypervisors or hypervisorless deployment scenarios aren't).
 

The problem is that your continued assertion that there is no advantage
to the guest is a completely unsubstantiated claim.  As it stands right
now, I have a public git tree that, to my knowledge, is the fastest KVM
PV networking implementation around.  It also has capabilities that are
demonstrably not found elsewhere, such as the ability to render generic
shared-memory interconnects (scheduling, timers), interrupt-priority
(qos), and interrupt-coalescing (exit-ratio reduction).  I designed each
of these capabilities after carefully analyzing where KVM was coming up
short.

Those are facts.

I can't easily prove which of my new features alone are what makes it
special per se, because I don't have unit tests for each part that
breaks it down.  What I _can_ state is that its the fastest and most
feature rich KVM-PV tree that I am aware of, and others may download and
test it themselves to verify my claims.
   


If you wish to introduce a feature which has downsides (and to me, vbus 
has downsides) then you must prove it is necessary on its own merits.  
venet is pretty cool but I need proof before I believe its performance 
is due to vbus and not to venet-host.



The disproof, on the other hand, would be in a counter example that
still meets all the performance and feature criteria under all the same
conditions while maintaining the existing ABI.  To my knowledge, this
doesn't exist.
   


mst is working on it and we should have it soon.


Therefore, if you believe my work is irrelevant, show me a git tree that
accomplishes the same feats in a binary compatible way, and I'll rethink
my position.  Until then, complaining about lack of binary compatibility
is pointless since it is not an insurmountable proposition, and the one
and only available solution declares it a required casualty.
   


Fine, let's defer it until vhost-net is up and running.


Well, Xen requires pre-translation (since the guest has to give the host
(which is just another guest) permissions to access the data).
 

Actually I am not sure that it does require pre-translation.  You might
be able to use the memctx-copy_to/copy_from scheme in post translation
as well, since those would be able to communicate to something like the
xen kernel.  But I suppose either method would result in extra exits, so
there is no distinct benefit using vbus there..as you say below they're
just different.

The biggest difference is that my proposed model gets around the notion
that the entire guest address space can be represented by an arbitrary
pointer.  For instance, the copy_to/copy_from routines take a GPA, but
may use something indirect like a DMA controller to access that GPA.  On
the other hand, virtio fully expects a viable pointer to come out of the
interface iiuc.  This is in part what makes vbus more adaptable to non-virt.
   


No, virtio doesn't expect a pointer (this is what makes Xen possible).  
vhost does; but nothing prevents an interested party from adapting it.



An interesting thing here is that you don't even need a fancy
multi-homed setup to see the effects of my exit-ratio reduction work:
even single port configurations suffer from the phenomenon since many
devices have multiple signal-flows (e.g. network adapters tend to have
at least 3 flows: rx-ready, tx-complete, and control-events (link-state,
etc).  Whats worse, is that the flows often are indirectly related (for
instance, many host adapters will free tx skbs during rx operations, so
you tend to get bursts of tx-completes at the same time as rx-ready.  If
the flows map 1:1 with IDT, they will suffer the same problem.

   

You can simply use the same vector for both rx and tx and poll both at
every interrupt.
 

Yes, but that has its own problems: e.g. additional exits or at least
additional overhead figuring out what happens each time.


If you're just coalescing tx and rx, it's an additional memory read 
(which you have anyway in the vbus interrupt queue).



This is even
more important as we scale out to MQ which may have dozens of queue
pairs.  You really want finer grained signal-path decode if you want
peak performance.
   


MQ definitely wants per-queue or per-queue-pair vectors, and it 
definitely doesn't want all interrupts to be serviced by a single 
interrupt queue (you could/should make the queue per-vcpu).



Its important to note here that we are actually looking at the interrupt
rate, not the exit rate (which is usually a multiple of the interrupt
rate, since you have to factor in as many as three exits per interrupt
(IPI, window, EOI).  Therefore we saved about 18k interrupts in this 10
second burst, but we may have actually saved 

Re: [Qemu-devel] Release plan for 0.12.0

2009-10-03 Thread Avi Kivity

On 10/01/2009 11:13 PM, Luiz Capitulino wrote:

If we're going to support the protocol for 0.12, I'd like to most of the
code merged by the end of October.
 

  Four weeks.. Not so much time, but let's try.

  There are two major issues that may delay QMP.

  Firstly, we are still on the infrastructure/design phase, which
is natural to take time. Maybe when handlers start getting converted
en masse things will be faster.
   


I sure hope so.  Maybe someone can pitch in if not.


  Secondly: testing. I have a very ugly python script to test the
already converted handlers. The problem is not only the ugliness,
the right way to do this would be to use kvm-autotest. So, I was
planning to take a detailed look at it and perhaps start writing
tests for QMP right when each handler is converted. Right Thing,
but takes time.
   


I think this could be done by having autotest use two monitors, one with 
the machine protocol and one with the human protocol, trying first the 
machine protocol and falling back if the command is not supported.


Hopefully we can get the autotest people to work on it so we parallelize 
development.  They'll also give user-oriented feedback which can be 
valuable.


Are you using a standard json parser with your test script?  That's an 
additional validation.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests

2009-10-03 Thread Marcelo Tosatti
On Fri, Oct 02, 2009 at 04:19:27PM -0400, Gregory Haskins wrote:
 What: xinterface is a mechanism that allows kernel modules external to
 the kvm.ko proper to interface with a running guest.  It accomplishes
 this by creating an abstracted interface which does not expose any
 private details of the guest or its related KVM structures, and provides
 a mechanism to find and bind to this interface at run-time.
 
 Why: There are various subsystems that would like to interact with a KVM
 guest which are ideally suited to exist outside the domain of the kvm.ko
 core logic. For instance, external pci-passthrough, virtual-bus, and
 virtio-net modules are currently under development.  In order for these
 modules to successfully interact with the guest, they need, at the very
 least, various interfaces for signaling IO events, pointer translation,
 and possibly memory mapping.
 
 The signaling case is covered by the recent introduction of the
 irqfd/ioeventfd mechanisms.  This patch provides a mechanism to cover the
 other cases.  Note that today we only expose pointer-translation related
 functions, but more could be added at a future date as needs arise.
 
 Example usage: QEMU instantiates a guest, and an external module foo
 that desires the ability to interface with the guest (say via
 open(/dev/foo)).  QEMU may then pass the kvmfd to foo via an
 ioctl, such as: ioctl(foofd, FOO_SET_VMID, kvmfd).  Upon receipt, the
 foo module can issue kvm_xinterface_bind(kvmfd) to acquire
 the proper context.  Internally, the struct kvm* and associated
 struct module* will remain pinned at least until the foo module calls
 kvm_xinterface_put().

 --- /dev/null
 +++ b/virt/kvm/xinterface.c
 @@ -0,0 +1,409 @@
 +/*
 + * KVM module interface - Allows external modules to interface with a guest
 + *
 + * Copyright 2009 Novell.  All Rights Reserved.
 + *
 + * Author:
 + *  Gregory Haskins ghask...@novell.com
 + *
 + * This file is free software; you can redistribute it and/or modify
 + * it under the terms of version 2 of the GNU General Public License
 + * as published by the Free Software Foundation.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 + * GNU General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public License
 + * along with this program; if not, write to the Free Software Foundation,
 + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
 + */
 +
 +#include linux/mm.h
 +#include linux/vmalloc.h
 +#include linux/highmem.h
 +#include linux/module.h
 +#include linux/mmu_context.h
 +#include linux/kvm_host.h
 +#include linux/kvm_xinterface.h
 +
 +struct _xinterface {
 + struct kvm *kvm;
 + struct task_struct *task;
 + struct mm_struct   *mm;
 + struct kvm_xinterface   intf;
 + struct kvm_memory_slot *slotcache[NR_CPUS];
 +};
 +
 +struct _xvmap {
 + struct kvm_memory_slot*memslot;
 + unsigned long  npages;
 + struct kvm_xvmap   vmap;
 +};
 +
 +static struct _xinterface *
 +to_intf(struct kvm_xinterface *intf)
 +{
 + return container_of(intf, struct _xinterface, intf);
 +}
 +
 +#define _gfn_to_hva(gfn, memslot) \
 + (memslot-userspace_addr + (gfn - memslot-base_gfn) * PAGE_SIZE)
 +
 +/*
 + * gpa_to_hva() - translate a guest-physical to host-virtual using
 + * a per-cpu cache of the memslot.
 + *
 + * The gfn_to_memslot() call is relatively expensive, and the gpa access
 + * patterns exhibit a high degree of locality.  Therefore, lets cache
 + * the last slot used on a per-cpu basis to optimize the lookup
 + *
 + * assumes slots_lock held for read
 + */
 +static unsigned long
 +gpa_to_hva(struct _xinterface *_intf, unsigned long gpa)
 +{
 + int cpu = get_cpu();
 + unsigned long   gfn = gpa  PAGE_SHIFT;
 + struct kvm_memory_slot *memslot = _intf-slotcache[cpu];
 + unsigned long   addr= 0;
 +
 + if (!memslot
 + || gfn  memslot-base_gfn
 + || gfn = memslot-base_gfn + memslot-npages) {
 +
 + memslot = gfn_to_memslot(_intf-kvm, gfn);
 + if (!memslot)
 + goto out;
 +
 + _intf-slotcache[cpu] = memslot;
 + }
 +
 + addr = _gfn_to_hva(gfn, memslot) + offset_in_page(gpa);
 +
 +out:
 + put_cpu();
 +
 + return addr;

Please optimize gfn_to_memslot() instead, so everybody benefits. It
shows very often on profiles.

 +
 + page_list = (struct page **) __get_free_page(GFP_KERNEL);
 + if (!page_list)
 + return NULL;
 +
 + down_write(mm-mmap_sem);
 +
 + ret = get_user_pages(p, mm, addr, npages, 1, 0, page_list, NULL);
 + if (ret  0)
 + goto out;
 +
 + ptr = vmap(page_list, npages, VM_MAP, PAGE_KERNEL);
 + if (ptr)
 + 

[ kvm-Bugs-2868883 ] netkvm.sys stops sending/receiving on Windows Server 2003 VM

2009-10-03 Thread SourceForge.net
Bugs item #2868883, was opened at 2009-09-28 16:27
Message generated for change (Comment added) made by amontezuma
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2868883group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Mark Weaver (mdw21)
Assigned to: Nobody/Anonymous (nobody)
Summary: netkvm.sys stops sending/receiving on Windows Server 2003 VM

Initial Comment:
This usually happens within an hour or two of starting the interface.  It can 
be cured temporarily by disabling/enabling the adapter within Windows.  I've 
run the Windows interface with log level set to 2 -- when traffic stops it 
still logs outgoing traffic as normal but ParaNdis_ProcessRxPath stops being 
logged.  I suspect this is to do with the traffic content or timing as I cannot 
reproduce this with iperf, but only with external traffic to a website hosted 
on the machine. 

What further steps can I take to debug this issue?

Host details:

2 x dual core xeons:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 23
model name  : Intel(R) Xeon(R) CPU   E5410  @ 2.33GHz
stepping: 6
cpu MHz : 2327.685
cache size  : 6144 KB
physical id : 0
siblings: 4
core id : 0
cpu cores   : 4
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm 
constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est 
tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips: 4655.37
clflush size: 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

kernel is 2.6.31 from kernel.org, userspace is debian lenny, all 64-bit
qemu is qemu-kvm-0.10.6

Guest details:
Windows Server 2003 32-bit

qemu is started as:
qemu-system-x86_64 \
-boot c \
-drive file=/data/vms/stooge/boot.raw,if=virtio,boot=on,cache=off \
-m 3072 \
-smp 1 \
-vnc 10.80.80.89:2 \
-k en-gb \
-net nic,model=virtio,macaddr=DE:AD:BE:EF:11:29 \
-net tap,ifname=tap0 \
-localtime \
-usb -usbdevice tablet \
-mem-path /hugepages 


--

Comment By: amontezuma (amontezuma)
Date: 2009-10-03 23:08

Message:
This is probably the same bug as all these others:
https://sourceforge.net/tracker/?func=detailaid=2506814group_id=180599atid=893831
https://sourceforge.net/tracker/?func=detailatid=893831aid=1771262group_id=180599
https://sourceforge.net/tracker/?func=detailatid=893831aid=2327497group_id=180599
it is the most serious and annoying bug in KVM imho and it has been there
for SO LONG.

--

Comment By: Mark Weaver (mdw21)
Date: 2009-09-29 16:00

Message:
 1. Could you please attach the log

Too big for sf.net, I have put a log here:

http://www.blushingpenguin.com/kvm/netkvm.log.bz2

During that log, it appeared that outgoing packets were still being
transmitted, 
however incoming packets were not being received.  This was verified by
running
ping on the guest and using tcpdump on the host.  After a while packets
started
begin received again.  The pattern can be seen with:

grep received netkvm.log  foo

up to 2235.26074219 packets are being pulled out regularly -- generally
1-2 packets
at a time.  After that they start begin pulled out irregularly and in
greater numbers.
after 2870.28808594 normal service is resumed.

 2. Could you be more specific on the scenario? Are you running some
tests
 or network application?

It's running websites under IIS.  I tried to reproduce this issue with
various
iperf scenarios but failed to do so.

 3. You could raise debug level even more to level 6 - that would give
the
 information about the rings (how much space is left and etc)

I have raised the level to 7 (the level of the log linked to above).  

 4. In the code you could add debug prints to ParaNdis5_MiniportISR to
 check if the driver even receives the interrupt.

It appears that DEBUG_EXIT_STATUS(7, (ULONG)b); is in the function 
ParaNdis5_MiniportISR so I assume this is sufficient.

 (5). Another thing to test - could you please run the guest without
/hugepages
option.

The same issue occurs without hugepages.


--

Comment By: Yan Vugenfirer (yanv)
Date: 2009-09-29 14:51

Message:
Another thing to test - could you please run the guest without /hugepages
option.




Re: kvm guest: hrtimer: interrupt too slow

2009-10-03 Thread Marcelo Tosatti
Michael,

Can you please give the patch below a try please? (without acpi_pm timer 
or priority adjustments for the guest).

On Tue, Sep 29, 2009 at 05:12:17PM +0400, Michael Tokarev wrote:
 Hello.

 I'm having quite an.. unusable system here.
 It's not really a regresssion with 0.11.0,
 it was something similar before, but with
 0.11.0 and/or 2.6.31 it become much worse.

 The thing is that after some uptime, kvm
 guest prints something like this:

 hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns

 after which system (guest) speeed becomes
 very slow.  The above message is from
 2.6.31 guest running wiht 0.11.0  2.6.31
 host.  Before I tried it with 0.10.6 and
 2.6.30 or 2.6.27, and the delta were a
 bit less than that:

 hrtimer: interrupt too slow, forcing clock min delta to 15415 ns
 hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns

It seems the way hrtimer_interrupt_hanging calculates min_delta is
wrong (especially to virtual machines). The guest vcpu can be scheduled
out during the execution of the hrtimer callbacks (and the callbacks
themselves can do operations that translate to blocking operations in
the hypervisor).

So high min_delta values can be calculated if, for example, a single
hrtimer_interrupt run takes two host time slices to execute, while some
other higher priority task runs for N slices in between.

Using the hrtimer_interrupt execution time (which can be the worse
case at any given time), as the min_delta is problematic.

So simply increase min_delta_ns by 50% once every detected failure,
which will eventually lead to an acceptable threshold (the algorithm
should scale back to down lower min_delta, to adjust back to wealthier
times, too).

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 49da79a..8997978 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer)
 
 #ifdef CONFIG_HIGH_RES_TIMERS
 
-static int force_clock_reprogram;
-
 /*
  * After 5 iteration's attempts, we consider that hrtimer_interrupt()
  * is hanging, which could happen with something that slows the interrupt
- * such as the tracing. Then we force the clock reprogramming for each future
- * hrtimer interrupts to avoid infinite loops and use the min_delta_ns
- * threshold that we will overwrite.
- * The next tick event will be scheduled to 3 times we currently spend on
- * hrtimer_interrupt(). This gives a good compromise, the cpus will spend
- * 1/4 of their time to process the hrtimer interrupts. This is enough to
- * let it running without serious starvation.
+ * such as the tracing, so we increase min_delta_ns.
  */
 
 static inline void
-hrtimer_interrupt_hanging(struct clock_event_device *dev,
-   ktime_t try_time)
+hrtimer_interrupt_hanging(struct clock_event_device *dev)
 {
-   force_clock_reprogram = 1;
-   dev-min_delta_ns = (unsigned long)try_time.tv64 * 3;
-   printk(KERN_WARNING hrtimer: interrupt too slow, 
-   forcing clock min delta to %lu ns\n, dev-min_delta_ns);
+   dev-min_delta_ns += dev-min_delta_ns  1;
+   if (printk_ratelimit())
+   printk(KERN_WARNING hrtimer: interrupt too slow, 
+   forcing clock min delta to %lu ns\n,
+   dev-min_delta_ns);
 }
 /*
  * High resolution timer interrupt
@@ -1276,7 +1268,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
  retry:
/* 5 retries is enough to notice a hang */
if (!(++nr_retries % 5))
-   hrtimer_interrupt_hanging(dev, ktime_sub(ktime_get(), now));
+   hrtimer_interrupt_hanging(dev);
 
now = ktime_get();
 
@@ -1342,7 +1334,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 
/* Reprogramming necessary ? */
if (expires_next.tv64 != KTIME_MAX) {
-   if (tick_program_event(expires_next, force_clock_reprogram))
+   if (tick_program_event(expires_next, 0))
goto retry;
}
 }
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: INFO: task journal:337 blocked for more than 120 seconds

2009-10-03 Thread Kevin Bowling

On 10/2/2009 2:30 PM, Jeremy Fitzhardinge wrote:

On 09/30/09 14:11, Shirley Ma wrote:
   

Anybody found this problem before? I kept hitting this issue for 2.6.31
guest kernel even with a simple network test.

INFO: task kjournal:337 blocked for more than 120 seconds.
echo 0  /proc/sys/kernel/hung_task_timeout_sec disables this message.

kjournald   D 0041  0   337 2 0x

My test is totally being blocked.
 

I'm assuming from the lists you've posted to that this is under KVM?
What disk drivers are you using (virtio or emulated)?

Can you get a full stack backtrace of kjournald?

Kevin Bowling submitted a RH bug against Xen with apparently the same
symptoms (https://bugzilla.redhat.com/show_bug.cgi?id=526627).  I'm
wondering if there's a core kernel bug here, which is perhaps more
easily triggered by the changed timing in a virtual machine.

Thanks,
 J
   


I've had a stable system thus far by appending clocksource=jiffies to 
the kernel boot line.  The default clocksource is otherwise xen.


The dmesg boot warnings in my bugzilla report still occur.

Regards,
Kevin Bowling
http://www.analograils.com/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/27] Add KVM support for Book3s_64 (PPC64) hosts v4

2009-10-03 Thread Benjamin Herrenschmidt
On Sat, 2009-10-03 at 20:59 +1000, Benjamin Herrenschmidt wrote:
 On Sat, 2009-10-03 at 12:08 +0200, Avi Kivity wrote:
  
  So these MSRs can be modified by the hypervisor?  Otherwise you'd cache 
  them in the guest with no hypervisor involvement, right?  (just making 
  sure :)
 
 There's one MSR :-) Among others, it can be altered by the act of
 taking an interrupt (for example, it contains the PR bit, which means
 user vs. supervisor, things like that).

For a bit more context...

On PowerPC, all those special registers are called SPRs (special
registers, surprise ! :-)

They are generally accessed via mfspr/mtspr instructions that encode
the SPR number, though some of them can also have decicated instructions
or be set as a side effect of some instructions or events etc...

MSR is a bit special here because it's not per-se an SPR. It's the
Machine State Register, in the core, it's in the fast path of a whole
bunch of pipeline stages, and it contains the state of things such as
the current privilege level, the state of MMU translation for I and D,
the interrupt enable bit, etc... It's accessed via specific mfmsr/mtmsr
instructions (to simplify as there are other instructions that modify
the MSR as a side effect, interrupts do that too, etc...).

So the MSR warrants special treatment for KVM. Other SPRs may or may not
depending on what they are. Some are just storage like the SPRGs, some
contain a copy of the previous PC and MSR when taking an interrupt (SRR0
and SRR1) and are used by the rfi instruction to restore them when
returning from an interrupt, and some are totally unrelated (such as
the decrementer which is our core timer facility) or other processor
specific registers containing various things like cache configuration
etc...

The main issue with kernel entry / exit performances, though, revolve
around MSR, SPRG and SRR0/1 accesses. SPRGs could -almost- be entirely
guest cached, but since the goal is to save a register to use as scrach
at a time when no register can be clobbered, saving a register to them
must fit in one instruction that has no side effect. The typical option
we are thinking about here is a store-absolute to an address that KVM
can then map to some per-CPU storage page.

Things like SRR0/SRR1 can be replaced by similar load/stores as long as
the HV sets them appropriately with the original MSR (or emulated MSR)
and PC when directing an interrupt to the guest, and know where to
retrieve the content set by the kernel when emulating an rfi
instruction. The MSR can be read from cache always by the guest as
long as the HV knows how to alter its cached value when directing
an interrupt to the guest or emulating another of those instructions
that can affect it (such as rfi of course), etc...

So in our case, that (relatively small) level of paravirt provides a
tremendous performance boost, since every guest interrupt (syscall,
etc...) goes down from something like a good dozen emulation traps
to maybe a couple just for the base entry/exit path from the kernel.

This is very different from the issues around PV that you guys had in
x86 world related to MMU emulation, though in our case, PV may also
prove useful, as our MMU structure is very different, this is a
completely orthogonal matter.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html