Re: [PATCH 0/9][RFC] KVM virtio_net performance

2008-07-26 Thread Rusty Russell
On Saturday 26 July 2008 19:45:36 Avi Kivity wrote:
> Mark McLoughlin wrote:
> > Hey,
> >   Here's a bunch of patches attempting to improve the performance
> > of virtio_net. This is more an RFC rather than a patch submission
> > since, as can be seen below, not all patches actually improve the
> > perfomance measurably.
> >
> >   I've tried hard to test each of these patches with as stable and
> > informative a benchmark as I could find. The first benchmark is a
> > netperf[1] based throughput benchmark and the second uses a flood
> > ping[2] to measure latency differences.
> >
> >   Each set of figures is min/average/max/standard deviation. The
> > first set is Gb/s and the second is milliseconds.
> >
> >   The network configuration used was very simple - the guest with
> > a virtio_net interface and the host with a tap interface and static
> > IP addresses assigned to both - e.g. there was no bridge in the host
> > involved and iptables was disable in both the host and guest.
> >
> >   I used:
> >
> >   1) kvm-71-26-g6152996 with the patches that follow
> >
> >   2) Linus's v2.6.26-5752-g93ded9b with Rusty's virtio patches from
> >  219:bbd2611289c5 applied; these are the patches have just been
> >  submitted to Linus
> >
> >   The conclusions I draw are:
> >
> >   1) The length of the tx mitigation timer makes quite a difference to
> >  throughput achieved; we probably need a good heuristic for
> >  adjusting this on the fly.
>
> The tx mitigation timer is just one part of the equation; the other is
> the virtio ring window size, which is now fixed.
>
> Using a maximum sized window is good when the guest and host are running
> flat out, doing nothing but networking.  When throughput drops (because
> the guest is spending cpu on processing, or simply because the other
> side is not keeping up), we need to drop the windows size so as to
> retain acceptable latencies.
>
> The tx timer can then be set to "a bit after the end of the window",
> acting as a safety belt in case the throughput changes.

Interestingly, I played a little with a threshold patch.  Unfortunately it 
seemed to just add YA variable to the mix; it'd take real research to figure 
out how to adjust the threshold and timeout values appropriately for 
different situations.

> This isn't too good.  Low latency is important for nfs clients (or other
> request/response workloads).  I think we can keep these low by adjusting
> the virtio window (for example, on an idle system it should be 1), so
> that the tx mitigation timer only fires when the workload transitions
> from throughput to request/response.

>From reading this thread, this seems like an implementation bug.  Don't 
suppress notifications on an empty ring (ie. threshold will be 1 when nothing 
is happening).

Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9][RFC] KVM virtio_net performance

2008-07-26 Thread Rusty Russell
On Saturday 26 July 2008 19:45:36 Avi Kivity wrote:
> >   5) Eliminating an extra copy on the host->guest path only makes a
> >  barely measurable difference.
>
> That's expected on a host->guest test.  Zero copy is mostly important
> for guest->external, and with zerocopy already enabled in the guest
> (sendfile or nfs server workloads).

Note that this will also need the copyless tun patches.  If we're doing one 
copy from userspace->kernel anyway, this extra copy is probably lost in the 
noise.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 2/3] KVM: irq ack notification

2008-07-26 Thread Marcelo Tosatti
Based on a patch from: Ben-Ami Yassour <[EMAIL PROTECTED]>
which was based on a patch from: Amit Shah <[EMAIL PROTECTED]>

Notify IRQ acking on PIC/APIC emulation. The previous patch missed two things:

- Edge triggered interrupts on IOAPIC
- PIC reset with IRR/ISR set should be equivalent to ack (LAPIC probably
needs something similar).

Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
CC: Amit Shah <[EMAIL PROTECTED]>
CC: Ben-Ami Yassour <[EMAIL PROTECTED]>

Index: kvm/arch/x86/kvm/i8259.c
===
--- kvm.orig/arch/x86/kvm/i8259.c
+++ kvm/arch/x86/kvm/i8259.c
@@ -159,9 +159,10 @@ static inline void pic_intack(struct kvm
s->irr &= ~(1 << irq);
 }
 
-int kvm_pic_read_irq(struct kvm_pic *s)
+int kvm_pic_read_irq(struct kvm *kvm)
 {
int irq, irq2, intno;
+   struct kvm_pic *s = pic_irqchip(kvm);
 
irq = pic_get_irq(&s->pics[0]);
if (irq >= 0) {
@@ -187,12 +188,21 @@ int kvm_pic_read_irq(struct kvm_pic *s)
intno = s->pics[0].irq_base + irq;
}
pic_update_irq(s);
+   kvm_notify_acked_irq(kvm, irq);
 
return intno;
 }
 
 void kvm_pic_reset(struct kvm_kpic_state *s)
 {
+   int irq;
+   struct kvm *kvm = s->pics_state->irq_request_opaque;
+
+   for (irq = 0; irq < PIC_NUM_PINS; irq++) {
+   if (!(s->imr & (1 << irq)) && (s->irr & (1 << irq) ||
+   s->isr & (1 << irq)))
+   kvm_notify_acked_irq(kvm, irq);
+   }
s->last_irr = 0;
s->irr = 0;
s->imr = 0;
Index: kvm/arch/x86/kvm/irq.c
===
--- kvm.orig/arch/x86/kvm/irq.c
+++ kvm/arch/x86/kvm/irq.c
@@ -72,7 +72,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcp
if (kvm_apic_accept_pic_intr(v)) {
s = pic_irqchip(v->kvm);
s->output = 0;  /* PIC */
-   vector = kvm_pic_read_irq(s);
+   vector = kvm_pic_read_irq(v->kvm);
}
}
return vector;
Index: kvm/arch/x86/kvm/irq.h
===
--- kvm.orig/arch/x86/kvm/irq.h
+++ kvm/arch/x86/kvm/irq.h
@@ -63,11 +63,12 @@ struct kvm_pic {
void *irq_request_opaque;
int output; /* intr from master PIC */
struct kvm_io_device dev;
+   void (*ack_notifier)(void *opaque, int irq);
 };
 
 struct kvm_pic *kvm_create_pic(struct kvm *kvm);
 void kvm_pic_set_irq(void *opaque, int irq, int level);
-int kvm_pic_read_irq(struct kvm_pic *s);
+int kvm_pic_read_irq(struct kvm *kvm);
 void kvm_pic_update_irq(struct kvm_pic *s);
 
 static inline struct kvm_pic *pic_irqchip(struct kvm *kvm)
Index: kvm/virt/kvm/ioapic.c
===
--- kvm.orig/virt/kvm/ioapic.c
+++ kvm/virt/kvm/ioapic.c
@@ -39,6 +39,7 @@
 
 #include "ioapic.h"
 #include "lapic.h"
+#include "irq.h"
 
 #if 0
 #define ioapic_debug(fmt,arg...) printk(KERN_WARNING fmt,##arg)
@@ -285,26 +286,31 @@ void kvm_ioapic_set_irq(struct kvm_ioapi
}
 }
 
-static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int gsi)
+static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int gsi,
+   int trigger_mode)
 {
union ioapic_redir_entry *ent;
 
ent = &ioapic->redirtbl[gsi];
-   ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
 
-   ent->fields.remote_irr = 0;
-   if (!ent->fields.mask && (ioapic->irr & (1 << gsi)))
-   ioapic_service(ioapic, gsi);
+   kvm_notify_acked_irq(ioapic->kvm, gsi);
+
+   if (trigger_mode == IOAPIC_LEVEL_TRIG) {
+   ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
+   ent->fields.remote_irr = 0;
+   if (!ent->fields.mask && (ioapic->irr & (1 << gsi)))
+   ioapic_service(ioapic, gsi);
+   }
 }
 
-void kvm_ioapic_update_eoi(struct kvm *kvm, int vector)
+void kvm_ioapic_update_eoi(struct kvm *kvm, int vector, int trigger_mode)
 {
struct kvm_ioapic *ioapic = kvm->arch.vioapic;
int i;
 
for (i = 0; i < IOAPIC_NUM_PINS; i++)
if (ioapic->redirtbl[i].fields.vector == vector)
-   __kvm_ioapic_update_eoi(ioapic, i);
+   __kvm_ioapic_update_eoi(ioapic, i, trigger_mode);
 }
 
 static int ioapic_in_range(struct kvm_io_device *this, gpa_t addr,
Index: kvm/virt/kvm/ioapic.h
===
--- kvm.orig/virt/kvm/ioapic.h
+++ kvm/virt/kvm/ioapic.h
@@ -58,6 +58,7 @@ struct kvm_ioapic {
} redirtbl[IOAPIC_NUM_PINS];
struct kvm_io_device dev;
struct kvm *kvm;
+   void (*ack_notifier)(void *opaque, int irq);
 };
 
 #ifdef DEBUG
@@ -87,7 +88,7 @@ static inline int irqchip_in_kernel(stru
 
 struct kvm_vcpu

[patch 3/3] KVM: PIT: fix injection logic and count

2008-07-26 Thread Marcelo Tosatti
The PIT injection logic is problematic under the following cases:

1) If there is a higher priority vector to be delivered by the time
kvm_pit_timer_intr_post is invoked ps->inject_pending won't be set. 
This opens the possibility for missing many PIT event injections (say if
guest executes hlt at this point).

2) ps->inject_pending is racy with more than two vcpus. Since there's no locking
around read/dec of pt->pending, two vcpu's can inject two interrupts for a 
single
pt->pending count.

Fix 1 by using an irq ack notifier: only reinject when the previous irq 
has been acked. Fix 2 with appropriate locking around manipulation of 
pending count and irq_ack by the injection / ack paths.

Also, count_load_time should be set at the time the count is reloaded,
not when the interrupt is injected (BTW, LAPIC uses the same apparently
broken scheme, could someone explain what was the reasoning behind
that? kvm_apic_timer_intr_post).

Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>

Index: kvm/arch/x86/kvm/i8254.c
===
--- kvm.orig/arch/x86/kvm/i8254.c
+++ kvm/arch/x86/kvm/i8254.c
@@ -207,6 +207,8 @@ static int __pit_timer_fn(struct kvm_kpi
 
pt->timer.expires = ktime_add_ns(pt->timer.expires, pt->period);
pt->scheduled = ktime_to_ns(pt->timer.expires);
+   if (pt->period)
+   ps->channels[0].count_load_time = pt->timer.expires;
 
return (pt->period == 0 ? 0 : 1);
 }
@@ -215,12 +217,22 @@ int pit_has_pending_timer(struct kvm_vcp
 {
struct kvm_pit *pit = vcpu->kvm->arch.vpit;
 
-   if (pit && vcpu->vcpu_id == 0 && pit->pit_state.inject_pending)
+   if (pit && vcpu->vcpu_id == 0 && pit->pit_state.irq_ack)
return atomic_read(&pit->pit_state.pit_timer.pending);
-
return 0;
 }
 
+void kvm_pit_ack_irq(struct kvm_irq_ack_notifier *kian)
+{
+   struct kvm_kpit_state *ps = container_of(kian, struct kvm_kpit_state,
+irq_ack_notifier);
+   spin_lock(&ps->inject_lock);
+   if (atomic_dec_return(&ps->pit_timer.pending) < 0)
+   WARN_ON(1);
+   ps->irq_ack = 1;
+   spin_unlock(&ps->inject_lock);
+}
+
 static enum hrtimer_restart pit_timer_fn(struct hrtimer *data)
 {
struct kvm_kpit_state *ps;
@@ -255,8 +267,9 @@ static void destroy_pit_timer(struct kvm
hrtimer_cancel(&pt->timer);
 }
 
-static void create_pit_timer(struct kvm_kpit_timer *pt, u32 val, int is_period)
+static void create_pit_timer(struct kvm_kpit_state *ps, u32 val, int is_period)
 {
+   struct kvm_kpit_timer *pt = &ps->pit_timer;
s64 interval;
 
interval = muldiv64(val, NSEC_PER_SEC, KVM_PIT_FREQ);
@@ -268,6 +281,7 @@ static void create_pit_timer(struct kvm_
pt->period = (is_period == 0) ? 0 : interval;
pt->timer.function = pit_timer_fn;
atomic_set(&pt->pending, 0);
+   ps->irq_ack = 1;
 
hrtimer_start(&pt->timer, ktime_add_ns(ktime_get(), interval),
  HRTIMER_MODE_ABS);
@@ -302,11 +316,11 @@ static void pit_load_count(struct kvm *k
case 1:
 /* FIXME: enhance mode 4 precision */
case 4:
-   create_pit_timer(&ps->pit_timer, val, 0);
+   create_pit_timer(ps, val, 0);
break;
case 2:
case 3:
-   create_pit_timer(&ps->pit_timer, val, 1);
+   create_pit_timer(ps, val, 1);
break;
default:
destroy_pit_timer(&ps->pit_timer);
@@ -520,7 +534,7 @@ void kvm_pit_reset(struct kvm_pit *pit)
mutex_unlock(&pit->pit_state.lock);
 
atomic_set(&pit->pit_state.pit_timer.pending, 0);
-   pit->pit_state.inject_pending = 1;
+   pit->pit_state.irq_ack = 1;
 }
 
 struct kvm_pit *kvm_create_pit(struct kvm *kvm)
@@ -534,6 +548,7 @@ struct kvm_pit *kvm_create_pit(struct kv
 
mutex_init(&pit->pit_state.lock);
mutex_lock(&pit->pit_state.lock);
+   spin_lock_init(&pit->pit_state.inject_lock);
 
/* Initialize PIO device */
pit->dev.read = pit_ioport_read;
@@ -555,6 +570,9 @@ struct kvm_pit *kvm_create_pit(struct kv
pit_state->pit = pit;
hrtimer_init(&pit_state->pit_timer.timer,
 CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+   pit_state->irq_ack_notifier.gsi = 0;
+   pit_state->irq_ack_notifier.irq_acked = kvm_pit_ack_irq;
+   kvm_register_irq_ack_notifier(kvm, &pit_state->irq_ack_notifier);
mutex_unlock(&pit->pit_state.lock);
 
kvm_pit_reset(pit);
@@ -592,37 +610,19 @@ void kvm_inject_pit_timer_irqs(struct kv
struct kvm_kpit_state *ps;
 
if (vcpu && pit) {
+   int inject = 0;
ps = &pit->pit_state;
 
-   /* Try to inject pending interrupts when:
-* 1. Pending exists
-* 2. Last interrupt was accepted or waited for too long time*/
-  

[patch 0/3] fix PIT injection

2008-07-26 Thread Marcelo Tosatti
The in-kernel PIT emulation can either inject too many or too few
interrupts.

-- 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/3] KVM: Add irq ack notifier list

2008-07-26 Thread Marcelo Tosatti
From: Avi Kivity <[EMAIL PROTECTED]>

This can be used by kvm subsystems that are interested in when
interrupts are acked, for example time drift compensation.

Signed-off-by: Avi Kivity <[EMAIL PROTECTED]>
---
 arch/x86/kvm/irq.c |   22 ++
 arch/x86/kvm/irq.h |5 +
 include/asm-x86/kvm_host.h |7 +++
 3 files changed, 34 insertions(+), 0 deletions(-)


Index: kvm/arch/x86/kvm/irq.c
===
--- kvm.orig/arch/x86/kvm/irq.c
+++ kvm/arch/x86/kvm/irq.c
@@ -111,3 +111,25 @@ void kvm_set_irq(struct kvm *kvm, int ir
kvm_ioapic_set_irq(kvm->arch.vioapic, irq, level);
kvm_pic_set_irq(pic_irqchip(kvm), irq, level);
 }
+
+void kvm_notify_acked_irq(struct kvm *kvm, unsigned gsi)
+{
+   struct kvm_irq_ack_notifier *kian;
+   struct hlist_node *n;
+
+   hlist_for_each_entry(kian, n, &kvm->arch.irq_ack_notifier_list, link)
+   if (kian->gsi == gsi)
+   kian->irq_acked(kian);
+}
+
+void kvm_register_irq_ack_notifier(struct kvm *kvm,
+  struct kvm_irq_ack_notifier *kian)
+{
+   hlist_add_head(&kian->link, &kvm->arch.irq_ack_notifier_list);
+}
+
+void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
+struct kvm_irq_ack_notifier *kian)
+{
+   hlist_del(&kian->link);
+}
Index: kvm/arch/x86/kvm/irq.h
===
--- kvm.orig/arch/x86/kvm/irq.h
+++ kvm/arch/x86/kvm/irq.h
@@ -83,6 +83,11 @@ static inline int irqchip_in_kernel(stru
 void kvm_pic_reset(struct kvm_kpic_state *s);
 
 void kvm_set_irq(struct kvm *kvm, int irq, int level);
+void kvm_notify_acked_irq(struct kvm *kvm, unsigned gsi);
+void kvm_register_irq_ack_notifier(struct kvm *kvm,
+  struct kvm_irq_ack_notifier *kian);
+void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
+struct kvm_irq_ack_notifier *kian);
 
 void kvm_timer_intr_post(struct kvm_vcpu *vcpu, int vec);
 void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu);
Index: kvm/include/asm-x86/kvm_host.h
===
--- kvm.orig/include/asm-x86/kvm_host.h
+++ kvm/include/asm-x86/kvm_host.h
@@ -319,6 +319,12 @@ struct kvm_mem_alias {
gfn_t target_gfn;
 };
 
+struct kvm_irq_ack_notifier {
+   struct hlist_node link;
+   unsigned gsi;
+   void (*irq_acked)(struct kvm_irq_ack_notifier *kian);
+};
+
 struct kvm_arch{
int naliases;
struct kvm_mem_alias aliases[KVM_ALIAS_SLOTS];
@@ -334,6 +340,7 @@ struct kvm_arch{
struct kvm_pic *vpic;
struct kvm_ioapic *vioapic;
struct kvm_pit *vpit;
+   struct hlist_head irq_ack_notifier_list;
 
int round_robin_prev_vcpu;
unsigned int tss_addr;

-- 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9][RFC] KVM virtio_net performance

2008-07-26 Thread Bill Davidsen

Anthony Liguori wrote:

Hi Mark,

Mark McLoughlin wrote:

Hey,
  Here's a bunch of patches attempting to improve the performance
of virtio_net. This is more an RFC rather than a patch submission
since, as can be seen below, not all patches actually improve the
perfomance measurably.
  


I'm still seeing the same problem I saw with my patch series.  Namely, 
dhclient fails to get a DHCP address.  Rusty noticed that RX has a lot 
more packets received then it should so we're suspicious that we're 
getting packet corruption.


I have been discussing this (on this list) in another thread. Putting 
tcpdump on the eth0 device in the VM, the br0 device in the host, and 
the eth0 (physical NIC) in the host, you can see that when the VM 
generates a DHCP request it shows up on the br0 in the host, but never 
gets sent on the wire by eth0.


That's the point of failure, at least using RHEL5/FC6/kvm-66 as the 
environment.


Configuring the tap device with a static address, here's what I get with 
iperf:


w/o patches:

guest->host: 625 Mbits/sec
host->guest: 825 Mbits/sec

w/patches

guest->host:  2.02 Gbits/sec
host->guest: 1.89 Gbits/sec

guest lo: 4.35 Gbits/sec
host lo: 4.36 Gbits/sec

This is with KVM GUEST configured FWIW.

Regards,

Anthony Liguori



--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Simple way of putting a VM on a LAN

2008-07-26 Thread Bill Davidsen

Stuart Jansen wrote:

On Fri, 2008-07-25 at 12:44 -0400, Bill Davidsen wrote:
But when the host is really on the network, it uses DHCP to set the IP, 
while in a VM it never sends any DHCP packets, the setting of the IP 
times out, and I wind up with no IP until I set it. I have checked with 
tcpdump, the DHCP requests for IP appear on the bridge, but not on the 
eth0 NIC, and so are never seen by the DHCP server.


Do you see this problem, or have any information about it? Obviously 
suggestions on fixing this are needed, since the dhcp server is a 
candidate for virtualization in the future.


Just to be certain of the obvious, you added eth0 to the bridge, right?

brctl addif br0 eth0


Yes, everything works except the DHCP discovery. Once I bring up the VM 
NIC by hand and set the default route everything works really well with 
TCP, UDP, and ICMP, as well as the usual ARP packets, etc.


And the tap device is active, right?

ifconfig tap0 up


UP and based on something I saw in another script is tried adding 
promiscuous, which really didn't change anything.


Assuming it isn't something so obvious, I'm suspecting spanning tree.

brctl stp br0 off


It was never on (unless it was turned on by something more automated 
than my fingers, but it's definitely off now, and make no difference.


For a test I modified the network setup to a static IP and routing. That 
did work, although it is undesirable, since it invites having the DNS 
wrong. I moved to DHCP to be sure that the IPs are always right, a 
master list gets turned into entries in both dhchd.conf and the 
appropriate DNS files (forward and reverse lookeps are always right, 
too). Shot myself in that foot way back in ARPAnet days :-(


Thanks for the ideas, I have one more, but I have to do a little 
research before I can ask an intelligent question.


--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 and PXE issues

2008-07-26 Thread Greg Kurtzer
On Sat, Jul 26, 2008 at 1:12 AM, Avi Kivity <[EMAIL PROTECTED]> wrote:
> Greg Kurtzer wrote:
>>
>> Hello,
>>
>> I noticed some problems with the e1000 implementation in kvm >= 70. At
>> first glance it seemed liked a PXE problem as it would not acknowledge
>> the DHCP offer from the server. I tried several different Etherboot
>> ROM images and version 5.2.6 seemed to work. That version isn't PXE
>> compliant so I built an ELF image to boot, and it downloaded it very,
>> very, very, very slowly (as in about 10 minutes) but it did end up
>> working.
>>
>> This all worked perfectly with version 69 and previous.
>>
>
> There are two patches to e1000 in kvm-70; can you try backing them out
> (patch -Rp1 < test.patch) to see which one is guilty?
>
> Candidates attached.

Reverted both of these patches yet the problem remains. :(

This is easy to reproduce (at least on my build). if there is a DHCP
server on the network, just do a network boot on the e1000. It makes a
correct DHCPDISCOVER, but never responds to the DHCPOFFER (it should
do a DHCPREQUEST next). No packets are getting lost according tcpdump
on the master.

The console is showing:

Probing pci nic...
[e1000-82540em]Ethernet addr: 00:04:21:DE:99:55
Searching for server (DHCP)No IP address
.No IP address
.No IP address
.No IP address
.No IP address
.No IP address
.No IP address
...

Thanks again for the help!

-- 
Greg Kurtzer
http://www.infiscale.com/
http://www.runlevelzero.net/
http://www.perceus.org/
http://www.caoslinux.org/
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer

2008-07-26 Thread Mark McLoughlin
On Sat, 2008-07-26 at 12:48 +0300, Avi Kivity wrote:
> Mark McLoughlin wrote:
> > The current virtio_net tx timer is 2ns, which doesn't
> > make any sense. Set it to a more reasonable 150us
> > instead.
> >
> > Signed-off-by: Mark McLoughlin <[EMAIL PROTECTED]>
> > ---
> >  qemu/hw/virtio-net.c |2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
> > index 2e57e5a..31867f1 100644
> > --- a/qemu/hw/virtio-net.c
> > +++ b/qemu/hw/virtio-net.c
> > @@ -26,7 +26,7 @@
> >  #define VIRTIO_NET_F_MAC   5
> >  #define VIRTIO_NET_F_GS0   6
> >  
> > -#define TX_TIMER_INTERVAL (1000 / 500)
> > +#define TX_TIMER_INTERVAL (15) /* 150 us */
> >  
> 
> Ouch.

Well, not so much - and I should have explained why.

Even though virtio-net is requesting a 2ns tx timer, it actually gets
limited to MIN_TIMER_REARM_US which is currently 250us.

However, even though the timer itself will only fire after 250us,
expire_time is only set to +2ns, so we'll get the timeout callback next
time qemu_run_timers() is called from the mainloop.

I think this might account for a lot of the jitter in the throughput
numbers - the effective tx window size is anywhere between 2ns and 250us
depending on e.g. whether there is rx data available on the tap fd.

Cheers,
Mark.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: device assignemnt: updated patches

2008-07-26 Thread Han, Weidong
Avi Kivity wrote:
> Han, Weidong wrote:
>>> How are we standing with merging the changes to the VT-d driver,
>>> which are a prerequisite? 
>>> 
>> 
>> You mean push the changes into VT-d driver first. But the VT-d driver
>> modification patch maybe need some changes during following
>> development. How about making it stable in KVM first, then push it
>> into VT-d driver? 
>> 
> 
> Yeah, we do have a chicken-and-egg situation here. I guess I can carry
> that patch while we're working this out.
> 
> Longer term, we will have to move to a vendor neutral API so as to
> support amd iommu and non-x86 iommus.

I agree arch-independent support is direction. But we should make one
work first, then make it generic later.

Randy (Weidong)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] kvm: qemu: Fix virtio_net tx timer

2008-07-26 Thread Avi Kivity

Mark McLoughlin wrote:

The current virtio_net tx timer is 2ns, which doesn't
make any sense. Set it to a more reasonable 150us
instead.

Signed-off-by: Mark McLoughlin <[EMAIL PROTECTED]>
---
 qemu/hw/virtio-net.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 2e57e5a..31867f1 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -26,7 +26,7 @@
 #define VIRTIO_NET_F_MAC   5
 #define VIRTIO_NET_F_GS0   6
 
-#define TX_TIMER_INTERVAL (1000 / 500)

+#define TX_TIMER_INTERVAL (15) /* 150 us */
 
  


Ouch.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9][RFC] KVM virtio_net performance

2008-07-26 Thread Avi Kivity

Mark McLoughlin wrote:

Hey,
  Here's a bunch of patches attempting to improve the performance
of virtio_net. This is more an RFC rather than a patch submission
since, as can be seen below, not all patches actually improve the
perfomance measurably.

  I've tried hard to test each of these patches with as stable and
informative a benchmark as I could find. The first benchmark is a
netperf[1] based throughput benchmark and the second uses a flood
ping[2] to measure latency differences.

  Each set of figures is min/average/max/standard deviation. The
first set is Gb/s and the second is milliseconds.

  The network configuration used was very simple - the guest with
a virtio_net interface and the host with a tap interface and static
IP addresses assigned to both - e.g. there was no bridge in the host
involved and iptables was disable in both the host and guest.

  I used:

  1) kvm-71-26-g6152996 with the patches that follow

  2) Linus's v2.6.26-5752-g93ded9b with Rusty's virtio patches from
 219:bbd2611289c5 applied; these are the patches have just been
 submitted to Linus

  The conclusions I draw are:

  1) The length of the tx mitigation timer makes quite a difference to
 throughput achieved; we probably need a good heuristic for
 adjusting this on the fly.
  


The tx mitigation timer is just one part of the equation; the other is 
the virtio ring window size, which is now fixed.


Using a maximum sized window is good when the guest and host are running 
flat out, doing nothing but networking.  When throughput drops (because 
the guest is spending cpu on processing, or simply because the other 
side is not keeping up), we need to drop the windows size so as to 
retain acceptable latencies.


The tx timer can then be set to "a bit after the end of the window", 
acting as a safety belt in case the throughput changes.



  4) Dropping the global mutex while reading GSO packets from the tap
 interface gives a nice speedup. This highlights the global mutex
 as a general perfomance issue.

  


Not sure whether this is safe.  What's stopping the guest from accessing 
virtio and changing some state?



  5) Eliminating an extra copy on the host->guest path only makes a
 barely measurable difference.

  


That's expected on a host->guest test.  Zero copy is mostly important 
for guest->external, and with zerocopy already enabled in the guest 
(sendfile or nfs server workloads).



Anyway, the figures:

  netperf, 10x20s runs (Gb/s)  |   guest->host  |   host->guest
  
-++---
  baseline | 1.520/ 1.573/ 1.610/ 0.034 | 1.160/ 1.357/ 
1.630/ 0.165
  50us tx timer + rearm| 1.050/ 1.086/ 1.110/ 0.017 | 1.710/ 1.832/ 
1.960/ 0.092
  250us tx timer + rearm   | 1.700/ 1.764/ 1.880/ 0.064 | 0.900/ 1.203/ 
1.580/ 0.205
  150us tx timer + rearm   | 1.520/ 1.602/ 1.690/ 0.044 | 1.670/ 1.928/ 
2.150/ 0.141
  no ring-full heuristic   | 1.480/ 1.569/ 1.710/ 0.066 | 1.610/ 1.857/ 
2.140/ 0.153
  VIRTIO_F_NOTIFY_ON_EMPTY | 1.470/ 1.554/ 1.650/ 0.054 | 1.770/ 1.960/ 
2.170/ 0.119
  recv NO_NOTIFY   | 1.530/ 1.604/ 1.680/ 0.047 | 1.780/ 1.944/ 
2.190/ 0.129
  GSO  | 4.120/ 4.323/ 4.420/ 0.099 | 6.540/ 7.033/ 
7.340/ 0.244
  ring size == 256 | 4.050/ 4.406/ 4.560/ 0.143 | 6.280/ 7.236/ 
8.280/ 0.613
  ring size == 512 | 4.420/ 4.600/ 4.960/ 0.140 | 6.470/ 7.205/ 
7.510/ 0.314
  drop mutex during tapfd read | 4.320/ 4.578/ 4.790/ 0.161 | 8.370/ 8.589/ 
8.730/ 0.120
  aligouri zero-copy   | 4.510/ 4.694/ 4.960/ 0.148 | 8.430/ 8.614/ 
8.840/ 0.142
  


Very impressive numbers; much better than I expected.  The host->guest 
numbers are around 100x better than the original emulated card througput 
we got from kvm.



  ping -f -c 10 (ms)   |   guest->host  |   host->guest
  
-++---
  baseline | 0.060/ 0.459/ 7.602/ 0.846 | 0.067/ 0.331/ 
2.517/ 0.057
  50us tx timer + rearm| 0.081/ 0.143/ 7.436/ 0.374 | 0.093/ 0.133/ 
1.883/ 0.026
  250us tx timer + rearm   | 0.302/ 0.463/ 7.580/ 0.849 | 0.297/ 0.344/ 
2.128/ 0.028
  150us tx timer + rearm   | 0.197/ 0.323/ 7.671/ 0.740 | 0.199/ 0.245/ 
7.836/ 0.037
  no ring-full heuristic   | 0.182/ 0.324/ 7.688/ 0.753 | 0.199/ 0.243/ 
2.197/ 0.030
  VIRTIO_F_NOTIFY_ON_EMPTY | 0.197/ 0.321/ 7.447/ 0.730 | 0.196/ 0.242/ 
2.218/ 0.032
  recv NO_NOTIFY   | 0.186/ 0.321/ 7.520/ 0.732 | 0.200/ 0.233/ 
2.216/ 0.028
  GSO  | 0.178/ 0.324/ 7.667/ 0.736 | 0.147/ 0.246/ 
1.361/ 0.024
  ring size == 256 | 0.184/ 0.323/ 7.674/ 0.728 | 0.199/ 0.243/ 
2.181/ 0.028
  ring size == 512 | (not measured) | (not 
measured)
  dr

Re: device assignemnt: updated patches

2008-07-26 Thread Avi Kivity

Han, Weidong wrote:

How are we standing with merging the changes to the VT-d driver, which
are a prerequisite?



You mean push the changes into VT-d driver first. But the VT-d driver
modification patch maybe need some changes during following development.
How about making it stable in KVM first, then push it into VT-d driver?
  


Yeah, we do have a chicken-and-egg situation here. I guess I can carry 
that patch while we're working this out.


Longer term, we will have to move to a vendor neutral API so as to 
support amd iommu and non-x86 iommus.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: device assignemnt: updated patches

2008-07-26 Thread Han, Weidong
Avi Kivity wrote:
> Ben-Ami Yassour wrote:
>> Following are the device assignment patches with the fixes of the
>> comments that were sent for the previous version.
>> 
>> Here is the list of changes that were made with respect to the
>> previous version: 
>> 1. Replace the interrupt ack hook patches with the notifiers list
>> patch by Avi. 
>> 2. Remove code from ioapic - the ack notifier is called before the
>> checking the irr bit. 
>> 3. Remove the interrupt ack work queue and handle it directly.
>> 4. Minimize the function for finding an assigned device.
>> 5. Remove the pt_lock, after making the changes above it is no
>> longer needed. 
>> 6. Move declarations from kvm_para.h
>> 7. Add irq array to ioctl API. Note that this is only a change to the
>> API, currently only single irq is supported.
>> 8. Renaming: use "assigned device" and not "passthrough device"
>> 9. Fix device release error path
>> 10. Moving the assigned devices list pointer to the device struct
>> itself and remove the extra structure.
>> 
>> Pending comment: shared guest interrputs are not tested.
>> 
>> 
> 
> (and not implemented, either).
> 
> Good progress; there are still many issues, but this is inevitable
> with 
> such a large and complex patchset.
> 
> How are we standing with merging the changes to the VT-d driver, which
> are a prerequisite?

You mean push the changes into VT-d driver first. But the VT-d driver
modification patch maybe need some changes during following development.
How about making it stable in KVM first, then push it into VT-d driver?

Randy (Weidong)

> 
> 
> --
> Do not meddle in the internals of kernels, for they are subtle and
> quick to panic. 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qcow and updates

2008-07-26 Thread Avi Kivity

Bill Davidsen wrote:
This question came out of a discussion of nesting (or stringing if you 
like) disk images using qcow.


If I have an unpatched install image, say CentOS-5.2 (box "A" below). 
and I make a copy of that with qcow and apply patches to that copy 
(box "B" below), everything is just fine. But if I should make a 
working qcow image from the patched version (box "C" below), which 
works fine initially, what would happen if I applied patches to the 
"current" version? I assume that would mess up the application 
version, since it's a physical copy, but I'd like to be sure.


 __
| [A]  |
| orig release |
|__|
||
   {qcow copy|
||
 
| [B]|
| patched to current |
||
||
   {qcow copy}
||
 _
| [C] |
| fully config. app. svr. |
|_|

I could play with union filesystem, network mounting of /var and /usr, 
and other tricks, but I thought I'd check that this really is a 
problem, and the current version needs a "real" copy to the app 
server, and then the app server needs to be patched after that.





Changing anything except the last image in the chain is liable to 
break.  In other words: once you use an image as a base for another 
image, this first image must never change.



--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH retry] arch/ia64/kvm/kvm-ia64.c: Add local_irq_restore in error handling code

2008-07-26 Thread Avi Kivity

Julia Lawall wrote:

From: Julia Lawall <[EMAIL PROTECTED]>

There is a call to local_irq_restore in the normal exit case, so it would
seem that there should be one on an error return as well.

The semantic patch that finds this problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)
  


Applied, thanks.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device assignemnt: updated patches

2008-07-26 Thread Avi Kivity

Ben-Ami Yassour wrote:

Following are the device assignment patches with the fixes of the
comments that were sent for the previous version.

Here is the list of changes that were made with respect to the previous
version:
1. Replace the interrupt ack hook patches with the notifiers list patch
by Avi.
2. Remove code from ioapic - the ack notifier is called before the
checking the irr bit.
3. Remove the interrupt ack work queue and handle it directly.
4. Minimize the function for finding an assigned device.
5. Remove the pt_lock, after making the changes above it is no longer
needed.
6. Move declarations from kvm_para.h
7. Add irq array to ioctl API. Note that this is only a change to the
API, currently only single irq is supported.
8. Renaming: use "assigned device" and not "passthrough device"
9. Fix device release error path
10. Moving the assigned devices list pointer to the device struct itself
and remove the extra structure. 


Pending comment: shared guest interrputs are not tested.

  


(and not implemented, either).

Good progress; there are still many issues, but this is inevitable with 
such a large and complex patchset.


How are we standing with merging the changes to the VT-d driver, which 
are a prerequisite?



--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] KVM: Device assignemnt with VT-d

2008-07-26 Thread Avi Kivity

Ben-Ami Yassour wrote:

From: Kay, Allen M <[EMAIL PROTECTED]>

This patch includes the functions to support VT-d for passthrough
devices.

[Ben: fixed memory pinning, cleanup]

Signed-off-by: Kay, Allen M <[EMAIL PROTECTED]>
Signed-off-by: Weidong Han <[EMAIL PROTECTED]>
Signed-off-by: Ben-Ami Yassour <[EMAIL PROTECTED]>
---
 arch/x86/kvm/Makefile  |2 +-
 arch/x86/kvm/vtd.c |  182 
 arch/x86/kvm/x86.c |   11 +++
 include/asm-x86/kvm_host.h |3 +
 include/linux/kvm_host.h   |6 ++
 virt/kvm/kvm_main.c|8 ++-
 6 files changed, 210 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kvm/vtd.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index d0e940b..5d9d079 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -11,7 +11,7 @@ endif
 EXTRA_CFLAGS += -Ivirt/kvm -Iarch/x86/kvm
 
 kvm-objs := $(common-objs) x86.o mmu.o x86_emulate.o i8259.o irq.o lapic.o \

-   i8254.o
+   i8254.o vtd.o
 obj-$(CONFIG_KVM) += kvm.o
 kvm-intel-objs = vmx.o
 obj-$(CONFIG_KVM_INTEL) += kvm-intel.o
diff --git a/arch/x86/kvm/vtd.c b/arch/x86/kvm/vtd.c
new file mode 100644
index 000..7a3cf4e
--- /dev/null
+++ b/arch/x86/kvm/vtd.c
@@ -0,0 +1,182 @@
+/*
+ * Copyright (c) 2006, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Copyright (C) 2006-2008 Intel Corporation
+ * Author: Allen M. Kay <[EMAIL PROTECTED]>
+ * Author: Weidong Han <[EMAIL PROTECTED]>
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int kvm_iommu_unmap_memslots(struct kvm *kvm);
+
+int kvm_iommu_map_pages(struct kvm *kvm,
+   gfn_t base_gfn, unsigned long npages)
+{
+   gfn_t gfn = base_gfn;
+   pfn_t pfn;
+   int i, rc;
+   struct dmar_domain *domain = kvm->arch.intel_iommu_domain;
+
+   if (!domain)
+   return -EFAULT;
+
+   for (i = 0; i < npages; i++) {
+   pfn = gfn_to_pfn(kvm, gfn);
+   if (!is_mmio_pfn(pfn)) {
+   rc = intel_iommu_page_mapping(domain,
+ gfn << PAGE_SHIFT,
+ pfn << PAGE_SHIFT,
  


This overflows on i386, where gfn and pfn are longs while gpas and hpas 
are u64s.


gfn_to_gpa() gets the first one right.  There should also be a 
pfn_to_phys(), but there isn't.



+ PAGE_SIZE,
+ DMA_PTE_READ |
+ DMA_PTE_WRITE);
+   if (rc)
+   kvm_release_pfn_clean(pfn);
  


You're never actually returning rc, so any errors will be swallowed 
silently.



+   } else {
+   printk(KERN_DEBUG "kvm_iommu_map_page:"
+  "invalid pfn=%lx\n", pfn);
+   return 0;
+   }
+
+   gfn++;
+   }
+   return 0;
  


Isn't this slow on large guests? (not a merge barrier).


+}
+
+int kvm_iommu_map_guest(struct kvm *kvm,
+   struct kvm_assigned_dev *assigned_dev)
+{
+   struct pci_dev *pdev = NULL;
+
+   printk(KERN_DEBUG "VT-d direct map: host bdf = %x:%x:%x\n",
+  assigned_dev->host.busnr,
+  PCI_SLOT(assigned_dev->host.devfn),
+  PCI_FUNC(assigned_dev->host.devfn));
+
+   for_each_pci_dev(pdev) {
+   if ((pdev->bus->number == assigned_dev->host.busnr) &&
+   (pdev->devfn == assigned_dev->host.devfn)) {
+   break;
+   }
+   }
  


Why not keep pdev in kvm_assigned_dev?  We've claimed it, so it's ours.


+
+   if (pdev == NULL) {
+   if (kvm->arch.intel_iommu_domain) {
+   intel_iommu_domain_exit(kvm->arch.intel_iommu_domain);
+   kvm->arch.intel_iommu_domain = NULL;
+   }
+   return -ENODEV;
+   }
+
+   kvm->arch.intel_iommu_domain = intel_iommu_domain_alloc(pdev);
+
+   if (kvm_iommu_map_memslots(kvm)) {
+   kvm_iommu_unmap_memslots(kvm);
+   return -EFAULT;
  


EFAULT is for the kernel touching an invalid userspace page.  You should

Re: [PATCH 2/4] KVM: pci device assignment

2008-07-26 Thread Avi Kivity

Ben-Ami Yassour wrote:

Based on a patch from: Amit Shah <[EMAIL PROTECTED]>

This patch adds support for handling PCI devices that are assigned to
the guest.

The device to be assigned to the guest is registered in the host kernel
and interrupt delivery is handled. If a device is already assigned, or
the device driver for it is still loaded on the host, the device
assignment
is failed by conveying a -EBUSY reply to the userspace.

Devices that share their interrupt line are not supported at the moment.

By itself, this patch will not make devices work within the guest.
The VT-d extension is required to enable the device to perform DMA.
Another alternative is PVDMA.

 
+struct kvm_assigned_dev_kernel

+*kvm_find_assigned_dev(struct list_head *head,
  


Keep these two on the same line.


+
+static int
+kvm_vm_ioctl_device_assignment(struct kvm *kvm,
  


Ditto.


+  struct kvm_assigned_dev *assigned_dev)
+{
+   int r = 0;
+   struct kvm_assigned_dev_kernel *match;
+   struct pci_dev *dev;
+
+   if (assigned_dev->host.num_valid_irqs != 1) {
+   printk(KERN_INFO "%s: Unsupported number of irqs %d\n",
+  __func__, assigned_dev->host.num_valid_irqs);
+   return -EINVAL;
+   }
  


We also support zero irqs.  While this patch doesn't do much with the 
information (only claim the device), this is still useful.




diff --git a/include/asm-x86/kvm.h b/include/asm-x86/kvm.h
index 8f13749..12b4b25 100644
--- a/include/asm-x86/kvm.h
+++ b/include/asm-x86/kvm.h
@@ -208,4 +208,5 @@ struct kvm_pit_channel_state {
 struct kvm_pit_state {
struct kvm_pit_channel_state channels[3];
 };
+
 #endif
  


Unrelated.


diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index e2864e6..34eb3e7 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -325,6 +325,25 @@ struct kvm_irq_ack_notifier {
void (*irq_acked)(struct kvm_irq_ack_notifier *kian);
 };
 
+/* For assigned devices, we schedule work in the system workqueue to

+ * inject interrupts into the guest when an interrupt occurs on the
+ * physical device and also when the guest acks the interrupt.
+ */
+struct kvm_assigned_dev_work {
+   struct work_struct work;
+   struct kvm_assigned_dev_kernel *assigned_dev;
+};
+
+struct kvm_assigned_dev_kernel {
+   struct kvm_irq_ack_notifier ack_notifier;
+   struct list_head list;
+   struct kvm_assigned_dev_info guest;
+   struct kvm_assigned_dev_info host;
+   struct kvm_assigned_dev_work int_work;
  


Just put the work_struct here, and use container_of() instead of 
->assigned_dev.



 struct kvm_arch{
int naliases;
struct kvm_mem_alias aliases[KVM_ALIAS_SLOTS];
@@ -337,6 +356,7 @@ struct kvm_arch{
 * Hash table of struct kvm_mmu_page.
 */
struct list_head active_mmu_pages;
+   struct list_head assigned_dev_head;
  


This is useful for non-x86 (except s390).  It's okay to leave it to the 
ia64/ppc maintainers, though.



struct kvm_pic *vpic;
struct kvm_ioapic *vioapic;
struct kvm_pit *vpit;
diff --git a/include/asm-x86/kvm_para.h b/include/asm-x86/kvm_para.h
index 76f3921..3aa1731 100644
--- a/include/asm-x86/kvm_para.h
+++ b/include/asm-x86/kvm_para.h
@@ -143,5 +143,4 @@ static inline unsigned int kvm_arch_para_features(void)
 }
 
 #endif

-
  


Spurious.


 #endif
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 6edba45..c436c08 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -382,6 +382,7 @@ struct kvm_trace_rec {
 #define KVM_CAP_PV_MMU 13
 #define KVM_CAP_MP_STATE 14
 #define KVM_CAP_COALESCED_MMIO 15
+#define KVM_CAP_DEVICE_ASSIGNMENT 16
 
 /*

  * ioctls for VM fds
@@ -411,6 +412,8 @@ struct kvm_trace_rec {
_IOW(KVMIO,  0x67, struct kvm_coalesced_mmio_zone)
 #define KVM_UNREGISTER_COALESCED_MMIO \
_IOW(KVMIO,  0x68, struct kvm_coalesced_mmio_zone)
+#define KVM_UPDATE_ASSIGNED_DEVICE _IOR(KVMIO, 0x69,   \
+   struct kvm_assigned_dev)
 
 /*

  * ioctls for vcpu fds
@@ -475,4 +478,22 @@ struct kvm_trace_rec {
 #define KVM_TRC_STLB_INVAL   (KVM_TRC_HANDLER + 0x18)
 #define KVM_TRC_PPC_INSTR(KVM_TRC_HANDLER + 0x19)
 
+#define ASSIGNED_DEV_MAX_IRQ 16

+
+/* Stores information for identifying host PCI devices assigned to the
+ * guest: this is used in the host kernel and in the userspace.
+ */
+struct kvm_assigned_dev_info {
+   __u32 busnr;
+   __u32 devfn;
+   __u32 irq[ASSIGNED_DEV_MAX_IRQ];
+   __u32 num_valid_irqs; /* currently only 1 is supported */
+};
  


Move num_valid_irqs before the array.  Add padding so the structure is a 
multiple of 64 bits (needed so that the i386 and x86_64 ABIs are identical).




+
+/* Mapping between host and guest PCI device */
+struct kvm_assigned_dev {
+   struct kvm_assigned_dev_info guest;
+   

Re: [PATCH 1/4] KVM: Add irq ack notifier list

2008-07-26 Thread Avi Kivity

Ben-Ami Yassour wrote:

From: Avi Kivity <[EMAIL PROTECTED]>

This can be used by kvm subsystems that are interested in when
interrupts
are acked, for example time drift compenstation.

  


Please add the pic and ioapic glue code to this patch.


Signed-off-by: Avi Kivity <[EMAIL PROTECTED]>
  


(You need to add your signoff to patches you send; even if you didn't 
modify them).



--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 and PXE issues

2008-07-26 Thread Avi Kivity

Greg Kurtzer wrote:

Hello,

I noticed some problems with the e1000 implementation in kvm >= 70. At
first glance it seemed liked a PXE problem as it would not acknowledge
the DHCP offer from the server. I tried several different Etherboot
ROM images and version 5.2.6 seemed to work. That version isn't PXE
compliant so I built an ELF image to boot, and it downloaded it very,
very, very, very slowly (as in about 10 minutes) but it did end up
working.

This all worked perfectly with version 69 and previous.
  


There are two patches to e1000 in kvm-70; can you try backing them out 
(patch -Rp1 < test.patch) to see which one is guilty?


Candidates attached.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

commit 2ec717a98c8b8e0c8f1ddc562115ca81d2dc024e
Author: Laurent Vivier <[EMAIL PROTECTED]>
Date:   Fri May 30 16:07:31 2008 +0200

kvm: qemu: coalesced MMIO support (e1000)

This patch defines coalesced MMIO zones for e1000 ethernet card.

Signed-off-by: Laurent Vivier <[EMAIL PROTECTED]>
Signed-off-by: Avi Kivity <[EMAIL PROTECTED]>

diff --git a/qemu/hw/e1000.c b/qemu/hw/e1000.c
index 01f8983..5b3a365 100644
--- a/qemu/hw/e1000.c
+++ b/qemu/hw/e1000.c
@@ -26,6 +26,7 @@
 #include "hw.h"
 #include "pci.h"
 #include "net.h"
+#include "qemu-kvm.h"
 
 #include "e1000_hw.h"
 
@@ -938,6 +939,18 @@ e1000_mmio_map(PCIDevice *pci_dev, int region_num,
 
 d->mmio_base = addr;
 cpu_register_physical_memory(addr, PNPMMIO_SIZE, d->mmio_index);
+
+if (kvm_enabled()) {
+	int i;
+uint32_t excluded_regs[] = {
+E1000_MDIC, E1000_ICR, E1000_ICS, E1000_IMS,
+E1000_IMC, E1000_TCTL, E1000_TDT, PNPMMIO_SIZE
+};
+qemu_kvm_register_coalesced_mmio(addr, excluded_regs[0]);
+for (i = 0; excluded_regs[i] != PNPMMIO_SIZE; i++)
+qemu_kvm_register_coalesced_mmio(addr + excluded_regs[i] + 4,
+ excluded_regs[i + 1] - excluded_regs[i] - 4);
+}
 }
 
 static int
commit 404ce95cb202ca6015beb26e9b870f91c0309ee0
Author: Anthony Liguori <[EMAIL PROTECTED]>
Date:   Wed May 7 16:40:58 2008 -0500

kvm: qemu: fix e1000 can_receive handler

The current logic of the can_receive handler is to allow packets whenever the
receiver is disabled or when there are descriptors available in the ring.

I think the logic ought to be to allow packets whenever the receiver is enabled
and there are descriptors available in the ring.

Signed-off-by: Anthony Liguori <[EMAIL PROTECTED]>
Signed-off-by: Avi Kivity <[EMAIL PROTECTED]>

diff --git a/qemu/hw/e1000.c b/qemu/hw/e1000.c
index 0728539..01f8983 100644
--- a/qemu/hw/e1000.c
+++ b/qemu/hw/e1000.c
@@ -520,8 +520,8 @@ e1000_can_receive(void *opaque)
 {
 E1000State *s = opaque;
 
-return (!(s->mac_reg[RCTL] & E1000_RCTL_EN) ||
-s->mac_reg[RDH] != s->mac_reg[RDT]);
+return ((s->mac_reg[RCTL] & E1000_RCTL_EN) &&
+	s->mac_reg[RDH] != s->mac_reg[RDT]);
 }
 
 static void


Re: [patch 1/2] KVM: x86: do not entry guest mode if vcpu is not runnable

2008-07-26 Thread Avi Kivity

Marcelo Tosatti wrote:

If a vcpu has been offlined, or not initialized at all, signals
requesting userspace work to be performed will result in KVM attempting
to re-entry guest mode.

Problem is that the in-kernel irqchip emulation happily executes HALTED
state vcpu's. This breaks "savevm" on Windows SMP installation (that
only boots up a single vcpu), for example.

Fix it by blocking halted vcpu's at kvm_arch_vcpu_ioctl_run(). 


Change the promotion from halted to running to happen in the vcpu
context. Use the information available in kvm_vcpu_block(), and the
current mpstate to make the decision:

- If there's an in-kernel timer or irq event the halted->running
promotion evaluation can be performed, no need for userspace assistance.

- If there's a signal, there's either userspace work to be performed
in the vcpu's context or irqchip emulation is in userspace.

This has the nice side effect of avoiding userspace exit in case 
of irq injection to a halted vcpu from the iothread.


Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>

Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -2505,17 +2505,25 @@ void kvm_arch_exit(void)
kvm_mmu_module_exit();
 }
 
+static void kvm_vcpu_promote_runnable(struct kvm_vcpu *vcpu)

+{
+   if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED)
+   vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+}
+
 int kvm_emulate_halt(struct kvm_vcpu *vcpu)
 {
++vcpu->stat.halt_exits;
KVMTRACE_0D(HLT, vcpu, handler);
if (irqchip_in_kernel(vcpu->kvm)) {
+   int ret;
  


Missing blank line.


@@ -2978,10 +2986,12 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_v
if (vcpu->sigset_active)
sigprocmask(SIG_SETMASK, &vcpu->sigset, &sigsaved);
 
-	if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {

-   kvm_vcpu_block(vcpu);
-   r = -EAGAIN;
-   goto out;
+   if (unlikely(!kvm_arch_vcpu_runnable(vcpu))) {
+   if (kvm_vcpu_block(vcpu)) {
+   r = -EAGAIN;
+   goto out;
+   }
+   kvm_vcpu_promote_runnable(vcpu);
}
  



Any reason this is not in __vcpu_run()?

Our main loop could look like

  while (no reason to stop)
if (runnable)
 enter guest
else
 block
deal with aftermath

kvm_emulate_halt would then simply modify the mp state.

 
 	/* re-sync apic's tpr */

Index: kvm/include/linux/kvm_host.h
===
--- kvm.orig/include/linux/kvm_host.h
+++ kvm/include/linux/kvm_host.h
@@ -199,7 +199,7 @@ struct kvm_memory_slot *gfn_to_memslot(s
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
 
-void kvm_vcpu_block(struct kvm_vcpu *vcpu);

+int kvm_vcpu_block(struct kvm_vcpu *vcpu);
 void kvm_resched(struct kvm_vcpu *vcpu);
 void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
 void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
Index: kvm/virt/kvm/kvm_main.c
===
--- kvm.orig/virt/kvm/kvm_main.c
+++ kvm/virt/kvm/kvm_main.c
@@ -818,9 +818,10 @@ void mark_page_dirty(struct kvm *kvm, gf
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
-void kvm_vcpu_block(struct kvm_vcpu *vcpu)
+int kvm_vcpu_block(struct kvm_vcpu *vcpu)
 {
DEFINE_WAIT(wait);
+   int ret = 0;
 
 	for (;;) {

prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
@@ -831,8 +832,10 @@ void kvm_vcpu_block(struct kvm_vcpu *vcp
break;
if (kvm_arch_vcpu_runnable(vcpu))
break;
-   if (signal_pending(current))
+   if (signal_pending(current)) {
+   ret = 1;
break;
+   }
  


This is ambiguous.  Multiple exit conditions could be true at the same 
time (vcpu becomes runnable _and_ signal is pending), so you can't trust 
the return code.  It doesn't affect the usage in the rest of the patch 
(I think), but it is best to avoid such subtlety.


Can this be done by setting a KVM_REQ_UNHALT bit in vcpu->requests?

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html