date:20101011

Re: system_powerdown not working for qemu-kvm 0.12.4?

2010-10-11 Thread Avi Kivity


 On 10/11/2010 07:53 PM, Ruben Kerkhof wrote:

On Mon, Oct 11, 2010 at 18:57, Avi Kivity  wrote:
>On 10/11/2010 06:29 PM, Ruben Kerkhof wrote:
>>
>>  Hi Avi,
>>
>>  On Sun, Aug 15, 2010 at 13:00, Avi Kivitywrote:
>>
>>  >I'm betting 73b48d914f9 is the cause, but let's see the full bisect.
>>
>>  system_powerdown with FreeBSD 8.1 works on 73b48d914f9.
>>
>>  I've bisected the issue to commit 76e617c.
>>
>>  Let me know if I should do some other tests.
>
>  It works on 73b48d914f9 but fails on 76e617c due to the seabios change.
>
>  Can you try 76e617c but with SeaBIOS 0.5.0?  If that works, keep qemu-kvm on
>  76e617c but bisect SeaBIOS from 0.5.0 to 0.5.1 to see which commit broke
>  freebsd.
>

Sure,

5c99b6c984682ddb1d4543a7e27a1f4ca633e6a6 is the first bad commit
commit 5c99b6c984682ddb1d4543a7e27a1f4ca633e6a6
Author: Kevin O'Connor
Date:   Wed Dec 30 12:36:22 2009 -0500

 Commit compiled dsdt file; misc comment updates.

 Commit new dsdt with recent changes.
 Add some misc comments.
 Also, fix uninitialized warning in mptable code.

Regards, Ruben


Gleb, Kevin, any ideas?

(summary: qemu-kvm doesn't acpi shutdown freebsd 8.1 with this commit; 
qemu.git does.  May be due to interrupt polarity which kvm implements 
but qemu does not)


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] KVM: Emulation MSI-X mask bits for assigned devices

2010-10-11 Thread Sheng Yang

On Monday 11 October 2010 18:01:00 Michael S. Tsirkin wrote:
> On Mon, Oct 11, 2010 at 05:28:30PM +0800, Sheng Yang wrote:
> > On Sunday 03 October 2010 19:12:47 Michael S. Tsirkin wrote:
> > > On Tue, Sep 28, 2010 at 05:44:10PM +0800, Sheng Yang wrote:
> > > > This patch enable per-vector mask for assigned devices using MSI-X.
> > > > 
> > > > Signed-off-by: Sheng Yang 
> > > 
> > > I think I see an issue here, noted below.  Some general comments:
> > > - The mask bit seems broken for injecting interrupts from
> > > 
> > >   userspace (with interrupts/ioctls).
> > >   I think we must test it on injection path.
> > 
> > I am not quite understand how it related to userspace interrupt injection
> > here... This patch only cover assigned devices for now.
> 
> Well, this is a kernel/userspace interface, if it's broken for userspace
> injection now we'll have to go through pain to fix it in a compatible
> way later when we want to use it for userspace injection.
> You might want to ask why we want the kernel to support making
> userspace-injected interrupts when userspace can just avoid injecting
> them, and the answer would be that with irqfd the injection might be
> handled in a separate process.

OK, I've understood how it related to userspace interrupt injection. But I 
still 
can't see why the interface is broken...
> 
> We currently handle this by destroying irqfd when irq is masked,
> an ioctl instead would be much faster.
> 
> > > - We'll need a way to support the pending bit.
> > > 
> > >   Disabling interrupts will not let us do it.
> > 
> > We may need a way to support pending bit, though I don't know which guest
> > has used it... And we can still know if there is interrupt pending by
> > check the real hardware's pending bit if it's necessary.
> 
> That's what I'm saying: since instead of masking the vector in hardware
> you disable irq in the APIC, the pending bit that we read from hardware
> will not have the correct value.

Are you sure? This disable_irq() has nothing to do with APIC. The disable 
callback 
in msi_chip didn't do anything but mark the IRQ status as IRQ_DISABLED, and the 
follow interrupt(if there are any) would be acked and masked, using mask 
callback 
in msi_chip. 

irq_to_desc() need to be exported for my initial version, in order to use mask 
callback. But later I think it would be clear and better if we use general IRQ 
function to do it. And I don't think the current solution would prevent us from 
reading hardware pending bits.

--
regards
Yang, Sheng

> 
> If we fix this, pending bit handling can be done by userspace.
> 
> > (And we haven't seen any problem by
> > leaving the bit 0 so far, and it's not in this patch's scope.)
> 
> I don't know about anyone using this, either, but the PCI spec does
> require support of polling mode where the pending bit is polled instead
> of interrupts. So yes, not a high priority to implement, but let's give
> the way we intend to support this in the future some thought.
> 
> > > > ---
> > > > 
> > > >  arch/x86/kvm/x86.c   |1 +
> > > >  include/linux/kvm.h  |9 -
> > > >  include/linux/kvm_host.h |1 +
> > > >  virt/kvm/assigned-dev.c  |   39
> > 
> > +++
> > 
> > > >  4 files changed, 49 insertions(+), 1 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 8412c91..e6933e6 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -1927,6 +1927,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > 
> > > > case KVM_CAP_DEBUGREGS:
> > > > case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > 
> > > > case KVM_CAP_XSAVE:
> > > > +   case KVM_CAP_DEVICE_MSIX_MASK:
> > > > r = 1;
> > > > break;
> > > > 
> > > > case KVM_CAP_COALESCED_MMIO:
> > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > index 919ae53..f2b7cdc 100644
> > > > --- a/include/linux/kvm.h
> > > > +++ b/include/linux/kvm.h
> > > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > > 
> > > >  #endif
> > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > 
> > > > +#ifdef __KVM_HAVE_MSIX
> > > > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > > > +#endif
> > > > 
> > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > 
> > > > @@ -787,11 +790,15 @@ struct kvm_assigned_msix_nr {
> > > > 
> > > >  };
> > > >  
> > > >  #define KVM_MAX_MSIX_PER_DEV   256
> > > > 
> > > > +
> > > > +#define KVM_MSIX_FLAG_MASK 1
> > > > +
> > > > 
> > > >  struct kvm_assigned_msix_entry {
> > > >  
> > > > __u32 assigned_dev_id;
> > > > __u32 gsi;
> > > > __u16 entry; /* The index of entry in the MSI-X table */
> > > > 
> > > > -   __u16 padding[3];
> > > > +   __u16 flags;
> > > > +   __u16 padding[2];
> > > > 
> > > >  };
> > > >  
> > > >  #endif /* __LINUX_KVM_H */
> > > > 
> > > > diff --git a/include/linux/kvm_hos

iPhone with KVM?

2010-10-11 Thread Jun Koi

hi,

i have a guest Windows on KVM, with iTunes installed on that. then i
let my guest to have direct access to my iPhone connecting to my
physical USB port, using "usb_add" command.
but i got a serious problem: "usb_add" doesnt seem to work, as my
guest Windows never sees my iPhone.

so it seems that giving guest Windows the direct access to USB port is
not enough. any idea why this happens? any solution?

thanks,
Jun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to debug unhandled vm exit: 0x11?

2010-10-11 Thread Neo Jia

On Mon, Oct 11, 2010 at 7:27 AM, Avi Kivity  wrote:
>  On 10/11/2010 08:49 AM, Neo Jia wrote:
>>
>> On Sun, Oct 10, 2010 at 11:30 PM, Avi Kivity  wrote:
>> >    On 10/11/2010 07:46 AM, Neo Jia wrote:
>> >>
>> >>  BTW, I have a question about saving FPU, especially those XMM
>> >>  registers. I don't see an explicit save FPU after exiting guest due to
>> >>  an exception (MMIO writes). The only thing I saw about fpu operation
>> >>  is fpu restore right before loading guest.
>> >>
>> >>  Is there anything I missed here?
>> >
>> >  kvm_put_guest_fpu.
>>
>> I found that function and it will be called by vcpu_put eventually
>> inside kvm_arch_vcpu_ioctl_run, but kvm_mmu_page_fault is called much
>> earlier than that inside kvm exit exception handler. so, the fxsave
>> data for the guest image might not be saved at that moment, when I am
>> going to emulate this instruction?
>
> Just call it when you want to be sure it is in memory.

that helps and looks like my implementation is working. As I am
debugging with kvm-88 version still, I need to integrate back to HEAD
and add unit test.

after that I will send out a patch for review.

Thanks,
Neo

>
> --
> error compiling committee.c: too many arguments to function
>
>



-- 
I would remember that if researchers were not ambitious
probably today we haven't the technology we are using!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.

2010-10-11 Thread Xin, Xiaohui

>-Original Message-
>From: Eric Dumazet [mailto:eric.duma...@gmail.com]
>Sent: Monday, October 11, 2010 11:42 PM
>To: David Miller
>Cc: Xin, Xiaohui; net...@vger.kernel.org; kvm@vger.kernel.org;
>linux-ker...@vger.kernel.org; m...@redhat.com; mi...@elte.hu;
>herb...@gondor.apana.org.au; jd...@linux.intel.com
>Subject: Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() 
>specially.
>
>Le lundi 11 octobre 2010 à 08:27 -0700, David Miller a écrit :
>> From: "Xin, Xiaohui" 
>> Date: Mon, 11 Oct 2010 15:06:05 +0800
>>
>> > That's to avoid the new cache miss caused by using destructor_arg in data 
>> > path
>> > like skb_release_data().
>> > That's based on the comment from Eric Dumazet on v7 patches.
>>
>> Thanks for the explanation.
>
>Anyway, frags[] must be the last field of "struct skb_shared_info"
>since commit fed66381 (net: pskb_expand_head() optimization)
>
>It seems Xin worked on a quite old tree.
>
>
I will rebase soon.

Thanks
Xiaohui
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RFC] tun: dma engine support

2010-10-11 Thread Michael S. Tsirkin

Simple hack to use dma engine for tun RX.
Only one skb in flight at the moment.

Signed-off-by: Michael S. Tsirkin 
---

I am still looking at handling multiple skbs, but
sending this out for early flames and improvement suggestions.

Loopback testing seems to show only minor performance gains:
this is not really suprising as data is hot in cache already.
Where I would expect this to help more is with incoming
traffic from an external NIC. This still needs to be tested.

 drivers/dma/Kconfig   |2 +-
 drivers/dma/iovlock.c |2 +-
 drivers/net/tun.c |  389 -
 3 files changed, 390 insertions(+), 3 deletions(-)

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 9520cf0..7e82c00 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -202,7 +202,7 @@ comment "DMA Clients"
depends on DMA_ENGINE
 
 config NET_DMA
-   bool "Network: TCP receive copy offload"
+   bool "Network: TCP/TUN receive copy offload"
depends on DMA_ENGINE && NET
default (INTEL_IOATDMA || FSL_DMA)
help
diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
index c6917e8..121d7fd 100644
--- a/drivers/dma/iovlock.c
+++ b/drivers/dma/iovlock.c
@@ -138,7 +138,7 @@ void dma_unpin_iovec_pages(struct dma_pinned_list 
*pinned_list)
 
kfree(pinned_list);
 }
-
+EXPORT_SYMBOL_GPL(dma_unpin_iovec_pages);
 
 /*
  * We have already pinned down the pages we will be using in the iovecs.
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 55f3a3e..ddbfbc8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -62,6 +62,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -70,6 +72,9 @@
 #include 
 #include 
 
+int tun_dma_copybreak = 0x1;
+module_param_named(dma_copybreak, tun_dma_copybreak, int, 0644);
+MODULE_PARM_DESC(debug_level, "Use DMA engine for messages of this length and 
up");
 /* Uncomment to enable debugging */
 /* #define TUN_DEBUG 1 */
 
@@ -547,6 +552,364 @@ static inline struct sk_buff *tun_alloc_skb(struct 
tun_struct *tun,
return skb;
 }
 
+#ifdef CONFIG_NET_DMA
+/* The below duplicates code from net/core and drivers/dma
+ * with the minor twist that these functions work on a const
+ * iovec with an offset. TODO: move it there? */
+static int num_pages_spanned(void __user * iov_base, size_t iov_len)
+{
+   return
+   ((PAGE_ALIGN((unsigned long)iov_base + iov_len) -
+   ((unsigned long)iov_base & PAGE_MASK)) >> PAGE_SHIFT);
+}
+
+/*
+ * Pin down all the iovec pages needed for len bytes.
+ * Return a struct dma_pinned_list to keep track of pages pinned down.
+ *
+ * We are allocating a single chunk of memory, and then carving it up into
+ * 3 sections, the latter 2 whose size depends on the number of iovecs and the
+ * total number of pages, respectively.
+ */
+static struct dma_pinned_list *dma_pin_const_iovec_pages(const struct iovec 
*iov,
+  size_t iov_offset, 
size_t len)
+{
+   struct dma_pinned_list *local_list;
+   struct page **pages;
+   int i;
+   int ret;
+   int nr_iovecs = 0;
+   int iovec_len_used = 0;
+   int iovec_pages_used = 0;
+   void __user *iov_base;
+   size_t iov_len;
+
+   /* determine how many iovecs/pages there are, up front */
+   do {
+   /* Skip offset as required. */
+   iov_len = iov[nr_iovecs].iov_len;
+   if (iov_offset >= iovec_len_used + iov_len) {
+   iov_offset -= iov_len;
+   ++iov;
+   continue;
+   }
+   iov_base = iov[nr_iovecs].iov_base;
+   if (!iovec_len_used) {
+   iov_base += iov_offset;
+   iov_len -= iov_offset;
+   }
+   iovec_len_used += iov_len;
+   iovec_pages_used += num_pages_spanned(iov_base, iov_len);
+   nr_iovecs++;
+   } while (iovec_len_used < len);
+
+   /* single kmalloc for pinned list, page_list[], and the page arrays */
+   local_list = kmalloc(sizeof(*local_list)
+   + (nr_iovecs * sizeof (struct dma_page_list))
+   + (iovec_pages_used * sizeof (struct page*)), GFP_KERNEL);
+   if (!local_list)
+   goto out;
+
+   /* list of pages starts right after the page list array */
+   pages = (struct page **) &local_list->page_list[nr_iovecs];
+
+   local_list->nr_iovecs = 0;
+
+   for (i = 0; i < nr_iovecs; i++) {
+   struct dma_page_list *page_list = &local_list->page_list[i];
+
+   iov_len = iov[i].iov_len + iov_offset;
+   iov_base = iov[i].iov_base + iov_offset;
+   iov_offset = 0;
+   len -= iov_len;
+
+   page_list->nr_pages = num_pages_spanned(iov_base, iov_len);
+   page_list->base_address = iov_base;
+
+

Re: 2.6.35-rc1 regression with pvclock and smp guests

2010-10-11 Thread Zachary Amsden


On 10/08/2010 10:59 PM, Arjan Koers wrote:

On 2010-10-09 08:29, Michael Tokarev wrote:
...
   

The result is that no released linux kernel boots
in smp in kvm, which is a linux virtual machine.
That's irony, isn't it?

I wonder how distributions (which are almost all based
on 2.6.32 nowadays) will deal with the issue.. ;)
 

It looks like Debian solved it on their 2.6.32 guest by
reverting the commit that makes it hang:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=588426
   


That's not a wise choice, the commit is needed to prevent clocks going 
backwards.  It then caused some fallout issues with clobbers, which I 
believe hpa fixed, but there were several rounds of it.


Glauber, perhaps, has a better idea of what patches are needed for the 
host side kvmclock.  I've mostly been working on the server side.


To solve the wider range of problems, distributions converging on 2.6.32 
will need all of the fixes backported, both server and host.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Exceed 1GB/s with virtio-net ?

2010-10-11 Thread matthew . r . rohrer

I had a similar problem with virtio.  With kernels newer than 2.6.31.5 and KVM 
versions greater than 11 my throughput decreased dramatically.  While with 
kernel 2.6.31.5 and version 11 software I could only get around 1.5 Gbps max 
with a single iperf thread.  Moving to higher versions of the software only 
produced poorer results, with the 2.6.35.4 kernel and version 0.12.50 KVM/QEMU 
yielded only ~.418 Gbps.  The only way I have found to achieve more than ~1.5 
Gbps was to use the PCI Pass-through option.  With that option I can get 
roughly 6 Gbps with a single iperf thread and in the 8 Gb range for multiple 
threads.

--Matt

-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of 
Alex Williamson
Sent: Thursday, September 16, 2010 9:01 AM
To: Thibault VINCENT
Cc: kvm@vger.kernel.org
Subject: Re: Exceed 1GB/s with virtio-net ?

On Tue, 2010-09-14 at 18:14 +0200, Thibault VINCENT wrote:
> On 13/09/2010 19:34, Alex Williamson wrote:
> > On Mon, Sep 13, 2010 at 4:32 AM, Thibault VINCENT
> >  wrote:
> >> Hello
> >>
> >> I'm trying to achieve higher than gigabit transferts over a virtio NIC
> >> with no success, and I can't find a recent bug or discussion about such
> >> an issue.
> >>
> >> The simpler test consist of two VM running on a high-end blade server
> >> with 4 cores and 4GB RAM each, and a virtio NIC dedicated to the
> >> inter-VM communication. On the host, the two vnet interfaces are
> >> enslaved into a bridge. I use a combination of 2.6.35 on the host and
> >> 2.6.32 in the VMs.
> >> Running iperf or netperf on these VMs, with TCP or UDP, result in
> >> ~900Mbits/s transferts. This is what could be expected of a 1G
> >> interface, and indeed the e1000 emulation performs similar.
> >>
> >> Changing the txqueuelen, MTU, and offloading settings on every interface
> >> (bridge/tap/virtio_net) didn't improve the speed, nor did the
> >> installation of irqbalance and the increase in CPU and RAM.
> >>
> >> Is this normal ? Is the multiple queue patch intended to address this ?
> >> It's quite possible I missed something :)
> > 
> > I'm able to achieve quite a bit more than 1Gbps using virtio-net
> > between 2 guests on the same host connected via an internal bridge.
> > With the virtio-net TX bottom half handler I can easily hit 7Gbps TCP
> > and 10+Gbps UDP using netperf (TCP_STREAM/UDP_STREAM tests).  Even
> > without the bottom half patches (not yet in qemu-kvm.git), I can get
> > ~5Gbps.  Maybe you could describe your setup further, host details,
> > bridge setup, guests, specific tests, etc...  Thanks,
> 
> Thanks Alex, I don't use the bottom half patches but anything between
> 3Gbps and 5Gbps would be fine. Here are some more details:
> 
> Host
> -
> Dell M610 ; 2 x Xeon X5650 ; 6 x 8GB
> Debian Squeeze amd64
> qemu-kvm 0.12.5+dfsg-1
> kernel 2.6.35-1 amd64 (Debian Experimental)
> 
> Guests
> ---
> Debian Squeeze amd64
> kernel 2.6.35-1 amd64 (Debian Experimental)
> 
> To measure the throughput between the guests, I do the following.
> 
> On the host:
>  * create a bridge
># brctl addbr br_test
># ifconfig br_test 1.1.1.1 up
>  * start two guests
># kvm -enable-kvm -m 4096 -smp 4 -drive
> file=/dev/vg/deb0,id=0,boot=on,format=raw -device
> virtio-blk-pci,drive=0,id=0 -device
> virtio-net-pci,vlan=0,id=1,mac=52:54:00:cf:6a:b0 -net
> tap,vlan=0,name=hostnet0
># kvm -enable-kvm -m 4096 -smp 4 -drive
> file=/dev/vg/deb1,id=0,boot=on,format=raw -device
> virtio-blk-pci,drive=0,id=0 -device
> virtio-net-pci,vlan=0,id=1,mac=52:54:00:cf:6a:b1 -net
> tap,vlan=0,name=hostnet0
>  * add guests to the bridge
># brctl addif br_test tap0
># brctl addif br_test tap1
> 
> On the first guest:
>  # ifconfig eth0 1.1.1.2 up
>  # iperf -s -i 1
> 
> On the second guest:
>  # ifconfig eth0 1.1.1.3 up
>  # iperf -c 1.1.1.2 -i 1
> 
> Client connecting to 1.1.1.2, TCP port 5001
> TCP window size: 16.0 KByte (default)
> 
> [  3] local 1.1.1.3 port 43510 connected with 1.1.1.2 port 5001
> [ ID] Interval   Transfer Bandwidth
> [  3]  0.0- 1.0 sec  80.7 MBytes677 Mbits/sec
> [  3]  1.0- 2.0 sec102 MBytes855 Mbits/sec
> [  3]  2.0- 3.0 sec101 MBytes847 Mbits/sec
> [  3]  3.0- 4.0 sec104 MBytes873 Mbits/sec
> [  3]  4.0- 5.0 sec104 MBytes874 Mbits/sec
> [  3]  5.0- 6.0 sec105 MBytes881 Mbits/sec
> [  3]  6.0- 7.0 sec103 MBytes862 Mbits/sec
> [  3]  7.0- 8.0 sec101 MBytes848 Mbits/sec
> [  3]  8.0- 9.0 sec105 MBytes878 Mbits/sec
> [  3]  9.0-10.0 sec105 MBytes882 Mbits/sec
> [  3]  0.0-10.0 sec  1011 MBytes848 Mbits/sec
> 
> On the host again:
>  # iperf -c 1.1.1.1 -i 1
> 
> Client connecting to 1.1.1.1, TCP port 5001
> TCP window size: 16.0 KByte (default)
>

Re: [PATCH] plugin_kvm: fix kvm_exit rip formatting

2010-10-11 Thread Steven Rostedt

On Mon, 2010-10-11 at 10:58 +0200, Avi Kivity wrote:
> %0xlx is not a valid printf format.
> 
> Signed-off-by: Avi Kivity 

Applied, thanks!

-- Steve

> ---
>  plugin_kvm.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/plugin_kvm.c b/plugin_kvm.c
> index cdd52c7..724143d 100644
> --- a/plugin_kvm.c
> +++ b/plugin_kvm.c
> @@ -153,7 +153,7 @@ static int kvm_exit_handler(struct trace_seq *s, struct 
> record *record,
>  
>   trace_seq_printf(s, "reason %s", find_vmx_reason(val));
>  
> - pevent_print_num_field(s, " rip %0xlx", event, "guest_rip", record, 1);
> + pevent_print_num_field(s, " rip 0x%lx", event, "guest_rip", record, 1);
>  
>   return 0;
>  }


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [libvirt-users] Problems with libvirt / qemu

2010-10-11 Thread Brian Jackson

On Monday, October 11, 2010 11:57:27 am Dan Johansson wrote:
> Yes, the image contains an OS (it works if I start the guest manually).
> 
> On Monday 11 October 2010 03.41:29 jbuy0710 wrote:
> > The image contains an OS or not?
> > If not, you can't choose to boot from disk like ""
> > 
> > 2010/10/10 Dan Johansson :
> > > HI,
> > > 
> > > I have a small problem with libvirt / qemu. I have created a guest and
> > > when I start it from the command-line the guests starts OK, but when I
> > > start the guest through libvirt with "virsh start" I get "Booting from
> > > Hard Disk... Boot failed: not a bootable disk
> > > No bootable device"
> > > 
> > > This is the command-line I use to start the guest (which works)
> > > "cd /var/lib/kvm/Wilmer;  /usr/bin/qemu-system-x86_64 --enable-kvm \
> > > 
> > >-net nic,vlan=1,model=rtl8139,macaddr=DE:ED:BE:EF:01:03 -net
> > >tap,vlan=1,ifname=qtap13,script=no,downscript=no \ -net
> > >nic,vlan=3,model=rtl8139,macaddr=DE:ED:BE:EF:03:03 -net
> > >tap,vlan=3,ifname=qtap33,script=no,downscript=no \ -m 2048 -k
> > >de-ch -vnc :3 -daemonize \
> > >Wilmer.qcow2"


My guess would be that libvirt is using the -drive syntax with if=ide and 
boot=on. Those 2 options together are known to be broken in certain versions 
of qemu/kvm. You can find out by checking in the libvirt logs to see what kvm 
command it's running to start the guest.


> > > 
> > > The libvirt XML-file was created using "virsh domxml-from-native
> > > qemu-argv" and this is the result of that conversion:  > > type='kvm'>
> > > 
> > >  wilmer
> > >  a421968d-0573-1356-8cb7-32caff525a03
> > >  2097152
> > >  2097152
> > >  2
> > >  
> > >  
> > >hvm
> > >
> > >  
> > >  
> > >  
> > >  
> > >
> > >  
> > >  
> > >  
> > >  destroy
> > >  restart
> > >  destroy
> > >  
> > >  
> > >/usr/bin/qemu-system-x86_64
> > >
> > >
> > >  
> > >  
> > >  
> > >
> > >
> > >
> > >
> > >   > >  function='0x1'/>
> > >
> > >
> > >
> > >
> > >  
> > >  
> > >  
> > >  
> > >   > >  function='0x0'/>
> > >
> > >
> > >
> > >
> > >  
> > >  
> > >  
> > >  
> > >   > >  function='0x0'/>
> > >
> > >
> > >
> > >
> > >
> > >
> > >  
> > >   > >  function='0x0'/>
> > >
> > >
> > >  
> > >  
> > > 
> > > 
> > > 
> > > 
> > > Anyone seeing something obvious that I have missed?
> > > 
> > >  Regards,
> > > 
> > > --
> > > Dan Johansson, 
> > > ***
> > > This message is printed on 100% recycled electrons!
> > > ***
> > > 
> > > ___
> > > libvirt-users mailing list
> > > libvirt-us...@redhat.com
> > > https://www.redhat.com/mailman/listinfo/libvirt-users
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KVM call agenda for Oct 12

2010-10-11 Thread Juan Quintela


Please send in any agenda items you are interested in covering.

thanks, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 6/8] Add RAM -> physical addr mapping in MCE simulation

2010-10-11 Thread Marcelo Tosatti

From: Huang Ying 

In QEMU-KVM, physical address != RAM address. While MCE simulation
needs physical address instead of RAM address. So
kvm_physical_memory_addr_from_ram() is implemented to do the
conversion, and it is invoked before being filled in the IA32_MCi_ADDR
MSR.

Reported-by: Dean Nelson 
Signed-off-by: Huang Ying 
Signed-off-by: Marcelo Tosatti 

Index: qemu/kvm-all.c
===
--- qemu.orig/kvm-all.c
+++ qemu/kvm-all.c
@@ -137,6 +137,24 @@ static KVMSlot *kvm_lookup_overlapping_s
 return found;
 }
 
+int kvm_physical_memory_addr_from_ram(KVMState *s, ram_addr_t ram_addr,
+  target_phys_addr_t *phys_addr)
+{
+int i;
+
+for (i = 0; i < ARRAY_SIZE(s->slots); i++) {
+KVMSlot *mem = &s->slots[i];
+
+if (ram_addr >= mem->phys_offset &&
+ram_addr < mem->phys_offset + mem->memory_size) {
+*phys_addr = mem->start_addr + (ram_addr - mem->phys_offset);
+return 1;
+}
+}
+
+return 0;
+}
+
 static int kvm_set_user_memory_region(KVMState *s, KVMSlot *slot)
 {
 struct kvm_userspace_memory_region mem;
Index: qemu/kvm.h
===
--- qemu.orig/kvm.h
+++ qemu/kvm.h
@@ -174,6 +174,9 @@ static inline void cpu_synchronize_post_
 }
 }
 
+int kvm_physical_memory_addr_from_ram(KVMState *s, ram_addr_t ram_addr,
+  target_phys_addr_t *phys_addr);
+
 #endif
 int kvm_set_ioeventfd_mmio_long(int fd, uint32_t adr, uint32_t val, bool 
assign);
 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 5/8] Export qemu_ram_addr_from_host

2010-10-11 Thread Marcelo Tosatti

To be used by next patches.

Signed-off-by: Marcelo Tosatti 

Index: qemu/cpu-common.h
===
--- qemu.orig/cpu-common.h
+++ qemu/cpu-common.h
@@ -47,7 +47,8 @@ void qemu_ram_free(ram_addr_t addr);
 /* This should only be used for ram local to a device.  */
 void *qemu_get_ram_ptr(ram_addr_t addr);
 /* This should not be used by devices.  */
-ram_addr_t qemu_ram_addr_from_host(void *ptr);
+int qemu_ram_addr_from_host(void *ptr, ram_addr_t *ram_addr);
+ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr);
 
 int cpu_register_io_memory(CPUReadMemoryFunc * const *mem_read,
CPUWriteMemoryFunc * const *mem_write,
Index: qemu/exec.c
===
--- qemu.orig/exec.c
+++ qemu/exec.c
@@ -2086,7 +2086,7 @@ static inline void tlb_update_dirty(CPUT
 if ((tlb_entry->addr_write & ~TARGET_PAGE_MASK) == IO_MEM_RAM) {
 p = (void *)(unsigned long)((tlb_entry->addr_write & TARGET_PAGE_MASK)
 + tlb_entry->addend);
-ram_addr = qemu_ram_addr_from_host(p);
+ram_addr = qemu_ram_addr_from_host_nofail(p);
 if (!cpu_physical_memory_is_dirty(ram_addr)) {
 tlb_entry->addr_write |= TLB_NOTDIRTY;
 }
@@ -2938,23 +2938,31 @@ void *qemu_get_ram_ptr(ram_addr_t addr)
 return NULL;
 }
 
-/* Some of the softmmu routines need to translate from a host pointer
-   (typically a TLB entry) back to a ram offset.  */
-ram_addr_t qemu_ram_addr_from_host(void *ptr)
+int qemu_ram_addr_from_host(void *ptr, ram_addr_t *ram_addr)
 {
 RAMBlock *block;
 uint8_t *host = ptr;
 
 QLIST_FOREACH(block, &ram_list.blocks, next) {
 if (host - block->host < block->length) {
-return block->offset + (host - block->host);
+*ram_addr = block->offset + (host - block->host);
+return 0;
 }
 }
+return -1;
+}
 
-fprintf(stderr, "Bad ram pointer %p\n", ptr);
-abort();
+/* Some of the softmmu routines need to translate from a host pointer
+   (typically a TLB entry) back to a ram offset.  */
+ram_addr_t qemu_ram_addr_from_host_nofail(void *ptr)
+{
+ram_addr_t ram_addr;
 
-return 0;
+if (qemu_ram_addr_from_host(ptr, &ram_addr)) {
+fprintf(stderr, "Bad ram pointer %p\n", ptr);
+abort();
+}
+return ram_addr;
 }
 
 static uint32_t unassigned_mem_readb(void *opaque, target_phys_addr_t addr)
@@ -3703,7 +3711,7 @@ void cpu_physical_memory_unmap(void *buf
 {
 if (buffer != bounce.buffer) {
 if (is_write) {
-ram_addr_t addr1 = qemu_ram_addr_from_host(buffer);
+ram_addr_t addr1 = qemu_ram_addr_from_host_nofail(buffer);
 while (access_len) {
 unsigned l;
 l = TARGET_PAGE_SIZE;
Index: qemu/exec-all.h
===
--- qemu.orig/exec-all.h
+++ qemu/exec-all.h
@@ -334,7 +334,7 @@ static inline tb_page_addr_t get_page_ad
 }
 p = (void *)(unsigned long)addr
 + env1->tlb_table[mmu_idx][page_index].addend;
-return qemu_ram_addr_from_host(p);
+return qemu_ram_addr_from_host_nofail(p);
 }
 #endif
 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 2/8] iothread: use signalfd

2010-10-11 Thread Marcelo Tosatti

Block SIGALRM, SIGIO and consume them via signalfd.

Signed-off-by: Marcelo Tosatti 

Index: qemu/cpus.c
===
--- qemu.orig/cpus.c
+++ qemu/cpus.c
@@ -33,6 +33,7 @@
 #include "exec-all.h"
 
 #include "cpus.h"
+#include "compatfd.h"
 
 #ifdef SIGRTMIN
 #define SIG_IPI (SIGRTMIN+4)
@@ -329,14 +330,75 @@ static QemuCond qemu_work_cond;
 
 static void tcg_init_ipi(void);
 static void kvm_init_ipi(CPUState *env);
-static void unblock_io_signals(void);
+static sigset_t block_io_signals(void);
+
+/* If we have signalfd, we mask out the signals we want to handle and then
+ * use signalfd to listen for them.  We rely on whatever the current signal
+ * handler is to dispatch the signals when we receive them.
+ */
+static void sigfd_handler(void *opaque)
+{
+int fd = (unsigned long) opaque;
+struct qemu_signalfd_siginfo info;
+struct sigaction action;
+ssize_t len;
+
+while (1) {
+do {
+len = read(fd, &info, sizeof(info));
+} while (len == -1 && errno == EINTR);
+
+if (len == -1 && errno == EAGAIN) {
+break;
+}
+
+if (len != sizeof(info)) {
+printf("read from sigfd returned %zd: %m\n", len);
+return;
+}
+
+sigaction(info.ssi_signo, NULL, &action);
+if ((action.sa_flags & SA_SIGINFO) && action.sa_sigaction) {
+action.sa_sigaction(info.ssi_signo,
+(siginfo_t *)&info, NULL);
+} else if (action.sa_handler) {
+action.sa_handler(info.ssi_signo);
+}
+}
+}
+
+static int qemu_signalfd_init(sigset_t mask)
+{
+int sigfd;
+
+sigfd = qemu_signalfd(&mask);
+if (sigfd == -1) {
+fprintf(stderr, "failed to create signalfd\n");
+return -errno;
+}
+
+fcntl_setfl(sigfd, O_NONBLOCK);
+
+qemu_set_fd_handler2(sigfd, NULL, sigfd_handler, NULL,
+ (void *)(unsigned long) sigfd);
+
+return 0;
+}
 
 int qemu_init_main_loop(void)
 {
 int ret;
+sigset_t blocked_signals;
 
 cpu_set_debug_excp_handler(cpu_debug_handler);
 
+blocked_signals = block_io_signals();
+
+ret = qemu_signalfd_init(blocked_signals);
+if (ret)
+return ret;
+
+/* Note eventfd must be drained before signalfd handlers run */
 ret = qemu_event_init();
 if (ret)
 return ret;
@@ -347,7 +409,6 @@ int qemu_init_main_loop(void)
 qemu_mutex_init(&qemu_global_mutex);
 qemu_mutex_lock(&qemu_global_mutex);
 
-unblock_io_signals();
 qemu_thread_self(&io_thread);
 
 return 0;
@@ -586,19 +647,22 @@ static void kvm_init_ipi(CPUState *env)
 }
 }
 
-static void unblock_io_signals(void)
+static sigset_t block_io_signals(void)
 {
 sigset_t set;
 
+/* SIGUSR2 used by posix-aio-compat.c */
 sigemptyset(&set);
 sigaddset(&set, SIGUSR2);
-sigaddset(&set, SIGIO);
-sigaddset(&set, SIGALRM);
 pthread_sigmask(SIG_UNBLOCK, &set, NULL);
 
 sigemptyset(&set);
+sigaddset(&set, SIGIO);
+sigaddset(&set, SIGALRM);
 sigaddset(&set, SIG_IPI);
 pthread_sigmask(SIG_BLOCK, &set, NULL);
+
+return set;
 }
 
 void qemu_mutex_lock_iothread(void)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 8/8] Add savevm/loadvm support for MCE

2010-10-11 Thread Marcelo Tosatti

Port qemu-kvm's

commit 1bab5d11545d8de5facf46c28630085a2f9651ae
Author: Huang Ying 
Date:   Wed Mar 3 16:52:46 2010 +0800

Add savevm/loadvm support for MCE

MCE registers are saved/load into/from CPUState in
kvm_arch_save/load_regs. To simulate the MCG_STATUS clearing upon
reset, MSR_MCG_STATUS is set to 0 for KVM_PUT_RESET_STATE.

Signed-off-by: Marcelo Tosatti 

Index: qemu/target-i386/kvm.c
===
--- qemu.orig/target-i386/kvm.c
+++ qemu/target-i386/kvm.c
@@ -774,7 +774,7 @@ static int kvm_put_msrs(CPUState *env, i
 struct kvm_msr_entry entries[100];
 } msr_data;
 struct kvm_msr_entry *msrs = msr_data.entries;
-int n = 0;
+int i, n = 0;
 
 kvm_msr_entry_set(&msrs[n++], MSR_IA32_SYSENTER_CS, env->sysenter_cs);
 kvm_msr_entry_set(&msrs[n++], MSR_IA32_SYSENTER_ESP, env->sysenter_esp);
@@ -794,6 +794,18 @@ static int kvm_put_msrs(CPUState *env, i
   env->system_time_msr);
 kvm_msr_entry_set(&msrs[n++], MSR_KVM_WALL_CLOCK, env->wall_clock_msr);
 }
+#ifdef KVM_CAP_MCE
+if (env->mcg_cap) {
+if (level == KVM_PUT_RESET_STATE)
+kvm_msr_entry_set(&msrs[n++], MSR_MCG_STATUS, env->mcg_status);
+else if (level == KVM_PUT_FULL_STATE) {
+kvm_msr_entry_set(&msrs[n++], MSR_MCG_STATUS, env->mcg_status);
+kvm_msr_entry_set(&msrs[n++], MSR_MCG_CTL, env->mcg_ctl);
+for (i = 0; i < (env->mcg_cap & 0xff) * 4; i++)
+kvm_msr_entry_set(&msrs[n++], MSR_MC0_CTL + i, 
env->mce_banks[i]);
+}
+}
+#endif
 
 msr_data.info.nmsrs = n;
 
@@ -1001,6 +1013,15 @@ static int kvm_get_msrs(CPUState *env)
 msrs[n++].index = MSR_KVM_SYSTEM_TIME;
 msrs[n++].index = MSR_KVM_WALL_CLOCK;
 
+#ifdef KVM_CAP_MCE
+if (env->mcg_cap) {
+msrs[n++].index = MSR_MCG_STATUS;
+msrs[n++].index = MSR_MCG_CTL;
+for (i = 0; i < (env->mcg_cap & 0xff) * 4; i++)
+msrs[n++].index = MSR_MC0_CTL + i;
+}
+#endif
+
 msr_data.info.nmsrs = n;
 ret = kvm_vcpu_ioctl(env, KVM_GET_MSRS, &msr_data);
 if (ret < 0)
@@ -1043,6 +1064,22 @@ static int kvm_get_msrs(CPUState *env)
 case MSR_KVM_WALL_CLOCK:
 env->wall_clock_msr = msrs[i].data;
 break;
+#ifdef KVM_CAP_MCE
+case MSR_MCG_STATUS:
+env->mcg_status = msrs[i].data;
+break;
+case MSR_MCG_CTL:
+env->mcg_ctl = msrs[i].data;
+break;
+#endif
+default:
+#ifdef KVM_CAP_MCE
+if (msrs[i].index >= MSR_MC0_CTL &&
+msrs[i].index < MSR_MC0_CTL + (env->mcg_cap & 0xff) * 4) {
+env->mce_banks[msrs[i].index - MSR_MC0_CTL] = msrs[i].data;
+break;
+}
+#endif
 }
 }
 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.35-rc1 regression with pvclock and smp guests

2010-10-11 Thread Marcelo Tosatti

On Mon, Oct 11, 2010 at 12:53:26PM -0500, Anthony Liguori wrote:
> On 10/08/2010 09:27 PM, Zachary Amsden wrote:
> >On 10/08/2010 03:10 PM, Arjan Koers wrote:
> >>On 2010-10-09 00:06, Marcelo Tosatti wrote:
> >>>On Thu, Oct 07, 2010 at 04:47:11PM -1000, Zachary Amsden wrote:
> On 10/07/2010 02:12 PM, Arjan Koers wrote:
> >On 2010-10-03 01:42, Zachary Amsden wrote:
> >...
> >>Umm...  do you guys have this commit?  This is supposed
> >>to address the
> >>issue where the guest keeps resetting the TSC.  A guest
> >>which does that
> >>will break kvmclock.  It only happens on SMP, and it's
> >>much worse on AMD
> >>CPUs...
> >>
> >>sound like your scenario.
> >>
> >>commit bd59fc8ff95126f27b7a0df1b6cc602aa428812d
> >>Author: Zachary Amsden
> >>Date:   Thu Aug 19 22:07:26 2010 -1000
> >This commit fixes the problem:
> >
> >commit aad07c4f92bae2edaa42bcef84c2afdd0d082458
> >Author: Zachary Amsden
> >Date:   Thu Aug 19 22:07:19 2010 -1000
> >
> > KVM: x86: Move TSC reset out of vmcb_init
> >
> > The VMCB is reset whenever we receive a startup IPI,
> >so Linux is setting
> > TSC back to zero happens very late in the boot
> >process and destabilizing
> > the TSC.  Instead, just set TSC to zero once at VCPU
> >creation time.
> >
> > Why the separate patch?  So git-bisect is your friend.
> Okay, apparently I need to go poke around 2.6.35 and see what
> patches made it there and what patches didn't.
> >>>Backports attached. Michael, Arjan, please give them a try.
> >>>
> >>Thanks for the patches.
> >>
> >>Successfully tested with 2.6.34.7, 2.6.35.7 and 2.6.36-rc7 host
> >>(with a 2.6.35.7 guest).
> >>
> >>It failed with a 2.6.32.24 host. The patch applied, but
> >>pvclock_clocksource_read on the guest is still producing wrong
> >>results for CPU 1 while it's booting. I'll re-check tomorrow.
> >
> >There's a lot of work I've done and also a lot of work done by
> >Glauber Costa on kvmclock that recently went upstream.
> 
> If pvclock is broken on 2.6.32-stable, then shouldn't we port these
> patches to the stable tree or in the very least, black list pvclock
> in stable?

The minimal fixes will be backported as soon as they appear on linux-2.6.git.

> 
> Regards,
> 
> Anthony Liguori
> 
> >It's unlikely that you'll be bug free without all of those patches
> >applied; most of the patches were not just enhancements, but
> >contained bugfixes as well as improved operation conditions.  On
> >top of this, the patches are highly interdependent because of
> >close code proximity.  I suggest applying the following commits to
> >your branch (newest listed first; apply in reverse order):
> >
> >12b1164fa498997bf72070e6a81418197e283716
> >bfa075b75d8786380a7bca1215d4c7d1485d18dd
> >82e7988a2088781175a22b09631bce97cd5ed177
> >bfb3f3326c915b1800dc65d10ca09fbd548353d2
> >1377ff23ae2bf49c76f8f498ca81050878b9666a
> >9a088cc32488cfb9f60dca5972155ba13f39eb83
> >e06a1a6cbe4e9f4c766595483a9b345d5b48bda7
> >da908f2fb4e783c2a4de751fb90f11a0dd041161
> >cf839f5da2b0779b9ec8b990f851fb4e7d681da0
> >cbc59a098486494d9a49537dcb9c969210a8306d
> >5cd459cdde725bb5c3a7feef6e074e7da70490c9
> >d578d4d72e3d2154901123f40c9fa7de1f85ae73
> >bd59fc8ff95126f27b7a0df1b6cc602aa428812d
> >e5e7675b0b9bf8eb0b806145a2fe173b5bb0e908
> >bf0fb4a42ba7eb362f4013bd2e93209666793e66
> >69403a558097a9bd333736d58a4cb69ea6e2a0ac
> >a87834bdb7ff9117da7f164e8cee638f2c51f9b7
> >91308e2fecddb6fc63feaf4cef3400f5cbea6619
> >fd03465c0648cd12d7333269b80d902d0a8516dd
> >aad07c4f92bae2edaa42bcef84c2afdd0d082458
> >280372e494634d0a2cba3956721be16fc4f989bf
> >1e6145f6fd7899d1f34e4ac00a8558d82a8d704a
> >ec01d2eb0a74a6d95823fb6e320298473faf12be
> >3e05d29fe45508625e2a73db3d1bfb54f30731ff
> >
> >Since the issue appears resolved, I'm going to continue working upstream.
> >
> >Zach
> >-- 
> >To unsubscribe from this list: send the line "unsubscribe kvm" in
> >the body of a message to majord...@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 7/8] MCE: Relay UCR MCE to guest

2010-10-11 Thread Marcelo Tosatti

Port qemu-kvm's

commit 4b62fff1101a7ad77553147717a8bd3bf79df7ef
Author: Huang Ying 
Date:   Mon Sep 21 10:43:25 2009 +0800

MCE: Relay UCR MCE to guest

UCR (uncorrected recovery) MCE is supported in recent Intel CPUs,
where some hardware error such as some memory error can be reported
without PCC (processor context corrupted). To recover from such MCE,
the corresponding memory will be unmapped, and all processes accessing
the memory will be killed via SIGBUS.

For KVM, if QEMU/KVM is killed, all guest processes will be killed
too. So we relay SIGBUS from host OS to guest system via a UCR MCE
injection. Then guest OS can isolate corresponding memory and kill
necessary guest processes only. SIGBUS sent to main thread (not VCPU
threads) will be broadcast to all VCPU threads as UCR MCE.

Signed-off-by: Marcelo Tosatti 

Index: qemu/cpus.c
===
--- qemu.orig/cpus.c
+++ qemu/cpus.c
@@ -34,6 +34,10 @@
 
 #include "cpus.h"
 #include "compatfd.h"
+#ifdef CONFIG_LINUX
+#include 
+#include 
+#endif
 
 #ifdef SIGRTMIN
 #define SIG_IPI (SIGRTMIN+4)
@@ -41,6 +45,10 @@
 #define SIG_IPI SIGUSR1
 #endif
 
+#ifndef PR_MCE_KILL
+#define PR_MCE_KILL 33
+#endif
+
 static CPUState *next_cpu;
 
 /***/
@@ -498,28 +506,77 @@ static void qemu_tcg_wait_io_event(void)
 }
 }
 
+static void sigbus_reraise(void)
+{
+sigset_t set;
+struct sigaction action;
+
+memset(&action, 0, sizeof(action));
+action.sa_handler = SIG_DFL;
+if (!sigaction(SIGBUS, &action, NULL)) {
+raise(SIGBUS);
+sigemptyset(&set);
+sigaddset(&set, SIGBUS);
+sigprocmask(SIG_UNBLOCK, &set, NULL);
+}
+perror("Failed to re-raise SIGBUS!\n");
+abort();
+}
+
+static void sigbus_handler(int n, struct qemu_signalfd_siginfo *siginfo,
+   void *ctx)
+{
+#if defined(TARGET_I386)
+if (kvm_on_sigbus(siginfo->ssi_code, (void *)(intptr_t)siginfo->ssi_addr))
+#endif
+sigbus_reraise();
+}
+
 static void qemu_kvm_eat_signal(CPUState *env, int timeout)
 {
 struct timespec ts;
 int r, e;
 siginfo_t siginfo;
 sigset_t waitset;
+sigset_t chkset;
 
 ts.tv_sec = timeout / 1000;
 ts.tv_nsec = (timeout % 1000) * 100;
 
 sigemptyset(&waitset);
 sigaddset(&waitset, SIG_IPI);
+sigaddset(&waitset, SIGBUS);
 
-qemu_mutex_unlock(&qemu_global_mutex);
-r = sigtimedwait(&waitset, &siginfo, &ts);
-e = errno;
-qemu_mutex_lock(&qemu_global_mutex);
+do {
+qemu_mutex_unlock(&qemu_global_mutex);
 
-if (r == -1 && !(e == EAGAIN || e == EINTR)) {
-fprintf(stderr, "sigtimedwait: %s\n", strerror(e));
-exit(1);
-}
+r = sigtimedwait(&waitset, &siginfo, &ts);
+e = errno;
+
+qemu_mutex_lock(&qemu_global_mutex);
+
+if (r == -1 && !(e == EAGAIN || e == EINTR)) {
+fprintf(stderr, "sigtimedwait: %s\n", strerror(e));
+exit(1);
+}
+
+switch (r) {
+case SIGBUS:
+#ifdef TARGET_I386
+if (kvm_on_sigbus_vcpu(env, siginfo.si_code, siginfo.si_addr))
+#endif
+sigbus_reraise();
+break;
+default:
+break;
+}
+
+r = sigpending(&chkset);
+if (r == -1) {
+fprintf(stderr, "sigpending: %s\n", strerror(e));
+exit(1);
+}
+} while (sigismember(&chkset, SIG_IPI) || sigismember(&chkset, SIGBUS));
 }
 
 static void qemu_kvm_wait_io_event(CPUState *env)
@@ -645,6 +702,7 @@ static void kvm_init_ipi(CPUState *env)
 
 pthread_sigmask(SIG_BLOCK, NULL, &set);
 sigdelset(&set, SIG_IPI);
+sigdelset(&set, SIGBUS);
 r = kvm_set_signal_mask(env, &set);
 if (r) {
 fprintf(stderr, "kvm_set_signal_mask: %s\n", strerror(r));
@@ -655,6 +713,7 @@ static void kvm_init_ipi(CPUState *env)
 static sigset_t block_io_signals(void)
 {
 sigset_t set;
+struct sigaction action;
 
 /* SIGUSR2 used by posix-aio-compat.c */
 sigemptyset(&set);
@@ -665,8 +724,15 @@ static sigset_t block_io_signals(void)
 sigaddset(&set, SIGIO);
 sigaddset(&set, SIGALRM);
 sigaddset(&set, SIG_IPI);
+sigaddset(&set, SIGBUS);
 pthread_sigmask(SIG_BLOCK, &set, NULL);
 
+memset(&action, 0, sizeof(action));
+action.sa_flags = SA_SIGINFO;
+action.sa_sigaction = (void (*)(int, siginfo_t*, void*))sigbus_handler;
+sigaction(SIGBUS, &action, NULL);
+prctl(PR_MCE_KILL, 1, 1, 0, 0);
+
 return set;
 }
 
Index: qemu/kvm.h
===
--- qemu.orig/kvm.h
+++ qemu/kvm.h
@@ -110,6 +110,9 @@ int kvm_arch_init_vcpu(CPUState *env);
 
 void kvm_arch_reset_vcpu(CPUState *env);
 
+int kvm_on_sigbus_vcpu(CPUState *env, int code, void *addr);
+int kvm_on_sigbus(int code, void *addr);
+
 struct kvm_guest_debug;
 s

[patch 0/8] port qemu-kvm's MCE support (v3)

2010-10-11 Thread Marcelo Tosatti

Port qemu-kvm's KVM MCE (Machine Check Exception) handling to qemu. It
allows qemu to propagate MCEs to the guest.

v2:
- rename do_qemu_ram_addr_from_host.
- fix kvm_on_sigbus/kvm_on_sigbus_vcpu naming.
- fix bank register restoration (Dean Nelson).

v3:
- condition MCE generation on MCE_SEG_P bit (Huang Ying).


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 4/8] kvm: x86: add mce support

2010-10-11 Thread Marcelo Tosatti

Port qemu-kvm's MCE support

commit c68b2374c9048812f488e00ffb95db66c0bc07a7
Author: Huang Ying 
Date:   Mon Jul 20 10:00:53 2009 +0800

Add MCE simulation support to qemu/kvm

KVM ioctls are used to initialize MCE simulation and inject MCE. The
real MCE simulation is implemented in Linux kernel. The Kernel part
has been merged.

Signed-off-by: Marcelo Tosatti 

Index: qemu/target-i386/helper.c
===
--- qemu.orig/target-i386/helper.c
+++ qemu/target-i386/helper.c
@@ -27,6 +27,7 @@
 #include "exec-all.h"
 #include "qemu-common.h"
 #include "kvm.h"
+#include "kvm_x86.h"
 
 //#define DEBUG_MMU
 
@@ -1030,6 +1031,11 @@ void cpu_inject_x86_mce(CPUState *cenv, 
 if (bank >= bank_num || !(status & MCI_STATUS_VAL))
 return;
 
+if (kvm_enabled()) {
+kvm_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc);
+return;
+}
+
 /*
  * if MSR_MCG_CTL is not all 1s, the uncorrected error
  * reporting is disabled
Index: qemu/target-i386/kvm.c
===
--- qemu.orig/target-i386/kvm.c
+++ qemu/target-i386/kvm.c
@@ -27,6 +27,7 @@
 #include "hw/pc.h"
 #include "hw/apic.h"
 #include "ioport.h"
+#include "kvm_x86.h"
 
 #ifdef CONFIG_KVM_PARA
 #include 
@@ -167,6 +168,67 @@ static int get_para_features(CPUState *e
 }
 #endif
 
+#ifdef KVM_CAP_MCE
+static int kvm_get_mce_cap_supported(KVMState *s, uint64_t *mce_cap,
+ int *max_banks)
+{
+int r;
+
+r = kvm_ioctl(s, KVM_CHECK_EXTENSION, KVM_CAP_MCE);
+if (r > 0) {
+*max_banks = r;
+return kvm_ioctl(s, KVM_X86_GET_MCE_CAP_SUPPORTED, mce_cap);
+}
+return -ENOSYS;
+}
+
+static int kvm_setup_mce(CPUState *env, uint64_t *mcg_cap)
+{
+return kvm_vcpu_ioctl(env, KVM_X86_SETUP_MCE, mcg_cap);
+}
+
+static int kvm_set_mce(CPUState *env, struct kvm_x86_mce *m)
+{
+return kvm_vcpu_ioctl(env, KVM_X86_SET_MCE, m);
+}
+
+struct kvm_x86_mce_data
+{
+CPUState *env;
+struct kvm_x86_mce *mce;
+};
+
+static void kvm_do_inject_x86_mce(void *_data)
+{
+struct kvm_x86_mce_data *data = _data;
+int r;
+
+r = kvm_set_mce(data->env, data->mce);
+if (r < 0)
+perror("kvm_set_mce FAILED");
+}
+#endif
+
+void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t status,
+uint64_t mcg_status, uint64_t addr, uint64_t misc)
+{
+#ifdef KVM_CAP_MCE
+struct kvm_x86_mce mce = {
+.bank = bank,
+.status = status,
+.mcg_status = mcg_status,
+.addr = addr,
+.misc = misc,
+};
+struct kvm_x86_mce_data data = {
+.env = cenv,
+.mce = &mce,
+};
+
+run_on_cpu(cenv, kvm_do_inject_x86_mce, &data);
+#endif
+}
+
 int kvm_arch_init_vcpu(CPUState *env)
 {
 struct {
@@ -274,6 +336,28 @@ int kvm_arch_init_vcpu(CPUState *env)
 
 cpuid_data.cpuid.nent = cpuid_i;
 
+#ifdef KVM_CAP_MCE
+if (((env->cpuid_version >> 8)&0xF) >= 6
+&& (env->cpuid_features&(CPUID_MCE|CPUID_MCA)) == (CPUID_MCE|CPUID_MCA)
+&& kvm_check_extension(env->kvm_state, KVM_CAP_MCE) > 0) {
+uint64_t mcg_cap;
+int banks;
+
+if (kvm_get_mce_cap_supported(env->kvm_state, &mcg_cap, &banks))
+perror("kvm_get_mce_cap_supported FAILED");
+else {
+if (banks > MCE_BANKS_DEF)
+banks = MCE_BANKS_DEF;
+mcg_cap &= MCE_CAP_DEF;
+mcg_cap |= banks;
+if (kvm_setup_mce(env, &mcg_cap))
+perror("kvm_setup_mce FAILED");
+else
+env->mcg_cap = mcg_cap;
+}
+}
+#endif
+
 return kvm_vcpu_ioctl(env, KVM_SET_CPUID2, &cpuid_data);
 }
 
Index: qemu/target-i386/kvm_x86.h
===
--- /dev/null
+++ qemu/target-i386/kvm_x86.h
@@ -0,0 +1,21 @@
+/*
+ * QEMU KVM support
+ *
+ * Copyright (C) 2009 Red Hat Inc.
+ * Copyright IBM, Corp. 2008
+ *
+ * Authors:
+ *  Anthony Liguori   
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef __KVM_X86_H__
+#define __KVM_X86_H__
+
+void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t status,
+uint64_t mcg_status, uint64_t addr, uint64_t misc);
+
+#endif


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 1/8] signalfd compatibility

2010-10-11 Thread Marcelo Tosatti

Port qemu-kvm's signalfd compat code.

commit 5a7fdd0abd7cd24dac205317a4195446ab8748b5
Author: Anthony Liguori 
Date:   Wed May 7 11:55:47 2008 -0500

Use signalfd() in io-thread

This patch reworks the IO thread to use signalfd() instead of sigtimedwait()
This will eliminate the need to use SIGIO everywhere.

Signed-off-by: Marcelo Tosatti 

Index: qemu/compatfd.c
===
--- /dev/null
+++ qemu/compatfd.c
@@ -0,0 +1,117 @@
+/*
+ * signalfd/eventfd compatibility
+ *
+ * Copyright IBM, Corp. 2008
+ *
+ * Authors:
+ *  Anthony Liguori   
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu-common.h"
+#include "compatfd.h"
+
+#include 
+#include 
+
+struct sigfd_compat_info
+{
+sigset_t mask;
+int fd;
+};
+
+static void *sigwait_compat(void *opaque)
+{
+struct sigfd_compat_info *info = opaque;
+int err;
+sigset_t all;
+
+sigfillset(&all);
+sigprocmask(SIG_BLOCK, &all, NULL);
+
+do {
+siginfo_t siginfo;
+
+err = sigwaitinfo(&info->mask, &siginfo);
+if (err == -1 && errno == EINTR) {
+err = 0;
+continue;
+}
+
+if (err > 0) {
+char buffer[128];
+size_t offset = 0;
+
+memcpy(buffer, &err, sizeof(err));
+while (offset < sizeof(buffer)) {
+ssize_t len;
+
+len = write(info->fd, buffer + offset,
+sizeof(buffer) - offset);
+if (len == -1 && errno == EINTR)
+continue;
+
+if (len <= 0) {
+err = -1;
+break;
+}
+
+offset += len;
+}
+}
+} while (err >= 0);
+
+return NULL;
+}
+
+static int qemu_signalfd_compat(const sigset_t *mask)
+{
+pthread_attr_t attr;
+pthread_t tid;
+struct sigfd_compat_info *info;
+int fds[2];
+
+info = malloc(sizeof(*info));
+if (info == NULL) {
+errno = ENOMEM;
+return -1;
+}
+
+if (pipe(fds) == -1) {
+free(info);
+return -1;
+}
+
+qemu_set_cloexec(fds[0]);
+qemu_set_cloexec(fds[1]);
+
+memcpy(&info->mask, mask, sizeof(*mask));
+info->fd = fds[1];
+
+pthread_attr_init(&attr);
+pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
+
+pthread_create(&tid, &attr, sigwait_compat, info);
+
+pthread_attr_destroy(&attr);
+
+return fds[0];
+}
+
+int qemu_signalfd(const sigset_t *mask)
+{
+#if defined(CONFIG_SIGNALFD)
+int ret;
+
+ret = syscall(SYS_signalfd, -1, mask, _NSIG / 8);
+if (ret != -1) {
+qemu_set_cloexec(ret);
+return ret;
+}
+#endif
+
+return qemu_signalfd_compat(mask);
+}
Index: qemu/compatfd.h
===
--- /dev/null
+++ qemu/compatfd.h
@@ -0,0 +1,43 @@
+/*
+ * signalfd/eventfd compatibility
+ *
+ * Copyright IBM, Corp. 2008
+ *
+ * Authors:
+ *  Anthony Liguori   
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_COMPATFD_H
+#define QEMU_COMPATFD_H
+
+#include 
+
+struct qemu_signalfd_siginfo {
+uint32_t ssi_signo;   /* Signal number */
+int32_t  ssi_errno;   /* Error number (unused) */
+int32_t  ssi_code;/* Signal code */
+uint32_t ssi_pid; /* PID of sender */
+uint32_t ssi_uid; /* Real UID of sender */
+int32_t  ssi_fd;  /* File descriptor (SIGIO) */
+uint32_t ssi_tid; /* Kernel timer ID (POSIX timers) */
+uint32_t ssi_band;/* Band event (SIGIO) */
+uint32_t ssi_overrun; /* POSIX timer overrun count */
+uint32_t ssi_trapno;  /* Trap number that caused signal */
+int32_t  ssi_status;  /* Exit status or signal (SIGCHLD) */
+int32_t  ssi_int; /* Integer sent by sigqueue(2) */
+uint64_t ssi_ptr; /* Pointer sent by sigqueue(2) */
+uint64_t ssi_utime;   /* User CPU time consumed (SIGCHLD) */
+uint64_t ssi_stime;   /* System CPU time consumed (SIGCHLD) */
+uint64_t ssi_addr;/* Address that generated signal
+ (for hardware-generated signals) */
+uint8_t  pad[48]; /* Pad size to 128 bytes (allow for
+ additional fields in the future) */
+};
+
+int qemu_signalfd(const sigset_t *mask);
+
+#endif
Index: qemu/Makefile.objs
===
--- qemu.orig/Makefile.objs
+++ qemu/Makefile.objs
@@ -121,6 +121,7 @@ common-obj-y += $(addprefix ui/, $(ui-ob
 
 common-obj-y += iov.o acl.o
 common-obj-$(CONFIG_THREAD) += qemu-thread.o
+common-obj-$(CONFIG_IOTHREAD) += compatfd.o
 common-obj-y += notify.o event_notifier.o
 common-obj-y += qemu-timer.o
 
Index: qemu/configure

[patch 3/8] Expose thread_id in info cpus

2010-10-11 Thread Marcelo Tosatti

commit ce6325ff1af34dbaee91c8d28e792277e43f1227
Author: Glauber Costa 
Date:   Wed Mar 5 17:01:10 2008 -0300

Augment info cpus

This patch exposes the thread id associated with each
cpu through the already well known 'info cpus' interface.

Signed-off-by: Marcelo Tosatti 

Index: qemu/cpu-defs.h
===
--- qemu.orig/cpu-defs.h
+++ qemu/cpu-defs.h
@@ -197,6 +197,7 @@ typedef struct CPUWatchpoint {
 int nr_cores;  /* number of cores within this CPU package */\
 int nr_threads;/* number of threads within this CPU */  \
 int running; /* Nonzero if cpu is currently running(usermode).  */  \
+int thread_id;  \
 /* user data */ \
 void *opaque;   \
 \
Index: qemu/cpus.c
===
--- qemu.orig/cpus.c
+++ qemu/cpus.c
@@ -539,6 +539,7 @@ static void *kvm_cpu_thread_fn(void *arg
 
 qemu_mutex_lock(&qemu_global_mutex);
 qemu_thread_self(env->thread);
+env->thread_id = get_thread_id();
 if (kvm_enabled())
 kvm_init_vcpu(env);
 
@@ -578,6 +579,10 @@ static void *tcg_cpu_thread_fn(void *arg
 while (!qemu_system_ready)
 qemu_cond_timedwait(&qemu_system_cond, &qemu_global_mutex, 100);
 
+for (env = first_cpu; env != NULL; env = env->next_cpu) {
+env->thread_id = get_thread_id();
+}
+
 while (1) {
 cpu_exec_all();
 qemu_tcg_wait_io_event();
Index: qemu/exec.c
===
--- qemu.orig/exec.c
+++ qemu/exec.c
@@ -637,6 +637,7 @@ void cpu_exec_init(CPUState *env)
 env->numa_node = 0;
 QTAILQ_INIT(&env->breakpoints);
 QTAILQ_INIT(&env->watchpoints);
+env->thread_id = get_thread_id();
 *penv = env;
 #if defined(CONFIG_USER_ONLY)
 cpu_list_unlock();
Index: qemu/osdep.c
===
--- qemu.orig/osdep.c
+++ qemu/osdep.c
@@ -44,6 +44,10 @@
 extern int madvise(caddr_t, size_t, int);
 #endif
 
+#ifdef CONFIG_LINUX
+#include 
+#endif
+
 #ifdef CONFIG_EVENTFD
 #include 
 #endif
@@ -200,6 +204,17 @@ int qemu_create_pidfile(const char *file
 return 0;
 }
 
+int get_thread_id(void)
+{
+#if defined (_WIN32)
+return GetCurrentThreadId();
+#elif defined (__linux__)
+return syscall(SYS_gettid);
+#else
+return getpid();
+#endif
+}
+
 #ifdef _WIN32
 
 /* mingw32 needs ffs for compilations without optimization. */
Index: qemu/osdep.h
===
--- qemu.orig/osdep.h
+++ qemu/osdep.h
@@ -126,6 +126,7 @@ void qemu_vfree(void *ptr);
 int qemu_madvise(void *addr, size_t len, int advice);
 
 int qemu_create_pidfile(const char *filename);
+int get_thread_id(void);
 
 #ifdef _WIN32
 int ffs(int i);
Index: qemu/monitor.c
===
--- qemu.orig/monitor.c
+++ qemu/monitor.c
@@ -878,6 +878,9 @@ static void print_cpu_iter(QObject *obj,
 monitor_printf(mon, " (halted)");
 }
 
+monitor_printf(mon, " thread_id=%" PRId64 " ",
+   qdict_get_int(cpu, "thread_id"));
+
 monitor_printf(mon, "\n");
 }
 
@@ -922,6 +925,7 @@ static void do_info_cpus(Monitor *mon, Q
 #elif defined(TARGET_MIPS)
 qdict_put(cpu, "PC", qint_from_int(env->active_tc.PC));
 #endif
+qdict_put(cpu, "thread_id", qint_from_int(env->thread_id));
 
 qlist_append(cpu_list, cpu);
 }


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: system_powerdown not working for qemu-kvm 0.12.4?

2010-10-11 Thread Ruben Kerkhof

On Mon, Oct 11, 2010 at 18:57, Avi Kivity  wrote:
>  On 10/11/2010 06:29 PM, Ruben Kerkhof wrote:
>>
>> Hi Avi,
>>
>> On Sun, Aug 15, 2010 at 13:00, Avi Kivity  wrote:
>>
>> >  I'm betting 73b48d914f9 is the cause, but let's see the full bisect.
>>
>> system_powerdown with FreeBSD 8.1 works on 73b48d914f9.
>>
>> I've bisected the issue to commit 76e617c.
>>
>> Let me know if I should do some other tests.
>
> It works on 73b48d914f9 but fails on 76e617c due to the seabios change.
>
> Can you try 76e617c but with SeaBIOS 0.5.0?  If that works, keep qemu-kvm on
> 76e617c but bisect SeaBIOS from 0.5.0 to 0.5.1 to see which commit broke
> freebsd.
>

Sure,

5c99b6c984682ddb1d4543a7e27a1f4ca633e6a6 is the first bad commit
commit 5c99b6c984682ddb1d4543a7e27a1f4ca633e6a6
Author: Kevin O'Connor 
Date:   Wed Dec 30 12:36:22 2009 -0500

Commit compiled dsdt file; misc comment updates.

Commit new dsdt with recent changes.
Add some misc comments.
Also, fix uninitialized warning in mptable code.

Regards, Ruben
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.35-rc1 regression with pvclock and smp guests

2010-10-11 Thread Anthony Liguori


On 10/08/2010 09:27 PM, Zachary Amsden wrote:

On 10/08/2010 03:10 PM, Arjan Koers wrote:

On 2010-10-09 00:06, Marcelo Tosatti wrote:

On Thu, Oct 07, 2010 at 04:47:11PM -1000, Zachary Amsden wrote:

On 10/07/2010 02:12 PM, Arjan Koers wrote:

On 2010-10-03 01:42, Zachary Amsden wrote:
...
Umm...  do you guys have this commit?  This is supposed to 
address the
issue where the guest keeps resetting the TSC.  A guest which 
does that
will break kvmclock.  It only happens on SMP, and it's much worse 
on AMD

CPUs...

sound like your scenario.

commit bd59fc8ff95126f27b7a0df1b6cc602aa428812d
Author: Zachary Amsden
Date:   Thu Aug 19 22:07:26 2010 -1000

This commit fixes the problem:

commit aad07c4f92bae2edaa42bcef84c2afdd0d082458
Author: Zachary Amsden
Date:   Thu Aug 19 22:07:19 2010 -1000

 KVM: x86: Move TSC reset out of vmcb_init

 The VMCB is reset whenever we receive a startup IPI, so Linux 
is setting
 TSC back to zero happens very late in the boot process and 
destabilizing
 the TSC.  Instead, just set TSC to zero once at VCPU creation 
time.


 Why the separate patch?  So git-bisect is your friend.

Okay, apparently I need to go poke around 2.6.35 and see what
patches made it there and what patches didn't.

Backports attached. Michael, Arjan, please give them a try.


Thanks for the patches.

Successfully tested with 2.6.34.7, 2.6.35.7 and 2.6.36-rc7 host
(with a 2.6.35.7 guest).

It failed with a 2.6.32.24 host. The patch applied, but
pvclock_clocksource_read on the guest is still producing wrong
results for CPU 1 while it's booting. I'll re-check tomorrow.


There's a lot of work I've done and also a lot of work done by Glauber 
Costa on kvmclock that recently went upstream.


If pvclock is broken on 2.6.32-stable, then shouldn't we port these 
patches to the stable tree or in the very least, black list pvclock in 
stable?


Regards,

Anthony Liguori

It's unlikely that you'll be bug free without all of those patches 
applied; most of the patches were not just enhancements, but contained 
bugfixes as well as improved operation conditions.  On top of this, 
the patches are highly interdependent because of close code 
proximity.  I suggest applying the following commits to your branch 
(newest listed first; apply in reverse order):


12b1164fa498997bf72070e6a81418197e283716
bfa075b75d8786380a7bca1215d4c7d1485d18dd
82e7988a2088781175a22b09631bce97cd5ed177
bfb3f3326c915b1800dc65d10ca09fbd548353d2
1377ff23ae2bf49c76f8f498ca81050878b9666a
9a088cc32488cfb9f60dca5972155ba13f39eb83
e06a1a6cbe4e9f4c766595483a9b345d5b48bda7
da908f2fb4e783c2a4de751fb90f11a0dd041161
cf839f5da2b0779b9ec8b990f851fb4e7d681da0
cbc59a098486494d9a49537dcb9c969210a8306d
5cd459cdde725bb5c3a7feef6e074e7da70490c9
d578d4d72e3d2154901123f40c9fa7de1f85ae73
bd59fc8ff95126f27b7a0df1b6cc602aa428812d
e5e7675b0b9bf8eb0b806145a2fe173b5bb0e908
bf0fb4a42ba7eb362f4013bd2e93209666793e66
69403a558097a9bd333736d58a4cb69ea6e2a0ac
a87834bdb7ff9117da7f164e8cee638f2c51f9b7
91308e2fecddb6fc63feaf4cef3400f5cbea6619
fd03465c0648cd12d7333269b80d902d0a8516dd
aad07c4f92bae2edaa42bcef84c2afdd0d082458
280372e494634d0a2cba3956721be16fc4f989bf
1e6145f6fd7899d1f34e4ac00a8558d82a8d704a
ec01d2eb0a74a6d95823fb6e320298473faf12be
3e05d29fe45508625e2a73db3d1bfb54f30731ff

Since the issue appears resolved, I'm going to continue working upstream.

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Using 1920x1080 resolution with kvm

2010-10-11 Thread Kenni Lund

2010/10/10 Aniruddha :
> I would like to use a resolution 1920x1080 for my Windows XP Guest.

Leave KVM at the default resolution (eg. don't specify any
resolution), activate remote desktop in Windows XP and connect with
rdesktop or some other RDP client to the guest.

The RDP protocol will give you the resolution you request and will
also be much more responsive/faster than if you do VNC to a KVM guest
running at some high native resolution. VNC to a Windows guest should
only be needed for installing Windows and for debugging when Windows
doesn't boot.

Best regards
Kenni
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/2] vhost: potential integer overflows

2010-10-11 Thread Al Viro

On Mon, Oct 11, 2010 at 07:22:57PM +0200, Dan Carpenter wrote:
> I did an audit for potential integer overflows of values which get passed
> to access_ok() and here are the results.

FWIW, UINT_MAX is wrong here.  What you want is maximal size_t value.

> Signed-off-by: Dan Carpenter 
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index dd3d6f7..c2aa12c 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -429,6 +429,14 @@ static int vq_access_ok(unsigned int num,
>   struct vring_avail __user *avail,
>   struct vring_used __user *used)
>  {
> +
> + if (num > UINT_MAX / sizeof *desc)
> + return 0;
> + if (num > UINT_MAX / sizeof *avail->ring - sizeof *avail)
> + return 0;
> + if (num > UINT_MAX / sizeof *used->ring - sizeof *used)
> + return 0;
> +
>   return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
>  access_ok(VERIFY_READ, avail,
>sizeof *avail + num * sizeof *avail->ring) &&
> @@ -447,6 +455,9 @@ int vhost_log_access_ok(struct vhost_dev *dev)
>  /* Caller should have vq mutex and device mutex */
>  static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user 
> *log_base)
>  {
> + if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used)
> + return 0;
> +
>   return vq_memory_access_ok(log_base, vq->dev->memory,
>   vhost_has_feature(vq->dev, VHOST_F_LOG_ALL)) &&
>   (!vq->log_used || log_access_ok(log_base, vq->log_addr,
> @@ -606,12 +617,17 @@ static long vhost_set_vring(struct vhost_dev *d, int 
> ioctl, void __user *argp)
>   }
>  
>   /* Also validate log access for used ring if enabled. */
> - if ((a.flags & (0x1 << VHOST_VRING_F_LOG)) &&
> - !log_access_ok(vq->log_base, a.log_guest_addr,
> + if (a.flags & (0x1 << VHOST_VRING_F_LOG)) {
> + if (vq->num > UINT_MAX / sizeof *vq->used->ring 
> - sizeof *vq->used) {
> + r = -EINVAL;
> + break;
> + }
> + if (!log_access_ok(vq->log_base, 
> a.log_guest_addr,
>  sizeof *vq->used +
>  vq->num * sizeof *vq->used->ring)) {
> - r = -EINVAL;
> - break;
> + r = -EINVAL;
> + break;
> + }
>   }
>   }
>  
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 2/2] vhost: fix return code for log_access_ok()

2010-10-11 Thread Dan Carpenter

access_ok() returns 1 if it's OK otherwise it should return 0.

Signed-off-by: Dan Carpenter 

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c2aa12c..f82fe57 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -371,7 +371,7 @@ static int log_access_ok(void __user *log_base, u64 addr, 
unsigned long sz)
/* Make sure 64 bit math will not overflow. */
if (a > ULONG_MAX - (unsigned long)log_base ||
a + (unsigned long)log_base > ULONG_MAX)
-   return -EFAULT;
+   return 0;
 
return access_ok(VERIFY_WRITE, log_base + a,
 (sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 1/2] vhost: potential integer overflows

2010-10-11 Thread Dan Carpenter

I did an audit for potential integer overflows of values which get passed
to access_ok() and here are the results.

Signed-off-by: Dan Carpenter 

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index dd3d6f7..c2aa12c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -429,6 +429,14 @@ static int vq_access_ok(unsigned int num,
struct vring_avail __user *avail,
struct vring_used __user *used)
 {
+
+   if (num > UINT_MAX / sizeof *desc)
+   return 0;
+   if (num > UINT_MAX / sizeof *avail->ring - sizeof *avail)
+   return 0;
+   if (num > UINT_MAX / sizeof *used->ring - sizeof *used)
+   return 0;
+
return access_ok(VERIFY_READ, desc, num * sizeof *desc) &&
   access_ok(VERIFY_READ, avail,
 sizeof *avail + num * sizeof *avail->ring) &&
@@ -447,6 +455,9 @@ int vhost_log_access_ok(struct vhost_dev *dev)
 /* Caller should have vq mutex and device mutex */
 static int vq_log_access_ok(struct vhost_virtqueue *vq, void __user *log_base)
 {
+   if (vq->num > UINT_MAX / sizeof *vq->used->ring - sizeof *vq->used)
+   return 0;
+
return vq_memory_access_ok(log_base, vq->dev->memory,
vhost_has_feature(vq->dev, VHOST_F_LOG_ALL)) &&
(!vq->log_used || log_access_ok(log_base, vq->log_addr,
@@ -606,12 +617,17 @@ static long vhost_set_vring(struct vhost_dev *d, int 
ioctl, void __user *argp)
}
 
/* Also validate log access for used ring if enabled. */
-   if ((a.flags & (0x1 << VHOST_VRING_F_LOG)) &&
-   !log_access_ok(vq->log_base, a.log_guest_addr,
+   if (a.flags & (0x1 << VHOST_VRING_F_LOG)) {
+   if (vq->num > UINT_MAX / sizeof *vq->used->ring 
- sizeof *vq->used) {
+   r = -EINVAL;
+   break;
+   }
+   if (!log_access_ok(vq->log_base, 
a.log_guest_addr,
   sizeof *vq->used +
   vq->num * sizeof *vq->used->ring)) {
-   r = -EINVAL;
-   break;
+   r = -EINVAL;
+   break;
+   }
}
}
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] KVM: Emulation MSI-X mask bits for assigned devices

2010-10-11 Thread Marcelo Tosatti

On Mon, Oct 11, 2010 at 05:49:17PM +0800, Sheng Yang wrote:
> On Friday 01 October 2010 00:18:10 Marcelo Tosatti wrote:
> > On Tue, Sep 28, 2010 at 05:44:10PM +0800, Sheng Yang wrote:
> > > This patch enable per-vector mask for assigned devices using MSI-X.
> > > 
> > > Signed-off-by: Sheng Yang 
> > > ---
> > > 
> > >  arch/x86/kvm/x86.c   |1 +
> > >  include/linux/kvm.h  |9 -
> > >  include/linux/kvm_host.h |1 +
> > >  virt/kvm/assigned-dev.c  |   39 
> +++
> > >  4 files changed, 49 insertions(+), 1 deletions(-)
> > >  
> > >  
> > >  #include 
> > > 
> > > +#include 
> > > +
> > > 
> > >  #include "irq.h"
> > >  
> > >  static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct
> > >  list_head *head,
> > > 
> > > @@ -666,6 +668,30 @@ msix_nr_out:
> > >   return r;
> > >  
> > >  }
> > > 
> > > +static void update_msix_mask(struct kvm_assigned_dev_kernel
> > > *assigned_dev, +   int index)
> > > +{
> > > + int irq;
> > > +
> > > + if (!assigned_dev->dev->msix_enabled ||
> > > + !(assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> > > + return;
> > > +
> > > + irq = assigned_dev->host_msix_entries[index].vector;
> > > +
> > > + ASSERT(irq != 0);
> > > +
> > > + if (assigned_dev->guest_msix_entries[index].flags &
> > > + KVM_ASSIGNED_MSIX_MASK)
> > > + disable_irq(irq);
> > > + else {
> > > + enable_irq(irq);
> > > + if (assigned_dev->guest_msix_entries[index].flags &
> > > + KVM_ASSIGNED_MSIX_PENDING)
> > > + schedule_work(&assigned_dev->interrupt_work);
> > > + }
> > > +}
> > > +
> > 
> > Should flush the workqueue after disabling irq. As you say, the
> > schedule_work should be unnecessary.
> > 
> > Also must be careful with races.
> 
> Do we need to flush the workqueue? I think the return of writing MSI-X mask 
> bit 
> doesn't have implicit meaning that we have handled all pending interrupts. So 
> I 
> think leave the work later should also be fine.

The point is the interrupt will be delivered to the guest after the mask
bit is set. AFAICS, it should either be either before or after (before
is easier to implement, flushing the workqueue).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [libvirt-users] Problems with libvirt / qemu

2010-10-11 Thread Dan Johansson

Yes, the image contains an OS (it works if I start the guest manually).

On Monday 11 October 2010 03.41:29 jbuy0710 wrote:
> The image contains an OS or not?
> If not, you can't choose to boot from disk like ""
> 
> 2010/10/10 Dan Johansson :
> > HI,
> >
> > I have a small problem with libvirt / qemu. I have created a guest and when 
> > I start it from the command-line the guests starts OK, but when I start the 
> > guest through libvirt with "virsh start" I get "Booting from Hard Disk...
> > Boot failed: not a bootable disk
> > No bootable device"
> >
> > This is the command-line I use to start the guest (which works)
> > "cd /var/lib/kvm/Wilmer;  /usr/bin/qemu-system-x86_64 --enable-kvm \
> >-net nic,vlan=1,model=rtl8139,macaddr=DE:ED:BE:EF:01:03 -net 
> > tap,vlan=1,ifname=qtap13,script=no,downscript=no \
> >-net nic,vlan=3,model=rtl8139,macaddr=DE:ED:BE:EF:03:03 -net 
> > tap,vlan=3,ifname=qtap33,script=no,downscript=no \
> >-m 2048 -k de-ch -vnc :3 -daemonize \
> >Wilmer.qcow2"
> >
> >
> > The libvirt XML-file was created using "virsh domxml-from-native qemu-argv" 
> > and this is the result of that conversion:
> > 
> >  wilmer
> >  a421968d-0573-1356-8cb7-32caff525a03
> >  2097152
> >  2097152
> >  2
> >  
> >hvm
> >
> >  
> >  
> >
> >  
> >  
> >  destroy
> >  restart
> >  destroy
> >  
> >/usr/bin/qemu-system-x86_64
> >
> >  
> >  
> >  
> >
> >
> >   > function='0x1'/>
> >
> >
> >  
> >  
> >  
> >  
> >   > function='0x0'/>
> >
> >
> >  
> >  
> >  
> >  
> >   > function='0x0'/>
> >
> >
> >
> >
> >  
> >   > function='0x0'/>
> >
> >  
> > 
> >
> >
> > Anyone seeing something obvious that I have missed?
> >
> >  Regards,
> > --
> > Dan Johansson, 
> > ***
> > This message is printed on 100% recycled electrons!
> > ***
> >
> > ___
> > libvirt-users mailing list
> > libvirt-us...@redhat.com
> > https://www.redhat.com/mailman/listinfo/libvirt-users
> >
> 

-- 
Dan Johansson, 
***
This message is printed on 100% recycled electrons!
***
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: system_powerdown not working for qemu-kvm 0.12.4?

2010-10-11 Thread Avi Kivity


 On 10/11/2010 06:29 PM, Ruben Kerkhof wrote:

Hi Avi,

On Sun, Aug 15, 2010 at 13:00, Avi Kivity  wrote:

>  I'm betting 73b48d914f9 is the cause, but let's see the full bisect.

system_powerdown with FreeBSD 8.1 works on 73b48d914f9.

I've bisected the issue to commit 76e617c.

Let me know if I should do some other tests.


It works on 73b48d914f9 but fails on 76e617c due to the seabios change.

Can you try 76e617c but with SeaBIOS 0.5.0?  If that works, keep 
qemu-kvm on 76e617c but bisect SeaBIOS from 0.5.0 to 0.5.1 to see which 
commit broke freebsd.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: system_powerdown not working for qemu-kvm 0.12.4?

2010-10-11 Thread Ruben Kerkhof

Hi Avi,

On Sun, Aug 15, 2010 at 13:00, Avi Kivity  wrote:

> I'm betting 73b48d914f9 is the cause, but let's see the full bisect.

system_powerdown with FreeBSD 8.1 works on 73b48d914f9.

I've bisected the issue to commit 76e617c.

Let me know if I should do some other tests.

Regards, Ruben
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] device-assignment: Allow PCI to manage the option ROM

2010-10-11 Thread Alex Williamson

On Mon, 2010-10-11 at 17:21 +0200, Michael S. Tsirkin wrote:
> On Mon, Oct 11, 2010 at 09:15:52AM -0600, Alex Williamson wrote:
> > On Sat, 2010-10-09 at 23:44 +0200, Michael S. Tsirkin wrote:
> > > On Fri, Oct 08, 2010 at 09:12:52AM -0600, Alex Williamson wrote:
> > > > On Fri, 2010-10-08 at 10:40 +0200, Michael S. Tsirkin wrote:
> > > > > On Thu, Oct 07, 2010 at 10:02:25PM -0600, Alex Williamson wrote:
> > > > > > On Fri, 2010-10-08 at 00:45 +0200, Michael S. Tsirkin wrote:
> > > > > > > On Thu, Oct 07, 2010 at 11:34:01AM -0600, Alex Williamson wrote:
> > > > > > > > On Thu, 2010-10-07 at 19:18 +0200, Michael S. Tsirkin wrote:
> > > > > > > > > On Mon, Oct 04, 2010 at 03:26:30PM -0600, Alex Williamson 
> > > > > > > > > wrote:
> > > > > > > > > > --- a/hw/device-assignment.c
> > > > > > > > > > +++ b/hw/device-assignment.c
> > > > > > > > ...
> > > > > > > > > > @@ -1644,58 +1621,64 @@ void add_assigned_devices(PCIBus 
> > > > > > > > > > *bus, const char **devices, int n_devices)
> > > > > > > > > >   */
> > > > > > > > > >  static void assigned_dev_load_option_rom(AssignedDevice 
> > > > > > > > > > *dev)
> > > > > > > > > >  {
> > > > > > > > > > -int size, len, ret;
> > > > > > > > > > -void *buf;
> > > > > > > > > > +char name[32], rom_file[64];
> > > > > > > > > >  FILE *fp;
> > > > > > > > > > -uint8_t i = 1;
> > > > > > > > > > -char rom_file[64];
> > > > > > > > > > +uint8_t val;
> > > > > > > > > > +struct stat st;
> > > > > > > > > > +void *ptr;
> > > > > > > > > > +
> > > > > > > > > > +/* If loading ROM from file, pci handles it */
> > > > > > > > > > +if (dev->dev.romfile || !dev->dev.rom_bar)
> > > > > > > > > > +return;
> > > > > > > > > >  
> > > > > > > > > >  snprintf(rom_file, sizeof(rom_file),
> > > > > > > > > >   
> > > > > > > > > > "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/rom",
> > > > > > > > > >   dev->host.seg, dev->host.bus, dev->host.dev, 
> > > > > > > > > > dev->host.func);
> > > > > > > > > >  
> > > > > > > > > > -if (access(rom_file, F_OK))
> > > > > > > > > > +if (stat(rom_file, &st)) {
> > > > > > > > > >  return;
> > > > > > > > > > +}
> > > > > > > > > >  
> > > > > > > > > 
> > > > > > > > > Just a note that stat on the ROM sysfs file returns window 
> > > > > > > > > size,
> > > > > > > > > not the ROM size. So this allocates more ram than really 
> > > > > > > > > necessary for
> > > > > > > > > ROM. Real size is returned by fread.
> > > > > > > > > 
> > > > > > > > > Do we care?
> > > > > > > > 
> > > > > > > > That was my intention with using stat.  I thought that by 
> > > > > > > > default the
> > > > > > > > ROM BAR should match physical hardware, so even if the contents 
> > > > > > > > could be
> > > > > > > > rounded down to a smaller size, we maintain the size of the 
> > > > > > > > physical
> > > > > > > > device.  To use the minimum size, the contents could be 
> > > > > > > > extracted using
> > > > > > > > pci-sysfs and passed with the romfile option, or the ROM could 
> > > > > > > > be
> > > > > > > > disabled altogether with the rombar=0 option.  Sound reasonable?
> > > > > > > > Thanks,
> > > > > > > > 
> > > > > > > > Alex
> > > > > > > 
> > > > > > > For BAR size yes, but we do not need the buffer full of 0xff as 
> > > > > > > it is
> > > > > > > never accessed: let's have buffer size match real ROM, avoid 
> > > > > > > wasting
> > > > > > > memory: this can come up to megabytes easily.
> > > > > > > Makes sense?
> > > > > > 
> > > > > > I tend to doubt that hardware vendors are going to waste money 
> > > > > > putting
> > > > > > seriously oversized eeproms on devices.  It does seem pretty 
> > > > > > typical to
> > > > > > find graphics cards with 128K ROM BARs where the actual ROM squeezes
> > > > > > just under 64K, but that's a long way from megabytes of wasted 
> > > > > > memory.
> > > > > > The only device I have with a ROM BAR in the megabytes is an 82576, 
> > > > > > but
> > > > > > it comes up as an invalid rom through pci-sysfs, so we skip it.  I
> > > > > > assume that just means someone was lazy and didn't bother to fuse a
> > > > > > transistor that disables the ROM BAR, leaving it at it's maximum
> > > > > > aperture w/ no eeprom to back it.  Anyone know?  Examples to the
> > > > > > contrary welcome.
> > > > > > 
> > > > > > So I think the question comes down to whether there's any value to
> > > > > > trying to exactly mimic the resource layout of the device.  I'm 
> > > > > > doubtful
> > > > > > that there is, but at the potential cost of 10-100s of KBs of 
> > > > > > memory, I
> > > > > > thought it might be worthwhile.  If you feel strongly otherwise, 
> > > > > > I'll
> > > > > > follow-up with a patch to size it by the actual readable contents.
> > > > > > Thanks,
> > > > > > 
> > > > > > Alex
> > > > > 
> > > > > I actually agree sizing ROM BAR exactly the same as the device
> > > > > is a good idea. I just though

Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.

2010-10-11 Thread Eric Dumazet

Le lundi 11 octobre 2010 à 08:27 -0700, David Miller a écrit :
> From: "Xin, Xiaohui" 
> Date: Mon, 11 Oct 2010 15:06:05 +0800
> 
> > That's to avoid the new cache miss caused by using destructor_arg in data 
> > path
> > like skb_release_data().
> > That's based on the comment from Eric Dumazet on v7 patches.
> 
> Thanks for the explanation.

Anyway, frags[] must be the last field of "struct skb_shared_info"
since commit fed66381 (net: pskb_expand_head() optimization)

It seems Xin worked on a quite old tree.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.

2010-10-11 Thread David Miller

From: "Xin, Xiaohui" 
Date: Mon, 11 Oct 2010 15:06:05 +0800

> That's to avoid the new cache miss caused by using destructor_arg in data path
> like skb_release_data().
> That's based on the comment from Eric Dumazet on v7 patches.

Thanks for the explanation.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] device-assignment: Allow PCI to manage the option ROM

2010-10-11 Thread Michael S. Tsirkin

On Mon, Oct 11, 2010 at 09:15:52AM -0600, Alex Williamson wrote:
> On Sat, 2010-10-09 at 23:44 +0200, Michael S. Tsirkin wrote:
> > On Fri, Oct 08, 2010 at 09:12:52AM -0600, Alex Williamson wrote:
> > > On Fri, 2010-10-08 at 10:40 +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Oct 07, 2010 at 10:02:25PM -0600, Alex Williamson wrote:
> > > > > On Fri, 2010-10-08 at 00:45 +0200, Michael S. Tsirkin wrote:
> > > > > > On Thu, Oct 07, 2010 at 11:34:01AM -0600, Alex Williamson wrote:
> > > > > > > On Thu, 2010-10-07 at 19:18 +0200, Michael S. Tsirkin wrote:
> > > > > > > > On Mon, Oct 04, 2010 at 03:26:30PM -0600, Alex Williamson wrote:
> > > > > > > > > --- a/hw/device-assignment.c
> > > > > > > > > +++ b/hw/device-assignment.c
> > > > > > > ...
> > > > > > > > > @@ -1644,58 +1621,64 @@ void add_assigned_devices(PCIBus 
> > > > > > > > > *bus, const char **devices, int n_devices)
> > > > > > > > >   */
> > > > > > > > >  static void assigned_dev_load_option_rom(AssignedDevice *dev)
> > > > > > > > >  {
> > > > > > > > > -int size, len, ret;
> > > > > > > > > -void *buf;
> > > > > > > > > +char name[32], rom_file[64];
> > > > > > > > >  FILE *fp;
> > > > > > > > > -uint8_t i = 1;
> > > > > > > > > -char rom_file[64];
> > > > > > > > > +uint8_t val;
> > > > > > > > > +struct stat st;
> > > > > > > > > +void *ptr;
> > > > > > > > > +
> > > > > > > > > +/* If loading ROM from file, pci handles it */
> > > > > > > > > +if (dev->dev.romfile || !dev->dev.rom_bar)
> > > > > > > > > +return;
> > > > > > > > >  
> > > > > > > > >  snprintf(rom_file, sizeof(rom_file),
> > > > > > > > >   "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/rom",
> > > > > > > > >   dev->host.seg, dev->host.bus, dev->host.dev, 
> > > > > > > > > dev->host.func);
> > > > > > > > >  
> > > > > > > > > -if (access(rom_file, F_OK))
> > > > > > > > > +if (stat(rom_file, &st)) {
> > > > > > > > >  return;
> > > > > > > > > +}
> > > > > > > > >  
> > > > > > > > 
> > > > > > > > Just a note that stat on the ROM sysfs file returns window size,
> > > > > > > > not the ROM size. So this allocates more ram than really 
> > > > > > > > necessary for
> > > > > > > > ROM. Real size is returned by fread.
> > > > > > > > 
> > > > > > > > Do we care?
> > > > > > > 
> > > > > > > That was my intention with using stat.  I thought that by default 
> > > > > > > the
> > > > > > > ROM BAR should match physical hardware, so even if the contents 
> > > > > > > could be
> > > > > > > rounded down to a smaller size, we maintain the size of the 
> > > > > > > physical
> > > > > > > device.  To use the minimum size, the contents could be extracted 
> > > > > > > using
> > > > > > > pci-sysfs and passed with the romfile option, or the ROM could be
> > > > > > > disabled altogether with the rombar=0 option.  Sound reasonable?
> > > > > > > Thanks,
> > > > > > > 
> > > > > > > Alex
> > > > > > 
> > > > > > For BAR size yes, but we do not need the buffer full of 0xff as it 
> > > > > > is
> > > > > > never accessed: let's have buffer size match real ROM, avoid wasting
> > > > > > memory: this can come up to megabytes easily.
> > > > > > Makes sense?
> > > > > 
> > > > > I tend to doubt that hardware vendors are going to waste money putting
> > > > > seriously oversized eeproms on devices.  It does seem pretty typical 
> > > > > to
> > > > > find graphics cards with 128K ROM BARs where the actual ROM squeezes
> > > > > just under 64K, but that's a long way from megabytes of wasted memory.
> > > > > The only device I have with a ROM BAR in the megabytes is an 82576, 
> > > > > but
> > > > > it comes up as an invalid rom through pci-sysfs, so we skip it.  I
> > > > > assume that just means someone was lazy and didn't bother to fuse a
> > > > > transistor that disables the ROM BAR, leaving it at it's maximum
> > > > > aperture w/ no eeprom to back it.  Anyone know?  Examples to the
> > > > > contrary welcome.
> > > > > 
> > > > > So I think the question comes down to whether there's any value to
> > > > > trying to exactly mimic the resource layout of the device.  I'm 
> > > > > doubtful
> > > > > that there is, but at the potential cost of 10-100s of KBs of memory, 
> > > > > I
> > > > > thought it might be worthwhile.  If you feel strongly otherwise, I'll
> > > > > follow-up with a patch to size it by the actual readable contents.
> > > > > Thanks,
> > > > > 
> > > > > Alex
> > > > 
> > > > I actually agree sizing ROM BAR exactly the same as the device
> > > > is a good idea. I just thought we can save the extra memory
> > > > by not allocating the RAM in question, and writing code
> > > > to return 0xff on reads within the BAR but outside ROM.
> > > > And no, I don't feel strongly about this optimization.
> > > > 
> > > 
> > > Ok, so you're looking for something like below.  We can no longer map
> > > the ROM into the guest,
> > > but it's a ROM, so we don't care about speed.
> > 
>

Re: [PATCH 2/2] device-assignment: Allow PCI to manage the option ROM

2010-10-11 Thread Alex Williamson

On Sat, 2010-10-09 at 23:44 +0200, Michael S. Tsirkin wrote:
> On Fri, Oct 08, 2010 at 09:12:52AM -0600, Alex Williamson wrote:
> > On Fri, 2010-10-08 at 10:40 +0200, Michael S. Tsirkin wrote:
> > > On Thu, Oct 07, 2010 at 10:02:25PM -0600, Alex Williamson wrote:
> > > > On Fri, 2010-10-08 at 00:45 +0200, Michael S. Tsirkin wrote:
> > > > > On Thu, Oct 07, 2010 at 11:34:01AM -0600, Alex Williamson wrote:
> > > > > > On Thu, 2010-10-07 at 19:18 +0200, Michael S. Tsirkin wrote:
> > > > > > > On Mon, Oct 04, 2010 at 03:26:30PM -0600, Alex Williamson wrote:
> > > > > > > > --- a/hw/device-assignment.c
> > > > > > > > +++ b/hw/device-assignment.c
> > > > > > ...
> > > > > > > > @@ -1644,58 +1621,64 @@ void add_assigned_devices(PCIBus *bus, 
> > > > > > > > const char **devices, int n_devices)
> > > > > > > >   */
> > > > > > > >  static void assigned_dev_load_option_rom(AssignedDevice *dev)
> > > > > > > >  {
> > > > > > > > -int size, len, ret;
> > > > > > > > -void *buf;
> > > > > > > > +char name[32], rom_file[64];
> > > > > > > >  FILE *fp;
> > > > > > > > -uint8_t i = 1;
> > > > > > > > -char rom_file[64];
> > > > > > > > +uint8_t val;
> > > > > > > > +struct stat st;
> > > > > > > > +void *ptr;
> > > > > > > > +
> > > > > > > > +/* If loading ROM from file, pci handles it */
> > > > > > > > +if (dev->dev.romfile || !dev->dev.rom_bar)
> > > > > > > > +return;
> > > > > > > >  
> > > > > > > >  snprintf(rom_file, sizeof(rom_file),
> > > > > > > >   "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/rom",
> > > > > > > >   dev->host.seg, dev->host.bus, dev->host.dev, 
> > > > > > > > dev->host.func);
> > > > > > > >  
> > > > > > > > -if (access(rom_file, F_OK))
> > > > > > > > +if (stat(rom_file, &st)) {
> > > > > > > >  return;
> > > > > > > > +}
> > > > > > > >  
> > > > > > > 
> > > > > > > Just a note that stat on the ROM sysfs file returns window size,
> > > > > > > not the ROM size. So this allocates more ram than really 
> > > > > > > necessary for
> > > > > > > ROM. Real size is returned by fread.
> > > > > > > 
> > > > > > > Do we care?
> > > > > > 
> > > > > > That was my intention with using stat.  I thought that by default 
> > > > > > the
> > > > > > ROM BAR should match physical hardware, so even if the contents 
> > > > > > could be
> > > > > > rounded down to a smaller size, we maintain the size of the physical
> > > > > > device.  To use the minimum size, the contents could be extracted 
> > > > > > using
> > > > > > pci-sysfs and passed with the romfile option, or the ROM could be
> > > > > > disabled altogether with the rombar=0 option.  Sound reasonable?
> > > > > > Thanks,
> > > > > > 
> > > > > > Alex
> > > > > 
> > > > > For BAR size yes, but we do not need the buffer full of 0xff as it is
> > > > > never accessed: let's have buffer size match real ROM, avoid wasting
> > > > > memory: this can come up to megabytes easily.
> > > > > Makes sense?
> > > > 
> > > > I tend to doubt that hardware vendors are going to waste money putting
> > > > seriously oversized eeproms on devices.  It does seem pretty typical to
> > > > find graphics cards with 128K ROM BARs where the actual ROM squeezes
> > > > just under 64K, but that's a long way from megabytes of wasted memory.
> > > > The only device I have with a ROM BAR in the megabytes is an 82576, but
> > > > it comes up as an invalid rom through pci-sysfs, so we skip it.  I
> > > > assume that just means someone was lazy and didn't bother to fuse a
> > > > transistor that disables the ROM BAR, leaving it at it's maximum
> > > > aperture w/ no eeprom to back it.  Anyone know?  Examples to the
> > > > contrary welcome.
> > > > 
> > > > So I think the question comes down to whether there's any value to
> > > > trying to exactly mimic the resource layout of the device.  I'm doubtful
> > > > that there is, but at the potential cost of 10-100s of KBs of memory, I
> > > > thought it might be worthwhile.  If you feel strongly otherwise, I'll
> > > > follow-up with a patch to size it by the actual readable contents.
> > > > Thanks,
> > > > 
> > > > Alex
> > > 
> > > I actually agree sizing ROM BAR exactly the same as the device
> > > is a good idea. I just thought we can save the extra memory
> > > by not allocating the RAM in question, and writing code
> > > to return 0xff on reads within the BAR but outside ROM.
> > > And no, I don't feel strongly about this optimization.
> > > 
> > 
> > Ok, so you're looking for something like below.  We can no longer map
> > the ROM into the guest,
> > but it's a ROM, so we don't care about speed.
> 
> Why can't we map ROM? Map full pages, leave 0xff unmapped.
> The reason there will be such is because BAR is power of 2.

If I understand correctly, you're suggesting we round the ROM up to a
power of two, allocate a full buffer to back that, and map that to the
guest.  If the physical device has a larger ROM BAR, the remainder

Re: how to debug unhandled vm exit: 0x11?

2010-10-11 Thread Avi Kivity


 On 10/11/2010 04:31 PM, Avi Kivity wrote:


btw, I just found an old branch of mine that implements movdqa, I'll 
see if I can refresh and repost it.




It's pretty hard to rebase; it's in kvm.git sse-mmio if you're interested.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Network Patch set V4

2010-10-11 Thread pradeep

Hi Lucas, i covered below  tests on RHEL 5.5 , RHEL 6 guests with vhost
enbled.  Please find the below errors and cause for errros.

I also attached few other test cases can be added to out TODO list of
network patchset. I am working on couple of these issues.

1. Nice_promisc:
-
With RHEL 5.5: PASS

With RHEL 6: FAIL

its failing to login to the RHEL6 guest by serial. 


02:07:06 ERROR| Test failed: TestFail: Could not log into guest 'vm2'


2. nicdriver_unload 


With RHEL 5.5: FAIL

With RHEL 6: FAIL


03:17:26 DEBUG| Got shell prompt -- logged in
03:17:26 DEBUG| Contents of environment: {'address_cache':
{'00:1a:4a:65:09:09': '192.168.122.66', '9a:52:2f:62:dd:52':
'192.168.122.39', '00:1a:64:12:e4:c1': '9.126.89.23',
'9a:52:2f:62:1a:b8': '192.168.122.95', '9a:52:2f:62:e6:67':
'192.168.122.176', '9a:52:2f:62:8c:b8': '192.168.122.59',
'9a:52:2f:62:0b:38': '192.168.122.7'}, 'vm__vm2': , 'tcpdump': , 'version': 0} 03:17:26 INFO | ['iteration.1'] 03:17:26
0x2314710>ERROR| Exception escaping from test: Traceback (most recent
0x2314710>call last): File
0x2314710>"/home/pradeep/vhost_net/autotest/client/common_lib/test.py",
0x2314710>line 412, in _exec _call_test_function(self.execute, *p_args,
0x2314710>**p_dargs) File
0x2314710>"/home/pradeep/vhost_net/autotest/client/common_lib/test.py",
0x2314710>line 598, in _call_test_function return func(*args, **dargs)
  File "/home/pradeep/vhost_net/autotest/client/common_lib/test.py",
line 284, in execute postprocess_profiled_run, args, dargs)
  File "/home/pradeep/vhost_net/autotest/client/common_lib/test.py",
line 202, in _call_run_once
self.run_once_profiling(postprocess_profiled_run, *args, **dargs) File
"/home/pradeep/vhost_net/autotest/client/common_lib/test.py", line 308,
in run_once_profiling self.run_once(*args, **dargs) File
"/home/pradeep/vhost_net/autotest/client/tests/kvm/kvm.py", line 73, in
run_once run_func(self, params, env) File
"/home/pradeep/vhost_net/autotest/client/tests/kvm/tests/nicdriver_unload.py",
line 27, in run_nicdriver_unload raise error.TestFail("Could not log
into guest '%s'" % vm.name)

3. Netperf
--


RHEL 5.5: FAIL


RHEL6: FAIL

some tests fail with TCP_CRR, UDP_RR are failed. 
Looks like a bug..

4.multicast
-
RHEL 5.5: fail
RHEL 6:   PASS

09:14:44 DEBUG| Command failed; status: 147, output:
join_mcast_pid:8638

[1]+  Stopped python /tmp/join_mcast.py 20 225.0.0 1
09:14:44 INFO | Initial ping test, mcast: 225.0.0.1

09:14:44 DEBUG| PING 225.0.0.1 (225.0.0.1) from 9.126.89.201
e1000_0_5900: 56(84) bytes of data.
09:15:03 DEBUG| 


09:15:03 ERROR| Test failed: TestFail:  Ping return non-zero value PING
225.0.0.1 (225.0.0.1) from 9.126.89.201 e1000_0_5900: 56(84) bytes of
data.

5. mac_chnage
-

RHEL 5.5: PASS
RHEL6: FAIL

 Trying to log into guest 'vm2' by serial
login fail

This is the reason for nic_promisc also.

6. jumbo:
--

RHEL:5.5 FAIL
RHEL6: FAIL

09:50:55 DEBUG| PING 10.168.0.9 (10.168.0.9) from 9.126.89.201
e1000_0_5900: 16082(16110) bytes of data.
09:50:57 DEBUG| 
09:50:57 DEBUG| --- 10.168.0.9 ping statistics ---
09:50:57 DEBUG| 1 packets transmitted, 0 received, 100% packet loss,
time 2134ms
09:50:57 DEBUG| 
09:50:57 DEBUG| (Process terminated with status 1)
09:50:58 DEBUG| PING 10.168.0.9 (10.168.0.9) from 9.126.89.201
e1000_0_5900: 16082(16110) bytes of data.
09:51:00 DEBUG| 
09:51:00 DEBUG| --- 10.168.0.9 ping statistics ---
09:51:00 DEBUG| 1 packets transmitted, 0 received, 100% packet loss,
time 2047ms
09:51:00 DEBUG| 
09:51:00 DEBUG| (Process terminated with status 1)
09:51:01 DEBUG| PING 10.168.0.9 (10.168.0.9) from 9.126.89.201
e1000_0_5900: 16082(16110) bytes of data.
09:51:03 DEBUG| 
09:51:03 DEBUG| --- 10.168.0.9 ping statistics ---
09:51:03 DEBUG| 1 packets transmitted, 0 received, 100% packet loss,
time 2041ms
09:51:03 DEBUG| 
09:51:03 DEBUG| (Process terminated with status 1)
09:51:04 DEBUG| PING 10.168.0.9 (10.168.0.9) from 9.126.89.201
e1000_0_5900: 16082(16110) bytes of data.
09:51:06 DEBUG| 
09:51:06 DEBUG| --- 10.168.0.9 ping statistics ---
09:51:06 DEBUG| 1 packets transmitted, 0 received, 100% packet loss,
time 2036ms
09:51:06 DEBUG| 
09:51:06 DEBUG| (Process terminated with status 1)
09:51:07 DEBUG| Timeout elapsed
09:51:07 DEBUG| e1000_0_5900 Link encap:Ethernet  HWaddr
1E:1D:84:CF:CD:6B  


04:45:11 DEBUG| Running 'arp -d 192.168.122.115 -i e1000_0_5900'
04:45:11 ERROR| Test failed: TestError: MTU is not as expected even
after 10 seconds


7. file_transfer

RHEL 5.5: FAIL

file transfer pass from host to guest.
 file transfer fail while transferring from guest to host

Need to debug further for this.


RHEL 6: pass


8. ethtool
---
RHEL 5.5:  PASS
RHEL6: FAIL

05:26:44 DEBUG| Command failed; status: 92, output: Cannot set large
receive offload settings: Operation not supported
05:26:44 ERROR| Fail to enable lro

05:29:44 ERROR| Test failed: TestError: Enab

Re: how to debug unhandled vm exit: 0x11?

2010-10-11 Thread Avi Kivity

 On 10/11/2010 04:27 PM, Avi Kivity wrote:

 On 10/11/2010 08:49 AM, Neo Jia wrote:

On Sun, Oct 10, 2010 at 11:30 PM, Avi Kivity  wrote:
>On 10/11/2010 07:46 AM, Neo Jia wrote:
>>
>>  BTW, I have a question about saving FPU, especially those XMM
>>  registers. I don't see an explicit save FPU after exiting guest 
due to

>>  an exception (MMIO writes). The only thing I saw about fpu operation
>>  is fpu restore right before loading guest.
>>
>>  Is there anything I missed here?
>
>  kvm_put_guest_fpu.

I found that function and it will be called by vcpu_put eventually
inside kvm_arch_vcpu_ioctl_run, but kvm_mmu_page_fault is called much
earlier than that inside kvm exit exception handler. so, the fxsave
data for the guest image might not be saved at that moment, when I am
going to emulate this instruction?

Just call it when you want to be sure it is in memory.

btw, I just found an old branch of mine that implements movdqa, I'll see 
if I can refresh and repost it.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to debug unhandled vm exit: 0x11?

2010-10-11 Thread Avi Kivity

 On 10/11/2010 08:49 AM, Neo Jia wrote:

On Sun, Oct 10, 2010 at 11:30 PM, Avi Kivity  wrote:
>On 10/11/2010 07:46 AM, Neo Jia wrote:
>>
>>  BTW, I have a question about saving FPU, especially those XMM
>>  registers. I don't see an explicit save FPU after exiting guest due to
>>  an exception (MMIO writes). The only thing I saw about fpu operation
>>  is fpu restore right before loading guest.
>>
>>  Is there anything I missed here?
>
>  kvm_put_guest_fpu.

I found that function and it will be called by vcpu_put eventually
inside kvm_arch_vcpu_ioctl_run, but kvm_mmu_page_fault is called much
earlier than that inside kvm exit exception handler. so, the fxsave
data for the guest image might not be saved at that moment, when I am
going to emulate this instruction?

Just call it when you want to be sure it is in memory.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [V2 PATCH 2/2] KVM test: Add vhost-net support

2010-10-11 Thread Lucas Meneghel Rodrigues

On Fri, 2010-10-08 at 21:46 -0400, Jason Wang wrote:
> Hi Lucas:
> 
> Vhost-net backend is only valid when we use tap, so it should work with
> "nic_mode=tap". Maybe we could add a comment above the "vhost=on" to warn the
> user to use tap when he need vhost.

Ok, fair enough, that's what I've done for v3, that I just sent to the
mailing list for future record and commited it:

http://autotest.kernel.org/changeset/4846

Thank you very much!

Lucas

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM test: Add vhost-net support

2010-10-11 Thread Lucas Meneghel Rodrigues

Vhost is a kernel-level backend for virtio. This patch add a nic_param
named "vhost" to enable/disable vhost backend.

Changes from v3:
- Added comment explaining that the feature only works with nic_mode =
  tap.
- Instead of commenting out the option vhost, leave it uncommented but
  with default = no.

Changes from v2:
- Rebase against latest HEAD

Signed-off-by: Jason Wang 
---
 client/tests/kvm/kvm_vm.py |7 +--
 client/tests/kvm/tests_base.cfg.sample |6 --
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/client/tests/kvm/kvm_vm.py b/client/tests/kvm/kvm_vm.py
index 3388b14..848ff63 100755
--- a/client/tests/kvm/kvm_vm.py
+++ b/client/tests/kvm/kvm_vm.py
@@ -282,9 +282,11 @@ class VM:
 
 def add_net(help, vlan, mode, ifname=None, script=None,
 downscript=None, tftp=None, bootfile=None, hostfwd=[],
-netdev_id=None):
+netdev_id=None, vhost=False):
 if has_option(help, "netdev"):
 cmd = " -netdev %s,id=%s" % (mode, netdev_id)
+if vhost:
+cmd +=",vhost=on"
 else:
 cmd = " -net %s,vlan=%d" % (mode, vlan)
 if mode == "tap":
@@ -448,7 +450,8 @@ class VM:
 self.get_ifname(vlan),
 script, downscript, tftp,
 nic_params.get("bootp"), redirs,
-self.netdev_id[vlan])
+self.netdev_id[vlan],
+nic_params.get("vhost")=="yes")
 # Proceed to next NIC
 vlan += 1
 
diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index 8d62c42..769d750 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -699,9 +699,11 @@ variants:
 supported_features = "tx rx sg tso gso gro lro"
 - virtio_net:
 nic_model = virtio
-# you can add advanced attributes on nic_extra_params
-# such as mrg_rxbuf
+# You can add advanced attributes on nic_extra_params such as mrg_rxbuf
 #nic_extra_params =
+# You can set vhost = yes to enable the vhost kernel backend
+# (This only works if nic_mode=tap)
+vhost = no
 jumbo:
 mtu = 65520
 ethtool:
-- 
1.7.2.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: Don't reset mmu context unnecessarily when updating EFER

2010-10-11 Thread Avi Kivity

The only bit of EFER that affects the mmu is NX, and this is already
accounted for (LME only takes effect when changing cr0).

Based on a patch by Hillf Danton.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/x86.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7127a13..f3f86b2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -835,7 +835,6 @@ static int set_efer(struct kvm_vcpu *vcpu, u64 efer)
kvm_x86_ops->set_efer(vcpu, efer);
 
vcpu->arch.mmu.base_role.nxe = (efer & EFER_NX) && !tdp_enabled;
-   kvm_mmu_reset_context(vcpu);
 
/* Update reserved bits */
if ((efer ^ old_efer) & EFER_NX)
-- 
1.7.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] KVM: Emulation MSI-X mask bits for assigned devices

2010-10-11 Thread Michael S. Tsirkin

On Mon, Oct 11, 2010 at 05:28:30PM +0800, Sheng Yang wrote:
> On Sunday 03 October 2010 19:12:47 Michael S. Tsirkin wrote:
> > On Tue, Sep 28, 2010 at 05:44:10PM +0800, Sheng Yang wrote:
> > > This patch enable per-vector mask for assigned devices using MSI-X.
> > > 
> > > Signed-off-by: Sheng Yang 
> > 
> > I think I see an issue here, noted below.  Some general comments:
> > - The mask bit seems broken for injecting interrupts from
> >   userspace (with interrupts/ioctls).
> >   I think we must test it on injection path.
> 
> I am not quite understand how it related to userspace interrupt injection 
> here... 
> This patch only cover assigned devices for now.

Well, this is a kernel/userspace interface, if it's broken for userspace
injection now we'll have to go through pain to fix it in a compatible
way later when we want to use it for userspace injection.
You might want to ask why we want the kernel to support making
userspace-injected interrupts when userspace can just avoid injecting
them, and the answer would be that with irqfd the injection might be
handled in a separate process.

We currently handle this by destroying irqfd when irq is masked,
an ioctl instead would be much faster.

> > - We'll need a way to support the pending bit.
> >   Disabling interrupts will not let us do it.
> 
> We may need a way to support pending bit, though I don't know which guest has 
> used 
> it... And we can still know if there is interrupt pending by check the real 
> hardware's pending bit if it's necessary.

That's what I'm saying: since instead of masking the vector in hardware
you disable irq in the APIC, the pending bit that we read from hardware
will not have the correct value.

If we fix this, pending bit handling can be done by userspace.

> (And we haven't seen any problem by 
> leaving the bit 0 so far, and it's not in this patch's scope.)

I don't know about anyone using this, either, but the PCI spec does
require support of polling mode where the pending bit is polled instead
of interrupts. So yes, not a high priority to implement, but let's give
the way we intend to support this in the future some thought.

> > > ---
> > > 
> > >  arch/x86/kvm/x86.c   |1 +
> > >  include/linux/kvm.h  |9 -
> > >  include/linux/kvm_host.h |1 +
> > >  virt/kvm/assigned-dev.c  |   39 
> +++
> > >  4 files changed, 49 insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 8412c91..e6933e6 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -1927,6 +1927,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > 
> > >   case KVM_CAP_DEBUGREGS:
> > >   case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > 
> > >   case KVM_CAP_XSAVE:
> > > + case KVM_CAP_DEVICE_MSIX_MASK:
> > >   r = 1;
> > >   break;
> > >   
> > >   case KVM_CAP_COALESCED_MMIO:
> > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > index 919ae53..f2b7cdc 100644
> > > --- a/include/linux/kvm.h
> > > +++ b/include/linux/kvm.h
> > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > 
> > >  #endif
> > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > 
> > > +#ifdef __KVM_HAVE_MSIX
> > > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > > +#endif
> > > 
> > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > 
> > > @@ -787,11 +790,15 @@ struct kvm_assigned_msix_nr {
> > > 
> > >  };
> > >  
> > >  #define KVM_MAX_MSIX_PER_DEV 256
> > > 
> > > +
> > > +#define KVM_MSIX_FLAG_MASK   1
> > > +
> > > 
> > >  struct kvm_assigned_msix_entry {
> > >  
> > >   __u32 assigned_dev_id;
> > >   __u32 gsi;
> > >   __u16 entry; /* The index of entry in the MSI-X table */
> > > 
> > > - __u16 padding[3];
> > > + __u16 flags;
> > > + __u16 padding[2];
> > > 
> > >  };
> > >  
> > >  #endif /* __LINUX_KVM_H */
> > > 
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 0b89d00..a43405c 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -415,6 +415,7 @@ struct kvm_irq_ack_notifier {
> > > 
> > >  };
> > >  
> > >  #define KVM_ASSIGNED_MSIX_PENDING0x1
> > > 
> > > +#define KVM_ASSIGNED_MSIX_MASK   0x2
> > > 
> > >  struct kvm_guest_msix_entry {
> > >  
> > >   u32 vector;
> > >   u16 entry;
> > > 
> > > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > > index 7c98928..15b8c32 100644
> > > --- a/virt/kvm/assigned-dev.c
> > > +++ b/virt/kvm/assigned-dev.c
> > > @@ -17,6 +17,8 @@
> > > 
> > >  #include 
> > >  #include 
> > >  #include 
> > > 
> > > +#include 
> > > +
> > > 
> > >  #include "irq.h"
> > >  
> > >  static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct
> > >  list_head *head,
> > > 
> > > @@ -666,6 +668,30 @@ msix_nr_out:
> > >   return r;
> > >  
> > >  }
> > > 
> > > +static void update_msix_mask(struct kvm_assigned_dev_kernel
> > > *assigned_dev, +

Re: [PATCH 1/3] KVM: Emulation MSI-X mask bits for assigned devices

2010-10-11 Thread Sheng Yang

On Friday 01 October 2010 00:18:10 Marcelo Tosatti wrote:
> On Tue, Sep 28, 2010 at 05:44:10PM +0800, Sheng Yang wrote:
> > This patch enable per-vector mask for assigned devices using MSI-X.
> > 
> > Signed-off-by: Sheng Yang 
> > ---
> > 
> >  arch/x86/kvm/x86.c   |1 +
> >  include/linux/kvm.h  |9 -
> >  include/linux/kvm_host.h |1 +
> >  virt/kvm/assigned-dev.c  |   39 
+++
> >  4 files changed, 49 insertions(+), 1 deletions(-)
> >  
> >  
> >  #include 
> > 
> > +#include 
> > +
> > 
> >  #include "irq.h"
> >  
> >  static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct
> >  list_head *head,
> > 
> > @@ -666,6 +668,30 @@ msix_nr_out:
> > return r;
> >  
> >  }
> > 
> > +static void update_msix_mask(struct kvm_assigned_dev_kernel
> > *assigned_dev, + int index)
> > +{
> > +   int irq;
> > +
> > +   if (!assigned_dev->dev->msix_enabled ||
> > +   !(assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> > +   return;
> > +
> > +   irq = assigned_dev->host_msix_entries[index].vector;
> > +
> > +   ASSERT(irq != 0);
> > +
> > +   if (assigned_dev->guest_msix_entries[index].flags &
> > +   KVM_ASSIGNED_MSIX_MASK)
> > +   disable_irq(irq);
> > +   else {
> > +   enable_irq(irq);
> > +   if (assigned_dev->guest_msix_entries[index].flags &
> > +   KVM_ASSIGNED_MSIX_PENDING)
> > +   schedule_work(&assigned_dev->interrupt_work);
> > +   }
> > +}
> > +
> 
> Should flush the workqueue after disabling irq. As you say, the
> schedule_work should be unnecessary.
> 
> Also must be careful with races.

Do we need to flush the workqueue? I think the return of writing MSI-X mask bit 
doesn't have implicit meaning that we have handled all pending interrupts. So I 
think leave the work later should also be fine.

--
regards
Yang, Sheng

 
> >  static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
> >  
> >struct kvm_assigned_msix_entry *entry)
> >  
> >  {
> > 
> > @@ -688,6 +714,19 @@ static int kvm_vm_ioctl_set_msix_entry(struct kvm
> > *kvm,
> > 
> > adev->guest_msix_entries[i].entry = entry->entry;
> > adev->guest_msix_entries[i].vector = entry->gsi;
> > adev->host_msix_entries[i].entry = entry->entry;
> > 
> > +   if ((entry->flags & KVM_MSIX_FLAG_MASK) &&
> > +   !(adev->guest_msix_entries[i].flags &
> > +   KVM_ASSIGNED_MSIX_MASK)) {
> > +   adev->guest_msix_entries[i].flags |=
> > +   KVM_ASSIGNED_MSIX_MASK;
> > +   update_msix_mask(adev, i);
> > +   } else if (!(entry->flags & KVM_MSIX_FLAG_MASK) &&
> > +   (adev->guest_msix_entries[i].flags &
> > +   KVM_ASSIGNED_MSIX_MASK)) {
> > +   adev->guest_msix_entries[i].flags &=
> > +   ~KVM_ASSIGNED_MSIX_MASK;
> > +   update_msix_mask(adev, i);
> > +   }
> > 
> > break;
> > 
> > }
> > 
> > if (i == adev->entries_nr) {
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] Emulate MSI-X mask bits for assigned devices

2010-10-11 Thread Sheng Yang

On Friday 01 October 2010 00:44:51 Marcelo Tosatti wrote:
> On Tue, Sep 28, 2010 at 05:44:09PM +0800, Sheng Yang wrote:
> > Hi Avi & Marcelo
> > 
> > This patchset would add emulation of MSI-X mask bit for assigned devices
> > in QEmu.
> > 
> > BTW: We are also purposed an acceleration of MSI-X mask bit for KVM - to
> > get it done in kernel. That's because sometime MSI-X mask bit was
> > accessed very frequently by guest and emulation in QEmu cost a lot(tens
> > to hundreds percent CPU utilization sometime), e.g. guest using Linux
> > 2.6.18 or under some heavy workload. The method has been proved
> > efficient in Xen, but it would require VMM to handle MSI-X table. Is
> > that OK with you?
> 
> It would be good to understand where that cost is, before. Eg. do you
> commit kvm irq route frequently when guest unmasks?

I don't think it would cause performance issue, but maybe need another bulk of 
codes in kernel. I would post the in-kernel patch later, which is easier for 
discuss. :)

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] KVM: Emulation MSI-X mask bits for assigned devices

2010-10-11 Thread Sheng Yang

On Sunday 03 October 2010 21:03:16 Avi Kivity wrote:
>   On 09/30/2010 07:41 AM, Sheng Yang wrote:
> > On Wednesday 29 September 2010 16:36:25 Avi Kivity wrote:
> > >On 09/28/2010 11:44 AM, Sheng Yang wrote:
> > >  >  This patch enable per-vector mask for assigned devices using MSI-X.
> > >  >  
> > >  >  +
> > >  >  
> > >  >struct kvm_assigned_msix_entry {
> > >  >
> > >  >__u32 assigned_dev_id;
> > >  >__u32 gsi;
> > >  >__u16 entry; /* The index of entry in the MSI-X table */
> > >  >  
> > >  >  - __u16 padding[3];
> > >  >  + __u16 flags;
> > >  >  + __u16 padding[2];
> > >  >  
> > >  >};
> > >  
> > >  Given that this field wasn't a flag field previously, we can't expect
> > >  it to be zero.  So before we can check it, userspace needs to tell us
> > >  that is knows the field is not padding, but a flags field.
> > >  
> > >  You can use KVM_ENABLE_CAP for that.
> > 
> > OK, I would use it. And it's per vcpu ioctl though, I would like to use
> > it on setting for per vm.
> 
> Yes.  It's easier to have KVM_ASSIGN_SET_MSIX_ENTRY2 that knows about
> the flags field.

Well, maybe it's a waste when we still have room in old one...

I think KVM_ENABLE_CAP maybe good enough.
> 
> btw, this should be documented in Documentation/kvm/api.txt.

OK.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] KVM: Emulation MSI-X mask bits for assigned devices

2010-10-11 Thread Sheng Yang

On Sunday 03 October 2010 19:12:47 Michael S. Tsirkin wrote:
> On Tue, Sep 28, 2010 at 05:44:10PM +0800, Sheng Yang wrote:
> > This patch enable per-vector mask for assigned devices using MSI-X.
> > 
> > Signed-off-by: Sheng Yang 
> 
> I think I see an issue here, noted below.  Some general comments:
> - The mask bit seems broken for injecting interrupts from
>   userspace (with interrupts/ioctls).
>   I think we must test it on injection path.

I am not quite understand how it related to userspace interrupt injection 
here... 
This patch only cover assigned devices for now.

> - We'll need a way to support the pending bit.
>   Disabling interrupts will not let us do it.

We may need a way to support pending bit, though I don't know which guest has 
used 
it... And we can still know if there is interrupt pending by check the real 
hardware's pending bit if it's necessary. (And we haven't seen any problem by 
leaving the bit 0 so far, and it's not in this patch's scope.)
 
> > ---
> > 
> >  arch/x86/kvm/x86.c   |1 +
> >  include/linux/kvm.h  |9 -
> >  include/linux/kvm_host.h |1 +
> >  virt/kvm/assigned-dev.c  |   39 
+++
> >  4 files changed, 49 insertions(+), 1 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 8412c91..e6933e6 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1927,6 +1927,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > 
> > case KVM_CAP_DEBUGREGS:
> > case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > 
> > case KVM_CAP_XSAVE:
> > +   case KVM_CAP_DEVICE_MSIX_MASK:
> > r = 1;
> > break;
> > 
> > case KVM_CAP_COALESCED_MMIO:
> > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > index 919ae53..f2b7cdc 100644
> > --- a/include/linux/kvm.h
> > +++ b/include/linux/kvm.h
> > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > 
> >  #endif
> >  #define KVM_CAP_PPC_GET_PVINFO 57
> >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > 
> > +#ifdef __KVM_HAVE_MSIX
> > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > +#endif
> > 
> >  #ifdef KVM_CAP_IRQ_ROUTING
> > 
> > @@ -787,11 +790,15 @@ struct kvm_assigned_msix_nr {
> > 
> >  };
> >  
> >  #define KVM_MAX_MSIX_PER_DEV   256
> > 
> > +
> > +#define KVM_MSIX_FLAG_MASK 1
> > +
> > 
> >  struct kvm_assigned_msix_entry {
> >  
> > __u32 assigned_dev_id;
> > __u32 gsi;
> > __u16 entry; /* The index of entry in the MSI-X table */
> > 
> > -   __u16 padding[3];
> > +   __u16 flags;
> > +   __u16 padding[2];
> > 
> >  };
> >  
> >  #endif /* __LINUX_KVM_H */
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 0b89d00..a43405c 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -415,6 +415,7 @@ struct kvm_irq_ack_notifier {
> > 
> >  };
> >  
> >  #define KVM_ASSIGNED_MSIX_PENDING  0x1
> > 
> > +#define KVM_ASSIGNED_MSIX_MASK 0x2
> > 
> >  struct kvm_guest_msix_entry {
> >  
> > u32 vector;
> > u16 entry;
> > 
> > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > index 7c98928..15b8c32 100644
> > --- a/virt/kvm/assigned-dev.c
> > +++ b/virt/kvm/assigned-dev.c
> > @@ -17,6 +17,8 @@
> > 
> >  #include 
> >  #include 
> >  #include 
> > 
> > +#include 
> > +
> > 
> >  #include "irq.h"
> >  
> >  static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct
> >  list_head *head,
> > 
> > @@ -666,6 +668,30 @@ msix_nr_out:
> > return r;
> >  
> >  }
> > 
> > +static void update_msix_mask(struct kvm_assigned_dev_kernel
> > *assigned_dev, + int index)
> > +{
> > +   int irq;
> > +
> > +   if (!assigned_dev->dev->msix_enabled ||
> > +   !(assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> > +   return;
> > +
> > +   irq = assigned_dev->host_msix_entries[index].vector;
> > +
> > +   ASSERT(irq != 0);
> > +
> > +   if (assigned_dev->guest_msix_entries[index].flags &
> > +   KVM_ASSIGNED_MSIX_MASK)
> > +   disable_irq(irq);
> > +   else {
> > +   enable_irq(irq);
> > +   if (assigned_dev->guest_msix_entries[index].flags &
> > +   KVM_ASSIGNED_MSIX_PENDING)
> > +   schedule_work(&assigned_dev->interrupt_work);
> > +   }
> > +}
> > +
> 
> What happens if guest masks an entry and then we hot-unplug the
> device and remove it from guest? It looks like interrupt
> will stay disabled?

I don't think so. pci_disable_msix() which was called in hot-unplug path would 
recycle all IRQs used by the device. It should be the same as VM shutdown.

Also before the IRQ was recycled, I believe the same dynamic IRQ wouldn't be 
used 
by other devices.

--
regards
Yang, Sheng

> 
> >  static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
> >  
> >struct kvm_assigned_msix_entry *entry)
> >  
> >  {
> > 
> > @@ -688,6 +714,19 @@ static int

[PATCH] plugin_kvm: fix kvm_exit rip formatting

2010-10-11 Thread Avi Kivity

%0xlx is not a valid printf format.

Signed-off-by: Avi Kivity 
---
 plugin_kvm.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/plugin_kvm.c b/plugin_kvm.c
index cdd52c7..724143d 100644
--- a/plugin_kvm.c
+++ b/plugin_kvm.c
@@ -153,7 +153,7 @@ static int kvm_exit_handler(struct trace_seq *s, struct 
record *record,
 
trace_seq_printf(s, "reason %s", find_vmx_reason(val));
 
-   pevent_print_num_field(s, " rip %0xlx", event, "guest_rip", record, 1);
+   pevent_print_num_field(s, " rip 0x%lx", event, "guest_rip", record, 1);
 
return 0;
 }
-- 
1.7.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH trace-cmd] plugin_kvm: remove stray line from kvm_emulate_insn_handler()

2010-10-11 Thread Avi Kivity


 On 09/26/2010 11:38 AM, Avi Kivity wrote:

'guest_rip' is not part of kvm_emulate_insn.

Signed-off-by: Avi Kivity
---
  plugin_kvm.c |2 --
  1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/plugin_kvm.c b/plugin_kvm.c
index 00cac5a..cdd52c7 100644
--- a/plugin_kvm.c
+++ b/plugin_kvm.c
@@ -199,8 +199,6 @@ static int kvm_emulate_insn_handler(struct trace_seq *s, 
struct record *record,
trace_seq_printf(s, "%llx:%llx: %s%s", csbase, rip, disasm,
 failed ? " FAIL" : "");

-   pevent_print_num_field(s, " rip %0xlx", event, "guest_rip", record, 1);
-
return 0;
  }



Ping?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-11 Thread Krishna Kumar2

"Michael S. Tsirkin"  wrote on 10/06/2010 07:04:31 PM:

> On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
> > For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
> > for degradation for 1 stream case:
>
> I thought about possible RX/TX contention reasons, and I realized that
> we get/put the mm counter all the time.  So I write the following: I
> haven't seen any performance gain from this in a single queue case, but
> maybe this will help multiqueue?

Sorry for the delay, I was sick last couple of days. The results
with your patch are (%'s over original code):

Code   BW%   CPU%   RemoteCPU
MQ (#txq=16)   31.4% 38.42% 6.41%
MQ+MST (#txq=16)   28.3% 18.9%  -10.77%

The patch helps CPU utilization but didn't help single stream
drop.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v12 10/17] Add a hook to intercept external buffers from NIC driver.

2010-10-11 Thread Xin, Xiaohui

>-Original Message-
>From: Eric Dumazet [mailto:eric.duma...@gmail.com]
>Sent: Thursday, September 30, 2010 10:22 PM
>To: Xin, Xiaohui
>Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
>m...@redhat.com; mi...@elte.hu; da...@davemloft.net; 
>herb...@gondor.apana.org.au;
>jd...@linux.intel.com
>Subject: Re: [PATCH v12 10/17] Add a hook to intercept external buffers from 
>NIC driver.
>
>Le jeudi 30 septembre 2010 à 22:04 +0800, xiaohui@intel.com a
>écrit :
>> From: Xin Xiaohui 
>>
>> The hook is called in netif_receive_skb().
>> Signed-off-by: Xin Xiaohui 
>> Signed-off-by: Zhao Yu 
>> Reviewed-by: Jeff Dike 
>> ---
>>  net/core/dev.c |   35 +++
>>  1 files changed, 35 insertions(+), 0 deletions(-)
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index c11e32c..83172b8 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -2517,6 +2517,37 @@ err:
>>  EXPORT_SYMBOL(netdev_mp_port_prep);
>>  #endif
>>
>> +#if defined(CONFIG_MEDIATE_PASSTHRU) ||
>defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
>> +/* Add a hook to intercept mediate passthru(zero-copy) packets,
>> + * and insert it to the socket queue owned by mp_port specially.
>> + */
>> +static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
>> +   struct packet_type **pt_prev,
>> +   int *ret,
>> +   struct net_device *orig_dev)
>> +{
>> +struct mp_port *mp_port = NULL;
>> +struct sock *sk = NULL;
>> +
>> +if (!dev_is_mpassthru(skb->dev))
>> +return skb;
>> +mp_port = skb->dev->mp_port;
>> +
>> +if (*pt_prev) {
>> +*ret = deliver_skb(skb, *pt_prev, orig_dev);
>> +*pt_prev = NULL;
>> +}
>> +
>> +sk = mp_port->sock->sk;
>> +skb_queue_tail(&sk->sk_receive_queue, skb);
>> +sk->sk_state_change(sk);
>> +
>> +return NULL;
>> +}
>> +#else
>> +#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb)
>> +#endif
>> +
>>  /**
>>   *  netif_receive_skb - process receive buffer from network
>>   *  @skb: buffer to process
>> @@ -2598,6 +2629,10 @@ int netif_receive_skb(struct sk_buff *skb)
>>  ncls:
>>  #endif
>>
>> +/* To intercept mediate passthru(zero-copy) packets here */
>> +skb = handle_mpassthru(skb, &pt_prev, &ret, orig_dev);
>> +if (!skb)
>> +goto out;
>>  skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
>>  if (!skb)
>>  goto out;
>
>This does not apply to current net-next-2.6
>
>We now have dev->rx_handler (currently for bridge or macvlan)
>
>
Ok. Thanks, will rebase to fix that.

Thanks
Xiaohui

>commit ab95bfe01f9872459c8678572ccadbf646badad0
>Author: Jiri Pirko 
>Date:   Tue Jun 1 21:52:08 2010 +
>
>net: replace hooks in __netif_receive_skb V5
>
>What this patch does is it removes two receive frame hooks (for bridge and 
> for
>macvlan) from __netif_receive_skb. These are replaced them with a single
>hook for both. It only supports one hook per device because it makes no
>sense to do bridging and macvlan on the same device.
>
>Then a network driver (of virtual netdev like macvlan or bridge) can 
> register
>an rx_handler for needed net device.
>
>Signed-off-by: Jiri Pirko 
>Signed-off-by: Stephen Hemminger 
>Signed-off-by: David S. Miller 
>
>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.

2010-10-11 Thread Xin, Xiaohui

>-Original Message-
>From: David Miller [mailto:da...@davemloft.net]
>Sent: Friday, October 01, 2010 3:15 PM
>To: Xin, Xiaohui
>Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
>m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; 
>jd...@linux.intel.com
>Subject: Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() 
>specially.
>
>From: xiaohui@intel.com
>Date: Thu, 30 Sep 2010 22:04:23 +0800
>
>> @@ -197,10 +197,11 @@ struct skb_shared_info {
>>  union skb_shared_tx tx_flags;
>>  struct sk_buff  *frag_list;
>>  struct skb_shared_hwtstamps hwtstamps;
>> -skb_frag_t  frags[MAX_SKB_FRAGS];
>>  /* Intermediate layers must ensure that destructor_arg
>>   * remains valid until skb destructor */
>>  void *  destructor_arg;
>> +
>> +skb_frag_t  frags[MAX_SKB_FRAGS];
>>  };
>>
>>  /* The structure is for a skb which pages may point to
>
>Why are you moving frags[] to the end like this?

That's to avoid the new cache miss caused by using destructor_arg in data path
like skb_release_data().
That's based on the comment from Eric Dumazet on v7 patches.

Thanks
Xiaohui
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

53 matches

Mail list logo