Re: [PATCH] KVM: VMX: Translate interrupt shadow when waiting on NMI window

2010-05-03 Thread Gleb Natapov
On Wed, Apr 21, 2010 at 05:14:04PM +0200, Jan Kiszka wrote:
  No you don't. I was told that software should be prepared to handle NMI
  after MOV SS. What part of SDM does this contradict? I found nothing in
  latest SDM.
 
 [ updated to March 2010 version ]
 
 To sum up the scenario again, I think it started with
 
 • If the “NMI-window exiting” VM-execution control is 1, a VM exit occurs 
 before
   execution of any instruction if there is no virtual-NMI blocking and there 
 is no
   blocking of events by MOV SS (see Table 21-3). (A logical processor may also
   prevent such a VM exit if there is blocking of events by STI.) Such a VM 
 exit
   occurs immediately after VM entry if the above conditions are true (see 
 Section
   23.6.6).
 
 
 We included STI into the NMI shadow, but we /may/ get early exits on
 some processors according to the statement above. According to your
 latest info, we can also get that when the MOV SS shadow is on!? But
 simply allowing NMI injection under MOV SS is not possible:
 
 23.3 CHECKING AND LOADING GUEST STATE
 23.3.1.5 Checks on Guest Non-Register State
 
 • Interruptibility state.
   ...
   — Bit 1 (blocking by MOV-SS) must be 0 if the valid bit (bit 31) in the 
 VM-entry
 interruption-information field is 1 and the interruption type (bits 10:8) 
 in that
 field has value 2, indicating non-maskable interrupt (NMI).
 
 
 And doing this for STI sounds risky too:
 
   — A processor may require bit 0 (blocking by STI) to be 0 if the valid bit 
 (bit 31)
 in the VM-entry interruption-information field is 1 and the interruption 
 type
 (bits 10:8) in that field has value 2, indicating NMI. Other processors 
 may not
 make this requirement.
 
 
 Should we start stepping over the shadow like we do for svm?
 
Intel's answer is that text above describes model-specific behaviour
in bare metal which can block NMI for one instruction after STI/MOV SS,
but since software should not rely on model-specific behaviour we can
safely inject NMI after STI/MOV SS (clearing blocked by STI/MOV SS bit
before injecting).

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm.0.12.2 aborts on linux

2010-05-03 Thread Gleb Natapov
On Sun, May 02, 2010 at 08:09:56AM -0700, K D wrote:
 After I added code to raise ulimits for qemu, I don't see any memory related 
 issue. I'm trying to spawn VM on an embedded linux with no window manager 
 etc. with '-curses' it goes into 'VGA Blank Mode' and it stops there. not 
 sure what it is doing. any clues?
 
Haven't used '-curses' option for a long time. Have you provided
bootable disk? Does your guest boots into graphical mode or text mode?

 thanks for help.
 
  
 
 
 
 
 
 
 From: Gleb Natapov g...@redhat.com
 To: K D kdca...@yahoo.com
 Cc: a...@redhat.com; mtosa...@redhat.com; kvm@vger.kernel.org
 Sent: Sat, May 1, 2010 10:23:16 PM
 Subject: Re: qemu-kvm.0.12.2 aborts on linux
 
 On Thu, Apr 29, 2010 at 07:35:02AM -0700, K D wrote:
  Here are my ulimits
  
  $ ulimit -a
  core file size  (blocks, -c) unlimited
  data seg size   (kbytes, -d) unlimited
  scheduling priority (-e) 0
  file size   (blocks, -f) unlimited
  pending signals (-i) 71680
  max locked memory   (kbytes, -l) 32
  max memory size (kbytes, -m) unlimited
  open files  (-n) 1024
  pipe size(512 bytes, -p) 8
  POSIX message queues (bytes, -q) 819200
  real-time priority  (-r) 0
  stack size  (kbytes, -s) 8192
  cpu time   (seconds, -t) unlimited
  max user processes  (-u) 1024
  virtual memory  (kbytes, -v) 204800
 Your virtual memory is limited to 200M. Make it unlimited.
 
  file locks  (-x) unlimited
  $
  
  I'm using monta vista distribution. I went through qemu code and there is 
  no place to raise rlimits. didn't want to touch it.
  
  thanks for looking.
  
  
  
  
  
  
  From: Gleb Natapov g...@redhat.com
  To: K D kdca...@yahoo.com
  Cc: a...@redhat.com; mtosa...@redhat.com; kvm@vger.kernel.org
  Sent: Thu, April 29, 2010 3:17:06 AM
  Subject: Re: qemu-kvm.0.12.2 aborts on linux
  
  On Wed, Apr 28, 2010 at 10:19:37AM -0700, K D wrote:
   Am using yahoo mail and my mails to this mailer gets rejected every time 
   saying message has HTML content etc. Should I use some other mail tool? 
   Below is my issue.
   
   I am trying to get KVM/qemu running on linux. I compiled 2.6.27.10 by 
   enabling KVM, KVM for intel options at configure time. My box is 
   running with this KVM enabled inside kernel. I also built qemu-kvm-0.12.2 
   using above kernel headers etc. I enabled virtualization in BIOS. I 
   didn't try to install any guest from CD etc. I made a hard disk image, 
   installed grub on it, copied kernel, initrd onto it. Now when I try to 
   create Vm as below, it crashes with following backtrace. What could be 
   going wrong?
   
   Also when I try to say -m 256 malloc (or posix_memalign) fails with 
   ENOMEM. So right now -m 128, which is default, works. Why is that? I 
   have 4G RAM in my setup and my native linux is using less than 1G. Is 
   there some rlimits for qemu that I need to raise?
   
   Sounds like I'm doing some basic stuff wrong. I'm using bios, vapic, 
   pxe-rtl bin straight from the qemu-kvm dir.
   
   If I don't do '-nographic' its running into some malloc failure inside 
   some vga routine. I pasted that backtrace too below.
   
  What's your Linux distribution? First trace bellow shows that 
  ptheard_create()
  failed which is strange. What ulimit -a shows?
  
  
   Appreciate your help.
   thanks
   
   qemu-system-x86_64 -hda /dev/shm/vmhd.img -bios ./bios.bin --option-rom 
   ./vapic.bin -curses -nographic -vga none -option-rom ./pxe-rtl8139.bin
   
   #0 0x414875a6 in raise () from /lib/libc.so.6
   (gdb) bt
   #0 0x414875a6 in raise () from /lib/libc.so.6
   #1 0x4148ad18 in abort () from /lib/libc.so.6
   #2 0x080b4cb3 in die2 (err=value optimized out,
   what=0x81f2662 pthread_create) at posix-aio-compat.c:80
   #3 0x080b5682 in thread_create (arg=value optimized out,
   start_routine=value optimized out, attr=value optimized out,
   thread=value optimized out) at posix-aio-compat.c:118
   #4 spawn_thread () at posix-aio-compat.c:379
   #5 qemu_paio_submit (aiocb=0x846b550) at posix-aio-compat.c:390
   #6 0x080b57cb in paio_submit (bs=0x843c008, fd=5, sector_num=0,
   qiov=0x84cefb8, nb_sectors=512, cb=0x81cb950 dma_bdrv_cb,
   opaque=0x84cef80, type=1) at posix-aio-compat.c:584
   #7 0x080cc7b8 in raw_aio_submit (type=value optimized out,
   opaque=value optimized out, cb=value optimized out,
   nb_sectors=value optimized out, qiov=value optimized out,
   sector_num=value optimized out, bs=value optimized out)
   at block/raw-posix.c:562
   #8 raw_aio_readv (bs=0x843c008, sector_num=0, qiov=0x84cefb8, 
   nb_sectors=1,
   cb=0x81cb950 dma_bdrv_cb, opaque=0x84cef80) at block/raw-posix.c:570
   #9 0x080b0593 in bdrv_aio_readv (bs=0x843c008, sector_num=0, 
   qiov=0x84cefb8,
   nb_sectors=1, cb=0x81cb950 dma_bdrv_cb, 

Re: Booting/installing WindowsNT

2010-05-03 Thread Andre Przywara

Michael Tokarev wrote:

02.05.2010 14:04, Avi Kivity wrote:

On 05/01/2010 12:40 AM, Michael Tokarev wrote:

01.05.2010 00:59, Michael Tokarev wrote:

Apparently with current kvm stable (0.12.3)
Windows NT 4.0 does not install anymore.

With default -cpu, it boots, displays the
Inspecting your hardware configuration
message and BSODs with STOP: 0x003E
error as shown here:
http://www.corpit.ru/mjt/winnt4_1.gif
With -cpu pentium the situation is a
bit better, it displays:

Microsoft (R) Windows NT (TM) Version 4.0 (Build 1381).
1 System Processor [512 MB Memory] Multiprocessor Kernel



Michael,

can you try -cpu kvm64? This should be somewhat in between -cpu host and 
-cpu qemu64.
Also look in dmesg for uncatched rd/wrmsrs. In case you find something 
there, please try:

# modprobe kvm ignore_msrs=1
(You have to unload the modules first)

Regards,
Andre.

--
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Booting/installing WindowsNT

2010-05-03 Thread Avi Kivity

On 05/03/2010 11:24 AM, Andre Przywara wrote:


can you try -cpu kvm64? This should be somewhat in between -cpu host 
and -cpu qemu64.
Also look in dmesg for uncatched rd/wrmsrs. In case you find something 
there, please try:

# modprobe kvm ignore_msrs=1
(You have to unload the modules first)


It's unlikely NT 4 will touch msrs, it's used to running on very old 
hardware (pre-msr, even).  I expect it's a problem with the bios or 
vendor/model/blah.


(worth trying anyway)

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Booting/installing WindowsNT

2010-05-03 Thread Andre Przywara

Avi Kivity wrote:

On 05/03/2010 11:24 AM, Andre Przywara wrote:


can you try -cpu kvm64? This should be somewhat in between -cpu host 
and -cpu qemu64.
Also look in dmesg for uncatched rd/wrmsrs. In case you find something 
there, please try:

# modprobe kvm ignore_msrs=1
(You have to unload the modules first)


It's unlikely NT 4 will touch msrs, it's used to running on very old 
hardware (pre-msr, even).  I expect it's a problem with the bios or 
vendor/model/blah.
That is probably right, although I could be that it uses some MSRs if it 
finds the CPUID bit set. But I just wanted to make sure that not the issue.


I made a diff of the CPUID flags between qemu64 and Michael's host, the 
difference is:

only on qemu64: up hypervisor
up is a Linux pseudo flag for a SMP kernel running on an UP host, the 
hypervisor bit has been introduced much later than even SP6. I guess 
that is OK, but note that Linux supposedly recognizes the single VCPU 
guest with -cpu host as SMP. But I guess that is not an issue for NT4, 
since it does not know about multi-core CPUs and CPUID based topology 
detection.


only on host: vme ht mmxext fxsr_opt rdtscp 3dnowext 3dnow rep_good 
extd_apicid
cmp_legacy svm extapic cr8_legacy 3dnowprefetch lbrv 


Most of the features are far younger than NT4, only vme sticks out.
This is pretty old flag, but I am not sure if that matters. My 
documentation reads:
VME: virtual-mode enhancements. CR4.VME, CR4.PVI, software interrupt 
indirection, expansion of the TSS with the software, indirection bitmap, 
EFLAGS.VIF, EFLAGS.VIP.


Let's just rule that out:

Michael, can you try to use -cpu host,-vme and see if that makes a 
difference?


Thanks,
Andre.

--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().

2010-05-03 Thread Yoshiaki Tamura
2010/4/23 Avi Kivity a...@redhat.com:
 On 04/23/2010 04:22 PM, Anthony Liguori wrote:

 I currently don't have data, but I'll prepare it.
 There were two things I wanted to avoid.

 1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
 2. Calling write() everytime even when we want to send multiple pages at
 once.

 I think 2 may be neglectable.
 But 1 seems to be problematic if we want make to the latency as small as
 possible, no?


 Copying often has strange CPU characteristics depending on whether the
 data is already in cache.  It's better to drive these sort of optimizations
 through performance measurement because changes are not always obvious.

 Copying always introduces more cache pollution, so even if the data is in
 the cache, it is worthwhile (not disagreeing with the need to measure).

Anthony,

I measure how long it takes to send all guest pages during migration, and I
would like to share the information in this message.  For convenience,
I modified
the code to do migration not live migration which means buffered file is not
used here.

In summary, the performance improvement using writev instead of write/send when
we used GbE seems to be neglectable, however, when the underlying network was
fast (InfiniBand with IPoIB in this case), writev performed 17% faster than
write/send, and therefore, it may be worthwhile to introduce vectors.

Since QEMU compresses pages, I copied a junk file to tmpfs to dirty pages to let
QEMU to transfer fine number of pages.  After setting up the guest, I used
cpu_get_real_ticks() to measure the time during the while loop calling
ram_save_block() in ram_save_live().  I removed the qemu_file_rate_limit() to
disable the function of buffered file, and all of the pages would be transfered
at the first round.

I measure 10 times for each, and took average and standard deviation.
Considering the results, I think the trial number was enough.  In addition to
time duration, number of writev/write and number of pages which were compressed
(dup)/not compressed (nodup) are demonstrated.

Test Environment:
CPU: 2x Intel Xeon Dual Core 3GHz
Mem size: 6GB
Network: GbE, InfiniBand (IPoIB)

Host OS: Fedora 11 (kernel 2.6.34-rc1)
Guest OS: Fedora 11 (kernel 2.6.33)
Guest Mem size: 512MB

* GbE writev
time (sec): 35.732 (std 0.002)
write count: 4 (std 0)
writev count: 8269 (std 1)
dup count: 36157 (std 124)
nodup count: 1016808 (std 147)

* GbE write
time (sec): 35.780 (std 0.164)
write count: 127367 (21)
writev count: 0 (std 0)
dup count: 36134 (std 108)
nodup count: 1016853 (std 165)

* IPoIB writev
time (sec): 13.889 (std 0.155)
write count: 4 (std 0)
writev count: 8267 (std 1)
dup count: 36147 (std 105)
nodup count: 1016838 (std 111)

* IPoIB write
time (sec): 16.777 (std 0.239)
write count: 127364 (24)
writev count: 0 (std 0)
dup count: 36173 (std 169)
nodup count: 1016840 (std 190)

Although the improvement wasn't obvious when the network wan GbE, introducing
writev may be worthwhile when we focus on faster networks like InfiniBand/10GE.

I agree that separating this optimization from the main logic of Kemari since
this modification must be done widely and carefully at the same time.

Thanks,

Yoshi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-03 Thread Michael S. Tsirkin
On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
 This patch adds mergeable receive buffer support to vhost_net.
 
 Signed-off-by: David L Stevens dlstev...@us.ibm.com

I've been doing some more testing before sending out a pull
request, and I see a drastic performance degradation in guest to host
traffic when this is applied but mergeable buffers are not in used
by userspace (existing qemu-kvm userspace).

This is both with and without my patch on top.

Without patch:
[...@tuck ~]$ sh runtest  21 | tee 
ser-meregeable-disabled-kernel-only-tun-only.log
Starting netserver at port 12865
set_up_server could not establish a listen endpoint for  port 12865 with family 
AF_UNSPEC
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.0.0.4 (11.0.0.4) 
port 0 AF_INET : demo
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

 87380  16384  1638410.00  9107.26   89.2033.850.802   2.436  

With patch:
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.0.0.4 (11.0.0.4) 
port 0 AF_INET : demo
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

 87380  16384  1638410.0035.00   2.21 0.62 5.181   11.575 


For ease of testing, I put this on my tree
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-broken

Please take a look.
Thanks!

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[qemu-kvm tests PATCH] qemu-kvm tests: fix linker script problem

2010-05-03 Thread Naphtali Sprei
This is a fix to a previous patch by me.
It's on 'next' branch, as of now.

commit 848bd0c89c83814023cf51c72effdbc7de0d18b7 causes the linker script
itself (flat.lds) to become part of the linked objects, which messed
the output file, one such problem is that symbol edata is not the last symbol
anymore.

Signed-off-by: Naphtali Sprei nsp...@redhat.com
---
 kvm/user/config-x86-common.mak |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kvm/user/config-x86-common.mak b/kvm/user/config-x86-common.mak
index 61cc2f0..ad7aeac 100644
--- a/kvm/user/config-x86-common.mak
+++ b/kvm/user/config-x86-common.mak
@@ -19,7 +19,7 @@ CFLAGS += -m$(bits)
 libgcc := $(shell $(CC) -m$(bits) --print-libgcc-file-name)
 
 FLATLIBS = test/lib/libcflat.a $(libgcc)
-%.flat: %.o $(FLATLIBS) flat.lds
+%.flat: %.o $(FLATLIBS)
$(CC) $(CFLAGS) -nostdlib -o $@ -Wl,-T,flat.lds $^ $(FLATLIBS)
 
 tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \
-- 
1.6.3.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().

2010-05-03 Thread Anthony Liguori

On 05/03/2010 04:32 AM, Yoshiaki Tamura wrote:

2010/4/23 Avi Kivitya...@redhat.com:
   

On 04/23/2010 04:22 PM, Anthony Liguori wrote:
 

I currently don't have data, but I'll prepare it.
There were two things I wanted to avoid.

1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
2. Calling write() everytime even when we want to send multiple pages at
once.

I think 2 may be neglectable.
But 1 seems to be problematic if we want make to the latency as small as
possible, no?
 


Copying often has strange CPU characteristics depending on whether the
data is already in cache.  It's better to drive these sort of optimizations
through performance measurement because changes are not always obvious.
   

Copying always introduces more cache pollution, so even if the data is in
the cache, it is worthwhile (not disagreeing with the need to measure).
 

Anthony,

I measure how long it takes to send all guest pages during migration, and I
would like to share the information in this message.  For convenience,
I modified
the code to do migration not live migration which means buffered file is not
used here.

In summary, the performance improvement using writev instead of write/send when
we used GbE seems to be neglectable, however, when the underlying network was
fast (InfiniBand with IPoIB in this case), writev performed 17% faster than
write/send, and therefore, it may be worthwhile to introduce vectors.

Since QEMU compresses pages, I copied a junk file to tmpfs to dirty pages to let
QEMU to transfer fine number of pages.  After setting up the guest, I used
cpu_get_real_ticks() to measure the time during the while loop calling
ram_save_block() in ram_save_live().  I removed the qemu_file_rate_limit() to
disable the function of buffered file, and all of the pages would be transfered
at the first round.

I measure 10 times for each, and took average and standard deviation.
Considering the results, I think the trial number was enough.  In addition to
time duration, number of writev/write and number of pages which were compressed
(dup)/not compressed (nodup) are demonstrated.

Test Environment:
CPU: 2x Intel Xeon Dual Core 3GHz
Mem size: 6GB
Network: GbE, InfiniBand (IPoIB)

Host OS: Fedora 11 (kernel 2.6.34-rc1)
Guest OS: Fedora 11 (kernel 2.6.33)
Guest Mem size: 512MB

* GbE writev
time (sec): 35.732 (std 0.002)
write count: 4 (std 0)
writev count: 8269 (std 1)
dup count: 36157 (std 124)
nodup count: 1016808 (std 147)

* GbE write
time (sec): 35.780 (std 0.164)
write count: 127367 (21)
writev count: 0 (std 0)
dup count: 36134 (std 108)
nodup count: 1016853 (std 165)

* IPoIB writev
time (sec): 13.889 (std 0.155)
write count: 4 (std 0)
writev count: 8267 (std 1)
dup count: 36147 (std 105)
nodup count: 1016838 (std 111)

* IPoIB write
time (sec): 16.777 (std 0.239)
write count: 127364 (24)
writev count: 0 (std 0)
dup count: 36173 (std 169)
nodup count: 1016840 (std 190)

Although the improvement wasn't obvious when the network wan GbE, introducing
writev may be worthwhile when we focus on faster networks like InfiniBand/10GE.

I agree that separating this optimization from the main logic of Kemari since
this modification must be done widely and carefully at the same time.
   


Okay.  It looks like it's clear that it's a win so let's split it out of 
the main series and we'll treat it separately.  I imagine we'll see even 
more positive results on 10 gbit and particularly if we move migration 
out into a separate thread.


Regards,

Anthony Liguori


Thanks,

Yoshi
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3 v2] KVM MMU: make kvm_mmu_zap_page() return the number of zapped sp in total.

2010-05-03 Thread Gui Jianfeng
Marcelo Tosatti wrote:
 On Fri, Apr 23, 2010 at 01:58:22PM +0800, Gui Jianfeng wrote:
 Currently, in kvm_mmu_change_mmu_pages(kvm, page), used_pages-- is  
 performed after calling
 kvm_mmu_zap_page() in spite of that whether page is actually reclaimed. 
 Because root sp won't 
 be reclaimed by kvm_mmu_zap_page(). So making kvm_mmu_zap_page() return 
 total number of reclaimed 
 sp makes more sense. A new flag is put into kvm_mmu_zap_page() to indicate 
 whether the top page is
 reclaimed.

 Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com
 ---
  arch/x86/kvm/mmu.c |   53 
 +++
  1 files changed, 36 insertions(+), 17 deletions(-)
 
 Gui, 
 
 There will be only a few pinned roots, and there is no need for
 kvm_mmu_change_mmu_pages to be precise at that level (pages will be
 reclaimed through kvm_unmap_hva eventually).

Hi Marcelo

Actually, it doesn't only affect kvm_mmu_change_mmu_pages() but also affects 
kvm_mmu_remove_some_alloc_mmu_pages()
which is called by mmu shrink routine. This will induce upper layer get a wrong 
number, so i think this should be
fixed. Here is a updated version.

---
From: Gui Jianfeng guijianf...@cn.fujitsu.com

Currently, in kvm_mmu_change_mmu_pages(kvm, page), used_pages-- is  performed 
after calling
kvm_mmu_zap_page() in spite of that whether page is actually reclaimed. 
Because root sp won't 
be reclaimed by kvm_mmu_zap_page(). So making kvm_mmu_zap_page() return total 
number of reclaimed 
sp makes more sense. A new flag is put into kvm_mmu_zap_page() to indicate 
whether the top page is
reclaimed. kvm_mmu_remove_some_alloc_mmu_pages() also rely on 
kvm_mmu_zap_page() to return a total
relcaimed number.

Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |   53 +++
 1 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 51eb6d6..e545da8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1194,12 +1194,13 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, 
struct kvm_mmu_page *sp)
--kvm-stat.mmu_unsync;
 }
 
-static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+   int *self_deleted);
 
 static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
if (sp-role.cr4_pae != !!is_pae(vcpu)) {
-   kvm_mmu_zap_page(vcpu-kvm, sp);
+   kvm_mmu_zap_page(vcpu-kvm, sp, NULL);
return 1;
}
 
@@ -1207,7 +1208,7 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
kvm_flush_remote_tlbs(vcpu-kvm);
kvm_unlink_unsync_page(vcpu-kvm, sp);
if (vcpu-arch.mmu.sync_page(vcpu, sp)) {
-   kvm_mmu_zap_page(vcpu-kvm, sp);
+   kvm_mmu_zap_page(vcpu-kvm, sp, NULL);
return 1;
}
 
@@ -1478,7 +1479,7 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
struct kvm_mmu_page *sp;
 
for_each_sp(pages, sp, parents, i) {
-   kvm_mmu_zap_page(kvm, sp);
+   kvm_mmu_zap_page(kvm, sp, NULL);
mmu_pages_clear_parents(parents);
zapped++;
}
@@ -1488,7 +1489,8 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
return zapped;
 }
 
-static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+   int *self_deleted)
 {
int ret;
 
@@ -1505,11 +1507,16 @@ static int kvm_mmu_zap_page(struct kvm *kvm, struct 
kvm_mmu_page *sp)
if (!sp-root_count) {
hlist_del(sp-hash_link);
kvm_mmu_free_page(kvm, sp);
+   /* Count self */
+   ret++;
+   if (self_deleted)
+   *self_deleted = 1;
} else {
sp-role.invalid = 1;
list_move(sp-link, kvm-arch.active_mmu_pages);
kvm_reload_remote_mmus(kvm);
}
+
kvm_mmu_reset_last_pte_updated(kvm);
return ret;
 }
@@ -1538,8 +1545,7 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned 
int kvm_nr_mmu_pages)
 
page = container_of(kvm-arch.active_mmu_pages.prev,
struct kvm_mmu_page, link);
-   used_pages -= kvm_mmu_zap_page(kvm, page);
-   used_pages--;
+   used_pages -= kvm_mmu_zap_page(kvm, page, NULL);
}
kvm_nr_mmu_pages = used_pages;
kvm-arch.n_free_mmu_pages = 0;
@@ -1558,6 +1564,8 @@ static int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t 
gfn)
struct kvm_mmu_page *sp;
struct hlist_node 

[PATCH] KVM: Get rid of KVM_REQ_KICK

2010-05-03 Thread Avi Kivity
KVM_REQ_KICK poisons vcpu-requests by having a bit set during normal
operation.  This causes the fast path check for a clear vcpu-requests
to fail all the time, triggering tons of atomic operations.

Fix by replacing KVM_REQ_KICK with a vcpu-guest_mode atomic.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/x86.c   |   17 ++---
 include/linux/kvm_host.h |1 +
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6b2ce1d..307094a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4499,13 +4499,15 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (vcpu-fpu_active)
kvm_load_guest_fpu(vcpu);
 
-   local_irq_disable();
+   atomic_set(vcpu-guest_mode, 1);
+   smp_wmb();
 
-   clear_bit(KVM_REQ_KICK, vcpu-requests);
-   smp_mb__after_clear_bit();
+   local_irq_disable();
 
-   if (vcpu-requests || need_resched() || signal_pending(current)) {
-   set_bit(KVM_REQ_KICK, vcpu-requests);
+   if (!atomic_read(vcpu-guest_mode) || vcpu-requests
+   || need_resched() || signal_pending(current)) {
+   atomic_set(vcpu-guest_mode, 0);
+   smp_wmb();
local_irq_enable();
preempt_enable();
r = 1;
@@ -4550,7 +4552,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (hw_breakpoint_active())
hw_breakpoint_restore();
 
-   set_bit(KVM_REQ_KICK, vcpu-requests);
+   atomic_set(vcpu-guest_mode, 0);
+   smp_wmb();
local_irq_enable();
 
++vcpu-stat.exits;
@@ -5470,7 +5473,7 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
 
me = get_cpu();
if (cpu != me  (unsigned)cpu  nr_cpu_ids  cpu_online(cpu))
-   if (!test_and_set_bit(KVM_REQ_KICK, vcpu-requests))
+   if (atomic_xchg(vcpu-guest_mode, 0))
smp_send_reschedule(cpu);
put_cpu();
 }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ce027d5..a020fa2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -81,6 +81,7 @@ struct kvm_vcpu {
int vcpu_id;
struct mutex mutex;
int   cpu;
+   atomic_t guest_mode;
struct kvm_run *run;
unsigned long requests;
unsigned long guest_debug;
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: VMX: Avoid writing HOST_CR0 every entry

2010-05-03 Thread Avi Kivity
cr0.ts may change between entries, so we copy cr0 to HOST_CR0 before each
entry.  That is slow, so instead, set HOST_CR0 to have TS set unconditionally
(which is a safe value), and issue a clts() just before exiting vcpu context
if the task indeed owns the fpu.

Saves ~50 cycles/exit.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/vmx.c |9 +++--
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 875b785..9cfdc0e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -753,6 +753,8 @@ static void __vmx_load_host_state(struct vcpu_vmx *vmx)
wrmsrl(MSR_KERNEL_GS_BASE, vmx-msr_host_kernel_gs_base);
}
 #endif
+   if (current_thread_info()-status  TS_USEDFPU)
+   clts();
 }
 
 static void vmx_load_host_state(struct vcpu_vmx *vmx)
@@ -2441,7 +2443,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, !!bypass_guest_pf);
vmcs_write32(CR3_TARGET_COUNT, 0);   /* 22.2.1 */
 
-   vmcs_writel(HOST_CR0, read_cr0());  /* 22.2.3 */
+   vmcs_writel(HOST_CR0, read_cr0() | X86_CR0_TS);  /* 22.2.3 */
vmcs_writel(HOST_CR4, read_cr4());  /* 22.2.3, 22.2.5 */
vmcs_writel(HOST_CR3, read_cr3());  /* 22.2.3  FIXME: shadow tables */
 
@@ -3794,11 +3796,6 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (vcpu-guest_debug  KVM_GUESTDBG_SINGLESTEP)
vmx_set_interrupt_shadow(vcpu, 0);
 
-   /*
-* Loading guest fpu may have cleared host cr0.ts
-*/
-   vmcs_writel(HOST_CR0, read_cr0());
-
asm(
/* Store host registers */
push %%Rdx; push %%Rbp;
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().

2010-05-03 Thread Yoshiaki Tamura
2010/5/3 Anthony Liguori aligu...@linux.vnet.ibm.com:
 On 05/03/2010 04:32 AM, Yoshiaki Tamura wrote:

 2010/4/23 Avi Kivitya...@redhat.com:


 On 04/23/2010 04:22 PM, Anthony Liguori wrote:


 I currently don't have data, but I'll prepare it.
 There were two things I wanted to avoid.

 1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
 2. Calling write() everytime even when we want to send multiple pages
 at
 once.

 I think 2 may be neglectable.
 But 1 seems to be problematic if we want make to the latency as small
 as
 possible, no?


 Copying often has strange CPU characteristics depending on whether the
 data is already in cache.  It's better to drive these sort of
 optimizations
 through performance measurement because changes are not always obvious.


 Copying always introduces more cache pollution, so even if the data is in
 the cache, it is worthwhile (not disagreeing with the need to measure).


 Anthony,

 I measure how long it takes to send all guest pages during migration, and
 I
 would like to share the information in this message.  For convenience,
 I modified
 the code to do migration not live migration which means buffered file is
 not
 used here.

 In summary, the performance improvement using writev instead of write/send
 when
 we used GbE seems to be neglectable, however, when the underlying network
 was
 fast (InfiniBand with IPoIB in this case), writev performed 17% faster
 than
 write/send, and therefore, it may be worthwhile to introduce vectors.

 Since QEMU compresses pages, I copied a junk file to tmpfs to dirty pages
 to let
 QEMU to transfer fine number of pages.  After setting up the guest, I used
 cpu_get_real_ticks() to measure the time during the while loop calling
 ram_save_block() in ram_save_live().  I removed the qemu_file_rate_limit()
 to
 disable the function of buffered file, and all of the pages would be
 transfered
 at the first round.

 I measure 10 times for each, and took average and standard deviation.
 Considering the results, I think the trial number was enough.  In addition
 to
 time duration, number of writev/write and number of pages which were
 compressed
 (dup)/not compressed (nodup) are demonstrated.

 Test Environment:
 CPU: 2x Intel Xeon Dual Core 3GHz
 Mem size: 6GB
 Network: GbE, InfiniBand (IPoIB)

 Host OS: Fedora 11 (kernel 2.6.34-rc1)
 Guest OS: Fedora 11 (kernel 2.6.33)
 Guest Mem size: 512MB

 * GbE writev
 time (sec): 35.732 (std 0.002)
 write count: 4 (std 0)
 writev count: 8269 (std 1)
 dup count: 36157 (std 124)
 nodup count: 1016808 (std 147)

 * GbE write
 time (sec): 35.780 (std 0.164)
 write count: 127367 (21)
 writev count: 0 (std 0)
 dup count: 36134 (std 108)
 nodup count: 1016853 (std 165)

 * IPoIB writev
 time (sec): 13.889 (std 0.155)
 write count: 4 (std 0)
 writev count: 8267 (std 1)
 dup count: 36147 (std 105)
 nodup count: 1016838 (std 111)

 * IPoIB write
 time (sec): 16.777 (std 0.239)
 write count: 127364 (24)
 writev count: 0 (std 0)
 dup count: 36173 (std 169)
 nodup count: 1016840 (std 190)

 Although the improvement wasn't obvious when the network wan GbE,
 introducing
 writev may be worthwhile when we focus on faster networks like
 InfiniBand/10GE.

 I agree that separating this optimization from the main logic of Kemari
 since
 this modification must be done widely and carefully at the same time.


 Okay.  It looks like it's clear that it's a win so let's split it out of the
 main series and we'll treat it separately.  I imagine we'll see even more
 positive results on 10 gbit and particularly if we move migration out into a
 separate thread.

Great!
I also wanted to test with 10GE but I'm physically away from my office
now, and can't set up the test environment.  I'll measure the numbers
w/ 10GE next week.

BTW, I was thinking to write a patch to separate threads for both
sender and receiver of migration.  Kemari especially needs a separate
thread receiver, so that monitor can accepts commands from other HA
tools.  Is someone already working on this?  If not, I would add it to
my task list :-)

Thanks,

Yoshi


 Regards,

 Anthony Liguori

 Thanks,

 Yoshi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-03 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 05/03/2010 03:34:11 AM:

 On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
  This patch adds mergeable receive buffer support to vhost_net.
  
  Signed-off-by: David L Stevens dlstev...@us.ibm.com
 
 I've been doing some more testing before sending out a pull
 request, and I see a drastic performance degradation in guest to host
 traffic when this is applied but mergeable buffers are not in used
 by userspace (existing qemu-kvm userspace).

Actually, I wouldn't expect it to work at all; the qemu-kvm
patch (particularly the feature bit setting bug fix) is required.
Without it, I think the existing code will tell the guest to use
mergeable buffers while turning it off in vhost. That was completely
non-functional for me -- what version of qemu-kvm are you using?
What I did to test w/o mergeable buffers is turn off the
bit in VHOST_FEATURES. I'll recheck these, but qemu-kvm definitely
must be updated; the original doesn't correctly handle feature bits.

+-DLS

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[qemu-kvm tests PATCH] qemu-kvm tests: merged stringio into emulator

2010-05-03 Thread Naphtali Sprei
based on 'next' branch.

Changed test-case stringio into C code and merged into emulator test-case.
Removed traces of stringio test-case.

Signed-off-by: Naphtali Sprei nsp...@redhat.com
---
 kvm/user/config-x86-common.mak |2 --
 kvm/user/config-x86_64.mak |4 ++--
 kvm/user/test/x86/README   |1 -
 kvm/user/test/x86/emulator.c   |   31 ---
 kvm/user/test/x86/stringio.S   |   36 
 5 files changed, 30 insertions(+), 44 deletions(-)
 delete mode 100644 kvm/user/test/x86/stringio.S

diff --git a/kvm/user/config-x86-common.mak b/kvm/user/config-x86-common.mak
index 61cc2f0..0ff9425 100644
--- a/kvm/user/config-x86-common.mak
+++ b/kvm/user/config-x86-common.mak
@@ -56,8 +56,6 @@ $(TEST_DIR)/realmode.flat: $(TEST_DIR)/realmode.o
 
 $(TEST_DIR)/realmode.o: bits = 32
 
-$(TEST_DIR)/stringio.flat: $(cstart.o) $(TEST_DIR)/stringio.o
-
 $(TEST_DIR)/msr.flat: $(cstart.o) $(TEST_DIR)/msr.o
 
 arch_clean:
diff --git a/kvm/user/config-x86_64.mak b/kvm/user/config-x86_64.mak
index 03c91f2..3403dc3 100644
--- a/kvm/user/config-x86_64.mak
+++ b/kvm/user/config-x86_64.mak
@@ -5,7 +5,7 @@ ldarch = elf64-x86-64
 CFLAGS += -D__x86_64__
 
 tests = $(TEST_DIR)/access.flat $(TEST_DIR)/sieve.flat \
-$(TEST_DIR)/stringio.flat $(TEST_DIR)/emulator.flat \
-$(TEST_DIR)/hypercall.flat $(TEST_DIR)/apic.flat
+ $(TEST_DIR)/emulator.flat $(TEST_DIR)/hypercall.flat \
+ $(TEST_DIR)/apic.flat
 
 include config-x86-common.mak
diff --git a/kvm/user/test/x86/README b/kvm/user/test/x86/README
index e2ede5c..ab5a2ae 100644
--- a/kvm/user/test/x86/README
+++ b/kvm/user/test/x86/README
@@ -10,6 +10,5 @@ realmode: goes back to realmode, shld, push/pop, mov 
immediate, cmp immediate, a
  io, eflags instructions (clc, cli, etc.), jcc short, jcc near, call, 
long jmp, xchg
 sieve: heavy memory access with no paging and with paging static and with 
paging vmalloc'ed
 smptest: run smp_id() on every cpu and compares return value to number
-stringio: outs forward and backward
 tsc: write to tsc(0) and write to tsc(1000) and read it back
 vmexit: long loops for each: cpuid, vmcall, mov_from_cr8, mov_to_cr8, 
inl_pmtimer, ipi, ipi+halt
diff --git a/kvm/user/test/x86/emulator.c b/kvm/user/test/x86/emulator.c
index 4967d1f..5406062 100644
--- a/kvm/user/test/x86/emulator.c
+++ b/kvm/user/test/x86/emulator.c
@@ -3,6 +3,7 @@
 #include libcflat.h
 
 #define memset __builtin_memset
+#define TESTDEV_IO_PORT 0xe0
 
 int fails, tests;
 
@@ -17,6 +18,29 @@ void report(const char *name, int result)
}
 }
 
+static char str[] = abcdefghijklmnop;
+
+void test_stringio()
+{
+   unsigned char r = 0;
+   asm volatile(cld \n\t
+movw %0, %%dx \n\t
+rep outsb \n\t
+: : i((short)TESTDEV_IO_PORT),
+  S(str), c(sizeof(str) - 1));
+   asm volatile(inb %1, %0\n\t : =a(r) : i((short)TESTDEV_IO_PORT));
+   report(outsb up, r == str[sizeof(str) - 2]); /* last char */
+
+   asm volatile(std \n\t
+movw %0, %%dx \n\t
+rep outsb \n\t
+: : i((short)TESTDEV_IO_PORT),
+  S(str + sizeof(str) - 2), c(sizeof(str) - 1));
+   asm volatile(cld \n\t : : );
+   asm volatile(in %1, %0\n\t : =a(r) : i((short)TESTDEV_IO_PORT));
+   report(outsb down, r == str[0]);
+}
+
 void test_cmps_one(unsigned char *m1, unsigned char *m3)
 {
void *rsi, *rdi;
@@ -93,8 +117,8 @@ void test_cmps(void *mem)
m1[i] = m2[i] = m3[i] = i;
for (int i = 100; i  200; ++i)
m1[i] = (m3[i] = m2[i] = i) + 1;
-test_cmps_one(m1, m3);
-test_cmps_one(m1, m2);
+   test_cmps_one(m1, m3);
+   test_cmps_one(m1, m2);
 }
 
 void test_cr8(void)
@@ -271,7 +295,8 @@ int main()
 
test_smsw();
test_lmsw();
-test_ljmp(mem);
+   test_ljmp(mem);
+   test_stringio();
 
printf(\nSUMMARY: %d tests, %d failures\n, tests, fails);
return fails ? 1 : 0;
diff --git a/kvm/user/test/x86/stringio.S b/kvm/user/test/x86/stringio.S
deleted file mode 100644
index 461621c..000
--- a/kvm/user/test/x86/stringio.S
+++ /dev/null
@@ -1,36 +0,0 @@
-
-.data
-
-.macro str name, value
-
-\name :.long 1f-2f
-2: .ascii \value
-1: 
-.endm  
-
-TESTDEV_PORT = 0xf1
-
-   str forward, forward
-   str backward, backward
-   
-.text
-
-.global main
-main:
-   cld
-   movl forward, %ecx
-   lea  4+forward, %rsi
-   movw $TESTDEV_PORT, %dx
-   rep outsb
-
-   std
-   movl backward, %ecx
-   lea 4+backward-1(%rcx), %rsi
-   movw $TESTDEV_PORT, %dx
-   rep outsb
-   
-   mov $0, %rsi
-   call exit
-
-
-
-- 
1.6.3.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

[PATCH v2 0/7] Pvclock fixes , version two

2010-05-03 Thread Glauber Costa
Here is a new version of this series.

I am definitely leaving any warp calculations out, as Jeremy wisely
points out that Chuck Norris should perish before we warp.

Also, in this series, I am using KVM_GET_SUPPORTED_CPUID to export
our features to userspace, as avi suggets. (patch 4/7), and starting
to use the flag that indicates possible tsc stability (patch 7/7), although
it is still off by default.

Glauber Costa (7):
  Enable pvclock flags in vcpu_time_info structure
  Add a global synchronization point for pvclock
  change msr numbers for kvmclock
  export paravirtual cpuid flags in KVM_GET_SUPPORTED_CPUID
  Try using new kvm clock msrs
  don't compute pvclock adjustments if we trust the tsc
  Tell the guest we'll warn it about tsc stability

 arch/x86/include/asm/kvm_para.h|   13 
 arch/x86/include/asm/pvclock-abi.h |4 ++-
 arch/x86/include/asm/pvclock.h |1 +
 arch/x86/kernel/kvmclock.c |   56 ++-
 arch/x86/kernel/pvclock.c  |   37 +++
 arch/x86/kvm/x86.c |   37 +++-
 6 files changed, 125 insertions(+), 23 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/7] export paravirtual cpuid flags in KVM_GET_SUPPORTED_CPUID

2010-05-03 Thread Glauber Costa
Right now, we were using individual KVM_CAP entities to communicate
userspace about which cpuids we support. This is suboptimal, since it
generates a delay between the feature arriving in the host, and
being available at the guest.

A much better mechanism is to list para features in KVM_GET_SUPPORTED_CPUID.
This makes userspace automatically aware of what we provide. And if we
ever add a new cpuid bit in the future, we have to do that again,
which create some complexity and delay in feature adoption.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 arch/x86/include/asm/kvm_para.h |4 
 arch/x86/kvm/x86.c  |   27 +++
 2 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 9734808..f019f8c 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -16,6 +16,10 @@
 #define KVM_FEATURE_CLOCKSOURCE0
 #define KVM_FEATURE_NOP_IO_DELAY   1
 #define KVM_FEATURE_MMU_OP 2
+/* This indicates that the new set of kvmclock msrs
+ * are available. The use of 0x11 and 0x12 is deprecated
+ */
+#define KVM_FEATURE_CLOCKSOURCE23
 
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eb84947..8a7cdda 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1971,6 +1971,20 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
u32 function,
}
break;
}
+   case 0x4000: {
+   char signature[] = KVMKVMKVM;
+   u32 *sigptr = (u32 *)signature;
+   entry-eax = 1;
+   entry-ebx = sigptr[0];
+   entry-ecx = sigptr[1];
+   entry-edx = sigptr[2];
+   break;
+   }
+   case 0x4001:
+   entry-eax = (1  KVM_FEATURE_CLOCKSOURCE) |
+   (1  KVM_FEATURE_NOP_IO_DELAY) |
+   (1  KVM_FEATURE_CLOCKSOURCE2);
+   break;
case 0x8000:
entry-eax = min(entry-eax, 0x801a);
break;
@@ -2017,6 +2031,19 @@ static int kvm_dev_ioctl_get_supported_cpuid(struct 
kvm_cpuid2 *cpuid,
for (func = 0x8001; func = limit  nent  cpuid-nent; ++func)
do_cpuid_ent(cpuid_entries[nent], func, 0,
 nent, cpuid-nent);
+
+   
+
+   r = -E2BIG;
+   if (nent = cpuid-nent)
+   goto out_free;
+
+   do_cpuid_ent(cpuid_entries[nent], 0x4000, 0, nent, cpuid-nent);
+   limit = cpuid_entries[nent - 1].eax;
+   for (func = 0x4001; func = limit  nent  cpuid-nent; ++func)
+   do_cpuid_ent(cpuid_entries[nent], func, 0,
+nent, cpuid-nent);
+
r = -E2BIG;
if (nent = cpuid-nent)
goto out_free;
-- 
1.6.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 7/7] Tell the guest we'll warn it about tsc stability

2010-05-03 Thread Glauber Costa
This patch puts up the flag that tells the guest that we'll warn it
about the tsc being trustworthy or not. By now, we also say
it is not.
---
 arch/x86/kvm/x86.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8a7cdda..63a2acc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -855,6 +855,8 @@ static void kvm_write_guest_time(struct kvm_vcpu *v)
vcpu-hv_clock.system_time = ts.tv_nsec +
 (NSEC_PER_SEC * (u64)ts.tv_sec) + 
v-kvm-arch.kvmclock_offset;
 
+   vcpu-hv_clock.flags = 0;
+
/*
 * The interface expects us to write an even number signaling that the
 * update is finished. Since the guest won't see the intermediate
@@ -1983,7 +1985,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
u32 function,
case 0x4001:
entry-eax = (1  KVM_FEATURE_CLOCKSOURCE) |
(1  KVM_FEATURE_NOP_IO_DELAY) |
-   (1  KVM_FEATURE_CLOCKSOURCE2);
+   (1  KVM_FEATURE_CLOCKSOURCE2) |
+   (1  KVM_FEATURE_CLOCKSOURCE_STABLE_TSC);
break;
case 0x8000:
entry-eax = min(entry-eax, 0x801a);
-- 
1.6.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 6/7] don't compute pvclock adjustments if we trust the tsc

2010-05-03 Thread Glauber Costa
If the HV told us we can fully trust the TSC, skip any
correction

Signed-off-by: Glauber Costa glom...@redhat.com
---
 arch/x86/include/asm/kvm_para.h|5 +
 arch/x86/include/asm/pvclock-abi.h |1 +
 arch/x86/kernel/kvmclock.c |3 +++
 arch/x86/kernel/pvclock.c  |4 
 4 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index f019f8c..6f1b878 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -21,6 +21,11 @@
  */
 #define KVM_FEATURE_CLOCKSOURCE23
 
+/* The last 8 bits are used to indicate how to interpret the flags field
+ * in pvclock structure. If no bits are set, all flags are ignored.
+ */
+#define KVM_FEATURE_CLOCKSOURCE_STABLE_TSC 24
+
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
 
diff --git a/arch/x86/include/asm/pvclock-abi.h 
b/arch/x86/include/asm/pvclock-abi.h
index ec5c41a..b123bd7 100644
--- a/arch/x86/include/asm/pvclock-abi.h
+++ b/arch/x86/include/asm/pvclock-abi.h
@@ -39,5 +39,6 @@ struct pvclock_wall_clock {
u32   nsec;
 } __attribute__((__packed__));
 
+#define PVCLOCK_STABLE_TSC (1  0)
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_X86_PVCLOCK_ABI_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 59c740f..518de01 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -215,4 +215,7 @@ void __init kvmclock_init(void)
clocksource_register(kvm_clock);
pv_info.paravirt_enabled = 1;
pv_info.name = KVM;
+
+   if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_TSC))
+   pvclock_set_flags(PVCLOCK_STABLE_TSC);
 }
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 3dff5fa..0d4fe0c 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -135,6 +135,10 @@ cycle_t pvclock_clocksource_read(struct 
pvclock_vcpu_time_info *src)
barrier();
} while (version != src-version);
 
+   if ((valid_flags  PVCLOCK_STABLE_TSC) 
+   (shadow.flags  PVCLOCK_STABLE_TSC))
+   return ret;
+
/*
 * Assumption here is that last_value, a global accumulator, always goes
 * forward. If we are less than that, we should not be much smaller.
-- 
1.6.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/7] Try using new kvm clock msrs

2010-05-03 Thread Glauber Costa
We now added a new set of clock-related msrs in replacement of the old
ones. In theory, we could just try to use them and get a return value
indicating they do not exist, due to our use of kvm_write_msr_save.

However, kvm clock registration happens very early, and if we ever
try to write to a non-existant MSR, we raise a lethal #GP, since our
idt handlers are not in place yet.

So this patch tests for a cpuid feature exported by the host to
decide which set of msrs are supported.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 arch/x86/kernel/kvmclock.c |   53 ++-
 1 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index feaeb0d..59c740f 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -29,6 +29,8 @@
 #define KVM_SCALE 22
 
 static int kvmclock = 1;
+static int msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
+static int msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
 
 static int parse_no_kvmclock(char *arg)
 {
@@ -54,7 +56,8 @@ static unsigned long kvm_get_wallclock(void)
 
low = (int)__pa_symbol(wall_clock);
high = ((u64)__pa_symbol(wall_clock)  32);
-   native_write_msr(MSR_KVM_WALL_CLOCK, low, high);
+
+   native_write_msr(msr_kvm_wall_clock, low, high);
 
vcpu_time = get_cpu_var(hv_clock);
pvclock_read_wallclock(wall_clock, vcpu_time, ts);
@@ -130,7 +133,8 @@ static int kvm_register_clock(char *txt)
high = ((u64)__pa(per_cpu(hv_clock, cpu))  32);
printk(KERN_INFO kvm-clock: cpu %d, msr %x:%x, %s\n,
   cpu, high, low, txt);
-   return native_write_msr_safe(MSR_KVM_SYSTEM_TIME, low, high);
+
+   return native_write_msr_safe(msr_kvm_system_time, low, high);
 }
 
 #ifdef CONFIG_X86_LOCAL_APIC
@@ -165,14 +169,14 @@ static void __init kvm_smp_prepare_boot_cpu(void)
 #ifdef CONFIG_KEXEC
 static void kvm_crash_shutdown(struct pt_regs *regs)
 {
-   native_write_msr_safe(MSR_KVM_SYSTEM_TIME, 0, 0);
+   native_write_msr(msr_kvm_system_time, 0, 0);
native_machine_crash_shutdown(regs);
 }
 #endif
 
 static void kvm_shutdown(void)
 {
-   native_write_msr_safe(MSR_KVM_SYSTEM_TIME, 0, 0);
+   native_write_msr(msr_kvm_system_time, 0, 0);
native_machine_shutdown();
 }
 
@@ -181,27 +185,34 @@ void __init kvmclock_init(void)
if (!kvm_para_available())
return;
 
-   if (kvmclock  kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
-   if (kvm_register_clock(boot clock))
-   return;
-   pv_time_ops.sched_clock = kvm_clock_read;
-   x86_platform.calibrate_tsc = kvm_get_tsc_khz;
-   x86_platform.get_wallclock = kvm_get_wallclock;
-   x86_platform.set_wallclock = kvm_set_wallclock;
+   if (kvmclock  kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
+   msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
+   msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
+   } else if (!(kvmclock  kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)))
+   return;
+
+   printk(KERN_INFO kvm-clock: Using msrs %x and %x,
+   msr_kvm_system_time, msr_kvm_wall_clock);
+
+   if (kvm_register_clock(boot clock))
+   return;
+   pv_time_ops.sched_clock = kvm_clock_read;
+   x86_platform.calibrate_tsc = kvm_get_tsc_khz;
+   x86_platform.get_wallclock = kvm_get_wallclock;
+   x86_platform.set_wallclock = kvm_set_wallclock;
 #ifdef CONFIG_X86_LOCAL_APIC
-   x86_cpuinit.setup_percpu_clockev =
-   kvm_setup_secondary_clock;
+   x86_cpuinit.setup_percpu_clockev =
+   kvm_setup_secondary_clock;
 #endif
 #ifdef CONFIG_SMP
-   smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+   smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
 #endif
-   machine_ops.shutdown  = kvm_shutdown;
+   machine_ops.shutdown  = kvm_shutdown;
 #ifdef CONFIG_KEXEC
-   machine_ops.crash_shutdown  = kvm_crash_shutdown;
+   machine_ops.crash_shutdown  = kvm_crash_shutdown;
 #endif
-   kvm_get_preset_lpj();
-   clocksource_register(kvm_clock);
-   pv_info.paravirt_enabled = 1;
-   pv_info.name = KVM;
-   }
+   kvm_get_preset_lpj();
+   clocksource_register(kvm_clock);
+   pv_info.paravirt_enabled = 1;
+   pv_info.name = KVM;
 }
-- 
1.6.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/7] change msr numbers for kvmclock

2010-05-03 Thread Glauber Costa
Avi pointed out a while ago that those MSRs falls into the pentium
PMU range. So the idea here is to add new ones, and after a while,
deprecate the old ones.

Signed-off-by: Glauber Costa glom...@redhat.com
---
 arch/x86/include/asm/kvm_para.h |4 
 arch/x86/kvm/x86.c  |7 ++-
 2 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index ffae142..9734808 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -20,6 +20,10 @@
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
 
+/* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
+#define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
+#define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
+
 #define KVM_MAX_MMU_OP_BATCH   32
 
 /* Operations for KVM_HC_MMU_OP */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6b2ce1d..eb84947 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -664,9 +664,10 @@ static inline u32 bit(int bitno)
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN5
+#define KVM_SAVE_MSRS_BEGIN7
 static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
+   MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
HV_X64_MSR_APIC_ASSIST_PAGE,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
@@ -1192,10 +1193,12 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
case MSR_IA32_MISC_ENABLE:
vcpu-arch.ia32_misc_enable_msr = data;
break;
+   case MSR_KVM_WALL_CLOCK_NEW:
case MSR_KVM_WALL_CLOCK:
vcpu-kvm-arch.wall_clock = data;
kvm_write_wall_clock(vcpu-kvm, data);
break;
+   case MSR_KVM_SYSTEM_TIME_NEW:
case MSR_KVM_SYSTEM_TIME: {
if (vcpu-arch.time_page) {
kvm_release_page_dirty(vcpu-arch.time_page);
@@ -1467,9 +1470,11 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 *pdata)
data = vcpu-arch.efer;
break;
case MSR_KVM_WALL_CLOCK:
+   case MSR_KVM_WALL_CLOCK_NEW:
data = vcpu-kvm-arch.wall_clock;
break;
case MSR_KVM_SYSTEM_TIME:
+   case MSR_KVM_SYSTEM_TIME_NEW:
data = vcpu-arch.time;
break;
case MSR_IA32_P5_MC_ADDR:
-- 
1.6.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/7] Add a global synchronization point for pvclock

2010-05-03 Thread Glauber Costa
In recent stress tests, it was found that pvclock-based systems
could seriously warp in smp systems. Using ingo's time-warp-test.c,
I could trigger a scenario as bad as 1.5mi warps a minute in some systems.
(to be fair, it wasn't that bad in most of them). Investigating further, I
found out that such warps were caused by the very offset-based calculation
pvclock is based on.

This happens even on some machines that report constant_tsc in its tsc flags,
specially on multi-socket ones.

Two reads of the same kernel timestamp at approx the same time, will likely
have tsc timestamped in different occasions too. This means the delta we
calculate is unpredictable at best, and can probably be smaller in a cpu
that is legitimately reading clock in a forward ocasion.

Some adjustments on the host could make this window less likely to happen,
but still, it pretty much poses as an intrinsic problem of the mechanism.

A while ago, I though about using a shared variable anyway, to hold clock
last state, but gave up due to the high contention locking was likely
to introduce, possibly rendering the thing useless on big machines. I argue,
however, that locking is not necessary.

We do a read-and-return sequence in pvclock, and between read and return,
the global value can have changed. However, it can only have changed
by means of an addition of a positive value. So if we detected that our
clock timestamp is less than the current global, we know that we need to
return a higher one, even though it is not exactly the one we compared to.

OTOH, if we detect we're greater than the current time source, we atomically
replace the value with our new readings. This do causes contention on big
boxes (but big here means *BIG*), but it seems like a good trade off, since
it provide us with a time source guaranteed to be stable wrt time warps.

After this patch is applied, I don't see a single warp in time during 5 days
of execution, in any of the machines I saw them before.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Jeremy Fitzhardinge jer...@goop.org
CC: Avi Kivity a...@redhat.com
CC: Marcelo Tosatti mtosa...@redhat.com
CC: Zachary Amsden zams...@redhat.com
---
 arch/x86/kernel/pvclock.c |   24 
 1 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index aa2262b..3dff5fa 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -118,11 +118,14 @@ unsigned long pvclock_tsc_khz(struct 
pvclock_vcpu_time_info *src)
return pv_tsc_khz;
 }
 
+static atomic64_t last_value = ATOMIC64_INIT(0);
+
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
struct pvclock_shadow_time shadow;
unsigned version;
cycle_t ret, offset;
+   u64 last;
 
do {
version = pvclock_get_time_values(shadow, src);
@@ -132,6 +135,27 @@ cycle_t pvclock_clocksource_read(struct 
pvclock_vcpu_time_info *src)
barrier();
} while (version != src-version);
 
+   /*
+* Assumption here is that last_value, a global accumulator, always goes
+* forward. If we are less than that, we should not be much smaller.
+* We assume there is an error marging we're inside, and then the 
correction
+* does not sacrifice accuracy.
+*
+* For reads: global may have changed between test and return,
+* but this means someone else updated poked the clock at a later time.
+* We just need to make sure we are not seeing a backwards event.
+*
+* For updates: last_value = ret is not enough, since two vcpus could be
+* updating at the same time, and one of them could be slightly behind,
+* making the assumption that last_value always go forward fail to hold.
+*/
+   last = atomic64_read(last_value);
+   do {
+   if (ret  last)
+   return last;
+   last = atomic64_cmpxchg(last_value, last, ret);
+   } while (unlikely(last != ret));
+
return ret;
 }
 
-- 
1.6.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/7] Enable pvclock flags in vcpu_time_info structure

2010-05-03 Thread Glauber Costa
This patch removes one padding byte and transform it into a flags
field. New versions of guests using pvclock will query these flags
upon each read.

Flags, however, will only be interpreted when the guest decides to.
It uses the pvclock_valid_flags function to signal that a specific
set of flags should be taken into consideration. Which flags are valid
are usually devised via HV negotiation.

Signed-off-by: Glauber Costa glom...@redhat.com
CC: Jeremy Fitzhardinge jer...@goop.org
---
 arch/x86/include/asm/pvclock-abi.h |3 ++-
 arch/x86/include/asm/pvclock.h |1 +
 arch/x86/kernel/pvclock.c  |9 +
 3 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/pvclock-abi.h 
b/arch/x86/include/asm/pvclock-abi.h
index 6d93508..ec5c41a 100644
--- a/arch/x86/include/asm/pvclock-abi.h
+++ b/arch/x86/include/asm/pvclock-abi.h
@@ -29,7 +29,8 @@ struct pvclock_vcpu_time_info {
u64   system_time;
u32   tsc_to_system_mul;
s8tsc_shift;
-   u8pad[3];
+   u8flags;
+   u8pad[2];
 } __attribute__((__packed__)); /* 32 bytes */
 
 struct pvclock_wall_clock {
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 53235fd..cd02f32 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -6,6 +6,7 @@
 
 /* some helper functions for xen and kvm pv clock sources */
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
+void pvclock_set_flags(u8 flags);
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
 void pvclock_read_wallclock(struct pvclock_wall_clock *wall,
struct pvclock_vcpu_time_info *vcpu,
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 03801f2..aa2262b 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -31,8 +31,16 @@ struct pvclock_shadow_time {
u32 tsc_to_nsec_mul;
int tsc_shift;
u32 version;
+   u8  flags;
 };
 
+static u8 valid_flags = 0;
+
+void pvclock_set_flags(u8 flags)
+{
+   valid_flags = flags;
+}
+
 /*
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
  * yielding a 64-bit result.
@@ -91,6 +99,7 @@ static unsigned pvclock_get_time_values(struct 
pvclock_shadow_time *dst,
dst-system_timestamp  = src-system_time;
dst-tsc_to_nsec_mul   = src-tsc_to_system_mul;
dst-tsc_shift = src-tsc_shift;
+   dst-flags = src-flags;
rmb();  /* test version after fetching data */
} while ((src-version  1) || (dst-version != src-version));
 
-- 
1.6.2.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-03 Thread Michael S. Tsirkin
On Mon, May 03, 2010 at 08:39:08AM -0700, David Stevens wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 05/03/2010 03:34:11 AM:
 
  On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
   This patch adds mergeable receive buffer support to vhost_net.
   
   Signed-off-by: David L Stevens dlstev...@us.ibm.com
  
  I've been doing some more testing before sending out a pull
  request, and I see a drastic performance degradation in guest to host
  traffic when this is applied but mergeable buffers are not in used
  by userspace (existing qemu-kvm userspace).
 
 Actually, I wouldn't expect it to work at all;
 the qemu-kvm
 patch (particularly the feature bit setting bug fix) is required.

Which bugfix is that?

 Without it, I think the existing code will tell the guest to use
 mergeable buffers while turning it off in vhost.

It should not do that, specifically
vhost_net_get_features does:
features = ~(1  VIRTIO_NET_F_MRG_RXBUF);
unconditionally. This was added with:
5751995a20e77cd9d61d00f7390401895fa172a6

I forced mergeable buffers off with -global virtio-net-pci.mrg_rxbuf=off
and it did not seem to help either.

 That was completely
 non-functional for me -- what version of qemu-kvm are you using?

992cc816c42f2e93db033919a9ddbfcd1da4 or later should work well
AFAIK.

 What I did to test w/o mergeable buffers is turn off the
 bit in VHOST_FEATURES.

It should be enough to force mergeable buffers to off by qemu command
line: -global virtio-net-pci.mrg_rxbuf=off 

 I'll recheck these, but qemu-kvm definitely
 must be updated; the original doesn't correctly handle feature bits.
 
 +-DLS

Hmm, I don't see the bug.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().

2010-05-03 Thread Anthony Liguori

On 05/03/2010 10:36 AM, Yoshiaki Tamura wrote:


Great!
I also wanted to test with 10GE but I'm physically away from my office
now, and can't set up the test environment.  I'll measure the numbers
w/ 10GE next week.

BTW, I was thinking to write a patch to separate threads for both
sender and receiver of migration.  Kemari especially needs a separate
thread receiver, so that monitor can accepts commands from other HA
tools.  Is someone already working on this?  If not, I would add it to
my task list :-)
   


So far, no one (to my knowledge at least), is working on this.

Regards,

Anthony Liguori


Thanks,

Yoshi

   

Regards,

Anthony Liguori

 

Thanks,

Yoshi
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7] add mergeable buffers support to vhost_net

2010-05-03 Thread David Stevens
Michael S. Tsirkin m...@redhat.com wrote on 05/03/2010 08:56:14 AM:

 On Mon, May 03, 2010 at 08:39:08AM -0700, David Stevens wrote:
  Michael S. Tsirkin m...@redhat.com wrote on 05/03/2010 03:34:11 AM:
  
   On Wed, Apr 28, 2010 at 01:57:12PM -0700, David L Stevens wrote:
This patch adds mergeable receive buffer support to vhost_net.

Signed-off-by: David L Stevens dlstev...@us.ibm.com
   
   I've been doing some more testing before sending out a pull
   request, and I see a drastic performance degradation in guest to 
host
   traffic when this is applied but mergeable buffers are not in used
   by userspace (existing qemu-kvm userspace).
  
  Actually, I wouldn't expect it to work at all;
  the qemu-kvm
  patch (particularly the feature bit setting bug fix) is required.
 
 Which bugfix is that?

Actually, I see you put that upstream already--
commit dc14a397812b91dd0d48b03d1b8f66a251542369  in
Avi's tree is the one I was talking about.
I'll look further.

+-DLS

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/22] KVM: MMU: Track page fault data in struct vcpu

2010-05-03 Thread Joerg Roedel
On Tue, Apr 27, 2010 at 03:58:36PM +0300, Avi Kivity wrote:
 So we probably need to upgrade gva_t to a u64.  Please send this as
 a separate patch, and test on i386 hosts.

Are there _any_ regular tests of KVM on i386 hosts? For me this is
terribly broken (also after I fixed the issue which gave me a
VMEXIT_INVALID at the first vmrun).

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: VIA Nano support

2010-05-03 Thread Rusty Burchfield
On Sun, May 2, 2010 at 1:21 AM, Yuhong Bao yuhongbao_...@hotmail.com wrote:
 What about details?

This old thread seems to have more details.
http://www.mail-archive.com/kvm@vger.kernel.org/msg11704.html

Hopefully they are of more use to you than me ;-)

Thanks,
Rusty
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Booting/installing WindowsNT

2010-05-03 Thread Michael Tokarev

03.05.2010 12:24, Andre Przywara wrote:

Michael Tokarev wrote:

02.05.2010 14:04, Avi Kivity wrote:

On 05/01/2010 12:40 AM, Michael Tokarev wrote:

01.05.2010 00:59, Michael Tokarev wrote:

Apparently with current kvm stable (0.12.3)
Windows NT 4.0 does not install anymore.

With default -cpu, it boots, displays the
Inspecting your hardware configuration
message and BSODs with STOP: 0x003E
error as shown here:
http://www.corpit.ru/mjt/winnt4_1.gif
With -cpu pentium the situation is a
bit better, it displays:

Microsoft (R) Windows NT (TM) Version 4.0 (Build 1381).
1 System Processor [512 MB Memory] Multiprocessor Kernel



Michael,

can you try -cpu kvm64? This should be somewhat in between -cpu host and
-cpu qemu64.


It stops during boot with the same STOP 0x003E message.


Also look in dmesg for uncatched rd/wrmsrs. In case you find something
there, please try:


Nothing in host dmesg, nothing at all.

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Booting/installing WindowsNT

2010-05-03 Thread Michael Tokarev

03.05.2010 13:28, Andre Przywara wrote:

Avi Kivity wrote:

On 05/03/2010 11:24 AM, Andre Przywara wrote:


can you try -cpu kvm64? This should be somewhat in between -cpu host
and -cpu qemu64.
Also look in dmesg for uncatched rd/wrmsrs. In case you find
something there, please try:
# modprobe kvm ignore_msrs=1
(You have to unload the modules first)


It's unlikely NT 4 will touch msrs, it's used to running on very old
hardware (pre-msr, even). I expect it's a problem with the bios or
vendor/model/blah.

That is probably right, although I could be that it uses some MSRs if it
finds the CPUID bit set. But I just wanted to make sure that not the issue.

I made a diff of the CPUID flags between qemu64 and Michael's host, the
difference is:
only on qemu64: up hypervisor
up is a Linux pseudo flag for a SMP kernel running on an UP host, the
hypervisor bit has been introduced much later than even SP6. I guess
that is OK, but note that Linux supposedly recognizes the single VCPU
guest with -cpu host as SMP. But I guess that is not an issue for NT4,
since it does not know about multi-core CPUs and CPUID based topology
detection.

only on host: vme ht mmxext fxsr_opt rdtscp 3dnowext 3dnow rep_good
extd_apicid
cmp_legacy svm extapic cr8_legacy 3dnowprefetch lbrv
Most of the features are far younger than NT4, only vme sticks out.
This is pretty old flag, but I am not sure if that matters. My
documentation reads:
VME: virtual-mode enhancements. CR4.VME, CR4.PVI, software interrupt
indirection, expansion of the TSS with the software, indirection bitmap,
EFLAGS.VIF, EFLAGS.VIP.

Let's just rule that out:

Michael, can you try to use -cpu host,-vme and see if that makes a
difference?


With -cpu host,-vme winNT boots just fine as with just -cpu host.

I also tried with -cpu qemu64 and kvm64, with +vme and -vme (4
combinations in total) - in all cases winNT crashes with the
same 0x003E error.  So it appears that vme makes no
difference.

While trying some other combinations, I noticed that with
-cpu pentium, while the install CD stops after loading
kernel (no BSoD as with other cases, it just stalls),
it stops... differently than in other cases.  It sits
there doing exactly nothing, all host CPUs are idle,
there's no progress of any sort.  In all other cases
the guest cpu is working at full speed all the time.

Just.. random observation, yet another one... ;)

Thanks!

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LKML] Re: [PATCH V3] drivers/uio/uio_pci_generic.c: allow access for non-privileged processes

2010-05-03 Thread Konrad Rzeszutek Wilk
On Thu, Apr 29, 2010 at 12:29:40PM -0700, Tom Lyon wrote:
 Michael, et al - sorry for the delay, but I've been digesting the comments 
 and researching new approaches.
 
 I think the plan for V4 will be to take things entirely out of the UIO 
 framework, and instead have a driver which supports user mode use of 
 well-behaved PCI devices. I would like to use read and write to support 
 access to memory regions, IO regions,  or PCI config space. Config space is a 
 bitch because not everything is safe to read or write, but I've come up with 
 a table driven approach which can be run-time extended for non-compliant 
 devices (under root control) which could then enable non-privileged users. 
 For instance, OHCI 1394 devices use a dword in config space which is not 
 formatted as a PCI capability, root can use sysfs to enable access:
   echo offset readbits writebits  
 /sys/dev/pci/devices/:xx:xx.x/yyy/config_permit
 
 
 A well-behaved PCI device must have memory BARs = 4K for mmaping, must 
 have separate memory space for MSI-X that does not need mmaping
 by the user driver, must support the PCI 2.3 interrupt masking, and must not 
 go totally crazy with PCI config space (tg3 is real ugly, e1000 is fine).

How about page aligned BARs?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fix -mem-path with hugetlbfs

2010-05-03 Thread Marcelo Tosatti

Avi, please apply to both master and uq/master.

---

Fallback to qemu_vmalloc in case file_ram_alloc fails.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/exec.c b/exec.c
index 467a0e7..7125a93 100644
--- a/exec.c
+++ b/exec.c
@@ -2821,8 +2821,12 @@ ram_addr_t qemu_ram_alloc(ram_addr_t size)
 if (mem_path) {
 #if defined (__linux__)  !defined(TARGET_S390X)
 new_block-host = file_ram_alloc(size, mem_path);
-if (!new_block-host)
-exit(1);
+if (!new_block-host) {
+new_block-host = qemu_vmalloc(size);
+#ifdef MADV_MERGEABLE
+madvise(new_block-host, size, MADV_MERGEABLE);
+#endif
+}
 #else
 fprintf(stderr, -mem-path option unsupported\n);
 exit(1);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] first round of vhost-net enhancements for net-next

2010-05-03 Thread Michael S. Tsirkin
David,
The following tree includes a couple of enhancements that help vhost-net.
Please pull them for net-next. Another set of patches is under
debugging/testing and I hope to get them ready in time for 2.6.35,
so there may be another pull request later.

Thanks!

The following changes since commit 7ef527377b88ff05fb122a47619ea506c631c914:

  Merge branch 'master' of 
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 (2010-05-02 22:02:06 
-0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost

Michael S. Tsirkin (2):
  tun: add ioctl to modify vnet header size
  macvtap: add ioctl to modify vnet header size

 drivers/net/macvtap.c  |   31 +++
 drivers/net/tun.c  |   32 
 include/linux/if_tun.h |2 ++
 3 files changed, 57 insertions(+), 8 deletions(-)

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] x86: eliminate TS_XSAVE

2010-05-03 Thread H. Peter Anvin
On 05/02/2010 10:44 AM, Avi Kivity wrote:
 On 05/02/2010 08:38 PM, Brian Gerst wrote:
 On Sun, May 2, 2010 at 10:53 AM, Avi Kivitya...@redhat.com  wrote:

 The fpu code currently uses current-thread_info-status  TS_XSAVE as
 a way to distinguish between XSAVE capable processors and older processors.
 The decision is not really task specific; instead we use the task status to
 avoid a global memory reference - the value should be the same across all
 threads.

 Eliminate this tie-in into the task structure by using an alternative
 instruction keyed off the XSAVE cpu feature; this results in shorter and
 faster code, without introducing a global memory reference.
  
 I think you should either just use cpu_has_xsave, or extend this use
 of alternatives to all cpu features.  It doesn't make sense to only do
 it for xsave.

 
 I was trying to avoid a performance regression relative to the current 
 code, as it appears that some care was taken to avoid the memory reference.
 
 I agree that it's probably negligible compared to the save/restore 
 code.  If the x86 maintainers agree as well, I'll replace it with 
 cpu_has_xsave.
 

I asked Suresh to comment on this, since he wrote the original code.  He
did confirm that the intent was to avoid a global memory reference.

-hpa

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IOzone test: Introduce postprocessing module v2

2010-05-03 Thread Lucas Meneghel Rodrigues
This module contains code to postprocess IOzone data
in a convenient way so we can generate performance graphs
and condensed data. The graph generation part depends
on gnuplot, but if the utility is not present,
functionality will gracefully degrade.

Use the postprocessing module introduced on the previous
patch, use it to analyze results and write performance
graphs and performance tables.

Also, in order for other tests to be able to use the
postprocessing code, added the right __init__.py
files, so a simple

from autotest_lib.client.tests.iozone import postprocessing

will work

Note: Martin, as patch will ignore and not create the
zero-sized files (high time we move to git), if the changes
look good to you I can commit them all at once, making sure
all files are created.

Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
---
 client/tests/iozone/common.py |8 +
 client/tests/iozone/iozone.py |   25 ++-
 client/tests/iozone/postprocessing.py |  487 +
 3 files changed, 515 insertions(+), 5 deletions(-)
 create mode 100644 client/tests/__init__.py
 create mode 100644 client/tests/iozone/__init__.py
 create mode 100644 client/tests/iozone/common.py
 create mode 100755 client/tests/iozone/postprocessing.py

diff --git a/client/tests/__init__.py b/client/tests/__init__.py
new file mode 100644
index 000..e69de29
diff --git a/client/tests/iozone/__init__.py b/client/tests/iozone/__init__.py
new file mode 100644
index 000..e69de29
diff --git a/client/tests/iozone/common.py b/client/tests/iozone/common.py
new file mode 100644
index 000..ce78b85
--- /dev/null
+++ b/client/tests/iozone/common.py
@@ -0,0 +1,8 @@
+import os, sys
+dirname = os.path.dirname(sys.modules[__name__].__file__)
+client_dir = os.path.abspath(os.path.join(dirname, .., ..))
+sys.path.insert(0, client_dir)
+import setup_modules
+sys.path.pop(0)
+setup_modules.setup(base_path=client_dir,
+root_module_name=autotest_lib.client)
diff --git a/client/tests/iozone/iozone.py b/client/tests/iozone/iozone.py
index fa3fba4..03c2c04 100755
--- a/client/tests/iozone/iozone.py
+++ b/client/tests/iozone/iozone.py
@@ -1,5 +1,6 @@
 import os, re
 from autotest_lib.client.bin import test, utils
+import postprocessing
 
 
 class iozone(test.test):
@@ -63,17 +64,19 @@ class iozone(test.test):
 self.results = utils.system_output('%s %s' % (cmd, args))
 self.auto_mode = (-a in args)
 
-path = os.path.join(self.resultsdir, 'raw_output_%s' % self.iteration)
-raw_output_file = open(path, 'w')
-raw_output_file.write(self.results)
-raw_output_file.close()
+self.results_path = os.path.join(self.resultsdir,
+ 'raw_output_%s' % self.iteration)
+self.analysisdir = os.path.join(self.resultsdir,
+'analysis_%s' % self.iteration)
+
+utils.open_write_close(self.results_path, self.results)
 
 
 def __get_section_name(self, desc):
 return desc.strip().replace(' ', '_')
 
 
-def postprocess_iteration(self):
+def generate_keyval(self):
 keylist = {}
 
 if self.auto_mode:
@@ -150,3 +153,15 @@ class iozone(test.test):
 keylist[key_name] = result
 
 self.write_perf_keyval(keylist)
+
+
+def postprocess_iteration(self):
+self.generate_keyval()
+if self.auto_mode:
+a = postprocessing.IOzoneAnalyzer(list_files=[self.results_path],
+  output_dir=self.analysisdir)
+a.analyze()
+p = postprocessing.IOzonePlotter(results_file=self.results_path,
+ output_dir=self.analysisdir)
+p.plot_all()
+
diff --git a/client/tests/iozone/postprocessing.py 
b/client/tests/iozone/postprocessing.py
new file mode 100755
index 000..c995aea
--- /dev/null
+++ b/client/tests/iozone/postprocessing.py
@@ -0,0 +1,487 @@
+#!/usr/bin/python
+
+Postprocessing module for IOzone. It is capable to pick results from an
+IOzone run, calculate the geometric mean for all throughput results for
+a given file size or record size, and then generate a series of 2D and 3D
+graphs. The graph generation functionality depends on gnuplot, and if it
+is not present, functionality degrates gracefully.
+
+...@copyright: Red Hat 2010
+
+import os, sys, optparse, logging, math, time
+import common
+from autotest_lib.client.common_lib import logging_config, logging_manager
+from autotest_lib.client.common_lib import error
+from autotest_lib.client.bin import utils, os_dep
+
+
+_LABELS = ['file_size', 'record_size', 'write', 'rewrite', 'read', 'reread',
+   'randread', 'randwrite', 'bkwdread', 'recordrewrite', 'strideread',
+   'fwrite', 'frewrite', 'fread', 'freread']
+
+
+def unique(list):
+
+Return a list of the elements in list, but without duplicates.

Qemu-KVM 0.12.3 and Multipath - Assertion

2010-05-03 Thread Peter Lieven

Hi Qemu/KVM Devel Team,

i'm using qemu-kvm 0.12.3 with latest Kernel 2.6.33.3.
As backend we use open-iSCSI with dm-multipath.

Multipath is configured to queue i/o if no path is available.

If we create a failure on all paths, qemu starts to consume 100%
CPU due to i/o waits which is ok so far.

1 odd thing: The Monitor Interface is not responding any more ...

What es a really blocker is that KVM crashes with:
kvm: /usr/src/qemu-kvm-0.12.3/hw/ide/internal.h:507: bmdma_active_if: 
Assertion `bmdma-unit != (uint8_t)-1' failed.


after the multipath has reestablisched at least one path.

Any ideas? I remember this was working with earlier kernel/kvm/qemu 
versions.


Thanks,
Peter


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM test: Add new subtest iozone_windows

2010-05-03 Thread Lucas Meneghel Rodrigues
Following the new IOzone postprocessing changes, add a new
KVM subtest iozone_windows, which takes advantage of the
fact that there's a windows build for the test, so we can
ship it on winutils.iso and run it, providing this way
the ability to track IO performance for windows guests also.
The new test imports the postprocessing library directly
from iozone, so it can postprocess the results right after
the benchmark is finished on the windows guest.

I'll update winutils.iso on the download page soon.

Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
---
 client/tests/kvm/tests/iozone_windows.py |   40 ++
 client/tests/kvm/tests_base.cfg.sample   |7 -
 2 files changed, 46 insertions(+), 1 deletions(-)
 create mode 100644 client/tests/kvm/tests/iozone_windows.py

diff --git a/client/tests/kvm/tests/iozone_windows.py 
b/client/tests/kvm/tests/iozone_windows.py
new file mode 100644
index 000..86ec2c4
--- /dev/null
+++ b/client/tests/kvm/tests/iozone_windows.py
@@ -0,0 +1,40 @@
+import logging, time, os
+from autotest_lib.client.common_lib import error
+from autotest_lib.client.bin import utils
+from autotest_lib.client.tests.iozone import postprocessing
+import kvm_subprocess, kvm_test_utils, kvm_utils
+
+
+def run_iozone_windows(test, params, env):
+
+Run IOzone for windows on a windows guest:
+1) Log into a guest
+2) Execute the IOzone test contained in the winutils.iso
+3) Get results
+4) Postprocess it with the IOzone postprocessing module
+
+@param test: kvm test object
+@param params: Dictionary with the test parameters
+@param env: Dictionary with test environment.
+
+vm = kvm_test_utils.get_living_vm(env, params.get(main_vm))
+session = kvm_test_utils.wait_for_login(vm)
+results_path = os.path.join(test.resultsdir,
+'raw_output_%s' % test.iteration)
+analysisdir = os.path.join(test.resultsdir, 'analysis_%s' % test.iteration)
+
+# Run IOzone and record its results
+c = command=params.get(iozone_cmd)
+t = int(params.get(iozone_timeout))
+logging.info(Running IOzone command on guest, timeout %ss, t)
+results = session.get_command_output(command=c, timeout=t)
+utils.open_write_close(results_path, results)
+
+# Postprocess the results using the IOzone postprocessing module
+logging.info(Iteration succeed, postprocessing)
+a = postprocessing.IOzoneAnalyzer(list_files=[results_path],
+  output_dir=analysisdir)
+a.analyze()
+p = postprocessing.IOzonePlotter(results_file=results_path,
+ output_dir=analysisdir)
+p.plot_all()
diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index f1104ed..e21f3c8 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -214,6 +214,11 @@ variants:
 dst_rsc_dir = C:\
 autoit_entry = C:\autoit\stub\stub.au3
 
+- iozone_windows: unattended_install
+type = iozone_windows
+iozone_cmd = D:\IOzone\iozone.exe -a
+iozone_timeout = 3600
+
 - guest_s4: install setup unattended_install
 type = guest_s4
 check_s4_support_cmd = grep -q disk /sys/power/state
@@ -417,7 +422,7 @@ variants:
 variants:
 # Linux section
 - @Linux:
-no autoit
+no autoit iozone_windows
 shutdown_command = shutdown -h now
 reboot_command = shutdown -r now
 status_test_command = echo $?
-- 
1.7.0.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] first round of vhost-net enhancements for net-next

2010-05-03 Thread David Miller
From: Michael S. Tsirkin m...@redhat.com
Date: Tue, 4 May 2010 00:32:45 +0300

 The following tree includes a couple of enhancements that help vhost-net.
 Please pull them for net-next. Another set of patches is under
 debugging/testing and I hope to get them ready in time for 2.6.35,
 so there may be another pull request later.

Pulled, thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] first round of vhost-net enhancements for net-next

2010-05-03 Thread David Miller
From: David Miller da...@davemloft.net
Date: Mon, 03 May 2010 15:07:29 -0700 (PDT)

 From: Michael S. Tsirkin m...@redhat.com
 Date: Tue, 4 May 2010 00:32:45 +0300
 
 The following tree includes a couple of enhancements that help vhost-net.
 Please pull them for net-next. Another set of patches is under
 debugging/testing and I hope to get them ready in time for 2.6.35,
 so there may be another pull request later.
 
 Pulled, thanks.

Nevermind, reverted.

Do you even compile test what you send to people?

drivers/net/macvtap.c: In function ‘macvtap_ioctl’:
drivers/net/macvtap.c:713: warning: control reaches end of non-void function

You're really batting 1000 today Michael...


Re: [GIT PULL] first round of vhost-net enhancements for net-next

2010-05-03 Thread Michael S. Tsirkin
On Mon, May 03, 2010 at 03:08:29PM -0700, David Miller wrote:
 From: David Miller da...@davemloft.net
 Date: Mon, 03 May 2010 15:07:29 -0700 (PDT)
 
  From: Michael S. Tsirkin m...@redhat.com
  Date: Tue, 4 May 2010 00:32:45 +0300
  
  The following tree includes a couple of enhancements that help vhost-net.
  Please pull them for net-next. Another set of patches is under
  debugging/testing and I hope to get them ready in time for 2.6.35,
  so there may be another pull request later.
  
  Pulled, thanks.
 
 Nevermind, reverted.
 
 Do you even compile test what you send to people?
 
 drivers/net/macvtap.c: In function ‘macvtap_ioctl’:
 drivers/net/macvtap.c:713: warning: control reaches end of non-void function
 
 You're really batting 1000 today Michael...

Ouch. Should teach me not to send out stuff after midnight. Sorry about
the noise.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 2/6] remove support for !KVM_CAP_SET_TSS_ADDR

2010-05-03 Thread Marcelo Tosatti
Kernels  2.6.24 hardcoded slot 0 for the location of VMX rmode TSS
pages. Remove support for it.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: qemu-kvm-memslot/qemu-kvm.c
===
--- qemu-kvm-memslot.orig/qemu-kvm.c
+++ qemu-kvm-memslot/qemu-kvm.c
@@ -157,24 +157,7 @@ static void init_slots(void)
 
 static int get_free_slot(kvm_context_t kvm)
 {
-int i;
-int tss_ext;
-
-#if defined(KVM_CAP_SET_TSS_ADDR)  !defined(__s390__)
-tss_ext = kvm_ioctl(kvm_state, KVM_CHECK_EXTENSION, KVM_CAP_SET_TSS_ADDR);
-#else
-tss_ext = 0;
-#endif
-
-/*
- * on older kernels where the set tss ioctl is not supprted we must save
- * slot 0 to hold the extended memory, as the vmx will use the last 3
- * pages of this slot.
- */
-if (tss_ext  0)
-i = 0;
-else
-i = 1;
+int i = 0;
 
 for (; i  KVM_MAX_NUM_MEM_REGIONS; ++i)
 if (!slots[i].len)
Index: qemu-kvm-memslot/qemu-kvm-x86.c
===
--- qemu-kvm-memslot.orig/qemu-kvm-x86.c
+++ qemu-kvm-memslot/qemu-kvm-x86.c
@@ -35,7 +35,6 @@ static int lm_capable_kernel;
 
 int kvm_set_tss_addr(kvm_context_t kvm, unsigned long addr)
 {
-#ifdef KVM_CAP_SET_TSS_ADDR
int r;
 /*
  * Tell fw_cfg to notify the BIOS to reserve the range.
@@ -45,22 +44,16 @@ int kvm_set_tss_addr(kvm_context_t kvm, 
 exit(1);
 }
 
-   r = kvm_ioctl(kvm_state, KVM_CHECK_EXTENSION, KVM_CAP_SET_TSS_ADDR);
-   if (r  0) {
-   r = kvm_vm_ioctl(kvm_state, KVM_SET_TSS_ADDR, addr);
-   if (r  0) {
-   fprintf(stderr, kvm_set_tss_addr: %m\n);
-   return r;
-   }
-   return 0;
+   r = kvm_vm_ioctl(kvm_state, KVM_SET_TSS_ADDR, addr);
+   if (r  0) {
+   fprintf(stderr, kvm_set_tss_addr: %m\n);
+   return r;
}
-#endif
-   return -ENOSYS;
+   return 0;
 }
 
 static int kvm_init_tss(kvm_context_t kvm)
 {
-#ifdef KVM_CAP_SET_TSS_ADDR
int r;
 
r = kvm_ioctl(kvm_state, KVM_CHECK_EXTENSION, KVM_CAP_SET_TSS_ADDR);
@@ -74,9 +67,9 @@ static int kvm_init_tss(kvm_context_t kv
fprintf(stderr, kvm_init_tss: unable to set tss 
addr\n);
return r;
}
-
+   } else {
+   fprintf(stderr, kvm does not support KVM_CAP_SET_TSS_ADDR\n);
}
-#endif
return 0;
 }
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 4/6] remove unused kvm_dirty_bitmap array

2010-05-03 Thread Marcelo Tosatti
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: qemu-kvm-memslot/qemu-kvm.c
===
--- qemu-kvm-memslot.orig/qemu-kvm.c
+++ qemu-kvm-memslot/qemu-kvm.c
@@ -2154,7 +2154,6 @@ void kvm_set_phys_mem(target_phys_addr_t
  * dirty pages logging
  */
 /* FIXME: use unsigned long pointer instead of unsigned char */
-unsigned char *kvm_dirty_bitmap = NULL;
 int kvm_physical_memory_set_dirty_tracking(int enable)
 {
 int r = 0;
@@ -2163,17 +2162,9 @@ int kvm_physical_memory_set_dirty_tracki
 return 0;
 
 if (enable) {
-if (!kvm_dirty_bitmap) {
-unsigned bitmap_size = BITMAP_SIZE(phys_ram_size);
-kvm_dirty_bitmap = qemu_malloc(bitmap_size);
-r = kvm_dirty_pages_log_enable_all(kvm_context);
-}
+r = kvm_dirty_pages_log_enable_all(kvm_context);
 } else {
-if (kvm_dirty_bitmap) {
-r = kvm_dirty_pages_log_reset(kvm_context);
-qemu_free(kvm_dirty_bitmap);
-kvm_dirty_bitmap = NULL;
-}
+r = kvm_dirty_pages_log_reset(kvm_context);
 }
 return r;
 }


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 5/6] introduce qemu_ram_map

2010-05-03 Thread Marcelo Tosatti
Which allows drivers to register an mmap region into ram block mappings.
To be used by device assignment driver.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: qemu-kvm/cpu-common.h
===
--- qemu-kvm.orig/cpu-common.h
+++ qemu-kvm/cpu-common.h
@@ -40,6 +40,7 @@ static inline void cpu_register_physical
 }
 
 ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr);
+ram_addr_t qemu_ram_map(ram_addr_t size, void *host);
 ram_addr_t qemu_ram_alloc(ram_addr_t);
 void qemu_ram_free(ram_addr_t addr);
 /* This should only be used for ram local to a device.  */
Index: qemu-kvm/exec.c
===
--- qemu-kvm.orig/exec.c
+++ qemu-kvm/exec.c
@@ -2805,6 +2805,34 @@ static void *file_ram_alloc(ram_addr_t m
 }
 #endif
 
+ram_addr_t qemu_ram_map(ram_addr_t size, void *host)
+{
+RAMBlock *new_block;
+
+size = TARGET_PAGE_ALIGN(size);
+new_block = qemu_malloc(sizeof(*new_block));
+
+new_block-host = host;
+
+new_block-offset = last_ram_offset;
+new_block-length = size;
+
+new_block-next = ram_blocks;
+ram_blocks = new_block;
+
+phys_ram_dirty = qemu_realloc(phys_ram_dirty,
+(last_ram_offset + size)  TARGET_PAGE_BITS);
+memset(phys_ram_dirty + (last_ram_offset  TARGET_PAGE_BITS),
+   0xff, size  TARGET_PAGE_BITS);
+
+last_ram_offset += size;
+
+if (kvm_enabled())
+kvm_setup_guest_memory(new_block-host, size);
+
+return new_block-offset;
+}
+
 ram_addr_t qemu_ram_alloc(ram_addr_t size)
 {
 RAMBlock *new_block;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/6] remove alias support

2010-05-03 Thread Marcelo Tosatti
Aliases were added to workaround kvm's inability to destroy
memory regions. This was fixed in 2.6.29, and advertised via
KVM_CAP_DESTROY_MEMORY_REGION_WORKS.

Also, alias support will be removed from the kernel in July 2010.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: qemu-kvm/qemu-kvm.c
===
--- qemu-kvm.orig/qemu-kvm.c
+++ qemu-kvm/qemu-kvm.c
@@ -1076,20 +1076,6 @@ int kvm_deassign_pci_device(kvm_context_
 }
 #endif
 
-int kvm_destroy_memory_region_works(kvm_context_t kvm)
-{
-int ret = 0;
-
-#ifdef KVM_CAP_DESTROY_MEMORY_REGION_WORKS
-ret =
-kvm_ioctl(kvm_state, KVM_CHECK_EXTENSION,
-  KVM_CAP_DESTROY_MEMORY_REGION_WORKS);
-if (ret = 0)
-ret = 0;
-#endif
-return ret;
-}
-
 int kvm_reinject_control(kvm_context_t kvm, int pit_reinject)
 {
 #ifdef KVM_CAP_REINJECT_CONTROL
@@ -2055,11 +2041,6 @@ int kvm_main_loop(void)
 return 0;
 }
 
-#ifdef TARGET_I386
-static int destroy_region_works = 0;
-#endif
-
-
 #if !defined(TARGET_I386)
 int kvm_arch_init_irq_routing(void)
 {
@@ -2071,6 +2052,10 @@ extern int no_hpet;
 
 static int kvm_create_context(void)
 {
+static const char upgrade_note[] =
+Please upgrade to at least kernel 2.6.29 or recent kvm-kmod\n
+(see http://sourceforge.net/projects/kvm).\n;
+
 int r;
 
 if (!kvm_irqchip) {
@@ -2094,9 +2079,16 @@ static int kvm_create_context(void)
 return -1;
 }
 }
-#ifdef TARGET_I386
-destroy_region_works = kvm_destroy_memory_region_works(kvm_context);
-#endif
+
+/* There was a nasty bug in  kvm-80 that prevents memory slots from being
+ * destroyed properly.  Since we rely on this capability, refuse to work
+ * with any kernel without this capability. */
+if (!kvm_check_extension(kvm_state, KVM_CAP_DESTROY_MEMORY_REGION_WORKS)) {
+fprintf(stderr,
+KVM kernel module broken (DESTROY_MEMORY_REGION).\n%s,
+upgrade_note);
+return -EINVAL;
+}
 
 r = kvm_arch_init_irq_routing();
 if (r  0) {
@@ -2134,73 +2126,11 @@ static int kvm_create_context(void)
 return 0;
 }
 
-#ifdef TARGET_I386
-static int must_use_aliases_source(target_phys_addr_t addr)
-{
-if (destroy_region_works)
-return false;
-if (addr == 0xa || addr == 0xa8000)
-return true;
-return false;
-}
-
-static int must_use_aliases_target(target_phys_addr_t addr)
-{
-if (destroy_region_works)
-return false;
-if (addr = 0xe000  addr  0x1ull)
-return true;
-return false;
-}
-
-static struct mapping {
-target_phys_addr_t phys;
-ram_addr_t ram;
-ram_addr_t len;
-} mappings[50];
-static int nr_mappings;
-
-static struct mapping *find_ram_mapping(ram_addr_t ram_addr)
-{
-struct mapping *p;
-
-for (p = mappings; p  mappings + nr_mappings; ++p) {
-if (p-ram = ram_addr  ram_addr  p-ram + p-len) {
-return p;
-}
-}
-return NULL;
-}
-
-static struct mapping *find_mapping(target_phys_addr_t start_addr)
-{
-struct mapping *p;
-
-for (p = mappings; p  mappings + nr_mappings; ++p) {
-if (p-phys = start_addr  start_addr  p-phys + p-len) {
-return p;
-}
-}
-return NULL;
-}
-
-static void drop_mapping(target_phys_addr_t start_addr)
-{
-struct mapping *p = find_mapping(start_addr);
-
-if (p)
-*p = mappings[--nr_mappings];
-}
-#endif
-
 void kvm_set_phys_mem(target_phys_addr_t start_addr, ram_addr_t size,
   ram_addr_t phys_offset)
 {
 int r = 0;
 unsigned long area_flags;
-#ifdef TARGET_I386
-struct mapping *p;
-#endif
 
 if (start_addr + size  phys_ram_size) {
 phys_ram_size = start_addr + size;
@@ -2210,19 +2140,13 @@ void kvm_set_phys_mem(target_phys_addr_t
 area_flags = phys_offset  ~TARGET_PAGE_MASK;
 
 if (area_flags != IO_MEM_RAM) {
-#ifdef TARGET_I386
-if (must_use_aliases_source(start_addr)) {
-kvm_destroy_memory_alias(kvm_context, start_addr);
-return;
-}
-if (must_use_aliases_target(start_addr))
-return;
-#endif
 while (size  0) {
-p = find_mapping(start_addr);
-if (p) {
-kvm_unregister_memory_area(kvm_context, p-phys, p-len);
-drop_mapping(p-phys);
+int slot;
+
+slot = get_slot(start_addr);
+if (slot != -1) {
+kvm_unregister_memory_area(kvm_context, slots[slot].phys_addr,
+   slots[slot].len);
 }
 start_addr += TARGET_PAGE_SIZE;
 if (size  TARGET_PAGE_SIZE) {
@@ -2241,30 +2165,12 @@ void kvm_set_phys_mem(target_phys_addr_t
 if (area_flags = TLB_MMIO)
 return;
 
-#ifdef TARGET_I386
-if (must_use_aliases_source(start_addr)) {
-p = find_ram_mapping(phys_offset);
- 

[patch 3/6] remove unused kvm_get_dirty_pages

2010-05-03 Thread Marcelo Tosatti
And make kvm_unregister_memory_area static.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: qemu-kvm-memslot/qemu-kvm.c
===
--- qemu-kvm-memslot.orig/qemu-kvm.c
+++ qemu-kvm-memslot/qemu-kvm.c
@@ -639,8 +639,8 @@ void kvm_destroy_phys_mem(kvm_context_t 
 free_slot(memory.slot);
 }
 
-void kvm_unregister_memory_area(kvm_context_t kvm, uint64_t phys_addr,
-unsigned long size)
+static void kvm_unregister_memory_area(kvm_context_t kvm, uint64_t phys_addr,
+   unsigned long size)
 {
 
 int slot = get_container_slot(phys_addr, size);
@@ -667,14 +667,6 @@ static int kvm_get_map(kvm_context_t kvm
 return 0;
 }
 
-int kvm_get_dirty_pages(kvm_context_t kvm, unsigned long phys_addr, void *buf)
-{
-int slot;
-
-slot = get_slot(phys_addr);
-return kvm_get_map(kvm, KVM_GET_DIRTY_LOG, slot, buf);
-}
-
 int kvm_get_dirty_pages_range(kvm_context_t kvm, unsigned long phys_addr,
   unsigned long len, void *opaque,
   int (*cb)(unsigned long start,
Index: qemu-kvm-memslot/qemu-kvm.h
===
--- qemu-kvm-memslot.orig/qemu-kvm.h
+++ qemu-kvm-memslot/qemu-kvm.h
@@ -381,14 +381,11 @@ void *kvm_create_phys_mem(kvm_context_t,
   unsigned long len, int log, int writable);
 void kvm_destroy_phys_mem(kvm_context_t, unsigned long phys_start,
   unsigned long len);
-void kvm_unregister_memory_area(kvm_context_t, uint64_t phys_start,
-unsigned long len);
 
 int kvm_is_containing_region(kvm_context_t kvm, unsigned long phys_start,
  unsigned long size);
 int kvm_register_phys_mem(kvm_context_t kvm, unsigned long phys_start,
   void *userspace_addr, unsigned long len, int log);
-int kvm_get_dirty_pages(kvm_context_t, unsigned long phys_addr, void *buf);
 int kvm_get_dirty_pages_range(kvm_context_t kvm, unsigned long phys_addr,
   unsigned long end_addr, void *opaque,
   int (*cb)(unsigned long start,


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 6/6] use upstream memslot management code

2010-05-03 Thread Marcelo Tosatti
Drop qemu-kvm's implementation in favour of qemu's, they are
functionally equivalent.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: qemu-kvm/qemu-kvm.c
===
--- qemu-kvm.orig/qemu-kvm.c
+++ qemu-kvm/qemu-kvm.c
@@ -76,21 +76,11 @@ static int io_thread_sigfd = -1;
 
 static CPUState *kvm_debug_cpu_requested;
 
-static uint64_t phys_ram_size;
-
 #ifdef CONFIG_KVM_DEVICE_ASSIGNMENT
 /* The list of ioperm_data */
 static QLIST_HEAD(, ioperm_data) ioperm_head;
 #endif
 
-//#define DEBUG_MEMREG
-#ifdef DEBUG_MEMREG
-#define DPRINTF(fmt, args...) \
-   do { fprintf(stderr, %s:%d  fmt , __func__, __LINE__, ##args); } 
while (0)
-#else
-#define DPRINTF(fmt, args...) do {} while (0)
-#endif
-
 #define ALIGN(x, y) (((x)+(y)-1)  ~((y)-1))
 
 int kvm_abi = EXPECTED_KVM_API_VERSION;
@@ -137,198 +127,6 @@ static inline void clear_gsi(kvm_context
 DPRINTF(Invalid GSI %u\n, gsi);
 }
 
-struct slot_info {
-unsigned long phys_addr;
-unsigned long len;
-unsigned long userspace_addr;
-unsigned flags;
-int logging_count;
-};
-
-struct slot_info slots[KVM_MAX_NUM_MEM_REGIONS];
-
-static void init_slots(void)
-{
-int i;
-
-for (i = 0; i  KVM_MAX_NUM_MEM_REGIONS; ++i)
-slots[i].len = 0;
-}
-
-static int get_free_slot(kvm_context_t kvm)
-{
-int i = 0;
-
-for (; i  KVM_MAX_NUM_MEM_REGIONS; ++i)
-if (!slots[i].len)
-return i;
-return -1;
-}
-
-static void register_slot(int slot, unsigned long phys_addr,
-  unsigned long len, unsigned long userspace_addr,
-  unsigned flags)
-{
-slots[slot].phys_addr = phys_addr;
-slots[slot].len = len;
-slots[slot].userspace_addr = userspace_addr;
-slots[slot].flags = flags;
-}
-
-static void free_slot(int slot)
-{
-slots[slot].len = 0;
-slots[slot].logging_count = 0;
-}
-
-static int get_slot(unsigned long phys_addr)
-{
-int i;
-
-for (i = 0; i  KVM_MAX_NUM_MEM_REGIONS; ++i) {
-if (slots[i].len  slots[i].phys_addr = phys_addr 
-(slots[i].phys_addr + slots[i].len - 1) = phys_addr)
-return i;
-}
-return -1;
-}
-
-/* Returns -1 if this slot is not totally contained on any other,
- * and the number of the slot otherwise */
-static int get_container_slot(uint64_t phys_addr, unsigned long size)
-{
-int i;
-
-for (i = 0; i  KVM_MAX_NUM_MEM_REGIONS; ++i)
-if (slots[i].len  slots[i].phys_addr = phys_addr 
-(slots[i].phys_addr + slots[i].len) = phys_addr + size)
-return i;
-return -1;
-}
-
-int kvm_is_containing_region(kvm_context_t kvm, unsigned long phys_addr,
- unsigned long size)
-{
-int slot = get_container_slot(phys_addr, size);
-if (slot == -1)
-return 0;
-return 1;
-}
-
-/*
- * dirty pages logging control
- */
-static int kvm_dirty_pages_log_change(kvm_context_t kvm,
-  unsigned long phys_addr, unsigned flags,
-  unsigned mask)
-{
-int r = -1;
-int slot = get_slot(phys_addr);
-
-if (slot == -1) {
-fprintf(stderr, BUG: %s: invalid parameters\n, __FUNCTION__);
-return 1;
-}
-
-flags = (slots[slot].flags  ~mask) | flags;
-if (flags == slots[slot].flags)
-return 0;
-slots[slot].flags = flags;
-
-{
-struct kvm_userspace_memory_region mem = {
-.slot = slot,
-.memory_size = slots[slot].len,
-.guest_phys_addr = slots[slot].phys_addr,
-.userspace_addr = slots[slot].userspace_addr,
-.flags = slots[slot].flags,
-};
-
-
-DPRINTF(slot %d start %llx len %llx flags %x\n,
-mem.slot, mem.guest_phys_addr, mem.memory_size, mem.flags);
-r = kvm_vm_ioctl(kvm_state, KVM_SET_USER_MEMORY_REGION, mem);
-if (r  0)
-fprintf(stderr, %s: %m\n, __FUNCTION__);
-}
-return r;
-}
-
-static int kvm_dirty_pages_log_change_all(kvm_context_t kvm,
-  int (*change)(kvm_context_t kvm,
-uint64_t start,
-uint64_t len))
-{
-int i, r;
-
-for (i = r = 0; i  KVM_MAX_NUM_MEM_REGIONS  r == 0; i++) {
-if (slots[i].len)
-r = change(kvm, slots[i].phys_addr, slots[i].len);
-}
-return r;
-}
-
-int kvm_dirty_pages_log_enable_slot(kvm_context_t kvm, uint64_t phys_addr,
-uint64_t len)
-{
-int slot = get_slot(phys_addr);
-
-DPRINTF(start % PRIx64  len % PRIx64 \n, phys_addr, len);
-if (slot == -1) {
-fprintf(stderr, BUG: %s: invalid parameters\n, __func__);
-return -EINVAL;
-}
-
-if (slots[slot].logging_count++)
-return 0;
-
-return kvm_dirty_pages_log_change(kvm, 

[patch 0/6] qemu-kvm: use upstream memslot code

2010-05-03 Thread Marcelo Tosatti
See individual patches for details.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IOzone test: Introduce postprocessing module v2

2010-05-03 Thread Martin Bligh
only thing that strikes me is whether the gnuplot support
should be abstracted out a bit. See tko/plotgraph.py ?

On Mon, May 3, 2010 at 2:52 PM, Lucas Meneghel Rodrigues l...@redhat.com 
wrote:
 This module contains code to postprocess IOzone data
 in a convenient way so we can generate performance graphs
 and condensed data. The graph generation part depends
 on gnuplot, but if the utility is not present,
 functionality will gracefully degrade.

 Use the postprocessing module introduced on the previous
 patch, use it to analyze results and write performance
 graphs and performance tables.

 Also, in order for other tests to be able to use the
 postprocessing code, added the right __init__.py
 files, so a simple

 from autotest_lib.client.tests.iozone import postprocessing

 will work

 Note: Martin, as patch will ignore and not create the
 zero-sized files (high time we move to git), if the changes
 look good to you I can commit them all at once, making sure
 all files are created.

 Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
 ---
  client/tests/iozone/common.py         |    8 +
  client/tests/iozone/iozone.py         |   25 ++-
  client/tests/iozone/postprocessing.py |  487 
 +
  3 files changed, 515 insertions(+), 5 deletions(-)
  create mode 100644 client/tests/__init__.py
  create mode 100644 client/tests/iozone/__init__.py
  create mode 100644 client/tests/iozone/common.py
  create mode 100755 client/tests/iozone/postprocessing.py

 diff --git a/client/tests/__init__.py b/client/tests/__init__.py
 new file mode 100644
 index 000..e69de29
 diff --git a/client/tests/iozone/__init__.py b/client/tests/iozone/__init__.py
 new file mode 100644
 index 000..e69de29
 diff --git a/client/tests/iozone/common.py b/client/tests/iozone/common.py
 new file mode 100644
 index 000..ce78b85
 --- /dev/null
 +++ b/client/tests/iozone/common.py
 @@ -0,0 +1,8 @@
 +import os, sys
 +dirname = os.path.dirname(sys.modules[__name__].__file__)
 +client_dir = os.path.abspath(os.path.join(dirname, .., ..))
 +sys.path.insert(0, client_dir)
 +import setup_modules
 +sys.path.pop(0)
 +setup_modules.setup(base_path=client_dir,
 +                    root_module_name=autotest_lib.client)
 diff --git a/client/tests/iozone/iozone.py b/client/tests/iozone/iozone.py
 index fa3fba4..03c2c04 100755
 --- a/client/tests/iozone/iozone.py
 +++ b/client/tests/iozone/iozone.py
 @@ -1,5 +1,6 @@
  import os, re
  from autotest_lib.client.bin import test, utils
 +import postprocessing


  class iozone(test.test):
 @@ -63,17 +64,19 @@ class iozone(test.test):
         self.results = utils.system_output('%s %s' % (cmd, args))
         self.auto_mode = (-a in args)

 -        path = os.path.join(self.resultsdir, 'raw_output_%s' % 
 self.iteration)
 -        raw_output_file = open(path, 'w')
 -        raw_output_file.write(self.results)
 -        raw_output_file.close()
 +        self.results_path = os.path.join(self.resultsdir,
 +                                         'raw_output_%s' % self.iteration)
 +        self.analysisdir = os.path.join(self.resultsdir,
 +                                        'analysis_%s' % self.iteration)
 +
 +        utils.open_write_close(self.results_path, self.results)


     def __get_section_name(self, desc):
         return desc.strip().replace(' ', '_')


 -    def postprocess_iteration(self):
 +    def generate_keyval(self):
         keylist = {}

         if self.auto_mode:
 @@ -150,3 +153,15 @@ class iozone(test.test):
                             keylist[key_name] = result

         self.write_perf_keyval(keylist)
 +
 +
 +    def postprocess_iteration(self):
 +        self.generate_keyval()
 +        if self.auto_mode:
 +            a = postprocessing.IOzoneAnalyzer(list_files=[self.results_path],
 +                                              output_dir=self.analysisdir)
 +            a.analyze()
 +            p = postprocessing.IOzonePlotter(results_file=self.results_path,
 +                                             output_dir=self.analysisdir)
 +            p.plot_all()
 +
 diff --git a/client/tests/iozone/postprocessing.py 
 b/client/tests/iozone/postprocessing.py
 new file mode 100755
 index 000..c995aea
 --- /dev/null
 +++ b/client/tests/iozone/postprocessing.py
 @@ -0,0 +1,487 @@
 +#!/usr/bin/python
 +
 +Postprocessing module for IOzone. It is capable to pick results from an
 +IOzone run, calculate the geometric mean for all throughput results for
 +a given file size or record size, and then generate a series of 2D and 3D
 +graphs. The graph generation functionality depends on gnuplot, and if it
 +is not present, functionality degrates gracefully.
 +
 +...@copyright: Red Hat 2010
 +
 +import os, sys, optparse, logging, math, time
 +import common
 +from autotest_lib.client.common_lib import logging_config, logging_manager
 +from autotest_lib.client.common_lib import error
 +from autotest_lib.client.bin import utils, os_dep
 +
 +
 +_LABELS = 

Re: [PATCH] IOzone test: Introduce postprocessing module v2

2010-05-03 Thread Lucas Meneghel Rodrigues
On Mon, 2010-05-03 at 16:52 -0700, Martin Bligh wrote:
 only thing that strikes me is whether the gnuplot support
 should be abstracted out a bit. See tko/plotgraph.py ?

I thought about it. Ideally, we would do all the plotting using a python
library, such as matplotlib, which has a decent API. However, I spent
quite some time trying to figure out how to draw the surface graphs
using matplot lib and in the end, I gave up (3d support on that lib is
just starting). There are some other libs, such as mayavi
(http://mayavi.sourceforge.net) that I would like to try out on the near
future. Your code in plotgraph.py is aimed to 2D graphs, a good
candidate for replacement using matplotlib (their support to 2D is
excellent).

So, instead of spending much time encapsulating gnuplot on a nice API,
I'd prefer to have this intermediate work (anyway it does the job) and
when possible, get back to this subject. What do you think?

 On Mon, May 3, 2010 at 2:52 PM, Lucas Meneghel Rodrigues l...@redhat.com 
 wrote:
  This module contains code to postprocess IOzone data
  in a convenient way so we can generate performance graphs
  and condensed data. The graph generation part depends
  on gnuplot, but if the utility is not present,
  functionality will gracefully degrade.
 
  Use the postprocessing module introduced on the previous
  patch, use it to analyze results and write performance
  graphs and performance tables.
 
  Also, in order for other tests to be able to use the
  postprocessing code, added the right __init__.py
  files, so a simple
 
  from autotest_lib.client.tests.iozone import postprocessing
 
  will work
 
  Note: Martin, as patch will ignore and not create the
  zero-sized files (high time we move to git), if the changes
  look good to you I can commit them all at once, making sure
  all files are created.
 
  Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
  ---
   client/tests/iozone/common.py |8 +
   client/tests/iozone/iozone.py |   25 ++-
   client/tests/iozone/postprocessing.py |  487 
  +
   3 files changed, 515 insertions(+), 5 deletions(-)
   create mode 100644 client/tests/__init__.py
   create mode 100644 client/tests/iozone/__init__.py
   create mode 100644 client/tests/iozone/common.py
   create mode 100755 client/tests/iozone/postprocessing.py
 
  diff --git a/client/tests/__init__.py b/client/tests/__init__.py
  new file mode 100644
  index 000..e69de29
  diff --git a/client/tests/iozone/__init__.py 
  b/client/tests/iozone/__init__.py
  new file mode 100644
  index 000..e69de29
  diff --git a/client/tests/iozone/common.py b/client/tests/iozone/common.py
  new file mode 100644
  index 000..ce78b85
  --- /dev/null
  +++ b/client/tests/iozone/common.py
  @@ -0,0 +1,8 @@
  +import os, sys
  +dirname = os.path.dirname(sys.modules[__name__].__file__)
  +client_dir = os.path.abspath(os.path.join(dirname, .., ..))
  +sys.path.insert(0, client_dir)
  +import setup_modules
  +sys.path.pop(0)
  +setup_modules.setup(base_path=client_dir,
  +root_module_name=autotest_lib.client)
  diff --git a/client/tests/iozone/iozone.py b/client/tests/iozone/iozone.py
  index fa3fba4..03c2c04 100755
  --- a/client/tests/iozone/iozone.py
  +++ b/client/tests/iozone/iozone.py
  @@ -1,5 +1,6 @@
   import os, re
   from autotest_lib.client.bin import test, utils
  +import postprocessing
 
 
   class iozone(test.test):
  @@ -63,17 +64,19 @@ class iozone(test.test):
  self.results = utils.system_output('%s %s' % (cmd, args))
  self.auto_mode = (-a in args)
 
  -path = os.path.join(self.resultsdir, 'raw_output_%s' % 
  self.iteration)
  -raw_output_file = open(path, 'w')
  -raw_output_file.write(self.results)
  -raw_output_file.close()
  +self.results_path = os.path.join(self.resultsdir,
  + 'raw_output_%s' % self.iteration)
  +self.analysisdir = os.path.join(self.resultsdir,
  +'analysis_%s' % self.iteration)
  +
  +utils.open_write_close(self.results_path, self.results)
 
 
  def __get_section_name(self, desc):
  return desc.strip().replace(' ', '_')
 
 
  -def postprocess_iteration(self):
  +def generate_keyval(self):
  keylist = {}
 
  if self.auto_mode:
  @@ -150,3 +153,15 @@ class iozone(test.test):
  keylist[key_name] = result
 
  self.write_perf_keyval(keylist)
  +
  +
  +def postprocess_iteration(self):
  +self.generate_keyval()
  +if self.auto_mode:
  +a = 
  postprocessing.IOzoneAnalyzer(list_files=[self.results_path],
  +  output_dir=self.analysisdir)
  +a.analyze()
  +p = 
  postprocessing.IOzonePlotter(results_file=self.results_path,
  + 

KVM: x86: properly update ready_for_interrupt_injection

2010-05-03 Thread Marcelo Tosatti

The recent changes to emulate string instructions without entering guest
mode exposed a bug where pending interrupts are not properly reflected
in ready_for_interrupt_injection.

The result is that userspace overwrites a previously queued interrupt, 
when irqchip's are emulated in qemu.

Fix by always updating state before returning to userspace.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6b2ce1d..dff08e5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4653,7 +4653,6 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
}
 
srcu_read_unlock(kvm-srcu, vcpu-srcu_idx);
-   post_kvm_run_save(vcpu);
 
vapic_exit(vcpu);
 
@@ -4703,6 +4702,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct 
kvm_run *kvm_run)
r = __vcpu_run(vcpu);
 
 out:
+   post_kvm_run_save(vcpu);
if (vcpu-sigset_active)
sigprocmask(SIG_SETMASK, sigsaved, NULL);
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2010-05-03 Thread Terry

unsubscribe kvm

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-spec: document block CMD and FLUSH

2010-05-03 Thread Rusty Russell
On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote:
 I took a stub at documenting CMD and FLUSH request types in virtio
 block.  Christoph, could you look over this please?
 
 I note that the interface seems full of warts to me,
 this might be a first step to cleaning them.

ISTR Christoph had withdrawn some patches in this area, and was waiting
for him to resubmit?

I've given up on figuring out the block device.  What seem to me to be sane
semantics along the lines of memory barriers are foreign to disk people: they
want (and depend on) flushing everywhere.

For example, tdb transactions do not require a flush, they only require what
I would call a barrier: that prior data be written out before any future data.
Surely that would be more efficient in general than a flush!  In fact, TDB
wants only writes to *that file* (and metadata) written out first; it has no
ordering issues with other I/O on the same device.

A generic I/O interface would allow you to specify this request depends on 
these
outstanding requests and leave it at that.  It might have some sync flush
command for dumb applications and OSes.  The userspace API might be not be as
precise and only allow such a barrier against all prior writes on this fd.

ISTR someone mentioning a desire for such an API years ago, so CC'ing the
usual I/O suspects...

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IOzone test: Introduce postprocessing module v2

2010-05-03 Thread Martin Bligh
yup, fair enough. Go ahead and check it in. If we end up doing
this in another test, we should make an abstraction

On Mon, May 3, 2010 at 5:39 PM, Lucas Meneghel Rodrigues l...@redhat.com 
wrote:
 On Mon, 2010-05-03 at 16:52 -0700, Martin Bligh wrote:
 only thing that strikes me is whether the gnuplot support
 should be abstracted out a bit. See tko/plotgraph.py ?

 I thought about it. Ideally, we would do all the plotting using a python
 library, such as matplotlib, which has a decent API. However, I spent
 quite some time trying to figure out how to draw the surface graphs
 using matplot lib and in the end, I gave up (3d support on that lib is
 just starting). There are some other libs, such as mayavi
 (http://mayavi.sourceforge.net) that I would like to try out on the near
 future. Your code in plotgraph.py is aimed to 2D graphs, a good
 candidate for replacement using matplotlib (their support to 2D is
 excellent).

 So, instead of spending much time encapsulating gnuplot on a nice API,
 I'd prefer to have this intermediate work (anyway it does the job) and
 when possible, get back to this subject. What do you think?

 On Mon, May 3, 2010 at 2:52 PM, Lucas Meneghel Rodrigues l...@redhat.com 
 wrote:
  This module contains code to postprocess IOzone data
  in a convenient way so we can generate performance graphs
  and condensed data. The graph generation part depends
  on gnuplot, but if the utility is not present,
  functionality will gracefully degrade.
 
  Use the postprocessing module introduced on the previous
  patch, use it to analyze results and write performance
  graphs and performance tables.
 
  Also, in order for other tests to be able to use the
  postprocessing code, added the right __init__.py
  files, so a simple
 
  from autotest_lib.client.tests.iozone import postprocessing
 
  will work
 
  Note: Martin, as patch will ignore and not create the
  zero-sized files (high time we move to git), if the changes
  look good to you I can commit them all at once, making sure
  all files are created.
 
  Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
  ---
   client/tests/iozone/common.py         |    8 +
   client/tests/iozone/iozone.py         |   25 ++-
   client/tests/iozone/postprocessing.py |  487 
  +
   3 files changed, 515 insertions(+), 5 deletions(-)
   create mode 100644 client/tests/__init__.py
   create mode 100644 client/tests/iozone/__init__.py
   create mode 100644 client/tests/iozone/common.py
   create mode 100755 client/tests/iozone/postprocessing.py
 
  diff --git a/client/tests/__init__.py b/client/tests/__init__.py
  new file mode 100644
  index 000..e69de29
  diff --git a/client/tests/iozone/__init__.py 
  b/client/tests/iozone/__init__.py
  new file mode 100644
  index 000..e69de29
  diff --git a/client/tests/iozone/common.py b/client/tests/iozone/common.py
  new file mode 100644
  index 000..ce78b85
  --- /dev/null
  +++ b/client/tests/iozone/common.py
  @@ -0,0 +1,8 @@
  +import os, sys
  +dirname = os.path.dirname(sys.modules[__name__].__file__)
  +client_dir = os.path.abspath(os.path.join(dirname, .., ..))
  +sys.path.insert(0, client_dir)
  +import setup_modules
  +sys.path.pop(0)
  +setup_modules.setup(base_path=client_dir,
  +                    root_module_name=autotest_lib.client)
  diff --git a/client/tests/iozone/iozone.py b/client/tests/iozone/iozone.py
  index fa3fba4..03c2c04 100755
  --- a/client/tests/iozone/iozone.py
  +++ b/client/tests/iozone/iozone.py
  @@ -1,5 +1,6 @@
   import os, re
   from autotest_lib.client.bin import test, utils
  +import postprocessing
 
 
   class iozone(test.test):
  @@ -63,17 +64,19 @@ class iozone(test.test):
          self.results = utils.system_output('%s %s' % (cmd, args))
          self.auto_mode = (-a in args)
 
  -        path = os.path.join(self.resultsdir, 'raw_output_%s' % 
  self.iteration)
  -        raw_output_file = open(path, 'w')
  -        raw_output_file.write(self.results)
  -        raw_output_file.close()
  +        self.results_path = os.path.join(self.resultsdir,
  +                                         'raw_output_%s' % self.iteration)
  +        self.analysisdir = os.path.join(self.resultsdir,
  +                                        'analysis_%s' % self.iteration)
  +
  +        utils.open_write_close(self.results_path, self.results)
 
 
      def __get_section_name(self, desc):
          return desc.strip().replace(' ', '_')
 
 
  -    def postprocess_iteration(self):
  +    def generate_keyval(self):
          keylist = {}
 
          if self.auto_mode:
  @@ -150,3 +153,15 @@ class iozone(test.test):
                              keylist[key_name] = result
 
          self.write_perf_keyval(keylist)
  +
  +
  +    def postprocess_iteration(self):
  +        self.generate_keyval()
  +        if self.auto_mode:
  +            a = 
  postprocessing.IOzoneAnalyzer(list_files=[self.results_path],
  +                                          

KVM call agenda for May 4

2010-05-03 Thread Chris Wright
Please send in any agenda items you are interested in covering.

If we have a lack of agenda items I'll cancel the week's call.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qemu-KVM 0.12.3 and Multipath - Assertion

2010-05-03 Thread André Weidemann

Hi Peter,
On 03.05.2010 23:26, Peter Lieven wrote:


Hi Qemu/KVM Devel Team,

i'm using qemu-kvm 0.12.3 with latest Kernel 2.6.33.3.
As backend we use open-iSCSI with dm-multipath.

Multipath is configured to queue i/o if no path is available.

If we create a failure on all paths, qemu starts to consume 100%
CPU due to i/o waits which is ok so far.

1 odd thing: The Monitor Interface is not responding any more ...

What es a really blocker is that KVM crashes with:
kvm: /usr/src/qemu-kvm-0.12.3/hw/ide/internal.h:507: bmdma_active_if:
Assertion `bmdma-unit != (uint8_t)-1' failed.

after the multipath has reestablisched at least one path.

Any ideas? I remember this was working with earlier kernel/kvm/qemu
versions.



I have the same issue on my machine, although I am using local storage 
(LVM or a physical disk) to write my data to.
I reported the Assertion failed on March 17th to the list. Marcello 
and Avi had asked some question back then, but I don't know if they have 
come up with a fix for it.


Regards
 André
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html