Re: [PATCHv2 RFC] virtio-spec: flexible configuration layout

2011-11-10 Thread Sasha Levin
On Fri, Nov 11, 2011 at 6:24 AM, Rusty Russell  wrote:
> On Wed, 09 Nov 2011 22:57:28 +0200, Sasha Levin  
> wrote:
>> On Wed, 2011-11-09 at 22:52 +0200, Michael S. Tsirkin wrote:
>> > On Wed, Nov 09, 2011 at 10:24:47PM +0200, Sasha Levin wrote:
>> > > It'll be a bit harder deprecating it in the future.
>> >
>> > Harder than ... what ?
>>
>> Harder than allowing devices not to present it at all if new layout
>> config is used. Right now the simple implementation is to use MMIO for
>> config and device specific, and let it fallback to legacy for ISR and
>> notifications (and therefore, this is probably how everybody will
>> implement it), which means that when you do want to deprecate legacy,
>> there will be extra work to be done then, instead of doing it now.
>
> Indeed, I'd like to see two changes to your proposal:
>
> (1) It should be all or nothing.  If a driver can find the virtio header
>    capability, it should only use the capabilties.  Otherwise, it
>    should fall back to legacy.  Your draft suggests a mix is possible;
>    I prefer a clean failure (ie. one day don't present a BAR 0 *at
>    all*, so ancient drivers just fail to load.).
>
> (2) There's no huge win in keeping the same layout.  Let's make some
>    cleanups.  There are more users ahead of us then behind us (I
>    hope!).

Actually, if we already do cleanups, here are two more suggestions:

1. Make 64bit features a one big 64bit block, instead of having 32bits
in one place and 32 in another.
2. Remove the reserved fields out of the config (the ones that were
caused by moving the ISR and the notifications out).

> But I think this is the right direction!
>
> Thanks,
> Rusty.
>

Also, an unrelated questions: With PIO, requests were ordered, which
means that if we wrote to the queue selector and then read from a
queue register we would read the correct queue info.
Is the same thing assured to us with MMIO? If we write to a queue
selector and immediately read from queue info would we be reading the
right info, or is there the slight chance that it would get reordered
and we would be reading queue info first and writing to the selector
later?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 RFC] virtio-spec: flexible configuration layout

2011-11-10 Thread Rusty Russell
On Wed, 09 Nov 2011 22:57:28 +0200, Sasha Levin  wrote:
> On Wed, 2011-11-09 at 22:52 +0200, Michael S. Tsirkin wrote:
> > On Wed, Nov 09, 2011 at 10:24:47PM +0200, Sasha Levin wrote:
> > > It'll be a bit harder deprecating it in the future.
> > 
> > Harder than ... what ?
> 
> Harder than allowing devices not to present it at all if new layout
> config is used. Right now the simple implementation is to use MMIO for
> config and device specific, and let it fallback to legacy for ISR and
> notifications (and therefore, this is probably how everybody will
> implement it), which means that when you do want to deprecate legacy,
> there will be extra work to be done then, instead of doing it now.

Indeed, I'd like to see two changes to your proposal:

(1) It should be all or nothing.  If a driver can find the virtio header
capability, it should only use the capabilties.  Otherwise, it
should fall back to legacy.  Your draft suggests a mix is possible;
I prefer a clean failure (ie. one day don't present a BAR 0 *at
all*, so ancient drivers just fail to load.).

(2) There's no huge win in keeping the same layout.  Let's make some
cleanups.  There are more users ahead of us then behind us (I
hope!).

But I think this is the right direction!

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/11] KVM: x86: optimize for writing guest page

2011-11-10 Thread Xiao Guangrong

On 11/10/2011 10:05 PM, Avi Kivity wrote:

On 11/10/2011 03:28 PM, Xiao Guangrong wrote:


I have tested RHEL.6.1 setup/boot/reboot/shutdown and the complete
output of scan_results.py is attached.

The result shows the performance is improved:
before:After:
570529
555538
552531
546528
553559
553527
550523
553533
547538
550526

How do you think about it? :)


Well, either I was sloppy in my measurements, or maybe RHEL 6 is very
different from F9 (unlikely).  I'll measure it again and see.



Thanks for your time. :)


btw, this is with ept=0, yes?



Yeah.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Virt: Add Fedora 16 to the list of guests

2011-11-10 Thread Lucas Meneghel Rodrigues
Also, make it the default for virt tests.

Signed-off-by: Lucas Meneghel Rodrigues 
---
 client/tests/kvm/guest-os.cfg.sample |   27 +++
 client/tests/kvm/tests.cfg.sample|   10 
 client/virt/unattended/Fedora-16.ks  |   39 ++
 client/virt/virt_utils.py|6 ++--
 4 files changed, 74 insertions(+), 8 deletions(-)
 create mode 100644 client/virt/unattended/Fedora-16.ks

diff --git a/client/tests/kvm/guest-os.cfg.sample 
b/client/tests/kvm/guest-os.cfg.sample
index 1e19c52..cd9427e 100644
--- a/client/tests/kvm/guest-os.cfg.sample
+++ b/client/tests/kvm/guest-os.cfg.sample
@@ -328,6 +328,33 @@ variants:
 md5sum_cd1 = c122a2a4f478da4a3d2d12396e84244e
 md5sum_1m_cd1 = c02f37e293bbc85be02a7c850a61273a
 
+- 16.32:
+image_name = f16-32
+unattended_install:
+unattended_file = unattended/Fedora-16.ks
+#floppy = images/f16-32/ks.vfd
+cdrom_unattended = images/f16-32/ks.iso
+kernel = images/f16-32/vmlinuz
+initrd = images/f16-32/initrd.img
+unattended_install.cdrom:
+cdrom_cd1 = isos/linux/Fedora-16-i386-DVD.iso
+md5sum_cd1 = 0d64ab6b1b800827a9c83d95395b3da0
+md5sum_1m_cd1 = 3f616b5034980cadeefe67dbca79cf99
+
+- 16.64:
+image_name = f16-64
+unattended_install:
+unattended_file = unattended/Fedora-16.ks
+#floppy = images/f16-64/ks.vfd
+cdrom_unattended = images/f16-64/ks.iso
+kernel = images/f16-64/vmlinuz
+initrd = images/f16-64/initrd.img
+unattended_install.cdrom:
+cdrom_cd1 = isos/linux/Fedora-16-x86_64-DVD.iso
+md5sum_cd1 = bb38ea1fe4b2fc69e7a6e15cf1c69c91
+md5sum_1m_cd1 = e25ea147176f24239d38a46f501bd25e
+
+
 - RHEL:
 no setup
 shell_prompt = "^\[.*\][\#\$]\s*$"
diff --git a/client/tests/kvm/tests.cfg.sample 
b/client/tests/kvm/tests.cfg.sample
index a30158b..4b217ee 100644
--- a/client/tests/kvm/tests.cfg.sample
+++ b/client/tests/kvm/tests.cfg.sample
@@ -78,7 +78,7 @@ variants:
 only unattended_install.cdrom, boot, shutdown
 
 # Runs qemu, f15 64 bit guest OS, install, boot, shutdown
-- @qemu_f15_quick:
+- @qemu_f16_quick:
 # We want qemu for this run
 qemu_binary = /usr/bin/qemu
 qemu_img_binary = /usr/bin/qemu-img
@@ -90,13 +90,13 @@ variants:
 only up
 only no_pci_assignable
 only smallpages
-only Fedora.15.64
+only Fedora.16.64
 only unattended_install.cdrom, boot, shutdown
 # qemu needs -enable-kvm on the cmdline
 extra_params += ' -enable-kvm'
 
 # Runs qemu-kvm, f15 64 bit guest OS, install, boot, shutdown
-- @qemu_kvm_f15_quick:
+- @qemu_kvm_f16_quick:
 # We want qemu-kvm for this run
 qemu_binary = /usr/bin/qemu-kvm
 qemu_img_binary = /usr/bin/qemu-img
@@ -106,7 +106,7 @@ variants:
 only smp2
 only no_pci_assignable
 only smallpages
-only Fedora.15.64
+only Fedora.16.64
 only unattended_install.cdrom, boot, shutdown
 
 # Runs your own guest image (qcow2, can be adjusted), all migration tests
@@ -140,4 +140,4 @@ variants:
 #kill_unresponsive_vms.* ?= no
 
 # Choose your test list from the testsets defined
-only qemu_kvm_f15_quick
+only qemu_kvm_f16_quick
diff --git a/client/virt/unattended/Fedora-16.ks 
b/client/virt/unattended/Fedora-16.ks
new file mode 100644
index 000..eb52c1f
--- /dev/null
+++ b/client/virt/unattended/Fedora-16.ks
@@ -0,0 +1,39 @@
+install
+KVM_TEST_MEDIUM
+text
+reboot
+lang en_US
+keyboard us
+network --bootproto dhcp
+rootpw 123456
+firewall --enabled --ssh
+selinux --enforcing
+timezone --utc America/New_York
+firstboot --disable
+bootloader --location=mbr --append="console=tty0 console=ttyS0,115200"
+zerombr
+poweroff
+
+clearpart --all --initlabel
+autopart
+
+%packages
+@base
+@development-libs
+@development-tools
+%end
+
+%post --interpreter /usr/bin/python
+import socket, os
+os.system('grubby --remove-args="rhgb quiet" --update-kernel=$(grubby 
--default-kernel)')
+os.system('dhclient')
+os.system('chkconfig sshd on')
+os.system('iptables -F')
+os.system('echo 0 > /selinux/enforce')
+server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+server.bind(('', 12323))
+server.listen(1)
+(client, addr) = server.accept()
+client.send("done")
+client.close()
+%end
diff --git a/client/vi

Re: OpenBSD 5.0 kernel panic in AMD K10 cpu power state

2011-11-10 Thread Andre Przywara

On 11/10/2011 09:46 AM, Avi Kivity wrote:

(re-adding cc)


On 11/09/2011 09:35 PM, Walter Haidinger wrote:

Am 09.11.2011 14:40, schrieb Avi Kivity:

Actually, it looks like an OpenBSD bug.  According to the AMD
documentation:


Well, the OpenBSD developers are very confident that is
a bug in the KVM cpu emulation and _not_ in OpenBSD.

Basically they say that [despite -cpu host], the emulated
cpu does not look like a real, but _non-existant_ cpu.
Virtualization should look like _existing_ hardware.


That is true.  But OpenBSD is not following the vendor's recommendation
for how software should access the hardware.


Since the list archive at
http://marc.info/?l=openbsd-misc&m=132077741910464&w=2
lags a bit, I'm attaching some parts of the thread below:

However, please remember it's OpenBSD, so the tone is, let's just
say, rough.


Less than expected, actually.


The panic you hit is for an msr read, not a write. I'm aware those
registers are read-only. The CPUID check isn't done, it matches on
all family 10 and/or higher AMD processors. They're pretending to be
  an AMD K10 processor. On all real hardware I've tested this works
fine. If you wish to be pedantic, patches are welcome.


Avi, thanks for caring of that.

The manual is clear here: no CPUID bit, no MSRs. Beside that the 
emulated ACPI tables probably also don't provide any info here, right?
The fact that it runs: "on all family 10 and/or higher AMD processors" 
is just an empiric observation, not a law. You would be astonished what 
can be fused off...


We had a similar discussion here with unconditional AMD Northbridge PCI 
accesses when detecting certain AMD CPU family/model/steppings in the 
Linux kernel already (...but every AMD CPU has a northbridge...)
We (as virtualization guys) should not step back so easily here, 
especially if the spec is so clear. That spec argument should actually 
appeal to the OpenBSD guys, too. I got the impression that their design 
is, well, actually well designed.




So they're actually open to adding the cpuid check.


They sent me a patch as a workaround, which:


The previous patch avoids touching the msr at all if ACPI indicates
speed scaling is unavailable, this should prevent your panic.


with -cpu host, OpenBSD dmesg showed the 1100T:

cpu0: AMD Phenom(tm) II X6 1100T Processor ("AuthenticAMD" 686-class, 512KB L2 
cache) 3.31 GHz cpu0:
FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SSE3,CX16,POPCNT
...
bios0: vendor Bochs version "Bochs" date 01/01/2007 bios0: Bochs
Bochs

They shouldn't be pretending to be AMD, especially if that emulation
is very incompatible.


but the bug is in the Linux KVM:


They're pretending to be an AMD K10 processor.


Exactly.  What they are doing is wrong. They are pretending to be a
AMD K10 processor _badly_, and then they think they can say "oh, but
you need to check all these other registers too". A machine with that
setup has never physically existed.


Is this all because I used -cpu host?



-cpu host is not to blame, you could get the same result from other
combinations of cpu model and family.

I'll look at adding support for this MSR; should be simple.  But in
general processor features need to be qualified by cpuid, not by model.


I guess emulating part of P-states will open up a can of worms. Beside 
the generic MSRs (0xC001006[1-3]) there are actual family specific ones 
which are selected by the CPUID family. So you would end up emulating 
them, too. I have a hard time to think about a strategy how to emulate 
this in general. So unless there is a real framework for dealing with 
P-state "hints" from the guest OS, I'd be reluctant with quick and dirty 
emulations.


Thanks,
Andre.

--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Anthony Liguori

On 11/10/2011 12:27 PM, Anthony Liguori wrote:

On 11/10/2011 02:55 AM, Avi Kivity wrote:

If we have to delay the release for a month to get it right, we should.
Not that I think we have to.



Adding libvirt to the discussion.

What does libvirt actually do in the monitor prior to migration completing on
the destination? The least invasive way of doing delayed open of block devices
is probably to make -incoming create a monitor and run a main loop before the
block devices (and full device model) is initialized. Since this isolates the
changes strictly to migration, I'd feel okay doing this for 1.0 (although it
might need to be in the stable branch).


This won't work.  libvirt needs things to be initialized.  Plus, once loadvm 
gets to loading the device model, the device model (and BDSes) need to be fully 
initialized.


I think I've convinced myself that without proper clustered shared storage, 
cache=none is a hard requirement.  That goes for iSCSI and NFS.  I don't see a 
way to do migration safely with NFS and there's no way to really solve the page 
cache problem with iSCSI.


Even with the reopen, it's racing against the close on the source.  If you look 
at Daniel's description of what libvirt is doing and then compare that to Juan's 
patches, there's a race condition regarding whether the source gets closed 
before the reopen happens.  cache=none seems to be the only way to solve this.


Live migration with qcow2 or any other image format is just not going to work 
right now even with proper clustered storage.  I think doing a block level flush 
cache interface and letting block devices decide how to do it is the best approach.


Regards,

Anthony Liguori


I know a monitor can run like this as I've done it before but some of the
commands will not behave as expected so it's pretty important to be comfortable
with what commands are actually being used in this mode.

Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread Stepan Moskovchenko

On 11/10/2011 9:09 AM, Joerg Roedel wrote:
The plan is to have a single DMA-API implementation for all IOMMU 
drivers (X86 and ARM) which just uses the IOMMU-API. But to make this 
performing reasonalbly well a few changes to the IOMMU-API are 
required. I already have some ideas which we can discuss if you want.


I have been experimenting with an iommu_map_range call, which maps a 
given scatterlist of discontiguous physical pages into a contiguous 
virtual region at a given IOVA. This has some performance advantages 
over just calling iommu_map iteratively. First, it reduces the amount of 
table walking / calculation needed for mapping each page, given how you 
know that all the pages will be mapped into a single 
virtually-contiguous region (so in most cases, the first-level table 
calculation can be reused). Second, it allows one to defer the TLB (and 
sometimes cache) maintenance operations until the entire scatterlist has 
been mapped, rather than doing a TLB invalidate after mapping each page, 
as would have been the case if iommu_map were just being called from 
within a loop. Granted, just using iommu_map many times may be 
acceptable on the slow path, but I have seen significant performance 
gains when using this approach on the fast path.


Steve

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Anthony Liguori

On 11/10/2011 02:06 PM, Daniel P. Berrange wrote:

On Thu, Nov 10, 2011 at 01:11:42PM -0600, Anthony Liguori wrote:

On 11/10/2011 12:42 PM, Daniel P. Berrange wrote:

On Thu, Nov 10, 2011 at 12:27:30PM -0600, Anthony Liguori wrote:

What does libvirt actually do in the monitor prior to migration
completing on the destination?  The least invasive way of doing
delayed open of block devices is probably to make -incoming create a
monitor and run a main loop before the block devices (and full
device model) is initialized.  Since this isolates the changes
strictly to migration, I'd feel okay doing this for 1.0 (although it
might need to be in the stable branch).


The way migration works with libvirt wrt QEMU interactions is now
as follows

  1. Destination.
Run   qemu -incoming ...args...
Query chardevs via monitor
Query vCPU threads via monitor
Set disk / vnc passwords


Since RHEL carries Juan's patch, and Juan's patch doesn't handle
disk passwords gracefully, how does libvirt cope with that?


No idea, that's the first I've heard of any patch that causes
problems with passwords in QEMU.


My guess is that migration with a password protected qcow2 file isn't a common 
test-case.


Regards,

Anthony Liguori



Daniel


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Daniel P. Berrange
On Thu, Nov 10, 2011 at 01:11:42PM -0600, Anthony Liguori wrote:
> On 11/10/2011 12:42 PM, Daniel P. Berrange wrote:
> >On Thu, Nov 10, 2011 at 12:27:30PM -0600, Anthony Liguori wrote:
> >>What does libvirt actually do in the monitor prior to migration
> >>completing on the destination?  The least invasive way of doing
> >>delayed open of block devices is probably to make -incoming create a
> >>monitor and run a main loop before the block devices (and full
> >>device model) is initialized.  Since this isolates the changes
> >>strictly to migration, I'd feel okay doing this for 1.0 (although it
> >>might need to be in the stable branch).
> >
> >The way migration works with libvirt wrt QEMU interactions is now
> >as follows
> >
> >  1. Destination.
> >Run   qemu -incoming ...args...
> >Query chardevs via monitor
> >Query vCPU threads via monitor
> >Set disk / vnc passwords
> 
> Since RHEL carries Juan's patch, and Juan's patch doesn't handle
> disk passwords gracefully, how does libvirt cope with that?

No idea, that's the first I've heard of any patch that causes
problems with passwords in QEMU.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-10 Thread Nadav Har'El
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU 
context for nested EPT":
> This is all correct, but the code in question parses the EPT12 table
> using the ia32 page table format.  They're sufficiently similar so that
> it works, but it isn't correct.
> 
> Bit 0: EPT readable, ia32 present
> Bit 1: Writable; ia32 meaning dependent on cr0.wp
> Bit 2: EPT executable, ia32 user (so, this implementation will interpret
> a non-executable EPT mapping, if someone could find a use for it, as a
> L2 kernel only mapping)
>

This is a very good point.

I was under the mistaken (?) impression that the page-table shadowing
code will just copy these bits as-is from the shadowed table (EPT12) to the
shadow table (EPT02), without caring what they actually mean. I knew we had
a problem when building, not copying, PTEs, and hence the patch to
link_shadow_page).

Also I realized we sometimes need to actually walk the TDP EPT12+cr3 (e.g.,
to see if an EPT violation is L1's fault), but I thought this was just the
normal TDP walk, which already knows how to correctly read the EPT
table.

> walk_addr() will also write to bits 6/7, which the L1 won't expect.

I didn't notice this :(

Back to the drawing board, I guess. I need to figure out exactly what
needs to be fixed, and how to do this with the least obtrusive changes to
the existing use case (normal shadow page tables, and nested EPT).

-- 
Nadav Har'El|  Thursday, Nov 10 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Learn from mistakes of others; you won't
http://nadav.harel.org.il   |live long enough to make them all yourself
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Allow retrieval about PTY redirection in 'kvm stat'

2011-11-10 Thread Pekka Enberg
On Tue, 2011-11-01 at 18:34 +0200, Sasha Levin wrote:
> This patch adds an option to provide information about redirection
> of terminal redirection to a PTY device within 'kvm stat'.
> 
> Usage:
>   'kvm stat -p [term] -n [instance_name]'
> 
> Will print information about redirection of terminal 'term' int instance
> 'instance_name'.
> 
> Cc: Osier Yang 
> Signed-off-by: Sasha Levin 

Ping? Should I apply this patch? Is it actually useful for libvirt?

> ---
>  tools/kvm/Documentation/kvm-stat.txt |2 +
>  tools/kvm/builtin-stat.c |   39 ++---
>  tools/kvm/include/kvm/kvm-ipc.h  |1 +
>  tools/kvm/include/kvm/term.h |7 ++
>  tools/kvm/term.c |   25 -
>  5 files changed, 69 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/kvm/Documentation/kvm-stat.txt 
> b/tools/kvm/Documentation/kvm-stat.txt
> index ce5ab54..5284aa9 100644
> --- a/tools/kvm/Documentation/kvm-stat.txt
> +++ b/tools/kvm/Documentation/kvm-stat.txt
> @@ -17,3 +17,5 @@ For a list of running instances see 'kvm list'.
>  
>  Commands:
>   --memory, -mDisplay memory statistics
> + --pty, -p   Display information about terminal's pty
> + device.
> diff --git a/tools/kvm/builtin-stat.c b/tools/kvm/builtin-stat.c
> index e28eb5b..2a46900 100644
> --- a/tools/kvm/builtin-stat.c
> +++ b/tools/kvm/builtin-stat.c
> @@ -4,6 +4,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #include 
>  #include 
> @@ -18,6 +20,7 @@ struct stat_cmd {
>  };
>  
>  static bool mem;
> +static int pty = -1;
>  static bool all;
>  static int instance;
>  static const char *instance_name;
> @@ -30,6 +33,7 @@ static const char * const stat_usage[] = {
>  static const struct option stat_options[] = {
>   OPT_GROUP("Commands options:"),
>   OPT_BOOLEAN('m', "memory", &mem, "Display memory statistics"),
> + OPT_INTEGER('p', "PTY info", &pty, "Display PTY path for given 
> terminal"),
>   OPT_GROUP("Instance options:"),
>   OPT_BOOLEAN('a', "all", &all, "All instances"),
>   OPT_STRING('n', "name", &instance_name, "name", "Instance name"),
> @@ -104,15 +108,40 @@ static int do_memstat(const char *name, int sock)
>   return 0;
>  }
>  
> +static int do_pty(const char *name, int sock)
> +{
> + struct pty_cmd cmd = {KVM_IPC_TRM_PTY, 0, pty};
> + int r;
> + char pty_path[PATH_MAX] = {0};
> +
> + r = xwrite(sock, &cmd, sizeof(cmd));
> + if (r < 0)
> + return r;
> +
> + r = xread(sock, pty_path, PATH_MAX);
> + if (r < 0)
> + return r;
> +
> + printf("Instance %s mapped term %d to: %s\n", name, pty, pty_path);
> +
> + return 0;
> +}
> +
>  int kvm_cmd_stat(int argc, const char **argv, const char *prefix)
>  {
>   parse_stat_options(argc, argv);
>  
> - if (!mem)
> + if (!mem && pty == -1)
>   usage_with_options(stat_usage, stat_options);
>  
> - if (mem && all)
> - return kvm__enumerate_instances(do_memstat);
> + if (all) {
> + if (mem)
> + kvm__enumerate_instances(do_memstat);
> + if (pty != -1)
> + kvm__enumerate_instances(do_pty);
> +
> + return 0;
> + }
>  
>   if (instance_name == NULL &&
>   instance == 0)
> @@ -125,7 +154,9 @@ int kvm_cmd_stat(int argc, const char **argv, const char 
> *prefix)
>   die("Failed locating instance");
>  
>   if (mem)
> - return do_memstat(instance_name, instance);
> + do_memstat(instance_name, instance);
> + if (pty != -1)
> + do_pty(instance_name, instance);
>  
>   return 0;
>  }
> diff --git a/tools/kvm/include/kvm/kvm-ipc.h b/tools/kvm/include/kvm/kvm-ipc.h
> index 731767f..1d9599b 100644
> --- a/tools/kvm/include/kvm/kvm-ipc.h
> +++ b/tools/kvm/include/kvm/kvm-ipc.h
> @@ -17,6 +17,7 @@ enum {
>   KVM_IPC_RESUME  = 5,
>   KVM_IPC_STOP= 6,
>   KVM_IPC_PID = 7,
> + KVM_IPC_TRM_PTY = 8,
>  };
>  
>  int kvm_ipc__register_handler(u32 type, void (*cb)(int fd, u32 type, u32 
> len, u8 *msg));
> diff --git a/tools/kvm/include/kvm/term.h b/tools/kvm/include/kvm/term.h
> index 37ec731..06d5b4e 100644
> --- a/tools/kvm/include/kvm/term.h
> +++ b/tools/kvm/include/kvm/term.h
> @@ -2,10 +2,17 @@
>  #define KVM__TERM_H
>  
>  #include 
> +#include 
>  
>  #define CONSOLE_8250 1
>  #define CONSOLE_VIRTIO   2
>  
> +struct pty_cmd {
> + u32 type;
> + u32 len;
> + int pty;
> +};
> +
>  int term_putc_iov(int who, struct iovec *iov, int iovcnt, int term);
>  int term_getc_iov(int who, struct iovec *iov, int iovcnt, int term);
>  int term_putc(int who, char *addr, int cnt, int term);
> diff --git a/tools/kvm/term.c b/tools/kvm/term.c
> index fb5d71c..4e0d946 100644
> --- a/tools/kvm/term.c
> +++ b/tools/kvm/term.c
> @@ -13,7 +13,7 @@
>  #include "kvm/util.h"
>  #include "kvm/kvm.h"
>  #

Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread David Woodhouse
On Thu, 2011-11-10 at 18:09 +0100, Joerg Roedel wrote:
> The requirement for the DMA-API is, that the IOTLB must be consistent
> with existing mappings, and only with the parts that are really mapped.
> The unmapped parts are not important.
> 
> This allows nice optimizations like your 'batched unmap' on the Intel
> IOMMU driver side. The AMD IOMMU driver uses a round-robin bitmap
> allocator for the IO addresses which makes it very easy to flush certain
> IOTLB ranges only before they are reused.

... which implies that a mapping, once made, might *never* actually get
torn down until we loop and start reusing address space? That has
interesting security implications. Is it true even for devices which
have been assigned to a VM and then unassigned?

> >   - ... unless booted with 'intel_iommu=strict', in which case we do the
> > unmap and IOTLB flush immediately before returning to the driver.
> 
> There is something similar on the AMD IOMMU side. There it is called
> unmap_flush.

OK, so that definitely wants consolidating into a generic option.

> >   - But the IOMMU API for virtualisation is different. In fact that
> > doesn't seem to flush the IOTLB at all. Which is probably a bug.
> 
> Well, *current* requirement is, that the IOTLB is in sync with the
> page-table at every time. This is true for the iommu_map and especially
> for the iommu_unmap function. It means basically that the unmapped area
> needs to be flushed out of the IOTLBs before iommu_unmap returns.
> 
> Some time ago I proposed the iommu_commit() interface which changes
> these requirements. With this interface the requirement is that after a
> couple of map/unmap operations the IOMMU-API user has to call
> iommu_commit() to make these changes visible to the hardware (so mostly
> sync the IOTLBs). As discussed at that time this would make sense for
> the Intel and AMD IOMMU drivers.

I would *really* want to keep those off the fast path (thinking mostly
about DMA API here, since that's the performance issue). But as long as
we can achieve that, that's fine.

> > What is acceptable, though? That batched unmap is quite important for
> > performance, because it means that we don't have to bash on the hardware
> > and wait for a flush to complete in the fast path of network driver RX,
> > for example.
> 
> Have you considered a round-robin bitmap-allocator? It allows quite nice
> flushing behavior.

Yeah, I was just looking through the AMD code with a view to doing
something similar. I was thinking of using that technique *within* each
larger range allocated from the whole address space.

> > If we move to a model where we have a separate ->flush_iotlb() call, we
> > need to be careful that we still allow necessary optimisations to
> > happen.
> 
> With iommu_commit() this should be possible, still.
> 
> > I'm looking at fixing performance issues in the Intel IOMMU code, with
> > its virtual address space allocation (the rbtree-based one in iova.c
> > that nobody else uses, which has a single spinlock that *all* CPUs bash
> > on when they need to allocate).
> > 
> > The plan is, vaguely, to allocate large chunks of space to each CPU, and
> > then for each CPU to allocate from its own region first, thus ensuring
> > that the common case doesn't bounce locks between CPUs. It'll be rare
> > for one CPU to have to touch a subregion 'belonging' to another CPU, so
> > lock contention should be drastically reduced.
> 
> Thats an interesting issue. It exists on the AMD IOMMU side too, the
> bitmap-allocator runs in a per-domain spinlock which can get high
> contention. I am not sure how per-cpu chunks of the address space scale
> to large numbers of cpus, though, given that some devices only have a
> small address range that they can address.

I don't care about performance of broken hardware. If we have a single
*global* "subrange" for the <4GiB range of address space, that's
absolutely fine by me.

But also, it's not *so* much of an issue to divide the space up even
when it's limited. The idea was not to have it *strictly* per-CPU, but
just for a CPU to try allocating from "its own" subrange first, and then
fall back to allocating a new subrange, and *then* fall back to
allocating from subranges "belonging" to other CPUs. It's not that the
allocation from a subrange would be lockless — it's that the lock would
almost never leave the l1 cache of the CPU that *normally* uses that
subrange.

With batched unmaps, the CPU doing the unmap may end up taking the lock
occasionally, and bounce cache lines then. But it's infrequent enough
that it shouldn't be a performance problem.

> I have been thinking about some lockless algorithms for the
> bitmap-allocator. But the ideas are not finalized yet, so I still don't
> know if they will work out at all :)

As explained above, I wasn't going for lockless. I was going for
lock-contention-less. Or at least mostly :)

Do you think that approach sounds reasonable?

> The plan is to have a single DMA-API i

Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Anthony Liguori

On 11/10/2011 12:42 PM, Daniel P. Berrange wrote:

On Thu, Nov 10, 2011 at 12:27:30PM -0600, Anthony Liguori wrote:

What does libvirt actually do in the monitor prior to migration
completing on the destination?  The least invasive way of doing
delayed open of block devices is probably to make -incoming create a
monitor and run a main loop before the block devices (and full
device model) is initialized.  Since this isolates the changes
strictly to migration, I'd feel okay doing this for 1.0 (although it
might need to be in the stable branch).


The way migration works with libvirt wrt QEMU interactions is now
as follows

  1. Destination.
Run   qemu -incoming ...args...
Query chardevs via monitor
Query vCPU threads via monitor
Set disk / vnc passwords


Since RHEL carries Juan's patch, and Juan's patch doesn't handle disk passwords 
gracefully, how does libvirt cope with that?


Regards,

Anthony Liguori


Set netdev link states
Set balloon target

  2. Source
Set  migration speed
Set  migration max downtime
Run  migrate command (detached)
while 1
   Query migration status
   if status is failed or success
 break;

  3. Destination
   If final status was success
  Run  'cont' in monitor
   else
  kill QEMU process

  4. Source
   If final status was success and 'cont' on dest succeeded
  kill QEMU process
   else
  Run 'cont' in monitor


In older libvirt, the bits from step 4, would actually take place
at the end of step 2. This meant we could end up with no QEMU
on either the source or dest, if starting CPUs on the dest QEMU
failed for some reason.


We would still really like to have a 'query-migrate' command for
the destination, so that we can confirm that the destination has
consumed all incoming migrate data successfully, rather than just
blindly starting CPUs and hoping for the best.

Regards,
Daniel


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Daniel P. Berrange
On Thu, Nov 10, 2011 at 12:27:30PM -0600, Anthony Liguori wrote:
> What does libvirt actually do in the monitor prior to migration
> completing on the destination?  The least invasive way of doing
> delayed open of block devices is probably to make -incoming create a
> monitor and run a main loop before the block devices (and full
> device model) is initialized.  Since this isolates the changes
> strictly to migration, I'd feel okay doing this for 1.0 (although it
> might need to be in the stable branch).

The way migration works with libvirt wrt QEMU interactions is now
as follows

 1. Destination.
   Run   qemu -incoming ...args...
   Query chardevs via monitor
   Query vCPU threads via monitor
   Set disk / vnc passwords
   Set netdev link states
   Set balloon target

 2. Source
   Set  migration speed
   Set  migration max downtime
   Run  migrate command (detached)
   while 1
  Query migration status
  if status is failed or success
break;

 3. Destination
  If final status was success
 Run  'cont' in monitor
  else
 kill QEMU process

 4. Source
  If final status was success and 'cont' on dest succeeded
 kill QEMU process
  else
 Run 'cont' in monitor


In older libvirt, the bits from step 4, would actually take place
at the end of step 2. This meant we could end up with no QEMU
on either the source or dest, if starting CPUs on the dest QEMU
failed for some reason.


We would still really like to have a 'query-migrate' command for
the destination, so that we can confirm that the destination has
consumed all incoming migrate data successfully, rather than just
blindly starting CPUs and hoping for the best.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Anthony Liguori

On 11/10/2011 02:55 AM, Avi Kivity wrote:

On 11/09/2011 07:35 PM, Anthony Liguori wrote:

On 11/09/2011 11:02 AM, Avi Kivity wrote:

On 11/09/2011 06:39 PM, Anthony Liguori wrote:


Migration with qcow2 is not a supported feature for 1.0.  Migration is
only supported with raw images using coherent shared storage[1].

[1] NFS is only coherent with close-to-open which right now is not
good enough for migration.


Say what?


Due to block format probing, we read at least the first sector of the
disk during start up.

Strictly going by what NFS guarantees, since we don't open on the
destination *after* as close on the source, we aren't guaranteed to
see what's written by the source.

In practice, because of block format probing, unless we're using
cache=none, the first sector can be out of sync with the source on the
destination.  If you use cache=none on a Linux client with at least a
Linux NFS server, you should be relatively safe.



IMO, this should be a release blocker.  qemu 1.0 only supporting
migration on enterprise storage?

If we have to delay the release for a month to get it right, we should.
Not that I think we have to.



Adding libvirt to the discussion.

What does libvirt actually do in the monitor prior to migration completing on 
the destination?  The least invasive way of doing delayed open of block devices 
is probably to make -incoming create a monitor and run a main loop before the 
block devices (and full device model) is initialized.  Since this isolates the 
changes strictly to migration, I'd feel okay doing this for 1.0 (although it 
might need to be in the stable branch).


I know a monitor can run like this as I've done it before but some of the 
commands will not behave as expected so it's pretty important to be comfortable 
with what commands are actually being used in this mode.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Anthony Liguori

On 11/10/2011 04:41 AM, Kevin Wolf wrote:

Am 09.11.2011 22:01, schrieb Anthony Liguori:

On 11/09/2011 03:00 PM, Michael S. Tsirkin wrote:

On Wed, Nov 09, 2011 at 02:22:02PM -0600, Anthony Liguori wrote:

On 11/09/2011 02:18 PM, Michael S. Tsirkin wrote:

On Wed, Nov 09, 2011 at 11:35:54AM -0600, Anthony Liguori wrote:

On 11/09/2011 11:02 AM, Avi Kivity wrote:

On 11/09/2011 06:39 PM, Anthony Liguori wrote:


Migration with qcow2 is not a supported feature for 1.0.  Migration is
only supported with raw images using coherent shared storage[1].

[1] NFS is only coherent with close-to-open which right now is not
good enough for migration.


Say what?


Due to block format probing, we read at least the first sector of
the disk during start up.


A simple solution is not to do any probing before the VM is first
started on the incoming path.

Any issues with this?



http://mid.gmane.org/1284213896-12705-4-git-send-email-aligu...@us.ibm.com
I think Kevin wanted open to get delayed.

Regards,

Anthony Liguori


So, this patchset just needs to be revived and polished up?


What I took from the feedback was that Kevin wanted to defer open until the
device model started.  That eliminates the need to reopen or have a invalidation
callback.

I think it would be good for Kevin to comment here though because I might have
misunderstood his feedback.


Your approach was to delay reads, but still keep the image open. I think
I worried that we might have additional reads somewhere that we don't
know about, and this is why I proposed delaying the open as well, so
that any read would always fail.

I believe just reopening the image is (almost?) as good and it's way
easier to do, so I would be inclined to do that for 1.0.


I don't think reopen is good enough without delaying CHS probing too.  That 
information is still potentially out of date.  I don't think you can fix this 
problem without delaying CHS probing at least.


Regards,

Anthony Liguori



I'm not 100% sure about cases like iscsi, where reopening doesn't help.
I think delaying the open doesn't help there either if you migrate from
A to B and then back from B to A, you could still get old data. So for
iscsi probably cache=none remains the only safe choice, whatever we do.

Kevin



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Anthony Liguori

On 11/10/2011 02:55 AM, Avi Kivity wrote:

On 11/09/2011 07:35 PM, Anthony Liguori wrote:

On 11/09/2011 11:02 AM, Avi Kivity wrote:

On 11/09/2011 06:39 PM, Anthony Liguori wrote:


Migration with qcow2 is not a supported feature for 1.0.  Migration is
only supported with raw images using coherent shared storage[1].

[1] NFS is only coherent with close-to-open which right now is not
good enough for migration.


Say what?


Due to block format probing, we read at least the first sector of the
disk during start up.

Strictly going by what NFS guarantees, since we don't open on the
destination *after* as close on the source, we aren't guaranteed to
see what's written by the source.

In practice, because of block format probing, unless we're using
cache=none, the first sector can be out of sync with the source on the
destination.  If you use cache=none on a Linux client with at least a
Linux NFS server, you should be relatively safe.



IMO, this should be a release blocker.  qemu 1.0 only supporting
migration on enterprise storage?


No, this is not going to block the release.

You can't dump patches on the ML during -rc for an issue that has been 
understood for well over a year simply because it's release time.


If this was so important, it should have been fixed a year ago in the proper 
way.

Regards,

Anthony Liguori



If we have to delay the release for a month to get it right, we should.
Not that I think we have to.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Anthony Liguori

On 11/10/2011 10:50 AM, Juan Quintela wrote:

Kevin Wolf  wrote:


What I took from the feedback was that Kevin wanted to defer open until the
device model started.  That eliminates the need to reopen or have a invalidation
callback.

I think it would be good for Kevin to comment here though because I might have
misunderstood his feedback.


Your approach was to delay reads, but still keep the image open. I think
I worried that we might have additional reads somewhere that we don't
know about, and this is why I proposed delaying the open as well, so
that any read would always fail.

I believe just reopening the image is (almost?) as good and it's way
easier to do, so I would be inclined to do that for 1.0.

I'm not 100% sure about cases like iscsi, where reopening doesn't help.
I think delaying the open doesn't help there either if you migrate from
A to B and then back from B to A, you could still get old data. So for
iscsi probably cache=none remains the only safe choice, whatever we do.


iSCSI and NFS only works with cache=none.  Even on NFS with close+open,
we have troubles if anything else has the file opened (think libvirt,
guestfs, whatever).


Reopening with iSCSI is strictly an issue with the in-kernel initiator, right? 
libiscsi should be safe with a delayed open I would imagine.


Regards,

Anthony Liguori

  I really think that anynthing different of

cache=none from iSCSI or NFS is just betting (and yes, it took a while
for Christoph to convince me, I was trying to a "poor man" distributed
lock manager, and as everybody knows, it is a _difficult_ problem to
solve.).

Later, Juan.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Juan Quintela
Avi Kivity  wrote:
> On 11/09/2011 07:35 PM, Anthony Liguori wrote:
>> On 11/09/2011 11:02 AM, Avi Kivity wrote:
>>> On 11/09/2011 06:39 PM, Anthony Liguori wrote:

 Migration with qcow2 is not a supported feature for 1.0.  Migration is
 only supported with raw images using coherent shared storage[1].

 [1] NFS is only coherent with close-to-open which right now is not
 good enough for migration.
>>>
>>> Say what?
>>
>> Due to block format probing, we read at least the first sector of the
>> disk during start up.
>>
>> Strictly going by what NFS guarantees, since we don't open on the
>> destination *after* as close on the source, we aren't guaranteed to
>> see what's written by the source.
>>
>> In practice, because of block format probing, unless we're using
>> cache=none, the first sector can be out of sync with the source on the
>> destination.  If you use cache=none on a Linux client with at least a
>> Linux NFS server, you should be relatively safe.
>>
>
> IMO, this should be a release blocker.  qemu 1.0 only supporting
> migration on enterprise storage?
>
> If we have to delay the release for a month to get it right, we should. 
> Not that I think we have to.

I kind of agree here, but it is not my call.  Patch 1/2 have been used
on RHEL for almost 3 years, so it should be safe (TM).

Later, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread Joerg Roedel
On Thu, Nov 10, 2011 at 03:28:50PM +, David Woodhouse wrote:

> Which brings me to another question I have been pondering... do we even
> have a consensus on exactly *when* the IOTLB should be flushed?

Well, sort of, there is still the outstanding idea of the
iommu_commit() interface for the IOMMU-API.

> Even just for the Intel IOMMU, we have three different behaviours:
> 
>   - For DMA API users by default, we do 'batched unmap', so a mapping
> may be active for a period of time after the driver has requested
> that it be unmapped.

The requirement for the DMA-API is, that the IOTLB must be consistent
with existing mappings, and only with the parts that are really mapped.
The unmapped parts are not important.

This allows nice optimizations like your 'batched unmap' on the Intel
IOMMU driver side. The AMD IOMMU driver uses a round-robin bitmap
allocator for the IO addresses which makes it very easy to flush certain
IOTLB ranges only before they are reused.

>   - ... unless booted with 'intel_iommu=strict', in which case we do the
> unmap and IOTLB flush immediately before returning to the driver.

There is something similar on the AMD IOMMU side. There it is called
unmap_flush.

>   - But the IOMMU API for virtualisation is different. In fact that
> doesn't seem to flush the IOTLB at all. Which is probably a bug.

Well, *current* requirement is, that the IOTLB is in sync with the
page-table at every time. This is true for the iommu_map and especially
for the iommu_unmap function. It means basically that the unmapped area
needs to be flushed out of the IOTLBs before iommu_unmap returns.

Some time ago I proposed the iommu_commit() interface which changes
these requirements. With this interface the requirement is that after a
couple of map/unmap operations the IOMMU-API user has to call
iommu_commit() to make these changes visible to the hardware (so mostly
sync the IOTLBs). As discussed at that time this would make sense for
the Intel and AMD IOMMU drivers.

> What is acceptable, though? That batched unmap is quite important for
> performance, because it means that we don't have to bash on the hardware
> and wait for a flush to complete in the fast path of network driver RX,
> for example.

Have you considered a round-robin bitmap-allocator? It allows quite nice
flushing behavior.

> If we move to a model where we have a separate ->flush_iotlb() call, we
> need to be careful that we still allow necessary optimisations to
> happen.

With iommu_commit() this should be possible, still.

> I'm looking at fixing performance issues in the Intel IOMMU code, with
> its virtual address space allocation (the rbtree-based one in iova.c
> that nobody else uses, which has a single spinlock that *all* CPUs bash
> on when they need to allocate).
> 
> The plan is, vaguely, to allocate large chunks of space to each CPU, and
> then for each CPU to allocate from its own region first, thus ensuring
> that the common case doesn't bounce locks between CPUs. It'll be rare
> for one CPU to have to touch a subregion 'belonging' to another CPU, so
> lock contention should be drastically reduced.

Thats an interesting issue. It exists on the AMD IOMMU side too, the
bitmap-allocator runs in a per-domain spinlock which can get high
contention. I am not sure how per-cpu chunks of the address space scale
to large numbers of cpus, though, given that some devices only have a
small address range that they can address.

I have been thinking about some lockless algorithms for the
bitmap-allocator. But the ideas are not finalized yet, so I still don't
know if they will work out at all :)
The basic idea builds around the fact, that most allocations using the
DMA-API fit into one page. So probably we can split the address-space
into a region for one-page allocations which can be accessed without
locks and another region for larger allocations which still need locks.

> Should I be planning to drop the DMA API support from intel-iommu.c
> completely, and have the new allocator just call into the IOMMU API
> functions instead? Other people have been looking at that, haven't they?

Yes, Marek Szyprowski from the ARM side is looking into this already,
but his patches are very ARM specific and not suitable for x86 yet.

> Is there any code? Or special platform-specific requirements for such a
> generic wrapper that I might not have thought of? Details about when to
> flush the IOTLB are one such thing which might need special handling for
> certain hardware...

The plan is to have a single DMA-API implementation for all IOMMU
drivers (X86 and ARM) which just uses the IOMMU-API. But to make this
performing reasonalbly well a few changes to the IOMMU-API are required.
I already have some ideas which we can discuss if you want.


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Re

Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls

2011-11-10 Thread Marcelo Tosatti
On Thu, Nov 10, 2011 at 05:49:42PM +0100, Alexander Graf wrote:
> >>  Documentation/virtual/kvm/api.txt |   47 
> >> ++
> >>  arch/powerpc/kvm/powerpc.c|   51 
> >> +
> >>  include/linux/kvm.h   |   32 +++
> >>  3 files changed, 130 insertions(+), 0 deletions(-)
> >I don't see the benefit of this generalization, the current structure where
> >context information is hardcoded in the data transmitted works well.
> 
> Well, unfortunately it doesn't work quite as well for us because we
> are a much more evolving platform. Also, there are a lot of edges
> and corners of the architecture that simply aren't implemented in
> KVM as of now. I want to have something extensible enough so we
> don't break the ABI along the way.

You still have to agree on format between userspace and kernel, right?
If either party fails to conform to that, you're doomed.

The problem with two interfaces is potential ambiguity: is
register X implemented through KVM_GET_ONE_REG and also through
KVM_GET_XYZ_REGISTER_SET ? If its accessible by two interfaces, what is
the register writeback order? Is there a plan to convert, etc.

If you agree these concerns are valid, perhaps this interface can be PPC
specific.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Juan Quintela
Kevin Wolf  wrote:

>> What I took from the feedback was that Kevin wanted to defer open until the 
>> device model started.  That eliminates the need to reopen or have a 
>> invalidation 
>> callback.
>> 
>> I think it would be good for Kevin to comment here though because I might 
>> have 
>> misunderstood his feedback.
>
> Your approach was to delay reads, but still keep the image open. I think
> I worried that we might have additional reads somewhere that we don't
> know about, and this is why I proposed delaying the open as well, so
> that any read would always fail.
>
> I believe just reopening the image is (almost?) as good and it's way
> easier to do, so I would be inclined to do that for 1.0.
>
> I'm not 100% sure about cases like iscsi, where reopening doesn't help.
> I think delaying the open doesn't help there either if you migrate from
> A to B and then back from B to A, you could still get old data. So for
> iscsi probably cache=none remains the only safe choice, whatever we do.

iSCSI and NFS only works with cache=none.  Even on NFS with close+open,
we have troubles if anything else has the file opened (think libvirt,
guestfs, whatever).  I really think that anynthing different of
cache=none from iSCSI or NFS is just betting (and yes, it took a while
for Christoph to convince me, I was trying to a "poor man" distributed
lock manager, and as everybody knows, it is a _difficult_ problem to
solve.).

Later, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls

2011-11-10 Thread Alexander Graf

On 11/10/2011 05:05 PM, Marcelo Tosatti wrote:

On Mon, Oct 31, 2011 at 08:53:11AM +0100, Alexander Graf wrote:

Right now we transfer a static struct every time we want to get or set
registers. Unfortunately, over time we realize that there are more of
these than we thought of before and the extensibility and flexibility of
transferring a full struct every time is limited.

So this is a new approach to the problem. With these new ioctls, we can
get and set a single register that is identified by an ID. This allows for
very precise and limited transmittal of data. When we later realize that
it's a better idea to shove over multiple registers at once, we can reuse
most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS
interface.

The only downpoint I see to this one is that it needs to pad to 1024 bits
(hardware is already on 512 bit registers, so I wanted to leave some room)
which is slightly too much for transmitting only 64 bits. But if that's all
the tradeoff we have to do for getting an extensible interface, I'd say go
for it nevertheless.

Signed-off-by: Alexander Graf
---
  Documentation/virtual/kvm/api.txt |   47 ++
  arch/powerpc/kvm/powerpc.c|   51 +
  include/linux/kvm.h   |   32 +++
  3 files changed, 130 insertions(+), 0 deletions(-)

I don't see the benefit of this generalization, the current structure where
context information is hardcoded in the data transmitted works well.


Well, unfortunately it doesn't work quite as well for us because we are 
a much more evolving platform. Also, there are a lot of edges and 
corners of the architecture that simply aren't implemented in KVM as of 
now. I want to have something extensible enough so we don't break the 
ABI along the way.



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Memory sync algorithm during migration

2011-11-10 Thread Oliver Hookins
Hi,

I am performing some benchmarks on KVM migration on two different types of VM.
One has 4GB RAM and the other 32GB. More or less idle, the 4GB VM takes about 20
seconds to migrate on our hardware while the 32GB VM takes about a minute.

With a reasonable amount of memory activity going on (in the hundreds of MB per
second) the 32GB VM takes 3.5 minutes to migrate, but the 4GB VM never
completes. Intuitively this tells me there is some watermarking of dirty pages
going on that is not particularly efficient when the dirty pages ratio is high
compared to total memory, but I may be completely incorrect.

Could anybody fill me in on what might be going on here? We're using libvirt
0.8.2 and kvm-83-224.el5.centos.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Set numa topology for max_cpus

2011-11-10 Thread Marcelo Tosatti
On Wed, Oct 26, 2011 at 02:19:00PM +0200, Vasilis Liaskovitis wrote:
> qemu-kvm passes numa/SRAT topology information for smp_cpus to SeaBIOS. 
> However
> SeaBIOS always expects to setup max_cpus number of SRAT cpu entries
> (MaxCountCPUs variable in build_srat function of Seabios). When qemu-kvm runs
> with smp_cpus != max_cpus (e.g. -smp 2,maxcpus=4), Seabios will mistakenly use
> memory SRAT info for setting up CPU SRAT entries for the offline CPUs. Wrong
> SRAT memory entries are also created. This breaks NUMA in a guest.
> Fix by setting up SRAT info for max_cpus in qemu-kvm.
> 
> Signed-off-by: Vasilis Liaskovitis 

Applied to uq/master, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 0/2] Initial support for Microsoft Hyper-V.

2011-11-10 Thread Marcelo Tosatti
On Sun, Oct 23, 2011 at 05:39:47PM +0200, Vadim Rozenfeld wrote:
> With the following series of patches we are starting to implement
> some basic Microsoft Hyper-V Enlightenment functionality. This series
> is mostly about adding support for relaxed timing, spinlock,
> and virtual apic.
> 
> For more Hyper-V related information please see:
> "Hypervisor Functional Specification v2.0: For Windows Server 2008 R2" at
> http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=18673
> 
> Changelog:
>  v2->v1
>   - remove KVM_CAP_IRQCHIP ifdef,
>   - remove CONFIG_HYPERV config option,
>   - move KVM leaves to new location (0x4100),
>   - cosmetic changes.
>  v0->v1
>   - move hyper-v parameters under cpu category,
>   - move hyper-v stuff to target-i386 directory,
>   - make CONFIG_HYPERV enabled by default for
> i386-softmmu and x86_64-softmmu configurations,
>   - rearrange the patches from v0,
>   - set HV_X64_MSR_HYPERCALL, HV_X64_MSR_GUEST_OS_ID,
> and HV_X64_MSR_APIC_ASSIST_PAGE to 0 on system reset.

Paolo, Jan, can you ack please? 

IMO making it TCG-friendly is nice, but, can be done later.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: x86: Drop redundant apic base and tpr update from kvm_get_sregs

2011-11-10 Thread Marcelo Tosatti
On Wed, Oct 26, 2011 at 01:09:45PM +0200, Jan Kiszka wrote:
> The latter was already commented out, the former is redundant as well.
> We always get the latest changes after return from the guest via
> kvm_arch_post_run.
> 
> Signed-off-by: Jan Kiszka 

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls

2011-11-10 Thread Marcelo Tosatti
On Mon, Oct 31, 2011 at 08:53:11AM +0100, Alexander Graf wrote:
> Right now we transfer a static struct every time we want to get or set
> registers. Unfortunately, over time we realize that there are more of
> these than we thought of before and the extensibility and flexibility of
> transferring a full struct every time is limited.
> 
> So this is a new approach to the problem. With these new ioctls, we can
> get and set a single register that is identified by an ID. This allows for
> very precise and limited transmittal of data. When we later realize that
> it's a better idea to shove over multiple registers at once, we can reuse
> most of the infrastructure and simply implement a GET_MANY_REGS / 
> SET_MANY_REGS
> interface.
> 
> The only downpoint I see to this one is that it needs to pad to 1024 bits
> (hardware is already on 512 bit registers, so I wanted to leave some room)
> which is slightly too much for transmitting only 64 bits. But if that's all
> the tradeoff we have to do for getting an extensible interface, I'd say go
> for it nevertheless.
> 
> Signed-off-by: Alexander Graf 
> ---
>  Documentation/virtual/kvm/api.txt |   47 ++
>  arch/powerpc/kvm/powerpc.c|   51 
> +
>  include/linux/kvm.h   |   32 +++
>  3 files changed, 130 insertions(+), 0 deletions(-)

I don't see the benefit of this generalization, the current structure where 
context information is hardcoded in the data transmitted works well.

Avi?

> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index ab1136f..a23fe62 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1482,6 +1482,53 @@ is supported; 2 if the processor requires all virtual 
> machines to have
>  an RMA, or 1 if the processor can use an RMA but doesn't require it,
>  because it supports the Virtual RMA (VRMA) facility.
>  
> +4.64 KVM_SET_ONE_REG
> +
> +Capability: KVM_CAP_ONE_REG
> +Architectures: all
> +Type: vcpu ioctl
> +Parameters: struct kvm_one_reg (in)
> +Returns: 0 on success, negative value on failure
> +
> +struct kvm_one_reg {
> +   __u64 id;
> +   union {
> +   __u8 reg8;
> +   __u16 reg16;
> +   __u32 reg32;
> +   __u64 reg64;
> +   __u8 reg128[16];
> +   __u8 reg256[32];
> +   __u8 reg512[64];
> +   __u8 reg1024[128];
> +   } u;
> +};
> +
> +Using this ioctl, a single vcpu register can be set to a specific value
> +defined by user space with the passed in struct kvm_one_reg. There can
> +be architecture agnostic and architecture specific registers. Each have
> +their own range of operation and their own constants and width. To keep
> +track of the implemented registers, find a list below:
> +
> +  Arch  |   Register| Width (bits)
> +|   |
> +
> +4.65 KVM_GET_ONE_REG
> +
> +Capability: KVM_CAP_ONE_REG
> +Architectures: all
> +Type: vcpu ioctl
> +Parameters: struct kvm_one_reg (in and out)
> +Returns: 0 on success, negative value on failure
> +
> +This ioctl allows to receive the value of a single register implemented
> +in a vcpu. The register to read is indicated by the "id" field of the
> +kvm_one_reg struct passed in. On success, the register value can be found
> +in the respective width field of the struct after this call.
> +
> +The list of registers accessible using this interface is identical to the
> +list in 4.64.
> +
>  5. The kvm_run structure
>  
>  Application code obtains a pointer to the kvm_run structure by
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index e75c5ac..39cdb3f 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -214,6 +214,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>   case KVM_CAP_PPC_UNSET_IRQ:
>   case KVM_CAP_PPC_IRQ_LEVEL:
>   case KVM_CAP_ENABLE_CAP:
> + case KVM_CAP_ONE_REG:
>   r = 1;
>   break;
>  #ifndef CONFIG_KVM_BOOK3S_64_HV
> @@ -627,6 +628,32 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu 
> *vcpu,
>   return r;
>  }
>  
> +static int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu,
> +   struct kvm_one_reg *reg)
> +{
> + int r = -EINVAL;
> +
> + switch (reg->id) {
> + default:
> + break;
> + }
> +
> + return r;
> +}
> +
> +static int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu,
> +   struct kvm_one_reg *reg)
> +{
> + int r = -EINVAL;
> +
> + switch (reg->id) {
> + default:
> + break;
> + }
> +
> + return r;
> +}
> +
>  int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
>  struct kvm_mp_state *mp_state)
>  {
> @@ -666,6 +693,30 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>  

Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-10 Thread Stefan Hajnoczi
On Thu, Nov 10, 2011 at 2:47 PM, Markus Armbruster  wrote:
> Pekka Enberg  writes:
>
>> Hi Anthony,
>>
>> On Thu, Nov 10, 2011 at 3:43 PM, Anthony Liguori  
>> wrote:
>>> It's not just the qcow2 implementation or even the block layer.  This pull
>>> requests adds a userspace TCP/IP stack to the kernel and yet netdev isn't on
>>> the CC and there are no Ack's from anyone from the networking stack.  I'm
>>> fairly sure if they knew what was happening here they would object.
>>
>> It's something we consider extremely important because it allows easy
>> non-root networking. But you're right, we definitely ought to ping the
>> networking folks before the next merge window.
>
> The problem is real.  The solution "duplicate in user space" sucks.  If
> you engaging with the kernel networking folks leads to one that doesn't
> suck, we should bathe you in free beer.

Look at disks, the problem is addressed by the udisks daemon on dbus.
Anything can try talking to it.  If you have permissions or can get
the user to authenticate then you can manipulate LVM volumes, mount
file systems, etc.  We could do something similar for tap networking.

The Ubuntu folks seem to want VDE instead to solve the same problem.
Create a VDE switch with a single tap device on startup.  Then let all
VMs talk to the VDE without privileges.

I don't think going through VDE is nice or performant, would be better
to have add virtual network functionality over dbus just like udisks
did for disks.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread David Woodhouse
On Thu, 2011-11-10 at 14:17 +0800, Kai Huang wrote:
> And another question: have we considered the IOTLB flush operation? I
> think we need to implement similar logic when flush the DVMA range.
> Intel VT-d's manual says software needs to specify the appropriate
> mask value to flush large pages, but it does not say we need to
> exactly match the page size as it is mapped. I guess it's not
> necessary for Intel IOMMU, but other vendor's IOMMU may have such
> limitation (or some other limitations). In my understanding current
> implementation does not provide page size information for particular
> DVMA ranges that has been mapped, and it's not flexible to implement
> IOTLB flush code (ex, we may need to walk through page table to find
> out actual page size). Maybe we can also add iommu_ops->flush_iotlb ? 

Which brings me to another question I have been pondering... do we even
have a consensus on exactly *when* the IOTLB should be flushed?

Even just for the Intel IOMMU, we have three different behaviours:

  - For DMA API users by default, we do 'batched unmap', so a mapping
may be active for a period of time after the driver has requested
that it be unmapped.

  - ... unless booted with 'intel_iommu=strict', in which case we do the
unmap and IOTLB flush immediately before returning to the driver.

  - But the IOMMU API for virtualisation is different. In fact that
doesn't seem to flush the IOTLB at all. Which is probably a bug.

What is acceptable, though? That batched unmap is quite important for
performance, because it means that we don't have to bash on the hardware
and wait for a flush to complete in the fast path of network driver RX,
for example.

If we move to a model where we have a separate ->flush_iotlb() call, we
need to be careful that we still allow necessary optimisations to
happen.

Since I have the right people on Cc and the iommu list is still down,
and it's vaguely tangentially related...

I'm looking at fixing performance issues in the Intel IOMMU code, with
its virtual address space allocation (the rbtree-based one in iova.c
that nobody else uses, which has a single spinlock that *all* CPUs bash
on when they need to allocate).

The plan is, vaguely, to allocate large chunks of space to each CPU, and
then for each CPU to allocate from its own region first, thus ensuring
that the common case doesn't bounce locks between CPUs. It'll be rare
for one CPU to have to touch a subregion 'belonging' to another CPU, so
lock contention should be drastically reduced.

Should I be planning to drop the DMA API support from intel-iommu.c
completely, and have the new allocator just call into the IOMMU API
functions instead? Other people have been looking at that, haven't they?
Is there any code? Or special platform-specific requirements for such a
generic wrapper that I might not have thought of? Details about when to
flush the IOTLB are one such thing which might need special handling for
certain hardware...

-- 
dwmw2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] nEPT: Module option

2011-11-10 Thread Avi Kivity
On 11/10/2011 05:14 PM, Nadav Har'El wrote:
> On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 01/10] nEPT: Module 
> option":
> > > By "this", do you mean without the "nested_ept" option, or without the
> > > hypothetical "EPT on shadow page tables" feature?
> > 
> > Er, both.  The feature should be controlled on a per-guest basis, not
> > per host.
> >..
> > It's just redundant, since we do need a per-guest control.
>
> I agreed that per-guest control would have been nicer, but since we
> don't have an API for specifying that per guest since EPT is not,
> unfortunately, a CPUID feature, I thought that at least a host-level
> flag would be useful.
>
> Why would it be useful? I agree it isn't the most important option since
> sliced bread, but if, for example, one day we discover a bug with nested
> EPT, L0 can disable it for all L1 guests and basically force them to use
> shadow page tables on EPT.

Or we just fix the bug.

> It was also useful for me to have this option for benchmarking, because
> I can force back the old shadow-on-EPT method with just a single option
> in L0 (instead of needing to give "ept=0" option in L1s).

When we have the per-guest controls, we can tell userspace to tell the
kernel disable guest EPT.

> If you really don't like the existance of this option, I can easily
> remove it of course.

Yes please.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-10 Thread Avi Kivity
On 11/10/2011 04:40 PM, Nadav Har'El wrote:
> On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU 
> context for nested EPT":
> > > +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
> > > +{
> > > + int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
> >...
> > > + vcpu->arch.walk_mmu  = &vcpu->arch.nested_mmu;
> >...
> > 
> > kvm_init_shadow_mmu() will cause ->page_fault to be set to something
> > like paging64_page_fault(), which is geared to reading EPT ptes.  How
> > does this work?

s/EPT/ia32/

>
> Hi,
>
> I'm afraid I didn't understand the problem.
>
> Nested EPT's merging of two EPT tables (EPT01 and EPT12) works just like
> normal shadow page tables' merging of two CR3s (host cr3 and guest cr3):
>
> When L0 receives a "page fault" from L2 (actually an EPT violation - real
> guest #PF don't cause exits), L0 first looks it up in the shadowed table,
> which is basically EPT12. If the address is there, L0 handles the fault itself
> (updating the shadow EPT table, EPT02 using the normal shadow pte building
> code). But if the address wasn't in the shadowed page table (EPT12),
> mmu->inject_page_fault() is called, which in our case actually causes L1 to
> get an EPT-violation (not #PF - see kvm_propagate_fault()).
>
> Please note that all this logic is shared with the existing nested NPT
> code (which itself shared most of the code with the preexisting shadow
> page tables code). All this code sharing makes it really difficult to
> understand at first glance why the code is really working, but once you
> understood why one of these cases works, the others work similarly.
> And it does in fact work - in typical cases which I tried, at least.
>
> If you still think I'm missing something, I won't be entirely surprised
> ( :-) ), so let me know.

This is all correct, but the code in question parses the EPT12 table
using the ia32 page table format.  They're sufficiently similar so that
it works, but it isn't correct.

Bit 0: EPT readable, ia32 present
Bit 1: Writable; ia32 meaning dependent on cr0.wp
Bit 2: EPT executable, ia32 user (so, this implementation will interpret
a non-executable EPT mapping, if someone could find a use for it, as a
L2 kernel only mapping)
Bits 3-5: EPT memory type, ia32 PWT/PCD (similar but different),
Accessed bit
Bit 6: EPT Ignore PAT, ia32 dirty
Bit 7: EPT ignored, ia32 PAT
Bit 8: EPT ignored, ia32 global
Bit 63: EPT ignored, ia32 NX

walk_addr() will also write to bits 6/7, which the L1 won't expect.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] nEPT: Module option

2011-11-10 Thread Nadav Har'El
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 01/10] nEPT: Module 
option":
> > By "this", do you mean without the "nested_ept" option, or without the
> > hypothetical "EPT on shadow page tables" feature?
> 
> Er, both.  The feature should be controlled on a per-guest basis, not
> per host.
>..
> It's just redundant, since we do need a per-guest control.

I agreed that per-guest control would have been nicer, but since we
don't have an API for specifying that per guest since EPT is not,
unfortunately, a CPUID feature, I thought that at least a host-level
flag would be useful.

Why would it be useful? I agree it isn't the most important option since
sliced bread, but if, for example, one day we discover a bug with nested
EPT, L0 can disable it for all L1 guests and basically force them to use
shadow page tables on EPT.
It was also useful for me to have this option for benchmarking, because
I can force back the old shadow-on-EPT method with just a single option
in L0 (instead of needing to give "ept=0" option in L1s).

If you really don't like the existance of this option, I can easily
remove it of course.


-- 
Nadav Har'El|  Thursday, Nov 10 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Guarantee: this email is 100% free of
http://nadav.harel.org.il   |magnetic monopoles, or your money back!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread Joerg Roedel
On Thu, Nov 10, 2011 at 10:35:34PM +0800, cody wrote:
> Yes I totally agree page-size is not required for unmap operations
> and should not be added as parameter to map/unmap operations. I am
> not saying the unmap operation, but the IOTLB flush operation. My
> point is we also may also need to add similar logic in IOTLB flush
> code (such as in Intel IOMMU dirver) to grantee that when issuing
> IOTLB flush command for large page, we will still meet the hardware
> limitation of flushing large page. Seems for Intel IOMMU the only
> limitation is the mask value (indicating number of 4k-pages) cannot
> be smaller than the value to cover large page, and seems current
> Intel IOMMU driver code has covered this case well. I am not
> familiar with how AMD IOMMU issues IOTLB flush command, it should be
> able to handle this large page case too. So at this moment, this
> patch should not have any issues :)

The map-operation actually takes a size, as it should. The idea is to
change this function to a map_page interface which takes a page-size
parameter, but thats another story.
The IOTLB flushing is not exposed by the IOMMU-API anyway. To whatever
is necessary to do that, it is the business of the IOMMU driver. So in
the unmap-path the driver finds out the page-size to unmap and can
immediatly flush the IOTLB for that page.


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-10 Thread Markus Armbruster
Pekka Enberg  writes:

> Hi Anthony,
>
> On Thu, Nov 10, 2011 at 3:43 PM, Anthony Liguori  
> wrote:
>> It's not just the qcow2 implementation or even the block layer.  This pull
>> requests adds a userspace TCP/IP stack to the kernel and yet netdev isn't on
>> the CC and there are no Ack's from anyone from the networking stack.  I'm
>> fairly sure if they knew what was happening here they would object.
>
> It's something we consider extremely important because it allows easy
> non-root networking. But you're right, we definitely ought to ping the
> networking folks before the next merge window.

The problem is real.  The solution "duplicate in user space" sucks.  If
you engaging with the kernel networking folks leads to one that doesn't
suck, we should bathe you in free beer.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-10 Thread Nadav Har'El
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU 
context for nested EPT":
> > +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
> > +{
> > +   int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
>...
> > +   vcpu->arch.walk_mmu  = &vcpu->arch.nested_mmu;
>...
> 
> kvm_init_shadow_mmu() will cause ->page_fault to be set to something
> like paging64_page_fault(), which is geared to reading EPT ptes.  How
> does this work?

Hi,

I'm afraid I didn't understand the problem.

Nested EPT's merging of two EPT tables (EPT01 and EPT12) works just like
normal shadow page tables' merging of two CR3s (host cr3 and guest cr3):

When L0 receives a "page fault" from L2 (actually an EPT violation - real
guest #PF don't cause exits), L0 first looks it up in the shadowed table,
which is basically EPT12. If the address is there, L0 handles the fault itself
(updating the shadow EPT table, EPT02 using the normal shadow pte building
code). But if the address wasn't in the shadowed page table (EPT12),
mmu->inject_page_fault() is called, which in our case actually causes L1 to
get an EPT-violation (not #PF - see kvm_propagate_fault()).

Please note that all this logic is shared with the existing nested NPT
code (which itself shared most of the code with the preexisting shadow
page tables code). All this code sharing makes it really difficult to
understand at first glance why the code is really working, but once you
understood why one of these cases works, the others work similarly.
And it does in fact work - in typical cases which I tried, at least.

If you still think I'm missing something, I won't be entirely surprised
( :-) ), so let me know.

Nadav.


-- 
Nadav Har'El|  Thursday, Nov 10 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I put a dollar in one of those change
http://nadav.harel.org.il   |machines. Nothing changed.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] nEPT: Module option

2011-11-10 Thread Avi Kivity
On 11/10/2011 04:21 PM, Nadav Har'El wrote:
> On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 01/10] nEPT: Module 
> option":
> > On 11/10/2011 11:58 AM, Nadav Har'El wrote:
> > > Add a module option "nested_ept" determining whether to enable Nested EPT.
> >...
> > > In the future, we can support emulation of EPT for L1 *always*, even when 
> > > L0
> > > itself doesn't have EPT. This so-called "EPT on shadow page tables" mode
> > > has some theoretical advantages over the baseline "shadow page tables on
> > > shadow page tables" mode typically used when EPT is not available to L0 -
> > > namely that L2's cr3 changes and page faults can be handled in L0 and do 
> > > not
> > > need to be propagated to L1. However, currently we do not support this 
> > > mode,
> > > and it is becoming less interesting as newer processors all support EPT.
> > >
> > >
> > 
> > I think we can live without this.
>
> By "this", do you mean without the "nested_ept" option, or without the
> hypothetical "EPT on shadow page tables" feature?

Er, both.  The feature should be controlled on a per-guest basis, not
per host.  And while emulating EPT on shadow is possible, we have enough
complexity already, I think, and non-EPT hosts are getting rarer.

> If the former, then I agree we can "live" without it, but since it was
> trivial to add, I don't see what harm it can do, and its nice that we
> can return with a single L0 option to the old shadow-on-ept paging.
> Is there anything specific you don't like about having this option?

It's just redundant, since we do need a per-guest control.

> About the latter, I agree - as I said, there isn't much point to go and
> write this (quite complicated) 3-level shadowing when all new processors
> have EPT anyway. So I didn't.
>
> > But we do need a way to control what
> > features are exposed to the guest, for compatibility and live migration
> > purposes, as we do with cpuid.  So we need some way for host userspace
> > to write to the vmx read-only feature reporting MSRs.
>
> I think this is a general issue (which we already discussed earlier),
> of nested VMX and not specific to nested EPT. I already put all the
> capabilities which the MSR report in variables initialized in a single
> function, nested_vmx_setup_ctls_msrs(), so once we devise an appropriate
> userspace interface to set these, we can do so easily.

Yes.

> Does nested SVM also have a similar problem, of whether or not it
> advertises new or optional SVM features to L1? If it does have this
> problem, how was it solved there?

svm cpu features are, funnily enough, reported by cpuid, so the existing
KVM_GET_SUPPORTED_CPUID/KVM_SET_CPUID2 method works.  We need a similar
KVM_SET_READONLY_MSRS or something.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread cody

On 11/10/2011 09:08 PM, Joerg Roedel wrote:

On Thu, Nov 10, 2011 at 08:16:16PM +0800, cody wrote:
   

On 11/10/2011 03:31 PM, Ohad Ben-Cohen wrote:
 

On Thu, Nov 10, 2011 at 8:17 AM, Kai Huang   wrote:
   

Seems the unmap function don't take phys as parameter, does this mean
domain->ops->unmap will walk through the page table to find out the
actual page size?
 

The short answer is yes, and furthermore, we also consider to remove
the size param from domain->ops->unmap entirely at some point.

We had a long discussion about it, please see:

https://lkml.org/lkml/2011/10/10/234
   

Yes I've seen your discussion, I followed this thread from beginning:)

How about the IOTLB flush? As I said I think we need to consider
that IOMMU (even does not exist now) may have some limitation on
IOTLB flush, and hiding page size from IOTLB flush code may hurt
performance, or even worse, trigger undefined behaviors.
 

We can only care about IOMMUs that exist today or ones that will exist
and we already know of.
In general for the hardware I know of a page-size is not required for
implementing unmap operations. Requiring this would imply that any user
of the IOMMU-API needs to keeps track of the page-sizes used to map a
given area.
This would be a huge burden which is not really necessary because the
IOMMU driver already has this information and can return it to the user.
So if you want to change that you need a very good reason for it.

   
Yes I totally agree page-size is not required for unmap operations and 
should not be added as parameter to map/unmap operations. I am not 
saying the unmap operation, but the IOTLB flush operation. My point is 
we also may also need to add similar logic in IOTLB flush code (such as 
in Intel IOMMU dirver) to grantee that when issuing IOTLB flush command 
for large page, we will still meet the hardware limitation of flushing 
large page. Seems for Intel IOMMU the only limitation is the mask value 
(indicating number of 4k-pages) cannot be smaller than the value to 
cover large page, and seems current Intel IOMMU driver code has covered 
this case well. I am not familiar with how AMD IOMMU issues IOTLB flush 
command, it should be able to handle this large page case too. So at 
this moment, this patch should not have any issues :)


-cody

Joerg

   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] nEPT: Module option

2011-11-10 Thread Nadav Har'El
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 01/10] nEPT: Module 
option":
> On 11/10/2011 11:58 AM, Nadav Har'El wrote:
> > Add a module option "nested_ept" determining whether to enable Nested EPT.
>...
> > In the future, we can support emulation of EPT for L1 *always*, even when L0
> > itself doesn't have EPT. This so-called "EPT on shadow page tables" mode
> > has some theoretical advantages over the baseline "shadow page tables on
> > shadow page tables" mode typically used when EPT is not available to L0 -
> > namely that L2's cr3 changes and page faults can be handled in L0 and do not
> > need to be propagated to L1. However, currently we do not support this mode,
> > and it is becoming less interesting as newer processors all support EPT.
> >
> >
> 
> I think we can live without this.

By "this", do you mean without the "nested_ept" option, or without the
hypothetical "EPT on shadow page tables" feature?

If the former, then I agree we can "live" without it, but since it was
trivial to add, I don't see what harm it can do, and its nice that we
can return with a single L0 option to the old shadow-on-ept paging.
Is there anything specific you don't like about having this option?

About the latter, I agree - as I said, there isn't much point to go and
write this (quite complicated) 3-level shadowing when all new processors
have EPT anyway. So I didn't.

> But we do need a way to control what
> features are exposed to the guest, for compatibility and live migration
> purposes, as we do with cpuid.  So we need some way for host userspace
> to write to the vmx read-only feature reporting MSRs.

I think this is a general issue (which we already discussed earlier),
of nested VMX and not specific to nested EPT. I already put all the
capabilities which the MSR report in variables initialized in a single
function, nested_vmx_setup_ctls_msrs(), so once we devise an appropriate
userspace interface to set these, we can do so easily.

Does nested SVM also have a similar problem, of whether or not it
advertises new or optional SVM features to L1? If it does have this
problem, how was it solved there?

-- 
Nadav Har'El|  Thursday, Nov 10 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I considered atheism but there weren't
http://nadav.harel.org.il   |enough holidays.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/14] KVM: PPC: e500: MMU API

2011-11-10 Thread Avi Kivity
On 11/10/2011 04:20 PM, Alexander Graf wrote:
>> looks like this is different in the 32x86 ABI.
>>
>> We can pad explicitly if you prefer.
> The size is 16 on 32-bit ppc -- the alignment of __u64 forces this.  It
>
> I would prefer if we keep this stable :). There's no good reason to
> pad it - ppc64 creates the same struct definition. There are over 500
> entries currently, and QEMU could make it much larger
>> if it wants to decrease guest-visible faults on certain workloads.
>>
>> It's not the most important feature, indeed we currently ignore the
>> bitmap entirely.  But it could be useful depending on how the API is
>> used in the future, and I don't think we gain much by dropping it at
>> this point.  Alex, any thoughts?
>
> The kernel can always opt in to ignore the field if it chooses to, so
> I don't see the point in dropping it. There shouldn't be an alignment
> problem in the first place :).

Ok.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/14] KVM: PPC: E500: Support hugetlbfs

2011-11-10 Thread Alexander Graf

On 10/31/2011 02:38 PM, Avi Kivity wrote:

On 10/31/2011 09:53 AM, Alexander Graf wrote:

With hugetlbfs support emerging on e500, we should also support KVM
backing its guest memory by it.

This patch adds support for hugetlbfs into the e500 shadow mmu code.


@@ -673,12 +674,31 @@ static inline void kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
pfn&= ~(tsize_pages - 1);
break;
}
+   } else if (vma&&  hva>= vma->vm_start&&
+   (vma->vm_flags&  VM_HUGETLB)) {
+   unsigned long psize = vma_kernel_pagesize(vma);


Leading spaces spotted.


Oh no! You really do read whitespace :).


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls

2011-11-10 Thread Alexander Graf

On 10/31/2011 02:36 PM, Avi Kivity wrote:

On 10/31/2011 09:53 AM, Alexander Graf wrote:

Right now we transfer a static struct every time we want to get or set
registers. Unfortunately, over time we realize that there are more of
these than we thought of before and the extensibility and flexibility of
transferring a full struct every time is limited.

So this is a new approach to the problem. With these new ioctls, we can
get and set a single register that is identified by an ID. This allows for
very precise and limited transmittal of data. When we later realize that
it's a better idea to shove over multiple registers at once, we can reuse
most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS
interface.

The only downpoint I see to this one is that it needs to pad to 1024 bits
(hardware is already on 512 bit registers, so I wanted to leave some room)
which is slightly too much for transmitting only 64 bits. But if that's all
the tradeoff we have to do for getting an extensible interface, I'd say go
for it nevertheless.

Do we want this for x86 too?  How often do we want just one register?


I'm not sure. Depends on your user space I suppose :). If you want a 
simple debugging tool that exposes register poking directly to user 
space, then it can be handy.




+4.64 KVM_SET_ONE_REG
+
+Capability: KVM_CAP_ONE_REG
+Architectures: all
+Type: vcpu ioctl
+Parameters: struct kvm_one_reg (in)
+Returns: 0 on success, negative value on failure
+
+struct kvm_one_reg {
+   __u64 id;

would be better to have a register set (in x86 terms,
gpr/x86/sse/cr/xcr/msr/special) and an ID within the set.  __u64 is
excessive, I hope.


Yeah, we have that in the ID. But since the sets are arch specific I'd 
rather keep the definition of which parts of the ID are used for the set 
and which are used for the actual register id inside that set to the arch.



+   union {
+   __u8 reg8;
+   __u16 reg16;
+   __u32 reg32;
+   __u64 reg64;
+   __u8 reg128[16];
+   __u8 reg256[32];
+   __u8 reg512[64];
+   __u8 reg1024[128];
+   } u;
+};
+
+Using this ioctl, a single vcpu register can be set to a specific value
+defined by user space with the passed in struct kvm_one_reg. There can
+be architecture agnostic and architecture specific registers. Each have
+their own range of operation and their own constants and width. To keep
+track of the implemented registers, find a list below:
+
+  Arch  |   Register| Width (bits)
+|   |
+


One possible issue is that certain register have mutually exclusive
values, so you may need to issue multiple calls to get the right
sequence.  You probably don't have that on ppc.


I'm fairly sure we don't. But even if so, it's the same as running code 
inside the guest, so it should come natural, no?



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/14] KVM: PPC: e500: MMU API

2011-11-10 Thread Alexander Graf

On 11/01/2011 05:16 PM, Scott Wood wrote:

On 11/01/2011 03:58 AM, Avi Kivity wrote:

On 10/31/2011 10:12 PM, Scott Wood wrote:

+4.59 KVM_DIRTY_TLB
+
+Capability: KVM_CAP_SW_TLB
+Architectures: ppc
+Type: vcpu ioctl
+Parameters: struct kvm_dirty_tlb (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_dirty_tlb {
+   __u64 bitmap;
+   __u32 num_dirty;
+};

This is not 32/64 bit safe.  e500 is 32-bit only, yes?

e5500 is 64-bit -- we don't support it with KVM yet, but it's planned.


but what if someone wants to emulate an e500 on a ppc64?  maybe it's better to 
add
padding here.

What is unsafe about it?  Are you picturing TLBs with more than 4
billion entries?

sizeof(struct kvm_tlb_dirty) == 12 for 32-bit userspace, but ==  16 for
64-bit userspace and the kernel.  ABI structures must have the same
alignment and size for 32/64 bit userspace, or they need compat handling.

The size is 16 on 32-bit ppc -- the alignment of __u64 forces this.  It
looks like this is different in the 32x86 ABI.

We can pad explicitly if you prefer.


I would prefer if we keep this stable :). There's no good reason to pad 
it - ppc64 creates the same struct definition.



There shouldn't be any alignment issues.


Another alternative is to drop the num_dirty field (and let the kernel
compute it instead, shouldn't take long?), and have the third argument
to ioctl() reference the bitmap directly.

The idea was to make it possible for the kernel to apply a threshold
above which it would be better to ignore the bitmap entirely and flush
everything:

http://www.spinics.net/lists/kvm/msg50079.html

Currently we always just flush everything, and QEMU always says
everything is dirty when it makes a change, but the API is there if needed.

Right, but you don't need num_dirty for it.  There are typically only a
few dozen entries, yes?  It should take a trivial amount of time to
calculate its weight.

There are over 500 entries currently, and QEMU could make it much larger
if it wants to decrease guest-visible faults on certain workloads.

It's not the most important feature, indeed we currently ignore the
bitmap entirely.  But it could be useful depending on how the API is
used in the future, and I don't think we gain much by dropping it at
this point.  Alex, any thoughts?


The kernel can always opt in to ignore the field if it chooses to, so I 
don't see the point in dropping it. There shouldn't be an alignment 
problem in the first place :).



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/11] KVM: x86: optimize for writing guest page

2011-11-10 Thread Avi Kivity
On 11/10/2011 03:28 PM, Xiao Guangrong wrote:
>
> I have tested RHEL.6.1 setup/boot/reboot/shutdown and the complete
> output of scan_results.py is attached.
>
> The result shows the performance is improved:
> before:After:
> 570529
> 555538
> 552531
> 546528
> 553559
> 553527
> 550523
> 553533
> 547538
> 550526
>
> How do you think about it? :)

Well, either I was sloppy in my measurements, or maybe RHEL 6 is very
different from F9 (unlikely).  I'll measure it again and see.

btw, this is with ept=0, yes?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-10 Thread Pekka Enberg
Hi Anthony,

On Thu, Nov 10, 2011 at 3:43 PM, Anthony Liguori  wrote:
> It's not just the qcow2 implementation or even the block layer.  This pull
> requests adds a userspace TCP/IP stack to the kernel and yet netdev isn't on
> the CC and there are no Ack's from anyone from the networking stack.  I'm
> fairly sure if they knew what was happening here they would object.

It's something we consider extremely important because it allows easy
non-root networking. But you're right, we definitely ought to ping the
networking folks before the next merge window.

On Thu, Nov 10, 2011 at 3:43 PM, Anthony Liguori  wrote:
>> Would you be interested in spending another 30 seconds to find out some
>> more issue? :-)
>
> I could, provided you could take the things you want to do differently and
> submit them as patches to qemu.git instead of creating a new tool.
>
> There are lots of people on qemu-devel than can provide deep review of this
> type of code.  That's the advantage of working in qemu.git.

I fully understand if you don't want to spend your time reviewing the
KVM tool code. I'm just saying that if you do spend the next 30
seconds next time you're bored, I'm all ears and happy to fix any
issues you point out.

Pekka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-10 Thread Anthony Liguori

On 11/10/2011 12:46 AM, Pekka Enberg wrote:

Hi Anthony,


1) The RTC emulation is limited to emulating CMOS and only the few fields used
to store the date and time. If code is added to arch/x86 that tries to make
use of a CMOS field for something useful, kvm-tool is going to fall over.

None of the register A/B/C logic is implemented and none of the timer logic is
implemented. I imagine this requires kernel command line hackery to keep the
kernel from throwing up.


The "fake it until you make it" design principle is actually something Ingo
suggested early on and has been a really important factor in getting us to where
we are right now.

Not that I disagree with you. I think we should definitely clean up our hardware
emulation code.


If a kernel change that works on bare metal but breaks kvm-tool because
kvm-tool is incomplete is committed, is that a regression that requires
reverting the change in arch/x86?


If it's the KVM tool being silly, obviously not.


2) The qcow2 code is a filesystem implemented in userspace. Image formats are
file systems. It really should be reviewed by the filesystem maintainers.
There is absolutely no attempt made to synchronize the metadata during write
operations which means that you do not have crash consistency of the meta data.

If you experience a power failure or kvm-tool crashs, your image will get
corrupted. I highly doubt a file system would ever be merged into Linux that
was this naive about data integrity.


The QCOW2 is lagging behind because we lost the main developer. It's forced as
read-only for the issues you mention. If you think it's a merge blocker, we can
drop it completely from the tree until the issues are sorted out.


It's not just the qcow2 implementation or even the block layer.  This pull 
requests adds a userspace TCP/IP stack to the kernel and yet netdev isn't on the 
CC and there are no Ack's from anyone from the networking stack.  I'm fairly 
sure if they knew what was happening here they would object.


And the implementation isn't even strictly needed.  You can just as well achieve 
the same goal using tun/tap with a privileged helper[1].



I found these three issues in the course of about 30 seconds of looking
through the kvm-tool code. I'm sure if other people with expertise in these
areas looked through the code, they would find a lot more issues. I'm sure I
could find many, many more issues.


Thanks for the review!

Would you be interested in spending another 30 seconds to find out some more
issue? :-)


I could, provided you could take the things you want to do differently and 
submit them as patches to qemu.git instead of creating a new tool.


There are lots of people on qemu-devel than can provide deep review of this type 
of code.  That's the advantage of working in qemu.git.


[1] 
http://mid.gmane.org/1320086191-23641-1-git-send-email-cor...@linux.vnet.ibm.com

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/11] KVM: x86: optimize for writing guest page

2011-11-10 Thread Xiao Guangrong

On 11/06/2011 11:35 PM, Avi Kivity wrote:

On 11/04/2011 11:16 AM, Xiao Guangrong wrote:


I have done kernbench tests several times on my desktop, but it shows
very well:

before patchset:
real 212.27
real 213.47
real 204.99
real 200.58
real 199.99
real 199.94
real 201.51
real 199.83
real 198.19
real 205.13

after patchset:
real 199.90
real 201.89
real 194.54
real 188.71
real 185.75
real 187.70
real 188.99
real 188.53
real 186.29
real 188.25

I will test it on our server using kvm-autotest, could you share me
your config file please?



# Copy this file to tests.cfg and edit it.
#
# This file contains the test set definitions. Define your test sets here.



Thanks Avi!

I have tested RHEL.6.1 setup/boot/reboot/shutdown and the complete output of 
scan_results.py is attached.


The result shows the performance is improved:
before: After:
570 529
555 538
552 531
546 528
553 559
553 527
550 523
553 533
547 538
550 526

How do you think about it? :)


result.tar.bz2
Description: application/bzip


Re: [Qemu-devel] [PATCH] i386: derive '-cpu host' from KVM_GET_SUPPORTED_CPUID

2011-11-10 Thread Avi Kivity
On 11/09/2011 08:21 PM, Sasha Levin wrote:
> On Wed, 2011-11-09 at 20:00 +0200, Avi Kivity wrote:
> > On 11/09/2011 07:56 PM, Anthony Liguori wrote:
> > > On 11/09/2011 07:44 AM, Avi Kivity wrote:
> > >> The fact that a host cpu supports a feature doesn't mean that QEMU
> > >> and KVM
> > >> will also support it, yet -cpuid host brings host features wholesale.
> > >>
> > >> We need to whitelist each feature separately to make sure we support it.
> > >> This patch adds KVM whitelisting (by simply using
> > >> KVM_GET_SUPPORTED_CPUID
> > >> instead of the CPUID instruction).
> > >>
> > >> Signed-off-by: Avi Kivity
> > >
> > > This seems like a 1.0 candidate, yes?
> > 
> > There is a distinct possibility this will uncover bugs in kvm's
> > KVM_GET_SUPPORTED_CPUID.  Those won't be qemu bugs, so I think it's good
> > for 1.0.
> > 
>
> Avi, we have a problem in the KVM tool of KVM_GET_SUPPORTED_CPUID
> sometimes returning -E2BIG. I've sent a mail about it some time ago, but
> we couldn't really find the reason.
>
> It's somewhat non-deterministic, and theres no sure way to reproduce it,
> but it doesn't happen that rarely.
>
> The block of code that uses it from usermode it pretty simple:
>
> struct kvm_cpuid2 *kvm_cpuid;
>
> kvm_cpuid = calloc(1, sizeof(*kvm_cpuid) + 
>   MAX_KVM_CPUID_ENTRIES * sizeof(*kvm_cpuid->entries));
>
> kvm_cpuid->nent = MAX_KVM_CPUID_ENTRIES;
> if (ioctl(vcpu->kvm->sys_fd, KVM_GET_SUPPORTED_CPUID, kvm_cpuid) < 0)
>   die_perror("KVM_GET_SUPPORTED_CPUID failed");
>
> MAX_KVM_CPUID_ENTRIES is set to 100, which is more than the 80 defined
> in the kernel, so it shouldn't be an issue. It wouldn't explain the non
> deterministic behavior either.
>
> QEMU's code around it allows it to hide the bug if it does happen:
>
> uint32_t kvm_arch_get_supported_cpuid(KVMState *s, uint32_t function,
>uint32_t index, int reg)
> {
>  struct kvm_cpuid2 *cpuid;
>  int i, max;
>  uint32_t ret = 0;
>  uint32_t cpuid_1_edx;
>  int has_kvm_features = 0;
>
>  max = 1;
>  while ((cpuid = try_get_cpuid(s, max)) == NULL) {
>  max *= 2;
>  }
> [snip]
>
> Which means that if it fails it will silently retry until it makes it.
>
> Any guess on why it might happen?
>

No idea.  If you run your code block in a loop, how soon will it reproduce?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread Joerg Roedel
On Thu, Nov 10, 2011 at 08:16:16PM +0800, cody wrote:
> On 11/10/2011 03:31 PM, Ohad Ben-Cohen wrote:
> >On Thu, Nov 10, 2011 at 8:17 AM, Kai Huang  wrote:
> >>Seems the unmap function don't take phys as parameter, does this mean
> >>domain->ops->unmap will walk through the page table to find out the
> >>actual page size?
> >The short answer is yes, and furthermore, we also consider to remove
> >the size param from domain->ops->unmap entirely at some point.
> >
> >We had a long discussion about it, please see:
> >
> >https://lkml.org/lkml/2011/10/10/234
> Yes I've seen your discussion, I followed this thread from beginning:)
> 
> How about the IOTLB flush? As I said I think we need to consider
> that IOMMU (even does not exist now) may have some limitation on
> IOTLB flush, and hiding page size from IOTLB flush code may hurt
> performance, or even worse, trigger undefined behaviors.

We can only care about IOMMUs that exist today or ones that will exist
and we already know of.
In general for the hardware I know of a page-size is not required for
implementing unmap operations. Requiring this would imply that any user
of the IOMMU-API needs to keeps track of the page-sizes used to map a
given area.
This would be a huge burden which is not really necessary because the
IOMMU driver already has this information and can return it to the user.
So if you want to change that you need a very good reason for it.


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT

2011-11-10 Thread Orit Wasserman
On 11/10/2011 11:59 AM, Nadav Har'El wrote:
> When the existing KVM MMU code creates a shadow page table, it assumes it
> has the normal x86 page table format. This is obviously correct for normal
> shadow page tables, and also correct for AMD's NPT.
> Unfortunately, Intel's EPT page tables differ in subtle ways from ordinary
> page tables, so when we create a shadow EPT table (i.e., in nested EPT),
> we need to slightly modify the way in which this table table is built.
> 
> In particular, when mmu.c's link_shadow_page() creates non-leaf page table
> entries, it used to enable the "present", "accessed", "writable" and "user"
> flags on these entries. While this is correct for ordinary page tables, it
> is wrong in EPT tables - where these bits actually have completely different
> meaning (compare PT_*_MASK from mmu.h to VMX_EPT_*_MASK from vmx.h).
> In particular, leaving the code as-is causes bit 5 of the PTE to be turned on
> (supposedly for PT_ACCESSED_MASK), which is a reserved bit in EPT and causes
> an "EPT Misconfiguration" failure.
> 
> So we must move link_shadow_page's list of extra bits to a new mmu context
> field, which is set differently for nested EPT.
> 
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/include/asm/kvm_host.h |1 +
>  arch/x86/kvm/mmu.c  |   16 +---
>  arch/x86/kvm/paging_tmpl.h  |6 --
>  arch/x86/kvm/vmx.c  |3 +++
>  4 files changed, 17 insertions(+), 9 deletions(-)
> 
> --- .before/arch/x86/include/asm/kvm_host.h   2011-11-10 11:33:59.0 
> +0200
> +++ .after/arch/x86/include/asm/kvm_host.h2011-11-10 11:33:59.0 
> +0200
> @@ -287,6 +287,7 @@ struct kvm_mmu {
>   bool nx;
>  
>   u64 pdptrs[4]; /* pae */
> + u64 link_shadow_page_set_bits;
>  };
>  
>  struct kvm_vcpu_arch {
> --- .before/arch/x86/kvm/vmx.c2011-11-10 11:33:59.0 +0200
> +++ .after/arch/x86/kvm/vmx.c 2011-11-10 11:33:59.0 +0200
> @@ -6485,6 +6485,9 @@ static int nested_ept_init_mmu_context(s
>   vcpu->arch.mmu.get_pdptr = nested_ept_get_pdptr;
>   vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
>   vcpu->arch.mmu.shadow_root_level = get_ept_level();
> + vcpu->arch.mmu.link_shadow_page_set_bits =
> + VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
> + VMX_EPT_EXECUTABLE_MASK;
>  
>   vcpu->arch.walk_mmu  = &vcpu->arch.nested_mmu;
>  
> --- .before/arch/x86/kvm/mmu.c2011-11-10 11:33:59.0 +0200
> +++ .after/arch/x86/kvm/mmu.c 2011-11-10 11:33:59.0 +0200
> @@ -1782,14 +1782,9 @@ static void shadow_walk_next(struct kvm_
>   return __shadow_walk_next(iterator, *iterator->sptep);
>  }
>  
> -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
> +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 
> set_bits)
>  {
> - u64 spte;
> -
> - spte = __pa(sp->spt)
> - | PT_PRESENT_MASK | PT_ACCESSED_MASK
> - | PT_WRITABLE_MASK | PT_USER_MASK;
> - mmu_spte_set(sptep, spte);
> + mmu_spte_set(sptep, __pa(sp->spt) | set_bits);
>  }
>  
>  static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
> @@ -3366,6 +3361,13 @@ int kvm_init_shadow_mmu(struct kvm_vcpu 
>   vcpu->arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
>   vcpu->arch.mmu.base_role.smep_andnot_wp
>   = smep && !is_write_protection(vcpu);
> + /*
> +  * link_shadow() should apply these bits in shadow page tables, and
> +  * in shadow NPT tables (nested NPT). For nested EPT, different bits
> +  * apply.
> +  */
> + vcpu->arch.mmu.link_shadow_page_set_bits = PT_PRESENT_MASK |
> + PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
>  
>   return r;
>  }
> --- .before/arch/x86/kvm/paging_tmpl.h2011-11-10 11:33:59.0 
> +0200
> +++ .after/arch/x86/kvm/paging_tmpl.h 2011-11-10 11:33:59.0 +0200
> @@ -515,7 +515,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
>   goto out_gpte_changed;
>  
>   if (sp)
> - link_shadow_page(it.sptep, sp);
> + link_shadow_page(it.sptep, sp,
> + vcpu->arch.mmu.link_shadow_page_set_bits);
>   }
>  
>   for (;
> @@ -535,7 +536,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
>  
>   sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
> true, direct_access, it.sptep);
> - link_shadow_page(it.sptep, sp);
> + link_shadow_page(it.sptep, sp,
> + vcpu->arch.mmu.link_shadow_page_set_bits);
>   }
>  
>   clear_sp_write_flooding_count(it.sptep);
We need to consider the permission in L1's EPT table,what if an entry is read 
only ?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info a

Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT

2011-11-10 Thread Avi Kivity
On 11/10/2011 02:21 PM, Avi Kivity wrote:
> On 11/10/2011 01:03 PM, Nadav Har'El wrote:
> > On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 04/10] nEPT: Fix 
> > page table format in nested EPT":
> > > > @@ -287,6 +287,7 @@ struct kvm_mmu {
> > > > bool nx;
> > > >  
> > > > u64 pdptrs[4]; /* pae */
> > > > +   u64 link_shadow_page_set_bits;
> > >...
> > > > +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 
> > > > set_bits)
> > > >  {
> > > > -   u64 spte;
> > > > -
> > > > -   spte = __pa(sp->spt)
> > > > -   | PT_PRESENT_MASK | PT_ACCESSED_MASK
> > > > -   | PT_WRITABLE_MASK | PT_USER_MASK;
> > > > -   mmu_spte_set(sptep, spte);
> > > > +   mmu_spte_set(sptep, __pa(sp->spt) | set_bits);
> > > >  }
> > > >
> > > 
> > > Minor nit: you can just use link_shadow_page_set_bits here instead of
> > > passing it around (unless later you have a different value for the
> > > parameter?)
> >
> > The problem was that link_shadow_page did not take an kvm_mmu parameter,
> > so I don't know where to find this link_shadow_page_set_bits. So either
> > I pass the pointer to the entire kvm_mmu to link_shadow_page, or I just
> > pass the only field which I need... I thought that passing the single
> > field I need was cleaner - but I can easily change it if you prefer to
> > pass the kvm_mmu.
>
> Ah, doesn't matter either way.
>

On second thoughts, passing the mmu is better for future maintainability.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3 00/10] KVM in-guest performance monitoring

2011-11-10 Thread Gleb Natapov
This patchset exposes an emulated version 2 architectural performance
monitoring unit to KVM guests.  The PMU is emulated using perf_events,
so the host kernel can multiplex host-wide, host-user, and the
guest on available resources.

The patches are against next branch on kvm.git.

If you want to try running perf in a guest you need to apply the patch
below to qemu-kvm and use -cpu host on qemu command line. But DO NOT
TRY those patches without applying [1][2] to the host kernel first.
Don't tell me I didn't warn you!

[1] https://lkml.org/lkml/2011/10/18/390
[2] https://lkml.org/lkml/2011/10/23/163

Changelog:
 v1->v2
  - put index into struct kvm_pmc instead of calculating it
  - use locked version of bitops
  - inject pmi from irq work if vcpu was not in a guest mode during NMI
  - providing stub for perf_get_x86_pmu_capability() for !PERF_EVENTS
 v2->v3
  - minor style change/comment clarification
  - add perf patch to disable arch event not supported by a CPU
  - create perf events as pinned

Avi Kivity (6):
  KVM: Expose kvm_lapic_local_deliver()
  KVM: Add generic RDPMC support
  KVM: SVM: Intercept RDPMC
  KVM: VMX: Intercept RDPMC
  KVM: x86 emulator: fix RDPMC privilege check
  KVM: x86 emulator: implement RDPMC (0F 33)

Gleb Natapov (4):
  KVM: Expose a version 2 architectural PMU to a guests
  x86, perf: disable non available architectural events.
  perf, x86: expose perf capability to other modules.
  KVM: Expose the architectural performance monitoring CPUID leaf

 arch/x86/include/asm/kvm_emulate.h |1 +
 arch/x86/include/asm/kvm_host.h|   49 +++
 arch/x86/include/asm/perf_event.h  |   29 ++
 arch/x86/kernel/cpu/perf_event.c   |   11 +
 arch/x86/kernel/cpu/perf_event.h   |5 +
 arch/x86/kernel/cpu/perf_event_intel.c |   29 ++-
 arch/x86/kvm/Kconfig   |1 +
 arch/x86/kvm/Makefile  |2 +-
 arch/x86/kvm/emulate.c |   13 +-
 arch/x86/kvm/lapic.c   |2 +-
 arch/x86/kvm/lapic.h   |1 +
 arch/x86/kvm/pmu.c |  531 
 arch/x86/kvm/svm.c |   15 +
 arch/x86/kvm/vmx.c |   15 +-
 arch/x86/kvm/x86.c |   76 -
 include/linux/kvm_host.h   |2 +
 16 files changed, 763 insertions(+), 19 deletions(-)
 create mode 100644 arch/x86/kvm/pmu.c


diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c
index f17..ff2a0ca 100644
--- a/target-i386/cpuid.c
+++ b/target-i386/cpuid.c
@@ -1178,11 +1178,20 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 *edx = 0;
 break;
 case 0xA:
-/* Architectural Performance Monitoring Leaf */
-*eax = 0;
-*ebx = 0;
-*ecx = 0;
-*edx = 0;
+   if (kvm_enabled()) {
+KVMState *s = env->kvm_state;
+
+*eax = kvm_arch_get_supported_cpuid(s, 0xA, count, R_EAX);
+*ebx = kvm_arch_get_supported_cpuid(s, 0xA, count, R_EBX);
+*ecx = kvm_arch_get_supported_cpuid(s, 0xA, count, R_ECX);
+*edx = kvm_arch_get_supported_cpuid(s, 0xA, count, R_EDX);
+   } else {
+   /* Architectural Performance Monitoring Leaf */
+   *eax = 0;
+   *ebx = 0;
+   *ecx = 0;
+   *edx = 0;
+   }
 break;
 case 0xD:
 /* Processor Extended State */
-- 
1.7.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3 09/10] KVM: x86 emulator: fix RDPMC privilege check

2011-11-10 Thread Gleb Natapov
From: Avi Kivity 

RDPMC is only privileged if CR4.PCE=0.  check_rdpmc() already implements this,
so all we need to do is drop the Priv flag.

Signed-off-by: Avi Kivity 
Signed-off-by: Gleb Natapov 
---
 arch/x86/kvm/emulate.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 8547958..c0ee85b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3254,7 +3254,7 @@ static struct opcode twobyte_table[256] = {
DI(ImplicitOps | Priv, wrmsr),
IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc),
DI(ImplicitOps | Priv, rdmsr),
-   DIP(ImplicitOps | Priv, rdpmc, check_rdpmc),
+   DIP(ImplicitOps, rdpmc, check_rdpmc),
I(ImplicitOps | VendorSpecific, em_sysenter),
I(ImplicitOps | Priv | VendorSpecific, em_sysexit),
N, N,
-- 
1.7.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3 04/10] KVM: SVM: Intercept RDPMC

2011-11-10 Thread Gleb Natapov
From: Avi Kivity 

Intercept RDPMC and forward it to the PMU emulation code.

Signed-off-by: Avi Kivity 
Signed-off-by: Gleb Natapov 
---
 arch/x86/kvm/svm.c |   15 +++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index e32243e..5fa553b 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1014,6 +1014,7 @@ static void init_vmcb(struct vcpu_svm *svm)
set_intercept(svm, INTERCEPT_NMI);
set_intercept(svm, INTERCEPT_SMI);
set_intercept(svm, INTERCEPT_SELECTIVE_CR0);
+   set_intercept(svm, INTERCEPT_RDPMC);
set_intercept(svm, INTERCEPT_CPUID);
set_intercept(svm, INTERCEPT_INVD);
set_intercept(svm, INTERCEPT_HLT);
@@ -2770,6 +2771,19 @@ static int emulate_on_interception(struct vcpu_svm *svm)
return emulate_instruction(&svm->vcpu, 0) == EMULATE_DONE;
 }
 
+static int rdpmc_interception(struct vcpu_svm *svm)
+{
+   int err;
+
+   if (!static_cpu_has(X86_FEATURE_NRIPS))
+   return emulate_on_interception(svm);
+
+   err = kvm_rdpmc(&svm->vcpu);
+   kvm_complete_insn_gp(&svm->vcpu, err);
+
+   return 1;
+}
+
 bool check_selective_cr0_intercepted(struct vcpu_svm *svm, unsigned long val)
 {
unsigned long cr0 = svm->vcpu.arch.cr0;
@@ -3190,6 +3204,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm) = 
{
[SVM_EXIT_SMI]  = nop_on_interception,
[SVM_EXIT_INIT] = nop_on_interception,
[SVM_EXIT_VINTR]= interrupt_window_interception,
+   [SVM_EXIT_RDPMC]= rdpmc_interception,
[SVM_EXIT_CPUID]= cpuid_interception,
[SVM_EXIT_IRET] = iret_interception,
[SVM_EXIT_INVD] = emulate_on_interception,
-- 
1.7.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3 08/10] KVM: Expose the architectural performance monitoring CPUID leaf

2011-11-10 Thread Gleb Natapov
Provide a CPUID leaf that describes the emulated PMU.

Signed-off-by: Gleb Natapov 
---
 arch/x86/kvm/x86.c |   30 +-
 1 files changed, 29 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b88426c..2c44b05 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2544,6 +2544,35 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
u32 function,
}
case 9:
break;
+   case 0xa: { /* Architectural Performance Monitoring */
+   struct x86_pmu_capability cap;
+   union cpuid10_eax eax;
+   union cpuid10_edx edx;
+
+   perf_get_x86_pmu_capability(&cap);
+
+   /*
+* Only support guest architectural pmu on a host
+* with architectural pmu.
+*/
+   if (!cap.version)
+   memset(&cap, 0, sizeof(cap));
+
+   eax.split.version_id = min(cap.version, 2);
+   eax.split.num_counters = cap.num_counters_gp;
+   eax.split.bit_width = cap.bit_width_gp;
+   eax.split.mask_length = cap.events_mask_len;
+
+   edx.split.num_counters_fixed = cap.num_counters_fixed;
+   edx.split.bit_width_fixed = cap.bit_width_fixed;
+   edx.split.reserved = 0;
+
+   entry->eax = eax.full;
+   entry->ebx = cap.events_mask;
+   entry->ecx = 0;
+   entry->edx = edx.full;
+   break;
+   }
/* function 0xb has additional index. */
case 0xb: {
int i, level_type;
@@ -2638,7 +2667,6 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
u32 function,
case 3: /* Processor serial number */
case 5: /* MONITOR/MWAIT */
case 6: /* Thermal management */
-   case 0xA: /* Architectural Performance Monitoring */
case 0x8007: /* Advanced power management */
case 0xC002:
case 0xC003:
-- 
1.7.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3 10/10] KVM: x86 emulator: implement RDPMC (0F 33)

2011-11-10 Thread Gleb Natapov
From: Avi Kivity 

Signed-off-by: Avi Kivity 
Signed-off-by: Gleb Natapov 
---
 arch/x86/include/asm/kvm_emulate.h |1 +
 arch/x86/kvm/emulate.c |   13 -
 arch/x86/kvm/x86.c |7 +++
 3 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index 9a4acf4..ab4092e 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -181,6 +181,7 @@ struct x86_emulate_ops {
int (*set_dr)(struct x86_emulate_ctxt *ctxt, int dr, ulong value);
int (*set_msr)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 data);
int (*get_msr)(struct x86_emulate_ctxt *ctxt, u32 msr_index, u64 
*pdata);
+   int (*read_pmc)(struct x86_emulate_ctxt *ctxt, u32 pmc, u64 *pdata);
void (*halt)(struct x86_emulate_ctxt *ctxt);
void (*wbinvd)(struct x86_emulate_ctxt *ctxt);
int (*fix_hypercall)(struct x86_emulate_ctxt *ctxt);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index c0ee85b..d76a852 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2623,6 +2623,17 @@ static int em_rdtsc(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
+static int em_rdpmc(struct x86_emulate_ctxt *ctxt)
+{
+   u64 pmc;
+
+   if (ctxt->ops->read_pmc(ctxt, ctxt->regs[VCPU_REGS_RCX], &pmc))
+   return emulate_gp(ctxt, 0);
+   ctxt->regs[VCPU_REGS_RAX] = (u32)pmc;
+   ctxt->regs[VCPU_REGS_RDX] = pmc >> 32;
+   return X86EMUL_CONTINUE;
+}
+
 static int em_mov(struct x86_emulate_ctxt *ctxt)
 {
ctxt->dst.val = ctxt->src.val;
@@ -3254,7 +3265,7 @@ static struct opcode twobyte_table[256] = {
DI(ImplicitOps | Priv, wrmsr),
IIP(ImplicitOps, em_rdtsc, rdtsc, check_rdtsc),
DI(ImplicitOps | Priv, rdmsr),
-   DIP(ImplicitOps, rdpmc, check_rdpmc),
+   IIP(ImplicitOps, em_rdpmc, rdpmc, check_rdpmc),
I(ImplicitOps | VendorSpecific, em_sysenter),
I(ImplicitOps | Priv | VendorSpecific, em_sysexit),
N, N,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2c44b05..f78b48c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4662,6 +4662,12 @@ static int emulator_set_msr(struct x86_emulate_ctxt 
*ctxt,
return kvm_set_msr(emul_to_vcpu(ctxt), msr_index, data);
 }
 
+static int emulator_read_pmc(struct x86_emulate_ctxt *ctxt,
+u32 pmc, u64 *pdata)
+{
+   return kvm_pmu_read_pmc(emul_to_vcpu(ctxt), pmc, pdata);
+}
+
 static void emulator_halt(struct x86_emulate_ctxt *ctxt)
 {
emul_to_vcpu(ctxt)->arch.halt_request = 1;
@@ -4714,6 +4720,7 @@ static struct x86_emulate_ops emulate_ops = {
.set_dr  = emulator_set_dr,
.set_msr = emulator_set_msr,
.get_msr = emulator_get_msr,
+   .read_pmc= emulator_read_pmc,
.halt= emulator_halt,
.wbinvd  = emulator_wbinvd,
.fix_hypercall   = emulator_fix_hypercall,
-- 
1.7.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3 02/10] KVM: Expose a version 2 architectural PMU to a guests

2011-11-10 Thread Gleb Natapov
Use perf_events to emulate an architectural PMU, version 2.

Based on PMU version 1 emulation by Avi Kivity.

Signed-off-by: Gleb Natapov 
---
 arch/x86/include/asm/kvm_host.h |   48 
 arch/x86/kvm/Kconfig|1 +
 arch/x86/kvm/Makefile   |2 +-
 arch/x86/kvm/pmu.c  |  531 +++
 arch/x86/kvm/x86.c  |   24 ++-
 include/linux/kvm_host.h|2 +
 6 files changed, 598 insertions(+), 10 deletions(-)
 create mode 100644 arch/x86/kvm/pmu.c

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d83264..5807a49 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -16,10 +16,12 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -289,6 +291,37 @@ struct kvm_mmu {
u64 pdptrs[4]; /* pae */
 };
 
+enum pmc_type {
+   KVM_PMC_GP = 0,
+   KVM_PMC_FIXED,
+};
+
+struct kvm_pmc {
+   enum pmc_type type;
+   u8 idx;
+   u64 counter;
+   u64 eventsel;
+   struct perf_event *perf_event;
+   struct kvm_vcpu *vcpu;
+};
+
+struct kvm_pmu {
+   unsigned nr_arch_gp_counters;
+   unsigned nr_arch_fixed_counters;
+   unsigned available_event_types;
+   u64 fixed_ctr_ctrl;
+   u64 global_ctrl;
+   u64 global_status;
+   u64 global_ovf_ctrl;
+   u64 counter_bitmask[2];
+   u64 global_ctrl_mask;
+   u8 version;
+   struct kvm_pmc gp_counters[X86_PMC_MAX_GENERIC];
+   struct kvm_pmc fixed_counters[X86_PMC_MAX_FIXED];
+   struct irq_work irq_work;
+   u64 reprogram_pmi;
+};
+
 struct kvm_vcpu_arch {
/*
 * rip and regs accesses must go through
@@ -422,6 +455,8 @@ struct kvm_vcpu_arch {
unsigned access;
gfn_t mmio_gfn;
 
+   struct kvm_pmu pmu;
+
/* used for guest single stepping over the given code position */
unsigned long singlestep_rip;
 
@@ -881,4 +916,17 @@ extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, 
gfn_t gfn);
 
 void kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err);
 
+int kvm_is_in_guest(void);
+
+void kvm_pmu_init(struct kvm_vcpu *vcpu);
+void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
+void kvm_pmu_reset(struct kvm_vcpu *vcpu);
+void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu);
+bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr);
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
+int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
+void kvm_handle_pmu_event(struct kvm_vcpu *vcpu);
+void kvm_deliver_pmi(struct kvm_vcpu *vcpu);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ff5790d..c27dd11 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -35,6 +35,7 @@ config KVM
select KVM_MMIO
select TASKSTATS
select TASK_DELAY_ACCT
+   select PERF_EVENTS
---help---
  Support hosting fully virtualized guest machines using hardware
  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index f15501f..cfca03f 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -12,7 +12,7 @@ kvm-$(CONFIG_IOMMU_API)   += $(addprefix 
../../../virt/kvm/, iommu.o)
 kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o)
 
 kvm-y  += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
-  i8254.o timer.o
+  i8254.o timer.o pmu.o
 kvm-intel-y+= vmx.o
 kvm-amd-y  += svm.o
 
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
new file mode 100644
index 000..1888aa4
--- /dev/null
+++ b/arch/x86/kvm/pmu.c
@@ -0,0 +1,531 @@
+/*
+ * Kernel-based Virtual Machine -- Performane Monitoring Unit support
+ *
+ * Copyright 2011 Red Hat, Inc. and/or its affiliates.
+ *
+ * Authors:
+ *   Avi Kivity   
+ *   Gleb Natapov 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include "x86.h"
+#include "lapic.h"
+
+static struct kvm_arch_event_perf_mapping {
+   u8 eventsel;
+   u8 unit_mask;
+   unsigned event_type;
+   bool inexact;
+} arch_events[] = {
+   /* Index must match CPUID 0x0A.EBX bit vector */
+   [0] = { 0x3c, 0x00, PERF_COUNT_HW_CPU_CYCLES },
+   [1] = { 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },
+   [2] = { 0x3c, 0x01, PERF_COUNT_HW_BUS_CYCLES  },
+   [3] = { 0x2e, 0x4f, PERF_COUNT_HW_CACHE_REFERENCES },
+   [4] = { 0x2e, 0x41, PERF_COUNT_HW_CACHE_MISSES },
+   [5] = { 0xc4, 0x00, PERF_COUNT_HW_BRANCH_INSTRUCTIONS },
+   [6] = { 0xc5, 0x00, PERF_COUNT_HW_BRANCH_MISSES },
+};
+
+/* mapping between fixed pmc index and arch_event

[PATCHv3 03/10] KVM: Add generic RDPMC support

2011-11-10 Thread Gleb Natapov
From: Avi Kivity 

Add a helper function that emulates the RDPMC instruction operation.

Signed-off-by: Avi Kivity 
Signed-off-by: Gleb Natapov 
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/x86.c  |   15 +++
 2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5807a49..422824c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -756,6 +756,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 
data);
 
 unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu);
 void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
+bool kvm_rdpmc(struct kvm_vcpu *vcpu);
 
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 52a8666..b88426c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -815,6 +815,21 @@ int kvm_get_dr(struct kvm_vcpu *vcpu, int dr, unsigned 
long *val)
 }
 EXPORT_SYMBOL_GPL(kvm_get_dr);
 
+bool kvm_rdpmc(struct kvm_vcpu *vcpu)
+{
+   u32 ecx = kvm_register_read(vcpu, VCPU_REGS_RCX);
+   u64 data;
+   int err;
+
+   err = kvm_pmu_read_pmc(vcpu, ecx, &data);
+   if (err)
+   return err;
+   kvm_register_write(vcpu, VCPU_REGS_RAX, (u32)data);
+   kvm_register_write(vcpu, VCPU_REGS_RDX, data >> 32);
+   return err;
+}
+EXPORT_SYMBOL_GPL(kvm_rdpmc);
+
 /*
  * List of msr numbers which we expose to userspace through KVM_GET_MSRS
  * and KVM_SET_MSRS, and KVM_GET_MSR_INDEX_LIST.
-- 
1.7.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3 07/10] perf, x86: expose perf capability to other modules.

2011-11-10 Thread Gleb Natapov
KVM needs to know perf capability to decide which PMU it can expose to a
guest.

Signed-off-by: Gleb Natapov 
---
 arch/x86/include/asm/perf_event.h |   15 +++
 arch/x86/kernel/cpu/perf_event.c  |   11 +++
 2 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index c6998bc..5487ad6 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -95,6 +95,15 @@ union cpuid10_edx {
unsigned int full;
 };
 
+struct x86_pmu_capability {
+   int version;
+   int num_counters_gp;
+   int num_counters_fixed;
+   int bit_width_gp;
+   int bit_width_fixed;
+   unsigned int events_mask;
+   int events_mask_len;
+};
 
 /*
  * Fixed-purpose performance events:
@@ -216,6 +225,7 @@ struct perf_guest_switch_msr {
 };
 
 extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
 #else
 static inline perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
 {
@@ -223,6 +233,11 @@ static inline perf_guest_switch_msr 
*perf_guest_get_msrs(int *nr)
return NULL;
 }
 
+static inline void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
+{
+   memset(cap, 0, sizeof(*cap));
+}
+
 static inline void perf_events_lapic_init(void){ }
 #endif
 
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 6408910..5af5996 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1570,3 +1570,14 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
 
return misc;
 }
+
+void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
+{
+   cap->version = x86_pmu.version;
+   cap->num_counters_gp = x86_pmu.num_counters;
+   cap->num_counters_fixed = x86_pmu.num_counters_fixed;
+   cap->bit_width_gp = cap->bit_width_fixed = x86_pmu.cntval_bits;
+   cap->events_mask = (unsigned int)x86_pmu.events_maskl;
+   cap->events_mask_len = x86_pmu.events_mask_len;
+}
+EXPORT_SYMBOL_GPL(perf_get_x86_pmu_capability);
-- 
1.7.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3 01/10] KVM: Expose kvm_lapic_local_deliver()

2011-11-10 Thread Gleb Natapov
From: Avi Kivity 

Needed to deliver performance monitoring interrupts.

Signed-off-by: Avi Kivity 
Signed-off-by: Gleb Natapov 
---
 arch/x86/kvm/lapic.c |2 +-
 arch/x86/kvm/lapic.h |1 +
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 54abb40..e87e43e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1120,7 +1120,7 @@ int apic_has_pending_timer(struct kvm_vcpu *vcpu)
return 0;
 }
 
-static int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type)
+int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type)
 {
u32 reg = apic_get_reg(apic, lvt_type);
int vector, mode, trig_mode;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 138e8cc..6f4ce25 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -34,6 +34,7 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu);
 int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest);
 int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda);
 int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
+int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
 
 u64 kvm_get_apic_base(struct kvm_vcpu *vcpu);
 void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data);
-- 
1.7.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3 06/10] x86, perf: disable non available architectural events.

2011-11-10 Thread Gleb Natapov
Intel CPUs report non-available architectural events in cpuid leaf
0AH.EBX. Use it to disable events that are not available according
to CPU.

Signed-off-by: Gleb Natapov 
---
 arch/x86/include/asm/perf_event.h  |   14 ++
 arch/x86/kernel/cpu/perf_event.h   |5 +
 arch/x86/kernel/cpu/perf_event_intel.c |   29 -
 3 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index f61c62f..c6998bc 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -57,6 +57,7 @@
(1 << (ARCH_PERFMON_UNHALTED_CORE_CYCLES_INDEX))
 
 #define ARCH_PERFMON_BRANCH_MISSES_RETIRED 6
+#define ARCH_PERFMON_EVENTS_COUNT  7
 
 /*
  * Intel "Architectural Performance Monitoring" CPUID
@@ -72,6 +73,19 @@ union cpuid10_eax {
unsigned int full;
 };
 
+union cpuid10_ebx {
+   struct {
+   unsigned int no_unhalted_core_cycles:1;
+   unsigned int no_instructions_retired:1;
+   unsigned int no_unhalted_reference_cycles:1;
+   unsigned int no_llc_reference:1;
+   unsigned int no_llc_misses:1;
+   unsigned int no_branch_instruction_retired:1;
+   unsigned int no_branch_misses_retired:1;
+   } split;
+   unsigned int full;
+};
+
 union cpuid10_edx {
struct {
unsigned int num_counters_fixed:5;
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index b9698d4..cd0ebcd 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -259,6 +259,11 @@ struct x86_pmu {
int num_counters_fixed;
int cntval_bits;
u64 cntval_mask;
+   union {
+   unsigned long events_maskl;
+   unsigned long 
events_mask[BITS_TO_LONGS(ARCH_PERFMON_EVENTS_COUNT)];
+   };
+   int events_mask_len;
int apic;
u64 max_period;
struct event_constraint *
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index e09ca20..301369a 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1544,13 +1544,23 @@ static void intel_clovertown_quirks(void)
x86_pmu.pebs_constraints = NULL;
 }
 
+static int intel_event_id_to_hw_id[] = {
+   PERF_COUNT_HW_CPU_CYCLES,
+   PERF_COUNT_HW_INSTRUCTIONS,
+   PERF_COUNT_HW_BUS_CYCLES,
+   PERF_COUNT_HW_CACHE_REFERENCES,
+   PERF_COUNT_HW_CACHE_MISSES,
+   PERF_COUNT_HW_BRANCH_INSTRUCTIONS,
+   PERF_COUNT_HW_BRANCH_MISSES,
+};
+
 __init int intel_pmu_init(void)
 {
union cpuid10_edx edx;
union cpuid10_eax eax;
+   union cpuid10_ebx ebx;
unsigned int unused;
-   unsigned int ebx;
-   int version;
+   int version, bit;
 
if (!cpu_has(&boot_cpu_data, X86_FEATURE_ARCH_PERFMON)) {
switch (boot_cpu_data.x86) {
@@ -1566,8 +1576,8 @@ __init int intel_pmu_init(void)
 * Check whether the Architectural PerfMon supports
 * Branch Misses Retired hw_event or not.
 */
-   cpuid(10, &eax.full, &ebx, &unused, &edx.full);
-   if (eax.split.mask_length <= ARCH_PERFMON_BRANCH_MISSES_RETIRED)
+   cpuid(10, &eax.full, &ebx.full, &unused, &edx.full);
+   if (eax.split.mask_length < ARCH_PERFMON_EVENTS_COUNT)
return -ENODEV;
 
version = eax.split.version_id;
@@ -1643,7 +1653,7 @@ __init int intel_pmu_init(void)
/* UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 */
intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 
0x1803fb1;
 
-   if (ebx & 0x40) {
+   if (ebx.split.no_branch_misses_retired) {
/*
 * Erratum AAJ80 detected, we work it around by using
 * the BR_MISP_EXEC.ANY event. This will over-count
@@ -1651,6 +1661,7 @@ __init int intel_pmu_init(void)
 * architectural event which is often completely bogus:
 */
intel_perfmon_event_map[PERF_COUNT_HW_BRANCH_MISSES] = 
0x7f89;
+   ebx.split.no_branch_misses_retired = 0;
 
pr_cont("erratum AAJ80 worked around, ");
}
@@ -1729,5 +1740,13 @@ __init int intel_pmu_init(void)
break;
}
}
+   x86_pmu.events_maskl= ebx.full;
+   x86_pmu.events_mask_len = eax.split.mask_length;
+
+   /* disable event that reported as not presend by cpuid */
+   for_each_set_bit(bit, x86_pmu.events_mask,
+   min(x86_pmu.events_mask_len, x86_pmu.max_events))
+   intel_perfmon_event_map

[PATCHv3 05/10] KVM: VMX: Intercept RDPMC

2011-11-10 Thread Gleb Natapov
From: Avi Kivity 

Intercept RDPMC and forward it to the PMU emulation code.

Signed-off-by: Avi Kivity 
Signed-off-by: Gleb Natapov 
---
 arch/x86/kvm/vmx.c |   15 ++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6e28d58..a6535ba 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1956,6 +1956,7 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 #endif
CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING |
+   CPU_BASED_RDPMC_EXITING |
CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
/*
 * We can allow some features even when not supported by the
@@ -2414,7 +2415,8 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
  CPU_BASED_USE_TSC_OFFSETING |
  CPU_BASED_MWAIT_EXITING |
  CPU_BASED_MONITOR_EXITING |
- CPU_BASED_INVLPG_EXITING;
+ CPU_BASED_INVLPG_EXITING |
+ CPU_BASED_RDPMC_EXITING;
 
if (yield_on_hlt)
min |= CPU_BASED_HLT_EXITING;
@@ -4615,6 +4617,16 @@ static int handle_invlpg(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static int handle_rdpmc(struct kvm_vcpu *vcpu)
+{
+   int err;
+
+   err = kvm_rdpmc(vcpu);
+   kvm_complete_insn_gp(vcpu, err);
+
+   return 1;
+}
+
 static int handle_wbinvd(struct kvm_vcpu *vcpu)
 {
skip_emulated_instruction(vcpu);
@@ -5565,6 +5577,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu 
*vcpu) = {
[EXIT_REASON_HLT] = handle_halt,
[EXIT_REASON_INVD]= handle_invd,
[EXIT_REASON_INVLPG]  = handle_invlpg,
+   [EXIT_REASON_RDPMC]   = handle_rdpmc,
[EXIT_REASON_VMCALL]  = handle_vmcall,
[EXIT_REASON_VMCLEAR] = handle_vmclear,
[EXIT_REASON_VMLAUNCH]= handle_vmlaunch,
-- 
1.7.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-10 Thread Avi Kivity
On 11/10/2011 11:58 AM, Nadav Har'El wrote:
> KVM's existing shadow MMU code already supports nested TDP. To use it, we
> need to set up a new "MMU context" for nested EPT, and create a few callbacks
> for it (nested_ept_*()). We then need to switch back and forth between this
> nested context and the regular MMU context when switching between L1 and L2.
>
> +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
> +{
> + int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
> +
> + vcpu->arch.mmu.set_cr3   = vmx_set_cr3;
> + vcpu->arch.mmu.get_cr3   = nested_ept_get_cr3;
> + vcpu->arch.mmu.get_pdptr = nested_ept_get_pdptr;
> + vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
> + vcpu->arch.mmu.shadow_root_level = get_ept_level();
> +
> + vcpu->arch.walk_mmu  = &vcpu->arch.nested_mmu;
> +
> + return r;
> +}
> +
>

kvm_init_shadow_mmu() will cause ->page_fault to be set to something
like paging64_page_fault(), which is geared to reading EPT ptes.  How
does this work?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/10] nEPT: Nested EPT support for Nested VMX

2011-11-10 Thread Avi Kivity
On 11/10/2011 11:57 AM, Nadav Har'El wrote:
> The following patches add nested EPT support to Nested VMX.
>
> Nested EPT means emulating EPT for an L1 guest, allowing it use EPT when
> running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set
> its own cr3 and take its own page faults without either of L0 or L1 getting
> involved. In many workloads this significanlty improves L2's performance over
> the previous two alternatives (shadow page tables over ept, and shadow page
> tables over shadow page tables). Our paper [1] described these three options,
> and the advantages of nested EPT ("multidimensional paging").
>
> Nested EPT is enabled by default (if the hardware supports EPT), so users do
> not have to do anything special to enjoy the performance improvement that
> this patch gives to L2 guests.
>
> Just as a non-scientific, non-representative indication of the kind of
> dramatic performance improvement you may see in workloads that have a lot of
> context switches and page faults, here is a measurement of the time
> an example single-threaded "make" took in L2 (kvm over kvm):
>
>  shadow over shadow: 105 seconds
>  ("ept=0" forces this)
>
>  shadow over EPT: 87 seconds
>  (the previous default; Can be forced now with "nested_ept=0")
>
>  EPT over EPT: 29 seconds
>  (the default after this patch)
>
> Note that the same test on L1 (with EPT) took 25 seconds, so for this example
> workload, performance of nested virtualization is now very close to that of
> single-level virtualization.
>
>

This patchset is missing a fairly hairy patch that makes reading L2
virtual addresses work.  The standard example is L1 passing a bit of
hardware (emulated in L0) to a L2; when L2 accesses it, the instruction
will fault and need to be handled in L0, transparently to L1.  The
emulation can cause a fault to be injected to L2, or and EPT violation
or misconfiguration injected to L1.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] nEPT: Module option

2011-11-10 Thread Avi Kivity
On 11/10/2011 11:58 AM, Nadav Har'El wrote:
> Add a module option "nested_ept" determining whether to enable Nested EPT.
>
> Nested EPT means emulating EPT for an L1 guest so that L1 can use EPT when
> running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set
> its own cr3 and take its own page faults without either of L0 or L1 getting
> involved. This often significanlty improves L2's performance over the
> previous two alternatives (shadow page tables over ept, and shadow page
> tables over shadow page tables).
>
> nested_ept is currently enabled by default (when nested VMX is enabled),
> unless L0 doesn't have EPT or disabled it with ept=0.
>
> Users would not normally want to explicitly disable this option. One reason
> why one might want to disable it is to force L1 to make due without the EPT
> capability, when anticipating a future need to migrate this L1 to another
> host which doesn't have EPT. Note that currently there is no API to turn off
> nested EPT for just a single L1 guest. However, obviously, an individual L1
> guest may choose not to use EPT - the nested_cpu_has_ept() checks if L1
> actually used EPT when running L2.
>
> In the future, we can support emulation of EPT for L1 *always*, even when L0
> itself doesn't have EPT. This so-called "EPT on shadow page tables" mode
> has some theoretical advantages over the baseline "shadow page tables on
> shadow page tables" mode typically used when EPT is not available to L0 -
> namely that L2's cr3 changes and page faults can be handled in L0 and do not
> need to be propagated to L1. However, currently we do not support this mode,
> and it is becoming less interesting as newer processors all support EPT.
>
>

I think we can live without this.  But we do need a way to control what
features are exposed to the guest, for compatibility and live migration
purposes, as we do with cpuid.  So we need some way for host userspace
to write to the vmx read-only feature reporting MSRs.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT

2011-11-10 Thread Avi Kivity
On 11/10/2011 01:03 PM, Nadav Har'El wrote:
> On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 04/10] nEPT: Fix 
> page table format in nested EPT":
> > > @@ -287,6 +287,7 @@ struct kvm_mmu {
> > >   bool nx;
> > >  
> > >   u64 pdptrs[4]; /* pae */
> > > + u64 link_shadow_page_set_bits;
> >...
> > > +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 
> > > set_bits)
> > >  {
> > > - u64 spte;
> > > -
> > > - spte = __pa(sp->spt)
> > > - | PT_PRESENT_MASK | PT_ACCESSED_MASK
> > > - | PT_WRITABLE_MASK | PT_USER_MASK;
> > > - mmu_spte_set(sptep, spte);
> > > + mmu_spte_set(sptep, __pa(sp->spt) | set_bits);
> > >  }
> > >
> > 
> > Minor nit: you can just use link_shadow_page_set_bits here instead of
> > passing it around (unless later you have a different value for the
> > parameter?)
>
> The problem was that link_shadow_page did not take an kvm_mmu parameter,
> so I don't know where to find this link_shadow_page_set_bits. So either
> I pass the pointer to the entire kvm_mmu to link_shadow_page, or I just
> pass the only field which I need... I thought that passing the single
> field I need was cleaner - but I can easily change it if you prefer to
> pass the kvm_mmu.

Ah, doesn't matter either way.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/10] nEPT: Nested INVEPT

2011-11-10 Thread Avi Kivity
On 11/10/2011 12:01 PM, Nadav Har'El wrote:
> If we let L1 use EPT, we should probably also support the INVEPT instruction.
>
> + case VMX_EPT_EXTENT_CONTEXT:
> + if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_CONTEXT_BIT))
> + nested_vmx_failValid(vcpu,
> + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> + else {
> + /*
> +  * We efficiently handle the common case, of L1
> +  * invalidating the last eptp it used to run L2.
> +  * TODO: Instead of saving one last_eptp02, look up
> +  * operand.eptp in the shadow EPT table cache, to
> +  * find its shadow. Then last_eptp02 won't be needed.
> +  */
> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> + if (vmcs12 && nested_cpu_has_ept(vmcs12) &&
> + (vmcs12->ept_pointer == operand.eptp) &&
> + vmx->nested.last_eptp02)
> + ept_sync_context(vmx->nested.last_eptp02);
> + else
> + ept_sync_global();

Are either of these needed?  Won't a write to a shadowed EPT table cause
them anyway?

> + nested_vmx_succeed(vcpu);
> + }
> + break;
> + case VMX_EPT_EXTENT_INDIVIDUAL_ADDR:
> + if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_INDIVIDUAL_BIT))
> + nested_vmx_failValid(vcpu,
> + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> + else {
> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> + if (vmcs12 && nested_cpu_has_ept(vmcs12) &&
> + (vmcs12->ept_pointer == operand.eptp) &&
> + vmx->nested.last_eptp02)
> + ept_sync_individual_addr(
> + vmx->nested.last_eptp02, operand.gpa);

Same here.

> + else
> + ept_sync_global();
> + nested_vmx_succeed(vcpu);
> + }
> + break;
> + default:
> + nested_vmx_failValid(vcpu,
> + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> + }
> +
> + skip_emulated_instruction(vcpu);
> + return 1;
> +}
> +
>

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread cody

On 11/10/2011 03:31 PM, Ohad Ben-Cohen wrote:

On Thu, Nov 10, 2011 at 8:17 AM, Kai Huang  wrote:
   

Seems the unmap function don't take phys as parameter, does this mean
domain->ops->unmap will walk through the page table to find out the
actual page size?
 

The short answer is yes, and furthermore, we also consider to remove
the size param from domain->ops->unmap entirely at some point.

We had a long discussion about it, please see:

https://lkml.org/lkml/2011/10/10/234
   

Yes I've seen your discussion, I followed this thread from beginning:)

How about the IOTLB flush? As I said I think we need to consider that 
IOMMU (even does not exist now) may have some limitation on IOTLB flush, 
and hiding page size from IOTLB flush code may hurt performance, or even 
worse, trigger undefined behaviors.


-cody
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/9] perf: expose perf capability to other modules.

2011-11-10 Thread Jason Wessel
On 11/10/2011 02:58 AM, Frederic Weisbecker wrote:
> On Mon, Nov 07, 2011 at 02:45:17PM +, Will Deacon wrote:
>> Hi Frederic,
>>
>> On Wed, Nov 02, 2011 at 07:42:04AM +, Frederic Weisbecker wrote:
>>> On Tue, Nov 01, 2011 at 10:20:04AM -0600, David Ahern wrote:
 Right. Originally it could be enabled/disabled. Right now it cannot be,
 but I believe Frederic is working on making it configurable again.

 David
>>> Yep. Will Deacon is working on making the breakpoints able to process
>>> pure arch informations (ie: without beeing forced to use the perf attr
>>> as a midlayer to define them).
>>>
>>> Once we have that I can seperate the breakpoints implementation from perf
>>> and make it opt-able.
>> How do you foresee kdb fitting into this? I see that currently [on x86] we
>> cook up perf_event structures with a specific overflow handler set. If we
>> want to move this over to using a completely arch-defined structure, then
>> we're going to end up with an overflow handler field in both perf_event
>> *and* the arch-specific structure, which doesn't feel right to me.
>>
>> Of course, if the goal is only to separate ptrace (i.e. user debugging) from
>> the perf dependency then we don't need the overflow handler because we'll
>> always just send SIGTRAP to the current task.
>>
>> Any ideas?
> I don't know if we want to convert x86/kgdb to use pure arch breakpoints.
> If kgdb one day wants to extend this use to generic code, it may be a good
> idea to keep the things as is. I don't know, I'm adding Jason in Cc.

I think the important part is to share the allocation code (meaning who owns 
which break point slots).  This is  why kgdb/kdb allocates the perf structures. 
 The kgdb code will also directly write data to the slots once it has reserved 
them it would be good to share this code, but it was not shared because it was 
not usable early enough in the boot cycle on x86.

Certainly there are others who could consume the same infrastructure such as 
kprobes.

Jason.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 6/9] perf: expose perf capability to other modules.

2011-11-10 Thread Gleb Natapov
On Tue, Nov 08, 2011 at 03:12:27PM +0100, Peter Zijlstra wrote:
> > I do not want to introduce
> > incidental regressions. For instance the patch below will introduce
> > regression on my Nehalem cpu. It reports value 0x44 in cpuid10.ebx which
> > means that unhalted_reference_cycles is not available (bit set means
> > event is not available), but event still works! Actually it is listed as
> > supported by the cpu in Table A-4 SDM 3B. Go figure. 
> 
> We'd better figure out why your machine says that. It could be we need
> another quirk for the nehalem machines, it could be your BIOS is smoking
> crack and there's nothing we can do about it.
> 
Looks like on nehalem 013c event does not count unhalted reference
cycles like spec says it should, but instead increments with 133MHZ
frequency. So disabling it should be the right thing to do.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT

2011-11-10 Thread Nadav Har'El
On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 04/10] nEPT: Fix page 
table format in nested EPT":
> > @@ -287,6 +287,7 @@ struct kvm_mmu {
> > bool nx;
> >  
> > u64 pdptrs[4]; /* pae */
> > +   u64 link_shadow_page_set_bits;
>...
> > +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 
> > set_bits)
> >  {
> > -   u64 spte;
> > -
> > -   spte = __pa(sp->spt)
> > -   | PT_PRESENT_MASK | PT_ACCESSED_MASK
> > -   | PT_WRITABLE_MASK | PT_USER_MASK;
> > -   mmu_spte_set(sptep, spte);
> > +   mmu_spte_set(sptep, __pa(sp->spt) | set_bits);
> >  }
> >
> 
> Minor nit: you can just use link_shadow_page_set_bits here instead of
> passing it around (unless later you have a different value for the
> parameter?)

The problem was that link_shadow_page did not take an kvm_mmu parameter,
so I don't know where to find this link_shadow_page_set_bits. So either
I pass the pointer to the entire kvm_mmu to link_shadow_page, or I just
pass the only field which I need... I thought that passing the single
field I need was cleaner - but I can easily change it if you prefer to
pass the kvm_mmu.

Thanks,

Nadav.

-- 
Nadav Har'El|  Thursday, Nov 10 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |I had a lovely evening. Unfortunately,
http://nadav.harel.org.il   |this wasn't it. - Groucho Marx
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Kevin Wolf
Am 09.11.2011 22:01, schrieb Anthony Liguori:
> On 11/09/2011 03:00 PM, Michael S. Tsirkin wrote:
>> On Wed, Nov 09, 2011 at 02:22:02PM -0600, Anthony Liguori wrote:
>>> On 11/09/2011 02:18 PM, Michael S. Tsirkin wrote:
 On Wed, Nov 09, 2011 at 11:35:54AM -0600, Anthony Liguori wrote:
> On 11/09/2011 11:02 AM, Avi Kivity wrote:
>> On 11/09/2011 06:39 PM, Anthony Liguori wrote:
>>>
>>> Migration with qcow2 is not a supported feature for 1.0.  Migration is
>>> only supported with raw images using coherent shared storage[1].
>>>
>>> [1] NFS is only coherent with close-to-open which right now is not
>>> good enough for migration.
>>
>> Say what?
>
> Due to block format probing, we read at least the first sector of
> the disk during start up.

 A simple solution is not to do any probing before the VM is first
 started on the incoming path.

 Any issues with this?

>>>
>>> http://mid.gmane.org/1284213896-12705-4-git-send-email-aligu...@us.ibm.com
>>> I think Kevin wanted open to get delayed.
>>>
>>> Regards,
>>>
>>> Anthony Liguori
>>
>> So, this patchset just needs to be revived and polished up?
> 
> What I took from the feedback was that Kevin wanted to defer open until the 
> device model started.  That eliminates the need to reopen or have a 
> invalidation 
> callback.
> 
> I think it would be good for Kevin to comment here though because I might 
> have 
> misunderstood his feedback.

Your approach was to delay reads, but still keep the image open. I think
I worried that we might have additional reads somewhere that we don't
know about, and this is why I proposed delaying the open as well, so
that any read would always fail.

I believe just reopening the image is (almost?) as good and it's way
easier to do, so I would be inclined to do that for 1.0.

I'm not 100% sure about cases like iscsi, where reopening doesn't help.
I think delaying the open doesn't help there either if you migrate from
A to B and then back from B to A, you could still get old data. So for
iscsi probably cache=none remains the only safe choice, whatever we do.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/10] nEPT: Fix page table format in nested EPT

2011-11-10 Thread Avi Kivity
On 11/10/2011 11:59 AM, Nadav Har'El wrote:
> When the existing KVM MMU code creates a shadow page table, it assumes it
> has the normal x86 page table format. This is obviously correct for normal
> shadow page tables, and also correct for AMD's NPT.
> Unfortunately, Intel's EPT page tables differ in subtle ways from ordinary
> page tables, so when we create a shadow EPT table (i.e., in nested EPT),
> we need to slightly modify the way in which this table table is built.
>
> In particular, when mmu.c's link_shadow_page() creates non-leaf page table
> entries, it used to enable the "present", "accessed", "writable" and "user"
> flags on these entries. While this is correct for ordinary page tables, it
> is wrong in EPT tables - where these bits actually have completely different
> meaning (compare PT_*_MASK from mmu.h to VMX_EPT_*_MASK from vmx.h).
> In particular, leaving the code as-is causes bit 5 of the PTE to be turned on
> (supposedly for PT_ACCESSED_MASK), which is a reserved bit in EPT and causes
> an "EPT Misconfiguration" failure.
>
> So we must move link_shadow_page's list of extra bits to a new mmu context
> field, which is set differently for nested EPT.
>
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/include/asm/kvm_host.h |1 +
>  arch/x86/kvm/mmu.c  |   16 +---
>  arch/x86/kvm/paging_tmpl.h  |6 --
>  arch/x86/kvm/vmx.c  |3 +++
>  4 files changed, 17 insertions(+), 9 deletions(-)
>
> --- .before/arch/x86/include/asm/kvm_host.h   2011-11-10 11:33:59.0 
> +0200
> +++ .after/arch/x86/include/asm/kvm_host.h2011-11-10 11:33:59.0 
> +0200
> @@ -287,6 +287,7 @@ struct kvm_mmu {
>   bool nx;
>  
>   u64 pdptrs[4]; /* pae */
> + u64 link_shadow_page_set_bits;
>  };
>  
>  struct kvm_vcpu_arch {
> --- .before/arch/x86/kvm/vmx.c2011-11-10 11:33:59.0 +0200
> +++ .after/arch/x86/kvm/vmx.c 2011-11-10 11:33:59.0 +0200
> @@ -6485,6 +6485,9 @@ static int nested_ept_init_mmu_context(s
>   vcpu->arch.mmu.get_pdptr = nested_ept_get_pdptr;
>   vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
>   vcpu->arch.mmu.shadow_root_level = get_ept_level();
> + vcpu->arch.mmu.link_shadow_page_set_bits =
> + VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
> + VMX_EPT_EXECUTABLE_MASK;
>  
>   vcpu->arch.walk_mmu  = &vcpu->arch.nested_mmu;
>  
> --- .before/arch/x86/kvm/mmu.c2011-11-10 11:33:59.0 +0200
> +++ .after/arch/x86/kvm/mmu.c 2011-11-10 11:33:59.0 +0200
> @@ -1782,14 +1782,9 @@ static void shadow_walk_next(struct kvm_
>   return __shadow_walk_next(iterator, *iterator->sptep);
>  }
>  
> -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
> +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 
> set_bits)
>  {
> - u64 spte;
> -
> - spte = __pa(sp->spt)
> - | PT_PRESENT_MASK | PT_ACCESSED_MASK
> - | PT_WRITABLE_MASK | PT_USER_MASK;
> - mmu_spte_set(sptep, spte);
> + mmu_spte_set(sptep, __pa(sp->spt) | set_bits);
>  }
>

Minor nit: you can just use link_shadow_page_set_bits here instead of
passing it around (unless later you have a different value for the
parameter?)

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-10 Thread Avi Kivity
On 11/10/2011 11:58 AM, Nadav Har'El wrote:
> KVM's existing shadow MMU code already supports nested TDP. To use it, we
> need to set up a new "MMU context" for nested EPT, and create a few callbacks
> for it (nested_ept_*()). We then need to switch back and forth between this
> nested context and the regular MMU context when switching between L1 and L2.
>
> +
> +static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
> + struct x86_exception *fault)
> +{
> + struct vmcs12 *vmcs12;
> + nested_vmx_vmexit(vcpu);
> + vmcs12 = get_vmcs12(vcpu);
> + /*
> +  * Note no need to set vmcs12->vm_exit_reason as it is already copied
> +  * from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION.
> +  */

Not in all cases.  For example, L0 may emulate an L2 instruction, which
then faults at the EPT level.

> + vmcs12->exit_qualification = fault->error_code;
> + vmcs12->guest_physical_address = fault->address;
> +}

What about the guest linear address field?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/10] nEPT: Documentation

2011-11-10 Thread Nadav Har'El
Update the documentation to no longer say that nested EPT is not supported.

Signed-off-by: Nadav Har'El 
---
 Documentation/virtual/kvm/nested-vmx.txt |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- .before/Documentation/virtual/kvm/nested-vmx.txt2011-11-10 
11:33:59.0 +0200
+++ .after/Documentation/virtual/kvm/nested-vmx.txt 2011-11-10 
11:33:59.0 +0200
@@ -38,8 +38,8 @@ The current code supports running Linux 
 Only 64-bit guest hypervisors are supported.
 
 Additional patches for running Windows under guest KVM, and Linux under
-guest VMware server, and support for nested EPT, are currently running in
-the lab, and will be sent as follow-on patchsets.
+guest VMware server, are currently running in the lab, and will be sent as
+follow-on patchsets.
 
 
 Running nested VMX
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/10] nEPT: Miscelleneous cleanups

2011-11-10 Thread Nadav Har'El
Some trivial code cleanups not really related to nested EPT.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:34:00.0 +0200
@@ -611,7 +611,6 @@ static void nested_release_page_clean(st
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
-static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 
 static DEFINE_PER_CPU(struct vmcs *, vmxarea);
@@ -875,8 +874,7 @@ static inline bool nested_cpu_has2(struc
(vmcs12->secondary_vm_exec_control & bit);
 }
 
-static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
-   struct kvm_vcpu *vcpu)
+static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12)
 {
return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
@@ -6020,7 +6018,7 @@ static int vmx_handle_exit(struct kvm_vc
 
if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked &&
!(is_guest_mode(vcpu) && nested_cpu_has_virtual_nmis(
-   get_vmcs12(vcpu), vcpu {
+   get_vmcs12(vcpu) {
if (vmx_interrupt_allowed(vcpu)) {
vmx->soft_vnmi_blocked = 0;
} else if (vmx->vnmi_blocked_time > 10LL &&
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/10] nEPT: Advertise EPT to L1

2011-11-10 Thread Nadav Har'El
Advertise the support of EPT to the L1 guest, through the appropriate MSR.

This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support until
this last patch, and will not attempt to use the half-applied feature.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |   15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:33:59.0 +0200
@@ -1908,6 +1908,7 @@ static u32 nested_vmx_secondary_ctls_low
 static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high;
 static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
 static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
+static u32 nested_vmx_ept_caps;
 static __init void nested_vmx_setup_ctls_msrs(void)
 {
/*
@@ -1980,6 +1981,16 @@ static __init void nested_vmx_setup_ctls
nested_vmx_secondary_ctls_low = 0;
nested_vmx_secondary_ctls_high &=
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+   if (nested_ept)
+   nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
+
+   /* ept capabilities */
+   if (nested_ept) {
+   nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+   nested_vmx_ept_caps &= vmx_capability.ept;
+   } else
+   nested_vmx_ept_caps = 0;
+
 }
 
 static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
@@ -2079,8 +2090,8 @@ static int vmx_get_vmx_msr(struct kvm_vc
nested_vmx_secondary_ctls_high);
break;
case MSR_IA32_VMX_EPT_VPID_CAP:
-   /* Currently, no nested ept or nested vpid */
-   *pdata = 0;
+   /* Currently, no nested vpid support */
+   *pdata = nested_vmx_ept_caps;
break;
default:
return 0;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/10] nEPT: Nested INVEPT

2011-11-10 Thread Nadav Har'El
If we let L1 use EPT, we should probably also support the INVEPT instruction.

Signed-off-by: Nadav Har'El 
---
 arch/x86/include/asm/vmx.h |2 
 arch/x86/kvm/vmx.c |  112 +++
 2 files changed, 114 insertions(+)

--- .before/arch/x86/include/asm/vmx.h  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/include/asm/vmx.h   2011-11-10 11:33:59.0 +0200
@@ -279,6 +279,7 @@ enum vmcs_field {
 #define EXIT_REASON_APIC_ACCESS 44
 #define EXIT_REASON_EPT_VIOLATION   48
 #define EXIT_REASON_EPT_MISCONFIG   49
+#define EXIT_REASON_INVEPT 50
 #define EXIT_REASON_WBINVD 54
 #define EXIT_REASON_XSETBV 55
 
@@ -404,6 +405,7 @@ enum vmcs_field {
 #define VMX_EPTP_WB_BIT(1ull << 14)
 #define VMX_EPT_2MB_PAGE_BIT   (1ull << 16)
 #define VMX_EPT_1GB_PAGE_BIT   (1ull << 17)
+#define VMX_EPT_INVEPT_BIT (1ull << 20)
 #define VMX_EPT_EXTENT_INDIVIDUAL_BIT  (1ull << 24)
 #define VMX_EPT_EXTENT_CONTEXT_BIT (1ull << 25)
 #define VMX_EPT_EXTENT_GLOBAL_BIT  (1ull << 26)
--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:33:59.0 +0200
@@ -351,6 +351,8 @@ struct nested_vmx {
struct list_head vmcs02_pool;
int vmcs02_num;
u64 vmcs01_tsc_offset;
+   /* Remember last EPT02, for single-context INVEPT optimization */
+   u64 last_eptp02;
/* L2 must run next, and mustn't decide to exit to L1. */
bool nested_run_pending;
/*
@@ -1987,6 +1989,10 @@ static __init void nested_vmx_setup_ctls
/* ept capabilities */
if (nested_ept) {
nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+   nested_vmx_ept_caps |=
+   VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT |
+   VMX_EPT_EXTENT_CONTEXT_BIT |
+   VMX_EPT_EXTENT_INDIVIDUAL_BIT;
nested_vmx_ept_caps &= vmx_capability.ept;
} else
nested_vmx_ept_caps = 0;
@@ -5568,6 +5574,105 @@ static int handle_vmptrst(struct kvm_vcp
return 1;
 }
 
+/* Emulate the INVEPT instruction */
+static int handle_invept(struct kvm_vcpu *vcpu)
+{
+   u32 vmx_instruction_info;
+   unsigned long type;
+   gva_t gva;
+   struct x86_exception e;
+   struct {
+   u64 eptp, gpa;
+   } operand;
+
+
+   if (!nested_ept || !(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   /* According to the Intel VMX instruction reference, the memory
+* operand is read even if it isn't needed (e.g., for type==global)
+*/
+   vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+   if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+   vmx_instruction_info, &gva))
+   return 1;
+   if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, &operand,
+   sizeof(operand), &e)) {
+   kvm_inject_page_fault(vcpu, &e);
+   return 1;
+   }
+
+   type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf);
+
+   switch (type) {
+   case VMX_EPT_EXTENT_GLOBAL:
+   if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_GLOBAL_BIT))
+   nested_vmx_failValid(vcpu,
+   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+   else {
+   ept_sync_global();
+   nested_vmx_succeed(vcpu);
+   }
+   break;
+   case VMX_EPT_EXTENT_CONTEXT:
+   if (!(nested_vmx_ept_caps & VMX_EPT_EXTENT_CONTEXT_BIT))
+   nested_vmx_failValid(vcpu,
+   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+   else {
+   /*
+* We efficiently handle the common case, of L1
+* invalidating the last eptp it used to run L2.
+* TODO: Instead of saving one last_eptp02, look up
+* operand.eptp in the shadow EPT table cache, to
+* find its shadow. Then last_eptp02 won't be needed.
+*/
+   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   if (vmcs12 && nested_cpu_has_ept(vmcs12) &&
+   (vmcs12->ept_pointer == operand.eptp) &&
+   vmx->nest

[PATCH 06/10] nEPT: Some additional comments

2011-11-10 Thread Nadav Har'El
Some additional comments to preexisting code:
Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
Don't mention "shadow on either EPT or shadow" as the only two options.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |   21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:33:59.0 +0200
@@ -5815,7 +5815,20 @@ static bool nested_vmx_exit_handled(stru
return nested_cpu_has2(vmcs12,
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
case EXIT_REASON_EPT_VIOLATION:
+   /*
+* L0 always deals with the EPT violation. If nested EPT is
+* used, and the nested mmu code discovers that the address is
+* missing in the guest EPT table (EPT12), the EPT violation
+* will be injected with nested_ept_inject_page_fault()
+*/
+   return 0;
case EXIT_REASON_EPT_MISCONFIG:
+   /*
+* L2 never uses directly L1's EPT, but rather L0's own EPT
+* table (shadow on EPT) or a merged EPT table that L0 built
+* (EPT on EPT). So any problems with the structure of the
+* table is L0's fault.
+*/
return 0;
case EXIT_REASON_WBINVD:
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_WBINVD_EXITING);
@@ -6736,7 +6749,12 @@ static void prepare_vmcs02(struct kvm_vc
vmx_set_cr4(vcpu, vmcs12->guest_cr4);
vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12));
 
-   /* shadow page tables on either EPT or shadow page tables */
+   /*
+* Note that kvm_set_cr3() and kvm_mmu_reset_context() will do the
+* right thing, and set GUEST_CR3 and/or EPT_POINTER in all supported
+* settings: 1. shadow page tables on shadow page tables, 2. shadow
+* page tables on EPT, 3. EPT on EPT.
+*/
kvm_set_cr3(vcpu, vmcs12->guest_cr3);
kvm_mmu_reset_context(vcpu);
 
@@ -7075,7 +7093,6 @@ void load_vmcs12_host_state(struct kvm_v
 
if (nested_cpu_has_ept(vmcs12))
nested_ept_uninit_mmu_context(vcpu);
-   /* shadow page tables on either EPT or shadow page tables */
kvm_set_cr3(vcpu, vmcs12->host_cr3);
kvm_mmu_reset_context(vcpu);
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/10] nEPT: Fix wrong test in kvm_set_cr3

2011-11-10 Thread Nadav Har'El
kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
address. The problem is that with nested EPT, cr3 is an *L2* physical
address, not an L1 physical address as this test expects.

As the comment above this test explains, it isn't necessary, and doesn't
correspond to anything a real processor would do. So this patch just
comments it out.

Note that this wrong test could have also theoretically caused problems
in nested NPT, not just in nested EPT. However, in practice, the problem
was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
circumventing the problem. Additional potential calls to the buggy function
are avoided in that we don't trap cr3 modifications when nested NPT is
enabled. However, because in nested VMX we did want to use kvm_set_cr3()
(as requested in Avi Kivity's review of the original nested VMX patches),
we can't avoid this problem and need to fix it.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/x86.c |   11 ---
 1 file changed, 11 deletions(-)

--- .before/arch/x86/kvm/x86.c  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/x86.c   2011-11-10 11:33:59.0 +0200
@@ -690,17 +690,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, u
 */
}
 
-   /*
-* Does the new cr3 value map to physical memory? (Note, we
-* catch an invalid cr3 even in real-mode, because it would
-* cause trouble later on when we turn on paging anyway.)
-*
-* A real CPU would silently accept an invalid cr3 and would
-* attempt to use it - with largely undefined (and often hard
-* to debug) behavior on the guest side.
-*/
-   if (unlikely(!gfn_to_memslot(vcpu->kvm, cr3 >> PAGE_SHIFT)))
-   return 1;
vcpu->arch.cr3 = cr3;
__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
vcpu->arch.mmu.new_cr3(vcpu);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/10] nEPT: Fix page table format in nested EPT

2011-11-10 Thread Nadav Har'El
When the existing KVM MMU code creates a shadow page table, it assumes it
has the normal x86 page table format. This is obviously correct for normal
shadow page tables, and also correct for AMD's NPT.
Unfortunately, Intel's EPT page tables differ in subtle ways from ordinary
page tables, so when we create a shadow EPT table (i.e., in nested EPT),
we need to slightly modify the way in which this table table is built.

In particular, when mmu.c's link_shadow_page() creates non-leaf page table
entries, it used to enable the "present", "accessed", "writable" and "user"
flags on these entries. While this is correct for ordinary page tables, it
is wrong in EPT tables - where these bits actually have completely different
meaning (compare PT_*_MASK from mmu.h to VMX_EPT_*_MASK from vmx.h).
In particular, leaving the code as-is causes bit 5 of the PTE to be turned on
(supposedly for PT_ACCESSED_MASK), which is a reserved bit in EPT and causes
an "EPT Misconfiguration" failure.

So we must move link_shadow_page's list of extra bits to a new mmu context
field, which is set differently for nested EPT.

Signed-off-by: Nadav Har'El 
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |   16 +---
 arch/x86/kvm/paging_tmpl.h  |6 --
 arch/x86/kvm/vmx.c  |3 +++
 4 files changed, 17 insertions(+), 9 deletions(-)

--- .before/arch/x86/include/asm/kvm_host.h 2011-11-10 11:33:59.0 
+0200
+++ .after/arch/x86/include/asm/kvm_host.h  2011-11-10 11:33:59.0 
+0200
@@ -287,6 +287,7 @@ struct kvm_mmu {
bool nx;
 
u64 pdptrs[4]; /* pae */
+   u64 link_shadow_page_set_bits;
 };
 
 struct kvm_vcpu_arch {
--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:33:59.0 +0200
@@ -6485,6 +6485,9 @@ static int nested_ept_init_mmu_context(s
vcpu->arch.mmu.get_pdptr = nested_ept_get_pdptr;
vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
vcpu->arch.mmu.shadow_root_level = get_ept_level();
+   vcpu->arch.mmu.link_shadow_page_set_bits =
+   VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
+   VMX_EPT_EXECUTABLE_MASK;
 
vcpu->arch.walk_mmu  = &vcpu->arch.nested_mmu;
 
--- .before/arch/x86/kvm/mmu.c  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/mmu.c   2011-11-10 11:33:59.0 +0200
@@ -1782,14 +1782,9 @@ static void shadow_walk_next(struct kvm_
return __shadow_walk_next(iterator, *iterator->sptep);
 }
 
-static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp)
+static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp, u64 set_bits)
 {
-   u64 spte;
-
-   spte = __pa(sp->spt)
-   | PT_PRESENT_MASK | PT_ACCESSED_MASK
-   | PT_WRITABLE_MASK | PT_USER_MASK;
-   mmu_spte_set(sptep, spte);
+   mmu_spte_set(sptep, __pa(sp->spt) | set_bits);
 }
 
 static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep)
@@ -3366,6 +3361,13 @@ int kvm_init_shadow_mmu(struct kvm_vcpu 
vcpu->arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
vcpu->arch.mmu.base_role.smep_andnot_wp
= smep && !is_write_protection(vcpu);
+   /*
+* link_shadow() should apply these bits in shadow page tables, and
+* in shadow NPT tables (nested NPT). For nested EPT, different bits
+* apply.
+*/
+   vcpu->arch.mmu.link_shadow_page_set_bits = PT_PRESENT_MASK |
+   PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
 
return r;
 }
--- .before/arch/x86/kvm/paging_tmpl.h  2011-11-10 11:33:59.0 +0200
+++ .after/arch/x86/kvm/paging_tmpl.h   2011-11-10 11:33:59.0 +0200
@@ -515,7 +515,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
goto out_gpte_changed;
 
if (sp)
-   link_shadow_page(it.sptep, sp);
+   link_shadow_page(it.sptep, sp,
+   vcpu->arch.mmu.link_shadow_page_set_bits);
}
 
for (;
@@ -535,7 +536,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
 
sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
  true, direct_access, it.sptep);
-   link_shadow_page(it.sptep, sp);
+   link_shadow_page(it.sptep, sp,
+   vcpu->arch.mmu.link_shadow_page_set_bits);
}
 
clear_sp_write_flooding_count(it.sptep);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/10] nEPT: Fix cr3 handling in nested exit and entry

2011-11-10 Thread Nadav Har'El
The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:

If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.

If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |   30 ++
 1 file changed, 30 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:58.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:33:58.0 +0200
@@ -6737,6 +6737,17 @@ static void prepare_vmcs02(struct kvm_vc
kvm_set_cr3(vcpu, vmcs12->guest_cr3);
kvm_mmu_reset_context(vcpu);
 
+   /*
+* Additionally, except when L0 is using shadow page tables, L1 or
+* L2 control guest_cr3 for L2, so they may also have saved PDPTEs
+*/
+   if (enable_ept) {
+   vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0);
+   vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1);
+   vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2);
+   vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
+   }
+
kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->guest_rsp);
kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->guest_rip);
 }
@@ -6968,6 +6979,25 @@ void prepare_vmcs12(struct kvm_vcpu *vcp
vmcs12->guest_pending_dbg_exceptions =
vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 
+   /*
+* In some cases (usually, nested EPT), L2 is allowed to change its
+* own CR3 without exiting. If it has changed it, we must keep it.
+* Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
+* by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
+*/
+   if (enable_ept)
+   vmcs12->guest_cr3 = vmcs_read64(GUEST_CR3);
+   /*
+* Additionally, except when L0 is using shadow page tables, L1 or
+* L2 control guest_cr3 for L2, so save their PDPTEs
+*/
+   if (enable_ept) {
+   vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+   vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+   vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+   vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+   }
+
/* TODO: These cannot have changed unless we have MSR bitmaps and
 * the relevant bit asks not to trap the change */
vmcs12->guest_ia32_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-10 Thread Nadav Har'El
KVM's existing shadow MMU code already supports nested TDP. To use it, we
need to set up a new "MMU context" for nested EPT, and create a few callbacks
for it (nested_ept_*()). We then need to switch back and forth between this
nested context and the regular MMU context when switching between L1 and L2.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |   60 +++
 1 file changed, 60 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:58.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:33:58.0 +0200
@@ -6443,6 +6443,59 @@ static void vmx_set_supported_cpuid(u32 
entry->ecx |= bit(X86_FEATURE_VMX);
 }
 
+/* Callbacks for nested_ept_init_mmu_context: */
+static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
+{
+   /* return the page table to be shadowed - in our case, EPT12 */
+   return get_vmcs12(vcpu)->ept_pointer;
+}
+
+static u64 nested_ept_get_pdptr(struct kvm_vcpu *vcpu, int index)
+{
+   /*
+* This function is called (as mmu.get_pdptr()) in mmu.c to help read
+* a to-be-shadowed page table in PAE (3-level) format. However, the
+* EPT table we're now shadowing (this is the nested EPT mmu), must
+* always have 4 levels, and is not in PAE format, so this function
+* should never be called.
+*/
+   kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+}
+
+static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
+   struct x86_exception *fault)
+{
+   struct vmcs12 *vmcs12;
+   nested_vmx_vmexit(vcpu);
+   vmcs12 = get_vmcs12(vcpu);
+   /*
+* Note no need to set vmcs12->vm_exit_reason as it is already copied
+* from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION.
+*/
+   vmcs12->exit_qualification = fault->error_code;
+   vmcs12->guest_physical_address = fault->address;
+}
+
+static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
+{
+   int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
+
+   vcpu->arch.mmu.set_cr3   = vmx_set_cr3;
+   vcpu->arch.mmu.get_cr3   = nested_ept_get_cr3;
+   vcpu->arch.mmu.get_pdptr = nested_ept_get_pdptr;
+   vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
+   vcpu->arch.mmu.shadow_root_level = get_ept_level();
+
+   vcpu->arch.walk_mmu  = &vcpu->arch.nested_mmu;
+
+   return r;
+}
+
+static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu)
+{
+   vcpu->arch.walk_mmu = &vcpu->arch.mmu;
+}
+
 /*
  * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
  * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
@@ -6652,6 +6705,11 @@ static void prepare_vmcs02(struct kvm_vc
vmx_flush_tlb(vcpu);
}
 
+   if (nested_cpu_has_ept(vmcs12)) {
+   kvm_mmu_unload(vcpu);
+   nested_ept_init_mmu_context(vcpu);
+   }
+
if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)
vcpu->arch.efer = vmcs12->guest_ia32_efer;
if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE)
@@ -6982,6 +7040,8 @@ void load_vmcs12_host_state(struct kvm_v
vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
kvm_set_cr4(vcpu, vmcs12->host_cr4);
 
+   if (nested_cpu_has_ept(vmcs12))
+   nested_ept_uninit_mmu_context(vcpu);
/* shadow page tables on either EPT or shadow page tables */
kvm_set_cr3(vcpu, vmcs12->host_cr3);
kvm_mmu_reset_context(vcpu);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/10] nEPT: Module option

2011-11-10 Thread Nadav Har'El
Add a module option "nested_ept" determining whether to enable Nested EPT.

Nested EPT means emulating EPT for an L1 guest so that L1 can use EPT when
running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set
its own cr3 and take its own page faults without either of L0 or L1 getting
involved. This often significanlty improves L2's performance over the
previous two alternatives (shadow page tables over ept, and shadow page
tables over shadow page tables).

nested_ept is currently enabled by default (when nested VMX is enabled),
unless L0 doesn't have EPT or disabled it with ept=0.

Users would not normally want to explicitly disable this option. One reason
why one might want to disable it is to force L1 to make due without the EPT
capability, when anticipating a future need to migrate this L1 to another
host which doesn't have EPT. Note that currently there is no API to turn off
nested EPT for just a single L1 guest. However, obviously, an individual L1
guest may choose not to use EPT - the nested_cpu_has_ept() checks if L1
actually used EPT when running L2.

In the future, we can support emulation of EPT for L1 *always*, even when L0
itself doesn't have EPT. This so-called "EPT on shadow page tables" mode
has some theoretical advantages over the baseline "shadow page tables on
shadow page tables" mode typically used when EPT is not available to L0 -
namely that L2's cr3 changes and page faults can be handled in L0 and do not
need to be propagated to L1. However, currently we do not support this mode,
and it is becoming less interesting as newer processors all support EPT.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |   12 
 1 file changed, 12 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-11-10 11:33:58.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2011-11-10 11:33:58.0 +0200
@@ -83,6 +83,10 @@ module_param(fasteoi, bool, S_IRUGO);
 static int __read_mostly nested = 0;
 module_param(nested, bool, S_IRUGO);
 
+/* Whether L0 emulates EPT for its L1 guests. It doesn't mean L1 must use it */
+static int __read_mostly nested_ept = 1;
+module_param(nested_ept, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST  \
(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
 #define KVM_GUEST_CR0_MASK \
@@ -875,6 +879,11 @@ static inline bool nested_cpu_has_virtua
return vmcs12->pin_based_vm_exec_control & PIN_BASED_VIRTUAL_NMIS;
 }
 
+static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
+}
+
 static inline bool is_exception(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2642,6 +2651,9 @@ static __init int hardware_setup(void)
if (!cpu_has_vmx_ple())
ple_gap = 0;
 
+   if (!nested || !enable_ept)
+   nested_ept = 0;
+
if (nested)
nested_vmx_setup_ctls_msrs();
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/10] nEPT: Nested EPT support for Nested VMX

2011-11-10 Thread Nadav Har'El
The following patches add nested EPT support to Nested VMX.

Nested EPT means emulating EPT for an L1 guest, allowing it use EPT when
running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set
its own cr3 and take its own page faults without either of L0 or L1 getting
involved. In many workloads this significanlty improves L2's performance over
the previous two alternatives (shadow page tables over ept, and shadow page
tables over shadow page tables). Our paper [1] described these three options,
and the advantages of nested EPT ("multidimensional paging").

Nested EPT is enabled by default (if the hardware supports EPT), so users do
not have to do anything special to enjoy the performance improvement that
this patch gives to L2 guests.

Just as a non-scientific, non-representative indication of the kind of
dramatic performance improvement you may see in workloads that have a lot of
context switches and page faults, here is a measurement of the time
an example single-threaded "make" took in L2 (kvm over kvm):

 shadow over shadow: 105 seconds
 ("ept=0" forces this)

 shadow over EPT: 87 seconds
 (the previous default; Can be forced now with "nested_ept=0")

 EPT over EPT: 29 seconds
 (the default after this patch)

Note that the same test on L1 (with EPT) took 25 seconds, so for this example
workload, performance of nested virtualization is now very close to that of
single-level virtualization.


[1] "The Turtles Project: Design and Implementation of Nested Virtualization",
http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf


Patch statistics:
-

 Documentation/virtual/kvm/nested-vmx.txt |4 
 arch/x86/include/asm/kvm_host.h  |1 
 arch/x86/include/asm/vmx.h   |2 
 arch/x86/kvm/mmu.c   |   16 -
 arch/x86/kvm/paging_tmpl.h   |6 
 arch/x86/kvm/vmx.c   |  259 -
 arch/x86/kvm/x86.c   |   11 
 7 files changed, 269 insertions(+), 30 deletions(-)

--
Nadav Har'El
IBM Haifa Research Lab
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-10 Thread Avi Kivity
On 11/10/2011 11:49 AM, Sasha Levin wrote:
> >
> > It does, but the hypervisor can only access the guest's images, and a
> > few internal files (like the qemu-kvm executable and its libraries).
>
> What about devices? You let the guest read and write to devices as
> well (/dev/kvm for example, or network devices).

They're all protected.  /dev/kvm is obviously rw for anyone, but it
can't be used to transfer information.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-10 Thread Sasha Levin
On Thu, Nov 10, 2011 at 11:43 AM, Avi Kivity  wrote:
> On 11/10/2011 11:34 AM, Sasha Levin wrote:
>> On Thu, Nov 10, 2011 at 11:23 AM, Avi Kivity  wrote:
>> > On 11/10/2011 11:14 AM, Sasha Levin wrote:
>> >> > Trying and failing.  sVirt will deny access to all files except those
>> >> > explicitly allowed by libvirt.
>> >>
>> >> It still allows the guest to read more than enough files which it
>> >> shouldn't be reading.
>> >>
>> >> Unless you configure sVirt on a per-guest basis...
>> >
>> > sVirt is per-guest.
>>
>> It still would mean that the guest can access any file (actually, even
>> device, no?) the hypervisor can access.
>
> It does, but the hypervisor can only access the guest's images, and a
> few internal files (like the qemu-kvm executable and its libraries).

What about devices? You let the guest read and write to devices as
well (/dev/kvm for example, or network devices).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-10 Thread Markus Armbruster
Sasha Levin  writes:

> On Thu, Nov 10, 2011 at 10:57 AM, Markus Armbruster  wrote:
>> Sasha Levin  writes:
[...]
>>> I'm actually not sure why KVM tool got QCOW support in the first
>>> place. You can have anything QCOW provides if you use btrfs (among
>>> several other FSs).
>>
>> Maybe it's just me, but isn't it weird to have a filesystem (QCOW2)
>> sitting in the kernel sources that you can't mount(2)?
>>
>
> It's not really a filesystem, it's a disk image :)

Sloppy language on my part, sorry about that.

It's a transport for blocks.  We have a few of those in the kernel
already: block devices.  Including loop devices and DRBD.  You use a
filesystem to interpret their contents.  The resulting stack is what
gets mounted.  Adding another transport for blocks to the kernel that
cannot be used that way strikes me as weird.

[...]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-10 Thread Avi Kivity
On 11/10/2011 11:34 AM, Sasha Levin wrote:
> On Thu, Nov 10, 2011 at 11:23 AM, Avi Kivity  wrote:
> > On 11/10/2011 11:14 AM, Sasha Levin wrote:
> >> > Trying and failing.  sVirt will deny access to all files except those
> >> > explicitly allowed by libvirt.
> >>
> >> It still allows the guest to read more than enough files which it
> >> shouldn't be reading.
> >>
> >> Unless you configure sVirt on a per-guest basis...
> >
> > sVirt is per-guest.
>
> It still would mean that the guest can access any file (actually, even
> device, no?) the hypervisor can access.

It does, but the hypervisor can only access the guest's images, and a
few internal files (like the qemu-kvm executable and its libraries).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-10 Thread Sasha Levin
On Thu, Nov 10, 2011 at 11:23 AM, Avi Kivity  wrote:
> On 11/10/2011 11:14 AM, Sasha Levin wrote:
>> > Trying and failing.  sVirt will deny access to all files except those
>> > explicitly allowed by libvirt.
>>
>> It still allows the guest to read more than enough files which it
>> shouldn't be reading.
>>
>> Unless you configure sVirt on a per-guest basis...
>
> sVirt is per-guest.

It still would mean that the guest can access any file (actually, even
device, no?) the hypervisor can access.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4-rebased 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread Ohad Ben-Cohen
When mapping a memory region, split it to page sizes as supported
by the iommu hardware. Always prefer bigger pages, when possible,
in order to reduce the TLB pressure.

The logic to do that is now added to the IOMMU core, so neither the iommu
drivers themselves nor users of the IOMMU API have to duplicate it.

This allows a more lenient granularity of mappings; traditionally the
IOMMU API took 'order' (of a page) as a mapping size, and directly let
the low level iommu drivers handle the mapping, but now that the IOMMU
core can split arbitrary memory regions into pages, we can remove this
limitation, so users don't have to split those regions by themselves.

Currently the supported page sizes are advertised once and they then
remain static. That works well for OMAP and MSM but it would probably
not fly well with intel's hardware, where the page size capabilities
seem to have the potential to be different between several DMA
remapping devices.

register_iommu() currently sets a default pgsize behavior, so we can convert
the IOMMU drivers in subsequent patches. After all the drivers
are converted, the temporary default settings will be removed.

Mainline users of the IOMMU API (kvm and omap-iovmm) are adopted
to deal with bytes instead of page order.

Many thanks to Joerg Roedel  for significant review!

Signed-off-by: Ohad Ben-Cohen 
Cc: David Brown 
Cc: David Woodhouse 
Cc: Joerg Roedel 
Cc: Stepan Moskovchenko 
Cc: KyongHo Cho 
Cc: Hiroshi DOYU 
Cc: Laurent Pinchart 
Cc: kvm@vger.kernel.org
---
 drivers/iommu/iommu.c  |  131 +++-
 drivers/iommu/omap-iovmm.c |   17 ++
 include/linux/iommu.h  |   20 ++-
 virt/kvm/iommu.c   |8 +-
 4 files changed, 144 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 7a2953d..b278458 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -16,6 +16,8 @@
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
  */
 
+#define pr_fmt(fmt)"%s: " fmt, __func__
+
 #include 
 #include 
 #include 
@@ -47,6 +49,16 @@ int bus_set_iommu(struct bus_type *bus, struct iommu_ops 
*ops)
if (bus->iommu_ops != NULL)
return -EBUSY;
 
+   /*
+* Set the default pgsize values, which retain the existing
+* IOMMU API behavior: drivers will be called to map
+* regions that are sized/aligned to order of 4KiB pages.
+*
+* This will be removed once all drivers are migrated.
+*/
+   if (!ops->pgsize_bitmap)
+   ops->pgsize_bitmap = ~0xFFFUL;
+
bus->iommu_ops = ops;
 
/* Do IOMMU specific setup for this bus-type */
@@ -157,34 +169,125 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
 EXPORT_SYMBOL_GPL(iommu_domain_has_cap);
 
 int iommu_map(struct iommu_domain *domain, unsigned long iova,
- phys_addr_t paddr, int gfp_order, int prot)
+ phys_addr_t paddr, size_t size, int prot)
 {
-   size_t size;
+   unsigned long orig_iova = iova;
+   unsigned int min_pagesz;
+   size_t orig_size = size;
+   int ret = 0;
 
if (unlikely(domain->ops->map == NULL))
return -ENODEV;
 
-   size = PAGE_SIZE << gfp_order;
+   /* find out the minimum page size supported */
+   min_pagesz = 1 << __ffs(domain->ops->pgsize_bitmap);
+
+   /*
+* both the virtual address and the physical one, as well as
+* the size of the mapping, must be aligned (at least) to the
+* size of the smallest page supported by the hardware
+*/
+   if (!IS_ALIGNED(iova | paddr | size, min_pagesz)) {
+   pr_err("unaligned: iova 0x%lx pa 0x%lx size 0x%lx min_pagesz "
+   "0x%x\n", iova, (unsigned long)paddr,
+   (unsigned long)size, min_pagesz);
+   return -EINVAL;
+   }
+
+   pr_debug("map: iova 0x%lx pa 0x%lx size 0x%lx\n", iova,
+   (unsigned long)paddr, (unsigned long)size);
+
+   while (size) {
+   unsigned long pgsize, addr_merge = iova | paddr;
+   unsigned int pgsize_idx;
+
+   /* Max page size that still fits into 'size' */
+   pgsize_idx = __fls(size);
+
+   /* need to consider alignment requirements ? */
+   if (likely(addr_merge)) {
+   /* Max page size allowed by both iova and paddr */
+   unsigned int align_pgsize_idx = __ffs(addr_merge);
+
+   pgsize_idx = min(pgsize_idx, align_pgsize_idx);
+   }
+
+   /* build a mask of acceptable page sizes */
+   pgsize = (1UL << (pgsize_idx + 1)) - 1;
+
+   /* throw away page sizes not supported by the hardware */
+   pgsize &= domain->ops->pgsize_bitmap;
 
-   BUG_ON(!IS_ALIGNED(iova | paddr, size));
+   /* make s

Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-10 Thread Avi Kivity
On 11/10/2011 11:14 AM, Sasha Levin wrote:
> > Trying and failing.  sVirt will deny access to all files except those
> > explicitly allowed by libvirt.
>
> It still allows the guest to read more than enough files which it
> shouldn't be reading.
>
> Unless you configure sVirt on a per-guest basis...

sVirt is per-guest.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-10 Thread Sasha Levin
On Thu, Nov 10, 2011 at 11:09 AM, Avi Kivity  wrote:
> On 11/10/2011 11:04 AM, Sasha Levin wrote:
>> On Thu, Nov 10, 2011 at 10:57 AM, Markus Armbruster  
>> wrote:
>> > Sasha Levin  writes:
>> >
>> >> On Thu, Nov 10, 2011 at 9:57 AM, Markus Armbruster  
>> >> wrote:
>> > [...]
>> >>> Start with a clean read/write raw image.  Probing declares it raw.
>> >>> Guest writes QCOW signature to it, with a backing file of its choice.
>> >>>
>> >>> Restart with the same image.  Probing declares it QCOW2.  Guest can read
>> >>> the backing file.  Oops.
>> >>
>> >> Thats an excellent scenario why you'd want to have 'Secure KVM' with
>> >> seccomp filters :)
>> >
>> > Yup.
>> >
>> > For what it's worth, sVirt (use SELinux to secure virtualization)
>> > mitigates the problem.  Doesn't mean we couldn't use "Secure KVM".
>>
>> How does it do it do that? You have a hypervisor trying to read
>> arbitrary files on the host FS, no?
>
> Trying and failing.  sVirt will deny access to all files except those
> explicitly allowed by libvirt.

It still allows the guest to read more than enough files which it
shouldn't be reading.

Unless you configure sVirt on a per-guest basis...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-10 Thread Avi Kivity
On 11/10/2011 11:04 AM, Sasha Levin wrote:
> On Thu, Nov 10, 2011 at 10:57 AM, Markus Armbruster  wrote:
> > Sasha Levin  writes:
> >
> >> On Thu, Nov 10, 2011 at 9:57 AM, Markus Armbruster  
> >> wrote:
> > [...]
> >>> Start with a clean read/write raw image.  Probing declares it raw.
> >>> Guest writes QCOW signature to it, with a backing file of its choice.
> >>>
> >>> Restart with the same image.  Probing declares it QCOW2.  Guest can read
> >>> the backing file.  Oops.
> >>
> >> Thats an excellent scenario why you'd want to have 'Secure KVM' with
> >> seccomp filters :)
> >
> > Yup.
> >
> > For what it's worth, sVirt (use SELinux to secure virtualization)
> > mitigates the problem.  Doesn't mean we couldn't use "Secure KVM".
>
> How does it do it do that? You have a hypervisor trying to read
> arbitrary files on the host FS, no?

Trying and failing.  sVirt will deny access to all files except those
explicitly allowed by libvirt.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-10 Thread Sasha Levin
On Thu, Nov 10, 2011 at 10:57 AM, Markus Armbruster  wrote:
> Sasha Levin  writes:
>
>> On Thu, Nov 10, 2011 at 9:57 AM, Markus Armbruster  wrote:
> [...]
>>> Start with a clean read/write raw image.  Probing declares it raw.
>>> Guest writes QCOW signature to it, with a backing file of its choice.
>>>
>>> Restart with the same image.  Probing declares it QCOW2.  Guest can read
>>> the backing file.  Oops.
>>
>> Thats an excellent scenario why you'd want to have 'Secure KVM' with
>> seccomp filters :)
>
> Yup.
>
> For what it's worth, sVirt (use SELinux to secure virtualization)
> mitigates the problem.  Doesn't mean we couldn't use "Secure KVM".

How does it do it do that? You have a hypervisor trying to read
arbitrary files on the host FS, no?

>> I'm actually not sure why KVM tool got QCOW support in the first
>> place. You can have anything QCOW provides if you use btrfs (among
>> several other FSs).
>
> Maybe it's just me, but isn't it weird to have a filesystem (QCOW2)
> sitting in the kernel sources that you can't mount(2)?
>

It's not really a filesystem, it's a disk image :)

When we did the initial QCOW patches this issue (in some form) came
up. The main concern there was that we shouldn't be duplicating QCOW
code and instead be using a 'libdiskimage' or something like that.

Since nothing like that existed at that time, and splitting it out of
QEMU wasn't trivial, we ended up agreeing on doing a rewrite of the
code.

The point you raised could be solved if we do end up having a usermode
lib which can handle disk images.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/9] perf: expose perf capability to other modules.

2011-11-10 Thread Frederic Weisbecker
On Mon, Nov 07, 2011 at 02:45:17PM +, Will Deacon wrote:
> Hi Frederic,
> 
> On Wed, Nov 02, 2011 at 07:42:04AM +, Frederic Weisbecker wrote:
> > On Tue, Nov 01, 2011 at 10:20:04AM -0600, David Ahern wrote:
> > > Right. Originally it could be enabled/disabled. Right now it cannot be,
> > > but I believe Frederic is working on making it configurable again.
> > > 
> > > David
> > 
> > Yep. Will Deacon is working on making the breakpoints able to process
> > pure arch informations (ie: without beeing forced to use the perf attr
> > as a midlayer to define them).
> > 
> > Once we have that I can seperate the breakpoints implementation from perf
> > and make it opt-able.
> 
> How do you foresee kdb fitting into this? I see that currently [on x86] we
> cook up perf_event structures with a specific overflow handler set. If we
> want to move this over to using a completely arch-defined structure, then
> we're going to end up with an overflow handler field in both perf_event
> *and* the arch-specific structure, which doesn't feel right to me.
> 
> Of course, if the goal is only to separate ptrace (i.e. user debugging) from
> the perf dependency then we don't need the overflow handler because we'll
> always just send SIGTRAP to the current task.
> 
> Any ideas?

I don't know if we want to convert x86/kgdb to use pure arch breakpoints.
If kgdb one day wants to extend this use to generic code, it may be a good
idea to keep the things as is. I don't know, I'm adding Jason in Cc.

In any case I think we have a problem if we want to default to send a
SIGTRAP. Look at this:

bp = per_cpu(bp_per_reg[i], cpu);
/*
 * Reset the 'i'th TRAP bit in dr6 to denote completion of
 * exception handling
 */
(*dr6_p) &= ~(DR_TRAP0 << i);
/*
 * bp can be NULL due to lazy debug register switching
 * or due to concurrent perf counter removing.
 */
if (!bp) {
rcu_read_unlock();
break;
}

perf_bp_event(bp, args->regs);


I don't have the details about how lazy the debug register switching
can be. And also we want to avoid a locking between the perf event
scheduling (removing) and the breakpoint triggering path.

A solution is to look at the ptrace breakpoints in the thread
struct and see if the one in the index is there. That can reside
in its own callback or as a fallback in hw_breakpoint_handler().
I don't feel that strong with choosing either of those solutions.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-10 Thread Markus Armbruster
Sasha Levin  writes:

> On Thu, Nov 10, 2011 at 9:57 AM, Markus Armbruster  wrote:
[...]
>> Start with a clean read/write raw image.  Probing declares it raw.
>> Guest writes QCOW signature to it, with a backing file of its choice.
>>
>> Restart with the same image.  Probing declares it QCOW2.  Guest can read
>> the backing file.  Oops.
>
> Thats an excellent scenario why you'd want to have 'Secure KVM' with
> seccomp filters :)

Yup.

For what it's worth, sVirt (use SELinux to secure virtualization)
mitigates the problem.  Doesn't mean we couldn't use "Secure KVM".

> I'm actually not sure why KVM tool got QCOW support in the first
> place. You can have anything QCOW provides if you use btrfs (among
> several other FSs).

Maybe it's just me, but isn't it weird to have a filesystem (QCOW2)
sitting in the kernel sources that you can't mount(2)?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Avi Kivity
On 11/09/2011 07:35 PM, Anthony Liguori wrote:
> On 11/09/2011 11:02 AM, Avi Kivity wrote:
>> On 11/09/2011 06:39 PM, Anthony Liguori wrote:
>>>
>>> Migration with qcow2 is not a supported feature for 1.0.  Migration is
>>> only supported with raw images using coherent shared storage[1].
>>>
>>> [1] NFS is only coherent with close-to-open which right now is not
>>> good enough for migration.
>>
>> Say what?
>
> Due to block format probing, we read at least the first sector of the
> disk during start up.
>
> Strictly going by what NFS guarantees, since we don't open on the
> destination *after* as close on the source, we aren't guaranteed to
> see what's written by the source.
>
> In practice, because of block format probing, unless we're using
> cache=none, the first sector can be out of sync with the source on the
> destination.  If you use cache=none on a Linux client with at least a
> Linux NFS server, you should be relatively safe.
>

IMO, this should be a release blocker.  qemu 1.0 only supporting
migration on enterprise storage?

If we have to delay the release for a month to get it right, we should. 
Not that I think we have to.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 RFC] virtio-spec: flexible configuration layout

2011-11-10 Thread Michael S. Tsirkin
On Wed, Nov 09, 2011 at 11:13:56PM +0200, Sasha Levin wrote:
> On Wed, 2011-11-09 at 23:14 +0200, Michael S. Tsirkin wrote:
> > On Wed, Nov 09, 2011 at 10:57:28PM +0200, Sasha Levin wrote:
> > > On Wed, 2011-11-09 at 22:52 +0200, Michael S. Tsirkin wrote:
> > > > On Wed, Nov 09, 2011 at 10:24:47PM +0200, Sasha Levin wrote:
> > > > > On Wed, 2011-11-09 at 21:59 +0200, Michael S. Tsirkin wrote:
> > > > > 
> > > > > [snip]
> > > > > 
> > > > > > +\begin_layout Enumerate
> > > > > > +Reset the device.
> > > > > > + This is not required on initial start up.
> > > > > > +\end_layout
> > > > > > +
> > > > > > +\begin_layout Enumerate
> > > > > > +The ACKNOWLEDGE status bit is set: we have noticed the device.
> > > > > > +\end_layout
> > > > > > +
> > > > > > +\begin_layout Enumerate
> > > > > > +The DRIVER status bit is set: we know how to drive the device.
> > > > > > +\end_layout
> > > > > > +
> > > > > > +\begin_layout Enumerate
> > > > > > +
> > > > > > +\change_inserted 1986246365 1320838089
> > > > > > +PCI capability list scan, detecting virtio configuration layout 
> > > > > > using Virtio
> > > > > > + Structure PCI capabilities.
> > > > > 
> > > > > Does the legacy space always gets mapped from BAR0?
> > > > > 
> > > > > If yes,
> > > > 
> > > > Yes and this is repeated in several places. Not clear? How can this
> > > > be made clearer?
> > > 
> > > Do you mean comments such as "For backwards compatibility, devices
> > > should also present legacy configuration space in the first I/O region
> > > of the PCI device"? What I understood from it is that the device should
> > > have a legacy config in case it's used with an older guest, but I didn't
> > > understand from it that the legacy config will be used even if new
> > > layout is present.
> > 
> > Yes, this is what I meant. New guest is required to use the new space
> > and not the legacy one. So you dont need a legacy space for the at all.
> > But practically, we'll need to support old guests for a long while.
> > 
> > > > > It'll be a bit harder deprecating it in the future.
> > > > 
> > > > Harder than ... what ?
> > > 
> > > Harder than allowing devices not to present it at all if new layout
> > > config is used.
> > 
> > Yes, it's allowed if you know you have a new guest. It says
> > explicitly that drivers are required to use new capabilities
> > is they are there.
> > 
> > > Right now the simple implementation is to use MMIO for
> > > config and device specific, and let it fallback to legacy for ISR and
> > > notifications (and therefore, this is probably how everybody will
> > > implement it), which means that when you do want to deprecate legacy,
> > > there will be extra work to be done then, instead of doing it now.
> > 
> > If hypervisors don't implement the new layout then drivers will
> > have to keep supporting the old one. I don't think we can do
> > much about that.
> > 
> > > > IMO there's no way to put legacy anywhere except the first BAR
> > > > without breaking existing guests.
> > > 
> > > It's not about where we put legacy, it's about how easy it is to drop
> > > legacy entirely.
> > 
> > We can only do this after all guests and hypervisors are updated. When
> > they are, we can drop legacy from drivers and hypervisors, and
> > I don't see a way to make it easier.
> 
> Well, in that case, why does the PCI cap probing is #4 in the device
> init list? Shouldn't we getting the layout and mapping it before we
> write to the status byte?

True, this is actually how it's done in the driver.
Good catch, I'll correct the text, thanks.

> -- 
> 
> Sasha.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OpenBSD 5.0 kernel panic in AMD K10 cpu power state

2011-11-10 Thread Avi Kivity
(re-adding cc)


On 11/09/2011 09:35 PM, Walter Haidinger wrote:
> Am 09.11.2011 14:40, schrieb Avi Kivity:
> > Actually, it looks like an OpenBSD bug.  According to the AMD 
> > documentation:
>
> Well, the OpenBSD developers are very confident that is
> a bug in the KVM cpu emulation and _not_ in OpenBSD.
>
> Basically they say that [despite -cpu host], the emulated
> cpu does not look like a real, but _non-existant_ cpu.
> Virtualization should look like _existing_ hardware.

That is true.  But OpenBSD is not following the vendor's recommendation
for how software should access the hardware.

> Since the list archive at 
> http://marc.info/?l=openbsd-misc&m=132077741910464&w=2
> lags a bit, I'm attaching some parts of the thread below:
>
> However, please remember it's OpenBSD, so the tone is, let's just
> say, rough.

Less than expected, actually.

> > The panic you hit is for an msr read, not a write. I'm aware those 
> > registers are read-only. The CPUID check isn't done, it matches on 
> > all family 10 and/or higher AMD processors. They're pretending to be
> >  an AMD K10 processor. On all real hardware I've tested this works 
> > fine. If you wish to be pedantic, patches are welcome.

So they're actually open to adding the cpuid check.

> They sent me a patch as a workaround, which:
>
> > The previous patch avoids touching the msr at all if ACPI indicates 
> > speed scaling is unavailable, this should prevent your panic.
>
> with -cpu host, OpenBSD dmesg showed the 1100T:
> >> cpu0: AMD Phenom(tm) II X6 1100T Processor ("AuthenticAMD" 686-class, 
> >> 512KB L2 cache) 3.31 GHz cpu0:
> >> FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SSE3,CX16,POPCNT
> >> ...
> >> bios0: vendor Bochs version "Bochs" date 01/01/2007 bios0: Bochs
> >> Bochs
> > They shouldn't be pretending to be AMD, especially if that emulation
> > is very incompatible.
>
> but the bug is in the Linux KVM:
>
> >> They're pretending to be an AMD K10 processor.
> >> 
> > Exactly.  What they are doing is wrong. They are pretending to be a 
> > AMD K10 processor _badly_, and then they think they can say "oh, but 
> > you need to check all these other registers too". A machine with that
> > setup has never physically existed.
>
> Is this all because I used -cpu host? 
>

-cpu host is not to blame, you could get the same result from other
combinations of cpu model and family.

I'll look at adding support for this MSR; should be simple.  But in
general processor features need to be qualified by cpuid, not by model.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >