date:20111110

Re: [PATCHv2 RFC] virtio-spec: flexible configuration layout

2011-11-10 Thread Sasha Levin

On Fri, Nov 11, 2011 at 6:24 AM, Rusty Russell  wrote:
> On Wed, 09 Nov 2011 22:57:28 +0200, Sasha Levin  
> wrote:
>> On Wed, 2011-11-09 at 22:52 +0200, Michael S. Tsirkin wrote:
>> > On Wed, Nov 09, 2011 at 10:24:47PM +0200, Sasha Levin wrote:
>> > > It'll be a bit harder deprecating it in the future.
>> >
>> > Harder than ... what ?
>>
>> Harder than allowing devices not to present it at all if new layout
>> config is used. Right now the simple implementation is to use MMIO for
>> config and device specific, and let it fallback to legacy for ISR and
>> notifications (and therefore, this is probably how everybody will
>> implement it), which means that when you do want to deprecate legacy,
>> there will be extra work to be done then, instead of doing it now.
>
> Indeed, I'd like to see two changes to your proposal:
>
> (1) It should be all or nothing.  If a driver can find the virtio header
>    capability, it should only use the capabilties.  Otherwise, it
>    should fall back to legacy.  Your draft suggests a mix is possible;
>    I prefer a clean failure (ie. one day don't present a BAR 0 *at
>    all*, so ancient drivers just fail to load.).
>
> (2) There's no huge win in keeping the same layout.  Let's make some
>    cleanups.  There are more users ahead of us then behind us (I
>    hope!).

Actually, if we already do cleanups, here are two more suggestions:

1. Make 64bit features a one big 64bit block, instead of having 32bits
in one place and 32 in another.
2. Remove the reserved fields out of the config (the ones that were
caused by moving the ISR and the notifications out).

> But I think this is the right direction!
>
> Thanks,
> Rusty.
>

Also, an unrelated questions: With PIO, requests were ordered, which
means that if we wrote to the queue selector and then read from a
queue register we would read the correct queue info.
Is the same thing assured to us with MMIO? If we write to a queue
selector and immediately read from queue info would we be reading the
right info, or is there the slight chance that it would get reordered
and we would be reading queue info first and writing to the selector
later?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2 RFC] virtio-spec: flexible configuration layout

2011-11-10 Thread Rusty Russell

On Wed, 09 Nov 2011 22:57:28 +0200, Sasha Levin  wrote:
> On Wed, 2011-11-09 at 22:52 +0200, Michael S. Tsirkin wrote:
> > On Wed, Nov 09, 2011 at 10:24:47PM +0200, Sasha Levin wrote:
> > > It'll be a bit harder deprecating it in the future.
> > 
> > Harder than ... what ?
> 
> Harder than allowing devices not to present it at all if new layout
> config is used. Right now the simple implementation is to use MMIO for
> config and device specific, and let it fallback to legacy for ISR and
> notifications (and therefore, this is probably how everybody will
> implement it), which means that when you do want to deprecate legacy,
> there will be extra work to be done then, instead of doing it now.

Indeed, I'd like to see two changes to your proposal:

(1) It should be all or nothing.  If a driver can find the virtio header
capability, it should only use the capabilties.  Otherwise, it
should fall back to legacy.  Your draft suggests a mix is possible;
I prefer a clean failure (ie. one day don't present a BAR 0 *at
all*, so ancient drivers just fail to load.).

(2) There's no huge win in keeping the same layout.  Let's make some
cleanups.  There are more users ahead of us then behind us (I
hope!).

But I think this is the right direction!

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 00/11] KVM: x86: optimize for writing guest page

2011-11-10 Thread Xiao Guangrong


On 11/10/2011 10:05 PM, Avi Kivity wrote:

On 11/10/2011 03:28 PM, Xiao Guangrong wrote:


I have tested RHEL.6.1 setup/boot/reboot/shutdown and the complete
output of scan_results.py is attached.

The result shows the performance is improved:
before:After:
570529
555538
552531
546528
553559
553527
550523
553533
547538
550526

How do you think about it? :)


Well, either I was sloppy in my measurements, or maybe RHEL 6 is very
different from F9 (unlikely).  I'll measure it again and see.



Thanks for your time. :)


btw, this is with ept=0, yes?



Yeah.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Virt: Add Fedora 16 to the list of guests

2011-11-10 Thread Lucas Meneghel Rodrigues

Also, make it the default for virt tests.

Signed-off-by: Lucas Meneghel Rodrigues 
---
 client/tests/kvm/guest-os.cfg.sample |   27 +++
 client/tests/kvm/tests.cfg.sample|   10 
 client/virt/unattended/Fedora-16.ks  |   39 ++
 client/virt/virt_utils.py|6 ++--
 4 files changed, 74 insertions(+), 8 deletions(-)
 create mode 100644 client/virt/unattended/Fedora-16.ks

diff --git a/client/tests/kvm/guest-os.cfg.sample 
b/client/tests/kvm/guest-os.cfg.sample
index 1e19c52..cd9427e 100644
--- a/client/tests/kvm/guest-os.cfg.sample
+++ b/client/tests/kvm/guest-os.cfg.sample
@@ -328,6 +328,33 @@ variants:
 md5sum_cd1 = c122a2a4f478da4a3d2d12396e84244e
 md5sum_1m_cd1 = c02f37e293bbc85be02a7c850a61273a
 
+- 16.32:
+image_name = f16-32
+unattended_install:
+unattended_file = unattended/Fedora-16.ks
+#floppy = images/f16-32/ks.vfd
+cdrom_unattended = images/f16-32/ks.iso
+kernel = images/f16-32/vmlinuz
+initrd = images/f16-32/initrd.img
+unattended_install.cdrom:
+cdrom_cd1 = isos/linux/Fedora-16-i386-DVD.iso
+md5sum_cd1 = 0d64ab6b1b800827a9c83d95395b3da0
+md5sum_1m_cd1 = 3f616b5034980cadeefe67dbca79cf99
+
+- 16.64:
+image_name = f16-64
+unattended_install:
+unattended_file = unattended/Fedora-16.ks
+#floppy = images/f16-64/ks.vfd
+cdrom_unattended = images/f16-64/ks.iso
+kernel = images/f16-64/vmlinuz
+initrd = images/f16-64/initrd.img
+unattended_install.cdrom:
+cdrom_cd1 = isos/linux/Fedora-16-x86_64-DVD.iso
+md5sum_cd1 = bb38ea1fe4b2fc69e7a6e15cf1c69c91
+md5sum_1m_cd1 = e25ea147176f24239d38a46f501bd25e
+
+
 - RHEL:
 no setup
 shell_prompt = "^\[.*\][\#\$]\s*$"
diff --git a/client/tests/kvm/tests.cfg.sample 
b/client/tests/kvm/tests.cfg.sample
index a30158b..4b217ee 100644
--- a/client/tests/kvm/tests.cfg.sample
+++ b/client/tests/kvm/tests.cfg.sample
@@ -78,7 +78,7 @@ variants:
 only unattended_install.cdrom, boot, shutdown
 
 # Runs qemu, f15 64 bit guest OS, install, boot, shutdown
-- @qemu_f15_quick:
+- @qemu_f16_quick:
 # We want qemu for this run
 qemu_binary = /usr/bin/qemu
 qemu_img_binary = /usr/bin/qemu-img
@@ -90,13 +90,13 @@ variants:
 only up
 only no_pci_assignable
 only smallpages
-only Fedora.15.64
+only Fedora.16.64
 only unattended_install.cdrom, boot, shutdown
 # qemu needs -enable-kvm on the cmdline
 extra_params += ' -enable-kvm'
 
 # Runs qemu-kvm, f15 64 bit guest OS, install, boot, shutdown
-- @qemu_kvm_f15_quick:
+- @qemu_kvm_f16_quick:
 # We want qemu-kvm for this run
 qemu_binary = /usr/bin/qemu-kvm
 qemu_img_binary = /usr/bin/qemu-img
@@ -106,7 +106,7 @@ variants:
 only smp2
 only no_pci_assignable
 only smallpages
-only Fedora.15.64
+only Fedora.16.64
 only unattended_install.cdrom, boot, shutdown
 
 # Runs your own guest image (qcow2, can be adjusted), all migration tests
@@ -140,4 +140,4 @@ variants:
 #kill_unresponsive_vms.* ?= no
 
 # Choose your test list from the testsets defined
-only qemu_kvm_f15_quick
+only qemu_kvm_f16_quick
diff --git a/client/virt/unattended/Fedora-16.ks 
b/client/virt/unattended/Fedora-16.ks
new file mode 100644
index 000..eb52c1f
--- /dev/null
+++ b/client/virt/unattended/Fedora-16.ks
@@ -0,0 +1,39 @@
+install
+KVM_TEST_MEDIUM
+text
+reboot
+lang en_US
+keyboard us
+network --bootproto dhcp
+rootpw 123456
+firewall --enabled --ssh
+selinux --enforcing
+timezone --utc America/New_York
+firstboot --disable
+bootloader --location=mbr --append="console=tty0 console=ttyS0,115200"
+zerombr
+poweroff
+
+clearpart --all --initlabel
+autopart
+
+%packages
+@base
+@development-libs
+@development-tools
+%end
+
+%post --interpreter /usr/bin/python
+import socket, os
+os.system('grubby --remove-args="rhgb quiet" --update-kernel=$(grubby 
--default-kernel)')
+os.system('dhclient')
+os.system('chkconfig sshd on')
+os.system('iptables -F')
+os.system('echo 0 > /selinux/enforce')
+server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+server.bind(('', 12323))
+server.listen(1)
+(client, addr) = server.accept()
+client.send("done")
+client.close()
+%end
diff --git a/client/vi

Re: OpenBSD 5.0 kernel panic in AMD K10 cpu power state

2011-11-10 Thread Andre Przywara

On 11/10/2011 09:46 AM, Avi Kivity wrote:

(re-adding cc)

On 11/09/2011 09:35 PM, Walter Haidinger wrote:

Am 09.11.2011 14:40, schrieb Avi Kivity:

Actually, it looks like an OpenBSD bug. According to the AMD
documentation:

Well, the OpenBSD developers are very confident that is
a bug in the KVM cpu emulation and _not_ in OpenBSD.

Basically they say that [despite -cpu host], the emulated
cpu does not look like a real, but _non-existant_ cpu.
Virtualization should look like _existing_ hardware.

That is true. But OpenBSD is not following the vendor's recommendation
for how software should access the hardware.

Since the list archive at
http://marc.info/?l=openbsd-misc&m=132077741910464&w=2
lags a bit, I'm attaching some parts of the thread below:

However, please remember it's OpenBSD, so the tone is, let's just
say, rough.

Less than expected, actually.

The panic you hit is for an msr read, not a write. I'm aware those
registers are read-only. The CPUID check isn't done, it matches on
all family 10 and/or higher AMD processors. They're pretending to be
an AMD K10 processor. On all real hardware I've tested this works
fine. If you wish to be pedantic, patches are welcome.

Avi, thanks for caring of that.

The manual is clear here: no CPUID bit, no MSRs. Beside that the
emulated ACPI tables probably also don't provide any info here, right?
The fact that it runs: "on all family 10 and/or higher AMD processors"
is just an empiric observation, not a law. You would be astonished what
can be fused off...

We had a similar discussion here with unconditional AMD Northbridge PCI
accesses when detecting certain AMD CPU family/model/steppings in the
Linux kernel already (...but every AMD CPU has a northbridge...)
We (as virtualization guys) should not step back so easily here,
especially if the spec is so clear. That spec argument should actually
appeal to the OpenBSD guys, too. I got the impression that their design
is, well, actually well designed.

So they're actually open to adding the cpuid check.

They sent me a patch as a workaround, which:

The previous patch avoids touching the msr at all if ACPI indicates
speed scaling is unavailable, this should prevent your panic.

with -cpu host, OpenBSD dmesg showed the 1100T:

cpu0: AMD Phenom(tm) II X6 1100T Processor ("AuthenticAMD" 686-class, 512KB L2
cache) 3.31 GHz cpu0:
FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SSE3,CX16,POPCNT
...
bios0: vendor Bochs version "Bochs" date 01/01/2007 bios0: Bochs
Bochs

They shouldn't be pretending to be AMD, especially if that emulation
is very incompatible.

but the bug is in the Linux KVM:

They're pretending to be an AMD K10 processor.

Exactly. What they are doing is wrong. They are pretending to be a
AMD K10 processor _badly_, and then they think they can say "oh, but
you need to check all these other registers too". A machine with that
setup has never physically existed.

Is this all because I used -cpu host?

-cpu host is not to blame, you could get the same result from other
combinations of cpu model and family.

I'll look at adding support for this MSR; should be simple. But in
general processor features need to be qualified by cpuid, not by model.

I guess emulating part of P-states will open up a can of worms. Beside
the generic MSRs (0xC001006[1-3]) there are actual family specific ones
which are selected by the CPUID family. So you would end up emulating
them, too. I have a hard time to think about a strategy how to emulate
this in general. So unless there is a real framework for dealing with
P-state "hints" from the guest OS, I'd be reluctant with quick and dirty
emulations.

Thanks,
Andre.

--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Anthony Liguori


On 11/10/2011 12:27 PM, Anthony Liguori wrote:

On 11/10/2011 02:55 AM, Avi Kivity wrote:

If we have to delay the release for a month to get it right, we should.
Not that I think we have to.



Adding libvirt to the discussion.

What does libvirt actually do in the monitor prior to migration completing on
the destination? The least invasive way of doing delayed open of block devices
is probably to make -incoming create a monitor and run a main loop before the
block devices (and full device model) is initialized. Since this isolates the
changes strictly to migration, I'd feel okay doing this for 1.0 (although it
might need to be in the stable branch).


This won't work.  libvirt needs things to be initialized.  Plus, once loadvm 
gets to loading the device model, the device model (and BDSes) need to be fully 
initialized.


I think I've convinced myself that without proper clustered shared storage, 
cache=none is a hard requirement.  That goes for iSCSI and NFS.  I don't see a 
way to do migration safely with NFS and there's no way to really solve the page 
cache problem with iSCSI.


Even with the reopen, it's racing against the close on the source.  If you look 
at Daniel's description of what libvirt is doing and then compare that to Juan's 
patches, there's a race condition regarding whether the source gets closed 
before the reopen happens.  cache=none seems to be the only way to solve this.


Live migration with qcow2 or any other image format is just not going to work 
right now even with proper clustered storage.  I think doing a block level flush 
cache interface and letting block devices decide how to do it is the best approach.


Regards,

Anthony Liguori


I know a monitor can run like this as I've done it before but some of the
commands will not behave as expected so it's pretty important to be comfortable
with what commands are actually being used in this mode.

Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread Stepan Moskovchenko


On 11/10/2011 9:09 AM, Joerg Roedel wrote:
The plan is to have a single DMA-API implementation for all IOMMU 
drivers (X86 and ARM) which just uses the IOMMU-API. But to make this 
performing reasonalbly well a few changes to the IOMMU-API are 
required. I already have some ideas which we can discuss if you want.


I have been experimenting with an iommu_map_range call, which maps a 
given scatterlist of discontiguous physical pages into a contiguous 
virtual region at a given IOVA. This has some performance advantages 
over just calling iommu_map iteratively. First, it reduces the amount of 
table walking / calculation needed for mapping each page, given how you 
know that all the pages will be mapped into a single 
virtually-contiguous region (so in most cases, the first-level table 
calculation can be reused). Second, it allows one to defer the TLB (and 
sometimes cache) maintenance operations until the entire scatterlist has 
been mapped, rather than doing a TLB invalidate after mapping each page, 
as would have been the case if iommu_map were just being called from 
within a loop. Granted, just using iommu_map many times may be 
acceptable on the slow path, but I have seen significant performance 
gains when using this approach on the fast path.


Steve

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Anthony Liguori


On 11/10/2011 02:06 PM, Daniel P. Berrange wrote:

On Thu, Nov 10, 2011 at 01:11:42PM -0600, Anthony Liguori wrote:

On 11/10/2011 12:42 PM, Daniel P. Berrange wrote:

On Thu, Nov 10, 2011 at 12:27:30PM -0600, Anthony Liguori wrote:

What does libvirt actually do in the monitor prior to migration
completing on the destination?  The least invasive way of doing
delayed open of block devices is probably to make -incoming create a
monitor and run a main loop before the block devices (and full
device model) is initialized.  Since this isolates the changes
strictly to migration, I'd feel okay doing this for 1.0 (although it
might need to be in the stable branch).


The way migration works with libvirt wrt QEMU interactions is now
as follows

  1. Destination.
Run   qemu -incoming ...args...
Query chardevs via monitor
Query vCPU threads via monitor
Set disk / vnc passwords


Since RHEL carries Juan's patch, and Juan's patch doesn't handle
disk passwords gracefully, how does libvirt cope with that?


No idea, that's the first I've heard of any patch that causes
problems with passwords in QEMU.


My guess is that migration with a password protected qcow2 file isn't a common 
test-case.


Regards,

Anthony Liguori



Daniel


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Daniel P. Berrange

On Thu, Nov 10, 2011 at 01:11:42PM -0600, Anthony Liguori wrote:
> On 11/10/2011 12:42 PM, Daniel P. Berrange wrote:
> >On Thu, Nov 10, 2011 at 12:27:30PM -0600, Anthony Liguori wrote:
> >>What does libvirt actually do in the monitor prior to migration
> >>completing on the destination?  The least invasive way of doing
> >>delayed open of block devices is probably to make -incoming create a
> >>monitor and run a main loop before the block devices (and full
> >>device model) is initialized.  Since this isolates the changes
> >>strictly to migration, I'd feel okay doing this for 1.0 (although it
> >>might need to be in the stable branch).
> >
> >The way migration works with libvirt wrt QEMU interactions is now
> >as follows
> >
> >  1. Destination.
> >Run   qemu -incoming ...args...
> >Query chardevs via monitor
> >Query vCPU threads via monitor
> >Set disk / vnc passwords
> 
> Since RHEL carries Juan's patch, and Juan's patch doesn't handle
> disk passwords gracefully, how does libvirt cope with that?

No idea, that's the first I've heard of any patch that causes
problems with passwords in QEMU.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/10] nEPT: MMU context for nested EPT

2011-11-10 Thread Nadav Har'El

On Thu, Nov 10, 2011, Avi Kivity wrote about "Re: [PATCH 02/10] nEPT: MMU 
context for nested EPT":
> This is all correct, but the code in question parses the EPT12 table
> using the ia32 page table format.  They're sufficiently similar so that
> it works, but it isn't correct.
> 
> Bit 0: EPT readable, ia32 present
> Bit 1: Writable; ia32 meaning dependent on cr0.wp
> Bit 2: EPT executable, ia32 user (so, this implementation will interpret
> a non-executable EPT mapping, if someone could find a use for it, as a
> L2 kernel only mapping)
>

This is a very good point.

I was under the mistaken (?) impression that the page-table shadowing
code will just copy these bits as-is from the shadowed table (EPT12) to the
shadow table (EPT02), without caring what they actually mean. I knew we had
a problem when building, not copying, PTEs, and hence the patch to
link_shadow_page).

Also I realized we sometimes need to actually walk the TDP EPT12+cr3 (e.g.,
to see if an EPT violation is L1's fault), but I thought this was just the
normal TDP walk, which already knows how to correctly read the EPT
table.

> walk_addr() will also write to bits 6/7, which the L1 won't expect.

I didn't notice this :(

Back to the drawing board, I guess. I need to figure out exactly what
needs to be fixed, and how to do this with the least obtrusive changes to
the existing use case (normal shadow page tables, and nested EPT).

-- 
Nadav Har'El|  Thursday, Nov 10 2011, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Learn from mistakes of others; you won't
http://nadav.harel.org.il   |live long enough to make them all yourself
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm tools: Allow retrieval about PTY redirection in 'kvm stat'

2011-11-10 Thread Pekka Enberg

On Tue, 2011-11-01 at 18:34 +0200, Sasha Levin wrote:
> This patch adds an option to provide information about redirection
> of terminal redirection to a PTY device within 'kvm stat'.
> 
> Usage:
>   'kvm stat -p [term] -n [instance_name]'
> 
> Will print information about redirection of terminal 'term' int instance
> 'instance_name'.
> 
> Cc: Osier Yang 
> Signed-off-by: Sasha Levin 

Ping? Should I apply this patch? Is it actually useful for libvirt?

> ---
>  tools/kvm/Documentation/kvm-stat.txt |2 +
>  tools/kvm/builtin-stat.c |   39 ++---
>  tools/kvm/include/kvm/kvm-ipc.h  |1 +
>  tools/kvm/include/kvm/term.h |7 ++
>  tools/kvm/term.c |   25 -
>  5 files changed, 69 insertions(+), 5 deletions(-)
> 
> diff --git a/tools/kvm/Documentation/kvm-stat.txt 
> b/tools/kvm/Documentation/kvm-stat.txt
> index ce5ab54..5284aa9 100644
> --- a/tools/kvm/Documentation/kvm-stat.txt
> +++ b/tools/kvm/Documentation/kvm-stat.txt
> @@ -17,3 +17,5 @@ For a list of running instances see 'kvm list'.
>  
>  Commands:
>   --memory, -mDisplay memory statistics
> + --pty, -p   Display information about terminal's pty
> + device.
> diff --git a/tools/kvm/builtin-stat.c b/tools/kvm/builtin-stat.c
> index e28eb5b..2a46900 100644
> --- a/tools/kvm/builtin-stat.c
> +++ b/tools/kvm/builtin-stat.c
> @@ -4,6 +4,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #include 
>  #include 
> @@ -18,6 +20,7 @@ struct stat_cmd {
>  };
>  
>  static bool mem;
> +static int pty = -1;
>  static bool all;
>  static int instance;
>  static const char *instance_name;
> @@ -30,6 +33,7 @@ static const char * const stat_usage[] = {
>  static const struct option stat_options[] = {
>   OPT_GROUP("Commands options:"),
>   OPT_BOOLEAN('m', "memory", &mem, "Display memory statistics"),
> + OPT_INTEGER('p', "PTY info", &pty, "Display PTY path for given 
> terminal"),
>   OPT_GROUP("Instance options:"),
>   OPT_BOOLEAN('a', "all", &all, "All instances"),
>   OPT_STRING('n', "name", &instance_name, "name", "Instance name"),
> @@ -104,15 +108,40 @@ static int do_memstat(const char *name, int sock)
>   return 0;
>  }
>  
> +static int do_pty(const char *name, int sock)
> +{
> + struct pty_cmd cmd = {KVM_IPC_TRM_PTY, 0, pty};
> + int r;
> + char pty_path[PATH_MAX] = {0};
> +
> + r = xwrite(sock, &cmd, sizeof(cmd));
> + if (r < 0)
> + return r;
> +
> + r = xread(sock, pty_path, PATH_MAX);
> + if (r < 0)
> + return r;
> +
> + printf("Instance %s mapped term %d to: %s\n", name, pty, pty_path);
> +
> + return 0;
> +}
> +
>  int kvm_cmd_stat(int argc, const char **argv, const char *prefix)
>  {
>   parse_stat_options(argc, argv);
>  
> - if (!mem)
> + if (!mem && pty == -1)
>   usage_with_options(stat_usage, stat_options);
>  
> - if (mem && all)
> - return kvm__enumerate_instances(do_memstat);
> + if (all) {
> + if (mem)
> + kvm__enumerate_instances(do_memstat);
> + if (pty != -1)
> + kvm__enumerate_instances(do_pty);
> +
> + return 0;
> + }
>  
>   if (instance_name == NULL &&
>   instance == 0)
> @@ -125,7 +154,9 @@ int kvm_cmd_stat(int argc, const char **argv, const char 
> *prefix)
>   die("Failed locating instance");
>  
>   if (mem)
> - return do_memstat(instance_name, instance);
> + do_memstat(instance_name, instance);
> + if (pty != -1)
> + do_pty(instance_name, instance);
>  
>   return 0;
>  }
> diff --git a/tools/kvm/include/kvm/kvm-ipc.h b/tools/kvm/include/kvm/kvm-ipc.h
> index 731767f..1d9599b 100644
> --- a/tools/kvm/include/kvm/kvm-ipc.h
> +++ b/tools/kvm/include/kvm/kvm-ipc.h
> @@ -17,6 +17,7 @@ enum {
>   KVM_IPC_RESUME  = 5,
>   KVM_IPC_STOP= 6,
>   KVM_IPC_PID = 7,
> + KVM_IPC_TRM_PTY = 8,
>  };
>  
>  int kvm_ipc__register_handler(u32 type, void (*cb)(int fd, u32 type, u32 
> len, u8 *msg));
> diff --git a/tools/kvm/include/kvm/term.h b/tools/kvm/include/kvm/term.h
> index 37ec731..06d5b4e 100644
> --- a/tools/kvm/include/kvm/term.h
> +++ b/tools/kvm/include/kvm/term.h
> @@ -2,10 +2,17 @@
>  #define KVM__TERM_H
>  
>  #include 
> +#include 
>  
>  #define CONSOLE_8250 1
>  #define CONSOLE_VIRTIO   2
>  
> +struct pty_cmd {
> + u32 type;
> + u32 len;
> + int pty;
> +};
> +
>  int term_putc_iov(int who, struct iovec *iov, int iovcnt, int term);
>  int term_getc_iov(int who, struct iovec *iov, int iovcnt, int term);
>  int term_putc(int who, char *addr, int cnt, int term);
> diff --git a/tools/kvm/term.c b/tools/kvm/term.c
> index fb5d71c..4e0d946 100644
> --- a/tools/kvm/term.c
> +++ b/tools/kvm/term.c
> @@ -13,7 +13,7 @@
>  #include "kvm/util.h"
>  #include "kvm/kvm.h"
>  #

Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread David Woodhouse

On Thu, 2011-11-10 at 18:09 +0100, Joerg Roedel wrote:
> The requirement for the DMA-API is, that the IOTLB must be consistent
> with existing mappings, and only with the parts that are really mapped.
> The unmapped parts are not important.
> 
> This allows nice optimizations like your 'batched unmap' on the Intel
> IOMMU driver side. The AMD IOMMU driver uses a round-robin bitmap
> allocator for the IO addresses which makes it very easy to flush certain
> IOTLB ranges only before they are reused.

... which implies that a mapping, once made, might *never* actually get
torn down until we loop and start reusing address space? That has
interesting security implications. Is it true even for devices which
have been assigned to a VM and then unassigned?

> >   - ... unless booted with 'intel_iommu=strict', in which case we do the
> > unmap and IOTLB flush immediately before returning to the driver.
> 
> There is something similar on the AMD IOMMU side. There it is called
> unmap_flush.

OK, so that definitely wants consolidating into a generic option.

> >   - But the IOMMU API for virtualisation is different. In fact that
> > doesn't seem to flush the IOTLB at all. Which is probably a bug.
> 
> Well, *current* requirement is, that the IOTLB is in sync with the
> page-table at every time. This is true for the iommu_map and especially
> for the iommu_unmap function. It means basically that the unmapped area
> needs to be flushed out of the IOTLBs before iommu_unmap returns.
> 
> Some time ago I proposed the iommu_commit() interface which changes
> these requirements. With this interface the requirement is that after a
> couple of map/unmap operations the IOMMU-API user has to call
> iommu_commit() to make these changes visible to the hardware (so mostly
> sync the IOTLBs). As discussed at that time this would make sense for
> the Intel and AMD IOMMU drivers.

I would *really* want to keep those off the fast path (thinking mostly
about DMA API here, since that's the performance issue). But as long as
we can achieve that, that's fine.

> > What is acceptable, though? That batched unmap is quite important for
> > performance, because it means that we don't have to bash on the hardware
> > and wait for a flush to complete in the fast path of network driver RX,
> > for example.
> 
> Have you considered a round-robin bitmap-allocator? It allows quite nice
> flushing behavior.

Yeah, I was just looking through the AMD code with a view to doing
something similar. I was thinking of using that technique *within* each
larger range allocated from the whole address space.

> > If we move to a model where we have a separate ->flush_iotlb() call, we
> > need to be careful that we still allow necessary optimisations to
> > happen.
> 
> With iommu_commit() this should be possible, still.
> 
> > I'm looking at fixing performance issues in the Intel IOMMU code, with
> > its virtual address space allocation (the rbtree-based one in iova.c
> > that nobody else uses, which has a single spinlock that *all* CPUs bash
> > on when they need to allocate).
> > 
> > The plan is, vaguely, to allocate large chunks of space to each CPU, and
> > then for each CPU to allocate from its own region first, thus ensuring
> > that the common case doesn't bounce locks between CPUs. It'll be rare
> > for one CPU to have to touch a subregion 'belonging' to another CPU, so
> > lock contention should be drastically reduced.
> 
> Thats an interesting issue. It exists on the AMD IOMMU side too, the
> bitmap-allocator runs in a per-domain spinlock which can get high
> contention. I am not sure how per-cpu chunks of the address space scale
> to large numbers of cpus, though, given that some devices only have a
> small address range that they can address.

I don't care about performance of broken hardware. If we have a single
*global* "subrange" for the <4GiB range of address space, that's
absolutely fine by me.

But also, it's not *so* much of an issue to divide the space up even
when it's limited. The idea was not to have it *strictly* per-CPU, but
just for a CPU to try allocating from "its own" subrange first, and then
fall back to allocating a new subrange, and *then* fall back to
allocating from subranges "belonging" to other CPUs. It's not that the
allocation from a subrange would be lockless — it's that the lock would
almost never leave the l1 cache of the CPU that *normally* uses that
subrange.

With batched unmaps, the CPU doing the unmap may end up taking the lock
occasionally, and bounce cache lines then. But it's infrequent enough
that it shouldn't be a performance problem.

> I have been thinking about some lockless algorithms for the
> bitmap-allocator. But the ideas are not finalized yet, so I still don't
> know if they will work out at all :)

As explained above, I wasn't going for lockless. I was going for
lock-contention-less. Or at least mostly :)

Do you think that approach sounds reasonable?

> The plan is to have a single DMA-API i

Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Anthony Liguori


On 11/10/2011 12:42 PM, Daniel P. Berrange wrote:

On Thu, Nov 10, 2011 at 12:27:30PM -0600, Anthony Liguori wrote:

What does libvirt actually do in the monitor prior to migration
completing on the destination?  The least invasive way of doing
delayed open of block devices is probably to make -incoming create a
monitor and run a main loop before the block devices (and full
device model) is initialized.  Since this isolates the changes
strictly to migration, I'd feel okay doing this for 1.0 (although it
might need to be in the stable branch).


The way migration works with libvirt wrt QEMU interactions is now
as follows

  1. Destination.
Run   qemu -incoming ...args...
Query chardevs via monitor
Query vCPU threads via monitor
Set disk / vnc passwords


Since RHEL carries Juan's patch, and Juan's patch doesn't handle disk passwords 
gracefully, how does libvirt cope with that?


Regards,

Anthony Liguori


Set netdev link states
Set balloon target

  2. Source
Set  migration speed
Set  migration max downtime
Run  migrate command (detached)
while 1
   Query migration status
   if status is failed or success
 break;

  3. Destination
   If final status was success
  Run  'cont' in monitor
   else
  kill QEMU process

  4. Source
   If final status was success and 'cont' on dest succeeded
  kill QEMU process
   else
  Run 'cont' in monitor


In older libvirt, the bits from step 4, would actually take place
at the end of step 2. This meant we could end up with no QEMU
on either the source or dest, if starting CPUs on the dest QEMU
failed for some reason.


We would still really like to have a 'query-migrate' command for
the destination, so that we can confirm that the destination has
consumed all incoming migrate data successfully, rather than just
blindly starting CPUs and hoping for the best.

Regards,
Daniel


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Daniel P. Berrange

On Thu, Nov 10, 2011 at 12:27:30PM -0600, Anthony Liguori wrote:
> What does libvirt actually do in the monitor prior to migration
> completing on the destination?  The least invasive way of doing
> delayed open of block devices is probably to make -incoming create a
> monitor and run a main loop before the block devices (and full
> device model) is initialized.  Since this isolates the changes
> strictly to migration, I'd feel okay doing this for 1.0 (although it
> might need to be in the stable branch).

The way migration works with libvirt wrt QEMU interactions is now
as follows

 1. Destination.
   Run   qemu -incoming ...args...
   Query chardevs via monitor
   Query vCPU threads via monitor
   Set disk / vnc passwords
   Set netdev link states
   Set balloon target

 2. Source
   Set  migration speed
   Set  migration max downtime
   Run  migrate command (detached)
   while 1
  Query migration status
  if status is failed or success
break;

 3. Destination
  If final status was success
 Run  'cont' in monitor
  else
 kill QEMU process

 4. Source
  If final status was success and 'cont' on dest succeeded
 kill QEMU process
  else
 Run 'cont' in monitor

In older libvirt, the bits from step 4, would actually take place
at the end of step 2. This meant we could end up with no QEMU
on either the source or dest, if starting CPUs on the dest QEMU
failed for some reason.

We would still really like to have a 'query-migrate' command for
the destination, so that we can confirm that the destination has
consumed all incoming migrate data successfully, rather than just
blindly starting CPUs and hoping for the best.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Anthony Liguori


On 11/10/2011 02:55 AM, Avi Kivity wrote:

On 11/09/2011 07:35 PM, Anthony Liguori wrote:

On 11/09/2011 11:02 AM, Avi Kivity wrote:

On 11/09/2011 06:39 PM, Anthony Liguori wrote:


Migration with qcow2 is not a supported feature for 1.0.  Migration is
only supported with raw images using coherent shared storage[1].

[1] NFS is only coherent with close-to-open which right now is not
good enough for migration.


Say what?


Due to block format probing, we read at least the first sector of the
disk during start up.

Strictly going by what NFS guarantees, since we don't open on the
destination *after* as close on the source, we aren't guaranteed to
see what's written by the source.

In practice, because of block format probing, unless we're using
cache=none, the first sector can be out of sync with the source on the
destination.  If you use cache=none on a Linux client with at least a
Linux NFS server, you should be relatively safe.



IMO, this should be a release blocker.  qemu 1.0 only supporting
migration on enterprise storage?

If we have to delay the release for a month to get it right, we should.
Not that I think we have to.



Adding libvirt to the discussion.

What does libvirt actually do in the monitor prior to migration completing on 
the destination?  The least invasive way of doing delayed open of block devices 
is probably to make -incoming create a monitor and run a main loop before the 
block devices (and full device model) is initialized.  Since this isolates the 
changes strictly to migration, I'd feel okay doing this for 1.0 (although it 
might need to be in the stable branch).


I know a monitor can run like this as I've done it before but some of the 
commands will not behave as expected so it's pretty important to be comfortable 
with what commands are actually being used in this mode.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Anthony Liguori


On 11/10/2011 04:41 AM, Kevin Wolf wrote:

Am 09.11.2011 22:01, schrieb Anthony Liguori:

On 11/09/2011 03:00 PM, Michael S. Tsirkin wrote:

On Wed, Nov 09, 2011 at 02:22:02PM -0600, Anthony Liguori wrote:

On 11/09/2011 02:18 PM, Michael S. Tsirkin wrote:

On Wed, Nov 09, 2011 at 11:35:54AM -0600, Anthony Liguori wrote:

On 11/09/2011 11:02 AM, Avi Kivity wrote:

On 11/09/2011 06:39 PM, Anthony Liguori wrote:


Migration with qcow2 is not a supported feature for 1.0.  Migration is
only supported with raw images using coherent shared storage[1].

[1] NFS is only coherent with close-to-open which right now is not
good enough for migration.


Say what?


Due to block format probing, we read at least the first sector of
the disk during start up.


A simple solution is not to do any probing before the VM is first
started on the incoming path.

Any issues with this?



http://mid.gmane.org/1284213896-12705-4-git-send-email-aligu...@us.ibm.com
I think Kevin wanted open to get delayed.

Regards,

Anthony Liguori


So, this patchset just needs to be revived and polished up?


What I took from the feedback was that Kevin wanted to defer open until the
device model started.  That eliminates the need to reopen or have a invalidation
callback.

I think it would be good for Kevin to comment here though because I might have
misunderstood his feedback.


Your approach was to delay reads, but still keep the image open. I think
I worried that we might have additional reads somewhere that we don't
know about, and this is why I proposed delaying the open as well, so
that any read would always fail.

I believe just reopening the image is (almost?) as good and it's way
easier to do, so I would be inclined to do that for 1.0.


I don't think reopen is good enough without delaying CHS probing too.  That 
information is still potentially out of date.  I don't think you can fix this 
problem without delaying CHS probing at least.


Regards,

Anthony Liguori



I'm not 100% sure about cases like iscsi, where reopening doesn't help.
I think delaying the open doesn't help there either if you migrate from
A to B and then back from B to A, you could still get old data. So for
iscsi probably cache=none remains the only safe choice, whatever we do.

Kevin



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Anthony Liguori


On 11/10/2011 02:55 AM, Avi Kivity wrote:

On 11/09/2011 07:35 PM, Anthony Liguori wrote:

On 11/09/2011 11:02 AM, Avi Kivity wrote:

On 11/09/2011 06:39 PM, Anthony Liguori wrote:


Migration with qcow2 is not a supported feature for 1.0.  Migration is
only supported with raw images using coherent shared storage[1].

[1] NFS is only coherent with close-to-open which right now is not
good enough for migration.


Say what?


Due to block format probing, we read at least the first sector of the
disk during start up.

Strictly going by what NFS guarantees, since we don't open on the
destination *after* as close on the source, we aren't guaranteed to
see what's written by the source.

In practice, because of block format probing, unless we're using
cache=none, the first sector can be out of sync with the source on the
destination.  If you use cache=none on a Linux client with at least a
Linux NFS server, you should be relatively safe.



IMO, this should be a release blocker.  qemu 1.0 only supporting
migration on enterprise storage?


No, this is not going to block the release.

You can't dump patches on the ML during -rc for an issue that has been 
understood for well over a year simply because it's release time.


If this was so important, it should have been fixed a year ago in the proper 
way.

Regards,

Anthony Liguori



If we have to delay the release for a month to get it right, we should.
Not that I think we have to.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Anthony Liguori


On 11/10/2011 10:50 AM, Juan Quintela wrote:

Kevin Wolf  wrote:


What I took from the feedback was that Kevin wanted to defer open until the
device model started.  That eliminates the need to reopen or have a invalidation
callback.

I think it would be good for Kevin to comment here though because I might have
misunderstood his feedback.


Your approach was to delay reads, but still keep the image open. I think
I worried that we might have additional reads somewhere that we don't
know about, and this is why I proposed delaying the open as well, so
that any read would always fail.

I believe just reopening the image is (almost?) as good and it's way
easier to do, so I would be inclined to do that for 1.0.

I'm not 100% sure about cases like iscsi, where reopening doesn't help.
I think delaying the open doesn't help there either if you migrate from
A to B and then back from B to A, you could still get old data. So for
iscsi probably cache=none remains the only safe choice, whatever we do.


iSCSI and NFS only works with cache=none.  Even on NFS with close+open,
we have troubles if anything else has the file opened (think libvirt,
guestfs, whatever).


Reopening with iSCSI is strictly an issue with the in-kernel initiator, right? 
libiscsi should be safe with a delayed open I would imagine.


Regards,

Anthony Liguori

  I really think that anynthing different of

cache=none from iSCSI or NFS is just betting (and yes, it took a while
for Christoph to convince me, I was trying to a "poor man" distributed
lock manager, and as everybody knows, it is a _difficult_ problem to
solve.).

Later, Juan.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Juan Quintela

Avi Kivity  wrote:
> On 11/09/2011 07:35 PM, Anthony Liguori wrote:
>> On 11/09/2011 11:02 AM, Avi Kivity wrote:
>>> On 11/09/2011 06:39 PM, Anthony Liguori wrote:

 Migration with qcow2 is not a supported feature for 1.0.  Migration is
 only supported with raw images using coherent shared storage[1].

 [1] NFS is only coherent with close-to-open which right now is not
 good enough for migration.
>>>
>>> Say what?
>>
>> Due to block format probing, we read at least the first sector of the
>> disk during start up.
>>
>> Strictly going by what NFS guarantees, since we don't open on the
>> destination *after* as close on the source, we aren't guaranteed to
>> see what's written by the source.
>>
>> In practice, because of block format probing, unless we're using
>> cache=none, the first sector can be out of sync with the source on the
>> destination.  If you use cache=none on a Linux client with at least a
>> Linux NFS server, you should be relatively safe.
>>
>
> IMO, this should be a release blocker.  qemu 1.0 only supporting
> migration on enterprise storage?
>
> If we have to delay the release for a month to get it right, we should. 
> Not that I think we have to.

I kind of agree here, but it is not my call.  Patch 1/2 have been used
on RHEL for almost 3 years, so it should be safe (TM).

Later, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread Joerg Roedel

On Thu, Nov 10, 2011 at 03:28:50PM +, David Woodhouse wrote:

> Which brings me to another question I have been pondering... do we even
> have a consensus on exactly *when* the IOTLB should be flushed?

Well, sort of, there is still the outstanding idea of the
iommu_commit() interface for the IOMMU-API.

> Even just for the Intel IOMMU, we have three different behaviours:
> 
>   - For DMA API users by default, we do 'batched unmap', so a mapping
> may be active for a period of time after the driver has requested
> that it be unmapped.

The requirement for the DMA-API is, that the IOTLB must be consistent
with existing mappings, and only with the parts that are really mapped.
The unmapped parts are not important.

This allows nice optimizations like your 'batched unmap' on the Intel
IOMMU driver side. The AMD IOMMU driver uses a round-robin bitmap
allocator for the IO addresses which makes it very easy to flush certain
IOTLB ranges only before they are reused.

>   - ... unless booted with 'intel_iommu=strict', in which case we do the
> unmap and IOTLB flush immediately before returning to the driver.

There is something similar on the AMD IOMMU side. There it is called
unmap_flush.

>   - But the IOMMU API for virtualisation is different. In fact that
> doesn't seem to flush the IOTLB at all. Which is probably a bug.

Well, *current* requirement is, that the IOTLB is in sync with the
page-table at every time. This is true for the iommu_map and especially
for the iommu_unmap function. It means basically that the unmapped area
needs to be flushed out of the IOTLBs before iommu_unmap returns.

Some time ago I proposed the iommu_commit() interface which changes
these requirements. With this interface the requirement is that after a
couple of map/unmap operations the IOMMU-API user has to call
iommu_commit() to make these changes visible to the hardware (so mostly
sync the IOTLBs). As discussed at that time this would make sense for
the Intel and AMD IOMMU drivers.

> What is acceptable, though? That batched unmap is quite important for
> performance, because it means that we don't have to bash on the hardware
> and wait for a flush to complete in the fast path of network driver RX,
> for example.

Have you considered a round-robin bitmap-allocator? It allows quite nice
flushing behavior.

> If we move to a model where we have a separate ->flush_iotlb() call, we
> need to be careful that we still allow necessary optimisations to
> happen.

With iommu_commit() this should be possible, still.

> I'm looking at fixing performance issues in the Intel IOMMU code, with
> its virtual address space allocation (the rbtree-based one in iova.c
> that nobody else uses, which has a single spinlock that *all* CPUs bash
> on when they need to allocate).
> 
> The plan is, vaguely, to allocate large chunks of space to each CPU, and
> then for each CPU to allocate from its own region first, thus ensuring
> that the common case doesn't bounce locks between CPUs. It'll be rare
> for one CPU to have to touch a subregion 'belonging' to another CPU, so
> lock contention should be drastically reduced.

Thats an interesting issue. It exists on the AMD IOMMU side too, the
bitmap-allocator runs in a per-domain spinlock which can get high
contention. I am not sure how per-cpu chunks of the address space scale
to large numbers of cpus, though, given that some devices only have a
small address range that they can address.

I have been thinking about some lockless algorithms for the
bitmap-allocator. But the ideas are not finalized yet, so I still don't
know if they will work out at all :)
The basic idea builds around the fact, that most allocations using the
DMA-API fit into one page. So probably we can split the address-space
into a region for one-page allocations which can be accessed without
locks and another region for larger allocations which still need locks.

> Should I be planning to drop the DMA API support from intel-iommu.c
> completely, and have the new allocator just call into the IOMMU API
> functions instead? Other people have been looking at that, haven't they?

Yes, Marek Szyprowski from the ARM side is looking into this already,
but his patches are very ARM specific and not suitable for x86 yet.

> Is there any code? Or special platform-specific requirements for such a
> generic wrapper that I might not have thought of? Details about when to
> flush the IOTLB are one such thing which might need special handling for
> certain hardware...

The plan is to have a single DMA-API implementation for all IOMMU
drivers (X86 and ARM) which just uses the IOMMU-API. But to make this
performing reasonalbly well a few changes to the IOMMU-API are required.
I already have some ideas which we can discuss if you want.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Re

Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls

2011-11-10 Thread Marcelo Tosatti

On Thu, Nov 10, 2011 at 05:49:42PM +0100, Alexander Graf wrote:
> >>  Documentation/virtual/kvm/api.txt |   47 
> >> ++
> >>  arch/powerpc/kvm/powerpc.c|   51 
> >> +
> >>  include/linux/kvm.h   |   32 +++
> >>  3 files changed, 130 insertions(+), 0 deletions(-)
> >I don't see the benefit of this generalization, the current structure where
> >context information is hardcoded in the data transmitted works well.
> 
> Well, unfortunately it doesn't work quite as well for us because we
> are a much more evolving platform. Also, there are a lot of edges
> and corners of the architecture that simply aren't implemented in
> KVM as of now. I want to have something extensible enough so we
> don't break the ABI along the way.

You still have to agree on format between userspace and kernel, right?
If either party fails to conform to that, you're doomed.

The problem with two interfaces is potential ambiguity: is
register X implemented through KVM_GET_ONE_REG and also through
KVM_GET_XYZ_REGISTER_SET ? If its accessible by two interfaces, what is
the register writeback order? Is there a plan to convert, etc.

If you agree these concerns are valid, perhaps this interface can be PPC
specific.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions

2011-11-10 Thread Juan Quintela

Kevin Wolf  wrote:

>> What I took from the feedback was that Kevin wanted to defer open until the 
>> device model started.  That eliminates the need to reopen or have a 
>> invalidation 
>> callback.
>> 
>> I think it would be good for Kevin to comment here though because I might 
>> have 
>> misunderstood his feedback.
>
> Your approach was to delay reads, but still keep the image open. I think
> I worried that we might have additional reads somewhere that we don't
> know about, and this is why I proposed delaying the open as well, so
> that any read would always fail.
>
> I believe just reopening the image is (almost?) as good and it's way
> easier to do, so I would be inclined to do that for 1.0.
>
> I'm not 100% sure about cases like iscsi, where reopening doesn't help.
> I think delaying the open doesn't help there either if you migrate from
> A to B and then back from B to A, you could still get old data. So for
> iscsi probably cache=none remains the only safe choice, whatever we do.

iSCSI and NFS only works with cache=none.  Even on NFS with close+open,
we have troubles if anything else has the file opened (think libvirt,
guestfs, whatever).  I really think that anynthing different of
cache=none from iSCSI or NFS is just betting (and yes, it took a while
for Christoph to convince me, I was trying to a "poor man" distributed
lock manager, and as everybody knows, it is a _difficult_ problem to
solve.).

Later, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls

2011-11-10 Thread Alexander Graf


On 11/10/2011 05:05 PM, Marcelo Tosatti wrote:

On Mon, Oct 31, 2011 at 08:53:11AM +0100, Alexander Graf wrote:

Right now we transfer a static struct every time we want to get or set
registers. Unfortunately, over time we realize that there are more of
these than we thought of before and the extensibility and flexibility of
transferring a full struct every time is limited.

So this is a new approach to the problem. With these new ioctls, we can
get and set a single register that is identified by an ID. This allows for
very precise and limited transmittal of data. When we later realize that
it's a better idea to shove over multiple registers at once, we can reuse
most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS
interface.

The only downpoint I see to this one is that it needs to pad to 1024 bits
(hardware is already on 512 bit registers, so I wanted to leave some room)
which is slightly too much for transmitting only 64 bits. But if that's all
the tradeoff we have to do for getting an extensible interface, I'd say go
for it nevertheless.

Signed-off-by: Alexander Graf
---
  Documentation/virtual/kvm/api.txt |   47 ++
  arch/powerpc/kvm/powerpc.c|   51 +
  include/linux/kvm.h   |   32 +++
  3 files changed, 130 insertions(+), 0 deletions(-)

I don't see the benefit of this generalization, the current structure where
context information is hardcoded in the data transmitted works well.


Well, unfortunately it doesn't work quite as well for us because we are 
a much more evolving platform. Also, there are a lot of edges and 
corners of the architecture that simply aren't implemented in KVM as of 
now. I want to have something extensible enough so we don't break the 
ABI along the way.



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Memory sync algorithm during migration

2011-11-10 Thread Oliver Hookins

Hi,

I am performing some benchmarks on KVM migration on two different types of VM.
One has 4GB RAM and the other 32GB. More or less idle, the 4GB VM takes about 20
seconds to migrate on our hardware while the 32GB VM takes about a minute.

With a reasonable amount of memory activity going on (in the hundreds of MB per
second) the 32GB VM takes 3.5 minutes to migrate, but the 4GB VM never
completes. Intuitively this tells me there is some watermarking of dirty pages
going on that is not particularly efficient when the dirty pages ratio is high
compared to total memory, but I may be completely incorrect.

Could anybody fill me in on what might be going on here? We're using libvirt
0.8.2 and kvm-83-224.el5.centos.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Set numa topology for max_cpus

2011-11-10 Thread Marcelo Tosatti

On Wed, Oct 26, 2011 at 02:19:00PM +0200, Vasilis Liaskovitis wrote:
> qemu-kvm passes numa/SRAT topology information for smp_cpus to SeaBIOS. 
> However
> SeaBIOS always expects to setup max_cpus number of SRAT cpu entries
> (MaxCountCPUs variable in build_srat function of Seabios). When qemu-kvm runs
> with smp_cpus != max_cpus (e.g. -smp 2,maxcpus=4), Seabios will mistakenly use
> memory SRAT info for setting up CPU SRAT entries for the offline CPUs. Wrong
> SRAT memory entries are also created. This breaks NUMA in a guest.
> Fix by setting up SRAT info for max_cpus in qemu-kvm.
> 
> Signed-off-by: Vasilis Liaskovitis 

Applied to uq/master, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC v2 0/2] Initial support for Microsoft Hyper-V.

2011-11-10 Thread Marcelo Tosatti

On Sun, Oct 23, 2011 at 05:39:47PM +0200, Vadim Rozenfeld wrote:
> With the following series of patches we are starting to implement
> some basic Microsoft Hyper-V Enlightenment functionality. This series
> is mostly about adding support for relaxed timing, spinlock,
> and virtual apic.
> 
> For more Hyper-V related information please see:
> "Hypervisor Functional Specification v2.0: For Windows Server 2008 R2" at
> http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=18673
> 
> Changelog:
>  v2->v1
>   - remove KVM_CAP_IRQCHIP ifdef,
>   - remove CONFIG_HYPERV config option,
>   - move KVM leaves to new location (0x4100),
>   - cosmetic changes.
>  v0->v1
>   - move hyper-v parameters under cpu category,
>   - move hyper-v stuff to target-i386 directory,
>   - make CONFIG_HYPERV enabled by default for
> i386-softmmu and x86_64-softmmu configurations,
>   - rearrange the patches from v0,
>   - set HV_X64_MSR_HYPERCALL, HV_X64_MSR_GUEST_OS_ID,
> and HV_X64_MSR_APIC_ASSIST_PAGE to 0 on system reset.

Paolo, Jan, can you ack please? 

IMO making it TCG-friendly is nice, but, can be done later.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: x86: Drop redundant apic base and tpr update from kvm_get_sregs

2011-11-10 Thread Marcelo Tosatti

On Wed, Oct 26, 2011 at 01:09:45PM +0200, Jan Kiszka wrote:
> The latter was already commented out, the former is redundant as well.
> We always get the latest changes after return from the guest via
> kvm_arch_post_run.
> 
> Signed-off-by: Jan Kiszka 

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls

2011-11-10 Thread Marcelo Tosatti

On Mon, Oct 31, 2011 at 08:53:11AM +0100, Alexander Graf wrote:
> Right now we transfer a static struct every time we want to get or set
> registers. Unfortunately, over time we realize that there are more of
> these than we thought of before and the extensibility and flexibility of
> transferring a full struct every time is limited.
> 
> So this is a new approach to the problem. With these new ioctls, we can
> get and set a single register that is identified by an ID. This allows for
> very precise and limited transmittal of data. When we later realize that
> it's a better idea to shove over multiple registers at once, we can reuse
> most of the infrastructure and simply implement a GET_MANY_REGS / 
> SET_MANY_REGS
> interface.
> 
> The only downpoint I see to this one is that it needs to pad to 1024 bits
> (hardware is already on 512 bit registers, so I wanted to leave some room)
> which is slightly too much for transmitting only 64 bits. But if that's all
> the tradeoff we have to do for getting an extensible interface, I'd say go
> for it nevertheless.
> 
> Signed-off-by: Alexander Graf 
> ---
>  Documentation/virtual/kvm/api.txt |   47 ++
>  arch/powerpc/kvm/powerpc.c|   51 
> +
>  include/linux/kvm.h   |   32 +++
>  3 files changed, 130 insertions(+), 0 deletions(-)

I don't see the benefit of this generalization, the current structure where 
context information is hardcoded in the data transmitted works well.

Avi?

> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index ab1136f..a23fe62 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1482,6 +1482,53 @@ is supported; 2 if the processor requires all virtual 
> machines to have
>  an RMA, or 1 if the processor can use an RMA but doesn't require it,
>  because it supports the Virtual RMA (VRMA) facility.
>  
> +4.64 KVM_SET_ONE_REG
> +
> +Capability: KVM_CAP_ONE_REG
> +Architectures: all
> +Type: vcpu ioctl
> +Parameters: struct kvm_one_reg (in)
> +Returns: 0 on success, negative value on failure
> +
> +struct kvm_one_reg {
> +   __u64 id;
> +   union {
> +   __u8 reg8;
> +   __u16 reg16;
> +   __u32 reg32;
> +   __u64 reg64;
> +   __u8 reg128[16];
> +   __u8 reg256[32];
> +   __u8 reg512[64];
> +   __u8 reg1024[128];
> +   } u;
> +};
> +
> +Using this ioctl, a single vcpu register can be set to a specific value
> +defined by user space with the passed in struct kvm_one_reg. There can
> +be architecture agnostic and architecture specific registers. Each have
> +their own range of operation and their own constants and width. To keep
> +track of the implemented registers, find a list below:
> +
> +  Arch  |   Register| Width (bits)
> +|   |
> +
> +4.65 KVM_GET_ONE_REG
> +
> +Capability: KVM_CAP_ONE_REG
> +Architectures: all
> +Type: vcpu ioctl
> +Parameters: struct kvm_one_reg (in and out)
> +Returns: 0 on success, negative value on failure
> +
> +This ioctl allows to receive the value of a single register implemented
> +in a vcpu. The register to read is indicated by the "id" field of the
> +kvm_one_reg struct passed in. On success, the register value can be found
> +in the respective width field of the struct after this call.
> +
> +The list of registers accessible using this interface is identical to the
> +list in 4.64.
> +
>  5. The kvm_run structure
>  
>  Application code obtains a pointer to the kvm_run structure by
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index e75c5ac..39cdb3f 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -214,6 +214,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>   case KVM_CAP_PPC_UNSET_IRQ:
>   case KVM_CAP_PPC_IRQ_LEVEL:
>   case KVM_CAP_ENABLE_CAP:
> + case KVM_CAP_ONE_REG:
>   r = 1;
>   break;
>  #ifndef CONFIG_KVM_BOOK3S_64_HV
> @@ -627,6 +628,32 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu 
> *vcpu,
>   return r;
>  }
>  
> +static int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu,
> +   struct kvm_one_reg *reg)
> +{
> + int r = -EINVAL;
> +
> + switch (reg->id) {
> + default:
> + break;
> + }
> +
> + return r;
> +}
> +
> +static int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu,
> +   struct kvm_one_reg *reg)
> +{
> + int r = -EINVAL;
> +
> + switch (reg->id) {
> + default:
> + break;
> + }
> +
> + return r;
> +}
> +
>  int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
>  struct kvm_mp_state *mp_state)
>  {
> @@ -666,6 +693,30 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>

Re: [RFC/GIT PULL] Linux KVM tool for v3.2

2011-11-10 Thread Stefan Hajnoczi

On Thu, Nov 10, 2011 at 2:47 PM, Markus Armbruster  wrote:
> Pekka Enberg  writes:
>
>> Hi Anthony,
>>
>> On Thu, Nov 10, 2011 at 3:43 PM, Anthony Liguori  
>> wrote:
>>> It's not just the qcow2 implementation or even the block layer.  This pull
>>> requests adds a userspace TCP/IP stack to the kernel and yet netdev isn't on
>>> the CC and there are no Ack's from anyone from the networking stack.  I'm
>>> fairly sure if they knew what was happening here they would object.
>>
>> It's something we consider extremely important because it allows easy
>> non-root networking. But you're right, we definitely ought to ping the
>> networking folks before the next merge window.
>
> The problem is real.  The solution "duplicate in user space" sucks.  If
> you engaging with the kernel networking folks leads to one that doesn't
> suck, we should bathe you in free beer.

Look at disks, the problem is addressed by the udisks daemon on dbus.
Anything can try talking to it.  If you have permissions or can get
the user to authenticate then you can manipulate LVM volumes, mount
file systems, etc.  We could do something similar for tap networking.

The Ubuntu folks seem to want VDE instead to solve the same problem.
Create a VDE switch with a single tap device on startup.  Then let all
VMs talk to the VDE without privileges.

I don't think going through VDE is nice or performant, would be better
to have add virtual network functionality over dbus just like udisks
did for disks.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread David Woodhouse

On Thu, 2011-11-10 at 14:17 +0800, Kai Huang wrote:
> And another question: have we considered the IOTLB flush operation? I
> think we need to implement similar logic when flush the DVMA range.
> Intel VT-d's manual says software needs to specify the appropriate
> mask value to flush large pages, but it does not say we need to
> exactly match the page size as it is mapped. I guess it's not
> necessary for Intel IOMMU, but other vendor's IOMMU may have such
> limitation (or some other limitations). In my understanding current
> implementation does not provide page size information for particular
> DVMA ranges that has been mapped, and it's not flexible to implement
> IOTLB flush code (ex, we may need to walk through page table to find
> out actual page size). Maybe we can also add iommu_ops->flush_iotlb ? 

Which brings me to another question I have been pondering... do we even
have a consensus on exactly *when* the IOTLB should be flushed?

Even just for the Intel IOMMU, we have three different behaviours:

  - For DMA API users by default, we do 'batched unmap', so a mapping
may be active for a period of time after the driver has requested
that it be unmapped.

  - ... unless booted with 'intel_iommu=strict', in which case we do the
unmap and IOTLB flush immediately before returning to the driver.

  - But the IOMMU API for virtualisation is different. In fact that
doesn't seem to flush the IOTLB at all. Which is probably a bug.

What is acceptable, though? That batched unmap is quite important for
performance, because it means that we don't have to bash on the hardware
and wait for a flush to complete in the fast path of network driver RX,
for example.

If we move to a model where we have a separate ->flush_iotlb() call, we
need to be careful that we still allow necessary optimisations to
happen.

Since I have the right people on Cc and the iommu list is still down,
and it's vaguely tangentially related...

I'm looking at fixing performance issues in the Intel IOMMU code, with
its virtual address space allocation (the rbtree-based one in iova.c
that nobody else uses, which has a single spinlock that *all* CPUs bash
on when they need to allocate).

The plan is, vaguely, to allocate large chunks of space to each CPU, and
then for each CPU to allocate from its own region first, thus ensuring
that the common case doesn't bounce locks between CPUs. It'll be rare
for one CPU to have to touch a subregion 'belonging' to another CPU, so
lock contention should be drastically reduced.

Should I be planning to drop the DMA API support from intel-iommu.c
completely, and have the new allocator just call into the IOMMU API
functions instead? Other people have been looking at that, haven't they?
Is there any code? Or special platform-specific requirements for such a
generic wrapper that I might not have thought of? Details about when to
flush the IOTLB are one such thing which might need special handling for
certain hardware...

-- 
dwmw2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 107 matches

Mail list logo