date:20120411

Re: KVM qemu-kvm "ext4_fill_flex_info()" Denial of Service Vulnerability

2012-04-11 Thread Avi Kivity

On 04/10/2012 10:39 PM, Agostino Sarubbo wrote:
> Hi all.
>
> Yesterday, secunia has released an advisory about qemu-kvm.
> https://secunia.com/advisories/48645/
>
> This seems to describe and 'old' kernel bug, but I don't know if there is a 
> 'link' between the ext4 issue and kvm.
>
> Can you explain a bit this issue?
>

Appears to be 100% unrelated to kvm.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virtio_blk: Add help function to format mass of disks

2012-04-11 Thread Michael S. Tsirkin

On Tue, Apr 10, 2012 at 03:28:05PM +0800, Ren Mingxin wrote:
> The current virtio block's naming algorithm just supports 18278
> (26^3 + 26^2 + 26) disks. If there are mass of virtio blocks,
> there will be disks with the same name.
> 
> Based on commit 3e1a7ff8a0a7b948f2684930166954f9e8e776fe, I add
> function "virtblk_name_format()" for virtio block to support mass
> of disks naming.
> 
> Signed-off-by: Ren Mingxin 

Applied, thanks everyone.

> ---
>  drivers/block/virtio_blk.c |   38 ++
>  1 files changed, 26 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index c4a60ba..86516c8 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -374,6 +374,31 @@ static int init_vq(struct virtio_blk *vblk)
>   return err;
>  }
>  
> +static int virtblk_name_format(char *prefix, int index, char *buf, int 
> buflen)
> +{
> + const int base = 'z' - 'a' + 1;
> + char *begin = buf + strlen(prefix);
> + char *begin = buf + strlen(prefix);
> + char *end = buf + buflen;
> + char *p;
> + int unit;
> +
> + p = end - 1;
> + *p = '\0';
> + unit = base;
> + do {
> + if (p == begin)
> + return -EINVAL;
> + *--p = 'a' + (index % unit);
> + index = (index / unit) - 1;
> + } while (index >= 0);
> +
> + memmove(begin, p, end - p);
> + memcpy(buf, prefix, strlen(prefix));
> +
> + return 0;
> +}
> +
>  static int __devinit virtblk_probe(struct virtio_device *vdev)
>  {
>   struct virtio_blk *vblk;
> @@ -442,18 +467,7 @@ static int __devinit virtblk_probe(struct virtio_device 
> *vdev)
>  
>   q->queuedata = vblk;
>  
> - if (index < 26) {
> - sprintf(vblk->disk->disk_name, "vd%c", 'a' + index % 26);
> - } else if (index < (26 + 1) * 26) {
> - sprintf(vblk->disk->disk_name, "vd%c%c",
> - 'a' + index / 26 - 1, 'a' + index % 26);
> - } else {
> - const unsigned int m1 = (index / 26 - 1) / 26 - 1;
> - const unsigned int m2 = (index / 26 - 1) % 26;
> - const unsigned int m3 =  index % 26;
> - sprintf(vblk->disk->disk_name, "vd%c%c%c",
> - 'a' + m1, 'a' + m2, 'a' + m3);
> - }
> + virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN);
>  
>   vblk->disk->major = major;
>   vblk->disk->first_minor = index_to_minor(index);
> -- 
> 1.7.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] KVM: VMX: Add functions to fill VMCSINFO

2012-04-11 Thread Avi Kivity

On 04/11/2012 04:50 AM, zhangyanfei wrote:
> This patch is to implement the feature that at initialization of
> kvm_intel module, fills VMCSINFO with a VMCS revision identifier,
> and encoded offsets of VMCS fields. The reason why we put the
> VMCSINFO processing at the initialization of kvm_intel module
> is that it's dangerous to rob VMX resources while kvm module is
> loaded.

Maybe it should be done by a separate module.

> +
> + kvm_cpu_vmxon(__pa(per_cpu(vmxarea, raw_smp_processor_id(;
> + vmcs_load(vmcs);

Should do this after writing into the vmcs directly (vmcs_load() may
cache some information for vmcs_read()).

> +
> + VMCSINFO_REVISION_ID(vmcs->revision_id);
> +
> + /*
> +  * Write encoded offsets into VMCS data for later vmcs_read.
> +  */
> + for (offset = FIELD_START; offset < vmcs_config.size;
> +  offset += sizeof(u16))
> + *(u16 *)((char *)vmcs + offset) = ENCODING_OFFSET(offset);

This assumes vmcs field contents use the same encoding as
vmread/vmwrite.  I guess it's a reasonable assumption.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump

2012-04-11 Thread Avi Kivity

On 04/11/2012 04:39 AM, zhangyanfei wrote:
> This patch set exports offsets of VMCS fields as note information for
> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
> runtime state of guest machine image, such as registers, in host
> machine's crash dump as VMCS format. The problem is that VMCS
> internal is hidden by Intel in its specification. So, we reverse
> engineering it in the way implemented in this patch set. Please note
> that this processing never affects any existing kvm logic. The
> VMCSINFO is exported via sysfs to kexec-tools just like VMCOREINFO.
>
> Here is an example:
> Processor: Intel(R) Core(TM)2 Duo CPU E7500  @ 2.93GHz
>
> $cat /sys/kernel/vmcsinfo
> 1cba8c0 2000
>
> crash> rd -p 1cba8c0 1000
>  1cba8c0:  127b0009 53434d56   {...VMCS
>  1cba8d0:  4f464e49 4e4f495349564552   INFOREVISION
>  1cba8e0:  49460a643d44495f 5f4e495028444c45   _ID=d.FIELD(PIN_
>  1cba8f0:  4d565f4445534142 4f435f434558455f   BASED_VM_EXEC_CO
>  1cba900:  303d294c4f52544e 0a30383130343831   NTROL)=01840180.
>  1cba910:  504328444c454946 5f44455341425f55   FIELD(CPU_BASED_
>  1cba920:  5f434558455f4d56 294c4f52544e4f43   VM_EXEC_CONTROL)
>  1cba930:  393130343931303d 28444c4549460a30   =01940190.FIELD(
>  1cba940:  5241444e4f434553 4558455f4d565f59   SECONDARY_VM_EXE
>  1cba950:  4f52544e4f435f43 30346566303d294c   C_CONTROL)=0fe40
>  1cba960:  4c4549460a306566 4958455f4d562844   fe0.FIELD(VM_EXI
>  1cba970:  4f52544e4f435f54 346531303d29534c   T_CONTROLS)=01e4
>  1cba980:  4549460a30653130 4e455f4d5628444c   01e0.FIELD(VM_EN
>  1cba990:  544e4f435f595254 33303d29534c4f52   TRY_CONTROLS)=03
>  1cba9a0:  460a303133303431 45554728444c4549   140310.FIELD(GUE
>  1cba9b0:  45535f53455f5453 3d29524f5443454c   ST_ES_SELECTOR)=
>  1cba9c0:  4549460a30303530 545345554728444c   0500.FIELD(GUEST
>  1cba9d0:  454c45535f53435f 35303d29524f5443   _CS_SELECTOR)=05
>  ..

Would be nicer to have a simple binary encoding   instead
of this.

> TODO:
>   1. In kexec-tools, get VMCSINFO via sysfs and dump it as note information
>  into vmcore.
>   2. Dump VMCS region of each guest vcpu and VMCSINFO into qemu-process
>  core file. To do this, we will modify kernel core dumper, gdb gcore
>  and crash gcore.


Seems excessive.  Why do you want vmcs information in qemu cores?  A
qemu crash is very rarely related to kvm, let alone the vmcs.  I
understand that you may want it in a kernel core dump, though I've never
needed to myself.  Can you outline a case where this data was needed?

>   3. Dump guest image from the qemu-process core file into a vmcore.

For this perhaps a different approach is better - modify the core dumper
to call kvm to extract the relevant vmcs information into an elf note. 
This way there is no need to reconstruct the guest data from the
offsets.  It's also more reliable, since vmread can access cached fields
that direct memory access cannot.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-11 Thread Avi Kivity

On 04/11/2012 04:47 AM, Marcelo Tosatti wrote:
> On Tue, Apr 10, 2012 at 01:04:13PM +0300, Avi Kivity wrote:
> > On 04/09/2012 10:46 PM, Marcelo Tosatti wrote:
> > > Perhaps the mmu_lock hold times by get_dirty are a large component here?
> > 
> > That's my concern, because it affects the scaling of migration for wider
> > guests.
> > 
> > > If that can be alleviated, not only RO->RW faults benefit.
> > 
> > Those are the most common types of faults on modern hardware, no?
>
> Depends on your workload, of course. If there is memory pressure,
> 0->PRESENT might be very frequent. My point is that reduction of
> mmu_lock contention is a good thing overall.
>

Agreed.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: New git workflow

2012-04-11 Thread Avi Kivity

On 04/06/2012 03:02 PM, Takuya Yoshikawa wrote:
> On Thu, 05 Apr 2012 20:02:44 +0300
> Avi Kivity  wrote:
>
> > In a recent conversation, Linus persuaded me that it's time for change
> > in our git workflow; the following will bring it in line with the
> > current practices of most trees.
> > 
> > The current 'master' branch will be abandoned (still available for
> > reviewing history).  The new branch structure will be as follows:
>
> Please update Documentation/virtual/kvm/review-checklist.txt as well:
>   2.  Patches should be against kvm.git master branch.
>
>

Yes, but let's wait a while until the workflow stabilizes.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump

2012-04-11 Thread zhangyanfei

于 2012年04月11日 16:56, Avi Kivity 写道:
> On 04/11/2012 04:39 AM, zhangyanfei wrote:
>> This patch set exports offsets of VMCS fields as note information for
>> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
>> runtime state of guest machine image, such as registers, in host
>> machine's crash dump as VMCS format. The problem is that VMCS
>> internal is hidden by Intel in its specification. So, we reverse
>> engineering it in the way implemented in this patch set. Please note
>> that this processing never affects any existing kvm logic. The
>> VMCSINFO is exported via sysfs to kexec-tools just like VMCOREINFO.
>>
>> Here is an example:
>> Processor: Intel(R) Core(TM)2 Duo CPU E7500  @ 2.93GHz
>>
>> $cat /sys/kernel/vmcsinfo
>> 1cba8c0 2000
>>
>> crash> rd -p 1cba8c0 1000
>>  1cba8c0:  127b0009 53434d56   {...VMCS
>>  1cba8d0:  4f464e49 4e4f495349564552   INFOREVISION
>>  1cba8e0:  49460a643d44495f 5f4e495028444c45   _ID=d.FIELD(PIN_
>>  1cba8f0:  4d565f4445534142 4f435f434558455f   BASED_VM_EXEC_CO
>>  1cba900:  303d294c4f52544e 0a30383130343831   NTROL)=01840180.
>>  1cba910:  504328444c454946 5f44455341425f55   FIELD(CPU_BASED_
>>  1cba920:  5f434558455f4d56 294c4f52544e4f43   VM_EXEC_CONTROL)
>>  1cba930:  393130343931303d 28444c4549460a30   =01940190.FIELD(
>>  1cba940:  5241444e4f434553 4558455f4d565f59   SECONDARY_VM_EXE
>>  1cba950:  4f52544e4f435f43 30346566303d294c   C_CONTROL)=0fe40
>>  1cba960:  4c4549460a306566 4958455f4d562844   fe0.FIELD(VM_EXI
>>  1cba970:  4f52544e4f435f54 346531303d29534c   T_CONTROLS)=01e4
>>  1cba980:  4549460a30653130 4e455f4d5628444c   01e0.FIELD(VM_EN
>>  1cba990:  544e4f435f595254 33303d29534c4f52   TRY_CONTROLS)=03
>>  1cba9a0:  460a303133303431 45554728444c4549   140310.FIELD(GUE
>>  1cba9b0:  45535f53455f5453 3d29524f5443454c   ST_ES_SELECTOR)=
>>  1cba9c0:  4549460a30303530 545345554728444c   0500.FIELD(GUEST
>>  1cba9d0:  454c45535f53435f 35303d29524f5443   _CS_SELECTOR)=05
>>  ..
> 
> Would be nicer to have a simple binary encoding   instead
> of this.

Agreed.

> 
>> TODO:
>>   1. In kexec-tools, get VMCSINFO via sysfs and dump it as note information
>>  into vmcore.
>>   2. Dump VMCS region of each guest vcpu and VMCSINFO into qemu-process
>>  core file. To do this, we will modify kernel core dumper, gdb gcore
>>  and crash gcore.
> 
> 
> Seems excessive.  Why do you want vmcs information in qemu cores?  A
> qemu crash is very rarely related to kvm, let alone the vmcs.  I
> understand that you may want it in a kernel core dump, though I've never
> needed to myself.  Can you outline a case where this data was needed?
> 

If a qemu process comes to a fatal error that causes itself to be core dumped
by kernel, the running guest based on the qemu process will be included in that
qemu core file. But with no vmcsinfo information in qemu core file, we could not
get the guest's states(registers' values), then we could not make a complete
guest vmcore.

>>   3. Dump guest image from the qemu-process core file into a vmcore.
> 
> For this perhaps a different approach is better - modify the core dumper
> to call kvm to extract the relevant vmcs information into an elf note. 
> This way there is no need to reconstruct the guest data from the
> offsets.  It's also more reliable, since vmread can access cached fields
> that direct memory access cannot.
> 

Does this approach is a replacement for TODO 2 ? That is to say, when generating
a qemu core by kernel core dumper, we could call kvm to extract the relevant 
vmcs
information into an elf note instead of VMCSINFO and the whole vmcs regions.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost-blk development

2012-04-11 Thread Stefan Hajnoczi

On Tue, Apr 10, 2012 at 6:25 PM, Michael Baysek  wrote:
> Well, I'm trying to determine which I/O method currently has the very least 
> performance overhead and gives the best performance for both reads and writes.
>
> I am doing my testing by putting the entire guest onto a ramdisk.  I'm 
> working on an i5-760 with 16GB RAM with VT-d enabled.  I am running the 
> standard Centos 6 kernel with 0.12.1.2 release of qemu-kvm that comes stock 
> on Centos 6.  The guest is configured with 512 MB RAM, using, 4 cpu cores 
> with it's /dev/vda being the ramdisk on the host.

Results collected for ramdisk usually do not reflect the performance
you get with a real disk or SSD.  I suggest using the host/guest
configuration you want to deploy.

> I've been using iozone 3.98 with -O -l32 -i0 -i1 -i2 -e -+n -r4K -s250M to 
> measure performance.

I haven't looked up the options but I think you need -I to use
O_DIRECT and bypass the guest page cache - otherwise you are not
benchmarking I/O performance but overall file system/page cache
performance.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump

2012-04-11 Thread Joerg Roedel

Hi,

On Wed, Apr 11, 2012 at 09:39:43AM +0800, zhangyanfei wrote:
> The problem is that VMCS internal is hidden by Intel in its
> specification. So, we reverse engineering it in the way implemented in
> this patch set.

Have you made sure this layout is the same on all uarchitectures that
implment VMX?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] KVM: VMX: Add functions to fill VMCSINFO

2012-04-11 Thread zhangyanfei

于 2012年04月11日 16:48, Avi Kivity 写道:
> On 04/11/2012 04:50 AM, zhangyanfei wrote:
>> This patch is to implement the feature that at initialization of
>> kvm_intel module, fills VMCSINFO with a VMCS revision identifier,
>> and encoded offsets of VMCS fields. The reason why we put the
>> VMCSINFO processing at the initialization of kvm_intel module
>> is that it's dangerous to rob VMX resources while kvm module is
>> loaded.
> 
> Maybe it should be done by a separate module.
> 

If we put vmcsinfo processing at the initialization of kvm_intel module,
as soon as the kvm_intel module is loaded, VMCSINFO is filled. And it is
because vmcsinfo processing is at the initialization of kvm_intel module,
no kvm guests are running, so it will not rob any VMX resources.

If it is done by a separate module, I am afraid this module may not be
loaded when the kernel needs VMCSINFO.

>> +
>> +kvm_cpu_vmxon(__pa(per_cpu(vmxarea, raw_smp_processor_id(;
>> +vmcs_load(vmcs);
> 
> Should do this after writing into the vmcs directly (vmcs_load() may
> cache some information for vmcs_read()).
>

Hmm, thanks for pointing this.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump

2012-04-11 Thread Avi Kivity

On 04/11/2012 01:21 PM, Joerg Roedel wrote:
> Hi,
>
> On Wed, Apr 11, 2012 at 09:39:43AM +0800, zhangyanfei wrote:
> > The problem is that VMCS internal is hidden by Intel in its
> > specification. So, we reverse engineering it in the way implemented in
> > this patch set.
>
> Have you made sure this layout is the same on all uarchitectures that
> implment VMX?

He's determining the layout at runtime.  It should even work with kvm's
vmx implementation.

It's vulnerable to two issues:
- fields that are cached in the processor and not flushed to memory
(perhaps just make sure to VMXOFF before dumping memory)
- fields that are encoded differently in memory than VMREAD/VMWRITE

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump

2012-04-11 Thread zhangyanfei

于 2012年04月11日 18:21, Joerg Roedel 写道:
> Hi,
> 
> On Wed, Apr 11, 2012 at 09:39:43AM +0800, zhangyanfei wrote:
>> The problem is that VMCS internal is hidden by Intel in its
>> specification. So, we reverse engineering it in the way implemented in
>> this patch set.
> 
> Have you made sure this layout is the same on all uarchitectures that
> implment VMX?
> 
> 
>   Joerg
> 

The layout differs from each other in different VMCS revision identifiers. 
The VMCS revision identifier is contained at the first 32 bits of the VMCS
region. And the VMCS revision identifiers may differ from different 
architectures.

for example, there are two processors below:
Processor 1: Intel(R) Xeon(R) CPU E7540  @ 2.00GHz with 24 cores
REVISION_ID=e
FIELD(PIN_BASED_VM_EXEC_CONTROL)=05540550
FIELD(CPU_BASED_VM_EXEC_CONTROL)=05440540
FIELD(SECONDARY_VM_EXEC_CONTROL)=054c0548
FIELD(VM_EXIT_CONTROLS) =057c0578
FIELD(VM_ENTRY_CONTROLS)=05940590
..

Processor 2: Intel(R) Core(TM)2 Duo CPU E7500  @ 2.93GHz
REVISION_ID=d
FIELD(PIN_BASED_VM_EXEC_CONTROL)=01840180
FIELD(CPU_BASED_VM_EXEC_CONTROL)=01940190
FIELD(SECONDARY_VM_EXEC_CONTROL)=0fe40fe0
FIELD(VM_EXIT_CONTROLS) =01e401e0
FIELD(VM_ENTRY_CONTROLS)=03140310

The purpose to get the VMCSINFO of one architecture is for guest debugging that 
was
running on the same architecture, so there is no problem the layouts differ 
from different
architectures.

Thanks
Zhang Yanfei

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump

2012-04-11 Thread Avi Kivity

On 04/11/2012 01:12 PM, zhangyanfei wrote:
> > 
> >> TODO:
> >>   1. In kexec-tools, get VMCSINFO via sysfs and dump it as note information
> >>  into vmcore.
> >>   2. Dump VMCS region of each guest vcpu and VMCSINFO into qemu-process
> >>  core file. To do this, we will modify kernel core dumper, gdb gcore
> >>  and crash gcore.
> > 
> > 
> > Seems excessive.  Why do you want vmcs information in qemu cores?  A
> > qemu crash is very rarely related to kvm, let alone the vmcs.  I
> > understand that you may want it in a kernel core dump, though I've never
> > needed to myself.  Can you outline a case where this data was needed?
> > 
>
> If a qemu process comes to a fatal error that causes itself to be core dumped
> by kernel, the running guest based on the qemu process will be included in 
> that
> qemu core file. But with no vmcsinfo information in qemu core file, we could 
> not
> get the guest's states(registers' values), then we could not make a complete
> guest vmcore.

We can't anyway.  Many registers (GPRs except RSP, fpu) are not stored
in the VMCS, but in kvm data structures.

So for this case we'd want a kvm callback to execute (that would make it
work cross vendor, too).

>
> >>   3. Dump guest image from the qemu-process core file into a vmcore.
> > 
> > For this perhaps a different approach is better - modify the core dumper
> > to call kvm to extract the relevant vmcs information into an elf note. 
> > This way there is no need to reconstruct the guest data from the
> > offsets.  It's also more reliable, since vmread can access cached fields
> > that direct memory access cannot.
> > 
>
> Does this approach is a replacement for TODO 2 ? That is to say, when 
> generating
> a qemu core by kernel core dumper, we could call kvm to extract the relevant 
> vmcs
> information into an elf note instead of VMCSINFO and the whole vmcs regions.

Yes.  I'm not convinced it's important though.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH being tested] KVM: Reduce mmu_lock contention during dirty logging by cond_resched()

2012-04-11 Thread Takuya Yoshikawa

I am now testing the following patch.

Note: this technique is used in several subsystems, e.g. jbd.

Although people tend to say that holding mmu_lock during get_dirty is
always a problem, my impression is slightly different.

When we call get_dirty, most of hot memory pages have already been
written at least once and faults are becoming rare.

Actually I rarely saw rescheduling due to mmu_lock contention when
I tested this patch locally -- though not enough.

In contrast, if we do O(1), we need to write protect 511 pages soon
after the get_dirty and the chance of mmu_lock contention may increase
if multiple VCPUs try to write to memory.

Anyway, this patch is small and seems effective.

Takuya

===
From: Takuya Yoshikawa 

get_dirty_log() needs to hold mmu_lock during write protecting dirty
pages and this can be long when there are many dirty pages to protect.

As the guest can get faulted during that time, this may result in a
severe latency problem which would prevent the system to scale.

This patch mitigates this by checking mmu_lock contention for every 2K
dirty pages we protect: we have selected this value since it took about
100us to get 2K dirty pages.

TODO: more numbers.

Signed-off-by: Takuya Yoshikawa 
---
 arch/x86/include/asm/kvm_host.h |6 +++---
 arch/x86/kvm/mmu.c  |   12 +---
 arch/x86/kvm/x86.c  |   18 +-
 3 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f624ca7..26b39c1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -712,9 +712,9 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 
 int kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-struct kvm_memory_slot *slot,
-gfn_t gfn_offset, unsigned long mask);
+int kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
+   struct kvm_memory_slot *slot,
+   gfn_t gfn_offset, unsigned long mask);
 void kvm_mmu_zap_all(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
 void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 29ad6f9..b88c5cc 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1081,20 +1081,26 @@ static int __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp, int level
  *
  * Used when we do not need to care about huge page mappings: e.g. during dirty
  * logging we do not have any such mappings.
+ *
+ * Returns the number of pages protected by this.
  */
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-struct kvm_memory_slot *slot,
-gfn_t gfn_offset, unsigned long mask)
+int kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
+   struct kvm_memory_slot *slot,
+   gfn_t gfn_offset, unsigned long mask)
 {
unsigned long *rmapp;
+   int nr_protected = 0;
 
while (mask) {
rmapp = &slot->rmap[gfn_offset + __ffs(mask)];
__rmap_write_protect(kvm, rmapp, PT_PAGE_TABLE_LEVEL);
+   ++nr_protected;
 
/* clear the first set bit */
mask &= mask - 1;
}
+
+   return nr_protected;
 }
 
 static int rmap_write_protect(struct kvm *kvm, u64 gfn)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0d9a578..b636669 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3092,7 +3092,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
kvm_dirty_log *log)
unsigned long n, i;
unsigned long *dirty_bitmap;
unsigned long *dirty_bitmap_buffer;
-   bool is_dirty = false;
+   int nr_protected = 0;
 
mutex_lock(&kvm->slots_lock);
 
@@ -3121,15 +3121,23 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
kvm_dirty_log *log)
if (!dirty_bitmap[i])
continue;
 
-   is_dirty = true;
-
mask = xchg(&dirty_bitmap[i], 0);
dirty_bitmap_buffer[i] = mask;
 
offset = i * BITS_PER_LONG;
-   kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
+   nr_protected += kvm_mmu_write_protect_pt_masked(kvm, memslot,
+   offset, mask);
+   if (nr_protected > 2048) {
+   if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
+   kvm_flush_remote_tlbs(kvm);
+   spin_unlock(&kvm->mmu_lock);
+   cond_resched();

Re: [PATCH 2/4] KVM: VMX: Add functions to fill VMCSINFO

2012-04-11 Thread Avi Kivity

On 04/11/2012 01:34 PM, zhangyanfei wrote:
> 于 2012年04月11日 16:48, Avi Kivity 写道:
> > On 04/11/2012 04:50 AM, zhangyanfei wrote:
> >> This patch is to implement the feature that at initialization of
> >> kvm_intel module, fills VMCSINFO with a VMCS revision identifier,
> >> and encoded offsets of VMCS fields. The reason why we put the
> >> VMCSINFO processing at the initialization of kvm_intel module
> >> is that it's dangerous to rob VMX resources while kvm module is
> >> loaded.
> > 
> > Maybe it should be done by a separate module.
> > 
>
> If we put vmcsinfo processing at the initialization of kvm_intel module,
> as soon as the kvm_intel module is loaded, VMCSINFO is filled. And it is
> because vmcsinfo processing is at the initialization of kvm_intel module,
> no kvm guests are running, so it will not rob any VMX resources.
>
> If it is done by a separate module, I am afraid this module may not be
> loaded when the kernel needs VMCSINFO.
>

You can make the module autoload when the vmx cpufeature is detected. 
But then there is an ordering problem wrt kvm-intel.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-11 Thread Takuya Yoshikawa

On Tue, 10 Apr 2012 19:58:44 +0800
Xiao Guangrong  wrote:

> No, i do not really agree with that.
> 
> We really can get great benefit from O(1) especially if lockless write-protect
> is introduced for O(1), live migration is very useful for cloud computing
> architecture to balance the overload on all nodes.

Recently, you said to me that you were not familiar with live migration.
Actually you did not know the basics of pre-copy live migration.

I know live migration better than you because NTT has Kemari and it uses
live migration infrastructure.  My work is originated from the data
I got during profiling Kemari.

SRCU-less dirty logging was also motivated by the pressures from scheduler
developers.  Everything was really needed.

Have you ever used live migration for real service?

I cannot say whether O(1) is OK with me without any real background.

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: New git workflow

2012-04-11 Thread Paul Mackerras

On Sun, Apr 08, 2012 at 02:33:32PM +0300, Avi Kivity wrote:
> On 04/05/2012 08:02 PM, Avi Kivity wrote:
> > I'll publish the new branches tomorrow, with any luck.
> 
> There wasn't any luck, so it's only ready today.  To allow chance for
> review, I'm publishing next as next-candidate.
> 
> Paul/Alex, please review the powerpc bits.  Specifically:
> 
>   system.h is gone, so I moved the prototype of load_up_fpu() to
>  and added a #include (8fae845f4956d).
>   E6500 was added in upstream in parallel with the split of kvm
> E500/E500MC.  I guessed which part of the #ifdef E6500 was to go into,
> but please verify (73196cd364a2, 06aae86799c1b).

All looks OK as far as I can see, but I have asked Scott Wood to
double-check the e500/e500mc bits.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-11 Thread Xiao Guangrong

On 04/11/2012 08:15 PM, Takuya Yoshikawa wrote:

> On Tue, 10 Apr 2012 19:58:44 +0800
> Xiao Guangrong  wrote:
> 
>> No, i do not really agree with that.
>>
>> We really can get great benefit from O(1) especially if lockless 
>> write-protect
>> is introduced for O(1), live migration is very useful for cloud computing
>> architecture to balance the overload on all nodes.
> 
> Recently, you said to me that you were not familiar with live migration.
> Actually you did not know the basics of pre-copy live migration.
> 
> I know live migration better than you because NTT has Kemari and it uses
> live migration infrastructure.  My work is originated from the data
> I got during profiling Kemari.

Well, my point is that live migration is so very useful that it is worth
to be improved, the description of your also proves this point.

What is your really want to say but i missed?

> 
> SRCU-less dirty logging was also motivated by the pressures from scheduler
> developers.  Everything was really needed.
> 

Totally agree, please note, i did not negate your contribution on dirty
logging at all.

> 
> Have you ever used live migration for real service?
> 

I should admit that you are better at live migration, but it does not
hinder our discussion. If you think i was wrong, welcome to correct me
at any time.

> I cannot say whether O(1) is OK with me without any real background.
> 

Okay, let us to compare the performance number after O(1) implemented.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Guido Winkelmann

Hi,

Nested virtualization on Intel does not work for me with qemu-kvm. As soon as 
the third layer OS (second virtualised) is starting the Linux kernel, the 
entire second layer freezes up. The last thing I can see console of the third 
layer system before it freezes is "Decompressing Linux... ". (no "done", 
though). When starting without nofb option, the kernel still manages to set 
the screen resolution before freezing.

Grub/Syslinux still work, but are extremely slow.

Both the first layer OS (i.e. the one running on bare metal) and the second 
layer OS are 64-bit-Fedora 16 with Kernel 3.3.1-3.fc16.x86_64. On both the 
first and second layer OS, the kvm_intel modules are loaded with nested=Y 
parameter. (I've also tried with nested=N in the second layer. Didn't change 
anything.)
Qemu-kvm was originally the Fedora-shipped 0.14, but I have since upgraded to 
1.0. (Using rpmbuild with the specfile and patches from 
http://pkgs.fedoraproject.org/gitweb/?p=qemu.git;a=blob;f=qemu.spec;hb=HEAD)

The second layer machine has this CPU specification in libvirt on the first 
layer OS:

  
Nehalem

  

which results in this qemu commandline (from libvirt's logs):

LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-
kvm -S -M pc-0.15 -cpu kvm64,+lahf_lm,+popcnt,+sse4.2,+sse4.1,+ssse3,+vmx -
enable-kvm -m 8192 -smp 8,sockets=8,cores=1,threads=1 -name vshost1 -uuid 
192b8c4b-0ded-07aa-2545-d7fef4cd897f -nodefconfig -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vshost1.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -
no-acpi -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
file=/data/vshost1.img,if=none,id=drive-virtio-disk0,format=qcow2 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-
disk0,bootindex=1 -drive file=/data/Fedora-16-x86_64-
netinst.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -
device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev 
tap,fd=21,id=hostnet0,vhost=on,vhostfd=22 -device virtio-net-
pci,netdev=hostnet0,id=net0,mac=52:54:00:84:7d:46,bus=pci.0,addr=0x3 -netdev 
tap,fd=23,id=hostnet1,vhost=on,vhostfd=24 -device virtio-net-
pci,netdev=hostnet1,id=net1,mac=52:54:00:84:8d:46,bus=pci.0,addr=0x4 -vnc 
127.0.0.1:0,password -k de -vga cirrus -device virtio-balloon-
pci,id=balloon0,bus=pci.0,addr=0x6

I have also tried some other combinations for the cpu element, like changing 
the model to core2duo and/or including all the features reported by libvirt's 
capabalities command.

The third level machine does not have a cpu element in libvirt, and its 
commandline looks like this:

LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-
kvm -S -M pc-0.14 -enable-kvm -m 8192 -smp 4,sockets=4,cores=1,threads=1 -name 
gentoo -uuid 3cdcc902-4520-df25-92ac-31ca5c707a50 -nodefconfig -nodefaults -
chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/gentoo.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-acpi -drive 
file=/data/gentoo.img,if=none,id=drive-virtio-disk0,format=qcow2 -device 
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -
drive file=/data/install-amd64-
minimal-20120223.iso,if=none,media=cdrom,id=drive-
ide0-1-0,readonly=on,format=raw -device ide-
drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1 -netdev 
tap,fd=23,id=hostnet0,vhost=on,vhostfd=24 -device virtio-net-
pci,netdev=hostnet0,id=net0,mac=52:54:00:84:6d:46,bus=pci.0,addr=0x3 -usb -vnc 
127.0.0.1:0,password -k de -vga cirrus -device virtio-balloon-
pci,id=balloon0,bus=pci.0,addr=0x5

The third layer OS is a recent Gentoo minimal install (amd64), but somehow I 
don't think that matters at this point...

The metal is a Dell PowerEdge R710 server with two Xeon E5520 CPUs. I've tried 
updating the machine's BIOS and other firmware to the latest version. That 
took a lot of time and a lot of searching on Dell websites, but didn't change 
anything.

Does anyone have any idea what might be going wrong here or how I could debug 
this further?

Guido
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Orit Wasserman

On 04/11/2012 03:44 PM, Guido Winkelmann wrote:
> Hi,
> 
> Nested virtualization on Intel does not work for me with qemu-kvm. As soon as 
> the third layer OS (second virtualised) is starting the Linux kernel, the 
> entire second layer freezes up. The last thing I can see console of the third 
> layer system before it freezes is "Decompressing Linux... ". (no "done", 
> though). When starting without nofb option, the kernel still manages to set 
> the screen resolution before freezing.
> 
> Grub/Syslinux still work, but are extremely slow.
> 
> Both the first layer OS (i.e. the one running on bare metal) and the second 
> layer OS are 64-bit-Fedora 16 with Kernel 3.3.1-3.fc16.x86_64. On both the 
> first and second layer OS, the kvm_intel modules are loaded with nested=Y 
> parameter. (I've also tried with nested=N in the second layer. Didn't change 
> anything.)
> Qemu-kvm was originally the Fedora-shipped 0.14, but I have since upgraded to 
> 1.0. (Using rpmbuild with the specfile and patches from 
> http://pkgs.fedoraproject.org/gitweb/?p=qemu.git;a=blob;f=qemu.spec;hb=HEAD)
> 
> The second layer machine has this CPU specification in libvirt on the first 
> layer OS:
> 
>   
> Nehalem
> 
>   
> 
> which results in this qemu commandline (from libvirt's logs):
> 
> LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-
> kvm -S -M pc-0.15 -cpu kvm64,+lahf_lm,+popcnt,+sse4.2,+sse4.1,+ssse3,+vmx -
> enable-kvm -m 8192 -smp 8,sockets=8,cores=1,threads=1 -name vshost1 -uuid 
> 192b8c4b-0ded-07aa-2545-d7fef4cd897f -nodefconfig -nodefaults -chardev 
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/vshost1.monitor,server,nowait
>  
> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -
> no-acpi -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
> file=/data/vshost1.img,if=none,id=drive-virtio-disk0,format=qcow2 -device 
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-
> disk0,bootindex=1 -drive file=/data/Fedora-16-x86_64-
> netinst.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -
> device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev 
> tap,fd=21,id=hostnet0,vhost=on,vhostfd=22 -device virtio-net-
> pci,netdev=hostnet0,id=net0,mac=52:54:00:84:7d:46,bus=pci.0,addr=0x3 -netdev 
> tap,fd=23,id=hostnet1,vhost=on,vhostfd=24 -device virtio-net-
> pci,netdev=hostnet1,id=net1,mac=52:54:00:84:8d:46,bus=pci.0,addr=0x4 -vnc 
> 127.0.0.1:0,password -k de -vga cirrus -device virtio-balloon-
> pci,id=balloon0,bus=pci.0,addr=0x6
> 
> I have also tried some other combinations for the cpu element, like changing 
> the model to core2duo and/or including all the features reported by libvirt's 
> capabalities command.
> 
> The third level machine does not have a cpu element in libvirt, and its 
> commandline looks like this:
> 
> LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-
> kvm -S -M pc-0.14 -enable-kvm -m 8192 -smp 4,sockets=4,cores=1,threads=1 
> -name 
> gentoo -uuid 3cdcc902-4520-df25-92ac-31ca5c707a50 -nodefconfig -nodefaults -
> chardev 
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/gentoo.monitor,server,nowait 
> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-acpi 
> -drive 
> file=/data/gentoo.img,if=none,id=drive-virtio-disk0,format=qcow2 -device 
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -
> drive file=/data/install-amd64-
> minimal-20120223.iso,if=none,media=cdrom,id=drive-
> ide0-1-0,readonly=on,format=raw -device ide-
> drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1 -netdev 
> tap,fd=23,id=hostnet0,vhost=on,vhostfd=24 -device virtio-net-
> pci,netdev=hostnet0,id=net0,mac=52:54:00:84:6d:46,bus=pci.0,addr=0x3 -usb 
> -vnc 
> 127.0.0.1:0,password -k de -vga cirrus -device virtio-balloon-
> pci,id=balloon0,bus=pci.0,addr=0x5
> 
> The third layer OS is a recent Gentoo minimal install (amd64), but somehow I 
> don't think that matters at this point...
> 
> The metal is a Dell PowerEdge R710 server with two Xeon E5520 CPUs. I've 
> tried 
> updating the machine's BIOS and other firmware to the latest version. That 
> took a lot of time and a lot of searching on Dell websites, but didn't change 
> anything.
> 
> Does anyone have any idea what might be going wrong here or how I could debug 
> this further?

I'm not sure if this is the problem but I noticed that the second layer and the 
third layer have 
the same memory size (8G), how about trying to reduce the memory for the third 
layer ?

Orit

> 
>   Guido
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/m

Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Guido Winkelmann

Am Mittwoch, 11. April 2012, 16:29:55 schrieben Sie:
> I'm not sure if this is the problem but I noticed that the second layer and
> the third layer have the same memory size (8G), how about trying to reduce
> the memory for the third layer ?

I tried reducing the third layer to 1G. That didn't change anything.

Guido
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-11 Thread Takuya Yoshikawa

On Wed, 11 Apr 2012 20:38:57 +0800
Xiao Guangrong  wrote:

> Well, my point is that live migration is so very useful that it is worth
> to be improved, the description of your also proves this point.
> 
> What is your really want to say but i missed?

How to improve and what we should pay for that.

Note that I am not objecting to O(1) itself.

Do you remember that when we discussed O(1) issue last year, with Avi,
the agreement was that we should take more time and look carefully
with more measurements to confirm if it's really worthwhile.

The point is whether we should do O(1) now, including near future.

My opinion is that we should do what we can do now and wait for feedback
from real users.

Before making the current code stable, I do not want to see it replaced
so dramatically.  Otherwise when can we use live migration with enough
confidence?  There may be another subtle bugs we should fix now.

In addition, XBRZLE and post-copy is now being developed in QEMU.

What do you think about this Avi, Marcelo?

I am testing the current live migration to see when and for what it can
be used.  I really want to see it become stable and usable for real
services.

> Okay, let us to compare the performance number after O(1) implemented.

>From my experience, I want to say that live migration is very difficult
to say about performance.  That is the problem I am now struggling with.

I developed dirty-log-perf unit-test for that but that was not enough.

Needless to say, checking the correctness is harder.

So I really do not want to see drastic change now without any real need
or feedback from real users -- this is my point.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-11 Thread Avi Kivity

On 04/11/2012 05:14 PM, Takuya Yoshikawa wrote:
> On Wed, 11 Apr 2012 20:38:57 +0800
> Xiao Guangrong  wrote:
>
> > Well, my point is that live migration is so very useful that it is worth
> > to be improved, the description of your also proves this point.
> > 
> > What is your really want to say but i missed?
>
> How to improve and what we should pay for that.
>
> Note that I am not objecting to O(1) itself.
>
> Do you remember that when we discussed O(1) issue last year, with Avi,
> the agreement was that we should take more time and look carefully
> with more measurements to confirm if it's really worthwhile.
>
> The point is whether we should do O(1) now, including near future.
>
> My opinion is that we should do what we can do now and wait for feedback
> from real users.
>
> Before making the current code stable, I do not want to see it replaced
> so dramatically.  Otherwise when can we use live migration with enough
> confidence?  There may be another subtle bugs we should fix now.
>
> In addition, XBRZLE and post-copy is now being developed in QEMU.
>
>
> What do you think about this Avi, Marcelo?

Currently the main performance bottleneck for migration is qemu, which
is single threaded and generally inefficient.  However I am sure that
once the qemu bottlenecks will be removed we'll encounter kvm problems,
particularly with wide (many vcpus) and large (lots of memory) guests. 
So it's a good idea to improve in this area.  I agree we'll need to
measure each change, perhaps with a test program until qemu catches up.

> I am testing the current live migration to see when and for what it can
> be used.  I really want to see it become stable and usable for real
> services.

Well, it's used in production now.

> > Okay, let us to compare the performance number after O(1) implemented.
>
> From my experience, I want to say that live migration is very difficult
> to say about performance.  That is the problem I am now struggling with.
>
> I developed dirty-log-perf unit-test for that but that was not enough.
>
> Needless to say, checking the correctness is harder.
>
>
> So I really do not want to see drastic change now without any real need
> or feedback from real users -- this is my point.
>

It's a good point, we should avoid change for its own sake.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Orit Wasserman

On 04/11/2012 04:43 PM, Guido Winkelmann wrote:
> Am Mittwoch, 11. April 2012, 16:29:55 schrieben Sie:
>> I'm not sure if this is the problem but I noticed that the second layer and
>> the third layer have the same memory size (8G), how about trying to reduce
>> the memory for the third layer ?
> 
> I tried reducing the third layer to 1G. That didn't change anything.

There is a patch for fixing nVMX in 3.4. 
http://www.mail-archive.com/kvm@vger.kernel.org/msg68951.html
you can try it .

Orit
> 
>   Guido
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Nadav Har'El

On Wed, Apr 11, 2012, Guido Winkelmann wrote about "Nested virtualization on 
Intel does not work - second level freezes when third level is starting":
> Nested virtualization on Intel does not work for me with qemu-kvm. As soon as 
> the third layer OS (second virtualised) is starting the Linux kernel, the 
> entire second layer freezes up. The last thing I can see console of the third 

Hi,

>From your description, I understand that "ordinary" (2-level) nested
virtualization working for you (host, guest and 2nd-level guest), and it's
the third nesting level (guest's guest's guest) which is broken?

This is the second report of this nature in a week (see the previous
report in https://bugzilla.kernel.org/show_bug.cgi?id=43068 - the
details there are different), so I guess I'll need to find the time
to give this issue some attention. L3 did work for me when the nested
VMX patches were included in KVM, so either something broke since, or
(perhaps more likely) your slightly different setups have features that
my setup didn't.

But in any case, like I explain in the aforementioned URL, even if L3 would
work, in the current implementation it would be extremenly slow - perhaps to
the point of being unusable (I think you saw this with grub performance in L3).
So I wonder if you'd really want to use it, even if it worked... Just
curious, what were you thinking of doing with L3?

Nadav.


-- 
Nadav Har'El| Wednesday, Apr 11 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If I were two-faced, would I be wearing
http://nadav.harel.org.il   |this one? Abraham Lincoln
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux Crash Caused By KVM?

2012-04-11 Thread Avi Kivity

On 04/11/2012 05:11 AM, Peijie Yu wrote:
> Hi,all
>   I have met some problems while utilizing KVM。
>   The test environment is:
> Summary:Dell R610, 1 x Xeon E5645 2.40GHz, 47.1GB / 48GB 1333MHz DDR3
> System: Dell PowerEdge R610 (Dell 08GXHX)
> Processors: 1 (of 2) x Xeon E5645 2.40GHz 5860MHz FSB (HT enabled,
> 6 cores, 24 threads)
> Memory: 47.1GB / 48GB 1333MHz DDR3 == 12 x 4GB
> Disk:   sda: 299GB (72%) JBOD
> Disk:   sdb (host9): 5.0TB JBOD == 1 x VIRTUAL-DISK
> Disk:   sdc (host11): 5.0TB JBOD == 1 x VIRTUAL-DISK
> Disk:   sdd (host12): 5.0TB JBOD == 1 x VIRTUAL-DISK
> Disk:   sde (host10): 5.0TB JBOD == 1 x VIRTUAL-DISK
> Disk-Control:   mpt2sas0: LSI Logic / Symbios Logic SAS2008
> PCI-Express Fusion-MPT SAS-2 [Falcon]
> Disk-Control:   host9:
> Disk-Control:   host10:
> Disk-Control:   host11:
> Disk-Control:   host12:
> Chipset:Intel 82801IB (ICH9)
> Network:br1 (bridge): 14:fe:b5:dc:2c:6e
> Network:em1 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
> 14:fe:b5:dc:2c:6e, 1000Mb/s 
> Network:em2 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
> 14:fe:b5:dc:2c:70, 1000Mb/s 
> Network:em3 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
> 14:fe:b5:dc:2c:72, 1000Mb/s 
> Network:em4 (bnx2): Broadcom NetXtreme II BCM5709 Gigabit,
> 14:fe:b5:dc:2c:74, 1000Mb/s 
> Network:vnet0 (tun): fe:16:3e:49:fb:05, 10Mb/s 
> Network:vnet1 (tun): fe:16:3e:cb:c0:d1, 10Mb/s 
> Network:vnet2 (tun): fe:16:3e:1e:c1:c4, 10Mb/s 
> Network:vnet3 (tun): fe:16:3e:d5:58:f4, 10Mb/s 
> Network:vnet4 (tun): fe:16:3e:15:b4:16, 10Mb/s 
> Network:vnet5 (tun): fe:16:3e:d2:07:47, 10Mb/s 
> Network:vnet6 (tun): fe:16:3e:e1:2b:b9, 10Mb/s 
> OS: RHEL Server 6.1 (Santiago), Linux
> 2.6.32-220.2.1.el6.x86_64 x86_64, 64-bit
> BIOS:   Dell 3.0.0 01/31/2011
>
>   And during the term i utilize KVM, some issues happen:
>   1.   Host Crash Caused by
>   a.   Kernel Panic
>   31   KERNEL: 
> /usr/lib/debug/lib/modules/2.6.32-131.12.1.el6.x86_64/vmlinux
>   32 DUMPFILE: ../vmcore_2012.13.46  [PARTIAL DUMP]
>   33 CPUS: 24
>   34 DATE: Wed Jan 11 13:34:13 2012
>   35   UPTIME: 25 days, 04:11:05
>   36 LOAD AVERAGE: 223.16, 172.97, 158.23
>   37TASKS: 1464
>   38 NODENAME: dell2.localdomain
>   39  RELEASE: 2.6.32-131.12.1.el6.x86_64
>   40  VERSION: #1 SMP Sun Jul 31 16:44:56 EDT 2011
>   41  MACHINE: x86_64  (2394 Mhz)
>   42   MEMORY: 48 GB
>   43PANIC: "kernel BUG at arch/x86/kernel/traps.c:547!"
>   44  PID: 11851
>   45  COMMAND: "qemu-kvm"
>   46 TASK: 880c071c3500  [THREAD_INFO: 880c132d8000]
>   47  CPU: 1
>   48STATE: TASK_RUNNING (PANIC)
>   49
>   50 PID: 11851  TASK: 880c071c3500  CPU: 1   COMMAND: "qemu-kvm"
>   51  #0 [880028207be0] machine_kexec at 810310cb
>   52  #1 [880028207c40] crash_kexec at 810b6392
>   53  #2 [880028207d10] oops_end at 814de670
>   54  #3 [880028207d40] die at 8100f2eb
>   55  #4 [880028207d70] do_trap at 814ddf64
>   56  #5 [880028207dd0] do_invalid_op at 8100ceb5
>   57  #6 [880028207e70] invalid_op at 8100bf5b
>   58 [exception RIP: do_nmi+554]
>   59 RIP: 814de43a  RSP: 880028207f28  RFLAGS: 00010002
>   60 RAX: 880c132d9fd8  RBX: 880028207f58  RCX: c101
>   61 RDX: 8800  RSI:   RDI: 880028207f58
>   62 RBP: 880028207f48   R8: 88005ebf9800   R9: 880028203fc0
>   63 R10: 0034  R11: 03e8  R12: cc20
>   64 R13: 816024a0  R14: 88005ebf9800  R15: 7000
>   65 ORIG_RAX:   CS: 0010  SS: 0018
>   66  #7 [880028207f50] nmi at 814ddc90
>   67 [exception RIP: bad_to_user+37]
>   68 RIP: 814e4e2b  RSP: 880028207bb0  RFLAGS: 00010046
>   69 RAX: 880c132d9fd8  RBX: 880c132d9c48  RCX: 0001
>   70 RDX:   RSI: 0001000b  RDI: 880028207c08
>   71 RBP: 880028207c48   R8: 88005ebf9800   R9: 880028203fc0
>   72 R10: 0034  R11: 03e8  R12: cc20
>   73 R13: 816024a0  R14: 88005ebf9800  R15: 7000
>   74 ORIG_RAX:   CS: 0010  SS: 0018
>   75 ---  ---
>
>  For this problem, i found that panic is caused by
> BUG_ON(in_nmi()) which means NMI happened during another NMI Context;
> But i check the Intel Technical Manual and found "While an NMI
> interrupt handler is executing, the processor disables additional
> calls to the NMI handler until the next IRET instruction is executed."
> So, how this happen?
>

The NMI path for kvm is different; the processor exits from the guest
with NM

[PATCH] kvm: dont clear TMR on EOI

2012-04-11 Thread Michael S. Tsirkin

Intel spec says that TMR needs to be set/cleared
when IRR is set, but kvm also clears it on  EOI.

I did some tests on a real (AMD based) system,
and I see same TMR values both before
and after EOI, so I think it's a minor bug in kvm.

This patch fixes TMR to be set/cleared on IRR set
only as per spec.

And now that we don't clear TMR, we can save
an atomic read of TMR on EOI that's not propagated
to ioapic, by checking whether ioapic needs
a specific vector first and calculating
the mode afterwards.

Signed-off-by: Michael S. Tsirkin 
---
 arch/x86/kvm/lapic.c |   19 +--
 virt/kvm/ioapic.c|   10 +++---
 virt/kvm/ioapic.h|1 +
 3 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 8584322..992b4ea 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -92,6 +92,11 @@ static inline int apic_test_and_clear_vector(int vec, void 
*bitmap)
return test_and_clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
 }
 
+static inline int apic_test_vector(int vec, void *bitmap)
+{
+   return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
+}
+
 static inline void apic_set_vector(int vec, void *bitmap)
 {
set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
@@ -480,7 +485,6 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct 
kvm_vcpu *vcpu2)
 static void apic_set_eoi(struct kvm_lapic *apic)
 {
int vector = apic_find_highest_isr(apic);
-   int trigger_mode;
/*
 * Not every write EOI will has corresponding ISR,
 * one example is when Kernel check timer on setup_IO_APIC
@@ -491,12 +495,15 @@ static void apic_set_eoi(struct kvm_lapic *apic)
apic_clear_vector(vector, apic->regs + APIC_ISR);
apic_update_ppr(apic);
 
-   if (apic_test_and_clear_vector(vector, apic->regs + APIC_TMR))
-   trigger_mode = IOAPIC_LEVEL_TRIG;
-   else
-   trigger_mode = IOAPIC_EDGE_TRIG;
-   if (!(apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI))
+   if (!(apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI) &&
+   kvm_ioapic_handles_vector(apic->vcpu->kvm, vector)) {
+   int trigger_mode;
+   if (apic_test_vector(vector, apic->regs + APIC_TMR))
+   trigger_mode = IOAPIC_LEVEL_TRIG;
+   else
+   trigger_mode = IOAPIC_EDGE_TRIG;
kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
+   }
kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
 }
 
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index dcaf272c26..26fd54d 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -254,13 +254,17 @@ static void __kvm_ioapic_update_eoi(struct kvm_ioapic 
*ioapic, int vector,
}
 }
 
+bool kvm_ioapic_handles_vector(struct kvm *kvm, int vector)
+{
+   struct kvm_ioapic *ioapic = kvm->arch.vioapic;
+   smp_rmb();
+   return test_bit(vector, ioapic->handled_vectors);
+}
+
 void kvm_ioapic_update_eoi(struct kvm *kvm, int vector, int trigger_mode)
 {
struct kvm_ioapic *ioapic = kvm->arch.vioapic;
 
-   smp_rmb();
-   if (!test_bit(vector, ioapic->handled_vectors))
-   return;
spin_lock(&ioapic->lock);
__kvm_ioapic_update_eoi(ioapic, vector, trigger_mode);
spin_unlock(&ioapic->lock);
diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
index 0b190c3..32872a0 100644
--- a/virt/kvm/ioapic.h
+++ b/virt/kvm/ioapic.h
@@ -71,6 +71,7 @@ int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct 
kvm_lapic *source,
int short_hand, int dest, int dest_mode);
 int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2);
 void kvm_ioapic_update_eoi(struct kvm *kvm, int vector, int trigger_mode);
+bool kvm_ioapic_handles_vector(struct kvm *kvm, int vector);
 int kvm_ioapic_init(struct kvm *kvm);
 void kvm_ioapic_destroy(struct kvm *kvm);
 int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level);
-- 
1.7.9.111.gf3fb0
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/2] kvm: iommu unmap fixes

2012-04-11 Thread Alex Williamson

This is a documentation change only from the previous version.
After discussing it, there is a potential page leak as noted
in the updated changelog for the first patch.  Thanks,

Alex

---

Alex Williamson (2):
  kvm: unpin guest and free iommu domain after deassign last device
  kvm: unmap pages from the iommu when slots are removed


 include/linux/kvm_host.h |6 ++
 virt/kvm/assigned-dev.c  |3 +++
 virt/kvm/iommu.c |8 +++-
 virt/kvm/kvm_main.c  |5 +++--
 4 files changed, 19 insertions(+), 3 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/2] kvm: unmap pages from the iommu when slots are removed

2012-04-11 Thread Alex Williamson

We've been adding new mappings, but not destroying old mappings.
This can lead to a page leak as pages are pinned using
get_user_pages, but only unpinned with put_page if they still
exist in the memslots list on vm shutdown.  A memslot that is
destroyed while an iommu domain is enabled for the guest will
therefore result in an elevated page reference count that is
never cleared.

Additionally, without this fix, the iommu is only programmed
with the first translation for a gpa.  This can result in
peer-to-peer errors if a mapping is destroyed and replaced by a
new mapping at the same gpa as the iommu will still be pointing
to the original, pinned memory address.

Signed-off-by: Alex Williamson 
---

 include/linux/kvm_host.h |6 ++
 virt/kvm/iommu.c |7 ++-
 virt/kvm/kvm_main.c  |5 +++--
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 665a260..72cbf08 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -596,6 +596,7 @@ void kvm_free_irq_source_id(struct kvm *kvm, int 
irq_source_id);
 
 #ifdef CONFIG_IOMMU_API
 int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot);
+void kvm_iommu_unmap_pages(struct kvm *kvm, struct kvm_memory_slot *slot);
 int kvm_iommu_map_guest(struct kvm *kvm);
 int kvm_iommu_unmap_guest(struct kvm *kvm);
 int kvm_assign_device(struct kvm *kvm,
@@ -609,6 +610,11 @@ static inline int kvm_iommu_map_pages(struct kvm *kvm,
return 0;
 }
 
+static inline void kvm_iommu_unmap_pages(struct kvm *kvm,
+struct kvm_memory_slot *slot)
+{
+}
+
 static inline int kvm_iommu_map_guest(struct kvm *kvm)
 {
return -ENODEV;
diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c
index a457d21..fec1723 100644
--- a/virt/kvm/iommu.c
+++ b/virt/kvm/iommu.c
@@ -310,6 +310,11 @@ static void kvm_iommu_put_pages(struct kvm *kvm,
}
 }
 
+void kvm_iommu_unmap_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+   kvm_iommu_put_pages(kvm, slot->base_gfn, slot->npages);
+}
+
 static int kvm_iommu_unmap_memslots(struct kvm *kvm)
 {
int idx;
@@ -320,7 +325,7 @@ static int kvm_iommu_unmap_memslots(struct kvm *kvm)
slots = kvm_memslots(kvm);
 
kvm_for_each_memslot(memslot, slots)
-   kvm_iommu_put_pages(kvm, memslot->base_gfn, memslot->npages);
+   kvm_iommu_unmap_pages(kvm, memslot);
 
srcu_read_unlock(&kvm->srcu, idx);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 42b7393..9739b53 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -808,12 +808,13 @@ int __kvm_set_memory_region(struct kvm *kvm,
if (r)
goto out_free;
 
-   /* map the pages in iommu page table */
+   /* map/unmap the pages in iommu page table */
if (npages) {
r = kvm_iommu_map_pages(kvm, &new);
if (r)
goto out_free;
-   }
+   } else
+   kvm_iommu_unmap_pages(kvm, &old);
 
r = -ENOMEM;
slots = kmemdup(kvm->memslots, sizeof(struct kvm_memslots),

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/2] kvm: unpin guest and free iommu domain after deassign last device

2012-04-11 Thread Alex Williamson

Unpin the guest and free the iommu domain if there are no longer
any devices attached.

Signed-off-by: Alex Williamson 
---

 virt/kvm/assigned-dev.c |3 +++
 virt/kvm/iommu.c|1 +
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 01f572c..01e7c37 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -765,6 +765,9 @@ static int kvm_vm_ioctl_deassign_device(struct kvm *kvm,
 
kvm_free_assigned_device(kvm, match);
 
+   if (list_empty(&kvm->arch.assigned_dev_head))
+   kvm_iommu_unmap_guest(kvm);
+
 out:
mutex_unlock(&kvm->lock);
return r;
diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c
index fec1723..ee4c236 100644
--- a/virt/kvm/iommu.c
+++ b/virt/kvm/iommu.c
@@ -342,5 +342,6 @@ int kvm_iommu_unmap_guest(struct kvm *kvm)
 
kvm_iommu_unmap_memslots(kvm);
iommu_domain_free(domain);
+   kvm->arch.iommu_domain = NULL;
return 0;
 }

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/2] kvm: iommu unmap fixes

2012-04-11 Thread Greg KH

On Wed, Apr 11, 2012 at 09:51:43AM -0600, Alex Williamson wrote:
> This is a documentation change only from the previous version.
> After discussing it, there is a potential page leak as noted
> in the updated changelog for the first patch.  Thanks,



This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read Documentation/stable_kernel_rules.txt
for how to do this properly.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/2] kvm: iommu unmap fixes

2012-04-11 Thread Alex Williamson

On Wed, 2012-04-11 at 08:59 -0700, Greg KH wrote:
> On Wed, Apr 11, 2012 at 09:51:43AM -0600, Alex Williamson wrote:
> > This is a documentation change only from the previous version.
> > After discussing it, there is a potential page leak as noted
> > in the updated changelog for the first patch.  Thanks,
> 
> 
> 
> This is not the correct way to submit patches for inclusion in the
> stable kernel tree.  Please read Documentation/stable_kernel_rules.txt
> for how to do this properly.
> 
> 

Sorry, Greg.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Guido Winkelmann

Am Mittwoch, 11. April 2012, 17:38:14 schrieb Nadav Har'El:
> On Wed, Apr 11, 2012, Guido Winkelmann wrote about "Nested virtualization on 
> Intel does not work - second level freezes when third level is starting":
> > Nested virtualization on Intel does not work for me with qemu-kvm. As soon
> > as the third layer OS (second virtualised) is starting the Linux kernel,
> > the entire second layer freezes up. The last thing I can see console of
> > the third
> Hi,
> 
> From your description, I understand that "ordinary" (2-level) nested
> virtualization working for you (host, guest and 2nd-level guest), and it's
> the third nesting level (guest's guest's guest) which is broken?

No, even 2-level nesting is broken. I can run Host->Guest, but not 
Host->Guest->2nd Level Guest. I haven't even tried with a third virtualized 
level.

I suppose the misunderstanding happened because, in my original mail, I was 
counting the host as one level.

> This is the second report of this nature in a week (see the previous
> report in https://bugzilla.kernel.org/show_bug.cgi?id=43068 - the
> details there are different), so I guess I'll need to find the time
> to give this issue some attention. L3 did work for me when the nested
> VMX patches were included in KVM, so either something broke since, or
> (perhaps more likely) your slightly different setups have features that
> my setup didn't.
> 
> But in any case, like I explain in the aforementioned URL, even if L3 would
> work, in the current implementation it would be extremenly slow - perhaps to
> the point of being unusable (I think you saw this with grub performance in
> L3). So I wonder if you'd really want to use it, even if it worked... Just
> curious, what were you thinking of doing with L3?

I was trying to test network setups that involve migrating VMs between hosts a 
lot, and I was hoping to be able to use only one physical server for that.

As I said, I really only need one level of nesting for that (i.e. two levels 
of virtualization, three levels of OSes when counting the host).

Guido
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost-blk development

2012-04-11 Thread Michael Baysek

In this particular case, I did intend to deploy these instances directly to 
the ramdisk.  I want to squeeze every drop of performance out of these 
instances for use cases with lots of concurrent accesses.   I thought it 
would be possible to achieve improvements an order of magnitude or more 
over SSD, but it seems not to be the case (so far).  

I am purposefully not using O_DIRECT since most workloads will not be using 
it, although I did notice better performance when I did use it.  I did
already identify the page cache as a hinderance as well.

I seem to have hit some performance ceilings inside of the kvm guests that 
are much lower than that of the host they are running on.  I am seeing a 
lot more interrupts and context switches on the parent than I am in the 
guests, and I am looking for any and all ways to cut these down.  

I had read somewhere that vhost-blk may help.  However, those patches were 
posted on qemu-devel in 2010, with some activity on LKML in 2011, but not 
much since.  I feared that the reason they are still not merged might be 
bugs, incomplete implementation, or something of the sort.  

Anyhow, I thank you for your quick and timely responses.  I have spent some 
weeks investigating ways to boost performance in this use case and I am 
left with few remaining options.  I hope I have communicated clearly what I 
am trying to accomplish, and why I am inquiring specifically about vhost-blk.  

Regards,

-Mike

- Original Message -
From: "Stefan Hajnoczi" 
To: "Michael Baysek" 
Cc: kvm@vger.kernel.org
Sent: Wednesday, April 11, 2012 3:19:48 AM
Subject: Re: vhost-blk development

On Tue, Apr 10, 2012 at 6:25 PM, Michael Baysek  wrote:
> Well, I'm trying to determine which I/O method currently has the very least 
> performance overhead and gives the best performance for both reads and writes.
>
> I am doing my testing by putting the entire guest onto a ramdisk.  I'm 
> working on an i5-760 with 16GB RAM with VT-d enabled.  I am running the 
> standard Centos 6 kernel with 0.12.1.2 release of qemu-kvm that comes stock 
> on Centos 6.  The guest is configured with 512 MB RAM, using, 4 cpu cores 
> with it's /dev/vda being the ramdisk on the host.

Results collected for ramdisk usually do not reflect the performance
you get with a real disk or SSD.  I suggest using the host/guest
configuration you want to deploy.

> I've been using iozone 3.98 with -O -l32 -i0 -i1 -i2 -e -+n -r4K -s250M to 
> measure performance.

I haven't looked up the options but I think you need -I to use
O_DIRECT and bypass the guest page cache - otherwise you are not
benchmarking I/O performance but overall file system/page cache
performance.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Guido Winkelmann

Am Mittwoch, 11. April 2012, 17:25:12 schrieben Sie:
> On 04/11/2012 04:43 PM, Guido Winkelmann wrote:
> > Am Mittwoch, 11. April 2012, 16:29:55 schrieben Sie:
> >> I'm not sure if this is the problem but I noticed that the second layer
> >> and
> >> the third layer have the same memory size (8G), how about trying to
> >> reduce
> >> the memory for the third layer ?
> > 
> > I tried reducing the third layer to 1G. That didn't change anything.
> 
> There is a patch for fixing nVMX in 3.4.
> http://www.mail-archive.com/kvm@vger.kernel.org/msg68951.html
> you can try it .

Should I use that patch on the host, or on the first virtualized layer, or on 
both? (Compiling 3.4-rc2 for both for now... )

Guido
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS

2012-04-11 Thread Chegu Vinod


Hello,

While running an AIM7 (workfile.high_systime) in a single 40-way (or a single 
60-way KVM guest) I noticed pretty bad performance when the guest was booted 
with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220 
(RHEL6.2) kernel.

'am still trying to dig more into the details here. Wondering if some changes 
in 
the upstream kernel (i.e. since 2.6.32-220) might be causing this to show up in 
a guest environment (esp. for this system-intensive workload).  

Has anyone else observed this kind of behavior ? Is it a known issue with a fix 
in the pipeline ? If not are there any special knobs/tunables that one needs to 
explicitly set/clear etc. when using newer kernels like 3.3.1 in a guest ? 

I have included some info. below. 

Also any pointers on what else I could capture that would be helpful.

Thanks!
Vinod

---

Platform used:
DL980 G7 (80 cores + 128G RAM).  Hyper-threading is turned off.

Workload used:
AIM7  (workfile.high_systime) and using RAM disks. This is 
primarily a cpu intensive workload...not much i/o. 

Software used :
qemu-system-x86_64   :  1.0.50(i.e. latest as of about a week or so ago).
Native/Host  OS  :  3.3.1 (SLUB allocator explicitly enabled)
Guest-RunA   OS  :  2.6.32-220 (i.e. RHEL6.2 kernel)
Guest-RunB   OS  :  3.3.1

Guest was pinned on :
numa node: 4,5,6,7   ->   40VCPUs + 64G   (i.e. 40-way guest)
numa node: 2,3,4,5,7  ->  60VCPUs + 96G   (i.e. 60-way guest)

For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than 
the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest 
kernel was nearly 12x better !

For the Guest-RunB (3.3.1) case I ran "mpstat -P ALL 1" on the host and 
observed 
that a very high % of time was being spent by the CPUs outside the guest mode 
and mostly in the host (i.e.  sys). Looking at the "perf" related traces it 
seemed like there were long pauses in the guest perhaps waiting for the 
zone->lru_lock as part of release_pages() and this resulted in the VT's PLE 
related code to kick-in on the host.

Turned on function tracing and found that there appears to be more time being
spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
guest.  Here is a small sampling of these traces... Notice the time stamp jump 
around "_spin_lock_irqsave <-release_pages" in the case of Guest-RunB. 


1) 40-way Guest-RunA (2.6.32-220 kernel):
-


#   TASK-PID   CPU#  TIMESTAMP  FUNCTION

   <...>-32147 [020] 145783.127452: native_flush_tlb <-flush_tlb_mm
   <...>-32147 [020] 145783.127452: free_pages_and_swap_cache <-
unmap_region
   <...>-32147 [020] 145783.127452: lru_add_drain <-
free_pages_and_swap_cache
   <...>-32147 [020] 145783.127452: release_pages <-
free_pages_and_swap_cache
   <...>-32147 [020] 145783.127452: _spin_lock_irqsave <-release_pages
   <...>-32147 [020] 145783.127452: __mod_zone_page_state <-
release_pages
   <...>-32147 [020] 145783.127452: mem_cgroup_del_lru_list <-
release_pages

...

   <...>-32147 [022] 145783.133536: release_pages <-
free_pages_and_swap_cache
   <...>-32147 [022] 145783.133536: _spin_lock_irqsave <-release_pages
   <...>-32147 [022] 145783.133536: __mod_zone_page_state <-
release_pages
   <...>-32147 [022] 145783.133536: mem_cgroup_del_lru_list <-
release_pages
   <...>-32147 [022] 145783.133537: lookup_page_cgroup <-
mem_cgroup_del_lru_list




2) 40-way Guest-RunB (3.3.1):
-


#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
   <...>-16459 [009]  101757.383125: free_pages_and_swap_cache <-
tlb_flush_mmu
   <...>-16459 [009]  101757.383125: lru_add_drain <-
free_pages_and_swap_cache
   <...>-16459 [009]  101757.383125: release_pages <-
free_pages_and_swap_cache
   <...>-16459 [009]  101757.383125: _raw_spin_lock_irqsave <-
release_pages
   <...>-16459 [009] d... 101757.384861: mem_cgroup_lru_del_list <-
release_pages
   <...>-16459 [009] d... 101757.384861: lookup_page_cgroup <-
mem_cgroup_lru_del_list




   <...>-16459 [009] .N.. 101757.390385: release_pages <-
free_pages_and_swap_cache
   <...>-16459 [009] .N.. 101757.390385: _raw_spin_lock_irqsave <-
release_pages
   <...>-16459 [009] dN.. 101757.392983: mem_cgroup_lru_del_list <-
release_pages
   <...>-16459 [009] dN.. 101757.392983: lookup_page_cgroup <-
mem_cgroup_lru_del_list
   <...>-16459 [009] dN.. 101757.392983: __mod_zone_page_state <-
release_pages




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Kashyap Chamarthy

On Wed, Apr 11, 2012 at 6:14 PM, Guido Winkelmann
 wrote:
> Hi,
>
> Nested virtualization on Intel does not work for me with qemu-kvm. As soon as
> the third layer OS (second virtualised) is starting the Linux kernel, the
> entire second layer freezes up. The last thing I can see console of the third
> layer system before it freezes is "Decompressing Linux... ". (no "done",
> though). When starting without nofb option, the kernel still manages to set
> the screen resolution before freezing.
>
> Grub/Syslinux still work, but are extremely slow.
>
> Both the first layer OS (i.e. the one running on bare metal) and the second
> layer OS are 64-bit-Fedora 16 with Kernel 3.3.1-3.fc16.x86_64. On both the
> first and second layer OS, the kvm_intel modules are loaded with nested=Y
> parameter. (I've also tried with nested=N in the second layer. Didn't change
> anything.)
> Qemu-kvm was originally the Fedora-shipped 0.14, but I have since upgraded to
> 1.0. (Using rpmbuild with the specfile and patches from
> http://pkgs.fedoraproject.org/gitweb/?p=qemu.git;a=blob;f=qemu.spec;hb=HEAD)
>
> The second layer machine has this CPU specification in libvirt on the first
> layer OS:
>
>  
>    Nehalem
>    
>  
>
> which results in this qemu commandline (from libvirt's logs):
>
> LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-
> kvm -S -M pc-0.15 -cpu kvm64,+lahf_lm,+popcnt,+sse4.2,+sse4.1,+ssse3,+vmx -
> enable-kvm -m 8192 -smp 8,sockets=8,cores=1,threads=1 -name vshost1 -uuid
> 192b8c4b-0ded-07aa-2545-d7fef4cd897f -nodefconfig -nodefaults -chardev
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/vshost1.monitor,server,nowait
> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -
> no-acpi -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
> file=/data/vshost1.img,if=none,id=drive-virtio-disk0,format=qcow2 -device
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-
> disk0,bootindex=1 -drive file=/data/Fedora-16-x86_64-
> netinst.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -
> device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev
> tap,fd=21,id=hostnet0,vhost=on,vhostfd=22 -device virtio-net-
> pci,netdev=hostnet0,id=net0,mac=52:54:00:84:7d:46,bus=pci.0,addr=0x3 -netdev
> tap,fd=23,id=hostnet1,vhost=on,vhostfd=24 -device virtio-net-
> pci,netdev=hostnet1,id=net1,mac=52:54:00:84:8d:46,bus=pci.0,addr=0x4 -vnc
> 127.0.0.1:0,password -k de -vga cirrus -device virtio-balloon-
> pci,id=balloon0,bus=pci.0,addr=0x6
>
> I have also tried some other combinations for the cpu element, like changing
> the model to core2duo and/or including all the features reported by libvirt's
> capabalities command.
>
> The third level machine does not have a cpu element in libvirt, and its
> commandline looks like this:
>
> LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-
> kvm -S -M pc-0.14 -enable-kvm -m 8192 -smp 4,sockets=4,cores=1,threads=1 -name
> gentoo -uuid 3cdcc902-4520-df25-92ac-31ca5c707a50 -nodefconfig -nodefaults -
> chardev
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/gentoo.monitor,server,nowait
> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-acpi -drive
> file=/data/gentoo.img,if=none,id=drive-virtio-disk0,format=qcow2 -device
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -
> drive file=/data/install-amd64-
> minimal-20120223.iso,if=none,media=cdrom,id=drive-
> ide0-1-0,readonly=on,format=raw -device ide-
> drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1 -netdev
> tap,fd=23,id=hostnet0,vhost=on,vhostfd=24 -device virtio-net-
> pci,netdev=hostnet0,id=net0,mac=52:54:00:84:6d:46,bus=pci.0,addr=0x3 -usb -vnc
> 127.0.0.1:0,password -k de -vga cirrus -device virtio-balloon-
> pci,id=balloon0,bus=pci.0,addr=0x5
>
> The third layer OS is a recent Gentoo minimal install (amd64), but somehow I
> don't think that matters at this point...
>
> The metal is a Dell PowerEdge R710 server with two Xeon E5520 CPUs. I've tried
> updating the machine's BIOS and other firmware to the latest version. That
> took a lot of time and a lot of searching on Dell websites, but didn't change
> anything.
>
> Does anyone have any idea what might be going wrong here or how I could debug
> this further?

Interesting. I recently(a couple of months) ago with this configuration:
==
1/ Physical Host (Host hypervisor/Bare metal) --
Config: Intel(R) Xeon(R) CPU(4 cores/socket); 10GB Memory; CPU Freq –
2GHz; Running latest Fedora-16(Minimal foot-print, @core only with
Virt pkgs;x86_64; kernel-3.1.8-2.fc16.x86_64

2/ Regualr Guest (Or Guest Hypervisor) --
Config: 4GB Memory; 4vCPU; 20GB Raw disk image with cache =’none’ to
have decent I/O; Minimal, @core F16; And same virt-packages as
Physical Host; x86_64

3/ Nested Guest (Guest installed inside the Regular Guest) --
Config: 2GB Memor

Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Guido Winkelmann

Am Mittwoch, 11. April 2012, 23:14:07 schrieb Kashyap Chamarthy:
> On Wed, Apr 11, 2012 at 6:14 PM, Guido Winkelmann
>  wrote:
[...]
> Here is my complete notes on nested virtualization w/ Intel  --
> http://kashyapc.wordpress.com/2012/01/14/nested-virtualization-with-kvm-inte
> l/

I had already found you blog post via Google :). In fact, I used some of the 
things you wrote for setting up my system, in the hopes that I would have more 
luck.

BTW, the line for modprobe.d/dist.conf should read

options kvm_intel nested=1

In your blog post, the "options" keyword is missing.

> My result: I was able to ssh into the nested guest (guest installed
> inside the regular guest),  but, after a reboot, the nested-guest
> loses the IP rendering it inaccessible.(Info: the regular-guest has a
> bridged IP, and nested-guest has a NATed IP)
> 
> Refer the comments in the above post for some more discussion. Though
> I haven't tried the suggestion of  'updating your system firmware and
> disabling VT for Direct I/O Access if you are able in the firmware' .
> And I wonder how does turning it off can alleviate the prob.

Yeah, I've seen that comment, that was what prompted me to update the server's 
firmware. The Dell BIOS does not offer the option to disable VT for Direct I/O 
Access, though.

> And my AMD notes  is here(which was completely successful) --
> http://kashyapc.wordpress.com/2012/01/18/nested-virtualization-with-kvm-and-
> amd/

Unfortunately, AMD is not an option for me, at least not in this particular 
context.

Guido
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Guido Winkelmann

Am Mittwoch, 11. April 2012, 17:25:12 schrieb Orit Wasserman:
> On 04/11/2012 04:43 PM, Guido Winkelmann wrote:
> > Am Mittwoch, 11. April 2012, 16:29:55 schrieben Sie:
> >> I'm not sure if this is the problem but I noticed that the second layer
> >> and
> >> the third layer have the same memory size (8G), how about trying to
> >> reduce
> >> the memory for the third layer ?
> > 
> > I tried reducing the third layer to 1G. That didn't change anything.
> 
> There is a patch for fixing nVMX in 3.4.
> http://www.mail-archive.com/kvm@vger.kernel.org/msg68951.html
> you can try it .

That worked, though the VM inside the VM is still very slow...

Guido
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Orit Wasserman

On 04/11/2012 09:37 PM, Guido Winkelmann wrote:
> Am Mittwoch, 11. April 2012, 17:25:12 schrieb Orit Wasserman:
>> On 04/11/2012 04:43 PM, Guido Winkelmann wrote:
>>> Am Mittwoch, 11. April 2012, 16:29:55 schrieben Sie:
 I'm not sure if this is the problem but I noticed that the second layer
 and
 the third layer have the same memory size (8G), how about trying to
 reduce
 the memory for the third layer ?
>>>
>>> I tried reducing the third layer to 1G. That didn't change anything.
>>
>> There is a patch for fixing nVMX in 3.4.
>> http://www.mail-archive.com/kvm@vger.kernel.org/msg68951.html
>> you can try it .
> 
> That worked, though the VM inside the VM is still very slow...

you can try Nadav's nEPT (Nested EPT) patches they should help with the 
performance.

Orit

> 
>   Guido

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Kashyap Chamarthy

On Wed, Apr 11, 2012 at 11:34 PM, Guido Winkelmann
 wrote:
> Am Mittwoch, 11. April 2012, 23:14:07 schrieb Kashyap Chamarthy:
>> On Wed, Apr 11, 2012 at 6:14 PM, Guido Winkelmann
>>  wrote:
> [...]
>> Here is my complete notes on nested virtualization w/ Intel  --
>> http://kashyapc.wordpress.com/2012/01/14/nested-virtualization-with-kvm-inte
>> l/
>
> I had already found you blog post via Google :). In fact, I used some of the
> things you wrote for setting up my system, in the hopes that I would have more
> luck.
>
> BTW, the line for modprobe.d/dist.conf should read
>
> options kvm_intel nested=1

Really? I wonder how it went that far then. Thanks for that, I'll make
the correction and re-try it.

And I see in your other email, it seemed to work for you (w/ the patch
indicated in that email). Good to know. I'll give it a try sometime
this weekend.

>
> In your blog post, the "options" keyword is missing.
>
>> My result: I was able to ssh into the nested guest (guest installed
>> inside the regular guest),  but, after a reboot, the nested-guest
>> loses the IP rendering it inaccessible.(Info: the regular-guest has a
>> bridged IP, and nested-guest has a NATed IP)
>>
>> Refer the comments in the above post for some more discussion. Though
>> I haven't tried the suggestion of  'updating your system firmware and
>> disabling VT for Direct I/O Access if you are able in the firmware' .
>> And I wonder how does turning it off can alleviate the prob.
>
> Yeah, I've seen that comment, that was what prompted me to update the server's
> firmware. The Dell BIOS does not offer the option to disable VT for Direct I/O
> Access, though.
>
>> And my AMD notes  is here(which was completely successful) --
>> http://kashyapc.wordpress.com/2012/01/18/nested-virtualization-with-kvm-and-
>> amd/
>
> Unfortunately, AMD is not an option for me, at least not in this particular
> context.
>
>        Guido
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Orit Wasserman

On 04/11/2012 08:00 PM, Guido Winkelmann wrote:
> Am Mittwoch, 11. April 2012, 17:25:12 schrieben Sie:
>> On 04/11/2012 04:43 PM, Guido Winkelmann wrote:
>>> Am Mittwoch, 11. April 2012, 16:29:55 schrieben Sie:
 I'm not sure if this is the problem but I noticed that the second layer
 and
 the third layer have the same memory size (8G), how about trying to
 reduce
 the memory for the third layer ?
>>>
>>> I tried reducing the third layer to 1G. That didn't change anything.
>>
>> There is a patch for fixing nVMX in 3.4.
>> http://www.mail-archive.com/kvm@vger.kernel.org/msg68951.html
>> you can try it .
> 
> Should I use that patch on the host, or on the first virtualized layer, or on 
> both? (Compiling 3.4-rc2 for both for now... )
On the host (we refer to it as L0 by the way).

Orit

> 
>   Guido



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux Crash Caused By KVM?

2012-04-11 Thread Eric Northup

On Wed, Apr 11, 2012 at 7:45 AM, Avi Kivity  wrote:
> On 04/11/2012 05:11 AM, Peijie Yu wrote:
>>      For this problem, i found that panic is caused by
>> BUG_ON(in_nmi()) which means NMI happened during another NMI Context;
>> But i check the Intel Technical Manual and found "While an NMI
>> interrupt handler is executing, the processor disables additional
>> calls to the NMI handler until the next IRET instruction is executed."
>> So, how this happen?
>>
>
> The NMI path for kvm is different; the processor exits from the guest
> with NMIs blocked, then executes kvm code until it issues "int $2" in
> vmx_complete_interrupts(). If an IRET is executed in this path, then
> NMIs will be unblocked and nested NMIs may occur.
>
> One way this can happen is if we access the vmap area and incur a fault,
> between the VMEXIT and invoking the NMI handler. Or perhaps the NMI
> handler itself generates a fault. Or we have a debug exception in that path.
>
> Is this reproducible?

As an FYI, there have been BIOSes whose SMI handlers ran IRETs.  So
the NMI blocking can go away surprisingly.

See 29.8 "NMI handling while in SMM" in the Intel SDM vol 3.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: New git workflow

2012-04-11 Thread Scott Wood


On 04/11/2012 07:23 AM, Paul Mackerras wrote:

On Sun, Apr 08, 2012 at 02:33:32PM +0300, Avi Kivity wrote:

On 04/05/2012 08:02 PM, Avi Kivity wrote:

I'll publish the new branches tomorrow, with any luck.


There wasn't any luck, so it's only ready today.  To allow chance for
review, I'm publishing next as next-candidate.

Paul/Alex, please review the powerpc bits.  Specifically:

   system.h is gone, so I moved the prototype of load_up_fpu() to
  and added a #include (8fae845f4956d).
   E6500 was added in upstream in parallel with the split of kvm
E500/E500MC.  I guessed which part of the #ifdef E6500 was to go into,
but please verify (73196cd364a2, 06aae86799c1b).


All looks OK as far as I can see, but I have asked Scott Wood to
double-check the e500/e500mc bits.


e6500 should have CPU_FTR_EMB_HV.  Otherwise looks OK.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] powerpc/e6500: add CPU_FTR_EMB_HV to CPU table

2012-04-11 Thread Scott Wood

e6500 support (commit 10241842fbe900276634fee8d37ec48a7d8a762f, 
"powerpc: Add initial e6500 cpu support" and the introduction of
CPU_FTR_EMB_HV (commit 73196cd364a2d972d73fa08da9d81ca3215bed68,
"KVM: PPC: e500mc support") collided during merge, leaving e6500's CPU
table entry missing CPU_FTR_EMB_HV.

Signed-off-by: Scott Wood 
---
Fixup patch for the KVM merge as requested by Marcelo.

 arch/powerpc/include/asm/cputable.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index 67c34af..50d82c8 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -395,7 +395,7 @@ extern const char *powerpc_base_platform;
 #define CPU_FTRS_E6500 (CPU_FTR_USE_TB | CPU_FTR_NODSISRALIGN | \
CPU_FTR_L2CSR | CPU_FTR_LWSYNC | CPU_FTR_NOEXECUTE | \
CPU_FTR_DBELL | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
-   CPU_FTR_DEBUG_LVL_EXC)
+   CPU_FTR_DEBUG_LVL_EXC | CPU_FTR_EMB_HV)
 #define CPU_FTRS_GENERIC_32(CPU_FTR_COMMON | CPU_FTR_NODSISRALIGN)
 
 /* 64-bit CPUs */
-- 
1.7.7.rc3.4.g8d714

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Nested virtualization on Intel does not work - second level freezes when third level is starting

2012-04-11 Thread Nadav Har'El

On Wed, Apr 11, 2012, Guido Winkelmann wrote about "Re: Nested virtualization 
on Intel does not work - second level freezes when third level is starting":
> No, even 2-level nesting is broken. I can run Host->Guest, but not 
> Host->Guest->2nd Level Guest. I haven't even tried with a third virtualized 
> level.

I see. I guess I completely misunderstood what you reported. Sorry.

I think Orit was right. 3.3rc5 had a regression in the nested support,
which I discovered and Avi Kivity fixed; I didn't notice this before
now, but unfortunately the fix only got to 3.4rc1 and never made it into
3.3 (I just verified, it's not in 3.3.1 but it is in 3.4).
This bug displayed itself similarly to what you saw (L1 would hang when
running L2).

If you can run a later kernel, I hope the problem will be solved.

Otherwise, perhaps you can patch your kernel with the following patch
and try again?

--- .before/arch/x86/kvm/vmx.c  2012-03-19 18:34:24.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2012-03-19 18:34:24.0 +0200
@@ -2210,6 +2210,10 @@ static int vmx_set_msr(struct kvm_vcpu *
msr = find_msr_entry(vmx, msr_index);
if (msr) {
msr->data = data;
+   if (msr - vmx->guest_msrs < vmx->save_nmsrs)
+   kvm_set_shared_msr(msr->index, msr->data,
+   msr->mask);
break;
}



-- 
Nadav Har'El|  Thursday, Apr 12 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If you tell the truth, you don't have to
http://nadav.harel.org.il   |remember anything.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] KVM: Introduce direct MSI message injection for in-kernel irqchips

2012-04-11 Thread Marcelo Tosatti

On Thu, Mar 29, 2012 at 06:15:41PM +0200, Jan Kiszka wrote:
> Currently, MSI messages can only be injected to in-kernel irqchips by
> defining a corresponding IRQ route for each message. This is not only
> unhandy if the MSI messages are generated "on the fly" by user space,

The MSI message format is configured on device configuration, and once
its settled, does not change. This should be an unfrequent operation,
no? (i am trying to understand what you mean by "on the fly" here).

If that is the case, the real problem is that irq routing tables do not
handle large numbers of vectors? And isnt that limitation also an issue
if you'd like to add more IOAPICs, for example?

> IRQ routes are a limited resource that user space as to manage
> carefully.
> 
> By providing a direct injection path, we can both avoid using up limited
> resources and simplify the necessary steps for user land.
> 
> Signed-off-by: Jan Kiszka 
> ---
> 
> Changes in v3:
>  - align return code doc to reality
>  - rename SET_MSI -> SIGNAL_MSI
> 
>  Documentation/virtual/kvm/api.txt |   21 +
>  arch/x86/kvm/Kconfig  |1 +
>  include/linux/kvm.h   |   11 +++
>  virt/kvm/Kconfig  |3 +++
>  virt/kvm/kvm_main.c   |   21 +
>  5 files changed, 57 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 81ff39f..ed27d1b 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1482,6 +1482,27 @@ See KVM_ASSIGN_DEV_IRQ for the data structure.  The 
> target device is specified
>  by assigned_dev_id.  In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is
>  evaluated.
>  
> +4.61 KVM_SIGNAL_MSI
> +
> +Capability: KVM_CAP_SIGNAL_MSI
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct kvm_msi (in)
> +Returns: >0 on delivery, 0 if guest blocked the MSI, and -1 on error
> +
> +Directly inject a MSI message. Only valid with in-kernel irqchip that handles
> +MSI messages.
> +
> +struct kvm_msi {
> + __u32 address_lo;
> + __u32 address_hi;
> + __u32 data;
> + __u32 flags;
> + __u8  pad[16];
> +};
> +
> +No flags are defined so far. The corresponding field must be 0.
> +
>  4.62 KVM_CREATE_SPAPR_TCE
>  
>  Capability: KVM_CAP_SPAPR_TCE
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 1a7fe86..a28f338 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -36,6 +36,7 @@ config KVM
>   select TASKSTATS
>   select TASK_DELAY_ACCT
>   select PERF_EVENTS
> + select HAVE_KVM_MSI
>   ---help---
> Support hosting fully virtualized guest machines using hardware
> virtualization extensions.  You will need a fairly recent
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 7a9dd4b..225b452 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -590,6 +590,7 @@ struct kvm_ppc_pvinfo {
>  #define KVM_CAP_SYNC_REGS 74
>  #define KVM_CAP_PCI_2_3 75
>  #define KVM_CAP_KVMCLOCK_CTRL 76
> +#define KVM_CAP_SIGNAL_MSI 77
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -715,6 +716,14 @@ struct kvm_one_reg {
>   __u64 addr;
>  };
>  
> +struct kvm_msi {
> + __u32 address_lo;
> + __u32 address_hi;
> + __u32 data;
> + __u32 flags;
> + __u8  pad[16];
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> @@ -789,6 +798,8 @@ struct kvm_s390_ucas_mapping {
>  /* Available with KVM_CAP_PCI_2_3 */
>  #define KVM_ASSIGN_SET_INTX_MASK  _IOW(KVMIO,  0xa4, \
>  struct kvm_assigned_pci_dev)
> +/* Available with KVM_CAP_SIGNAL_MSI */
> +#define KVM_SIGNAL_MSI_IOW(KVMIO,  0xa5, struct kvm_msi)
>  
>  /*
>   * ioctls for vcpu fds
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index f63ccb0..28694f4 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -18,3 +18,6 @@ config KVM_MMIO
>  
>  config KVM_ASYNC_PF
> bool
> +
> +config HAVE_KVM_MSI
> +   bool
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index a612bc8..3aeb7ab 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2063,6 +2063,24 @@ static long kvm_vm_ioctl(struct file *filp,
>   mutex_unlock(&kvm->lock);
>   break;
>  #endif
> +#ifdef CONFIG_HAVE_KVM_MSI
> + case KVM_SIGNAL_MSI: {
> + struct kvm_kernel_irq_routing_entry route;
> + struct kvm_msi msi;

Zero them (future proof).

> +
> + r = -EFAULT;
> + if (copy_from_user(&msi, argp, sizeof msi))
> + goto out;
> + r = -EINVAL;
> + if (!irqchip_in_kernel(kvm) || msi.flags != 0)
> + goto out;
> + route.msi.address_lo = msi.address_lo;
> + route.msi.address_hi = msi.address_hi;
> + route.msi.data = msi.data;
> + r =  kvm_set_msi(&route, kvm, KVM_USERSPAC

Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-11 Thread Takuya Yoshikawa

On Wed, 11 Apr 2012 17:21:30 +0300
Avi Kivity  wrote:

> Currently the main performance bottleneck for migration is qemu, which
> is single threaded and generally inefficient.  However I am sure that
> once the qemu bottlenecks will be removed we'll encounter kvm problems,
> particularly with wide (many vcpus) and large (lots of memory) guests. 
> So it's a good idea to improve in this area.  I agree we'll need to
> measure each change, perhaps with a test program until qemu catches up.

I agree.

I am especially interested in XBRLE + current srcu-less.

> > I am testing the current live migration to see when and for what it can
> > be used.  I really want to see it become stable and usable for real
> > services.

> Well, it's used in production now.

About RHEL6 e.g., yes of course and we are ...

My comment was about the current srcu-less and whether I can make it enough
stable in this rc-cycle.  I think it will enlarge real use cases in some
extent.

> > So I really do not want to see drastic change now without any real need
> > or feedback from real users -- this is my point.

> It's a good point, we should avoid change for its own sake.

Yes, especially because live migration users are limited to those who have
such services.

I hope that kernel developers start using it in their desktops!!!???

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V5 3/6] kvm : Add unhalt msr to aid (live) migration

2012-04-11 Thread Marcelo Tosatti

On Fri, Mar 23, 2012 at 01:37:26PM +0530, Raghavendra K T wrote:
> From: Raghavendra K T 
> 
> Currently guest does not need to know pv_unhalt state and intended to be
> used via GET/SET_MSR ioctls  during migration.
> 
> Signed-off-by: Raghavendra K T 
> ---
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 9234f13..46f9751 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -40,6 +40,7 @@
>  #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
>  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
>  #define MSR_KVM_STEAL_TIME  0x4b564d03
> +#define MSR_KVM_PV_UNHALT   0x4b564d04
>  
>  struct kvm_steal_time {
>   __u64 steal;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index bd5ef91..38e6c47 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -784,12 +784,13 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
>   * kvm-specific. Those are put in the beginning of the list.
>   */
>  
> -#define KVM_SAVE_MSRS_BEGIN  9
> +#define KVM_SAVE_MSRS_BEGIN  10
>  static u32 msrs_to_save[] = {
>   MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
>   MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
>   HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
>   HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
> + MSR_KVM_PV_UNHALT,
>   MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
>   MSR_STAR,
>  #ifdef CONFIG_X86_64
> @@ -1606,7 +1607,9 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
> u64 data)
>   kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
>  
>   break;
> -
> + case MSR_KVM_PV_UNHALT:
> + vcpu->pv_unhalted = (u32) data;
> + break;
>   case MSR_IA32_MCG_CTL:
>   case MSR_IA32_MCG_STATUS:
>   case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
> @@ -1917,6 +1920,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
> u64 *pdata)
>   case MSR_KVM_STEAL_TIME:
>   data = vcpu->arch.st.msr_val;
>   break;
> + case MSR_KVM_PV_UNHALT:
> + data = (u64)vcpu->pv_unhalted;
> + break;
>   case MSR_IA32_P5_MC_ADDR:
>   case MSR_IA32_P5_MC_TYPE:
>   case MSR_IA32_MCG_CAP:

Unless there is a reason to use an MSR, should use a normal ioctl
such as KVM_{GET,SET}_MP_STATE.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V5 2/6] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks

2012-04-11 Thread Marcelo Tosatti

On Fri, Mar 23, 2012 at 01:37:04PM +0530, Raghavendra K T wrote:
> From: Srivatsa Vaddagiri 
> 
> KVM_HC_KICK_CPU allows the calling vcpu to kick another vcpu out of halt 
> state.
> 
> The presence of these hypercalls is indicated to guest via
> KVM_FEATURE_PV_UNHALT/KVM_CAP_PV_UNHALT.
> 
> Signed-off-by: Srivatsa Vaddagiri 
> Signed-off-by: Suzuki Poulose 
> Signed-off-by: Raghavendra K T 
> ---
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 734c376..9234f13 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -16,12 +16,14 @@
>  #define KVM_FEATURE_CLOCKSOURCE  0
>  #define KVM_FEATURE_NOP_IO_DELAY 1
>  #define KVM_FEATURE_MMU_OP   2
> +
>  /* This indicates that the new set of kvmclock msrs
>   * are available. The use of 0x11 and 0x12 is deprecated
>   */
>  #define KVM_FEATURE_CLOCKSOURCE23
>  #define KVM_FEATURE_ASYNC_PF 4
>  #define KVM_FEATURE_STEAL_TIME   5
> +#define KVM_FEATURE_PV_UNHALT6
>  
>  /* The last 8 bits are used to indicate how to interpret the flags field
>   * in pvclock structure. If no bits are set, all flags are ignored.
> @@ -32,6 +34,7 @@
>  #define MSR_KVM_SYSTEM_TIME 0x12
>  
>  #define KVM_MSR_ENABLED 1
> +
>  /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
>  #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
>  #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 89b02bf..61388b9 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -408,7 +408,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
> u32 function,
>(1 << KVM_FEATURE_NOP_IO_DELAY) |
>(1 << KVM_FEATURE_CLOCKSOURCE2) |
>(1 << KVM_FEATURE_ASYNC_PF) |
> -  (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
> +  (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
> +  (1 << KVM_FEATURE_PV_UNHALT);
>  
>   if (sched_info_on())
>   entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9cbfc06..bd5ef91 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2079,6 +2079,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>   case KVM_CAP_XSAVE:
>   case KVM_CAP_ASYNC_PF:
>   case KVM_CAP_GET_TSC_KHZ:
> + case KVM_CAP_PV_UNHALT:
>   r = 1;
>   break;
>   case KVM_CAP_COALESCED_MMIO:
> @@ -4913,6 +4914,30 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>   return 1;
>  }
>  
> +/*
> + * kvm_pv_kick_cpu_op:  Kick a vcpu.
> + *
> + * @apicid - apicid of vcpu to be kicked.
> + */
> +static void kvm_pv_kick_cpu_op(struct kvm *kvm, int apicid)
> +{
> + struct kvm_vcpu *vcpu = NULL;
> + int i;
> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + if (!kvm_apic_present(vcpu))
> + continue;
> +
> + if (kvm_apic_match_dest(vcpu, 0, 0, apicid, 0))
> + break;
> + }
> + if (vcpu) {
> + vcpu->pv_unhalted = 1;
> + smp_mb();
> + kvm_vcpu_kick(vcpu);
> + }
> +}
> +
>  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  {
>   unsigned long nr, a0, a1, a2, a3, ret;
> @@ -4946,6 +4971,10 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   case KVM_HC_VAPIC_POLL_IRQ:
>   ret = 0;
>   break;
> + case KVM_HC_KICK_CPU:
> + kvm_pv_kick_cpu_op(vcpu->kvm, a0);
> + ret = 0;
> + break;
>   default:
>   ret = -KVM_ENOSYS;
>   break;
> @@ -6174,6 +6203,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
>   !vcpu->arch.apf.halted)
>   || !list_empty_careful(&vcpu->async_pf.done)
>   || vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
> + || vcpu->pv_unhalted
>   || atomic_read(&vcpu->arch.nmi_queued) ||
>   (kvm_arch_interrupt_allowed(vcpu) &&
>kvm_cpu_has_interrupt(vcpu));
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 68e67e5..e822d96 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -558,6 +558,7 @@ struct kvm_ppc_pvinfo {
>  #define KVM_CAP_PPC_PAPR 68
>  #define KVM_CAP_S390_GMAP 71
>  #define KVM_CAP_TSC_DEADLINE_TIMER 72
> +#define KVM_CAP_PV_UNHALT 73
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 900c763..433ae97 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -158,6 +158,7 @@ struct kvm_vcpu {
>  #endif
>  
>   struct kvm_vcpu_arch arch;
> + int pv_unhalted;
>  };
>  
>  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
> diff --git a/include/linux/kvm_para.h b/inclu

Re: [PATCH RFC V5 0/6] kvm : Paravirt-spinlock support for KVM guests

2012-04-11 Thread Marcelo Tosatti

On Thu, Mar 29, 2012 at 12:02:29AM +0530, Raghavendra K T wrote:
> On 03/23/2012 01:35 PM, Raghavendra K T wrote:
> >The 6-patch series to follow this email extends KVM-hypervisor and Linux 
> >guest
> >running on KVM-hypervisor to support pv-ticket spinlocks, based on Xen's
> >implementation.
> >
> >One hypercall is introduced in KVM hypervisor,that allows a vcpu to kick
> >another vcpu out of halt state.
> >The blocking of vcpu is done using halt() in (lock_spinning) slowpath.
> >one MSR is added to aid live migration.
> >
> >Changes in V5:
> >- rebased to 3.3-rc6
> >- added PV_UNHALT_MSR that would help in live migration (Avi)
> >- removed PV_LOCK_KICK vcpu request and pv_unhalt flag (re)added.
> 
> Sorry for pinging
> I know it is busy time. But I hope to get response on these patches
> in your free time, so that I can target next merge window for this.
> (whether it has reached some good state or it is heading in reverse
> direction!). it would really boost my morale.
> especially MSR stuff and dropping vcpu request bit for PV unhalt.
> 
> - Raghu

Looks good. Only the MSR appears an abuse, since there is no need
to expose the info to the guest.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC V5 2/6] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks

2012-04-11 Thread Marcelo Tosatti

On Wed, Apr 11, 2012 at 09:06:29PM -0300, Marcelo Tosatti wrote:
> On Fri, Mar 23, 2012 at 01:37:04PM +0530, Raghavendra K T wrote:
> > From: Srivatsa Vaddagiri 
> > 
> > KVM_HC_KICK_CPU allows the calling vcpu to kick another vcpu out of halt 
> > state.
> > 
> > The presence of these hypercalls is indicated to guest via
> > KVM_FEATURE_PV_UNHALT/KVM_CAP_PV_UNHALT.
> > 
> > Signed-off-by: Srivatsa Vaddagiri 
> > Signed-off-by: Suzuki Poulose 
> > Signed-off-by: Raghavendra K T 
> > ---
> > diff --git a/arch/x86/include/asm/kvm_para.h 
> > b/arch/x86/include/asm/kvm_para.h
> > index 734c376..9234f13 100644
> > --- a/arch/x86/include/asm/kvm_para.h
> > +++ b/arch/x86/include/asm/kvm_para.h
> > @@ -16,12 +16,14 @@
> >  #define KVM_FEATURE_CLOCKSOURCE0
> >  #define KVM_FEATURE_NOP_IO_DELAY   1
> >  #define KVM_FEATURE_MMU_OP 2
> > +
> >  /* This indicates that the new set of kvmclock msrs
> >   * are available. The use of 0x11 and 0x12 is deprecated
> >   */
> >  #define KVM_FEATURE_CLOCKSOURCE23
> >  #define KVM_FEATURE_ASYNC_PF   4
> >  #define KVM_FEATURE_STEAL_TIME 5
> > +#define KVM_FEATURE_PV_UNHALT  6
> >  
> >  /* The last 8 bits are used to indicate how to interpret the flags field
> >   * in pvclock structure. If no bits are set, all flags are ignored.
> > @@ -32,6 +34,7 @@
> >  #define MSR_KVM_SYSTEM_TIME 0x12
> >  
> >  #define KVM_MSR_ENABLED 1
> > +
> >  /* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
> >  #define MSR_KVM_WALL_CLOCK_NEW  0x4b564d00
> >  #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index 89b02bf..61388b9 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -408,7 +408,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
> > u32 function,
> >  (1 << KVM_FEATURE_NOP_IO_DELAY) |
> >  (1 << KVM_FEATURE_CLOCKSOURCE2) |
> >  (1 << KVM_FEATURE_ASYNC_PF) |
> > -(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
> > +(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
> > +(1 << KVM_FEATURE_PV_UNHALT);
> >  
> > if (sched_info_on())
> > entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 9cbfc06..bd5ef91 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2079,6 +2079,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > case KVM_CAP_XSAVE:
> > case KVM_CAP_ASYNC_PF:
> > case KVM_CAP_GET_TSC_KHZ:
> > +   case KVM_CAP_PV_UNHALT:
> > r = 1;
> > break;
> > case KVM_CAP_COALESCED_MMIO:
> > @@ -4913,6 +4914,30 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
> > return 1;
> >  }
> >  
> > +/*
> > + * kvm_pv_kick_cpu_op:  Kick a vcpu.
> > + *
> > + * @apicid - apicid of vcpu to be kicked.
> > + */
> > +static void kvm_pv_kick_cpu_op(struct kvm *kvm, int apicid)
> > +{
> > +   struct kvm_vcpu *vcpu = NULL;
> > +   int i;
> > +
> > +   kvm_for_each_vcpu(i, vcpu, kvm) {
> > +   if (!kvm_apic_present(vcpu))
> > +   continue;
> > +
> > +   if (kvm_apic_match_dest(vcpu, 0, 0, apicid, 0))
> > +   break;
> > +   }
> > +   if (vcpu) {
> > +   vcpu->pv_unhalted = 1;
> > +   smp_mb();
> > +   kvm_vcpu_kick(vcpu);
> > +   }
> > +}
> > +
> >  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >  {
> > unsigned long nr, a0, a1, a2, a3, ret;
> > @@ -4946,6 +4971,10 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> > case KVM_HC_VAPIC_POLL_IRQ:
> > ret = 0;
> > break;
> > +   case KVM_HC_KICK_CPU:
> > +   kvm_pv_kick_cpu_op(vcpu->kvm, a0);
> > +   ret = 0;
> > +   break;
> > default:
> > ret = -KVM_ENOSYS;
> > break;
> > @@ -6174,6 +6203,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
> > !vcpu->arch.apf.halted)
> > || !list_empty_careful(&vcpu->async_pf.done)
> > || vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
> > +   || vcpu->pv_unhalted
> > || atomic_read(&vcpu->arch.nmi_queued) ||
> > (kvm_arch_interrupt_allowed(vcpu) &&
> >  kvm_cpu_has_interrupt(vcpu));
> > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > index 68e67e5..e822d96 100644
> > --- a/include/linux/kvm.h
> > +++ b/include/linux/kvm.h
> > @@ -558,6 +558,7 @@ struct kvm_ppc_pvinfo {
> >  #define KVM_CAP_PPC_PAPR 68
> >  #define KVM_CAP_S390_GMAP 71
> >  #define KVM_CAP_TSC_DEADLINE_TIMER 72
> > +#define KVM_CAP_PV_UNHALT 73
> >  
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >  
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 900c763..433ae97 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @

Re: [PATCH v2 1/2] kvm: unmap pages from the iommu when slots are removed

2012-04-11 Thread Marcelo Tosatti

On Wed, Apr 11, 2012 at 09:51:49AM -0600, Alex Williamson wrote:
> We've been adding new mappings, but not destroying old mappings.
> This can lead to a page leak as pages are pinned using
> get_user_pages, but only unpinned with put_page if they still
> exist in the memslots list on vm shutdown.  A memslot that is
> destroyed while an iommu domain is enabled for the guest will
> therefore result in an elevated page reference count that is
> never cleared.
> 
> Additionally, without this fix, the iommu is only programmed
> with the first translation for a gpa.  This can result in
> peer-to-peer errors if a mapping is destroyed and replaced by a
> new mapping at the same gpa as the iommu will still be pointing
> to the original, pinned memory address.
> 
> Signed-off-by: Alex Williamson 

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-amd: Auto-load on CPUs with SVM.

2012-04-11 Thread Marcelo Tosatti

On Wed, Mar 28, 2012 at 11:32:28AM -0700, Josh Triplett wrote:
> Enable x86 feature-based autoloading for the kvm-amd module on CPUs
> with X86_FEATURE_SVM.
> 
> Signed-off-by: Josh Triplett 
> ---
> 
> On Wed, Mar 28, 2012 at 01:26:01PM +0200, Avi Kivity wrote:
> > On 03/21/2012 08:33 AM, Josh Triplett wrote:
> > > Enable x86 feature-based autoloading for the kvm-intel module on CPUs
> > > with X86_FEATURE_VMX.
> > 
> > Thanks, applied.
> 
> As promised, the corresponding patch for kvm-amd.
> 
>  arch/x86/kvm/svm.c |7 +++
>  1 files changed, 7 insertions(+), 0 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: dont clear TMR on EOI

2012-04-11 Thread Marcelo Tosatti

On Wed, Apr 11, 2012 at 06:49:55PM +0300, Michael S. Tsirkin wrote:
> Intel spec says that TMR needs to be set/cleared
> when IRR is set, but kvm also clears it on  EOI.
> 
> I did some tests on a real (AMD based) system,
> and I see same TMR values both before
> and after EOI, so I think it's a minor bug in kvm.
> 
> This patch fixes TMR to be set/cleared on IRR set
> only as per spec.
> 
> And now that we don't clear TMR, we can save
> an atomic read of TMR on EOI that's not propagated
> to ioapic, by checking whether ioapic needs
> a specific vector first and calculating
> the mode afterwards.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  arch/x86/kvm/lapic.c |   19 +--
>  virt/kvm/ioapic.c|   10 +++---
>  virt/kvm/ioapic.h|1 +
>  3 files changed, 21 insertions(+), 9 deletions(-)

Looks OK, ioapic_service -> accept_apic_irq will set TMR 
again if IRR is raised. Gleb, can you review please?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V9 1/1] Guest stop notification

2012-04-11 Thread Marcelo Tosatti

On Sat, Apr 07, 2012 at 06:17:47AM +0530, Raghavendra K T wrote:
> From: Eric B Munson 
> 
> Often when a guest is stopped from the qemu console, it will report spurious
> soft lockup warnings on resume.  There are kernel patches being discussed that
> will give the host the ability to tell the guest that it is being stopped and
> should ignore the soft lockup warning that generates.  This patch uses the 
> qemu
> Notifier system to tell the guest it is about to be stopped.
> 
> Signed-off-by: Eric B Munson  
> Signed-off-by: Raghavendra K T 

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Guest hang with > 5 virtio disk with recents kernel and qemu-kvm git

2012-04-11 Thread Alexandre DERUMIER

Hi,
I'm contributor on proxmox2 distrib,

we use qemu-kvm last git version and users reports some guest hang, at udev 
start at virtio devices initialization.

http://forum.proxmox.com/threads/9057-virtio-net-crashing-after-upgrade-to-proxmox-2-0
(screenshots are available in the forum thread)


if we have
 - 5 or more virtio disks
 - 4virtios disk and 1 or more virtios nics.

working guests
--
- guests with 2.6.32 kernel like debian squeeze boot fine
- debian wheezy with squeeze 2.6.32 kernel boot fine.

non working guests
---
gentoo with
- 3.0.17
- 3.1.6
- 3.2.1
- 3.2.12
kernels are hanging at udev start.

- centos6.2 with 2.6.32+backports patchs kernel is hanging also.
- debian wheezy with 3.2 kernel. 



same guests/kernel with qemu-kvm 0.15 are booting fine.

So I can't tell if it's a kernel problem or qemu-kvm problem.


command line sample:

 /usr/bin/kvm -id 100 -chardev 
socket,id=monitor,path=/var/run/qemu-server/100.mon,server,nowait -mon 
chardev=monitor,mode=readline -vnc 
unix:/var/run/qemu-server/100.vnc,x509,password -pidfile 
/var/run/qemu-server/100.pid -daemonize -usbdevice tablet -name centos-6.2 -smp 
sockets=1,cores=4 -nodefaults -boot menu=on -vga cirrus -localtime -k en-us 
-drive 
file=/dev/disk5/vm-100-disk-1,if=none,id=drive-virtio3,aio=native,cache=none 
-device virtio-blk-pci,drive=drive-virtio3,id=virtio3,bus=pci.0,addr=0xd -drive 
file=/dev/disk3/vm-100-disk-1,if=none,id=drive-virtio1,aio=native,cache=none 
-device virtio-blk-pci,drive=drive-virtio1,id=virtio1,bus=pci.0,addr=0xb -drive 
if=none,id=drive-ide2,media=cdrom,aio=native -device 
ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200 -drive 
file=/dev/disk2/vm-100-disk-1,if=none,id=drive-virtio0,aio=native,cache=none 
-device 
virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=10 2 
-drive 
file=/dev/disk4/vm-100-disk-1,if=none,id=drive-virtio2,aio=native,cache=none 
-device virtio-blk-pci,drive=drive-virtio2,id=virtio2,bus=pci.0,addr=0xc -m 
8192 -netdev 
type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge,vhost=on
 -device virtio-net-pci,mac=6A:A3:E9:EA:51:17,netdev=net0,bus=pci.0,ad 
dr=0x12,id=net0,bootindex=300


I had try with/without vhost and changing pci addr, same results.


tests made
---

- 3 virtio disks + 1 virtio-net = OK
- 3 virtio disks + 2 virtio-net = OK
- 3 virtio disks + 1 scsi (lsi) disk + 1 virtio-net = OK
- 3 virtio disks + 1 scsi (lsi) disk + 2 virtio-net = OK
- 4 virtio disks + 1 virtio-net = NOK (hang at net init on the virtio-net)
- 4 virtio disks + 1 e1000 = OK
- 4 virtio disks + 1 e1000 + 1 virtio-net = NOK (hang at net init on the 
virtio-net)
- 5 virtio disks + 1 e1000 = NOK (udevadm settle timeout on disk N°5 which 
become unusable)
- 5 virtio disks + 2 virtio-net = NOK (udevadm settle timeout on disk N°5 + 
hang on the virtio-net)
- 5 virtio disks + 3 virtio-net = NOK (udev settle timeout on disk N°5 + hang 
on the first virtio-net)


Can someone reproduce the problem ?



Best Regards,
Alexandre Derumier
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

57 matches

Mail list logo