Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-15 Thread Gleb Natapov
On Mon, May 16, 2011 at 02:11:40AM +0300, Nadav Har'El wrote:
> On Thu, May 12, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested 
> VMX, v9":
> > Ah, yes.  For live migration to work, all vmcb state must be accessible 
> > via vendor-independent accessors once an exit is completely handled.  
> > For example, GPRs are accessible via kvm_register_read(), and without 
> > nesting, interrupt state is stowed in the interrupt queue, but if you 
> > keep IDT_VECTORING_INFO live between exit and entry, you can lose it if 
> > you migrate at this point.
> 
> Hi, I can quite easily save this state in a different place which is saved -
> The easiest will just be to use vmcs12, which has place for exactly the fields
> we want to save (and they are rewritten anyway when we exit to L1).
> 
This will not address the problem that the state will not be visible to
generic logic in x86.c.

> Avi, would you you like me use this sort of solution to avoid the extra
> state? Of course, considering that anyway, live migration with nested VMX
> probably still doesn't work for a dozen other reasons :(
> 
> Or do you consider this not enough, and rather that it is necessary that
> nested VMX should use exactly the same logic as nested SVM does - namely,
> use tricks like SVM's "exit_required" instead of our different tricks?
> 
Given two solutions I prefer SVM one. Yes, I know that you asked Avi :)

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/13] kvm/powerpc: Add support for Book3S processors in hypervisor mode

2011-05-15 Thread Paul Mackerras
On Sun, May 15, 2011 at 11:58:12PM +0200, Alexander Graf wrote:
> 
> On 11.05.2011, at 12:44, Paul Mackerras wrote:

> > +#ifdef CONFIG_KVM_BOOK3S_NONHV
> 
> I really liked how you called the .c file _pr - why call it NONHV now?

I agree, CONFIG_KVM_BOOK3S_PR would be better, I'll change it.

> > diff --git a/arch/powerpc/include/asm/paca.h 
> > b/arch/powerpc/include/asm/paca.h
> > index 7412676..8dba5f6 100644
> > --- a/arch/powerpc/include/asm/paca.h
> > +++ b/arch/powerpc/include/asm/paca.h
> > @@ -149,6 +149,16 @@ struct paca_struct {
> > #ifdef CONFIG_KVM_BOOK3S_HANDLER
> > /* We use this to store guest state in */
> > struct kvmppc_book3s_shadow_vcpu shadow_vcpu;
> > +#ifdef CONFIG_KVM_BOOK3S_64_HV
> > +   struct kvm_vcpu *kvm_vcpu;
> > +   u64 dabr;
> > +   u64 host_mmcr[3];
> > +   u32 host_pmc[6];
> > +   u64 host_purr;
> > +   u64 host_spurr;
> > +   u64 host_dscr;
> > +   u64 dec_expires;
> 
> Hrm. I'd say either push those into shadow_vcpu for HV mode or get
> rid of the shadow_vcpu reference. I'd probably prefer the former.

These are fields that are pieces of host state that we need to save
and restore across execution of a guest; they don't apply to any
specific guest or vcpu.  That's why I didn't put them in shadow_vcpu,
which is specifically for one vcpu in one guest.  

Given that book3s_pr copies the shadow_vcpu into and out of the paca,
I thought it best not to add fields to it that are only live while we
are in the guest.  True, these fields only exist for book3s_hv, but if
we later on make it possible to select book3s_pr vs. book3s_hv at
runtime, we won't want to be copying these fields into and out of the
paca when book3s_pr is active.

Maybe we need another struct, kvm_host_state or something like that,
to save this sort of state.

> > @@ -65,6 +98,7 @@ config KVM_440
> > bool "KVM support for PowerPC 440 processors"
> > depends on EXPERIMENTAL && 44x
> > select KVM
> > +   select KVM_MMIO
> 
> e500 should also select MMIO, no?

Good point, I'll fix that.

> > +long kvmppc_alloc_hpt(struct kvm *kvm)
> > +{
> > +   unsigned long hpt;
> > +   unsigned long lpid;
> > +
> > +   hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|__GFP_NOWARN,
> > +  HPT_ORDER - PAGE_SHIFT);
> 
> This would end up failing quite often, no?

In practice it seems to be OK, possibly because the machines we're
testing this on have plenty of memory.  Maybe we should get qemu to
allocate the HPT using hugetlbfs so the memory will come from the
reserved page pool.  It does need to be physically contiguous and
aligned on a multiple of its size -- that's a hardware requirement.

> > +   kvm->arch.sdr1 = __pa(hpt) | (HPT_ORDER - 18);
> > +   kvm->arch.lpid = lpid;
> > +   kvm->arch.host_sdr1 = mfspr(SPRN_SDR1);
> > +   kvm->arch.host_lpid = mfspr(SPRN_LPID);
> > +   kvm->arch.host_lpcr = mfspr(SPRN_LPCR);
> 
> How do these correlate with the guest's hv mmu? I'd like to keep the
> code clean enough so we can potentially use it for PR mode as well :). 

The host SDR1 and LPID are different from the guest's.  That is, the
guest has its own HPT which is quite separate from the host's.  The
host values could be saved in global variables, though; there's no
real need for each VM to have its own copy, except that doing it this
way simplifies the low-level assembly code a little.

> > +   /* First see what page size we have */
> > +   psize = user_page_size(mem->userspace_addr);
> > +   /* For now, only allow 16MB pages */
> 
> The reason to go for 16MB pages is because of the host mmu code, not
> the guest hv mmu. So please at least #ifdef the code to HV so we
> document that correlation. 

I'm not sure what you mean by that.  The reason for going for 16MB
pages initially is for performance (this way the guest can use 16MB
pages for its linear mapping) and to avoid having to deal with the
pages being paged or swapped on the host side.  Could you explain the
"because of the host mmu code" part of your comment further?

What would adding #ifdef CONFIG_KVM_BOOK3S_64_HV achieve in a file
whose name ends in _hv.c and which only gets compiled when
CONFIG_KVM_BOOK3S_64_HV=y?

> > +int kvmppc_mmu_hv_init(void)
> > +{
> > +   if (!cpu_has_feature(CPU_FTR_HVMODE_206))
> > +   return 0;
> 
> Return 0 for "it doesn't work" might not be the right exit code ;).

Good point, I'll make it -ENXIO or something.

> > +static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t 
> > eaddr,
> > +   struct kvmppc_pte *gpte, bool data)
> > +{
> > +   return -ENOENT;
> 
> Can't you just call the normal book3s_64 mmu code here? Without
> xlate, doing ppc_ld doesn't work, which means that reading out the
> faulting guest instruction breaks. We'd also need it for PR mode :). 

With book3s_hv we currently have no situations where we need to read
out the faulting guest instruction.  

We could use the normal code here if we had the guest HPT mapped into
the q

Re: [PATCH 10/13] kvm/powerpc: Add support for Book3S processors in hypervisor mode

2011-05-15 Thread Paul Mackerras
On Thu, May 12, 2011 at 12:07:17PM +0300, Avi Kivity wrote:
> On 05/11/2011 01:44 PM, Paul Mackerras wrote:

> >--- a/include/linux/kvm.h
> >+++ b/include/linux/kvm.h
> >@@ -161,6 +161,7 @@ struct kvm_pit_config {
> >  #define KVM_EXIT_NMI  16
> >  #define KVM_EXIT_INTERNAL_ERROR   17
> >  #define KVM_EXIT_OSI  18
> >+#define KVM_EXIT_PAPR_HCALL   19
> >
> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >  #define KVM_INTERNAL_ERROR_EMULATION 1
> >@@ -264,6 +265,11 @@ struct kvm_run {
> > struct {
> > __u64 gprs[32];
> > } osi;
> >+struct {
> >+__u64 nr;
> >+__u64 ret;
> >+__u64 args[9];
> >+} papr_hcall;
> > /* Fix the size of the union. */
> > char padding[256];
> > };
> 
> Please document this in Documentation/kvm/api.txt.

I'll add a description of the basic calling convention in the next
version of the patches.  The full description of all the possible
hypercalls is in the PAPR version 2.4 document (826 pages) on the
www.power.org website.  You have to become a power.org member to
download it, but membership is free for individual developers.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-15 Thread Nadav Har'El
On Thu, May 12, 2011, Avi Kivity wrote about "Re: [PATCH 0/30] nVMX: Nested 
VMX, v9":
> Ah, yes.  For live migration to work, all vmcb state must be accessible 
> via vendor-independent accessors once an exit is completely handled.  
> For example, GPRs are accessible via kvm_register_read(), and without 
> nesting, interrupt state is stowed in the interrupt queue, but if you 
> keep IDT_VECTORING_INFO live between exit and entry, you can lose it if 
> you migrate at this point.

Hi, I can quite easily save this state in a different place which is saved -
The easiest will just be to use vmcs12, which has place for exactly the fields
we want to save (and they are rewritten anyway when we exit to L1).

Avi, would you you like me use this sort of solution to avoid the extra
state? Of course, considering that anyway, live migration with nested VMX
probably still doesn't work for a dozen other reasons :(

Or do you consider this not enough, and rather that it is necessary that
nested VMX should use exactly the same logic as nested SVM does - namely,
use tricks like SVM's "exit_required" instead of our different tricks?

Nadav.

-- 
Nadav Har'El|   Monday, May 16 2011, 12 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Corduroy pillows - they're making
http://nadav.harel.org.il   |headlines!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: X86EMUL_PROPAGATE_FAULT

2011-05-15 Thread kvm

As regards this issue,
I don't think it's a guest OS bug
since running the same OS without the "RET" emulation
I get no error.
If this bug is related to the ret emulation code,
could you please provide a fix?

thanks,

MS

On 05/05/2011 06:05 PM, Matteo wrote:

Hello to everybody,

I am working on KVM version 2.6.38
and I'm facing a new problem on an emulated instruction
whose implementation is already in kvm.

The error shows up after the emulation of the "RET" opcode (C3 Byte 
Opcode).

When trying to emulate the instruction
at the address loaded after the pop instruction made by the RET
I get an X86EMUL_PROPAGATE_FAULT error due to a gpa == UNMAPPED_GVA
as you can see in the following debug trace:

---8<-
x86_decode_insn:2705 -> Starting New Instruction Decode

x86_decode_insn:2709 -> c->eip = ctxt->eip = 3226138255
x86_decode_insn:2759 -> Opcode -> c3
x86_decode_insn:2928 -> Decode and fetch the source operand
x86_decode_insn:2931 -> SrcNone
x86_decode_insn:3015 -> Decode and fetch the second source operand
x86_decode_insn:3018 -> Src2None
x86_decode_insn:3044 -> Decode and fetch the destination operand
x86_decode_insn:3089 -> ImplicitOps
x86_decode_insn:3092 -> No Destination Operand
x86_emulate_instruction:4458 -> Returned from x86_decode_insn with r = 0

x86_emulate_insn:3194 -> starting special_insn...
x86_emulate_insn:3196 -> c->eip = 3226138256
x86_emulate_insn:3565 -> starting writeback...
writeback:1178 -> c->eip = 2147483648
x86_emulate_instruction:4538 -> Return from x86_emulate_insn with 
code r = 0
---8<--- 



So the next instruction will be emulated reading the opcode with 
eip=2147483648 as stated before

but the emulation fails with the following debug trace

---8<--- 


x86_decode_insn:2705 -> Starting New Instruction Decode

x86_decode_insn:2709 -> c->eip = ctxt->eip = 2147483648
x86_decode_insn:2757 -> Read opcode from eip
kvm_read_guest_virt_helper:3724 -> gpa == UNMAPPED_GVA return 
X86EMUL_PROPAGATE_FAULT

do_fetch_insn_byte:573 -> ops->fetch returns an error
---8< 



The instruction has returned to an EIP that is outside RAM, so kvm is 
unable to fetch the next instruction.  This is likely due to a bug (in 
kvm or the guest) that has occurred much earlier.




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] live snapshot, live merge, live block migration

2011-05-15 Thread Jagane Sundar

Hello Dor,

One important advantage of live snapshot over live backup is support of
multiple (consecutive) live snapshots while there can be only a single
live backup at one time.

This is why I tend to think that although live backup carry some benefit
(no merge required), the live snapshot + live merge are more robust
mechanism.



The two things that concern me regarding the
live snapshot/live merge approach are:
1. Performance considerations of having
multiple active snapshots?
2. Robustness of this solution in the face of
errors in the disk, etc. If any one of the snapshot
files were to get corrupted, the whole VM is
adversely impacted.

The primary goal of Livebackup architecture was to have zero
performance impact on the running VM.

Livebackup impacts performance of the VM only when the
backup client connects to qemu to transfer the modified
blocks over, which should be, say 15 minutes a day, for a
daily backup schedule VM.

One useful thing to do is to evaluate the important use cases
for this technology, and then decide which approach makes
most sense. As an example, let me state this use case:
- A IaaS cloud, where VMs are always on, running off of a local
  disk, and need to be backed up once a day or so.

Can you list some of the other use cases that live snapshot and
live merge were designed to solve. Perhaps we can put up a
single wiki page that describes all of these proposals.

Thanks,
Jagane

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snapmirror functionality using qcow2 and rsync?

2011-05-15 Thread Dor Laor

On 05/15/2011 10:18 PM, Fred van Zwieten wrote:

Hello Jagane,

My original idea was this: Have a data disk implemented as an LV. Use
drbd or dm-replicator to asynchronously replicate the vg to a remote
location. When I make a snapshot of the LV, the snapshot will be


It can work but it might be way too expensive if you have many unrelated 
LVs on that VG.



replicated along with it. Have an application quiesce itself and make
the LV snapshot (doing it from within a vm is also possible, but a bit
more challenging) on the source site. This means there is an
application consistent snapshot available. Problem is the lvm
administration need also to be replicated to the second site. Quite
difficult I think.

Another idea, not using lvm but qcow2, is to create a snapshot in a
separate file from the master file. The snapshot file holds the
changed blocks. Then, when you create a second snapshot, the changes
from that point onwards will be placed in the second snapshot, making
the first one static. The first one can be used for replication. When
that is done, it can be deleted. A new snapshot is created and will
function as the new target for changed blocks and the snapshot before
that one will be replicated and so on. This means 2 snapshot files are
needed to make this work. The snapshots on the other site will be


There is no need to create 2 snapshots. A single snapshot is enough.
In order to have efficient diff sync we'll only need a way to poke into 
the image format file to get the dirty blocks. We can add a verb to 
qemu-img to do that.



stored next to a copy of the master file. When needed a vm can be
started on any of these snapshots. Guess what, the NetApp SnapMirror
mechanism also uses two snapshot's to make it work.


In theory we can even utilize the storage vendor snapshot mechanism, 
NetApp in this case and create snapshots using it and not by 
qcow2/lvm/other.




I know NetApp and LVM but am rather new to qcow2, so i am not sure if
i'm suggesting anything remotely sane here, so forgive me that's the
case.

Greetz,

Fred



On Sun, May 15, 2011 at 7:16 PM, Jagane Sundar  wrote:

Hello Fred,

As Stefan points out, there is a lot of interest in such functionality.

I am the author of a feature called Livebackup - which is aimed at
solving exactly this problem - backup of live virtual machines. The
additional constraint I place is that backup of the VM should be
enabled regardless of the type of storage in use - file on local disk,
file on NAS, LVM logical volume on local disk, LVM volume on an
iSCSI volume.

More information is at:
http://wiki.qemu.org/Features/Livebackup

I investigated the approach that you outline below - LVM snapshot
followed by transfer of the blocks out of the VM host. The problem
I ran into was that there was no easy way to identify the blocks that
were modified since the last backup was taken. To the best of my
knowledge, rsync uses checksums of entire files to determine whether
a file was modified or not - in our use case the whole virtual disk
is one single file, do rsync will not work well.

One of the 'features' of Livebackup is that I chose to invent my own
rudimentary protocol for transferring the blocks out of the VM
host. iSCSI seems like a standard protocol that would suit the
purpose; however it does not fit well with the architecture of
Livebackup - I would probably need to implement an iSCSI target
function inside Livebackup.

Thanks,
Jagane

On 5/15/2011 7:11 AM, Stefan Hajnoczi wrote:


On Sun, May 15, 2011 at 1:16 PM, Fred van Zwieten
  wrote:


Background:
NetApp Snashot functionality gives you application consistent
snapshots of data. Just inform the app a snapshot is about to be made,
depending on the app, it needs to go in to some sort of backup mode,
of just stop and flush outstanding I/O. Next, a snapshot is made and
everything just runs on. Because of the underlying WAFL filesystem
design, the snapshot always points to the blocks at the time of the
creation without needing to do any COW.

Layered on top op this functionality is SnapMirror, where the delta
between this snapshot and a previous snapshot (both being static in
nature), is asynchronously replicated to a second system. There, this
delta is applied to a copy of the disk as a local snapshot.

This setup gives you application consistent data disks on a remote
location as a form of disaster tolerancy. The RPO is the
snapshot/snapmirror frequency.

KVM:
My question is rather simple. Could something like this be implemented
with kvm-img and rsync and/or lvm? I've played with the backing_file
option, but this means I have to shutdown the vm a boot is on the new
file to let this work.


This recent thread might interest you:
http://lists.gnu.org/archive/html/qemu-devel/2011-05/msg00733.html

Basically you have to cobble it together yourself at the moment but
there is activity around live snapshots and dirty block tracking, so
hopefully they will be available as APIs in the future.

Stefan




--

Re: snapmirror functionality using qcow2 and rsync?

2011-05-15 Thread Fred van Zwieten
Hello Jagane,

My original idea was this: Have a data disk implemented as an LV. Use
drbd or dm-replicator to asynchronously replicate the vg to a remote
location. When I make a snapshot of the LV, the snapshot will be
replicated along with it. Have an application quiesce itself and make
the LV snapshot (doing it from within a vm is also possible, but a bit
more challenging) on the source site. This means there is an
application consistent snapshot available. Problem is the lvm
administration need also to be replicated to the second site. Quite
difficult I think.

Another idea, not using lvm but qcow2, is to create a snapshot in a
separate file from the master file. The snapshot file holds the
changed blocks. Then, when you create a second snapshot, the changes
from that point onwards will be placed in the second snapshot, making
the first one static. The first one can be used for replication. When
that is done, it can be deleted. A new snapshot is created and will
function as the new target for changed blocks and the snapshot before
that one will be replicated and so on. This means 2 snapshot files are
needed to make this work. The snapshots on the other site will be
stored next to a copy of the master file. When needed a vm can be
started on any of these snapshots. Guess what, the NetApp SnapMirror
mechanism also uses two snapshot's to make it work.

I know NetApp and LVM but am rather new to qcow2, so i am not sure if
i'm suggesting anything remotely sane here, so forgive me that's the
case.

Greetz,

Fred



On Sun, May 15, 2011 at 7:16 PM, Jagane Sundar  wrote:
> Hello Fred,
>
> As Stefan points out, there is a lot of interest in such functionality.
>
> I am the author of a feature called Livebackup - which is aimed at
> solving exactly this problem - backup of live virtual machines. The
> additional constraint I place is that backup of the VM should be
> enabled regardless of the type of storage in use - file on local disk,
> file on NAS, LVM logical volume on local disk, LVM volume on an
> iSCSI volume.
>
> More information is at:
> http://wiki.qemu.org/Features/Livebackup
>
> I investigated the approach that you outline below - LVM snapshot
> followed by transfer of the blocks out of the VM host. The problem
> I ran into was that there was no easy way to identify the blocks that
> were modified since the last backup was taken. To the best of my
> knowledge, rsync uses checksums of entire files to determine whether
> a file was modified or not - in our use case the whole virtual disk
> is one single file, do rsync will not work well.
>
> One of the 'features' of Livebackup is that I chose to invent my own
> rudimentary protocol for transferring the blocks out of the VM
> host. iSCSI seems like a standard protocol that would suit the
> purpose; however it does not fit well with the architecture of
> Livebackup - I would probably need to implement an iSCSI target
> function inside Livebackup.
>
> Thanks,
> Jagane
>
> On 5/15/2011 7:11 AM, Stefan Hajnoczi wrote:
>>
>> On Sun, May 15, 2011 at 1:16 PM, Fred van Zwieten
>>  wrote:
>>>
>>> Background:
>>> NetApp Snashot functionality gives you application consistent
>>> snapshots of data. Just inform the app a snapshot is about to be made,
>>> depending on the app, it needs to go in to some sort of backup mode,
>>> of just stop and flush outstanding I/O. Next, a snapshot is made and
>>> everything just runs on. Because of the underlying WAFL filesystem
>>> design, the snapshot always points to the blocks at the time of the
>>> creation without needing to do any COW.
>>>
>>> Layered on top op this functionality is SnapMirror, where the delta
>>> between this snapshot and a previous snapshot (both being static in
>>> nature), is asynchronously replicated to a second system. There, this
>>> delta is applied to a copy of the disk as a local snapshot.
>>>
>>> This setup gives you application consistent data disks on a remote
>>> location as a form of disaster tolerancy. The RPO is the
>>> snapshot/snapmirror frequency.
>>>
>>> KVM:
>>> My question is rather simple. Could something like this be implemented
>>> with kvm-img and rsync and/or lvm? I've played with the backing_file
>>> option, but this means I have to shutdown the vm a boot is on the new
>>> file to let this work.
>>
>> This recent thread might interest you:
>> http://lists.gnu.org/archive/html/qemu-devel/2011-05/msg00733.html
>>
>> Basically you have to cobble it together yourself at the moment but
>> there is activity around live snapshots and dirty block tracking, so
>> hopefully they will be available as APIs in the future.
>>
>> Stefan
>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snapmirror functionality using qcow2 and rsync?

2011-05-15 Thread Jagane Sundar

Hello Fred,

As Stefan points out, there is a lot of interest in such functionality.

I am the author of a feature called Livebackup - which is aimed at
solving exactly this problem - backup of live virtual machines. The
additional constraint I place is that backup of the VM should be
enabled regardless of the type of storage in use - file on local disk,
file on NAS, LVM logical volume on local disk, LVM volume on an
iSCSI volume.

More information is at:
http://wiki.qemu.org/Features/Livebackup

I investigated the approach that you outline below - LVM snapshot
followed by transfer of the blocks out of the VM host. The problem
I ran into was that there was no easy way to identify the blocks that
were modified since the last backup was taken. To the best of my
knowledge, rsync uses checksums of entire files to determine whether
a file was modified or not - in our use case the whole virtual disk
is one single file, do rsync will not work well.

One of the 'features' of Livebackup is that I chose to invent my own
rudimentary protocol for transferring the blocks out of the VM
host. iSCSI seems like a standard protocol that would suit the
purpose; however it does not fit well with the architecture of
Livebackup - I would probably need to implement an iSCSI target
function inside Livebackup.

Thanks,
Jagane

On 5/15/2011 7:11 AM, Stefan Hajnoczi wrote:

On Sun, May 15, 2011 at 1:16 PM, Fred van Zwieten  wrote:

Background:
NetApp Snashot functionality gives you application consistent
snapshots of data. Just inform the app a snapshot is about to be made,
depending on the app, it needs to go in to some sort of backup mode,
of just stop and flush outstanding I/O. Next, a snapshot is made and
everything just runs on. Because of the underlying WAFL filesystem
design, the snapshot always points to the blocks at the time of the
creation without needing to do any COW.

Layered on top op this functionality is SnapMirror, where the delta
between this snapshot and a previous snapshot (both being static in
nature), is asynchronously replicated to a second system. There, this
delta is applied to a copy of the disk as a local snapshot.

This setup gives you application consistent data disks on a remote
location as a form of disaster tolerancy. The RPO is the
snapshot/snapmirror frequency.

KVM:
My question is rather simple. Could something like this be implemented
with kvm-img and rsync and/or lvm? I've played with the backing_file
option, but this means I have to shutdown the vm a boot is on the new
file to let this work.

This recent thread might interest you:
http://lists.gnu.org/archive/html/qemu-devel/2011-05/msg00733.html

Basically you have to cobble it together yourself at the moment but
there is activity around live snapshots and dirty block tracking, so
hopefully they will be available as APIs in the future.

Stefan


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 7/7] KVM: MMU: cleanup for dropping parent pte

2011-05-15 Thread Xiao Guangrong
Introduce drop_parent_pte to remove the rmap of parent pte and
clear parent pte

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |   21 -
 1 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3fab3c2..2d14434 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1084,6 +1084,13 @@ static void mmu_page_remove_parent_pte(struct 
kvm_mmu_page *sp,
pte_list_remove(parent_pte, &sp->parent_ptes);
 }
 
+static void drop_parent_pte(struct kvm_mmu_page *sp,
+   u64 *parent_pte)
+{
+   mmu_page_remove_parent_pte(sp, parent_pte);
+   __set_spte(parent_pte, shadow_trap_nonpresent_pte);
+}
+
 static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
   u64 *parent_pte, int direct)
 {
@@ -1560,8 +1567,7 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, 
u64 *sptep,
if (child->role.access == direct_access)
return;
 
-   mmu_page_remove_parent_pte(child, sptep);
-   __set_spte(sptep, shadow_trap_nonpresent_pte);
+   drop_parent_pte(child, sptep);
kvm_flush_remote_tlbs(vcpu->kvm);
}
 }
@@ -1578,7 +1584,7 @@ static void mmu_page_zap_pte(struct kvm *kvm, struct 
kvm_mmu_page *sp,
drop_spte(kvm, spte, shadow_trap_nonpresent_pte);
else {
child = page_header(pte & PT64_BASE_ADDR_MASK);
-   mmu_page_remove_parent_pte(child, spte);
+   drop_parent_pte(child, spte);
}
}
__set_spte(spte, shadow_trap_nonpresent_pte);
@@ -1613,10 +1619,8 @@ static void kvm_mmu_unlink_parents(struct kvm *kvm, 
struct kvm_mmu_page *sp)
 {
u64 *parent_pte;
 
-   while ((parent_pte = pte_list_next(&sp->parent_ptes, NULL))) {
-   kvm_mmu_put_page(sp, parent_pte);
-   __set_spte(parent_pte, shadow_trap_nonpresent_pte);
-   }
+   while ((parent_pte = pte_list_next(&sp->parent_ptes, NULL)))
+   drop_parent_pte(sp, parent_pte);
 }
 
 static int mmu_zap_unsync_children(struct kvm *kvm,
@@ -2046,8 +2050,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
u64 pte = *sptep;
 
child = page_header(pte & PT64_BASE_ADDR_MASK);
-   mmu_page_remove_parent_pte(child, sptep);
-   __set_spte(sptep, shadow_trap_nonpresent_pte);
+   drop_parent_pte(child, sptep);
kvm_flush_remote_tlbs(vcpu->kvm);
} else if (pfn != spte_to_pfn(*sptep)) {
pgprintk("hfn old %llx new %llx\n",
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 6/7] KVM: MMU: cleanup for kvm_mmu_page_unlink_children

2011-05-15 Thread Xiao Guangrong
Cleanup the same operation between kvm_mmu_page_unlink_children and
mmu_pte_write_zap_pte

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |   66 ++-
 1 files changed, 23 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c0b16dd..3fab3c2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1566,32 +1566,33 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, 
u64 *sptep,
}
 }
 
+static void mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
+u64 *spte)
+{
+   u64 pte;
+   struct kvm_mmu_page *child;
+
+   pte = *spte;
+   if (is_shadow_present_pte(pte)) {
+   if (is_last_spte(pte, sp->role.level))
+   drop_spte(kvm, spte, shadow_trap_nonpresent_pte);
+   else {
+   child = page_header(pte & PT64_BASE_ADDR_MASK);
+   mmu_page_remove_parent_pte(child, spte);
+   }
+   }
+   __set_spte(spte, shadow_trap_nonpresent_pte);
+   if (is_large_pte(pte))
+   --kvm->stat.lpages;
+}
+
 static void kvm_mmu_page_unlink_children(struct kvm *kvm,
 struct kvm_mmu_page *sp)
 {
unsigned i;
-   u64 *pt;
-   u64 ent;
-
-   pt = sp->spt;
 
-   for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
-   ent = pt[i];
-
-   if (is_shadow_present_pte(ent)) {
-   if (!is_last_spte(ent, sp->role.level)) {
-   ent &= PT64_BASE_ADDR_MASK;
-   mmu_page_remove_parent_pte(page_header(ent),
-  &pt[i]);
-   } else {
-   if (is_large_pte(ent))
-   --kvm->stat.lpages;
-   drop_spte(kvm, &pt[i],
- shadow_trap_nonpresent_pte);
-   }
-   }
-   pt[i] = shadow_trap_nonpresent_pte;
-   }
+   for (i = 0; i < PT64_ENT_PER_PAGE; ++i)
+   mmu_page_zap_pte(kvm, sp, sp->spt + i);
 }
 
 static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
@@ -3069,27 +3070,6 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_unload);
 
-static void mmu_pte_write_zap_pte(struct kvm_vcpu *vcpu,
- struct kvm_mmu_page *sp,
- u64 *spte)
-{
-   u64 pte;
-   struct kvm_mmu_page *child;
-
-   pte = *spte;
-   if (is_shadow_present_pte(pte)) {
-   if (is_last_spte(pte, sp->role.level))
-   drop_spte(vcpu->kvm, spte, shadow_trap_nonpresent_pte);
-   else {
-   child = page_header(pte & PT64_BASE_ADDR_MASK);
-   mmu_page_remove_parent_pte(child, spte);
-   }
-   }
-   __set_spte(spte, shadow_trap_nonpresent_pte);
-   if (is_large_pte(pte))
-   --vcpu->kvm->stat.lpages;
-}
-
 static void mmu_pte_write_new_pte(struct kvm_vcpu *vcpu,
  struct kvm_mmu_page *sp, u64 *spte,
  const void *new)
@@ -3271,7 +3251,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
spte = &sp->spt[page_offset / sizeof(*spte)];
while (npte--) {
entry = *spte;
-   mmu_pte_write_zap_pte(vcpu, sp, spte);
+   mmu_page_zap_pte(vcpu->kvm, sp, spte);
if (gentry &&
  !((sp->role.word ^ vcpu->arch.mmu.base_role.word)
  & mask.word))
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/7] KVM: MMU: remove the arithmetic of parent pte rmap

2011-05-15 Thread Xiao Guangrong
Parent pte rmap and page rmap are very similar, so use the same arithmetic
for them

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |7 +--
 arch/x86/kvm/mmu.c  |  189 +-
 2 files changed, 46 insertions(+), 150 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0d824e4..e8a68f8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -227,14 +227,10 @@ struct kvm_mmu_page {
 * in this shadow page.
 */
DECLARE_BITMAP(slot_bitmap, KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS);
-   bool multimapped; /* More than one parent_pte? */
bool unsync;
int root_count;  /* Currently serving as active root */
unsigned int unsync_children;
-   union {
-   u64 *parent_pte;   /* !multimapped */
-   struct hlist_head parent_ptes; /* multimapped, kvm_pte_chain */
-   };
+   unsigned long parent_ptes;  /* Reverse mapping for parent_pte */
DECLARE_BITMAP(unsync_child_bitmap, 512);
 };
 
@@ -346,7 +342,6 @@ struct kvm_vcpu_arch {
 * put it here to avoid allocation */
struct kvm_pv_mmu_op_buffer mmu_op_buffer;
 
-   struct kvm_mmu_memory_cache mmu_pte_chain_cache;
struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
struct kvm_mmu_memory_cache mmu_page_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5ba347b..c0b16dd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -182,9 +182,6 @@ struct kvm_shadow_walk_iterator {
 shadow_walk_okay(&(_walker));  \
 shadow_walk_next(&(_walker)))
 
-typedef void (*mmu_parent_walk_fn) (struct kvm_mmu_page *sp, u64 *spte);
-
-static struct kmem_cache *pte_chain_cache;
 static struct kmem_cache *pte_list_desc_cache;
 static struct kmem_cache *mmu_page_header_cache;
 static struct percpu_counter kvm_total_used_mmu_pages;
@@ -397,12 +394,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
 {
int r;
 
-   r = mmu_topup_memory_cache(&vcpu->arch.mmu_pte_chain_cache,
-  pte_chain_cache, 4);
-   if (r)
-   goto out;
r = mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
-  pte_list_desc_cache, 4 + PTE_PREFETCH_NUM);
+  pte_list_desc_cache, 8 + PTE_PREFETCH_NUM);
if (r)
goto out;
r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
@@ -416,8 +409,6 @@ out:
 
 static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
-   mmu_free_memory_cache(&vcpu->arch.mmu_pte_chain_cache,
-   pte_chain_cache);
mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
pte_list_desc_cache);
mmu_free_memory_cache_page(&vcpu->arch.mmu_page_cache);
@@ -435,17 +426,6 @@ static void *mmu_memory_cache_alloc(struct 
kvm_mmu_memory_cache *mc,
return p;
 }
 
-static struct kvm_pte_chain *mmu_alloc_pte_chain(struct kvm_vcpu *vcpu)
-{
-   return mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_chain_cache,
- sizeof(struct kvm_pte_chain));
-}
-
-static void mmu_free_pte_chain(struct kvm_pte_chain *pc)
-{
-   kmem_cache_free(pte_chain_cache, pc);
-}
-
 static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
 {
return mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache,
@@ -721,6 +701,26 @@ static void pte_list_remove(u64 *spte, unsigned long 
*pte_list)
}
 }
 
+typedef void (*pte_list_walk_fn) (u64 *spte);
+static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
+{
+   struct pte_list_desc *desc;
+   int i;
+
+   if (!*pte_list)
+   return;
+
+   if (!(*pte_list & 1))
+   return fn((u64 *)*pte_list);
+
+   desc = (struct pte_list_desc *)(*pte_list & ~1ul);
+   while (desc) {
+   for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
+   fn(desc->sptes[i]);
+   desc = desc->more;
+   }
+}
+
 /*
  * Take gfn and return the reverse mapping to it.
  */
@@ -1069,134 +1069,52 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn)
return gfn & ((1 << KVM_MMU_HASH_SHIFT) - 1);
 }
 
-static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
-  u64 *parent_pte, int direct)
-{
-   struct kvm_mmu_page *sp;
-
-   sp = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache, sizeof 
*sp);
-   sp->spt = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache, PAGE_SIZE);
-   if (!direct)
-   sp->gfns = mmu_memory_cache_alloc(&vcpu->arch.mmu_page_cache,
-   

[PATCH v2 4/7] KVM: MMU: abstract the operation of rmap

2011-05-15 Thread Xiao Guangrong
Abstract the operation of rmap to spte_list, then we can use it for the
reverse mapping of parent pte in the later patch

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |2 +-
 arch/x86/kvm/mmu.c  |  260 +--
 2 files changed, 140 insertions(+), 122 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 152601a..0d824e4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -347,7 +347,7 @@ struct kvm_vcpu_arch {
struct kvm_pv_mmu_op_buffer mmu_op_buffer;
 
struct kvm_mmu_memory_cache mmu_pte_chain_cache;
-   struct kvm_mmu_memory_cache mmu_rmap_desc_cache;
+   struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
struct kvm_mmu_memory_cache mmu_page_cache;
struct kvm_mmu_memory_cache mmu_page_header_cache;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ad520d4..5ba347b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -148,7 +148,7 @@ module_param(oos_shadow, bool, 0644);
 #define PT64_PERM_MASK (PT_PRESENT_MASK | PT_WRITABLE_MASK | PT_USER_MASK \
| PT64_NX_MASK)
 
-#define RMAP_EXT 4
+#define PTE_LIST_EXT 4
 
 #define ACC_EXEC_MASK1
 #define ACC_WRITE_MASK   PT_WRITABLE_MASK
@@ -164,9 +164,9 @@ module_param(oos_shadow, bool, 0644);
 
 #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level)
 
-struct kvm_rmap_desc {
-   u64 *sptes[RMAP_EXT];
-   struct kvm_rmap_desc *more;
+struct pte_list_desc {
+   u64 *sptes[PTE_LIST_EXT];
+   struct pte_list_desc *more;
 };
 
 struct kvm_shadow_walk_iterator {
@@ -185,7 +185,7 @@ struct kvm_shadow_walk_iterator {
 typedef void (*mmu_parent_walk_fn) (struct kvm_mmu_page *sp, u64 *spte);
 
 static struct kmem_cache *pte_chain_cache;
-static struct kmem_cache *rmap_desc_cache;
+static struct kmem_cache *pte_list_desc_cache;
 static struct kmem_cache *mmu_page_header_cache;
 static struct percpu_counter kvm_total_used_mmu_pages;
 
@@ -401,8 +401,8 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
   pte_chain_cache, 4);
if (r)
goto out;
-   r = mmu_topup_memory_cache(&vcpu->arch.mmu_rmap_desc_cache,
-  rmap_desc_cache, 4 + PTE_PREFETCH_NUM);
+   r = mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
+  pte_list_desc_cache, 4 + PTE_PREFETCH_NUM);
if (r)
goto out;
r = mmu_topup_memory_cache_page(&vcpu->arch.mmu_page_cache, 8);
@@ -416,8 +416,10 @@ out:
 
 static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
-   mmu_free_memory_cache(&vcpu->arch.mmu_pte_chain_cache, pte_chain_cache);
-   mmu_free_memory_cache(&vcpu->arch.mmu_rmap_desc_cache, rmap_desc_cache);
+   mmu_free_memory_cache(&vcpu->arch.mmu_pte_chain_cache,
+   pte_chain_cache);
+   mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
+   pte_list_desc_cache);
mmu_free_memory_cache_page(&vcpu->arch.mmu_page_cache);
mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache,
mmu_page_header_cache);
@@ -444,15 +446,15 @@ static void mmu_free_pte_chain(struct kvm_pte_chain *pc)
kmem_cache_free(pte_chain_cache, pc);
 }
 
-static struct kvm_rmap_desc *mmu_alloc_rmap_desc(struct kvm_vcpu *vcpu)
+static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_vcpu *vcpu)
 {
-   return mmu_memory_cache_alloc(&vcpu->arch.mmu_rmap_desc_cache,
- sizeof(struct kvm_rmap_desc));
+   return mmu_memory_cache_alloc(&vcpu->arch.mmu_pte_list_desc_cache,
+ sizeof(struct pte_list_desc));
 }
 
-static void mmu_free_rmap_desc(struct kvm_rmap_desc *rd)
+static void mmu_free_pte_list_desc(struct pte_list_desc *pte_list_desc)
 {
-   kmem_cache_free(rmap_desc_cache, rd);
+   kmem_cache_free(pte_list_desc_cache, pte_list_desc);
 }
 
 static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
@@ -590,67 +592,42 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
 }
 
 /*
- * Take gfn and return the reverse mapping to it.
- */
-
-static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int level)
-{
-   struct kvm_memory_slot *slot;
-   struct kvm_lpage_info *linfo;
-
-   slot = gfn_to_memslot(kvm, gfn);
-   if (likely(level == PT_PAGE_TABLE_LEVEL))
-   return &slot->rmap[gfn - slot->base_gfn];
-
-   linfo = lpage_info_slot(gfn, slot, level);
-
-   return &linfo->rmap_pde;
-}
-
-/*
- * Reverse mapping data structures:
+ * Pte mapping structures:
  *
- * If rmapp bit zero is zero, then rmapp point to the shadw page table entry
- * that points to page_address(page).
+ * If pte_list bit zero is zero, then pte_list point to the 

[PATCH v2 3/7] KVM: fix uninitialized warning

2011-05-15 Thread Xiao Guangrong
Fix:

 warning: ‘cs_sel’ may be used uninitialized in this function
 warning: ‘ss_sel’ may be used uninitialized in this function

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/emulate.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 291c872..ea32340 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2015,7 +2015,7 @@ emulate_sysexit(struct x86_emulate_ctxt *ctxt, struct 
x86_emulate_ops *ops)
struct desc_struct cs, ss;
u64 msr_data;
int usermode;
-   u16 cs_sel, ss_sel;
+   u16 cs_sel = 0, ss_sel = 0;
 
/* inject #GP if in real mode or Virtual 8086 mode */
if (ctxt->mode == X86EMUL_MODE_REAL ||
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/7] KVM: use __copy_to_user/__clear_user to write guest page

2011-05-15 Thread Xiao Guangrong
Simply use __copy_to_user/__clear_user to write guest page since we have
already verified the user address when the memslot is set

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/x86.c  |4 ++--
 virt/kvm/kvm_main.c |4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 77c9d86..a419bd1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1388,7 +1388,7 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 
msr, u64 data)
return 1;
kvm_x86_ops->patch_hypercall(vcpu, instructions);
((unsigned char *)instructions)[3] = 0xc3; /* ret */
-   if (copy_to_user((void __user *)addr, instructions, 4))
+   if (__copy_to_user((void __user *)addr, instructions, 4))
return 1;
kvm->arch.hv_hypercall = data;
break;
@@ -1415,7 +1415,7 @@ static int set_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
  HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT);
if (kvm_is_error_hva(addr))
return 1;
-   if (clear_user((void __user *)addr, PAGE_SIZE))
+   if (__clear_user((void __user *)addr, PAGE_SIZE))
return 1;
vcpu->arch.hv_vapic = data;
break;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 22cdb96..3962899 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1342,7 +1342,7 @@ int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, 
const void *data,
addr = gfn_to_hva(kvm, gfn);
if (kvm_is_error_hva(addr))
return -EFAULT;
-   r = copy_to_user((void __user *)addr + offset, data, len);
+   r = __copy_to_user((void __user *)addr + offset, data, len);
if (r)
return -EFAULT;
mark_page_dirty(kvm, gfn);
@@ -1402,7 +1402,7 @@ int kvm_write_guest_cached(struct kvm *kvm, struct 
gfn_to_hva_cache *ghc,
if (kvm_is_error_hva(ghc->hva))
return -EFAULT;
 
-   r = copy_to_user((void __user *)ghc->hva, data, len);
+   r = __copy_to_user((void __user *)ghc->hva, data, len);
if (r)
return -EFAULT;
mark_page_dirty_in_slot(kvm, ghc->memslot, ghc->gpa >> PAGE_SHIFT);
-- 
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/7] KVM: MMU: optimize pte write path if don't have protected sp

2011-05-15 Thread Xiao Guangrong
Simply return from kvm_mmu_pte_write path if no shadow page is
write-protected, then we can avoid to walk all shadow pages and hold
mmu-lock

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |9 +
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d2ac8e2..152601a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -441,6 +441,7 @@ struct kvm_arch {
unsigned int n_used_mmu_pages;
unsigned int n_requested_mmu_pages;
unsigned int n_max_mmu_pages;
+   unsigned int indirect_shadow_pages;
atomic_t invlpg_counter;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2841805..ad520d4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -498,6 +498,7 @@ static void account_shadowed(struct kvm *kvm, gfn_t gfn)
linfo = lpage_info_slot(gfn, slot, i);
linfo->write_count += 1;
}
+   kvm->arch.indirect_shadow_pages++;
 }
 
 static void unaccount_shadowed(struct kvm *kvm, gfn_t gfn)
@@ -513,6 +514,7 @@ static void unaccount_shadowed(struct kvm *kvm, gfn_t gfn)
linfo->write_count -= 1;
WARN_ON(linfo->write_count < 0);
}
+   kvm->arch.indirect_shadow_pages--;
 }
 
 static int has_wrprotected_page(struct kvm *kvm,
@@ -3233,6 +3235,13 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
int level, npte, invlpg_counter, r, flooded = 0;
bool remote_flush, local_flush, zap_page;
 
+   /*
+* If we don't have indirect shadow pages, it means no page is
+* write-protected, so we can exit simply.
+*/
+   if (!ACCESS_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
+   return;
+
zap_page = remote_flush = local_flush = false;
offset = offset_in_page(gpa);
 
-- 
1.7.4.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] KVM: VMX: always_inline VMREADs

2011-05-15 Thread Avi Kivity
vmcs_readl() and friends are really short, but gcc thinks they are long because 
of
the out-of-line exception handlers.  Mark them always_inline to clear the
misunderstanding.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/vmx.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3df6b2c..750b0ff 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -587,7 +587,7 @@ static inline void ept_sync_individual_addr(u64 eptp, gpa_t 
gpa)
}
 }
 
-static unsigned long vmcs_readl(unsigned long field)
+static __always_inline unsigned long vmcs_readl(unsigned long field)
 {
unsigned long value;
 
@@ -596,17 +596,17 @@ static unsigned long vmcs_readl(unsigned long field)
return value;
 }
 
-static u16 vmcs_read16(unsigned long field)
+static __always_inline u16 vmcs_read16(unsigned long field)
 {
return vmcs_readl(field);
 }
 
-static u32 vmcs_read32(unsigned long field)
+static __always_inline u32 vmcs_read32(unsigned long field)
 {
return vmcs_readl(field);
 }
 
-static u64 vmcs_read64(unsigned long field)
+static __always_inline u64 vmcs_read64(unsigned long field)
 {
 #ifdef CONFIG_X86_64
return vmcs_readl(field);
-- 
1.7.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Inline vmcs_readl()

2011-05-15 Thread Avi Kivity
The following two patches optimize vmcs_readl() to a single VMREAD instruction,
instead of a call to a 7-instruction function.

Avi Kivity (2):
  KVM: VMX: Move VMREAD cleanup to exception handler
  KVM: VMX: always_inline VMREADs

 arch/x86/include/asm/kvm_host.h |6 +-
 arch/x86/kvm/vmx.c  |   16 +---
 2 files changed, 14 insertions(+), 8 deletions(-)

-- 
1.7.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: VMX: Move VMREAD cleanup to exception handler

2011-05-15 Thread Avi Kivity
We clean up a failed VMREAD by clearing the output register.  Do
it in the exception handler instead of unconditionally.  This is
worthwhile since there are more than a hundred call sites.

Signed-off-by: Avi Kivity 
---
 arch/x86/include/asm/kvm_host.h |6 +-
 arch/x86/kvm/vmx.c  |8 +---
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d2ac8e2..db4b654 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -830,11 +830,12 @@ enum {
 asmlinkage void kvm_spurious_fault(void);
 extern bool kvm_rebooting;
 
-#define __kvm_handle_fault_on_reboot(insn) \
+#define kvm_handle_fault_on_reboot(insn, cleanup_insn) \
"666: " insn "\n\t" \
"668: \n\t"   \
".pushsection .fixup, \"ax\" \n" \
"667: \n\t" \
+   cleanup_insn "\n\t"   \
"cmpb $0, kvm_rebooting \n\t" \
"jne 668b \n\t"   \
__ASM_SIZE(push) " $666b \n\t"\
@@ -844,6 +845,9 @@ extern bool kvm_rebooting;
_ASM_PTR " 666b, 667b \n\t" \
".popsection"
 
+#define __kvm_handle_fault_on_reboot(insn) \
+   kvm_handle_fault_on_reboot(insn, "")
+
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4c3fa0f..3df6b2c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -43,6 +43,8 @@
 #include "trace.h"
 
 #define __ex(x) __kvm_handle_fault_on_reboot(x)
+#define __ex_clear(x, reg) \
+   kvm_handle_fault_on_reboot(x, "xor " reg " , " reg)
 
 MODULE_AUTHOR("Qumranet");
 MODULE_LICENSE("GPL");
@@ -587,10 +589,10 @@ static inline void ept_sync_individual_addr(u64 eptp, 
gpa_t gpa)
 
 static unsigned long vmcs_readl(unsigned long field)
 {
-   unsigned long value = 0;
+   unsigned long value;
 
-   asm volatile (__ex(ASM_VMX_VMREAD_RDX_RAX)
- : "+a"(value) : "d"(field) : "cc");
+   asm volatile (__ex_clear(ASM_VMX_VMREAD_RDX_RAX, "%0")
+ : "=a"(value) : "d"(field) : "cc");
return value;
 }
 
-- 
1.7.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snapmirror functionality using qcow2 and rsync?

2011-05-15 Thread Stefan Hajnoczi
On Sun, May 15, 2011 at 1:16 PM, Fred van Zwieten  wrote:
> Background:
> NetApp Snashot functionality gives you application consistent
> snapshots of data. Just inform the app a snapshot is about to be made,
> depending on the app, it needs to go in to some sort of backup mode,
> of just stop and flush outstanding I/O. Next, a snapshot is made and
> everything just runs on. Because of the underlying WAFL filesystem
> design, the snapshot always points to the blocks at the time of the
> creation without needing to do any COW.
>
> Layered on top op this functionality is SnapMirror, where the delta
> between this snapshot and a previous snapshot (both being static in
> nature), is asynchronously replicated to a second system. There, this
> delta is applied to a copy of the disk as a local snapshot.
>
> This setup gives you application consistent data disks on a remote
> location as a form of disaster tolerancy. The RPO is the
> snapshot/snapmirror frequency.
>
> KVM:
> My question is rather simple. Could something like this be implemented
> with kvm-img and rsync and/or lvm? I've played with the backing_file
> option, but this means I have to shutdown the vm a boot is on the new
> file to let this work.

This recent thread might interest you:
http://lists.gnu.org/archive/html/qemu-devel/2011-05/msg00733.html

Basically you have to cobble it together yourself at the moment but
there is activity around live snapshots and dirty block tracking, so
hopefully they will be available as APIs in the future.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/18] virtio: use avail_event index

2011-05-15 Thread Michael S. Tsirkin
On Mon, May 09, 2011 at 02:03:26PM +0930, Rusty Russell wrote:
> On Wed, 4 May 2011 23:51:47 +0300, "Michael S. Tsirkin"  
> wrote:
> > Use the new avail_event feature to reduce the number
> > of exits from the guest.
> 
> Figures here would be nice :)

You mean ASCII art in comments?

> > @@ -228,6 +237,12 @@ add_head:
> >  * new available array entries. */
> > virtio_wmb();
> > vq->vring.avail->idx++;
> > +   /* If the driver never bothers to kick in a very long while,
> > +* avail index might wrap around. If that happens, invalidate
> > +* kicked_avail index we stored. TODO: make sure all drivers
> > +* kick at least once in 2^16 and remove this. */
> > +   if (unlikely(vq->vring.avail->idx == vq->kicked_avail))
> > +   vq->kicked_avail_valid = true;
> 
> If they don't, they're already buggy.  Simply do:
> WARN_ON(vq->vring.avail->idx == vq->kicked_avail);

Hmm, but does it say that somewhere?
It seems that most drivers use locking to prevent
posting more buffers while they drain the used ring,
and also kick at least once in vq->num buffers,
which I guess in the end would work out fine
as vq num is never as large as 2^15.

> > +static bool vring_notify(struct vring_virtqueue *vq)
> > +{
> > +   u16 old, new;
> > +   bool v;
> > +   if (!vq->event)
> > +   return !(vq->vring.used->flags & VRING_USED_F_NO_NOTIFY);
> > +
> > +   v = vq->kicked_avail_valid;
> > +   old = vq->kicked_avail;
> > +   new = vq->kicked_avail = vq->vring.avail->idx;
> > +   vq->kicked_avail_valid = true;
> > +   if (unlikely(!v))
> > +   return true;
> 
> This is the only place you actually used kicked_avail_valid.  Is it
> possible to initialize it in such a way that you can remove this?

If we kill the code above as you suggested, likely yes.
Maybe an explicit flag is more obvious?

> > @@ -482,6 +517,8 @@ void vring_transport_features(struct virtio_device 
> > *vdev)
> > break;
> > case VIRTIO_RING_F_USED_EVENT_IDX:
> > break;
> > +   case VIRTIO_RING_F_AVAIL_EVENT_IDX:
> > +   break;
> > default:
> > /* We don't understand this bit. */
> > clear_bit(i, vdev->features);
> 
> Does this belong in a prior patch?
> 
> Thanks,
> Rusty.

Well if we don't support the feature in the ring we should not
ack the feature, right?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/18] virtio: add api for delayed callbacks

2011-05-15 Thread Michael S. Tsirkin
On Mon, May 09, 2011 at 03:27:33PM +0930, Rusty Russell wrote:
> On Wed, 4 May 2011 23:52:33 +0300, "Michael S. Tsirkin"  
> wrote:
> > Add an API that tells the other side that callbacks
> > should be delayed until a lot of work has been done.
> > Implement using the new used_event feature.
> 
> Since you're going to add a capacity query anyway, why not add the
> threshold argument here?

I thought that if we keep the API kind of generic
there might be more of a chance that future transports
will be able to implement it. For example, with an
old host we can't commit to a specific index.

> 
> Then the caller can choose how much space it needs.  Maybe net and block
> will want different things...
> 
> Cheers,
> Rusty.

Hmm, maybe. OK.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/18] virtio_ring: support for used_event idx feature

2011-05-15 Thread Michael S. Tsirkin
On Mon, May 09, 2011 at 01:47:32PM +0930, Rusty Russell wrote:
> On Wed, 4 May 2011 23:51:38 +0300, "Michael S. Tsirkin"  
> wrote:
> > Add support for the used_event idx feature: when enabling
> > interrupts, publish the current avail index value to
> > the host so that we get interrupts on the next update.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >  drivers/virtio/virtio_ring.c |   14 ++
> >  1 files changed, 14 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 507d6eb..3a3ed75 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -320,6 +320,14 @@ void *virtqueue_get_buf(struct virtqueue *_vq, 
> > unsigned int *len)
> > ret = vq->data[i];
> > detach_buf(vq, i);
> > vq->last_used_idx++;
> > +   /* If we expect an interrupt for the next entry, tell host
> > +* by writing event index and flush out the write before
> > +* the read in the next get_buf call. */
> > +   if (!(vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT)) {
> > +   vring_used_event(&vq->vring) = vq->last_used_idx;
> > +   virtio_mb();
> > +   }
> > +
> 
> Hmm, so you're still using the avail->flags; it's just if thresholding
> is enabled the host will ignore it?
> 
> It's a little subtle, but it keeps this patch small.

Right, that's exactly why I do it this way.

> Perhaps we'll want to make it more explicit later.
> 
> Thanks,
> Rusty.

Yes, e.g. it might be better to avoid touching that cache line,
and track the current status in a private field in the guest.
But I was unable to measure any effect from doing it either way.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/18] virtio_ring: avail event index interface

2011-05-15 Thread Michael S. Tsirkin
On Mon, May 09, 2011 at 01:43:15PM +0930, Rusty Russell wrote:
> On Wed, 4 May 2011 23:51:19 +0300, "Michael S. Tsirkin"  
> wrote:
> > Define a new feature bit for the host to
> > declare that it uses an avail_event index
> > (like Xen) instead of a feature bit
> > to enable/disable interrupts.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >  include/linux/virtio_ring.h |   11 ---
> >  1 files changed, 8 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
> > index f5c1b75..f791772 100644
> > --- a/include/linux/virtio_ring.h
> > +++ b/include/linux/virtio_ring.h
> > @@ -32,6 +32,9 @@
> >  /* The Guest publishes the used index for which it expects an interrupt
> >   * at the end of the avail ring. Host should ignore the avail->flags 
> > field. */
> >  #define VIRTIO_RING_F_USED_EVENT_IDX   29
> > +/* The Host publishes the avail index for which it expects a kick
> > + * at the end of the used ring. Guest should ignore the used->flags field. 
> > */
> > +#define VIRTIO_RING_F_AVAIL_EVENT_IDX  32
> 
> Are you really sure we want to separate the two?  Seems a little simpler
> to have one bit to mean "we're publishing our threshold".  For someone
> implementing this from scratch, it's a little simpler.
> 
> Or are there cases where the old style makes more sense?
> 
> Thanks,
> Rusty.

Hmm, it makes debugging easier as each side can disable
publishing separately - I used it all the time when I saw
e.g. networking stuck to guess whether I need to investigate the
interrupt or the exit handling.

But I'm not hung up on this.

Let me know pls.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snapmirror functionality using qcow2 and rsync?

2011-05-15 Thread Fred van Zwieten
Hello,

The website said user questions are welcome and for one-off questions
I could just mail, so here goes:

Background:
NetApp Snashot functionality gives you application consistent
snapshots of data. Just inform the app a snapshot is about to be made,
depending on the app, it needs to go in to some sort of backup mode,
of just stop and flush outstanding I/O. Next, a snapshot is made and
everything just runs on. Because of the underlying WAFL filesystem
design, the snapshot always points to the blocks at the time of the
creation without needing to do any COW.

Layered on top op this functionality is SnapMirror, where the delta
between this snapshot and a previous snapshot (both being static in
nature), is asynchronously replicated to a second system. There, this
delta is applied to a copy of the disk as a local snapshot.

This setup gives you application consistent data disks on a remote
location as a form of disaster tolerancy. The RPO is the
snapshot/snapmirror frequency.

KVM:
My question is rather simple. Could something like this be implemented
with kvm-img and rsync and/or lvm? I've played with the backing_file
option, but this means I have to shutdown the vm a boot is on the new
file to let this work.

Greetz,

Fred
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2] kvm tools: Return correct values from disk IOV functions

2011-05-15 Thread Sasha Levin
Currently read/write IOV functions return an incorrect
value instead of the amount of bytes read/written.

This incorrect value may cause errors within the virtio layer.

Return correct amount of bytes read/written from _iov functions.

Changes in V2:
 - Un-inline the functions.
 - Add some comments regarding the expected return value.

Signed-off-by: Sasha Levin 
---
 tools/kvm/disk-image.c |   38 
 tools/kvm/include/kvm/disk-image.h |   33 +-
 2 files changed, 40 insertions(+), 31 deletions(-)

diff --git a/tools/kvm/disk-image.c b/tools/kvm/disk-image.c
index 1a704b3..9a7a914 100644
--- a/tools/kvm/disk-image.c
+++ b/tools/kvm/disk-image.c
@@ -172,3 +172,41 @@ void disk_image__close(struct disk_image *disk)
 
free(disk);
 }
+
+/* Fill iov with disk data, starting from sector 'sector'. Return amount of 
bytes read. */
+inline ssize_t disk_image__read_sector_iov(struct disk_image *disk, u64 
sector, const struct iovec *iov, int iovcount)
+{
+   u64 first_sector = sector;
+
+   if (disk->ops->read_sector_iov)
+   return disk->ops->read_sector_iov(disk, sector, iov, iovcount);
+
+   while (iovcount--) {
+   if (disk->ops->read_sector(disk, sector, iov->iov_base, 
iov->iov_len) < 0)
+   return -1;
+
+   sector += iov->iov_len >> SECTOR_SHIFT;
+   iov++;
+   }
+
+   return (sector - first_sector) << SECTOR_SHIFT;
+}
+
+/* Write iov to disk, starting from sector 'sector'. Return amount of bytes 
written. */
+inline ssize_t disk_image__write_sector_iov(struct disk_image *disk, u64 
sector, const struct iovec *iov, int iovcount)
+{
+   u64 first_sector = sector;
+
+   if (disk->ops->write_sector_iov)
+   return disk->ops->write_sector_iov(disk, sector, iov, iovcount);
+
+   while (iovcount--) {
+   if (disk->ops->write_sector(disk, sector, iov->iov_base, 
iov->iov_len) < 0)
+   return -1;
+
+   sector += iov->iov_len >> SECTOR_SHIFT;
+   iov++;
+   }
+
+   return (sector - first_sector) << SECTOR_SHIFT;
+}
\ No newline at end of file
diff --git a/tools/kvm/include/kvm/disk-image.h 
b/tools/kvm/include/kvm/disk-image.h
index 9d7b572..aa5b12b 100644
--- a/tools/kvm/include/kvm/disk-image.h
+++ b/tools/kvm/include/kvm/disk-image.h
@@ -40,36 +40,7 @@ static inline int disk_image__write_sector(struct disk_image 
*disk, u64 sector,
return disk->ops->write_sector(disk, sector, src, src_len);
 }
 
-static inline ssize_t disk_image__read_sector_iov(struct disk_image *disk, u64 
sector, const struct iovec *iov, int iovcount)
-{
-   if (disk->ops->read_sector_iov)
-   return disk->ops->read_sector_iov(disk, sector, iov, iovcount);
-
-   while (iovcount--) {
-   if (disk->ops->read_sector(disk, sector, iov->iov_base, 
iov->iov_len) < 0)
-   return -1;
-
-   sector += iov->iov_len >> SECTOR_SHIFT;
-   iov++;
-   }
-
-   return sector << SECTOR_SHIFT;
-}
-
-static inline ssize_t disk_image__write_sector_iov(struct disk_image *disk, 
u64 sector, const struct iovec *iov, int iovcount)
-{
-   if (disk->ops->write_sector_iov)
-   return disk->ops->write_sector_iov(disk, sector, iov, iovcount);
-
-   while (iovcount--) {
-   if (disk->ops->write_sector(disk, sector, iov->iov_base, 
iov->iov_len) < 0)
-   return -1;
-
-   sector += iov->iov_len >> SECTOR_SHIFT;
-   iov++;
-   }
-
-   return sector << SECTOR_SHIFT;
-}
+inline ssize_t disk_image__read_sector_iov(struct disk_image *disk, u64 
sector, const struct iovec *iov, int iovcount);
+inline ssize_t disk_image__write_sector_iov(struct disk_image *disk, u64 
sector, const struct iovec *iov, int iovcount);
 
 #endif /* KVM__DISK_IMAGE_H */
-- 
1.7.5.rc3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] KVM: MMU: remove the arithmetic of parent pte rmap

2011-05-15 Thread Xiao Guangrong
On 05/15/2011 04:23 PM, Avi Kivity wrote:
> On 05/15/2011 12:44 AM, Xiao Guangrong wrote:
>> Parent pte rmap and page rmap are very similar, so use the same arithmetic 
>> for
>> them
>>
> 
> While they're similar, this is going to cause a lot of confusion because the 
> name 'rmap' already has a deep meaning, at least for me.
> 
> Maybe we can call the abstract structure 'spte_list' and define operations on 
> that, then implement rmap_* and *_parent_pte as wrappers around spte_list.
> 

OK, will fix the confused name in next version, thanks!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] KVM: MMU: optimize pte write path if don't have protected sp

2011-05-15 Thread Avi Kivity

On 05/15/2011 11:33 AM, Xiao Guangrong wrote:

>
>  These atomic ops are always called from within the spinlock, so we don't 
need an atomic_t here.
>
>  Sorry, I should have noticed this on the first version.

We read indirect_shadow_pages atomically on pte write path, that is allowed out 
of mmu_lock


Reading is fine:

  #define atomic_read(v)(*(volatile int *)&(v)->counter)

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] KVM: MMU: optimize pte write path if don't have protected sp

2011-05-15 Thread Xiao Guangrong
On 05/15/2011 04:20 PM, Avi Kivity wrote:
> On 05/15/2011 12:35 AM, Xiao Guangrong wrote:
>> Simply return from kvm_mmu_pte_write path if no shadow page is
>> write-protected, then we can avoid to walk all shadow pages and hold
>> mmu-lock
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 2841805..971e2d2 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -498,6 +498,7 @@ static void account_shadowed(struct kvm *kvm, gfn_t gfn)
>>   linfo = lpage_info_slot(gfn, slot, i);
>>   linfo->write_count += 1;
>>   }
>> +atomic_inc(&kvm->arch.indirect_shadow_pages);
>>   }
>>
>>   static void unaccount_shadowed(struct kvm *kvm, gfn_t gfn)
>> @@ -513,6 +514,7 @@ static void unaccount_shadowed(struct kvm *kvm, gfn_t 
>> gfn)
>>   linfo->write_count -= 1;
>>   WARN_ON(linfo->write_count<  0);
>>   }
>> +atomic_dec(&kvm->arch.indirect_shadow_pages);
>>   }
> 
> These atomic ops are always called from within the spinlock, so we don't need 
> an atomic_t here.
> 
> Sorry, I should have noticed this on the first version.

We read indirect_shadow_pages atomically on pte write path, that is allowed out 
of mmu_lock
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] KVM: MMU: remove the arithmetic of parent pte rmap

2011-05-15 Thread Avi Kivity

On 05/15/2011 12:44 AM, Xiao Guangrong wrote:

Parent pte rmap and page rmap are very similar, so use the same arithmetic for
them



While they're similar, this is going to cause a lot of confusion because 
the name 'rmap' already has a deep meaning, at least for me.


Maybe we can call the abstract structure 'spte_list' and define 
operations on that, then implement rmap_* and *_parent_pte as wrappers 
around spte_list.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] KVM: MMU: optimize pte write path if don't have protected sp

2011-05-15 Thread Avi Kivity

On 05/15/2011 12:35 AM, Xiao Guangrong wrote:

Simply return from kvm_mmu_pte_write path if no shadow page is
write-protected, then we can avoid to walk all shadow pages and hold
mmu-lock
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2841805..971e2d2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -498,6 +498,7 @@ static void account_shadowed(struct kvm *kvm, gfn_t gfn)
linfo = lpage_info_slot(gfn, slot, i);
linfo->write_count += 1;
}
+   atomic_inc(&kvm->arch.indirect_shadow_pages);
  }

  static void unaccount_shadowed(struct kvm *kvm, gfn_t gfn)
@@ -513,6 +514,7 @@ static void unaccount_shadowed(struct kvm *kvm, gfn_t gfn)
linfo->write_count -= 1;
WARN_ON(linfo->write_count<  0);
}
+   atomic_dec(&kvm->arch.indirect_shadow_pages);
  }


These atomic ops are always called from within the spinlock, so we don't 
need an atomic_t here.


Sorry, I should have noticed this on the first version.





--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html