Re: [Qemu-devel] [RFC]VM live snapshot proposal

2014-03-04 Thread Wenchao Xia

于 2014/3/4 17:05, Paolo Bonzini 写道:

Il 04/03/2014 09:54, Stefan Hajnoczi ha scritto:

Is there any other proposals to implement vm-snapshot?

See the discussion by Paolo and Andrea about post-copy migration, which
adds kernel memory management features for tracking userspace page
faults. Perhaps you can use that infrastructure to trap guest writes.


That infrastructure actually traps guest reads too. But it's fine, as 
they are a superset of guest writes and the image will still be 
consistent.


Paolo

I heard that Kernel going to have API to let userspace catch memory 
operation, which originally
can be only caught by kernel code. I am not sure how it is now, but if 
kernel have it, qemu

can use it more gracefully than modifing migration code.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature?

2013-08-15 Thread Wenchao Xia

于 2013-8-15 15:49, Stefan Hajnoczi 写道:

On Thu, Aug 15, 2013 at 10:26:36AM +0800, Wenchao Xia wrote:

于 2013-8-14 15:53, Stefan Hajnoczi 写道:

On Wed, Aug 14, 2013 at 3:54 AM, Wenchao Xia  wrote:

于 2013-8-13 16:21, Stefan Hajnoczi 写道:


On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia 
wrote:


于 2013-8-12 19:33, Stefan Hajnoczi 写道:


On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh  wrote:



--On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi 
wrote:


The idea that was discussed on qemu-de...@nongnu.org uses fork(2) to
capture the state of guest RAM and then send it back to the parent
process.  The guest is only paused for a brief instant during fork(2)
and can continue to run afterwards.





How would you capture the state of emulated hardware which might not
be in the guest RAM?




Exactly the same way vmsave works today.  It calls the device's save
functions which serialize state to file.

The difference between today's vmsave and the fork(2) approach is that
QEMU does not need to wait for guest RAM to be written to file before
resuming the guest.

Stefan


 I have a worry about what glib says:

"On Unix, the GLib mainloop is incompatible with fork(). Any program
using the mainloop must either exec() or exit() from the child without
returning to the mainloop. "



This is fine, the child just writes out the memory pages and exits.
It never returns to the glib mainloop.


 There is another way to do it: intercept the write in kvm.ko(or other
kernel code). Since the key is intercept the memory change, we can do
it in userspace in TCG mode, thus we can add the missing part in KVM
mode. Another benefit of this way is: the used memory can be
controlled. For example, with ioctl(), set a buffer of a fixed size
which keeps the intercepted write data by kernel code, which can avoid
frequently switch back to user space qemu code. when it is full always
return back to userspace's qemu code, let qemu code save the data into
disk. I haven't check the exactly behavior of Intel guest mode about
how to handle page fault, so can't estimate the performance caused by
switching of guest mode and root mode, but it should not be worse than
fork().



The fork(2) approach is portable, covers both KVM and TCG, and doesn't
require kernel changes.  A kvm.ko kernel change also won't be
supported on existing KVM hosts.  These are big drawbacks and the
kernel approach would need to be significantly better than plain old
fork(2) to make it worthwhile.

Stefan


I think advantage is memory usage is predictable, so memory usage
peak can be avoided, by always save the changed pages first. fork()
does not know which pages are changed. I am not sure if this would
be a serious issue when server's memory is consumed much, for example,
24G host emulate 11G*2 guest to provide powerful virtual server.


Memory usage is predictable but guest uptime is unpredictable because
it waits until memory is written out.  This defeats the point of
"live" savevm.  The guest may be stalled arbitrarily.


   I think it is adjustable. There is no much difference with
fork(), except get more precise control about the changed pages.
   Kernel intercept the change, and stores the changed page in another
page, similar to fork(). When userspace qemu code execute, save some
pages to disk. Buffer can be used like some lubricant. When Buffer =
MAX, it equals to fork(), guest runs more lively. When Buffer = 0,
guest runs less lively. I think it allows user to find a good balance
point with a parameter.
   It is harder to implement, just want to show the idea.


You are right.  You could set a bigger buffer size to increase guest
uptime.


The fork child can minimize the chance of out-of-memory by using
madvise(MADV_DONTNEED) after pages have been written out.

   It seems no way to make sure the written out page is the changed
pages, so it have a good chance the written one is the unchanged and
still used by the other qemu process.


The KVM dirty log tells you which pages were touched.  The fork child
process could give priority to the pages which have been touched by the
guest.  They must be written out and marked madvise(MADV_DONTNEED) as
soon as possible.

  Hmm, if dirty log still works normal in child process to reflect the
memory status in parent not child's, then the problem could be solved
by: when dirty pages is too much, child tell parent to wait some time.
But I haven't check if kvm.ko behaviors like that.



I haven't looked at the vmsave data format yet to see if memory pages
can be saved in random order, but this might work.  It reduces the
likelihood of copy-on-write memory growth.

Stefan




--
Best Regards

Wenchao Xia

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature?

2013-08-14 Thread Wenchao Xia
于 2013-8-14 15:53, Stefan Hajnoczi 写道:
> On Wed, Aug 14, 2013 at 3:54 AM, Wenchao Xia  
> wrote:
>> 于 2013-8-13 16:21, Stefan Hajnoczi 写道:
>>
>>> On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia 
>>> wrote:
>>>>
>>>> 于 2013-8-12 19:33, Stefan Hajnoczi 写道:
>>>>
>>>>> On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh  wrote:
>>>>>>
>>>>>>
>>>>>> --On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi 
>>>>>> wrote:
>>>>>>
>>>>>>> The idea that was discussed on qemu-de...@nongnu.org uses fork(2) to
>>>>>>> capture the state of guest RAM and then send it back to the parent
>>>>>>> process.  The guest is only paused for a brief instant during fork(2)
>>>>>>> and can continue to run afterwards.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> How would you capture the state of emulated hardware which might not
>>>>>> be in the guest RAM?
>>>>>
>>>>>
>>>>>
>>>>> Exactly the same way vmsave works today.  It calls the device's save
>>>>> functions which serialize state to file.
>>>>>
>>>>> The difference between today's vmsave and the fork(2) approach is that
>>>>> QEMU does not need to wait for guest RAM to be written to file before
>>>>> resuming the guest.
>>>>>
>>>>> Stefan
>>>>>
>>>> I have a worry about what glib says:
>>>>
>>>> "On Unix, the GLib mainloop is incompatible with fork(). Any program
>>>> using the mainloop must either exec() or exit() from the child without
>>>> returning to the mainloop. "
>>>
>>>
>>> This is fine, the child just writes out the memory pages and exits.
>>> It never returns to the glib mainloop.
>>>
>>>> There is another way to do it: intercept the write in kvm.ko(or other
>>>> kernel code). Since the key is intercept the memory change, we can do
>>>> it in userspace in TCG mode, thus we can add the missing part in KVM
>>>> mode. Another benefit of this way is: the used memory can be
>>>> controlled. For example, with ioctl(), set a buffer of a fixed size
>>>> which keeps the intercepted write data by kernel code, which can avoid
>>>> frequently switch back to user space qemu code. when it is full always
>>>> return back to userspace's qemu code, let qemu code save the data into
>>>> disk. I haven't check the exactly behavior of Intel guest mode about
>>>> how to handle page fault, so can't estimate the performance caused by
>>>> switching of guest mode and root mode, but it should not be worse than
>>>> fork().
>>>
>>>
>>> The fork(2) approach is portable, covers both KVM and TCG, and doesn't
>>> require kernel changes.  A kvm.ko kernel change also won't be
>>> supported on existing KVM hosts.  These are big drawbacks and the
>>> kernel approach would need to be significantly better than plain old
>>> fork(2) to make it worthwhile.
>>>
>>> Stefan
>>>
>>I think advantage is memory usage is predictable, so memory usage
>> peak can be avoided, by always save the changed pages first. fork()
>> does not know which pages are changed. I am not sure if this would
>> be a serious issue when server's memory is consumed much, for example,
>> 24G host emulate 11G*2 guest to provide powerful virtual server.
> 
> Memory usage is predictable but guest uptime is unpredictable because
> it waits until memory is written out.  This defeats the point of
> "live" savevm.  The guest may be stalled arbitrarily.
> 
  I think it is adjustable. There is no much difference with
fork(), except get more precise control about the changed pages.
  Kernel intercept the change, and stores the changed page in another
page, similar to fork(). When userspace qemu code execute, save some
pages to disk. Buffer can be used like some lubricant. When Buffer =
MAX, it equals to fork(), guest runs more lively. When Buffer = 0,
guest runs less lively. I think it allows user to find a good balance
point with a parameter.
  It is harder to implement, just want to show the idea.

> The fork child can minimize the chance of out-of-memory by using
> madvise(MADV_DONTNEED) after pages have been written out.
  It seems no way to make sure the written out page is the changed
pages, so it have a good chance the written one is the unchanged and
still used by the other qemu process.

> 
> The way fork handles memory overcommit on Linux is configurable, but I
> guess in a situation where memory runs out the Out-of-Memory Killer
> will kill a process (probably QEMU since it is hogging so much
> memory).
> 
> The risk of OOM can be avoided by running the traditional vmsave which
> stops the guest instead of using "live" vmsave.
> 
> The other option is to live migrate to file but the disadvantage there
> is that you cannot choose exactly when the state it saved, it happens
> sometime after live migration is initiated.
> 
> There are trade-offs with all the approaches, it depends on what is
> most important to you.
> 
> Stefan
> 


-- 
Best Regards

Wenchao Xia

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature?

2013-08-13 Thread Wenchao Xia

于 2013-8-13 16:21, Stefan Hajnoczi 写道:

On Tue, Aug 13, 2013 at 4:53 AM, Wenchao Xia  wrote:

于 2013-8-12 19:33, Stefan Hajnoczi 写道:


On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh  wrote:


--On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi 
wrote:


The idea that was discussed on qemu-de...@nongnu.org uses fork(2) to
capture the state of guest RAM and then send it back to the parent
process.  The guest is only paused for a brief instant during fork(2)
and can continue to run afterwards.




How would you capture the state of emulated hardware which might not
be in the guest RAM?



Exactly the same way vmsave works today.  It calls the device's save
functions which serialize state to file.

The difference between today's vmsave and the fork(2) approach is that
QEMU does not need to wait for guest RAM to be written to file before
resuming the guest.

Stefan


   I have a worry about what glib says:

"On Unix, the GLib mainloop is incompatible with fork(). Any program
using the mainloop must either exec() or exit() from the child without
returning to the mainloop. "


This is fine, the child just writes out the memory pages and exits.
It never returns to the glib mainloop.


   There is another way to do it: intercept the write in kvm.ko(or other
kernel code). Since the key is intercept the memory change, we can do
it in userspace in TCG mode, thus we can add the missing part in KVM
mode. Another benefit of this way is: the used memory can be
controlled. For example, with ioctl(), set a buffer of a fixed size
which keeps the intercepted write data by kernel code, which can avoid
frequently switch back to user space qemu code. when it is full always
return back to userspace's qemu code, let qemu code save the data into
disk. I haven't check the exactly behavior of Intel guest mode about
how to handle page fault, so can't estimate the performance caused by
switching of guest mode and root mode, but it should not be worse than
fork().


The fork(2) approach is portable, covers both KVM and TCG, and doesn't
require kernel changes.  A kvm.ko kernel change also won't be
supported on existing KVM hosts.  These are big drawbacks and the
kernel approach would need to be significantly better than plain old
fork(2) to make it worthwhile.

Stefan


  I think advantage is memory usage is predictable, so memory usage
peak can be avoided, by always save the changed pages first. fork()
does not know which pages are changed. I am not sure if this would
be a serious issue when server's memory is consumed much, for example,
24G host emulate 11G*2 guest to provide powerful virtual server.

--
Best Regards

Wenchao Xia

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Are there plans to achieve ram live Snapshot feature?

2013-08-12 Thread Wenchao Xia

于 2013-8-12 19:33, Stefan Hajnoczi 写道:

On Mon, Aug 12, 2013 at 12:26 PM, Alex Bligh  wrote:

--On 12 August 2013 11:59:03 +0200 Stefan Hajnoczi 
wrote:


The idea that was discussed on qemu-de...@nongnu.org uses fork(2) to
capture the state of guest RAM and then send it back to the parent
process.  The guest is only paused for a brief instant during fork(2)
and can continue to run afterwards.



How would you capture the state of emulated hardware which might not
be in the guest RAM?


Exactly the same way vmsave works today.  It calls the device's save
functions which serialize state to file.

The difference between today's vmsave and the fork(2) approach is that
QEMU does not need to wait for guest RAM to be written to file before
resuming the guest.

Stefan


  I have a worry about what glib says:

"On Unix, the GLib mainloop is incompatible with fork(). Any program
using the mainloop must either exec() or exit() from the child without
returning to the mainloop. "

  There is another way to do it: intercept the write in kvm.ko(or other
kernel code). Since the key is intercept the memory change, we can do
it in userspace in TCG mode, thus we can add the missing part in KVM
mode. Another benefit of this way is: the used memory can be
controlled. For example, with ioctl(), set a buffer of a fixed size
which keeps the intercepted write data by kernel code, which can avoid
frequently switch back to user space qemu code. when it is full always
return back to userspace's qemu code, let qemu code save the data into
disk. I haven't check the exactly behavior of Intel guest mode about
how to handle page fault, so can't estimate the performance caused by
switching of guest mode and root mode, but it should not be worse than
fork().


--
Best Regards

Wenchao Xia

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-scsi: return -ENOENT when no matching tcm_vhost_tpg found

2013-06-11 Thread wenchao

cc to Greg for 3.9.


On Tue, May 28, 2013 at 04:54:44PM +0800, Wenchao Xia wrote:

ioctl for VHOST_SCSI_SET_ENDPOINT report file exist errori, when I forget
to set it correctly in configfs, make user confused. Actually it fail
to find a matching one, so change the error value.

Signed-off-by: Wenchao Xia 


Acked-by: Asias He 

BTW, It would be nice to print more informative info in qemu when wwpn
is not available as well.


---
  drivers/vhost/scsi.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 7014202..6325b1d 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1219,7 +1219,7 @@ static int vhost_scsi_set_endpoint(
}
ret = 0;
} else {
-   ret = -EEXIST;
+   ret = -ENOENT;
}

/*
--
1.7.1





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vhost-scsi: return -ENOENT when no matching tcm_vhost_tpg found

2013-05-28 Thread Wenchao Xia
ioctl for VHOST_SCSI_SET_ENDPOINT report file exist errori, when I forget
to set it correctly in configfs, make user confused. Actually it fail
to find a matching one, so change the error value.

Signed-off-by: Wenchao Xia 
---
 drivers/vhost/scsi.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 7014202..6325b1d 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1219,7 +1219,7 @@ static int vhost_scsi_set_endpoint(
}
ret = 0;
} else {
-   ret = -EEXIST;
+   ret = -ENOENT;
}
 
/*
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] provide an API to userspace doing memory snapshot

2013-04-16 Thread Wenchao Xia

于 2013-4-16 13:51, Stefan Hajnoczi 写道:

On Mon, Apr 15, 2013 at 09:03:36PM +0800, Wenchao Xia wrote:

   I'd like to add/export an function which allow userspace program
to take snapshot for a region of memory. Since it is not implemented yet
I will describe it as C APIs, it is quite simple now and if it is worthy
I'll improve the interface later:


We talked about a simple approach using fork(2) on IRC yesterday.

Is this email outdated?

Stefan


  No, after the discuss on IRC, I agree that fork() is a simpler
method to do it, which can comes to qemu fast, since user wants it.
  With a more consideration, still I think a KVM's mem snapshot would
be an long term solution for it:
  The source of the problem comes from acceleration module, kvm.ko, when
qemu does not use it, no troubles. This means an acceleration module
missed a function while caller requires. My instinct idea is: when
acceleration module replace a pure software one, it should try provide
all parts or not stop software filling the gap, and doing so brings
benefits, so hope to add it.
  My API description is old, the core is COW pages, maybe redesign if
reasonable.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] provide an API to userspace doing memory snapshot

2013-04-15 Thread Wenchao Xia
Hi,
  I'd like to add/export an function which allow userspace program
to take snapshot for a region of memory. Since it is not implemented yet
I will describe it as C APIs, it is quite simple now and if it is worthy
I'll improve the interface later:

Simple prototype:
C API in userspace:
/*
 *   This function will mark a section of memory as COW, and return
 * a new virtual address of it. User space program can dump out the
 * content as a snapshot while other thread continue modify the content
 * in the region.
 *   @addr: the virtual address to be snapshotted.
 *   @length: the length of it.
 *   This function returns a new virtual address which can be used as
 * snapshot. Return NULL on fail.
 */
void *memory_snapshot_create(void *addr, uint64_t length);

/*
 *   This function will free the memory snapshot.
 *   @addr: the virtual snapshot addr to be freed, it should be the
 * returned one in memory_snapshot_create().
 */
void memory_snapshot_delete(void *addr);

In kernel space:
  The pages in those virtual address will be marked as COW. Take a
page with physical addr P0 as example, it will have two virtual addr:
old A0 and new A1. When modified, kernel should create a new page P1
with same contents, and mapping A1 to P1. When NUMA is used, P1 can
be a slower page.
  It is quite like fork(), but only COW part of pages. Maybe add it
as an ioctl() in kvm.ko, and change the input/output as a structure
describe guest memory state.

Why bring it to kernel space:
  Compared with fork():
  1 take less RAM, less halt time by avoid marking unnecessary pages
COW.
  2 take less RAM if API can return a bitmap let qemu consume dirty
page first.
  3 much nicer userspace program model, no need to pipe or memcpy(),
brings better performance.
  4 optimization space in kernel space, since snapshoted pages
can be set into slower memory in NUMA when change comes.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 0/5] vhost-scsi: Add support for host virtualized target

2013-04-01 Thread Wenchao Xia
Hi, Nicholas
  Have this serial been merged to qemu 1.4? If not, I am rebasing it to
upstream, hope no one else is working on that.

> From: Nicholas Bellinger 
> 
> Hello Anthony & Co,
> 
> This is the fourth installment to add host virtualized target support for
> the mainline tcm_vhost fabric driver using Linux v3.6-rc into QEMU 1.3.0-rc.
> 
> The series is available directly from the following git branch:
> 
> git://git.kernel.org/pub/scm/virt/kvm/nab/qemu-kvm.git vhost-scsi-for-1.3
> 
> Note the code is cut against yesterday's QEMU head, and dispite the name
> of the tree is based upon mainline qemu.org git code + has thus far been
> running overnight with > 100K IOPs small block 4k workloads using v3.6-rc2+
> based target code with RAMDISK_DR backstores.
> 
> Other than some minor fuzz between jumping from QEMU 1.2.0 -> 1.2.50, this
> series is functionally identical to what's been posted for vhost-scsi RFC-v3
> to qemu-devel.
> 
> Please consider applying these patches for an initial vhost-scsi merge into
> QEMU 1.3.0-rc code, or let us know what else you'd like to see addressed for
> this series to in order to merge.
> 
> Thank you!
> 
> --nab
> 
> Nicholas Bellinger (2):
>monitor: Rename+move net_handle_fd_param -> monitor_handle_fd_param
>virtio-scsi: Set max_target=0 during vhost-scsi operation
> 
> Stefan Hajnoczi (3):
>vhost: Pass device path to vhost_dev_init()
>vhost-scsi: add -vhost-scsi host device for use with tcm-vhost
>virtio-scsi: Add start/stop functionality for vhost-scsi
> 
>   configure|   10 +++
>   hw/Makefile.objs |1 +
>   hw/qdev-properties.c |   41 +++
>   hw/vhost-scsi.c  |  190 
> ++
>   hw/vhost-scsi.h  |   62 
>   hw/vhost.c   |5 +-
>   hw/vhost.h   |3 +-
>   hw/vhost_net.c   |2 +-
>   hw/virtio-pci.c  |2 +
>   hw/virtio-scsi.c |   55 ++-
>   hw/virtio-scsi.h |1 +
>   monitor.c|   18 +
>   monitor.h|1 +
>   net.c|   18 -
>   net.h|2 -
>   net/socket.c |2 +-
>   net/tap.c|4 +-
>   qemu-common.h|1 +
>   qemu-config.c|   19 +
>   qemu-options.hx  |    4 +
>   vl.c |   18 +
>   21 files changed, 431 insertions(+), 28 deletions(-)
>   create mode 100644 hw/vhost-scsi.c
>   create mode 100644 hw/vhost-scsi.h
> 


-- 
Best Regards

Wenchao Xia

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html