Re: [RFC]VM live snapshot proposal

2014-03-05 Thread Andrea Arcangeli
On Wed, Mar 05, 2014 at 01:52:14AM +, Huangpeng (Peter) wrote:
 Hi, Andrea
 
 Where can I get the dev-git-branch?
 I can use it to try the snapshot prototype coding.

You can find the current status in the origin/master branch here
http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git

however userlandfd is still missing so it's not yet good for
transparent userfault when it's O_DIRECT or other gup users triggering
the access (those would currently return an error to userland if they
hit on a userfault vma, and we don't want to change userland to ever
get an error or the modifications to userland are too big).

userlandfd will let the kernel wait on an event from the migration
thread and it will talk with the migration thread directly. So
userland won't be able to notice the userfault happening inside a
write() or kvm ioctl() syscall (you could notice only if you strace
the migration thread). That's more efficient too so the host scheduler
can directly switch to the migration thread without having to return
to userland first. And after remap_anon_pages completes and the host
scheduler runs the vcpu or I/O thread again, gup_fast can continue
from kernel mode where it stopped again without unnecessary exits to
userland. Making the kernel speak directly to the migration thread is
somewhat more tricky at the kernel level that what you find in aa.git
right now, but it is worth it to be transparent to all syscalls that
would trip on userfaults with gup_fast.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC]VM live snapshot proposal

2014-03-05 Thread Andrea Arcangeli
Hi,

On Tue, Mar 04, 2014 at 01:35:53AM +, Huangpeng (Peter) wrote:
  
  Hi Paolo,
  
  On Mon, Mar 03, 2014 at 02:47:31PM +0100, Paolo Bonzini wrote:
 I'm not sure what's the status of the kernel infrastructure for
   post-copy.  Andrea?
  
  sys_userfaultfd is still work in progress but it shouldn't be much work 
  left to
  completion. madvise(MADV_USERFAULT) and
  remap_anon_pages() are complete for a while.
 
 http://qemu-project.org/Features/PostCopyLiveMigration
 From the feature description, post-copy uses memory copy, so this 
 infrastructure
 will solve this problem, but do not help snapshot, am I right?

Correct there's no copy with this infrastructure, other than whatever
data copy that may be happening inside the network receive protocol
for skb linearization into userland memory. With RDMA or zerocopy DMA
receive mechanisms, there may be no copy at all.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC]VM live snapshot proposal

2014-03-04 Thread Stefan Hajnoczi
On Tue, Mar 04, 2014 at 01:02:44AM +, Huangpeng (Peter) wrote:
  But back to the options:
  
  If the host has enough free memory to fork QEMU, a small helper process can
  be used to save the copy-on-write memory snapshot (thanks to fork(2)
  semantics).  The hard part about the fork(2) approach is that QEMU isn't
  really designed to fork, so work is necessary to reach a quiescent state 
  for the
  child process.
  
  If there is not enough memory to fork, then a synchronous approach to
  catching guest memory writes is needed.  I'm not sure if a good mechanism
  for that exists but the simplest would be mprotect(2) and a signal handler
  (which will make the guest run very slowly).
  
  Stefan
 
 In real production environment, memory over-commit or use as much memory as
 possible may be the normal case, so the fork semantics cannot meet the needs. 
  

Yes, I think you're right.  The fork approach only works in the easy
case where there is plenty of free host memory.

 Is there any other proposals to implement vm-snapshot?

See the discussion by Paolo and Andrea about post-copy migration, which
adds kernel memory management features for tracking userspace page
faults.  Perhaps you can use that infrastructure to trap guest writes.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC]VM live snapshot proposal

2014-03-04 Thread Paolo Bonzini

Il 04/03/2014 09:54, Stefan Hajnoczi ha scritto:

Is there any other proposals to implement vm-snapshot?

See the discussion by Paolo and Andrea about post-copy migration, which
adds kernel memory management features for tracking userspace page
faults.  Perhaps you can use that infrastructure to trap guest writes.


That infrastructure actually traps guest reads too.  But it's fine, as 
they are a superset of guest writes and the image will still be consistent.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC]VM live snapshot proposal

2014-03-04 Thread Huangpeng (Peter)
  Is there any other proposals to implement vm-snapshot?
 
 See the discussion by Paolo and Andrea about post-copy migration, which adds
 kernel memory management features for tracking userspace page faults.
 Perhaps you can use that infrastructure to trap guest writes.
 
 Stefan

I will look into Paolo's new infrastructure first, and post new progress later.
Thanks
N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

RE: [RFC]VM live snapshot proposal

2014-03-04 Thread Huangpeng (Peter)
Hi, Andrea

Where can I get the dev-git-branch?
I can use it to try the snapshot prototype coding.

Thanks.

 -Original Message-
 From: Andrea Arcangeli [mailto:aarca...@redhat.com]
 Sent: Tuesday, March 04, 2014 3:52 AM
 To: Paolo Bonzini
 Cc: Kevin Wolf; Stefan Hajnoczi; Huangpeng (Peter); qemu-de...@nongnu.org;
 Wenchao Xia; Pavel Hrdina; KVM devel mailing list; Zhanghailiang
 Subject: Re: [RFC]VM live snapshot proposal
 
 Hi Paolo,
 
 On Mon, Mar 03, 2014 at 02:47:31PM +0100, Paolo Bonzini wrote:
I'm not sure what's the status of the kernel infrastructure for
  post-copy.  Andrea?
 
 sys_userfaultfd is still work in progress but it shouldn't be much work left 
 to
 completion. madvise(MADV_USERFAULT) and
 remap_anon_pages() are complete for a while.


Re: [RFC]VM live snapshot proposal

2014-03-03 Thread Stefan Hajnoczi
On Mon, Mar 03, 2014 at 01:13:41AM +, Huangpeng (Peter) wrote:

Just to summarize the idea of live savevm for people joining the
discussion:

It should be possible to save a snapshot of the guest (including memory,
devices, and disk) without noticable downtime.

The 'savevm' command pauses the guest until the snapshot has been
completed and therefore doesn't meet the requirements.

 Here I have another proposal, based on the live-migration scheme, add 
 consistent 
 memory state tracking and saving.
 The idea is simple:
 1.First round use live-migration to save all memory to a snapshot file.
 2.intercept the action of memory-modify, save old pages to a temporary file 
 and mark dirty-bits,
 3.Merge temporary file to the original snapshot file
 
 Detailed process:
 (1)Pause VM
 (2) Save the device status to a temporary file (live-migration already 
 supported )
 (3) Make disk snapshot
 (4) Enable page dirty log and old dirty pages save function(which we need to 
 add)
 (5) Resume VM
 (6) Begin the first round of iteration, we save the entire contents of the VM 
 memory pages
 to the snapshot file
 (7) In the second round of iteration , we save the old page to the snapshot 
 file
 (8) Merge data of device status which is pre-saved in temporary files to the 
 snapshot file
 (8) End ram snapshot and some cleanup work
 
 Due to memory-modifications may happen in kvm, qemu, or vhost, the key-part 
 is how we
 can provide common page-modify-tracking-and-saving api, we completed a 
 prototype by 
 simply add modified-page tracking/saving function in qemu, and it seems 
 worked fine.

Yes, this is the tricky part.  To be honest, I think this is the reason
no one has submitted patches - it's a hard task and the win isn't that
great (you can already migrate to file).

But back to the options:

If the host has enough free memory to fork QEMU, a small helper process
can be used to save the copy-on-write memory snapshot (thanks to fork(2)
semantics).  The hard part about the fork(2) approach is that QEMU isn't
really designed to fork, so work is necessary to reach a quiescent state
for the child process.

If there is not enough memory to fork, then a synchronous approach to
catching guest memory writes is needed.  I'm not sure if a good
mechanism for that exists but the simplest would be mprotect(2) and a
signal handler (which will make the guest run very slowly).

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC]VM live snapshot proposal

2014-03-03 Thread Kevin Wolf
Am 03.03.2014 um 13:32 hat Stefan Hajnoczi geschrieben:
 On Mon, Mar 03, 2014 at 01:13:41AM +, Huangpeng (Peter) wrote:
 
 Just to summarize the idea of live savevm for people joining the
 discussion:
 
 It should be possible to save a snapshot of the guest (including memory,
 devices, and disk) without noticable downtime.
 
 The 'savevm' command pauses the guest until the snapshot has been
 completed and therefore doesn't meet the requirements.
 
  Here I have another proposal, based on the live-migration scheme, add 
  consistent 
  memory state tracking and saving.
  The idea is simple:
  1.First round use live-migration to save all memory to a snapshot file.
  2.intercept the action of memory-modify, save old pages to a temporary file 
  and mark dirty-bits,
  3.Merge temporary file to the original snapshot file

Why do you need a temporary file for this? Couldn't you directly store
the memory to its final destination in the snapshot file?

  Detailed process:
  (1)Pause VM
  (2) Save the device status to a temporary file (live-migration already 
  supported )
  (3) Make disk snapshot
  (4) Enable page dirty log and old dirty pages save function(which we need 
  to add)
  (5) Resume VM
  (6) Begin the first round of iteration, we save the entire contents of the 
  VM memory pages
  to the snapshot file
  (7) In the second round of iteration , we save the old page to the snapshot 
  file
  (8) Merge data of device status which is pre-saved in temporary files to 
  the snapshot file
  (8) End ram snapshot and some cleanup work
  
  Due to memory-modifications may happen in kvm, qemu, or vhost, the key-part 
  is how we
  can provide common page-modify-tracking-and-saving api, we completed a 
  prototype by 
  simply add modified-page tracking/saving function in qemu, and it seems 
  worked fine.
 
 Yes, this is the tricky part.  To be honest, I think this is the reason
 no one has submitted patches - it's a hard task and the win isn't that
 great (you can already migrate to file).

So why don't we simply reuse the existing migration code?

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC]VM live snapshot proposal

2014-03-03 Thread Paolo Bonzini

Il 03/03/2014 13:32, Stefan Hajnoczi ha scritto:

If there is not enough memory to fork, then a synchronous approach to
catching guest memory writes is needed.  I'm not sure if a good
mechanism for that exists but the simplest would be mprotect(2) and a
signal handler (which will make the guest run very slowly).


I think we'll be adding such a mechanism, but for guest memory reads, 
for postcopy migration.  Perhaps it could be reused for live snapshotting?


Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC]VM live snapshot proposal

2014-03-03 Thread Paolo Bonzini

Il 03/03/2014 13:55, Kevin Wolf ha scritto:

  Due to memory-modifications may happen in kvm, qemu, or vhost, the key-part 
is how we
  can provide common page-modify-tracking-and-saving api, we completed a 
prototype by
  simply add modified-page tracking/saving function in qemu, and it seems 
worked fine.


 Yes, this is the tricky part.  To be honest, I think this is the reason
 no one has submitted patches - it's a hard task and the win isn't that
 great (you can already migrate to file).

So why don't we simply reuse the existing migration code?


I think this is different in the same way that block-backup and 
block-mirror are different.  Huangpeng's proposal would let you make a 
consistent snapshot of disks and RAM.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC]VM live snapshot proposal

2014-03-03 Thread Kevin Wolf
Am 03.03.2014 um 14:19 hat Paolo Bonzini geschrieben:
 Il 03/03/2014 13:55, Kevin Wolf ha scritto:
   Due to memory-modifications may happen in kvm, qemu, or vhost, the 
   key-part is how we
   can provide common page-modify-tracking-and-saving api, we completed a 
   prototype by
   simply add modified-page tracking/saving function in qemu, and it 
   seems worked fine.
 
  Yes, this is the tricky part.  To be honest, I think this is the reason
  no one has submitted patches - it's a hard task and the win isn't that
  great (you can already migrate to file).
 So why don't we simply reuse the existing migration code?
 
 I think this is different in the same way that block-backup and
 block-mirror are different.  Huangpeng's proposal would let you make
 a consistent snapshot of disks and RAM.

Right. Though the point isn't about consistency (doing the disk snapshot
when memory has converged would be consistent as well), but about
having the snapshot semantically right at the time when the monitor
command is issued instead of only starting it then and being consistent
at the point of completion.

This is indeed like pre/post-copy live migration, and probably both
options have their uses. I would suggest starting with the easy one, and
adding the post-copy feature on top.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC]VM live snapshot proposal

2014-03-03 Thread Paolo Bonzini

Il 03/03/2014 14:30, Kevin Wolf ha scritto:

  So why don't we simply reuse the existing migration code?
 I think this is different in the same way that block-backup and
 block-mirror are different.  Huangpeng's proposal would let you make
 a consistent snapshot of disks and RAM.
Right. Though the point isn't about consistency (doing the disk snapshot
when memory has converged would be consistent as well), but about
having the snapshot semantically right at the time when the monitor
command is issued instead of only starting it then and being consistent
at the point of completion.


Right---though it's not entirely true that migration only affects the 
point in time where you have consistency.  For example, with migration 
you cannot use the guest agent for freeze/thaw and, even if we changed 
the code to allow that, the pause would be much longer than for live 
snapshots or block-backup.



This is indeed like pre/post-copy live migration, and probably both
options have their uses. I would suggest starting with the easy one, and
adding the post-copy feature on top.


The feature matrix for migration and snapshot

  disk   RAMinternal snapshot
non-live  yes (0)yes (0)yes
live, disk only   yes (1)N/Ayes (2)
live, pre-copyyes (3)yesno
live, post-copy   yes (4)no no
live, point-in-time   yes (5)no no

(0) just stop VM while doing normal pre-copy migration
(1) blockdev-snapshot-sync
(2) blockdev-snapshot-internal-sync
(3) block-stream
(4) drive-mirror
(5) drive-backup

By the easy one you mean live savevm with snapshot at the end of RAM 
migration, I guess.  But the functionality is already available using 
migration, while point-in-time snapshots actually add new functionality. 
 I'm not sure what's the status of the kernel infrastructure for 
post-copy.  Andrea?


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC]VM live snapshot proposal

2014-03-03 Thread Kevin Wolf
Am 03.03.2014 um 14:47 hat Paolo Bonzini geschrieben:
 Il 03/03/2014 14:30, Kevin Wolf ha scritto:
   So why don't we simply reuse the existing migration code?
  I think this is different in the same way that block-backup and
  block-mirror are different.  Huangpeng's proposal would let you make
  a consistent snapshot of disks and RAM.
 Right. Though the point isn't about consistency (doing the disk snapshot
 when memory has converged would be consistent as well), but about
 having the snapshot semantically right at the time when the monitor
 command is issued instead of only starting it then and being consistent
 at the point of completion.
 
 Right---though it's not entirely true that migration only affects
 the point in time where you have consistency.  For example, with
 migration you cannot use the guest agent for freeze/thaw and, even
 if we changed the code to allow that, the pause would be much longer
 than for live snapshots or block-backup.
 
 This is indeed like pre/post-copy live migration, and probably both
 options have their uses. I would suggest starting with the easy one, and
 adding the post-copy feature on top.
 
 The feature matrix for migration and snapshot
 
   disk   RAMinternal snapshot
 non-live  yes (0)yes (0)yes
 live, disk only   yes (1)N/Ayes (2)
 live, pre-copyyes (3)yesno
 live, post-copy   yes (4)no no
 live, point-in-time   yes (5)no no
 
 (0) just stop VM while doing normal pre-copy migration
 (1) blockdev-snapshot-sync
 (2) blockdev-snapshot-internal-sync
 (3) block-stream
 (4) drive-mirror
 (5) drive-backup
 
 By the easy one you mean live savevm with snapshot at the end of
 RAM migration, I guess.  But the functionality is already available
 using migration, while point-in-time snapshots actually add new
 functionality.  I'm not sure what's the status of the kernel
 infrastructure for post-copy.  Andrea?

Yes, it's available, but not with internal snapshots, but only with
RAM snapshots stored in an external file.

An incremental next step would be to avoid writing dirtied memory to two
places, because internal snapshots aren't a streaming, but a random
access interface, so you can overwrite the original place instead of
appending the new copy. That would already be a small advantage.

Once you have this infrastructure, it's probably also a bit easier to
plug in any post-copy/point-in-time features that the migration code can
(be improved to) provide.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC]VM live snapshot proposal

2014-03-03 Thread Andrea Arcangeli
Hi Paolo,

On Mon, Mar 03, 2014 at 02:47:31PM +0100, Paolo Bonzini wrote:
   I'm not sure what's the status of the kernel infrastructure for 
 post-copy.  Andrea?

sys_userfaultfd is still work in progress but it shouldn't be much
work left to completion. madvise(MADV_USERFAULT) and
remap_anon_pages() are complete for a while.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC]VM live snapshot proposal

2014-03-03 Thread Huangpeng (Peter)
 
 Yes, this is the tricky part.  To be honest, I think this is the reason no 
 one has
 submitted patches - it's a hard task and the win isn't that great (you can
 already migrate to file).

Yes, lots of places have to be considered. Though scenarios are limited, users 
like
library experiments may need to revert repeatedly to the same vm-state(memory 
state + disk state) .

The key-part is tracking and saving the consistent state right on snapshot 
time, 
kvm/qemu/vhost have already implement dirty-tracking and my proposal will add 
common save-old-page apis to save the consistent state. Is this way right or do 
you 
have other suggestions? 

 But back to the options:
 
 If the host has enough free memory to fork QEMU, a small helper process can
 be used to save the copy-on-write memory snapshot (thanks to fork(2)
 semantics).  The hard part about the fork(2) approach is that QEMU isn't
 really designed to fork, so work is necessary to reach a quiescent state for 
 the
 child process.
 
 If there is not enough memory to fork, then a synchronous approach to
 catching guest memory writes is needed.  I'm not sure if a good mechanism
 for that exists but the simplest would be mprotect(2) and a signal handler
 (which will make the guest run very slowly).
 
 Stefan

In real production environment, memory over-commit or use as much memory as
possible may be the normal case, so the fork semantics cannot meet the needs.  

Is there any other proposals to implement vm-snapshot?

Thanks.


RE: [RFC]VM live snapshot proposal

2014-03-03 Thread Huangpeng (Peter)


   Here I have another proposal, based on the live-migration scheme,
   add consistent memory state tracking and saving.
   The idea is simple:
   1.First round use live-migration to save all memory to a snapshot file.
   2.intercept the action of memory-modify, save old pages to a
   temporary file and mark dirty-bits, 3.Merge temporary file to the
   original snapshot file
 
 Why do you need a temporary file for this? Couldn't you directly store the
 memory to its final destination in the snapshot file?
 

Writing to the same snapshot file needs to consider about write protection,
currently we implemented the prototype in the simplest way, and if this proposal
is accepted we will consider about it.

thanks.
N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

RE: [RFC]VM live snapshot proposal

2014-03-03 Thread Huangpeng (Peter)
  I think this is different in the same way that block-backup and
  block-mirror are different.  Huangpeng's proposal would let you make a
  consistent snapshot of disks and RAM.
 
 Right. Though the point isn't about consistency (doing the disk snapshot when
 memory has converged would be consistent as well), but about having the
 snapshot semantically right at the time when the monitor command is issued
 instead of only starting it then and being consistent at the point of 
 completion.
 
 This is indeed like pre/post-copy live migration, and probably both options 
 have
 their uses. I would suggest starting with the easy one, and adding the
 post-copy feature on top.
 

Good suggestion, The latest patches of post-copy seems updated 2 years ago.
https://github.com/yamahata/qemu

One question:
Can post-copy fallback if exceptions happen during post-copy?

Thanks



RE: [RFC]VM live snapshot proposal

2014-03-03 Thread Huangpeng (Peter)
 
 Hi Paolo,
 
 On Mon, Mar 03, 2014 at 02:47:31PM +0100, Paolo Bonzini wrote:
I'm not sure what's the status of the kernel infrastructure for
  post-copy.  Andrea?
 
 sys_userfaultfd is still work in progress but it shouldn't be much work left 
 to
 completion. madvise(MADV_USERFAULT) and
 remap_anon_pages() are complete for a while.

http://qemu-project.org/Features/PostCopyLiveMigration
From the feature description, post-copy uses memory copy, so this infrastructure
will solve this problem, but do not help snapshot, am I right?

Thansk

N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf