Re: [RFC]VM live snapshot proposal
On Wed, Mar 05, 2014 at 01:52:14AM +, Huangpeng (Peter) wrote: Hi, Andrea Where can I get the dev-git-branch? I can use it to try the snapshot prototype coding. You can find the current status in the origin/master branch here http://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git however userlandfd is still missing so it's not yet good for transparent userfault when it's O_DIRECT or other gup users triggering the access (those would currently return an error to userland if they hit on a userfault vma, and we don't want to change userland to ever get an error or the modifications to userland are too big). userlandfd will let the kernel wait on an event from the migration thread and it will talk with the migration thread directly. So userland won't be able to notice the userfault happening inside a write() or kvm ioctl() syscall (you could notice only if you strace the migration thread). That's more efficient too so the host scheduler can directly switch to the migration thread without having to return to userland first. And after remap_anon_pages completes and the host scheduler runs the vcpu or I/O thread again, gup_fast can continue from kernel mode where it stopped again without unnecessary exits to userland. Making the kernel speak directly to the migration thread is somewhat more tricky at the kernel level that what you find in aa.git right now, but it is worth it to be transparent to all syscalls that would trip on userfaults with gup_fast. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC]VM live snapshot proposal
Hi, On Tue, Mar 04, 2014 at 01:35:53AM +, Huangpeng (Peter) wrote: Hi Paolo, On Mon, Mar 03, 2014 at 02:47:31PM +0100, Paolo Bonzini wrote: I'm not sure what's the status of the kernel infrastructure for post-copy. Andrea? sys_userfaultfd is still work in progress but it shouldn't be much work left to completion. madvise(MADV_USERFAULT) and remap_anon_pages() are complete for a while. http://qemu-project.org/Features/PostCopyLiveMigration From the feature description, post-copy uses memory copy, so this infrastructure will solve this problem, but do not help snapshot, am I right? Correct there's no copy with this infrastructure, other than whatever data copy that may be happening inside the network receive protocol for skb linearization into userland memory. With RDMA or zerocopy DMA receive mechanisms, there may be no copy at all. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC]VM live snapshot proposal
On Tue, Mar 04, 2014 at 01:02:44AM +, Huangpeng (Peter) wrote: But back to the options: If the host has enough free memory to fork QEMU, a small helper process can be used to save the copy-on-write memory snapshot (thanks to fork(2) semantics). The hard part about the fork(2) approach is that QEMU isn't really designed to fork, so work is necessary to reach a quiescent state for the child process. If there is not enough memory to fork, then a synchronous approach to catching guest memory writes is needed. I'm not sure if a good mechanism for that exists but the simplest would be mprotect(2) and a signal handler (which will make the guest run very slowly). Stefan In real production environment, memory over-commit or use as much memory as possible may be the normal case, so the fork semantics cannot meet the needs. Yes, I think you're right. The fork approach only works in the easy case where there is plenty of free host memory. Is there any other proposals to implement vm-snapshot? See the discussion by Paolo and Andrea about post-copy migration, which adds kernel memory management features for tracking userspace page faults. Perhaps you can use that infrastructure to trap guest writes. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC]VM live snapshot proposal
Il 04/03/2014 09:54, Stefan Hajnoczi ha scritto: Is there any other proposals to implement vm-snapshot? See the discussion by Paolo and Andrea about post-copy migration, which adds kernel memory management features for tracking userspace page faults. Perhaps you can use that infrastructure to trap guest writes. That infrastructure actually traps guest reads too. But it's fine, as they are a superset of guest writes and the image will still be consistent. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC]VM live snapshot proposal
Is there any other proposals to implement vm-snapshot? See the discussion by Paolo and Andrea about post-copy migration, which adds kernel memory management features for tracking userspace page faults. Perhaps you can use that infrastructure to trap guest writes. Stefan I will look into Paolo's new infrastructure first, and post new progress later. Thanks N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
RE: [RFC]VM live snapshot proposal
Hi, Andrea Where can I get the dev-git-branch? I can use it to try the snapshot prototype coding. Thanks. -Original Message- From: Andrea Arcangeli [mailto:aarca...@redhat.com] Sent: Tuesday, March 04, 2014 3:52 AM To: Paolo Bonzini Cc: Kevin Wolf; Stefan Hajnoczi; Huangpeng (Peter); qemu-de...@nongnu.org; Wenchao Xia; Pavel Hrdina; KVM devel mailing list; Zhanghailiang Subject: Re: [RFC]VM live snapshot proposal Hi Paolo, On Mon, Mar 03, 2014 at 02:47:31PM +0100, Paolo Bonzini wrote: I'm not sure what's the status of the kernel infrastructure for post-copy. Andrea? sys_userfaultfd is still work in progress but it shouldn't be much work left to completion. madvise(MADV_USERFAULT) and remap_anon_pages() are complete for a while.
Re: [RFC]VM live snapshot proposal
On Mon, Mar 03, 2014 at 01:13:41AM +, Huangpeng (Peter) wrote: Just to summarize the idea of live savevm for people joining the discussion: It should be possible to save a snapshot of the guest (including memory, devices, and disk) without noticable downtime. The 'savevm' command pauses the guest until the snapshot has been completed and therefore doesn't meet the requirements. Here I have another proposal, based on the live-migration scheme, add consistent memory state tracking and saving. The idea is simple: 1.First round use live-migration to save all memory to a snapshot file. 2.intercept the action of memory-modify, save old pages to a temporary file and mark dirty-bits, 3.Merge temporary file to the original snapshot file Detailed process: (1)Pause VM (2) Save the device status to a temporary file (live-migration already supported ) (3) Make disk snapshot (4) Enable page dirty log and old dirty pages save function(which we need to add) (5) Resume VM (6) Begin the first round of iteration, we save the entire contents of the VM memory pages to the snapshot file (7) In the second round of iteration , we save the old page to the snapshot file (8) Merge data of device status which is pre-saved in temporary files to the snapshot file (8) End ram snapshot and some cleanup work Due to memory-modifications may happen in kvm, qemu, or vhost, the key-part is how we can provide common page-modify-tracking-and-saving api, we completed a prototype by simply add modified-page tracking/saving function in qemu, and it seems worked fine. Yes, this is the tricky part. To be honest, I think this is the reason no one has submitted patches - it's a hard task and the win isn't that great (you can already migrate to file). But back to the options: If the host has enough free memory to fork QEMU, a small helper process can be used to save the copy-on-write memory snapshot (thanks to fork(2) semantics). The hard part about the fork(2) approach is that QEMU isn't really designed to fork, so work is necessary to reach a quiescent state for the child process. If there is not enough memory to fork, then a synchronous approach to catching guest memory writes is needed. I'm not sure if a good mechanism for that exists but the simplest would be mprotect(2) and a signal handler (which will make the guest run very slowly). Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC]VM live snapshot proposal
Am 03.03.2014 um 13:32 hat Stefan Hajnoczi geschrieben: On Mon, Mar 03, 2014 at 01:13:41AM +, Huangpeng (Peter) wrote: Just to summarize the idea of live savevm for people joining the discussion: It should be possible to save a snapshot of the guest (including memory, devices, and disk) without noticable downtime. The 'savevm' command pauses the guest until the snapshot has been completed and therefore doesn't meet the requirements. Here I have another proposal, based on the live-migration scheme, add consistent memory state tracking and saving. The idea is simple: 1.First round use live-migration to save all memory to a snapshot file. 2.intercept the action of memory-modify, save old pages to a temporary file and mark dirty-bits, 3.Merge temporary file to the original snapshot file Why do you need a temporary file for this? Couldn't you directly store the memory to its final destination in the snapshot file? Detailed process: (1)Pause VM (2) Save the device status to a temporary file (live-migration already supported ) (3) Make disk snapshot (4) Enable page dirty log and old dirty pages save function(which we need to add) (5) Resume VM (6) Begin the first round of iteration, we save the entire contents of the VM memory pages to the snapshot file (7) In the second round of iteration , we save the old page to the snapshot file (8) Merge data of device status which is pre-saved in temporary files to the snapshot file (8) End ram snapshot and some cleanup work Due to memory-modifications may happen in kvm, qemu, or vhost, the key-part is how we can provide common page-modify-tracking-and-saving api, we completed a prototype by simply add modified-page tracking/saving function in qemu, and it seems worked fine. Yes, this is the tricky part. To be honest, I think this is the reason no one has submitted patches - it's a hard task and the win isn't that great (you can already migrate to file). So why don't we simply reuse the existing migration code? Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC]VM live snapshot proposal
Il 03/03/2014 13:32, Stefan Hajnoczi ha scritto: If there is not enough memory to fork, then a synchronous approach to catching guest memory writes is needed. I'm not sure if a good mechanism for that exists but the simplest would be mprotect(2) and a signal handler (which will make the guest run very slowly). I think we'll be adding such a mechanism, but for guest memory reads, for postcopy migration. Perhaps it could be reused for live snapshotting? Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC]VM live snapshot proposal
Il 03/03/2014 13:55, Kevin Wolf ha scritto: Due to memory-modifications may happen in kvm, qemu, or vhost, the key-part is how we can provide common page-modify-tracking-and-saving api, we completed a prototype by simply add modified-page tracking/saving function in qemu, and it seems worked fine. Yes, this is the tricky part. To be honest, I think this is the reason no one has submitted patches - it's a hard task and the win isn't that great (you can already migrate to file). So why don't we simply reuse the existing migration code? I think this is different in the same way that block-backup and block-mirror are different. Huangpeng's proposal would let you make a consistent snapshot of disks and RAM. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC]VM live snapshot proposal
Am 03.03.2014 um 14:19 hat Paolo Bonzini geschrieben: Il 03/03/2014 13:55, Kevin Wolf ha scritto: Due to memory-modifications may happen in kvm, qemu, or vhost, the key-part is how we can provide common page-modify-tracking-and-saving api, we completed a prototype by simply add modified-page tracking/saving function in qemu, and it seems worked fine. Yes, this is the tricky part. To be honest, I think this is the reason no one has submitted patches - it's a hard task and the win isn't that great (you can already migrate to file). So why don't we simply reuse the existing migration code? I think this is different in the same way that block-backup and block-mirror are different. Huangpeng's proposal would let you make a consistent snapshot of disks and RAM. Right. Though the point isn't about consistency (doing the disk snapshot when memory has converged would be consistent as well), but about having the snapshot semantically right at the time when the monitor command is issued instead of only starting it then and being consistent at the point of completion. This is indeed like pre/post-copy live migration, and probably both options have their uses. I would suggest starting with the easy one, and adding the post-copy feature on top. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC]VM live snapshot proposal
Il 03/03/2014 14:30, Kevin Wolf ha scritto: So why don't we simply reuse the existing migration code? I think this is different in the same way that block-backup and block-mirror are different. Huangpeng's proposal would let you make a consistent snapshot of disks and RAM. Right. Though the point isn't about consistency (doing the disk snapshot when memory has converged would be consistent as well), but about having the snapshot semantically right at the time when the monitor command is issued instead of only starting it then and being consistent at the point of completion. Right---though it's not entirely true that migration only affects the point in time where you have consistency. For example, with migration you cannot use the guest agent for freeze/thaw and, even if we changed the code to allow that, the pause would be much longer than for live snapshots or block-backup. This is indeed like pre/post-copy live migration, and probably both options have their uses. I would suggest starting with the easy one, and adding the post-copy feature on top. The feature matrix for migration and snapshot disk RAMinternal snapshot non-live yes (0)yes (0)yes live, disk only yes (1)N/Ayes (2) live, pre-copyyes (3)yesno live, post-copy yes (4)no no live, point-in-time yes (5)no no (0) just stop VM while doing normal pre-copy migration (1) blockdev-snapshot-sync (2) blockdev-snapshot-internal-sync (3) block-stream (4) drive-mirror (5) drive-backup By the easy one you mean live savevm with snapshot at the end of RAM migration, I guess. But the functionality is already available using migration, while point-in-time snapshots actually add new functionality. I'm not sure what's the status of the kernel infrastructure for post-copy. Andrea? Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC]VM live snapshot proposal
Am 03.03.2014 um 14:47 hat Paolo Bonzini geschrieben: Il 03/03/2014 14:30, Kevin Wolf ha scritto: So why don't we simply reuse the existing migration code? I think this is different in the same way that block-backup and block-mirror are different. Huangpeng's proposal would let you make a consistent snapshot of disks and RAM. Right. Though the point isn't about consistency (doing the disk snapshot when memory has converged would be consistent as well), but about having the snapshot semantically right at the time when the monitor command is issued instead of only starting it then and being consistent at the point of completion. Right---though it's not entirely true that migration only affects the point in time where you have consistency. For example, with migration you cannot use the guest agent for freeze/thaw and, even if we changed the code to allow that, the pause would be much longer than for live snapshots or block-backup. This is indeed like pre/post-copy live migration, and probably both options have their uses. I would suggest starting with the easy one, and adding the post-copy feature on top. The feature matrix for migration and snapshot disk RAMinternal snapshot non-live yes (0)yes (0)yes live, disk only yes (1)N/Ayes (2) live, pre-copyyes (3)yesno live, post-copy yes (4)no no live, point-in-time yes (5)no no (0) just stop VM while doing normal pre-copy migration (1) blockdev-snapshot-sync (2) blockdev-snapshot-internal-sync (3) block-stream (4) drive-mirror (5) drive-backup By the easy one you mean live savevm with snapshot at the end of RAM migration, I guess. But the functionality is already available using migration, while point-in-time snapshots actually add new functionality. I'm not sure what's the status of the kernel infrastructure for post-copy. Andrea? Yes, it's available, but not with internal snapshots, but only with RAM snapshots stored in an external file. An incremental next step would be to avoid writing dirtied memory to two places, because internal snapshots aren't a streaming, but a random access interface, so you can overwrite the original place instead of appending the new copy. That would already be a small advantage. Once you have this infrastructure, it's probably also a bit easier to plug in any post-copy/point-in-time features that the migration code can (be improved to) provide. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC]VM live snapshot proposal
Hi Paolo, On Mon, Mar 03, 2014 at 02:47:31PM +0100, Paolo Bonzini wrote: I'm not sure what's the status of the kernel infrastructure for post-copy. Andrea? sys_userfaultfd is still work in progress but it shouldn't be much work left to completion. madvise(MADV_USERFAULT) and remap_anon_pages() are complete for a while. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC]VM live snapshot proposal
Yes, this is the tricky part. To be honest, I think this is the reason no one has submitted patches - it's a hard task and the win isn't that great (you can already migrate to file). Yes, lots of places have to be considered. Though scenarios are limited, users like library experiments may need to revert repeatedly to the same vm-state(memory state + disk state) . The key-part is tracking and saving the consistent state right on snapshot time, kvm/qemu/vhost have already implement dirty-tracking and my proposal will add common save-old-page apis to save the consistent state. Is this way right or do you have other suggestions? But back to the options: If the host has enough free memory to fork QEMU, a small helper process can be used to save the copy-on-write memory snapshot (thanks to fork(2) semantics). The hard part about the fork(2) approach is that QEMU isn't really designed to fork, so work is necessary to reach a quiescent state for the child process. If there is not enough memory to fork, then a synchronous approach to catching guest memory writes is needed. I'm not sure if a good mechanism for that exists but the simplest would be mprotect(2) and a signal handler (which will make the guest run very slowly). Stefan In real production environment, memory over-commit or use as much memory as possible may be the normal case, so the fork semantics cannot meet the needs. Is there any other proposals to implement vm-snapshot? Thanks.
RE: [RFC]VM live snapshot proposal
Here I have another proposal, based on the live-migration scheme, add consistent memory state tracking and saving. The idea is simple: 1.First round use live-migration to save all memory to a snapshot file. 2.intercept the action of memory-modify, save old pages to a temporary file and mark dirty-bits, 3.Merge temporary file to the original snapshot file Why do you need a temporary file for this? Couldn't you directly store the memory to its final destination in the snapshot file? Writing to the same snapshot file needs to consider about write protection, currently we implemented the prototype in the simplest way, and if this proposal is accepted we will consider about it. thanks. N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
RE: [RFC]VM live snapshot proposal
I think this is different in the same way that block-backup and block-mirror are different. Huangpeng's proposal would let you make a consistent snapshot of disks and RAM. Right. Though the point isn't about consistency (doing the disk snapshot when memory has converged would be consistent as well), but about having the snapshot semantically right at the time when the monitor command is issued instead of only starting it then and being consistent at the point of completion. This is indeed like pre/post-copy live migration, and probably both options have their uses. I would suggest starting with the easy one, and adding the post-copy feature on top. Good suggestion, The latest patches of post-copy seems updated 2 years ago. https://github.com/yamahata/qemu One question: Can post-copy fallback if exceptions happen during post-copy? Thanks
RE: [RFC]VM live snapshot proposal
Hi Paolo, On Mon, Mar 03, 2014 at 02:47:31PM +0100, Paolo Bonzini wrote: I'm not sure what's the status of the kernel infrastructure for post-copy. Andrea? sys_userfaultfd is still work in progress but it shouldn't be much work left to completion. madvise(MADV_USERFAULT) and remap_anon_pages() are complete for a while. http://qemu-project.org/Features/PostCopyLiveMigration From the feature description, post-copy uses memory copy, so this infrastructure will solve this problem, but do not help snapshot, am I right? Thansk N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf