[Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-07 Thread Isaku Yamahata
This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
on which we'll give a talk at KVM-forum.
The purpose of this mail is to letting developers know it in advance
so that we can get better feedback on its design/implementation approach
early before our starting to implement it.


Background
==
* What's is postcopy livemigration
It is is yet another live migration mechanism for Qemu/KVM, which
implements the migration technique known as "postcopy" or "lazy"
migration. Just after the "migrate" command is invoked, the execution
host of a VM is instantaneously switched to a destination host.

The benefit is, total migration time is shorter because it transfer
a page only once. On the other hand precopy may repeat sending same pages
again and again because they can be dirtied.
The switching time from the source to the destination is several
hunderds mili seconds so that it enables quick load balancing.
For details, please refer to the papers.

We believe this is useful for others so that we'd like to merge this
feature into the upstream qemu/kvm. The existing implementation that
we have right now is very ad-hoc because it's for academic research.
For the upstream merge, we're starting to re-design/implement it and
we'd like to get feedback early.  Although many improvements/optimizations
are possible, we should implement/merge the simple/clean, but extensible
as well, one at first and then improve/optimize it later.

postcopy livemigration will be introduced as optional feature. The existing
precopy livemigration remains as default behavior.


* related links:
project page
http://sites.google.com/site/grivonhome/quick-kvm-migration

Enabling Instantaneous Relocation of Virtual Machines with a
Lightweight VMM Extension,
(proof-of-concept, ad-hoc prototype. not a new design)
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-paper.pdf
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-talk.pdf

Reactive consolidation of virtual machines enabled by postcopy live migration
(advantage for VM consolidation)
http://portal.acm.org/citation.cfm?id=1996125
http://www.emn.fr/x-info/ascola/lib/exe/fetch.php?media=internet:vtdc-postcopy.pdf

Qemu wiki
http://wiki.qemu.org/Features/PostCopyLiveMigration


Design/Implementation
=
The basic idea of postcopy livemigration is to use a sort of distributed
shared memory between the migration source and destination.

The migration procedure looks like
  - start migration
stop the guest VM on the source and send the machine states except
guest RAM to the destination
  - resume the guest VM on the destination without guest RAM contents
  - Hook guest access to pages, and pull page contents from the source
This continues until all the pages are pulled to the destination

  The big picture is depicted at
  http://wiki.qemu.org/File:Postcopy-livemigration.png


There are several design points.
  - who takes care of pulling page contents.
an independent daemon vs a thread in qemu
The daemon approach is preferable because an independent daemon would
easy for debug postcopy memory mechanism without qemu.
If required, it wouldn't be difficult to convert a daemon into
a thread in qemu

  - connection between the source and the destination
The connection for live migration can be re-used after sending machine
state.

  - transfer protocol
The existing protocol that exists today can be extended.

  - hooking guest RAM access
Introduce a character device to handle page fault.
When page fault occurs, it queues page request up to user space daemon
at the destination. And the daemon pulls page contents from the source
and serves it into the character device. Then the page fault is resovlved.


* More on hooking guest RAM access
There are several candidate for the implementation. Our preference is
character device approach.

  - inserting hooks into everywhere in qemu/kvm
This is impractical

  - backing store for guest ram
a block device or a file can be used to back guest RAM.
Thus hook the guest ram access.

pros
- new device driver isn't needed.
cons
- future improvement would be difficult
- some KVM host feature(KSM, THP) wouldn't work

  - character device
qemu mmap() the dedicated character device, and then hook page fault.

pros
- straght forward approach
- future improvement would be easy
cons
- new driver is needed
- some KVM host feature(KSM, THP) wouldn't work
  They checks if a given VMA is anonymous. This can be fixed.

  - swap device
When creating guest, it is set up as if all the guest RAM is swapped out
to a dedicated swap device, which may be nbd disk (or some kind of user
space block device, BUSE?).
When the VM tries to access memory, swap-in is triggered and IO to the
swap device is issued. Then the IO to swap is routed to the daemon
in user space with nbd protocol (or 

Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Dor Laor

On 08/08/2011 06:24 AM, Isaku Yamahata wrote:

This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
on which we'll give a talk at KVM-forum.
The purpose of this mail is to letting developers know it in advance
so that we can get better feedback on its design/implementation approach
early before our starting to implement it.


Background
==
* What's is postcopy livemigration
It is is yet another live migration mechanism for Qemu/KVM, which
implements the migration technique known as "postcopy" or "lazy"
migration. Just after the "migrate" command is invoked, the execution
host of a VM is instantaneously switched to a destination host.

The benefit is, total migration time is shorter because it transfer
a page only once. On the other hand precopy may repeat sending same pages
again and again because they can be dirtied.
The switching time from the source to the destination is several
hunderds mili seconds so that it enables quick load balancing.
For details, please refer to the papers.

We believe this is useful for others so that we'd like to merge this
feature into the upstream qemu/kvm. The existing implementation that
we have right now is very ad-hoc because it's for academic research.
For the upstream merge, we're starting to re-design/implement it and
we'd like to get feedback early.  Although many improvements/optimizations
are possible, we should implement/merge the simple/clean, but extensible
as well, one at first and then improve/optimize it later.

postcopy livemigration will be introduced as optional feature. The existing
precopy livemigration remains as default behavior.


* related links:
project page
http://sites.google.com/site/grivonhome/quick-kvm-migration

Enabling Instantaneous Relocation of Virtual Machines with a
Lightweight VMM Extension,
(proof-of-concept, ad-hoc prototype. not a new design)
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-paper.pdf
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-talk.pdf

Reactive consolidation of virtual machines enabled by postcopy live migration
(advantage for VM consolidation)
http://portal.acm.org/citation.cfm?id=1996125
http://www.emn.fr/x-info/ascola/lib/exe/fetch.php?media=internet:vtdc-postcopy.pdf

Qemu wiki
http://wiki.qemu.org/Features/PostCopyLiveMigration


Design/Implementation
=
The basic idea of postcopy livemigration is to use a sort of distributed
shared memory between the migration source and destination.

The migration procedure looks like
   - start migration
 stop the guest VM on the source and send the machine states except
 guest RAM to the destination
   - resume the guest VM on the destination without guest RAM contents
   - Hook guest access to pages, and pull page contents from the source
 This continues until all the pages are pulled to the destination

   The big picture is depicted at
   http://wiki.qemu.org/File:Postcopy-livemigration.png


That's terrific  (nice video also)!
Orit and myself had the exact same idea too (now we can't patent it..).

Advantages:
- No down time due to memory copying.
- Efficient, reduce needed traffic no need to re-send pages.
- Reduce overall RAM consumption of the source and destination
as opposed from current live migration (both the source and the
destination allocate the memory until the live migration
completes). We can free copied memory once the destination guest
received it and save RAM.
- Increase parallelism for SMP guests we can have multiple
virtual CPU handle their demand paging . Less time to hold a
global lock, less thread contention.
- Virtual machines are using more and more memory resources ,
for a virtual machine with very large working set doing live
migration with reasonable down time is impossible today.

Disadvantageous:
- During the live migration the guest will run slower than in
today's live migration. We need to remember that even today
guests suffer from performance penalty on the source during the
COW stage (memory copy).
- Failure of the source or destination or the network will cause
us to lose the running virtual machine. Those failures are very
rare.
In case there is shared storage we can store a copy of the
memory there , that can be recovered in case of such failure .

Overall, it looks like a better approach for the vast majority of cases.
Hope it will get merged to kvm and become the default way.




There are several design points.
   - who takes care of pulling page contents.
 an independent daemon vs a thread in qemu
 The daemon approach is preferable because an independent daemon would
 easy for debug postcopy memory mechanism without qemu.
 If required, it wouldn't be difficult to convert a daemon into
 a thread in qemu

   - connection between the source and the destinati

Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Stefan Hajnoczi
On Mon, Aug 8, 2011 at 4:24 AM, Isaku Yamahata  wrote:
> This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
> on which we'll give a talk at KVM-forum.

I'm curious if this approach is compatible with asynchronous page
faults?  The idea there was to tell the guest about a page fault so it
can continue to do useful work in the meantime (if the fault was in
guest userspace).

Stefan



Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Yaniv Kaul

On 08/08/2011 12:20, Dor Laor wrote:

On 08/08/2011 06:24 AM, Isaku Yamahata wrote:

This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
on which we'll give a talk at KVM-forum.
The purpose of this mail is to letting developers know it in advance
so that we can get better feedback on its design/implementation approach
early before our starting to implement it.


Background
==
* What's is postcopy livemigration
It is is yet another live migration mechanism for Qemu/KVM, which
implements the migration technique known as "postcopy" or "lazy"
migration. Just after the "migrate" command is invoked, the execution
host of a VM is instantaneously switched to a destination host.

The benefit is, total migration time is shorter because it transfer
a page only once. On the other hand precopy may repeat sending same 
pages

again and again because they can be dirtied.
The switching time from the source to the destination is several
hunderds mili seconds so that it enables quick load balancing.
For details, please refer to the papers.

We believe this is useful for others so that we'd like to merge this
feature into the upstream qemu/kvm. The existing implementation that
we have right now is very ad-hoc because it's for academic research.
For the upstream merge, we're starting to re-design/implement it and
we'd like to get feedback early.  Although many 
improvements/optimizations

are possible, we should implement/merge the simple/clean, but extensible
as well, one at first and then improve/optimize it later.

postcopy livemigration will be introduced as optional feature. The 
existing

precopy livemigration remains as default behavior.


* related links:
project page
http://sites.google.com/site/grivonhome/quick-kvm-migration

Enabling Instantaneous Relocation of Virtual Machines with a
Lightweight VMM Extension,
(proof-of-concept, ad-hoc prototype. not a new design)
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-paper.pdf
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-talk.pdf

Reactive consolidation of virtual machines enabled by postcopy live 
migration

(advantage for VM consolidation)
http://portal.acm.org/citation.cfm?id=1996125
http://www.emn.fr/x-info/ascola/lib/exe/fetch.php?media=internet:vtdc-postcopy.pdf 



Qemu wiki
http://wiki.qemu.org/Features/PostCopyLiveMigration


Design/Implementation
=
The basic idea of postcopy livemigration is to use a sort of distributed
shared memory between the migration source and destination.

The migration procedure looks like
   - start migration
 stop the guest VM on the source and send the machine states except
 guest RAM to the destination
   - resume the guest VM on the destination without guest RAM contents
   - Hook guest access to pages, and pull page contents from the source
 This continues until all the pages are pulled to the destination

   The big picture is depicted at
   http://wiki.qemu.org/File:Postcopy-livemigration.png


That's terrific  (nice video also)!
Orit and myself had the exact same idea too (now we can't patent it..).

Advantages:
- No down time due to memory copying.
- Efficient, reduce needed traffic no need to re-send pages.
- Reduce overall RAM consumption of the source and destination
as opposed from current live migration (both the source and the
destination allocate the memory until the live migration
completes). We can free copied memory once the destination guest
received it and save RAM.
- Increase parallelism for SMP guests we can have multiple
virtual CPU handle their demand paging . Less time to hold a
global lock, less thread contention.
- Virtual machines are using more and more memory resources ,
for a virtual machine with very large working set doing live
migration with reasonable down time is impossible today.

Disadvantageous:
- During the live migration the guest will run slower than in
today's live migration. We need to remember that even today
guests suffer from performance penalty on the source during the
COW stage (memory copy).
- Failure of the source or destination or the network will cause
us to lose the running virtual machine. Those failures are very
rare.


I highly doubt that's acceptable in enterprise deployments.


In case there is shared storage we can store a copy of the
memory there , that can be recovered in case of such failure .

Overall, it looks like a better approach for the vast majority of cases.
Hope it will get merged to kvm and become the default way.




There are several design points.
   - who takes care of pulling page contents.
 an independent daemon vs a thread in qemu
 The daemon approach is preferable because an independent daemon 
would

 easy for debug postcopy memory mechanism without qemu.
 If required, it wouldn'

Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Isaku Yamahata
On Mon, Aug 08, 2011 at 10:38:35AM +0100, Stefan Hajnoczi wrote:
> On Mon, Aug 8, 2011 at 4:24 AM, Isaku Yamahata  wrote:
> > This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
> > on which we'll give a talk at KVM-forum.
> 
> I'm curious if this approach is compatible with asynchronous page
> faults?  The idea there was to tell the guest about a page fault so it
> can continue to do useful work in the meantime (if the fault was in
> guest userspace).

Yes. It's quite possible to inject async page fault into the guest
when the faulted page isn't available on the destination. At the same
time the page will be requested to the source of the migration.
I think it's not so difficult.
-- 
yamahata



Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Nadav Har'El
> >* What's is postcopy livemigration
> >It is is yet another live migration mechanism for Qemu/KVM, which
> >implements the migration technique known as "postcopy" or "lazy"
> >migration. Just after the "migrate" command is invoked, the execution
> >host of a VM is instantaneously switched to a destination host.

Sounds like a cool idea.

> >The benefit is, total migration time is shorter because it transfer
> >a page only once. On the other hand precopy may repeat sending same pages
> >again and again because they can be dirtied.
> >The switching time from the source to the destination is several
> >hunderds mili seconds so that it enables quick load balancing.
> >For details, please refer to the papers.

While these are the obvious benefits, the possible downside (that, as
always, depends on the workload) is the amount of time that the guest
workload runs more slowly than usual, waiting for pages it needs to
continue. There are a whole spectrum between the guest pausing completely
(which would solve all the problems of migration, but is often considered
unacceptible) and running at full-speed. Is it acceptable that the guest
runs at 90% speed during the migration? 50%? 10%?
I guess we could have nothing to lose from having both options, and choosing
the most appropriate technique for each guest!

> That's terrific  (nice video also)!
> Orit and myself had the exact same idea too (now we can't patent it..).

I think new implementation is not the only reason why you cannot patent
this idea :-) Demand-paged migration has actually been discussed (and done)
for nearly a quarter of a century (!) in the area of *process* migration.

The first use I'm aware of was in CMU's Accent 1987 - see [1].
Another paper, [2], written in 1991, discusses how process migration is done
in UCB's Sprite operating system, and evaluates the various alternatives
common at the time (20 years ago), including what it calls "lazy copying"
is more-or-less the same thing as "post copy". Mosix (a project which, in some
sense, is still alive to day) also used some sort of cross between pre-copying
(of dirty pages) and copying on-demand of clean pages (from their backing
store on the source machine).


References
[1] "Attacking the Process Migration Bottleneck"
 http://www.nd.edu/~dthain/courses/cse598z/fall2004/papers/accent.pdf
[2]  "Transparent Process Migration: Design Alternatives and the Sprite
 Implementation"
 http://nd.edu/~dthain/courses/cse598z/fall2004/papers/sprite-migration.pdf

> Advantages:
> - Virtual machines are using more and more memory resources ,
> for a virtual machine with very large working set doing live
> migration with reasonable down time is impossible today.

If a guest actually constantly uses (working set) most of its allocated
memory, it will basically be unable to do any significant amount of work
on the destination VM until this large working set is transfered to the
destination. So in this scenario, "post copying" doesn't give any
significant advantages over plain-old "pause guest and send it to the
destination". Or am I missing something?

> Disadvantageous:
> - During the live migration the guest will run slower than in
> today's live migration. We need to remember that even today
> guests suffer from performance penalty on the source during the
> COW stage (memory copy).

I wonder if something like asynchronous page faults can help somewhat with
multi-process guest workloads (and modified (PV) guest OS). 

> - Failure of the source or destination or the network will cause
> us to lose the running virtual machine. Those failures are very
> rare.

How is this different from a VM running on a single machine that fails?
Just that the small probability of failure (roughly) doubles for the
relatively-short duration of the transfer?


-- 
Nadav Har'El|   Monday, Aug  8 2011, 8 Av 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If glory comes after death, I'm not in a
http://nadav.harel.org.il   |hurry. (Latin proverb)



Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Dor Laor

On 08/08/2011 01:59 PM, Nadav Har'El wrote:

* What's is postcopy livemigration
It is is yet another live migration mechanism for Qemu/KVM, which
implements the migration technique known as "postcopy" or "lazy"
migration. Just after the "migrate" command is invoked, the execution
host of a VM is instantaneously switched to a destination host.


Sounds like a cool idea.


The benefit is, total migration time is shorter because it transfer
a page only once. On the other hand precopy may repeat sending same pages
again and again because they can be dirtied.
The switching time from the source to the destination is several
hunderds mili seconds so that it enables quick load balancing.
For details, please refer to the papers.


While these are the obvious benefits, the possible downside (that, as
always, depends on the workload) is the amount of time that the guest
workload runs more slowly than usual, waiting for pages it needs to
continue. There are a whole spectrum between the guest pausing completely
(which would solve all the problems of migration, but is often considered
unacceptible) and running at full-speed. Is it acceptable that the guest
runs at 90% speed during the migration? 50%? 10%?
I guess we could have nothing to lose from having both options, and choosing
the most appropriate technique for each guest!


+1




That's terrific  (nice video also)!
Orit and myself had the exact same idea too (now we can't patent it..).


I think new implementation is not the only reason why you cannot patent
this idea :-) Demand-paged migration has actually been discussed (and done)
for nearly a quarter of a century (!) in the area of *process* migration.

The first use I'm aware of was in CMU's Accent 1987 - see [1].
Another paper, [2], written in 1991, discusses how process migration is done
in UCB's Sprite operating system, and evaluates the various alternatives
common at the time (20 years ago), including what it calls "lazy copying"
is more-or-less the same thing as "post copy". Mosix (a project which, in some
sense, is still alive to day) also used some sort of cross between pre-copying
(of dirty pages) and copying on-demand of clean pages (from their backing
store on the source machine).


References
[1] "Attacking the Process Migration Bottleneck"
  http://www.nd.edu/~dthain/courses/cse598z/fall2004/papers/accent.pdf


w/o reading the internals, patents enable you to implement an existing 
idea on a new field. Anyway, there won't be no patent in this case. 
Still let's have the kvm innovation merged.



[2]  "Transparent Process Migration: Design Alternatives and the Sprite
  Implementation"
  http://nd.edu/~dthain/courses/cse598z/fall2004/papers/sprite-migration.pdf


Advantages:
 - Virtual machines are using more and more memory resources ,
 for a virtual machine with very large working set doing live
 migration with reasonable down time is impossible today.


If a guest actually constantly uses (working set) most of its allocated
memory, it will basically be unable to do any significant amount of work
on the destination VM until this large working set is transfered to the
destination. So in this scenario, "post copying" doesn't give any
significant advantages over plain-old "pause guest and send it to the
destination". Or am I missing something?


There is one key advantage in this scheme/use case - if you have a guest 
with a very large working set, you'll need a very large downtime in 
order to migrate it with today's algorithm. With post copy (aka 
streaming/demand paging), the guest won't have any downtime but will run 
slower than expected.


There are guests today that is impractical to really live migrate them.

btw: Even today, marking pages RO also carries some performance penalty.




Disadvantageous:
 - During the live migration the guest will run slower than in
 today's live migration. We need to remember that even today
 guests suffer from performance penalty on the source during the
 COW stage (memory copy).


I wonder if something like asynchronous page faults can help somewhat with
multi-process guest workloads (and modified (PV) guest OS).


They should come in to play for some extent. Note that only newer Linux 
guest will enjoy of them.





 - Failure of the source or destination or the network will cause
 us to lose the running virtual machine. Those failures are very
 rare.


How is this different from a VM running on a single machine that fails?
Just that the small probability of failure (roughly) doubles for the
relatively-short duration of the transfer?


Exactly my point, this is not a major disadvantage because of this low 
probability.





Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Avi Kivity

On 08/08/2011 06:24 AM, Isaku Yamahata wrote:

This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
on which we'll give a talk at KVM-forum.
The purpose of this mail is to letting developers know it in advance
so that we can get better feedback on its design/implementation approach
early before our starting to implement it.


Interesting; what is the impact of increased latency on memory reads?




There are several design points.
   - who takes care of pulling page contents.
 an independent daemon vs a thread in qemu
 The daemon approach is preferable because an independent daemon would
 easy for debug postcopy memory mechanism without qemu.
 If required, it wouldn't be difficult to convert a daemon into
 a thread in qemu


Isn't this equivalent to touching each page in sequence?

Care must be taken that we don't post too many requests, or it could 
affect the latency of synchronous accesses by the guest.




   - connection between the source and the destination
 The connection for live migration can be re-used after sending machine
 state.

   - transfer protocol
 The existing protocol that exists today can be extended.

   - hooking guest RAM access
 Introduce a character device to handle page fault.
 When page fault occurs, it queues page request up to user space daemon
 at the destination. And the daemon pulls page contents from the source
 and serves it into the character device. Then the page fault is resovlved.


This doesn't play well with host swapping, transparent hugepages, or 
ksm, does it?


I see you note this later on.


* More on hooking guest RAM access
There are several candidate for the implementation. Our preference is
character device approach.

   - inserting hooks into everywhere in qemu/kvm
 This is impractical

   - backing store for guest ram
 a block device or a file can be used to back guest RAM.
 Thus hook the guest ram access.

 pros
 - new device driver isn't needed.
 cons
 - future improvement would be difficult
 - some KVM host feature(KSM, THP) wouldn't work

   - character device
 qemu mmap() the dedicated character device, and then hook page fault.

 pros
 - straght forward approach
 - future improvement would be easy
 cons
 - new driver is needed
 - some KVM host feature(KSM, THP) wouldn't work
   They checks if a given VMA is anonymous. This can be fixed.

   - swap device
 When creating guest, it is set up as if all the guest RAM is swapped out
 to a dedicated swap device, which may be nbd disk (or some kind of user
 space block device, BUSE?).
 When the VM tries to access memory, swap-in is triggered and IO to the
 swap device is issued. Then the IO to swap is routed to the daemon
 in user space with nbd protocol (or BUSE, AOE, iSCSI...). The daemon pulls
 pages from the migration source and services the IO request.

 pros
 - After the page transfer is complete, everything is same as normal case.
 - no new device driver isn't needed
 cons
 - future improvement would be difficult
 - administration: setting up nbd, swap device



Using a swap device would be my preference.  We'd still be using 
anonymous memory so thp/ksm/ordinary swap still work.


It would need to be a special kind of swap device since we only want to 
swap in, and never out, to that device.  We'd also need a special way of 
telling the kernel that memory comes from that device.  In that it's 
similar your second option.


Maybe we should use a backing file (using nbd) and have a madvise() call 
that converts the vma to anonymous memory once the migration is finished.


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Anthony Liguori

On 08/08/2011 04:20 AM, Dor Laor wrote:

On 08/08/2011 06:24 AM, Isaku Yamahata wrote:

This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
on which we'll give a talk at KVM-forum.
The purpose of this mail is to letting developers know it in advance
so that we can get better feedback on its design/implementation approach
early before our starting to implement it.


Background
==
* What's is postcopy livemigration
It is is yet another live migration mechanism for Qemu/KVM, which
implements the migration technique known as "postcopy" or "lazy"
migration. Just after the "migrate" command is invoked, the execution
host of a VM is instantaneously switched to a destination host.

The benefit is, total migration time is shorter because it transfer
a page only once. On the other hand precopy may repeat sending same pages
again and again because they can be dirtied.
The switching time from the source to the destination is several
hunderds mili seconds so that it enables quick load balancing.
For details, please refer to the papers.

We believe this is useful for others so that we'd like to merge this
feature into the upstream qemu/kvm. The existing implementation that
we have right now is very ad-hoc because it's for academic research.
For the upstream merge, we're starting to re-design/implement it and
we'd like to get feedback early. Although many improvements/optimizations
are possible, we should implement/merge the simple/clean, but extensible
as well, one at first and then improve/optimize it later.

postcopy livemigration will be introduced as optional feature. The
existing
precopy livemigration remains as default behavior.


* related links:
project page
http://sites.google.com/site/grivonhome/quick-kvm-migration

Enabling Instantaneous Relocation of Virtual Machines with a
Lightweight VMM Extension,
(proof-of-concept, ad-hoc prototype. not a new design)
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-paper.pdf
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-talk.pdf

Reactive consolidation of virtual machines enabled by postcopy live
migration
(advantage for VM consolidation)
http://portal.acm.org/citation.cfm?id=1996125
http://www.emn.fr/x-info/ascola/lib/exe/fetch.php?media=internet:vtdc-postcopy.pdf


Qemu wiki
http://wiki.qemu.org/Features/PostCopyLiveMigration


Design/Implementation
=
The basic idea of postcopy livemigration is to use a sort of distributed
shared memory between the migration source and destination.

The migration procedure looks like
- start migration
stop the guest VM on the source and send the machine states except
guest RAM to the destination
- resume the guest VM on the destination without guest RAM contents
- Hook guest access to pages, and pull page contents from the source
This continues until all the pages are pulled to the destination

The big picture is depicted at
http://wiki.qemu.org/File:Postcopy-livemigration.png


That's terrific (nice video also)!
Orit and myself had the exact same idea too (now we can't patent it..).

Advantages:
- No down time due to memory copying.


But non-deterministic down time due to network latency while trying to 
satisfy a page fault.



- Efficient, reduce needed traffic no need to re-send pages.


It's not quite that simple.  Post-copy needs to introduce a protocol 
capable of requesting pages.


I think in presenting something like this, it's important to collect 
quite a bit of performance data.  I'd suggest doing runs while running 
jitterd in the guest to attempt to quantify the actual downtime 
experienced too.


http://git.codemonkey.ws/cgit/jitterd.git/

There's a lot of potential in something like this, but it's not obvious 
to me whether it's a net win.  Should make for a very interesting 
presentation :-)



- Reduce overall RAM consumption of the source and destination
as opposed from current live migration (both the source and the
destination allocate the memory until the live migration
completes). We can free copied memory once the destination guest
received it and save RAM.
- Increase parallelism for SMP guests we can have multiple
virtual CPU handle their demand paging . Less time to hold a
global lock, less thread contention.
- Virtual machines are using more and more memory resources ,
for a virtual machine with very large working set doing live
migration with reasonable down time is impossible today.


This is really just a limitation of our implementation.  In theory, 
pre-copy allows you to exert fine grain resource control over the guest 
which you can use to encourage convergence.



Disadvantageous:
- During the live migration the guest will run slower than in
today's live migration. We need to remember that even today
guests suffer from performance penalty on the source during the
COW stage (memory copy).
- Failure of the source or destination or the network will cause
us to lose the running virtual machine. Those failures are very
rare.
In case

Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Dor Laor

On 08/08/2011 03:32 PM, Anthony Liguori wrote:

On 08/08/2011 04:20 AM, Dor Laor wrote:

On 08/08/2011 06:24 AM, Isaku Yamahata wrote:

This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
on which we'll give a talk at KVM-forum.
The purpose of this mail is to letting developers know it in advance
so that we can get better feedback on its design/implementation approach
early before our starting to implement it.


Background
==
* What's is postcopy livemigration
It is is yet another live migration mechanism for Qemu/KVM, which
implements the migration technique known as "postcopy" or "lazy"
migration. Just after the "migrate" command is invoked, the execution
host of a VM is instantaneously switched to a destination host.

The benefit is, total migration time is shorter because it transfer
a page only once. On the other hand precopy may repeat sending same
pages
again and again because they can be dirtied.
The switching time from the source to the destination is several
hunderds mili seconds so that it enables quick load balancing.
For details, please refer to the papers.

We believe this is useful for others so that we'd like to merge this
feature into the upstream qemu/kvm. The existing implementation that
we have right now is very ad-hoc because it's for academic research.
For the upstream merge, we're starting to re-design/implement it and
we'd like to get feedback early. Although many
improvements/optimizations
are possible, we should implement/merge the simple/clean, but extensible
as well, one at first and then improve/optimize it later.

postcopy livemigration will be introduced as optional feature. The
existing
precopy livemigration remains as default behavior.


* related links:
project page
http://sites.google.com/site/grivonhome/quick-kvm-migration

Enabling Instantaneous Relocation of Virtual Machines with a
Lightweight VMM Extension,
(proof-of-concept, ad-hoc prototype. not a new design)
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-paper.pdf
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-talk.pdf

Reactive consolidation of virtual machines enabled by postcopy live
migration
(advantage for VM consolidation)
http://portal.acm.org/citation.cfm?id=1996125
http://www.emn.fr/x-info/ascola/lib/exe/fetch.php?media=internet:vtdc-postcopy.pdf



Qemu wiki
http://wiki.qemu.org/Features/PostCopyLiveMigration


Design/Implementation
=
The basic idea of postcopy livemigration is to use a sort of distributed
shared memory between the migration source and destination.

The migration procedure looks like
- start migration
stop the guest VM on the source and send the machine states except
guest RAM to the destination
- resume the guest VM on the destination without guest RAM contents
- Hook guest access to pages, and pull page contents from the source
This continues until all the pages are pulled to the destination

The big picture is depicted at
http://wiki.qemu.org/File:Postcopy-livemigration.png


That's terrific (nice video also)!
Orit and myself had the exact same idea too (now we can't patent it..).

Advantages:
- No down time due to memory copying.


But non-deterministic down time due to network latency while trying to
satisfy a page fault.


True but it is possible to limit it with some dedicated network or 
bandwidth reservation.





- Efficient, reduce needed traffic no need to re-send pages.


It's not quite that simple. Post-copy needs to introduce a protocol
capable of requesting pages.


Just another subsection.. (kidding), still it shouldn't be too 
complicated, just an offset+pagesize and return page_content/error




I think in presenting something like this, it's important to collect
quite a bit of performance data. I'd suggest doing runs while running
jitterd in the guest to attempt to quantify the actual downtime
experienced too.

http://git.codemonkey.ws/cgit/jitterd.git/


and also comparing the speed that it takes for various benchmarks like 
iozone/netperf/linpack/..




There's a lot of potential in something like this, but it's not obvious
to me whether it's a net win. Should make for a very interesting
presentation :-)


- Reduce overall RAM consumption of the source and destination
as opposed from current live migration (both the source and the
destination allocate the memory until the live migration
completes). We can free copied memory once the destination guest
received it and save RAM.
- Increase parallelism for SMP guests we can have multiple
virtual CPU handle their demand paging . Less time to hold a
global lock, less thread contention.
- Virtual machines are using more and more memory resources ,
for a virtual machine with very large working set doing live
migration with reasonable down time is impossible today.


This is really just a limitation of our implementation. In theory,
pre-copy allows you to exert fine grain resource control over the guest
which you can use to encourage convergence.


But 

Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Anthony Liguori

On 08/08/2011 10:11 AM, Dor Laor wrote:

On 08/08/2011 03:32 PM, Anthony Liguori wrote:

On 08/08/2011 04:20 AM, Dor Laor wrote:


That's terrific (nice video also)!
Orit and myself had the exact same idea too (now we can't patent it..).

Advantages:
- No down time due to memory copying.


But non-deterministic down time due to network latency while trying to
satisfy a page fault.


True but it is possible to limit it with some dedicated network or
bandwidth reservation.


Yup.  Any technique that uses RDMA (which is basically what this is) 
requires dedicated network resources.



- Efficient, reduce needed traffic no need to re-send pages.


It's not quite that simple. Post-copy needs to introduce a protocol
capable of requesting pages.


Just another subsection.. (kidding), still it shouldn't be too
complicated, just an offset+pagesize and return page_content/error


What I meant by this is that there is potentially a lot of round trip 
overhead.  Pre-copy migration works well with reasonable high latency 
network connections because the downtime is capped only by the maximum 
latency sending from one point to another.


But with something like this, the total downtime is 
2*max_latency*nb_pagefaults.  That's potentially pretty high.


So it may be desirable to try to reduce nb_pagefaults by prefaulting in 
pages, etc.  Suffice to say, this ends up getting complicated and may 
end up burning network traffic too.



This is really just a limitation of our implementation. In theory,
pre-copy allows you to exert fine grain resource control over the guest
which you can use to encourage convergence.


But a very large guest w/ large working set that changes more frequent
than the network bandwidth might always need huge down time with the
current system.


In theory, you can do things like reduce the guests' priority to reduce 
the amount of work it can do in order to encourage convergence.



One thing I think we need to do is put together a live migration
roadmap. We've got a lot of invasive efforts underway with live
migration and I fear that without some planning and serialization, some
of this useful work with get lost.


Some of them are parallel. I think all the readers here agree that post
copy migration should be an option while we need to maintain the current
one.


I actually think they need to be done mostly in sequence while cleaning 
up some of the current infrastructure.  I don't think we really should 
make any major changes (beyond maybe the separate thread) until we 
eliminate QEMUFile.


There's so much overhead involved in using QEMUFile today, I think it's 
hard to talk about performance data when we've got a major bottleneck 
sitting in the middle.


Regards,

Anthony Liguori



Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Avi Kivity

On 08/08/2011 06:29 PM, Anthony Liguori wrote:



- Efficient, reduce needed traffic no need to re-send pages.


It's not quite that simple. Post-copy needs to introduce a protocol
capable of requesting pages.


Just another subsection.. (kidding), still it shouldn't be too
complicated, just an offset+pagesize and return page_content/error


What I meant by this is that there is potentially a lot of round trip 
overhead.  Pre-copy migration works well with reasonable high latency 
network connections because the downtime is capped only by the maximum 
latency sending from one point to another.


But with something like this, the total downtime is 
2*max_latency*nb_pagefaults.  That's potentially pretty high.


Let's be generous and assume that the latency is dominated by page copy 
time.  So the total downtime is equal to the first live migration pass, 
~20 sec for 2GB on 1GbE.  It's distributed over potentially even more 
time, though.  If the guest does a lot of I/O, it may not be noticeable 
(esp. if we don't copy over pages read from disk).  If the guest is 
cpu/memory bound, it'll probably suck badly.




So it may be desirable to try to reduce nb_pagefaults by prefaulting 
in pages, etc.  Suffice to say, this ends up getting complicated and 
may end up burning network traffic too.


Yeah, and prefaulting in the background adds latency to synchronous 
requests.


This really needs excellent networking resources to work well.

--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Cleber Rosa

On 08/08/2011 07:47 AM, Dor Laor wrote:

On 08/08/2011 01:59 PM, Nadav Har'El wrote:

* What's is postcopy livemigration
It is is yet another live migration mechanism for Qemu/KVM, which
implements the migration technique known as "postcopy" or "lazy"
migration. Just after the "migrate" command is invoked, the execution
host of a VM is instantaneously switched to a destination host.


Sounds like a cool idea.


The benefit is, total migration time is shorter because it transfer
a page only once. On the other hand precopy may repeat sending same 
pages

again and again because they can be dirtied.
The switching time from the source to the destination is several
hunderds mili seconds so that it enables quick load balancing.
For details, please refer to the papers.


While these are the obvious benefits, the possible downside (that, as
always, depends on the workload) is the amount of time that the guest
workload runs more slowly than usual, waiting for pages it needs to
continue. There are a whole spectrum between the guest pausing 
completely
(which would solve all the problems of migration, but is often 
considered

unacceptible) and running at full-speed. Is it acceptable that the guest
runs at 90% speed during the migration? 50%? 10%?
I guess we could have nothing to lose from having both options, and 
choosing

the most appropriate technique for each guest!


Not sure if it's possible to have smart heuristics on guest memory page 
faults, but maybe a technique that reads ahead more pages if a given 
pattern is detected may help to lower the impact.




+1




That's terrific  (nice video also)!
Orit and myself had the exact same idea too (now we can't patent it..).


I think new implementation is not the only reason why you cannot patent
this idea :-) Demand-paged migration has actually been discussed (and 
done)
for nearly a quarter of a century (!) in the area of *process* 
migration.


The first use I'm aware of was in CMU's Accent 1987 - see [1].
Another paper, [2], written in 1991, discusses how process migration 
is done

in UCB's Sprite operating system, and evaluates the various alternatives
common at the time (20 years ago), including what it calls "lazy 
copying"
is more-or-less the same thing as "post copy". Mosix (a project 
which, in some
sense, is still alive to day) also used some sort of cross between 
pre-copying
(of dirty pages) and copying on-demand of clean pages (from their 
backing

store on the source machine).


References
[1] "Attacking the Process Migration Bottleneck"
  
http://www.nd.edu/~dthain/courses/cse598z/fall2004/papers/accent.pdf


w/o reading the internals, patents enable you to implement an existing 
idea on a new field. Anyway, there won't be no patent in this case. 
Still let's have the kvm innovation merged.



[2]  "Transparent Process Migration: Design Alternatives and the Sprite
  Implementation"
  
http://nd.edu/~dthain/courses/cse598z/fall2004/papers/sprite-migration.pdf



Advantages:
 - Virtual machines are using more and more memory resources ,
 for a virtual machine with very large working set doing live
 migration with reasonable down time is impossible today.


If a guest actually constantly uses (working set) most of its allocated
memory, it will basically be unable to do any significant amount of work
on the destination VM until this large working set is transfered to the
destination. So in this scenario, "post copying" doesn't give any
significant advantages over plain-old "pause guest and send it to the
destination". Or am I missing something?


There is one key advantage in this scheme/use case - if you have a 
guest with a very large working set, you'll need a very large downtime 
in order to migrate it with today's algorithm. With post copy (aka 
streaming/demand paging), the guest won't have any downtime but will 
run slower than expected.


There are guests today that is impractical to really live migrate them.

btw: Even today, marking pages RO also carries some performance penalty.




Disadvantageous:
 - During the live migration the guest will run slower than in
 today's live migration. We need to remember that even today
 guests suffer from performance penalty on the source during 
the

 COW stage (memory copy).


I wonder if something like asynchronous page faults can help somewhat 
with

multi-process guest workloads (and modified (PV) guest OS).


They should come in to play for some extent. Note that only newer 
Linux guest will enjoy of them.




 - Failure of the source or destination or the network will 
cause
 us to lose the running virtual machine. Those failures are 
very

 rare.


How is this different from a VM running on a single machine that fails?
Just that the small probability of failure (roughly) doubles for the
relatively-short duration of the transfer?


Exactly my point, this is not a major disadvantage because of this low 
pr

Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Anthony Liguori

On 08/08/2011 11:52 AM, Cleber Rosa wrote:

On 08/08/2011 07:47 AM, Dor Laor wrote:

On 08/08/2011 01:59 PM, Nadav Har'El wrote:

* What's is postcopy livemigration
It is is yet another live migration mechanism for Qemu/KVM, which
implements the migration technique known as "postcopy" or "lazy"
migration. Just after the "migrate" command is invoked, the execution
host of a VM is instantaneously switched to a destination host.


Sounds like a cool idea.


The benefit is, total migration time is shorter because it transfer
a page only once. On the other hand precopy may repeat sending same
pages
again and again because they can be dirtied.
The switching time from the source to the destination is several
hunderds mili seconds so that it enables quick load balancing.
For details, please refer to the papers.


While these are the obvious benefits, the possible downside (that, as
always, depends on the workload) is the amount of time that the guest
workload runs more slowly than usual, waiting for pages it needs to
continue. There are a whole spectrum between the guest pausing
completely
(which would solve all the problems of migration, but is often
considered
unacceptible) and running at full-speed. Is it acceptable that the guest
runs at 90% speed during the migration? 50%? 10%?
I guess we could have nothing to lose from having both options, and
choosing
the most appropriate technique for each guest!


Not sure if it's possible to have smart heuristics on guest memory page
faults, but maybe a technique that reads ahead more pages if a given
pattern is detected may help to lower the impact.


It's got to be a user choice.  Post-copy can mean unbounded downtime for 
a guest with no way to mitigate it.  It's impossible to cancel a 
post-copy migration.


I actually think the use-cases for post-copy are fairly limited in an 
enterprise environment.


Regards,

Anthony Liguori



Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Anthony Liguori

On 08/08/2011 10:36 AM, Avi Kivity wrote:

On 08/08/2011 06:29 PM, Anthony Liguori wrote:



- Efficient, reduce needed traffic no need to re-send pages.


It's not quite that simple. Post-copy needs to introduce a protocol
capable of requesting pages.


Just another subsection.. (kidding), still it shouldn't be too
complicated, just an offset+pagesize and return page_content/error


What I meant by this is that there is potentially a lot of round trip
overhead. Pre-copy migration works well with reasonable high latency
network connections because the downtime is capped only by the maximum
latency sending from one point to another.

But with something like this, the total downtime is
2*max_latency*nb_pagefaults. That's potentially pretty high.


Let's be generous and assume that the latency is dominated by page copy
time. So the total downtime is equal to the first live migration pass,
~20 sec for 2GB on 1GbE. It's distributed over potentially even more
time, though. If the guest does a lot of I/O, it may not be noticeable
(esp. if we don't copy over pages read from disk). If the guest is
cpu/memory bound, it'll probably suck badly.



So it may be desirable to try to reduce nb_pagefaults by prefaulting
in pages, etc. Suffice to say, this ends up getting complicated and
may end up burning network traffic too.


Yeah, and prefaulting in the background adds latency to synchronous
requests.

This really needs excellent networking resources to work well.


Yup, it's very similar to other technologies using RDMA (single system 
image, lock step execution, etc.).


Regards,

Anthony Liguori








Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Dor Laor

On 08/08/2011 06:59 PM, Anthony Liguori wrote:

On 08/08/2011 10:36 AM, Avi Kivity wrote:

On 08/08/2011 06:29 PM, Anthony Liguori wrote:



- Efficient, reduce needed traffic no need to re-send pages.


It's not quite that simple. Post-copy needs to introduce a protocol
capable of requesting pages.


Just another subsection.. (kidding), still it shouldn't be too
complicated, just an offset+pagesize and return page_content/error


What I meant by this is that there is potentially a lot of round trip
overhead. Pre-copy migration works well with reasonable high latency
network connections because the downtime is capped only by the maximum
latency sending from one point to another.

But with something like this, the total downtime is
2*max_latency*nb_pagefaults. That's potentially pretty high.


Let's be generous and assume that the latency is dominated by page copy
time. So the total downtime is equal to the first live migration pass,
~20 sec for 2GB on 1GbE. It's distributed over potentially even more
time, though. If the guest does a lot of I/O, it may not be noticeable
(esp. if we don't copy over pages read from disk). If the guest is
cpu/memory bound, it'll probably suck badly.



So it may be desirable to try to reduce nb_pagefaults by prefaulting
in pages, etc. Suffice to say, this ends up getting complicated and
may end up burning network traffic too.


It is complicated but can help (like pre faulting working set size 
pages). Beyond that async page fault will help a bit.
Lastly, if a guest has several apps, those that are memory intensive 
might suffer but light weight apps will function nicely.
It provides extra flexibility over the current protocol (that still has 
value for some of the loads).




Yeah, and prefaulting in the background adds latency to synchronous
requests.

This really needs excellent networking resources to work well.


Yup, it's very similar to other technologies using RDMA (single system
image, lock step execution, etc.).

Regards,

Anthony Liguori











Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Anthony Liguori

On 08/08/2011 04:40 AM, Yaniv Kaul wrote:

On 08/08/2011 12:20, Dor Laor wrote:

On 08/08/2011 06:24 AM, Isaku Yamahata wrote:



Design/Implementation
=
The basic idea of postcopy livemigration is to use a sort of distributed
shared memory between the migration source and destination.

The migration procedure looks like
- start migration
stop the guest VM on the source and send the machine states except
guest RAM to the destination
- resume the guest VM on the destination without guest RAM contents
- Hook guest access to pages, and pull page contents from the source
This continues until all the pages are pulled to the destination

The big picture is depicted at
http://wiki.qemu.org/File:Postcopy-livemigration.png


That's terrific (nice video also)!
Orit and myself had the exact same idea too (now we can't patent it..).

Advantages:
- No down time due to memory copying.
- Efficient, reduce needed traffic no need to re-send pages.
- Reduce overall RAM consumption of the source and destination
as opposed from current live migration (both the source and the
destination allocate the memory until the live migration
completes). We can free copied memory once the destination guest
received it and save RAM.
- Increase parallelism for SMP guests we can have multiple
virtual CPU handle their demand paging . Less time to hold a
global lock, less thread contention.
- Virtual machines are using more and more memory resources ,
for a virtual machine with very large working set doing live
migration with reasonable down time is impossible today.

Disadvantageous:
- During the live migration the guest will run slower than in
today's live migration. We need to remember that even today
guests suffer from performance penalty on the source during the
COW stage (memory copy).
- Failure of the source or destination or the network will cause
us to lose the running virtual machine. Those failures are very
rare.


I highly doubt that's acceptable in enterprise deployments.


I don't think you can make blanket statements about enterprise deployments.

A lot of enterprises are increasingly building fault tolerance into 
their applications expecting that the underlying hardware will fail. 
With cloud environments like EC2 that experience failure on a pretty 
regular basis, this is just becoming all the more common.


So I really don't view this as a critical issue.  It certainly would be 
if it were the only mechanism available but as long as we can also 
support pre-copy migration it would be fine.


Regards,

Anthony Liguori



Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Isaku Yamahata
On Mon, Aug 08, 2011 at 10:47:09PM +0300, Dor Laor wrote:
> On 08/08/2011 06:59 PM, Anthony Liguori wrote:
>> On 08/08/2011 10:36 AM, Avi Kivity wrote:
>>> On 08/08/2011 06:29 PM, Anthony Liguori wrote:

>>> - Efficient, reduce needed traffic no need to re-send pages.
>>
>> It's not quite that simple. Post-copy needs to introduce a protocol
>> capable of requesting pages.
>
> Just another subsection.. (kidding), still it shouldn't be too
> complicated, just an offset+pagesize and return page_content/error

 What I meant by this is that there is potentially a lot of round trip
 overhead. Pre-copy migration works well with reasonable high latency
 network connections because the downtime is capped only by the maximum
 latency sending from one point to another.

 But with something like this, the total downtime is
 2*max_latency*nb_pagefaults. That's potentially pretty high.
>>>
>>> Let's be generous and assume that the latency is dominated by page copy
>>> time. So the total downtime is equal to the first live migration pass,
>>> ~20 sec for 2GB on 1GbE. It's distributed over potentially even more
>>> time, though. If the guest does a lot of I/O, it may not be noticeable
>>> (esp. if we don't copy over pages read from disk). If the guest is
>>> cpu/memory bound, it'll probably suck badly.
>>>

 So it may be desirable to try to reduce nb_pagefaults by prefaulting
 in pages, etc. Suffice to say, this ends up getting complicated and
 may end up burning network traffic too.
>
> It is complicated but can help (like pre faulting working set size  
> pages). Beyond that async page fault will help a bit.
> Lastly, if a guest has several apps, those that are memory intensive  
> might suffer but light weight apps will function nicely.
> It provides extra flexibility over the current protocol (that still has  
> value for some of the loads).

We can also combine postcopy with precopy.
For example, The migration is started in in precopy mode at the beginning
and then at some point it is switched into postcopy mode.

>
>>>
>>> Yeah, and prefaulting in the background adds latency to synchronous
>>> requests.
>>>
>>> This really needs excellent networking resources to work well.
>>
>> Yup, it's very similar to other technologies using RDMA (single system
>> image, lock step execution, etc.).
>>
>> Regards,
>>
>> Anthony Liguori
>>
>>>
>>
>>
>

-- 
yamahata



Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Isaku Yamahata
On Mon, Aug 08, 2011 at 03:38:54PM +0300, Avi Kivity wrote:
> On 08/08/2011 06:24 AM, Isaku Yamahata wrote:
>> This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
>> on which we'll give a talk at KVM-forum.
>> The purpose of this mail is to letting developers know it in advance
>> so that we can get better feedback on its design/implementation approach
>> early before our starting to implement it.
>
> Interesting; what is the impact of increased latency on memory reads?

Many people has already discussed it much in another thread. :-)
That's much more than I expected.


>> There are several design points.
>>- who takes care of pulling page contents.
>>  an independent daemon vs a thread in qemu
>>  The daemon approach is preferable because an independent daemon would
>>  easy for debug postcopy memory mechanism without qemu.
>>  If required, it wouldn't be difficult to convert a daemon into
>>  a thread in qemu
>
> Isn't this equivalent to touching each page in sequence?

No. I don't get your point of this question.


> Care must be taken that we don't post too many requests, or it could  
> affect the latency of synchronous accesses by the guest.

Yes.


>>- connection between the source and the destination
>>  The connection for live migration can be re-used after sending machine
>>  state.
>>
>>- transfer protocol
>>  The existing protocol that exists today can be extended.
>>
>>- hooking guest RAM access
>>  Introduce a character device to handle page fault.
>>  When page fault occurs, it queues page request up to user space daemon
>>  at the destination. And the daemon pulls page contents from the source
>>  and serves it into the character device. Then the page fault is 
>> resovlved.
>
> This doesn't play well with host swapping, transparent hugepages, or  
> ksm, does it?

No. At least it wouldn't be so difficult to fix it, I haven't looked ksm,
thp so closely though.
Although the vma is backed by the device, the populated page is
anonymous. (by MMAP_PRIVATE or the deriver returning anonymous page)
So swapping, thp, ksm should work.


> I see you note this later on.
>
>> * More on hooking guest RAM access
>> There are several candidate for the implementation. Our preference is
>> character device approach.
>>
>>- inserting hooks into everywhere in qemu/kvm
>>  This is impractical
>>
>>- backing store for guest ram
>>  a block device or a file can be used to back guest RAM.
>>  Thus hook the guest ram access.
>>
>>  pros
>>  - new device driver isn't needed.
>>  cons
>>  - future improvement would be difficult
>>  - some KVM host feature(KSM, THP) wouldn't work
>>
>>- character device
>>  qemu mmap() the dedicated character device, and then hook page fault.
>>
>>  pros
>>  - straght forward approach
>>  - future improvement would be easy
>>  cons
>>  - new driver is needed
>>  - some KVM host feature(KSM, THP) wouldn't work
>>They checks if a given VMA is anonymous. This can be fixed.
>>
>>- swap device
>>  When creating guest, it is set up as if all the guest RAM is swapped out
>>  to a dedicated swap device, which may be nbd disk (or some kind of user
>>  space block device, BUSE?).
>>  When the VM tries to access memory, swap-in is triggered and IO to the
>>  swap device is issued. Then the IO to swap is routed to the daemon
>>  in user space with nbd protocol (or BUSE, AOE, iSCSI...). The daemon 
>> pulls
>>  pages from the migration source and services the IO request.
>>
>>  pros
>>  - After the page transfer is complete, everything is same as normal 
>> case.
>>  - no new device driver isn't needed
>>  cons
>>  - future improvement would be difficult
>>  - administration: setting up nbd, swap device
>>
>
> Using a swap device would be my preference.  We'd still be using  
> anonymous memory so thp/ksm/ordinary swap still work.
>
> It would need to be a special kind of swap device since we only want to  
> swap in, and never out, to that device.  We'd also need a special way of  
> telling the kernel that memory comes from that device.  In that it's  
> similar your second option.
>
> Maybe we should use a backing file (using nbd) and have a madvise() call  
> that converts the vma to anonymous memory once the migration is finished.

With whichever options, I'd like to convert the vma into anonymous area
after the migration completes somehow. i.e. nulling vma->vm_ops.
(The pages are already anonymous.)

It seems troublesome involving complicated races/lockings. So I'm not sure
it's worthwhile.
-- 
yamahata



Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-10 Thread Avi Kivity

On 08/09/2011 05:33 AM, Isaku Yamahata wrote:

On Mon, Aug 08, 2011 at 03:38:54PM +0300, Avi Kivity wrote:
>  On 08/08/2011 06:24 AM, Isaku Yamahata wrote:
>>  This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
>>  on which we'll give a talk at KVM-forum.
>>  The purpose of this mail is to letting developers know it in advance
>>  so that we can get better feedback on its design/implementation approach
>>  early before our starting to implement it.
>
>  Interesting; what is the impact of increased latency on memory reads?

Many people has already discussed it much in another thread. :-)
That's much more than I expected.


Can you point me to the discussion?



>>  There are several design points.
>> - who takes care of pulling page contents.
>>   an independent daemon vs a thread in qemu
>>   The daemon approach is preferable because an independent daemon would
>>   easy for debug postcopy memory mechanism without qemu.
>>   If required, it wouldn't be difficult to convert a daemon into
>>   a thread in qemu
>
>  Isn't this equivalent to touching each page in sequence?

No. I don't get your point of this question.


If you have a qemu thread that does

   for (each guest page)
   sum += *(char *)page;

doesn't that effectively pull all pages from the source node?

(but maybe I'm assuming that the kernel takes care of things and this 
isn't the case?)



>>
>> - hooking guest RAM access
>>   Introduce a character device to handle page fault.
>>   When page fault occurs, it queues page request up to user space daemon
>>   at the destination. And the daemon pulls page contents from the source
>>   and serves it into the character device. Then the page fault is 
resovlved.
>
>  This doesn't play well with host swapping, transparent hugepages, or
>  ksm, does it?

No. At least it wouldn't be so difficult to fix it, I haven't looked ksm,
thp so closely though.
Although the vma is backed by the device, the populated page is
anonymous. (by MMAP_PRIVATE or the deriver returning anonymous page)
So swapping, thp, ksm should work.


I'm not 100% sure, but I think that thp and ksm need the vma to be 
anonymous, not just the page.



>
>  It would need to be a special kind of swap device since we only want to
>  swap in, and never out, to that device.  We'd also need a special way of
>  telling the kernel that memory comes from that device.  In that it's
>  similar your second option.
>
>  Maybe we should use a backing file (using nbd) and have a madvise() call
>  that converts the vma to anonymous memory once the migration is finished.

With whichever options, I'd like to convert the vma into anonymous area
after the migration completes somehow. i.e. nulling vma->vm_ops.
(The pages are already anonymous.)

It seems troublesome involving complicated races/lockings. So I'm not sure
it's worthwhile.


Andrea, what's your take on this?

--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-10 Thread Isaku Yamahata
On Wed, Aug 10, 2011 at 04:55:32PM +0300, Avi Kivity wrote:
> On 08/09/2011 05:33 AM, Isaku Yamahata wrote:
>> On Mon, Aug 08, 2011 at 03:38:54PM +0300, Avi Kivity wrote:
>> >  On 08/08/2011 06:24 AM, Isaku Yamahata wrote:
>> >>  This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
>> >>  on which we'll give a talk at KVM-forum.
>> >>  The purpose of this mail is to letting developers know it in advance
>> >>  so that we can get better feedback on its design/implementation approach
>> >>  early before our starting to implement it.
>> >
>> >  Interesting; what is the impact of increased latency on memory reads?
>>
>> Many people has already discussed it much in another thread. :-)
>> That's much more than I expected.
>
> Can you point me to the discussion?

I misunderstood of your question.
Please refer to the papers which includes the evaluation results including
network latency. It discusses about it in details.
And the presentation that we will give at the KVM forum also includes
some results.


>> >>  There are several design points.
>> >> - who takes care of pulling page contents.
>> >>   an independent daemon vs a thread in qemu
>> >>   The daemon approach is preferable because an independent daemon 
>> >> would
>> >>   easy for debug postcopy memory mechanism without qemu.
>> >>   If required, it wouldn't be difficult to convert a daemon into
>> >>   a thread in qemu
>> >
>> >  Isn't this equivalent to touching each page in sequence?
>>
>> No. I don't get your point of this question.
>
> If you have a qemu thread that does
>
>for (each guest page)
>sum += *(char *)page;
>
> doesn't that effectively pull all pages from the source node?
>
> (but maybe I'm assuming that the kernel takes care of things and this  
> isn't the case?)

Now I see your point. Right, it doesn't matter who starts the access
to guest RAM.
My point is, after the page fault, someone has to resolve the fault
by sending the request for the page to the migration source.

I think, daemon or thread isn't a big issue anyway.
If nbd with swap device is used, its IO request may be sent to the source
directly.


>> >> - hooking guest RAM access
>> >>   Introduce a character device to handle page fault.
>> >>   When page fault occurs, it queues page request up to user space 
>> >> daemon
>> >>   at the destination. And the daemon pulls page contents from the 
>> >> source
>> >>   and serves it into the character device. Then the page fault is 
>> >> resovlved.
>> >
>> >  This doesn't play well with host swapping, transparent hugepages, or
>> >  ksm, does it?
>>
>> No. At least it wouldn't be so difficult to fix it, I haven't looked ksm,
>> thp so closely though.
>> Although the vma is backed by the device, the populated page is
>> anonymous. (by MMAP_PRIVATE or the deriver returning anonymous page)
>> So swapping, thp, ksm should work.
>
> I'm not 100% sure, but I think that thp and ksm need the vma to be  
> anonymous, not just the page.

Yes, they seems to check if not only the page is anonymous, but also the vma.
I'd like to hear from Andrea before digging into the code deeply.


>> >  It would need to be a special kind of swap device since we only want to
>> >  swap in, and never out, to that device.  We'd also need a special way of
>> >  telling the kernel that memory comes from that device.  In that it's
>> >  similar your second option.
>> >
>> >  Maybe we should use a backing file (using nbd) and have a madvise() call
>> >  that converts the vma to anonymous memory once the migration is finished.
>>
>> With whichever options, I'd like to convert the vma into anonymous area
>> after the migration completes somehow. i.e. nulling vma->vm_ops.
>> (The pages are already anonymous.)
>>
>> It seems troublesome involving complicated races/lockings. So I'm not sure
>> it's worthwhile.
>
> Andrea, what's your take on this?

I'd also like to hear from those who are familiar with ksm/thp.

If it is possible to convert the vma into anonymous, swap device or
backed by device/file wouldn't matter in respect to ksm and thp.
Acquiring mmap_sem suffices?

thanks,
-- 
yamahata



Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-11 Thread Andrea Arcangeli
Hello everyone,

so basically this is a tradeoff between not having a long latency for
the migration to succeed and reducing the total network traffic (and
CPU load) in the migration source and destination and reducing the
memory footprint a bit, by adding an initial latency to the memory
accesses on the destination of the migration (i.e. causing a more
significant and noticeable slowdown to the guest).

It's more or less like if when the guest starts on the destination
node, it will find all its memory swapped out to a network swap
device, so it needs to do I/O for the first access (side note: and
hopefully it won't run out of memory while the memory is copied to the
destination node or the guest will crash).

On Thu, Aug 11, 2011 at 11:19:19AM +0900, Isaku Yamahata wrote:
> On Wed, Aug 10, 2011 at 04:55:32PM +0300, Avi Kivity wrote:
> > I'm not 100% sure, but I think that thp and ksm need the vma to be  
> > anonymous, not just the page.
> 
> Yes, they seems to check if not only the page is anonymous, but also the vma.
> I'd like to hear from Andrea before digging into the code deeply.

The vma doesn't need to be anonymous for THP, an mmap on /dev/zero
MAP_PRIVATE also is backed by THP. But it must be close to anonymous
and not have special VM_IO/PFNMAP flags or khugepaged/ksm will not
scan it. ->vm_file itself isn't checked by THP/KSM (sure for THP
because of the /dev/zero example which I explicitly fixed as it wasn't
fully handled initially). NOTE a chardevice won't work on RHEL6
because I didn't allow /dev/zero to use it there (it wasn't an
important enough feature and it was more risky) but upstream it should
work already.

A chardevice doing this may work, even if it would be simpler/cleaner
if this was still an anonymous vma. A chardevice could act similar to
/dev/zero MAP_PRIVATE. In theory KSM should work on /dev/zero too, you
can test that if you want. But a chardevice will require dealing with
permissions when we don't actually need special permissions for this.

Another problem is you can't migrate the stuff using hugepages or it'd
multiply the latency 512 times (with 2M contiguous access it won't
make a difference but if the guest is accessing memory randomly it
would make a difference). So you will have to relay on khugepaged to
collapse the hugepages later. That should work but initially the guest
will run slower even when the migration is already fully completed.

> If it is possible to convert the vma into anonymous, swap device or
> backed by device/file wouldn't matter in respect to ksm and thp.
> Acquiring mmap_sem suffices?

A swap device would require root permissions and we don't want qemu to
mangle over the swapdevices automatically. It'd be bad to add new
admin requirements, few people would use it. Ideally the migration API
should remain the same and it should be an internal tweak in qemu to
select which migration mode to use beforehand.

Even if it was a swap device it'd still require special operations to
setup swap entries in the process pagetables before the pages
exists. A swap device may give more complication than it solves.

If it was only KVM accessing the guest physical memory we could just
handle it in KVM, and call get_user_pages_fast, if that fails and it's
the first ever invocation we just talk with QEMU to get the page and
establish it by hand. But qemu can also write to memory and if it's a
partial write and the guest reads the not-written yet part with
get_user_pages_fast+spte establishment, it'll go wrong. Maybe qemu is
already doing all checks on the pages it's going to write and we could
hook there too from qemu side.

Another more generic (not KVM centric) that will not require a special
chardev or a special daemon way could be a new:

sys_set_muserload(unsigned long start, unsigned long len, int signal)
sys_muserload(void *from, void *to)

When sys_set_muserload is called the region start,start+len gets
covered by muserload swap entries that trigger special page faults.

When anything touches the memory with an muserload swap entry still
set, the thread gets a signal with force_sig_info_fault(si_signo =
signal), and the signal handler will get the faulting address in
info.si_addr. The signal handler is then responsible to call
sys_userload after talking the thread doing the TCP send/recv()
talking with the qemu source. The recv(mmap(4096), 4096) should
generate a page in the destination node in some random mapping
(aligned).

Then the muserload(tcp_received_page_address,
guest_faulting_physical_address_from_info_si_addr) does get_user_pages
on tcp_received_page_address, takes it away from the
tcp_received_page_address (clears the pte at that address), adjusts
page->index for the new vma, and maps the page zercopy atomically into
the "guest_faulting_physical_address_from_info_si_addr" new address,
if and only if the pagetable at address is still of muserload
type. Then the signal munmap(tcp_received_page_address, 4096) to
truncate/free the vma