Re: [Qemu-devel] live block copy/stream/snapshot discussion

2011-07-14 Thread Stefan Hajnoczi
Here is the latest interface, I'm not updating existing patches to
implement and test it (not yet using generic image stream):

http://wiki.qemu.org/Features/LiveBlockMigration/ImageStreamingAPI

=Changelog=
v2:
* Remove iteration interface where management tool drives individual
copy iterations
* Add block_stream_cancel command (like migrate_cancel)
* Add 'base' common backing file argument to block_stream
* Replace QError object in BLOCK_STREAM_COMPLETED with an error string
* Add error documentation

=Image streaming API=

The stream commands populate an image file by streaming data from its backing
file.  Once all blocks have been streamed, the dependency on the original
backing image is removed.  The stream commands can be used to implement
post-copy live block migration and rapid deployment.

The block_stream command starts streaming the image file.  When streaming
completes successfully or with an error, the BLOCK_STREAM_COMPLETED event is
raised.

The progress of a streaming operation can be polled using query-block-stream.
This returns information regarding how much of the image has been streamed.

The block_stream_cancel command stops streaming the image file.  The image file
retains its backing file.  A new streaming operation can be started at a later
time.

The command synopses are as follows:

 block_stream
 

 Copy data from a backing file into a block device.

 The block streaming operation is performed in the background until the entire
 backing file has been copied.  This command returns immediately once streaming
 has started.  The status of ongoing block streaming operations can be checked
 with query-block-stream.  The operation can be stopped before it has completed
 using the block_stream_cancel command.

 If a base file is specified then sectors are not copied from that base file and
 its backing chain.  When streaming completes the image file will have the base
 file as its backing file.  This can be used to stream a subset of the backing
 file chain instead of flattening the entire image.

 On successful completion the image file is updated to drop the backing file.

 Arguments:

 - device: device name (json-string)
 - base:   common backing file (json-string, optional)

 Errors:

 DeviceInUse:streaming is already active on this device
 DeviceNotFound: device name is invalid
 NotSupported:   image streaming is not supported by this device

 Events:

 On completion the BLOCK_STREAM_COMPLETED event is raised with the following
 fields:

 - device: device name (json-string)
 - len:size of the device, in bytes (json-int)
 - offset: last offset of completed I/O, in bytes (json-int)
 - error:  error message (json-string, only on error)

 The completion event is raised both on success and on failure.

 Examples:

 - { execute: block_stream, arguments: { device: virtio0 } }
 - { return:  {} }

 block_stream_cancel
 ---

 Stop an active block streaming operation.

 This command returns once the active block streaming operation has been
 stopped.  It is an error to call this command if no operation is in progress.

 The image file retains its backing file unless the streaming operation happens
 to complete just as it is being cancelled.

 A new block streaming operation can be started at a later time to finish
 copying all data from the backing file.

 Arguments:

 - device: device name (json-string)

 Errors:

 DeviceNotActive: streaming is not active on this device
 DeviceInUse: cancellation already in progress

 Examples:

 - { execute: block_stream_cancel, arguments: { device: virtio0 } }
 - { return:  {} }

 query-block-stream
 --

 Show progress of ongoing block_stream operations.

 Return a json-array of all block streaming operations.  If no operation is
 active then return an empty array.  Each operation is a json-object with the
 following data:

 - device: device name (json-string)
 - len:size of the device, in bytes (json-int)
 - offset: ending offset of the completed I/O, in bytes (json-int)

 Example:

 - { execute: query-block-stream }
 - { return:[
   { device: virtio0, len: 10737418240, offset: 709632}
]
  }

=How live block copy works=

Live block copy does the following:

# Create and switch to the destination file: snapshot_blkdev
virtio-blk0 destination.$fmt $fmt
# Stream the base into the image file: block_stream -a virtio-blk0

Stefan



Re: [Qemu-devel] live block copy/stream/snapshot discussion

2011-07-14 Thread Kevin Wolf
Am 14.07.2011 11:39, schrieb Stefan Hajnoczi:
  Events:
 
  On completion the BLOCK_STREAM_COMPLETED event is raised with the following
  fields:
 
  - device: device name (json-string)
  - len:size of the device, in bytes (json-int)
  - offset: last offset of completed I/O, in bytes (json-int)
  - error:  error message (json-string, only on error)
 
  The completion event is raised both on success and on failure.

Why do len/offset matter in a completion event?

Other than that it looks good to me.

Kevin



Re: [Qemu-devel] live block copy/stream/snapshot discussion

2011-07-14 Thread Kevin Wolf
Am 14.07.2011 12:00, schrieb Stefan Hajnoczi:
 On Thu, Jul 14, 2011 at 10:55 AM, Kevin Wolf kw...@redhat.com wrote:
 Am 14.07.2011 11:39, schrieb Stefan Hajnoczi:
  Events:

  On completion the BLOCK_STREAM_COMPLETED event is raised with the following
  fields:

  - device: device name (json-string)
  - len:size of the device, in bytes (json-int)
  - offset: last offset of completed I/O, in bytes (json-int)
  - error:  error message (json-string, only on error)

  The completion event is raised both on success and on failure.

 Why do len/offset matter in a completion event?
 
 For completeness.  You could see it as telling you how much progress
 was made before an error occurred.  In the success case offset will
 always be equal to len.  But in the error case you get the last
 completed progress before error, which could be useful (for example if
 you weren't polling but want to display Streaming virtio-blk0 failed
 at 33%).

Makes sense.

We also need to define the possible error messages, and probably use an
enum instead of a string.

Kevin



Re: [Qemu-devel] live block copy/stream/snapshot discussion

2011-07-13 Thread Stefan Hajnoczi
On Tue, Jul 12, 2011 at 5:10 PM, Kevin Wolf kw...@redhat.com wrote:
 Am 12.07.2011 17:45, schrieb Stefan Hajnoczi:
 The command synopses are as follows:

 block_stream
 

 Copy data from a backing file into a block device.

 If the optional 'all' argument is true, this operation is performed in the
 background until the entire backing file has been copied.  The status of
 ongoing block_stream operations can be checked with query-block-stream.

 Not sure if it's a good idea to use a bool argument to turn a command
 into its opposite. I think having a separate command for stopping would
 be cleaner. Something for the QMP folks to decide, though.

 git branch new_branch
 git branch -D new_branch

 Makes sense to me :)

 I don't think you should compare a command line option to a programming
 interface. Having a git_create_branch(const char *name, bool delete)
 would really look strange. Anyway, probably a matter of taste.

 A hint that separate commands would make sense is that the stop command
 won't need the other arguments that the start command gets ('all' and
 'base').

I can see your point.  Splitting the command might make the code more
straightforward and eliminate the need for checking invalid argument
combinations.

 Arguments:

 - all:    copy entire device (json-bool, optional)
 - stop:   stop copying to device (json-bool, optional)
 - device: device name (json-string)

 It must be possible to specify backing file that will be
 active after streaming finishes (data from that file will not
 be streamed into active file, of course).

 Yes, I think the common base image belongs here.

 Right.  We need to specify it by filename:

   - base: filename of base file (json-string, optional)

   Sectors are not copied from the base file and its backing file
   chain.  The following describes this feature:
     Before: base - sn1 - sn2 - sn3 - vm.img
     After:  base - vm.img

 Does this imply that a rebase -u happens always after completion?

Yes.  The current implementation removes the backing file when
streaming completes.  I think this is the right thing to do since all
sectors are now allocated - there is no way to use the backing file
anymore.

If we don't change the backing file on streaming completion, then the
user has to issue an extra command.  There's nothing to gain by doing
that so I think rebase -u should happen on completion.

 With all = false, where does the streaming begin?

 Streaming begins at the start of the image.

 Do you have something like the current streaming offset in the state of 
 each BlockDriverState?

 Yes, there is a StreamState for each block device that has an
 in-progress operation.  The progress is saved between block_stream
 (without -a) invocations so the caller does not need to specify the
 streaming offset as an argument.

 Thanks for pointing out these weaknesses in the documentation.  It
 should really be explained fully.

 I think we also need to describe error cases. For example, what happens
 if you try to start streaming while it's already in progress?

Yes, will do.

 Return:

 - device: device name (json-string)
 - len:    size of the device, in bytes (json-int)
 - offset: ending offset of the completed I/O, in bytes (json-int)

 So you only get the reply when the request has completed? With the
 current monitor, this means that QMP is blocked while we stream, doesn't
 it? How are you supposed to send the stop command then?

 Incomplete documentation again, sorry.  The block_stream command
 behaves as follows:

 1. block_stream all returns immediately and the BLOCK_STREAM_COMPLETED
 event is raised when streaming completes either successfully or with
 an error.

 2. block_stream stop returns when the in-progress streaming operation
 has been safely stopped.

 3. block_stream returns when one iteration of streaming has completed.

 Two of three examples below have an empty return value instead, so they
 are not compliant to this specification.

 I will update the documentation, the non-all invocations do not return 
 anything.

 Okay, then I don't understand what the 'offset' return value means. The
 text says offset of the completed I/O. If all=true immediately
 returns, shouldn't it always be 0?

The 'offset' value gives you an indication of progress when using the
iteration interface.  You don't need to separately call
query-block-stream, instead you can use the return value from the
iteration interface to get progress information.

However, let's drop iteration.

 I find it rather disturbing that a command like 'change' has made it
 into QMP... Anyway, I don't think this is really what we need.

 We have two switches to do. The first one happens before starting the
 copy: Creating the copy, with the source as its backing file, and
 switching to that. The monitor command to achieve this is snapshot_blkdev.

 I don't think that creating image files in QEMU is going to work when
 running KVM with libvirt (SELinux).  The QEMU process does not 

Re: [Qemu-devel] live block copy/stream/snapshot discussion

2011-07-12 Thread Kevin Wolf
Am 11.07.2011 18:32, schrieb Marcelo Tosatti:
 On Mon, Jul 11, 2011 at 03:47:15PM +0100, Stefan Hajnoczi wrote:
 Kevin, Marcelo,
 I'd like to reach agreement on the QMP/HMP APIs for live block copy
 and image streaming.  Libvirt has acked the image streaming APIs that
 Adam proposed and I think they are a good fit for the feature.  I have
 described that API below for your review (it's exactly what the QED
 Image Streaming patches provide).

 Marcelo: Are you happy with this API for live block copy?  Also please
 take a look at the switch command that I am proposing.

 Image streaming API
 ===

 For leaf images with copy-on-read semantics, the stream commands allow the 
 user
 to populate local blocks by manually streaming them from the backing image.
 Once all blocks have been streamed, the dependency on the original backing
 image can be removed.  Therefore, stream commands can be used to implement
 post-copy live block migration and rapid deployment.

 The block_stream command can be used to stream a single cluster, to
 start streaming the entire device, and to cancel an active stream.  It
 is easiest to allow the block_stream command to manage streaming for the
 entire device but a managent tool could use single cluster mode to
 throttle the I/O rate.

As discussed earlier, having the management send requests for each
single cluster doesn't make any sense at all. It wouldn't only throttle
the I/O rate but bring it down to a level that makes it unusable. What
you really want is to allow the management to give us a range (offset +
length) that qemu should stream.

 The command synopses are as follows:

 block_stream
 

 Copy data from a backing file into a block device.

 If the optional 'all' argument is true, this operation is performed in the
 background until the entire backing file has been copied.  The status of
 ongoing block_stream operations can be checked with query-block-stream.

Not sure if it's a good idea to use a bool argument to turn a command
into its opposite. I think having a separate command for stopping would
be cleaner. Something for the QMP folks to decide, though.

 Arguments:

 - all:copy entire device (json-bool, optional)
 - stop:   stop copying to device (json-bool, optional)
 - device: device name (json-string)
 
 It must be possible to specify backing file that will be
 active after streaming finishes (data from that file will not 
 be streamed into active file, of course).

Yes, I think the common base image belongs here.

With all = false, where does the streaming begin? Do you have something
like the current streaming offset in the state of each
BlockDriverState? As I said above, I would prefer adding offset and
length to the arguments.

 Return:

 - device: device name (json-string)
 - len:size of the device, in bytes (json-int)
 - offset: ending offset of the completed I/O, in bytes (json-int)

So you only get the reply when the request has completed? With the
current monitor, this means that QMP is blocked while we stream, doesn't
it? How are you supposed to send the stop command then?

Two of three examples below have an empty return value instead, so they
are not compliant to this specification.

 Examples:

 - { execute: block_stream, arguments: { device: virtio0 } }
 - { return:  { device: virtio0, len: 10737418240, offset: 512 } }

 - { execute: block_stream, arguments: { all: true, device:
 virtio0 } }
 - { return: {} }

 - { execute: block_stream, arguments: { stop: true, device:
 virtio0 } }
 - { return: {} }

 query-block-stream
 --

 Show progress of ongoing block_stream operations.

 Return a json-array of all operations.  If no operation is active then an 
 empty
 array will be returned.  Each operation is a json-object with the following
 data:

 - device: device name (json-string)
 - len:size of the device, in bytes (json-int)
 - offset: ending offset of the completed I/O, in bytes (json-int)

 Example:

 - { execute: query-block-stream }
 - { return:[
{ device: virtio0, len: 10737418240, offset: 709632}
 ]
   }

When block_stream is changed, this will have to make the same changes.

 Block device switching API
 ==

 Extend the 'change' command to support changing the image file without
 media change notification.

 Perhaps we should take the opportunity to add a format argument for
 image files?

 change
 --

 Change a removable medium or VNC configuration.

 Arguments:

 - device: device name (json-string)
 - target: filename or item (json-string)
 - arg: additional argument (json-string, optional)
 - notify: whether to notify guest, defaults to true (json-bool, optional)

 Examples:

 1. Change a removable medium

 - { execute: change,
  arguments: { device: ide1-cd0,
 target: /srv/images/Fedora-12-x86_64-DVD.iso 
 } }
 - { return: {} }

 2. Change a disk without media change notification

 - { execute: change,
   

Re: [Qemu-devel] live block copy/stream/snapshot discussion

2011-07-12 Thread Kevin Wolf
Am 12.07.2011 17:45, schrieb Stefan Hajnoczi:
 Image streaming API
 ===

 For leaf images with copy-on-read semantics, the stream commands allow the 
 user
 to populate local blocks by manually streaming them from the backing image.
 Once all blocks have been streamed, the dependency on the original backing
 image can be removed.  Therefore, stream commands can be used to implement
 post-copy live block migration and rapid deployment.

 The block_stream command can be used to stream a single cluster, to
 start streaming the entire device, and to cancel an active stream.  It
 is easiest to allow the block_stream command to manage streaming for the
 entire device but a managent tool could use single cluster mode to
 throttle the I/O rate.

 As discussed earlier, having the management send requests for each
 single cluster doesn't make any sense at all. It wouldn't only throttle
 the I/O rate but bring it down to a level that makes it unusable. What
 you really want is to allow the management to give us a range (offset +
 length) that qemu should stream.
 
 I feel that an iteration interface is problematic whether the
 management tool or QEMU decide what to stream.  Let's have just the
 background streaming operation.
 
 The problem with byte ranges is two-fold.  The management tool doesn't
 know which regions of the image are allocated so it may do a lot of
 nop calls to already-allocated regions with no intelligence as to
 where the next sensible offset for streaming is.  Secondly, because
 the progress and performance of image streaming depend largely on
 whether or not clusters are allocated (it is very fast when a cluster
 is already allocated and we have no work to do), offsets are bad
 indicators of progress to the user.  I think it's best not to expose
 these details to the management tool at all.
 
 The only reason for the iteration interface was to punt I/O throttling
 to the management tool.  I think it would be easier to just throttle
 inside the streaming function.
 
 Kevin: Are you happy with dropping the iteration interface?
 Adam: Is there a libvirt requirement for iteration or could we support
 background copy only?

Okay, works for me.

 The command synopses are as follows:

 block_stream
 

 Copy data from a backing file into a block device.

 If the optional 'all' argument is true, this operation is performed in the
 background until the entire backing file has been copied.  The status of
 ongoing block_stream operations can be checked with query-block-stream.

 Not sure if it's a good idea to use a bool argument to turn a command
 into its opposite. I think having a separate command for stopping would
 be cleaner. Something for the QMP folks to decide, though.
 
 git branch new_branch
 git branch -D new_branch
 
 Makes sense to me :)

I don't think you should compare a command line option to a programming
interface. Having a git_create_branch(const char *name, bool delete)
would really look strange. Anyway, probably a matter of taste.

A hint that separate commands would make sense is that the stop command
won't need the other arguments that the start command gets ('all' and
'base').

 Arguments:

 - all:copy entire device (json-bool, optional)
 - stop:   stop copying to device (json-bool, optional)
 - device: device name (json-string)

 It must be possible to specify backing file that will be
 active after streaming finishes (data from that file will not
 be streamed into active file, of course).

 Yes, I think the common base image belongs here.
 
 Right.  We need to specify it by filename:
 
   - base: filename of base file (json-string, optional)
 
   Sectors are not copied from the base file and its backing file
   chain.  The following describes this feature:
 Before: base - sn1 - sn2 - sn3 - vm.img
 After:  base - vm.img

Does this imply that a rebase -u happens always after completion?

 With all = false, where does the streaming begin?
 
 Streaming begins at the start of the image.
 
 Do you have something like the current streaming offset in the state of 
 each BlockDriverState?
 
 Yes, there is a StreamState for each block device that has an
 in-progress operation.  The progress is saved between block_stream
 (without -a) invocations so the caller does not need to specify the
 streaming offset as an argument.
 
 Thanks for pointing out these weaknesses in the documentation.  It
 should really be explained fully.

I think we also need to describe error cases. For example, what happens
if you try to start streaming while it's already in progress?

 Return:

 - device: device name (json-string)
 - len:size of the device, in bytes (json-int)
 - offset: ending offset of the completed I/O, in bytes (json-int)

 So you only get the reply when the request has completed? With the
 current monitor, this means that QMP is blocked while we stream, doesn't
 it? How are you supposed to send the stop command then?
 
 Incomplete documentation again, 

Re: [Qemu-devel] live block copy/stream/snapshot discussion

2011-07-12 Thread Stefan Hajnoczi
On Tue, Jul 12, 2011 at 9:06 AM, Kevin Wolf kw...@redhat.com wrote:
 Am 11.07.2011 18:32, schrieb Marcelo Tosatti:
 On Mon, Jul 11, 2011 at 03:47:15PM +0100, Stefan Hajnoczi wrote:
 Kevin, Marcelo,
 I'd like to reach agreement on the QMP/HMP APIs for live block copy
 and image streaming.  Libvirt has acked the image streaming APIs that
 Adam proposed and I think they are a good fit for the feature.  I have
 described that API below for your review (it's exactly what the QED
 Image Streaming patches provide).

 Marcelo: Are you happy with this API for live block copy?  Also please
 take a look at the switch command that I am proposing.

 Image streaming API
 ===

 For leaf images with copy-on-read semantics, the stream commands allow the 
 user
 to populate local blocks by manually streaming them from the backing image.
 Once all blocks have been streamed, the dependency on the original backing
 image can be removed.  Therefore, stream commands can be used to implement
 post-copy live block migration and rapid deployment.

 The block_stream command can be used to stream a single cluster, to
 start streaming the entire device, and to cancel an active stream.  It
 is easiest to allow the block_stream command to manage streaming for the
 entire device but a managent tool could use single cluster mode to
 throttle the I/O rate.

 As discussed earlier, having the management send requests for each
 single cluster doesn't make any sense at all. It wouldn't only throttle
 the I/O rate but bring it down to a level that makes it unusable. What
 you really want is to allow the management to give us a range (offset +
 length) that qemu should stream.

I feel that an iteration interface is problematic whether the
management tool or QEMU decide what to stream.  Let's have just the
background streaming operation.

The problem with byte ranges is two-fold.  The management tool doesn't
know which regions of the image are allocated so it may do a lot of
nop calls to already-allocated regions with no intelligence as to
where the next sensible offset for streaming is.  Secondly, because
the progress and performance of image streaming depend largely on
whether or not clusters are allocated (it is very fast when a cluster
is already allocated and we have no work to do), offsets are bad
indicators of progress to the user.  I think it's best not to expose
these details to the management tool at all.

The only reason for the iteration interface was to punt I/O throttling
to the management tool.  I think it would be easier to just throttle
inside the streaming function.

Kevin: Are you happy with dropping the iteration interface?
Adam: Is there a libvirt requirement for iteration or could we support
background copy only?

 The command synopses are as follows:

 block_stream
 

 Copy data from a backing file into a block device.

 If the optional 'all' argument is true, this operation is performed in the
 background until the entire backing file has been copied.  The status of
 ongoing block_stream operations can be checked with query-block-stream.

 Not sure if it's a good idea to use a bool argument to turn a command
 into its opposite. I think having a separate command for stopping would
 be cleaner. Something for the QMP folks to decide, though.

git branch new_branch
git branch -D new_branch

Makes sense to me :)

 Arguments:

 - all:    copy entire device (json-bool, optional)
 - stop:   stop copying to device (json-bool, optional)
 - device: device name (json-string)

 It must be possible to specify backing file that will be
 active after streaming finishes (data from that file will not
 be streamed into active file, of course).

 Yes, I think the common base image belongs here.

Right.  We need to specify it by filename:

  - base: filename of base file (json-string, optional)

  Sectors are not copied from the base file and its backing file
  chain.  The following describes this feature:
Before: base - sn1 - sn2 - sn3 - vm.img
After:  base - vm.img

 With all = false, where does the streaming begin?

Streaming begins at the start of the image.

 Do you have something like the current streaming offset in the state of 
 each BlockDriverState?

Yes, there is a StreamState for each block device that has an
in-progress operation.  The progress is saved between block_stream
(without -a) invocations so the caller does not need to specify the
streaming offset as an argument.

Thanks for pointing out these weaknesses in the documentation.  It
should really be explained fully.

 Return:

 - device: device name (json-string)
 - len:    size of the device, in bytes (json-int)
 - offset: ending offset of the completed I/O, in bytes (json-int)

 So you only get the reply when the request has completed? With the
 current monitor, this means that QMP is blocked while we stream, doesn't
 it? How are you supposed to send the stop command then?

Incomplete documentation again, sorry.  The block_stream 

Re: [Qemu-devel] live block copy/stream/snapshot discussion

2011-07-12 Thread Adam Litke


On 07/12/2011 10:45 AM, Stefan Hajnoczi wrote:
 On Tue, Jul 12, 2011 at 9:06 AM, Kevin Wolf kw...@redhat.com wrote:
 Am 11.07.2011 18:32, schrieb Marcelo Tosatti:
 On Mon, Jul 11, 2011 at 03:47:15PM +0100, Stefan Hajnoczi wrote:
 Kevin, Marcelo,
 I'd like to reach agreement on the QMP/HMP APIs for live block copy
 and image streaming.  Libvirt has acked the image streaming APIs that
 Adam proposed and I think they are a good fit for the feature.  I have
 described that API below for your review (it's exactly what the QED
 Image Streaming patches provide).

 Marcelo: Are you happy with this API for live block copy?  Also please
 take a look at the switch command that I am proposing.

 Image streaming API
 ===

 For leaf images with copy-on-read semantics, the stream commands allow the 
 user
 to populate local blocks by manually streaming them from the backing image.
 Once all blocks have been streamed, the dependency on the original backing
 image can be removed.  Therefore, stream commands can be used to implement
 post-copy live block migration and rapid deployment.

 The block_stream command can be used to stream a single cluster, to
 start streaming the entire device, and to cancel an active stream.  It
 is easiest to allow the block_stream command to manage streaming for the
 entire device but a managent tool could use single cluster mode to
 throttle the I/O rate.

 As discussed earlier, having the management send requests for each
 single cluster doesn't make any sense at all. It wouldn't only throttle
 the I/O rate but bring it down to a level that makes it unusable. What
 you really want is to allow the management to give us a range (offset +
 length) that qemu should stream.
 
 I feel that an iteration interface is problematic whether the
 management tool or QEMU decide what to stream.  Let's have just the
 background streaming operation.
 
 The problem with byte ranges is two-fold.  The management tool doesn't
 know which regions of the image are allocated so it may do a lot of
 nop calls to already-allocated regions with no intelligence as to
 where the next sensible offset for streaming is.  Secondly, because
 the progress and performance of image streaming depend largely on
 whether or not clusters are allocated (it is very fast when a cluster
 is already allocated and we have no work to do), offsets are bad
 indicators of progress to the user.  I think it's best not to expose
 these details to the management tool at all.
 
 The only reason for the iteration interface was to punt I/O throttling
 to the management tool.  I think it would be easier to just throttle
 inside the streaming function.
 
 Kevin: Are you happy with dropping the iteration interface?
 Adam: Is there a libvirt requirement for iteration or could we support
 background copy only?

There is no hard requirement for iteration in libvirt.  However, I think
there is a requirement that we report some sort of progress to an end
user.  These operations can easily take many minutes (even hours) and
such a long-running operation needs to report progress.  I think the
current information returned by 'query-block-stream' is appropriate for
this purpose and should definitely be maintained.

-- 
Adam Litke
IBM Linux Technology Center



Re: [Qemu-devel] live block copy/stream/snapshot discussion

2011-07-11 Thread Stefan Hajnoczi
On Tue, Jul 05, 2011 at 05:17:49PM +0300, Dor Laor wrote:
 Anthony advised to clone 
 http://wiki.qemu.org/index.php?title=Features/LiveBlockMigrationFuture
 to the list in order to encourage discussion, so here it is:
 
  qemu is expected to support these features (some already implemented):
 
 = Live features =
 
 == Live block copy ==
 
Ability to copy 1+ virtual disk from the source backing file/block
device to a new target that is accessible by the host. The copy
supposed to be executed while the VM runs in a transparent way.
 
 == Live snapshots and live snapshot merge ==
 
Live snapshot is already incorporated (by Jes) in qemu (still need
virt-agent work to freeze the guest FS).
Live snapshot merge is required in order of reducing the overhead
caused by the additional snapshots (sometimes over raw device).
We'll use live copy to do the live merge

This line seems outdated.  Kevin and Marcelo have suggested a separate
live commit operation that does not use the unified block copy/image
streaming mechanism.

 = Solutions =
 
 == Non shared storage ==
 
Either use iscsi (target and initiator) or NBD or proprietary qemu
solution. iScsi in theory is the best but there is a problem of
dealing with COW images - iScsi cannot report the COW level and
detect un-allocated blocks. This might force us to use
proprietary solution.
An interesting option (by Orit Wasserman) was to use iScsi for
exporting the images externally to qemu level and qemu will access
as if they were a local device. This can work well w/o almost any
effort. What do we do with chains of COW files? We create up to N
such iscsi connections for every COW file in the chain.

If there is a discovery mechanism to locate LUNs then it would be
possible to use this approach.

However, using iSCSI but placing all the copy-on-write intelligence into
the QEMU initiator is overkill since we need to support SAN/NAS
appliances that provide snapshots, copy-on-write, and thin provisioning
anyway.  If you look at what other hypervisors are doing, they are
trying to offload as much storage processing onto the appliance as
possible.

We probably want the appliance to do those operations for us, so
implementing them in the initiator for some cases is duplicating that
code and making the system more complex.

The real problem is that we're lacking a library interface to manage
volumes, including snapshots.  I don't think that QEMU needs to drive
this interface.  It should be libvirt (which deals with storage pools
and volumes today already).

Once we do have an interface defined, I think it makes less sense
implementing all of this in QEMU when this storage management
functionality really belongs in NAS/SAN appliances and software targets.

 
 == Live block migration ==
 
Use the streaming approach + regular live migration + iscsi:
Execute regular live migration and at the end of it, start streaming.
If there is no shared storage, use the external iscsi and behave as
if the image is local. At the end of the streaming operation there
will be a new local base image.
 
 == Block mirror layer ==
 
Was invented in order to duplicate write IOs for the source and
destination images. It prevents the potential race when both qemu
and the management crash at the end of the block copy stage and it
is unknown whether management should pick the source or the
destination
 
 == Streaming ==
 
No need for mirror since only the destination changes and is
writable.
 
 == Block copy background task ==
 
Can be shared between block copy and streaming
 
 == Live snapshot ==
 
It can be seen as a (local) stream that preserve the current COW
chain
 
 = Use cases =
 
  1. Basic streaming, single base master image on source storage, need
 to be instantiated on destination storage
 
  The base image is a single level COW format (file or lvm).
  The base is RO and only new destination is RW. base' is empty at
  the beginning. The base image content is being copied in the
  background to base'. At the end of the operation, base' is a
  standalone image w/o depending on the base image.
 
  a. Case of a shared storage streaming guest boot
 
  Before:   src storage: base dst storage: none
  After src storage: base dst storage: base'
 
  b. Case of no shared storage streaming guest boot
 Every thing is the same, we use external iscsi target on the
 src host and external iscsi initiator on the destination host.
 Qemu boots from the destination by using the iscsi access. This
 is transparent to qemu (expect cmd syntax change ). Once the
 streaming is over, we can live drop the usage of iscsi and open
 the image directly (some sort of null live copy)
 
  c. Live 

Re: [Qemu-devel] live block copy/stream/snapshot discussion

2011-07-11 Thread Stefan Hajnoczi
Kevin, Marcelo,
I'd like to reach agreement on the QMP/HMP APIs for live block copy
and image streaming.  Libvirt has acked the image streaming APIs that
Adam proposed and I think they are a good fit for the feature.  I have
described that API below for your review (it's exactly what the QED
Image Streaming patches provide).

Marcelo: Are you happy with this API for live block copy?  Also please
take a look at the switch command that I am proposing.

Image streaming API
===

For leaf images with copy-on-read semantics, the stream commands allow the user
to populate local blocks by manually streaming them from the backing image.
Once all blocks have been streamed, the dependency on the original backing
image can be removed.  Therefore, stream commands can be used to implement
post-copy live block migration and rapid deployment.

The block_stream command can be used to stream a single cluster, to
start streaming the entire device, and to cancel an active stream.  It
is easiest to allow the block_stream command to manage streaming for the
entire device but a managent tool could use single cluster mode to
throttle the I/O rate.

The command synopses are as follows:

block_stream


Copy data from a backing file into a block device.

If the optional 'all' argument is true, this operation is performed in the
background until the entire backing file has been copied.  The status of
ongoing block_stream operations can be checked with query-block-stream.

Arguments:

- all:copy entire device (json-bool, optional)
- stop:   stop copying to device (json-bool, optional)
- device: device name (json-string)

Return:

- device: device name (json-string)
- len:size of the device, in bytes (json-int)
- offset: ending offset of the completed I/O, in bytes (json-int)

Examples:

- { execute: block_stream, arguments: { device: virtio0 } }
- { return:  { device: virtio0, len: 10737418240, offset: 512 } }

- { execute: block_stream, arguments: { all: true, device:
virtio0 } }
- { return: {} }

- { execute: block_stream, arguments: { stop: true, device:
virtio0 } }
- { return: {} }

query-block-stream
--

Show progress of ongoing block_stream operations.

Return a json-array of all operations.  If no operation is active then an empty
array will be returned.  Each operation is a json-object with the following
data:

- device: device name (json-string)
- len:size of the device, in bytes (json-int)
- offset: ending offset of the completed I/O, in bytes (json-int)

Example:

- { execute: query-block-stream }
- { return:[
   { device: virtio0, len: 10737418240, offset: 709632}
]
  }


Block device switching API
==

Extend the 'change' command to support changing the image file without
media change notification.

Perhaps we should take the opportunity to add a format argument for
image files?

change
--

Change a removable medium or VNC configuration.

Arguments:

- device: device name (json-string)
- target: filename or item (json-string)
- arg: additional argument (json-string, optional)
- notify: whether to notify guest, defaults to true (json-bool, optional)

Examples:

1. Change a removable medium

- { execute: change,
 arguments: { device: ide1-cd0,
target: /srv/images/Fedora-12-x86_64-DVD.iso } }
- { return: {} }

2. Change a disk without media change notification

- { execute: change,
 arguments: { device: virtio-blk0,
target: /srv/images/vm_1.img,
notify: false } }

3. Change VNC password

- { execute: change,
 arguments: { device: vnc, target: password,
arg: foobar1 } }
- { return: {} }

How live block copy works
=

Live block copy does the following:

1. Create the destination file: qemu-img create -f $cow_fmt -o
backing_file=$base destination.$cow_fmt
2. Switch to the destination file: change -n virtio-blk0 /srv/images/vm_1.img
3. Stream the base into the image file: block_stream -a virtio-blk0

Stefan



Re: [Qemu-devel] live block copy/stream/snapshot discussion

2011-07-11 Thread Marcelo Tosatti
On Mon, Jul 11, 2011 at 03:47:15PM +0100, Stefan Hajnoczi wrote:
 Kevin, Marcelo,
 I'd like to reach agreement on the QMP/HMP APIs for live block copy
 and image streaming.  Libvirt has acked the image streaming APIs that
 Adam proposed and I think they are a good fit for the feature.  I have
 described that API below for your review (it's exactly what the QED
 Image Streaming patches provide).
 
 Marcelo: Are you happy with this API for live block copy?  Also please
 take a look at the switch command that I am proposing.
 
 Image streaming API
 ===
 
 For leaf images with copy-on-read semantics, the stream commands allow the 
 user
 to populate local blocks by manually streaming them from the backing image.
 Once all blocks have been streamed, the dependency on the original backing
 image can be removed.  Therefore, stream commands can be used to implement
 post-copy live block migration and rapid deployment.
 
 The block_stream command can be used to stream a single cluster, to
 start streaming the entire device, and to cancel an active stream.  It
 is easiest to allow the block_stream command to manage streaming for the
 entire device but a managent tool could use single cluster mode to
 throttle the I/O rate.
 
 The command synopses are as follows:
 
 block_stream
 
 
 Copy data from a backing file into a block device.
 
 If the optional 'all' argument is true, this operation is performed in the
 background until the entire backing file has been copied.  The status of
 ongoing block_stream operations can be checked with query-block-stream.
 
 Arguments:
 
 - all:copy entire device (json-bool, optional)
 - stop:   stop copying to device (json-bool, optional)
 - device: device name (json-string)

It must be possible to specify backing file that will be
active after streaming finishes (data from that file will not 
be streamed into active file, of course).

 Return:
 
 - device: device name (json-string)
 - len:size of the device, in bytes (json-int)
 - offset: ending offset of the completed I/O, in bytes (json-int)
 
 Examples:
 
 - { execute: block_stream, arguments: { device: virtio0 } }
 - { return:  { device: virtio0, len: 10737418240, offset: 512 } }
 
 - { execute: block_stream, arguments: { all: true, device:
 virtio0 } }
 - { return: {} }
 
 - { execute: block_stream, arguments: { stop: true, device:
 virtio0 } }
 - { return: {} }
 
 query-block-stream
 --
 
 Show progress of ongoing block_stream operations.
 
 Return a json-array of all operations.  If no operation is active then an 
 empty
 array will be returned.  Each operation is a json-object with the following
 data:
 
 - device: device name (json-string)
 - len:size of the device, in bytes (json-int)
 - offset: ending offset of the completed I/O, in bytes (json-int)
 
 Example:
 
 - { execute: query-block-stream }
 - { return:[
{ device: virtio0, len: 10737418240, offset: 709632}
 ]
   }
 
 
 Block device switching API
 ==
 
 Extend the 'change' command to support changing the image file without
 media change notification.
 
 Perhaps we should take the opportunity to add a format argument for
 image files?
 
 change
 --
 
 Change a removable medium or VNC configuration.
 
 Arguments:
 
 - device: device name (json-string)
 - target: filename or item (json-string)
 - arg: additional argument (json-string, optional)
 - notify: whether to notify guest, defaults to true (json-bool, optional)
 
 Examples:
 
 1. Change a removable medium
 
 - { execute: change,
  arguments: { device: ide1-cd0,
 target: /srv/images/Fedora-12-x86_64-DVD.iso 
 } }
 - { return: {} }
 
 2. Change a disk without media change notification
 
 - { execute: change,
  arguments: { device: virtio-blk0,
 target: /srv/images/vm_1.img,
 notify: false } }
 
 3. Change VNC password
 
 - { execute: change,
  arguments: { device: vnc, target: password,
 arg: foobar1 } }
 - { return: {} }
 
 How live block copy works
 =
 
 Live block copy does the following:
 
 1. Create the destination file: qemu-img create -f $cow_fmt -o
 backing_file=$base destination.$cow_fmt
 2. Switch to the destination file: change -n virtio-blk0 /srv/images/vm_1.img

The snapshot command (snapshot_blkdev) can be used for these two steps.

 3. Stream the base into the image file: block_stream -a virtio-blk0
 
 Stefan



[Qemu-devel] live block copy/stream/snapshot discussion

2011-07-05 Thread Dor Laor
Anthony advised to clone 
http://wiki.qemu.org/index.php?title=Features/LiveBlockMigrationFuture 
to the list in order to encourage discussion, so here it is:


 qemu is expected to support these features (some already implemented):

= Live features =

== Live block copy ==

   Ability to copy 1+ virtual disk from the source backing file/block
   device to a new target that is accessible by the host. The copy
   supposed to be executed while the VM runs in a transparent way.

== Live snapshots and live snapshot merge ==

   Live snapshot is already incorporated (by Jes) in qemu (still need
   virt-agent work to freeze the guest FS).
   Live snapshot merge is required in order of reducing the overhead
   caused by the additional snapshots (sometimes over raw device).
   We'll use live copy to do the live merge

== Image streaming (Copy on read) ==
   Ability to start guest execution while the parent image reside
   remotely and each block access is replicated to a local copy (image
   format snapshot)
   Such functionality can be hooked together with live block migration
   instead of the 'post copy' method.

== Live block migration (pre/post) ==

   Beyond live block copy we'll sometimes need to move both the storage
   and the guest. There are two main approached here:
   - pre copy
 First live copy the image and only then live migration the VM.
 It is simple and safer approach in terms of management app, but if
 the purpose of the whole live block migration was to balance the
 cpu load, it won't be practical to use since copying an image of
 100GB will take too long.
   - post copy (streaming / copy on read)
 First live migrate the VM, then on line stream its blocks.
 It's better approach for HA/load balancing but it might make
 management complex (need to keep the source VM alive, handling
 failures)

   In addition there are two cases for the storage access:

   1. Shared storage
  Live block copy enable this capability, its seems like a rare
  case for live block migration.
   2. There are some cases where the is no NFS/SAN storage and live
  migration is needed. It should be similar to VMW's storage VM
  motion.
  http://www.vmware.com/files/pdf/VMware-Storage-VMotion-DS-EN.pdf
  http://www.vmware.com/products/storage-vmotion/features.html

== Using external dirty block bitmap ==

   FVD has an option to use external dirty block bitmap file in
   addition to the regular mapping/data files.
   We can consider using it for live block migration and live merge too.
   It can also allow additional usages of 3rd party tools to calculate
   diffs between the snapshots.
   There is a big down side thought since it will make management
   complicated and there is the risky of the image and its bitmap file
   get out of sync. It's much better choice to have qemu-img tool to be
   the single interface to the dirty block bitmap data.

= Solutions =

== Non shared storage ==

   Either use iscsi (target and initiator) or NBD or proprietary qemu
   solution. iScsi in theory is the best but there is a problem of
   dealing with COW images - iScsi cannot report the COW level and
   detect un-allocated blocks. This might force us to use
   proprietary solution.
   An interesting option (by Orit Wasserman) was to use iScsi for
   exporting the images externally to qemu level and qemu will access
   as if they were a local device. This can work well w/o almost any
   effort. What do we do with chains of COW files? We create up to N
   such iscsi connections for every COW file in the chain.

== Live block migration ==

   Use the streaming approach + regular live migration + iscsi:
   Execute regular live migration and at the end of it, start streaming.
   If there is no shared storage, use the external iscsi and behave as
   if the image is local. At the end of the streaming operation there
   will be a new local base image.

== Block mirror layer ==

   Was invented in order to duplicate write IOs for the source and
   destination images. It prevents the potential race when both qemu
   and the management crash at the end of the block copy stage and it
   is unknown whether management should pick the source or the
   destination

== Streaming ==

   No need for mirror since only the destination changes and is
   writable.

== Block copy background task ==

   Can be shared between block copy and streaming

== Live snapshot ==

   It can be seen as a (local) stream that preserve the current COW
   chain

= Use cases =

 1. Basic streaming, single base master image on source storage, need
to be instantiated on destination storage

 The base image is a single level COW format (file or lvm).
 The base is RO and only new destination is RW. base' is empty at
 the beginning. The base image content is being copied in the
 background to base'. At the end of the operation,