Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-31 Thread Paolo Bonzini
Il 31/05/2012 10:44, Roni Luxenberg ha scritto:
 a continuous replication application expects to get all IOs in the same order 
 as
 issued by the guest irrespective of the (rate of the) flushes done within the 
 guest.

Does real hardware give such consistency, unless you disable caching?

 This is required to be able to support more advanced features like cross VM 
 consistency
 where the application protects a group of VMs (possibly running on different 
 hosts)
 forming a single logical application/service.
 Does the design strictly maintain this property?

No.  But this is only a problem with the implementation, not with the API.

 under this design and assuming async implementation, is a same block
 that is written few times in a raw by the guest guaranteed to be
 received by the continuous replication agent the same exact number of
 times without overriding any of the writes?

No.

Paolo



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-31 Thread ronnie sahlberg
On Fri, May 25, 2012 at 11:25 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 25/05/2012 14:09, Stefan Hajnoczi ha scritto:
 
  Perhaps that be simply a new qemu-img subcommand?  It should be possible
  to run it while the VM is offline.  Then the file that is produced could
  be fed to blockdev-dirty-enable.
 For both continuous replication and incremental backups we cannot
 require that the guest is shut down in order to collect the dirty
 bitmap, I think.

 Yes, that is a problem for internal snapshots.  For external snapshots,
 see the drive-mirror command's sync parameter.  Perhaps we can add a
 blockdev-dirty-fill command that adds allocated sectors up to a given
 base to the dirty bitmap.

 I think we really need a libvirt API because a local file not only has
 permissions issues but also is not network transparent.  The continuous
 replication server runs on another machine, how will it access the dirty
 bitmap file?

 This is still using a push model where the dirty data is sent from
 QEMU to the replication server, so the dirty bitmap is not needed on the
 machine that runs the replication server---only on the machine that runs
 the VM (to preserve the bitmap across VM shutdowns including power
 loss).  It has to be stored on shared storage if you plan to run the VM
 from multiple hosts.

Why reinventing the wheel?
Wouldnt it be much better to externalize the snapshotting.

Some/Many filesystems support snapshotting today.
A-whole-lot-of/most-non-consumer-grade block storage devices support
it too.


So a different way to do this would be to use a mechanism to
quiescence the backing file/device and then call out to an external
agent to snapshot the backing file or the backing device.
Other external tools can then be used to compute a dense delta between
this new snapshot and the previous snapshot and transfer to the other
side.

I think all filesystems that support snapshotting on file level
support APIs for a cheap way to cumpute the block deltas.
I would imagine all midrange or better block storage devices that
support LUN snapshotting provide this too.


So why do snapshotting and computation of snapshot deltas in qemu ?
Why not just externalize it with you want snapshotting and
incremental replication = you must use a system where file/block
supports it.


regards
ronnie sahlberg



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-31 Thread Geert Jansen


On 05/30/2012 05:06 PM, Paolo Bonzini wrote:

I think it's beginning to dawn on me that what you have is correct, when 
i combine this:



2) target flushes do not have to coincide with a source flush.  Writes
after the last source flush _can_ be inconsistent between the source and
the destination!  What matters is that all writes up to the last source
flush are consistent.


with the statement you made earlier that the drive-mirror coroutine 
issues a target flush *after* a target write returns *and* the dirty 
count is zero.


However, i'm thinking that this design has two undesirable properties. 
Both properties have a high impact if you assume the replication 
appliance is high bandwidth but also potentially high latency (high 
latency because it runs in a guest, and is multiplexing I/Os for many 
different other VMs).


1) Target flushes are not guaranteed to happen at all. If the latency of 
the target is higher than the maximum interval between writes to the 
source, the bitmap will always be dirty when a write to the target 
returns, and a target flush will never be issued.


2) The fact that drive-mirror waits for acknowledgments of writes to the 
target means that there is at most one I/O outstanding and throughput is 
bound by latency.


The Promela model is a bit out of my league, unfortunately :)

Regards,
Geert



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-31 Thread Paolo Bonzini
Il 31/05/2012 13:08, Geert Jansen ha scritto:
 1) Target flushes are not guaranteed to happen at all. If the latency of
 the target is higher than the maximum interval between writes to the
 source, the bitmap will always be dirty when a write to the target
 returns, and a target flush will never be issued.

That's a very good point.

 2) The fact that drive-mirror waits for acknowledgments of writes to the
 target means that there is at most one I/O outstanding and throughput is
 bound by latency.

That can be fixed, I'll add it on my ever-growing todo list. :)

Paolo



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-31 Thread Roni Luxenberg
- Original Message -
 Il 30/05/2012 14:34, Geert Jansen ha scritto:
  
  On 05/29/2012 02:52 PM, Paolo Bonzini wrote:
  
  Does the drive-mirror coroutine send the writes to the target in
  the
  same order as they are sent to the source? I assume so.
 
  No, it doesn't.  It's asynchronous; for continuous replication,
  the
  target knows that it has a consistent view whenever it sees a
  flush on
  the NBD stream.  Flushing the target is needed anyway before
  writing the
  dirty bitmap, so the target might as well exploit them to get
  information about the state of the source.
 
  The target _must_ flush to disk when it receives a flush commands,
  not
  matter how close they are.  It _may_ choose to snapshot the disk,
  for
  example establishing one new snapshot every 5 seconds.
  
  Interesting. So it works quite differently than i had assumed. Some
  follow-up questions hope you don't mind...
  
   * I assume a flush roughly corresponds to an fsync() in the guest
   OS?
 
 Yes (or a metadata flush from the guest OS filesystem, since our
 guest
 models do not support attaching the FUA bit to single writes).
 

a continuous replication application expects to get all IOs in the same order 
as issued by the guest irrespective of the (rate of the) flushes done within 
the guest. This is required to be able to support more advanced features like 
cross VM consistency where the application protects a group of VMs (possibly 
running on different hosts) forming a single logical application/service.
Does the design strictly maintain this property?

   * Writes will not be re-ordered over a flush boundary, right?
 
 More or less.  This for example is a valid ordering:
 
 write sector 0
  write 0 returns
 flush
 write sector 1
  write 1 returns
  flush returns
 
 However, writes that have already returned will not be re-ordered
 over a
 flush boundary.
 
  A synchronous implementation is not forbidden by the spec (by
  design),
  but at the moment it's a bit more complex to implement because, as
  you
  mention, it requires buffering the I/O data on the host.
  
  So if i understand correctly, you'd only be keeping a list of
  (len, offset) tuples without any data, and drive-mirror then reads
  the
  data from the disk image? If that is the case how do you handle a
  flush?
  Does a flush need to wait for drive-mirror to drain the entire
  outgoing
  queue to the target before it can complete? If not how do prevent
  writes
  that happen after a flush from overwriting the data that will be
  sent to
  the target in case that hasn't reached the flush point yet.
 
 The key is that:
 
 1) you only flush the target when you have a consistent image of the
 source on the destination, and the replication server only creates a
 snapshot when it receives a flush.  Thus, the server does not create
 a
 consistent snapshot unless the client was able to keep pace with the
 guest.
 
 2) target flushes do not have to coincide with a source flush.
  Writes
 after the last source flush _can_ be inconsistent between the source
 and
 the destination!  What matters is that all writes up to the last
 source
 flush are consistent.
 
 Say the guest starts with (4 characters = 1 sectors)   
 on
 disk
 
 and then the following happens
 
 guest   disk   dirty countmirroring
  ---
0
 write 1 =  1
 FLUSH
 write 1 = 
 dirty bitmap: sector 1 dirty
 write 2 =  2
1   copy sector 1 =

0   copy sector 2 =

FLUSH
 dirty bitmap: all clean
 write 0 = 
 write 0 = 
 
 and then a power loss happens on the source.
 
 The guest now has the dirty bitmap saying all clean even though the
 source now is    and the destination   .
 However, this is not a problem because both are consistent with the
 last
 flush.

under this design and assuming async implementation, is a same block that is 
written few times in a raw by the guest guaranteed to be received by the 
continuous replication agent the same exact number of times without overriding 
any of the writes?

Thanks,
Roni



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-30 Thread Geert Jansen


On 05/29/2012 02:52 PM, Paolo Bonzini wrote:


Does the drive-mirror coroutine send the writes to the target in the
same order as they are sent to the source? I assume so.


No, it doesn't.  It's asynchronous; for continuous replication, the
target knows that it has a consistent view whenever it sees a flush on
the NBD stream.  Flushing the target is needed anyway before writing the
dirty bitmap, so the target might as well exploit them to get
information about the state of the source.

The target _must_ flush to disk when it receives a flush commands, not
matter how close they are.  It _may_ choose to snapshot the disk, for
example establishing one new snapshot every 5 seconds.


Interesting. So it works quite differently than i had assumed. Some 
follow-up questions hope you don't mind...


 * I assume a flush roughly corresponds to an fsync() in the guest OS? 
Or is it more frequent than that?


 * Writes will not be re-ordered over a flush boundary, right?


A synchronous implementation is not forbidden by the spec (by design),
but at the moment it's a bit more complex to implement because, as you
mention, it requires buffering the I/O data on the host.


So if i understand correctly, you'd only be keeping a list of
(len, offset) tuples without any data, and drive-mirror then reads the 
data from the disk image? If that is the case how do you handle a flush? 
Does a flush need to wait for drive-mirror to drain the entire outgoing 
queue to the target before it can complete? If not how do prevent writes 
that happen after a flush from overwriting the data that will be sent to 
the target in case that hasn't reached the flush point yet.


If so, that could have significant performance impact on the guest.


After the copy phase is done, in order to avoid race conditions, the
bitmap should be reset and mirroring should start directly and
atomically. Is that currently handed by your design?


Yes, this is already all correct.


OK, i think i was confused by your description of drive-mirror in the 
wiki. It says that starts mirroring, but what it also does is that it 
copies the source to the target before it does that. It is clear from 
the description of the sync option though.


Thanks,
Geert



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-30 Thread Paolo Bonzini
Il 30/05/2012 14:34, Geert Jansen ha scritto:
 
 On 05/29/2012 02:52 PM, Paolo Bonzini wrote:
 
 Does the drive-mirror coroutine send the writes to the target in the
 same order as they are sent to the source? I assume so.

 No, it doesn't.  It's asynchronous; for continuous replication, the
 target knows that it has a consistent view whenever it sees a flush on
 the NBD stream.  Flushing the target is needed anyway before writing the
 dirty bitmap, so the target might as well exploit them to get
 information about the state of the source.

 The target _must_ flush to disk when it receives a flush commands, not
 matter how close they are.  It _may_ choose to snapshot the disk, for
 example establishing one new snapshot every 5 seconds.
 
 Interesting. So it works quite differently than i had assumed. Some
 follow-up questions hope you don't mind...
 
  * I assume a flush roughly corresponds to an fsync() in the guest OS?

Yes (or a metadata flush from the guest OS filesystem, since our guest
models do not support attaching the FUA bit to single writes).

  * Writes will not be re-ordered over a flush boundary, right?

More or less.  This for example is a valid ordering:

write sector 0
 write 0 returns
flush
write sector 1
 write 1 returns
 flush returns

However, writes that have already returned will not be re-ordered over a
flush boundary.

 A synchronous implementation is not forbidden by the spec (by design),
 but at the moment it's a bit more complex to implement because, as you
 mention, it requires buffering the I/O data on the host.
 
 So if i understand correctly, you'd only be keeping a list of
 (len, offset) tuples without any data, and drive-mirror then reads the
 data from the disk image? If that is the case how do you handle a flush?
 Does a flush need to wait for drive-mirror to drain the entire outgoing
 queue to the target before it can complete? If not how do prevent writes
 that happen after a flush from overwriting the data that will be sent to
 the target in case that hasn't reached the flush point yet.

The key is that:

1) you only flush the target when you have a consistent image of the
source on the destination, and the replication server only creates a
snapshot when it receives a flush.  Thus, the server does not create a
consistent snapshot unless the client was able to keep pace with the guest.

2) target flushes do not have to coincide with a source flush.  Writes
after the last source flush _can_ be inconsistent between the source and
the destination!  What matters is that all writes up to the last source
flush are consistent.

Say the guest starts with (4 characters = 1 sectors)    on
disk

and then the following happens

guest   disk   dirty countmirroring
 ---
   0
write 1 =  1
FLUSH
write 1 = 
dirty bitmap: sector 1 dirty
write 2 =  2
   1   copy sector 1 = 
   0   copy sector 2 = 
   FLUSH
dirty bitmap: all clean
write 0 = 
write 0 = 

and then a power loss happens on the source.

The guest now has the dirty bitmap saying all clean even though the
source now is    and the destination   .
However, this is not a problem because both are consistent with the last
flush.

I attach a Promela model of the algorithm I'm going to implement.  It's
not exactly the one I posted upthread; I successfully ran this one
through a model checker, so this one works. :)  (I tested 3 sectors / 1
write, i.e. the case above, and 2 sectors / 3 writes. It should be
enough given that it goes exhaustively through the entire state space).

I have another model with two concurrent writers, but it is quite messy
and I don't think it adds much.

It shouldn't be hard to follow, the only tricky thing is that multiple
branches in an if or do can be true, and if so all paths will be
explored by the model checker.  An else is only executed if all the
other paths are false.

 Yes, this is already all correct.
 
 OK, i think i was confused by your description of drive-mirror in the
 wiki. It says that starts mirroring, but what it also does is that it
 copies the source to the target before it does that. It is clear from
 the description of the sync option though.

Yes, the sync option simply fills in the dirty bitmap before starting
the actual loop.

Paolo
/*
 * Formal model for disk synchronization.
 *
 * Copyright (C) 2012 Red Hat, Inc.
 * Author: Paolo Bonzini pbonz...@redhat.com
 *
 * State space explodes real fast:
 *
 *   SEC \ MAX   1   2  

Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-29 Thread Geert Jansen

Hi,

On 05/24/2012 04:19 PM, Paolo Bonzini wrote:


Here is how the bitmaps are handled when doing I/O on the source:
- after writing to the source:
   - clear bit in the volatile in-flight bitmap
   - set bit in the persistent dirty bitmap

- after flushing the source:
   - msync the persistent bitmap to disk


Here is how the bitmaps are handled in the drive-mirror coroutine:
- before reading from the source:
   - set bit in the volatile in-flight bitmap

- after writing to the target:
   - if the dirty count will become zero, flush the target
   - if the bit is still set in the in-flight bitmap, clear bit in the
 persistent dirty bitmap
   - clear bit in the volatile in-flight bitmap


I have a few questions, apologies if some of these are obvious..

I assume the target can be any QEmu block driver including e.g. NBD? A 
networked block driver would be required for a continuous replication 
solution.


Does the drive-mirror coroutine send the writes to the target in the 
same order as they are sent to the source? I assume so.


Does the drive-mirror coroutine require that writes are acknowledged? 
I'd assume so, as you mention that the bit from the persistent bitmap is 
cleared after a write, so you'd need to know the write arrived otherwise 
you cannot safely clear the bit.


If the two above are true (sending in-order, and require acknowledgment 
of writes by the target), then I assume there is a need to keep an 
in-memory list with the IOs that still need to be sent to the target? 
That list could get too large if i.e. the target cannot keep up or 
becomes unavailable. When this happens, the dirty bitmap is needed to 
re-establish synchronized state again between the two images.


For this re-sync, i think there will be two phases. The first phase 
would send blocks marked as dirty by the bitmap. I assume these would be 
sent in arbitrary order, not the order in which they were sent to the 
source, right?


After the copy phase is done, in order to avoid race conditions, the 
bitmap should be reset and mirroring should start directly and 
atomically. Is that currently handed by your design?


Also probably the target would need some kind of signal that the copy 
ended and that we are now mirroring because this is when writes are 
in-order again, and therefore only in this phase the solution can 
provide crash consistent protection. In the copy phase no crash 
consistency can be provided if i am not mistaken.


Finally, again if i am not mistaken, I think that the scenario where 
synchronization is lost with the target is exactly the same as when you 
need to do an initial copy, expect that in the latter case all bits in 
the bitmap are set, right?


Regards,
Geert



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-29 Thread Paolo Bonzini
Il 29/05/2012 13:57, Geert Jansen ha scritto:
 I assume the target can be any QEmu block driver including e.g. NBD? A
 networked block driver would be required for a continuous replication
 solution.

Yes.

 Does the drive-mirror coroutine send the writes to the target in the
 same order as they are sent to the source? I assume so.

No, it doesn't.  It's asynchronous; for continuous replication, the
target knows that it has a consistent view whenever it sees a flush on
the NBD stream.  Flushing the target is needed anyway before writing the
dirty bitmap, so the target might as well exploit them to get
information about the state of the source.

The target _must_ flush to disk when it receives a flush commands, not
matter how close they are.  It _may_ choose to snapshot the disk, for
example establishing one new snapshot every 5 seconds.

A synchronous implementation is not forbidden by the spec (by design),
but at the moment it's a bit more complex to implement because, as you
mention, it requires buffering the I/O data on the host.

 Does the drive-mirror coroutine require that writes are acknowledged?
 I'd assume so, as you mention that the bit from the persistent bitmap is
 cleared after a write, so you'd need to know the write arrived otherwise
 you cannot safely clear the bit.

Yes, the drive-mirror coroutine will not go on to the next write until
the previous one is acked.

 For this re-sync, i think there will be two phases. The first phase
 would send blocks marked as dirty by the bitmap. I assume these would be
 sent in arbitrary order, not the order in which they were sent to the
 source, right?
 
 After the copy phase is done, in order to avoid race conditions, the
 bitmap should be reset and mirroring should start directly and
 atomically. Is that currently handed by your design?

Yes, this is already all correct.

 Also probably the target would need some kind of signal that the copy
 ended and that we are now mirroring because this is when writes are
 in-order again, and therefore only in this phase the solution can
 provide crash consistent protection. In the copy phase no crash
 consistency can be provided if i am not mistaken.

The copy phase will not have flushes (they are kind of useless).

 Finally, again if i am not mistaken, I think that the scenario where
 synchronization is lost with the target is exactly the same as when you
 need to do an initial copy, expect that in the latter case all bits in
 the bitmap are set, right?

Yes.

Paolo



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-25 Thread Stefan Hajnoczi
On Thu, May 24, 2012 at 03:41:29PM +0200, Paolo Bonzini wrote:
 changes from v1:
 - added per-job iostatus
 - added description of persistent dirty bitmap
 
 The same content is also at
 http://wiki.qemu.org/Features/LiveBlockMigration/1.2
 
 
 QMP changes for error handling
 ==
 
 * query-block-jobs: BlockJobInfo gets two new fields, paused and
 io-status.  The job-specific iostatus is completely separate from the
 block device iostatus.
 
 
 * block-stream: I would still like to add on_error to the existing
 block-stream command, if only to ease unit testing.  Concerns about the
 stability of the API can be handled by adding introspection (exporting
 the schema), which is not hard to do.  The new option is an enum with
 the following possible values:
 
 'report': The behavior is the same as in 1.1.  An I/O error will
 complete the job immediately with an error code.
 
 'ignore': An I/O error, respectively during a read or a write, will be
 ignored.  For streaming, the job will complete with an error and the
 backing file will be left in place.  For mirroring, the sector will be
 marked again as dirty and re-examined later.
 
 'stop': The job will be paused, and the job iostatus (which can be
 examined with query-block-jobs) is updated.
 
 'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others.

'stop' and 'enospc' must raise a QMP event so the user is notified when
the job is paused.  Are the details on this missing from this draft?

Stefan




Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-25 Thread Kevin Wolf
Am 25.05.2012 10:28, schrieb Stefan Hajnoczi:
 On Thu, May 24, 2012 at 03:41:29PM +0200, Paolo Bonzini wrote:
 changes from v1:
 - added per-job iostatus
 - added description of persistent dirty bitmap

 The same content is also at
 http://wiki.qemu.org/Features/LiveBlockMigration/1.2


 QMP changes for error handling
 ==

 * query-block-jobs: BlockJobInfo gets two new fields, paused and
 io-status.  The job-specific iostatus is completely separate from the
 block device iostatus.


 * block-stream: I would still like to add on_error to the existing
 block-stream command, if only to ease unit testing.  Concerns about the
 stability of the API can be handled by adding introspection (exporting
 the schema), which is not hard to do.  The new option is an enum with
 the following possible values:

 'report': The behavior is the same as in 1.1.  An I/O error will
 complete the job immediately with an error code.

 'ignore': An I/O error, respectively during a read or a write, will be
 ignored.  For streaming, the job will complete with an error and the
 backing file will be left in place.  For mirroring, the sector will be
 marked again as dirty and re-examined later.

 'stop': The job will be paused, and the job iostatus (which can be
 examined with query-block-jobs) is updated.

 'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others.

May I quote the next two lines as well?

In all cases, even for 'report', the I/O error is reported as a QMP
event BLOCK_JOB_ERROR, with the same arguments as BLOCK_IO_ERROR.

 'stop' and 'enospc' must raise a QMP event so the user is notified when
 the job is paused.  Are the details on this missing from this draft?

No, just from your quote. :-)

Kevin



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-25 Thread Paolo Bonzini
Il 24/05/2012 18:57, Eric Blake ha scritto:
 On 05/24/2012 07:41 AM, Paolo Bonzini wrote:
 changes from v1:
 - added per-job iostatus
 - added description of persistent dirty bitmap

 The same content is also at
 http://wiki.qemu.org/Features/LiveBlockMigration/1.2

 
 * query-block-jobs: BlockJobInfo gets two new fields, paused and
 io-status.  The job-specific iostatus is completely separate from the
 block device iostatus.
 
 Is it still true that for mirror jobs, whether we are mirroring is still
 determined by whether 'len'=='offset'?

Yes.

 * block-job-complete: force completion of mirroring and switching of the
 device to the target, not related to the rest of the proposal.
 Synchronously opens backing files if needed, asynchronously completes
 the job.
 
 Can this be made part of a 'transaction'?  Likewise, can
 'block-job-cancel' be made part of a 'transaction'?

Both of them are asynchronous so they would not create an atomic
snapshot.  We could add it later, in the meanwhile you can wrap with
fsfreeze/fsthaw.

 But now that you are adding the possibility of mirroring reverting
 to copying, there is a race where I can probe and see that we are
 in mirroring, then issue a 'block-job-cancel' to affect a copy operation,
 but in the meantime things reverted, and the cancel ends up leaving me
 with an incomplete copy.

Hmm, that's right.  But then this can only happen if you have an error
in the target.  I can make block-job-cancel _not_ resume a paused job.
Would that satisfy your needs?

 Persistent dirty bitmap
 ===

 A persistent dirty bitmap can be used by management for two reasons.
 When mirroring is used for continuous replication of storage, to record
 I/O operations that happened while the replication server is not
 connected or unavailable.  When mirroring is used for storage migration,
 to check after a management crash whether the VM must be restarted with
 the source or the destination.
 
 Is there a particular file format for the dirty bitmap?  Is there a
 header, or is it just straight bitmap, where the size of the file is an
 exact function of size of the file that it maps?

I think it could be just a straight bitmap.

 management can restart the virtual
 machine with /mnt/dest/diskname.img.  If it has even a single zero bit,
 
 s/zero/non-zero/

Doh, of course.

Paolo



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-25 Thread Paolo Bonzini
Il 24/05/2012 17:32, Dor Laor ha scritto:
 I didn't understand whether the persistent dirty bitmap needs to be
 flushed. This bitmap actually control the persistent known state of the
 destination image. Since w/ mirroring we always have the source in full
 state condition, we can choose to lazy update the destination w/ a risk
 of loosing some content from the last flush (of the destination only side).

Flushing the dirty bitmap after writing to the target can indeed be
tuned for the application.  However, it is not optional to msync the
bitmap when flushing the source.  If the source has a power loss, it has
to know what to retransmit.

 This way one can pick the frequency of flushing the persistent bits map
 (and the respective target IO writes).  Continuous replication can chose
 a timely based fashion, such as every 5 seconds.

But then the target is not able to restore a consistent state (which is
a state where the dirty bitmap is all-zeros).

The scheme above is roughly what DRBD does.  But in any case,
optimizations need to be worked out with a model checker, it's too delicate.

Paolo



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-25 Thread Stefan Hajnoczi
On Thu, May 24, 2012 at 03:41:29PM +0200, Paolo Bonzini wrote:
 Persistent dirty bitmap
 ===
 
 A persistent dirty bitmap can be used by management for two reasons.
 When mirroring is used for continuous replication of storage, to record
 I/O operations that happened while the replication server is not
 connected or unavailable.

For incremental backups we also need a dirty bitmap API.  This allows
backup software to determine which blocks are dirty in a snapshot.
The backup software will only copy out the dirty blocks.

(For external snapshots the dirty bitmap is actually the qemu-io -c
map of is_allocated clusters.  But for internal snapshots it would be a
diff of image metadata which results in a real bitmap.)

So it seems like a dirty bitmap API will be required for continuous
replication (so the server can find out what it missed) and for
incremental backup.  If there is commonality here we should work
together so the same API can do both.

Stefan




Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-25 Thread Paolo Bonzini
Il 25/05/2012 11:43, Stefan Hajnoczi ha scritto:
 On Thu, May 24, 2012 at 03:41:29PM +0200, Paolo Bonzini wrote:
 Persistent dirty bitmap
 ===

 A persistent dirty bitmap can be used by management for two reasons.
 When mirroring is used for continuous replication of storage, to record
 I/O operations that happened while the replication server is not
 connected or unavailable.
 
 For incremental backups we also need a dirty bitmap API.  This allows
 backup software to determine which blocks are dirty in a snapshot.
 The backup software will only copy out the dirty blocks.
 
 (For external snapshots the dirty bitmap is actually the qemu-io -c
 map of is_allocated clusters.  But for internal snapshots it would be a
 diff of image metadata which results in a real bitmap.)

Perhaps that be simply a new qemu-img subcommand?  It should be possible
to run it while the VM is offline.  Then the file that is produced could
be fed to blockdev-dirty-enable.

Paolo

 So it seems like a dirty bitmap API will be required for continuous
 replication (so the server can find out what it missed) and for
 incremental backup.  If there is commonality here we should work
 together so the same API can do both.
 
 Stefan
 




Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-25 Thread Stefan Hajnoczi
On Fri, May 25, 2012 at 01:17:04PM +0200, Paolo Bonzini wrote:
 Il 25/05/2012 11:43, Stefan Hajnoczi ha scritto:
  On Thu, May 24, 2012 at 03:41:29PM +0200, Paolo Bonzini wrote:
  Persistent dirty bitmap
  ===
 
  A persistent dirty bitmap can be used by management for two reasons.
  When mirroring is used for continuous replication of storage, to record
  I/O operations that happened while the replication server is not
  connected or unavailable.
  
  For incremental backups we also need a dirty bitmap API.  This allows
  backup software to determine which blocks are dirty in a snapshot.
  The backup software will only copy out the dirty blocks.
  
  (For external snapshots the dirty bitmap is actually the qemu-io -c
  map of is_allocated clusters.  But for internal snapshots it would be a
  diff of image metadata which results in a real bitmap.)
 
 Perhaps that be simply a new qemu-img subcommand?  It should be possible
 to run it while the VM is offline.  Then the file that is produced could
 be fed to blockdev-dirty-enable.

For both continuous replication and incremental backups we cannot
require that the guest is shut down in order to collect the dirty
bitmap, I think.

Also, anything to do with internal snapshots needs to be online since
QEMU will still have the image file open and be writing to it.  With
backing files we have a little bit more wiggle room when QEMU has
backing files open read-only.

I think we really need a libvirt API because a local file not only has
permissions issues but also is not network transparent.  The continuous
replication server runs on another machine, how will it access the dirty
bitmap file?

Stefan




Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-25 Thread Paolo Bonzini
Il 25/05/2012 14:09, Stefan Hajnoczi ha scritto:
  
  Perhaps that be simply a new qemu-img subcommand?  It should be possible
  to run it while the VM is offline.  Then the file that is produced could
  be fed to blockdev-dirty-enable.
 For both continuous replication and incremental backups we cannot
 require that the guest is shut down in order to collect the dirty
 bitmap, I think.

Yes, that is a problem for internal snapshots.  For external snapshots,
see the drive-mirror command's sync parameter.  Perhaps we can add a
blockdev-dirty-fill command that adds allocated sectors up to a given
base to the dirty bitmap.

 I think we really need a libvirt API because a local file not only has
 permissions issues but also is not network transparent.  The continuous
 replication server runs on another machine, how will it access the dirty
 bitmap file?

This is still using a push model where the dirty data is sent from
QEMU to the replication server, so the dirty bitmap is not needed on the
machine that runs the replication server---only on the machine that runs
the VM (to preserve the bitmap across VM shutdowns including power
loss).  It has to be stored on shared storage if you plan to run the VM
from multiple hosts.

Paolo



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-25 Thread Eric Blake
On 05/25/2012 02:48 AM, Paolo Bonzini wrote:

 * block-job-complete: force completion of mirroring and switching of the
 device to the target, not related to the rest of the proposal.
 Synchronously opens backing files if needed, asynchronously completes
 the job.

 Can this be made part of a 'transaction'?  Likewise, can
 'block-job-cancel' be made part of a 'transaction'?
 
 Both of them are asynchronous so they would not create an atomic
 snapshot.  We could add it later, in the meanwhile you can wrap with
 fsfreeze/fsthaw.

It doesn't have to be right away, I just want to make sure that we
aren't excluding it from a possible future extension, because it _does_
sound useful.

 
 But now that you are adding the possibility of mirroring reverting
 to copying, there is a race where I can probe and see that we are
 in mirroring, then issue a 'block-job-cancel' to affect a copy operation,
 but in the meantime things reverted, and the cancel ends up leaving me
 with an incomplete copy.
 
 Hmm, that's right.  But then this can only happen if you have an error
 in the target.  I can make block-job-cancel _not_ resume a paused job.
 Would that satisfy your needs?

I'm not sure I follow what you are asking.  My scenario is:

call 'drive-mirror' to start a job
'block-job-complete' fails because job is not ready, but the job is not
affected
wait for the event telling me we are in mirroring phase
start issuing my call to 'block-job-complete' to pivot
something happens where we are no longer mirroring
'block-job-complete' fails because we are not mirroring - good

call 'drive-mirror' to start a job
calling 'block-job-cancel' would abort the job, which is not what I want
wait for the event telling me we are in mirroring phase
start issuing my call to 'block-job-cancel' to cleanly leave the copy behind
something happens where we are no longer mirroring
'block-job-cancel' completes, but did not leave a complete mirror - bad

On the other hand, if I'm _not_ trying to make a clean copy, then I want
'block-job-cancel' to work as fast as possible, no matter what.

I'm not sure why having block-job-cancel resume or not resume a job
would make a difference.  What I really am asking for here is a way to
have some command (perhaps 'block-job-complete' but with an optional
flag set to a non-default value) that says I want to complete the job as
a clean copy, but revert back to the source rather than pivot to the
destination, and to cleanly fail with the job still around for
additional actions if I cannot get a clean copy at the current moment,
in the same way that the default 'block-job-complete' cleanly fails but
does not kill the job if I'm not mirroring yet.

-- 
Eric Blake   ebl...@redhat.com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-25 Thread Luiz Capitulino
On Thu, 24 May 2012 15:41:29 +0200
Paolo Bonzini pbonz...@redhat.com wrote:

 * block-stream: I would still like to add on_error to the existing
 block-stream command, if only to ease unit testing.  Concerns about the
 stability of the API can be handled by adding introspection (exporting
 the schema), which is not hard to do.  The new option is an enum with
 the following possible values:
 
 'report': The behavior is the same as in 1.1.  An I/O error will
 complete the job immediately with an error code.
 
 'ignore': An I/O error, respectively during a read or a write, will be
 ignored.  For streaming, the job will complete with an error and the
 backing file will be left in place.  For mirroring, the sector will be
 marked again as dirty and re-examined later.
 
 'stop': The job will be paused, and the job iostatus (which can be
 examined with query-block-jobs) is updated.
 
 'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others.
 
 In all cases, even for 'report', the I/O error is reported as a QMP
 event BLOCK_JOB_ERROR, with the same arguments as BLOCK_IO_ERROR.
 
 After cancelling a job, the job implementation MAY choose to treat stop
 and enospc values as report, i.e. complete the job immediately with an
 error code, as long as block_job_is_cancelled(job) returns true when the
 completion callback is called.
 
   Open problem: There could be unrecoverable errors in which the job
   will always fail as if rerror/werror were set to report (example:
   error while switching backing files).  Does it make sense to fire an
   event before the point in time where such errors can happen?

You mean, you fire the event before the point the error can happen but
the operation keeps running if it doesn't fail?

If that's the case, I think that the returned error is enough for the mngt app
to decide what to do.

 * block-job-pause: A new QMP command.  Takes a block device (drive),
 pauses an active background block operation on that device.  This
 command returns immediately after marking the active background block
 operation for pausing.  It is an error to call this command if no
 operation is in progress.  The operation will pause as soon as possible
 (it won't pause if the job is being cancelled).  No event is emitted
 when the operation is actually paused.  Cancelling a paused job
 automatically resumes it.

Is pausing guaranteed to succeed?



[Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-24 Thread Paolo Bonzini
changes from v1:
- added per-job iostatus
- added description of persistent dirty bitmap

The same content is also at
http://wiki.qemu.org/Features/LiveBlockMigration/1.2


QMP changes for error handling
==

* query-block-jobs: BlockJobInfo gets two new fields, paused and
io-status.  The job-specific iostatus is completely separate from the
block device iostatus.


* block-stream: I would still like to add on_error to the existing
block-stream command, if only to ease unit testing.  Concerns about the
stability of the API can be handled by adding introspection (exporting
the schema), which is not hard to do.  The new option is an enum with
the following possible values:

'report': The behavior is the same as in 1.1.  An I/O error will
complete the job immediately with an error code.

'ignore': An I/O error, respectively during a read or a write, will be
ignored.  For streaming, the job will complete with an error and the
backing file will be left in place.  For mirroring, the sector will be
marked again as dirty and re-examined later.

'stop': The job will be paused, and the job iostatus (which can be
examined with query-block-jobs) is updated.

'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others.

In all cases, even for 'report', the I/O error is reported as a QMP
event BLOCK_JOB_ERROR, with the same arguments as BLOCK_IO_ERROR.

After cancelling a job, the job implementation MAY choose to treat stop
and enospc values as report, i.e. complete the job immediately with an
error code, as long as block_job_is_cancelled(job) returns true when the
completion callback is called.

  Open problem: There could be unrecoverable errors in which the job
  will always fail as if rerror/werror were set to report (example:
  error while switching backing files).  Does it make sense to fire an
  event before the point in time where such errors can happen?


* block-job-pause: A new QMP command.  Takes a block device (drive),
pauses an active background block operation on that device.  This
command returns immediately after marking the active background block
operation for pausing.  It is an error to call this command if no
operation is in progress.  The operation will pause as soon as possible
(it won't pause if the job is being cancelled).  No event is emitted
when the operation is actually paused.  Cancelling a paused job
automatically resumes it.


* block-job-resume: A new QMP command.  Takes a block device (drive),
resume a paused background block operation on that device.  This command
returns immediately after resuming a paused background block operation.
 It is an error to call this command if no operation is in progress.

A successful block-job-resume operation also resets the iostatus on the
job that is passed.

  Rationale: block-job-resume is required to restart a job that had
  on_error behavior set to 'stop' or 'enospc'.  Adding block-job-pause
  makes it simpler to test the new feature.


Other points specific to mirroring
==

* query-block-jobs: The returned JSON object will grow an additional
member, target.  The target field is a dictionary with two fields,
info and stats (resembling the output of query-block and
query-blockstat but for the mirroring target).  Member device of the
BlockInfo structure will be made optional.

  Rationale: this allows libvirt to observe the high watermark of qcow2
  mirroring targets.

If present, the target has its own iostatus.  It is set when the job is
paused due to an error on the target (together with sending a
BLOCK_JOB_ERROR event). block-job-resume resets it.


* drive-mirror: activates mirroring to a second block device (optionally
creating the image on that second block device).  Compared to the
earlier versions, the full argument is replaced by an enum option
sync with three values:

- top: copies data in the topmost image to the destination

- full: copies data from all images to the destination

- dirty: copies clusters that are marked in the dirty bitmap to the
destination (see below)


* block-job-complete: force completion of mirroring and switching of the
device to the target, not related to the rest of the proposal.
Synchronously opens backing files if needed, asynchronously completes
the job.


* MIRROR_STATE_CHANGE: new event, triggered every time the
block-job-complete becomes available/unavailable.  Contains the device
name (like device: 'ide0-hd0'), and the state (synced: true/false).


Persistent dirty bitmap
===

A persistent dirty bitmap can be used by management for two reasons.
When mirroring is used for continuous replication of storage, to record
I/O operations that happened while the replication server is not
connected or unavailable.  When mirroring is used for storage migration,
to check after a management crash whether the VM must be restarted with
the source or the destination.

The dirty bitmap is synchronized on every bdrv_flush (or on every 

Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-24 Thread Ori Mamluk

On 24/05/2012 16:41, Paolo Bonzini wrote:

The dirty bitmap is managed by these QMP commands:

* blockdev-dirty-enable: takes a file name used for the dirty bitmap,
and an optional granularity.  Setting the granularity will not be
supported in the initial version.

* query-block-dirty: returns statistics about the dirty bitmap: right
now the granularity, the number of bits that are set, and whether QEMU
is using the dirty bitmap or just adding to it.

* blockdev-dirty-disable: disable the dirty bitmap.



When do bits get cleared from the bitmap?
using the dirty bitmap or just adding to it - I'm not sure I 
understand what you mean. what's the difference?


Thanks,
Ori



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-24 Thread Paolo Bonzini
Il 24/05/2012 16:00, Ori Mamluk ha scritto:
 
 The dirty bitmap is managed by these QMP commands:

 * blockdev-dirty-enable: takes a file name used for the dirty bitmap,
 and an optional granularity.  Setting the granularity will not be
 supported in the initial version.

 * query-block-dirty: returns statistics about the dirty bitmap: right
 now the granularity, the number of bits that are set, and whether QEMU
 is using the dirty bitmap or just adding to it.

 * blockdev-dirty-disable: disable the dirty bitmap.


 When do bits get cleared from the bitmap?

drive-mirror clears bits from the bitmap as it processes the writes.

In addition to the persistent dirty bitmap, QEMU keeps an in-flight
bitmap.  The in-flight bitmap does not need to be persistent.


Here is how the bitmaps are handled when doing I/O on the source:
- after writing to the source:
  - clear bit in the volatile in-flight bitmap
  - set bit in the persistent dirty bitmap

- after flushing the source:
  - msync the persistent bitmap to disk


Here is how the bitmaps are handled in the drive-mirror coroutine:
- before reading from the source:
  - set bit in the volatile in-flight bitmap

- after writing to the target:
  - if the dirty count will become zero, flush the target
  - if the bit is still set in the in-flight bitmap, clear bit in the
persistent dirty bitmap
  - clear bit in the volatile in-flight bitmap

 using the dirty bitmap or just adding to it - I'm not sure I
 understand what you mean. what's the difference?

Processing the data and removing from the bitmap (mirroring active), or
just setting dirty bits (mirroring inactive).

Paolo



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-24 Thread Dor Laor

On 05/24/2012 05:19 PM, Paolo Bonzini wrote:

Il 24/05/2012 16:00, Ori Mamluk ha scritto:



The dirty bitmap is managed by these QMP commands:

* blockdev-dirty-enable: takes a file name used for the dirty bitmap,
and an optional granularity.  Setting the granularity will not be
supported in the initial version.

* query-block-dirty: returns statistics about the dirty bitmap: right
now the granularity, the number of bits that are set, and whether QEMU
is using the dirty bitmap or just adding to it.

* blockdev-dirty-disable: disable the dirty bitmap.



When do bits get cleared from the bitmap?


drive-mirror clears bits from the bitmap as it processes the writes.

In addition to the persistent dirty bitmap, QEMU keeps an in-flight
bitmap.  The in-flight bitmap does not need to be persistent.


Here is how the bitmaps are handled when doing I/O on the source:
- after writing to the source:
   - clear bit in the volatile in-flight bitmap
   - set bit in the persistent dirty bitmap

- after flushing the source:
   - msync the persistent bitmap to disk


Here is how the bitmaps are handled in the drive-mirror coroutine:
- before reading from the source:
   - set bit in the volatile in-flight bitmap

- after writing to the target:
   - if the dirty count will become zero, flush the target
   - if the bit is still set in the in-flight bitmap, clear bit in the
 persistent dirty bitmap
   - clear bit in the volatile in-flight bitmap


I didn't understand whether the persistent dirty bitmap needs to be 
flushed. This bitmap actually control the persistent known state of the 
destination image. Since w/ mirroring we always have the source in full 
state condition, we can choose to lazy update the destination w/ a risk 
of loosing some content from the last flush (of the destination only side).


This way one can pick the frequency of flushing the persistent bits map 
(and the respective target IO writes). Continuous replication can chose 
a timely based fashion, such as every 5 seconds. A standard mirroring 
job for live copy proposes can pick just to flush once at the end of the 
copy process.


Dor



using the dirty bitmap or just adding to it - I'm not sure I
understand what you mean. what's the difference?


Processing the data and removing from the bitmap (mirroring active), or
just setting dirty bits (mirroring inactive).

Paolo






Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-24 Thread Eric Blake
On 05/24/2012 07:41 AM, Paolo Bonzini wrote:
 changes from v1:
 - added per-job iostatus
 - added description of persistent dirty bitmap
 
 The same content is also at
 http://wiki.qemu.org/Features/LiveBlockMigration/1.2
 

 * query-block-jobs: BlockJobInfo gets two new fields, paused and
 io-status.  The job-specific iostatus is completely separate from the
 block device iostatus.

Is it still true that for mirror jobs, whether we are mirroring is still
determined by whether 'len'=='offset'?

 * drive-mirror: activates mirroring to a second block device (optionally
 creating the image on that second block device).  Compared to the
 earlier versions, the full argument is replaced by an enum option
 sync with three values:
 
 - top: copies data in the topmost image to the destination
 
 - full: copies data from all images to the destination
 
 - dirty: copies clusters that are marked in the dirty bitmap to the
 destination (see below)

Different, but at least RHEL used the name __com.redhat_drive-mirror, so
libvirt can cope with the difference.

 
 
 * block-job-complete: force completion of mirroring and switching of the
 device to the target, not related to the rest of the proposal.
 Synchronously opens backing files if needed, asynchronously completes
 the job.

Can this be made part of a 'transaction'?  Likewise, can
'block-job-cancel' be made part of a 'transaction'?  Having those two
commands transactionable means that you could copy multiple disks at the
same point in time (block-job-cancel) or pivot multiple disks leaving
the former files consistent at the same point in time
(block-job-complete).  It doesn't have to be done in the first round,
but we should make sure we are not precluding this for future growth.

Also, for the purposes of copying but not pivoting, you only have a safe
copy if 'len'=='offset' at the time of the cancel.  But now that you are
adding the possibility of mirroring reverting to copying, there is a
race where I can probe and see that we are in mirroring, then issue a
'block-job-cancel' to affect a copy operation, but in the meantime
things reverted, and the cancel ends up leaving me with an incomplete
copy.  Maybe 'block-job-complete' should be given an optional boolean
parameter; by default or if the parameter is true, we pivot, but if
false, then we do the same as 'block-job-cancel' to affect a safe copy
if we are in mirroring, while erroring out if we are not in mirroring,
leaving 'block-job-cancel' as a way to always cancel a job but no longer
a safe way to guarantee a copy operation.


 Persistent dirty bitmap
 ===
 
 A persistent dirty bitmap can be used by management for two reasons.
 When mirroring is used for continuous replication of storage, to record
 I/O operations that happened while the replication server is not
 connected or unavailable.  When mirroring is used for storage migration,
 to check after a management crash whether the VM must be restarted with
 the source or the destination.

Is there a particular file format for the dirty bitmap?  Is there a
header, or is it just straight bitmap, where the size of the file is an
exact function of size of the file that it maps?

 
 If management crashes between (6) and (7), it can examine the dirty
 bitmap on disk.  If it is all-zeros,

Obviously, this would be all-zeros in the map portion of the file, any
header portion would not impact this.

 management can restart the virtual
 machine with /mnt/dest/diskname.img.  If it has even a single zero bit,

s/zero/non-zero/

 management can restart the virtual machine with the persistent dirty
 bitmap enabled, and later issue again a drive-mirror command to restart
 from step 4.
 
 Paolo
 

-- 
Eric Blake   ebl...@redhat.com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature