Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-31 Thread Geert Jansen


On 05/30/2012 05:06 PM, Paolo Bonzini wrote:

I think it's beginning to dawn on me that what you have is correct, when 
i combine this:



2) target flushes do not have to coincide with a source flush.  Writes
after the last source flush _can_ be inconsistent between the source and
the destination!  What matters is that all writes up to the last source
flush are consistent.


with the statement you made earlier that the drive-mirror coroutine 
issues a target flush *after* a target write returns *and* the dirty 
count is zero.


However, i'm thinking that this design has two undesirable properties. 
Both properties have a high impact if you assume the replication 
appliance is high bandwidth but also potentially high latency (high 
latency because it runs in a guest, and is multiplexing I/Os for many 
different other VMs).


1) Target flushes are not guaranteed to happen at all. If the latency of 
the target is higher than the maximum interval between writes to the 
source, the bitmap will always be dirty when a write to the target 
returns, and a target flush will never be issued.


2) The fact that drive-mirror waits for acknowledgments of writes to the 
target means that there is at most one I/O outstanding and throughput is 
bound by latency.


The Promela model is a bit out of my league, unfortunately :)

Regards,
Geert



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-30 Thread Geert Jansen


On 05/29/2012 02:52 PM, Paolo Bonzini wrote:


Does the drive-mirror coroutine send the writes to the target in the
same order as they are sent to the source? I assume so.


No, it doesn't.  It's asynchronous; for continuous replication, the
target knows that it has a consistent view whenever it sees a flush on
the NBD stream.  Flushing the target is needed anyway before writing the
dirty bitmap, so the target might as well exploit them to get
information about the state of the source.

The target _must_ flush to disk when it receives a flush commands, not
matter how close they are.  It _may_ choose to snapshot the disk, for
example establishing one new snapshot every 5 seconds.


Interesting. So it works quite differently than i had assumed. Some 
follow-up questions hope you don't mind...


 * I assume a flush roughly corresponds to an fsync() in the guest OS? 
Or is it more frequent than that?


 * Writes will not be re-ordered over a flush boundary, right?


A synchronous implementation is not forbidden by the spec (by design),
but at the moment it's a bit more complex to implement because, as you
mention, it requires buffering the I/O data on the host.


So if i understand correctly, you'd only be keeping a list of
(len, offset) tuples without any data, and drive-mirror then reads the 
data from the disk image? If that is the case how do you handle a flush? 
Does a flush need to wait for drive-mirror to drain the entire outgoing 
queue to the target before it can complete? If not how do prevent writes 
that happen after a flush from overwriting the data that will be sent to 
the target in case that hasn't reached the flush point yet.


If so, that could have significant performance impact on the guest.


After the copy phase is done, in order to avoid race conditions, the
bitmap should be reset and mirroring should start directly and
atomically. Is that currently handed by your design?


Yes, this is already all correct.


OK, i think i was confused by your description of drive-mirror in the 
wiki. It says that starts mirroring, but what it also does is that it 
copies the source to the target before it does that. It is clear from 
the description of the sync option though.


Thanks,
Geert



Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]

2012-05-29 Thread Geert Jansen

Hi,

On 05/24/2012 04:19 PM, Paolo Bonzini wrote:


Here is how the bitmaps are handled when doing I/O on the source:
- after writing to the source:
   - clear bit in the volatile in-flight bitmap
   - set bit in the persistent dirty bitmap

- after flushing the source:
   - msync the persistent bitmap to disk


Here is how the bitmaps are handled in the drive-mirror coroutine:
- before reading from the source:
   - set bit in the volatile in-flight bitmap

- after writing to the target:
   - if the dirty count will become zero, flush the target
   - if the bit is still set in the in-flight bitmap, clear bit in the
 persistent dirty bitmap
   - clear bit in the volatile in-flight bitmap


I have a few questions, apologies if some of these are obvious..

I assume the target can be any QEmu block driver including e.g. NBD? A 
networked block driver would be required for a continuous replication 
solution.


Does the drive-mirror coroutine send the writes to the target in the 
same order as they are sent to the source? I assume so.


Does the drive-mirror coroutine require that writes are acknowledged? 
I'd assume so, as you mention that the bit from the persistent bitmap is 
cleared after a write, so you'd need to know the write arrived otherwise 
you cannot safely clear the bit.


If the two above are true (sending in-order, and require acknowledgment 
of writes by the target), then I assume there is a need to keep an 
in-memory list with the IOs that still need to be sent to the target? 
That list could get too large if i.e. the target cannot keep up or 
becomes unavailable. When this happens, the dirty bitmap is needed to 
re-establish synchronized state again between the two images.


For this re-sync, i think there will be two phases. The first phase 
would send blocks marked as dirty by the bitmap. I assume these would be 
sent in arbitrary order, not the order in which they were sent to the 
source, right?


After the copy phase is done, in order to avoid race conditions, the 
bitmap should be reset and mirroring should start directly and 
atomically. Is that currently handed by your design?


Also probably the target would need some kind of signal that the copy 
ended and that we are now mirroring because this is when writes are 
in-order again, and therefore only in this phase the solution can 
provide crash consistent protection. In the copy phase no crash 
consistency can be provided if i am not mistaken.


Finally, again if i am not mistaken, I think that the scenario where 
synchronization is lost with the target is exactly the same as when you 
need to do an initial copy, expect that in the latter case all bits in 
the bitmap are set, right?


Regards,
Geert