Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
Il 31/05/2012 10:44, Roni Luxenberg ha scritto: a continuous replication application expects to get all IOs in the same order as issued by the guest irrespective of the (rate of the) flushes done within the guest. Does real hardware give such consistency, unless you disable caching? This is required to be able to support more advanced features like cross VM consistency where the application protects a group of VMs (possibly running on different hosts) forming a single logical application/service. Does the design strictly maintain this property? No. But this is only a problem with the implementation, not with the API. under this design and assuming async implementation, is a same block that is written few times in a raw by the guest guaranteed to be received by the continuous replication agent the same exact number of times without overriding any of the writes? No. Paolo
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
On Fri, May 25, 2012 at 11:25 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 25/05/2012 14:09, Stefan Hajnoczi ha scritto: Perhaps that be simply a new qemu-img subcommand? It should be possible to run it while the VM is offline. Then the file that is produced could be fed to blockdev-dirty-enable. For both continuous replication and incremental backups we cannot require that the guest is shut down in order to collect the dirty bitmap, I think. Yes, that is a problem for internal snapshots. For external snapshots, see the drive-mirror command's sync parameter. Perhaps we can add a blockdev-dirty-fill command that adds allocated sectors up to a given base to the dirty bitmap. I think we really need a libvirt API because a local file not only has permissions issues but also is not network transparent. The continuous replication server runs on another machine, how will it access the dirty bitmap file? This is still using a push model where the dirty data is sent from QEMU to the replication server, so the dirty bitmap is not needed on the machine that runs the replication server---only on the machine that runs the VM (to preserve the bitmap across VM shutdowns including power loss). It has to be stored on shared storage if you plan to run the VM from multiple hosts. Why reinventing the wheel? Wouldnt it be much better to externalize the snapshotting. Some/Many filesystems support snapshotting today. A-whole-lot-of/most-non-consumer-grade block storage devices support it too. So a different way to do this would be to use a mechanism to quiescence the backing file/device and then call out to an external agent to snapshot the backing file or the backing device. Other external tools can then be used to compute a dense delta between this new snapshot and the previous snapshot and transfer to the other side. I think all filesystems that support snapshotting on file level support APIs for a cheap way to cumpute the block deltas. I would imagine all midrange or better block storage devices that support LUN snapshotting provide this too. So why do snapshotting and computation of snapshot deltas in qemu ? Why not just externalize it with you want snapshotting and incremental replication = you must use a system where file/block supports it. regards ronnie sahlberg
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
On 05/30/2012 05:06 PM, Paolo Bonzini wrote: I think it's beginning to dawn on me that what you have is correct, when i combine this: 2) target flushes do not have to coincide with a source flush. Writes after the last source flush _can_ be inconsistent between the source and the destination! What matters is that all writes up to the last source flush are consistent. with the statement you made earlier that the drive-mirror coroutine issues a target flush *after* a target write returns *and* the dirty count is zero. However, i'm thinking that this design has two undesirable properties. Both properties have a high impact if you assume the replication appliance is high bandwidth but also potentially high latency (high latency because it runs in a guest, and is multiplexing I/Os for many different other VMs). 1) Target flushes are not guaranteed to happen at all. If the latency of the target is higher than the maximum interval between writes to the source, the bitmap will always be dirty when a write to the target returns, and a target flush will never be issued. 2) The fact that drive-mirror waits for acknowledgments of writes to the target means that there is at most one I/O outstanding and throughput is bound by latency. The Promela model is a bit out of my league, unfortunately :) Regards, Geert
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
Il 31/05/2012 13:08, Geert Jansen ha scritto: 1) Target flushes are not guaranteed to happen at all. If the latency of the target is higher than the maximum interval between writes to the source, the bitmap will always be dirty when a write to the target returns, and a target flush will never be issued. That's a very good point. 2) The fact that drive-mirror waits for acknowledgments of writes to the target means that there is at most one I/O outstanding and throughput is bound by latency. That can be fixed, I'll add it on my ever-growing todo list. :) Paolo
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
- Original Message - Il 30/05/2012 14:34, Geert Jansen ha scritto: On 05/29/2012 02:52 PM, Paolo Bonzini wrote: Does the drive-mirror coroutine send the writes to the target in the same order as they are sent to the source? I assume so. No, it doesn't. It's asynchronous; for continuous replication, the target knows that it has a consistent view whenever it sees a flush on the NBD stream. Flushing the target is needed anyway before writing the dirty bitmap, so the target might as well exploit them to get information about the state of the source. The target _must_ flush to disk when it receives a flush commands, not matter how close they are. It _may_ choose to snapshot the disk, for example establishing one new snapshot every 5 seconds. Interesting. So it works quite differently than i had assumed. Some follow-up questions hope you don't mind... * I assume a flush roughly corresponds to an fsync() in the guest OS? Yes (or a metadata flush from the guest OS filesystem, since our guest models do not support attaching the FUA bit to single writes). a continuous replication application expects to get all IOs in the same order as issued by the guest irrespective of the (rate of the) flushes done within the guest. This is required to be able to support more advanced features like cross VM consistency where the application protects a group of VMs (possibly running on different hosts) forming a single logical application/service. Does the design strictly maintain this property? * Writes will not be re-ordered over a flush boundary, right? More or less. This for example is a valid ordering: write sector 0 write 0 returns flush write sector 1 write 1 returns flush returns However, writes that have already returned will not be re-ordered over a flush boundary. A synchronous implementation is not forbidden by the spec (by design), but at the moment it's a bit more complex to implement because, as you mention, it requires buffering the I/O data on the host. So if i understand correctly, you'd only be keeping a list of (len, offset) tuples without any data, and drive-mirror then reads the data from the disk image? If that is the case how do you handle a flush? Does a flush need to wait for drive-mirror to drain the entire outgoing queue to the target before it can complete? If not how do prevent writes that happen after a flush from overwriting the data that will be sent to the target in case that hasn't reached the flush point yet. The key is that: 1) you only flush the target when you have a consistent image of the source on the destination, and the replication server only creates a snapshot when it receives a flush. Thus, the server does not create a consistent snapshot unless the client was able to keep pace with the guest. 2) target flushes do not have to coincide with a source flush. Writes after the last source flush _can_ be inconsistent between the source and the destination! What matters is that all writes up to the last source flush are consistent. Say the guest starts with (4 characters = 1 sectors) on disk and then the following happens guest disk dirty countmirroring --- 0 write 1 = 1 FLUSH write 1 = dirty bitmap: sector 1 dirty write 2 = 2 1 copy sector 1 = 0 copy sector 2 = FLUSH dirty bitmap: all clean write 0 = write 0 = and then a power loss happens on the source. The guest now has the dirty bitmap saying all clean even though the source now is and the destination . However, this is not a problem because both are consistent with the last flush. under this design and assuming async implementation, is a same block that is written few times in a raw by the guest guaranteed to be received by the continuous replication agent the same exact number of times without overriding any of the writes? Thanks, Roni
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
On 05/29/2012 02:52 PM, Paolo Bonzini wrote: Does the drive-mirror coroutine send the writes to the target in the same order as they are sent to the source? I assume so. No, it doesn't. It's asynchronous; for continuous replication, the target knows that it has a consistent view whenever it sees a flush on the NBD stream. Flushing the target is needed anyway before writing the dirty bitmap, so the target might as well exploit them to get information about the state of the source. The target _must_ flush to disk when it receives a flush commands, not matter how close they are. It _may_ choose to snapshot the disk, for example establishing one new snapshot every 5 seconds. Interesting. So it works quite differently than i had assumed. Some follow-up questions hope you don't mind... * I assume a flush roughly corresponds to an fsync() in the guest OS? Or is it more frequent than that? * Writes will not be re-ordered over a flush boundary, right? A synchronous implementation is not forbidden by the spec (by design), but at the moment it's a bit more complex to implement because, as you mention, it requires buffering the I/O data on the host. So if i understand correctly, you'd only be keeping a list of (len, offset) tuples without any data, and drive-mirror then reads the data from the disk image? If that is the case how do you handle a flush? Does a flush need to wait for drive-mirror to drain the entire outgoing queue to the target before it can complete? If not how do prevent writes that happen after a flush from overwriting the data that will be sent to the target in case that hasn't reached the flush point yet. If so, that could have significant performance impact on the guest. After the copy phase is done, in order to avoid race conditions, the bitmap should be reset and mirroring should start directly and atomically. Is that currently handed by your design? Yes, this is already all correct. OK, i think i was confused by your description of drive-mirror in the wiki. It says that starts mirroring, but what it also does is that it copies the source to the target before it does that. It is clear from the description of the sync option though. Thanks, Geert
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
Il 30/05/2012 14:34, Geert Jansen ha scritto: On 05/29/2012 02:52 PM, Paolo Bonzini wrote: Does the drive-mirror coroutine send the writes to the target in the same order as they are sent to the source? I assume so. No, it doesn't. It's asynchronous; for continuous replication, the target knows that it has a consistent view whenever it sees a flush on the NBD stream. Flushing the target is needed anyway before writing the dirty bitmap, so the target might as well exploit them to get information about the state of the source. The target _must_ flush to disk when it receives a flush commands, not matter how close they are. It _may_ choose to snapshot the disk, for example establishing one new snapshot every 5 seconds. Interesting. So it works quite differently than i had assumed. Some follow-up questions hope you don't mind... * I assume a flush roughly corresponds to an fsync() in the guest OS? Yes (or a metadata flush from the guest OS filesystem, since our guest models do not support attaching the FUA bit to single writes). * Writes will not be re-ordered over a flush boundary, right? More or less. This for example is a valid ordering: write sector 0 write 0 returns flush write sector 1 write 1 returns flush returns However, writes that have already returned will not be re-ordered over a flush boundary. A synchronous implementation is not forbidden by the spec (by design), but at the moment it's a bit more complex to implement because, as you mention, it requires buffering the I/O data on the host. So if i understand correctly, you'd only be keeping a list of (len, offset) tuples without any data, and drive-mirror then reads the data from the disk image? If that is the case how do you handle a flush? Does a flush need to wait for drive-mirror to drain the entire outgoing queue to the target before it can complete? If not how do prevent writes that happen after a flush from overwriting the data that will be sent to the target in case that hasn't reached the flush point yet. The key is that: 1) you only flush the target when you have a consistent image of the source on the destination, and the replication server only creates a snapshot when it receives a flush. Thus, the server does not create a consistent snapshot unless the client was able to keep pace with the guest. 2) target flushes do not have to coincide with a source flush. Writes after the last source flush _can_ be inconsistent between the source and the destination! What matters is that all writes up to the last source flush are consistent. Say the guest starts with (4 characters = 1 sectors) on disk and then the following happens guest disk dirty countmirroring --- 0 write 1 = 1 FLUSH write 1 = dirty bitmap: sector 1 dirty write 2 = 2 1 copy sector 1 = 0 copy sector 2 = FLUSH dirty bitmap: all clean write 0 = write 0 = and then a power loss happens on the source. The guest now has the dirty bitmap saying all clean even though the source now is and the destination . However, this is not a problem because both are consistent with the last flush. I attach a Promela model of the algorithm I'm going to implement. It's not exactly the one I posted upthread; I successfully ran this one through a model checker, so this one works. :) (I tested 3 sectors / 1 write, i.e. the case above, and 2 sectors / 3 writes. It should be enough given that it goes exhaustively through the entire state space). I have another model with two concurrent writers, but it is quite messy and I don't think it adds much. It shouldn't be hard to follow, the only tricky thing is that multiple branches in an if or do can be true, and if so all paths will be explored by the model checker. An else is only executed if all the other paths are false. Yes, this is already all correct. OK, i think i was confused by your description of drive-mirror in the wiki. It says that starts mirroring, but what it also does is that it copies the source to the target before it does that. It is clear from the description of the sync option though. Yes, the sync option simply fills in the dirty bitmap before starting the actual loop. Paolo /* * Formal model for disk synchronization. * * Copyright (C) 2012 Red Hat, Inc. * Author: Paolo Bonzini pbonz...@redhat.com * * State space explodes real fast: * * SEC \ MAX 1 2
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
Hi, On 05/24/2012 04:19 PM, Paolo Bonzini wrote: Here is how the bitmaps are handled when doing I/O on the source: - after writing to the source: - clear bit in the volatile in-flight bitmap - set bit in the persistent dirty bitmap - after flushing the source: - msync the persistent bitmap to disk Here is how the bitmaps are handled in the drive-mirror coroutine: - before reading from the source: - set bit in the volatile in-flight bitmap - after writing to the target: - if the dirty count will become zero, flush the target - if the bit is still set in the in-flight bitmap, clear bit in the persistent dirty bitmap - clear bit in the volatile in-flight bitmap I have a few questions, apologies if some of these are obvious.. I assume the target can be any QEmu block driver including e.g. NBD? A networked block driver would be required for a continuous replication solution. Does the drive-mirror coroutine send the writes to the target in the same order as they are sent to the source? I assume so. Does the drive-mirror coroutine require that writes are acknowledged? I'd assume so, as you mention that the bit from the persistent bitmap is cleared after a write, so you'd need to know the write arrived otherwise you cannot safely clear the bit. If the two above are true (sending in-order, and require acknowledgment of writes by the target), then I assume there is a need to keep an in-memory list with the IOs that still need to be sent to the target? That list could get too large if i.e. the target cannot keep up or becomes unavailable. When this happens, the dirty bitmap is needed to re-establish synchronized state again between the two images. For this re-sync, i think there will be two phases. The first phase would send blocks marked as dirty by the bitmap. I assume these would be sent in arbitrary order, not the order in which they were sent to the source, right? After the copy phase is done, in order to avoid race conditions, the bitmap should be reset and mirroring should start directly and atomically. Is that currently handed by your design? Also probably the target would need some kind of signal that the copy ended and that we are now mirroring because this is when writes are in-order again, and therefore only in this phase the solution can provide crash consistent protection. In the copy phase no crash consistency can be provided if i am not mistaken. Finally, again if i am not mistaken, I think that the scenario where synchronization is lost with the target is exactly the same as when you need to do an initial copy, expect that in the latter case all bits in the bitmap are set, right? Regards, Geert
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
Il 29/05/2012 13:57, Geert Jansen ha scritto: I assume the target can be any QEmu block driver including e.g. NBD? A networked block driver would be required for a continuous replication solution. Yes. Does the drive-mirror coroutine send the writes to the target in the same order as they are sent to the source? I assume so. No, it doesn't. It's asynchronous; for continuous replication, the target knows that it has a consistent view whenever it sees a flush on the NBD stream. Flushing the target is needed anyway before writing the dirty bitmap, so the target might as well exploit them to get information about the state of the source. The target _must_ flush to disk when it receives a flush commands, not matter how close they are. It _may_ choose to snapshot the disk, for example establishing one new snapshot every 5 seconds. A synchronous implementation is not forbidden by the spec (by design), but at the moment it's a bit more complex to implement because, as you mention, it requires buffering the I/O data on the host. Does the drive-mirror coroutine require that writes are acknowledged? I'd assume so, as you mention that the bit from the persistent bitmap is cleared after a write, so you'd need to know the write arrived otherwise you cannot safely clear the bit. Yes, the drive-mirror coroutine will not go on to the next write until the previous one is acked. For this re-sync, i think there will be two phases. The first phase would send blocks marked as dirty by the bitmap. I assume these would be sent in arbitrary order, not the order in which they were sent to the source, right? After the copy phase is done, in order to avoid race conditions, the bitmap should be reset and mirroring should start directly and atomically. Is that currently handed by your design? Yes, this is already all correct. Also probably the target would need some kind of signal that the copy ended and that we are now mirroring because this is when writes are in-order again, and therefore only in this phase the solution can provide crash consistent protection. In the copy phase no crash consistency can be provided if i am not mistaken. The copy phase will not have flushes (they are kind of useless). Finally, again if i am not mistaken, I think that the scenario where synchronization is lost with the target is exactly the same as when you need to do an initial copy, expect that in the latter case all bits in the bitmap are set, right? Yes. Paolo
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
On Thu, May 24, 2012 at 03:41:29PM +0200, Paolo Bonzini wrote: changes from v1: - added per-job iostatus - added description of persistent dirty bitmap The same content is also at http://wiki.qemu.org/Features/LiveBlockMigration/1.2 QMP changes for error handling == * query-block-jobs: BlockJobInfo gets two new fields, paused and io-status. The job-specific iostatus is completely separate from the block device iostatus. * block-stream: I would still like to add on_error to the existing block-stream command, if only to ease unit testing. Concerns about the stability of the API can be handled by adding introspection (exporting the schema), which is not hard to do. The new option is an enum with the following possible values: 'report': The behavior is the same as in 1.1. An I/O error will complete the job immediately with an error code. 'ignore': An I/O error, respectively during a read or a write, will be ignored. For streaming, the job will complete with an error and the backing file will be left in place. For mirroring, the sector will be marked again as dirty and re-examined later. 'stop': The job will be paused, and the job iostatus (which can be examined with query-block-jobs) is updated. 'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others. 'stop' and 'enospc' must raise a QMP event so the user is notified when the job is paused. Are the details on this missing from this draft? Stefan
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
Am 25.05.2012 10:28, schrieb Stefan Hajnoczi: On Thu, May 24, 2012 at 03:41:29PM +0200, Paolo Bonzini wrote: changes from v1: - added per-job iostatus - added description of persistent dirty bitmap The same content is also at http://wiki.qemu.org/Features/LiveBlockMigration/1.2 QMP changes for error handling == * query-block-jobs: BlockJobInfo gets two new fields, paused and io-status. The job-specific iostatus is completely separate from the block device iostatus. * block-stream: I would still like to add on_error to the existing block-stream command, if only to ease unit testing. Concerns about the stability of the API can be handled by adding introspection (exporting the schema), which is not hard to do. The new option is an enum with the following possible values: 'report': The behavior is the same as in 1.1. An I/O error will complete the job immediately with an error code. 'ignore': An I/O error, respectively during a read or a write, will be ignored. For streaming, the job will complete with an error and the backing file will be left in place. For mirroring, the sector will be marked again as dirty and re-examined later. 'stop': The job will be paused, and the job iostatus (which can be examined with query-block-jobs) is updated. 'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others. May I quote the next two lines as well? In all cases, even for 'report', the I/O error is reported as a QMP event BLOCK_JOB_ERROR, with the same arguments as BLOCK_IO_ERROR. 'stop' and 'enospc' must raise a QMP event so the user is notified when the job is paused. Are the details on this missing from this draft? No, just from your quote. :-) Kevin
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
Il 24/05/2012 18:57, Eric Blake ha scritto: On 05/24/2012 07:41 AM, Paolo Bonzini wrote: changes from v1: - added per-job iostatus - added description of persistent dirty bitmap The same content is also at http://wiki.qemu.org/Features/LiveBlockMigration/1.2 * query-block-jobs: BlockJobInfo gets two new fields, paused and io-status. The job-specific iostatus is completely separate from the block device iostatus. Is it still true that for mirror jobs, whether we are mirroring is still determined by whether 'len'=='offset'? Yes. * block-job-complete: force completion of mirroring and switching of the device to the target, not related to the rest of the proposal. Synchronously opens backing files if needed, asynchronously completes the job. Can this be made part of a 'transaction'? Likewise, can 'block-job-cancel' be made part of a 'transaction'? Both of them are asynchronous so they would not create an atomic snapshot. We could add it later, in the meanwhile you can wrap with fsfreeze/fsthaw. But now that you are adding the possibility of mirroring reverting to copying, there is a race where I can probe and see that we are in mirroring, then issue a 'block-job-cancel' to affect a copy operation, but in the meantime things reverted, and the cancel ends up leaving me with an incomplete copy. Hmm, that's right. But then this can only happen if you have an error in the target. I can make block-job-cancel _not_ resume a paused job. Would that satisfy your needs? Persistent dirty bitmap === A persistent dirty bitmap can be used by management for two reasons. When mirroring is used for continuous replication of storage, to record I/O operations that happened while the replication server is not connected or unavailable. When mirroring is used for storage migration, to check after a management crash whether the VM must be restarted with the source or the destination. Is there a particular file format for the dirty bitmap? Is there a header, or is it just straight bitmap, where the size of the file is an exact function of size of the file that it maps? I think it could be just a straight bitmap. management can restart the virtual machine with /mnt/dest/diskname.img. If it has even a single zero bit, s/zero/non-zero/ Doh, of course. Paolo
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
Il 24/05/2012 17:32, Dor Laor ha scritto: I didn't understand whether the persistent dirty bitmap needs to be flushed. This bitmap actually control the persistent known state of the destination image. Since w/ mirroring we always have the source in full state condition, we can choose to lazy update the destination w/ a risk of loosing some content from the last flush (of the destination only side). Flushing the dirty bitmap after writing to the target can indeed be tuned for the application. However, it is not optional to msync the bitmap when flushing the source. If the source has a power loss, it has to know what to retransmit. This way one can pick the frequency of flushing the persistent bits map (and the respective target IO writes). Continuous replication can chose a timely based fashion, such as every 5 seconds. But then the target is not able to restore a consistent state (which is a state where the dirty bitmap is all-zeros). The scheme above is roughly what DRBD does. But in any case, optimizations need to be worked out with a model checker, it's too delicate. Paolo
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
On Thu, May 24, 2012 at 03:41:29PM +0200, Paolo Bonzini wrote: Persistent dirty bitmap === A persistent dirty bitmap can be used by management for two reasons. When mirroring is used for continuous replication of storage, to record I/O operations that happened while the replication server is not connected or unavailable. For incremental backups we also need a dirty bitmap API. This allows backup software to determine which blocks are dirty in a snapshot. The backup software will only copy out the dirty blocks. (For external snapshots the dirty bitmap is actually the qemu-io -c map of is_allocated clusters. But for internal snapshots it would be a diff of image metadata which results in a real bitmap.) So it seems like a dirty bitmap API will be required for continuous replication (so the server can find out what it missed) and for incremental backup. If there is commonality here we should work together so the same API can do both. Stefan
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
Il 25/05/2012 11:43, Stefan Hajnoczi ha scritto: On Thu, May 24, 2012 at 03:41:29PM +0200, Paolo Bonzini wrote: Persistent dirty bitmap === A persistent dirty bitmap can be used by management for two reasons. When mirroring is used for continuous replication of storage, to record I/O operations that happened while the replication server is not connected or unavailable. For incremental backups we also need a dirty bitmap API. This allows backup software to determine which blocks are dirty in a snapshot. The backup software will only copy out the dirty blocks. (For external snapshots the dirty bitmap is actually the qemu-io -c map of is_allocated clusters. But for internal snapshots it would be a diff of image metadata which results in a real bitmap.) Perhaps that be simply a new qemu-img subcommand? It should be possible to run it while the VM is offline. Then the file that is produced could be fed to blockdev-dirty-enable. Paolo So it seems like a dirty bitmap API will be required for continuous replication (so the server can find out what it missed) and for incremental backup. If there is commonality here we should work together so the same API can do both. Stefan
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
On Fri, May 25, 2012 at 01:17:04PM +0200, Paolo Bonzini wrote: Il 25/05/2012 11:43, Stefan Hajnoczi ha scritto: On Thu, May 24, 2012 at 03:41:29PM +0200, Paolo Bonzini wrote: Persistent dirty bitmap === A persistent dirty bitmap can be used by management for two reasons. When mirroring is used for continuous replication of storage, to record I/O operations that happened while the replication server is not connected or unavailable. For incremental backups we also need a dirty bitmap API. This allows backup software to determine which blocks are dirty in a snapshot. The backup software will only copy out the dirty blocks. (For external snapshots the dirty bitmap is actually the qemu-io -c map of is_allocated clusters. But for internal snapshots it would be a diff of image metadata which results in a real bitmap.) Perhaps that be simply a new qemu-img subcommand? It should be possible to run it while the VM is offline. Then the file that is produced could be fed to blockdev-dirty-enable. For both continuous replication and incremental backups we cannot require that the guest is shut down in order to collect the dirty bitmap, I think. Also, anything to do with internal snapshots needs to be online since QEMU will still have the image file open and be writing to it. With backing files we have a little bit more wiggle room when QEMU has backing files open read-only. I think we really need a libvirt API because a local file not only has permissions issues but also is not network transparent. The continuous replication server runs on another machine, how will it access the dirty bitmap file? Stefan
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
Il 25/05/2012 14:09, Stefan Hajnoczi ha scritto: Perhaps that be simply a new qemu-img subcommand? It should be possible to run it while the VM is offline. Then the file that is produced could be fed to blockdev-dirty-enable. For both continuous replication and incremental backups we cannot require that the guest is shut down in order to collect the dirty bitmap, I think. Yes, that is a problem for internal snapshots. For external snapshots, see the drive-mirror command's sync parameter. Perhaps we can add a blockdev-dirty-fill command that adds allocated sectors up to a given base to the dirty bitmap. I think we really need a libvirt API because a local file not only has permissions issues but also is not network transparent. The continuous replication server runs on another machine, how will it access the dirty bitmap file? This is still using a push model where the dirty data is sent from QEMU to the replication server, so the dirty bitmap is not needed on the machine that runs the replication server---only on the machine that runs the VM (to preserve the bitmap across VM shutdowns including power loss). It has to be stored on shared storage if you plan to run the VM from multiple hosts. Paolo
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
On 05/25/2012 02:48 AM, Paolo Bonzini wrote: * block-job-complete: force completion of mirroring and switching of the device to the target, not related to the rest of the proposal. Synchronously opens backing files if needed, asynchronously completes the job. Can this be made part of a 'transaction'? Likewise, can 'block-job-cancel' be made part of a 'transaction'? Both of them are asynchronous so they would not create an atomic snapshot. We could add it later, in the meanwhile you can wrap with fsfreeze/fsthaw. It doesn't have to be right away, I just want to make sure that we aren't excluding it from a possible future extension, because it _does_ sound useful. But now that you are adding the possibility of mirroring reverting to copying, there is a race where I can probe and see that we are in mirroring, then issue a 'block-job-cancel' to affect a copy operation, but in the meantime things reverted, and the cancel ends up leaving me with an incomplete copy. Hmm, that's right. But then this can only happen if you have an error in the target. I can make block-job-cancel _not_ resume a paused job. Would that satisfy your needs? I'm not sure I follow what you are asking. My scenario is: call 'drive-mirror' to start a job 'block-job-complete' fails because job is not ready, but the job is not affected wait for the event telling me we are in mirroring phase start issuing my call to 'block-job-complete' to pivot something happens where we are no longer mirroring 'block-job-complete' fails because we are not mirroring - good call 'drive-mirror' to start a job calling 'block-job-cancel' would abort the job, which is not what I want wait for the event telling me we are in mirroring phase start issuing my call to 'block-job-cancel' to cleanly leave the copy behind something happens where we are no longer mirroring 'block-job-cancel' completes, but did not leave a complete mirror - bad On the other hand, if I'm _not_ trying to make a clean copy, then I want 'block-job-cancel' to work as fast as possible, no matter what. I'm not sure why having block-job-cancel resume or not resume a job would make a difference. What I really am asking for here is a way to have some command (perhaps 'block-job-complete' but with an optional flag set to a non-default value) that says I want to complete the job as a clean copy, but revert back to the source rather than pivot to the destination, and to cleanly fail with the job still around for additional actions if I cannot get a clean copy at the current moment, in the same way that the default 'block-job-complete' cleanly fails but does not kill the job if I'm not mirroring yet. -- Eric Blake ebl...@redhat.com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
On Thu, 24 May 2012 15:41:29 +0200 Paolo Bonzini pbonz...@redhat.com wrote: * block-stream: I would still like to add on_error to the existing block-stream command, if only to ease unit testing. Concerns about the stability of the API can be handled by adding introspection (exporting the schema), which is not hard to do. The new option is an enum with the following possible values: 'report': The behavior is the same as in 1.1. An I/O error will complete the job immediately with an error code. 'ignore': An I/O error, respectively during a read or a write, will be ignored. For streaming, the job will complete with an error and the backing file will be left in place. For mirroring, the sector will be marked again as dirty and re-examined later. 'stop': The job will be paused, and the job iostatus (which can be examined with query-block-jobs) is updated. 'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others. In all cases, even for 'report', the I/O error is reported as a QMP event BLOCK_JOB_ERROR, with the same arguments as BLOCK_IO_ERROR. After cancelling a job, the job implementation MAY choose to treat stop and enospc values as report, i.e. complete the job immediately with an error code, as long as block_job_is_cancelled(job) returns true when the completion callback is called. Open problem: There could be unrecoverable errors in which the job will always fail as if rerror/werror were set to report (example: error while switching backing files). Does it make sense to fire an event before the point in time where such errors can happen? You mean, you fire the event before the point the error can happen but the operation keeps running if it doesn't fail? If that's the case, I think that the returned error is enough for the mngt app to decide what to do. * block-job-pause: A new QMP command. Takes a block device (drive), pauses an active background block operation on that device. This command returns immediately after marking the active background block operation for pausing. It is an error to call this command if no operation is in progress. The operation will pause as soon as possible (it won't pause if the job is being cancelled). No event is emitted when the operation is actually paused. Cancelling a paused job automatically resumes it. Is pausing guaranteed to succeed?
[Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
changes from v1: - added per-job iostatus - added description of persistent dirty bitmap The same content is also at http://wiki.qemu.org/Features/LiveBlockMigration/1.2 QMP changes for error handling == * query-block-jobs: BlockJobInfo gets two new fields, paused and io-status. The job-specific iostatus is completely separate from the block device iostatus. * block-stream: I would still like to add on_error to the existing block-stream command, if only to ease unit testing. Concerns about the stability of the API can be handled by adding introspection (exporting the schema), which is not hard to do. The new option is an enum with the following possible values: 'report': The behavior is the same as in 1.1. An I/O error will complete the job immediately with an error code. 'ignore': An I/O error, respectively during a read or a write, will be ignored. For streaming, the job will complete with an error and the backing file will be left in place. For mirroring, the sector will be marked again as dirty and re-examined later. 'stop': The job will be paused, and the job iostatus (which can be examined with query-block-jobs) is updated. 'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others. In all cases, even for 'report', the I/O error is reported as a QMP event BLOCK_JOB_ERROR, with the same arguments as BLOCK_IO_ERROR. After cancelling a job, the job implementation MAY choose to treat stop and enospc values as report, i.e. complete the job immediately with an error code, as long as block_job_is_cancelled(job) returns true when the completion callback is called. Open problem: There could be unrecoverable errors in which the job will always fail as if rerror/werror were set to report (example: error while switching backing files). Does it make sense to fire an event before the point in time where such errors can happen? * block-job-pause: A new QMP command. Takes a block device (drive), pauses an active background block operation on that device. This command returns immediately after marking the active background block operation for pausing. It is an error to call this command if no operation is in progress. The operation will pause as soon as possible (it won't pause if the job is being cancelled). No event is emitted when the operation is actually paused. Cancelling a paused job automatically resumes it. * block-job-resume: A new QMP command. Takes a block device (drive), resume a paused background block operation on that device. This command returns immediately after resuming a paused background block operation. It is an error to call this command if no operation is in progress. A successful block-job-resume operation also resets the iostatus on the job that is passed. Rationale: block-job-resume is required to restart a job that had on_error behavior set to 'stop' or 'enospc'. Adding block-job-pause makes it simpler to test the new feature. Other points specific to mirroring == * query-block-jobs: The returned JSON object will grow an additional member, target. The target field is a dictionary with two fields, info and stats (resembling the output of query-block and query-blockstat but for the mirroring target). Member device of the BlockInfo structure will be made optional. Rationale: this allows libvirt to observe the high watermark of qcow2 mirroring targets. If present, the target has its own iostatus. It is set when the job is paused due to an error on the target (together with sending a BLOCK_JOB_ERROR event). block-job-resume resets it. * drive-mirror: activates mirroring to a second block device (optionally creating the image on that second block device). Compared to the earlier versions, the full argument is replaced by an enum option sync with three values: - top: copies data in the topmost image to the destination - full: copies data from all images to the destination - dirty: copies clusters that are marked in the dirty bitmap to the destination (see below) * block-job-complete: force completion of mirroring and switching of the device to the target, not related to the rest of the proposal. Synchronously opens backing files if needed, asynchronously completes the job. * MIRROR_STATE_CHANGE: new event, triggered every time the block-job-complete becomes available/unavailable. Contains the device name (like device: 'ide0-hd0'), and the state (synced: true/false). Persistent dirty bitmap === A persistent dirty bitmap can be used by management for two reasons. When mirroring is used for continuous replication of storage, to record I/O operations that happened while the replication server is not connected or unavailable. When mirroring is used for storage migration, to check after a management crash whether the VM must be restarted with the source or the destination. The dirty bitmap is synchronized on every bdrv_flush (or on every
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
On 24/05/2012 16:41, Paolo Bonzini wrote: The dirty bitmap is managed by these QMP commands: * blockdev-dirty-enable: takes a file name used for the dirty bitmap, and an optional granularity. Setting the granularity will not be supported in the initial version. * query-block-dirty: returns statistics about the dirty bitmap: right now the granularity, the number of bits that are set, and whether QEMU is using the dirty bitmap or just adding to it. * blockdev-dirty-disable: disable the dirty bitmap. When do bits get cleared from the bitmap? using the dirty bitmap or just adding to it - I'm not sure I understand what you mean. what's the difference? Thanks, Ori
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
Il 24/05/2012 16:00, Ori Mamluk ha scritto: The dirty bitmap is managed by these QMP commands: * blockdev-dirty-enable: takes a file name used for the dirty bitmap, and an optional granularity. Setting the granularity will not be supported in the initial version. * query-block-dirty: returns statistics about the dirty bitmap: right now the granularity, the number of bits that are set, and whether QEMU is using the dirty bitmap or just adding to it. * blockdev-dirty-disable: disable the dirty bitmap. When do bits get cleared from the bitmap? drive-mirror clears bits from the bitmap as it processes the writes. In addition to the persistent dirty bitmap, QEMU keeps an in-flight bitmap. The in-flight bitmap does not need to be persistent. Here is how the bitmaps are handled when doing I/O on the source: - after writing to the source: - clear bit in the volatile in-flight bitmap - set bit in the persistent dirty bitmap - after flushing the source: - msync the persistent bitmap to disk Here is how the bitmaps are handled in the drive-mirror coroutine: - before reading from the source: - set bit in the volatile in-flight bitmap - after writing to the target: - if the dirty count will become zero, flush the target - if the bit is still set in the in-flight bitmap, clear bit in the persistent dirty bitmap - clear bit in the volatile in-flight bitmap using the dirty bitmap or just adding to it - I'm not sure I understand what you mean. what's the difference? Processing the data and removing from the bitmap (mirroring active), or just setting dirty bits (mirroring inactive). Paolo
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
On 05/24/2012 05:19 PM, Paolo Bonzini wrote: Il 24/05/2012 16:00, Ori Mamluk ha scritto: The dirty bitmap is managed by these QMP commands: * blockdev-dirty-enable: takes a file name used for the dirty bitmap, and an optional granularity. Setting the granularity will not be supported in the initial version. * query-block-dirty: returns statistics about the dirty bitmap: right now the granularity, the number of bits that are set, and whether QEMU is using the dirty bitmap or just adding to it. * blockdev-dirty-disable: disable the dirty bitmap. When do bits get cleared from the bitmap? drive-mirror clears bits from the bitmap as it processes the writes. In addition to the persistent dirty bitmap, QEMU keeps an in-flight bitmap. The in-flight bitmap does not need to be persistent. Here is how the bitmaps are handled when doing I/O on the source: - after writing to the source: - clear bit in the volatile in-flight bitmap - set bit in the persistent dirty bitmap - after flushing the source: - msync the persistent bitmap to disk Here is how the bitmaps are handled in the drive-mirror coroutine: - before reading from the source: - set bit in the volatile in-flight bitmap - after writing to the target: - if the dirty count will become zero, flush the target - if the bit is still set in the in-flight bitmap, clear bit in the persistent dirty bitmap - clear bit in the volatile in-flight bitmap I didn't understand whether the persistent dirty bitmap needs to be flushed. This bitmap actually control the persistent known state of the destination image. Since w/ mirroring we always have the source in full state condition, we can choose to lazy update the destination w/ a risk of loosing some content from the last flush (of the destination only side). This way one can pick the frequency of flushing the persistent bits map (and the respective target IO writes). Continuous replication can chose a timely based fashion, such as every 5 seconds. A standard mirroring job for live copy proposes can pick just to flush once at the end of the copy process. Dor using the dirty bitmap or just adding to it - I'm not sure I understand what you mean. what's the difference? Processing the data and removing from the bitmap (mirroring active), or just setting dirty bits (mirroring inactive). Paolo
Re: [Qemu-devel] Block job commands in QEMU 1.2 [v2, including support for replication]
On 05/24/2012 07:41 AM, Paolo Bonzini wrote: changes from v1: - added per-job iostatus - added description of persistent dirty bitmap The same content is also at http://wiki.qemu.org/Features/LiveBlockMigration/1.2 * query-block-jobs: BlockJobInfo gets two new fields, paused and io-status. The job-specific iostatus is completely separate from the block device iostatus. Is it still true that for mirror jobs, whether we are mirroring is still determined by whether 'len'=='offset'? * drive-mirror: activates mirroring to a second block device (optionally creating the image on that second block device). Compared to the earlier versions, the full argument is replaced by an enum option sync with three values: - top: copies data in the topmost image to the destination - full: copies data from all images to the destination - dirty: copies clusters that are marked in the dirty bitmap to the destination (see below) Different, but at least RHEL used the name __com.redhat_drive-mirror, so libvirt can cope with the difference. * block-job-complete: force completion of mirroring and switching of the device to the target, not related to the rest of the proposal. Synchronously opens backing files if needed, asynchronously completes the job. Can this be made part of a 'transaction'? Likewise, can 'block-job-cancel' be made part of a 'transaction'? Having those two commands transactionable means that you could copy multiple disks at the same point in time (block-job-cancel) or pivot multiple disks leaving the former files consistent at the same point in time (block-job-complete). It doesn't have to be done in the first round, but we should make sure we are not precluding this for future growth. Also, for the purposes of copying but not pivoting, you only have a safe copy if 'len'=='offset' at the time of the cancel. But now that you are adding the possibility of mirroring reverting to copying, there is a race where I can probe and see that we are in mirroring, then issue a 'block-job-cancel' to affect a copy operation, but in the meantime things reverted, and the cancel ends up leaving me with an incomplete copy. Maybe 'block-job-complete' should be given an optional boolean parameter; by default or if the parameter is true, we pivot, but if false, then we do the same as 'block-job-cancel' to affect a safe copy if we are in mirroring, while erroring out if we are not in mirroring, leaving 'block-job-cancel' as a way to always cancel a job but no longer a safe way to guarantee a copy operation. Persistent dirty bitmap === A persistent dirty bitmap can be used by management for two reasons. When mirroring is used for continuous replication of storage, to record I/O operations that happened while the replication server is not connected or unavailable. When mirroring is used for storage migration, to check after a management crash whether the VM must be restarted with the source or the destination. Is there a particular file format for the dirty bitmap? Is there a header, or is it just straight bitmap, where the size of the file is an exact function of size of the file that it maps? If management crashes between (6) and (7), it can examine the dirty bitmap on disk. If it is all-zeros, Obviously, this would be all-zeros in the map portion of the file, any header portion would not impact this. management can restart the virtual machine with /mnt/dest/diskname.img. If it has even a single zero bit, s/zero/non-zero/ management can restart the virtual machine with the persistent dirty bitmap enabled, and later issue again a drive-mirror command to restart from step 4. Paolo -- Eric Blake ebl...@redhat.com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature