On 26/12/2014 04:31, Yang Hongyang wrote:
> Please feel free to comment.
> We want comments/feedbacks as many as possiable please, thanks in advance.

Hi Yang,

I think it's possible to build COLO block replication from many basic
blocks that are already in QEMU.  The only new piece would be the disk
buffer on the secondary.

         virtio-blk       ||
             ^            ||                            .----------
             |            ||                            | Secondary
        1 Quorum          ||                            '----------
         /      \         ||
        /        \        ||
   Primary      2 NBD  ------->  2 NBD
     disk       client    ||     server                  virtio-blk
                          ||        ^                         ^
--------.                 ||        |                         |
Primary |                 ||  Secondary disk <--------- COLO buffer 3
--------'                 ||                   backing


1) The disk on the primary is represented by a block device with two
children, providing replication between a primary disk and the host that
runs the secondary VM.  The read pattern patches for quorum
(http://lists.gnu.org/archive/html/qemu-devel/2014-08/msg02381.html) can
be used/extended to make the primary always read from the local disk
instead of going through NBD.

2) The secondary disk receives writes from the primary VM through QEMU's
embedded NBD server (speculative write-through).

3) The disk on the secondary is represented by a custom block device
("COLO buffer").  The disk buffer's backing image is the secondary disk,
and the disk buffer uses bdrv_add_before_write_notifier to implement
copy-on-write, similar to block/backup.c.

4) Checkpointing can use new bdrv_prepare_checkpoint and
bdrv_do_checkpoint members in BlockDriver to discard the COLO buffer,
similar to your patches (you did not explain why you do checkpointing in
two steps).  Failover instead is done with bdrv_commit or can even be
done without stopping the secondary (live commit, block/commit.c).


The missing parts are:

1) NBD server on the backing image of the COLO buffer.  This means the
backing image needs its own BlockBackend.  Apart for this, no new
infrastructure is needed to receive writes on the secondary.

2) Read pattern support for quorum need to be extended for the needs of
the COLO primary.  It may be simpler or faster to write a simple
"replication" driver that writes to N children but always reads from the
first.  But in any case initial tests can be done with the quorum
driver, even without read pattern support.  Again, all the network
infrastructure to replicate writes already exists in QEMU.

3) Of course the disk buffer itself.

Paolo

> Thanks,
> Yang.
> 
> Wen Congyang (1):
>   PoC: Block replication for COLO
> 
> Yang Hongyang (1):
>   Block: Block replication design for COLO
> 
>  block.c                   |  48 +++++++
>  block/blkcolo.c           | 338 
> ++++++++++++++++++++++++++++++++++++++++++++++
>  docs/blkcolo.txt          |  85 ++++++++++++
>  include/block/block.h     |   6 +
>  include/block/block_int.h |  21 +++
>  5 files changed, 498 insertions(+)
>  create mode 100644 block/blkcolo.c
>  create mode 100644 docs/blkcolo.txt
> 

Reply via email to