On 09/29/2016 03:46 AM, zhanghailiang wrote:
> Introduce the design of COLO, and how to test it.
> 
> Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
> ---
>  docs/COLO-FT.txt | 190 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 190 insertions(+)
>  create mode 100644 docs/COLO-FT.txt
> 

> +
> +== Background ==
> +Virtual machine (VM) replication is a well known technique for providing
> +application-agnostic software-implemented hardware fault tolerance
> +"non-stop service".

Do you want s/tolerance/tolerance, also known as/ ?


> +== Architecture ==
> +
> +The architecture of COLO is shown in the bellow diagram.

s/bellow diagram/diagram below/

> +It consists of a pair of networked physical nodes:
> +The primary node running the PVM, and the secondary node running the SVM
> +to maintain a valid replica of the PVM.
> +PVM and SVM execute in parallel and generate output of response packets for
> +client requests according to the application semantics.
> +
> +The incoming packets from the client or external network are received by the
> +primary node, and then forwarded to the secondary node, so that Both the PVM

s/Both/both/

> +and the SVM are stimulated with the same requests.
> +
> +COLO receives the outbound packets from both the PVM and SVM and compares 
> them
> +before allowing the output to be sent to clients.
> +
> +The SVM is qualified as a valid replica of the PVM, as long as it generates
> +identical responses to all client requests. Once the differences in the 
> outputs
> +are detected between the PVM and SVM, COLO withholds transmission of the
> +outbound packets until it has successfully synchronized the PVM state to the 
> SVM.
> +

> +== Components introduction ==
> +
> +You can see there are several components in COLO's diagram of architecture.
> +Their functions are described as bellow.

s/as bellow/below/

> +
> +HeartBeat:
> +Runs on both the primary and secondary nodes, to periodically check platform
> +availability. When the primary node suffers a hardware fail-stop failure,
> +the heartbeat stops responding, the secondary node will trigger a failover
> +as soon as it determines the absence.
> +
> +COLO disk Manager:
> +When primary VM writes data into image, the colo disk manger captures this 
> data
> +and send it to secondary VM’s which makes sure the context of secondary VM's

s/send/sends/

> +image is consentient with the context of primary VM 's image.

s/consentient/consistent/
s/VM 's/VM's/

> +For more details, please refer to docs/block-replication.txt.
> +
> +Checkpoint/Failover Controller:
> +Modifications of save/restore flow to realize continuous migration,
> +to make sure the state of VM in Secondary side always be consistent with VM 
> in

s/always be/is always/

> +Primary side.
> +
> +COLO Proxy:
> +Delivers packets to Primary and Seconday, and then compare the responses from
> +both side. Then decide whether to start a checkpoint according to some rules.
> +
> +Note:
> + a. HeartBeat is not been realized, so you need to trigger failover process

s/is/has/
s/realized/implemented yet/

Is this note going to be stale once heartbeat is implemented?

> +    by using 'x-colo-lost-heartbeat' command.
> + b. COLO proxy compents is work-in-process, it only support periodic 
> checkpoint

s/compents is/components are a/

> +    mode now, just as Micro-checkpointing.
> +

> +3. On Primary VM's QEMU monitor, issue command:
> +{'execute':'qmp_capabilities'}
> +{ 'execute': 'human-monitor-command',
> +  'arguments': {'command-line': 'drive_add -n buddy 
> driver=replication,mode=primary,file.driver=nbd,file.host=xx.xx.xx.xx,file.port=8889,file.export=colo-disk0,node-name=node0'}}

It would be really nice if we could get this done through QMP
blockdev-add instead of HMP drive_add.

> +
> +Before issuing '{ "execute": "x-colo-lost-heartbeat" }' command, we have to
> +issue block related command to stop block replication.
> +Primary:
> +  Remove the nbd child from the quorum:
> +  { 'execute': 'x-blockdev-change', 'arguments': {'parent': 'colo-disk0', 
> 'child': 'children.1'}}
> +  { 'execute': 'human-monitor-command','arguments': {'command-line': 
> 'drive_del blk-buddy0'}}
> +  Note: there is no qmp command to remove the blockdev now

Don't we have x-blockdev-del?

> +
> +Secondary:
> +  The primary host is down, so we should do the following thing:
> +  { 'execute': 'nbd-server-stop' }
> +
> +== TODO ==
> +1. Support continuously VM replication.

s/continuously/continuous/

> +2. Support shared storage.
> +3. Develop the heartbeat part.
> +4. Reduce checkpoint VM’s downtime while do checkpoint.

s/do/doing/

> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to