rdma.txt

Michael S. Tsirkin Mon, 11 Mar 2013 04:52:07 -0700

On Mon, Mar 11, 2013 at 12:33:27AM -0400, 
michael.r.hines.mrhi...@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhi...@us.ibm.com>
> 
> 
> Signed-off-by: Michael R. Hines <mrhi...@us.ibm.com>
> ---
>  docs/rdma.txt |   93 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 93 insertions(+)
>  create mode 100644 docs/rdma.txt
> 
> diff --git a/docs/rdma.txt b/docs/rdma.txt
> new file mode 100644
> index 0000000..a38ce1c
> --- /dev/null
> +++ b/docs/rdma.txt
> @@ -0,0 +1,93 @@
> +Changes since v2:
> +
> +- TCP channel has been eliminated. All device state uses SEND messages
> +- type 'migrate rdma:host:port' to start the migration
> +- QEMUFileRDMA is introduced
> +- librdmacm calls distinquished from qemu RDMA calls
> +- lots of code cleanup
> +
> +RDMA-based live migration protocol
> +==================================
> +
> +We use two kinds of RDMA messages:
> +
> +1. RDMA WRITES (to the receiver)
> +2. RDMA SEND (for non-live state, like devices and CPU)


Something's missing here.
Don't you need to know remote addresses before doing RDMA writes?

> +
> +First, migration-rdma.c does the initial connection establishment
> +using the URI 'rdma:host:port' on the QMP command line.
> +
> +Second, the normal live migration process kicks in for 'pc.ram'.
> +
> +During iterative phase of the migration, only RDMA WRITE messages 
> +are used. Messages are grouped into "chunks" which get pinned by 
> +the hardware in 64-page increments. Each chunk is acknowledged in
> +the Queue Pairs completion queue (not the individual pages).
> +
> +During iteration of RAM, there are no messages sent, just RDMA writes.
> +During the last iteration, once the devices and CPU is ready to be
> +sent, we begin to use the RDMA SEND messages.

It's unclear whether you are switching modes here, if yes
assuming CPU/device state is only sent during
the last iteration would break post-migration so
is probably not a good choice for a protocol.

> +Due to the asynchronous nature of RDMA, the receiver of the migration
> +must post Receive work requests in the queue *before* a SEND work request
> +can be posted.
> +
> +To achieve this, both sides perform an initial 'barrier' synchronization.
> +Before the barrier, we already know that both sides have a receive work
> +request posted,

How?

> and then both sides exchange and block on the completion
> +queue waiting for each other to know the other peer is alive and ready
> +to send the rest of the live migration state (qemu_send/recv_barrier()).

How much?

> +At this point, the use of QEMUFile between both sides for communication
> +proceeds as normal.
> +The difference between TCP and SEND comes in migration-rdma.c: Since
> +we cannot simply dump the bytes into a socket, instead a SEND message
> +must be preceeded by one side instructing the other side *exactly* how
> +many bytes the SEND message will contain.

instructing how? Presumably you use some protocol for this?

> +Each time a SEND is received, the receiver buffers the message and
> +divies out the bytes from the SEND to the qemu_loadvm_state() function
> +until all the bytes from the buffered SEND message have been exhausted.
> +
> +Before the SEND is exhausted, the receiver sends an 'ack' SEND back
> +to the sender to let the savevm_state_* functions know that they
> +can resume and start generating more SEND messages.

The above two paragraphs seem very opaque to me.
what's an 'ack' SEND, how do you know whether SEND
is exhausted?


> +This ping-pong of SEND messages

BTW, if by ping-pong you mean something like this:
        source "I have X bytes"
        destination "ok send me X bytes"
        source sends X bytes
then you could put the address in the destination response and
use RDMA for sending X bytes.
It's up to you but it might simplify the protocol as
the only thing you send would be buffer management messages.

> happens until the live migration completes.

Any way to tear down the connection in case of errors?

> +
> +USAGE
> +===============================
> +
> +Compiling:
> +
> +$ ./configure --enable-rdma --target-list=x86_64-softmmu
> +
> +$ make
> +
> +Command-line on the Source machine AND Destination:
> +
> +$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or 
> whatever is the MAX of your RDMA device
> +
> +Finally, perform the actual migration:
> +
> +$ virsh migrate domain rdma:xx.xx.xx.xx:port
> +
> +PERFORMANCE
> +===================
> +
> +Using a 40gbps infinband link performing a worst-case stress test:
> +
> +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +Approximately 30 gpbs (little better than the paper)
> +1. Average worst-case throughput 
> +TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
> +
> +Average downtime (stop time) ranges between 28 and 33 milliseconds.
> +
> +An *exhaustive* paper (2010) shows additional performance details
> +linked on the QEMU wiki:
> +
> +http://wiki.qemu.org/Features/RDMALiveMigration
> -- 
> 1.7.10.4

Re: [Qemu-devel] [RFC PATCH RDMA support v3: 03/10] documentation of RDMA protocol in docs/rdma.txt

Reply via email to