On Mon, Mar 11, 2013 at 12:33:27AM -0400, michael.r.hines.mrhi...@linux.vnet.ibm.com wrote: > From: "Michael R. Hines" <mrhi...@us.ibm.com> > > > Signed-off-by: Michael R. Hines <mrhi...@us.ibm.com> > --- > docs/rdma.txt | 93 > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 93 insertions(+) > create mode 100644 docs/rdma.txt > > diff --git a/docs/rdma.txt b/docs/rdma.txt > new file mode 100644 > index 0000000..a38ce1c > --- /dev/null > +++ b/docs/rdma.txt > @@ -0,0 +1,93 @@ > +Changes since v2: > + > +- TCP channel has been eliminated. All device state uses SEND messages > +- type 'migrate rdma:host:port' to start the migration > +- QEMUFileRDMA is introduced > +- librdmacm calls distinquished from qemu RDMA calls > +- lots of code cleanup > + > +RDMA-based live migration protocol > +================================== > + > +We use two kinds of RDMA messages: > + > +1. RDMA WRITES (to the receiver) > +2. RDMA SEND (for non-live state, like devices and CPU)
Something's missing here. Don't you need to know remote addresses before doing RDMA writes? > + > +First, migration-rdma.c does the initial connection establishment > +using the URI 'rdma:host:port' on the QMP command line. > + > +Second, the normal live migration process kicks in for 'pc.ram'. > + > +During iterative phase of the migration, only RDMA WRITE messages > +are used. Messages are grouped into "chunks" which get pinned by > +the hardware in 64-page increments. Each chunk is acknowledged in > +the Queue Pairs completion queue (not the individual pages). > + > +During iteration of RAM, there are no messages sent, just RDMA writes. > +During the last iteration, once the devices and CPU is ready to be > +sent, we begin to use the RDMA SEND messages. It's unclear whether you are switching modes here, if yes assuming CPU/device state is only sent during the last iteration would break post-migration so is probably not a good choice for a protocol. > +Due to the asynchronous nature of RDMA, the receiver of the migration > +must post Receive work requests in the queue *before* a SEND work request > +can be posted. > + > +To achieve this, both sides perform an initial 'barrier' synchronization. > +Before the barrier, we already know that both sides have a receive work > +request posted, How? > and then both sides exchange and block on the completion > +queue waiting for each other to know the other peer is alive and ready > +to send the rest of the live migration state (qemu_send/recv_barrier()). How much? > +At this point, the use of QEMUFile between both sides for communication > +proceeds as normal. > +The difference between TCP and SEND comes in migration-rdma.c: Since > +we cannot simply dump the bytes into a socket, instead a SEND message > +must be preceeded by one side instructing the other side *exactly* how > +many bytes the SEND message will contain. instructing how? Presumably you use some protocol for this? > +Each time a SEND is received, the receiver buffers the message and > +divies out the bytes from the SEND to the qemu_loadvm_state() function > +until all the bytes from the buffered SEND message have been exhausted. > + > +Before the SEND is exhausted, the receiver sends an 'ack' SEND back > +to the sender to let the savevm_state_* functions know that they > +can resume and start generating more SEND messages. The above two paragraphs seem very opaque to me. what's an 'ack' SEND, how do you know whether SEND is exhausted? > +This ping-pong of SEND messages BTW, if by ping-pong you mean something like this: source "I have X bytes" destination "ok send me X bytes" source sends X bytes then you could put the address in the destination response and use RDMA for sending X bytes. It's up to you but it might simplify the protocol as the only thing you send would be buffer management messages. > happens until the live migration completes. Any way to tear down the connection in case of errors? > + > +USAGE > +=============================== > + > +Compiling: > + > +$ ./configure --enable-rdma --target-list=x86_64-softmmu > + > +$ make > + > +Command-line on the Source machine AND Destination: > + > +$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or > whatever is the MAX of your RDMA device > + > +Finally, perform the actual migration: > + > +$ virsh migrate domain rdma:xx.xx.xx.xx:port > + > +PERFORMANCE > +=================== > + > +Using a 40gbps infinband link performing a worst-case stress test: > + > +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep > +Approximately 30 gpbs (little better than the paper) > +1. Average worst-case throughput > +TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep > +2. Approximately 8 gpbs (using IPOIB IP over Infiniband) > + > +Average downtime (stop time) ranges between 28 and 33 milliseconds. > + > +An *exhaustive* paper (2010) shows additional performance details > +linked on the QEMU wiki: > + > +http://wiki.qemu.org/Features/RDMALiveMigration > -- > 1.7.10.4