On Mon, Mar 18, 2013 at 04:24:44PM -0400, Michael R. Hines wrote: > On 03/18/2013 06:40 AM, Michael S. Tsirkin wrote: > >I think there are two things here, API documentation and protocol > >documentation, protocol documentation still needs some more work. > >Also if what I understand from this document is correct this > >breaks memory overcommit on destination which needs to be fixed. > > > >I think something chunk-based on the destination side is required > >as well. You also can't trust the source to tell you the chunk > >size it could be malicious and ask for too much. Maybe source > >gives chunk size hint and destination responds with what it wants > >to use. > > Do we allow ballooning *during* the live migration? Is that necessary?
Probably but I haven't mentioned ballooning at all. memory overcommit != ballooning > Would it be sufficient to inform the destination which pages are ballooned > and then only register the ones that the VM actually owns? I haven't thought about it. > >Is there any feature and/or version negotiation? How are we going to > >handle compatibility when we extend the protocol? > You mean, on top of the protocol versioning that's already > builtin to QEMUFile? inside qemu_savevm_state_begin()? I mean for protocol things like credit negotiation, which are unrelated to high level QEMUFile. > Should I piggy-back and additional protocol version number > before QEMUFile sends it's version number? CM can exchange a bit of data during connection setup, maybe use that? > >So how does destination know it's ok to send anything to source? I > >suspect this is wrong. When using CM you must post on RQ before > >completing the connection negotiation, not after it's done. > > This is already handled by the RDMA connection manager (librdmacm). > > The library already has functions like listen() and accept() the same > way that TCP does. > > Once these functions return success, we have a gaurantee that both > sides of the connection have already posted the appropriate work > requests sufficient for driving the migration. Not if you don't post anything. librdmacm does not post requests. So everyone posts 1 buffer on RQ during connection setup? OK though this is not what the document said, I was under the impression this is done after connection setup. > > >>+2. We transmit an empty SEND to let the sender know that > >>+ we are *ready* to receive some bytes from QEMUFileRDMA. > >>+ These bytes will come in the form of a another SEND. > >Using an empty message seems somewhat hacky, a fixed header in the > >message would let you do more things if protocol is ever extended. > > Great idea....... I'll add a struct RDMAHeader to each send > message in the next RFC which includes a version number. > > (Until now, there were *only* QEMUFile bytes, nothing else, > so I didn't have any reason for a formal structure.) > > > >OK to summarize flow control: at any time there's either 0 or 1 > >outstanding buffers in RQ. At each time only one side can talk. > >Destination always goes first, then source, etc. At each time a > >single send message can be passed. Just FYI, this means you are > >often at 0 buffers in RQ and IIRC 0 buffers is a worst-case path > >for infiniband. It's better to keep at least 1 buffers in RQ at > >all times, so prepost 2 initially so it would fluctuate between 1 > >and 2. > > That's correct. Having 0 buffers is not possible - sending > a message with 0 buffers would throw an error. The "protocol" > as I described ensures that there is always one buffer posted > before waiting for another message to arrive. So # of buffers goes 0 -> 1 -> 0 -> 1. What I am saying is you should have an extra buffer so it goes 1 -> 2 -> 1 -> 2 otherwise you keep hitting slow path in RQ processing: each time you consume the last buffer, IIRC receiver sends and ACK to sender saying "hey this is the last buffer, slow down". You don't want that. > I avoided "better" flow control because the non-live state > is so small in comparison to the pc.ram contents that would be sent. > The non-live state is in the range of kilobytes, so it seemed silly to > have more rigorous flow control.... I think it's good enough, just add an extra unused buffer to make hardware happy. > >>+Migration of pc.ram: > >>+=============================== > >>+ > >>+At the beginning of the migration, (migration-rdma.c), > >>+the sender and the receiver populate the list of RAMBlocks > >>+to be registered with each other into a structure. > >Could you add the packet format here as well please? > >Need to document endian-ness etc. > > There is no packet format for pc.ram. The 'structure' above is passed using SEND so there is a format. > It's just bytes - raw RDMA > writes of each 4K page, because the memory must be registered > before the RDMA write can begin. > > (As discussed, there will be a format for SEND, though - so I'll > take care of that in my next RFC). > > > Yes but we also need to report errors detected during migration. > >Need to document how this is done. We also need to report success. > Acknowledged - I'll add more verbosity to the different error conditions. > > - Michael R. Hines