Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport

Michael S. Tsirkin Mon, 18 Mar 2013 14:27:10 -0700

On Mon, Mar 18, 2013 at 04:24:44PM -0400, Michael R. Hines wrote:
> On 03/18/2013 06:40 AM, Michael S. Tsirkin wrote:
> >I think there are two things here, API documentation and protocol
> >documentation, protocol documentation still needs some more work.
> >Also if what I understand from this document is correct this
> >breaks memory overcommit on destination which needs to be fixed.
> >
> >I think something chunk-based on the destination side is required
> >as well. You also can't trust the source to tell you the chunk
> >size it could be malicious and ask for too much. Maybe source
> >gives chunk size hint and destination responds with what it wants
> >to use.
> 
> Do we allow ballooning *during* the live migration? Is that necessary?


Probably but I haven't mentioned ballooning at all.

memory overcommit != ballooning

> Would it be sufficient to inform the destination which pages are ballooned
> and then only register the ones that the VM actually owns?

I haven't thought about it.

> >Is there any feature and/or version negotiation? How are we going to
> >handle compatibility when we extend the protocol?
> You mean, on top of the protocol versioning that's already
> builtin to QEMUFile? inside qemu_savevm_state_begin()?

I mean for protocol things like credit negotiation, which are unrelated
to high level QEMUFile.

> Should I piggy-back and additional protocol version number
> before QEMUFile sends it's version number?

CM can exchange a bit of data during connection setup, maybe use that?

> >So how does destination know it's ok to send anything to source? I
> >suspect this is wrong. When using CM you must post on RQ before
> >completing the connection negotiation, not after it's done.
> 
> This is already handled by the RDMA connection manager (librdmacm).
> 
> The library already has functions like listen() and accept() the same
> way that TCP does.
> 
> Once these functions return success, we have a gaurantee that both
> sides of the connection have already posted the appropriate work
> requests sufficient for driving the migration.

Not if you don't post anything. librdmacm does not post requests.  So
everyone posts 1 buffer on RQ during connection setup?
OK though this is not what the document said, I was under the impression
this is done after connection setup.

> 
> >>+2. We transmit an empty SEND to let the sender know that
> >>+   we are *ready* to receive some bytes from QEMUFileRDMA.
> >>+   These bytes will come in the form of a another SEND.
> >Using an empty message seems somewhat hacky, a fixed header in the
> >message would let you do more things if protocol is ever extended.
> 
> Great idea....... I'll add a struct RDMAHeader to each send
> message in the next RFC which includes a version number.
> 
> (Until now, there were *only* QEMUFile bytes, nothing else,
> so I didn't have any reason for a formal structure.)
> 
> 
> >OK to summarize flow control: at any time there's either 0 or 1
> >outstanding buffers in RQ. At each time only one side can talk.
> >Destination always goes first, then source, etc. At each time a
> >single send message can be passed. Just FYI, this means you are
> >often at 0 buffers in RQ and IIRC 0 buffers is a worst-case path
> >for infiniband. It's better to keep at least 1 buffers in RQ at
> >all times, so prepost 2 initially so it would fluctuate between 1
> >and 2.
> 
> That's correct. Having 0 buffers is not possible - sending
> a message with 0 buffers would throw an error. The "protocol"
> as I described ensures that there is always one buffer posted
> before waiting for another message to arrive.

So # of buffers goes 0 -> 1 -> 0 -> 1.
What I am saying is you should have an extra buffer
so it goes 1 -> 2 -> 1 -> 2
otherwise you keep hitting slow path in RQ processing:
each time you consume the last buffer, IIRC receiver sends
and ACK to sender saying "hey this is the last buffer, slow down".
You don't want that.

> I avoided "better" flow control because the non-live state
> is so small in comparison to the pc.ram contents that would be sent.
> The non-live state is in the range of kilobytes, so it seemed silly to
> have more rigorous flow control....

I think it's good enough, just add an extra unused buffer to make
hardware happy.

> >>+Migration of pc.ram:
> >>+===============================
> >>+
> >>+At the beginning of the migration, (migration-rdma.c),
> >>+the sender and the receiver populate the list of RAMBlocks
> >>+to be registered with each other into a structure.
> >Could you add the packet format here as well please?
> >Need to document endian-ness etc.
> 
> There is no packet format for pc.ram.

The 'structure' above is passed using SEND so there is
a format.

> It's just bytes - raw RDMA
> writes of each 4K page, because the memory must be registered
> before the RDMA write can begin.
> 
> (As discussed, there will be a format for SEND, though - so I'll
> take care of that in my next RFC).
> 
> > Yes but we also need to report errors detected during migration.
> >Need to document how this is done. We also need to report success.
> Acknowledged - I'll add more verbosity to the different error conditions.
> 
> - Michael R. Hines

Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport

Reply via email to