Peter Xu <pet...@redhat.com> writes:

> On Wed, Jul 10, 2024 at 05:16:36PM -0300, Fabiano Rosas wrote:
>> Peter Xu <pet...@redhat.com> writes:
>> 
>> > On Wed, Jul 10, 2024 at 01:10:37PM -0300, Fabiano Rosas wrote:
>> >> Peter Xu <pet...@redhat.com> writes:
>> >> 
>> >> > On Thu, Jun 27, 2024 at 11:27:08AM +0800, Wang, Lei wrote:
>> >> >> > Or graphically:
>> >> >> > 
>> >> >> > 1) client fills the active slot with data. Channels point to nothing
>> >> >> >    at this point:
>> >> >> >   [a]      <-- active slot
>> >> >> >   [][][][] <-- free slots, one per-channel
>> >> >> > 
>> >> >> >   [][][][] <-- channels' p->data pointers
>> >> >> > 
>> >> >> > 2) multifd_send() swaps the pointers inside the client slot. Channels
>> >> >> >    still point to nothing:
>> >> >> >   []
>> >> >> >   [a][][][]
>> >> >> > 
>> >> >> >   [][][][]
>> >> >> > 
>> >> >> > 3) multifd_send() finds an idle channel and updates its pointer:
>> >> >> 
>> >> >> It seems the action "finds an idle channel" is in step 2 rather than 
>> >> >> step 3,
>> >> >> which means the free slot is selected based on the id of the channel 
>> >> >> found, am I
>> >> >> understanding correctly?
>> >> >
>> >> > I think you're right.
>> >> >
>> >> > Actually I also feel like the desription here is ambiguous, even though 
>> >> > I
>> >> > think I get what Fabiano wanted to say.
>> >> >
>> >> > The free slot should be the first step of step 2+3, here what Fabiano
>> >> > really wanted to suggest is we move the free buffer array from multifd
>> >> > channels into the callers, then the caller can pass in whatever data to
>> >> > send.
>> >> >
>> >> > So I think maybe it's cleaner to write it as this in code (note: I 
>> >> > didn't
>> >> > really change the code, just some ordering and comments):
>> >> >
>> >> > ===8<===
>> >> > @@ -710,15 +710,11 @@ static bool multifd_send(MultiFDSlots *slots)
>> >> >       */
>> >> >      active_slot = slots->active;
>> >> >      slots->active = slots->free[p->id];
>> >> > -    p->data = active_slot;
>> >> > -
>> >> > -    /*
>> >> > -     * By the next time we arrive here, the channel will certainly
>> >> > -     * have consumed the active slot. Put it back on the free list
>> >> > -     * now.
>> >> > -     */
>> >> >      slots->free[p->id] = active_slot;
>> >> >  
>> >> > +    /* Assign the current active slot to the chosen thread */
>> >> > +    p->data = active_slot;
>> >> > ===8<===
>> >> >
>> >> > The comment I removed is slightly misleading to me too, because right 
>> >> > now 
>> >> > active_slot contains the data hasn't yet been delivered to multifd, so
>> >> > we're "putting it back to free list" not because of it's free, but 
>> >> > because
>> >> > we know it won't get used until the multifd send thread consumes it
>> >> > (because before that the thread will be busy, and we won't use the 
>> >> > buffer
>> >> > if so in upcoming send()s).
>> >> >
>> >> > And then when I'm looking at this again, I think maybe it's a slight
>> >> > overkill, and maybe we can still keep the "opaque data" managed by 
>> >> > multifd.
>> >> > One reason might be that I don't expect the "opaque data" payload keep
>> >> > growing at all: it should really be either RAM or device state as I
>> >> > commented elsewhere in a relevant thread, after all it's a thread model
>> >> > only for migration purpose to move vmstates..
>> >> 
>> >> Some amount of flexibility needs to be baked in. For instance, what
>> >> about the handshake procedure? Don't we want to use multifd threads to
>> >> put some information on the wire for that as well?
>> >
>> > Is this an orthogonal question?
>> 
>> I don't think so. You say the payload data should be either RAM or
>> device state. I'm asking what other types of data do we want the multifd
>> channel to transmit and suggesting we need to allow room for the
>> addition of that, whatever it is. One thing that comes to mind that is
>> neither RAM or device state is some form of handshake or capabilities
>> negotiation.
>
> Indeed what I thought was multifd payload should be either ram or device,
> nothing else.  The worst case is we can add one more into the union, but I
> can't think of.

What about the QEMUFile traffic? There's an iov in there. I have been
thinking of replacing some of qemu-file.c guts with calls to
multifd. Instead of several qemu_put_byte() we could construct an iov
and give it to multifd for transfering, call multifd_sync at the end and
get rid of the QEMUFile entirely. I don't have that completely laid out
at the moment, but I think it should be possible. I get concerned about
making assumptions on the types of data we're ever going to want to
transmit. I bet someone thought in the past that multifd would never be
used for anything other than ram.

>
> I wonder why handshake needs to be done per-thread.  I was naturally
> thinking the handshake should happen sequentially, talking over everything
> including multifd.

Well, it would still be thread based. Just that it would be 1 thread and
it would not be managed by multifd. I don't see the point. We could make
everything be multifd-based. Any piece of data that needs to reach the
other side of the migration could be sent through multifd, no?

Also, when you say "per-thread", that's the model we're trying to get
away from. There should be nothing "per-thread", the threads just
consume the data produced by the clients. Anything "per-thread" that is
not strictly related to the thread model should go away. For instance,
p->page_size, p->page_count, p->write_flags, p->flags, etc. None of
these should be in MultiFDSendParams. That thing should be (say)
MultifdChannelState and contain only the semaphores and control flags
for the threads.

It would be nice if we could once and for all have a model that can
dispatch data transfers without having to fiddle with threading all the
time. Any time someone wants to do something different in the migration
code, there it goes a random qemu_create_thread() flying around.

>  IMO multifd to have these threads are mostly for the
> sake of performance.  I sometimes think we have some tiny places where we
> "over-engineered" multifd, e.g. on attaching ZLIB/ZSTD/... flags on each
> packet header, even if they should never change, and that is the part of
> thing we can put into handshake too, and after handshake we should assume
> both sides and all threads are in sync.  There's no need to worry
> compressor per-packet, per-channel.  It could be a global thing and done
> upfront, even if Libvirt didn't guarantee those.

Yep, I agree. And if any client needs such granularity, it can add a
sub-header of it's own.

>> 
>> >
>> > What I meant above is it looks fine to me to keep "device state" in
>> > multifd.c, as long as it is not only about VFIO.
>> >
>> > What you were saying seems to be about how to identify this is a device
>> > state, then I just hope VFIO shares the same flag with any future device
>> > that would also like to send its state via multifd, like:
>> >
>> > #define MULTIFD_FLAG_DEVICE_STATE (32 << 1)
>> >
>> > Then set it in MultiFDPacket_t.flags.  The dest qemu should route that
>> > packet to the device vmsd / save_entry for parsing.
>> 
>> Sure, that part I agree with, no issue here.
>> 
>> >
>> >> 
>> >> > Putting it managed by multifd thread should involve less change than 
>> >> > this
>> >> > series, but it could look like this:
>> >> >
>> >> > typedef enum {
>> >> >     MULTIFD_PAYLOAD_RAM = 0,
>> >> >     MULTIFD_PAYLOAD_DEVICE_STATE = 1,
>> >> > } MultifdPayloadType;
>> >> >
>> >> > typedef enum {
>> >> >     MultiFDPages_t ram_payload;
>> >> >     MultifdDeviceState_t device_payload;
>> >> > } MultifdPayload;
>> >> >
>> >> > struct MultiFDSendData {
>> >> >     MultifdPayloadType type;
>> >> >     MultifdPayload data;
>> >> > };
>> >> 
>> >> Is that an union up there? So you want to simply allocate in multifd the
>> >
>> > Yes.
>> >
>> >> max amount of memory between the two types of payload? But then we'll
>> >
>> > Yes.
>> >
>> >> need a memset(p->data, 0, ...) at every round of sending to avoid giving
>> >> stale data from one client to another. That doesn't work with the
>> >
>> > I think as long as the one to enqueue will always setup the fields, we
>> > don't need to do memset.  I am not sure if it's a major concern to always
>> > set all the relevant fields in the multifd enqueue threads.  It sounds like
>> > the thing we should always better do.
>> 
>> Well, writing to a region of memory that was "owned" by another multifd
>> client and already has a bunch of data there is somewhat prone to
>> bugs. Just forget to set something and now things start to behave
>> weirdly. I guess that's just the price of having an union. I'm not
>> against that, but I would maybe prefer to have each client hold its own
>> data and not have to think about anything else. Much of this feeling
>> comes from how the RAM code currently works (more on that below).
>
> IIUC so far not using union is fine.  However in the future what if we
> extend device states to vcpus?  In that case the vcpu will need to have its
> own array of SendData[], even though it will share the same structure with
> what VFIO would use.
>
> If with an union, and if we can settle it'll be either PAGES or DEV_STATE,
> then when vcpu state is involved, vcpu will simply enqueue the same
> DEV_STATE.

Yeah, the union is advantageous from that aspect.

> One thing to mention is that when with an union we may probably need to get
> rid of multifd_send_state->pages already.

Hehe, please don't do this like "oh, by the way...". This is a major
pain point. I've been complaining about that "holding of client data"
since the fist time I read that code. So if you're going to propose
something, it needs to account for that.

> The object can't be a global
> cache (in which case so far it's N+1, N being n_multifd_channels, while "1"
> is the extra buffer as only RAM uses it).  In the union world we'll need to
> allocate M+N SendData, where N is still the n_multifd_channels, and M is
> the number of users, in VFIO's case, VFIO allocates the cached SendData and
> use that to enqueue, right after enqueue it'll get a free one by switching
> it with another one in the multifd's array[N].  Same to RAM.  Then there'll
> be N+2 SendData and VFIO/RAM needs to free their own SendData when cleanup
> (multifd owns the N per-thread only).
>

At first sight, that seems to work. It's similar to this series, but
you're moving the free slots back into the channels. Should I keep
SendData as an actual separate array instead of multiple p->data?

Let me know, I'll write some code and see what it looks like.

>> 
>> >
>> >> current ram migration because it wants p->pages to remain active across
>> >> several calls of multifd_queue_page().
>> >
>> > I don't think I followed here.
>> >
>> > What I meant: QEMU maintains SendData[8], now a bunch of pages arrives, it
>> > enqueues "pages" into a free slot index 2 (set type=pages), then before
>> > thread 2 finished sending the bunch of pages, SendData[2] will always
>> > represent those pages without being used by anything else. What did I miss?
>> >
>> 
>> You're missing multifd_send_state->pages and the fact that it holds
>> unsent data on behalf of the client. At every call to
>> multifd_queue_pages(), the RAM code expects the previously filled pages
>> structure to be there. Since we intend to have more than one multifd
>> client, now the other client (say device state) might run, it will take
>> that slot and fill it with it's own stuff (or rather fill p->send_data
>> and multifd_send_pages() switches the pointer). Next call to
>> multifd_queue_pages(), it will take multifd_send_state->pages and
>> there'll be garbage there.
>
> Please see above to see whether that may answer here too; in general I
> think we need to drop multifd_send_state->pages, but let SendData caches be
> managed by the users, so instead of multifd managing N+1 SendData, it only
> manages N, leaving the rest to possible users of multifd.  It also means
> it's a must any multifd enqueue user will first provide a valid cache
> object of SendData.
>
>> 
>> The code is not: take a free slot from the next idle channel and fill it
>> with data.
>> 
>> It is: take from multifd_send_state the active slot which *might* have
>> previously been consumed by the last channel and (continue to) fill it
>> with data.
>> 
>> "might", because successive calls to multifd_queue_page() don't need to
>> call multifd_send_page() to flush to the channel.
>> 
>> >> 
>> >> >
>> >> > Then the "enum" makes sure the payload only consumes only the max of 
>> >> > both
>> >> > types; a side benefit to save some memory.
>> >> >
>> >> > I think we need to make sure MultifdDeviceState_t is generic enough so 
>> >> > that
>> >> > it will work for mostly everything (especially normal VMSDs).  In this 
>> >> > case
>> >> > the VFIO series should be good as that was currently defined as:
>> >> >
>> >> > typedef struct {
>> >> >     MultiFDPacketHdr_t hdr;
>> >> >
>> >> >     char idstr[256] QEMU_NONSTRING;
>> >> >     uint32_t instance_id;
>> >> >
>> >> >     /* size of the next packet that contains the actual data */
>> >> >     uint32_t next_packet_size;
>> >> > } __attribute__((packed)) MultiFDPacketDeviceState_t;
>> >> 
>> >> This is the packet, a different thing. Not sure if your paragraph above
>> >> means to talk about that or really MultifdDeviceState, which is what is
>> >> exchanged between the multifd threads and the client code.
>> >
>> > I meant the wire protocol looks great from that POV.  We may need similar
>> > thing for the type==device_state slots just to be generic.
>> >
>> >> 
>> >> >
>> >> > IIUC that was what we need exactly with idstr+instance_id, so as to nail
>> >> > exactly at where should the "opaque device state" go to, then load it 
>> >> > with
>> >> > a buffer-based loader when it's ready (starting from VFIO, to get rid of
>> >> > qemufile).  For VMSDs in the future if ever possible, that should be a
>> >> > modified version of vmstate_load() where it may take buffers not 
>> >> > qemufiles.
>> >> >
>> >> > To Maciej: please see whether above makes sense to you, and if you also
>> >> > agree please consider that with your VFIO work.
>> >> >
>> >> > Thanks,
>> >> >
>> >> >> 
>> >> >> >   []
>> >> >> >   [a][][][]
>> >> >> > 
>> >> >> >   [a][][][]
>> >> >> >   ^idle
>> >> 
>> 

Reply via email to