On 4/3/2024 5:39, Peter Xu wrote:>>>>> >>>>> I'm not 100% sure such thing (no matter here or moved into it, which >>>>> does look cleaner) would work for us. >>>>> >>>>> The problem is I still don't yet see an ordering restricted on top of >>>>> (1) >>>>> accept() happens, and (2) receive LISTEN cmd here. What happens if >>>>> the >>>>> accept() request is not yet received when reaching LISTEN? Or is it >>>>> always guaranteed the accept(fd) will always be polled here? >>>>> >>>>> For example, the source QEMU (no matter pre-7.2 or later) will always >>>>> setup the preempt channel asynchrounously, then IIUC it can connect() >>>>> after sending the whole chunk of packed data which should include this >>>>> LISTEN. I think it means it's not guaranteed this will 100% work, but >>>>> maybe further reduce the possibility of the race. >>>> >>>> I think the following code: >>>> >>>> postcopy_start() -> >>>> postcopy_preempt_establish_channel() -> >>>> qemu_sem_wait(&s->postcopy_qemufile_src_sem); >>>> >>>> can guarantee that the connect() syscall is successful so the destination >>>> side >>>> receives the connect() request before it loads the LISTEN command, >>>> otherwise >>>> it won't post the sem: >>>> >>>> postcopy_preempt_send_channel_new() -> >>>> postcopy_preempt_send_channel_done() -> >>>> qemu_sem_post(&s->postcopy_qemufile_src_sem); >>>> >>> >>> Yes. But as mentioned in another thread, connect() and accept() are async. >>> So in theory accept() could still come later after the LISTEN command. >> >> IIUC accept() is the callback and will be triggered by the connect() event. >> >> The reason accept() is not called in the destination is the main loop doesn't >> get a chance to handle other events (connect()), so if we can guarantee >> connect() is before LISTEN, then when handling the LISTEN cmd, we yield to >> the >> main loop and the connect() event is there, then we can handle it by calling >> the >> accept(): >> >> qio_net_listener_channel_func >> qio_channel_socket_accept >> qemu_accept >> accept >> >> so it seems the case accept() comes alter after LISTEN is in our expectation? > > The major thing uncertain to me is "accept() will return with a valid fd" > on dest host is not guaranteed to order against anything. > > For example, I won't be surprised if a kernel implementation provides an > async model of "accept()" syscall, so that even if the other side returned > with "connect()", the "accept()" can still fetch nothing if the async model > will need a delay for the new channel to be finally delivered to the > "accept()" thread context. It just sounds like tricky to rely on such > thing. > > What I proposed below shouldn't rely on any kernel details on how accept() > could be implemented, it simply waits for the fd to be created before doing > anything else (including creating the preempt thread and process packed > data).
Thanks for the detailed explanation! > >> >>> >>>>> >>>>> One right fix that I can think of is moving the sem_wait(&done) into >>>>> the main thread too, so we wait for the sem _before_ reading the >>>>> packed data, so there's no chance of fault. However I don't think >>>>> sem_wait() will be smart enough to yield when in a coroutine.. In the >>>>> long term run I think we should really make migration loadvm to do >>>>> work in the thread rather than the main thread. I think it means we >>>>> have one more example to be listed in this todo so that's preferred.. >>>>> >>>>> https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration >>>>> _destination >>>>> >>>>> I attached such draft patch below, but I'm not sure it'll work. Let >>>>> me know how both of you think about it. >>>> >>>> Sadly it doesn't work, there is an unknown segfault. > > Could you paste the stack of the segfault ("(gdb) thread apply all bt")? > Or help to figure out what is wrong? > > Since I cannot reproduce myself, I may need your help debugging this. If > you agree with what I said above and agree on such fix, please also feel > free to go ahead and fix the segfault. We should change the following line from while (!qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, 100)) { to while (qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, 100)) { After that fix, test passed and no segfault. Given that the test shows a yield to the main loop won't introduce much overhead (<1ms), how about first yield unconditionally, then we enter the while loop to wait for several ms and yield periodically?