On Thu, Aug 03, 2023 at 11:45:38AM -0300, Fabiano Rosas wrote: > Peter Xu <pet...@redhat.com> writes: > > > On Wed, Aug 02, 2023 at 04:58:38PM -0300, Fabiano Rosas wrote: > >> Peter Xu <pet...@redhat.com> writes: > >> > >> > On Wed, Aug 02, 2023 at 11:36:43AM -0300, Fabiano Rosas wrote: > >> >> This function currently has a straight-forward part which is waiting > >> >> for the thread to join and a complicated part which is doing a > >> >> qemu_file_shutdown() on the return path file. > >> >> > >> >> The shutdown is tricky because all calls to qemu_file_shutdown() set > >> >> f->last_error to -EIO, which means we can never know if an error is an > >> >> actual error or if we cleanly shutdown the file previously. > >> >> > >> >> This is particularly bothersome for postcopy because it would send the > >> >> return path thread into the retry routine which would wait on the > >> >> postcopy_pause_rp_sem and consequently block the main thread. We > >> >> haven't had reports of this so I must presume we never reach here with > >> >> postcopy. > >> >> > >> >> The shutdown call is also racy because since it doesn't take the > >> >> qemu_file_lock, it could NULL-dereference if the return path thread > >> >> happens to be in the middle of the critical region at > >> >> migration_release_dst_files(). > >> > > >> > After you rework the thread model on resume, shall we move > >> > migration_release_dst_files() into the migration thread to be after the > >> > pthread_join()? I assume then we don't even need a mutex to protect it? > >> > > >> > >> I just need to figure out if it's ok to move the postcopy_qemufile_src > >> cleanup along. No idea why it is there in the first place. I see you > >> moved it from postcopy_pause and we're about to move it back to the > >> exact same place =D > > > > It was there because the old postcopy-preempt was sending data via > > postcopy_qemufile_src from the migration thread, while postcopy_pause is > > also the migration thread context. > > > > Then we had 9358982744 ("migration: Send requested page directly in > > rp-return thread") where we moved that "send page" operation into the > > return path thread to reduce latencies. After moving there it also means > > the file handle can be accessed in >1 threads, so I just moved it over to > > operate that always in the return path thread, then no race should happen. > > > > Thanks for the context. > > > With your change, return path will vanish before migration thread accesses > > it later (so as mentioned above, it must be after pthread_join() > > succeeded), then I assume it'll be fine too to have it back in migration > > thread. > > > > Or perhaps just take the file lock? > > > > There's also migrate_fd_cleanup and migrate_fd_cancel that can touch > these files. We might need to lock anyway, let's see.
The cancel path shouldn't clear the QEMUFile*, then I assume it's fine. That's based on the assumption that qemu_file_shutdown() is actually thread safe (say, shutdown() syscall is thread-safe for sockets). But yeah that depends on some more knowledge, it'll be good as you said below to just always take the lock because that shouldn't hurt. > > In general I'd like to drop all of these "ok not to lock, because...", > it's too easy for code to change and the assumptions to stop being > true. IMHO it's not worth it to gain performance by not taking a lock > when the data is still shared and there's nothing stopping someone in > the future from accessing it concurrently. Yes I agree with you. If you want we can move that from rp thread to migration thread, but with the lock added altogether. Then rp thread can always guarantee to have the file there which should still be helpful; sometimes even with a lock there we still need to take care of when QEMUFile*==NULL, but then not needed to care for rp thread. > > >> > >> >> > >> >> Move this more complicated part of the code to a separate routine so > >> >> we can wait on the thread without all of this baggage. > >> > > >> > I think you mentioned "some nuance" on having mark_source_rp_bad() in > >> > await_return_path_close_on_source(), I did remember I tried to look into > >> > that "nuance" too a long time ago but I just forgot what was that. Great > >> > if you can share some details. > >> > > >> > >> Well, mark_source_rp_bad() at await_return_path_close_on_source() is > >> basically useless: > >> > >> - We only call mark_source_rp_bad() if s->to_dst_file has an error and the > >> migration_completion() already checks that condition and fails the > >> migration anyway. > >> > >> - If to_dst_file has an error, chances are the destination already did > >> cleanup by this point, so from_dst_file would already have an errno, > >> due to that. At qemu_fill_buffer(), the len == 0 case basically means > >> "the other end finished cleanly". We still set -EIO in that case, I > >> don't know why. Possibly because not all backends will have the same > >> semantics for len == 0. > > > > I don't know either, but I checked it's from a555b8092a ("qemu-file: Don't > > do IO after shutdown"). Maybe there's better way to do so we could > > identify the difference, but yes can be for later. > > > > That is not the -EIO I'm talking about, but I'm glad you mentioned it, > because I just noticed we might need to revert ac7d25b816 ("qemu-file: > remove shutdown member"). > > It did a purely logical change which is to drop the f->shutdown flag, > but that has the side-effect that now we cannot differentiate an orderly > shutdown from an IO error. > > I get that perhaps ideally all of the code would be resilient to be able > to handle a shutdown in the same way as an IO error, but I'm not sure we > are prepared for that. > > > Now, the actual -EIO I was mentioning is this one in qemu-file.c: > > static ssize_t coroutine_mixed_fn qemu_fill_buffer(QEMUFile *f) > { > ... > ... > do { > len = qio_channel_read(f->ioc, > (char *)f->buf + pending, > IO_BUF_SIZE - pending, > &local_error); > if (len == QIO_CHANNEL_ERR_BLOCK) { > if (qemu_in_coroutine()) { > qio_channel_yield(f->ioc, G_IO_IN); > } else { > qio_channel_wait(f->ioc, G_IO_IN); > } > } else if (len < 0) { > len = -EIO; > } > } while (len == QIO_CHANNEL_ERR_BLOCK); > > if (len > 0) { > f->buf_size += len; > f->total_transferred += len; > } else if (len == 0) { > qemu_file_set_error_obj(f, -EIO, local_error); <---- THIS > } else { > qemu_file_set_error_obj(f, len, local_error); > } > > return len; > } > > the recvmsg manual says: > > "When a stream socket peer has performed an orderly shutdown, the return > value will be 0 (the traditional "end-of-file" return)." > > So both issues combined put us in a situation where there's never an > "orderly shutdown", everything is an error. I wouldn't say that's wrong, > but it's not clear to me whether we consciously made that design > decision. Here when reaching qemu_fill_buffer() it normally means we expect something already (it can originates from a qemu_fill_buffer() where we're waiting for a header byte to come, for example), so _maybe_ it still makes sense to error out because even if the other side does close() safely, it still doesn't do it at the right time. But I also agree this should not be part of a failure in the qemufile layer (which is the transport only), instead it should be in the upper layer (migration protocol) that set this error instead. But I guess we have a closer bind than expected on these two layers. The qemufile layer is just not well designed IMHO, starting from qemu_get_byte() directly return the byte without being able to fail at all. IOW one needs to call qemu_get_byte() then qemu_file_get_error() to know whether the get_byte() is legal. That gets a bit too far because qemufile is another level of issue, but for now I think if you have solid idea on reviving f->shutdown and makes "migration is cancelled" different from any other kinds of errors, feel free to go; that looks reasonable to me so far. Thanks, -- Peter Xu