* Peter Xu (pet...@redhat.com) wrote: > On Wed, Jul 21, 2021 at 10:55:00AM +0100, Dr. David Alan Gilbert wrote: > > * Peter Xu (pet...@redhat.com) wrote: > > > We have a logic in await_return_path_close_on_source() that we will > > > explicitly > > > shutdown the socket when migration encounters errors. However it could > > > be racy > > > because from_dst_file could have been reset right after checking it but > > > before > > > passing it to qemu_file_shutdown() by the rp_thread. > > > > > > Fix it by shutdown() on the src file instead. Since they must be a pair > > > of > > > qemu files, shutdown on either of them will work the same. > > > > > > Since at it, drop the check for from_dst_file directly, which makes the > > > behavior even more predictable. > > > > So while the existing code maybe racy, I'm not sure that this change > > keeps the semantics; the channel may well have dup()'d the fd's for the > > two directions, and I'm not convinced that a shutdown() on one will > > necessarily impact the other; and if the shutdown doesn't happen the > > rp_thread might not exit, and we might block on the koin. > > dup() seems fine as long as the backend file/socket is the same; but I get the > point here, e.g., potentially from/to channels can indeed be different ones. > > > > > Why don't we solve this a different way - how about we move the: > > ms->rp_state.from_dst_file = NULL; > > qemu_fclose(rp); > > > > out of the source_return_path_thread and put it in > > await_return_path_close_on_source, immediately after the join? > > Then we *know* that the the rp thread isn't messing with it. > > Yes that looks working for this special case of when rp_thread quits. > > It's just that it'll make things a bit more complicated, previously > from_dst_file is only reset in rp_thread (for either paused or completed > migration), now we handle "migration completes" a bit different. > > This also reminded me that maybe we can also use the qemu_file_lock mutex as > we > use to try protect access to to_dst_file, because I think from_dst_file is > potentially racy in the same way. Even if we moved the reset to migration > thread, we still have e.g. migrate_fd_cancel() calling: > > if (s->rp_state.from_dst_file) { > qemu_file_shutdown(s->rp_state.from_dst_file); > } > > I think that is also racy too when running in the main thread, as either the > migration thread or rp_thread could have reset it right after the check. > > So.. maybe I start to use the qemu_file_lock to cover from_dst_file cases too? > Then I'll cover at leasd both migrate_fd_cancel() and this case.
Yes, I think that's best; but try to do as little as possible with the lock held. Dave > Thanks, > > -- > Peter Xu > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK