On Thu, 19 Sept 2024 at 12:59, Peter Xu <pet...@redhat.com> wrote: > > On Thu, Sep 19, 2024 at 10:08:25AM +0100, Peter Maydell wrote: > > Thanks for looking at the issues with the migration tests. > > This run went through first time without my needing to retry any > > jobs, so fingers crossed that we have at least improved the reliability. > > (I have a feeling there's still something funny with the k8s runners, > > but that's not migration-test specific, it's just that test tends > > to be the longest running and so most likely to be affected.) > > Kudos all go to Fabiano for debugging the hard problem. > > And yes, please let either of us know if it fails again, we can either keep > looking, or still can disable it when necessary (if it takes long to debug).
On the subject of potential races in the migration code, there's a couple of outstanding Coverity issues that might be worth looking at. If they're false-positives let me know and I can reclassify them in Coverity. CID 1527402: In migrate_fd_cleanup() Coverity thinks there's a race because we read s->to_dst_file in the "if (s->to_dst_file)" check without holding the qemu_file_lock. This might be a false-positive because the race Coverity identifies happens if two threads both call migrate_fd_cleanup() at the same time, which is probably not permitted. (But OTOH taking a mutex gets you for free any necessary memory barriers...) CID 1527413: In postcopy_pause_incoming() we read mis->postcopy_qemufile_dst without holding the postcopy_prio_thread_mutex which we use to protect the write to that field, so Coverity thinks there's a race if two threads call this function at once. (The only other migration Coverity issue is CID 1560071, which is the "better to use pstrcpy()" not-really-a-bug we discussed in another thread.) thanks -- PMM