Re: [PULL v2 0/6] Migration 20240917 patches

Fabiano Rosas Thu, 19 Sep 2024 07:25:15 -0700

Peter Maydell <peter.mayd...@linaro.org> writes:

> On Thu, 19 Sept 2024 at 12:59, Peter Xu <pet...@redhat.com> wrote:
>>
>> On Thu, Sep 19, 2024 at 10:08:25AM +0100, Peter Maydell wrote:
>> > Thanks for looking at the issues with the migration tests.
>> > This run went through first time without my needing to retry any
>> > jobs, so fingers crossed that we have at least improved the reliability.
>> > (I have a feeling there's still something funny with the k8s runners,
>> > but that's not migration-test specific, it's just that test tends
>> > to be the longest running and so most likely to be affected.)
>>
>> Kudos all go to Fabiano for debugging the hard problem.
>>
>> And yes, please let either of us know if it fails again, we can either keep
>> looking, or still can disable it when necessary (if it takes long to debug).
>
> On the subject of potential races in the migration code,
> there's a couple of outstanding Coverity issues that might
> be worth looking at. If they're false-positives let me know
> and I can reclassify them in Coverity.
>
> CID 1527402: In migrate_fd_cleanup() Coverity thinks there's
> a race because we read s->to_dst_file in the "if (s->to_dst_file)"
> check without holding the qemu_file_lock. This might be a
> false-positive because the race Coverity identifies happens
> if two threads both call migrate_fd_cleanup() at the same
> time, which is probably not permitted. (But OTOH taking a
> mutex gets you for free any necessary memory barriers...)


Yes, we shouldn't rely on mental gymnastics to prove that there's no
concurrent access.

@peterx, that RH bug you showed me could very well be caused by this
race, except that I don't see how fd_cleanup could race with
itself. Just having the lock would probably save us time even thinking
about it.

>
> CID 1527413: In postcopy_pause_incoming() we read
> mis->postcopy_qemufile_dst without holding the
> postcopy_prio_thread_mutex which we use to protect the write
> to that field, so Coverity thinks there's a race if two
> threads call this function at once.

At first sight, it seems like a real problem. We did a good pass on
these races on the source side, but the destination side hasn't been
investigated yet.

Unfortunately, these QEMUFile races are not trivial to fix due to
several design pain points, such as:

- the QEMUFile pointer validity being sometimes used to imply no error
  has happened before;

- the various shutdown() calls that serve both as a way to kick a read()
  that's stuck, but also to cause some other part of the code to realise
  there has been an error (due to the point above);

- the yank feature which has weird semantics regarding whether it
  operates on an iochannel or qemufile;

- migrate_fd_cancel() that _can_ run concurrently with anything else;

- the need to ensure the other end of migration also reacts to
  error/cancel on this side;

>
> (The only other migration Coverity issue is CID 1560071,
> which is the "better to use pstrcpy()" not-really-a-bug
> we discussed in another thread.)
>
> thanks
> -- PMM

Re: [PULL v2 0/6] Migration 20240917 patches

Reply via email to