Peter Xu <pet...@redhat.com> writes:

> On Mon, Mar 04, 2024 at 09:04:51PM +0000, Daniel P. Berrangé wrote:
>> On Mon, Mar 04, 2024 at 05:15:05PM -0300, Fabiano Rosas wrote:
>> > Peter Xu <pet...@redhat.com> writes:
>> > 
>> > > On Mon, Mar 04, 2024 at 08:53:24PM +0800, Peter Xu wrote:
>> > >> On Mon, Mar 04, 2024 at 12:42:25PM +0000, Daniel P. Berrangé wrote:
>> > >> > On Mon, Mar 04, 2024 at 08:35:36PM +0800, Peter Xu wrote:
>> > >> > > Fabiano,
>> > >> > > 
>> > >> > > On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
>> > >> > > > => guest: 128 GB RAM - 120 GB dirty - 1 vcpu in tight loop 
>> > >> > > > dirtying memory
>> > >> > > 
>> > >> > > I'm curious normally how much time does it take to do the final 
>> > >> > > fdatasync()
>> > >> > > for you when you did this test.
>> > 
>> > I measured and it takes ~4s for the live migration and ~2s for the
>> > non-live. I didn't notice this before because the VM goes into
>> > postmigrate, so it's paused anyway.
>
> For my case it took me tens of seconds at least, if not go into minutes,
> which I didn't measure.
>
> I could have dirtied harder, or I just had a slower disk.  IIUC the worst
> case is all cache dirty (didn't yet writeback in the kernel), say 100GB,
> assuming the disk bandwidth 1GB/s (that's the bw of my test machine hard
> drive of 1M chunk dd for a 10GB file, even without a sync..), IIUC it means
> it could take 1min or more in reality.
>
>> > 
>> > >> > > 
>> > >> > > I finally got a relatively large system today and gave it a quick 
>> > >> > > shot over
>> > >> > > 128G (100G busy dirty) mapped-ram snapshot with 8 multifd channels. 
>> > >> > >  The
>> > >> > > migration save/load does all fine, so I don't think there's 
>> > >> > > anything wrong
>> > >> > > with the patchset, however when save completes (I'll need to stop 
>> > >> > > the
>> > >> > > workload as my disk isn't fast enough I guess..) I'll always hit a 
>> > >> > > super
>> > >> > > long hang of QEMU on fdatasync() on XFS during which the main 
>> > >> > > thread is in
>> > >> > > UNINTERRUPTIBLE state.
>> > >> > 
>> > >> > That isn't very surprising. If you don't have O_DIRECT enabled, then
>> > >> > all that disk I/O from the migrate is going to be in RAM, and thus the
>> > >> > fdatasync() is likely to trigger writing out alot of data.
>> > >> > 
>> > >> > Blocking the main QEMU thread though is pretty unhelpful. That 
>> > >> > suggests
>> > >> > the data sync needs to be moved to a non-main thread.
>> > >> 
>> > >> Perhaps migration thread itself can also be a candidate, then.
>> > >> 
>> > >> > 
>> > >> > With O_DIRECT meanwhile there should be essentially no hit from 
>> > >> > fdatasync.
>> > >> 
>> > >> The update of COMPLETED status can be a good place of a marker point to
>> > >> show such flush done if from the gut feeling of a user POV.  If that 
>> > >> makes
>> > >> sense, maybe we can do that sync before setting COMPLETED.
>> > 
>> > At the migration completion I believe the multifd threads will have
>> > already cleaned up and dropped the reference to the channel, it might be
>> > too late then.
>> > 
>> > In the multifd threads, we'll be wasting (like we are today) the extra
>> > syscalls after the first sync succeeds.
>> > 
>> > >> 
>> > >> No matter which thread does that sync, it's still a pity that it'll go 
>> > >> into
>> > >> UNINTERRUPTIBLE during fdatasync(), then whoever wants to e.g. attach a 
>> > >> gdb
>> > >> onto it to have a look will also hang.
>> > >
>> > > Or... would it be nicer we get rid of the fdatasync() but leave that for
>> > > upper layers?  QEMU used to support file: migration already, it never
>> > > manage cache behavior; it does smell like something shouldn't be done in
>> > > QEMU when thinking about it, at least mapped-ram is nothing special to me
>> > > from this regard.
>> > >
>> > > User should be able to control that either manually (sync), or Libvirt 
>> > > can
>> > > do that after QEMU quits; after all Libvirt holds the fd itself?  It 
>> > > should
>> > > allow us to get rid of above UNINTERRUPTIBLE / un-debuggable period of 
>> > > QEMU
>> > > went away.  Another side benefit: rather than holding all of QEMU 
>> > > resources
>> > > (especially, guest RAM) when waiting for a super slow disk flush, 
>> > > Libvirt /
>> > > upper layer can do that separately after releasing all the QEMU resources
>> > > first.
>> > 
>> > I like the idea of QEMU having a self-contained
>> > implementation. Specially since we'll add O_DIRECT support, which is
>> > already quite heavy-handed if we're talking about managing cache
>> > behavior.
>
> O_DIRECT is optionally selected by the user by setting the new parameter
> first, so the user is still in full control - it's still user's decision on
> how cache should be managed, even if QEMU needs explicit changes to support
> and expose the new parameter.
>
> For fdatasync(), I think it's slightly different in that it doesn't require
> anything implemented in QEMU, as the snapshot is always in the form of a
> file, and file is pretty common concept which well supports sync semantics
> separately.  Instead of providing yet another parameter to control it, we
> can just avoid that datasync.
>
> Besides what I already described above as reasons, I think it's also legal
> if an user wants to temporarily flush a VM into a disk (in paused state),
> run some RAM-intense loads (which can immediately make use of guest's RAM
> which is directly freed, but may _not_ always require a page cache flush),
> then relaunch the VM.  In that case keeping some cache around might help
> already to speedup relaunching to avoid unnecessary swap-ins/swap-outs.
>
>> > 
>> > However, it's not trivial to find the right place to add the sync.
>> > Wherever we put it there will be some implications, such as ensuring the
>> > sync works even after migration failure, avoiding concurrent cleanup,
>> > etc.
>> > 
>> > In any case, I don't think it's correct to have the sync at
>> > qio_channel_close(), now that we've seen it might block for a long
>> > time. We could at the very least have a qio_channel_flush()[1] which the
>> > QIOChannelFile implements with fdatasync(). Then the clients can choose
>> > when to sync.
>> 
>> Yes, I agree with de-coupling it.
>
> Yes, that decoupling makes sense to me.  That definitely answers some of my
> previous confusions.
>
> The following question is whether we should require a qio_channel_flush()
> by default at anywhere around the end of migration for mapped-ram, in which
> case I lean towards removing it completely.  In all cases, considering the
> time it could hang qemu (possible in minutes) we may want to change that
> behavior for 9.0 if possible.

Ok, I'll remove it for 9.0 then. And I guess I'll also remove the flush
completely since there are no other users except for migration.

Reply via email to