On Fri, Mar 31, 2023 at 08:56:01AM +0100, Daniel P. Berrangé wrote:
> On Thu, Mar 30, 2023 at 06:01:51PM -0400, Peter Xu wrote:
> > On Thu, Mar 30, 2023 at 03:03:20PM -0300, Fabiano Rosas wrote:
> > > From: Nikolay Borisov <nbori...@suse.com>
> > > 
> > > Implement 'fixed-ram' feature. The core of the feature is to ensure that
> > > each ram page of the migration stream has a specific offset in the
> > > resulting migration stream. The reason why we'd want such behavior are
> > > two fold:
> > > 
> > >  - When doing a 'fixed-ram' migration the resulting file will have a
> > >    bounded size, since pages which are dirtied multiple times will
> > >    always go to a fixed location in the file, rather than constantly
> > >    being added to a sequential stream. This eliminates cases where a vm
> > >    with, say, 1G of ram can result in a migration file that's 10s of
> > >    GBs, provided that the workload constantly redirties memory.
> > > 
> > >  - It paves the way to implement DIO-enabled save/restore of the
> > >    migration stream as the pages are ensured to be written at aligned
> > >    offsets.
> > > 
> > > The feature requires changing the stream format. First, a bitmap is
> > > introduced which tracks which pages have been written (i.e are
> > > dirtied) during migration and subsequently it's being written in the
> > > resulting file, again at a fixed location for every ramblock. Zero
> > > pages are ignored as they'd be zero in the destination migration as
> > > well. With the changed format data would look like the following:
> > > 
> > > |name len|name|used_len|pc*|bitmap_size|pages_offset|bitmap|pages|
> > 
> > What happens with huge pages?  Would page size matter here?
> > 
> > I would assume it's fine it uses a constant (small) page size, assuming
> > that should match with the granule that qemu tracks dirty (which IIUC is
> > the host page size not guest's).
> > 
> > But I didn't yet pay any further thoughts on that, maybe it would be
> > worthwhile in all cases to record page sizes here to be explicit or the
> > meaning of bitmap may not be clear (and then the bitmap_size will be a
> > field just for sanity check too).
> 
> I think recording the page sizes is an anti-feature in this case.
> 
> The migration format / state needs to reflect the guest ABI, but
> we need to be free to have different backend config behind that
> either side of the save/restore.
> 
> IOW, if I start a QEMU with 2 GB of RAM, I should be free to use
> small pages initially and after restore use 2 x 1 GB hugepages,
> or vica-verca.
> 
> The important thing with the pages that are saved into the file
> is that they are a 1:1 mapping guest RAM regions to file offsets.
> IOW, the 2 GB of guest RAM is always a contiguous 2 GB region
> in the file.
> 
> If the src VM used 1 GB pages, we would be writing a full 2 GB
> of data assuming both pages were dirty.
> 
> If the src VM used 4k pages, we would be writing some subset of
> the 2 GB of data, and the rest would be unwritten.
> 
> Either way, when reading back the data we restore it into either
> 1 GB pages of 4k pages, beause any places there were unwritten
> orignally  will read back as zeros.

I think there's already the page size information, because there's a bitmap
embeded in the format at least in the current proposal, and the bitmap can
only be defined with a page size provided in some form.

Here I agree the backend can change before/after a migration (live or
not).  Though the question is whether page size matters in the snapshot
layout rather than what the loaded QEMU instance will use as backend.

> 
> > If postcopy might be an option, we'd want the page size to be the host page
> > size because then looking up the bitmap will be straightforward, deciding
> > whether we should copy over page (UFFDIO_COPY) or fill in with zeros
> > (UFFDIO_ZEROPAGE).
> 
> This format is only intended for the case where we are migrating to
> a random-access medium, aka a file, because the fixed RAM mappings
> to disk mean that we need to seek back to the original location to
> re-write pages that get dirtied. It isn't suitable for a live
> migration stream, and thus postcopy is inherantly out of scope.

Yes, I've commented also in the cover letter, but I can expand a bit.

I mean support postcopy only when loading, but not when saving.

Saving to file definitely cannot work with postcopy because there's no dest
qemu running.

Loading from file, OTOH, can work together with postcopy.

Right now AFAICT current approach is precopy loading the whole guest image
with the supported snapshot format (if I can call it just a snapshot).

What I want to say is we can consider supporting postcopy on loading in
that we start an "empty" QEMU dest node, when any page fault triggered we
do it using userfault and lookup the snapshot file instead rather than
sending a request back to the source.  I mentioned that because there'll be
two major benefits which I mentioned in reply to the cover letter quickly,
but I can also extend here:

  - Firstly, the snapshot format is ideally storing pages in linear
    offsets, it means when we know some page missing we can use O(1) time
    looking it up from the snapshot image.

  - Secondly, we don't need to let the page go through the wires, neither
    do we need to send a request to src qemu or anyone.  What we need here
    is simply test the bit on the snapshot bitmap, then:

    - If it is copied, do UFFDIO_COPY to resolve page faults,
    - If it is not copied, do UFFDIO_ZEROPAGE (e.g., if not hugetlb,
      hugetlb can use a fake UFFDIO_COPY)

So this is a perfect testing ground for using postcopy in a very efficient
way against a file snapshot.

Thanks,

-- 
Peter Xu


Reply via email to