On Wed, Nov 01, 2023 at 12:24:22PM -0400, Peter Xu wrote:
> On Wed, Nov 01, 2023 at 03:52:18PM +0000, Daniel P. Berrangé wrote:
> > On Wed, Nov 01, 2023 at 11:23:37AM -0400, Peter Xu wrote:
> > > On Wed, Oct 25, 2023 at 10:39:58AM +0100, Daniel P. Berrangé wrote:
> > > > If I'm reading the code correctly the new format has some padding
> > > > such that each "ramblock pages" region starts on a 1 MB boundary.
> > > > 
> > > > eg so we get:
> > > > 
> > > >  --------------------------------
> > > >  | ramblock 1 header            |
> > > >  --------------------------------
> > > >  | ramblock 1 fixed-ram header  |
> > > >  --------------------------------
> > > >  | padding to next 1MB boundary |
> > > >  | ...                          |
> > > >  --------------------------------
> > > >  | ramblock 1 pages             |
> > > >  | ...                          |
> > > >  --------------------------------
> > > >  | ramblock 2 header            |
> > > >  --------------------------------
> > > >  | ramblock 2 fixed-ram header  |
> > > >  --------------------------------
> > > >  | padding to next 1MB boundary |
> > > >  | ...                          |
> > > >  --------------------------------
> > > >  | ramblock 2 pages             |
> > > >  | ...                          |
> > > >  --------------------------------
> > > >  | ...                          |
> > > >  --------------------------------
> > > >  | RAM_SAVE_FLAG_EOS            |
> > > >  --------------------------------
> > > >  | ...                          |
> > > >  -------------------------------
> > > 
> > > When reading the series, I was thinking one more thing on whether 
> > > fixed-ram
> > > would like to leverage compression in the future?
> > 
> > Libvirt currently supports compression of saved state images, so yes,
> > I think compression is a desirable feature.
> 
> Ah, yeah this will work too; one more copy as you mentioned below, but
> assume that's not a major concern so far (or.. will it?).
> 
> > 
> > Due to libvirt's architecture it does compression on the stream and
> > the final step in the sequence bounc buffers into suitably aligned
> > memory required for O_DIRECT.
> > 
> > > To be exact, not really fixed-ram as a feature, but non-live snapshot as
> > > the real use case.  More below.
> > > 
> > > I just noticed that compression can be a great feature to have for such 
> > > use
> > > case, where the image size can be further shrinked noticeably.  In this
> > > case, speed of savevm may not matter as much as image size (as compression
> > > can take some more cpu overhead): VM will be stopped anyway.
> > > 
> > > With current fixed-ram layout, we probably can't have compression due to
> > > two reasons:
> > > 
> > >   - We offset each page with page alignment in the final image, and that's
> > >     where fixed-ram as the term comes from; more fundamentally,
> > > 
> > >   - We allow src VM to run (dropping auto-pause as the plan, even if we
> > >     plan to guarantee it not run; QEMU still can't take that as
> > >     guaranteed), then we need page granule on storing pages, and then it's
> > >     hard to know the size of each page after compressed.
> > > 
> > > If with the guarantee that VM is stopped, I think compression should be
> > > easy to get?  Because right after dropping the page-granule requirement, 
> > > we
> > > can compress in chunks, storing binary in the image, one page written 
> > > once.
> > > We may lose O_DIRECT but we can consider the hardware accelerators on
> > > [de]compress if necessary.
> > 
> > We can keep O_DIRECT if we buffer in QEMU between compressor output
> > and disk I/O, which is what libvirt does. QEMU would still be saving
> > at least one extra copy compared to libvirt
> > 
> > 
> > The fixed RAM layout was primarily intended to allow easy parallel
> > I/O without needing any synchronization between threads. In theory
> > fixed RAM layout even allows you todo something fun like
> > 
> >    maped_addr = mmap(save-stat-fd, offset, ramblocksize);
> >    memcpy(ramblock, maped_addr, ramblocksize)
> >    munmap(maped_addr)
> > 
> > which would still be buffered I/O without O_DIRECT, but might be better
> > than many writes() as you avoid 1000's of syscalls.
> > 
> > Anyway back to compression, I think if you wanted to allow for parallel
> > I/O, then it would require a different "fixed ram" approach, where each
> > multifd  thread requested use of a 64 MB region, compressed until that
> > was full, then asked for another 64 MB region, repeat until done.
> 
> Right, we need a constant buffer per-thread if so.
> 
> > 
> > The reason we didn't want to break up the file format into regions like
> > this is because we wanted to allow for flexbility into configuration on
> > save / restore. eg  you might save using 7 threads, but restore using
> > 3 threads. We didn't want the on-disk layout to have any structural
> > artifact that was related to the number of threads saving data, as that
> > would make restore less efficient. eg 2 threads would process 2 chunks
> > each and  and 1 thread would process 3 chunks, which is unbalanced.
> 
> I didn't follow on why the image needs to contain thread number
> information.

It doesn't contain thread number information directly, but it can
be implicit from the data layout.

If you want parallel I/O, each thread has to know it is the only
one reading/writing to a particular region of the file. With the
fixed RAM layout in this series, the file offset directly maps
to the memory region. So if a thread has been given a guest page
to save it knows it will be the only thing writing to the file
at that offset. There is no relationship at all between the
number of threads and the file layout.

If you can't directly map pages to file offsets, then you need
some other way to lay out date such that each thread can safely
write. If you split up a file based on fixed size chunks, then
the number of chunks you end up with in the file is likely to be
a multiple of the number of threads you had saving data.

This means if you restore using a different number of threads,
you can't evenly assign file chunks to each restore thread.

There's no info about thread IDs in the file, but the data layout
reflects how mcuh threads were doing work.

> Assuming decompress can do the same by assigning different chunks to each
> decompress thread, no matter how many are there.
> 
> Would that work?

Again you get uneven workloads if the number of restore threads is
different than the save threads, as some threads will have to process
more chunks than other threads. If the chunks are small this might
not matter, if they are big it could matter.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


Reply via email to