Re: Revisiting parallel save/restore

2024-05-01 Thread Jim Fehlig via Devel

On 4/26/24 4:04 AM, Daniel P. Berrangé wrote:

On Wed, Apr 17, 2024 at 05:12:27PM -0600, Jim Fehlig via Devel wrote:

A good starting point on this journey is supporting the new mapped-ram
capability in qemu 9.0 [2]. Since mapped-ram is a new on-disk format, I
assume we'll need a new QEMU_SAVE_VERSION 3 when using it? Otherwise I'm not
sure how to detect if a saved image is in mapped-ram format vs the existing,
sequential stream format.


Yes, we'll need to be supporting 'mapped-ram', so a good first step.

A question is whether we make that feature mandatory for all save images,
or implied by another feature (parallel save), or an directly controllable
feature with opt-in.

The former breaks back compat with existnig libvirt, while the latter 2
options are net new so don't have compat implications.

In terms of actual data blocks written on disk mapped-ram should be be the
same size, or smaller, than the existing format.

In terms of logical file size, however, mapped-ram will almost always be
larger.

This is because mapped-ram will result in a file whose logical size matches
the guest RAM size, plus some header overhead, while being sparse so not
all blocks are written.

If tools handling save images aren't sparse-aware this could come across
as a surprise and even be considered a regression.

Mapped ram is needed for parallel saves since it lets each thread write
to a specific region of the file.

Mapped ram is good for non-parallel saves too though, because the mapping
of RAM into the file is aligned suitably to allow for O_DIRECT to be used.
Currently libvirt has to tunnnel over its iohelper to futz alignment
needed for O_DIRECT. This makes it desirable to use in general, but back
compat hurts...


Looking at what we did in the past

First time, we stole a element from 'uint32_t unused[..]' in the
save header, to add the 'compressed' field, and bumped the
version. This prevented old libvirt reading the files. This was
needed as adding compression was a non-backwards compatible
change. We could have carried on using version 1 for non-compressd
fields, but we didn't for some reason. It was a hard compat break.


Hmm, libvirt's implementation of compression seems to conflict with mapped-ram. 
AFAIK, mapped-ram requires a seekable fd. Should the two be mutually exclusive?




Next time, we stole a element from 'uint32 unused[..]' in the
save header, to add the 'cookie_len' field, but did NOT bump
the version. 'unused' is always all zeroes, so new libvirt could
detect whether the cookie was present by the len being non-zero.
Old libvirt would still load the image, but would be ignoring
the cookie data. This was largely harmless.

This time mapped-ram is a non-compatible change, so we need to
ensure old libvirt won't try to read the files, which suggests
either a save version bump, or we could abuse the 'compressed'
field to indicate 'mapped-ram' as a form of compression.

If we did a save version bump, we might want to carrry on using
v2 for non mapped ram.


IIUC, mapped-ram cannot be used with the exiting 'fd:' migration URI and
instead must use 'file:'. Does qemu advertise support for that? I couldn't
find it. If not, 'file:' (available in qemu 8.2) predates mapped-ram, so in
theory we could live without the advertisement.


'mapped-ram' is reported in QMP as a MigrationCapability, so I think we
can probe for it directly.

Yes, it is exclusively for use with 'file:' protocol. If we want to use
FD passing, then we can still do that with 'file:', by using QEMU's
generic /dev/fdset/NNN approach we have with block devices.



It's also not clear when we want to enable the mapped-ram capability. Should
it always be enabled if supported by the underlying qemu? One motivation for
creating the mapped-ram was to support direct-io of the migration stream in
qemu, in which case it could be tied to VIR_DOMAIN_SAVE_BYPASS_CACHE. E.g.
the mapped-ram capability is enabled when user specifies
VIR_DOMAIN_SAVE_BYPASS_CACHE && user-provided path results in a seekable fd
&& qemu supports mapped-ram?


One option is to be lazy and have a /etc/libvirt/qemu.conf for the
save format version, defaulting to latest v3. Release note that
admin/host provisioning apps must set it to v2 if back compat is
needed with old libvirt. If we assume new -> old save image loading
is relatively rare, that's probably good enough.

IOW, we can

  * Bump save version to 3
  * Use v3 by default


Using mapped-ram by default but not supporting compression would be a 
regression, right? E.g. 'virsh save vm-name /some/path' would suddenly fail if 
user's /etc/libvirt/qemu.conf contained 'save_image_format = "lzop"'.


Regards,
Jim


  * Add a SAVE_PARALLEL flag which implies mapped-ram, reject
if v2
  * Use mapped RAM with BYPASS_CACHE for v3, old approach for v2
  * Steal another unused field to indicate use of mapped-ram,
or perhaps future proof it by declaring a 'features'
field. So we don't need to bump version again, just make

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-01 Thread Peter Xu
On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > What I worry more is whether this is really what we want to keep rdma in
> > qemu, and that's also why I was trying to request for some serious
> > performance measurements comparing rdma v.s. nics.  And here when I said
> > "we" I mean both QEMU community and any company that will support keeping
> > rdma around.
> > 
> > The problem is if NICs now are fast enough to perform at least equally
> > against rdma, and if it has a lower cost of overall maintenance, does it
> > mean that rdma migration will only be used by whoever wants to keep them in
> > the products and existed already?  In that case we should simply ask new
> > users to stick with tcp, and rdma users should only drop but not increase.
> > 
> > It seems also destined that most new migration features will not support
> > rdma: see how much we drop old features in migration now (which rdma
> > _might_ still leverage, but maybe not), and how much we add mostly multifd
> > relevant which will probably not apply to rdma at all.  So in general what
> > I am worrying is a both-loss condition, if the company might be easier to
> > either stick with an old qemu (depending on whether other new features are
> > requested to be used besides RDMA alone), or do periodic rebase with RDMA
> > downstream only.
> 
> I don't know much about the originals of RDMA support in QEMU and why
> this particular design was taken. It is indeed a huge maint burden to
> have a completely different code flow for RDMA with 4000+ lines of
> custom protocol signalling which is barely understandable.
> 
> I would note that /usr/include/rdma/rsocket.h provides a higher level
> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> type could almost[1] trivially have supported RDMA. There would have
> been almost no RDMA code required in the migration subsystem, and all
> the modern features like compression, multifd, post-copy, etc would
> "just work".
> 
> I guess the 'rsocket.h' shim may well limit some of the possible
> performance gains, but it might still have been a better tradeoff
> to have not quite so good peak performance, but with massively
> less maint burden.

My understanding so far is RDMA is sololy for performance but nothing else,
then it's a question on whether rdma existing users would like to do so if
it will run slower.

Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
quotting that word as I don't really know such details:

https://lore.kernel.org/qemu-devel/camgffem2twjxopcnqtq1sjytf5395dbztcmyikrqfxdzjws...@mail.gmail.com/

So not sure whether that applies here too, in that having qiochannel
wrapper may not allow direct access to those ib verbs.

Thanks,

> 
> With regards,
> Daniel
> 
> [1] "almost" trivially, because the poll() integration for rsockets
> requires a bit more magic sauce since rsockets FDs are not
> really FDs from the kernel's POV. Still, QIOCHannel likely can
> abstract that probme.
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
> 

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-01 Thread Daniel P . Berrangé
On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> What I worry more is whether this is really what we want to keep rdma in
> qemu, and that's also why I was trying to request for some serious
> performance measurements comparing rdma v.s. nics.  And here when I said
> "we" I mean both QEMU community and any company that will support keeping
> rdma around.
> 
> The problem is if NICs now are fast enough to perform at least equally
> against rdma, and if it has a lower cost of overall maintenance, does it
> mean that rdma migration will only be used by whoever wants to keep them in
> the products and existed already?  In that case we should simply ask new
> users to stick with tcp, and rdma users should only drop but not increase.
> 
> It seems also destined that most new migration features will not support
> rdma: see how much we drop old features in migration now (which rdma
> _might_ still leverage, but maybe not), and how much we add mostly multifd
> relevant which will probably not apply to rdma at all.  So in general what
> I am worrying is a both-loss condition, if the company might be easier to
> either stick with an old qemu (depending on whether other new features are
> requested to be used besides RDMA alone), or do periodic rebase with RDMA
> downstream only.

I don't know much about the originals of RDMA support in QEMU and why
this particular design was taken. It is indeed a huge maint burden to
have a completely different code flow for RDMA with 4000+ lines of
custom protocol signalling which is barely understandable.

I would note that /usr/include/rdma/rsocket.h provides a higher level
API that is a 1-1 match of the normal kernel 'sockets' API. If we had
leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
type could almost[1] trivially have supported RDMA. There would have
been almost no RDMA code required in the migration subsystem, and all
the modern features like compression, multifd, post-copy, etc would
"just work".

I guess the 'rsocket.h' shim may well limit some of the possible
performance gains, but it might still have been a better tradeoff
to have not quite so good peak performance, but with massively
less maint burden.

With regards,
Daniel

[1] "almost" trivially, because the poll() integration for rsockets
requires a bit more magic sauce since rsockets FDs are not
really FDs from the kernel's POV. Still, QIOCHannel likely can
abstract that probme.
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH v3 0/6] migration removals & deprecations

2024-05-01 Thread Peter Xu
On Tue, Apr 30, 2024 at 11:27:31AM -0300, Fabiano Rosas wrote:
> since v2:
> 
> - removed some more stuff which I missed:
>blk/inc options from hmp-commands.hx
>the entire ram-compress.h
>unused declarations from options.h
>unused compression functions from qemu-file.c
> 
> - removed must_remove_block_options earlier in the 'blk' patch
> 
> - added a deprecation warning to outgoing/incoming fd
> 
> CI run: https://gitlab.com/farosas/qemu/-/pipelines/1272385260
> 
> v2:
> https://lore.kernel.org/r/20240426131408.25410-1-faro...@suse.de
> v1:
> https://lore.kernel.org/r/20240425150939.19268-1-faro...@suse.de
> 
> Hi everyone,
> 
> Here's some cleaning up of deprecated code. It removes the old block
> migration and compression code. Both have suitable replacements in the
> form of the blockdev-mirror driver and multifd compression,
> respectively.
> 
> There's also a deprecation for fd: + file to cope with the fact that
> the new MigrationAddress API defines transports instead of protocols
> (loose terms) like the old string API did. So we cannot map 1:1 from
> fd: to any transport because fd: allows *both* file migration and
> socket migration.
> 
> Fabiano Rosas (6):
>   migration: Remove 'skipped' field from MigrationStats
>   migration: Remove 'inc' option from migrate command
>   migration: Remove 'blk/-b' option from migrate commands
>   migration: Remove block migration
>   migration: Remove non-multifd compression
>   migration: Deprecate fd: for file migration

Reviewed-by: Peter Xu 

-- 
Peter Xu
___
Devel mailing list -- devel@lists.libvirt.org
To unsubscribe send an email to devel-le...@lists.libvirt.org


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-01 Thread Peter Xu
On Tue, Apr 30, 2024 at 09:00:49AM +0100, Daniel P. Berrangé wrote:
> On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote:
> > Peter Xu  writes:
> > 
> > > On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> > >> Hi All (and Peter),
> > >
> > > Hi, Michael,
> > >
> > >> 
> > >> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> > >> (highly irregular for a male) and yes, that's my real last name:
> > >> https://www.linkedin.com/in/mrgalaxy/)
> > >> 
> > >> I'm the original author of the RDMA implementation. I've been discussing
> > >> with Yu Zhang for a little bit about potentially handing over 
> > >> maintainership
> > >> of the codebase to his team.
> > >> 
> > >> I simply have zero access to RoCE or Infiniband hardware at all,
> > >> unfortunately. so I've never been able to run tests or use what I wrote 
> > >> at
> > >> work, and as all of you know, if you don't have a way to test something,
> > >> then you can't maintain it.
> > >> 
> > >> Yu Zhang put a (very kind) proposal forward to me to ask the community if
> > >> they feel comfortable training his team to maintain the codebase (and run
> > >> tests) while they learn about it.
> > >
> > > The "while learning" part is fine at least to me.  IMHO the "ownership" to
> > > the code, or say, taking over the responsibility, may or may not need 100%
> > > mastering the code base first.  There should still be some fundamental
> > > confidence to work on the code though as a starting point, then it's about
> > > serious use case to back this up, and careful testings while getting more
> > > familiar with it.
> > 
> > How much experience we expect of maintainers depends on the subsystem
> > and other circumstances.  The hard requirement isn't experience, it's
> > trust.  See the recent attack on xz.
> > 
> > I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
> > I'm merely reminding y'all what's at stake.
> 
> I think we shouldn't overly obsess[1] about 'xz', because the overwhealmingly
> common scenario is that volunteer maintainers are honest people. QEMU is
> in a massively better peer review situation. With xz there was basically no
> oversight of the new maintainer. With QEMU, we have oversight from 1000's
> of people on the list, a huge pool of general maintainers, the specific
> migration maintainers, and the release manager merging code.
> 
> With a lack of historical experiance with QEMU maintainership, I'd suggest
> that new RDMA volunteers would start by adding themselves to the "MAINTAINERS"
> file with only the 'Reviewer' classification. The main migration maintainers
> would still handle pull requests, but wait for a R-b from one of the RMDA
> volunteers. After some period of time the RDMA folks could graduate to full
> maintainer status if the migration maintainers needed to reduce their load.
> I suspect that might prove unneccesary though, given RDMA isn't an area of
> code with a high turnover of patches.

Right, and we can do that as a start, it also follows our normal rules of
starting from Reviewers to maintain something.  I even considered Zhijian
to be the previous rdma goto guy / maintainer no matter what role he used
to have in the MAINTAINERS file.

Here IMHO it's more about whether any company would like to stand up and
provide help, without yet binding that to be able to send pull requests in
the near future or even longer term.

What I worry more is whether this is really what we want to keep rdma in
qemu, and that's also why I was trying to request for some serious
performance measurements comparing rdma v.s. nics.  And here when I said
"we" I mean both QEMU community and any company that will support keeping
rdma around.

The problem is if NICs now are fast enough to perform at least equally
against rdma, and if it has a lower cost of overall maintenance, does it
mean that rdma migration will only be used by whoever wants to keep them in
the products and existed already?  In that case we should simply ask new
users to stick with tcp, and rdma users should only drop but not increase.

It seems also destined that most new migration features will not support
rdma: see how much we drop old features in migration now (which rdma
_might_ still leverage, but maybe not), and how much we add mostly multifd
relevant which will probably not apply to rdma at all.  So in general what
I am worrying is a both-loss condition, if the company might be easier to
either stick with an old qemu (depending on whether other new features are
requested to be used besides RDMA alone), or do periodic rebase with RDMA
downstream only.

So even if we want to keep RDMA around I hope with this chance we can at
least have clear picture on when we should still suggest any new user to
use RDMA (with the reasons behind).  Or we simply shouldn't suggest any new
user to use RDMA at all (because at least it'll lose many new migration
features).

Thanks,

-- 
Peter Xu