On Thu, Sep 2, 2021 at 6:59 AM Daniel P. Berrangé <berra...@redhat.com> wrote: > > On Thu, Sep 02, 2021 at 06:49:06AM -0300, Leonardo Bras Soares Passos wrote: > > On Thu, Sep 2, 2021 at 6:20 AM Daniel P. Berrangé <berra...@redhat.com> > > wrote: > > > > > > On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos > > > wrote: > > > > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berra...@redhat.com> > > > > wrote: > > > > > > > > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos > > > > > wrote: > > > > > > Hello Daniel, thanks for the feedback ! > > > > > > > > > > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé > > > > > > <berra...@redhat.com> wrote: > > > > > > > > > > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote: > > > > > > > > Call qio_channel_set_zerocopy(true) in the start of every > > > > > > > > multifd thread. > > > > > > > > > > > > > > > > Change the send_write() interface of multifd, allowing it to > > > > > > > > pass down > > > > > > > > flags for qio_channel_write*(). > > > > > > > > > > > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while > > > > > > > > keeping the > > > > > > > > other data being sent at the default copying approach. > > > > > > > > > > > > > > > > Signed-off-by: Leonardo Bras <leob...@redhat.com> > > > > > > > > --- > > > > > > > > migration/multifd-zlib.c | 7 ++++--- > > > > > > > > migration/multifd-zstd.c | 7 ++++--- > > > > > > > > migration/multifd.c | 9 ++++++--- > > > > > > > > migration/multifd.h | 3 ++- > > > > > > > > 4 files changed, 16 insertions(+), 10 deletions(-) > > > > > > > > > > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void > > > > > > > > *opaque) > > > > > > > > } > > > > > > > > > > > > > > > > if (used) { > > > > > > > > - ret = multifd_send_state->ops->send_write(p, > > > > > > > > used, &local_err); > > > > > > > > + ret = multifd_send_state->ops->send_write(p, > > > > > > > > used, MSG_ZEROCOPY, > > > > > > > > + > > > > > > > > &local_err); > > > > > > > > > > > > > > I don't think it is valid to unconditionally enable this feature > > > > > > > due to the > > > > > > > resource usage implications > > > > > > > > > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html > > > > > > > > > > > > > > "A zerocopy failure will return -1 with errno ENOBUFS. This > > > > > > > happens > > > > > > > if the socket option was not set, the socket exceeds its optmem > > > > > > > limit or the user exceeds its ulimit on locked pages." > > > > > > > > > > > > You are correct, I unfortunately missed this part in the docs :( > > > > > > > > > > > > > The limit on locked pages is something that looks very likely to > > > > > > > be > > > > > > > exceeded unless you happen to be running a QEMU config that > > > > > > > already > > > > > > > implies locked memory (eg PCI assignment) > > > > > > > > > > > > Do you mean the limit an user has on locking memory? > > > > > > > > > > Yes, by default limit QEMU sees will be something very small. > > > > > > > > > > > If so, that makes sense. I remember I needed to set the upper limit > > > > > > of locked > > > > > > memory for the user before using it, or adding a capability to qemu > > > > > > before. > > > > > > > > > > > > Maybe an option would be trying to mlock all guest memory before > > > > > > setting > > > > > > zerocopy=on in qemu code. If it fails, we can print an error > > > > > > message and fall > > > > > > back to not using zerocopy (following the idea of a new > > > > > > io_async_writev() > > > > > > I told you in the previous mail). > > > > > > > > > > Currently ability to lock memory is something that has to be > > > > > configured > > > > > when QEMU starts, and it requires libvirt to grant suitable > > > > > permissions > > > > > to QEMU. Memory locking is generally undesirable because it prevents > > > > > memory overcommit. Or rather if you are allowing memory overcommit, > > > > > then > > > > > allowing memory locking is a way to kill your entire host. > > > > > > > > You mean it's gonna consume too much memory, or something else? > > > > > > Essentially yes. > > > > Well, maybe we can check for available memory before doing that, > > but maybe it's too much effort. > > From a mgmt app POV, we assume QEMU is untrustworthy, so the mgmt > app needs to enforce policy based on the worst case that the > VM configuration allows for. > > Checking current available memory is not viable, because even > if you see 10 GB free, QEMU can't know if that free memory is > there to satisfy other VMs's worst case needs, or its own. QEMU > needs to be explicitly told whether its OK to use locked memory, > and an enforcement policy applied to it.
Yeah, it makes sense to let the mgmt app deal with that and enable/disable the MSG_ZEROCOPY on migration whenever it sees fit. > > > > > Consider a VM host with 64 GB of RAM and 64 GB of swap, and an > > > overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM > > > total. > > > > > > So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage > > > which exceeds physical RAM. Most of the time this may well be fine > > > as the VMs don't concurrently need their full RAM allocation, and > > > worst case they'll get pushed to swap as the kernel re-shares > > > memory in respose to load. So perhaps each VM only needs 20 GB > > > resident at any time, but over time one VM can burst upto 32 GB > > > and then 12 GB of it get swapped out later when inactive. > > > > > > But now consider if we allowed 2 of the VMs to lock memory for > > > purposes of migration. Those 2 VMs can now pin 64 GB of memory > > > in the worst case, leaving no free memory for the 3rd VM or > > > for the OS. This will likely take down the entire host, regardless > > > of swap availability. > > > > > > IOW, if you are overcomitting RAM you have to be extremely > > > careful about allowing any VM to lock memory. If you do decide > > > to allow memory locking, you need to make sure that the worst > > > case locked memory amount still leaves enough unlocked memory > > > for the OS to be able to effectively manage the overcommit > > > load via swap. We definitely can't grant memory locking to > > > VMs at startup in this scenario, and if we grant it at runtime, > > > we need to be able to revoke it again later. > > > > > > These overcommit numbers are a bit more extreme that you'd > > > usually do in real world, but it illustrates the genreal > > > problem. Also bear in mind that QEMU has memory overhead > > > beyond the guest RAM block, which varies over time, making > > > accounting quite hard. We have to also assume that QEMU > > > could have been compromised by a guest breakout, so we > > > can't assume that migration will play nice - we have to > > > assume the worst case possible, given the process ulimits. > > > > > > > Yeah, that makes sense. Thanks for this illustration and elucidation ! > > > > I assume there is no way of asking the OS to lock memory, and if there is > > no space available, it fails and rolls back the locking. > > Yes & no. On most Linux configs though it ends up being no, because > instead of getting a nice failure, when host memory pressure occurs, > the OOM Killer wakes up and just culls processes. oh, right the OOM Killer :) > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| > Thanks! Best regards, Leonardo