On Wed, Jun 15, 2022 at 6:03 PM Eugenio Perez Martin <epere...@redhat.com> wrote: > > On Wed, Jun 15, 2022 at 5:04 AM Jason Wang <jasow...@redhat.com> wrote: > > > > On Tue, Jun 14, 2022 at 5:32 PM Eugenio Perez Martin > > <epere...@redhat.com> wrote: > > > > > > On Tue, Jun 14, 2022 at 10:20 AM Jason Wang <jasow...@redhat.com> wrote: > > > > > > > > On Tue, Jun 14, 2022 at 4:14 PM Eugenio Perez Martin > > > > <epere...@redhat.com> wrote: > > > > > > > > > > On Tue, Jun 14, 2022 at 10:02 AM Jason Wang <jasow...@redhat.com> > > > > > wrote: > > > > > > > > > > > > On Tue, Jun 14, 2022 at 12:32 AM Eugenio Perez Martin > > > > > > <epere...@redhat.com> wrote: > > > > > > > > > > > > > > On Wed, Jun 8, 2022 at 9:28 PM Eugenio Perez Martin > > > > > > > <epere...@redhat.com> wrote: > > > > > > > > > > > > > > > > On Wed, Jun 8, 2022 at 7:51 AM Jason Wang <jasow...@redhat.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > 在 2022/5/20 03:12, Eugenio Pérez 写道: > > > > > > > > > > Control virtqueue is used by networking device for > > > > > > > > > > accepting various > > > > > > > > > > commands from the driver. It's a must to support multiqueue > > > > > > > > > > and other > > > > > > > > > > configurations. > > > > > > > > > > > > > > > > > > > > Shadow VirtQueue (SVQ) already makes possible migration of > > > > > > > > > > virtqueue > > > > > > > > > > states, effectively intercepting them so qemu can track > > > > > > > > > > what regions of memory > > > > > > > > > > are dirty because device action and needs migration. > > > > > > > > > > However, this does not > > > > > > > > > > solve networking device state seen by the driver because > > > > > > > > > > CVQ messages, like > > > > > > > > > > changes on MAC addresses from the driver. > > > > > > > > > > > > > > > > > > > > To solve that, this series uses SVQ infraestructure > > > > > > > > > > proposed to intercept > > > > > > > > > > networking control messages used by the device. This way, > > > > > > > > > > qemu is able to > > > > > > > > > > update VirtIONet device model and to migrate it. > > > > > > > > > > > > > > > > > > > > However, to intercept all queues would slow device data > > > > > > > > > > forwarding. To solve > > > > > > > > > > that, only the CVQ must be intercepted all the time. This > > > > > > > > > > is achieved using > > > > > > > > > > the ASID infraestructure, that allows different > > > > > > > > > > translations for different > > > > > > > > > > virtqueues. The most updated kernel part of ASID is > > > > > > > > > > proposed at [1]. > > > > > > > > > > > > > > > > > > > > You can run qemu in two modes after applying this series: > > > > > > > > > > only intercepting > > > > > > > > > > cvq with x-cvq-svq=on or intercept all the virtqueues > > > > > > > > > > adding cmdline x-svq=on: > > > > > > > > > > > > > > > > > > > > -netdev > > > > > > > > > > type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,id=vhost-vdpa0,x-cvq-svq=on,x-svq=on > > > > > > > > > > > > > > > > > > > > First three patches enable the update of the virtio-net > > > > > > > > > > device model for each > > > > > > > > > > CVQ message acknoledged by the device. > > > > > > > > > > > > > > > > > > > > Patches from 5 to 9 enables individual SVQ to copy the > > > > > > > > > > buffers to QEMU's VA. > > > > > > > > > > This allows simplyfing the memory mapping, instead of map > > > > > > > > > > all the guest's > > > > > > > > > > memory like in the data virtqueues. > > > > > > > > > > > > > > > > > > > > Patch 10 allows to inject control messages to the device. > > > > > > > > > > This allows to set > > > > > > > > > > state to the device both at QEMU startup and at live > > > > > > > > > > migration destination. In > > > > > > > > > > the future, this may also be used to emulate _F_ANNOUNCE. > > > > > > > > > > > > > > > > > > > > Patch 11 updates kernel headers, but it assign random > > > > > > > > > > numbers to needed ioctls > > > > > > > > > > because they are still not accepted in the kernel. > > > > > > > > > > > > > > > > > > > > Patches 12-16 enables the set of the features of the net > > > > > > > > > > device model to the > > > > > > > > > > vdpa device at device start. > > > > > > > > > > > > > > > > > > > > Last ones enables the sepparated ASID and SVQ. > > > > > > > > > > > > > > > > > > > > Comments are welcomed. > > > > > > > > > > > > > > > > > > > > > > > > > > > As discussed, I think we need to split this huge series into > > > > > > > > > smaller ones: > > > > > > > > > > > > > > > > > > 1) shadow CVQ only, this makes rx-filter-event work > > > > > > > > > 2) ASID support for CVQ > > > > > > > > > > > > > > > > > > And for 1) we need consider whether or not it could be > > > > > > > > > simplified. > > > > > > > > > > > > > > > > > > Or do it in reverse order, since if we do 1) first, we may > > > > > > > > > have security > > > > > > > > > issues. > > > > > > > > > > > > > > > > > > > > > > > > > I'm ok with both, but I also think 2) before 1) might make more > > > > > > > > sense. > > > > > > > > There is no way to only shadow CVQ otherwise ATM. > > > > > > > > > > > > > > > > > > > > > > On second thought, that order is kind of harder. > > > > > > > > > > > > > > If we only map CVQ buffers, we need to either: > > > > > > > a. Copy them to controlled buffers > > > > > > > b. Track properly when to unmap them > > > > > > > > > > > > Just to make sure we're at the same page: > > > > > > > > > > > > I meant we can start with e.g having a dedicated ASID for CVQ but > > > > > > still using CVQ passthrough. > > > > > > > > > > > > > > > > That would imply duplicating all the memory listener updates to both > > > > > ASIDs. That part of the code needs to be reverted. I'm ok with that, > > > > > but I'm not sure if it's worth it to do it that way. > > > > > > > > I don't get why it is related to memory listeners. The only change is > > > > > > > > 1) read the groups > > > > 2) set cvq to be an independent asid > > > > 3) update CVQ's IOTLB with its own ASID > > > > > > > > > > How to track the mappings of step 3) without a copy? > > > > So let me try to explain, what I propose is to split the patches. So > > the above could be the first part. Since we know: > > > > 1) CVQ is passthrough to guest right now > > 2) We know CVQ will use an independent ASID > > > > It doesn't harm to implement those first. It's unrelated to the policy > > (e.g how to shadow CVQ). > > > > > > > > If we don't copy the buffers to qemu's IOVA, we need to track when to > > > unmap CVQ buffers memory. Many CVQ buffers could be in the same page, > > > so we need to refcount them (or similar solution). > > > > Can we use fixed mapping instead of the dynamic ones? > > > > That implies either to implement something like a memory ring (size?), > or to effectively duplicate memory listener mappings.
I'm not sure I get this. But it's mainly the CVQ buffer + CVQ virtqueue. It should be possible if: 1) allocate something like a buffer of several megabytes 2) only process one CVQ command from guest at once ? > > I'm not against that, but it's something we need to remove on the > final solution. To use the order presented here will avoid that. > > > > > > > This series copies the buffers to an independent buffer in qemu memory > > > to avoid that. Once you copy them, we have the problem you point at > > > some patch later: The guest control buffers, so qemu must understand > > > CVQ so the guest cannot trick it. All of this is orthogonal to ASID. > > > At that point, we have this series except for the asid part and the > > > injection of CVQ buffers at the LM destination, isn't it? > > > > So we have several stuffs: > > > > 1) ASID support > > 2) Shadow CVQ only > > 3) State restoring > > > > I hope we can split them into independent series. If we want to shadow > > CVQ first, we need to prove that it is safe without ASID. > > > > > > > > CVQ buffers can be copied in the qemu IOVA space and be offered to the > > > device. Much like SVQ vrings, the copied buffers will not be > > > accessible from the guest. The hw device (as "non emulated cvq") will > > > receive a lot of dma updates, but it's temporary. We can add ASID on > > > top of that as a mean to: > > > - Not to SVQ data plane (fundamental to the intended use case of vdpa). > > > - Not to pollute data plane DMA mappings. > > > > > > > ? > > > > > > > > > > > > > > > Then do other stuff on top. > > > > > > > > > > > > > > > > > > > > Alternative a. have the same problems exposed in this RFC: It's > > > > > > > hard > > > > > > > (and unneeded in the final version) to know the size to copy. > > > > > > > Alternative b. also requires things not needed in the final > > > > > > > version, > > > > > > > like to count the number of times each page is mapped and > > > > > > > unmapped. > > > > > > > > > > > > > > So I'll go to the first alternative, that is also the proposed > > > > > > > order > > > > > > > of the RFC. What security issues do you expect beyond the > > > > > > > comments in > > > > > > > this series? > > > > > > > > > > > > If we shadow CVQ without ASID. The guest may guess the IOVA of CVQ > > > > > > and > > > > > > try to peek/modify it? > > > > > > > > > > > > > > > > It works the same way as data vqs, we're just updating the device > > > > > model in the middle. It should imply the exact same risk as updating > > > > > an emulated NIC control plane (including vhost-kernel / vhost-user). > > > > > > > > Not sure I got you here. For vhost-kernel and vhost-user, CVQ's buffer > > > > is owned by guests. > > > > > > > > > > The same way they control the data plane when all data virtqueues are > > > shadowed for dirty page tracking (more on the risk of qemu updating > > > the device model below). > > > > Ok. > > > > > > > > > But if we shadow CVQ without ASID, the CVQ buffer is owned by QEMU and > > > > there's no way to prevent guests from accessing it? > > > > > > > > > > With SVQ the memory exposed to the device is already shadowed. They > > > cannot access the CVQ buffers memory the same way they cannot access > > > the SVQ vrings. > > > > Ok, I think I kind of get you, it looks like we have different > > assumptions here: So if we only shadow CVQ, it will have security > > issues, since RX/TX is not shadowed. If we shadow CVQ as well as > > TX/RX, there's no security issue, since each IOVA is validated and the > > descriptors are prepared by Qemu. > > > > Right. I expected to maintain the all-shadowed-or-nothing behavior, > sorry if I was not clear. > > > This goes back to another question, what's the order of the series. > > > > I think that the shortest path is to follow the order of this series. > I tried to reorder your way, but ASID patches have to come with a lot > of CVQ patches if we want proper validation. Ok, so if this is the case, let's just split this series and keep the order. > > We can take the long route if we either implement a fixed ring buffer, > memory listener cloning, or another use case (sub-slicing?). But I > expect more issues to arise there. > > I have another question actually, is it ok to implement the cvq use > case but not to merge the x-svq parameter? The more I think on the > parameter the more I see it's better to leave it as a separated patch > for testing until we shape the complete series and it's unneeded. That's fine. Thanks > > Thanks! > > > Thanks > > > > > > > > > > > If in the case of vhost-kernel/vhost-user, there's a way for the guest > > > > to exploit buffers owned by Qemu, it should be a bug. > > > > > > > > > > The only extra step is the call to virtio_net_handle_ctrl_iov > > > (extracted from virtio_net_handle_ctrl). If a guest can exploit that > > > in SVQ mode, it can exploit it too with other vhost backends as far as > > > I see. > > > > > > > Thanks > > > > > > > > > > > > > > Roughly speaking, it's just to propose patches 01 to 03, with your > > > > > comments. That already meets use cases like rx filter notifications > > > > > for devices with only one ASID. > > > > > > > > > > > This part of my mail is not correct, we need to add a few patches of > > > this series on top :). If not, it would be exploitable. > > > > > > Thanks! > > > > > > > > Thanks! > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > Can we do as with previous base SVQ patches? they were merged > > > > > > > > although > > > > > > > > there is still no way to enable SVQ. > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > TODO: > > > > > > > > > > * Fallback on regular CVQ if QEMU cannot isolate in its own > > > > > > > > > > ASID by any > > > > > > > > > > reason, blocking migration. This is tricky, since it can > > > > > > > > > > cause that the VM > > > > > > > > > > cannot be migrated anymore, so some way of block it must > > > > > > > > > > be used. > > > > > > > > > > * Review failure paths, some are with TODO notes, other > > > > > > > > > > don't. > > > > > > > > > > > > > > > > > > > > Changes from rfc v7: > > > > > > > > > > * Don't map all guest space in ASID 1 but copy all the > > > > > > > > > > buffers. No need for > > > > > > > > > > more memory listeners. > > > > > > > > > > * Move net backend start callback to SVQ. > > > > > > > > > > * Wait for device CVQ commands used by the device at SVQ > > > > > > > > > > start, avoiding races. > > > > > > > > > > * Changed ioctls, but they're provisional anyway. > > > > > > > > > > * Reorder commits so refactor and code adding ones are > > > > > > > > > > closer to usage. > > > > > > > > > > * Usual cleaning: better tracing, doc, patches messages, ... > > > > > > > > > > > > > > > > > > > > Changes from rfc v6: > > > > > > > > > > * Fix bad iotlb updates order when batching was enabled > > > > > > > > > > * Add reference counting to iova_tree so cleaning is > > > > > > > > > > simpler. > > > > > > > > > > > > > > > > > > > > Changes from rfc v5: > > > > > > > > > > * Fixes bad calculus of cvq end group when MQ is not acked > > > > > > > > > > by the guest. > > > > > > > > > > > > > > > > > > > > Changes from rfc v4: > > > > > > > > > > * Add missing tracing > > > > > > > > > > * Add multiqueue support > > > > > > > > > > * Use already sent version for replacing g_memdup > > > > > > > > > > * Care with memory management > > > > > > > > > > > > > > > > > > > > Changes from rfc v3: > > > > > > > > > > * Fix bad returning of descriptors to SVQ list. > > > > > > > > > > > > > > > > > > > > Changes from rfc v2: > > > > > > > > > > * Fix use-after-free. > > > > > > > > > > > > > > > > > > > > Changes from rfc v1: > > > > > > > > > > * Rebase to latest master. > > > > > > > > > > * Configure ASID instead of assuming cvq asid != data vqs > > > > > > > > > > asid. > > > > > > > > > > * Update device model so (MAC) state can be migrated too. > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > https://lkml.kernel.org/kvm/20220224212314.1326-1-gda...@xilinx.com/ > > > > > > > > > > > > > > > > > > > > Eugenio Pérez (21): > > > > > > > > > > virtio-net: Expose ctrl virtqueue logic > > > > > > > > > > vhost: Add custom used buffer callback > > > > > > > > > > vdpa: control virtqueue support on shadow virtqueue > > > > > > > > > > virtio: Make virtqueue_alloc_element non-static > > > > > > > > > > vhost: Add vhost_iova_tree_find > > > > > > > > > > vdpa: Add map/unmap operation callback to SVQ > > > > > > > > > > vhost: move descriptor translation to > > > > > > > > > > vhost_svq_vring_write_descs > > > > > > > > > > vhost: Add SVQElement > > > > > > > > > > vhost: Add svq copy desc mode > > > > > > > > > > vhost: Add vhost_svq_inject > > > > > > > > > > vhost: Update kernel headers > > > > > > > > > > vdpa: delay set_vring_ready after DRIVER_OK > > > > > > > > > > vhost: Add ShadowVirtQueueStart operation > > > > > > > > > > vhost: Make possible to check for device exclusive vq > > > > > > > > > > group > > > > > > > > > > vhost: add vhost_svq_poll > > > > > > > > > > vdpa: Add vhost_vdpa_start_control_svq > > > > > > > > > > vdpa: Add asid attribute to vdpa device > > > > > > > > > > vdpa: Extract get features part from > > > > > > > > > > vhost_vdpa_get_max_queue_pairs > > > > > > > > > > vhost: Add reference counting to vhost_iova_tree > > > > > > > > > > vdpa: Add x-svq to NetdevVhostVDPAOptions > > > > > > > > > > vdpa: Add x-cvq-svq > > > > > > > > > > > > > > > > > > > > qapi/net.json | 13 +- > > > > > > > > > > hw/virtio/vhost-iova-tree.h | 7 +- > > > > > > > > > > hw/virtio/vhost-shadow-virtqueue.h | 61 ++- > > > > > > > > > > include/hw/virtio/vhost-vdpa.h | 3 + > > > > > > > > > > include/hw/virtio/vhost.h | 3 + > > > > > > > > > > include/hw/virtio/virtio-net.h | 4 + > > > > > > > > > > include/hw/virtio/virtio.h | 1 + > > > > > > > > > > include/standard-headers/linux/vhost_types.h | 11 +- > > > > > > > > > > linux-headers/linux/vhost.h | 25 +- > > > > > > > > > > hw/net/vhost_net.c | 5 +- > > > > > > > > > > hw/net/virtio-net.c | 84 +++-- > > > > > > > > > > hw/virtio/vhost-iova-tree.c | 35 +- > > > > > > > > > > hw/virtio/vhost-shadow-virtqueue.c | 378 > > > > > > > > > > ++++++++++++++++--- > > > > > > > > > > hw/virtio/vhost-vdpa.c | 206 > > > > > > > > > > +++++++++- > > > > > > > > > > hw/virtio/virtio.c | 2 +- > > > > > > > > > > net/vhost-vdpa.c | 294 > > > > > > > > > > ++++++++++++++- > > > > > > > > > > hw/virtio/trace-events | 10 +- > > > > > > > > > > 17 files changed, 1012 insertions(+), 130 deletions(-) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >