On 01/27/2016 06:56 AM, Alex Williamson wrote: > On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote: >>> From: Alex Williamson [mailto:alex.william...@redhat.com] >>> Sent: Wednesday, January 27, 2016 6:27 AM >>> >>> On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote: >>>>> From: Alex Williamson [mailto:alex.william...@redhat.com] >>>>> Sent: Wednesday, January 27, 2016 6:08 AM >>>>> >>>>>>>>> >>>>>>>> >>>>>>>> Today KVMGT (not using VFIO yet) registers I/O emulation callbacks to >>>>>>>> KVM, so VM MMIO access will be forwarded to KVMGT directly for >>>>>>>> emulation in kernel. If we reuse above R/W flags, the whole emulation >>>>>>>> path would be unnecessarily long with obvious performance impact. We >>>>>>>> either need a new flag here to indicate in-kernel emulation (bias from >>>>>>>> passthrough support), or just hide the region alternatively (let KVMGT >>>>>>>> to handle I/O emulation itself like today). >>>>>>> >>>>>>> That sounds like a future optimization TBH. There's very strict >>>>>>> layering between vfio and kvm. Physical device assignment could make >>>>>>> use of it as well, avoiding a round trip through userspace when an >>>>>>> ioread/write would do. Userspace also needs to orchestrate those kinds >>>>>>> of accelerators, there might be cases where userspace wants to see those >>>>>>> transactions for debugging or manipulating the device. We can't simply >>>>>>> take shortcuts to provide such direct access. Thanks, >>>>>>> >>>>>> >>>>>> But we have to balance such debugging flexibility and acceptable >>>>>> performance. >>>>>> To me the latter one is more important otherwise there'd be no real usage >>>>>> around this technique, while for debugging there are other alternative >>>>>> (e.g. >>>>>> ftrace) Consider some extreme case with 100k traps/second and then see >>>>>> how much impact a 2-3x longer emulation path can bring... >>>>> >>>>> Are you jumping to the conclusion that it cannot be done with proper >>>>> layering in place? Performance is important, but it's not an excuse to >>>>> abandon designing interfaces between independent components. Thanks, >>>>> >>>> >>>> Two are not controversial. My point is to remove unnecessary long trip >>>> as possible. After another thought, yes we can reuse existing read/write >>>> flags: >>>> - KVMGT will expose a private control variable whether in-kernel >>>> delivery is required; >>> >>> But in-kernel delivery is never *required*. Wouldn't userspace want to >>> deliver in-kernel any time it possibly could? >>> >>>> - when the variable is true, KVMGT will register in-kernel MMIO >>>> emulation callbacks then VM MMIO request will be delivered to KVMGT >>>> directly; >>>> - when the variable is false, KVMGT will not register anything. >>>> VM MMIO request will then be delivered to Qemu and then ioread/write >>>> will be used to finally reach KVMGT emulation logic; >>> >>> No, that means the interface is entirely dependent on a backdoor through >>> KVM. Why can't userspace (QEMU) do something like register an MMIO >>> region with KVM handled via a provided file descriptor and offset, >>> couldn't KVM then call the file ops without a kernel exit? Thanks, >>> >> >> Could you elaborate this thought? If it can achieve the purpose w/o >> a kernel exit definitely we can adapt to it. :-) > > I only thought of it when replying to the last email and have been doing > some research, but we already do quite a bit of synchronization through > file descriptors. The kvm-vfio pseudo device uses a group file > descriptor to ensure a user has access to a group, allowing some degree > of interaction between modules. Eventfds and irqfds already make use of > f_ops on file descriptors to poke data. So, if KVM had information that > an MMIO region was backed by a file descriptor for which it already has > a reference via fdget() (and verified access rights and whatnot), then > it ought to be a simple matter to get to f_ops->read/write knowing the > base offset of that MMIO region. Perhaps it could even simply use > __vfs_read/write(). Then we've got a proper reference to the file > descriptor for ownership purposes and we've transparently jumped across > modules without any implicit knowledge of the other end. Could it work?
This is OK for KVMGT, from fops to vgpu device-model would always be simple. The only question is, how is KVM hypervisor supposed to get the fd on VM-exitings? copy-and-paste the current implementation of vcpu_mmio_write(), seems nothing but GPA and len are provided: static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len, const void *v) { int handled = 0; int n; do { n = min(len, 8); if (!(vcpu->arch.apic && !kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, addr, n, v)) && kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, n, v)) break; handled += n; addr += n; len -= n; v += n; } while (len); return handled; } If we back a GPA range with a fd, this will also be a 'backdoor'? > Thanks, > > Alex > -- Thanks, Jike