Hi Paolo, Slightly rephrased: > (1) add shadow doorbell buffer and ioeventfd support into QEMU NVMe > emulation, which will reduce # of VM-exits and make them less expensive > (reduce VCPU latency. > (2) add iothread support to QEMU NVMe emulation. This can also be used > to eliminate VM-exits because iothreads can do adaptive polling. > (1) and (2) seem okay for at most 1.5 months, especially if you already > have experience with QEMU.
Thanks a lot for rephrasing it to make it more clear. Yes, I think (1)(2) should be achievable in 1-1.5month. What needs to be added based on FEMU includes: ioeventfd support to QEMU NVMe, and use iothread for polling (current FEMU implementation uses a periodic timer to poll shadow buffer directly, moving to iothread would deliver better performance). Including a RAM disk backend in QEMU would be nice too, and it may > interest you as it would reduce the delta between upstream QEMU and > FEMU. So this could be another idea. Glad you're also interested in this part. This can definitely be part of the project. For (3), there is work in progress to add multiqueue support to QEMU's > block device layer. We're hoping to get the infrastructure part in > (removing the AioContext lock) during the first half of 2018. As you > say, we can see what the workload will be. Thanks for letting me know this. Could you provide a link to the on-going multiqueue implementation? I would like to learn how this is done. :) However, the main issue that I'd love to see tackled is interrupt > mitigation. With higher rates of I/O ops and high queue depth (e.g. > 32), it's common for the guest to become slower when you introduce > optimizations in QEMU. The reason is that lower latency causes higher > interrupt rates and that in turn slows down the guest. If you have any > ideas on how to work around this, I would love to hear about it. Yeah, indeed interrupt overhead (host-to-guest notification) is a headache. I thought about this, and one intuitive optimization in my mind is to add interrupt coalescing support into QEMU NVMe. We may use some heuristic to batch I/O completions back to guest, thus reducing # of interrupts. The heuristic can be time-window based (i.e., for I/Os completed in the same time window, we only do one interrupt for each CQ). I believe there are several research papers that can achieve direct interrupt delivery without exits for para-virtual devices, but those need KVM side modifications. It might be not a good fit here. In any case, I would very much like to mentor this project. Let me know > if you have any more ideas on how to extend it! Great to know that you'd like to mentor the project! If so, can we make it an official project idea and put it on QEMU GSoC page? Thank you so much for the feedbacks and agreeing to be a potential mentor for this project. I'm happy to see that you also think this is something that's worth putting efforts into. Best, Huaicheng On Mon, Feb 26, 2018 at 2:45 AM, Paolo Bonzini <pbonz...@redhat.com> wrote: > On 25/02/2018 23:52, Huaicheng Li wrote: > > I remember there were some discussions back in 2015 about this, but I > > don't see it finally done. For this project, I think we can go in three > > steps: (1). add the shadow doorbell buffer support into QEMU NVMe > > emulation, this will reduce # of VM-exits. (2). replace current timers > > used by QEMU NVMe with a separate polling thread, thus we can completely > > eliminate VM-exits. (3). Even further, we can adapt the architecture to > > use one polling thread for each NVMe queue pair, thus it's possible to > > provide more performance. (step 3 can be left for next year if the > > workload is too much for 3 months). > > Slightly rephrased: > > (1) add shadow doorbell buffer and ioeventfd support into QEMU NVMe > emulation, which will reduce # of VM-exits and make them less expensive > (reduce VCPU latency. > > (2) add iothread support to QEMU NVMe emulation. This can also be used > to eliminate VM-exits because iothreads can do adaptive polling. > > (1) and (2) seem okay for at most 1.5 months, especially if you already > have experience with QEMU. > > For (3), there is work in progress to add multiqueue support to QEMU's > block device layer. We're hoping to get the infrastructure part in > (removing the AioContext lock) during the first half of 2018. As you > say, we can see what the workload will be. > > Including a RAM disk backend in QEMU would be nice too, and it may > interest you as it would reduce the delta between upstream QEMU and > FEMU. So this could be another idea. > > However, the main issue that I'd love to see tackled is interrupt > mitigation. With higher rates of I/O ops and high queue depth (e.g. > 32), it's common for the guest to become slower when you introduce > optimizations in QEMU. The reason is that lower latency causes higher > interrupt rates and that in turn slows down the guest. If you have any > ideas on how to work around this, I would love to hear about it. > > In any case, I would very much like to mentor this project. Let me know > if you have any more ideas on how to extend it! > > Paolo >