Re: [PATCH v2 00/10] RFC: NVME MDEV

2019-05-09 Thread Keith Busch
On Thu, May 09, 2019 at 02:12:55AM -0700, Stefan Hajnoczi wrote:
> On Mon, May 06, 2019 at 12:04:06PM +0300, Maxim Levitsky wrote:
> > On top of that, it is expected that newer hardware will support the PASID 
> > based
> > device subdivision, which will allow us to _directly_ pass through the
> > submission queues of the device and _force_ us to use the NVME protocol for 
> > the
> > frontend.
> 
> I don't understand the PASID argument.  The data path will be 100%
> passthrough and this driver won't be necessary.

We still need a non-passthrough component to handle slow path,
non-doorbell controller registers and admin queue. That doesn't
necessarily need to be a kernel driver, though.


Re: [PATCH v2 00/10] RFC: NVME MDEV

2019-05-09 Thread Stefan Hajnoczi
On Mon, May 06, 2019 at 12:04:06PM +0300, Maxim Levitsky wrote:
> On top of that, it is expected that newer hardware will support the PASID 
> based
> device subdivision, which will allow us to _directly_ pass through the
> submission queues of the device and _force_ us to use the NVME protocol for 
> the
> frontend.

I don't understand the PASID argument.  The data path will be 100%
passthrough and this driver won't be necessary.

In the meantime there is already SPDK for users who want polling.  This
driver's main feature is that the host can still access the device at
the same time as VMs, but I'm not sure that's useful in
performance-critical use cases and for non-performance use cases this
driver isn't necessary.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v2 00/10] RFC: NVME MDEV

2019-05-08 Thread Paolo Bonzini
On 06/05/19 07:57, Christoph Hellwig wrote:
> 
> Or to put it into another way:  unless your paravirt interface requires
> zero specific changes to the core nvme code it is not acceptable at all.

I'm not sure it's possible to attain that goal, however I agree that
putting the control plane in the kernel is probably not a good idea, so
the vhost model is better than mdev for this usecase.

In addition, unless it is possible for the driver to pass the queue
directly to the guests, there probably isn't much advantage in putting
the driver in the kernel at all.  Maxim, do you have numbers for 1) QEMU
with aio 2) QEMU with VFIO-based userspace nvme driver 3) nvme-mdev?

Paolo


Re: [PATCH v2 00/10] RFC: NVME MDEV

2019-05-06 Thread Keith Busch
On Mon, May 06, 2019 at 05:57:52AM -0700, Christoph Hellwig wrote:
> > However, similar to the (1), when the driver will support the devices with
> > hardware based passthrough, it will have to dedicate a bunch of queues to 
> > the
> > guest, configure them with the appropriate PASID, and then let the guest 
> > useA
> > these queues directly.
> 
> We will not let you abuse the nvme queues for anything else.  We had
> that discussion with the mellanox offload and it not only unsafe but
> also adds way to much crap to the core nvme code for corner cases.
> 
> Or to put it into another way:  unless your paravirt interface requires
> zero specific changes to the core nvme code it is not acceptable at all.

I agree we shouldn't specialize generic queues for this, but I think
it is worth revisiting driver support for assignable hardware resources
iff the specification defines it.

Until then, you can always steer processes to different queues by
assigning them to different CPUs.


Re: [PATCH v2 00/10] RFC: NVME MDEV

2019-05-06 Thread Christoph Hellwig
On Mon, May 06, 2019 at 12:04:06PM +0300, Maxim Levitsky wrote:
> 1. Frontend interface (the interface that faces the guest/userspace/etc):
> 
> VFIO/mdev is just way to expose a (partially) software defined PCIe device to 
> a
> guest.
> 
> Vhost on the other hand is an interface that is hardcoded and optimized for
> virtio. It can be extended to be pci generic, but why to do so if we already
> have VFIO.

I wouldn't say vhost is virtio specific.  At least Hanne's vhost-nvme
doesn't get impacted by that a whole lot.

> 2. Backend interface (the connection to the real nvme device):
> 
> Currently the backend interface _doesn't have_ to allocate a dedicated queue 
> and
> bypass the block layer. It can use the block submit_bio/blk_poll as I
> demonstrate in the last patch in the series. Its 2x slower though.
> 
> However, similar to the (1), when the driver will support the devices with
> hardware based passthrough, it will have to dedicate a bunch of queues to the
> guest, configure them with the appropriate PASID, and then let the guest useA
> these queues directly.

We will not let you abuse the nvme queues for anything else.  We had
that discussion with the mellanox offload and it not only unsafe but
also adds way to much crap to the core nvme code for corner cases.

Or to put it into another way:  unless your paravirt interface requires
zero specific changes to the core nvme code it is not acceptable at all.


Re: [PATCH v2 00/10] RFC: NVME MDEV

2019-05-06 Thread Maxim Levitsky
On Fri, 2019-05-03 at 14:18 +0200, Christoph Hellwig wrote:
> I simply don't get the point of this series.
> 
> MDEV is an interface for exposing parts of a device to a userspace
> program / VM.  But that this series appears to do is to expose a
> purely software defined nvme controller to userspace.  Which in
> principle is a good idea, but we have a much better framework for that,
> which is called vhost.

Let me explain the reasons for choosing the IO interfaces as I did:

1. Frontend interface (the interface that faces the guest/userspace/etc):

VFIO/mdev is just way to expose a (partially) software defined PCIe device to a
guest.

Vhost on the other hand is an interface that is hardcoded and optimized for
virtio. It can be extended to be pci generic, but why to do so if we already
have VFIO.

So the biggest advantage of using VFIO _currently_ is that I don't add any new
API/ABI to the kernel, and neither the userspace (qemu) needs to learn to use a
new API. 

It also worth noting that VFIO supports nesting out of box, so I don't need to
worry about it (vhost has to deal with that on the protocol level using its
IOTLB facility).

On top of that, it is expected that newer hardware will support the PASID based
device subdivision, which will allow us to _directly_ pass through the
submission queues of the device and _force_ us to use the NVME protocol for the
frontend.

2. Backend interface (the connection to the real nvme device):

Currently the backend interface _doesn't have_ to allocate a dedicated queue and
bypass the block layer. It can use the block submit_bio/blk_poll as I
demonstrate in the last patch in the series. Its 2x slower though.

However, similar to the (1), when the driver will support the devices with
hardware based passthrough, it will have to dedicate a bunch of queues to the
guest, configure them with the appropriate PASID, and then let the guest use
these queues directly.


Best regards,
Maxim Levitsky



Re: [PATCH v2 00/10] RFC: NVME MDEV

2019-05-03 Thread Christoph Hellwig
I simply don't get the point of this series.

MDEV is an interface for exposing parts of a device to a userspace
program / VM.  But that this series appears to do is to expose a
purely software defined nvme controller to userspace.  Which in
principle is a good idea, but we have a much better framework for that,
which is called vhost.


[PATCH v2 00/10] RFC: NVME MDEV

2019-05-02 Thread Maxim Levitsky
Hi everyone!

In this patch series, I would like to introduce my take on the problem of doing 
as fast as possible virtualization of storage with emphasis on low latency.

For more information for the inner working you can look at V1 of the submission
at
https://lkml.org/lkml/2019/3/19/458


*** Changes from V1 ***

* Some correctness fixes that slipped in the last minute,
Nothing major though, and all my attempt to crash this driver lately 
were unsucessfull,
and it pretty much copes with anything I throw at it.


* Experemental block layer mode: In this mode, the mdev driver sends
the IO though the block layer (using bio_submit), and then polls for
the completions using the blk_poll. The last commit in the series adds this.
The performance overhead of this is about 2x of direct dedicated queue 
submission
though.


For this patch series, I would like to hear your opinion on the generic
block layer code, and to hear your opinion on the code as whole.

Please ignore the fact that code doesn't use nvme target code
(for instance it already have the generic block device code in io-cmd-bdev.c)
I will later switch to it, in next version of these patches, although,
I think that I should keep the option of using the direct, reserved IO queue
version too.


*** Performance results ***

Two tests were run, this time focusing only on latency and especially
on measuring the overhead of the translation.

I used the nvme-5.2 branch of http://git.infradead.org/nvme.git
which includes the small IO performance improvments which show a modest
perf increase in the generic block layer code path.

So the tests were:

** No interrupts test**

For that test, I used spdk's fio plugin as the test client. It itself uses
vfio to bind to the device, and then read/write from it using polling.
For tests that run in the guest, I used the virtual IOMMU, and run that spdk
test in the guest.

IOW, the tests were:

1. spdk in the host with fio
2. nvme-mdev in the host, spdk in the guest with fio
3. spdk in the host, spdk in the guest with fio

The fio (with the spdk plugin)was running at queue depth 1.
In addtion to that I added rdtsc based instrumentation to my mdev driver
to read the number of cycles it takes it to translate a command, and number
of cycles it takes it to poll for the response, and also number of cycles,
it takes to send the interrupt to the guest.

You can find the script at

'https://gitlab.com/maximlevitsky/vm_scripts/blob/master/tests/stest0.sh'


** Interrupts test **

For this test, the client was the kernel nvme driver, and fio running on top of 
it,
also with queue depth 1.
The test is at

'https://gitlab.com/maximlevitsky/vm_scripts/blob/master/tests/test0.sh'


The tests were done on Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz,
with Optane SSD 900P NVME drive.

The system is dual socket, but the system was booted with configuration
that allowed to fully use only NUMA node 0, where the device is attached.


** The results for non interrupt test **

host:
BW   :574.93 MiB/s  (stdev:0.73 Mib/s)
IOPS :147,182  (stdev:186 IOPS/s)
SLAT :0.113 us  (stdev:0.055 us)
CLAT :6.402 us  (stdev:1.146 us)
LAT  :6.516 us  (stdev:1.148 us)

mdev/direct:

BW   :535.99 MiB/s  (stdev:2.62 Mib/s)
IOPS :137,214  (stdev:670 IOPS/s)
SLAT :0.128 us  (stdev:3.074 us)
CLAT :6.909 us  (stdev:4.384 us)
LAT  :7.038 us  (stdev:6.892 us)

commands translated : avg cycles: 668.032 avg time(usec): 0.196 
  total: 8239732 
commands completed  : avg cycles: 411.791 avg time(usec): 0.121 
  total: 8239732

mdev/block generic:

BW   :512.99 MiB/s  (stdev:2.5 Mib/s)
IOPS :131,324  (stdev:641 IOPS/s)
SLAT :0.13 us  (stdev:3.143 us)
CLAT :7.237 us  (stdev:4.516 us)
LAT  :7.367 us  (stdev:7.069 us)

commands translated : avg cycles: 1509.207avg time(usec): 0.444 
  total: 7879519 
commands completed  : avg cycles: 1005.299avg time(usec): 0.296 
  total: 7879519

*Here you clearly see the overhead added by the block layer*


spdk:
BW   :535.77 MiB/s  (stdev:0.86 Mib/s)
IOPS :137,157  (stdev:220 IOPS/s)
SLAT :0.135 us  (stdev:0.073 us)
CLAT :6.905 us  (stdev:1.166 us)
LAT  :7.04 us  (stdev:1.168 us)


qemu userspace nvme driver:
BW   :151.56 MiB/s  (stdev:0.38 Mib/s)
IOPS :38,799  (stdev:97 IOPS/s)
SLAT :4.655 us  (stdev:0.714 us)
CLAT :20.856 us  (stdev:1.993 us)
LAT  :25.512 us  (stdev:2.086 us)

** The results from interrupt test **

host:
BW   :428.37 MiB/s  (stdev:0.44 Mib/s)
IOPS :109,662  (stdev:112 IOPS/s)
SLAT :0.913 us  (stdev:0.077 us)
CLAT :7.894 us  (stdev:1.152 us)
LAT  :8.844 us  (stdev:1.155 us)

mdev/direct:
BW   :401.33 MiB/s  (stdev:1.99 Mib/s)
IOPS :102,739  (stdev:509 IOPS/s)
SLAT :0.916 u