Re: [PATCH v3 RFC 2/2] nvme: improve performance for virtual NVMe devices
On Tue, Aug 16, 2016 at 01:45:03PM -0700, J Freyensee wrote: > On Mon, 2016-08-15 at 22:41 -0300, Helen Koike wrote: > > struct nvmf_connect_command connect; > > struct nvmf_property_set_command prop_set; > > struct nvmf_property_get_command prop_get; > > + struct nvme_doorbell_memory doorbell_memory; > > }; > > }; > > This looks like a new NVMe command being introduced, not found in the > latest NVMe specs (NVMe 1.2.1 spec or NVMe-over-Fabrics 1.0 spec)? > > This is a big NACK, the command needs to be part of the NVMe standard > before adding it to the NVMe code base (this is exactly how NVMe-over- > Fabrics standard got implemented). I would bring your proposal to > nvmexpress.org. While this is an approved TPAR up for consideration soon, I don't think we want to include this in mainline before it is ratified lest the current form conflicts with the future spec. We can still review and comment on the code, though it's not normally done publicly until the spec is also public. The proposal originated from a public RFC, though, so it's not a secret.
Re: [PATCH v3 RFC 2/2] nvme: improve performance for virtual NVMe devices
On Tue, Aug 16, 2016 at 01:45:03PM -0700, J Freyensee wrote: > On Mon, 2016-08-15 at 22:41 -0300, Helen Koike wrote: > > struct nvmf_connect_command connect; > > struct nvmf_property_set_command prop_set; > > struct nvmf_property_get_command prop_get; > > + struct nvme_doorbell_memory doorbell_memory; > > }; > > }; > > This looks like a new NVMe command being introduced, not found in the > latest NVMe specs (NVMe 1.2.1 spec or NVMe-over-Fabrics 1.0 spec)? > > This is a big NACK, the command needs to be part of the NVMe standard > before adding it to the NVMe code base (this is exactly how NVMe-over- > Fabrics standard got implemented). I would bring your proposal to > nvmexpress.org. While this is an approved TPAR up for consideration soon, I don't think we want to include this in mainline before it is ratified lest the current form conflicts with the future spec. We can still review and comment on the code, though it's not normally done publicly until the spec is also public. The proposal originated from a public RFC, though, so it's not a secret.
Re: [PATCH v3 RFC 2/2] nvme: improve performance for virtual NVMe devices
On Mon, 2016-08-15 at 22:41 -0300, Helen Koike wrote: > > +struct nvme_doorbell_memory { > + __u8opcode; > + __u8flags; > + __u16 command_id; > + __u32 rsvd1[5]; > + __le64 prp1; > + __le64 prp2; > + __u32 rsvd12[6]; > +}; > + > struct nvme_command { > union { > struct nvme_common_command common; > @@ -845,6 +858,7 @@ struct nvme_command { > struct nvmf_connect_command connect; > struct nvmf_property_set_command prop_set; > struct nvmf_property_get_command prop_get; > + struct nvme_doorbell_memory doorbell_memory; > }; > }; This looks like a new NVMe command being introduced, not found in the latest NVMe specs (NVMe 1.2.1 spec or NVMe-over-Fabrics 1.0 spec)? This is a big NACK, the command needs to be part of the NVMe standard before adding it to the NVMe code base (this is exactly how NVMe-over- Fabrics standard got implemented). I would bring your proposal to nvmexpress.org. Jay > > @@ -934,6 +948,9 @@ enum { > /* > * Media and Data Integrity Errors: > */ > +#ifdef CONFIG_NVME_VDB > + NVME_SC_DOORBELL_MEMORY_INVALID = 0x1C0, > +#endif > NVME_SC_WRITE_FAULT = 0x280, > NVME_SC_READ_ERROR = 0x281, > NVME_SC_GUARD_CHECK = 0x282,
Re: [PATCH v3 RFC 2/2] nvme: improve performance for virtual NVMe devices
On Mon, 2016-08-15 at 22:41 -0300, Helen Koike wrote: > > +struct nvme_doorbell_memory { > + __u8opcode; > + __u8flags; > + __u16 command_id; > + __u32 rsvd1[5]; > + __le64 prp1; > + __le64 prp2; > + __u32 rsvd12[6]; > +}; > + > struct nvme_command { > union { > struct nvme_common_command common; > @@ -845,6 +858,7 @@ struct nvme_command { > struct nvmf_connect_command connect; > struct nvmf_property_set_command prop_set; > struct nvmf_property_get_command prop_get; > + struct nvme_doorbell_memory doorbell_memory; > }; > }; This looks like a new NVMe command being introduced, not found in the latest NVMe specs (NVMe 1.2.1 spec or NVMe-over-Fabrics 1.0 spec)? This is a big NACK, the command needs to be part of the NVMe standard before adding it to the NVMe code base (this is exactly how NVMe-over- Fabrics standard got implemented). I would bring your proposal to nvmexpress.org. Jay > > @@ -934,6 +948,9 @@ enum { > /* > * Media and Data Integrity Errors: > */ > +#ifdef CONFIG_NVME_VDB > + NVME_SC_DOORBELL_MEMORY_INVALID = 0x1C0, > +#endif > NVME_SC_WRITE_FAULT = 0x280, > NVME_SC_READ_ERROR = 0x281, > NVME_SC_GUARD_CHECK = 0x282,
[PATCH v3 RFC 2/2] nvme: improve performance for virtual NVMe devices
From: Rob NelsonThis change provides a mechanism to reduce the number of MMIO doorbell writes for the NVMe driver. When running in a virtualized environment like QEMU, the cost of an MMIO is quite hefy here. The main idea for the patch is provide the device two memory location locations: 1) to store the doorbell values so they can be lookup without the doorbell MMIO write 2) to store an event index. I believe the doorbell value is obvious, the event index not so much. Similar to the virtio specificaiton, the virtual device can tell the driver (guest OS) not to write MMIO unless you are writing past this value. FYI: doorbell values are written by the nvme driver (guest OS) and the event index is written by the virtual device (host OS). The patch implements a new admin command that will communicate where these two memory locations reside. If the command fails, the nvme driver will work as before without any optimizations. Contributions: Eric Northup Frank Swiderski Ted Tso Keith Busch Just to give an idea on the performance boost with the vendor extension: Running fio [1], a stock NVMe driver I get about 200K read IOPs with my vendor patch I get about 1000K read IOPs. This was running with a null device i.e. the backing device simply returned success on every read IO request. [1] Running on a 4 core machine: fio --time_based --name=benchmark --runtime=30 --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randread --blocksize=4k --randrepeat=false Signed-off-by: Rob Nelson [mlin: port for upstream] Signed-off-by: Ming Lin [koike: updated for upstream] Signed-off-by: Helen Koike --- Changes since v2: - Add vdb.c and vdb.h, the idea is to let the code in pci.c clean and to make it easier to integrate with the official nvme extention when nvme consortium publishes it - Remove rmb (I couldn't see why they were necessary here, please let me know if I am wrong) - Reposition wmb - Transform specific code in helper functions - Coding style (checkpatch, remove unecessary goto, change if statement logic to decrease identation) - Rename feature to CONFIG_NVME_VDB - Remove some PCI_VENDOR_ID_GOOGLE checks drivers/nvme/host/Kconfig | 11 drivers/nvme/host/Makefile | 1 + drivers/nvme/host/pci.c| 29 ++- drivers/nvme/host/vdb.c| 125 + drivers/nvme/host/vdb.h| 118 ++ include/linux/nvme.h | 17 ++ 6 files changed, 299 insertions(+), 2 deletions(-) create mode 100644 drivers/nvme/host/vdb.c create mode 100644 drivers/nvme/host/vdb.h diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index db39d53..d3f4da9 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig @@ -43,3 +43,14 @@ config NVME_RDMA from https://github.com/linux-nvme/nvme-cli. If unsure, say N. + +config NVME_VDB + bool "NVMe Virtual Doorbell Extension for Improved Virtualization" + depends on NVME_CORE + ---help--- + This provides support for the Virtual Doorbell Extension which + reduces the number of required MMIOs to ring doorbells, improving + performance in virtualized environments where MMIO causes a high + overhead. + + If unsure, say N. diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile index 47abcec..d4d0e3d 100644 --- a/drivers/nvme/host/Makefile +++ b/drivers/nvme/host/Makefile @@ -8,6 +8,7 @@ nvme-core-$(CONFIG_BLK_DEV_NVME_SCSI) += scsi.o nvme-core-$(CONFIG_NVM)+= lightnvm.o nvme-y += pci.o +nvme-$(CONFIG_NVME_VDB)+= vdb.o nvme-fabrics-y += fabrics.o diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index cf8b3d7..20bbc33 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -44,6 +44,7 @@ #include #include "nvme.h" +#include "vdb.h" #define NVME_Q_DEPTH 1024 #define NVME_AQ_DEPTH 256 @@ -99,6 +100,7 @@ struct nvme_dev { dma_addr_t cmb_dma_addr; u64 cmb_size; u32 cmbsz; + struct nvme_vdb_dev vdb_d; struct nvme_ctrl ctrl; struct completion ioq_wait; }; @@ -131,6 +133,7 @@ struct nvme_queue { u16 qid; u8 cq_phase; u8 cqe_seen; + struct nvme_vdb_queue vdb_q; }; /* @@ -171,6 +174,7 @@ static inline void _nvme_check_size(void) BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096); BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64); BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512); + BUILD_BUG_ON(sizeof(struct nvme_doorbell_memory)
[PATCH v3 RFC 2/2] nvme: improve performance for virtual NVMe devices
From: Rob Nelson This change provides a mechanism to reduce the number of MMIO doorbell writes for the NVMe driver. When running in a virtualized environment like QEMU, the cost of an MMIO is quite hefy here. The main idea for the patch is provide the device two memory location locations: 1) to store the doorbell values so they can be lookup without the doorbell MMIO write 2) to store an event index. I believe the doorbell value is obvious, the event index not so much. Similar to the virtio specificaiton, the virtual device can tell the driver (guest OS) not to write MMIO unless you are writing past this value. FYI: doorbell values are written by the nvme driver (guest OS) and the event index is written by the virtual device (host OS). The patch implements a new admin command that will communicate where these two memory locations reside. If the command fails, the nvme driver will work as before without any optimizations. Contributions: Eric Northup Frank Swiderski Ted Tso Keith Busch Just to give an idea on the performance boost with the vendor extension: Running fio [1], a stock NVMe driver I get about 200K read IOPs with my vendor patch I get about 1000K read IOPs. This was running with a null device i.e. the backing device simply returned success on every read IO request. [1] Running on a 4 core machine: fio --time_based --name=benchmark --runtime=30 --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randread --blocksize=4k --randrepeat=false Signed-off-by: Rob Nelson [mlin: port for upstream] Signed-off-by: Ming Lin [koike: updated for upstream] Signed-off-by: Helen Koike --- Changes since v2: - Add vdb.c and vdb.h, the idea is to let the code in pci.c clean and to make it easier to integrate with the official nvme extention when nvme consortium publishes it - Remove rmb (I couldn't see why they were necessary here, please let me know if I am wrong) - Reposition wmb - Transform specific code in helper functions - Coding style (checkpatch, remove unecessary goto, change if statement logic to decrease identation) - Rename feature to CONFIG_NVME_VDB - Remove some PCI_VENDOR_ID_GOOGLE checks drivers/nvme/host/Kconfig | 11 drivers/nvme/host/Makefile | 1 + drivers/nvme/host/pci.c| 29 ++- drivers/nvme/host/vdb.c| 125 + drivers/nvme/host/vdb.h| 118 ++ include/linux/nvme.h | 17 ++ 6 files changed, 299 insertions(+), 2 deletions(-) create mode 100644 drivers/nvme/host/vdb.c create mode 100644 drivers/nvme/host/vdb.h diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index db39d53..d3f4da9 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig @@ -43,3 +43,14 @@ config NVME_RDMA from https://github.com/linux-nvme/nvme-cli. If unsure, say N. + +config NVME_VDB + bool "NVMe Virtual Doorbell Extension for Improved Virtualization" + depends on NVME_CORE + ---help--- + This provides support for the Virtual Doorbell Extension which + reduces the number of required MMIOs to ring doorbells, improving + performance in virtualized environments where MMIO causes a high + overhead. + + If unsure, say N. diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile index 47abcec..d4d0e3d 100644 --- a/drivers/nvme/host/Makefile +++ b/drivers/nvme/host/Makefile @@ -8,6 +8,7 @@ nvme-core-$(CONFIG_BLK_DEV_NVME_SCSI) += scsi.o nvme-core-$(CONFIG_NVM)+= lightnvm.o nvme-y += pci.o +nvme-$(CONFIG_NVME_VDB)+= vdb.o nvme-fabrics-y += fabrics.o diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index cf8b3d7..20bbc33 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -44,6 +44,7 @@ #include #include "nvme.h" +#include "vdb.h" #define NVME_Q_DEPTH 1024 #define NVME_AQ_DEPTH 256 @@ -99,6 +100,7 @@ struct nvme_dev { dma_addr_t cmb_dma_addr; u64 cmb_size; u32 cmbsz; + struct nvme_vdb_dev vdb_d; struct nvme_ctrl ctrl; struct completion ioq_wait; }; @@ -131,6 +133,7 @@ struct nvme_queue { u16 qid; u8 cq_phase; u8 cqe_seen; + struct nvme_vdb_queue vdb_q; }; /* @@ -171,6 +174,7 @@ static inline void _nvme_check_size(void) BUILD_BUG_ON(sizeof(struct nvme_id_ns) != 4096); BUILD_BUG_ON(sizeof(struct nvme_lba_range_type) != 64); BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512); + BUILD_BUG_ON(sizeof(struct nvme_doorbell_memory) != 64); } /* @@ -285,7 +289,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq, if (++tail == nvmeq->q_depth) tail = 0; -