Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Tue, 8 May 2018 17:25:24 -0400 Don Dutile wrote: > On 05/08/2018 12:57 PM, Alex Williamson wrote: > > On Mon, 7 May 2018 18:23:46 -0500 > > Bjorn Helgaas wrote: > > > >> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote: > >>> Hi Everyone, > >>> > >>> Here's v4 of our series to introduce P2P based copy offload to NVMe > >>> fabrics. This version has been rebased onto v4.17-rc2. A git repo > >>> is here: > >>> > >>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4 > >>> ... > >> > >>> Logan Gunthorpe (14): > >>>PCI/P2PDMA: Support peer-to-peer memory > >>>PCI/P2PDMA: Add sysfs group to display p2pmem stats > >>>PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset > >>>PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches > >>>docs-rst: Add a new directory for PCI documentation > >>>PCI/P2PDMA: Add P2P DMA driver writer's documentation > >>>block: Introduce PCI P2P flags for request and request queue > >>>IB/core: Ensure we map P2P memory correctly in > >>> rdma_rw_ctx_[init|destroy]() > >>>nvme-pci: Use PCI p2pmem subsystem to manage the CMB > >>>nvme-pci: Add support for P2P memory in requests > >>>nvme-pci: Add a quirk for a pseudo CMB > >>>nvmet: Introduce helper functions to allocate and free request SGLs > >>>nvmet-rdma: Use new SGL alloc/free helper for requests > >>>nvmet: Optionally use PCI P2P memory > >>> > >>> Documentation/ABI/testing/sysfs-bus-pci| 25 + > >>> Documentation/PCI/index.rst| 14 + > >>> Documentation/driver-api/index.rst | 2 +- > >>> Documentation/driver-api/pci/index.rst | 20 + > >>> Documentation/driver-api/pci/p2pdma.rst| 166 ++ > >>> Documentation/driver-api/{ => pci}/pci.rst | 0 > >>> Documentation/index.rst| 3 +- > >>> block/blk-core.c | 3 + > >>> drivers/infiniband/core/rw.c | 13 +- > >>> drivers/nvme/host/core.c | 4 + > >>> drivers/nvme/host/nvme.h | 8 + > >>> drivers/nvme/host/pci.c| 118 +++-- > >>> drivers/nvme/target/configfs.c | 67 +++ > >>> drivers/nvme/target/core.c | 143 - > >>> drivers/nvme/target/io-cmd.c | 3 + > >>> drivers/nvme/target/nvmet.h| 15 + > >>> drivers/nvme/target/rdma.c | 22 +- > >>> drivers/pci/Kconfig| 26 + > >>> drivers/pci/Makefile | 1 + > >>> drivers/pci/p2pdma.c | 814 > >>> + > >>> drivers/pci/pci.c | 6 + > >>> include/linux/blk_types.h | 18 +- > >>> include/linux/blkdev.h | 3 + > >>> include/linux/memremap.h | 19 + > >>> include/linux/pci-p2pdma.h | 118 + > >>> include/linux/pci.h| 4 + > >>> 26 files changed, 1579 insertions(+), 56 deletions(-) > >>> create mode 100644 Documentation/PCI/index.rst > >>> create mode 100644 Documentation/driver-api/pci/index.rst > >>> create mode 100644 Documentation/driver-api/pci/p2pdma.rst > >>> rename Documentation/driver-api/{ => pci}/pci.rst (100%) > >>> create mode 100644 drivers/pci/p2pdma.c > >>> create mode 100644 include/linux/pci-p2pdma.h > >> > >> How do you envison merging this? There's a big chunk in drivers/pci, but > >> really no opportunity for conflicts there, and there's significant stuff in > >> block and nvme that I don't really want to merge. > >> > >> If Alex is OK with the ACS situation, I can ack the PCI parts and you could > >> merge it elsewhere? > > > > AIUI from previously questioning this, the change is hidden behind a > > build-time config option and only custom kernels or distros optimized > > for this sort of support would enable that build option. I'm more than > > a little dubious though that we're not going to have a wave of distros > > enabling this only to get user complaints that they can no longer make > > effective use of their devices for assignment due to the resulting span > > of the IOMMU groups, nor is there any sort of compromise, configure > > the kernel for p2p or device assignment, not both. Is this really such > > a unique feature that distro users aren't going to be asking for both > > features? Thanks, > > > > Alex > At least 1/2 the cases presented to me by existing customers want it in a > tunable kernel, > and tunable btwn two points, if the hw allows it to be 'contained' in that > manner, which > a (layer of) switch(ing) provides. > To me, that means a kernel cmdline parameter to _enable_, and another sysfs > (configfs? ... i'm not a configfs afficionato to say which is best), > method to make two points p2p dma capable. That's not what's done here AIUI. There are also some complicat
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 05/08/2018 12:57 PM, Alex Williamson wrote: On Mon, 7 May 2018 18:23:46 -0500 Bjorn Helgaas wrote: On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote: Hi Everyone, Here's v4 of our series to introduce P2P based copy offload to NVMe fabrics. This version has been rebased onto v4.17-rc2. A git repo is here: https://github.com/sbates130272/linux-p2pmem pci-p2p-v4 ... Logan Gunthorpe (14): PCI/P2PDMA: Support peer-to-peer memory PCI/P2PDMA: Add sysfs group to display p2pmem stats PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches docs-rst: Add a new directory for PCI documentation PCI/P2PDMA: Add P2P DMA driver writer's documentation block: Introduce PCI P2P flags for request and request queue IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]() nvme-pci: Use PCI p2pmem subsystem to manage the CMB nvme-pci: Add support for P2P memory in requests nvme-pci: Add a quirk for a pseudo CMB nvmet: Introduce helper functions to allocate and free request SGLs nvmet-rdma: Use new SGL alloc/free helper for requests nvmet: Optionally use PCI P2P memory Documentation/ABI/testing/sysfs-bus-pci| 25 + Documentation/PCI/index.rst| 14 + Documentation/driver-api/index.rst | 2 +- Documentation/driver-api/pci/index.rst | 20 + Documentation/driver-api/pci/p2pdma.rst| 166 ++ Documentation/driver-api/{ => pci}/pci.rst | 0 Documentation/index.rst| 3 +- block/blk-core.c | 3 + drivers/infiniband/core/rw.c | 13 +- drivers/nvme/host/core.c | 4 + drivers/nvme/host/nvme.h | 8 + drivers/nvme/host/pci.c| 118 +++-- drivers/nvme/target/configfs.c | 67 +++ drivers/nvme/target/core.c | 143 - drivers/nvme/target/io-cmd.c | 3 + drivers/nvme/target/nvmet.h| 15 + drivers/nvme/target/rdma.c | 22 +- drivers/pci/Kconfig| 26 + drivers/pci/Makefile | 1 + drivers/pci/p2pdma.c | 814 + drivers/pci/pci.c | 6 + include/linux/blk_types.h | 18 +- include/linux/blkdev.h | 3 + include/linux/memremap.h | 19 + include/linux/pci-p2pdma.h | 118 + include/linux/pci.h| 4 + 26 files changed, 1579 insertions(+), 56 deletions(-) create mode 100644 Documentation/PCI/index.rst create mode 100644 Documentation/driver-api/pci/index.rst create mode 100644 Documentation/driver-api/pci/p2pdma.rst rename Documentation/driver-api/{ => pci}/pci.rst (100%) create mode 100644 drivers/pci/p2pdma.c create mode 100644 include/linux/pci-p2pdma.h How do you envison merging this? There's a big chunk in drivers/pci, but really no opportunity for conflicts there, and there's significant stuff in block and nvme that I don't really want to merge. If Alex is OK with the ACS situation, I can ack the PCI parts and you could merge it elsewhere? AIUI from previously questioning this, the change is hidden behind a build-time config option and only custom kernels or distros optimized for this sort of support would enable that build option. I'm more than a little dubious though that we're not going to have a wave of distros enabling this only to get user complaints that they can no longer make effective use of their devices for assignment due to the resulting span of the IOMMU groups, nor is there any sort of compromise, configure the kernel for p2p or device assignment, not both. Is this really such a unique feature that distro users aren't going to be asking for both features? Thanks, Alex At least 1/2 the cases presented to me by existing customers want it in a tunable kernel, and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which a (layer of) switch(ing) provides. To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best), method to make two points p2p dma capable. Worse case, the whole system is one large IOMMU group (current mindset of this static or run-time config option), or best case (over time, more hw), a secure set of the primary system with p2p-enabled sections, that are deemed 'safe' or 'self-inflicting-unsecure', the latter the case of today's VM with an assigned device -- can scribble all over the VM, but no other VM and not the host/HV. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 08/05/18 10:57 AM, Alex Williamson wrote: > AIUI from previously questioning this, the change is hidden behind a > build-time config option and only custom kernels or distros optimized > for this sort of support would enable that build option. I'm more than > a little dubious though that we're not going to have a wave of distros > enabling this only to get user complaints that they can no longer make > effective use of their devices for assignment due to the resulting span > of the IOMMU groups, nor is there any sort of compromise, configure > the kernel for p2p or device assignment, not both. Is this really such > a unique feature that distro users aren't going to be asking for both > features? Thanks, I think it is. But it sounds like the majority want this to be a command line option. So we will look at doing that for v5. Logan
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Mon, 7 May 2018 18:23:46 -0500 Bjorn Helgaas wrote: > On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote: > > Hi Everyone, > > > > Here's v4 of our series to introduce P2P based copy offload to NVMe > > fabrics. This version has been rebased onto v4.17-rc2. A git repo > > is here: > > > > https://github.com/sbates130272/linux-p2pmem pci-p2p-v4 > > ... > > > Logan Gunthorpe (14): > > PCI/P2PDMA: Support peer-to-peer memory > > PCI/P2PDMA: Add sysfs group to display p2pmem stats > > PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset > > PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches > > docs-rst: Add a new directory for PCI documentation > > PCI/P2PDMA: Add P2P DMA driver writer's documentation > > block: Introduce PCI P2P flags for request and request queue > > IB/core: Ensure we map P2P memory correctly in > > rdma_rw_ctx_[init|destroy]() > > nvme-pci: Use PCI p2pmem subsystem to manage the CMB > > nvme-pci: Add support for P2P memory in requests > > nvme-pci: Add a quirk for a pseudo CMB > > nvmet: Introduce helper functions to allocate and free request SGLs > > nvmet-rdma: Use new SGL alloc/free helper for requests > > nvmet: Optionally use PCI P2P memory > > > > Documentation/ABI/testing/sysfs-bus-pci| 25 + > > Documentation/PCI/index.rst| 14 + > > Documentation/driver-api/index.rst | 2 +- > > Documentation/driver-api/pci/index.rst | 20 + > > Documentation/driver-api/pci/p2pdma.rst| 166 ++ > > Documentation/driver-api/{ => pci}/pci.rst | 0 > > Documentation/index.rst| 3 +- > > block/blk-core.c | 3 + > > drivers/infiniband/core/rw.c | 13 +- > > drivers/nvme/host/core.c | 4 + > > drivers/nvme/host/nvme.h | 8 + > > drivers/nvme/host/pci.c| 118 +++-- > > drivers/nvme/target/configfs.c | 67 +++ > > drivers/nvme/target/core.c | 143 - > > drivers/nvme/target/io-cmd.c | 3 + > > drivers/nvme/target/nvmet.h| 15 + > > drivers/nvme/target/rdma.c | 22 +- > > drivers/pci/Kconfig| 26 + > > drivers/pci/Makefile | 1 + > > drivers/pci/p2pdma.c | 814 > > + > > drivers/pci/pci.c | 6 + > > include/linux/blk_types.h | 18 +- > > include/linux/blkdev.h | 3 + > > include/linux/memremap.h | 19 + > > include/linux/pci-p2pdma.h | 118 + > > include/linux/pci.h| 4 + > > 26 files changed, 1579 insertions(+), 56 deletions(-) > > create mode 100644 Documentation/PCI/index.rst > > create mode 100644 Documentation/driver-api/pci/index.rst > > create mode 100644 Documentation/driver-api/pci/p2pdma.rst > > rename Documentation/driver-api/{ => pci}/pci.rst (100%) > > create mode 100644 drivers/pci/p2pdma.c > > create mode 100644 include/linux/pci-p2pdma.h > > How do you envison merging this? There's a big chunk in drivers/pci, but > really no opportunity for conflicts there, and there's significant stuff in > block and nvme that I don't really want to merge. > > If Alex is OK with the ACS situation, I can ack the PCI parts and you could > merge it elsewhere? AIUI from previously questioning this, the change is hidden behind a build-time config option and only custom kernels or distros optimized for this sort of support would enable that build option. I'm more than a little dubious though that we're not going to have a wave of distros enabling this only to get user complaints that they can no longer make effective use of their devices for assignment due to the resulting span of the IOMMU groups, nor is there any sort of compromise, configure the kernel for p2p or device assignment, not both. Is this really such a unique feature that distro users aren't going to be asking for both features? Thanks, Alex
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
> How do you envison merging this? There's a big chunk in drivers/pci, but > really no opportunity for conflicts there, and there's significant stuff in > block and nvme that I don't really want to merge. > > If Alex is OK with the ACS situation, I can ack the PCI parts and you could > merge it elsewhere? Honestly, I don't know. I guess with your ACK on the PCI parts, the vast balance is NVMe stuff so we could look at merging it through that tree. The block patch and IB patch are pretty small. Thanks, Logan
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote: > Hi Everyone, > > Here's v4 of our series to introduce P2P based copy offload to NVMe > fabrics. This version has been rebased onto v4.17-rc2. A git repo > is here: > > https://github.com/sbates130272/linux-p2pmem pci-p2p-v4 > ... > Logan Gunthorpe (14): > PCI/P2PDMA: Support peer-to-peer memory > PCI/P2PDMA: Add sysfs group to display p2pmem stats > PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset > PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches > docs-rst: Add a new directory for PCI documentation > PCI/P2PDMA: Add P2P DMA driver writer's documentation > block: Introduce PCI P2P flags for request and request queue > IB/core: Ensure we map P2P memory correctly in > rdma_rw_ctx_[init|destroy]() > nvme-pci: Use PCI p2pmem subsystem to manage the CMB > nvme-pci: Add support for P2P memory in requests > nvme-pci: Add a quirk for a pseudo CMB > nvmet: Introduce helper functions to allocate and free request SGLs > nvmet-rdma: Use new SGL alloc/free helper for requests > nvmet: Optionally use PCI P2P memory > > Documentation/ABI/testing/sysfs-bus-pci| 25 + > Documentation/PCI/index.rst| 14 + > Documentation/driver-api/index.rst | 2 +- > Documentation/driver-api/pci/index.rst | 20 + > Documentation/driver-api/pci/p2pdma.rst| 166 ++ > Documentation/driver-api/{ => pci}/pci.rst | 0 > Documentation/index.rst| 3 +- > block/blk-core.c | 3 + > drivers/infiniband/core/rw.c | 13 +- > drivers/nvme/host/core.c | 4 + > drivers/nvme/host/nvme.h | 8 + > drivers/nvme/host/pci.c| 118 +++-- > drivers/nvme/target/configfs.c | 67 +++ > drivers/nvme/target/core.c | 143 - > drivers/nvme/target/io-cmd.c | 3 + > drivers/nvme/target/nvmet.h| 15 + > drivers/nvme/target/rdma.c | 22 +- > drivers/pci/Kconfig| 26 + > drivers/pci/Makefile | 1 + > drivers/pci/p2pdma.c | 814 > + > drivers/pci/pci.c | 6 + > include/linux/blk_types.h | 18 +- > include/linux/blkdev.h | 3 + > include/linux/memremap.h | 19 + > include/linux/pci-p2pdma.h | 118 + > include/linux/pci.h| 4 + > 26 files changed, 1579 insertions(+), 56 deletions(-) > create mode 100644 Documentation/PCI/index.rst > create mode 100644 Documentation/driver-api/pci/index.rst > create mode 100644 Documentation/driver-api/pci/p2pdma.rst > rename Documentation/driver-api/{ => pci}/pci.rst (100%) > create mode 100644 drivers/pci/p2pdma.c > create mode 100644 include/linux/pci-p2pdma.h How do you envison merging this? There's a big chunk in drivers/pci, but really no opportunity for conflicts there, and there's significant stuff in block and nvme that I don't really want to merge. If Alex is OK with the ACS situation, I can ack the PCI parts and you could merge it elsewhere? Bjorn
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 04/05/18 08:27 AM, Christian König wrote: > Are you sure that this is more convenient? At least on first glance it > feels overly complicated. > > I mean what's the difference between the two approaches? > > sum = pci_p2pdma_distance(target, [A, B, C, target]); > > and > > sum = pci_p2pdma_distance(target, A); > sum += pci_p2pdma_distance(target, B); > sum += pci_p2pdma_distance(target, C); Well, it's more for consistency with the pci_p2pdma_find() which has to take a list of devices to find a resource which matches all of them. (You can't use multiple calls in that case because all the devices in the list might not have the same set of compatible providers.) That way we can use the same list to check the distance (when the user specifies a device) as we do to find a compatible device (when the user wants to automatically find one. Logan
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
Am 03.05.2018 um 20:43 schrieb Logan Gunthorpe: On 03/05/18 11:29 AM, Christian König wrote: Ok, that is the point where I'm stuck. Why do we need that in one function call in the PCIe subsystem? The problem at least with GPUs is that we seriously don't have that information here, cause the PCI subsystem might not be aware of all the interconnections. For example it isn't uncommon to put multiple GPUs on one board. To the PCI subsystem that looks like separate devices, but in reality all GPUs are interconnected and can access each others memory directly without going over the PCIe bus. I seriously don't want to model that in the PCI subsystem, but rather the driver. That's why it feels like a mistake to me to push all that into the PCI function. Huh? I'm lost. If you have a bunch of PCI devices you can send them as a list to this API, if you want. If the driver is _sure_ they are all the same, you only have to send one. In your terminology, you'd just have to call the interface with: pci_p2pdma_distance(target, [initiator, target]) Ok, I expected that something like that would do it. So just to confirm: When I have a bunch of GPUs which could be the initiator I only need to do "pci_p2pdma_distance(target, [first GPU, target]);" and not "pci_p2pdma_distance(target, [first GPU, second GPU, third GPU, forth, target])" ? Why can't we model that as two separate transactions? You could, but this is more convenient for users of the API that need to deal with multiple devices (and manage devices that may be added or removed at any time). Are you sure that this is more convenient? At least on first glance it feels overly complicated. I mean what's the difference between the two approaches? sum = pci_p2pdma_distance(target, [A, B, C, target]); and sum = pci_p2pdma_distance(target, A); sum += pci_p2pdma_distance(target, B); sum += pci_p2pdma_distance(target, C); Yeah, same for me. If Bjorn is ok with that specialized NVM functions that I'm fine with that as well. I think it would just be more convenient when we can come up with functions which can handle all use cases, cause there still seems to be a lot of similarities. The way it's implemented is more general and can handle all use cases. You are arguing for a function that can handle your case (albeit with a bit more fuss) but can't handle mine and is therefore less general. Calling my interface specialized is wrong. Well at the end of the day you only need to convince Bjorn of the interface, so I'm perfectly fine with it as long as it serves my use case as well :) But I still would like to understand your intention, cause that really helps not to accidentally break something in the long term. Now when I take a look at the pure PCI hardware level, what I have is a transaction between an initiator and a target, and not multiple devices in one operation. I mean you must have a very good reason that you now want to deal with multiple devices in the software layer, but neither from the code nor from your explanation that reason becomes obvious to me. Thanks, Christian. Logan
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 03/05/18 11:29 AM, Christian König wrote: > Ok, that is the point where I'm stuck. Why do we need that in one > function call in the PCIe subsystem? > > The problem at least with GPUs is that we seriously don't have that > information here, cause the PCI subsystem might not be aware of all the > interconnections. > > For example it isn't uncommon to put multiple GPUs on one board. To the > PCI subsystem that looks like separate devices, but in reality all GPUs > are interconnected and can access each others memory directly without > going over the PCIe bus. > > I seriously don't want to model that in the PCI subsystem, but rather > the driver. That's why it feels like a mistake to me to push all that > into the PCI function. Huh? I'm lost. If you have a bunch of PCI devices you can send them as a list to this API, if you want. If the driver is _sure_ they are all the same, you only have to send one. In your terminology, you'd just have to call the interface with: pci_p2pdma_distance(target, [initiator, target]) > Why can't we model that as two separate transactions? You could, but this is more convenient for users of the API that need to deal with multiple devices (and manage devices that may be added or removed at any time). > Yeah, same for me. If Bjorn is ok with that specialized NVM functions > that I'm fine with that as well. > > I think it would just be more convenient when we can come up with > functions which can handle all use cases, cause there still seems to be > a lot of similarities. The way it's implemented is more general and can handle all use cases. You are arguing for a function that can handle your case (albeit with a bit more fuss) but can't handle mine and is therefore less general. Calling my interface specialized is wrong. Logan
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
Am 03.05.2018 um 17:59 schrieb Logan Gunthorpe: On 03/05/18 03:05 AM, Christian König wrote: Second question is how to you want to handle things when device are not behind the same root port (which is perfectly possible in the cases I deal with)? I think we need to implement a whitelist. If both root ports are in the white list and are on the same bus then we return a larger distance instead of -1. Sounds good. Third question why multiple clients? That feels a bit like you are pushing something special to your use case into the common PCI subsystem. Something which usually isn't a good idea. No, I think this will be pretty standard. In the simple general case you are going to have one provider and at least two clients (one which writes the memory and one which reads it). However, one client is likely, but not necessarily, the same as the provider. Ok, that is the point where I'm stuck. Why do we need that in one function call in the PCIe subsystem? The problem at least with GPUs is that we seriously don't have that information here, cause the PCI subsystem might not be aware of all the interconnections. For example it isn't uncommon to put multiple GPUs on one board. To the PCI subsystem that looks like separate devices, but in reality all GPUs are interconnected and can access each others memory directly without going over the PCIe bus. I seriously don't want to model that in the PCI subsystem, but rather the driver. That's why it feels like a mistake to me to push all that into the PCI function. In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block devices. The code doesn't care which device provides the memory as it could be the RDMA device or one/all of the block devices (or, in theory, a completely separate device with P2P-able memory). However, it does require that all devices involved are accessible per pci_p2pdma_distance() or it won't use P2P transactions. I could also imagine other use cases: ie. an RDMA NIC sends data to a GPU for processing and then sends the data to an NVMe device for storage (or vice-versa). In this case we have 3 clients and one provider. Why can't we model that as two separate transactions? E.g. one from the RDMA NIC to the GPU memory. And another one from the GPU memory to the NVMe device. That would also match how I get this information from userspace. As far as I can see we need a function which return the distance between a initiator and target device. This function then returns -1 if the transaction can't be made and a positive value otherwise. If you need to make a simpler convenience function for your use case I'm not against it. Yeah, same for me. If Bjorn is ok with that specialized NVM functions that I'm fine with that as well. I think it would just be more convenient when we can come up with functions which can handle all use cases, cause there still seems to be a lot of similarities. We also need to give the direction of the transaction and have a whitelist root complex PCI-IDs which can handle P2P transactions from different ports for a certain DMA direction. Yes. In the NVMeof case we need all devices to be able to DMA in both directions so we did not need the DMA direction. But I can see this being useful once we add the whitelist. Ok, I agree that can be added later on. For simplicity let's assume for now we always to bidirectional transfers. Thanks for the explanation, Christian. Logan
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 03/05/18 03:05 AM, Christian König wrote: > Ok, I'm still missing the big picture here. First question is what is > the P2PDMA provider? Well there's some pretty good documentation in the patchset for this, but in short, a provider is a device that provides some kind of P2P resource (ie. BAR memory, or perhaps a doorbell register -- only memory is supported at this time). > Second question is how to you want to handle things when device are not > behind the same root port (which is perfectly possible in the cases I > deal with)? I think we need to implement a whitelist. If both root ports are in the white list and are on the same bus then we return a larger distance instead of -1. > Third question why multiple clients? That feels a bit like you are > pushing something special to your use case into the common PCI > subsystem. Something which usually isn't a good idea. No, I think this will be pretty standard. In the simple general case you are going to have one provider and at least two clients (one which writes the memory and one which reads it). However, one client is likely, but not necessarily, the same as the provider. In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block devices. The code doesn't care which device provides the memory as it could be the RDMA device or one/all of the block devices (or, in theory, a completely separate device with P2P-able memory). However, it does require that all devices involved are accessible per pci_p2pdma_distance() or it won't use P2P transactions. I could also imagine other use cases: ie. an RDMA NIC sends data to a GPU for processing and then sends the data to an NVMe device for storage (or vice-versa). In this case we have 3 clients and one provider. > As far as I can see we need a function which return the distance between > a initiator and target device. This function then returns -1 if the > transaction can't be made and a positive value otherwise. If you need to make a simpler convenience function for your use case I'm not against it. > We also need to give the direction of the transaction and have a > whitelist root complex PCI-IDs which can handle P2P transactions from > different ports for a certain DMA direction. Yes. In the NVMeof case we need all devices to be able to DMA in both directions so we did not need the DMA direction. But I can see this being useful once we add the whitelist. Logan
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
Am 02.05.2018 um 17:56 schrieb Logan Gunthorpe: Hi Christian, On 5/2/2018 5:51 AM, Christian König wrote: it would be rather nice to have if you could separate out the functions to detect if peer2peer is possible between two devices. This would essentially be pci_p2pdma_distance() in the existing patchset. It returns the sum of the distance between a list of clients and a P2PDMA provider. It returns -1 if peer2peer is not possible between the devices (presently this means they are not behind the same root port). Ok, I'm still missing the big picture here. First question is what is the P2PDMA provider? Second question is how to you want to handle things when device are not behind the same root port (which is perfectly possible in the cases I deal with)? Third question why multiple clients? That feels a bit like you are pushing something special to your use case into the common PCI subsystem. Something which usually isn't a good idea. As far as I can see we need a function which return the distance between a initiator and target device. This function then returns -1 if the transaction can't be made and a positive value otherwise. We also need to give the direction of the transaction and have a whitelist root complex PCI-IDs which can handle P2P transactions from different ports for a certain DMA direction. Christian. Logan
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
Hi Christian, On 5/2/2018 5:51 AM, Christian König wrote: it would be rather nice to have if you could separate out the functions to detect if peer2peer is possible between two devices. This would essentially be pci_p2pdma_distance() in the existing patchset. It returns the sum of the distance between a list of clients and a P2PDMA provider. It returns -1 if peer2peer is not possible between the devices (presently this means they are not behind the same root port). Logan
Re: [PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory
Hi Logan, it would be rather nice to have if you could separate out the functions to detect if peer2peer is possible between two devices. That would allow me to reuse the same logic for GPU peer2peer where I don't really have ZONE_DEVICE. Regards, Christian. Am 24.04.2018 um 01:30 schrieb Logan Gunthorpe: Hi Everyone, Here's v4 of our series to introduce P2P based copy offload to NVMe fabrics. This version has been rebased onto v4.17-rc2. A git repo is here: https://github.com/sbates130272/linux-p2pmem pci-p2p-v4 Thanks, Logan Changes in v4: * Change the original upstream_bridges_match() function to upstream_bridge_distance() which calculates the distance between two devices as long as they are behind the same root port. This should address Bjorn's concerns that the code was to focused on being behind a single switch. * The disable ACS function now disables ACS for all bridge ports instead of switch ports (ie. those that had two upstream_bridge ports). * Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl() API to be more like sgl_alloc() in that the alloc function returns the allocated scatterlist and nents is not required bythe free function. * Moved the new documentation into the driver-api tree as requested by Jonathan * Add SGL alloc and free helpers in the nvmet code so that the individual drivers can share the code that allocates P2P memory. As requested by Christoph. * Cleanup the nvmet_p2pmem_store() function as Christoph thought my first attempt was ugly. * Numerous commit message and comment fix-ups Changes in v3: * Many more fixes and minor cleanups that were spotted by Bjorn * Additional explanation of the ACS change in both the commit message and Kconfig doc. Also, the code that disables the ACS bits is surrounded explicitly by an #ifdef * Removed the flag we added to rdma_rw_ctx() in favour of using is_pci_p2pdma_page(), as suggested by Sagi. * Adjust pci_p2pmem_find() so that it prefers P2P providers that are closest to (or the same as) the clients using them. In cases of ties, the provider is randomly chosen. * Modify the NVMe Target code so that the PCI device name of the provider may be explicitly specified, bypassing the logic in pci_p2pmem_find(). (Note: it's still enforced that the provider must be behind the same switch as the clients). * As requested by Bjorn, added documentation for driver writers. Changes in v2: * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well as a bunch of cleanup and spelling fixes he pointed out in the last series. * To address Alex's ACS concerns, we change to a simpler method of just disabling ACS behind switches for any kernel that has CONFIG_PCI_P2PDMA. * We also reject using devices that employ 'dma_virt_ops' which should fairly simply handle Jason's concerns that this work might break with the HFI, QIB and rxe drivers that use the virtual ops to implement their own special DMA operations. -- This is a continuation of our work to enable using Peer-to-Peer PCI memory in the kernel with initial support for the NVMe fabrics target subsystem. Many thanks go to Christoph Hellwig who provided valuable feedback to get these patches to where they are today. The concept here is to use memory that's exposed on a PCI BAR as data buffers in the NVMe target code such that data can be transferred from an RDMA NIC to the special memory and then directly to an NVMe device avoiding system memory entirely. The upside of this is better QoS for applications running on the CPU utilizing memory and lower PCI bandwidth required to the CPU (such that systems could be designed with fewer lanes connected to the CPU). Due to these trade-offs we've designed the system to only enable using the PCI memory in cases where the NIC, NVMe devices and memory are all behind the same PCI switch hierarchy. This will mean many setups that could likely work well will not be supported so that we can be more confident it will work and not place any responsibility on the user to understand their topology. (We chose to go this route based on feedback we received at the last LSF). Future work may enable these transfers using a white list of known good root complexes. However, at this time, there is no reliable way to ensure that Peer-to-Peer transactions are permitted between PCI Root Ports. In order to enable this functionality, we introduce a few new PCI functions such that a driver can register P2P memory with the system. Struct pages are created for this memory using devm_memremap_pages() and the PCI bus offset is stored in the corresponding pagemap structure. When the PCI P2PDMA config option is selected the ACS bits in every bridge port in the system are turned off to allow traffic to pass freely behind the root port. At this time, the bit must be disabled at boot so the IOMMU subsystem can correctly create the groups, though this could be addresse