Re: [PATCH v5 01/13] PCI/P2PDMA: Support peer-to-peer memory

2018-09-01 Thread Christoph Hellwig
On Fri, Aug 31, 2018 at 09:48:40AM -0600, Logan Gunthorpe wrote:
> Pretty easy. P2P detection is pretty much just pci_p2pdma_distance() ,
> which has nothing to do with the ZONE_DEVICE support.
> 
> (And the distance function makes use of a number of static functions
> which could be combined into a simpler interface, should we need it.)

I'd ѕay lets get things merged as-is, so that we can review the
non-ZONE_DEVICE users.  I'm a little curious how that is going to work,
so having it as a full series would be useful.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 01/13] PCI/P2PDMA: Support peer-to-peer memory

2018-08-31 Thread Logan Gunthorpe



On 31/08/18 10:19 AM, Jonathan Cameron wrote:
> This feels like a somewhat simplistic starting point rather than a
> generally correct estimate to use.  Should we be taking the bandwidth of
> those links into account for example, or any discoverable latencies?
> Not all PCIe switches are alike - particularly when it comes to P2P.

I don't think this is necessary. There won't typically be a ton of
choice in terms of devices to use and if there is, the hardware will
probably be fairly homogenous. For example, it would be unusual to have
an NVMe drive on a x4 and another one on an x8. Or mixing say Gen3
switches with Gen4 would also be very strange. In weird unusual cases
like this where the user specifically wants to use a faster device they
can specify the specific device in the configfs interface.

I think the latency would probably be proportional to the distance which
is what we are already using.

> I guess that can be a topic for future development if it turns out people
> have horrible mixed systems.

Yup!

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 01/13] PCI/P2PDMA: Support peer-to-peer memory

2018-08-31 Thread Jonathan Cameron
On Thu, 30 Aug 2018 12:53:40 -0600
Logan Gunthorpe  wrote:

> Some PCI devices may have memory mapped in a BAR space that's
> intended for use in peer-to-peer transactions. In order to enable
> such transactions the memory must be registered with ZONE_DEVICE pages
> so it can be used by DMA interfaces in existing drivers.
> 
> Add an interface for other subsystems to find and allocate chunks of P2P
> memory as necessary to facilitate transfers between two PCI peers:
> 
> int pci_p2pdma_add_client();
> struct pci_dev *pci_p2pmem_find();
> void *pci_alloc_p2pmem();
> 
> The new interface requires a driver to collect a list of client devices
> involved in the transaction with the pci_p2pmem_add_client*() functions
> then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
> this is done the list is bound to the memory and the calling driver is
> free to add and remove clients as necessary (adding incompatible clients
> will fail). With a suitable p2pmem device, memory can then be
> allocated with pci_alloc_p2pmem() for use in DMA transactions.
> 
> Depending on hardware, using peer-to-peer memory may reduce the bandwidth
> of the transfer but can significantly reduce pressure on system memory.
> This may be desirable in many cases: for example a system could be designed
> with a small CPU connected to a PCIe switch by a small number of lanes
> which would maximize the number of lanes available to connect to NVMe
> devices.
> 
> The code is designed to only utilize the p2pmem device if all the devices
> involved in a transfer are behind the same PCI bridge. This is because we
> have no way of knowing whether peer-to-peer routing between PCIe Root Ports
> is supported (PCIe r4.0, sec 1.3.1). Additionally, the benefits of P2P
> transfers that go through the RC is limited to only reducing DRAM usage
> and, in some cases, coding convenience. The PCI-SIG may be exploring
> adding a new capability bit to advertise whether this is possible for
> future hardware.
> 
> This commit includes significant rework and feedback from Christoph
> Hellwig.
> 
> Signed-off-by: Christoph Hellwig 
> Signed-off-by: Logan Gunthorpe 

Apologies for being a late entrant to this conversation so I may be asking
about a topic that has been covered in detail in earlier patches!
> ---
...

> +/*
> + * Find the distance through the nearest common upstream bridge between
> + * two PCI devices.
> + *
> + * If the two devices are the same device then 0 will be returned.
> + *
> + * If there are two virtual functions of the same device behind the same
> + * bridge port then 2 will be returned (one step down to the PCIe switch,
> + * then one step back to the same device).
> + *
> + * In the case where two devices are connected to the same PCIe switch, the
> + * value 4 will be returned. This corresponds to the following PCI tree:
> + *
> + * -+  Root Port
> + *  \+ Switch Upstream Port
> + *   +-+ Switch Downstream Port
> + *   + \- Device A
> + *   \-+ Switch Downstream Port
> + * \- Device B
> + *
> + * The distance is 4 because we traverse from Device A through the downstream
> + * port of the switch, to the common upstream port, back up to the second
> + * downstream port and then to Device B.
> + *
> + * Any two devices that don't have a common upstream bridge will return -1.
> + * In this way devices on separate PCIe root ports will be rejected, which
> + * is what we want for peer-to-peer seeing each PCIe root port defines a
> + * separate hierarchy domain and there's no way to determine whether the root
> + * complex supports forwarding between them.
> + *
> + * In the case where two devices are connected to different PCIe switches,
> + * this function will still return a positive distance as long as both
> + * switches evenutally have a common upstream bridge. Note this covers
> + * the case of using multiple PCIe switches to achieve a desired level of
> + * fan-out from a root port. The exact distance will be a function of the
> + * number of switches between Device A and Device B.

This feels like a somewhat simplistic starting point rather than a
generally correct estimate to use.  Should we be taking the bandwidth of
those links into account for example, or any discoverable latencies?
Not all PCIe switches are alike - particularly when it comes to P2P.

I guess that can be a topic for future development if it turns out people
have horrible mixed systems.

> + *
> + * If a bridge which has any ACS redirection bits set is in the path
> + * then this functions will return -2. This is so we reject any
> + * cases where the TLPs are forwarded up into the root complex.
> + * In this case, a list of all infringing bridge addresses will be
> + * populated in acs_list (assuming it's non-null) for printk purposes.
> + */

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 01/13] PCI/P2PDMA: Support peer-to-peer memory

2018-08-31 Thread Logan Gunthorpe


On 31/08/18 02:04 AM, Christian König wrote:
> Am 30.08.2018 um 20:53 schrieb Logan Gunthorpe:
>> Some PCI devices may have memory mapped in a BAR space that's
>> intended for use in peer-to-peer transactions. In order to enable
>> such transactions the memory must be registered with ZONE_DEVICE pages
>> so it can be used by DMA interfaces in existing drivers.
> 
> We want to use that feature without ZONE_DEVICE pages for DMA-buf as well.
> 
> How hard would it be to separate enabling P2P detection (e.g. distance 
> between two devices) from this?

Pretty easy. P2P detection is pretty much just pci_p2pdma_distance() ,
which has nothing to do with the ZONE_DEVICE support.

(And the distance function makes use of a number of static functions
which could be combined into a simpler interface, should we need it.)

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 01/13] PCI/P2PDMA: Support peer-to-peer memory

2018-08-31 Thread Christian König

Am 30.08.2018 um 20:53 schrieb Logan Gunthorpe:

Some PCI devices may have memory mapped in a BAR space that's
intended for use in peer-to-peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.


We want to use that feature without ZONE_DEVICE pages for DMA-buf as well.

How hard would it be to separate enabling P2P detection (e.g. distance 
between two devices) from this?


Regards,
Christian.



Add an interface for other subsystems to find and allocate chunks of P2P
memory as necessary to facilitate transfers between two PCI peers:

int pci_p2pdma_add_client();
struct pci_dev *pci_p2pmem_find();
void *pci_alloc_p2pmem();

The new interface requires a driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary (adding incompatible clients
will fail). With a suitable p2pmem device, memory can then be
allocated with pci_alloc_p2pmem() for use in DMA transactions.

Depending on hardware, using peer-to-peer memory may reduce the bandwidth
of the transfer but can significantly reduce pressure on system memory.
This may be desirable in many cases: for example a system could be designed
with a small CPU connected to a PCIe switch by a small number of lanes
which would maximize the number of lanes available to connect to NVMe
devices.

The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same PCI bridge. This is because we
have no way of knowing whether peer-to-peer routing between PCIe Root Ports
is supported (PCIe r4.0, sec 1.3.1). Additionally, the benefits of P2P
transfers that go through the RC is limited to only reducing DRAM usage
and, in some cases, coding convenience. The PCI-SIG may be exploring
adding a new capability bit to advertise whether this is possible for
future hardware.

This commit includes significant rework and feedback from Christoph
Hellwig.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Logan Gunthorpe 
---
  drivers/pci/Kconfig|  17 +
  drivers/pci/Makefile   |   1 +
  drivers/pci/p2pdma.c   | 761 +
  include/linux/memremap.h   |   5 +
  include/linux/mm.h |  18 ++
  include/linux/pci-p2pdma.h | 102 ++
  include/linux/pci.h|   4 +
  7 files changed, 908 insertions(+)
  create mode 100644 drivers/pci/p2pdma.c
  create mode 100644 include/linux/pci-p2pdma.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 56ff8f6d31fc..deb68be4fdac 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -132,6 +132,23 @@ config PCI_PASID
  
  	  If unsure, say N.
  
+config PCI_P2PDMA

+   bool "PCI peer-to-peer transfer support"
+   depends on PCI && ZONE_DEVICE
+   select GENERIC_ALLOCATOR
+   help
+ Enableѕ drivers to do PCI peer-to-peer transactions to and from
+ BARs that are exposed in other devices that are the part of
+ the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+ specification to work (ie. anything below a single PCI bridge).
+
+ Many PCIe root complexes do not support P2P transactions and
+ it's hard to tell which support it at all, so at this time,
+ P2P DMA transations must be between devices behind the same root
+ port.
+
+ If unsure, say N.
+
  config PCI_LABEL
def_bool y if (DMI || ACPI)
depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 1b2cfe51e8d7..85f4a703b2be 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_PCI_SYSCALL) += syscall.o
  obj-$(CONFIG_PCI_STUB)+= pci-stub.o
  obj-$(CONFIG_PCI_PF_STUB) += pci-pf-stub.o
  obj-$(CONFIG_PCI_ECAM)+= ecam.o
+obj-$(CONFIG_PCI_P2PDMA)   += p2pdma.o
  obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
  
  # Endpoint library must be initialized before its users

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index ..88aaec5351cd
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,761 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ */
+
+#define pr_fmt(fmt) "pci-p2pdma: " fmt
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct pci_p2pdma {
+   struct percpu_ref devmap_ref;
+   struct completion devmap_ref_done;
+   struct gen_pool *pool;
+   bool p2pmem_published;
+};
+
+static void pci_p2pdma_percpu_r

[PATCH v5 01/13] PCI/P2PDMA: Support peer-to-peer memory

2018-08-30 Thread Logan Gunthorpe
Some PCI devices may have memory mapped in a BAR space that's
intended for use in peer-to-peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.

Add an interface for other subsystems to find and allocate chunks of P2P
memory as necessary to facilitate transfers between two PCI peers:

int pci_p2pdma_add_client();
struct pci_dev *pci_p2pmem_find();
void *pci_alloc_p2pmem();

The new interface requires a driver to collect a list of client devices
involved in the transaction with the pci_p2pmem_add_client*() functions
then call pci_p2pmem_find() to obtain any suitable P2P memory. Once
this is done the list is bound to the memory and the calling driver is
free to add and remove clients as necessary (adding incompatible clients
will fail). With a suitable p2pmem device, memory can then be
allocated with pci_alloc_p2pmem() for use in DMA transactions.

Depending on hardware, using peer-to-peer memory may reduce the bandwidth
of the transfer but can significantly reduce pressure on system memory.
This may be desirable in many cases: for example a system could be designed
with a small CPU connected to a PCIe switch by a small number of lanes
which would maximize the number of lanes available to connect to NVMe
devices.

The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same PCI bridge. This is because we
have no way of knowing whether peer-to-peer routing between PCIe Root Ports
is supported (PCIe r4.0, sec 1.3.1). Additionally, the benefits of P2P
transfers that go through the RC is limited to only reducing DRAM usage
and, in some cases, coding convenience. The PCI-SIG may be exploring
adding a new capability bit to advertise whether this is possible for
future hardware.

This commit includes significant rework and feedback from Christoph
Hellwig.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/Kconfig|  17 +
 drivers/pci/Makefile   |   1 +
 drivers/pci/p2pdma.c   | 761 +
 include/linux/memremap.h   |   5 +
 include/linux/mm.h |  18 ++
 include/linux/pci-p2pdma.h | 102 ++
 include/linux/pci.h|   4 +
 7 files changed, 908 insertions(+)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 56ff8f6d31fc..deb68be4fdac 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -132,6 +132,23 @@ config PCI_PASID
 
  If unsure, say N.
 
+config PCI_P2PDMA
+   bool "PCI peer-to-peer transfer support"
+   depends on PCI && ZONE_DEVICE
+   select GENERIC_ALLOCATOR
+   help
+ Enableѕ drivers to do PCI peer-to-peer transactions to and from
+ BARs that are exposed in other devices that are the part of
+ the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+ specification to work (ie. anything below a single PCI bridge).
+
+ Many PCIe root complexes do not support P2P transactions and
+ it's hard to tell which support it at all, so at this time,
+ P2P DMA transations must be between devices behind the same root
+ port.
+
+ If unsure, say N.
+
 config PCI_LABEL
def_bool y if (DMI || ACPI)
depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 1b2cfe51e8d7..85f4a703b2be 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_PCI_SYSCALL) += syscall.o
 obj-$(CONFIG_PCI_STUB) += pci-stub.o
 obj-$(CONFIG_PCI_PF_STUB)  += pci-pf-stub.o
 obj-$(CONFIG_PCI_ECAM) += ecam.o
+obj-$(CONFIG_PCI_P2PDMA)   += p2pdma.o
 obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
 
 # Endpoint library must be initialized before its users
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index ..88aaec5351cd
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,761 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ */
+
+#define pr_fmt(fmt) "pci-p2pdma: " fmt
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct pci_p2pdma {
+   struct percpu_ref devmap_ref;
+   struct completion devmap_ref_done;
+   struct gen_pool *pool;
+   bool p2pmem_published;
+};
+
+static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
+{
+   struct pci_p2pdma *p2p =
+   container_of(ref, struct pci_p2pdma, devmap_ref);
+
+   complete_all(&p2p->devmap_ref_done);
+}
+
+static void pci_p2pdma_percpu_kill(void *data)
+{
+   struct percpu_ref *ref = data;
+
+