Re: [PATCH v3 1/8] irqchip/mips-gic: Only register IPI domain when SMP is enabled

2022-07-07 Thread Serge Semin
On Thu, Jul 07, 2022 at 09:22:26AM +0100, Marc Zyngier wrote:
> On Tue, 05 Jul 2022 14:52:43 +0100,
> Serge Semin  wrote:
> > 
> > Hi Samuel
> > 
> > On Fri, Jul 01, 2022 at 03:00:49PM -0500, Samuel Holland wrote:
> > > The MIPS GIC irqchip driver may be selected in a uniprocessor
> > > configuration, but it unconditionally registers an IPI domain.
> > > 
> > > Limit the part of the driver dealing with IPIs to only be compiled when
> > > GENERIC_IRQ_IPI is enabled, which corresponds to an SMP configuration.
> > 
> > Thanks for the patch. Some comment is below.
> > 
> > > 
> > > Reported-by: kernel test robot 
> > > Signed-off-by: Samuel Holland 
> > > ---
> > > 
> > > Changes in v3:
> > >  - New patch to fix build errors in uniprocessor MIPS configs
> > > 
> > >  drivers/irqchip/Kconfig|  3 +-
> > >  drivers/irqchip/irq-mips-gic.c | 80 +++---
> > >  2 files changed, 56 insertions(+), 27 deletions(-)
> > > 
> > > diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
> > > index 1f23a6be7d88..d26a4ff7c99f 100644
> > > --- a/drivers/irqchip/Kconfig
> > > +++ b/drivers/irqchip/Kconfig
> > > @@ -322,7 +322,8 @@ config KEYSTONE_IRQ
> > >  
> > >  config MIPS_GIC
> > >   bool
> > > - select GENERIC_IRQ_IPI
> > > + select GENERIC_IRQ_IPI if SMP
> > 
> > > + select IRQ_DOMAIN_HIERARCHY
> > 
> > It seems to me that the IRQ domains hierarchy is supposed to be
> > created only if IPI is required. At least that's what the MIPS GIC
> > driver implies. Thus we can go further and CONFIG_IRQ_DOMAIN_HIERARCHY
> > ifdef-out the gic_irq_domain_alloc() and gic_irq_domain_free()
> > methods definition together with the initialization:
> > 
> >  static const struct irq_domain_ops gic_irq_domain_ops = {
> > .xlate = gic_irq_domain_xlate,
> > +#ifdef CONFIG_IRQ_DOMAIN_HIERARCHY
> > .alloc = gic_irq_domain_alloc,
> > .free = gic_irq_domain_free,
> > +#endif
> > .map = gic_irq_domain_map,
> > };
> > 
> > If the GENERIC_IRQ_IPI config is enabled, CONFIG_IRQ_DOMAIN_HIERARCHY
> > will be automatically selected (see the config definition in
> > kernel/irq/Kconfig). If the IRQs hierarchy is needed for some another
> > functionality like GENERIC_MSI_IRQ_DOMAIN or GPIOs then they will
> > explicitly enable the IRQ_DOMAIN_HIERARCHY config thus activating the
> > denoted .alloc and .free methods definitions.
> > 
> > To sum up you can get rid of the IRQ_DOMAIN_HIERARCHY config
> > force-select from this patch and make the MIPS GIC driver code a bit
> > more coherent.
> > 
> > @Marc, please correct me if were wrong.
> 

> Either way probably works correctly, but Samuel's approach is more
> readable IMO. It is far easier to reason about a high-level feature
> (GENERIC_IRQ_IPI) than an implementation detail (IRQ_DOMAIN_HIERARCHY).
> 

The main idea of my comment was to get rid of the forcible
IRQ_DOMAIN_HIERARCHY config selection, because the basic part of the
driver doesn't depends on the hierarchical IRQ-domains functionality.
It's needed only for IPIs and implicitly for the lower level IRQ
device drivers like GPIO or PCIe-controllers, which explicitly enable
the IRQ_DOMAIN_HIERARCHY config anyway. That's why instead of forcible
IRQ_DOMAIN_HIERARCHY config selection (see Samuel patch) I suggested
to make the corresponding functionality defined under the
IRQ_DOMAIN_HIERARCHY config ifdefs, thus having the driver capable of
creating the hierarchical IRQs domains only if it's required.

> If you really want to save a handful of bytes, you can make the
> callbacks conditional on GENERIC_IRQ_IPI, and be done with it.

AFAIU I can't in this case. It must be either IRQ_DOMAIN_HIERARCHY
ifdefs or explicit IRQ_DOMAIN_HIERARCHY select. There can be non-SMP
(UP) systems with no need in IPIs but for instance having a GPIO or
PCIe controller which require the hierarchical IRQ-domains support of
the parental IRQ controller. So making the callbacks definition
depended on the GENERIC_IRQ_IPI config state will break the driver for
these systems. That's why I suggested to use
CONFIG_IRQ_DOMAIN_HIERARCHY which activates the hierarchical IRQ
domains support in the IRQ-chip system (see the irq_domain_ops
structure conditional fields definition) and shall we have the
suggested approach implemented in the MIPS GIC driver.

-Sergey

> But this can come as an additional patch.
> 
> Thanks,
> 
>   M.
> 
> -- 
> Without deviation from the norm, progress is not possible.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 1/8] irqchip/mips-gic: Only register IPI domain when SMP is enabled

2022-07-05 Thread Serge Semin
Hi Samuel

On Fri, Jul 01, 2022 at 03:00:49PM -0500, Samuel Holland wrote:
> The MIPS GIC irqchip driver may be selected in a uniprocessor
> configuration, but it unconditionally registers an IPI domain.
> 
> Limit the part of the driver dealing with IPIs to only be compiled when
> GENERIC_IRQ_IPI is enabled, which corresponds to an SMP configuration.

Thanks for the patch. Some comment is below.

> 
> Reported-by: kernel test robot 
> Signed-off-by: Samuel Holland 
> ---
> 
> Changes in v3:
>  - New patch to fix build errors in uniprocessor MIPS configs
> 
>  drivers/irqchip/Kconfig|  3 +-
>  drivers/irqchip/irq-mips-gic.c | 80 +++---
>  2 files changed, 56 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
> index 1f23a6be7d88..d26a4ff7c99f 100644
> --- a/drivers/irqchip/Kconfig
> +++ b/drivers/irqchip/Kconfig
> @@ -322,7 +322,8 @@ config KEYSTONE_IRQ
>  
>  config MIPS_GIC
>   bool
> - select GENERIC_IRQ_IPI
> + select GENERIC_IRQ_IPI if SMP

> + select IRQ_DOMAIN_HIERARCHY

It seems to me that the IRQ domains hierarchy is supposed to be
created only if IPI is required. At least that's what the MIPS GIC
driver implies. Thus we can go further and CONFIG_IRQ_DOMAIN_HIERARCHY
ifdef-out the gic_irq_domain_alloc() and gic_irq_domain_free()
methods definition together with the initialization:

 static const struct irq_domain_ops gic_irq_domain_ops = {
.xlate = gic_irq_domain_xlate,
+#ifdef CONFIG_IRQ_DOMAIN_HIERARCHY
.alloc = gic_irq_domain_alloc,
.free = gic_irq_domain_free,
+#endif
.map = gic_irq_domain_map,
};

If the GENERIC_IRQ_IPI config is enabled, CONFIG_IRQ_DOMAIN_HIERARCHY
will be automatically selected (see the config definition in
kernel/irq/Kconfig). If the IRQs hierarchy is needed for some another
functionality like GENERIC_MSI_IRQ_DOMAIN or GPIOs then they will
explicitly enable the IRQ_DOMAIN_HIERARCHY config thus activating the
denoted .alloc and .free methods definitions.

To sum up you can get rid of the IRQ_DOMAIN_HIERARCHY config
force-select from this patch and make the MIPS GIC driver code a bit
more coherent.

@Marc, please correct me if were wrong.

-Serget

>   select MIPS_CM
>  
>  config INGENIC_IRQ
> diff --git a/drivers/irqchip/irq-mips-gic.c b/drivers/irqchip/irq-mips-gic.c
> index ff89b36267dd..8a9efb6ae587 100644
> --- a/drivers/irqchip/irq-mips-gic.c
> +++ b/drivers/irqchip/irq-mips-gic.c
> @@ -52,13 +52,15 @@ static DEFINE_PER_CPU_READ_MOSTLY(unsigned 
> long[GIC_MAX_LONGS], pcpu_masks);
>  
>  static DEFINE_SPINLOCK(gic_lock);
>  static struct irq_domain *gic_irq_domain;
> -static struct irq_domain *gic_ipi_domain;
>  static int gic_shared_intrs;
>  static unsigned int gic_cpu_pin;
>  static unsigned int timer_cpu_pin;
>  static struct irq_chip gic_level_irq_controller, gic_edge_irq_controller;
> +
> +#ifdef CONFIG_GENERIC_IRQ_IPI
>  static DECLARE_BITMAP(ipi_resrv, GIC_MAX_INTRS);
>  static DECLARE_BITMAP(ipi_available, GIC_MAX_INTRS);
> +#endif /* CONFIG_GENERIC_IRQ_IPI */
>  
>  static struct gic_all_vpes_chip_data {
>   u32 map;
> @@ -472,9 +474,11 @@ static int gic_irq_domain_map(struct irq_domain *d, 
> unsigned int virq,
>   u32 map;
>  
>   if (hwirq >= GIC_SHARED_HWIRQ_BASE) {
> +#ifdef CONFIG_GENERIC_IRQ_IPI
>   /* verify that shared irqs don't conflict with an IPI irq */
>   if (test_bit(GIC_HWIRQ_TO_SHARED(hwirq), ipi_resrv))
>   return -EBUSY;
> +#endif /* CONFIG_GENERIC_IRQ_IPI */
>  
>   err = irq_domain_set_hwirq_and_chip(d, virq, hwirq,
>   _level_irq_controller,
> @@ -567,6 +571,8 @@ static const struct irq_domain_ops gic_irq_domain_ops = {
>   .map = gic_irq_domain_map,
>  };
>  
> +#ifdef CONFIG_GENERIC_IRQ_IPI
> +
>  static int gic_ipi_domain_xlate(struct irq_domain *d, struct device_node 
> *ctrlr,
>   const u32 *intspec, unsigned int intsize,
>   irq_hw_number_t *out_hwirq,
> @@ -670,6 +676,48 @@ static const struct irq_domain_ops gic_ipi_domain_ops = {
>   .match = gic_ipi_domain_match,
>  };
>  
> +static int gic_register_ipi_domain(struct device_node *node)
> +{
> + struct irq_domain *gic_ipi_domain;
> + unsigned int v[2], num_ipis;
> +
> + gic_ipi_domain = irq_domain_add_hierarchy(gic_irq_domain,
> +   IRQ_DOMAIN_FLAG_IPI_PER_CPU,
> +   GIC_NUM_LOCAL_INTRS + 
> gic_shared_intrs,
> +   node, _ipi_domain_ops, 
> NULL);
> + if (!gic_ipi_domain) {
> + pr_err("Failed to add IPI domain");
> + return -ENXIO;
> + }
> +
> + irq_domain_update_bus_token(gic_ipi_domain, DOMAIN_BUS_IPI);
> +
> + if (node &&
> + !of_property_read_u32_array(node, 

Re: [PATCH] dma-direct: take dma-ranges/offsets into account in resource mapping

2022-06-20 Thread Serge Semin
Folks,

On Fri, Jun 10, 2022 at 11:08:02AM +0300, Serge Semin wrote:
> A basic device-specific linear memory mapping was introduced back in
> commit ("dma: Take into account dma_pfn_offset") as a single-valued offset
> preserved in the device.dma_pfn_offset field, which was initialized for
> instance by means of the "dma-ranges" DT property. Afterwards the
> functionality was extended to support more than one device-specific region
> defined in the device.dma_range_map list of maps. But all of these
> improvements concerned a single pointer, page or sg DMA-mapping methods,
> while the system resource mapping function turned to miss the
> corresponding modification. Thus the dma_direct_map_resource() method now
> just casts the CPU physical address to the device DMA address with no
> dma-ranges-based mapping taking into account, which is obviously wrong.
> Let's fix it by using the phys_to_dma_direct() method to get the
> device-specific bus address from the passed memory resource for the case
> of the directly mapped DMA.

So any comment on the suggest modification? Any notes against or for?

-Sergey

> 
> Fixes: 25f1e1887088 ("dma: Take into account dma_pfn_offset")
> Signed-off-by: Serge Semin 
> 
> ---
> 
> After a long discussion with Christoph and Robin regarding this patch
> here:
> https://lore.kernel.org/lkml/20220324014836.19149-4-sergey.se...@baikalelectronics.ru
> and here
> https://lore.kernel.org/linux-pci/20220503225104.12108-2-sergey.se...@baikalelectronics.ru/
> It was decided to consult with wider maintainers audience whether it's ok
> to accept the change as is or a more sophisticated solution needs to be
> found for the non-linear direct MMIO mapping.
> 
> Cc: Christoph Hellwig 
> Cc: Robin Murphy 
> Cc: Manivannan Sadhasivam 
> 
> file: arch/arm/mach-orion5x/board-dt.c
> Cc: Andrew Lunn 
> Cc: Sebastian Hesselbarth 
> Cc: Gregory Clement 
> Cc: linux-arm-ker...@lists.infradead.org
> 
> file: drivers/crypto/marvell/cesa/cesa.c
> Cc: Srujana Challa 
> Cc: Arnaud Ebalard 
> Cc: Boris Brezillon 
> Cc: linux-cry...@vger.kernel.org
> 
> file: drivers/dma/{fsl-edma-common.c,pl330.c,sh/rcar-dmac.c}
> Cc: Vinod Koul 
> Cc: dmaeng...@vger.kernel.org
> 
> file: 
> arch/arm/boot/dts/{vfxxx.dtsi,ls1021a.dtsi,imx7ulp.dtsi,fsl-ls1043a.dtsi}
> Cc: Shawn Guo 
> Cc: Sascha Hauer 
> Cc: Li Yang 
> Cc: linux-arm-ker...@lists.infradead.org
> 
> file: arch/arm/boot/dts/r8a77*.dtsi, arch/arm64/boot/dts/renesas/r8a77*.dtsi
> Cc: Geert Uytterhoeven 
> Cc: Magnus Damm 
> Cc: linux-renesas-...@vger.kernel.org
> 
> file: drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
> Cc: Alex Deucher 
> Cc: "Christian König" 
> Cc: "Pan, Xinhui" 
> 
> file: drivers/gpu/drm/virtio/virtgpu_vram.c
> Cc: David Airlie 
> Cc: Gerd Hoffmann 
> 
> file: drivers/media/common/videobuf2/videobuf2-dma-contig.c
> Cc: Tomasz Figa 
> Cc: Marek Szyprowski 
> 
> file: drivers/misc/habanalabs/common/memory.c
> Cc: Oded Gabbay 
> Cc: Arnd Bergmann 
> Cc: Greg Kroah-Hartman 
> 
> file: drivers/mtd/nand/raw/qcom_nandc.c
> Cc: Manivannan Sadhasivam 
> 
> file: 
> arch/arm64/boot/dts/qcom/{ipq8074.dtsi,ipq6018.dtsi,qcom-sdx55.dtsi,qcom-ipq4019.dtsi,qcom-ipq8064.dtsi}
> Cc: Andy Gross 
> Cc: Bjorn Andersson 
> Cc: linux-arm-...@vger.kernel.org
> 
> file: drivers/net/ethernet/marvell/octeontx2/af/rvu.c
> Cc: Sunil Goutham 
> Cc: Linu Cherian 
> Cc: Geetha sowjanya 
> 
> file: drivers/ntb/ntb_transport.c
> Cc: Jon Mason 
> Cc: Dave Jiang 
> Cc: n...@lists.linux.dev
> ---
>  kernel/dma/direct.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 9743c6ccce1a..bc06db74dfdb 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -497,7 +497,7 @@ int dma_direct_map_sg(struct device *dev, struct 
> scatterlist *sgl, int nents,
>  dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
>   size_t size, enum dma_data_direction dir, unsigned long attrs)
>  {
> - dma_addr_t dma_addr = paddr;
> + dma_addr_t dma_addr = phys_to_dma_direct(dev, paddr);
>  
>   if (unlikely(!dma_capable(dev, dma_addr, size, false))) {
>   dev_err_once(dev,
> -- 
> 2.35.1
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] dma-direct: take dma-ranges/offsets into account in resource mapping

2022-06-10 Thread Serge Semin via iommu
A basic device-specific linear memory mapping was introduced back in
commit ("dma: Take into account dma_pfn_offset") as a single-valued offset
preserved in the device.dma_pfn_offset field, which was initialized for
instance by means of the "dma-ranges" DT property. Afterwards the
functionality was extended to support more than one device-specific region
defined in the device.dma_range_map list of maps. But all of these
improvements concerned a single pointer, page or sg DMA-mapping methods,
while the system resource mapping function turned to miss the
corresponding modification. Thus the dma_direct_map_resource() method now
just casts the CPU physical address to the device DMA address with no
dma-ranges-based mapping taking into account, which is obviously wrong.
Let's fix it by using the phys_to_dma_direct() method to get the
device-specific bus address from the passed memory resource for the case
of the directly mapped DMA.

Fixes: 25f1e1887088 ("dma: Take into account dma_pfn_offset")
Signed-off-by: Serge Semin 

---

After a long discussion with Christoph and Robin regarding this patch
here:
https://lore.kernel.org/lkml/20220324014836.19149-4-sergey.se...@baikalelectronics.ru
and here
https://lore.kernel.org/linux-pci/20220503225104.12108-2-sergey.se...@baikalelectronics.ru/
It was decided to consult with wider maintainers audience whether it's ok
to accept the change as is or a more sophisticated solution needs to be
found for the non-linear direct MMIO mapping.

Cc: Christoph Hellwig 
Cc: Robin Murphy 
Cc: Manivannan Sadhasivam 

file: arch/arm/mach-orion5x/board-dt.c
Cc: Andrew Lunn 
Cc: Sebastian Hesselbarth 
Cc: Gregory Clement 
Cc: linux-arm-ker...@lists.infradead.org

file: drivers/crypto/marvell/cesa/cesa.c
Cc: Srujana Challa 
Cc: Arnaud Ebalard 
Cc: Boris Brezillon 
Cc: linux-cry...@vger.kernel.org

file: drivers/dma/{fsl-edma-common.c,pl330.c,sh/rcar-dmac.c}
Cc: Vinod Koul 
Cc: dmaeng...@vger.kernel.org

file: arch/arm/boot/dts/{vfxxx.dtsi,ls1021a.dtsi,imx7ulp.dtsi,fsl-ls1043a.dtsi}
Cc: Shawn Guo 
Cc: Sascha Hauer 
Cc: Li Yang 
Cc: linux-arm-ker...@lists.infradead.org

file: arch/arm/boot/dts/r8a77*.dtsi, arch/arm64/boot/dts/renesas/r8a77*.dtsi
Cc: Geert Uytterhoeven 
Cc: Magnus Damm 
Cc: linux-renesas-...@vger.kernel.org

file: drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
Cc: Alex Deucher 
Cc: "Christian König" 
Cc: "Pan, Xinhui" 

file: drivers/gpu/drm/virtio/virtgpu_vram.c
Cc: David Airlie 
Cc: Gerd Hoffmann 

file: drivers/media/common/videobuf2/videobuf2-dma-contig.c
Cc: Tomasz Figa 
Cc: Marek Szyprowski 

file: drivers/misc/habanalabs/common/memory.c
Cc: Oded Gabbay 
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 

file: drivers/mtd/nand/raw/qcom_nandc.c
Cc: Manivannan Sadhasivam 

file: 
arch/arm64/boot/dts/qcom/{ipq8074.dtsi,ipq6018.dtsi,qcom-sdx55.dtsi,qcom-ipq4019.dtsi,qcom-ipq8064.dtsi}
Cc: Andy Gross 
Cc: Bjorn Andersson 
Cc: linux-arm-...@vger.kernel.org

file: drivers/net/ethernet/marvell/octeontx2/af/rvu.c
Cc: Sunil Goutham 
Cc: Linu Cherian 
Cc: Geetha sowjanya 

file: drivers/ntb/ntb_transport.c
Cc: Jon Mason 
Cc: Dave Jiang 
Cc: n...@lists.linux.dev
---
 kernel/dma/direct.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 9743c6ccce1a..bc06db74dfdb 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -497,7 +497,7 @@ int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl, int nents,
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
-   dma_addr_t dma_addr = paddr;
+   dma_addr_t dma_addr = phys_to_dma_direct(dev, paddr);
 
if (unlikely(!dma_capable(dev, dma_addr, size, false))) {
dev_err_once(dev,
-- 
2.35.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 01/26] dma-direct: take dma-ranges/offsets into account in resource mapping

2022-05-12 Thread Serge Semin
On Mon, May 09, 2022 at 08:15:52AM +0200, Christoph Hellwig wrote:
> So I think the big problem pointed out by Robin is that existing DTs
> might not take this into account. 

I'd say that the biggest problem isn't in the DT part but in the
drivers using the dma_map_resource() method since they don't expect
the non-uniform DMA-direct mapping can be available.

> So I think we need to do a little
> research and at least Cc all maintainers and lists for relevant in-tree
> DTs for drivers that use dma_map_resource to discuss this.

Right. I'll send the next patchset revision out with all possibly
interested maintainers and lists in Cc of this patch.

-Sergey

> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 01/26] dma-direct: take dma-ranges/offsets into account in resource mapping

2022-05-03 Thread Serge Semin
A basic device-specific linear memory mapping was introduced back in
commit ("dma: Take into account dma_pfn_offset") as a single-valued offset
preserved in the device.dma_pfn_offset field, which was initialized for
instance by means of the "dma-ranges" DT property. Afterwards the
functionality was extended to support more than one device-specific region
defined in the device.dma_range_map list of maps. But all of these
improvements concerned a single pointer, page or sg DMA-mapping methods,
while the system resource mapping function turned to miss the
corresponding modification. Thus the dma_direct_map_resource() method now
just casts the CPU physical address to the device DMA address with no
dma-ranges-based mapping taking into account, which is obviously wrong.
Let's fix it by using the phys_to_dma_direct() method to get the
device-specific bus address from the passed memory resource for the case
of the directly mapped DMA.

Fixes: 25f1e1887088 ("dma: Take into account dma_pfn_offset")
Signed-off-by: Serge Semin 
---
 kernel/dma/direct.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 9743c6ccce1a..bc06db74dfdb 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -497,7 +497,7 @@ int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl, int nents,
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
-   dma_addr_t dma_addr = paddr;
+   dma_addr_t dma_addr = phys_to_dma_direct(dev, paddr);
 
if (unlikely(!dma_capable(dev, dma_addr, size, false))) {
dev_err_once(dev,
-- 
2.35.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 03/25] dma-direct: take dma-ranges/offsets into account in resource mapping

2022-04-24 Thread Serge Semin
On Thu, Apr 21, 2022 at 09:51:31PM +0100, Robin Murphy wrote:
> On 2022-04-21 18:35, Serge Semin wrote:
> > On Thu, Apr 21, 2022 at 04:45:36PM +0200, Christoph Hellwig wrote:
> > > On Wed, Apr 20, 2022 at 11:55:38AM +0300, Serge Semin wrote:
> > > > On Wed, Apr 20, 2022 at 10:47:46AM +0200, Christoph Hellwig wrote:
> > > > > I can't really comment on the dma-ranges exlcusion for P2P mappings,
> > > > > as that predates my involvedment, however:
> > > > 
> > > > My example wasn't specific to the PCIe P2P transfers, but about PCIe
> > > > devices reaching some platform devices over the system interconnect
> > > > bus.
> > > 
> > > So strike PCIe, but this our definition of Peer to Peer accesses.
> > > 
> > > > What if I get to have a physical address of a platform device and want
> > > > have that device being accessed by a PCIe peripheral device? The
> > > > dma_map_resource() seemed very much suitable for that. But considering
> > > > what you say it isn't.
> > > 
> > 
> > > dma_map_resource is the right thing for that.  But the physical address
> > > of MMIO ranges in the platform device should not have struct pages
> > > allocated for it, and thus the other dma_map_* APIs should not work on
> > > it to start with.
> > 
> > The problem is that the dma_map_resource() won't work for that, but
> > presumably the dma_map_sg()-like methods will (after some hacking with
> > the phys address, but anyway). Consider the system diagram in my
> > previous email. Here is what I would do to initialize a DMA
> > transaction between a platform device and a PCIe peripheral device:
> > 
> > 1) struct resource *rsc = platform_get_resource(plat_dev, IORESOURCE_MEM, 
> > 0);
> > 
> > 2) dma_addr_t dar = dma_map_resource(_dev->dev, rsc->start, rsc->end - 
> > rsc->start + 1,
> >DMA_FROM_DEVICE, 0);
> > 
> > 3) dma_addr_t sar;
> > void *tmp = dma_alloc_coherent(_dev->dev, PAGE_SIZE, , 
> > GFP_KERNEL);
> > memset(tmp, 0xaa, PAGE_SIZE);
> > 
> > 4) PCIe device: DMA.DAR=dar, DMA.SAR=sar. RUN.
> > 
> > If there is no dma-ranges specified in the PCIe Host controller
> > DT-node, the PCIe peripheral devices will see the rest of the system
> > memory as is (no offsets and remappings). But if there is dma-ranges
> > with some specific system settings it may affect the PCIe MRd/MWr TLPs
> > address translation including the addresses targeted to the MMIO
> > space. In that case the mapping performed on step 2) will return a
> > wrong DMA-address since the corresponding dma_direct_map_resource()
> > just returns the passed physical address missing the
> > 'pci_dev->dma_range_map'-based mapping performed in
> > translate_phys_to_dma().
> > 
> > Note the mapping on step 3) works correctly because it calls the
> > translate_phys_to_dma() of the direct DMA interface thus taking the
> > PCie dma-ranges into account.
> > 
> > To sum up as I see it either restricting dma_map_resource() to map
> > just the intra-bus addresses was wrong or there must be some
> > additional mapping infrastructure for the denoted systems. Though I
> > don't see a way the dma_map_resource() could be fixed to be suitable
> > for each considered cases.
> 

> FWIW the current semantics of dma_map_resource() are basically just to
> insert IOMMU awareness where dmaengine drivers were previously just casting
> phys_addr_t to dma_addr_t (or u32, or whatever else they put into their
> descriptor/register/etc.) IIRC there was a bit of a question whether it
> really belonged in the DMA API at all, since it's not really a "DMA"
> operation in the conventional sense, and convenience was the only real
> deciding argument. The relevant drivers at the time were not taking
> dma_pfn_offset into account when consuming physical addresses directly, so
> the new API didn't either.
> 
> That's just how things got to where they are today. 

I see. Thanks for the clarification. Right, IOMMU is the only reason
to have the current dma_map_resource() implementation.

> Once again, I'm not
> saying that what we have now is necessarily right, or that your change is
> necessarily wrong, I just really want to understand specifically *why* you
> need to make it, so we can evaluate the risk of possible breakage either
> way. Theoretical "if"s aren't really enough.

As I already said our SoC has the next structure (obviously the
diagram is very simplified, but the gist is the same):
   

Re: [PATCH 03/25] dma-direct: take dma-ranges/offsets into account in resource mapping

2022-04-21 Thread Serge Semin
On Thu, Apr 21, 2022 at 04:45:36PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 20, 2022 at 11:55:38AM +0300, Serge Semin wrote:
> > On Wed, Apr 20, 2022 at 10:47:46AM +0200, Christoph Hellwig wrote:
> > > I can't really comment on the dma-ranges exlcusion for P2P mappings,
> > > as that predates my involvedment, however:
> > 
> > My example wasn't specific to the PCIe P2P transfers, but about PCIe
> > devices reaching some platform devices over the system interconnect
> > bus.
> 
> So strike PCIe, but this our definition of Peer to Peer accesses.
> 
> > What if I get to have a physical address of a platform device and want
> > have that device being accessed by a PCIe peripheral device? The
> > dma_map_resource() seemed very much suitable for that. But considering
> > what you say it isn't.
> 

> dma_map_resource is the right thing for that.  But the physical address
> of MMIO ranges in the platform device should not have struct pages
> allocated for it, and thus the other dma_map_* APIs should not work on
> it to start with.

The problem is that the dma_map_resource() won't work for that, but
presumably the dma_map_sg()-like methods will (after some hacking with
the phys address, but anyway). Consider the system diagram in my
previous email. Here is what I would do to initialize a DMA
transaction between a platform device and a PCIe peripheral device:

1) struct resource *rsc = platform_get_resource(plat_dev, IORESOURCE_MEM, 0);

2) dma_addr_t dar = dma_map_resource(_dev->dev, rsc->start, rsc->end - 
rsc->start + 1,
  DMA_FROM_DEVICE, 0);

3) dma_addr_t sar;
   void *tmp = dma_alloc_coherent(_dev->dev, PAGE_SIZE, , GFP_KERNEL);
   memset(tmp, 0xaa, PAGE_SIZE);

4) PCIe device: DMA.DAR=dar, DMA.SAR=sar. RUN.

If there is no dma-ranges specified in the PCIe Host controller
DT-node, the PCIe peripheral devices will see the rest of the system
memory as is (no offsets and remappings). But if there is dma-ranges
with some specific system settings it may affect the PCIe MRd/MWr TLPs
address translation including the addresses targeted to the MMIO
space. In that case the mapping performed on step 2) will return a
wrong DMA-address since the corresponding dma_direct_map_resource()
just returns the passed physical address missing the
'pci_dev->dma_range_map'-based mapping performed in
translate_phys_to_dma().

Note the mapping on step 3) works correctly because it calls the
translate_phys_to_dma() of the direct DMA interface thus taking the
PCie dma-ranges into account.

To sum up as I see it either restricting dma_map_resource() to map
just the intra-bus addresses was wrong or there must be some
additional mapping infrastructure for the denoted systems. Though I
don't see a way the dma_map_resource() could be fixed to be suitable
for each considered cases.

-Sergey
   
map the platforms

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 03/25] dma-direct: take dma-ranges/offsets into account in resource mapping

2022-04-20 Thread Serge Semin
On Wed, Apr 20, 2022 at 10:47:46AM +0200, Christoph Hellwig wrote:
> I can't really comment on the dma-ranges exlcusion for P2P mappings,
> as that predates my involvedment, however:

My example wasn't specific to the PCIe P2P transfers, but about PCIe
devices reaching some platform devices over the system interconnect
bus.

> 
> On Wed, Apr 20, 2022 at 11:32:07AM +0300, Serge Semin wrote:
> > See, if I get to map a virtual memory address to be accessible by any
> > PCIe peripheral device, then the dma_map_sg/dma_map_page/etc
> > procedures will take the PCIe host controller dma-ranges into account.
> > It will work as expected and the PCIe devices will see the memory what
> > I specified. But if I get to pass the physical address of the same
> > page or a physical address of some device of the DEVs space to the
> > dma_map_resource(), then the PCIe dma-ranges won't be taken into
> > account, and the result mapping will be incorrect. That's why the
> > current dma_map_resource() implementation seems very confusing to me.
> > As I see it phys_addr_t is the type of the Interconnect address space,
> > meanwhile dma_addr_t describes the PCIe, DEVs address spaces.
> > 
> > Based on what I said here and in my previous email could you explain
> > what do I get wrong?
> 

> You simply must not use dma_map_resource for normal kernel memory.
> So while the exclusion might be somewhat confusing, that confusion
> really should not matter for any proper use of the API.

What if I get to have a physical address of a platform device and want
have that device being accessed by a PCIe peripheral device? The
dma_map_resource() seemed very much suitable for that. But considering
what you say it isn't.

-Sergey

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 03/25] dma-direct: take dma-ranges/offsets into account in resource mapping

2022-04-20 Thread Serge Semin
On Wed, Apr 20, 2022 at 09:12:17AM +0200, Christoph Hellwig wrote:
> On Mon, Apr 18, 2022 at 01:44:27AM +0300, Serge Semin wrote:
> > > but a DMA controller might also want to access something in the MMIO range
> > > 0x0-0x7fff, of which it still has an identical non-offset view. If a
> > > driver was previously using dma_map_resource() for that, it would now 
> > > start
> > > getting DMA_MAPPING_ERROR because the dma_range_map exists but doesn't
> > > describe the MMIO region. I agree that in hindsight it's not an ideal
> > > situation, but it's how things have ended up, so at this point I'm wary of
> > > making potentially-breaking changes.
> > 
> > Hmm, what if the driver was previously using for instance the
> > dma_direct_map_sg() method for it?
> 

> dma_map_resource is for mapping MMIO space, and must not be called on
> memory in the kernel map.  For dma_map_sg (or all the other dma_map_*
> interface except for dma_map_resource), the reverse is true.

I've got it from the Robin comment. Exactly that part seems very much
confusing to me, because what you say doesn't cohere with the passed
address type. If the passed address belongs to the MMIO space and is a
part of the CPU physical address space, then it is supposed to be
visible by the CPU as is (see the very first diagram in [1]). So the
mapping performed in the dma_map_resource() and dma_map_sg() methods
is supposed to match. Otherwise the spaces you are talking about are
different and as such need to be described by different types. Since
what you are talking about more seem like a DMA address space, then the
dma_map_resource() address needs to have the dma_addr_t type instead
of the phys_addr_t.

BTW here is a brightest example of a system, which contradicts the
MMIO-specific mapping semantics you are talking about (it actually
matches to what we've got except some interconnect implementation
peculiarities):

  +-+
  | DDR |
  +--+--+
 |
  +-+ +--+---+ +-+
  | CPU +-+ Interconnect +-+ DEVs... |
  +-+ +-^-+--+ +-+
 dma-ranges-| |-ranges
  +-+-v-+
  | PCI |
  +-+

See, if I get to map a virtual memory address to be accessible by any
PCIe peripheral device, then the dma_map_sg/dma_map_page/etc
procedures will take the PCIe host controller dma-ranges into account.
It will work as expected and the PCIe devices will see the memory what
I specified. But if I get to pass the physical address of the same
page or a physical address of some device of the DEVs space to the
dma_map_resource(), then the PCIe dma-ranges won't be taken into
account, and the result mapping will be incorrect. That's why the
current dma_map_resource() implementation seems very confusing to me.
As I see it phys_addr_t is the type of the Interconnect address space,
meanwhile dma_addr_t describes the PCIe, DEVs address spaces.

Based on what I said here and in my previous email could you explain
what do I get wrong?

[1] Documentation/core-api/dma-api-howto.rst

-Sergey

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 03/25] dma-direct: take dma-ranges/offsets into account in resource mapping

2022-04-17 Thread Serge Semin
Hello Robin.

Sorry for the delayed answer. My comments are below.

On Thu, Mar 24, 2022 at 11:30:38AM +, Robin Murphy wrote:
> On 2022-03-24 01:48, Serge Semin wrote:
> > A basic device-specific linear memory mapping was introduced back in
> > commit ("dma: Take into account dma_pfn_offset") as a single-valued offset
> > preserved in the device.dma_pfn_offset field, which was initialized for
> > instance by means of the "dma-ranges" DT property. Afterwards the
> > functionality was extended to support more than one device-specific region
> > defined in the device.dma_range_map list of maps. But all of these
> > improvements concerned a single pointer, page or sg DMA-mapping methods,
> > while the system resource mapping function turned to miss the
> > corresponding modification. Thus the dma_direct_map_resource() method now
> > just casts the CPU physical address to the device DMA address with no
> > dma-ranges-based mapping taking into account, which is obviously wrong.
> > Let's fix it by using the phys_to_dma_direct() method to get the
> > device-specific bus address from the passed memory resource for the case
> > of the directly mapped DMA.
> 

> It may not have been well-documented at the time, but this was largely
> intentional. The assumption based on known systems was that where
> dma_pfn_offset existed, it would *not* apply to peer MMIO addresses.

Well, I'd say it wasn't documented or even discussed at all. At least
after a pretty much comprehensive retrospective research I failed to
find any note about the reason of having all the dma_direct_map*()
methods converted to supporting the dma_pfn_offset/dma_range_map
ranges, but leaving the dma_direct_map_resource() method out of that
conversion. Neither it is immediately inferable from the method usage
and its prototype that it is supposed to be utilized for the DMA
memory addresses, not the CPU one.

> 
> For instance, DTs for TI Keystone 2 platforms only describe an offset for
> RAM:
> 
>   dma-ranges = <0x8000 0x8 0x 0x8000>;
> 
> but a DMA controller might also want to access something in the MMIO range
> 0x0-0x7fff, of which it still has an identical non-offset view. If a
> driver was previously using dma_map_resource() for that, it would now start
> getting DMA_MAPPING_ERROR because the dma_range_map exists but doesn't
> describe the MMIO region. I agree that in hindsight it's not an ideal
> situation, but it's how things have ended up, so at this point I'm wary of
> making potentially-breaking changes.

Hmm, what if the driver was previously using for instance the
dma_direct_map_sg() method for it? Following this logic you would have
needed to decline the whole dma_pfn_offset/dma_range_map ranges
support, since the dma_direct_map_sg(), dma_direct_map_page(),
dma_direct_alloc*() methods do take the offsets into account. What we
can see now is that the same physical address will be differently
mapped by the dma_map_resource() and, for instance, dma_map_sg()
methods. All of these methods expect to have the "phys_addr_t" address
passed, which is the CPU address, not the DMA one. Doesn't that look
erroneous? IIUC in accordance with the common kernel definition the
"resource" suffix indicates the CPU-visible address (like struct
resource range), not the DMA address. No matter whether it's used to
describe the RAM or MMIO range.

AFAICS the dma_range_map just defines the offset-based DMA-to-CPU
mapping for the particular bus/device. If the device driver already
knows the DMA address why does it need to map it at all? I see some
contradiction here.

> 
> May I ask what exactly your setup looks like, if you have a DMA controller
> with an offset view of its "own" MMIO space?

I don't have such. But what I see here is either the wrong
dma_direct_map_resource() implementation or a redundant mapping
performed in some platforms/DMA-device drivers. Indeed judging by the
dma_map_resource() method declaration it expects to have the
CPU-address passed, which will be mapped in accordance with the
"dma-ranges"-based DMA-to-CPU memory mapping in the same way as the
rest of the dma_direct-family methods. If DMA address is already known
then it is supposed to be used as-is with no any additional remapping
procedure performed.

The last but not least regarding the DMA controllers and the
dma_map_resource() usage. The dma_slave_config structure was converted
to having the CPU-physical src/dst address specified in commit
9575632052ba ("dmaengine: make slave address physical"). So the DMA
client drivers now have to set the slave source and destination
addresses defined in the CPU address space, while the DMA engine
driver needs to map it in accordance with the platform/device specific
configs.

To sum u

[PATCH 03/25] dma-direct: take dma-ranges/offsets into account in resource mapping

2022-03-24 Thread Serge Semin
A basic device-specific linear memory mapping was introduced back in
commit ("dma: Take into account dma_pfn_offset") as a single-valued offset
preserved in the device.dma_pfn_offset field, which was initialized for
instance by means of the "dma-ranges" DT property. Afterwards the
functionality was extended to support more than one device-specific region
defined in the device.dma_range_map list of maps. But all of these
improvements concerned a single pointer, page or sg DMA-mapping methods,
while the system resource mapping function turned to miss the
corresponding modification. Thus the dma_direct_map_resource() method now
just casts the CPU physical address to the device DMA address with no
dma-ranges-based mapping taking into account, which is obviously wrong.
Let's fix it by using the phys_to_dma_direct() method to get the
device-specific bus address from the passed memory resource for the case
of the directly mapped DMA.

Fixes: 25f1e1887088 ("dma: Take into account dma_pfn_offset")
Signed-off-by: Serge Semin 
---
 kernel/dma/direct.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 50f48e9e4598..9ce8192b29ab 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -497,7 +497,7 @@ int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl, int nents,
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
-   dma_addr_t dma_addr = paddr;
+   dma_addr_t dma_addr = phys_to_dma_direct(dev, paddr);
 
if (unlikely(!dma_capable(dev, dma_addr, size, false))) {
dev_err_once(dev,
-- 
2.35.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 07/12] NTB: Introduce functions to calculate multi-port resource index

2019-03-12 Thread Serge Semin
On Wed, Mar 06, 2019 at 04:22:33PM -0700, Logan Gunthorpe wrote:
> 
> On 2019-03-06 3:45 p.m., Serge Semin wrote:
> [Snip]
> 
> Pretty sure everything above is just agreement...
> 
> > So your current approach is inbound MW-centralized, while mine is developed
> > around the outbound MWs.
> 
> I don't think this has anything to do with inbound vs outbound. The
> problem is the same no matter from which direction you assign things.
> 
> >> Physical Port Number: 0  2  4  6  8  12  16  20
> >> Logical Port Number:  0  1  2  3  4  5   6   7
> >> Peer Index (Port 0):  x  0  1  2  3  4   5   6
> >> Port Index (Port 8):  0  1  2  3  x  4   5   6
> >> (etc)
> > 
> > That's what I suggested in the two possible solutions:
> > 1st solution: replace current pidx with Logical Port Number,
> > 2nd solution: alter ntb_peer_resource_idx() so it would return the Logical 
> > Port Number.
> 
> Well my solution wasn't to change pidx and no, we don't want to change
> ntb_peer_resource_idx() to return the logical port number because that's
> not how it's being used and I have no use for a
> ntb_peer_port_global_idx() function. Functions are supposed to be added
> by code that needs to call them, not by some grand design. Part of the
> reason we have this confusing mess is because the API was reviewed and
> merged before any users of the API were presented. Usually this is not
> accepted in kernel development.
> 
> My suggestion is to simply say that the existing port numbers are the
> logical port number and have the drivers handle the physical port number
> mapping internally. That requires the fewest changes.
> 

Now I see what you were suggesting. The solution you propose doesn't require
any changes in your patches, but requires to update IDT driver. It also
makes a change in the ports numbering semantics, which you said you
didn't want to introduce.
(read all further comments to get my conclusion regarding it)

> > IMO In case of the 2nd solution I'd also suggest to rename the
> > ntb_peer_resource_idx() method into ntb_peer_port_global_idx(),
> > and then consider the current port indexes used in the NTB API
> > as local port indexes. The resource indexing can be abstracted
> > by a macro like this:
> > #define ntb_peer_resource_idx ntb_peer_port_global_idx
> 
> That define would not be useful.
> 
> > Finally in order to close the space up we'd also need to define
> > a method: ntb_port_global_idx(), which would return a Logical (global)
> > index of local port.
> 
> Again, I'd rather not add a bunch of large semantic and infrastructure
> changes at this point. It's confusing enough as it is and we don't need
> to introduce yet another indexing scheme API to the clients that really
> do not need it. What the clients need is a simple API to decide which
> resources to use for which peers, and to figure out which peers used
> which resources. ntb_peer_port_idx() and ntb_peer_resource_idx() suit
> these purposes. Nothing else really needs to exist.
> 

If you don't want to add a large semantic and infrastructure change at
this point, then it would be better to leave NTB port API the way it is
now, and use logical port indexes in the ntb_peer_resource_idx() method.
You'd also need to use this method to access both outbound and inbound
memory windows. In this case we wouldn't need to change anything in
current API and drivers. Yes, this approach would cause one resource being
wasted, but it would lead to simpler semantic of the port numbers API.

Don't get me wrong. I am not completely against the approach you suggest.
It just has its pros and cons. The same way as mine does. I'll explain my
opinion later in this email.

> >> Where the Physical Port Number is whatever the hardware uses and the
> >> logical port number is a numbering scheme starting with zero with no
> >> gaps. Then the port indexes are still as we currently have them. If we
> >> say that the port numbers we have now are the Logical Port Number, then
> >> ntb_peer_resource_idx() is correct.
> >>
> > 
> > Current port numbers are the physical port numbers with gaps. 
> 
> I think that's up for interpretation as, based on the existing code, I
> naturally interpreted it the other way and therefore it's pretty simple
> to say that it's the logical port number and fix the one driver that
> needs to change.
> 
> > That's why we
> > introduced the port-index NTB API abstraction in the first place, to have 
> > these gaps
> > eliminated and to provide a simple way of bulk setup. Although that 
> > abstraction
> > turned out not that suitable to distribute the shared resources. So
> > the Logical (G

Re: [PATCH v2 09/12] NTB: Introduce MSI library

2019-03-06 Thread Serge Semin
On Wed, Mar 06, 2019 at 02:35:53PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-03-06 1:26 p.m., Serge Semin wrote:
> > First of all, It might be unsafe to have some resources consumed by NTB
> > MSI or some other library without a simple way to warn NTB client drivers
> > about their attempts to access that resources, since it might lead to random
> > errors. When I thought about implementing a transport library based on the
> > Message/Spad+Doorbell registers, I had in mind to create an internal 
> > bits-field
> > array with the resources busy-flags. If, for instance, some message or
> > scratchpad register is occupied by the library (MSI, transport or some 
> > else),
> > then it would be impossible to access these resources directly through NTB 
> > API
> > methods. So NTB client driver shall retrieve an error in an attempt to
> > write/read data to/from busy message or scratchpad register, or in an 
> > attempt
> > to set some occupied doorbell bit. The same thing can be done for memory 
> > windows.
> 
> Yes, it would be nice to have a generic library to manage all the
> resources, but right now we don't and it's unfair to expect us to take
> on this work to get the features we care about merged. Right now, it's
> not at all unsafe as the client is quite capable of ensuring it has the
> resources for the MSI library. The changes for ntb_transport to ensure
> this are quite reasonable.
> 
> > Second tiny concern is about documentation. Since there is a special file 
> > for
> > all NTB-related doc, it would be good to have some description about the
> > NTB MSI library there as well:
> > Documentation/ntb.txt
> 
> Sure, I'll add a short blurb for v3. Though, I noticed it's quite out of
> date since your changes. Especially in the ntb_tool section...
> 

Ok. Thanks.
If you want you can add some info to the ntb_tool section as well. If you
don't have time, I'll update it next time I submit anything new to the
subsystem.

-Sergey

> >> +  u32 *peer_mws[];
> > 
> > Shouldn't we use the __iomem attribute here since later the devm_ioremap() 
> > is
> > used to map MWs at these pointers?
> 
> Yes, will change for v3.
> 
> 
> > Simpler and faster cleanup-code would be:
> 
> > + unroll:
> > +   for (--i; i >= 0; --i)
> > +   devm_iounmap(>dev, ntb->msi->peer_mws[i]);
> 
> Faster, maybe, but I would not consider this simpler. It's much more
> complicated to reason about and ensure it's correct. I prefer my way
> because I don't care about speed, but I do care about readability.
> 
> 
> > Alas calling the ntb_mw_set_trans() method isn't enough to fully initialize
> > NTB Memory Windows. Yes, the library will work for Intel/AMD/Switchtec
> > (two-ports legacy configuration), but will fail for IDT due to being based 
> > on
> > the outbound MW xlat interface. So the library at this stage isn't portable
> > across all NTB hardware. In order to make it working the translation 
> > address is
> > supposed to be transferred to the peer side, where a peer code should call
> > ntb_peer_mw_set_trans() method with the retrieved xlat address.
> > See documentation for details:
> 
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/ntb.txt
> > 
> > ntb_perf driver can be also used as a reference of the portable NTB MWs 
> > setup.
> 
> Gross. Well, given that ntb_transport doesn't even support this and we
> don't really have a sensible library to transfer this information, I'm
> going to leave it as is for now. Someone can update ntb_msi when they
> update ntb_transport, preferably after we have a nice library to handle
> the transfers for us seeing I absolutely do not want to replicate the
> mess in ntb_perf.
> 
> Actually, if we had a generic spad/msg communication library, it would
> probably be better to have a common ntb_mw_set_trans() function that
> uses the communications library to send the data and automatically call
> ntb_peer_mw_set_trans() on the peer. That way we don't have to push this
> mess into the clients.
> 
> > The same cleanup pattern can be utilized here:
> > +error_out:
> > +   for (--peer; peer >= 0; --peer) {
> > +   peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
> > +   ntb_mw_clear_trans(ntb, i, peer_widx);
> > +   }
> > 
> > So you won't need "i" variable here anymore. You also don't need to check 
> > the
> > return value of ntb_peer_highest_mw_idx() in the cleanup loop because it
> > was already checked in the main algo code.
> 
> See above.
> 

Re: [PATCH v2 07/12] NTB: Introduce functions to calculate multi-port resource index

2019-03-06 Thread Serge Semin
On Wed, Mar 06, 2019 at 12:11:11PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-03-05 6:24 p.m., Serge Semin wrote:
> >> + * In a 5 peer system, this function will return the following matrix
> >> + *
> >> + * pidx \ port01234
> >> + * 0  00123
> >> + * 1  01234
> >> + * 2  01234
> >> + * 3  01234
> >> + *
> 
> Oh, first, oops: looks like I copied this down wrong anyway; the code
> was what I had intended, but the documented example should have been:
> 
> pidx \ local_port 01234
>  000123
>  101123
>  201223
>  301233
> 
> And this is definitely the correct table we are aiming for.
> ntb_peer_resource_idx() is supposed to return the result of
> ntb_peer_port_idx(ntb, local_port) when run on the peer specified by pidx.
> 
> Note: this table also makes sense because it only uses 4 resources for 5
> ports which is the best case scenario. (In other words, to communicate
> between N ports, N-1 resources are required on each peer).
> 

Yes, it does use as much and as tight resources as it possible, but
only for the case of pure integer ports numbering. While in case if
there are gaps in the port numbers space (which is the only case we have
in supported hardware at this moment) it will lead to a failure if there
are ports with higher numbers, than there are MWs available (MWs availability
depends on the IDT chip firmware). Additionally it creates gaps in the MWs
space if physical ports are numbered with gaps. Since the only multi-port
device we've got now is IDT and it always has it' ports numbered with gaps as I
described, then the current implementation will definitely produced the
problems.

> > This table is too simplified to represent a generic case of port-index
> > mapping table. In particular the IDT PCIe switch got it ports numbered
> > with uneven integers like: 0 2 4 6 8 12 16 20 or 0 8 16, and so on.
> > Moreover some of the ports might be disabled or may have NTB functions
> > deactivated, in which case these ports shouldn't be considered by NTB 
> > subsystem
> > at all. Basically we may have any increasing subset of that port
> > numbers depending on the current IDT PCIe-switch ports setup.
> 
> Yes, I did not consider situations where there would be gaps in the
> "port number" space. It wasn't at all clear from the code that this was
> possible. Switchtec hardware could be configured for such an
> arrangement, but I don't know why anyone would do that as it just
> needlessly complicates everything.
> 
> As you point out, with a gap, we end up with something that is wrong:
> 
> pidx \ port 01345
>  0  00234
>  1  01234
>  2  01334
>  3  01344
> 
> Here, the relationship between ntb_peer_resource_idx() and
> ntb_peer_port_idx() is not maintained and it seems to prescribe 5
> resources for 5 ports. If there were more gaps it would be even more wrong.
> 

Exactly. The table will look even worse for the port numbers: 0 2 4 6 8 12 16 
20.

> >> +static inline int ntb_peer_resource_idx(struct ntb_dev *ntb, int pidx)
> >> +{
> >> +  int local_port, peer_port;
> >> +
> >> +  if (pidx >= ntb_peer_port_count(ntb))
> >> +  return -EINVAL;
> >> +
> >> +  local_port = ntb_port_number(ntb);
> >> +  peer_port = ntb_peer_port_number(ntb, pidx);
> >> +
> >> +  if (peer_port < local_port)
> >> +  return local_port - 1;
> >> +  else
> >> +  return local_port;
> >> +}
> >> +
> > 
> > Instead of redefining the port-index table we can just fix the
> > ntb_peer_resource_idx() method, so it would return a global port index
> > instead of some number based on the port number. It can be done just by
> > the next modification:
> > 
> > + if (peer_port <= local_port)
> > + return pidx;
> > + else
> > + return pidx + 1;
> > 
> 
> This creates a table that looks like:
> 
> pidx \ port 01234
>  0  10000
>  1  22111
>  2  33322
>  3  44443
> 
> Which is not correct. In fact, it seems to require 5 resources for 5
> ports. 

Re: [PATCH v2 10/12] NTB: Introduce NTB MSI Test Client

2019-03-06 Thread Serge Semin
Hi

On Wed, Feb 13, 2019 at 10:54:52AM -0700, Logan Gunthorpe wrote:
> Introduce a tool to test NTB MSI interrupts similar to the other
> NTB test tools. This tool creates a debugfs directory for each
> NTB device with the following files:
> 
> port
> irqX_occurrences
> peerX/port
> peerX/count
> peerX/trigger
> 
> The 'port' file tells the user the local port number and the
> 'occurrences' files tell the number of local interrupts that
> have been received for each interrupt.
> 
> For each peer, the 'port' file and the 'count' file tell you the
> peer's port number and number of interrupts respectively. Writing
> the interrupt number to the 'trigger' file triggers the interrupt
> handler for the peer which should increment their corresponding
> 'occurrences' file. The 'ready' file indicates if a peer is ready,
> writing to this file blocks until it is ready.
> 
> The module parameter num_irqs can be used to set the number of
> local interrupts. By default this is 4. This is only limited by
> the number of unused MSI interrupts registered by the hardware
> (this will require support of the hardware driver) and there must
> be at least 2*num_irqs + 1 spads registers available.
> 

Alas the test driver is not going to work with IDT NTB hardware,
since it uses Spad only, which aren't supported by IDT devices. IDT NTB
PCIe functions provide message registers to communicate with peers. See
ntb_perf driver code for reference of portable driver design.

In order to make the test driver portable the following alterations
are needed:
1) Add NTB Message register API usage together with Scratchpad+Doorbell.
2) When NTB MSI library is updated with functions like
ntb_msi_peer_setup_mws()/ntb_msi_peer_clear_mws() they are needed to be
utilized in the test driver as well to set a translation address on the
peer side.

> Signed-off-by: Logan Gunthorpe 
> Cc: Jon Mason 
> Cc: Dave Jiang 
> Cc: Allen Hubbe 
> ---
>  drivers/ntb/test/Kconfig|   9 +
>  drivers/ntb/test/Makefile   |   1 +
>  drivers/ntb/test/ntb_msi_test.c | 433 
>  3 files changed, 443 insertions(+)
>  create mode 100644 drivers/ntb/test/ntb_msi_test.c
> 
> diff --git a/drivers/ntb/test/Kconfig b/drivers/ntb/test/Kconfig
> index a5d0eda44438..a3f3e2638935 100644
> --- a/drivers/ntb/test/Kconfig
> +++ b/drivers/ntb/test/Kconfig
> @@ -25,3 +25,12 @@ config NTB_PERF
>to and from the window without additional software interaction.
>  
>If unsure, say N.
> +
> +config NTB_MSI_TEST
> + tristate "NTB MSI Test Client"
> + depends on NTB_MSI
> + help
> +   This tool demonstrates the use of the NTB MSI library to
> +   send MSI interrupts between peers.
> +
> +   If unsure, say N.
> diff --git a/drivers/ntb/test/Makefile b/drivers/ntb/test/Makefile
> index 9e77e0b761c2..d2895ca995e4 100644
> --- a/drivers/ntb/test/Makefile
> +++ b/drivers/ntb/test/Makefile
> @@ -1,3 +1,4 @@
>  obj-$(CONFIG_NTB_PINGPONG) += ntb_pingpong.o
>  obj-$(CONFIG_NTB_TOOL) += ntb_tool.o
>  obj-$(CONFIG_NTB_PERF) += ntb_perf.o
> +obj-$(CONFIG_NTB_MSI_TEST) += ntb_msi_test.o
> diff --git a/drivers/ntb/test/ntb_msi_test.c b/drivers/ntb/test/ntb_msi_test.c
> new file mode 100644
> index ..99d826ed9c34
> --- /dev/null
> +++ b/drivers/ntb/test/ntb_msi_test.c
> @@ -0,0 +1,433 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +MODULE_LICENSE("Dual BSD/GPL");
> +MODULE_VERSION("0.1");
> +MODULE_AUTHOR("Logan Gunthorpe ");
> +MODULE_DESCRIPTION("Test for sending MSI interrupts over an NTB memory 
> window");
> +
> +static int num_irqs = 4;
> +module_param(num_irqs, int, 0644);
> +MODULE_PARM_DESC(num_irqs, "number of irqs to use");
> +
> +struct ntb_msit_ctx {
> + struct ntb_dev *ntb;
> + struct dentry *dbgfs_dir;
> + struct work_struct setup_work;
> +
> + struct ntb_msit_isr_ctx {
> + int irq_idx;
> + int irq_num;
> + int occurrences;
> + struct ntb_msit_ctx *nm;
> + struct ntb_msi_desc desc;
> + } *isr_ctx;
> +
> + struct ntb_msit_peer {
> + struct ntb_msit_ctx *nm;
> + int pidx;
> + int num_irqs;
> + struct completion init_comp;
> + struct ntb_msi_desc *msi_desc;
> + } peers[];
> +};
> +
> +static struct dentry *ntb_msit_dbgfs_topdir;
> +
> +static irqreturn_t ntb_msit_isr(int irq, void *dev)
> +{
> + struct ntb_msit_isr_ctx *isr_ctx = dev;
> + struct ntb_msit_ctx *nm = isr_ctx->nm;
> +
> + dev_dbg(>ntb->dev, "Interrupt Occurred: %d",
> + isr_ctx->irq_idx);
> +
> + isr_ctx->occurrences++;
> +
> + return IRQ_HANDLED;
> +}
> +
> +static void ntb_msit_setup_work(struct work_struct *work)
> +{
> + struct ntb_msit_ctx *nm = container_of(work, struct ntb_msit_ctx,
> +setup_work);
> + int 

Re: [PATCH v2 09/12] NTB: Introduce MSI library

2019-03-06 Thread Serge Semin
On Wed, Feb 13, 2019 at 10:54:51AM -0700, Logan Gunthorpe wrote:

Hi

> The NTB MSI library allows passing MSI interrupts across a memory
> window. This offers similar functionality to doorbells or messages
> except will often have much better latency and the client can
> potentially use significantly more remote interrupts than typical hardware
> provides for doorbells. (Which can be important in high-multiport
> setups.)
> 
> The library utilizes one memory window per peer and uses the highest
> index memory windows. Before any ntb_msi function may be used, the user
> must call ntb_msi_init(). It may then setup and tear down the memory
> windows when the link state changes using ntb_msi_setup_mws() and
> ntb_msi_clear_mws().
> 
> The peer which receives the interrupt must call ntb_msim_request_irq()
> to assign the interrupt handler (this function is functionally
> similar to devm_request_irq()) and the returned descriptor must be
> transferred to the peer which can use it to trigger the interrupt.
> The triggering peer, once having received the descriptor, can
> trigger the interrupt by calling ntb_msi_peer_trigger().
> 

The library is very useful, thanks for sharing it with us.

Here are my two general concerns regarding the implementation.
(More specific comments are further in the letter.)

First of all, It might be unsafe to have some resources consumed by NTB
MSI or some other library without a simple way to warn NTB client drivers
about their attempts to access that resources, since it might lead to random
errors. When I thought about implementing a transport library based on the
Message/Spad+Doorbell registers, I had in mind to create an internal bits-field
array with the resources busy-flags. If, for instance, some message or
scratchpad register is occupied by the library (MSI, transport or some else),
then it would be impossible to access these resources directly through NTB API
methods. So NTB client driver shall retrieve an error in an attempt to
write/read data to/from busy message or scratchpad register, or in an attempt
to set some occupied doorbell bit. The same thing can be done for memory 
windows.

Second tiny concern is about documentation. Since there is a special file for
all NTB-related doc, it would be good to have some description about the
NTB MSI library there as well:
Documentation/ntb.txt

> Signed-off-by: Logan Gunthorpe 
> Cc: Jon Mason 
> Cc: Dave Jiang 
> Cc: Allen Hubbe 
> ---
>  drivers/ntb/Kconfig  |  11 ++
>  drivers/ntb/Makefile |   3 +-
>  drivers/ntb/msi.c| 415 +++
>  include/linux/ntb.h  |  73 
>  4 files changed, 501 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/ntb/msi.c
> 
> diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
> index 95944e52fa36..5760764052be 100644
> --- a/drivers/ntb/Kconfig
> +++ b/drivers/ntb/Kconfig
> @@ -12,6 +12,17 @@ menuconfig NTB
>  
>  if NTB
>  
> +config NTB_MSI
> + bool "MSI Interrupt Support"
> + depends on PCI_MSI
> + help
> +  Support using MSI interrupt forwarding instead of (or in addition to)
> +  hardware doorbells. MSI interrupts typically offer lower latency
> +  than doorbells and more MSI interrupts can be made available to
> +  clients. However this requires an extra memory window and support
> +  in the hardware driver for creating the MSI interrupts.
> +
> +  If unsure, say N.
>  source "drivers/ntb/hw/Kconfig"
>  
>  source "drivers/ntb/test/Kconfig"
> diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
> index 537226f8e78d..cc27ad2ef150 100644
> --- a/drivers/ntb/Makefile
> +++ b/drivers/ntb/Makefile
> @@ -1,4 +1,5 @@
>  obj-$(CONFIG_NTB) += ntb.o hw/ test/
>  obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
>  
> -ntb-y := core.o
> +ntb-y:= core.o
> +ntb-$(CONFIG_NTB_MSI)+= msi.o
> diff --git a/drivers/ntb/msi.c b/drivers/ntb/msi.c
> new file mode 100644
> index ..5d4bd7a63924
> --- /dev/null
> +++ b/drivers/ntb/msi.c
> @@ -0,0 +1,415 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +MODULE_LICENSE("Dual BSD/GPL");
> +MODULE_VERSION("0.1");
> +MODULE_AUTHOR("Logan Gunthorpe ");
> +MODULE_DESCRIPTION("NTB MSI Interrupt Library");
> +
> +struct ntb_msi {
> + u64 base_addr;
> + u64 end_addr;
> +
> + void (*desc_changed)(void *ctx);
> +
> + u32 *peer_mws[];

Shouldn't we use the __iomem attribute here since later the devm_ioremap() is
used to map MWs at these pointers?

> +};
> +
> +/**
> + * ntb_msi_init() - Initialize the MSI context
> + * @ntb: NTB device context
> + *
> + * This function must be called before any other ntb_msi function.
> + * It initializes the context for MSI operations and maps
> + * the peer memory windows.
> + *
> + * This function reserves the last N outbound memory windows (where N
> + * is the number of peers).
> + *
> + * Return: Zero 

Re: [PATCH v2 07/12] NTB: Introduce functions to calculate multi-port resource index

2019-03-06 Thread Serge Semin
On Wed, Feb 13, 2019 at 10:54:49AM -0700, Logan Gunthorpe wrote:

Hi

> When using multi-ports each port uses resources (dbs, msgs, mws, etc)
> on every other port. Creating a mapping for these resources such that
> each port has a corresponding resource on every other port is a bit
> tricky.
> 
> Introduce the ntb_peer_resource_idx() function for this purpose.
> It returns the peer resource number that will correspond with the
> local peer index on the remote peer.
> 
> Also, introduce ntb_peer_highest_mw_idx() which will use
> ntb_peer_resource_idx() but return the MW index starting with the
> highest index and working down.
> 
> Signed-off-by: Logan Gunthorpe 
> Cc: Jon Mason 
> Cc: Dave Jiang 
> Cc: Allen Hubbe 
> ---
>  include/linux/ntb.h | 70 +
>  1 file changed, 70 insertions(+)
> 
> diff --git a/include/linux/ntb.h b/include/linux/ntb.h
> index 181d16601dd9..f5c69d853489 100644
> --- a/include/linux/ntb.h
> +++ b/include/linux/ntb.h
> @@ -1502,4 +1502,74 @@ static inline int ntb_peer_msg_write(struct ntb_dev 
> *ntb, int pidx, int midx,
>   return ntb->ops->peer_msg_write(ntb, pidx, midx, msg);
>  }
>  
> +/**
> + * ntb_peer_resource_idx() - get a resource index for a given peer idx
> + * @ntb: NTB device context.
> + * @pidx:Peer port index.
> + *
> + * When constructing a graph of peers, each remote peer must use a different
> + * resource index (mw, doorbell, etc) to communicate with each other
> + * peer.
> + *
> + * In a two peer system, this function should always return 0 such that
> + * resource 0 points to the remote peer on both ports.
> + *
> + * In a 5 peer system, this function will return the following matrix
> + *
> + * pidx \ port01234
> + * 0  00123
> + * 1  01234
> + * 2  01234
> + * 3  01234
> + *

This table is too simplified to represent a generic case of port-index
mapping table. In particular the IDT PCIe switch got it ports numbered
with uneven integers like: 0 2 4 6 8 12 16 20 or 0 8 16, and so on.
Moreover some of the ports might be disabled or may have NTB functions
deactivated, in which case these ports shouldn't be considered by NTB subsystem
at all. Basically we may have any increasing subset of that port
numbers depending on the current IDT PCIe-switch ports setup.

What I am trying to say by this is that if in real life the NTB MSI library
or some other library used that matrix to calculate the MW index, then most
likely it would either consume nearly all the MWs leaving holes in the space
of MWs or just run out of them since depending on the configuration there might
be from 0 to 24 MWs enabled in a IDT NTB function. In order to solve the problem
we either need the ntb_peer_resource_idx() method somehow fixed so it would use
the pidx instead of the pure port number or we should properly alter the current
NTB API port-index table implementation, so the ntb_pingpong, NTB MSI library 
and
others wouldn't need to a have some other table to distribute the NTB
resources.

A short preamble of why we introduced the ports-index NTB API. Since each
NTB MW and NTB message registers can be setup to be related with any NTB
device port, we needed to have the port attribute accepted by NTB API
functions. But it turned out the IDT PCIe switch ports are unevenly numbered,
which may cause difficulties in using their numbers for bulk configurations.
So in order to simplify a client code utilizing the multi-ports NTB API (for
example for simpler looping around the device ports), we decided to create the
ports-index table in the NTB API internals. The table is defined by a set of
ntb_port_*()/ntb_peer_port_*() functions. So now all the NTB API methods accept
a port index instead of an irregular port number.

Here is the port-index table currently defined for IDT 89HPES32NT8 PCIe-switch
with all 8 ports having NTB functions enabled:

peer port \ local port   0  2  4  6  8  12  16  20
0x  0  0  0  0   0   0   0
20  x  1  1  1   1   1   1
41  1  x  2  2   2   2   2
62  2  2  x  3   3   3   3
83  3  3  3  x   4   4   4
   124  4  4  4  4   x   5   5
   165  5  5  5  5   5   x   6
   206  6  6  6  6   6   6   x

As you can see currently each NTB device side's got its own port-index
mapping, so each port can access another ports, but not itself.
I implemented it this way intentionally for two reasons:
1) I don't think anyone ever would need to have MW or Message registers
setup pointing to the local port.
2) IDT PCIe switch will report "Unsupported TLP" error in case if there is
an attempt to touch a MW frame initialized with port number matching the local
port number. (Internally IDT PCIe-switch is working with port partitions.