RE: [PATCH] dma-mapping: make map_benchmark compile into module

2021-03-23 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: tiantao (H)
> Sent: Wednesday, March 24, 2021 3:18 PM
> To: a...@linux-foundation.org; pet...@infradead.org; paul...@kernel.org;
> a...@kernel.org; t...@linutronix.de; rost...@goodmis.org; h...@lst.de;
> m.szyprow...@samsung.com; Song Bao Hua (Barry Song)
> 
> Cc: iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org; tiantao
> (H) 
> Subject: [PATCH] dma-mapping: make map_benchmark compile into module
> 
> under some scenarios, it is necessary to compile map_benchmark
> into module to test iommu, so this patch changed Kconfig and
> export_symbol to implement map_benchmark compiled into module.
> 
> On the other hand, map_benchmark is a driver, which is supposed
> to be able to run as a module.
> 
> Signed-off-by: Tian Tao 
> ---

Acked-by: Barry Song 

Look sensible to me. I like the idea that map_benchmark is
a driver. It seems unreasonable to always require built-in.


>  kernel/dma/Kconfig | 2 +-
>  kernel/kthread.c   | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> index 77b4055..0468293 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -223,7 +223,7 @@ config DMA_API_DEBUG_SG
> If unsure, say N.
> 
>  config DMA_MAP_BENCHMARK
> - bool "Enable benchmarking of streaming DMA mapping"
> + tristate "Enable benchmarking of streaming DMA mapping"
>   depends on DEBUG_FS
>   help
> Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 1578973..fa4736f 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -455,6 +455,7 @@ void kthread_bind_mask(struct task_struct *p, const struct
> cpumask *mask)
>  {
>   __kthread_bind_mask(p, mask, TASK_UNINTERRUPTIBLE);
>  }
> +EXPORT_SYMBOL(kthread_bind_mask);
> 
>  /**
>   * kthread_bind - bind a just-created kthread to a cpu.
> --
> 2.7.4

Thanks
Barry

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] dma-mapping: make map_benchmark compile into module

2021-03-23 Thread Tian Tao
under some scenarios, it is necessary to compile map_benchmark
into module to test iommu, so this patch changed Kconfig and
export_symbol to implement map_benchmark compiled into module.

On the other hand, map_benchmark is a driver, which is supposed
to be able to run as a module.

Signed-off-by: Tian Tao 
---
 kernel/dma/Kconfig | 2 +-
 kernel/kthread.c   | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 77b4055..0468293 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -223,7 +223,7 @@ config DMA_API_DEBUG_SG
  If unsure, say N.
 
 config DMA_MAP_BENCHMARK
-   bool "Enable benchmarking of streaming DMA mapping"
+   tristate "Enable benchmarking of streaming DMA mapping"
depends on DEBUG_FS
help
  Provides /sys/kernel/debug/dma_map_benchmark that helps with testing
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 1578973..fa4736f 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -455,6 +455,7 @@ void kthread_bind_mask(struct task_struct *p, const struct 
cpumask *mask)
 {
__kthread_bind_mask(p, mask, TASK_UNINTERRUPTIBLE);
 }
+EXPORT_SYMBOL(kthread_bind_mask);
 
 /**
  * kthread_bind - bind a just-created kthread to a cpu.
-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/3] Apple M1 DART IOMMU driver

2021-03-23 Thread Mark Kettenis
> Date: Tue, 23 Mar 2021 14:53:46 -0600
> From: Rob Herring 
> 
> On Sun, Mar 21, 2021 at 05:00:50PM +0100, Mark Kettenis wrote:
> > > Date: Sat, 20 Mar 2021 15:19:33 +
> > > From: Sven Peter 
> > > 
> > > Hi,
> > > 
> > > After Hector's initial work [1] to bring up Linux on Apple's M1 it's time 
> > > to
> > > bring up more devices. Most peripherals connected to the SoC are behind a 
> > > iommu
> > > which Apple calls "Device Address Resolution Table", or DART for short 
> > > [2].
> > > Unfortunately, it only shares the name with PowerPC's DART.
> > > Configuring this iommu is mandatory if these peripherals require DMA 
> > > access.
> > > 
> > > This patchset implements initial support for this iommu. The hardware 
> > > itself
> > > uses a pagetable format that's very similar to the one already implement 
> > > in
> > > io-pgtable.c. There are some minor modifications, namely some details of 
> > > the
> > > PTE format and that there are always three pagetable levels, which I've
> > > implement as a new format variant.
> > > 
> > > I have mainly tested this with the USB controller in device mode which is
> > > compatible with Linux's dwc3 driver. Some custom PHY initialization 
> > > (which is
> > > not yet ready or fully understood) is required though to bring up the 
> > > ports,
> > > see e.g. my patches to our m1n1 bootloader [3,4]. If you want to test the 
> > > same
> > > setup you will probably need that branch for now and add the nodes from
> > > the DT binding specification example to your device tree.
> > > 
> > > Even though each DART instances could support up to 16 devices usually 
> > > only
> > > a single device is actually connected. Different devices generally just 
> > > use
> > > an entirely separate DART instance with a seperate MMIO range, IRQ, etc.
> > > 
> > > I have just noticed today though that at least the USB DWC3 controller in 
> > > host
> > > mode uses *two* darts at the same time. I'm not sure yet which parts seem 
> > > to
> > > require which DART instance.
> > > 
> > > This means that we might need to support devices attached to two iommus
> > > simultaneously and just create the same iova mappings. Currently this only
> > > seems to be required for USB according to Apple's Device Tree.
> > > 
> > > I see two options for this and would like to get feedback before
> > > I implement either one:
> > > 
> > > 1) Change #iommu-cells = <1>; to #iommu-cells = <2>; and use the 
> > > first cell
> > >to identify the DART and the second one to identify the master.
> > >The DART DT node would then also take two register ranges that 
> > > would
> > >correspond to the two DARTs. Both instances use the same IRQ and 
> > > the
> > >same clocks according to Apple's device tree and my experiments.
> > >This would keep a single device node and the DART driver would then
> > >simply map iovas in both DARTs if required.
> > > 
> > > 2) Keep #iommu-cells as-is but support
> > > iommus = <&usb_dart1a 1>, <&usb_dart1b 0>;
> > >instead.
> > >This would then require two devices nodes for the two DART 
> > > instances and
> > >some housekeeping in the DART driver to support mapping iovas in 
> > > both
> > >DARTs.
> > >I believe omap-iommu.c supports this setup but I will have to read
> > >more code to understand the details there and figure out how to 
> > > implement
> > >this in a sane way.
> > > 
> > > I currently prefer the first option but I don't understand enough details 
> > > of
> > > the iommu system to actually make an informed decision.
> 
> Please don't mix what does the h/w look like and what's easy to 
> implement in Linux's IOMMU subsytem. It's pretty clear (at least 
> from the description here) that option 2 reflects the h/w.

Apple does represent these as a single node in their device tree.  The
two instances share an interrupt and share power/clock gating.  So
they seem to be part of a single hardware block.

> > > I'm obviously also open to more options :-)
> > 
> > Hi Sven,
> > 
> > I don't think the first option is going to work for PCIe.  PCIe
> > devices will have to use "iommu-map" properties to map PCI devices to
> > the right iommu, and the currently implementation seems to assume that
> > #iommu-cells = <1>.  The devictree binding[1] doesn't explicitly state
> > that it relies on #iommu-cells = <1>, but it isn't clear how the
> > rid-base to iommu-base mapping mechanism would work when that isn't
> > the case.
> > 
> > Now the PCIe DARTs are simpler and seem to have only one "instance"
> > per DART.  So if we keep #iommu-cells = <1> for those, you'd still be
> > fine using the first approach.
> > 
> > As I mentioned before, not all DARTs support the full 32-bit aperture.
> > In particular the PCIe DARTs support a smaller address-space.  It is
> > not clear whether this is a restriction of the PCIe host controller or
> > the DART, but the

[PATCH] iommu/amd: page-specific invalidations for more than one page

2021-03-23 Thread Nadav Amit
From: Nadav Amit 

Currently, IOMMU invalidations and device-IOTLB invalidations using
AMD IOMMU fall back to full address-space invalidation if more than a
single page need to be flushed.

Full flushes are especially inefficient when the IOMMU is virtualized by
a hypervisor, since it requires the hypervisor to synchronize the entire
address-space.

AMD IOMMUs allow to provide a mask to perform page-specific
invalidations for multiple pages that match the address. The mask is
encoded as part of the address, and the first zero bit in the address
(in bits [51:12]) indicates the mask size.

Use this hardware feature to perform selective IOMMU and IOTLB flushes.
Combine the logic between both for better code reuse.

The IOMMU invalidations passed a smoke-test. The device IOTLB
invalidations are untested.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 76 +--
 1 file changed, 42 insertions(+), 34 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 9256f84f5ebf..5f2dc3d7f2dc 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -927,33 +927,58 @@ static void build_inv_dte(struct iommu_cmd *cmd, u16 
devid)
CMD_SET_TYPE(cmd, CMD_INV_DEV_ENTRY);
 }
 
-static void build_inv_iommu_pages(struct iommu_cmd *cmd, u64 address,
- size_t size, u16 domid, int pde)
+/*
+ * Builds an invalidation address which is suitable for one page or multiple
+ * pages. Sets the size bit (S) as needed is more than one page is flushed.
+ */
+static inline u64 build_inv_address(u64 address, size_t size)
 {
-   u64 pages;
-   bool s;
+   u64 pages, end, msb_diff;
 
pages = iommu_num_pages(address, size, PAGE_SIZE);
-   s = false;
 
-   if (pages > 1) {
+   if (pages == 1)
+   return address & PAGE_MASK;
+
+   end = address + size - 1;
+
+   /*
+* msb_diff would hold the index of the most significant bit that
+* flipped between the start and end.
+*/
+   msb_diff = fls64(end ^ address) - 1;
+
+   /*
+* Bits 63:52 are sign extended. If for some reason bit 51 is different
+* between the start and the end, invalidate everything.
+*/
+   if (unlikely(msb_diff > 51)) {
+   address = CMD_INV_IOMMU_ALL_PAGES_ADDRESS;
+   } else {
/*
-* If we have to flush more than one page, flush all
-* TLB entries for this domain
+* The msb-bit must be clear on the address. Just set all the
+* lower bits.
 */
-   address = CMD_INV_IOMMU_ALL_PAGES_ADDRESS;
-   s = true;
+   address |= 1ull << (msb_diff - 1);
}
 
+   /* Clear bits 11:0 */
address &= PAGE_MASK;
 
+   /* Set the size bit - we flush more than one 4kb page */
+   return address | CMD_INV_IOMMU_PAGES_SIZE_MASK;
+}
+
+static void build_inv_iommu_pages(struct iommu_cmd *cmd, u64 address,
+ size_t size, u16 domid, int pde)
+{
+   u64 inv_address = build_inv_address(address, size);
+
memset(cmd, 0, sizeof(*cmd));
cmd->data[1] |= domid;
-   cmd->data[2]  = lower_32_bits(address);
-   cmd->data[3]  = upper_32_bits(address);
+   cmd->data[2]  = lower_32_bits(inv_address);
+   cmd->data[3]  = upper_32_bits(inv_address);
CMD_SET_TYPE(cmd, CMD_INV_IOMMU_PAGES);
-   if (s) /* size bit - we flush more than one 4kb page */
-   cmd->data[2] |= CMD_INV_IOMMU_PAGES_SIZE_MASK;
if (pde) /* PDE bit - we want to flush everything, not only the PTEs */
cmd->data[2] |= CMD_INV_IOMMU_PAGES_PDE_MASK;
 }
@@ -961,32 +986,15 @@ static void build_inv_iommu_pages(struct iommu_cmd *cmd, 
u64 address,
 static void build_inv_iotlb_pages(struct iommu_cmd *cmd, u16 devid, int qdep,
  u64 address, size_t size)
 {
-   u64 pages;
-   bool s;
-
-   pages = iommu_num_pages(address, size, PAGE_SIZE);
-   s = false;
-
-   if (pages > 1) {
-   /*
-* If we have to flush more than one page, flush all
-* TLB entries for this domain
-*/
-   address = CMD_INV_IOMMU_ALL_PAGES_ADDRESS;
-   s = true;
-   }
-
-   address &= PAGE_MASK;
+   u64 inv_address = build_inv_address(address, size);
 
memset(cmd, 0, sizeof(*cmd));
cmd->data[0]  = devid;
cmd->data[0] |= (qdep & 0xff) << 24;
cmd->data[1]  = devid;
-   cmd->data[2]  = lower_32_bits(address);
-   cmd->data[3]  = upper_32_bits(address);
+   cmd->data[2]  = lower_32_bits(inv_address);
+   cmd->data[3]  = upper_32_bits(inv_address);
CMD_SET_

Re: [PATCH 0/3] Apple M1 DART IOMMU driver

2021-03-23 Thread Sven Peter via iommu
Hi Mark,

On Tue, Mar 23, 2021, at 21:00, Mark Kettenis wrote:
> The problem with both #1 and #2 is that you end up with two references
> to (effectively) different iommu's in the dwc3 device node.  I don't
> see how that is compatible with the idea of using a single translation
> table for both sub-DARTs.

I don't have a strong opinion about this fwiw.
Option #1 and #2 seem to have precedence in the already existing
iommu bindings though [1,2,3].
I just want to make sure we've thought through all options and understand
their advantages/disadvantages before we take a decision.


I've been reading some more Linux iommu code and here's how I understand
it now / how I could implement at least #1 with a single shared pagetable.
This might also be possible for option #2, but I'll need to think that through
in more detail.

An iommu domain is a collection of devices that share a virtual address space.
During domain allocation I can just allocate a single DART pagetable and
not have anything point to it for now.

A stream identifies the smallest unit the iommu hardware can differentiate.
For the DART we have 16 of these with a single TCR + the four TTBR register
for each stream.

A device is assigned to individual streams using the "iommus" property. When
a device is attached to a domain we now simply setup the TTBR registers to
point to the iommu domain pagetable. It doesn't matter here if it's a single
stream or multiple ones or even multiple devices sharing a single stream as
long as they're attached to the same domain.

All operations (map, unmap, etc.) now simply first modify the domain
pagetable and then issue the TLB maintenance operations for attached streams.


> 
> If you no longer think that is desirable, you'll still have the
> problem that you'll need to modify the dwc3 driver code such that it
> uses the right IOMMU to do its DMA address translation.  Given what
> you write above that sounds really ugly and confusing.  I would
> certainly want to avoid doing that in OpenBSD.

Yeah, I absolutely don't want to hack on the dwc3 code at all.
That will end up being *very* ugly.



Best,

Sven


[1] Documentation/devicetree/bindings/iommu/iommu.txt "Multiple-master IOMMU"
[2] Documentation/devicetree/bindings/qcom,iommu.txt last example
[3] Documentation/devicetree/bindings/arm,smmu.yaml first example
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/3] Apple M1 DART IOMMU driver

2021-03-23 Thread Rob Herring
On Sun, Mar 21, 2021 at 05:00:50PM +0100, Mark Kettenis wrote:
> > Date: Sat, 20 Mar 2021 15:19:33 +
> > From: Sven Peter 
> > 
> > Hi,
> > 
> > After Hector's initial work [1] to bring up Linux on Apple's M1 it's time to
> > bring up more devices. Most peripherals connected to the SoC are behind a 
> > iommu
> > which Apple calls "Device Address Resolution Table", or DART for short [2].
> > Unfortunately, it only shares the name with PowerPC's DART.
> > Configuring this iommu is mandatory if these peripherals require DMA access.
> > 
> > This patchset implements initial support for this iommu. The hardware itself
> > uses a pagetable format that's very similar to the one already implement in
> > io-pgtable.c. There are some minor modifications, namely some details of the
> > PTE format and that there are always three pagetable levels, which I've
> > implement as a new format variant.
> > 
> > I have mainly tested this with the USB controller in device mode which is
> > compatible with Linux's dwc3 driver. Some custom PHY initialization (which 
> > is
> > not yet ready or fully understood) is required though to bring up the ports,
> > see e.g. my patches to our m1n1 bootloader [3,4]. If you want to test the 
> > same
> > setup you will probably need that branch for now and add the nodes from
> > the DT binding specification example to your device tree.
> > 
> > Even though each DART instances could support up to 16 devices usually only
> > a single device is actually connected. Different devices generally just use
> > an entirely separate DART instance with a seperate MMIO range, IRQ, etc.
> > 
> > I have just noticed today though that at least the USB DWC3 controller in 
> > host
> > mode uses *two* darts at the same time. I'm not sure yet which parts seem to
> > require which DART instance.
> > 
> > This means that we might need to support devices attached to two iommus
> > simultaneously and just create the same iova mappings. Currently this only
> > seems to be required for USB according to Apple's Device Tree.
> > 
> > I see two options for this and would like to get feedback before
> > I implement either one:
> > 
> > 1) Change #iommu-cells = <1>; to #iommu-cells = <2>; and use the first 
> > cell
> >to identify the DART and the second one to identify the master.
> >The DART DT node would then also take two register ranges that would
> >correspond to the two DARTs. Both instances use the same IRQ and the
> >same clocks according to Apple's device tree and my experiments.
> >This would keep a single device node and the DART driver would then
> >simply map iovas in both DARTs if required.
> > 
> > 2) Keep #iommu-cells as-is but support
> > iommus = <&usb_dart1a 1>, <&usb_dart1b 0>;
> >instead.
> >This would then require two devices nodes for the two DART instances 
> > and
> >some housekeeping in the DART driver to support mapping iovas in both
> >DARTs.
> >I believe omap-iommu.c supports this setup but I will have to read
> >more code to understand the details there and figure out how to 
> > implement
> >this in a sane way.
> > 
> > I currently prefer the first option but I don't understand enough details of
> > the iommu system to actually make an informed decision.

Please don't mix what does the h/w look like and what's easy to 
implement in Linux's IOMMU subsytem. It's pretty clear (at least 
from the description here) that option 2 reflects the h/w. 

> > I'm obviously also open to more options :-)
> 
> Hi Sven,
> 
> I don't think the first option is going to work for PCIe.  PCIe
> devices will have to use "iommu-map" properties to map PCI devices to
> the right iommu, and the currently implementation seems to assume that
> #iommu-cells = <1>.  The devictree binding[1] doesn't explicitly state
> that it relies on #iommu-cells = <1>, but it isn't clear how the
> rid-base to iommu-base mapping mechanism would work when that isn't
> the case.
> 
> Now the PCIe DARTs are simpler and seem to have only one "instance"
> per DART.  So if we keep #iommu-cells = <1> for those, you'd still be
> fine using the first approach.
> 
> As I mentioned before, not all DARTs support the full 32-bit aperture.
> In particular the PCIe DARTs support a smaller address-space.  It is
> not clear whether this is a restriction of the PCIe host controller or
> the DART, but the Apple Device Tree has "vm-base" and "vm-size"
> properties that encode the base address and size of the aperture.
> These single-cell properties which is probably why for the USB DARTs
> only "vm-base" is given; since "vm-base" is 0, a 32-bit number
> wouldn't be able to encode the full aperture size.  We could make them
> 64-bit numbers in the Linux device tree though and always be explicit
> about the size.  Older Sun SPARC machines used a single "virtual-dma"
> property to encode the aperture.  We could do someting simila

Re: [PATCH 0/3] Apple M1 DART IOMMU driver

2021-03-23 Thread Mark Kettenis
> Date: Mon, 22 Mar 2021 23:17:31 +0100
> From: "Sven Peter" 
> 
> Hi Mark,
> 
> On Sun, Mar 21, 2021, at 19:35, Mark Kettenis wrote:
> >
> > Guess we do need to understand a little bit better how the USB DART
> > actually works.  My hypothesis (based on our discussion on #asahi) is
> > that the XHCI host controller and the peripheral controller of the
> > DWC3 block use different DMA "streams" that are handled by the
> > different sub-DARTs.
> 
> I've done some more experiments and the situation is unfortunately more
> complicated: Most DMA transfers are translated with the first DART.
> But sometimes (and I have not been able to figure out the exact conditions)
> transfers instead have to go through the second DART. 
> This happens e.g. with one of my USB keyboards after a stop EP command
> is issued: Suddenly the xhci_ep_ctx struct must be translated through the
> second DART.
> 
> What this likely means is that we'll need to point both DARTs
> to the same pagetables and just issue the TLB maintenance operations
> as a group.
> 
> > 
> > The Corellium folks use a DART + sub-DART model in their driver and a
> > single node in the device tree that represents both.  That might sense
> > since the error registers and interrupts are shared.  Maybe it would
> > make sense to select the appropriate sub-DART based on the DMA stream
> > ID?
> 
> dwc3 specifically seems to require stream id #1 from the DART
> at <0x5 0x02f0> and stream id #0 from the DART at <0x5 0x02f8>.
> Both of these only share a IRQ line but are otherwise completely independent.
> Each has their own error registers, etc. and we need some way to
> specify these two DARTs + the appropriate stream ID.
> 
> Essentially we have three options to represent this now:
> 
> 1) Add both DARTs as separate regs, use #iommu-cells = <2> and have the
>first cell select the DART and the second one the stream ID.
>We could allow #iommu-cells = <1> in case only one reg is specified
>for the PCIe DART:
> 
>usb_dart1@502f0 {
>  compatible = "apple,t8103-dart";
>  reg = <0x5 0x02f0 0x0 0x4000>, <0x5 0x02f8 0x0 0x4000>;
>  #iommu-cells = <2>;
>  ...
>};
> 
>usb1 {
>  iommus = <&usb_dart1 0 1>, <&usb_dart1 1 0>;
>  ...
>};
> 
>I prefer this option because we fully describe the DART in a single
>device node here. It also feels natural to group them like this because
>they need to share some properties (like dma-window and the interrupt)
>anyway. 
> 
> 2) Create two DART nodes which share the same IRQ line and attach them
>both to the master node:
> 
>usb_dart1a@502f0 {
>  compatible = "apple,t8103-dart";
>  reg = <0x5 0x02f0 0x0 0x4000>;
>  #iommu-cells = <1>;
>  ...
>};
>usb_dart1b@502f8 {
>  compatible = "apple,t8103-dart";
>  reg = <0x5 0x02f8 0x0 0x4000>;
>  #iommu-cells = <1>;
>  ...
>};
> 
>usb1 {
>  iommus = <&usb_dart1a 1>, <&usb_dart1b 0>;
>  ...
>};
> 
>I dislike this one because attaching those two DARTs to a single device
>seems rather unusual. We'd also have to duplicate the dma-window setting,
>make sure it's the same for both DARTs and there are probably even more
>complications I can't think of right now. It seems like this would also
>make the device tree very verbose and the implementation itself more
>complicated.
> 
> 3) Introduce another property and let the DART driver take care of
>mirroring the pagetables. I believe this would be similar to
>the sid-remap property:
> 
>usb_dart1@502f0 {
>  compatible = "apple,t8103-dart";
>  reg = <0x5 0x02f0 0x0 0x4000>, <0x5 0x02f8 0x0 0x4000>;
>  #iommu-cells = <1>;
>  sid-remap = <0 1>;
>};
>usb1 {
>  iommus = <&usb_dart1 0>;
>};
> 
>I slightly dislike this one because we now specify which stream id 
>to use in two places: Once in the device node and another time in the
>new property in the DART node. I also don't think the binding is much
>simpler than the first one.

Hi Sven,

The problem with both #1 and #2 is that you end up with two references
to (effectively) different iommu's in the dwc3 device node.  I don't
see how that is compatible with the idea of using a single translation
table for both sub-DARTs.

If you no longer think that is desirable, you'll still have the
problem that you'll need to modify the dwc3 driver code such that it
uses the right IOMMU to do its DMA address translation.  Given what
you write above that sounds really ugly and confusing.  I would
certainly want to avoid doing that in OpenBSD.

So I think #3 is the only realistic option here.  In my opinion it is
perfectly fine for the devicetree to present a workable model of the
hardware instead of a 100% accurate model.

> > > where #dma-address-cells and #dma-size-cells default to
> > > #address-cells and #size-cells respectively if I understand
> > > the code cor

RE: [RFC PATCH v2 1/8] ACPICA: IORT: Update for revision E

2021-03-23 Thread Kaneda, Erik



> -Original Message-
> From: Lorenzo Pieralisi 
> Sent: Tuesday, March 23, 2021 8:54 AM
> To: Kaneda, Erik 
> Cc: Shameerali Kolothum Thodi ;
> linux-arm-ker...@lists.infradead.org; linux-a...@vger.kernel.org;
> iommu@lists.linux-foundation.org; de...@acpica.org; Moore, Robert
> ; Linuxarm ;
> steven.pr...@arm.com; sami.muja...@arm.com; robin.mur...@arm.com;
> wanghuiqiang 
> Subject: Re: [RFC PATCH v2 1/8] ACPICA: IORT: Update for revision E
> 
> On Mon, Mar 22, 2021 at 09:57:58PM +, Kaneda, Erik wrote:
> >
> >
> > > -Original Message-
> > > From: Shameerali Kolothum Thodi
> > > 
> > > Sent: Monday, March 22, 2021 3:36 AM
> > > To: Kaneda, Erik ; linux-arm-
> > > ker...@lists.infradead.org; linux-a...@vger.kernel.org;
> iommu@lists.linux-
> > > foundation.org; de...@acpica.org; Lorenzo Pieralisi
> > > ; Moore, Robert
> 
> > > Cc: Linuxarm ; steven.pr...@arm.com;
> > > sami.muja...@arm.com; robin.mur...@arm.com; wanghuiqiang
> > > 
> > > Subject: [Devel] Re: [RFC PATCH v2 1/8] ACPICA: IORT: Update for
> revision E
> > >
> > > [+]
> > >
> > > Hi Erik,
> > >
> > > As this is now just merged ino acpica-master and based on the discussion
> we
> > > had here,
> > >
> > > https://github.com/acpica/acpica/pull/638
> > >
> > > I had a discussion with ARM folks(Lorenzo) in the linaro-open-discussions
> call
> > > and
> > > can confirm that the IORT Revision E is not the final specification and 
> > > has
> > > some issues
> > > which is now corrected in the latest E.b revision[1]. Also there are no
> current
> > > users
> > > for the Rev E and it may not be a good idea to push this version into the
> Linux
> > > kernel
> > > or elsewhere.
> > >
> > > So could you please revert the merge and I am planning to work on the
> E.b
> > > soon.
> > Hi,
> >
> > > Please let me know if I need to explicitly send a revert pull request or 
> > > not.
> >
> > Please send a revert pull request and I'll remember to not submit Linux-ize
> the IORT patches.
> >
> > From all of the activity that I've seen with the IORT specification,
> > it looks like this table is nontrivial to design and maintain. The
> > difficulty I have with the table is that I do not understand which
> > table revisions are in active use.
> 
Hi Lorenzo,

> Possibly all of them in firmware in the field - I am not sure what you
> are asking though; if you can elaborate I'd be grateful.

Yes, I'd be happy to explain.

The ACPICA project contains code that provides definitions for ACPI tables. 
After each release of this project, the codebase gets translated to a Linux 
style syntax and relevant patches are submitted to Linux so that it can use 
these table definitions in their driver codebase. From ACPICA's perspective, 
the code providing these definitions are used to implement a 
compiler/disassembler called iASL. This tool disassembles table definitions so 
that the user doesn't have to open a hex editor to decipher ACPI tables. Our 
goal with iASL is to be able to disassemble as many ACPI tables as possible to 
give users the ability to compile and debug ACPI tables.

Out of the 60+ tables that we support, IORT has been tricky to maintain. There 
are revisions A-E and we have received pull requests from engineers from ARM or 
companies that work with ARM. We are grateful of their contributions but some 
of these pull requests have broken support for earlier versions of IORT. In 
addition, we sometimes receive communication from people like Shameer saying 
that revision E does not currently have users. This leaves Bob and I very 
confused about which revisions we should be focused on supporting in iASL.

We need your help in understanding which versions of IORT should be supported 
by iASL as well as Linux.

I hope this helps.. Let me know if you need me to clarify anything that I've 
written.

Thanks,
Erik
> 
> > So my question is this: which IORT revisions are being actively used?
> 
> See above.
> 
> Thanks,
> Lorenzo
> 
> >
> > Thanks,
> > Erik
> > >
> > > Thanks,
> > > Shameer
> > >
> > > 1. https://developer.arm.com/documentation/den0049/latest/
> > >
> > > > -Original Message-
> > > > From: iommu [mailto:iommu-boun...@lists.linux-foundation.org] On
> > > Behalf Of
> > > > Shameer Kolothum
> > > > Sent: 19 November 2020 12:12
> > > > To: linux-arm-ker...@lists.infradead.org; linux-a...@vger.kernel.org;
> > > > iommu@lists.linux-foundation.org; de...@acpica.org
> > > > Cc: Linuxarm ; steven.pr...@arm.com;
> > > Guohanjun
> > > > (Hanjun Guo) ; sami.muja...@arm.com;
> > > > robin.mur...@arm.com; wanghuiqiang 
> > > > Subject: [RFC PATCH v2 1/8] ACPICA: IORT: Update for revision E
> > > >
> > > > IORT revision E contains a few additions like,
> > > >     -Added an identifier field in the node descriptors to aid table
> > > >      cross-referencing.
> > > >     -Introduced the Reserved Memory Range(RMR) node. This is used
> > > >      to describe memory ranges that are used by endpoints and requires
> > > >      a unity mapping 

Re: [PATCH 5/5] iommu/vt-d: Make unnecessarily global functions static

2021-03-23 Thread Christoph Hellwig
On Tue, Mar 23, 2021 at 09:06:00AM +0800, Lu Baolu wrote:
> Make some functions static as they are only used inside pasid.c.
> 
> Signed-off-by: Lu Baolu 

Looks good,

Reviewed-by: Christoph Hellwig 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 4/5] iommu/vt-d: Remove unused function declarations

2021-03-23 Thread Christoph Hellwig
On Tue, Mar 23, 2021 at 09:05:59AM +0800, Lu Baolu wrote:
> Some functions have been deprecated. Remove the remaining function
> delarations.

s/deprecated/removed/g

Otherwise looks good:

Reviewed-by: Christoph Hellwig 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/5] iommu/vt-d: Remove SVM_FLAG_PRIVATE_PASID

2021-03-23 Thread Christoph Hellwig
On Tue, Mar 23, 2021 at 09:05:58AM +0800, Lu Baolu wrote:
> The SVM_FLAG_PRIVATE_PASID has never been referenced in the tree, and
> there's no plan to have anything to use it. So cleanup it.
> 
> Signed-off-by: Lu Baolu 

Looks good,

Reviewed-by: Christoph Hellwig 

But can we take this a little further?  SVM_FLAG_GUEST_PASID is unused
as well.  SVM_FLAG_GUEST_MODE is only used in drivers/iommu/intel/svm.c,
and SVM_FLAG_SUPERVISOR_MODE is actually used as an API flag to
iommu_sva_bind_device.  So IMHO the latter should be elevated to an
IOMMU API level flag, and then include/linux/intel-svm.h can go away
entirely or at least be moved to drivers/iommu/intel/.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/5] iommu/vt-d: Remove svm_dev_ops

2021-03-23 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/5] iommu/vt-d: Remove unused dma map/unmap trace events

2021-03-23 Thread Christoph Hellwig
On Tue, Mar 23, 2021 at 09:05:56AM +0800, Lu Baolu wrote:
> With commit c588072bba6b5 ("iommu/vt-d: Convert intel iommu driver to
> the iommu ops"), the trace events for dma map/unmap have no users any
> more. Cleanup them to make the code neat.
> 
> Signed-off-by: Lu Baolu 

Looks good,

Reviewed-by: Christoph Hellwig 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v2 1/8] ACPICA: IORT: Update for revision E

2021-03-23 Thread Lorenzo Pieralisi
On Mon, Mar 22, 2021 at 09:57:58PM +, Kaneda, Erik wrote:
> 
> 
> > -Original Message-
> > From: Shameerali Kolothum Thodi
> > 
> > Sent: Monday, March 22, 2021 3:36 AM
> > To: Kaneda, Erik ; linux-arm-
> > ker...@lists.infradead.org; linux-a...@vger.kernel.org; iommu@lists.linux-
> > foundation.org; de...@acpica.org; Lorenzo Pieralisi
> > ; Moore, Robert 
> > Cc: Linuxarm ; steven.pr...@arm.com;
> > sami.muja...@arm.com; robin.mur...@arm.com; wanghuiqiang
> > 
> > Subject: [Devel] Re: [RFC PATCH v2 1/8] ACPICA: IORT: Update for revision E
> > 
> > [+]
> > 
> > Hi Erik,
> > 
> > As this is now just merged ino acpica-master and based on the discussion we
> > had here,
> > 
> > https://github.com/acpica/acpica/pull/638
> > 
> > I had a discussion with ARM folks(Lorenzo) in the linaro-open-discussions 
> > call
> > and
> > can confirm that the IORT Revision E is not the final specification and has
> > some issues
> > which is now corrected in the latest E.b revision[1]. Also there are no 
> > current
> > users
> > for the Rev E and it may not be a good idea to push this version into the 
> > Linux
> > kernel
> > or elsewhere.
> > 
> > So could you please revert the merge and I am planning to work on the E.b
> > soon.
> Hi,
> 
> > Please let me know if I need to explicitly send a revert pull request or 
> > not.
> 
> Please send a revert pull request and I'll remember to not submit Linux-ize 
> the IORT patches.
> 
> From all of the activity that I've seen with the IORT specification,
> it looks like this table is nontrivial to design and maintain. The
> difficulty I have with the table is that I do not understand which
> table revisions are in active use.

Possibly all of them in firmware in the field - I am not sure what you
are asking though; if you can elaborate I'd be grateful.

> So my question is this: which IORT revisions are being actively used?

See above.

Thanks,
Lorenzo

> 
> Thanks,
> Erik
> > 
> > Thanks,
> > Shameer
> > 
> > 1. https://developer.arm.com/documentation/den0049/latest/
> > 
> > > -Original Message-
> > > From: iommu [mailto:iommu-boun...@lists.linux-foundation.org] On
> > Behalf Of
> > > Shameer Kolothum
> > > Sent: 19 November 2020 12:12
> > > To: linux-arm-ker...@lists.infradead.org; linux-a...@vger.kernel.org;
> > > iommu@lists.linux-foundation.org; de...@acpica.org
> > > Cc: Linuxarm ; steven.pr...@arm.com;
> > Guohanjun
> > > (Hanjun Guo) ; sami.muja...@arm.com;
> > > robin.mur...@arm.com; wanghuiqiang 
> > > Subject: [RFC PATCH v2 1/8] ACPICA: IORT: Update for revision E
> > >
> > > IORT revision E contains a few additions like,
> > >     -Added an identifier field in the node descriptors to aid table
> > >      cross-referencing.
> > >     -Introduced the Reserved Memory Range(RMR) node. This is used
> > >      to describe memory ranges that are used by endpoints and requires
> > >      a unity mapping in SMMU.
> > > -Introduced a flag in the RC node to express support for PRI.
> > >
> > > Signed-off-by: Shameer Kolothum
> > 
> > > ---
> > >  include/acpi/actbl2.h | 25 +++--
> > >  1 file changed, 19 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h index
> > > ec66779cb193..274fce7b5c01 100644
> > > --- a/include/acpi/actbl2.h
> > > +++ b/include/acpi/actbl2.h
> > > @@ -68,7 +68,7 @@
> > >   * IORT - IO Remapping Table
> > >   *
> > >   * Conforms to "IO Remapping Table System Software on ARM Platforms",
> > > - * Document number: ARM DEN 0049D, March 2018
> > > + * Document number: ARM DEN 0049E, June 2020
> > >   *
> > >
> > >
> > **
> > **
> > > **/
> > >
> > > @@ -86,7 +86,8 @@ struct acpi_iort_node {
> > >   u8 type;
> > >   u16 length;
> > >   u8 revision;
> > > - u32 reserved;
> > > + u16 reserved;
> > > + u16 identifier;
> > >   u32 mapping_count;
> > >   u32 mapping_offset;
> > >   char node_data[1];
> > > @@ -100,7 +101,8 @@ enum acpi_iort_node_type {
> > >   ACPI_IORT_NODE_PCI_ROOT_COMPLEX = 0x02,
> > >   ACPI_IORT_NODE_SMMU = 0x03,
> > >   ACPI_IORT_NODE_SMMU_V3 = 0x04,
> > > - ACPI_IORT_NODE_PMCG = 0x05
> > > + ACPI_IORT_NODE_PMCG = 0x05,
> > > + ACPI_IORT_NODE_RMR = 0x06,
> > >  };
> > >
> > >  struct acpi_iort_id_mapping {
> > > @@ -167,10 +169,10 @@ struct acpi_iort_root_complex {
> > >   u8 reserved[3]; /* Reserved, must be zero */
> > >  };
> > >
> > > -/* Values for ats_attribute field above */
> > > +/* Masks for ats_attribute field above */
> > >
> > > -#define ACPI_IORT_ATS_SUPPORTED 0x0001   /* The root
> > > complex supports ATS */
> > > -#define ACPI_IORT_ATS_UNSUPPORTED   0x   /* The root
> > > complex doesn't support ATS */
> > > +#define ACPI_IORT_ATS_SUPPORTED (1)  /* The root complex
> > > supports ATS */
> > > +#define ACPI_IORT_PRI_SUPPORTED (1<<1)   /* The root complex
> > > supports PRI */
> > >
> > >  struct acpi_io

Re: [PATCH 3/3] iova: Correct comment for free_cpu_cached_iovas()

2021-03-23 Thread John Garry

On 23/03/2021 13:05, Robin Murphy wrote:

On 2021-03-01 12:12, John Garry wrote:

Function free_cpu_cached_iovas() is not only called when a CPU is
hotplugged, so remove that part of the code comment.


FWIW I read it as clarifying why this is broken out into a separate 
function vs. a monolithic "free all cached IOVAs" routine that handles 
both the per-cpu and global caches 



it never said "*only* used..."


It seems to be implying that.

It's only a code comment, so I don't care too much either way and can 
drop this change.




As such I'd hesitate to call it incorrect, but it's certainly arguable 
whether it needs to be stated or not, especially once the hotplug 
callsite is now obvious in the same file - on which note the function 
itself also shouldn't need to be public any more, no?




Right, I actually missed deleting iommu_dma_free_cpu_cached_iovas(), so 
can fix that now.


Cheers,
John


Robin.


Signed-off-by: John Garry 
---
  drivers/iommu/iova.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index c78312560425..465b3b0eeeb0 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -996,7 +996,7 @@ static void free_iova_rcaches(struct iova_domain 
*iovad)

  }
  /*
- * free all the IOVA ranges cached by a cpu (used when cpu is unplugged)
+ * free all the IOVA ranges cached by a cpu
   */
  void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad)
  {


.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 3/3] iova: Correct comment for free_cpu_cached_iovas()

2021-03-23 Thread Robin Murphy

On 2021-03-01 12:12, John Garry wrote:

Function free_cpu_cached_iovas() is not only called when a CPU is
hotplugged, so remove that part of the code comment.


FWIW I read it as clarifying why this is broken out into a separate 
function vs. a monolithic "free all cached IOVAs" routine that handles 
both the per-cpu and global caches - it never said "*only* used..."


As such I'd hesitate to call it incorrect, but it's certainly arguable 
whether it needs to be stated or not, especially once the hotplug 
callsite is now obvious in the same file - on which note the function 
itself also shouldn't need to be public any more, no?


Robin.


Signed-off-by: John Garry 
---
  drivers/iommu/iova.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index c78312560425..465b3b0eeeb0 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -996,7 +996,7 @@ static void free_iova_rcaches(struct iova_domain *iovad)
  }
  
  /*

- * free all the IOVA ranges cached by a cpu (used when cpu is unplugged)
+ * free all the IOVA ranges cached by a cpu
   */
  void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad)
  {


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/3] iova: Add CPU hotplug handler to flush rcaches

2021-03-23 Thread Robin Murphy

On 2021-03-01 12:12, John Garry wrote:

Like the intel IOMMU driver already does, flush the per-IOVA domain
CPU rcache when a CPU goes offline - there's no point in keeping it.


Thanks John!

Reviewed-by: Robin Murphy 


Signed-off-by: John Garry 
---
  drivers/iommu/iova.c   | 30 +-
  include/linux/cpuhotplug.h |  1 +
  include/linux/iova.h   |  1 +
  3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index e6e2fa85271c..c78312560425 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -25,6 +25,17 @@ static void init_iova_rcaches(struct iova_domain *iovad);
  static void free_iova_rcaches(struct iova_domain *iovad);
  static void fq_destroy_all_entries(struct iova_domain *iovad);
  static void fq_flush_timeout(struct timer_list *t);
+
+static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node)
+{
+   struct iova_domain *iovad;
+
+   iovad = hlist_entry_safe(node, struct iova_domain, cpuhp_dead);
+
+   free_cpu_cached_iovas(cpu, iovad);
+   return 0;
+}
+
  static void free_global_cached_iovas(struct iova_domain *iovad);
  
  void

@@ -51,6 +62,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long 
granule,
iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node);
rb_insert_color(&iovad->anchor.node, &iovad->rbroot);
+   cpuhp_state_add_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, 
&iovad->cpuhp_dead);
init_iova_rcaches(iovad);
  }
  EXPORT_SYMBOL_GPL(init_iova_domain);
@@ -257,10 +269,21 @@ int iova_cache_get(void)
  {
mutex_lock(&iova_cache_mutex);
if (!iova_cache_users) {
+   int ret;
+
+   ret = cpuhp_setup_state_multi(CPUHP_IOMMU_IOVA_DEAD, 
"iommu/iova:dead", NULL,
+   iova_cpuhp_dead);
+   if (ret) {
+   mutex_unlock(&iova_cache_mutex);
+   pr_err("Couldn't register cpuhp handler\n");
+   return ret;
+   }
+
iova_cache = kmem_cache_create(
"iommu_iova", sizeof(struct iova), 0,
SLAB_HWCACHE_ALIGN, NULL);
if (!iova_cache) {
+   cpuhp_remove_multi_state(CPUHP_IOMMU_IOVA_DEAD);
mutex_unlock(&iova_cache_mutex);
pr_err("Couldn't create iova cache\n");
return -ENOMEM;
@@ -282,8 +305,10 @@ void iova_cache_put(void)
return;
}
iova_cache_users--;
-   if (!iova_cache_users)
+   if (!iova_cache_users) {
+   cpuhp_remove_multi_state(CPUHP_IOMMU_IOVA_DEAD);
kmem_cache_destroy(iova_cache);
+   }
mutex_unlock(&iova_cache_mutex);
  }
  EXPORT_SYMBOL_GPL(iova_cache_put);
@@ -606,6 +631,9 @@ void put_iova_domain(struct iova_domain *iovad)
  {
struct iova *iova, *tmp;
  
+	cpuhp_state_remove_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD,

+   &iovad->cpuhp_dead);
+
free_iova_flush_queue(iovad);
free_iova_rcaches(iovad);
rbtree_postorder_for_each_entry_safe(iova, tmp, &iovad->rbroot, node)
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index f14adb882338..cedac9986557 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -58,6 +58,7 @@ enum cpuhp_state {
CPUHP_NET_DEV_DEAD,
CPUHP_PCI_XGENE_DEAD,
CPUHP_IOMMU_INTEL_DEAD,
+   CPUHP_IOMMU_IOVA_DEAD,
CPUHP_LUSTRE_CFS_DEAD,
CPUHP_AP_ARM_CACHE_B15_RAC_DEAD,
CPUHP_PADATA_DEAD,
diff --git a/include/linux/iova.h b/include/linux/iova.h
index c834c01c0a5b..4be6c0ab4997 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -95,6 +95,7 @@ struct iova_domain {
   flush-queues */
atomic_t fq_timer_on;   /* 1 when timer is active, 0
   when not */
+   struct hlist_node   cpuhp_dead;
  };
  
  static inline unsigned long iova_size(struct iova *iova)



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v13 06/10] iommu: Add a page fault handler

2021-03-23 Thread Jean-Philippe Brucker
On Tue, Mar 02, 2021 at 09:57:27PM -0800, Raj, Ashok wrote:
> > +   ret = handle_mm_fault(vma, prm->addr, fault_flags, NULL);
> 
> Should we add a trace similar to trace_page_fault_user() or kernel in
> arch/x86/kernel/mm/fault.c 

Yes that would definitely be useful for debugging hardware and developping
applications. I can send a separate patch to add tracepoints here and in
the lower-level device fault path.

> or maybe add a perf_sw_event() for device faults? 

It does seem like that would fit well alongside the existing
PERF_COUNT_SW_PAGE_FAULTS, but I don't think it would be useful in
practice, because we can't provide a context for the event. Since we're
handling these faults remotely, the only way for a user to get IOPF events
is to enable them on all CPUs and all tasks. Tracepoints can have 'pasid'
and 'device' fields to identify an event, but the perf_sw_event wouldn't
have any useful metadata apart from the faulting address.

We could also add tracepoints on bind(), so users can get the PASID
obtained with the PID they care about and filter fault events based on
that.

I've been wondering about associating a PASID with a PID internally,
because we don't currently have anywhere to send SEGV signals for
unhandled page faults. But I think it would be best to notify the device
driver on unhandled fault and let them deal with it. They'll probably want
that information anyway.

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v13 06/10] iommu: Add a page fault handler

2021-03-23 Thread Jean-Philippe Brucker
On Wed, Mar 03, 2021 at 01:27:34PM +0800, Lu Baolu wrote:
> I have tested this framework with the Intel VT-d implementation. It
> works as expected. Hence,
> 
> Reviewed-by: Lu Baolu 
> Tested-by: Lu Baolu 

Thanks!

> One possible future optimization is that we could allow the system
> administrators to choose between handle PRQs in a workqueue or handle
> them synchronously. One research discovered that most of the software
> latency of handling a single page fault exists in the schedule part.
> Hence, synchronous processing will get shorter software latency if PRQs
> are rare and limited.

Yes, the risk is that processing a fault synchronously will take too much
time, leading to PRI queue overflow if the IOMMU keeps receiving page
faults. That's why I opted for the workqueue initially, but it's
definitely something we can tweak

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v13 06/10] iommu: Add a page fault handler

2021-03-23 Thread Jean-Philippe Brucker
On Tue, Mar 02, 2021 at 03:59:57PM -0800, Jacob Pan wrote:
> Hi Jean-Philippe,
> 
> A few comments from the p.o.v of converting VT-d to this framework. Mostly
> about potential optimization. I think VT-d SVA code will be able to use this
> work.
> +Ashok provided many insight.
> 
> FWIW,
> Reviewed-by:Jacob Pan 

Thanks!

> On Tue,  2 Mar 2021 10:26:42 +0100, Jean-Philippe Brucker
>  wrote:
> > +int iommu_queue_iopf(struct iommu_fault *fault, void *cookie)
> > +{
> > +   int ret;
> > +   struct iopf_group *group;
> > +   struct iopf_fault *iopf, *next;
> > +   struct iopf_device_param *iopf_param;
> > +
> > +   struct device *dev = cookie;
> > +   struct dev_iommu *param = dev->iommu;
> > +
> > +   lockdep_assert_held(¶m->lock);
> > +
> > +   if (fault->type != IOMMU_FAULT_PAGE_REQ)
> > +   /* Not a recoverable page fault */
> > +   return -EOPNOTSUPP;
> > +
> > +   /*
> > +* As long as we're holding param->lock, the queue can't be
> > unlinked
> > +* from the device and therefore cannot disappear.
> > +*/
> > +   iopf_param = param->iopf_param;
> > +   if (!iopf_param)
> > +   return -ENODEV;
> > +
> > +   if (!(fault->prm.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) {
> > +   iopf = kzalloc(sizeof(*iopf), GFP_KERNEL);
> > +   if (!iopf)
> > +   return -ENOMEM;
> > +
> > +   iopf->fault = *fault;
> > +
> > +   /* Non-last request of a group. Postpone until the last
> > one */
> Would be nice to have an option here to allow non-deferred handle_mm_fault.
> Some devices may have a large group.
> 
> FYI, VT-d also needs to send page response before the last one (LPIG).
> "A Page Group Response Descriptor is issued by software in response to a
> page request with data or a page request (with or without data) that
> indicated that it was the last request in a group."
> 
> But I think we deal with that when we convert. Perhaps just treat the
> request with data as LPIG.

Sure that seems fine. Do you mean that the vt-d driver will set the
IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE flag for PR-with-data?  Otherwise we
could introduce a new flag. I prefer making it explicit rather than having
IOPF consumers infer that a response is needed when seeing
IOMMU_FAULT_PAGE_REQUEST_PRIV_DATA set, because private_data wouldn't be
reusable by other architectures.

> Also adding a trace event would be nice, similar to CPU page fault.

Agreed, the tracepoints in my development tree (along with the lower-level
page_request/response tracepoints) have been indispensable for debugging
hardware and SVA applications

> > +   list_add(&iopf->list, &iopf_param->partial);
> > +
> > +   return 0;
> > +   }
> > +
> > +   group = kzalloc(sizeof(*group), GFP_KERNEL);
> > +   if (!group) {
> > +   /*
> > +* The caller will send a response to the hardware. But
> > we do
> > +* need to clean up before leaving, otherwise partial
> > faults
> > +* will be stuck.
> > +*/
> > +   ret = -ENOMEM;
> > +   goto cleanup_partial;
> > +   }
> > +
> > +   group->dev = dev;
> > +   group->last_fault.fault = *fault;
> > +   INIT_LIST_HEAD(&group->faults);
> > +   list_add(&group->last_fault.list, &group->faults);
> > +   INIT_WORK(&group->work, iopf_handle_group);
> > +
> > +   /* See if we have partial faults for this group */
> > +   list_for_each_entry_safe(iopf, next, &iopf_param->partial, list)
> > {
> > +   if (iopf->fault.prm.grpid == fault->prm.grpid)
> Just curious, the iopf handler is registered per arm_smmu_master dev. Is
> the namespace of group ID also within an arm_smmu_master?

Yes for both PCIe PRI and SMMU stall, the namespace is within one device
(one RequesterID or StreamID)

> Can one arm_smmu_master support multiple devices?

No, it's a single device

> 
> For VT-d, group ID is per PCI device.
> 
> > +   /* Insert *before* the last fault */
> > +   list_move(&iopf->list, &group->faults);
> > +   }
> > +
> This is fine with devices supports small number of outstanding PRQs.
> VT-d is setting the limit to 32. But the incoming DSA device will support
> 512.
> 
> So if we pre-sort IOPF by group ID and put them in a per group list, would
> it be faster? I mean once the LPIG comes in, we just need to search the
> list of groups instead of all partial faults. I am not against what is done
> here, just exploring optimization.

Yes I think that's worth exploring, keeping the groups in a rb_tree or something
like that, for easy access and update. Note that I don't have a system
with non-LPIG faults, so I can't test any of this at the moment

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu