Re: [RFC PATCH v2] iommu/xen: Add Xen PV-IOMMU driver

2024-06-26 Thread Robin Murphy

On 2024-06-24 3:36 pm, Teddy Astie wrote:

Hello Robin,
Thanks for the thourough review.


diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 0af39bbbe3a3..242cefac77c9 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -480,6 +480,15 @@ config VIRTIO_IOMMU
     Say Y here if you intend to run this kernel as a guest.
+config XEN_IOMMU
+    bool "Xen IOMMU driver"
+    depends on XEN_DOM0


Clearly this depends on X86 as well.


Well, I don't intend this driver to be X86-only, even though the current
Xen RFC doesn't support ARM (yet). Unless there is a counter-indication
for it ?


It's purely practical - even if you drop the asm/iommu.h stuff it would 
still break ARM DOM0 builds due to HYPERVISOR_iommu_op() only being 
defined for x86. And it's better to add a dependency here to make it 
clear what's *currently* supported, than to add dummy code to allow it 
to build for ARM if that's not actually tested or usable yet.



+bool xen_iommu_capable(struct device *dev, enum iommu_cap cap)
+{
+    switch (cap) {
+    case IOMMU_CAP_CACHE_COHERENCY:
+    return true;


Will the PV-IOMMU only ever be exposed on hardware where that really is
always true?



On the hypervisor side, the PV-IOMMU interface always implicitely flush
the IOMMU hardware on map/unmap operation, so at the end of the
hypercall, the cache should be always coherent IMO.


As Jason already brought up, this is not about TLBs or anything cached 
by the IOMMU itself, it's about the memory type(s) it can create 
mappings with. Returning true here says Xen guarantees it can use a 
cacheable memory type which will let DMA snoop the CPU caches. 
Furthermore, not explicitly handling IOMMU_CACHE in the map_pages op 
then also implies that it will *always* do that, so you couldn't 
actually get an uncached mapping even if you wanted one.



+    while (xen_pg_count) {
+    size_t to_unmap = min(xen_pg_count, max_nr_pages);
+
+    //pr_info("Unmapping %lx-%lx\n", dfn, dfn + to_unmap - 1);
+
+    op.unmap_pages.dfn = dfn;
+    op.unmap_pages.nr_pages = to_unmap;
+
+    ret = HYPERVISOR_iommu_op();
+
+    if (ret)
+    pr_warn("Unmap failure (%lx-%lx)\n", dfn, dfn + to_unmap
- 1);


But then how

would it ever happen anyway? Unmap is a domain op, so a domain which
doesn't allow unmapping shouldn't offer it in the first place...


Unmap failing should be exceptionnal, but is possible e.g with
transparent superpages (like Xen IOMMU drivers do). Xen drivers folds
appropriate contiguous mappings into superpages entries to optimize
memory usage and iotlb. However, if you unmap in the middle of a region
covered by a superpage entry, this is no longer a valid superpage entry,
and you need to allocate and fill the lower levels, which is faillible
if lacking memory.


OK, so in the worst case you could potentially have a partial unmap 
failure if the range crosses a superpage boundary and the end part 
happens to have been folded, and Xen doesn't detect and prepare that 
allocation until it's already unmapped up to the boundary. If that is 
so, does the hypercall interface give any information about partial 
failure, or can any error only be taken to mean that some or all of the 
given range may or may not have be unmapped now?

In this case I'd argue that you really *do* want to return short, in the
hope of propagating the error back up and letting the caller know the
address space is now messed up before things start blowing up even more
if they keep going and subsequently try to map new pages into
not-actually-unmapped VAs.


While mapping on top of another mapping is ok for us (it's just going to
override the previous mapping), I definetely agree that having the
address space messed up is not good.


Oh, indeed, quietly replacing existing PTEs might help paper over errors 
in this particular instance, but it does then allow *other* cases to go 
wrong in fun and infuriating ways :)



+static struct iommu_domain default_domain = {
+    .ops = &(const struct iommu_domain_ops){
+    .attach_dev = default_domain_attach_dev
+    }
+};


Looks like you could make it a static xen_iommu_domain and just use the
normal attach callback? Either way please name it something less
confusing like xen_iommu_identity_domain - "default" is far too
overloaded round here already...



Yes, although, if in the future, we can have either this domain as
identity or blocking/paging depending on some upper level configuration.
Should we have both identity and blocking domains, and only setting the
relevant one in iommu_ops, or keep this naming.


That's something that can be considered if and when it does happen. For 
now, if it's going to be pre-mapped as an identity domain, then let's 
just treat it as such and keep things straightforward.



+void __exit xen_iommu_fini(void)
+{
+    pr_info("Unregistering Xen IOMMU driver\n");
+
+    iommu_device_unregister(_iommu_device);
+    

Re: [RFC PATCH v2] iommu/xen: Add Xen PV-IOMMU driver

2024-06-24 Thread Robin Murphy

On 2024-06-24 6:36 pm, Easwar Hariharan wrote:

Hi Jason,

On 6/24/2024 9:32 AM, Jason Gunthorpe wrote:

On Mon, Jun 24, 2024 at 02:36:45PM +, Teddy Astie wrote:

+bool xen_iommu_capable(struct device *dev, enum iommu_cap cap)
+{
+    switch (cap) {
+    case IOMMU_CAP_CACHE_COHERENCY:
+    return true;


Will the PV-IOMMU only ever be exposed on hardware where that really is
always true?



On the hypervisor side, the PV-IOMMU interface always implicitely flush
the IOMMU hardware on map/unmap operation, so at the end of the
hypercall, the cache should be always coherent IMO.


Cache coherency is a property of the underlying IOMMU HW and reflects
the ability to prevent generating transactions that would bypass the
cache.

On AMD and Intel IOMMU HW this maps to a bit in their PTEs that must
always be set to claim this capability.

No ARM SMMU supports it yet.



Unrelated to this patch: Both the arm-smmu and arm-smmu-v3 drivers claim
this capability if the device tree/IORT table have the corresponding flags.

I read through DEN0049 to determine what are the knock-on effects, or
equivalently the requirements to set those flags in the IORT, but came
up empty. Could you help with what I'm missing to resolve the apparent
contradiction between your statement and the code?


We did rejig things slightly a while back. The status quo now is that 
IOMMU_CAP_CACHE_COHERENCY mostly covers whether IOMMU mappings can make 
device accesses coherent at all, tied in with the IOMMU_CACHE prot value 
- this is effectively forced for Intel and AMD, while for SMMU we have 
to take a guess, but as commented it's a pretty reasonable assumption 
that if the SMMU's own output for table walks etc. is coherent then its 
translation outputs are likely to be too. The further property of being 
able to then enforce a coherent mapping regardless of what an endpoint 
might try to get around it (PCIe No Snoop etc.) is now under the 
enforce_cache_coherency op - that's what SMMU can't guarantee for now 
due to the IMP-DEF nature of whether S2FWB overrides No Snoop or not.


Thanks,
Robin.



Re: [RFC PATCH] iommu/xen: Add Xen PV-IOMMU driver

2024-06-24 Thread Robin Murphy

On 2024-06-23 4:21 am, Baolu Lu wrote:

On 6/21/24 11:09 PM, Teddy Astie wrote:

Le 19/06/2024 à 18:30, Jason Gunthorpe a écrit :

On Thu, Jun 13, 2024 at 01:50:22PM +, Teddy Astie wrote:


+struct iommu_domain *xen_iommu_domain_alloc(unsigned type)
+{
+    struct xen_iommu_domain *domain;
+    u16 ctx_no;
+    int ret;
+
+    if (type & IOMMU_DOMAIN_IDENTITY) {
+    /* use default domain */
+    ctx_no = 0;
Please use the new ops, domain_alloc_paging and the static identity 
domain.

Yes, in the v2, I will use this newer interface.

I have a question on this new interface : is it valid to not have a
identity domain (and "default domain" being blocking); well in the
current implementation it doesn't really matter, but at some point, we
may want to allow not having it (thus making this driver mandatory).


It's valid to not have an identity domain if "default domain being
blocking" means a paging domain with no mappings.

In the iommu driver's iommu_ops::def_domain_type callback, just always
return IOMMU_DOMAIN_DMA, which indicates that the iommu driver doesn't
support identity translation.


That's not necessary - if neither ops->identity_domain nor 
ops->domain_alloc(IOMMU_DOMAIN_IDENTITY) gives a valid domain then we 
fall back to IOMMU_DOMAIN_DMA anyway.


Thanks,
Robin.



Re: [RFC PATCH v2] iommu/xen: Add Xen PV-IOMMU driver

2024-06-21 Thread Robin Murphy

On 2024-06-21 5:08 pm, TSnake41 wrote:

From: Teddy Astie 

In the context of Xen, Linux runs as Dom0 and doesn't have access to the
machine IOMMU. Although, a IOMMU is mandatory to use some kernel features
such as VFIO or DMA protection.

In Xen, we added a paravirtualized IOMMU with iommu_op hypercall in order to
allow Dom0 to implement such feature. This commit introduces a new IOMMU driver
that uses this new hypercall interface.

Signed-off-by Teddy Astie 
---
Changes since v1 :
* formatting changes
* applied Jan Beulich proposed changes : removed vim notes at end of pv-iommu.h
* applied Jason Gunthorpe proposed changes : use new ops and remove redundant
checks
---
  arch/x86/include/asm/xen/hypercall.h |   6 +
  drivers/iommu/Kconfig|   9 +
  drivers/iommu/Makefile   |   1 +
  drivers/iommu/xen-iommu.c| 489 +++
  include/xen/interface/memory.h   |  33 ++
  include/xen/interface/pv-iommu.h | 104 ++
  include/xen/interface/xen.h  |   1 +
  7 files changed, 643 insertions(+)
  create mode 100644 drivers/iommu/xen-iommu.c
  create mode 100644 include/xen/interface/pv-iommu.h

diff --git a/arch/x86/include/asm/xen/hypercall.h 
b/arch/x86/include/asm/xen/hypercall.h
index a2dd24947eb8..6b1857f27c14 100644
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -490,6 +490,12 @@ HYPERVISOR_xenpmu_op(unsigned int op, void *arg)
return _hypercall2(int, xenpmu_op, op, arg);
  }
  
+static inline int

+HYPERVISOR_iommu_op(void *arg)
+{
+   return _hypercall1(int, iommu_op, arg);
+}
+
  static inline int
  HYPERVISOR_dm_op(
domid_t dom, unsigned int nr_bufs, struct xen_dm_op_buf *bufs)
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 0af39bbbe3a3..242cefac77c9 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -480,6 +480,15 @@ config VIRTIO_IOMMU
  
  	  Say Y here if you intend to run this kernel as a guest.
  
+config XEN_IOMMU

+   bool "Xen IOMMU driver"
+   depends on XEN_DOM0


Clearly this depends on X86 as well.


+   select IOMMU_API
+   help
+   Xen PV-IOMMU driver for Dom0.
+
+   Say Y here if you intend to run this guest as Xen Dom0.
+
  config SPRD_IOMMU
tristate "Unisoc IOMMU Support"
depends on ARCH_SPRD || COMPILE_TEST
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 542760d963ec..393afe22c901 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o
  obj-$(CONFIG_IOMMU_IOPF) += io-pgfault.o
  obj-$(CONFIG_SPRD_IOMMU) += sprd-iommu.o
  obj-$(CONFIG_APPLE_DART) += apple-dart.o
+obj-$(CONFIG_XEN_IOMMU) += xen-iommu.o
\ No newline at end of file
diff --git a/drivers/iommu/xen-iommu.c b/drivers/iommu/xen-iommu.c
new file mode 100644
index ..b765445d27cd
--- /dev/null
+++ b/drivers/iommu/xen-iommu.c
@@ -0,0 +1,489 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Xen PV-IOMMU driver.
+ *
+ * Copyright (C) 2024 Vates SAS
+ *
+ * Author: Teddy Astie 
+ *
+ */
+
+#define pr_fmt(fmt)"xen-iommu: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 


Please drop this; it's a driver, not a DMA ops implementation.


+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_DESCRIPTION("Xen IOMMU driver");
+MODULE_AUTHOR("Teddy Astie ");
+MODULE_LICENSE("GPL");
+
+#define MSI_RANGE_START(0xfee0)
+#define MSI_RANGE_END  (0xfeef)
+
+#define XEN_IOMMU_PGSIZES   (0x1000)
+
+struct xen_iommu_domain {
+   struct iommu_domain domain;
+
+   u16 ctx_no; /* Xen PV-IOMMU context number */
+};
+
+static struct iommu_device xen_iommu_device;
+
+static uint32_t max_nr_pages;
+static uint64_t max_iova_addr;
+
+static spinlock_t lock;


Not a great name - usually it's good to name a lock after what it 
protects. Although perhaps it is already, since AFAICS this isn't 
actually used anywhere anyway.



+static inline struct xen_iommu_domain *to_xen_iommu_domain(struct iommu_domain 
*dom)
+{
+   return container_of(dom, struct xen_iommu_domain, domain);
+}
+
+static inline u64 addr_to_pfn(u64 addr)
+{
+   return addr >> 12;
+}
+
+static inline u64 pfn_to_addr(u64 pfn)
+{
+   return pfn << 12;
+}
+
+bool xen_iommu_capable(struct device *dev, enum iommu_cap cap)
+{
+   switch (cap) {
+   case IOMMU_CAP_CACHE_COHERENCY:
+   return true;


Will the PV-IOMMU only ever be exposed on hardware where that really is 
always true?



+
+   default:
+   return false;
+   }
+}
+
+struct iommu_domain *xen_iommu_domain_alloc_paging(struct device *dev)
+{
+   struct xen_iommu_domain *domain;
+   int ret;
+
+   struct pv_iommu_op op = {
+   

Re: [PATCH] arm/mm: add option to prefer IOMMU ops for DMA on Xen

2023-11-14 Thread Robin Murphy

On 11/11/2023 6:45 pm, Chuck Zmudzinski wrote:

Enabling the new option, ARM_DMA_USE_IOMMU_XEN, fixes this error when
attaching the Exynos mixer in Linux dom0 on Xen on the Chromebook Snow
(and probably on other devices that use the Exynos mixer):

[drm] Exynos DRM: using 1440.fimd device for DMA mapping operations
exynos-drm exynos-drm: bound 1440.fimd (ops 0xc0d96354)
exynos-mixer 1445.mixer: [drm:exynos_drm_register_dma] *ERROR* Device
  1445.mixer lacks support for IOMMU
exynos-drm exynos-drm: failed to bind 1445.mixer (ops 0xc0d97554): -22
exynos-drm exynos-drm: adev bind failed: -22
exynos-dp: probe of 145b.dp-controller failed with error -22

Linux normally uses xen_swiotlb_dma_ops for DMA for all devices when
xen_swiotlb is detected even when Xen exposes an IOMMU to Linux. Enabling
the new config option allows devices such as the Exynos mixer to use the
IOMMU instead of xen_swiotlb_dma_ops for DMA and this fixes the error.

The new config option is not set by default because it is likely some
devices that use IOMMU for DMA on Xen will cause DMA errors and memory
corruption when Xen PV block and network drivers are in use on the system.

Link: 
https://lore.kernel.org/xen-devel/acfab1c5-eed1-4930-8c70-8681e256c...@netscape.net/

Signed-off-by: Chuck Zmudzinski 
---
The reported error with the Exynos mixer is not fixed by default by adding
a second patch to select the new option in the Kconfig definition for the
Exynos mixer if EXYNOS_IOMMU and SWIOTLB_XEN are enabled because it is
not certain setting the config option is suitable for all cases. So it is
necessary to explicitly select the new config option during the config
stage of the Linux kernel build to fix the reported error or similar
errors that have the same cause of lack of support for IOMMU on Xen. This
is necessary to avoid any regressions that might be caused by enabling the
new option by default for the Exynos mixer.
  arch/arm/mm/dma-mapping.c |  6 ++
  drivers/xen/Kconfig   | 16 
  2 files changed, 22 insertions(+)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 5409225b4abc..ca04fdf01be3 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1779,6 +1779,12 @@ void arch_setup_dma_ops(struct device *dev, u64 
dma_base, u64 size,
if (iommu)
arm_setup_iommu_dma_ops(dev, dma_base, size, iommu, coherent);
  
+#ifdef CONFIG_ARM_DMA_USE_IOMMU_XEN


FWIW I don't think this really needs a config option - if Xen *has* made 
an IOMMU available, then there isn't really much reason not to use it, 
and if for some reason someone really didn't want to then they could 
simply disable the IOMMU driver anyway.



+   if (dev->dma_ops == _ops) {
+   dev->archdata.dma_ops_setup = true;


The existing assignment is effectively unconditional by this point 
anyway, so could probably just be moved earlier to save duplicating it 
(or perhaps just make the xen_setup_dma_ops() call conditional instead 
to save the early return as well).


However, are the IOMMU DMA ops really compatible with Xen? The comments 
about hypercalls and foreign memory in xen_arch_need_swiotlb() leave me 
concerned that assuming non-coherent DMA to any old Dom0 page is OK 
might not actually work in general :/


Thanks,
Robin.


+   return;
+   }
+#endif
xen_setup_dma_ops(dev);
dev->archdata.dma_ops_setup = true;
  }
diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index d5989871dd5d..44e1334b6acd 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -181,6 +181,22 @@ config SWIOTLB_XEN
select DMA_OPS
select SWIOTLB
  
+config ARM_DMA_USE_IOMMU_XEN

+   bool "Prefer IOMMU DMA ops on Xen"
+   depends on SWIOTLB_XEN
+   depends on ARM_DMA_USE_IOMMU
+   help
+ Normally on Xen, the IOMMU is used by Xen and not exposed to
+ Linux. Some Arm systems such as Exynos have an IOMMU that
+ Xen does not use so the IOMMU is exposed to Linux in those
+ cases. This option enables Linux to use the IOMMU instead of
+ using the Xen swiotlb_dma_ops for DMA on Xen.
+
+ Say N here unless support for one or more devices that use
+ IOMMU ops instead of Xen swiotlb ops for DMA is needed and the
+ devices that use the IOMMU do not cause any problems on the
+ Xen system in use.
+
  config XEN_PCI_STUB
bool
  




Re: [PATCH v3 1/7] swiotlb: make io_tlb_default_mem local to swiotlb.c

2023-06-27 Thread Robin Murphy

On 27/06/2023 11:24 am, Greg Kroah-Hartman wrote:

On Tue, Jun 27, 2023 at 11:54:23AM +0200, Petr Tesarik wrote:

+/**
+ * is_swiotlb_active() - check if the software IO TLB is initialized
+ * @dev:   Device to check, or %NULL for the default IO TLB.
+ */
  bool is_swiotlb_active(struct device *dev)
  {
-   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+   struct io_tlb_mem *mem = dev
+   ? dev->dma_io_tlb_mem
+   : _tlb_default_mem;


That's impossible to read and maintain over time, sorry.

Please use real "if () else" lines, so that it can be maintained over
time.


Moreover, it makes for a horrible interface anyway. If there's a need 
for a non-specific "is SWIOTLB present at all?" check unrelated to any 
particular device (which arguably still smells of poking into 
implementation details...), please encapsulate it in its own distinct 
helper like, say, is_swiotlb_present(void).


However, the more I think about it, the more I doubt that logic like 
octeon_pci_setup() can continue to work properly at all if SWIOTLB 
allocation becomes dynamic... :/


Thanks,
Robin.



Re: [PATCH] arm/acpi: don't expose the ACPI IORT SMMUv3 entry to dom0

2022-05-09 Thread Robin Murphy

On 2022-04-27 17:12, Rahul Singh wrote:

Xen should control the SMMUv3 devices therefore, don't expose the
SMMUv3 devices to dom0. Deny iomem access to SMMUv3 address space for
dom0 and also make ACPI IORT SMMUv3 node type to 0xff.


...making the resulting IORT technically useless to consumers. ID 
mappings for all the Root Complex, Named Component and RMR nodes which 
were supposed to be translated through that SMMU node are now invalid, 
because ID mappings can only target an SMMU or ITS node. I can't guess 
at how other consumers may react, but Linux's IORT code is going to 
experience undefined behaviour when tries to translate any MSI DeviceID 
mappings through this invalid node, since it uses a bitmap of (1 << 
node->type) internally; beyond that we're not as strict as we could be 
and make some pretty loose assumptions about what we're parsing, so it 
might even appear to work, but that could easily change at any time in 
future and is absolutely not something Xen or any other software should 
try to rely on.


Thanks,
Robin.


Signed-off-by: Rahul Singh 
---
  xen/arch/arm/acpi/domain_build.c | 40 
  1 file changed, 40 insertions(+)

diff --git a/xen/arch/arm/acpi/domain_build.c b/xen/arch/arm/acpi/domain_build.c
index bbdc90f92c..ec0b5b261f 100644
--- a/xen/arch/arm/acpi/domain_build.c
+++ b/xen/arch/arm/acpi/domain_build.c
@@ -14,6 +14,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -30,6 +31,7 @@ static int __init acpi_iomem_deny_access(struct domain *d)
  {
  acpi_status status;
  struct acpi_table_spcr *spcr = NULL;
+struct acpi_table_iort *iort;
  unsigned long mfn;
  int rc;
  
@@ -55,6 +57,44 @@ static int __init acpi_iomem_deny_access(struct domain *d)

  printk("Failed to get SPCR table, Xen console may be unavailable\n");
  }
  
+status = acpi_get_table(ACPI_SIG_IORT, 0,

+(struct acpi_table_header **));
+
+if ( ACPI_SUCCESS(status) )
+{
+int i;
+struct acpi_iort_node *node, *end;
+node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset);
+end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length);
+
+for ( i = 0; i < iort->node_count; i++ )
+{
+if ( node >= end )
+break;
+
+switch ( node->type )
+{
+case ACPI_IORT_NODE_SMMU_V3:
+{
+struct acpi_iort_smmu_v3 *smmu;
+smmu = (struct acpi_iort_smmu_v3 *)node->node_data;
+mfn = paddr_to_pfn(smmu->base_address);
+rc = iomem_deny_access(d, mfn, mfn + PFN_UP(SZ_128K));
+if ( rc )
+printk("iomem_deny_access failed for SMMUv3\n");
+node->type = 0xff;
+break;
+}
+}
+node = ACPI_ADD_PTR(struct acpi_iort_node, node, node->length);
+}
+}
+else
+{
+printk("Failed to get IORT table\n");
+return -EINVAL;
+}
+
  /* Deny MMIO access for GIC regions */
  return gic_iomem_deny_access(d);
  }




Re: [PATCH v3 1/2] dt-bindings: arm: xen: document Xen iommu device

2022-01-26 Thread Robin Murphy

On 2022-01-26 15:09, Sergiy Kibrik wrote:

Hi Robin,



This could break Linux guests, since depending on the deferred probe
timeout setting it could lead to drivers never probing because the "IOMMU"
never becomes available.



I've noticed no deferred probe timeouts when booting with this patch. Could you 
please explain more on how this would break guests?


Right now I think it would actually require command-line intervention, 
e.g. "fw_devlink=on" or "deferred_probe_timeout=3600" (with modules 
enabled for the latter to take full effect), but I'm wary of the 
potential for future config options to control those behaviours by default.


Robin.


Thank you!

  -- Sergiy




Re: [PATCH v3 1/2] dt-bindings: arm: xen: document Xen iommu device

2022-01-17 Thread Robin Murphy

On 2022-01-17 12:32, Sergiy Kibrik wrote:

In IOMMU-capable system hypervisor usually takes over IOMMU control.
Generally guest domains don't need to care about IOMMU management and take any
extra actions. Yet in some cases a knowledge about which device is protected
may be useful for privileged domain.

In compliance with iommu bindings this can be achieved with device-level
iommus property specified with dummy Xen iommu device.


This could break Linux guests, since depending on the deferred probe 
timeout setting it could lead to drivers never probing because the 
"IOMMU" never becomes available.


Unless you intend to expose actual paravirtualised IOMMU translation 
functionality to guests (in which case virtio-iommu would be highly 
preferable anyway), I don't think this is the right approach. If there's 
no better alternative to using DT to communicate Xen-specific policy, 
then at least it should logically be via a Xen-specific DT property.


Thanks,
Robin.


Signed-off-by: Sergiy Kibrik 
---
  Documentation/devicetree/bindings/arm/xen.txt | 26 +++
  1 file changed, 26 insertions(+)

diff --git a/Documentation/devicetree/bindings/arm/xen.txt 
b/Documentation/devicetree/bindings/arm/xen.txt
index db5c56db30ec..98efa95c0d1b 100644
--- a/Documentation/devicetree/bindings/arm/xen.txt
+++ b/Documentation/devicetree/bindings/arm/xen.txt
@@ -58,3 +58,29 @@ Documentation/arm/uefi.rst, which are provided by the 
regular UEFI stub. However
  they differ because they are provided by the Xen hypervisor, together with a 
set
  of UEFI runtime services implemented via hypercalls, see
  
http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,platform.h.html.
+
+* XEN IOMMU device
+
+In compliance with iommu bindings Xen virtual IOMMU device node represents
+hypervisor-managed IOMMU [1]. Platform devices specified as IOMMU masters of
+this xen-iommu device are protected by hypervisor-managed platform IOMMU.
+
+Required properties:
+
+- compatible:  Should be "xen,iommu-el2-v1"
+- #iommu-cells: must be 0
+
+Example:
+
+xen-iommu {
+   compatible = "xen,iommu-el2-v1";
+   #iommu-cells = <0>;
+};
+
+video@fe001000 {
+   ...
+   /* this platform device is IOMMU-protected by hypervisor */
+   iommus = < 0x0>;
+};
+
+[1] Documentation/devicetree/bindings/iommu/iommu.txt




Re: [PATCH] swiotlb: set IO TLB segment size via cmdline

2021-09-17 Thread Robin Murphy

On 2021-09-17 10:36, Roman Skakun wrote:

Hi, Christoph

I use Xen PV display. In my case, PV display backend(Dom0) allocates
contiguous buffer via DMA-API to
to implement zero-copy between Dom0 and DomU.


Well, something's gone badly wrong there - if you have to shadow the 
entire thing in a bounce buffer to import it then it's hardly zero-copy, 
is it? If you want to do buffer sharing the buffer really needs to be 
allocated appropriately to begin with, such that all relevant devices 
can access it directly. That might be something which needs fixing in Xen.


Robin.


When I start Weston under DomU, I got the next log in Dom0:
```
[ 112.554471] CPU: 0 PID: 367 Comm: weston Tainted: G O
5.10.0-yocto-standard+ #312
[ 112.575149] Call trace:
[ 112.577666] dump_backtrace+0x0/0x1b0
[ 112.581373] show_stack+0x18/0x70
[ 112.584746] dump_stack+0xd0/0x12c
[ 112.588200] swiotlb_tbl_map_single+0x234/0x360
[ 112.592781] xen_swiotlb_map_page+0xe4/0x4c0
[ 112.597095] xen_swiotlb_map_sg+0x84/0x12c
[ 112.601249] dma_map_sg_attrs+0x54/0x60
[ 112.605138] vsp1_du_map_sg+0x30/0x60
[ 112.608851] rcar_du_vsp_map_fb+0x134/0x170
[ 112.613082] rcar_du_vsp_plane_prepare_fb+0x44/0x64
[ 112.618007] drm_atomic_helper_prepare_planes+0xac/0x160
[ 112.623362] drm_atomic_helper_commit+0x88/0x390
[ 112.628029] drm_atomic_nonblocking_commit+0x4c/0x60
[ 112.633043] drm_mode_atomic_ioctl+0x9a8/0xb0c
[ 112.637532] drm_ioctl_kernel+0xc4/0x11c
[ 112.641506] drm_ioctl+0x21c/0x460
[ 112.644967] __arm64_sys_ioctl+0xa8/0xf0
[ 112.648939] el0_svc_common.constprop.0+0x78/0x1a0
[ 112.653775] do_el0_svc+0x24/0x90
[ 112.657148] el0_svc+0x14/0x20
[ 112.660254] el0_sync_handler+0x1a4/0x1b0
[ 112.664315] el0_sync+0x174/0x180
[ 112.668145] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz:
3686400 bytes), total 65536 (slots), used 112 (slots)
```
The problem is happened here:
https://elixir.bootlin.com/linux/v5.14.4/source/drivers/gpu/drm/rcar-du/rcar_du_vsp.c#L202

Sgt was created in dma_get_sgtable() by dma_common_get_sgtable() and
includes a single page chunk
as shown here:
https://elixir.bootlin.com/linux/v5.14.5/source/kernel/dma/ops_helpers.c#L18

After creating a new sgt, we tried to map this sgt through vsp1_du_map_sg().
Internally, vsp1_du_map_sg() using ops->map_sg (e.g
xen_swiotlb_map_sg) to perform
mapping.

I realized that required segment is too big to be fitted to default
swiotlb segment and condition
https://elixir.bootlin.com/linux/latest/source/kernel/dma/swiotlb.c#L474
is always false.

I know that I use a large buffer, but why can't I map this buffer in one chunk?

Thanks!

ср, 15 сент. 2021 г. в 16:53, Christoph Hellwig :


On Wed, Sep 15, 2021 at 03:49:52PM +0200, Jan Beulich wrote:

But the question remains: Why does the framebuffer need to be mapped
in a single giant chunk?


More importantly: if you use dynamic dma mappings for your framebuffer
you're doing something wrong.








Re: [PATCH v1 16/16] dma-mapping: Disallow .map_sg operations from returning zero on error

2021-07-16 Thread Robin Murphy

On 2021-07-16 07:33, Christoph Hellwig wrote:

On Thu, Jul 15, 2021 at 10:45:44AM -0600, Logan Gunthorpe wrote:

@@ -194,6 +194,8 @@ static int __dma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
else
ents = ops->map_sg(dev, sg, nents, dir, attrs);
  
+	WARN_ON_ONCE(ents == 0);


Turns this into a negative error code while we're at it, just to keep
the callers sane?


Right, by this point returning the 0 would pass through 
dma_map_sg_attrs() OK, but AFAICS dma_map_sgtable() would now get 
confused and return success but with sgt->nents = 0. Coercing it to an 
error code (which dma_map_sg_attrs() would then just change right back) 
seems sensible for the sake of easy robustness.


Robin.



Re: [PATCH v1 14/16] x86/amd_gart: return error code from gart_map_sg()

2021-07-16 Thread Robin Murphy

On 2021-07-16 07:32, Christoph Hellwig wrote:

On Thu, Jul 15, 2021 at 10:45:42AM -0600, Logan Gunthorpe wrote:

@@ -458,7 +460,7 @@ static int gart_map_sg(struct device *dev, struct 
scatterlist *sg, int nents,
iommu_full(dev, pages << PAGE_SHIFT, dir);
for_each_sg(sg, s, nents, i)
s->dma_address = DMA_MAPPING_ERROR;
-   return 0;
+   return ret;


While we're at it - setting the ->dma_address to DMA_MAPPING_ERROR is
not part of the ->map_sg calling convention.  Might be worth to fix
up while we're at it.


Especially since it's not even zeroing dma_length, which at least is a 
documented indicator of validity (even if it's not strictly defined for 
failure cases either).


Robin.



Re: [PATCH v15 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-07-06 Thread Robin Murphy

On 2021-07-06 15:05, Christoph Hellwig wrote:

On Tue, Jul 06, 2021 at 03:01:04PM +0100, Robin Murphy wrote:

FWIW I was pondering the question of whether to do something along those
lines or just scrap the default assignment entirely, so since I hadn't got
round to saying that I've gone ahead and hacked up the alternative
(similarly untested) for comparison :)

TBH I'm still not sure which one I prefer...


Claire did implement something like your suggestion originally, but
I don't really like it as it doesn't scale for adding multiple global
pools, e.g. for the 64-bit addressable one for the various encrypted
secure guest schemes.


Ah yes, that had slipped my mind, and it's a fair point indeed. Since 
we're not concerned with a minimal fix for backports anyway I'm more 
than happy to focus on Will's approach. Another thing is that that looks 
to take us a quiet step closer to the possibility of dynamically 
resizing a SWIOTLB pool, which is something that some of the hypervisor 
protection schemes looking to build on top of this series may want to 
explore at some point.


Robin.



Re: [PATCH v15 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-07-06 Thread Robin Murphy

On 2021-07-06 14:24, Will Deacon wrote:

On Tue, Jul 06, 2021 at 06:48:48AM +0200, Christoph Hellwig wrote:

On Mon, Jul 05, 2021 at 08:03:52PM +0100, Will Deacon wrote:

So at this point, the AMD IOMMU driver does:

swiotlb= (iommu_default_passthrough() || sme_me_mask) ? 1 : 0;

where 'swiotlb' is a global variable indicating whether or not swiotlb
is in use. It's picked up a bit later on by pci_swiotlb_late_init(), which
will call swiotlb_exit() if 'swiotlb' is false.

Now, that used to work fine, because swiotlb_exit() clears
'io_tlb_default_mem' to NULL, but now with the restricted DMA changes, I
think that all the devices which have successfully probed beforehand will
have stale pointers to the freed structure in their 'dev->dma_io_tlb_mem'
field.


Yeah.  I don't think we can do that anymore, and I also think it is
a bad idea to start with.


I've had a crack at reworking things along the following lines:

   - io_tlb_default_mem now lives in the BSS, the flexible array member
 is now a pointer and that part is allocated dynamically (downside of
 this is an extra indirection to get at the slots).

   - io_tlb_default_mem.nslabs tells you whether the thing is valid

   - swiotlb_exit() frees the slots array and clears the rest of the
 structure to 0. I also extended it to free the actual slabs, but I'm
 not sure why it wasn't doing that before.

So a non-NULL dev->dma_io_tlb_mem should always be valid to follow.


FWIW I was pondering the question of whether to do something along those 
lines or just scrap the default assignment entirely, so since I hadn't 
got round to saying that I've gone ahead and hacked up the alternative 
(similarly untested) for comparison :)


TBH I'm still not sure which one I prefer...

Robin.

->8-
diff --git a/drivers/base/core.c b/drivers/base/core.c
index ea5b85354526..394abf184c1a 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2847,9 +2847,6 @@ void device_initialize(struct device *dev)
 defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
dev->dma_coherent = dma_default_coherent;
 #endif
-#ifdef CONFIG_SWIOTLB
-   dev->dma_io_tlb_mem = io_tlb_default_mem;
-#endif
 }
 EXPORT_SYMBOL_GPL(device_initialize);

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 39284ff2a6cd..620f16d89a98 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -107,16 +107,21 @@ struct io_tlb_mem {
 };
 extern struct io_tlb_mem *io_tlb_default_mem;

+static inline struct io_tlb_mem *dev_iotlb_mem(struct device *dev)
+{
+   return dev->dma_io_tlb_mem ?: io_tlb_default_mem;
+}
+
 static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t 
paddr)

 {
-   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+   struct io_tlb_mem *mem = dev_iotlb_mem(dev);

return mem && paddr >= mem->start && paddr < mem->end;
 }

 static inline bool is_swiotlb_force_bounce(struct device *dev)
 {
-   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+   struct io_tlb_mem *mem = dev_iotlb_mem(dev);

return mem && mem->force_bounce;
 }
@@ -167,7 +172,7 @@ bool swiotlb_free(struct device *dev, struct page 
*page, size_t size);


 static inline bool is_swiotlb_for_alloc(struct device *dev)
 {
-   return dev->dma_io_tlb_mem->for_alloc;
+   return dev_iotlb_mem(dev)->for_alloc;
 }
 #else
 static inline struct page *swiotlb_alloc(struct device *dev, size_t size)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index b7f76bca89bf..f4942149f87d 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -359,7 +359,7 @@ static unsigned int swiotlb_align_offset(struct 
device *dev, u64 addr)
 static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, 
size_t size,

   enum dma_data_direction dir)
 {
-   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+   struct io_tlb_mem *mem = dev_iotlb_mem(dev);
int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
phys_addr_t orig_addr = mem->slots[index].orig_addr;
size_t alloc_size = mem->slots[index].alloc_size;
@@ -440,7 +440,7 @@ static unsigned int wrap_index(struct io_tlb_mem 
*mem, unsigned int index)

 static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
  size_t alloc_size)
 {
-   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+   struct io_tlb_mem *mem = dev_iotlb_mem(dev);
unsigned long boundary_mask = dma_get_seg_boundary(dev);
dma_addr_t tbl_dma_addr =
phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
@@ -522,7 +522,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device 
*dev, phys_addr_t orig_addr,

size_t mapping_size, size_t alloc_size,
enum dma_data_direction dir, unsigned long attrs)
 {
-   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+   struct io_tlb_mem *mem = dev_iotlb_mem(dev);
unsigned int offset = 

Re: [PATCH v15 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-07-02 Thread Robin Murphy

On 2021-07-02 14:58, Will Deacon wrote:

Hi Nathan,

On Thu, Jul 01, 2021 at 12:52:20AM -0700, Nathan Chancellor wrote:

On 7/1/2021 12:40 AM, Will Deacon wrote:

On Wed, Jun 30, 2021 at 08:56:51AM -0700, Nathan Chancellor wrote:

On Wed, Jun 30, 2021 at 12:43:48PM +0100, Will Deacon wrote:

On Wed, Jun 30, 2021 at 05:17:27PM +0800, Claire Chang wrote:

`BUG: unable to handle page fault for address: 003a8290` and
the fact it crashed at `_raw_spin_lock_irqsave` look like the memory
(maybe dev->dma_io_tlb_mem) was corrupted?
The dev->dma_io_tlb_mem should be set here
(https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/pci/probe.c#n2528)
through device_initialize.


I'm less sure about this. 'dma_io_tlb_mem' should be pointing at
'io_tlb_default_mem', which is a page-aligned allocation from memblock.
The spinlock is at offset 0x24 in that structure, and looking at the
register dump from the crash:

Jun 29 18:28:42 hp-4300G kernel: RSP: 0018:adb4013db9e8 EFLAGS: 00010006
Jun 29 18:28:42 hp-4300G kernel: RAX: 003a8290 RBX:  
RCX: 8900572ad580
Jun 29 18:28:42 hp-4300G kernel: RDX: 89005653f024 RSI: 000c 
RDI: 1d17
Jun 29 18:28:42 hp-4300G kernel: RBP: 0a20d000 R08: 000c 
R09: 
Jun 29 18:28:42 hp-4300G kernel: R10: 0a20d000 R11: 89005653f000 
R12: 0212
Jun 29 18:28:42 hp-4300G kernel: R13: 1000 R14: 0002 
R15: 0020
Jun 29 18:28:42 hp-4300G kernel: FS:  7f1f8898ea40() 
GS:89005728() knlGS:
Jun 29 18:28:42 hp-4300G kernel: CS:  0010 DS:  ES:  CR0: 
80050033
Jun 29 18:28:42 hp-4300G kernel: CR2: 003a8290 CR3: 0001020d 
CR4: 00350ee0
Jun 29 18:28:42 hp-4300G kernel: Call Trace:
Jun 29 18:28:42 hp-4300G kernel:  _raw_spin_lock_irqsave+0x39/0x50
Jun 29 18:28:42 hp-4300G kernel:  swiotlb_tbl_map_single+0x12b/0x4c0

Then that correlates with R11 holding the 'dma_io_tlb_mem' pointer and
RDX pointing at the spinlock. Yet RAX is holding junk :/

I agree that enabling KASAN would be a good idea, but I also think we
probably need to get some more information out of swiotlb_tbl_map_single()
to see see what exactly is going wrong in there.


I can certainly enable KASAN and if there is any debug print I can add
or dump anything, let me know!


I bit the bullet and took v5.13 with swiotlb/for-linus-5.14 merged in, built
x86 defconfig and ran it on my laptop. However, it seems to work fine!

Please can you share your .config?


Sure thing, it is attached. It is just Arch Linux's config run through
olddefconfig. The original is below in case you need to diff it.

https://raw.githubusercontent.com/archlinux/svntogit-packages/9045405dc835527164f3034b3ceb9a67c7a53cd4/trunk/config

If there is anything more that I can provide, please let me know.


I eventually got this booting (for some reason it was causing LD to SEGV
trying to link it for a while...) and sadly it works fine on my laptop. Hmm.

Did you manage to try again with KASAN?

It might also be worth taking the IOMMU out of the equation, since that
interfaces differently with SWIOTLB and I couldn't figure out the code path
from the log you provided. What happens if you boot with "amd_iommu=off
swiotlb=force"?


Oh, now there's a thing... the chat from the IOMMU API in the boot log 
implies that the IOMMU *should* be in the picture - we see that default 
domains are IOMMU_DOMAIN_DMA default and the GPU :0c:00.0 was added 
to a group. That means dev->dma_ops should be set and DMA API calls 
should be going through iommu-dma, yet the callstack in the crash says 
we've gone straight from dma_map_page_attrs() to swiotlb_map(), implying 
the inline dma_direct_map_page() path.


If dev->dma_ops didn't look right in the first place, it's perhaps less 
surprising that dev->dma_io_tlb_mem might be wild as well. It doesn't 
seem plausible that we should have a race between initialising the 
device and probing its driver, so maybe the whole dev pointer is getting 
trampled earlier in the callchain (or is fundamentally wrong to begin 
with, but from a quick skim of the amdgpu code it did look like 
adev->dev and adev->pdev are appropriately set early on by 
amdgpu_pci_probe()).



(although word of warning here: i915 dies horribly on my laptop if I pass
swiotlb=force, even with the distro 5.10 kernel)


FWIW I'd imagine you probably need to massively increase the SWIOTLB 
buffer size to have hope of that working.


Robin.



Re: [PATCH v15 12/12] of: Add plumbing for restricted DMA pool

2021-07-02 Thread Robin Murphy

On 2021-07-02 04:08, Guenter Roeck wrote:

Hi,

On Thu, Jun 24, 2021 at 11:55:26PM +0800, Claire Chang wrote:

If a device is not behind an IOMMU, we look up the device node and set
up the restricted DMA when the restricted-dma-pool is presented.

Signed-off-by: Claire Chang 
Tested-by: Stefano Stabellini 
Tested-by: Will Deacon 


With this patch in place, all sparc and sparc64 qemu emulations
fail to boot. Symptom is that the root file system is not found.
Reverting this patch fixes the problem. Bisect log is attached.


Ah, OF_ADDRESS depends on !SPARC, so of_dma_configure_id() is presumably 
returning an unexpected -ENODEV from the of_dma_set_restricted_buffer() 
stub. That should probably be returning 0 instead, since either way it's 
not an error condition for it to simply do nothing.


Robin.



Guenter

---
# bad: [fb0ca446157a86b75502c1636b0d81e642fe6bf1] Add linux-next specific files 
for 20210701
# good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
git bisect start 'HEAD' 'v5.13'
# bad: [f63c4fda987a19b1194cc45cb72fd5bf968d9d90] Merge remote-tracking branch 
'rdma/for-next'
git bisect bad f63c4fda987a19b1194cc45cb72fd5bf968d9d90
# good: [46bb5dd1d2a63e906e374e97dfd4a5e33934b1c4] Merge remote-tracking branch 
'ipsec/master'
git bisect good 46bb5dd1d2a63e906e374e97dfd4a5e33934b1c4
# good: [43ba6969cfb8185353a7a6fc79070f13b9e3d6d3] Merge remote-tracking branch 
'clk/clk-next'
git bisect good 43ba6969cfb8185353a7a6fc79070f13b9e3d6d3
# good: [1ca5eddcf8dca1d6345471c6404e7364af0d7019] Merge remote-tracking branch 
'fuse/for-next'
git bisect good 1ca5eddcf8dca1d6345471c6404e7364af0d7019
# good: [8f6d7b3248705920187263a4e7147b0752ec7dcf] Merge remote-tracking branch 
'pci/next'
git bisect good 8f6d7b3248705920187263a4e7147b0752ec7dcf
# good: [df1885a755784da3ef285f36d9230c1d090ef186] RDMA/rtrs_clt: Alloc less 
memory with write path fast memory registration
git bisect good df1885a755784da3ef285f36d9230c1d090ef186
# good: [93d31efb58c8ad4a66bbedbc2d082df458c04e45] Merge remote-tracking branch 
'cpufreq-arm/cpufreq/arm/linux-next'
git bisect good 93d31efb58c8ad4a66bbedbc2d082df458c04e45
# good: [46308965ae6fdc7c25deb2e8c048510ae51bbe66] RDMA/irdma: Check contents 
of user-space irdma_mem_reg_req object
git bisect good 46308965ae6fdc7c25deb2e8c048510ae51bbe66
# good: [6de7a1d006ea9db235492b288312838d6878385f] 
thermal/drivers/int340x/processor_thermal: Split enumeration and processing part
git bisect good 6de7a1d006ea9db235492b288312838d6878385f
# good: [081bec2577cda3d04f6559c60b6f4e2242853520] dt-bindings: of: Add 
restricted DMA pool
git bisect good 081bec2577cda3d04f6559c60b6f4e2242853520
# good: [bf95ac0bcd69979af146852f6a617a60285ebbc1] Merge remote-tracking branch 
'thermal/thermal/linux-next'
git bisect good bf95ac0bcd69979af146852f6a617a60285ebbc1
# good: [3d8287544223a3d2f37981c1f9ffd94d0b5e9ffc] RDMA/core: Always release 
restrack object
git bisect good 3d8287544223a3d2f37981c1f9ffd94d0b5e9ffc
# bad: [cff1f23fad6e0bd7d671acce0d15285c709f259c] Merge remote-tracking branch 
'swiotlb/linux-next'
git bisect bad cff1f23fad6e0bd7d671acce0d15285c709f259c
# bad: [b655006619b7bccd0dc1e055bd72de5d613e7b5c] of: Add plumbing for 
restricted DMA pool
git bisect bad b655006619b7bccd0dc1e055bd72de5d613e7b5c
# first bad commit: [b655006619b7bccd0dc1e055bd72de5d613e7b5c] of: Add plumbing 
for restricted DMA pool





Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-24 Thread Robin Murphy

On 2021-06-24 12:18, Will Deacon wrote:

On Thu, Jun 24, 2021 at 12:14:39PM +0100, Robin Murphy wrote:

On 2021-06-24 07:05, Claire Chang wrote:

On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig  wrote:


On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:

is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119

is_swiotlb_force_bounce() was the new function introduced in this patch here.

+static inline bool is_swiotlb_force_bounce(struct device *dev)
+{
+ return dev->dma_io_tlb_mem->force_bounce;
+}


To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
turn this into :

  return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce;

for a quick debug check?


I just realized that dma_io_tlb_mem might be NULL like Christoph
pointed out since swiotlb might not get initialized.
However,  `Unable to handle kernel paging request at virtual address
dfff800e` looks more like the address is garbage rather than
NULL?
I wonder if that's because dev->dma_io_tlb_mem is not assigned
properly (which means device_initialize is not called?).


What also looks odd is that the base "address" 0xdfff8000 is held in
a couple of registers, but the offset 0xe looks too small to match up to any
relevant structure member in that dereference chain :/


FWIW, I've managed to trigger a NULL dereference locally when swiotlb hasn't
been initialised but we dereference 'dev->dma_io_tlb_mem', so I think
Christoph's suggestion is needed regardless.


Ack to that - for SWIOTLB_NO_FORCE, io_tlb_default_mem will remain NULL. 
The massive jump in KernelCI baseline failures as of yesterday looks 
like every arm64 machine with less than 4GB of RAM blowing up...


Robin.


But I agree that it won't help
with the issue reported by Qian Cai.

Qian Cai: please can you share your .config and your command line?

Thanks,

Will





Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-24 Thread Robin Murphy

On 2021-06-24 07:05, Claire Chang wrote:

On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig  wrote:


On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:

is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119

is_swiotlb_force_bounce() was the new function introduced in this patch here.

+static inline bool is_swiotlb_force_bounce(struct device *dev)
+{
+ return dev->dma_io_tlb_mem->force_bounce;
+}


To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
turn this into :

 return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce;

for a quick debug check?


I just realized that dma_io_tlb_mem might be NULL like Christoph
pointed out since swiotlb might not get initialized.
However,  `Unable to handle kernel paging request at virtual address
dfff800e` looks more like the address is garbage rather than
NULL?
I wonder if that's because dev->dma_io_tlb_mem is not assigned
properly (which means device_initialize is not called?).


What also looks odd is that the base "address" 0xdfff8000 is 
held in a couple of registers, but the offset 0xe looks too small to 
match up to any relevant structure member in that dereference chain :/


Robin.



Re: [RFC PATCH V3 08/11] swiotlb: Add bounce buffer remap address setting function

2021-06-14 Thread Robin Murphy

On 2021-06-07 07:43, Christoph Hellwig wrote:

On Sun, May 30, 2021 at 11:06:25AM -0400, Tianyu Lan wrote:

From: Tianyu Lan 

For Hyper-V isolation VM with AMD SEV SNP, the bounce buffer(shared memory)
needs to be accessed via extra address space(e.g address above bit39).
Hyper-V code may remap extra address space outside of swiotlb. swiotlb_
bounce() needs to use remap virtual address to copy data from/to bounce
buffer. Add new interface swiotlb_set_bounce_remap() to do that.


Why can't you use the bus_dma_region ranges to remap to your preferred
address?


FWIW, I think a better generalisation for this would be allowing 
set_memory_decrypted() to return an address rather than implicitly 
operating in-place, and hide all the various hypervisor hooks behind that.


Robin.



Re: Regression in at least 5.10.y and mainline: Firewire audio interface fails to work properly (when booted under Xen)

2021-06-14 Thread Robin Murphy

On 2021-06-13 07:29, Salvatore Bonaccorso wrote:

A user in Debian reported the above issue, which was reproducible with
5.13-rc5 and 5.10.y as packaged in Debian and found that 85a5a6875ca9
("swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single") that
introduced the issue.


Sounds like it's probably the same thing as being discussed over here:

https://lore.kernel.org/linux-iommu/2e899de2-4b69-c4b6-33a6-09fb8949d...@nxp.com/

Robin.



Re: [PATCH v5 14/16] dma-direct: Allocate memory from restricted DMA pool if available

2021-04-23 Thread Robin Murphy

On 2021-04-22 09:15, Claire Chang wrote:

The restricted DMA pool is preferred if available.

The restricted DMA pools provide a basic level of protection against the
DMA overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system
needs to provide a way to lock down the memory access, e.g., MPU.

Signed-off-by: Claire Chang 
---
  kernel/dma/direct.c | 35 ++-
  1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 7a27f0510fcc..29523d2a9845 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -78,6 +78,10 @@ static bool dma_coherent_ok(struct device *dev, phys_addr_t 
phys, size_t size)
  static void __dma_direct_free_pages(struct device *dev, struct page *page,
size_t size)
  {
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   if (swiotlb_free(dev, page, size))
+   return;
+#endif
dma_free_contiguous(dev, page, size);
  }
  
@@ -92,7 +96,17 @@ static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
  
  	gfp |= dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask,

   _limit);
-   page = dma_alloc_contiguous(dev, size, gfp);
+
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   page = swiotlb_alloc(dev, size);
+   if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
+   __dma_direct_free_pages(dev, page, size);
+   page = NULL;
+   }
+#endif
+
+   if (!page)
+   page = dma_alloc_contiguous(dev, size, gfp);
if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
dma_free_contiguous(dev, page, size);
page = NULL;
@@ -148,7 +162,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
gfp |= __GFP_NOWARN;
  
  	if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&

-   !force_dma_unencrypted(dev)) {
+   !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) {
page = __dma_direct_alloc_pages(dev, size, gfp & ~__GFP_ZERO);
if (!page)
return NULL;
@@ -161,8 +175,8 @@ void *dma_direct_alloc(struct device *dev, size_t size,
}
  
  	if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) &&

-   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
-   !dev_is_dma_coherent(dev))
+   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) &&
+   !is_dev_swiotlb_force(dev))
return arch_dma_alloc(dev, size, dma_handle, gfp, attrs);
  
  	/*

@@ -172,7 +186,9 @@ void *dma_direct_alloc(struct device *dev, size_t size,
if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
!gfpflags_allow_blocking(gfp) &&
(force_dma_unencrypted(dev) ||
-(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && 
!dev_is_dma_coherent(dev
+(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
+ !dev_is_dma_coherent(dev))) &&
+   !is_dev_swiotlb_force(dev))
return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
  
  	/* we always manually zero the memory once we are done */

@@ -253,15 +269,15 @@ void dma_direct_free(struct device *dev, size_t size,
unsigned int page_order = get_order(size);
  
  	if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&

-   !force_dma_unencrypted(dev)) {
+   !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) {
/* cpu_addr is a struct page cookie, not a kernel address */
dma_free_contiguous(dev, cpu_addr, size);
return;
}
  
  	if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) &&

-   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
-   !dev_is_dma_coherent(dev)) {
+   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) &&
+   !is_dev_swiotlb_force(dev)) {
arch_dma_free(dev, size, cpu_addr, dma_addr, attrs);
return;
}
@@ -289,7 +305,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, 
size_t size,
void *ret;
  
  	if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&

-   force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp))
+   force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp) &&
+   !is_dev_swiotlb_force(dev))
return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);


Wait, this seems broken for non-coherent devices - in that case we need 
to return a non-cacheable address, but we can't simply fall through into 
the remapping path below in GFP_ATOMIC context. That's why we need the 
atomic pool concept in the first place :/


Unless I've overlooked something, we're still using the regular 
cacheable linear map address of the dma_io_tlb_mem buffer, no?


Robin.

  
  	page = __dma_direct_alloc_pages(dev, size, 

Re: [PATCH v5 16/16] of: Add plumbing for restricted DMA pool

2021-04-23 Thread Robin Murphy

On 2021-04-22 09:15, Claire Chang wrote:

If a device is not behind an IOMMU, we look up the device node and set
up the restricted DMA when the restricted-dma-pool is presented.

Signed-off-by: Claire Chang 
---
  drivers/of/address.c| 25 +
  drivers/of/device.c |  3 +++
  drivers/of/of_private.h |  5 +
  3 files changed, 33 insertions(+)

diff --git a/drivers/of/address.c b/drivers/of/address.c
index 54f221dde267..fff3adfe4986 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -8,6 +8,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -1109,6 +1110,30 @@ bool of_dma_is_coherent(struct device_node *np)
  }
  EXPORT_SYMBOL_GPL(of_dma_is_coherent);
  
+int of_dma_set_restricted_buffer(struct device *dev)

+{
+   struct device_node *node;
+   int count, i;
+
+   if (!dev->of_node)
+   return 0;
+
+   count = of_property_count_elems_of_size(dev->of_node, "memory-region",
+   sizeof(phandle));
+   for (i = 0; i < count; i++) {
+   node = of_parse_phandle(dev->of_node, "memory-region", i);
+   /* There might be multiple memory regions, but only one
+* restriced-dma-pool region is allowed.
+*/


What's the use-case for having multiple regions if the restricted pool 
is by definition the only one accessible?


Robin.


+   if (of_device_is_compatible(node, "restricted-dma-pool") &&
+   of_device_is_available(node))
+   return of_reserved_mem_device_init_by_idx(
+   dev, dev->of_node, i);
+   }
+
+   return 0;
+}
+
  /**
   * of_mmio_is_nonposted - Check if device uses non-posted MMIO
   * @np:   device node
diff --git a/drivers/of/device.c b/drivers/of/device.c
index c5a9473a5fb1..d8d865223e51 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -165,6 +165,9 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
  
  	arch_setup_dma_ops(dev, dma_start, size, iommu, coherent);
  
+	if (!iommu)

+   return of_dma_set_restricted_buffer(dev);
+
return 0;
  }
  EXPORT_SYMBOL_GPL(of_dma_configure_id);
diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
index d717efbd637d..e9237f5eff48 100644
--- a/drivers/of/of_private.h
+++ b/drivers/of/of_private.h
@@ -163,12 +163,17 @@ struct bus_dma_region;
  #if defined(CONFIG_OF_ADDRESS) && defined(CONFIG_HAS_DMA)
  int of_dma_get_range(struct device_node *np,
const struct bus_dma_region **map);
+int of_dma_set_restricted_buffer(struct device *dev);
  #else
  static inline int of_dma_get_range(struct device_node *np,
const struct bus_dma_region **map)
  {
return -ENODEV;
  }
+static inline int of_dma_get_restricted_buffer(struct device *dev)
+{
+   return -ENODEV;
+}
  #endif
  
  #endif /* _LINUX_OF_PRIVATE_H */






Re: [PATCH v5 08/16] swiotlb: Update is_swiotlb_active to add a struct device argument

2021-04-23 Thread Robin Murphy

On 2021-04-22 09:15, Claire Chang wrote:

Update is_swiotlb_active to add a struct device argument. This will be
useful later to allow for restricted DMA pool.

Signed-off-by: Claire Chang 
---
  drivers/gpu/drm/i915/gem/i915_gem_internal.c | 2 +-
  drivers/gpu/drm/nouveau/nouveau_ttm.c| 2 +-
  drivers/pci/xen-pcifront.c   | 2 +-
  include/linux/swiotlb.h  | 4 ++--
  kernel/dma/direct.c  | 2 +-
  kernel/dma/swiotlb.c | 4 ++--
  6 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c 
b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
index ce6b664b10aa..7d48c433446b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
@@ -42,7 +42,7 @@ static int i915_gem_object_get_pages_internal(struct 
drm_i915_gem_object *obj)
  
  	max_order = MAX_ORDER;

  #ifdef CONFIG_SWIOTLB
-   if (is_swiotlb_active()) {
+   if (is_swiotlb_active(NULL)) {
unsigned int max_segment;
  
  		max_segment = swiotlb_max_segment();

diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c 
b/drivers/gpu/drm/nouveau/nouveau_ttm.c
index e8b506a6685b..2a2ae6d6cf6d 100644
--- a/drivers/gpu/drm/nouveau/nouveau_ttm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c
@@ -321,7 +321,7 @@ nouveau_ttm_init(struct nouveau_drm *drm)
}
  
  #if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86)

-   need_swiotlb = is_swiotlb_active();
+   need_swiotlb = is_swiotlb_active(NULL);
  #endif
  
  	ret = ttm_device_init(>ttm.bdev, _bo_driver, drm->dev->dev,

diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c
index b7a8f3a1921f..6d548ce53ce7 100644
--- a/drivers/pci/xen-pcifront.c
+++ b/drivers/pci/xen-pcifront.c
@@ -693,7 +693,7 @@ static int pcifront_connect_and_init_dma(struct 
pcifront_device *pdev)
  
  	spin_unlock(_dev_lock);
  
-	if (!err && !is_swiotlb_active()) {

+   if (!err && !is_swiotlb_active(NULL)) {
err = pci_xen_swiotlb_init_late();
if (err)
dev_err(>xdev->dev, "Could not setup SWIOTLB!\n");
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 2a6cca07540b..c530c976d18b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -123,7 +123,7 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
  void __init swiotlb_exit(void);
  unsigned int swiotlb_max_segment(void);
  size_t swiotlb_max_mapping_size(struct device *dev);
-bool is_swiotlb_active(void);
+bool is_swiotlb_active(struct device *dev);
  void __init swiotlb_adjust_size(unsigned long size);
  #else
  #define swiotlb_force SWIOTLB_NO_FORCE
@@ -143,7 +143,7 @@ static inline size_t swiotlb_max_mapping_size(struct device 
*dev)
return SIZE_MAX;
  }
  
-static inline bool is_swiotlb_active(void)

+static inline bool is_swiotlb_active(struct device *dev)
  {
return false;
  }
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 84c9feb5474a..7a88c34d0867 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -495,7 +495,7 @@ int dma_direct_supported(struct device *dev, u64 mask)
  size_t dma_direct_max_mapping_size(struct device *dev)
  {
/* If SWIOTLB is active, use its maximum mapping size */
-   if (is_swiotlb_active() &&
+   if (is_swiotlb_active(dev) &&
(dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE))


I wonder if it's worth trying to fold these other conditions into 
is_swiotlb_active() itself? I'm not entirely sure what matters for Xen, 
but for the other cases it seems like they probably only care about 
whether bouncing may occur for their particular device or not (possibly 
they want to be using dma_max_mapping_size() now anyway - TBH I'm 
struggling to make sense of what the swiotlb_max_segment business is 
supposed to mean).


Otherwise, patch #9 will need to touch here as well to make sure that 
per-device forced bouncing is reflected correctly.


Robin.


return swiotlb_max_mapping_size(dev);
return SIZE_MAX;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index ffbb8724e06c..1d221343f1c8 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -659,9 +659,9 @@ size_t swiotlb_max_mapping_size(struct device *dev)
return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE;
  }
  
-bool is_swiotlb_active(void)

+bool is_swiotlb_active(struct device *dev)
  {
-   return io_tlb_default_mem != NULL;
+   return get_io_tlb_mem(dev) != NULL;
  }
  EXPORT_SYMBOL_GPL(is_swiotlb_active);
  





Re: [PATCH RFC v1 2/6] swiotlb: convert variables to arrays

2021-02-04 Thread Robin Murphy

On 2021-02-04 07:29, Christoph Hellwig wrote:

On Wed, Feb 03, 2021 at 03:37:05PM -0800, Dongli Zhang wrote:

This patch converts several swiotlb related variables to arrays, in
order to maintain stat/status for different swiotlb buffers. Here are
variables involved:

- io_tlb_start and io_tlb_end
- io_tlb_nslabs and io_tlb_used
- io_tlb_list
- io_tlb_index
- max_segment
- io_tlb_orig_addr
- no_iotlb_memory

There is no functional change and this is to prepare to enable 64-bit
swiotlb.


Claire Chang (on Cc) already posted a patch like this a month ago,
which looks much better because it actually uses a struct instead
of all the random variables.


Indeed, I skimmed the cover letter and immediately thought that this 
whole thing is just the restricted DMA pool concept[1] again, only from 
a slightly different angle.


Robin.

[1] 
https://lore.kernel.org/linux-iommu/20210106034124.30560-1-tien...@chromium.org/




Re: [RFC PATCH v3 5/6] dt-bindings: of: Add restricted DMA pool

2021-01-21 Thread Robin Murphy

On 2021-01-21 15:48, Rob Herring wrote:

On Wed, Jan 20, 2021 at 7:10 PM Robin Murphy 
wrote:


On 2021-01-20 21:31, Rob Herring wrote:

On Wed, Jan 20, 2021 at 11:30 AM Robin Murphy
 wrote:


On 2021-01-20 16:53, Rob Herring wrote:

On Wed, Jan 06, 2021 at 11:41:23AM +0800, Claire Chang
wrote:

Introduce the new compatible string, restricted-dma-pool,
for restricted DMA. One can specify the address and length
of the restricted DMA memory region by restricted-dma-pool
in the device tree.


If this goes into DT, I think we should be able to use
dma-ranges for this purpose instead. Normally, 'dma-ranges'
is for physical bus restrictions, but there's no reason it
can't be used for policy or to express restrictions the
firmware has enabled.


There would still need to be some way to tell SWIOTLB to pick
up the corresponding chunk of memory and to prevent the kernel
from using it for anything else, though.


Don't we already have that problem if dma-ranges had a very
small range? We just get lucky because the restriction is
generally much more RAM than needed.


Not really - if a device has a naturally tiny addressing capability
that doesn't even cover ZONE_DMA32 where the regular SWIOTLB buffer
will be allocated then it's unlikely to work well, but that's just
crap system design. Yes, memory pressure in ZONE_DMA{32} is
particularly problematic for such limited devices, but it's
irrelevant to the issue at hand here.


Yesterday's crap system design is today's security feature. Couldn't 
this feature make crap system design work better?


Indeed! Say you bring out your shiny new "Strawberry Flan 4" machine
with all the latest connectivity, but tragically its PCIe can only
address 25% of the RAM. So you decide to support deploying it in two
configurations: one where it runs normally for best performance, and
another "secure" one where it dedicates that quarter of RAM as a 
restricted DMA pool for any PCIe devices - that way, even if that hotel 
projector you plug in turns out to be a rogue Thunderbolt endpoint, it 
can never snarf your private keys off your eMMC out of the page cache.


(Yes, is is the thinnest of strawmen, but it sets the scene for the 
point you raised...)


...which is that in both cases the dma-ranges will still be identical. 
So how is the kernel going to know whether to steal that whole area from 
memblock before anything else can allocate from it, or not?


I don't disagree that even in Claire's original intended case it would 
be semantically correct to describe the hardware-firewalled region with 
dma-ranges. It just turns out not to be necessary, and you're already 
arguing for not adding anything in DT that doesn't need to be.



What we have here is a device that's not allowed to see *kernel*
memory at all. It's been artificially constrained to a particular
region by a TZASC or similar, and the only data which should ever
be placed in that


May have been constrained, but that's entirely optional.

In the optional case where the setup is entirely up to the OS, I
don't think this belongs in the DT at all. Perhaps that should be
solved first.


Yes! Let's definitely consider that case! Say you don't have any 
security or physical limitations but want to use a bounce pool for some 
device anyway because reasons (perhaps copying streaming DMA data to a 
better guaranteed alignment gives an overall performance win). Now the 
*only* relevant thing to communicate to the kernel is to, ahem, reserve 
a large chunk of memory, and use it for this special purpose. Isn't that 
literally what reserved-memory bindings are for?



region is data intended for that device to see. That way if it
tries to go rogue it physically can't start slurping data intended
for other devices or not mapped for DMA at all. The bouncing is an
important part of this - I forget the title off-hand but there was
an interesting paper a few years ago which demonstrated that even
with an IOMMU, streaming DMA of in-place buffers could reveal
enough adjacent data from the same page to mount an attack on the
system. Memory pressure should be immaterial since the size of each
bounce pool carveout will presumably be tuned for the needs of the
given device.

In any case, wouldn't finding all the dma-ranges do this? We're 
already walking the tree to find the max DMA address now.


If all you can see are two "dma-ranges" properties, how do you
propose to tell that one means "this is the extent of what I can
address, please set my masks and dma-range-map accordingly and try
to allocate things where I can reach them" while the other means
"take this output range away from the page allocator and hook it up
as my dedicated bounce pool, because it is Serious Security Time"?
Especially since getting that choice wrong either way would be a
Bad Thing.


Either we have some heuristic based on the size or we add some hint. 
The point is let's build on what we already have for defining DMA 

Re: [RFC PATCH v3 5/6] dt-bindings: of: Add restricted DMA pool

2021-01-20 Thread Robin Murphy

On 2021-01-20 21:31, Rob Herring wrote:

On Wed, Jan 20, 2021 at 11:30 AM Robin Murphy  wrote:


On 2021-01-20 16:53, Rob Herring wrote:

On Wed, Jan 06, 2021 at 11:41:23AM +0800, Claire Chang wrote:

Introduce the new compatible string, restricted-dma-pool, for restricted
DMA. One can specify the address and length of the restricted DMA memory
region by restricted-dma-pool in the device tree.


If this goes into DT, I think we should be able to use dma-ranges for
this purpose instead. Normally, 'dma-ranges' is for physical bus
restrictions, but there's no reason it can't be used for policy or to
express restrictions the firmware has enabled.


There would still need to be some way to tell SWIOTLB to pick up the
corresponding chunk of memory and to prevent the kernel from using it
for anything else, though.


Don't we already have that problem if dma-ranges had a very small
range? We just get lucky because the restriction is generally much
more RAM than needed.


Not really - if a device has a naturally tiny addressing capability that 
doesn't even cover ZONE_DMA32 where the regular SWIOTLB buffer will be 
allocated then it's unlikely to work well, but that's just crap system 
design. Yes, memory pressure in ZONE_DMA{32} is particularly problematic 
for such limited devices, but it's irrelevant to the issue at hand here.


What we have here is a device that's not allowed to see *kernel* memory 
at all. It's been artificially constrained to a particular region by a 
TZASC or similar, and the only data which should ever be placed in that 
region is data intended for that device to see. That way if it tries to 
go rogue it physically can't start slurping data intended for other 
devices or not mapped for DMA at all. The bouncing is an important part 
of this - I forget the title off-hand but there was an interesting paper 
a few years ago which demonstrated that even with an IOMMU, streaming 
DMA of in-place buffers could reveal enough adjacent data from the same 
page to mount an attack on the system. Memory pressure should be 
immaterial since the size of each bounce pool carveout will presumably 
be tuned for the needs of the given device.



In any case, wouldn't finding all the dma-ranges do this? We're
already walking the tree to find the max DMA address now.


If all you can see are two "dma-ranges" properties, how do you propose 
to tell that one means "this is the extent of what I can address, please 
set my masks and dma-range-map accordingly and try to allocate things 
where I can reach them" while the other means "take this output range 
away from the page allocator and hook it up as my dedicated bounce pool, 
because it is Serious Security Time"? Especially since getting that 
choice wrong either way would be a Bad Thing.


Robin.


Signed-off-by: Claire Chang 
---
   .../reserved-memory/reserved-memory.txt   | 24 +++
   1 file changed, 24 insertions(+)

diff --git 
a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt 
b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
index e8d3096d922c..44975e2a1fd2 100644
--- a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
+++ b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
@@ -51,6 +51,20 @@ compatible (optional) - standard definition
 used as a shared pool of DMA buffers for a set of devices. It can
 be used by an operating system to instantiate the necessary pool
 management subsystem if necessary.
+- restricted-dma-pool: This indicates a region of memory meant to be
+  used as a pool of restricted DMA buffers for a set of devices. The
+  memory region would be the only region accessible to those devices.
+  When using this, the no-map and reusable properties must not be set,
+  so the operating system can create a virtual mapping that will be 
used
+  for synchronization. The main purpose for restricted DMA is to
+  mitigate the lack of DMA access control on systems without an IOMMU,
+  which could result in the DMA accessing the system memory at
+  unexpected times and/or unexpected addresses, possibly leading to 
data
+  leakage or corruption. The feature on its own provides a basic level
+  of protection against the DMA overwriting buffer contents at
+  unexpected times. However, to protect against general data leakage 
and
+  system memory corruption, the system needs to provide way to restrict
+  the DMA to a predefined memory region.
   - vendor specific string in the form ,[-]
   no-map (optional) - empty property
   - Indicates the operating system must not create a virtual mapping
@@ -120,6 +134,11 @@ one for multimedia processing (named 
multimedia-memory@7700, 64MiB).
  compatible = "acme,multimedia-memory";
 

Re: [RFC PATCH v3 5/6] dt-bindings: of: Add restricted DMA pool

2021-01-20 Thread Robin Murphy

On 2021-01-20 16:53, Rob Herring wrote:

On Wed, Jan 06, 2021 at 11:41:23AM +0800, Claire Chang wrote:

Introduce the new compatible string, restricted-dma-pool, for restricted
DMA. One can specify the address and length of the restricted DMA memory
region by restricted-dma-pool in the device tree.


If this goes into DT, I think we should be able to use dma-ranges for
this purpose instead. Normally, 'dma-ranges' is for physical bus
restrictions, but there's no reason it can't be used for policy or to
express restrictions the firmware has enabled.


There would still need to be some way to tell SWIOTLB to pick up the 
corresponding chunk of memory and to prevent the kernel from using it 
for anything else, though.



Signed-off-by: Claire Chang 
---
  .../reserved-memory/reserved-memory.txt   | 24 +++
  1 file changed, 24 insertions(+)

diff --git 
a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt 
b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
index e8d3096d922c..44975e2a1fd2 100644
--- a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
+++ b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
@@ -51,6 +51,20 @@ compatible (optional) - standard definition
used as a shared pool of DMA buffers for a set of devices. It can
be used by an operating system to instantiate the necessary pool
management subsystem if necessary.
+- restricted-dma-pool: This indicates a region of memory meant to be
+  used as a pool of restricted DMA buffers for a set of devices. The
+  memory region would be the only region accessible to those devices.
+  When using this, the no-map and reusable properties must not be set,
+  so the operating system can create a virtual mapping that will be 
used
+  for synchronization. The main purpose for restricted DMA is to
+  mitigate the lack of DMA access control on systems without an IOMMU,
+  which could result in the DMA accessing the system memory at
+  unexpected times and/or unexpected addresses, possibly leading to 
data
+  leakage or corruption. The feature on its own provides a basic level
+  of protection against the DMA overwriting buffer contents at
+  unexpected times. However, to protect against general data leakage 
and
+  system memory corruption, the system needs to provide way to restrict
+  the DMA to a predefined memory region.
  - vendor specific string in the form ,[-]
  no-map (optional) - empty property
  - Indicates the operating system must not create a virtual mapping
@@ -120,6 +134,11 @@ one for multimedia processing (named 
multimedia-memory@7700, 64MiB).
compatible = "acme,multimedia-memory";
reg = <0x7700 0x400>;
};
+
+   restricted_dma_mem_reserved: restricted_dma_mem_reserved {
+   compatible = "restricted-dma-pool";
+   reg = <0x5000 0x40>;
+   };
};
  
  	/* ... */

@@ -138,4 +157,9 @@ one for multimedia processing (named 
multimedia-memory@7700, 64MiB).
memory-region = <_reserved>;
/* ... */
};
+
+   pcie_device: pcie_device@0,0 {
+   memory-region = <_dma_mem_reserved>;


PCI hosts often have inbound window configurations that limit the
address range and translate PCI to bus addresses. Those windows happen
to be configured by dma-ranges. In any case, wouldn't you want to put
the configuration in the PCI host node? Is there a usecase of
restricting one PCIe device and not another?


The general design seems to accommodate devices having their own pools 
such that they can't even snoop on each others' transient DMA data. If 
the interconnect had a way of wiring up, say, PCI RIDs to AMBA NSAIDs, 
then in principle you could certainly apply that to PCI endpoints too 
(presumably you'd also disallow them from peer-to-peer transactions at 
the PCI level too).


Robin.



Re: [RFC PATCH v3 4/6] swiotlb: Add restricted DMA alloc/free support.

2021-01-13 Thread Robin Murphy

On 2021-01-13 12:48, Christoph Hellwig wrote:

+#ifdef CONFIG_SWIOTLB
+   if (unlikely(dev->dma_io_tlb_mem))
+   return swiotlb_alloc(dev, size, dma_handle, attrs);
+#endif


Another place where the dma_io_tlb_mem is useful to avoid the ifdef.


-phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
-   size_t mapping_size, size_t alloc_size,
-   enum dma_data_direction dir, unsigned long attrs)
+static int swiotlb_tbl_find_free_region(struct device *hwdev,
+   dma_addr_t tbl_dma_addr,
+   size_t alloc_size,
+   unsigned long attrs)



+static void swiotlb_tbl_release_region(struct device *hwdev, int index,
+  size_t size)


This refactoring should be another prep patch.



+void *swiotlb_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
+   unsigned long attrs)


I'd rather have the names convey there are for the per-device bounce
buffer in some form.


+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;


While we're at it I wonder if the io_tlb is something we could change
while we're at it.  Maybe replace io_tlb_mem with struct swiotlb
and rename the field in struct device to dev_swiotlb?


+   int index;
+   void *vaddr;
+   phys_addr_t tlb_addr;
+
+   size = PAGE_ALIGN(size);
+   index = swiotlb_tbl_find_free_region(dev, mem->start, size, attrs);
+   if (index < 0)
+   return NULL;
+
+   tlb_addr = mem->start + (index << IO_TLB_SHIFT);
+   *dma_handle = phys_to_dma_unencrypted(dev, tlb_addr);
+
+   if (!dev_is_dma_coherent(dev)) {
+   unsigned long pfn = PFN_DOWN(tlb_addr);
+
+   /* remove any dirty cache lines on the kernel alias */
+   arch_dma_prep_coherent(pfn_to_page(pfn), size);


Can we hook in somewhat lower level in the dma-direct code so that all
the remapping in dma-direct can be reused instead of duplicated?  That
also becomes important if we want to use non-remapping uncached support,
e.g. on mips or x86, or the direct changing of the attributes that Will
planned to look into for arm64.


Indeed, AFAICS this ought to boil down to a direct equivalent of 
__dma_direct_alloc_pages() - other than the address there should be no 
conceptual difference between pages from the restricted pool and those 
from the regular page allocator, so this probably deserves to be plumbed 
in as an alternative to that.


Robin.



Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-13 Thread Robin Murphy

On 2021-01-13 17:43, Florian Fainelli wrote:

On 1/13/21 7:27 AM, Robin Murphy wrote:

On 2021-01-13 13:59, Nicolas Saenz Julienne wrote:

Hi All,

On Tue, 2021-01-12 at 16:03 -0800, Florian Fainelli wrote:

On 1/5/21 7:41 PM, Claire Chang wrote:

Add the initialization function to create restricted DMA pools from
matching reserved-memory nodes in the device tree.

Signed-off-by: Claire Chang 
---
   include/linux/device.h  |   4 ++
   include/linux/swiotlb.h |   7 +-
   kernel/dma/Kconfig  |   1 +
   kernel/dma/swiotlb.c    | 144
++--
   4 files changed, 131 insertions(+), 25 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 89bb8b84173e..ca6f71ec8871 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -413,6 +413,7 @@ struct dev_links_info {
    * @dma_pools:    Dma pools (if dma'ble device).
    * @dma_mem:    Internal for coherent mem override.
    * @cma_area:    Contiguous memory area for dma allocations
+ * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override.
    * @archdata:    For arch-specific additions.
    * @of_node:    Associated device tree node.
    * @fwnode:    Associated device node supplied by platform firmware.
@@ -515,6 +516,9 @@ struct device {
   #ifdef CONFIG_DMA_CMA
   struct cma *cma_area;    /* contiguous memory area for dma
  allocations */
+#endif
+#ifdef CONFIG_SWIOTLB
+    struct io_tlb_mem    *dma_io_tlb_mem;
   #endif
   /* arch specific additions */
   struct dev_archdata    archdata;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index dd8eb57cbb8f..a1bbd775 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -76,12 +76,13 @@ extern enum swiotlb_force swiotlb_force;
    *
    * @start:    The start address of the swiotlb memory pool. Used
to do a quick
    *    range check to see if the memory was in fact allocated
by this
- *    API.
+ *    API. For restricted DMA pool, this is device tree
adjustable.


Maybe write it as this is "firmware adjustable" such that when/if ACPI
needs something like this, the description does not need updating.


TBH I really don't think this needs calling out at all. Even in the
regular case, the details of exactly how and where the pool is allocated
are beyond the scope of this code - architectures already have several
ways to control that and make their own decisions.



[snip]


+static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
+    struct device *dev)
+{
+    struct io_tlb_mem *mem = rmem->priv;
+    int ret;
+
+    if (dev->dma_io_tlb_mem)
+    return -EBUSY;
+
+    if (!mem) {
+    mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+    if (!mem)
+    return -ENOMEM;
+
+    if (!memremap(rmem->base, rmem->size, MEMREMAP_WB)) {


MEMREMAP_WB sounds appropriate as a default.


As per the binding 'no-map' has to be disabled here. So AFAIU, this
memory will
be part of the linear mapping. Is this really needed then?


More than that, I'd assume that we *have* to use the linear/direct map
address rather than anything that has any possibility of being a vmalloc
remap, otherwise we can no longer safely rely on
phys_to_dma/dma_to_phys, no?


I believe you are right, which means that if we want to make use of the
restricted DMA pool on a 32-bit architecture (and we do, at least, I do)
we should probably add some error checking/warning to ensure the
restricted DMA pool falls within the linear map.


Oh, good point - I'm so used to 64-bit that I instinctively just blanked 
out the !PageHighMem() condition in try_ram_remap(). So maybe the 
original intent here *was* to effectively just implement that check, but 
if so it could still do with being a lot more explicit.


Cheers,
Robin.



Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-13 Thread Robin Murphy

On 2021-01-13 13:59, Nicolas Saenz Julienne wrote:

Hi All,

On Tue, 2021-01-12 at 16:03 -0800, Florian Fainelli wrote:

On 1/5/21 7:41 PM, Claire Chang wrote:

Add the initialization function to create restricted DMA pools from
matching reserved-memory nodes in the device tree.

Signed-off-by: Claire Chang 
---
  include/linux/device.h  |   4 ++
  include/linux/swiotlb.h |   7 +-
  kernel/dma/Kconfig  |   1 +
  kernel/dma/swiotlb.c| 144 ++--
  4 files changed, 131 insertions(+), 25 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 89bb8b84173e..ca6f71ec8871 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -413,6 +413,7 @@ struct dev_links_info {
   * @dma_pools:Dma pools (if dma'ble device).
   * @dma_mem:  Internal for coherent mem override.
   * @cma_area: Contiguous memory area for dma allocations
+ * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override.
   * @archdata: For arch-specific additions.
   * @of_node:  Associated device tree node.
   * @fwnode:   Associated device node supplied by platform firmware.
@@ -515,6 +516,9 @@ struct device {
  #ifdef CONFIG_DMA_CMA
    struct cma *cma_area;   /* contiguous memory area for dma
   allocations */
+#endif
+#ifdef CONFIG_SWIOTLB
+   struct io_tlb_mem   *dma_io_tlb_mem;
  #endif
    /* arch specific additions */
    struct dev_archdata archdata;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index dd8eb57cbb8f..a1bbd775 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -76,12 +76,13 @@ extern enum swiotlb_force swiotlb_force;
   *
   * @start:The start address of the swiotlb memory pool. Used to do a quick
   *range check to see if the memory was in fact allocated by this
- * API.
+ * API. For restricted DMA pool, this is device tree adjustable.


Maybe write it as this is "firmware adjustable" such that when/if ACPI
needs something like this, the description does not need updating.


TBH I really don't think this needs calling out at all. Even in the 
regular case, the details of exactly how and where the pool is allocated 
are beyond the scope of this code - architectures already have several 
ways to control that and make their own decisions.




[snip]


+static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
+   struct device *dev)
+{
+   struct io_tlb_mem *mem = rmem->priv;
+   int ret;
+
+   if (dev->dma_io_tlb_mem)
+   return -EBUSY;
+
+   if (!mem) {
+   mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+   if (!mem)
+   return -ENOMEM;
+
+   if (!memremap(rmem->base, rmem->size, MEMREMAP_WB)) {


MEMREMAP_WB sounds appropriate as a default.


As per the binding 'no-map' has to be disabled here. So AFAIU, this memory will
be part of the linear mapping. Is this really needed then?


More than that, I'd assume that we *have* to use the linear/direct map 
address rather than anything that has any possibility of being a vmalloc 
remap, otherwise we can no longer safely rely on 
phys_to_dma/dma_to_phys, no?


That said, given that we're not actually using the returned address, I'm 
not entirely sure what the point of this call is anyway. If we can 
assume it's always going to return the linear map address via 
try_ram_remap() then we can equally just go ahead and use the linear map 
address straight away. I don't really see how we could ever hit the 
"is_ram == REGION_MIXED" case in memremap() that would return NULL, if 
we passed the memblock check earlier in __reserved_mem_alloc_size() such 
that this rmem node ever got to be initialised at all.


Robin.


Documentation/devicetree/bindings/reserved-memory/ramoops.txt does
define an "unbuffered" property which in premise could be applied to the
generic reserved memory binding as well and that we may have to be
honoring here, if we were to make it more generic. Oh well, this does
not need to be addressed right now I guess.








Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-12 Thread Robin Murphy

On 2021-01-07 17:57, Konrad Rzeszutek Wilk wrote:

On Fri, Jan 08, 2021 at 01:39:18AM +0800, Claire Chang wrote:

Hi Greg and Konrad,

This change is intended to be non-arch specific. Any arch that lacks DMA access
control and has devices not behind an IOMMU can make use of it. Could you share
why you think this should be arch specific?


The idea behind non-arch specific code is it to be generic. The devicetree
is specific to PowerPC, Sparc, and ARM, and not to x86 - hence it should
be in arch specific code.


Sorry, but that's an absurd argument. By the same token you'd equally 
have to claim that bits of, say, the Broadcom WiFi driver (not to 
mention dozens of others) should be split out into arch code, since not 
all platforms use the devicetree parts, nor the ACPI parts, nor the PCI 
parts...


There is nothing architecture-specific about using devicetree as a 
system description - AFAIK there *are* a handful of x86 platforms that 
use it, besides even more architectures than you've listed above. It has 
long been the policy that devicetree-related code for a particular 
subsystem should just live within that subsystem. Sometimes if there's 
enough of it it gets collected together into its own file - e.g. 
drivers/pci/of.c - otherwise it tends to just get #ifdef'ed - e.g. 
of_spi_parse_dt(), or the other DMA reserved-memory consumers that 
already exist as Florian points out.


Besides, there are far more platforms that enable CONFIG_OF than enable 
CONFIG_SWIOTLB, so by that metric the whole of the SWIOTLB code itself 
is even less "generic" than any DT parsing :P


Robin.



Re: [Xen-devel] [PATCH 00/11] of: Fix DMA configuration for non-DT masters

2019-09-26 Thread Robin Murphy

On 2019-09-26 11:44 am, Nicolas Saenz Julienne wrote:

Robin, have you looked into supporting multiple dma-ranges? It's the
next thing
we need for BCM STB's PCIe. I'll have a go at it myself if nothing is in
the
works already.


Multiple dma-ranges as far as configuring inbound windows should work
already other than the bug when there's any parent translation. But if
you mean supporting multiple DMA offsets and masks per device in the
DMA API, there's nothing in the works yet.


Sorry, I meant supporting multiple DMA offsets[1]. I think I could still make
it with a single DMA mask though.


The main problem for supporting that case in general is the disgusting 
carving up of the physical memory map you may have to do to guarantee 
that a single buffer allocation cannot ever span two windows with 
different offsets. I don't think we ever reached a conclusion on whether 
that was even achievable in practice.



There's also the in-between step of making of_dma_get_range() return a
size based on all the dma-ranges entries rather than only the first one
- otherwise, something like [1] can lead to pretty unworkable default
masks. We implemented that when doing acpi_dma_get_range(), it's just
that the OF counterpart never caught up.


Right. I suppose we assume any holes in the ranges are addressable by
the device but won't get used for other reasons (such as no memory
there). However, to be correct, the range of the dma offset plus mask
would need to be within the min start and max end addresses. IOW,
while we need to round up (0xa_8000_ - 0x2c1c_) to the next
power of 2, the 'correct' thing to do is round down.


IIUC I also have this issue on my list. The RPi4 PCIe block has an integration
bug that only allows DMA to the lower 3GB. With dma-ranges of size 0xc000_
you get a 32bit DMA mask wich is not what you need. So far I faked it in the
device-tree but I guess it be better to add an extra check in
of_dma_configure(), decrease the mask and print some kind of warning stating
that DMA addressing is suboptimal.


Yeah, there's just no way for masks to describe that the device can 
drive all the individual bits, just not in certain combinations :(


The plan I have sketched out there is to merge dma_pfn_offset and 
bus_dma_mask into a "DMA range" descriptor, so we can then hang one or 
more of those off a device to properly cope with all these weird 
interconnects. Conceptually it feels pretty straightforward; I think 
most of the challenge is in implementing it efficiently. Plus there's 
the question of whether it could also subsume the dma_mask as well.


Robin.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC PATCH] xen/gntdev: Stop abusing DT of_dma_configure API

2019-09-26 Thread Robin Murphy

On 2019-09-26 11:17 am, Oleksandr Andrushchenko wrote:


On 9/26/19 12:49 PM, Julien Grall wrote:

Hi Rob,


On 9/25/19 10:50 PM, Rob Herring wrote:

As the comment says, this isn't a DT based device. of_dma_configure()
is going to stop allowing a NULL DT node, so this needs to be fixed.


And this can't work on arch not selecting CONFIG_OF and can select
CONFIG_XEN_GRANT_DMA_ALLOC.

We are lucky enough on x86 because, AFAICT, arch_setup_dma_ops is just
a nop.


No luck is needed as [1] does nothing for those platforms not using
CONFIG_OF


Not sure exactly what setup besides arch_setup_dma_ops is needed...


We probably want to update dma_mask, coherent_dma_mask and
dma_pfn_offset.

Also, while look at of_configure_dma, I noticed that we consider the
DMA will not be coherent for the grant-table. Oleksandr, do you know
why they can't be coherent?

The main and the only reason to use of_configure_dma is that if we don't
then we
are about to stay with dma_dummy_ops [2]. It effectively means that
operations on dma-bufs
will end up returning errors, like [3], [4], thus not making it possible
for Xen PV DRM and DMA
part of gntdev driver to do what we need (dma-bufs in our use-cases
allow zero-copying
while using graphics buffers and many more).

I didn't find any better way of achieving that, but of_configure_dma...
If there is any better solution which will not break the existing
functionality then
I will definitely change the drivers so we do not abuse DT )
Before that, please keep in mind that merging this RFC will break Xen PV
DRM +
DMA buf support in gntdev...
Hope we can work out some acceptable solution, so everyone is happy


As I mentioned elsewhere, the recent dma-direct rework means that 
dma_dummy_ops are now only explicitly installed for the ACPI error case, 
so - much as I may dislike it - you should get regular (direct/SWIOTLB) 
ops by default again.


Coherency is trickier - if the guest is allocating buffers for the PV 
device, which may be shared directly with hardware by the host driver, 
then the coherency of the PV device should really reflect that of the 
underlying hardware to avoid potential problems. There are some cases 
where the stage 2 attributes alone wouldn't be enough to correct a mismatch.


Robin.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 00/11] of: Fix DMA configuration for non-DT masters

2019-09-25 Thread Robin Murphy

On 25/09/2019 17:16, Rob Herring wrote:

On Wed, Sep 25, 2019 at 10:30 AM Nicolas Saenz Julienne
 wrote:


On Wed, 2019-09-25 at 16:09 +0100, Robin Murphy wrote:

On 25/09/2019 15:52, Nicolas Saenz Julienne wrote:

On Tue, 2019-09-24 at 16:59 -0500, Rob Herring wrote:

On Tue, Sep 24, 2019 at 1:12 PM Nicolas Saenz Julienne
 wrote:

Hi All,
this series tries to address one of the issues blocking us from
upstreaming Broadcom's STB PCIe controller[1]. Namely, the fact that
devices not represented in DT which sit behind a PCI bus fail to get the
bus' DMA addressing constraints.

This is due to the fact that of_dma_configure() assumes it's receiving a
DT node representing the device being configured, as opposed to the PCIe
bridge node we currently pass. This causes the code to directly jump
into PCI's parent node when checking for 'dma-ranges' and misses
whatever was set there.

To address this I create a new API in OF - inspired from Robin Murphys
original proposal[2] - which accepts a bus DT node as it's input in
order to configure a device's DMA constraints. The changes go deep into
of/address.c's implementation, as a device being having a DT node
assumption was pretty strong.

On top of this work, I also cleaned up of_dma_configure() removing its
redundant arguments and creating an alternative function for the special
cases
not applicable to either the above case or the default usage.

IMO the resulting functions are more explicit. They will probably
surface some hacky usages that can be properly fixed as I show with the
DT fixes on the Layerscape platform.

This was also tested on a Raspberry Pi 4 with a custom PCIe driver and
on a Seattle AMD board.


Humm, I've been working on this issue too. Looks similar though yours
has a lot more churn and there's some other bugs I've found.


That's good news, and yes now that I see it, some stuff on my series is
overly
complicated. Specially around of_translate_*().

On top of that, you removed in of_dma_get_range():

-   /*
-* At least empty ranges has to be defined for parent node if
-* DMA is supported
-*/
-   if (!ranges)
-   break;

Which I assumed was bound to the standard and makes things easier.


Can you test out this branch[1]. I don't have any h/w needing this,
but wrote a unittest and tested with modified QEMU.


I reviewed everything, I did find a minor issue, see the patch attached.


WRT that patch, the original intent of "force_dma" was purely to
consider a device DMA-capable regardless of the presence of
"dma-ranges". Expecting of_dma_configure() to do anything for a non-OF
device has always been bogus - magic paravirt devices which appear out
of nowhere and expect to be treated as genuine DMA masters are a
separate problem that we haven't really approached yet.


I agree it's clearly abusing the function. I have no problem with the behaviour
change if it's OK with you.


Thinking about it, you could probably just remove that call from the Xen 
DRM driver now anyway - since the dma-direct rework, we lost the ability 
to set dma_dummy_ops by default, and NULL ops now represent what it 
(presumably) wants.



Robin, have you looked into supporting multiple dma-ranges? It's the next thing
we need for BCM STB's PCIe. I'll have a go at it myself if nothing is in the
works already.


Multiple dma-ranges as far as configuring inbound windows should work
already other than the bug when there's any parent translation. But if
you mean supporting multiple DMA offsets and masks per device in the
DMA API, there's nothing in the works yet.


There's also the in-between step of making of_dma_get_range() return a 
size based on all the dma-ranges entries rather than only the first one 
- otherwise, something like [1] can lead to pretty unworkable default 
masks. We implemented that when doing acpi_dma_get_range(), it's just 
that the OF counterpart never caught up.


Robin.

[1] 
http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a2814af56b3486c2985a95540a88d8f9fa3a699f


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 00/11] of: Fix DMA configuration for non-DT masters

2019-09-25 Thread Robin Murphy

On 25/09/2019 15:52, Nicolas Saenz Julienne wrote:

On Tue, 2019-09-24 at 16:59 -0500, Rob Herring wrote:

On Tue, Sep 24, 2019 at 1:12 PM Nicolas Saenz Julienne
 wrote:

Hi All,
this series tries to address one of the issues blocking us from
upstreaming Broadcom's STB PCIe controller[1]. Namely, the fact that
devices not represented in DT which sit behind a PCI bus fail to get the
bus' DMA addressing constraints.

This is due to the fact that of_dma_configure() assumes it's receiving a
DT node representing the device being configured, as opposed to the PCIe
bridge node we currently pass. This causes the code to directly jump
into PCI's parent node when checking for 'dma-ranges' and misses
whatever was set there.

To address this I create a new API in OF - inspired from Robin Murphys
original proposal[2] - which accepts a bus DT node as it's input in
order to configure a device's DMA constraints. The changes go deep into
of/address.c's implementation, as a device being having a DT node
assumption was pretty strong.

On top of this work, I also cleaned up of_dma_configure() removing its
redundant arguments and creating an alternative function for the special
cases
not applicable to either the above case or the default usage.

IMO the resulting functions are more explicit. They will probably
surface some hacky usages that can be properly fixed as I show with the
DT fixes on the Layerscape platform.

This was also tested on a Raspberry Pi 4 with a custom PCIe driver and
on a Seattle AMD board.


Humm, I've been working on this issue too. Looks similar though yours
has a lot more churn and there's some other bugs I've found.


That's good news, and yes now that I see it, some stuff on my series is overly
complicated. Specially around of_translate_*().

On top of that, you removed in of_dma_get_range():

-   /*
-* At least empty ranges has to be defined for parent node if
-* DMA is supported
-*/
-   if (!ranges)
-   break;

Which I assumed was bound to the standard and makes things easier.


Can you test out this branch[1]. I don't have any h/w needing this,
but wrote a unittest and tested with modified QEMU.


I reviewed everything, I did find a minor issue, see the patch attached.


WRT that patch, the original intent of "force_dma" was purely to 
consider a device DMA-capable regardless of the presence of 
"dma-ranges". Expecting of_dma_configure() to do anything for a non-OF 
device has always been bogus - magic paravirt devices which appear out 
of nowhere and expect to be treated as genuine DMA masters are a 
separate problem that we haven't really approached yet.


Robin.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] arm: xen: mm: use __GPF_DMA32 for arm64

2019-08-27 Thread Robin Murphy

On 09/07/2019 09:22, Peng Fan wrote:

arm64 shares some code under arch/arm/xen, including mm.c.
However ZONE_DMA is removed by commit
ad67f5a6545("arm64: replace ZONE_DMA with ZONE_DMA32").
So to ARM64, need use __GFP_DMA32.

Signed-off-by: Peng Fan 
---
  arch/arm/xen/mm.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
index e1d44b903dfc..a95e76d18bf9 100644
--- a/arch/arm/xen/mm.c
+++ b/arch/arm/xen/mm.c
@@ -27,7 +27,7 @@ unsigned long xen_get_swiotlb_free_pages(unsigned int order)
  
  	for_each_memblock(memory, reg) {

if (reg->base < (phys_addr_t)0x) {
-   flags |= __GFP_DMA;
+   flags |= __GFP_DMA | __GFP_DMA32;


Given the definition of GFP_ZONE_BAD, I'm not sure this combination of 
flags is strictly valid, but rather is implicitly reliant on only one of 
those zones ever actually existing. As such, it seems liable to blow up 
if the plans to add ZONE_DMA to arm64[1] go ahead.


Robin.

[1] 
https://lore.kernel.org/linux-arm-kernel/20190820145821.27214-1-nsaenzjulie...@suse.de/



break;
}
}



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 03/11] xen/arm: pass one less argument to dma_cache_maint

2019-08-16 Thread Robin Murphy

On 16/08/2019 14:00, Christoph Hellwig wrote:

Instead of taking apart the dma address in both callers do it inside
dma_cache_maint itself.

Signed-off-by: Christoph Hellwig 
---
  arch/arm/xen/mm.c | 10 ++
  1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
index 90574d89d0d4..d9da24fda2f7 100644
--- a/arch/arm/xen/mm.c
+++ b/arch/arm/xen/mm.c
@@ -43,13 +43,15 @@ static bool hypercall_cflush = false;
  
  /* functions called by SWIOTLB */
  
-static void dma_cache_maint(dma_addr_t handle, unsigned long offset,

-   size_t size, enum dma_data_direction dir, enum dma_cache_op op)
+static void dma_cache_maint(dma_addr_t handle, size_t size,
+   enum dma_data_direction dir, enum dma_cache_op op)
  {
struct gnttab_cache_flush cflush;
unsigned long xen_pfn;
+   unsigned long offset = handle & ~PAGE_MASK;
size_t left = size;
  
+	offset &= PAGE_MASK;


Ahem... presumably that should be handle, not offset.

Robin.


xen_pfn = (handle >> XEN_PAGE_SHIFT) + offset / XEN_PAGE_SIZE;
offset %= XEN_PAGE_SIZE;
  
@@ -86,13 +88,13 @@ static void dma_cache_maint(dma_addr_t handle, unsigned long offset,

  static void __xen_dma_page_dev_to_cpu(struct device *hwdev, dma_addr_t handle,
size_t size, enum dma_data_direction dir)
  {
-   dma_cache_maint(handle & PAGE_MASK, handle & ~PAGE_MASK, size, dir, 
DMA_UNMAP);
+   dma_cache_maint(handle, size, dir, DMA_UNMAP);
  }
  
  static void __xen_dma_page_cpu_to_dev(struct device *hwdev, dma_addr_t handle,

size_t size, enum dma_data_direction dir)
  {
-   dma_cache_maint(handle & PAGE_MASK, handle & ~PAGE_MASK, size, dir, 
DMA_MAP);
+   dma_cache_maint(handle, size, dir, DMA_MAP);
  }
  
  void __xen_dma_map_page(struct device *hwdev, struct page *page,




___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC 3/4] Arm64: further speed-up to hweight{32, 64}()

2019-06-04 Thread Robin Murphy

On 04/06/2019 17:11, Julien Grall wrote:

Hi Jan,

On 5/31/19 10:53 AM, Jan Beulich wrote:

According to Linux commit e75bef2a4f ("arm64: Select
ARCH_HAS_FAST_MULTIPLIER") this is a further improvement over the
variant using only bitwise operations on at least some hardware, and no
worse on other.

Suggested-by: Andrew Cooper 
Signed-off-by: Jan Beulich 
---
RFC: To be honest I'm not fully convinced this is a win in particular in
  the hweight32() case, as there's no actual shift insn which gets
  replaced by the multiplication. Even for hweight64() the compiler
  could emit better code and avoid the explicit shift by 32 (which it
  emits at least for me).


I can see multiplication instruction used in both hweight32() and 
hweight64() with the compiler I am using.


I would expect the compiler could easily replace a multiply by a series 
of shift but it would be more difficult to do the invert.


Also, this has been in Linux for a year now, so I am assuming Linux 
folks are happy with changes (CCing Robin just in case I missed 
anything). Therefore I am happy to give it a go on Xen as well.


IIRC it did look like Linux's hweight() routines could possibly be 
further tuned for the A64 ISA to shave off one or two more instructions, 
but it's yet to be proven that performance is anywhere near critical 
enough to justify maintaining arch-specific versions. It costs us 
basically nothing to switch between the two generic implementations 
though, so since the smaller-and-no-slower code can be considered a net 
win (however minor), there seemed no reason *not* to apply the existing 
option appropriately.


Robin.



Cheers,



--- a/xen/arch/arm/Kconfig
+++ b/xen/arch/arm/Kconfig
@@ -12,6 +12,7 @@ config ARM_32
  config ARM_64
  def_bool y
  depends on 64BIT
+    select HAS_FAST_MULTIPLY
  config ARM
  def_bool y






___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] Revert "swiotlb: remove SWIOTLB_MAP_ERROR"

2019-03-04 Thread Robin Murphy

Hi Arnd,

On 2019-03-04 7:59 pm, Arnd Bergmann wrote:

This reverts commit b907e20508d0 ("swiotlb: remove SWIOTLB_MAP_ERROR"), which
introduced an overflow warning in configurations that have a larger
dma_addr_t than phys_addr_t:

In file included from include/linux/dma-direct.h:5,
  from kernel/dma/swiotlb.c:23:
kernel/dma/swiotlb.c: In function 'swiotlb_tbl_map_single':
include/linux/dma-mapping.h:136:28: error: conversion from 'long long unsigned 
int' to 'phys_addr_t' {aka 'unsigned int'} changes value from 
'18446744073709551615' to '4294967295' [-Werror=overflow]
  #define DMA_MAPPING_ERROR  (~(dma_addr_t)0)
 ^
kernel/dma/swiotlb.c:544:9: note: in expansion of macro 'DMA_MAPPING_ERROR'
   return DMA_MAPPING_ERROR;

The configuration that caused this is on 32-bit ARM, where the DMA address
space depends on the enabled hardware platforms, while the physical
address space depends on the type of MMU chosen (classic vs LPAE).


Are these real platforms, or random configs? Realistically I don't see a 
great deal of need to support DMA_ADDR_T_64BIT for non-LPAE. 
Particularly in this case since AFAIK the only selector of SWIOTLB on 
Arm is Xen, and that by definition is never going to be useful on 
non-LPAE hardware.


Fair enough that we don't still don't want even randconfigs generating 
warnings, though. As long as this change doesn't let SWIOTLB_MAP_ERROR 
leak out to logic expecting DMA_MAP_ERROR - which does appear to be the 
case - and is also still OK for the opposite weirdness of 
PHYS_ADDR_T_64BIT && !DMA_ADDR_T_64BIT, then I think it's reasonable.


Robin.


I tried a couple of alternative approaches, but the previous version
seems as good as any other, so I went back to that.

Fixes: b907e20508d0 ("swiotlb: remove SWIOTLB_MAP_ERROR")
Signed-off-by: Arnd Bergmann 
---
  drivers/xen/swiotlb-xen.c | 4 ++--
  include/linux/swiotlb.h   | 3 +++
  kernel/dma/swiotlb.c  | 4 ++--
  3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 877baf2a94f4..57a98279bf4f 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -406,7 +406,7 @@ static dma_addr_t xen_swiotlb_map_page(struct device *dev, 
struct page *page,
  
  	map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir,

 attrs);
-   if (map == DMA_MAPPING_ERROR)
+   if (map == SWIOTLB_MAP_ERROR)
return DMA_MAPPING_ERROR;
  
  	dev_addr = xen_phys_to_bus(map);

@@ -557,7 +557,7 @@ xen_swiotlb_map_sg_attrs(struct device *hwdev, struct 
scatterlist *sgl,
 sg_phys(sg),
 sg->length,
 dir, attrs);
-   if (map == DMA_MAPPING_ERROR) {
+   if (map == SWIOTLB_MAP_ERROR) {
dev_warn(hwdev, "swiotlb buffer is full\n");
/* Don't panic here, we expect map_sg users
   to do proper error handling. */
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 361f62bb4a8e..a65a36551f58 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -44,6 +44,9 @@ enum dma_sync_target {
SYNC_FOR_DEVICE = 1,
  };
  
+/* define the last possible byte of physical address space as a mapping error */

+#define SWIOTLB_MAP_ERROR (~(phys_addr_t)0x0)
+
  extern phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
  dma_addr_t tbl_dma_addr,
  phys_addr_t phys, size_t size,
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 12059b78b631..922880b84387 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -541,7 +541,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
spin_unlock_irqrestore(_tlb_lock, flags);
if (!(attrs & DMA_ATTR_NO_WARN) && printk_ratelimit())
dev_warn(hwdev, "swiotlb buffer is full (sz: %zd bytes)\n", 
size);
-   return DMA_MAPPING_ERROR;
+   return SWIOTLB_MAP_ERROR;
  found:
io_tlb_used += nslots;
spin_unlock_irqrestore(_tlb_lock, flags);
@@ -659,7 +659,7 @@ bool swiotlb_map(struct device *dev, phys_addr_t *phys, 
dma_addr_t *dma_addr,
/* Oh well, have to allocate and map a bounce buffer. */
*phys = swiotlb_tbl_map_single(dev, __phys_to_dma(dev, io_tlb_start),
*phys, size, dir, attrs);
-   if (*phys == DMA_MAPPING_ERROR)
+   if (*phys == SWIOTLB_MAP_ERROR)
return false;
  
  	/* Ensure that the address returned is DMA'ble */




___
Xen-devel mailing list
Xen-devel@lists.xenproject.org

Re: [Xen-devel] [PATCH v3 1/9] mm: Introduce new vm_insert_range API

2018-12-07 Thread Robin Murphy

On 2018-12-07 7:28 pm, Souptick Joarder wrote:

On Fri, Dec 7, 2018 at 10:41 PM Matthew Wilcox  wrote:


On Fri, Dec 07, 2018 at 03:34:56PM +, Robin Murphy wrote:

+int vm_insert_range(struct vm_area_struct *vma, unsigned long addr,
+   struct page **pages, unsigned long page_count)
+{
+   unsigned long uaddr = addr;
+   int ret = 0, i;


Some of the sites being replaced were effectively ensuring that vma and
pages were mutually compatible as an initial condition - would it be worth
adding something here for robustness, e.g.:

+ if (page_count != vma_pages(vma))
+ return -ENXIO;


I think we want to allow this to be used to populate part of a VMA.
So perhaps:

 if (page_count > vma_pages(vma))
 return -ENXIO;


Ok, This can be added.

I think Patch [2/9] is the only leftover place where this
check could be removed.


Right, 9/9 could also have relied on my stricter check here, but since 
it's really testing whether it actually managed to allocate vma_pages() 
worth of pages earlier, Matthew's more lenient version won't help for 
that one. (Why privcmd_buf_mmap() doesn't clean up and return an error 
as soon as that allocation loop fails, without taking the mutex under 
which it still does a bunch more pointless work to only undo it again, 
is a mind-boggling mystery, but that's not our problem here...)


Robin.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 1/9] mm: Introduce new vm_insert_range API

2018-12-07 Thread Robin Murphy

On 06/12/2018 18:39, Souptick Joarder wrote:

Previouly drivers have their own way of mapping range of
kernel pages/memory into user vma and this was done by
invoking vm_insert_page() within a loop.

As this pattern is common across different drivers, it can
be generalized by creating a new function and use it across
the drivers.

vm_insert_range is the new API which will be used to map a
range of kernel memory/pages to user vma.

This API is tested by Heiko for Rockchip drm driver, on rk3188,
rk3288, rk3328 and rk3399 with graphics.

Signed-off-by: Souptick Joarder 
Reviewed-by: Matthew Wilcox 
Reviewed-by: Mike Rapoport 
Tested-by: Heiko Stuebner 
---
  include/linux/mm.h |  2 ++
  mm/memory.c| 38 ++
  mm/nommu.c |  7 +++
  3 files changed, 47 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fcf9cc9..2bc399f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2506,6 +2506,8 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
  int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
  int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page 
*);
+int vm_insert_range(struct vm_area_struct *vma, unsigned long addr,
+   struct page **pages, unsigned long page_count);
  vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
  vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 15c417e..84ea46c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1478,6 +1478,44 @@ static int insert_page(struct vm_area_struct *vma, 
unsigned long addr,
  }
  
  /**

+ * vm_insert_range - insert range of kernel pages into user vma
+ * @vma: user vma to map to
+ * @addr: target user address of this page
+ * @pages: pointer to array of source kernel pages
+ * @page_count: number of pages need to insert into user vma
+ *
+ * This allows drivers to insert range of kernel pages they've allocated
+ * into a user vma. This is a generic function which drivers can use
+ * rather than using their own way of mapping range of kernel pages into
+ * user vma.
+ *
+ * If we fail to insert any page into the vma, the function will return
+ * immediately leaving any previously-inserted pages present.  Callers
+ * from the mmap handler may immediately return the error as their caller
+ * will destroy the vma, removing any successfully-inserted pages. Other
+ * callers should make their own arrangements for calling unmap_region().
+ *
+ * Context: Process context. Called by mmap handlers.
+ * Return: 0 on success and error code otherwise
+ */
+int vm_insert_range(struct vm_area_struct *vma, unsigned long addr,
+   struct page **pages, unsigned long page_count)
+{
+   unsigned long uaddr = addr;
+   int ret = 0, i;


Some of the sites being replaced were effectively ensuring that vma and 
pages were mutually compatible as an initial condition - would it be 
worth adding something here for robustness, e.g.:


+   if (page_count != vma_pages(vma))
+   return -ENXIO;

?

(then you could also clean up a couple more places where you're not 
already removing such checks)


Robin.


+
+   for (i = 0; i < page_count; i++) {
+   ret = vm_insert_page(vma, uaddr, pages[i]);
+   if (ret < 0)
+   return ret;
+   uaddr += PAGE_SIZE;
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL(vm_insert_range);
+
+/**
   * vm_insert_page - insert single page into user vma
   * @vma: user vma to map to
   * @addr: target user address of this page
diff --git a/mm/nommu.c b/mm/nommu.c
index 749276b..d6ef5c7 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -473,6 +473,13 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned 
long addr,
  }
  EXPORT_SYMBOL(vm_insert_page);
  
+int vm_insert_range(struct vm_area_struct *vma, unsigned long addr,

+   struct page **pages, unsigned long page_count)
+{
+   return -EINVAL;
+}
+EXPORT_SYMBOL(vm_insert_range);
+
  /*
   *  sys_brk() for the most part doesn't need the global kernel
   *  lock, except when an application is doing something nasty



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 01/23] dma-mapping: provide a generic DMA_MAPPING_ERROR

2018-12-04 Thread Robin Murphy

On 30/11/2018 13:22, Christoph Hellwig wrote:

Error handling of the dma_map_single and dma_map_page APIs is a little
problematic at the moment, in that we use different encodings in the
returned dma_addr_t to indicate an error.  That means we require an
additional indirect call to figure out if a dma mapping call returned
an error, and a lot of boilerplate code to implement these semantics.

Instead return the maximum addressable value as the error.  As long
as we don't allow mapping single-byte ranges with single-byte alignment
this value can never be a valid return.  Additionaly if drivers do
not check the return value from the dma_map* routines this values means
they will generally not be pointed to actual memory.

Once the default value is added here we can start removing the
various mapping_error methods and just rely on this generic check.

Signed-off-by: Christoph Hellwig 
---
  include/linux/dma-mapping.h | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 0f81c713f6e9..46bd612d929e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -133,6 +133,8 @@ struct dma_map_ops {
u64 (*get_required_mask)(struct device *dev);
  };
  
+#define DMA_MAPPING_ERROR		(~(dma_addr_t)0)

+
  extern const struct dma_map_ops dma_direct_ops;
  extern const struct dma_map_ops dma_virt_ops;
  
@@ -576,6 +578,10 @@ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)

const struct dma_map_ops *ops = get_dma_ops(dev);
  
  	debug_dma_mapping_error(dev, dma_addr);

+
+   if (dma_addr == DMA_MAPPING_ERROR)
+   return 1;
+
if (ops->mapping_error)
return ops->mapping_error(dev, dma_addr);
return 0;


I'd have been inclined to put the default check here, i.e.

-   return 0
+   return dma_addr == DMA_MAPPING_ERROR

such that the callback retains full precedence and we don't have to deal 
with the non-trivial removals immediately if it comes to it. Not that it 
makes a vast difference though, so either way,


Reviewed-by: Robin Murphy 

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] remove the ->mapping_error method from dma_map_ops V2

2018-11-22 Thread Robin Murphy

On 22/11/2018 17:09, Linus Torvalds wrote:

On Thu, Nov 22, 2018 at 9:07 AM Russell King - ARM Linux
 wrote:


I'm afraid that won't work very well - 32 bit platforms with 64-bit
addresses (LPAE) would have dma_addr_t as a 64-bit value, which
wouldn't fit into an unsigned long.


Good point. So we'd have to have a special IS_DMA_ERR() function that
takes a dma_addr_t and checks the same "is it the top 4095 values".

Because I'd still prefer to allow people to return the _actual_ error,
and to have "return -Exyz" as the error case. That part still DTRT
even with dma_addr_t.


Unfortunately, with things like the top-down IOVA allocator, and 32-bit 
systems in general, "the top 4095" values may well still be valid 
addresses - we're relying on a 1-byte mapping of the very top byte of 
memory/IOVA space being sufficiently ridiculous that no real code would 
ever do that, but even a 4-byte mapping of the top 4 bytes is within the 
realms of the plausible (I've definitely seen the USB layer make 8-byte 
mappings from any old offset within a page, for example).


Thus we'd have to go with the extra complication of detecting and 
carving out problematic memory maps in those corner cases as Russell 
alluded to, for the sake of better error reporting in places where, in 
the majority of cases, there's not really all that many ways to go 
wrong. Off the top of my head, I guess:


-EINVAL if the address is outside the device's DMA mask (and there is no 
IOMMU or bounce buffer).


-ENOSPC if there is an IOMMU or bounce buffer, but no free IOVA/buffer 
space currently.


-ESOMETHINGWENTWRONGWITHIOMMUPAGETABLES or similar, because it's not 
like the caller can treat that as anything other than an opaque failure 
anyway.


The only immediate benefit I can see is that we could distinguish cases 
like the first which can never possibly succeed, and thus callers could 
actually give up instead of doing what various subsystems currently do 
and retry the exact same mapping indefinitely on the apparent assumption 
that errors must be transient.


Robin.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel