Re: [mainline] [linux-next] [6.8-rc1] [FC] [DLPAR] OOps kernel crash after performing dlpar remove test

2024-02-02 Thread Robin Murphy
] [c000a878bcb0] [c0685d3c]
kernfs_fop_write_iter+0x1cc/0x280
[  981.124283] [c000a878bd00] [c05909c8] vfs_write+0x358/0x4b0
[  981.124288] [c000a878bdc0] [c0590cfc] ksys_write+0x7c/0x140
[  981.124293] [c000a878be10] [c0036554]
system_call_exception+0x134/0x330
[  981.124298] [c000a878be50] [c000d6a0]
system_call_common+0x160/0x2e4
[  981.124303] --- interrupt: c00 at 0x200013f21594
[  981.124306] NIP:  200013f21594 LR: 200013e97bf4 CTR:

[  981.124309] REGS: c000a878be80 TRAP: 0c00   Not tainted
(6.5.0-rc6-next-20230817-auto)
[  981.124312] MSR:  8280f033
  CR: 22000282  XER: 
[  981.124321] IRQMASK: 0
[  981.124321] GPR00: 0004 73a55c70 200014007300
0007
[  981.124321] GPR04: 00013aff5750 0008 fbad2c80
00013afd02a0
[  981.124321] GPR08: 0001  

[  981.124321] GPR12:  200013b7bc30 

[  981.124321] GPR16:   

[  981.124321] GPR20:   

[  981.124321] GPR24: 00010ef61668  0008
00013aff5750
[  981.124321] GPR28: 0008 00013afd02a0 00013aff5750
0008
[  981.124356] NIP [200013f21594] 0x200013f21594
[  981.124358] LR [200013e97bf4] 0x200013e97bf4
[  981.124361] --- interrupt: c00
[  981.124362] Code: 38427bd0 7c0802a6 6000 7c0802a6 fba1ffe8
fbc1fff0 fbe1fff8 7cbf2b78 38a0 7cdd3378 f8010010 f821ffc1
 4bff95d1 6000 7c7e1b79
[  981.124374] ---[ end trace  ]---


Thanks and Regards

On 1/31/24 16:18, Robin Murphy wrote:

On 2024-01-31 9:19 am, Tasmiya Nalatwad wrote:

Greetings,

[mainline] [linux-next] [6.8-rc1] [DLPAR] OOps kernel crash after 
performing dlpar remove test


--- Traces ---

[58563.146236] BUG: Unable to handle kernel data access at 
0x6b6b6b6b6b6b6b83

[58563.146242] Faulting instruction address: 0xc09c0e60
[58563.146248] Oops: Kernel access of bad area, sig: 11 [#1]
[58563.146252] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries
[58563.146258] Modules linked in: isofs cdrom dm_snapshot dm_bufio 
dm_round_robin dm_queue_length exfat vfat fat btrfs blake2b_generic 
xor raid6_pq zstd_compress loop xfs libcrc32c raid0 nvram rpadlpar_io 
rpaphp nfnetlink xsk_diag bonding tls rfkill sunrpc dm_service_time 
dm_multipath dm_mod pseries_rng vmx_crypto binfmt_misc ext4 mbcache 
jbd2 sd_mod sg ibmvscsi scsi_transport_srp ibmveth lpfc nvmet_fc 
nvmet nvme_fc nvme_fabrics nvme_core t10_pi crc64_rocksoft crc64 
scsi_transport_fc fuse
[58563.146326] CPU: 0 PID: 1071247 Comm: drmgr Kdump: loaded Not 
tainted 6.8.0-rc1-auto-gecb1b8288dc7 #1
[58563.146332] Hardware name: IBM,9009-42A POWER9 (raw) 0x4e0202 
0xf05 of:IBM,FW950.A0 (VL950_141) hv:phyp pSeries
[58563.146337] NIP:  c09c0e60 LR: c09c0e28 CTR: 
c09c1584
[58563.146342] REGS: c0007960f260 TRAP: 0380   Not tainted 
(6.8.0-rc1-auto-gecb1b8288dc7)
[58563.146347] MSR:  80009033   CR: 
24822424  XER: 20040006

[58563.146360] CFAR: c09c0e74 IRQMASK: 0
[58563.146360] GPR00: c09c0e28 c0007960f500 
c1482600 c3050540
[58563.146360] GPR04:  c0089a6870c0 
0001 fffe
[58563.146360] GPR08: c2bac020 6b6b6b6b6b6b6b6b 
6b6b6b6b6b6b6b6b 0220
[58563.146360] GPR12: 2000 c308 
 
[58563.146360] GPR16:   
 0001
[58563.146360] GPR20: c1281478  
c1281490 c2bfed80
[58563.146360] GPR24: c0089a6870c0  
 c2b9ffb8
[58563.146360] GPR28:  c2bac0e8 
 

[58563.146421] NIP [c09c0e60] iommu_ops_from_fwnode+0x68/0x118
[58563.146430] LR [c09c0e28] iommu_ops_from_fwnode+0x30/0x118


This implies that iommu_device_list has become corrupted. Looks like 
spapr_tce_setup_phb_iommus_initcall() registers an iommu_device which 
pcibios_free_controller() could free if a PCI controller is removed, 
but there's no path anywhere to ever unregister any of those IOMMUs. 
Presumably this also means that is a PCI controller is dynamically 
added after init, its IOMMU won't be set up properly either.


Thanks,
Robin.


[58563.146437] Call Trace:
[58563.146439] [c0007960f500] [c0007960f560] 
0xc0007960f560 (unreliable)
[58563.146446] [c0007960f530] [c09c0fd0] 
__iommu_probe_device+0xc0/0x5c0
[58563.146454] [c0007960f5a0] [c09c151c] 
iommu_probe_device+0x4c/0xb4
[58563.146462] [c0007960f5e0] [c09c15d0] 
iommu_bus_notifier+0x4c/0x8c
[58563.146469

Re: [mainline] [linux-next] [6.8-rc1] [FC] [DLPAR] OOps kernel crash after performing dlpar remove test

2024-01-31 Thread Robin Murphy

On 2024-01-31 9:19 am, Tasmiya Nalatwad wrote:

Greetings,

[mainline] [linux-next] [6.8-rc1] [DLPAR] OOps kernel crash after 
performing dlpar remove test


--- Traces ---

[58563.146236] BUG: Unable to handle kernel data access at 
0x6b6b6b6b6b6b6b83

[58563.146242] Faulting instruction address: 0xc09c0e60
[58563.146248] Oops: Kernel access of bad area, sig: 11 [#1]
[58563.146252] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries
[58563.146258] Modules linked in: isofs cdrom dm_snapshot dm_bufio 
dm_round_robin dm_queue_length exfat vfat fat btrfs blake2b_generic xor 
raid6_pq zstd_compress loop xfs libcrc32c raid0 nvram rpadlpar_io rpaphp 
nfnetlink xsk_diag bonding tls rfkill sunrpc dm_service_time 
dm_multipath dm_mod pseries_rng vmx_crypto binfmt_misc ext4 mbcache jbd2 
sd_mod sg ibmvscsi scsi_transport_srp ibmveth lpfc nvmet_fc nvmet 
nvme_fc nvme_fabrics nvme_core t10_pi crc64_rocksoft crc64 
scsi_transport_fc fuse
[58563.146326] CPU: 0 PID: 1071247 Comm: drmgr Kdump: loaded Not tainted 
6.8.0-rc1-auto-gecb1b8288dc7 #1
[58563.146332] Hardware name: IBM,9009-42A POWER9 (raw) 0x4e0202 
0xf05 of:IBM,FW950.A0 (VL950_141) hv:phyp pSeries
[58563.146337] NIP:  c09c0e60 LR: c09c0e28 CTR: 
c09c1584
[58563.146342] REGS: c0007960f260 TRAP: 0380   Not tainted 
(6.8.0-rc1-auto-gecb1b8288dc7)
[58563.146347] MSR:  80009033   CR: 
24822424  XER: 20040006

[58563.146360] CFAR: c09c0e74 IRQMASK: 0
[58563.146360] GPR00: c09c0e28 c0007960f500 c1482600 
c3050540
[58563.146360] GPR04:  c0089a6870c0 0001 
fffe
[58563.146360] GPR08: c2bac020 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 
0220
[58563.146360] GPR12: 2000 c308  

[58563.146360] GPR16:    
0001
[58563.146360] GPR20: c1281478  c1281490 
c2bfed80
[58563.146360] GPR24: c0089a6870c0   
c2b9ffb8
[58563.146360] GPR28:  c2bac0e8  


[58563.146421] NIP [c09c0e60] iommu_ops_from_fwnode+0x68/0x118
[58563.146430] LR [c09c0e28] iommu_ops_from_fwnode+0x30/0x118


This implies that iommu_device_list has become corrupted. Looks like 
spapr_tce_setup_phb_iommus_initcall() registers an iommu_device which 
pcibios_free_controller() could free if a PCI controller is removed, but 
there's no path anywhere to ever unregister any of those IOMMUs. 
Presumably this also means that is a PCI controller is dynamically added 
after init, its IOMMU won't be set up properly either.


Thanks,
Robin.


[58563.146437] Call Trace:
[58563.146439] [c0007960f500] [c0007960f560] 0xc0007960f560 
(unreliable)
[58563.146446] [c0007960f530] [c09c0fd0] 
__iommu_probe_device+0xc0/0x5c0
[58563.146454] [c0007960f5a0] [c09c151c] 
iommu_probe_device+0x4c/0xb4
[58563.146462] [c0007960f5e0] [c09c15d0] 
iommu_bus_notifier+0x4c/0x8c
[58563.146469] [c0007960f600] [c019e3d0] 
notifier_call_chain+0xb8/0x1a0
[58563.146476] [c0007960f660] [c019eea0] 
blocking_notifier_call_chain+0x64/0x94

[58563.146483] [c0007960f6a0] [c09d3c5c] bus_notify+0x50/0x7c
[58563.146491] [c0007960f6e0] [c09cfba4] device_add+0x774/0x9bc
[58563.146498] [c0007960f7a0] [c08abe9c] 
pci_device_add+0x2f4/0x864
[58563.146506] [c0007960f850] [c007d5a0] 
of_create_pci_dev+0x390/0xa08
[58563.146514] [c0007960f930] [c007de68] 
__of_scan_bus+0x250/0x328
[58563.146520] [c0007960fa10] [c007a680] 
pcibios_scan_phb+0x274/0x3c0
[58563.146527] [c0007960fae0] [c0105d58] 
init_phb_dynamic+0xb8/0x110
[58563.146535] [c0007960fb50] [c008217b0380] 
dlpar_add_slot+0x170/0x3b4 [rpadlpar_io]
[58563.146544] [c0007960fbf0] [c008217b0ca0] 
add_slot_store+0xa4/0x140 [rpadlpar_io]
[58563.146551] [c0007960fc80] [c0f3dbec] 
kobj_attr_store+0x30/0x4c
[58563.146559] [c0007960fca0] [c06931fc] 
sysfs_kf_write+0x68/0x7c
[58563.146566] [c0007960fcc0] [c0691b2c] 
kernfs_fop_write_iter+0x1c8/0x278

[58563.146573] [c0007960fd10] [c0599f54] vfs_write+0x340/0x4cc
[58563.146580] [c0007960fdc0] [c059a2bc] ksys_write+0x7c/0x140
[58563.146587] [c0007960fe10] [c0035d74] 
system_call_exception+0x134/0x330
[58563.146595] [c0007960fe50] [c000d6a0] 
system_call_common+0x160/0x2e4

[58563.146602] --- interrupt: c00 at 0x24470cb4
[58563.146606] NIP:  24470cb4 LR: 243e7d04 CTR: 

[58563.146611] REGS: c0007960fe80 TRAP: 0c00   Not tainted 
(6.8.0-rc1-auto-gecb1b8288dc7)
[58563.146616] MSR:  8280f033 
  CR: 24000282  XER: 

[58563.146632] IRQMASK: 0
[58563.146632] GPR00: 

Re: [PATCH 6/8] iommu/dart: Move the blocked domain support to a global static

2023-09-26 Thread Robin Murphy

On 2023-09-26 20:05, Janne Grunau wrote:

Hej,

On Fri, Sep 22, 2023 at 02:07:57PM -0300, Jason Gunthorpe wrote:

Move to the new static global for blocked domains. Move the blocked
specific code to apple_dart_attach_dev_blocked().

Signed-off-by: Jason Gunthorpe 
---
  drivers/iommu/apple-dart.c | 36 ++--
  1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index 424f779ccc34df..376f4c5461e8f7 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -675,10 +675,6 @@ static int apple_dart_attach_dev(struct iommu_domain 
*domain,
for_each_stream_map(i, cfg, stream_map)
apple_dart_setup_translation(dart_domain, stream_map);
break;
-   case IOMMU_DOMAIN_BLOCKED:
-   for_each_stream_map(i, cfg, stream_map)
-   apple_dart_hw_disable_dma(stream_map);
-   break;
default:
return -EINVAL;
}
@@ -710,6 +706,30 @@ static struct iommu_domain apple_dart_identity_domain = {
.ops = _dart_identity_ops,
  };
  
+static int apple_dart_attach_dev_blocked(struct iommu_domain *domain,

+struct device *dev)
+{
+   struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
+   struct apple_dart_stream_map *stream_map;
+   int i;
+
+   if (cfg->stream_maps[0].dart->force_bypass)
+   return -EINVAL;


unrelated to this change as this keeps the current behavior but I think
force_bypass should not override IOMMU_DOMAIN_BLOCKED. It is set if the
CPU page size is smaller than dart's page size. Obviously dart can't
translate in that situation but it should be still possible to block it
completely.

How do we manage this? I can write a patch either to the current state
or based on this series.


The series is queued already, so best to send a patch based on 
iommu/core (I guess just removing these lines?). It won't be 
super-useful in practice since the blocking domain is normally only used 
to transition to an unmanaged domain which in the force_bypass situation 
can't be used anyway, but it's still nice on principle not to have 
unnecessary reasons for attach to fail.


Thanks,
Robin.




+
+   for_each_stream_map(i, cfg, stream_map)
+   apple_dart_hw_disable_dma(stream_map);
+   return 0;
+}
+
+static const struct iommu_domain_ops apple_dart_blocked_ops = {
+   .attach_dev = apple_dart_attach_dev_blocked,
+};
+
+static struct iommu_domain apple_dart_blocked_domain = {
+   .type = IOMMU_DOMAIN_BLOCKED,
+   .ops = _dart_blocked_ops,
+};
+
  static struct iommu_device *apple_dart_probe_device(struct device *dev)
  {
struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
@@ -739,8 +759,7 @@ static struct iommu_domain 
*apple_dart_domain_alloc(unsigned int type)
  {
struct apple_dart_domain *dart_domain;
  
-	if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED &&

-   type != IOMMU_DOMAIN_BLOCKED)
+   if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED)
return NULL;
  
  	dart_domain = kzalloc(sizeof(*dart_domain), GFP_KERNEL);

@@ -749,10 +768,6 @@ static struct iommu_domain 
*apple_dart_domain_alloc(unsigned int type)
  
  	mutex_init(_domain->init_lock);
  
-	/* no need to allocate pgtbl_ops or do any other finalization steps */

-   if (type == IOMMU_DOMAIN_BLOCKED)
-   dart_domain->finalized = true;
-
return _domain->domain;
  }
  
@@ -966,6 +981,7 @@ static void apple_dart_get_resv_regions(struct device *dev,
  
  static const struct iommu_ops apple_dart_iommu_ops = {

.identity_domain = _dart_identity_domain,
+   .blocked_domain = _dart_blocked_domain,
.domain_alloc = apple_dart_domain_alloc,
.probe_device = apple_dart_probe_device,
.release_device = apple_dart_release_device,
--
2.42.0


Reviewed-by: Janne Grunau 

best regards
Janne

ps: I sent the reply to [Patch 4/8] accidentally with an incorrect from
address but the correct Reviewed-by:. I can resend if necessary.


Re: [PATCH 6/8] iommu/dart: Move the blocked domain support to a global static

2023-09-26 Thread Robin Murphy

On 2023-09-26 20:34, Robin Murphy wrote:

On 2023-09-26 20:05, Janne Grunau wrote:

Hej,

On Fri, Sep 22, 2023 at 02:07:57PM -0300, Jason Gunthorpe wrote:

Move to the new static global for blocked domains. Move the blocked
specific code to apple_dart_attach_dev_blocked().

Signed-off-by: Jason Gunthorpe 
---
  drivers/iommu/apple-dart.c | 36 ++--
  1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index 424f779ccc34df..376f4c5461e8f7 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -675,10 +675,6 @@ static int apple_dart_attach_dev(struct 
iommu_domain *domain,

  for_each_stream_map(i, cfg, stream_map)
  apple_dart_setup_translation(dart_domain, stream_map);
  break;
-    case IOMMU_DOMAIN_BLOCKED:
-    for_each_stream_map(i, cfg, stream_map)
-    apple_dart_hw_disable_dma(stream_map);
-    break;
  default:
  return -EINVAL;
  }
@@ -710,6 +706,30 @@ static struct iommu_domain 
apple_dart_identity_domain = {

  .ops = _dart_identity_ops,
  };
+static int apple_dart_attach_dev_blocked(struct iommu_domain *domain,
+ struct device *dev)
+{
+    struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
+    struct apple_dart_stream_map *stream_map;
+    int i;
+
+    if (cfg->stream_maps[0].dart->force_bypass)
+    return -EINVAL;


unrelated to this change as this keeps the current behavior but I think
force_bypass should not override IOMMU_DOMAIN_BLOCKED. It is set if the
CPU page size is smaller than dart's page size. Obviously dart can't
translate in that situation but it should be still possible to block it
completely.

How do we manage this? I can write a patch either to the current state
or based on this series.


The series is queued already, so best to send a patch based on 
iommu/core (I guess just removing these lines?).


Um, what? This isn't the domain_alloc_paging series itself, Robin you 
fool. Clearly it's time to close the computer and try again tomorrow...


Cheers,
Robin.

It won't be 
super-useful in practice since the blocking domain is normally only used 
to transition to an unmanaged domain which in the force_bypass situation 
can't be used anyway, but it's still nice on principle not to have 
unnecessary reasons for attach to fail.


Thanks,
Robin.




+
+    for_each_stream_map(i, cfg, stream_map)
+    apple_dart_hw_disable_dma(stream_map);
+    return 0;
+}
+
+static const struct iommu_domain_ops apple_dart_blocked_ops = {
+    .attach_dev = apple_dart_attach_dev_blocked,
+};
+
+static struct iommu_domain apple_dart_blocked_domain = {
+    .type = IOMMU_DOMAIN_BLOCKED,
+    .ops = _dart_blocked_ops,
+};
+
  static struct iommu_device *apple_dart_probe_device(struct device 
*dev)

  {
  struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
@@ -739,8 +759,7 @@ static struct iommu_domain 
*apple_dart_domain_alloc(unsigned int type)

  {
  struct apple_dart_domain *dart_domain;
-    if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED &&
-    type != IOMMU_DOMAIN_BLOCKED)
+    if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED)
  return NULL;
  dart_domain = kzalloc(sizeof(*dart_domain), GFP_KERNEL);
@@ -749,10 +768,6 @@ static struct iommu_domain 
*apple_dart_domain_alloc(unsigned int type)

  mutex_init(_domain->init_lock);
-    /* no need to allocate pgtbl_ops or do any other finalization 
steps */

-    if (type == IOMMU_DOMAIN_BLOCKED)
-    dart_domain->finalized = true;
-
  return _domain->domain;
  }
@@ -966,6 +981,7 @@ static void apple_dart_get_resv_regions(struct 
device *dev,

  static const struct iommu_ops apple_dart_iommu_ops = {
  .identity_domain = _dart_identity_domain,
+    .blocked_domain = _dart_blocked_domain,
  .domain_alloc = apple_dart_domain_alloc,
  .probe_device = apple_dart_probe_device,
  .release_device = apple_dart_release_device,
--
2.42.0


Reviewed-by: Janne Grunau 

best regards
Janne

ps: I sent the reply to [Patch 4/8] accidentally with an incorrect from
address but the correct Reviewed-by:. I can resend if necessary.




Re: [PATCH v2 09/25] iommu/fsl_pamu: Implement an IDENTITY domain

2023-06-01 Thread Robin Murphy

On 2023-06-01 20:46, Jason Gunthorpe wrote:

On Thu, Jun 01, 2023 at 08:37:45PM +0100, Robin Murphy wrote:

On 2023-05-16 01:00, Jason Gunthorpe wrote:

Robin was able to check the documentation and what fsl_pamu has
historically called detach_dev() is really putting the IOMMU into an
IDENTITY mode.


Unfortunately it was the other way around - it's the call to
fsl_setup_liodns() from fsl_pamu_probe() which leaves everything in bypass
by default (the PAACE_ATM_NO_XLATE part, IIRC), whereas the detach_device()
call here ends up disabling the given device's LIODN altogether


Er, I see.. Let me think about it, you convinced me to change it from
PLATFORM, so maybe we should go back to that if it is all wonky.


FWIW I was thinking more along the lines of a token nominal identity 
domain where attach does nothing at all...



There doesn't appear to have ever been any code anywhere for putting
things *back* into bypass after using a VFIO domain, so as-is these
default domains would probably break all DMA :(


Sounds like it just never worked right.

ie going to VFIO mode was always a one way trip and you can't go back
to a kernel driver.


...on the assumption that doing so wouldn't really be any less broken 
than it always has been :)


Thanks,
Robin.


I don't think this patch makes it worse because we call the identity
attach_dev in all the same places we called detach_dev in the first
place.

We add an extra call at the start of time, but that call is NOP'd
by this:


if (domain == platform_domain || !domain)
+   return 0;
+


(bah, and the variable name needs updating too)

Honestly, I don't really want to fix FSL since it seems abandoned, so
either this patch or going back to PLATFORM seems like the best option.

Jason


Re: [PATCH v2 25/25] iommu: Convert remaining simple drivers to domain_alloc_paging()

2023-06-01 Thread Robin Murphy

On 2023-05-16 01:00, Jason Gunthorpe wrote:

These drivers don't support IOMMU_DOMAIN_DMA, so this commit effectively
allows them to support that mode.

The prior work to require default_domains makes this safe because every
one of these drivers is either compilation incompatible with dma-iommu.c,
or already establishing a default_domain. In both cases alloc_domain()
will never be called with IOMMU_DOMAIN_DMA for these drivers so it is safe
to drop the test.

Removing these tests clarifies that the domain allocation path is only
about the functionality of a paging domain and has nothing to do with
policy of how the paging domain is used for UNMANAGED/DMA/DMA_FQ.

Tested-by: Niklas Schnelle 
Signed-off-by: Jason Gunthorpe 
---
  drivers/iommu/fsl_pamu_domain.c | 7 ++-
  drivers/iommu/msm_iommu.c   | 7 ++-
  drivers/iommu/mtk_iommu_v1.c| 7 ++-
  drivers/iommu/omap-iommu.c  | 7 ++-
  drivers/iommu/s390-iommu.c  | 7 ++-
  5 files changed, 10 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
index ca4f5ebf028783..8d5d6a3acf9dfd 100644
--- a/drivers/iommu/fsl_pamu_domain.c
+++ b/drivers/iommu/fsl_pamu_domain.c
@@ -192,13 +192,10 @@ static void fsl_pamu_domain_free(struct iommu_domain 
*domain)
kmem_cache_free(fsl_pamu_domain_cache, dma_domain);
  }
  
-static struct iommu_domain *fsl_pamu_domain_alloc(unsigned type)

+static struct iommu_domain *fsl_pamu_domain_alloc_paging(struct device *dev)


This isn't a paging domain - it doesn't support map/unmap, and AFAICT 
all it has ever been intended to do is "isolate" accesses to within an 
aperture which is never set to anything less than the entire physical 
address space :/


I hate to imagine what the VFIO userspace applications looked like...

Thanks,
Robin.


  {
struct fsl_dma_domain *dma_domain;
  
-	if (type != IOMMU_DOMAIN_UNMANAGED)

-   return NULL;
-
dma_domain = kmem_cache_zalloc(fsl_pamu_domain_cache, GFP_KERNEL);
if (!dma_domain)
return NULL;
@@ -476,7 +473,7 @@ static const struct iommu_ops fsl_pamu_ops = {
.identity_domain = _pamu_identity_domain,
.def_domain_type = _pamu_def_domain_type,
.capable= fsl_pamu_capable,
-   .domain_alloc   = fsl_pamu_domain_alloc,
+   .domain_alloc_paging = fsl_pamu_domain_alloc_paging,
.probe_device   = fsl_pamu_probe_device,
.device_group   = fsl_pamu_device_group,
.default_domain_ops = &(const struct iommu_domain_ops) {
diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 26ed81cfeee897..a163cee0b7242d 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -302,13 +302,10 @@ static void __program_context(void __iomem *base, int ctx,
SET_M(base, ctx, 1);
  }
  
-static struct iommu_domain *msm_iommu_domain_alloc(unsigned type)

+static struct iommu_domain *msm_iommu_domain_alloc_paging(struct device *dev)
  {
struct msm_priv *priv;
  
-	if (type != IOMMU_DOMAIN_UNMANAGED)

-   return NULL;
-
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv)
goto fail_nomem;
@@ -691,7 +688,7 @@ irqreturn_t msm_iommu_fault_handler(int irq, void *dev_id)
  
  static struct iommu_ops msm_iommu_ops = {

.identity_domain = _iommu_identity_domain,
-   .domain_alloc = msm_iommu_domain_alloc,
+   .domain_alloc_paging = msm_iommu_domain_alloc_paging,
.probe_device = msm_iommu_probe_device,
.device_group = generic_device_group,
.pgsize_bitmap = MSM_IOMMU_PGSIZES,
diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index 7c0c1d50df5f75..67e044c1a7d93b 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -270,13 +270,10 @@ static int mtk_iommu_v1_domain_finalise(struct 
mtk_iommu_v1_data *data)
return 0;
  }
  
-static struct iommu_domain *mtk_iommu_v1_domain_alloc(unsigned type)

+static struct iommu_domain *mtk_iommu_v1_domain_alloc_paging(struct device 
*dev)
  {
struct mtk_iommu_v1_domain *dom;
  
-	if (type != IOMMU_DOMAIN_UNMANAGED)

-   return NULL;
-
dom = kzalloc(sizeof(*dom), GFP_KERNEL);
if (!dom)
return NULL;
@@ -585,7 +582,7 @@ static int mtk_iommu_v1_hw_init(const struct 
mtk_iommu_v1_data *data)
  
  static const struct iommu_ops mtk_iommu_v1_ops = {

.identity_domain = _iommu_v1_identity_domain,
-   .domain_alloc   = mtk_iommu_v1_domain_alloc,
+   .domain_alloc_paging = mtk_iommu_v1_domain_alloc_paging,
.probe_device   = mtk_iommu_v1_probe_device,
.probe_finalize = mtk_iommu_v1_probe_finalize,
.release_device = mtk_iommu_v1_release_device,
diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index 34340ef15241bc..fcf99bd195b32e 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1580,13 

Re: [PATCH v2 09/25] iommu/fsl_pamu: Implement an IDENTITY domain

2023-06-01 Thread Robin Murphy

On 2023-05-16 01:00, Jason Gunthorpe wrote:

Robin was able to check the documentation and what fsl_pamu has
historically called detach_dev() is really putting the IOMMU into an
IDENTITY mode.


Unfortunately it was the other way around - it's the call to 
fsl_setup_liodns() from fsl_pamu_probe() which leaves everything in 
bypass by default (the PAACE_ATM_NO_XLATE part, IIRC), whereas the 
detach_device() call here ends up disabling the given device's LIODN 
altogether. There doesn't appear to have ever been any code anywhere for 
putting things *back* into bypass after using a VFIO domain, so as-is 
these default domains would probably break all DMA :(


Thanks,
Robin.


Move to the new core support for ARM_DMA_USE_IOMMU by defining
ops->identity_domain. This is a ppc driver without any dma_ops, so ensure
the identity translation is the default domain.

Signed-off-by: Jason Gunthorpe 
---
  drivers/iommu/fsl_pamu_domain.c | 32 +---
  1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
index bce37229709965..ca4f5ebf028783 100644
--- a/drivers/iommu/fsl_pamu_domain.c
+++ b/drivers/iommu/fsl_pamu_domain.c
@@ -283,15 +283,21 @@ static int fsl_pamu_attach_device(struct iommu_domain 
*domain,
return ret;
  }
  
-static void fsl_pamu_set_platform_dma(struct device *dev)

+static int fsl_pamu_identity_attach(struct iommu_domain *platform_domain,
+   struct device *dev)
  {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct fsl_dma_domain *dma_domain = to_fsl_dma_domain(domain);
+   struct fsl_dma_domain *dma_domain;
const u32 *prop;
int len;
struct pci_dev *pdev = NULL;
struct pci_controller *pci_ctl;
  
+	if (domain == platform_domain || !domain)

+   return 0;
+
+   dma_domain = to_fsl_dma_domain(domain);
+
/*
 * Use LIODN of the PCI controller while detaching a
 * PCI device.
@@ -312,8 +318,18 @@ static void fsl_pamu_set_platform_dma(struct device *dev)
detach_device(dev, dma_domain);
else
pr_debug("missing fsl,liodn property at %pOF\n", dev->of_node);
+   return 0;
  }
  
+static struct iommu_domain_ops fsl_pamu_identity_ops = {

+   .attach_dev = fsl_pamu_identity_attach,
+};
+
+static struct iommu_domain fsl_pamu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _pamu_identity_ops,
+};
+
  /* Set the domain stash attribute */
  int fsl_pamu_configure_l1_stash(struct iommu_domain *domain, u32 cpu)
  {
@@ -447,12 +463,22 @@ static struct iommu_device *fsl_pamu_probe_device(struct 
device *dev)
return _iommu;
  }
  
+static int fsl_pamu_def_domain_type(struct device *dev)

+{
+   /*
+* This platform does not use dma_ops at all so the normally the iommu
+* must be in identity mode
+*/
+   return IOMMU_DOMAIN_IDENTITY;
+}
+
  static const struct iommu_ops fsl_pamu_ops = {
+   .identity_domain = _pamu_identity_domain,
+   .def_domain_type = _pamu_def_domain_type,
.capable= fsl_pamu_capable,
.domain_alloc   = fsl_pamu_domain_alloc,
.probe_device   = fsl_pamu_probe_device,
.device_group   = fsl_pamu_device_group,
-   .set_platform_dma_ops = fsl_pamu_set_platform_dma,
.default_domain_ops = &(const struct iommu_domain_ops) {
.attach_dev = fsl_pamu_attach_device,
.iova_to_phys   = fsl_pamu_iova_to_phys,


Re: [PATCH v2 23/25] iommu: Add ops->domain_alloc_paging()

2023-06-01 Thread Robin Murphy

On 2023-05-16 01:00, Jason Gunthorpe wrote:

This callback requests the driver to create only a __IOMMU_DOMAIN_PAGING
domain, so it saves a few lines in a lot of drivers needlessly checking
the type.

More critically, this allows us to sweep out all the
IOMMU_DOMAIN_UNMANAGED and IOMMU_DOMAIN_DMA checks from a lot of the
drivers, simplifying what is going on in the code and ultimately removing
the now-unused special cases in drivers where they did not support
IOMMU_DOMAIN_DMA.

domain_alloc_paging() should return a struct iommu_domain that is
functionally compatible with ARM_DMA_USE_IOMMU, dma-iommu.c and iommufd.

Be forwards looking and pass in a 'struct device *' argument. We can
provide this when allocating the default_domain. No drivers will look at
this.


As mentioned before, we already know we're going to need additional 
flags (and possibly data) to cover the existing set_pgtable_quirks 
use-case plus new stuff like the proposed dirty-tracking enable, so I'd 
be inclined to either add an extensible structure argument now to avoid 
future churn, or just not bother adding the device argument either until 
drivers can actually use it.



Signed-off-by: Jason Gunthorpe 
---
  drivers/iommu/iommu.c | 18 +++---
  include/linux/iommu.h |  3 +++
  2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index c4cac1dcf80610..15aa51c356bd74 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1995,14 +1995,25 @@ void iommu_set_fault_handler(struct iommu_domain 
*domain,
  EXPORT_SYMBOL_GPL(iommu_set_fault_handler);
  
  static struct iommu_domain *__iommu_domain_alloc(const struct iommu_ops *ops,

+struct device *dev,
 unsigned int type)
  {
struct iommu_domain *domain;
  
  	if (type == IOMMU_DOMAIN_IDENTITY && ops->identity_domain)

return ops->identity_domain;
+   else if ((type == IOMMU_DOMAIN_UNMANAGED || type == IOMMU_DOMAIN_DMA) &&
+ops->domain_alloc_paging) {
+   /*
+* For now exclude DMA_FQ since it is still a driver policy
+* decision through domain_alloc() if we can use FQ mode.
+*/


That's sorted now, so the type test can neatly collapse down to "type & 
__IOMMU_DOMAIN_PAGING".


Thanks,
Robin.


+   domain = ops->domain_alloc_paging(dev);
+   } else if (ops->domain_alloc)
+   domain = ops->domain_alloc(type);
+   else
+   return NULL;
  
-	domain = ops->domain_alloc(type);

if (!domain)
return NULL;
  
@@ -2033,14 +2044,15 @@ __iommu_group_domain_alloc(struct iommu_group *group, unsigned int type)
  
  	lockdep_assert_held(>mutex);
  
-	return __iommu_domain_alloc(dev_iommu_ops(dev), type);

+   return __iommu_domain_alloc(dev_iommu_ops(dev), dev, type);
  }
  
  struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus)

  {
if (bus == NULL || bus->iommu_ops == NULL)
return NULL;
-   return __iommu_domain_alloc(bus->iommu_ops, IOMMU_DOMAIN_UNMANAGED);
+   return __iommu_domain_alloc(bus->iommu_ops, NULL,
+   IOMMU_DOMAIN_UNMANAGED);
  }
  EXPORT_SYMBOL_GPL(iommu_domain_alloc);
  
diff --git a/include/linux/iommu.h b/include/linux/iommu.h

index 387746f8273c99..18b0df42cc80d1 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -227,6 +227,8 @@ struct iommu_iotlb_gather {
   * struct iommu_ops - iommu ops and capabilities
   * @capable: check capability
   * @domain_alloc: allocate iommu domain
+ * @domain_alloc_paging: Allocate an iommu_domain that can be used for
+ *   UNMANAGED, DMA, and DMA_FQ domain types.
   * @probe_device: Add device to iommu driver handling
   * @release_device: Remove device from iommu driver handling
   * @probe_finalize: Do final setup work after the device is added to an IOMMU
@@ -258,6 +260,7 @@ struct iommu_ops {
  
  	/* Domain allocation and freeing by the iommu driver */

struct iommu_domain *(*domain_alloc)(unsigned iommu_domain_type);
+   struct iommu_domain *(*domain_alloc_paging)(struct device *dev);
  
  	struct iommu_device *(*probe_device)(struct device *dev);

void (*release_device)(struct device *dev);


Re: [PATCH v2 08/25] iommu: Allow an IDENTITY domain as the default_domain in ARM32

2023-06-01 Thread Robin Murphy

On 2023-05-16 01:00, Jason Gunthorpe wrote:

Even though dma-iommu.c and CONFIG_ARM_DMA_USE_IOMMU do approximately the
same stuff, the way they relate to the IOMMU core is quiet different.

dma-iommu.c expects the core code to setup an UNMANAGED domain (of type
IOMMU_DOMAIN_DMA) and then configures itself to use that domain. This
becomes the default_domain for the group.

ARM_DMA_USE_IOMMU does not use the default_domain, instead it directly
allocates an UNMANAGED domain and operates it just like an external
driver. In this case group->default_domain is NULL.

If the driver provides a global static identity_domain then automatically
use it as the default_domain when in ARM_DMA_USE_IOMMU mode.

This allows drivers that implemented default_domain == NULL as an IDENTITY
translation to trivially get a properly labeled non-NULL default_domain on
ARM32 configs.

With this arrangment when ARM_DMA_USE_IOMMU wants to disconnect from the
device the normal detach_domain flow will restore the IDENTITY domain as
the default domain. Overall this makes attach_dev() of the IDENTITY domain
called in the same places as detach_dev().

This effectively migrates these drivers to default_domain mode. For
drivers that support ARM64 they will gain support for the IDENTITY
translation mode for the dma_api and behave in a uniform way.

Drivers use this by setting ops->identity_domain to a static singleton
iommu_domain that implements the identity attach. If the core detects
ARM_DMA_USE_IOMMU mode then it automatically attaches the IDENTITY domain
during probe.

Drivers can continue to prevent the use of DMA translation by returning
IOMMU_DOMAIN_IDENTITY from def_domain_type, this will completely prevent
IOMMU_DMA from running but will not impact ARM_DMA_USE_IOMMU.

This allows removing the set_platform_dma_ops() from every remaining
driver.

Remove the set_platform_dma_ops from rockchip and mkt_v1 as all it does
is set an existing global static identity domain. mkt_v1 does not support
IOMMU_DOMAIN_DMA and it does not compile on ARM64 so this transformation
is safe.

Signed-off-by: Jason Gunthorpe 
---
  drivers/iommu/iommu.c  | 40 +-
  drivers/iommu/mtk_iommu_v1.c   | 12 --
  drivers/iommu/rockchip-iommu.c | 10 -
  3 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 8ba90571449cec..bed7cb6e5ee65b 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1757,18 +1757,48 @@ static int iommu_get_default_domain_type(struct 
iommu_group *group,
int type;
  
  	lockdep_assert_held(>mutex);

+
+   /*
+* ARM32 drivers supporting CONFIG_ARM_DMA_USE_IOMMU can declare an
+* identity_domain and it will automatically become their default
+* domain. Later on ARM_DMA_USE_IOMMU will install its UNMANAGED domain.
+* Override the selection to IDENTITY if we are sure the driver supports
+* it.
+*/
+   if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) && ops->identity_domain) {


If I cared about arm-smmu on 32-bit, I'd bring that up again, but 
honestly I'm not sure that I do... I think it might end up working after 
patch #21, and it's currently still broken for lack of .set_platform_dma 
anyway, so meh.



+   type = IOMMU_DOMAIN_IDENTITY;
+   if (best_type && type && best_type != type)
+   goto err;
+   best_type = target_type = IOMMU_DOMAIN_IDENTITY;
+   }
+
for_each_group_device(group, gdev) {
type = best_type;
if (ops->def_domain_type) {
type = ops->def_domain_type(gdev->dev);
-   if (best_type && type && best_type != type)
+   if (best_type && type && best_type != type) {
+   /* Stick with the last driver override we saw */
+   best_type = type;
goto err;
+   }
}
  
  		if (dev_is_pci(gdev->dev) && to_pci_dev(gdev->dev)->untrusted) {

-   type = IOMMU_DOMAIN_DMA;
-   if (best_type && type && best_type != type)
-   goto err;
+   /*
+* We don't have any way for the iommu core code to
+* force arm_iommu to activate so we can't enforce
+* trusted. Log it and keep going with the IDENTITY
+* default domain.
+*/
+   if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)) {
+   dev_warn(
+   gdev->dev,
+   "PCI device is untrusted but ARM32 does not 
support secure IOMMU operation, continuing anyway.\n");


To within experimental error, this is dead code. The ARM DMA ops don't 

Re: [PATCH v2 04/25] iommu: Add IOMMU_DOMAIN_PLATFORM for S390

2023-06-01 Thread Robin Murphy

On 2023-05-16 01:00, Jason Gunthorpe wrote:

The PLATFORM domain will be set as the default domain and attached as
normal during probe. The driver will ignore the initial attach from a NULL
domain to the PLATFORM domain.

After this, the PLATFORM domain's attach_dev will be called whenever we
detach from an UNMANAGED domain (eg for VFIO). This is the same time the
original design would have called op->detach_dev().

This is temporary until the S390 dma-iommu.c conversion is merged.


If we do need a stopgap here, can we please just call the current 
situation an identity domain? It's true enough in the sense that the 
IOMMU API is not offering any translation or guarantee of isolation, so 
the semantics of an identity domain - from the point of view of anything 
inside the IOMMU API that would be looking - are no weaker or less 
useful than a "platform" domain whose semantics are intentionally unknown.


Then similarly for patch #3 - since we already know s390 is temporary, 
it seems an anathema to introduce a whole domain type with its own weird 
ops->default_domain mechanism solely for POWER to not actually use 
domains with.


In terms of reasoning, I don't see that IOMMU_DOMAIN_PLATFORM is any 
more useful than a NULL default domain, it just renames the problem, and 
gives us more code to maintain for the privilege. As I say, though, we 
don't actually need to juggle the semantic of a "we don't know what's 
happening here" domain around any further, since it works out that a 
"we're not influencing anything here" domain actually suffices for what 
we want to reason about, and those are already well-defined. Sure, the 
platform DMA ops *might* be doing more, but that's beyond the scope of 
the IOMMU API either way. At that point, lo and behold, s390 and POWER 
now look just like ARM and the core code only needs a single special 
case for arch-specific default identity domains, lovely!


Thanks,
Robin.


Tested-by: Heiko Stuebner 
Tested-by: Niklas Schnelle 
Signed-off-by: Jason Gunthorpe 
---
  drivers/iommu/s390-iommu.c | 21 +++--
  1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index fbf59a8db29b11..f0c867c57a5b9b 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -142,14 +142,31 @@ static int s390_iommu_attach_device(struct iommu_domain 
*domain,
return 0;
  }
  
-static void s390_iommu_set_platform_dma(struct device *dev)

+/*
+ * Switch control over the IOMMU to S390's internal dma_api ops
+ */
+static int s390_iommu_platform_attach(struct iommu_domain *platform_domain,
+ struct device *dev)
  {
struct zpci_dev *zdev = to_zpci_dev(dev);
  
+	if (!zdev->s390_domain)

+   return 0;
+
__s390_iommu_detach_device(zdev);
zpci_dma_init_device(zdev);
+   return 0;
  }
  
+static struct iommu_domain_ops s390_iommu_platform_ops = {

+   .attach_dev = s390_iommu_platform_attach,
+};
+
+static struct iommu_domain s390_iommu_platform_domain = {
+   .type = IOMMU_DOMAIN_PLATFORM,
+   .ops = _iommu_platform_ops,
+};
+
  static void s390_iommu_get_resv_regions(struct device *dev,
struct list_head *list)
  {
@@ -428,12 +445,12 @@ void zpci_destroy_iommu(struct zpci_dev *zdev)
  }
  
  static const struct iommu_ops s390_iommu_ops = {

+   .default_domain = _iommu_platform_domain,
.capable = s390_iommu_capable,
.domain_alloc = s390_domain_alloc,
.probe_device = s390_iommu_probe_device,
.release_device = s390_iommu_release_device,
.device_group = generic_device_group,
-   .set_platform_dma_ops = s390_iommu_set_platform_dma,
.pgsize_bitmap = SZ_4K,
.get_resv_regions = s390_iommu_get_resv_regions,
.default_domain_ops = &(const struct iommu_domain_ops) {


Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper

2023-03-31 Thread Robin Murphy

On 31/03/2023 3:00 pm, Arnd Bergmann wrote:

On Mon, Mar 27, 2023, at 14:48, Robin Murphy wrote:

On 2023-03-27 13:13, Arnd Bergmann wrote:


[ HELP NEEDED: can anyone confirm that it is a correct assumption
on arm that a cache-coherent device writing to a page always results
in it being in a PG_dcache_clean state like on ia64, or can a device
write directly into the dcache?]


In AMBA at least, if a snooping write hits in a cache then the data is
most likely going to get routed directly into that cache. If it has
write-back write-allocate attributes it could also land in any cache
along its normal path to RAM; it wouldn't have to go all the way.

Hence all the fun we have where treating a coherent device as
non-coherent can still be almost as broken as the other way round :)


Ok, thanks for the information. I'm still not sure whether this can
result in the situation where PG_dcache_clean is wrong though.

Specifically, the question is whether a DMA to a coherent buffer
can end up in a dirty L1 dcache of one core and require to write
back the dcache before invalidating the icache for that page.

On ia64, this is not the case, the optimization here is to
only flush the icache after a coherent DMA into an executable
user page, while Arm only does this for noncoherent DMA but not
coherent DMA.

 From your explanation it sounds like this might happen,
even though that would mean that "coherent" DMA is slightly
less coherent than it is elsewhere.

To be on the safe side, I'd have to pass a flag into
arch_dma_mark_clean() about coherency, to let the arm
implementation still require the extra dcache flush
for coherent DMA, while ia64 can ignore that flag.


Coherent DMA on Arm is assumed to be inner-shareable, so a coherent DMA 
write should be pretty much equivalent to a coherent write by another 
CPU (or indeed the local CPU itself) - nothing says that it *couldn't* 
dirty a line in a data cache above the level of unification, so in 
general the assumption must be that, yes, if coherent DMA is writing 
data intended to be executable, then it's going to want a Dcache clean 
to PoU and an Icache invalidate to PoU before trying to execute it. By 
comparison, a non-coherent DMA transfer will inherently have to 
invalidate the Dcache all the way to PoC in its dma_unmap, thus cannot 
leave dirty data above the PoU, so only the Icache maintenance is 
required in the executable case.


(FWIW I believe the Armv8 IDC/DIC features can safely be considered 
irrelevant to 32-bit kernels)


I don't know a great deal about IA-64, but it appears to be using its 
PG_arch_1 flag in a subtly different manner to Arm, namely to optimise 
out the *Icache* maintenance. So if anything, it seems IA-64 is the 
weirdo here (who'd have guessed?) where DMA manages to be *more* 
coherent than the CPUs themselves :)


This is all now making me think we need some careful consideration of 
whether the benefits of consolidating code outweigh the confusion of 
conflating multiple different meanings of "clean" together...


Thanks,
Robin.


Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper

2023-03-27 Thread Robin Murphy

On 2023-03-27 13:13, Arnd Bergmann wrote:

From: Arnd Bergmann 

The arm version of the arch_sync_dma_for_cpu() function annotates pages as
PG_dcache_clean after a DMA, but no other architecture does this here. On
ia64, the same thing is done in arch_sync_dma_for_cpu(), so it makes sense
to use the same hook in order to have identical arch_sync_dma_for_cpu()
semantics as all other architectures.

Splitting this out has multiple effects:

  - for dma-direct, this now gets called after arch_sync_dma_for_cpu()
for DMA_FROM_DEVICE mappings, but not for DMA_BIDIRECTIONAL. While
it would not be harmful to keep doing it for bidirectional mappings,
those are apparently not used in any callers that care about the flag.

  - Since arm has its own dma-iommu abstraction, this now also needs to
call the same function, so the calls are added there to mirror the
dma-direct version.

  - Like dma-direct, the dma-iommu version now marks the dcache clean
for both coherent and noncoherent devices after a DMA, but it only
does this for DMA_FROM_DEVICE, not DMA_BIDIRECTIONAL.

[ HELP NEEDED: can anyone confirm that it is a correct assumption
   on arm that a cache-coherent device writing to a page always results
   in it being in a PG_dcache_clean state like on ia64, or can a device
   write directly into the dcache?]


In AMBA at least, if a snooping write hits in a cache then the data is 
most likely going to get routed directly into that cache. If it has 
write-back write-allocate attributes it could also land in any cache 
along its normal path to RAM; it wouldn't have to go all the way.


Hence all the fun we have where treating a coherent device as 
non-coherent can still be almost as broken as the other way round :)


Cheers,
Robin.


Signed-off-by: Arnd Bergmann 
---
  arch/arm/Kconfig  |  1 +
  arch/arm/mm/dma-mapping.c | 71 +++
  2 files changed, 43 insertions(+), 29 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index e24a9820e12f..125d58c54ab1 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -7,6 +7,7 @@ config ARM
select ARCH_HAS_BINFMT_FLAT
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL if MMU
+   select ARCH_HAS_DMA_MARK_CLEAN if MMU
select ARCH_HAS_DMA_WRITE_COMBINE if !ARM_DMA_MEM_BUFFERABLE
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index cc702cb27ae7..b703cb83d27e 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -665,6 +665,28 @@ static void dma_cache_maint(phys_addr_t paddr,
} while (left);
  }
  
+/*

+ * Mark the D-cache clean for these pages to avoid extra flushing.
+ */
+void arch_dma_mark_clean(phys_addr_t paddr, size_t size)
+{
+   unsigned long pfn = PFN_UP(paddr);
+   unsigned long off = paddr & (PAGE_SIZE - 1);
+   size_t left = size;
+
+   if (size < PAGE_SIZE)
+   return;
+
+   if (off)
+   left -= PAGE_SIZE - off;
+
+   while (left >= PAGE_SIZE) {
+   struct page *page = pfn_to_page(pfn++);
+   set_bit(PG_dcache_clean, >flags);
+   left -= PAGE_SIZE;
+   }
+}
+
  static bool arch_sync_dma_cpu_needs_post_dma_flush(void)
  {
if (IS_ENABLED(CONFIG_CPU_V6) ||
@@ -715,24 +737,6 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
outer_inv_range(paddr, paddr + size);
dma_cache_maint(paddr, size, dmac_inv_range);
}
-
-   /*
-* Mark the D-cache clean for these pages to avoid extra flushing.
-*/
-   if (dir != DMA_TO_DEVICE && size >= PAGE_SIZE) {
-   unsigned long pfn = PFN_UP(paddr);
-   unsigned long off = paddr & (PAGE_SIZE - 1);
-   size_t left = size;
-
-   if (off)
-   left -= PAGE_SIZE - off;
-
-   while (left >= PAGE_SIZE) {
-   struct page *page = pfn_to_page(pfn++);
-   set_bit(PG_dcache_clean, >flags);
-   left -= PAGE_SIZE;
-   }
-   }
  }
  
  #ifdef CONFIG_ARM_DMA_USE_IOMMU

@@ -1294,6 +1298,17 @@ static int arm_iommu_map_sg(struct device *dev, struct 
scatterlist *sg,
return -EINVAL;
  }
  
+static void arm_iommu_sync_dma_for_cpu(phys_addr_t phys, size_t len,

+  enum dma_data_direction dir,
+  bool dma_coherent)
+{
+   if (!dma_coherent)
+   arch_sync_dma_for_cpu(phys, s->length, dir);
+
+   if (dir == DMA_FROM_DEVICE)
+   arch_dma_mark_clean(phys, s->length);
+}
+
  /**
   * arm_iommu_unmap_sg - unmap a set of SG buffers mapped by dma_map_sg
   * @dev: valid struct device pointer
@@ -1316,8 +1331,9 @@ static void arm_iommu_unmap_sg(struct device *dev,
if 

Re: [PATCH 3/3] of: address: Use dma_default_coherent to determine default coherency

2023-02-22 Thread Robin Murphy

On 2023-02-22 13:37, Jiaxun Yang wrote:

As for now all arches have dma_default_coherent matched with default
DMA coherency for of devices, so there is no need to have a standalone
config option.

This also fixes a case that for some MIPS platforms, coherency information
is not carried in devicetree and kernel will override dma_default_coherent
at early boot.

Note for PowerPC: CONFIG_OF_DMA_DEFUALT_COHERENT was only selected when
CONFIG_NOT_COHERENT_CACHE is false, in this case dma_default_coherent will
be true, so it still matches present behavior.

Note for RISC-V: dma_default_coherent is set to true at init code in this
series.


OK, so the fundamental problem here is that we have two slightly 
different conflicting mechanisms, the ex-PowerPC config option, and the 
ex-MIPS dma_default_coherent for which of_dma_is_coherent() has 
apparently been broken forever.


I'd agree that it's worth consolidating the two, but please separate out 
the fix as below, so it's feasible to backport without having to muck 
about in arch code.



Signed-off-by: Jiaxun Yang 
---
  arch/powerpc/Kconfig | 1 -
  arch/riscv/Kconfig   | 1 -
  drivers/of/Kconfig   | 4 
  drivers/of/address.c | 2 +-
  4 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2c9cdf1d8761..c67e5da714f7 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -272,7 +272,6 @@ config PPC
select NEED_PER_CPU_PAGE_FIRST_CHUNKif PPC64
select NEED_SG_DMA_LENGTH
select OF
-   select OF_DMA_DEFAULT_COHERENT  if !NOT_COHERENT_CACHE
select OF_EARLY_FLATTREE
select OLD_SIGACTIONif PPC32
select OLD_SIGSUSPEND
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 1d46a268ce16..406c6816d289 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -119,7 +119,6 @@ config RISCV
select MODULES_USE_ELF_RELA if MODULES
select MODULE_SECTIONS if MODULES
select OF
-   select OF_DMA_DEFAULT_COHERENT
select OF_EARLY_FLATTREE
select OF_IRQ
select PCI_DOMAINS_GENERIC if PCI
diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
index 644386833a7b..e40f10bf2ba4 100644
--- a/drivers/of/Kconfig
+++ b/drivers/of/Kconfig
@@ -102,8 +102,4 @@ config OF_OVERLAY
  config OF_NUMA
bool
  
-config OF_DMA_DEFAULT_COHERENT

-   # arches should select this if DMA is coherent by default for OF devices
-   bool
-
  endif # OF
diff --git a/drivers/of/address.c b/drivers/of/address.c
index 4c0b169ef9bf..23ade4919853 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -1103,7 +1103,7 @@ phys_addr_t __init of_dma_get_max_cpu_address(struct 
device_node *np)
  bool of_dma_is_coherent(struct device_node *np)
  {
struct device_node *node;
-   bool is_coherent = IS_ENABLED(CONFIG_OF_DMA_DEFAULT_COHERENT);
+   bool is_coherent = dma_default_coherent;


AFAICS, all you should actually need is a single self-contained addition 
here, something like:


+   /*
+* DT-based MIPS doesn't use OF_DMA_DEFAULT_COHERENT, but
+* might override the system-wide default at runtime.
+*/
+#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \
+   defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \
+   defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
+   is_coherent = dma_default_coherent;
+#endif

  
  	node = of_node_get(np);
  


Then *after* that's fixed, we can do a more comprehensive refactoring to 
merge the two mechanisms properly. FWIW I think I'd prefer an approach 
closer to the first one, where config options control the initial value 
of dma_default_coherent rather than architectures having to override it 
unconditionally (and TBH I'd also like to have a generic config symbol 
for whether an arch supports per-device coherency or not).


Thanks,
Robin.


Re: [PATCH 0/7] MIPS DMA coherence fixes

2023-02-22 Thread Robin Murphy

On 2023-02-22 13:04, Jiaxun Yang wrote:




2023年2月22日 12:55,Robin Murphy  写道:

On 2023-02-21 19:55, Jiaxun Yang wrote:

2023年2月21日 19:46,Robin Murphy  写道:

On 2023-02-21 18:15, Jiaxun Yang wrote:

2023年2月21日 17:54,Christoph Hellwig  写道:

Can you explain the motivation here?  Also why riscv patches are at
the end of a mips fіxes series?

Ah sorry for any confusion.
So the main purpose of this patch is to fix MIPS’s broken per-device coherency.
To be more precise, we want to be able to control the default coherency for all 
devices probed from
devicetree in early boot code.


Including the patch which actually does that would be helpful. As it is, 
patches 4-7 here just appear to be moving an option around for no practical 
effect.

Well the affect is default coherency of devicetree probed devices are now 
following dma_default_coherent
instead of a static Kconfig option. For MIPS platform, dma_default_coherent 
will be determined by boot code.


"Will be" is the issue I'm getting at. We can't review some future promise of a 
patch, we can only review actual patches. And it's hard to meaningfully review 
preparatory patches for some change without the full context of that change.


Actually this already present in current MIPS platform code.

arch/mips/mti-malta is setting dma_default_coherent on boot, and it’s 
devicetree does not explicitly specify coherency.


OK, this really needs to be explained much more clearly. I read this 
series as 3 actual fix patches, then 3 patches adding a new option to 
replace an existing one on the grounds that it "can be useful" for 
unspecified purposes, then a final cleanup patch removing the old option 
that has now been superseded.


Going back and looking closely I see there is actually a brief mention 
in the cleanup patch that it also happens to fix some issue, but even 
then it doesn't clearly explain what the issue really is or how and why 
the fix works and is appropriate.


Ideally, functional fixes and cleanup should be in distinct patches 
whenever that is reasonable. Sometimes the best fix is inherently a 
cleanup, but in such cases the patch should always be presented as the 
fix being its primary purpose. Please also use the cover letter to give 
reviewers an overview of the whole series if it's not merely a set of 
loosely-related patches that just happened to be convenient so send all 
together.


I think I do at least now understand the underlying problem well enough 
to have a think about whether this is the best way to address it.


Thanks,
Robin.


Re: [PATCH 0/7] MIPS DMA coherence fixes

2023-02-22 Thread Robin Murphy

On 2023-02-21 19:55, Jiaxun Yang wrote:




2023年2月21日 19:46,Robin Murphy  写道:

On 2023-02-21 18:15, Jiaxun Yang wrote:

2023年2月21日 17:54,Christoph Hellwig  写道:

Can you explain the motivation here?  Also why riscv patches are at
the end of a mips fіxes series?

Ah sorry for any confusion.
So the main purpose of this patch is to fix MIPS’s broken per-device coherency.
To be more precise, we want to be able to control the default coherency for all 
devices probed from
devicetree in early boot code.


Including the patch which actually does that would be helpful. As it is, 
patches 4-7 here just appear to be moving an option around for no practical 
effect.


Well the affect is default coherency of devicetree probed devices are now 
following dma_default_coherent
instead of a static Kconfig option. For MIPS platform, dma_default_coherent 
will be determined by boot code.


"Will be" is the issue I'm getting at. We can't review some future 
promise of a patch, we can only review actual patches. And it's hard to 
meaningfully review preparatory patches for some change without the full 
context of that change.


Thanks,
Robin.


Re: [PATCH 0/7] MIPS DMA coherence fixes

2023-02-21 Thread Robin Murphy

On 2023-02-21 18:15, Jiaxun Yang wrote:




2023年2月21日 17:54,Christoph Hellwig  写道:

Can you explain the motivation here?  Also why riscv patches are at
the end of a mips fіxes series?


Ah sorry for any confusion.

So the main purpose of this patch is to fix MIPS’s broken per-device coherency.
To be more precise, we want to be able to control the default coherency for all 
devices probed from
devicetree in early boot code.


Including the patch which actually does that would be helpful. As it is, 
patches 4-7 here just appear to be moving an option around for no 
practical effect.


Robin.


To achieve that I decided to reuse dma_default_coherent to set default 
coherency for devicetree.
And all later patches are severing for this purpose.

Thanks
- Jiaxun


Re: [PATCH 4/7] dma-mapping: Always provide dma_default_coherent

2023-02-21 Thread Robin Murphy

On 2023-02-21 17:58, Christoph Hellwig wrote:

On Tue, Feb 21, 2023 at 12:46:10PM +, Jiaxun Yang wrote:

dma_default_coherent can be useful for determine default coherency
even on arches without noncoherent support.


How?


Indeed, "default" is conceptually meaningless when there is no possible 
alternative :/


Robin.


Re: [PATCH kernel] powerpc/iommu: Add simple iommu_ops to report capabilities

2022-07-05 Thread Robin Murphy

On 2022-07-05 07:22, Alexey Kardashevskiy wrote:

Historically PPC64 managed to avoid using iommu_ops. The VFIO driver
uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
the Type1 VFIO driver. Recent development though has added a coherency
capability check to the generic part of VFIO and essentially disabled
VFIO on PPC64.

This adds a simple iommu_ops stub which reports support for cache
coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices,
this provides minimum code for the probing to not crash.


No more bus_set_iommu() please - I'll be sending a new version of this 
series very soon:


https://lore.kernel.org/linux-iommu/cover.1650890638.git.robin.mur...@arm.com/

Cheers,
Robin.


The previous discussion is here:
https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/

Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence")
Fixes: 70693f470848 ("vfio: Set DMA ownership for VFIO devices")
Signed-off-by: Alexey Kardashevskiy 
---

I have not looked into the domains for ages, what is missing here? With this
on top of 5.19-rc1 VFIO works again on my POWER9 box. Thanks,

---
  arch/powerpc/include/asm/iommu.h |  2 +
  arch/powerpc/kernel/iommu.c  | 70 
  arch/powerpc/kernel/pci_64.c |  3 ++
  3 files changed, 75 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7e29c73e3dd4..4bdae0ee29d0 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -215,6 +215,8 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm,
enum dma_data_direction *direction);
  extern void iommu_tce_kill(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
+
+extern const struct iommu_ops spapr_tce_iommu_ops;
  #else
  static inline void iommu_register_group(struct iommu_table_group *table_group,
int pci_domain_number,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 7e56ddb3e0b9..2205b448f7d5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1176,4 +1176,74 @@ void iommu_del_device(struct device *dev)
iommu_group_remove_device(dev);
  }
  EXPORT_SYMBOL_GPL(iommu_del_device);
+
+/*
+ * A simple iommu_ops to allow less cruft in generic VFIO code.
+ */
+static bool spapr_tce_iommu_capable(enum iommu_cap cap)
+{
+   switch (cap) {
+   case IOMMU_CAP_CACHE_COHERENCY:
+   return true;
+   default:
+   break;
+   }
+
+   return false;
+}
+
+static struct iommu_domain *spapr_tce_iommu_domain_alloc(unsigned int type)
+{
+   struct iommu_domain *domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+
+   if (!domain)
+   return NULL;
+
+   domain->geometry.aperture_start = 0;
+   domain->geometry.aperture_end = ~0ULL;
+   domain->geometry.force_aperture = true;
+
+   return domain;
+}
+
+static struct iommu_device *spapr_tce_iommu_probe_device(struct device *dev)
+{
+   struct iommu_device *iommu_dev = kzalloc(sizeof(struct iommu_device), 
GFP_KERNEL);
+
+   iommu_dev->dev = dev;
+   iommu_dev->ops = _tce_iommu_ops;
+
+   return iommu_dev;
+}
+
+static void spapr_tce_iommu_release_device(struct device *dev)
+{
+}
+
+static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
+ struct device *dev)
+{
+   return 0;
+}
+
+static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev)
+{
+   struct iommu_group *grp = dev->iommu_group;
+
+   if (!grp)
+   grp = ERR_PTR(-ENODEV);
+   return grp;
+}
+
+const struct iommu_ops spapr_tce_iommu_ops = {
+   .capable = spapr_tce_iommu_capable,
+   .domain_alloc = spapr_tce_iommu_domain_alloc,
+   .probe_device = spapr_tce_iommu_probe_device,
+   .release_device = spapr_tce_iommu_release_device,
+   .device_group = spapr_tce_iommu_device_group,
+   .default_domain_ops = &(const struct iommu_domain_ops) {
+   .attach_dev = spapr_tce_iommu_attach_dev,
+   }
+};
+
  #endif /* CONFIG_IOMMU_API */
diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c
index 19b03ddf5631..04bc0c52e45c 100644
--- a/arch/powerpc/kernel/pci_64.c
+++ b/arch/powerpc/kernel/pci_64.c
@@ -20,6 +20,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -27,6 +28,7 @@
  #include 
  #include 
  #include 
+#include 
  
  /* pci_io_base -- the base address from which io bars are offsets.

   * This is the lowest I/O base address (so bar values are always positive),
@@ -69,6 +71,7 @@ static int __init pcibios_init(void)
ppc_md.pcibios_fixup();
  
  	printk(KERN_DEBUG "PCI: Probing PCI hardware done\n");

+   bus_set_iommu(_bus_type, _tce_iommu_ops);
  
  	return 0;

  }


Re: [PATCH v2 2/3] scsi: BusLogic remove bus_to_virt

2022-06-17 Thread Robin Murphy

On 2022-06-17 13:57, Arnd Bergmann wrote:

From: Arnd Bergmann 

The BusLogic driver is the last remaining driver that relies on the
deprecated bus_to_virt() function, which in turn only works on a few
architectures, and is incompatible with both swiotlb and iommu support.

Before commit 391e2f25601e ("[SCSI] BusLogic: Port driver to 64-bit."),
the driver had a dependency on x86-32, presumably because of this
problem. However, the change introduced another bug that made it still
impossible to use the driver on any 64-bit machine.

This was in turn fixed in commit 56f396146af2 ("scsi: BusLogic: Fix
64-bit system enumeration error for Buslogic"), 8 years later, which
shows that there are not a lot of users.

Maciej is still using the driver on 32-bit hardware, and Khalid mentioned
that the driver works with the device emulation used in VirtualBox
and VMware. Both of those only emulate it for Windows 2000 and older
operating systems that did not ship with the better LSI logic driver.

Do a minimum fix that searches through the list of descriptors to find
one that matches the bus address. This is clearly as inefficient as
was indicated in the code comment about the lack of a bus_to_virt()
replacement. A better fix would likely involve changing out the entire
descriptor allocation for a simpler one, but that would be much
more invasive.


FWIW, if efficiency *is* a practical concern, even under the current 
allocation scheme it looks like there are only 4 actual DMA allocations 
to search through. From a quick scan (since it's too hot here not to get 
distracted...), if I wanted to optimise this in future I'd probably 
remove the semi-redundant allocgrp_* fields from struct blogic_ccb and 
hang a dedicated list of the block allocations off the adapter - at that 
point the lookup could likely already be more efficient than a 
theoretical dma_to_virt() interface would be if it had to go off and 
walk an IOMMU pagetable. Then the next question would be whether it's 
viable to make a single 32KB allocation rather 4*8KB, so it's no longer 
even a list.


For now, though, I agree with the simple change that's clear and easy to 
reason about:


Reviewed-by: Robin Murphy 

Thanks,
Robin.


Cc: Maciej W. Rozycki 
Cc: Matt Wang 
Cc: Khalid Aziz 
Signed-off-by: Arnd Bergmann 
---
  drivers/scsi/BusLogic.c | 27 ---
  drivers/scsi/Kconfig|  2 +-
  2 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/scsi/BusLogic.c b/drivers/scsi/BusLogic.c
index a897c8f914cf..d057abfcdd5c 100644
--- a/drivers/scsi/BusLogic.c
+++ b/drivers/scsi/BusLogic.c
@@ -2515,12 +2515,26 @@ static int blogic_resultcode(struct blogic_adapter 
*adapter,
return (hoststatus << 16) | tgt_status;
  }
  
+/*

+ * turn the dma address from an inbox into a ccb pointer
+ * This is rather inefficient.
+ */
+static struct blogic_ccb *
+blogic_inbox_to_ccb(struct blogic_adapter *adapter, struct blogic_inbox *inbox)
+{
+   struct blogic_ccb *ccb;
+
+   for (ccb = adapter->all_ccbs; ccb; ccb = ccb->next_all)
+   if (inbox->ccb == ccb->dma_handle)
+   break;
+
+   return ccb;
+}
  
  /*

blogic_scan_inbox scans the Incoming Mailboxes saving any
Incoming Mailbox entries for completion processing.
  */
-
  static void blogic_scan_inbox(struct blogic_adapter *adapter)
  {
/*
@@ -2540,16 +2554,7 @@ static void blogic_scan_inbox(struct blogic_adapter 
*adapter)
enum blogic_cmplt_code comp_code;
  
  	while ((comp_code = next_inbox->comp_code) != BLOGIC_INBOX_FREE) {

-   /*
-  We are only allowed to do this because we limit our
-  architectures we run on to machines where bus_to_virt(
-  actually works.  There *needs* to be a dma_addr_to_virt()
-  in the new PCI DMA mapping interface to replace
-  bus_to_virt() or else this code is going to become very
-  innefficient.
-*/
-   struct blogic_ccb *ccb =
-   (struct blogic_ccb *) bus_to_virt(next_inbox->ccb);
+   struct blogic_ccb *ccb = blogic_inbox_to_ccb(adapter, 
adapter->next_inbox);
if (comp_code != BLOGIC_CMD_NOTFOUND) {
if (ccb->status == BLOGIC_CCB_ACTIVE ||
ccb->status == BLOGIC_CCB_RESET) {
diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index cf75588a2587..56bdc08d0b77 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -513,7 +513,7 @@ config SCSI_HPTIOP
  
  config SCSI_BUSLOGIC

tristate "BusLogic SCSI support"
-   depends on PCI && SCSI && VIRT_TO_BUS
+   depends on PCI && SCSI
help
  This is support for BusLogic MultiMaster and FlashPoint SCSI Host
  Adapters. Consult the SCSI-HOWTO, available from


Re: [PATCH 08/14] arm64: simplify access_ok()

2022-02-14 Thread Robin Murphy

On 2022-02-14 16:34, Arnd Bergmann wrote:

From: Arnd Bergmann 

arm64 has an inline asm implementation of access_ok() that is derived from
the 32-bit arm version and optimized for the case that both the limit and
the size are variable. With set_fs() gone, the limit is always constant,
and the size usually is as well, so just using the default implementation
reduces the check into a comparison against a constant that can be
scheduled by the compiler.


Aww, I still vividly remember the birth of this madness, sat with my 
phone on a Saturday morning waiting for my bike to be MOT'd, staring at 
the 7-instruction sequence that Mark and I had come up with and certain 
that it could be shortened still. Kinda sad to see it go, but at the 
same time, glad that it can.


Acked-by: Robin Murphy 


On a defconfig build, this saves over 28KB of .text.


Not to mention saving those "WTF is going on there... oh yeah, 
access_ok()" moments when looking through disassembly :)


Cheers,
Robin.


Signed-off-by: Arnd Bergmann 
---
  arch/arm64/include/asm/uaccess.h | 28 +---
  1 file changed, 5 insertions(+), 23 deletions(-)

diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index 357f7bd9c981..e8dce0cc5eaa 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -26,6 +26,8 @@
  #include 
  #include 
  
+static inline int __access_ok(const void __user *ptr, unsigned long size);

+
  /*
   * Test whether a block of memory is a valid user space address.
   * Returns 1 if the range is valid, 0 otherwise.
@@ -33,10 +35,8 @@
   * This is equivalent to the following test:
   * (u65)addr + (u65)size <= (u65)TASK_SIZE_MAX
   */
-static inline unsigned long __access_ok(const void __user *addr, unsigned long 
size)
+static inline int access_ok(const void __user *addr, unsigned long size)
  {
-   unsigned long ret, limit = TASK_SIZE_MAX - 1;
-
/*
 * Asynchronous I/O running in a kernel thread does not have the
 * TIF_TAGGED_ADDR flag of the process owning the mm, so always untag
@@ -46,27 +46,9 @@ static inline unsigned long __access_ok(const void __user 
*addr, unsigned long s
(current->flags & PF_KTHREAD || test_thread_flag(TIF_TAGGED_ADDR)))
addr = untagged_addr(addr);
  
-	__chk_user_ptr(addr);

-   asm volatile(
-   // A + B <= C + 1 for all A,B,C, in four easy steps:
-   // 1: X = A + B; X' = X % 2^64
-   "  adds%0, %3, %2\n"
-   // 2: Set C = 0 if X > 2^64, to guarantee X' > C in step 4
-   "  csel%1, xzr, %1, hi\n"
-   // 3: Set X' = ~0 if X >= 2^64. For X == 2^64, this decrements X'
-   //to compensate for the carry flag being set in step 4. For
-   //X > 2^64, X' merely has to remain nonzero, which it does.
-   "  csinv   %0, %0, xzr, cc\n"
-   // 4: For X < 2^64, this gives us X' - C - 1 <= 0, where the -1
-   //comes from the carry in being clear. Otherwise, we are
-   //testing X' - C == 0, subject to the previous adjustments.
-   "  sbcsxzr, %0, %1\n"
-   "  cset%0, ls\n"
-   : "=" (ret), "+r" (limit) : "Ir" (size), "0" (addr) : "cc");
-
-   return ret;
+   return likely(__access_ok(addr, size));
  }
-#define __access_ok __access_ok
+#define access_ok access_ok
  
  #include 
  


Re: [PATCH] soc: fsl: dpio: protect smp_processor_id when get processor id

2021-10-15 Thread Robin Murphy

On 2021-10-15 07:36, meng...@windriver.com wrote:

From: Meng Li 

When enable debug kernel configs,there will be calltrace as below:

BUG: using smp_processor_id() in preemptible [] code: swapper/0/1
caller is debug_smp_processor_id+0x20/0x30
CPU: 6 PID: 1 Comm: swapper/0 Not tainted 5.10.63-yocto-standard #1
Hardware name: NXP Layerscape LX2160ARDB (DT)
Call trace:
  dump_backtrace+0x0/0x1a0
  show_stack+0x24/0x30
  dump_stack+0xf0/0x13c
  check_preemption_disabled+0x100/0x110
  debug_smp_processor_id+0x20/0x30
  dpaa2_io_query_fq_count+0xdc/0x154
  dpaa2_eth_stop+0x144/0x314
  __dev_close_many+0xdc/0x160
  __dev_change_flags+0xe8/0x220
  dev_change_flags+0x30/0x70
  ic_close_devs+0x50/0x78
  ip_auto_config+0xed0/0xf10
  do_one_initcall+0xac/0x460
  kernel_init_freeable+0x30c/0x378
  kernel_init+0x20/0x128
  ret_from_fork+0x10/0x38

Because smp_processor_id() should be invoked in preempt disable status.
So, add preempt_disable/enable() to protect smp_processor_id().


If preemption doesn't matter anyway, as the comment in the context 
implies, then it probably makes more sense just to use 
raw_smp_processor_id() instead.


Robin.


Fixes: c89105c9b390 ("staging: fsl-mc: Move DPIO from staging to 
drivers/soc/fsl")
Cc: sta...@vger.kernel.org
Signed-off-by: Meng Li 
---
  drivers/soc/fsl/dpio/dpio-service.c | 7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/soc/fsl/dpio/dpio-service.c 
b/drivers/soc/fsl/dpio/dpio-service.c
index 19f47ea9dab0..afc3b89b0fc5 100644
--- a/drivers/soc/fsl/dpio/dpio-service.c
+++ b/drivers/soc/fsl/dpio/dpio-service.c
@@ -58,8 +58,11 @@ static inline struct dpaa2_io *service_select_by_cpu(struct 
dpaa2_io *d,
 * If cpu == -1, choose the current cpu, with no guarantees about
 * potentially being migrated away.
 */
-   if (cpu < 0)
-   cpu = smp_processor_id();
+if (cpu < 0) {
+preempt_disable();
+cpu = smp_processor_id();
+preempt_enable();
+}
  
  	/* If a specific cpu was requested, pick it up immediately */

return dpio_by_cpu[cpu];



Re: [PATCH] swiotlb: set IO TLB segment size via cmdline

2021-09-17 Thread Robin Murphy

On 2021-09-17 10:36, Roman Skakun wrote:

Hi, Christoph

I use Xen PV display. In my case, PV display backend(Dom0) allocates
contiguous buffer via DMA-API to
to implement zero-copy between Dom0 and DomU.


Well, something's gone badly wrong there - if you have to shadow the 
entire thing in a bounce buffer to import it then it's hardly zero-copy, 
is it? If you want to do buffer sharing the buffer really needs to be 
allocated appropriately to begin with, such that all relevant devices 
can access it directly. That might be something which needs fixing in Xen.


Robin.


When I start Weston under DomU, I got the next log in Dom0:
```
[ 112.554471] CPU: 0 PID: 367 Comm: weston Tainted: G O
5.10.0-yocto-standard+ #312
[ 112.575149] Call trace:
[ 112.577666] dump_backtrace+0x0/0x1b0
[ 112.581373] show_stack+0x18/0x70
[ 112.584746] dump_stack+0xd0/0x12c
[ 112.588200] swiotlb_tbl_map_single+0x234/0x360
[ 112.592781] xen_swiotlb_map_page+0xe4/0x4c0
[ 112.597095] xen_swiotlb_map_sg+0x84/0x12c
[ 112.601249] dma_map_sg_attrs+0x54/0x60
[ 112.605138] vsp1_du_map_sg+0x30/0x60
[ 112.608851] rcar_du_vsp_map_fb+0x134/0x170
[ 112.613082] rcar_du_vsp_plane_prepare_fb+0x44/0x64
[ 112.618007] drm_atomic_helper_prepare_planes+0xac/0x160
[ 112.623362] drm_atomic_helper_commit+0x88/0x390
[ 112.628029] drm_atomic_nonblocking_commit+0x4c/0x60
[ 112.633043] drm_mode_atomic_ioctl+0x9a8/0xb0c
[ 112.637532] drm_ioctl_kernel+0xc4/0x11c
[ 112.641506] drm_ioctl+0x21c/0x460
[ 112.644967] __arm64_sys_ioctl+0xa8/0xf0
[ 112.648939] el0_svc_common.constprop.0+0x78/0x1a0
[ 112.653775] do_el0_svc+0x24/0x90
[ 112.657148] el0_svc+0x14/0x20
[ 112.660254] el0_sync_handler+0x1a4/0x1b0
[ 112.664315] el0_sync+0x174/0x180
[ 112.668145] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz:
3686400 bytes), total 65536 (slots), used 112 (slots)
```
The problem is happened here:
https://elixir.bootlin.com/linux/v5.14.4/source/drivers/gpu/drm/rcar-du/rcar_du_vsp.c#L202

Sgt was created in dma_get_sgtable() by dma_common_get_sgtable() and
includes a single page chunk
as shown here:
https://elixir.bootlin.com/linux/v5.14.5/source/kernel/dma/ops_helpers.c#L18

After creating a new sgt, we tried to map this sgt through vsp1_du_map_sg().
Internally, vsp1_du_map_sg() using ops->map_sg (e.g
xen_swiotlb_map_sg) to perform
mapping.

I realized that required segment is too big to be fitted to default
swiotlb segment and condition
https://elixir.bootlin.com/linux/latest/source/kernel/dma/swiotlb.c#L474
is always false.

I know that I use a large buffer, but why can't I map this buffer in one chunk?

Thanks!

ср, 15 сент. 2021 г. в 16:53, Christoph Hellwig :


On Wed, Sep 15, 2021 at 03:49:52PM +0200, Jan Beulich wrote:

But the question remains: Why does the framebuffer need to be mapped
in a single giant chunk?


More importantly: if you use dynamic dma mappings for your framebuffer
you're doing something wrong.






Re: [PATCH v1 16/16] dma-mapping: Disallow .map_sg operations from returning zero on error

2021-07-16 Thread Robin Murphy

On 2021-07-16 07:33, Christoph Hellwig wrote:

On Thu, Jul 15, 2021 at 10:45:44AM -0600, Logan Gunthorpe wrote:

@@ -194,6 +194,8 @@ static int __dma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
else
ents = ops->map_sg(dev, sg, nents, dir, attrs);
  
+	WARN_ON_ONCE(ents == 0);


Turns this into a negative error code while we're at it, just to keep
the callers sane?


Right, by this point returning the 0 would pass through 
dma_map_sg_attrs() OK, but AFAICS dma_map_sgtable() would now get 
confused and return success but with sgt->nents = 0. Coercing it to an 
error code (which dma_map_sg_attrs() would then just change right back) 
seems sensible for the sake of easy robustness.


Robin.


Re: [PATCH v1 14/16] x86/amd_gart: return error code from gart_map_sg()

2021-07-16 Thread Robin Murphy

On 2021-07-16 07:32, Christoph Hellwig wrote:

On Thu, Jul 15, 2021 at 10:45:42AM -0600, Logan Gunthorpe wrote:

@@ -458,7 +460,7 @@ static int gart_map_sg(struct device *dev, struct 
scatterlist *sg, int nents,
iommu_full(dev, pages << PAGE_SHIFT, dir);
for_each_sg(sg, s, nents, i)
s->dma_address = DMA_MAPPING_ERROR;
-   return 0;
+   return ret;


While we're at it - setting the ->dma_address to DMA_MAPPING_ERROR is
not part of the ->map_sg calling convention.  Might be worth to fix
up while we're at it.


Especially since it's not even zeroing dma_length, which at least is a 
documented indicator of validity (even if it's not strictly defined for 
failure cases either).


Robin.


Re: [PATCH v15 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-07-06 Thread Robin Murphy

On 2021-07-06 15:05, Christoph Hellwig wrote:

On Tue, Jul 06, 2021 at 03:01:04PM +0100, Robin Murphy wrote:

FWIW I was pondering the question of whether to do something along those
lines or just scrap the default assignment entirely, so since I hadn't got
round to saying that I've gone ahead and hacked up the alternative
(similarly untested) for comparison :)

TBH I'm still not sure which one I prefer...


Claire did implement something like your suggestion originally, but
I don't really like it as it doesn't scale for adding multiple global
pools, e.g. for the 64-bit addressable one for the various encrypted
secure guest schemes.


Ah yes, that had slipped my mind, and it's a fair point indeed. Since 
we're not concerned with a minimal fix for backports anyway I'm more 
than happy to focus on Will's approach. Another thing is that that looks 
to take us a quiet step closer to the possibility of dynamically 
resizing a SWIOTLB pool, which is something that some of the hypervisor 
protection schemes looking to build on top of this series may want to 
explore at some point.


Robin.


Re: [PATCH v15 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-07-06 Thread Robin Murphy

On 2021-07-06 14:24, Will Deacon wrote:

On Tue, Jul 06, 2021 at 06:48:48AM +0200, Christoph Hellwig wrote:

On Mon, Jul 05, 2021 at 08:03:52PM +0100, Will Deacon wrote:

So at this point, the AMD IOMMU driver does:

swiotlb= (iommu_default_passthrough() || sme_me_mask) ? 1 : 0;

where 'swiotlb' is a global variable indicating whether or not swiotlb
is in use. It's picked up a bit later on by pci_swiotlb_late_init(), which
will call swiotlb_exit() if 'swiotlb' is false.

Now, that used to work fine, because swiotlb_exit() clears
'io_tlb_default_mem' to NULL, but now with the restricted DMA changes, I
think that all the devices which have successfully probed beforehand will
have stale pointers to the freed structure in their 'dev->dma_io_tlb_mem'
field.


Yeah.  I don't think we can do that anymore, and I also think it is
a bad idea to start with.


I've had a crack at reworking things along the following lines:

   - io_tlb_default_mem now lives in the BSS, the flexible array member
 is now a pointer and that part is allocated dynamically (downside of
 this is an extra indirection to get at the slots).

   - io_tlb_default_mem.nslabs tells you whether the thing is valid

   - swiotlb_exit() frees the slots array and clears the rest of the
 structure to 0. I also extended it to free the actual slabs, but I'm
 not sure why it wasn't doing that before.

So a non-NULL dev->dma_io_tlb_mem should always be valid to follow.


FWIW I was pondering the question of whether to do something along those 
lines or just scrap the default assignment entirely, so since I hadn't 
got round to saying that I've gone ahead and hacked up the alternative 
(similarly untested) for comparison :)


TBH I'm still not sure which one I prefer...

Robin.

->8-
diff --git a/drivers/base/core.c b/drivers/base/core.c
index ea5b85354526..394abf184c1a 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2847,9 +2847,6 @@ void device_initialize(struct device *dev)
 defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
dev->dma_coherent = dma_default_coherent;
 #endif
-#ifdef CONFIG_SWIOTLB
-   dev->dma_io_tlb_mem = io_tlb_default_mem;
-#endif
 }
 EXPORT_SYMBOL_GPL(device_initialize);

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 39284ff2a6cd..620f16d89a98 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -107,16 +107,21 @@ struct io_tlb_mem {
 };
 extern struct io_tlb_mem *io_tlb_default_mem;

+static inline struct io_tlb_mem *dev_iotlb_mem(struct device *dev)
+{
+   return dev->dma_io_tlb_mem ?: io_tlb_default_mem;
+}
+
 static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t 
paddr)

 {
-   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+   struct io_tlb_mem *mem = dev_iotlb_mem(dev);

return mem && paddr >= mem->start && paddr < mem->end;
 }

 static inline bool is_swiotlb_force_bounce(struct device *dev)
 {
-   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+   struct io_tlb_mem *mem = dev_iotlb_mem(dev);

return mem && mem->force_bounce;
 }
@@ -167,7 +172,7 @@ bool swiotlb_free(struct device *dev, struct page 
*page, size_t size);


 static inline bool is_swiotlb_for_alloc(struct device *dev)
 {
-   return dev->dma_io_tlb_mem->for_alloc;
+   return dev_iotlb_mem(dev)->for_alloc;
 }
 #else
 static inline struct page *swiotlb_alloc(struct device *dev, size_t size)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index b7f76bca89bf..f4942149f87d 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -359,7 +359,7 @@ static unsigned int swiotlb_align_offset(struct 
device *dev, u64 addr)
 static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, 
size_t size,

   enum dma_data_direction dir)
 {
-   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+   struct io_tlb_mem *mem = dev_iotlb_mem(dev);
int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
phys_addr_t orig_addr = mem->slots[index].orig_addr;
size_t alloc_size = mem->slots[index].alloc_size;
@@ -440,7 +440,7 @@ static unsigned int wrap_index(struct io_tlb_mem 
*mem, unsigned int index)

 static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
  size_t alloc_size)
 {
-   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+   struct io_tlb_mem *mem = dev_iotlb_mem(dev);
unsigned long boundary_mask = dma_get_seg_boundary(dev);
dma_addr_t tbl_dma_addr =
phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
@@ -522,7 +522,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device 
*dev, phys_addr_t orig_addr,

size_t mapping_size, size_t alloc_size,
enum dma_data_direction dir, unsigned long attrs)
 {
-   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+   struct io_tlb_mem *mem = dev_iotlb_mem(dev);
unsigned int offset = 

Re: [PATCH v15 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-07-02 Thread Robin Murphy

On 2021-07-02 14:58, Will Deacon wrote:

Hi Nathan,

On Thu, Jul 01, 2021 at 12:52:20AM -0700, Nathan Chancellor wrote:

On 7/1/2021 12:40 AM, Will Deacon wrote:

On Wed, Jun 30, 2021 at 08:56:51AM -0700, Nathan Chancellor wrote:

On Wed, Jun 30, 2021 at 12:43:48PM +0100, Will Deacon wrote:

On Wed, Jun 30, 2021 at 05:17:27PM +0800, Claire Chang wrote:

`BUG: unable to handle page fault for address: 003a8290` and
the fact it crashed at `_raw_spin_lock_irqsave` look like the memory
(maybe dev->dma_io_tlb_mem) was corrupted?
The dev->dma_io_tlb_mem should be set here
(https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/pci/probe.c#n2528)
through device_initialize.


I'm less sure about this. 'dma_io_tlb_mem' should be pointing at
'io_tlb_default_mem', which is a page-aligned allocation from memblock.
The spinlock is at offset 0x24 in that structure, and looking at the
register dump from the crash:

Jun 29 18:28:42 hp-4300G kernel: RSP: 0018:adb4013db9e8 EFLAGS: 00010006
Jun 29 18:28:42 hp-4300G kernel: RAX: 003a8290 RBX:  
RCX: 8900572ad580
Jun 29 18:28:42 hp-4300G kernel: RDX: 89005653f024 RSI: 000c 
RDI: 1d17
Jun 29 18:28:42 hp-4300G kernel: RBP: 0a20d000 R08: 000c 
R09: 
Jun 29 18:28:42 hp-4300G kernel: R10: 0a20d000 R11: 89005653f000 
R12: 0212
Jun 29 18:28:42 hp-4300G kernel: R13: 1000 R14: 0002 
R15: 0020
Jun 29 18:28:42 hp-4300G kernel: FS:  7f1f8898ea40() 
GS:89005728() knlGS:
Jun 29 18:28:42 hp-4300G kernel: CS:  0010 DS:  ES:  CR0: 
80050033
Jun 29 18:28:42 hp-4300G kernel: CR2: 003a8290 CR3: 0001020d 
CR4: 00350ee0
Jun 29 18:28:42 hp-4300G kernel: Call Trace:
Jun 29 18:28:42 hp-4300G kernel:  _raw_spin_lock_irqsave+0x39/0x50
Jun 29 18:28:42 hp-4300G kernel:  swiotlb_tbl_map_single+0x12b/0x4c0

Then that correlates with R11 holding the 'dma_io_tlb_mem' pointer and
RDX pointing at the spinlock. Yet RAX is holding junk :/

I agree that enabling KASAN would be a good idea, but I also think we
probably need to get some more information out of swiotlb_tbl_map_single()
to see see what exactly is going wrong in there.


I can certainly enable KASAN and if there is any debug print I can add
or dump anything, let me know!


I bit the bullet and took v5.13 with swiotlb/for-linus-5.14 merged in, built
x86 defconfig and ran it on my laptop. However, it seems to work fine!

Please can you share your .config?


Sure thing, it is attached. It is just Arch Linux's config run through
olddefconfig. The original is below in case you need to diff it.

https://raw.githubusercontent.com/archlinux/svntogit-packages/9045405dc835527164f3034b3ceb9a67c7a53cd4/trunk/config

If there is anything more that I can provide, please let me know.


I eventually got this booting (for some reason it was causing LD to SEGV
trying to link it for a while...) and sadly it works fine on my laptop. Hmm.

Did you manage to try again with KASAN?

It might also be worth taking the IOMMU out of the equation, since that
interfaces differently with SWIOTLB and I couldn't figure out the code path
from the log you provided. What happens if you boot with "amd_iommu=off
swiotlb=force"?


Oh, now there's a thing... the chat from the IOMMU API in the boot log 
implies that the IOMMU *should* be in the picture - we see that default 
domains are IOMMU_DOMAIN_DMA default and the GPU :0c:00.0 was added 
to a group. That means dev->dma_ops should be set and DMA API calls 
should be going through iommu-dma, yet the callstack in the crash says 
we've gone straight from dma_map_page_attrs() to swiotlb_map(), implying 
the inline dma_direct_map_page() path.


If dev->dma_ops didn't look right in the first place, it's perhaps less 
surprising that dev->dma_io_tlb_mem might be wild as well. It doesn't 
seem plausible that we should have a race between initialising the 
device and probing its driver, so maybe the whole dev pointer is getting 
trampled earlier in the callchain (or is fundamentally wrong to begin 
with, but from a quick skim of the amdgpu code it did look like 
adev->dev and adev->pdev are appropriately set early on by 
amdgpu_pci_probe()).



(although word of warning here: i915 dies horribly on my laptop if I pass
swiotlb=force, even with the distro 5.10 kernel)


FWIW I'd imagine you probably need to massively increase the SWIOTLB 
buffer size to have hope of that working.


Robin.


Re: [PATCH v15 12/12] of: Add plumbing for restricted DMA pool

2021-07-02 Thread Robin Murphy

On 2021-07-02 04:08, Guenter Roeck wrote:

Hi,

On Thu, Jun 24, 2021 at 11:55:26PM +0800, Claire Chang wrote:

If a device is not behind an IOMMU, we look up the device node and set
up the restricted DMA when the restricted-dma-pool is presented.

Signed-off-by: Claire Chang 
Tested-by: Stefano Stabellini 
Tested-by: Will Deacon 


With this patch in place, all sparc and sparc64 qemu emulations
fail to boot. Symptom is that the root file system is not found.
Reverting this patch fixes the problem. Bisect log is attached.


Ah, OF_ADDRESS depends on !SPARC, so of_dma_configure_id() is presumably 
returning an unexpected -ENODEV from the of_dma_set_restricted_buffer() 
stub. That should probably be returning 0 instead, since either way it's 
not an error condition for it to simply do nothing.


Robin.



Guenter

---
# bad: [fb0ca446157a86b75502c1636b0d81e642fe6bf1] Add linux-next specific files 
for 20210701
# good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
git bisect start 'HEAD' 'v5.13'
# bad: [f63c4fda987a19b1194cc45cb72fd5bf968d9d90] Merge remote-tracking branch 
'rdma/for-next'
git bisect bad f63c4fda987a19b1194cc45cb72fd5bf968d9d90
# good: [46bb5dd1d2a63e906e374e97dfd4a5e33934b1c4] Merge remote-tracking branch 
'ipsec/master'
git bisect good 46bb5dd1d2a63e906e374e97dfd4a5e33934b1c4
# good: [43ba6969cfb8185353a7a6fc79070f13b9e3d6d3] Merge remote-tracking branch 
'clk/clk-next'
git bisect good 43ba6969cfb8185353a7a6fc79070f13b9e3d6d3
# good: [1ca5eddcf8dca1d6345471c6404e7364af0d7019] Merge remote-tracking branch 
'fuse/for-next'
git bisect good 1ca5eddcf8dca1d6345471c6404e7364af0d7019
# good: [8f6d7b3248705920187263a4e7147b0752ec7dcf] Merge remote-tracking branch 
'pci/next'
git bisect good 8f6d7b3248705920187263a4e7147b0752ec7dcf
# good: [df1885a755784da3ef285f36d9230c1d090ef186] RDMA/rtrs_clt: Alloc less 
memory with write path fast memory registration
git bisect good df1885a755784da3ef285f36d9230c1d090ef186
# good: [93d31efb58c8ad4a66bbedbc2d082df458c04e45] Merge remote-tracking branch 
'cpufreq-arm/cpufreq/arm/linux-next'
git bisect good 93d31efb58c8ad4a66bbedbc2d082df458c04e45
# good: [46308965ae6fdc7c25deb2e8c048510ae51bbe66] RDMA/irdma: Check contents 
of user-space irdma_mem_reg_req object
git bisect good 46308965ae6fdc7c25deb2e8c048510ae51bbe66
# good: [6de7a1d006ea9db235492b288312838d6878385f] 
thermal/drivers/int340x/processor_thermal: Split enumeration and processing part
git bisect good 6de7a1d006ea9db235492b288312838d6878385f
# good: [081bec2577cda3d04f6559c60b6f4e2242853520] dt-bindings: of: Add 
restricted DMA pool
git bisect good 081bec2577cda3d04f6559c60b6f4e2242853520
# good: [bf95ac0bcd69979af146852f6a617a60285ebbc1] Merge remote-tracking branch 
'thermal/thermal/linux-next'
git bisect good bf95ac0bcd69979af146852f6a617a60285ebbc1
# good: [3d8287544223a3d2f37981c1f9ffd94d0b5e9ffc] RDMA/core: Always release 
restrack object
git bisect good 3d8287544223a3d2f37981c1f9ffd94d0b5e9ffc
# bad: [cff1f23fad6e0bd7d671acce0d15285c709f259c] Merge remote-tracking branch 
'swiotlb/linux-next'
git bisect bad cff1f23fad6e0bd7d671acce0d15285c709f259c
# bad: [b655006619b7bccd0dc1e055bd72de5d613e7b5c] of: Add plumbing for 
restricted DMA pool
git bisect bad b655006619b7bccd0dc1e055bd72de5d613e7b5c
# first bad commit: [b655006619b7bccd0dc1e055bd72de5d613e7b5c] of: Add plumbing 
for restricted DMA pool



Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-24 Thread Robin Murphy

On 2021-06-24 12:18, Will Deacon wrote:

On Thu, Jun 24, 2021 at 12:14:39PM +0100, Robin Murphy wrote:

On 2021-06-24 07:05, Claire Chang wrote:

On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig  wrote:


On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:

is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119

is_swiotlb_force_bounce() was the new function introduced in this patch here.

+static inline bool is_swiotlb_force_bounce(struct device *dev)
+{
+ return dev->dma_io_tlb_mem->force_bounce;
+}


To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
turn this into :

  return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce;

for a quick debug check?


I just realized that dma_io_tlb_mem might be NULL like Christoph
pointed out since swiotlb might not get initialized.
However,  `Unable to handle kernel paging request at virtual address
dfff800e` looks more like the address is garbage rather than
NULL?
I wonder if that's because dev->dma_io_tlb_mem is not assigned
properly (which means device_initialize is not called?).


What also looks odd is that the base "address" 0xdfff8000 is held in
a couple of registers, but the offset 0xe looks too small to match up to any
relevant structure member in that dereference chain :/


FWIW, I've managed to trigger a NULL dereference locally when swiotlb hasn't
been initialised but we dereference 'dev->dma_io_tlb_mem', so I think
Christoph's suggestion is needed regardless.


Ack to that - for SWIOTLB_NO_FORCE, io_tlb_default_mem will remain NULL. 
The massive jump in KernelCI baseline failures as of yesterday looks 
like every arm64 machine with less than 4GB of RAM blowing up...


Robin.


But I agree that it won't help
with the issue reported by Qian Cai.

Qian Cai: please can you share your .config and your command line?

Thanks,

Will



Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-24 Thread Robin Murphy

On 2021-06-24 07:05, Claire Chang wrote:

On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig  wrote:


On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:

is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119

is_swiotlb_force_bounce() was the new function introduced in this patch here.

+static inline bool is_swiotlb_force_bounce(struct device *dev)
+{
+ return dev->dma_io_tlb_mem->force_bounce;
+}


To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
turn this into :

 return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce;

for a quick debug check?


I just realized that dma_io_tlb_mem might be NULL like Christoph
pointed out since swiotlb might not get initialized.
However,  `Unable to handle kernel paging request at virtual address
dfff800e` looks more like the address is garbage rather than
NULL?
I wonder if that's because dev->dma_io_tlb_mem is not assigned
properly (which means device_initialize is not called?).


What also looks odd is that the base "address" 0xdfff8000 is 
held in a couple of registers, but the offset 0xe looks too small to 
match up to any relevant structure member in that dereference chain :/


Robin.


Re: [PATCH v5 14/16] dma-direct: Allocate memory from restricted DMA pool if available

2021-04-23 Thread Robin Murphy

On 2021-04-22 09:15, Claire Chang wrote:

The restricted DMA pool is preferred if available.

The restricted DMA pools provide a basic level of protection against the
DMA overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system
needs to provide a way to lock down the memory access, e.g., MPU.

Signed-off-by: Claire Chang 
---
  kernel/dma/direct.c | 35 ++-
  1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 7a27f0510fcc..29523d2a9845 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -78,6 +78,10 @@ static bool dma_coherent_ok(struct device *dev, phys_addr_t 
phys, size_t size)
  static void __dma_direct_free_pages(struct device *dev, struct page *page,
size_t size)
  {
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   if (swiotlb_free(dev, page, size))
+   return;
+#endif
dma_free_contiguous(dev, page, size);
  }
  
@@ -92,7 +96,17 @@ static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
  
  	gfp |= dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask,

   _limit);
-   page = dma_alloc_contiguous(dev, size, gfp);
+
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   page = swiotlb_alloc(dev, size);
+   if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
+   __dma_direct_free_pages(dev, page, size);
+   page = NULL;
+   }
+#endif
+
+   if (!page)
+   page = dma_alloc_contiguous(dev, size, gfp);
if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
dma_free_contiguous(dev, page, size);
page = NULL;
@@ -148,7 +162,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
gfp |= __GFP_NOWARN;
  
  	if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&

-   !force_dma_unencrypted(dev)) {
+   !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) {
page = __dma_direct_alloc_pages(dev, size, gfp & ~__GFP_ZERO);
if (!page)
return NULL;
@@ -161,8 +175,8 @@ void *dma_direct_alloc(struct device *dev, size_t size,
}
  
  	if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) &&

-   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
-   !dev_is_dma_coherent(dev))
+   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) &&
+   !is_dev_swiotlb_force(dev))
return arch_dma_alloc(dev, size, dma_handle, gfp, attrs);
  
  	/*

@@ -172,7 +186,9 @@ void *dma_direct_alloc(struct device *dev, size_t size,
if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
!gfpflags_allow_blocking(gfp) &&
(force_dma_unencrypted(dev) ||
-(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && 
!dev_is_dma_coherent(dev
+(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
+ !dev_is_dma_coherent(dev))) &&
+   !is_dev_swiotlb_force(dev))
return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
  
  	/* we always manually zero the memory once we are done */

@@ -253,15 +269,15 @@ void dma_direct_free(struct device *dev, size_t size,
unsigned int page_order = get_order(size);
  
  	if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&

-   !force_dma_unencrypted(dev)) {
+   !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) {
/* cpu_addr is a struct page cookie, not a kernel address */
dma_free_contiguous(dev, cpu_addr, size);
return;
}
  
  	if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) &&

-   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
-   !dev_is_dma_coherent(dev)) {
+   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) &&
+   !is_dev_swiotlb_force(dev)) {
arch_dma_free(dev, size, cpu_addr, dma_addr, attrs);
return;
}
@@ -289,7 +305,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, 
size_t size,
void *ret;
  
  	if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&

-   force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp))
+   force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp) &&
+   !is_dev_swiotlb_force(dev))
return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);


Wait, this seems broken for non-coherent devices - in that case we need 
to return a non-cacheable address, but we can't simply fall through into 
the remapping path below in GFP_ATOMIC context. That's why we need the 
atomic pool concept in the first place :/


Unless I've overlooked something, we're still using the regular 
cacheable linear map address of the dma_io_tlb_mem buffer, no?


Robin.

  
  	page = __dma_direct_alloc_pages(dev, size, 

Re: [PATCH v5 16/16] of: Add plumbing for restricted DMA pool

2021-04-23 Thread Robin Murphy

On 2021-04-22 09:15, Claire Chang wrote:

If a device is not behind an IOMMU, we look up the device node and set
up the restricted DMA when the restricted-dma-pool is presented.

Signed-off-by: Claire Chang 
---
  drivers/of/address.c| 25 +
  drivers/of/device.c |  3 +++
  drivers/of/of_private.h |  5 +
  3 files changed, 33 insertions(+)

diff --git a/drivers/of/address.c b/drivers/of/address.c
index 54f221dde267..fff3adfe4986 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -8,6 +8,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -1109,6 +1110,30 @@ bool of_dma_is_coherent(struct device_node *np)
  }
  EXPORT_SYMBOL_GPL(of_dma_is_coherent);
  
+int of_dma_set_restricted_buffer(struct device *dev)

+{
+   struct device_node *node;
+   int count, i;
+
+   if (!dev->of_node)
+   return 0;
+
+   count = of_property_count_elems_of_size(dev->of_node, "memory-region",
+   sizeof(phandle));
+   for (i = 0; i < count; i++) {
+   node = of_parse_phandle(dev->of_node, "memory-region", i);
+   /* There might be multiple memory regions, but only one
+* restriced-dma-pool region is allowed.
+*/


What's the use-case for having multiple regions if the restricted pool 
is by definition the only one accessible?


Robin.


+   if (of_device_is_compatible(node, "restricted-dma-pool") &&
+   of_device_is_available(node))
+   return of_reserved_mem_device_init_by_idx(
+   dev, dev->of_node, i);
+   }
+
+   return 0;
+}
+
  /**
   * of_mmio_is_nonposted - Check if device uses non-posted MMIO
   * @np:   device node
diff --git a/drivers/of/device.c b/drivers/of/device.c
index c5a9473a5fb1..d8d865223e51 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -165,6 +165,9 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
  
  	arch_setup_dma_ops(dev, dma_start, size, iommu, coherent);
  
+	if (!iommu)

+   return of_dma_set_restricted_buffer(dev);
+
return 0;
  }
  EXPORT_SYMBOL_GPL(of_dma_configure_id);
diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
index d717efbd637d..e9237f5eff48 100644
--- a/drivers/of/of_private.h
+++ b/drivers/of/of_private.h
@@ -163,12 +163,17 @@ struct bus_dma_region;
  #if defined(CONFIG_OF_ADDRESS) && defined(CONFIG_HAS_DMA)
  int of_dma_get_range(struct device_node *np,
const struct bus_dma_region **map);
+int of_dma_set_restricted_buffer(struct device *dev);
  #else
  static inline int of_dma_get_range(struct device_node *np,
const struct bus_dma_region **map)
  {
return -ENODEV;
  }
+static inline int of_dma_get_restricted_buffer(struct device *dev)
+{
+   return -ENODEV;
+}
  #endif
  
  #endif /* _LINUX_OF_PRIVATE_H */




Re: [PATCH v5 08/16] swiotlb: Update is_swiotlb_active to add a struct device argument

2021-04-23 Thread Robin Murphy

On 2021-04-22 09:15, Claire Chang wrote:

Update is_swiotlb_active to add a struct device argument. This will be
useful later to allow for restricted DMA pool.

Signed-off-by: Claire Chang 
---
  drivers/gpu/drm/i915/gem/i915_gem_internal.c | 2 +-
  drivers/gpu/drm/nouveau/nouveau_ttm.c| 2 +-
  drivers/pci/xen-pcifront.c   | 2 +-
  include/linux/swiotlb.h  | 4 ++--
  kernel/dma/direct.c  | 2 +-
  kernel/dma/swiotlb.c | 4 ++--
  6 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c 
b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
index ce6b664b10aa..7d48c433446b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
@@ -42,7 +42,7 @@ static int i915_gem_object_get_pages_internal(struct 
drm_i915_gem_object *obj)
  
  	max_order = MAX_ORDER;

  #ifdef CONFIG_SWIOTLB
-   if (is_swiotlb_active()) {
+   if (is_swiotlb_active(NULL)) {
unsigned int max_segment;
  
  		max_segment = swiotlb_max_segment();

diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c 
b/drivers/gpu/drm/nouveau/nouveau_ttm.c
index e8b506a6685b..2a2ae6d6cf6d 100644
--- a/drivers/gpu/drm/nouveau/nouveau_ttm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c
@@ -321,7 +321,7 @@ nouveau_ttm_init(struct nouveau_drm *drm)
}
  
  #if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86)

-   need_swiotlb = is_swiotlb_active();
+   need_swiotlb = is_swiotlb_active(NULL);
  #endif
  
  	ret = ttm_device_init(>ttm.bdev, _bo_driver, drm->dev->dev,

diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c
index b7a8f3a1921f..6d548ce53ce7 100644
--- a/drivers/pci/xen-pcifront.c
+++ b/drivers/pci/xen-pcifront.c
@@ -693,7 +693,7 @@ static int pcifront_connect_and_init_dma(struct 
pcifront_device *pdev)
  
  	spin_unlock(_dev_lock);
  
-	if (!err && !is_swiotlb_active()) {

+   if (!err && !is_swiotlb_active(NULL)) {
err = pci_xen_swiotlb_init_late();
if (err)
dev_err(>xdev->dev, "Could not setup SWIOTLB!\n");
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 2a6cca07540b..c530c976d18b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -123,7 +123,7 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
  void __init swiotlb_exit(void);
  unsigned int swiotlb_max_segment(void);
  size_t swiotlb_max_mapping_size(struct device *dev);
-bool is_swiotlb_active(void);
+bool is_swiotlb_active(struct device *dev);
  void __init swiotlb_adjust_size(unsigned long size);
  #else
  #define swiotlb_force SWIOTLB_NO_FORCE
@@ -143,7 +143,7 @@ static inline size_t swiotlb_max_mapping_size(struct device 
*dev)
return SIZE_MAX;
  }
  
-static inline bool is_swiotlb_active(void)

+static inline bool is_swiotlb_active(struct device *dev)
  {
return false;
  }
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 84c9feb5474a..7a88c34d0867 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -495,7 +495,7 @@ int dma_direct_supported(struct device *dev, u64 mask)
  size_t dma_direct_max_mapping_size(struct device *dev)
  {
/* If SWIOTLB is active, use its maximum mapping size */
-   if (is_swiotlb_active() &&
+   if (is_swiotlb_active(dev) &&
(dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE))


I wonder if it's worth trying to fold these other conditions into 
is_swiotlb_active() itself? I'm not entirely sure what matters for Xen, 
but for the other cases it seems like they probably only care about 
whether bouncing may occur for their particular device or not (possibly 
they want to be using dma_max_mapping_size() now anyway - TBH I'm 
struggling to make sense of what the swiotlb_max_segment business is 
supposed to mean).


Otherwise, patch #9 will need to touch here as well to make sure that 
per-device forced bouncing is reflected correctly.


Robin.


return swiotlb_max_mapping_size(dev);
return SIZE_MAX;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index ffbb8724e06c..1d221343f1c8 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -659,9 +659,9 @@ size_t swiotlb_max_mapping_size(struct device *dev)
return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE;
  }
  
-bool is_swiotlb_active(void)

+bool is_swiotlb_active(struct device *dev)
  {
-   return io_tlb_default_mem != NULL;
+   return get_io_tlb_mem(dev) != NULL;
  }
  EXPORT_SYMBOL_GPL(is_swiotlb_active);
  



Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-31 Thread Robin Murphy

On 2021-03-16 15:38, Christoph Hellwig wrote:
[...]

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f1e38526d5bd40..996dfdf9d375dd 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2017,7 +2017,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
.iommu_dev  = smmu->dev,
};
  
-	if (smmu_domain->non_strict)

+   if (!iommu_get_dma_strict())


As Will raised, this also needs to be checking "domain->type == 
IOMMU_DOMAIN_DMA" to maintain equivalent behaviour to the attribute code 
below.



pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
  
  	pgtbl_ops = alloc_io_pgtable_ops(fmt, _cfg, smmu_domain);

@@ -2449,52 +2449,6 @@ static struct iommu_group *arm_smmu_device_group(struct 
device *dev)
return group;
  }
  
-static int arm_smmu_domain_get_attr(struct iommu_domain *domain,

-   enum iommu_attr attr, void *data)
-{
-   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-
-   switch (domain->type) {
-   case IOMMU_DOMAIN_DMA:
-   switch (attr) {
-   case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
-   *(int *)data = smmu_domain->non_strict;
-   return 0;
-   default:
-   return -ENODEV;
-   }
-   break;
-   default:
-   return -EINVAL;
-   }
-}

[...]

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index f985817c967a25..edb1de479dd1a7 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -668,7 +668,6 @@ struct arm_smmu_domain {
struct mutexinit_mutex; /* Protects smmu pointer */
  
  	struct io_pgtable_ops		*pgtbl_ops;

-   boolnon_strict;
atomic_tnr_ats_masters;
  
  	enum arm_smmu_domain_stage	stage;

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 0aa6d667274970..3dde22b1f8ffb0 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -761,6 +761,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain 
*domain,
.iommu_dev  = smmu->dev,
};
  
+	if (!iommu_get_dma_strict())


Ditto here.

Sorry for not spotting that sooner :(

Robin.


+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
+
if (smmu->impl && smmu->impl->init_context) {
ret = smmu->impl->init_context(smmu_domain, _cfg, dev);
if (ret)


Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-31 Thread Robin Murphy

On 2021-03-31 16:32, Will Deacon wrote:

On Wed, Mar 31, 2021 at 02:09:37PM +0100, Robin Murphy wrote:

On 2021-03-31 12:49, Will Deacon wrote:

On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote:

On 2021-03-30 14:58, Will Deacon wrote:

On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote:

On 2021-03-30 14:11, Will Deacon wrote:

On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:

From: Robin Murphy 

Instead make the global iommu_dma_strict paramete in iommu.c canonical by
exporting helpers to get and set it and use those directly in the drivers.

This make sure that the iommu.strict parameter also works for the AMD and
Intel IOMMU drivers on x86.  As those default to lazy flushing a new
IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
represent the default if not overriden by an explicit parameter.

Signed-off-by: Robin Murphy .
[ported on top of the other iommu_attr changes and added a few small
 missing bits]
Signed-off-by: Christoph Hellwig 
---
 drivers/iommu/amd/iommu.c   | 23 +---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
 drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
 drivers/iommu/dma-iommu.c   |  9 +--
 drivers/iommu/intel/iommu.c | 64 -
 drivers/iommu/iommu.c   | 27 ++---
 include/linux/iommu.h   |  4 +-
 8 files changed, 40 insertions(+), 165 deletions(-)


I really like this cleanup, but I can't help wonder if it's going in the
wrong direction. With SoCs often having multiple IOMMU instances and a
distinction between "trusted" and "untrusted" devices, then having the
flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
unreasonable to me, but this change makes it a global property.


The intent here was just to streamline the existing behaviour of stuffing a
global property into a domain attribute then pulling it out again in the
illusion that it was in any way per-domain. We're still checking
dev_is_untrusted() before making an actual decision, and it's not like we
can't add more factors at that point if we want to.


Like I say, the cleanup is great. I'm just wondering whether there's a
better way to express the complicated logic to decide whether or not to use
the flush queue than what we end up with:

if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
domain->ops->flush_iotlb_all && !iommu_get_dma_strict())

which is mixing up globals, device properties and domain properties. The
result is that the driver code ends up just using the global to determine
whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code,
which is a departure from the current way of doing things.


But previously, SMMU only ever saw the global policy piped through the
domain attribute by iommu_group_alloc_default_domain(), so there's no
functional change there.


For DMA domains sure, but I don't think that's the case for unmanaged
domains such as those used by VFIO.


Eh? This is only relevant to DMA domains anyway. Flush queues are part of
the IOVA allocator that VFIO doesn't even use. It's always been the case
that unmanaged domains only use strict invalidation.


Maybe I'm going mad. With this patch, the SMMU driver unconditionally sets
IO_PGTABLE_QUIRK_NON_STRICT for page-tables if iommu_get_dma_strict() is
true, no? In which case, that will get set for page-tables corresponding
to unmanaged domains as well as DMA domains when it is enabled. That didn't
happen before because you couldn't set the attribute for unmanaged domains.

What am I missing?


Oh cock... sorry, all this time I've been saying what I *expect* it to 
do, while overlooking the fact that the IO_PGTABLE_QUIRK_NON_STRICT 
hunks were the bits I forgot to write and Christoph had to fix up. 
Indeed, those should be checking the domain type too to preserve the 
existing behaviour. Apologies for the confusion.


Robin.


Obviously some of the above checks could be factored out into some kind of
iommu_use_flush_queue() helper that IOMMU drivers can also call if they need
to keep in sync. Or maybe we just allow iommu-dma to set
IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're
treating that as a generic thing now.


I think a helper that takes a domain would be a good starting point.


You mean device, right? The one condition we currently have is at the device
level, and there's really nothing inherent to the domain itself that matters
(since the type is implicitly IOMMU_DOMAIN_DMA to even care about this).


Device would probably work too; you'd pass the first device to attach to the
domain when querying this from the SMMU driver, I suppose.

Will



Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-31 Thread Robin Murphy

On 2021-03-31 12:49, Will Deacon wrote:

On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote:

On 2021-03-30 14:58, Will Deacon wrote:

On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote:

On 2021-03-30 14:11, Will Deacon wrote:

On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:

From: Robin Murphy 

Instead make the global iommu_dma_strict paramete in iommu.c canonical by
exporting helpers to get and set it and use those directly in the drivers.

This make sure that the iommu.strict parameter also works for the AMD and
Intel IOMMU drivers on x86.  As those default to lazy flushing a new
IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
represent the default if not overriden by an explicit parameter.

Signed-off-by: Robin Murphy .
[ported on top of the other iommu_attr changes and added a few small
missing bits]
Signed-off-by: Christoph Hellwig 
---
drivers/iommu/amd/iommu.c   | 23 +---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
drivers/iommu/dma-iommu.c   |  9 +--
drivers/iommu/intel/iommu.c | 64 -
drivers/iommu/iommu.c   | 27 ++---
include/linux/iommu.h   |  4 +-
8 files changed, 40 insertions(+), 165 deletions(-)


I really like this cleanup, but I can't help wonder if it's going in the
wrong direction. With SoCs often having multiple IOMMU instances and a
distinction between "trusted" and "untrusted" devices, then having the
flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
unreasonable to me, but this change makes it a global property.


The intent here was just to streamline the existing behaviour of stuffing a
global property into a domain attribute then pulling it out again in the
illusion that it was in any way per-domain. We're still checking
dev_is_untrusted() before making an actual decision, and it's not like we
can't add more factors at that point if we want to.


Like I say, the cleanup is great. I'm just wondering whether there's a
better way to express the complicated logic to decide whether or not to use
the flush queue than what we end up with:

if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
domain->ops->flush_iotlb_all && !iommu_get_dma_strict())

which is mixing up globals, device properties and domain properties. The
result is that the driver code ends up just using the global to determine
whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code,
which is a departure from the current way of doing things.


But previously, SMMU only ever saw the global policy piped through the
domain attribute by iommu_group_alloc_default_domain(), so there's no
functional change there.


For DMA domains sure, but I don't think that's the case for unmanaged
domains such as those used by VFIO.


Eh? This is only relevant to DMA domains anyway. Flush queues are part 
of the IOVA allocator that VFIO doesn't even use. It's always been the 
case that unmanaged domains only use strict invalidation.



Obviously some of the above checks could be factored out into some kind of
iommu_use_flush_queue() helper that IOMMU drivers can also call if they need
to keep in sync. Or maybe we just allow iommu-dma to set
IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're
treating that as a generic thing now.


I think a helper that takes a domain would be a good starting point.


You mean device, right? The one condition we currently have is at the 
device level, and there's really nothing inherent to the domain itself 
that matters (since the type is implicitly IOMMU_DOMAIN_DMA to even care 
about this).


Another idea that's just come to mind is now that IOMMU_DOMAIN_DMA has a 
standard meaning, maybe we could split out a separate 
IOMMU_DOMAIN_DMA_STRICT type such that it can all propagate from 
iommu_get_def_domain_type()? That feels like it might be quite 
promising, but I'd still do it as an improvement on top of this patch, 
since it's beyond just cleaning up the abuse of domain attributes to 
pass a command-line option around.


Robin.


Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-30 Thread Robin Murphy

On 2021-03-30 14:58, Will Deacon wrote:

On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote:

On 2021-03-30 14:11, Will Deacon wrote:

On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:

From: Robin Murphy 

Instead make the global iommu_dma_strict paramete in iommu.c canonical by
exporting helpers to get and set it and use those directly in the drivers.

This make sure that the iommu.strict parameter also works for the AMD and
Intel IOMMU drivers on x86.  As those default to lazy flushing a new
IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
represent the default if not overriden by an explicit parameter.

Signed-off-by: Robin Murphy .
[ported on top of the other iommu_attr changes and added a few small
   missing bits]
Signed-off-by: Christoph Hellwig 
---
   drivers/iommu/amd/iommu.c   | 23 +---
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +---
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
   drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
   drivers/iommu/dma-iommu.c   |  9 +--
   drivers/iommu/intel/iommu.c | 64 -
   drivers/iommu/iommu.c   | 27 ++---
   include/linux/iommu.h   |  4 +-
   8 files changed, 40 insertions(+), 165 deletions(-)


I really like this cleanup, but I can't help wonder if it's going in the
wrong direction. With SoCs often having multiple IOMMU instances and a
distinction between "trusted" and "untrusted" devices, then having the
flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
unreasonable to me, but this change makes it a global property.


The intent here was just to streamline the existing behaviour of stuffing a
global property into a domain attribute then pulling it out again in the
illusion that it was in any way per-domain. We're still checking
dev_is_untrusted() before making an actual decision, and it's not like we
can't add more factors at that point if we want to.


Like I say, the cleanup is great. I'm just wondering whether there's a
better way to express the complicated logic to decide whether or not to use
the flush queue than what we end up with:

if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
domain->ops->flush_iotlb_all && !iommu_get_dma_strict())

which is mixing up globals, device properties and domain properties. The
result is that the driver code ends up just using the global to determine
whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code,
which is a departure from the current way of doing things.


But previously, SMMU only ever saw the global policy piped through the 
domain attribute by iommu_group_alloc_default_domain(), so there's no 
functional change there.


Obviously some of the above checks could be factored out into some kind 
of iommu_use_flush_queue() helper that IOMMU drivers can also call if 
they need to keep in sync. Or maybe we just allow iommu-dma to set 
IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if 
we're treating that as a generic thing now.



For example, see the recent patch from Lu Baolu:

https://lore.kernel.org/r/20210225061454.2864009-1-baolu...@linux.intel.com


Erm, this patch is based on that one, it's right there in the context :/


Ah, sorry, I didn't spot that! I was just trying to illustrate that this
is per-device.


Sure, I understand - and I'm just trying to bang home that despite 
appearances it's never actually been treated as such for SMMU, so 
anything that's wrong after this change was already wrong before.


Robin.


Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-30 Thread Robin Murphy

On 2021-03-30 14:11, Will Deacon wrote:

On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:

From: Robin Murphy 

Instead make the global iommu_dma_strict paramete in iommu.c canonical by
exporting helpers to get and set it and use those directly in the drivers.

This make sure that the iommu.strict parameter also works for the AMD and
Intel IOMMU drivers on x86.  As those default to lazy flushing a new
IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
represent the default if not overriden by an explicit parameter.

Signed-off-by: Robin Murphy .
[ported on top of the other iommu_attr changes and added a few small
  missing bits]
Signed-off-by: Christoph Hellwig 
---
  drivers/iommu/amd/iommu.c   | 23 +---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
  drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
  drivers/iommu/dma-iommu.c   |  9 +--
  drivers/iommu/intel/iommu.c | 64 -
  drivers/iommu/iommu.c   | 27 ++---
  include/linux/iommu.h   |  4 +-
  8 files changed, 40 insertions(+), 165 deletions(-)


I really like this cleanup, but I can't help wonder if it's going in the
wrong direction. With SoCs often having multiple IOMMU instances and a
distinction between "trusted" and "untrusted" devices, then having the
flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
unreasonable to me, but this change makes it a global property.


The intent here was just to streamline the existing behaviour of 
stuffing a global property into a domain attribute then pulling it out 
again in the illusion that it was in any way per-domain. We're still 
checking dev_is_untrusted() before making an actual decision, and it's 
not like we can't add more factors at that point if we want to.



For example, see the recent patch from Lu Baolu:

https://lore.kernel.org/r/20210225061454.2864009-1-baolu...@linux.intel.com


Erm, this patch is based on that one, it's right there in the context :/

Thanks,
Robin.


Re: [PATCH 24/30] Kconfig: Change Synopsys to Synopsis

2021-03-30 Thread Robin Murphy

On 2021-03-29 00:53, Bhaskar Chowdhury wrote:

s/Synopsys/Synopsis/  .two different places.


Erm, that is definitely not a typo... :/


..and for some unknown reason it introduce a empty line deleted and added
back.


Presumably your editor is configured to trim trailing whitespace on save.

Furthermore, there are several instances in the other patches where your 
"corrections" are grammatically incorrect, I'm not sure what the deal is 
with patch #14, and you've also used the wrong subsystem name (it should 
be "dmaengine"). It's great to want to clean things up, but please pay a 
bit of care and attention to what you're actually doing.


Robin.


Signed-off-by: Bhaskar Chowdhury 
---
  drivers/dma/Kconfig | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 0c2827fd8c19..30e8cc26f43b 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -170,15 +170,15 @@ config DMA_SUN6I
  Support for the DMA engine first found in Allwinner A31 SoCs.

  config DW_AXI_DMAC
-   tristate "Synopsys DesignWare AXI DMA support"
+   tristate "Synopsis DesignWare AXI DMA support"
depends on OF || COMPILE_TEST
depends on HAS_IOMEM
select DMA_ENGINE
select DMA_VIRTUAL_CHANNELS
help
- Enable support for Synopsys DesignWare AXI DMA controller.
+ Enable support for Synopsis DesignWare AXI DMA controller.
  NOTE: This driver wasn't tested on 64 bit platform because
- of lack 64 bit platform with Synopsys DW AXI DMAC.
+ of lack 64 bit platform with Synopsis DW AXI DMAC.

  config EP93XX_DMA
bool "Cirrus Logic EP93xx DMA support"
@@ -394,7 +394,7 @@ config MOXART_DMA
select DMA_VIRTUAL_CHANNELS
help
  Enable support for the MOXA ART SoC DMA controller.
-
+
  Say Y here if you enabled MMP ADMA, otherwise say N.

  config MPC512X_DMA
--
2.26.3

___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu



Re: [PATCH 14/17] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-16 Thread Robin Murphy

On 2021-03-15 08:33, Christoph Hellwig wrote:

On Fri, Mar 12, 2021 at 04:18:24PM +, Robin Murphy wrote:

Let me know what you think of the version here:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iommu-cleanup

I'll happily switch the patch to you as the author if you're fine with
that as well.


I still have reservations about removing the attribute API entirely and
pretending that io_pgtable_cfg is anything other than a SoC-specific
private interface,


I think a private inteface would make more sense.  For now I've just
condensed it down to a generic set of quirk bits and dropped the
attrs structure, which seems like an ok middle ground for now.  That
being said I wonder why that quirk isn't simply set in the device
tree?


Because it's a software policy decision rather than any inherent 
property of the platform, and the DT certainly doesn't know *when* any 
particular device might prefer its IOMMU to use cacheable pagetables to 
minimise TLB miss latency vs. saving the cache capacity for larger data 
buffers. It really is most logical to decide this at the driver level.


In truth the overall concept *is* relatively generic (a trend towards 
larger system caches and cleverer usage is about both raw performance 
and saving power on off-SoC DRAM traffic), it's just the particular 
implementation of using io-pgtable to set an outer-cacheable walk 
attribute in an SMMU TCR that's pretty much specific to Qualcomm SoCs. 
Hence why having a common abstraction at the iommu_domain level, but 
where the exact details are free to vary across different IOMMUs and 
their respective client drivers, is in many ways an ideal fit.



but the reworked patch on its own looks reasonable to
me, thanks! (I wasn't too convinced about the iommu_cmd_line wrappers
either...) Just iommu_get_dma_strict() needs an export since the SMMU
drivers can be modular - I consciously didn't add that myself since I was
mistakenly thinking only iommu-dma would call it.


Fixed.  Can I get your signoff for the patch?  Then I'll switch it to
over to being attributed to you.


Sure - I would have thought that the one I originally posted still 
stands, but for the avoidance of doubt, for the parts of commit 
8b6d45c495bd in your tree that remain from what I wrote:


Signed-off-by: Robin Murphy 

Cheers,
Robin.


Re: [PATCH 14/17] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-12 Thread Robin Murphy

On 2021-03-11 08:26, Christoph Hellwig wrote:

On Wed, Mar 10, 2021 at 06:39:57PM +, Robin Murphy wrote:

Actually... Just mirroring the iommu_dma_strict value into
struct iommu_domain should solve all of that with very little
boilerplate code.


Yes, my initial thought was to directly replace the attribute with a
common flag at iommu_domain level, but since in all cases the behaviour
is effectively global rather than actually per-domain, it seemed
reasonable to take it a step further. This passes compile-testing for
arm64 and x86, what do you think?


It seems to miss a few bits, and also generally seems to be not actually
apply to recent mainline or something like it due to different empty
lines in a few places.


Yeah, that was sketched out on top of some other development patches, 
and in being so focused on not breaking any of the x86 behaviours I did 
indeed overlook fully converting the SMMU drivers... oops!


(my thought was to do the conversion for its own sake, then clean up the 
redundant attribute separately, but I guess it's fine either way)



Let me know what you think of the version here:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iommu-cleanup

I'll happily switch the patch to you as the author if you're fine with
that as well.


I still have reservations about removing the attribute API entirely and 
pretending that io_pgtable_cfg is anything other than a SoC-specific 
private interface, but the reworked patch on its own looks reasonable to 
me, thanks! (I wasn't too convinced about the iommu_cmd_line wrappers 
either...) Just iommu_get_dma_strict() needs an export since the SMMU 
drivers can be modular - I consciously didn't add that myself since I 
was mistakenly thinking only iommu-dma would call it.


Robin.


Re: [PATCH 14/17] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-10 Thread Robin Murphy

On 2021-03-10 09:25, Christoph Hellwig wrote:

On Wed, Mar 10, 2021 at 10:15:01AM +0100, Christoph Hellwig wrote:

On Thu, Mar 04, 2021 at 03:25:27PM +, Robin Murphy wrote:

On 2021-03-01 08:42, Christoph Hellwig wrote:

Use explicit methods for setting and querying the information instead.


Now that everyone's using iommu-dma, is there any point in bouncing this
through the drivers at all? Seems like it would make more sense for the x86
drivers to reflect their private options back to iommu_dma_strict (and
allow Intel's caching mode to override it as well), then have
iommu_dma_init_domain just test !iommu_dma_strict &&
domain->ops->flush_iotlb_all.


Hmm.  I looked at this, and kill off ->dma_enable_flush_queue for
the ARM drivers and just looking at iommu_dma_strict seems like a
very clear win.

OTOH x86 is a little more complicated.  AMD and intel defaul to lazy
mode, so we'd have to change the global iommu_dma_strict if they are
initialized.  Also Intel has not only a "static" option to disable
lazy mode, but also a "dynamic" one where it iterates structure.  So
I think on the get side we're stuck with the method, but it still
simplifies the whole thing.


Actually... Just mirroring the iommu_dma_strict value into
struct iommu_domain should solve all of that with very little
boilerplate code.


Yes, my initial thought was to directly replace the attribute with a
common flag at iommu_domain level, but since in all cases the behaviour
is effectively global rather than actually per-domain, it seemed
reasonable to take it a step further. This passes compile-testing for
arm64 and x86, what do you think?

Robin.

->8-
Subject: [PATCH] iommu: Consolidate strict invalidation handling

Now that everyone is using iommu-dma, the global invalidation policy
really doesn't need to be woven through several parts of the core API
and individual drivers, we can just look it up directly at the one point
that we now make the flush queue decision. If the x86 drivers reflect
their internal options and overrides back to iommu_dma_strict, that can
become the canonical source.

Signed-off-by: Robin Murphy 
---
 drivers/iommu/amd/iommu.c   |  2 ++
 drivers/iommu/dma-iommu.c   |  8 +---
 drivers/iommu/intel/iommu.c | 12 
 drivers/iommu/iommu.c   | 35 +++
 include/linux/iommu.h   |  2 ++
 5 files changed, 44 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index a69a8b573e40..1db29e59d468 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1856,6 +1856,8 @@ int __init amd_iommu_init_dma_ops(void)
else
pr_info("Lazy IO/TLB flushing enabled\n");
 
+	iommu_set_dma_strict(amd_iommu_unmap_flush);

+
return 0;
 
 }

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index af765c813cc8..789a950cc125 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -304,10 +304,6 @@ static void iommu_dma_flush_iotlb_all(struct iova_domain 
*iovad)
 
 	cookie = container_of(iovad, struct iommu_dma_cookie, iovad);

domain = cookie->fq_domain;
-   /*
-* The IOMMU driver supporting DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
-* implies that ops->flush_iotlb_all must be non-NULL.
-*/
domain->ops->flush_iotlb_all(domain);
 }
 
@@ -334,7 +330,6 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,

struct iommu_dma_cookie *cookie = domain->iova_cookie;
unsigned long order, base_pfn;
struct iova_domain *iovad;
-   int attr;
 
 	if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE)

return -EINVAL;
@@ -371,8 +366,7 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
init_iova_domain(iovad, 1UL << order, base_pfn);
 
 	if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&

-   !iommu_domain_get_attr(domain, DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE, ) 
&&
-   attr) {
+   domain->ops->flush_iotlb_all && !iommu_get_dma_strict()) {
if (init_iova_flush_queue(iovad, iommu_dma_flush_iotlb_all,
  iommu_dma_entry_dtor))
pr_warn("iova flush queue initialization failed\n");
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index b5c746f0f63b..f5b452cd1266 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4377,6 +4377,17 @@ int __init intel_iommu_init(void)
 
 	down_read(_global_lock);

for_each_active_iommu(iommu, drhd) {
+   if (!intel_iommu_strict && cap_caching_mode(iommu->cap)) {
+   /*
+* The flush queue implementation does not perform 
page-selective
+

Re: [PATCH 16/17] iommu: remove DOMAIN_ATTR_IO_PGTABLE_CFG

2021-03-04 Thread Robin Murphy

On 2021-03-01 08:42, Christoph Hellwig wrote:

Signed-off-by: Christoph Hellwig 


Moreso than the previous patch, where the feature is at least relatively 
generic (note that there's a bunch of in-flight development around 
DOMAIN_ATTR_NESTING), I'm really not convinced that it's beneficial to 
bloat the generic iommu_ops structure with private driver-specific 
interfaces. The attribute interface is a great compromise for these 
kinds of things, and you can easily add type-checked wrappers around it 
for external callers (maybe even make the actual attributes internal 
between the IOMMU core and drivers) if that's your concern.


Robin.


---
  drivers/gpu/drm/msm/adreno/adreno_gpu.c |  2 +-
  drivers/iommu/arm/arm-smmu/arm-smmu.c   | 40 +++--
  drivers/iommu/iommu.c   |  9 ++
  include/linux/iommu.h   |  9 +-
  4 files changed, 29 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c 
b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
index 0f184c3dd9d9ec..78d98ab2ee3a68 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
@@ -191,7 +191,7 @@ void adreno_set_llc_attributes(struct iommu_domain *iommu)
struct io_pgtable_domain_attr pgtbl_cfg;
  
  	pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_ARM_OUTER_WBWA;

-   iommu_domain_set_attr(iommu, DOMAIN_ATTR_IO_PGTABLE_CFG, _cfg);
+   iommu_domain_set_pgtable_attr(iommu, _cfg);
  }
  
  struct msm_gem_address_space *

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 2e17d990d04481..2858999c86dfd1 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1515,40 +1515,22 @@ static int arm_smmu_domain_enable_nesting(struct 
iommu_domain *domain)
return ret;
  }
  
-static int arm_smmu_domain_set_attr(struct iommu_domain *domain,

-   enum iommu_attr attr, void *data)
+static int arm_smmu_domain_set_pgtable_attr(struct iommu_domain *domain,
+   struct io_pgtable_domain_attr *pgtbl_cfg)
  {
-   int ret = 0;
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+   int ret = -EPERM;
  
-	mutex_lock(_domain->init_mutex);

-
-   switch(domain->type) {
-   case IOMMU_DOMAIN_UNMANAGED:
-   switch (attr) {
-   case DOMAIN_ATTR_IO_PGTABLE_CFG: {
-   struct io_pgtable_domain_attr *pgtbl_cfg = data;
-
-   if (smmu_domain->smmu) {
-   ret = -EPERM;
-   goto out_unlock;
-   }
+   if (domain->type != IOMMU_DOMAIN_UNMANAGED)
+   return -EINVAL;
  
-			smmu_domain->pgtbl_cfg = *pgtbl_cfg;

-   break;
-   }
-   default:
-   ret = -ENODEV;
-   }
-   break;
-   case IOMMU_DOMAIN_DMA:
-   ret = -ENODEV;
-   break;
-   default:
-   ret = -EINVAL;
+   mutex_lock(_domain->init_mutex);
+   if (!smmu_domain->smmu) {
+   smmu_domain->pgtbl_cfg = *pgtbl_cfg;
+   ret = 0;
}
-out_unlock:
mutex_unlock(_domain->init_mutex);
+
return ret;
  }
  
@@ -1609,7 +1591,7 @@ static struct iommu_ops arm_smmu_ops = {

.device_group   = arm_smmu_device_group,
.dma_use_flush_queue= arm_smmu_dma_use_flush_queue,
.dma_enable_flush_queue = arm_smmu_dma_enable_flush_queue,
-   .domain_set_attr= arm_smmu_domain_set_attr,
+   .domain_set_pgtable_attr = arm_smmu_domain_set_pgtable_attr,
.domain_enable_nesting  = arm_smmu_domain_enable_nesting,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2e9e058501a953..8490aefd4b41f8 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2693,6 +2693,15 @@ int iommu_domain_enable_nesting(struct iommu_domain 
*domain)
  }
  EXPORT_SYMBOL_GPL(iommu_domain_enable_nesting);
  
+int iommu_domain_set_pgtable_attr(struct iommu_domain *domain,

+   struct io_pgtable_domain_attr *pgtbl_cfg)
+{
+   if (!domain->ops->domain_set_pgtable_attr)
+   return -EINVAL;
+   return domain->ops->domain_set_pgtable_attr(domain, pgtbl_cfg);
+}
+EXPORT_SYMBOL_GPL(iommu_domain_set_pgtable_attr);
+
  void iommu_get_resv_regions(struct device *dev, struct list_head *list)
  {
const struct iommu_ops *ops = dev->bus->iommu_ops;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index aed88aa3bd3edf..39d3ed4d2700ac 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -40,6 +40,7 @@ struct iommu_domain;
  struct notifier_block;
  struct iommu_sva;
  struct iommu_fault_event;
+struct io_pgtable_domain_attr;

Re: [PATCH 14/17] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-04 Thread Robin Murphy

On 2021-03-01 08:42, Christoph Hellwig wrote:

Use explicit methods for setting and querying the information instead.


Now that everyone's using iommu-dma, is there any point in bouncing this 
through the drivers at all? Seems like it would make more sense for the 
x86 drivers to reflect their private options back to iommu_dma_strict 
(and allow Intel's caching mode to override it as well), then have 
iommu_dma_init_domain just test !iommu_dma_strict && 
domain->ops->flush_iotlb_all.


Robin.


Also remove the now unused iommu_domain_get_attr functionality.

Signed-off-by: Christoph Hellwig 
---
  drivers/iommu/amd/iommu.c   | 23 ++---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 47 ++---
  drivers/iommu/arm/arm-smmu/arm-smmu.c   | 56 +
  drivers/iommu/dma-iommu.c   |  8 ++-
  drivers/iommu/intel/iommu.c | 27 ++
  drivers/iommu/iommu.c   | 19 +++
  include/linux/iommu.h   | 17 ++-
  7 files changed, 51 insertions(+), 146 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index a69a8b573e40d0..37a8e51db17656 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1771,24 +1771,11 @@ static struct iommu_group 
*amd_iommu_device_group(struct device *dev)
return acpihid_device_group(dev);
  }
  
-static int amd_iommu_domain_get_attr(struct iommu_domain *domain,

-   enum iommu_attr attr, void *data)
+static bool amd_iommu_dma_use_flush_queue(struct iommu_domain *domain)
  {
-   switch (domain->type) {
-   case IOMMU_DOMAIN_UNMANAGED:
-   return -ENODEV;
-   case IOMMU_DOMAIN_DMA:
-   switch (attr) {
-   case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
-   *(int *)data = !amd_iommu_unmap_flush;
-   return 0;
-   default:
-   return -ENODEV;
-   }
-   break;
-   default:
-   return -EINVAL;
-   }
+   if (domain->type != IOMMU_DOMAIN_DMA)
+   return false;
+   return !amd_iommu_unmap_flush;
  }
  
  /*

@@ -2257,7 +2244,7 @@ const struct iommu_ops amd_iommu_ops = {
.release_device = amd_iommu_release_device,
.probe_finalize = amd_iommu_probe_finalize,
.device_group = amd_iommu_device_group,
-   .domain_get_attr = amd_iommu_domain_get_attr,
+   .dma_use_flush_queue = amd_iommu_dma_use_flush_queue,
.get_resv_regions = amd_iommu_get_resv_regions,
.put_resv_regions = generic_iommu_put_resv_regions,
.is_attach_deferred = amd_iommu_is_attach_deferred,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8594b4a8304375..bf96172e8c1f71 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2449,33 +2449,21 @@ static struct iommu_group *arm_smmu_device_group(struct 
device *dev)
return group;
  }
  
-static int arm_smmu_domain_get_attr(struct iommu_domain *domain,

-   enum iommu_attr attr, void *data)
+static bool arm_smmu_dma_use_flush_queue(struct iommu_domain *domain)
  {
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
  
-	switch (domain->type) {

-   case IOMMU_DOMAIN_UNMANAGED:
-   switch (attr) {
-   case DOMAIN_ATTR_NESTING:
-   *(int *)data = (smmu_domain->stage == 
ARM_SMMU_DOMAIN_NESTED);
-   return 0;
-   default:
-   return -ENODEV;
-   }
-   break;
-   case IOMMU_DOMAIN_DMA:
-   switch (attr) {
-   case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
-   *(int *)data = smmu_domain->non_strict;
-   return 0;
-   default:
-   return -ENODEV;
-   }
-   break;
-   default:
-   return -EINVAL;
-   }
+   if (domain->type != IOMMU_DOMAIN_DMA)
+   return false;
+   return smmu_domain->non_strict;
+}
+
+
+static void arm_smmu_dma_enable_flush_queue(struct iommu_domain *domain)
+{
+   if (domain->type != IOMMU_DOMAIN_DMA)
+   return;
+   to_smmu_domain(domain)->non_strict = true;
  }
  
  static int arm_smmu_domain_set_attr(struct iommu_domain *domain,

@@ -2505,13 +2493,7 @@ static int arm_smmu_domain_set_attr(struct iommu_domain 
*domain,
}
break;
case IOMMU_DOMAIN_DMA:
-   switch(attr) {
-   case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
-   smmu_domain->non_strict = *(int *)data;
-   break;
-   default:
-   ret = 

Re: [PATCH RFC v1 2/6] swiotlb: convert variables to arrays

2021-02-04 Thread Robin Murphy

On 2021-02-04 07:29, Christoph Hellwig wrote:

On Wed, Feb 03, 2021 at 03:37:05PM -0800, Dongli Zhang wrote:

This patch converts several swiotlb related variables to arrays, in
order to maintain stat/status for different swiotlb buffers. Here are
variables involved:

- io_tlb_start and io_tlb_end
- io_tlb_nslabs and io_tlb_used
- io_tlb_list
- io_tlb_index
- max_segment
- io_tlb_orig_addr
- no_iotlb_memory

There is no functional change and this is to prepare to enable 64-bit
swiotlb.


Claire Chang (on Cc) already posted a patch like this a month ago,
which looks much better because it actually uses a struct instead
of all the random variables.


Indeed, I skimmed the cover letter and immediately thought that this 
whole thing is just the restricted DMA pool concept[1] again, only from 
a slightly different angle.


Robin.

[1] 
https://lore.kernel.org/linux-iommu/20210106034124.30560-1-tien...@chromium.org/


Re: [RFC PATCH v3 5/6] dt-bindings: of: Add restricted DMA pool

2021-01-21 Thread Robin Murphy

On 2021-01-21 15:48, Rob Herring wrote:

On Wed, Jan 20, 2021 at 7:10 PM Robin Murphy 
wrote:


On 2021-01-20 21:31, Rob Herring wrote:

On Wed, Jan 20, 2021 at 11:30 AM Robin Murphy
 wrote:


On 2021-01-20 16:53, Rob Herring wrote:

On Wed, Jan 06, 2021 at 11:41:23AM +0800, Claire Chang
wrote:

Introduce the new compatible string, restricted-dma-pool,
for restricted DMA. One can specify the address and length
of the restricted DMA memory region by restricted-dma-pool
in the device tree.


If this goes into DT, I think we should be able to use
dma-ranges for this purpose instead. Normally, 'dma-ranges'
is for physical bus restrictions, but there's no reason it
can't be used for policy or to express restrictions the
firmware has enabled.


There would still need to be some way to tell SWIOTLB to pick
up the corresponding chunk of memory and to prevent the kernel
from using it for anything else, though.


Don't we already have that problem if dma-ranges had a very
small range? We just get lucky because the restriction is
generally much more RAM than needed.


Not really - if a device has a naturally tiny addressing capability
that doesn't even cover ZONE_DMA32 where the regular SWIOTLB buffer
will be allocated then it's unlikely to work well, but that's just
crap system design. Yes, memory pressure in ZONE_DMA{32} is
particularly problematic for such limited devices, but it's
irrelevant to the issue at hand here.


Yesterday's crap system design is today's security feature. Couldn't 
this feature make crap system design work better?


Indeed! Say you bring out your shiny new "Strawberry Flan 4" machine
with all the latest connectivity, but tragically its PCIe can only
address 25% of the RAM. So you decide to support deploying it in two
configurations: one where it runs normally for best performance, and
another "secure" one where it dedicates that quarter of RAM as a 
restricted DMA pool for any PCIe devices - that way, even if that hotel 
projector you plug in turns out to be a rogue Thunderbolt endpoint, it 
can never snarf your private keys off your eMMC out of the page cache.


(Yes, is is the thinnest of strawmen, but it sets the scene for the 
point you raised...)


...which is that in both cases the dma-ranges will still be identical. 
So how is the kernel going to know whether to steal that whole area from 
memblock before anything else can allocate from it, or not?


I don't disagree that even in Claire's original intended case it would 
be semantically correct to describe the hardware-firewalled region with 
dma-ranges. It just turns out not to be necessary, and you're already 
arguing for not adding anything in DT that doesn't need to be.



What we have here is a device that's not allowed to see *kernel*
memory at all. It's been artificially constrained to a particular
region by a TZASC or similar, and the only data which should ever
be placed in that


May have been constrained, but that's entirely optional.

In the optional case where the setup is entirely up to the OS, I
don't think this belongs in the DT at all. Perhaps that should be
solved first.


Yes! Let's definitely consider that case! Say you don't have any 
security or physical limitations but want to use a bounce pool for some 
device anyway because reasons (perhaps copying streaming DMA data to a 
better guaranteed alignment gives an overall performance win). Now the 
*only* relevant thing to communicate to the kernel is to, ahem, reserve 
a large chunk of memory, and use it for this special purpose. Isn't that 
literally what reserved-memory bindings are for?



region is data intended for that device to see. That way if it
tries to go rogue it physically can't start slurping data intended
for other devices or not mapped for DMA at all. The bouncing is an
important part of this - I forget the title off-hand but there was
an interesting paper a few years ago which demonstrated that even
with an IOMMU, streaming DMA of in-place buffers could reveal
enough adjacent data from the same page to mount an attack on the
system. Memory pressure should be immaterial since the size of each
bounce pool carveout will presumably be tuned for the needs of the
given device.

In any case, wouldn't finding all the dma-ranges do this? We're 
already walking the tree to find the max DMA address now.


If all you can see are two "dma-ranges" properties, how do you
propose to tell that one means "this is the extent of what I can
address, please set my masks and dma-range-map accordingly and try
to allocate things where I can reach them" while the other means
"take this output range away from the page allocator and hook it up
as my dedicated bounce pool, because it is Serious Security Time"?
Especially since getting that choice wrong either way would be a
Bad Thing.


Either we have some heuristic based on the size or we add some hint. 
The point is let's build on what we already have for defining DMA 

Re: [RFC PATCH v3 5/6] dt-bindings: of: Add restricted DMA pool

2021-01-20 Thread Robin Murphy

On 2021-01-20 21:31, Rob Herring wrote:

On Wed, Jan 20, 2021 at 11:30 AM Robin Murphy  wrote:


On 2021-01-20 16:53, Rob Herring wrote:

On Wed, Jan 06, 2021 at 11:41:23AM +0800, Claire Chang wrote:

Introduce the new compatible string, restricted-dma-pool, for restricted
DMA. One can specify the address and length of the restricted DMA memory
region by restricted-dma-pool in the device tree.


If this goes into DT, I think we should be able to use dma-ranges for
this purpose instead. Normally, 'dma-ranges' is for physical bus
restrictions, but there's no reason it can't be used for policy or to
express restrictions the firmware has enabled.


There would still need to be some way to tell SWIOTLB to pick up the
corresponding chunk of memory and to prevent the kernel from using it
for anything else, though.


Don't we already have that problem if dma-ranges had a very small
range? We just get lucky because the restriction is generally much
more RAM than needed.


Not really - if a device has a naturally tiny addressing capability that 
doesn't even cover ZONE_DMA32 where the regular SWIOTLB buffer will be 
allocated then it's unlikely to work well, but that's just crap system 
design. Yes, memory pressure in ZONE_DMA{32} is particularly problematic 
for such limited devices, but it's irrelevant to the issue at hand here.


What we have here is a device that's not allowed to see *kernel* memory 
at all. It's been artificially constrained to a particular region by a 
TZASC or similar, and the only data which should ever be placed in that 
region is data intended for that device to see. That way if it tries to 
go rogue it physically can't start slurping data intended for other 
devices or not mapped for DMA at all. The bouncing is an important part 
of this - I forget the title off-hand but there was an interesting paper 
a few years ago which demonstrated that even with an IOMMU, streaming 
DMA of in-place buffers could reveal enough adjacent data from the same 
page to mount an attack on the system. Memory pressure should be 
immaterial since the size of each bounce pool carveout will presumably 
be tuned for the needs of the given device.



In any case, wouldn't finding all the dma-ranges do this? We're
already walking the tree to find the max DMA address now.


If all you can see are two "dma-ranges" properties, how do you propose 
to tell that one means "this is the extent of what I can address, please 
set my masks and dma-range-map accordingly and try to allocate things 
where I can reach them" while the other means "take this output range 
away from the page allocator and hook it up as my dedicated bounce pool, 
because it is Serious Security Time"? Especially since getting that 
choice wrong either way would be a Bad Thing.


Robin.


Signed-off-by: Claire Chang 
---
   .../reserved-memory/reserved-memory.txt   | 24 +++
   1 file changed, 24 insertions(+)

diff --git 
a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt 
b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
index e8d3096d922c..44975e2a1fd2 100644
--- a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
+++ b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
@@ -51,6 +51,20 @@ compatible (optional) - standard definition
 used as a shared pool of DMA buffers for a set of devices. It can
 be used by an operating system to instantiate the necessary pool
 management subsystem if necessary.
+- restricted-dma-pool: This indicates a region of memory meant to be
+  used as a pool of restricted DMA buffers for a set of devices. The
+  memory region would be the only region accessible to those devices.
+  When using this, the no-map and reusable properties must not be set,
+  so the operating system can create a virtual mapping that will be 
used
+  for synchronization. The main purpose for restricted DMA is to
+  mitigate the lack of DMA access control on systems without an IOMMU,
+  which could result in the DMA accessing the system memory at
+  unexpected times and/or unexpected addresses, possibly leading to 
data
+  leakage or corruption. The feature on its own provides a basic level
+  of protection against the DMA overwriting buffer contents at
+  unexpected times. However, to protect against general data leakage 
and
+  system memory corruption, the system needs to provide way to restrict
+  the DMA to a predefined memory region.
   - vendor specific string in the form ,[-]
   no-map (optional) - empty property
   - Indicates the operating system must not create a virtual mapping
@@ -120,6 +134,11 @@ one for multimedia processing (named 
multimedia-memory@7700, 64MiB).
  compatible = "acme,multimedia-memory";
 

Re: [RFC PATCH v3 5/6] dt-bindings: of: Add restricted DMA pool

2021-01-20 Thread Robin Murphy

On 2021-01-20 16:53, Rob Herring wrote:

On Wed, Jan 06, 2021 at 11:41:23AM +0800, Claire Chang wrote:

Introduce the new compatible string, restricted-dma-pool, for restricted
DMA. One can specify the address and length of the restricted DMA memory
region by restricted-dma-pool in the device tree.


If this goes into DT, I think we should be able to use dma-ranges for
this purpose instead. Normally, 'dma-ranges' is for physical bus
restrictions, but there's no reason it can't be used for policy or to
express restrictions the firmware has enabled.


There would still need to be some way to tell SWIOTLB to pick up the 
corresponding chunk of memory and to prevent the kernel from using it 
for anything else, though.



Signed-off-by: Claire Chang 
---
  .../reserved-memory/reserved-memory.txt   | 24 +++
  1 file changed, 24 insertions(+)

diff --git 
a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt 
b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
index e8d3096d922c..44975e2a1fd2 100644
--- a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
+++ b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
@@ -51,6 +51,20 @@ compatible (optional) - standard definition
used as a shared pool of DMA buffers for a set of devices. It can
be used by an operating system to instantiate the necessary pool
management subsystem if necessary.
+- restricted-dma-pool: This indicates a region of memory meant to be
+  used as a pool of restricted DMA buffers for a set of devices. The
+  memory region would be the only region accessible to those devices.
+  When using this, the no-map and reusable properties must not be set,
+  so the operating system can create a virtual mapping that will be 
used
+  for synchronization. The main purpose for restricted DMA is to
+  mitigate the lack of DMA access control on systems without an IOMMU,
+  which could result in the DMA accessing the system memory at
+  unexpected times and/or unexpected addresses, possibly leading to 
data
+  leakage or corruption. The feature on its own provides a basic level
+  of protection against the DMA overwriting buffer contents at
+  unexpected times. However, to protect against general data leakage 
and
+  system memory corruption, the system needs to provide way to restrict
+  the DMA to a predefined memory region.
  - vendor specific string in the form ,[-]
  no-map (optional) - empty property
  - Indicates the operating system must not create a virtual mapping
@@ -120,6 +134,11 @@ one for multimedia processing (named 
multimedia-memory@7700, 64MiB).
compatible = "acme,multimedia-memory";
reg = <0x7700 0x400>;
};
+
+   restricted_dma_mem_reserved: restricted_dma_mem_reserved {
+   compatible = "restricted-dma-pool";
+   reg = <0x5000 0x40>;
+   };
};
  
  	/* ... */

@@ -138,4 +157,9 @@ one for multimedia processing (named 
multimedia-memory@7700, 64MiB).
memory-region = <_reserved>;
/* ... */
};
+
+   pcie_device: pcie_device@0,0 {
+   memory-region = <_dma_mem_reserved>;


PCI hosts often have inbound window configurations that limit the
address range and translate PCI to bus addresses. Those windows happen
to be configured by dma-ranges. In any case, wouldn't you want to put
the configuration in the PCI host node? Is there a usecase of
restricting one PCIe device and not another?


The general design seems to accommodate devices having their own pools 
such that they can't even snoop on each others' transient DMA data. If 
the interconnect had a way of wiring up, say, PCI RIDs to AMBA NSAIDs, 
then in principle you could certainly apply that to PCI endpoints too 
(presumably you'd also disallow them from peer-to-peer transactions at 
the PCI level too).


Robin.


Re: [RFC PATCH v3 4/6] swiotlb: Add restricted DMA alloc/free support.

2021-01-13 Thread Robin Murphy

On 2021-01-13 12:48, Christoph Hellwig wrote:

+#ifdef CONFIG_SWIOTLB
+   if (unlikely(dev->dma_io_tlb_mem))
+   return swiotlb_alloc(dev, size, dma_handle, attrs);
+#endif


Another place where the dma_io_tlb_mem is useful to avoid the ifdef.


-phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
-   size_t mapping_size, size_t alloc_size,
-   enum dma_data_direction dir, unsigned long attrs)
+static int swiotlb_tbl_find_free_region(struct device *hwdev,
+   dma_addr_t tbl_dma_addr,
+   size_t alloc_size,
+   unsigned long attrs)



+static void swiotlb_tbl_release_region(struct device *hwdev, int index,
+  size_t size)


This refactoring should be another prep patch.



+void *swiotlb_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
+   unsigned long attrs)


I'd rather have the names convey there are for the per-device bounce
buffer in some form.


+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;


While we're at it I wonder if the io_tlb is something we could change
while we're at it.  Maybe replace io_tlb_mem with struct swiotlb
and rename the field in struct device to dev_swiotlb?


+   int index;
+   void *vaddr;
+   phys_addr_t tlb_addr;
+
+   size = PAGE_ALIGN(size);
+   index = swiotlb_tbl_find_free_region(dev, mem->start, size, attrs);
+   if (index < 0)
+   return NULL;
+
+   tlb_addr = mem->start + (index << IO_TLB_SHIFT);
+   *dma_handle = phys_to_dma_unencrypted(dev, tlb_addr);
+
+   if (!dev_is_dma_coherent(dev)) {
+   unsigned long pfn = PFN_DOWN(tlb_addr);
+
+   /* remove any dirty cache lines on the kernel alias */
+   arch_dma_prep_coherent(pfn_to_page(pfn), size);


Can we hook in somewhat lower level in the dma-direct code so that all
the remapping in dma-direct can be reused instead of duplicated?  That
also becomes important if we want to use non-remapping uncached support,
e.g. on mips or x86, or the direct changing of the attributes that Will
planned to look into for arm64.


Indeed, AFAICS this ought to boil down to a direct equivalent of 
__dma_direct_alloc_pages() - other than the address there should be no 
conceptual difference between pages from the restricted pool and those 
from the regular page allocator, so this probably deserves to be plumbed 
in as an alternative to that.


Robin.


Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-13 Thread Robin Murphy

On 2021-01-13 17:43, Florian Fainelli wrote:

On 1/13/21 7:27 AM, Robin Murphy wrote:

On 2021-01-13 13:59, Nicolas Saenz Julienne wrote:

Hi All,

On Tue, 2021-01-12 at 16:03 -0800, Florian Fainelli wrote:

On 1/5/21 7:41 PM, Claire Chang wrote:

Add the initialization function to create restricted DMA pools from
matching reserved-memory nodes in the device tree.

Signed-off-by: Claire Chang 
---
   include/linux/device.h  |   4 ++
   include/linux/swiotlb.h |   7 +-
   kernel/dma/Kconfig  |   1 +
   kernel/dma/swiotlb.c    | 144
++--
   4 files changed, 131 insertions(+), 25 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 89bb8b84173e..ca6f71ec8871 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -413,6 +413,7 @@ struct dev_links_info {
    * @dma_pools:    Dma pools (if dma'ble device).
    * @dma_mem:    Internal for coherent mem override.
    * @cma_area:    Contiguous memory area for dma allocations
+ * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override.
    * @archdata:    For arch-specific additions.
    * @of_node:    Associated device tree node.
    * @fwnode:    Associated device node supplied by platform firmware.
@@ -515,6 +516,9 @@ struct device {
   #ifdef CONFIG_DMA_CMA
   struct cma *cma_area;    /* contiguous memory area for dma
  allocations */
+#endif
+#ifdef CONFIG_SWIOTLB
+    struct io_tlb_mem    *dma_io_tlb_mem;
   #endif
   /* arch specific additions */
   struct dev_archdata    archdata;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index dd8eb57cbb8f..a1bbd775 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -76,12 +76,13 @@ extern enum swiotlb_force swiotlb_force;
    *
    * @start:    The start address of the swiotlb memory pool. Used
to do a quick
    *    range check to see if the memory was in fact allocated
by this
- *    API.
+ *    API. For restricted DMA pool, this is device tree
adjustable.


Maybe write it as this is "firmware adjustable" such that when/if ACPI
needs something like this, the description does not need updating.


TBH I really don't think this needs calling out at all. Even in the
regular case, the details of exactly how and where the pool is allocated
are beyond the scope of this code - architectures already have several
ways to control that and make their own decisions.



[snip]


+static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
+    struct device *dev)
+{
+    struct io_tlb_mem *mem = rmem->priv;
+    int ret;
+
+    if (dev->dma_io_tlb_mem)
+    return -EBUSY;
+
+    if (!mem) {
+    mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+    if (!mem)
+    return -ENOMEM;
+
+    if (!memremap(rmem->base, rmem->size, MEMREMAP_WB)) {


MEMREMAP_WB sounds appropriate as a default.


As per the binding 'no-map' has to be disabled here. So AFAIU, this
memory will
be part of the linear mapping. Is this really needed then?


More than that, I'd assume that we *have* to use the linear/direct map
address rather than anything that has any possibility of being a vmalloc
remap, otherwise we can no longer safely rely on
phys_to_dma/dma_to_phys, no?


I believe you are right, which means that if we want to make use of the
restricted DMA pool on a 32-bit architecture (and we do, at least, I do)
we should probably add some error checking/warning to ensure the
restricted DMA pool falls within the linear map.


Oh, good point - I'm so used to 64-bit that I instinctively just blanked 
out the !PageHighMem() condition in try_ram_remap(). So maybe the 
original intent here *was* to effectively just implement that check, but 
if so it could still do with being a lot more explicit.


Cheers,
Robin.


Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-13 Thread Robin Murphy

On 2021-01-13 13:59, Nicolas Saenz Julienne wrote:

Hi All,

On Tue, 2021-01-12 at 16:03 -0800, Florian Fainelli wrote:

On 1/5/21 7:41 PM, Claire Chang wrote:

Add the initialization function to create restricted DMA pools from
matching reserved-memory nodes in the device tree.

Signed-off-by: Claire Chang 
---
  include/linux/device.h  |   4 ++
  include/linux/swiotlb.h |   7 +-
  kernel/dma/Kconfig  |   1 +
  kernel/dma/swiotlb.c| 144 ++--
  4 files changed, 131 insertions(+), 25 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 89bb8b84173e..ca6f71ec8871 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -413,6 +413,7 @@ struct dev_links_info {
   * @dma_pools:Dma pools (if dma'ble device).
   * @dma_mem:  Internal for coherent mem override.
   * @cma_area: Contiguous memory area for dma allocations
+ * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override.
   * @archdata: For arch-specific additions.
   * @of_node:  Associated device tree node.
   * @fwnode:   Associated device node supplied by platform firmware.
@@ -515,6 +516,9 @@ struct device {
  #ifdef CONFIG_DMA_CMA
    struct cma *cma_area;   /* contiguous memory area for dma
   allocations */
+#endif
+#ifdef CONFIG_SWIOTLB
+   struct io_tlb_mem   *dma_io_tlb_mem;
  #endif
    /* arch specific additions */
    struct dev_archdata archdata;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index dd8eb57cbb8f..a1bbd775 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -76,12 +76,13 @@ extern enum swiotlb_force swiotlb_force;
   *
   * @start:The start address of the swiotlb memory pool. Used to do a quick
   *range check to see if the memory was in fact allocated by this
- * API.
+ * API. For restricted DMA pool, this is device tree adjustable.


Maybe write it as this is "firmware adjustable" such that when/if ACPI
needs something like this, the description does not need updating.


TBH I really don't think this needs calling out at all. Even in the 
regular case, the details of exactly how and where the pool is allocated 
are beyond the scope of this code - architectures already have several 
ways to control that and make their own decisions.




[snip]


+static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
+   struct device *dev)
+{
+   struct io_tlb_mem *mem = rmem->priv;
+   int ret;
+
+   if (dev->dma_io_tlb_mem)
+   return -EBUSY;
+
+   if (!mem) {
+   mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+   if (!mem)
+   return -ENOMEM;
+
+   if (!memremap(rmem->base, rmem->size, MEMREMAP_WB)) {


MEMREMAP_WB sounds appropriate as a default.


As per the binding 'no-map' has to be disabled here. So AFAIU, this memory will
be part of the linear mapping. Is this really needed then?


More than that, I'd assume that we *have* to use the linear/direct map 
address rather than anything that has any possibility of being a vmalloc 
remap, otherwise we can no longer safely rely on 
phys_to_dma/dma_to_phys, no?


That said, given that we're not actually using the returned address, I'm 
not entirely sure what the point of this call is anyway. If we can 
assume it's always going to return the linear map address via 
try_ram_remap() then we can equally just go ahead and use the linear map 
address straight away. I don't really see how we could ever hit the 
"is_ram == REGION_MIXED" case in memremap() that would return NULL, if 
we passed the memblock check earlier in __reserved_mem_alloc_size() such 
that this rmem node ever got to be initialised at all.


Robin.


Documentation/devicetree/bindings/reserved-memory/ramoops.txt does
define an "unbuffered" property which in premise could be applied to the
generic reserved memory binding as well and that we may have to be
honoring here, if we were to make it more generic. Oh well, this does
not need to be addressed right now I guess.






Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool

2021-01-12 Thread Robin Murphy

On 2021-01-07 17:57, Konrad Rzeszutek Wilk wrote:

On Fri, Jan 08, 2021 at 01:39:18AM +0800, Claire Chang wrote:

Hi Greg and Konrad,

This change is intended to be non-arch specific. Any arch that lacks DMA access
control and has devices not behind an IOMMU can make use of it. Could you share
why you think this should be arch specific?


The idea behind non-arch specific code is it to be generic. The devicetree
is specific to PowerPC, Sparc, and ARM, and not to x86 - hence it should
be in arch specific code.


Sorry, but that's an absurd argument. By the same token you'd equally 
have to claim that bits of, say, the Broadcom WiFi driver (not to 
mention dozens of others) should be split out into arch code, since not 
all platforms use the devicetree parts, nor the ACPI parts, nor the PCI 
parts...


There is nothing architecture-specific about using devicetree as a 
system description - AFAIK there *are* a handful of x86 platforms that 
use it, besides even more architectures than you've listed above. It has 
long been the policy that devicetree-related code for a particular 
subsystem should just live within that subsystem. Sometimes if there's 
enough of it it gets collected together into its own file - e.g. 
drivers/pci/of.c - otherwise it tends to just get #ifdef'ed - e.g. 
of_spi_parse_dt(), or the other DMA reserved-memory consumers that 
already exist as Florian points out.


Besides, there are far more platforms that enable CONFIG_OF than enable 
CONFIG_SWIOTLB, so by that metric the whole of the SWIOTLB code itself 
is even less "generic" than any DT parsing :P


Robin.


Re: [trivial PATCH] treewide: Convert switch/case fallthrough; to break;

2020-09-10 Thread Robin Murphy

On 2020-09-09 21:06, Joe Perches wrote:

fallthrough to a separate case/default label break; isn't very readable.

Convert pseudo-keyword fallthrough; statements to a simple break; when
the next label is case or default and the only statement in the next
label block is break;

Found using:

$ grep-2.5.4 -rP --include=*.[ch] -n 
"fallthrough;(\s*(case\s+\w+|default)\s*:\s*){1,7}break;" *

Miscellanea:

o Move or coalesce a couple label blocks above a default: block.

Signed-off-by: Joe Perches 
---

Compiled allyesconfig x86-64 only.
A few files for other arches were not compiled.



[...]

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index c192544e874b..743db1abec40 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3777,7 +3777,7 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
switch (FIELD_GET(IDR0_TTF, reg)) {
case IDR0_TTF_AARCH32_64:
smmu->ias = 40;
-   fallthrough;
+   break;
case IDR0_TTF_AARCH64:
break;
default:


I have to say I don't really agree with the readability argument for 
this one - a fallthrough is semantically correct here, since the first 
case is a superset of the second. It just happens that anything we would 
do for the common subset is implicitly assumed (there are other 
potential cases we simply haven't added support for at the moment), thus 
the second case is currently empty.


This change actively obfuscates that distinction.

Robin.


Re: [PATCH 1/2] dma-mapping: add a dma_ops_bypass flag to struct device

2020-03-23 Thread Robin Murphy

On 2020-03-20 2:16 pm, Christoph Hellwig wrote:

Several IOMMU drivers have a bypass mode where they can use a direct
mapping if the devices DMA mask is large enough.  Add generic support
to the core dma-mapping code to do that to switch those drivers to
a common solution.


Hmm, this is _almost_, but not quite the same as the case where drivers 
are managing their own IOMMU mappings, but still need to use streaming 
DMA for cache maintenance on the underlying pages. For that we need the 
ops bypass to be a "true" bypass and also avoid SWIOTLB regardless of 
the device's DMA mask. That behaviour should in fact be fine for the 
opportunistic bypass case here as well, since the mask being "big 
enough" implies by definition that this should never need to bounce either.


For the (hopefully less common) third case where, due to groups or user 
overrides, we end up giving an identity DMA domain to a device with 
limited DMA masks which _does_ need SWIOTLB, I'd like to think we can 
solve that by not giving the device IOMMU DMA ops in the first place, 
such that it never needs to engage the bypass mechanism at all.


Thoughts?

Robin.


Signed-off-by: Christoph Hellwig 
---
  include/linux/device.h  |  6 ++
  include/linux/dma-mapping.h | 30 ++
  kernel/dma/mapping.c| 36 +++-
  3 files changed, 51 insertions(+), 21 deletions(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 0cd7c647c16c..09be8bb2c4a6 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -525,6 +525,11 @@ struct dev_links_info {
   *  sync_state() callback.
   * @dma_coherent: this particular device is dma coherent, even if the
   *architecture supports non-coherent devices.
+ * @dma_ops_bypass: If set to %true then the dma_ops are bypassed for the
+ * streaming DMA operations (->map_* / ->unmap_* / ->sync_*),
+ * and optionall (if the coherent mask is large enough) also
+ * for dma allocations.  This flag is managed by the dma ops
+ * instance from ->dma_supported.
   *
   * At the lowest level, every device in a Linux system is represented by an
   * instance of struct device. The device structure contains the information
@@ -625,6 +630,7 @@ struct device {
  defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
booldma_coherent:1;
  #endif
+   booldma_ops_bypass : 1;
  };
  
  static inline struct device *kobj_to_dev(struct kobject *kobj)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 330ad58fbf4d..c3af0cf5e435 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -188,9 +188,15 @@ static inline int dma_mmap_from_global_coherent(struct 
vm_area_struct *vma,
  }
  #endif /* CONFIG_DMA_DECLARE_COHERENT */
  
-static inline bool dma_is_direct(const struct dma_map_ops *ops)

+/*
+ * Check if the devices uses a direct mapping for streaming DMA operations.
+ * This allows IOMMU drivers to set a bypass mode if the DMA mask is large
+ * enough.
+ */
+static inline bool dma_map_direct(struct device *dev,
+   const struct dma_map_ops *ops)
  {
-   return likely(!ops);
+   return likely(!ops) || dev->dma_ops_bypass;
  }
  
  /*

@@ -279,7 +285,7 @@ static inline dma_addr_t dma_map_page_attrs(struct device 
*dev,
dma_addr_t addr;
  
  	BUG_ON(!valid_dma_direction(dir));

-   if (dma_is_direct(ops))
+   if (dma_map_direct(dev, ops))
addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
else
addr = ops->map_page(dev, page, offset, size, dir, attrs);
@@ -294,7 +300,7 @@ static inline void dma_unmap_page_attrs(struct device *dev, 
dma_addr_t addr,
const struct dma_map_ops *ops = get_dma_ops(dev);
  
  	BUG_ON(!valid_dma_direction(dir));

-   if (dma_is_direct(ops))
+   if (dma_map_direct(dev, ops))
dma_direct_unmap_page(dev, addr, size, dir, attrs);
else if (ops->unmap_page)
ops->unmap_page(dev, addr, size, dir, attrs);
@@ -313,7 +319,7 @@ static inline int dma_map_sg_attrs(struct device *dev, 
struct scatterlist *sg,
int ents;
  
  	BUG_ON(!valid_dma_direction(dir));

-   if (dma_is_direct(ops))
+   if (dma_map_direct(dev, ops))
ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
else
ents = ops->map_sg(dev, sg, nents, dir, attrs);
@@ -331,7 +337,7 @@ static inline void dma_unmap_sg_attrs(struct device *dev, 
struct scatterlist *sg
  
  	BUG_ON(!valid_dma_direction(dir));

debug_dma_unmap_sg(dev, sg, nents, dir);
-   if (dma_is_direct(ops))
+   if (dma_map_direct(dev, ops))
dma_direct_unmap_sg(dev, sg, nents, dir, attrs);
else if (ops->unmap_sg)
ops->unmap_sg(dev, sg, nents, dir, attrs);
@@ -352,7 +358,7 @@ static inline 

Re: [PATCH -v2] treewide: Rename "unencrypted" to "decrypted"

2020-03-19 Thread Robin Murphy

[since this is in my inbox...]

On 2020-03-19 10:28 am, Borislav Petkov wrote:

On Thu, Mar 19, 2020 at 11:20:11AM +0100, Christoph Hellwig wrote:

I thought we agreed that decrypted is absolutely the wrong term.


I don't think we did. At least I don't know where we did that.


So NAK - if you want to change things it needs to go the other way.


We are already using "decrypted" everywhere in arch/x86/. Changing that
would be a *lot* more churn.

And it is just a term, for chrissakes, to denote memory which is not
encrypted. And it would make our lifes easier if we had only *two* terms
instead of three or more. Especially if the concept we denote with this
is a binary one: encrypted memory and *not* encrypted memory.


Let me add another vote from a native English speaker that "unencrypted" 
is the appropriate term to imply the *absence* of encryption, whereas 
"decrypted" implies the *reversal* of applied encryption.


Naming things is famously hard, for good reason - names are *important* 
for understanding. Just because a decision was already made one way 
doesn't mean that that decision was necessarily right. Churning one area 
to be consistently inaccurate just because it's less work than churning 
another area to be consistently accurate isn't really the best excuse.


Robin.


Re: [PATCH v2] dma-mapping: treat dev->bus_dma_mask as a DMA limit

2019-11-26 Thread Robin Murphy

On 2019-11-26 6:51 pm, Nicolas Saenz Julienne wrote:

On Mon, 2019-11-25 at 16:33 +, Robin Murphy wrote:

On 25/11/2019 7:44 am, Christoph Hellwig wrote:

On Sat, Nov 23, 2019 at 09:51:08AM -0700, Nathan Chancellor wrote:

Just as an FYI, this introduces a warning on arm32 allyesconfig for me:


I think the dma_limit argument to iommu_dma_alloc_iova should be a u64
and/or we need to use min_t and open code the zero exception.

Robin, Nicolas - any opinions?


Yeah, given that it's always held a mask I'm not entirely sure why it
was ever a dma_addr_t rather than a u64. Unless anyone else is desperate
to do it I'll get a cleanup patch ready for rc1.


Sounds good to me too

Robin, since I started the mess, I'll be happy to do it if it helps offloading
some work from you.


No worries - your change only exposed my original weird decision ;)  On 
second look the patch was literally a trivial one-liner, so I've written 
it up already.


Cheers,
Robin.


Re: [PATCH v2] dma-mapping: treat dev->bus_dma_mask as a DMA limit

2019-11-25 Thread Robin Murphy

On 25/11/2019 7:44 am, Christoph Hellwig wrote:

On Sat, Nov 23, 2019 at 09:51:08AM -0700, Nathan Chancellor wrote:

Just as an FYI, this introduces a warning on arm32 allyesconfig for me:


I think the dma_limit argument to iommu_dma_alloc_iova should be a u64
and/or we need to use min_t and open code the zero exception.

Robin, Nicolas - any opinions?


Yeah, given that it's always held a mask I'm not entirely sure why it 
was ever a dma_addr_t rather than a u64. Unless anyone else is desperate 
to do it I'll get a cleanup patch ready for rc1.



Also I wonder how this file gets compiled on arm32 given that arm32
has its own set of iommu dma ops..


As long as the dependencies for CONFIG_IOMMU_DMA are met it can be built 
even when it's not actually used. That said, I might have expected that 
arm allyesconfig ends up with CONFIG_ARCH_DMA_ADDR_T_64BIT=y anyway; I 
guess it must pick some of CONFIG_ARM_LPAE's negative dependencies.


(/me doesn't feel like jumping down the all*config rabbit hole today)

Robin.


Re: [PATCH v2] dma-mapping: treat dev->bus_dma_mask as a DMA limit

2019-11-21 Thread Robin Murphy

On 21/11/2019 9:26 am, Nicolas Saenz Julienne wrote:

Using a mask to represent bus DMA constraints has a set of limitations.
The biggest one being it can only hold a power of two (minus one). The
DMA mapping code is already aware of this and treats dev->bus_dma_mask
as a limit. This quirk is already used by some architectures although
still rare.

With the introduction of the Raspberry Pi 4 we've found a new contender
for the use of bus DMA limits, as its PCIe bus can only address the
lower 3GB of memory (of a total of 4GB). This is impossible to represent
with a mask. To make things worse the device-tree code rounds non power
of two bus DMA limits to the next power of two, which is unacceptable in
this case.

In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all
over the tree and treat it as such. Note that dev->bus_dma_limit should
contain the higher accesible DMA address.


^^ super-nit only because I can't not see my editor currently 
highlighting the typo: "accessible"


Regardless of that though,

Reviewed-by: Robin Murphy 


Signed-off-by: Nicolas Saenz Julienne 

---

Changes since v1:
   - rework ACPI code to avoid divergence with OF's

  arch/mips/pci/fixup-sb1250.c  | 16 
  arch/powerpc/sysdev/fsl_pci.c |  6 +++---
  arch/x86/kernel/pci-dma.c |  2 +-
  arch/x86/mm/mem_encrypt.c |  2 +-
  arch/x86/pci/sta2x11-fixup.c  |  2 +-
  drivers/acpi/arm64/iort.c | 20 +++-
  drivers/ata/ahci.c|  2 +-
  drivers/iommu/dma-iommu.c |  3 +--
  drivers/of/device.c   |  9 +
  include/linux/device.h|  6 +++---
  include/linux/dma-direct.h|  2 +-
  include/linux/dma-mapping.h   |  2 +-
  kernel/dma/direct.c   | 27 +--
  13 files changed, 46 insertions(+), 53 deletions(-)

diff --git a/arch/mips/pci/fixup-sb1250.c b/arch/mips/pci/fixup-sb1250.c
index 8a41b359cf90..40efc990cdce 100644
--- a/arch/mips/pci/fixup-sb1250.c
+++ b/arch/mips/pci/fixup-sb1250.c
@@ -21,22 +21,22 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_SIBYTE, 
PCI_DEVICE_ID_BCM1250_PCI,
  
  /*

   * The BCM1250, etc. PCI host bridge does not support DAC on its 32-bit
- * bus, so we set the bus's DMA mask accordingly.  However the HT link
+ * bus, so we set the bus's DMA limit accordingly.  However the HT link
   * down the artificial PCI-HT bridge supports 40-bit addressing and the
   * SP1011 HT-PCI bridge downstream supports both DAC and a 64-bit bus
   * width, so we record the PCI-HT bridge's secondary and subordinate bus
- * numbers and do not set the mask for devices present in the inclusive
+ * numbers and do not set the limit for devices present in the inclusive
   * range of those.
   */
-struct sb1250_bus_dma_mask_exclude {
+struct sb1250_bus_dma_limit_exclude {
bool set;
unsigned char start;
unsigned char end;
  };
  
-static int sb1250_bus_dma_mask(struct pci_dev *dev, void *data)

+static int sb1250_bus_dma_limit(struct pci_dev *dev, void *data)
  {
-   struct sb1250_bus_dma_mask_exclude *exclude = data;
+   struct sb1250_bus_dma_limit_exclude *exclude = data;
bool exclude_this;
bool ht_bridge;
  
@@ -55,7 +55,7 @@ static int sb1250_bus_dma_mask(struct pci_dev *dev, void *data)

exclude->start, exclude->end);
} else {
dev_dbg(>dev, "disabling DAC for device");
-   dev->dev.bus_dma_mask = DMA_BIT_MASK(32);
+   dev->dev.bus_dma_limit = DMA_BIT_MASK(32);
}
  
  	return 0;

@@ -63,9 +63,9 @@ static int sb1250_bus_dma_mask(struct pci_dev *dev, void 
*data)
  
  static void quirk_sb1250_pci_dac(struct pci_dev *dev)

  {
-   struct sb1250_bus_dma_mask_exclude exclude = { .set = false };
+   struct sb1250_bus_dma_limit_exclude exclude = { .set = false };
  
-	pci_walk_bus(dev->bus, sb1250_bus_dma_mask, );

+   pci_walk_bus(dev->bus, sb1250_bus_dma_limit, );
  }
  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_SIBYTE, PCI_DEVICE_ID_BCM1250_PCI,
quirk_sb1250_pci_dac);
diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
index ff0e2b156cb5..617a443d673d 100644
--- a/arch/powerpc/sysdev/fsl_pci.c
+++ b/arch/powerpc/sysdev/fsl_pci.c
@@ -115,8 +115,8 @@ static void pci_dma_dev_setup_swiotlb(struct pci_dev *pdev)
  {
struct pci_controller *hose = pci_bus_to_host(pdev->bus);
  
-	pdev->dev.bus_dma_mask =

-   hose->dma_window_base_cur + hose->dma_window_size;
+   pdev->dev.bus_dma_limit =
+   hose->dma_window_base_cur + hose->dma_window_size - 1;
  }
  
  static void setup_swiotlb_ops(struct pci_controller *hose)

@@ -135,7 +135,7 @@ static void fsl_pci_dma_set_mask(struct device *dev, u64 
dma_mask)
 * mapping that allows addressing any RAM address from across PCI.
 */
if (dev_is_pci(dev) &&

Re: generic DMA bypass flag

2019-11-21 Thread Robin Murphy

On 21/11/2019 7:34 am, Christoph Hellwig wrote:

Robin, does this mean you ACK this series for the powerpc use case?


Yeah, I think we've nailed down sufficient justification now for having 
a generalised flag, so at that point it makes every bit of sense to 
convert PPC's private equivalent.


Robin.


Alexey and other ppc folks: can you take a look?
___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu



Re: [PATCH v2] dma-mapping: treat dev->bus_dma_mask as a DMA limit

2019-11-21 Thread Robin Murphy

On 21/11/2019 3:26 pm, Christoph Hellwig wrote:

On Thu, Nov 21, 2019 at 04:24:57PM +0100, Christoph Hellwig wrote:

On Thu, Nov 21, 2019 at 10:26:44AM +0100, Nicolas Saenz Julienne wrote:

Using a mask to represent bus DMA constraints has a set of limitations.
The biggest one being it can only hold a power of two (minus one). The
DMA mapping code is already aware of this and treats dev->bus_dma_mask
as a limit. This quirk is already used by some architectures although
still rare.

With the introduction of the Raspberry Pi 4 we've found a new contender
for the use of bus DMA limits, as its PCIe bus can only address the
lower 3GB of memory (of a total of 4GB). This is impossible to represent
with a mask. To make things worse the device-tree code rounds non power
of two bus DMA limits to the next power of two, which is unacceptable in
this case.

In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all
over the tree and treat it as such. Note that dev->bus_dma_limit should
contain the higher accesible DMA address.

Signed-off-by: Nicolas Saenz Julienne 


I've tentatively added this patch to the dma-mapping tree based on
Robins principal approval of the last version.  That way tomorrows
linux-next run should still pick it up.


Actually.  This doesn't apply because the dma-mapping tree doesn't
have you zone_dma_bits change.  I guess we'll need to wait for the
next merge window, or maybe post rc1 if this happens to fix the
powerpc problem that Christian reported.


Hmm, there's no functional dependency though, is there? AFAICS it's 
essentially just a context conflict. Is it worth simply dropping (or 
postponing) the local renaming in __dma_direct_optimal_gfp_mask(), or 
perhaps even cross-merging arm64/for-next/zone-dma into dma/for-next?


Robin.


Re: Bug 205201 - Booting halts if Dawicontrol DC-2976 UW SCSI board installed, unless RAM size limited to 3500M

2019-11-21 Thread Robin Murphy

On 21/11/2019 12:21 pm, Christian Zigotzky wrote:

On 21 November 2019 at 01:16 pm, Christian Zigotzky wrote:

On 21 November 2019 at 08:29 am, Christoph Hellwig wrote:

On Sat, Nov 16, 2019 at 08:06:05AM +0100, Christian Zigotzky wrote:

/*
  *  DMA addressing mode.
  *
  *  0 : 32 bit addressing for all chips.
  *  1 : 40 bit addressing when supported by chip.
  *  2 : 64 bit addressing when supported by chip,
  *  limited to 16 segments of 4 GB -> 64 GB max.
  */
#define   SYM_CONF_DMA_ADDRESSING_MODE 
CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE


Cyrus config:

CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1

I will configure “0 : 32 bit addressing for all chips” for the RC8. 
Maybe this is the solution.

0 means you are going to do bounce buffering a lot, which seems
generally like a bad idea.

But why are we talking about the sym53c8xx driver now?  The last issue
you reported was about video4linux allocations.

Both drivers have the same problem. They don't work if we have more 
than 3.5GB RAM. I try to find a solution until you have a good 
solution. I have already a solution for V4L but I still need one for 
the sym53c8xx driver.
OK, you mean that "0" is a bad idea but maybe it works until you have a 
solution. ;-)


Is this on the same machine with the funny non-power-of-two bus_dma_mask 
as your other report? If so, does Nicolas' latest patch[1] help at all?


Robin.

[1] 
https://lore.kernel.org/linux-iommu/20191121092646.8449-1-nsaenzjulie...@suse.de/T/#u


Re: generic DMA bypass flag

2019-11-19 Thread Robin Murphy

On 16/11/2019 6:22 am, Christoph Hellwig wrote:

On Fri, Nov 15, 2019 at 06:12:48PM +, Robin Murphy wrote:

And is that any different from where you would choose to "just" set a
generic bypass flag?


Same spots, as intel-iommu moves from the identify to a dma domain when
setting a 32-bit mask.  But that means once a 32-bit mask is set we can't
ever go back to the 64-bit one.


Is that a problem though? It's not safe in general to rewrite the 
default domain willy-nilly, so if it's a concern that drivers get stuck 
having to use a translation domain if they do something dumb like:


if (!dma_set_mask(DMA_BIT_MASK(32))
dma_set_mask(DMA_BIT_MASK(64));

then the simple solution is "don't do that" - note that this doesn't 
affect overriding of the default 32-bit mask, because we don't use the 
driver API to initialise those.



 And we had a couple drivers playing
interesting games there.


If the games you're worried about are stuff like:

dma_set_mask(dev, DMA_BIT_MASK(64));
high_buf = dma_alloc_coherent(dev, ...);
dma_set_mask(dev, DMA_BIT_MASK(32));
low_buf = dma_alloc_coherent(dev, ...);

then iommu_need_mapping() already ensures that will end spectacularly 
badly. Unless we can somehow log when a mask has been "committed" by a 
mapping operation, I don't think any kind of opportunistic bypass 
mechanism is ever not going to blow up that case.



 FYI, this is the current intel-iommu
WIP conversion to the dma bypass flag:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-bypass


Having thought a bit more, I guess my idea does end up with one slightly 
ugly corner wherein dma_direct_supported() has to learn to look for an 
IOMMU default domain and try iommu_dma_supported() before saying no, 
even if it's clean everywhere else. The bypass flag is more 'balanced' 
in terms of being equally invasive everywhere and preserving abstraction 
a bit better. Plus I think it might let us bring back the default 
assignment of dma_dummy_ops, which I do like the thought of :D


Either way, making sure that the fundamental bypass decision is correct 
and robust is still far more important than the implementation details.


Robin.


Re: [PATCH] dma-mapping: treat dev->bus_dma_mask as a DMA limit

2019-11-19 Thread Robin Murphy

On 19/11/2019 5:00 pm, Christoph Hellwig wrote:

On Tue, Nov 19, 2019 at 01:57:43PM +0100, Nicolas Saenz Julienne wrote:

Hi Rob & Christoph,
do you mind if I append v2 of this into my upcoming v3 RPi4 PCIe support
series, I didn't do it initially as I thought this was going to be a
contentious patch.  But as it turned out better than expected, I think it
should go into the PCIe series. In the end it's the first explicit user of the
bus DMA limit.

Here's v2 in case you don't know what I'm talking about:
https://www.spinics.net/lists/arm-kernel/msg768459.html


In principle I wouldn't mind, but I think this is going to conflict
quite badly with other changes in the dma-mapping tree (including
yours).  So I think we'll need a shared tree or I'll need to pull
in the whole series through the dma-mapping tree if there are not
other conflicts and the other maintainers are fine with it.


TBH I can't see it being a massive problem even if the DMA patch, driver 
and DTS patch went entirely separately via the respective DMA, PCI, and 
arm-soc trees in the same cycle. Bisecting over a merge window is a big 
enough pain in the bum as it is, and if the worst case is that someone 
trying to do that on a Pi4 has a wonky PCI controller appear for a 
couple of commits, they may as well just disable that driver for their 
bisection, because it wasn't there at the start so can't possibly be the 
thing they're looking for regressions in ;)


Robin.


Re: [PATCH 1/3] dma-direct: unify the dma_capable definitions

2019-11-19 Thread Robin Murphy

On 19/11/2019 10:26 am, Marek Szyprowski wrote:

Hi Krzysztof,

On 19.11.2019 10:44, Krzysztof Kozlowski wrote:

On Tue, 19 Nov 2019 at 17:27, Marek Szyprowski  wrote:

Hi Christoph,

On 13.11.2019 08:35, Christoph Hellwig wrote:

Currently each architectures that wants to override dma_to_phys and
phys_to_dma also has to provide dma_capable.  But there isn't really
any good reason for that.  powerpc and mips just have copies of the
generic one minus the latests fix, and the arm one was the inspiration
for said fix, but misses the bus_dma_mask handling.
Make all architectures use the generic version instead.

Signed-off-by: Christoph Hellwig 

This patch breaks DMAengine PL330 driver on Samsung Exynos SoCs:


Thanks Marek for bisecting it. I wonder whether it is also the cause
for boot failures I see on NXP Vybrid VF50 SoC (NXP/Freescale
fsl-edma) since few days:

[ 2.853428] fsl-edma 40018000.dma-controller: overflow 0x40027007+1 of
DMA mask  bus mask 0
[ 2.862566] [ cut here ]
[ 2.867273] WARNING: CPU: 0 PID: 1 at
/home/buildbot/worker/builddir_yocto/build/build/tmp/work-shared/col-vf50-proceq-mainline-next/kernel-source/kernel/dma/direct.c:35
report_addr+0xc0/0xfc
[ 2.884380] Modules linked in:
[ 2.887486] CPU: 0 PID: 1 Comm: swapper Tainted: G W
5.4.0-rc7-next-20191118-g519ead8f6a32 #1
[ 2.897364] Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
[ 2.903892] [<8010ddfc>] (unwind_backtrace) from [<8010b4b8>]
(show_stack+0x10/0x14)
[ 2.911712] [<8010b4b8>] (show_stack) from [<8011b08c>] (__warn+0xd4/0xec)
[ 2.918653] [<8011b08c>] (__warn) from [<8011b154>]
(warn_slowpath_fmt+0xb0/0xb8)
[ 2.926218] [<8011b154>] (warn_slowpath_fmt) from [<80155f7c>]
(report_addr+0xc0/0xfc)
[ 2.934221] [<80155f7c>] (report_addr) from [<801561f0>]
(dma_direct_map_resource+0x98/0xa4)
[ 2.942744] [<801561f0>] (dma_direct_map_resource) from [<8041d5d4>]
(fsl_edma_prep_slave_dma+0x12c/0x150)
[ 2.952475] [<8041d5d4>] (fsl_edma_prep_slave_dma) from [<8041d8cc>]
(fsl_edma_prep_dma_cyclic+0x30/0x21c)
[ 2.962213] [<8041d8cc>] (fsl_edma_prep_dma_cyclic) from [<80452e10>]
(lpuart_rx_dma_startup+0x188/0x36c)
[ 2.971871] [<80452e10>] (lpuart_rx_dma_startup) from [<80453058>]
(lpuart_startup+0x64/0x78)
[ 2.980477] [<80453058>] (lpuart_startup) from [<8044e924>]
(uart_startup.part.7+0x110/0x23c)
[ 2.989080] [<8044e924>] (uart_startup.part.7) from [<8044eaa0>]
(uart_port_activate+0x50/0x7c)
[ 2.997857] [<8044eaa0>] (uart_port_activate) from [<80438dc0>]
(tty_port_open+0x74/0xc0)
[ 3.006111] [<80438dc0>] (tty_port_open) from [<8044be30>] (uart_open+0x18/0x20)
[ 3.013588] [<8044be30>] (uart_open) from [<80431b4c>] (tty_open+0x108/0x428)
[ 3.020794] [<80431b4c>] (tty_open) from [<801edb48>] (chrdev_open+0xac/0x164)
[ 3.028098] [<801edb48>] (chrdev_open) from [<801e55c8>]
(do_dentry_open+0x22c/0x3e4)
[ 3.036010] [<801e55c8>] (do_dentry_open) from [<801f72a8>]
(path_openat+0x4a4/0xf78)
[ 3.043912] [<801f72a8>] (path_openat) from [<801f8d34>]
(do_filp_open+0x70/0xdc)
[ 3.051472] [<801f8d34>] (do_filp_open) from [<801e6998>]
(do_sys_open+0x128/0x1f4)
[ 3.059217] [<801e6998>] (do_sys_open) from [<80a00ee0>]
(kernel_init_freeable+0x150/0x1c4)
[ 3.067658] [<80a00ee0>] (kernel_init_freeable) from [<8068e208>]
(kernel_init+0x8/0x110)
[ 3.075917] [<8068e208>] (kernel_init) from [<801010e8>]
(ret_from_fork+0x14/0x2c)
[ 3.083539] Exception stack(0x86843fb0 to 0x86843ff8)
[ 3.088631] 3fa0:    
[ 3.096866] 3fc0:     
  
[ 3.105103] 3fe0:     0013 
[ 3.111752] ---[ end trace 6fb699802256a309 ]---
[3.116423] fsl-lpuart 40027000.serial: Cannot prepare cyclic DMA
[3.192968] VFS: Mounted root (nfs4 filesystem) on device 0:13.
[3.201432] devtmpfs: mounted
[3.210985] Freeing unused kernel memory: 1024K
[3.217854] Run /sbin/init as init process
[4.643355] systemd[1]: System time before build time, advancing clock.
[4.774106] random: systemd: uninitialized urandom read (16 bytes read)
[4.838361] systemd[1]: systemd 232 running in system mode. (-PAM
-AUDIT -SELINUX -IMA -APPARMOR -SMACK +SYSVINIT +UTMP -LIBCRYPTSETUP
-GCRYPT -GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID -ELFUTILS +KMOD -IDN)
[4.858997] systemd[1]: Detected architecture arm.
[4.873438] fsl-lpuart 40027000.serial: Cannot prepare TX slave DMA!
[4.880138] fsl-lpuart 40027000.serial: Cannot prepare TX slave DMA!
[4.886585] fsl-lpuart 40027000.serial: Cannot prepare TX slave DMA!
[4.893124] fsl-lpuart 40027000.serial: Cannot prepare TX slave DMA!
[4.899679] fsl-lpuart 40027000.serial: Cannot prepare TX slave DMA!
[4.906110] fsl-lpuart 40027000.serial: Cannot prepare TX slave DMA!
[4.912616] fsl-lpuart 40027000.serial: Cannot prepare TX slave DMA!

Although maybe that's just the fsl-edma problem?


This is the same issue.

A quick looks at the dma-direct code revealed that 

Re: generic DMA bypass flag

2019-11-15 Thread Robin Murphy

On 14/11/2019 7:41 am, Christoph Hellwig wrote:

On Wed, Nov 13, 2019 at 02:45:15PM +, Robin Murphy wrote:

In all honesty, this seems silly. If we can set a per-device flag to say
"oh, bypass these ops and use some other ops instead", then we can just as
easily simply give the device the appropriate ops in the first place.
Because, y'know, the name of the game is "per-device ops".


Except that we can't do it _that_ easily.  The problem is that for both
the powerpc and intel case the selection is dynamic.  If a device is in
the identify domain with intel-iommu (or the equivalent on powerpc which
doesn't use the normal iommu framework), we still want to use the iommu
to be able to map memory for devices with a too small dma mask using
the iommu instead of using swiotlb bouncing.  So to "just" use the
per-device dma ops we'd need:

   a) a hook in dma_direct_supported to pick another set of ops for
  small dma masks
   b) a hook in the IOMMU ops to propagate to the direct ops for full
  64-bit masks


And is that any different from where you would choose to "just" set a 
generic bypass flag?



diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d658c7c6a2ab..40e096d3dbc5 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2242,12 +2242,14 @@ request_default_domain_for_dev(struct device 
*dev, unsigned long type)

 /* Request that a device is direct mapped by the IOMMU */
 int iommu_request_dm_for_dev(struct device *dev)
 {
+   set_dma_ops(dev, NULL);
return request_default_domain_for_dev(dev, IOMMU_DOMAIN_IDENTITY);
 }

 /* Request that a device can't be direct mapped by the IOMMU */
 int iommu_request_dma_domain_for_dev(struct device *dev)
 {
+   set_dma_ops(dev, _dma_ops);
return request_default_domain_for_dev(dev, IOMMU_DOMAIN_DMA);
 }


If intel-iommu gets fully converted such that everyone using default 
domains is also using iommu-dma, that should be it as far as the generic 
DMA ops are concerned (ultimately we might even be able to do it in 
__iommu_attach_device() based on the domain type). Like I said, the hard 
part is deciding when to make *these* calls, but apparently intel-iommu 
can already do that.



I looked into that for powerpc a while ago and it wasn't pretty at all.
Compared to that just checking another flag for the DMA direct calls
is relatively clean and trivial as seens in the diffstat for this series
alone.


I don't see a great benefit to pulling legacy cruft out into common code
instead of just working to get rid of it in-place, when said cruft pulls in
the opposite direction to where we're taking the common code (i.e. it's
inherently based on the premise of global ops).


I'm not sure what legacy cruft it pull in.  I think it actually fits very
much into a mental model of "direct mapping is the default, to be overriden
if needed" which is pretty close to what we have at the moment.  Just
with a slightly more complicated processing of the override.


Because the specific "slightly more complicated" currently used by the 
existing powerpc code will AFAICS continue to be needed only by the 
existing powerpc code, thus I don't see any benefit to cluttering up the 
common code with it today. You may as well just refactor powerpc to 
swizzle its own ops to obviate archdata.iommu_bypass, and delete a fair 
chunk of old code.


Robin.


Re: [PATCH] dma-mapping: treat dev->bus_dma_mask as a DMA limit

2019-11-13 Thread Robin Murphy

On 2019-11-13 8:41 pm, Florian Fainelli wrote:

On 11/13/19 12:34 PM, Robin Murphy wrote:

On 13/11/2019 4:13 pm, Nicolas Saenz Julienne wrote:

Using a mask to represent bus DMA constraints has a set of limitations.
The biggest one being it can only hold a power of two (minus one). The
DMA mapping code is already aware of this and treats dev->bus_dma_mask
as a limit. This quirk is already used by some architectures although
still rare.

With the introduction of the Raspberry Pi 4 we've found a new contender
for the use of bus DMA limits, as its PCIe bus can only address the
lower 3GB of memory (of a total of 4GB). This is impossible to represent
with a mask. To make things worse the device-tree code rounds non power
of two bus DMA limits to the next power of two, which is unacceptable in
this case.

In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all
over the tree and treat it as such. Note that dev->bus_dma_limit is
meant to contain the higher accesible DMA address.


Neat, you win a "why didn't I do it that way in the first place?" :)

Looking at it without all the history of previous attempts, this looks
entirely reasonable, and definitely a step in the right direction.


And while you are changing those, would it make sense to not only rename
the structure member but introduce a getter and setter in order to ease
future work where this would no longer be a scalar?


I doubt it - once we get as a far as supporting multiple DMA ranges, 
there will be a whole load of infrastructure churn anyway if only to 
replace dma_pfn_offset, and I'm not sure a simple get/set paradigm would 
even be viable, so it's probably better to save that until clearly 
necessary.


Robin.


Re: [PATCH] dma-mapping: treat dev->bus_dma_mask as a DMA limit

2019-11-13 Thread Robin Murphy

On 13/11/2019 4:13 pm, Nicolas Saenz Julienne wrote:

Using a mask to represent bus DMA constraints has a set of limitations.
The biggest one being it can only hold a power of two (minus one). The
DMA mapping code is already aware of this and treats dev->bus_dma_mask
as a limit. This quirk is already used by some architectures although
still rare.

With the introduction of the Raspberry Pi 4 we've found a new contender
for the use of bus DMA limits, as its PCIe bus can only address the
lower 3GB of memory (of a total of 4GB). This is impossible to represent
with a mask. To make things worse the device-tree code rounds non power
of two bus DMA limits to the next power of two, which is unacceptable in
this case.

In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all
over the tree and treat it as such. Note that dev->bus_dma_limit is
meant to contain the higher accesible DMA address.


Neat, you win a "why didn't I do it that way in the first place?" :)

Looking at it without all the history of previous attempts, this looks 
entirely reasonable, and definitely a step in the right direction.


[...]

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 5a7551d060f2..f18827cf96df 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -1097,7 +1097,7 @@ void iort_dma_setup(struct device *dev, u64 *dma_addr, 
u64 *dma_size)
 * Limit coherent and dma mask based on size
 * retrieved from firmware.
 */
-   dev->bus_dma_mask = mask;
+   dev->bus_dma_limit = mask;


Although this preserves the existing behaviour, as in of_dma_configure() 
we can do better here since we have the original address range to hand. 
I think it's worth keeping the ACPI and OF paths in sync for minor 
tweaks like this, rather than letting them diverge unnecessarily.


Otherwise, the rest looks OK to me - in principle we could store it as 
an exclusive limit such that we could then streamline the min_not_zero() 
tests to just min(mask, limit - 1), but that's probably too clever for 
its own good.


Robin.


dev->coherent_dma_mask = mask;
*dev->dma_mask = mask;
}


Re: generic DMA bypass flag

2019-11-13 Thread Robin Murphy

On 13/11/2019 1:37 pm, Christoph Hellwig wrote:

Hi all,

I've recently beeing chatting with Lu about using dma-iommu and
per-device DMA ops in the intel IOMMU driver, and one missing feature
in dma-iommu is a bypass mode where the direct mapping is used even
when an iommu is attached to improve performance.  The powerpc
code already has a similar mode, so I'd like to move it to the core
DMA mapping code.  As part of that I noticed that the current
powerpc code has a little bug in that it used the wrong check in the
dma_sync_* routines to see if the direct mapping code is used.


In all honesty, this seems silly. If we can set a per-device flag to say 
"oh, bypass these ops and use some other ops instead", then we can just 
as easily simply give the device the appropriate ops in the first place. 
Because, y'know, the name of the game is "per-device ops".



These two patches just add the generic code and move powerpc over,
the intel IOMMU bits will require a separate discussion.


The ops need to match the default domain type, so the hard part of the 
problem is choosing when and if to switch that (particularly fun if 
there are multiple devices in the IOMMU group). Flipping between 
iommu-dma and dma-direct at that point will be trivial.


I don't see a great benefit to pulling legacy cruft out into common code 
instead of just working to get rid of it in-place, when said cruft pulls 
in the opposite direction to where we're taking the common code (i.e. 
it's inherently based on the premise of global ops).


Robin.


The x86 AMD Gart code also has a bypass mode, but it is a lot
strange, so I'm not going to touch it for now.
___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu



Re: [PATCH 1/2] dma-mapping: Add dma_addr_is_phys_addr()

2019-10-14 Thread Robin Murphy

On 14/10/2019 05:51, David Gibson wrote:

On Fri, Oct 11, 2019 at 06:25:18PM -0700, Ram Pai wrote:

From: Thiago Jung Bauermann 

In order to safely use the DMA API, virtio needs to know whether DMA
addresses are in fact physical addresses and for that purpose,
dma_addr_is_phys_addr() is introduced.

cc: Benjamin Herrenschmidt 
cc: David Gibson 
cc: Michael Ellerman 
cc: Paul Mackerras 
cc: Michael Roth 
cc: Alexey Kardashevskiy 
cc: Paul Burton 
cc: Robin Murphy 
cc: Bartlomiej Zolnierkiewicz 
cc: Marek Szyprowski 
cc: Christoph Hellwig 
Suggested-by: Michael S. Tsirkin 
Signed-off-by: Ram Pai 
Signed-off-by: Thiago Jung Bauermann 


The change itself looks ok, so

Reviewed-by: David Gibson 

However, I would like to see the commit message (and maybe the inline
comments) expanded a bit on what the distinction here is about.  Some
of the text from the next patch would be suitable, about DMA addresses
usually being in a different address space but not in the case of
bounce buffering.


Right, this needs a much tighter definition. "DMA address happens to be 
a valid physical address" is true of various IOMMU setups too, but I 
can't believe it's meaningful in such cases.


If what you actually want is "DMA is direct or SWIOTLB" - i.e. "DMA 
address is physical address of DMA data (not necessarily the original 
buffer)" - wouldn't dma_is_direct() suffice?


Robin.


---
  arch/powerpc/include/asm/dma-mapping.h | 21 +
  arch/powerpc/platforms/pseries/Kconfig |  1 +
  include/linux/dma-mapping.h| 20 
  kernel/dma/Kconfig |  3 +++
  4 files changed, 45 insertions(+)

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index 565d6f7..f92c0a4b 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -5,6 +5,8 @@
  #ifndef _ASM_DMA_MAPPING_H
  #define _ASM_DMA_MAPPING_H
  
+#include 

+
  static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus)
  {
/* We don't handle the NULL dev case for ISA for now. We could
@@ -15,4 +17,23 @@ static inline const struct dma_map_ops 
*get_arch_dma_ops(struct bus_type *bus)
return NULL;
  }
  
+#ifdef CONFIG_ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR

+/**
+ * dma_addr_is_phys_addr - check whether a device DMA address is a physical
+ * address
+ * @dev:   device to check
+ *
+ * Returns %true if any DMA address for this device happens to also be a valid
+ * physical address (not necessarily of the same page).
+ */
+static inline bool dma_addr_is_phys_addr(struct device *dev)
+{
+   /*
+* Secure guests always use the SWIOTLB, therefore DMA addresses are
+* actually the physical address of the bounce buffer.
+*/
+   return is_secure_guest();
+}
+#endif
+
  #endif/* _ASM_DMA_MAPPING_H */
diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index 9e35cdd..0108150 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -152,6 +152,7 @@ config PPC_SVM
select SWIOTLB
select ARCH_HAS_MEM_ENCRYPT
select ARCH_HAS_FORCE_DMA_UNENCRYPTED
+   select ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR
help
 There are certain POWER platforms which support secure guests using
 the Protected Execution Facility, with the help of an Ultravisor
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index f7d1eea..6df5664 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -693,6 +693,26 @@ static inline bool dma_addressing_limited(struct device 
*dev)
dma_get_required_mask(dev);
  }
  
+#ifndef CONFIG_ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR

+/**
+ * dma_addr_is_phys_addr - check whether a device DMA address is a physical
+ * address
+ * @dev:   device to check
+ *
+ * Returns %true if any DMA address for this device happens to also be a valid
+ * physical address (not necessarily of the same page).
+ */
+static inline bool dma_addr_is_phys_addr(struct device *dev)
+{
+   /*
+* Except in very specific setups, DMA addresses exist in a different
+* address space from CPU physical addresses and cannot be directly used
+* to reference system memory.
+*/
+   return false;
+}
+#endif
+
  #ifdef CONFIG_ARCH_HAS_SETUP_DMA_OPS
  void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
const struct iommu_ops *iommu, bool coherent);
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 9decbba..6209b46 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -51,6 +51,9 @@ config ARCH_HAS_DMA_MMAP_PGPROT
  config ARCH_HAS_FORCE_DMA_UNENCRYPTED
bool
  
+config ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR

+   bool
+
  config DMA_NONCOHERENT_CACHE_SYNC
bool
  




Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-09 Thread Robin Murphy

On 2019-10-08 9:38 am, Yunsheng Lin wrote:

On 2019/9/25 18:41, Peter Zijlstra wrote:

On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:

 From the discussion above, It seems making the node_to_cpumask_map()
NUMA_NO_NODE aware is the most feasible way to move forwad.


That's still wrong.


Hi, Peter

It seems this has trapped in the dead circle.

 From my understanding, NUMA_NO_NODE which means not node numa preference
is the state to describe the node of virtual device or the physical device
that has equal distance to all cpu.

We can be stricter if the device does have a nearer node, but we can not
deny that a device does not have a node numa preference or node affinity,
which also means the control or data buffer can be allocated at the node where
the process is running.

As you has proposed, making it -2 and have dev_to_node() warn if the device does
have a nearer node and not set by the fw is a way to be stricter.

But I think maybe being stricter is not really relevant to NUMA_NO_NODE, because
we does need a state to describe the device that have equal distance to all 
node,
even if it is not physically scalable.

Any better suggestion to move this forward?


FWIW (since this is in my inbox), it sounds like the fundamental issue 
is that NUMA_NO_NODE is conflated for at least two different purposes, 
so trying to sort that out would be a good first step. AFAICS we have 
genuine "don't care" cases like alloc_pages_node(), where if the 
producer says it doesn't matter then the consumer is free to make its 
own judgement on what to do, and fundamentally different "we expect this 
thing to have an affinity but it doesn't, so we can't say what's 
appropriate" cases which could really do with some separate indicator 
like "NUMA_INVALID_NODE".


The tricky part is then bestowed on the producers to decide whether they 
can downgrade "invalid" to "don't care". You can technically build 'a 
device' whose internal logic is distributed between nodes and thus 
appears to have equal affinity - interrupt controllers, for example, may 
have per-CPU or per-node interfaces that end up looking like that - so 
although it's unlikely it's not outright nonsensical. Similarly a 
'device' that's actually emulated behind a firmware call interface may 
well effectively have no real affinity.


Robin.


Re: [PATCH v2 6/6] mm/memory_hotplug: Pass nid instead of zone to __remove_pages()

2019-08-27 Thread Robin Murphy

On 26/08/2019 11:10, David Hildenbrand wrote:

The zone parameter is no longer in use. Replace it with the nid, which
we can now use the nid to limit the number of zones we have to process
(vie for_each_zone_nid()). The function signature of __remove_pages() now
looks much more similar to the one of __add_pages().


FWIW I recall this being trivially easy to hit when first playing with 
hotremove development for arm64 - since we only have 3 zones, the page 
flags poison would cause page_zone() to dereference past the end of 
node_zones[] and go all kinds of wrong. This looks like a definite 
improvement in API terms.


For arm64,

Acked-by: Robin Murphy 

Cheers,
Robin.


Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: Andrew Morton 
Cc: Mark Rutland 
Cc: Steve Capper 
Cc: Mike Rapoport 
Cc: Anshuman Khandual 
Cc: Yu Zhao 
Cc: Jun Yao 
Cc: Robin Murphy 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Christophe Leroy 
Cc: "Aneesh Kumar K.V" 
Cc: Pavel Tatashin 
Cc: Gerald Schaefer 
Cc: Halil Pasic 
Cc: Tom Lendacky 
Cc: Greg Kroah-Hartman 
Cc: Masahiro Yamada 
Cc: Dan Williams 
Cc: Wei Yang 
Cc: Qian Cai 
Cc: Jason Gunthorpe 
Cc: Logan Gunthorpe 
Cc: Ira Weiny 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
  arch/arm64/mm/mmu.c|  4 +---
  arch/ia64/mm/init.c|  4 +---
  arch/powerpc/mm/mem.c  |  3 +--
  arch/s390/mm/init.c|  4 +---
  arch/sh/mm/init.c  |  4 +---
  arch/x86/mm/init_32.c  |  4 +---
  arch/x86/mm/init_64.c  |  4 +---
  include/linux/memory_hotplug.h |  2 +-
  mm/memory_hotplug.c| 17 +
  mm/memremap.c  |  3 +--
  10 files changed, 18 insertions(+), 31 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e67bab4d613e..9a2d388314f3 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1080,7 +1080,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
  {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
-   struct zone *zone;
  
  	/*

 * FIXME: Cleanup page tables (also in arch_add_memory() in case
@@ -1089,7 +1088,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
 * unlocked yet.
 */
-   zone = page_zone(pfn_to_page(start_pfn));
-   __remove_pages(zone, start_pfn, nr_pages, altmap);
+   __remove_pages(nid, start_pfn, nr_pages, altmap);
  }
  #endif
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index bf9df2625bc8..ae6a3e718aa0 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -689,9 +689,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
  {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
-   struct zone *zone;
  
-	zone = page_zone(pfn_to_page(start_pfn));

-   __remove_pages(zone, start_pfn, nr_pages, altmap);
+   __remove_pages(nid, start_pfn, nr_pages, altmap);
  }
  #endif
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 9191a66b3bc5..af21e13529ce 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -130,10 +130,9 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
  {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
-   struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
int ret;
  
-	__remove_pages(page_zone(page), start_pfn, nr_pages, altmap);

+   __remove_pages(nid, start_pfn, nr_pages, altmap);
  
  	/* Remove htab bolted mappings for this section of memory */

start = (unsigned long)__va(start);
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 20340a03ad90..2a7373ed6ded 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -296,10 +296,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
  {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
-   struct zone *zone;
  
-	zone = page_zone(pfn_to_page(start_pfn));

-   __remove_pages(zone, start_pfn, nr_pages, altmap);
+   __remove_pages(nid, start_pfn, nr_pages, altmap);
vmem_remove_mapping(start, size);
  }
  #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/sh/mm/init.

Re: [PATCH 6/6] driver core: initialize a default DMA mask for platform device

2019-08-14 Thread Robin Murphy

On 11/08/2019 09:05, Christoph Hellwig wrote:

We still treat devices without a DMA mask as defaulting to 32-bits for
both mask, but a few releases ago we've started warning about such
cases, as they require special cases to work around this sloppyness.
Add a dma_mask field to struct platform_object so that we can initialize


s/object/device/


the dma_mask pointer in struct device and initialize both masks to
32-bits by default.  Architectures can still override this in
arch_setup_pdev_archdata if needed.

Note that the code looks a little odd with the various conditionals
because we have to support platform_device structures that are
statically allocated.


This would be a good point to also get rid of the long-standing bodge in 
platform_device_register_full().



Signed-off-by: Christoph Hellwig 
---
  drivers/base/platform.c | 15 +--
  include/linux/platform_device.h |  1 +
  2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/base/platform.c b/drivers/base/platform.c
index ec974ba9c0c4..b216fcb0a8af 100644
--- a/drivers/base/platform.c
+++ b/drivers/base/platform.c
@@ -264,6 +264,17 @@ struct platform_object {
char name[];
  };
  
+static void setup_pdev_archdata(struct platform_device *pdev)


Bikeshed: painting the generic DMA API properties as "archdata" feels a 
bit off-target :/



+{
+   if (!pdev->dev.coherent_dma_mask)
+   pdev->dev.coherent_dma_mask = DMA_BIT_MASK(32);
+   if (!pdev->dma_mask)
+   pdev->dma_mask = DMA_BIT_MASK(32);
+   if (!pdev->dev.dma_mask)
+   pdev->dev.dma_mask = >dma_mask;
+   arch_setup_pdev_archdata(pdev);


AFAICS m68k's implementation of that arch hook becomes entirely 
redundant after this change, so may as well go. That would just leave 
powerpc's actual archdata, which at a glance looks like it could 
probably be cleaned up with not *too* much trouble.


Robin.


+};
+
  /**
   * platform_device_put - destroy a platform device
   * @pdev: platform device to free
@@ -310,7 +321,7 @@ struct platform_device *platform_device_alloc(const char 
*name, int id)
pa->pdev.id = id;
device_initialize(>pdev.dev);
pa->pdev.dev.release = platform_device_release;
-   arch_setup_pdev_archdata(>pdev);
+   setup_pdev_archdata(>pdev);
}
  
  	return pa ? >pdev : NULL;

@@ -512,7 +523,7 @@ EXPORT_SYMBOL_GPL(platform_device_del);
  int platform_device_register(struct platform_device *pdev)
  {
device_initialize(>dev);
-   arch_setup_pdev_archdata(pdev);
+   setup_pdev_archdata(pdev);
return platform_device_add(pdev);
  }
  EXPORT_SYMBOL_GPL(platform_device_register);
diff --git a/include/linux/platform_device.h b/include/linux/platform_device.h
index 9bc36b589827..a2abde2aef25 100644
--- a/include/linux/platform_device.h
+++ b/include/linux/platform_device.h
@@ -24,6 +24,7 @@ struct platform_device {
int id;
boolid_auto;
struct device   dev;
+   u64 dma_mask;
u32 num_resources;
struct resource *resource;
  



Re: [PATCH v3 04/11] arm64/mm: Add temporary arch_remove_memory() implementation

2019-06-04 Thread Robin Murphy

On 04/06/2019 07:56, David Hildenbrand wrote:

On 03.06.19 23:41, Wei Yang wrote:

On Mon, May 27, 2019 at 01:11:45PM +0200, David Hildenbrand wrote:

A proper arch_remove_memory() implementation is on its way, which also
cleanly removes page tables in arch_add_memory() in case something goes
wrong.


Would this be better to understand?

 removes page tables created in arch_add_memory


That's not what this sentence expresses. Have a look at
arch_add_memory(), in case  __add_pages() fails, the page tables are not
removed. This will also be fixed by Anshuman in the same shot.





As we want to use arch_remove_memory() in case something goes wrong
during memory hotplug after arch_add_memory() finished, let's add
a temporary hack that is sufficient enough until we get a proper
implementation that cleans up page table entries.

We will remove CONFIG_MEMORY_HOTREMOVE around this code in follow up
patches.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Rutland 
Cc: Andrew Morton 
Cc: Ard Biesheuvel 
Cc: Chintan Pandya 
Cc: Mike Rapoport 
Cc: Jun Yao 
Cc: Yu Zhao 
Cc: Robin Murphy 
Cc: Anshuman Khandual 
Signed-off-by: David Hildenbrand 
---
arch/arm64/mm/mmu.c | 19 +++
1 file changed, 19 insertions(+)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index a1bfc4413982..e569a543c384 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1084,4 +1084,23 @@ int arch_add_memory(int nid, u64 start, u64 size,
return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
   restrictions);
}
+#ifdef CONFIG_MEMORY_HOTREMOVE
+void arch_remove_memory(int nid, u64 start, u64 size,
+   struct vmem_altmap *altmap)
+{
+   unsigned long start_pfn = start >> PAGE_SHIFT;
+   unsigned long nr_pages = size >> PAGE_SHIFT;
+   struct zone *zone;
+
+   /*
+* FIXME: Cleanup page tables (also in arch_add_memory() in case
+* adding fails). Until then, this function should only be used
+* during memory hotplug (adding memory), not for memory
+* unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
+* unlocked yet.
+*/
+   zone = page_zone(pfn_to_page(start_pfn));


Compared with arch_remove_memory in x86. If altmap is not NULL, zone will be
retrieved from page related to altmap. Not sure why this is not the same?


This is a minimal implementation, sufficient for this use case here. A
full implementation is in the works. For now, this function will not be
used with an altmap (ZONE_DEVICE is not esupported for arm64 yet).


FWIW the other pieces of ZONE_DEVICE are now due to land in parallel, 
but as long as we don't throw the ARCH_ENABLE_MEMORY_HOTREMOVE switch 
then there should still be no issue. Besides, given that we should 
consistently ignore the altmap everywhere at the moment, it may even 
work out regardless.


One thing stands out about the failure path thing, though - if 
__add_pages() did fail, can it still be guaranteed to have initialised 
the memmap such that page_zone() won't return nonsense? Last time I 
looked that was still a problem when removing memory which had been 
successfully added, but never onlined (although I do know that 
particular case was already being discussed at the time, and I've not 
been paying the greatest attention since).


Robin.


Re: [PATCH v3 0/6] Prerequisites for NXP LS104xA SMMU enablement

2019-05-31 Thread Robin Murphy

On 31/05/2019 18:08, Christoph Hellwig wrote:

On Fri, May 31, 2019 at 06:03:30PM +0100, Robin Murphy wrote:

The thing needs to be completely redone as it abuses parts of the
iommu API in a completely unacceptable way.


`git grep iommu_iova_to_phys drivers/{crypto,gpu,net}`

:(

I guess one alternative is for the offending drivers to maintain their own
lookup tables of mapped DMA addresses - I think at least some of these
things allow storing some kind of token in a descriptor, which even if it's
not big enough for a virtual address might be sufficient for an index.


Well, we'll at least need DMA API wrappers that work on the dma addr
only and hide this madness underneath.  And then tell if an given device
supports this and fail the probe otherwise.


Bleh, I'm certainly not keen on formalising any kind of 
dma_to_phys()/dma_to_virt() interface for this. Or are you just 
proposing something like dma_unmap_sorry_sir_the_dog_ate_my_homework() 
for drivers which have 'lost' the original VA they mapped?


Robin.


Re: [PATCH v3 0/6] Prerequisites for NXP LS104xA SMMU enablement

2019-05-31 Thread Robin Murphy

On 31/05/2019 17:33, Christoph Hellwig wrote:

On Thu, May 30, 2019 at 03:08:44PM -0700, David Miller wrote:

From: laurentiu.tu...@nxp.com
Date: Thu, 30 May 2019 17:19:45 +0300


Depends on this pull request:

  http://lists.infradead.org/pipermail/linux-arm-kernel/2019-May/653554.html


I'm not sure how you want me to handle this.


The thing needs to be completely redone as it abuses parts of the
iommu API in a completely unacceptable way.


`git grep iommu_iova_to_phys drivers/{crypto,gpu,net}`

:(

I guess one alternative is for the offending drivers to maintain their 
own lookup tables of mapped DMA addresses - I think at least some of 
these things allow storing some kind of token in a descriptor, which 
even if it's not big enough for a virtual address might be sufficient 
for an index.


Robin.


[PATCH v3 3/4] mm: introduce ARCH_HAS_PTE_DEVMAP

2019-05-23 Thread Robin Murphy
ARCH_HAS_ZONE_DEVICE is somewhat meaningless in itself, and combined
with the long-out-of-date comment can lead to the impression than an
architecture may just enable it (since __add_pages() now "comprehends
device memory" for itself) and expect things to work.

In practice, however, ZONE_DEVICE users have little chance of
functioning correctly without __HAVE_ARCH_PTE_DEVMAP, so let's clean
that up the same way as ARCH_HAS_PTE_SPECIAL and make it the proper
dependency so the real situation is clearer.

Cc: x...@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Michael Ellerman 
Acked-by: Dan Williams 
Reviewed-by: Ira Weiny 
Acked-by: Oliver O'Halloran 
Reviewed-by: Anshuman Khandual 
Signed-off-by: Robin Murphy 
---
 arch/powerpc/Kconfig | 2 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h | 1 -
 arch/x86/Kconfig | 2 +-
 arch/x86/include/asm/pgtable.h   | 4 ++--
 arch/x86/include/asm/pgtable_types.h | 1 -
 include/linux/mm.h   | 4 ++--
 include/linux/pfn_t.h| 4 ++--
 mm/Kconfig   | 5 ++---
 mm/gup.c | 2 +-
 9 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8c1c636308c8..1120ff8ac715 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -128,6 +128,7 @@ config PPC
select ARCH_HAS_MMIOWB  if PPC64
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_APIif PPC64
+   select ARCH_HAS_PTE_DEVMAP  if PPC_BOOK3S_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC64
@@ -135,7 +136,6 @@ config PPC
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE  if PPC64
select ARCH_HAS_UBSAN_SANITIZE_ALL
-   select ARCH_HAS_ZONE_DEVICE if PPC_BOOK3S_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_KEEP_MEMBLOCK
select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 7dede2e34b70..c6c2bdfb369b 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -90,7 +90,6 @@
 #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty tracking 
*/
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
 #define _PAGE_DEVMAP   _RPAGE_SW1 /* software: ZONE_DEVICE page */
-#define __HAVE_ARCH_PTE_DEVMAP
 
 /*
  * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2bbbd4d1ba31..57c4e80bd368 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -69,6 +69,7 @@ config X86
select ARCH_HAS_KCOVif X86_64
select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_PMEM_APIif X86_64
+   select ARCH_HAS_PTE_DEVMAP  if X86_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_REFCOUNT
select ARCH_HAS_UACCESS_FLUSHCACHE  if X86_64
@@ -79,7 +80,6 @@ config X86
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_UBSAN_SANITIZE_ALL
-   select ARCH_HAS_ZONE_DEVICE if X86_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5e0509b41986..0bc530c4eb13 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -271,7 +271,7 @@ static inline int has_transparent_hugepage(void)
return boot_cpu_has(X86_FEATURE_PSE);
 }
 
-#ifdef __HAVE_ARCH_PTE_DEVMAP
+#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
 static inline int pmd_devmap(pmd_t pmd)
 {
return !!(pmd_val(pmd) & _PAGE_DEVMAP);
@@ -732,7 +732,7 @@ static inline int pte_present(pte_t a)
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
-#ifdef __HAVE_ARCH_PTE_DEVMAP
+#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
 static inline int pte_devmap(pte_t a)
 {
return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP;
diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index d6ff0bbdb394..b5e49e6bac63 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -103,7 +103,6 @@
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX   (_AT(pteval_t, 1) << _PAGE_BIT_NX)
 #define _PAGE_DEVMAP   (_AT(u64, 1) << _PAGE_BIT_DEVMAP)
-#define __HAVE_ARCH_PTE_DEVMAP
 #else
 #define _PAGE_NX   (_AT(

[PATCH v2 3/3] mm: introduce ARCH_HAS_PTE_DEVMAP

2019-04-29 Thread Robin Murphy
ARCH_HAS_ZONE_DEVICE is somewhat meaningless in itself, and combined
with the long-out-of-date comment can lead to the impression than an
architecture may just enable it (since __add_pages() now "comprehends
device memory" for itself) and expect things to work.

In practice, however, ZONE_DEVICE users have little chance of
functioning correctly without __HAVE_ARCH_PTE_DEVMAP, so let's clean
that up the same way as ARCH_HAS_PTE_SPECIAL and make it the proper
dependency so the real situation is clearer.

Cc: x...@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Michael Ellerman 
Acked-by: Dan Williams 
Reviewed-by: Ira Weiny 
Acked-by: Oliver O'Halloran 
Reviewed-by: Anshuman Khandual 
Signed-off-by: Robin Murphy 
---

v2: Add review tags.

 arch/powerpc/Kconfig | 2 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h | 1 -
 arch/x86/Kconfig | 2 +-
 arch/x86/include/asm/pgtable.h   | 4 ++--
 arch/x86/include/asm/pgtable_types.h | 1 -
 include/linux/mm.h   | 4 ++--
 include/linux/pfn_t.h| 4 ++--
 mm/Kconfig   | 5 ++---
 mm/gup.c | 2 +-
 9 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 5e3d0853c31d..77e1993bba80 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -135,6 +135,7 @@ config PPC
select ARCH_HAS_MMIOWB  if PPC64
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_APIif PPC64
+   select ARCH_HAS_PTE_DEVMAP  if PPC_BOOK3S_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC64
@@ -142,7 +143,6 @@ config PPC
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE  if PPC64
select ARCH_HAS_UBSAN_SANITIZE_ALL
-   select ARCH_HAS_ZONE_DEVICE if PPC_BOOK3S_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 581f91be9dd4..02c22ac8f387 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -90,7 +90,6 @@
 #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty tracking 
*/
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
 #define _PAGE_DEVMAP   _RPAGE_SW1 /* software: ZONE_DEVICE page */
-#define __HAVE_ARCH_PTE_DEVMAP
 
 /*
  * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5ad92419be19..ffd50f27f395 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -60,6 +60,7 @@ config X86
select ARCH_HAS_KCOVif X86_64
select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_PMEM_APIif X86_64
+   select ARCH_HAS_PTE_DEVMAP  if X86_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_REFCOUNT
select ARCH_HAS_UACCESS_FLUSHCACHE  if X86_64
@@ -69,7 +70,6 @@ config X86
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_UBSAN_SANITIZE_ALL
-   select ARCH_HAS_ZONE_DEVICE if X86_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2779ace16d23..89a1f6fd48bf 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -254,7 +254,7 @@ static inline int has_transparent_hugepage(void)
return boot_cpu_has(X86_FEATURE_PSE);
 }
 
-#ifdef __HAVE_ARCH_PTE_DEVMAP
+#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
 static inline int pmd_devmap(pmd_t pmd)
 {
return !!(pmd_val(pmd) & _PAGE_DEVMAP);
@@ -715,7 +715,7 @@ static inline int pte_present(pte_t a)
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
-#ifdef __HAVE_ARCH_PTE_DEVMAP
+#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
 static inline int pte_devmap(pte_t a)
 {
return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP;
diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index d6ff0bbdb394..b5e49e6bac63 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -103,7 +103,6 @@
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX   (_AT(pteval_t, 1) << _PAGE_BIT_NX)
 #define _PAGE_DEVMAP   (_AT(u64, 1) << _PAGE_BIT_DEVMAP)
-#define __HAVE_ARCH_PTE_DEVMAP
 #els

[PATCH 3/3] mm: introduce ARCH_HAS_PTE_DEVMAP

2019-04-12 Thread Robin Murphy
ARCH_HAS_ZONE_DEVICE is somewhat meaningless in itself, and combined
with the long-out-of-date comment can lead to the impression than an
architecture may just enable it (since __add_pages() now "comprehends
device memory" for itself) and expect things to work.

In practice, however, ZONE_DEVICE users have little chance of
functioning correctly without __HAVE_ARCH_PTE_DEVMAP, so let's clean
that up the same way as ARCH_HAS_PTE_SPECIAL and make it the proper
dependency so the real situation is clearer.

Signed-off-by: Robin Murphy 
---
 arch/powerpc/Kconfig | 2 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h | 1 -
 arch/x86/Kconfig | 2 +-
 arch/x86/include/asm/pgtable.h   | 4 ++--
 arch/x86/include/asm/pgtable_types.h | 1 -
 include/linux/mm.h   | 4 ++--
 include/linux/pfn_t.h| 4 ++--
 mm/Kconfig   | 5 ++---
 mm/gup.c | 2 +-
 9 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 5e3d0853c31d..77e1993bba80 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -135,6 +135,7 @@ config PPC
select ARCH_HAS_MMIOWB  if PPC64
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_APIif PPC64
+   select ARCH_HAS_PTE_DEVMAP  if PPC_BOOK3S_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC64
@@ -142,7 +143,6 @@ config PPC
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE  if PPC64
select ARCH_HAS_UBSAN_SANITIZE_ALL
-   select ARCH_HAS_ZONE_DEVICE if PPC_BOOK3S_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 581f91be9dd4..02c22ac8f387 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -90,7 +90,6 @@
 #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty tracking 
*/
 #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
 #define _PAGE_DEVMAP   _RPAGE_SW1 /* software: ZONE_DEVICE page */
-#define __HAVE_ARCH_PTE_DEVMAP
 
 /*
  * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5ad92419be19..ffd50f27f395 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -60,6 +60,7 @@ config X86
select ARCH_HAS_KCOVif X86_64
select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_PMEM_APIif X86_64
+   select ARCH_HAS_PTE_DEVMAP  if X86_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_REFCOUNT
select ARCH_HAS_UACCESS_FLUSHCACHE  if X86_64
@@ -69,7 +70,6 @@ config X86
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_UBSAN_SANITIZE_ALL
-   select ARCH_HAS_ZONE_DEVICE if X86_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2779ace16d23..89a1f6fd48bf 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -254,7 +254,7 @@ static inline int has_transparent_hugepage(void)
return boot_cpu_has(X86_FEATURE_PSE);
 }
 
-#ifdef __HAVE_ARCH_PTE_DEVMAP
+#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
 static inline int pmd_devmap(pmd_t pmd)
 {
return !!(pmd_val(pmd) & _PAGE_DEVMAP);
@@ -715,7 +715,7 @@ static inline int pte_present(pte_t a)
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
 }
 
-#ifdef __HAVE_ARCH_PTE_DEVMAP
+#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
 static inline int pte_devmap(pte_t a)
 {
return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP;
diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index d6ff0bbdb394..b5e49e6bac63 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -103,7 +103,6 @@
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX   (_AT(pteval_t, 1) << _PAGE_BIT_NX)
 #define _PAGE_DEVMAP   (_AT(u64, 1) << _PAGE_BIT_DEVMAP)
-#define __HAVE_ARCH_PTE_DEVMAP
 #else
 #define _PAGE_NX   (_AT(pteval_t, 0))
 #define _PAGE_DEVMAP   (_AT(pteval_t, 0))
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d76dfb7ac617..fe05c94f23e9 100644
--- a/include/linux/mm.h
+++ b/inc

[PATCH 2/3] mm: clean up is_device_*_page() definitions

2019-04-12 Thread Robin Murphy
Refactor is_device_{public,private}_page() with is_pci_p2pdma_page()
to make them all consistent in depending on their respective config
options even when CONFIG_DEV_PAGEMAP_OPS is enabled for other reasons.
This allows a little more compile-time optimisation as well as the
conceptual and cosmetic cleanup.

Suggested-by: Jerome Glisse 
Signed-off-by: Robin Murphy 
---
 include/linux/mm.h | 43 +--
 1 file changed, 13 insertions(+), 30 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 76769749b5a5..d76dfb7ac617 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -910,32 +910,6 @@ static inline bool put_devmap_managed_page(struct page 
*page)
}
return false;
 }
-
-static inline bool is_device_private_page(const struct page *page)
-{
-   return is_zone_device_page(page) &&
-   page->pgmap->type == MEMORY_DEVICE_PRIVATE;
-}
-
-static inline bool is_device_public_page(const struct page *page)
-{
-   return is_zone_device_page(page) &&
-   page->pgmap->type == MEMORY_DEVICE_PUBLIC;
-}
-
-#ifdef CONFIG_PCI_P2PDMA
-static inline bool is_pci_p2pdma_page(const struct page *page)
-{
-   return is_zone_device_page(page) &&
-   page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
-}
-#else /* CONFIG_PCI_P2PDMA */
-static inline bool is_pci_p2pdma_page(const struct page *page)
-{
-   return false;
-}
-#endif /* CONFIG_PCI_P2PDMA */
-
 #else /* CONFIG_DEV_PAGEMAP_OPS */
 static inline void dev_pagemap_get_ops(void)
 {
@@ -949,22 +923,31 @@ static inline bool put_devmap_managed_page(struct page 
*page)
 {
return false;
 }
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline bool is_device_private_page(const struct page *page)
 {
-   return false;
+   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+   IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
+   is_zone_device_page(page) &&
+   page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
 static inline bool is_device_public_page(const struct page *page)
 {
-   return false;
+   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+   IS_ENABLED(CONFIG_DEVICE_PUBLIC) &&
+   is_zone_device_page(page) &&
+   page->pgmap->type == MEMORY_DEVICE_PUBLIC;
 }
 
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
-   return false;
+   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+   IS_ENABLED(CONFIG_PCI_P2PDMA) &&
+   is_zone_device_page(page) &&
+   page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
 }
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline void get_page(struct page *page)
 {
-- 
2.21.0.dirty



[PATCH 1/3] mm/memremap: Rename and consolidate SECTION_SIZE

2019-04-12 Thread Robin Murphy
Trying to activatee ZONE_DEVICE for arm64 reveals that memremap's
internal helpers for sparsemem sections conflict with and arm64's
definitions for hugepages, which inherit the name of "sections" from
earlier versions of the ARM architecture.

Disambiguate memremap (and now HMM too) by propagating sparsemem's PA_
prefix, to clarify that these values are in terms of addresses rather
than PFNs (and because it's a heck of a lot easier than changing all the
arch code). SECTION_MASK is unused, so it can just go.

[anshuman: Consolidated mm/hmm.c instance and updated the commit message]

Acked-by: Michal Hocko 
Reviewed-by: David Hildenbrand 
Signed-off-by: Robin Murphy 
Signed-off-by: Anshuman Khandual 
---
 include/linux/mmzone.h |  1 +
 kernel/memremap.c  | 10 --
 mm/hmm.c   |  2 --
 3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fba7741533be..ed7dd27ee94a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1081,6 +1081,7 @@ static inline unsigned long early_pfn_to_nid(unsigned 
long pfn)
  * PFN_SECTION_SHIFT   pfn to/from section number
  */
 #define PA_SECTION_SHIFT   (SECTION_SIZE_BITS)
+#define PA_SECTION_SIZE(1UL << PA_SECTION_SHIFT)
 #define PFN_SECTION_SHIFT  (SECTION_SIZE_BITS - PAGE_SHIFT)
 
 #define NR_MEM_SECTIONS(1UL << SECTIONS_SHIFT)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index a856cb5ff192..dda1367b385d 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -14,8 +14,6 @@
 #include 
 
 static DEFINE_XARRAY(pgmap_array);
-#define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1)
-#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
 
 #if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
 vm_fault_t device_private_entry_fault(struct vm_area_struct *vma,
@@ -98,8 +96,8 @@ static void devm_memremap_pages_release(void *data)
put_page(pfn_to_page(pfn));
 
/* pages are dead and unused, undo the arch mapping */
-   align_start = res->start & ~(SECTION_SIZE - 1);
-   align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
+   align_start = res->start & ~(PA_SECTION_SIZE - 1);
+   align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
- align_start;
 
nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT));
@@ -154,8 +152,8 @@ void *devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
if (!pgmap->ref || !pgmap->kill)
return ERR_PTR(-EINVAL);
 
-   align_start = res->start & ~(SECTION_SIZE - 1);
-   align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
+   align_start = res->start & ~(PA_SECTION_SIZE - 1);
+   align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE)
- align_start;
align_end = align_start + align_size - 1;
 
diff --git a/mm/hmm.c b/mm/hmm.c
index fe1cd87e49ac..ef9e4e6c9f92 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -33,8 +33,6 @@
 #include 
 #include 
 
-#define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
-
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
-- 
2.21.0.dirty



[PATCH 0/3] Device-memory-related cleanups

2019-04-12 Thread Robin Murphy
Hi,

As promised, these are my preparatory cleanup patches that have so far
fallen out of pmem DAX work for arm64. Patch #1 has already been out for
a ride in Anshuman's hot-remove series, so I've collected the acks
already given.

Since we have various things in flight at the moment touching arm64
pagetable code, I'm wary of conflicts and cross-tree dependencies for
our actual ARCH_HAS_PTE_DEVMAP implementation. Thus it would be nice if
these could be picked up for 5.2 via mm or nvdimm as appropriate, such
that we can then handle the devmap patch itself via arm64 next cycle.

Robin.


Robin Murphy (3):
  mm/memremap: Rename and consolidate SECTION_SIZE
  mm: clean up is_device_*_page() definitions
  mm: introduce ARCH_HAS_PTE_DEVMAP

 arch/powerpc/Kconfig |  2 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h |  1 -
 arch/x86/Kconfig |  2 +-
 arch/x86/include/asm/pgtable.h   |  4 +-
 arch/x86/include/asm/pgtable_types.h |  1 -
 include/linux/mm.h   | 47 +++-
 include/linux/mmzone.h   |  1 +
 include/linux/pfn_t.h|  4 +-
 kernel/memremap.c| 10 ++---
 mm/Kconfig   |  5 +--
 mm/gup.c |  2 +-
 mm/hmm.c |  2 -
 12 files changed, 29 insertions(+), 52 deletions(-)

-- 
2.21.0.dirty



Re: [PATCH v5 1/6] iommu: add generic boot option iommu.dma_mode

2019-04-12 Thread Robin Murphy

On 12/04/2019 11:26, John Garry wrote:

On 09/04/2019 13:53, Zhen Lei wrote:

Currently the IOMMU dma contains 3 modes: passthrough, lazy, strict. The
passthrough mode bypass the IOMMU, the lazy mode defer the invalidation
of hardware TLBs, and the strict mode invalidate IOMMU hardware TLBs
synchronously. The three modes are mutually exclusive. But the current
boot options are confused, such as: iommu.passthrough and iommu.strict,
because they are no good to be coexist. So add iommu.dma_mode.

Signed-off-by: Zhen Lei 
---
 Documentation/admin-guide/kernel-parameters.txt | 19 
 drivers/iommu/iommu.c   | 59 
-

 include/linux/iommu.h   |  5 +++
 3 files changed, 71 insertions(+), 12 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt

index 2b8ee90bb64470d..f7766f8ac8b9084 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1811,6 +1811,25 @@
 1 - Bypass the IOMMU for DMA.
 unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.

+    iommu.dma_mode= Configure default dma mode. if unset, use the value
+    of CONFIG_IOMMU_DEFAULT_PASSTHROUGH to determine
+    passthrough or not.


To me, for unset it's unclear what we default to. So if unset and also 
CONFIG_IOMMU_DEFAULT_PASSTHROUGH is not set, do we get lazy or strict 
mode? (note: I'm ignoring backwards compatibility and interaction of 
iommu.strict and .passthorugh also, more below).


Could we considering introducing config DEFAULT_IOMMU_DMA_MODE, similar 
to DEFAULT_IOSCHED?


Yes, what I was suggesting was specifically refactoring the Kconfig 
options into a single choice that controls the default (i.e. no command 
line option provided) behaviour. AFAICS it should be fairly 
straightforward to maintain the existing "strict" and "passthrough" 
options (and legacy arch-specific versions thereof) to override that 
default without introducing yet another command-line option, which I 
think we should avoid if possible.

+    Note: For historical reasons, ARM64/S390/PPC/X86 have
+    their specific options. Currently, only ARM64 support
+    this boot option, and hope other ARCHs to use this as
+    generic boot option.
+    passthrough
+    Configure DMA to bypass the IOMMU by default.
+    lazy
+    Request that DMA unmap operations use deferred
+    invalidation of hardware TLBs, for increased
+    throughput at the cost of reduced device isolation.
+    Will fall back to strict mode if not supported by
+    the relevant IOMMU driver.
+    strict
+    DMA unmap operations invalidate IOMMU hardware TLBs
+    synchronously.
+
 io7=    [HW] IO7 for Marvel based alpha systems
 See comment before marvel_specify_io7 in
 arch/alpha/kernel/core_marvel.c.
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 109de67d5d727c2..df1ce8e22385b48 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -38,12 +38,13 @@

 static struct kset *iommu_group_kset;
 static DEFINE_IDA(iommu_group_ida);
+
 #ifdef CONFIG_IOMMU_DEFAULT_PASSTHROUGH
-static unsigned int iommu_def_domain_type = IOMMU_DOMAIN_IDENTITY;
+#define IOMMU_DEFAULT_DMA_MODE    IOMMU_DMA_MODE_PASSTHROUGH
 #else
-static unsigned int iommu_def_domain_type = IOMMU_DOMAIN_DMA;
+#define IOMMU_DEFAULT_DMA_MODE    IOMMU_DMA_MODE_STRICT
 #endif
-static bool iommu_dma_strict __read_mostly = true;
+static int iommu_default_dma_mode __read_mostly = 
IOMMU_DEFAULT_DMA_MODE;


 struct iommu_callback_data {
 const struct iommu_ops *ops;
@@ -147,20 +148,51 @@ static int __init iommu_set_def_domain_type(char 
*str)

 int ret;

 ret = kstrtobool(str, );
-    if (ret)
-    return ret;
+    if (!ret && pt)
+    iommu_default_dma_mode = IOMMU_DMA_MODE_PASSTHROUGH;

-    iommu_def_domain_type = pt ? IOMMU_DOMAIN_IDENTITY : 
IOMMU_DOMAIN_DMA;

-    return 0;
+    return ret;
 }
 early_param("iommu.passthrough", iommu_set_def_domain_type);

 static int __init iommu_dma_setup(char *str)
 {
-    return kstrtobool(str, _dma_strict);
+    bool strict;
+    int ret;
+
+    ret = kstrtobool(str, );
+    if (!ret)
+    iommu_default_dma_mode = strict ?
+    IOMMU_DMA_MODE_STRICT : IOMMU_DMA_MODE_LAZY;
+
+    return ret;
 }
 early_param("iommu.strict", iommu_dma_setup);

+static int __init iommu_dma_mode_setup(char *str)
+{
+    if (!str)
+    goto fail;
+
+    if (!strncmp(str, "passthrough", 11))
+    iommu_default_dma_mode = IOMMU_DMA_MODE_PASSTHROUGH;
+    else if (!strncmp(str, "lazy", 4))
+    iommu_default_dma_mode = IOMMU_DMA_MODE_LAZY;
+    else if (!strncmp(str, "strict", 6))
+    iommu_default_dma_mode = IOMMU_DMA_MODE_STRICT;
+    else
+    goto fail;

Re: [PATCH 02/13] soc/fsl/bman: map FBPR area in the iommu

2019-03-29 Thread Robin Murphy

On 29/03/2019 14:00, laurentiu.tu...@nxp.com wrote:

From: Laurentiu Tudor 

Add a one-to-one iommu mapping for bman private data memory (FBPR).
This is required for BMAN to work without faults behind an iommu.

Signed-off-by: Laurentiu Tudor 
---
  drivers/soc/fsl/qbman/bman_ccsr.c | 11 +++
  1 file changed, 11 insertions(+)

diff --git a/drivers/soc/fsl/qbman/bman_ccsr.c 
b/drivers/soc/fsl/qbman/bman_ccsr.c
index 7c3cc968053c..b209c79511bb 100644
--- a/drivers/soc/fsl/qbman/bman_ccsr.c
+++ b/drivers/soc/fsl/qbman/bman_ccsr.c
@@ -29,6 +29,7 @@
   */
  
  #include "bman_priv.h"

+#include 
  
  u16 bman_ip_rev;

  EXPORT_SYMBOL(bman_ip_rev);
@@ -178,6 +179,7 @@ static int fsl_bman_probe(struct platform_device *pdev)
int ret, err_irq;
struct device *dev = >dev;
struct device_node *node = dev->of_node;
+   struct iommu_domain *domain;
struct resource *res;
u16 id, bm_pool_cnt;
u8 major, minor;
@@ -225,6 +227,15 @@ static int fsl_bman_probe(struct platform_device *pdev)
  
  	dev_dbg(dev, "Allocated FBPR 0x%llx 0x%zx\n", fbpr_a, fbpr_sz);
  
+	/* Create an 1-to-1 iommu mapping for FBPR area */

+   domain = iommu_get_domain_for_dev(dev);


If that's expected to be the default domain that you're grabbing, then 
this is *incredibly* fragile. There's nothing to stop the IOVA that you 
forcibly map from being automatically allocated later and causing some 
other DMA mapping to fail noisily and unexpectedly. Furthermore, have 
you tried this with "iommu.passthrough=1"?


That said, I really don't understand what's going on here anyway :/

As far as I can tell from qbman_init_private_mem(), fbpr_a comes from 
dma_alloc_coherent() and thus would already be a mapped IOVA - isn't 
this the stuff that Roy converted to nicely use shared-dma-pool regions 
a while ago?


Robin.


+   if (domain) {
+   ret = iommu_map(domain, fbpr_a, fbpr_a, fbpr_sz,
+   IOMMU_READ | IOMMU_WRITE | IOMMU_CACHE);
+   if (ret)
+   dev_warn(dev, "failed to iommu_map() %d\n", ret);
+   }
+
bm_set_memory(fbpr_a, fbpr_sz);
  
  	err_irq = platform_get_irq(pdev, 0);




Re: [PATCH 2/3] dma-mapping: don't BUG when calling dma_map_resource on RAM

2019-01-14 Thread Robin Murphy

On 11/01/2019 18:17, Christoph Hellwig wrote:

Use WARN_ON_ONCE to print a stack trace and return a proper error
code instead.


I was racking my brain to remember the reasoning behind BUG_ON() being 
the only viable way to prevent errors getting through unhandled, but of 
course that was before we had a standardised DMA_MAPPING_ERROR that 
would work across all implementations.



Signed-off-by: Christoph Hellwig 
---
  include/linux/dma-mapping.h | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index d3087829a6df..91add0751aa5 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -353,7 +353,8 @@ static inline dma_addr_t dma_map_resource(struct device 
*dev,
BUG_ON(!valid_dma_direction(dir));
  
  	/* Don't allow RAM to be mapped */


Ugh, I'm pretty sure that that "pfn_valid means RAM" misunderstanding 
originally came from me - it might be nice to have a less-misleading 
comment here, but off-hand I can't think of a succinct way to say "only 
for 'DMA' access to MMIO registers/SRAMs/etc. and not for anything the 
kernel knows as actual system/device memory" to better explain the WARN...


Either way, though,

Reviewed-by: Robin Murphy 


-   BUG_ON(pfn_valid(PHYS_PFN(phys_addr)));
+   if (WARN_ON_ONCE(pfn_valid(PHYS_PFN(phys_addr
+   return DMA_MAPPING_ERROR;
  
  	if (dma_is_direct(ops))

addr = dma_direct_map_resource(dev, phys_addr, size, dir, 
attrs);



Re: [PATCH 1/3] dma-mapping: remove the default map_resource implementation

2019-01-14 Thread Robin Murphy

On 11/01/2019 18:17, Christoph Hellwig wrote:

Just returning the physical address when not map_resource method is
present is highly dangerous as it doesn't take any offset in the
direct mapping into account and does the completely wrong thing for
IOMMUs.  Instead provide a proper implementation in the direct mapping
code, and also wire it up for arm and powerpc.


Ignoring the offset was kind of intentional there, because at the time I 
was primarily thinking about it in terms of the Keystone 2 platform 
where the peripherals are all in the same place (0-2GB) in both the bus 
and CPU physical address maps, and only the view of RAM differs between 
the two (2-4GB vs. 32-34GB). However, on something like BCM283x, the 
peripherals region is also offset from its bus address in the CPU view, 
but at a *different* offset relative to that of RAM.


Fortunately, I'm not aware of any platform which has a DMA engine behind 
an IOMMU (and thus *needs* to use dma_map_resource() to avoid said IOMMU 
blocking the slave device register reads/writes) and also has any 
nonzero offsets, and AFAIK the IOMMU-less platforms above aren't using 
dma_map_resource() at all, so this change shouldn't actually break 
anything, but I guess we have a bit of a problem making it truly generic 
and robust :(


Is this perhaps another shove in the direction of overhauling 
dma_pfn_offset into an arbitrary "DMA ranges" lookup table?



Signed-off-by: Christoph Hellwig 
---
  arch/arm/mm/dma-mapping.c |  2 ++
  arch/powerpc/kernel/dma-swiotlb.c |  1 +
  arch/powerpc/kernel/dma.c |  1 +
  include/linux/dma-mapping.h   | 12 +++-
  kernel/dma/direct.c   | 14 ++
  5 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index f1e2922e447c..3c8534904209 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -188,6 +188,7 @@ const struct dma_map_ops arm_dma_ops = {
.unmap_page = arm_dma_unmap_page,
.map_sg = arm_dma_map_sg,
.unmap_sg   = arm_dma_unmap_sg,
+   .map_resource   = dma_direct_map_resource,
.sync_single_for_cpu= arm_dma_sync_single_for_cpu,
.sync_single_for_device = arm_dma_sync_single_for_device,
.sync_sg_for_cpu= arm_dma_sync_sg_for_cpu,
@@ -211,6 +212,7 @@ const struct dma_map_ops arm_coherent_dma_ops = {
.get_sgtable= arm_dma_get_sgtable,
.map_page   = arm_coherent_dma_map_page,
.map_sg = arm_dma_map_sg,
+   .map_resource   = dma_direct_map_resource,
.dma_supported  = arm_dma_supported,
  };
  EXPORT_SYMBOL(arm_coherent_dma_ops);
diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
b/arch/powerpc/kernel/dma-swiotlb.c
index 7d5fc9751622..fbb2506a414e 100644
--- a/arch/powerpc/kernel/dma-swiotlb.c
+++ b/arch/powerpc/kernel/dma-swiotlb.c
@@ -55,6 +55,7 @@ const struct dma_map_ops powerpc_swiotlb_dma_ops = {
.dma_supported = swiotlb_dma_supported,
.map_page = dma_direct_map_page,
.unmap_page = dma_direct_unmap_page,
+   .map_resource = dma_direct_map_resource,
.sync_single_for_cpu = dma_direct_sync_single_for_cpu,
.sync_single_for_device = dma_direct_sync_single_for_device,
.sync_sg_for_cpu = dma_direct_sync_sg_for_cpu,
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index b1903ebb2e9c..258b9e8ebb99 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -273,6 +273,7 @@ const struct dma_map_ops dma_nommu_ops = {
.dma_supported  = dma_nommu_dma_supported,
.map_page   = dma_nommu_map_page,
.unmap_page = dma_nommu_unmap_page,
+   .map_resource   = dma_direct_map_resource,
.get_required_mask  = dma_nommu_get_required_mask,
  #ifdef CONFIG_NOT_COHERENT_CACHE
.sync_single_for_cpu= dma_nommu_sync_single,
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index cef2127e1d70..d3087829a6df 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -208,6 +208,8 @@ dma_addr_t dma_direct_map_page(struct device *dev, struct 
page *page,
unsigned long attrs);
  int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
enum dma_data_direction dir, unsigned long attrs);
+dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
+   size_t size, enum dma_data_direction dir, unsigned long attrs);
  
  #if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \

  defined(CONFIG_SWIOTLB)
@@ -346,19 +348,19 @@ static inline dma_addr_t dma_map_resource(struct device 
*dev,
  unsigned long attrs)
  {
const struct dma_map_ops *ops = 

Re: [PATCH] mm/zsmalloc.c: Fix zsmalloc 32-bit PAE support

2018-12-10 Thread Robin Murphy

On 10/12/2018 14:21, Rafael David Tinoco wrote:

On 32-bit systems, zsmalloc uses HIGHMEM and, when PAE is enabled, the
physical frame number might be so big that zsmalloc obj encoding (to
location) will break, causing:

BUG: KASAN: null-ptr-deref in zs_map_object+0xa4/0x2bc
Read of size 4 at addr  by task mkfs.ext4/623
CPU: 2 PID: 623 Comm: mkfs.ext4 Not tainted 
4.19.0-rc8-00017-g8239bc6d3307-dirty #15
Hardware name: Generic DT based system
[] (unwind_backtrace) from [] (show_stack+0x20/0x24)
[] (show_stack) from [] (dump_stack+0xbc/0xe8)
[] (dump_stack) from [] (kasan_report+0x248/0x390)
[] (kasan_report) from [] (__asan_load4+0x78/0xb4)
[] (__asan_load4) from [] (zs_map_object+0xa4/0x2bc)
[] (zs_map_object) from [] 
(zram_bvec_rw.constprop.2+0x324/0x8e4 [zram])
[] (zram_bvec_rw.constprop.2 [zram]) from [] 
(zram_make_request+0x234/0x46c [zram])
[] (zram_make_request [zram]) from [] 
(generic_make_request+0x304/0x63c)
[] (generic_make_request) from [] (submit_bio+0x4c/0x1c8)
[] (submit_bio) from [] 
(submit_bh_wbc.constprop.15+0x238/0x26c)
[] (submit_bh_wbc.constprop.15) from [] 
(__block_write_full_page+0x524/0x76c)
[] (__block_write_full_page) from [] 
(block_write_full_page+0x1bc/0x1d4)
[] (block_write_full_page) from [] 
(blkdev_writepage+0x24/0x28)
[] (blkdev_writepage) from [] (__writepage+0x44/0x78)
[] (__writepage) from [] (write_cache_pages+0x3b8/0x800)
[] (write_cache_pages) from [] 
(generic_writepages+0x74/0xa0)
[] (generic_writepages) from [] 
(blkdev_writepages+0x18/0x1c)
[] (blkdev_writepages) from [] (do_writepages+0x68/0x134)
[] (do_writepages) from [] 
(__filemap_fdatawrite_range+0xb0/0xf4)
[] (__filemap_fdatawrite_range) from [] 
(file_write_and_wait_range+0x64/0xd0)
[] (file_write_and_wait_range) from [] 
(blkdev_fsync+0x54/0x84)
[] (blkdev_fsync) from [] (vfs_fsync_range+0x70/0xd4)
[] (vfs_fsync_range) from [] (do_fsync+0x4c/0x80)
[] (do_fsync) from [] (sys_fsync+0x1c/0x20)
[] (sys_fsync) from [] (ret_fast_syscall+0x0/0x2c)

when trying to decode (the pfn) and map the object.

That happens because one architecture might not re-define
MAX_PHYSMEM_BITS, like in this ARM 32-bit w/ LPAE enabled example. For
32-bit systems, if not re-defined, MAX_POSSIBLE_PHYSMEM_BITS will
default to BITS_PER_LONG (32) in most cases, and, with PAE enabled,
_PFN_BITS might be wrong: which may cause obj variable to overflow if
frame number is in HIGHMEM and referencing a page above the 4GB
watermark.

commit 6e00ec00b1a7 ("staging: zsmalloc: calculate MAX_PHYSMEM_BITS if
not defined") realized MAX_PHYSMEM_BITS depended on SPARSEMEM headers
and "fixed" it by calculating it using BITS_PER_LONG if SPARSEMEM wasn't
used, like in the example given above.

Systems with potential for PAE exist for a long time and assuming
BITS_PER_LONG seems inadequate. Defining MAX_PHYSMEM_BITS looks better,
however it is NOT a constant anymore for x86.

SO, instead, MAX_POSSIBLE_PHYSMEM_BITS should be defined by every
architecture using zsmalloc, together with a sanity check for
MAX_POSSIBLE_PHYSMEM_BITS being too big on 32-bit systems.

Link: https://bugs.linaro.org/show_bug.cgi?id=3765#c17
Signed-off-by: Rafael David Tinoco 
---
  arch/arm/include/asm/pgtable-2level-types.h |  2 ++
  arch/arm/include/asm/pgtable-3level-types.h |  2 ++
  arch/arm64/include/asm/pgtable-types.h  |  2 ++
  arch/ia64/include/asm/page.h|  2 ++
  arch/mips/include/asm/page.h|  2 ++
  arch/powerpc/include/asm/mmu.h  |  2 ++
  arch/s390/include/asm/page.h|  2 ++
  arch/sh/include/asm/page.h  |  2 ++
  arch/sparc/include/asm/page_32.h|  2 ++
  arch/sparc/include/asm/page_64.h|  2 ++
  arch/x86/include/asm/pgtable-2level_types.h |  2 ++
  arch/x86/include/asm/pgtable-3level_types.h |  3 +-
  arch/x86/include/asm/pgtable_64_types.h |  4 +--
  mm/zsmalloc.c   | 35 +++--
  14 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level-types.h 
b/arch/arm/include/asm/pgtable-2level-types.h
index 66cb5b0e89c5..552dba411324 100644
--- a/arch/arm/include/asm/pgtable-2level-types.h
+++ b/arch/arm/include/asm/pgtable-2level-types.h
@@ -64,4 +64,6 @@ typedef pteval_t pgprot_t;
  
  #endif /* STRICT_MM_TYPECHECKS */
  
+#define MAX_POSSIBLE_PHYSMEM_BITS 32

+
  #endif/* _ASM_PGTABLE_2LEVEL_TYPES_H */
diff --git a/arch/arm/include/asm/pgtable-3level-types.h 
b/arch/arm/include/asm/pgtable-3level-types.h
index 921aa30259c4..664c39e6517c 100644
--- a/arch/arm/include/asm/pgtable-3level-types.h
+++ b/arch/arm/include/asm/pgtable-3level-types.h
@@ -67,4 +67,6 @@ typedef pteval_t pgprot_t;
  
  #endif	/* STRICT_MM_TYPECHECKS */
  
+#define MAX_POSSIBLE_PHYSMEM_BITS 36


Nit: with LPAE, physical addresses go up to 40 bits, not just 36.

Robin.


+
  #endif/* _ASM_PGTABLE_3LEVEL_TYPES_H */
diff --git a/arch/arm64/include/asm/pgtable-types.h 

Re: [PATCH v2 03/20] perf/core: add PERF_PMU_CAP_EXCLUDE for exclusion capable PMUs

2018-11-26 Thread Robin Murphy

Hi Andrew,

On 26/11/2018 11:12, Andrew Murray wrote:

Many PMU drivers do not have the capability to exclude counting events
that occur in specific contexts such as idle, kernel, guest, etc. These
drivers indicate this by returning an error in their event_init upon
testing the events attribute flags. This approach is error prone and
often inconsistent.

Let's instead allow PMU drivers to advertise their ability to exclude
based on context via a new capability: PERF_PMU_CAP_EXCLUDE. This
allows the perf core to reject requests for exclusion events where
there is no support in the PMU.

Signed-off-by: Andrew Murray 
---
  include/linux/perf_event.h | 1 +
  kernel/events/core.c   | 9 +
  2 files changed, 10 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b2e806f..69b3d65 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -244,6 +244,7 @@ struct perf_event;
  #define PERF_PMU_CAP_EXCLUSIVE0x10
  #define PERF_PMU_CAP_ITRACE   0x20
  #define PERF_PMU_CAP_HETEROGENEOUS_CPUS   0x40
+#define PERF_PMU_CAP_EXCLUDE   0x80
  
  /**

   * struct pmu - generic performance monitoring unit
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5a97f34..9afb33c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9743,6 +9743,15 @@ static int perf_try_init_event(struct pmu *pmu, struct 
perf_event *event)
if (ctx)
perf_event_ctx_unlock(event->group_leader, ctx);
  
+	if (!ret) {

+   if (!(pmu->capabilities & PERF_PMU_CAP_EXCLUDE) &&
+   event_has_any_exclude_flag(event)) {


Technically this is a bisection-breaker, since no driver has this 
capability yet - ideally, this patch should come after all the ones 
introducing it to the relevant drivers (with the removal of the 
now-redundant code from the other drivers at the end).


Alternatively, since we already have several other negative 
capabilities, unless there's a strong feeling against adding any more 
then it might work out simpler to flip it to PERF_PMU_CAP_NO_EXCLUDE, 
such that we only need to introduce the core check then directly replace 
the open-coded event checks with the capability in the appropriate 
drivers, and need not touch the exclusion-supporting ones at all.


Robin.


+   if (event->destroy)
+   event->destroy(event);
+   ret = -EINVAL;
+   }
+   }
+
if (ret)
module_put(pmu->module);
  



Re: [PATCH 5/5] dma-direct: always allow dma mask <= physiscal memory size

2018-11-19 Thread Robin Murphy

On 19/11/2018 14:18, Ramon Fried wrote:

On Tue, Oct 9, 2018 at 8:02 AM Benjamin Herrenschmidt
 wrote:


On Wed, 2018-10-03 at 16:10 -0700, Alexander Duyck wrote:

-* Because 32-bit DMA masks are so common we expect every architecture
-* to be able to satisfy them - either by not supporting more physical
-* memory, or by providing a ZONE_DMA32.  If neither is the case, the
-* architecture needs to use an IOMMU instead of the direct mapping.
-*/
-   if (mask < phys_to_dma(dev, DMA_BIT_MASK(32)))
+   u64 min_mask;
+
+   if (IS_ENABLED(CONFIG_ZONE_DMA))
+   min_mask = DMA_BIT_MASK(ARCH_ZONE_DMA_BITS);
+   else
+   min_mask = DMA_BIT_MASK(32);
+
+   min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT);
+
+   if (mask >= phys_to_dma(dev, min_mask))
  return 0;
-#endif
  return 1;
   }


So I believe I have run into the same issue that Guenter reported. On
an x86_64 system w/ Intel IOMMU. I wasn't able to complete boot and
all probe attempts for various devices were failing with -EIO errors.

I believe the last mask check should be "if (mask < phys_to_dma(dev,
min_mask))" not a ">=" check.


Right, that test is backwards. I needed to change it here too (powermac
with the rest of the powerpc series).

Cheers,
Ben.




Hi, I'm working on a MIPS64 soc with PCIe root complex on it, and it
appears that this series of patches are causing all PCI drivers that
request 64bit mask to fail with -5.
It's broken in 4.19. However, I just checked, it working on master.
We may need to backport a couple of patches to 4.19. I'm not sure
though which patches should be backported as there were at least 10
patches resolving this dma_direct area recently.
Christoph, Robin.
Can we ask Greg to backport all these changes ? What do you think ?


As far as I'm aware, the only real issue in 4.19 was my subtle breakage 
around setting bus_dma_mask - that's fixed by 6778be4e5209, which 
according to my inbox got picked up by autosel for 4.19 stable last week.


Robin.


Re: [PATCH 13/36] dt-bindings: arm: Convert PMU binding to json-schema

2018-11-08 Thread Robin Murphy

On 08/11/2018 15:59, Thomas Petazzoni wrote:

Hello,

I'm jumping into the discussion, but I clearly don't have all the
context of the discussion.

On Thu, 8 Nov 2018 15:54:31 +, Robin Murphy wrote:


This seems like a semantic different between the two representations, or am
I missing something here? Specifically, both the introduction of
interrupts-extended and also dropping any mention of using a single per-cpu
interrupt (the single combined case is no longer support by Linux; not sure
if you want to keep it in the binding).


In regards to no support for the single combined interrupt, it looks
like Marvell Armada SoCs at least (armada-375 is what I'm looking at)
have only a single interrupt. Though the interrupt gets routed to MPIC
which then has a GIC PPI. So it isn't supported or happens to work
still since it is a PPI?


Well, the description of the MPIC in the Armada XP functional spec says:

"Interrupt sources ID0–ID28 are private events per CPU. Thus, each
processor has a different set of events map interrupts ID0–ID28."

Odd grammar aside, that would seem to imply that < 3> is a per-cpu
interrupt itself, thus AFAICS so long as it's cascaded to a GIC PPI and
not an SPI then there's no issue there.


The Armada XP does not have a GIC at all, but only a MPIC as the
primary interrupt controller.

However the Armada 38x has both a GIC and a MPIC, and indeed the parent
interrupts of the MPIC towards the GIC is:

interrupts = ;


Yeah, perhaps I should have clarified that the XP manual was the only 
publicly-available one I could find, but I'm inferring from the binding 
and driver that the MPIC in 375/38x still behaves the same.


Robin.


Re: [PATCH 13/36] dt-bindings: arm: Convert PMU binding to json-schema

2018-11-08 Thread Robin Murphy

On 01/11/2018 19:32, Rob Herring wrote:

On Tue, Oct 9, 2018 at 6:57 AM Will Deacon  wrote:


Hi Rob,

On Fri, Oct 05, 2018 at 11:58:25AM -0500, Rob Herring wrote:

Convert ARM PMU binding to DT schema format using json-schema.

Cc: Will Deacon 
Cc: Mark Rutland 
Cc: linux-arm-ker...@lists.infradead.org
Cc: devicet...@vger.kernel.org
Signed-off-by: Rob Herring 
---
  Documentation/devicetree/bindings/arm/pmu.txt | 70 --
  .../devicetree/bindings/arm/pmu.yaml  | 96 +++
  2 files changed, 96 insertions(+), 70 deletions(-)
  delete mode 100644 Documentation/devicetree/bindings/arm/pmu.txt
  create mode 100644 Documentation/devicetree/bindings/arm/pmu.yaml


[...]


-- interrupts : 1 combined interrupt or 1 per core. If the interrupt is a 
per-cpu
-   interrupt (PPI) then 1 interrupt should be specified.


[...]


+  interrupts:
+oneOf:
+  - maxItems: 1
+  - minItems: 2
+maxItems: 8
+description: 1 interrupt per core.
+
+  interrupts-extended:
+$ref: '#/properties/interrupts'


This seems like a semantic different between the two representations, or am
I missing something here? Specifically, both the introduction of
interrupts-extended and also dropping any mention of using a single per-cpu
interrupt (the single combined case is no longer support by Linux; not sure
if you want to keep it in the binding).


In regards to no support for the single combined interrupt, it looks
like Marvell Armada SoCs at least (armada-375 is what I'm looking at)
have only a single interrupt. Though the interrupt gets routed to MPIC
which then has a GIC PPI. So it isn't supported or happens to work
still since it is a PPI?


Well, the description of the MPIC in the Armada XP functional spec says:

"Interrupt sources ID0–ID28 are private events per CPU. Thus, each 
processor has a different set of events map interrupts ID0–ID28."


Odd grammar aside, that would seem to imply that < 3> is a per-cpu 
interrupt itself, thus AFAICS so long as it's cascaded to a GIC PPI and 
not an SPI then there's no issue there.


Robin.


Re: [PATCH] dma-direct: Fix return value of dma_direct_supported

2018-10-04 Thread Robin Murphy

On 04/10/18 00:48, Alexander Duyck wrote:

It appears that in commit 9d7a224b463e ("dma-direct: always allow dma mask
<= physiscal memory size") the logic of the test was changed from a "<" to
a ">=" however I don't see any reason for that change. I am assuming that
there was some additional change planned, specifically I suspect the logic
was intended to be reversed and possibly used for a return. Since that is
the case I have gone ahead and done that.


Bah, seems I got hung up on the min_mask code above it and totally 
overlooked that the condition itself got flipped. It probably also can't 
help that it's an int return type, but treated as a bool by callers 
rather than "0 for success" as int tends to imply in isolation.


Anyway, paying a bit more attention this time, I think this looks like 
the right fix - cheers Alex.


Robin.


This addresses issues I had on my system that prevented me from booting
with the above mentioned commit applied on an x86_64 system w/ Intel IOMMU.

Fixes: 9d7a224b463e ("dma-direct: always allow dma mask <= physiscal memory 
size")
Signed-off-by: Alexander Duyck 
---
  kernel/dma/direct.c |4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 5a0806b5351b..65872f6c2e93 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -301,9 +301,7 @@ int dma_direct_supported(struct device *dev, u64 mask)
  
  	min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT);
  
-	if (mask >= phys_to_dma(dev, min_mask))

-   return 0;
-   return 1;
+   return mask >= phys_to_dma(dev, min_mask);
  }
  
  int dma_direct_mapping_error(struct device *dev, dma_addr_t dma_addr)


___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu



Re: [PATCH 4/5] dma-direct: implement complete bus_dma_mask handling

2018-09-27 Thread Robin Murphy

On 27/09/18 17:27, Christoph Hellwig wrote:

On Thu, Sep 27, 2018 at 05:14:56PM +0100, Robin Murphy wrote:

This just seemed more readable to me than min_not_zero, but if others
prefer min_not_zero I can switch.


Nah, just checking whether there were any intentionally different
assumptions compared to the couple of other places in the patch where
min_not_zero() *is* used. If it's purely a style thing then no worries
(personally I'd have written it yet another way anyway).


I'm curious: how would you have written it?


Come to think of it, I actually already have, in iommu-dma:

if (dev->bus_dma_mask)
mask &= dev->bus_dma_mask;

but of course it's not so pretty for those cases where you don't already 
have the local variable ready to go.


Robin.


Re: [PATCH 4/5] dma-direct: implement complete bus_dma_mask handling

2018-09-27 Thread Robin Murphy

On 27/09/18 16:32, Christoph Hellwig wrote:

On Thu, Sep 27, 2018 at 03:58:04PM +0100, Robin Murphy wrote:

   }
   #endif /* !CONFIG_ARCH_HAS_PHYS_TO_DMA */
   diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 3c404e33d946..64466b7ef67b 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -43,10 +43,11 @@ check_addr(struct device *dev, dma_addr_t dma_addr, size_t 
size,
return false;
}
   -if (*dev->dma_mask >= DMA_BIT_MASK(32)) {
+   if (*dev->dma_mask >= DMA_BIT_MASK(32) || dev->bus_dma_mask) {


Hmm... say *dev->dma_mask is 31 bits and dev->bus_dma_mask is 40 bits due
to a global DT property, we'll now scream where we didn't before even
though the bus mask is almost certainly irrelevant - is that desirable?


This is just the reporting in the error case - we'll only hit this
IFF dma_capable already returned false.  But if you don't want a message
here we can probably drop this hunk.


It was only a question of consistency - whatever the reason was to avoid 
warning for small device masks before, it's not obvious why it wouldn't 
still apply in the presence of nonzero but larger bus mask. The fact 
that the current check is there at all implied to me that we're 
expecting less-capable device to hit this often and thus wanted to avoid 
being noisy. If that's not the case then it's fine as-is AFAIC.



@@ -65,12 +66,18 @@ u64 dma_direct_get_required_mask(struct device *dev)
   {
u64 max_dma = phys_to_dma_direct(dev, (max_pfn - 1) << PAGE_SHIFT);
   +if (dev->bus_dma_mask && dev->bus_dma_mask < max_dma)
+   max_dma = dev->bus_dma_mask;


Again, I think we could just do another min_not_zero() here. A device wired
to address only one single page of RAM isn't a realistic prospect (and we
could just flip the -1 and the shift in the max_dma calculation if we
*really* wanted to support such things).


This just seemed more readable to me than min_not_zero, but if others
prefer min_not_zero I can switch.


Nah, just checking whether there were any intentionally different 
assumptions compared to the couple of other places in the patch where 
min_not_zero() *is* used. If it's purely a style thing then no worries 
(personally I'd have written it yet another way anyway).


Robin.


Re: [PATCH 3/5] dma-direct: refine dma_direct_alloc zone selection

2018-09-27 Thread Robin Murphy

On 27/09/18 16:30, Christoph Hellwig wrote:

On Thu, Sep 27, 2018 at 03:30:20PM +0100, Robin Murphy wrote:

   +static gfp_t __dma_direct_optimal_gfp_mask(struct device *dev, u64
dma_mask,
+   u64 *phys_mask)
+{
+   if (force_dma_unencrypted())
+   *phys_mask = __dma_to_phys(dev, dma_mask);
+   else
+   *phys_mask = dma_to_phys(dev, dma_mask);


Maybe make phys_to_dma_direct() take u64 instead of phys_addr_t so we can
reuse it here?


This is a dma_to_phys and not a phys_to_dma.


Ugh, clearly it's time to stop reviewing patches for today... sorry :(

Robin.


Re: [PATCH 2/5] dma-direct: add an explicit dma_direct_get_required_mask

2018-09-27 Thread Robin Murphy

On 27/09/18 16:28, Christoph Hellwig wrote:

On Thu, Sep 27, 2018 at 03:12:25PM +0100, Robin Murphy wrote:

+u64 dma_direct_get_required_mask(struct device *dev)
+{
+   u64 max_dma = phys_to_dma_direct(dev, (max_pfn - 1) << PAGE_SHIFT);
+
+   return (1ULL << (fls64(max_dma) - 1)) * 2 - 1;


I think that may as well just use __fls64() - it seems reasonable to assume
max_dma > 0. Otherwise,


Is there any good reason to micro-optimize given that this isn't
a fast path?


Not at all, I wasn't even thinking in terms of optimisation other than 
in terms of number of source characters and levels of parentheses.


But more importantly I was also being a big idiot because no matter how 
much I have the fls()/__fls() thing in mind, __fls64() doesn't actually 
exist. Nitpick rescinded!


Robin.


Re: [PATCH 5/5] dma-direct: always allow dma mask <= physiscal memory size

2018-09-27 Thread Robin Murphy
[ oops, I should have looked at the replies first, now I see Ben already 
had the same thing to say about #3... ]


On 27/09/18 14:49, Christoph Hellwig wrote:

On Thu, Sep 27, 2018 at 11:50:14AM +1000, Benjamin Herrenschmidt wrote:

-* to be able to satisfy them - either by not supporting more physical
-* memory, or by providing a ZONE_DMA32.  If neither is the case, the
-* architecture needs to use an IOMMU instead of the direct mapping.
-*/
-   if (mask < phys_to_dma(dev, DMA_BIT_MASK(32)))
+   u64 min_mask;
+
+   if (IS_ENABLED(CONFIG_ZONE_DMA))
+   min_mask = DMA_BIT_MASK(ARCH_ZONE_DMA_BITS);
+   else
+   min_mask = min_t(u64, DMA_BIT_MASK(32),
+(max_pfn - 1) << PAGE_SHIFT);
+
+   if (mask >= phys_to_dma(dev, min_mask))
return 0;


nitpick ... to be completely "correct", I would have written

if (IS_ENABLED(CONFIG_ZONE_DMA))
min_mask = DMA_BIT_MASK(ARCH_ZONE_DMA_BITS);
else
min_mask = DMA_BIT_MASK(32);

min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT);

In "theory" it's also ok to have a mask < ZONE_DMA_BITS as long as it's
big enough to fit all memory :-)


Yeah, we could do that.


FWIW I like it even if just for looking slightly more readable. With 
that fixup,


Reviewed-by: Robin Murphy 


Re: [PATCH 4/5] dma-direct: implement complete bus_dma_mask handling

2018-09-27 Thread Robin Murphy

On 20/09/18 19:52, Christoph Hellwig wrote:

Instead of rejecting devices with a too small bus_dma_mask we can handle
by taking the bus dma_mask into account for allocations and bounce
buffering decisions.

Signed-off-by: Christoph Hellwig 
---
  include/linux/dma-direct.h |  3 ++-
  kernel/dma/direct.c| 21 +++--
  2 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index b79496d8c75b..fbca184ff5a0 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -27,7 +27,8 @@ static inline bool dma_capable(struct device *dev, dma_addr_t 
addr, size_t size)
if (!dev->dma_mask)
return false;
  
-	return addr + size - 1 <= *dev->dma_mask;

+   return addr + size - 1 <=
+   min_not_zero(*dev->dma_mask, dev->bus_dma_mask);
  }
  #endif /* !CONFIG_ARCH_HAS_PHYS_TO_DMA */
  
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c

index 3c404e33d946..64466b7ef67b 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -43,10 +43,11 @@ check_addr(struct device *dev, dma_addr_t dma_addr, size_t 
size,
return false;
}
  
-		if (*dev->dma_mask >= DMA_BIT_MASK(32)) {

+   if (*dev->dma_mask >= DMA_BIT_MASK(32) || dev->bus_dma_mask) {


Hmm... say *dev->dma_mask is 31 bits and dev->bus_dma_mask is 40 bits 
due to a global DT property, we'll now scream where we didn't before 
even though the bus mask is almost certainly irrelevant - is that desirable?



dev_err(dev,
-   "%s: overflow %pad+%zu of device mask %llx\n",
-   caller, _addr, size, *dev->dma_mask);
+   "%s: overflow %pad+%zu of device mask %llx bus mask 
%llx\n",
+   caller, _addr, size,
+   *dev->dma_mask, dev->bus_dma_mask);
}
return false;
}
@@ -65,12 +66,18 @@ u64 dma_direct_get_required_mask(struct device *dev)
  {
u64 max_dma = phys_to_dma_direct(dev, (max_pfn - 1) << PAGE_SHIFT);
  
+	if (dev->bus_dma_mask && dev->bus_dma_mask < max_dma)

+   max_dma = dev->bus_dma_mask;


Again, I think we could just do another min_not_zero() here. A device 
wired to address only one single page of RAM isn't a realistic prospect 
(and we could just flip the -1 and the shift in the max_dma calculation 
if we *really* wanted to support such things).



+
return (1ULL << (fls64(max_dma) - 1)) * 2 - 1;
  }
  
  static gfp_t __dma_direct_optimal_gfp_mask(struct device *dev, u64 dma_mask,

u64 *phys_mask)
  {
+   if (dev->bus_dma_mask && dev->bus_dma_mask < dma_mask)
+   dma_mask = dev->bus_dma_mask;
+


Similarly, can't we assume dma_mask to be nonzero here too? It feels 
like we really shouldn't have managed to get this far without one.


Robin.


if (force_dma_unencrypted())
*phys_mask = __dma_to_phys(dev, dma_mask);
else
@@ -87,7 +94,7 @@ static gfp_t __dma_direct_optimal_gfp_mask(struct device 
*dev, u64 dma_mask,
  static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
  {
return phys_to_dma_direct(dev, phys) + size - 1 <=
-   dev->coherent_dma_mask;
+   min_not_zero(dev->coherent_dma_mask, dev->bus_dma_mask);
  }
  
  void *dma_direct_alloc_pages(struct device *dev, size_t size,

@@ -291,12 +298,6 @@ int dma_direct_supported(struct device *dev, u64 mask)
if (mask < phys_to_dma(dev, DMA_BIT_MASK(32)))
return 0;
  #endif
-   /*
-* Upstream PCI/PCIe bridges or SoC interconnects may not carry
-* as many DMA address bits as the device itself supports.
-*/
-   if (dev->bus_dma_mask && mask > dev->bus_dma_mask)
-   return 0;
return 1;
  }
  



Re: [PATCH 3/5] dma-direct: refine dma_direct_alloc zone selection

2018-09-27 Thread Robin Murphy

On 20/09/18 19:52, Christoph Hellwig wrote:

We need to take the DMA offset and encryption bit into account when
selecting a zone.  User the opportunity to factor out the zone
selection into a helper for reuse.

Signed-off-by: Christoph Hellwig 
---
  kernel/dma/direct.c | 31 +--
  1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 81b73a5bba54..3c404e33d946 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -68,6 +68,22 @@ u64 dma_direct_get_required_mask(struct device *dev)
return (1ULL << (fls64(max_dma) - 1)) * 2 - 1;
  }
  
+static gfp_t __dma_direct_optimal_gfp_mask(struct device *dev, u64 dma_mask,

+   u64 *phys_mask)
+{
+   if (force_dma_unencrypted())
+   *phys_mask = __dma_to_phys(dev, dma_mask);
+   else
+   *phys_mask = dma_to_phys(dev, dma_mask);


Maybe make phys_to_dma_direct() take u64 instead of phys_addr_t so we 
can reuse it here?



+   /* GFP_DMA32 and GFP_DMA are no ops without the corresponding zones: */
+   if (*phys_mask <= DMA_BIT_MASK(ARCH_ZONE_DMA_BITS))
+   return GFP_DMA;


If we don't have ZONE_DMA it would in theory be a tiny bit better to 
fall back to GFP_DMA32 instead of returning 0 here, but I'm not sure 
it's worth the bother.



+   if (*phys_mask <= DMA_BIT_MASK(32))
+   return GFP_DMA32;
+   return 0;
+}
+
  static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
  {
return phys_to_dma_direct(dev, phys) + size - 1 <=
@@ -80,17 +96,13 @@ void *dma_direct_alloc_pages(struct device *dev, size_t 
size,
unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
int page_order = get_order(size);
struct page *page = NULL;
+   u64 phys_mask;
void *ret;
  
  	/* we always manually zero the memory once we are done: */

gfp &= ~__GFP_ZERO;
-
-   /* GFP_DMA32 and GFP_DMA are no ops without the corresponding zones: */
-   if (dev->coherent_dma_mask <= DMA_BIT_MASK(ARCH_ZONE_DMA_BITS))
-   gfp |= GFP_DMA;
-   if (dev->coherent_dma_mask <= DMA_BIT_MASK(32) && !(gfp & GFP_DMA))
-   gfp |= GFP_DMA32;
-
+   gfp |= __dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask,
+   _mask);
  again:
/* CMA can be used only in the context which permits sleeping */
if (gfpflags_allow_blocking(gfp)) {
@@ -109,15 +121,14 @@ void *dma_direct_alloc_pages(struct device *dev, size_t 
size,
page = NULL;
  
  		if (IS_ENABLED(CONFIG_ZONE_DMA32) &&

-   dev->coherent_dma_mask < DMA_BIT_MASK(64) &&
+   phys_mask < DMA_BIT_MASK(64) &&
!(gfp & (GFP_DMA32 | GFP_DMA))) {
gfp |= GFP_DMA32;
goto again;


Ah right, in that situation we should probably end up here anyway, so 
that's good enough - definitely not worth any more #ifdeffery above. 
Nits aside,


Reviewed-by: Robin Murphy 


}
  
  		if (IS_ENABLED(CONFIG_ZONE_DMA) &&

-   dev->coherent_dma_mask < DMA_BIT_MASK(32) &&
-   !(gfp & GFP_DMA)) {
+   phys_mask < DMA_BIT_MASK(32) && !(gfp & GFP_DMA)) {
gfp = (gfp & ~GFP_DMA32) | GFP_DMA;
goto again;
}



Re: [PATCH 2/5] dma-direct: add an explicit dma_direct_get_required_mask

2018-09-27 Thread Robin Murphy

On 20/09/18 19:52, Christoph Hellwig wrote:

This is somewhat modelled after the powerpc version, and differs from
the legacy fallback in use fls64 instead of pointlessly splitting up the
address into low and high dwords and in that it takes (__)phys_to_dma
into account.

Signed-off-by: Christoph Hellwig 
---
  include/linux/dma-direct.h |  1 +
  kernel/dma/direct.c| 21 ++---
  2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index 86a59ba5a7f3..b79496d8c75b 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -55,6 +55,7 @@ static inline void dma_mark_clean(void *addr, size_t size)
  }
  #endif /* CONFIG_ARCH_HAS_DMA_MARK_CLEAN */
  
+u64 dma_direct_get_required_mask(struct device *dev);

  void *dma_direct_alloc(struct device *dev, size_t size, dma_addr_t 
*dma_handle,
gfp_t gfp, unsigned long attrs);
  void dma_direct_free(struct device *dev, size_t size, void *cpu_addr,
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index c954f0a6dc62..81b73a5bba54 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -53,11 +53,25 @@ check_addr(struct device *dev, dma_addr_t dma_addr, size_t 
size,
return true;
  }
  
+static inline dma_addr_t phys_to_dma_direct(struct device *dev,

+   phys_addr_t phys)
+{
+   if (force_dma_unencrypted())
+   return __phys_to_dma(dev, phys);
+   return phys_to_dma(dev, phys);
+}
+
+u64 dma_direct_get_required_mask(struct device *dev)
+{
+   u64 max_dma = phys_to_dma_direct(dev, (max_pfn - 1) << PAGE_SHIFT);
+
+   return (1ULL << (fls64(max_dma) - 1)) * 2 - 1;


I think that may as well just use __fls64() - it seems reasonable to 
assume max_dma > 0. Otherwise,


Reviewed-by: Robin Murphy 


+}
+
  static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
  {
-   dma_addr_t addr = force_dma_unencrypted() ?
-   __phys_to_dma(dev, phys) : phys_to_dma(dev, phys);
-   return addr + size - 1 <= dev->coherent_dma_mask;
+   return phys_to_dma_direct(dev, phys) + size - 1 <=
+   dev->coherent_dma_mask;
  }
  
  void *dma_direct_alloc_pages(struct device *dev, size_t size,

@@ -296,6 +310,7 @@ const struct dma_map_ops dma_direct_ops = {
.unmap_page = dma_direct_unmap_page,
.unmap_sg   = dma_direct_unmap_sg,
  #endif
+   .get_required_mask  = dma_direct_get_required_mask,
.dma_supported  = dma_direct_supported,
.mapping_error  = dma_direct_mapping_error,
.cache_sync = arch_dma_cache_sync,



Re: [PATCH 4/4] dma-mapping: clear dev->dma_ops in arch_teardown_dma_ops

2018-09-26 Thread Robin Murphy

On 25/09/18 21:16, Christoph Hellwig wrote:

Looking at the code I think this commit is simply broken for
architectures not using arch_setup_dma_ops, but instead setting up
the dma ops through arch specific magic.

I'll revert the patch.


Ugh, sorry about missing that too. Ack to a revert - thinking about 
those PPC symptoms, it might actually be that other architectures could 
also get into the same pickle just by unbinding and rebinding a driver 
(e.g. switching to VFIO then back again).


Robin.


  1   2   >