Re: [mainline] [linux-next] [6.8-rc1] [FC] [DLPAR] OOps kernel crash after performing dlpar remove test
] [c000a878bcb0] [c0685d3c] kernfs_fop_write_iter+0x1cc/0x280 [ 981.124283] [c000a878bd00] [c05909c8] vfs_write+0x358/0x4b0 [ 981.124288] [c000a878bdc0] [c0590cfc] ksys_write+0x7c/0x140 [ 981.124293] [c000a878be10] [c0036554] system_call_exception+0x134/0x330 [ 981.124298] [c000a878be50] [c000d6a0] system_call_common+0x160/0x2e4 [ 981.124303] --- interrupt: c00 at 0x200013f21594 [ 981.124306] NIP: 200013f21594 LR: 200013e97bf4 CTR: [ 981.124309] REGS: c000a878be80 TRAP: 0c00 Not tainted (6.5.0-rc6-next-20230817-auto) [ 981.124312] MSR: 8280f033 CR: 22000282 XER: [ 981.124321] IRQMASK: 0 [ 981.124321] GPR00: 0004 73a55c70 200014007300 0007 [ 981.124321] GPR04: 00013aff5750 0008 fbad2c80 00013afd02a0 [ 981.124321] GPR08: 0001 [ 981.124321] GPR12: 200013b7bc30 [ 981.124321] GPR16: [ 981.124321] GPR20: [ 981.124321] GPR24: 00010ef61668 0008 00013aff5750 [ 981.124321] GPR28: 0008 00013afd02a0 00013aff5750 0008 [ 981.124356] NIP [200013f21594] 0x200013f21594 [ 981.124358] LR [200013e97bf4] 0x200013e97bf4 [ 981.124361] --- interrupt: c00 [ 981.124362] Code: 38427bd0 7c0802a6 6000 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7cbf2b78 38a0 7cdd3378 f8010010 f821ffc1 4bff95d1 6000 7c7e1b79 [ 981.124374] ---[ end trace ]--- Thanks and Regards On 1/31/24 16:18, Robin Murphy wrote: On 2024-01-31 9:19 am, Tasmiya Nalatwad wrote: Greetings, [mainline] [linux-next] [6.8-rc1] [DLPAR] OOps kernel crash after performing dlpar remove test --- Traces --- [58563.146236] BUG: Unable to handle kernel data access at 0x6b6b6b6b6b6b6b83 [58563.146242] Faulting instruction address: 0xc09c0e60 [58563.146248] Oops: Kernel access of bad area, sig: 11 [#1] [58563.146252] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries [58563.146258] Modules linked in: isofs cdrom dm_snapshot dm_bufio dm_round_robin dm_queue_length exfat vfat fat btrfs blake2b_generic xor raid6_pq zstd_compress loop xfs libcrc32c raid0 nvram rpadlpar_io rpaphp nfnetlink xsk_diag bonding tls rfkill sunrpc dm_service_time dm_multipath dm_mod pseries_rng vmx_crypto binfmt_misc ext4 mbcache jbd2 sd_mod sg ibmvscsi scsi_transport_srp ibmveth lpfc nvmet_fc nvmet nvme_fc nvme_fabrics nvme_core t10_pi crc64_rocksoft crc64 scsi_transport_fc fuse [58563.146326] CPU: 0 PID: 1071247 Comm: drmgr Kdump: loaded Not tainted 6.8.0-rc1-auto-gecb1b8288dc7 #1 [58563.146332] Hardware name: IBM,9009-42A POWER9 (raw) 0x4e0202 0xf05 of:IBM,FW950.A0 (VL950_141) hv:phyp pSeries [58563.146337] NIP: c09c0e60 LR: c09c0e28 CTR: c09c1584 [58563.146342] REGS: c0007960f260 TRAP: 0380 Not tainted (6.8.0-rc1-auto-gecb1b8288dc7) [58563.146347] MSR: 80009033 CR: 24822424 XER: 20040006 [58563.146360] CFAR: c09c0e74 IRQMASK: 0 [58563.146360] GPR00: c09c0e28 c0007960f500 c1482600 c3050540 [58563.146360] GPR04: c0089a6870c0 0001 fffe [58563.146360] GPR08: c2bac020 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 0220 [58563.146360] GPR12: 2000 c308 [58563.146360] GPR16: 0001 [58563.146360] GPR20: c1281478 c1281490 c2bfed80 [58563.146360] GPR24: c0089a6870c0 c2b9ffb8 [58563.146360] GPR28: c2bac0e8 [58563.146421] NIP [c09c0e60] iommu_ops_from_fwnode+0x68/0x118 [58563.146430] LR [c09c0e28] iommu_ops_from_fwnode+0x30/0x118 This implies that iommu_device_list has become corrupted. Looks like spapr_tce_setup_phb_iommus_initcall() registers an iommu_device which pcibios_free_controller() could free if a PCI controller is removed, but there's no path anywhere to ever unregister any of those IOMMUs. Presumably this also means that is a PCI controller is dynamically added after init, its IOMMU won't be set up properly either. Thanks, Robin. [58563.146437] Call Trace: [58563.146439] [c0007960f500] [c0007960f560] 0xc0007960f560 (unreliable) [58563.146446] [c0007960f530] [c09c0fd0] __iommu_probe_device+0xc0/0x5c0 [58563.146454] [c0007960f5a0] [c09c151c] iommu_probe_device+0x4c/0xb4 [58563.146462] [c0007960f5e0] [c09c15d0] iommu_bus_notifier+0x4c/0x8c [58563.146469
Re: [mainline] [linux-next] [6.8-rc1] [FC] [DLPAR] OOps kernel crash after performing dlpar remove test
On 2024-01-31 9:19 am, Tasmiya Nalatwad wrote: Greetings, [mainline] [linux-next] [6.8-rc1] [DLPAR] OOps kernel crash after performing dlpar remove test --- Traces --- [58563.146236] BUG: Unable to handle kernel data access at 0x6b6b6b6b6b6b6b83 [58563.146242] Faulting instruction address: 0xc09c0e60 [58563.146248] Oops: Kernel access of bad area, sig: 11 [#1] [58563.146252] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries [58563.146258] Modules linked in: isofs cdrom dm_snapshot dm_bufio dm_round_robin dm_queue_length exfat vfat fat btrfs blake2b_generic xor raid6_pq zstd_compress loop xfs libcrc32c raid0 nvram rpadlpar_io rpaphp nfnetlink xsk_diag bonding tls rfkill sunrpc dm_service_time dm_multipath dm_mod pseries_rng vmx_crypto binfmt_misc ext4 mbcache jbd2 sd_mod sg ibmvscsi scsi_transport_srp ibmveth lpfc nvmet_fc nvmet nvme_fc nvme_fabrics nvme_core t10_pi crc64_rocksoft crc64 scsi_transport_fc fuse [58563.146326] CPU: 0 PID: 1071247 Comm: drmgr Kdump: loaded Not tainted 6.8.0-rc1-auto-gecb1b8288dc7 #1 [58563.146332] Hardware name: IBM,9009-42A POWER9 (raw) 0x4e0202 0xf05 of:IBM,FW950.A0 (VL950_141) hv:phyp pSeries [58563.146337] NIP: c09c0e60 LR: c09c0e28 CTR: c09c1584 [58563.146342] REGS: c0007960f260 TRAP: 0380 Not tainted (6.8.0-rc1-auto-gecb1b8288dc7) [58563.146347] MSR: 80009033 CR: 24822424 XER: 20040006 [58563.146360] CFAR: c09c0e74 IRQMASK: 0 [58563.146360] GPR00: c09c0e28 c0007960f500 c1482600 c3050540 [58563.146360] GPR04: c0089a6870c0 0001 fffe [58563.146360] GPR08: c2bac020 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b 0220 [58563.146360] GPR12: 2000 c308 [58563.146360] GPR16: 0001 [58563.146360] GPR20: c1281478 c1281490 c2bfed80 [58563.146360] GPR24: c0089a6870c0 c2b9ffb8 [58563.146360] GPR28: c2bac0e8 [58563.146421] NIP [c09c0e60] iommu_ops_from_fwnode+0x68/0x118 [58563.146430] LR [c09c0e28] iommu_ops_from_fwnode+0x30/0x118 This implies that iommu_device_list has become corrupted. Looks like spapr_tce_setup_phb_iommus_initcall() registers an iommu_device which pcibios_free_controller() could free if a PCI controller is removed, but there's no path anywhere to ever unregister any of those IOMMUs. Presumably this also means that is a PCI controller is dynamically added after init, its IOMMU won't be set up properly either. Thanks, Robin. [58563.146437] Call Trace: [58563.146439] [c0007960f500] [c0007960f560] 0xc0007960f560 (unreliable) [58563.146446] [c0007960f530] [c09c0fd0] __iommu_probe_device+0xc0/0x5c0 [58563.146454] [c0007960f5a0] [c09c151c] iommu_probe_device+0x4c/0xb4 [58563.146462] [c0007960f5e0] [c09c15d0] iommu_bus_notifier+0x4c/0x8c [58563.146469] [c0007960f600] [c019e3d0] notifier_call_chain+0xb8/0x1a0 [58563.146476] [c0007960f660] [c019eea0] blocking_notifier_call_chain+0x64/0x94 [58563.146483] [c0007960f6a0] [c09d3c5c] bus_notify+0x50/0x7c [58563.146491] [c0007960f6e0] [c09cfba4] device_add+0x774/0x9bc [58563.146498] [c0007960f7a0] [c08abe9c] pci_device_add+0x2f4/0x864 [58563.146506] [c0007960f850] [c007d5a0] of_create_pci_dev+0x390/0xa08 [58563.146514] [c0007960f930] [c007de68] __of_scan_bus+0x250/0x328 [58563.146520] [c0007960fa10] [c007a680] pcibios_scan_phb+0x274/0x3c0 [58563.146527] [c0007960fae0] [c0105d58] init_phb_dynamic+0xb8/0x110 [58563.146535] [c0007960fb50] [c008217b0380] dlpar_add_slot+0x170/0x3b4 [rpadlpar_io] [58563.146544] [c0007960fbf0] [c008217b0ca0] add_slot_store+0xa4/0x140 [rpadlpar_io] [58563.146551] [c0007960fc80] [c0f3dbec] kobj_attr_store+0x30/0x4c [58563.146559] [c0007960fca0] [c06931fc] sysfs_kf_write+0x68/0x7c [58563.146566] [c0007960fcc0] [c0691b2c] kernfs_fop_write_iter+0x1c8/0x278 [58563.146573] [c0007960fd10] [c0599f54] vfs_write+0x340/0x4cc [58563.146580] [c0007960fdc0] [c059a2bc] ksys_write+0x7c/0x140 [58563.146587] [c0007960fe10] [c0035d74] system_call_exception+0x134/0x330 [58563.146595] [c0007960fe50] [c000d6a0] system_call_common+0x160/0x2e4 [58563.146602] --- interrupt: c00 at 0x24470cb4 [58563.146606] NIP: 24470cb4 LR: 243e7d04 CTR: [58563.146611] REGS: c0007960fe80 TRAP: 0c00 Not tainted (6.8.0-rc1-auto-gecb1b8288dc7) [58563.146616] MSR: 8280f033 CR: 24000282 XER: [58563.146632] IRQMASK: 0 [58563.146632] GPR00:
Re: [PATCH 6/8] iommu/dart: Move the blocked domain support to a global static
On 2023-09-26 20:05, Janne Grunau wrote: Hej, On Fri, Sep 22, 2023 at 02:07:57PM -0300, Jason Gunthorpe wrote: Move to the new static global for blocked domains. Move the blocked specific code to apple_dart_attach_dev_blocked(). Signed-off-by: Jason Gunthorpe --- drivers/iommu/apple-dart.c | 36 ++-- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c index 424f779ccc34df..376f4c5461e8f7 100644 --- a/drivers/iommu/apple-dart.c +++ b/drivers/iommu/apple-dart.c @@ -675,10 +675,6 @@ static int apple_dart_attach_dev(struct iommu_domain *domain, for_each_stream_map(i, cfg, stream_map) apple_dart_setup_translation(dart_domain, stream_map); break; - case IOMMU_DOMAIN_BLOCKED: - for_each_stream_map(i, cfg, stream_map) - apple_dart_hw_disable_dma(stream_map); - break; default: return -EINVAL; } @@ -710,6 +706,30 @@ static struct iommu_domain apple_dart_identity_domain = { .ops = _dart_identity_ops, }; +static int apple_dart_attach_dev_blocked(struct iommu_domain *domain, +struct device *dev) +{ + struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev); + struct apple_dart_stream_map *stream_map; + int i; + + if (cfg->stream_maps[0].dart->force_bypass) + return -EINVAL; unrelated to this change as this keeps the current behavior but I think force_bypass should not override IOMMU_DOMAIN_BLOCKED. It is set if the CPU page size is smaller than dart's page size. Obviously dart can't translate in that situation but it should be still possible to block it completely. How do we manage this? I can write a patch either to the current state or based on this series. The series is queued already, so best to send a patch based on iommu/core (I guess just removing these lines?). It won't be super-useful in practice since the blocking domain is normally only used to transition to an unmanaged domain which in the force_bypass situation can't be used anyway, but it's still nice on principle not to have unnecessary reasons for attach to fail. Thanks, Robin. + + for_each_stream_map(i, cfg, stream_map) + apple_dart_hw_disable_dma(stream_map); + return 0; +} + +static const struct iommu_domain_ops apple_dart_blocked_ops = { + .attach_dev = apple_dart_attach_dev_blocked, +}; + +static struct iommu_domain apple_dart_blocked_domain = { + .type = IOMMU_DOMAIN_BLOCKED, + .ops = _dart_blocked_ops, +}; + static struct iommu_device *apple_dart_probe_device(struct device *dev) { struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev); @@ -739,8 +759,7 @@ static struct iommu_domain *apple_dart_domain_alloc(unsigned int type) { struct apple_dart_domain *dart_domain; - if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED && - type != IOMMU_DOMAIN_BLOCKED) + if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED) return NULL; dart_domain = kzalloc(sizeof(*dart_domain), GFP_KERNEL); @@ -749,10 +768,6 @@ static struct iommu_domain *apple_dart_domain_alloc(unsigned int type) mutex_init(_domain->init_lock); - /* no need to allocate pgtbl_ops or do any other finalization steps */ - if (type == IOMMU_DOMAIN_BLOCKED) - dart_domain->finalized = true; - return _domain->domain; } @@ -966,6 +981,7 @@ static void apple_dart_get_resv_regions(struct device *dev, static const struct iommu_ops apple_dart_iommu_ops = { .identity_domain = _dart_identity_domain, + .blocked_domain = _dart_blocked_domain, .domain_alloc = apple_dart_domain_alloc, .probe_device = apple_dart_probe_device, .release_device = apple_dart_release_device, -- 2.42.0 Reviewed-by: Janne Grunau best regards Janne ps: I sent the reply to [Patch 4/8] accidentally with an incorrect from address but the correct Reviewed-by:. I can resend if necessary.
Re: [PATCH 6/8] iommu/dart: Move the blocked domain support to a global static
On 2023-09-26 20:34, Robin Murphy wrote: On 2023-09-26 20:05, Janne Grunau wrote: Hej, On Fri, Sep 22, 2023 at 02:07:57PM -0300, Jason Gunthorpe wrote: Move to the new static global for blocked domains. Move the blocked specific code to apple_dart_attach_dev_blocked(). Signed-off-by: Jason Gunthorpe --- drivers/iommu/apple-dart.c | 36 ++-- 1 file changed, 26 insertions(+), 10 deletions(-) diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c index 424f779ccc34df..376f4c5461e8f7 100644 --- a/drivers/iommu/apple-dart.c +++ b/drivers/iommu/apple-dart.c @@ -675,10 +675,6 @@ static int apple_dart_attach_dev(struct iommu_domain *domain, for_each_stream_map(i, cfg, stream_map) apple_dart_setup_translation(dart_domain, stream_map); break; - case IOMMU_DOMAIN_BLOCKED: - for_each_stream_map(i, cfg, stream_map) - apple_dart_hw_disable_dma(stream_map); - break; default: return -EINVAL; } @@ -710,6 +706,30 @@ static struct iommu_domain apple_dart_identity_domain = { .ops = _dart_identity_ops, }; +static int apple_dart_attach_dev_blocked(struct iommu_domain *domain, + struct device *dev) +{ + struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev); + struct apple_dart_stream_map *stream_map; + int i; + + if (cfg->stream_maps[0].dart->force_bypass) + return -EINVAL; unrelated to this change as this keeps the current behavior but I think force_bypass should not override IOMMU_DOMAIN_BLOCKED. It is set if the CPU page size is smaller than dart's page size. Obviously dart can't translate in that situation but it should be still possible to block it completely. How do we manage this? I can write a patch either to the current state or based on this series. The series is queued already, so best to send a patch based on iommu/core (I guess just removing these lines?). Um, what? This isn't the domain_alloc_paging series itself, Robin you fool. Clearly it's time to close the computer and try again tomorrow... Cheers, Robin. It won't be super-useful in practice since the blocking domain is normally only used to transition to an unmanaged domain which in the force_bypass situation can't be used anyway, but it's still nice on principle not to have unnecessary reasons for attach to fail. Thanks, Robin. + + for_each_stream_map(i, cfg, stream_map) + apple_dart_hw_disable_dma(stream_map); + return 0; +} + +static const struct iommu_domain_ops apple_dart_blocked_ops = { + .attach_dev = apple_dart_attach_dev_blocked, +}; + +static struct iommu_domain apple_dart_blocked_domain = { + .type = IOMMU_DOMAIN_BLOCKED, + .ops = _dart_blocked_ops, +}; + static struct iommu_device *apple_dart_probe_device(struct device *dev) { struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev); @@ -739,8 +759,7 @@ static struct iommu_domain *apple_dart_domain_alloc(unsigned int type) { struct apple_dart_domain *dart_domain; - if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED && - type != IOMMU_DOMAIN_BLOCKED) + if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED) return NULL; dart_domain = kzalloc(sizeof(*dart_domain), GFP_KERNEL); @@ -749,10 +768,6 @@ static struct iommu_domain *apple_dart_domain_alloc(unsigned int type) mutex_init(_domain->init_lock); - /* no need to allocate pgtbl_ops or do any other finalization steps */ - if (type == IOMMU_DOMAIN_BLOCKED) - dart_domain->finalized = true; - return _domain->domain; } @@ -966,6 +981,7 @@ static void apple_dart_get_resv_regions(struct device *dev, static const struct iommu_ops apple_dart_iommu_ops = { .identity_domain = _dart_identity_domain, + .blocked_domain = _dart_blocked_domain, .domain_alloc = apple_dart_domain_alloc, .probe_device = apple_dart_probe_device, .release_device = apple_dart_release_device, -- 2.42.0 Reviewed-by: Janne Grunau best regards Janne ps: I sent the reply to [Patch 4/8] accidentally with an incorrect from address but the correct Reviewed-by:. I can resend if necessary.
Re: [PATCH v2 09/25] iommu/fsl_pamu: Implement an IDENTITY domain
On 2023-06-01 20:46, Jason Gunthorpe wrote: On Thu, Jun 01, 2023 at 08:37:45PM +0100, Robin Murphy wrote: On 2023-05-16 01:00, Jason Gunthorpe wrote: Robin was able to check the documentation and what fsl_pamu has historically called detach_dev() is really putting the IOMMU into an IDENTITY mode. Unfortunately it was the other way around - it's the call to fsl_setup_liodns() from fsl_pamu_probe() which leaves everything in bypass by default (the PAACE_ATM_NO_XLATE part, IIRC), whereas the detach_device() call here ends up disabling the given device's LIODN altogether Er, I see.. Let me think about it, you convinced me to change it from PLATFORM, so maybe we should go back to that if it is all wonky. FWIW I was thinking more along the lines of a token nominal identity domain where attach does nothing at all... There doesn't appear to have ever been any code anywhere for putting things *back* into bypass after using a VFIO domain, so as-is these default domains would probably break all DMA :( Sounds like it just never worked right. ie going to VFIO mode was always a one way trip and you can't go back to a kernel driver. ...on the assumption that doing so wouldn't really be any less broken than it always has been :) Thanks, Robin. I don't think this patch makes it worse because we call the identity attach_dev in all the same places we called detach_dev in the first place. We add an extra call at the start of time, but that call is NOP'd by this: if (domain == platform_domain || !domain) + return 0; + (bah, and the variable name needs updating too) Honestly, I don't really want to fix FSL since it seems abandoned, so either this patch or going back to PLATFORM seems like the best option. Jason
Re: [PATCH v2 25/25] iommu: Convert remaining simple drivers to domain_alloc_paging()
On 2023-05-16 01:00, Jason Gunthorpe wrote: These drivers don't support IOMMU_DOMAIN_DMA, so this commit effectively allows them to support that mode. The prior work to require default_domains makes this safe because every one of these drivers is either compilation incompatible with dma-iommu.c, or already establishing a default_domain. In both cases alloc_domain() will never be called with IOMMU_DOMAIN_DMA for these drivers so it is safe to drop the test. Removing these tests clarifies that the domain allocation path is only about the functionality of a paging domain and has nothing to do with policy of how the paging domain is used for UNMANAGED/DMA/DMA_FQ. Tested-by: Niklas Schnelle Signed-off-by: Jason Gunthorpe --- drivers/iommu/fsl_pamu_domain.c | 7 ++- drivers/iommu/msm_iommu.c | 7 ++- drivers/iommu/mtk_iommu_v1.c| 7 ++- drivers/iommu/omap-iommu.c | 7 ++- drivers/iommu/s390-iommu.c | 7 ++- 5 files changed, 10 insertions(+), 25 deletions(-) diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c index ca4f5ebf028783..8d5d6a3acf9dfd 100644 --- a/drivers/iommu/fsl_pamu_domain.c +++ b/drivers/iommu/fsl_pamu_domain.c @@ -192,13 +192,10 @@ static void fsl_pamu_domain_free(struct iommu_domain *domain) kmem_cache_free(fsl_pamu_domain_cache, dma_domain); } -static struct iommu_domain *fsl_pamu_domain_alloc(unsigned type) +static struct iommu_domain *fsl_pamu_domain_alloc_paging(struct device *dev) This isn't a paging domain - it doesn't support map/unmap, and AFAICT all it has ever been intended to do is "isolate" accesses to within an aperture which is never set to anything less than the entire physical address space :/ I hate to imagine what the VFIO userspace applications looked like... Thanks, Robin. { struct fsl_dma_domain *dma_domain; - if (type != IOMMU_DOMAIN_UNMANAGED) - return NULL; - dma_domain = kmem_cache_zalloc(fsl_pamu_domain_cache, GFP_KERNEL); if (!dma_domain) return NULL; @@ -476,7 +473,7 @@ static const struct iommu_ops fsl_pamu_ops = { .identity_domain = _pamu_identity_domain, .def_domain_type = _pamu_def_domain_type, .capable= fsl_pamu_capable, - .domain_alloc = fsl_pamu_domain_alloc, + .domain_alloc_paging = fsl_pamu_domain_alloc_paging, .probe_device = fsl_pamu_probe_device, .device_group = fsl_pamu_device_group, .default_domain_ops = &(const struct iommu_domain_ops) { diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c index 26ed81cfeee897..a163cee0b7242d 100644 --- a/drivers/iommu/msm_iommu.c +++ b/drivers/iommu/msm_iommu.c @@ -302,13 +302,10 @@ static void __program_context(void __iomem *base, int ctx, SET_M(base, ctx, 1); } -static struct iommu_domain *msm_iommu_domain_alloc(unsigned type) +static struct iommu_domain *msm_iommu_domain_alloc_paging(struct device *dev) { struct msm_priv *priv; - if (type != IOMMU_DOMAIN_UNMANAGED) - return NULL; - priv = kzalloc(sizeof(*priv), GFP_KERNEL); if (!priv) goto fail_nomem; @@ -691,7 +688,7 @@ irqreturn_t msm_iommu_fault_handler(int irq, void *dev_id) static struct iommu_ops msm_iommu_ops = { .identity_domain = _iommu_identity_domain, - .domain_alloc = msm_iommu_domain_alloc, + .domain_alloc_paging = msm_iommu_domain_alloc_paging, .probe_device = msm_iommu_probe_device, .device_group = generic_device_group, .pgsize_bitmap = MSM_IOMMU_PGSIZES, diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c index 7c0c1d50df5f75..67e044c1a7d93b 100644 --- a/drivers/iommu/mtk_iommu_v1.c +++ b/drivers/iommu/mtk_iommu_v1.c @@ -270,13 +270,10 @@ static int mtk_iommu_v1_domain_finalise(struct mtk_iommu_v1_data *data) return 0; } -static struct iommu_domain *mtk_iommu_v1_domain_alloc(unsigned type) +static struct iommu_domain *mtk_iommu_v1_domain_alloc_paging(struct device *dev) { struct mtk_iommu_v1_domain *dom; - if (type != IOMMU_DOMAIN_UNMANAGED) - return NULL; - dom = kzalloc(sizeof(*dom), GFP_KERNEL); if (!dom) return NULL; @@ -585,7 +582,7 @@ static int mtk_iommu_v1_hw_init(const struct mtk_iommu_v1_data *data) static const struct iommu_ops mtk_iommu_v1_ops = { .identity_domain = _iommu_v1_identity_domain, - .domain_alloc = mtk_iommu_v1_domain_alloc, + .domain_alloc_paging = mtk_iommu_v1_domain_alloc_paging, .probe_device = mtk_iommu_v1_probe_device, .probe_finalize = mtk_iommu_v1_probe_finalize, .release_device = mtk_iommu_v1_release_device, diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c index 34340ef15241bc..fcf99bd195b32e 100644 --- a/drivers/iommu/omap-iommu.c +++ b/drivers/iommu/omap-iommu.c @@ -1580,13
Re: [PATCH v2 09/25] iommu/fsl_pamu: Implement an IDENTITY domain
On 2023-05-16 01:00, Jason Gunthorpe wrote: Robin was able to check the documentation and what fsl_pamu has historically called detach_dev() is really putting the IOMMU into an IDENTITY mode. Unfortunately it was the other way around - it's the call to fsl_setup_liodns() from fsl_pamu_probe() which leaves everything in bypass by default (the PAACE_ATM_NO_XLATE part, IIRC), whereas the detach_device() call here ends up disabling the given device's LIODN altogether. There doesn't appear to have ever been any code anywhere for putting things *back* into bypass after using a VFIO domain, so as-is these default domains would probably break all DMA :( Thanks, Robin. Move to the new core support for ARM_DMA_USE_IOMMU by defining ops->identity_domain. This is a ppc driver without any dma_ops, so ensure the identity translation is the default domain. Signed-off-by: Jason Gunthorpe --- drivers/iommu/fsl_pamu_domain.c | 32 +--- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c index bce37229709965..ca4f5ebf028783 100644 --- a/drivers/iommu/fsl_pamu_domain.c +++ b/drivers/iommu/fsl_pamu_domain.c @@ -283,15 +283,21 @@ static int fsl_pamu_attach_device(struct iommu_domain *domain, return ret; } -static void fsl_pamu_set_platform_dma(struct device *dev) +static int fsl_pamu_identity_attach(struct iommu_domain *platform_domain, + struct device *dev) { struct iommu_domain *domain = iommu_get_domain_for_dev(dev); - struct fsl_dma_domain *dma_domain = to_fsl_dma_domain(domain); + struct fsl_dma_domain *dma_domain; const u32 *prop; int len; struct pci_dev *pdev = NULL; struct pci_controller *pci_ctl; + if (domain == platform_domain || !domain) + return 0; + + dma_domain = to_fsl_dma_domain(domain); + /* * Use LIODN of the PCI controller while detaching a * PCI device. @@ -312,8 +318,18 @@ static void fsl_pamu_set_platform_dma(struct device *dev) detach_device(dev, dma_domain); else pr_debug("missing fsl,liodn property at %pOF\n", dev->of_node); + return 0; } +static struct iommu_domain_ops fsl_pamu_identity_ops = { + .attach_dev = fsl_pamu_identity_attach, +}; + +static struct iommu_domain fsl_pamu_identity_domain = { + .type = IOMMU_DOMAIN_IDENTITY, + .ops = _pamu_identity_ops, +}; + /* Set the domain stash attribute */ int fsl_pamu_configure_l1_stash(struct iommu_domain *domain, u32 cpu) { @@ -447,12 +463,22 @@ static struct iommu_device *fsl_pamu_probe_device(struct device *dev) return _iommu; } +static int fsl_pamu_def_domain_type(struct device *dev) +{ + /* +* This platform does not use dma_ops at all so the normally the iommu +* must be in identity mode +*/ + return IOMMU_DOMAIN_IDENTITY; +} + static const struct iommu_ops fsl_pamu_ops = { + .identity_domain = _pamu_identity_domain, + .def_domain_type = _pamu_def_domain_type, .capable= fsl_pamu_capable, .domain_alloc = fsl_pamu_domain_alloc, .probe_device = fsl_pamu_probe_device, .device_group = fsl_pamu_device_group, - .set_platform_dma_ops = fsl_pamu_set_platform_dma, .default_domain_ops = &(const struct iommu_domain_ops) { .attach_dev = fsl_pamu_attach_device, .iova_to_phys = fsl_pamu_iova_to_phys,
Re: [PATCH v2 23/25] iommu: Add ops->domain_alloc_paging()
On 2023-05-16 01:00, Jason Gunthorpe wrote: This callback requests the driver to create only a __IOMMU_DOMAIN_PAGING domain, so it saves a few lines in a lot of drivers needlessly checking the type. More critically, this allows us to sweep out all the IOMMU_DOMAIN_UNMANAGED and IOMMU_DOMAIN_DMA checks from a lot of the drivers, simplifying what is going on in the code and ultimately removing the now-unused special cases in drivers where they did not support IOMMU_DOMAIN_DMA. domain_alloc_paging() should return a struct iommu_domain that is functionally compatible with ARM_DMA_USE_IOMMU, dma-iommu.c and iommufd. Be forwards looking and pass in a 'struct device *' argument. We can provide this when allocating the default_domain. No drivers will look at this. As mentioned before, we already know we're going to need additional flags (and possibly data) to cover the existing set_pgtable_quirks use-case plus new stuff like the proposed dirty-tracking enable, so I'd be inclined to either add an extensible structure argument now to avoid future churn, or just not bother adding the device argument either until drivers can actually use it. Signed-off-by: Jason Gunthorpe --- drivers/iommu/iommu.c | 18 +++--- include/linux/iommu.h | 3 +++ 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index c4cac1dcf80610..15aa51c356bd74 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1995,14 +1995,25 @@ void iommu_set_fault_handler(struct iommu_domain *domain, EXPORT_SYMBOL_GPL(iommu_set_fault_handler); static struct iommu_domain *__iommu_domain_alloc(const struct iommu_ops *ops, +struct device *dev, unsigned int type) { struct iommu_domain *domain; if (type == IOMMU_DOMAIN_IDENTITY && ops->identity_domain) return ops->identity_domain; + else if ((type == IOMMU_DOMAIN_UNMANAGED || type == IOMMU_DOMAIN_DMA) && +ops->domain_alloc_paging) { + /* +* For now exclude DMA_FQ since it is still a driver policy +* decision through domain_alloc() if we can use FQ mode. +*/ That's sorted now, so the type test can neatly collapse down to "type & __IOMMU_DOMAIN_PAGING". Thanks, Robin. + domain = ops->domain_alloc_paging(dev); + } else if (ops->domain_alloc) + domain = ops->domain_alloc(type); + else + return NULL; - domain = ops->domain_alloc(type); if (!domain) return NULL; @@ -2033,14 +2044,15 @@ __iommu_group_domain_alloc(struct iommu_group *group, unsigned int type) lockdep_assert_held(>mutex); - return __iommu_domain_alloc(dev_iommu_ops(dev), type); + return __iommu_domain_alloc(dev_iommu_ops(dev), dev, type); } struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus) { if (bus == NULL || bus->iommu_ops == NULL) return NULL; - return __iommu_domain_alloc(bus->iommu_ops, IOMMU_DOMAIN_UNMANAGED); + return __iommu_domain_alloc(bus->iommu_ops, NULL, + IOMMU_DOMAIN_UNMANAGED); } EXPORT_SYMBOL_GPL(iommu_domain_alloc); diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 387746f8273c99..18b0df42cc80d1 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -227,6 +227,8 @@ struct iommu_iotlb_gather { * struct iommu_ops - iommu ops and capabilities * @capable: check capability * @domain_alloc: allocate iommu domain + * @domain_alloc_paging: Allocate an iommu_domain that can be used for + * UNMANAGED, DMA, and DMA_FQ domain types. * @probe_device: Add device to iommu driver handling * @release_device: Remove device from iommu driver handling * @probe_finalize: Do final setup work after the device is added to an IOMMU @@ -258,6 +260,7 @@ struct iommu_ops { /* Domain allocation and freeing by the iommu driver */ struct iommu_domain *(*domain_alloc)(unsigned iommu_domain_type); + struct iommu_domain *(*domain_alloc_paging)(struct device *dev); struct iommu_device *(*probe_device)(struct device *dev); void (*release_device)(struct device *dev);
Re: [PATCH v2 08/25] iommu: Allow an IDENTITY domain as the default_domain in ARM32
On 2023-05-16 01:00, Jason Gunthorpe wrote: Even though dma-iommu.c and CONFIG_ARM_DMA_USE_IOMMU do approximately the same stuff, the way they relate to the IOMMU core is quiet different. dma-iommu.c expects the core code to setup an UNMANAGED domain (of type IOMMU_DOMAIN_DMA) and then configures itself to use that domain. This becomes the default_domain for the group. ARM_DMA_USE_IOMMU does not use the default_domain, instead it directly allocates an UNMANAGED domain and operates it just like an external driver. In this case group->default_domain is NULL. If the driver provides a global static identity_domain then automatically use it as the default_domain when in ARM_DMA_USE_IOMMU mode. This allows drivers that implemented default_domain == NULL as an IDENTITY translation to trivially get a properly labeled non-NULL default_domain on ARM32 configs. With this arrangment when ARM_DMA_USE_IOMMU wants to disconnect from the device the normal detach_domain flow will restore the IDENTITY domain as the default domain. Overall this makes attach_dev() of the IDENTITY domain called in the same places as detach_dev(). This effectively migrates these drivers to default_domain mode. For drivers that support ARM64 they will gain support for the IDENTITY translation mode for the dma_api and behave in a uniform way. Drivers use this by setting ops->identity_domain to a static singleton iommu_domain that implements the identity attach. If the core detects ARM_DMA_USE_IOMMU mode then it automatically attaches the IDENTITY domain during probe. Drivers can continue to prevent the use of DMA translation by returning IOMMU_DOMAIN_IDENTITY from def_domain_type, this will completely prevent IOMMU_DMA from running but will not impact ARM_DMA_USE_IOMMU. This allows removing the set_platform_dma_ops() from every remaining driver. Remove the set_platform_dma_ops from rockchip and mkt_v1 as all it does is set an existing global static identity domain. mkt_v1 does not support IOMMU_DOMAIN_DMA and it does not compile on ARM64 so this transformation is safe. Signed-off-by: Jason Gunthorpe --- drivers/iommu/iommu.c | 40 +- drivers/iommu/mtk_iommu_v1.c | 12 -- drivers/iommu/rockchip-iommu.c | 10 - 3 files changed, 35 insertions(+), 27 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 8ba90571449cec..bed7cb6e5ee65b 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1757,18 +1757,48 @@ static int iommu_get_default_domain_type(struct iommu_group *group, int type; lockdep_assert_held(>mutex); + + /* +* ARM32 drivers supporting CONFIG_ARM_DMA_USE_IOMMU can declare an +* identity_domain and it will automatically become their default +* domain. Later on ARM_DMA_USE_IOMMU will install its UNMANAGED domain. +* Override the selection to IDENTITY if we are sure the driver supports +* it. +*/ + if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) && ops->identity_domain) { If I cared about arm-smmu on 32-bit, I'd bring that up again, but honestly I'm not sure that I do... I think it might end up working after patch #21, and it's currently still broken for lack of .set_platform_dma anyway, so meh. + type = IOMMU_DOMAIN_IDENTITY; + if (best_type && type && best_type != type) + goto err; + best_type = target_type = IOMMU_DOMAIN_IDENTITY; + } + for_each_group_device(group, gdev) { type = best_type; if (ops->def_domain_type) { type = ops->def_domain_type(gdev->dev); - if (best_type && type && best_type != type) + if (best_type && type && best_type != type) { + /* Stick with the last driver override we saw */ + best_type = type; goto err; + } } if (dev_is_pci(gdev->dev) && to_pci_dev(gdev->dev)->untrusted) { - type = IOMMU_DOMAIN_DMA; - if (best_type && type && best_type != type) - goto err; + /* +* We don't have any way for the iommu core code to +* force arm_iommu to activate so we can't enforce +* trusted. Log it and keep going with the IDENTITY +* default domain. +*/ + if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)) { + dev_warn( + gdev->dev, + "PCI device is untrusted but ARM32 does not support secure IOMMU operation, continuing anyway.\n"); To within experimental error, this is dead code. The ARM DMA ops don't
Re: [PATCH v2 04/25] iommu: Add IOMMU_DOMAIN_PLATFORM for S390
On 2023-05-16 01:00, Jason Gunthorpe wrote: The PLATFORM domain will be set as the default domain and attached as normal during probe. The driver will ignore the initial attach from a NULL domain to the PLATFORM domain. After this, the PLATFORM domain's attach_dev will be called whenever we detach from an UNMANAGED domain (eg for VFIO). This is the same time the original design would have called op->detach_dev(). This is temporary until the S390 dma-iommu.c conversion is merged. If we do need a stopgap here, can we please just call the current situation an identity domain? It's true enough in the sense that the IOMMU API is not offering any translation or guarantee of isolation, so the semantics of an identity domain - from the point of view of anything inside the IOMMU API that would be looking - are no weaker or less useful than a "platform" domain whose semantics are intentionally unknown. Then similarly for patch #3 - since we already know s390 is temporary, it seems an anathema to introduce a whole domain type with its own weird ops->default_domain mechanism solely for POWER to not actually use domains with. In terms of reasoning, I don't see that IOMMU_DOMAIN_PLATFORM is any more useful than a NULL default domain, it just renames the problem, and gives us more code to maintain for the privilege. As I say, though, we don't actually need to juggle the semantic of a "we don't know what's happening here" domain around any further, since it works out that a "we're not influencing anything here" domain actually suffices for what we want to reason about, and those are already well-defined. Sure, the platform DMA ops *might* be doing more, but that's beyond the scope of the IOMMU API either way. At that point, lo and behold, s390 and POWER now look just like ARM and the core code only needs a single special case for arch-specific default identity domains, lovely! Thanks, Robin. Tested-by: Heiko Stuebner Tested-by: Niklas Schnelle Signed-off-by: Jason Gunthorpe --- drivers/iommu/s390-iommu.c | 21 +++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c index fbf59a8db29b11..f0c867c57a5b9b 100644 --- a/drivers/iommu/s390-iommu.c +++ b/drivers/iommu/s390-iommu.c @@ -142,14 +142,31 @@ static int s390_iommu_attach_device(struct iommu_domain *domain, return 0; } -static void s390_iommu_set_platform_dma(struct device *dev) +/* + * Switch control over the IOMMU to S390's internal dma_api ops + */ +static int s390_iommu_platform_attach(struct iommu_domain *platform_domain, + struct device *dev) { struct zpci_dev *zdev = to_zpci_dev(dev); + if (!zdev->s390_domain) + return 0; + __s390_iommu_detach_device(zdev); zpci_dma_init_device(zdev); + return 0; } +static struct iommu_domain_ops s390_iommu_platform_ops = { + .attach_dev = s390_iommu_platform_attach, +}; + +static struct iommu_domain s390_iommu_platform_domain = { + .type = IOMMU_DOMAIN_PLATFORM, + .ops = _iommu_platform_ops, +}; + static void s390_iommu_get_resv_regions(struct device *dev, struct list_head *list) { @@ -428,12 +445,12 @@ void zpci_destroy_iommu(struct zpci_dev *zdev) } static const struct iommu_ops s390_iommu_ops = { + .default_domain = _iommu_platform_domain, .capable = s390_iommu_capable, .domain_alloc = s390_domain_alloc, .probe_device = s390_iommu_probe_device, .release_device = s390_iommu_release_device, .device_group = generic_device_group, - .set_platform_dma_ops = s390_iommu_set_platform_dma, .pgsize_bitmap = SZ_4K, .get_resv_regions = s390_iommu_get_resv_regions, .default_domain_ops = &(const struct iommu_domain_ops) {
Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper
On 31/03/2023 3:00 pm, Arnd Bergmann wrote: On Mon, Mar 27, 2023, at 14:48, Robin Murphy wrote: On 2023-03-27 13:13, Arnd Bergmann wrote: [ HELP NEEDED: can anyone confirm that it is a correct assumption on arm that a cache-coherent device writing to a page always results in it being in a PG_dcache_clean state like on ia64, or can a device write directly into the dcache?] In AMBA at least, if a snooping write hits in a cache then the data is most likely going to get routed directly into that cache. If it has write-back write-allocate attributes it could also land in any cache along its normal path to RAM; it wouldn't have to go all the way. Hence all the fun we have where treating a coherent device as non-coherent can still be almost as broken as the other way round :) Ok, thanks for the information. I'm still not sure whether this can result in the situation where PG_dcache_clean is wrong though. Specifically, the question is whether a DMA to a coherent buffer can end up in a dirty L1 dcache of one core and require to write back the dcache before invalidating the icache for that page. On ia64, this is not the case, the optimization here is to only flush the icache after a coherent DMA into an executable user page, while Arm only does this for noncoherent DMA but not coherent DMA. From your explanation it sounds like this might happen, even though that would mean that "coherent" DMA is slightly less coherent than it is elsewhere. To be on the safe side, I'd have to pass a flag into arch_dma_mark_clean() about coherency, to let the arm implementation still require the extra dcache flush for coherent DMA, while ia64 can ignore that flag. Coherent DMA on Arm is assumed to be inner-shareable, so a coherent DMA write should be pretty much equivalent to a coherent write by another CPU (or indeed the local CPU itself) - nothing says that it *couldn't* dirty a line in a data cache above the level of unification, so in general the assumption must be that, yes, if coherent DMA is writing data intended to be executable, then it's going to want a Dcache clean to PoU and an Icache invalidate to PoU before trying to execute it. By comparison, a non-coherent DMA transfer will inherently have to invalidate the Dcache all the way to PoC in its dma_unmap, thus cannot leave dirty data above the PoU, so only the Icache maintenance is required in the executable case. (FWIW I believe the Armv8 IDC/DIC features can safely be considered irrelevant to 32-bit kernels) I don't know a great deal about IA-64, but it appears to be using its PG_arch_1 flag in a subtly different manner to Arm, namely to optimise out the *Icache* maintenance. So if anything, it seems IA-64 is the weirdo here (who'd have guessed?) where DMA manages to be *more* coherent than the CPUs themselves :) This is all now making me think we need some careful consideration of whether the benefits of consolidating code outweigh the confusion of conflating multiple different meanings of "clean" together... Thanks, Robin.
Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper
On 2023-03-27 13:13, Arnd Bergmann wrote: From: Arnd Bergmann The arm version of the arch_sync_dma_for_cpu() function annotates pages as PG_dcache_clean after a DMA, but no other architecture does this here. On ia64, the same thing is done in arch_sync_dma_for_cpu(), so it makes sense to use the same hook in order to have identical arch_sync_dma_for_cpu() semantics as all other architectures. Splitting this out has multiple effects: - for dma-direct, this now gets called after arch_sync_dma_for_cpu() for DMA_FROM_DEVICE mappings, but not for DMA_BIDIRECTIONAL. While it would not be harmful to keep doing it for bidirectional mappings, those are apparently not used in any callers that care about the flag. - Since arm has its own dma-iommu abstraction, this now also needs to call the same function, so the calls are added there to mirror the dma-direct version. - Like dma-direct, the dma-iommu version now marks the dcache clean for both coherent and noncoherent devices after a DMA, but it only does this for DMA_FROM_DEVICE, not DMA_BIDIRECTIONAL. [ HELP NEEDED: can anyone confirm that it is a correct assumption on arm that a cache-coherent device writing to a page always results in it being in a PG_dcache_clean state like on ia64, or can a device write directly into the dcache?] In AMBA at least, if a snooping write hits in a cache then the data is most likely going to get routed directly into that cache. If it has write-back write-allocate attributes it could also land in any cache along its normal path to RAM; it wouldn't have to go all the way. Hence all the fun we have where treating a coherent device as non-coherent can still be almost as broken as the other way round :) Cheers, Robin. Signed-off-by: Arnd Bergmann --- arch/arm/Kconfig | 1 + arch/arm/mm/dma-mapping.c | 71 +++ 2 files changed, 43 insertions(+), 29 deletions(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index e24a9820e12f..125d58c54ab1 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -7,6 +7,7 @@ config ARM select ARCH_HAS_BINFMT_FLAT select ARCH_HAS_CURRENT_STACK_POINTER select ARCH_HAS_DEBUG_VIRTUAL if MMU + select ARCH_HAS_DMA_MARK_CLEAN if MMU select ARCH_HAS_DMA_WRITE_COMBINE if !ARM_DMA_MEM_BUFFERABLE select ARCH_HAS_ELF_RANDOMIZE select ARCH_HAS_FORTIFY_SOURCE diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index cc702cb27ae7..b703cb83d27e 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -665,6 +665,28 @@ static void dma_cache_maint(phys_addr_t paddr, } while (left); } +/* + * Mark the D-cache clean for these pages to avoid extra flushing. + */ +void arch_dma_mark_clean(phys_addr_t paddr, size_t size) +{ + unsigned long pfn = PFN_UP(paddr); + unsigned long off = paddr & (PAGE_SIZE - 1); + size_t left = size; + + if (size < PAGE_SIZE) + return; + + if (off) + left -= PAGE_SIZE - off; + + while (left >= PAGE_SIZE) { + struct page *page = pfn_to_page(pfn++); + set_bit(PG_dcache_clean, >flags); + left -= PAGE_SIZE; + } +} + static bool arch_sync_dma_cpu_needs_post_dma_flush(void) { if (IS_ENABLED(CONFIG_CPU_V6) || @@ -715,24 +737,6 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size, outer_inv_range(paddr, paddr + size); dma_cache_maint(paddr, size, dmac_inv_range); } - - /* -* Mark the D-cache clean for these pages to avoid extra flushing. -*/ - if (dir != DMA_TO_DEVICE && size >= PAGE_SIZE) { - unsigned long pfn = PFN_UP(paddr); - unsigned long off = paddr & (PAGE_SIZE - 1); - size_t left = size; - - if (off) - left -= PAGE_SIZE - off; - - while (left >= PAGE_SIZE) { - struct page *page = pfn_to_page(pfn++); - set_bit(PG_dcache_clean, >flags); - left -= PAGE_SIZE; - } - } } #ifdef CONFIG_ARM_DMA_USE_IOMMU @@ -1294,6 +1298,17 @@ static int arm_iommu_map_sg(struct device *dev, struct scatterlist *sg, return -EINVAL; } +static void arm_iommu_sync_dma_for_cpu(phys_addr_t phys, size_t len, + enum dma_data_direction dir, + bool dma_coherent) +{ + if (!dma_coherent) + arch_sync_dma_for_cpu(phys, s->length, dir); + + if (dir == DMA_FROM_DEVICE) + arch_dma_mark_clean(phys, s->length); +} + /** * arm_iommu_unmap_sg - unmap a set of SG buffers mapped by dma_map_sg * @dev: valid struct device pointer @@ -1316,8 +1331,9 @@ static void arm_iommu_unmap_sg(struct device *dev, if
Re: [PATCH 3/3] of: address: Use dma_default_coherent to determine default coherency
On 2023-02-22 13:37, Jiaxun Yang wrote: As for now all arches have dma_default_coherent matched with default DMA coherency for of devices, so there is no need to have a standalone config option. This also fixes a case that for some MIPS platforms, coherency information is not carried in devicetree and kernel will override dma_default_coherent at early boot. Note for PowerPC: CONFIG_OF_DMA_DEFUALT_COHERENT was only selected when CONFIG_NOT_COHERENT_CACHE is false, in this case dma_default_coherent will be true, so it still matches present behavior. Note for RISC-V: dma_default_coherent is set to true at init code in this series. OK, so the fundamental problem here is that we have two slightly different conflicting mechanisms, the ex-PowerPC config option, and the ex-MIPS dma_default_coherent for which of_dma_is_coherent() has apparently been broken forever. I'd agree that it's worth consolidating the two, but please separate out the fix as below, so it's feasible to backport without having to muck about in arch code. Signed-off-by: Jiaxun Yang --- arch/powerpc/Kconfig | 1 - arch/riscv/Kconfig | 1 - drivers/of/Kconfig | 4 drivers/of/address.c | 2 +- 4 files changed, 1 insertion(+), 7 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 2c9cdf1d8761..c67e5da714f7 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -272,7 +272,6 @@ config PPC select NEED_PER_CPU_PAGE_FIRST_CHUNKif PPC64 select NEED_SG_DMA_LENGTH select OF - select OF_DMA_DEFAULT_COHERENT if !NOT_COHERENT_CACHE select OF_EARLY_FLATTREE select OLD_SIGACTIONif PPC32 select OLD_SIGSUSPEND diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 1d46a268ce16..406c6816d289 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -119,7 +119,6 @@ config RISCV select MODULES_USE_ELF_RELA if MODULES select MODULE_SECTIONS if MODULES select OF - select OF_DMA_DEFAULT_COHERENT select OF_EARLY_FLATTREE select OF_IRQ select PCI_DOMAINS_GENERIC if PCI diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig index 644386833a7b..e40f10bf2ba4 100644 --- a/drivers/of/Kconfig +++ b/drivers/of/Kconfig @@ -102,8 +102,4 @@ config OF_OVERLAY config OF_NUMA bool -config OF_DMA_DEFAULT_COHERENT - # arches should select this if DMA is coherent by default for OF devices - bool - endif # OF diff --git a/drivers/of/address.c b/drivers/of/address.c index 4c0b169ef9bf..23ade4919853 100644 --- a/drivers/of/address.c +++ b/drivers/of/address.c @@ -1103,7 +1103,7 @@ phys_addr_t __init of_dma_get_max_cpu_address(struct device_node *np) bool of_dma_is_coherent(struct device_node *np) { struct device_node *node; - bool is_coherent = IS_ENABLED(CONFIG_OF_DMA_DEFAULT_COHERENT); + bool is_coherent = dma_default_coherent; AFAICS, all you should actually need is a single self-contained addition here, something like: + /* +* DT-based MIPS doesn't use OF_DMA_DEFAULT_COHERENT, but +* might override the system-wide default at runtime. +*/ +#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \ + defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \ + defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) + is_coherent = dma_default_coherent; +#endif node = of_node_get(np); Then *after* that's fixed, we can do a more comprehensive refactoring to merge the two mechanisms properly. FWIW I think I'd prefer an approach closer to the first one, where config options control the initial value of dma_default_coherent rather than architectures having to override it unconditionally (and TBH I'd also like to have a generic config symbol for whether an arch supports per-device coherency or not). Thanks, Robin.
Re: [PATCH 0/7] MIPS DMA coherence fixes
On 2023-02-22 13:04, Jiaxun Yang wrote: 2023年2月22日 12:55,Robin Murphy 写道: On 2023-02-21 19:55, Jiaxun Yang wrote: 2023年2月21日 19:46,Robin Murphy 写道: On 2023-02-21 18:15, Jiaxun Yang wrote: 2023年2月21日 17:54,Christoph Hellwig 写道: Can you explain the motivation here? Also why riscv patches are at the end of a mips fіxes series? Ah sorry for any confusion. So the main purpose of this patch is to fix MIPS’s broken per-device coherency. To be more precise, we want to be able to control the default coherency for all devices probed from devicetree in early boot code. Including the patch which actually does that would be helpful. As it is, patches 4-7 here just appear to be moving an option around for no practical effect. Well the affect is default coherency of devicetree probed devices are now following dma_default_coherent instead of a static Kconfig option. For MIPS platform, dma_default_coherent will be determined by boot code. "Will be" is the issue I'm getting at. We can't review some future promise of a patch, we can only review actual patches. And it's hard to meaningfully review preparatory patches for some change without the full context of that change. Actually this already present in current MIPS platform code. arch/mips/mti-malta is setting dma_default_coherent on boot, and it’s devicetree does not explicitly specify coherency. OK, this really needs to be explained much more clearly. I read this series as 3 actual fix patches, then 3 patches adding a new option to replace an existing one on the grounds that it "can be useful" for unspecified purposes, then a final cleanup patch removing the old option that has now been superseded. Going back and looking closely I see there is actually a brief mention in the cleanup patch that it also happens to fix some issue, but even then it doesn't clearly explain what the issue really is or how and why the fix works and is appropriate. Ideally, functional fixes and cleanup should be in distinct patches whenever that is reasonable. Sometimes the best fix is inherently a cleanup, but in such cases the patch should always be presented as the fix being its primary purpose. Please also use the cover letter to give reviewers an overview of the whole series if it's not merely a set of loosely-related patches that just happened to be convenient so send all together. I think I do at least now understand the underlying problem well enough to have a think about whether this is the best way to address it. Thanks, Robin.
Re: [PATCH 0/7] MIPS DMA coherence fixes
On 2023-02-21 19:55, Jiaxun Yang wrote: 2023年2月21日 19:46,Robin Murphy 写道: On 2023-02-21 18:15, Jiaxun Yang wrote: 2023年2月21日 17:54,Christoph Hellwig 写道: Can you explain the motivation here? Also why riscv patches are at the end of a mips fіxes series? Ah sorry for any confusion. So the main purpose of this patch is to fix MIPS’s broken per-device coherency. To be more precise, we want to be able to control the default coherency for all devices probed from devicetree in early boot code. Including the patch which actually does that would be helpful. As it is, patches 4-7 here just appear to be moving an option around for no practical effect. Well the affect is default coherency of devicetree probed devices are now following dma_default_coherent instead of a static Kconfig option. For MIPS platform, dma_default_coherent will be determined by boot code. "Will be" is the issue I'm getting at. We can't review some future promise of a patch, we can only review actual patches. And it's hard to meaningfully review preparatory patches for some change without the full context of that change. Thanks, Robin.
Re: [PATCH 0/7] MIPS DMA coherence fixes
On 2023-02-21 18:15, Jiaxun Yang wrote: 2023年2月21日 17:54,Christoph Hellwig 写道: Can you explain the motivation here? Also why riscv patches are at the end of a mips fіxes series? Ah sorry for any confusion. So the main purpose of this patch is to fix MIPS’s broken per-device coherency. To be more precise, we want to be able to control the default coherency for all devices probed from devicetree in early boot code. Including the patch which actually does that would be helpful. As it is, patches 4-7 here just appear to be moving an option around for no practical effect. Robin. To achieve that I decided to reuse dma_default_coherent to set default coherency for devicetree. And all later patches are severing for this purpose. Thanks - Jiaxun
Re: [PATCH 4/7] dma-mapping: Always provide dma_default_coherent
On 2023-02-21 17:58, Christoph Hellwig wrote: On Tue, Feb 21, 2023 at 12:46:10PM +, Jiaxun Yang wrote: dma_default_coherent can be useful for determine default coherency even on arches without noncoherent support. How? Indeed, "default" is conceptually meaningless when there is no possible alternative :/ Robin.
Re: [PATCH kernel] powerpc/iommu: Add simple iommu_ops to report capabilities
On 2022-07-05 07:22, Alexey Kardashevskiy wrote: Historically PPC64 managed to avoid using iommu_ops. The VFIO driver uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in the Type1 VFIO driver. Recent development though has added a coherency capability check to the generic part of VFIO and essentially disabled VFIO on PPC64. This adds a simple iommu_ops stub which reports support for cache coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices, this provides minimum code for the probing to not crash. No more bus_set_iommu() please - I'll be sending a new version of this series very soon: https://lore.kernel.org/linux-iommu/cover.1650890638.git.robin.mur...@arm.com/ Cheers, Robin. The previous discussion is here: https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-...@ozlabs.ru/ Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence") Fixes: 70693f470848 ("vfio: Set DMA ownership for VFIO devices") Signed-off-by: Alexey Kardashevskiy --- I have not looked into the domains for ages, what is missing here? With this on top of 5.19-rc1 VFIO works again on my POWER9 box. Thanks, --- arch/powerpc/include/asm/iommu.h | 2 + arch/powerpc/kernel/iommu.c | 70 arch/powerpc/kernel/pci_64.c | 3 ++ 3 files changed, 75 insertions(+) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 7e29c73e3dd4..4bdae0ee29d0 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -215,6 +215,8 @@ extern long iommu_tce_xchg_no_kill(struct mm_struct *mm, enum dma_data_direction *direction); extern void iommu_tce_kill(struct iommu_table *tbl, unsigned long entry, unsigned long pages); + +extern const struct iommu_ops spapr_tce_iommu_ops; #else static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 7e56ddb3e0b9..2205b448f7d5 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1176,4 +1176,74 @@ void iommu_del_device(struct device *dev) iommu_group_remove_device(dev); } EXPORT_SYMBOL_GPL(iommu_del_device); + +/* + * A simple iommu_ops to allow less cruft in generic VFIO code. + */ +static bool spapr_tce_iommu_capable(enum iommu_cap cap) +{ + switch (cap) { + case IOMMU_CAP_CACHE_COHERENCY: + return true; + default: + break; + } + + return false; +} + +static struct iommu_domain *spapr_tce_iommu_domain_alloc(unsigned int type) +{ + struct iommu_domain *domain = kzalloc(sizeof(*domain), GFP_KERNEL); + + if (!domain) + return NULL; + + domain->geometry.aperture_start = 0; + domain->geometry.aperture_end = ~0ULL; + domain->geometry.force_aperture = true; + + return domain; +} + +static struct iommu_device *spapr_tce_iommu_probe_device(struct device *dev) +{ + struct iommu_device *iommu_dev = kzalloc(sizeof(struct iommu_device), GFP_KERNEL); + + iommu_dev->dev = dev; + iommu_dev->ops = _tce_iommu_ops; + + return iommu_dev; +} + +static void spapr_tce_iommu_release_device(struct device *dev) +{ +} + +static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom, + struct device *dev) +{ + return 0; +} + +static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev) +{ + struct iommu_group *grp = dev->iommu_group; + + if (!grp) + grp = ERR_PTR(-ENODEV); + return grp; +} + +const struct iommu_ops spapr_tce_iommu_ops = { + .capable = spapr_tce_iommu_capable, + .domain_alloc = spapr_tce_iommu_domain_alloc, + .probe_device = spapr_tce_iommu_probe_device, + .release_device = spapr_tce_iommu_release_device, + .device_group = spapr_tce_iommu_device_group, + .default_domain_ops = &(const struct iommu_domain_ops) { + .attach_dev = spapr_tce_iommu_attach_dev, + } +}; + #endif /* CONFIG_IOMMU_API */ diff --git a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c index 19b03ddf5631..04bc0c52e45c 100644 --- a/arch/powerpc/kernel/pci_64.c +++ b/arch/powerpc/kernel/pci_64.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -27,6 +28,7 @@ #include #include #include +#include /* pci_io_base -- the base address from which io bars are offsets. * This is the lowest I/O base address (so bar values are always positive), @@ -69,6 +71,7 @@ static int __init pcibios_init(void) ppc_md.pcibios_fixup(); printk(KERN_DEBUG "PCI: Probing PCI hardware done\n"); + bus_set_iommu(_bus_type, _tce_iommu_ops); return 0; }
Re: [PATCH v2 2/3] scsi: BusLogic remove bus_to_virt
On 2022-06-17 13:57, Arnd Bergmann wrote: From: Arnd Bergmann The BusLogic driver is the last remaining driver that relies on the deprecated bus_to_virt() function, which in turn only works on a few architectures, and is incompatible with both swiotlb and iommu support. Before commit 391e2f25601e ("[SCSI] BusLogic: Port driver to 64-bit."), the driver had a dependency on x86-32, presumably because of this problem. However, the change introduced another bug that made it still impossible to use the driver on any 64-bit machine. This was in turn fixed in commit 56f396146af2 ("scsi: BusLogic: Fix 64-bit system enumeration error for Buslogic"), 8 years later, which shows that there are not a lot of users. Maciej is still using the driver on 32-bit hardware, and Khalid mentioned that the driver works with the device emulation used in VirtualBox and VMware. Both of those only emulate it for Windows 2000 and older operating systems that did not ship with the better LSI logic driver. Do a minimum fix that searches through the list of descriptors to find one that matches the bus address. This is clearly as inefficient as was indicated in the code comment about the lack of a bus_to_virt() replacement. A better fix would likely involve changing out the entire descriptor allocation for a simpler one, but that would be much more invasive. FWIW, if efficiency *is* a practical concern, even under the current allocation scheme it looks like there are only 4 actual DMA allocations to search through. From a quick scan (since it's too hot here not to get distracted...), if I wanted to optimise this in future I'd probably remove the semi-redundant allocgrp_* fields from struct blogic_ccb and hang a dedicated list of the block allocations off the adapter - at that point the lookup could likely already be more efficient than a theoretical dma_to_virt() interface would be if it had to go off and walk an IOMMU pagetable. Then the next question would be whether it's viable to make a single 32KB allocation rather 4*8KB, so it's no longer even a list. For now, though, I agree with the simple change that's clear and easy to reason about: Reviewed-by: Robin Murphy Thanks, Robin. Cc: Maciej W. Rozycki Cc: Matt Wang Cc: Khalid Aziz Signed-off-by: Arnd Bergmann --- drivers/scsi/BusLogic.c | 27 --- drivers/scsi/Kconfig| 2 +- 2 files changed, 17 insertions(+), 12 deletions(-) diff --git a/drivers/scsi/BusLogic.c b/drivers/scsi/BusLogic.c index a897c8f914cf..d057abfcdd5c 100644 --- a/drivers/scsi/BusLogic.c +++ b/drivers/scsi/BusLogic.c @@ -2515,12 +2515,26 @@ static int blogic_resultcode(struct blogic_adapter *adapter, return (hoststatus << 16) | tgt_status; } +/* + * turn the dma address from an inbox into a ccb pointer + * This is rather inefficient. + */ +static struct blogic_ccb * +blogic_inbox_to_ccb(struct blogic_adapter *adapter, struct blogic_inbox *inbox) +{ + struct blogic_ccb *ccb; + + for (ccb = adapter->all_ccbs; ccb; ccb = ccb->next_all) + if (inbox->ccb == ccb->dma_handle) + break; + + return ccb; +} /* blogic_scan_inbox scans the Incoming Mailboxes saving any Incoming Mailbox entries for completion processing. */ - static void blogic_scan_inbox(struct blogic_adapter *adapter) { /* @@ -2540,16 +2554,7 @@ static void blogic_scan_inbox(struct blogic_adapter *adapter) enum blogic_cmplt_code comp_code; while ((comp_code = next_inbox->comp_code) != BLOGIC_INBOX_FREE) { - /* - We are only allowed to do this because we limit our - architectures we run on to machines where bus_to_virt( - actually works. There *needs* to be a dma_addr_to_virt() - in the new PCI DMA mapping interface to replace - bus_to_virt() or else this code is going to become very - innefficient. -*/ - struct blogic_ccb *ccb = - (struct blogic_ccb *) bus_to_virt(next_inbox->ccb); + struct blogic_ccb *ccb = blogic_inbox_to_ccb(adapter, adapter->next_inbox); if (comp_code != BLOGIC_CMD_NOTFOUND) { if (ccb->status == BLOGIC_CCB_ACTIVE || ccb->status == BLOGIC_CCB_RESET) { diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig index cf75588a2587..56bdc08d0b77 100644 --- a/drivers/scsi/Kconfig +++ b/drivers/scsi/Kconfig @@ -513,7 +513,7 @@ config SCSI_HPTIOP config SCSI_BUSLOGIC tristate "BusLogic SCSI support" - depends on PCI && SCSI && VIRT_TO_BUS + depends on PCI && SCSI help This is support for BusLogic MultiMaster and FlashPoint SCSI Host Adapters. Consult the SCSI-HOWTO, available from
Re: [PATCH 08/14] arm64: simplify access_ok()
On 2022-02-14 16:34, Arnd Bergmann wrote: From: Arnd Bergmann arm64 has an inline asm implementation of access_ok() that is derived from the 32-bit arm version and optimized for the case that both the limit and the size are variable. With set_fs() gone, the limit is always constant, and the size usually is as well, so just using the default implementation reduces the check into a comparison against a constant that can be scheduled by the compiler. Aww, I still vividly remember the birth of this madness, sat with my phone on a Saturday morning waiting for my bike to be MOT'd, staring at the 7-instruction sequence that Mark and I had come up with and certain that it could be shortened still. Kinda sad to see it go, but at the same time, glad that it can. Acked-by: Robin Murphy On a defconfig build, this saves over 28KB of .text. Not to mention saving those "WTF is going on there... oh yeah, access_ok()" moments when looking through disassembly :) Cheers, Robin. Signed-off-by: Arnd Bergmann --- arch/arm64/include/asm/uaccess.h | 28 +--- 1 file changed, 5 insertions(+), 23 deletions(-) diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index 357f7bd9c981..e8dce0cc5eaa 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -26,6 +26,8 @@ #include #include +static inline int __access_ok(const void __user *ptr, unsigned long size); + /* * Test whether a block of memory is a valid user space address. * Returns 1 if the range is valid, 0 otherwise. @@ -33,10 +35,8 @@ * This is equivalent to the following test: * (u65)addr + (u65)size <= (u65)TASK_SIZE_MAX */ -static inline unsigned long __access_ok(const void __user *addr, unsigned long size) +static inline int access_ok(const void __user *addr, unsigned long size) { - unsigned long ret, limit = TASK_SIZE_MAX - 1; - /* * Asynchronous I/O running in a kernel thread does not have the * TIF_TAGGED_ADDR flag of the process owning the mm, so always untag @@ -46,27 +46,9 @@ static inline unsigned long __access_ok(const void __user *addr, unsigned long s (current->flags & PF_KTHREAD || test_thread_flag(TIF_TAGGED_ADDR))) addr = untagged_addr(addr); - __chk_user_ptr(addr); - asm volatile( - // A + B <= C + 1 for all A,B,C, in four easy steps: - // 1: X = A + B; X' = X % 2^64 - " adds%0, %3, %2\n" - // 2: Set C = 0 if X > 2^64, to guarantee X' > C in step 4 - " csel%1, xzr, %1, hi\n" - // 3: Set X' = ~0 if X >= 2^64. For X == 2^64, this decrements X' - //to compensate for the carry flag being set in step 4. For - //X > 2^64, X' merely has to remain nonzero, which it does. - " csinv %0, %0, xzr, cc\n" - // 4: For X < 2^64, this gives us X' - C - 1 <= 0, where the -1 - //comes from the carry in being clear. Otherwise, we are - //testing X' - C == 0, subject to the previous adjustments. - " sbcsxzr, %0, %1\n" - " cset%0, ls\n" - : "=" (ret), "+r" (limit) : "Ir" (size), "0" (addr) : "cc"); - - return ret; + return likely(__access_ok(addr, size)); } -#define __access_ok __access_ok +#define access_ok access_ok #include
Re: [PATCH] soc: fsl: dpio: protect smp_processor_id when get processor id
On 2021-10-15 07:36, meng...@windriver.com wrote: From: Meng Li When enable debug kernel configs,there will be calltrace as below: BUG: using smp_processor_id() in preemptible [] code: swapper/0/1 caller is debug_smp_processor_id+0x20/0x30 CPU: 6 PID: 1 Comm: swapper/0 Not tainted 5.10.63-yocto-standard #1 Hardware name: NXP Layerscape LX2160ARDB (DT) Call trace: dump_backtrace+0x0/0x1a0 show_stack+0x24/0x30 dump_stack+0xf0/0x13c check_preemption_disabled+0x100/0x110 debug_smp_processor_id+0x20/0x30 dpaa2_io_query_fq_count+0xdc/0x154 dpaa2_eth_stop+0x144/0x314 __dev_close_many+0xdc/0x160 __dev_change_flags+0xe8/0x220 dev_change_flags+0x30/0x70 ic_close_devs+0x50/0x78 ip_auto_config+0xed0/0xf10 do_one_initcall+0xac/0x460 kernel_init_freeable+0x30c/0x378 kernel_init+0x20/0x128 ret_from_fork+0x10/0x38 Because smp_processor_id() should be invoked in preempt disable status. So, add preempt_disable/enable() to protect smp_processor_id(). If preemption doesn't matter anyway, as the comment in the context implies, then it probably makes more sense just to use raw_smp_processor_id() instead. Robin. Fixes: c89105c9b390 ("staging: fsl-mc: Move DPIO from staging to drivers/soc/fsl") Cc: sta...@vger.kernel.org Signed-off-by: Meng Li --- drivers/soc/fsl/dpio/dpio-service.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/soc/fsl/dpio/dpio-service.c b/drivers/soc/fsl/dpio/dpio-service.c index 19f47ea9dab0..afc3b89b0fc5 100644 --- a/drivers/soc/fsl/dpio/dpio-service.c +++ b/drivers/soc/fsl/dpio/dpio-service.c @@ -58,8 +58,11 @@ static inline struct dpaa2_io *service_select_by_cpu(struct dpaa2_io *d, * If cpu == -1, choose the current cpu, with no guarantees about * potentially being migrated away. */ - if (cpu < 0) - cpu = smp_processor_id(); +if (cpu < 0) { +preempt_disable(); +cpu = smp_processor_id(); +preempt_enable(); +} /* If a specific cpu was requested, pick it up immediately */ return dpio_by_cpu[cpu];
Re: [PATCH] swiotlb: set IO TLB segment size via cmdline
On 2021-09-17 10:36, Roman Skakun wrote: Hi, Christoph I use Xen PV display. In my case, PV display backend(Dom0) allocates contiguous buffer via DMA-API to to implement zero-copy between Dom0 and DomU. Well, something's gone badly wrong there - if you have to shadow the entire thing in a bounce buffer to import it then it's hardly zero-copy, is it? If you want to do buffer sharing the buffer really needs to be allocated appropriately to begin with, such that all relevant devices can access it directly. That might be something which needs fixing in Xen. Robin. When I start Weston under DomU, I got the next log in Dom0: ``` [ 112.554471] CPU: 0 PID: 367 Comm: weston Tainted: G O 5.10.0-yocto-standard+ #312 [ 112.575149] Call trace: [ 112.577666] dump_backtrace+0x0/0x1b0 [ 112.581373] show_stack+0x18/0x70 [ 112.584746] dump_stack+0xd0/0x12c [ 112.588200] swiotlb_tbl_map_single+0x234/0x360 [ 112.592781] xen_swiotlb_map_page+0xe4/0x4c0 [ 112.597095] xen_swiotlb_map_sg+0x84/0x12c [ 112.601249] dma_map_sg_attrs+0x54/0x60 [ 112.605138] vsp1_du_map_sg+0x30/0x60 [ 112.608851] rcar_du_vsp_map_fb+0x134/0x170 [ 112.613082] rcar_du_vsp_plane_prepare_fb+0x44/0x64 [ 112.618007] drm_atomic_helper_prepare_planes+0xac/0x160 [ 112.623362] drm_atomic_helper_commit+0x88/0x390 [ 112.628029] drm_atomic_nonblocking_commit+0x4c/0x60 [ 112.633043] drm_mode_atomic_ioctl+0x9a8/0xb0c [ 112.637532] drm_ioctl_kernel+0xc4/0x11c [ 112.641506] drm_ioctl+0x21c/0x460 [ 112.644967] __arm64_sys_ioctl+0xa8/0xf0 [ 112.648939] el0_svc_common.constprop.0+0x78/0x1a0 [ 112.653775] do_el0_svc+0x24/0x90 [ 112.657148] el0_svc+0x14/0x20 [ 112.660254] el0_sync_handler+0x1a4/0x1b0 [ 112.664315] el0_sync+0x174/0x180 [ 112.668145] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz: 3686400 bytes), total 65536 (slots), used 112 (slots) ``` The problem is happened here: https://elixir.bootlin.com/linux/v5.14.4/source/drivers/gpu/drm/rcar-du/rcar_du_vsp.c#L202 Sgt was created in dma_get_sgtable() by dma_common_get_sgtable() and includes a single page chunk as shown here: https://elixir.bootlin.com/linux/v5.14.5/source/kernel/dma/ops_helpers.c#L18 After creating a new sgt, we tried to map this sgt through vsp1_du_map_sg(). Internally, vsp1_du_map_sg() using ops->map_sg (e.g xen_swiotlb_map_sg) to perform mapping. I realized that required segment is too big to be fitted to default swiotlb segment and condition https://elixir.bootlin.com/linux/latest/source/kernel/dma/swiotlb.c#L474 is always false. I know that I use a large buffer, but why can't I map this buffer in one chunk? Thanks! ср, 15 сент. 2021 г. в 16:53, Christoph Hellwig : On Wed, Sep 15, 2021 at 03:49:52PM +0200, Jan Beulich wrote: But the question remains: Why does the framebuffer need to be mapped in a single giant chunk? More importantly: if you use dynamic dma mappings for your framebuffer you're doing something wrong.
Re: [PATCH v1 16/16] dma-mapping: Disallow .map_sg operations from returning zero on error
On 2021-07-16 07:33, Christoph Hellwig wrote: On Thu, Jul 15, 2021 at 10:45:44AM -0600, Logan Gunthorpe wrote: @@ -194,6 +194,8 @@ static int __dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, else ents = ops->map_sg(dev, sg, nents, dir, attrs); + WARN_ON_ONCE(ents == 0); Turns this into a negative error code while we're at it, just to keep the callers sane? Right, by this point returning the 0 would pass through dma_map_sg_attrs() OK, but AFAICS dma_map_sgtable() would now get confused and return success but with sgt->nents = 0. Coercing it to an error code (which dma_map_sg_attrs() would then just change right back) seems sensible for the sake of easy robustness. Robin.
Re: [PATCH v1 14/16] x86/amd_gart: return error code from gart_map_sg()
On 2021-07-16 07:32, Christoph Hellwig wrote: On Thu, Jul 15, 2021 at 10:45:42AM -0600, Logan Gunthorpe wrote: @@ -458,7 +460,7 @@ static int gart_map_sg(struct device *dev, struct scatterlist *sg, int nents, iommu_full(dev, pages << PAGE_SHIFT, dir); for_each_sg(sg, s, nents, i) s->dma_address = DMA_MAPPING_ERROR; - return 0; + return ret; While we're at it - setting the ->dma_address to DMA_MAPPING_ERROR is not part of the ->map_sg calling convention. Might be worth to fix up while we're at it. Especially since it's not even zeroing dma_length, which at least is a documented indicator of validity (even if it's not strictly defined for failure cases either). Robin.
Re: [PATCH v15 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing
On 2021-07-06 15:05, Christoph Hellwig wrote: On Tue, Jul 06, 2021 at 03:01:04PM +0100, Robin Murphy wrote: FWIW I was pondering the question of whether to do something along those lines or just scrap the default assignment entirely, so since I hadn't got round to saying that I've gone ahead and hacked up the alternative (similarly untested) for comparison :) TBH I'm still not sure which one I prefer... Claire did implement something like your suggestion originally, but I don't really like it as it doesn't scale for adding multiple global pools, e.g. for the 64-bit addressable one for the various encrypted secure guest schemes. Ah yes, that had slipped my mind, and it's a fair point indeed. Since we're not concerned with a minimal fix for backports anyway I'm more than happy to focus on Will's approach. Another thing is that that looks to take us a quiet step closer to the possibility of dynamically resizing a SWIOTLB pool, which is something that some of the hypervisor protection schemes looking to build on top of this series may want to explore at some point. Robin.
Re: [PATCH v15 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing
On 2021-07-06 14:24, Will Deacon wrote: On Tue, Jul 06, 2021 at 06:48:48AM +0200, Christoph Hellwig wrote: On Mon, Jul 05, 2021 at 08:03:52PM +0100, Will Deacon wrote: So at this point, the AMD IOMMU driver does: swiotlb= (iommu_default_passthrough() || sme_me_mask) ? 1 : 0; where 'swiotlb' is a global variable indicating whether or not swiotlb is in use. It's picked up a bit later on by pci_swiotlb_late_init(), which will call swiotlb_exit() if 'swiotlb' is false. Now, that used to work fine, because swiotlb_exit() clears 'io_tlb_default_mem' to NULL, but now with the restricted DMA changes, I think that all the devices which have successfully probed beforehand will have stale pointers to the freed structure in their 'dev->dma_io_tlb_mem' field. Yeah. I don't think we can do that anymore, and I also think it is a bad idea to start with. I've had a crack at reworking things along the following lines: - io_tlb_default_mem now lives in the BSS, the flexible array member is now a pointer and that part is allocated dynamically (downside of this is an extra indirection to get at the slots). - io_tlb_default_mem.nslabs tells you whether the thing is valid - swiotlb_exit() frees the slots array and clears the rest of the structure to 0. I also extended it to free the actual slabs, but I'm not sure why it wasn't doing that before. So a non-NULL dev->dma_io_tlb_mem should always be valid to follow. FWIW I was pondering the question of whether to do something along those lines or just scrap the default assignment entirely, so since I hadn't got round to saying that I've gone ahead and hacked up the alternative (similarly untested) for comparison :) TBH I'm still not sure which one I prefer... Robin. ->8- diff --git a/drivers/base/core.c b/drivers/base/core.c index ea5b85354526..394abf184c1a 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -2847,9 +2847,6 @@ void device_initialize(struct device *dev) defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) dev->dma_coherent = dma_default_coherent; #endif -#ifdef CONFIG_SWIOTLB - dev->dma_io_tlb_mem = io_tlb_default_mem; -#endif } EXPORT_SYMBOL_GPL(device_initialize); diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 39284ff2a6cd..620f16d89a98 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -107,16 +107,21 @@ struct io_tlb_mem { }; extern struct io_tlb_mem *io_tlb_default_mem; +static inline struct io_tlb_mem *dev_iotlb_mem(struct device *dev) +{ + return dev->dma_io_tlb_mem ?: io_tlb_default_mem; +} + static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr) { - struct io_tlb_mem *mem = dev->dma_io_tlb_mem; + struct io_tlb_mem *mem = dev_iotlb_mem(dev); return mem && paddr >= mem->start && paddr < mem->end; } static inline bool is_swiotlb_force_bounce(struct device *dev) { - struct io_tlb_mem *mem = dev->dma_io_tlb_mem; + struct io_tlb_mem *mem = dev_iotlb_mem(dev); return mem && mem->force_bounce; } @@ -167,7 +172,7 @@ bool swiotlb_free(struct device *dev, struct page *page, size_t size); static inline bool is_swiotlb_for_alloc(struct device *dev) { - return dev->dma_io_tlb_mem->for_alloc; + return dev_iotlb_mem(dev)->for_alloc; } #else static inline struct page *swiotlb_alloc(struct device *dev, size_t size) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index b7f76bca89bf..f4942149f87d 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -359,7 +359,7 @@ static unsigned int swiotlb_align_offset(struct device *dev, u64 addr) static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size, enum dma_data_direction dir) { - struct io_tlb_mem *mem = dev->dma_io_tlb_mem; + struct io_tlb_mem *mem = dev_iotlb_mem(dev); int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT; phys_addr_t orig_addr = mem->slots[index].orig_addr; size_t alloc_size = mem->slots[index].alloc_size; @@ -440,7 +440,7 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index) static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr, size_t alloc_size) { - struct io_tlb_mem *mem = dev->dma_io_tlb_mem; + struct io_tlb_mem *mem = dev_iotlb_mem(dev); unsigned long boundary_mask = dma_get_seg_boundary(dev); dma_addr_t tbl_dma_addr = phys_to_dma_unencrypted(dev, mem->start) & boundary_mask; @@ -522,7 +522,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr, size_t mapping_size, size_t alloc_size, enum dma_data_direction dir, unsigned long attrs) { - struct io_tlb_mem *mem = dev->dma_io_tlb_mem; + struct io_tlb_mem *mem = dev_iotlb_mem(dev); unsigned int offset =
Re: [PATCH v15 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing
On 2021-07-02 14:58, Will Deacon wrote: Hi Nathan, On Thu, Jul 01, 2021 at 12:52:20AM -0700, Nathan Chancellor wrote: On 7/1/2021 12:40 AM, Will Deacon wrote: On Wed, Jun 30, 2021 at 08:56:51AM -0700, Nathan Chancellor wrote: On Wed, Jun 30, 2021 at 12:43:48PM +0100, Will Deacon wrote: On Wed, Jun 30, 2021 at 05:17:27PM +0800, Claire Chang wrote: `BUG: unable to handle page fault for address: 003a8290` and the fact it crashed at `_raw_spin_lock_irqsave` look like the memory (maybe dev->dma_io_tlb_mem) was corrupted? The dev->dma_io_tlb_mem should be set here (https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/pci/probe.c#n2528) through device_initialize. I'm less sure about this. 'dma_io_tlb_mem' should be pointing at 'io_tlb_default_mem', which is a page-aligned allocation from memblock. The spinlock is at offset 0x24 in that structure, and looking at the register dump from the crash: Jun 29 18:28:42 hp-4300G kernel: RSP: 0018:adb4013db9e8 EFLAGS: 00010006 Jun 29 18:28:42 hp-4300G kernel: RAX: 003a8290 RBX: RCX: 8900572ad580 Jun 29 18:28:42 hp-4300G kernel: RDX: 89005653f024 RSI: 000c RDI: 1d17 Jun 29 18:28:42 hp-4300G kernel: RBP: 0a20d000 R08: 000c R09: Jun 29 18:28:42 hp-4300G kernel: R10: 0a20d000 R11: 89005653f000 R12: 0212 Jun 29 18:28:42 hp-4300G kernel: R13: 1000 R14: 0002 R15: 0020 Jun 29 18:28:42 hp-4300G kernel: FS: 7f1f8898ea40() GS:89005728() knlGS: Jun 29 18:28:42 hp-4300G kernel: CS: 0010 DS: ES: CR0: 80050033 Jun 29 18:28:42 hp-4300G kernel: CR2: 003a8290 CR3: 0001020d CR4: 00350ee0 Jun 29 18:28:42 hp-4300G kernel: Call Trace: Jun 29 18:28:42 hp-4300G kernel: _raw_spin_lock_irqsave+0x39/0x50 Jun 29 18:28:42 hp-4300G kernel: swiotlb_tbl_map_single+0x12b/0x4c0 Then that correlates with R11 holding the 'dma_io_tlb_mem' pointer and RDX pointing at the spinlock. Yet RAX is holding junk :/ I agree that enabling KASAN would be a good idea, but I also think we probably need to get some more information out of swiotlb_tbl_map_single() to see see what exactly is going wrong in there. I can certainly enable KASAN and if there is any debug print I can add or dump anything, let me know! I bit the bullet and took v5.13 with swiotlb/for-linus-5.14 merged in, built x86 defconfig and ran it on my laptop. However, it seems to work fine! Please can you share your .config? Sure thing, it is attached. It is just Arch Linux's config run through olddefconfig. The original is below in case you need to diff it. https://raw.githubusercontent.com/archlinux/svntogit-packages/9045405dc835527164f3034b3ceb9a67c7a53cd4/trunk/config If there is anything more that I can provide, please let me know. I eventually got this booting (for some reason it was causing LD to SEGV trying to link it for a while...) and sadly it works fine on my laptop. Hmm. Did you manage to try again with KASAN? It might also be worth taking the IOMMU out of the equation, since that interfaces differently with SWIOTLB and I couldn't figure out the code path from the log you provided. What happens if you boot with "amd_iommu=off swiotlb=force"? Oh, now there's a thing... the chat from the IOMMU API in the boot log implies that the IOMMU *should* be in the picture - we see that default domains are IOMMU_DOMAIN_DMA default and the GPU :0c:00.0 was added to a group. That means dev->dma_ops should be set and DMA API calls should be going through iommu-dma, yet the callstack in the crash says we've gone straight from dma_map_page_attrs() to swiotlb_map(), implying the inline dma_direct_map_page() path. If dev->dma_ops didn't look right in the first place, it's perhaps less surprising that dev->dma_io_tlb_mem might be wild as well. It doesn't seem plausible that we should have a race between initialising the device and probing its driver, so maybe the whole dev pointer is getting trampled earlier in the callchain (or is fundamentally wrong to begin with, but from a quick skim of the amdgpu code it did look like adev->dev and adev->pdev are appropriately set early on by amdgpu_pci_probe()). (although word of warning here: i915 dies horribly on my laptop if I pass swiotlb=force, even with the distro 5.10 kernel) FWIW I'd imagine you probably need to massively increase the SWIOTLB buffer size to have hope of that working. Robin.
Re: [PATCH v15 12/12] of: Add plumbing for restricted DMA pool
On 2021-07-02 04:08, Guenter Roeck wrote: Hi, On Thu, Jun 24, 2021 at 11:55:26PM +0800, Claire Chang wrote: If a device is not behind an IOMMU, we look up the device node and set up the restricted DMA when the restricted-dma-pool is presented. Signed-off-by: Claire Chang Tested-by: Stefano Stabellini Tested-by: Will Deacon With this patch in place, all sparc and sparc64 qemu emulations fail to boot. Symptom is that the root file system is not found. Reverting this patch fixes the problem. Bisect log is attached. Ah, OF_ADDRESS depends on !SPARC, so of_dma_configure_id() is presumably returning an unexpected -ENODEV from the of_dma_set_restricted_buffer() stub. That should probably be returning 0 instead, since either way it's not an error condition for it to simply do nothing. Robin. Guenter --- # bad: [fb0ca446157a86b75502c1636b0d81e642fe6bf1] Add linux-next specific files for 20210701 # good: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13 git bisect start 'HEAD' 'v5.13' # bad: [f63c4fda987a19b1194cc45cb72fd5bf968d9d90] Merge remote-tracking branch 'rdma/for-next' git bisect bad f63c4fda987a19b1194cc45cb72fd5bf968d9d90 # good: [46bb5dd1d2a63e906e374e97dfd4a5e33934b1c4] Merge remote-tracking branch 'ipsec/master' git bisect good 46bb5dd1d2a63e906e374e97dfd4a5e33934b1c4 # good: [43ba6969cfb8185353a7a6fc79070f13b9e3d6d3] Merge remote-tracking branch 'clk/clk-next' git bisect good 43ba6969cfb8185353a7a6fc79070f13b9e3d6d3 # good: [1ca5eddcf8dca1d6345471c6404e7364af0d7019] Merge remote-tracking branch 'fuse/for-next' git bisect good 1ca5eddcf8dca1d6345471c6404e7364af0d7019 # good: [8f6d7b3248705920187263a4e7147b0752ec7dcf] Merge remote-tracking branch 'pci/next' git bisect good 8f6d7b3248705920187263a4e7147b0752ec7dcf # good: [df1885a755784da3ef285f36d9230c1d090ef186] RDMA/rtrs_clt: Alloc less memory with write path fast memory registration git bisect good df1885a755784da3ef285f36d9230c1d090ef186 # good: [93d31efb58c8ad4a66bbedbc2d082df458c04e45] Merge remote-tracking branch 'cpufreq-arm/cpufreq/arm/linux-next' git bisect good 93d31efb58c8ad4a66bbedbc2d082df458c04e45 # good: [46308965ae6fdc7c25deb2e8c048510ae51bbe66] RDMA/irdma: Check contents of user-space irdma_mem_reg_req object git bisect good 46308965ae6fdc7c25deb2e8c048510ae51bbe66 # good: [6de7a1d006ea9db235492b288312838d6878385f] thermal/drivers/int340x/processor_thermal: Split enumeration and processing part git bisect good 6de7a1d006ea9db235492b288312838d6878385f # good: [081bec2577cda3d04f6559c60b6f4e2242853520] dt-bindings: of: Add restricted DMA pool git bisect good 081bec2577cda3d04f6559c60b6f4e2242853520 # good: [bf95ac0bcd69979af146852f6a617a60285ebbc1] Merge remote-tracking branch 'thermal/thermal/linux-next' git bisect good bf95ac0bcd69979af146852f6a617a60285ebbc1 # good: [3d8287544223a3d2f37981c1f9ffd94d0b5e9ffc] RDMA/core: Always release restrack object git bisect good 3d8287544223a3d2f37981c1f9ffd94d0b5e9ffc # bad: [cff1f23fad6e0bd7d671acce0d15285c709f259c] Merge remote-tracking branch 'swiotlb/linux-next' git bisect bad cff1f23fad6e0bd7d671acce0d15285c709f259c # bad: [b655006619b7bccd0dc1e055bd72de5d613e7b5c] of: Add plumbing for restricted DMA pool git bisect bad b655006619b7bccd0dc1e055bd72de5d613e7b5c # first bad commit: [b655006619b7bccd0dc1e055bd72de5d613e7b5c] of: Add plumbing for restricted DMA pool
Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing
On 2021-06-24 12:18, Will Deacon wrote: On Thu, Jun 24, 2021 at 12:14:39PM +0100, Robin Murphy wrote: On 2021-06-24 07:05, Claire Chang wrote: On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig wrote: On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote: is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119 is_swiotlb_force_bounce() was the new function introduced in this patch here. +static inline bool is_swiotlb_force_bounce(struct device *dev) +{ + return dev->dma_io_tlb_mem->force_bounce; +} To me the crash looks like dev->dma_io_tlb_mem is NULL. Can you turn this into : return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce; for a quick debug check? I just realized that dma_io_tlb_mem might be NULL like Christoph pointed out since swiotlb might not get initialized. However, `Unable to handle kernel paging request at virtual address dfff800e` looks more like the address is garbage rather than NULL? I wonder if that's because dev->dma_io_tlb_mem is not assigned properly (which means device_initialize is not called?). What also looks odd is that the base "address" 0xdfff8000 is held in a couple of registers, but the offset 0xe looks too small to match up to any relevant structure member in that dereference chain :/ FWIW, I've managed to trigger a NULL dereference locally when swiotlb hasn't been initialised but we dereference 'dev->dma_io_tlb_mem', so I think Christoph's suggestion is needed regardless. Ack to that - for SWIOTLB_NO_FORCE, io_tlb_default_mem will remain NULL. The massive jump in KernelCI baseline failures as of yesterday looks like every arm64 machine with less than 4GB of RAM blowing up... Robin. But I agree that it won't help with the issue reported by Qian Cai. Qian Cai: please can you share your .config and your command line? Thanks, Will
Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing
On 2021-06-24 07:05, Claire Chang wrote: On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig wrote: On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote: is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119 is_swiotlb_force_bounce() was the new function introduced in this patch here. +static inline bool is_swiotlb_force_bounce(struct device *dev) +{ + return dev->dma_io_tlb_mem->force_bounce; +} To me the crash looks like dev->dma_io_tlb_mem is NULL. Can you turn this into : return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce; for a quick debug check? I just realized that dma_io_tlb_mem might be NULL like Christoph pointed out since swiotlb might not get initialized. However, `Unable to handle kernel paging request at virtual address dfff800e` looks more like the address is garbage rather than NULL? I wonder if that's because dev->dma_io_tlb_mem is not assigned properly (which means device_initialize is not called?). What also looks odd is that the base "address" 0xdfff8000 is held in a couple of registers, but the offset 0xe looks too small to match up to any relevant structure member in that dereference chain :/ Robin.
Re: [PATCH v5 14/16] dma-direct: Allocate memory from restricted DMA pool if available
On 2021-04-22 09:15, Claire Chang wrote: The restricted DMA pool is preferred if available. The restricted DMA pools provide a basic level of protection against the DMA overwriting buffer contents at unexpected times. However, to protect against general data leakage and system memory corruption, the system needs to provide a way to lock down the memory access, e.g., MPU. Signed-off-by: Claire Chang --- kernel/dma/direct.c | 35 ++- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 7a27f0510fcc..29523d2a9845 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -78,6 +78,10 @@ static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size) static void __dma_direct_free_pages(struct device *dev, struct page *page, size_t size) { +#ifdef CONFIG_DMA_RESTRICTED_POOL + if (swiotlb_free(dev, page, size)) + return; +#endif dma_free_contiguous(dev, page, size); } @@ -92,7 +96,17 @@ static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size, gfp |= dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask, _limit); - page = dma_alloc_contiguous(dev, size, gfp); + +#ifdef CONFIG_DMA_RESTRICTED_POOL + page = swiotlb_alloc(dev, size); + if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) { + __dma_direct_free_pages(dev, page, size); + page = NULL; + } +#endif + + if (!page) + page = dma_alloc_contiguous(dev, size, gfp); if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) { dma_free_contiguous(dev, page, size); page = NULL; @@ -148,7 +162,7 @@ void *dma_direct_alloc(struct device *dev, size_t size, gfp |= __GFP_NOWARN; if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) && - !force_dma_unencrypted(dev)) { + !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) { page = __dma_direct_alloc_pages(dev, size, gfp & ~__GFP_ZERO); if (!page) return NULL; @@ -161,8 +175,8 @@ void *dma_direct_alloc(struct device *dev, size_t size, } if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) && - !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && - !dev_is_dma_coherent(dev)) + !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) && + !is_dev_swiotlb_force(dev)) return arch_dma_alloc(dev, size, dma_handle, gfp, attrs); /* @@ -172,7 +186,9 @@ void *dma_direct_alloc(struct device *dev, size_t size, if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) && !gfpflags_allow_blocking(gfp) && (force_dma_unencrypted(dev) || -(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev +(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && + !dev_is_dma_coherent(dev))) && + !is_dev_swiotlb_force(dev)) return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp); /* we always manually zero the memory once we are done */ @@ -253,15 +269,15 @@ void dma_direct_free(struct device *dev, size_t size, unsigned int page_order = get_order(size); if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) && - !force_dma_unencrypted(dev)) { + !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) { /* cpu_addr is a struct page cookie, not a kernel address */ dma_free_contiguous(dev, cpu_addr, size); return; } if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) && - !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && - !dev_is_dma_coherent(dev)) { + !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) && + !is_dev_swiotlb_force(dev)) { arch_dma_free(dev, size, cpu_addr, dma_addr, attrs); return; } @@ -289,7 +305,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size, void *ret; if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) && - force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp)) + force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp) && + !is_dev_swiotlb_force(dev)) return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp); Wait, this seems broken for non-coherent devices - in that case we need to return a non-cacheable address, but we can't simply fall through into the remapping path below in GFP_ATOMIC context. That's why we need the atomic pool concept in the first place :/ Unless I've overlooked something, we're still using the regular cacheable linear map address of the dma_io_tlb_mem buffer, no? Robin. page = __dma_direct_alloc_pages(dev, size,
Re: [PATCH v5 16/16] of: Add plumbing for restricted DMA pool
On 2021-04-22 09:15, Claire Chang wrote: If a device is not behind an IOMMU, we look up the device node and set up the restricted DMA when the restricted-dma-pool is presented. Signed-off-by: Claire Chang --- drivers/of/address.c| 25 + drivers/of/device.c | 3 +++ drivers/of/of_private.h | 5 + 3 files changed, 33 insertions(+) diff --git a/drivers/of/address.c b/drivers/of/address.c index 54f221dde267..fff3adfe4986 100644 --- a/drivers/of/address.c +++ b/drivers/of/address.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include @@ -1109,6 +1110,30 @@ bool of_dma_is_coherent(struct device_node *np) } EXPORT_SYMBOL_GPL(of_dma_is_coherent); +int of_dma_set_restricted_buffer(struct device *dev) +{ + struct device_node *node; + int count, i; + + if (!dev->of_node) + return 0; + + count = of_property_count_elems_of_size(dev->of_node, "memory-region", + sizeof(phandle)); + for (i = 0; i < count; i++) { + node = of_parse_phandle(dev->of_node, "memory-region", i); + /* There might be multiple memory regions, but only one +* restriced-dma-pool region is allowed. +*/ What's the use-case for having multiple regions if the restricted pool is by definition the only one accessible? Robin. + if (of_device_is_compatible(node, "restricted-dma-pool") && + of_device_is_available(node)) + return of_reserved_mem_device_init_by_idx( + dev, dev->of_node, i); + } + + return 0; +} + /** * of_mmio_is_nonposted - Check if device uses non-posted MMIO * @np: device node diff --git a/drivers/of/device.c b/drivers/of/device.c index c5a9473a5fb1..d8d865223e51 100644 --- a/drivers/of/device.c +++ b/drivers/of/device.c @@ -165,6 +165,9 @@ int of_dma_configure_id(struct device *dev, struct device_node *np, arch_setup_dma_ops(dev, dma_start, size, iommu, coherent); + if (!iommu) + return of_dma_set_restricted_buffer(dev); + return 0; } EXPORT_SYMBOL_GPL(of_dma_configure_id); diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h index d717efbd637d..e9237f5eff48 100644 --- a/drivers/of/of_private.h +++ b/drivers/of/of_private.h @@ -163,12 +163,17 @@ struct bus_dma_region; #if defined(CONFIG_OF_ADDRESS) && defined(CONFIG_HAS_DMA) int of_dma_get_range(struct device_node *np, const struct bus_dma_region **map); +int of_dma_set_restricted_buffer(struct device *dev); #else static inline int of_dma_get_range(struct device_node *np, const struct bus_dma_region **map) { return -ENODEV; } +static inline int of_dma_get_restricted_buffer(struct device *dev) +{ + return -ENODEV; +} #endif #endif /* _LINUX_OF_PRIVATE_H */
Re: [PATCH v5 08/16] swiotlb: Update is_swiotlb_active to add a struct device argument
On 2021-04-22 09:15, Claire Chang wrote: Update is_swiotlb_active to add a struct device argument. This will be useful later to allow for restricted DMA pool. Signed-off-by: Claire Chang --- drivers/gpu/drm/i915/gem/i915_gem_internal.c | 2 +- drivers/gpu/drm/nouveau/nouveau_ttm.c| 2 +- drivers/pci/xen-pcifront.c | 2 +- include/linux/swiotlb.h | 4 ++-- kernel/dma/direct.c | 2 +- kernel/dma/swiotlb.c | 4 ++-- 6 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c b/drivers/gpu/drm/i915/gem/i915_gem_internal.c index ce6b664b10aa..7d48c433446b 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c @@ -42,7 +42,7 @@ static int i915_gem_object_get_pages_internal(struct drm_i915_gem_object *obj) max_order = MAX_ORDER; #ifdef CONFIG_SWIOTLB - if (is_swiotlb_active()) { + if (is_swiotlb_active(NULL)) { unsigned int max_segment; max_segment = swiotlb_max_segment(); diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c b/drivers/gpu/drm/nouveau/nouveau_ttm.c index e8b506a6685b..2a2ae6d6cf6d 100644 --- a/drivers/gpu/drm/nouveau/nouveau_ttm.c +++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c @@ -321,7 +321,7 @@ nouveau_ttm_init(struct nouveau_drm *drm) } #if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86) - need_swiotlb = is_swiotlb_active(); + need_swiotlb = is_swiotlb_active(NULL); #endif ret = ttm_device_init(>ttm.bdev, _bo_driver, drm->dev->dev, diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c index b7a8f3a1921f..6d548ce53ce7 100644 --- a/drivers/pci/xen-pcifront.c +++ b/drivers/pci/xen-pcifront.c @@ -693,7 +693,7 @@ static int pcifront_connect_and_init_dma(struct pcifront_device *pdev) spin_unlock(_dev_lock); - if (!err && !is_swiotlb_active()) { + if (!err && !is_swiotlb_active(NULL)) { err = pci_xen_swiotlb_init_late(); if (err) dev_err(>xdev->dev, "Could not setup SWIOTLB!\n"); diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 2a6cca07540b..c530c976d18b 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -123,7 +123,7 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr) void __init swiotlb_exit(void); unsigned int swiotlb_max_segment(void); size_t swiotlb_max_mapping_size(struct device *dev); -bool is_swiotlb_active(void); +bool is_swiotlb_active(struct device *dev); void __init swiotlb_adjust_size(unsigned long size); #else #define swiotlb_force SWIOTLB_NO_FORCE @@ -143,7 +143,7 @@ static inline size_t swiotlb_max_mapping_size(struct device *dev) return SIZE_MAX; } -static inline bool is_swiotlb_active(void) +static inline bool is_swiotlb_active(struct device *dev) { return false; } diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 84c9feb5474a..7a88c34d0867 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -495,7 +495,7 @@ int dma_direct_supported(struct device *dev, u64 mask) size_t dma_direct_max_mapping_size(struct device *dev) { /* If SWIOTLB is active, use its maximum mapping size */ - if (is_swiotlb_active() && + if (is_swiotlb_active(dev) && (dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE)) I wonder if it's worth trying to fold these other conditions into is_swiotlb_active() itself? I'm not entirely sure what matters for Xen, but for the other cases it seems like they probably only care about whether bouncing may occur for their particular device or not (possibly they want to be using dma_max_mapping_size() now anyway - TBH I'm struggling to make sense of what the swiotlb_max_segment business is supposed to mean). Otherwise, patch #9 will need to touch here as well to make sure that per-device forced bouncing is reflected correctly. Robin. return swiotlb_max_mapping_size(dev); return SIZE_MAX; diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index ffbb8724e06c..1d221343f1c8 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -659,9 +659,9 @@ size_t swiotlb_max_mapping_size(struct device *dev) return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE; } -bool is_swiotlb_active(void) +bool is_swiotlb_active(struct device *dev) { - return io_tlb_default_mem != NULL; + return get_io_tlb_mem(dev) != NULL; } EXPORT_SYMBOL_GPL(is_swiotlb_active);
Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-16 15:38, Christoph Hellwig wrote: [...] diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index f1e38526d5bd40..996dfdf9d375dd 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2017,7 +2017,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain, .iommu_dev = smmu->dev, }; - if (smmu_domain->non_strict) + if (!iommu_get_dma_strict()) As Will raised, this also needs to be checking "domain->type == IOMMU_DOMAIN_DMA" to maintain equivalent behaviour to the attribute code below. pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT; pgtbl_ops = alloc_io_pgtable_ops(fmt, _cfg, smmu_domain); @@ -2449,52 +2449,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev) return group; } -static int arm_smmu_domain_get_attr(struct iommu_domain *domain, - enum iommu_attr attr, void *data) -{ - struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - - switch (domain->type) { - case IOMMU_DOMAIN_DMA: - switch (attr) { - case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE: - *(int *)data = smmu_domain->non_strict; - return 0; - default: - return -ENODEV; - } - break; - default: - return -EINVAL; - } -} [...] diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index f985817c967a25..edb1de479dd1a7 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -668,7 +668,6 @@ struct arm_smmu_domain { struct mutexinit_mutex; /* Protects smmu pointer */ struct io_pgtable_ops *pgtbl_ops; - boolnon_strict; atomic_tnr_ats_masters; enum arm_smmu_domain_stage stage; diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index 0aa6d667274970..3dde22b1f8ffb0 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -761,6 +761,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain, .iommu_dev = smmu->dev, }; + if (!iommu_get_dma_strict()) Ditto here. Sorry for not spotting that sooner :( Robin. + pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT; + if (smmu->impl && smmu->impl->init_context) { ret = smmu->impl->init_context(smmu_domain, _cfg, dev); if (ret)
Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-31 16:32, Will Deacon wrote: On Wed, Mar 31, 2021 at 02:09:37PM +0100, Robin Murphy wrote: On 2021-03-31 12:49, Will Deacon wrote: On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote: On 2021-03-30 14:58, Will Deacon wrote: On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote: On 2021-03-30 14:11, Will Deacon wrote: On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote: From: Robin Murphy Instead make the global iommu_dma_strict paramete in iommu.c canonical by exporting helpers to get and set it and use those directly in the drivers. This make sure that the iommu.strict parameter also works for the AMD and Intel IOMMU drivers on x86. As those default to lazy flushing a new IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to represent the default if not overriden by an explicit parameter. Signed-off-by: Robin Murphy . [ported on top of the other iommu_attr changes and added a few small missing bits] Signed-off-by: Christoph Hellwig --- drivers/iommu/amd/iommu.c | 23 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 - drivers/iommu/arm/arm-smmu/arm-smmu.c | 27 + drivers/iommu/dma-iommu.c | 9 +-- drivers/iommu/intel/iommu.c | 64 - drivers/iommu/iommu.c | 27 ++--- include/linux/iommu.h | 4 +- 8 files changed, 40 insertions(+), 165 deletions(-) I really like this cleanup, but I can't help wonder if it's going in the wrong direction. With SoCs often having multiple IOMMU instances and a distinction between "trusted" and "untrusted" devices, then having the flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound unreasonable to me, but this change makes it a global property. The intent here was just to streamline the existing behaviour of stuffing a global property into a domain attribute then pulling it out again in the illusion that it was in any way per-domain. We're still checking dev_is_untrusted() before making an actual decision, and it's not like we can't add more factors at that point if we want to. Like I say, the cleanup is great. I'm just wondering whether there's a better way to express the complicated logic to decide whether or not to use the flush queue than what we end up with: if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) && domain->ops->flush_iotlb_all && !iommu_get_dma_strict()) which is mixing up globals, device properties and domain properties. The result is that the driver code ends up just using the global to determine whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code, which is a departure from the current way of doing things. But previously, SMMU only ever saw the global policy piped through the domain attribute by iommu_group_alloc_default_domain(), so there's no functional change there. For DMA domains sure, but I don't think that's the case for unmanaged domains such as those used by VFIO. Eh? This is only relevant to DMA domains anyway. Flush queues are part of the IOVA allocator that VFIO doesn't even use. It's always been the case that unmanaged domains only use strict invalidation. Maybe I'm going mad. With this patch, the SMMU driver unconditionally sets IO_PGTABLE_QUIRK_NON_STRICT for page-tables if iommu_get_dma_strict() is true, no? In which case, that will get set for page-tables corresponding to unmanaged domains as well as DMA domains when it is enabled. That didn't happen before because you couldn't set the attribute for unmanaged domains. What am I missing? Oh cock... sorry, all this time I've been saying what I *expect* it to do, while overlooking the fact that the IO_PGTABLE_QUIRK_NON_STRICT hunks were the bits I forgot to write and Christoph had to fix up. Indeed, those should be checking the domain type too to preserve the existing behaviour. Apologies for the confusion. Robin. Obviously some of the above checks could be factored out into some kind of iommu_use_flush_queue() helper that IOMMU drivers can also call if they need to keep in sync. Or maybe we just allow iommu-dma to set IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're treating that as a generic thing now. I think a helper that takes a domain would be a good starting point. You mean device, right? The one condition we currently have is at the device level, and there's really nothing inherent to the domain itself that matters (since the type is implicitly IOMMU_DOMAIN_DMA to even care about this). Device would probably work too; you'd pass the first device to attach to the domain when querying this from the SMMU driver, I suppose. Will
Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-31 12:49, Will Deacon wrote: On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote: On 2021-03-30 14:58, Will Deacon wrote: On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote: On 2021-03-30 14:11, Will Deacon wrote: On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote: From: Robin Murphy Instead make the global iommu_dma_strict paramete in iommu.c canonical by exporting helpers to get and set it and use those directly in the drivers. This make sure that the iommu.strict parameter also works for the AMD and Intel IOMMU drivers on x86. As those default to lazy flushing a new IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to represent the default if not overriden by an explicit parameter. Signed-off-by: Robin Murphy . [ported on top of the other iommu_attr changes and added a few small missing bits] Signed-off-by: Christoph Hellwig --- drivers/iommu/amd/iommu.c | 23 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 - drivers/iommu/arm/arm-smmu/arm-smmu.c | 27 + drivers/iommu/dma-iommu.c | 9 +-- drivers/iommu/intel/iommu.c | 64 - drivers/iommu/iommu.c | 27 ++--- include/linux/iommu.h | 4 +- 8 files changed, 40 insertions(+), 165 deletions(-) I really like this cleanup, but I can't help wonder if it's going in the wrong direction. With SoCs often having multiple IOMMU instances and a distinction between "trusted" and "untrusted" devices, then having the flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound unreasonable to me, but this change makes it a global property. The intent here was just to streamline the existing behaviour of stuffing a global property into a domain attribute then pulling it out again in the illusion that it was in any way per-domain. We're still checking dev_is_untrusted() before making an actual decision, and it's not like we can't add more factors at that point if we want to. Like I say, the cleanup is great. I'm just wondering whether there's a better way to express the complicated logic to decide whether or not to use the flush queue than what we end up with: if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) && domain->ops->flush_iotlb_all && !iommu_get_dma_strict()) which is mixing up globals, device properties and domain properties. The result is that the driver code ends up just using the global to determine whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code, which is a departure from the current way of doing things. But previously, SMMU only ever saw the global policy piped through the domain attribute by iommu_group_alloc_default_domain(), so there's no functional change there. For DMA domains sure, but I don't think that's the case for unmanaged domains such as those used by VFIO. Eh? This is only relevant to DMA domains anyway. Flush queues are part of the IOVA allocator that VFIO doesn't even use. It's always been the case that unmanaged domains only use strict invalidation. Obviously some of the above checks could be factored out into some kind of iommu_use_flush_queue() helper that IOMMU drivers can also call if they need to keep in sync. Or maybe we just allow iommu-dma to set IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're treating that as a generic thing now. I think a helper that takes a domain would be a good starting point. You mean device, right? The one condition we currently have is at the device level, and there's really nothing inherent to the domain itself that matters (since the type is implicitly IOMMU_DOMAIN_DMA to even care about this). Another idea that's just come to mind is now that IOMMU_DOMAIN_DMA has a standard meaning, maybe we could split out a separate IOMMU_DOMAIN_DMA_STRICT type such that it can all propagate from iommu_get_def_domain_type()? That feels like it might be quite promising, but I'd still do it as an improvement on top of this patch, since it's beyond just cleaning up the abuse of domain attributes to pass a command-line option around. Robin.
Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-30 14:58, Will Deacon wrote: On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote: On 2021-03-30 14:11, Will Deacon wrote: On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote: From: Robin Murphy Instead make the global iommu_dma_strict paramete in iommu.c canonical by exporting helpers to get and set it and use those directly in the drivers. This make sure that the iommu.strict parameter also works for the AMD and Intel IOMMU drivers on x86. As those default to lazy flushing a new IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to represent the default if not overriden by an explicit parameter. Signed-off-by: Robin Murphy . [ported on top of the other iommu_attr changes and added a few small missing bits] Signed-off-by: Christoph Hellwig --- drivers/iommu/amd/iommu.c | 23 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 - drivers/iommu/arm/arm-smmu/arm-smmu.c | 27 + drivers/iommu/dma-iommu.c | 9 +-- drivers/iommu/intel/iommu.c | 64 - drivers/iommu/iommu.c | 27 ++--- include/linux/iommu.h | 4 +- 8 files changed, 40 insertions(+), 165 deletions(-) I really like this cleanup, but I can't help wonder if it's going in the wrong direction. With SoCs often having multiple IOMMU instances and a distinction between "trusted" and "untrusted" devices, then having the flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound unreasonable to me, but this change makes it a global property. The intent here was just to streamline the existing behaviour of stuffing a global property into a domain attribute then pulling it out again in the illusion that it was in any way per-domain. We're still checking dev_is_untrusted() before making an actual decision, and it's not like we can't add more factors at that point if we want to. Like I say, the cleanup is great. I'm just wondering whether there's a better way to express the complicated logic to decide whether or not to use the flush queue than what we end up with: if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) && domain->ops->flush_iotlb_all && !iommu_get_dma_strict()) which is mixing up globals, device properties and domain properties. The result is that the driver code ends up just using the global to determine whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code, which is a departure from the current way of doing things. But previously, SMMU only ever saw the global policy piped through the domain attribute by iommu_group_alloc_default_domain(), so there's no functional change there. Obviously some of the above checks could be factored out into some kind of iommu_use_flush_queue() helper that IOMMU drivers can also call if they need to keep in sync. Or maybe we just allow iommu-dma to set IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're treating that as a generic thing now. For example, see the recent patch from Lu Baolu: https://lore.kernel.org/r/20210225061454.2864009-1-baolu...@linux.intel.com Erm, this patch is based on that one, it's right there in the context :/ Ah, sorry, I didn't spot that! I was just trying to illustrate that this is per-device. Sure, I understand - and I'm just trying to bang home that despite appearances it's never actually been treated as such for SMMU, so anything that's wrong after this change was already wrong before. Robin.
Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-30 14:11, Will Deacon wrote: On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote: From: Robin Murphy Instead make the global iommu_dma_strict paramete in iommu.c canonical by exporting helpers to get and set it and use those directly in the drivers. This make sure that the iommu.strict parameter also works for the AMD and Intel IOMMU drivers on x86. As those default to lazy flushing a new IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to represent the default if not overriden by an explicit parameter. Signed-off-by: Robin Murphy . [ported on top of the other iommu_attr changes and added a few small missing bits] Signed-off-by: Christoph Hellwig --- drivers/iommu/amd/iommu.c | 23 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 - drivers/iommu/arm/arm-smmu/arm-smmu.c | 27 + drivers/iommu/dma-iommu.c | 9 +-- drivers/iommu/intel/iommu.c | 64 - drivers/iommu/iommu.c | 27 ++--- include/linux/iommu.h | 4 +- 8 files changed, 40 insertions(+), 165 deletions(-) I really like this cleanup, but I can't help wonder if it's going in the wrong direction. With SoCs often having multiple IOMMU instances and a distinction between "trusted" and "untrusted" devices, then having the flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound unreasonable to me, but this change makes it a global property. The intent here was just to streamline the existing behaviour of stuffing a global property into a domain attribute then pulling it out again in the illusion that it was in any way per-domain. We're still checking dev_is_untrusted() before making an actual decision, and it's not like we can't add more factors at that point if we want to. For example, see the recent patch from Lu Baolu: https://lore.kernel.org/r/20210225061454.2864009-1-baolu...@linux.intel.com Erm, this patch is based on that one, it's right there in the context :/ Thanks, Robin.
Re: [PATCH 24/30] Kconfig: Change Synopsys to Synopsis
On 2021-03-29 00:53, Bhaskar Chowdhury wrote: s/Synopsys/Synopsis/ .two different places. Erm, that is definitely not a typo... :/ ..and for some unknown reason it introduce a empty line deleted and added back. Presumably your editor is configured to trim trailing whitespace on save. Furthermore, there are several instances in the other patches where your "corrections" are grammatically incorrect, I'm not sure what the deal is with patch #14, and you've also used the wrong subsystem name (it should be "dmaengine"). It's great to want to clean things up, but please pay a bit of care and attention to what you're actually doing. Robin. Signed-off-by: Bhaskar Chowdhury --- drivers/dma/Kconfig | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig index 0c2827fd8c19..30e8cc26f43b 100644 --- a/drivers/dma/Kconfig +++ b/drivers/dma/Kconfig @@ -170,15 +170,15 @@ config DMA_SUN6I Support for the DMA engine first found in Allwinner A31 SoCs. config DW_AXI_DMAC - tristate "Synopsys DesignWare AXI DMA support" + tristate "Synopsis DesignWare AXI DMA support" depends on OF || COMPILE_TEST depends on HAS_IOMEM select DMA_ENGINE select DMA_VIRTUAL_CHANNELS help - Enable support for Synopsys DesignWare AXI DMA controller. + Enable support for Synopsis DesignWare AXI DMA controller. NOTE: This driver wasn't tested on 64 bit platform because - of lack 64 bit platform with Synopsys DW AXI DMAC. + of lack 64 bit platform with Synopsis DW AXI DMAC. config EP93XX_DMA bool "Cirrus Logic EP93xx DMA support" @@ -394,7 +394,7 @@ config MOXART_DMA select DMA_VIRTUAL_CHANNELS help Enable support for the MOXA ART SoC DMA controller. - + Say Y here if you enabled MMP ADMA, otherwise say N. config MPC512X_DMA -- 2.26.3 ___ iommu mailing list io...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 14/17] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-15 08:33, Christoph Hellwig wrote: On Fri, Mar 12, 2021 at 04:18:24PM +, Robin Murphy wrote: Let me know what you think of the version here: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iommu-cleanup I'll happily switch the patch to you as the author if you're fine with that as well. I still have reservations about removing the attribute API entirely and pretending that io_pgtable_cfg is anything other than a SoC-specific private interface, I think a private inteface would make more sense. For now I've just condensed it down to a generic set of quirk bits and dropped the attrs structure, which seems like an ok middle ground for now. That being said I wonder why that quirk isn't simply set in the device tree? Because it's a software policy decision rather than any inherent property of the platform, and the DT certainly doesn't know *when* any particular device might prefer its IOMMU to use cacheable pagetables to minimise TLB miss latency vs. saving the cache capacity for larger data buffers. It really is most logical to decide this at the driver level. In truth the overall concept *is* relatively generic (a trend towards larger system caches and cleverer usage is about both raw performance and saving power on off-SoC DRAM traffic), it's just the particular implementation of using io-pgtable to set an outer-cacheable walk attribute in an SMMU TCR that's pretty much specific to Qualcomm SoCs. Hence why having a common abstraction at the iommu_domain level, but where the exact details are free to vary across different IOMMUs and their respective client drivers, is in many ways an ideal fit. but the reworked patch on its own looks reasonable to me, thanks! (I wasn't too convinced about the iommu_cmd_line wrappers either...) Just iommu_get_dma_strict() needs an export since the SMMU drivers can be modular - I consciously didn't add that myself since I was mistakenly thinking only iommu-dma would call it. Fixed. Can I get your signoff for the patch? Then I'll switch it to over to being attributed to you. Sure - I would have thought that the one I originally posted still stands, but for the avoidance of doubt, for the parts of commit 8b6d45c495bd in your tree that remain from what I wrote: Signed-off-by: Robin Murphy Cheers, Robin.
Re: [PATCH 14/17] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-11 08:26, Christoph Hellwig wrote: On Wed, Mar 10, 2021 at 06:39:57PM +, Robin Murphy wrote: Actually... Just mirroring the iommu_dma_strict value into struct iommu_domain should solve all of that with very little boilerplate code. Yes, my initial thought was to directly replace the attribute with a common flag at iommu_domain level, but since in all cases the behaviour is effectively global rather than actually per-domain, it seemed reasonable to take it a step further. This passes compile-testing for arm64 and x86, what do you think? It seems to miss a few bits, and also generally seems to be not actually apply to recent mainline or something like it due to different empty lines in a few places. Yeah, that was sketched out on top of some other development patches, and in being so focused on not breaking any of the x86 behaviours I did indeed overlook fully converting the SMMU drivers... oops! (my thought was to do the conversion for its own sake, then clean up the redundant attribute separately, but I guess it's fine either way) Let me know what you think of the version here: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/iommu-cleanup I'll happily switch the patch to you as the author if you're fine with that as well. I still have reservations about removing the attribute API entirely and pretending that io_pgtable_cfg is anything other than a SoC-specific private interface, but the reworked patch on its own looks reasonable to me, thanks! (I wasn't too convinced about the iommu_cmd_line wrappers either...) Just iommu_get_dma_strict() needs an export since the SMMU drivers can be modular - I consciously didn't add that myself since I was mistakenly thinking only iommu-dma would call it. Robin.
Re: [PATCH 14/17] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-10 09:25, Christoph Hellwig wrote: On Wed, Mar 10, 2021 at 10:15:01AM +0100, Christoph Hellwig wrote: On Thu, Mar 04, 2021 at 03:25:27PM +, Robin Murphy wrote: On 2021-03-01 08:42, Christoph Hellwig wrote: Use explicit methods for setting and querying the information instead. Now that everyone's using iommu-dma, is there any point in bouncing this through the drivers at all? Seems like it would make more sense for the x86 drivers to reflect their private options back to iommu_dma_strict (and allow Intel's caching mode to override it as well), then have iommu_dma_init_domain just test !iommu_dma_strict && domain->ops->flush_iotlb_all. Hmm. I looked at this, and kill off ->dma_enable_flush_queue for the ARM drivers and just looking at iommu_dma_strict seems like a very clear win. OTOH x86 is a little more complicated. AMD and intel defaul to lazy mode, so we'd have to change the global iommu_dma_strict if they are initialized. Also Intel has not only a "static" option to disable lazy mode, but also a "dynamic" one where it iterates structure. So I think on the get side we're stuck with the method, but it still simplifies the whole thing. Actually... Just mirroring the iommu_dma_strict value into struct iommu_domain should solve all of that with very little boilerplate code. Yes, my initial thought was to directly replace the attribute with a common flag at iommu_domain level, but since in all cases the behaviour is effectively global rather than actually per-domain, it seemed reasonable to take it a step further. This passes compile-testing for arm64 and x86, what do you think? Robin. ->8- Subject: [PATCH] iommu: Consolidate strict invalidation handling Now that everyone is using iommu-dma, the global invalidation policy really doesn't need to be woven through several parts of the core API and individual drivers, we can just look it up directly at the one point that we now make the flush queue decision. If the x86 drivers reflect their internal options and overrides back to iommu_dma_strict, that can become the canonical source. Signed-off-by: Robin Murphy --- drivers/iommu/amd/iommu.c | 2 ++ drivers/iommu/dma-iommu.c | 8 +--- drivers/iommu/intel/iommu.c | 12 drivers/iommu/iommu.c | 35 +++ include/linux/iommu.h | 2 ++ 5 files changed, 44 insertions(+), 15 deletions(-) diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index a69a8b573e40..1db29e59d468 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -1856,6 +1856,8 @@ int __init amd_iommu_init_dma_ops(void) else pr_info("Lazy IO/TLB flushing enabled\n"); + iommu_set_dma_strict(amd_iommu_unmap_flush); + return 0; } diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index af765c813cc8..789a950cc125 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -304,10 +304,6 @@ static void iommu_dma_flush_iotlb_all(struct iova_domain *iovad) cookie = container_of(iovad, struct iommu_dma_cookie, iovad); domain = cookie->fq_domain; - /* -* The IOMMU driver supporting DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE -* implies that ops->flush_iotlb_all must be non-NULL. -*/ domain->ops->flush_iotlb_all(domain); } @@ -334,7 +330,6 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, struct iommu_dma_cookie *cookie = domain->iova_cookie; unsigned long order, base_pfn; struct iova_domain *iovad; - int attr; if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE) return -EINVAL; @@ -371,8 +366,7 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, init_iova_domain(iovad, 1UL << order, base_pfn); if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) && - !iommu_domain_get_attr(domain, DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE, ) && - attr) { + domain->ops->flush_iotlb_all && !iommu_get_dma_strict()) { if (init_iova_flush_queue(iovad, iommu_dma_flush_iotlb_all, iommu_dma_entry_dtor)) pr_warn("iova flush queue initialization failed\n"); diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index b5c746f0f63b..f5b452cd1266 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4377,6 +4377,17 @@ int __init intel_iommu_init(void) down_read(_global_lock); for_each_active_iommu(iommu, drhd) { + if (!intel_iommu_strict && cap_caching_mode(iommu->cap)) { + /* +* The flush queue implementation does not perform page-selective +
Re: [PATCH 16/17] iommu: remove DOMAIN_ATTR_IO_PGTABLE_CFG
On 2021-03-01 08:42, Christoph Hellwig wrote: Signed-off-by: Christoph Hellwig Moreso than the previous patch, where the feature is at least relatively generic (note that there's a bunch of in-flight development around DOMAIN_ATTR_NESTING), I'm really not convinced that it's beneficial to bloat the generic iommu_ops structure with private driver-specific interfaces. The attribute interface is a great compromise for these kinds of things, and you can easily add type-checked wrappers around it for external callers (maybe even make the actual attributes internal between the IOMMU core and drivers) if that's your concern. Robin. --- drivers/gpu/drm/msm/adreno/adreno_gpu.c | 2 +- drivers/iommu/arm/arm-smmu/arm-smmu.c | 40 +++-- drivers/iommu/iommu.c | 9 ++ include/linux/iommu.h | 9 +- 4 files changed, 29 insertions(+), 31 deletions(-) diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c index 0f184c3dd9d9ec..78d98ab2ee3a68 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c @@ -191,7 +191,7 @@ void adreno_set_llc_attributes(struct iommu_domain *iommu) struct io_pgtable_domain_attr pgtbl_cfg; pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_ARM_OUTER_WBWA; - iommu_domain_set_attr(iommu, DOMAIN_ATTR_IO_PGTABLE_CFG, _cfg); + iommu_domain_set_pgtable_attr(iommu, _cfg); } struct msm_gem_address_space * diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index 2e17d990d04481..2858999c86dfd1 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -1515,40 +1515,22 @@ static int arm_smmu_domain_enable_nesting(struct iommu_domain *domain) return ret; } -static int arm_smmu_domain_set_attr(struct iommu_domain *domain, - enum iommu_attr attr, void *data) +static int arm_smmu_domain_set_pgtable_attr(struct iommu_domain *domain, + struct io_pgtable_domain_attr *pgtbl_cfg) { - int ret = 0; struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); + int ret = -EPERM; - mutex_lock(_domain->init_mutex); - - switch(domain->type) { - case IOMMU_DOMAIN_UNMANAGED: - switch (attr) { - case DOMAIN_ATTR_IO_PGTABLE_CFG: { - struct io_pgtable_domain_attr *pgtbl_cfg = data; - - if (smmu_domain->smmu) { - ret = -EPERM; - goto out_unlock; - } + if (domain->type != IOMMU_DOMAIN_UNMANAGED) + return -EINVAL; - smmu_domain->pgtbl_cfg = *pgtbl_cfg; - break; - } - default: - ret = -ENODEV; - } - break; - case IOMMU_DOMAIN_DMA: - ret = -ENODEV; - break; - default: - ret = -EINVAL; + mutex_lock(_domain->init_mutex); + if (!smmu_domain->smmu) { + smmu_domain->pgtbl_cfg = *pgtbl_cfg; + ret = 0; } -out_unlock: mutex_unlock(_domain->init_mutex); + return ret; } @@ -1609,7 +1591,7 @@ static struct iommu_ops arm_smmu_ops = { .device_group = arm_smmu_device_group, .dma_use_flush_queue= arm_smmu_dma_use_flush_queue, .dma_enable_flush_queue = arm_smmu_dma_enable_flush_queue, - .domain_set_attr= arm_smmu_domain_set_attr, + .domain_set_pgtable_attr = arm_smmu_domain_set_pgtable_attr, .domain_enable_nesting = arm_smmu_domain_enable_nesting, .of_xlate = arm_smmu_of_xlate, .get_resv_regions = arm_smmu_get_resv_regions, diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 2e9e058501a953..8490aefd4b41f8 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2693,6 +2693,15 @@ int iommu_domain_enable_nesting(struct iommu_domain *domain) } EXPORT_SYMBOL_GPL(iommu_domain_enable_nesting); +int iommu_domain_set_pgtable_attr(struct iommu_domain *domain, + struct io_pgtable_domain_attr *pgtbl_cfg) +{ + if (!domain->ops->domain_set_pgtable_attr) + return -EINVAL; + return domain->ops->domain_set_pgtable_attr(domain, pgtbl_cfg); +} +EXPORT_SYMBOL_GPL(iommu_domain_set_pgtable_attr); + void iommu_get_resv_regions(struct device *dev, struct list_head *list) { const struct iommu_ops *ops = dev->bus->iommu_ops; diff --git a/include/linux/iommu.h b/include/linux/iommu.h index aed88aa3bd3edf..39d3ed4d2700ac 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -40,6 +40,7 @@ struct iommu_domain; struct notifier_block; struct iommu_sva; struct iommu_fault_event; +struct io_pgtable_domain_attr;
Re: [PATCH 14/17] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE
On 2021-03-01 08:42, Christoph Hellwig wrote: Use explicit methods for setting and querying the information instead. Now that everyone's using iommu-dma, is there any point in bouncing this through the drivers at all? Seems like it would make more sense for the x86 drivers to reflect their private options back to iommu_dma_strict (and allow Intel's caching mode to override it as well), then have iommu_dma_init_domain just test !iommu_dma_strict && domain->ops->flush_iotlb_all. Robin. Also remove the now unused iommu_domain_get_attr functionality. Signed-off-by: Christoph Hellwig --- drivers/iommu/amd/iommu.c | 23 ++--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 47 ++--- drivers/iommu/arm/arm-smmu/arm-smmu.c | 56 + drivers/iommu/dma-iommu.c | 8 ++- drivers/iommu/intel/iommu.c | 27 ++ drivers/iommu/iommu.c | 19 +++ include/linux/iommu.h | 17 ++- 7 files changed, 51 insertions(+), 146 deletions(-) diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index a69a8b573e40d0..37a8e51db17656 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -1771,24 +1771,11 @@ static struct iommu_group *amd_iommu_device_group(struct device *dev) return acpihid_device_group(dev); } -static int amd_iommu_domain_get_attr(struct iommu_domain *domain, - enum iommu_attr attr, void *data) +static bool amd_iommu_dma_use_flush_queue(struct iommu_domain *domain) { - switch (domain->type) { - case IOMMU_DOMAIN_UNMANAGED: - return -ENODEV; - case IOMMU_DOMAIN_DMA: - switch (attr) { - case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE: - *(int *)data = !amd_iommu_unmap_flush; - return 0; - default: - return -ENODEV; - } - break; - default: - return -EINVAL; - } + if (domain->type != IOMMU_DOMAIN_DMA) + return false; + return !amd_iommu_unmap_flush; } /* @@ -2257,7 +2244,7 @@ const struct iommu_ops amd_iommu_ops = { .release_device = amd_iommu_release_device, .probe_finalize = amd_iommu_probe_finalize, .device_group = amd_iommu_device_group, - .domain_get_attr = amd_iommu_domain_get_attr, + .dma_use_flush_queue = amd_iommu_dma_use_flush_queue, .get_resv_regions = amd_iommu_get_resv_regions, .put_resv_regions = generic_iommu_put_resv_regions, .is_attach_deferred = amd_iommu_is_attach_deferred, diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 8594b4a8304375..bf96172e8c1f71 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2449,33 +2449,21 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev) return group; } -static int arm_smmu_domain_get_attr(struct iommu_domain *domain, - enum iommu_attr attr, void *data) +static bool arm_smmu_dma_use_flush_queue(struct iommu_domain *domain) { struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - switch (domain->type) { - case IOMMU_DOMAIN_UNMANAGED: - switch (attr) { - case DOMAIN_ATTR_NESTING: - *(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED); - return 0; - default: - return -ENODEV; - } - break; - case IOMMU_DOMAIN_DMA: - switch (attr) { - case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE: - *(int *)data = smmu_domain->non_strict; - return 0; - default: - return -ENODEV; - } - break; - default: - return -EINVAL; - } + if (domain->type != IOMMU_DOMAIN_DMA) + return false; + return smmu_domain->non_strict; +} + + +static void arm_smmu_dma_enable_flush_queue(struct iommu_domain *domain) +{ + if (domain->type != IOMMU_DOMAIN_DMA) + return; + to_smmu_domain(domain)->non_strict = true; } static int arm_smmu_domain_set_attr(struct iommu_domain *domain, @@ -2505,13 +2493,7 @@ static int arm_smmu_domain_set_attr(struct iommu_domain *domain, } break; case IOMMU_DOMAIN_DMA: - switch(attr) { - case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE: - smmu_domain->non_strict = *(int *)data; - break; - default: - ret =
Re: [PATCH RFC v1 2/6] swiotlb: convert variables to arrays
On 2021-02-04 07:29, Christoph Hellwig wrote: On Wed, Feb 03, 2021 at 03:37:05PM -0800, Dongli Zhang wrote: This patch converts several swiotlb related variables to arrays, in order to maintain stat/status for different swiotlb buffers. Here are variables involved: - io_tlb_start and io_tlb_end - io_tlb_nslabs and io_tlb_used - io_tlb_list - io_tlb_index - max_segment - io_tlb_orig_addr - no_iotlb_memory There is no functional change and this is to prepare to enable 64-bit swiotlb. Claire Chang (on Cc) already posted a patch like this a month ago, which looks much better because it actually uses a struct instead of all the random variables. Indeed, I skimmed the cover letter and immediately thought that this whole thing is just the restricted DMA pool concept[1] again, only from a slightly different angle. Robin. [1] https://lore.kernel.org/linux-iommu/20210106034124.30560-1-tien...@chromium.org/
Re: [RFC PATCH v3 5/6] dt-bindings: of: Add restricted DMA pool
On 2021-01-21 15:48, Rob Herring wrote: On Wed, Jan 20, 2021 at 7:10 PM Robin Murphy wrote: On 2021-01-20 21:31, Rob Herring wrote: On Wed, Jan 20, 2021 at 11:30 AM Robin Murphy wrote: On 2021-01-20 16:53, Rob Herring wrote: On Wed, Jan 06, 2021 at 11:41:23AM +0800, Claire Chang wrote: Introduce the new compatible string, restricted-dma-pool, for restricted DMA. One can specify the address and length of the restricted DMA memory region by restricted-dma-pool in the device tree. If this goes into DT, I think we should be able to use dma-ranges for this purpose instead. Normally, 'dma-ranges' is for physical bus restrictions, but there's no reason it can't be used for policy or to express restrictions the firmware has enabled. There would still need to be some way to tell SWIOTLB to pick up the corresponding chunk of memory and to prevent the kernel from using it for anything else, though. Don't we already have that problem if dma-ranges had a very small range? We just get lucky because the restriction is generally much more RAM than needed. Not really - if a device has a naturally tiny addressing capability that doesn't even cover ZONE_DMA32 where the regular SWIOTLB buffer will be allocated then it's unlikely to work well, but that's just crap system design. Yes, memory pressure in ZONE_DMA{32} is particularly problematic for such limited devices, but it's irrelevant to the issue at hand here. Yesterday's crap system design is today's security feature. Couldn't this feature make crap system design work better? Indeed! Say you bring out your shiny new "Strawberry Flan 4" machine with all the latest connectivity, but tragically its PCIe can only address 25% of the RAM. So you decide to support deploying it in two configurations: one where it runs normally for best performance, and another "secure" one where it dedicates that quarter of RAM as a restricted DMA pool for any PCIe devices - that way, even if that hotel projector you plug in turns out to be a rogue Thunderbolt endpoint, it can never snarf your private keys off your eMMC out of the page cache. (Yes, is is the thinnest of strawmen, but it sets the scene for the point you raised...) ...which is that in both cases the dma-ranges will still be identical. So how is the kernel going to know whether to steal that whole area from memblock before anything else can allocate from it, or not? I don't disagree that even in Claire's original intended case it would be semantically correct to describe the hardware-firewalled region with dma-ranges. It just turns out not to be necessary, and you're already arguing for not adding anything in DT that doesn't need to be. What we have here is a device that's not allowed to see *kernel* memory at all. It's been artificially constrained to a particular region by a TZASC or similar, and the only data which should ever be placed in that May have been constrained, but that's entirely optional. In the optional case where the setup is entirely up to the OS, I don't think this belongs in the DT at all. Perhaps that should be solved first. Yes! Let's definitely consider that case! Say you don't have any security or physical limitations but want to use a bounce pool for some device anyway because reasons (perhaps copying streaming DMA data to a better guaranteed alignment gives an overall performance win). Now the *only* relevant thing to communicate to the kernel is to, ahem, reserve a large chunk of memory, and use it for this special purpose. Isn't that literally what reserved-memory bindings are for? region is data intended for that device to see. That way if it tries to go rogue it physically can't start slurping data intended for other devices or not mapped for DMA at all. The bouncing is an important part of this - I forget the title off-hand but there was an interesting paper a few years ago which demonstrated that even with an IOMMU, streaming DMA of in-place buffers could reveal enough adjacent data from the same page to mount an attack on the system. Memory pressure should be immaterial since the size of each bounce pool carveout will presumably be tuned for the needs of the given device. In any case, wouldn't finding all the dma-ranges do this? We're already walking the tree to find the max DMA address now. If all you can see are two "dma-ranges" properties, how do you propose to tell that one means "this is the extent of what I can address, please set my masks and dma-range-map accordingly and try to allocate things where I can reach them" while the other means "take this output range away from the page allocator and hook it up as my dedicated bounce pool, because it is Serious Security Time"? Especially since getting that choice wrong either way would be a Bad Thing. Either we have some heuristic based on the size or we add some hint. The point is let's build on what we already have for defining DMA
Re: [RFC PATCH v3 5/6] dt-bindings: of: Add restricted DMA pool
On 2021-01-20 21:31, Rob Herring wrote: On Wed, Jan 20, 2021 at 11:30 AM Robin Murphy wrote: On 2021-01-20 16:53, Rob Herring wrote: On Wed, Jan 06, 2021 at 11:41:23AM +0800, Claire Chang wrote: Introduce the new compatible string, restricted-dma-pool, for restricted DMA. One can specify the address and length of the restricted DMA memory region by restricted-dma-pool in the device tree. If this goes into DT, I think we should be able to use dma-ranges for this purpose instead. Normally, 'dma-ranges' is for physical bus restrictions, but there's no reason it can't be used for policy or to express restrictions the firmware has enabled. There would still need to be some way to tell SWIOTLB to pick up the corresponding chunk of memory and to prevent the kernel from using it for anything else, though. Don't we already have that problem if dma-ranges had a very small range? We just get lucky because the restriction is generally much more RAM than needed. Not really - if a device has a naturally tiny addressing capability that doesn't even cover ZONE_DMA32 where the regular SWIOTLB buffer will be allocated then it's unlikely to work well, but that's just crap system design. Yes, memory pressure in ZONE_DMA{32} is particularly problematic for such limited devices, but it's irrelevant to the issue at hand here. What we have here is a device that's not allowed to see *kernel* memory at all. It's been artificially constrained to a particular region by a TZASC or similar, and the only data which should ever be placed in that region is data intended for that device to see. That way if it tries to go rogue it physically can't start slurping data intended for other devices or not mapped for DMA at all. The bouncing is an important part of this - I forget the title off-hand but there was an interesting paper a few years ago which demonstrated that even with an IOMMU, streaming DMA of in-place buffers could reveal enough adjacent data from the same page to mount an attack on the system. Memory pressure should be immaterial since the size of each bounce pool carveout will presumably be tuned for the needs of the given device. In any case, wouldn't finding all the dma-ranges do this? We're already walking the tree to find the max DMA address now. If all you can see are two "dma-ranges" properties, how do you propose to tell that one means "this is the extent of what I can address, please set my masks and dma-range-map accordingly and try to allocate things where I can reach them" while the other means "take this output range away from the page allocator and hook it up as my dedicated bounce pool, because it is Serious Security Time"? Especially since getting that choice wrong either way would be a Bad Thing. Robin. Signed-off-by: Claire Chang --- .../reserved-memory/reserved-memory.txt | 24 +++ 1 file changed, 24 insertions(+) diff --git a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt index e8d3096d922c..44975e2a1fd2 100644 --- a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt +++ b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt @@ -51,6 +51,20 @@ compatible (optional) - standard definition used as a shared pool of DMA buffers for a set of devices. It can be used by an operating system to instantiate the necessary pool management subsystem if necessary. +- restricted-dma-pool: This indicates a region of memory meant to be + used as a pool of restricted DMA buffers for a set of devices. The + memory region would be the only region accessible to those devices. + When using this, the no-map and reusable properties must not be set, + so the operating system can create a virtual mapping that will be used + for synchronization. The main purpose for restricted DMA is to + mitigate the lack of DMA access control on systems without an IOMMU, + which could result in the DMA accessing the system memory at + unexpected times and/or unexpected addresses, possibly leading to data + leakage or corruption. The feature on its own provides a basic level + of protection against the DMA overwriting buffer contents at + unexpected times. However, to protect against general data leakage and + system memory corruption, the system needs to provide way to restrict + the DMA to a predefined memory region. - vendor specific string in the form ,[-] no-map (optional) - empty property - Indicates the operating system must not create a virtual mapping @@ -120,6 +134,11 @@ one for multimedia processing (named multimedia-memory@7700, 64MiB). compatible = "acme,multimedia-memory";
Re: [RFC PATCH v3 5/6] dt-bindings: of: Add restricted DMA pool
On 2021-01-20 16:53, Rob Herring wrote: On Wed, Jan 06, 2021 at 11:41:23AM +0800, Claire Chang wrote: Introduce the new compatible string, restricted-dma-pool, for restricted DMA. One can specify the address and length of the restricted DMA memory region by restricted-dma-pool in the device tree. If this goes into DT, I think we should be able to use dma-ranges for this purpose instead. Normally, 'dma-ranges' is for physical bus restrictions, but there's no reason it can't be used for policy or to express restrictions the firmware has enabled. There would still need to be some way to tell SWIOTLB to pick up the corresponding chunk of memory and to prevent the kernel from using it for anything else, though. Signed-off-by: Claire Chang --- .../reserved-memory/reserved-memory.txt | 24 +++ 1 file changed, 24 insertions(+) diff --git a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt index e8d3096d922c..44975e2a1fd2 100644 --- a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt +++ b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt @@ -51,6 +51,20 @@ compatible (optional) - standard definition used as a shared pool of DMA buffers for a set of devices. It can be used by an operating system to instantiate the necessary pool management subsystem if necessary. +- restricted-dma-pool: This indicates a region of memory meant to be + used as a pool of restricted DMA buffers for a set of devices. The + memory region would be the only region accessible to those devices. + When using this, the no-map and reusable properties must not be set, + so the operating system can create a virtual mapping that will be used + for synchronization. The main purpose for restricted DMA is to + mitigate the lack of DMA access control on systems without an IOMMU, + which could result in the DMA accessing the system memory at + unexpected times and/or unexpected addresses, possibly leading to data + leakage or corruption. The feature on its own provides a basic level + of protection against the DMA overwriting buffer contents at + unexpected times. However, to protect against general data leakage and + system memory corruption, the system needs to provide way to restrict + the DMA to a predefined memory region. - vendor specific string in the form ,[-] no-map (optional) - empty property - Indicates the operating system must not create a virtual mapping @@ -120,6 +134,11 @@ one for multimedia processing (named multimedia-memory@7700, 64MiB). compatible = "acme,multimedia-memory"; reg = <0x7700 0x400>; }; + + restricted_dma_mem_reserved: restricted_dma_mem_reserved { + compatible = "restricted-dma-pool"; + reg = <0x5000 0x40>; + }; }; /* ... */ @@ -138,4 +157,9 @@ one for multimedia processing (named multimedia-memory@7700, 64MiB). memory-region = <_reserved>; /* ... */ }; + + pcie_device: pcie_device@0,0 { + memory-region = <_dma_mem_reserved>; PCI hosts often have inbound window configurations that limit the address range and translate PCI to bus addresses. Those windows happen to be configured by dma-ranges. In any case, wouldn't you want to put the configuration in the PCI host node? Is there a usecase of restricting one PCIe device and not another? The general design seems to accommodate devices having their own pools such that they can't even snoop on each others' transient DMA data. If the interconnect had a way of wiring up, say, PCI RIDs to AMBA NSAIDs, then in principle you could certainly apply that to PCI endpoints too (presumably you'd also disallow them from peer-to-peer transactions at the PCI level too). Robin.
Re: [RFC PATCH v3 4/6] swiotlb: Add restricted DMA alloc/free support.
On 2021-01-13 12:48, Christoph Hellwig wrote: +#ifdef CONFIG_SWIOTLB + if (unlikely(dev->dma_io_tlb_mem)) + return swiotlb_alloc(dev, size, dma_handle, attrs); +#endif Another place where the dma_io_tlb_mem is useful to avoid the ifdef. -phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr, - size_t mapping_size, size_t alloc_size, - enum dma_data_direction dir, unsigned long attrs) +static int swiotlb_tbl_find_free_region(struct device *hwdev, + dma_addr_t tbl_dma_addr, + size_t alloc_size, + unsigned long attrs) +static void swiotlb_tbl_release_region(struct device *hwdev, int index, + size_t size) This refactoring should be another prep patch. +void *swiotlb_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, + unsigned long attrs) I'd rather have the names convey there are for the per-device bounce buffer in some form. + struct io_tlb_mem *mem = dev->dma_io_tlb_mem; While we're at it I wonder if the io_tlb is something we could change while we're at it. Maybe replace io_tlb_mem with struct swiotlb and rename the field in struct device to dev_swiotlb? + int index; + void *vaddr; + phys_addr_t tlb_addr; + + size = PAGE_ALIGN(size); + index = swiotlb_tbl_find_free_region(dev, mem->start, size, attrs); + if (index < 0) + return NULL; + + tlb_addr = mem->start + (index << IO_TLB_SHIFT); + *dma_handle = phys_to_dma_unencrypted(dev, tlb_addr); + + if (!dev_is_dma_coherent(dev)) { + unsigned long pfn = PFN_DOWN(tlb_addr); + + /* remove any dirty cache lines on the kernel alias */ + arch_dma_prep_coherent(pfn_to_page(pfn), size); Can we hook in somewhat lower level in the dma-direct code so that all the remapping in dma-direct can be reused instead of duplicated? That also becomes important if we want to use non-remapping uncached support, e.g. on mips or x86, or the direct changing of the attributes that Will planned to look into for arm64. Indeed, AFAICS this ought to boil down to a direct equivalent of __dma_direct_alloc_pages() - other than the address there should be no conceptual difference between pages from the restricted pool and those from the regular page allocator, so this probably deserves to be plumbed in as an alternative to that. Robin.
Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool
On 2021-01-13 17:43, Florian Fainelli wrote: On 1/13/21 7:27 AM, Robin Murphy wrote: On 2021-01-13 13:59, Nicolas Saenz Julienne wrote: Hi All, On Tue, 2021-01-12 at 16:03 -0800, Florian Fainelli wrote: On 1/5/21 7:41 PM, Claire Chang wrote: Add the initialization function to create restricted DMA pools from matching reserved-memory nodes in the device tree. Signed-off-by: Claire Chang --- include/linux/device.h | 4 ++ include/linux/swiotlb.h | 7 +- kernel/dma/Kconfig | 1 + kernel/dma/swiotlb.c | 144 ++-- 4 files changed, 131 insertions(+), 25 deletions(-) diff --git a/include/linux/device.h b/include/linux/device.h index 89bb8b84173e..ca6f71ec8871 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -413,6 +413,7 @@ struct dev_links_info { * @dma_pools: Dma pools (if dma'ble device). * @dma_mem: Internal for coherent mem override. * @cma_area: Contiguous memory area for dma allocations + * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override. * @archdata: For arch-specific additions. * @of_node: Associated device tree node. * @fwnode: Associated device node supplied by platform firmware. @@ -515,6 +516,9 @@ struct device { #ifdef CONFIG_DMA_CMA struct cma *cma_area; /* contiguous memory area for dma allocations */ +#endif +#ifdef CONFIG_SWIOTLB + struct io_tlb_mem *dma_io_tlb_mem; #endif /* arch specific additions */ struct dev_archdata archdata; diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index dd8eb57cbb8f..a1bbd775 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -76,12 +76,13 @@ extern enum swiotlb_force swiotlb_force; * * @start: The start address of the swiotlb memory pool. Used to do a quick * range check to see if the memory was in fact allocated by this - * API. + * API. For restricted DMA pool, this is device tree adjustable. Maybe write it as this is "firmware adjustable" such that when/if ACPI needs something like this, the description does not need updating. TBH I really don't think this needs calling out at all. Even in the regular case, the details of exactly how and where the pool is allocated are beyond the scope of this code - architectures already have several ways to control that and make their own decisions. [snip] +static int rmem_swiotlb_device_init(struct reserved_mem *rmem, + struct device *dev) +{ + struct io_tlb_mem *mem = rmem->priv; + int ret; + + if (dev->dma_io_tlb_mem) + return -EBUSY; + + if (!mem) { + mem = kzalloc(sizeof(*mem), GFP_KERNEL); + if (!mem) + return -ENOMEM; + + if (!memremap(rmem->base, rmem->size, MEMREMAP_WB)) { MEMREMAP_WB sounds appropriate as a default. As per the binding 'no-map' has to be disabled here. So AFAIU, this memory will be part of the linear mapping. Is this really needed then? More than that, I'd assume that we *have* to use the linear/direct map address rather than anything that has any possibility of being a vmalloc remap, otherwise we can no longer safely rely on phys_to_dma/dma_to_phys, no? I believe you are right, which means that if we want to make use of the restricted DMA pool on a 32-bit architecture (and we do, at least, I do) we should probably add some error checking/warning to ensure the restricted DMA pool falls within the linear map. Oh, good point - I'm so used to 64-bit that I instinctively just blanked out the !PageHighMem() condition in try_ram_remap(). So maybe the original intent here *was* to effectively just implement that check, but if so it could still do with being a lot more explicit. Cheers, Robin.
Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool
On 2021-01-13 13:59, Nicolas Saenz Julienne wrote: Hi All, On Tue, 2021-01-12 at 16:03 -0800, Florian Fainelli wrote: On 1/5/21 7:41 PM, Claire Chang wrote: Add the initialization function to create restricted DMA pools from matching reserved-memory nodes in the device tree. Signed-off-by: Claire Chang --- include/linux/device.h | 4 ++ include/linux/swiotlb.h | 7 +- kernel/dma/Kconfig | 1 + kernel/dma/swiotlb.c| 144 ++-- 4 files changed, 131 insertions(+), 25 deletions(-) diff --git a/include/linux/device.h b/include/linux/device.h index 89bb8b84173e..ca6f71ec8871 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -413,6 +413,7 @@ struct dev_links_info { * @dma_pools:Dma pools (if dma'ble device). * @dma_mem: Internal for coherent mem override. * @cma_area: Contiguous memory area for dma allocations + * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override. * @archdata: For arch-specific additions. * @of_node: Associated device tree node. * @fwnode: Associated device node supplied by platform firmware. @@ -515,6 +516,9 @@ struct device { #ifdef CONFIG_DMA_CMA struct cma *cma_area; /* contiguous memory area for dma allocations */ +#endif +#ifdef CONFIG_SWIOTLB + struct io_tlb_mem *dma_io_tlb_mem; #endif /* arch specific additions */ struct dev_archdata archdata; diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index dd8eb57cbb8f..a1bbd775 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -76,12 +76,13 @@ extern enum swiotlb_force swiotlb_force; * * @start:The start address of the swiotlb memory pool. Used to do a quick *range check to see if the memory was in fact allocated by this - * API. + * API. For restricted DMA pool, this is device tree adjustable. Maybe write it as this is "firmware adjustable" such that when/if ACPI needs something like this, the description does not need updating. TBH I really don't think this needs calling out at all. Even in the regular case, the details of exactly how and where the pool is allocated are beyond the scope of this code - architectures already have several ways to control that and make their own decisions. [snip] +static int rmem_swiotlb_device_init(struct reserved_mem *rmem, + struct device *dev) +{ + struct io_tlb_mem *mem = rmem->priv; + int ret; + + if (dev->dma_io_tlb_mem) + return -EBUSY; + + if (!mem) { + mem = kzalloc(sizeof(*mem), GFP_KERNEL); + if (!mem) + return -ENOMEM; + + if (!memremap(rmem->base, rmem->size, MEMREMAP_WB)) { MEMREMAP_WB sounds appropriate as a default. As per the binding 'no-map' has to be disabled here. So AFAIU, this memory will be part of the linear mapping. Is this really needed then? More than that, I'd assume that we *have* to use the linear/direct map address rather than anything that has any possibility of being a vmalloc remap, otherwise we can no longer safely rely on phys_to_dma/dma_to_phys, no? That said, given that we're not actually using the returned address, I'm not entirely sure what the point of this call is anyway. If we can assume it's always going to return the linear map address via try_ram_remap() then we can equally just go ahead and use the linear map address straight away. I don't really see how we could ever hit the "is_ram == REGION_MIXED" case in memremap() that would return NULL, if we passed the memblock check earlier in __reserved_mem_alloc_size() such that this rmem node ever got to be initialised at all. Robin. Documentation/devicetree/bindings/reserved-memory/ramoops.txt does define an "unbuffered" property which in premise could be applied to the generic reserved memory binding as well and that we may have to be honoring here, if we were to make it more generic. Oh well, this does not need to be addressed right now I guess.
Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool
On 2021-01-07 17:57, Konrad Rzeszutek Wilk wrote: On Fri, Jan 08, 2021 at 01:39:18AM +0800, Claire Chang wrote: Hi Greg and Konrad, This change is intended to be non-arch specific. Any arch that lacks DMA access control and has devices not behind an IOMMU can make use of it. Could you share why you think this should be arch specific? The idea behind non-arch specific code is it to be generic. The devicetree is specific to PowerPC, Sparc, and ARM, and not to x86 - hence it should be in arch specific code. Sorry, but that's an absurd argument. By the same token you'd equally have to claim that bits of, say, the Broadcom WiFi driver (not to mention dozens of others) should be split out into arch code, since not all platforms use the devicetree parts, nor the ACPI parts, nor the PCI parts... There is nothing architecture-specific about using devicetree as a system description - AFAIK there *are* a handful of x86 platforms that use it, besides even more architectures than you've listed above. It has long been the policy that devicetree-related code for a particular subsystem should just live within that subsystem. Sometimes if there's enough of it it gets collected together into its own file - e.g. drivers/pci/of.c - otherwise it tends to just get #ifdef'ed - e.g. of_spi_parse_dt(), or the other DMA reserved-memory consumers that already exist as Florian points out. Besides, there are far more platforms that enable CONFIG_OF than enable CONFIG_SWIOTLB, so by that metric the whole of the SWIOTLB code itself is even less "generic" than any DT parsing :P Robin.
Re: [trivial PATCH] treewide: Convert switch/case fallthrough; to break;
On 2020-09-09 21:06, Joe Perches wrote: fallthrough to a separate case/default label break; isn't very readable. Convert pseudo-keyword fallthrough; statements to a simple break; when the next label is case or default and the only statement in the next label block is break; Found using: $ grep-2.5.4 -rP --include=*.[ch] -n "fallthrough;(\s*(case\s+\w+|default)\s*:\s*){1,7}break;" * Miscellanea: o Move or coalesce a couple label blocks above a default: block. Signed-off-by: Joe Perches --- Compiled allyesconfig x86-64 only. A few files for other arches were not compiled. [...] diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index c192544e874b..743db1abec40 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -3777,7 +3777,7 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu) switch (FIELD_GET(IDR0_TTF, reg)) { case IDR0_TTF_AARCH32_64: smmu->ias = 40; - fallthrough; + break; case IDR0_TTF_AARCH64: break; default: I have to say I don't really agree with the readability argument for this one - a fallthrough is semantically correct here, since the first case is a superset of the second. It just happens that anything we would do for the common subset is implicitly assumed (there are other potential cases we simply haven't added support for at the moment), thus the second case is currently empty. This change actively obfuscates that distinction. Robin.
Re: [PATCH 1/2] dma-mapping: add a dma_ops_bypass flag to struct device
On 2020-03-20 2:16 pm, Christoph Hellwig wrote: Several IOMMU drivers have a bypass mode where they can use a direct mapping if the devices DMA mask is large enough. Add generic support to the core dma-mapping code to do that to switch those drivers to a common solution. Hmm, this is _almost_, but not quite the same as the case where drivers are managing their own IOMMU mappings, but still need to use streaming DMA for cache maintenance on the underlying pages. For that we need the ops bypass to be a "true" bypass and also avoid SWIOTLB regardless of the device's DMA mask. That behaviour should in fact be fine for the opportunistic bypass case here as well, since the mask being "big enough" implies by definition that this should never need to bounce either. For the (hopefully less common) third case where, due to groups or user overrides, we end up giving an identity DMA domain to a device with limited DMA masks which _does_ need SWIOTLB, I'd like to think we can solve that by not giving the device IOMMU DMA ops in the first place, such that it never needs to engage the bypass mechanism at all. Thoughts? Robin. Signed-off-by: Christoph Hellwig --- include/linux/device.h | 6 ++ include/linux/dma-mapping.h | 30 ++ kernel/dma/mapping.c| 36 +++- 3 files changed, 51 insertions(+), 21 deletions(-) diff --git a/include/linux/device.h b/include/linux/device.h index 0cd7c647c16c..09be8bb2c4a6 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -525,6 +525,11 @@ struct dev_links_info { * sync_state() callback. * @dma_coherent: this particular device is dma coherent, even if the *architecture supports non-coherent devices. + * @dma_ops_bypass: If set to %true then the dma_ops are bypassed for the + * streaming DMA operations (->map_* / ->unmap_* / ->sync_*), + * and optionall (if the coherent mask is large enough) also + * for dma allocations. This flag is managed by the dma ops + * instance from ->dma_supported. * * At the lowest level, every device in a Linux system is represented by an * instance of struct device. The device structure contains the information @@ -625,6 +630,7 @@ struct device { defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) booldma_coherent:1; #endif + booldma_ops_bypass : 1; }; static inline struct device *kobj_to_dev(struct kobject *kobj) diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 330ad58fbf4d..c3af0cf5e435 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -188,9 +188,15 @@ static inline int dma_mmap_from_global_coherent(struct vm_area_struct *vma, } #endif /* CONFIG_DMA_DECLARE_COHERENT */ -static inline bool dma_is_direct(const struct dma_map_ops *ops) +/* + * Check if the devices uses a direct mapping for streaming DMA operations. + * This allows IOMMU drivers to set a bypass mode if the DMA mask is large + * enough. + */ +static inline bool dma_map_direct(struct device *dev, + const struct dma_map_ops *ops) { - return likely(!ops); + return likely(!ops) || dev->dma_ops_bypass; } /* @@ -279,7 +285,7 @@ static inline dma_addr_t dma_map_page_attrs(struct device *dev, dma_addr_t addr; BUG_ON(!valid_dma_direction(dir)); - if (dma_is_direct(ops)) + if (dma_map_direct(dev, ops)) addr = dma_direct_map_page(dev, page, offset, size, dir, attrs); else addr = ops->map_page(dev, page, offset, size, dir, attrs); @@ -294,7 +300,7 @@ static inline void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, const struct dma_map_ops *ops = get_dma_ops(dev); BUG_ON(!valid_dma_direction(dir)); - if (dma_is_direct(ops)) + if (dma_map_direct(dev, ops)) dma_direct_unmap_page(dev, addr, size, dir, attrs); else if (ops->unmap_page) ops->unmap_page(dev, addr, size, dir, attrs); @@ -313,7 +319,7 @@ static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int ents; BUG_ON(!valid_dma_direction(dir)); - if (dma_is_direct(ops)) + if (dma_map_direct(dev, ops)) ents = dma_direct_map_sg(dev, sg, nents, dir, attrs); else ents = ops->map_sg(dev, sg, nents, dir, attrs); @@ -331,7 +337,7 @@ static inline void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg BUG_ON(!valid_dma_direction(dir)); debug_dma_unmap_sg(dev, sg, nents, dir); - if (dma_is_direct(ops)) + if (dma_map_direct(dev, ops)) dma_direct_unmap_sg(dev, sg, nents, dir, attrs); else if (ops->unmap_sg) ops->unmap_sg(dev, sg, nents, dir, attrs); @@ -352,7 +358,7 @@ static inline
Re: [PATCH -v2] treewide: Rename "unencrypted" to "decrypted"
[since this is in my inbox...] On 2020-03-19 10:28 am, Borislav Petkov wrote: On Thu, Mar 19, 2020 at 11:20:11AM +0100, Christoph Hellwig wrote: I thought we agreed that decrypted is absolutely the wrong term. I don't think we did. At least I don't know where we did that. So NAK - if you want to change things it needs to go the other way. We are already using "decrypted" everywhere in arch/x86/. Changing that would be a *lot* more churn. And it is just a term, for chrissakes, to denote memory which is not encrypted. And it would make our lifes easier if we had only *two* terms instead of three or more. Especially if the concept we denote with this is a binary one: encrypted memory and *not* encrypted memory. Let me add another vote from a native English speaker that "unencrypted" is the appropriate term to imply the *absence* of encryption, whereas "decrypted" implies the *reversal* of applied encryption. Naming things is famously hard, for good reason - names are *important* for understanding. Just because a decision was already made one way doesn't mean that that decision was necessarily right. Churning one area to be consistently inaccurate just because it's less work than churning another area to be consistently accurate isn't really the best excuse. Robin.
Re: [PATCH v2] dma-mapping: treat dev->bus_dma_mask as a DMA limit
On 2019-11-26 6:51 pm, Nicolas Saenz Julienne wrote: On Mon, 2019-11-25 at 16:33 +, Robin Murphy wrote: On 25/11/2019 7:44 am, Christoph Hellwig wrote: On Sat, Nov 23, 2019 at 09:51:08AM -0700, Nathan Chancellor wrote: Just as an FYI, this introduces a warning on arm32 allyesconfig for me: I think the dma_limit argument to iommu_dma_alloc_iova should be a u64 and/or we need to use min_t and open code the zero exception. Robin, Nicolas - any opinions? Yeah, given that it's always held a mask I'm not entirely sure why it was ever a dma_addr_t rather than a u64. Unless anyone else is desperate to do it I'll get a cleanup patch ready for rc1. Sounds good to me too Robin, since I started the mess, I'll be happy to do it if it helps offloading some work from you. No worries - your change only exposed my original weird decision ;) On second look the patch was literally a trivial one-liner, so I've written it up already. Cheers, Robin.
Re: [PATCH v2] dma-mapping: treat dev->bus_dma_mask as a DMA limit
On 25/11/2019 7:44 am, Christoph Hellwig wrote: On Sat, Nov 23, 2019 at 09:51:08AM -0700, Nathan Chancellor wrote: Just as an FYI, this introduces a warning on arm32 allyesconfig for me: I think the dma_limit argument to iommu_dma_alloc_iova should be a u64 and/or we need to use min_t and open code the zero exception. Robin, Nicolas - any opinions? Yeah, given that it's always held a mask I'm not entirely sure why it was ever a dma_addr_t rather than a u64. Unless anyone else is desperate to do it I'll get a cleanup patch ready for rc1. Also I wonder how this file gets compiled on arm32 given that arm32 has its own set of iommu dma ops.. As long as the dependencies for CONFIG_IOMMU_DMA are met it can be built even when it's not actually used. That said, I might have expected that arm allyesconfig ends up with CONFIG_ARCH_DMA_ADDR_T_64BIT=y anyway; I guess it must pick some of CONFIG_ARM_LPAE's negative dependencies. (/me doesn't feel like jumping down the all*config rabbit hole today) Robin.
Re: [PATCH v2] dma-mapping: treat dev->bus_dma_mask as a DMA limit
On 21/11/2019 9:26 am, Nicolas Saenz Julienne wrote: Using a mask to represent bus DMA constraints has a set of limitations. The biggest one being it can only hold a power of two (minus one). The DMA mapping code is already aware of this and treats dev->bus_dma_mask as a limit. This quirk is already used by some architectures although still rare. With the introduction of the Raspberry Pi 4 we've found a new contender for the use of bus DMA limits, as its PCIe bus can only address the lower 3GB of memory (of a total of 4GB). This is impossible to represent with a mask. To make things worse the device-tree code rounds non power of two bus DMA limits to the next power of two, which is unacceptable in this case. In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all over the tree and treat it as such. Note that dev->bus_dma_limit should contain the higher accesible DMA address. ^^ super-nit only because I can't not see my editor currently highlighting the typo: "accessible" Regardless of that though, Reviewed-by: Robin Murphy Signed-off-by: Nicolas Saenz Julienne --- Changes since v1: - rework ACPI code to avoid divergence with OF's arch/mips/pci/fixup-sb1250.c | 16 arch/powerpc/sysdev/fsl_pci.c | 6 +++--- arch/x86/kernel/pci-dma.c | 2 +- arch/x86/mm/mem_encrypt.c | 2 +- arch/x86/pci/sta2x11-fixup.c | 2 +- drivers/acpi/arm64/iort.c | 20 +++- drivers/ata/ahci.c| 2 +- drivers/iommu/dma-iommu.c | 3 +-- drivers/of/device.c | 9 + include/linux/device.h| 6 +++--- include/linux/dma-direct.h| 2 +- include/linux/dma-mapping.h | 2 +- kernel/dma/direct.c | 27 +-- 13 files changed, 46 insertions(+), 53 deletions(-) diff --git a/arch/mips/pci/fixup-sb1250.c b/arch/mips/pci/fixup-sb1250.c index 8a41b359cf90..40efc990cdce 100644 --- a/arch/mips/pci/fixup-sb1250.c +++ b/arch/mips/pci/fixup-sb1250.c @@ -21,22 +21,22 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_SIBYTE, PCI_DEVICE_ID_BCM1250_PCI, /* * The BCM1250, etc. PCI host bridge does not support DAC on its 32-bit - * bus, so we set the bus's DMA mask accordingly. However the HT link + * bus, so we set the bus's DMA limit accordingly. However the HT link * down the artificial PCI-HT bridge supports 40-bit addressing and the * SP1011 HT-PCI bridge downstream supports both DAC and a 64-bit bus * width, so we record the PCI-HT bridge's secondary and subordinate bus - * numbers and do not set the mask for devices present in the inclusive + * numbers and do not set the limit for devices present in the inclusive * range of those. */ -struct sb1250_bus_dma_mask_exclude { +struct sb1250_bus_dma_limit_exclude { bool set; unsigned char start; unsigned char end; }; -static int sb1250_bus_dma_mask(struct pci_dev *dev, void *data) +static int sb1250_bus_dma_limit(struct pci_dev *dev, void *data) { - struct sb1250_bus_dma_mask_exclude *exclude = data; + struct sb1250_bus_dma_limit_exclude *exclude = data; bool exclude_this; bool ht_bridge; @@ -55,7 +55,7 @@ static int sb1250_bus_dma_mask(struct pci_dev *dev, void *data) exclude->start, exclude->end); } else { dev_dbg(>dev, "disabling DAC for device"); - dev->dev.bus_dma_mask = DMA_BIT_MASK(32); + dev->dev.bus_dma_limit = DMA_BIT_MASK(32); } return 0; @@ -63,9 +63,9 @@ static int sb1250_bus_dma_mask(struct pci_dev *dev, void *data) static void quirk_sb1250_pci_dac(struct pci_dev *dev) { - struct sb1250_bus_dma_mask_exclude exclude = { .set = false }; + struct sb1250_bus_dma_limit_exclude exclude = { .set = false }; - pci_walk_bus(dev->bus, sb1250_bus_dma_mask, ); + pci_walk_bus(dev->bus, sb1250_bus_dma_limit, ); } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_SIBYTE, PCI_DEVICE_ID_BCM1250_PCI, quirk_sb1250_pci_dac); diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c index ff0e2b156cb5..617a443d673d 100644 --- a/arch/powerpc/sysdev/fsl_pci.c +++ b/arch/powerpc/sysdev/fsl_pci.c @@ -115,8 +115,8 @@ static void pci_dma_dev_setup_swiotlb(struct pci_dev *pdev) { struct pci_controller *hose = pci_bus_to_host(pdev->bus); - pdev->dev.bus_dma_mask = - hose->dma_window_base_cur + hose->dma_window_size; + pdev->dev.bus_dma_limit = + hose->dma_window_base_cur + hose->dma_window_size - 1; } static void setup_swiotlb_ops(struct pci_controller *hose) @@ -135,7 +135,7 @@ static void fsl_pci_dma_set_mask(struct device *dev, u64 dma_mask) * mapping that allows addressing any RAM address from across PCI. */ if (dev_is_pci(dev) &&
Re: generic DMA bypass flag
On 21/11/2019 7:34 am, Christoph Hellwig wrote: Robin, does this mean you ACK this series for the powerpc use case? Yeah, I think we've nailed down sufficient justification now for having a generalised flag, so at that point it makes every bit of sense to convert PPC's private equivalent. Robin. Alexey and other ppc folks: can you take a look? ___ iommu mailing list io...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2] dma-mapping: treat dev->bus_dma_mask as a DMA limit
On 21/11/2019 3:26 pm, Christoph Hellwig wrote: On Thu, Nov 21, 2019 at 04:24:57PM +0100, Christoph Hellwig wrote: On Thu, Nov 21, 2019 at 10:26:44AM +0100, Nicolas Saenz Julienne wrote: Using a mask to represent bus DMA constraints has a set of limitations. The biggest one being it can only hold a power of two (minus one). The DMA mapping code is already aware of this and treats dev->bus_dma_mask as a limit. This quirk is already used by some architectures although still rare. With the introduction of the Raspberry Pi 4 we've found a new contender for the use of bus DMA limits, as its PCIe bus can only address the lower 3GB of memory (of a total of 4GB). This is impossible to represent with a mask. To make things worse the device-tree code rounds non power of two bus DMA limits to the next power of two, which is unacceptable in this case. In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all over the tree and treat it as such. Note that dev->bus_dma_limit should contain the higher accesible DMA address. Signed-off-by: Nicolas Saenz Julienne I've tentatively added this patch to the dma-mapping tree based on Robins principal approval of the last version. That way tomorrows linux-next run should still pick it up. Actually. This doesn't apply because the dma-mapping tree doesn't have you zone_dma_bits change. I guess we'll need to wait for the next merge window, or maybe post rc1 if this happens to fix the powerpc problem that Christian reported. Hmm, there's no functional dependency though, is there? AFAICS it's essentially just a context conflict. Is it worth simply dropping (or postponing) the local renaming in __dma_direct_optimal_gfp_mask(), or perhaps even cross-merging arm64/for-next/zone-dma into dma/for-next? Robin.
Re: Bug 205201 - Booting halts if Dawicontrol DC-2976 UW SCSI board installed, unless RAM size limited to 3500M
On 21/11/2019 12:21 pm, Christian Zigotzky wrote: On 21 November 2019 at 01:16 pm, Christian Zigotzky wrote: On 21 November 2019 at 08:29 am, Christoph Hellwig wrote: On Sat, Nov 16, 2019 at 08:06:05AM +0100, Christian Zigotzky wrote: /* * DMA addressing mode. * * 0 : 32 bit addressing for all chips. * 1 : 40 bit addressing when supported by chip. * 2 : 64 bit addressing when supported by chip, * limited to 16 segments of 4 GB -> 64 GB max. */ #define SYM_CONF_DMA_ADDRESSING_MODE CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE Cyrus config: CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1 I will configure “0 : 32 bit addressing for all chips” for the RC8. Maybe this is the solution. 0 means you are going to do bounce buffering a lot, which seems generally like a bad idea. But why are we talking about the sym53c8xx driver now? The last issue you reported was about video4linux allocations. Both drivers have the same problem. They don't work if we have more than 3.5GB RAM. I try to find a solution until you have a good solution. I have already a solution for V4L but I still need one for the sym53c8xx driver. OK, you mean that "0" is a bad idea but maybe it works until you have a solution. ;-) Is this on the same machine with the funny non-power-of-two bus_dma_mask as your other report? If so, does Nicolas' latest patch[1] help at all? Robin. [1] https://lore.kernel.org/linux-iommu/20191121092646.8449-1-nsaenzjulie...@suse.de/T/#u
Re: generic DMA bypass flag
On 16/11/2019 6:22 am, Christoph Hellwig wrote: On Fri, Nov 15, 2019 at 06:12:48PM +, Robin Murphy wrote: And is that any different from where you would choose to "just" set a generic bypass flag? Same spots, as intel-iommu moves from the identify to a dma domain when setting a 32-bit mask. But that means once a 32-bit mask is set we can't ever go back to the 64-bit one. Is that a problem though? It's not safe in general to rewrite the default domain willy-nilly, so if it's a concern that drivers get stuck having to use a translation domain if they do something dumb like: if (!dma_set_mask(DMA_BIT_MASK(32)) dma_set_mask(DMA_BIT_MASK(64)); then the simple solution is "don't do that" - note that this doesn't affect overriding of the default 32-bit mask, because we don't use the driver API to initialise those. And we had a couple drivers playing interesting games there. If the games you're worried about are stuff like: dma_set_mask(dev, DMA_BIT_MASK(64)); high_buf = dma_alloc_coherent(dev, ...); dma_set_mask(dev, DMA_BIT_MASK(32)); low_buf = dma_alloc_coherent(dev, ...); then iommu_need_mapping() already ensures that will end spectacularly badly. Unless we can somehow log when a mask has been "committed" by a mapping operation, I don't think any kind of opportunistic bypass mechanism is ever not going to blow up that case. FYI, this is the current intel-iommu WIP conversion to the dma bypass flag: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-bypass Having thought a bit more, I guess my idea does end up with one slightly ugly corner wherein dma_direct_supported() has to learn to look for an IOMMU default domain and try iommu_dma_supported() before saying no, even if it's clean everywhere else. The bypass flag is more 'balanced' in terms of being equally invasive everywhere and preserving abstraction a bit better. Plus I think it might let us bring back the default assignment of dma_dummy_ops, which I do like the thought of :D Either way, making sure that the fundamental bypass decision is correct and robust is still far more important than the implementation details. Robin.
Re: [PATCH] dma-mapping: treat dev->bus_dma_mask as a DMA limit
On 19/11/2019 5:00 pm, Christoph Hellwig wrote: On Tue, Nov 19, 2019 at 01:57:43PM +0100, Nicolas Saenz Julienne wrote: Hi Rob & Christoph, do you mind if I append v2 of this into my upcoming v3 RPi4 PCIe support series, I didn't do it initially as I thought this was going to be a contentious patch. But as it turned out better than expected, I think it should go into the PCIe series. In the end it's the first explicit user of the bus DMA limit. Here's v2 in case you don't know what I'm talking about: https://www.spinics.net/lists/arm-kernel/msg768459.html In principle I wouldn't mind, but I think this is going to conflict quite badly with other changes in the dma-mapping tree (including yours). So I think we'll need a shared tree or I'll need to pull in the whole series through the dma-mapping tree if there are not other conflicts and the other maintainers are fine with it. TBH I can't see it being a massive problem even if the DMA patch, driver and DTS patch went entirely separately via the respective DMA, PCI, and arm-soc trees in the same cycle. Bisecting over a merge window is a big enough pain in the bum as it is, and if the worst case is that someone trying to do that on a Pi4 has a wonky PCI controller appear for a couple of commits, they may as well just disable that driver for their bisection, because it wasn't there at the start so can't possibly be the thing they're looking for regressions in ;) Robin.
Re: [PATCH 1/3] dma-direct: unify the dma_capable definitions
On 19/11/2019 10:26 am, Marek Szyprowski wrote: Hi Krzysztof, On 19.11.2019 10:44, Krzysztof Kozlowski wrote: On Tue, 19 Nov 2019 at 17:27, Marek Szyprowski wrote: Hi Christoph, On 13.11.2019 08:35, Christoph Hellwig wrote: Currently each architectures that wants to override dma_to_phys and phys_to_dma also has to provide dma_capable. But there isn't really any good reason for that. powerpc and mips just have copies of the generic one minus the latests fix, and the arm one was the inspiration for said fix, but misses the bus_dma_mask handling. Make all architectures use the generic version instead. Signed-off-by: Christoph Hellwig This patch breaks DMAengine PL330 driver on Samsung Exynos SoCs: Thanks Marek for bisecting it. I wonder whether it is also the cause for boot failures I see on NXP Vybrid VF50 SoC (NXP/Freescale fsl-edma) since few days: [ 2.853428] fsl-edma 40018000.dma-controller: overflow 0x40027007+1 of DMA mask bus mask 0 [ 2.862566] [ cut here ] [ 2.867273] WARNING: CPU: 0 PID: 1 at /home/buildbot/worker/builddir_yocto/build/build/tmp/work-shared/col-vf50-proceq-mainline-next/kernel-source/kernel/dma/direct.c:35 report_addr+0xc0/0xfc [ 2.884380] Modules linked in: [ 2.887486] CPU: 0 PID: 1 Comm: swapper Tainted: G W 5.4.0-rc7-next-20191118-g519ead8f6a32 #1 [ 2.897364] Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree) [ 2.903892] [<8010ddfc>] (unwind_backtrace) from [<8010b4b8>] (show_stack+0x10/0x14) [ 2.911712] [<8010b4b8>] (show_stack) from [<8011b08c>] (__warn+0xd4/0xec) [ 2.918653] [<8011b08c>] (__warn) from [<8011b154>] (warn_slowpath_fmt+0xb0/0xb8) [ 2.926218] [<8011b154>] (warn_slowpath_fmt) from [<80155f7c>] (report_addr+0xc0/0xfc) [ 2.934221] [<80155f7c>] (report_addr) from [<801561f0>] (dma_direct_map_resource+0x98/0xa4) [ 2.942744] [<801561f0>] (dma_direct_map_resource) from [<8041d5d4>] (fsl_edma_prep_slave_dma+0x12c/0x150) [ 2.952475] [<8041d5d4>] (fsl_edma_prep_slave_dma) from [<8041d8cc>] (fsl_edma_prep_dma_cyclic+0x30/0x21c) [ 2.962213] [<8041d8cc>] (fsl_edma_prep_dma_cyclic) from [<80452e10>] (lpuart_rx_dma_startup+0x188/0x36c) [ 2.971871] [<80452e10>] (lpuart_rx_dma_startup) from [<80453058>] (lpuart_startup+0x64/0x78) [ 2.980477] [<80453058>] (lpuart_startup) from [<8044e924>] (uart_startup.part.7+0x110/0x23c) [ 2.989080] [<8044e924>] (uart_startup.part.7) from [<8044eaa0>] (uart_port_activate+0x50/0x7c) [ 2.997857] [<8044eaa0>] (uart_port_activate) from [<80438dc0>] (tty_port_open+0x74/0xc0) [ 3.006111] [<80438dc0>] (tty_port_open) from [<8044be30>] (uart_open+0x18/0x20) [ 3.013588] [<8044be30>] (uart_open) from [<80431b4c>] (tty_open+0x108/0x428) [ 3.020794] [<80431b4c>] (tty_open) from [<801edb48>] (chrdev_open+0xac/0x164) [ 3.028098] [<801edb48>] (chrdev_open) from [<801e55c8>] (do_dentry_open+0x22c/0x3e4) [ 3.036010] [<801e55c8>] (do_dentry_open) from [<801f72a8>] (path_openat+0x4a4/0xf78) [ 3.043912] [<801f72a8>] (path_openat) from [<801f8d34>] (do_filp_open+0x70/0xdc) [ 3.051472] [<801f8d34>] (do_filp_open) from [<801e6998>] (do_sys_open+0x128/0x1f4) [ 3.059217] [<801e6998>] (do_sys_open) from [<80a00ee0>] (kernel_init_freeable+0x150/0x1c4) [ 3.067658] [<80a00ee0>] (kernel_init_freeable) from [<8068e208>] (kernel_init+0x8/0x110) [ 3.075917] [<8068e208>] (kernel_init) from [<801010e8>] (ret_from_fork+0x14/0x2c) [ 3.083539] Exception stack(0x86843fb0 to 0x86843ff8) [ 3.088631] 3fa0: [ 3.096866] 3fc0: [ 3.105103] 3fe0: 0013 [ 3.111752] ---[ end trace 6fb699802256a309 ]--- [3.116423] fsl-lpuart 40027000.serial: Cannot prepare cyclic DMA [3.192968] VFS: Mounted root (nfs4 filesystem) on device 0:13. [3.201432] devtmpfs: mounted [3.210985] Freeing unused kernel memory: 1024K [3.217854] Run /sbin/init as init process [4.643355] systemd[1]: System time before build time, advancing clock. [4.774106] random: systemd: uninitialized urandom read (16 bytes read) [4.838361] systemd[1]: systemd 232 running in system mode. (-PAM -AUDIT -SELINUX -IMA -APPARMOR -SMACK +SYSVINIT +UTMP -LIBCRYPTSETUP -GCRYPT -GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID -ELFUTILS +KMOD -IDN) [4.858997] systemd[1]: Detected architecture arm. [4.873438] fsl-lpuart 40027000.serial: Cannot prepare TX slave DMA! [4.880138] fsl-lpuart 40027000.serial: Cannot prepare TX slave DMA! [4.886585] fsl-lpuart 40027000.serial: Cannot prepare TX slave DMA! [4.893124] fsl-lpuart 40027000.serial: Cannot prepare TX slave DMA! [4.899679] fsl-lpuart 40027000.serial: Cannot prepare TX slave DMA! [4.906110] fsl-lpuart 40027000.serial: Cannot prepare TX slave DMA! [4.912616] fsl-lpuart 40027000.serial: Cannot prepare TX slave DMA! Although maybe that's just the fsl-edma problem? This is the same issue. A quick looks at the dma-direct code revealed that
Re: generic DMA bypass flag
On 14/11/2019 7:41 am, Christoph Hellwig wrote: On Wed, Nov 13, 2019 at 02:45:15PM +, Robin Murphy wrote: In all honesty, this seems silly. If we can set a per-device flag to say "oh, bypass these ops and use some other ops instead", then we can just as easily simply give the device the appropriate ops in the first place. Because, y'know, the name of the game is "per-device ops". Except that we can't do it _that_ easily. The problem is that for both the powerpc and intel case the selection is dynamic. If a device is in the identify domain with intel-iommu (or the equivalent on powerpc which doesn't use the normal iommu framework), we still want to use the iommu to be able to map memory for devices with a too small dma mask using the iommu instead of using swiotlb bouncing. So to "just" use the per-device dma ops we'd need: a) a hook in dma_direct_supported to pick another set of ops for small dma masks b) a hook in the IOMMU ops to propagate to the direct ops for full 64-bit masks And is that any different from where you would choose to "just" set a generic bypass flag? diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index d658c7c6a2ab..40e096d3dbc5 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2242,12 +2242,14 @@ request_default_domain_for_dev(struct device *dev, unsigned long type) /* Request that a device is direct mapped by the IOMMU */ int iommu_request_dm_for_dev(struct device *dev) { + set_dma_ops(dev, NULL); return request_default_domain_for_dev(dev, IOMMU_DOMAIN_IDENTITY); } /* Request that a device can't be direct mapped by the IOMMU */ int iommu_request_dma_domain_for_dev(struct device *dev) { + set_dma_ops(dev, _dma_ops); return request_default_domain_for_dev(dev, IOMMU_DOMAIN_DMA); } If intel-iommu gets fully converted such that everyone using default domains is also using iommu-dma, that should be it as far as the generic DMA ops are concerned (ultimately we might even be able to do it in __iommu_attach_device() based on the domain type). Like I said, the hard part is deciding when to make *these* calls, but apparently intel-iommu can already do that. I looked into that for powerpc a while ago and it wasn't pretty at all. Compared to that just checking another flag for the DMA direct calls is relatively clean and trivial as seens in the diffstat for this series alone. I don't see a great benefit to pulling legacy cruft out into common code instead of just working to get rid of it in-place, when said cruft pulls in the opposite direction to where we're taking the common code (i.e. it's inherently based on the premise of global ops). I'm not sure what legacy cruft it pull in. I think it actually fits very much into a mental model of "direct mapping is the default, to be overriden if needed" which is pretty close to what we have at the moment. Just with a slightly more complicated processing of the override. Because the specific "slightly more complicated" currently used by the existing powerpc code will AFAICS continue to be needed only by the existing powerpc code, thus I don't see any benefit to cluttering up the common code with it today. You may as well just refactor powerpc to swizzle its own ops to obviate archdata.iommu_bypass, and delete a fair chunk of old code. Robin.
Re: [PATCH] dma-mapping: treat dev->bus_dma_mask as a DMA limit
On 2019-11-13 8:41 pm, Florian Fainelli wrote: On 11/13/19 12:34 PM, Robin Murphy wrote: On 13/11/2019 4:13 pm, Nicolas Saenz Julienne wrote: Using a mask to represent bus DMA constraints has a set of limitations. The biggest one being it can only hold a power of two (minus one). The DMA mapping code is already aware of this and treats dev->bus_dma_mask as a limit. This quirk is already used by some architectures although still rare. With the introduction of the Raspberry Pi 4 we've found a new contender for the use of bus DMA limits, as its PCIe bus can only address the lower 3GB of memory (of a total of 4GB). This is impossible to represent with a mask. To make things worse the device-tree code rounds non power of two bus DMA limits to the next power of two, which is unacceptable in this case. In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all over the tree and treat it as such. Note that dev->bus_dma_limit is meant to contain the higher accesible DMA address. Neat, you win a "why didn't I do it that way in the first place?" :) Looking at it without all the history of previous attempts, this looks entirely reasonable, and definitely a step in the right direction. And while you are changing those, would it make sense to not only rename the structure member but introduce a getter and setter in order to ease future work where this would no longer be a scalar? I doubt it - once we get as a far as supporting multiple DMA ranges, there will be a whole load of infrastructure churn anyway if only to replace dma_pfn_offset, and I'm not sure a simple get/set paradigm would even be viable, so it's probably better to save that until clearly necessary. Robin.
Re: [PATCH] dma-mapping: treat dev->bus_dma_mask as a DMA limit
On 13/11/2019 4:13 pm, Nicolas Saenz Julienne wrote: Using a mask to represent bus DMA constraints has a set of limitations. The biggest one being it can only hold a power of two (minus one). The DMA mapping code is already aware of this and treats dev->bus_dma_mask as a limit. This quirk is already used by some architectures although still rare. With the introduction of the Raspberry Pi 4 we've found a new contender for the use of bus DMA limits, as its PCIe bus can only address the lower 3GB of memory (of a total of 4GB). This is impossible to represent with a mask. To make things worse the device-tree code rounds non power of two bus DMA limits to the next power of two, which is unacceptable in this case. In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all over the tree and treat it as such. Note that dev->bus_dma_limit is meant to contain the higher accesible DMA address. Neat, you win a "why didn't I do it that way in the first place?" :) Looking at it without all the history of previous attempts, this looks entirely reasonable, and definitely a step in the right direction. [...] diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c index 5a7551d060f2..f18827cf96df 100644 --- a/drivers/acpi/arm64/iort.c +++ b/drivers/acpi/arm64/iort.c @@ -1097,7 +1097,7 @@ void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *dma_size) * Limit coherent and dma mask based on size * retrieved from firmware. */ - dev->bus_dma_mask = mask; + dev->bus_dma_limit = mask; Although this preserves the existing behaviour, as in of_dma_configure() we can do better here since we have the original address range to hand. I think it's worth keeping the ACPI and OF paths in sync for minor tweaks like this, rather than letting them diverge unnecessarily. Otherwise, the rest looks OK to me - in principle we could store it as an exclusive limit such that we could then streamline the min_not_zero() tests to just min(mask, limit - 1), but that's probably too clever for its own good. Robin. dev->coherent_dma_mask = mask; *dev->dma_mask = mask; }
Re: generic DMA bypass flag
On 13/11/2019 1:37 pm, Christoph Hellwig wrote: Hi all, I've recently beeing chatting with Lu about using dma-iommu and per-device DMA ops in the intel IOMMU driver, and one missing feature in dma-iommu is a bypass mode where the direct mapping is used even when an iommu is attached to improve performance. The powerpc code already has a similar mode, so I'd like to move it to the core DMA mapping code. As part of that I noticed that the current powerpc code has a little bug in that it used the wrong check in the dma_sync_* routines to see if the direct mapping code is used. In all honesty, this seems silly. If we can set a per-device flag to say "oh, bypass these ops and use some other ops instead", then we can just as easily simply give the device the appropriate ops in the first place. Because, y'know, the name of the game is "per-device ops". These two patches just add the generic code and move powerpc over, the intel IOMMU bits will require a separate discussion. The ops need to match the default domain type, so the hard part of the problem is choosing when and if to switch that (particularly fun if there are multiple devices in the IOMMU group). Flipping between iommu-dma and dma-direct at that point will be trivial. I don't see a great benefit to pulling legacy cruft out into common code instead of just working to get rid of it in-place, when said cruft pulls in the opposite direction to where we're taking the common code (i.e. it's inherently based on the premise of global ops). Robin. The x86 AMD Gart code also has a bypass mode, but it is a lot strange, so I'm not going to touch it for now. ___ iommu mailing list io...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] dma-mapping: Add dma_addr_is_phys_addr()
On 14/10/2019 05:51, David Gibson wrote: On Fri, Oct 11, 2019 at 06:25:18PM -0700, Ram Pai wrote: From: Thiago Jung Bauermann In order to safely use the DMA API, virtio needs to know whether DMA addresses are in fact physical addresses and for that purpose, dma_addr_is_phys_addr() is introduced. cc: Benjamin Herrenschmidt cc: David Gibson cc: Michael Ellerman cc: Paul Mackerras cc: Michael Roth cc: Alexey Kardashevskiy cc: Paul Burton cc: Robin Murphy cc: Bartlomiej Zolnierkiewicz cc: Marek Szyprowski cc: Christoph Hellwig Suggested-by: Michael S. Tsirkin Signed-off-by: Ram Pai Signed-off-by: Thiago Jung Bauermann The change itself looks ok, so Reviewed-by: David Gibson However, I would like to see the commit message (and maybe the inline comments) expanded a bit on what the distinction here is about. Some of the text from the next patch would be suitable, about DMA addresses usually being in a different address space but not in the case of bounce buffering. Right, this needs a much tighter definition. "DMA address happens to be a valid physical address" is true of various IOMMU setups too, but I can't believe it's meaningful in such cases. If what you actually want is "DMA is direct or SWIOTLB" - i.e. "DMA address is physical address of DMA data (not necessarily the original buffer)" - wouldn't dma_is_direct() suffice? Robin. --- arch/powerpc/include/asm/dma-mapping.h | 21 + arch/powerpc/platforms/pseries/Kconfig | 1 + include/linux/dma-mapping.h| 20 kernel/dma/Kconfig | 3 +++ 4 files changed, 45 insertions(+) diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index 565d6f7..f92c0a4b 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -5,6 +5,8 @@ #ifndef _ASM_DMA_MAPPING_H #define _ASM_DMA_MAPPING_H +#include + static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) { /* We don't handle the NULL dev case for ISA for now. We could @@ -15,4 +17,23 @@ static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus) return NULL; } +#ifdef CONFIG_ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR +/** + * dma_addr_is_phys_addr - check whether a device DMA address is a physical + * address + * @dev: device to check + * + * Returns %true if any DMA address for this device happens to also be a valid + * physical address (not necessarily of the same page). + */ +static inline bool dma_addr_is_phys_addr(struct device *dev) +{ + /* +* Secure guests always use the SWIOTLB, therefore DMA addresses are +* actually the physical address of the bounce buffer. +*/ + return is_secure_guest(); +} +#endif + #endif/* _ASM_DMA_MAPPING_H */ diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index 9e35cdd..0108150 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -152,6 +152,7 @@ config PPC_SVM select SWIOTLB select ARCH_HAS_MEM_ENCRYPT select ARCH_HAS_FORCE_DMA_UNENCRYPTED + select ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR help There are certain POWER platforms which support secure guests using the Protected Execution Facility, with the help of an Ultravisor diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index f7d1eea..6df5664 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -693,6 +693,26 @@ static inline bool dma_addressing_limited(struct device *dev) dma_get_required_mask(dev); } +#ifndef CONFIG_ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR +/** + * dma_addr_is_phys_addr - check whether a device DMA address is a physical + * address + * @dev: device to check + * + * Returns %true if any DMA address for this device happens to also be a valid + * physical address (not necessarily of the same page). + */ +static inline bool dma_addr_is_phys_addr(struct device *dev) +{ + /* +* Except in very specific setups, DMA addresses exist in a different +* address space from CPU physical addresses and cannot be directly used +* to reference system memory. +*/ + return false; +} +#endif + #ifdef CONFIG_ARCH_HAS_SETUP_DMA_OPS void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, const struct iommu_ops *iommu, bool coherent); diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index 9decbba..6209b46 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -51,6 +51,9 @@ config ARCH_HAS_DMA_MMAP_PGPROT config ARCH_HAS_FORCE_DMA_UNENCRYPTED bool +config ARCH_HAS_DMA_ADDR_IS_PHYS_ADDR + bool + config DMA_NONCOHERENT_CACHE_SYNC bool
Re: [PATCH v6] numa: make node_to_cpumask_map() NUMA_NO_NODE aware
On 2019-10-08 9:38 am, Yunsheng Lin wrote: On 2019/9/25 18:41, Peter Zijlstra wrote: On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote: From the discussion above, It seems making the node_to_cpumask_map() NUMA_NO_NODE aware is the most feasible way to move forwad. That's still wrong. Hi, Peter It seems this has trapped in the dead circle. From my understanding, NUMA_NO_NODE which means not node numa preference is the state to describe the node of virtual device or the physical device that has equal distance to all cpu. We can be stricter if the device does have a nearer node, but we can not deny that a device does not have a node numa preference or node affinity, which also means the control or data buffer can be allocated at the node where the process is running. As you has proposed, making it -2 and have dev_to_node() warn if the device does have a nearer node and not set by the fw is a way to be stricter. But I think maybe being stricter is not really relevant to NUMA_NO_NODE, because we does need a state to describe the device that have equal distance to all node, even if it is not physically scalable. Any better suggestion to move this forward? FWIW (since this is in my inbox), it sounds like the fundamental issue is that NUMA_NO_NODE is conflated for at least two different purposes, so trying to sort that out would be a good first step. AFAICS we have genuine "don't care" cases like alloc_pages_node(), where if the producer says it doesn't matter then the consumer is free to make its own judgement on what to do, and fundamentally different "we expect this thing to have an affinity but it doesn't, so we can't say what's appropriate" cases which could really do with some separate indicator like "NUMA_INVALID_NODE". The tricky part is then bestowed on the producers to decide whether they can downgrade "invalid" to "don't care". You can technically build 'a device' whose internal logic is distributed between nodes and thus appears to have equal affinity - interrupt controllers, for example, may have per-CPU or per-node interfaces that end up looking like that - so although it's unlikely it's not outright nonsensical. Similarly a 'device' that's actually emulated behind a firmware call interface may well effectively have no real affinity. Robin.
Re: [PATCH v2 6/6] mm/memory_hotplug: Pass nid instead of zone to __remove_pages()
On 26/08/2019 11:10, David Hildenbrand wrote: The zone parameter is no longer in use. Replace it with the nid, which we can now use the nid to limit the number of zones we have to process (vie for_each_zone_nid()). The function signature of __remove_pages() now looks much more similar to the one of __add_pages(). FWIW I recall this being trivially easy to hit when first playing with hotremove development for arm64 - since we only have 3 zones, the page flags poison would cause page_zone() to dereference past the end of node_zones[] and go all kinds of wrong. This looks like a definite improvement in API terms. For arm64, Acked-by: Robin Murphy Cheers, Robin. Cc: Catalin Marinas Cc: Will Deacon Cc: Tony Luck Cc: Fenghua Yu Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Christian Borntraeger Cc: Yoshinori Sato Cc: Rich Felker Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: Andrew Morton Cc: Mark Rutland Cc: Steve Capper Cc: Mike Rapoport Cc: Anshuman Khandual Cc: Yu Zhao Cc: Jun Yao Cc: Robin Murphy Cc: Michal Hocko Cc: Oscar Salvador Cc: "Matthew Wilcox (Oracle)" Cc: Christophe Leroy Cc: "Aneesh Kumar K.V" Cc: Pavel Tatashin Cc: Gerald Schaefer Cc: Halil Pasic Cc: Tom Lendacky Cc: Greg Kroah-Hartman Cc: Masahiro Yamada Cc: Dan Williams Cc: Wei Yang Cc: Qian Cai Cc: Jason Gunthorpe Cc: Logan Gunthorpe Cc: Ira Weiny Cc: linux-arm-ker...@lists.infradead.org Cc: linux-i...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s...@vger.kernel.org Cc: linux...@vger.kernel.org Signed-off-by: David Hildenbrand --- arch/arm64/mm/mmu.c| 4 +--- arch/ia64/mm/init.c| 4 +--- arch/powerpc/mm/mem.c | 3 +-- arch/s390/mm/init.c| 4 +--- arch/sh/mm/init.c | 4 +--- arch/x86/mm/init_32.c | 4 +--- arch/x86/mm/init_64.c | 4 +--- include/linux/memory_hotplug.h | 2 +- mm/memory_hotplug.c| 17 + mm/memremap.c | 3 +-- 10 files changed, 18 insertions(+), 31 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index e67bab4d613e..9a2d388314f3 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1080,7 +1080,6 @@ void arch_remove_memory(int nid, u64 start, u64 size, { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; - struct zone *zone; /* * FIXME: Cleanup page tables (also in arch_add_memory() in case @@ -1089,7 +1088,6 @@ void arch_remove_memory(int nid, u64 start, u64 size, * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be * unlocked yet. */ - zone = page_zone(pfn_to_page(start_pfn)); - __remove_pages(zone, start_pfn, nr_pages, altmap); + __remove_pages(nid, start_pfn, nr_pages, altmap); } #endif diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c index bf9df2625bc8..ae6a3e718aa0 100644 --- a/arch/ia64/mm/init.c +++ b/arch/ia64/mm/init.c @@ -689,9 +689,7 @@ void arch_remove_memory(int nid, u64 start, u64 size, { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; - struct zone *zone; - zone = page_zone(pfn_to_page(start_pfn)); - __remove_pages(zone, start_pfn, nr_pages, altmap); + __remove_pages(nid, start_pfn, nr_pages, altmap); } #endif diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index 9191a66b3bc5..af21e13529ce 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -130,10 +130,9 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size, { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; - struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap); int ret; - __remove_pages(page_zone(page), start_pfn, nr_pages, altmap); + __remove_pages(nid, start_pfn, nr_pages, altmap); /* Remove htab bolted mappings for this section of memory */ start = (unsigned long)__va(start); diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c index 20340a03ad90..2a7373ed6ded 100644 --- a/arch/s390/mm/init.c +++ b/arch/s390/mm/init.c @@ -296,10 +296,8 @@ void arch_remove_memory(int nid, u64 start, u64 size, { unsigned long start_pfn = start >> PAGE_SHIFT; unsigned long nr_pages = size >> PAGE_SHIFT; - struct zone *zone; - zone = page_zone(pfn_to_page(start_pfn)); - __remove_pages(zone, start_pfn, nr_pages, altmap); + __remove_pages(nid, start_pfn, nr_pages, altmap); vmem_remove_mapping(start, size); } #endif /* CONFIG_MEMORY_HOTPLUG */ diff --git a/arch/sh/mm/init.
Re: [PATCH 6/6] driver core: initialize a default DMA mask for platform device
On 11/08/2019 09:05, Christoph Hellwig wrote: We still treat devices without a DMA mask as defaulting to 32-bits for both mask, but a few releases ago we've started warning about such cases, as they require special cases to work around this sloppyness. Add a dma_mask field to struct platform_object so that we can initialize s/object/device/ the dma_mask pointer in struct device and initialize both masks to 32-bits by default. Architectures can still override this in arch_setup_pdev_archdata if needed. Note that the code looks a little odd with the various conditionals because we have to support platform_device structures that are statically allocated. This would be a good point to also get rid of the long-standing bodge in platform_device_register_full(). Signed-off-by: Christoph Hellwig --- drivers/base/platform.c | 15 +-- include/linux/platform_device.h | 1 + 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/base/platform.c b/drivers/base/platform.c index ec974ba9c0c4..b216fcb0a8af 100644 --- a/drivers/base/platform.c +++ b/drivers/base/platform.c @@ -264,6 +264,17 @@ struct platform_object { char name[]; }; +static void setup_pdev_archdata(struct platform_device *pdev) Bikeshed: painting the generic DMA API properties as "archdata" feels a bit off-target :/ +{ + if (!pdev->dev.coherent_dma_mask) + pdev->dev.coherent_dma_mask = DMA_BIT_MASK(32); + if (!pdev->dma_mask) + pdev->dma_mask = DMA_BIT_MASK(32); + if (!pdev->dev.dma_mask) + pdev->dev.dma_mask = >dma_mask; + arch_setup_pdev_archdata(pdev); AFAICS m68k's implementation of that arch hook becomes entirely redundant after this change, so may as well go. That would just leave powerpc's actual archdata, which at a glance looks like it could probably be cleaned up with not *too* much trouble. Robin. +}; + /** * platform_device_put - destroy a platform device * @pdev: platform device to free @@ -310,7 +321,7 @@ struct platform_device *platform_device_alloc(const char *name, int id) pa->pdev.id = id; device_initialize(>pdev.dev); pa->pdev.dev.release = platform_device_release; - arch_setup_pdev_archdata(>pdev); + setup_pdev_archdata(>pdev); } return pa ? >pdev : NULL; @@ -512,7 +523,7 @@ EXPORT_SYMBOL_GPL(platform_device_del); int platform_device_register(struct platform_device *pdev) { device_initialize(>dev); - arch_setup_pdev_archdata(pdev); + setup_pdev_archdata(pdev); return platform_device_add(pdev); } EXPORT_SYMBOL_GPL(platform_device_register); diff --git a/include/linux/platform_device.h b/include/linux/platform_device.h index 9bc36b589827..a2abde2aef25 100644 --- a/include/linux/platform_device.h +++ b/include/linux/platform_device.h @@ -24,6 +24,7 @@ struct platform_device { int id; boolid_auto; struct device dev; + u64 dma_mask; u32 num_resources; struct resource *resource;
Re: [PATCH v3 04/11] arm64/mm: Add temporary arch_remove_memory() implementation
On 04/06/2019 07:56, David Hildenbrand wrote: On 03.06.19 23:41, Wei Yang wrote: On Mon, May 27, 2019 at 01:11:45PM +0200, David Hildenbrand wrote: A proper arch_remove_memory() implementation is on its way, which also cleanly removes page tables in arch_add_memory() in case something goes wrong. Would this be better to understand? removes page tables created in arch_add_memory That's not what this sentence expresses. Have a look at arch_add_memory(), in case __add_pages() fails, the page tables are not removed. This will also be fixed by Anshuman in the same shot. As we want to use arch_remove_memory() in case something goes wrong during memory hotplug after arch_add_memory() finished, let's add a temporary hack that is sufficient enough until we get a proper implementation that cleans up page table entries. We will remove CONFIG_MEMORY_HOTREMOVE around this code in follow up patches. Cc: Catalin Marinas Cc: Will Deacon Cc: Mark Rutland Cc: Andrew Morton Cc: Ard Biesheuvel Cc: Chintan Pandya Cc: Mike Rapoport Cc: Jun Yao Cc: Yu Zhao Cc: Robin Murphy Cc: Anshuman Khandual Signed-off-by: David Hildenbrand --- arch/arm64/mm/mmu.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index a1bfc4413982..e569a543c384 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1084,4 +1084,23 @@ int arch_add_memory(int nid, u64 start, u64 size, return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT, restrictions); } +#ifdef CONFIG_MEMORY_HOTREMOVE +void arch_remove_memory(int nid, u64 start, u64 size, + struct vmem_altmap *altmap) +{ + unsigned long start_pfn = start >> PAGE_SHIFT; + unsigned long nr_pages = size >> PAGE_SHIFT; + struct zone *zone; + + /* +* FIXME: Cleanup page tables (also in arch_add_memory() in case +* adding fails). Until then, this function should only be used +* during memory hotplug (adding memory), not for memory +* unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be +* unlocked yet. +*/ + zone = page_zone(pfn_to_page(start_pfn)); Compared with arch_remove_memory in x86. If altmap is not NULL, zone will be retrieved from page related to altmap. Not sure why this is not the same? This is a minimal implementation, sufficient for this use case here. A full implementation is in the works. For now, this function will not be used with an altmap (ZONE_DEVICE is not esupported for arm64 yet). FWIW the other pieces of ZONE_DEVICE are now due to land in parallel, but as long as we don't throw the ARCH_ENABLE_MEMORY_HOTREMOVE switch then there should still be no issue. Besides, given that we should consistently ignore the altmap everywhere at the moment, it may even work out regardless. One thing stands out about the failure path thing, though - if __add_pages() did fail, can it still be guaranteed to have initialised the memmap such that page_zone() won't return nonsense? Last time I looked that was still a problem when removing memory which had been successfully added, but never onlined (although I do know that particular case was already being discussed at the time, and I've not been paying the greatest attention since). Robin.
Re: [PATCH v3 0/6] Prerequisites for NXP LS104xA SMMU enablement
On 31/05/2019 18:08, Christoph Hellwig wrote: On Fri, May 31, 2019 at 06:03:30PM +0100, Robin Murphy wrote: The thing needs to be completely redone as it abuses parts of the iommu API in a completely unacceptable way. `git grep iommu_iova_to_phys drivers/{crypto,gpu,net}` :( I guess one alternative is for the offending drivers to maintain their own lookup tables of mapped DMA addresses - I think at least some of these things allow storing some kind of token in a descriptor, which even if it's not big enough for a virtual address might be sufficient for an index. Well, we'll at least need DMA API wrappers that work on the dma addr only and hide this madness underneath. And then tell if an given device supports this and fail the probe otherwise. Bleh, I'm certainly not keen on formalising any kind of dma_to_phys()/dma_to_virt() interface for this. Or are you just proposing something like dma_unmap_sorry_sir_the_dog_ate_my_homework() for drivers which have 'lost' the original VA they mapped? Robin.
Re: [PATCH v3 0/6] Prerequisites for NXP LS104xA SMMU enablement
On 31/05/2019 17:33, Christoph Hellwig wrote: On Thu, May 30, 2019 at 03:08:44PM -0700, David Miller wrote: From: laurentiu.tu...@nxp.com Date: Thu, 30 May 2019 17:19:45 +0300 Depends on this pull request: http://lists.infradead.org/pipermail/linux-arm-kernel/2019-May/653554.html I'm not sure how you want me to handle this. The thing needs to be completely redone as it abuses parts of the iommu API in a completely unacceptable way. `git grep iommu_iova_to_phys drivers/{crypto,gpu,net}` :( I guess one alternative is for the offending drivers to maintain their own lookup tables of mapped DMA addresses - I think at least some of these things allow storing some kind of token in a descriptor, which even if it's not big enough for a virtual address might be sufficient for an index. Robin.
[PATCH v3 3/4] mm: introduce ARCH_HAS_PTE_DEVMAP
ARCH_HAS_ZONE_DEVICE is somewhat meaningless in itself, and combined with the long-out-of-date comment can lead to the impression than an architecture may just enable it (since __add_pages() now "comprehends device memory" for itself) and expect things to work. In practice, however, ZONE_DEVICE users have little chance of functioning correctly without __HAVE_ARCH_PTE_DEVMAP, so let's clean that up the same way as ARCH_HAS_PTE_SPECIAL and make it the proper dependency so the real situation is clearer. Cc: x...@kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: Michael Ellerman Acked-by: Dan Williams Reviewed-by: Ira Weiny Acked-by: Oliver O'Halloran Reviewed-by: Anshuman Khandual Signed-off-by: Robin Murphy --- arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/book3s/64/pgtable.h | 1 - arch/x86/Kconfig | 2 +- arch/x86/include/asm/pgtable.h | 4 ++-- arch/x86/include/asm/pgtable_types.h | 1 - include/linux/mm.h | 4 ++-- include/linux/pfn_t.h| 4 ++-- mm/Kconfig | 5 ++--- mm/gup.c | 2 +- 9 files changed, 11 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 8c1c636308c8..1120ff8ac715 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -128,6 +128,7 @@ config PPC select ARCH_HAS_MMIOWB if PPC64 select ARCH_HAS_PHYS_TO_DMA select ARCH_HAS_PMEM_APIif PPC64 + select ARCH_HAS_PTE_DEVMAP if PPC_BOOK3S_64 select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_MEMBARRIER_CALLBACKS select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE && PPC64 @@ -135,7 +136,6 @@ config PPC select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAS_UACCESS_FLUSHCACHE if PPC64 select ARCH_HAS_UBSAN_SANITIZE_ALL - select ARCH_HAS_ZONE_DEVICE if PPC_BOOK3S_64 select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_KEEP_MEMBLOCK select ARCH_MIGHT_HAVE_PC_PARPORT diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 7dede2e34b70..c6c2bdfb369b 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -90,7 +90,6 @@ #define _PAGE_SOFT_DIRTY _RPAGE_SW3 /* software: software dirty tracking */ #define _PAGE_SPECIAL _RPAGE_SW2 /* software: special page */ #define _PAGE_DEVMAP _RPAGE_SW1 /* software: ZONE_DEVICE page */ -#define __HAVE_ARCH_PTE_DEVMAP /* * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 2bbbd4d1ba31..57c4e80bd368 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -69,6 +69,7 @@ config X86 select ARCH_HAS_KCOVif X86_64 select ARCH_HAS_MEMBARRIER_SYNC_CORE select ARCH_HAS_PMEM_APIif X86_64 + select ARCH_HAS_PTE_DEVMAP if X86_64 select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_REFCOUNT select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64 @@ -79,7 +80,6 @@ config X86 select ARCH_HAS_STRICT_MODULE_RWX select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE select ARCH_HAS_UBSAN_SANITIZE_ALL - select ARCH_HAS_ZONE_DEVICE if X86_64 select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI select ARCH_MIGHT_HAVE_PC_PARPORT diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 5e0509b41986..0bc530c4eb13 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -271,7 +271,7 @@ static inline int has_transparent_hugepage(void) return boot_cpu_has(X86_FEATURE_PSE); } -#ifdef __HAVE_ARCH_PTE_DEVMAP +#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP static inline int pmd_devmap(pmd_t pmd) { return !!(pmd_val(pmd) & _PAGE_DEVMAP); @@ -732,7 +732,7 @@ static inline int pte_present(pte_t a) return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE); } -#ifdef __HAVE_ARCH_PTE_DEVMAP +#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP static inline int pte_devmap(pte_t a) { return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP; diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index d6ff0bbdb394..b5e49e6bac63 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -103,7 +103,6 @@ #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX) #define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP) -#define __HAVE_ARCH_PTE_DEVMAP #else #define _PAGE_NX (_AT(
[PATCH v2 3/3] mm: introduce ARCH_HAS_PTE_DEVMAP
ARCH_HAS_ZONE_DEVICE is somewhat meaningless in itself, and combined with the long-out-of-date comment can lead to the impression than an architecture may just enable it (since __add_pages() now "comprehends device memory" for itself) and expect things to work. In practice, however, ZONE_DEVICE users have little chance of functioning correctly without __HAVE_ARCH_PTE_DEVMAP, so let's clean that up the same way as ARCH_HAS_PTE_SPECIAL and make it the proper dependency so the real situation is clearer. Cc: x...@kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: Michael Ellerman Acked-by: Dan Williams Reviewed-by: Ira Weiny Acked-by: Oliver O'Halloran Reviewed-by: Anshuman Khandual Signed-off-by: Robin Murphy --- v2: Add review tags. arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/book3s/64/pgtable.h | 1 - arch/x86/Kconfig | 2 +- arch/x86/include/asm/pgtable.h | 4 ++-- arch/x86/include/asm/pgtable_types.h | 1 - include/linux/mm.h | 4 ++-- include/linux/pfn_t.h| 4 ++-- mm/Kconfig | 5 ++--- mm/gup.c | 2 +- 9 files changed, 11 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 5e3d0853c31d..77e1993bba80 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -135,6 +135,7 @@ config PPC select ARCH_HAS_MMIOWB if PPC64 select ARCH_HAS_PHYS_TO_DMA select ARCH_HAS_PMEM_APIif PPC64 + select ARCH_HAS_PTE_DEVMAP if PPC_BOOK3S_64 select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_MEMBARRIER_CALLBACKS select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE && PPC64 @@ -142,7 +143,6 @@ config PPC select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAS_UACCESS_FLUSHCACHE if PPC64 select ARCH_HAS_UBSAN_SANITIZE_ALL - select ARCH_HAS_ZONE_DEVICE if PPC_BOOK3S_64 select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_MIGHT_HAVE_PC_SERIO diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 581f91be9dd4..02c22ac8f387 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -90,7 +90,6 @@ #define _PAGE_SOFT_DIRTY _RPAGE_SW3 /* software: software dirty tracking */ #define _PAGE_SPECIAL _RPAGE_SW2 /* software: special page */ #define _PAGE_DEVMAP _RPAGE_SW1 /* software: ZONE_DEVICE page */ -#define __HAVE_ARCH_PTE_DEVMAP /* * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 5ad92419be19..ffd50f27f395 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -60,6 +60,7 @@ config X86 select ARCH_HAS_KCOVif X86_64 select ARCH_HAS_MEMBARRIER_SYNC_CORE select ARCH_HAS_PMEM_APIif X86_64 + select ARCH_HAS_PTE_DEVMAP if X86_64 select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_REFCOUNT select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64 @@ -69,7 +70,6 @@ config X86 select ARCH_HAS_STRICT_MODULE_RWX select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE select ARCH_HAS_UBSAN_SANITIZE_ALL - select ARCH_HAS_ZONE_DEVICE if X86_64 select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI select ARCH_MIGHT_HAVE_PC_PARPORT diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 2779ace16d23..89a1f6fd48bf 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -254,7 +254,7 @@ static inline int has_transparent_hugepage(void) return boot_cpu_has(X86_FEATURE_PSE); } -#ifdef __HAVE_ARCH_PTE_DEVMAP +#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP static inline int pmd_devmap(pmd_t pmd) { return !!(pmd_val(pmd) & _PAGE_DEVMAP); @@ -715,7 +715,7 @@ static inline int pte_present(pte_t a) return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE); } -#ifdef __HAVE_ARCH_PTE_DEVMAP +#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP static inline int pte_devmap(pte_t a) { return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP; diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index d6ff0bbdb394..b5e49e6bac63 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -103,7 +103,6 @@ #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX) #define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP) -#define __HAVE_ARCH_PTE_DEVMAP #els
[PATCH 3/3] mm: introduce ARCH_HAS_PTE_DEVMAP
ARCH_HAS_ZONE_DEVICE is somewhat meaningless in itself, and combined with the long-out-of-date comment can lead to the impression than an architecture may just enable it (since __add_pages() now "comprehends device memory" for itself) and expect things to work. In practice, however, ZONE_DEVICE users have little chance of functioning correctly without __HAVE_ARCH_PTE_DEVMAP, so let's clean that up the same way as ARCH_HAS_PTE_SPECIAL and make it the proper dependency so the real situation is clearer. Signed-off-by: Robin Murphy --- arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/book3s/64/pgtable.h | 1 - arch/x86/Kconfig | 2 +- arch/x86/include/asm/pgtable.h | 4 ++-- arch/x86/include/asm/pgtable_types.h | 1 - include/linux/mm.h | 4 ++-- include/linux/pfn_t.h| 4 ++-- mm/Kconfig | 5 ++--- mm/gup.c | 2 +- 9 files changed, 11 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 5e3d0853c31d..77e1993bba80 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -135,6 +135,7 @@ config PPC select ARCH_HAS_MMIOWB if PPC64 select ARCH_HAS_PHYS_TO_DMA select ARCH_HAS_PMEM_APIif PPC64 + select ARCH_HAS_PTE_DEVMAP if PPC_BOOK3S_64 select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_MEMBARRIER_CALLBACKS select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE && PPC64 @@ -142,7 +143,6 @@ config PPC select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAS_UACCESS_FLUSHCACHE if PPC64 select ARCH_HAS_UBSAN_SANITIZE_ALL - select ARCH_HAS_ZONE_DEVICE if PPC_BOOK3S_64 select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_MIGHT_HAVE_PC_SERIO diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 581f91be9dd4..02c22ac8f387 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -90,7 +90,6 @@ #define _PAGE_SOFT_DIRTY _RPAGE_SW3 /* software: software dirty tracking */ #define _PAGE_SPECIAL _RPAGE_SW2 /* software: special page */ #define _PAGE_DEVMAP _RPAGE_SW1 /* software: ZONE_DEVICE page */ -#define __HAVE_ARCH_PTE_DEVMAP /* * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 5ad92419be19..ffd50f27f395 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -60,6 +60,7 @@ config X86 select ARCH_HAS_KCOVif X86_64 select ARCH_HAS_MEMBARRIER_SYNC_CORE select ARCH_HAS_PMEM_APIif X86_64 + select ARCH_HAS_PTE_DEVMAP if X86_64 select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_REFCOUNT select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64 @@ -69,7 +70,6 @@ config X86 select ARCH_HAS_STRICT_MODULE_RWX select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE select ARCH_HAS_UBSAN_SANITIZE_ALL - select ARCH_HAS_ZONE_DEVICE if X86_64 select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI select ARCH_MIGHT_HAVE_PC_PARPORT diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 2779ace16d23..89a1f6fd48bf 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -254,7 +254,7 @@ static inline int has_transparent_hugepage(void) return boot_cpu_has(X86_FEATURE_PSE); } -#ifdef __HAVE_ARCH_PTE_DEVMAP +#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP static inline int pmd_devmap(pmd_t pmd) { return !!(pmd_val(pmd) & _PAGE_DEVMAP); @@ -715,7 +715,7 @@ static inline int pte_present(pte_t a) return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE); } -#ifdef __HAVE_ARCH_PTE_DEVMAP +#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP static inline int pte_devmap(pte_t a) { return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP; diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index d6ff0bbdb394..b5e49e6bac63 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -103,7 +103,6 @@ #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX) #define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP) -#define __HAVE_ARCH_PTE_DEVMAP #else #define _PAGE_NX (_AT(pteval_t, 0)) #define _PAGE_DEVMAP (_AT(pteval_t, 0)) diff --git a/include/linux/mm.h b/include/linux/mm.h index d76dfb7ac617..fe05c94f23e9 100644 --- a/include/linux/mm.h +++ b/inc
[PATCH 2/3] mm: clean up is_device_*_page() definitions
Refactor is_device_{public,private}_page() with is_pci_p2pdma_page() to make them all consistent in depending on their respective config options even when CONFIG_DEV_PAGEMAP_OPS is enabled for other reasons. This allows a little more compile-time optimisation as well as the conceptual and cosmetic cleanup. Suggested-by: Jerome Glisse Signed-off-by: Robin Murphy --- include/linux/mm.h | 43 +-- 1 file changed, 13 insertions(+), 30 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 76769749b5a5..d76dfb7ac617 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -910,32 +910,6 @@ static inline bool put_devmap_managed_page(struct page *page) } return false; } - -static inline bool is_device_private_page(const struct page *page) -{ - return is_zone_device_page(page) && - page->pgmap->type == MEMORY_DEVICE_PRIVATE; -} - -static inline bool is_device_public_page(const struct page *page) -{ - return is_zone_device_page(page) && - page->pgmap->type == MEMORY_DEVICE_PUBLIC; -} - -#ifdef CONFIG_PCI_P2PDMA -static inline bool is_pci_p2pdma_page(const struct page *page) -{ - return is_zone_device_page(page) && - page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; -} -#else /* CONFIG_PCI_P2PDMA */ -static inline bool is_pci_p2pdma_page(const struct page *page) -{ - return false; -} -#endif /* CONFIG_PCI_P2PDMA */ - #else /* CONFIG_DEV_PAGEMAP_OPS */ static inline void dev_pagemap_get_ops(void) { @@ -949,22 +923,31 @@ static inline bool put_devmap_managed_page(struct page *page) { return false; } +#endif /* CONFIG_DEV_PAGEMAP_OPS */ static inline bool is_device_private_page(const struct page *page) { - return false; + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && + IS_ENABLED(CONFIG_DEVICE_PRIVATE) && + is_zone_device_page(page) && + page->pgmap->type == MEMORY_DEVICE_PRIVATE; } static inline bool is_device_public_page(const struct page *page) { - return false; + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && + IS_ENABLED(CONFIG_DEVICE_PUBLIC) && + is_zone_device_page(page) && + page->pgmap->type == MEMORY_DEVICE_PUBLIC; } static inline bool is_pci_p2pdma_page(const struct page *page) { - return false; + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && + IS_ENABLED(CONFIG_PCI_P2PDMA) && + is_zone_device_page(page) && + page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } -#endif /* CONFIG_DEV_PAGEMAP_OPS */ static inline void get_page(struct page *page) { -- 2.21.0.dirty
[PATCH 1/3] mm/memremap: Rename and consolidate SECTION_SIZE
Trying to activatee ZONE_DEVICE for arm64 reveals that memremap's internal helpers for sparsemem sections conflict with and arm64's definitions for hugepages, which inherit the name of "sections" from earlier versions of the ARM architecture. Disambiguate memremap (and now HMM too) by propagating sparsemem's PA_ prefix, to clarify that these values are in terms of addresses rather than PFNs (and because it's a heck of a lot easier than changing all the arch code). SECTION_MASK is unused, so it can just go. [anshuman: Consolidated mm/hmm.c instance and updated the commit message] Acked-by: Michal Hocko Reviewed-by: David Hildenbrand Signed-off-by: Robin Murphy Signed-off-by: Anshuman Khandual --- include/linux/mmzone.h | 1 + kernel/memremap.c | 10 -- mm/hmm.c | 2 -- 3 files changed, 5 insertions(+), 8 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fba7741533be..ed7dd27ee94a 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1081,6 +1081,7 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn) * PFN_SECTION_SHIFT pfn to/from section number */ #define PA_SECTION_SHIFT (SECTION_SIZE_BITS) +#define PA_SECTION_SIZE(1UL << PA_SECTION_SHIFT) #define PFN_SECTION_SHIFT (SECTION_SIZE_BITS - PAGE_SHIFT) #define NR_MEM_SECTIONS(1UL << SECTIONS_SHIFT) diff --git a/kernel/memremap.c b/kernel/memremap.c index a856cb5ff192..dda1367b385d 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -14,8 +14,6 @@ #include static DEFINE_XARRAY(pgmap_array); -#define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1) -#define SECTION_SIZE (1UL << PA_SECTION_SHIFT) #if IS_ENABLED(CONFIG_DEVICE_PRIVATE) vm_fault_t device_private_entry_fault(struct vm_area_struct *vma, @@ -98,8 +96,8 @@ static void devm_memremap_pages_release(void *data) put_page(pfn_to_page(pfn)); /* pages are dead and unused, undo the arch mapping */ - align_start = res->start & ~(SECTION_SIZE - 1); - align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE) + align_start = res->start & ~(PA_SECTION_SIZE - 1); + align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE) - align_start; nid = page_to_nid(pfn_to_page(align_start >> PAGE_SHIFT)); @@ -154,8 +152,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap) if (!pgmap->ref || !pgmap->kill) return ERR_PTR(-EINVAL); - align_start = res->start & ~(SECTION_SIZE - 1); - align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE) + align_start = res->start & ~(PA_SECTION_SIZE - 1); + align_size = ALIGN(res->start + resource_size(res), PA_SECTION_SIZE) - align_start; align_end = align_start + align_size - 1; diff --git a/mm/hmm.c b/mm/hmm.c index fe1cd87e49ac..ef9e4e6c9f92 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -33,8 +33,6 @@ #include #include -#define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT) - #if IS_ENABLED(CONFIG_HMM_MIRROR) static const struct mmu_notifier_ops hmm_mmu_notifier_ops; -- 2.21.0.dirty
[PATCH 0/3] Device-memory-related cleanups
Hi, As promised, these are my preparatory cleanup patches that have so far fallen out of pmem DAX work for arm64. Patch #1 has already been out for a ride in Anshuman's hot-remove series, so I've collected the acks already given. Since we have various things in flight at the moment touching arm64 pagetable code, I'm wary of conflicts and cross-tree dependencies for our actual ARCH_HAS_PTE_DEVMAP implementation. Thus it would be nice if these could be picked up for 5.2 via mm or nvdimm as appropriate, such that we can then handle the devmap patch itself via arm64 next cycle. Robin. Robin Murphy (3): mm/memremap: Rename and consolidate SECTION_SIZE mm: clean up is_device_*_page() definitions mm: introduce ARCH_HAS_PTE_DEVMAP arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/book3s/64/pgtable.h | 1 - arch/x86/Kconfig | 2 +- arch/x86/include/asm/pgtable.h | 4 +- arch/x86/include/asm/pgtable_types.h | 1 - include/linux/mm.h | 47 +++- include/linux/mmzone.h | 1 + include/linux/pfn_t.h| 4 +- kernel/memremap.c| 10 ++--- mm/Kconfig | 5 +-- mm/gup.c | 2 +- mm/hmm.c | 2 - 12 files changed, 29 insertions(+), 52 deletions(-) -- 2.21.0.dirty
Re: [PATCH v5 1/6] iommu: add generic boot option iommu.dma_mode
On 12/04/2019 11:26, John Garry wrote: On 09/04/2019 13:53, Zhen Lei wrote: Currently the IOMMU dma contains 3 modes: passthrough, lazy, strict. The passthrough mode bypass the IOMMU, the lazy mode defer the invalidation of hardware TLBs, and the strict mode invalidate IOMMU hardware TLBs synchronously. The three modes are mutually exclusive. But the current boot options are confused, such as: iommu.passthrough and iommu.strict, because they are no good to be coexist. So add iommu.dma_mode. Signed-off-by: Zhen Lei --- Documentation/admin-guide/kernel-parameters.txt | 19 drivers/iommu/iommu.c | 59 - include/linux/iommu.h | 5 +++ 3 files changed, 71 insertions(+), 12 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 2b8ee90bb64470d..f7766f8ac8b9084 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1811,6 +1811,25 @@ 1 - Bypass the IOMMU for DMA. unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH. + iommu.dma_mode= Configure default dma mode. if unset, use the value + of CONFIG_IOMMU_DEFAULT_PASSTHROUGH to determine + passthrough or not. To me, for unset it's unclear what we default to. So if unset and also CONFIG_IOMMU_DEFAULT_PASSTHROUGH is not set, do we get lazy or strict mode? (note: I'm ignoring backwards compatibility and interaction of iommu.strict and .passthorugh also, more below). Could we considering introducing config DEFAULT_IOMMU_DMA_MODE, similar to DEFAULT_IOSCHED? Yes, what I was suggesting was specifically refactoring the Kconfig options into a single choice that controls the default (i.e. no command line option provided) behaviour. AFAICS it should be fairly straightforward to maintain the existing "strict" and "passthrough" options (and legacy arch-specific versions thereof) to override that default without introducing yet another command-line option, which I think we should avoid if possible. + Note: For historical reasons, ARM64/S390/PPC/X86 have + their specific options. Currently, only ARM64 support + this boot option, and hope other ARCHs to use this as + generic boot option. + passthrough + Configure DMA to bypass the IOMMU by default. + lazy + Request that DMA unmap operations use deferred + invalidation of hardware TLBs, for increased + throughput at the cost of reduced device isolation. + Will fall back to strict mode if not supported by + the relevant IOMMU driver. + strict + DMA unmap operations invalidate IOMMU hardware TLBs + synchronously. + io7= [HW] IO7 for Marvel based alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 109de67d5d727c2..df1ce8e22385b48 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -38,12 +38,13 @@ static struct kset *iommu_group_kset; static DEFINE_IDA(iommu_group_ida); + #ifdef CONFIG_IOMMU_DEFAULT_PASSTHROUGH -static unsigned int iommu_def_domain_type = IOMMU_DOMAIN_IDENTITY; +#define IOMMU_DEFAULT_DMA_MODE IOMMU_DMA_MODE_PASSTHROUGH #else -static unsigned int iommu_def_domain_type = IOMMU_DOMAIN_DMA; +#define IOMMU_DEFAULT_DMA_MODE IOMMU_DMA_MODE_STRICT #endif -static bool iommu_dma_strict __read_mostly = true; +static int iommu_default_dma_mode __read_mostly = IOMMU_DEFAULT_DMA_MODE; struct iommu_callback_data { const struct iommu_ops *ops; @@ -147,20 +148,51 @@ static int __init iommu_set_def_domain_type(char *str) int ret; ret = kstrtobool(str, ); - if (ret) - return ret; + if (!ret && pt) + iommu_default_dma_mode = IOMMU_DMA_MODE_PASSTHROUGH; - iommu_def_domain_type = pt ? IOMMU_DOMAIN_IDENTITY : IOMMU_DOMAIN_DMA; - return 0; + return ret; } early_param("iommu.passthrough", iommu_set_def_domain_type); static int __init iommu_dma_setup(char *str) { - return kstrtobool(str, _dma_strict); + bool strict; + int ret; + + ret = kstrtobool(str, ); + if (!ret) + iommu_default_dma_mode = strict ? + IOMMU_DMA_MODE_STRICT : IOMMU_DMA_MODE_LAZY; + + return ret; } early_param("iommu.strict", iommu_dma_setup); +static int __init iommu_dma_mode_setup(char *str) +{ + if (!str) + goto fail; + + if (!strncmp(str, "passthrough", 11)) + iommu_default_dma_mode = IOMMU_DMA_MODE_PASSTHROUGH; + else if (!strncmp(str, "lazy", 4)) + iommu_default_dma_mode = IOMMU_DMA_MODE_LAZY; + else if (!strncmp(str, "strict", 6)) + iommu_default_dma_mode = IOMMU_DMA_MODE_STRICT; + else + goto fail;
Re: [PATCH 02/13] soc/fsl/bman: map FBPR area in the iommu
On 29/03/2019 14:00, laurentiu.tu...@nxp.com wrote: From: Laurentiu Tudor Add a one-to-one iommu mapping for bman private data memory (FBPR). This is required for BMAN to work without faults behind an iommu. Signed-off-by: Laurentiu Tudor --- drivers/soc/fsl/qbman/bman_ccsr.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/soc/fsl/qbman/bman_ccsr.c b/drivers/soc/fsl/qbman/bman_ccsr.c index 7c3cc968053c..b209c79511bb 100644 --- a/drivers/soc/fsl/qbman/bman_ccsr.c +++ b/drivers/soc/fsl/qbman/bman_ccsr.c @@ -29,6 +29,7 @@ */ #include "bman_priv.h" +#include u16 bman_ip_rev; EXPORT_SYMBOL(bman_ip_rev); @@ -178,6 +179,7 @@ static int fsl_bman_probe(struct platform_device *pdev) int ret, err_irq; struct device *dev = >dev; struct device_node *node = dev->of_node; + struct iommu_domain *domain; struct resource *res; u16 id, bm_pool_cnt; u8 major, minor; @@ -225,6 +227,15 @@ static int fsl_bman_probe(struct platform_device *pdev) dev_dbg(dev, "Allocated FBPR 0x%llx 0x%zx\n", fbpr_a, fbpr_sz); + /* Create an 1-to-1 iommu mapping for FBPR area */ + domain = iommu_get_domain_for_dev(dev); If that's expected to be the default domain that you're grabbing, then this is *incredibly* fragile. There's nothing to stop the IOVA that you forcibly map from being automatically allocated later and causing some other DMA mapping to fail noisily and unexpectedly. Furthermore, have you tried this with "iommu.passthrough=1"? That said, I really don't understand what's going on here anyway :/ As far as I can tell from qbman_init_private_mem(), fbpr_a comes from dma_alloc_coherent() and thus would already be a mapped IOVA - isn't this the stuff that Roy converted to nicely use shared-dma-pool regions a while ago? Robin. + if (domain) { + ret = iommu_map(domain, fbpr_a, fbpr_a, fbpr_sz, + IOMMU_READ | IOMMU_WRITE | IOMMU_CACHE); + if (ret) + dev_warn(dev, "failed to iommu_map() %d\n", ret); + } + bm_set_memory(fbpr_a, fbpr_sz); err_irq = platform_get_irq(pdev, 0);
Re: [PATCH 2/3] dma-mapping: don't BUG when calling dma_map_resource on RAM
On 11/01/2019 18:17, Christoph Hellwig wrote: Use WARN_ON_ONCE to print a stack trace and return a proper error code instead. I was racking my brain to remember the reasoning behind BUG_ON() being the only viable way to prevent errors getting through unhandled, but of course that was before we had a standardised DMA_MAPPING_ERROR that would work across all implementations. Signed-off-by: Christoph Hellwig --- include/linux/dma-mapping.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index d3087829a6df..91add0751aa5 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -353,7 +353,8 @@ static inline dma_addr_t dma_map_resource(struct device *dev, BUG_ON(!valid_dma_direction(dir)); /* Don't allow RAM to be mapped */ Ugh, I'm pretty sure that that "pfn_valid means RAM" misunderstanding originally came from me - it might be nice to have a less-misleading comment here, but off-hand I can't think of a succinct way to say "only for 'DMA' access to MMIO registers/SRAMs/etc. and not for anything the kernel knows as actual system/device memory" to better explain the WARN... Either way, though, Reviewed-by: Robin Murphy - BUG_ON(pfn_valid(PHYS_PFN(phys_addr))); + if (WARN_ON_ONCE(pfn_valid(PHYS_PFN(phys_addr + return DMA_MAPPING_ERROR; if (dma_is_direct(ops)) addr = dma_direct_map_resource(dev, phys_addr, size, dir, attrs);
Re: [PATCH 1/3] dma-mapping: remove the default map_resource implementation
On 11/01/2019 18:17, Christoph Hellwig wrote: Just returning the physical address when not map_resource method is present is highly dangerous as it doesn't take any offset in the direct mapping into account and does the completely wrong thing for IOMMUs. Instead provide a proper implementation in the direct mapping code, and also wire it up for arm and powerpc. Ignoring the offset was kind of intentional there, because at the time I was primarily thinking about it in terms of the Keystone 2 platform where the peripherals are all in the same place (0-2GB) in both the bus and CPU physical address maps, and only the view of RAM differs between the two (2-4GB vs. 32-34GB). However, on something like BCM283x, the peripherals region is also offset from its bus address in the CPU view, but at a *different* offset relative to that of RAM. Fortunately, I'm not aware of any platform which has a DMA engine behind an IOMMU (and thus *needs* to use dma_map_resource() to avoid said IOMMU blocking the slave device register reads/writes) and also has any nonzero offsets, and AFAIK the IOMMU-less platforms above aren't using dma_map_resource() at all, so this change shouldn't actually break anything, but I guess we have a bit of a problem making it truly generic and robust :( Is this perhaps another shove in the direction of overhauling dma_pfn_offset into an arbitrary "DMA ranges" lookup table? Signed-off-by: Christoph Hellwig --- arch/arm/mm/dma-mapping.c | 2 ++ arch/powerpc/kernel/dma-swiotlb.c | 1 + arch/powerpc/kernel/dma.c | 1 + include/linux/dma-mapping.h | 12 +++- kernel/dma/direct.c | 14 ++ 5 files changed, 25 insertions(+), 5 deletions(-) diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index f1e2922e447c..3c8534904209 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -188,6 +188,7 @@ const struct dma_map_ops arm_dma_ops = { .unmap_page = arm_dma_unmap_page, .map_sg = arm_dma_map_sg, .unmap_sg = arm_dma_unmap_sg, + .map_resource = dma_direct_map_resource, .sync_single_for_cpu= arm_dma_sync_single_for_cpu, .sync_single_for_device = arm_dma_sync_single_for_device, .sync_sg_for_cpu= arm_dma_sync_sg_for_cpu, @@ -211,6 +212,7 @@ const struct dma_map_ops arm_coherent_dma_ops = { .get_sgtable= arm_dma_get_sgtable, .map_page = arm_coherent_dma_map_page, .map_sg = arm_dma_map_sg, + .map_resource = dma_direct_map_resource, .dma_supported = arm_dma_supported, }; EXPORT_SYMBOL(arm_coherent_dma_ops); diff --git a/arch/powerpc/kernel/dma-swiotlb.c b/arch/powerpc/kernel/dma-swiotlb.c index 7d5fc9751622..fbb2506a414e 100644 --- a/arch/powerpc/kernel/dma-swiotlb.c +++ b/arch/powerpc/kernel/dma-swiotlb.c @@ -55,6 +55,7 @@ const struct dma_map_ops powerpc_swiotlb_dma_ops = { .dma_supported = swiotlb_dma_supported, .map_page = dma_direct_map_page, .unmap_page = dma_direct_unmap_page, + .map_resource = dma_direct_map_resource, .sync_single_for_cpu = dma_direct_sync_single_for_cpu, .sync_single_for_device = dma_direct_sync_single_for_device, .sync_sg_for_cpu = dma_direct_sync_sg_for_cpu, diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c index b1903ebb2e9c..258b9e8ebb99 100644 --- a/arch/powerpc/kernel/dma.c +++ b/arch/powerpc/kernel/dma.c @@ -273,6 +273,7 @@ const struct dma_map_ops dma_nommu_ops = { .dma_supported = dma_nommu_dma_supported, .map_page = dma_nommu_map_page, .unmap_page = dma_nommu_unmap_page, + .map_resource = dma_direct_map_resource, .get_required_mask = dma_nommu_get_required_mask, #ifdef CONFIG_NOT_COHERENT_CACHE .sync_single_for_cpu= dma_nommu_sync_single, diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index cef2127e1d70..d3087829a6df 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -208,6 +208,8 @@ dma_addr_t dma_direct_map_page(struct device *dev, struct page *page, unsigned long attrs); int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, enum dma_data_direction dir, unsigned long attrs); +dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr, + size_t size, enum dma_data_direction dir, unsigned long attrs); #if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \ defined(CONFIG_SWIOTLB) @@ -346,19 +348,19 @@ static inline dma_addr_t dma_map_resource(struct device *dev, unsigned long attrs) { const struct dma_map_ops *ops =
Re: [PATCH] mm/zsmalloc.c: Fix zsmalloc 32-bit PAE support
On 10/12/2018 14:21, Rafael David Tinoco wrote: On 32-bit systems, zsmalloc uses HIGHMEM and, when PAE is enabled, the physical frame number might be so big that zsmalloc obj encoding (to location) will break, causing: BUG: KASAN: null-ptr-deref in zs_map_object+0xa4/0x2bc Read of size 4 at addr by task mkfs.ext4/623 CPU: 2 PID: 623 Comm: mkfs.ext4 Not tainted 4.19.0-rc8-00017-g8239bc6d3307-dirty #15 Hardware name: Generic DT based system [] (unwind_backtrace) from [] (show_stack+0x20/0x24) [] (show_stack) from [] (dump_stack+0xbc/0xe8) [] (dump_stack) from [] (kasan_report+0x248/0x390) [] (kasan_report) from [] (__asan_load4+0x78/0xb4) [] (__asan_load4) from [] (zs_map_object+0xa4/0x2bc) [] (zs_map_object) from [] (zram_bvec_rw.constprop.2+0x324/0x8e4 [zram]) [] (zram_bvec_rw.constprop.2 [zram]) from [] (zram_make_request+0x234/0x46c [zram]) [] (zram_make_request [zram]) from [] (generic_make_request+0x304/0x63c) [] (generic_make_request) from [] (submit_bio+0x4c/0x1c8) [] (submit_bio) from [] (submit_bh_wbc.constprop.15+0x238/0x26c) [] (submit_bh_wbc.constprop.15) from [] (__block_write_full_page+0x524/0x76c) [] (__block_write_full_page) from [] (block_write_full_page+0x1bc/0x1d4) [] (block_write_full_page) from [] (blkdev_writepage+0x24/0x28) [] (blkdev_writepage) from [] (__writepage+0x44/0x78) [] (__writepage) from [] (write_cache_pages+0x3b8/0x800) [] (write_cache_pages) from [] (generic_writepages+0x74/0xa0) [] (generic_writepages) from [] (blkdev_writepages+0x18/0x1c) [] (blkdev_writepages) from [] (do_writepages+0x68/0x134) [] (do_writepages) from [] (__filemap_fdatawrite_range+0xb0/0xf4) [] (__filemap_fdatawrite_range) from [] (file_write_and_wait_range+0x64/0xd0) [] (file_write_and_wait_range) from [] (blkdev_fsync+0x54/0x84) [] (blkdev_fsync) from [] (vfs_fsync_range+0x70/0xd4) [] (vfs_fsync_range) from [] (do_fsync+0x4c/0x80) [] (do_fsync) from [] (sys_fsync+0x1c/0x20) [] (sys_fsync) from [] (ret_fast_syscall+0x0/0x2c) when trying to decode (the pfn) and map the object. That happens because one architecture might not re-define MAX_PHYSMEM_BITS, like in this ARM 32-bit w/ LPAE enabled example. For 32-bit systems, if not re-defined, MAX_POSSIBLE_PHYSMEM_BITS will default to BITS_PER_LONG (32) in most cases, and, with PAE enabled, _PFN_BITS might be wrong: which may cause obj variable to overflow if frame number is in HIGHMEM and referencing a page above the 4GB watermark. commit 6e00ec00b1a7 ("staging: zsmalloc: calculate MAX_PHYSMEM_BITS if not defined") realized MAX_PHYSMEM_BITS depended on SPARSEMEM headers and "fixed" it by calculating it using BITS_PER_LONG if SPARSEMEM wasn't used, like in the example given above. Systems with potential for PAE exist for a long time and assuming BITS_PER_LONG seems inadequate. Defining MAX_PHYSMEM_BITS looks better, however it is NOT a constant anymore for x86. SO, instead, MAX_POSSIBLE_PHYSMEM_BITS should be defined by every architecture using zsmalloc, together with a sanity check for MAX_POSSIBLE_PHYSMEM_BITS being too big on 32-bit systems. Link: https://bugs.linaro.org/show_bug.cgi?id=3765#c17 Signed-off-by: Rafael David Tinoco --- arch/arm/include/asm/pgtable-2level-types.h | 2 ++ arch/arm/include/asm/pgtable-3level-types.h | 2 ++ arch/arm64/include/asm/pgtable-types.h | 2 ++ arch/ia64/include/asm/page.h| 2 ++ arch/mips/include/asm/page.h| 2 ++ arch/powerpc/include/asm/mmu.h | 2 ++ arch/s390/include/asm/page.h| 2 ++ arch/sh/include/asm/page.h | 2 ++ arch/sparc/include/asm/page_32.h| 2 ++ arch/sparc/include/asm/page_64.h| 2 ++ arch/x86/include/asm/pgtable-2level_types.h | 2 ++ arch/x86/include/asm/pgtable-3level_types.h | 3 +- arch/x86/include/asm/pgtable_64_types.h | 4 +-- mm/zsmalloc.c | 35 +++-- 14 files changed, 45 insertions(+), 19 deletions(-) diff --git a/arch/arm/include/asm/pgtable-2level-types.h b/arch/arm/include/asm/pgtable-2level-types.h index 66cb5b0e89c5..552dba411324 100644 --- a/arch/arm/include/asm/pgtable-2level-types.h +++ b/arch/arm/include/asm/pgtable-2level-types.h @@ -64,4 +64,6 @@ typedef pteval_t pgprot_t; #endif /* STRICT_MM_TYPECHECKS */ +#define MAX_POSSIBLE_PHYSMEM_BITS 32 + #endif/* _ASM_PGTABLE_2LEVEL_TYPES_H */ diff --git a/arch/arm/include/asm/pgtable-3level-types.h b/arch/arm/include/asm/pgtable-3level-types.h index 921aa30259c4..664c39e6517c 100644 --- a/arch/arm/include/asm/pgtable-3level-types.h +++ b/arch/arm/include/asm/pgtable-3level-types.h @@ -67,4 +67,6 @@ typedef pteval_t pgprot_t; #endif /* STRICT_MM_TYPECHECKS */ +#define MAX_POSSIBLE_PHYSMEM_BITS 36 Nit: with LPAE, physical addresses go up to 40 bits, not just 36. Robin. + #endif/* _ASM_PGTABLE_3LEVEL_TYPES_H */ diff --git a/arch/arm64/include/asm/pgtable-types.h
Re: [PATCH v2 03/20] perf/core: add PERF_PMU_CAP_EXCLUDE for exclusion capable PMUs
Hi Andrew, On 26/11/2018 11:12, Andrew Murray wrote: Many PMU drivers do not have the capability to exclude counting events that occur in specific contexts such as idle, kernel, guest, etc. These drivers indicate this by returning an error in their event_init upon testing the events attribute flags. This approach is error prone and often inconsistent. Let's instead allow PMU drivers to advertise their ability to exclude based on context via a new capability: PERF_PMU_CAP_EXCLUDE. This allows the perf core to reject requests for exclusion events where there is no support in the PMU. Signed-off-by: Andrew Murray --- include/linux/perf_event.h | 1 + kernel/events/core.c | 9 + 2 files changed, 10 insertions(+) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index b2e806f..69b3d65 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -244,6 +244,7 @@ struct perf_event; #define PERF_PMU_CAP_EXCLUSIVE0x10 #define PERF_PMU_CAP_ITRACE 0x20 #define PERF_PMU_CAP_HETEROGENEOUS_CPUS 0x40 +#define PERF_PMU_CAP_EXCLUDE 0x80 /** * struct pmu - generic performance monitoring unit diff --git a/kernel/events/core.c b/kernel/events/core.c index 5a97f34..9afb33c 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9743,6 +9743,15 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event) if (ctx) perf_event_ctx_unlock(event->group_leader, ctx); + if (!ret) { + if (!(pmu->capabilities & PERF_PMU_CAP_EXCLUDE) && + event_has_any_exclude_flag(event)) { Technically this is a bisection-breaker, since no driver has this capability yet - ideally, this patch should come after all the ones introducing it to the relevant drivers (with the removal of the now-redundant code from the other drivers at the end). Alternatively, since we already have several other negative capabilities, unless there's a strong feeling against adding any more then it might work out simpler to flip it to PERF_PMU_CAP_NO_EXCLUDE, such that we only need to introduce the core check then directly replace the open-coded event checks with the capability in the appropriate drivers, and need not touch the exclusion-supporting ones at all. Robin. + if (event->destroy) + event->destroy(event); + ret = -EINVAL; + } + } + if (ret) module_put(pmu->module);
Re: [PATCH 5/5] dma-direct: always allow dma mask <= physiscal memory size
On 19/11/2018 14:18, Ramon Fried wrote: On Tue, Oct 9, 2018 at 8:02 AM Benjamin Herrenschmidt wrote: On Wed, 2018-10-03 at 16:10 -0700, Alexander Duyck wrote: -* Because 32-bit DMA masks are so common we expect every architecture -* to be able to satisfy them - either by not supporting more physical -* memory, or by providing a ZONE_DMA32. If neither is the case, the -* architecture needs to use an IOMMU instead of the direct mapping. -*/ - if (mask < phys_to_dma(dev, DMA_BIT_MASK(32))) + u64 min_mask; + + if (IS_ENABLED(CONFIG_ZONE_DMA)) + min_mask = DMA_BIT_MASK(ARCH_ZONE_DMA_BITS); + else + min_mask = DMA_BIT_MASK(32); + + min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT); + + if (mask >= phys_to_dma(dev, min_mask)) return 0; -#endif return 1; } So I believe I have run into the same issue that Guenter reported. On an x86_64 system w/ Intel IOMMU. I wasn't able to complete boot and all probe attempts for various devices were failing with -EIO errors. I believe the last mask check should be "if (mask < phys_to_dma(dev, min_mask))" not a ">=" check. Right, that test is backwards. I needed to change it here too (powermac with the rest of the powerpc series). Cheers, Ben. Hi, I'm working on a MIPS64 soc with PCIe root complex on it, and it appears that this series of patches are causing all PCI drivers that request 64bit mask to fail with -5. It's broken in 4.19. However, I just checked, it working on master. We may need to backport a couple of patches to 4.19. I'm not sure though which patches should be backported as there were at least 10 patches resolving this dma_direct area recently. Christoph, Robin. Can we ask Greg to backport all these changes ? What do you think ? As far as I'm aware, the only real issue in 4.19 was my subtle breakage around setting bus_dma_mask - that's fixed by 6778be4e5209, which according to my inbox got picked up by autosel for 4.19 stable last week. Robin.
Re: [PATCH 13/36] dt-bindings: arm: Convert PMU binding to json-schema
On 08/11/2018 15:59, Thomas Petazzoni wrote: Hello, I'm jumping into the discussion, but I clearly don't have all the context of the discussion. On Thu, 8 Nov 2018 15:54:31 +, Robin Murphy wrote: This seems like a semantic different between the two representations, or am I missing something here? Specifically, both the introduction of interrupts-extended and also dropping any mention of using a single per-cpu interrupt (the single combined case is no longer support by Linux; not sure if you want to keep it in the binding). In regards to no support for the single combined interrupt, it looks like Marvell Armada SoCs at least (armada-375 is what I'm looking at) have only a single interrupt. Though the interrupt gets routed to MPIC which then has a GIC PPI. So it isn't supported or happens to work still since it is a PPI? Well, the description of the MPIC in the Armada XP functional spec says: "Interrupt sources ID0–ID28 are private events per CPU. Thus, each processor has a different set of events map interrupts ID0–ID28." Odd grammar aside, that would seem to imply that < 3> is a per-cpu interrupt itself, thus AFAICS so long as it's cascaded to a GIC PPI and not an SPI then there's no issue there. The Armada XP does not have a GIC at all, but only a MPIC as the primary interrupt controller. However the Armada 38x has both a GIC and a MPIC, and indeed the parent interrupts of the MPIC towards the GIC is: interrupts = ; Yeah, perhaps I should have clarified that the XP manual was the only publicly-available one I could find, but I'm inferring from the binding and driver that the MPIC in 375/38x still behaves the same. Robin.
Re: [PATCH 13/36] dt-bindings: arm: Convert PMU binding to json-schema
On 01/11/2018 19:32, Rob Herring wrote: On Tue, Oct 9, 2018 at 6:57 AM Will Deacon wrote: Hi Rob, On Fri, Oct 05, 2018 at 11:58:25AM -0500, Rob Herring wrote: Convert ARM PMU binding to DT schema format using json-schema. Cc: Will Deacon Cc: Mark Rutland Cc: linux-arm-ker...@lists.infradead.org Cc: devicet...@vger.kernel.org Signed-off-by: Rob Herring --- Documentation/devicetree/bindings/arm/pmu.txt | 70 -- .../devicetree/bindings/arm/pmu.yaml | 96 +++ 2 files changed, 96 insertions(+), 70 deletions(-) delete mode 100644 Documentation/devicetree/bindings/arm/pmu.txt create mode 100644 Documentation/devicetree/bindings/arm/pmu.yaml [...] -- interrupts : 1 combined interrupt or 1 per core. If the interrupt is a per-cpu - interrupt (PPI) then 1 interrupt should be specified. [...] + interrupts: +oneOf: + - maxItems: 1 + - minItems: 2 +maxItems: 8 +description: 1 interrupt per core. + + interrupts-extended: +$ref: '#/properties/interrupts' This seems like a semantic different between the two representations, or am I missing something here? Specifically, both the introduction of interrupts-extended and also dropping any mention of using a single per-cpu interrupt (the single combined case is no longer support by Linux; not sure if you want to keep it in the binding). In regards to no support for the single combined interrupt, it looks like Marvell Armada SoCs at least (armada-375 is what I'm looking at) have only a single interrupt. Though the interrupt gets routed to MPIC which then has a GIC PPI. So it isn't supported or happens to work still since it is a PPI? Well, the description of the MPIC in the Armada XP functional spec says: "Interrupt sources ID0–ID28 are private events per CPU. Thus, each processor has a different set of events map interrupts ID0–ID28." Odd grammar aside, that would seem to imply that < 3> is a per-cpu interrupt itself, thus AFAICS so long as it's cascaded to a GIC PPI and not an SPI then there's no issue there. Robin.
Re: [PATCH] dma-direct: Fix return value of dma_direct_supported
On 04/10/18 00:48, Alexander Duyck wrote: It appears that in commit 9d7a224b463e ("dma-direct: always allow dma mask <= physiscal memory size") the logic of the test was changed from a "<" to a ">=" however I don't see any reason for that change. I am assuming that there was some additional change planned, specifically I suspect the logic was intended to be reversed and possibly used for a return. Since that is the case I have gone ahead and done that. Bah, seems I got hung up on the min_mask code above it and totally overlooked that the condition itself got flipped. It probably also can't help that it's an int return type, but treated as a bool by callers rather than "0 for success" as int tends to imply in isolation. Anyway, paying a bit more attention this time, I think this looks like the right fix - cheers Alex. Robin. This addresses issues I had on my system that prevented me from booting with the above mentioned commit applied on an x86_64 system w/ Intel IOMMU. Fixes: 9d7a224b463e ("dma-direct: always allow dma mask <= physiscal memory size") Signed-off-by: Alexander Duyck --- kernel/dma/direct.c |4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 5a0806b5351b..65872f6c2e93 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -301,9 +301,7 @@ int dma_direct_supported(struct device *dev, u64 mask) min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT); - if (mask >= phys_to_dma(dev, min_mask)) - return 0; - return 1; + return mask >= phys_to_dma(dev, min_mask); } int dma_direct_mapping_error(struct device *dev, dma_addr_t dma_addr) ___ iommu mailing list io...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 4/5] dma-direct: implement complete bus_dma_mask handling
On 27/09/18 17:27, Christoph Hellwig wrote: On Thu, Sep 27, 2018 at 05:14:56PM +0100, Robin Murphy wrote: This just seemed more readable to me than min_not_zero, but if others prefer min_not_zero I can switch. Nah, just checking whether there were any intentionally different assumptions compared to the couple of other places in the patch where min_not_zero() *is* used. If it's purely a style thing then no worries (personally I'd have written it yet another way anyway). I'm curious: how would you have written it? Come to think of it, I actually already have, in iommu-dma: if (dev->bus_dma_mask) mask &= dev->bus_dma_mask; but of course it's not so pretty for those cases where you don't already have the local variable ready to go. Robin.
Re: [PATCH 4/5] dma-direct: implement complete bus_dma_mask handling
On 27/09/18 16:32, Christoph Hellwig wrote: On Thu, Sep 27, 2018 at 03:58:04PM +0100, Robin Murphy wrote: } #endif /* !CONFIG_ARCH_HAS_PHYS_TO_DMA */ diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 3c404e33d946..64466b7ef67b 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -43,10 +43,11 @@ check_addr(struct device *dev, dma_addr_t dma_addr, size_t size, return false; } -if (*dev->dma_mask >= DMA_BIT_MASK(32)) { + if (*dev->dma_mask >= DMA_BIT_MASK(32) || dev->bus_dma_mask) { Hmm... say *dev->dma_mask is 31 bits and dev->bus_dma_mask is 40 bits due to a global DT property, we'll now scream where we didn't before even though the bus mask is almost certainly irrelevant - is that desirable? This is just the reporting in the error case - we'll only hit this IFF dma_capable already returned false. But if you don't want a message here we can probably drop this hunk. It was only a question of consistency - whatever the reason was to avoid warning for small device masks before, it's not obvious why it wouldn't still apply in the presence of nonzero but larger bus mask. The fact that the current check is there at all implied to me that we're expecting less-capable device to hit this often and thus wanted to avoid being noisy. If that's not the case then it's fine as-is AFAIC. @@ -65,12 +66,18 @@ u64 dma_direct_get_required_mask(struct device *dev) { u64 max_dma = phys_to_dma_direct(dev, (max_pfn - 1) << PAGE_SHIFT); +if (dev->bus_dma_mask && dev->bus_dma_mask < max_dma) + max_dma = dev->bus_dma_mask; Again, I think we could just do another min_not_zero() here. A device wired to address only one single page of RAM isn't a realistic prospect (and we could just flip the -1 and the shift in the max_dma calculation if we *really* wanted to support such things). This just seemed more readable to me than min_not_zero, but if others prefer min_not_zero I can switch. Nah, just checking whether there were any intentionally different assumptions compared to the couple of other places in the patch where min_not_zero() *is* used. If it's purely a style thing then no worries (personally I'd have written it yet another way anyway). Robin.
Re: [PATCH 3/5] dma-direct: refine dma_direct_alloc zone selection
On 27/09/18 16:30, Christoph Hellwig wrote: On Thu, Sep 27, 2018 at 03:30:20PM +0100, Robin Murphy wrote: +static gfp_t __dma_direct_optimal_gfp_mask(struct device *dev, u64 dma_mask, + u64 *phys_mask) +{ + if (force_dma_unencrypted()) + *phys_mask = __dma_to_phys(dev, dma_mask); + else + *phys_mask = dma_to_phys(dev, dma_mask); Maybe make phys_to_dma_direct() take u64 instead of phys_addr_t so we can reuse it here? This is a dma_to_phys and not a phys_to_dma. Ugh, clearly it's time to stop reviewing patches for today... sorry :( Robin.
Re: [PATCH 2/5] dma-direct: add an explicit dma_direct_get_required_mask
On 27/09/18 16:28, Christoph Hellwig wrote: On Thu, Sep 27, 2018 at 03:12:25PM +0100, Robin Murphy wrote: +u64 dma_direct_get_required_mask(struct device *dev) +{ + u64 max_dma = phys_to_dma_direct(dev, (max_pfn - 1) << PAGE_SHIFT); + + return (1ULL << (fls64(max_dma) - 1)) * 2 - 1; I think that may as well just use __fls64() - it seems reasonable to assume max_dma > 0. Otherwise, Is there any good reason to micro-optimize given that this isn't a fast path? Not at all, I wasn't even thinking in terms of optimisation other than in terms of number of source characters and levels of parentheses. But more importantly I was also being a big idiot because no matter how much I have the fls()/__fls() thing in mind, __fls64() doesn't actually exist. Nitpick rescinded! Robin.
Re: [PATCH 5/5] dma-direct: always allow dma mask <= physiscal memory size
[ oops, I should have looked at the replies first, now I see Ben already had the same thing to say about #3... ] On 27/09/18 14:49, Christoph Hellwig wrote: On Thu, Sep 27, 2018 at 11:50:14AM +1000, Benjamin Herrenschmidt wrote: -* to be able to satisfy them - either by not supporting more physical -* memory, or by providing a ZONE_DMA32. If neither is the case, the -* architecture needs to use an IOMMU instead of the direct mapping. -*/ - if (mask < phys_to_dma(dev, DMA_BIT_MASK(32))) + u64 min_mask; + + if (IS_ENABLED(CONFIG_ZONE_DMA)) + min_mask = DMA_BIT_MASK(ARCH_ZONE_DMA_BITS); + else + min_mask = min_t(u64, DMA_BIT_MASK(32), +(max_pfn - 1) << PAGE_SHIFT); + + if (mask >= phys_to_dma(dev, min_mask)) return 0; nitpick ... to be completely "correct", I would have written if (IS_ENABLED(CONFIG_ZONE_DMA)) min_mask = DMA_BIT_MASK(ARCH_ZONE_DMA_BITS); else min_mask = DMA_BIT_MASK(32); min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT); In "theory" it's also ok to have a mask < ZONE_DMA_BITS as long as it's big enough to fit all memory :-) Yeah, we could do that. FWIW I like it even if just for looking slightly more readable. With that fixup, Reviewed-by: Robin Murphy
Re: [PATCH 4/5] dma-direct: implement complete bus_dma_mask handling
On 20/09/18 19:52, Christoph Hellwig wrote: Instead of rejecting devices with a too small bus_dma_mask we can handle by taking the bus dma_mask into account for allocations and bounce buffering decisions. Signed-off-by: Christoph Hellwig --- include/linux/dma-direct.h | 3 ++- kernel/dma/direct.c| 21 +++-- 2 files changed, 13 insertions(+), 11 deletions(-) diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h index b79496d8c75b..fbca184ff5a0 100644 --- a/include/linux/dma-direct.h +++ b/include/linux/dma-direct.h @@ -27,7 +27,8 @@ static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size) if (!dev->dma_mask) return false; - return addr + size - 1 <= *dev->dma_mask; + return addr + size - 1 <= + min_not_zero(*dev->dma_mask, dev->bus_dma_mask); } #endif /* !CONFIG_ARCH_HAS_PHYS_TO_DMA */ diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 3c404e33d946..64466b7ef67b 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -43,10 +43,11 @@ check_addr(struct device *dev, dma_addr_t dma_addr, size_t size, return false; } - if (*dev->dma_mask >= DMA_BIT_MASK(32)) { + if (*dev->dma_mask >= DMA_BIT_MASK(32) || dev->bus_dma_mask) { Hmm... say *dev->dma_mask is 31 bits and dev->bus_dma_mask is 40 bits due to a global DT property, we'll now scream where we didn't before even though the bus mask is almost certainly irrelevant - is that desirable? dev_err(dev, - "%s: overflow %pad+%zu of device mask %llx\n", - caller, _addr, size, *dev->dma_mask); + "%s: overflow %pad+%zu of device mask %llx bus mask %llx\n", + caller, _addr, size, + *dev->dma_mask, dev->bus_dma_mask); } return false; } @@ -65,12 +66,18 @@ u64 dma_direct_get_required_mask(struct device *dev) { u64 max_dma = phys_to_dma_direct(dev, (max_pfn - 1) << PAGE_SHIFT); + if (dev->bus_dma_mask && dev->bus_dma_mask < max_dma) + max_dma = dev->bus_dma_mask; Again, I think we could just do another min_not_zero() here. A device wired to address only one single page of RAM isn't a realistic prospect (and we could just flip the -1 and the shift in the max_dma calculation if we *really* wanted to support such things). + return (1ULL << (fls64(max_dma) - 1)) * 2 - 1; } static gfp_t __dma_direct_optimal_gfp_mask(struct device *dev, u64 dma_mask, u64 *phys_mask) { + if (dev->bus_dma_mask && dev->bus_dma_mask < dma_mask) + dma_mask = dev->bus_dma_mask; + Similarly, can't we assume dma_mask to be nonzero here too? It feels like we really shouldn't have managed to get this far without one. Robin. if (force_dma_unencrypted()) *phys_mask = __dma_to_phys(dev, dma_mask); else @@ -87,7 +94,7 @@ static gfp_t __dma_direct_optimal_gfp_mask(struct device *dev, u64 dma_mask, static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size) { return phys_to_dma_direct(dev, phys) + size - 1 <= - dev->coherent_dma_mask; + min_not_zero(dev->coherent_dma_mask, dev->bus_dma_mask); } void *dma_direct_alloc_pages(struct device *dev, size_t size, @@ -291,12 +298,6 @@ int dma_direct_supported(struct device *dev, u64 mask) if (mask < phys_to_dma(dev, DMA_BIT_MASK(32))) return 0; #endif - /* -* Upstream PCI/PCIe bridges or SoC interconnects may not carry -* as many DMA address bits as the device itself supports. -*/ - if (dev->bus_dma_mask && mask > dev->bus_dma_mask) - return 0; return 1; }
Re: [PATCH 3/5] dma-direct: refine dma_direct_alloc zone selection
On 20/09/18 19:52, Christoph Hellwig wrote: We need to take the DMA offset and encryption bit into account when selecting a zone. User the opportunity to factor out the zone selection into a helper for reuse. Signed-off-by: Christoph Hellwig --- kernel/dma/direct.c | 31 +-- 1 file changed, 21 insertions(+), 10 deletions(-) diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 81b73a5bba54..3c404e33d946 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -68,6 +68,22 @@ u64 dma_direct_get_required_mask(struct device *dev) return (1ULL << (fls64(max_dma) - 1)) * 2 - 1; } +static gfp_t __dma_direct_optimal_gfp_mask(struct device *dev, u64 dma_mask, + u64 *phys_mask) +{ + if (force_dma_unencrypted()) + *phys_mask = __dma_to_phys(dev, dma_mask); + else + *phys_mask = dma_to_phys(dev, dma_mask); Maybe make phys_to_dma_direct() take u64 instead of phys_addr_t so we can reuse it here? + /* GFP_DMA32 and GFP_DMA are no ops without the corresponding zones: */ + if (*phys_mask <= DMA_BIT_MASK(ARCH_ZONE_DMA_BITS)) + return GFP_DMA; If we don't have ZONE_DMA it would in theory be a tiny bit better to fall back to GFP_DMA32 instead of returning 0 here, but I'm not sure it's worth the bother. + if (*phys_mask <= DMA_BIT_MASK(32)) + return GFP_DMA32; + return 0; +} + static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size) { return phys_to_dma_direct(dev, phys) + size - 1 <= @@ -80,17 +96,13 @@ void *dma_direct_alloc_pages(struct device *dev, size_t size, unsigned int count = PAGE_ALIGN(size) >> PAGE_SHIFT; int page_order = get_order(size); struct page *page = NULL; + u64 phys_mask; void *ret; /* we always manually zero the memory once we are done: */ gfp &= ~__GFP_ZERO; - - /* GFP_DMA32 and GFP_DMA are no ops without the corresponding zones: */ - if (dev->coherent_dma_mask <= DMA_BIT_MASK(ARCH_ZONE_DMA_BITS)) - gfp |= GFP_DMA; - if (dev->coherent_dma_mask <= DMA_BIT_MASK(32) && !(gfp & GFP_DMA)) - gfp |= GFP_DMA32; - + gfp |= __dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask, + _mask); again: /* CMA can be used only in the context which permits sleeping */ if (gfpflags_allow_blocking(gfp)) { @@ -109,15 +121,14 @@ void *dma_direct_alloc_pages(struct device *dev, size_t size, page = NULL; if (IS_ENABLED(CONFIG_ZONE_DMA32) && - dev->coherent_dma_mask < DMA_BIT_MASK(64) && + phys_mask < DMA_BIT_MASK(64) && !(gfp & (GFP_DMA32 | GFP_DMA))) { gfp |= GFP_DMA32; goto again; Ah right, in that situation we should probably end up here anyway, so that's good enough - definitely not worth any more #ifdeffery above. Nits aside, Reviewed-by: Robin Murphy } if (IS_ENABLED(CONFIG_ZONE_DMA) && - dev->coherent_dma_mask < DMA_BIT_MASK(32) && - !(gfp & GFP_DMA)) { + phys_mask < DMA_BIT_MASK(32) && !(gfp & GFP_DMA)) { gfp = (gfp & ~GFP_DMA32) | GFP_DMA; goto again; }
Re: [PATCH 2/5] dma-direct: add an explicit dma_direct_get_required_mask
On 20/09/18 19:52, Christoph Hellwig wrote: This is somewhat modelled after the powerpc version, and differs from the legacy fallback in use fls64 instead of pointlessly splitting up the address into low and high dwords and in that it takes (__)phys_to_dma into account. Signed-off-by: Christoph Hellwig --- include/linux/dma-direct.h | 1 + kernel/dma/direct.c| 21 ++--- 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h index 86a59ba5a7f3..b79496d8c75b 100644 --- a/include/linux/dma-direct.h +++ b/include/linux/dma-direct.h @@ -55,6 +55,7 @@ static inline void dma_mark_clean(void *addr, size_t size) } #endif /* CONFIG_ARCH_HAS_DMA_MARK_CLEAN */ +u64 dma_direct_get_required_mask(struct device *dev); void *dma_direct_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs); void dma_direct_free(struct device *dev, size_t size, void *cpu_addr, diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index c954f0a6dc62..81b73a5bba54 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -53,11 +53,25 @@ check_addr(struct device *dev, dma_addr_t dma_addr, size_t size, return true; } +static inline dma_addr_t phys_to_dma_direct(struct device *dev, + phys_addr_t phys) +{ + if (force_dma_unencrypted()) + return __phys_to_dma(dev, phys); + return phys_to_dma(dev, phys); +} + +u64 dma_direct_get_required_mask(struct device *dev) +{ + u64 max_dma = phys_to_dma_direct(dev, (max_pfn - 1) << PAGE_SHIFT); + + return (1ULL << (fls64(max_dma) - 1)) * 2 - 1; I think that may as well just use __fls64() - it seems reasonable to assume max_dma > 0. Otherwise, Reviewed-by: Robin Murphy +} + static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size) { - dma_addr_t addr = force_dma_unencrypted() ? - __phys_to_dma(dev, phys) : phys_to_dma(dev, phys); - return addr + size - 1 <= dev->coherent_dma_mask; + return phys_to_dma_direct(dev, phys) + size - 1 <= + dev->coherent_dma_mask; } void *dma_direct_alloc_pages(struct device *dev, size_t size, @@ -296,6 +310,7 @@ const struct dma_map_ops dma_direct_ops = { .unmap_page = dma_direct_unmap_page, .unmap_sg = dma_direct_unmap_sg, #endif + .get_required_mask = dma_direct_get_required_mask, .dma_supported = dma_direct_supported, .mapping_error = dma_direct_mapping_error, .cache_sync = arch_dma_cache_sync,
Re: [PATCH 4/4] dma-mapping: clear dev->dma_ops in arch_teardown_dma_ops
On 25/09/18 21:16, Christoph Hellwig wrote: Looking at the code I think this commit is simply broken for architectures not using arch_setup_dma_ops, but instead setting up the dma ops through arch specific magic. I'll revert the patch. Ugh, sorry about missing that too. Ack to a revert - thinking about those PPC symptoms, it might actually be that other architectures could also get into the same pickle just by unbinding and rebinding a driver (e.g. switching to VFIO then back again). Robin.