Re: [PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA
On Mon, Mar 27, 2023 at 2:16 PM Arnd Bergmann wrote: > From: Arnd Bergmann > > Most ARM CPUs can have write-back caches and that require > cache management to be done in the dma_sync_*_for_device() > operation. This is typically done in both writeback and > writethrough mode. > > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S > (arm920t, arm940t) implementations are the exception here, > and only do the cache management after the DMA is complete, > in the dma_sync_*_for_cpu() operation. > > Change this for consistency with the other platforms. This > should have no user visible effect. > > Signed-off-by: Arnd Bergmann Looks good to me. Reviewed-by: Linus Walleij Yours, Linus Walleij ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA
On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote: > From: Arnd Bergmann > > Most ARM CPUs can have write-back caches and that require > cache management to be done in the dma_sync_*_for_device() > operation. This is typically done in both writeback and > writethrough mode. > > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S > (arm920t, arm940t) implementations are the exception here, > and only do the cache management after the DMA is complete, > in the dma_sync_*_for_cpu() operation. > > Change this for consistency with the other platforms. This > should have no user visible effect. NAK... The reason we do cache management _after_ is to ensure that there is no stale data. The kernel _has_ (at the very least in the past) performed DMA to data structures that are embedded within other data structures, resulting in cache lines being shared. If one of those cache lines is touched while DMA is progressing, then we must to cache management _after_ the DMA operation has completed. Doing it before is no good. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last! ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH 17/21] ARM: dma-mapping: use arch_sync_dma_for_{device,cpu}() internally
On Mon, Mar 27, 2023 at 2:16 PM Arnd Bergmann wrote: > From: Arnd Bergmann > > The arm specific iommu code in dma-mapping.c uses the page+offset based > __dma_page_cpu_to_dev()/__dma_page_dev_to_cpu() helpers in place of the > phys_addr_t based arch_sync_dma_for_device()/arch_sync_dma_for_cpu() > wrappers around the. Broken sentence? > In order to be able to move the latter part set of functions into > common code, change the iommu implementation to use them directly > and remove the internal ones as a separate interface. > > As page+offset and phys_address are equivalent, but are used in > different parts of the code here, this allows removing some of > the conversion but adds them elsewhere. > > Signed-off-by: Arnd Bergmann Looks good to me, took me some time to verify and understand the open-coded version of PFN_UP() and this refactoring alone makes the patch highly valuable. Reviewed-by: Linus Walleij Yours, Linus Walleij ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA
On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote: > On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote: > > From: Arnd Bergmann > > > > Most ARM CPUs can have write-back caches and that require > > cache management to be done in the dma_sync_*_for_device() > > operation. This is typically done in both writeback and > > writethrough mode. > > > > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S > > (arm920t, arm940t) implementations are the exception here, > > and only do the cache management after the DMA is complete, > > in the dma_sync_*_for_cpu() operation. > > > > Change this for consistency with the other platforms. This > > should have no user visible effect. > > NAK... > > The reason we do cache management _after_ is to ensure that there > is no stale data. The kernel _has_ (at the very least in the past) > performed DMA to data structures that are embedded within other > data structures, resulting in cache lines being shared. If one of > those cache lines is touched while DMA is progressing, then we > must to cache management _after_ the DMA operation has completed. > Doing it before is no good. It looks like the main offender of "touching cache lines shared with DMA" has now been resolved - that was the SCSI sense buffer, and was fixed some time ago: commit de25deb18016f66dcdede165d07654559bb332bc Author: FUJITA Tomonori Date: Wed Jan 16 13:32:17 2008 +0900 /if/ that is the one and only case, then we're probably fine, but having been through an era where this kind of thing was the norm and requests to fix it did not get great responses from subsystem maintainers, I just don't trust the kernel not to want to DMA to overlapping cache lines. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last! ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA
On Fri, Mar 31, 2023, at 11:35, Russell King (Oracle) wrote: > On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote: >> On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote: >> > From: Arnd Bergmann >> > >> > Most ARM CPUs can have write-back caches and that require >> > cache management to be done in the dma_sync_*_for_device() >> > operation. This is typically done in both writeback and >> > writethrough mode. >> > >> > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S >> > (arm920t, arm940t) implementations are the exception here, >> > and only do the cache management after the DMA is complete, >> > in the dma_sync_*_for_cpu() operation. >> > >> > Change this for consistency with the other platforms. This >> > should have no user visible effect. >> >> NAK... >> >> The reason we do cache management _after_ is to ensure that there >> is no stale data. The kernel _has_ (at the very least in the past) >> performed DMA to data structures that are embedded within other >> data structures, resulting in cache lines being shared. If one of >> those cache lines is touched while DMA is progressing, then we >> must to cache management _after_ the DMA operation has completed. >> Doing it before is no good. What I'm trying to address here is the inconsistency between implementations. If we decide that we always want to invalidate after FROM_DEVICE, I can do that as part of the series, but then I have to change most of the other arm implementations. Right now, the only WT cache implementations that do the the invalidation after the DMA are cache-v4.S (arm720 integrator and clps711x), cache-v4wt.S (arm920/arm922 at91rm9200, clps711x, ep93xx, omap15xx, imx1 and integrator), some sparc32 leon3 and early xtensa. Most architectures that have write-through caches (m68k, microblaze) or write-back caches but no speculation (all other armv4/armv5, hexagon, openrisc, sh, most mips, later xtensa) only invalidate before DMA but not after. OTOH, most machines that are actually in use today (armv6+, powerpc, later mips, microblaze, riscv, nios2) also have to deal with speculative accesses, so they end up having to invalidate or flush both before and after a DMA_FROM_DEVICE and DMA_BIDIRECTIONAL. > It looks like the main offender of "touching cache lines shared > with DMA" has now been resolved - that was the SCSI sense buffer, > and was fixed some time ago: > > commit de25deb18016f66dcdede165d07654559bb332bc > Author: FUJITA Tomonori > Date: Wed Jan 16 13:32:17 2008 +0900 > > /if/ that is the one and only case, then we're probably fine, but > having been through an era where this kind of thing was the norm > and requests to fix it did not get great responses from subsystem > maintainers, I just don't trust the kernel not to want to DMA to > overlapping cache lines. Thanks for digging that out, that is very useful. It looks like this was around the same time as 03d70617b8a7 ("powerpc: Prevent memory corruption due to cache invalidation of unaligned DMA buffer"), so it may well have been related. I know we also had more recent problems with USB drivers trying to DMA to stack, which would also cause problems on non-coherent machines, but some of these were only found after we introduced VMAP_STACK. It would be nice to use KASAN prevent reads on cache lines that have in-flight DMA. Arnd ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
RE: [PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA
From: Arnd Bergmann > Sent: 31 March 2023 11:39 ... > Most architectures that have write-through caches (m68k, > microblaze) or write-back caches but no speculation (all other > armv4/armv5, hexagon, openrisc, sh, most mips, later xtensa) > only invalidate before DMA but not after. > > OTOH, most machines that are actually in use today (armv6+, > powerpc, later mips, microblaze, riscv, nios2) also have to > deal with speculative accesses, so they end up having to > invalidate or flush both before and after a DMA_FROM_DEVICE > and DMA_BIDIRECTIONAL. nios2 is a simple in-order cpu with a short pipeline (it is a soft-cpu made from normal fpga logic elements). Definitely doesn't do speculative accesses. OTOH any one trying to run Linux on it needs their head examined. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA
On Fri, Mar 31, 2023 at 12:38:45PM +0200, Arnd Bergmann wrote: > On Fri, Mar 31, 2023, at 11:35, Russell King (Oracle) wrote: > > On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote: > >> On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote: > >> > From: Arnd Bergmann > >> > > >> > Most ARM CPUs can have write-back caches and that require > >> > cache management to be done in the dma_sync_*_for_device() > >> > operation. This is typically done in both writeback and > >> > writethrough mode. > >> > > >> > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S > >> > (arm920t, arm940t) implementations are the exception here, > >> > and only do the cache management after the DMA is complete, > >> > in the dma_sync_*_for_cpu() operation. > >> > > >> > Change this for consistency with the other platforms. This > >> > should have no user visible effect. > >> > >> NAK... > >> > >> The reason we do cache management _after_ is to ensure that there > >> is no stale data. The kernel _has_ (at the very least in the past) > >> performed DMA to data structures that are embedded within other > >> data structures, resulting in cache lines being shared. If one of > >> those cache lines is touched while DMA is progressing, then we > >> must to cache management _after_ the DMA operation has completed. > >> Doing it before is no good. > > What I'm trying to address here is the inconsistency between > implementations. If we decide that we always want to invalidate > after FROM_DEVICE, I can do that as part of the series, but then > I have to change most of the other arm implementations. Why? First thing to say is that DMA to buffers where the cache lines are shared with data the CPU may be accessing need to be outlawed - they are a recipe for data corruption - always have been. Sadly, some folk don't see it that way because of a passed "x86 just works and we demand that all architectures behave like x86!" attitude. The SCSI sense buffer has historically been a big culpret for that. For WT, FROM_DEVICE, invalidating after DMA is the right thing to do, because we want to ensure that the DMA'd data is properly readable upon completion of the DMA. If overlapping cache lines have been touched while DMA is progressing, and we invalidate before DMA, then the cache will contain stale data that will remain in the cache after DMA has completed. Invalidating a WT cache does not destroy any data, so is safe to do. So the safest approach is to invalidate after DMA has completed in this instance. For WB, FROM_DEVICE, we have the problem of dirty cache lines which we have to get rid of. For the overlapping cache lines, we have to clean those before DMA begins to ensure that data written to the non-DMA-buffer part is preserved. All other cache lines need to be invalidated before DMA begins to ensure that writebacks do not corrupt data from the device. Hence why it's different. And hence why the ARM implementation is based around buffer ownership. And hence why they're called dma_map_area()/dma_unmap_area() rather than the cache operations themselves. This is an intentional change, one that was done when ARMv6 came along. > OTOH, most machines that are actually in use today (armv6+, > powerpc, later mips, microblaze, riscv, nios2) also have to > deal with speculative accesses, so they end up having to > invalidate or flush both before and after a DMA_FROM_DEVICE > and DMA_BIDIRECTIONAL. Again, these are implementation details of the cache, and this is precisely why having the map/unmap interface is so much better than having generic code explicitly call "clean" and "invalidate" interfaces into arch code. If we treat everything as a speculative cache, then we're doing needless extra work for those caches that aren't speculative. So, ARM would have to step through every cache line for every DMA buffer at 32-byte intervals performing cache maintenance whether the cache is speculative or not. That is expensive, and hurts performance. I put a lot of thought into this when I updated the ARM DMA implementation when we started seeing these different cache types particularly when ARMv6 came along. I really don't want that work wrecked. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last! ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA
On Fri, Mar 31, 2023, at 13:08, Russell King (Oracle) wrote: > On Fri, Mar 31, 2023 at 12:38:45PM +0200, Arnd Bergmann wrote: >> On Fri, Mar 31, 2023, at 11:35, Russell King (Oracle) wrote: >> > On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote: >> >> On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote: >> >> > From: Arnd Bergmann >> >> > >> >> > Most ARM CPUs can have write-back caches and that require >> >> > cache management to be done in the dma_sync_*_for_device() >> >> > operation. This is typically done in both writeback and >> >> > writethrough mode. >> >> > >> >> > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S >> >> > (arm920t, arm940t) implementations are the exception here, >> >> > and only do the cache management after the DMA is complete, >> >> > in the dma_sync_*_for_cpu() operation. >> >> > >> >> > Change this for consistency with the other platforms. This >> >> > should have no user visible effect. >> >> >> >> NAK...So t >> >> >> >> The reason we do cache management _after_ is to ensure that there >> >> is no stale data. The kernel _has_ (at the very least in the past) >> >> performed DMA to data structures that are embedded within other >> >> data structures, resulting in cache lines being shared. If one of >> >> those cache lines is touched while DMA is progressing, then we >> >> must to cache management _after_ the DMA operation has completed. >> >> Doing it before is no good. >> >> What I'm trying to address here is the inconsistency between >> implementations. If we decide that we always want to invalidate >> after FROM_DEVICE, I can do that as part of the series, but then >> I have to change most of the other arm implementations. > > Why? > > First thing to say is that DMA to buffers where the cache lines are > shared with data the CPU may be accessing need to be outlawed - they > are a recipe for data corruption - always have been. Sadly, some folk > don't see it that way because of a passed "x86 just works and we demand > that all architectures behave like x86!" attitude. The SCSI sense > buffer has historically been a big culpret for that. I think that part is pretty much agree by everyone, the difference between architectures is to what extend they try to work around drivers that get it wrong. > For WT, FROM_DEVICE, invalidating after DMA is the right thing to do, > because we want to ensure that the DMA'd data is properly readable upon > completion of the DMA. If overlapping cache lines haveDoes that mean you take > back you NAK on this patch tehn? been touched > while DMA is proSo tgressing, and we invalidate before DMA, then the cache > will contain stale data that will remain in the cache after DMA has > completed. Invalidating a WT cache does not destroy any data, so is > safe to do. So the safest approach is to invalidate after DMA has > completed in this instance. > For WB, FROM_DEVICE, we have the problem of dirty cache lines which > we have to get rid of. For the overlapping cache lines, we have to > clean those before DMA begins to ensure that data written to the > non-DMA-buffer part is preserved. All other cache lines need to be > invalidated before DMA begins to ensure that writebacks do not > corrupt data from the device. Hence why it's different. I don't see how WB and Wt caches being different implies that we should give extra guarantees to (broken) drivers when WT caches on other architectures. Always doing it first in the absence of prefetching avoids a special case in the generic implementation and makes the driver interface on Arm/sparc32/xtensa WT caches no different from what everything provides. The writeback before DMA_FROM_DEVICE is another issue that we have to address at some point, as there are clearly incompatible expectations here. It makes no sense that a device driver can rely on the entire to be written back on a 64-bit arm kernel but not on a 32-bit kernel. > And hence why the ARM implementation is based around buffer ownership. > And hence why they're called dma_map_area()/dma_unmap_area() rather > than the cache operations themselves. This is an intentional change, > one that was done when ARMv6 came along. The bit that has changed in the meantime though is that the buffer ownership interfaces has moved up in the stack and is now handled mostly in the common kernel/dma/*.c that multiplexes between the direct/iommu/swiotlb dma_map_ops, except for the bit about noncoherent devices. Right now, we have 37 implementations that are mostly identical, and all the differences are either bugs or disagreements about the API guarantees but not related to architecture specific requirements. >> OTOH, most machines that are actually in use today (armv6+, >> powerpc, later mips, microblaze, riscv, nios2) also have to >> deal with speculative accesses, so they end up having to >> invalidate or flush both before and after a DMA_FROM_DEVICE >> and DMA_BIDIRECTIONAL. > > Again, these are implementation detail
Re: [PATCH 17/21] ARM: dma-mapping: use arch_sync_dma_for_{device,cpu}() internally
On Fri, Mar 31, 2023, at 11:10, Linus Walleij wrote: > On Mon, Mar 27, 2023 at 2:16 PM Arnd Bergmann wrote: > >> From: Arnd Bergmann >> >> The arm specific iommu code in dma-mapping.c uses the page+offset based >> __dma_page_cpu_to_dev()/__dma_page_dev_to_cpu() helpers in place of the >> phys_addr_t based arch_sync_dma_for_device()/arch_sync_dma_for_cpu() >> wrappers around the. > > Broken sentence? I've changed s/the/them/ now, at least I think that's what I meant to write in the first place. >> In order to be able to move the latter part set of functions into >> common code, change the iommu implementation to use them directly >> and remove the internal ones as a separate interface. >> >> As page+offset and phys_address are equivalent, but are used in >> different parts of the code here, this allows removing some of >> the conversion but adds them elsewhere. >> >> Signed-off-by: Arnd Bergmann > > Looks good to me, took me some time to verify and understand > the open-coded version of PFN_UP() and this refactoring alone > makes the patch highly valuable. > Reviewed-by: Linus Walleij Thanks! ARnd ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH 21/21] dma-mapping: replace custom code with generic implementation
On Tue, Mar 28, 2023, at 00:25, Christoph Hellwig wrote: >> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size) >> { >> +dma_cache_wback(paddr, size); >> +} >> >> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size) >> +{ >> +dma_cache_inv(paddr, size); >> } > >> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size) >> { >> +dma_cache_wback_inv(paddr, size); >> +} > > There are the only calls for the three functions for each of the > involved functions. So I'd rather rename the low-level symbols > (and drop the pointless exports for two of them) rather than adding > these wrapppers. > > The same is probably true for many other architectures. Ok, done that now. >> +static inline bool arch_sync_dma_clean_before_fromdevice(void) >> +{ >> +return false; >> +} >> >> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void) >> +{ >> +return true; >> } > > Is there a way to cut down on this boilerplate code by just having > sane default, and Kconfig options to override them if they are not > runtime decisions? I've changed arch_sync_dma_clean_before_fromdevice() to a Kconfig symbol now, as this is never a runtime decision. For arch_sync_dma_cpu_needs_post_dma_flush(), I have this version now in common code, which lets mips and arm have their own logic and has the same effect elsewhere: +#ifndef arch_sync_dma_cpu_needs_post_dma_flush +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void) +{ + return IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU); +} +#endif >> +#include > > I can't really say I like the #include version here despite your > rationale in the commit log. I can probably live with it if you > think it is absolutely worth it, but I'm really not in favor of it. > >> +config ARCH_DMA_MARK_DCACHE_CLEAN >> +def_bool y > > What do we need this symbol for? Unless I'm missing something it is > always enable for arm32, and only used in arm32 code. This was left over from an earlier draft and accidentally duplicates the thing that I have in the Arm version for the existing ARCH_HAS_DMA_MARK_CLEAN. I dropped this one and the generic copy of the arch_dma_mark_dcache_clean() function now, but still need to revisit the arm version, as it sounds like it has slightly different semantics from the ia64 version. Arnd ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper
On Mon, Mar 27, 2023, at 14:48, Robin Murphy wrote: > On 2023-03-27 13:13, Arnd Bergmann wrote: >> >> [ HELP NEEDED: can anyone confirm that it is a correct assumption >>on arm that a cache-coherent device writing to a page always results >>in it being in a PG_dcache_clean state like on ia64, or can a device >>write directly into the dcache?] > > In AMBA at least, if a snooping write hits in a cache then the data is > most likely going to get routed directly into that cache. If it has > write-back write-allocate attributes it could also land in any cache > along its normal path to RAM; it wouldn't have to go all the way. > > Hence all the fun we have where treating a coherent device as > non-coherent can still be almost as broken as the other way round :) Ok, thanks for the information. I'm still not sure whether this can result in the situation where PG_dcache_clean is wrong though. Specifically, the question is whether a DMA to a coherent buffer can end up in a dirty L1 dcache of one core and require to write back the dcache before invalidating the icache for that page. On ia64, this is not the case, the optimization here is to only flush the icache after a coherent DMA into an executable user page, while Arm only does this for noncoherent DMA but not coherent DMA. >From your explanation it sounds like this might happen, even though that would mean that "coherent" DMA is slightly less coherent than it is elsewhere. To be on the safe side, I'd have to pass a flag into arch_dma_mark_clean() about coherency, to let the arm implementation still require the extra dcache flush for coherent DMA, while ia64 can ignore that flag. Arnd ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper
On Mon, Mar 27, 2023, at 17:01, Russell King (Oracle) wrote: > On Mon, Mar 27, 2023 at 02:13:16PM +0200, Arnd Bergmann wrote: >> From: Arnd Bergmann >> >> The arm version of the arch_sync_dma_for_cpu() function annotates pages as >> PG_dcache_clean after a DMA, but no other architecture does this here. > > ... because this is an arm32 specific feature. Generically, it's > PG_arch_1, which is a page flag free for architecture use. On arm32 > we decided to use this to mark whether we can skip dcache writebacks > when establishing a PTE - and thus it was decided to call it > PG_dcache_clean to reflect how arm32 decided to use that bit. > > This isn't just a DMA thing, there are other places that we update > the bit, such as flush_dcache_page() and copy_user_highpage(). > > So thinking that the arm32 PG_dcache_clean is something for DMA is > actually wrong. > > Other architectures are free to do their own other optimisations > using that bit, and their implementations may be DMA-centric. The flag is used the same way on most architectures, though some use the opposite polarity and call it PG_dcache_dirty. The only other architecture that uses it for DMA is ia64, with the difference being that this also marks the page as clean even for coherent DMA, not just when doing a flush as part of noncoherent DMA. Based on Robin's reply it sounds that this is not a valid assumption on Arm, if a coherent DMA can target a dirty dcache line without cleaning it. Arnd ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper
On 31/03/2023 3:00 pm, Arnd Bergmann wrote: On Mon, Mar 27, 2023, at 14:48, Robin Murphy wrote: On 2023-03-27 13:13, Arnd Bergmann wrote: [ HELP NEEDED: can anyone confirm that it is a correct assumption on arm that a cache-coherent device writing to a page always results in it being in a PG_dcache_clean state like on ia64, or can a device write directly into the dcache?] In AMBA at least, if a snooping write hits in a cache then the data is most likely going to get routed directly into that cache. If it has write-back write-allocate attributes it could also land in any cache along its normal path to RAM; it wouldn't have to go all the way. Hence all the fun we have where treating a coherent device as non-coherent can still be almost as broken as the other way round :) Ok, thanks for the information. I'm still not sure whether this can result in the situation where PG_dcache_clean is wrong though. Specifically, the question is whether a DMA to a coherent buffer can end up in a dirty L1 dcache of one core and require to write back the dcache before invalidating the icache for that page. On ia64, this is not the case, the optimization here is to only flush the icache after a coherent DMA into an executable user page, while Arm only does this for noncoherent DMA but not coherent DMA. From your explanation it sounds like this might happen, even though that would mean that "coherent" DMA is slightly less coherent than it is elsewhere. To be on the safe side, I'd have to pass a flag into arch_dma_mark_clean() about coherency, to let the arm implementation still require the extra dcache flush for coherent DMA, while ia64 can ignore that flag. Coherent DMA on Arm is assumed to be inner-shareable, so a coherent DMA write should be pretty much equivalent to a coherent write by another CPU (or indeed the local CPU itself) - nothing says that it *couldn't* dirty a line in a data cache above the level of unification, so in general the assumption must be that, yes, if coherent DMA is writing data intended to be executable, then it's going to want a Dcache clean to PoU and an Icache invalidate to PoU before trying to execute it. By comparison, a non-coherent DMA transfer will inherently have to invalidate the Dcache all the way to PoC in its dma_unmap, thus cannot leave dirty data above the PoU, so only the Icache maintenance is required in the executable case. (FWIW I believe the Armv8 IDC/DIC features can safely be considered irrelevant to 32-bit kernels) I don't know a great deal about IA-64, but it appears to be using its PG_arch_1 flag in a subtly different manner to Arm, namely to optimise out the *Icache* maintenance. So if anything, it seems IA-64 is the weirdo here (who'd have guessed?) where DMA manages to be *more* coherent than the CPUs themselves :) This is all now making me think we need some careful consideration of whether the benefits of consolidating code outweigh the confusion of conflating multiple different meanings of "clean" together... Thanks, Robin. ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper
On Fri, Mar 31, 2023 at 04:06:37PM +0200, Arnd Bergmann wrote: > On Mon, Mar 27, 2023, at 17:01, Russell King (Oracle) wrote: > > On Mon, Mar 27, 2023 at 02:13:16PM +0200, Arnd Bergmann wrote: > >> From: Arnd Bergmann > >> > >> The arm version of the arch_sync_dma_for_cpu() function annotates pages as > >> PG_dcache_clean after a DMA, but no other architecture does this here. > > > > ... because this is an arm32 specific feature. Generically, it's > > PG_arch_1, which is a page flag free for architecture use. On arm32 > > we decided to use this to mark whether we can skip dcache writebacks > > when establishing a PTE - and thus it was decided to call it > > PG_dcache_clean to reflect how arm32 decided to use that bit. > > > > This isn't just a DMA thing, there are other places that we update > > the bit, such as flush_dcache_page() and copy_user_highpage(). > > > > So thinking that the arm32 PG_dcache_clean is something for DMA is > > actually wrong. > > > > Other architectures are free to do their own other optimisations > > using that bit, and their implementations may be DMA-centric. > > The flag is used the same way on most architectures, though some > use the opposite polarity and call it PG_dcache_dirty. The only > other architecture that uses it for DMA is ia64, with the difference > being that this also marks the page as clean even for coherent > DMA, not just when doing a flush as part of noncoherent DMA. > > Based on Robin's reply it sounds that this is not a valid assumption > on Arm, if a coherent DMA can target a dirty dcache line without > cleaning it. The other thing to note here is that PG_dcache_clean doesn't have much meaning on modern CPUs with PIPT caches. For these, cache_is_vipt_nonaliasing() will be true, and cache_ops_need_broadcast() will be false. Firstly, if we're using coherent DMA, then PG_dcache_clean is intentionally not touched, because the data cache isn't cleaned in any way by DMA operations. flush_dcache_page() turns into a no-op apart from clearing PG_dcache_clean if it was set. __sync_icache_dcache() will do nothing for non-executable pages, but will write-back a page that isn't marked PG_dcache_clean to ensure that it is visible to the instruction stream. This is only used to ensure that a the instructions are visible to a newly established executable mapping when e.g. the page has been DMA'd in. The default state of PG_dcache_clean is zero on any new allocation, so this has the effect of causing any executable page to be flushed such that the instruction stream can see the instructions, but only for the first establishment of the mapping. That means that e.g. libc text pages don't keep getting flushed on the start of every program. update_mmu_cache() isn't compiled, so it's use of PG_dcache_clean is irrelevant. v6_copy_user_highpage_aliasing() won't be called because we're not using an aliasing cache. So, for modern ARM systems with DMA-coherent PG_dcache_clean only serves for the __sync_icache_dcache() optimisation. ARMs use of this remains valid in this circumstance. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last! ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH 00/21] dma-mapping: unify support for cache flushes
On Mon, Mar 27, 2023 at 02:12:56PM +0200, Arnd Bergmann wrote: > Another difference that I do not address here is what cache invalidation > does for partical cache lines. On arm32, arm64 and powerpc, a partial > cache line always gets written back before invalidation in order to > ensure that data before or after the buffer is not discarded. On all > other architectures, the assumption is cache lines are never shared > between DMA buffer and data that is accessed by the CPU. I don't think sharing the DMA buffer with other data is safe even with this clean+invalidate on the unaligned cache. Mapping the DMA buffer as FROM_DEVICE or BIDIRECTIONAL can cause the shared cache line to be evicted and override the device written data. This sharing only works if the CPU guarantees not to dirty the corresponding cache line. I'm fine with removing this partial cache line hack from arm64 as it's not safe anyway. We'll see if any driver stops working. If there's some benign sharing (I wouldn't trust it), the cache cleaning prior to mapping and invalidate on unmap would not lose any data. -- Catalin ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH 18/21] ARM: drop SMP support for ARM11MPCore
On Mon, Mar 27, 2023 at 02:13:14PM +0200, Arnd Bergmann wrote: > From: Arnd Bergmann > > The cache management operations for noncoherent DMA on ARMv6 work > in two different ways: > > * When CONFIG_DMA_CACHE_RWFO is set, speculative prefetches on in-flight >DMA buffers lead to data corruption when the prefetched data is written >back on top of data from the device. > > * When CONFIG_DMA_CACHE_RWFO is disabled, a cache flush on one CPU >is not seen by the other core(s), leading to inconsistent contents >accross the system. > > As a consequence, neither configuration is actually safe to use in a > general-purpose kernel that is used on both MPCore systems and ARM1176 > with prefetching enabled. As the author of this terrible hack (created under duress ;)) Acked-by: Catalin Marinas IIRC, RWFO is working in combination with the cache operations. Because the cache maintenance broadcast did not happen, we forced the cache lines to migrate to a CPU via a write (for ownership) and doing the cache maintenance on that CPU (that was the FROM_DEVICE case). For the TO_DEVICE case, reading on a CPU would cause dirty lines on another CPU to be evicted (or migrated as dirty to the current CPU IIRC) then the cache maintenance to clean them to PoC on the local CPU. But there's always a small window between read/write for ownership and the actual cache maintenance which can cause a cache line to migrate to other CPUs if they do speculative prefetches. At the time ARM11MPCore was deemed safe-ish but I haven't followed what later implementations actually did (luckily we fixed the architecture in ARMv7). -- Catalin ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper
On Fri, Mar 31, 2023, at 17:12, Robin Murphy wrote: > On 31/03/2023 3:00 pm, Arnd Bergmann wrote: >> On Mon, Mar 27, 2023, at 14:48, Robin Murphy wrote: >> >> To be on the safe side, I'd have to pass a flag into >> arch_dma_mark_clean() about coherency, to let the arm >> implementation still require the extra dcache flush >> for coherent DMA, while ia64 can ignore that flag. > > Coherent DMA on Arm is assumed to be inner-shareable, so a coherent DMA > write should be pretty much equivalent to a coherent write by another > CPU (or indeed the local CPU itself) - nothing says that it *couldn't* > dirty a line in a data cache above the level of unification, so in > general the assumption must be that, yes, if coherent DMA is writing > data intended to be executable, then it's going to want a Dcache clean > to PoU and an Icache invalidate to PoU before trying to execute it. By > comparison, a non-coherent DMA transfer will inherently have to > invalidate the Dcache all the way to PoC in its dma_unmap, thus cannot > leave dirty data above the PoU, so only the Icache maintenance is > required in the executable case. Ok, makes sense. I've already started reworking my patch for it. > (FWIW I believe the Armv8 IDC/DIC features can safely be considered > irrelevant to 32-bit kernels) > > I don't know a great deal about IA-64, but it appears to be using its > PG_arch_1 flag in a subtly different manner to Arm, namely to optimise > out the *Icache* maintenance. So if anything, it seems IA-64 is the > weirdo here (who'd have guessed?) where DMA manages to be *more* > coherent than the CPUs themselves :) I checked this in the ia64 manual, and as far as I can tell, it originally only had one cacheflush instruction that flushes the dcache and invalidates the icache at the same time. So flush_icache_range() actually does both and flush_dcache_page() instead just marks the page as dirty to ensure flush_icache_range() does not get skipped after a writing a page from the kernel. On later Itaniums, there is apparently a separate icache flush instruction that gets used in flush_icache_range(), but that still works for the DMA case that is allowed to skip the flush. > This is all now making me think we need some careful consideration of > whether the benefits of consolidating code outweigh the confusion of > conflating multiple different meanings of "clean" together... The difference in usage of PG_dcache_clean/PG_dcache_dirty/PG_arch_1 across architectures is certainly big enough that we can't just define a a common arch_dma_mark_clean() across architectures, but I think the idea of having a common entry point for arch_dma_mark_clean() to be called from the dma-mapping code to do something architecture specific after a DMA is clean still makes sense, Arnd ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: [PATCH 00/21] dma-mapping: unify support for cache flushes
On Fri, Mar 31, 2023, at 18:53, Catalin Marinas wrote: > On Mon, Mar 27, 2023 at 02:12:56PM +0200, Arnd Bergmann wrote: >> Another difference that I do not address here is what cache invalidation >> does for partical cache lines. On arm32, arm64 and powerpc, a partial >> cache line always gets written back before invalidation in order to >> ensure that data before or after the buffer is not discarded. On all >> other architectures, the assumption is cache lines are never shared >> between DMA buffer and data that is accessed by the CPU. > > I don't think sharing the DMA buffer with other data is safe even with > this clean+invalidate on the unaligned cache. Mapping the DMA buffer as > FROM_DEVICE or BIDIRECTIONAL can cause the shared cache line to be > evicted and override the device written data. This sharing only works if > the CPU guarantees not to dirty the corresponding cache line. > > I'm fine with removing this partial cache line hack from arm64 as it's > not safe anyway. We'll see if any driver stops working. If there's some > benign sharing (I wouldn't trust it), the cache cleaning prior to > mapping and invalidate on unmap would not lose any data. Ok, I'll add a patch to remove that bit from dcache_inval_poc then. Do you know if any of the the other callers of this function rely on on the writeback behavior, or is it safe to remove it for all of them? Note that before c50f11c6196f ("arm64: mm: Don't invalidate FROM_DEVICE buffers at start of DMA transfer"), it made some sense to write back partial cache lines before a DMA_FROM_DEVICE, in order to allow sharing read-only data in them the same way as on arm32 and powerpc. Doing the writeback in the sync_for_cpu bit is of course always pointless. Arnd ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc