Re: [RFC PATCH v3 2/6] swiotlb: Add restricted DMA pool
On 1/7/21 1:09 PM, Florian Fainelli wrote: On 1/7/21 9:57 AM, Konrad Rzeszutek Wilk wrote: On Fri, Jan 08, 2021 at 01:39:18AM +0800, Claire Chang wrote: Hi Greg and Konrad, This change is intended to be non-arch specific. Any arch that lacks DMA access control and has devices not behind an IOMMU can make use of it. Could you share why you think this should be arch specific? The idea behind non-arch specific code is it to be generic. The devicetree is specific to PowerPC, Sparc, and ARM, and not to x86 - hence it should be in arch specific code. In premise the same code could be used with an ACPI enabled system with an appropriate service to identify the restricted DMA regions and unlock them. More than 1 architecture requiring this function (ARM and ARM64 are the two I can think of needing this immediately) sort of calls for making the code architecture agnostic since past 2, you need something that scales. There is already code today under kernel/dma/contiguous.c that is only activated on a CONFIG_OF=y && CONFIG_OF_RESERVED_MEM=y system, this is no different. Just a note for history/archives that this approach would not be appropriate on general purpose Arm systems, such as SystemReady-ES edge/non-server platforms seeking to run general purpose distros. I want to have that in the record before someone at Arm (or NVidia, or a bunch of others that come to mind who have memory firewalls) gets an idea. If you're working at an Arm vendor and come looking at this later thinking "wow, what a great idea!", please fix your hardware to have a real IOMMU/SMMU and real PCIe. You'll be pointed at this reply. Jon. -- Computer Architect ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 00/13] Add support for eXclusive Page Frame Ownership
Khalid, Thanks for these patches. We will test them on x86 and investigate the Arm pieces highlighted. Jon. -- Computer Architect > On Apr 4, 2019, at 00:37, Khalid Aziz wrote: > > This is another update to the work Juerg, Tycho and Julian have > done on XPFO. After the last round of updates, we were seeing very > significant performance penalties when stale TLB entries were > flushed actively after an XPFO TLB update. Benchmark for measuring > performance is kernel build using parallel make. To get full > protection from ret2dir attackes, we must flush stale TLB entries. > Performance penalty from flushing stale TLB entries goes up as the > number of cores goes up. On a desktop class machine with only 4 > cores, enabling TLB flush for stale entries causes system time for > "make -j4" to go up by a factor of 2.61x but on a larger machine > with 96 cores, system time with "make -j60" goes up by a factor of > 26.37x! I have been working on reducing this performance penalty. > > I implemented two solutions to reduce performance penalty and that > has had large impact. XPFO code flushes TLB every time a page is > allocated to userspace. It does so by sending IPIs to all processors > to flush TLB. Back to back allocations of pages to userspace on > multiple processors results in a storm of IPIs. Each one of these > incoming IPIs is handled by a processor by flushing its TLB. To > reduce this IPI storm, I have added a per CPU flag that can be set > to tell a processor to flush its TLB. A processor checks this flag > on every context switch. If the flag is set, it flushes its TLB and > clears the flag. This allows for multiple TLB flush requests to a > single CPU to be combined into a single request. A kernel TLB entry > for a page that has been allocated to userspace is flushed on all > processors unlike the previous version of this patch. A processor > could hold a stale kernel TLB entry that was removed on another > processor until the next context switch. A local userspace page > allocation by the currently running process could force the TLB > flush earlier for such entries. > > The other solution reduces the number of TLB flushes required, by > performing TLB flush for multiple pages at one time when pages are > refilled on the per-cpu freelist. If the pages being addedd to > per-cpu freelist are marked for userspace allocation, TLB entries > for these pages can be flushed upfront and pages tagged as currently > unmapped. When any such page is allocated to userspace, there is no > need to performa a TLB flush at that time any more. This batching of > TLB flushes reduces performance imapct further. Similarly when > these user pages are freed by userspace and added back to per-cpu > free list, they are left unmapped and tagged so. This further > optimization reduced performance impact from 1.32x to 1.28x for > 96-core server and from 1.31x to 1.27x for a 4-core desktop. > > I measured system time for parallel make with unmodified 4.20 > kernel, 4.20 with XPFO patches before these patches and then again > after applying each of these patches. Here are the results: > > Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM > make -j60 all > > 5.0913.862s > 5.0+this patch series1165.259ss1.28x > > > Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM > make -j4 all > > 5.0610.642s > 5.0+this patch series773.075s1.27x > > Performance with this patch set is good enough to use these as > starting point for further refinement before we merge it into main > kernel, hence RFC. > > I have restructurerd the patches in this version to separate out > architecture independent code. I folded much of the code > improvement by Julian to not use page extension into patch 3. > > What remains to be done beyond this patch series: > > 1. Performance improvements: Ideas to explore - (1) kernel mappings > private to an mm, (2) Any others?? > 2. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb" > from Juerg. I dropped it for now since swiotlb code for ARM has > changed a lot since this patch was written. I could use help > from ARM experts on this. > 3. Extend the patch "xpfo, mm: Defer TLB flushes for non-current > CPUs" to other architectures besides x86. > 4. Change kmap to not map the page back to physmap, instead map it > to a new va similar to what kmap_high does. Mapping page back > into physmap re-opens the ret2dir security for the duration of > kmap. All of the kmap_high and related code can be reused for this > but that will require restructuring that code so it can be built for > 64-bits as well. Any objections to that? > > - > > Juerg Haefliger (6): > mm: Add support for eXclusive Page Frame Ownership (XPFO) > xpfo, x86: Add support for XPFO for x86-64 > lkdtm: Add test for XPFO > arm64/mm: Add support for XPFO
Re: [PATCH 0/2] Fix incorrect warning from dma-debug
On 05/09/2016 06:00 AM, Robin Murphy wrote: > On 09/05/16 10:37, Robin Murphy wrote: >> Hi Niklas, >> >> On 08/05/16 11:59, Niklas Söderlund wrote: >>> Hi, >>> >>> While using CONFIG_DMA_API_DEBUG i came across this warning which I >>> think is a false positive. As shown dma_sync_single_for_device() are >>> called from the dma_map_single() call path. This triggers the warning >>> since the dma-debug code have not yet been made aware of the mapping. >> >> Almost right ;) The thing being mapped (the SPI device's buffer) and the >> thing being synced (the IOMMU's PTE) are entirely unrelated. Due to the >> current of_iommu_init() setup, the IOMMU is probed long before >> dma_debug_init() gets called, therefore DMA debug is missing entries for >> some of the initial page table mappings and gets confused when we update >> them later. What was the eventual followup/fix for this? We're seeing similar traces to this during internal testing of 4.11 kernels. Jon. -- Computer Architect ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 3/7] ACPICA: IORT: Add Cavium ThunderX2 SMMUv3 model definition.
On 05/05/2017 10:58 AM, Will Deacon wrote: > On Fri, May 05, 2017 at 07:56:17AM -0700, David Daney wrote: >> On 05/05/2017 06:53 AM, Hanjun Guo wrote: >>> On 2017/5/5 20:08, Geetha sowjanya wrote: From: Linu Cherian +#define ACPI_IORT_SMMU_V3_CAVIUM_CN99XX 0x0002 /* Cavium ThunderX2 SMMUv3 */ >>> >>> There are some other model numbers in the unreleased spec, >>> I think we need to wait for the updated IORT spec to >>> be released. Indeed. I've synced with the author on this and he's got it in hand. >> ... or if we are fairly confident that the identifier will not need to >> change, we can merge this as is and establish a de facto specification that >> the Real IORT specification will then be forced to follow. Can't do that - this always causes trouble ;) But if there's any delay I'll ask that the IDs at least be listed somewhere public or something. >> Is there anything other than bureaucratic inertia holding up the real >> specification? > > My understanding is that IORT is going to be published imminently (i.e. > before the next kernel release), so it makes sense to wait rather than fork > the spec. Let's track this and get the updated patches posted next week once the new ID drops. Meanwhile, I suggest reviewing them as-is for other issues. I'm tracking this for internal purposes and require this to be upstream asap so I'll be sitting on this thread for updates ;) Jon. -- Computer Architect | Sent from my Fedora powered laptop ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/3] iommu/arm-smmu-v3: Add workaround for Cavium ThunderX2 erratum #74
On 05/03/2017 05:47 AM, Will Deacon wrote: > Hi Geetha, > > On Tue, May 02, 2017 at 12:01:15PM +0530, Geetha Akula wrote: >> SMMU_IIDR register is broken on T99, that the reason we are using MIDR. > > Urgh, that's unfortunate. In what way is it broken? > >> If using MIDR is not accepted, can we enable errata based on SMMU resource >> size? >> some thing like below. > > No, you need to get your model number added to IORT after all if the IIDR > can't uniqely identify the part. > > Sorry [I've pinged the IORT author directly with a copy of the above message] Can folks please take action urgently if the IORT spec needs updating to accommodate additional vendor IDs. Jon. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 2/2] PCI: quirks: Fix ThunderX2 dma alias handling
One additional footnote. I spent a bunch of time recently on my personal Xeon systems walking through the PCIe topology and aligning on how to advise the ARM server community proceed going forward. If you look at how Intel vs AMD handle their host bridges for example, you'll see two very different approaches to the case of cross-socket PCIe. But my operating assumption is that anything longer term which looks boring and x86 enough is probably fine from an ARM server point of view. On 04/19/2017 07:38 PM, Jon Masters wrote: > Hi Bjorn, JC, > > On 04/13/2017 08:19 PM, Bjorn Helgaas wrote: > >> I tentatively applied both patches to pci/host-thunder for v4.12. > > Thanks for that :) > >> However, I am concerned about the topology here: > > Various feedback has been provided on this one over the past $time. In > addition, I have requested that this serve as an example of why we need > a more complete PCIe integration guide for ARMv8. It's on the list of > things for my intended opus magnum on the topic ;) > >> On Thu, Apr 13, 2017 at 08:30:45PM +, Jayachandran C wrote: >>> On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the >>> PCI topology is slightly unusual. For a multi-node system, it looks >>> like: >>> >>> 00:00.0 [PCI] bridge to [bus 01-1e] >>> 01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04] >>> 02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT) >>> 03:00.0 PCIe Endpoint >> >> A root port normally has a single PCIe link leading downstream. > > In integration terms, there's always a bus on the other side of the RC. > It's just usually a processor local bus of some kind on a proprietary > interconnect, but there's always something there. The issue is when you > can see it as PCIe and it's not through a transparent glue bridge. > > I had originally hoped that the ECAM could be hacked up so that we would > first walk the topology at the 02:00.0 as a root and not see what's > above it BUT there are other device attachments up there for the on-SoC > devices and I think we really intend to use those. > > Bottom line in my opinion is document this case, use it as a learning > example, and move forward. This has been useful in justifying why we > need better integration documentation from the server community. And in > particular from the OS vendors, of which I guess we can allude to their > being more than Linux interested in ARM server these days ;) > > Jon. > -- Computer Architect | Sent from my Fedora powered Ryzen ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 2/2] PCI: quirks: Fix ThunderX2 dma alias handling
Hi Bjorn, JC, On 04/13/2017 08:19 PM, Bjorn Helgaas wrote: > I tentatively applied both patches to pci/host-thunder for v4.12. Thanks for that :) > However, I am concerned about the topology here: Various feedback has been provided on this one over the past $time. In addition, I have requested that this serve as an example of why we need a more complete PCIe integration guide for ARMv8. It's on the list of things for my intended opus magnum on the topic ;) > On Thu, Apr 13, 2017 at 08:30:45PM +, Jayachandran C wrote: >> On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the >> PCI topology is slightly unusual. For a multi-node system, it looks >> like: >> >> 00:00.0 [PCI] bridge to [bus 01-1e] >> 01:0a.0 [PCI-PCIe bridge, type 8] bridge to [bus 02-04] >> 02:00.0 [PCIe root port, type 4] bridge to [bus 03-04] (XLATE_ROOT) >> 03:00.0 PCIe Endpoint > > A root port normally has a single PCIe link leading downstream. In integration terms, there's always a bus on the other side of the RC. It's just usually a processor local bus of some kind on a proprietary interconnect, but there's always something there. The issue is when you can see it as PCIe and it's not through a transparent glue bridge. I had originally hoped that the ECAM could be hacked up so that we would first walk the topology at the 02:00.0 as a root and not see what's above it BUT there are other device attachments up there for the on-SoC devices and I think we really intend to use those. Bottom line in my opinion is document this case, use it as a learning example, and move forward. This has been useful in justifying why we need better integration documentation from the server community. And in particular from the OS vendors, of which I guess we can allude to their being more than Linux interested in ARM server these days ;) Jon. -- Computer Architect | Sent from my Fedora powered Ryzen ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 2/2] PCI: quirks: Fix ThunderX2 dma alias handling
On 04/04/2017 10:28 AM, Robin Murphy wrote: > So (at the risk of Jon mooing at me) m -- Computer Architect | Sent from my Fedora powered laptop ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 2/2] PCI: quirks: Fix ThunderX2 dma alias handling
On 04/11/2017 11:27 AM, Jayachandran C wrote: > On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote: >> I suspect the reason this patch makes a difference is because the >> current pci_for_each_dma_alias() believes one of those top-level >> bridges is an alias, and the iterator produces it last, so that's the >> one you map. The IOMMU is attached lower down, so that top-level >> bridge is not in fact an alias, but since you only look at the *last* >> one, you don't map the correct aliases from lower down in the tree. > > Exactly. The IORT spec allows a range of RIDs to map to an SMMU, which > means that a PCI RC can multiple SMMUs, each handling a subset of RIDs. > > In the case of Cavium ThunderX2, the RID which we should see on the RC > - if we follow the standard and factor in the aliasing introduced by the > PCI bridge and the PCI/PCIe bridge - is not the RID seen by the SMMU (or > ITS). > > But, if we stop the traversal at the point where SMMU (or ITS) is > attached, we will get the correct RID as seen by these. Side note that I am trying to get various specifications clarified to promote more of a familiar alternative architecture (x86) approach in the future in which these aren't at different levels in the topology. But to do that requires integrated Root Complex IP with bells/whistles. Jon. -- Computer Architect | Sent from my Fedora powered laptop ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 0/2] Handle Cavium ThunderX2 PCI topology quirk
On 03/22/2017 04:51 AM, Jayachandran C wrote: > Hi Bjorn, Alex, > > Here is v3 of the patchset to handle the PCIe topology quirk of > Cavium ThunderX2 (previously called Broadcom Vulcan). > > The earlier discussions on this can be seen at: > http://www.spinics.net/lists/linux-pci/msg51001.html > https://patchwork.ozlabs.org/patch/582633/ and > https://lists.linuxfoundation.org/pipermail/iommu/2016-June/017681.html > > The earlier discussion on this patchset had stalled with a suggestion > that it may be possible to fix up this quirk by handling the issue in > the function argument of pci_for_each_dma_alias(). But at that point > all the ACPI and OF code for SMMU and GIC was to merged, and we did not > have reasonable codebase to make the changes. > > For 4.11, I tried to fix it in both the SMMU and the GIC ITS code based > on this suggestion, but after going thru the effort, that does not look > like the right approach. I have the code changes at: > https://github.com/jchandra-cavm/linux/commits/rid-xlate-fixup > if anyone want to look over the code. > > The problems with that approach is: > - of the 14 uses of pci_for_each_dma_alias in the function in the kernel >tree, I have to fixup 6 callers (which is all but one ofthe callers >outside x86) > - 4 of these can be reasonably handled (please see the github repo above), >but the calls in drivers/irqchip/irq-gic-v3-its-pci-msi.c and >drivers/iommu/iommu.c cannot be reasonably fixed up. > - Even without the 2 above two changes I can get it to work for now. >But pci_for_each_dma_alias does not work as expected on this platform >and we have to be aware of that for all future uses of the function. > > For now, I have ruled out that approach, and I have rebased the earlier > patch on to 4.11-rc and submitting again for review. The changes are: > > v2>v3: > - changed device flag name from PCI_DEV_FLAGS_DMA_ALIAS_ROOT to >PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT > - updated commit message to make the quirk clearer. > > Let me know your comments and suggestions. My opinion FWIW is that the quirk you have is one of the least intrusive ways of handling this. Generally in the case of ARM servers, we have a difference vs. x86 in that the latter usually have a magic RC at the top level that everything sits beneath (and then, presumably, Intel do some magic for multi-socket to fudge things over Q/UPI so that things look nice and boring to software). On ARM, we're typically dealing with third party RC IP that's disjoint from other parts of the SoC. We're certainly in the process of bolstering the specs to set some expectations and greater guidance around topologies that we would like to avoid, so I don't see this getting out of hand. That's my $0.02. Jon. -- Computer Architect | Sent from my Fedora powered laptop ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: Summary of LPC guest MSI discussion in Santa Fe
On 11/07/2016 07:45 PM, Will Deacon wrote: > I figured this was a reasonable post to piggy-back on for the LPC minutes > relating to guest MSIs on arm64. Thanks for this Will. I'm still digging out post-LPC and SC16, but the summary was much appreciated, and I'm glad the conversation is helping. > 1. The physical memory map is not standardised (Jon pointed out that > this is something that was realised late on) Just to note, we discussed this one about 3-4 years ago. I recall making a vigorous slideshow at a committee meeting in defense of having a single memory map for ARMv8 servers and requiring everyone to follow it. I was weak. I listened to the comments that this was "unreasonable". Instead, I consider it was unreasonable of me to not get with the other OS vendors and force things to be done one way. The lack of a "map at zero" RAM location on ARMv8 has been annoying enough for 32-bit DMA only devices on 64-bit (behind an SMMU but in passthrough mode it doesn't help) and other issues beyond fixing the MSI doorbell regions. If I ever have a time machine, I tried harder. > Jon pointed out that most people are pretty conservative about hardware > choices when migrating between them -- that is, they may only migrate > between different revisions of the same SoC, or they know ahead of time > all of the memory maps they want to support and this could be communicated > by way of configuration to libvirt. I think it's certainly reasonable to assume this in an initial implementation and fix it later. Currently, we're very conservative about host CPU passthrough anyway and can't migrate from one microarch to another revision of the same microarch even. And on x86, nobody really supports e.g. Intel to AMD and back again. I've always been of the mind that we should ensure the architecture can handle this, but then cautiously approach this with a default to not doing it. > Alex asked if there was a security > issue with DMA bypassing the SMMU, but there aren't currently any systems > where that is known to happen. Such a system would surely not be safe for > passthrough. There are other potential security issues that came up but don't need to be noted here (yet). I have wanted to clarify the SBSA for a long time when it comes to how IOMMUs should be implemented. It's past time that we went back and had a few conversations about that. I've poked. > Ben mused that a way to handle conflicts dynamically might be to hotplug > on the entire host bridge in the guest, passing firmware tables describing > the new reserved regions as a property of the host bridge. Whilst this > may well solve the issue, it was largely considered future work due to > its invasive nature and dependency on firmware tables (and guest support) > that do not currently exist. Indeed. It's an elegant solution (thanks Ben) that I gather POWER already does (good for them). We've obviously got a few things to clean up after we get the basics in place. Again, I think we can consider it reasonable that the MSI doorbell regions are predetermined on system A well ahead of any potential migration (that may or may not then work) for the moment. Vendors will want to loosen this later, and they can drive the work to do that, for example by hotplugging a host bridge. Jon. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 1/2] PCI: Add PCI device flag PCI_DEV_FLAGS_DMA_ALIAS_ROOT
Quick reply (sorry about top post): I am just getting online so I might discover that this callback is shared by both - in which case, cool - but if not, we need an interim ACPI solution. Aside: Current RHEL Server for ARM (RHELSA) explicitly disables SMMUv3 support because (I warned the folks involved more than a year ago), we will only support ACPI/IORT, and as you can imagine, many people would like us to enable support for device passthrough and robustness, not to mention 32-bit devices (32-bit DMA mask). Nonetheless, that will only happen when the upstream kernel has IORT and quirks in place. -- Computer Architect | Sent from my 64-bit #ARM Powered phone > On Jun 23, 2016, at 08:04, Robin Murphy wrote: > >> On 23/06/16 06:01, Jon Masters wrote: >>> On 05/11/2016 10:26 AM, Robin Murphy wrote: >>> (I have no actual objection to this patch, though, and at this point >>> I'm just chucking ideas about). >> >> Can I ask what the next steps are here? We're looking for upstream >> direction to guide some internal activities and could really do with >> understanding how you'd like to solve this one longer term as well as >> what interim solution could be acceptable until we get there. > > Well, for now I'm planning to leave the explicit "terminate the alias walk > from the callback function" behaviour in the DT-parsing code[1], since there > doesn't seem any good reason not to. As Bjorn says, though, it probably is > generally useful for the PCI code to have its own knowledge of exactly where > DMA can escape the PCI hierarchy - I now wonder if we could actually just do > that from the DT/IORT code; if firmware says a particular bridge/etc. has a > relationship with an ITS or SMMU, then presumably it's reasonable to infer > that DMA can come out of it, thus we could inform the PCI code there and then > without it having to quirk things on its own? > > Robin. > > [1]:http://article.gmane.org/gmane.linux.kernel.iommu/13932 > >> >> Jon. > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 1/2] PCI: Add PCI device flag PCI_DEV_FLAGS_DMA_ALIAS_ROOT
On 05/11/2016 10:26 AM, Robin Murphy wrote: > (I have no actual objection to this patch, though, and at this point > I'm just chucking ideas about). Can I ask what the next steps are here? We're looking for upstream direction to guide some internal activities and could really do with understanding how you'd like to solve this one longer term as well as what interim solution could be acceptable until we get there. Jon. -- Computer Architect | Sent from my Fedora powered laptop ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu