Re: [PATCH 09/27] mm: generalize the pgmap based page_free infrastructure
On 2022-02-10 12:28 a.m., Christoph Hellwig wrote: > Key off on the existence of ->page_free to prepare for adding support for > more pgmap types that are device managed and thus need the free callback. > > Signed-off-by: Christoph Hellwig Great! This makes my patch simpler. Reviewed-by: Logan Gunthorpe > --- > mm/memremap.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/mm/memremap.c b/mm/memremap.c > index fef5734d5e4933..e00ffcdba7b632 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -452,7 +452,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap); > > void free_zone_device_page(struct page *page) > { > - if (WARN_ON_ONCE(!is_device_private_page(page))) > + if (WARN_ON_ONCE(!page->pgmap->ops || !page->pgmap->ops->page_free)) > return; > > __ClearPageWaiters(page); > @@ -460,7 +460,7 @@ void free_zone_device_page(struct page *page) > mem_cgroup_uncharge(page_folio(page)); > > /* > - * When a device_private page is freed, the page->mapping field > + * When a device managed page is freed, the page->mapping field >* may still contain a (stale) mapping value. For example, the >* lower bits of page->mapping may still identify the page as an >* anonymous page. Ultimately, this entire field is just stale >
Re: start sorting out the ZONE_DEVICE refcount mess
On 2022-02-06 11:32 p.m., Christoph Hellwig wrote: > Hi all, > > this series removes the offset by one refcount for ZONE_DEVICE pages > that are freed back to the driver owning them, which is just device > private ones for now, but also the planned device coherent pages > and the ehanced p2p ones pending. > > It does not address the fsdax pages yet, which will be attacked in a > follow on series. > > Diffstat: > arch/arm64/mm/mmu.c |1 > arch/powerpc/kvm/book3s_hv_uvmem.c |1 > drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |2 > drivers/gpu/drm/amd/amdkfd/kfd_priv.h|1 > drivers/gpu/drm/drm_cache.c |2 > drivers/gpu/drm/nouveau/nouveau_dmem.c |3 - > drivers/gpu/drm/nouveau/nouveau_svm.c|1 > drivers/infiniband/core/rw.c |1 > drivers/nvdimm/pmem.h|1 > drivers/nvme/host/pci.c |1 > drivers/nvme/target/io-cmd-bdev.c|1 > fs/Kconfig |2 > fs/fuse/virtio_fs.c |1 > include/linux/hmm.h |9 > include/linux/memremap.h | 22 +- > include/linux/mm.h | 59 - > lib/test_hmm.c |4 + > mm/Kconfig |4 - > mm/internal.h|2 > mm/memcontrol.c | 11 + > mm/memremap.c| 63 > --- > mm/migrate.c |6 -- > mm/swap.c| 49 ++-- > 23 files changed, 90 insertions(+), 157 deletions(-) Looks good to me. I was wondering about the location of some of this code, so it's nice to see it cleaned up. Except for the one minor issue I noted on patch 6, it all looks good to me. I've reviewed all the patches and tested the series under my p2pdma series. Reviewed-by: Logan Gunthorpe Logan
Re: [PATCH 6/8] mm: don't include in
On 2022-02-06 11:32 p.m., Christoph Hellwig wrote: > Move the check for the actual pgmap types that need the free at refcount > one behavior into the out of line helper, and thus avoid the need to > pull memremap.h into mm.h. > > Signed-off-by: Christoph Hellwig I've noticed mm/memcontrol.c uses is_device_private_page() and also needs a memremap.h include added to compile with my configuration. Logan
Re: [PATCH v2 02/15] PCI: Add shorthand define for message signalled interrupt types
On 2020-06-03 5:45 a.m., Piotr Stankiewicz wrote: > There are several places in the kernel which check/ask for MSI or MSI-X > interrupts. It would make sense to have a shorthand constant, similar to > PCI_IRQ_ALL_TYPES, to use in these situations. So add PCI_IRQ_MSI_TYPES, > for this purpose. > > Signed-off-by: Piotr Stankiewicz > Suggested-by: Andy Shevchenko > Reviewed-by: Andy Shevchenko Looks good to me, Reviewed-by: Logan Gunthorpe Thanks, Logan ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
On 02/04/18 01:16 PM, Jerome Glisse wrote: > There isn't good API at the moment AFAIK, closest thing would either be > lookup_resource() or region_intersects(), but a more appropriate one can > easily be added, code to walk down the tree is readily available. More- > over this can be optimize like vma lookup are, even more as resource are > seldomly added so read side (finding a resource) can be heavily favor > over write side (adding|registering a new resource). So someone needs to create a highly optimized tree that registers all physical address on the system and maps them to devices? That seems a long way from being realized. I'd hardly characterize that as "easily". If we can pass both devices to the API I'd suspect it would be preferred over the complicated tree. This, of course, depends on what users of the API need. > cache coherency protocol (bit further than PCIE snoop). But also the > other direction the CPU access to device memory can also be cache coherent, > which is not the case in PCIE. I was not aware that CAPI allows PCI device memory to be cache coherent. That sounds like it would be very tricky... > Note that with mmu_notifier there isn't any need to pin stuff (even > without any special hardware capabilities), as long as you can preempt > what is happening on your hardware to update its page table. I've been told there's a lot of dislike of the mmu_notifier interface. And being able to preempt what's happening on hardware, generally, is not trivial. But, yes, this is essentially how ODP works. Logan ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
On 02/04/18 11:20 AM, Jerome Glisse wrote: > The point i have been trying to get accross is that you do have this > information with dma_map_resource() you know the device to which you > are trying to map (dev argument to dma_map_resource()) and you can > easily get the device to which the memory belongs because you have the > CPU physical address of the memory hence you can lookup the resource > and get the device from that. How do you go from a physical address to a struct device generally and in a performant manner? > IIRC CAPI make P2P mandatory but maybe this is with NVLink. We can ask > the PowerPC folks to confirm. Note CAPI is Power8 and newer AFAICT. PowerPC folks recently told us specifically that Power9 does not support P2P between PCI root ports. I've said this many times. CAPI has nothing to do with it. > Mapping to userspace have nothing to do here. I am talking at hardware > level. How thing are expose to userspace is a completely different > problems that do not have one solution fit all. For GPU you want this > to be under total control of GPU drivers. For storage like persistent > memory, you might want to expose it userspace more directly ... My understanding (and I worked on this a while ago) is that CAPI hardware manages memory maps typically for userspace memory. When a userspace program changes it's mapping, the CAPI hardware is updated so that hardware is coherent with the user address space and it is safe to DMA to any address without having to pin memory. (This is very similar to ODP in RNICs.) This is *really* nice but doesn't solve *any* of the problems we've been discussing. Moreover, many developers want to keep P2P in-kernel, for the time being, where the problem of pinning memory does not exist. Logan ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
On 30/03/18 01:45 PM, Jerome Glisse wrote: > Looking at upstream code it seems that the x86 bits never made it upstream > and thus what is now upstream is only for ARM. See [1] for x86 code. Dunno > what happen, i was convince it got merge. So yes current code is broken on > x86. ccing Joerg maybe he remembers what happened there. > > [1] https://lwn.net/Articles/646605/ That looks like a significant improvement over what's upstream. But it's three years old and looks to have been abandoned. I think I agree with Bjorn's comments in that if it's going to be a general P2P API it will need the device the resource belongs to in addition to the device that's initiating the DMA request. > Given it is currently only used by ARM folks it appear to at least work > for them (tm) :) Note that Christian is doing this in PCIE only context > and again dma_map_resource can easily figure that out if the address is > a PCIE or something else. Note that the exporter export the CPU bus > address. So again dma_map_resource has all the informations it will ever > need, if the peer to peer is fundamentaly un-doable it can return dma > error and it is up to the caller to handle this, just like GPU code do. > > Do you claim that dma_map_resource is missing any information ? Yes, that's what I said. All the existing ARM code appears to use it for platform devices only. I suspect platform P2P is relatively trivial to support on ARM. I think it's a big difference from using it for PCI P2P or general P2P on any bus. > I agree and understand that but for platform where such feature make sense > this will work. For me it is PowerPC and x86 and given PowerPC has CAPI > which has far more advance feature when it comes to peer to peer, i don't > see something more basic not working. On x86, Intel is a bit of lone wolf, > dunno if they gonna support this usecase pro-actively. AMD definitly will. Well PowerPC doesn't even support P2P between root ports. And I fail to see how CAPI applies unless/until we get this memory mapped into userspace and the mappings need to be dynamically managed. Seems like that's a long way away. Logan ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
On 29/03/18 07:58 PM, Jerome Glisse wrote: > On Thu, Mar 29, 2018 at 10:25:52AM -0600, Logan Gunthorpe wrote: >> >> >> On 29/03/18 10:10 AM, Christian König wrote: >>> Why not? I mean the dma_map_resource() function is for P2P while other >>> dma_map_* functions are only for system memory. >> >> Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping >> P2P. Though it's a bit odd seeing we've been working under the >> assumption that PCI P2P is different as it has to translate the PCI bus >> address. Where as P2P for devices on other buses is a big unknown. > > dma_map_resource() is the right API (thought its current implementation > is fill with x86 assumptions). So i would argue that arch can decide to > implement it or simply return dma error address which trigger fallback > path into the caller (at least for GPU drivers). SG variant can be added > on top. > > dma_map_resource() is the right API because it has all the necessary > informations. It use the CPU physical address as the common address > "language" with CPU physical address of PCIE bar to map to another > device you can find the corresponding bus address from the IOMMU code > (NOP on x86). So dma_map_resource() knows both the source device which > export its PCIE bar and the destination devices. Well, as it is today, it doesn't look very sane to me. The default is to just return the physical address if the architecture doesn't support it. So if someone tries this on an arch that hasn't added itself to return an error they're very likely going to just end up DMAing to an invalid address and loosing the data or causing a machine check. Furthermore, the API does not have all the information it needs to do sane things. A phys_addr_t doesn't really tell you anything about the memory behind it or what needs to be done with it. For example, on some arm64 implementations if the physical address points to a PCI BAR and that BAR is behind a switch with the DMA device then the address must be converted to the PCI BUS address. On the other hand, if it's a physical address of a device in an SOC it might need to be handled in a completely different way. And right now all the implementations I can find seem to just assume that phys_addr_t points to regular memory and can be treated as such. This is one of the reasons that, based on feedback, our work went from being general P2P with any device to being restricted to only P2P with PCI devices. The dream that we can just grab the physical address of any device and use it in a DMA request is simply not realistic. Logan ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
On 29/03/18 10:10 AM, Christian König wrote: > Why not? I mean the dma_map_resource() function is for P2P while other > dma_map_* functions are only for system memory. Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping P2P. Though it's a bit odd seeing we've been working under the assumption that PCI P2P is different as it has to translate the PCI bus address. Where as P2P for devices on other buses is a big unknown. >> And this is necessary to >> check if the DMA ops in use support it or not. We can't have the >> dma_map_X() functions do the wrong thing because they don't support it yet. > > Well that sounds like we should just return an error from > dma_map_resources() when an architecture doesn't support P2P yet as Alex > suggested. Yes, well except in our patch-set we can't easily use dma_map_resources() as we either have SGLs to deal with or we need to create whole new interfaces to a number of subsystems. > You don't seem to understand the implications: The devices do have a > common upstream bridge! In other words your code would currently claim > that P2P is supported, but in practice it doesn't work. Do they? They don't on any of the Intel machines I'm looking at. The previous version of the patchset not only required a common upstream bridge but two layers of upstream bridges on both devices which would effectively limit transfers to PCIe switches only. But Bjorn did not like this. > You need to include both drivers which participate in the P2P > transaction to make sure that both supports this and give them > opportunity to chicken out and in the case of AMD APUs even redirect the > request to another location (e.g. participate in the DMA translation). I don't think it's the drivers responsibility to reject P2P . The topology is what governs support or not. The discussions we had with Bjorn settled on if the devices are all behind the same bridge they can communicate with each other. This is essentially guaranteed by the PCI spec. > DMA-buf fortunately seems to handle all this already, that's why we > choose it as base for our implementation. Well, unfortunately DMA-buf doesn't help for the drivers we are working with as neither the block layer nor the RDMA subsystem have any interfaces for it. Logan ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
On 29/03/18 05:44 AM, Christian König wrote: > Am 28.03.2018 um 21:53 schrieb Logan Gunthorpe: >> >> On 28/03/18 01:44 PM, Christian König wrote: >>> Well, isn't that exactly what dma_map_resource() is good for? As far as >>> I can see it makes sure IOMMU is aware of the access route and >>> translates a CPU address into a PCI Bus address. >>> I'm using that with the AMD IOMMU driver and at least there it works >>> perfectly fine. >> Yes, it would be nice, but no arch has implemented this yet. We are just >> lucky in the x86 case because that arch is simple and doesn't need to do >> anything for P2P (partially due to the Bus and CPU addresses being the >> same). But in the general case, you can't rely on it. > > Well, that an arch hasn't implemented it doesn't mean that we don't have > the right interface to do it. Yes, but right now we don't have a performant way to check if we are doing P2P or not in the dma_map_X() wrappers. And this is necessary to check if the DMA ops in use support it or not. We can't have the dma_map_X() functions do the wrong thing because they don't support it yet. > Devices integrated in the CPU usually only "claim" to be PCIe devices. > In reality their memory request path go directly through the integrated > north bridge. The reason for this is simple better throughput/latency. These are just more reasons why our patchset restricts to devices behind a switch. And more mess for someone to deal with if they need to relax that restriction. Logan ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
On 28/03/18 01:44 PM, Christian König wrote: > Well, isn't that exactly what dma_map_resource() is good for? As far as > I can see it makes sure IOMMU is aware of the access route and > translates a CPU address into a PCI Bus address. > I'm using that with the AMD IOMMU driver and at least there it works > perfectly fine. Yes, it would be nice, but no arch has implemented this yet. We are just lucky in the x86 case because that arch is simple and doesn't need to do anything for P2P (partially due to the Bus and CPU addresses being the same). But in the general case, you can't rely on it. >>> Yeah, but not for ours. See if you want to do real peer 2 peer you need >>> to keep both the operation as well as the direction into account. >> Not sure what you are saying here... I'm pretty sure we are doing "real" >> peer 2 peer... >> >>> For example when you can do writes between A and B that doesn't mean >>> that writes between B and A work. And reads are generally less likely to >>> work than writes. etc... >> If both devices are behind a switch then the PCI spec guarantees that A >> can both read and write B and vice versa. > > Sorry to say that, but I know a whole bunch of PCI devices which > horrible ignores that. Can you elaborate? As far as the device is concerned it shouldn't know whether a request comes from a peer or from the host. If it does do crazy stuff like that it's well out of spec. It's up to the switch (or root complex if good support exists) to route the request to the device and it's the root complex that tends to be what drops the load requests which causes the asymmetries. Logan ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
On 28/03/18 12:28 PM, Christian König wrote: > I'm just using amdgpu as blueprint because I'm the co-maintainer of it > and know it mostly inside out. Ah, I see. > The resource addresses are translated using dma_map_resource(). As far > as I know that should be sufficient to offload all the architecture > specific stuff to the DMA subsystem. It's not. The dma_map infrastructure currently has no concept of peer-to-peer mappings and is designed for system memory only. No architecture I'm aware of will translate PCI CPU addresses into PCI Bus addresses which is necessary for any transfer that doesn't go through the root complex (though on arches like x86 the CPU and Bus address happen to be the same). There's a lot of people that would like to see this change but it's likely going to be a long road before it does. Furthermore, one of the reasons our patch-set avoids going through the root complex at all is that IOMMU drivers will need to be made aware that it is operating on P2P memory and do arch-specific things accordingly. There will also need to be flags that indicate whether a given IOMMU driver supports this. None of this work is done or easy. > Yeah, but not for ours. See if you want to do real peer 2 peer you need > to keep both the operation as well as the direction into account. Not sure what you are saying here... I'm pretty sure we are doing "real" peer 2 peer... > For example when you can do writes between A and B that doesn't mean > that writes between B and A work. And reads are generally less likely to > work than writes. etc... If both devices are behind a switch then the PCI spec guarantees that A can both read and write B and vice versa. Only once you involve root complexes do you have this problem. Ie. you have unknown support which may be no support, or partial support (stores but not loads); or sometimes bad performance; or a combination of both... and you need some way to figure out all this mess and that is hard. Whoever tries to implement a white list will have to sort all this out. Logan ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
On 28/03/18 10:02 AM, Christian König wrote: > Yeah, that looks very similar to what I picked up from the older > patches, going to read up on that after my vacation. Yeah, I was just reading through your patchset and there are a lot of similarities. Though, I'm not sure what you're trying to accomplish as I could not find a cover letter and it seems to only enable one driver. Is it meant to enable DMA transactions only between two AMD GPUs? I also don't see where you've taken into account the PCI bus address. On some architectures this is not the same as the CPU physical address. > Just in general why are you interested in the "distance" of the devices? We've taken a general approach where some drivers may provide p2p memory (ie. an NVMe card or an RDMA NIC) and other drivers make use of it (ie. the NVMe-of driver). The orchestrator driver needs to find the most applicable provider device for a transaction in a situation that may have multiple providers and multiple clients. So the most applicable provider is the one that's closest ("distance"-wise) to all the clients for the P2P transaction. > And BTW: At least for writes that Peer 2 Peer transactions between > different root complexes work is actually more common than the other way > around. Maybe on x86 with hardware made in the last few years. But on PowerPC, ARM64, and likely a lot more the chance of support is *much* less. Also, hardware that only supports P2P stores is hardly full support and is insufficient for our needs. > So I'm a bit torn between using a blacklist or a whitelist. A whitelist > is certainly more conservative approach, but that could get a bit long. I think a whitelist approach is correct. Given old hardware and other architectures, a black list is going to be too long and too difficult to comprehensively populate. Logan ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()
On 28/03/18 09:07 AM, Christian König wrote: > Am 28.03.2018 um 14:38 schrieb Christoph Hellwig: >> On Sun, Mar 25, 2018 at 12:59:54PM +0200, Christian König wrote: >>> From: "wda...@nvidia.com">>> >>> Add an interface to find the first device which is upstream of both >>> devices. >> Please work with Logan and base this on top of the outstanding peer >> to peer patchset. > > Can you point me to that? The last code I could find about that was from > 2015. The latest posted series is here: https://lkml.org/lkml/2018/3/12/830 However, we've made some significant changes to the area that's similar to what you are doing. You can find lasted un-posted here: https://github.com/sbates130272/linux-p2pmem/tree/pci-p2p-v4-pre2 Specifically this function would be of interest to you: https://github.com/sbates130272/linux-p2pmem/blob/0e9468ae2a5a5198513dd12990151e09105f0351/drivers/pci/p2pdma.c#L239 However, the difference between what we are doing is that we are interested in the distance through the common upstream device and you appear to be finding the actual common device. Thanks, Logan ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx