Re: [PATCH 09/27] mm: generalize the pgmap based page_free infrastructure

2022-02-14 Thread Logan Gunthorpe



On 2022-02-10 12:28 a.m., Christoph Hellwig wrote:
> Key off on the existence of ->page_free to prepare for adding support for
> more pgmap types that are device managed and thus need the free callback.
> 
> Signed-off-by: Christoph Hellwig 

Great! This makes my patch simpler.

Reviewed-by: Logan Gunthorpe 


> ---
>  mm/memremap.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memremap.c b/mm/memremap.c
> index fef5734d5e4933..e00ffcdba7b632 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -452,7 +452,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  
>  void free_zone_device_page(struct page *page)
>  {
> - if (WARN_ON_ONCE(!is_device_private_page(page)))
> + if (WARN_ON_ONCE(!page->pgmap->ops || !page->pgmap->ops->page_free))
>   return;
>  
>   __ClearPageWaiters(page);
> @@ -460,7 +460,7 @@ void free_zone_device_page(struct page *page)
>   mem_cgroup_uncharge(page_folio(page));
>  
>   /*
> -  * When a device_private page is freed, the page->mapping field
> +  * When a device managed page is freed, the page->mapping field
>* may still contain a (stale) mapping value. For example, the
>* lower bits of page->mapping may still identify the page as an
>* anonymous page. Ultimately, this entire field is just stale
> 



Re: start sorting out the ZONE_DEVICE refcount mess

2022-02-07 Thread Logan Gunthorpe



On 2022-02-06 11:32 p.m., Christoph Hellwig wrote:
> Hi all,
> 
> this series removes the offset by one refcount for ZONE_DEVICE pages
> that are freed back to the driver owning them, which is just device
> private ones for now, but also the planned device coherent pages
> and the ehanced p2p ones pending.
> 
> It does not address the fsdax pages yet, which will be attacked in a
> follow on series.
> 
> Diffstat:
>  arch/arm64/mm/mmu.c  |1 
>  arch/powerpc/kvm/book3s_hv_uvmem.c   |1 
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |2 
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h|1 
>  drivers/gpu/drm/drm_cache.c  |2 
>  drivers/gpu/drm/nouveau/nouveau_dmem.c   |3 -
>  drivers/gpu/drm/nouveau/nouveau_svm.c|1 
>  drivers/infiniband/core/rw.c |1 
>  drivers/nvdimm/pmem.h|1 
>  drivers/nvme/host/pci.c  |1 
>  drivers/nvme/target/io-cmd-bdev.c|1 
>  fs/Kconfig   |2 
>  fs/fuse/virtio_fs.c  |1 
>  include/linux/hmm.h  |9 
>  include/linux/memremap.h |   22 +-
>  include/linux/mm.h   |   59 -
>  lib/test_hmm.c   |4 +
>  mm/Kconfig   |4 -
>  mm/internal.h|2 
>  mm/memcontrol.c  |   11 +
>  mm/memremap.c|   63 
> ---
>  mm/migrate.c |6 --
>  mm/swap.c|   49 ++--
>  23 files changed, 90 insertions(+), 157 deletions(-)

Looks good to me. I was wondering about the location of some of this
code, so it's nice to see it cleaned up. Except for the one minor issue
I noted on patch 6, it all looks good to me. I've reviewed all the
patches and tested the series under my p2pdma series.

Reviewed-by: Logan Gunthorpe 

Logan



Re: [PATCH 6/8] mm: don't include in

2022-02-07 Thread Logan Gunthorpe



On 2022-02-06 11:32 p.m., Christoph Hellwig wrote:
> Move the check for the actual pgmap types that need the free at refcount
> one behavior into the out of line helper, and thus avoid the need to
> pull memremap.h into mm.h.
> 
> Signed-off-by: Christoph Hellwig 

I've noticed mm/memcontrol.c uses is_device_private_page() and also
needs a memremap.h include added to compile with my configuration.

Logan



Re: [PATCH v2 02/15] PCI: Add shorthand define for message signalled interrupt types

2020-06-03 Thread Logan Gunthorpe



On 2020-06-03 5:45 a.m., Piotr Stankiewicz wrote:
> There are several places in the kernel which check/ask for MSI or MSI-X
> interrupts. It would make sense to have a shorthand constant, similar to
> PCI_IRQ_ALL_TYPES, to use in these situations. So add PCI_IRQ_MSI_TYPES,
> for this purpose.
> 
> Signed-off-by: Piotr Stankiewicz 
> Suggested-by: Andy Shevchenko 
> Reviewed-by: Andy Shevchenko 

Looks good to me,

Reviewed-by: Logan Gunthorpe 

Thanks,

Logan
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-04-02 Thread Logan Gunthorpe


On 02/04/18 01:16 PM, Jerome Glisse wrote:
> There isn't good API at the moment AFAIK, closest thing would either be
> lookup_resource() or region_intersects(), but a more appropriate one can
> easily be added, code to walk down the tree is readily available. More-
> over this can be optimize like vma lookup are, even more as resource are
> seldomly added so read side (finding a resource) can be heavily favor
> over write side (adding|registering a new resource).

So someone needs to create a highly optimized tree that registers all
physical address on the system and maps them to devices? That seems a
long way from being realized. I'd hardly characterize that as "easily".
If we can pass both devices to the API I'd suspect it would be preferred
over the complicated tree. This, of course, depends on what users of the
API need.

> cache coherency protocol (bit further than PCIE snoop). But also the
> other direction the CPU access to device memory can also be cache coherent,
> which is not the case in PCIE.

I was not aware that CAPI allows PCI device memory to be cache coherent.
That sounds like it would be very tricky...

> Note that with mmu_notifier there isn't any need to pin stuff (even
> without any special hardware capabilities), as long as you can preempt
> what is happening on your hardware to update its page table.

I've been told there's a lot of dislike of the mmu_notifier interface.
And being able to preempt what's happening on hardware, generally, is
not trivial. But, yes, this is essentially how ODP works.

Logan
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-04-02 Thread Logan Gunthorpe


On 02/04/18 11:20 AM, Jerome Glisse wrote:
> The point i have been trying to get accross is that you do have this
> information with dma_map_resource() you know the device to which you
> are trying to map (dev argument to dma_map_resource()) and you can
> easily get the device to which the memory belongs because you have the
> CPU physical address of the memory hence you can lookup the resource
> and get the device from that.

How do you go from a physical address to a struct device generally and
in a performant manner?

> IIRC CAPI make P2P mandatory but maybe this is with NVLink. We can ask
> the PowerPC folks to confirm. Note CAPI is Power8 and newer AFAICT.

PowerPC folks recently told us specifically that Power9 does not support
P2P between PCI root ports. I've said this many times. CAPI has nothing
to do with it.

> Mapping to userspace have nothing to do here. I am talking at hardware
> level. How thing are expose to userspace is a completely different
> problems that do not have one solution fit all. For GPU you want this
> to be under total control of GPU drivers. For storage like persistent
> memory, you might want to expose it userspace more directly ...

My understanding (and I worked on this a while ago) is that CAPI
hardware manages memory maps typically for userspace memory. When a
userspace program changes it's mapping, the CAPI hardware is updated so
that hardware is coherent with the user address space and it is safe to
DMA to any address without having to pin memory. (This is very similar
to ODP in RNICs.) This is *really* nice but doesn't solve *any* of the
problems we've been discussing. Moreover, many developers want to keep
P2P in-kernel, for the time being, where the problem of pinning memory
does not exist.

Logan
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-04-02 Thread Logan Gunthorpe


On 30/03/18 01:45 PM, Jerome Glisse wrote:
> Looking at upstream code it seems that the x86 bits never made it upstream
> and thus what is now upstream is only for ARM. See [1] for x86 code. Dunno
> what happen, i was convince it got merge. So yes current code is broken on
> x86. ccing Joerg maybe he remembers what happened there.
> 
> [1] https://lwn.net/Articles/646605/

That looks like a significant improvement over what's upstream. But it's
three years old and looks to have been abandoned. I think I agree with
Bjorn's comments in that if it's going to be a general P2P API it will
need the device the resource belongs to in addition to the device that's
initiating the DMA request.

> Given it is currently only used by ARM folks it appear to at least work
> for them (tm) :) Note that Christian is doing this in PCIE only context
> and again dma_map_resource can easily figure that out if the address is
> a PCIE or something else. Note that the exporter export the CPU bus
> address. So again dma_map_resource has all the informations it will ever
> need, if the peer to peer is fundamentaly un-doable it can return dma
> error and it is up to the caller to handle this, just like GPU code do.
> 
> Do you claim that dma_map_resource is missing any information ?

Yes, that's what I said. All the existing ARM code appears to use it for
platform devices only. I suspect platform P2P is relatively trivial to
support on ARM. I think it's a big difference from using it for PCI P2P
or general P2P on any bus.

> I agree and understand that but for platform where such feature make sense
> this will work. For me it is PowerPC and x86 and given PowerPC has CAPI
> which has far more advance feature when it comes to peer to peer, i don't
> see something more basic not working. On x86, Intel is a bit of lone wolf,
> dunno if they gonna support this usecase pro-actively. AMD definitly will.

Well PowerPC doesn't even support P2P between root ports. And I fail to
see how CAPI applies unless/until we get this memory mapped into
userspace and the mappings need to be dynamically managed. Seems like
that's a long way away.

Logan

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-03-30 Thread Logan Gunthorpe


On 29/03/18 07:58 PM, Jerome Glisse wrote:
> On Thu, Mar 29, 2018 at 10:25:52AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 29/03/18 10:10 AM, Christian König wrote:
>>> Why not? I mean the dma_map_resource() function is for P2P while other 
>>> dma_map_* functions are only for system memory.
>>
>> Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping
>> P2P. Though it's a bit odd seeing we've been working under the
>> assumption that PCI P2P is different as it has to translate the PCI bus
>> address. Where as P2P for devices on other buses is a big unknown.
> 
> dma_map_resource() is the right API (thought its current implementation
> is fill with x86 assumptions). So i would argue that arch can decide to
> implement it or simply return dma error address which trigger fallback
> path into the caller (at least for GPU drivers). SG variant can be added
> on top.
> 
> dma_map_resource() is the right API because it has all the necessary
> informations. It use the CPU physical address as the common address
> "language" with CPU physical address of PCIE bar to map to another
> device you can find the corresponding bus address from the IOMMU code
> (NOP on x86). So dma_map_resource() knows both the source device which
> export its PCIE bar and the destination devices.

Well, as it is today, it doesn't look very sane to me. The default is to
just return the physical address if the architecture doesn't support it.
So if someone tries this on an arch that hasn't added itself to return
an error they're very likely going to just end up DMAing to an invalid
address and loosing the data or causing a machine check.

Furthermore, the API does not have all the information it needs to do
sane things. A phys_addr_t doesn't really tell you anything about the
memory behind it or what needs to be done with it. For example, on some
arm64 implementations if the physical address points to a PCI BAR and
that BAR is behind a switch with the DMA device then the address must be
converted to the PCI BUS address. On the other hand, if it's a physical
address of a device in an SOC it might need to  be handled in a
completely different way. And right now all the implementations I can
find seem to just assume that phys_addr_t points to regular memory and
can be treated as such.

This is one of the reasons that, based on feedback, our work went from
being general P2P with any device to being restricted to only P2P with
PCI devices. The dream that we can just grab the physical address of any
device and use it in a DMA request is simply not realistic.

Logan
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-03-29 Thread Logan Gunthorpe


On 29/03/18 10:10 AM, Christian König wrote:
> Why not? I mean the dma_map_resource() function is for P2P while other 
> dma_map_* functions are only for system memory.

Oh, hmm, I wasn't aware dma_map_resource was exclusively for mapping
P2P. Though it's a bit odd seeing we've been working under the
assumption that PCI P2P is different as it has to translate the PCI bus
address. Where as P2P for devices on other buses is a big unknown.

>> And this is necessary to
>> check if the DMA ops in use support it or not. We can't have the
>> dma_map_X() functions do the wrong thing because they don't support it yet.
> 
> Well that sounds like we should just return an error from 
> dma_map_resources() when an architecture doesn't support P2P yet as Alex 
> suggested.

Yes, well except in our patch-set we can't easily use
dma_map_resources() as we either have SGLs to deal with or we need to
create whole new interfaces to a number of subsystems.

> You don't seem to understand the implications: The devices do have a 
> common upstream bridge! In other words your code would currently claim 
> that P2P is supported, but in practice it doesn't work.

Do they? They don't on any of the Intel machines I'm looking at. The
previous version of the patchset not only required a common upstream
bridge but two layers of upstream bridges on both devices which would
effectively limit transfers to PCIe switches only. But Bjorn did not
like this.

> You need to include both drivers which participate in the P2P 
> transaction to make sure that both supports this and give them 
> opportunity to chicken out and in the case of AMD APUs even redirect the 
> request to another location (e.g. participate in the DMA translation).

I don't think it's the drivers responsibility to reject P2P . The
topology is what governs support or not. The discussions we had with
Bjorn settled on if the devices are all behind the same bridge they can
communicate with each other. This is essentially guaranteed by the PCI spec.

> DMA-buf fortunately seems to handle all this already, that's why we 
> choose it as base for our implementation.

Well, unfortunately DMA-buf doesn't help for the drivers we are working
with as neither the block layer nor the RDMA subsystem have any
interfaces for it.

Logan
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-03-29 Thread Logan Gunthorpe


On 29/03/18 05:44 AM, Christian König wrote:
> Am 28.03.2018 um 21:53 schrieb Logan Gunthorpe:
>>
>> On 28/03/18 01:44 PM, Christian König wrote:
>>> Well, isn't that exactly what dma_map_resource() is good for? As far as
>>> I can see it makes sure IOMMU is aware of the access route and
>>> translates a CPU address into a PCI Bus address.
>>> I'm using that with the AMD IOMMU driver and at least there it works
>>> perfectly fine.
>> Yes, it would be nice, but no arch has implemented this yet. We are just
>> lucky in the x86 case because that arch is simple and doesn't need to do
>> anything for P2P (partially due to the Bus and CPU addresses being the
>> same). But in the general case, you can't rely on it.
> 
> Well, that an arch hasn't implemented it doesn't mean that we don't have 
> the right interface to do it.

Yes, but right now we don't have a performant way to check if we are
doing P2P or not in the dma_map_X() wrappers. And this is necessary to
check if the DMA ops in use support it or not. We can't have the
dma_map_X() functions do the wrong thing because they don't support it yet.

> Devices integrated in the CPU usually only "claim" to be PCIe devices. 
> In reality their memory request path go directly through the integrated 
> north bridge. The reason for this is simple better throughput/latency.

These are just more reasons why our patchset restricts to devices behind
a switch. And more mess for someone to deal with if they need to relax
that restriction.

Logan
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-03-28 Thread Logan Gunthorpe


On 28/03/18 01:44 PM, Christian König wrote:
> Well, isn't that exactly what dma_map_resource() is good for? As far as 
> I can see it makes sure IOMMU is aware of the access route and 
> translates a CPU address into a PCI Bus address.

> I'm using that with the AMD IOMMU driver and at least there it works 
> perfectly fine.

Yes, it would be nice, but no arch has implemented this yet. We are just
lucky in the x86 case because that arch is simple and doesn't need to do
anything for P2P (partially due to the Bus and CPU addresses being the
same). But in the general case, you can't rely on it.

>>> Yeah, but not for ours. See if you want to do real peer 2 peer you need
>>> to keep both the operation as well as the direction into account.
>> Not sure what you are saying here... I'm pretty sure we are doing "real"
>> peer 2 peer...
>>
>>> For example when you can do writes between A and B that doesn't mean
>>> that writes between B and A work. And reads are generally less likely to
>>> work than writes. etc...
>> If both devices are behind a switch then the PCI spec guarantees that A
>> can both read and write B and vice versa.
> 
> Sorry to say that, but I know a whole bunch of PCI devices which 
> horrible ignores that.

Can you elaborate? As far as the device is concerned it shouldn't know
whether a request comes from a peer or from the host. If it does do
crazy stuff like that it's well out of spec. It's up to the switch (or
root complex if good support exists) to route the request to the device
and it's the root complex that tends to be what drops the load requests
which causes the asymmetries.

Logan
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-03-28 Thread Logan Gunthorpe


On 28/03/18 12:28 PM, Christian König wrote:
> I'm just using amdgpu as blueprint because I'm the co-maintainer of it 
> and know it mostly inside out.

Ah, I see.

> The resource addresses are translated using dma_map_resource(). As far 
> as I know that should be sufficient to offload all the architecture 
> specific stuff to the DMA subsystem.

It's not. The dma_map infrastructure currently has no concept of
peer-to-peer mappings and is designed for system memory only. No
architecture I'm aware of will translate PCI CPU addresses into PCI Bus
addresses which is necessary for any transfer that doesn't go through
the root complex (though on arches like x86 the CPU and Bus address
happen to be the same). There's a lot of people that would like to see
this change but it's likely going to be a long road before it does.

Furthermore, one of the reasons our patch-set avoids going through the
root complex at all is that IOMMU drivers will need to be made aware
that it is operating on P2P memory and do arch-specific things
accordingly. There will also need to be flags that indicate whether a
given IOMMU driver supports this. None of this work is done or easy.

> Yeah, but not for ours. See if you want to do real peer 2 peer you need 
> to keep both the operation as well as the direction into account.

Not sure what you are saying here... I'm pretty sure we are doing "real"
peer 2 peer...

> For example when you can do writes between A and B that doesn't mean 
> that writes between B and A work. And reads are generally less likely to 
> work than writes. etc...

If both devices are behind a switch then the PCI spec guarantees that A
can both read and write B and vice versa. Only once you involve root
complexes do you have this problem. Ie. you have unknown support which
may be no support, or partial support (stores but not loads); or
sometimes bad performance; or a combination of both... and you need some
way to figure out all this mess and that is hard. Whoever tries to
implement a white list will have to sort all this out.

Logan
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-03-28 Thread Logan Gunthorpe


On 28/03/18 10:02 AM, Christian König wrote:
> Yeah, that looks very similar to what I picked up from the older 
> patches, going to read up on that after my vacation.

Yeah, I was just reading through your patchset and there are a lot of
similarities. Though, I'm not sure what you're trying to accomplish as I
could not find a cover letter and it seems to only enable one driver. Is
it meant to enable DMA transactions only between two AMD GPUs?

I also don't see where you've taken into account the PCI bus address. On
some architectures this is not the same as the CPU physical address.

> Just in general why are you interested in the "distance" of the devices?

We've taken a general approach where some drivers may provide p2p memory
(ie. an NVMe card or an RDMA NIC) and other drivers make use of it (ie.
the NVMe-of driver). The orchestrator driver needs to find the most
applicable provider device for a transaction in a situation that may
have multiple providers and multiple clients. So the most applicable
provider is the one that's closest ("distance"-wise) to all the clients
for the P2P transaction.

> And BTW: At least for writes that Peer 2 Peer transactions between 
> different root complexes work is actually more common than the other way 
> around.

Maybe on x86 with hardware made in the last few years. But on PowerPC,
ARM64, and likely a lot more the chance of support is *much* less. Also,
hardware that only supports P2P stores is hardly full support and is
insufficient for our needs.

> So I'm a bit torn between using a blacklist or a whitelist. A whitelist 
> is certainly more conservative approach, but that could get a bit long.

I think a whitelist approach is correct. Given old hardware and other
architectures, a black list is going to be too long and too difficult to
comprehensively populate.

Logan
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/8] PCI: Add pci_find_common_upstream_dev()

2018-03-28 Thread Logan Gunthorpe


On 28/03/18 09:07 AM, Christian König wrote:
> Am 28.03.2018 um 14:38 schrieb Christoph Hellwig:
>> On Sun, Mar 25, 2018 at 12:59:54PM +0200, Christian König wrote:
>>> From: "wda...@nvidia.com" 
>>>
>>> Add an interface to find the first device which is upstream of both
>>> devices.
>> Please work with Logan and base this on top of the outstanding peer
>> to peer patchset.
> 
> Can you point me to that? The last code I could find about that was from 
> 2015.

The latest posted series is here:

https://lkml.org/lkml/2018/3/12/830

However, we've made some significant changes to the area that's similar
to what you are doing. You can find lasted un-posted here:

https://github.com/sbates130272/linux-p2pmem/tree/pci-p2p-v4-pre2

Specifically this function would be of interest to you:

https://github.com/sbates130272/linux-p2pmem/blob/0e9468ae2a5a5198513dd12990151e09105f0351/drivers/pci/p2pdma.c#L239

However, the difference between what we are doing is that we are
interested in the distance through the common upstream device and you
appear to be finding the actual common device.

Thanks,

Logan
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx