Am 12.12.22 um 04:00 schrieb Tomasz Figa:
[SNIP]
What we could do is to force all exporters to use begin/end_cpu_access()
even on it's own buffers and look at all the importers when the access
is completed. But that would increase the complexity of the handling in
the exporter.
I feel like they should be doing so anyway, because it often depends
on the SoC integration whether the DMA can do cache snooping or not.

Yeah, but wouldn't help without the exporter taking care of the importers needs of cache flushing. And that's exactly what we try hard to avoid because it creates code in every exporter which is never tested except for the case that you combine this exporter with an non coherent importer.

That's a testing nightmare because then you everywhere has code which only in very few combinations of exporter and importer is actually used.

Although arguably, there is a corner case today where if one uses
dma_alloc_coherent() to get a buffer with a coherent CPU mapping for
device X that is declared as cache-coherent, one also expects not to
need to call begin/end_cpu_access(), but those would be needed if the
buffer was to be imported by device Y that is not cache-coherent...

Sounds like after all it's a mess. I guess your patches make it one
step closer to something sensible, import would fail in such cases.
Although arguably we should be able to still export from driver Y and
import to driver X just fine if Y allocated the buffer as coherent -
otherwise we would break existing users for whom things worked fine.

Allocating the buffer as coherent won't help in this case because we actually do want CPU caching for performant access. It's just that some device needs a cache flush before it sees all the changes.

As far as I can see without adding additional complexity to the exporter this can only be archived in two ways:

1. Switch the role of the exporter and importer. This way the device with the need for the cache flush is the exporter and in control of the operations on its buffers.

2. We use DMA-buf as neutral mediator. Since DMA-buf keeps track of who has mapped the buffers it inserts the appropriate dma_sync_*_for_device() calls.

[SNIP]
The best we can do is to reject combinations which won't work in the
kernel and then userspace could react accordingly.

The question is whether userspace is able to deal with it, given the
limited amount of information it gets from the kernel. Sure, there is
always the ultimate fallback of memcpy(), but in many cases that would
be totally unusable due to severe performance impact. If we were to
remove the existing extent of implicit handling from the kernel, I
think we need to first give the userspace the information necessary to
explicitly handle the fallback to the same extent.

Good point.

We also need to think about backwards compatibility. Simply removing
the implicit fallback cases would probably break a lot of userspace,
so an opt-in behavior is likely needed initially...

Yes, I'm completely aware of that as well.

We can't hard break userspace even if the previous approach didn't worked 100% correctly.

That's essentially the reason why we have DMA-buf heaps. Those heaps
expose system memory buffers with certain properties (size, CMA, DMA bit
restrictions etc...) and also implement the callback functions for CPU
cache maintenance.

How do DMA-buf heaps change anything here? We already have CPU cache
maintenance callbacks in the DMA-buf API - the begin/end_cpu_access()
for CPU accesses and dmabuf_map/unmap_attachment() for device accesses
(which arguably shouldn't actually do CPU cache maintenance, unless
that's implied by how either of the involved DMA engines work).
DMA-buf heaps are the neutral man in the middle.

The implementation keeps track of all the attached importers and should
make sure that the allocated memory fits the need of everyone.
Additional to that calls to the cache DMA-api cache management functions
are inserted whenever CPU access begins/ends.

I think in current design, it only knows all the importers after the
buffer is already allocated, so it doesn't necessarily have a way to
handle the allocation constraints. Something would have to be done to
get all the importers attached before the allocation actually takes
place.

That's already in place. See the attach and map callbacks.

I'm just not sure if heaps fully implements it like this.

The problem is that in this particular case the exporter provides the
buffer as is, e.g. with dirty CPU caches. And the importer can't deal
with that.
Why does the exporter leave the buffer with dirty CPU caches?
Because exporters always export the buffers as they would use it. And in
this case that means writing with the CPU to it.

Sorry for the question not being very clear. I meant: How do the CPU
caches get dirty in that case?
The exporter wrote to it. As far as I understand the exporter just
copies things from A to B with memcpy to construct the buffer content.

Okay, so it's just due to CPU access and basically what we touched a
few paragraphs above.

Yes, I've never seen a device which actually dirties the CPU cache. But I would never rule out that such a device exists.

Regards,
Christian.

[SNIP]
Yes, totally agree. The problem is really that we moved bunch of MM and
DMA functions in one API.

The bounce buffers are something we should really have in a separate
component.

Then the functionality of allocating system memory for a specific device
or devices should be something provided by the MM.

And finally making this memory or any other CPU address accessible to a
device (IOMMU programming etc..) should then be part of an DMA API.

Remember that actually making the memory accessible to a device often
needs to be handled already as a part of the allocation (e.g. dma_mask
in the non-IOMMU case). So I tend to think that the current division
of responsibilities is mostly fine - the dma_alloc family should be
seen as a part of MM already, especially with all the recent
improvements from Christoph, like dma_alloc_pages().
Yes, that's indeed a very interesting development which as far as I can
see goes into the right direction.

That said, it may indeed make sense to move things like IOMMU mapping
management out of the dma_alloc() and just reduce those functions to
simply returning a set of pages that satisfy the allocation
constraints. One would need to call dma_map() after the allocation,
but that's actually a fair expectation. Sometimes it might be
preferred to pre-allocate the memory, but only map it into the device
address space when it's really necessary.
What I'm still missing is the functionality to allocate pages for
multiple devices and proper error codes when dma_map() can't make the
page accessible to a device.
Agreed. Although again, I think the more challenging part would be to
get the complete list of devices involved before the allocation
happens.

Best regards,
Tomasz

Regards,
Christian.

It's a use-case that is working fine today with many devices (e.g. network
adapters) in the ARM world, exactly because the architecture specific
implementation of the DMA API inserts the cache maintenance operations
on buffer ownership transfer.
Yeah, I'm perfectly aware of that. The problem is that exactly that
design totally breaks GPUs on Xen DOM0 for example.

And Xen is just one example, I can certainly say from experience that
this design was a really really bad idea because it favors just one use
case while making other use cases practically impossible if not really
hard to implement.
Sorry, I haven't worked with Xen. Could you elaborate what's the
problem that this introduces for it?
That's a bit longer topic. The AMD XEN devs are already working on this
as far as I know. I can ping internally how far they got with sending
the patches out to avoid this problem.
Hmm, I see. It might be a good data point to understand in which
direction we should be going, but I guess we can wait until they send
some patches.
There was just recently a longer thread on the amd-gfx mailing list
about that. I think looking in there as well might be beneficial.
Okay, let me check. Thanks,

Best regards,
Tomasz

Reply via email to