Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-05-04 Thread Lucas Stach
Am Mittwoch, den 25.04.2018, 13:44 -0400 schrieb Alex Deucher:
> On Wed, Apr 25, 2018 at 2:41 AM, Christoph Hellwig  > wrote:
> > On Wed, Apr 25, 2018 at 02:24:36AM -0400, Alex Deucher wrote:
> > > > It has a non-coherent transaction mode (which the chipset can opt to
> > > > not implement and still flush), to make sure the AGP horror show
> > > > doesn't happen again and GPU folks are happy with PCIe. That's at
> > > > least my understanding from digging around in amd the last time we had
> > > > coherency issues between intel and amd gpus. GPUs have some bits
> > > > somewhere (in the pagetables, or in the buffer object description
> > > > table created by userspace) to control that stuff.
> > > 
> > > Right.  We have a bit in the GPU page table entries that determines
> > > whether we snoop the CPU's cache or not.
> > 
> > I can see how that works with the GPU on the same SOC or SOC set as the
> > CPU.  But how is that going to work for a GPU that is a plain old PCIe
> > card?  The cache snooping in that case is happening in the PCIe root
> > complex.
> 
> I'm not a pci expert, but as far as I know, the device sends either a
> snooped or non-snooped transaction on the bus.  I think the
> transaction descriptor supports a no snoop attribute.  Our GPUs have
> supported this feature for probably 20 years if not more, going back
> to PCI.  Using non-snooped transactions have lower latency and faster
> throughput compared to snooped transactions.

PCI-X (as in the thing which as all the rage before PCIe) added a
attribute phase to each transaction where the requestor can enable
relaxed ordering and/or no-snoop on a transaction basis. As those are
strictly performance optimizations the root-complex isn't required to
honor those attributes, but implementations that care about performance
 usually will.

Regards,
Lucas
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-25 Thread Dan Williams
On Wed, Apr 25, 2018 at 10:44 AM, Alex Deucher  wrote:
> On Wed, Apr 25, 2018 at 2:41 AM, Christoph Hellwig  wrote:
>> On Wed, Apr 25, 2018 at 02:24:36AM -0400, Alex Deucher wrote:
>>> > It has a non-coherent transaction mode (which the chipset can opt to
>>> > not implement and still flush), to make sure the AGP horror show
>>> > doesn't happen again and GPU folks are happy with PCIe. That's at
>>> > least my understanding from digging around in amd the last time we had
>>> > coherency issues between intel and amd gpus. GPUs have some bits
>>> > somewhere (in the pagetables, or in the buffer object description
>>> > table created by userspace) to control that stuff.
>>>
>>> Right.  We have a bit in the GPU page table entries that determines
>>> whether we snoop the CPU's cache or not.
>>
>> I can see how that works with the GPU on the same SOC or SOC set as the
>> CPU.  But how is that going to work for a GPU that is a plain old PCIe
>> card?  The cache snooping in that case is happening in the PCIe root
>> complex.
>
> I'm not a pci expert, but as far as I know, the device sends either a
> snooped or non-snooped transaction on the bus.  I think the
> transaction descriptor supports a no snoop attribute.  Our GPUs have
> supported this feature for probably 20 years if not more, going back
> to PCI.  Using non-snooped transactions have lower latency and faster
> throughput compared to snooped transactions.

Right, 'no snoop' and 'relaxed ordering' have been part of the PCI
spec since forever. With a plain old PCI-E card the root complex
indeed arranges for caches to be snooped. Although it's not always
faster depending on the platform. 'No snoop' traffic may be relegated
to less bus resources relative to the much more common snooped
traffic.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-25 Thread Alex Deucher
On Wed, Apr 25, 2018 at 2:41 AM, Christoph Hellwig  wrote:
> On Wed, Apr 25, 2018 at 02:24:36AM -0400, Alex Deucher wrote:
>> > It has a non-coherent transaction mode (which the chipset can opt to
>> > not implement and still flush), to make sure the AGP horror show
>> > doesn't happen again and GPU folks are happy with PCIe. That's at
>> > least my understanding from digging around in amd the last time we had
>> > coherency issues between intel and amd gpus. GPUs have some bits
>> > somewhere (in the pagetables, or in the buffer object description
>> > table created by userspace) to control that stuff.
>>
>> Right.  We have a bit in the GPU page table entries that determines
>> whether we snoop the CPU's cache or not.
>
> I can see how that works with the GPU on the same SOC or SOC set as the
> CPU.  But how is that going to work for a GPU that is a plain old PCIe
> card?  The cache snooping in that case is happening in the PCIe root
> complex.

I'm not a pci expert, but as far as I know, the device sends either a
snooped or non-snooped transaction on the bus.  I think the
transaction descriptor supports a no snoop attribute.  Our GPUs have
supported this feature for probably 20 years if not more, going back
to PCI.  Using non-snooped transactions have lower latency and faster
throughput compared to snooped transactions.

Alex
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-25 Thread Christoph Hellwig
On Wed, Apr 25, 2018 at 09:56:43AM +0200, Thierry Reding wrote:
> And to add to the confusion, none of this seems to be an issue on 64-bit
> ARM where the generic DMA/IOMMU code from drivers/iommu/dma-iommu.c is
> used.

In the long term I want everyone to use that code.  Help welcome!
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-25 Thread Thierry Reding
On Wed, Apr 25, 2018 at 09:30:39AM +0200, Daniel Vetter wrote:
> On Wed, Apr 25, 2018 at 12:09:05AM -0700, Christoph Hellwig wrote:
> > On Wed, Apr 25, 2018 at 09:02:17AM +0200, Daniel Vetter wrote:
> > > Can we please not nack everything right away? Doesn't really motivate
> > > me to show you all the various things we're doing in gpu to make the
> > > dma layer work for us. That kind of noodling around in lower levels to
> > > get them to do what we want is absolutely par-for-course for gpu
> > > drivers. If you just nack everything I point you at for illustrative
> > > purposes, then I can't show you stuff anymore.
> > 
> > No, it's not.  No driver (and that includes the magic GPUs) has
> > any business messing with dma ops directly.
> > 
> > A GPU driver imght have a very valid reason to disable the IOMMU,
> > but the code to do so needs to be at least in the arch code, maybe
> > in the dma-mapping/iommu code, not in the driver.
> > 
> > As a first step to get the discussion started we'll simply need
> > to move the code Thierry wrote into a helper in arch/arm and that
> > alone would be a massive improvement.  I'm not even talking about
> > minor details like actually using arm_get_dma_map_ops instead
> > of duplicating it.
> > 
> > And doing this basic trivial work really helps to get this whole
> > mess under control.
> 
> Ah ok. It did sound a bit like a much more cathegorical NAK than an "ack
> in principle, but we need to shuffle the implementation into the right
> place first". In the past we generally got a principled NAK on anything
> funny we've been doing with the dma api, and the dma api maintainer
> steaming off telling us we're incompetent idiots. I guess I've been
> branded a bit on this topic :-/
> 
> Really great that this is changing now.
> 
> On the patch itself: It might not be the right thing in all cases, since
> for certain compression formats the nv gpu wants larger pages (easy to
> allocate from vram, not so easy from main memory), so might need the iommu
> still. But currently that's not implemented:
> 
> https://www.spinics.net/lists/dri-devel/msg173932.html

To clarify: we do want to use the IOMMU, but we want to use it
explicitly via the IOMMU API rather than hiding it behind the DMA API.
We do the same thing in Tegra DRM where we don't want to use the DMA API
because it doesn't allow us to share the same mapping between multiple
display controllers in the same way the IOMMU API does. We've also been
thinking about using the IOMMU API directly in order to support process
isolation for devices that accept command streams from userspace.

Fortunately the issue I'm seeing with Nouveau doesn't happen with Tegra
DRM, which seems to be because we have an IOMMU group with multiple
devices and that prevents the DMA API from "hijacking" the IOMMU domain
for the group.

And to add to the confusion, none of this seems to be an issue on 64-bit
ARM where the generic DMA/IOMMU code from drivers/iommu/dma-iommu.c is
used.

Thierry


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-25 Thread Thierry Reding
On Wed, Apr 25, 2018 at 12:09:05AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 25, 2018 at 09:02:17AM +0200, Daniel Vetter wrote:
> > Can we please not nack everything right away? Doesn't really motivate
> > me to show you all the various things we're doing in gpu to make the
> > dma layer work for us. That kind of noodling around in lower levels to
> > get them to do what we want is absolutely par-for-course for gpu
> > drivers. If you just nack everything I point you at for illustrative
> > purposes, then I can't show you stuff anymore.
> 
> No, it's not.  No driver (and that includes the magic GPUs) has
> any business messing with dma ops directly.
> 
> A GPU driver imght have a very valid reason to disable the IOMMU,
> but the code to do so needs to be at least in the arch code, maybe
> in the dma-mapping/iommu code, not in the driver.
> 
> As a first step to get the discussion started we'll simply need
> to move the code Thierry wrote into a helper in arch/arm and that
> alone would be a massive improvement.  I'm not even talking about
> minor details like actually using arm_get_dma_map_ops instead
> of duplicating it.
> 
> And doing this basic trivial work really helps to get this whole
> mess under control.

That's a good idea and I can prepare patches for this, but I'd like to
make those changes on top of the fix to keep the option of getting this
into v4.16 if at all possible.

Thierry


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-25 Thread Thierry Reding
On Tue, Apr 24, 2018 at 11:43:35PM -0700, Christoph Hellwig wrote:
> On Wed, Apr 25, 2018 at 08:23:15AM +0200, Daniel Vetter wrote:
> > For more fun:
> > 
> > https://www.spinics.net/lists/dri-devel/msg173630.html
> > 
> > Yeah, sometimes we want to disable the iommu because the on-gpu
> > pagetables are faster ...
> 
> I am not on this list, but remote NAK from here.  This needs an
> API from the iommu/dma-mapping code.  Drivers have no business poking
> into these details.

The interfaces that the above patch uses are all EXPORT_SYMBOL_GPL,
which is rather misleading if they are not meant to be used by drivers
directly.

> Thierry, please resend this with at least the iommu list and
> linux-arm-kernel in Cc to have a proper discussion on the right API.

I'm certainly open to help with finding a correct solution, but the
patch above was purposefully terse because this is something that I
hope we can get backported to v4.16 to unbreak Nouveau. Coordinating
such a backport between ARM and DRM trees does not sound like something
that would help getting this fixed in v4.16.

The fundamental issue here is that the DMA/IOMMU integration is
something that has caused a number of surprising regressions in the past
because it tends to sneak in unexpectedly. For example the current
regression shows up only if CONFIG_ARM_DMA_USE_IOMMU=y because the DMA
API will then transparently create a second mapping and mess things up.
Everything works fine if that option is disabled. This is ultimately why
we didn't notice, since we don't enable that option by default. I do
have a patch that I plan to apply to the Tegra tree that will always
enable CONFIG_ARM_DMA_USE_IOMMU=y on Tegra to avoid any such surprises
in the future, but I can obviously only apply that once the above patch
is applied to Nouveau, otherwise we'll break Nouveau unconditionally.

Granted, this issue could've been caught with a little more testing, but
in retrospect I think it would've been a lot better if ARM_DMA_USE_IOMMU
was just enabled unconditionally if it has side-effects that platforms
don't opt in to but have to explicitly opt out of.

Thierry


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-25 Thread Daniel Vetter
On Wed, Apr 25, 2018 at 12:09:05AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 25, 2018 at 09:02:17AM +0200, Daniel Vetter wrote:
> > Can we please not nack everything right away? Doesn't really motivate
> > me to show you all the various things we're doing in gpu to make the
> > dma layer work for us. That kind of noodling around in lower levels to
> > get them to do what we want is absolutely par-for-course for gpu
> > drivers. If you just nack everything I point you at for illustrative
> > purposes, then I can't show you stuff anymore.
> 
> No, it's not.  No driver (and that includes the magic GPUs) has
> any business messing with dma ops directly.
> 
> A GPU driver imght have a very valid reason to disable the IOMMU,
> but the code to do so needs to be at least in the arch code, maybe
> in the dma-mapping/iommu code, not in the driver.
> 
> As a first step to get the discussion started we'll simply need
> to move the code Thierry wrote into a helper in arch/arm and that
> alone would be a massive improvement.  I'm not even talking about
> minor details like actually using arm_get_dma_map_ops instead
> of duplicating it.
> 
> And doing this basic trivial work really helps to get this whole
> mess under control.

Ah ok. It did sound a bit like a much more cathegorical NAK than an "ack
in principle, but we need to shuffle the implementation into the right
place first". In the past we generally got a principled NAK on anything
funny we've been doing with the dma api, and the dma api maintainer
steaming off telling us we're incompetent idiots. I guess I've been
branded a bit on this topic :-/

Really great that this is changing now.

On the patch itself: It might not be the right thing in all cases, since
for certain compression formats the nv gpu wants larger pages (easy to
allocate from vram, not so easy from main memory), so might need the iommu
still. But currently that's not implemented:

https://www.spinics.net/lists/dri-devel/msg173932.html

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-25 Thread Christoph Hellwig
On Wed, Apr 25, 2018 at 09:02:17AM +0200, Daniel Vetter wrote:
> Can we please not nack everything right away? Doesn't really motivate
> me to show you all the various things we're doing in gpu to make the
> dma layer work for us. That kind of noodling around in lower levels to
> get them to do what we want is absolutely par-for-course for gpu
> drivers. If you just nack everything I point you at for illustrative
> purposes, then I can't show you stuff anymore.

No, it's not.  No driver (and that includes the magic GPUs) has
any business messing with dma ops directly.

A GPU driver imght have a very valid reason to disable the IOMMU,
but the code to do so needs to be at least in the arch code, maybe
in the dma-mapping/iommu code, not in the driver.

As a first step to get the discussion started we'll simply need
to move the code Thierry wrote into a helper in arch/arm and that
alone would be a massive improvement.  I'm not even talking about
minor details like actually using arm_get_dma_map_ops instead
of duplicating it.

And doing this basic trivial work really helps to get this whole
mess under control.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-25 Thread Daniel Vetter
On Wed, Apr 25, 2018 at 8:43 AM, Christoph Hellwig  wrote:
> On Wed, Apr 25, 2018 at 08:23:15AM +0200, Daniel Vetter wrote:
>> For more fun:
>>
>> https://www.spinics.net/lists/dri-devel/msg173630.html
>>
>> Yeah, sometimes we want to disable the iommu because the on-gpu
>> pagetables are faster ...
>
> I am not on this list, but remote NAK from here.  This needs an
> API from the iommu/dma-mapping code.  Drivers have no business poking
> into these details.

Can we please not nack everything right away? Doesn't really motivate
me to show you all the various things we're doing in gpu to make the
dma layer work for us. That kind of noodling around in lower levels to
get them to do what we want is absolutely par-for-course for gpu
drivers. If you just nack everything I point you at for illustrative
purposes, then I can't show you stuff anymore.

Just to make it clear: I do want to get this stuff sorted, and it's
awesome that someone from core finally takes a serious look at what
gpu folks have been doing for decades (instead of just telling us
we're incompetent and doing it all wrong and then steaming off), and
how to make this work without layering violations to no end. But
stopping the world until this is fixed isn't really a good option.

Thanks, Daniel

> Thierry, please resend this with at least the iommu list and
> linux-arm-kernel in Cc to have a proper discussion on the right API.



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-24 Thread Christoph Hellwig
On Wed, Apr 25, 2018 at 08:23:15AM +0200, Daniel Vetter wrote:
> For more fun:
> 
> https://www.spinics.net/lists/dri-devel/msg173630.html
> 
> Yeah, sometimes we want to disable the iommu because the on-gpu
> pagetables are faster ...

I am not on this list, but remote NAK from here.  This needs an
API from the iommu/dma-mapping code.  Drivers have no business poking
into these details.

Thierry, please resend this with at least the iommu list and
linux-arm-kernel in Cc to have a proper discussion on the right API.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-24 Thread Christoph Hellwig
On Wed, Apr 25, 2018 at 02:24:36AM -0400, Alex Deucher wrote:
> > It has a non-coherent transaction mode (which the chipset can opt to
> > not implement and still flush), to make sure the AGP horror show
> > doesn't happen again and GPU folks are happy with PCIe. That's at
> > least my understanding from digging around in amd the last time we had
> > coherency issues between intel and amd gpus. GPUs have some bits
> > somewhere (in the pagetables, or in the buffer object description
> > table created by userspace) to control that stuff.
> 
> Right.  We have a bit in the GPU page table entries that determines
> whether we snoop the CPU's cache or not.

I can see how that works with the GPU on the same SOC or SOC set as the
CPU.  But how is that going to work for a GPU that is a plain old PCIe
card?  The cache snooping in that case is happening in the PCIe root
complex.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-24 Thread Alex Deucher
On Wed, Apr 25, 2018 at 2:13 AM, Daniel Vetter  wrote:
> On Wed, Apr 25, 2018 at 7:48 AM, Christoph Hellwig  wrote:
>> On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote:
>>> Out of curiosity, how much virtual flushing stuff is there still out
>>> there? At least in drm we've pretty much ignore this, and seem to be
>>> getting away without a huge uproar (at least from driver developers
>>> and users, core folks are less amused about that).
>>
>> As I've just been wading through the code, the following architectures
>> have non-coherent dma that flushes by virtual address for at least some
>> platforms:
>>
>>  - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips,
>>powerpc
>>
>> These have non-coherent dma ops that flush by physical address:
>>
>>  - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc
>>
>> And these do not have non-coherent dma ops at all:
>>
>>  - alpha, h8300, riscv, unicore32, x86
>>
>> [1] arm ѕeems to do both virtually and physically based ops, further
>> audit needed.
>>
>> Note that using virtual addresses in the cache flushing interface
>> doesn't mean that the cache actually is virtually indexed, but it at
>> least allows for the possibility.
>>
>>> > I think the most important thing about such a buffer object is that
>>> > it can distinguish the underlying mapping types.  While
>>> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
>>> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
>>> > back a dma_addr_t they are in now way interchangable.  And trying to
>>> > stuff them all into a structure like struct scatterlist that has
>>> > no indication what kind of mapping you are dealing with is just
>>> > asking for trouble.
>>>
>>> Well the idea was to have 1 interface to allow all drivers to share
>>> buffers with anything else, no matter how exactly they're allocated.
>>
>> Isn't that interface supposed to be dmabuf?  Currently dma_map leaks
>> a scatterlist through the sg_table in dma_buf_map_attachment /
>> ->map_dma_buf, but looking at a few of the callers it seems like they
>> really do not even want a scatterlist to start with, but check that
>> is contains a physically contiguous range first.  So kicking the
>> scatterlist our there will probably improve the interface in general.
>
> I think by number most drm drivers require contiguous memory (or an
> iommu that makes it look contiguous). But there's plenty others who
> have another set of pagetables on the gpu itself and can
> scatter-gather. Usually it's the former for display/video blocks, and
> the latter for rendering.
>
>>> dma-buf has all the functions for flushing, so you can have coherent
>>> mappings, non-coherent mappings and pretty much anything else. Or well
>>> could, because in practice people hack up layering violations until it
>>> works for the 2-3 drivers they care about. On top of that there's the
>>> small issue that x86 insists that dma is coherent (and that's true for
>>> most devices, including v4l drivers you might want to share stuff
>>> with), and gpus really, really really do want to make almost
>>> everything incoherent.
>>
>> How do discrete GPUs manage to be incoherent when attached over PCIe?
>
> It has a non-coherent transaction mode (which the chipset can opt to
> not implement and still flush), to make sure the AGP horror show
> doesn't happen again and GPU folks are happy with PCIe. That's at
> least my understanding from digging around in amd the last time we had
> coherency issues between intel and amd gpus. GPUs have some bits
> somewhere (in the pagetables, or in the buffer object description
> table created by userspace) to control that stuff.

Right.  We have a bit in the GPU page table entries that determines
whether we snoop the CPU's cache or not.

Alex

>
> For anything on the SoC it's presented as pci device, but that's
> extremely fake, and we can definitely do non-snooped transactions on
> drm/i915. Again, controlled by a mix of pagetables and
> userspace-provided buffer object description tables.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> ___
> amd-gfx mailing list
> amd-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-24 Thread Daniel Vetter
On Wed, Apr 25, 2018 at 8:13 AM, Daniel Vetter  wrote:
> On Wed, Apr 25, 2018 at 7:48 AM, Christoph Hellwig  wrote:
>> On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote:
>>> Out of curiosity, how much virtual flushing stuff is there still out
>>> there? At least in drm we've pretty much ignore this, and seem to be
>>> getting away without a huge uproar (at least from driver developers
>>> and users, core folks are less amused about that).
>>
>> As I've just been wading through the code, the following architectures
>> have non-coherent dma that flushes by virtual address for at least some
>> platforms:
>>
>>  - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips,
>>powerpc
>>
>> These have non-coherent dma ops that flush by physical address:
>>
>>  - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc
>>
>> And these do not have non-coherent dma ops at all:
>>
>>  - alpha, h8300, riscv, unicore32, x86
>>
>> [1] arm ѕeems to do both virtually and physically based ops, further
>> audit needed.
>>
>> Note that using virtual addresses in the cache flushing interface
>> doesn't mean that the cache actually is virtually indexed, but it at
>> least allows for the possibility.
>>
>>> > I think the most important thing about such a buffer object is that
>>> > it can distinguish the underlying mapping types.  While
>>> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
>>> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
>>> > back a dma_addr_t they are in now way interchangable.  And trying to
>>> > stuff them all into a structure like struct scatterlist that has
>>> > no indication what kind of mapping you are dealing with is just
>>> > asking for trouble.
>>>
>>> Well the idea was to have 1 interface to allow all drivers to share
>>> buffers with anything else, no matter how exactly they're allocated.
>>
>> Isn't that interface supposed to be dmabuf?  Currently dma_map leaks
>> a scatterlist through the sg_table in dma_buf_map_attachment /
>> ->map_dma_buf, but looking at a few of the callers it seems like they
>> really do not even want a scatterlist to start with, but check that
>> is contains a physically contiguous range first.  So kicking the
>> scatterlist our there will probably improve the interface in general.
>
> I think by number most drm drivers require contiguous memory (or an
> iommu that makes it look contiguous). But there's plenty others who
> have another set of pagetables on the gpu itself and can
> scatter-gather. Usually it's the former for display/video blocks, and
> the latter for rendering.

For more fun:

https://www.spinics.net/lists/dri-devel/msg173630.html

Yeah, sometimes we want to disable the iommu because the on-gpu
pagetables are faster ...
-Daniel

>>> dma-buf has all the functions for flushing, so you can have coherent
>>> mappings, non-coherent mappings and pretty much anything else. Or well
>>> could, because in practice people hack up layering violations until it
>>> works for the 2-3 drivers they care about. On top of that there's the
>>> small issue that x86 insists that dma is coherent (and that's true for
>>> most devices, including v4l drivers you might want to share stuff
>>> with), and gpus really, really really do want to make almost
>>> everything incoherent.
>>
>> How do discrete GPUs manage to be incoherent when attached over PCIe?
>
> It has a non-coherent transaction mode (which the chipset can opt to
> not implement and still flush), to make sure the AGP horror show
> doesn't happen again and GPU folks are happy with PCIe. That's at
> least my understanding from digging around in amd the last time we had
> coherency issues between intel and amd gpus. GPUs have some bits
> somewhere (in the pagetables, or in the buffer object description
> table created by userspace) to control that stuff.
>
> For anything on the SoC it's presented as pci device, but that's
> extremely fake, and we can definitely do non-snooped transactions on
> drm/i915. Again, controlled by a mix of pagetables and
> userspace-provided buffer object description tables.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-24 Thread Daniel Vetter
On Wed, Apr 25, 2018 at 7:48 AM, Christoph Hellwig  wrote:
> On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote:
>> Out of curiosity, how much virtual flushing stuff is there still out
>> there? At least in drm we've pretty much ignore this, and seem to be
>> getting away without a huge uproar (at least from driver developers
>> and users, core folks are less amused about that).
>
> As I've just been wading through the code, the following architectures
> have non-coherent dma that flushes by virtual address for at least some
> platforms:
>
>  - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips,
>powerpc
>
> These have non-coherent dma ops that flush by physical address:
>
>  - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc
>
> And these do not have non-coherent dma ops at all:
>
>  - alpha, h8300, riscv, unicore32, x86
>
> [1] arm ѕeems to do both virtually and physically based ops, further
> audit needed.
>
> Note that using virtual addresses in the cache flushing interface
> doesn't mean that the cache actually is virtually indexed, but it at
> least allows for the possibility.
>
>> > I think the most important thing about such a buffer object is that
>> > it can distinguish the underlying mapping types.  While
>> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
>> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
>> > back a dma_addr_t they are in now way interchangable.  And trying to
>> > stuff them all into a structure like struct scatterlist that has
>> > no indication what kind of mapping you are dealing with is just
>> > asking for trouble.
>>
>> Well the idea was to have 1 interface to allow all drivers to share
>> buffers with anything else, no matter how exactly they're allocated.
>
> Isn't that interface supposed to be dmabuf?  Currently dma_map leaks
> a scatterlist through the sg_table in dma_buf_map_attachment /
> ->map_dma_buf, but looking at a few of the callers it seems like they
> really do not even want a scatterlist to start with, but check that
> is contains a physically contiguous range first.  So kicking the
> scatterlist our there will probably improve the interface in general.

I think by number most drm drivers require contiguous memory (or an
iommu that makes it look contiguous). But there's plenty others who
have another set of pagetables on the gpu itself and can
scatter-gather. Usually it's the former for display/video blocks, and
the latter for rendering.

>> dma-buf has all the functions for flushing, so you can have coherent
>> mappings, non-coherent mappings and pretty much anything else. Or well
>> could, because in practice people hack up layering violations until it
>> works for the 2-3 drivers they care about. On top of that there's the
>> small issue that x86 insists that dma is coherent (and that's true for
>> most devices, including v4l drivers you might want to share stuff
>> with), and gpus really, really really do want to make almost
>> everything incoherent.
>
> How do discrete GPUs manage to be incoherent when attached over PCIe?

It has a non-coherent transaction mode (which the chipset can opt to
not implement and still flush), to make sure the AGP horror show
doesn't happen again and GPU folks are happy with PCIe. That's at
least my understanding from digging around in amd the last time we had
coherency issues between intel and amd gpus. GPUs have some bits
somewhere (in the pagetables, or in the buffer object description
table created by userspace) to control that stuff.

For anything on the SoC it's presented as pci device, but that's
extremely fake, and we can definitely do non-snooped transactions on
drm/i915. Again, controlled by a mix of pagetables and
userspace-provided buffer object description tables.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-24 Thread Alex Deucher
On Wed, Apr 25, 2018 at 1:48 AM, Christoph Hellwig  wrote:
> On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote:
>> Out of curiosity, how much virtual flushing stuff is there still out
>> there? At least in drm we've pretty much ignore this, and seem to be
>> getting away without a huge uproar (at least from driver developers
>> and users, core folks are less amused about that).
>
> As I've just been wading through the code, the following architectures
> have non-coherent dma that flushes by virtual address for at least some
> platforms:
>
>  - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips,
>powerpc
>
> These have non-coherent dma ops that flush by physical address:
>
>  - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc
>
> And these do not have non-coherent dma ops at all:
>
>  - alpha, h8300, riscv, unicore32, x86
>
> [1] arm ѕeems to do both virtually and physically based ops, further
> audit needed.
>
> Note that using virtual addresses in the cache flushing interface
> doesn't mean that the cache actually is virtually indexed, but it at
> least allows for the possibility.
>
>> > I think the most important thing about such a buffer object is that
>> > it can distinguish the underlying mapping types.  While
>> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
>> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
>> > back a dma_addr_t they are in now way interchangable.  And trying to
>> > stuff them all into a structure like struct scatterlist that has
>> > no indication what kind of mapping you are dealing with is just
>> > asking for trouble.
>>
>> Well the idea was to have 1 interface to allow all drivers to share
>> buffers with anything else, no matter how exactly they're allocated.
>
> Isn't that interface supposed to be dmabuf?  Currently dma_map leaks
> a scatterlist through the sg_table in dma_buf_map_attachment /
> ->map_dma_buf, but looking at a few of the callers it seems like they
> really do not even want a scatterlist to start with, but check that
> is contains a physically contiguous range first.  So kicking the
> scatterlist our there will probably improve the interface in general.
>
>> dma-buf has all the functions for flushing, so you can have coherent
>> mappings, non-coherent mappings and pretty much anything else. Or well
>> could, because in practice people hack up layering violations until it
>> works for the 2-3 drivers they care about. On top of that there's the
>> small issue that x86 insists that dma is coherent (and that's true for
>> most devices, including v4l drivers you might want to share stuff
>> with), and gpus really, really really do want to make almost
>> everything incoherent.
>
> How do discrete GPUs manage to be incoherent when attached over PCIe?

They can do CPU cache snooped (coherent) or non-snooped (incoherent)
DMA.  Also for things like APUs, they show up as a PCIe device, but
that actual GPU core is part of the same die as the CPU and they have
their own special paths to memory, etc.  The fact that they show up as
PCIe devices is mostly for enumeration purposes.

Alex

> ___
> Linaro-mm-sig mailing list
> linaro-mm-...@lists.linaro.org
> https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-24 Thread Christoph Hellwig
On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote:
> Out of curiosity, how much virtual flushing stuff is there still out
> there? At least in drm we've pretty much ignore this, and seem to be
> getting away without a huge uproar (at least from driver developers
> and users, core folks are less amused about that).

As I've just been wading through the code, the following architectures
have non-coherent dma that flushes by virtual address for at least some
platforms:

 - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips,
   powerpc

These have non-coherent dma ops that flush by physical address:

 - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc

And these do not have non-coherent dma ops at all:

 - alpha, h8300, riscv, unicore32, x86

[1] arm ѕeems to do both virtually and physically based ops, further
audit needed.

Note that using virtual addresses in the cache flushing interface
doesn't mean that the cache actually is virtually indexed, but it at
least allows for the possibility.

> > I think the most important thing about such a buffer object is that
> > it can distinguish the underlying mapping types.  While
> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
> > back a dma_addr_t they are in now way interchangable.  And trying to
> > stuff them all into a structure like struct scatterlist that has
> > no indication what kind of mapping you are dealing with is just
> > asking for trouble.
> 
> Well the idea was to have 1 interface to allow all drivers to share
> buffers with anything else, no matter how exactly they're allocated.

Isn't that interface supposed to be dmabuf?  Currently dma_map leaks
a scatterlist through the sg_table in dma_buf_map_attachment /
->map_dma_buf, but looking at a few of the callers it seems like they
really do not even want a scatterlist to start with, but check that
is contains a physically contiguous range first.  So kicking the
scatterlist our there will probably improve the interface in general.

> dma-buf has all the functions for flushing, so you can have coherent
> mappings, non-coherent mappings and pretty much anything else. Or well
> could, because in practice people hack up layering violations until it
> works for the 2-3 drivers they care about. On top of that there's the
> small issue that x86 insists that dma is coherent (and that's true for
> most devices, including v4l drivers you might want to share stuff
> with), and gpus really, really really do want to make almost
> everything incoherent.

How do discrete GPUs manage to be incoherent when attached over PCIe?
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-24 Thread Daniel Vetter
On Tue, Apr 24, 2018 at 8:48 PM, Christoph Hellwig  wrote:
> On Fri, Apr 20, 2018 at 05:21:11PM +0200, Daniel Vetter wrote:
>> > At the very lowest level they will need to be handled differently for
>> > many architectures, the questions is at what point we'll do the
>> > branching out.
>>
>> Having at least struct page also in that list with (dma_addr_t, lenght)
>> pairs has a bunch of benefits for drivers in unifying buffer handling
>> code. You just pass that one single list around, use the dma_addr_t side
>> for gpu access (generally bashing it into gpu ptes). And the struct page
>> (if present) for cpu access, using kmap or vm_insert_*. We generally
>> ignore virt, if we do need a full mapping then we construct a vmap for
>> that buffer of our own.
>
> Well, for mapping a resource (which gets back to the start of the
> discussion) you will need an explicit virt pointer.  You also need
> an explicit virt pointer and not just page_address/kmap for users of
> dma_get_sgtable, because for many architectures you will need to flush
> the virtual address used to access the data, which might be a
> vmap/ioremap style mapping retourned from dma_alloc_address, and not
> the directly mapped kernel address.

Out of curiosity, how much virtual flushing stuff is there still out
there? At least in drm we've pretty much ignore this, and seem to be
getting away without a huge uproar (at least from driver developers
and users, core folks are less amused about that).

And at least for gpus that seems to have been the case since forever,
or at least since AGP was a thing 20 years ago: AGP isn't coherent, so
needs explicit cache flushing, and we have our own implementations of
that in drivers/char/agp. Luckily AGP died 10 years ago, so no one yet
proposed to port it all over to the iommu framework and hide it behind
the dma api (which really would be the "clean" way to do this, AGP is
simply an IOMMU + special socket dedicated for the add-on gpu).

> Here is another idea at the low-level dma API level:
>
>  - dma_get_sgtable goes away.  The replacement is a new
>dma_alloc_remap helper that takes the virtual address returned
>from dma_alloc_attrs/coherent and creates a dma_addr_t for the
>given new device.  If the original allocation was a coherent
>one no cache flushing is required either (because the arch
>made sure it is coherent), if the original allocation used
>DMA_ATTR_NON_CONSISTENT the new allocation will need
>dma_cache_sync calls as well.

Yeah I think that should work. dma_get_sgtable is a pretty nasty
layering violation.

>  - you never even try to share a mapping retourned from
>dma_map_resource - instead each device using it creates a new
>mapping, which makes sense as no virtual addresses are involved
>at all.

Yeah the dma-buf exporter always knows what kind of backing storage it
is dealing with, and for which struct device it should set up a new
view. Hence can make sure that it calls the right functions to
establish a new mapping, whether that's dma_map_sg, dma_map_resource
or the new dma_alloc_remap (instead of the dma_get_sgtable layering
mixup). The importer doesn't know.

>> So maybe a list of (struct page *, dma_addr_t, num_pages) would suit best,
>> with struct page * being optional (if it's a resource, or something else
>> that the kernel core mm isn't aware of). But that only has benefits if we
>> really roll it out everywhere, in all the subsystems and drivers, since if
>> we don't we've made the struct pages ** <-> sgt conversion fun only worse
>> by adding a 3 representation of gpu buffer object backing storage.
>
> I think the most important thing about such a buffer object is that
> it can distinguish the underlying mapping types.  While
> dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
> dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
> back a dma_addr_t they are in now way interchangable.  And trying to
> stuff them all into a structure like struct scatterlist that has
> no indication what kind of mapping you are dealing with is just
> asking for trouble.

Well the idea was to have 1 interface to allow all drivers to share
buffers with anything else, no matter how exactly they're allocated.
dma-buf has all the functions for flushing, so you can have coherent
mappings, non-coherent mappings and pretty much anything else. Or well
could, because in practice people hack up layering violations until it
works for the 2-3 drivers they care about. On top of that there's the
small issue that x86 insists that dma is coherent (and that's true for
most devices, including v4l drivers you might want to share stuff
with), and gpus really, really really do want to make almost
everything incoherent.

The end result is pretty epic :-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.free

Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-24 Thread Christoph Hellwig
On Fri, Apr 20, 2018 at 05:21:11PM +0200, Daniel Vetter wrote:
> > At the very lowest level they will need to be handled differently for
> > many architectures, the questions is at what point we'll do the
> > branching out.
> 
> Having at least struct page also in that list with (dma_addr_t, lenght)
> pairs has a bunch of benefits for drivers in unifying buffer handling
> code. You just pass that one single list around, use the dma_addr_t side
> for gpu access (generally bashing it into gpu ptes). And the struct page
> (if present) for cpu access, using kmap or vm_insert_*. We generally
> ignore virt, if we do need a full mapping then we construct a vmap for
> that buffer of our own.

Well, for mapping a resource (which gets back to the start of the
discussion) you will need an explicit virt pointer.  You also need
an explicit virt pointer and not just page_address/kmap for users of
dma_get_sgtable, because for many architectures you will need to flush
the virtual address used to access the data, which might be a
vmap/ioremap style mapping retourned from dma_alloc_address, and not
the directly mapped kernel address.

Here is another idea at the low-level dma API level:

 - dma_get_sgtable goes away.  The replacement is a new
   dma_alloc_remap helper that takes the virtual address returned
   from dma_alloc_attrs/coherent and creates a dma_addr_t for the
   given new device.  If the original allocation was a coherent
   one no cache flushing is required either (because the arch
   made sure it is coherent), if the original allocation used
   DMA_ATTR_NON_CONSISTENT the new allocation will need
   dma_cache_sync calls as well.
 - you never even try to share a mapping retourned from
   dma_map_resource - instead each device using it creates a new
   mapping, which makes sense as no virtual addresses are involved
   at all.

> So maybe a list of (struct page *, dma_addr_t, num_pages) would suit best,
> with struct page * being optional (if it's a resource, or something else
> that the kernel core mm isn't aware of). But that only has benefits if we
> really roll it out everywhere, in all the subsystems and drivers, since if
> we don't we've made the struct pages ** <-> sgt conversion fun only worse
> by adding a 3 representation of gpu buffer object backing storage.

I think the most important thing about such a buffer object is that
it can distinguish the underlying mapping types.  While
dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT,
dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give
back a dma_addr_t they are in now way interchangable.  And trying to
stuff them all into a structure like struct scatterlist that has
no indication what kind of mapping you are dealing with is just
asking for trouble.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-20 Thread Daniel Vetter
On Fri, Apr 20, 2018 at 05:46:25AM -0700, Christoph Hellwig wrote:
> On Fri, Apr 20, 2018 at 12:44:01PM +0200, Christian König wrote:
> > > > What we need is an sg_alloc_table_from_resources(dev, resources,
> > > > num_resources) which does the handling common to all drivers.
> > > A structure that contains
> > > 
> > > {page,offset,len} + {dma_addr+dma_len}
> > > 
> > > is not a good container for storing
> > > 
> > > {virt addr, dma_addr, len}
> > > 
> > > no matter what interface you build arond it.
> > 
> > Why not? I mean at least for my use case we actually don't need the virtual
> > address.
> 
> If you don't need the virtual address you need scatterlist even list.
> 
> > What we need is {dma_addr+dma_len} in a consistent interface which can come
> > from both {page,offset,len} as well as {resource, len}.
> 
> Ok.
> 
> > What I actually don't need is separate handling for system memory and
> > resources, but that would we get exactly when we don't use sg_table.
> 
> At the very lowest level they will need to be handled differently for
> many architectures, the questions is at what point we'll do the
> branching out.

Having at least struct page also in that list with (dma_addr_t, lenght)
pairs has a bunch of benefits for drivers in unifying buffer handling
code. You just pass that one single list around, use the dma_addr_t side
for gpu access (generally bashing it into gpu ptes). And the struct page
(if present) for cpu access, using kmap or vm_insert_*. We generally
ignore virt, if we do need a full mapping then we construct a vmap for
that buffer of our own.

If (and that would be serious amounts of work all over the tree, with lots
of drivers) we come up with a new struct for gpu buffers, then I'd also
add "force page alignement for everything" to the requirements list.
That's another mismatch we have, since gpu buffer objects (and dma-buf)
are always full pages. That mismatch motived the addition of the
page-oriented sg iterators.

So maybe a list of (struct page *, dma_addr_t, num_pages) would suit best,
with struct page * being optional (if it's a resource, or something else
that the kernel core mm isn't aware of). But that only has benefits if we
really roll it out everywhere, in all the subsystems and drivers, since if
we don't we've made the struct pages ** <-> sgt conversion fun only worse
by adding a 3 representation of gpu buffer object backing storage.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-20 Thread Christoph Hellwig
On Fri, Apr 20, 2018 at 12:44:01PM +0200, Christian König wrote:
> > > What we need is an sg_alloc_table_from_resources(dev, resources,
> > > num_resources) which does the handling common to all drivers.
> > A structure that contains
> > 
> > {page,offset,len} + {dma_addr+dma_len}
> > 
> > is not a good container for storing
> > 
> > {virt addr, dma_addr, len}
> > 
> > no matter what interface you build arond it.
> 
> Why not? I mean at least for my use case we actually don't need the virtual
> address.

If you don't need the virtual address you need scatterlist even list.

> What we need is {dma_addr+dma_len} in a consistent interface which can come
> from both {page,offset,len} as well as {resource, len}.

Ok.

> What I actually don't need is separate handling for system memory and
> resources, but that would we get exactly when we don't use sg_table.

At the very lowest level they will need to be handled differently for
many architectures, the questions is at what point we'll do the
branching out.

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-20 Thread Christian König

Am 20.04.2018 um 12:17 schrieb Christoph Hellwig:

On Fri, Apr 20, 2018 at 10:58:50AM +0200, Christian König wrote:

Yes there's a bit a layering violation insofar that drivers really
shouldn't each have their own copy of "how do I convert a piece of dma
memory into  dma-buf", but that doesn't render the interface a bad idea.

Completely agree on that.

What we need is an sg_alloc_table_from_resources(dev, resources,
num_resources) which does the handling common to all drivers.

A structure that contains

{page,offset,len} + {dma_addr+dma_len}

is not a good container for storing

{virt addr, dma_addr, len}

no matter what interface you build arond it.


Why not? I mean at least for my use case we actually don't need the 
virtual address.


What we need is {dma_addr+dma_len} in a consistent interface which can 
come from both {page,offset,len} as well as {resource, len}.


What I actually don't need is separate handling for system memory and 
resources, but that would we get exactly when we don't use sg_table.


Christian.


And that is discounting
all the problems around mapping coherent allocations for other devices,
or the iommu merging problem we are having another thread on.

So let's come up with a better high level interface first, and then
worrty about how to implement it in the low-level dma-mapping interface
second.  Especially given that my consolidation of the dma_map_ops
implementation is in full stream and there shoudn't be all that many
to bother with.

So first question:  Do you actually care about having multiple
pairs of the above, or instead of all chunks just deal with a single
of the above?  In that case we really should not need that many
new interfaces as dma_map_resource will be all you need anyway.


Christian.


-Daniel

---end quoted text---


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-20 Thread Christoph Hellwig
On Fri, Apr 20, 2018 at 10:58:50AM +0200, Christian König wrote:
> > Yes there's a bit a layering violation insofar that drivers really
> > shouldn't each have their own copy of "how do I convert a piece of dma
> > memory into  dma-buf", but that doesn't render the interface a bad idea.
> 
> Completely agree on that.
> 
> What we need is an sg_alloc_table_from_resources(dev, resources,
> num_resources) which does the handling common to all drivers.

A structure that contains

{page,offset,len} + {dma_addr+dma_len}

is not a good container for storing

{virt addr, dma_addr, len}

no matter what interface you build arond it.  And that is discounting
all the problems around mapping coherent allocations for other devices,
or the iommu merging problem we are having another thread on.

So let's come up with a better high level interface first, and then
worrty about how to implement it in the low-level dma-mapping interface
second.  Especially given that my consolidation of the dma_map_ops
implementation is in full stream and there shoudn't be all that many
to bother with.

So first question:  Do you actually care about having multiple
pairs of the above, or instead of all chunks just deal with a single
of the above?  In that case we really should not need that many
new interfaces as dma_map_resource will be all you need anyway.

> 
> Christian.
> 
> > -Daniel
> 
---end quoted text---
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-20 Thread Christian König

Am 20.04.2018 um 09:13 schrieb Daniel Vetter:

On Thu, Apr 19, 2018 at 01:16:57AM -0700, Christoph Hellwig wrote:

On Mon, Apr 16, 2018 at 03:38:56PM +0200, Daniel Vetter wrote:

We've broken that assumption in i915 years ago. Not struct page backed
gpu memory is very real.

Of course we'll never feed such a strange sg table to a driver which
doesn't understand it, but allowing sg_page == NULL works perfectly
fine. At least for gpu drivers.

For GPU drivers on x86 with no dma coherency problems, sure.  But not
all the world is x86.  We already have problems due to dmabugs use
of the awkward get_sgtable interface (see the common on
arm_dma_get_sgtable that I fully agree with), and doing this for memory
that doesn't have a struct page at all will make things even worse.

x86 dma isn't coherent either, if you're a GPU :-) Flushing gpu caches
tends to be too expensive, so there's pci-e support and chipset support to
forgo it. Plus drivers flushing caches themselves.

The dma_get_sgtable thing is indeed fun, right solution would probably be
to push the dma-buf export down into the dma layer. The comment for
arm_dma_get_sgtable is also not a realy concern, because dma-buf also
abstracts away the flushing (or well is supposed to), so there really
shouldn't be anyone calling the streaming apis on the returned sg table.
That's why dma-buf gives you an sg table that's mapped already.


If that's not acceptable then I guess we could go over the entire tree
and frob all the gpu related code to switch over to a new struct
sg_table_might_not_be_struct_page_backed, including all the other
functions we added over the past few years to iterate over sg tables.
But seems slightly silly, given that sg tables seem to do exactly what
we need.

It isn't silly.  We will have to do some surgery like that anyway
because the current APIs don't work.  So relax, sit back and come up
with an API that solves the existing issues and serves us well in
the future.

So we should just implement a copy of sg table for dma-buf, since I still
think it does exactly what we need for gpus?

Yes there's a bit a layering violation insofar that drivers really
shouldn't each have their own copy of "how do I convert a piece of dma
memory into  dma-buf", but that doesn't render the interface a bad idea.


Completely agree on that.

What we need is an sg_alloc_table_from_resources(dev, resources, 
num_resources) which does the handling common to all drivers.


Christian.


-Daniel


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-20 Thread Daniel Vetter
On Thu, Apr 19, 2018 at 01:16:57AM -0700, Christoph Hellwig wrote:
> On Mon, Apr 16, 2018 at 03:38:56PM +0200, Daniel Vetter wrote:
> > We've broken that assumption in i915 years ago. Not struct page backed
> > gpu memory is very real.
> > 
> > Of course we'll never feed such a strange sg table to a driver which
> > doesn't understand it, but allowing sg_page == NULL works perfectly
> > fine. At least for gpu drivers.
> 
> For GPU drivers on x86 with no dma coherency problems, sure.  But not
> all the world is x86.  We already have problems due to dmabugs use
> of the awkward get_sgtable interface (see the common on
> arm_dma_get_sgtable that I fully agree with), and doing this for memory
> that doesn't have a struct page at all will make things even worse.

x86 dma isn't coherent either, if you're a GPU :-) Flushing gpu caches
tends to be too expensive, so there's pci-e support and chipset support to
forgo it. Plus drivers flushing caches themselves.

The dma_get_sgtable thing is indeed fun, right solution would probably be
to push the dma-buf export down into the dma layer. The comment for
arm_dma_get_sgtable is also not a realy concern, because dma-buf also
abstracts away the flushing (or well is supposed to), so there really
shouldn't be anyone calling the streaming apis on the returned sg table.
That's why dma-buf gives you an sg table that's mapped already.

> > If that's not acceptable then I guess we could go over the entire tree
> > and frob all the gpu related code to switch over to a new struct
> > sg_table_might_not_be_struct_page_backed, including all the other
> > functions we added over the past few years to iterate over sg tables.
> > But seems slightly silly, given that sg tables seem to do exactly what
> > we need.
> 
> It isn't silly.  We will have to do some surgery like that anyway
> because the current APIs don't work.  So relax, sit back and come up
> with an API that solves the existing issues and serves us well in
> the future.

So we should just implement a copy of sg table for dma-buf, since I still
think it does exactly what we need for gpus?

Yes there's a bit a layering violation insofar that drivers really
shouldn't each have their own copy of "how do I convert a piece of dma
memory into  dma-buf", but that doesn't render the interface a bad idea.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-19 Thread Christoph Hellwig
On Mon, Apr 16, 2018 at 03:38:56PM +0200, Daniel Vetter wrote:
> We've broken that assumption in i915 years ago. Not struct page backed
> gpu memory is very real.
> 
> Of course we'll never feed such a strange sg table to a driver which
> doesn't understand it, but allowing sg_page == NULL works perfectly
> fine. At least for gpu drivers.

For GPU drivers on x86 with no dma coherency problems, sure.  But not
all the world is x86.  We already have problems due to dmabugs use
of the awkward get_sgtable interface (see the common on
arm_dma_get_sgtable that I fully agree with), and doing this for memory
that doesn't have a struct page at all will make things even worse.

> If that's not acceptable then I guess we could go over the entire tree
> and frob all the gpu related code to switch over to a new struct
> sg_table_might_not_be_struct_page_backed, including all the other
> functions we added over the past few years to iterate over sg tables.
> But seems slightly silly, given that sg tables seem to do exactly what
> we need.

It isn't silly.  We will have to do some surgery like that anyway
because the current APIs don't work.  So relax, sit back and come up
with an API that solves the existing issues and serves us well in
the future.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-16 Thread Daniel Vetter
On Mon, Apr 16, 2018 at 2:39 PM, Christoph Hellwig  wrote:
> On Tue, Apr 03, 2018 at 08:08:32PM +0200, Daniel Vetter wrote:
>> I did not mean you should dma_map_sg/page. I just meant that using
>> dma_map_resource to fill only the dma address part of the sg table seems
>> perfectly sufficient.
>
> But that is not how the interface work, especially facing sg_dma_len.
>
>> Assuming you get an sg table that's been mapping by calling dma_map_sg was
>> always a bit a case of bending the abstraction to avoid typing code. The
>> only thing an importer ever should have done is look at the dma addresses
>> in that sg table, nothing else.
>
> The scatterlist is not a very good abstraction unfortunately, but it
> it is spread all over the kernel.  And we do expect that anyone who
> gets passed a scatterlist can use sg_page() or sg_virt() (which calls
> sg_page()) on it.  Your changes would break that, and will cause major
> trouble because of that.
>
> If you want to expose p2p memory returned from dma_map_resource in
> dmabuf do not use scatterlists for this please, but with a new interface
> that explicitly passes a virtual address, a dma address and a length
> and make it very clear that virt_to_page will not work on the virtual
> address.

We've broken that assumption in i915 years ago. Not struct page backed
gpu memory is very real.

Of course we'll never feed such a strange sg table to a driver which
doesn't understand it, but allowing sg_page == NULL works perfectly
fine. At least for gpu drivers.

If that's not acceptable then I guess we could go over the entire tree
and frob all the gpu related code to switch over to a new struct
sg_table_might_not_be_struct_page_backed, including all the other
functions we added over the past few years to iterate over sg tables.
But seems slightly silly, given that sg tables seem to do exactly what
we need.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-16 Thread Christoph Hellwig
On Tue, Apr 03, 2018 at 08:08:32PM +0200, Daniel Vetter wrote:
> I did not mean you should dma_map_sg/page. I just meant that using
> dma_map_resource to fill only the dma address part of the sg table seems
> perfectly sufficient.

But that is not how the interface work, especially facing sg_dma_len.

> Assuming you get an sg table that's been mapping by calling dma_map_sg was
> always a bit a case of bending the abstraction to avoid typing code. The
> only thing an importer ever should have done is look at the dma addresses
> in that sg table, nothing else.

The scatterlist is not a very good abstraction unfortunately, but it
it is spread all over the kernel.  And we do expect that anyone who
gets passed a scatterlist can use sg_page() or sg_virt() (which calls
sg_page()) on it.  Your changes would break that, and will cause major
trouble because of that.

If you want to expose p2p memory returned from dma_map_resource in
dmabuf do not use scatterlists for this please, but with a new interface
that explicitly passes a virtual address, a dma address and a length
and make it very clear that virt_to_page will not work on the virtual
address.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-03 Thread Daniel Vetter
On Tue, Apr 03, 2018 at 01:06:45PM -0400, Jerome Glisse wrote:
> On Tue, Apr 03, 2018 at 11:09:09AM +0200, Daniel Vetter wrote:
> > On Thu, Mar 29, 2018 at 01:34:24PM +0200, Christian König wrote:
> > > Am 29.03.2018 um 08:57 schrieb Daniel Vetter:
> > > > On Sun, Mar 25, 2018 at 12:59:56PM +0200, Christian König wrote:
> > > > > Add a peer2peer flag noting that the importer can deal with device
> > > > > resources which are not backed by pages.
> > > > > 
> > > > > Signed-off-by: Christian König 
> > > > Um strictly speaking they all should, but ttm never bothered to use the
> > > > real interfaces but just hacked around the provided sg list, grabbing 
> > > > the
> > > > underlying struct pages, then rebuilding&remapping the sg list again.
> > > 
> > > Actually that isn't correct. TTM converts them to a dma address array
> > > because drivers need it like this (at least nouveau, radeon and amdgpu).
> > > 
> > > I've fixed radeon and amdgpu to be able to deal without it and mailed with
> > > Ben about nouveau, but the outcome is they don't really know.
> > > 
> > > TTM itself doesn't have any need for the pages on imported BOs (you can't
> > > mmap them anyway), the real underlying problem is that sg tables doesn't
> > > provide what drivers need.
> > > 
> > > I think we could rather easily fix sg tables, but that is a totally 
> > > separate
> > > task.
> > 
> > Looking at patch 8, the sg table seems perfectly sufficient to convey the
> > right dma addresses to the importer. Ofcourse the exporter has to set up
> > the right kind of iommu mappings to make this work.
> > 
> > > > The entire point of using sg lists was exactly to allow this use case of
> > > > peer2peer dma (or well in general have special exporters which managed
> > > > memory/IO ranges not backed by struct page). So essentially you're 
> > > > having
> > > > a "I'm totally not broken flag" here.
> > > 
> > > No, independent of needed struct page pointers we need to note if the
> > > exporter can handle peer2peer stuff from the hardware side in general.
> > > 
> > > So what I've did is just to set peer2peer allowed on the importer because 
> > > of
> > > the driver needs and clear it in the exporter if the hardware can't handle
> > > that.
> > 
> > The only thing the importer seems to do is call the
> > pci_peer_traffic_supported, which the exporter could call too. What am I
> > missing (since the sturct_page stuff sounds like it's fixed already by
> > you)?
> > -Daniel
> 
> AFAIK Logan patchset require to register and initialize struct page
> for the device memory you want to map (export from exporter point of
> view).
> 
> With GPU this isn't something we want, struct page is >~= 2^6 so for
> 4GB GPU = 2^6*2^32/2^12 = 2^26 = 64MB of RAM
> 8GB GPU = 2^6*2^33/2^12 = 2^27 = 128MB of RAM
> 16GB GPU = 2^6*2^34/2^12 = 2^28 = 256MB of RAM
> 32GB GPU = 2^6*2^34/2^12 = 2^29 = 512MB of RAM
> 
> All this is mostly wasted as only a small sub-set (that can not be
> constraint to specific range) will ever be exported at any point in
> time. For GPU work load this is hardly justifiable, even for HMM i
> do not plan to register all those pages.
> 
> Hence why i argue that dma_map_resource() like use by Christian is
> good enough for us. People that care about SG can fix that but i
> rather not have to depend on that and waste system memory.

I did not mean you should dma_map_sg/page. I just meant that using
dma_map_resource to fill only the dma address part of the sg table seems
perfectly sufficient. And that's exactly why the importer gets an already
mapped sg table, so that it doesn't have to call dma_map_sg on something
that dma_map_sg can't handle.

Assuming you get an sg table that's been mapping by calling dma_map_sg was
always a bit a case of bending the abstraction to avoid typing code. The
only thing an importer ever should have done is look at the dma addresses
in that sg table, nothing else.

And p2p seems to perfectly fit into this (surprise, it was meant to).
That's why I suggested we annotate the broken importers who assume the sg
table is mapped using dma_map_sg or has a struct_page backing the memory
(but there doesn't seem to be any left it seems) instead of annotating the
ones that aren't broken with a flag that's confusing - you could also have
a dma-buf sgt that points at some other memory that doesn't have struct
pages backing it.

Aside: At least internally in i915 we've been using this forever for our
own private/stolen memory. Unfortunately no other device can access that
range of memory, which is why we don't allow it to be imported to anything
but i915 itself. But if that hw restriction doesn't exist, it'd would
work.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-03 Thread Jerome Glisse
On Tue, Apr 03, 2018 at 11:09:09AM +0200, Daniel Vetter wrote:
> On Thu, Mar 29, 2018 at 01:34:24PM +0200, Christian König wrote:
> > Am 29.03.2018 um 08:57 schrieb Daniel Vetter:
> > > On Sun, Mar 25, 2018 at 12:59:56PM +0200, Christian König wrote:
> > > > Add a peer2peer flag noting that the importer can deal with device
> > > > resources which are not backed by pages.
> > > > 
> > > > Signed-off-by: Christian König 
> > > Um strictly speaking they all should, but ttm never bothered to use the
> > > real interfaces but just hacked around the provided sg list, grabbing the
> > > underlying struct pages, then rebuilding&remapping the sg list again.
> > 
> > Actually that isn't correct. TTM converts them to a dma address array
> > because drivers need it like this (at least nouveau, radeon and amdgpu).
> > 
> > I've fixed radeon and amdgpu to be able to deal without it and mailed with
> > Ben about nouveau, but the outcome is they don't really know.
> > 
> > TTM itself doesn't have any need for the pages on imported BOs (you can't
> > mmap them anyway), the real underlying problem is that sg tables doesn't
> > provide what drivers need.
> > 
> > I think we could rather easily fix sg tables, but that is a totally separate
> > task.
> 
> Looking at patch 8, the sg table seems perfectly sufficient to convey the
> right dma addresses to the importer. Ofcourse the exporter has to set up
> the right kind of iommu mappings to make this work.
> 
> > > The entire point of using sg lists was exactly to allow this use case of
> > > peer2peer dma (or well in general have special exporters which managed
> > > memory/IO ranges not backed by struct page). So essentially you're having
> > > a "I'm totally not broken flag" here.
> > 
> > No, independent of needed struct page pointers we need to note if the
> > exporter can handle peer2peer stuff from the hardware side in general.
> > 
> > So what I've did is just to set peer2peer allowed on the importer because of
> > the driver needs and clear it in the exporter if the hardware can't handle
> > that.
> 
> The only thing the importer seems to do is call the
> pci_peer_traffic_supported, which the exporter could call too. What am I
> missing (since the sturct_page stuff sounds like it's fixed already by
> you)?
> -Daniel

AFAIK Logan patchset require to register and initialize struct page
for the device memory you want to map (export from exporter point of
view).

With GPU this isn't something we want, struct page is >~= 2^6 so for
4GB GPU = 2^6*2^32/2^12 = 2^26 = 64MB of RAM
8GB GPU = 2^6*2^33/2^12 = 2^27 = 128MB of RAM
16GB GPU = 2^6*2^34/2^12 = 2^28 = 256MB of RAM
32GB GPU = 2^6*2^34/2^12 = 2^29 = 512MB of RAM

All this is mostly wasted as only a small sub-set (that can not be
constraint to specific range) will ever be exported at any point in
time. For GPU work load this is hardly justifiable, even for HMM i
do not plan to register all those pages.

Hence why i argue that dma_map_resource() like use by Christian is
good enough for us. People that care about SG can fix that but i
rather not have to depend on that and waste system memory.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-04-03 Thread Daniel Vetter
On Thu, Mar 29, 2018 at 01:34:24PM +0200, Christian König wrote:
> Am 29.03.2018 um 08:57 schrieb Daniel Vetter:
> > On Sun, Mar 25, 2018 at 12:59:56PM +0200, Christian König wrote:
> > > Add a peer2peer flag noting that the importer can deal with device
> > > resources which are not backed by pages.
> > > 
> > > Signed-off-by: Christian König 
> > Um strictly speaking they all should, but ttm never bothered to use the
> > real interfaces but just hacked around the provided sg list, grabbing the
> > underlying struct pages, then rebuilding&remapping the sg list again.
> 
> Actually that isn't correct. TTM converts them to a dma address array
> because drivers need it like this (at least nouveau, radeon and amdgpu).
> 
> I've fixed radeon and amdgpu to be able to deal without it and mailed with
> Ben about nouveau, but the outcome is they don't really know.
> 
> TTM itself doesn't have any need for the pages on imported BOs (you can't
> mmap them anyway), the real underlying problem is that sg tables doesn't
> provide what drivers need.
> 
> I think we could rather easily fix sg tables, but that is a totally separate
> task.

Looking at patch 8, the sg table seems perfectly sufficient to convey the
right dma addresses to the importer. Ofcourse the exporter has to set up
the right kind of iommu mappings to make this work.

> > The entire point of using sg lists was exactly to allow this use case of
> > peer2peer dma (or well in general have special exporters which managed
> > memory/IO ranges not backed by struct page). So essentially you're having
> > a "I'm totally not broken flag" here.
> 
> No, independent of needed struct page pointers we need to note if the
> exporter can handle peer2peer stuff from the hardware side in general.
> 
> So what I've did is just to set peer2peer allowed on the importer because of
> the driver needs and clear it in the exporter if the hardware can't handle
> that.

The only thing the importer seems to do is call the
pci_peer_traffic_supported, which the exporter could call too. What am I
missing (since the sturct_page stuff sounds like it's fixed already by
you)?
-Daniel

> > I think a better approach would be if we add a requires_struct_page or so,
> > and annotate the current importers accordingly. Or we just fix them up (it
> > is all in shared ttm code after all, I think everyone else got this
> > right).
> 
> I would rather not bed on that.
> 
> Christian.
> 
> > -Daniel
> > 
> > > ---
> > >   drivers/dma-buf/dma-buf.c | 1 +
> > >   include/linux/dma-buf.h   | 4 
> > >   2 files changed, 5 insertions(+)
> > > 
> > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > > index ffaa2f9a9c2c..f420225f93c6 100644
> > > --- a/drivers/dma-buf/dma-buf.c
> > > +++ b/drivers/dma-buf/dma-buf.c
> > > @@ -565,6 +565,7 @@ struct dma_buf_attachment *dma_buf_attach(const 
> > > struct dma_buf_attach_info *info
> > >   attach->dev = info->dev;
> > >   attach->dmabuf = dmabuf;
> > > + attach->peer2peer = info->peer2peer;
> > >   attach->priv = info->priv;
> > >   attach->invalidate = info->invalidate;
> > > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> > > index 15dd8598bff1..1ef50bd9bc5b 100644
> > > --- a/include/linux/dma-buf.h
> > > +++ b/include/linux/dma-buf.h
> > > @@ -313,6 +313,7 @@ struct dma_buf {
> > >* @dmabuf: buffer for this attachment.
> > >* @dev: device attached to the buffer.
> > >* @node: list of dma_buf_attachment.
> > > + * @peer2peer: true if the importer can handle peer resources without 
> > > pages.
> > >* @priv: exporter specific attachment data.
> > >*
> > >* This structure holds the attachment information between the dma_buf 
> > > buffer
> > > @@ -328,6 +329,7 @@ struct dma_buf_attachment {
> > >   struct dma_buf *dmabuf;
> > >   struct device *dev;
> > >   struct list_head node;
> > > + bool peer2peer;
> > >   void *priv;
> > >   /**
> > > @@ -392,6 +394,7 @@ struct dma_buf_export_info {
> > >* @dmabuf: the exported dma_buf
> > >* @dev:the device which wants to import the attachment
> > >* @priv:   private data of importer to this attachment
> > > + * @peer2peer:   true if the importer can handle peer resources without 
> > > pages
> > >* @invalidate: callback to use for invalidating mappings
> > >*
> > >* This structure holds the information required to attach to a buffer. 
> > > Used
> > > @@ -401,6 +404,7 @@ struct dma_buf_attach_info {
> > >   struct dma_buf *dmabuf;
> > >   struct device *dev;
> > >   void *priv;
> > > + bool peer2peer;
> > >   void (*invalidate)(struct dma_buf_attachment *attach);
> > >   };
> > > -- 
> > > 2.14.1
> > > 
> > > ___
> > > dri-devel mailing list
> > > dri-devel@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
> _

Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-03-29 Thread Christian König

Am 29.03.2018 um 08:57 schrieb Daniel Vetter:

On Sun, Mar 25, 2018 at 12:59:56PM +0200, Christian König wrote:

Add a peer2peer flag noting that the importer can deal with device
resources which are not backed by pages.

Signed-off-by: Christian König 

Um strictly speaking they all should, but ttm never bothered to use the
real interfaces but just hacked around the provided sg list, grabbing the
underlying struct pages, then rebuilding&remapping the sg list again.


Actually that isn't correct. TTM converts them to a dma address array 
because drivers need it like this (at least nouveau, radeon and amdgpu).


I've fixed radeon and amdgpu to be able to deal without it and mailed 
with Ben about nouveau, but the outcome is they don't really know.


TTM itself doesn't have any need for the pages on imported BOs (you 
can't mmap them anyway), the real underlying problem is that sg tables 
doesn't provide what drivers need.


I think we could rather easily fix sg tables, but that is a totally 
separate task.



The entire point of using sg lists was exactly to allow this use case of
peer2peer dma (or well in general have special exporters which managed
memory/IO ranges not backed by struct page). So essentially you're having
a "I'm totally not broken flag" here.


No, independent of needed struct page pointers we need to note if the 
exporter can handle peer2peer stuff from the hardware side in general.


So what I've did is just to set peer2peer allowed on the importer 
because of the driver needs and clear it in the exporter if the hardware 
can't handle that.



I think a better approach would be if we add a requires_struct_page or so,
and annotate the current importers accordingly. Or we just fix them up (it
is all in shared ttm code after all, I think everyone else got this
right).


I would rather not bed on that.

Christian.


-Daniel


---
  drivers/dma-buf/dma-buf.c | 1 +
  include/linux/dma-buf.h   | 4 
  2 files changed, 5 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index ffaa2f9a9c2c..f420225f93c6 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -565,6 +565,7 @@ struct dma_buf_attachment *dma_buf_attach(const struct 
dma_buf_attach_info *info
  
  	attach->dev = info->dev;

attach->dmabuf = dmabuf;
+   attach->peer2peer = info->peer2peer;
attach->priv = info->priv;
attach->invalidate = info->invalidate;
  
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h

index 15dd8598bff1..1ef50bd9bc5b 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -313,6 +313,7 @@ struct dma_buf {
   * @dmabuf: buffer for this attachment.
   * @dev: device attached to the buffer.
   * @node: list of dma_buf_attachment.
+ * @peer2peer: true if the importer can handle peer resources without pages.
   * @priv: exporter specific attachment data.
   *
   * This structure holds the attachment information between the dma_buf buffer
@@ -328,6 +329,7 @@ struct dma_buf_attachment {
struct dma_buf *dmabuf;
struct device *dev;
struct list_head node;
+   bool peer2peer;
void *priv;
  
  	/**

@@ -392,6 +394,7 @@ struct dma_buf_export_info {
   * @dmabuf:   the exported dma_buf
   * @dev:  the device which wants to import the attachment
   * @priv: private data of importer to this attachment
+ * @peer2peer: true if the importer can handle peer resources without pages
   * @invalidate:   callback to use for invalidating mappings
   *
   * This structure holds the information required to attach to a buffer. Used
@@ -401,6 +404,7 @@ struct dma_buf_attach_info {
struct dma_buf *dmabuf;
struct device *dev;
void *priv;
+   bool peer2peer;
void (*invalidate)(struct dma_buf_attachment *attach);
  };
  
--

2.14.1

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 4/8] dma-buf: add peer2peer flag

2018-03-28 Thread Daniel Vetter
On Sun, Mar 25, 2018 at 12:59:56PM +0200, Christian König wrote:
> Add a peer2peer flag noting that the importer can deal with device
> resources which are not backed by pages.
> 
> Signed-off-by: Christian König 

Um strictly speaking they all should, but ttm never bothered to use the
real interfaces but just hacked around the provided sg list, grabbing the
underlying struct pages, then rebuilding&remapping the sg list again.

The entire point of using sg lists was exactly to allow this use case of
peer2peer dma (or well in general have special exporters which managed
memory/IO ranges not backed by struct page). So essentially you're having
a "I'm totally not broken flag" here.

I think a better approach would be if we add a requires_struct_page or so,
and annotate the current importers accordingly. Or we just fix them up (it
is all in shared ttm code after all, I think everyone else got this
right).
-Daniel

> ---
>  drivers/dma-buf/dma-buf.c | 1 +
>  include/linux/dma-buf.h   | 4 
>  2 files changed, 5 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index ffaa2f9a9c2c..f420225f93c6 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -565,6 +565,7 @@ struct dma_buf_attachment *dma_buf_attach(const struct 
> dma_buf_attach_info *info
>  
>   attach->dev = info->dev;
>   attach->dmabuf = dmabuf;
> + attach->peer2peer = info->peer2peer;
>   attach->priv = info->priv;
>   attach->invalidate = info->invalidate;
>  
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index 15dd8598bff1..1ef50bd9bc5b 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -313,6 +313,7 @@ struct dma_buf {
>   * @dmabuf: buffer for this attachment.
>   * @dev: device attached to the buffer.
>   * @node: list of dma_buf_attachment.
> + * @peer2peer: true if the importer can handle peer resources without pages.
>   * @priv: exporter specific attachment data.
>   *
>   * This structure holds the attachment information between the dma_buf buffer
> @@ -328,6 +329,7 @@ struct dma_buf_attachment {
>   struct dma_buf *dmabuf;
>   struct device *dev;
>   struct list_head node;
> + bool peer2peer;
>   void *priv;
>  
>   /**
> @@ -392,6 +394,7 @@ struct dma_buf_export_info {
>   * @dmabuf:  the exported dma_buf
>   * @dev: the device which wants to import the attachment
>   * @priv:private data of importer to this attachment
> + * @peer2peer:   true if the importer can handle peer resources without 
> pages
>   * @invalidate:  callback to use for invalidating mappings
>   *
>   * This structure holds the information required to attach to a buffer. Used
> @@ -401,6 +404,7 @@ struct dma_buf_attach_info {
>   struct dma_buf *dmabuf;
>   struct device *dev;
>   void *priv;
> + bool peer2peer;
>   void (*invalidate)(struct dma_buf_attachment *attach);
>  };
>  
> -- 
> 2.14.1
> 
> ___
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH 4/8] dma-buf: add peer2peer flag

2018-03-25 Thread Christian König
Add a peer2peer flag noting that the importer can deal with device
resources which are not backed by pages.

Signed-off-by: Christian König 
---
 drivers/dma-buf/dma-buf.c | 1 +
 include/linux/dma-buf.h   | 4 
 2 files changed, 5 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index ffaa2f9a9c2c..f420225f93c6 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -565,6 +565,7 @@ struct dma_buf_attachment *dma_buf_attach(const struct 
dma_buf_attach_info *info
 
attach->dev = info->dev;
attach->dmabuf = dmabuf;
+   attach->peer2peer = info->peer2peer;
attach->priv = info->priv;
attach->invalidate = info->invalidate;
 
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 15dd8598bff1..1ef50bd9bc5b 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -313,6 +313,7 @@ struct dma_buf {
  * @dmabuf: buffer for this attachment.
  * @dev: device attached to the buffer.
  * @node: list of dma_buf_attachment.
+ * @peer2peer: true if the importer can handle peer resources without pages.
  * @priv: exporter specific attachment data.
  *
  * This structure holds the attachment information between the dma_buf buffer
@@ -328,6 +329,7 @@ struct dma_buf_attachment {
struct dma_buf *dmabuf;
struct device *dev;
struct list_head node;
+   bool peer2peer;
void *priv;
 
/**
@@ -392,6 +394,7 @@ struct dma_buf_export_info {
  * @dmabuf:the exported dma_buf
  * @dev:   the device which wants to import the attachment
  * @priv:  private data of importer to this attachment
+ * @peer2peer: true if the importer can handle peer resources without pages
  * @invalidate:callback to use for invalidating mappings
  *
  * This structure holds the information required to attach to a buffer. Used
@@ -401,6 +404,7 @@ struct dma_buf_attach_info {
struct dma_buf *dmabuf;
struct device *dev;
void *priv;
+   bool peer2peer;
void (*invalidate)(struct dma_buf_attachment *attach);
 };
 
-- 
2.14.1

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel