Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
Am Mittwoch, den 25.04.2018, 13:44 -0400 schrieb Alex Deucher: > On Wed, Apr 25, 2018 at 2:41 AM, Christoph Hellwig > wrote: > > On Wed, Apr 25, 2018 at 02:24:36AM -0400, Alex Deucher wrote: > > > > It has a non-coherent transaction mode (which the chipset can opt to > > > > not implement and still flush), to make sure the AGP horror show > > > > doesn't happen again and GPU folks are happy with PCIe. That's at > > > > least my understanding from digging around in amd the last time we had > > > > coherency issues between intel and amd gpus. GPUs have some bits > > > > somewhere (in the pagetables, or in the buffer object description > > > > table created by userspace) to control that stuff. > > > > > > Right. We have a bit in the GPU page table entries that determines > > > whether we snoop the CPU's cache or not. > > > > I can see how that works with the GPU on the same SOC or SOC set as the > > CPU. But how is that going to work for a GPU that is a plain old PCIe > > card? The cache snooping in that case is happening in the PCIe root > > complex. > > I'm not a pci expert, but as far as I know, the device sends either a > snooped or non-snooped transaction on the bus. I think the > transaction descriptor supports a no snoop attribute. Our GPUs have > supported this feature for probably 20 years if not more, going back > to PCI. Using non-snooped transactions have lower latency and faster > throughput compared to snooped transactions. PCI-X (as in the thing which as all the rage before PCIe) added a attribute phase to each transaction where the requestor can enable relaxed ordering and/or no-snoop on a transaction basis. As those are strictly performance optimizations the root-complex isn't required to honor those attributes, but implementations that care about performance usually will. Regards, Lucas
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Wed, Apr 25, 2018 at 10:44 AM, Alex Deucher wrote: > On Wed, Apr 25, 2018 at 2:41 AM, Christoph Hellwig wrote: >> On Wed, Apr 25, 2018 at 02:24:36AM -0400, Alex Deucher wrote: >>> > It has a non-coherent transaction mode (which the chipset can opt to >>> > not implement and still flush), to make sure the AGP horror show >>> > doesn't happen again and GPU folks are happy with PCIe. That's at >>> > least my understanding from digging around in amd the last time we had >>> > coherency issues between intel and amd gpus. GPUs have some bits >>> > somewhere (in the pagetables, or in the buffer object description >>> > table created by userspace) to control that stuff. >>> >>> Right. We have a bit in the GPU page table entries that determines >>> whether we snoop the CPU's cache or not. >> >> I can see how that works with the GPU on the same SOC or SOC set as the >> CPU. But how is that going to work for a GPU that is a plain old PCIe >> card? The cache snooping in that case is happening in the PCIe root >> complex. > > I'm not a pci expert, but as far as I know, the device sends either a > snooped or non-snooped transaction on the bus. I think the > transaction descriptor supports a no snoop attribute. Our GPUs have > supported this feature for probably 20 years if not more, going back > to PCI. Using non-snooped transactions have lower latency and faster > throughput compared to snooped transactions. Right, 'no snoop' and 'relaxed ordering' have been part of the PCI spec since forever. With a plain old PCI-E card the root complex indeed arranges for caches to be snooped. Although it's not always faster depending on the platform. 'No snoop' traffic may be relegated to less bus resources relative to the much more common snooped traffic.
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Wed, Apr 25, 2018 at 2:41 AM, Christoph Hellwig wrote: > On Wed, Apr 25, 2018 at 02:24:36AM -0400, Alex Deucher wrote: >> > It has a non-coherent transaction mode (which the chipset can opt to >> > not implement and still flush), to make sure the AGP horror show >> > doesn't happen again and GPU folks are happy with PCIe. That's at >> > least my understanding from digging around in amd the last time we had >> > coherency issues between intel and amd gpus. GPUs have some bits >> > somewhere (in the pagetables, or in the buffer object description >> > table created by userspace) to control that stuff. >> >> Right. We have a bit in the GPU page table entries that determines >> whether we snoop the CPU's cache or not. > > I can see how that works with the GPU on the same SOC or SOC set as the > CPU. But how is that going to work for a GPU that is a plain old PCIe > card? The cache snooping in that case is happening in the PCIe root > complex. I'm not a pci expert, but as far as I know, the device sends either a snooped or non-snooped transaction on the bus. I think the transaction descriptor supports a no snoop attribute. Our GPUs have supported this feature for probably 20 years if not more, going back to PCI. Using non-snooped transactions have lower latency and faster throughput compared to snooped transactions. Alex
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Wed, Apr 25, 2018 at 09:56:43AM +0200, Thierry Reding wrote: > And to add to the confusion, none of this seems to be an issue on 64-bit > ARM where the generic DMA/IOMMU code from drivers/iommu/dma-iommu.c is > used. In the long term I want everyone to use that code. Help welcome!
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Wed, Apr 25, 2018 at 09:30:39AM +0200, Daniel Vetter wrote: > On Wed, Apr 25, 2018 at 12:09:05AM -0700, Christoph Hellwig wrote: > > On Wed, Apr 25, 2018 at 09:02:17AM +0200, Daniel Vetter wrote: > > > Can we please not nack everything right away? Doesn't really motivate > > > me to show you all the various things we're doing in gpu to make the > > > dma layer work for us. That kind of noodling around in lower levels to > > > get them to do what we want is absolutely par-for-course for gpu > > > drivers. If you just nack everything I point you at for illustrative > > > purposes, then I can't show you stuff anymore. > > > > No, it's not. No driver (and that includes the magic GPUs) has > > any business messing with dma ops directly. > > > > A GPU driver imght have a very valid reason to disable the IOMMU, > > but the code to do so needs to be at least in the arch code, maybe > > in the dma-mapping/iommu code, not in the driver. > > > > As a first step to get the discussion started we'll simply need > > to move the code Thierry wrote into a helper in arch/arm and that > > alone would be a massive improvement. I'm not even talking about > > minor details like actually using arm_get_dma_map_ops instead > > of duplicating it. > > > > And doing this basic trivial work really helps to get this whole > > mess under control. > > Ah ok. It did sound a bit like a much more cathegorical NAK than an "ack > in principle, but we need to shuffle the implementation into the right > place first". In the past we generally got a principled NAK on anything > funny we've been doing with the dma api, and the dma api maintainer > steaming off telling us we're incompetent idiots. I guess I've been > branded a bit on this topic :-/ > > Really great that this is changing now. > > On the patch itself: It might not be the right thing in all cases, since > for certain compression formats the nv gpu wants larger pages (easy to > allocate from vram, not so easy from main memory), so might need the iommu > still. But currently that's not implemented: > > https://www.spinics.net/lists/dri-devel/msg173932.html To clarify: we do want to use the IOMMU, but we want to use it explicitly via the IOMMU API rather than hiding it behind the DMA API. We do the same thing in Tegra DRM where we don't want to use the DMA API because it doesn't allow us to share the same mapping between multiple display controllers in the same way the IOMMU API does. We've also been thinking about using the IOMMU API directly in order to support process isolation for devices that accept command streams from userspace. Fortunately the issue I'm seeing with Nouveau doesn't happen with Tegra DRM, which seems to be because we have an IOMMU group with multiple devices and that prevents the DMA API from "hijacking" the IOMMU domain for the group. And to add to the confusion, none of this seems to be an issue on 64-bit ARM where the generic DMA/IOMMU code from drivers/iommu/dma-iommu.c is used. Thierry signature.asc Description: PGP signature
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Wed, Apr 25, 2018 at 12:09:05AM -0700, Christoph Hellwig wrote: > On Wed, Apr 25, 2018 at 09:02:17AM +0200, Daniel Vetter wrote: > > Can we please not nack everything right away? Doesn't really motivate > > me to show you all the various things we're doing in gpu to make the > > dma layer work for us. That kind of noodling around in lower levels to > > get them to do what we want is absolutely par-for-course for gpu > > drivers. If you just nack everything I point you at for illustrative > > purposes, then I can't show you stuff anymore. > > No, it's not. No driver (and that includes the magic GPUs) has > any business messing with dma ops directly. > > A GPU driver imght have a very valid reason to disable the IOMMU, > but the code to do so needs to be at least in the arch code, maybe > in the dma-mapping/iommu code, not in the driver. > > As a first step to get the discussion started we'll simply need > to move the code Thierry wrote into a helper in arch/arm and that > alone would be a massive improvement. I'm not even talking about > minor details like actually using arm_get_dma_map_ops instead > of duplicating it. > > And doing this basic trivial work really helps to get this whole > mess under control. That's a good idea and I can prepare patches for this, but I'd like to make those changes on top of the fix to keep the option of getting this into v4.16 if at all possible. Thierry signature.asc Description: PGP signature
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Tue, Apr 24, 2018 at 11:43:35PM -0700, Christoph Hellwig wrote: > On Wed, Apr 25, 2018 at 08:23:15AM +0200, Daniel Vetter wrote: > > For more fun: > > > > https://www.spinics.net/lists/dri-devel/msg173630.html > > > > Yeah, sometimes we want to disable the iommu because the on-gpu > > pagetables are faster ... > > I am not on this list, but remote NAK from here. This needs an > API from the iommu/dma-mapping code. Drivers have no business poking > into these details. The interfaces that the above patch uses are all EXPORT_SYMBOL_GPL, which is rather misleading if they are not meant to be used by drivers directly. > Thierry, please resend this with at least the iommu list and > linux-arm-kernel in Cc to have a proper discussion on the right API. I'm certainly open to help with finding a correct solution, but the patch above was purposefully terse because this is something that I hope we can get backported to v4.16 to unbreak Nouveau. Coordinating such a backport between ARM and DRM trees does not sound like something that would help getting this fixed in v4.16. The fundamental issue here is that the DMA/IOMMU integration is something that has caused a number of surprising regressions in the past because it tends to sneak in unexpectedly. For example the current regression shows up only if CONFIG_ARM_DMA_USE_IOMMU=y because the DMA API will then transparently create a second mapping and mess things up. Everything works fine if that option is disabled. This is ultimately why we didn't notice, since we don't enable that option by default. I do have a patch that I plan to apply to the Tegra tree that will always enable CONFIG_ARM_DMA_USE_IOMMU=y on Tegra to avoid any such surprises in the future, but I can obviously only apply that once the above patch is applied to Nouveau, otherwise we'll break Nouveau unconditionally. Granted, this issue could've been caught with a little more testing, but in retrospect I think it would've been a lot better if ARM_DMA_USE_IOMMU was just enabled unconditionally if it has side-effects that platforms don't opt in to but have to explicitly opt out of. Thierry signature.asc Description: PGP signature
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Wed, Apr 25, 2018 at 12:09:05AM -0700, Christoph Hellwig wrote: > On Wed, Apr 25, 2018 at 09:02:17AM +0200, Daniel Vetter wrote: > > Can we please not nack everything right away? Doesn't really motivate > > me to show you all the various things we're doing in gpu to make the > > dma layer work for us. That kind of noodling around in lower levels to > > get them to do what we want is absolutely par-for-course for gpu > > drivers. If you just nack everything I point you at for illustrative > > purposes, then I can't show you stuff anymore. > > No, it's not. No driver (and that includes the magic GPUs) has > any business messing with dma ops directly. > > A GPU driver imght have a very valid reason to disable the IOMMU, > but the code to do so needs to be at least in the arch code, maybe > in the dma-mapping/iommu code, not in the driver. > > As a first step to get the discussion started we'll simply need > to move the code Thierry wrote into a helper in arch/arm and that > alone would be a massive improvement. I'm not even talking about > minor details like actually using arm_get_dma_map_ops instead > of duplicating it. > > And doing this basic trivial work really helps to get this whole > mess under control. Ah ok. It did sound a bit like a much more cathegorical NAK than an "ack in principle, but we need to shuffle the implementation into the right place first". In the past we generally got a principled NAK on anything funny we've been doing with the dma api, and the dma api maintainer steaming off telling us we're incompetent idiots. I guess I've been branded a bit on this topic :-/ Really great that this is changing now. On the patch itself: It might not be the right thing in all cases, since for certain compression formats the nv gpu wants larger pages (easy to allocate from vram, not so easy from main memory), so might need the iommu still. But currently that's not implemented: https://www.spinics.net/lists/dri-devel/msg173932.html Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Wed, Apr 25, 2018 at 09:02:17AM +0200, Daniel Vetter wrote: > Can we please not nack everything right away? Doesn't really motivate > me to show you all the various things we're doing in gpu to make the > dma layer work for us. That kind of noodling around in lower levels to > get them to do what we want is absolutely par-for-course for gpu > drivers. If you just nack everything I point you at for illustrative > purposes, then I can't show you stuff anymore. No, it's not. No driver (and that includes the magic GPUs) has any business messing with dma ops directly. A GPU driver imght have a very valid reason to disable the IOMMU, but the code to do so needs to be at least in the arch code, maybe in the dma-mapping/iommu code, not in the driver. As a first step to get the discussion started we'll simply need to move the code Thierry wrote into a helper in arch/arm and that alone would be a massive improvement. I'm not even talking about minor details like actually using arm_get_dma_map_ops instead of duplicating it. And doing this basic trivial work really helps to get this whole mess under control.
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Wed, Apr 25, 2018 at 8:43 AM, Christoph Hellwig wrote: > On Wed, Apr 25, 2018 at 08:23:15AM +0200, Daniel Vetter wrote: >> For more fun: >> >> https://www.spinics.net/lists/dri-devel/msg173630.html >> >> Yeah, sometimes we want to disable the iommu because the on-gpu >> pagetables are faster ... > > I am not on this list, but remote NAK from here. This needs an > API from the iommu/dma-mapping code. Drivers have no business poking > into these details. Can we please not nack everything right away? Doesn't really motivate me to show you all the various things we're doing in gpu to make the dma layer work for us. That kind of noodling around in lower levels to get them to do what we want is absolutely par-for-course for gpu drivers. If you just nack everything I point you at for illustrative purposes, then I can't show you stuff anymore. Just to make it clear: I do want to get this stuff sorted, and it's awesome that someone from core finally takes a serious look at what gpu folks have been doing for decades (instead of just telling us we're incompetent and doing it all wrong and then steaming off), and how to make this work without layering violations to no end. But stopping the world until this is fixed isn't really a good option. Thanks, Daniel > Thierry, please resend this with at least the iommu list and > linux-arm-kernel in Cc to have a proper discussion on the right API. -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Wed, Apr 25, 2018 at 08:23:15AM +0200, Daniel Vetter wrote: > For more fun: > > https://www.spinics.net/lists/dri-devel/msg173630.html > > Yeah, sometimes we want to disable the iommu because the on-gpu > pagetables are faster ... I am not on this list, but remote NAK from here. This needs an API from the iommu/dma-mapping code. Drivers have no business poking into these details. Thierry, please resend this with at least the iommu list and linux-arm-kernel in Cc to have a proper discussion on the right API.
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Wed, Apr 25, 2018 at 02:24:36AM -0400, Alex Deucher wrote: > > It has a non-coherent transaction mode (which the chipset can opt to > > not implement and still flush), to make sure the AGP horror show > > doesn't happen again and GPU folks are happy with PCIe. That's at > > least my understanding from digging around in amd the last time we had > > coherency issues between intel and amd gpus. GPUs have some bits > > somewhere (in the pagetables, or in the buffer object description > > table created by userspace) to control that stuff. > > Right. We have a bit in the GPU page table entries that determines > whether we snoop the CPU's cache or not. I can see how that works with the GPU on the same SOC or SOC set as the CPU. But how is that going to work for a GPU that is a plain old PCIe card? The cache snooping in that case is happening in the PCIe root complex.
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Wed, Apr 25, 2018 at 2:13 AM, Daniel Vetter wrote: > On Wed, Apr 25, 2018 at 7:48 AM, Christoph Hellwig wrote: >> On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote: >>> Out of curiosity, how much virtual flushing stuff is there still out >>> there? At least in drm we've pretty much ignore this, and seem to be >>> getting away without a huge uproar (at least from driver developers >>> and users, core folks are less amused about that). >> >> As I've just been wading through the code, the following architectures >> have non-coherent dma that flushes by virtual address for at least some >> platforms: >> >> - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips, >>powerpc >> >> These have non-coherent dma ops that flush by physical address: >> >> - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc >> >> And these do not have non-coherent dma ops at all: >> >> - alpha, h8300, riscv, unicore32, x86 >> >> [1] arm ѕeems to do both virtually and physically based ops, further >> audit needed. >> >> Note that using virtual addresses in the cache flushing interface >> doesn't mean that the cache actually is virtually indexed, but it at >> least allows for the possibility. >> >>> > I think the most important thing about such a buffer object is that >>> > it can distinguish the underlying mapping types. While >>> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT, >>> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give >>> > back a dma_addr_t they are in now way interchangable. And trying to >>> > stuff them all into a structure like struct scatterlist that has >>> > no indication what kind of mapping you are dealing with is just >>> > asking for trouble. >>> >>> Well the idea was to have 1 interface to allow all drivers to share >>> buffers with anything else, no matter how exactly they're allocated. >> >> Isn't that interface supposed to be dmabuf? Currently dma_map leaks >> a scatterlist through the sg_table in dma_buf_map_attachment / >> ->map_dma_buf, but looking at a few of the callers it seems like they >> really do not even want a scatterlist to start with, but check that >> is contains a physically contiguous range first. So kicking the >> scatterlist our there will probably improve the interface in general. > > I think by number most drm drivers require contiguous memory (or an > iommu that makes it look contiguous). But there's plenty others who > have another set of pagetables on the gpu itself and can > scatter-gather. Usually it's the former for display/video blocks, and > the latter for rendering. > >>> dma-buf has all the functions for flushing, so you can have coherent >>> mappings, non-coherent mappings and pretty much anything else. Or well >>> could, because in practice people hack up layering violations until it >>> works for the 2-3 drivers they care about. On top of that there's the >>> small issue that x86 insists that dma is coherent (and that's true for >>> most devices, including v4l drivers you might want to share stuff >>> with), and gpus really, really really do want to make almost >>> everything incoherent. >> >> How do discrete GPUs manage to be incoherent when attached over PCIe? > > It has a non-coherent transaction mode (which the chipset can opt to > not implement and still flush), to make sure the AGP horror show > doesn't happen again and GPU folks are happy with PCIe. That's at > least my understanding from digging around in amd the last time we had > coherency issues between intel and amd gpus. GPUs have some bits > somewhere (in the pagetables, or in the buffer object description > table created by userspace) to control that stuff. Right. We have a bit in the GPU page table entries that determines whether we snoop the CPU's cache or not. Alex > > For anything on the SoC it's presented as pci device, but that's > extremely fake, and we can definitely do non-snooped transactions on > drm/i915. Again, controlled by a mix of pagetables and > userspace-provided buffer object description tables. > -Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch > ___ > amd-gfx mailing list > amd-...@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Wed, Apr 25, 2018 at 8:13 AM, Daniel Vetter wrote: > On Wed, Apr 25, 2018 at 7:48 AM, Christoph Hellwig wrote: >> On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote: >>> Out of curiosity, how much virtual flushing stuff is there still out >>> there? At least in drm we've pretty much ignore this, and seem to be >>> getting away without a huge uproar (at least from driver developers >>> and users, core folks are less amused about that). >> >> As I've just been wading through the code, the following architectures >> have non-coherent dma that flushes by virtual address for at least some >> platforms: >> >> - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips, >>powerpc >> >> These have non-coherent dma ops that flush by physical address: >> >> - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc >> >> And these do not have non-coherent dma ops at all: >> >> - alpha, h8300, riscv, unicore32, x86 >> >> [1] arm ѕeems to do both virtually and physically based ops, further >> audit needed. >> >> Note that using virtual addresses in the cache flushing interface >> doesn't mean that the cache actually is virtually indexed, but it at >> least allows for the possibility. >> >>> > I think the most important thing about such a buffer object is that >>> > it can distinguish the underlying mapping types. While >>> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT, >>> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give >>> > back a dma_addr_t they are in now way interchangable. And trying to >>> > stuff them all into a structure like struct scatterlist that has >>> > no indication what kind of mapping you are dealing with is just >>> > asking for trouble. >>> >>> Well the idea was to have 1 interface to allow all drivers to share >>> buffers with anything else, no matter how exactly they're allocated. >> >> Isn't that interface supposed to be dmabuf? Currently dma_map leaks >> a scatterlist through the sg_table in dma_buf_map_attachment / >> ->map_dma_buf, but looking at a few of the callers it seems like they >> really do not even want a scatterlist to start with, but check that >> is contains a physically contiguous range first. So kicking the >> scatterlist our there will probably improve the interface in general. > > I think by number most drm drivers require contiguous memory (or an > iommu that makes it look contiguous). But there's plenty others who > have another set of pagetables on the gpu itself and can > scatter-gather. Usually it's the former for display/video blocks, and > the latter for rendering. For more fun: https://www.spinics.net/lists/dri-devel/msg173630.html Yeah, sometimes we want to disable the iommu because the on-gpu pagetables are faster ... -Daniel >>> dma-buf has all the functions for flushing, so you can have coherent >>> mappings, non-coherent mappings and pretty much anything else. Or well >>> could, because in practice people hack up layering violations until it >>> works for the 2-3 drivers they care about. On top of that there's the >>> small issue that x86 insists that dma is coherent (and that's true for >>> most devices, including v4l drivers you might want to share stuff >>> with), and gpus really, really really do want to make almost >>> everything incoherent. >> >> How do discrete GPUs manage to be incoherent when attached over PCIe? > > It has a non-coherent transaction mode (which the chipset can opt to > not implement and still flush), to make sure the AGP horror show > doesn't happen again and GPU folks are happy with PCIe. That's at > least my understanding from digging around in amd the last time we had > coherency issues between intel and amd gpus. GPUs have some bits > somewhere (in the pagetables, or in the buffer object description > table created by userspace) to control that stuff. > > For anything on the SoC it's presented as pci device, but that's > extremely fake, and we can definitely do non-snooped transactions on > drm/i915. Again, controlled by a mix of pagetables and > userspace-provided buffer object description tables. > -Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Wed, Apr 25, 2018 at 7:48 AM, Christoph Hellwig wrote: > On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote: >> Out of curiosity, how much virtual flushing stuff is there still out >> there? At least in drm we've pretty much ignore this, and seem to be >> getting away without a huge uproar (at least from driver developers >> and users, core folks are less amused about that). > > As I've just been wading through the code, the following architectures > have non-coherent dma that flushes by virtual address for at least some > platforms: > > - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips, >powerpc > > These have non-coherent dma ops that flush by physical address: > > - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc > > And these do not have non-coherent dma ops at all: > > - alpha, h8300, riscv, unicore32, x86 > > [1] arm ѕeems to do both virtually and physically based ops, further > audit needed. > > Note that using virtual addresses in the cache flushing interface > doesn't mean that the cache actually is virtually indexed, but it at > least allows for the possibility. > >> > I think the most important thing about such a buffer object is that >> > it can distinguish the underlying mapping types. While >> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT, >> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give >> > back a dma_addr_t they are in now way interchangable. And trying to >> > stuff them all into a structure like struct scatterlist that has >> > no indication what kind of mapping you are dealing with is just >> > asking for trouble. >> >> Well the idea was to have 1 interface to allow all drivers to share >> buffers with anything else, no matter how exactly they're allocated. > > Isn't that interface supposed to be dmabuf? Currently dma_map leaks > a scatterlist through the sg_table in dma_buf_map_attachment / > ->map_dma_buf, but looking at a few of the callers it seems like they > really do not even want a scatterlist to start with, but check that > is contains a physically contiguous range first. So kicking the > scatterlist our there will probably improve the interface in general. I think by number most drm drivers require contiguous memory (or an iommu that makes it look contiguous). But there's plenty others who have another set of pagetables on the gpu itself and can scatter-gather. Usually it's the former for display/video blocks, and the latter for rendering. >> dma-buf has all the functions for flushing, so you can have coherent >> mappings, non-coherent mappings and pretty much anything else. Or well >> could, because in practice people hack up layering violations until it >> works for the 2-3 drivers they care about. On top of that there's the >> small issue that x86 insists that dma is coherent (and that's true for >> most devices, including v4l drivers you might want to share stuff >> with), and gpus really, really really do want to make almost >> everything incoherent. > > How do discrete GPUs manage to be incoherent when attached over PCIe? It has a non-coherent transaction mode (which the chipset can opt to not implement and still flush), to make sure the AGP horror show doesn't happen again and GPU folks are happy with PCIe. That's at least my understanding from digging around in amd the last time we had coherency issues between intel and amd gpus. GPUs have some bits somewhere (in the pagetables, or in the buffer object description table created by userspace) to control that stuff. For anything on the SoC it's presented as pci device, but that's extremely fake, and we can definitely do non-snooped transactions on drm/i915. Again, controlled by a mix of pagetables and userspace-provided buffer object description tables. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Wed, Apr 25, 2018 at 1:48 AM, Christoph Hellwig wrote: > On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote: >> Out of curiosity, how much virtual flushing stuff is there still out >> there? At least in drm we've pretty much ignore this, and seem to be >> getting away without a huge uproar (at least from driver developers >> and users, core folks are less amused about that). > > As I've just been wading through the code, the following architectures > have non-coherent dma that flushes by virtual address for at least some > platforms: > > - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips, >powerpc > > These have non-coherent dma ops that flush by physical address: > > - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc > > And these do not have non-coherent dma ops at all: > > - alpha, h8300, riscv, unicore32, x86 > > [1] arm ѕeems to do both virtually and physically based ops, further > audit needed. > > Note that using virtual addresses in the cache flushing interface > doesn't mean that the cache actually is virtually indexed, but it at > least allows for the possibility. > >> > I think the most important thing about such a buffer object is that >> > it can distinguish the underlying mapping types. While >> > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT, >> > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give >> > back a dma_addr_t they are in now way interchangable. And trying to >> > stuff them all into a structure like struct scatterlist that has >> > no indication what kind of mapping you are dealing with is just >> > asking for trouble. >> >> Well the idea was to have 1 interface to allow all drivers to share >> buffers with anything else, no matter how exactly they're allocated. > > Isn't that interface supposed to be dmabuf? Currently dma_map leaks > a scatterlist through the sg_table in dma_buf_map_attachment / > ->map_dma_buf, but looking at a few of the callers it seems like they > really do not even want a scatterlist to start with, but check that > is contains a physically contiguous range first. So kicking the > scatterlist our there will probably improve the interface in general. > >> dma-buf has all the functions for flushing, so you can have coherent >> mappings, non-coherent mappings and pretty much anything else. Or well >> could, because in practice people hack up layering violations until it >> works for the 2-3 drivers they care about. On top of that there's the >> small issue that x86 insists that dma is coherent (and that's true for >> most devices, including v4l drivers you might want to share stuff >> with), and gpus really, really really do want to make almost >> everything incoherent. > > How do discrete GPUs manage to be incoherent when attached over PCIe? They can do CPU cache snooped (coherent) or non-snooped (incoherent) DMA. Also for things like APUs, they show up as a PCIe device, but that actual GPU core is part of the same die as the CPU and they have their own special paths to memory, etc. The fact that they show up as PCIe devices is mostly for enumeration purposes. Alex > ___ > Linaro-mm-sig mailing list > linaro-mm-...@lists.linaro.org > https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Tue, Apr 24, 2018 at 09:32:20PM +0200, Daniel Vetter wrote: > Out of curiosity, how much virtual flushing stuff is there still out > there? At least in drm we've pretty much ignore this, and seem to be > getting away without a huge uproar (at least from driver developers > and users, core folks are less amused about that). As I've just been wading through the code, the following architectures have non-coherent dma that flushes by virtual address for at least some platforms: - arm [1], arm64, hexagon, nds32, nios2, parisc, sh, xtensa, mips, powerpc These have non-coherent dma ops that flush by physical address: - arc, arm [1], c6x, m68k, microblaze, openrisc, sparc And these do not have non-coherent dma ops at all: - alpha, h8300, riscv, unicore32, x86 [1] arm ѕeems to do both virtually and physically based ops, further audit needed. Note that using virtual addresses in the cache flushing interface doesn't mean that the cache actually is virtually indexed, but it at least allows for the possibility. > > I think the most important thing about such a buffer object is that > > it can distinguish the underlying mapping types. While > > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT, > > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give > > back a dma_addr_t they are in now way interchangable. And trying to > > stuff them all into a structure like struct scatterlist that has > > no indication what kind of mapping you are dealing with is just > > asking for trouble. > > Well the idea was to have 1 interface to allow all drivers to share > buffers with anything else, no matter how exactly they're allocated. Isn't that interface supposed to be dmabuf? Currently dma_map leaks a scatterlist through the sg_table in dma_buf_map_attachment / ->map_dma_buf, but looking at a few of the callers it seems like they really do not even want a scatterlist to start with, but check that is contains a physically contiguous range first. So kicking the scatterlist our there will probably improve the interface in general. > dma-buf has all the functions for flushing, so you can have coherent > mappings, non-coherent mappings and pretty much anything else. Or well > could, because in practice people hack up layering violations until it > works for the 2-3 drivers they care about. On top of that there's the > small issue that x86 insists that dma is coherent (and that's true for > most devices, including v4l drivers you might want to share stuff > with), and gpus really, really really do want to make almost > everything incoherent. How do discrete GPUs manage to be incoherent when attached over PCIe?
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Tue, Apr 24, 2018 at 8:48 PM, Christoph Hellwig wrote: > On Fri, Apr 20, 2018 at 05:21:11PM +0200, Daniel Vetter wrote: >> > At the very lowest level they will need to be handled differently for >> > many architectures, the questions is at what point we'll do the >> > branching out. >> >> Having at least struct page also in that list with (dma_addr_t, lenght) >> pairs has a bunch of benefits for drivers in unifying buffer handling >> code. You just pass that one single list around, use the dma_addr_t side >> for gpu access (generally bashing it into gpu ptes). And the struct page >> (if present) for cpu access, using kmap or vm_insert_*. We generally >> ignore virt, if we do need a full mapping then we construct a vmap for >> that buffer of our own. > > Well, for mapping a resource (which gets back to the start of the > discussion) you will need an explicit virt pointer. You also need > an explicit virt pointer and not just page_address/kmap for users of > dma_get_sgtable, because for many architectures you will need to flush > the virtual address used to access the data, which might be a > vmap/ioremap style mapping retourned from dma_alloc_address, and not > the directly mapped kernel address. Out of curiosity, how much virtual flushing stuff is there still out there? At least in drm we've pretty much ignore this, and seem to be getting away without a huge uproar (at least from driver developers and users, core folks are less amused about that). And at least for gpus that seems to have been the case since forever, or at least since AGP was a thing 20 years ago: AGP isn't coherent, so needs explicit cache flushing, and we have our own implementations of that in drivers/char/agp. Luckily AGP died 10 years ago, so no one yet proposed to port it all over to the iommu framework and hide it behind the dma api (which really would be the "clean" way to do this, AGP is simply an IOMMU + special socket dedicated for the add-on gpu). > Here is another idea at the low-level dma API level: > > - dma_get_sgtable goes away. The replacement is a new >dma_alloc_remap helper that takes the virtual address returned >from dma_alloc_attrs/coherent and creates a dma_addr_t for the >given new device. If the original allocation was a coherent >one no cache flushing is required either (because the arch >made sure it is coherent), if the original allocation used >DMA_ATTR_NON_CONSISTENT the new allocation will need >dma_cache_sync calls as well. Yeah I think that should work. dma_get_sgtable is a pretty nasty layering violation. > - you never even try to share a mapping retourned from >dma_map_resource - instead each device using it creates a new >mapping, which makes sense as no virtual addresses are involved >at all. Yeah the dma-buf exporter always knows what kind of backing storage it is dealing with, and for which struct device it should set up a new view. Hence can make sure that it calls the right functions to establish a new mapping, whether that's dma_map_sg, dma_map_resource or the new dma_alloc_remap (instead of the dma_get_sgtable layering mixup). The importer doesn't know. >> So maybe a list of (struct page *, dma_addr_t, num_pages) would suit best, >> with struct page * being optional (if it's a resource, or something else >> that the kernel core mm isn't aware of). But that only has benefits if we >> really roll it out everywhere, in all the subsystems and drivers, since if >> we don't we've made the struct pages ** <-> sgt conversion fun only worse >> by adding a 3 representation of gpu buffer object backing storage. > > I think the most important thing about such a buffer object is that > it can distinguish the underlying mapping types. While > dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT, > dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give > back a dma_addr_t they are in now way interchangable. And trying to > stuff them all into a structure like struct scatterlist that has > no indication what kind of mapping you are dealing with is just > asking for trouble. Well the idea was to have 1 interface to allow all drivers to share buffers with anything else, no matter how exactly they're allocated. dma-buf has all the functions for flushing, so you can have coherent mappings, non-coherent mappings and pretty much anything else. Or well could, because in practice people hack up layering violations until it works for the 2-3 drivers they care about. On top of that there's the small issue that x86 insists that dma is coherent (and that's true for most devices, including v4l drivers you might want to share stuff with), and gpus really, really really do want to make almost everything incoherent. The end result is pretty epic :-) -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Fri, Apr 20, 2018 at 05:21:11PM +0200, Daniel Vetter wrote: > > At the very lowest level they will need to be handled differently for > > many architectures, the questions is at what point we'll do the > > branching out. > > Having at least struct page also in that list with (dma_addr_t, lenght) > pairs has a bunch of benefits for drivers in unifying buffer handling > code. You just pass that one single list around, use the dma_addr_t side > for gpu access (generally bashing it into gpu ptes). And the struct page > (if present) for cpu access, using kmap or vm_insert_*. We generally > ignore virt, if we do need a full mapping then we construct a vmap for > that buffer of our own. Well, for mapping a resource (which gets back to the start of the discussion) you will need an explicit virt pointer. You also need an explicit virt pointer and not just page_address/kmap for users of dma_get_sgtable, because for many architectures you will need to flush the virtual address used to access the data, which might be a vmap/ioremap style mapping retourned from dma_alloc_address, and not the directly mapped kernel address. Here is another idea at the low-level dma API level: - dma_get_sgtable goes away. The replacement is a new dma_alloc_remap helper that takes the virtual address returned from dma_alloc_attrs/coherent and creates a dma_addr_t for the given new device. If the original allocation was a coherent one no cache flushing is required either (because the arch made sure it is coherent), if the original allocation used DMA_ATTR_NON_CONSISTENT the new allocation will need dma_cache_sync calls as well. - you never even try to share a mapping retourned from dma_map_resource - instead each device using it creates a new mapping, which makes sense as no virtual addresses are involved at all. > So maybe a list of (struct page *, dma_addr_t, num_pages) would suit best, > with struct page * being optional (if it's a resource, or something else > that the kernel core mm isn't aware of). But that only has benefits if we > really roll it out everywhere, in all the subsystems and drivers, since if > we don't we've made the struct pages ** <-> sgt conversion fun only worse > by adding a 3 representation of gpu buffer object backing storage. I think the most important thing about such a buffer object is that it can distinguish the underlying mapping types. While dma_alloc_coherent, dma_alloc_attrs with DMA_ATTR_NON_CONSISTENT, dma_map_page/dma_map_single/dma_map_sg and dma_map_resource all give back a dma_addr_t they are in now way interchangable. And trying to stuff them all into a structure like struct scatterlist that has no indication what kind of mapping you are dealing with is just asking for trouble.
Re: [Linaro-mm-sig] [PATCH 4/8] dma-buf: add peer2peer flag
On Fri, Apr 20, 2018 at 05:46:25AM -0700, Christoph Hellwig wrote: > On Fri, Apr 20, 2018 at 12:44:01PM +0200, Christian König wrote: > > > > What we need is an sg_alloc_table_from_resources(dev, resources, > > > > num_resources) which does the handling common to all drivers. > > > A structure that contains > > > > > > {page,offset,len} + {dma_addr+dma_len} > > > > > > is not a good container for storing > > > > > > {virt addr, dma_addr, len} > > > > > > no matter what interface you build arond it. > > > > Why not? I mean at least for my use case we actually don't need the virtual > > address. > > If you don't need the virtual address you need scatterlist even list. > > > What we need is {dma_addr+dma_len} in a consistent interface which can come > > from both {page,offset,len} as well as {resource, len}. > > Ok. > > > What I actually don't need is separate handling for system memory and > > resources, but that would we get exactly when we don't use sg_table. > > At the very lowest level they will need to be handled differently for > many architectures, the questions is at what point we'll do the > branching out. Having at least struct page also in that list with (dma_addr_t, lenght) pairs has a bunch of benefits for drivers in unifying buffer handling code. You just pass that one single list around, use the dma_addr_t side for gpu access (generally bashing it into gpu ptes). And the struct page (if present) for cpu access, using kmap or vm_insert_*. We generally ignore virt, if we do need a full mapping then we construct a vmap for that buffer of our own. If (and that would be serious amounts of work all over the tree, with lots of drivers) we come up with a new struct for gpu buffers, then I'd also add "force page alignement for everything" to the requirements list. That's another mismatch we have, since gpu buffer objects (and dma-buf) are always full pages. That mismatch motived the addition of the page-oriented sg iterators. So maybe a list of (struct page *, dma_addr_t, num_pages) would suit best, with struct page * being optional (if it's a resource, or something else that the kernel core mm isn't aware of). But that only has benefits if we really roll it out everywhere, in all the subsystems and drivers, since if we don't we've made the struct pages ** <-> sgt conversion fun only worse by adding a 3 representation of gpu buffer object backing storage. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch