Am 14.01.19 um 18:32 schrieb Ard Biesheuvel:
[SNIP]

- The existence of a linear region mapping with mismatched attributes
is likely not the culprit here. (We do something similar with
non-cache coherent DMA in other places).



This is still rather problematic.

The issue is that we often don't create a vmap for a page, but rather
access the page directly using the linear mapping.

So we would use the wrong access type here.




Yes. But how are these accesses done? Are they wrapped in a kmap()?

I just double checked the source for this and it should actually be handled 
gracefully:

        if 
(num_pages<https://elixir.bootlin.com/linux/v5.0-rc1/ident/num_pages> == 1 && 
(mem<https://elixir.bootlin.com/linux/v5.0-rc1/ident/mem>->placement<https://elixir.bootlin.com/linux/v5.0-rc1/ident/placement>
 & 
TTM_PL_FLAG_CACHED<https://elixir.bootlin.com/linux/v5.0-rc1/ident/TTM_PL_FLAG_CACHED>))
 {
                /*
                 * We're mapping a single page, and the desired
                 * page protection is consistent with the bo.
                 */

                
map<https://elixir.bootlin.com/linux/v5.0-rc1/ident/map>->bo_kmap_type = 
ttm_bo_map_kmap<https://elixir.bootlin.com/linux/v5.0-rc1/ident/ttm_bo_map_kmap>;
                map<https://elixir.bootlin.com/linux/v5.0-rc1/ident/map>->page 
= ttm->pages<https://elixir.bootlin.com/linux/v5.0-rc1/ident/pages>[start_page];
                
map<https://elixir.bootlin.com/linux/v5.0-rc1/ident/map>->virtual = 
kmap<https://elixir.bootlin.com/linux/v5.0-rc1/ident/kmap>(map<https://elixir.bootlin.com/linux/v5.0-rc1/ident/map>->page);
        } else {
                /*
                 * We need to use vmap to get the desired page protection
                 * or to make the buffer object look contiguous.
                 */
                prot = 
ttm_io_prot<https://elixir.bootlin.com/linux/v5.0-rc1/ident/ttm_io_prot>(mem<https://elixir.bootlin.com/linux/v5.0-rc1/ident/mem>->placement<https://elixir.bootlin.com/linux/v5.0-rc1/ident/placement>,
 PAGE_KERNEL<https://elixir.bootlin.com/linux/v5.0-rc1/ident/PAGE_KERNEL>);
                
map<https://elixir.bootlin.com/linux/v5.0-rc1/ident/map>->bo_kmap_type = 
ttm_bo_map_vmap<https://elixir.bootlin.com/linux/v5.0-rc1/ident/ttm_bo_map_vmap>;
                
map<https://elixir.bootlin.com/linux/v5.0-rc1/ident/map>->virtual = 
vmap<https://elixir.bootlin.com/linux/v5.0-rc1/ident/vmap>(ttm->pages<https://elixir.bootlin.com/linux/v5.0-rc1/ident/pages>
 + start_page, 
num_pages<https://elixir.bootlin.com/linux/v5.0-rc1/ident/num_pages>,
                                    0, prot);
        }

We check the placement flags and only directly use kmap when it is cached. 
Otherwise we fallback to using vmap. So that should indeed work correctly.


- The reason remapping the CPU side as cacheable does work (which I
did test) is because the GPU's uncacheable accesses (which I assume
are made using the NoSnoop PCIe transaction attribute) are actually
emitted as cacheable in some cases.
   . On my AMD Seattle, with or without SMMU (which is stage 2 only), I
must use cacheable accesses from the CPU side or things are broken.
This might be a h/w flaw, though.
   . On systems with stage 1+2 SMMUs, the driver uses stage 1
translations which always override the memory attributes to cacheable
for DMA coherent devices. This is what is affecting the Cavium
ThunderX2 (although it appears the attributes emitted by the RC may be
incorrect as well.)

The latter issue is a shortcoming in the SMMU driver that we have to
fix, i.e., it should take care not to modify the incoming attributes
of DMA coherent PCIe devices for NoSnoop to be able to work.

So in summary, the mismatch appears to be between the CPU accessing
the vmap region with non-cacheable attributes and the GPU accessing
the same memory with cacheable attributes, resulting in a loss of
coherency and lots of visible corruption.



Actually it is the other way around. The CPU thinks some data is in the
cache and the GPU only updates the system memory version because the
snoop flag is not set.




That doesn't seem to be what is happening. As far as we can tell from
our experiments, all inbound transactions are always cacheable, and so
the only way to make things work is to ensure that the CPU uses the
same attributes.

Ok that doesn't make any sense. If inbound transactions are cacheable or not is 
irrelevant when the CPU always uses uncached accesses.

See on the PCIe side you have the snoop bit in the read/write transactions 
which tells the root hub if the device wants to snoop caches or not.

When the CPU accesses some memory as cached then devices need to snoop the 
cache for coherent accesses.

When the CPU accesses some memory as uncached then devices can disable snooping 
to improve performance, but when they don't do this it is mandated by the spec 
that this still works.


To be able to debug this further, could you elaborate a bit on
- How does the hardware emit those uncached/wc inbound accesses? Do
they rely on NoSnoop?



The GPU has a separate page walker in the MC and the page tables there
have a bits saying if the access should go to the PCIe bus and if yes if
the snoop bit should be set.



- Christian pointed out that some accesses must be uncached even when
not using WC. What kind of accesses are those? And do they access
system RAM?



On some hardware generations we have a buggy engine which fails to
forward the snoop bit and because of this the system memory page used by
this engine must be uncached. But this only applies if you use ROCm in a
specific configuration.




OK. I don't know what that means tbh. Does this apply to both radeon and amdgpu?


That applies to both, but only in rare corner cases.

Christian.

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Reply via email to