On 20.06.2017 12:34, Marek Olšák wrote:
BTW, I noticed the flush sequence in the kernel is wrong. The correct
flush sequence should be:

1) EVENT_WRITE_EOP - CACHE_FLUSH_AND_INV_TS - write a dword to memory,
but no fence/interrupt.
2) WAIT_REG_MEM on the dword to wait for idle before SURFACE_SYNC.
3) SURFACE_SYNC (TC, K$, I$)
4) Write CP_COHER_CNTL2.
5) EVENT_WRITE_EOP - BOTTOM_OF_PIPE_TS - write the fence with the interrupt.

WAIT_REG_MEM wouldn't be needed if we were able to merge
CACHE_FLUSH_AND_INV, SURFACE_SYNC, and CP_COHER_CNTL2 into one EOP
event.

The main issue with the current flush sequence in radeon and amdgpu is
that it doesn't wait for idle before writing CP_COHER_CNTL2 and
SURFACE_SYNC. So far we've been able to avoid the bug by waiting for
idle in userspace IBs.

This is gfx9-only though, right?

Cheers,
Nicolai



Marek


On Fri, May 26, 2017 at 5:47 PM, Marek Olšák <mar...@gmail.com> wrote:
On Tue, May 9, 2017 at 2:13 PM, Nicolai Hähnle <nhaeh...@gmail.com> wrote:
Hi all,

I'm seeing some very strange errors on Verde with CPU readback from GART,
and am pretty much out of ideas. Some help would be very much appreciated.

The error manifests with the
GL45-CTS.gtf32.GL3Tests.packed_pixels.packed_pixels_pbo test on amdgpu,
but
*not* on radeon. Here's what the test does:

1. Upload a texture.
2. Read the texture back via a shader that uses shader buffer writes to
write data to a buffer that is allocated in GART.
3. The CPU then reads from the buffer -- and sometimes gets stale data.

This sequence is repeated for many sub-tests. There are some sub-tests
where
the CPU reads stale data from the buffer, i.e. the shader writes simply
don't make it to the CPU. The tests vary superficially, e.g. the first
failing test is (almost?) always one where data is written in 16-bit words
(but there are succeeding sub-tests with 16-bit writes as well).

The bug is *not* a timing issue. Adding even a 1sec delay (sleep(1);)
between the fence wait and the return of glMapBuffer does not fix the
problem. The data must be stuck in a cache somewhere.

Since the test runs okay with the radeon module, I tried some changes
based
on comparing the IB submit between radeon and amdgpu, and based on
comparing
register settings via scans obtained from umr. Some of the things I've
tried:

- Set HDP_MISC_CNTL.FLUSH_INVALIDATE_CACHE to 1 (both radeon and
amdgpu/gfx9
set this)
- Add SURFACE_SYNC packets preceded by setting CP_COHER_CNTL2 to the vmid
(radeon does this)
- Change gfx_v6_0_ring_emit_hdp_invalidate: select ME engine instead of
PFP
(which seems more logical, and is done by gfx7+), or remove the
corresponding WRITE_DATA entirely

None of these changes helped.

What *does* help is adding an artificial wait. Specifically, I'm adding a
sequence of

- WRITE_DATA
- CACHE_FLUSH_AND_INV_TS_EVENT (BOTTOM_OF_PIPE_TS has same behavior)
- WAIT_REG_MEM

as can be seen in the attached patch. This works around the problem, but
it
makes no sense:

Adding the wait sequence *before* the SURFACE_SYNC in ring_emit_fence
works
around the problem. However(!) it does not actually cause the UMD to wait
any longer than before. Without this change, the UMD immediately sees a
signaled user fence (and never uses an ioctl to wait), and with this
change,
it *still* sees a signaled user fence.

Also, note that the way I've hacked the change, the wait sequence is only
added for the user fence emit (and I'm using a modified UMD to ensure that
there is enough memory to be used by the added wait sequence).

Adding the wait sequence *after* the SURFACE_SYNC *doesn't* work around
the
problem.

So for whatever reason, the added wait sequence *before* the SURFACE_SYNC
encourages some part of the GPU to flush the data from wherever it's
stuck,
and that's just really bizarre. There must be something really simple I'm
missing, and any pointers would be appreciated.

Have you tried this?

diff --git a/src/gallium/drivers/radeonsi/si_hw_context.c
b/src/gallium/drivers/radeonsi/si_hw_context.c
index 92c09cb..e6ac0ba 100644
--- a/src/gallium/drivers/radeonsi/si_hw_context.c
+++ b/src/gallium/drivers/radeonsi/si_hw_context.c
@@ -133,7 +133,8 @@ void si_context_gfx_flush(void *context, unsigned flags,
                         SI_CONTEXT_PS_PARTIAL_FLUSH;

         /* DRM 3.1.0 doesn't flush TC for VI correctly. */
-       if (ctx->b.chip_class == VI && ctx->b.screen->info.drm_minor <= 1)
+       if ((ctx->b.chip_class == VI && ctx->b.screen->info.drm_minor <= 1)
||
+           (ctx->b.chip_class == SI && ctx->b.screen->info.drm_major == 3))
                 ctx->b.flags |= SI_CONTEXT_INV_GLOBAL_L2 |
                                 SI_CONTEXT_INV_VMEM_L1;

One more cache flush there shouldn't hurt.

Also, Mesa uses PFP_SYNC_ME. It shouldn't be necessary, but it's worth a
try.

Marek


--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to