Nicolai Hähnle <nhaeh...@gmail.com> writes: > On 15.04.2016 17:12, Francisco Jerez wrote: >>>>>> For a test doing almost the same thing but not relying on unspecified >>>>>> invocation ordering, see >>>>>> "tests/spec/arb_shader_image_load_store/shader-mem-barrier.c" -- It >>>>>> would be interesting to see whether you can get it to reproduce the GCN >>>>>> coherency bug using different framebuffer size and modulus parameters. >>>>> >>>>> I tried that, but couldn't reproduce. Whether I just wasn't thorough >>>>> enough/"unlucky" or whether the in-order nature of the hardware and L1 >>>>> cache behavior just makes it impossible to fail the shader-mem-barrier >>>>> test, I'm not sure. >>>>> >>>> Now I'm curious about the exact nature of the bug ;), some sort of >>>> missing L1 cache-flushing which could potentially affect dependent >>>> invocations? >>> >>> I'm not sure I remember everything, to be honest. >>> >>> One issue that I do remember is that load/store by default go through >>> L1, but atomics _never_ go through L1, no matter how you compile them. >>> This means that if you're working on two different images, one with >>> atomics and the other without, then the atomic one will always behave >>> coherently but the other one won't unless you explicitly tell it to. >>> >>> Now that I think about this again, there should probably be a >>> shader-mem-barrier-style way to test for that particular issue in a way >>> that doesn't depend on the specifics of the parallelization. Something >>> like, in a loop: >>> >>> Thread 1: increasing imageStore into image 1 at location 1, imageLoad >>> from image 1 location 2 >>> >>> Thread 2: same, but exchange locations 1 and 2 >>> >>> Both threads: imageAtomicAdd on the same location in image 2 >>> >>> Then each thread can check that _if_ the imageAtomicAdd detects the >>> buddy thread operating in parallel, _then_ they must also observe >>> incrementing values in the location that the buddy thread stores to. >>> Does that sound reasonable? >>> >> Yeah, that sounds reasonable, but keep in mind that even if both image >> variables are marked coherent you cannot make assumptions about the >> ordering of the image stores performed on image 1 relative to the >> atomics performed on image 2 unless there is an explicit barrier in >> between, which means that some level of L1 caching is legitimate even in >> that scenario (and might have some performance benefit over skipping L1 >> caching of coherent images altogether) -- That's in fact the way that >> the i965 driver implements coherent image stores: We just write to L1 >> and flush later on to the globally coherent L3 on the next >> memoryBarrier(). > > Okay, adding the barrier makes sense. > > >> What about a test along the lines of the current coherency test? Any >> idea what's the reason you couldn't get it to reproduce the issue? Is >> it because threads with dependent inputs are guaranteed to be spawned in >> the same L1 cache domain as the threads that generated their inputs or >> something like that? > > From what I understand (though admittedly the documentation I have on > this is not the clearest...), the hardware flushes the L1 cache > automatically at the end of each shader invocation, so that dependent > invocations are guaranteed to pick it up. > Ah, interesting. What about memoryBarrier()? Does that cause the back-end compiler to emit an L1 cache flush of some sort?
> Cheers, > Nicolai
signature.asc
Description: PGP signature
_______________________________________________ Piglit mailing list Piglit@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/piglit