Re: [Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test
Nicolai Hähnle writes: > On 16.04.2016 16:01, Francisco Jerez wrote: >> Jan Vesely writes: >> >>> On Sat, 2016-04-16 at 10:19 -0500, Nicolai Hähnle wrote: On 15.04.2016 17:12, Francisco Jerez wrote: > >> >>> > > For a test doing almost the same thing but not relying on > unspecified > invocation ordering, see > "tests/spec/arb_shader_image_load_store/shader-mem- > barrier.c" -- It > would be interesting to see whether you can get it to > reproduce the GCN > coherency bug using different framebuffer size and modulus > parameters. I tried that, but couldn't reproduce. Whether I just wasn't thorough enough/"unlucky" or whether the in-order nature of the hardware and L1 cache behavior just makes it impossible to fail the shader- mem-barrier test, I'm not sure. >>> Now I'm curious about the exact nature of the bug ;), some sort >>> of >>> missing L1 cache-flushing which could potentially affect >>> dependent >>> invocations? >> I'm not sure I remember everything, to be honest. >> >> One issue that I do remember is that load/store by default go >> through >> L1, but atomics _never_ go through L1, no matter how you compile >> them. >> This means that if you're working on two different images, one >> with >> atomics and the other without, then the atomic one will always >> behave >> coherently but the other one won't unless you explicitly tell it >> to. >> >> Now that I think about this again, there should probably be a >> shader-mem-barrier-style way to test for that particular issue in >> a way >> that doesn't depend on the specifics of the parallelization. >> Something >> like, in a loop: >> >> Thread 1: increasing imageStore into image 1 at location 1, >> imageLoad >> from image 1 location 2 >> >> Thread 2: same, but exchange locations 1 and 2 >> >> Both threads: imageAtomicAdd on the same location in image 2 >> >> Then each thread can check that _if_ the imageAtomicAdd detects >> the >> buddy thread operating in parallel, _then_ they must also observe >> incrementing values in the location that the buddy thread stores >> to. >> Does that sound reasonable? >> > Yeah, that sounds reasonable, but keep in mind that even if both > image > variables are marked coherent you cannot make assumptions about the > ordering of the image stores performed on image 1 relative to the > atomics performed on image 2 unless there is an explicit barrier in > between, which means that some level of L1 caching is legitimate > even in > that scenario (and might have some performance benefit over > skipping L1 > caching of coherent images altogether) -- That's in fact the way > that > the i965 driver implements coherent image stores: We just write to > L1 > and flush later on to the globally coherent L3 on the next > memoryBarrier(). Okay, adding the barrier makes sense. > > What about a test along the lines of the current coherency > test? Any > idea what's the reason you couldn't get it to reproduce the > issue? Is > it because threads with dependent inputs are guaranteed to be > spawned in > the same L1 cache domain as the threads that generated their inputs > or > something like that? From what I understand (though admittedly the documentation I have on this is not the clearest...), the hardware flushes the L1 cache automatically at the end of each shader invocation, so that dependent invocations are guaranteed to pick it up. >>> >>> The GCN whitepaper mentions both that the L1D cache is write-through >>> and sends data to L2 at the end all 64 WF stores (page 9), and that the >>> L1 cache writes back data at the end of wavefront or when barrier is >>> invoked (page 10). >>> >> >> Interesting... In that case I wonder if there actually was a bug to >> test for or you could just call the non-coherent behavior of L1 writes a >> feature? ;) >> >> Nicolai, do I have your Acked-by for the revert? > > Yes, I should have made that clearer in my first reply :) > Thanks, pushed. > Anyway, there _is_ still something that needs to be tested. Whereas > coherency only worries about stores being propagated to dependent > invocations, and shader-mem-barrier only worries about stores being > propagated _in the correct order_, there needs to be a test that worries > about stores being propagated _at all_ (because of the mentioned > difference in GCN between atomics and load/stores). But I'll follow up > with such a test soon. > The problem is how do you define "at all" in a way independent from the unspecified shader program and memory transaction execution o
Re: [Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test
Jan Vesely writes: > On Sat, 2016-04-16 at 10:19 -0500, Nicolai Hähnle wrote: >> On 15.04.2016 17:12, Francisco Jerez wrote: >> > >> > > >> > > > >> > > > > >> > > > > > >> > > > > > For a test doing almost the same thing but not relying on >> > > > > > unspecified >> > > > > > invocation ordering, see >> > > > > > "tests/spec/arb_shader_image_load_store/shader-mem- >> > > > > > barrier.c" -- It >> > > > > > would be interesting to see whether you can get it to >> > > > > > reproduce the GCN >> > > > > > coherency bug using different framebuffer size and modulus >> > > > > > parameters. >> > > > > I tried that, but couldn't reproduce. Whether I just wasn't >> > > > > thorough >> > > > > enough/"unlucky" or whether the in-order nature of the >> > > > > hardware and L1 >> > > > > cache behavior just makes it impossible to fail the shader- >> > > > > mem-barrier >> > > > > test, I'm not sure. >> > > > > >> > > > Now I'm curious about the exact nature of the bug ;), some sort >> > > > of >> > > > missing L1 cache-flushing which could potentially affect >> > > > dependent >> > > > invocations? >> > > I'm not sure I remember everything, to be honest. >> > > >> > > One issue that I do remember is that load/store by default go >> > > through >> > > L1, but atomics _never_ go through L1, no matter how you compile >> > > them. >> > > This means that if you're working on two different images, one >> > > with >> > > atomics and the other without, then the atomic one will always >> > > behave >> > > coherently but the other one won't unless you explicitly tell it >> > > to. >> > > >> > > Now that I think about this again, there should probably be a >> > > shader-mem-barrier-style way to test for that particular issue in >> > > a way >> > > that doesn't depend on the specifics of the parallelization. >> > > Something >> > > like, in a loop: >> > > >> > > Thread 1: increasing imageStore into image 1 at location 1, >> > > imageLoad >> > > from image 1 location 2 >> > > >> > > Thread 2: same, but exchange locations 1 and 2 >> > > >> > > Both threads: imageAtomicAdd on the same location in image 2 >> > > >> > > Then each thread can check that _if_ the imageAtomicAdd detects >> > > the >> > > buddy thread operating in parallel, _then_ they must also observe >> > > incrementing values in the location that the buddy thread stores >> > > to. >> > > Does that sound reasonable? >> > > >> > Yeah, that sounds reasonable, but keep in mind that even if both >> > image >> > variables are marked coherent you cannot make assumptions about the >> > ordering of the image stores performed on image 1 relative to the >> > atomics performed on image 2 unless there is an explicit barrier in >> > between, which means that some level of L1 caching is legitimate >> > even in >> > that scenario (and might have some performance benefit over >> > skipping L1 >> > caching of coherent images altogether) -- That's in fact the way >> > that >> > the i965 driver implements coherent image stores: We just write to >> > L1 >> > and flush later on to the globally coherent L3 on the next >> > memoryBarrier(). >> Okay, adding the barrier makes sense. >> >> >> > >> > What about a test along the lines of the current coherency >> > test? Any >> > idea what's the reason you couldn't get it to reproduce the >> > issue? Is >> > it because threads with dependent inputs are guaranteed to be >> > spawned in >> > the same L1 cache domain as the threads that generated their inputs >> > or >> > something like that? >> From what I understand (though admittedly the documentation I have >> on >> this is not the clearest...), the hardware flushes the L1 cache >> automatically at the end of each shader invocation, so that >> dependent >> invocations are guaranteed to pick it up. > > The GCN whitepaper mentions both that the L1D cache is write-through > and sends data to L2 at the end all 64 WF stores (page 9), and that the > L1 cache writes back data at the end of wavefront or when barrier is > invoked (page 10). > Interesting... In that case I wonder if there actually was a bug to test for or you could just call the non-coherent behavior of L1 writes a feature? ;) Nicolai, do I have your Acked-by for the revert? > Jan > >> >> Cheers, >> Nicolai >> ___ >> Piglit mailing list >> Piglit@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/piglit signature.asc Description: PGP signature ___ Piglit mailing list Piglit@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/piglit
Re: [Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test
On Sat, 2016-04-16 at 10:19 -0500, Nicolai Hähnle wrote: > On 15.04.2016 17:12, Francisco Jerez wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > For a test doing almost the same thing but not relying on > > > > > > unspecified > > > > > > invocation ordering, see > > > > > > "tests/spec/arb_shader_image_load_store/shader-mem- > > > > > > barrier.c" -- It > > > > > > would be interesting to see whether you can get it to > > > > > > reproduce the GCN > > > > > > coherency bug using different framebuffer size and modulus > > > > > > parameters. > > > > > I tried that, but couldn't reproduce. Whether I just wasn't > > > > > thorough > > > > > enough/"unlucky" or whether the in-order nature of the > > > > > hardware and L1 > > > > > cache behavior just makes it impossible to fail the shader- > > > > > mem-barrier > > > > > test, I'm not sure. > > > > > > > > > Now I'm curious about the exact nature of the bug ;), some sort > > > > of > > > > missing L1 cache-flushing which could potentially affect > > > > dependent > > > > invocations? > > > I'm not sure I remember everything, to be honest. > > > > > > One issue that I do remember is that load/store by default go > > > through > > > L1, but atomics _never_ go through L1, no matter how you compile > > > them. > > > This means that if you're working on two different images, one > > > with > > > atomics and the other without, then the atomic one will always > > > behave > > > coherently but the other one won't unless you explicitly tell it > > > to. > > > > > > Now that I think about this again, there should probably be a > > > shader-mem-barrier-style way to test for that particular issue in > > > a way > > > that doesn't depend on the specifics of the parallelization. > > > Something > > > like, in a loop: > > > > > > Thread 1: increasing imageStore into image 1 at location 1, > > > imageLoad > > > from image 1 location 2 > > > > > > Thread 2: same, but exchange locations 1 and 2 > > > > > > Both threads: imageAtomicAdd on the same location in image 2 > > > > > > Then each thread can check that _if_ the imageAtomicAdd detects > > > the > > > buddy thread operating in parallel, _then_ they must also observe > > > incrementing values in the location that the buddy thread stores > > > to. > > > Does that sound reasonable? > > > > > Yeah, that sounds reasonable, but keep in mind that even if both > > image > > variables are marked coherent you cannot make assumptions about the > > ordering of the image stores performed on image 1 relative to the > > atomics performed on image 2 unless there is an explicit barrier in > > between, which means that some level of L1 caching is legitimate > > even in > > that scenario (and might have some performance benefit over > > skipping L1 > > caching of coherent images altogether) -- That's in fact the way > > that > > the i965 driver implements coherent image stores: We just write to > > L1 > > and flush later on to the globally coherent L3 on the next > > memoryBarrier(). > Okay, adding the barrier makes sense. > > > > > > What about a test along the lines of the current coherency > > test? Any > > idea what's the reason you couldn't get it to reproduce the > > issue? Is > > it because threads with dependent inputs are guaranteed to be > > spawned in > > the same L1 cache domain as the threads that generated their inputs > > or > > something like that? > From what I understand (though admittedly the documentation I have > on > this is not the clearest...), the hardware flushes the L1 cache > automatically at the end of each shader invocation, so that > dependent > invocations are guaranteed to pick it up. The GCN whitepaper mentions both that the L1D cache is write-through and sends data to L2 at the end all 64 WF stores (page 9), and that the L1 cache writes back data at the end of wavefront or when barrier is invoked (page 10). Jan > > Cheers, > Nicolai > ___ > Piglit mailing list > Piglit@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/piglit signature.asc Description: This is a digitally signed message part ___ Piglit mailing list Piglit@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/piglit
Re: [Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test
Nicolai Hähnle writes: > On 15.04.2016 17:12, Francisco Jerez wrote: >> For a test doing almost the same thing but not relying on unspecified >> invocation ordering, see >> "tests/spec/arb_shader_image_load_store/shader-mem-barrier.c" -- It >> would be interesting to see whether you can get it to reproduce the GCN >> coherency bug using different framebuffer size and modulus parameters. > > I tried that, but couldn't reproduce. Whether I just wasn't thorough > enough/"unlucky" or whether the in-order nature of the hardware and L1 > cache behavior just makes it impossible to fail the shader-mem-barrier > test, I'm not sure. > Now I'm curious about the exact nature of the bug ;), some sort of missing L1 cache-flushing which could potentially affect dependent invocations? >>> >>> I'm not sure I remember everything, to be honest. >>> >>> One issue that I do remember is that load/store by default go through >>> L1, but atomics _never_ go through L1, no matter how you compile them. >>> This means that if you're working on two different images, one with >>> atomics and the other without, then the atomic one will always behave >>> coherently but the other one won't unless you explicitly tell it to. >>> >>> Now that I think about this again, there should probably be a >>> shader-mem-barrier-style way to test for that particular issue in a way >>> that doesn't depend on the specifics of the parallelization. Something >>> like, in a loop: >>> >>> Thread 1: increasing imageStore into image 1 at location 1, imageLoad >>> from image 1 location 2 >>> >>> Thread 2: same, but exchange locations 1 and 2 >>> >>> Both threads: imageAtomicAdd on the same location in image 2 >>> >>> Then each thread can check that _if_ the imageAtomicAdd detects the >>> buddy thread operating in parallel, _then_ they must also observe >>> incrementing values in the location that the buddy thread stores to. >>> Does that sound reasonable? >>> >> Yeah, that sounds reasonable, but keep in mind that even if both image >> variables are marked coherent you cannot make assumptions about the >> ordering of the image stores performed on image 1 relative to the >> atomics performed on image 2 unless there is an explicit barrier in >> between, which means that some level of L1 caching is legitimate even in >> that scenario (and might have some performance benefit over skipping L1 >> caching of coherent images altogether) -- That's in fact the way that >> the i965 driver implements coherent image stores: We just write to L1 >> and flush later on to the globally coherent L3 on the next >> memoryBarrier(). > > Okay, adding the barrier makes sense. > > >> What about a test along the lines of the current coherency test? Any >> idea what's the reason you couldn't get it to reproduce the issue? Is >> it because threads with dependent inputs are guaranteed to be spawned in >> the same L1 cache domain as the threads that generated their inputs or >> something like that? > > From what I understand (though admittedly the documentation I have on > this is not the clearest...), the hardware flushes the L1 cache > automatically at the end of each shader invocation, so that dependent > invocations are guaranteed to pick it up. > Ah, interesting. What about memoryBarrier()? Does that cause the back-end compiler to emit an L1 cache flush of some sort? > Cheers, > Nicolai signature.asc Description: PGP signature ___ Piglit mailing list Piglit@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/piglit
Re: [Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test
On 15.04.2016 17:12, Francisco Jerez wrote: For a test doing almost the same thing but not relying on unspecified invocation ordering, see "tests/spec/arb_shader_image_load_store/shader-mem-barrier.c" -- It would be interesting to see whether you can get it to reproduce the GCN coherency bug using different framebuffer size and modulus parameters. I tried that, but couldn't reproduce. Whether I just wasn't thorough enough/"unlucky" or whether the in-order nature of the hardware and L1 cache behavior just makes it impossible to fail the shader-mem-barrier test, I'm not sure. Now I'm curious about the exact nature of the bug ;), some sort of missing L1 cache-flushing which could potentially affect dependent invocations? I'm not sure I remember everything, to be honest. One issue that I do remember is that load/store by default go through L1, but atomics _never_ go through L1, no matter how you compile them. This means that if you're working on two different images, one with atomics and the other without, then the atomic one will always behave coherently but the other one won't unless you explicitly tell it to. Now that I think about this again, there should probably be a shader-mem-barrier-style way to test for that particular issue in a way that doesn't depend on the specifics of the parallelization. Something like, in a loop: Thread 1: increasing imageStore into image 1 at location 1, imageLoad from image 1 location 2 Thread 2: same, but exchange locations 1 and 2 Both threads: imageAtomicAdd on the same location in image 2 Then each thread can check that _if_ the imageAtomicAdd detects the buddy thread operating in parallel, _then_ they must also observe incrementing values in the location that the buddy thread stores to. Does that sound reasonable? Yeah, that sounds reasonable, but keep in mind that even if both image variables are marked coherent you cannot make assumptions about the ordering of the image stores performed on image 1 relative to the atomics performed on image 2 unless there is an explicit barrier in between, which means that some level of L1 caching is legitimate even in that scenario (and might have some performance benefit over skipping L1 caching of coherent images altogether) -- That's in fact the way that the i965 driver implements coherent image stores: We just write to L1 and flush later on to the globally coherent L3 on the next memoryBarrier(). Okay, adding the barrier makes sense. What about a test along the lines of the current coherency test? Any idea what's the reason you couldn't get it to reproduce the issue? Is it because threads with dependent inputs are guaranteed to be spawned in the same L1 cache domain as the threads that generated their inputs or something like that? From what I understand (though admittedly the documentation I have on this is not the clearest...), the hardware flushes the L1 cache automatically at the end of each shader invocation, so that dependent invocations are guaranteed to pick it up. Cheers, Nicolai ___ Piglit mailing list Piglit@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/piglit
Re: [Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test
Nicolai Hähnle writes: > On 14.04.2016 23:43, Francisco Jerez wrote: >> Nicolai Hähnle writes: >> >>> On 14.04.2016 22:37, Francisco Jerez wrote: This test seems bogus, ARB_shader_image_load_store doesn't give you any ordering or concurrency guarantees for independent shader invocations (i.e. invocations so that there is no data dependency between the inputs of one and the outputs of another). The writer thread may seem to make no progress or may seem to have already run to completion from the point of view of the reader threads in a perfectly valid implementation of ARB_shader_image_load_store, because the ordering in which your fragment shader invocations will be run is fully unspecified -- In fact this seems to fail on most Intel hardware as it started showing up in our CI system today, can we please revert it? >>> >>> You're right, the test is too tight. I still think something like this >>> test could be useful, but it shouldn't use a hard-coded limit for the >>> writer thread, and instead have a mechanism to communicate how many >>> reader threads were successfully started in parallel. In any case, feel >>> free to revert this in the meantime. >>> >> Right, the thing is that even if you have some mechanism to find out >> From thread A that some other threads have started executing, you have >> no guarantee that those other threads will continue making progress in a >> finite amount of time before thread A runs to completion, because you >> have no guarantees about the way that concurrency is implemented between >> those threads (whether they actually run in parallel, whether the >> implementation does time slicing in a round-robin, priority-based or >> first-come first-served fashion or some sort of SIMD-like parallelism >> with multiple threads in a bundle executing in lockstep with a single >> instruction counter, etc...). What you *can* do is assume that shader >> invocations sharing some sort of data dependency will be executed in >> sequence and that therefore coherent memory writes from the first to >> execute will be visible from the second (which is what the existing >> coherency image load/store test attempts to do). You can also assume >> that *if* you find out on thread B that the (not necessarily ordered) >> thread A has had side effects 1 and 2 on coherent memory locations, and >> 1 was guaranteed not to happen after 2 (e.g. because of a shader memory >> barrier), thread B will see them happen consistently in the right order >> (which is roughly what the existing shader-mem-barrier test checks). >> For a test doing almost the same thing but not relying on unspecified invocation ordering, see "tests/spec/arb_shader_image_load_store/shader-mem-barrier.c" -- It would be interesting to see whether you can get it to reproduce the GCN coherency bug using different framebuffer size and modulus parameters. >>> >>> I tried that, but couldn't reproduce. Whether I just wasn't thorough >>> enough/"unlucky" or whether the in-order nature of the hardware and L1 >>> cache behavior just makes it impossible to fail the shader-mem-barrier >>> test, I'm not sure. >>> >> Now I'm curious about the exact nature of the bug ;), some sort of >> missing L1 cache-flushing which could potentially affect dependent >> invocations? > > I'm not sure I remember everything, to be honest. > > One issue that I do remember is that load/store by default go through > L1, but atomics _never_ go through L1, no matter how you compile them. > This means that if you're working on two different images, one with > atomics and the other without, then the atomic one will always behave > coherently but the other one won't unless you explicitly tell it to. > > Now that I think about this again, there should probably be a > shader-mem-barrier-style way to test for that particular issue in a way > that doesn't depend on the specifics of the parallelization. Something > like, in a loop: > > Thread 1: increasing imageStore into image 1 at location 1, imageLoad > from image 1 location 2 > > Thread 2: same, but exchange locations 1 and 2 > > Both threads: imageAtomicAdd on the same location in image 2 > > Then each thread can check that _if_ the imageAtomicAdd detects the > buddy thread operating in parallel, _then_ they must also observe > incrementing values in the location that the buddy thread stores to. > Does that sound reasonable? > Yeah, that sounds reasonable, but keep in mind that even if both image variables are marked coherent you cannot make assumptions about the ordering of the image stores performed on image 1 relative to the atomics performed on image 2 unless there is an explicit barrier in between, which means that some level of L1 caching is legitimate even in that scenario (and might have some performance benefit over skipping L1 caching of coherent images altogether) -- That's in fact the way that the i965 driver implements cohere
Re: [Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test
On 14.04.2016 23:43, Francisco Jerez wrote: Nicolai Hähnle writes: On 14.04.2016 22:37, Francisco Jerez wrote: This test seems bogus, ARB_shader_image_load_store doesn't give you any ordering or concurrency guarantees for independent shader invocations (i.e. invocations so that there is no data dependency between the inputs of one and the outputs of another). The writer thread may seem to make no progress or may seem to have already run to completion from the point of view of the reader threads in a perfectly valid implementation of ARB_shader_image_load_store, because the ordering in which your fragment shader invocations will be run is fully unspecified -- In fact this seems to fail on most Intel hardware as it started showing up in our CI system today, can we please revert it? You're right, the test is too tight. I still think something like this test could be useful, but it shouldn't use a hard-coded limit for the writer thread, and instead have a mechanism to communicate how many reader threads were successfully started in parallel. In any case, feel free to revert this in the meantime. Right, the thing is that even if you have some mechanism to find out From thread A that some other threads have started executing, you have no guarantee that those other threads will continue making progress in a finite amount of time before thread A runs to completion, because you have no guarantees about the way that concurrency is implemented between those threads (whether they actually run in parallel, whether the implementation does time slicing in a round-robin, priority-based or first-come first-served fashion or some sort of SIMD-like parallelism with multiple threads in a bundle executing in lockstep with a single instruction counter, etc...). What you *can* do is assume that shader invocations sharing some sort of data dependency will be executed in sequence and that therefore coherent memory writes from the first to execute will be visible from the second (which is what the existing coherency image load/store test attempts to do). You can also assume that *if* you find out on thread B that the (not necessarily ordered) thread A has had side effects 1 and 2 on coherent memory locations, and 1 was guaranteed not to happen after 2 (e.g. because of a shader memory barrier), thread B will see them happen consistently in the right order (which is roughly what the existing shader-mem-barrier test checks). For a test doing almost the same thing but not relying on unspecified invocation ordering, see "tests/spec/arb_shader_image_load_store/shader-mem-barrier.c" -- It would be interesting to see whether you can get it to reproduce the GCN coherency bug using different framebuffer size and modulus parameters. I tried that, but couldn't reproduce. Whether I just wasn't thorough enough/"unlucky" or whether the in-order nature of the hardware and L1 cache behavior just makes it impossible to fail the shader-mem-barrier test, I'm not sure. Now I'm curious about the exact nature of the bug ;), some sort of missing L1 cache-flushing which could potentially affect dependent invocations? I'm not sure I remember everything, to be honest. One issue that I do remember is that load/store by default go through L1, but atomics _never_ go through L1, no matter how you compile them. This means that if you're working on two different images, one with atomics and the other without, then the atomic one will always behave coherently but the other one won't unless you explicitly tell it to. Now that I think about this again, there should probably be a shader-mem-barrier-style way to test for that particular issue in a way that doesn't depend on the specifics of the parallelization. Something like, in a loop: Thread 1: increasing imageStore into image 1 at location 1, imageLoad from image 1 location 2 Thread 2: same, but exchange locations 1 and 2 Both threads: imageAtomicAdd on the same location in image 2 Then each thread can check that _if_ the imageAtomicAdd detects the buddy thread operating in parallel, _then_ they must also observe incrementing values in the location that the buddy thread stores to. Does that sound reasonable? Cheers, Nicolai Cheers, Nicolai Nicolai Hähnle writes: From: Nicolai Hähnle The existing coherency test isn't a good match for the AMD GCN execution model. --- .../execution/coherency-extra.shader_test | 90 ++ 1 file changed, 90 insertions(+) create mode 100644 tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test diff --git a/tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test b/tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test new file mode 100644 index 000..f718cd2 --- /dev/null +++ b/tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test @@ -0,0 +1,90 @@ +# Additional coherency test that can demonstrate failure
Re: [Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test
Nicolai Hähnle writes: > On 14.04.2016 22:37, Francisco Jerez wrote: >> This test seems bogus, ARB_shader_image_load_store doesn't give you any >> ordering or concurrency guarantees for independent shader invocations >> (i.e. invocations so that there is no data dependency between the inputs >> of one and the outputs of another). The writer thread may seem to make >> no progress or may seem to have already run to completion from the point >> of view of the reader threads in a perfectly valid implementation of >> ARB_shader_image_load_store, because the ordering in which your fragment >> shader invocations will be run is fully unspecified -- In fact this >> seems to fail on most Intel hardware as it started showing up in our CI >> system today, can we please revert it? > > You're right, the test is too tight. I still think something like this > test could be useful, but it shouldn't use a hard-coded limit for the > writer thread, and instead have a mechanism to communicate how many > reader threads were successfully started in parallel. In any case, feel > free to revert this in the meantime. > Right, the thing is that even if you have some mechanism to find out From thread A that some other threads have started executing, you have no guarantee that those other threads will continue making progress in a finite amount of time before thread A runs to completion, because you have no guarantees about the way that concurrency is implemented between those threads (whether they actually run in parallel, whether the implementation does time slicing in a round-robin, priority-based or first-come first-served fashion or some sort of SIMD-like parallelism with multiple threads in a bundle executing in lockstep with a single instruction counter, etc...). What you *can* do is assume that shader invocations sharing some sort of data dependency will be executed in sequence and that therefore coherent memory writes from the first to execute will be visible from the second (which is what the existing coherency image load/store test attempts to do). You can also assume that *if* you find out on thread B that the (not necessarily ordered) thread A has had side effects 1 and 2 on coherent memory locations, and 1 was guaranteed not to happen after 2 (e.g. because of a shader memory barrier), thread B will see them happen consistently in the right order (which is roughly what the existing shader-mem-barrier test checks). >> For a test doing almost the same thing but not relying on unspecified >> invocation ordering, see >> "tests/spec/arb_shader_image_load_store/shader-mem-barrier.c" -- It >> would be interesting to see whether you can get it to reproduce the GCN >> coherency bug using different framebuffer size and modulus parameters. > > I tried that, but couldn't reproduce. Whether I just wasn't thorough > enough/"unlucky" or whether the in-order nature of the hardware and L1 > cache behavior just makes it impossible to fail the shader-mem-barrier > test, I'm not sure. > Now I'm curious about the exact nature of the bug ;), some sort of missing L1 cache-flushing which could potentially affect dependent invocations? > Cheers, > Nicolai > >> Nicolai Hähnle writes: >> >>> From: Nicolai Hähnle >>> >>> The existing coherency test isn't a good match for the AMD GCN execution >>> model. >>> --- >>> .../execution/coherency-extra.shader_test | 90 >>> ++ >>> 1 file changed, 90 insertions(+) >>> create mode 100644 >>> tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test >>> >>> diff --git >>> a/tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test >>> >>> b/tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test >>> new file mode 100644 >>> index 000..f718cd2 >>> --- /dev/null >>> +++ >>> b/tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test >>> @@ -0,0 +1,90 @@ >>> +# Additional coherency test that can demonstrate failures in an incorrect >>> +# coherency implementation for AMD GCN, unlike >>> arb_shader_image_load_store-coherency. >>> +# >>> +# The real problem with coherency in AMD GCN is separate, non-coherent L1 >>> +# caches, i.e. when a shader execution writes to an image in a CU that uses >>> +# one L1 cache, and a different shader execution reads from the image >>> +# in a CU with a different L1 cache. >>> +# >>> +# This test uses atomic accesses to a control texture to select the very >>> first >>> +# fragment shader thread as a writer thread which keeps changing a data >>> +# texture in a tight loop. All other threads become reader threads which >>> +# report success if they see two different values of the same texture. >>> +# >>> +# This test can produce a false negative (false failure) in two cases: >>> +# 1) The timeout value ITERS is too low, >>> +# 2) There is no (or insufficient) parallelism in the implementation, and >>> +# therefore the writer thread must fin
Re: [Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test
This test seems bogus, ARB_shader_image_load_store doesn't give you any ordering or concurrency guarantees for independent shader invocations (i.e. invocations so that there is no data dependency between the inputs of one and the outputs of another). The writer thread may seem to make no progress or may seem to have already run to completion from the point of view of the reader threads in a perfectly valid implementation of ARB_shader_image_load_store, because the ordering in which your fragment shader invocations will be run is fully unspecified -- In fact this seems to fail on most Intel hardware as it started showing up in our CI system today, can we please revert it? For a test doing almost the same thing but not relying on unspecified invocation ordering, see "tests/spec/arb_shader_image_load_store/shader-mem-barrier.c" -- It would be interesting to see whether you can get it to reproduce the GCN coherency bug using different framebuffer size and modulus parameters. Nicolai Hähnle writes: > From: Nicolai Hähnle > > The existing coherency test isn't a good match for the AMD GCN execution > model. > --- > .../execution/coherency-extra.shader_test | 90 > ++ > 1 file changed, 90 insertions(+) > create mode 100644 > tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test > > diff --git > a/tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test > > b/tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test > new file mode 100644 > index 000..f718cd2 > --- /dev/null > +++ > b/tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test > @@ -0,0 +1,90 @@ > +# Additional coherency test that can demonstrate failures in an incorrect > +# coherency implementation for AMD GCN, unlike > arb_shader_image_load_store-coherency. > +# > +# The real problem with coherency in AMD GCN is separate, non-coherent L1 > +# caches, i.e. when a shader execution writes to an image in a CU that uses > +# one L1 cache, and a different shader execution reads from the image > +# in a CU with a different L1 cache. > +# > +# This test uses atomic accesses to a control texture to select the very > first > +# fragment shader thread as a writer thread which keeps changing a data > +# texture in a tight loop. All other threads become reader threads which > +# report success if they see two different values of the same texture. > +# > +# This test can produce a false negative (false failure) in two cases: > +# 1) The timeout value ITERS is too low, > +# 2) There is no (or insufficient) parallelism in the implementation, and > +# therefore the writer thread must finish before most of the reader > threads > +# get a chance to run. > +# > + > +[require] > +GL >= 3.3 > +GLSL >= 3.30 > +GL_ARB_shader_image_load_store > +SIZE 256 256 > + > +[vertex shader passthrough] > + > +[fragment shader] > +#version 330 > +#extension GL_ARB_shader_image_load_store: enable > + > +// Change this to 0 to get a control test that should fail on hardware > +// without coherent L1 caches. > +// > +// Need volatile instead of just coherent to prevent overly smart compilers > +// from moving the imageLoad/imageStore out of the loop. > +#if 1 > +volatile > +#endif > +layout(r32i) uniform iimage2D tex; > +volatile layout(r32i) uniform iimage2D ctrl; > +out vec4 outcolor; > + > +// Add a timeout so that an incorrect coherency implementation doesn't hang > +// the GPU. If this timeout is too low, you can get false negative results > +// because the writer thread quits before all reader threads have > +// executed. > +#define ITERS 10 > + > +void main() > +{ > + int id = imageAtomicAdd(ctrl, ivec2(0, 0), 1); > + int orig = imageLoad(tex, ivec2(0, 0)).x; > + bool done = false; > + > + outcolor = vec4(0.0, 0.0, 0.0, 1.0); > + > + for (int iter = 0; iter < ITERS && !done; ++iter) { > + if (id == 0) { > + imageStore(tex, ivec2(0, 0), ivec4(iter)); > + if (imageLoad(ctrl, ivec2(0, 1)).x >= 256 * 256) > + done = true; > + } else { > + int current = imageLoad(tex, ivec2(0, 0)).x; > + if (current != orig) > + done = true; > + } > + > + if (done || (id == 0 && iter == 0)) > + imageAtomicAdd(ctrl, ivec2(0, 1), 1); > + } > + > + if (done) > + outcolor.y = 1.0; > + else > + outcolor.x = 1.0; > +} > + > +[test] > +texture integer 0 (1, 2) (0, 0) GL_R32I > +image texture 0 GL_R32I > +texture integer 1 (1, 1) (0, 0) GL_R32I > +image texture 1 GL_R32I > + > +uniform int ctrl 0 > +uniform int tex 1 > +draw rect -1 -1 2 2 > + > +probe all rgba 0.0 1.0 0.0 1.0 > -- > 2.5.0 > > ___ > Piglit mailing list > Piglit@lists.freedesktop.org > https://lists.f
[Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test
From: Nicolai Hähnle The existing coherency test isn't a good match for the AMD GCN execution model. --- .../execution/coherency-extra.shader_test | 90 ++ 1 file changed, 90 insertions(+) create mode 100644 tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test diff --git a/tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test b/tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test new file mode 100644 index 000..f718cd2 --- /dev/null +++ b/tests/spec/arb_shader_image_load_store/execution/coherency-extra.shader_test @@ -0,0 +1,90 @@ +# Additional coherency test that can demonstrate failures in an incorrect +# coherency implementation for AMD GCN, unlike arb_shader_image_load_store-coherency. +# +# The real problem with coherency in AMD GCN is separate, non-coherent L1 +# caches, i.e. when a shader execution writes to an image in a CU that uses +# one L1 cache, and a different shader execution reads from the image +# in a CU with a different L1 cache. +# +# This test uses atomic accesses to a control texture to select the very first +# fragment shader thread as a writer thread which keeps changing a data +# texture in a tight loop. All other threads become reader threads which +# report success if they see two different values of the same texture. +# +# This test can produce a false negative (false failure) in two cases: +# 1) The timeout value ITERS is too low, +# 2) There is no (or insufficient) parallelism in the implementation, and +# therefore the writer thread must finish before most of the reader threads +# get a chance to run. +# + +[require] +GL >= 3.3 +GLSL >= 3.30 +GL_ARB_shader_image_load_store +SIZE 256 256 + +[vertex shader passthrough] + +[fragment shader] +#version 330 +#extension GL_ARB_shader_image_load_store: enable + +// Change this to 0 to get a control test that should fail on hardware +// without coherent L1 caches. +// +// Need volatile instead of just coherent to prevent overly smart compilers +// from moving the imageLoad/imageStore out of the loop. +#if 1 +volatile +#endif +layout(r32i) uniform iimage2D tex; +volatile layout(r32i) uniform iimage2D ctrl; +out vec4 outcolor; + +// Add a timeout so that an incorrect coherency implementation doesn't hang +// the GPU. If this timeout is too low, you can get false negative results +// because the writer thread quits before all reader threads have +// executed. +#define ITERS 10 + +void main() +{ + int id = imageAtomicAdd(ctrl, ivec2(0, 0), 1); + int orig = imageLoad(tex, ivec2(0, 0)).x; + bool done = false; + + outcolor = vec4(0.0, 0.0, 0.0, 1.0); + + for (int iter = 0; iter < ITERS && !done; ++iter) { + if (id == 0) { + imageStore(tex, ivec2(0, 0), ivec4(iter)); + if (imageLoad(ctrl, ivec2(0, 1)).x >= 256 * 256) + done = true; + } else { + int current = imageLoad(tex, ivec2(0, 0)).x; + if (current != orig) + done = true; + } + + if (done || (id == 0 && iter == 0)) + imageAtomicAdd(ctrl, ivec2(0, 1), 1); + } + + if (done) + outcolor.y = 1.0; + else + outcolor.x = 1.0; +} + +[test] +texture integer 0 (1, 2) (0, 0) GL_R32I +image texture 0 GL_R32I +texture integer 1 (1, 1) (0, 0) GL_R32I +image texture 1 GL_R32I + +uniform int ctrl 0 +uniform int tex 1 +draw rect -1 -1 2 2 + +probe all rgba 0.0 1.0 0.0 1.0 -- 2.5.0 ___ Piglit mailing list Piglit@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/piglit