Re: [Piglit] [PATCH 1/8] README: Fix spelling mistake

2016-04-16 Thread Vinson Lee
On Sat, Apr 16, 2016 at 10:28 AM, Eric Engestrom  wrote:
> On Mon, Apr 04, 2016 at 11:16:54AM -0700, Dylan Baker wrote:
>> Do you have commit access or do you need someone to push these for you?
>>
>> For the series:
>> Reviewed-by: Dylan Baker 
>
> Somehow I completely missed your reply, sorry about that.
> I don't have access, someone else will have to push it :]
>
> Cheers,
> Eric

I pushed patches 1-6 and 8.

Patch 7 no longer applies after "8dca1ba generators: Use python
generator instead of bash generated tests". Can you please provide
another version?

Vinson
___
Piglit mailing list
Piglit@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/piglit


Re: [Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test

2016-04-16 Thread Francisco Jerez
Nicolai Hähnle  writes:

> On 16.04.2016 16:01, Francisco Jerez wrote:
>> Jan Vesely  writes:
>>
>>> On Sat, 2016-04-16 at 10:19 -0500, Nicolai Hähnle wrote:
 On 15.04.2016 17:12, Francisco Jerez wrote:
>
>>
>>>

>
> For a test doing almost the same thing but not relying on
> unspecified
> invocation ordering, see
> "tests/spec/arb_shader_image_load_store/shader-mem-
> barrier.c" -- It
> would be interesting to see whether you can get it to
> reproduce the GCN
> coherency bug using different framebuffer size and modulus
> parameters.
 I tried that, but couldn't reproduce. Whether I just wasn't
 thorough
 enough/"unlucky" or whether the in-order nature of the
 hardware and L1
 cache behavior just makes it impossible to fail the shader-
 mem-barrier
 test, I'm not sure.

>>> Now I'm curious about the exact nature of the bug ;), some sort
>>> of
>>> missing L1 cache-flushing which could potentially affect
>>> dependent
>>> invocations?
>> I'm not sure I remember everything, to be honest.
>>
>> One issue that I do remember is that load/store by default go
>> through
>> L1, but atomics _never_ go through L1, no matter how you compile
>> them.
>> This means that if you're working on two different images, one
>> with
>> atomics and the other without, then the atomic one will always
>> behave
>> coherently but the other one won't unless you explicitly tell it
>> to.
>>
>> Now that I think about this again, there should probably be a
>> shader-mem-barrier-style way to test for that particular issue in
>> a way
>> that doesn't depend on the specifics of the parallelization.
>> Something
>> like, in a loop:
>>
>> Thread 1: increasing imageStore into image 1 at location 1,
>> imageLoad
>> from image 1 location 2
>>
>> Thread 2: same, but exchange locations 1 and 2
>>
>> Both threads: imageAtomicAdd on the same location in image 2
>>
>> Then each thread can check that _if_ the imageAtomicAdd detects
>> the
>> buddy thread operating in parallel, _then_ they must also observe
>> incrementing values in the location that the buddy thread stores
>> to.
>> Does that sound reasonable?
>>
> Yeah, that sounds reasonable, but keep in mind that even if both
> image
> variables are marked coherent you cannot make assumptions about the
> ordering of the image stores performed on image 1 relative to the
> atomics performed on image 2 unless there is an explicit barrier in
> between, which means that some level of L1 caching is legitimate
> even in
> that scenario (and might have some performance benefit over
> skipping L1
> caching of coherent images altogether) -- That's in fact the way
> that
> the i965 driver implements coherent image stores: We just write to
> L1
> and flush later on to the globally coherent L3 on the next
> memoryBarrier().
 Okay, adding the barrier makes sense.


>
> What about a test along the lines of the current coherency
> test?  Any
> idea what's the reason you couldn't get it to reproduce the
> issue?  Is
> it because threads with dependent inputs are guaranteed to be
> spawned in
> the same L1 cache domain as the threads that generated their inputs
> or
> something like that?
   From what I understand (though admittedly the documentation I have
 on
 this is not the clearest...), the hardware flushes the L1 cache
 automatically at the end of each shader invocation, so that
 dependent
 invocations are guaranteed to pick it up.
>>>
>>> The GCN whitepaper mentions both that the L1D cache is write-through
>>> and sends data to L2 at the end all 64 WF stores (page 9), and that the
>>> L1 cache writes back data at the end of wavefront or when barrier is
>>> invoked (page 10).
>>>
>>
>> Interesting...  In that case I wonder if there actually was a bug to
>> test for or you could just call the non-coherent behavior of L1 writes a
>> feature? ;)
>>
>> Nicolai, do I have your Acked-by for the revert?
>
> Yes, I should have made that clearer in my first reply :)
>
Thanks, pushed.

> Anyway, there _is_ still something that needs to be tested. Whereas 
> coherency only worries about stores being propagated to dependent 
> invocations, and shader-mem-barrier only worries about stores being 
> propagated _in the correct order_, there needs to be a test that worries 
> about stores being propagated _at all_ (because of the mentioned 
> difference in GCN between atomics and load/stores). But I'll follow up 
> with such a test soon.
>

The problem is how do you define "at all" in a way independent from the
unspecified 

[Piglit] Nearly finished: shader_runner running THOUSANDS of tests per process

2016-04-16 Thread Marek Olšák
Hi,

This makes shader_runner very fast. The expected result is 40%
decrease in quick.py running time, or a 12x faster piglit run if you
run shader tests alone.

Branch:
https://cgit.freedesktop.org/~mareko/piglit/log/?h=shader-runner

Changes:

1) Any number of test files can be specified as command-line
parameters. Those command lines can be insanely long.

2) shader_runner can re-create the window & GL context if test
requirements demand different settings when going from one test to
another.

3) all.py generates one shader_runner instance per group of tests
(usually one or two directories - tests and generated_tests).
Individual tests are reported as subtests.

The shader_runner part is done. The python part needs more work.


What's missing:

Handling of crashes. If shader_runner crashes:
- The crash is not shown in piglit results (other tests with subtests
already have the same behavior)
- The remaining tests will not be run.

The ShaderTest python class has the list of all files and should be
able to catch a crash, check how many test results have been written,
and restart shader_runner with the remaining tests.

shader_runner prints TEST %i: and then the subtest result. %i is the
i-th file in the list. Python can parse that and re-run shader_runner
with the first %i tests removed. (0..%i-1 -> parse subtest results; %i
-> crash; %i+1.. -> run again)


I'm by no means a python expert, so here's an alternative solution (for me):
- Catch crash signals in shader_runner.
- In the single handler, re-run shader_runner with the remaining tests.

Opinions welcome,

Marek
___
Piglit mailing list
Piglit@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/piglit


Re: [Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test

2016-04-16 Thread Jan Vesely
On Sat, 2016-04-16 at 10:19 -0500, Nicolai Hähnle wrote:
> On 15.04.2016 17:12, Francisco Jerez wrote:
> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > For a test doing almost the same thing but not relying on
> > > > > > unspecified
> > > > > > invocation ordering, see
> > > > > > "tests/spec/arb_shader_image_load_store/shader-mem-
> > > > > > barrier.c" -- It
> > > > > > would be interesting to see whether you can get it to
> > > > > > reproduce the GCN
> > > > > > coherency bug using different framebuffer size and modulus
> > > > > > parameters.
> > > > > I tried that, but couldn't reproduce. Whether I just wasn't
> > > > > thorough
> > > > > enough/"unlucky" or whether the in-order nature of the
> > > > > hardware and L1
> > > > > cache behavior just makes it impossible to fail the shader-
> > > > > mem-barrier
> > > > > test, I'm not sure.
> > > > > 
> > > > Now I'm curious about the exact nature of the bug ;), some sort
> > > > of
> > > > missing L1 cache-flushing which could potentially affect
> > > > dependent
> > > > invocations?
> > > I'm not sure I remember everything, to be honest.
> > > 
> > > One issue that I do remember is that load/store by default go
> > > through
> > > L1, but atomics _never_ go through L1, no matter how you compile
> > > them.
> > > This means that if you're working on two different images, one
> > > with
> > > atomics and the other without, then the atomic one will always
> > > behave
> > > coherently but the other one won't unless you explicitly tell it
> > > to.
> > > 
> > > Now that I think about this again, there should probably be a
> > > shader-mem-barrier-style way to test for that particular issue in
> > > a way
> > > that doesn't depend on the specifics of the parallelization.
> > > Something
> > > like, in a loop:
> > > 
> > > Thread 1: increasing imageStore into image 1 at location 1,
> > > imageLoad
> > > from image 1 location 2
> > > 
> > > Thread 2: same, but exchange locations 1 and 2
> > > 
> > > Both threads: imageAtomicAdd on the same location in image 2
> > > 
> > > Then each thread can check that _if_ the imageAtomicAdd detects
> > > the
> > > buddy thread operating in parallel, _then_ they must also observe
> > > incrementing values in the location that the buddy thread stores
> > > to.
> > > Does that sound reasonable?
> > > 
> > Yeah, that sounds reasonable, but keep in mind that even if both
> > image
> > variables are marked coherent you cannot make assumptions about the
> > ordering of the image stores performed on image 1 relative to the
> > atomics performed on image 2 unless there is an explicit barrier in
> > between, which means that some level of L1 caching is legitimate
> > even in
> > that scenario (and might have some performance benefit over
> > skipping L1
> > caching of coherent images altogether) -- That's in fact the way
> > that
> > the i965 driver implements coherent image stores: We just write to
> > L1
> > and flush later on to the globally coherent L3 on the next
> > memoryBarrier().
> Okay, adding the barrier makes sense.
> 
> 
> > 
> > What about a test along the lines of the current coherency
> > test?  Any
> > idea what's the reason you couldn't get it to reproduce the
> > issue?  Is
> > it because threads with dependent inputs are guaranteed to be
> > spawned in
> > the same L1 cache domain as the threads that generated their inputs
> > or
> > something like that?
>  From what I understand (though admittedly the documentation I have
> on 
> this is not the clearest...), the hardware flushes the L1 cache 
> automatically at the end of each shader invocation, so that
> dependent 
> invocations are guaranteed to pick it up.

The GCN whitepaper mentions both that the L1D cache is write-through
and sends data to L2 at the end all 64 WF stores (page 9), and that the
L1 cache writes back data at the end of wavefront or when barrier is
invoked (page 10).

Jan

> 
> Cheers,
> Nicolai
> ___
> Piglit mailing list
> Piglit@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/piglit


signature.asc
Description: This is a digitally signed message part
___
Piglit mailing list
Piglit@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/piglit


Re: [Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test

2016-04-16 Thread Francisco Jerez
Nicolai Hähnle  writes:

> On 15.04.2016 17:12, Francisco Jerez wrote:
>> For a test doing almost the same thing but not relying on unspecified
>> invocation ordering, see
>> "tests/spec/arb_shader_image_load_store/shader-mem-barrier.c" -- It
>> would be interesting to see whether you can get it to reproduce the GCN
>> coherency bug using different framebuffer size and modulus parameters.
>
> I tried that, but couldn't reproduce. Whether I just wasn't thorough
> enough/"unlucky" or whether the in-order nature of the hardware and L1
> cache behavior just makes it impossible to fail the shader-mem-barrier
> test, I'm not sure.
>
 Now I'm curious about the exact nature of the bug ;), some sort of
 missing L1 cache-flushing which could potentially affect dependent
 invocations?
>>>
>>> I'm not sure I remember everything, to be honest.
>>>
>>> One issue that I do remember is that load/store by default go through
>>> L1, but atomics _never_ go through L1, no matter how you compile them.
>>> This means that if you're working on two different images, one with
>>> atomics and the other without, then the atomic one will always behave
>>> coherently but the other one won't unless you explicitly tell it to.
>>>
>>> Now that I think about this again, there should probably be a
>>> shader-mem-barrier-style way to test for that particular issue in a way
>>> that doesn't depend on the specifics of the parallelization. Something
>>> like, in a loop:
>>>
>>> Thread 1: increasing imageStore into image 1 at location 1, imageLoad
>>> from image 1 location 2
>>>
>>> Thread 2: same, but exchange locations 1 and 2
>>>
>>> Both threads: imageAtomicAdd on the same location in image 2
>>>
>>> Then each thread can check that _if_ the imageAtomicAdd detects the
>>> buddy thread operating in parallel, _then_ they must also observe
>>> incrementing values in the location that the buddy thread stores to.
>>> Does that sound reasonable?
>>>
>> Yeah, that sounds reasonable, but keep in mind that even if both image
>> variables are marked coherent you cannot make assumptions about the
>> ordering of the image stores performed on image 1 relative to the
>> atomics performed on image 2 unless there is an explicit barrier in
>> between, which means that some level of L1 caching is legitimate even in
>> that scenario (and might have some performance benefit over skipping L1
>> caching of coherent images altogether) -- That's in fact the way that
>> the i965 driver implements coherent image stores: We just write to L1
>> and flush later on to the globally coherent L3 on the next
>> memoryBarrier().
>
> Okay, adding the barrier makes sense.
>
>
>> What about a test along the lines of the current coherency test?  Any
>> idea what's the reason you couldn't get it to reproduce the issue?  Is
>> it because threads with dependent inputs are guaranteed to be spawned in
>> the same L1 cache domain as the threads that generated their inputs or
>> something like that?
>
>  From what I understand (though admittedly the documentation I have on 
> this is not the clearest...), the hardware flushes the L1 cache 
> automatically at the end of each shader invocation, so that dependent 
> invocations are guaranteed to pick it up.
>
Ah, interesting.  What about memoryBarrier()?  Does that cause the
back-end compiler to emit an L1 cache flush of some sort?

> Cheers,
> Nicolai


signature.asc
Description: PGP signature
___
Piglit mailing list
Piglit@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/piglit


Re: [Piglit] [PATCH 1/8] README: Fix spelling mistake

2016-04-16 Thread Eric Engestrom
On Mon, Apr 04, 2016 at 11:16:54AM -0700, Dylan Baker wrote:
> Do you have commit access or do you need someone to push these for you?
> 
> For the series:
> Reviewed-by: Dylan Baker 

Somehow I completely missed your reply, sorry about that.
I don't have access, someone else will have to push it :]

Cheers,
Eric
___
Piglit mailing list
Piglit@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/piglit


Re: [Piglit] [PATCH 5/8] arb_shader_image_load_store: add additional coherency test

2016-04-16 Thread Nicolai Hähnle

On 15.04.2016 17:12, Francisco Jerez wrote:

For a test doing almost the same thing but not relying on unspecified
invocation ordering, see
"tests/spec/arb_shader_image_load_store/shader-mem-barrier.c" -- It
would be interesting to see whether you can get it to reproduce the GCN
coherency bug using different framebuffer size and modulus parameters.


I tried that, but couldn't reproduce. Whether I just wasn't thorough
enough/"unlucky" or whether the in-order nature of the hardware and L1
cache behavior just makes it impossible to fail the shader-mem-barrier
test, I'm not sure.


Now I'm curious about the exact nature of the bug ;), some sort of
missing L1 cache-flushing which could potentially affect dependent
invocations?


I'm not sure I remember everything, to be honest.

One issue that I do remember is that load/store by default go through
L1, but atomics _never_ go through L1, no matter how you compile them.
This means that if you're working on two different images, one with
atomics and the other without, then the atomic one will always behave
coherently but the other one won't unless you explicitly tell it to.

Now that I think about this again, there should probably be a
shader-mem-barrier-style way to test for that particular issue in a way
that doesn't depend on the specifics of the parallelization. Something
like, in a loop:

Thread 1: increasing imageStore into image 1 at location 1, imageLoad
from image 1 location 2

Thread 2: same, but exchange locations 1 and 2

Both threads: imageAtomicAdd on the same location in image 2

Then each thread can check that _if_ the imageAtomicAdd detects the
buddy thread operating in parallel, _then_ they must also observe
incrementing values in the location that the buddy thread stores to.
Does that sound reasonable?


Yeah, that sounds reasonable, but keep in mind that even if both image
variables are marked coherent you cannot make assumptions about the
ordering of the image stores performed on image 1 relative to the
atomics performed on image 2 unless there is an explicit barrier in
between, which means that some level of L1 caching is legitimate even in
that scenario (and might have some performance benefit over skipping L1
caching of coherent images altogether) -- That's in fact the way that
the i965 driver implements coherent image stores: We just write to L1
and flush later on to the globally coherent L3 on the next
memoryBarrier().


Okay, adding the barrier makes sense.



What about a test along the lines of the current coherency test?  Any
idea what's the reason you couldn't get it to reproduce the issue?  Is
it because threads with dependent inputs are guaranteed to be spawned in
the same L1 cache domain as the threads that generated their inputs or
something like that?


From what I understand (though admittedly the documentation I have on 
this is not the clearest...), the hardware flushes the L1 cache 
automatically at the end of each shader invocation, so that dependent 
invocations are guaranteed to pick it up.


Cheers,
Nicolai
___
Piglit mailing list
Piglit@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/piglit


[Piglit] [PATCH] util: probe pixel rectangles as ubyte if possible (v2)

2016-04-16 Thread Marek Olšák
From: Marek Olšák 

v2: int -> bool, unify all new functions into one
---
 tests/util/piglit-util-gl.c | 94 +
 1 file changed, 94 insertions(+)

diff --git a/tests/util/piglit-util-gl.c b/tests/util/piglit-util-gl.c
index e8a47c5..09eb99d 100644
--- a/tests/util/piglit-util-gl.c
+++ b/tests/util/piglit-util-gl.c
@@ -1060,6 +1060,23 @@ piglit_read_pixels_float(GLint x, GLint y, GLsizei 
width, GLsizei height,
return pixels;
 }
 
+static bool
+piglit_can_probe_ubyte()
+{
+   int r,g,b,a;
+
+   glGetIntegerv(GL_RED_BITS, );
+   glGetIntegerv(GL_GREEN_BITS, );
+   glGetIntegerv(GL_BLUE_BITS, );
+   glGetIntegerv(GL_ALPHA_BITS, );
+
+   /* it could be LUMINANCE32F, etc. */
+   if (!r && !g && !b && !a)
+   return false;
+
+   return r <= 8 && g <= 8 && b <= 8 && a <= 8;
+}
+
 int
 piglit_probe_pixel_rgb_silent(int x, int y, const float* expected, float 
*out_probe)
 {
@@ -1158,6 +1175,74 @@ piglit_probe_pixel_rgba(int x, int y, const float* 
expected)
return 0;
 }
 
+static void
+piglit_array_float_to_ubyte(int n, const float *f, GLubyte *b)
+{
+   int i;
+
+   for (i = 0; i < n; i++)
+   b[i] = f[i] * 255;
+}
+
+static void
+piglit_array_float_to_ubyte_roundup(int n, const float *f, GLubyte *b)
+{
+   int i;
+
+   for (i = 0; i < n; i++)
+   b[i] = ceil(f[i] * 255);
+}
+
+static bool
+piglit_probe_rect_ubyte(int x, int y, int w, int h, int num_components,
+   const float *fexpected, bool silent)
+{
+   int i, j, p;
+   GLubyte *probe;
+   GLubyte *pixels;
+   GLubyte tolerance[4];
+   GLubyte expected[4];
+
+   piglit_array_float_to_ubyte_roundup(num_components, piglit_tolerance, 
tolerance);
+   piglit_array_float_to_ubyte(num_components, fexpected, expected);
+
+   /* RGBA readbacks are likely to be faster */
+   pixels = malloc(w*h*4);
+   glReadPixels(x, y, w, h, GL_RGBA, GL_UNSIGNED_BYTE, pixels);
+
+   for (j = 0; j < h; j++) {
+   for (i = 0; i < w; i++) {
+   probe = [(j*w+i)*4];
+
+   for (p = 0; p < num_components; ++p) {
+   if (abs((int)probe[p] - (int)expected[p]) >= 
tolerance[p]) {
+   if (!silent) {
+   printf("Probe color at 
(%i,%i)\n", x+i, y+j);
+   if (num_components == 4) {
+   printf("  Expected: %u 
%u %u %u\n",
+  expected[0], 
expected[1],
+  expected[2], 
expected[3]);
+   printf("  Observed: %u 
%u %u %u\n",
+  probe[0], 
probe[1], probe[2], probe[3]);
+   } else {
+   printf("  Expected: %u 
%u %u\n",
+  expected[0], 
expected[1],
+  expected[2]);
+   printf("  Observed: %u 
%u %u\n",
+  probe[0], 
probe[1], probe[2]);
+   }
+   }
+   free(pixels);
+   return false;
+   }
+   }
+   }
+   }
+
+   free(pixels);
+   return true;
+}
+
 int
 piglit_probe_rect_rgb_silent(int x, int y, int w, int h, const float *expected)
 {
@@ -1165,6 +1250,9 @@ piglit_probe_rect_rgb_silent(int x, int y, int w, int h, 
const float *expected)
GLfloat *probe;
GLfloat *pixels;
 
+   if (piglit_can_probe_ubyte())
+   return piglit_probe_rect_ubyte(x, y, w, h, 3, expected, true);
+
pixels = piglit_read_pixels_float(x, y, w, h, GL_RGB, NULL);
 
for (j = 0; j < h; j++) {
@@ -1223,6 +1311,9 @@ piglit_probe_rect_rgb(int x, int y, int w, int h, const 
float *expected)
GLfloat *probe;
GLfloat *pixels;
 
+   if (piglit_can_probe_ubyte())
+   return piglit_probe_rect_ubyte(x, y, w, h, 3, expected, false);
+
pixels = piglit_read_pixels_float(x, y, w, h, GL_RGBA, NULL);
 
for (j = 0; j < h; j++) {
@@ -1290,6 +1381,9 @@ piglit_probe_rect_rgba(int x, int y, int w, int h, const 
float *expected)
GLfloat *probe;
GLfloat *pixels;
 
+   if (piglit_can_probe_ubyte())
+   return piglit_probe_rect_ubyte(x, y, w, h, 4, expected, false);
+
pixels =