Re: [Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation

2014-12-02 Thread Neil Roberts
Ok, I've written a somewhat contrived test case here:

https://github.com/bpeel/glthing/tree/time-attribs

(Make sure to use the time-attribs branch)

The example draws a 1000 single-pixel points each with a separate draw
call. Each call uses a separate but identical VAO so that the driver
will be be forced to emit the vertices state for each point. Each vertex
uses the maximum number of vertex attributes as returned by
GL_MAX_VERTEX_ATTRIBS. All of the attributes are used to determine a
color value in the vertex shader. Normally it will order the attributes
in memory so that the first one is in generic attribute 0, the second in
1 and so on. However if you pass the option ‘backwards’ on the command
line it will put them in reverse order. With git master, if the
attributes are given in order then it will generate a
3DSTATE_VERTEX_BUFFERS command with a single buffer and a single
relocation otherwise it will generate one for each attribute.

I ran the test with each of these three versions of Mesa and noted the
FPS. This is based on top of commit 29c7cf2b2 with -O3 and
--disable-debug. libdrm is 00847fa4 with -O3.

1) Mesa master

2) Master with my patch applied

3) The original optimization removed completely so that it will always
   generate a buffer relocation for every attribute.

The test was run with LIBGL_SHOW_FPS=1 vblank_mode=0 on my Haswell
laptop. The results are:

attributes are  │  master  with patch  optimization removed
┼──
in order│   820   560  325
out of order│   325   540  325

The FPS fluctuated by around 20 FPS either side so I've just noted down
what looked like an approximate representation.

So I guess the results are that yes, in this extreme case having more
relocations makes a big difference but also doing the qsort is quite
expensive. The original optimization does seem worth doing.

It might be worth making a simpler hard-coded implementation of
quicksort because calling qsort is probably not very sensible for such a
small array and the function call overhead for each comparison is
probably quite a bit.

It would also probably be good to see if this difference is noticeable
in a real use case.

- Neil
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation

2014-12-02 Thread Ben Widawsky
On Tue, Dec 02, 2014 at 04:17:35PM +, Neil Roberts wrote:
 Ok, I've written a somewhat contrived test case here:
 
 https://github.com/bpeel/glthing/tree/time-attribs
 
 (Make sure to use the time-attribs branch)
 
 The example draws a 1000 single-pixel points each with a separate draw
 call. Each call uses a separate but identical VAO so that the driver
 will be be forced to emit the vertices state for each point. Each vertex
 uses the maximum number of vertex attributes as returned by
 GL_MAX_VERTEX_ATTRIBS. All of the attributes are used to determine a
 color value in the vertex shader. Normally it will order the attributes
 in memory so that the first one is in generic attribute 0, the second in
 1 and so on. However if you pass the option ‘backwards’ on the command
 line it will put them in reverse order. With git master, if the
 attributes are given in order then it will generate a
 3DSTATE_VERTEX_BUFFERS command with a single buffer and a single
 relocation otherwise it will generate one for each attribute.
 
 I ran the test with each of these three versions of Mesa and noted the
 FPS. This is based on top of commit 29c7cf2b2 with -O3 and
 --disable-debug. libdrm is 00847fa4 with -O3.
 
 1) Mesa master
 
 2) Master with my patch applied
 
 3) The original optimization removed completely so that it will always
generate a buffer relocation for every attribute.
 
 The test was run with LIBGL_SHOW_FPS=1 vblank_mode=0 on my Haswell
 laptop. The results are:
 
 attributes are  │  master  with patch  optimization removed
 ┼──
 in order│   820   560  325
 out of order│   325   540  325
 
 The FPS fluctuated by around 20 FPS either side so I've just noted down
 what looked like an approximate representation.
 
 So I guess the results are that yes, in this extreme case having more
 relocations makes a big difference but also doing the qsort is quite
 expensive. The original optimization does seem worth doing.
 
 It might be worth making a simpler hard-coded implementation of
 quicksort because calling qsort is probably not very sensible for such a
 small array and the function call overhead for each comparison is
 probably quite a bit.
 
 It would also probably be good to see if this difference is noticeable
 in a real use case.
 
 - Neil

Cool. My statement was really getting at there normally won't be many duplicated
relocations. What kind of numbers do you see as you scale down the number of
points?
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation

2014-12-02 Thread Ian Romanick
On 12/02/2014 08:17 AM, Neil Roberts wrote:
 Ok, I've written a somewhat contrived test case here:
 
 https://github.com/bpeel/glthing/tree/time-attribs
 
 (Make sure to use the time-attribs branch)
 
 The example draws a 1000 single-pixel points each with a separate draw
 call. Each call uses a separate but identical VAO so that the driver
 will be be forced to emit the vertices state for each point. Each vertex
 uses the maximum number of vertex attributes as returned by
 GL_MAX_VERTEX_ATTRIBS. All of the attributes are used to determine a
 color value in the vertex shader. Normally it will order the attributes
 in memory so that the first one is in generic attribute 0, the second in
 1 and so on. However if you pass the option ‘backwards’ on the command
 line it will put them in reverse order. With git master, if the
 attributes are given in order then it will generate a
 3DSTATE_VERTEX_BUFFERS command with a single buffer and a single
 relocation otherwise it will generate one for each attribute.
 
 I ran the test with each of these three versions of Mesa and noted the
 FPS. This is based on top of commit 29c7cf2b2 with -O3 and
 --disable-debug. libdrm is 00847fa4 with -O3.
 
 1) Mesa master
 
 2) Master with my patch applied
 
 3) The original optimization removed completely so that it will always
generate a buffer relocation for every attribute.
 
 The test was run with LIBGL_SHOW_FPS=1 vblank_mode=0 on my Haswell
 laptop. The results are:
 
 attributes are  │  master  with patch  optimization removed
 ┼──
 in order│   820   560  325
 out of order│   325   540  325
 
 The FPS fluctuated by around 20 FPS either side so I've just noted down
 what looked like an approximate representation.

Try ministat.  http://anholt.net/compare-perf/

 So I guess the results are that yes, in this extreme case having more
 relocations makes a big difference but also doing the qsort is quite
 expensive. The original optimization does seem worth doing.
 
 It might be worth making a simpler hard-coded implementation of
 quicksort because calling qsort is probably not very sensible for such a
 small array and the function call overhead for each comparison is
 probably quite a bit.

The number of elements is generally small, and the maximum, currently,
is 16.  Quite a bit of work has been done on generating (provably)
optimal comparison sorts using sorting networks (Google sorting network
generator).  You should be able to hack something up for just the
number of vertex attributes used by your test.

 It would also probably be good to see if this difference is noticeable
 in a real use case.

Also that. :)

 - Neil
 ___
 mesa-dev mailing list
 mesa-dev@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/mesa-dev

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation

2014-12-02 Thread Ian Romanick
On 12/02/2014 08:17 AM, Neil Roberts wrote:
 Ok, I've written a somewhat contrived test case here:
 
 https://github.com/bpeel/glthing/tree/time-attribs
 
 (Make sure to use the time-attribs branch)
 
 The example draws a 1000 single-pixel points each with a separate draw
 call. Each call uses a separate but identical VAO so that the driver
 will be be forced to emit the vertices state for each point. Each vertex
 uses the maximum number of vertex attributes as returned by
 GL_MAX_VERTEX_ATTRIBS. All of the attributes are used to determine a
 color value in the vertex shader. Normally it will order the attributes
 in memory so that the first one is in generic attribute 0, the second in
 1 and so on. However if you pass the option ‘backwards’ on the command
 line it will put them in reverse order. With git master, if the
 attributes are given in order then it will generate a
 3DSTATE_VERTEX_BUFFERS command with a single buffer and a single
 relocation otherwise it will generate one for each attribute.
 
 I ran the test with each of these three versions of Mesa and noted the
 FPS. This is based on top of commit 29c7cf2b2 with -O3 and
 --disable-debug. libdrm is 00847fa4 with -O3.
 
 1) Mesa master
 
 2) Master with my patch applied
 
 3) The original optimization removed completely so that it will always
generate a buffer relocation for every attribute.
 
 The test was run with LIBGL_SHOW_FPS=1 vblank_mode=0 on my Haswell
 laptop. The results are:
 
 attributes are  │  master  with patch  optimization removed
 ┼──
 in order│   820   560  325
 out of order│   325   540  325

Also... what is the affect of the optimization when the relocations
cannot be merged?  It should be easy enough to modify the test to get
that data as well.

 The FPS fluctuated by around 20 FPS either side so I've just noted down
 what looked like an approximate representation.
 
 So I guess the results are that yes, in this extreme case having more
 relocations makes a big difference but also doing the qsort is quite
 expensive. The original optimization does seem worth doing.
 
 It might be worth making a simpler hard-coded implementation of
 quicksort because calling qsort is probably not very sensible for such a
 small array and the function call overhead for each comparison is
 probably quite a bit.
 
 It would also probably be good to see if this difference is noticeable
 in a real use case.
 
 - Neil
 ___
 mesa-dev mailing list
 mesa-dev@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/mesa-dev
 

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation

2014-12-01 Thread Neil Roberts
When submitting the vertex buffers the i965 driver will try to recognise when
multiple attributes are using the same buffer so that it can submit a single
relocation for it and set a different offset in the attribute. Previously
however if the application happens to have the attributes in a struct with an
order that doesn't match the order they are listed in the gl_vert_attrib enum
then the loop would end up processing the attributes with a greater offset
first and the optimisation wouldn't be used.

To make the optmisation more likely to be used this patch makes it always
process the elements in increasing order of offset. This is done copying the
element pointers into a separate array and sorting it with qsort. This only
affects the order that the elements are processed and doesn't change the order
that they are submitted to the hardware.
---

I noticed this problem by inspection but I don't have a good feel for
how important avoiding the buffer relocations is so I don't know
whether the patch makes much sense. However I think the added overhead
to sort the attributes is minimal so I don't think it can do much
harm.

I've run the patch through Piglit on IvyBridge and there are no
regressions. I've also verified that it does end up generating two
buffer relocations if you put the attributes in the wrong order with
gdb and also that the patch fixes it.

 src/mesa/drivers/dri/i965/brw_draw_upload.c | 38 ++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/src/mesa/drivers/dri/i965/brw_draw_upload.c 
b/src/mesa/drivers/dri/i965/brw_draw_upload.c
index 7bf9163..8b60563 100644
--- a/src/mesa/drivers/dri/i965/brw_draw_upload.c
+++ b/src/mesa/drivers/dri/i965/brw_draw_upload.c
@@ -396,6 +396,24 @@ copy_array_to_vbo_array(struct brw_context *brw,
buffer-stride = dst_stride;
 }
 
+static int
+compare_array_ptr(const void *a, const void *b)
+{
+   struct brw_vertex_element *input_a =
+  *(struct brw_vertex_element * const *) a;
+   struct brw_vertex_element *input_b =
+  *(struct brw_vertex_element * const *) b;
+   const GLubyte *ptr_a = input_a-glarray-Ptr;
+   const GLubyte *ptr_b = input_b-glarray-Ptr;
+
+   if (ptr_a  ptr_b)
+  return -1;
+   else if (ptr_a  ptr_b)
+  return 1;
+   else
+  return 0;
+}
+
 void
 brw_prepare_vertices(struct brw_context *brw)
 {
@@ -409,6 +427,7 @@ brw_prepare_vertices(struct brw_context *brw)
int delta, i, j;
 
struct brw_vertex_element *upload[VERT_ATTRIB_MAX];
+   struct brw_vertex_element *sorted[VERT_ATTRIB_MAX];
GLuint nr_uploads = 0;
 
/* _NEW_POLYGON
@@ -442,8 +461,21 @@ brw_prepare_vertices(struct brw_context *brw)
if (brw-vb.nr_buffers)
   return;
 
+   /* In the loop below if it finds an element that is using the same buffer
+* as a previous element then it will reuse the same buffer relocation.
+* However it will only work if the offset for the previous element is less
+* than the offset for the new element and the difference is less than the
+* stride. In order to increase the chances of hitting this optimisation
+* the elements will be processed in increasing order of offset by first
+* sorting the pointers.
+*/
+   memcpy(sorted, brw-vb.enabled,
+  sizeof(sorted[0]) * brw-vb.nr_enabled);
+   qsort(sorted, brw-vb.nr_enabled, sizeof(sorted[0]),
+ compare_array_ptr);
+
for (i = j = 0; i  brw-vb.nr_enabled; i++) {
-  struct brw_vertex_element *input = brw-vb.enabled[i];
+  struct brw_vertex_element *input = sorted[i];
   const struct gl_client_array *glarray = input-glarray;
 
   if (_mesa_is_bufferobj(glarray-BufferObj)) {
@@ -456,13 +488,13 @@ brw_prepare_vertices(struct brw_context *brw)
  * relocations.
  */
 for (k = 0; k  i; k++) {
-   const struct gl_client_array *other = brw-vb.enabled[k]-glarray;
+   const struct gl_client_array *other = sorted[k]-glarray;
if (glarray-BufferObj == other-BufferObj 
glarray-StrideB == other-StrideB 
glarray-InstanceDivisor == other-InstanceDivisor 
(uintptr_t)(glarray-Ptr - other-Ptr)  glarray-StrideB)
{
-  input-buffer = brw-vb.enabled[k]-buffer;
+  input-buffer = sorted[k]-buffer;
   input-offset = glarray-Ptr - other-Ptr;
   break;
}
-- 
1.9.3

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation

2014-12-01 Thread Eric Anholt
Neil Roberts n...@linux.intel.com writes:

 When submitting the vertex buffers the i965 driver will try to recognise when
 multiple attributes are using the same buffer so that it can submit a single
 relocation for it and set a different offset in the attribute. Previously
 however if the application happens to have the attributes in a struct with an
 order that doesn't match the order they are listed in the gl_vert_attrib enum
 then the loop would end up processing the attributes with a greater offset
 first and the optimisation wouldn't be used.

 To make the optmisation more likely to be used this patch makes it always
 process the elements in increasing order of offset. This is done copying the
 element pointers into a separate array and sorting it with qsort. This only
 affects the order that the elements are processed and doesn't change the order
 that they are submitted to the hardware.

I would expect the sorting to cost more than the potential win you're
looking for.  Performance patches should come with performance data.


signature.asc
Description: PGP signature
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation

2014-12-01 Thread Ben Widawsky
On Mon, Dec 01, 2014 at 05:37:27PM -0800, Eric Anholt wrote:
 Neil Roberts n...@linux.intel.com writes:
 
  When submitting the vertex buffers the i965 driver will try to recognise 
  when
  multiple attributes are using the same buffer so that it can submit a single
  relocation for it and set a different offset in the attribute. Previously
  however if the application happens to have the attributes in a struct with 
  an
  order that doesn't match the order they are listed in the gl_vert_attrib 
  enum
  then the loop would end up processing the attributes with a greater offset
  first and the optimisation wouldn't be used.
 
  To make the optmisation more likely to be used this patch makes it always
  process the elements in increasing order of offset. This is done copying the
  element pointers into a separate array and sorting it with qsort. This only
  affects the order that the elements are processed and doesn't change the 
  order
  that they are submitted to the hardware.
 
 I would expect the sorting to cost more than the potential win you're
 looking for.  Performance patches should come with performance data.

In fact, I'd be interested to see numbers with the existing logic was removed.

 ___
 mesa-dev mailing list
 mesa-dev@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/mesa-dev

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation

2014-12-01 Thread Kenneth Graunke
On Monday, December 01, 2014 08:23:32 PM Neil Roberts wrote:
 When submitting the vertex buffers the i965 driver will try to recognise when
 multiple attributes are using the same buffer so that it can submit a single
 relocation for it and set a different offset in the attribute. Previously
 however if the application happens to have the attributes in a struct with an
 order that doesn't match the order they are listed in the gl_vert_attrib enum
 then the loop would end up processing the attributes with a greater offset
 first and the optimisation wouldn't be used.
 
 To make the optmisation more likely to be used this patch makes it always
 process the elements in increasing order of offset. This is done copying the
 element pointers into a separate array and sorting it with qsort. This only
 affects the order that the elements are processed and doesn't change the order
 that they are submitted to the hardware.
 ---
 
 I noticed this problem by inspection but I don't have a good feel for
 how important avoiding the buffer relocations is so I don't know
 whether the patch makes much sense. However I think the added overhead
 to sort the attributes is minimal so I don't think it can do much
 harm.

My suspicion is that avoiding the couple of extra relocations shouldn't
really matter that much, and it may even be worth just deleting the existing
optimization.  I'm definitely skeptical of adding /more/ code, since this
is pretty much the hottest path in the driver.

But I'd definitely be interested in any performance data you find.

--Ken

signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev