Re: [Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation
Ok, I've written a somewhat contrived test case here: https://github.com/bpeel/glthing/tree/time-attribs (Make sure to use the time-attribs branch) The example draws a 1000 single-pixel points each with a separate draw call. Each call uses a separate but identical VAO so that the driver will be be forced to emit the vertices state for each point. Each vertex uses the maximum number of vertex attributes as returned by GL_MAX_VERTEX_ATTRIBS. All of the attributes are used to determine a color value in the vertex shader. Normally it will order the attributes in memory so that the first one is in generic attribute 0, the second in 1 and so on. However if you pass the option ‘backwards’ on the command line it will put them in reverse order. With git master, if the attributes are given in order then it will generate a 3DSTATE_VERTEX_BUFFERS command with a single buffer and a single relocation otherwise it will generate one for each attribute. I ran the test with each of these three versions of Mesa and noted the FPS. This is based on top of commit 29c7cf2b2 with -O3 and --disable-debug. libdrm is 00847fa4 with -O3. 1) Mesa master 2) Master with my patch applied 3) The original optimization removed completely so that it will always generate a buffer relocation for every attribute. The test was run with LIBGL_SHOW_FPS=1 vblank_mode=0 on my Haswell laptop. The results are: attributes are │ master with patch optimization removed ┼── in order│ 820 560 325 out of order│ 325 540 325 The FPS fluctuated by around 20 FPS either side so I've just noted down what looked like an approximate representation. So I guess the results are that yes, in this extreme case having more relocations makes a big difference but also doing the qsort is quite expensive. The original optimization does seem worth doing. It might be worth making a simpler hard-coded implementation of quicksort because calling qsort is probably not very sensible for such a small array and the function call overhead for each comparison is probably quite a bit. It would also probably be good to see if this difference is noticeable in a real use case. - Neil ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation
On Tue, Dec 02, 2014 at 04:17:35PM +, Neil Roberts wrote: Ok, I've written a somewhat contrived test case here: https://github.com/bpeel/glthing/tree/time-attribs (Make sure to use the time-attribs branch) The example draws a 1000 single-pixel points each with a separate draw call. Each call uses a separate but identical VAO so that the driver will be be forced to emit the vertices state for each point. Each vertex uses the maximum number of vertex attributes as returned by GL_MAX_VERTEX_ATTRIBS. All of the attributes are used to determine a color value in the vertex shader. Normally it will order the attributes in memory so that the first one is in generic attribute 0, the second in 1 and so on. However if you pass the option ‘backwards’ on the command line it will put them in reverse order. With git master, if the attributes are given in order then it will generate a 3DSTATE_VERTEX_BUFFERS command with a single buffer and a single relocation otherwise it will generate one for each attribute. I ran the test with each of these three versions of Mesa and noted the FPS. This is based on top of commit 29c7cf2b2 with -O3 and --disable-debug. libdrm is 00847fa4 with -O3. 1) Mesa master 2) Master with my patch applied 3) The original optimization removed completely so that it will always generate a buffer relocation for every attribute. The test was run with LIBGL_SHOW_FPS=1 vblank_mode=0 on my Haswell laptop. The results are: attributes are │ master with patch optimization removed ┼── in order│ 820 560 325 out of order│ 325 540 325 The FPS fluctuated by around 20 FPS either side so I've just noted down what looked like an approximate representation. So I guess the results are that yes, in this extreme case having more relocations makes a big difference but also doing the qsort is quite expensive. The original optimization does seem worth doing. It might be worth making a simpler hard-coded implementation of quicksort because calling qsort is probably not very sensible for such a small array and the function call overhead for each comparison is probably quite a bit. It would also probably be good to see if this difference is noticeable in a real use case. - Neil Cool. My statement was really getting at there normally won't be many duplicated relocations. What kind of numbers do you see as you scale down the number of points? ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation
On 12/02/2014 08:17 AM, Neil Roberts wrote: Ok, I've written a somewhat contrived test case here: https://github.com/bpeel/glthing/tree/time-attribs (Make sure to use the time-attribs branch) The example draws a 1000 single-pixel points each with a separate draw call. Each call uses a separate but identical VAO so that the driver will be be forced to emit the vertices state for each point. Each vertex uses the maximum number of vertex attributes as returned by GL_MAX_VERTEX_ATTRIBS. All of the attributes are used to determine a color value in the vertex shader. Normally it will order the attributes in memory so that the first one is in generic attribute 0, the second in 1 and so on. However if you pass the option ‘backwards’ on the command line it will put them in reverse order. With git master, if the attributes are given in order then it will generate a 3DSTATE_VERTEX_BUFFERS command with a single buffer and a single relocation otherwise it will generate one for each attribute. I ran the test with each of these three versions of Mesa and noted the FPS. This is based on top of commit 29c7cf2b2 with -O3 and --disable-debug. libdrm is 00847fa4 with -O3. 1) Mesa master 2) Master with my patch applied 3) The original optimization removed completely so that it will always generate a buffer relocation for every attribute. The test was run with LIBGL_SHOW_FPS=1 vblank_mode=0 on my Haswell laptop. The results are: attributes are │ master with patch optimization removed ┼── in order│ 820 560 325 out of order│ 325 540 325 The FPS fluctuated by around 20 FPS either side so I've just noted down what looked like an approximate representation. Try ministat. http://anholt.net/compare-perf/ So I guess the results are that yes, in this extreme case having more relocations makes a big difference but also doing the qsort is quite expensive. The original optimization does seem worth doing. It might be worth making a simpler hard-coded implementation of quicksort because calling qsort is probably not very sensible for such a small array and the function call overhead for each comparison is probably quite a bit. The number of elements is generally small, and the maximum, currently, is 16. Quite a bit of work has been done on generating (provably) optimal comparison sorts using sorting networks (Google sorting network generator). You should be able to hack something up for just the number of vertex attributes used by your test. It would also probably be good to see if this difference is noticeable in a real use case. Also that. :) - Neil ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation
On 12/02/2014 08:17 AM, Neil Roberts wrote: Ok, I've written a somewhat contrived test case here: https://github.com/bpeel/glthing/tree/time-attribs (Make sure to use the time-attribs branch) The example draws a 1000 single-pixel points each with a separate draw call. Each call uses a separate but identical VAO so that the driver will be be forced to emit the vertices state for each point. Each vertex uses the maximum number of vertex attributes as returned by GL_MAX_VERTEX_ATTRIBS. All of the attributes are used to determine a color value in the vertex shader. Normally it will order the attributes in memory so that the first one is in generic attribute 0, the second in 1 and so on. However if you pass the option ‘backwards’ on the command line it will put them in reverse order. With git master, if the attributes are given in order then it will generate a 3DSTATE_VERTEX_BUFFERS command with a single buffer and a single relocation otherwise it will generate one for each attribute. I ran the test with each of these three versions of Mesa and noted the FPS. This is based on top of commit 29c7cf2b2 with -O3 and --disable-debug. libdrm is 00847fa4 with -O3. 1) Mesa master 2) Master with my patch applied 3) The original optimization removed completely so that it will always generate a buffer relocation for every attribute. The test was run with LIBGL_SHOW_FPS=1 vblank_mode=0 on my Haswell laptop. The results are: attributes are │ master with patch optimization removed ┼── in order│ 820 560 325 out of order│ 325 540 325 Also... what is the affect of the optimization when the relocations cannot be merged? It should be easy enough to modify the test to get that data as well. The FPS fluctuated by around 20 FPS either side so I've just noted down what looked like an approximate representation. So I guess the results are that yes, in this extreme case having more relocations makes a big difference but also doing the qsort is quite expensive. The original optimization does seem worth doing. It might be worth making a simpler hard-coded implementation of quicksort because calling qsort is probably not very sensible for such a small array and the function call overhead for each comparison is probably quite a bit. It would also probably be good to see if this difference is noticeable in a real use case. - Neil ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
[Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation
When submitting the vertex buffers the i965 driver will try to recognise when multiple attributes are using the same buffer so that it can submit a single relocation for it and set a different offset in the attribute. Previously however if the application happens to have the attributes in a struct with an order that doesn't match the order they are listed in the gl_vert_attrib enum then the loop would end up processing the attributes with a greater offset first and the optimisation wouldn't be used. To make the optmisation more likely to be used this patch makes it always process the elements in increasing order of offset. This is done copying the element pointers into a separate array and sorting it with qsort. This only affects the order that the elements are processed and doesn't change the order that they are submitted to the hardware. --- I noticed this problem by inspection but I don't have a good feel for how important avoiding the buffer relocations is so I don't know whether the patch makes much sense. However I think the added overhead to sort the attributes is minimal so I don't think it can do much harm. I've run the patch through Piglit on IvyBridge and there are no regressions. I've also verified that it does end up generating two buffer relocations if you put the attributes in the wrong order with gdb and also that the patch fixes it. src/mesa/drivers/dri/i965/brw_draw_upload.c | 38 ++--- 1 file changed, 35 insertions(+), 3 deletions(-) diff --git a/src/mesa/drivers/dri/i965/brw_draw_upload.c b/src/mesa/drivers/dri/i965/brw_draw_upload.c index 7bf9163..8b60563 100644 --- a/src/mesa/drivers/dri/i965/brw_draw_upload.c +++ b/src/mesa/drivers/dri/i965/brw_draw_upload.c @@ -396,6 +396,24 @@ copy_array_to_vbo_array(struct brw_context *brw, buffer-stride = dst_stride; } +static int +compare_array_ptr(const void *a, const void *b) +{ + struct brw_vertex_element *input_a = + *(struct brw_vertex_element * const *) a; + struct brw_vertex_element *input_b = + *(struct brw_vertex_element * const *) b; + const GLubyte *ptr_a = input_a-glarray-Ptr; + const GLubyte *ptr_b = input_b-glarray-Ptr; + + if (ptr_a ptr_b) + return -1; + else if (ptr_a ptr_b) + return 1; + else + return 0; +} + void brw_prepare_vertices(struct brw_context *brw) { @@ -409,6 +427,7 @@ brw_prepare_vertices(struct brw_context *brw) int delta, i, j; struct brw_vertex_element *upload[VERT_ATTRIB_MAX]; + struct brw_vertex_element *sorted[VERT_ATTRIB_MAX]; GLuint nr_uploads = 0; /* _NEW_POLYGON @@ -442,8 +461,21 @@ brw_prepare_vertices(struct brw_context *brw) if (brw-vb.nr_buffers) return; + /* In the loop below if it finds an element that is using the same buffer +* as a previous element then it will reuse the same buffer relocation. +* However it will only work if the offset for the previous element is less +* than the offset for the new element and the difference is less than the +* stride. In order to increase the chances of hitting this optimisation +* the elements will be processed in increasing order of offset by first +* sorting the pointers. +*/ + memcpy(sorted, brw-vb.enabled, + sizeof(sorted[0]) * brw-vb.nr_enabled); + qsort(sorted, brw-vb.nr_enabled, sizeof(sorted[0]), + compare_array_ptr); + for (i = j = 0; i brw-vb.nr_enabled; i++) { - struct brw_vertex_element *input = brw-vb.enabled[i]; + struct brw_vertex_element *input = sorted[i]; const struct gl_client_array *glarray = input-glarray; if (_mesa_is_bufferobj(glarray-BufferObj)) { @@ -456,13 +488,13 @@ brw_prepare_vertices(struct brw_context *brw) * relocations. */ for (k = 0; k i; k++) { - const struct gl_client_array *other = brw-vb.enabled[k]-glarray; + const struct gl_client_array *other = sorted[k]-glarray; if (glarray-BufferObj == other-BufferObj glarray-StrideB == other-StrideB glarray-InstanceDivisor == other-InstanceDivisor (uintptr_t)(glarray-Ptr - other-Ptr) glarray-StrideB) { - input-buffer = brw-vb.enabled[k]-buffer; + input-buffer = sorted[k]-buffer; input-offset = glarray-Ptr - other-Ptr; break; } -- 1.9.3 ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation
Neil Roberts n...@linux.intel.com writes: When submitting the vertex buffers the i965 driver will try to recognise when multiple attributes are using the same buffer so that it can submit a single relocation for it and set a different offset in the attribute. Previously however if the application happens to have the attributes in a struct with an order that doesn't match the order they are listed in the gl_vert_attrib enum then the loop would end up processing the attributes with a greater offset first and the optimisation wouldn't be used. To make the optmisation more likely to be used this patch makes it always process the elements in increasing order of offset. This is done copying the element pointers into a separate array and sorting it with qsort. This only affects the order that the elements are processed and doesn't change the order that they are submitted to the hardware. I would expect the sorting to cost more than the potential win you're looking for. Performance patches should come with performance data. signature.asc Description: PGP signature ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation
On Mon, Dec 01, 2014 at 05:37:27PM -0800, Eric Anholt wrote: Neil Roberts n...@linux.intel.com writes: When submitting the vertex buffers the i965 driver will try to recognise when multiple attributes are using the same buffer so that it can submit a single relocation for it and set a different offset in the attribute. Previously however if the application happens to have the attributes in a struct with an order that doesn't match the order they are listed in the gl_vert_attrib enum then the loop would end up processing the attributes with a greater offset first and the optimisation wouldn't be used. To make the optmisation more likely to be used this patch makes it always process the elements in increasing order of offset. This is done copying the element pointers into a separate array and sorting it with qsort. This only affects the order that the elements are processed and doesn't change the order that they are submitted to the hardware. I would expect the sorting to cost more than the potential win you're looking for. Performance patches should come with performance data. In fact, I'd be interested to see numbers with the existing logic was removed. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] i965: Sort array elements to increase chances of reusing buffer relocation
On Monday, December 01, 2014 08:23:32 PM Neil Roberts wrote: When submitting the vertex buffers the i965 driver will try to recognise when multiple attributes are using the same buffer so that it can submit a single relocation for it and set a different offset in the attribute. Previously however if the application happens to have the attributes in a struct with an order that doesn't match the order they are listed in the gl_vert_attrib enum then the loop would end up processing the attributes with a greater offset first and the optimisation wouldn't be used. To make the optmisation more likely to be used this patch makes it always process the elements in increasing order of offset. This is done copying the element pointers into a separate array and sorting it with qsort. This only affects the order that the elements are processed and doesn't change the order that they are submitted to the hardware. --- I noticed this problem by inspection but I don't have a good feel for how important avoiding the buffer relocations is so I don't know whether the patch makes much sense. However I think the added overhead to sort the attributes is minimal so I don't think it can do much harm. My suspicion is that avoiding the couple of extra relocations shouldn't really matter that much, and it may even be worth just deleting the existing optimization. I'm definitely skeptical of adding /more/ code, since this is pretty much the hottest path in the driver. But I'd definitely be interested in any performance data you find. --Ken signature.asc Description: This is a digitally signed message part. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev