I sent previous patches enabling hardware-generated binding tables. By itself, hw-binding tables gave no performance improvements, it is just a means to an end. However, the real meat of the RS hardware is the optimized ability to map constants to the GRF.
Gather push constants is basically an optimized way of programming push constants. What it gives us is the ability to gather and pack constant data that may reside in a non-contiguous block of any arbitrary buffer object without incurring additional overhead. The goal of this series is to allow registers representing combined UBO blocks and uniform to be sequentially allocated and packed tightly without holes, thus (1) reduce register pressure and (2) minimize the use of pull constant loads. To achieve the same results without the resource streamer, the driver may have to manually rearrange, reformat, and repack the entries within the already uploaded UBO block and any uniform buffer that may be present so that the entries would carefully match the layout of the allocated GRFs. All of which would happen every frame. It get's even worse if a shader fetches its constants from two or more different constant buffer blocks. The resource streamer acheives this hardware packing of GRF entries by parsing a gather table containing hardware-binding table indices, offset, and channel mask to gather the sparsely-located constant data. I promised some folks that I would send this out in a coherent state before the holidays. Unfortunately, I didn't make it in time, but I hope the current state should be enough to demonstrate my approach and make reviews possible. I still lack real-world benchmarks. But consider this simple piglit testcase: tests/spec/glsl-1.40/uniform_buffer/fs-struct-copy.shader_test. With the existing method of fetching the ubo entries: SIMD16 shader: 15 instructions. 0 loops. Compacted 240 to 176 bytes (27%) mov(1) g16<1>UD 0x0000000cUD mov(1) g18<1>UD 0x00000000UD mov(1) g20<1>UD 0x00000004UD send(4) g2<1>F g16<0,1,0>F sampler (1, 0, 7, 0) mlen 1 rlen 1 send(4) g4<1>F g18<0,1,0>F sampler (1, 0, 7, 0) mlen 1 rlen 1 send(4) g6<1>F g20<0,1,0>F sampler (1, 0, 7, 0) mlen 1 rlen 1 add(16) g8<1>F g4<0,1,0>F g6<0,1,0>F add(16) g10<1>F g4.1<0,1,0>F g6.1<0,1,0>F add(16) g12<1>F g4.2<0,1,0>F g6.2<0,1,0>F add(16) g14<1>F g4.3<0,1,0>F g6.3<0,1,0>F add(16) g120<1>F g8<8,8,1>F g2<0,1,0>F add(16) g122<1>F g10<8,8,1>F g2.1<0,1,0>F add(16) g124<1>F g12<8,8,1>F g2.2<0,1,0>F add(16) g126<1>F g14<8,8,1>F g2.3<0,1,0>F sendc(16) null g120<8,8,1>F Compare with gather constants enabled: SIMD16 shader: 9 instructions. 0 loops. Compacted 144 to 112 bytes (22%) add(16) g4<1>F g2.4<0,1,0>F g3<0,1,0>F add(16) g6<1>F g2.5<0,1,0>F g3.1<0,1,0>F add(16) g8<1>F g2.6<0,1,0>F g3.2<0,1,0>F add(16) g10<1>F g2.7<0,1,0>F g3.3<0,1,0>F add(16) g120<1>F g4<8,8,1>F g2<0,1,0>F add(16) g122<1>F g6<8,8,1>F g2.1<0,1,0>F add(16) g124<1>F g8<8,8,1>F g2.2<0,1,0>F add(16) g126<1>F g10<8,8,1>F g2.3<0,1,0>F nop ; sendc(16) null g120<8,8,1>F Current Status -------------- What works: - FS, VS uniforms piglit tests pass - Fragment shader UBOs without mixed uniforms pass - Fragment shader UBOs mixed with uniforms entries sized vec4 or less pass What doesn't work yet: - Fragment shader UBOs with bools - VS and GS UBOs Vec4 backend support is not yet done. Once I complete it, I hope to publish comprehenive benchmark scores. Patch Summary ------------- Series lives here: http://cgit.freedesktop.org/~abj/mesa/log/?h=rs_gather_constants0 Patches 1 - 11: Enables hardware-generated binding tables which is a requirement for gather push constants. Patches 12 - 18: Enables gather push constant support for ordinary uniforms Patches 19 - 24: Implements fine-grained uniform uploads. Patches 26 - 40: Adds FS-backend compiler support to make UBOs as push constants I'm not particularly very happy about having to do patch 19. My goal was to make the driver able to tell which stage actually modified their uniforms. With that information, uniform uploads actually happen when there is a change, which makes the gather table generation more efficient for ordinary uniforms. Ideally, if there is any way to let the driver accept additional state flags without making the type size of the state flag variable bigger, I would be more than happy to implement it. I think the more interesting pieces of this series are in patches 17, 27, 30, and 34 which changes how constants are programmed into the GRF using a gather table in addition to specifying which channels in a register gets packed and loaded. The rest are just support for the hardware-enabling bits. Additional Notes ---------------- This series also needs the kernel support to switch on the resource streamer when the ringbuffer jumps to the userspace batchbuffer. I have the preliminary support here: http://cgit.freedesktop.org/~abj/linux/log/?h=intel_resource_streamer Unfortunately, I also ran out of time to rebase the kernel changes to the latest drm-nightly. I also made additional changes which toggles the hw-binding table feature within the ring which is actually required when multiple gl clients are running. To switch on hw-generated binding tables, set the environment variable INTEL_RESOURCE_STREAMER=1. To enable gather push constants, set INTEL_GATHER=1 in addition to the previous resource streamer variable. src/mesa/drivers/dri/i965/brw_binding_tables.c | 199 ++++++++++++++++++++++++++++++++++++++++++- src/mesa/drivers/dri/i965/brw_context.c | 19 ++++- src/mesa/drivers/dri/i965/brw_context.h | 41 ++++++++- src/mesa/drivers/dri/i965/brw_defines.h | 16 ++++ src/mesa/drivers/dri/i965/brw_draw.c | 14 +++ src/mesa/drivers/dri/i965/brw_fs.cpp | 46 +++++++--- src/mesa/drivers/dri/i965/brw_fs.h | 3 + src/mesa/drivers/dri/i965/brw_fs_visitor.cpp | 42 ++++++++- src/mesa/drivers/dri/i965/brw_program.c | 12 +++ src/mesa/drivers/dri/i965/brw_state.h | 25 ++++++ src/mesa/drivers/dri/i965/brw_state_upload.c | 9 +- src/mesa/drivers/dri/i965/brw_vec4_visitor.cpp | 3 + src/mesa/drivers/dri/i965/brw_wm.c | 3 + src/mesa/drivers/dri/i965/brw_wm_surface_state.c | 6 ++ src/mesa/drivers/dri/i965/gen6_blorp.cpp | 35 ++++++-- src/mesa/drivers/dri/i965/gen6_gs_state.c | 2 +- src/mesa/drivers/dri/i965/gen6_vs_state.c | 51 +++++++---- src/mesa/drivers/dri/i965/gen6_wm_state.c | 2 +- src/mesa/drivers/dri/i965/gen7_blorp.cpp | 7 +- src/mesa/drivers/dri/i965/gen7_disable.c | 4 + src/mesa/drivers/dri/i965/gen7_vs_state.c | 136 ++++++++++++++++++++++++++++- src/mesa/drivers/dri/i965/gen7_wm_state.c | 2 +- src/mesa/drivers/dri/i965/intel_batchbuffer.c | 11 ++- src/mesa/drivers/dri/i965/intel_reg.h | 3 + src/mesa/main/dd.h | 4 +- src/mesa/main/mtypes.h | 6 +- src/mesa/main/state.c | 16 ++-- src/mesa/main/uniform_query.cpp | 6 ++ 28 files changed, 660 insertions(+), 63 deletions(-) _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev