Re: [Mesa-dev] Link failure when copying big arrays stored in SSBOs

2015-11-20 Thread Iago Toral
On Fri, 2015-11-20 at 13:07 +0100, Iago Toral wrote:
> Hi,
> 
> Jordan sent a piglit test that produces a link failure with the ssbo
> code [1]. Doing something like this is sufficient to reproduce the
> problem:
> 
> [fragment shader]
> #version 330
> #extension GL_ARB_shader_storage_buffer_object: require
> 
> #define SIZE 6
> 
> layout (std430) buffer SSBO {
> mat4 m1[SIZE];
> mat4 m2[SIZE];
> };
> 
> void main() {
> m2 = m1;
> }
> 
> the thing here is that the lower_ubo_reference pass will first find that
> we read all of m1 and emit ssbo loads for each offset, then it will find
> the write to m2 and emit all the writes, one for each offset. That
> produces NIR code that looks like this:
> 
> vec4 ssa_1 = intrinsic load_ssbo (ssa_0) () (0)
> vec4 ssa_2 = intrinsic load_ssbo (ssa_0) () (16)
> vec4 ssa_3 = intrinsic load_ssbo (ssa_0) () (32)
> (...)
> vec4 ssa_24 = intrinsic load_ssbo (ssa_0) () (368)
> intrinsic store_ssbo (ssa_24, ssa_0) () (752, 15)
> intrinsic store_ssbo (ssa_23, ssa_0) () (736, 15)
> intrinsic store_ssbo (ssa_22, ssa_0) () (720, 15)
> (...)
> intrinsic store_ssbo (ssa_1, ssa_0) () (384, 15)
> 
> Down at the i965 level, the registers used to configure the loads are
> also used also to configure the writes (since they specify the address),
> which means that they are alive for the whole time between the read and
> the write to the same offset. For example:
> 
> {  7}1: untyped_surface_read(8) (mlen: 1) vgrf95+2.0:UD, vgrf25:UD
> ...  ...
> ...  ...
> {  6}  140: mov(8) vgrf95+0.0:UD, 0d NoMask
> {  6}  141: mov(8) vgrf95+0.28:UD, g1:UD NoMask
> {  6}  142: mov(8) vgrf95+1.0:UD, 384u
> {  6}  143: untyped_surface_write(8) (mlen: 6) null:UD, vgrf95:UD
> 
> In that code, vgrf95 is alive in ip=[1, 143]. The same goes for all the
> other offsets, so we just end up with too many live registers. In
> general, register pressure increases with each load and won't decrease
> until we start with the writes, so the larger the arrays get the worse
> the situation becomes.
> 
> I don't think we can do much about this other than maybe handling array
> copies specially (so that instead of emitting all the loads first and
> all the stores second, we emit the load and store for each element at
> once, reducing liveness for the registers involved. I am assuming that
> nobody would write structs big enough to generate the same problem
> there, but hey... :)

Actually, we'd need the same for struct copies, since we would run into
the same problem as soon as they include large arrays of course.

> Any better ideas?
> 
> Iago
> 
> [1]http://lists.freedesktop.org/archives/piglit/2015-November/018055.html


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] Link failure when copying big arrays stored in SSBOs

2015-11-20 Thread Iago Toral
Hi,

Jordan sent a piglit test that produces a link failure with the ssbo
code [1]. Doing something like this is sufficient to reproduce the
problem:

[fragment shader]
#version 330
#extension GL_ARB_shader_storage_buffer_object: require

#define SIZE 6

layout (std430) buffer SSBO {
mat4 m1[SIZE];
mat4 m2[SIZE];
};

void main() {
m2 = m1;
}

the thing here is that the lower_ubo_reference pass will first find that
we read all of m1 and emit ssbo loads for each offset, then it will find
the write to m2 and emit all the writes, one for each offset. That
produces NIR code that looks like this:

vec4 ssa_1 = intrinsic load_ssbo (ssa_0) () (0)
vec4 ssa_2 = intrinsic load_ssbo (ssa_0) () (16)
vec4 ssa_3 = intrinsic load_ssbo (ssa_0) () (32)
(...)
vec4 ssa_24 = intrinsic load_ssbo (ssa_0) () (368)
intrinsic store_ssbo (ssa_24, ssa_0) () (752, 15)
intrinsic store_ssbo (ssa_23, ssa_0) () (736, 15)
intrinsic store_ssbo (ssa_22, ssa_0) () (720, 15)
(...)
intrinsic store_ssbo (ssa_1, ssa_0) () (384, 15)

Down at the i965 level, the registers used to configure the loads are
also used also to configure the writes (since they specify the address),
which means that they are alive for the whole time between the read and
the write to the same offset. For example:

{  7}1: untyped_surface_read(8) (mlen: 1) vgrf95+2.0:UD, vgrf25:UD
...  ...
...  ...
{  6}  140: mov(8) vgrf95+0.0:UD, 0d NoMask
{  6}  141: mov(8) vgrf95+0.28:UD, g1:UD NoMask
{  6}  142: mov(8) vgrf95+1.0:UD, 384u
{  6}  143: untyped_surface_write(8) (mlen: 6) null:UD, vgrf95:UD

In that code, vgrf95 is alive in ip=[1, 143]. The same goes for all the
other offsets, so we just end up with too many live registers. In
general, register pressure increases with each load and won't decrease
until we start with the writes, so the larger the arrays get the worse
the situation becomes.

I don't think we can do much about this other than maybe handling array
copies specially (so that instead of emitting all the loads first and
all the stores second, we emit the load and store for each element at
once, reducing liveness for the registers involved. I am assuming that
nobody would write structs big enough to generate the same problem
there, but hey... :)

Any better ideas?

Iago

[1]http://lists.freedesktop.org/archives/piglit/2015-November/018055.html

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev