Re: [Mesa-dev] [PATCH v3 38/43] i965/fs: Optimize 16-bit SSBO stores by packing two into a 32-bit reg
On Thu, Oct 12, 2017 at 11:38 AM, Jose Maria Casanova Crespo < jmcasan...@igalia.com> wrote: > From: Eduardo Lima Mitev> > Currently, we use byte-scattered write messages for storing 16-bit > into an SSBO. This is because untyped surface messages have a fixed > 32-bit size. > > This patch optimizes these 16-bit writes by combining 2 values (e.g, > two consecutive components) into a 32-bit register, packing the two > 16-bit words. > > 16-bit single component values will continue to use byte-scattered > write messages. > > This optimization reduces the number of SEND messages used for storing > 16-bit values potentially by 2 or 4, which cuts down execution time > significantly because byte-scattered writes are an expensive > operation. > > v2: Removed use of stride = 2 on sources (Jason Ekstrand) > Rework optimization using shuffle 16 write and enable writes > of 16bit vec4 with only one message of 32-bits. (Chema Casanova) > > Signed-off-by: Jose Maria Casanova Crespo > Signed-off-by: Eduardo Lima > --- > src/intel/compiler/brw_fs_nir.cpp | 64 ++ > + > 1 file changed, 52 insertions(+), 12 deletions(-) > > diff --git a/src/intel/compiler/brw_fs_nir.cpp > b/src/intel/compiler/brw_fs_nir.cpp > index 2d0b3e139e..c07b3e4d8d 100644 > --- a/src/intel/compiler/brw_fs_nir.cpp > +++ b/src/intel/compiler/brw_fs_nir.cpp > @@ -4218,6 +4218,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder > , nir_intrinsic_instr *instr > instr->num_components); > val_reg = tmp; >} > + if (bit_size == 16) { > + val_reg=retype(val_reg, BRW_REGISTER_TYPE_HF); > + } > >/* 16-bit types would use a minimum of 1 slot */ >unsigned type_slots = MAX2(type_size / 4, 1); > @@ -4231,6 +4234,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder > , nir_intrinsic_instr *instr > unsigned first_component = ffs(writemask) - 1; > unsigned length = ffs(~(writemask >> first_component)) - 1; > > + fs_reg current_val_reg = > +offset(val_reg, bld, first_component * type_slots); > + > /* We can't write more than 2 64-bit components at once. Limit > the >* length of the write to what we can do and let the next > iteration >* handle the rest > @@ -4238,11 +4244,40 @@ fs_visitor::nir_emit_intrinsic(const fs_builder > , nir_intrinsic_instr *instr > if (type_size > 4) { > length = MIN2(2, length); > } else if (type_size == 2) { > -/* For 16-bit types we are using byte scattered writes, that > can > - * only write one component per call. So we limit the length, > and > - * let the write happening in several iterations. > +/* For 16-bit types we pack two consecutive values into a > + * 32-bit word and use an untyped write message. For single > values > + * we need to use byte-scattered writes because untyped > writes work > + * on multiples of 32 bits. > + * > + * For example, if there is a 3-component vector we submit one > + * untyped-write message of 32-bit (first two components), > and one > + * byte-scattered write message (the last component). > */ > -length = 1; > +if (length >= 2) { > + /* pack two consecutive 16-bit words into a 32-bit > register, > +* using the same original source register. > +*/ > This doesn't work if you have a writemask of .xz > + length -= length % 2; > I'm very confused by this bit of math. > + fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F, length / 2); > + shuffle_16bit_data_for_32bit_write(bld, > + tmp, > + current_val_reg, > + length); > + current_val_reg = tmp; > + > +} else { > + /* For single 16-bit values, we just limit the length to 1 > and > +* use a byte-scattered write message below. > +*/ > + length = 1; > I think this can be an assert. Also, why do we need the shuffle? I thought this case would work if we just set length == 1. I answered my own question about the shuffle. It lets us delete some code below. > + fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F); > + shuffle_16bit_data_for_32bit_write(bld, > + tmp, > + current_val_reg, > + length); > + current_val_reg = tmp; > + > +} > } > > fs_reg offset_reg; > @@ -4257,24 +4292,29 @@
Re: [Mesa-dev] [PATCH v3 38/43] i965/fs: Optimize 16-bit SSBO stores by packing two into a 32-bit reg
El 22/10/17 a las 12:31, Eduardo Lima Mitev escribió: > On 10/12/2017 08:38 PM, Jose Maria Casanova Crespo wrote: >> From: Eduardo Lima Mitev>> >> Currently, we use byte-scattered write messages for storing 16-bit >> into an SSBO. This is because untyped surface messages have a fixed >> 32-bit size. >> >> This patch optimizes these 16-bit writes by combining 2 values (e.g, >> two consecutive components) into a 32-bit register, packing the two >> 16-bit words. >> >> 16-bit single component values will continue to use byte-scattered >> write messages. >> >> This optimization reduces the number of SEND messages used for storing >> 16-bit values potentially by 2 or 4, which cuts down execution time >> significantly because byte-scattered writes are an expensive >> operation. >> >> v2: Removed use of stride = 2 on sources (Jason Ekstrand) >> Rework optimization using shuffle 16 write and enable writes >> of 16bit vec4 with only one message of 32-bits. (Chema Casanova) >> >> Signed-off-by: Jose Maria Casanova Crespo >> Signed-off-by: Eduardo Lima >> --- >> src/intel/compiler/brw_fs_nir.cpp | 64 >> +++ >> 1 file changed, 52 insertions(+), 12 deletions(-) >> >> diff --git a/src/intel/compiler/brw_fs_nir.cpp >> b/src/intel/compiler/brw_fs_nir.cpp >> index 2d0b3e139e..c07b3e4d8d 100644 >> --- a/src/intel/compiler/brw_fs_nir.cpp >> +++ b/src/intel/compiler/brw_fs_nir.cpp >> @@ -4218,6 +4218,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , >> nir_intrinsic_instr *instr >> instr->num_components); >> val_reg = tmp; >>} >> + if (bit_size == 16) { >> + val_reg=retype(val_reg, BRW_REGISTER_TYPE_HF); > Spaces around the '='. Probably also remove the block since there is > only one instruction in it. Fixed locally. > >> + } >> >>/* 16-bit types would use a minimum of 1 slot */ >>unsigned type_slots = MAX2(type_size / 4, 1); >> @@ -4231,6 +4234,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , >> nir_intrinsic_instr *instr >> unsigned first_component = ffs(writemask) - 1; >> unsigned length = ffs(~(writemask >> first_component)) - 1; >> >> + fs_reg current_val_reg = >> +offset(val_reg, bld, first_component * type_slots); >> + >> /* We can't write more than 2 64-bit components at once. Limit the >>* length of the write to what we can do and let the next iteration >>* handle the rest >> @@ -4238,11 +4244,40 @@ fs_visitor::nir_emit_intrinsic(const fs_builder >> , nir_intrinsic_instr *instr >> if (type_size > 4) { >> length = MIN2(2, length); >> } else if (type_size == 2) { >> -/* For 16-bit types we are using byte scattered writes, that can >> - * only write one component per call. So we limit the length, >> and >> - * let the write happening in several iterations. >> +/* For 16-bit types we pack two consecutive values into a >> + * 32-bit word and use an untyped write message. For single >> values >> + * we need to use byte-scattered writes because untyped writes >> work >> + * on multiples of 32 bits. >> + * >> + * For example, if there is a 3-component vector we submit one >> + * untyped-write message of 32-bit (first two components), and >> one >> + * byte-scattered write message (the last component). >> */ >> -length = 1; >> +if (length >= 2) { >> + /* pack two consecutive 16-bit words into a 32-bit register, >> +* using the same original source register. >> +*/ >> + length -= length % 2; >> + fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F, length / 2); >> + shuffle_16bit_data_for_32bit_write(bld, >> + tmp, >> + current_val_reg, >> + length); >> + current_val_reg = tmp; >> + >> +} else { >> + /* For single 16-bit values, we just limit the length to 1 >> and >> +* use a byte-scattered write message below. >> +*/ >> + length = 1; >> + fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F); >> + shuffle_16bit_data_for_32bit_write(bld, >> + tmp, >> + current_val_reg, >> + length); >> + current_val_reg = tmp; >> + >> +} >> } >> >> fs_reg offset_reg; >> @@ -4257,24 +4292,29 @@ fs_visitor::nir_emit_intrinsic(const fs_builder >> ,
Re: [Mesa-dev] [PATCH v3 38/43] i965/fs: Optimize 16-bit SSBO stores by packing two into a 32-bit reg
On 10/12/2017 08:38 PM, Jose Maria Casanova Crespo wrote: > From: Eduardo Lima Mitev> > Currently, we use byte-scattered write messages for storing 16-bit > into an SSBO. This is because untyped surface messages have a fixed > 32-bit size. > > This patch optimizes these 16-bit writes by combining 2 values (e.g, > two consecutive components) into a 32-bit register, packing the two > 16-bit words. > > 16-bit single component values will continue to use byte-scattered > write messages. > > This optimization reduces the number of SEND messages used for storing > 16-bit values potentially by 2 or 4, which cuts down execution time > significantly because byte-scattered writes are an expensive > operation. > > v2: Removed use of stride = 2 on sources (Jason Ekstrand) > Rework optimization using shuffle 16 write and enable writes > of 16bit vec4 with only one message of 32-bits. (Chema Casanova) > > Signed-off-by: Jose Maria Casanova Crespo > Signed-off-by: Eduardo Lima > --- > src/intel/compiler/brw_fs_nir.cpp | 64 > +++ > 1 file changed, 52 insertions(+), 12 deletions(-) > > diff --git a/src/intel/compiler/brw_fs_nir.cpp > b/src/intel/compiler/brw_fs_nir.cpp > index 2d0b3e139e..c07b3e4d8d 100644 > --- a/src/intel/compiler/brw_fs_nir.cpp > +++ b/src/intel/compiler/brw_fs_nir.cpp > @@ -4218,6 +4218,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , > nir_intrinsic_instr *instr > instr->num_components); > val_reg = tmp; >} > + if (bit_size == 16) { > + val_reg=retype(val_reg, BRW_REGISTER_TYPE_HF); Spaces around the '='. Probably also remove the block since there is only one instruction in it. > + } > >/* 16-bit types would use a minimum of 1 slot */ >unsigned type_slots = MAX2(type_size / 4, 1); > @@ -4231,6 +4234,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , > nir_intrinsic_instr *instr > unsigned first_component = ffs(writemask) - 1; > unsigned length = ffs(~(writemask >> first_component)) - 1; > > + fs_reg current_val_reg = > +offset(val_reg, bld, first_component * type_slots); > + > /* We can't write more than 2 64-bit components at once. Limit the >* length of the write to what we can do and let the next iteration >* handle the rest > @@ -4238,11 +4244,40 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , > nir_intrinsic_instr *instr > if (type_size > 4) { > length = MIN2(2, length); > } else if (type_size == 2) { > -/* For 16-bit types we are using byte scattered writes, that can > - * only write one component per call. So we limit the length, and > - * let the write happening in several iterations. > +/* For 16-bit types we pack two consecutive values into a > + * 32-bit word and use an untyped write message. For single > values > + * we need to use byte-scattered writes because untyped writes > work > + * on multiples of 32 bits. > + * > + * For example, if there is a 3-component vector we submit one > + * untyped-write message of 32-bit (first two components), and > one > + * byte-scattered write message (the last component). > */ > -length = 1; > +if (length >= 2) { > + /* pack two consecutive 16-bit words into a 32-bit register, > +* using the same original source register. > +*/ > + length -= length % 2; > + fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F, length / 2); > + shuffle_16bit_data_for_32bit_write(bld, > + tmp, > + current_val_reg, > + length); > + current_val_reg = tmp; > + > +} else { > + /* For single 16-bit values, we just limit the length to 1 and > +* use a byte-scattered write message below. > +*/ > + length = 1; > + fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F); > + shuffle_16bit_data_for_32bit_write(bld, > + tmp, > + current_val_reg, > + length); > + current_val_reg = tmp; > + > +} > } > > fs_reg offset_reg; > @@ -4257,24 +4292,29 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , > nir_intrinsic_instr *instr > brw_imm_ud(type_size * first_component)); > } > > - if (type_size == 2) { > + if (type_size == 2 && length == 1) { >
[Mesa-dev] [PATCH v3 38/43] i965/fs: Optimize 16-bit SSBO stores by packing two into a 32-bit reg
From: Eduardo Lima MitevCurrently, we use byte-scattered write messages for storing 16-bit into an SSBO. This is because untyped surface messages have a fixed 32-bit size. This patch optimizes these 16-bit writes by combining 2 values (e.g, two consecutive components) into a 32-bit register, packing the two 16-bit words. 16-bit single component values will continue to use byte-scattered write messages. This optimization reduces the number of SEND messages used for storing 16-bit values potentially by 2 or 4, which cuts down execution time significantly because byte-scattered writes are an expensive operation. v2: Removed use of stride = 2 on sources (Jason Ekstrand) Rework optimization using shuffle 16 write and enable writes of 16bit vec4 with only one message of 32-bits. (Chema Casanova) Signed-off-by: Jose Maria Casanova Crespo Signed-off-by: Eduardo Lima --- src/intel/compiler/brw_fs_nir.cpp | 64 +++ 1 file changed, 52 insertions(+), 12 deletions(-) diff --git a/src/intel/compiler/brw_fs_nir.cpp b/src/intel/compiler/brw_fs_nir.cpp index 2d0b3e139e..c07b3e4d8d 100644 --- a/src/intel/compiler/brw_fs_nir.cpp +++ b/src/intel/compiler/brw_fs_nir.cpp @@ -4218,6 +4218,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , nir_intrinsic_instr *instr instr->num_components); val_reg = tmp; } + if (bit_size == 16) { + val_reg=retype(val_reg, BRW_REGISTER_TYPE_HF); + } /* 16-bit types would use a minimum of 1 slot */ unsigned type_slots = MAX2(type_size / 4, 1); @@ -4231,6 +4234,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , nir_intrinsic_instr *instr unsigned first_component = ffs(writemask) - 1; unsigned length = ffs(~(writemask >> first_component)) - 1; + fs_reg current_val_reg = +offset(val_reg, bld, first_component * type_slots); + /* We can't write more than 2 64-bit components at once. Limit the * length of the write to what we can do and let the next iteration * handle the rest @@ -4238,11 +4244,40 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , nir_intrinsic_instr *instr if (type_size > 4) { length = MIN2(2, length); } else if (type_size == 2) { -/* For 16-bit types we are using byte scattered writes, that can - * only write one component per call. So we limit the length, and - * let the write happening in several iterations. +/* For 16-bit types we pack two consecutive values into a + * 32-bit word and use an untyped write message. For single values + * we need to use byte-scattered writes because untyped writes work + * on multiples of 32 bits. + * + * For example, if there is a 3-component vector we submit one + * untyped-write message of 32-bit (first two components), and one + * byte-scattered write message (the last component). */ -length = 1; +if (length >= 2) { + /* pack two consecutive 16-bit words into a 32-bit register, +* using the same original source register. +*/ + length -= length % 2; + fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F, length / 2); + shuffle_16bit_data_for_32bit_write(bld, + tmp, + current_val_reg, + length); + current_val_reg = tmp; + +} else { + /* For single 16-bit values, we just limit the length to 1 and +* use a byte-scattered write message below. +*/ + length = 1; + fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F); + shuffle_16bit_data_for_32bit_write(bld, + tmp, + current_val_reg, + length); + current_val_reg = tmp; + +} } fs_reg offset_reg; @@ -4257,24 +4292,29 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , nir_intrinsic_instr *instr brw_imm_ud(type_size * first_component)); } - if (type_size == 2) { + if (type_size == 2 && length == 1) { /* Untyped Surface messages have a fixed 32-bit size, so we need * to rely on byte scattered in order to write 16-bit elements. * The byte_scattered_write message needs that every written 16-bit * type to be aligned 32-bits (stride=2). */ -fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F); -val_reg.type