Re: [Mesa-dev] [PATCH v3 38/43] i965/fs: Optimize 16-bit SSBO stores by packing two into a 32-bit reg

2017-10-30 Thread Jason Ekstrand
On Thu, Oct 12, 2017 at 11:38 AM, Jose Maria Casanova Crespo <
jmcasan...@igalia.com> wrote:

> From: Eduardo Lima Mitev 
>
> Currently, we use byte-scattered write messages for storing 16-bit
> into an SSBO. This is because untyped surface messages have a fixed
> 32-bit size.
>
> This patch optimizes these 16-bit writes by combining 2 values (e.g,
> two consecutive components) into a 32-bit register, packing the two
> 16-bit words.
>
> 16-bit single component values will continue to use byte-scattered
> write messages.
>
> This optimization reduces the number of SEND messages used for storing
> 16-bit values potentially by 2 or 4, which cuts down execution time
> significantly because byte-scattered writes are an expensive
> operation.
>
> v2: Removed use of stride = 2 on sources (Jason Ekstrand)
> Rework optimization using shuffle 16 write and enable writes
> of 16bit vec4 with only one message of 32-bits. (Chema Casanova)
>
> Signed-off-by: Jose Maria Casanova Crespo 
> Signed-off-by: Eduardo Lima 
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 64 ++
> +
>  1 file changed, 52 insertions(+), 12 deletions(-)
>
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index 2d0b3e139e..c07b3e4d8d 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -4218,6 +4218,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder
> , nir_intrinsic_instr *instr
>  instr->num_components);
>   val_reg = tmp;
>}
> +  if (bit_size == 16) {
> + val_reg=retype(val_reg, BRW_REGISTER_TYPE_HF);
> +  }
>
>/* 16-bit types would use a minimum of 1 slot */
>unsigned type_slots = MAX2(type_size / 4, 1);
> @@ -4231,6 +4234,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder
> , nir_intrinsic_instr *instr
>   unsigned first_component = ffs(writemask) - 1;
>   unsigned length = ffs(~(writemask >> first_component)) - 1;
>
> + fs_reg current_val_reg =
> +offset(val_reg, bld, first_component * type_slots);
> +
>   /* We can't write more than 2 64-bit components at once. Limit
> the
>* length of the write to what we can do and let the next
> iteration
>* handle the rest
> @@ -4238,11 +4244,40 @@ fs_visitor::nir_emit_intrinsic(const fs_builder
> , nir_intrinsic_instr *instr
>   if (type_size > 4) {
>  length = MIN2(2, length);
>   } else if (type_size == 2) {
> -/* For 16-bit types we are using byte scattered writes, that
> can
> - * only write one component per call. So we limit the length,
> and
> - * let the write happening in several iterations.
> +/* For 16-bit types we pack two consecutive values into a
> + * 32-bit word and use an untyped write message. For single
> values
> + * we need to use byte-scattered writes because untyped
> writes work
> + * on multiples of 32 bits.
> + *
> + * For example, if there is a 3-component vector we submit one
> + * untyped-write message of 32-bit (first two components),
> and one
> + * byte-scattered write message (the last component).
>   */
> -length = 1;
> +if (length >= 2) {
> +   /* pack two consecutive 16-bit words into a 32-bit
> register,
> +* using the same original source register.
> +*/
>

This doesn't work if you have a writemask of .xz


> +   length -= length % 2;
>

I'm very confused by this bit of math.


> +   fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F, length / 2);
> +   shuffle_16bit_data_for_32bit_write(bld,
> +  tmp,
> +  current_val_reg,
> +  length);
> +   current_val_reg = tmp;
> +
> +} else {
> +   /* For single 16-bit values, we just limit the length to 1
> and
> +* use a byte-scattered write message below.
> +*/
> +   length = 1;
>

I think this can be an assert.  Also, why do we need the shuffle?  I
thought this case would work if we just set length == 1.

I answered my own question about the shuffle.  It lets us delete some code
below.


> +   fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F);
> +   shuffle_16bit_data_for_32bit_write(bld,
> +  tmp,
> +  current_val_reg,
> +  length);
>
+   current_val_reg = tmp;
> +
> +}
>   }
>
>   fs_reg offset_reg;
> @@ -4257,24 +4292,29 @@ 

Re: [Mesa-dev] [PATCH v3 38/43] i965/fs: Optimize 16-bit SSBO stores by packing two into a 32-bit reg

2017-10-24 Thread Chema Casanova
El 22/10/17 a las 12:31, Eduardo Lima Mitev escribió:
> On 10/12/2017 08:38 PM, Jose Maria Casanova Crespo wrote:
>> From: Eduardo Lima Mitev 
>>
>> Currently, we use byte-scattered write messages for storing 16-bit
>> into an SSBO. This is because untyped surface messages have a fixed
>> 32-bit size.
>>
>> This patch optimizes these 16-bit writes by combining 2 values (e.g,
>> two consecutive components) into a 32-bit register, packing the two
>> 16-bit words.
>>
>> 16-bit single component values will continue to use byte-scattered
>> write messages.
>>
>> This optimization reduces the number of SEND messages used for storing
>> 16-bit values potentially by 2 or 4, which cuts down execution time
>> significantly because byte-scattered writes are an expensive
>> operation.
>>
>> v2: Removed use of stride = 2 on sources (Jason Ekstrand)
>> Rework optimization using shuffle 16 write and enable writes
>> of 16bit vec4 with only one message of 32-bits. (Chema Casanova)
>>
>> Signed-off-by: Jose Maria Casanova Crespo 
>> Signed-off-by: Eduardo Lima 
>> ---
>>  src/intel/compiler/brw_fs_nir.cpp | 64 
>> +++
>>  1 file changed, 52 insertions(+), 12 deletions(-)
>>
>> diff --git a/src/intel/compiler/brw_fs_nir.cpp 
>> b/src/intel/compiler/brw_fs_nir.cpp
>> index 2d0b3e139e..c07b3e4d8d 100644
>> --- a/src/intel/compiler/brw_fs_nir.cpp
>> +++ b/src/intel/compiler/brw_fs_nir.cpp
>> @@ -4218,6 +4218,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
>> nir_intrinsic_instr *instr
>>  instr->num_components);
>>   val_reg = tmp;
>>}
>> +  if (bit_size == 16) {
>> + val_reg=retype(val_reg, BRW_REGISTER_TYPE_HF);
> Spaces around the '='. Probably also remove the block since there  is
> only one instruction in it.

Fixed locally.

>
>> +  }
>>  
>>/* 16-bit types would use a minimum of 1 slot */
>>unsigned type_slots = MAX2(type_size / 4, 1);
>> @@ -4231,6 +4234,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
>> nir_intrinsic_instr *instr
>>   unsigned first_component = ffs(writemask) - 1;
>>   unsigned length = ffs(~(writemask >> first_component)) - 1;
>>  
>> + fs_reg current_val_reg =
>> +offset(val_reg, bld, first_component * type_slots);
>> +
>>   /* We can't write more than 2 64-bit components at once. Limit the
>>* length of the write to what we can do and let the next iteration
>>* handle the rest
>> @@ -4238,11 +4244,40 @@ fs_visitor::nir_emit_intrinsic(const fs_builder 
>> , nir_intrinsic_instr *instr
>>   if (type_size > 4) {
>>  length = MIN2(2, length);
>>   } else if (type_size == 2) {
>> -/* For 16-bit types we are using byte scattered writes, that can
>> - * only write one component per call. So we limit the length, 
>> and
>> - * let the write happening in several iterations.
>> +/* For 16-bit types we pack two consecutive values into a
>> + * 32-bit word and use an untyped write message. For single 
>> values
>> + * we need to use byte-scattered writes because untyped writes 
>> work
>> + * on multiples of 32 bits.
>> + *
>> + * For example, if there is a 3-component vector we submit one
>> + * untyped-write message of 32-bit (first two components), and 
>> one
>> + * byte-scattered write message (the last component).
>>   */
>> -length = 1;
>> +if (length >= 2) {
>> +   /* pack two consecutive 16-bit words into a 32-bit register,
>> +* using the same original source register.
>> +*/
>> +   length -= length % 2;
>> +   fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F, length / 2);
>> +   shuffle_16bit_data_for_32bit_write(bld,
>> +  tmp,
>> +  current_val_reg,
>> +  length);
>> +   current_val_reg = tmp;
>> +
>> +} else {
>> +   /* For single 16-bit values, we just limit the length to 1 
>> and
>> +* use a byte-scattered write message below.
>> +*/
>> +   length = 1;
>> +   fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F);
>> +   shuffle_16bit_data_for_32bit_write(bld,
>> +  tmp,
>> +  current_val_reg,
>> +  length);
>> +   current_val_reg = tmp;
>> +
>> +}
>>   }
>>  
>>   fs_reg offset_reg;
>> @@ -4257,24 +4292,29 @@ fs_visitor::nir_emit_intrinsic(const fs_builder 
>> , 

Re: [Mesa-dev] [PATCH v3 38/43] i965/fs: Optimize 16-bit SSBO stores by packing two into a 32-bit reg

2017-10-22 Thread Eduardo Lima Mitev
On 10/12/2017 08:38 PM, Jose Maria Casanova Crespo wrote:
> From: Eduardo Lima Mitev 
> 
> Currently, we use byte-scattered write messages for storing 16-bit
> into an SSBO. This is because untyped surface messages have a fixed
> 32-bit size.
> 
> This patch optimizes these 16-bit writes by combining 2 values (e.g,
> two consecutive components) into a 32-bit register, packing the two
> 16-bit words.
> 
> 16-bit single component values will continue to use byte-scattered
> write messages.
> 
> This optimization reduces the number of SEND messages used for storing
> 16-bit values potentially by 2 or 4, which cuts down execution time
> significantly because byte-scattered writes are an expensive
> operation.
> 
> v2: Removed use of stride = 2 on sources (Jason Ekstrand)
> Rework optimization using shuffle 16 write and enable writes
> of 16bit vec4 with only one message of 32-bits. (Chema Casanova)
> 
> Signed-off-by: Jose Maria Casanova Crespo 
> Signed-off-by: Eduardo Lima 
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 64 
> +++
>  1 file changed, 52 insertions(+), 12 deletions(-)
> 
> diff --git a/src/intel/compiler/brw_fs_nir.cpp 
> b/src/intel/compiler/brw_fs_nir.cpp
> index 2d0b3e139e..c07b3e4d8d 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -4218,6 +4218,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
> nir_intrinsic_instr *instr
>  instr->num_components);
>   val_reg = tmp;
>}
> +  if (bit_size == 16) {
> + val_reg=retype(val_reg, BRW_REGISTER_TYPE_HF);

Spaces around the '='. Probably also remove the block since there  is
only one instruction in it.

> +  }
>  
>/* 16-bit types would use a minimum of 1 slot */
>unsigned type_slots = MAX2(type_size / 4, 1);
> @@ -4231,6 +4234,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
> nir_intrinsic_instr *instr
>   unsigned first_component = ffs(writemask) - 1;
>   unsigned length = ffs(~(writemask >> first_component)) - 1;
>  
> + fs_reg current_val_reg =
> +offset(val_reg, bld, first_component * type_slots);
> +
>   /* We can't write more than 2 64-bit components at once. Limit the
>* length of the write to what we can do and let the next iteration
>* handle the rest
> @@ -4238,11 +4244,40 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
> nir_intrinsic_instr *instr
>   if (type_size > 4) {
>  length = MIN2(2, length);
>   } else if (type_size == 2) {
> -/* For 16-bit types we are using byte scattered writes, that can
> - * only write one component per call. So we limit the length, and
> - * let the write happening in several iterations.
> +/* For 16-bit types we pack two consecutive values into a
> + * 32-bit word and use an untyped write message. For single 
> values
> + * we need to use byte-scattered writes because untyped writes 
> work
> + * on multiples of 32 bits.
> + *
> + * For example, if there is a 3-component vector we submit one
> + * untyped-write message of 32-bit (first two components), and 
> one
> + * byte-scattered write message (the last component).
>   */
> -length = 1;
> +if (length >= 2) {
> +   /* pack two consecutive 16-bit words into a 32-bit register,
> +* using the same original source register.
> +*/
> +   length -= length % 2;
> +   fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F, length / 2);
> +   shuffle_16bit_data_for_32bit_write(bld,
> +  tmp,
> +  current_val_reg,
> +  length);
> +   current_val_reg = tmp;
> +
> +} else {
> +   /* For single 16-bit values, we just limit the length to 1 and
> +* use a byte-scattered write message below.
> +*/
> +   length = 1;
> +   fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F);
> +   shuffle_16bit_data_for_32bit_write(bld,
> +  tmp,
> +  current_val_reg,
> +  length);
> +   current_val_reg = tmp;
> +
> +}
>   }
>  
>   fs_reg offset_reg;
> @@ -4257,24 +4292,29 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
> nir_intrinsic_instr *instr
>  brw_imm_ud(type_size * first_component));
>   }
>  
> - if (type_size == 2) {
> + if (type_size == 2 && length == 1) {
>

[Mesa-dev] [PATCH v3 38/43] i965/fs: Optimize 16-bit SSBO stores by packing two into a 32-bit reg

2017-10-12 Thread Jose Maria Casanova Crespo
From: Eduardo Lima Mitev 

Currently, we use byte-scattered write messages for storing 16-bit
into an SSBO. This is because untyped surface messages have a fixed
32-bit size.

This patch optimizes these 16-bit writes by combining 2 values (e.g,
two consecutive components) into a 32-bit register, packing the two
16-bit words.

16-bit single component values will continue to use byte-scattered
write messages.

This optimization reduces the number of SEND messages used for storing
16-bit values potentially by 2 or 4, which cuts down execution time
significantly because byte-scattered writes are an expensive
operation.

v2: Removed use of stride = 2 on sources (Jason Ekstrand)
Rework optimization using shuffle 16 write and enable writes
of 16bit vec4 with only one message of 32-bits. (Chema Casanova)

Signed-off-by: Jose Maria Casanova Crespo 
Signed-off-by: Eduardo Lima 
---
 src/intel/compiler/brw_fs_nir.cpp | 64 +++
 1 file changed, 52 insertions(+), 12 deletions(-)

diff --git a/src/intel/compiler/brw_fs_nir.cpp 
b/src/intel/compiler/brw_fs_nir.cpp
index 2d0b3e139e..c07b3e4d8d 100644
--- a/src/intel/compiler/brw_fs_nir.cpp
+++ b/src/intel/compiler/brw_fs_nir.cpp
@@ -4218,6 +4218,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
nir_intrinsic_instr *instr
 instr->num_components);
  val_reg = tmp;
   }
+  if (bit_size == 16) {
+ val_reg=retype(val_reg, BRW_REGISTER_TYPE_HF);
+  }
 
   /* 16-bit types would use a minimum of 1 slot */
   unsigned type_slots = MAX2(type_size / 4, 1);
@@ -4231,6 +4234,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
nir_intrinsic_instr *instr
  unsigned first_component = ffs(writemask) - 1;
  unsigned length = ffs(~(writemask >> first_component)) - 1;
 
+ fs_reg current_val_reg =
+offset(val_reg, bld, first_component * type_slots);
+
  /* We can't write more than 2 64-bit components at once. Limit the
   * length of the write to what we can do and let the next iteration
   * handle the rest
@@ -4238,11 +4244,40 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
nir_intrinsic_instr *instr
  if (type_size > 4) {
 length = MIN2(2, length);
  } else if (type_size == 2) {
-/* For 16-bit types we are using byte scattered writes, that can
- * only write one component per call. So we limit the length, and
- * let the write happening in several iterations.
+/* For 16-bit types we pack two consecutive values into a
+ * 32-bit word and use an untyped write message. For single values
+ * we need to use byte-scattered writes because untyped writes work
+ * on multiples of 32 bits.
+ *
+ * For example, if there is a 3-component vector we submit one
+ * untyped-write message of 32-bit (first two components), and one
+ * byte-scattered write message (the last component).
  */
-length = 1;
+if (length >= 2) {
+   /* pack two consecutive 16-bit words into a 32-bit register,
+* using the same original source register.
+*/
+   length -= length % 2;
+   fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F, length / 2);
+   shuffle_16bit_data_for_32bit_write(bld,
+  tmp,
+  current_val_reg,
+  length);
+   current_val_reg = tmp;
+
+} else {
+   /* For single 16-bit values, we just limit the length to 1 and
+* use a byte-scattered write message below.
+*/
+   length = 1;
+   fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F);
+   shuffle_16bit_data_for_32bit_write(bld,
+  tmp,
+  current_val_reg,
+  length);
+   current_val_reg = tmp;
+
+}
  }
 
  fs_reg offset_reg;
@@ -4257,24 +4292,29 @@ fs_visitor::nir_emit_intrinsic(const fs_builder , 
nir_intrinsic_instr *instr
 brw_imm_ud(type_size * first_component));
  }
 
- if (type_size == 2) {
+ if (type_size == 2 && length == 1) {
 /* Untyped Surface messages have a fixed 32-bit size, so we need
  * to rely on byte scattered in order to write 16-bit elements.
  * The byte_scattered_write message needs that every written 16-bit
  * type to be aligned 32-bits (stride=2).
  */
-fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_F);
-val_reg.type