Re: [RFC][PATCH]Merge VEC_COND_EXPR into MASK_STORE after loop vectorization

Renlin Li Fri, 09 Nov 2018 07:50:17 -0800

Hi Richard,

On 11/09/2018 11:48 AM, Richard Biener wrote:

On Thu, Nov 8, 2018 at 5:55 PM Renlin Li <renlin...@foss.arm.com> wrote:


Hi Richard,


*However*, after I rebased my patch on the latest trunk.
Got the following dump from ifcvt:
    <bb 3> [local count: 1006632961]:
    # i_20 = PHI <i_13(9), 1(21)>
    # ivtmp_18 = PHI <ivtmp_5(9), 15(21)>
    a_10 = array[i_20];
    _1 = a_10 & 1;
    _2 = a_10 + 1;
    _ifc__34 = _1 != 0 ? _2 : a_10;
    array[i_20] = _ifc__34;
    _4 = a_10 + 2;
    _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
    array[i_20] = _ifc__37;
    i_13 = i_20 + 1;
    ivtmp_5 = ivtmp_18 - 1;
    if (ivtmp_5 != 0)
      goto <bb 9>; [93.33%]
    else
      goto <bb 8>; [6.67%]

the redundant load is not generated, but you could still see the unconditional 
store.


Yes, I fixed the redundant loads recently and indeed dead stores
remain (for the particular
testcase they would be easy to remove).


Right.

After loop vectorization, the following is generated (without my change):


Huh.  But that's not because of if-conversion but because SVE needs to
mask _all_
loop operations that are not safe to execute with the loop_mask!

    vect_a_10.6_6 = .MASK_LOAD (vectp_array.4_35, 4B, loop_mask_7);
    a_10 = array[i_20];
    vect__1.7_39 = vect_a_10.6_6 & vect_cst__38;
    _1 = a_10 & 1;
    vect__2.8_41 = vect_a_10.6_6 + vect_cst__40;
    _2 = a_10 + 1;
    vect__ifc__34.9_43 = VEC_COND_EXPR <vect__1.7_39 != vect_cst__42, 
vect__2.8_41, vect_a_10.6_6>;
    _ifc__34 = _1 != 0 ? _2 : a_10;
    .MASK_STORE (vectp_array.10_45, 4B, loop_mask_7, vect__ifc__34.9_43);
    vect__4.12_49 = vect_a_10.6_6 + vect_cst__48;
    _4 = a_10 + 2;
    vect__ifc__37.13_51 = VEC_COND_EXPR <vect__ifc__34.9_43 > vect_cst__50, 
vect__4.12_49, vect__ifc__34.9_43>;
    _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
    .MASK_STORE (vectp_array.14_53, 4B, loop_mask_7, vect__ifc__37.13_51);

With the old ifcvt code, my change here could improve it a little bit, 
eliminate some redundant load.
With the new code, it could not improved it further. I'll adjust the patch 
based on the latest trunk.


So what does the patch change the above to?  The code has little to no
comments apart from a
small picture with code _before_ the transform...

It is like this:
  vect_a_10.6_6 = .MASK_LOAD (vectp_array.4_35, 4B, loop_mask_7);
  a_10 = array[i_20];
  vect__1.7_39 = vect_a_10.6_6 & vect_cst__38;
  _1 = a_10 & 1;
  vect__2.8_41 = vect_a_10.6_6 + vect_cst__40;
  _2 = a_10 + 1;
  _60 = vect__1.7_39 != vect_cst__42;
  vect__ifc__34.9_43 = VEC_COND_EXPR <_60, vect__2.8_41, vect_a_10.6_6>;
  _ifc__34 = _1 != 0 ? _2 : a_10;
  vec_mask_and_61 = _60 & loop_mask_7;
  .MASK_STORE (vectp_array.10_45, 4B, vec_mask_and_61, vect__2.8_41);
  vect__4.12_49 = vect_a_10.6_6 + vect_cst__48;
  _4 = a_10 + 2;
  vect__ifc__37.13_51 = VEC_COND_EXPR <vect__ifc__34.9_43 > vect_cst__50, 
vect__4.12_49, vect__ifc__34.9_43>;
  _ifc__37 = _ifc__34 > 10 ? _4 : _ifc__34;
  .MASK_STORE (vectp_array.14_53, 4B, loop_mask_7, vect__ifc__37.13_51);

As the loaded value is used later, It could not be removed.

With the change, ideally, less data is stored.
However, it might generate more instructions.

1, The load is not eliminable. Apparently, your change eliminate most of the 
redundant load.
   The rest is necessary or not easy to remove.
2, additional AND instruction.

With a simpler test case like this:

static int array[100];
int test (int a, int i)
{
  for (unsigned i = 0; i < 16; i++)
    {
      if (a & 1)
        array[i] = a + 1;
    }
  return array[i];
}

The new code-gen will be:
  vect__2.4_29 = vect_cst__27 + vect_cst__28;
  _44 = vect_cst__34 != vect_cst__35;
  vec_mask_and_45 = _44 & loop_mask_32;
  .MASK_STORE (vectp_array.9_37, 4B, vec_mask_and_45, vect__2.4_29);

While the old one is:

  vect__2.4_29 = vect_cst__27 + vect_cst__28;
  vect__ifc__24.7_33 = .MASK_LOAD (vectp_array.5_30, 4B, loop_mask_32);
  vect__ifc__26.8_36 = VEC_COND_EXPR <vect_cst__34 != vect_cst__35, vect__2.4_29, 
vect__ifc__24.7_33>;
  .MASK_STORE (vectp_array.9_37, 4B, loop_mask_32, vect__ifc__26.8_36);


I was wondering whether we can implement

   l = [masked]load;
   tem = cond ? x : l;
   masked-store = tem;

pattern matching in a regular pass - forwprop for example.  Note the
load doesn't need to be masked,
correct?  In fact if it is masked you need to make sure the
conditional never accesses parts that
are masked in the load, no?  Or require the mask to be the same as
that used by the store.  But then
you still cannot simply replace the store mask with a new mask
generated from the conditional?


Yes, this would require the mask for load and store is the same.
This matches the pattern before loop vectorization.
The mask here is loop mask, to ensure we are bounded by the number of 
iterations.

The new mask is the (original mask & condition mask) (example shown above).
In this case, less lanes will be stored.

It is possible we do that in forwprop.
I could try to integrate the change into it if it is the correct place to go.

As the pattern is initially generated by loop vectorizer, I did the change 
right after it before it got
converted into other forms. For example, forwprop will transform the original 
code into:

  vect__2.4_29 = vect_cst__27 + { 1, ... };
  _16 = (void *) ivtmp.13_25;
  _2 = &MEM[base: _16, offset: 0B];
  vect__ifc__24.7_33 = .MASK_LOAD (_2, 4B, loop_mask_32);
  _28 = vect_cst__34 != { 0, ... };
  _35 = .COND_ADD (_28, vect_cst__27, { 1, ... }, vect__ifc__24.7_33);
  vect__ifc__26.8_36 = _35;
  .MASK_STORE (_2, 4B, loop_mask_32, vect__ifc__26.8_36);
  ivtmp_41 = ivtmp_40 + POLY_INT_CST [4, 4];
  next_mask_43 = .WHILE_ULT (ivtmp_41, 16, { 0, ... });
  ivtmp.13_15 = ivtmp.13_25 + POLY_INT_CST [16, 16];

This make the pattern matching not straight forward.

Thanks,
Renlin


Richard.

Re: [RFC][PATCH]Merge VEC_COND_EXPR into MASK_STORE after loop vectorization

Reply via email to