Re: [PATCH V3][GCC] Algorithmic optimization in match and simplify

Richard Biener Thu, 08 Oct 2015 05:30:14 -0700

On Wed, Oct 7, 2015 at 10:21 AM, Andre Vieira
<andre.simoesdiasvie...@arm.com> wrote:
> On 25/09/15 12:42, Richard Biener wrote:
>>
>> On Fri, Sep 25, 2015 at 1:30 PM, Andre Vieira
>> <andre.simoesdiasvie...@arm.com> wrote:
>>>
>>> On 17/09/15 10:46, Richard Biener wrote:
>>>>
>>>>
>>>> On Thu, Sep 3, 2015 at 1:11 PM, Andre Vieira
>>>> <andre.simoesdiasvie...@arm.com> wrote:
>>>>>
>>>>>
>>>>> On 01/09/15 15:01, Richard Biener wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 1, 2015 at 3:40 PM, Andre Vieira
>>>>>> <andre.simoesdiasvie...@arm.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Marc,
>>>>>>>
>>>>>>> On 28/08/15 19:07, Marc Glisse wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> (not a review, I haven't even read the whole patch)
>>>>>>>>
>>>>>>>> On Fri, 28 Aug 2015, Andre Vieira wrote:
>>>>>>>>
>>>>>>>>> 2015-08-03  Andre Vieira  <andre.simoesdiasvie...@arm.com>
>>>>>>>>>
>>>>>>>>>      * match.pd: Added new patterns:
>>>>>>>>>        ((X {&,<<,>>} C0) {|,^} C1) {^,|} C2)
>>>>>>>>>        (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0
>>>>>>>>> {<<,>>}
>>>>>>>>> C1)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> +(for op0 (rshift rshift lshift lshift bit_and bit_and)
>>>>>>>> + op1 (bit_ior bit_xor bit_ior bit_xor bit_ior bit_xor)
>>>>>>>> + op2 (bit_xor bit_ior bit_xor bit_ior bit_xor bit_ior)
>>>>>>>>
>>>>>>>> You can nest for-loops, it seems clearer as:
>>>>>>>> (for op0 (rshift lshift bit_and)
>>>>>>>>       (for op1 (bit_ior bit_xor)
>>>>>>>>            op2 (bit_xor bit_ior)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Will do, thank you for pointing it out.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> +(simplify
>>>>>>>> + (op2:c
>>>>>>>> +  (op1:c
>>>>>>>> +   (op0 @0 INTEGER_CST@1) INTEGER_CST@2) INTEGER_CST@3)
>>>>>>>>
>>>>>>>> I suspect you will want more :s (single_use) and less :c
>>>>>>>> (canonicalization
>>>>>>>> should put constants in second position).
>>>>>>>>
>>>>>>> I can't find the definition of :s (single_use).
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sorry for that - didn't get along updating it yet :/  It restricts
>>>>>> matching to
>>>>>> sub-expressions that have a single-use.  So
>>>>>>
>>>>>> +  a &= 0xd123;
>>>>>> +  unsigned short tem = a ^ 0x6040;
>>>>>> +  a = tem | 0xc031; /* Simplify _not_ to ((a & 0xd123) | 0xe071).  */
>>>>>> ... use of tem ...
>>>>>>
>>>>>> we shouldn't do the simplifcation here because the expression
>>>>>> (a & 0x123) ^ 0x6040 is kept live by 'tem'.
>>>>>>
>>>>>>> GCC internals do point out
>>>>>>> that canonicalization does put constants in the second position,
>>>>>>> didnt
>>>>>>> see
>>>>>>> that first. Thank you for pointing it out.
>>>>>>>
>>>>>>>> +       C1 = wi::bit_and_not (C1,C2);
>>>>>>>>
>>>>>>>> Space after ','.
>>>>>>>>
>>>>>>> Will do.
>>>>>>>
>>>>>>>> Having wide_int_storage in many places is surprising, I can't find
>>>>>>>> similar
>>>>>>>> code anywhere else in gcc.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> I tried looking for examples of something similar, I think I ended up
>>>>>>> using
>>>>>>> wide_int because I was able to convert easily to and from it and it
>>>>>>> has
>>>>>>> the
>>>>>>> "mask" and "wide_int_to_tree" functions. I welcome any suggestions on
>>>>>>> what I
>>>>>>> should be using here for integer constant transformations and
>>>>>>> comparisons.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Using wide-ints is fine, but you shouldn't need 'wide_int_storage'
>>>>>> constructors - those
>>>>>> are indeed odd.  Is it just for the initializers of wide-ints?
>>>>>>
>>>>>> +    wide_int zero_mask = wi::zero (prec);
>>>>>> +    wide_int C0 = wide_int_storage (@1);
>>>>>> +    wide_int C1 = wide_int_storage (@2);
>>>>>> +    wide_int C2 = wide_int_storage (@3);
>>>>>> ...
>>>>>> +       zero_mask = wide_int_storage (wi::mask (C0.to_uhwi (), false,
>>>>>> prec));
>>>>>>
>>>>>> tree_to_uhwi (@1) should do the trick as well
>>>>>>
>>>>>> +       C1 = wi::bit_and_not (C1,C2);
>>>>>> +       cst_emit = wi::bit_or (C1, C2);
>>>>>>
>>>>>> the ops should be replacable with @2 and @3, the store to C1 obviously
>>>>>> not
>>>>>> (but you can use a tree temporary and use wide_int_to_tree here to
>>>>>> avoid
>>>>>> the back-and-forth for the case where C1 is not assigned to).
>>>>>>
>>>>>> Note that transforms only doing association are prone to endless
>>>>>> recursion
>>>>>> in case some other pattern does the reverse op...
>>>>>>
>>>>>> Richard.
>>>>>>
>>>>>>
>>>>>>> BR,
>>>>>>> Andre
>>>>>>>
>>>>>>
>>>>> Thank you for all the comments, see reworked version:
>>>>>
>>>>> Two new algorithmic optimisations:
>>>>>     1.((X op0 C0) op1 C1) op2 C2)
>>>>>       with op0 = {&, >>, <<}, op1 = {|,^}, op2 = {|,^} and op1 != op2
>>>>>       zero_mask has 1's for all bits that are sure to be 0 in (X op0
>>>>> C0)
>>>>>       and 0's otherwise.
>>>>>       if (op1 == '^') C1 &= ~C2 (Only changed if actually emitted)
>>>>>       if ((C1 & ~zero_mask) == 0) then emit (X op0 C0) op2 (C1 op2 C2)
>>>>>       if ((C2 & ~zero_mask) == 0) then emit (X op0 C0) op1 (C1 op2 C2)
>>>>>     2. (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>}
>>>>> C1)
>>>>>
>>>>>
>>>>> This patch does two algorithmic optimisations that target patterns
>>>>> like:
>>>>> (((x << 24) | 0x00FFFFFF) ^ 0xFF000000) and ((x ^ 0x40000002) >> 1) |
>>>>> 0x80000000.
>>>>>
>>>>> The transformation uses the knowledge of which bits are zero after
>>>>> operations like (X {&,<<,(unsigned)>>}) to combine constants, reducing
>>>>> run-time operations.
>>>>> The two examples above would be transformed into (X << 24) ^ 0xFFFFFFFF
>>>>> and
>>>>> (X >> 1) ^ 0xa0000001 respectively.
>>>>>
>>>>> The second transformation enables more applications of the first. Also
>>>>> some
>>>>> targets may benefit from delaying shift operations. I am aware that
>>>>> such
>>>>> an
>>>>> optimization, in combination with one or more optimizations that cause
>>>>> the
>>>>> reverse transformation, may lead to an infinite loop. Though such
>>>>> behavior
>>>>> has not been detected during regression testing and bootstrapping on
>>>>> aarch64.
>>>>
>>>>
>>>>
>>>> +/* (X bit_op C0) rshift C1 -> (X rshift C0) bit_op (C0 rshift C1) */
>>>> +(for bit_op (bit_ior bit_xor bit_and)
>>>> +(simplify
>>>> + (rshift (bit_op:s @0 INTEGER_CST@1) INTEGER_CST@2)
>>>> + (bit_op
>>>> +  (rshift @0 @2)
>>>> +  { wide_int_to_tree (type, wi::rshift (@1, @2, TYPE_SIGN (type)));
>>>> })))
>>>> +
>>>> +/* (X bit_op C0) lshift C1 -> (X lshift C0) bit_op (C0 lshift C1) */
>>>> +(for bit_op (bit_ior bit_xor bit_and)
>>>> +(simplify
>>>> + (lshift (bit_op:s @0 INTEGER_CST@1) INTEGER_CST@2)
>>>> + (bit_op
>>>> +  (lshift @0 @2)
>>>> +  { wide_int_to_tree (type, wi::lshift (@1, @2)); })))
>>>
>>>
>>>
>>> Will do, good to see that my second transformation still picks up the
>>> shift
>>> @1 @2 as a constant. I'm assuming there is some constant folding going on
>>> between simplify transformations?
>>
>>
>> Yes.
>>
>>>>
>>>> this may be one case where not using wide-ints to be able to combine the
>>>> patterns makes sense.  Thus,
>>>>
>>>> (for shift (lshift rshift)
>>>>    (simplify
>>>>     (shift ...)
>>>>     (bit_op
>>>>      (shift @0 @2)
>>>>      (shift @1 @2))))
>>>>
>>>> note that I'm worried we'd take on "invalid" ubsan here when the above
>>>> applies to
>>>>
>>>> int foo (int i)
>>>> {
>>>>     return (i & 0x7fffffff) >> 3;
>>>> }
>>>> int main () { return foo (0x80000007); }
>>>>
>>>> and i is negative.  That's because right-shift of negative values
>>>> invokes undefined
>>>> behavior.  IIRC in the middle-end we will not be taking advantage of
>>>> that but the
>>>> simplifications apply to GENERIC as well and thus may hit before ubsan
>>>> instrumentation triggers(?)  It would be nice if you could double-check
>>>> that.
>>>
>>>
>>>
>>> I was looking into this and I understand your worries, though, I found
>>> out
>>> that for the particular case of shifts and bit_and there already is a
>>> simplify transformation that does the same, regardless of the sign.
>>>
>>> /* Fold (X & C2) << C1 into (X << C1) & (C2 << C1)
>>>     (X & C2) >> C1 into (X >> C1) & (C2 >> C1).  */
>>> (for shift (lshift rshift)
>>>   (simplify
>>>    (shift (convert?:s (bit_and:s @0 INTEGER_CST@2)) INTEGER_CST@1)
>>>    (if (tree_nop_conversion_p (type, TREE_TYPE (@0)))
>>>     (with { tree mask = int_const_binop (shift, fold_convert (type, @2),
>>> @1);
>>> }
>>>      (bit_and (shift (convert @0) @1) { mask; })))))
>>
>>
>> I see ... also an opportunity to merge this pattern with yours.
>>
>>> Now, I don't quite understand what you mean by ubsan instrumentation,
>>> will
>>> GCC introduce guards into code where it detects potential undefined
>>> behavior?
>>
>>
>> Yes.
>>
>>> Also, it might be worth to note that right shift of negative
>>> values is denoted as "implementation defined" by the C standard. GCC
>>> however
>>> doesn't include it in its list of implementation defined behavior so I
>>> guess
>>> that is why you refer to it as undefined?
>>
>>
>> Not sure, I thought it was undefined.  If its implementation defined
>> then GCC needs
>> to document its behavior.
>>
>>> Should we perhaps disable transformations where we can not guarantee that
>>> the right shift produced is not one of negative values? I.e. prohibit
>>> this
>>> transformation for:
>>> 1) signed types and op1 == bit_xor
>>> 2) signed types and op1 == bit_and and C1 has sign bit set.
>>>
>>> Also would it be useful in cases where you have signed shift and bit_and,
>>> and C1 is not negative, to do the transformation but replace the signed
>>> shift by an unsigned shift. This to make sure we don't introduce
>>> undefined/implementation defined behavior were there was none.
>>>
>>> This does however change the current behavior!
>>
>>
>> Yeah, so unless somebody else chimes in let's consider this as followup
>> only.
>>
>>>>
>>>> + (if (!(op0 == RSHIFT_EXPR && !TYPE_UNSIGNED (type)) && wi::fits_uhwi_p
>>>> (@1))
>>>>
>>>> you only need fits_uhwi_p (@1) in the op0 != BIT_AND_EXPR case it
>>>> seems, so better
>>>> move it down to those cases.
>>>
>>>
>>>
>>> So I used to, but I had the problem that I didn't know how to make it
>>> "fail"
>>> the matching if this was not the case. For instance if op0 is a lshift
>>> for
>>> which the constant doesn't fit uhwi, then it would fall through and never
>>> set the zero mask, potentially leading to a wrong transformation. Now I'm
>>> not sure this can happen, since that would mean that either constant @2
>>> or
>>> @3 need to be all 1's and that might already be caught by some other
>>> transformation, but its wrong to rely on such behavior IMO. So for now I
>>> have changed it to:
>>>
>>> (simplify
>>>   (op2
>>>    (op1:s
>>>     (op0@4 @0 INTEGER_CST@1) INTEGER_CST@2) INTEGER_CST@3)
>>>   (if (!(op0 == RSHIFT_EXPR && !TYPE_UNSIGNED (type)) &&
>>>        (op0 == BIT_AND_EXPR || wi::fits_uhwi_p (@1)))
>>>
>>>
>>> It would be cool to have a FAIL expression, usable in the with clauses,
>>> to
>>> make the pattern match fail a bit like the one in the machine description
>>> language.
>>
>>
>> I'll think about it.  Currently you'd need to add a  'bool fail' in the
>> with
>> and initialize it, adding a (if (!fail) ...) after it.
>>
>>>>
>>>> +   (if (wi::eq_p (wi::bit_and (C1, zero_mask_not), wi::zero (prec)))
>>>>
>>>> I think you can write
>>>>
>>>>      (if (wi::bit_and (...) == 0)
>>>>
>>>> or at least wi:eq_p (wi::bit_and (...), 0).
>>>>
>>>
>>> wi::bit_and (...) == 0 seems to be doing the trick.
>>>
>>>> I wonder if we shouldn't improve the pattern by handling (X op0 C0)
>>>> transparently
>>>> via using get_nonzero_bits (yeah, that's not exactly zero_mask but its
>>>> inverse AFAIK).
>>>> We'd wrap get_nonzero_bits in a helper that can handle GENERIC and your
>>>> &, >>, << cases (hmm, such function must already exist somewhere...).
>>>> So
>>>> you'd
>>>> reduce the pattern to
>>>>
>>>> + (for op1 (bit_ior bit_xor)
>>>> +      op2 (bit_xor bit_ior)
>>>> +(simplify
>>>> + (op2
>>>> +  (op1:s @0 INTEGER_CST@2) INTEGER_CST@3))
>>>>      (with
>>>>       {
>>>>         wide_int zero_mask_not = get_nonzero_bits (@0);
>>>> ...
>>>>       }
>>>>
>>>> This would make use of value-range information determined by VRP for
>>>> example.
>>>
>>>
>>>
>>> I'll go look for such a function.
>>>
>>>>
>>>> note that with your pattern you'd want to capture (op0:s @0
>>>> INTEGER_CST@1)
>>>> like via (op0@4 @0 INTEGER_CST@1) so you can re-use it in the
>>>> replacement
>>>> like so:
>>>>
>>>> +   (if (wi::eq_p (wi::bit_and (C1, zero_mask_not), wi::zero (prec)))
>>>> +    (op2 @4 { wide_int_to_tree (type, cst_emit); })
>>>> +    (if (wi::eq_p (wi::bit_and (@3, zero_mask_not), wi::zero (prec)))
>>>> +     (op1 @4 { wide_int_to_tree (type, cst_emit); }))))))))
>>>>
>>>> the expression doesn't need a :s then obviously.
>>>
>>>
>>>
>>> Yeah makes sense.
>>>>
>>>>
>>>>
>>>> Thanks and sorry for the delay in reviewing this.
>>>> Richard.
>>>>
>>>
>>> Thank you for all the comments!
>>
>>
>> No problem!
>>
>>>>
>>>>> gcc/ChangeLog:
>>>>>
>>>>> 2015-08-03  Andre Vieira  <andre.simoesdiasvie...@arm.com>
>>>>>
>>>>>     * match.pd: Added new patterns:
>>>>>       ((X {&,<<,>>} C0) {|,^} C1) {^,|} C2)
>>>>>       (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>}
>>>>> C1)
>>>>>
>>>>> gcc/testsuite/ChangeLog:
>>>>>
>>>>> 2015-08-03  Andre Vieira  <andre.simoesdiasvie...@arm.com>
>>>>>               Hale Wang  <hale.w...@arm.com>
>>>>>
>>>>>     * gcc.dg/tree-ssa/forwprop-33.c: New test.
>>>>
>>>>
>>>>
>>>
>>
> Thanks again for the comments Richard!
>
> A new algorithmic optimisation:
>
> ((X inner_op C0) outer_op C1)
> With X being a tree where value_range has reasoned certain bits to always be
> zero throughout its computed value range, we will call this the zero_mask,
> and with inner_op = {|,^}, outer_op = {|,^} and inner_op != outer_op.
> if (inner_op == '^') C0 &= ~C1;
> if ((C0 & ~zero_mask) == 0) then emit (X outer_op (C0 outer_op C1)
> if ((C1 & ~zero_mask) == 0) then emit (X inner_op (C0 outer_op C1)
>
> And extended '(X & C2) << C1 into (X << C1) & (C2 << C1)' and
> '(X & C2) >> C1 into (X >> C1) & (C2 >> C1)' to also accept the bitwise or
> and xor operators:
> '(X {&,^,|} C2) << C1 into (X << C1) {&,^,|} (C2 << C1)' and
> '(X {&,^,|} C2) >> C1 into (X >> C1) & (C2 >> C1)'.
>
> The second transformation enables more applications of the first. Also some
> targets may benefit from delaying shift operations. I am aware that such an
> optimization, in combination with one or more optimizations that cause the
> reverse transformation, may lead to an infinite loop. Though such behavior
> has not been detected during regression testing and bootstrapping on
> aarch64.
>
> gcc/ChangeLog:
>
> 2015-10-05 Andre Vieira <andre.simoesdiasvie...@arm.com>
>
> * match.pd: Added a new pattern
> ((X inner_op C0) outer_op C1)
> and expanded existing one
> (X {|,^,&} C0) {<<,>>} C1 -> (X {<<,>>} C1) {|,^,&} (C0 {<<,>>} C1)
>
> gcc/testsuite/ChangeLog:
>
> 2015-10-05 Andre Vieira <andre.simoesdiasvie...@arm.com>
>
> Hale Wang <hale.w...@arm.com>
>
> * gcc.dg/tree-ssa/forwprop-33.c: New test.


Ok.

Thanks,
Richard.

Re: [PATCH V3][GCC] Algorithmic optimization in match and simplify

Reply via email to