RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]

Tamar Christina via Gcc-patches Fri, 10 Feb 2023 09:15:26 -0800

> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Friday, February 10, 2023 4:57 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> <n...@arm.com>; rguent...@suse.de; j...@ventanamicro.com
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandif...@arm.com>
> >> Sent: Friday, February 10, 2023 4:25 PM
> >> To: Tamar Christina <tamar.christ...@arm.com>
> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> <n...@arm.com>; rguent...@suse.de; j...@ventanamicro.com
> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> div-bitmask by using new optabs [PR108583]
> >>
> >> Tamar Christina <tamar.christ...@arm.com> writes:
> >> >> -----Original Message-----
> >> >> From: Richard Sandiford <richard.sandif...@arm.com>
> >> >> Sent: Friday, February 10, 2023 3:57 PM
> >> >> To: Tamar Christina <tamar.christ...@arm.com>
> >> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd
> >> >> <n...@arm.com>; rguent...@suse.de; j...@ventanamicro.com
> >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of
> >> >> div-bitmask by using new optabs [PR108583]
> >> >>
> >> >> Tamar Christina <tamar.christ...@arm.com> writes:
> >> >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >> >> >> >
> >> >> >>
> >> >>
> >>
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
> >> >> >> 69
> >> >> >> > de2afea139d6 100644
> >> >> >> > --- a/gcc/tree-vect-patterns.cc
> >> >> >> > +++ b/gcc/tree-vect-patterns.cc
> >> >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern
> (vec_info
> >> >> *vinfo,
> >> >> >> >        return pattern_stmt;
> >> >> >> >      }
> >> >> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> >> >> >> > -     && targetm.vectorize.can_special_div_by_const (rhs_code,
> >> >> >> vectype,
> >> >> >> > -                                                    wi::to_wide
> >> (cst),
> >> >> >> > -                                                    NULL,
> >> NULL_RTX,
> >> >> >> > -                                                    NULL_RTX))
> >> >> >> > +     && TYPE_UNSIGNED (itype)
> >> >> >> > +     && rhs_code == TRUNC_DIV_EXPR
> >> >> >> > +     && vectype
> >> >> >> > +     && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> >> >> >> > +                                        OPTIMIZE_FOR_SPEED))
> >> >> >> >      {
> >> >> >> > -      return NULL;
> >> >> >> > +      /* div optimizations using narrowings
> >> >> >> > +       we can do the division e.g. shorts by 255 faster by
> >> >> >> > + calculating it
> >> as
> >> >> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done 
> >> >> >> > in
> >> >> >> > +       double the precision of x.
> >> >> >> > +
> >> >> >> > +       If we imagine a short as being composed of two blocks
> >> >> >> > + of bytes
> >> >> then
> >> >> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is
> >> >> >> > + equivalent
> >> >> to
> >> >> >> > +       adding 1 to each sub component:
> >> >> >> > +
> >> >> >> > +      short value of 16-bits
> >> >> >> > +       ┌──────────────┬────────────────┐
> >> >> >> > +       │              │                │
> >> >> >> > +       └──────────────┴────────────────┘
> >> >> >> > +   8-bit part1 ▲  8-bit part2   ▲
> >> >> >> > +               │                │
> >> >> >> > +               │                │
> >> >> >> > +              +1               +1
> >> >> >> > +
> >> >> >> > +       after the first addition, we have to shift right by
> >> >> >> > + 8, and narrow
> >> the
> >> >> >> > +       results back to a byte.  Remember that the addition
> >> >> >> > + must be done
> >> >> in
> >> >> >> > +       double the precision of the input.  However if we
> >> >> >> > + know that the
> >> >> >> addition
> >> >> >> > +       `x + 257` does not overflow then we can do the
> >> >> >> > + operation in the
> >> >> >> current
> >> >> >> > +       precision.  In which case we don't need the pack and
> unpacks.
> >> */
> >> >> >> > +      auto wcst = wi::to_wide (cst);
> >> >> >> > +      int pow = wi::exact_log2 (wcst + 1);
> >> >> >> > +      if (pow == (int) (element_precision (vectype) / 2))
> >> >> >> > +  {
> >> >> >> > +    wide_int min,max;
> >> >> >> > +    /* If we're in a pattern we need to find the orginal
> >> definition.  */
> >> >> >> > +    tree op0 = oprnd0;
> >> >> >> > +    gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> >> >> >> > +    stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> >> >> >> > +    if (is_pattern_stmt_p (stmt_info))
> >> >> >> > +      {
> >> >> >> > +        auto orig_stmt = STMT_VINFO_RELATED_STMT
> >> (stmt_info);
> >> >> >> > +        if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> >> >> >> > +          op0 = gimple_assign_lhs (STMT_VINFO_STMT
> >> (orig_stmt));
> >> >> >> > +      }
> >> >> >>
> >> >> >> If this is generally safe (I'm skipping thinking about it in
> >> >> >> the interests of a quick review :-)), then I think it should be
> >> >> >> done in vect_get_range_info instead.  Using gimple_get_lhs
> >> >> >> would be more general than handling just assignments.
> >> >> >>
> >> >> >> > +
> >> >> >> > +    /* Check that no overflow will occur.  If we don't have range
> >> >> >> > +       information we can't perform the optimization.  */
> >> >> >> > +    if (vect_get_range_info (op0, &min, &max))
> >> >> >> > +      {
> >> >> >> > +        wide_int one = wi::to_wide (build_one_cst (itype));
> >> >> >> > +        wide_int adder = wi::add (one, wi::lshift (one, pow));
> >> >> >> > +        wi::overflow_type ovf;
> >> >> >> > +        /* We need adder and max in the same precision.  */
> >> >> >> > +        wide_int zadder
> >> >> >> > +          = wide_int_storage::from (adder, wi::get_precision
> >> (max),
> >> >> >> > +                                    UNSIGNED);
> >> >> >> > +        wi::add (max, zadder, UNSIGNED, &ovf);
> >> >> >>
> >> >> >> Could you explain this a bit more?  When do we have mismatched
> >> >> >> precisions?
> >> >> >
> >> >> > C promotion rules will promote e.g.
> >> >> >
> >> >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
> >> >> >   for (int i = 0; i < n; i+=1)
> >> >> >     pixel[i] = (pixel[i] + level) / 0xff; }
> >> >> >
> >> >> > And have the addition be done as a 32 bit integer.  The
> >> >> > vectorizer will demote this down to a short, but range
> >> >> > information is not stored for patterns.  So In the above the
> >> >> > range will correctly be 0x1fe but the precision will be that of
> >> >> > the original expression, so 32.  This will be a mismatch with
> >> >> > itype which is derived from the size the vectorizer
> >> >> will perform the operation in.
> >> >>
> >> >> Gah, missed this first time round, sorry.
> >> >>
> >> >> Richi would know better than me, but I think it's dangerous to
> >> >> rely on the orig/pattern link for range information.  The end
> >> >> result of a pattern
> >> >> (vect_stmt_to_vectorize) has to have the same type as the lhs of
> >> >> the original statement.  But the other statements in the pattern
> >> >> sequence can do arbitrary things.  Their range isn't predictable
> >> >> from the range of the original statement result.
> >> >>
> >> >> IIRC, the addition above is converted to:
> >> >>
> >> >>   a' = (uint16_t) pixel[i]
> >> >>   b' = (uint16_t) level
> >> >>   sum' = a' + b'
> >> >>   sum = (int) sum'
> >> >>
> >> >> where sum is the direct replacement of "pixel[i] + level", with
> >> >> the same type and range.  The division then uses sum' instead of sum.
> >> >>
> >> >> But the fact that sum' is part of the same pattern as sum doesn't
> >> >> guarantee that sum' has the same range as sum.  E.g. the pattern
> >> >> statements added by the division optimisation wouldn't have this
> >> property.
> >> >
> >> > So my assumption is that no pattern would replace a statement with
> >> > something That has higher precision than the C statement. The
> >> > pattern above is demoted By the vectorizer based on range information
> already.
> >> > My assumption was that the precision can only ever be smaller,
> >> > because otherwise the pattern has violated the semantics of the C
> >> > code, which
> >> would be dangerous if e.g. the expression escapes?
> >>
> >> IMO the difference in precisions was a symptom of the problem rather
> >> than the direct cause.
> >>
> >> The point is more that "B = vect_orig_stmt(A)" just says "A is used
> >> somehow in a new calculation of B".  A might equal B (if A replaces
> >> B), or A might be an arbitrary temporary result.  The code above is
> >> instead using it to mean "A equals B, expressed in a different type".
> >> That happens to be true for sum' in the sequence above, but it isn't
> >> true of non-final pattern statements in general.
> >>
> >
> > Sorry for being dense, but I though that's exactly what the code does
> > and what I tried explain before. If B isn't a final statement than it won't
> have an original statement.
> > AFAIK, the only places we set original statement is the root of the pattern
> expression.
> 
> Final pattern statements (those not in DEF_SEQ) always have the same type
> and value as the original statements.  We wouldn't see mismatched
> precisions if we were only looking at final pattern statements.


We would because the entire problem is that pattern statement have no ranges.
Ranger does not track them after they have been created.  This could of course
Trivially be solved if we tell ranger about the demotion we did, but we don't 
do so
at the moment. It will just return varying here.  This is the root cause of the 
issue.

> 
> Like you say, the 16-bit addition didn't exist before vectorisation (it was a 
> 32-
> bit addition instead).  So to make things type-correct, the 32-bit addition:
> 
>    A: sum = a + b           (STMT_VINFO_RELATED_STMT == A2)
> 
> is replaced with:
> 
>    DEF_SEQ:
>      A1: tmp = a' + b'      (STMT_VINFO_RELATED_STMT == A)
>    A2: sum' = (int) tmp     (STMT_VINFO_RELATED_STMT == A)
> 
> (using different notation from before, just to confuse things).
> Here, A2 is the final pattern statement for A and A1 is just a temporary 
> result.
> sum == sum'.
> 
> Later, we do a similar thing for the division itself.  We have:
> 
>    B: quotient = sum / 0xff            (STMT_VINFO_RELATED_STMT == B2)
> 
> We realise that this can be a 16-bit division, so (IIRC) we use
> vect_look_through_possible_promotion on sum to find the best starting
> point.  This should give:
> 
>    DEF_SEQ:
>      B1: tmp2 = tmp / (uint16_t) 0xff  (STMT_VINFO_RELATED_STMT == B)
>    B2: quotient' = (int) tmp2          (STMT_VINFO_RELATED_STMT == B)
> 
> Both changes are done by vect_widened_op_tree.
> 
> We then apply the division pattern to B1.  B1 is a nonfinal pattern statement
> that uses the result (tmp) of another nonfinal pattern statement (A1).
> 
> The code does:
> 
>         if (is_pattern_stmt_p (stmt_info))
>           {
>             auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
>             if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
>               op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
>           }
> 
> is_pattern_stmt_p is true for both A1 and A2, and
> STMT_VINFO_RELATED_STMT is A for both A1 and A2.  I would expect:
> 
>   gcc_assert (stmt_info == vect_stmt_to_vectorize (orig_stmt));
> 
> (testing for a final pattern) to fail for the motivating example.
> 

I think we're actually saying the same thing. I believe all I'm saying is that 
looking
at the original statement is a safe alternative as it conservatively will 
overestimate
to VARYING or give a wider range than the pattern would have.

I'm saying it's conservatively safe, while not overly accurate.  The 
alternative would be
to tell ranger about the demotions in vect_recog_over_widening_pattern using 
range::set.

But for this to work the general widening pattern also have to update the range 
information.

I think where we're disagreeing is that I think looking at the original scalar 
statement is a safe
conservative estimate.  It will fail in some cases, but that's a missed 
optimization, not a miss-optimization.

In any case, if you disagree I don’t' really see a way forward aside from 
making this its own pattern
running it before the overwidening pattern.

Alternatively I'd love to know how to proceed.

Tamar

> Thanks,
> Richard

RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]

Reply via email to