RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]

Tamar Christina via Gcc-patches Fri, 10 Feb 2023 08:10:30 -0800

> -----Original Message-----
> From: Richard Sandiford <[email protected]>
> Sent: Friday, February 10, 2023 3:57 PM
> To: Tamar Christina <[email protected]>
> Cc: Tamar Christina via Gcc-patches <[email protected]>; nd
> <[email protected]>; [email protected]; [email protected]
> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask
> by using new optabs [PR108583]
> 
> Tamar Christina <[email protected]> writes:
> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index
> >> >
> >>
> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33
> >> 69
> >> > de2afea139d6 100644
> >> > --- a/gcc/tree-vect-patterns.cc
> >> > +++ b/gcc/tree-vect-patterns.cc
> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info
> *vinfo,
> >> >        return pattern_stmt;
> >> >      }
> >> >    else if ((cst = uniform_integer_cst_p (oprnd1))
> >> > -           && targetm.vectorize.can_special_div_by_const (rhs_code,
> >> vectype,
> >> > -                                                          wi::to_wide 
> >> > (cst),
> >> > -                                                          NULL, 
> >> > NULL_RTX,
> >> > -                                                          NULL_RTX))
> >> > +           && TYPE_UNSIGNED (itype)
> >> > +           && rhs_code == TRUNC_DIV_EXPR
> >> > +           && vectype
> >> > +           && direct_internal_fn_supported_p (IFN_ADDH, vectype,
> >> > +                                              OPTIMIZE_FOR_SPEED))
> >> >      {
> >> > -      return NULL;
> >> > +      /* div optimizations using narrowings
> >> > +       we can do the division e.g. shorts by 255 faster by calculating 
> >> > it as
> >> > +       (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> >> > +       double the precision of x.
> >> > +
> >> > +       If we imagine a short as being composed of two blocks of bytes
> then
> >> > +       adding 257 or 0b0000_0001_0000_0001 to the number is equivalent
> to
> >> > +       adding 1 to each sub component:
> >> > +
> >> > +            short value of 16-bits
> >> > +       ┌──────────────┬────────────────┐
> >> > +       │              │                │
> >> > +       └──────────────┴────────────────┘
> >> > +         8-bit part1 ▲  8-bit part2   ▲
> >> > +                     │                │
> >> > +                     │                │
> >> > +                    +1               +1
> >> > +
> >> > +       after the first addition, we have to shift right by 8, and 
> >> > narrow the
> >> > +       results back to a byte.  Remember that the addition must be done
> in
> >> > +       double the precision of the input.  However if we know that
> >> > + the
> >> addition
> >> > +       `x + 257` does not overflow then we can do the operation in
> >> > + the
> >> current
> >> > +       precision.  In which case we don't need the pack and unpacks.  */
> >> > +      auto wcst = wi::to_wide (cst);
> >> > +      int pow = wi::exact_log2 (wcst + 1);
> >> > +      if (pow == (int) (element_precision (vectype) / 2))
> >> > +        {
> >> > +          wide_int min,max;
> >> > +          /* If we're in a pattern we need to find the orginal 
> >> > definition.  */
> >> > +          tree op0 = oprnd0;
> >> > +          gimple *stmt = SSA_NAME_DEF_STMT (oprnd0);
> >> > +          stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt);
> >> > +          if (is_pattern_stmt_p (stmt_info))
> >> > +            {
> >> > +              auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info);
> >> > +              if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt)))
> >> > +                op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt));
> >> > +            }
> >>
> >> If this is generally safe (I'm skipping thinking about it in the
> >> interests of a quick review :-)), then I think it should be done in
> >> vect_get_range_info instead.  Using gimple_get_lhs would be more
> >> general than handling just assignments.
> >>
> >> > +
> >> > +          /* Check that no overflow will occur.  If we don't have range
> >> > +             information we can't perform the optimization.  */
> >> > +          if (vect_get_range_info (op0, &min, &max))
> >> > +            {
> >> > +              wide_int one = wi::to_wide (build_one_cst (itype));
> >> > +              wide_int adder = wi::add (one, wi::lshift (one, pow));
> >> > +              wi::overflow_type ovf;
> >> > +              /* We need adder and max in the same precision.  */
> >> > +              wide_int zadder
> >> > +                = wide_int_storage::from (adder, wi::get_precision 
> >> > (max),
> >> > +                                          UNSIGNED);
> >> > +              wi::add (max, zadder, UNSIGNED, &ovf);
> >>
> >> Could you explain this a bit more?  When do we have mismatched
> >> precisions?
> >
> > C promotion rules will promote e.g.
> >
> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) {
> >   for (int i = 0; i < n; i+=1)
> >     pixel[i] = (pixel[i] + level) / 0xff; }
> >
> > And have the addition be done as a 32 bit integer.  The vectorizer
> > will demote this down to a short, but range information is not stored
> > for patterns.  So In the above the range will correctly be 0x1fe but
> > the precision will be that of the original expression, so 32.  This
> > will be a mismatch with itype which is derived from the size the vectorizer
> will perform the operation in.
> 
> Gah, missed this first time round, sorry.
> 
> Richi would know better than me, but I think it's dangerous to rely on the
> orig/pattern link for range information.  The end result of a pattern
> (vect_stmt_to_vectorize) has to have the same type as the lhs of the original
> statement.  But the other statements in the pattern sequence can do
> arbitrary things.  Their range isn't predictable from the range of the 
> original
> statement result.
> 
> IIRC, the addition above is converted to:
> 
>   a' = (uint16_t) pixel[i]
>   b' = (uint16_t) level
>   sum' = a' + b'
>   sum = (int) sum'
> 
> where sum is the direct replacement of "pixel[i] + level", with the same type
> and range.  The division then uses sum' instead of sum.
> 
> But the fact that sum' is part of the same pattern as sum doesn't guarantee
> that sum' has the same range as sum.  E.g. the pattern statements added by
> the division optimisation wouldn't have this property.


So my assumption is that no pattern would replace a statement with something
That has higher precision than the C statement. The pattern above is demoted
By the vectorizer based on range information already. My assumption was that
the precision can only ever be smaller, because otherwise the pattern has 
violated
the semantics of the C code, which would be dangerous if e.g. the expression 
escapes?

> 
> Is it possible to tell ranger to compute the range of expressions that haven't
> been added to the IL?  (Genuine question, haven't looked.
> It seems pretty powerful though.)

I don't know either, I guess for things it has explicit knowledge about it's 
ok, so
+w or *w would be fine, but with a random IFN_ it'll likely have to punt as 
varying.

I guess while theoretically possible, I don't see a case where the vectorizer 
would
Introduce a higher precision, as this would reduce your VF.

The only place I can think of that this is unsafe is if the division is 
introduced as part
of another pattern, but in that case the pattern won't have a related statement 
so
we'll punt.

Regards,
Tamar

> 
> Thanks,
> Richard

RE: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask by using new optabs [PR108583]

Reply via email to