Tamar Christina <tamar.christ...@arm.com> writes: >> -----Original Message----- >> From: Richard Sandiford <richard.sandif...@arm.com> >> Sent: Friday, February 10, 2023 4:25 PM >> To: Tamar Christina <tamar.christ...@arm.com> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd >> <n...@arm.com>; rguent...@suse.de; j...@ventanamicro.com >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask >> by using new optabs [PR108583] >> >> Tamar Christina <tamar.christ...@arm.com> writes: >> >> -----Original Message----- >> >> From: Richard Sandiford <richard.sandif...@arm.com> >> >> Sent: Friday, February 10, 2023 3:57 PM >> >> To: Tamar Christina <tamar.christ...@arm.com> >> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd >> >> <n...@arm.com>; rguent...@suse.de; j...@ventanamicro.com >> >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of >> >> div-bitmask by using new optabs [PR108583] >> >> >> >> Tamar Christina <tamar.christ...@arm.com> writes: >> >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index >> >> >> > >> >> >> >> >> >> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33 >> >> >> 69 >> >> >> > de2afea139d6 100644 >> >> >> > --- a/gcc/tree-vect-patterns.cc >> >> >> > +++ b/gcc/tree-vect-patterns.cc >> >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info >> >> *vinfo, >> >> >> > return pattern_stmt; >> >> >> > } >> >> >> > else if ((cst = uniform_integer_cst_p (oprnd1)) >> >> >> > - && targetm.vectorize.can_special_div_by_const (rhs_code, >> >> >> vectype, >> >> >> > - wi::to_wide >> (cst), >> >> >> > - NULL, >> NULL_RTX, >> >> >> > - NULL_RTX)) >> >> >> > + && TYPE_UNSIGNED (itype) >> >> >> > + && rhs_code == TRUNC_DIV_EXPR >> >> >> > + && vectype >> >> >> > + && direct_internal_fn_supported_p (IFN_ADDH, vectype, >> >> >> > + OPTIMIZE_FOR_SPEED)) >> >> >> > { >> >> >> > - return NULL; >> >> >> > + /* div optimizations using narrowings >> >> >> > + we can do the division e.g. shorts by 255 faster by >> >> >> > calculating it >> as >> >> >> > + (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in >> >> >> > + double the precision of x. >> >> >> > + >> >> >> > + If we imagine a short as being composed of two blocks of >> >> >> > + bytes >> >> then >> >> >> > + adding 257 or 0b0000_0001_0000_0001 to the number is >> >> >> > + equivalent >> >> to >> >> >> > + adding 1 to each sub component: >> >> >> > + >> >> >> > + short value of 16-bits >> >> >> > + ┌──────────────┬────────────────┐ >> >> >> > + │ │ │ >> >> >> > + └──────────────┴────────────────┘ >> >> >> > + 8-bit part1 ▲ 8-bit part2 ▲ >> >> >> > + │ │ >> >> >> > + │ │ >> >> >> > + +1 +1 >> >> >> > + >> >> >> > + after the first addition, we have to shift right by 8, and >> >> >> > narrow >> the >> >> >> > + results back to a byte. Remember that the addition must >> >> >> > + be done >> >> in >> >> >> > + double the precision of the input. However if we know >> >> >> > + that the >> >> >> addition >> >> >> > + `x + 257` does not overflow then we can do the operation >> >> >> > + in the >> >> >> current >> >> >> > + precision. In which case we don't need the pack and unpacks. >> */ >> >> >> > + auto wcst = wi::to_wide (cst); >> >> >> > + int pow = wi::exact_log2 (wcst + 1); >> >> >> > + if (pow == (int) (element_precision (vectype) / 2)) >> >> >> > + { >> >> >> > + wide_int min,max; >> >> >> > + /* If we're in a pattern we need to find the orginal >> definition. */ >> >> >> > + tree op0 = oprnd0; >> >> >> > + gimple *stmt = SSA_NAME_DEF_STMT (oprnd0); >> >> >> > + stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt); >> >> >> > + if (is_pattern_stmt_p (stmt_info)) >> >> >> > + { >> >> >> > + auto orig_stmt = STMT_VINFO_RELATED_STMT >> (stmt_info); >> >> >> > + if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt))) >> >> >> > + op0 = gimple_assign_lhs (STMT_VINFO_STMT >> (orig_stmt)); >> >> >> > + } >> >> >> >> >> >> If this is generally safe (I'm skipping thinking about it in the >> >> >> interests of a quick review :-)), then I think it should be done >> >> >> in vect_get_range_info instead. Using gimple_get_lhs would be >> >> >> more general than handling just assignments. >> >> >> >> >> >> > + >> >> >> > + /* Check that no overflow will occur. If we don't have range >> >> >> > + information we can't perform the optimization. */ >> >> >> > + if (vect_get_range_info (op0, &min, &max)) >> >> >> > + { >> >> >> > + wide_int one = wi::to_wide (build_one_cst (itype)); >> >> >> > + wide_int adder = wi::add (one, wi::lshift (one, pow)); >> >> >> > + wi::overflow_type ovf; >> >> >> > + /* We need adder and max in the same precision. */ >> >> >> > + wide_int zadder >> >> >> > + = wide_int_storage::from (adder, wi::get_precision >> (max), >> >> >> > + UNSIGNED); >> >> >> > + wi::add (max, zadder, UNSIGNED, &ovf); >> >> >> >> >> >> Could you explain this a bit more? When do we have mismatched >> >> >> precisions? >> >> > >> >> > C promotion rules will promote e.g. >> >> > >> >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) { >> >> > for (int i = 0; i < n; i+=1) >> >> > pixel[i] = (pixel[i] + level) / 0xff; } >> >> > >> >> > And have the addition be done as a 32 bit integer. The vectorizer >> >> > will demote this down to a short, but range information is not >> >> > stored for patterns. So In the above the range will correctly be >> >> > 0x1fe but the precision will be that of the original expression, so >> >> > 32. This will be a mismatch with itype which is derived from the >> >> > size the vectorizer >> >> will perform the operation in. >> >> >> >> Gah, missed this first time round, sorry. >> >> >> >> Richi would know better than me, but I think it's dangerous to rely >> >> on the orig/pattern link for range information. The end result of a >> >> pattern >> >> (vect_stmt_to_vectorize) has to have the same type as the lhs of the >> >> original statement. But the other statements in the pattern sequence >> >> can do arbitrary things. Their range isn't predictable from the >> >> range of the original statement result. >> >> >> >> IIRC, the addition above is converted to: >> >> >> >> a' = (uint16_t) pixel[i] >> >> b' = (uint16_t) level >> >> sum' = a' + b' >> >> sum = (int) sum' >> >> >> >> where sum is the direct replacement of "pixel[i] + level", with the >> >> same type and range. The division then uses sum' instead of sum. >> >> >> >> But the fact that sum' is part of the same pattern as sum doesn't >> >> guarantee that sum' has the same range as sum. E.g. the pattern >> >> statements added by the division optimisation wouldn't have this >> property. >> > >> > So my assumption is that no pattern would replace a statement with >> > something That has higher precision than the C statement. The pattern >> > above is demoted By the vectorizer based on range information already. >> > My assumption was that the precision can only ever be smaller, because >> > otherwise the pattern has violated the semantics of the C code, which >> would be dangerous if e.g. the expression escapes? >> >> IMO the difference in precisions was a symptom of the problem rather than >> the direct cause. >> >> The point is more that "B = vect_orig_stmt(A)" just says "A is used somehow >> in a new calculation of B". A might equal B (if A replaces B), or A might >> be an >> arbitrary temporary result. The code above is instead using it to mean "A >> equals B, expressed in a different type". That happens to be true for sum' >> in >> the sequence above, but it isn't true of non-final pattern statements in >> general. >> > > Sorry for being dense, but I though that's exactly what the code does and > what I > tried explain before. If B isn't a final statement than it won't have an > original statement. > AFAIK, the only places we set original statement is the root of the pattern > expression.
Final pattern statements (those not in DEF_SEQ) always have the same type and value as the original statements. We wouldn't see mismatched precisions if we were only looking at final pattern statements. Like you say, the 16-bit addition didn't exist before vectorisation (it was a 32-bit addition instead). So to make things type-correct, the 32-bit addition: A: sum = a + b (STMT_VINFO_RELATED_STMT == A2) is replaced with: DEF_SEQ: A1: tmp = a' + b' (STMT_VINFO_RELATED_STMT == A) A2: sum' = (int) tmp (STMT_VINFO_RELATED_STMT == A) (using different notation from before, just to confuse things). Here, A2 is the final pattern statement for A and A1 is just a temporary result. sum == sum'. Later, we do a similar thing for the division itself. We have: B: quotient = sum / 0xff (STMT_VINFO_RELATED_STMT == B2) We realise that this can be a 16-bit division, so (IIRC) we use vect_look_through_possible_promotion on sum to find the best starting point. This should give: DEF_SEQ: B1: tmp2 = tmp / (uint16_t) 0xff (STMT_VINFO_RELATED_STMT == B) B2: quotient' = (int) tmp2 (STMT_VINFO_RELATED_STMT == B) Both changes are done by vect_widened_op_tree. We then apply the division pattern to B1. B1 is a nonfinal pattern statement that uses the result (tmp) of another nonfinal pattern statement (A1). The code does: if (is_pattern_stmt_p (stmt_info)) { auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info); if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt))) op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt)); } is_pattern_stmt_p is true for both A1 and A2, and STMT_VINFO_RELATED_STMT is A for both A1 and A2. I would expect: gcc_assert (stmt_info == vect_stmt_to_vectorize (orig_stmt)); (testing for a final pattern) to fail for the motivating example. Thanks, Richard