Re: [PATCH] Add MULT_HIGHPART_EXPR
On Fri, Jun 29, 2012 at 11:00:14AM +0200, Richard Guenther wrote: Indeed - the lack of cross-sub-128bit-word operations makes it very much expensive for some vectorizations. Initially we added the patterns for vectorization of the hi/lo and interleave stuff because we didn't want regressions for vectorizing with 256bit vectors vs. 128bit vectors in the vectorizer testsuite. But now as we have support for vectorizing with both sizes we could consider not advertising the really not existing intstructions for 256bit vectors. Or at least properly model their cost. The pr51581-3.c (f2) generated code is only shorter with -O3 -mavx when using hi/lo over even/odd, with -O3 -mavx2 even/odd sequence is shorter than hi/lo. $ ~/timing ./pr51581-3-evenodd Strip out best and worst realtime result minimum: 0.110145575 sec real / 0.71177 sec CPU maximum: 0.134790162 sec real / 0.000140234 sec CPU average: 0.113982306 sec real / 0.000113236 sec CPU stdev : 0.002545680 sec real / 0.09365 sec CPU $ ~/timing ./pr51581-3-hilo Strip out best and worst realtime result minimum: 0.098651474 sec real / 0.69318 sec CPU maximum: 0.102126514 sec real / 0.000129507 sec CPU average: 0.100120802 sec real / 0.000104589 sec CPU stdev : 0.001008010 sec real / 0.13241 sec CPU Can't benchmark -mavx2 though... Jakub
Re: [PATCH] Add MULT_HIGHPART_EXPR
On Wed, Jun 27, 2012 at 02:37:08PM -0700, Richard Henderson wrote: I was sitting on this patch until I got around to fixing up Jakub's existing vector divmod code to use it. But seeing as how he's adding more uses, I think it's better to get it in earlier. Tested via a patch sent under separate cover that changes __builtin_alpha_umulh to immediately fold to MULT_HIGHPART_EXPR. Thanks. Here is an incremental patch on top of my patch from yesterday which expands some of the vector divisions/modulos using MULT_HIGHPART_EXPR instead of VEC_WIDEN_MULT_*_EXPR + VEC_PERM_EXPR if backend supports that. Improves code generated for ushort or short / or % on i?86 (slightly complicated by the fact that unfortunately even -mavx2 doesn't support vector by vector shifts for V{8,16}HImode (nor V{16,32}QImode), XOP does though). Ok for trunk? I'll look at using MULT_HIGHPART_EXPR in the pattern recognizer and vectorizing it as either of the sequences next. 2012-06-28 Jakub Jelinek ja...@redhat.com PR tree-optimization/53645 * tree-vect-generic.c (expand_vector_divmod): Use MULT_HIGHPART_EXPR instead of VEC_WIDEN_MULT_{HI,LO}_EXPR followed by VEC_PERM_EXPR if possible. * gcc.c-torture/execute/pr53645-2.c: New test. --- gcc/tree-vect-generic.c.jj 2012-06-28 08:32:50.0 +0200 +++ gcc/tree-vect-generic.c 2012-06-28 09:10:51.436748834 +0200 @@ -455,7 +455,7 @@ expand_vector_divmod (gimple_stmt_iterat unsigned HOST_WIDE_INT mask = GET_MODE_MASK (TYPE_MODE (TREE_TYPE (type))); optab op; tree *vec; - unsigned char *sel; + unsigned char *sel = NULL; tree cur_op, mhi, mlo, mulcst, perm_mask, wider_type, tem; if (prec HOST_BITS_PER_WIDE_INT) @@ -744,26 +744,34 @@ expand_vector_divmod (gimple_stmt_iterat if (mode == -2 || BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN) return NULL_TREE; - op = optab_for_tree_code (VEC_WIDEN_MULT_LO_EXPR, type, optab_default); - if (op == NULL - || optab_handler (op, TYPE_MODE (type)) == CODE_FOR_nothing) -return NULL_TREE; - op = optab_for_tree_code (VEC_WIDEN_MULT_HI_EXPR, type, optab_default); - if (op == NULL - || optab_handler (op, TYPE_MODE (type)) == CODE_FOR_nothing) -return NULL_TREE; - sel = XALLOCAVEC (unsigned char, nunits); - for (i = 0; i nunits; i++) -sel[i] = 2 * i + (BYTES_BIG_ENDIAN ? 0 : 1); - if (!can_vec_perm_p (TYPE_MODE (type), false, sel)) -return NULL_TREE; - wider_type -= build_vector_type (build_nonstandard_integer_type (prec * 2, unsignedp), -nunits / 2); - if (GET_MODE_CLASS (TYPE_MODE (wider_type)) != MODE_VECTOR_INT - || GET_MODE_BITSIZE (TYPE_MODE (wider_type)) -!= GET_MODE_BITSIZE (TYPE_MODE (type))) -return NULL_TREE; + op = optab_for_tree_code (MULT_HIGHPART_EXPR, type, optab_default); + if (op != NULL + optab_handler (op, TYPE_MODE (type)) != CODE_FOR_nothing) +wider_type = NULL_TREE; + else +{ + op = optab_for_tree_code (VEC_WIDEN_MULT_LO_EXPR, type, optab_default); + if (op == NULL + || optab_handler (op, TYPE_MODE (type)) == CODE_FOR_nothing) + return NULL_TREE; + op = optab_for_tree_code (VEC_WIDEN_MULT_HI_EXPR, type, optab_default); + if (op == NULL + || optab_handler (op, TYPE_MODE (type)) == CODE_FOR_nothing) + return NULL_TREE; + sel = XALLOCAVEC (unsigned char, nunits); + for (i = 0; i nunits; i++) + sel[i] = 2 * i + (BYTES_BIG_ENDIAN ? 0 : 1); + if (!can_vec_perm_p (TYPE_MODE (type), false, sel)) + return NULL_TREE; + wider_type + = build_vector_type (build_nonstandard_integer_type (prec * 2, +unsignedp), +nunits / 2); + if (GET_MODE_CLASS (TYPE_MODE (wider_type)) != MODE_VECTOR_INT + || GET_MODE_BITSIZE (TYPE_MODE (wider_type)) +!= GET_MODE_BITSIZE (TYPE_MODE (type))) + return NULL_TREE; +} cur_op = op0; @@ -772,7 +780,7 @@ expand_vector_divmod (gimple_stmt_iterat case 0: gcc_assert (unsignedp); /* t1 = oprnd0 pre_shift; -t2 = (type) (t1 w* ml prec); +t2 = t1 h* ml; q = t2 post_shift; */ cur_op = add_rshift (gsi, type, cur_op, pre_shifts); if (cur_op == NULL_TREE) @@ -801,30 +809,37 @@ expand_vector_divmod (gimple_stmt_iterat for (i = 0; i nunits; i++) vec[i] = build_int_cst (TREE_TYPE (type), mulc[i]); mulcst = build_vector (type, vec); - for (i = 0; i nunits; i++) -vec[i] = build_int_cst (TREE_TYPE (type), sel[i]); - perm_mask = build_vector (type, vec); - mhi = gimplify_build2 (gsi, VEC_WIDEN_MULT_HI_EXPR, wider_type, -cur_op, mulcst); - mhi = gimplify_build1 (gsi, VIEW_CONVERT_EXPR, type, mhi); - mlo = gimplify_build2 (gsi, VEC_WIDEN_MULT_LO_EXPR, wider_type, -cur_op, mulcst); - mlo = gimplify_build1 (gsi,
Re: [PATCH] Add MULT_HIGHPART_EXPR
On Thu, Jun 28, 2012 at 09:17:55AM +0200, Jakub Jelinek wrote: I'll look at using MULT_HIGHPART_EXPR in the pattern recognizer and vectorizing it as either of the sequences next. And here is corresponding pattern recognizer and vectorizer patch. Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? Unfortunately the addition of the builtin_mul_widen_* hooks on i?86 seems to pessimize the generated code for gcc.dg/vect/pr51581-3.c testcase (at least with -O3 -mavx) compared to when the hooks aren't present, because i?86 has more natural support for widen mult lo/hi compoared to widen mult even/odd, but I assume that on powerpc it is the other way around. So, how should I find out if both VEC_WIDEN_MULT_*_EXPR and builtin_mul_widen_* are possible for the particular vectype which one will be cheaper? 2012-06-28 Jakub Jelinek ja...@redhat.com PR tree-optimization/51581 * tree-vect-stmts.c (permute_vec_elements): Add forward decl. (vectorizable_operation): Handle vectorization of MULT_HIGHPART_EXPR also using VEC_WIDEN_MULT_*_EXPR or builtin_mul_widen_* plus VEC_PERM_EXPR if vector MULT_HIGHPART_EXPR isn't supported. * tree-vect-patterns.c (vect_recog_divmod_pattern): Use MULT_HIGHPART_EXPR instead of VEC_WIDEN_MULT_*_EXPR and shifts. * gcc.dg/vect/pr51581-4.c: New test. --- gcc/tree-vect-stmts.c.jj2012-06-26 11:38:28.0 +0200 +++ gcc/tree-vect-stmts.c 2012-06-28 13:27:50.475158271 +0200 @@ -3288,6 +3288,10 @@ vectorizable_shift (gimple stmt, gimple_ } +static tree permute_vec_elements (tree, tree, tree, gimple, + gimple_stmt_iterator *); + + /* Function vectorizable_operation. Check if STMT performs a binary, unary or ternary operation that can @@ -3300,17 +3304,18 @@ static bool vectorizable_operation (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt, slp_tree slp_node) { - tree vec_dest; + tree vec_dest, vec_dest2 = NULL_TREE; + tree vec_dest3 = NULL_TREE, vec_dest4 = NULL_TREE; tree scalar_dest; tree op0, op1 = NULL_TREE, op2 = NULL_TREE; stmt_vec_info stmt_info = vinfo_for_stmt (stmt); - tree vectype; + tree vectype, wide_vectype = NULL_TREE; loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info); enum tree_code code; enum machine_mode vec_mode; tree new_temp; int op_type; - optab optab; + optab optab, optab2 = NULL; int icode; tree def; gimple def_stmt; @@ -3327,6 +3332,8 @@ vectorizable_operation (gimple stmt, gim tree vop0, vop1, vop2; bb_vec_info bb_vinfo = STMT_VINFO_BB_VINFO (stmt_info); int vf; + unsigned char *sel = NULL; + tree decl1 = NULL_TREE, decl2 = NULL_TREE, perm_mask = NULL_TREE; if (!STMT_VINFO_RELEVANT_P (stmt_info) !bb_vinfo) return false; @@ -3451,31 +3458,97 @@ vectorizable_operation (gimple stmt, gim optab = optab_for_tree_code (code, vectype, optab_default); /* Supportable by target? */ - if (!optab) + if (!optab code != MULT_HIGHPART_EXPR) { if (vect_print_dump_info (REPORT_DETAILS)) fprintf (vect_dump, no optab.); return false; } vec_mode = TYPE_MODE (vectype); - icode = (int) optab_handler (optab, vec_mode); + icode = optab ? (int) optab_handler (optab, vec_mode) : CODE_FOR_nothing; + + if (icode == CODE_FOR_nothing + code == MULT_HIGHPART_EXPR + VECTOR_MODE_P (vec_mode) + BYTES_BIG_ENDIAN == WORDS_BIG_ENDIAN) +{ + /* If MULT_HIGHPART_EXPR isn't supported by the backend, see +if we can emit VEC_WIDEN_MULT_{LO,HI}_EXPR followed by VEC_PERM_EXPR +or builtin_mul_widen_{even,odd} followed by VEC_PERM_EXPR. */ + unsigned int prec = TYPE_PRECISION (TREE_TYPE (scalar_dest)); + unsigned int unsignedp = TYPE_UNSIGNED (TREE_TYPE (scalar_dest)); + tree wide_type + = build_nonstandard_integer_type (prec * 2, unsignedp); + wide_vectype += get_same_sized_vectype (wide_type, vectype); + + sel = XALLOCAVEC (unsigned char, nunits_in); + if (VECTOR_MODE_P (TYPE_MODE (wide_vectype)) + GET_MODE_SIZE (TYPE_MODE (wide_vectype)) +== GET_MODE_SIZE (vec_mode)) + { + if (targetm.vectorize.builtin_mul_widen_even + (decl1 = targetm.vectorize.builtin_mul_widen_even (vectype)) + targetm.vectorize.builtin_mul_widen_odd + (decl2 = targetm.vectorize.builtin_mul_widen_odd (vectype)) + TYPE_MODE (TREE_TYPE (TREE_TYPE (decl1))) +== TYPE_MODE (wide_vectype)) + { + for (i = 0; i nunits_in; i++) + sel[i] = !BYTES_BIG_ENDIAN + (i ~1) ++ ((i 1) ? nunits_in : 0); + if (0 can_vec_perm_p (vec_mode, false, sel)) + icode = 0; + } + if (icode == CODE_FOR_nothing) + { + decl1 = NULL_TREE; + decl2 =
Re: [PATCH] Add MULT_HIGHPART_EXPR
On 2012-06-28 07:05, Jakub Jelinek wrote: Unfortunately the addition of the builtin_mul_widen_* hooks on i?86 seems to pessimize the generated code for gcc.dg/vect/pr51581-3.c testcase (at least with -O3 -mavx) compared to when the hooks aren't present, because i?86 has more natural support for widen mult lo/hi compoared to widen mult even/odd, but I assume that on powerpc it is the other way around. So, how should I find out if both VEC_WIDEN_MULT_*_EXPR and builtin_mul_widen_* are possible for the particular vectype which one will be cheaper? I would assume that if the builtin exists, then it is cheaper. I disagree about x86 has more natural support for hi/lo. The basic sse2 multiplication is even. One shift per input is needed to generate odd. On the other hand, one interleave per input is required for both hi/lo. So 4 setup insns for hi/lo, and 2 setup insns for even/odd. And on top of all that, XOP includes multiply odd at least for signed V4SI. I'll have a look at the test case you mention while I re-look at the patches... r~
Re: [PATCH] Add MULT_HIGHPART_EXPR
On Thu, Jun 28, 2012 at 08:57:23AM -0700, Richard Henderson wrote: On 2012-06-28 07:05, Jakub Jelinek wrote: Unfortunately the addition of the builtin_mul_widen_* hooks on i?86 seems to pessimize the generated code for gcc.dg/vect/pr51581-3.c testcase (at least with -O3 -mavx) compared to when the hooks aren't present, because i?86 has more natural support for widen mult lo/hi compoared to widen mult even/odd, but I assume that on powerpc it is the other way around. So, how should I find out if both VEC_WIDEN_MULT_*_EXPR and builtin_mul_widen_* are possible for the particular vectype which one will be cheaper? I would assume that if the builtin exists, then it is cheaper. I disagree about x86 has more natural support for hi/lo. The basic sse2 multiplication is even. One shift per input is needed to generate odd. On the other hand, one interleave per input is required for both hi/lo. So 4 setup insns for hi/lo, and 2 setup insns for even/odd. And on top of all that, XOP includes multiply odd at least for signed V4SI. Perhaps the problem is then that the permutation is much more expensive for even/odd. With even/odd the f2 routine is: vmovdqa d(%rip), %xmm2 vmovdqa .LC1(%rip), %xmm0 vpsrlq $32, %xmm2, %xmm4 vmovdqa d+16(%rip), %xmm1 vpmuludq%xmm0, %xmm2, %xmm5 vpsrlq $32, %xmm0, %xmm3 vpmuludq%xmm3, %xmm4, %xmm4 vpmuludq%xmm0, %xmm1, %xmm0 vmovdqa .LC2(%rip), %xmm2 vpsrlq $32, %xmm1, %xmm1 vpmuludq%xmm3, %xmm1, %xmm3 vmovdqa .LC3(%rip), %xmm1 vpshufb %xmm2, %xmm5, %xmm5 vpshufb %xmm1, %xmm4, %xmm4 vpshufb %xmm2, %xmm0, %xmm2 vpshufb %xmm1, %xmm3, %xmm1 vpor%xmm4, %xmm5, %xmm4 vpor%xmm1, %xmm2, %xmm1 vpsrld $1, %xmm4, %xmm4 vmovdqa %xmm4, c(%rip) vpsrld $1, %xmm1, %xmm1 vmovdqa %xmm1, c+16(%rip) ret and with lo/hi it is: vmovdqa d(%rip), %xmm2 vpunpckhdq %xmm2, %xmm2, %xmm3 vpunpckldq %xmm2, %xmm2, %xmm2 vmovdqa .LC1(%rip), %xmm0 vpmuludq%xmm0, %xmm3, %xmm3 vmovdqa d+16(%rip), %xmm1 vpmuludq%xmm0, %xmm2, %xmm2 vshufps $221, %xmm2, %xmm3, %xmm2 vpsrld $1, %xmm2, %xmm2 vmovdqa %xmm2, c(%rip) vpunpckhdq %xmm1, %xmm1, %xmm2 vpunpckldq %xmm1, %xmm1, %xmm1 vpmuludq%xmm0, %xmm2, %xmm2 vpmuludq%xmm0, %xmm1, %xmm0 vshufps $221, %xmm0, %xmm2, %xmm0 vpsrld $1, %xmm0, %xmm0 vmovdqa %xmm0, c+16(%rip) ret Jakub
Re: [PATCH] Add MULT_HIGHPART_EXPR
On Thu, Jun 28, 2012 at 8:57 AM, Richard Henderson r...@redhat.com wrote: On 2012-06-28 07:05, Jakub Jelinek wrote: Unfortunately the addition of the builtin_mul_widen_* hooks on i?86 seems to pessimize the generated code for gcc.dg/vect/pr51581-3.c testcase (at least with -O3 -mavx) compared to when the hooks aren't present, because i?86 has more natural support for widen mult lo/hi compoared to widen mult even/odd, but I assume that on powerpc it is the other way around. So, how should I find out if both VEC_WIDEN_MULT_*_EXPR and builtin_mul_widen_* are possible for the particular vectype which one will be cheaper? I would assume that if the builtin exists, then it is cheaper. I disagree about x86 has more natural support for hi/lo. The basic sse2 multiplication is even. One shift per input is needed to generate odd. On the other hand, one interleave per input is required for both hi/lo. So 4 setup insns for hi/lo, and 2 setup insns for even/odd. And on top of all that, XOP includes multiply odd at least for signed V4SI. I'll have a look at the test case you mention while I re-look at the patches... The upper 128-bit of 256-bit AVX instructions aren't a good fit with the current vectorizer infrastructure. -- H.J.
Re: [PATCH] Add MULT_HIGHPART_EXPR
On 2012-06-28 09:20, Jakub Jelinek wrote: Perhaps the problem is then that the permutation is much more expensive for even/odd. With even/odd the f2 routine is: ... vpshufb %xmm2, %xmm5, %xmm5 vpshufb %xmm1, %xmm4, %xmm4 vpor%xmm4, %xmm5, %xmm4 ... and with lo/hi it is: vshufps $221, %xmm2, %xmm3, %xmm2 Hmm. That second has a reformatting delay. Last week when I pulled the mulv4si3 routine out to i386.c, I experimented with a few different options, including that interleave+shufps sequence seen here for lo/hi. See the comment there discussing options and timing. This also shows a deficiency in our vec_perm logic: 0L 0H 2L 2H 1L 1H 3L 3H 0H 2H 0H 2H 1H 3H 1H 3H 2*pshufd 0H 1H 2H 3H punpckldq without the permutation constants in memory. r~
Re: [PATCH] Add MULT_HIGHPART_EXPR
On 2012-06-28 07:05, Jakub Jelinek wrote: PR tree-optimization/51581 * tree-vect-stmts.c (permute_vec_elements): Add forward decl. (vectorizable_operation): Handle vectorization of MULT_HIGHPART_EXPR also using VEC_WIDEN_MULT_*_EXPR or builtin_mul_widen_* plus VEC_PERM_EXPR if vector MULT_HIGHPART_EXPR isn't supported. * tree-vect-patterns.c (vect_recog_divmod_pattern): Use MULT_HIGHPART_EXPR instead of VEC_WIDEN_MULT_*_EXPR and shifts. * gcc.dg/vect/pr51581-4.c: New test. Ok, except, + if (0 can_vec_perm_p (vec_mode, false, sel)) + icode = 0; Testing hack left in. r~
Re: [PATCH] Add MULT_HIGHPART_EXPR
On Fri, Jun 29, 2012 at 12:00:10AM +0200, Bernhard Reutner-Fischer wrote: Really both HI? If so optab2 could be removed from that fn altogether.. Of course, thanks for pointing that out. I've additionally added a result mode check (similar to what supportable_widening_operation does). The reason for not using supportable_widening_operation is that it only tests even/odd calls for reductions, while we can use them everywhere. Committed as obvious. 2012-06-29 Jakub Jelinek ja...@redhat.com * tree-vect-stmts.c (vectorizable_operation): Check both VEC_WIDEN_MULT_LO_EXPR and VEC_WIDEN_MULT_HI_EXPR optabs. Verify that operand[0]'s mode is TYPE_MODE (wide_vectype). --- gcc/tree-vect-stmts.c (revision 189053) +++ gcc/tree-vect-stmts.c (working copy) @@ -3504,14 +3504,19 @@ vectorizable_operation (gimple stmt, gim { decl1 = NULL_TREE; decl2 = NULL_TREE; - optab = optab_for_tree_code (VEC_WIDEN_MULT_HI_EXPR, + optab = optab_for_tree_code (VEC_WIDEN_MULT_LO_EXPR, vectype, optab_default); optab2 = optab_for_tree_code (VEC_WIDEN_MULT_HI_EXPR, vectype, optab_default); if (optab != NULL optab2 != NULL optab_handler (optab, vec_mode) != CODE_FOR_nothing - optab_handler (optab2, vec_mode) != CODE_FOR_nothing) + optab_handler (optab2, vec_mode) != CODE_FOR_nothing + insn_data[optab_handler (optab, vec_mode)].operand[0].mode +== TYPE_MODE (wide_vectype) + insn_data[optab_handler (optab2, + vec_mode)].operand[0].mode +== TYPE_MODE (wide_vectype)) { for (i = 0; i nunits_in; i++) sel[i] = !BYTES_BIG_ENDIAN + 2 * i; Jakub
Re: [PATCH] Add MULT_HIGHPART_EXPR
On Wed, Jun 27, 2012 at 11:37 PM, Richard Henderson r...@redhat.com wrote: * tree.def (MULT_HIGHPART_EXPR): New. * cfgexpand.c (expand_debug_expr): Ignore it. * expr.c (expand_expr_real_2): Handle it. * fold-const.c (int_const_binop_1): Likewise. * optabs.c (optab_for_tree_code): Likewise. * tree-cfg.c (verify_gimple_assign_binary): Likewise. * tree-inline.c (estimate_operator_cost): Likewise. * tree-pretty-print.c (dump_generic_node): Likewise. (op_code_prio, op_symbol_code): Likewise. * tree.c (commutative_tree_code): Likewise. Also handle WIDEN_MULT_EXPR, VEC_WIDEN_MULT_HI_EXPR, VEC_WIDEN_MULT_LO_EXPR. Maybe also a bit in doc/generic.texi? Or is this not supposed to be exposed to the front ends? Ciao! Steven
Re: [PATCH] Add MULT_HIGHPART_EXPR
On 06/27/2012 02:42 PM, Steven Bosscher wrote: Maybe also a bit in doc/generic.texi? Or is this not supposed to be exposed to the front ends? I can't imagine it being terribly useful to front ends, but there's certainly nothing that ought to prevent it. How's this? r~ diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi index e99366f..c48b663 100644 --- a/gcc/doc/generic.texi +++ b/gcc/doc/generic.texi @@ -1235,6 +1235,7 @@ the byte offset of the field, but should not be used directly; call @tindex PLUS_EXPR @tindex MINUS_EXPR @tindex MULT_EXPR +@tindex MULT_HIGHPART_EXPR @tindex RDIV_EXPR @tindex TRUNC_DIV_EXPR @tindex FLOOR_DIV_EXPR @@ -1433,6 +1434,11 @@ one operand is of floating type and the other is of integral type. The behavior of these operations on signed arithmetic overflow is controlled by the @code{flag_wrapv} and @code{flag_trapv} variables. +@item MULT_HIGHPART_EXPR +This node represents the ``high-part'' of a widening multiplication. +For an integral type with @var{b} bits of precision, the result is +the most significant @var{b} bits of the full @math{2@var{b}} product. + @item RDIV_EXPR This node represents a floating point division operation.