[Bug tree-optimization/125303] vector operation before shuffle produces unvectorized shuffle

rguenther at suse dot de via Gcc-bugs Mon, 18 May 2026 23:40:40 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125303


--- Comment #6 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 18 May 2026, tnfchris at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125303
>
> --- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #2)
>> (In reply to Drea Pinski from comment #1)
>>> Confirmed.
>>>
>>> The difference comes from who does the expansion/combining.
>>>
>>> So vec_shuf is handled by vectorizer SLP pass.
>>>
>>> While vec_xor_shuf is handled by forwprop1 and veclowering pass.
>>>
>>> The veclowering pass does not handle:
>>>   _15 = VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2, 7, 3 }>;
>>>
>>> in a partial wise and only handles scalar wise at this stage. Nobody has
>>> improved it yet to try to use partial vector sizes.
>>
>> The idea is that the re-vectorization patches from Tamar should address this
>> (we'll have to trigger extra BB SLP seeds of course)
>
> I've been looking into this today to see what's required here, as my current
> patches
> allow growing of the VF but not shrinking. The extra seed is easy enough to 
> do:
>
> @@ -10000,6 +10003,24 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo)
>                }
>            }
>        }
> +#if 1
> +       else if (gimple_vdef (assign)
> +              && VECTOR_TYPE_P (TREE_TYPE (rhs))
> +              && ((op = optab_for_tree_code (code, TREE_TYPE (rhs),
> optab_default)) == unknown_optab
> +                  || !can_implement_p (op, TYPE_MODE (TREE_TYPE (rhs)))))
> +       {
> +         vec<stmt_vec_info> roots = vNULL;
> +         auto stmt_vinfo = bb_vinfo->lookup_stmt (assign);
> +         roots.safe_push (stmt_vinfo);
> +         vec<stmt_vec_info> stmts;
> +         auto num_entries = TYPE_VECTOR_SUBPARTS (TREE_TYPE 
> (rhs)).to_constant
> ();
> +         stmts.create (num_entries);
> +         for (int i = 0; i < num_entries; i++)
> +           stmts.quick_push (stmt_vinfo);
> +         bb_vinfo->roots.safe_push (slp_root (slp_inst_kind_store,
> +                                              stmts, roots));
> +       }
> +#endif
>     }
> }
>
> Which triggers a change in data_ref analysis since we don't allow loads groups
> with zero step during BB vectorization.
> Updating that hits the largest road block and as expected that's build_slp
> analysis.
>
> Because the group size is larger than any supported vector type it fails.
> I changed it to pick the largest possible vectype and then if analysis 
> succeeds
> would set unroll fact.

I'm a bit confused, it should be fine to pick a smaller vector type if the
group size is a multiple of it.  The unroll factor (VF) is always 1 in
BB vectorization, but yes, "ncopies" can be > 1 iff the vector type is
smaller than the group size.

>
> However there are way too many things doing analysis as we expected so it 
> still
> fails:
>
> note:   === vect_analyze_data_ref_accesses ===
> note:   === vect_determine_precisions ===
> note:   === vect_pattern_recog ===
> note:   === vect_analyze_slp ===
> note:   Starting SLP discovery for
> note:     *x_5(D) = _15;
> note:     *x_5(D) = _15;
> note:     *x_5(D) = _15;
> note:     *x_5(D) = _15;
> note:     *x_5(D) = _15;
> note:     *x_5(D) = _15;
> note:     *x_5(D) = _15;
> note:     *x_5(D) = _15;
> note:   starting SLP discovery for node 0x5eba780
> note:   get vectype for scalar type (group size 8): unsigned int
> note:   vectype: vector(4) unsigned int
> note:   get vectype for smallest scalar type: vec256
> note:   nunits vectype: vector(4) unsigned int
> note:   nunits = 4
> note:   Build SLP for *x_5(D) = _15;
> note:   Build SLP for *x_5(D) = _15;
> note:   Build SLP for *x_5(D) = _15;
> note:   Build SLP for *x_5(D) = _15;
> note:   Build SLP for *x_5(D) = _15;
> note:   Build SLP for *x_5(D) = _15;
> note:   Build SLP for *x_5(D) = _15;
> note:   Build SLP for *x_5(D) = _15;
> note:   SLP discovery for node 0x5eba780 failed
> note:   SLP discovery failed
>
> So I'm not sure what the cleanest solution is he.. hmm maybe I'm looking at
> this wrong...
>
> Since the tree is supposed to already be vectorized, perhaps I should instead
> build the initial
> groups using the largest supported group size and then set the unroll based on
> that..

I think the group size has to match the vector type used in the IL
(or be larger if we combine two vector typed operations).  But as I didn't
yet see your re-vectorization patches I can only guess what you did to
representation in the SLP graph ...

> note:   === vect_analyze_slp ===
>
>
>                  xor3.c:6:7: note:   Starting SLP discovery for
>
>
>                                               xor3.c:6:7: note:     *x_5(D) =
> _15;
>
>
> xor3.c:6:7: note:     *x_5(D) = _15;
>
>
>                             xor3.c:6:7: note:     *x_5(D) = _15;
> note:     *x_5(D) = _15;
> note:   starting SLP discovery for node 0x5f13780
> note:   get vectype for scalar type (group size 4): unsigned int
> note:   vectype: vector(4) unsigned int
>
>
>                  xor3.c:6:7: note:   get vectype for smallest scalar type:
> vec256
>
>                                                   xor3.c:6:7: note:   nunits
> vectype: vector(4) unsigned int
>
>
>  xor3.c:6:7: note:   nunits = 4
>
>
>                               xor3.c:6:7: note:   Build SLP for *x_5(D) = _15;
> note:   Build SLP for *x_5(D) = _15;
>
>
>                  xor3.c:6:7: note:   Build SLP for *x_5(D) = _15;
> note:   Build SLP for *x_5(D) = _15;
> note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
> 7, 3 }>, type of def: internal
> note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
> 7, 3 }>, type of def: internal
> note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
> 7, 3 }>, type of def: internal
> note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
> 7, 3 }>, type of def: internal
> note:   Using a splat of the uniform operand _15 = VEC_PERM_EXPR <_2, _2, { 4,
> 0, 5, 1, 6, 2, 7, 3 }>;
> note:   Building parent vector operands from scalars instead
> note:   SLP discovery for node 0x5f13780 failed
> note:   SLP discovery failed
> note:  recording new base alignment for x_5(D)
>
> OK that looks better, it now fails dealing with the VEC_PERM_EXPR, which for
> this should just be accepted
> as is...
>
> note:   === vect_analyze_slp ===
> note:   Starting SLP discovery for
> note:     *x_5(D) = _15;
> note:     *x_5(D) = _15;
> note:     *x_5(D) = _15;
> note:     *x_5(D) = _15;
> note:   starting SLP discovery for node 0x5f13780
> note:   get vectype for scalar type (group size 4): unsigned int
> note:   vectype: vector(4) unsigned int
> note:   get vectype for smallest scalar type: vec256
> note:   nunits vectype: vector(4) unsigned int
> note:   nunits = 4
> note:   Build SLP for *x_5(D) = _15;
> note:   Build SLP for *x_5(D) = _15;
> note:   Build SLP for *x_5(D) = _15;
> note:   Build SLP for *x_5(D) = _15;
> note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
> 7, 3 }>, type of def: internal
> note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
> 7, 3 }>, type of def: internal
> note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
> 7, 3 }>, type of def: internal
> note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
> 7, 3 }>, type of def: internal
> note:   Using a splat of the uniform operand _15 = VEC_PERM_EXPR <_2, _2, { 4,
> 0, 5, 1, 6, 2, 7, 3 }>;
> note:   Building parent vector operands from scalars instead
> note:   SLP discovery for node 0x5f13780 failed
> note:   SLP discovery failed
> note:  recording new base alignment for x_5(D)
>
> doing that gives
>
> note:   Using a splat of the uniform operand _15 = VEC_PERM_EXPR <_2, _2, { 4,
> 0, 5, 1, 6, 2, 7, 3 }>;
> note:   SLP discovery for node 0x779a780 succeeded
> note:   SLP size 1 vs. limit 5.
> note:   Final SLP tree for instance 0x7754ae0:
> note:   node 0x779a780 (max_nunits=4, refcnt=2) vector(4) unsigned int
> note:   op template: *x_5(D) = _15;
> note:       stmt 0 *x_5(D) = _15;
> note:       stmt 1 *x_5(D) = _15;
> note:       stmt 2 *x_5(D) = _15;
> note:       stmt 3 *x_5(D) = _15;
> note:       children 0x779a838
> note:   node (external) 0x779a838 (max_nunits=1, refcnt=1)
> note:       { _15, _15, _15, _15 }
>
> Which approach has better promise for shrinking the VF Richi? I'm leaning
> towards the second one,
> that is creating only lanes up to max supported vector size, storing the
> original nunit and using it during
> codegen.

create lanes according to the vector type in the IL.  Rely on
"ncopies" for code generation.  It should work the same as you'd
have scalar code with 16 lanes but vector type with 4 lanes.

> Note that trying to change unrolling so it matches the old VF seems to trigger
> node splitting:
>
> note:   SLP discovery succeeded but node needs splitting
>
> which then ICEs.  But haven't looked into that yet in case this is a bad
> approach.

That said - I think we can followup with the details but should make
sure the design of re-vectorization is sound to handle this case since
IMO it's trivially the "same".

Richard.

[Bug tree-optimization/125303] vector operation before shuffle produces unvectorized shuffle

Reply via email to