[Bug tree-optimization/125303] vector operation before shuffle produces unvectorized shuffle

tnfchris at gcc dot gnu.org via Gcc-bugs Mon, 18 May 2026 07:00:07 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125303


--- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> (In reply to Drea Pinski from comment #1)
> > Confirmed.
> > 
> > The difference comes from who does the expansion/combining.
> > 
> > So vec_shuf is handled by vectorizer SLP pass.
> > 
> > While vec_xor_shuf is handled by forwprop1 and veclowering pass.
> > 
> > The veclowering pass does not handle:
> >   _15 = VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2, 7, 3 }>;
> > 
> > in a partial wise and only handles scalar wise at this stage. Nobody has
> > improved it yet to try to use partial vector sizes.
> 
> The idea is that the re-vectorization patches from Tamar should address this
> (we'll have to trigger extra BB SLP seeds of course)

I've been looking into this today to see what's required here, as my current
patches
allow growing of the VF but not shrinking. The extra seed is easy enough to do:

@@ -10000,6 +10003,24 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo)
                }
            }
        }
+#if 1
+       else if (gimple_vdef (assign)
+              && VECTOR_TYPE_P (TREE_TYPE (rhs))
+              && ((op = optab_for_tree_code (code, TREE_TYPE (rhs),
optab_default)) == unknown_optab
+                  || !can_implement_p (op, TYPE_MODE (TREE_TYPE (rhs)))))
+       {
+         vec<stmt_vec_info> roots = vNULL;
+         auto stmt_vinfo = bb_vinfo->lookup_stmt (assign);
+         roots.safe_push (stmt_vinfo);
+         vec<stmt_vec_info> stmts;
+         auto num_entries = TYPE_VECTOR_SUBPARTS (TREE_TYPE (rhs)).to_constant
();
+         stmts.create (num_entries);
+         for (int i = 0; i < num_entries; i++)
+           stmts.quick_push (stmt_vinfo);
+         bb_vinfo->roots.safe_push (slp_root (slp_inst_kind_store,
+                                              stmts, roots));
+       }
+#endif
     }
 }

Which triggers a change in data_ref analysis since we don't allow loads groups
with zero step during BB vectorization.
Updating that hits the largest road block and as expected that's build_slp
analysis.

Because the group size is larger than any supported vector type it fails.
I changed it to pick the largest possible vectype and then if analysis succeeds
would set unroll fact.

However there are way too many things doing analysis as we expected so it still
fails:

 note:   === vect_analyze_data_ref_accesses ===
 note:   === vect_determine_precisions ===
 note:   === vect_pattern_recog ===
 note:   === vect_analyze_slp ===
 note:   Starting SLP discovery for
 note:     *x_5(D) = _15;
 note:     *x_5(D) = _15;
 note:     *x_5(D) = _15;
 note:     *x_5(D) = _15;
 note:     *x_5(D) = _15;
 note:     *x_5(D) = _15;
 note:     *x_5(D) = _15;
 note:     *x_5(D) = _15;
 note:   starting SLP discovery for node 0x5eba780
 note:   get vectype for scalar type (group size 8): unsigned int
 note:   vectype: vector(4) unsigned int
 note:   get vectype for smallest scalar type: vec256
 note:   nunits vectype: vector(4) unsigned int
 note:   nunits = 4
 note:   Build SLP for *x_5(D) = _15;
 note:   Build SLP for *x_5(D) = _15;
 note:   Build SLP for *x_5(D) = _15;
 note:   Build SLP for *x_5(D) = _15;
 note:   Build SLP for *x_5(D) = _15;
 note:   Build SLP for *x_5(D) = _15;
 note:   Build SLP for *x_5(D) = _15;
 note:   Build SLP for *x_5(D) = _15;
 note:   SLP discovery for node 0x5eba780 failed
 note:   SLP discovery failed

So I'm not sure what the cleanest solution is he.. hmm maybe I'm looking at
this wrong...

Since the tree is supposed to already be vectorized, perhaps I should instead
build the initial
groups using the largest supported group size and then set the unroll based on
that..

 note:   === vect_analyze_slp ===                                              
                                                                               
                                                                               
                  xor3.c:6:7: note:   Starting SLP discovery for               
                                                                               
                                                                               
                                               xor3.c:6:7: note:     *x_5(D) =
_15;                                                                           
                                                                               
                                                                            
xor3.c:6:7: note:     *x_5(D) = _15;                                           
                                                                               
                                                                               
                             xor3.c:6:7: note:     *x_5(D) = _15;
 note:     *x_5(D) = _15;
 note:   starting SLP discovery for node 0x5f13780
 note:   get vectype for scalar type (group size 4): unsigned int
 note:   vectype: vector(4) unsigned int                                       
                                                                               
                                                                               
                  xor3.c:6:7: note:   get vectype for smallest scalar type:
vec256                                                                         
                                                                               
                                                   xor3.c:6:7: note:   nunits
vectype: vector(4) unsigned int                                                
                                                                               
                                                                               
  xor3.c:6:7: note:   nunits = 4                                               
                                                                               
                                                                               
                               xor3.c:6:7: note:   Build SLP for *x_5(D) = _15;
 note:   Build SLP for *x_5(D) = _15;                                          
                                                                               
                                                                               
                  xor3.c:6:7: note:   Build SLP for *x_5(D) = _15;
 note:   Build SLP for *x_5(D) = _15;
 note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
 note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
 note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
 note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
 note:   Using a splat of the uniform operand _15 = VEC_PERM_EXPR <_2, _2, { 4,
0, 5, 1, 6, 2, 7, 3 }>;
 note:   Building parent vector operands from scalars instead
 note:   SLP discovery for node 0x5f13780 failed
 note:   SLP discovery failed
 note:  recording new base alignment for x_5(D)

OK that looks better, it now fails dealing with the VEC_PERM_EXPR, which for
this should just be accepted
as is...

 note:   === vect_analyze_slp ===
 note:   Starting SLP discovery for
 note:     *x_5(D) = _15;
 note:     *x_5(D) = _15;
 note:     *x_5(D) = _15;
 note:     *x_5(D) = _15;
 note:   starting SLP discovery for node 0x5f13780
 note:   get vectype for scalar type (group size 4): unsigned int
 note:   vectype: vector(4) unsigned int
 note:   get vectype for smallest scalar type: vec256
 note:   nunits vectype: vector(4) unsigned int
 note:   nunits = 4
 note:   Build SLP for *x_5(D) = _15;
 note:   Build SLP for *x_5(D) = _15;
 note:   Build SLP for *x_5(D) = _15;
 note:   Build SLP for *x_5(D) = _15;
 note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
 note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
 note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
 note:   vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
 note:   Using a splat of the uniform operand _15 = VEC_PERM_EXPR <_2, _2, { 4,
0, 5, 1, 6, 2, 7, 3 }>;
 note:   Building parent vector operands from scalars instead
 note:   SLP discovery for node 0x5f13780 failed
 note:   SLP discovery failed
 note:  recording new base alignment for x_5(D)

doing that gives

note:   Using a splat of the uniform operand _15 = VEC_PERM_EXPR <_2, _2, { 4,
0, 5, 1, 6, 2, 7, 3 }>;
note:   SLP discovery for node 0x779a780 succeeded
note:   SLP size 1 vs. limit 5.
note:   Final SLP tree for instance 0x7754ae0:
note:   node 0x779a780 (max_nunits=4, refcnt=2) vector(4) unsigned int
note:   op template: *x_5(D) = _15;
note:       stmt 0 *x_5(D) = _15;
note:       stmt 1 *x_5(D) = _15;
note:       stmt 2 *x_5(D) = _15;
note:       stmt 3 *x_5(D) = _15;
note:       children 0x779a838
note:   node (external) 0x779a838 (max_nunits=1, refcnt=1)
note:       { _15, _15, _15, _15 }

Which approach has better promise for shrinking the VF Richi? I'm leaning
towards the second one,
that is creating only lanes up to max supported vector size, storing the
original nunit and using it during
codegen.

Note that trying to change unrolling so it matches the old VF seems to trigger
node splitting:

note:   SLP discovery succeeded but node needs splitting

which then ICEs.  But haven't looked into that yet in case this is a bad
approach.

[Bug tree-optimization/125303] vector operation before shuffle produces unvectorized shuffle

Reply via email to