https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125303
--- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #2)
> (In reply to Drea Pinski from comment #1)
> > Confirmed.
> >
> > The difference comes from who does the expansion/combining.
> >
> > So vec_shuf is handled by vectorizer SLP pass.
> >
> > While vec_xor_shuf is handled by forwprop1 and veclowering pass.
> >
> > The veclowering pass does not handle:
> > _15 = VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2, 7, 3 }>;
> >
> > in a partial wise and only handles scalar wise at this stage. Nobody has
> > improved it yet to try to use partial vector sizes.
>
> The idea is that the re-vectorization patches from Tamar should address this
> (we'll have to trigger extra BB SLP seeds of course)
I've been looking into this today to see what's required here, as my current
patches
allow growing of the VF but not shrinking. The extra seed is easy enough to do:
@@ -10000,6 +10003,24 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo)
}
}
}
+#if 1
+ else if (gimple_vdef (assign)
+ && VECTOR_TYPE_P (TREE_TYPE (rhs))
+ && ((op = optab_for_tree_code (code, TREE_TYPE (rhs),
optab_default)) == unknown_optab
+ || !can_implement_p (op, TYPE_MODE (TREE_TYPE (rhs)))))
+ {
+ vec<stmt_vec_info> roots = vNULL;
+ auto stmt_vinfo = bb_vinfo->lookup_stmt (assign);
+ roots.safe_push (stmt_vinfo);
+ vec<stmt_vec_info> stmts;
+ auto num_entries = TYPE_VECTOR_SUBPARTS (TREE_TYPE (rhs)).to_constant
();
+ stmts.create (num_entries);
+ for (int i = 0; i < num_entries; i++)
+ stmts.quick_push (stmt_vinfo);
+ bb_vinfo->roots.safe_push (slp_root (slp_inst_kind_store,
+ stmts, roots));
+ }
+#endif
}
}
Which triggers a change in data_ref analysis since we don't allow loads groups
with zero step during BB vectorization.
Updating that hits the largest road block and as expected that's build_slp
analysis.
Because the group size is larger than any supported vector type it fails.
I changed it to pick the largest possible vectype and then if analysis succeeds
would set unroll fact.
However there are way too many things doing analysis as we expected so it still
fails:
note: === vect_analyze_data_ref_accesses ===
note: === vect_determine_precisions ===
note: === vect_pattern_recog ===
note: === vect_analyze_slp ===
note: Starting SLP discovery for
note: *x_5(D) = _15;
note: *x_5(D) = _15;
note: *x_5(D) = _15;
note: *x_5(D) = _15;
note: *x_5(D) = _15;
note: *x_5(D) = _15;
note: *x_5(D) = _15;
note: *x_5(D) = _15;
note: starting SLP discovery for node 0x5eba780
note: get vectype for scalar type (group size 8): unsigned int
note: vectype: vector(4) unsigned int
note: get vectype for smallest scalar type: vec256
note: nunits vectype: vector(4) unsigned int
note: nunits = 4
note: Build SLP for *x_5(D) = _15;
note: Build SLP for *x_5(D) = _15;
note: Build SLP for *x_5(D) = _15;
note: Build SLP for *x_5(D) = _15;
note: Build SLP for *x_5(D) = _15;
note: Build SLP for *x_5(D) = _15;
note: Build SLP for *x_5(D) = _15;
note: Build SLP for *x_5(D) = _15;
note: SLP discovery for node 0x5eba780 failed
note: SLP discovery failed
So I'm not sure what the cleanest solution is he.. hmm maybe I'm looking at
this wrong...
Since the tree is supposed to already be vectorized, perhaps I should instead
build the initial
groups using the largest supported group size and then set the unroll based on
that..
note: === vect_analyze_slp ===
xor3.c:6:7: note: Starting SLP discovery for
xor3.c:6:7: note: *x_5(D) =
_15;
xor3.c:6:7: note: *x_5(D) = _15;
xor3.c:6:7: note: *x_5(D) = _15;
note: *x_5(D) = _15;
note: starting SLP discovery for node 0x5f13780
note: get vectype for scalar type (group size 4): unsigned int
note: vectype: vector(4) unsigned int
xor3.c:6:7: note: get vectype for smallest scalar type:
vec256
xor3.c:6:7: note: nunits
vectype: vector(4) unsigned int
xor3.c:6:7: note: nunits = 4
xor3.c:6:7: note: Build SLP for *x_5(D) = _15;
note: Build SLP for *x_5(D) = _15;
xor3.c:6:7: note: Build SLP for *x_5(D) = _15;
note: Build SLP for *x_5(D) = _15;
note: vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
note: vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
note: vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
note: vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
note: Using a splat of the uniform operand _15 = VEC_PERM_EXPR <_2, _2, { 4,
0, 5, 1, 6, 2, 7, 3 }>;
note: Building parent vector operands from scalars instead
note: SLP discovery for node 0x5f13780 failed
note: SLP discovery failed
note: recording new base alignment for x_5(D)
OK that looks better, it now fails dealing with the VEC_PERM_EXPR, which for
this should just be accepted
as is...
note: === vect_analyze_slp ===
note: Starting SLP discovery for
note: *x_5(D) = _15;
note: *x_5(D) = _15;
note: *x_5(D) = _15;
note: *x_5(D) = _15;
note: starting SLP discovery for node 0x5f13780
note: get vectype for scalar type (group size 4): unsigned int
note: vectype: vector(4) unsigned int
note: get vectype for smallest scalar type: vec256
note: nunits vectype: vector(4) unsigned int
note: nunits = 4
note: Build SLP for *x_5(D) = _15;
note: Build SLP for *x_5(D) = _15;
note: Build SLP for *x_5(D) = _15;
note: Build SLP for *x_5(D) = _15;
note: vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
note: vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
note: vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
note: vect_is_simple_use: operand VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2,
7, 3 }>, type of def: internal
note: Using a splat of the uniform operand _15 = VEC_PERM_EXPR <_2, _2, { 4,
0, 5, 1, 6, 2, 7, 3 }>;
note: Building parent vector operands from scalars instead
note: SLP discovery for node 0x5f13780 failed
note: SLP discovery failed
note: recording new base alignment for x_5(D)
doing that gives
note: Using a splat of the uniform operand _15 = VEC_PERM_EXPR <_2, _2, { 4,
0, 5, 1, 6, 2, 7, 3 }>;
note: SLP discovery for node 0x779a780 succeeded
note: SLP size 1 vs. limit 5.
note: Final SLP tree for instance 0x7754ae0:
note: node 0x779a780 (max_nunits=4, refcnt=2) vector(4) unsigned int
note: op template: *x_5(D) = _15;
note: stmt 0 *x_5(D) = _15;
note: stmt 1 *x_5(D) = _15;
note: stmt 2 *x_5(D) = _15;
note: stmt 3 *x_5(D) = _15;
note: children 0x779a838
note: node (external) 0x779a838 (max_nunits=1, refcnt=1)
note: { _15, _15, _15, _15 }
Which approach has better promise for shrinking the VF Richi? I'm leaning
towards the second one,
that is creating only lanes up to max supported vector size, storing the
original nunit and using it during
codegen.
Note that trying to change unrolling so it matches the old VF seems to trigger
node splitting:
note: SLP discovery succeeded but node needs splitting
which then ICEs. But haven't looked into that yet in case this is a bad
approach.