RE: [PATCH 1/2][GCC][RFC][middle-end]: Add SLP pattern matcher.

2019-10-12 Thread Tamar Christina
Hi Richi,

> -Original Message-
> From: Richard Biener 
> Sent: Friday, October 11, 2019 8:02 AM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; o...@ucw.cz
> Subject: Re: [PATCH 1/2][GCC][RFC][middle-end]: Add SLP pattern matcher.
> 
> On Tue, 8 Oct 2019, Tamar Christina wrote:
> 
> > Hi Richi,
> >
> > Thanks for the review, I've added some comments inline.
> >
> > The 10/07/2019 12:15, Richard Biener wrote:
> > > On Tue, 1 Oct 2019, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > This adds a framework to allow pattern matchers to be written at
> > > > based on the SLP tree.  The difference between this one and the
> > > > one in tree-vect-patterns is that this matcher allows the matching
> > > > of an arbitrary number of parallel statements and replacing of an
> arbitrary number of children or statements.
> > > >
> > > > Any relationship created by the SLP pattern matcher will be undone if
> SLP fails.
> > > >
> > > > The pattern matcher can also cancel all permutes depending on what
> > > > the pattern requested it to do.  As soon as one pattern requests
> > > > the permutes to be cancelled all permutes are cancelled.
> > > >
> > > > Compared to the previous pattern matcher this one will work for an
> > > > arbitrary group size and will match at any arbitrary node in the
> > > > SLP tree.  The only requirement is that the entire node is matched or
> rejected.
> > > >
> > > > vect_build_slp_tree_1 is a bit more lenient in what it accepts as
> > > > "compatible operations" but the matcher cannot be because in cases
> > > > where you match the order of the operands may be changed.  So all
> operands must be changed or none.
> > > >
> > > > Furthermore the matcher relies on canonization of the operations
> > > > inside the SLP tree and on the fact that floating math operations are 
> > > > not
> commutative.
> > > > This means that matching a pattern does not need to try all
> > > > alternatives or combinations and that arguments will always be in
> > > > the same order if it's the same operation.
> > > >
> > > > The pattern matcher also ignored uninteresting nodes such as type
> > > > casts, loads and stores.  Doing so is essential to keep the runtime 
> > > > down.
> > > >
> > > > Each matcher is allowed a post condition that can be run to
> > > > perform any changes to the SLP tree as needed before the patterns
> > > > are created and may also abort the creation of the patterns.
> > > >
> > > > When a pattern is matched it is not immediately created but
> > > > instead it is deferred until all statements in the node have been
> > > > analyzed.  Only if all post conditions are true, and all
> > > > statements will be replaced will the patterns be created in batch.
> > > > This allows us to not have to undo any work if the pattern fails but 
> > > > also
> makes it so we traverse the tree only once.
> > > >
> > > > When a new pattern is created it is a marked as a pattern to the
> > > > statement it is replacing and be marked as used in the current SLP
> > > > scope.  If SLP fails then relationship is undone and the relevancy
> restored.
> > > >
> > > > Each pattern matcher can detect any number of pattern it wants.
> > > > The only constraint is that the optabs they produce must all have the
> same arity.
> > > >
> > > > The pattern matcher supports instructions that have no scalar form
> > > > as they are added as pattern statements to the stmt.  The BB is
> > > > left untouched and so the scalar loop is untouched.
> > > >
> > > > Bootstrapped on aarch64-none-linux-gnu and no issues.
> > > > No regression testing done yet.
> > >
> > > If you split out the introduction of SLP_TREE_REF_COUNT you can
> > > commit that right now (sorry for being too lazy there...).
> > >
> >
> > I'll split those off :)
> >
> > > One overall comment - you do pattern matching after SLP tree
> > > creation (good) but still do it before the whole SLP graph is
> > > created (bad).  Would it be possible to instead do it as a separate
> > > phase in vect_analyze_slp, looping over all instances (the instances
> > >

Re: [PATCH 1/2][GCC][RFC][middle-end]: Add SLP pattern matcher.

2019-10-11 Thread Richard Biener
On Tue, 8 Oct 2019, Tamar Christina wrote:

> Hi Richi,
> 
> Thanks for the review, I've added some comments inline.
> 
> The 10/07/2019 12:15, Richard Biener wrote:
> > On Tue, 1 Oct 2019, Tamar Christina wrote:
> > 
> > > Hi All,
> > > 
> > > This adds a framework to allow pattern matchers to be written at based on 
> > > the
> > > SLP tree.  The difference between this one and the one in 
> > > tree-vect-patterns is
> > > that this matcher allows the matching of an arbitrary number of parallel
> > > statements and replacing of an arbitrary number of children or statements.
> > > 
> > > Any relationship created by the SLP pattern matcher will be undone if SLP 
> > > fails.
> > > 
> > > The pattern matcher can also cancel all permutes depending on what the 
> > > pattern
> > > requested it to do.  As soon as one pattern requests the permutes to be
> > > cancelled all permutes are cancelled.
> > > 
> > > Compared to the previous pattern matcher this one will work for an 
> > > arbitrary
> > > group size and will match at any arbitrary node in the SLP tree.  The only
> > > requirement is that the entire node is matched or rejected.
> > > 
> > > vect_build_slp_tree_1 is a bit more lenient in what it accepts as 
> > > "compatible
> > > operations" but the matcher cannot be because in cases where you match 
> > > the order
> > > of the operands may be changed.  So all operands must be changed or none.
> > > 
> > > Furthermore the matcher relies on canonization of the operations inside 
> > > the
> > > SLP tree and on the fact that floating math operations are not 
> > > commutative.
> > > This means that matching a pattern does not need to try all alternatives 
> > > or
> > > combinations and that arguments will always be in the same order if it's 
> > > the
> > > same operation.
> > > 
> > > The pattern matcher also ignored uninteresting nodes such as type casts, 
> > > loads
> > > and stores.  Doing so is essential to keep the runtime down.
> > > 
> > > Each matcher is allowed a post condition that can be run to perform any 
> > > changes
> > > to the SLP tree as needed before the patterns are created and may also 
> > > abort
> > > the creation of the patterns.
> > > 
> > > When a pattern is matched it is not immediately created but instead it is
> > > deferred until all statements in the node have been analyzed.  Only if 
> > > all post
> > > conditions are true, and all statements will be replaced will the 
> > > patterns be
> > > created in batch.  This allows us to not have to undo any work if the 
> > > pattern
> > > fails but also makes it so we traverse the tree only once.
> > > 
> > > When a new pattern is created it is a marked as a pattern to the 
> > > statement it is
> > > replacing and be marked as used in the current SLP scope.  If SLP fails 
> > > then
> > > relationship is undone and the relevancy restored.
> > > 
> > > Each pattern matcher can detect any number of pattern it wants.  The only
> > > constraint is that the optabs they produce must all have the same arity.
> > > 
> > > The pattern matcher supports instructions that have no scalar form as they
> > > are added as pattern statements to the stmt.  The BB is left untouched and
> > > so the scalar loop is untouched.
> > > 
> > > Bootstrapped on aarch64-none-linux-gnu and no issues.
> > > No regression testing done yet.
> > 
> > If you split out the introduction of SLP_TREE_REF_COUNT you can commit
> > that right now (sorry for being too lazy there...).
> > 
> 
> I'll split those off :)
> 
> > One overall comment - you do pattern matching after SLP tree
> > creation (good) but still do it before the whole SLP graph is
> > created (bad).  Would it be possible to instead do it as a separate
> > phase in vect_analyze_slp, looping over all instances (the instances
> > present entries into the single unified SLP graph now), avoiding
> > to visit "duplicates"?
> > 
> 
> It should be, the only issue I can see is that build SLP may fail because of
> an unsupported permute, or because it can use load lanes.  If I'm 
> understanding
> it correctly you wouldn't get SLP vectorization in those cases so then the 
> matching
> can't work? So it would limit it a it more.

True.  As said later ideally the pattern matching woudl happen before
or rather as part of SLP building so that the operations then actually
match (and we then don't need that plus/minus SLP_TREE_TWO_OPERATORS
handling).  At the moment it sits in a quite awkward place which
may contribute to its complexity.

> > In patch 1/2 I see references (mostly in variable names) to
> > "complex" which is IMHO too specific.
> > 
> 
> Sorry, missed those during my cleanup.
> 
> > I'd also think that a useful first pattern to support would be
> > what we do with SLP_TREE_TWO_OPERATORS, where code generation
> > inserts extra blends.  I have yet to dive into the complex pattern
> > details to see if that's feasible or if you maybe re-use that...
> 
> Oh, I hadn't thought of that. I'll take a look.

Re: [PATCH 1/2][GCC][RFC][middle-end]: Add SLP pattern matcher.

2019-10-08 Thread Tamar Christina
Hi Richi,

Thanks for the review, I've added some comments inline.

The 10/07/2019 12:15, Richard Biener wrote:
> On Tue, 1 Oct 2019, Tamar Christina wrote:
> 
> > Hi All,
> > 
> > This adds a framework to allow pattern matchers to be written at based on 
> > the
> > SLP tree.  The difference between this one and the one in 
> > tree-vect-patterns is
> > that this matcher allows the matching of an arbitrary number of parallel
> > statements and replacing of an arbitrary number of children or statements.
> > 
> > Any relationship created by the SLP pattern matcher will be undone if SLP 
> > fails.
> > 
> > The pattern matcher can also cancel all permutes depending on what the 
> > pattern
> > requested it to do.  As soon as one pattern requests the permutes to be
> > cancelled all permutes are cancelled.
> > 
> > Compared to the previous pattern matcher this one will work for an arbitrary
> > group size and will match at any arbitrary node in the SLP tree.  The only
> > requirement is that the entire node is matched or rejected.
> > 
> > vect_build_slp_tree_1 is a bit more lenient in what it accepts as 
> > "compatible
> > operations" but the matcher cannot be because in cases where you match the 
> > order
> > of the operands may be changed.  So all operands must be changed or none.
> > 
> > Furthermore the matcher relies on canonization of the operations inside the
> > SLP tree and on the fact that floating math operations are not commutative.
> > This means that matching a pattern does not need to try all alternatives or
> > combinations and that arguments will always be in the same order if it's the
> > same operation.
> > 
> > The pattern matcher also ignored uninteresting nodes such as type casts, 
> > loads
> > and stores.  Doing so is essential to keep the runtime down.
> > 
> > Each matcher is allowed a post condition that can be run to perform any 
> > changes
> > to the SLP tree as needed before the patterns are created and may also abort
> > the creation of the patterns.
> > 
> > When a pattern is matched it is not immediately created but instead it is
> > deferred until all statements in the node have been analyzed.  Only if all 
> > post
> > conditions are true, and all statements will be replaced will the patterns 
> > be
> > created in batch.  This allows us to not have to undo any work if the 
> > pattern
> > fails but also makes it so we traverse the tree only once.
> > 
> > When a new pattern is created it is a marked as a pattern to the statement 
> > it is
> > replacing and be marked as used in the current SLP scope.  If SLP fails then
> > relationship is undone and the relevancy restored.
> > 
> > Each pattern matcher can detect any number of pattern it wants.  The only
> > constraint is that the optabs they produce must all have the same arity.
> > 
> > The pattern matcher supports instructions that have no scalar form as they
> > are added as pattern statements to the stmt.  The BB is left untouched and
> > so the scalar loop is untouched.
> > 
> > Bootstrapped on aarch64-none-linux-gnu and no issues.
> > No regression testing done yet.
> 
> If you split out the introduction of SLP_TREE_REF_COUNT you can commit
> that right now (sorry for being too lazy there...).
> 

I'll split those off :)

> One overall comment - you do pattern matching after SLP tree
> creation (good) but still do it before the whole SLP graph is
> created (bad).  Would it be possible to instead do it as a separate
> phase in vect_analyze_slp, looping over all instances (the instances
> present entries into the single unified SLP graph now), avoiding
> to visit "duplicates"?
> 

It should be, the only issue I can see is that build SLP may fail because of
an unsupported permute, or because it can use load lanes.  If I'm understanding
it correctly you wouldn't get SLP vectorization in those cases so then the 
matching
can't work? So it would limit it a it more.

> In patch 1/2 I see references (mostly in variable names) to
> "complex" which is IMHO too specific.
> 

Sorry, missed those during my cleanup.

> I'd also think that a useful first pattern to support would be
> what we do with SLP_TREE_TWO_OPERATORS, where code generation
> inserts extra blends.  I have yet to dive into the complex pattern
> details to see if that's feasible or if you maybe re-use that...

Oh, I hadn't thought of that. I'll take a look.

> You seem to at least rely on the SLP build succeeding with
> the mixed plus/minus cases?  Which also restricts the kind of
> patterns we can recognize I guess.  Plus it shows the chicken-and-egg
> issue here - we'd like to detect the patterns _first_ and then
> build the SLP trees (or rather combine lanes into larger groups).
> Doing it after the fact makes it no more powerful than doing
> it as folding post vectorization?

It's true that I do rely on build SLP succeeding, and there is one case I know
of where SLP build fails when I didn't expect it to.  But I was operating under
the impression that tha

Re: [PATCH 1/2][GCC][RFC][middle-end]: Add SLP pattern matcher.

2019-10-07 Thread Richard Biener
On Tue, 1 Oct 2019, Tamar Christina wrote:

> Hi All,
> 
> This adds a framework to allow pattern matchers to be written at based on the
> SLP tree.  The difference between this one and the one in tree-vect-patterns 
> is
> that this matcher allows the matching of an arbitrary number of parallel
> statements and replacing of an arbitrary number of children or statements.
> 
> Any relationship created by the SLP pattern matcher will be undone if SLP 
> fails.
> 
> The pattern matcher can also cancel all permutes depending on what the pattern
> requested it to do.  As soon as one pattern requests the permutes to be
> cancelled all permutes are cancelled.
> 
> Compared to the previous pattern matcher this one will work for an arbitrary
> group size and will match at any arbitrary node in the SLP tree.  The only
> requirement is that the entire node is matched or rejected.
> 
> vect_build_slp_tree_1 is a bit more lenient in what it accepts as "compatible
> operations" but the matcher cannot be because in cases where you match the 
> order
> of the operands may be changed.  So all operands must be changed or none.
> 
> Furthermore the matcher relies on canonization of the operations inside the
> SLP tree and on the fact that floating math operations are not commutative.
> This means that matching a pattern does not need to try all alternatives or
> combinations and that arguments will always be in the same order if it's the
> same operation.
> 
> The pattern matcher also ignored uninteresting nodes such as type casts, loads
> and stores.  Doing so is essential to keep the runtime down.
> 
> Each matcher is allowed a post condition that can be run to perform any 
> changes
> to the SLP tree as needed before the patterns are created and may also abort
> the creation of the patterns.
> 
> When a pattern is matched it is not immediately created but instead it is
> deferred until all statements in the node have been analyzed.  Only if all 
> post
> conditions are true, and all statements will be replaced will the patterns be
> created in batch.  This allows us to not have to undo any work if the pattern
> fails but also makes it so we traverse the tree only once.
> 
> When a new pattern is created it is a marked as a pattern to the statement it 
> is
> replacing and be marked as used in the current SLP scope.  If SLP fails then
> relationship is undone and the relevancy restored.
> 
> Each pattern matcher can detect any number of pattern it wants.  The only
> constraint is that the optabs they produce must all have the same arity.
> 
> The pattern matcher supports instructions that have no scalar form as they
> are added as pattern statements to the stmt.  The BB is left untouched and
> so the scalar loop is untouched.
> 
> Bootstrapped on aarch64-none-linux-gnu and no issues.
> No regression testing done yet.

If you split out the introduction of SLP_TREE_REF_COUNT you can commit
that right now (sorry for being too lazy there...).

One overall comment - you do pattern matching after SLP tree
creation (good) but still do it before the whole SLP graph is
created (bad).  Would it be possible to instead do it as a separate
phase in vect_analyze_slp, looping over all instances (the instances
present entries into the single unified SLP graph now), avoiding
to visit "duplicates"?

In patch 1/2 I see references (mostly in variable names) to
"complex" which is IMHO too specific.

I'd also think that a useful first pattern to support would be
what we do with SLP_TREE_TWO_OPERATORS, where code generation
inserts extra blends.  I have yet to dive into the complex pattern
details to see if that's feasible or if you maybe re-use that...
You seem to at least rely on the SLP build succeeding with
the mixed plus/minus cases?  Which also restricts the kind of
patterns we can recognize I guess.  Plus it shows the chicken-and-egg
issue here - we'd like to detect the patterns _first_ and then
build the SLP trees (or rather combine lanes into larger groups).
Doing it after the fact makes it no more powerful than doing
it as folding post vectorization?

+typedef hash_map 
+  ssa_name_def_to_slp_tree_map_t;
+

this seems to be unused.

+  FOR_EACH_VEC_ELT_FROM (scalar_stmts, i, stmt_info, n - 1)
+{
...
+  work.stmt_infos = scalar_stmts.begin () + (i - (n - 1));
+  work.idx = i;

so this tries to match patterns on ARITY arbitrary aligned
but adjacent scalar stmts [in a brute force way].  But then
you immediately fail when one matching fails.  So I think
this loop should be written like

 for (unsigned i = n - 1; i < scalar_stmts.length (); i += n)
   {
...

to make this fact clearer.  A missed optimization here is to consider
pre/post shuffling of the group to make more cases match.

In this loop you also do the target support check which could
possibly be done upfront in the pattern matcher itself to save
compile-time?  It also seems to be required that patterns match
a single IFN call?

Looking at the complex patterns I am wo