[Bug tree-optimization/61559] FAIL: gcc.dg/builtin-bswap-8.c on i686 with -mmovbe

2014-11-13 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559

--- Comment #16 from Richard Biener  ---
Author: rguenth
Date: Thu Nov 13 08:45:29 2014
New Revision: 217464

URL: https://gcc.gnu.org/viewcvs?rev=217464&root=gcc&view=rev
Log:
2014-12-13  Richard Biener  

PR middle-end/61559
* match.pd: Implement bswap patterns for transforms checked by
gcc.dg/builtin-bswap-8.c.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/match.pd


[Bug tree-optimization/61559] FAIL: gcc.dg/builtin-bswap-8.c on i686 with -mmovbe

2014-11-13 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #17 from Richard Biener  ---
Fixed.  Note I've only implemented what is necessary for the testcase not what
was discussed additionally.


[Bug tree-optimization/61559] FAIL: gcc.dg/builtin-bswap-8.c on i686 with -mmovbe

2014-09-04 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559

--- Comment #15 from Jakub Jelinek  ---
(In reply to rguent...@suse.de from comment #14)
> On Thu, 4 Sep 2014, ubizjak at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559
> > 
> > --- Comment #12 from Uroš Bizjak  ---
> > (In reply to Uroš Bizjak from comment #11)
> > 
> > > This one should be:
> > > 
> > > (simplify
> > >   (bswap (bitop (bswap @0) (bswap @1)))
> > >   (bitop @0 @1))
> > 
> > Oh, we already have this. Please disregard this message.
> 
> Not sure - we don't exactly have it.  OTOH I think
> that
> 
> (simplify
>   (bitop (bswap @0) (bswap @1))
>   (bswap (bitop @0 @1)))
> 
> is profitable in most cases (not only when wrapped inside another
> bswap).  Jakubs concern applies though.

My concern has not been backed up by data from any target, and generally I'd
say it should be better if we are able to canonicalize in the middle-end (e.g.
for VN etc.).
The question is if for selected ISAs and if the arguments are all memories e.g.
right before expansion or during expansion or RTL optimizations it might not be
worth to undo that if it proves to be beneficial.

[Bug tree-optimization/61559] FAIL: gcc.dg/builtin-bswap-8.c on i686 with -mmovbe

2014-09-04 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559

--- Comment #14 from rguenther at suse dot de  ---
On Thu, 4 Sep 2014, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559
> 
> --- Comment #12 from Uroš Bizjak  ---
> (In reply to Uroš Bizjak from comment #11)
> 
> > This one should be:
> > 
> > (simplify
> >   (bswap (bitop (bswap @0) (bswap @1)))
> >   (bitop @0 @1))
> 
> Oh, we already have this. Please disregard this message.

Not sure - we don't exactly have it.  OTOH I think
that

(simplify
  (bitop (bswap @0) (bswap @1))
  (bswap (bitop @0 @1)))

is profitable in most cases (not only when wrapped inside another
bswap).  Jakubs concern applies though.

OTOH

(simplify
  (bswap (bitop (bswap @0) INTEGER_CST@1))
  (bitop @0 (bswap @1)

may generally apply (not only to INTEGER_CST 2nd operand to bitop).
The argument would be that one bswap goes away.  But again if
@0 was a memory and @1 is not then this isn't always profitable(?)
(well, the outer bswap goes away).

So merge the restricted first one and this to

   (simplify
 (bswap (bitop:c (bswap @0) @1))
 (bitop @0 (bswap @1)))

?  Thus only one operand of the bitop needs to be a bswap if
there is an outer bswap.  For constant @1 the inner bswap
will be folded away as well as if @1 is a bswap itself.


/* PR61559.  Transforms for gcc.dg/builtin-bswap-8.c  */
(for bswap in BUILT_IN_BSWAP16 BUILT_IN_BSWAP32 BUILT_IN_BSWAP64
  (simplify
(bswap (bswap @0))
@0)
  (simplify
(bswap (bit_not (bswap @0)))
(bit_not @0))
  (for bitop in bit_xor bit_ior bit_and
/* This might not be profitable if the inner bswaps are
   free because @0 and @1 are memory operands and the
   target has an instruction for load+bswap.  */
(simplify
  (bitop (bswap @0) (bswap @1))
  (bswap (bitop @0 @1)))
(simplify
  (bswap (bitop:c (bswap @0) @1))
  (bitop @0 (bswap @1)

[Bug tree-optimization/61559] FAIL: gcc.dg/builtin-bswap-8.c on i686 with -mmovbe

2014-09-04 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559

--- Comment #13 from Marc Glisse  ---
(In reply to Richard Biener from comment #6)
> (for bitop in bit_xor bit_ior bit_and
>   (for bswap in BUILT_IN_BSWAP16 BUILT_IN_BSWAP32 BUILT_IN_BSWAP64
> (simplify
>   (bitop (bswap @0) (bswap @1))
>   (bswap (bitop @0 @1
>   (simplify
> (bitop (vec_perm @1 @2 @0) (vec_perm @3 @4 @0))
> (vec_perm (bitop @1 @3) (bitop @2 @4) @0)))
> 
> not sure if the vector permute one is profitable (but I guess a
> permute is always more expensive than a bit operation).

For the vector version, there is no reason to restrict to bit operations. It
works just as well for float addition, etc. And at least in the case where
@1==@3&&@2==@4 (I hope there is always a CSE to clean up the duplicated bitop)
and everything is single-use, it is profitable.

> The requested transform of course relies on somebody transforming
> bswap (bswap (x)) to x and for vec_perm detecting a cancelling
> operation (tree-ssa-forwprop.c can do that already I think).

Yes for vec_perm, in simplify_permutation.


[Bug tree-optimization/61559] FAIL: gcc.dg/builtin-bswap-8.c on i686 with -mmovbe

2014-09-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559

--- Comment #12 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #11)

> This one should be:
> 
> (simplify
>   (bswap (bitop (bswap @0) (bswap @1)))
>   (bitop @0 @1))

Oh, we already have this. Please disregard this message.

[Bug tree-optimization/61559] FAIL: gcc.dg/builtin-bswap-8.c on i686 with -mmovbe

2014-09-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559

--- Comment #11 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #9)
> Aren't these optimizations actually a pessimization for -mmovbe if the inner
> bswap is on a read from memory?  Assuming the load and bswap instruction is
> cheap, then e.g. loading two values with bswap on them and doing say xor on
> them afterwards might be cheaper than load the two values, xor them and then
> bswap them (because for that bswap you don't have a load+bswap instruction).

(simplify
  (bitop (bswap @0) (bswap @1))
  (bswap (bitop @0 @1)))

This one should be:

(simplify
  (bswap (bitop (bswap @0) (bswap @1)))
  (bitop @0 @1))

This is what builtin-bswap-8.c tests, and I believe it will address Jakub's
concerns.

[Bug tree-optimization/61559] FAIL: gcc.dg/builtin-bswap-8.c on i686 with -mmovbe

2014-09-04 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559

--- Comment #10 from rguenther at suse dot de  ---
On Thu, 4 Sep 2014, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559
> 
> Jakub Jelinek  changed:
> 
>What|Removed |Added
> 
>  CC||jakub at gcc dot gnu.org
> 
> --- Comment #9 from Jakub Jelinek  ---
> Aren't these optimizations actually a pessimization for -mmovbe if the inner
> bswap is on a read from memory?  Assuming the load and bswap instruction is
> cheap, then e.g. loading two values with bswap on them and doing say xor on
> them afterwards might be cheaper than load the two values, xor them and then
> bswap them (because for that bswap you don't have a load+bswap instruction).

Depends on how fast that load+bswap instruction is I suppose (if it
plays nicely with things like store-forwarding on the pipeline
and pipelines as well as regular loads, etc.).

That said - what does the optimization guides say on consecutive
movbe instructions vs. non-movbe and a bswap instruction?


[Bug tree-optimization/61559] FAIL: gcc.dg/builtin-bswap-8.c on i686 with -mmovbe

2014-09-04 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #9 from Jakub Jelinek  ---
Aren't these optimizations actually a pessimization for -mmovbe if the inner
bswap is on a read from memory?  Assuming the load and bswap instruction is
cheap, then e.g. loading two values with bswap on them and doing say xor on
them afterwards might be cheaper than load the two values, xor them and then
bswap them (because for that bswap you don't have a load+bswap instruction).


[Bug tree-optimization/61559] FAIL: gcc.dg/builtin-bswap-8.c on i686 with -mmovbe

2014-09-04 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559

--- Comment #8 from rguenther at suse dot de  ---
On Thu, 4 Sep 2014, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559
> 
> --- Comment #7 from Uroš Bizjak  ---
> (In reply to Richard Biener from comment #6)
> 
> > Seems to me we want
> > 
> >   (bit_xor (bswap32 @0) (bswap32 @1)) -> (bswap32 (bit_xor @0 @1))
> > 
> > in match-and-simplify speak.
> 
> Please note that there are some more interesting transformations involving
> immediates and bit_not, as shown in builtin-bswap-8.c.

Added.

/* PR61559.  Transforms for gcc.dg/builtin-bswap-8.c  */
(for bswap in BUILT_IN_BSWAP16 BUILT_IN_BSWAP32 BUILT_IN_BSWAP64
  (simplify
(bswap (bswap @0))
@0)
  (simplify
(bswap (bit_not (bswap @0)))
(bit_not @0))
  (for bitop in bit_xor bit_ior bit_and
(simplify
  (bitop (bswap @0) (bswap @1))
  (bswap (bitop @0 @1)))
(simplify
  (bswap (bitop (bswap @0) INTEGER_CST@1))
  (bitop @0 (bswap @1)

[Bug tree-optimization/61559] FAIL: gcc.dg/builtin-bswap-8.c on i686 with -mmovbe

2014-09-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559

--- Comment #7 from Uroš Bizjak  ---
(In reply to Richard Biener from comment #6)

> Seems to me we want
> 
>   (bit_xor (bswap32 @0) (bswap32 @1)) -> (bswap32 (bit_xor @0 @1))
> 
> in match-and-simplify speak.

Please note that there are some more interesting transformations involving
immediates and bit_not, as shown in builtin-bswap-8.c.

[Bug tree-optimization/61559] FAIL: gcc.dg/builtin-bswap-8.c on i686 with -mmovbe

2014-09-04 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org

--- Comment #6 from Richard Biener  ---
(In reply to Uroš Bizjak from comment #5)
> (In reply to Eric Botcazou from comment #4)
> > I guess the transformations should accept MEMs instead of just REGs but, no,
> > I'm not particularly interested in quirks of CISC architectures, I have
> > enough to do with those of RISC architectures.
> 
> The problem is that with both function arguments in memory, combine
> simplifies sequence of bswaps with memory argument ( == movbe) in foo7 to:
> 
> Failed to match this instruction:
> (set (reg:SI 84 [ D.2318 ])
> (xor:SI (mem/c:SI (plus:SI (reg/f:SI 16 argp)
> (const_int 4 [0x4])) [2 b+0 S4 A32])
> (mem/c:SI (reg/f:SI 16 argp) [2 a+0 S4 A32])))
> 
> This is invalid RTX, where both input arguments are in memory.
> 
> The optimized tree dump for foo7 is:
> 
>   :
>   _2 = __builtin_bswap32 (a_1(D));
>   _4 = __builtin_bswap32 (b_3(D));
>   _5 = _4 ^ _2;
>   _6 = __builtin_bswap32 (_5); [tail call]
>   return _6;

Seems to me we want

  (bit_xor (bswap32 @0) (bswap32 @1)) -> (bswap32 (bit_xor @0 @1))

in match-and-simplify speak.

On trunk this transform would go to tree-ssa-forwprop.c as pattern.
It would apply to all bitwise binary ops and all bswap builtins
(all bit/byte-shuffling operations applying the same shuffle to
both operands).

(for bitop in bit_xor bit_ior bit_and
  (for bswap in BUILT_IN_BSWAP16 BUILT_IN_BSWAP32 BUILT_IN_BSWAP64
(simplify
  (bitop (bswap @0) (bswap @1))
  (bswap (bitop @0 @1
  (simplify
(bitop (vec_perm @1 @2 @0) (vec_perm @3 @4 @0))
(vec_perm (bitop @1 @3) (bitop @2 @4) @0)))

not sure if the vector permute one is profitable (but I guess a
permute is always more expensive than a bit operation).

The requested transform of course relies on somebody transforming
bswap (bswap (x)) to x and for vec_perm detecting a cancelling
operation (tree-ssa-forwprop.c can do that already I think).

Mine.  Fixed by the above on match-and-simplify.

> It looks to me that the optimization has to be re-implemented as tree
> optimization (probably by extending fold_builtin_bswap in builtins.c). This
> generic optimization will also benefit targets without bswap RTX pattern,
> e.g. plain i386, as observed in Comment #2.
> 
> I'm recategorizing the PR as a tree-optimization.

[Bug tree-optimization/61559] FAIL: gcc.dg/builtin-bswap-8.c on i686 with -mmovbe

2014-09-03 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559

Uroš Bizjak  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2014-09-03
 CC||rguenth at gcc dot gnu.org
  Component|rtl-optimization|tree-optimization
 Ever confirmed|0   |1

--- Comment #5 from Uroš Bizjak  ---
(In reply to Eric Botcazou from comment #4)
> I guess the transformations should accept MEMs instead of just REGs but, no,
> I'm not particularly interested in quirks of CISC architectures, I have
> enough to do with those of RISC architectures.

The problem is that with both function arguments in memory, combine simplifies
sequence of bswaps with memory argument ( == movbe) in foo7 to:

Failed to match this instruction:
(set (reg:SI 84 [ D.2318 ])
(xor:SI (mem/c:SI (plus:SI (reg/f:SI 16 argp)
(const_int 4 [0x4])) [2 b+0 S4 A32])
(mem/c:SI (reg/f:SI 16 argp) [2 a+0 S4 A32])))

This is invalid RTX, where both input arguments are in memory.

The optimized tree dump for foo7 is:

  :
  _2 = __builtin_bswap32 (a_1(D));
  _4 = __builtin_bswap32 (b_3(D));
  _5 = _4 ^ _2;
  _6 = __builtin_bswap32 (_5); [tail call]
  return _6;

It looks to me that the optimization has to be re-implemented as tree
optimization (probably by extending fold_builtin_bswap in builtins.c). This
generic optimization will also benefit targets without bswap RTX pattern, e.g.
plain i386, as observed in Comment #2.

I'm recategorizing the PR as a tree-optimization.