[Bug target/70873] [7 Regressio] 20% performance regression at 482.sphinx3 after r235442 with -O2 -m32 on Haswell.

2016-05-08 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70873

Uroš Bizjak  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #31 from Uroš Bizjak  ---
Fixed.

[Bug target/70873] [7 Regressio] 20% performance regression at 482.sphinx3 after r235442 with -O2 -m32 on Haswell.

2016-05-05 Thread uros at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70873

--- Comment #30 from uros at gcc dot gnu.org ---
Author: uros
Date: Thu May  5 22:48:29 2016
New Revision: 235936

URL: https://gcc.gnu.org/viewcvs?rev=235936=gcc=rev
Log:
PR target/70873
* config/i386/i386-protos.h (ix86_standard_x87sse_constant_load_p):
New prototype.
* config/i386/i386.c (ix86_standard_x87sse_constant_load_p): New.
* config/i386/i386.md (push mem splitter): Use find_constant_src in
the splitter condition.
(FP load splitter): Use ix86_standard_x87sse_constant_load_p in
the splitter condition.
(FP float_extend load splitter): Ditto.


Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386-protos.h
trunk/gcc/config/i386/i386.c
trunk/gcc/config/i386/i386.md

[Bug target/70873] [7 Regressio] 20% performance regression at 482.sphinx3 after r235442 with -O2 -m32 on Haswell.

2016-05-05 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70873

--- Comment #29 from Uroš Bizjak  ---
Created attachment 38419
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38419=edit
Patch to fix problematic splitters

I'm testing the attached patch that introduces
ix86_standard_x87sse_constant_load_p function, and uses it to limit problematic
splitters only to cases, where we can actually transform the insn to a constant
load.

[Bug target/70873] [7 Regressio] 20% performance regression at 482.sphinx3 after r235442 with -O2 -m32 on Haswell.

2016-05-04 Thread uros at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70873

--- Comment #28 from uros at gcc dot gnu.org ---
Author: uros
Date: Wed May  4 21:13:13 2016
New Revision: 235906

URL: https://gcc.gnu.org/viewcvs?rev=235906=gcc=rev
Log:
PR target/70873
* config/i386/i386.md
(TARGET_SSE_PARTIAL_REG_DEPENDENCY float_extend sf->df peephole2):
Change to post-epilogue_completed late splitter.  Use sse_reg_operand
as operand 0 predicate.
(TARGET_SSE_PARTIAL_REG_DEPENDENCY float_truncate df->sf peephole2):
Ditto.
(TARGET_SSE_PARTIAL_REG_DEPENDENCY float {si,di}->{sf,df} peephole2):
Ditto.  Emit the pattern using RTX.

(TARGET_USE_VECTOR_FP_CONVERTS float_extend sf->df splitter):
Use sse_reg_opreand as operand 0 predicate.  Do not use true_regnum in
the post-reload splitter.  Use lowpart_subreg instead of gen_rtx_REG.
(TARGET_USE_VECTOR_FP_CONVERTS float_truncate df->sf splitter):
Ditto.
(TARGET_USE_VECTOR_CONVERTS float si->{sf,df} splitter): Use
sse_reg_operand as operand 0 predicate.

(TARGET_SPLIT_MEM_OPND_FOR_FP_CONVERTS float_extend sf->df peephole2):
Use sse_reg_opreand as operand 0 predicate.  Use lowpart_subreg
instead of gen_rtx_REG.
(TARGET_SPLIT_MEM_OPND_FOR_FP_CONVERTS float_truncate sf->df
peephole2):
Ditto.


Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.md

[Bug target/70873] [7 Regressio] 20% performance regression at 482.sphinx3 after r235442 with -O2 -m32 on Haswell.

2016-05-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70873

--- Comment #27 from Uroš Bizjak  ---
(In reply to H.J. Lu from comment #26)

> But when this splitter fails, no other splitters will be tried.

Bah. This is clearly an implementation bug in the split pass. I don't think we
have to work around it, the infrastructure bug should be fixed instead. Can you
please open a new bug about it?

I propose to go ahead and commit my patch, which is already accumulating
several semi-related fixes and improvements. The patch will probably expose an
optimization problem in the splitting pass, so we will have a testcase to
illustrate the problem.

[Bug target/70873] [7 Regressio] 20% performance regression at 482.sphinx3 after r235442 with -O2 -m32 on Haswell.

2016-05-04 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70873

--- Comment #26 from H.J. Lu  ---
(In reply to Uroš Bizjak from comment #25)
> (In reply to H.J. Lu from comment #23)
> 
> > We need to move those special SSE SF->DF splitters before
> 
> No, this splitter will fail if the transformation doesn't result in a
> constant. So, we actually want this splitter first, to try to transform a
> memory load to a constant load, and moving others before this one would be
> harmful.
> 
> > (define_split
> >   [(set (match_operand 0 "any_fp_register_operand")
> > (float_extend (match_operand 1 "memory_operand")))]
> >   "reload_completed
> >&& (GET_MODE (operands[0]) == TFmode
> >|| GET_MODE (operands[0]) == XFmode
> >|| GET_MODE (operands[0]) == DFmode)"
> >   [(set (match_dup 0) (match_dup 2))]
> > {
> >   operands[2] = find_constant_src (curr_insn);
> > 
> >   if (operands[2] == NULL_RTX
> >   || (SSE_REGNO_P (REGNO (operands[0]))
> >   && standard_sse_constant_p (operands[2],
> >   GET_MODE (operands[0])) != 1)
> >   || (STACK_REGNO_P (REGNO (operands[0]))
> >&& standard_80387_constant_p (operands[2]) < 1))
> > FAIL;
> > })

But when this splitter fails, no other splitters will be tried.

[Bug target/70873] [7 Regressio] 20% performance regression at 482.sphinx3 after r235442 with -O2 -m32 on Haswell.

2016-05-04 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70873

--- Comment #25 from Uroš Bizjak  ---
(In reply to H.J. Lu from comment #23)

> We need to move those special SSE SF->DF splitters before

No, this splitter will fail if the transformation doesn't result in a constant.
So, we actually want this splitter first, to try to transform a memory load to
a constant load, and moving others before this one would be harmful.

> (define_split
>   [(set (match_operand 0 "any_fp_register_operand")
> (float_extend (match_operand 1 "memory_operand")))]
>   "reload_completed
>&& (GET_MODE (operands[0]) == TFmode
>|| GET_MODE (operands[0]) == XFmode
>|| GET_MODE (operands[0]) == DFmode)"
>   [(set (match_dup 0) (match_dup 2))]
> {
>   operands[2] = find_constant_src (curr_insn);
> 
>   if (operands[2] == NULL_RTX
>   || (SSE_REGNO_P (REGNO (operands[0]))
>   && standard_sse_constant_p (operands[2],
>   GET_MODE (operands[0])) != 1)
>   || (STACK_REGNO_P (REGNO (operands[0]))
>&& standard_80387_constant_p (operands[2]) < 1))
> FAIL;
> })

[Bug target/70873] [7 Regressio] 20% performance regression at 482.sphinx3 after r235442 with -O2 -m32 on Haswell.

2016-05-04 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70873

--- Comment #24 from H.J. Lu  ---
Created attachment 38414
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38414=edit
A patch to move special SSE splitters before general SSE float_extend splitter

[Bug target/70873] [7 Regressio] 20% performance regression at 482.sphinx3 after r235442 with -O2 -m32 on Haswell.

2016-05-04 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70873

--- Comment #23 from H.J. Lu  ---
(In reply to Uroš Bizjak from comment #22)
> Created attachment 38412 [details]
> Proposed patch
> 
> This patch moves all TARGET_SSE_PARTIAL_REG_DEPENDENCY FP conversion
> splitters to a later split pass. Plus, the patch substantially cleans these
> and related patterns.
> 
> The functionality of post-reload conversion splitters goes this way:
> 
> - process FP conversions for TARGET_USE_VECTOR_FP_CONVERTS in an early
> post-reload splitter. This pass will rewrite FP conversions to vector insns
> and is thus incompatible with the next two passes. AMDFAM10 processors
> depend on this transformation.
> 
> - process FP conversions for TARGET_SPLIT_MEM_OPND_FOR_FP_CONVERTS in a
> peephole2 pass. This will transform mem->reg insns to reg->reg insns, and
> these insn could be processed by the next pass. Some Intel processors depend
> on this transformation.
> 
> - process FP conversions for TARGET_SSE_PARTIAL_REG_DEPENDENCY in a late
> post-reload splitter, when allocated registers are stable. AMD and Intel
> processors depend on this pass, so it is part of generic tuning.

We need to move those special SSE SF->DF splitters before

(define_split
  [(set (match_operand 0 "any_fp_register_operand")
(float_extend (match_operand 1 "memory_operand")))]
  "reload_completed
   && (GET_MODE (operands[0]) == TFmode
   || GET_MODE (operands[0]) == XFmode
   || GET_MODE (operands[0]) == DFmode)"
  [(set (match_dup 0) (match_dup 2))]
{
  operands[2] = find_constant_src (curr_insn);

  if (operands[2] == NULL_RTX
  || (SSE_REGNO_P (REGNO (operands[0]))
  && standard_sse_constant_p (operands[2],
  GET_MODE (operands[0])) != 1)
  || (STACK_REGNO_P (REGNO (operands[0]))
   && standard_80387_constant_p (operands[2]) < 1))
FAIL;
})

Otherwise, they may not be used on memory operand since the general
SSE (In reply to Uroš Bizjak from comment #22)
> Created attachment 38412 [details]
> Proposed patch
> 
> This patch moves all TARGET_SSE_PARTIAL_REG_DEPENDENCY FP conversion
> splitters to a later split pass. Plus, the patch substantially cleans these
> and related patterns.
> 
> The functionality of post-reload conversion splitters goes this way:
> 
> - process FP conversions for TARGET_USE_VECTOR_FP_CONVERTS in an early
> post-reload splitter. This pass will rewrite FP conversions to vector insns
> and is thus incompatible with the next two passes. AMDFAM10 processors
> depend on this transformation.
> 
> - process FP conversions for TARGET_SPLIT_MEM_OPND_FOR_FP_CONVERTS in a
> peephole2 pass. This will transform mem->reg insns to reg->reg insns, and
> these insn could be processed by the next pass. Some Intel processors depend
> on this transformation.
> 
> - process FP conversions for TARGET_SSE_PARTIAL_REG_DEPENDENCY in a late
> post-reload splitter, when allocated registers are stable. AMD and Intel
> processors depend on this pass, so it is part of generic tuning.

We need to move those special SSE SF->DF splitters before

(define_split
  [(set (match_operand 0 "any_fp_register_operand")
(float_extend (match_operand 1 "memory_operand")))]
  "reload_completed
   && (GET_MODE (operands[0]) == TFmode
   || GET_MODE (operands[0]) == XFmode
   || GET_MODE (operands[0]) == DFmode)"
  [(set (match_dup 0) (match_dup 2))]
{
  operands[2] = find_constant_src (curr_insn);

  if (operands[2] == NULL_RTX
  || (SSE_REGNO_P (REGNO (operands[0]))
  && standard_sse_constant_p (operands[2],
  GET_MODE (operands[0])) != 1)
  || (STACK_REGNO_P (REGNO (operands[0]))
   && standard_80387_constant_p (operands[2]) < 1))
FAIL;
})

Otherwise, they may not be used on memory operand since the general
SSE float_extend splitter on memory operand will be used.