Re: [PATCH AArch64]Handle REG+REG+CONST and REG+NON_REG+CONST in legitimize address

2015-11-23 Thread Bin.Cheng
On Sat, Nov 21, 2015 at 1:39 AM, Richard Earnshaw
<richard.earns...@foss.arm.com> wrote:
> On 20/11/15 08:31, Bin.Cheng wrote:
>> On Thu, Nov 19, 2015 at 10:32 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>> On Tue, Nov 17, 2015 at 6:08 PM, James Greenhalgh
>>> <james.greenha...@arm.com> wrote:
>>>> On Tue, Nov 17, 2015 at 05:21:01PM +0800, Bin Cheng wrote:
>>>>> Hi,
>>>>> GIMPLE IVO needs to call backend interface to calculate costs for addr
>>>>> expressions like below:
>>>>>FORM1: "r73 + r74 + 16380"
>>>>>FORM2: "r73 << 2 + r74 + 16380"
>>>>>
>>>>> They are invalid address expression on AArch64, so will be legitimized by
>>>>> aarch64_legitimize_address.  Below are what we got from that function:
>>>>>
>>>>> For FORM1, the address expression is legitimized into below insn sequence
>>>>> and rtx:
>>>>>r84:DI=r73:DI+r74:DI
>>>>>r85:DI=r84:DI+0x3000
>>>>>r83:DI=r85:DI
>>>>>"r83 + 4092"
>>>>>
>>>>> For FORM2, the address expression is legitimized into below insn sequence
>>>>> and rtx:
>>>>>r108:DI=r73:DI<<0x2
>>>>>r109:DI=r108:DI+r74:DI
>>>>>r110:DI=r109:DI+0x3000
>>>>>r107:DI=r110:DI
>>>>>"r107 + 4092"
>>>>>
>>>>> So the costs computed are 12/16 respectively.  The high cost prevents IVO
>>>>> from choosing right candidates.  Besides cost computation, I also think 
>>>>> the
>>>>> legitmization is bad in terms of code generation.
>>>>> The root cause in aarch64_legitimize_address can be described by it's
>>>>> comment:
>>>>>/* Try to split X+CONST into Y=X+(CONST & ~mask), Y+(CONST),
>>>>>   where mask is selected by alignment and size of the offset.
>>>>>   We try to pick as large a range for the offset as possible to
>>>>>   maximize the chance of a CSE.  However, for aligned addresses
>>>>>   we limit the range to 4k so that structures with different sized
>>>>>   elements are likely to use the same base.  */
>>>>> I think the split of CONST is intended for REG+CONST where the const 
>>>>> offset
>>>>> is not in the range of AArch64's addressing modes.  Unfortunately, it
>>>>> doesn't explicitly handle/reject "REG+REG+CONST" and 
>>>>> "REG+REG<<SCALE+CONST"
>>>>> when the CONST are in the range of addressing modes.  As a result, these 
>>>>> two
>>>>> cases fallthrough this logic, resulting in sub-optimal results.
>>>>>
>>>>> It's obvious we can do below legitimization:
>>>>> FORM1:
>>>>>r83:DI=r73:DI+r74:DI
>>>>>"r83 + 16380"
>>>>> FORM2:
>>>>>r107:DI=0x3ffc
>>>>>r106:DI=r74:DI+r107:DI
>>>>>   REG_EQUAL r74:DI+0x3ffc
>>>>>"r106 + r73 << 2"
>>>>>
>>>>> This patch handles these two cases as described.
>>>>
>>>> Thanks for the description, it made the patch very easy to review. I only
>>>> have a style comment.
>>>>
>>>>> Bootstrap & test on AArch64 along with other patch.  Is it OK?
>>>>>
>>>>> 2015-11-04  Bin Cheng  <bin.ch...@arm.com>
>>>>>   Jiong Wang  <jiong.w...@arm.com>
>>>>>
>>>>>   * config/aarch64/aarch64.c (aarch64_legitimize_address): Handle
>>>>>   address expressions like REG+REG+CONST and REG+NON_REG+CONST.
>>>>
>>>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>>>>> index 5c8604f..47875ac 100644
>>>>> --- a/gcc/config/aarch64/aarch64.c
>>>>> +++ b/gcc/config/aarch64/aarch64.c
>>>>> @@ -4710,6 +4710,51 @@ aarch64_legitimize_address (rtx x, rtx /* orig_x  
>>>>> */, machine_mode mode)
>>>>>  {
>>>>>HOST_WIDE_INT offset = INTVAL (XEXP (x, 1));
>>>>>HOST_WIDE_INT base_offset;
>>>>> +  rtx op0 = XEXP (x,0);
>>>>> +
>>>>> +  if (GET_CODE (op0) == PLUS)
>>>>> + {
&g

Re: [PATCH PR52272]Be smart when adding iv candidates

2015-11-20 Thread Bin.Cheng
On Wed, Nov 18, 2015 at 11:50 PM, Bernd Schmidt <bschm...@redhat.com> wrote:
> On 11/10/2015 11:19 AM, Bin.Cheng wrote:
>>
>> On Tue, Nov 10, 2015 at 6:06 PM, Bernd Schmidt <bschm...@redhat.com>
>> wrote:
>>>
>>>
>>> Multi-line expressions should be wrapped in parentheses so that
>>> emacs/indent
>>> can format them automatically. Two sets of parens are needed for this.
>>> Operators should then line up appropriately.
>>
>> Ah, thanks for teaching.  Here is the updated patch, hoping it's correct.
>
> It looks like you're waiting to check it in - Richard's earlier approval
> still holds.
Thanks, applied as r230647.
>
>
> Bernd
>


Re: [PATCH AArch64]Handle REG+REG+CONST and REG+NON_REG+CONST in legitimize address

2015-11-20 Thread Bin.Cheng
On Thu, Nov 19, 2015 at 10:32 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Tue, Nov 17, 2015 at 6:08 PM, James Greenhalgh
> <james.greenha...@arm.com> wrote:
>> On Tue, Nov 17, 2015 at 05:21:01PM +0800, Bin Cheng wrote:
>>> Hi,
>>> GIMPLE IVO needs to call backend interface to calculate costs for addr
>>> expressions like below:
>>>FORM1: "r73 + r74 + 16380"
>>>FORM2: "r73 << 2 + r74 + 16380"
>>>
>>> They are invalid address expression on AArch64, so will be legitimized by
>>> aarch64_legitimize_address.  Below are what we got from that function:
>>>
>>> For FORM1, the address expression is legitimized into below insn sequence
>>> and rtx:
>>>r84:DI=r73:DI+r74:DI
>>>r85:DI=r84:DI+0x3000
>>>r83:DI=r85:DI
>>>"r83 + 4092"
>>>
>>> For FORM2, the address expression is legitimized into below insn sequence
>>> and rtx:
>>>r108:DI=r73:DI<<0x2
>>>r109:DI=r108:DI+r74:DI
>>>r110:DI=r109:DI+0x3000
>>>r107:DI=r110:DI
>>>"r107 + 4092"
>>>
>>> So the costs computed are 12/16 respectively.  The high cost prevents IVO
>>> from choosing right candidates.  Besides cost computation, I also think the
>>> legitmization is bad in terms of code generation.
>>> The root cause in aarch64_legitimize_address can be described by it's
>>> comment:
>>>/* Try to split X+CONST into Y=X+(CONST & ~mask), Y+(CONST),
>>>   where mask is selected by alignment and size of the offset.
>>>   We try to pick as large a range for the offset as possible to
>>>   maximize the chance of a CSE.  However, for aligned addresses
>>>   we limit the range to 4k so that structures with different sized
>>>   elements are likely to use the same base.  */
>>> I think the split of CONST is intended for REG+CONST where the const offset
>>> is not in the range of AArch64's addressing modes.  Unfortunately, it
>>> doesn't explicitly handle/reject "REG+REG+CONST" and "REG+REG<<SCALE+CONST"
>>> when the CONST are in the range of addressing modes.  As a result, these two
>>> cases fallthrough this logic, resulting in sub-optimal results.
>>>
>>> It's obvious we can do below legitimization:
>>> FORM1:
>>>r83:DI=r73:DI+r74:DI
>>>"r83 + 16380"
>>> FORM2:
>>>r107:DI=0x3ffc
>>>r106:DI=r74:DI+r107:DI
>>>   REG_EQUAL r74:DI+0x3ffc
>>>"r106 + r73 << 2"
>>>
>>> This patch handles these two cases as described.
>>
>> Thanks for the description, it made the patch very easy to review. I only
>> have a style comment.
>>
>>> Bootstrap & test on AArch64 along with other patch.  Is it OK?
>>>
>>> 2015-11-04  Bin Cheng  <bin.ch...@arm.com>
>>>   Jiong Wang  <jiong.w...@arm.com>
>>>
>>>   * config/aarch64/aarch64.c (aarch64_legitimize_address): Handle
>>>   address expressions like REG+REG+CONST and REG+NON_REG+CONST.
>>
>>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>>> index 5c8604f..47875ac 100644
>>> --- a/gcc/config/aarch64/aarch64.c
>>> +++ b/gcc/config/aarch64/aarch64.c
>>> @@ -4710,6 +4710,51 @@ aarch64_legitimize_address (rtx x, rtx /* orig_x  
>>> */, machine_mode mode)
>>>  {
>>>HOST_WIDE_INT offset = INTVAL (XEXP (x, 1));
>>>HOST_WIDE_INT base_offset;
>>> +  rtx op0 = XEXP (x,0);
>>> +
>>> +  if (GET_CODE (op0) == PLUS)
>>> + {
>>> +   rtx op0_ = XEXP (op0, 0);
>>> +   rtx op1_ = XEXP (op0, 1);
>>
>> I don't see this trailing _ on a variable name in many places in the source
>> tree (mostly in the Go frontend), and certainly not in the aarch64 backend.
>> Can we pick a different name for op0_ and op1_?
>>
>>> +
>>> +   /* RTX pattern in the form of (PLUS (PLUS REG, REG), CONST) will
>>> +  reach here, the 'CONST' may be valid in which case we should
>>> +  not split.  */
>>> +   if (REG_P (op0_) && REG_P (op1_))
>>> + {
>>> +   machine_mode addr_mode = GET_MODE (op0);
>>> +   rtx addr = gen_reg_rtx (addr_mode);
>>> +
>>> +   rtx ret = plus_constant (addr_mode, addr, offset);
>

Re: [PATCH AArch64]Handle REG+REG+CONST and REG+NON_REG+CONST in legitimize address

2015-11-18 Thread Bin.Cheng
On Tue, Nov 17, 2015 at 6:08 PM, James Greenhalgh
 wrote:
> On Tue, Nov 17, 2015 at 05:21:01PM +0800, Bin Cheng wrote:
>> Hi,
>> GIMPLE IVO needs to call backend interface to calculate costs for addr
>> expressions like below:
>>FORM1: "r73 + r74 + 16380"
>>FORM2: "r73 << 2 + r74 + 16380"
>>
>> They are invalid address expression on AArch64, so will be legitimized by
>> aarch64_legitimize_address.  Below are what we got from that function:
>>
>> For FORM1, the address expression is legitimized into below insn sequence
>> and rtx:
>>r84:DI=r73:DI+r74:DI
>>r85:DI=r84:DI+0x3000
>>r83:DI=r85:DI
>>"r83 + 4092"
>>
>> For FORM2, the address expression is legitimized into below insn sequence
>> and rtx:
>>r108:DI=r73:DI<<0x2
>>r109:DI=r108:DI+r74:DI
>>r110:DI=r109:DI+0x3000
>>r107:DI=r110:DI
>>"r107 + 4092"
>>
>> So the costs computed are 12/16 respectively.  The high cost prevents IVO
>> from choosing right candidates.  Besides cost computation, I also think the
>> legitmization is bad in terms of code generation.
>> The root cause in aarch64_legitimize_address can be described by it's
>> comment:
>>/* Try to split X+CONST into Y=X+(CONST & ~mask), Y+(CONST),
>>   where mask is selected by alignment and size of the offset.
>>   We try to pick as large a range for the offset as possible to
>>   maximize the chance of a CSE.  However, for aligned addresses
>>   we limit the range to 4k so that structures with different sized
>>   elements are likely to use the same base.  */
>> I think the split of CONST is intended for REG+CONST where the const offset
>> is not in the range of AArch64's addressing modes.  Unfortunately, it
>> doesn't explicitly handle/reject "REG+REG+CONST" and "REG+REG<> when the CONST are in the range of addressing modes.  As a result, these two
>> cases fallthrough this logic, resulting in sub-optimal results.
>>
>> It's obvious we can do below legitimization:
>> FORM1:
>>r83:DI=r73:DI+r74:DI
>>"r83 + 16380"
>> FORM2:
>>r107:DI=0x3ffc
>>r106:DI=r74:DI+r107:DI
>>   REG_EQUAL r74:DI+0x3ffc
>>"r106 + r73 << 2"
>>
>> This patch handles these two cases as described.
>
> Thanks for the description, it made the patch very easy to review. I only
> have a style comment.
>
>> Bootstrap & test on AArch64 along with other patch.  Is it OK?
>>
>> 2015-11-04  Bin Cheng  
>>   Jiong Wang  
>>
>>   * config/aarch64/aarch64.c (aarch64_legitimize_address): Handle
>>   address expressions like REG+REG+CONST and REG+NON_REG+CONST.
>
>> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
>> index 5c8604f..47875ac 100644
>> --- a/gcc/config/aarch64/aarch64.c
>> +++ b/gcc/config/aarch64/aarch64.c
>> @@ -4710,6 +4710,51 @@ aarch64_legitimize_address (rtx x, rtx /* orig_x  */, 
>> machine_mode mode)
>>  {
>>HOST_WIDE_INT offset = INTVAL (XEXP (x, 1));
>>HOST_WIDE_INT base_offset;
>> +  rtx op0 = XEXP (x,0);
>> +
>> +  if (GET_CODE (op0) == PLUS)
>> + {
>> +   rtx op0_ = XEXP (op0, 0);
>> +   rtx op1_ = XEXP (op0, 1);
>
> I don't see this trailing _ on a variable name in many places in the source
> tree (mostly in the Go frontend), and certainly not in the aarch64 backend.
> Can we pick a different name for op0_ and op1_?
>
>> +
>> +   /* RTX pattern in the form of (PLUS (PLUS REG, REG), CONST) will
>> +  reach here, the 'CONST' may be valid in which case we should
>> +  not split.  */
>> +   if (REG_P (op0_) && REG_P (op1_))
>> + {
>> +   machine_mode addr_mode = GET_MODE (op0);
>> +   rtx addr = gen_reg_rtx (addr_mode);
>> +
>> +   rtx ret = plus_constant (addr_mode, addr, offset);
>> +   if (aarch64_legitimate_address_hook_p (mode, ret, false))
>> + {
>> +   emit_insn (gen_adddi3 (addr, op0_, op1_));
>> +   return ret;
>> + }
>> + }
>> +   /* RTX pattern in the form of (PLUS (PLUS REG, NON_REG), CONST)
>> +  will reach here.  If (PLUS REG, NON_REG) is valid addr expr,
>> +  we split it into Y=REG+CONST, Y+NON_REG.  */
>> +   else if (REG_P (op0_) || REG_P (op1_))
>> + {
>> +   machine_mode addr_mode = GET_MODE (op0);
>> +   rtx addr = gen_reg_rtx (addr_mode);
>> +
>> +   /* Switch to make sure that register is in op0_.  */
>> +   if (REG_P (op1_))
>> + std::swap (op0_, op1_);
>> +
>> +   rtx ret = gen_rtx_fmt_ee (PLUS, addr_mode, addr, op1_);
>> +   if (aarch64_legitimate_address_hook_p (mode, ret, false))
>> + {
>> +   addr = force_operand (plus_constant (addr_mode,
>> +op0_, offset),
>> + NULL_RTX);
>> +   ret = gen_rtx_fmt_ee (PLUS, 

Re: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS.

2015-11-16 Thread Bin.Cheng
On Tue, Nov 17, 2015 at 1:56 AM, Ajit Kumar Agarwal
 wrote:
>
> Sorry I missed out some of the points in earlier mail which is given below.
>
> -Original Message-
> From: Ajit Kumar Agarwal
> Sent: Monday, November 16, 2015 11:07 PM
> To: 'Jeff Law'; GCC Patches
> Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
> Subject: RE: [RFC, Patch]: Optimized changes in the register used inside loop 
> for LICM and IVOPTS.
>
>
>
> -Original Message-
> From: Jeff Law [mailto:l...@redhat.com]
> Sent: Friday, November 13, 2015 11:44 AM
> To: Ajit Kumar Agarwal; GCC Patches
> Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
> Subject: Re: [RFC, Patch]: Optimized changes in the register used inside loop 
> for LICM and IVOPTS.
>
> On 10/07/2015 10:32 PM, Ajit Kumar Agarwal wrote:
>
>>
>> 0001-RFC-Patch-Optimized-changes-in-the-register-used-ins.patch
>>
>>
>>  From f164fd80953f3cffd96a492c8424c83290cd43cc Mon Sep 17 00:00:00
>> 2001
>> From: Ajit Kumar Agarwal
>> Date: Wed, 7 Oct 2015 20:50:40 +0200
>> Subject: [PATCH] [RFC, Patch]: Optimized changes in the register used inside
>>   loop for LICM and IVOPTS.
>>
>> Changes are done in the Loop Invariant(LICM) at RTL level and also the
>> Induction variable optimization based on SSA representation. The
>> current logic used in LICM for register used inside the loops is
>> changed. The Live Out of the loop latch node and the Live in of the
>> destination of the exit nodes is used to set the Loops Liveness at the exit 
>> of the Loop.
>> The register used is the number of live variables at the exit of the
>> Loop calculated above.
>>
>> For Induction variable optimization on tree SSA representation, the
>> register used logic is based on the number of phi nodes at the loop
>> header to represent the liveness at the loop.  Current Logic used only
>> the number of phi nodes at the loop header.  Changes are made to
>> represent the phi operands also live at the loop. Thus number of phi
>> operands also gets incremented in the number of registers used.
>>
>> ChangeLog:
>> 2015-10-09  Ajit Agarwal
>>
>>   * loop-invariant.c (compute_loop_liveness): New.
>>   (determine_regs_used): New.
>>   (find_invariants_to_move): Use of determine_regs_used.
>>   * tree-ssa-loop-ivopts.c (determine_set_costs): Consider the phi
>>   arguments for register used.
>>>I think Bin rejected the tree-ssa-loop-ivopts change.  However, the 
>>>loop-invariant change is still pending, right?
>
>
>>
>> Signed-off-by:Ajit agarwalajit...@xilinx.com
>> ---
>>   gcc/loop-invariant.c   | 72 
>> +-
>>   gcc/tree-ssa-loop-ivopts.c |  4 +--
>>   2 files changed, 60 insertions(+), 16 deletions(-)
>>
>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c index
>> 52c8ae8..e4291c9 100644
>> --- a/gcc/loop-invariant.c
>> +++ b/gcc/loop-invariant.c
>> @@ -1413,6 +1413,19 @@ set_move_mark (unsigned invno, int gain)
>>   }
>>   }
>>
>> +static int
>> +determine_regs_used()
>> +{
>> +  unsigned int j;
>> +  unsigned int reg_used = 2;
>> +  bitmap_iterator bi;
>> +
>> +  EXECUTE_IF_SET_IN_BITMAP (_DATA (curr_loop)->regs_live, 0, j, bi)
>> +(reg_used) ++;
>> +
>> +  return reg_used;
>> +}
>>>Isn't this just bitmap_count_bits (regs_live) + 2?
>
>
>> @@ -2055,9 +2057,43 @@ calculate_loop_reg_pressure (void)
>>   }
>>   }
>>
>> -
>> +static void
>> +calculate_loop_liveness (void)
>>>Needs a function comment.
>
> I will incorporate the above comments.
>> +{
>> +  basic_block bb;
>> +  struct loop *loop;
>>
>> -/* Move the invariants out of the loops.  */
>> +  FOR_EACH_LOOP (loop, 0)
>> +if (loop->aux == NULL)
>> +  {
>> +loop->aux = xcalloc (1, sizeof (struct loop_data));
>> +bitmap_initialize (_DATA (loop)->regs_live, _obstack);
>> + }
>> +
>> +  FOR_EACH_BB_FN (bb, cfun)
>>>Why loop over blocks here?  Why not just iterate through all the loops
>>>in the loop structure.  Order isn't particularly important AFAICT for
>>>this code.
>
> Iterating over the Loop structure is enough. We don't need iterating over the 
> basic blocks.
>
>> +   {
>> + int  i;
>> + edge e;
>> + vec edges;
>> + edges = get_loop_exit_edges (loop);
>> + FOR_EACH_VEC_ELT (edges, i, e)
>> + {
>> +   bitmap_ior_into (_DATA (loop)->regs_live, 
>> DF_LR_OUT(e->src));
>> +   bitmap_ior_into (_DATA (loop)->regs_live,
>> + DF_LR_IN(e->dest));
>>>Space before the open-paren in the previous two lines DF_LR_OUT
>>>(e->src) and FD_LR_INT (e->dest))
>
> I will incorporate this.
>
>> + }
>> +  }
>> +  }
>> +}
>> +
>> +/* Move the invariants  ut of the loops.  */
>>>Looks like you introduced a typo.
>
>>>I'd like to see testcases which show the change in # regs used
>>>computation helping generate better code.
>
> We need to measure the 

Re: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS.

2015-11-12 Thread Bin.Cheng
On Fri, Nov 13, 2015 at 2:13 PM, Jeff Law  wrote:
> On 10/07/2015 10:32 PM, Ajit Kumar Agarwal wrote:
>
>>
>> 0001-RFC-Patch-Optimized-changes-in-the-register-used-ins.patch
>>
>>
>>  From f164fd80953f3cffd96a492c8424c83290cd43cc Mon Sep 17 00:00:00 2001
>> From: Ajit Kumar Agarwal
>> Date: Wed, 7 Oct 2015 20:50:40 +0200
>> Subject: [PATCH] [RFC, Patch]: Optimized changes in the register used
>> inside
>>   loop for LICM and IVOPTS.
>>
>> Changes are done in the Loop Invariant(LICM) at RTL level and also the
>> Induction variable optimization based on SSA representation. The current
>> logic used in LICM for register used inside the loops is changed. The
>> Live Out of the loop latch node and the Live in of the destination of
>> the exit nodes is used to set the Loops Liveness at the exit of the Loop.
>> The register used is the number of live variables at the exit of the
>> Loop calculated above.
>>
>> For Induction variable optimization on tree SSA representation, the
>> register
>> used logic is based on the number of phi nodes at the loop header to
>> represent
>> the liveness at the loop.  Current Logic used only the number of phi nodes
>> at
>> the loop header.  Changes are made to represent the phi operands also live
>> at
>> the loop. Thus number of phi operands also gets incremented in the number
>> of
>> registers used.
>>
>> ChangeLog:
>> 2015-10-09  Ajit Agarwal
>>
>> * loop-invariant.c (compute_loop_liveness): New.
>> (determine_regs_used): New.
>> (find_invariants_to_move): Use of determine_regs_used.
>> * tree-ssa-loop-ivopts.c (determine_set_costs): Consider the phi
>> arguments for register used.
>
> I think Bin rejected the tree-ssa-loop-ivopts change.  However, the
> loop-invariant change is still pending, right?
Ah, reject is a strong word, I am just being dumb and don't understand
why it's a general better estimation yet.
Maybe Richard have some inputs here?

Thanks,
bin
>
>
>>
>> Signed-off-by:Ajit agarwalajit...@xilinx.com
>> ---
>>   gcc/loop-invariant.c   | 72
>> +-
>>   gcc/tree-ssa-loop-ivopts.c |  4 +--
>>   2 files changed, 60 insertions(+), 16 deletions(-)
>>
>> diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
>> index 52c8ae8..e4291c9 100644
>> --- a/gcc/loop-invariant.c
>> +++ b/gcc/loop-invariant.c
>> @@ -1413,6 +1413,19 @@ set_move_mark (unsigned invno, int gain)
>>   }
>>   }
>>
>> +static int
>> +determine_regs_used()
>> +{
>> +  unsigned int j;
>> +  unsigned int reg_used = 2;
>> +  bitmap_iterator bi;
>> +
>> +  EXECUTE_IF_SET_IN_BITMAP (_DATA (curr_loop)->regs_live, 0, j, bi)
>> +(reg_used) ++;
>> +
>> +  return reg_used;
>> +}
>
> Isn't this just bitmap_count_bits (regs_live) + 2?
>
>
>> @@ -2055,9 +2057,43 @@ calculate_loop_reg_pressure (void)
>>   }
>>   }
>>
>> -
>> +static void
>> +calculate_loop_liveness (void)
>
> Needs a function comment.
>
>
>> +{
>> +  basic_block bb;
>> +  struct loop *loop;
>>
>> -/* Move the invariants out of the loops.  */
>> +  FOR_EACH_LOOP (loop, 0)
>> +if (loop->aux == NULL)
>> +  {
>> +loop->aux = xcalloc (1, sizeof (struct loop_data));
>> +bitmap_initialize (_DATA (loop)->regs_live, _obstack);
>> + }
>> +
>> +  FOR_EACH_BB_FN (bb, cfun)
>
> Why loop over blocks here?  Why not just iterate through all the loops in
> the loop structure.  Order isn't particularly important AFAICT for this
> code.
>
>
>
>> +   {
>> + int  i;
>> + edge e;
>> + vec edges;
>> + edges = get_loop_exit_edges (loop);
>> + FOR_EACH_VEC_ELT (edges, i, e)
>> + {
>> +   bitmap_ior_into (_DATA (loop)->regs_live,
>> DF_LR_OUT(e->src));
>> +   bitmap_ior_into (_DATA (loop)->regs_live,
>> DF_LR_IN(e->dest));
>
> Space before the open-paren in the previous two lines
> DF_LR_OUT (e->src) and FD_LR_INT (e->dest))
>
>
>> + }
>> +  }
>> +  }
>> +}
>> +
>> +/* Move the invariants  ut of the loops.  */
>
> Looks like you introduced a typo.
>
> I'd like to see testcases which show the change in # regs used computation
> helping generate better code.
>
> And  I'd also like to see some background information on why you think this
> is a more accurate measure for the number of registers used in the loop.
> regs_used AFAICT is supposed to be an estimate of the registers live around
> the loop.  So ISTM that you get that value by live-out set on the backedge
> of the loop.  I guess you get somethign similar by looking at the exit
> edge's source block's live-out set.  But I don't see any value  in including
> stuff live at the block outside the loop.
>
> It also seems fairly non-intuitive.  Get the block's latch and use its
> live-out set.  That seems more intuitive.
>


Re: [PATCH PR52272]Be smart when adding iv candidates

2015-11-10 Thread Bin.Cheng
On Tue, Nov 10, 2015 at 9:26 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Mon, Nov 9, 2015 at 11:24 PM, Bernd Schmidt <bschm...@redhat.com> wrote:
>> On 11/08/2015 10:11 AM, Richard Biener wrote:
>>>
>>> On November 8, 2015 3:58:57 AM GMT+01:00, "Bin.Cheng"
>>> <amker.ch...@gmail.com> wrote:
>>>>>
>>>>> +inline bool
>>>>> +iv_common_cand_hasher::equal (const iv_common_cand *ccand1,
>>>>> +  const iv_common_cand *ccand2)
>>>>> +{
>>>>> +  return ccand1->hash == ccand2->hash
>>>>> +&& operand_equal_p (ccand1->base, ccand2->base, 0)
>>>>> +&& operand_equal_p (ccand1->step, ccand2->step, 0)
>>>>> +&& TYPE_PRECISION (TREE_TYPE (ccand1->base))
>>>>> + == TYPE_PRECISION (TREE_TYPE (ccand2->base));
>>>>>
>>> Yes.  Patch is OK then.
>>
>>
>> Doesn't follow the formatting rules though in the quoted piece.
>
> Hi Bernd,
> Thanks for reviewing.  I haven't committed it yet, could you please
> point out which quoted piece is so that I can update patch?
Ah, the part quoted in review message, I was stupid and tried to find
quoted part in my patch...  I can see the problem now, here is the
updated patch.

Thanks,
bin
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 1f952a7..aecba12 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -247,6 +247,45 @@ struct iv_cand
   smaller type.  */
 };
 
+/* Hashtable entry for common candidate derived from iv uses.  */
+struct iv_common_cand
+{
+  tree base;
+  tree step;
+  /* IV uses from which this common candidate is derived.  */
+  vec uses;
+  hashval_t hash;
+};
+
+/* Hashtable helpers.  */
+
+struct iv_common_cand_hasher : free_ptr_hash 
+{
+  static inline hashval_t hash (const iv_common_cand *);
+  static inline bool equal (const iv_common_cand *, const iv_common_cand *);
+};
+
+/* Hash function for possible common candidates.  */
+
+inline hashval_t
+iv_common_cand_hasher::hash (const iv_common_cand *ccand)
+{
+  return ccand->hash;
+}
+
+/* Hash table equality function for common candidates.  */
+
+inline bool
+iv_common_cand_hasher::equal (const iv_common_cand *ccand1,
+ const iv_common_cand *ccand2)
+{
+  return ccand1->hash == ccand2->hash
+&& operand_equal_p (ccand1->base, ccand2->base, 0)
+&& operand_equal_p (ccand1->step, ccand2->step, 0)
+&& TYPE_PRECISION (TREE_TYPE (ccand1->base))
+ == TYPE_PRECISION (TREE_TYPE (ccand2->base));
+}
+
 /* Loop invariant expression hashtable entry.  */
 struct iv_inv_expr_ent
 {
@@ -255,8 +294,6 @@ struct iv_inv_expr_ent
   hashval_t hash;
 };
 
-/* The data used by the induction variable optimizations.  */
-
 /* Hashtable helpers.  */
 
 struct iv_inv_expr_hasher : free_ptr_hash 
@@ -323,6 +360,12 @@ struct ivopts_data
   /* Cache used by tree_to_aff_combination_expand.  */
   hash_map<tree, name_expansion *> *name_expansion_cache;
 
+  /* The hashtable of common candidates derived from iv uses.  */
+  hash_table *iv_common_cand_tab;
+
+  /* The common candidates.  */
+  vec iv_common_cands;
+
   /* The maximum invariant id.  */
   unsigned max_inv_id;
 
@@ -894,6 +937,8 @@ tree_ssa_iv_optimize_init (struct ivopts_data *data)
   data->inv_expr_tab = new hash_table (10);
   data->inv_expr_id = 0;
   data->name_expansion_cache = NULL;
+  data->iv_common_cand_tab = new hash_table (10);
+  data->iv_common_cands.create (20);
   decl_rtl_to_reset.create (20);
   gcc_obstack_init (>iv_obstack);
 }
@@ -3051,6 +3096,96 @@ add_iv_candidate_for_bivs (struct ivopts_data *data)
 }
 }
 
+/* Record common candidate {BASE, STEP} derived from USE in hashtable.  */
+
+static void
+record_common_cand (struct ivopts_data *data, tree base,
+   tree step, struct iv_use *use)
+{
+  struct iv_common_cand ent;
+  struct iv_common_cand **slot;
+
+  gcc_assert (use != NULL);
+
+  ent.base = base;
+  ent.step = step;
+  ent.hash = iterative_hash_expr (base, 0);
+  ent.hash = iterative_hash_expr (step, ent.hash);
+
+  slot = data->iv_common_cand_tab->find_slot (, INSERT);
+  if (*slot == NULL)
+{
+  *slot = XNEW (struct iv_common_cand);
+  (*slot)->base = base;
+  (*slot)->step = step;
+  (*slot)->uses.create (8);
+  (*slot)->hash = ent.hash;
+  data->iv_common_cands.safe_push ((*slot));
+}
+  (*slot)->uses.safe_push (use);
+  return;
+}
+
+/* Comparison function used to sort common candidates.  */
+
+static int
+common_cand_cmp (const void *p1, const void *p2)
+{
+  unsigned n1, n2;
+  const struct i

Re: [PATCH PR52272]Be smart when adding iv candidates

2015-11-10 Thread Bin.Cheng
On Tue, Nov 10, 2015 at 6:06 PM, Bernd Schmidt <bschm...@redhat.com> wrote:
> On 11/10/2015 09:25 AM, Bin.Cheng wrote:
>>>
>>> Thanks for reviewing.  I haven't committed it yet, could you please
>>> point out which quoted piece is so that I can update patch?
>
>
> Sorry, I thought it was pretty obvious...
>
>> +{
>> +  return ccand1->hash == ccand2->hash
>> +&& operand_equal_p (ccand1->base, ccand2->base, 0)
>> +&& operand_equal_p (ccand1->step, ccand2->step, 0)
>> +&& TYPE_PRECISION (TREE_TYPE (ccand1->base))
>> + == TYPE_PRECISION (TREE_TYPE (ccand2->base));
>> +}
>> +
>
>
> Multi-line expressions should be wrapped in parentheses so that emacs/indent
> can format them automatically. Two sets of parens are needed for this.
> Operators should then line up appropriately.
Ah, thanks for teaching.  Here is the updated patch, hoping it's correct.

Thanks,
bin
>
>
> Bernd
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 1f952a7..a00e33c 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -247,6 +247,45 @@ struct iv_cand
   smaller type.  */
 };
 
+/* Hashtable entry for common candidate derived from iv uses.  */
+struct iv_common_cand
+{
+  tree base;
+  tree step;
+  /* IV uses from which this common candidate is derived.  */
+  vec uses;
+  hashval_t hash;
+};
+
+/* Hashtable helpers.  */
+
+struct iv_common_cand_hasher : free_ptr_hash 
+{
+  static inline hashval_t hash (const iv_common_cand *);
+  static inline bool equal (const iv_common_cand *, const iv_common_cand *);
+};
+
+/* Hash function for possible common candidates.  */
+
+inline hashval_t
+iv_common_cand_hasher::hash (const iv_common_cand *ccand)
+{
+  return ccand->hash;
+}
+
+/* Hash table equality function for common candidates.  */
+
+inline bool
+iv_common_cand_hasher::equal (const iv_common_cand *ccand1,
+ const iv_common_cand *ccand2)
+{
+  return (ccand1->hash == ccand2->hash
+ && operand_equal_p (ccand1->base, ccand2->base, 0)
+ && operand_equal_p (ccand1->step, ccand2->step, 0)
+ && (TYPE_PRECISION (TREE_TYPE (ccand1->base))
+ == TYPE_PRECISION (TREE_TYPE (ccand2->base;
+}
+
 /* Loop invariant expression hashtable entry.  */
 struct iv_inv_expr_ent
 {
@@ -255,8 +294,6 @@ struct iv_inv_expr_ent
   hashval_t hash;
 };
 
-/* The data used by the induction variable optimizations.  */
-
 /* Hashtable helpers.  */
 
 struct iv_inv_expr_hasher : free_ptr_hash 
@@ -323,6 +360,12 @@ struct ivopts_data
   /* Cache used by tree_to_aff_combination_expand.  */
   hash_map<tree, name_expansion *> *name_expansion_cache;
 
+  /* The hashtable of common candidates derived from iv uses.  */
+  hash_table *iv_common_cand_tab;
+
+  /* The common candidates.  */
+  vec iv_common_cands;
+
   /* The maximum invariant id.  */
   unsigned max_inv_id;
 
@@ -894,6 +937,8 @@ tree_ssa_iv_optimize_init (struct ivopts_data *data)
   data->inv_expr_tab = new hash_table (10);
   data->inv_expr_id = 0;
   data->name_expansion_cache = NULL;
+  data->iv_common_cand_tab = new hash_table (10);
+  data->iv_common_cands.create (20);
   decl_rtl_to_reset.create (20);
   gcc_obstack_init (>iv_obstack);
 }
@@ -3051,6 +3096,96 @@ add_iv_candidate_for_bivs (struct ivopts_data *data)
 }
 }
 
+/* Record common candidate {BASE, STEP} derived from USE in hashtable.  */
+
+static void
+record_common_cand (struct ivopts_data *data, tree base,
+   tree step, struct iv_use *use)
+{
+  struct iv_common_cand ent;
+  struct iv_common_cand **slot;
+
+  gcc_assert (use != NULL);
+
+  ent.base = base;
+  ent.step = step;
+  ent.hash = iterative_hash_expr (base, 0);
+  ent.hash = iterative_hash_expr (step, ent.hash);
+
+  slot = data->iv_common_cand_tab->find_slot (, INSERT);
+  if (*slot == NULL)
+{
+  *slot = XNEW (struct iv_common_cand);
+  (*slot)->base = base;
+  (*slot)->step = step;
+  (*slot)->uses.create (8);
+  (*slot)->hash = ent.hash;
+  data->iv_common_cands.safe_push ((*slot));
+}
+  (*slot)->uses.safe_push (use);
+  return;
+}
+
+/* Comparison function used to sort common candidates.  */
+
+static int
+common_cand_cmp (const void *p1, const void *p2)
+{
+  unsigned n1, n2;
+  const struct iv_common_cand *const *const ccand1
+= (const struct iv_common_cand *const *)p1;
+  const struct iv_common_cand *const *const ccand2
+= (const struct iv_common_cand *const *)p2;
+
+  n1 = (*ccand1)->uses.length ();
+  n2 = (*ccand2)->uses.length ();
+  return n2 - n1;
+}
+
+/* Adds IV candidates based on common candidated recorded.  */
+
+static void
+add_iv_candidate_derived_from_uses (stru

Re: [PATCH PR52272]Be smart when adding iv candidates

2015-11-09 Thread Bin.Cheng
On Mon, Nov 9, 2015 at 11:24 PM, Bernd Schmidt <bschm...@redhat.com> wrote:
> On 11/08/2015 10:11 AM, Richard Biener wrote:
>>
>> On November 8, 2015 3:58:57 AM GMT+01:00, "Bin.Cheng"
>> <amker.ch...@gmail.com> wrote:
>>>>
>>>> +inline bool
>>>> +iv_common_cand_hasher::equal (const iv_common_cand *ccand1,
>>>> +  const iv_common_cand *ccand2)
>>>> +{
>>>> +  return ccand1->hash == ccand2->hash
>>>> +&& operand_equal_p (ccand1->base, ccand2->base, 0)
>>>> +&& operand_equal_p (ccand1->step, ccand2->step, 0)
>>>> +&& TYPE_PRECISION (TREE_TYPE (ccand1->base))
>>>> + == TYPE_PRECISION (TREE_TYPE (ccand2->base));
>>>>
>> Yes.  Patch is OK then.
>
>
> Doesn't follow the formatting rules though in the quoted piece.

Hi Bernd,
Thanks for reviewing.  I haven't committed it yet, could you please
point out which quoted piece is so that I can update patch?

Thanks,
bin
>
>
> Bernd
>


Re: [PATCH PR52272]Be smart when adding iv candidates

2015-11-07 Thread Bin.Cheng
On Fri, Nov 6, 2015 at 9:24 PM, Richard Biener
 wrote:
> On Wed, Nov 4, 2015 at 11:18 AM, Bin Cheng  wrote:
>> Hi,
>> PR52272 reported a performance regression in spec2006/410.bwaves once GCC is
>> prevented from representing address of one memory object using address of
>> another memory object.  Also as I commented in that PR, we have two possible
>> fixes for this:
>> 1) Improve how TMR.base is deduced, so that we can represent addr of mem obj
>> using another one, while not breaking PR50955.
>> 2) Add iv candidates with base object stripped.  In this way, we use the
>> common base-stripped part to represent all address expressions, in the form
>> of [base_1 + common], [base_2 + common], ..., [base_n + common].
>>
>> In terms of code generation, method 2) is at least as good as 1), actually
>> better in my opinion.  The problem of 2) is we need to tell when iv
>> candidates should be added for the common part and when shouldn't.  This
>> issue can be generalized and described as: We know IVO tries to add
>> candidates by deriving from iv uses.  One disadvantage is that candidates
>> are derived from iv use independently.  It doesn't take common sub
>> expression among different iv uses into consideration.  As a result,
>> candidate for common sub expression is not added, while many useless
>> candidates are added.
>>
>> As a matter of fact, candidate derived from iv use is useful only if it's
>> common enough and could be shared among different uses.  A candidate is most
>> likely useless if it's derived from a single use and could not be shared by
>> others.  This patch works in this way by firstly recording all kinds
>> candidates derived from iv uses, then adding candidates for common ones.
>>
>> The patch improves 410.bwaves by 3-4% on x86_64.  I also saw regression for
>> 400.perlbench and small regression for 401.bzip on x86_64, but I can confirm
>> they are false alarms caused by align issues.
>> For aarch64, fp cases are obviously improved for both spec2000 and spec2006.
>> Also the patch causes 2-3% regression for 459.GemsFDTD, which I think is
>> another irrelevant issue caused by heuristic candidate selecting algorithm.
>> Unfortunately, I don't have fix to it currently.
>>
>> This patch may add more candidates in some cases, but generally candidates
>> number is smaller because we don't need to add useless candidates now.
>> Statistic data shows there are quite fewer loops with more than 30
>> candidates when building spec2k6 on x86_64 using this patch.
>>
>> Bootstrap and test on x86_64.  I will re-test it against latest trunk on
>> AArch64.  Is it OK?
>
> +inline bool
> +iv_common_cand_hasher::equal (const iv_common_cand *ccand1,
> +  const iv_common_cand *ccand2)
> +{
> +  return ccand1->hash == ccand2->hash
> +&& operand_equal_p (ccand1->base, ccand2->base, 0)
> +&& operand_equal_p (ccand1->step, ccand2->step, 0)
> +&& TYPE_PRECISION (TREE_TYPE (ccand1->base))
> + == TYPE_PRECISION (TREE_TYPE (ccand2->base));
>
Hi Richard,
Thanks for reviewing.

> I'm wondering on the TYPE_PRECISION check.  a) why is that needed?
Because operand_equal_p doesn't check type precision for constant int
nodes, and IVO needs to take precision into consideration.

> and b) what kind of tree is base so that it is safe to inspect TYPE_PRECISION
> unconditionally?
Both SCEV and IVO work on expressions with type satisfying
POINTER_TYPE_P or INTEGRAL_TYPE_P, so it's safe to access precision
unconditionally?

>
> +  slot = data->iv_common_cand_tab->find_slot (, INSERT);
> +  if (*slot == NULL)
> +{
> +  *slot = XNEW (struct iv_common_cand);
>
> allocate from the IV obstack instead?  I see we do a lot of heap allocations
> in IVOPTs, so we can improve that as followup as well.
>
Yes, small structures in IVO like iv, iv_use, iv_cand, iv_common_cand
are better to be allocated in obstack.  Actually I have already make
that change to struct iv.  others will be followup too.

Thanks,
bin
> We probably should empty the obstack after each processed loop.
>
> Thanks,
> Richard.
>
>
>> Thanks,
>> bin
>>
>> 2015-11-03  Bin Cheng  
>>
>> PR tree-optimization/52272
>> * tree-ssa-loop-ivopts.c (struct iv_common_cand): New struct.
>> (struct iv_common_cand_hasher): New struct.
>> (iv_common_cand_hasher::hash): New function.
>> (iv_common_cand_hasher::equal): New function.
>> (struct ivopts_data): New fields, iv_common_cand_tab and
>> iv_common_cands.
>> (tree_ssa_iv_optimize_init): Initialize above fields.
>> (record_common_cand, common_cand_cmp): New functions.
>> (add_iv_candidate_derived_from_uses): New function.
>> (add_iv_candidate_for_use): Record iv_common_cands derived from
>> iv use in hash table, instead of adding candidates directly.
>> (add_iv_candidate_for_uses): Call
>> 

Re: [PATCH GCC]Improve rtl loop inv cost by checking if the inv can be propagated to address uses

2015-10-25 Thread Bin.Cheng
On Wed, Oct 21, 2015 at 11:55 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Fri, Oct 9, 2015 at 8:04 PM, Bernd Schmidt <bschm...@redhat.com> wrote:
>> On 10/09/2015 02:00 PM, Bin.Cheng wrote:
>>>
>>> I further bootstrap and test attached patch on aarch64.  Also three
>>> cases in spec2k6/fp are improved by 3~6%, two cases in spec2k6/fp are
>>> regressed by ~2%.  Overall score is improved by ~0.8% for spec2k6/fp
>>> on aarch64 of my run.  I may later analyze the regression.
>>>
>>> So is this patch OK?
>>
> Hi Bernd,
> Thanks for reviewing this patch.  I further collected perf data for
> spec2k on AArch64.  Three fp cases are improved by 3-5%, no obvious
> regression.  As for int cases, perlbmk is improved by 8%, but crafty
> is regressed by 3.8%.  Together with spec2k6 data, I think this patch
> is generally good.  I scanned hot functions in crafty but didn't find
> obvious regression because lim hoist decision is very different
> because of this change.  The regression could be caused by register
> pressure..
>
>>
>> I'll approve this with one change, but please keep an eye out for
>> performance regressions on other targets.
> Sure.
>
>>
>>>  * loop-invariant.c (struct def): New field cant_prop_to_addr_uses.
>>>  (inv_cant_prop_to_addr_use): New function.
>>
>>
>> I would like these to have switched truthvalues, i.e. can_prop_to_addr_uses,
>> inv_can_prop_to_addr_use. Otherwise we end up with double negations like
>> !def->cant_prop_to_addr_uses which can be slightly confusing.
>>
>> You'll probably slightly need to tweak the initialization when n_addr_uses
>> goes from zero to one.
> Here is the new version patch with your comments incorporated.
>
Given the patch was pre-approved and there is no other comments, I
will apply it later.

Thanks,
bin


Re: [PATCH PR67921]Use sizetype for CHREC_RIGHT when building pointer type CHREC

2015-10-21 Thread Bin.Cheng
On Wed, Oct 21, 2015 at 5:15 PM, Richard Biener
 wrote:
> On Wed, Oct 21, 2015 at 6:46 AM, Bin Cheng  wrote:
>> Hi,
>> As analyzed in PR67921, I think the issue is caused by fold_binary_loc which
>> folds:
>>   4 - (sizetype)  - (sizetype) ((int *) p1_8(D) + ((sizetype) a_23 * 24 +
>> 4))
>> into below form:
>>   ((sizetype) -((int *) p1_8(D) + ((sizetype) a_23 * 24 + 4)) - (sizetype)
>> ) + 4
>>
>> Look the minus sizetype expression is folded as negative pointer expression,
>> which seems incorrect.  Apart from this, The direct reason of this ICE is in
>> CHREC because of an overlook.  In general CHREC supports NEGATE_EXPR for
>> CHREC, the only problem is it uses pointer type for CHREC_RIGHT, rather than
>> sizetype, when building pointer type CHREC.
>>
>> This simple patch fixes the ICE issue.  Bootstrap and test on x86 & x86_64.
>>
>> Is it OK?
>
> Hmm, I think not - we shouldn't ever get pointer typed
> multiplications.  Did you track
> down which is the bogus fold transform (I agree the result above is
> bogus)?  It's
> probably related to STRIP_NOPS stripping sizetype conversions from pointers
> so we might get split_tree to build such negate.  Note that split_tree strips
> (sign!) nops itself and thus should probably simply receive op0 and op1 
> instead
> of arg0 and arg1.
Yes, I was going to send similar patch for fold stuff.  Just thought
it might be useful to support POINTER chrec in *_multiply.  I will
drop this and let you test yours.

Thanks,
bin
>
> I'm testing
>
> @@ -9505,8 +9523,8 @@ fold_binary_loc (location_t loc,
>  then the result with variables.  This increases the chances of
>  literals being recombined later and of generating relocatable
>  expressions for the sum of a constant and literal.  */
> - var0 = split_tree (arg0, code, , , _lit0, 0);
> - var1 = split_tree (arg1, code, , , _lit1,
> + var0 = split_tree (op0, code, , , _lit0, 0);
> + var1 = split_tree (op1, code, , , _lit1,
>  code == MINUS_EXPR);
>
>   /* Recombine MINUS_EXPR operands by using PLUS_EXPR.  */
>
> which fixes the testcase for me.
>
> Richard.
>
>> Note, I do think the associate logic in fold_binary_loc needs fix, but that
>> should be another patch.
>>
>>
>> 2015-10-20  Bin Cheng  
>>
>> PR tree-optimization/67921
>> * tree-chrec.c (chrec_fold_multiply): Use sizetype for CHREC_RIGHT
>> if
>> type is pointer type.
>>
>> 2015-10-20  Bin Cheng  
>>
>> PR tree-optimization/67921
>> * gcc.dg/ubsan/pr67921.c: New test.


Re: [PATCH GCC]Improve rtl loop inv cost by checking if the inv can be propagated to address uses

2015-10-20 Thread Bin.Cheng
On Fri, Oct 9, 2015 at 8:04 PM, Bernd Schmidt <bschm...@redhat.com> wrote:
> On 10/09/2015 02:00 PM, Bin.Cheng wrote:
>>
>> I further bootstrap and test attached patch on aarch64.  Also three
>> cases in spec2k6/fp are improved by 3~6%, two cases in spec2k6/fp are
>> regressed by ~2%.  Overall score is improved by ~0.8% for spec2k6/fp
>> on aarch64 of my run.  I may later analyze the regression.
>>
>> So is this patch OK?
>
Hi Bernd,
Thanks for reviewing this patch.  I further collected perf data for
spec2k on AArch64.  Three fp cases are improved by 3-5%, no obvious
regression.  As for int cases, perlbmk is improved by 8%, but crafty
is regressed by 3.8%.  Together with spec2k6 data, I think this patch
is generally good.  I scanned hot functions in crafty but didn't find
obvious regression because lim hoist decision is very different
because of this change.  The regression could be caused by register
pressure..

>
> I'll approve this with one change, but please keep an eye out for
> performance regressions on other targets.
Sure.

>
>>  * loop-invariant.c (struct def): New field cant_prop_to_addr_uses.
>>  (inv_cant_prop_to_addr_use): New function.
>
>
> I would like these to have switched truthvalues, i.e. can_prop_to_addr_uses,
> inv_can_prop_to_addr_use. Otherwise we end up with double negations like
> !def->cant_prop_to_addr_uses which can be slightly confusing.
>
> You'll probably slightly need to tweak the initialization when n_addr_uses
> goes from zero to one.
Here is the new version patch with your comments incorporated.

Thanks,
bin

2015-10-19  Bin Cheng  <bin.ch...@arm.com>

* loop-invariant.c (struct def): New field can_prop_to_addr_uses.
(inv_can_prop_to_addr_use): New function.
(record_use): Call can_prop_to_addr_uses, set the new field.
(get_inv_cost): Count cost if inv can't be propagated into its
address uses.
diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index 52c8ae8..7ac38c6 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -99,6 +99,8 @@ struct def
   unsigned n_uses; /* Number of such uses.  */
   unsigned n_addr_uses;/* Number of uses in addresses.  */
   unsigned invno;  /* The corresponding invariant.  */
+  bool can_prop_to_addr_uses;  /* True if the corresponding inv can be
+  propagated into its address uses.  */
 };
 
 /* The data stored for each invariant.  */
@@ -762,6 +764,34 @@ create_new_invariant (struct def *def, rtx_insn *insn, 
bitmap depends_on,
   return inv;
 }
 
+/* Given invariant DEF and its address USE, check if the corresponding
+   invariant expr can be propagated into the use or not.  */
+
+static bool
+inv_can_prop_to_addr_use (struct def *def, df_ref use)
+{
+  struct invariant *inv;
+  rtx *pos = DF_REF_REAL_LOC (use), def_set;
+  rtx_insn *use_insn = DF_REF_INSN (use);
+  rtx_insn *def_insn;
+  bool ok;
+
+  inv = invariants[def->invno];
+  /* No need to check if address expression is expensive.  */
+  if (!inv->cheap_address)
+return false;
+
+  def_insn = inv->insn;
+  def_set = single_set (def_insn);
+  if (!def_set)
+return false;
+
+  validate_unshare_change (use_insn, pos, SET_SRC (def_set), true);
+  ok = verify_changes (0);
+  cancel_changes (0);
+  return ok;
+}
+
 /* Record USE at DEF.  */
 
 static void
@@ -777,7 +807,16 @@ record_use (struct def *def, df_ref use)
   def->uses = u;
   def->n_uses++;
   if (u->addr_use_p)
-def->n_addr_uses++;
+{
+  /* Initialize propagation information if this is the first addr
+use of the inv def.  */
+  if (def->n_addr_uses == 0)
+   def->can_prop_to_addr_uses = true;
+
+  def->n_addr_uses++;
+  if (def->can_prop_to_addr_uses && !inv_can_prop_to_addr_use (def, use))
+   def->can_prop_to_addr_uses = false;
+}
 }
 
 /* Finds the invariants USE depends on and store them to the DEPENDS_ON
@@ -1158,7 +1197,9 @@ get_inv_cost (struct invariant *inv, int *comp_cost, 
unsigned *regs_needed,
 
   if (!inv->cheap_address
   || inv->def->n_uses == 0
-  || inv->def->n_addr_uses < inv->def->n_uses)
+  || inv->def->n_addr_uses < inv->def->n_uses
+  /* Count cost if the inv can't be propagated into address uses.  */
+  || !inv->def->can_prop_to_addr_uses)
 (*comp_cost) += inv->cost * inv->eqno;
 
 #ifdef STACK_REGS


Re: [Patch,optimization]: Optimized changes in the estimate register pressure cost.

2015-10-16 Thread Bin.Cheng
On Wed, Sep 30, 2015 at 12:00 AM, Pat Haugen
 wrote:
> On 09/25/2015 11:51 PM, Ajit Kumar Agarwal wrote:
>>
>> I have made the following changes in the estimate_reg_pressure_cost
>> function used
>> by the loop invariant and IVOPTS.
>>
>> Earlier the estimate_reg_pressure cost uses the cost of n_new variables
>> that are generated by the Loop Invariant
>>   and IVOPTS. These are not sufficient for register pressure calculation.
>> The register pressure cost calculation should
>> use the n_new + n_old (numbers) to consider the cost. n_old is the
>> register  used inside the loops and the effect of
>>   n_new new variables generated by loop invariant and IVOPTS on register
>> pressure is based on how the new
>> variables impact on register used inside the loops. The increase or
>> decrease in register pressure is due to the impact
>> of new variables on the register used  inside the loops. The
>> register-register move cost or the spill cost should consider
>> the cost associated with register used and the new variables generated.
>> The movement  of new variables increases or
>> decreases the register pressure, which is based on  overall cost of n_new
>> + n_old variables.
>>
>> The increase and decrease in register pressure is based on the overall
>> cost of n_new + n_old as the changes in the
>> register pressure caused due to new variables is based on how the changes
>> behave with respect to the register used
>> in the loops.
>>
>> Thus the register pressure caused to new variables is based on the new
>> variables and its impact on register used inside
>>   the loops and thus consider the overall  cost of n_new + n_old.
>>
>> Bootstrap for i386 and reg tested on i386 with the change is fine.
>>
>> SPEC CPU 2000 benchmarks are run and there is following impact on the
>> performance
>> and code size.
>>
>> ratio with the optimization vs ratio without optimization for INT
>> benchmarks
>> (3807.632 vs 3804.661)
>>
>> ratio with the optimization vs ratio without optimization for FP
>> benchmarks
>> ( 4668.743 vs 4778.741)
>>
>> Code size reduction with respect to FP SPEC CPU 2000 benchmarks
>>
>> Number of instruction with optimization = 1094117
>> Number of instruction without optimization = 1094659
>>
>> Reduction in number of instruction with the optimization = 542
>> instruction.
>
> I tried your patch on powerpc64le using CPU2006. There was a small
> degradation in mcf (-1.5%) and small improvement in bwaves (+1.3%), the
> remaining benchmarks (and overall results) were neutral.
We collected performance data on AArch64 for spec2k6, Most cases vary
within [-1%, 1%], three cases vary from +/-[1%, 2%].  INT cases are
generally slightly regressed, while FP cases are generally slightly
improved.  INT overall score is regressed by 0.25%; FP overall score
is improved by 0.35%.

Thanks,
bin
>
> -Pat
>


Re: [PATCH GCC]Improve rtl loop inv cost by checking if the inv can be propagated to address uses

2015-10-09 Thread Bin.Cheng
On Wed, Sep 30, 2015 at 11:33 AM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Tue, Sep 29, 2015 at 1:21 AM, Jeff Law <l...@redhat.com> wrote:
>> On 09/28/2015 05:28 AM, Bernd Schmidt wrote:
>>>
>>> On 09/28/2015 11:43 AM, Bin Cheng wrote:
>>>>
>>>> Bootstrap and test on x86_64 and x86_32.  Will test it on aarch64.  So
>>>> any
>>>> comments?
>>>>
>>>> Thanks,
>>>> bin
>>>>
>>>> 2015-09-28  Bin Cheng  <bin.ch...@arm.com>
>>>>
>>>> * loop-invariant.c (struct def): New field cant_fwprop_to_addr_uses.
>>>> (inv_cant_fwprop_to_addr_use): New function.
>>>> (record_use): Call inv_cant_fwprop_to_addr_use, set the new field.
>>>> (get_inv_cost): Count cost if inv can't be propagated into its
>>>> address uses.
>>>
>>>
>>> It looks at least plausible.
>>
>> Definitely plausible.  Many targets have restrictions on the immediate
>> offsets, so this potentially affects many targets (in a good way).
> Here is some more information.
> For spec2k6 on x86, no regression or obvious improvement.
> For spec2k6 on aarch64, several cases are improved.
> Given the results and your positive feedback, I will continue to work
> on this patch when I get back from holiday.
I further bootstrap and test attached patch on aarch64.  Also three
cases in spec2k6/fp are improved by 3~6%, two cases in spec2k6/fp are
regressed by ~2%.  Overall score is improved by ~0.8% for spec2k6/fp
on aarch64 of my run.  I may later analyze the regression.

So is this patch OK?

Thanks,
bin

2015-10-09  Bin Cheng  <bin.ch...@arm.com>

* loop-invariant.c (struct def): New field cant_prop_to_addr_uses.
(inv_cant_prop_to_addr_use): New function.
(record_use): Call inv_cant_prop_to_addr_use, set the new field.
(get_inv_cost): Count cost if inv can't be propagated into its
address uses.

>
>>
>>
>>  Another option which I think has had some
>>>
>>> discussion recently would be to just move everything, and leave it to
>>> cprop to put things back together if the costs allow it.
>>
>> I go back and forth on this kind of approach.
> Hmm, either way we need to model rtx/register pressure costs, in loop
> invariant or cprop.
>
> Thanks,
> bin
>
>>
>> jeff
diff --git a/gcc/loop-invariant.c b/gcc/loop-invariant.c
index 52c8ae8..3c2395c 100644
--- a/gcc/loop-invariant.c
+++ b/gcc/loop-invariant.c
@@ -99,6 +99,8 @@ struct def
   unsigned n_uses; /* Number of such uses.  */
   unsigned n_addr_uses;/* Number of uses in addresses.  */
   unsigned invno;  /* The corresponding invariant.  */
+  bool cant_prop_to_addr_uses; /* True if the corresponding inv can't be
+  propagated into its address uses.  */
 };
 
 /* The data stored for each invariant.  */
@@ -762,6 +764,34 @@ create_new_invariant (struct def *def, rtx_insn *insn, 
bitmap depends_on,
   return inv;
 }
 
+/* Given invariant DEF and its address USE, check if the corresponding
+   invariant expr can be propagated into the use or not.  */
+
+static bool
+inv_cant_prop_to_addr_use (struct def *def, df_ref use)
+{
+  struct invariant *inv;
+  rtx *pos = DF_REF_REAL_LOC (use), def_set;
+  rtx_insn *use_insn = DF_REF_INSN (use);
+  rtx_insn *def_insn;
+  bool ok;
+
+  inv = invariants[def->invno];
+  /* No need to check if address expression is expensive.  */
+  if (!inv->cheap_address)
+return true;
+
+  def_insn = inv->insn;
+  def_set = single_set (def_insn);
+  if (!def_set)
+return true;
+
+  validate_unshare_change (use_insn, pos, SET_SRC (def_set), true);
+  ok = verify_changes (0);
+  cancel_changes (0);
+  return !ok;
+}
+
 /* Record USE at DEF.  */
 
 static void
@@ -777,7 +807,11 @@ record_use (struct def *def, df_ref use)
   def->uses = u;
   def->n_uses++;
   if (u->addr_use_p)
-def->n_addr_uses++;
+{
+  def->n_addr_uses++;
+  if (!def->cant_prop_to_addr_uses && inv_cant_prop_to_addr_use (def, use))
+   def->cant_prop_to_addr_uses = true;
+}
 }
 
 /* Finds the invariants USE depends on and store them to the DEPENDS_ON
@@ -1158,7 +1192,9 @@ get_inv_cost (struct invariant *inv, int *comp_cost, 
unsigned *regs_needed,
 
   if (!inv->cheap_address
   || inv->def->n_uses == 0
-  || inv->def->n_addr_uses < inv->def->n_uses)
+  || inv->def->n_addr_uses < inv->def->n_uses
+  /* Count cost if the inv can't be propagated into address uses.  */
+  || inv->def->cant_prop_to_addr_uses)
 (*comp_cost) += inv->cost * inv->eqno;
 
 #ifdef STACK_REGS


Re: [PATCH 9/9] Fix PR 66768

2015-10-08 Thread Bin.Cheng
On Thu, Oct 8, 2015 at 5:55 PM, Bernd Schmidt <bschm...@redhat.com> wrote:
> On 10/08/2015 07:17 AM, Bin.Cheng wrote:
>>
>> On Thu, Oct 8, 2015 at 12:59 PM, Richard Henderson <r...@redhat.com> wrote:
>>>
>>> This is the patch that richi includes in the PR.  There will need to
>>> be an additional patch to solve an ICE for the AVR backend, as noted
>>> in the PR, but this is good enough to solve the bad-code generation
>>> problem for the i386 backend.
>>
>> Hi Richard,
>> For the record, the root cause is in IVO because it fails to preserve
>> base object.  This patch can only paper over the issue for address
>> spaces where PTR type and sizetype have the same length, otherwise IVO
>> generates wrong code which can't be walked around by this patch.  I
>> will take PR66768.
>
>
> Hmm. In 2012 I submitted a patch "Preserve pointer types in ivopts", which
> got lost in review. It was for a different problem than address spaces, but
> it might be worth taking a look whether that approach could help solve this
> issue.
Hi Bernd,
Thanks for your suggestion, I will search for that patch.

Thanks,
bin
>
>
> Bernd


Re: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS.

2015-10-08 Thread Bin.Cheng
On Thu, Oct 8, 2015 at 1:53 PM, Ajit Kumar Agarwal
<ajit.kumar.agar...@xilinx.com> wrote:
>
>
> -Original Message-
> From: Bin.Cheng [mailto:amker.ch...@gmail.com]
> Sent: Thursday, October 08, 2015 10:29 AM
> To: Ajit Kumar Agarwal
> Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; 
> Nagaraju Mekala
> Subject: Re: [RFC, Patch]: Optimized changes in the register used inside loop 
> for LICM and IVOPTS.
>
> On Thu, Oct 8, 2015 at 12:32 PM, Ajit Kumar Agarwal 
> <ajit.kumar.agar...@xilinx.com> wrote:
>> Following Proposed:
>>
>> Changes are done in the Loop Invariant(LICM) at RTL level and also the 
>> Induction variable optimization based on SSA representation.
>> The current logic used in LICM for register used inside the loops is
>> changed. The Live Out of the loop latch node and the Live in of the
>> destination of the exit nodes is used to set the Loops Liveness at the exit 
>> of the Loop. The register used is the number of live variables at the exit 
>> of the Loop calculated above.
>>
>> For Induction variable optimization on tree SSA representation, the
>> register used logic is based on the number of phi nodes at the loop
>> header to represent the liveness at the loop. Current Logic used only the 
>> number of phi nodes at the loop header. I have made changes  to represent 
>> the phi operands also live at the loop. Thus number of phi operands also 
>> gets incremented in the number of registers used.
> Hi,
>>>For the GIMPLE IVO part, I don't think the change is reasonable enough.  
>>>IMHO, IVO fails to restrict iv number in some complex cases, your change 
>>>tries to >>rectify that by increasing register pressure irrespective to 
>>>out-of-ssa and coalescing.  I think the original code models reg-pressure 
>>>better, what needs to be >>changed is how we compute cost from register 
>>>pressure and use that to restrict iv number.
>
> Considering the liveness with respect to all the phi arguments will not 
> increase the register pressure. It improves the heuristics for restricting
> The IV that increases the register pressure. The cost model uses regs_used 
> and modelling the
I think register pressure is increased along with regs_needed, doesn't
matter if it will be canceled in estimate_reg_pressure_cost for both
ends of cost comparison.
Liveness with respect to the phi arguments measures
> Better register pressure.
I agree IV number should be controlled for some cases, but not by
increasing `n' using phi argument number unconditionally.  Considering
summary reduction as an example, most likely the ssa names will be
coalesced and held in single register.  Furthermore, there is no
reason to count phi node/arg number for floating point phi nodes.

>
> Number of phi nodes in the loop header is not only the criteria for 
> regs_used, but the number of liveness with respect to loop should be
> Criteria to measure appropriate register pressure.
IMHO, it's hard to accurately track liveness info on SSA(PHI), because
of coalescing etc.  So could you give some examples/proof for this?

Thanks,
bin
>
> Thanks & Regards
> Ajit


Re: [RFC, Patch]: Optimized changes in the register used inside loop for LICM and IVOPTS.

2015-10-07 Thread Bin.Cheng
On Thu, Oct 8, 2015 at 12:32 PM, Ajit Kumar Agarwal
 wrote:
> Following Proposed:
>
> Changes are done in the Loop Invariant(LICM) at RTL level and also the 
> Induction variable optimization based on SSA representation.
> The current logic used in LICM for register used inside the loops is changed. 
> The Live Out of the loop latch node and the Live in of the
> destination of the exit nodes is used to set the Loops Liveness at the exit 
> of the Loop. The register used is the number of live variables
> at the exit of the Loop calculated above.
>
> For Induction variable optimization on tree SSA representation, the register 
> used logic is based on the number of phi nodes at the loop
> header to represent the liveness at the loop. Current Logic used only the 
> number of phi nodes at the loop header. I have made changes
>  to represent the phi operands also live at the loop. Thus number of phi 
> operands also gets incremented in the number of registers used.
Hi,
For the GIMPLE IVO part, I don't think the change is reasonable
enough.  IMHO, IVO fails to restrict iv number in some complex cases,
your change tries to rectify that by increasing register pressure
irrespective to out-of-ssa and coalescing.  I think the original code
models reg-pressure better, what needs to be changed is how we compute
cost from register pressure and use that to restrict iv number.
As for the specific function determine_set_costs, I think one change
is necessary to rule out all floating point phi nodes, because they do
not have impact on IVO register pressure.  Actually this change will
further reduce register pressure for fp related cases.

Thanks,
bin
>
> Performance runs:
>
> Bootstrapping with i386 goes through fine. The spec cpu 2000 benchmarks is 
> run and following performance runs and the code size for
>  i386 target seen.
>
> Ratio with the above optimization changes vs ratio without above 
> optimizations for INT benchmarks (3785.261 vs 3783.064).
> Ratio with the above optimization changes vs ratio without above optimization 
> for FP benchmarks ( 4676.763189 vs 4676.072428 ).
>
> Code size reduction for INT benchmarks : 2324 instructions.
> Code size reduction for FP benchmarks : 1283 instructions.
>
> For Microblaze target the Mibench and EEMBC benchmarks is run and the 
> following improvements is seen.
>
> (qos_lite(5.3%), consumer_jpeg_c(1.34%), security_rijndael_d(1.8%), 
> security_rijndael_e(1.4%))
>
> Code Size reduction for Mibench  = 16164 instructions.
> Code Size reduction for EEMBC = 98 instructions.
>
> Patch ChangeLog:
>
> PATCH] [RFC, Patch]: Optimized changes in the register used inside  loop for 
> LICM and IVOPTS.
>
> Changes are done in the Loop Invariant(LICM) at RTL level and also the 
> Induction variable optimization
> based on SSA representation. The current logic used in LICM for register used 
> inside the loops is changed.
> The Live Out of the loop latch node and the Live in of the destination of the 
> exit nodes is used to set the
>  Loops Liveness at the exit of the Loop. The register used is the number of 
> live variables at the exit of the
>  Loop calculated above.
>
> For Induction variable optimization on tree SSA representation, the register 
> used logic is based on the
>  number of phi nodes at the loop header to represent the liveness at the 
> loop.  Current Logic used only
>  the number of phi nodes at the loop header.  Changes are made to represent 
> the phi operands also live
>  at the loop. Thus number of phi operands also gets incremented in the number 
> of registers used.
>
> ChangeLog:
> 2015-10-09  Ajit Agarwal  
>
> * loop-invariant.c (compute_loop_liveness): New.
> (determine_regs_used): New.
> (find_invariants_to_move): Use of determine_regs_used.
> * tree-ssa-loop-ivopts.c (determine_set_costs): Consider the phi
> arguments for register used.
>
> Signed-off-by:Ajit Agarwal ajit...@xilinx.com
>
> Thanks & Regards
> Ajit


Re: [PATCH 9/9] Fix PR 66768

2015-10-07 Thread Bin.Cheng
On Thu, Oct 8, 2015 at 12:59 PM, Richard Henderson  wrote:
> This is the patch that richi includes in the PR.  There will need to
> be an additional patch to solve an ICE for the AVR backend, as noted
> in the PR, but this is good enough to solve the bad-code generation
> problem for the i386 backend.
Hi Richard,
For the record, the root cause is in IVO because it fails to preserve
base object.  This patch can only paper over the issue for address
spaces where PTR type and sizetype have the same length, otherwise IVO
generates wrong code which can't be walked around by this patch.  I
will take PR66768.

Thanks,
bin
>
>
> * tree-ssa-address.c (create_mem_ref_raw): Retain the correct
> type for the address base.
> ---
>  gcc/tree-ssa-address.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/tree-ssa-address.c b/gcc/tree-ssa-address.c
> index 042f9c9..bd10ae7 100644
> --- a/gcc/tree-ssa-address.c
> +++ b/gcc/tree-ssa-address.c
> @@ -388,7 +388,7 @@ create_mem_ref_raw (tree type, tree alias_ptr_type, 
> struct mem_address *addr,
>  }
>else
>  {
> -  base = build_int_cst (ptr_type_node, 0);
> +  base = build_int_cst (build_pointer_type (type), 0);
>index2 = addr->base;
>  }
>
> --
> 2.4.3
>


Re: [PATCH GCC]Improve rtl loop inv cost by checking if the inv can be propagated to address uses

2015-09-29 Thread Bin.Cheng
On Tue, Sep 29, 2015 at 1:21 AM, Jeff Law  wrote:
> On 09/28/2015 05:28 AM, Bernd Schmidt wrote:
>>
>> On 09/28/2015 11:43 AM, Bin Cheng wrote:
>>>
>>> Bootstrap and test on x86_64 and x86_32.  Will test it on aarch64.  So
>>> any
>>> comments?
>>>
>>> Thanks,
>>> bin
>>>
>>> 2015-09-28  Bin Cheng  
>>>
>>> * loop-invariant.c (struct def): New field cant_fwprop_to_addr_uses.
>>> (inv_cant_fwprop_to_addr_use): New function.
>>> (record_use): Call inv_cant_fwprop_to_addr_use, set the new field.
>>> (get_inv_cost): Count cost if inv can't be propagated into its
>>> address uses.
>>
>>
>> It looks at least plausible.
>
> Definitely plausible.  Many targets have restrictions on the immediate
> offsets, so this potentially affects many targets (in a good way).
Here is some more information.
For spec2k6 on x86, no regression or obvious improvement.
For spec2k6 on aarch64, several cases are improved.
Given the results and your positive feedback, I will continue to work
on this patch when I get back from holiday.

>
>
>  Another option which I think has had some
>>
>> discussion recently would be to just move everything, and leave it to
>> cprop to put things back together if the costs allow it.
>
> I go back and forth on this kind of approach.
Hmm, either way we need to model rtx/register pressure costs, in loop
invariant or cprop.

Thanks,
bin

>
> jeff


Re: [Patch,optimization]: Optimized changes in the estimate register pressure cost.

2015-09-28 Thread Bin.Cheng
On Tue, Sep 29, 2015 at 2:25 AM, Aaron Sawdey
 wrote:
> On Sat, 2015-09-26 at 04:51 +, Ajit Kumar Agarwal wrote:
>> I have made the following changes in the estimate_reg_pressure_cost function 
>> used
>> by the loop invariant and IVOPTS.
>>
>> Earlier the estimate_reg_pressure cost uses the cost of n_new variables that 
>> are generated by the Loop Invariant
>>  and IVOPTS. These are not sufficient for register pressure calculation. The 
>> register pressure cost calculation should
>> use the n_new + n_old (numbers) to consider the cost. n_old is the register  
>> used inside the loops and the effect of
>>  n_new new variables generated by loop invariant and IVOPTS on register 
>> pressure is based on how the new
>> variables impact on register used inside the loops. The increase or decrease 
>> in register pressure is due to the impact
>> of new variables on the register used  inside the loops. The 
>> register-register move cost or the spill cost should consider
>> the cost associated with register used and the new variables generated. The 
>> movement  of new variables increases or
>> decreases the register pressure, which is based on  overall cost of n_new + 
>> n_old variables.
>>
>> The increase and decrease in register pressure is based on the overall cost 
>> of n_new + n_old as the changes in the
>> register pressure caused due to new variables is based on how the changes 
>> behave with respect to the register used
>> in the loops.
>>
>> Thus the register pressure caused to new variables is based on the new 
>> variables and its impact on register used inside
>>  the loops and thus consider the overall  cost of n_new + n_old.
>>
>> Bootstrap for i386 and reg tested on i386 with the change is fine.
>>
>> SPEC CPU 2000 benchmarks are run and there is following impact on the 
>> performance
>> and code size.
>>
>> ratio with the optimization vs ratio without optimization for INT benchmarks
>> (3807.632 vs 3804.661)
>>
>> ratio with the optimization vs ratio without optimization for FP benchmarks
>> ( 4668.743 vs 4778.741)
>>
>> Code size reduction with respect to FP SPEC CPU 2000 benchmarks
>>
>> Number of instruction with optimization = 1094117
>> Number of instruction without optimization = 1094659
>>
>> Reduction in number of instruction with the optimization = 542 instruction.
>>
>> [Patch,optimization]: Optimized changes in the estimate
>>  register pressure cost.
>>
>> Earlier the estimate_reg_pressure cost uses the cost of n_new variables that
>> are generated by the Loop Invariant and IVOPTS. These are not sufficient for
>> register pressure calculation. The register pressure cost calculation should
>> use the n_new + n_old (numbers) to consider the cost. n_old is the register
>> used inside the loops and the affect of n_new new variables generated by
>> loop invariant and IVOPTS on register pressure is based on how the new
>> variables impact on register used inside the loops.
>>
>> ChangeLog:
>> 2015-09-26  Ajit Agarwal  
>>
>>   * cfgloopanal.c (estimate_reg_pressure_cost) : Add changes
>>   to consider the n_new plus n_old in the register pressure
>>   cost.
>>
>> Signed-off-by:Ajit Agarwal ajit...@xilinx.com
>
> Ajit,
>   It looks to me like your change doesn't do anything at all inside the
> loop-invariant.c code. There it's doing a difference between two
> estimate_reg_pressure_cost calls so adding n_old (regs_used) to both is
> canceled out.
>
>   size_cost = (estimate_reg_pressure_cost (new_regs[0] + regs_needed[0],
>regs_used, speed, call_p)
>- estimate_reg_pressure_cost (new_regs[0],
>  regs_used, speed, call_p));
>
> I'm not quite sure I understand the "why" of the heuristic you've added
> here -- can you explain your reasoning further?

With this, I think the only change would be in GIMPLE IVOPT?  The
patch increases register pressure if it exceeds available register
number when choosing iv candidates.  As I mentioned, it may only have
impact on scenarios that's on the verge of available register number,
otherwise the reg_old is added(thus cancelled) for both ends of
comparison.
The result isn't clear enough even for boundary cases because IVO now
has issues in computing the "starting register pressure".  I also
planned to visit the pressure model in IVO later.

Thanks,
bin
>
>>
>> Thanks & Regards
>> Ajit
>>
>
> Thanks,
> Aaron
>
> --
> Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
> 050-2/C113  (507) 253-7520 home: 507/263-0782
> IBM Linux Technology Center - PPC Toolchain
>


Re: [Patch,optimization]: Optimized changes in the estimate register pressure cost.

2015-09-27 Thread Bin.Cheng
On Sun, Sep 27, 2015 at 11:13 PM, Ajit Kumar Agarwal
 wrote:
>
>
> -Original Message-
> From: Segher Boessenkool [mailto:seg...@kernel.crashing.org]
> Sent: Sunday, September 27, 2015 7:49 PM
> To: Ajit Kumar Agarwal
> Cc: GCC Patches; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; 
> Nagaraju Mekala
> Subject: Re: [Patch,optimization]: Optimized changes in the estimate register 
> pressure cost.
>
> On Sat, Sep 26, 2015 at 04:51:20AM +, Ajit Kumar Agarwal wrote:
>> SPEC CPU 2000 benchmarks are run and there is following impact on the
>> performance and code size.
>>
>> ratio with the optimization vs ratio without optimization for INT
>> benchmarks
>> (3807.632 vs 3804.661)
>>
>> ratio with the optimization vs ratio without optimization for FP
>> benchmarks ( 4668.743 vs 4778.741)
>
>>>Did you swap these?  You're saying FP got significantly worse?
>
> Sorry for the typo error.  Please find the corrected one.
>
> Ratio  with the optimization vs ratio without optimization for FP  benchmarks 
> ( 4668.743 vs 4668.741). With the optimization
> FP is slightly better performance.
Did you mis-type the number again?  Or this must be noise.  Now I
remember why I didn't get perf improvement from this.  Changing
reg_new to reg_new + reg_old doesn't have big impact because it just
increased the starting number for each scenarios.  Maybe it still
makes sense for cases on the verge of exceeding target's available
register number.  I will try to collect benchmark data on ARM, but it
may take some time.

Thanks,
bin
>
> Thanks & Regards
> Ajit
>
> Segher


Re: [Patch,optimization]: Optimized changes in the estimate register pressure cost.

2015-09-26 Thread Bin.Cheng
On Sat, Sep 26, 2015 at 12:51 PM, Ajit Kumar Agarwal
 wrote:
> I have made the following changes in the estimate_reg_pressure_cost function 
> used
> by the loop invariant and IVOPTS.
>
> Earlier the estimate_reg_pressure cost uses the cost of n_new variables that 
> are generated by the Loop Invariant
>  and IVOPTS. These are not sufficient for register pressure calculation. The 
> register pressure cost calculation should
> use the n_new + n_old (numbers) to consider the cost. n_old is the register  
> used inside the loops and the effect of
>  n_new new variables generated by loop invariant and IVOPTS on register 
> pressure is based on how the new
> variables impact on register used inside the loops. The increase or decrease 
> in register pressure is due to the impact
> of new variables on the register used  inside the loops. The 
> register-register move cost or the spill cost should consider
> the cost associated with register used and the new variables generated. The 
> movement  of new variables increases or
> decreases the register pressure, which is based on  overall cost of n_new + 
> n_old variables.
>
> The increase and decrease in register pressure is based on the overall cost 
> of n_new + n_old as the changes in the
> register pressure caused due to new variables is based on how the changes 
> behave with respect to the register used
> in the loops.
>
> Thus the register pressure caused to new variables is based on the new 
> variables and its impact on register used inside
>  the loops and thus consider the overall  cost of n_new + n_old.
>
> Bootstrap for i386 and reg tested on i386 with the change is fine.
>
> SPEC CPU 2000 benchmarks are run and there is following impact on the 
> performance
> and code size.
>
> ratio with the optimization vs ratio without optimization for INT benchmarks
> (3807.632 vs 3804.661)
>
> ratio with the optimization vs ratio without optimization for FP benchmarks
> ( 4668.743 vs 4778.741)
>
> Code size reduction with respect to FP SPEC CPU 2000 benchmarks
>
> Number of instruction with optimization = 1094117
> Number of instruction without optimization = 1094659
>
> Reduction in number of instruction with the optimization = 542 instruction.
>
> [Patch,optimization]: Optimized changes in the estimate
>  register pressure cost.
>
> Earlier the estimate_reg_pressure cost uses the cost of n_new variables that
> are generated by the Loop Invariant and IVOPTS. These are not sufficient for
> register pressure calculation. The register pressure cost calculation should
> use the n_new + n_old (numbers) to consider the cost. n_old is the register
> used inside the loops and the affect of n_new new variables generated by
> loop invariant and IVOPTS on register pressure is based on how the new
> variables impact on register used inside the loops.

Hi,
I remember I did this experiment before when I was tuning register
pressure for embedded processors.  Don't remember why I didn't follow
up it.   I will collect data on arm processors for the patch.

Thanks,
bin
>
> ChangeLog:
> 2015-09-26  Ajit Agarwal  
>
> * cfgloopanal.c (estimate_reg_pressure_cost) : Add changes
> to consider the n_new plus n_old in the register pressure
> cost.
>
> Signed-off-by:Ajit Agarwal ajit...@xilinx.com
>
> Thanks & Regards
> Ajit
>


Re: [PATCH PR66388]Add sizetype cand for BIV of smaller type if it's used as index of memory ref

2015-09-14 Thread Bin.Cheng
Just realized that I missed the updated patch before.  Here it is...

Thanks,
bin

On Tue, Sep 8, 2015 at 6:07 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Tue, Sep 8, 2015 at 6:06 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Wed, Sep 2, 2015 at 10:12 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Wed, Sep 2, 2015 at 5:26 AM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>> Hi,
>>>> This patch is a new approach to fix PR66388.  IVO today computes iv_use 
>>>> with
>>>> iv_cand which has at least same type precision as the use.  On 64bit
>>>> platforms like AArch64, this results in different iv_cand created for each
>>>> address type iv_use, and register pressure increased.  As a matter of fact,
>>>> the BIV should be used for all iv_uses in some of these cases.  It is a
>>>> latent bug but recently getting worse because of overflow changes.
>>>>
>>>> The original approach at
>>>> https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01484.html can fix the issue
>>>> except it conflict with IV elimination.  Seems to me it is impossible to
>>>> mitigate the contradiction.
>>>>
>>>> This new approach fixes the issue by adding sizetype iv_cand for BIVs
>>>> directly.  In cases if the original BIV is preferred, the sizetype iv_cand
>>>> will be chosen.  As for code generation, the sizetype iv_cand has the same
>>>> effect as the original BIV.  Actually, it's better because BIV needs to be
>>>> explicitly extended to sizetype to be used in address expression on most
>>>> targets.
>>>>
>>>> One shortage of this approach is it may introduce more iv candidates.  To
>>>> minimize the impact, this patch does sophisticated code analysis and adds
>>>> sizetype candidate for BIV only if it is used as index.  Moreover, it 
>>>> avoids
>>>> to add candidate of the original type if the BIV is only used as index.
>>>> Statistics for compiling spec2k6 shows increase of candidate number is
>>>> modest and can be ignored.
>>>>
>>>> There are two more patches following to fix corner cases revealed by this
>>>> one.  In together they bring obvious perf improvement for spec26k/int on
>>>> aarch64.
>>>> Spec2k6/int
>>>> 400.perlbench   3.44%
>>>> 445.gobmk   -0.86%
>>>> 456.hmmer   14.83%
>>>> 458.sjeng   2.49%
>>>> 462.libquantum  -0.79%
>>>> GEOMEAN 1.68%
>>>>
>>>> There is also about 0.36% improvement for spec2k6/fp, mostly because of 
>>>> case
>>>> 436.cactusADM.  I believe it can be further improved, but that should be
>>>> another patch.
>>>>
>>>> I also collected benchmark data for x86_64.  Spec2k6/fp is not affected.  
>>>> As
>>>> for spec2k6/int, though the geomean is improved slightly, 400.perlbench is
>>>> regressed by ~3%.  I can see BIVs are chosen for some loops instead of
>>>> address candidates.  Generally, the loop header will be simplified because
>>>> iv elimination with BIV is simpler; the number of instructions in loop body
>>>> isn't changed.  I suspect the regression comes from different addressing
>>>> modes.  With BIV, complex addressing mode like [base + index << scale +
>>>> disp] is used, rather than [base + disp].  I guess the former has more
>>>> micro-ops, thus more expensive.  This guess can be confirmed by manually
>>>> suppressing the complex addressing mode with higher address cost.
>>>> Now the problem becomes why overall cost of BIV is computed lower while the
>>>> actual cost is higher.  I noticed for most affected loops, loop header is
>>>> bloated because of iv elimination using the old address candidate.  The
>>>> bloated loop header results in much higher cost than BIV.  As a result, BIV
>>>> is preferred.  I also noticed the bloated loop header generally can be
>>>> simplified (I have a following patch for this).  After applying the local
>>>> patch, the old address candidate is chosen, and most of regression is
>>>> recovered.
>>>> Conclusion is I think loop header bloated issue should be blamed for the
>>>> regression, and it can be resolved.
>>>>
>>>> Bootstrap and test on x64_64 and aarch64.  It fixes failure of
>>>> gcc.target/i386/pr49781-1.c, without new breakage.
>>&g

Re: [PATCH GCC][rework]Improve loop bound info by simplifying conversions in iv base

2015-09-14 Thread Bin.Cheng
Ping.

On Thu, Aug 27, 2015 at 5:41 PM, Bin Cheng  wrote:
> Hi,
> This is a rework for
> https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02335.html, with review
> comments addressed.  For now, SCEV may compute iv base in the form of
> "(signed T)((unsigned T)base + step))".  This complicates other
> optimizations/analysis depending on SCEV because it's hard to dive into type
> conversions.  This kind of type conversions can be simplified with
> additional range information implied by loop initial conditions.  This patch
> does such simplification.
> With simplified iv base, loop niter analysis can compute more accurate bound
> information since sensible value range can be derived for "base+step".  For
> example, accurate loop bound_be_zero information is computed for cases
> added by this patch.
>
> The code is actually moved from loop_exits_before_overflow.  After this
> patch, the corresponding code in loop_exits_before_overflow will be never
> executed, so I removed that part code.  The patch also includes some code
> format changes.
>
> Bootstrap and test on x86_64.  Is it OK?
>
> Thanks,
> bin
>
> 2015-08-27  Bin Cheng  
>
> * tree-ssa-loop-niter.c (tree_simplify_using_condition_1): Support
> new parameter.
> (tree_simplify_using_condition): Ditto.
> (simplify_using_initial_conditions): Ditto.
> (loop_exits_before_overflow): Pass new argument to function
> simplify_using_initial_conditions.  Remove case for type conversions
> simplification.
> * tree-ssa-loop-niter.h (simplify_using_initial_conditions): New
> parameter.
> * tree-scalar-evolution.c (simple_iv): Simplify type conversions
> in iv base using loop initial conditions.
>
> gcc/testsuite/ChangeLog
> 2015-08-27  Bin Cheng  
>
> * gcc.dg/tree-ssa/loop-bound-2.c: New test.
> * gcc.dg/tree-ssa/loop-bound-4.c: New test.
> * gcc.dg/tree-ssa/loop-bound-6.c: New test.


Re: [PATCH GCC]Look into unnecessary conversion when checking mult_op in get_shiftadd_cost

2015-09-14 Thread Bin.Cheng
On Wed, Sep 2, 2015 at 8:32 PM, Richard Biener
 wrote:
> On Wed, Sep 2, 2015 at 5:50 AM, Bin Cheng  wrote:
>> Hi,
>> When calling get_shiftadd_cost, the mult_op is stripped at caller places.
>> We should look into unnecessary conversion in op1 before checking equality,
>> otherwise it computes wrong shiftadd cost.  This patch picks this small
>> issue up.
>>
>> Bootstrap and test on x86_64 and aarch64 along with other patches.  Is it
>> OK?
>
> Just do STRIP_NOPS (op1) unconditionally?  Thus
>
>   STRIP_NOPS (op1);
>   mult_in_op1 = operand_equal_p (op1, mult, 0);
>
> ok with that change.
Patch committed as suggested.

Thanks,
bin


Re: [PATCH PR66388]Add sizetype cand for BIV of smaller type if it's used as index of memory ref

2015-09-08 Thread Bin.Cheng
On Wed, Sep 2, 2015 at 10:12 PM, Richard Biener
 wrote:
> On Wed, Sep 2, 2015 at 5:26 AM, Bin Cheng  wrote:
>> Hi,
>> This patch is a new approach to fix PR66388.  IVO today computes iv_use with
>> iv_cand which has at least same type precision as the use.  On 64bit
>> platforms like AArch64, this results in different iv_cand created for each
>> address type iv_use, and register pressure increased.  As a matter of fact,
>> the BIV should be used for all iv_uses in some of these cases.  It is a
>> latent bug but recently getting worse because of overflow changes.
>>
>> The original approach at
>> https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01484.html can fix the issue
>> except it conflict with IV elimination.  Seems to me it is impossible to
>> mitigate the contradiction.
>>
>> This new approach fixes the issue by adding sizetype iv_cand for BIVs
>> directly.  In cases if the original BIV is preferred, the sizetype iv_cand
>> will be chosen.  As for code generation, the sizetype iv_cand has the same
>> effect as the original BIV.  Actually, it's better because BIV needs to be
>> explicitly extended to sizetype to be used in address expression on most
>> targets.
>>
>> One shortage of this approach is it may introduce more iv candidates.  To
>> minimize the impact, this patch does sophisticated code analysis and adds
>> sizetype candidate for BIV only if it is used as index.  Moreover, it avoids
>> to add candidate of the original type if the BIV is only used as index.
>> Statistics for compiling spec2k6 shows increase of candidate number is
>> modest and can be ignored.
>>
>> There are two more patches following to fix corner cases revealed by this
>> one.  In together they bring obvious perf improvement for spec26k/int on
>> aarch64.
>> Spec2k6/int
>> 400.perlbench   3.44%
>> 445.gobmk   -0.86%
>> 456.hmmer   14.83%
>> 458.sjeng   2.49%
>> 462.libquantum  -0.79%
>> GEOMEAN 1.68%
>>
>> There is also about 0.36% improvement for spec2k6/fp, mostly because of case
>> 436.cactusADM.  I believe it can be further improved, but that should be
>> another patch.
>>
>> I also collected benchmark data for x86_64.  Spec2k6/fp is not affected.  As
>> for spec2k6/int, though the geomean is improved slightly, 400.perlbench is
>> regressed by ~3%.  I can see BIVs are chosen for some loops instead of
>> address candidates.  Generally, the loop header will be simplified because
>> iv elimination with BIV is simpler; the number of instructions in loop body
>> isn't changed.  I suspect the regression comes from different addressing
>> modes.  With BIV, complex addressing mode like [base + index << scale +
>> disp] is used, rather than [base + disp].  I guess the former has more
>> micro-ops, thus more expensive.  This guess can be confirmed by manually
>> suppressing the complex addressing mode with higher address cost.
>> Now the problem becomes why overall cost of BIV is computed lower while the
>> actual cost is higher.  I noticed for most affected loops, loop header is
>> bloated because of iv elimination using the old address candidate.  The
>> bloated loop header results in much higher cost than BIV.  As a result, BIV
>> is preferred.  I also noticed the bloated loop header generally can be
>> simplified (I have a following patch for this).  After applying the local
>> patch, the old address candidate is chosen, and most of regression is
>> recovered.
>> Conclusion is I think loop header bloated issue should be blamed for the
>> regression, and it can be resolved.
>>
>> Bootstrap and test on x64_64 and aarch64.  It fixes failure of
>> gcc.target/i386/pr49781-1.c, without new breakage.
>>
>> So what do you think?
>
> The data above looks ok to me.
>
> +static struct iv *
> +find_deriving_biv_for_iv (struct ivopts_data *data, struct iv *iv)
> +{
> +  aff_tree aff;
> +  struct expand_data exp_data;
> +
> +  if (!iv->ssa_name || TREE_CODE (iv->ssa_name) != SSA_NAME)
> +return iv;
> +
> +  /* Expand IV's ssa_name till the deriving biv is found.  */
> +  exp_data.data = data;
> +  exp_data.biv = NULL;
> +  tree_to_aff_combination_expand (iv->ssa_name, TREE_TYPE (iv->ssa_name),
> + , >name_expansion_cache,
> + stop_expand, _data);
> +  return exp_data.biv;
>
> that's actually "abusing" tree_to_aff_combination_expand for simply walking
> SSA uses and their defs uses recursively until you hit "stop".  ISTR past
> discussion to add a generic walk_ssa_use interface for that.  Not sure if it
> materialized with a name I can't remember or whether it didn't.
Thanks for reviewing.  I didn't found existing interface to walk up
definition chains of ssa vars.  In this updated patch, I implemented a
simple function which meets the minimal requirement of walking up
definition chains of BIV variables.  I also counted number of
no_overflow BIVs that are not used in address type use.  Since
generally there 

Re: [PATCH PR66388]Add sizetype cand for BIV of smaller type if it's used as index of memory ref

2015-09-08 Thread Bin.Cheng
On Tue, Sep 8, 2015 at 6:06 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Wed, Sep 2, 2015 at 10:12 PM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Wed, Sep 2, 2015 at 5:26 AM, Bin Cheng <bin.ch...@arm.com> wrote:
>>> Hi,
>>> This patch is a new approach to fix PR66388.  IVO today computes iv_use with
>>> iv_cand which has at least same type precision as the use.  On 64bit
>>> platforms like AArch64, this results in different iv_cand created for each
>>> address type iv_use, and register pressure increased.  As a matter of fact,
>>> the BIV should be used for all iv_uses in some of these cases.  It is a
>>> latent bug but recently getting worse because of overflow changes.
>>>
>>> The original approach at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-07/msg01484.html can fix the issue
>>> except it conflict with IV elimination.  Seems to me it is impossible to
>>> mitigate the contradiction.
>>>
>>> This new approach fixes the issue by adding sizetype iv_cand for BIVs
>>> directly.  In cases if the original BIV is preferred, the sizetype iv_cand
>>> will be chosen.  As for code generation, the sizetype iv_cand has the same
>>> effect as the original BIV.  Actually, it's better because BIV needs to be
>>> explicitly extended to sizetype to be used in address expression on most
>>> targets.
>>>
>>> One shortage of this approach is it may introduce more iv candidates.  To
>>> minimize the impact, this patch does sophisticated code analysis and adds
>>> sizetype candidate for BIV only if it is used as index.  Moreover, it avoids
>>> to add candidate of the original type if the BIV is only used as index.
>>> Statistics for compiling spec2k6 shows increase of candidate number is
>>> modest and can be ignored.
>>>
>>> There are two more patches following to fix corner cases revealed by this
>>> one.  In together they bring obvious perf improvement for spec26k/int on
>>> aarch64.
>>> Spec2k6/int
>>> 400.perlbench   3.44%
>>> 445.gobmk   -0.86%
>>> 456.hmmer   14.83%
>>> 458.sjeng   2.49%
>>> 462.libquantum  -0.79%
>>> GEOMEAN 1.68%
>>>
>>> There is also about 0.36% improvement for spec2k6/fp, mostly because of case
>>> 436.cactusADM.  I believe it can be further improved, but that should be
>>> another patch.
>>>
>>> I also collected benchmark data for x86_64.  Spec2k6/fp is not affected.  As
>>> for spec2k6/int, though the geomean is improved slightly, 400.perlbench is
>>> regressed by ~3%.  I can see BIVs are chosen for some loops instead of
>>> address candidates.  Generally, the loop header will be simplified because
>>> iv elimination with BIV is simpler; the number of instructions in loop body
>>> isn't changed.  I suspect the regression comes from different addressing
>>> modes.  With BIV, complex addressing mode like [base + index << scale +
>>> disp] is used, rather than [base + disp].  I guess the former has more
>>> micro-ops, thus more expensive.  This guess can be confirmed by manually
>>> suppressing the complex addressing mode with higher address cost.
>>> Now the problem becomes why overall cost of BIV is computed lower while the
>>> actual cost is higher.  I noticed for most affected loops, loop header is
>>> bloated because of iv elimination using the old address candidate.  The
>>> bloated loop header results in much higher cost than BIV.  As a result, BIV
>>> is preferred.  I also noticed the bloated loop header generally can be
>>> simplified (I have a following patch for this).  After applying the local
>>> patch, the old address candidate is chosen, and most of regression is
>>> recovered.
>>> Conclusion is I think loop header bloated issue should be blamed for the
>>> regression, and it can be resolved.
>>>
>>> Bootstrap and test on x64_64 and aarch64.  It fixes failure of
>>> gcc.target/i386/pr49781-1.c, without new breakage.
>>>
>>> So what do you think?
>>
>> The data above looks ok to me.
>>
>> +static struct iv *
>> +find_deriving_biv_for_iv (struct ivopts_data *data, struct iv *iv)
>> +{
>> +  aff_tree aff;
>> +  struct expand_data exp_data;
>> +
>> +  if (!iv->ssa_name || TREE_CODE (iv->ssa_name) != SSA_NAME)
>> +return iv;
>> +
>> +  /* Expand IV's ssa_name till the deriving biv is found.  */
>> +  exp_data.data = data;
&

Re: [PATCH GCC][rework]Improve loop bound info by simplifying conversions in iv base

2015-08-27 Thread Bin.Cheng
On Thu, Aug 27, 2015 at 6:54 PM, Ajit Kumar Agarwal
ajit.kumar.agar...@xilinx.com wrote:


 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On 
 Behalf Of Bin Cheng
 Sent: Thursday, August 27, 2015 3:12 PM
 To: gcc-patches@gcc.gnu.org
 Subject: [PATCH GCC][rework]Improve loop bound info by simplifying 
 conversions in iv base

 Hi,
This is a rework for
https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02335.html, with review 
comments addressed.  For now, SCEV may compute iv base in the form of 
(signed T)((unsigned T)base + step)).  This complicates other 
optimizations/analysis depending on SCEV because it's hard to dive into type 
conversions.  This kind of type conversions can be simplified with 
additional range information implied by loop initial conditions.  This patch 
does such simplification.
With simplified iv base, loop niter analysis can compute more accurate bound 
information since sensible value range can be derived for base+step.  For 
example, accurate loop boundmay_be_zero information is computed for cases 
added by this patch.

The code is actually moved from loop_exits_before_overflow.  After this 
patch, the corresponding code in loop_exits_before_overflow will be never 
executed, so I removed that part code.  The patch also includes some code 
format changes.

Bootstrap and test on x86_64.  Is it OK?

 The scalar Evolution calculates the chrec (base , +,step) based on 
 chain of recurrence through induction variable expressions and
 Propagating the value in SSA representation to derive at the above chrec.. If 
 the base value assigned is unsigned and the declaration of
I don't quite get the meaning about If the base value assigned is
unsigned  The conversion comes from following C standard's type
promotion for induction variables which have signed type smaller than
int.

Thanks,
bin
 the base is signed, then only the above chrec is derived based on conversion 
 from unsigned to signed? Such type
 conversions can be ignored for the calculation of iteration bound as this 
 cannot be overflow in any case. Is the below patch aim at that?

 Thanks  Regards
 Ajit

 Thanks,
 bin

 2015-08-27  Bin Cheng  bin.ch...@arm.com

 * tree-ssa-loop-niter.c (tree_simplify_using_condition_1): Support
 new parameter.
 (tree_simplify_using_condition): Ditto.
 (simplify_using_initial_conditions): Ditto.
 (loop_exits_before_overflow): Pass new argument to function
 simplify_using_initial_conditions.  Remove case for type conversions
 simplification.
 * tree-ssa-loop-niter.h (simplify_using_initial_conditions): New
 parameter.
 * tree-scalar-evolution.c (simple_iv): Simplify type conversions
 in iv base using loop initial conditions.

 gcc/testsuite/ChangeLog
 2015-08-27  Bin Cheng  bin.ch...@arm.com

 * gcc.dg/tree-ssa/loop-bound-2.c: New test.
 * gcc.dg/tree-ssa/loop-bound-4.c: New test.
 * gcc.dg/tree-ssa/loop-bound-6.c: New test.


Re: [PATCH 3/5] Build ARRAY_REFs when the base is of ARRAY_TYPE.

2015-08-26 Thread Bin.Cheng
On Wed, Aug 26, 2015 at 3:29 PM, Richard Biener rguent...@suse.de wrote:
 On Wed, 26 Aug 2015, Bin.Cheng wrote:

 On Wed, Aug 26, 2015 at 3:50 AM, Jeff Law l...@redhat.com wrote:
  On 08/25/2015 05:06 AM, Alan Lawrence wrote:
 
  When SRA completely scalarizes an array, this patch changes the
  generated accesses from e.g.
 
  MEM[(int[8] *)a + 4B] = 1;
 
  to
 
  a[1] = 1;
 
  This overcomes a limitation in dom2, that accesses to equivalent
  chunks of e.g. MEM[(int[8] *)a] are not hashable_expr_equal_p with
  accesses to e.g. MEM[(int[8] *)a]. This is necessary for constant
  propagation in the ssa-dom-cse-2.c testcase (after the next patch
  that makes SRA handle constant-pool loads).
 
  I tried to work around this by making dom2's hashable_expr_equal_p
  less conservative, but found that on platforms without AArch64's
  vectorized reductions (specifically Alpha, hppa, PowerPC, and SPARC,
  mentioned in ssa-dom-cse-2.c), I also needed to make MEM[(int[8]
  *)a] equivalent to a[0], etc.; a complete overhaul of
  hashable_expr_equal_p seems like a larger task than this patch
  series.
 
  I can't see how to write a testcase for this in C though as direct
  assignment to an array is not possible; such assignments occur only
  with constant pool data, which is dealt with in the next patch.
 
  It's a general issue that if there's  1 common way to represent an
  expression, then DOM will often miss discovery of the CSE opportunity
  because of the way it hashes expressions.
 
  Ideally we'd be moving to a canonical form, but I also realize that in
  the case of memory references like this, that may not be feasible.
 IIRC, there were talks about lowering all memory reference on GIMPLE?
 Which is the reverse approach.  Since SRA is in quite early
 compilation stage, don't know if lowered memory reference has impact
 on other optimizers.

 Yeah, I'd only do the lowering after loop opts.  Which also may make
 the DOM issue moot as the array refs would be lowered as well and thus
 DOM would see a consistent set of references again.  The lowering should
 also simplify SLSR and expose address computation redundancies to DOM.

 I'd place such lowering before the late reassoc (any takers?  I suppose
 you can pick up one of the bitfield lowering passes posted in the
 previous years as this should also handle bitfield accesses correctly).
I ran into several issues related to lowered memory references (some
of them are about slsr), and want to have a look at this.  But only
after finishing major issues in IVO...

As for slsr, I think the problem is more about we need to prove
equality of expressions by diving into definition chain of ssa_var,
just like tree_to_affine_expand.  I think this has already been
discussed too.  Anyway, lowering memory reference provides a canonical
form and should benefit other optimizers.

Thanks,
bin

 Thanks,
 Richard.

 Thanks,
 bin
 
  It does make me wonder how many CSEs we're really missing due to the two
  ways to represent array accesses.
 
 
  Bootstrap + check-gcc on x86-none-linux-gnu,
  arm-none-linux-gnueabihf, aarch64-none-linux-gnu.
 
  gcc/ChangeLog:
 
  * tree-sra.c (completely_scalarize): Move some code into:
  (get_elem_size): New. (build_ref_for_offset): Build ARRAY_REF if base
  is aligned array. --- gcc/tree-sra.c | 110
  - 1 file
  changed, 69 insertions(+), 41 deletions(-)
 
  diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c index 08fa8dc..af35fcc
  100644 --- a/gcc/tree-sra.c +++ b/gcc/tree-sra.c @@ -957,6 +957,20 @@
  scalarizable_type_p (tree type) } }
 
  +static bool +get_elem_size (const_tree type, unsigned HOST_WIDE_INT
  *sz_out)
 
  Function comment needed.
 
  I may have missed it in the earlier patches, but can you please make
  sure any new functions you created have comments in those as well.  Such
  patches are pre-approved.
 
  With the added function comment, this patch is fine.
 
  jeff
 
 



 --
 Richard Biener rguent...@suse.de
 SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
 21284 (AG Nuernberg)


Re: [PATCH 3/5] Build ARRAY_REFs when the base is of ARRAY_TYPE.

2015-08-25 Thread Bin.Cheng
On Wed, Aug 26, 2015 at 3:50 AM, Jeff Law l...@redhat.com wrote:
 On 08/25/2015 05:06 AM, Alan Lawrence wrote:

 When SRA completely scalarizes an array, this patch changes the
 generated accesses from e.g.

 MEM[(int[8] *)a + 4B] = 1;

 to

 a[1] = 1;

 This overcomes a limitation in dom2, that accesses to equivalent
 chunks of e.g. MEM[(int[8] *)a] are not hashable_expr_equal_p with
 accesses to e.g. MEM[(int[8] *)a]. This is necessary for constant
 propagation in the ssa-dom-cse-2.c testcase (after the next patch
 that makes SRA handle constant-pool loads).

 I tried to work around this by making dom2's hashable_expr_equal_p
 less conservative, but found that on platforms without AArch64's
 vectorized reductions (specifically Alpha, hppa, PowerPC, and SPARC,
 mentioned in ssa-dom-cse-2.c), I also needed to make MEM[(int[8]
 *)a] equivalent to a[0], etc.; a complete overhaul of
 hashable_expr_equal_p seems like a larger task than this patch
 series.

 I can't see how to write a testcase for this in C though as direct
 assignment to an array is not possible; such assignments occur only
 with constant pool data, which is dealt with in the next patch.

 It's a general issue that if there's  1 common way to represent an
 expression, then DOM will often miss discovery of the CSE opportunity
 because of the way it hashes expressions.

 Ideally we'd be moving to a canonical form, but I also realize that in
 the case of memory references like this, that may not be feasible.
IIRC, there were talks about lowering all memory reference on GIMPLE?
Which is the reverse approach.  Since SRA is in quite early
compilation stage, don't know if lowered memory reference has impact
on other optimizers.

Thanks,
bin

 It does make me wonder how many CSEs we're really missing due to the two
 ways to represent array accesses.


 Bootstrap + check-gcc on x86-none-linux-gnu,
 arm-none-linux-gnueabihf, aarch64-none-linux-gnu.

 gcc/ChangeLog:

 * tree-sra.c (completely_scalarize): Move some code into:
 (get_elem_size): New. (build_ref_for_offset): Build ARRAY_REF if base
 is aligned array. --- gcc/tree-sra.c | 110
 - 1 file
 changed, 69 insertions(+), 41 deletions(-)

 diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c index 08fa8dc..af35fcc
 100644 --- a/gcc/tree-sra.c +++ b/gcc/tree-sra.c @@ -957,6 +957,20 @@
 scalarizable_type_p (tree type) } }

 +static bool +get_elem_size (const_tree type, unsigned HOST_WIDE_INT
 *sz_out)

 Function comment needed.

 I may have missed it in the earlier patches, but can you please make
 sure any new functions you created have comments in those as well.  Such
 patches are pre-approved.

 With the added function comment, this patch is fine.

 jeff




Re: [PATCH GCC]Improve loop bound info by simplifying conversions in iv base

2015-08-20 Thread Bin.Cheng
On Fri, Aug 14, 2015 at 4:28 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Tue, Jul 28, 2015 at 11:38 AM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 For now, SCEV may compute iv base in the form of (signed T)((unsigned
 T)base + step)).  This complicates other optimizations/analysis depending
 on SCEV because it's hard to dive into type conversions.  For many cases,
 such type conversions can be simplified with additional range information
 implied by loop initial conditions.  This patch does such simplification.
 With simplified iv base, loop niter analysis can compute more accurate bound
 information since sensible value range can be derived for base+step.  For
 example, accurate loop boundmay_be_zero information is computed for cases
 added by this patch.
 The code is actually borrowed from loop_exits_before_overflow.  Moreover,
 with simplified iv base, the second case handled in that function now
 becomes the first case.  I didn't remove that part of code because it may(?)
 still be visited in scev analysis itself and simple_iv isn't an interface
 for that.

 Is it OK?

 It looks quite special given it only handles a very specific pattern.  Did you
 do any larger collecting of statistics on how many times this triggers,
 esp. how many times simplify_using_initial_conditions succeeds and
 how many times not?  This function is somewhat expensive.
Yes, this is corner case targeting induction variables of small signed
types, just like added test cases.  We need to convert it to unsigned,
do the stepping, and convert back.  I collected statistics for gcc
bootstrap and spec2k6.  The function is called about 400-500 times in
both case.  About 45% of calls succeeded in bootstrap, while only ~3%
succeeded in spec2k6.

I will prepare a new version patch if you think it's worthwhile in
terms of compilation cost and benefit.

Thanks,
bin

 +  || !operand_equal_p (iv-step,
 +  fold_convert (type,
 +TREE_OPERAND (e, 1)), 0))

 operand_equal_p can handle sign-differences in integer constants,
 no need to fold_convert here.  Also if you know that you are comparing
 integer constants please use tree_int_cst_equal_p.

 +  extreme = lower_bound_in_type (type, type);

 that's a strange function to call here (with two same types).  Looks like
 just wide_int_to_tree (type, wi::max/min_value (type)).

 +  extreme = fold_build2 (MINUS_EXPR, type, extreme, iv-step);

 so as iv-step is an INTEGER_CST please do this whole thing using
 wide_ints and only build trees here:

 +  e = fold_build2 (code, boolean_type_node, base, extreme);

 Thanks,
 Richard.

 Thanks,
 bin

 2015-07-28  Bin Cheng  bin.ch...@arm.com

 * tree-ssa-loop-niter.c (tree_simplify_using_condition): Export
 the interface.
 * tree-ssa-loop-niter.h (tree_simplify_using_condition): Declare.
 * tree-scalar-evolution.c (simple_iv): Simplify type conversions
 in iv base using loop initial conditions.

 gcc/testsuite/ChangeLog
 2015-07-28  Bin Cheng  bin.ch...@arm.com

 * gcc.dg/tree-ssa/loop-bound-2.c: New test.
 * gcc.dg/tree-ssa/loop-bound-4.c: New test.
 * gcc.dg/tree-ssa/loop-bound-6.c: New test.


Re: [PATCH GCC]Improve bound information in loop niter analysis

2015-08-18 Thread Bin.Cheng
On Mon, Aug 17, 2015 at 6:49 PM, Ajit Kumar Agarwal
ajit.kumar.agar...@xilinx.com wrote:
 All:

 Does the Logic to calculate the Loop bound information through Value Range 
 Analyis uses the post dominator and
 Dominator info. The iteration branches instead of Loop exit condition can be 
 calculated through post dominator info.
 If the node in the Loop has two successors and post dominates the two 
 successors then the iteration branch can be
 The same node.

 For All the nodes L in the Loop B
 If (L1, L2  belongs to successors of (L)  L1,L2 belongs to PosDom(Header of 
 Loop))
 {
   I = I union L1
 }

 Thus I will have all set of iteration branches. This will handle more cases 
 of Loop bound information that
 Will be accurate through the exact iteration count that are known cases along 
 with Value Range Information
 Where the condition is instead not the Loop exits but other nodes in the Loop.

I don't quite follow your words here.  Could you please give a simple
example about it?  Especially I don't know how post-dom helps the loop
bound analysis.  Seems your pseudo code is collecting some comparison
basic block of loop?

Thanks,
bin

 Thanks  Regards
 Ajit


 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On 
 Behalf Of Bin.Cheng
 Sent: Monday, August 17, 2015 3:32 PM
 To: Richard Biener
 Cc: Bin Cheng; GCC Patches
 Subject: Re: [PATCH GCC]Improve bound information in loop niter analysis

 Thanks for all your reviews.

 On Fri, Aug 14, 2015 at 4:17 PM, Richard Biener richard.guent...@gmail.com 
 wrote:
 On Tue, Jul 28, 2015 at 11:36 AM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 Loop niter computes inaccurate bound information for different loops.
 This patch is to improve it by using loop initial condition in
 determine_value_range.  Generally, loop niter is computed by
 subtracting start var from end var in loop exit condition.  Moreover,
 loop bound is computed using value range information of both start and end 
 variables.
 Basic idea of this patch is to check if loop initial condition
 implies more range information for both start/end variables.  If yes,
 we refine range information and use that to compute loop bound.
 With this improvement, more accurate loop bound information is
 computed for test cases added by this patch.

 +  c0 = fold_convert (type, c0);
 +  c1 = fold_convert (type, c1);
 +
 +  if (operand_equal_p (var, c0, 0))

 I believe if c0 is not already of type type operand-equal_p will never 
 succeed.
 It's quite specific case targeting comparison between var and it's range 
 bounds.  Given c0 is in form of var + offc0, then the comparison var + 
 offc0 != range bounds doesn't have any useful information.  Maybe useless 
 type conversion can be handled here though, it might be even corner case.


 (side-note: we should get rid of the GMP use, that's expensive and now
 we have wide-int available which should do the trick as well)

 + /* Case of comparing with the bounds of the type.  */
 + if (TYPE_MIN_VALUE (type)
 +  operand_equal_p (c1, TYPE_MIN_VALUE (type), 0))
 +   cmp = GT_EXPR;
 + if (TYPE_MAX_VALUE (type)
 +  operand_equal_p (c1, TYPE_MAX_VALUE (type), 0))
 +   cmp = LT_EXPR;

 don't use TYPE_MIN/MAX_VALUE.  Instead use the types precision and all
 wide_int operations (see match.pd wi::max_value use).
 Done.


 +  else if (!operand_equal_p (var, varc0, 0))
 +goto end_2;

 ick - goto.  We need sth like a auto_mpz class with a destructor.
 Label end_2 removed.


 struct auto_mpz
 {
   auto_mpz () { mpz_init (m_val); }
   ~auto_mpz () { mpz_clear (m_val); }
   mpz operator() { return m_val; }
   mpz m_val;
 };

 Is it OK?

 I see the code follows existing practice in niter analysis even though
 my overall plan was to transition its copying of value-range related
 optimizations to use VRP infrastructure.
 Yes, I think it's easy to push it to VRP infrastructure.  Actually from the 
 name of the function, it's more vrp related.  For now, the function is called 
 only by bound_difference, not so many as vrp queries.  We need cache facility 
 in vrp otherwise it would be expensive.


 I'm still ok with improving the existing code on the basis that I
 won't get to that for GCC 6.

 So - ok with the TYPE_MIN/MAX_VALUE change suggested above.

 Refactoring with auto_mpz welcome.
 That will be an independent patch, so I skipped it in this one.

 New version attached.  Bootstrap and test on x86_64.

 Thanks,
 bin

 Thanks,
 RIchard.

 Thanks,
 bin

 2015-07-28  Bin Cheng  bin.ch...@arm.com

 * tree-ssa-loop-niter.c (refine_value_range_using_guard): New.
 (determine_value_range): Call refine_value_range_using_guard for
 each loop initial condition to improve value range.

 gcc/testsuite/ChangeLog
 2015-07-28  Bin Cheng  bin.ch...@arm.com

 * gcc.dg/tree-ssa/loop-bound-1.c: New test.
 * gcc.dg/tree-ssa/loop-bound-3

Re: [PATCH GCC]Improve bound information in loop niter analysis

2015-08-18 Thread Bin.Cheng
On Tue, Aug 18, 2015 at 4:02 PM, Ajit Kumar Agarwal
ajit.kumar.agar...@xilinx.com wrote:


 -Original Message-
 From: Bin.Cheng [mailto:amker.ch...@gmail.com]
 Sent: Tuesday, August 18, 2015 1:08 PM
 To: Ajit Kumar Agarwal
 Cc: Richard Biener; Bin Cheng; GCC Patches; Vinod Kathail; Shail Aditya 
 Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
 Subject: Re: [PATCH GCC]Improve bound information in loop niter analysis

 On Mon, Aug 17, 2015 at 6:49 PM, Ajit Kumar Agarwal 
 ajit.kumar.agar...@xilinx.com wrote:
 All:

 Does the Logic to calculate the Loop bound information through Value
 Range Analyis uses the post dominator and Dominator info. The iteration 
 branches instead of Loop exit condition can be calculated through post 
 dominator info.
 If the node in the Loop has two successors and post dominates the two
 successors then the iteration branch can be The same node.

 For All the nodes L in the Loop B
 If (L1, L2  belongs to successors of (L)  L1,L2 belongs to
 PosDom(Header of Loop)) {
   I = I union L1
 }

 Thus I will have all set of iteration branches. This will handle
 more cases of Loop bound information that Will be accurate through the
 exact iteration count that are known cases along with Value Range 
 Information Where the condition is instead not the Loop exits but other 
 nodes in the Loop.

I don't quite follow your words here.  Could you please give a simple 
example about it?  Especially I don't know how post-dom helps the loop bound 
analysis.  Seems your pseudo code is collecting some comparison basic 
block of loop?

 The Algorithm I have given above is based on Post Dominator Info. This helps 
 to calculate the iteration branches. The iteration branches are the
 Branches that determine the loop exit condition. Based on the condition it 
 either branches to the header of the Loop, Or it may branch to the
 Block dominated by the header or exit from the loop. The above Algorithm 
 finds out such iteration branches and thus decides on the Loop bound
 Or iteration count. If such iteration branches are not at the back edge node 
 and it may be a node inside the loop based on some conditions.
 Finding out such iteration branches can be done through the post dominator 
 info using the above algorithm.  Based on the iteration branches the
 conditions can be analyzed and that helps in finding out the iteration bound 
 for Known cases. Know cases are the cases where the loop bound can be 
 determined at compile time.

  One Example would be Multi-Exits Loops where the Loop exit condition can be 
 at the back edge or it may be a block inside the Loop based on the
 IF conditions and breaks out based on the conditions. Thus having multiple 
 exits. Such iteration branches can be found using The above Algorithm.
As far as I understand, GCC already do loop niter analysis one the
basis of exit edge (and the corresponding block and condition).  And
yes, it uses scev/vrp information for the analysis.  The original
patch is to improve the analysis with flow sensitive range
information.  It's not about GCC can't work on exit edge.

Thanks,
bin

 Thanks  Regards
 Ajit

 Thanks,
 bin

 Thanks  Regards
 Ajit


 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org
 [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Bin.Cheng
 Sent: Monday, August 17, 2015 3:32 PM
 To: Richard Biener
 Cc: Bin Cheng; GCC Patches
 Subject: Re: [PATCH GCC]Improve bound information in loop niter
 analysis

 Thanks for all your reviews.

 On Fri, Aug 14, 2015 at 4:17 PM, Richard Biener richard.guent...@gmail.com 
 wrote:
 On Tue, Jul 28, 2015 at 11:36 AM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 Loop niter computes inaccurate bound information for different loops.
 This patch is to improve it by using loop initial condition in
 determine_value_range.  Generally, loop niter is computed by
 subtracting start var from end var in loop exit condition.
 Moreover, loop bound is computed using value range information of both 
 start and end variables.
 Basic idea of this patch is to check if loop initial condition
 implies more range information for both start/end variables.  If
 yes, we refine range information and use that to compute loop bound.
 With this improvement, more accurate loop bound information is
 computed for test cases added by this patch.

 +  c0 = fold_convert (type, c0);
 +  c1 = fold_convert (type, c1);
 +
 +  if (operand_equal_p (var, c0, 0))

 I believe if c0 is not already of type type operand-equal_p will never 
 succeed.
 It's quite specific case targeting comparison between var and it's range 
 bounds.  Given c0 is in form of var + offc0, then the comparison var + 
 offc0 != range bounds doesn't have any useful information.  Maybe useless 
 type conversion can be handled here though, it might be even corner case.


 (side-note: we should get rid of the GMP use, that's expensive and
 now we have wide-int available which should do the trick as well)

 + /* Case of comparing

Re: [PATCH] ivopts costs debug

2015-08-18 Thread Bin.Cheng
On Wed, Aug 19, 2015 at 5:19 AM, Segher Boessenkool
seg...@kernel.crashing.org wrote:
 Hi,

 I've used this patch in the past for another port, and now again for
 rs6000, and I think it is generally useful.  It prints very verbose
 information to the dump file about how ivopts comes up with its costs
 for various forms of memory accesses, which tends to show problems in
 the target's address cost functions and the legitimize functions.

 In this patch it is disabled by default -- it is very chatty.

Hi,
I ran into back-end address cost issue before and this should be
useful in such cases.  Though there are a lot of dumps, it would be
better to classify it into existing dump option (TDF_DETAILS?) and
discard the use of macro.  Also the address cost will be tuned/dumped
later, we should differentiate between them by emphasizing this part
of dump is the original cost from back-end.
Maybe there will be some other comments.


 It also shows that the LAST_VIRTUAL_REGISTER trickery ivopts does
 does not work (legitimize_address can create new registers, so now
 a) we have new registers anyway, and b) we use some for multiple
 purposes.  Oops).
Yes, that makes seq dump a little weird.

Thanks,
bin

 Is this okay for trunk?  Bootstrapped and tested on powerpc64-linux.


 Segher


 2015-08-18  Segher Boessenkool  seg...@kernel.crashing.org

 * tree-ssa-loop-ivopts.c (IVOPTS_DEBUG_COSTS): New define.
 (get_address_cost): Add address cost debug code.

 ---
  gcc/tree-ssa-loop-ivopts.c | 50 
 --
  1 file changed, 48 insertions(+), 2 deletions(-)

 diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
 index 6bce3a1..ae29a6f 100644
 --- a/gcc/tree-ssa-loop-ivopts.c
 +++ b/gcc/tree-ssa-loop-ivopts.c
 @@ -17,6 +17,9 @@ You should have received a copy of the GNU General Public 
 License
  along with GCC; see the file COPYING3.  If not see
  http://www.gnu.org/licenses/.  */

 +/* Define to 1 to debug how this pass comes up with its cost estimates.  */
 +#define IVOPTS_DEBUG_COSTS 0
 +
  /* This pass tries to find the optimal set of induction variables for the 
 loop.
 It optimizes just the basic linear induction variables (although adding
 support for other types should not be too hard).  It includes the
 @@ -3743,8 +3746,51 @@ get_address_cost (bool symbol_present, bool 
 var_present,
   seq = get_insns ();
   end_sequence ();

 - acost = seq_cost (seq, speed);
 - acost += address_cost (addr, mem_mode, as, speed);
 + unsigned acost1 = seq_cost (seq, speed);
 + unsigned acost2 = address_cost (addr, mem_mode, as, speed);
 +
 + if (dump_file  IVOPTS_DEBUG_COSTS)
 +   {
 + fprintf (dump_file, === sequence generated for  );
 + if (sym_p)
 +   fprintf (dump_file, sym + );
 + if (var_p)
 +   fprintf (dump_file, var + );
 + if (off_p)
 +   fprintf (dump_file, cst + );
 + if (rat_p)
 +   fprintf (dump_file, rat * );
 + fprintf (dump_file, index:\n);
 +
 + print_rtl (dump_file, seq);
 +
 + fprintf (dump_file, \n cost of seq is %u, acost1);
 +
 + if (seq  NEXT_INSN (seq))
 +   {
 + fprintf (dump_file,  (namely,);
 +
 + for (rtx_insn *insn = seq; insn; insn = NEXT_INSN (insn))
 +   {
 + unsigned cost;
 + rtx set = single_set (insn);
 + if (set)
 +   cost = set_rtx_cost (set, speed);
 + else
 +   cost = 1;
 + fprintf (dump_file,  %u, cost);
 +   }
 +
 + fprintf (dump_file, ));
 +   }
 +
 + fprintf (dump_file, \n\nremaining address is:\n);
 + print_rtl_single (dump_file, addr);
 +
 + fprintf (dump_file, \n cost of that address is %u\n\n, 
 acost2);
 +   }
 +
 + acost = acost1 + acost2;

   if (!acost)
 acost = 1;
 --
 1.8.1.4



Re: [PATCH GCC]Improve bound information in loop niter analysis

2015-08-18 Thread Bin.Cheng
On Wed, Aug 19, 2015 at 2:14 AM, Jeff Law l...@redhat.com wrote:
 On 08/17/2015 04:01 AM, Bin.Cheng wrote:


 +  c0 = fold_convert (type, c0);
 +  c1 = fold_convert (type, c1);
 +
 +  if (operand_equal_p (var, c0, 0))

 I believe if c0 is not already of type type operand-equal_p will never
 succeed.

 It's quite specific case targeting comparison between var and it's
 range bounds.  Given c0 is in form of var + offc0, then the
 comparison var + offc0 != range bounds doesn't have any useful
 information.  Maybe useless type conversion can be handled here
 though, it might be even corner case.

 My comment about useless type conversions was more about a deficiency in
 operand_equal_p's implementation.  It wasn't something I felt needed to be
 addressed in your patch.

 I think using operand_equal_p is fine here.
Hi Jeff,
I misunderstood the point.  Thanks for explanation.  Given the
approval, new version patch is applied.

Thanks,
bin


 Jeff


Re: [PATCH GCC]Improve bound information in loop niter analysis

2015-08-17 Thread Bin.Cheng
Thanks for all your reviews.

On Fri, Aug 14, 2015 at 4:17 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Tue, Jul 28, 2015 at 11:36 AM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 Loop niter computes inaccurate bound information for different loops.  This
 patch is to improve it by using loop initial condition in
 determine_value_range.  Generally, loop niter is computed by subtracting
 start var from end var in loop exit condition.  Moreover, loop bound is
 computed using value range information of both start and end variables.
 Basic idea of this patch is to check if loop initial condition implies more
 range information for both start/end variables.  If yes, we refine range
 information and use that to compute loop bound.
 With this improvement, more accurate loop bound information is computed for
 test cases added by this patch.

 +  c0 = fold_convert (type, c0);
 +  c1 = fold_convert (type, c1);
 +
 +  if (operand_equal_p (var, c0, 0))

 I believe if c0 is not already of type type operand-equal_p will never 
 succeed.
It's quite specific case targeting comparison between var and it's
range bounds.  Given c0 is in form of var + offc0, then the
comparison var + offc0 != range bounds doesn't have any useful
information.  Maybe useless type conversion can be handled here
though, it might be even corner case.


 (side-note: we should get rid of the GMP use, that's expensive and now we
 have wide-int available which should do the trick as well)

 + /* Case of comparing with the bounds of the type.  */
 + if (TYPE_MIN_VALUE (type)
 +  operand_equal_p (c1, TYPE_MIN_VALUE (type), 0))
 +   cmp = GT_EXPR;
 + if (TYPE_MAX_VALUE (type)
 +  operand_equal_p (c1, TYPE_MAX_VALUE (type), 0))
 +   cmp = LT_EXPR;

 don't use TYPE_MIN/MAX_VALUE.  Instead use the types precision
 and all wide_int operations (see match.pd wi::max_value use).
Done.


 +  else if (!operand_equal_p (var, varc0, 0))
 +goto end_2;

 ick - goto.  We need sth like a auto_mpz class with a destructor.
Label end_2 removed.


 struct auto_mpz
 {
   auto_mpz () { mpz_init (m_val); }
   ~auto_mpz () { mpz_clear (m_val); }
   mpz operator() { return m_val; }
   mpz m_val;
 };

 Is it OK?

 I see the code follows existing practice in niter analysis even though
 my overall plan was to transition its copying of value-range related
 optimizations to use VRP infrastructure.
Yes, I think it's easy to push it to VRP infrastructure.  Actually
from the name of the function, it's more vrp related.  For now, the
function is called only by bound_difference, not so many as vrp
queries.  We need cache facility in vrp otherwise it would be
expensive.


 I'm still ok with improving the existing code on the basis that I won't
 get to that for GCC 6.

 So - ok with the TYPE_MIN/MAX_VALUE change suggested above.

 Refactoring with auto_mpz welcome.
That will be an independent patch, so I skipped it in this one.

New version attached.  Bootstrap and test on x86_64.

Thanks,
bin

 Thanks,
 RIchard.

 Thanks,
 bin

 2015-07-28  Bin Cheng  bin.ch...@arm.com

 * tree-ssa-loop-niter.c (refine_value_range_using_guard): New.
 (determine_value_range): Call refine_value_range_using_guard for
 each loop initial condition to improve value range.

 gcc/testsuite/ChangeLog
 2015-07-28  Bin Cheng  bin.ch...@arm.com

 * gcc.dg/tree-ssa/loop-bound-1.c: New test.
 * gcc.dg/tree-ssa/loop-bound-3.c: New test.
 * gcc.dg/tree-ssa/loop-bound-5.c: New test.
Index: gcc/testsuite/gcc.dg/tree-ssa/loop-bound-3.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/loop-bound-3.c(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/loop-bound-3.c(revision 0)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -fdump-tree-ivopts-details } */
+
+int *a;
+
+int
+foo (unsigned char s, unsigned char l)
+{
+  unsigned char i;
+  int sum = 0;
+
+  for (i = s; i  l; i -= 1)
+{
+  sum += a[i];
+}
+
+  return sum;
+}
+
+/* Check loop niter bound information.  */
+/* { dg-final { scan-tree-dump bounded by 254 ivopts } } */
+/* { dg-final { scan-tree-dump-not bounded by 255 ivopts } } */
Index: gcc/testsuite/gcc.dg/tree-ssa/loop-bound-5.c
===
--- gcc/testsuite/gcc.dg/tree-ssa/loop-bound-5.c(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/loop-bound-5.c(revision 0)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options -O2 -fdump-tree-ivopts-details } */
+
+int *a;
+
+int
+foo (unsigned char s)
+{
+  unsigned char i;
+  int sum = 0;
+
+  for (i = s; i  0; i -= 1)
+{
+  sum += a[i];
+}
+
+  return sum;
+}
+
+/* Check loop niter bound information.  */
+/* { dg-final { scan-tree-dump bounded by 254 ivopts } } */
+/* { dg-final { scan-tree-dump-not bounded by 255 ivopts } } */
Index: 

Re: [PATCH GCC]Improve bound information in loop niter analysis

2015-08-13 Thread Bin.Cheng
Ping.

Thanks,
bin

On Tue, Jul 28, 2015 at 5:36 PM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 Loop niter computes inaccurate bound information for different loops.  This
 patch is to improve it by using loop initial condition in
 determine_value_range.  Generally, loop niter is computed by subtracting
 start var from end var in loop exit condition.  Moreover, loop bound is
 computed using value range information of both start and end variables.
 Basic idea of this patch is to check if loop initial condition implies more
 range information for both start/end variables.  If yes, we refine range
 information and use that to compute loop bound.
 With this improvement, more accurate loop bound information is computed for
 test cases added by this patch.

 Is it OK?

 Thanks,
 bin

 2015-07-28  Bin Cheng  bin.ch...@arm.com

 * tree-ssa-loop-niter.c (refine_value_range_using_guard): New.
 (determine_value_range): Call refine_value_range_using_guard for
 each loop initial condition to improve value range.

 gcc/testsuite/ChangeLog
 2015-07-28  Bin Cheng  bin.ch...@arm.com

 * gcc.dg/tree-ssa/loop-bound-1.c: New test.
 * gcc.dg/tree-ssa/loop-bound-3.c: New test.
 * gcc.dg/tree-ssa/loop-bound-5.c: New test.


Re: [PATCH GCC]Improve loop bound info by simplifying conversions in iv base

2015-08-13 Thread Bin.Cheng
Ping.

Thanks,
bin

On Tue, Jul 28, 2015 at 5:38 PM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 For now, SCEV may compute iv base in the form of (signed T)((unsigned
 T)base + step)).  This complicates other optimizations/analysis depending
 on SCEV because it's hard to dive into type conversions.  For many cases,
 such type conversions can be simplified with additional range information
 implied by loop initial conditions.  This patch does such simplification.
 With simplified iv base, loop niter analysis can compute more accurate bound
 information since sensible value range can be derived for base+step.  For
 example, accurate loop boundmay_be_zero information is computed for cases
 added by this patch.
 The code is actually borrowed from loop_exits_before_overflow.  Moreover,
 with simplified iv base, the second case handled in that function now
 becomes the first case.  I didn't remove that part of code because it may(?)
 still be visited in scev analysis itself and simple_iv isn't an interface
 for that.

 Is it OK?

 Thanks,
 bin

 2015-07-28  Bin Cheng  bin.ch...@arm.com

 * tree-ssa-loop-niter.c (tree_simplify_using_condition): Export
 the interface.
 * tree-ssa-loop-niter.h (tree_simplify_using_condition): Declare.
 * tree-scalar-evolution.c (simple_iv): Simplify type conversions
 in iv base using loop initial conditions.

 gcc/testsuite/ChangeLog
 2015-07-28  Bin Cheng  bin.ch...@arm.com

 * gcc.dg/tree-ssa/loop-bound-2.c: New test.
 * gcc.dg/tree-ssa/loop-bound-4.c: New test.
 * gcc.dg/tree-ssa/loop-bound-6.c: New test.


Re: [PATCH GCC]Improve loop bound info by simplifying conversions in iv base

2015-08-13 Thread Bin.Cheng
On Fri, Aug 14, 2015 at 6:10 AM, Jeff Law l...@redhat.com wrote:
 On 07/28/2015 03:38 AM, Bin Cheng wrote:

 Hi,
 For now, SCEV may compute iv base in the form of (signed T)((unsigned
 T)base + step)).  This complicates other optimizations/analysis depending
 on SCEV because it's hard to dive into type conversions.  For many cases,
 such type conversions can be simplified with additional range information
 implied by loop initial conditions.  This patch does such simplification.
 With simplified iv base, loop niter analysis can compute more accurate
 bound
 information since sensible value range can be derived for base+step.
 For
 example, accurate loop boundmay_be_zero information is computed for cases
 added by this patch.
 The code is actually borrowed from loop_exits_before_overflow.  Moreover,
 with simplified iv base, the second case handled in that function now
 becomes the first case.  I didn't remove that part of code because it
 may(?)
 still be visited in scev analysis itself and simple_iv isn't an interface
 for that.

 Is it OK?

 Thanks,
 bin

 2015-07-28  Bin Cheng  bin.ch...@arm.com

 * tree-ssa-loop-niter.c (tree_simplify_using_condition): Export
 the interface.
 * tree-ssa-loop-niter.h (tree_simplify_using_condition): Declare.
 * tree-scalar-evolution.c (simple_iv): Simplify type conversions
 in iv base using loop initial conditions.

 gcc/testsuite/ChangeLog
 2015-07-28  Bin Cheng  bin.ch...@arm.com

 * gcc.dg/tree-ssa/loop-bound-2.c: New test.
 * gcc.dg/tree-ssa/loop-bound-4.c: New test.
 * gcc.dg/tree-ssa/loop-bound-6.c: New test.

 I have the same concerns about these tests...  Which makes me really think I
 must be mis-understanding something in the debugging output.

This patch tries to simplify SCEV base with the help of loop's initial
conditions.  The previous patch can only handle simplified SCEV base
when analyzing loop's bound information.  It's independent to the
previous one.  Actually it can be viewed as an irrelevant patch even
the previous one is proven wrong.  Of course, I need to change test
case in that way.

Thanks,
bin

 jeff



Re: [PATCH GCC]Improve bound information in loop niter analysis

2015-08-13 Thread Bin.Cheng
On Fri, Aug 14, 2015 at 6:08 AM, Jeff Law l...@redhat.com wrote:
 On 07/28/2015 03:36 AM, Bin Cheng wrote:

 Hi,
 Loop niter computes inaccurate bound information for different loops.
 This
 patch is to improve it by using loop initial condition in
 determine_value_range.  Generally, loop niter is computed by subtracting
 start var from end var in loop exit condition.  Moreover, loop bound is
 computed using value range information of both start and end variables.
 Basic idea of this patch is to check if loop initial condition implies
 more
 range information for both start/end variables.  If yes, we refine range
 information and use that to compute loop bound.
 With this improvement, more accurate loop bound information is computed
 for
 test cases added by this patch.

 Is it OK?

 Thanks,
 bin

 2015-07-28  Bin Chengbin.ch...@arm.com

 * tree-ssa-loop-niter.c (refine_value_range_using_guard): New.
 (determine_value_range): Call refine_value_range_using_guard for
 each loop initial condition to improve value range.

 gcc/testsuite/ChangeLog
 2015-07-28  Bin Chengbin.ch...@arm.com

 * gcc.dg/tree-ssa/loop-bound-1.c: New test.
 * gcc.dg/tree-ssa/loop-bound-3.c: New test.
 * gcc.dg/tree-ssa/loop-bound-5.c: New test.


 improve-loop-bound-analysis-20150728.txt


 Index: gcc/testsuite/gcc.dg/tree-ssa/loop-bound-3.c
 ===
 --- gcc/testsuite/gcc.dg/tree-ssa/loop-bound-3.c(revision 0)
 +++ gcc/testsuite/gcc.dg/tree-ssa/loop-bound-3.c(revision 0)
 @@ -0,0 +1,22 @@
 +/* { dg-do compile } */
 +/* { dg-options -O2 -fdump-tree-ivopts-details } */
 +
 +int *a;
 +
 +int
 +foo (unsigned char s, unsigned char l)
 +{
 +  unsigned char i;
 +  int sum = 0;
 +
 +  for (i = s; i  l; i -= 1)

 So is this really bounded by 254 iterations?  ISTM it's bounded by 255
 iterations when called with s = 255, l = 0.   What am I missing here? Am I
 mis-interpreting the dump output in some way?

Thanks for the comment.
IIUC, the niter information in struct tree_niter_desc means the number
of executions of the latch of the loop, so it's 254, rather than 255.
And same the bound information.  Of course, statements in the loop
body could execute bound+1 times depending on its position in loop.
For example, struct nb_iter_bound is to describe number of iterations
of statements(exit conditions).
Moreover, if we modify the test as in below:

 +int *a;
 +
 +int
 +foo (void)
 +{
 +  unsigned char i;
 +  int sum = 0;
 +
 +  for (i = 255; i  0; i -= 1)

GCC now successfully analyzes the loop's bound as 254.

I might be wrong about the code, so please correct me.

Thanks,
bin

 Similarly for the other tests.

 Jeff



Re: [PATCH] Optimize certain end of loop conditions into min/max operation

2015-07-26 Thread Bin.Cheng
On Mon, Jul 27, 2015 at 11:41 AM, Michael Collison
michael.colli...@linaro.org wrote:
 This patch is designed to optimize end of loop conditions involving of the
 form
  i  x  i  y into i  min (x, y). Loop condition involving '' are
 handled similarly using max(x,y).
 As an example:

 #define N 1024

 int  a[N], b[N], c[N];

 void add (unsignedint  m, unsignedint  n)
 {
   unsignedint  i, bound = (m  n) ? m : n;
   for  (i = 0; i  m  i  n; ++i)
 a[i] = b[i] + c[i];
 }


 Performed bootstrap and make check on: x86_64_unknown-linux-gnu,
 arm-linux-gnueabihf, and aarch64-linux-gnu.
 Okay for trunk?

 2015-07-24  Michael Collison  michael.colli...@linaro.org
 Andrew Pinski andrew.pin...@caviumnetworks.com

 * match.pd ((x  y)  (x  z) - x  min (y,z),
 (x  y) and (x  z) - x  max (y,z))

 diff --git a/gcc/match.pd b/gcc/match.pd
 index 5e8fd32..8691710 100644
 --- a/gcc/match.pd
 +++ b/gcc/match.pd
 @@ -1793,3 +1793,17 @@ along with GCC; see the file COPYING3.  If not see
  (convert (bit_and (op (convert:utype @0) (convert:utype @1))
(convert:utype @4)))

 +
 +/* Transform (@0  @1 and @0  @2) to use min */
 +(for op (lt le)
 +(simplify
 +(bit_and:c (op @0 @1) (op @0 @2))
 +(if (INTEGRAL_TYPE_P (TREE_TYPE (@0)))
 +(op @0 (min @1 @2)
 +
 +/* Transform (@0  @1 and @0  @2) to use max */
 +(for op (gt ge)
 +(simplify
 +(bit_and:c (op @0 @1) (op @0 @2))
 +(if (INTEGRAL_TYPE_P (TREE_TYPE (@0)))
 +(op @0 (max @1 @2)

Could you please give a test case for it?  Also IIUC, this is not only
simplification, but also loop invariant hoist, so how does it check
invariantness?

Thanks,
bin
 --

 --
 Michael Collison
 Linaro Toolchain Working Group
 michael.colli...@linaro.org



Re: [PATCH] Optimize certain end of loop conditions into min/max operation

2015-07-26 Thread Bin.Cheng
On Mon, Jul 27, 2015 at 12:23 PM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Mon, Jul 27, 2015 at 11:41 AM, Michael Collison
 michael.colli...@linaro.org wrote:
 This patch is designed to optimize end of loop conditions involving of the
 form
  i  x  i  y into i  min (x, y). Loop condition involving '' are
 handled similarly using max(x,y).
 As an example:

 #define N 1024

 int  a[N], b[N], c[N];

 void add (unsignedint  m, unsignedint  n)
 {
   unsignedint  i, bound = (m  n) ? m : n;
   for  (i = 0; i  m  i  n; ++i)
 a[i] = b[i] + c[i];
 }


 Performed bootstrap and make check on: x86_64_unknown-linux-gnu,
 arm-linux-gnueabihf, and aarch64-linux-gnu.
 Okay for trunk?

 2015-07-24  Michael Collison  michael.colli...@linaro.org
 Andrew Pinski andrew.pin...@caviumnetworks.com

 * match.pd ((x  y)  (x  z) - x  min (y,z),
 (x  y) and (x  z) - x  max (y,z))

 diff --git a/gcc/match.pd b/gcc/match.pd
 index 5e8fd32..8691710 100644
 --- a/gcc/match.pd
 +++ b/gcc/match.pd
 @@ -1793,3 +1793,17 @@ along with GCC; see the file COPYING3.  If not see
  (convert (bit_and (op (convert:utype @0) (convert:utype @1))
(convert:utype @4)))

 +
 +/* Transform (@0  @1 and @0  @2) to use min */
 +(for op (lt le)
 +(simplify
 +(bit_and:c (op @0 @1) (op @0 @2))
 +(if (INTEGRAL_TYPE_P (TREE_TYPE (@0)))
 +(op @0 (min @1 @2)
 +
 +/* Transform (@0  @1 and @0  @2) to use max */
 +(for op (gt ge)
 +(simplify
 +(bit_and:c (op @0 @1) (op @0 @2))
 +(if (INTEGRAL_TYPE_P (TREE_TYPE (@0)))
 +(op @0 (max @1 @2)

 Could you please give a test case for it?  Also IIUC, this is not only
 simplification, but also loop invariant hoist, so how does it check
 invariantness?

Sorry I realized this patch only does simplification and then let lim
pass decide if it can be moved?  In this way, there is no invariant
problem, please ignore previous message.

Thanks,
bin


Re: [PATCH PR66388]Compute use with cand of smaller precision by further exercising scev overflow info.

2015-07-24 Thread Bin.Cheng
On Thu, Jul 23, 2015 at 10:06 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Fri, Jul 17, 2015 at 8:27 AM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 This patch is to fix PR66388.  It's an old issue but recently became worse
 after my scev overflow change.  IVOPT now can only compute iv use with
 candidate which has at least same type precision.  See below code:

   if (TYPE_PRECISION (utype)  TYPE_PRECISION (ctype))
 {
   /* We do not have a precision to express the values of use.  */
   return infinite_cost;
 }

 This is not always true.  It's possible to compute with a candidate of
 smaller precision if it has enough stepping periods to express the iv use.
 Just as code in iv_elimination.  Well, since now we have iv no_overflow
 information, we can use that to prove it's safe.  Actually I am thinking
 about improving iv elimination with overflow information too.  So this patch
 relaxes the constraint to allow computation of uses with smaller precision
 candidates.

 Benchmark data shows several cases in spec2k6 are obviously improved on
 aarch64:
 400.perlbench2.32%
 445.gobmk0.86%
 456.hmmer11.72%
 464.h264ref  1.93%
 473.astar0.75%
 433.milc -1.49%
 436.cactusADM6.61%
 444.namd -0.76%

 I looked into assembly code of 456.hmmer436.cactusADM, and can confirm hot
 loops are reduced.  Also perf data could confirm the improvement in
 456.hmmer.
 I looked into 433.milc and found most hot functions are not affected by this
 patch.  But I do observe two kinds of regressions described as below:
 A)  For some loops, auto-increment addressing mode is generated before this
 patch, but base + indexscale is generated after. I don't worry about
 this too much because auto-increment support in IVO hasn't been enabled on
 AArch64 yet. On the contrary, we should worry that auto-increment support is
 too aggressive in IVO, resulting in auto-increment addressing mode generated
 where it shouldn't. I suspect the regression we monitored before is caused
 by such kind of reason.
 B) This patch enables computation of 64 bits address iv use with 32 bits biv
 candidate.  So there will be a sign extension before the candidate can be
 used in memory reference as an index. I already increased the cost by 2 for
 such biv candidates but there still be some peculiar cases... Decreasing
 cost in determine_iv_cost for biv candidates makes this worse.  It does that
 to make debugging simpler, nothing to do with performance.

 Bootstrap and test on x86_64.  It fixes failure of pr49781-1.c.
 Unfortunately, it introduces new failure of
 g++.dg/debug/dwarf2/deallocator.C.  I looked into the test and found with
 this patch, the loop is transformed into a shape that can be later
 eliminated(because it can be proved never loop back?).  We can further
 discuss if it's this patch's problem or the case should be tuned.
 Also bootstrap and test on aarch64.

 So what's your opinion?

 Looks sensible, but the deallocator.C fail looks odd.  I presume that
 i + j is simplified in a way that either the first or the second iteration
 must exit the loop via the return and thus the scan for deallocator.C:34
 fails?  How does this happen - I can only see this happen if we unroll
 the loop and then run into VRP.  So does IVOPTs now affect non-loop
 code as well?  Ah, at the moment we use an IV that runs backward.

 Still curious if this isn't a wrong-code issue...


Tree dump just before ivopts is as below:
{
  struct t test;
  struct t test;
  int j;
  struct t test_outside;
  unsigned int ivtmp_1;
  int _11;
  unsigned int ivtmp_26;

  bb 2:
  t::t (test_outside);
  # DEBUG j = 0
  # DEBUG j = 0

  bb 3:
  # j_27 = PHI j_14(6), 0(2)
  # ivtmp_1 = PHI ivtmp_26(6), 10(2)
  # DEBUG j = j_27
  t::t (test);
  t::foo (test);
  _11 = i_10(D) + j_27;
  if (_11 != 0)
goto bb 4;
  else
goto bb 5;

  bb 4:
  t::bar (test);
  t::~t (test);
  test ={v} {CLOBBER};
  goto bb 12;

  bb 5:
  t::~t (test);
  test ={v} {CLOBBER};
  j_14 = j_27 + 1;
  # DEBUG j = j_14
  # DEBUG j = j_14
  ivtmp_26 = ivtmp_1 - 1;
  if (ivtmp_26 == 0)
goto bb 7;
  else
goto bb 6;

  bb 6:
  goto bb 3;

  bb 7:
  if (i_10(D) != 0)
goto bb 8;
  else
goto bb 11;

  bb 8:
  t::t (test);
  if (i_10(D) == 10)
goto bb 9;
  else
goto bb 10;

  bb 9:
  t::bar (test);

  bb 10:
  t::~t (test);
  test ={v} {CLOBBER};

  bb 11:
  t::foo (test_outside);

  bb 12:
  t::~t (test_outside);
  test_outside ={v} {CLOBBER};
  return;

}

The only difference the patch made, is candidate 8 is not converted
into unsigned type any more because _11 is now considered not
overflow.

Before patch:
candidate 8
  var_before ivtmp.9
  var_after ivtmp.9
  incremented before exit test
  type unsigned int
  base (unsigned int) i_10(D)
  step 1
After patch:
candidate 8
  var_before ivtmp.9
  var_after ivtmp.9
  incremented 

Re: [PATCH PR66388]Compute use with cand of smaller precision by further exercising scev overflow info.

2015-07-24 Thread Bin.Cheng
On Fri, Jul 24, 2015 at 7:23 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Fri, Jul 24, 2015 at 1:09 PM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Thu, Jul 23, 2015 at 10:06 PM, Richard Biener
 richard.guent...@gmail.com wrote:
 On Fri, Jul 17, 2015 at 8:27 AM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 This patch is to fix PR66388.  It's an old issue but recently became worse
 after my scev overflow change.  IVOPT now can only compute iv use with
 candidate which has at least same type precision.  See below code:

   if (TYPE_PRECISION (utype)  TYPE_PRECISION (ctype))
 {
   /* We do not have a precision to express the values of use.  */
   return infinite_cost;
 }

 This is not always true.  It's possible to compute with a candidate of
 smaller precision if it has enough stepping periods to express the iv use.
 Just as code in iv_elimination.  Well, since now we have iv no_overflow
 information, we can use that to prove it's safe.  Actually I am thinking
 about improving iv elimination with overflow information too.  So this 
 patch
 relaxes the constraint to allow computation of uses with smaller precision
 candidates.

 Benchmark data shows several cases in spec2k6 are obviously improved on
 aarch64:
 400.perlbench2.32%
 445.gobmk0.86%
 456.hmmer11.72%
 464.h264ref  1.93%
 473.astar0.75%
 433.milc -1.49%
 436.cactusADM6.61%
 444.namd -0.76%

 I looked into assembly code of 456.hmmer436.cactusADM, and can confirm hot
 loops are reduced.  Also perf data could confirm the improvement in
 456.hmmer.
 I looked into 433.milc and found most hot functions are not affected by 
 this
 patch.  But I do observe two kinds of regressions described as below:
 A)  For some loops, auto-increment addressing mode is generated before this
 patch, but base + indexscale is generated after. I don't worry about
 this too much because auto-increment support in IVO hasn't been enabled on
 AArch64 yet. On the contrary, we should worry that auto-increment support 
 is
 too aggressive in IVO, resulting in auto-increment addressing mode 
 generated
 where it shouldn't. I suspect the regression we monitored before is caused
 by such kind of reason.
 B) This patch enables computation of 64 bits address iv use with 32 bits 
 biv
 candidate.  So there will be a sign extension before the candidate can be
 used in memory reference as an index. I already increased the cost by 2 for
 such biv candidates but there still be some peculiar cases... Decreasing
 cost in determine_iv_cost for biv candidates makes this worse.  It does 
 that
 to make debugging simpler, nothing to do with performance.

 Bootstrap and test on x86_64.  It fixes failure of pr49781-1.c.
 Unfortunately, it introduces new failure of
 g++.dg/debug/dwarf2/deallocator.C.  I looked into the test and found with
 this patch, the loop is transformed into a shape that can be later
 eliminated(because it can be proved never loop back?).  We can further
 discuss if it's this patch's problem or the case should be tuned.
 Also bootstrap and test on aarch64.

 So what's your opinion?

 Looks sensible, but the deallocator.C fail looks odd.  I presume that
 i + j is simplified in a way that either the first or the second iteration
 must exit the loop via the return and thus the scan for deallocator.C:34
 fails?  How does this happen - I can only see this happen if we unroll
 the loop and then run into VRP.  So does IVOPTs now affect non-loop
 code as well?  Ah, at the moment we use an IV that runs backward.

 Still curious if this isn't a wrong-code issue...


 Tree dump just before ivopts is as below:
 {
   struct t test;
   struct t test;
   int j;
   struct t test_outside;
   unsigned int ivtmp_1;
   int _11;
   unsigned int ivtmp_26;

   bb 2:
   t::t (test_outside);
   # DEBUG j = 0
   # DEBUG j = 0

   bb 3:
   # j_27 = PHI j_14(6), 0(2)
   # ivtmp_1 = PHI ivtmp_26(6), 10(2)
   # DEBUG j = j_27
   t::t (test);
   t::foo (test);
   _11 = i_10(D) + j_27;
   if (_11 != 0)
 goto bb 4;
   else
 goto bb 5;

   bb 4:
   t::bar (test);
   t::~t (test);
   test ={v} {CLOBBER};
   goto bb 12;

   bb 5:
   t::~t (test);
   test ={v} {CLOBBER};
   j_14 = j_27 + 1;
   # DEBUG j = j_14
   # DEBUG j = j_14
   ivtmp_26 = ivtmp_1 - 1;
   if (ivtmp_26 == 0)
 goto bb 7;
   else
 goto bb 6;

   bb 6:
   goto bb 3;

   bb 7:
   if (i_10(D) != 0)
 goto bb 8;
   else
 goto bb 11;

   bb 8:
   t::t (test);
   if (i_10(D) == 10)
 goto bb 9;
   else
 goto bb 10;

   bb 9:
   t::bar (test);

   bb 10:
   t::~t (test);
   test ={v} {CLOBBER};

   bb 11:
   t::foo (test_outside);

   bb 12:
   t::~t (test_outside);
   test_outside ={v} {CLOBBER};
   return;

 }

 The only difference the patch made, is candidate 8 is not converted
 into unsigned type any more because _11 is now considered not
 overflow

Re: [PATCH AArch64]Handle wrong cost for addition of minus immediate in aarch64_rtx_costs.

2015-07-15 Thread Bin.Cheng
Ping^2

On Thu, Jul 9, 2015 at 5:42 PM, Bin.Cheng amker.ch...@gmail.com wrote:
 Ping.

 On Fri, Jun 26, 2015 at 4:47 PM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 The canonical form of subtract of immediate is (add op0 minus_imm), which is
 supported with addsi3_aarch64 pattern on aarch64.  Unfortunately wrong cost
 (8 rather than 4) is computed by aarch64_rtx_cost because it doesn't honor
 the fact that it actually is a sub instruction.  This patch fixes it, is
 this OK?

 Thanks,
 bin

 2015-06-25  Bin Cheng  bin.ch...@arm.com

 * config/aarch64/aarch64.c (aarch64_rtx_costs): Handle addition of
 minus immediate.


Re: [PATCH AArch64]Handle wrong cost for addition of minus immediate in aarch64_rtx_costs.

2015-07-09 Thread Bin.Cheng
Ping.

On Fri, Jun 26, 2015 at 4:47 PM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 The canonical form of subtract of immediate is (add op0 minus_imm), which is
 supported with addsi3_aarch64 pattern on aarch64.  Unfortunately wrong cost
 (8 rather than 4) is computed by aarch64_rtx_cost because it doesn't honor
 the fact that it actually is a sub instruction.  This patch fixes it, is
 this OK?

 Thanks,
 bin

 2015-06-25  Bin Cheng  bin.ch...@arm.com

 * config/aarch64/aarch64.c (aarch64_rtx_costs): Handle addition of
 minus immediate.


Re: [PATCH GCC]Udate best_cost for start cand if it has lower overall cost in iv set narrowing

2015-07-09 Thread Bin.Cheng
On Thu, Jul 9, 2015 at 5:49 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Thu, Jul 9, 2015 at 11:37 AM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 When I going through the code, I spot this minor issue.  When
 start_cand/orig_cand/third_cand have overall cost in order like start_cand
  third_cand  orig_cand, GCC chooses the third_cand instead of start_cand
 because we haven't set best_cost for start_cand.  This is an obvious fix to
 it.

 So is it OK?

 Ok.  I wonder if you have a testcase which this improves?
Thanks for reviewing.
No, unfortunately.  I ran into a case with another patch.  Even for
that case, the start_cand has same overall cost with third_cand.  But
I believe the idea is true.  Considering if we didn't handle
start_cand specially (well, we need to handle it specially) in
narrowing, we need to check it for possible lower cost in following
loop anyway.

Thanks,
bin

 Richard.


 2015-07-08  Bin Cheng  bin.ch...@arm.com

 * tree-ssa-loop-ivopts.c (iv_ca_narrow): Update best_cost
 if start candidate has lower cost.


Re: [PATCH GCC][refacor]Manage allocation of struct iv in obstack.

2015-06-29 Thread Bin.Cheng
On Sat, Jun 27, 2015 at 5:13 AM, Jeff Law l...@redhat.com wrote:
 On 06/26/2015 03:02 AM, Bin Cheng wrote:

 Hi,
 GCC avoids multi-pointers/dangling-pointers of struct iv by allocating
 multiple copies of the structure.  This patch is an obvious fix to the
 issue
 by managing iv structures in obstack.

 Bootstrap on x86_64, will apply to trunk if no objection.

 Thanks,
 bin

 2015-06-26  Bin Cheng  bin.ch...@arm.com

 * tree-ssa-loop-ivopts.c (struct ivopts_data): New field
 iv_obstack.
 (tree_ssa_iv_optimize_init): Initialize iv_obstack.
 (alloc_iv): New parameter.  Allocate struct iv using
 obstack_alloc.
 (set_iv, find_interesting_uses_address, add_candidate_1): New
 argument.
 (find_interesting_uses_op): Don't duplicate struct iv.
 (free_loop_data): Don't free iv structure explicitly.
 (tree_ssa_iv_optimize_finalize): Free iv_obstack.

 Presumably you're trying to simplify the memory management  here so that you
 don't have to track lifetimes of the IV structures so carefully, which in
 turn simplifies some upcoming patch?
Yes, that's exactly the reason.  I am still on the way fixing
missed-optimizations in IVO, and plan to do some
refactoring/simplification afterwards.

 Note we don't have a no objection policy for this kind of patch. However,
 I think it may make sense to look into having you as a maintainer for the IV
 optimizations if you're interested.
Oh, that would be my great honor.

Thanks,
bin

 Jeff



Re: [PATCH testsuite]Refine scanning string in pr65447.c to support small address offset target

2015-06-03 Thread Bin.Cheng
On Tue, Jun 2, 2015 at 11:40 PM, Bernhard Reutner-Fischer
rep.dot@gmail.com wrote:
 On June 2, 2015 5:56:13 AM GMT+02:00, Bin Cheng bin.ch...@arm.com wrote:
Hi,
On some arm processors, the offset supported in addressing modes is
very
small.  As a result, the dozens of address induction variables will be
grouped into several groups, rather than only one as on armv7/8.  This
patch
refines scanning string to avoid test failure on such processors.

It's an obvious change, and test acts as expected.  So is it OK?  I
will
commit it in next 24 hours if there is no objection.

Thanks,
bin

2015-06-02  Bin Cheng  bin.ch...@arm.com

   PR tree-optimization/65447
   * gcc.dg/tree-ssa/pr65447.c: Increase searching number.

 There should be no cleanup-tree-dump left on trunk.
 Please refresh your patch before pushing when somebody OKs it.
Thanks for reminding.  I am aware of that.  Patch committed as revision 224055.

Thanks,
bin

 Thanks,






[PATCH GCC]Preserve ssa name (thus vrp info) for IV structure. Committed.

2015-06-03 Thread Bin.Cheng
Hi,
I applied this obvious patch to trunk.  It was approved in last Stage 4.

Thanks,
bin

2015-06-03  Bin Cheng  bin.ch...@arm.com

* tree-ssa-loop-ivopts.c (dump_iv): New parameter.
(dump_use, dump_cand, find_induction_variables): Pass new argument
to dump_iv.
(record_use): Preserve the ssa name information in IV.
Index: gcc/tree-ssa-loop-ivopts.c
===
--- gcc/tree-ssa-loop-ivopts.c  (revision 224056)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -517,9 +517,9 @@ single_dom_exit (struct loop *loop)
 /* Dumps information about the induction variable IV to FILE.  */
 
 void
-dump_iv (FILE *file, struct iv *iv)
+dump_iv (FILE *file, struct iv *iv, bool dump_name)
 {
-  if (iv-ssa_name)
+  if (iv-ssa_name  dump_name)
 {
   fprintf (file, ssa name );
   print_generic_expr (file, iv-ssa_name, TDF_SLIM);
@@ -596,7 +596,7 @@ dump_use (FILE *file, struct iv_use *use)
 print_generic_expr (file, *use-op_p, TDF_SLIM);
   fprintf (file, \n);
 
-  dump_iv (file, use-iv);
+  dump_iv (file, use-iv, false);
 
   if (use-related_cands)
 {
@@ -684,7 +684,7 @@ dump_cand (FILE *file, struct iv_cand *cand)
   break;
 }
 
-  dump_iv (file, iv);
+  dump_iv (file, iv, false);
 }
 
 /* Returns the info for ssa version VER.  */
@@ -1326,7 +1326,7 @@ find_induction_variables (struct ivopts_data *data
   EXECUTE_IF_SET_IN_BITMAP (data-relevant, 0, i, bi)
{
  if (ver_info (data, i)-iv)
-   dump_iv (dump_file, ver_info (data, i)-iv);
+   dump_iv (dump_file, ver_info (data, i)-iv, true);
}
 }
 
@@ -1356,10 +1356,6 @@ record_use (struct ivopts_data *data, tree *use_p,
   use-addr_base = addr_base;
   use-addr_offset = addr_offset;
 
-  /* To avoid showing ssa name in the dumps, if it was not reset by the
- caller.  */
-  iv-ssa_name = NULL_TREE;
-
   data-iv_uses.safe_push (use);
 
   return use;


Re: [PATCH GCC]Improve how we handle overflow in scev by using overflow information computed for control iv in loop niter, part II

2015-06-02 Thread Bin.Cheng
On Tue, Jun 2, 2015 at 4:40 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Tue, Jun 2, 2015 at 4:55 AM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Mon, Jun 1, 2015 at 6:45 PM, Richard Biener
 richard.guent...@gmail.com wrote:
 On Tue, May 26, 2015 at 1:04 PM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 My first part patch improving how we handle overflow in scev is posted at
 https://gcc.gnu.org/ml/gcc-patches/2015-05/msg01795.html .  Here comes the
 second part patch.

 This patch does below improvements:
   1) Computes and records control iv for each loop's exit edge.  This
 provides a way to compute overflow information in loop niter and use it in
 different customers.  It think it's useful, especially with option
 -funsafe-loop-optimizers.
   2) Improve chrec_convert by adding new interface
 loop_exits_before_overflow.  It checks if a converted IV overflows wrto its
 type and loop using overflow information of loop's control iv.  This
 basically propagates no-overflow information from control iv to ivs
 converted from control iv.  Moreover, we can further improve the logic by
 using possible VRP information in the future.

 But 2) you already posted (and I have approved it but you didn't commit 
 yet?).

 Can you commit that approved patch and only send the parts I didn't approve
 yet?

 Thanks,
 Richard.

 With this patch, cases like scev-9.c and scev-10.c in patch can be handled
 now.  Cases reported in PR48052 can be vectorized too.
 Opinions?

 Thanks,
 bin


 2015-05-26  Bin Cheng  bin.ch...@arm.com

 * cfgloop.h (struct control_iv): New.
 (struct loop): New field control_ivs.
 * tree-ssa-loop-niter.c : Include stor-layout.h.
 (number_of_iterations_lt): Set no_overflow information.
 (number_of_iterations_exit): Init control iv in niter struct.
 (record_control_iv): New.
 (estimate_numbers_of_iterations_loop): Call record_control_iv.
 (loop_exits_before_overflow): New.  Interface factored out of
 scev_probably_wraps_p.
 (scev_probably_wraps_p): Factor loop niter related code into
 loop_exits_before_overflow.
 (free_numbers_of_iterations_estimates_loop): Free control ivs.
 * tree-ssa-loop-niter.h (free_loop_control_ivs): New.

 gcc/testsuite/ChangeLog
 2015-05-26  Bin Cheng  bin.ch...@arm.com

 PR tree-optimization/48052
 * gcc.dg/tree-ssa/scev-8.c: New.
 * gcc.dg/tree-ssa/scev-9.c: New.
 * gcc.dg/tree-ssa/scev-10.c: New.
 * gcc.dg/vect/pr48052.c: New.


 Hi Richard,
 I think you replied the review message of this patch to another
 thread.  Sorry for being mis-leading.  S I copied and answered your
 review comments in this thread thus we can continue here.

 +   /* Done proving if this is a no-overflow control IV.  */
 +   if (operand_equal_p (base, civ-base, 0))
 + return true;

 so all control IVs are no-overflow?

 This patch only records known no-overflow control ivs in loop
 structure, so it depends on loop niter analyzer.  For now, this patch
 (and the existing code) sets no-overflow flag only for two cases.  One
 is the step-1 case, the other one is in assert_no_overflow_lt.
 As a matter of fact, we may want to set no_overflow flag for all cases
 with -funsafe-loop-optimizations in the future.  In that case, we will
 assume all control IVs are no-overflow.


 +base = UPPER_BOUND (type) - step  ;;step  0
 +base = LOWER_BOUND (type) - step  ;;step  0
 +
 +  by using loop's initial condition.  */
 +   stepped = fold_build2 (PLUS_EXPR, TREE_TYPE (base), base, step);
 +   if (operand_equal_p (stepped, civ-base, 0))
 + {
 +   if (tree_int_cst_sign_bit (step))
 + {
 +   code = LT_EXPR;
 +   extreme = lower_bound_in_type (type, type);
 + }
 +   else
 + {
 +   code = GT_EXPR;
 +   extreme = upper_bound_in_type (type, type);
 + }
 +   extreme = fold_build2 (MINUS_EXPR, type, extreme, step);
 +   e = fold_build2 (code, boolean_type_node, base, extreme);

 looks like you are actually computing base + step = UPPER_BOUND (type)
 so maybe adjust the comment.  But as both step and UPPER_BOUND  (type)
 are constants why not compute it the way the comment specifies it?  
 Comparison
 codes also don't match the comment and we try to prove the condition is 
 false.
 I tried to prove the condition are satisfied by proving the reverse
 condition (base  UPPER_BOUND (type) - step) is false here.  In the
 updated patch, I revised comments to reflect that logic.  Is it ok?


 This also reminds me of eventually pushing forward my idea of strengthening
 simplify_using_initial_
 conditions by using the VRP machinery (I have a small
 prototype patch for that).
 Interesting.  If I understand correctly, VRP info is hold for ssa var
 on a global scope basis?  The loop's initial condition

Re: [PATCH GCC]Improve how we handle overflow for type conversion in scev/ivopts, part I

2015-06-02 Thread Bin.Cheng
On Tue, Jun 2, 2015 at 11:37 AM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Tue, May 26, 2015 at 5:04 PM, Richard Biener
 richard.guent...@gmail.com wrote:
 On Sun, May 24, 2015 at 8:47 AM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Fri, May 22, 2015 at 7:45 PM, Richard Biener
 richard.guent...@gmail.com wrote:
 On Wed, May 20, 2015 at 11:41 AM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 As we know, GCC is too conservative when checking overflow behavior in 
 SCEV
 and loop related optimizers.  Result is some variable can't be recognized 
 as
 scalar evolution and thus optimizations are missed.  To be specific,
 optimizers like ivopts and vectorizer are affected.
 This issue is more severe on 64 bit platforms, for example, PR62173 is
 failed on aarch64; scev-3.c and scev-4.c were marked as XFAIL on lp64
 platforms.

 As the first part to improve overflow checking in GCC, this patch does 
 below
 improvements:
   1) Ideally, chrec_convert should be responsible to convert scev like
 (type){base, step} to scev like {(type)base, (type)step} when the 
 result
 scev doesn't overflow; chrec_convert_aggressive should do the conversion 
 if
 the result scev could overflow/wrap.  Unfortunately, current 
 implementation
 may use chrec_convert_aggressive to return a scev that won't overflow.  
 This
 is because of a) the static parameter fold_conversions for
 instantiate_scev_convert can only tracks whether chrec_convert_aggressive
 may be called, rather than if it does some overflow conversion or not;  b)
 the implementation of instantiate_scev_convert sometimes shortcuts the 
 call
 to chrec_convert and misses conversion opportunities.  This patch improves
 this.
   2) iv-no_overflow computed in simple_iv is too conservative.  With 1)
 fixed, iv-no_overflow should reflects whether chrec_convert_aggressive 
 does
 return an overflow scev.  This patch improves this.
   3) chrec_convert should be able to prove the resulting scev won't 
 overflow
 with loop niter information.  This patch doesn't finish this, but it
 factored a new interface out of scev_probably_wraps_p for future
 improvement.  And that will be the part II patch.

 With the improvements in SCEV, this patch also improves optimizer(IVOPT)
 that uses scev information like below:
   For array reference in the form of arr[IV], GCC tries to derive new
 address iv {arr+iv.base, iv.step*elem_size} from IV.  If IV overflow wrto 
 a
 type that is narrower than address space, this derivation is not true
 because arr[IV] isn't a scev.  Root cause why scev-*.c are failed now is
 the overflow information of IV is too conservative.  IVOPT has to be
 conservative to reject arr[IV] as a scev.  With more accurate overflow
 information, IVOPT can be improved too.  So this patch fixes the mentioned
 long standing issues.

 Bootstrap and test on x86_64, x86 and aarch64.
 BTW, test gcc.target/i386/pr49781-1.c failed on x86_64, but I can 
 confirmed
 it's not this patch's fault.

 So what's your opinion on this?.

 I maybe mixing things up but does

 +chrec_convert_aggressive (tree type, tree chrec, bool *fold_conversions)
  {
 ...
 +  if (evolution_function_is_affine_p (chrec))
 +{
 +  tree base, step;
 +  struct loop *loop;
 +
 +  loop = get_chrec_loop (chrec);
 +  base = CHREC_LEFT (chrec);
 +  step = CHREC_RIGHT (chrec);
 +  if (convert_affine_scev (loop, type, base, step, NULL, true))
 +   return build_polynomial_chrec (loop-num, base, step);

 ^^^ not forget to set *fold_conversions to true?  Or we need to use
 convert_affine_scev (..., false)?

 Nice catch.  It's supposed to be called only if source scev has no
 overflow behavior introduced by previous call to
 chrec_convert_aggressive.  In other words, it should be guarded by
 !*fold_conversions like below:

 +
 +  if (!*fold_conversions  evolution_function_is_affine_p (chrec))
 +{
 +  tree base, step;
 +  struct loop *loop;
 +
 +  loop = get_chrec_loop (chrec);
 +  base = CHREC_LEFT (chrec);
 +  step = CHREC_RIGHT (chrec);
 +  if (convert_affine_scev (loop, type, base, step, NULL, true))
 +return build_polynomial_chrec (loop-num, base, step);
 +}

 The scenario is rare that didn't exposed in either bootstrap or reg-test.

 Here is the updated patch without any other difference.  Bootstrap and
 test on x86_64  AArch64.

 Ok.

 Thanks,
 Richard.

 Thanks,
 bin

 +}

 (bah, and the diff somehow messes up -p context :/  which is why I like
 context diffs more)

 Other from the above the patch looks good to me.

 Thanks,
 Richard.

 Thanks,
 bin

 2015-05-20  Bin Cheng  bin.ch...@arm.com

 PR tree-optimization/62173
 * tree-ssa-loop-ivopts.c (struct iv): New field.  Reorder fields.
 (alloc_iv, set_iv): New parameter.
 (determine_biv_step): Delete.
 (find_bivs): Inline original determine_biv_step.  Pass new
 argument to set_iv.
 (idx_find_step): Use no_overflow information for conversion

Re: Fix PR48052: loop not vectorized if index is unsigned int

2015-06-01 Thread Bin.Cheng
On Mon, Jun 1, 2015 at 4:00 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Sat, May 30, 2015 at 7:47 AM, Jeff Law l...@redhat.com wrote:
 On 05/19/2015 10:12 AM, Aditya K wrote:

 w.r.t. the PR48052, here is the patch which finds out if scev would wrap
 or not.
 The patch symbolically evaluates if valid_niter= loop-nb_iterations is
 true. In that case the scev would not wrap (??).
 Currently, we only look for two special 'patterns', which are sufficient
 to analyze the simple test cases.

 valid_niter = ~s (= UNIT_MAX - s)
 We have to prove that valid_niter= loop-nb_iterations

 Pattern1 loop-nb_iterations: s= e ? s - e : 0
 Pattern2 loop-nb_iterations: (e - s) -1

 In the first case we prove that valid_niter= loop-nb_iterations in both
 the cases i.e., when s=e and when not.
 In the second case we prove valid_niter= loop-nb_iterations, by simple
 analysis that  UINT_MAX= e is true in all cases.

 I haven't tested this patch completely. I'm looking for feedback and any
 scope for improvement.


 hth,
 -Aditya



 Vectorize loops which has typecast.

 2015-05-19  hiraditya  hiradi...@msn.com

  * gcc.dg/vect/pr48052.c: New test.

 gcc/ChangeLog:

 2015-05-19  hiraditya  hiradi...@msn.com

  * tree-ssa-loop-niter.c (fold_binary_cond_p): Fold a conditional
 operation when additional constraints are
  available.
  (fold_binary_minus_p): Fold a subtraction operations of the form
 (A - B -1) when additional constraints are
  available.
  (scev_probably_wraps_p): Use the above two functions to find
 whether valid_niter= loop-nb_iterations.

 Is any of this work still useful if Bin Cheng's work on improving overflow
 detection for scev goes forward?  I certainly got the impression that Bin's
 work would solve 48052 and others.

 Bin is probably the one to answer this.  His patches are still on my list
 of patches to review...
Richard, you have already approved the first one.  The second one is
at https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02317.html . You may
need to decide if it is as expected.

Yes, PRs like 62173, 52563 and 48052 will be fixed with these two patches.

Thanks,
bin

 Richard.

 Jeff



Re: [PATCH GCC]Improve overflow in scev by using information computed in loop niter, part II

2015-06-01 Thread Bin.Cheng
On Mon, Jun 1, 2015 at 6:41 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Tue, May 26, 2015 at 1:13 PM, Bin.Cheng amker.ch...@gmail.com wrote:
 Hi,
 The first part patch improving how we handle overflow in scev is
 posted at https://gcc.gnu.org/ml/gcc-patches/2015-05/msg01795.html .
 Here comes the second part patch.

 This patch does below improvements:
   1) Computes and records control iv for each loop's exit edge.  This
 provides a way to compute overflow information in loop niter and use
 it in different customers.  It think it's useful, especially with
 option -funsafe-loop-optimizers.
   2) Improve chrec_convert by adding new interface
 loop_exits_before_overflow.  It checks if a converted IV overflows
 wrto its type and loop using overflow information of loop's control
 iv.  This basically propagates no-overflow information from control iv
 to ivs converted from control iv.  Moreover, we can further improve
 the logic by using possible VRP information in the future.

 With this patch, cases like scev-9.c and scev-10.c in patch can be
 handled now.  Cases reported in PR48052 can be vectorized too.
 Opinions?

 -ENOPATCH
Here comes the patch.  Sorry for the inconvenience.

Thanks,
bin

 Thanks,
 bin


 2015-05-26  Bin Cheng  bin.ch...@arm.com

 * cfgloop.h (struct control_iv): New.
 (struct loop): New field control_ivs.
 * tree-ssa-loop-niter.c : Include stor-layout.h.
 (number_of_iterations_lt): Set no_overflow information.
 (number_of_iterations_exit): Init control iv in niter struct.
 (record_control_iv): New.
 (estimate_numbers_of_iterations_loop): Call record_control_iv.
 (loop_exits_before_overflow): New.  Interface factored out of
 scev_probably_wraps_p.
 (scev_probably_wraps_p): Factor loop niter related code into
 loop_exits_before_overflow.
 (free_numbers_of_iterations_estimates_loop): Free control ivs.
 * tree-ssa-loop-niter.h (free_loop_control_ivs): New.

 gcc/testsuite/ChangeLog
 2015-05-26  Bin Cheng  bin.ch...@arm.com

 PR tree-optimization/48052
 * gcc.dg/tree-ssa/scev-8.c: New.
 * gcc.dg/tree-ssa/scev-9.c: New.
 * gcc.dg/tree-ssa/scev-10.c: New.
 * gcc.dg/vect/pr48052.c: New.
Index: gcc/tree-ssa-loop-niter.c
===
--- gcc/tree-ssa-loop-niter.c   (revision 222758)
+++ gcc/tree-ssa-loop-niter.c   (working copy)
@@ -31,6 +31,7 @@ along with GCC; see the file COPYING3.  If not see
 #include wide-int.h
 #include inchash.h
 #include tree.h
+#include stor-layout.h
 #include fold-const.h
 #include calls.h
 #include hashtab.h
@@ -1184,6 +1185,7 @@ number_of_iterations_lt (tree type, affine_iv *iv0
   niter-niter = delta;
   niter-max = widest_int::from (wi::from_mpz (niter_type, bnds-up, 
false),
 TYPE_SIGN (niter_type));
+  niter-control.no_overflow = true;
   return true;
 }
 
@@ -1965,6 +1967,9 @@ number_of_iterations_exit (struct loop *loop, edge
 return false;
 
   niter-assumptions = boolean_false_node;
+  niter-control.base = NULL_TREE;
+  niter-control.step = NULL_TREE;
+  niter-control.no_overflow = false;
   last = last_stmt (exit-src);
   if (!last)
 return false;
@@ -2744,6 +2749,29 @@ record_estimate (struct loop *loop, tree bound, co
   record_niter_bound (loop, new_i_bound, realistic, upper);
 }
 
+/* Records the control iv analyzed in NITER for LOOP if the iv is valid
+   and doesn't overflow.  */
+
+static void
+record_control_iv (struct loop *loop, struct tree_niter_desc *niter)
+{
+  struct control_iv *iv;
+
+  if (!niter-control.base || !niter-control.step)
+return;
+
+  if (!integer_onep (niter-assumptions) || !niter-control.no_overflow)
+return;
+
+  iv = ggc_alloccontrol_iv ();
+  iv-base = niter-control.base;
+  iv-step = niter-control.step;
+  iv-next = loop-control_ivs;
+  loop-control_ivs = iv;
+
+  return;
+}
+
 /* Record the estimate on number of iterations of LOOP based on the fact that
the induction variable BASE + STEP * i evaluated in STMT does not wrap and
its values belong to the range LOW, HIGH.  REALISTIC is true if the
@@ -3467,6 +3495,7 @@ estimate_numbers_of_iterations_loop (struct loop *
   record_estimate (loop, niter, niter_desc.max,
   last_stmt (ex-src),
   true, ex == likely_exit, true);
+  record_control_iv (loop, niter_desc);
 }
   exits.release ();
 
@@ -3773,6 +3802,188 @@ nowrap_type_p (tree type)
   return false;
 }
 
+/* Return true if we can prove LOOP is exited before evolution of induction
+   variabled {BASE, STEP} overflows with respect to its type bound.  */
+
+static bool
+loop_exits_before_overflow (tree base, tree step,
+   gimple at_stmt, struct loop *loop)
+{
+  widest_int niter;
+  struct control_iv *civ;
+  struct nb_iter_bound *bound;
+  tree e, delta, step_abs, unsigned_base;
+  tree type = TREE_TYPE (step);
+  tree

Re: [PATCH GCC]Improve how we handle overflow for type conversion in scev/ivopts, part I

2015-06-01 Thread Bin.Cheng
On Tue, May 26, 2015 at 5:04 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Sun, May 24, 2015 at 8:47 AM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Fri, May 22, 2015 at 7:45 PM, Richard Biener
 richard.guent...@gmail.com wrote:
 On Wed, May 20, 2015 at 11:41 AM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 As we know, GCC is too conservative when checking overflow behavior in SCEV
 and loop related optimizers.  Result is some variable can't be recognized 
 as
 scalar evolution and thus optimizations are missed.  To be specific,
 optimizers like ivopts and vectorizer are affected.
 This issue is more severe on 64 bit platforms, for example, PR62173 is
 failed on aarch64; scev-3.c and scev-4.c were marked as XFAIL on lp64
 platforms.

 As the first part to improve overflow checking in GCC, this patch does 
 below
 improvements:
   1) Ideally, chrec_convert should be responsible to convert scev like
 (type){base, step} to scev like {(type)base, (type)step} when the result
 scev doesn't overflow; chrec_convert_aggressive should do the conversion if
 the result scev could overflow/wrap.  Unfortunately, current implementation
 may use chrec_convert_aggressive to return a scev that won't overflow.  
 This
 is because of a) the static parameter fold_conversions for
 instantiate_scev_convert can only tracks whether chrec_convert_aggressive
 may be called, rather than if it does some overflow conversion or not;  b)
 the implementation of instantiate_scev_convert sometimes shortcuts the call
 to chrec_convert and misses conversion opportunities.  This patch improves
 this.
   2) iv-no_overflow computed in simple_iv is too conservative.  With 1)
 fixed, iv-no_overflow should reflects whether chrec_convert_aggressive 
 does
 return an overflow scev.  This patch improves this.
   3) chrec_convert should be able to prove the resulting scev won't 
 overflow
 with loop niter information.  This patch doesn't finish this, but it
 factored a new interface out of scev_probably_wraps_p for future
 improvement.  And that will be the part II patch.

 With the improvements in SCEV, this patch also improves optimizer(IVOPT)
 that uses scev information like below:
   For array reference in the form of arr[IV], GCC tries to derive new
 address iv {arr+iv.base, iv.step*elem_size} from IV.  If IV overflow wrto a
 type that is narrower than address space, this derivation is not true
 because arr[IV] isn't a scev.  Root cause why scev-*.c are failed now is
 the overflow information of IV is too conservative.  IVOPT has to be
 conservative to reject arr[IV] as a scev.  With more accurate overflow
 information, IVOPT can be improved too.  So this patch fixes the mentioned
 long standing issues.

 Bootstrap and test on x86_64, x86 and aarch64.
 BTW, test gcc.target/i386/pr49781-1.c failed on x86_64, but I can confirmed
 it's not this patch's fault.

 So what's your opinion on this?.

 I maybe mixing things up but does

 +chrec_convert_aggressive (tree type, tree chrec, bool *fold_conversions)
  {
 ...
 +  if (evolution_function_is_affine_p (chrec))
 +{
 +  tree base, step;
 +  struct loop *loop;
 +
 +  loop = get_chrec_loop (chrec);
 +  base = CHREC_LEFT (chrec);
 +  step = CHREC_RIGHT (chrec);
 +  if (convert_affine_scev (loop, type, base, step, NULL, true))
 +   return build_polynomial_chrec (loop-num, base, step);

 ^^^ not forget to set *fold_conversions to true?  Or we need to use
 convert_affine_scev (..., false)?

 Nice catch.  It's supposed to be called only if source scev has no
 overflow behavior introduced by previous call to
 chrec_convert_aggressive.  In other words, it should be guarded by
 !*fold_conversions like below:

 +
 +  if (!*fold_conversions  evolution_function_is_affine_p (chrec))
 +{
 +  tree base, step;
 +  struct loop *loop;
 +
 +  loop = get_chrec_loop (chrec);
 +  base = CHREC_LEFT (chrec);
 +  step = CHREC_RIGHT (chrec);
 +  if (convert_affine_scev (loop, type, base, step, NULL, true))
 +return build_polynomial_chrec (loop-num, base, step);
 +}

 The scenario is rare that didn't exposed in either bootstrap or reg-test.

 Here is the updated patch without any other difference.  Bootstrap and
 test on x86_64  AArch64.

 Ok.

 Thanks,
 Richard.

 Thanks,
 bin

 +}

 (bah, and the diff somehow messes up -p context :/  which is why I like
 context diffs more)

 Other from the above the patch looks good to me.

 Thanks,
 Richard.

 Thanks,
 bin

 2015-05-20  Bin Cheng  bin.ch...@arm.com

 PR tree-optimization/62173
 * tree-ssa-loop-ivopts.c (struct iv): New field.  Reorder fields.
 (alloc_iv, set_iv): New parameter.
 (determine_biv_step): Delete.
 (find_bivs): Inline original determine_biv_step.  Pass new
 argument to set_iv.
 (idx_find_step): Use no_overflow information for conversion.
 * tree-scalar-evolution.c (analyze_scalar_evolution_in_loop): Let

Re: [PATCH GCC]Improve how we handle overflow in scev by using overflow information computed for control iv in loop niter, part II

2015-06-01 Thread Bin.Cheng
On Mon, Jun 1, 2015 at 6:45 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Tue, May 26, 2015 at 1:04 PM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 My first part patch improving how we handle overflow in scev is posted at
 https://gcc.gnu.org/ml/gcc-patches/2015-05/msg01795.html .  Here comes the
 second part patch.

 This patch does below improvements:
   1) Computes and records control iv for each loop's exit edge.  This
 provides a way to compute overflow information in loop niter and use it in
 different customers.  It think it's useful, especially with option
 -funsafe-loop-optimizers.
   2) Improve chrec_convert by adding new interface
 loop_exits_before_overflow.  It checks if a converted IV overflows wrto its
 type and loop using overflow information of loop's control iv.  This
 basically propagates no-overflow information from control iv to ivs
 converted from control iv.  Moreover, we can further improve the logic by
 using possible VRP information in the future.

 But 2) you already posted (and I have approved it but you didn't commit yet?).

 Can you commit that approved patch and only send the parts I didn't approve
 yet?

 Thanks,
 Richard.

 With this patch, cases like scev-9.c and scev-10.c in patch can be handled
 now.  Cases reported in PR48052 can be vectorized too.
 Opinions?

 Thanks,
 bin


 2015-05-26  Bin Cheng  bin.ch...@arm.com

 * cfgloop.h (struct control_iv): New.
 (struct loop): New field control_ivs.
 * tree-ssa-loop-niter.c : Include stor-layout.h.
 (number_of_iterations_lt): Set no_overflow information.
 (number_of_iterations_exit): Init control iv in niter struct.
 (record_control_iv): New.
 (estimate_numbers_of_iterations_loop): Call record_control_iv.
 (loop_exits_before_overflow): New.  Interface factored out of
 scev_probably_wraps_p.
 (scev_probably_wraps_p): Factor loop niter related code into
 loop_exits_before_overflow.
 (free_numbers_of_iterations_estimates_loop): Free control ivs.
 * tree-ssa-loop-niter.h (free_loop_control_ivs): New.

 gcc/testsuite/ChangeLog
 2015-05-26  Bin Cheng  bin.ch...@arm.com

 PR tree-optimization/48052
 * gcc.dg/tree-ssa/scev-8.c: New.
 * gcc.dg/tree-ssa/scev-9.c: New.
 * gcc.dg/tree-ssa/scev-10.c: New.
 * gcc.dg/vect/pr48052.c: New.


Hi Richard,
I think you replied the review message of this patch to another
thread.  Sorry for being mis-leading.  S I copied and answered your
review comments in this thread thus we can continue here.

 +   /* Done proving if this is a no-overflow control IV.  */
 +   if (operand_equal_p (base, civ-base, 0))
 + return true;

 so all control IVs are no-overflow?

This patch only records known no-overflow control ivs in loop
structure, so it depends on loop niter analyzer.  For now, this patch
(and the existing code) sets no-overflow flag only for two cases.  One
is the step-1 case, the other one is in assert_no_overflow_lt.
As a matter of fact, we may want to set no_overflow flag for all cases
with -funsafe-loop-optimizations in the future.  In that case, we will
assume all control IVs are no-overflow.


 +base = UPPER_BOUND (type) - step  ;;step  0
 +base = LOWER_BOUND (type) - step  ;;step  0
 +
 +  by using loop's initial condition.  */
 +   stepped = fold_build2 (PLUS_EXPR, TREE_TYPE (base), base, step);
 +   if (operand_equal_p (stepped, civ-base, 0))
 + {
 +   if (tree_int_cst_sign_bit (step))
 + {
 +   code = LT_EXPR;
 +   extreme = lower_bound_in_type (type, type);
 + }
 +   else
 + {
 +   code = GT_EXPR;
 +   extreme = upper_bound_in_type (type, type);
 + }
 +   extreme = fold_build2 (MINUS_EXPR, type, extreme, step);
 +   e = fold_build2 (code, boolean_type_node, base, extreme);

 looks like you are actually computing base + step = UPPER_BOUND (type)
 so maybe adjust the comment.  But as both step and UPPER_BOUND  (type)
 are constants why not compute it the way the comment specifies it?  Comparison
 codes also don't match the comment and we try to prove the condition is false.
I tried to prove the condition are satisfied by proving the reverse
condition (base  UPPER_BOUND (type) - step) is false here.  In the
updated patch, I revised comments to reflect that logic.  Is it ok?


 This also reminds me of eventually pushing forward my idea of strengthening
 simplify_using_initial_
 conditions by using the VRP machinery (I have a small
 prototype patch for that).
Interesting.  If I understand correctly, VRP info is hold for ssa var
on a global scope basis?  The loop's initial condition may strengthen
var's range information, rather than the contrary.  Actually I tried
to refine vrp info in scope of loop and used the refined vrp
information in loop optimizer 

Re: [PATCH][ARM] Add debug dumping of cost table fields

2015-05-27 Thread Bin.Cheng
On Wed, May 27, 2015 at 4:39 PM, Andrew Pinski pins...@gmail.com wrote:
 On Wed, May 27, 2015 at 4:38 PM, Kyrill Tkachov kyrylo.tkac...@arm.com 
 wrote:
 Ping.
 https://gcc.gnu.org/ml/gcc-patches/2015-05/msg00054.html

 This and the one in AARCH64 is too noisy.  Can we have an option to
 turn this on and default to turning them off.

Agreed.  Actually I once file a PR about this enormous dump
information in gimple dumps.

Thanks,
bin

 Thanks,
 Andrew


 Thanks,
 Kyrill

 On 01/05/15 15:31, Kyrill Tkachov wrote:

 Hi all,

 This patch adds a macro to wrap cost field accesses into a helpful debug
 dump,
 saying which field is being accessed at what line and with what values.
 This helped me track down cases where the costs were doing the wrong thing
 by allowing me to see which path in arm_new_rtx_costs was taken.
 For example, the combine log might now contain:

 Trying 2 - 6:
 Successfully matched this instruction:
 (set (reg:SI 115 [ D.5348 ])
   (neg:SI (reg:SI 0 r0 [ a ])))
 using extra_cost-alu.arith with cost 0 from line 10506

 which can be useful in debugging the rtx costs.

 Bootstrapped and tested on arm.

 Ok for trunk?

 Thanks,
 Kyrill


 2015-05-01  Kyrylo Tkachov  kyrylo.tkac...@arm.com

   * config/arm/arm.c (DBG_COST): New macro.
   (arm_new_rtx_costs): Use above.




Re: [PATCH PR65447]Improve IV handling by grouping address type uses with same base and step

2015-05-27 Thread Bin.Cheng
On Wed, May 27, 2015 at 11:44 PM, Kyrill Tkachov kyrylo.tkac...@arm.com wrote:
 Hi Bin,


 On 08/05/15 11:47, Bin Cheng wrote:

 Hi,
 GCC's IVO currently handles every IV use independently, which is not right
 by learning from cases reported in PR65447.

 The rationale is:
 1) Lots of address type IVs refer to the same memory object, share similar
 base and have same step.  We should handle these IVs as a group in order
 to
 maximize CSE opportunities, prefer reg+offset addressing mode.
 2) GCC's IVO algorithm is expensive and only is run when candidate set is
 small enough.  By grouping same family uses, we can decrease the number of
 both uses and candidates.  Before this patch, number of candidates for
 PR65447 is too big to run expensive IVO algorithm, resulting in bad
 assembly
 code on targets like AArch64 and Mips.
 3) Even for cases the assembly code isn't improved, we can still get
 compilation time benefit with this patch.
 4) This is a prerequisite for enabling auto-increment support in IVO on
 AArch64.

 For now, this is only done to address type IVs, in the future I may extend
 it to general IVs too.

 For AArch64:
 Benchmarks 470.lbm/spec2k6 and 173.applu/spec2k are improved obviously by
 this patch.  A couple of cases from spec2k/fp appear regressed.  I looked
 into generated assembly code and can confirm the regression is false alarm
 except one case (189.lucas).  For that case, I think it's another issue
 exposed by this patch (GCC failed to CSE candidate setup code, resulting
 in
 bloated loop header).  Anyway, I also fined tuned the patch to minimize
 the
 impact.

 For AArch32, this patch seems to be able to improve spec2kfp too, but I
 didn't look deep into it.  I guess the reason is it can make life for
 auto-increment support in IVO better.

 One of defects of this patch is computation of max offset in
 compute_max_addr_offset is basically borrowed from get_address_cost.  The
 comment says we should find a better way to compute all information.
 People
 also complained we need to refactor that part of code.  I don't have good
 solution to that yet, though I did try best to keep
 compute_max_addr_offset
 simple.

 I believe this is a generally wanted change, bootstrap and test on x86_64
 and AArch64, so is it ok?


 2015-05-08  Bin Cheng  bin.ch...@arm.com

 PR tree-optimization/65447
 * tree-ssa-loop-ivopts.c (struct iv_use): New fields.
 (dump_use, dump_uses): Support to dump sub use.
 (record_use): New parameters to support sub use.  Remove call to
 dump_use.
 (record_sub_use, record_group_use): New functions.
 (compute_max_addr_offset, split_all_small_groups): New functions.
 (group_address_uses, rewrite_use_address): New functions.
 (strip_offset): New declaration.
 (find_interesting_uses_address): Call record_group_use.
 (add_candidate): New assertion.
 (infinite_cost_p): Move definition forward.
 (add_costs): Check INFTY cost and return immediately.
 (get_computation_cost_at): Clear setup cost and dependent bitmap
 for sub uses.
 (determine_use_iv_cost_address): Compute cost for sub uses.
 (rewrite_use_address_1): Rename from old rewrite_use_address.
 (free_loop_data): Free sub uses.
 (tree_ssa_iv_optimize_loop): Call group_address_uses.

 gcc/testsuite/ChangeLog
 2015-05-08  Bin Cheng  bin.ch...@arm.com

 PR tree-optimization/65447
 * gcc.dg/tree-ssa/pr65447.c: New test.


 I see this test failing on arm-none-eabi with a compiler at r223737.
 My configure options are: --enable-checking=yes --with-newlib
 --with-fpu=neon-fp-armv8 --with-arch=armv8-a --without-isl

Hi Kyrill,
Thank you for reporting this.  I will have a look.

Thanks,
bin

 Kyrill




[PATCH GCC]Improve overflow in scev by using information computed in loop niter, part II

2015-05-26 Thread Bin.Cheng
Hi,
The first part patch improving how we handle overflow in scev is
posted at https://gcc.gnu.org/ml/gcc-patches/2015-05/msg01795.html .
Here comes the second part patch.

This patch does below improvements:
  1) Computes and records control iv for each loop's exit edge.  This
provides a way to compute overflow information in loop niter and use
it in different customers.  It think it's useful, especially with
option -funsafe-loop-optimizers.
  2) Improve chrec_convert by adding new interface
loop_exits_before_overflow.  It checks if a converted IV overflows
wrto its type and loop using overflow information of loop's control
iv.  This basically propagates no-overflow information from control iv
to ivs converted from control iv.  Moreover, we can further improve
the logic by using possible VRP information in the future.

With this patch, cases like scev-9.c and scev-10.c in patch can be
handled now.  Cases reported in PR48052 can be vectorized too.
Opinions?

Thanks,
bin


2015-05-26  Bin Cheng  bin.ch...@arm.com

* cfgloop.h (struct control_iv): New.
(struct loop): New field control_ivs.
* tree-ssa-loop-niter.c : Include stor-layout.h.
(number_of_iterations_lt): Set no_overflow information.
(number_of_iterations_exit): Init control iv in niter struct.
(record_control_iv): New.
(estimate_numbers_of_iterations_loop): Call record_control_iv.
(loop_exits_before_overflow): New.  Interface factored out of
scev_probably_wraps_p.
(scev_probably_wraps_p): Factor loop niter related code into
loop_exits_before_overflow.
(free_numbers_of_iterations_estimates_loop): Free control ivs.
* tree-ssa-loop-niter.h (free_loop_control_ivs): New.

gcc/testsuite/ChangeLog
2015-05-26  Bin Cheng  bin.ch...@arm.com

PR tree-optimization/48052
* gcc.dg/tree-ssa/scev-8.c: New.
* gcc.dg/tree-ssa/scev-9.c: New.
* gcc.dg/tree-ssa/scev-10.c: New.
* gcc.dg/vect/pr48052.c: New.


Re: [PATCH GCC]Improve how we handle overflow for type conversion in scev/ivopts, part I

2015-05-24 Thread Bin.Cheng
On Fri, May 22, 2015 at 7:45 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Wed, May 20, 2015 at 11:41 AM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 As we know, GCC is too conservative when checking overflow behavior in SCEV
 and loop related optimizers.  Result is some variable can't be recognized as
 scalar evolution and thus optimizations are missed.  To be specific,
 optimizers like ivopts and vectorizer are affected.
 This issue is more severe on 64 bit platforms, for example, PR62173 is
 failed on aarch64; scev-3.c and scev-4.c were marked as XFAIL on lp64
 platforms.

 As the first part to improve overflow checking in GCC, this patch does below
 improvements:
   1) Ideally, chrec_convert should be responsible to convert scev like
 (type){base, step} to scev like {(type)base, (type)step} when the result
 scev doesn't overflow; chrec_convert_aggressive should do the conversion if
 the result scev could overflow/wrap.  Unfortunately, current implementation
 may use chrec_convert_aggressive to return a scev that won't overflow.  This
 is because of a) the static parameter fold_conversions for
 instantiate_scev_convert can only tracks whether chrec_convert_aggressive
 may be called, rather than if it does some overflow conversion or not;  b)
 the implementation of instantiate_scev_convert sometimes shortcuts the call
 to chrec_convert and misses conversion opportunities.  This patch improves
 this.
   2) iv-no_overflow computed in simple_iv is too conservative.  With 1)
 fixed, iv-no_overflow should reflects whether chrec_convert_aggressive does
 return an overflow scev.  This patch improves this.
   3) chrec_convert should be able to prove the resulting scev won't overflow
 with loop niter information.  This patch doesn't finish this, but it
 factored a new interface out of scev_probably_wraps_p for future
 improvement.  And that will be the part II patch.

 With the improvements in SCEV, this patch also improves optimizer(IVOPT)
 that uses scev information like below:
   For array reference in the form of arr[IV], GCC tries to derive new
 address iv {arr+iv.base, iv.step*elem_size} from IV.  If IV overflow wrto a
 type that is narrower than address space, this derivation is not true
 because arr[IV] isn't a scev.  Root cause why scev-*.c are failed now is
 the overflow information of IV is too conservative.  IVOPT has to be
 conservative to reject arr[IV] as a scev.  With more accurate overflow
 information, IVOPT can be improved too.  So this patch fixes the mentioned
 long standing issues.

 Bootstrap and test on x86_64, x86 and aarch64.
 BTW, test gcc.target/i386/pr49781-1.c failed on x86_64, but I can confirmed
 it's not this patch's fault.

 So what's your opinion on this?.

 I maybe mixing things up but does

 +chrec_convert_aggressive (tree type, tree chrec, bool *fold_conversions)
  {
 ...
 +  if (evolution_function_is_affine_p (chrec))
 +{
 +  tree base, step;
 +  struct loop *loop;
 +
 +  loop = get_chrec_loop (chrec);
 +  base = CHREC_LEFT (chrec);
 +  step = CHREC_RIGHT (chrec);
 +  if (convert_affine_scev (loop, type, base, step, NULL, true))
 +   return build_polynomial_chrec (loop-num, base, step);

 ^^^ not forget to set *fold_conversions to true?  Or we need to use
 convert_affine_scev (..., false)?

Nice catch.  It's supposed to be called only if source scev has no
overflow behavior introduced by previous call to
chrec_convert_aggressive.  In other words, it should be guarded by
!*fold_conversions like below:

+
+  if (!*fold_conversions  evolution_function_is_affine_p (chrec))
+{
+  tree base, step;
+  struct loop *loop;
+
+  loop = get_chrec_loop (chrec);
+  base = CHREC_LEFT (chrec);
+  step = CHREC_RIGHT (chrec);
+  if (convert_affine_scev (loop, type, base, step, NULL, true))
+return build_polynomial_chrec (loop-num, base, step);
+}

The scenario is rare that didn't exposed in either bootstrap or reg-test.

Here is the updated patch without any other difference.  Bootstrap and
test on x86_64  AArch64.

Thanks,
bin

 +}

 (bah, and the diff somehow messes up -p context :/  which is why I like
 context diffs more)

 Other from the above the patch looks good to me.

 Thanks,
 Richard.

 Thanks,
 bin

 2015-05-20  Bin Cheng  bin.ch...@arm.com

 PR tree-optimization/62173
 * tree-ssa-loop-ivopts.c (struct iv): New field.  Reorder fields.
 (alloc_iv, set_iv): New parameter.
 (determine_biv_step): Delete.
 (find_bivs): Inline original determine_biv_step.  Pass new
 argument to set_iv.
 (idx_find_step): Use no_overflow information for conversion.
 * tree-scalar-evolution.c (analyze_scalar_evolution_in_loop): Let
 resolve_mixers handle folded_casts.
 (instantiate_scev_name): Change bool parameter to bool pointer.
 (instantiate_scev_poly, instantiate_scev_binary): Ditto.
 (instantiate_array_ref, 

Re: [Patch] [AArch64] PR target 66049: fix add/extend gcc test suite failures

2015-05-22 Thread Bin.Cheng
On Fri, May 22, 2015 at 4:58 PM, Kyrill Tkachov
kyrylo.tkac...@foss.arm.com wrote:
 Hi Venkat,


 On 22/05/15 09:50, Kumar, Venkataramanan wrote:

 Hi Kyrill,

 Sorry for little delay in responding.

 -Original Message-
 From: Kyrill Tkachov [mailto:kyrylo.tkac...@foss.arm.com]
 Sent: Tuesday, May 19, 2015 9:13 PM
 To: Kumar, Venkataramanan; James Greenhalgh; gcc-patches@gcc.gnu.org
 Cc: Ramana Radhakrishnan; seg...@kernel.crashing.org; Marcus Shawcroft
 Subject: Re: [Patch] [AArch64] PR target 66049: fix add/extend gcc test
 suite
 failures

 Hi Venkat,

 On 19/05/15 16:37, Kumar, Venkataramanan wrote:

 Hi Maintainers,

 Please find the attached patch, that fixes add/extend gcc test suite
 failures

 in Aarch64 target.

 Ref: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66049

 These tests started to fail after we prevented combiner from converting

 shift RTX to mult RTX, when the RTX is not inside a memory operation
 (r222874) .

 Now I have added new add/extend patterns which are based on shift

 operations,  to fix these cases.

 Testing status with the patch.

 (1) GCC bootstrap on AArch64 successful.
 (2)  SPEC2006 INT runs did not show any degradation.

 Does that mean there was no performance regression? Or no codegen
 difference?

 Yes there was no performance regression.


 What I'd expect from this patch is that the codegen would be the same as
 before the combine patch
 (r222874). A performance difference can sometimes be hard to measure
 even at worse code quality.
 Can you please confirm that on SPEC2006 INT the adds and shifts are now
 back to being combined
 into a single instruction?

 I  used -dp --save-temps to dump pattern names in assembly files.
 I used revision before combiner patch (r222873) and the patch on top of
 trunk (little older r223194) for comparison.

 Quickly compared the counts generated for the below patterns from the SPEC
 INT binaries.

 *adds_optabmode_multp2 vs *adds_optabALLX:mode_shft_GPI:mode
 *subs_optabmode_multp2 vs *subs_optabALLX:mode_shft_GPI:mode
 *add_uxtmode_multp2  vs  *add_uxtmode_shift2
 *add_uxtsi_multp2_uxtw vs  *add_uxtsi_shift2_uxtw
 *sub_uxtmode_multp2  vs  *sub_uxtmode_shift2
 *sub_uxtsi_multp2_uxtw vs  *sub_uxtsi_shift2_uxtw
 *adds_mul_imm_modevs *adds_shift_imm_mode
 *subs_mul_imm_modevs *subs_shift_imm_mode

 Patterns are found in few benchmarks. The  generated counts are matching
 in the binaries compiled with the two revisions.

   Also Looked at assembly generated and  the adds and shifts are combined
 properly using the new patterns. Please let me know if this is OK.


 Thanks for investigating this. I've run your patch on aarch64.exp and I see
 the add+shift/extend failures
 that we were seeing go away (I'm sure you saw that as well ;).
Exact what's expected.  This patch and the previous combine one are
designed to fix those failures.

Thanks,
bin
 Up to the maintainers to review the patch.

 Thanks,
 Kyrill



 Thanks,
 Kyrill

 (3) gcc regression testing passed.

 (-Snip-)
 # Comparing 3 common sum files
 ## /bin/sh ./gcc-fsf-trunk/contrib/compare_tests  /tmp/gxx-sum1.24998

 /tmp/gxx-sum2.24998

 Tests that now work, but didn't before:

 gcc.target/aarch64/adds1.c scan-assembler adds\tw[0-9]+, w[0-9]+, w[0-

 9]+, lsl 3

 gcc.target/aarch64/adds1.c scan-assembler adds\tx[0-9]+, x[0-9]+,
 x[0-9]+,

 lsl 3

 gcc.target/aarch64/adds3.c scan-assembler-times adds\tx[0-9]+, x[0-9]+,

 x[0-9]+, sxtw 2

 gcc.target/aarch64/extend.c scan-assembler add\tw[0-9]+,.*uxth #?1
 gcc.target/aarch64/extend.c scan-assembler add\tx[0-9]+,.*uxtw #?3
 gcc.target/aarch64/extend.c scan-assembler sub\tw[0-9]+,.*uxth #?1
 gcc.target/aarch64/extend.c scan-assembler sub\tx[0-9]+,.*uxth #?1
 gcc.target/aarch64/extend.c scan-assembler sub\tx[0-9]+,.*uxtw #?3
 gcc.target/aarch64/subs1.c scan-assembler subs\tw[0-9]+, w[0-9]+, w[0-

 9]+, lsl 3

 gcc.target/aarch64/subs1.c scan-assembler subs\tx[0-9]+, x[0-9]+,
 x[0-9]+,

 lsl 3

 gcc.target/aarch64/subs3.c scan-assembler-times subs\tx[0-9]+, x[0-9]+,

 x[0-9]+, sxtw 2

 # No differences found in 3 common sum files
 (-Snip-)

 The patterns are fixing the regressing tests, so I have not added any
 new

 tests.

 Regarding  removal of the old patterns based on mults,  I am planning
 to

 do it as a separate work.

 Is this OK for trunk ?

 gcc/ChangeLog

 2015-05-19  Venkataramanan Kumar  venkataramanan.ku...@amd.com

   * config/aarch64/aarch64.md
   (*adds_shift_imm_mode):  New pattern.
   (*subs_shift_imm_mode):  Likewise.
   (*adds_optabALLX:mode_shift_GPI:mode):  Likewise.
   (*subs_optabALLX:mode_shift_GPI:mode): Likewise.
   (*add_uxtmode_shift2): Likewise.
   (*add_uxtsi_shift2_uxtw): Likewise.
  (*sub_uxtmode_shift2): Likewise.
  (*sub_uxtsi_shift2_uxtw): Likewise.


 Regards,
 Venkat.




 Regards,
 Venkat,




Re: Fix PR48052: loop not vectorized if index is unsigned int

2015-05-19 Thread Bin.Cheng
On Wed, May 6, 2015 at 7:02 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Mon, May 4, 2015 at 9:47 PM, Abderrazek Zaafrani
 az.zaafr...@gmail.com wrote:
 This is an old thread and we are still running into similar issues:
 Code is not being vectorized on 64-bit target due to scev not being
 able to optimally analyze overflow condition.

 While the original test case shown here seems to work now, it does not
 work if the start value is not a constant and the loop index variable
 is of unsigned type: Ex

 void loop2( double const * __restrict__ x_in, double * __restrict__
 x_out, double const * __restrict__ c, unsigned int N, unsigned int
 start) {
  for(unsigned int i=start; i!=N; ++i)
x_out[i] = c[i]*x_in[i];
 }

 Here is our unit test:

 int foo(int* A, int* B, unsigned start, unsigned B)
 {
   int s;
   for (unsigned k = start; k start+B; k++)
 s += A[k] * B[k];
   return s;
 }

 Our unit test case is extracted from a matrix multiply of a
 two-dimensional array and all loops are blocked by hand by a factor of
 B. Even though a bit modified, above loop corresponds to the innermost
 loop of the blocked matrix multiply.

 We worked on patch to solve the problem (see attachment.)
 The attached patch passed bootstrap and make check on x86_64-linux.
 Ok for trunk?

 Apart from coding style / API issues the case you handle is very special
 (IVs with step 1 only?!) I believe it is also wrong - the assumption that
 if there is a symbolic or constant expression for the number of iterations
 a BIV will not wrap is not true.  niter analysis can very well compute
 the number of iterations for a loop with wrapping IVs.  For your unit test
 this only works because of the special-casing of step 1 IVs.
I happen to look into similar issue right now.  scev_probably_wraps_p
and thus chrec_convert_1 should be improved using niter information.
Actually all information (and the wrap behavior) has already been
computed in tree-ssa-loop-niter.c.  We just need to find a way to used
it.


 Technically it might be more interesting to compute wrapping of IVs
 during niter analysis in some more generic way (we have iv-no_overflow
 computed by simple_iv, but that is rather not useful here).

For it iv-no_overflow is computed in simple_iv as below:
  tmp = analyze_scalar_evolution (use_loop, ev);
  ev = resolve_mixers (use_loop, tmp);

  if (folded_casts  tmp != ev)
*folded_casts = true;

It's inaccurate because calling resolve_mixers doesn't mean the result
scev will wrap.  resolve_mixers could have just done exact the same
transformation as instantiate_parameters.  Also
chrec_convert_aggressive is incomplete and need to revised too.

Thanks,
bin

 Richard.

 Thanks,
 Abderrazek Zaafrani


Re: [PATCH PR65447]Improve IV handling by grouping address type uses with same base and step

2015-05-14 Thread Bin.Cheng
On Wed, May 13, 2015 at 7:38 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Fri, May 8, 2015 at 12:47 PM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 GCC's IVO currently handles every IV use independently, which is not right
 by learning from cases reported in PR65447.

 The rationale is:
 1) Lots of address type IVs refer to the same memory object, share similar
 base and have same step.  We should handle these IVs as a group in order to
 maximize CSE opportunities, prefer reg+offset addressing mode.
 2) GCC's IVO algorithm is expensive and only is run when candidate set is
 small enough.  By grouping same family uses, we can decrease the number of
 both uses and candidates.  Before this patch, number of candidates for
 PR65447 is too big to run expensive IVO algorithm, resulting in bad assembly
 code on targets like AArch64 and Mips.
 3) Even for cases the assembly code isn't improved, we can still get
 compilation time benefit with this patch.
 4) This is a prerequisite for enabling auto-increment support in IVO on
 AArch64.

 For now, this is only done to address type IVs, in the future I may extend
 it to general IVs too.

 For AArch64:
 Benchmarks 470.lbm/spec2k6 and 173.applu/spec2k are improved obviously by
 this patch.  A couple of cases from spec2k/fp appear regressed.  I looked
 into generated assembly code and can confirm the regression is false alarm
 except one case (189.lucas).  For that case, I think it's another issue
 exposed by this patch (GCC failed to CSE candidate setup code, resulting in
 bloated loop header).  Anyway, I also fined tuned the patch to minimize the
 impact.

 For AArch32, this patch seems to be able to improve spec2kfp too, but I
 didn't look deep into it.  I guess the reason is it can make life for
 auto-increment support in IVO better.

 One of defects of this patch is computation of max offset in
 compute_max_addr_offset is basically borrowed from get_address_cost.  The
 comment says we should find a better way to compute all information.  People
 also complained we need to refactor that part of code.  I don't have good
 solution to that yet, though I did try best to keep compute_max_addr_offset
 simple.

 I believe this is a generally wanted change, bootstrap and test on x86_64
 and AArch64, so is it ok?

 I'm a little bit worried about the linked list of sub-uses and the sorting
 (that's quadratic).  A little.  I don't have any good idea but to use a 
 tree...
 We don't seem to limit the number of sub-uses (if we'd do that it would
 become O(1)).

 Similar is searching in the list of uses for a group with same base/step
 (but ISTR IVOPTs has multiple similar loops?)

Hi Richard,
Thanks for reviewing.  Instead of tree, I can also keep the linked
list, then quick sort it by using a vector as temporary storage.  This
can avoid the complexity of BST operations without non-trivial
overload.

For the searching routine, a local hash table could help.

 Overall the patch looks like a good improvement to how we do IVO, so I think
 it is ok as-is.
Do you want me to do this change now, or I can pick it up later when
dealing with compilation time issue?  Also it would be nice if any
compilation time issue will be reported after applying this version
patch.  Because we will have a IVO compilation time benchmark then.

Thanks,
bin


 Thanks,
 Richard.



 2015-05-08  Bin Cheng  bin.ch...@arm.com

 PR tree-optimization/65447
 * tree-ssa-loop-ivopts.c (struct iv_use): New fields.
 (dump_use, dump_uses): Support to dump sub use.
 (record_use): New parameters to support sub use.  Remove call to
 dump_use.
 (record_sub_use, record_group_use): New functions.
 (compute_max_addr_offset, split_all_small_groups): New functions.
 (group_address_uses, rewrite_use_address): New functions.
 (strip_offset): New declaration.
 (find_interesting_uses_address): Call record_group_use.
 (add_candidate): New assertion.
 (infinite_cost_p): Move definition forward.
 (add_costs): Check INFTY cost and return immediately.
 (get_computation_cost_at): Clear setup cost and dependent bitmap
 for sub uses.
 (determine_use_iv_cost_address): Compute cost for sub uses.
 (rewrite_use_address_1): Rename from old rewrite_use_address.
 (free_loop_data): Free sub uses.
 (tree_ssa_iv_optimize_loop): Call group_address_uses.

 gcc/testsuite/ChangeLog
 2015-05-08  Bin Cheng  bin.ch...@arm.com

 PR tree-optimization/65447
 * gcc.dg/tree-ssa/pr65447.c: New test.


Re: [PATCH, GCC, stage1] Fallback to copy-prop if constant-prop not possible

2015-04-30 Thread Bin.Cheng
On Fri, Apr 24, 2015 at 12:52 PM, Thomas Preud'homme
thomas.preudho...@arm.com wrote:
 From: Jeff Law [mailto:l...@redhat.com]
 Sent: Friday, April 24, 2015 11:15 AM

 So revised review is ok for the trunk :-)

 Committed.
Hi Thomas,
The newly introduced test failed on
arm-none-linux-gnueabiarm-none-linux-gnueabihf.  Could you please
have a look at it?
FAIL: gcc.target/arm/pr64616.c scan-assembler-times ldr 2

GCC was configured with
gcc/configure --target=arm-none-linux-gnueabi --prefix=
--with-sysroot=... --enable-shared --disable-libsanitizer
--disable-libssp --disable-libmudflap
--with-plugin-ld=arm-none-linux-gnueabi-ld --enable-checking=yes
--enable-languages=c,c++,fortran --with-gmp=... --with-mpfr=...
--with-mpc=... --with-isl=... --with-cloog=... --with-arch=armv7-a
--with-fpu=vfpv3-d16 --with-float=softfp --with-arch=armv7-a

Thanks,
bin


 Best regards,

 Thomas





Re: Mostly rewrite genrecog

2015-04-30 Thread Bin.Cheng
On Mon, Apr 27, 2015 at 6:20 PM, Richard Sandiford
richard.sandif...@arm.com wrote:
 I think it's been the case for a while that parallel builds of GCC tend
 to serialise around the compilation of insn-recog.c, especially with
 higher --enable-checking settings.  This patch tries to speed that
 up by replacing most of genrecog with a new algorithm.

 I think the main problems with the current code are:

 1. Vector architectures have added lots of new instructions that have
a similar shape and differ only in mode, code or unspec number.
The current algorithm doesn't have any way of factoring out those
similarities.

 2. When matching a particular instruction, the current code examines
everything about a SET_DEST before moving on to the SET_SRC.  This has
two subproblems:

2a. The destination of a SET isn't very distinctive.  It's usually
just a register_operand, a memory_operand, a nonimmediate_operand
or a flags register.  We therefore tend to backtrack to the
SET_DEST a lot, oscillating between groups of instructions with
the same kind of destination.

2b. Backtracking through predicate checks is relatively expensive.
It would be good to narrow down the shape of the instruction
first and only then check the predicates.  (The backtracking is
expensive in terms of insn-recog.o compile time too, both because
we need to copy into argument registers and out of the result
register, and because it adds more sites where spills are needed.)

 3. The code keeps one local variable per rtx depth, so it ends up
loading the same rtx many times over (mostly when backtracking).
This is very expensive in rtl-checking builds because each XEXP
includes a code check and a line-specific failure call.

In principle the idea of having one local variable per depth
is good.  But it was originally written that way when all optimisations
were done at the rtl level and I imagine each local variable mapped
to one pseudo register.  These days the statements that reload the
value needed on backtracking lead to many more SSA names and phi
statements than you'd get with just a single variable per position
(loaded once, so naturally SSA).  There is still the potential benefit
of avoiding having sibling rtxes live at once, but fixing (2) above
reduces that problem.

 Also, the code is all goto-based, which makes it rather hard to step through.

 The patch deals with these as follows:

 1. Detect subpatterns that differ only by mode, code and/or integer
(e.g. unspec number) and split them out into a common routine.

 2. Match the shape of the instruction first, in terms of codes,
integers and vector lengths, and only then check the modes, predicates
and dups.  When checking the shape, handle SET_SRCs before SET_DESTs.
In practice this seems to greatly reduce the amount of backtracking.

 3. Have one local variable per rtx position.  I tested the patch with
and without the change and it helped a lot with rtl-checking builds
without seeming to affect release builds much either way.

 As far as debuggability goes, the new code avoids gotos and just
 uses natural control flow.

 The headline stat is that a stage 3 --enable-checking=yes,rtl,df
 build of insn-recog.c on my box goes from 7m43s to 2m2s (using the
 same stage 2 compiler).  The corresponding --enable-checking=release
 change is from 49s to 24s (less impressive, as expected).

 The patch seems to speed up recog.  E.g. the time taken to build
 fold-const.ii goes from 6.74s before the patch to 6.69s after it;
 not a big speed-up, but reproducible.

 Here's a comparison of the number of lines of code in insn-recog.c
 before and after the patch on one target per config/ CPU:

 aarch64-linux-gnueabi   11552638169 :   33.04%
 alpha-linux-gnu  2447910740 :   43.87%
 arm-linux-gnueabi   16920867759 :   40.04%
 avr-rtems5564722127 :   39.76%
 bfin-elf 13928 6498 :   46.65%
 c6x-elf  2992813324 :   44.52%
 cr16-elf  2650 1419 :   53.55%
 cris-elf 18669 7257 :   38.87%
 epiphany-elf 19308 6131 :   31.75%
 fr30-elf  2204 1112 :   50.45%
 frv-linux-gnu13541 5950 :   43.94%
 h8300-elf19584 9327 :   47.63%
 hppa64-hp-hpux11.23  18299 8549 :   46.72%
 ia64-linux-gnu   3762917101 :   45.45%
 iq2000-elf2752 1609 :   58.47%
 lm32-elf 

Re: [PATCH] Fix for PR26702: Emit .size for BSS variables on arm-eabi

2015-04-30 Thread Bin.Cheng
On Thu, Apr 23, 2015 at 10:51 PM, Ramana Radhakrishnan
ramana@googlemail.com wrote:
 On Mon, Mar 30, 2015 at 9:25 PM, Kwok Cheung Yeung k...@codesourcery.com 
 wrote:
 This is a simple patch that ensures that a .size directive is emitted when
 space is allocated for a static variable in the BSS on bare-metal ARM
 targets. This allows other tools such as GDB to look up the size of the
 object correctly.

 Before:

 $ readelf -s pr26702.o

 Symbol table '.symtab' contains 10 entries:
Num:Value  Size TypeBind   Vis  Ndx Name
 ...
  6:  0 NOTYPE  LOCAL  DEFAULT3 static_foo
 ...

 After:

 $ readelf -s pr26702.o

 Symbol table '.symtab' contains 10 entries:
Num:Value  Size TypeBind   Vis  Ndx Name
 ...
  6:  4 NOTYPE  LOCAL  DEFAULT3 static_foo
 ...

 The testsuite has been run with a i686-pc-linux-gnu hosted cross-compiler
 targetted at arm-none-eabi with no regressions.

 Kwok


 2015-03-30  Kwok Cheung Yeung  k...@codesourcery.com

 gcc/
 PR target/26702
 * config/arm/unknown-elf.h (ASM_OUTPUT_ALIGNED_DECL_LOCAL): Emit
 size of local.

 gcc/testsuite/
 PR target/26702
 * gcc.target/arm/pr26702.c: New test.

 Index: gcc/testsuite/gcc.target/arm/pr26702.c
 ===
 --- gcc/testsuite/gcc.target/arm/pr26702.c  (revision 0)
 +++ gcc/testsuite/gcc.target/arm/pr26702.c  (revision 0)
 @@ -0,0 +1,4 @@
 +/* { dg-do compile { target arm*-*-eabi* } } */
 +/* { dg-final { scan-assembler \\.size\[\\t \]+static_foo, 4 } } */
 +int foo;
 +static int static_foo;
 Index: gcc/config/arm/unknown-elf.h
 ===
 --- gcc/config/arm/unknown-elf.h(revision 447549)
 +++ gcc/config/arm/unknown-elf.h(working copy)
 @@ -81,6 +81,8 @@
ASM_OUTPUT_ALIGN (FILE, floor_log2 (ALIGN / BITS_PER_UNIT)); \
ASM_OUTPUT_LABEL (FILE, NAME);   \
fprintf (FILE, \t.space\t%d\n, SIZE ? (int)(SIZE) : 1);
 \
 +  fprintf (FILE, \t.size\t%s, %d\n,  \
 +  NAME, SIZE ? (int)(SIZE) : 1);   \
  }  \
while (0)



 Now applied as attached with the following modifications.

 Sorry about the delay - I've been away for a bit and couldn't attend
 to committing this.

Hi Kwok,
The newly introduced test case failed on
arm-none-linux-gnueabiarm-none-linux-gnueabihf.  Could you please
have a look at it?

FAIL: gcc.target/arm/pr26702.c scan-assembler \\.size[\\t ]+static_foo, 4

PR65937 is filed for tracking this.

Thanks,
bin



 Thanks
 Ramana


Re: [PATCH, x86] Add TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE hook

2015-04-30 Thread Bin.Cheng
On Mon, Apr 27, 2015 at 8:01 PM, Uros Bizjak ubiz...@gmail.com wrote:
 On Wed, Feb 4, 2015 at 2:21 PM, Christian Bruel christian.br...@st.com 
 wrote:
 While trying to reduce the PR64835 case for ARM and x86, I noticed that the
 alignment flags are cleared for x86 when attribute optimized is used.

 With the attached testcases, the visible effects are twofold :

 1) Functions compiled in with attribute optimize (-O2) are not aligned as if
 they were with the -O2 flag.

 2) can_inline_edge_p fails because opts_for_fn (caller-decl) != opts_for_fn
 (callee-decl)) even-though they are compiled with the same optimization
 level.

 2015-02-06  Christian Bruel  christian.br...@st.com

 PR target/64835
 * config/i386/i386.c (ix86_default_align): New function.
 (ix86_override_options_after_change): Call ix86_default_align.
 (TARGET_OVERRIDE_OPTIONS_AFTER_CHANGE): New hook.
 (ix86_override_options_after_change): New function.

 2015-02-06  Christian Bruel  christian.br...@st.com

 PR target/64835
 * gcc.dg/ipa/iinline-attr.c: New test.
 * gcc.target/i386/iinline-attr-2.c: New test.

 OK for mainline.

Hi Christian,
I noticed case gcc.dg/ipa/iinline-attr.c failed on aarch64.  The
original patch is x86 specific, while the case is added as general
one.  Could you please have a look at this?

FAIL: gcc.dg/ipa/iinline-attr.c scan-ipa-dump inline
hooray[^\\n]*inline copy in test

Thanks,
bin

 Thanks,
 Uros


Re: [PATCH, rs6000, testsuite, PR65456] Changes for unaligned vector load/store support on POWER8

2015-04-30 Thread Bin.Cheng
On Mon, Apr 27, 2015 at 9:26 PM, Bill Schmidt
wschm...@linux.vnet.ibm.com wrote:
 On Mon, 2015-04-27 at 14:23 +0800, Bin.Cheng wrote:
 On Mon, Mar 30, 2015 at 1:42 AM, Bill Schmidt
 wschm...@linux.vnet.ibm.com wrote:


  Index: gcc/testsuite/gcc.dg/vect/vect-33.c
  ===
  --- gcc/testsuite/gcc.dg/vect/vect-33.c (revision 221118)
  +++ gcc/testsuite/gcc.dg/vect/vect-33.c (working copy)
  @@ -36,9 +36,10 @@ int main (void)
 return main1 ();
   }
 
  +/* vect_hw_misalign  { ! vect64 } */
 
   /* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect  } } */
  -/* { dg-final { scan-tree-dump Vectorizing an unaligned access vect { 
  target { vect_hw_misalign  { {! vect64} || vect_multiple_sizes } } } } } 
  */
  +/* { dg-final { scan-tree-dump Vectorizing an unaligned access vect { 
  target { { { ! powerpc*-*-* }  vect_hw_misalign }  { { ! vect64 } || 
  vect_multiple_sizes } } } } }  */
   /* { dg-final { scan-tree-dump Alignment of access forced using peeling 
  vect { target { vector_alignment_reachable  { vect64  {! 
  vect_multiple_sizes} } } } } } */
   /* { dg-final { scan-tree-dump-times Alignment of access forced using 
  versioning 1 vect { target { { {! vector_alignment_reachable} || {! 
  vect64} }  {! vect_hw_misalign} } } } } */
   /* { dg-final { cleanup-tree-dump vect } } */

 Hi Bill,
 With this change, the test case is skipped on aarch64 now.  Since it
 passed before, Is it expected to act like this on 64bit platforms?

 Hi Bin,

 No, that's a mistake on my part -- thanks for the report!  That first
 added line was not intended to be part of the patch:

 +/* vect_hw_misalign  { ! vect64 } */

 Please try removing that line and verify that the patch succeeds again
 for ARM.  Assuming so, I'll prepare a patch to fix this.

 It looks like this mistake was introduced only in this particular test,
 but please let me know if you see any other anomalies.
Hi Bill,
I chased the wrong branch.  The test disappeared on fsf-48 branch in
out build, rather than trunk.  I guess it's not your patch's fault.
Will follow up and get back to you later.
Sorry for the inconvenience.

Thanks,
bin

 Thanks very much!

 Bill

 PASS-NA: gcc.dg/vect/vect-33.c -flto -ffat-lto-objects
 scan-tree-dump-times vect Vectorizing an unaligned access 0
 PASS-NA: gcc.dg/vect/vect-33.c scan-tree-dump-times vect Vectorizing
 an unaligned access 0

 Thanks,
 bin





Re: [PATCH, rs6000, testsuite, PR65456] Changes for unaligned vector load/store support on POWER8

2015-04-27 Thread Bin.Cheng
On Mon, Mar 30, 2015 at 1:42 AM, Bill Schmidt
wschm...@linux.vnet.ibm.com wrote:
 Hi,

 This is a follow-up to
 https://gcc.gnu.org/ml/gcc-patches/2015-03/msg00103.html, which adds
 support for faster unaligned vector memory accesses on POWER8.  As
 pointed out in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65456, there
 was a piece missing here.  The target macro SLOW_UNALIGNED_ACCESS is
 still evaluating to 1 for unaligned vector accesses on POWER8, which
 causes some scalarization to occur during expand.  This version of the
 patch fixes this as well.

 The only changes from before are the update to config/rs6000/rs6000.h,
 and the new test case gcc.target/powerpc/pr65456.c.  Is this ok for
 trunk after 5 branches, and backports to 4.8, 4.9, 5 thereafter?

 Thanks,
 Bill


 [gcc]

 2015-03-29  Bill Schmidt  wschm...@linux.vnet.ibm.com

 * config/rs6000/rs6000.c (rs6000_option_override_internal):  For
 VSX + POWER8, enable TARGET_ALLOW_MOVMISALIGN and
 TARGET_EFFICIENT_UNALIGNED_VSX if not selected by command line
 option.  However, for -mno-allow-movmisalign, be sure to disable
 TARGET_EFFICIENT_UNALIGNED_VSX to avoid an ICE.
 (rs6000_builtin_mask_for_load): Return 0 for targets with
 efficient unaligned VSX accesses so that the vectorizer will use
 direct unaligned loads.
 (rs6000_builtin_support_vector_misalignment): Always return true
 for targets with efficient unaligned VSX accesses.
 (rs6000_builtin_vectorization_cost): Cost of unaligned loads and
 stores on targets with efficient unaligned VSX accesses is almost
 always the same as the cost of an aligned load or store, so model
 it that way.
 * config/rs6000/rs6000.h (SLOW_UNALIGNED_ACCESS): Evaluate to
 zero for unaligned vector accesses on POWER8.
 * config/rs6000/rs6000.opt (mefficient-unaligned-vector): New
 undocumented option.

 [gcc/testsuite]

 2015-03-29  Bill Schmidt  wschm...@linux.vnet.ibm.com

 * gcc.dg/vect/bb-slp-24.c: Exclude test for POWER8.
 * gcc.dg/vect/bb-slp-25.c: Likewise.
 * gcc.dg/vect/bb-slp-29.c: Likewise.
 * gcc.dg/vect/bb-slp-32.c: Replace vect_no_align with
 vect_no_align  { ! vect_hw_misalign }.
 * gcc.dg/vect/bb-slp-9.c: Likewise.
 * gcc.dg/vect/costmodel/ppc/costmodel-slp-33.c: Exclude test for
 vect_hw_misalign.
 * gcc.dg/vect/costmodel/ppc/costmodel-vect-31a.c: Likewise.
 * gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c: Adjust tests to
 account for POWER8, where peeling for alignment is not needed.
 * gcc.dg/vect/costmodel/ppc/costmodel-vect-outer-fir.c: Replace
 vect_no_align with vect_no_align  { ! vect_hw_misalign }.
 * gcc.dg.vect.if-cvt-stores-vect-ifcvt-18.c: Likewise.
 * gcc.dg/vect/no-scevccp-outer-6-global.c: Likewise.
 * gcc.dg/vect/no-scevccp-outer-6.c: Likewise.
 * gcc.dg/vect/no-vfa-vect-43.c: Likewise.
 * gcc.dg/vect/no-vfa-vect-57.c: Likewise.
 * gcc.dg/vect/no-vfa-vect-61.c: Likewise.
 * gcc.dg/vect/no-vfa-vect-depend-1.c: Likewise.
 * gcc.dg/vect/no-vfa-vect-depend-2.c: Likewise.
 * gcc.dg/vect/no-vfa-vect-depend-3.c: Likewise.
 * gcc.dg/vect/pr16105.c: Likewise.
 * gcc.dg/vect/pr20122.c: Likewise.
 * gcc.dg/vect/pr33804.c: Likewise.
 * gcc.dg/vect/pr33953.c: Likewise.
 * gcc.dg/vect/pr56787.c: Likewise.
 * gcc.dg/vect/pr58508.c: Likewise.
 * gcc.dg/vect/slp-25.c: Likewise.
 * gcc.dg/vect/vect-105-bit-array.c: Likewise.
 * gcc.dg/vect/vect-105.c: Likewise.
 * gcc.dg/vect/vect-27.c: Likewise.
 * gcc.dg/vect/vect-29.c: Likewise.
 * gcc.dg/vect/vect-33.c: Exclude unaligned access test for
 POWER8.
 * gcc.dg/vect/vect-42.c: Replace vect_no_align with vect_no_align
  { ! vect_hw_misalign }.
 * gcc.dg/vect/vect-44.c: Likewise.
 * gcc.dg/vect/vect-48.c: Likewise.
 * gcc.dg/vect/vect-50.c: Likewise.
 * gcc.dg/vect/vect-52.c: Likewise.
 * gcc.dg/vect/vect-56.c: Likewise.
 * gcc.dg/vect/vect-60.c: Likewise.
 * gcc.dg/vect/vect-72.c: Likewise.
 * gcc.dg/vect/vect-75-big-array.c: Likewise.
 * gcc.dg/vect/vect-75.c: Likewise.
 * gcc.dg/vect/vect-77-alignchecks.c: Likewise.
 * gcc.dg/vect/vect-77-global.c: Likewise.
 * gcc.dg/vect/vect-78-alignchecks.c: Likewise.
 * gcc.dg/vect/vect-78-global.c: Likewise.
 * gcc.dg/vect/vect-93.c: Likewise.
 * gcc.dg/vect/vect-95.c: Likewise.
 * gcc.dg/vect/vect-96.c: Likewise.
 * gcc.dg/vect/vect-cond-1.c: Likewise.
 * gcc.dg/vect/vect-cond-3.c: Likewise.
 * gcc.dg/vect/vect-cond-4.c: Likewise.
 * gcc.dg/vect/vect-cselim-1.c: Likewise.
 * gcc.dg/vect/vect-multitypes-1.c: 

Re: [PATCH ARM]Fix pr42172-1.c failure on pre armv7 processors

2015-04-23 Thread Bin.Cheng
On Thu, Apr 23, 2015 at 4:19 PM, Kyrill Tkachov kyrylo.tkac...@arm.com wrote:

 On 22/04/15 09:42, Bin Cheng wrote:

 Hi,
 Case pr42172-1.c failed on pre-armv7 processors because GCC actually
 generates better code without ldr instruction.  This patch just refines
 test
 case by checking str instead of ldr, makes sure the case passes on all arm
 processors.  In the end, we need to fix GCC combiner to generate optimal
 code on armv7 processors too.  PR42172 is kept open for that purpose.

 This is obvious change, is it OK for branches too?


 For the record, for -mcpu=arm7tdmi we now generate:
 init_A:
 mov r3, #8
 strbr3, [r0]
 bx
That's the point, it generates optimal code for pre-armv7 processors,
which is treated as failure now.



 Ok for trunk. Is this test failing on the branches too?
Well, at least for gcc-5-branch.


 Kyrill



 gcc/testsuite/ChangeLog
 2015-04-22  Bin Cheng  bin.ch...@arm.com

 * gcc.target/arm/pr42172-1.c: Check str instead of ldr.




Re: [PATCH][PR65802] Mark ifn_va_arg with ECF_NOTHROW

2015-04-23 Thread Bin.Cheng
On Tue, Apr 21, 2015 at 3:10 PM, Tom de Vries tom_devr...@mentor.com wrote:
 Hi,

 this patch fixes PR65802.

 diff --git a/gcc/testsuite/g++.dg/
pr65802.C b/gcc/testsuite/g++.dg/pr65802.C
 new file mode 100644
 index 000..26e5317
 --- /dev/null
 +++ b/gcc/testsuite/g++.dg/pr65802.C
 @@ -0,0 +1,29 @@
 +// { dg-do compile }
 +// { dg-options -O0 }
 +
 +typedef int tf ();
 +
 +struct S
 +{
 +  tf m_fn1;
 +} a;
 +
 +void
 +fn1 ()
 +{
 +  try
 +{
 +  __builtin_va_list c;
 +  {
 + int *d = __builtin_va_arg (c, int *);
 + int **e = d;
 + __asm__( : =d(e));
Hi, thanks for fixing the issue.
But 'd' is a machine specific constraint?  This case failed on all arm
processors.

Thanks,
bin
 + a.m_fn1 ();
 +  }
 +  a.m_fn1 ();
 +}
 +  catch (...)
 +{
 +
 +}
 +}

 OK for trunk?

 Thanks,
 - Tom


Re: [PING^2] [PATCH][5a/5] Postpone expanding va_arg until pass_stdarg

2015-04-20 Thread Bin.Cheng
On Thu, Apr 16, 2015 at 4:55 PM, Richard Biener rguent...@suse.de wrote:
 On Thu, 16 Apr 2015, Tom de Vries wrote:

 [stage1 ping^2]
 On 10-03-15 16:30, Tom de Vries wrote:
  [stage1 ping]
  On 22-02-15 14:13, Tom de Vries wrote:
   On 19-02-15 14:03, Richard Biener wrote:
On Thu, 19 Feb 2015, Tom de Vries wrote:
   
 On 19-02-15 11:29, Tom de Vries wrote:
  Hi,
 
  I'm posting this patch series for stage1:
  - 0001-Disable-lang_hooks.gimplify_expr-in-free_lang_data.patch
  - 0002-Add-gimple_find_sub_bbs.patch
  - 0003-Factor-optimize_va_list_gpr_fpr_size-out-of-pass_std.patch
  - 0004-Handle-internal_fn-in-operand_equal_p.patch
  - 0005-Postpone-expanding-va_arg-until-pass_stdarg.patch
 
  The patch series - based on Michael's initial patch - postpones
  expanding
  va_arg
  until pass_stdarg, which makes pass_stdarg more robust.
 
  Bootstrapped and reg-tested on x86_64 using all languages, with
  unix/ and
  unix/-m32 testing.
 
  I'll post the patches in reply to this email.
 

 This patch postpones expanding va_arg until pass_stdarg.

 We add a new internal function IFN_VA_ARG. During gimplification, we
 map
 VA_ARG_EXPR onto a CALL_EXPR with IFN_VA_ARG, which is then 
 gimplified
 in to a
 gimple_call. At pass_stdarg, we expand the IFN_VA_ARG gimple_call 
 into
 actual
 code.

 There are a few implementation details worth mentioning:
 - passing the type beyond gimplification is done by adding a NULL
 pointer-
to-type to IFN_VA_ARG.
 - there is special handling for IFN_VA_ARG that would be most suited
 to be
placed in gimplify_va_arg_expr. However, that function lacks the
 scope for
the special handling, so it's placed awkwardly in
 gimplify_modify_expr.
 - there's special handling in case the va_arg type is variable-sized.
gimplify_modify_expr adds a WITH_SIZE_EXPR to the CALL_EXPR
 IFN_VA_ARG for
variable-sized types. However, this is gimplified into a
 gimple_call which
does not have the possibility to wrap it's result in a
 WITH_SIZE_EXPR. So
we're adding the size argument of the WITH_SIZE_EXPR as argument 
 to
IFN_VA_ARG, and at expansion in pass_stdarg, wrap the result of 
 the
gimplification of IFN_VA_ARG in a WITH_SIZE_EXPR, such that the
 subsequent
gimplify_assign will generate a memcpy if necessary.
 - when gimplifying the va_arg argument ap, it may not be addressable.
 So
gimplification will generate a copy ap.1 = ap, and use ap.1 as
 argument.
This means that we have to copy back the ap.1 value to ap after
 IFN_VA_ARG.
The copy is classified by the va_list_gpr/fpr_size optimization as
 an
escape,  so it inhibits optimization. The tree-ssa/stdarg-2.c f15
 update is
because of that.

 OK for stage1?
   
Looks mostly good, though it looks like with -O0 this doesn't delay
lowering of va-arg and thus won't fix offloading.  Can you instead
introduce a PROP_gimple_lva, provide it by the stdarg pass and add
a pass_lower_vaarg somewhere where pass_lower_complex_O0 is run
that runs of !PROP_gimple_lva (and also provides it), and require
PROP_gimple_lva by pass_expand?  (just look for PROP_gimple_lcx for
the complex stuff to get an idea what needs to be touched)
   
  
   Updated according to comments.
  
   Furthermore (having updated the patch series to recent trunk), I'm
   dropping the
   ACCEL_COMPILER bit in pass_stdarg::gate. AFAIU the comment there relates
   to this
   patch.
  
   Retested as before.
  
   OK for stage1?
  
 
  Ping.

 Ping again.

 Patch originally posted at:
 https://gcc.gnu.org/ml/gcc-patches/2015-02/msg01332.html .

 Ok.

 Thanks,
 Richard.

 Thanks,
 - Tom

   Btw, I'm wondering if as run-time optimization we can tentatively set
   PROP_gimple_lva at the start of the gimple pass, and unset it in
   gimplify_va_arg_expr. That way we would avoid the loop in
   expand_ifn_va_arg_1
   (over all bbs and gimples) in functions without va_arg.
  
 
  Taken care of in follow-up patch 5b.

Hi,
This patch causes stdarg test failure on aarch64 for both linux and
elf variants.  Seems wrong code is generated.

FAIL: gcc.c-torture/execute/stdarg-1.c   -O0  execution test
FAIL: gcc.c-torture/execute/stdarg-1.c   -O1  execution test
FAIL: gcc.c-torture/execute/stdarg-1.c   -O2  execution test
FAIL: gcc.c-torture/execute/stdarg-1.c   -O2 -flto
-fno-use-linker-plugin -flto-partition=none  execution test
FAIL: gcc.c-torture/execute/stdarg-1.c   -O2 -flto -fuse-linker-plugin
-fno-fat-lto-objects  execution test
FAIL: gcc.c-torture/execute/stdarg-1.c   -O3 -fomit-frame-pointer
execution test
FAIL: gcc.c-torture/execute/stdarg-1.c   -O3 -g  execution test
FAIL: gcc.c-torture/execute/stdarg-1.c   -Os  execution test
FAIL: 

Re: [PATCH][expmed] Calculate mult-by-const cost properly in mult_by_coeff_cost

2015-03-18 Thread Bin.Cheng
On Wed, Mar 18, 2015 at 5:06 PM, Kyrill Tkachov kyrylo.tkac...@arm.com wrote:

 On 17/03/15 19:11, Jeff Law wrote:

 On 03/16/2015 04:12 AM, Kyrill Tkachov wrote:

 Hi all,

 Eyeballing the mult_by_coeff_cost function I think it has a typo/bug.
 It's supposed to return the cost of multiplying by a constant 'coeff'.
 It calculates that by taking the cost of a MULT rtx by that constant
 and comparing it to the cost of synthesizing that multiplication, and
 returning
 the cheapest. However, in the MULT rtx cost calculations it creates
 a MULT rtx of two REGs rather than the a REG and the GEN_INT of coeff as
 I would
 expect. This patches fixes that in the obvious way.

 Tested aarch64-none-elf and bootstrapped on x86_64-linux-gnu.
 I'm guessing this is stage 1 material at this point?

 Thanks,
 Kyrill

 2015-03-13  Kyrylo Tkachov  kyrylo.tkac...@arm.com

   * expmed.c (mult_by_coeff_cost): Pass CONT_INT rtx to MULT cost
   calculation rather than fake_reg.

 I'd think stage1, unless you can point to a bug, particularly a
 regression.


 No regression that I know of. I'll queue it up for stage 1 if it's ok
 code-wise.

This function just estimate the max_cost roughly, since it is only
used by Tree passes.  It shouldn't have much impact on generated code.
Maybe some targets doesn't have proper cost function for reg * const
rtl expression, also most of calls are in IVOPT, so it would be better
if you run some benchmark to make sure there is no surprise.

Thanks,
bin

 Thanks,
 Kyrill

 Jeff





Re: [PATCH ARM]Fix memset-inline-* failures on cortex-a9 tune by checking tune information.

2015-03-16 Thread Bin.Cheng
On Fri, Mar 13, 2015 at 7:56 PM, Ramana Radhakrishnan
ramana@googlemail.com wrote:
 On Fri, Mar 6, 2015 at 7:46 AM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 This patch is the second part fixing memset-inline-{4,5,6,8,9}.c failures on
 cortex-a9.  It adds a function checking CPU tuning information in dejagnu,
 it also uses that function to skip related testcase when we are compiling
 for cortex-a9 tune.

 Skips the related testcase for all tests where the tuning information
 doesn't use neon. I think this technique can be used to clean up a
 number of multilib related failures in the gcc.target/arm testsuite.
Actually these are all related cases.  Cases {1,2,3} are intended for
non-neon target inlining test, case 7 is an executable test which
should be run what ever the target supports.



 Build and test on arm-none-eabi.  Is it OK?

 gcc/testsuite/ChangeLog
 2015-03-06  Bin Cheng  bin.ch...@arm.com

 * lib/target-supports.exp (arm_tune_string_ops_prefer_neon): New.
 * gcc.target/arm/memset-inline-4.c: Skip for
 arm_tune_string_ops_prefer_neon.
 * gcc.target/arm/memset-inline-5.c: Ditto.
 * gcc.target/arm/memset-inline-6.c: Ditto.
 * gcc.target/arm/memset-inline-8.c: Ditto.
 * gcc.target/arm/memset-inline-9.c: Ditto.

 Ok, please document the new dejagnu helper routine in sourcebuild.texi
Done.  Patch updated, I will push both patches in if you are ok with it.

Thanks,
bin

2015-03-17  Bin Cheng  bin.ch...@arm.com
* doc/sourcebuild.texi (arm_tune_string_ops_prefer_neon): New.

gcc/testsuite/ChangeLog
2015-03-17  Bin Cheng  bin.ch...@arm.com

* lib/target-supports.exp (arm_tune_string_ops_prefer_neon): New.
* gcc.target/arm/memset-inline-4.c: Skip for
arm_tune_string_ops_prefer_neon.
* gcc.target/arm/memset-inline-5.c: Ditto.
* gcc.target/arm/memset-inline-6.c: Ditto.
* gcc.target/arm/memset-inline-8.c: Ditto.
* gcc.target/arm/memset-inline-9.c: Ditto.
Index: gcc/testsuite/gcc.target/arm/memset-inline-4.c
===
--- gcc/testsuite/gcc.target/arm/memset-inline-4.c  (revision 221097)
+++ gcc/testsuite/gcc.target/arm/memset-inline-4.c  (working copy)
@@ -1,6 +1,5 @@
 /* { dg-do run } */
-/* { dg-skip-if Don't inline memset using neon instructions on cortex-a9 { 
*-*-* } { -mcpu=cortex-a9 } {  } } */
-/* { dg-skip-if Don't inline memset using neon instructions on cortex-a9 { 
*-*-* } { -mtune=cortex-a9 } {  } } */
+/* { dg-skip-if Don't inline memset using neon instructions { ! 
arm_tune_string_ops_prefer_neon } } */
 /* { dg-options -save-temps -O2 -fno-inline } */
 /* { dg-add-options arm_neon } */
 
Index: gcc/testsuite/gcc.target/arm/memset-inline-5.c
===
--- gcc/testsuite/gcc.target/arm/memset-inline-5.c  (revision 221097)
+++ gcc/testsuite/gcc.target/arm/memset-inline-5.c  (working copy)
@@ -1,6 +1,5 @@
 /* { dg-do run } */
-/* { dg-skip-if Don't inline memset using neon instructions on cortex-a9 { 
*-*-* } { -mcpu=cortex-a9 } {  } } */
-/* { dg-skip-if Don't inline memset using neon instructions on cortex-a9 { 
*-*-* } { -mtune=cortex-a9 } {  } } */
+/* { dg-skip-if Don't inline memset using neon instructions { ! 
arm_tune_string_ops_prefer_neon } } */
 /* { dg-options -save-temps -O2 -fno-inline } */
 /* { dg-add-options arm_neon } */
 
Index: gcc/testsuite/gcc.target/arm/memset-inline-6.c
===
--- gcc/testsuite/gcc.target/arm/memset-inline-6.c  (revision 221097)
+++ gcc/testsuite/gcc.target/arm/memset-inline-6.c  (working copy)
@@ -1,6 +1,5 @@
 /* { dg-do run } */
-/* { dg-skip-if Don't inline memset using neon instructions on cortex-a9 { 
*-*-* } { -mcpu=cortex-a9 } {  } } */
-/* { dg-skip-if Don't inline memset using neon instructions on cortex-a9 { 
*-*-* } { -mtune=cortex-a9 } {  } } */
+/* { dg-skip-if Don't inline memset using neon instructions { ! 
arm_tune_string_ops_prefer_neon } } */
 /* { dg-options -save-temps -O2 -fno-inline } */
 /* { dg-add-options arm_neon } */
 
Index: gcc/testsuite/gcc.target/arm/memset-inline-8.c
===
--- gcc/testsuite/gcc.target/arm/memset-inline-8.c  (revision 221097)
+++ gcc/testsuite/gcc.target/arm/memset-inline-8.c  (working copy)
@@ -1,6 +1,5 @@
 /* { dg-do run } */
-/* { dg-skip-if Don't inline memset using neon instructions on cortex-a9 { 
*-*-* } { -mcpu=cortex-a9 } {  } } */
-/* { dg-skip-if Don't inline memset using neon instructions on cortex-a9 { 
*-*-* } { -mtune=cortex-a9 } {  } } */
+/* { dg-skip-if Don't inline memset using neon instructions { ! 
arm_tune_string_ops_prefer_neon } } */
 /* { dg-options -save-temps -O2 -fno-inline  } */
 /* { dg-add-options arm_neon } */
 
Index: gcc/testsuite/gcc.target/arm/memset-inline-9.c

Re: [PATCH ARM]Print CPU tuning information as comment in assembler file.

2015-03-13 Thread Bin.Cheng
Ping.
This is for case failures and it doesn't affect normal compilation, so
I suppose it's fine for this stage?

Thanks,
bin

On Fri, Mar 6, 2015 at 3:42 PM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 This patch is the first part fixing memset-inline-{4,5,6,8,9}.c failures on
 cortex-a9.  GCC/arm doesn't generate any tuning information in assembly, it
 can't tell whether we are compiling for cortex-a9 tune if the compiler is
 configured so by default.
 This patch introduces a new (target dependent) option -mprint-tune-info.
 It prints CPU tuning information as comment in assembler file, thus DEJAGNU
 can check it and make decisions.  By default the option is disabled, so it
 won't change current behaviors.  For now, pointers in tune structure are not
 printed, we should improve that and output more useful information in the
 long run.

 Another patch is followed adding DEJAGNU test function and adapting test
 strings.

 Build and test on arm-none-eabi, is it OK?

 2015-03-06  Bin Cheng  bin.ch...@arm.com

 * config/arm/arm.opt (print_tune_info): New option.
 * config/arm/arm.c (arm_print_tune_info): New function.
 (arm_file_start): Call arm_print_tune_info.
 * config/arm/arm-protos.h (struct tune_params): Add comment.
 * doc/invoke.texi (@item -mprint-tune-info): New item.
 (-mtune): mention it in ARM Option Summary.


Re: [PATCH ARM]Fix memset-inline-* failures on cortex-a9 tune by checking tune information.

2015-03-13 Thread Bin.Cheng
Ping.

Thanks,
bin

On Fri, Mar 6, 2015 at 3:46 PM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 This patch is the second part fixing memset-inline-{4,5,6,8,9}.c failures on
 cortex-a9.  It adds a function checking CPU tuning information in dejagnu,
 it also uses that function to skip related testcase when we are compiling
 for cortex-a9 tune.

 Build and test on arm-none-eabi.  Is it OK?

 gcc/testsuite/ChangeLog
 2015-03-06  Bin Cheng  bin.ch...@arm.com

 * lib/target-supports.exp (arm_tune_string_ops_prefer_neon): New.
 * gcc.target/arm/memset-inline-4.c: Skip for
 arm_tune_string_ops_prefer_neon.
 * gcc.target/arm/memset-inline-5.c: Ditto.
 * gcc.target/arm/memset-inline-6.c: Ditto.
 * gcc.target/arm/memset-inline-8.c: Ditto.
 * gcc.target/arm/memset-inline-9.c: Ditto.


Re: [PATCH PR64705]Don't aggressively expand induction variable's base

2015-02-11 Thread Bin.Cheng
On Wed, Feb 11, 2015 at 7:24 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Wed, Feb 11, 2015 at 9:23 AM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Tue, Feb 10, 2015 at 12:24 AM, Richard Biener
 richard.guent...@gmail.com wrote:

 Previously, the computation of _1174 can be replaced by _629 in bb8 in
 DOM2 pass, while it can't after patching.  This is the only possible
 regression that I can see on TREE level.  There is another difference
 but not regression on TREE.  Seems real change happens on RTL pre with
 different register uses in the input.  I failed to go further or
 extract a test case, it's just very volatile.

 Well, I can see what is happening and indeed we shouldn't blame the
 patch for this.

 With all of this, I guess this patch shouldn't be blamed for this.  I
 also wonder if the PR should be fixed in this way since the patch
 definitely is a corner case.

 It might not fix the real issue (whatever that is), but not making
 IVOPTs (or tree-affines) life harder is very good (I believe I have
 seen this kind of issue as well).
I guess IV's base is expanded because we want to explore more CSE opportunities?
I suspect this doesn't work very well because of two reasons: 1)
overload the tree-affine facility; 2) weak IV rewriting capacity in
GCC (for example, mess up loop variant/invariant part expressions).  I
will do experiments on this.

As for the patch itself, I collected instrumental data from GCC
bootstrap and Spec2k6 compilation.  Can confirm in most cases
(bootstrap 99.9%, spec2k6 99.1%), there is only one ssa name in IV's
step.


 So I do think that the patch is fine.  Just seen the known-to-work GCC 3.4
 version so it's even a regression 

Here is the refined patch according to your comments.  It passes
bootstrap and test on x86_64.

Thanks,
bin

2015-02-12  Bin Cheng  bin.ch...@arm.com

PR tree-optimization/64705
* tree-ssa-loop-niter.h (expand_simple_operations): New parameter.
* tree-ssa-loop-niter.c (expand_simple_operations): New parameter.
* tree-ssa-loop-ivopts.c (extract_single_var_from_expr): New.
(find_bivs, find_givs_in_stmt_scev): Pass new argument to
expand_simple_operations.

gcc/testsuite/ChangeLog
2015-02-12  Bin Cheng  bin.ch...@arm.com

PR tree-optimization/64705
* gcc.dg/tree-ssa/pr64705.c: New test.
Index: gcc/tree-ssa-loop-niter.c
===
--- gcc/tree-ssa-loop-niter.c   (revision 219574)
+++ gcc/tree-ssa-loop-niter.c   (working copy)
@@ -1552,10 +1552,11 @@ simplify_replace_tree (tree expr, tree old, tree n
 }
 
 /* Expand definitions of ssa names in EXPR as long as they are simple
-   enough, and return the new expression.  */
+   enough, and return the new expression.  If STOP is specified, stop
+   expanding if EXPR equals to it.  */
 
 tree
-expand_simple_operations (tree expr)
+expand_simple_operations (tree expr, tree stop)
 {
   unsigned i, n;
   tree ret = NULL_TREE, e, ee, e1;
@@ -1575,7 +1576,7 @@ tree
   for (i = 0; i  n; i++)
{
  e = TREE_OPERAND (expr, i);
- ee = expand_simple_operations (e);
+ ee = expand_simple_operations (e, stop);
  if (e == ee)
continue;
 
@@ -1594,7 +1595,8 @@ tree
   return ret;
 }
 
-  if (TREE_CODE (expr) != SSA_NAME)
+  /* Stop if it's not ssa name or the one we don't want to expand.  */
+  if (TREE_CODE (expr) != SSA_NAME || expr == stop)
 return expr;
 
   stmt = SSA_NAME_DEF_STMT (expr);
@@ -1614,7 +1616,7 @@ tree
   src-loop_father != dest-loop_father)
return expr;
 
-  return expand_simple_operations (e);
+  return expand_simple_operations (e, stop);
 }
   if (gimple_code (stmt) != GIMPLE_ASSIGN)
 return expr;
@@ -1634,7 +1636,7 @@ tree
return e;
 
   if (code == SSA_NAME)
-   return expand_simple_operations (e);
+   return expand_simple_operations (e, stop);
 
   return expr;
 }
@@ -1643,7 +1645,7 @@ tree
 {
 CASE_CONVERT:
   /* Casts are simple.  */
-  ee = expand_simple_operations (e);
+  ee = expand_simple_operations (e, stop);
   return fold_build1 (code, TREE_TYPE (expr), ee);
 
 case PLUS_EXPR:
@@ -1658,7 +1660,7 @@ tree
   if (!is_gimple_min_invariant (e1))
return expr;
 
-  ee = expand_simple_operations (e);
+  ee = expand_simple_operations (e, stop);
   return fold_build2 (code, TREE_TYPE (expr), ee, e1);
 
 default:
Index: gcc/tree-ssa-loop-niter.h
===
--- gcc/tree-ssa-loop-niter.h   (revision 219574)
+++ gcc/tree-ssa-loop-niter.h   (working copy)
@@ -20,7 +20,7 @@ along with GCC; see the file COPYING3.  If not see
 #ifndef GCC_TREE_SSA_LOOP_NITER_H
 #define GCC_TREE_SSA_LOOP_NITER_H
 
-extern tree expand_simple_operations (tree);
+extern tree expand_simple_operations (tree, tree = NULL);
 extern bool loop_only_exit_p (const struct loop *, const_edge);
 extern

Re: [RFA][PATCH][PR rtl-optimization/47477] Type narrowing in match.pd

2015-02-11 Thread Bin.Cheng
On Wed, Feb 11, 2015 at 4:55 AM, Jeff Law l...@redhat.com wrote:

 This PR was originally minor issue where we regressed on this kind of
 sequence:

 typedef struct toto_s *toto_t;
 toto_t add (toto_t a, toto_t b) {
   int64_t tmp = (int64_t)(intptr_t)a + ((int64_t)(intptr_t)b~1L);
   return (toto_t)(intptr_t) tmp;
 }


 There was talk of trying to peephole this in the x86 backend.  But later
 Jakub speculated that if we had good type narrowing this could be done in
 the tree optimizers...

 S, here we go.  I didn't do anything with logicals are those are already
 handled elsewhere in match.pd.  I didn't try to handle MULT as in the early
 experiments I did, it was a lose because of the existing mechanisms for
 widening multiplications.

 Interestingly enough, this patch seems to help out libjava more than
 anything else in a GCC build and it really only helps a few routines. There
 weren't any routines I could see where the code regressed after this patch.
 This is probably an indicator that these things aren't *that* common, or the
 existing shortening code better than we thought, or some important
 shortening case is missing.

Cool that we are trying to simplify type conversion using generic
match facility.  I have thought about type promotion in match.pd too.
For example, (unsigned long long)(unsigned long)(int_expr), if we can
prove int_expr is always positive (in my case, this is from vrp
information), then the first conversion can be saved.  This is another
way for (and related? I didn't look at the code) the sign/zero
extension elimination work using VRP I suppose?

Thanks,
bin


 I think we should pull the other tests from 47477 which are not regressions
 out into their own bug for future work.  Or alternately, when this fix is
 checked in remove the regression marker in 47477.


 Bootstrapped and regression tested on x86_64-unknown-linux-gnu.  OK for the
 trunk?








 diff --git a/gcc/ChangeLog b/gcc/ChangeLog
 index 7f3816c..7a95029 100644
 --- a/gcc/ChangeLog
 +++ b/gcc/ChangeLog
 @@ -1,3 +1,8 @@
 +2015-02-10  Jeff Law  l...@redhat.com
 +
 +   * match.pd (convert (plus/minus (convert @0) (convert @1): New
 +   simplifier to narrow arithmetic.
 +
  2015-02-10  Richard Biener  rguent...@suse.de

 PR tree-optimization/64909
 diff --git a/gcc/match.pd b/gcc/match.pd
 index 81c4ee6..abc703e 100644
 --- a/gcc/match.pd
 +++ b/gcc/match.pd
 @@ -1018,3 +1018,21 @@ along with GCC; see the file COPYING3.  If not see
 (logs (pows @0 @1))
 (mult @1 (logs @0)

 +/* If we have a narrowing conversion of an arithmetic operation where
 +   both operands are widening conversions from the same type as the outer
 +   narrowing conversion.  Then convert the innermost operands to a suitable
 +   unsigned type (to avoid introducing undefined behaviour), perform the
 +   operation and convert the result to the desired type.
 +
 +   This narrows the arithmetic operation.  */
 +(for op (plus minus)
 +  (simplify
 +(convert (op (convert@2 @0) (convert @1)))
 +(if (TREE_TYPE (@0) == TREE_TYPE (@1)
 +  TREE_TYPE (@0) == type
 +  INTEGRAL_TYPE_P (type)
 +  TYPE_PRECISION (TREE_TYPE (@2))  TYPE_PRECISION (TREE_TYPE
 (@0))
 +/* This prevents infinite recursion.  */
 + unsigned_type_for (TREE_TYPE (@0)) != TREE_TYPE (@2))
 +  (with { tree utype = unsigned_type_for (TREE_TYPE (@0)); }
 +(convert (op (convert:utype @0) (convert:utype @1)))
 diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
 index 15d5e2d..76e5254 100644
 --- a/gcc/testsuite/ChangeLog
 +++ b/gcc/testsuite/ChangeLog
 @@ -1,3 +1,8 @@
 +2015-02-10  Jeff Law  l...@redhat.com
 +
 +   PR rtl-optimization/47477
 +   * gcc.dg/tree-ssa/narrow-arith-1.c: New test.
 +
  2015-02-10  Richard Biener  rguent...@suse.de

 PR tree-optimization/64909
 diff --git a/gcc/testsuite/gcc.dg/tree-ssa/narrow-arith-1.c
 b/gcc/testsuite/gcc.dg/tree-ssa/narrow-arith-1.c
 new file mode 100644
 index 000..104cb6f5
 --- /dev/null
 +++ b/gcc/testsuite/gcc.dg/tree-ssa/narrow-arith-1.c
 @@ -0,0 +1,22 @@
 +/* PR tree-optimization/47477 */
 +/* { dg-do compile } */
 +/* { dg-options -O2 -fdump-tree-optimized -w } */
 +/* { dg-require-effective-target ilp32 } */
 +
 +typedef int int64_t __attribute__ ((__mode__ (__DI__)));
 +typedef int * intptr_t;
 +
 +typedef struct toto_s *toto_t;
 +toto_t add (toto_t a, toto_t b) {
 +  int64_t tmp = (int64_t)(intptr_t)a + ((int64_t)(intptr_t)b~1L);
 +  return (toto_t)(intptr_t) tmp;
 +}
 +
 +/* For an ILP32 target there'll be 6 casts when we start, but just 4
 +   if the match.pd pattern is successfully matched.  */
 +/* { dg-final { scan-tree-dump-times = \\(int\\) 1 optimized } } */
 +/* { dg-final { scan-tree-dump-times = \\(unsigned int\\) 2 optimized }
 } */
 +/* { dg-final { scan-tree-dump-times = \\(struct toto_s \\*\\) 1
 optimized } } */
 +/* { dg-final { cleanup-tree-dump optimized } } */
 +
 +



Re: [PATCH PR64705]Don't aggressively expand induction variable's base

2015-02-11 Thread Bin.Cheng
On Tue, Feb 10, 2015 at 12:24 AM, Richard Biener
richard.guent...@gmail.com wrote:
 On February 9, 2015 11:09:49 AM CET, Bin Cheng bin.ch...@arm.com wrote:

 Did you extract a testcase for it?  Note that the IV step itself may be 
 expanded
 Too much.

   I
looked into the regression and thought it was because of passes after
IVOPT.
The IVOPT dump is at least not worse than the original version.

 But different I presume.  So how did you figure it regressed?
The tree level dump is like,

---Without patch
+++ With patch
   bb 8:
   _618 = MAX_EXPR n_105, 0;
   _619 = (integer(kind=8)) _618;
   _629 = (unsigned long) _618;

   ..

   bb 37:
   _257 = _619 + 1;
   _255 = (sizetype) _257;
   _261 = _255 * 8;
   _256 = (sizetype) _140;
-  _1174 = (sizetype) _618;
+ _1174 = (sizetype) _619;
   _1175 = _256 + _1174;
   _1176 = _1175 * 8;
   _1177 = _148 + _1176;
   ivtmp.3849_258 = (unsigned long) _1177;
   _1179 = (unsigned int) _104;

Previously, the computation of _1174 can be replaced by _629 in bb8 in
DOM2 pass, while it can't after patching.  This is the only possible
regression that I can see on TREE level.  There is another difference
but not regression on TREE.  Seems real change happens on RTL pre with
different register uses in the input.  I failed to go further or
extract a test case, it's just very volatile.

With all of this, I guess this patch shouldn't be blamed for this.  I
also wonder if the PR should be fixed in this way since the patch
definitely is a corner case.

Thanks,
bin

 Thanks,
 Richard.

Bootstrap and test on x86_64 and AArch64, so is it OK?

2015-02-09  Bin Cheng  bin.ch...@arm.com

   PR tree-optimization/64705
   * tree-ssa-loop-niter.h (expand_simple_operations): New parameter.
   * tree-ssa-loop-niter.c (expand_simple_operations): New parameter.
   (tree_simplify_using_condition_1, refine_bounds_using_guard)
   (number_of_iterations_exit): Pass new argument to
   expand_simple_operations.
   * tree-ssa-loop-ivopts.c (extract_single_var_from_expr): New.
   (find_bivs, find_givs_in_stmt_scev): Pass new argument to
   expand_simple_operations.  Call extract_single_var_from_expr.
   (difference_cannot_overflow_p): Pass new argument to
   expand_simple_operations.

gcc/testsuite/ChangeLog
2015-02-09  Bin Cheng  bin.ch...@arm.com

   PR tree-optimization/64705
   * gcc.dg/tree-ssa/pr64705.c: New test.




Re: [PATCH PR64705]Don't aggressively expand induction variable's base

2015-02-10 Thread Bin.Cheng
On Tue, Feb 10, 2015 at 6:02 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Mon, Feb 9, 2015 at 11:33 AM, Bin Cheng bin.ch...@arm.com wrote:
 The second time I missed patch in one day, I hate myself.
 Here it is.

 I think the patch is reasonable but I would have used a default = NULL
 arg for 'stop' to make the patch smaller.  You don't constrain 'stop'
 to being an SSA name - any particular reason for that?  It would
The check is from the first version patch, in which I just passed the
whole IV's step to expand_simple_operations.  Yes, it should be
changed accordingly.

 make the comparison in expand_simple_operations simpler
 and it could be extended to be a bitmap of SSA name versions.
Yes, that's exactly what I want to do.  BTW, per for previous comment,
I don't think GCC expands IV's step in either IVOPT or SCEV, right?
As a result, it's unlikely to have an IV's step referring to multiple
ssa names.  And that's why I didn't extend it to a ssa name versions
bitmap.


 So - I'd like you to constrain 'stop' and check it like

   if (TREE_CODE (expr) != SSA_NAME
Hmm, won't this effectively disable the expansion?

  || expr == stop)
 return expr;

 and declare

 -extern tree expand_simple_operations (tree);
 +extern tree expand_simple_operations (tree, tree = NULL_TREE);
I am still living in the C world...


 Ok with that change.

 Thanks,
 Richard.

 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
 ow...@gcc.gnu.org] On Behalf Of Bin Cheng
 Sent: Monday, February 09, 2015 6:10 PM
 To: gcc-patches@gcc.gnu.org
 Subject: [PATCH PR64705]Don't aggressively expand induction variable's
 base

 Hi,
 As comments in the PR, root cause is GCC aggressively expand induction
 variable's base.  This patch avoids that by adding new parameter to
 expand_simple_operations thus we can stop expansion whenever ssa var
 referred by IV's step is encountered.  As comments in patch, we could stop
 expanding at each ssa var referred by IV's step, but that's expensive and
 not
 likely to happen, this patch only extracts the first ssa var and skips
 expanding
 accordingly.
 For the new test case, currently the loop body is bloated as below after
 IVOPT:

 bb 5:
   # ci_28 = PHI ci_12(D)(4), ci_17(6)
   # ivtmp.13_31 = PHI ivtmp.13_25(4), ivtmp.13_27(6)
   ci_17 = ci_28 + 1;
   _1 = (void *) ivtmp.13_31;
   MEM[base: _1, offset: 0B] = 0;
   ivtmp.13_27 = ivtmp.13_31 + _26;
   _34 = (unsigned long) _13;
   _35 = (unsigned long) flags_8(D);
   _36 = _34 - _35;
   _37 = (unsigned long) step_14;
   _38 = _36 - _37;
   _39 = ivtmp.13_27 + _38;
   _40 = _39 + 3;
   iter_33 = (long int) _40;
   if (len_16(D) = iter_33)
 goto bb 6;
   else
 goto bb 7;

 bb 6:
   goto bb 5;

 And it can be improved by this patch as below:

 bb 5:
   # steps_28 = PHI steps_12(D)(4), steps_17(6)
   # iter_29 = PHI iter_15(4), iter_21(6)
   steps_17 = steps_28 + 1;
   _31 = (sizetype) iter_29;
   MEM[base: flags_8(D), index: _31, offset: 0B] = 0;
   iter_21 = step_14 + iter_29;
   if (len_16(D) = iter_21)
 goto bb 6;
   else
 goto bb 7;

 bb 6:
   goto bb 5;


 I think this is a corner case, it only changes several files' assembly
 code
 slightly in spec2k6.  Among these files, only one of them is regression.
 I
 looked into the regression and thought it was because of passes after
 IVOPT.
 The IVOPT dump is at least not worse than the original version.

 Bootstrap and test on x86_64 and AArch64, so is it OK?

 2015-02-09  Bin Cheng  bin.ch...@arm.com

   PR tree-optimization/64705
   * tree-ssa-loop-niter.h (expand_simple_operations): New
 parameter.
   * tree-ssa-loop-niter.c (expand_simple_operations): New parameter.
   (tree_simplify_using_condition_1, refine_bounds_using_guard)
   (number_of_iterations_exit): Pass new argument to
   expand_simple_operations.
   * tree-ssa-loop-ivopts.c (extract_single_var_from_expr): New.
   (find_bivs, find_givs_in_stmt_scev): Pass new argument to
   expand_simple_operations.  Call extract_single_var_from_expr.
   (difference_cannot_overflow_p): Pass new argument to
   expand_simple_operations.

 gcc/testsuite/ChangeLog
 2015-02-09  Bin Cheng  bin.ch...@arm.com

   PR tree-optimization/64705
   * gcc.dg/tree-ssa/pr64705.c: New test.






Re: [PATCH PR64705]Don't aggressively expand induction variable's base

2015-02-10 Thread Bin.Cheng
On Tue, Feb 10, 2015 at 6:18 PM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Tue, Feb 10, 2015 at 6:02 PM, Richard Biener
 richard.guent...@gmail.com wrote:
 On Mon, Feb 9, 2015 at 11:33 AM, Bin Cheng bin.ch...@arm.com wrote:
 The second time I missed patch in one day, I hate myself.
 Here it is.

 I think the patch is reasonable but I would have used a default = NULL
 arg for 'stop' to make the patch smaller.  You don't constrain 'stop'
 to being an SSA name - any particular reason for that?  It would
 The check is from the first version patch, in which I just passed the
 whole IV's step to expand_simple_operations.  Yes, it should be
 changed accordingly.

 make the comparison in expand_simple_operations simpler
 and it could be extended to be a bitmap of SSA name versions.
 Yes, that's exactly what I want to do.  BTW, per for previous comment,
 I don't think GCC expands IV's step in either IVOPT or SCEV, right?
 As a result, it's unlikely to have an IV's step referring to multiple
 ssa names.  And that's why I didn't extend it to a ssa name versions
 bitmap.


 So - I'd like you to constrain 'stop' and check it like

   if (TREE_CODE (expr) != SSA_NAME
 Hmm, won't this effectively disable the expansion?

Understood, the check is below ssa expanding.

I will go through the regression file before applying this.

Thanks,
bin

  || expr == stop)
 return expr;

 and declare

 -extern tree expand_simple_operations (tree);
 +extern tree expand_simple_operations (tree, tree = NULL_TREE);
 I am still living in the C world...


 Ok with that change.

 Thanks,
 Richard.

 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
 ow...@gcc.gnu.org] On Behalf Of Bin Cheng
 Sent: Monday, February 09, 2015 6:10 PM
 To: gcc-patches@gcc.gnu.org
 Subject: [PATCH PR64705]Don't aggressively expand induction variable's
 base

 Hi,
 As comments in the PR, root cause is GCC aggressively expand induction
 variable's base.  This patch avoids that by adding new parameter to
 expand_simple_operations thus we can stop expansion whenever ssa var
 referred by IV's step is encountered.  As comments in patch, we could stop
 expanding at each ssa var referred by IV's step, but that's expensive and
 not
 likely to happen, this patch only extracts the first ssa var and skips
 expanding
 accordingly.
 For the new test case, currently the loop body is bloated as below after
 IVOPT:

 bb 5:
   # ci_28 = PHI ci_12(D)(4), ci_17(6)
   # ivtmp.13_31 = PHI ivtmp.13_25(4), ivtmp.13_27(6)
   ci_17 = ci_28 + 1;
   _1 = (void *) ivtmp.13_31;
   MEM[base: _1, offset: 0B] = 0;
   ivtmp.13_27 = ivtmp.13_31 + _26;
   _34 = (unsigned long) _13;
   _35 = (unsigned long) flags_8(D);
   _36 = _34 - _35;
   _37 = (unsigned long) step_14;
   _38 = _36 - _37;
   _39 = ivtmp.13_27 + _38;
   _40 = _39 + 3;
   iter_33 = (long int) _40;
   if (len_16(D) = iter_33)
 goto bb 6;
   else
 goto bb 7;

 bb 6:
   goto bb 5;

 And it can be improved by this patch as below:

 bb 5:
   # steps_28 = PHI steps_12(D)(4), steps_17(6)
   # iter_29 = PHI iter_15(4), iter_21(6)
   steps_17 = steps_28 + 1;
   _31 = (sizetype) iter_29;
   MEM[base: flags_8(D), index: _31, offset: 0B] = 0;
   iter_21 = step_14 + iter_29;
   if (len_16(D) = iter_21)
 goto bb 6;
   else
 goto bb 7;

 bb 6:
   goto bb 5;


 I think this is a corner case, it only changes several files' assembly
 code
 slightly in spec2k6.  Among these files, only one of them is regression.
 I
 looked into the regression and thought it was because of passes after
 IVOPT.
 The IVOPT dump is at least not worse than the original version.

 Bootstrap and test on x86_64 and AArch64, so is it OK?

 2015-02-09  Bin Cheng  bin.ch...@arm.com

   PR tree-optimization/64705
   * tree-ssa-loop-niter.h (expand_simple_operations): New
 parameter.
   * tree-ssa-loop-niter.c (expand_simple_operations): New parameter.
   (tree_simplify_using_condition_1, refine_bounds_using_guard)
   (number_of_iterations_exit): Pass new argument to
   expand_simple_operations.
   * tree-ssa-loop-ivopts.c (extract_single_var_from_expr): New.
   (find_bivs, find_givs_in_stmt_scev): Pass new argument to
   expand_simple_operations.  Call extract_single_var_from_expr.
   (difference_cannot_overflow_p): Pass new argument to
   expand_simple_operations.

 gcc/testsuite/ChangeLog
 2015-02-09  Bin Cheng  bin.ch...@arm.com

   PR tree-optimization/64705
   * gcc.dg/tree-ssa/pr64705.c: New test.






Re: [PATCH IRA] update_equiv_regs fails to set EQUIV reg-note for pseudo with more than one definition

2015-02-03 Thread Bin.Cheng
On Wed, Feb 4, 2015 at 12:28 AM, Jeff Law l...@redhat.com wrote:
 On 02/03/15 01:29, Bin.Cheng wrote:


 Hmm, if I understand correctly, it's a code size regression, so I
 don't think it's appropriate to adapt the test case.  Either the patch
 or something else in GCC is doing wrong, right?

 Hi Alex, could you please file a PR with full dump information for
 tracking?

 But if the code size regression is due to the older compiler incorrectly
 handling the promotion of REG_EQUAL to REG_EQUIV notes, then the test
 absolutely does need updating as the codesize was dependent on incorrect
 behaviour in the compiler.

Hi Jeff,

I looked into the test and can confirm the previous compilation is correct.
The cover letter of this patch said IRA mis-handled REQ_EQUIV before,
but in this case it is REG_EQUAL that is lost.  The full dump (without
this patch) after IRA is like:

   10: NOTE_INSN_BASIC_BLOCK 2
2: r116:SI=r0:SI
3: r117:SI=r1:SI
  REG_DEAD r1:SI
4: r118:SI=r2:SI
  REG_DEAD r2:SI
5: NOTE_INSN_FUNCTION_BEG
   12: r2:SI=0x1
   13: r1:SI=0
   15: r0:SI=call [`lseek'] argc:0
  REG_DEAD r2:SI
  REG_DEAD r1:SI
  REG_CALL_DECL `lseek'
   16: r111:SI=r0:SI
  REG_DEAD r0:SI
   17: r2:SI=0x2
   18: r1:SI=0
   19: r0:SI=r116:SI
  REG_DEAD r116:SI
   20: r0:SI=call [`lseek'] argc:0
  REG_DEAD r2:SI
  REG_DEAD r1:SI
  REG_CALL_DECL `lseek'
   21: r112:SI=r0:SI
  REG_DEAD r0:SI
   22: cc:CC=cmp(r111:SI,0x)
   23: pc={(cc:CC==0)?L46:pc}
  REG_DEAD cc:CC
  REG_BR_PROB 159
   24: NOTE_INSN_BASIC_BLOCK 3
   25: cc:CC=cmp(r112:SI,0x)
   26: pc={(cc:CC==0)?L50:pc}
  REG_DEAD cc:CC
  REG_BR_PROB 159
   27: NOTE_INSN_BASIC_BLOCK 4
   28: NOTE_INSN_DELETED
   29: {cc:CC_NOOV=cmp(r112:SI-r111:SI,0);r114:SI=r112:SI-r111:SI;}
  REG_DEAD r112:SI
   30: pc={(cc:CC_NOOV==0)?L54:pc}
  REG_DEAD cc:CC_NOOV
  REG_BR_PROB 400
   31: NOTE_INSN_BASIC_BLOCK 5
   32: [r117:SI]=r111:SI
  REG_DEAD r117:SI
  REG_DEAD r111:SI
   33: [r118:SI]=r114:SI
  REG_DEAD r118:SI
  REG_DEAD r114:SI
7: r110:SI=0
  REG_EQUAL 0
   76: pc=L34
   77: barrier
   46: L46:
   45: NOTE_INSN_BASIC_BLOCK 6
8: r110:SI=r111:SI
  REG_DEAD r111:SI
  REG_EQUAL 0x
   78: pc=L34
   79: barrier
   50: L50:
   49: NOTE_INSN_BASIC_BLOCK 7
6: r110:SI=r112:SI
  REG_DEAD r112:SI
  REG_EQUAL 0x
   80: pc=L34
   81: barrier
   54: L54:
   53: NOTE_INSN_BASIC_BLOCK 8
9: r110:SI=0x
  REG_EQUAL 0x
   34: L34:
   35: NOTE_INSN_BASIC_BLOCK 9
   40: r0:SI=r110:SI
  REG_DEAD r110:SI
   41: use r0:SI

Before r216169 (with REG_EQUAL in insn9), jumps from basic block 6/7/8
- 9 can be merged because r110 equals to -1 afterwards.  But with the
patch, the equal information of r110==-1 in basic block 8 is lost.  As
a result, jump from 8-9 can't be merged and two additional
instructions are generated.

I suppose the REG_EQUAL note is correct in insn9?  According to
GCCint, it only means r110 set by insn9 will be equal to the value at
run time at the end of this insn but not necessarily elsewhere in the
function.

I also found another problem (or mis-leading?) with the document:
Thus, compiler passes prior to register allocation need only check
for REG_EQUAL notes and passes subsequent to register allocation need
only check for REG_EQUIV notes.  This seems not true now as in this
example, passes after register allocation do take advantage of
REG_EQUAL in optimization and we can't achieve that by using
REG_EQUIV.

Thanks,
bin


 jeff


Re: [PATCH IRA] update_equiv_regs fails to set EQUIV reg-note for pseudo with more than one definition

2015-02-03 Thread Bin.Cheng
On Tue, Feb 3, 2015 at 3:24 PM, Jeff Law l...@redhat.com wrote:
 On 02/02/15 08:59, Alex Velenko wrote:

 On 11/10/14 13:44, Felix Yang wrote:

 Hello Jeff,

  I see that you have improved the RTL typesafety issue for ira.c,
 so I rebased this patch
  on the latest trunk and change to use the new list walking
 interface.
  Bootstrapped on x86_64-SUSE-Linux and make check regression tested.
  OK for trunk?

 Hi Felix,
 I believe your patch causes a regression for arm-none-eabi.
 FAIL: gcc.target/arm/pr43920-2.c object-size text = 54
 FAIL: gcc.target/arm/pr43920-2.c scan-assembler-times pop 2

 This happens because your patch stops reuse of code for
  return -1; statements in pr43920-2.c.

 As far as I investigated, your patch prevents adding (expr_list (-1)
 (nil) in ira pass, which prevents jump2 optimization from happening.

 So before, in ira pass I could see:
 (insn 9 53 34 8 (set (reg:SI 110 [ D.4934 ])
  (const_int -1 [0x]))
 /work/fsf-trunk-ref-2/src/gcc/gcc/testsuite/gcc.target/arm/pr43920-2.c:20
 613
 {*thumb2_movsi_vfp}
   (expr_list:REG_EQUAL (const_int -1 [0x])
  (nil)))
 But with your patch I get
 (insn 9 53 34 8 (set (reg:SI 110 [ D.5322 ])
  (const_int -1 [0x]))
 /work/fsf-trunk-2/src/gcc/gcc/testsuite/gcc.target/arm/pr43920-2.c:20
 615 {*thumb2_movsi_vfp}
   (nil))

 This causes a code generation regression and needs to be fixed.
 Kind regards,

 We'd need to see the full dumps.  In particular is reg110 set anywhere else?
 If so then the change is doing precisely what it should be doing and the
 test needs to be updated to handle the different code we generate.

Hmm, if I understand correctly, it's a code size regression, so I
don't think it's appropriate to adapt the test case.  Either the patch
or something else in GCC is doing wrong, right?

Hi Alex, could you please file a PR with full dump information for tracking?

Thanks,
bin

 Jeff


Re: Stage3 closing soon, call for patch pings

2015-01-15 Thread Bin.Cheng
On Fri, Jan 16, 2015 at 5:04 AM, Jeff Law l...@redhat.com wrote:

 Stage3 is closing rapidly.  I've drained my queue of patches I was tracking
 for gcc-5.However, note that I don't track everything.  If it's a patch
 for a backend, language other than C or seemingly has another maintainer
 that's engaged in review, then I haven't been tracking the patch.

 So this is my final call for patch pings.  I've got some bandwidth and may
 be able to look at a few patches that have otherwise stalled.

Here is an ARM backend patch, CCing ARM maintainers.
https://gcc.gnu.org/ml/gcc-patches/2014-11/msg01383.html

Thanks,
bin

 Jeff





Re: [PATCH] LRA: Fix caller-save store/restore instruction for large mode

2015-01-08 Thread Bin.Cheng
On Fri, Jan 9, 2015 at 6:03 AM, Jeff Law l...@redhat.com wrote:
 On 01/08/15 08:58, Kito Cheng wrote:

 Hi Jeff:

 After discussion with Bin, he prefer just use
 gcc.c-torture/execute/scal-to-vec1.c
 instead of introduce new one, do you have any further comment on this
 patch?

 Ah, if there's an existing test, then we certainly don't need a new one.


Hi, according to the review comments, I applied the new version patch
on behalf of Kito (because of no write access) as revision 219375.

Thanks,
bin


Re: [PATCH] LRA: Fix caller-save store/restore instruction for large mode

2015-01-07 Thread Bin.Cheng
On Wed, Jan 7, 2015 at 4:03 PM, Kito Cheng kito.ch...@gmail.com wrote:
 Hi Jeff:

 It's updated patch,bootstrapped and run regression tested on arm-eabi,
 arm-none-linux-uclibcgnueabi, x86_64-unknown-linux-gnu and nds32le-elf
 without introducing regression.

 Thanks for your review :)

 2015-01-07  Kito Cheng  k...@0xlab.org

 PR target/64348
 * lra-constraints.c (split_reg): Fix caller-save store/restore
 instruction generation.

Thanks for fixing the issue.
The PR is against existing testcase failure
gcc.c-torture/execute/scal-to-vec1.c.  Unless we can create a new
case, there is no need to include same case twice I think?  Or we can
mention the PR number in the original test case?

Thanks,
bin


Re: [PATCH] LRA: Fix caller-save store/restore instruction for large mode

2015-01-07 Thread Bin.Cheng
On Wed, Jan 7, 2015 at 8:28 PM, Kito Cheng kito.ch...@gmail.com wrote:
 Hi Bin:

 It's 2 more line than gcc.c-torture/execute/scal-to-vec1.c since it's
 need specific compilation
 flag and specific target to reproduce this issue,
 and it's can't reproduce by normal testing flow with
 arm-*-linux-gnueabi (due to lack -fPIC flag),
 so I prefer duplicate this case into gcc.target/arm/ :)

 /* { dg-do compile } */
 /* { dg-options -O3 -fPIC -marm -mcpu=cortex-a8 } */
Not really, we generally want to avoid cpu related options in testcase
since it introduces conflict option failures when testing against
specific processor, e.g. testing against Cortex-M profile processors.

Thanks,
bin


 On Wed, Jan 7, 2015 at 4:50 PM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Wed, Jan 7, 2015 at 4:03 PM, Kito Cheng kito.ch...@gmail.com wrote:
 Hi Jeff:

 It's updated patch,bootstrapped and run regression tested on arm-eabi,
 arm-none-linux-uclibcgnueabi, x86_64-unknown-linux-gnu and nds32le-elf
 without introducing regression.

 Thanks for your review :)

 2015-01-07  Kito Cheng  k...@0xlab.org

 PR target/64348
 * lra-constraints.c (split_reg): Fix caller-save store/restore
 instruction generation.

 Thanks for fixing the issue.
 The PR is against existing testcase failure
 gcc.c-torture/execute/scal-to-vec1.c.  Unless we can create a new
 case, there is no need to include same case twice I think?  Or we can
 mention the PR number in the original test case?

 Thanks,
 bin


Re: [PATCH] LRA: Fix caller-save store/restore instruction for large mode

2015-01-05 Thread Bin.Cheng
On Mon, Jan 5, 2015 at 3:44 PM, Kito Cheng kito.ch...@gmail.com wrote:
 Hi Vladimir:
   This patch has a discusses with you in May 2014, this patch is about
 the caller-save register store and restore instruction generation, the
 current LRA implementation will miss caller-save store/restore
 instruction if need one more instruction.

 You said you will investigate for this on IRC, so I don't send the
 patch last year, however ARM guys seem got this problem too, so I
 think it's time to send this patch :)

 ChangeLog

 2015-01-05  Kito Cheng  k...@0xlab.org

 * lra-constraints.c (split_reg): Fix caller-save store/restore
 instruction generation.
Hi,
Thanks for saving my work, I was going to send this exact patch.
Note that we have PR64348 tracking the issue now.

Thanks,
bin


Re: [PATCH] LRA: Fix caller-save store/restore instruction for large mode

2015-01-05 Thread Bin.Cheng
On Tue, Jan 6, 2015 at 7:36 AM, Vladimir Makarov vmaka...@redhat.com wrote:

 On 2015-01-05 12:31 PM, Jeff Law wrote:

 On 01/05/15 00:44, Kito Cheng wrote:

 Hi Vladimir:
This patch has a discusses with you in May 2014, this patch is about
 the caller-save register store and restore instruction generation, the
 current LRA implementation will miss caller-save store/restore
 instruction if need one more instruction.

 You said you will investigate for this on IRC, so I don't send the
 patch last year, however ARM guys seem got this problem too, so I
 think it's time to send this patch :)

 ChangeLog

 2015-01-05  Kito Cheng  k...@0xlab.org

  * lra-constraints.c (split_reg): Fix caller-save store/restore
 instruction generation.

 Please reference PR64348 in the ChangeLog entry.

 Please include a testcase if there isn't one in the regression suite
 already.

 Please indicate what platform this patch was bootstrapped and regression
 tested on.

 The dumping code immediately after the assert you removed has code like
 this in both cases:



  fprintf (lra_dump_file,
Rejecting split %d-%d 
resulting in  2 %s restore insns:\n,
original_regno, REGNO (new_reg), call_save_p ? call :
 );

 Testing call_save_p here won't make any sense after your patch.

 I'll let Vlad chime in on the correctness of allowing multi register
 saves/restores in this code.

 The solution itself is ok.  Prohibiting generation of more one insn was
 intended for inheritance only as inheritance transformation can be undone
 when the inheritance pseudo does not get a hard register. Undoing
 multi-register splitting is difficult and also such splitting is doubtedly
 profitable.

 Splitting for save/restore is never undone.  So it is ok for this case to
 generate multi-register saves/restores.

 Kito, Jeff wrote reasonable changes for the patch.  Please, do them and you
 can commit the patch.

Hi Vlad,
As for this specific case in PR64348, dump IR crossing function call
is as below:

  430: [sfp:SI-0x30]=r989:TI#0
  432: [r1706:SI+0x4]=r989:TI#4
  434: [r1706:SI+0x8]=r989:TI#8
  436: [r1706:SI+0xc]=r989:TI#12
  441: r0:DI=call [`__aeabi_idivmod'] argc:0
  REG_UNUSED r0:SI
  REG_CALL_DECL `__aeabi_idivmod'
  REG_EH_REGION 0x8000
  437: r1007:SI=sign_extend(r989:TI#0)
  REG_DEAD r989:TI

Save/restore are introduced because of use of r989:#0 in insn 437.
Register r989 is TImode register and assigned to r2/r3/r4/r5.  I can
think about two possible improvements to LRA splitter.  1) According
to ARM EABI, only r2/r3 need to be saved/restored; 2) In this case, we
only need to save/restore r989:TI#0, which is r2.  In fact, it has
already been saved to stack in insn 430, what we need is to restore
r989:TI#0 after function call.

What do you think about these?

Thanks,
bin


 Thanks.



Re: [PATCH/TopLevel] Fix compiling libgo with a combined sources

2015-01-04 Thread Bin.Cheng
On Sun, Jan 4, 2015 at 6:55 AM, Andrew Pinski pins...@gmail.com wrote:
 On Mon, Nov 24, 2014 at 1:32 PM, Jeff Law l...@redhat.com wrote:
 On 11/22/14 21:20, Andrew Pinski wrote:

 Hi,
The problem here is here is that OBJCOPY is not being set to the
 newly built objcopy when compiling libgo.  This patch adds
 OBJCOPY_FOR_TARGET to the toplevel configure/Makefile so that when
 libgo is compiled OBJCOPY is set to OBJCOPY_FOR_TARGET.

 I noticed this issue when building an aarch64 cross compile on an
 older system where objcopy did not understand aarch64.

 OK?  Bootstrapped and tested on x86_64 with no regressions.  Also
 tested with a combined build for a cross compiler to
 aarch64-linux-gnu.

 Thanks,
 Andrew Pinski


  * Makefile.def (flags_to_pass): Pass OBJCOPY_FOR_TARGET also.
  * Makefile.tpl (HOST_EXPORTS): Add OBJCOPY_FOR_TARGET.
  (BASE_TARGET_EXPORTS): Add OBJCOPY.
  (OBJCOPY_FOR_TARGET): New variable.
  (EXTRA_TARGET_FLAGS): Add OBJCOPY.
  * Makefile.in: Regenerate.
  * configure.ac: Check for already installed target objcopy.
  Also GCC_TARGET_TOOL on objcopy.
  * configure: Regenerate.

 OK


 Committed to GCC and gdb/binutils repos now.

 Thanks,
 Andrew

Hi Andrew,

 +  elif test x$target = x$host; then
 +# We can use an host tool
 +OBJCOPY_FOR_TARGET='$(OBJDUMP)'
Is it a typo for '$(OBJCOPY)' ?

Thanks,
bin


Re: [PATCH ARM]Prefer neon for stringops on a53/a57 in AArch32 mode

2015-01-04 Thread Bin.Cheng
On Tue, Dec 16, 2014 at 6:37 PM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Thu, Nov 13, 2014 at 1:54 PM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 As commented at https://gcc.gnu.org/ml/gcc-patches/2014-09/msg00684.html,
 this is a simple patch enabling neon memset inlining on
 cortex-a53/cortex-a57 in AArch32 mode.

 Test on
 arm-none-linux-gnueabihf/--with-cpu=cortex-a57/--with-fpu=crypto-neon-fp-arm
 v8/--with-float=hard.  I will further collect benchmark data, see if there
 is regression.

 Is it ok if benchmark results are good?

 2014-11-13  Bin Cheng  bin.ch...@arm.com

 * config/arm/arm.c (arm_cortex_a53_tune, arm_cortex_a57_tune):
 Prefer
 neon for stringops on cortex-a53/a57 in AArch32 mode.

 I collected perf data for this patch, there is no obvious change on
 cortex-a57/aarch32, so is it OK?



PING.

Thanks,
bin


Re: [4.8] Request to backport patch to the 4.8 branch

2014-12-24 Thread Bin.Cheng
On Wed, Dec 24, 2014 at 4:35 PM, zhangjian bamvor.zhangj...@huawei.com wrote:
 Hi, guys

 I encounter a gcc failure when I build mysql on opensuse[1]
 5.6.17/storage/perfschema/pfs_account.cc:320:1: error: could not split insn
 [ 1245s]  }
 [ 1245s]  ^
 [ 1245s] (insn 482 1770 1461 (parallel [
 [ 1245s] (set (reg:SI 1 x1 [orig:167 D.16835 ] [167])
 [ 1245s] (mem/v:SI (reg/f:DI 0 x0 [orig:166 D.16844 ] [166]) 
 [-1  S4 A32]))
 [ 1245s] (set (mem/v:SI (reg/f:DI 0 x0 [orig:166 D.16844 ] [166]) 
 [-1  S4 A32])
 [ 1245s] (unspec_volatile:SI [
 [ 1245s] (ior:SI (mem/v:SI (reg/f:DI 0 x0 [orig:166 
 D.16844 ] [166]) [-1  S4 A32])
 [ 1245s] (const_int 0 [0]))
 [ 1245s] (const_int 5 [0x5])
 [ 1245s] ] UNSPECV_ATOMIC_OP))
 [ 1245s] (clobber (reg:CC 66 cc))
 [ 1245s] (clobber (reg:SI 4 x4))
 [ 1245s] (clobber (reg:SI 3 x3))
 [ 1245s] ]) 
 /home/abuild/rpmbuild/BUILD/mysql-5.6.17/include/my_atomic.h:217 1814 
 {atomic_fetch_orsi}
 [ 1245s]  (expr_list:REG_UNUSED (reg:CC 66 cc)
 [ 1245s] (expr_list:REG_UNUSED (reg:SI 4 x4)
 [ 1245s] (expr_list:REG_UNUSED (reg:SI 3 x3)
 [ 1245s] (nil)
 [ 1245s] 
 /home/abuild/rpmbuild/BUILD/mysql-5.6.17/storage/perfschema/pfs_account.cc:320:1:
  internal compiler error: in final_scan_insn, at final.c:2897

 Ihis bug could be fixed by Michael's patch(r217076):
 2014-11-04  Michael Collison michael.colli...@linaro.org

 * config/aarch64/iterators.md (lconst_atomic): New mode attribute
 to support constraints for CONST_INT in atomic operations.
 * config/aarch64/atomics.md
 (atomic_atomic_optabmode): Use lconst_atomic constraint.
 (atomic_nandmode): Likewise.
 (atomic_fetch_atomic_optabmode): Likewise.
 (atomic_fetch_nandmode): Likewise.
 (atomic_atomic_optab_fetchmode): Likewise.
 (atomic_nand_fetchmode): Likewise.

 Michael's patch could be applied on the top of gcc 4.8 branch except the 
 gcc/ChangeLog.
 Is it possible backport this patch to gcc 4.8 branch?
 I am new to here, I am not sure if I need send the patch with modified 
 ChangeLog. Sorry if I break the rules.
Hi,
Since the patch applies to 4.8 smoothly, and you already provided the
revision number, I don't think an additional patch is needed.  But is
the original patch for an existing bug?  And what's about gcc 4_9
branch?  Maybe you can create a PR against 4.8 (or 4.9) for tracking.
Another problem is you may need to wait for a while since it's holiday
time.

Thanks,
bin

 regards

 bamvor

 [1] https://bugzilla.opensuse.org/show_bug.cgi?id=896667



Re: [PATCH PR62151]Fix REG_DEAD note distribution issue by using right ELIM_I0/ELIM_I1

2014-12-22 Thread Bin.Cheng
On Mon, Dec 22, 2014 at 3:54 PM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Sat, Dec 20, 2014 at 8:18 PM, Eric Botcazou ebotca...@adacore.com wrote:
 As described both in the PR and patch comments, this patch fixes PR62151 by
 setting right value to ELIM_I0/ELIM_I1 when distributing REG_DEAD notes from
 i0/i1.  It is said that distribute_notes had caused many bugs in the past.
 I think it still has bug in it, as noted in the PR.  This patch doesn't
 touch distribute_notes because we are in stage3 and I want to have more
 discussion on it.
 Bootstrap and test on x86_64.  aarch64 is ongoing.  So is it ok?

 2014-12-11  Bin Cheng  bin.ch...@arm.com

   PR rtl-optimization/62151
   * combine.c (try_combine): Reset elim_i0 and elim_i1 when
   distributing notes from i0notes or i1notes, this time don't
   check whether newi2pat sets i1dest or i0dest.

 The reasoning looks correct to me and the patch is certainly safe so it's OK
 on principle, but I think that we should avoid the duplication of predicates.

 Can you move the computation of the alternative elim_i1  elim_i0 up to where
 the original ones are computed along with the explanation of why we care 
 about
 newi2pat only for notes that were on I3 and I2?  Something like:

/* Compute which registers we expect to eliminate.  newi2pat may be 
 setting
   either i3dest or i2dest, so we must check it.  */
 rtx elim_i2 = ((newi2pat  reg_set_p (i2dest, newi2pat))
|| i2dest_in_i2src || i2dest_in_i1src || i2dest_in_i0src
|| !i2dest_killed
? 0 : i2dest);
/* For I1 we need to compute both local elimination and global elimination
   because i1dest may be the same as i3dest, in which case newi2pat may be
   setting i1dest.  big explanation of why this is needed  */
 rtx local_elim_i1 = (i1 == 0 || i1dest_in_i1src || i1dest_in_i0src
|| !i1dest_killed
? 0 : i1dest);
 rtx elim_i1 = (local_elim_i1 == 0
|| (newi2pat  reg_set_p (i1dest, newi2pat))
? 0 : i1dest);
 /* Likewise for I0.  */
 rtx local_elim_i0 = (i0 == 0 || i0dest_in_i0src
|| !i0dest_killed
? 0 : i0dest);
 rtx elim_i0 = (local_elim_i0 == 0
|| (newi2pat  reg_set_p (i0dest, newi2pat))
? 0 : i0dest);

 --
 Eric Botcazou

 Hi Eric,
 Thanks for reviewing.  Here comes the revised patch.  Bootstrap and
 test on x86_64, is it OK?

 Thanks,
 bin


 2014-12-22  Bin Cheng  bin.ch...@arm.com

 PR rtl-optimization/62151
 * combine.c (try_combine): New local variables local_elim_i1
 and local_elim_i0.  Set elim_i1 and elim_i0 using the local
 version variables.  Distribute notes from i0notes or i1notes
 using the local variavbles.

 gcc/testsuite/ChangeLog
 2014-12-22  Bin Cheng  bin.ch...@arm.com

 PR rtl-optimization/62151
 * gcc.c-torture/execute/pr62151.c: New test.

Hmm, I further revised comment in the patch since in try_combine, i2
is always after i0/i1.  The original comment is inaccurate about that.

Thanks,
bin
Index: gcc/testsuite/gcc.c-torture/execute/pr62151.c
===
--- gcc/testsuite/gcc.c-torture/execute/pr62151.c   (revision 0)
+++ gcc/testsuite/gcc.c-torture/execute/pr62151.c   (revision 0)
@@ -0,0 +1,41 @@
+/* PR rtl-optimization/62151 */
+
+int a, c, d, e, f, g, h, i;
+short b;
+
+int
+fn1 ()
+{
+  b = 0;
+  for (;;)
+{
+  int j[2];
+  j[f] = 0;
+  if (h)
+   d = 0;
+  else
+   {
+ for (; f; f++)
+   ;
+ for (a = 0; a  1; a++)
+   for (;;)
+ {
+   i = b  ((b ^ 1)  83647) ? b : b - 1;
+   g = 1 ? i : 0;
+   e = j[0];
+   if (c)
+ break;
+   return 0;
+ }
+   }
+}
+}
+
+int
+main ()
+{
+  fn1 ();
+  if (g != -1)
+__builtin_abort ();
+  return 0;
+}
Index: gcc/combine.c
===
--- gcc/combine.c   (revision 218855)
+++ gcc/combine.c   (working copy)
@@ -4119,19 +4119,46 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn
 rtx midnotes = 0;
 int from_luid;
 /* Compute which registers we expect to eliminate.  newi2pat may be setting
-   either i3dest or i2dest, so we must check it.  Also, i1dest may be the
-   same as i3dest, in which case newi2pat may be setting i1dest.  */
+   either i3dest or i2dest, so we must check it.  */
 rtx elim_i2 = ((newi2pat  reg_set_p (i2dest, newi2pat))
   || i2dest_in_i2src || i2dest_in_i1src || i2dest_in_i0src
   || !i2dest_killed
   ? 0 : i2dest);
-rtx elim_i1 = (i1 == 0 || i1dest_in_i1src || i1dest_in_i0src
+/* For i1, we need to compute both local elimination and global

Re: [PATCH PR62151]Fix REG_DEAD note distribution issue by using right ELIM_I0/ELIM_I1

2014-12-21 Thread Bin.Cheng
On Sat, Dec 20, 2014 at 8:18 PM, Eric Botcazou ebotca...@adacore.com wrote:
 As described both in the PR and patch comments, this patch fixes PR62151 by
 setting right value to ELIM_I0/ELIM_I1 when distributing REG_DEAD notes from
 i0/i1.  It is said that distribute_notes had caused many bugs in the past.
 I think it still has bug in it, as noted in the PR.  This patch doesn't
 touch distribute_notes because we are in stage3 and I want to have more
 discussion on it.
 Bootstrap and test on x86_64.  aarch64 is ongoing.  So is it ok?

 2014-12-11  Bin Cheng  bin.ch...@arm.com

   PR rtl-optimization/62151
   * combine.c (try_combine): Reset elim_i0 and elim_i1 when
   distributing notes from i0notes or i1notes, this time don't
   check whether newi2pat sets i1dest or i0dest.

 The reasoning looks correct to me and the patch is certainly safe so it's OK
 on principle, but I think that we should avoid the duplication of predicates.

 Can you move the computation of the alternative elim_i1  elim_i0 up to where
 the original ones are computed along with the explanation of why we care about
 newi2pat only for notes that were on I3 and I2?  Something like:

/* Compute which registers we expect to eliminate.  newi2pat may be setting
   either i3dest or i2dest, so we must check it.  */
 rtx elim_i2 = ((newi2pat  reg_set_p (i2dest, newi2pat))
|| i2dest_in_i2src || i2dest_in_i1src || i2dest_in_i0src
|| !i2dest_killed
? 0 : i2dest);
/* For I1 we need to compute both local elimination and global elimination
   because i1dest may be the same as i3dest, in which case newi2pat may be
   setting i1dest.  big explanation of why this is needed  */
 rtx local_elim_i1 = (i1 == 0 || i1dest_in_i1src || i1dest_in_i0src
|| !i1dest_killed
? 0 : i1dest);
 rtx elim_i1 = (local_elim_i1 == 0
|| (newi2pat  reg_set_p (i1dest, newi2pat))
? 0 : i1dest);
 /* Likewise for I0.  */
 rtx local_elim_i0 = (i0 == 0 || i0dest_in_i0src
|| !i0dest_killed
? 0 : i0dest);
 rtx elim_i0 = (local_elim_i0 == 0
|| (newi2pat  reg_set_p (i0dest, newi2pat))
? 0 : i0dest);

 --
 Eric Botcazou

Hi Eric,
Thanks for reviewing.  Here comes the revised patch.  Bootstrap and
test on x86_64, is it OK?

Thanks,
bin


2014-12-22  Bin Cheng  bin.ch...@arm.com

PR rtl-optimization/62151
* combine.c (try_combine): New local variables local_elim_i1
and local_elim_i0.  Set elim_i1 and elim_i0 using the local
version variables.  Distribute notes from i0notes or i1notes
using the local variavbles.

gcc/testsuite/ChangeLog
2014-12-22  Bin Cheng  bin.ch...@arm.com

PR rtl-optimization/62151
* gcc.c-torture/execute/pr62151.c: New test.
Index: gcc/testsuite/gcc.c-torture/execute/pr62151.c
===
--- gcc/testsuite/gcc.c-torture/execute/pr62151.c   (revision 0)
+++ gcc/testsuite/gcc.c-torture/execute/pr62151.c   (revision 0)
@@ -0,0 +1,41 @@
+/* PR rtl-optimization/62151 */
+
+int a, c, d, e, f, g, h, i;
+short b;
+
+int
+fn1 ()
+{
+  b = 0;
+  for (;;)
+{
+  int j[2];
+  j[f] = 0;
+  if (h)
+   d = 0;
+  else
+   {
+ for (; f; f++)
+   ;
+ for (a = 0; a  1; a++)
+   for (;;)
+ {
+   i = b  ((b ^ 1)  83647) ? b : b - 1;
+   g = 1 ? i : 0;
+   e = j[0];
+   if (c)
+ break;
+   return 0;
+ }
+   }
+}
+}
+
+int
+main ()
+{
+  fn1 ();
+  if (g != -1)
+__builtin_abort ();
+  return 0;
+}
Index: gcc/combine.c
===
--- gcc/combine.c   (revision 218855)
+++ gcc/combine.c   (working copy)
@@ -4119,19 +4119,46 @@ try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn
 rtx midnotes = 0;
 int from_luid;
 /* Compute which registers we expect to eliminate.  newi2pat may be setting
-   either i3dest or i2dest, so we must check it.  Also, i1dest may be the
-   same as i3dest, in which case newi2pat may be setting i1dest.  */
+   either i3dest or i2dest, so we must check it.  */
 rtx elim_i2 = ((newi2pat  reg_set_p (i2dest, newi2pat))
   || i2dest_in_i2src || i2dest_in_i1src || i2dest_in_i0src
   || !i2dest_killed
   ? 0 : i2dest);
-rtx elim_i1 = (i1 == 0 || i1dest_in_i1src || i1dest_in_i0src
+/* For i1, we need to compute both local elimination and global
+   elimination information with respect to newi2pat because i1dest
+   may be the same as i3dest, in which case newi2pat may be setting
+   i1dest.  Global information is used when distributing REG_DEAD
+   note for i2 and i3, in which case it 

Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting

2014-12-18 Thread Bin.Cheng
On Fri, Dec 19, 2014 at 6:09 AM, Segher Boessenkool
seg...@kernel.crashing.org wrote:
 On Thu, Dec 18, 2014 at 05:00:01PM +, Jiong Wang wrote:
 On 17/12/14 15:54, Richard Biener wrote:
 ick.  I realize we don't have SSA form on RTL but doesn't DF provide
 at least some help in looking up definition statements for pseudos?
 In fact we want to restrict the transform to single-use pseudos, thus
 hopefully it can at least tell us that... (maybe not and this is what
 LOG_LINKS are for in combine...?)  At least loop-invariant alreadly
 computes df_chain with DF_UD_CHAIN which seems exactly what
 is needed (apart from maybe getting at single-use info).

 thanks very much for these inspiring questions.

 yes, we want to restrict the transformation on single-use pseudo only,
 and it's better the transformation could re-use existed info and helper
 function to avoid increase compile time. but I haven't found anything I
 can reuse at the stage the transformation happen.

 the info similar as LOG_LINKS is what I want, but maybe simpler. I'd study
 the code about build LOG_LINKS, and try to see if we can do some factor out.

 LOG_LINKs in combine are just historical.  combine should be converted
 to use DF fully.

 LOG_LINKs have nothing to do with single use; they point from the _first_
 use to its corresponding def.

 You might want to look at what fwprop does instead.
Pass rtl fwprop uses df information in single-definition way, it
doesn't really take into consideration if register is a single use.
This often corrupts other optimizations like post-increment and
load/store pair.  For example:

  add r2, r1, r0
  ldr rx, [r2]
  add r2, r2, #4
is transformed into below form:
  add r2, r1, r0
  ldr rx, [r1, r0]
  add r2, r2, #4

As a result, post-increment opportunity is corrupted, also definition
of r2 can't be deleted because it's not single use.

Thanks,
bin


Re: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.

2014-12-17 Thread Bin.Cheng
On Tue, Dec 16, 2014 at 4:42 PM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Thu, Dec 11, 2014 at 8:08 PM, Richard Biener
 richard.guent...@gmail.com wrote:
 On Thu, Dec 11, 2014 at 10:56 AM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Wed, Dec 10, 2014 at 9:47 PM, Richard Biener
 richard.guent...@gmail.com wrote:
 On Fri, Dec 5, 2014 at 1:15 PM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 Though PR62178 is hidden by recent cost change in aarch64 backend, the 
 ivopt
 issue still exists.

 Current candidate selecting algorithm tends to select fewer candidates 
 given
 below reasons:
   1) to better handle loops with many induction uses but the best choice 
 is
 one generic basic induction variable;
   2) to keep compilation time low.

 One fundamental weakness of the strategy is the opposite situation can't 
 be
 handled properly sometimes.  For these cases the best choice is each
 induction variable has its own candidate.
 This patch fixes the problem by shuffling candidate set after fix-point is
 reached by current implementation.  The reason why this strategy works is 
 it
 replaces candidate set by selecting local optimal candidate for some
 induction uses, and the new candidate set (has lower cost) is exact what 
 we
 want in the mentioned case.  Instrumentation data shows this can find 
 better
 candidates set for ~6% loops in spec2006 on x86_64, and ~4% on aarch64.

 This patch actually is extension to the first version patch posted at
 https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html, that only adds
 another selecting pass with special seed set (more or less like the 
 shuffled
 set in this patch).  Data also confirms this patch can find optimal sets 
 for
 most loops found by the first one, as well as optimal sets for many new
 loops.

 Bootstrap and test on x86_64, no regression on benchmarks.  Bootstrap and
 test on aarch64.
 Since this patch only selects candidate set with lower cost, any 
 regressions
 revealed are latent bugs of other components in GCC.
 I also collected GCC bootstrap time on x86_64, no regression either.
 Is this OK?

 The algorithm seems to be quadratic in the number of IV candidates
 (at least):
 Yes, I worried about that too, that's why I measured the bootstrap
 time.  One way is restrict this procedure one time for each loop.  I
 already tried that and it can capture +90% loops.  Is this sounds
 reasonable?

 Yes.  That's my suggestion to handle it in the caller of try_improve_iv_set?

 BTW, do we have some compilation time benchmarks for GCC?

 There are various testcases linked from PR47344, I don't remember
 any particular one putting load on IVOPTs (but I do remember seeing
 IVOPTs in the ~25% area in -ftime-report for some testcases).


Hi,
I further refined the patch.  Specifically, I factored out common
code, improved comments, and restricted new code in several ways, for
example, now iv_ca_replace runs exactly one time for each
find_optimal_iv_set; iv_ca_replace only tries to replace one candidate
in IVS each time and makes quick return if lower cost set is found;
most importantly, iv_ca_replace now checks
ALWAYS_PRUNE_CAND_SET_BOUND.
The patch is simplified with these changes.  As for compilation time,
IVOPT isn't regressed obviously for the overloaded case I created,
also regression in llvm compilation time benchmarks is gone.

I think we could adapt data structure in IVOPT to make it faster, for
example, record information in candidate about which uses are
represented by each cand, sort candidates by cost for each iv use.  I
may do some refactor in next stage1.

Bootstrap on x86_64, test ongoing.  So OK if no regressions?

Thanks,
bin

2014-12-17  Bin Cheng  bin.ch...@arm.com

PR tree-optimization/62178
* tree-ssa-loop-ivopts.c (cheaper_cost_with_cand): New function.
(iv_ca_replace): New function.
(try_improve_iv_set): New parameter try_replace_p.
Break local optimal fixed-point by calling iv_ca_replace.
(find_optimal_iv_set_1): Pass new argument to try_improve_iv_set.

gcc/testsuite/ChangeLog
2014-12-17  Bin Cheng  bin.ch...@arm.com

PR tree-optimization/62178
* gcc.target/aarch64/pr62178.c: New test.
Index: gcc/testsuite/gcc.target/aarch64/pr62178.c
===
--- gcc/testsuite/gcc.target/aarch64/pr62178.c  (revision 0)
+++ gcc/testsuite/gcc.target/aarch64/pr62178.c  (revision 0)
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options -O3 } */
+
+int a[30 +1][30 +1], b[30 +1][30 +1], r[30 +1][30 +1];
+
+void foo (void) {
+  int i, j, k;
+
+  for ( i = 1; i = 30; i++ )
+for ( j = 1; j = 30; j++ ) {
+  r[i][j] = 0;
+  for(k = 1; k = 30; k++ )
+r[i][j] += a[i][k]*b[k][j];
+}
+}
+
+/* { dg-final { scan-assembler ld1r\\t\{v\[0-9\]+\.} } */
Index: gcc/tree-ssa-loop-ivopts.c
===
--- gcc/tree-ssa-loop-ivopts.c  (revision 218200)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -5862,6

Re: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.

2014-12-16 Thread Bin.Cheng
On Thu, Dec 11, 2014 at 8:08 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Thu, Dec 11, 2014 at 10:56 AM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Wed, Dec 10, 2014 at 9:47 PM, Richard Biener
 richard.guent...@gmail.com wrote:
 On Fri, Dec 5, 2014 at 1:15 PM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 Though PR62178 is hidden by recent cost change in aarch64 backend, the 
 ivopt
 issue still exists.

 Current candidate selecting algorithm tends to select fewer candidates 
 given
 below reasons:
   1) to better handle loops with many induction uses but the best choice is
 one generic basic induction variable;
   2) to keep compilation time low.

 One fundamental weakness of the strategy is the opposite situation can't be
 handled properly sometimes.  For these cases the best choice is each
 induction variable has its own candidate.
 This patch fixes the problem by shuffling candidate set after fix-point is
 reached by current implementation.  The reason why this strategy works is 
 it
 replaces candidate set by selecting local optimal candidate for some
 induction uses, and the new candidate set (has lower cost) is exact what we
 want in the mentioned case.  Instrumentation data shows this can find 
 better
 candidates set for ~6% loops in spec2006 on x86_64, and ~4% on aarch64.

 This patch actually is extension to the first version patch posted at
 https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html, that only adds
 another selecting pass with special seed set (more or less like the 
 shuffled
 set in this patch).  Data also confirms this patch can find optimal sets 
 for
 most loops found by the first one, as well as optimal sets for many new
 loops.

 Bootstrap and test on x86_64, no regression on benchmarks.  Bootstrap and
 test on aarch64.
 Since this patch only selects candidate set with lower cost, any 
 regressions
 revealed are latent bugs of other components in GCC.
 I also collected GCC bootstrap time on x86_64, no regression either.
 Is this OK?

 The algorithm seems to be quadratic in the number of IV candidates
 (at least):
 Yes, I worried about that too, that's why I measured the bootstrap
 time.  One way is restrict this procedure one time for each loop.  I
 already tried that and it can capture +90% loops.  Is this sounds
 reasonable?

 Yes.  That's my suggestion to handle it in the caller of try_improve_iv_set?

 BTW, do we have some compilation time benchmarks for GCC?

 There are various testcases linked from PR47344, I don't remember
 any particular one putting load on IVOPTs (but I do remember seeing
 IVOPTs in the ~25% area in -ftime-report for some testcases).


Hi Jeff  Richard,
I updated patch according to your review comments.  Is this version looks good?
I didn't find cases in PR47344 which exercising IVOPT, but I produced
one case from PR53852 which runs ivopt for ~17% of total time (28s).
This patch does increase IVOPT time to 18%.  Unfortunately, I tried
the other restriction, it doesn't work as well as this one on spec2k6,
if I understood the method correctly.

Hi Sebastian,
Thanks for help!  I managed to run llvm compilation time tests
successfully as you suggested.  Case
Multisource/Benchmarks/mafft/pairlocalalign is regressed but I can't
reproduce it in cmd.  The running time of compilation of
pairlocalalign.c is too small comparing to the results.  I also tried
to invoke it by using RunSafely.sh but no lucky either.  So any
documentation on this?  Thanks very much!

Thanks,
bin

2014-12-16  Bin Cheng  bin.ch...@arm.com

PR tree-optimization/62178
* tree-ssa-loop-ivopts.c (cheaper_cost_with_cand): New function.
(iv_ca_replace): New function.
(try_improve_iv_set): New parameter try_replace_p.
Replace candidates in IVS by calling iv_ca_replace.
(find_optimal_iv_set_1): Pass new argument to try_improve_iv_set.

gcc/testsuite/ChangeLog
2014-12-16  Bin Cheng  bin.ch...@arm.com

PR tree-optimization/62178
* gcc.target/aarch64/pr62178.c: New test.
Index: gcc/tree-ssa-loop-ivopts.c
===
--- gcc/tree-ssa-loop-ivopts.c  (revision 218200)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -5862,6 +5862,127 @@ iv_ca_prune (struct ivopts_data *data, struct iv_c
   return best_cost;
 }
 
+/* Check if CAND_IDX is a candidate other than OLD_CAND and has
+   cheaper local cost for USE than BEST_CP.  Return pointer to
+   the corresponding cost_pair, otherwise just return BEST_CP.  */
+
+static struct cost_pair*
+cheaper_cost_with_cand (struct ivopts_data *data, struct iv_use *use,
+   unsigned int cand_idx, struct iv_cand *old_cand,
+   struct cost_pair *best_cp)
+{
+  struct iv_cand *cand;
+  struct cost_pair *cp;
+
+  gcc_assert (old_cand != NULL);
+  if (cand_idx == old_cand-id)
+return best_cp;
+
+  cand = iv_cand (data, cand_idx);
+  cp = get_use_iv_cost (data, use, cand);
+  if (cp != NULL
+   (best_cp == NULL

Fwd: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.

2014-12-16 Thread Bin.Cheng
CCing Sebastian.

Thanks,
bin

-- Forwarded message --
From: Bin.Cheng amker.ch...@gmail.com
Date: Tue, Dec 16, 2014 at 4:42 PM
Subject: Re: [PATCH PR62178]Improve candidate selecting in IVOPT, 2nd try.
To: Richard Biener richard.guent...@gmail.com
Cc: Bin Cheng bin.ch...@arm.com, GCC Patches
gcc-patches@gcc.gnu.org, Zdenek Dvorak o...@ucw.cz


On Thu, Dec 11, 2014 at 8:08 PM, Richard Biener
richard.guent...@gmail.com wrote:
 On Thu, Dec 11, 2014 at 10:56 AM, Bin.Cheng amker.ch...@gmail.com wrote:
 On Wed, Dec 10, 2014 at 9:47 PM, Richard Biener
 richard.guent...@gmail.com wrote:
 On Fri, Dec 5, 2014 at 1:15 PM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 Though PR62178 is hidden by recent cost change in aarch64 backend, the 
 ivopt
 issue still exists.

 Current candidate selecting algorithm tends to select fewer candidates 
 given
 below reasons:
   1) to better handle loops with many induction uses but the best choice is
 one generic basic induction variable;
   2) to keep compilation time low.

 One fundamental weakness of the strategy is the opposite situation can't be
 handled properly sometimes.  For these cases the best choice is each
 induction variable has its own candidate.
 This patch fixes the problem by shuffling candidate set after fix-point is
 reached by current implementation.  The reason why this strategy works is 
 it
 replaces candidate set by selecting local optimal candidate for some
 induction uses, and the new candidate set (has lower cost) is exact what we
 want in the mentioned case.  Instrumentation data shows this can find 
 better
 candidates set for ~6% loops in spec2006 on x86_64, and ~4% on aarch64.

 This patch actually is extension to the first version patch posted at
 https://gcc.gnu.org/ml/gcc-patches/2014-09/msg02620.html, that only adds
 another selecting pass with special seed set (more or less like the 
 shuffled
 set in this patch).  Data also confirms this patch can find optimal sets 
 for
 most loops found by the first one, as well as optimal sets for many new
 loops.

 Bootstrap and test on x86_64, no regression on benchmarks.  Bootstrap and
 test on aarch64.
 Since this patch only selects candidate set with lower cost, any 
 regressions
 revealed are latent bugs of other components in GCC.
 I also collected GCC bootstrap time on x86_64, no regression either.
 Is this OK?

 The algorithm seems to be quadratic in the number of IV candidates
 (at least):
 Yes, I worried about that too, that's why I measured the bootstrap
 time.  One way is restrict this procedure one time for each loop.  I
 already tried that and it can capture +90% loops.  Is this sounds
 reasonable?

 Yes.  That's my suggestion to handle it in the caller of try_improve_iv_set?

 BTW, do we have some compilation time benchmarks for GCC?

 There are various testcases linked from PR47344, I don't remember
 any particular one putting load on IVOPTs (but I do remember seeing
 IVOPTs in the ~25% area in -ftime-report for some testcases).


Hi Jeff  Richard,
I updated patch according to your review comments.  Is this version looks good?
I didn't find cases in PR47344 which exercising IVOPT, but I produced
one case from PR53852 which runs ivopt for ~17% of total time (28s).
This patch does increase IVOPT time to 18%.  Unfortunately, I tried
the other restriction, it doesn't work as well as this one on spec2k6,
if I understood the method correctly.

Hi Sebastian,
Thanks for help!  I managed to run llvm compilation time tests
successfully as you suggested.  Case
Multisource/Benchmarks/mafft/pairlocalalign is regressed but I can't
reproduce it in cmd.  The running time of compilation of
pairlocalalign.c is too small comparing to the results.  I also tried
to invoke it by using RunSafely.sh but no lucky either.  So any
documentation on this?  Thanks very much!

Thanks,
bin

2014-12-16  Bin Cheng  bin.ch...@arm.com

PR tree-optimization/62178
* tree-ssa-loop-ivopts.c (cheaper_cost_with_cand): New function.
(iv_ca_replace): New function.
(try_improve_iv_set): New parameter try_replace_p.
Replace candidates in IVS by calling iv_ca_replace.
(find_optimal_iv_set_1): Pass new argument to try_improve_iv_set.

gcc/testsuite/ChangeLog
2014-12-16  Bin Cheng  bin.ch...@arm.com

PR tree-optimization/62178
* gcc.target/aarch64/pr62178.c: New test.
Index: gcc/tree-ssa-loop-ivopts.c
===
--- gcc/tree-ssa-loop-ivopts.c  (revision 218200)
+++ gcc/tree-ssa-loop-ivopts.c  (working copy)
@@ -5862,6 +5862,127 @@ iv_ca_prune (struct ivopts_data *data, struct iv_c
   return best_cost;
 }
 
+/* Check if CAND_IDX is a candidate other than OLD_CAND and has
+   cheaper local cost for USE than BEST_CP.  Return pointer to
+   the corresponding cost_pair, otherwise just return BEST_CP.  */
+
+static struct cost_pair*
+cheaper_cost_with_cand (struct ivopts_data *data, struct iv_use *use,
+   unsigned int

Re: [PATCH ARM]Prefer neon for stringops on a53/a57 in AArch32 mode

2014-12-16 Thread Bin.Cheng
On Thu, Nov 13, 2014 at 1:54 PM, Bin Cheng bin.ch...@arm.com wrote:
 Hi,
 As commented at https://gcc.gnu.org/ml/gcc-patches/2014-09/msg00684.html,
 this is a simple patch enabling neon memset inlining on
 cortex-a53/cortex-a57 in AArch32 mode.

 Test on
 arm-none-linux-gnueabihf/--with-cpu=cortex-a57/--with-fpu=crypto-neon-fp-arm
 v8/--with-float=hard.  I will further collect benchmark data, see if there
 is regression.

 Is it ok if benchmark results are good?

 2014-11-13  Bin Cheng  bin.ch...@arm.com

 * config/arm/arm.c (arm_cortex_a53_tune, arm_cortex_a57_tune):
 Prefer
 neon for stringops on cortex-a53/a57 in AArch32 mode.

I collected perf data for this patch, there is no obvious change on
cortex-a57/aarch32, so is it OK?

Thanks,
bin


<    2   3   4   5   6   7   8   9   10   >