Re: [PATCH, PR43864] Gimple level duplicate block cleanup.

2011-06-08 Thread Richard Guenther
On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries  wrote:
> Hi Richard,
>
> I have a patch for PR43864. The patch adds a gimple level duplicate block
> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the 
> following
> table (%, lower is better).
>
>                     none            pic
>                thumb1  thumb2  thumb1 thumb2
> spec2000          99.9    99.9    99.8   99.8
>
> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that 
> the
> optimizations proposed in PR20070 would fix this PR.
>
> The problem in this PR is that when compiling with -O2, the example below 
> should
> only have one call to free. The original problem is formulated in terms of 
> -Os,
> but currently we generate one call to free with -Os, although still not the
> smallest code possible. I'll show here the -O2 case, since that's similar to 
> the
> original PR.
>
> #include 
> void foo (char*, FILE*);
> char* hprofStartupp(char *outputFileName, char *ctx)
> {
>    char fileName[1000];
>    FILE *fp;
>    sprintf(fileName, outputFileName);
>    if (access(fileName, 1) == 0) {
>        free(ctx);
>        return 0;
>    }
>
>    fp = fopen(fileName, 0);
>    if (fp == 0) {
>        free(ctx);
>        return 0;
>    }
>
>    foo(outputFileName, fp);
>
>    return ctx;
> }
>
> AFAIU, there are 2 complementary methods of rtl optimizations proposed in 
> PR20070.
> - Merging 2 blocks which are identical expect for input registers, by using a
>  conditional move to choose between the different input registers.
> - Merging 2 blocks which have different local registers, by ignoring those
>  differences
>
> Blocks .L6 and.L7 have no difference in local registers, but they have a
> difference in input registers: r3 and r1. Replacing the move to r5 by a
> conditional move would probably be benificial in terms of size, but it's not
> clear what condition the conditional move should be using. Calculating such a
> condition would add in size and increase the execution path.
>
> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
> ...
>        push    {r4, r5, lr}
>        mov     r4, r0
>        sub     sp, sp, #1004
>        mov     r5, r1
>        mov     r0, sp
>        mov     r1, r4
>        bl      sprintf
>        mov     r0, sp
>        movs    r1, #1
>        bl      access
>        mov     r3, r0
>        cbz     r0, .L6
>        movs    r1, #0
>        mov     r0, sp
>        bl      fopen
>        mov     r1, r0
>        cbz     r0, .L7
>        mov     r0, r4
>        bl      foo
> .L3:
>        mov     r0, r5
>        add     sp, sp, #1004
>        pop     {r4, r5, pc}
> .L6:
>        mov     r0, r5
>        mov     r5, r3
>        bl      free
>        b       .L3
> .L7:
>        mov     r0, r5
>        mov     r5, r1
>        bl      free
>        b       .L3
> ...
>
> The proposed patch solved the problem by dealing with the 2 blocks at a level
> when they are still identical: at gimple level. It detect that the 2 blocks 
> are
> identical, and removes one of them.
>
> The following table shows the impact of the patch on the example in terms of
> size for -march=armv7-a:
>
>          without     with    delta
> Os      :     108      104       -4
> O2      :     120      104      -16
> Os thumb:      68       64       -4
> O2 thumb:      76       64      -12
>
> The gain in size for -O2 is that of removing the entire block, plus the
> replacement of 2 moves by a constant set, which also decreases the execution
> path. The patch ensures optimal code for both -O2 and -Os.
>
>
> By keeping track of equivalent definitions in the 2 blocks, we can ignore 
> those
> differences in comparison. Without this feature, we would only match blocks 
> with
> resultless operations, due the the ssa-nature of gimples.
> For example, with this feature, we reduce the following function to its 
> minimum
> at gimple level, rather than at rtl level.
>
> int f(int c, int b, int d)
> {
>  int r, e;
>
>  if (c)
>    r = b + d;
>  else
>    {
>      e = b + d;
>      r = e;
>    }
>
>  return r;
> }
>
> ;; Function f (f)
>
> f (int c, int b, int d)
> {
>  int e;
>
> :
>  e_6 = b_3(D) + d_4(D);
>  return e_6;
>
> }
>
> I'll send the patch with the testcases in a separate email.
>
> OK for trunk?

I don't like that you hook this into cleanup_tree_cfg - that is called
_way_ too often.

This also duplicates the literal matching done on the RTL level - instead
I think this optimization would be more related to value-numbering
(either that of SCCVN/FRE/PRE or that of DOM which also does
jump-threading).

Richard.

> Thanks,
> - Tom
>
> 2011-06-08  Tom de Vries  
>
>        PR middle-end/43864
>        * tree-cfgcleanup.c (int_int_splay_lookup, int_int_splay_insert)
>        (int_int_splay_node_contained_in, int_int_splay_contained_in)
>        (equiv_lookup, equiv_insert, equiv_contained_in, equiv_init)
>        (equiv_delete, gimple_base_equal_p,

Re: [PATCH, PR43864] Gimple level duplicate block cleanup.

2011-06-08 Thread Steven Bosscher
On Wed, Jun 8, 2011 at 11:55 AM, Richard Guenther
 wrote:
> On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries  wrote:
>> Hi Richard,
>>
>> I have a patch for PR43864. The patch adds a gimple level duplicate block
>> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
>> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the 
>> following
>> table (%, lower is better).
>>
>>                     none            pic
>>                thumb1  thumb2  thumb1 thumb2
>> spec2000          99.9    99.9    99.8   99.8
>>
>> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that 
>> the
>> optimizations proposed in PR20070 would fix this PR.
>>
>> The problem in this PR is that when compiling with -O2, the example below 
>> should
>> only have one call to free. The original problem is formulated in terms of 
>> -Os,
>> but currently we generate one call to free with -Os, although still not the
>> smallest code possible. I'll show here the -O2 case, since that's similar to 
>> the
>> original PR.
>>
>> #include 
>> void foo (char*, FILE*);
>> char* hprofStartupp(char *outputFileName, char *ctx)
>> {
>>    char fileName[1000];
>>    FILE *fp;
>>    sprintf(fileName, outputFileName);
>>    if (access(fileName, 1) == 0) {
>>        free(ctx);
>>        return 0;
>>    }
>>
>>    fp = fopen(fileName, 0);
>>    if (fp == 0) {
>>        free(ctx);
>>        return 0;
>>    }
>>
>>    foo(outputFileName, fp);
>>
>>    return ctx;
>> }
>>
>> AFAIU, there are 2 complementary methods of rtl optimizations proposed in 
>> PR20070.
>> - Merging 2 blocks which are identical expect for input registers, by using a
>>  conditional move to choose between the different input registers.
>> - Merging 2 blocks which have different local registers, by ignoring those
>>  differences
>>
>> Blocks .L6 and.L7 have no difference in local registers, but they have a
>> difference in input registers: r3 and r1. Replacing the move to r5 by a
>> conditional move would probably be benificial in terms of size, but it's not
>> clear what condition the conditional move should be using. Calculating such a
>> condition would add in size and increase the execution path.
>>
>> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
>> ...
>>        push    {r4, r5, lr}
>>        mov     r4, r0
>>        sub     sp, sp, #1004
>>        mov     r5, r1
>>        mov     r0, sp
>>        mov     r1, r4
>>        bl      sprintf
>>        mov     r0, sp
>>        movs    r1, #1
>>        bl      access
>>        mov     r3, r0
>>        cbz     r0, .L6
>>        movs    r1, #0
>>        mov     r0, sp
>>        bl      fopen
>>        mov     r1, r0
>>        cbz     r0, .L7
>>        mov     r0, r4
>>        bl      foo
>> .L3:
>>        mov     r0, r5
>>        add     sp, sp, #1004
>>        pop     {r4, r5, pc}
>> .L6:
>>        mov     r0, r5
>>        mov     r5, r3
>>        bl      free
>>        b       .L3
>> .L7:
>>        mov     r0, r5
>>        mov     r5, r1
>>        bl      free
>>        b       .L3
>> ...
>>
>> The proposed patch solved the problem by dealing with the 2 blocks at a level
>> when they are still identical: at gimple level. It detect that the 2 blocks 
>> are
>> identical, and removes one of them.
>>
>> The following table shows the impact of the patch on the example in terms of
>> size for -march=armv7-a:
>>
>>          without     with    delta
>> Os      :     108      104       -4
>> O2      :     120      104      -16
>> Os thumb:      68       64       -4
>> O2 thumb:      76       64      -12
>>
>> The gain in size for -O2 is that of removing the entire block, plus the
>> replacement of 2 moves by a constant set, which also decreases the execution
>> path. The patch ensures optimal code for both -O2 and -Os.
>>
>>
>> By keeping track of equivalent definitions in the 2 blocks, we can ignore 
>> those
>> differences in comparison. Without this feature, we would only match blocks 
>> with
>> resultless operations, due the the ssa-nature of gimples.
>> For example, with this feature, we reduce the following function to its 
>> minimum
>> at gimple level, rather than at rtl level.
>>
>> int f(int c, int b, int d)
>> {
>>  int r, e;
>>
>>  if (c)
>>    r = b + d;
>>  else
>>    {
>>      e = b + d;
>>      r = e;
>>    }
>>
>>  return r;
>> }
>>
>> ;; Function f (f)
>>
>> f (int c, int b, int d)
>> {
>>  int e;
>>
>> :
>>  e_6 = b_3(D) + d_4(D);
>>  return e_6;
>>
>> }
>>
>> I'll send the patch with the testcases in a separate email.
>>
>> OK for trunk?
>
> I don't like that you hook this into cleanup_tree_cfg - that is called
> _way_ too often.
>
> This also duplicates the literal matching done on the RTL level - instead
> I think this optimization would be more related to value-numbering
> (either that of SCCVN/FRE/PRE or that of DOM which also does
> jump-threading).

And while at it: Put it in a separate file ;-)

Ciao!
Steven


Re: [PATCH, PR43864] Gimple level duplicate block cleanup.

2011-06-10 Thread Tom de Vries
Hi Richard,

thanks for the review.

On 06/08/2011 11:55 AM, Richard Guenther wrote:
> On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries  wrote:
>> Hi Richard,
>>
>> I have a patch for PR43864. The patch adds a gimple level duplicate block
>> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
>> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the 
>> following
>> table (%, lower is better).
>>
>> nonepic
>>thumb1  thumb2  thumb1 thumb2
>> spec2000 99.999.999.8   99.8
>>
>> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that 
>> the
>> optimizations proposed in PR20070 would fix this PR.
>>
>> The problem in this PR is that when compiling with -O2, the example below 
>> should
>> only have one call to free. The original problem is formulated in terms of 
>> -Os,
>> but currently we generate one call to free with -Os, although still not the
>> smallest code possible. I'll show here the -O2 case, since that's similar to 
>> the
>> original PR.
>>

Example A. (naming it for reference below)

>> #include 
>> void foo (char*, FILE*);
>> char* hprofStartupp(char *outputFileName, char *ctx)
>> {
>>char fileName[1000];
>>FILE *fp;
>>sprintf(fileName, outputFileName);
>>if (access(fileName, 1) == 0) {
>>free(ctx);
>>return 0;
>>}
>>
>>fp = fopen(fileName, 0);
>>if (fp == 0) {
>>free(ctx);
>>return 0;
>>}
>>
>>foo(outputFileName, fp);
>>
>>return ctx;
>> }
>>
>> AFAIU, there are 2 complementary methods of rtl optimizations proposed in 
>> PR20070.
>> - Merging 2 blocks which are identical expect for input registers, by using a
>>  conditional move to choose between the different input registers.
>> - Merging 2 blocks which have different local registers, by ignoring those
>>  differences
>>
>> Blocks .L6 and.L7 have no difference in local registers, but they have a
>> difference in input registers: r3 and r1. Replacing the move to r5 by a
>> conditional move would probably be benificial in terms of size, but it's not
>> clear what condition the conditional move should be using. Calculating such a
>> condition would add in size and increase the execution path.
>>
>> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
>> ...
>>push{r4, r5, lr}
>>mov r4, r0
>>sub sp, sp, #1004
>>mov r5, r1
>>mov r0, sp
>>mov r1, r4
>>bl  sprintf
>>mov r0, sp
>>movsr1, #1
>>bl  access
>>mov r3, r0
>>cbz r0, .L6
>>movsr1, #0
>>mov r0, sp
>>bl  fopen
>>mov r1, r0
>>cbz r0, .L7
>>mov r0, r4
>>bl  foo
>> .L3:
>>mov r0, r5
>>add sp, sp, #1004
>>pop {r4, r5, pc}
>> .L6:
>>mov r0, r5
>>mov r5, r3
>>bl  free
>>b   .L3
>> .L7:
>>mov r0, r5
>>mov r5, r1
>>bl  free
>>b   .L3
>> ...
>>
>> The proposed patch solved the problem by dealing with the 2 blocks at a level
>> when they are still identical: at gimple level. It detect that the 2 blocks 
>> are
>> identical, and removes one of them.
>>
>> The following table shows the impact of the patch on the example in terms of
>> size for -march=armv7-a:
>>
>>  without withdelta
>> Os  : 108  104   -4
>> O2  : 120  104  -16
>> Os thumb:  68   64   -4
>> O2 thumb:  76   64  -12
>>
>> The gain in size for -O2 is that of removing the entire block, plus the
>> replacement of 2 moves by a constant set, which also decreases the execution
>> path. The patch ensures optimal code for both -O2 and -Os.
>>
>>
>> By keeping track of equivalent definitions in the 2 blocks, we can ignore 
>> those
>> differences in comparison. Without this feature, we would only match blocks 
>> with
>> resultless operations, due the the ssa-nature of gimples.
>> For example, with this feature, we reduce the following function to its 
>> minimum
>> at gimple level, rather than at rtl level.
>>

Example B. (naming it for reference below)

>> int f(int c, int b, int d)
>> {
>>  int r, e;
>>
>>  if (c)
>>r = b + d;
>>  else
>>{
>>  e = b + d;
>>  r = e;
>>}
>>
>>  return r;
>> }
>>
>> ;; Function f (f)
>>
>> f (int c, int b, int d)
>> {
>>  int e;
>>
>> :
>>  e_6 = b_3(D) + d_4(D);
>>  return e_6;
>>
>> }
>>
>> I'll send the patch with the testcases in a separate email.
>>
>> OK for trunk?
> 
> I don't like that you hook this into cleanup_tree_cfg - that is called
> _way_ too often.
> 

Here is a reworked patch that addresses several concerns, particularly the
compile time overhead.

Changes:
- The optimization is now in a separate file.
- The optimization is now a pass rather than a cleanup. That allowed me to

Re: [PATCH, PR43864] Gimple level duplicate block cleanup.

2011-06-10 Thread Jeff Law
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/08/11 03:42, Tom de Vries wrote:
> Hi Richard,
> 
> I have a patch for PR43864. The patch adds a gimple level duplicate block
> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the 
> following
> table (%, lower is better).
> 
>  nonepic
> thumb1  thumb2  thumb1 thumb2
> spec2000  99.999.999.8   99.8
> 
> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that 
> the
> optimizations proposed in PR20070 would fix this PR.
> 
> The problem in this PR is that when compiling with -O2, the example below 
> should
> only have one call to free. The original problem is formulated in terms of 
> -Os,
> but currently we generate one call to free with -Os, although still not the
> smallest code possible. I'll show here the -O2 case, since that's similar to 
> the
> original PR.
[ ... ]

FWIW, I've seen at least one paper which claims that extending value
numbering redundancy elimination to handle blocks is effective at
eliminating duplicates.Redundancy elimination for blocks turns into
CFG manipulations.

We want to do this early in the pipeline so that register numbering
doesn't get in the way.

I'm going to let you and Richi iterate, but wanted to chime in with my
general support for detecting and eliminating blocks early in the
pipeline (gimple).

jeff
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJN8mTsAAoJEBRtltQi2kC7UxMIAKqEprg9LUJS4QnUeXRKDx7h
+tMBS1hPEZcmfxV9jo95oXb+1yiCIrI2CUou9PDO8E4i/Dv3x6kCF5vyfLBGx0ZI
dS80HjTdvr2vUzxFQnChESvDYepyg2JE6pnlnOe4cwnuyFHWhPQG7L3e3lDaZkTa
224afwDNNUMmCGDE4emUpgV/evQ4dpiiY/dqAU1fu6ev8wQxfDpyGeZOTD+qC1XM
DfpZumzHJpJfo2w/LMbSiBNci0HYxENjUheNixzHDLMSUO8AdStje6NBIJ96I3s7
p94WuwuP08B1wU5J0F4B1TiyAJxDOBbtApRwhpAOs9NImEDydmOXDLjIaPivXBY=
=2J0r
-END PGP SIGNATURE-


Re: [PATCH, PR43864] Gimple level duplicate block cleanup.

2011-06-14 Thread Richard Guenther
On Fri, Jun 10, 2011 at 6:54 PM, Tom de Vries  wrote:
> Hi Richard,
>
> thanks for the review.
>
> On 06/08/2011 11:55 AM, Richard Guenther wrote:
>> On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries  wrote:
>>> Hi Richard,
>>>
>>> I have a patch for PR43864. The patch adds a gimple level duplicate block
>>> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
>>> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the 
>>> following
>>> table (%, lower is better).
>>>
>>>                     none            pic
>>>                thumb1  thumb2  thumb1 thumb2
>>> spec2000         99.9    99.9    99.8   99.8
>>>
>>> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure 
>>> that the
>>> optimizations proposed in PR20070 would fix this PR.
>>>
>>> The problem in this PR is that when compiling with -O2, the example below 
>>> should
>>> only have one call to free. The original problem is formulated in terms of 
>>> -Os,
>>> but currently we generate one call to free with -Os, although still not the
>>> smallest code possible. I'll show here the -O2 case, since that's similar 
>>> to the
>>> original PR.
>>>
>
> Example A. (naming it for reference below)
>
>>> #include 
>>> void foo (char*, FILE*);
>>> char* hprofStartupp(char *outputFileName, char *ctx)
>>> {
>>>    char fileName[1000];
>>>    FILE *fp;
>>>    sprintf(fileName, outputFileName);
>>>    if (access(fileName, 1) == 0) {
>>>        free(ctx);
>>>        return 0;
>>>    }
>>>
>>>    fp = fopen(fileName, 0);
>>>    if (fp == 0) {
>>>        free(ctx);
>>>        return 0;
>>>    }
>>>
>>>    foo(outputFileName, fp);
>>>
>>>    return ctx;
>>> }
>>>
>>> AFAIU, there are 2 complementary methods of rtl optimizations proposed in 
>>> PR20070.
>>> - Merging 2 blocks which are identical expect for input registers, by using 
>>> a
>>>  conditional move to choose between the different input registers.
>>> - Merging 2 blocks which have different local registers, by ignoring those
>>>  differences
>>>
>>> Blocks .L6 and.L7 have no difference in local registers, but they have a
>>> difference in input registers: r3 and r1. Replacing the move to r5 by a
>>> conditional move would probably be benificial in terms of size, but it's not
>>> clear what condition the conditional move should be using. Calculating such 
>>> a
>>> condition would add in size and increase the execution path.
>>>
>>> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
>>> ...
>>>        push    {r4, r5, lr}
>>>        mov     r4, r0
>>>        sub     sp, sp, #1004
>>>        mov     r5, r1
>>>        mov     r0, sp
>>>        mov     r1, r4
>>>        bl      sprintf
>>>        mov     r0, sp
>>>        movs    r1, #1
>>>        bl      access
>>>        mov     r3, r0
>>>        cbz     r0, .L6
>>>        movs    r1, #0
>>>        mov     r0, sp
>>>        bl      fopen
>>>        mov     r1, r0
>>>        cbz     r0, .L7
>>>        mov     r0, r4
>>>        bl      foo
>>> .L3:
>>>        mov     r0, r5
>>>        add     sp, sp, #1004
>>>        pop     {r4, r5, pc}
>>> .L6:
>>>        mov     r0, r5
>>>        mov     r5, r3
>>>        bl      free
>>>        b       .L3
>>> .L7:
>>>        mov     r0, r5
>>>        mov     r5, r1
>>>        bl      free
>>>        b       .L3
>>> ...
>>>
>>> The proposed patch solved the problem by dealing with the 2 blocks at a 
>>> level
>>> when they are still identical: at gimple level. It detect that the 2 blocks 
>>> are
>>> identical, and removes one of them.
>>>
>>> The following table shows the impact of the patch on the example in terms of
>>> size for -march=armv7-a:
>>>
>>>          without     with    delta
>>> Os      :     108      104       -4
>>> O2      :     120      104      -16
>>> Os thumb:      68       64       -4
>>> O2 thumb:      76       64      -12
>>>
>>> The gain in size for -O2 is that of removing the entire block, plus the
>>> replacement of 2 moves by a constant set, which also decreases the execution
>>> path. The patch ensures optimal code for both -O2 and -Os.
>>>
>>>
>>> By keeping track of equivalent definitions in the 2 blocks, we can ignore 
>>> those
>>> differences in comparison. Without this feature, we would only match blocks 
>>> with
>>> resultless operations, due the the ssa-nature of gimples.
>>> For example, with this feature, we reduce the following function to its 
>>> minimum
>>> at gimple level, rather than at rtl level.
>>>
>
> Example B. (naming it for reference below)
>
>>> int f(int c, int b, int d)
>>> {
>>>  int r, e;
>>>
>>>  if (c)
>>>    r = b + d;
>>>  else
>>>    {
>>>      e = b + d;
>>>      r = e;
>>>    }
>>>
>>>  return r;
>>> }
>>>
>>> ;; Function f (f)
>>>
>>> f (int c, int b, int d)
>>> {
>>>  int e;
>>>
>>> :
>>>  e_6 = b_3(D) + d_4(D);
>>>  return e_6;
>>>
>>> }
>>>
>>> I'll send the patch with the testcases in a separate email.
>>>
>>> OK for trunk?
>>
>> I don't like that you hook this into cleanup_tree_cfg - that is called
>> _way_

Re: [PATCH, PR43864] Gimple level duplicate block cleanup.

2011-07-12 Thread Richard Guenther
On Tue, Jul 12, 2011 at 2:12 PM, Tom de Vries  wrote:
> Hi Richard,
>
> here's a new version of the pass. I attempted to address as much as possible
> your comments. The pass was bootstrapped and reg-tested on x86_64.
>
> On 06/14/2011 05:01 PM, Richard Guenther wrote:
>> On Fri, Jun 10, 2011 at 6:54 PM, Tom de Vries  wrote:
>>> Hi Richard,
>>>
>>> thanks for the review.
>>>
>>> On 06/08/2011 11:55 AM, Richard Guenther wrote:
 On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries  
 wrote:
> Hi Richard,
>
> I have a patch for PR43864. The patch adds a gimple level duplicate block
> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the 
> following
> table (%, lower is better).
>
>                     none            pic
>                thumb1  thumb2  thumb1 thumb2
> spec2000         99.9    99.9    99.8   99.8
>
> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure 
> that the
> optimizations proposed in PR20070 would fix this PR.
>
> The problem in this PR is that when compiling with -O2, the example below 
> should
> only have one call to free. The original problem is formulated in terms 
> of -Os,
> but currently we generate one call to free with -Os, although still not 
> the
> smallest code possible. I'll show here the -O2 case, since that's similar 
> to the
> original PR.
>
>>>
>>> Example A. (naming it for reference below)
>>>
> #include 
> void foo (char*, FILE*);
> char* hprofStartupp(char *outputFileName, char *ctx)
> {
>    char fileName[1000];
>    FILE *fp;
>    sprintf(fileName, outputFileName);
>    if (access(fileName, 1) == 0) {
>        free(ctx);
>        return 0;
>    }
>
>    fp = fopen(fileName, 0);
>    if (fp == 0) {
>        free(ctx);
>        return 0;
>    }
>
>    foo(outputFileName, fp);
>
>    return ctx;
> }
>
> AFAIU, there are 2 complementary methods of rtl optimizations proposed in 
> PR20070.
> - Merging 2 blocks which are identical expect for input registers, by 
> using a
>  conditional move to choose between the different input registers.
> - Merging 2 blocks which have different local registers, by ignoring those
>  differences
>
> Blocks .L6 and.L7 have no difference in local registers, but they have a
> difference in input registers: r3 and r1. Replacing the move to r5 by a
> conditional move would probably be benificial in terms of size, but it's 
> not
> clear what condition the conditional move should be using. Calculating 
> such a
> condition would add in size and increase the execution path.
>
> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
> ...
>        push    {r4, r5, lr}
>        mov     r4, r0
>        sub     sp, sp, #1004
>        mov     r5, r1
>        mov     r0, sp
>        mov     r1, r4
>        bl      sprintf
>        mov     r0, sp
>        movs    r1, #1
>        bl      access
>        mov     r3, r0
>        cbz     r0, .L6
>        movs    r1, #0
>        mov     r0, sp
>        bl      fopen
>        mov     r1, r0
>        cbz     r0, .L7
>        mov     r0, r4
>        bl      foo
> .L3:
>        mov     r0, r5
>        add     sp, sp, #1004
>        pop     {r4, r5, pc}
> .L6:
>        mov     r0, r5
>        mov     r5, r3
>        bl      free
>        b       .L3
> .L7:
>        mov     r0, r5
>        mov     r5, r1
>        bl      free
>        b       .L3
> ...
>
> The proposed patch solved the problem by dealing with the 2 blocks at a 
> level
> when they are still identical: at gimple level. It detect that the 2 
> blocks are
> identical, and removes one of them.
>
> The following table shows the impact of the patch on the example in terms 
> of
> size for -march=armv7-a:
>
>          without     with    delta
> Os      :     108      104       -4
> O2      :     120      104      -16
> Os thumb:      68       64       -4
> O2 thumb:      76       64      -12
>
> The gain in size for -O2 is that of removing the entire block, plus the
> replacement of 2 moves by a constant set, which also decreases the 
> execution
> path. The patch ensures optimal code for both -O2 and -Os.
>
>
> By keeping track of equivalent definitions in the 2 blocks, we can ignore 
> those
> differences in comparison. Without this feature, we would only match 
> blocks with
> resultless operations, due the the ssa-nature of gimples.
> For example, with this feature, we reduce the following function to its 
> minimum
> at gim

Re: [PATCH, PR43864] Gimple level duplicate block cleanup.

2011-07-22 Thread Richard Guenther
On Sun, Jul 17, 2011 at 8:33 PM, Tom de Vries  wrote:

> Bootstrapped and reg-tested on x86_64.  Ok for trunk (after ARM testing)?

+static int
+same_succ_equal (const void *ve1, const void *ve2)
+{
...
+  if (bitmap_bit_p (e1->bbs, ENTRY_BLOCK)
+  || bitmap_bit_p (e1->bbs, EXIT_BLOCK)
+  || bitmap_bit_p (e2->bbs, ENTRY_BLOCK)
+  || bitmap_bit_p (e2->bbs, EXIT_BLOCK))
+return 0;

that's odd - what are these checks for?

+  if (dump_file)
+{
+  fprintf (dump_file, "initial worklist:\n");

with dump_flags & TDF_DETAILS

I'm now at merge_calls and wondering about alias info again.  We are
probably safe for the per-pointer information because we are not
operating flow-sensitive for memory and for merging require value-equivalence
for SSA names.  For calls the same should be true - we are not
flow- or context-sensitive, and even if we were context-sentitive we
require equivalent arguments (for memory arguments we should be safe
because of the non-flow-sensitivity).

So, did you actually run into problems?  If not then I suggest to remove
merge_calls completely (and the related changes that it requires).

+/* Create or update a vop phi in BB2.  Use VUSE1 arguments for all the
+   REDIRECTED_EDGES, or if VUSE1 is NULL_TREE, use BB_VOP_AT_EXIT.  If a new
+   phis is created, use the phi instead of VUSE2 in BB2.  */
+
+static void
+update_vuses (tree vuse1, tree vuse2, basic_block bb2,
+  VEC (edge,heap) *redirected_edges)

...

+  if (vuse2 == NULL_TREE)
+return;

hm, that's the case when there is no VUSE that is dominated by BB2
(or is in BB2).  Ok, might happen.

+ locus1 = gimple_location (SSA_NAME_DEF_STMT (arg));
+ add_phi_arg (phi, arg, e, locus1);

I don't think virtual operand PHIs should have locations, just use
UNKNOWN_LOCATION here.

+  locus2 = gimple_location (def_stmt2);

Likewise.

+  /* Create a phi, first with default argument vuse2 for all preds.  */
+  lhs = make_ssa_name (SSA_NAME_VAR (vuse2), NULL);
+  VN_INFO_GET (lhs);
+  phi = create_phi_node (lhs, bb2);
+  SSA_NAME_DEF_STMT (lhs) = phi;
+  FOR_EACH_EDGE (e, ei, bb2->preds)
+add_phi_arg (phi, vuse2, e, locus2);
+
+  /* Now overwrite the arguments associated with the redirected edges with
+ vuse1.  */
+  for (i = 0; i < EDGE_COUNT (redirected_edges); ++i)
+{
+  e = VEC_index (edge, redirected_edges, i);
+  gcc_assert (PHI_ARG_DEF_FROM_EDGE (phi, e));

No need for this assert.

+  if (vuse1)
+   arg = vuse1;
+  else
+   arg = BB_VOP_AT_EXIT (e->src);
+  SET_PHI_ARG_DEF (phi, e->dest_idx, arg);
+  locus1 = gimple_location (SSA_NAME_DEF_STMT (arg));

See above.

+  gimple_phi_arg_set_location (phi, e->dest_idx, locus1);
+}


Can you maybe merge this with the update-existing-phi-case?  They
look all too similar.

+  /* Replace uses of vuse2 in bb2 with phi.  */
+  FOR_EACH_IMM_USE_STMT (stmt, iter, vuse2)
+{
+  if (gimple_code (stmt) == GIMPLE_PHI)

Does FOR_EACH_IMM_USE_ON_STMT really not work for PHIs?
Other code doesn't seem to care.

+   {
+ edge e;
+ if (stmt == phi)
+   continue;
+ e = find_edge (bb2, gimple_bb (stmt));
+ if (e == NULL)
+   continue;
+ use_p = PHI_ARG_DEF_PTR_FROM_EDGE (stmt, e);
+ SET_USE (use_p, lhs);
+ update_stmt (stmt);
+   }
+  else if (gimple_bb (stmt) == bb2)

That check looks odd.  A use can very well appear in a forwarder block.

+   {
+ FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+   SET_USE (use_p, lhs);
+ update_stmt (stmt);
+   }
+}

+/* Scans the vdefs and vuses of the insn of BB, and returns the vop at entry in
+   VOP_AT_ENTRY, and the vop at exit in VOP_AT_EXIT.  */
+
+static void
+insn_vops (basic_block bb, tree *vop_at_entry, tree *vop_at_exit)

it's easier to start from the bb end and walk until you see the
first vdef or vuse.  Then you have *vop_at_exit.  From there
just walk the SSA_NAME_DEF_STMTs of the vuse until you
hit one whose definition is not in BB - and you have *vop_at_entry.
That way you can avoid scanning most of the stmts.

The function also has an odd name ;)  It should be something like
vops_at_bb_entry_and_exit.

+static void
+vop_at_entry (basic_block bb1, basic_block bb2, tree *vop_at_entry1,
+ tree *vop_at_entry2)

so you don't need the vop at exit at all?  The function is a bit unclear
to me given it does so much stuff other than just computing the BBs
entry VOPs ...

+static void
+replace_block_by (basic_block bb1, basic_block bb2, bool update_vops)
+{

can you add some comments before the different phases of update?
I _think_ I understand what it does, but ...

+/* Runs tail merge optimization.  */
+
+unsigned int
+tail_merge_optimize (unsigned int todo)
+{
+  int nr_bbs_removed_total = 0;
+  int nr_bbs_removed;
+  bool loop_entered = false;
+  int iteration_nr = 0;
+  bool update_vops = ((todo & TODO_update_ssa_only_virtuals) == 0

Re: [PATCH, PR43864] Gimple level duplicate block cleanup.

2011-08-24 Thread Tom de Vries
On 07/22/2011 05:36 PM, Richard Guenther wrote:
> That said - I'm reasonably happy with the pass now, but it's rather large
> (this review took 40min again ...) so I appreciate a second look from
> somebody else.
> 

Ian,

Do you have a moment to give a second look to a gimple CFG optimization?  The
optimization removes duplicate basic blocks and reduces code size by 1-2%.

The latest patch is posted at
http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01602.html.

Thanks,
- Tom


Re: [PATCH, PR43864] Gimple level duplicate block cleanup.

2011-08-24 Thread Ian Lance Taylor
Tom de Vries  writes:

> Do you have a moment to give a second look to a gimple CFG optimization?  The
> optimization removes duplicate basic blocks and reduces code size by 1-2%.
>
> The latest patch is posted at
> http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01602.html.


I'm not really the best person to look at this patch, since it applies
to areas of the compiler with which I am less familiar..  However, since
you ask, I did read through the patch, and it looks OK to me.  Since
Richi OK'ed it, this patch is OK with the following changes.


> +typedef struct same_succ *same_succ_t;
> +typedef const struct same_succ *const_same_succ_t;

Don't name new types ending with "_t".  POSIX reserves names ending with
"_t" when  is #included.  Name these something else.

> +typedef struct bb_cluster *bb_cluster_t;
> +typedef const struct bb_cluster *const_bb_cluster_t;

Same here.


> +@item -ftree-tail-merge
> +Merges identical blocks with same successors.  This flag is enabled by 
> default
> +at @option{-O2} and higher.  The run time of this pass can be limited using
> +@option{max-tail-merge-comparisons} parameter.

I think this text can be improved to be more meaningful to compiler
users.  I suggest something like:

  Look for identical code sequences.  When found, replace one with a
  jump to the other.  This optimization is known as tail merging or
  cross jumping.  This flag is enabled [now same as above]


Thanks.

Ian


Re: [PATCH, PR43864] Gimple level duplicate block cleanup.

2011-08-25 Thread Richard Guenther
On Wed, Aug 24, 2011 at 9:00 PM, Ian Lance Taylor  wrote:
> Tom de Vries  writes:
>
>> Do you have a moment to give a second look to a gimple CFG optimization?  The
>> optimization removes duplicate basic blocks and reduces code size by 1-2%.
>>
>> The latest patch is posted at
>> http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01602.html.
>
>
> I'm not really the best person to look at this patch, since it applies
> to areas of the compiler with which I am less familiar..  However, since
> you ask, I did read through the patch, and it looks OK to me.  Since
> Richi OK'ed it, this patch is OK with the following changes.
>
>
>> +typedef struct same_succ *same_succ_t;
>> +typedef const struct same_succ *const_same_succ_t;
>
> Don't name new types ending with "_t".  POSIX reserves names ending with
> "_t" when  is #included.  Name these something else.
>
>> +typedef struct bb_cluster *bb_cluster_t;
>> +typedef const struct bb_cluster *const_bb_cluster_t;
>
> Same here.
>
>
>> +@item -ftree-tail-merge
>> +Merges identical blocks with same successors.  This flag is enabled by 
>> default
>> +at @option{-O2} and higher.  The run time of this pass can be limited using
>> +@option{max-tail-merge-comparisons} parameter.
>
> I think this text can be improved to be more meaningful to compiler
> users.  I suggest something like:
>
>  Look for identical code sequences.  When found, replace one with a
>  jump to the other.  This optimization is known as tail merging or
>  cross jumping.  This flag is enabled [now same as above]

Can you also add a --param for the maximum number of iterations
you perform (16 sounds quite high for GCC bootstrap), I'd default it
to 2 which seems to catch 99% of all cases.

If you already committed the patch just do it as a followup please.

Thanks,
Richard.

>
> Thanks.
>
> Ian
>


Re: [PATCH, PR43864] Gimple level duplicate block cleanup - test cases.

2011-07-17 Thread Tom de Vries
Updated version.

On 06/08/2011 11:45 AM, Tom de Vries wrote:
> On 06/08/2011 11:42 AM, Tom de Vries wrote:
> 
>> I'll send the patch with the testcases in a separate email.
> 

OK for trunk?

Thanks,
- Tom

2011-07-17  Tom de Vries  

PR middle-end/43864
* gcc.dg/fold-compare-2.c (dg-options): Add -fno-tree-tail-merge.
* gcc/testsuite/gcc.dg/uninit-pred-2_c.c: Same.
* gcc.dg/pr43864.c: New test.
* gcc.dg/pr43864-2.c: Same.
Index: gcc/testsuite/gcc.dg/fold-compare-2.c
===
--- gcc/testsuite/gcc.dg/fold-compare-2.c	(revision 175801)
+++ gcc/testsuite/gcc.dg/fold-compare-2.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-vrp" } */
+/* { dg-options "-O2 -fno-tree-tail-merge -fdump-tree-vrp" } */
 
 extern void abort (void);
 
Index: gcc/testsuite/gcc.dg/uninit-pred-2_c.c
===
--- gcc/testsuite/gcc.dg/uninit-pred-2_c.c	(revision 175801)
+++ gcc/testsuite/gcc.dg/uninit-pred-2_c.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-Wuninitialized -O2" } */
+/* { dg-options "-Wuninitialized -O2 -fno-tree-tail-merge" } */
 
 int g;
 void bar (void);
Index: gcc/testsuite/gcc.dg/pr43864.c
===
--- gcc/testsuite/gcc.dg/pr43864.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr43864.c	(revision 0)
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+extern void foo (char*, int);
+extern void mysprintf (char *, char *);
+extern void myfree (void *);
+extern int access (char *, int);
+extern int fopen (char *, int);
+
+char *
+hprofStartupp (char *outputFileName, char *ctx)
+{
+  char fileName[1000];
+  int fp;
+  mysprintf (fileName, outputFileName);
+  if (access (fileName, 1) == 0)
+{
+  myfree (ctx);
+  return 0;
+}
+
+  fp = fopen (fileName, 0);
+  if (fp == 0)
+{
+  myfree (ctx);
+  return 0;
+}
+
+  foo (outputFileName, fp);
+
+  return ctx;
+}
+
+/* { dg-final { scan-tree-dump-times "myfree" 1 "optimized"} } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: gcc/testsuite/gcc.dg/pr43864-2.c
===
--- gcc/testsuite/gcc.dg/pr43864-2.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr43864-2.c	(revision 0)
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+int
+f (int c, int b, int d)
+{
+  int r, e;
+
+  if (c)
+r = b + d;
+  else
+{
+  e = b + d;
+  r = e;
+}
+
+  return r;
+}
+
+/* { dg-final { scan-tree-dump-times "if " 0 "optimized"} } */
+/* { dg-final { scan-tree-dump-times "\\\+" 1 "optimized"} } */
+/* { dg-final { scan-tree-dump-times "PHI" 0 "optimized"} } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */


Re: [PATCH, PR43864] Gimple level duplicate block cleanup - test cases.

2011-08-19 Thread Tom de Vries
On 07/17/2011 08:33 PM, Tom de Vries wrote:
> Updated version.
> 
> On 06/08/2011 11:45 AM, Tom de Vries wrote:
>> On 06/08/2011 11:42 AM, Tom de Vries wrote:
>>
>>> I'll send the patch with the testcases in a separate email.
>>
> 

2 extra testcases added.

OK for trunk?

Thanks,
- Tom

2011-08-19  Tom de Vries  

PR middle-end/43864
* gcc.dg/fold-compare-2.c (dg-options): Add -fno-tree-tail-merge.
* gcc/testsuite/gcc.dg/uninit-pred-2_c.c: Same.
* gcc.dg/pr43864.c: New test.
* gcc.dg/pr43864-2.c: Same.
* gcc.dg/pr43864-3.c: Same.
* gcc.dg/pr43864-4.c: Same.
Index: gcc/testsuite/gcc.dg/pr43864-4.c
===
--- gcc/testsuite/gcc.dg/pr43864-4.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr43864-4.c	(revision 0)
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-pre" } */
+
+/* Different stmt order.  */
+
+int f(int c, int b, int d)
+{
+  int r, r2, e;
+
+  if (c)
+{
+  r = b + d;
+  r2 = d - b;
+}
+  else
+{
+  r2 = d - b;
+  e = d + b;
+  r = e;
+}
+
+  return r - r2;
+}
+
+/* { dg-final { scan-tree-dump-times "if " 0 "pre"} } */
+/* { dg-final { scan-tree-dump-times "_.*\\\+.*_" 1 "pre"} } */
+/* { dg-final { scan-tree-dump-times " - " 2 "pre"} } */
+/* { dg-final { cleanup-tree-dump "pre" } } */
Index: gcc/testsuite/gcc.dg/fold-compare-2.c
===
--- gcc/testsuite/gcc.dg/fold-compare-2.c	(revision 176554)
+++ gcc/testsuite/gcc.dg/fold-compare-2.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-vrp" } */
+/* { dg-options "-O2 -fno-tree-tail-merge -fdump-tree-vrp" } */
 
 extern void abort (void);
 
Index: gcc/testsuite/gcc.dg/uninit-pred-2_c.c
===
--- gcc/testsuite/gcc.dg/uninit-pred-2_c.c	(revision 176554)
+++ gcc/testsuite/gcc.dg/uninit-pred-2_c.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-Wuninitialized -O2" } */
+/* { dg-options "-Wuninitialized -O2 -fno-tree-tail-merge" } */
 
 int g;
 void bar (void);
Index: gcc/testsuite/gcc.dg/pr43864.c
===
--- gcc/testsuite/gcc.dg/pr43864.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr43864.c	(revision 0)
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-pre" } */
+
+extern void foo (char*, int);
+extern void mysprintf (char *, char *);
+extern void myfree (void *);
+extern int access (char *, int);
+extern int fopen (char *, int);
+
+char *
+hprofStartupp (char *outputFileName, char *ctx)
+{
+  char fileName[1000];
+  int fp;
+  mysprintf (fileName, outputFileName);
+  if (access (fileName, 1) == 0)
+{
+  myfree (ctx);
+  return 0;
+}
+
+  fp = fopen (fileName, 0);
+  if (fp == 0)
+{
+  myfree (ctx);
+  return 0;
+}
+
+  foo (outputFileName, fp);
+
+  return ctx;
+}
+
+/* { dg-final { scan-tree-dump-times "myfree \\(" 1 "pre"} } */
+/* { dg-final { cleanup-tree-dump "pre" } } */
Index: gcc/testsuite/gcc.dg/pr43864-2.c
===
--- gcc/testsuite/gcc.dg/pr43864-2.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr43864-2.c	(revision 0)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-pre" } */
+
+int
+f (int c, int b, int d)
+{
+  int r, e;
+
+  if (c)
+r = b + d;
+  else
+{
+  e = b + d;
+  r = e;
+}
+
+  return r;
+}
+
+/* { dg-final { scan-tree-dump-times "if " 0 "pre"} } */
+/* { dg-final { scan-tree-dump-times "_.*\\\+.*_" 1 "pre"} } */
+/* { dg-final { cleanup-tree-dump "pre" } } */
Index: gcc/testsuite/gcc.dg/pr43864-3.c
===
--- gcc/testsuite/gcc.dg/pr43864-3.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr43864-3.c	(revision 0)
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-pre" } */
+
+/* Commutative case.  */
+
+int f(int c, int b, int d)
+{
+  int r, e;
+
+  if (c)
+r = b + d;
+  else
+{
+  e = d + b;
+  r = e;
+}
+
+  return r;
+}
+
+/* { dg-final { scan-tree-dump-times "if " 0 "pre"} } */
+/* { dg-final { scan-tree-dump-times "_.*\\\+.*_" 1 "pre"} } */
+/* { dg-final { cleanup-tree-dump "pre" } } */


Re: [PATCH, PR43864] Gimple level duplicate block cleanup - test cases.

2011-08-25 Thread Richard Guenther
On Fri, Aug 19, 2011 at 6:28 PM, Tom de Vries  wrote:
> On 07/17/2011 08:33 PM, Tom de Vries wrote:
>> Updated version.
>>
>> On 06/08/2011 11:45 AM, Tom de Vries wrote:
>>> On 06/08/2011 11:42 AM, Tom de Vries wrote:
>>>
 I'll send the patch with the testcases in a separate email.
>>>
>>
>
> 2 extra testcases added.
>
> OK for trunk?

Ok.

THanks,
Richard.

> Thanks,
> - Tom
>
> 2011-08-19  Tom de Vries  
>
>        PR middle-end/43864
>        * gcc.dg/fold-compare-2.c (dg-options): Add -fno-tree-tail-merge.
>        * gcc/testsuite/gcc.dg/uninit-pred-2_c.c: Same.
>        * gcc.dg/pr43864.c: New test.
>        * gcc.dg/pr43864-2.c: Same.
>        * gcc.dg/pr43864-3.c: Same.
>        * gcc.dg/pr43864-4.c: Same.
>