Re: [PATCH][RFA] [PR rtl-optimization/64317] Enhance postreload-gcse.c to eliminate more redundant loads
On 03/18/2015 12:19 AM, Andrew Pinski wrote: On Tue, Mar 17, 2015 at 11:27 AM, Jeff Law l...@redhat.com wrote: On 03/17/2015 04:35 AM, Richard Biener wrote: I'll test both. In the common case, the cost is going to be the basic bookkeeping so that we can compute the transparent property. The actual computation of transparency and everything else is guarded on having something in the hash tables -- and the overwhelming majority of the time there's nothing in the hash tables. Regardless, I'll pin down boxes and do some testing. I'm slightly leaning towards trying it even in stage4, but if e.g. richi disagrees, we could defer it to stage1 too. I'd be OK either way. I just want us to make a decision one way or the If it fixes a regression then OK for trunk, otherwise please wait for stage 1 to open. It fixes 64317 which is a P2 regression. I had a pass which I inherited from my predecessor which was basically doing the same as your patch: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/postreload-load.c;h=2652e51f8eca310d51db3a30e5f6c8847be436ce;hb=refs/heads/apinski/thunderx-cost But I like your patch much better than this one. I saw many loads being removed for AARCH64 also at -Ofast -funroll-loops on SPEC 2006 with this pass. But it seemed to due to subregs which I had mentioned at https://gcc.gnu.org/ml/gcc/2015-01/msg00125.html . When I get a chance I can test your patch out on AARCH64 and disable my pass to see if my pass catches more than your patch. ISTM this pass from your predecessor is tackling a different issue. Namely it arranges to copy the result of the load (or source of a store) from a register which gets clobbered into a non-clobbered hard register so that it can be re-used later. My patch detected cases where the the load/store register is not clobbered across intermediate blocks and arranges to re-use it. Unless I'm missing something, they ought to be able to work together. I'd recommend officially submitting that code. Bonus points for integrating it into the existing pass rather than running it as a separate pass. jeff
Re: [PATCH][RFA] [PR rtl-optimization/64317] Enhance postreload-gcse.c to eliminate more redundant loads
On 03/16/2015 01:27 PM, Jakub Jelinek wrote: On Wed, Mar 11, 2015 at 03:30:36PM -0600, Jeff Law wrote: +#ifndef GCC_GCSE__COMMONH +#define GCC_GCSE__COMMONH GCC_GCSE_COMMON_H instead? @@ -1308,8 +1396,19 @@ gcse_after_reload_main (rtx f ATTRIBUTE_UNUSED) if (expr_table-elements () 0) { + /* Knowing which MEMs are transparent through a block can signifiantly +increase the number of reundant loads found. So compute transparency +information for for each memory expression in the hash table. */ s/for for/for/ ? + df_analyze (); + /* This can not be part of the normal allocation routine because +we have to know the number of elements in the hash table. */ + transp = sbitmap_vector_alloc (last_basic_block_for_fn (cfun), +expr_table-elements ()); + bitmap_vector_ones (transp, last_basic_block_for_fn (cfun)); + expr_table-traverse FILE *, compute_expr_transp (dump_file); eliminate_partially_redundant_loads (); delete_redundant_insns (); + sbitmap_vector_free (transp); if (dump_file) { What effect does the patch have on compile time on say x86_64 or ppc64? I'm slightly leaning towards trying it even in stage4, but if e.g. richi disagrees, we could defer it to stage1 too. I made the requested fixes as well as fixed another comment typo and fixed one real bug that showed up during some additional testing. Basically the bitmap indices need to start counting from 0, not 1. Starting at 1 can result in an out-of-range write for the last element. That's usually not a problem as there's space in the word, but I happened to have a case where we had 128 entries in the table and thus the out-of-bounds write hit a new word in memory clobbering heap metadata. I'm glad this was caught prior to installation. I've re-bootstrapped and regression tested on x86_64-linux-gnu and powerpc64-linux-gnu. Installed on the trunk. Actual patch committed is attached for archival purposes. Jeff commit f82a93981ea0f2ffbda93511a51c71910e10e4dd Author: law law@138bc75d-0d04-0410-961f-82ee72b054a4 Date: Mon Mar 23 05:21:04 2015 + PR rtl-optimization/64317 * Makefile.in (OBJS): Add gcse-common.c * gcse.c: Include gcse-common.h (struct modify_pair_s): Move structure definition to gcse-common.h (compute_transp): Move function to gcse-common.c. (canon_list_insert): Similarly. (record_last_mem_set_info): Break out some code and put it into gcse-common.c. Call into the new common code. (compute_local_properties): Pass additional arguments to compute_transp. * postreload-gcse.c: Include gcse-common.h and df.h (modify_mem_list_set, blocks_with_calls): New variables. (modify_mem_list, canon_modify_mem_list, transp): Likewise. (get_bb_avail_insn): Pass in the expression index too. (alloc_mem): Allocate memory for the new bitmaps and lists. (free_mem): Free memory for the new bitmaps and lists. (insert_expr_in_table): Record a bitmap index for each entry we add to the table. (record_last_mem_set_info): Call into common code in gcse-common.c. (get_bb_avail_insn): If no available insn was found in the requested BB. If BB has a single predecessor, see if the expression is transparent in BB and available in that single predecessor. (compute_expr_transp): New wrapper for compute_transp. (eliminate_partially_redundant_load): Pass expression's bitmap_index to get_bb_avail_insn. Compute next_pred_bb_end a bit later. (gcse_after_reload_main): If there are elements in the hash table, then compute transparency for all the elements in the hash table. * gcse-common.h: New file. * gcse-common.c: New file. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@221585 138bc75d-0d04-0410-961f-82ee72b054a4 diff --git a/gcc/ChangeLog b/gcc/ChangeLog index a42d22a..b615c1f 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,34 @@ +2015-03-22 Jeff Law l...@redhat.com + + PR rtl-optimization/64317 + * Makefile.in (OBJS): Add gcse-common.c + * gcse.c: Include gcse-common.h + (struct modify_pair_s): Move structure definition to gcse-common.h + (compute_transp): Move function to gcse-common.c. + (canon_list_insert): Similarly. + (record_last_mem_set_info): Break out some code and put it into + gcse-common.c. Call into the new common code. + (compute_local_properties): Pass additional arguments to compute_transp. + * postreload-gcse.c: Include gcse-common.h and df.h + (modify_mem_list_set, blocks_with_calls): New variables. + (modify_mem_list, canon_modify_mem_list, transp): Likewise. +
Re: [PATCH][RFA] [PR rtl-optimization/64317] Enhance postreload-gcse.c to eliminate more redundant loads
On 03/16/2015 01:27 PM, Jakub Jelinek wrote: On Wed, Mar 11, 2015 at 03:30:36PM -0600, Jeff Law wrote: +#ifndef GCC_GCSE__COMMONH +#define GCC_GCSE__COMMONH GCC_GCSE_COMMON_H instead? @@ -1308,8 +1396,19 @@ gcse_after_reload_main (rtx f ATTRIBUTE_UNUSED) if (expr_table-elements () 0) { + /* Knowing which MEMs are transparent through a block can signifiantly +increase the number of reundant loads found. So compute transparency +information for for each memory expression in the hash table. */ s/for for/for/ ? + df_analyze (); + /* This can not be part of the normal allocation routine because +we have to know the number of elements in the hash table. */ + transp = sbitmap_vector_alloc (last_basic_block_for_fn (cfun), +expr_table-elements ()); + bitmap_vector_ones (transp, last_basic_block_for_fn (cfun)); + expr_table-traverse FILE *, compute_expr_transp (dump_file); eliminate_partially_redundant_loads (); delete_redundant_insns (); + sbitmap_vector_free (transp); if (dump_file) { What effect does the patch have on compile time on say x86_64 or ppc64? I'm slightly leaning towards trying it even in stage4, but if e.g. richi disagrees, we could defer it to stage1 too. Oh yea, forgot to mention, for PPC the compile-time testing results were essentially the same -- there's significantly more variation in the timings, but the before/after comparisons were in the noise. Jeff
Re: [PATCH][RFA] [PR rtl-optimization/64317] Enhance postreload-gcse.c to eliminate more redundant loads
On 03/18/2015 12:19 AM, Andrew Pinski wrote: On Tue, Mar 17, 2015 at 11:27 AM, Jeff Law l...@redhat.com wrote: On 03/17/2015 04:35 AM, Richard Biener wrote: I'll test both. In the common case, the cost is going to be the basic bookkeeping so that we can compute the transparent property. The actual computation of transparency and everything else is guarded on having something in the hash tables -- and the overwhelming majority of the time there's nothing in the hash tables. Regardless, I'll pin down boxes and do some testing. I'm slightly leaning towards trying it even in stage4, but if e.g. richi disagrees, we could defer it to stage1 too. I'd be OK either way. I just want us to make a decision one way or the If it fixes a regression then OK for trunk, otherwise please wait for stage 1 to open. It fixes 64317 which is a P2 regression. I had a pass which I inherited from my predecessor which was basically doing the same as your patch: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/postreload-load.c;h=2652e51f8eca310d51db3a30e5f6c8847be436ce;hb=refs/heads/apinski/thunderx-cost But I like your patch much better than this one. I saw many loads being removed for AARCH64 also at -Ofast -funroll-loops on SPEC 2006 with this pass. But it seemed to due to subregs which I had mentioned at https://gcc.gnu.org/ml/gcc/2015-01/msg00125.html . When I get a chance I can test your patch out on AARCH64 and disable my pass to see if my pass catches more than your patch. Well, they're doing different things, so they ought to be complementary to some degree. Jeff
Re: [PATCH][RFA] [PR rtl-optimization/64317] Enhance postreload-gcse.c to eliminate more redundant loads
On 03/16/2015 01:27 PM, Jakub Jelinek wrote: What effect does the patch have on compile time on say x86_64 or ppc64? I bootstrapped x86_64 trunk then timed compiling 640 .i/.ii files with -O2. That was repeated 10 times (to get a sense of variability). Each run took a little over 40 minutes with variability of less than a second (slowest run compared to fastest run). I then also gathered data for a smaller set of runs for -O2 -funroll-loops (given the consistency between runs, 10 iterations seemed like major overkill). Again variability was less than a second. I then applied the patch and repeated the -O2 and -O2 -funroll-loops test. The patched compiler was within a second of the unpatched compiler for both the -O2 and -O2 -funroll-loops tests. Given the difference was within the noise level of those tests, my conclusion is the new code makes no measurable difference in compile time performance for x86_64. Similar tests are in progress on powerpc64-linux-gnu. Jeff
Re: [PATCH][RFA] [PR rtl-optimization/64317] Enhance postreload-gcse.c to eliminate more redundant loads
On Sat, Mar 21, 2015 at 01:47:10AM -0600, Jeff Law wrote: On 03/16/2015 01:27 PM, Jakub Jelinek wrote: What effect does the patch have on compile time on say x86_64 or ppc64? I bootstrapped x86_64 trunk then timed compiling 640 .i/.ii files with -O2. That was repeated 10 times (to get a sense of variability). Each run took a little over 40 minutes with variability of less than a second (slowest run compared to fastest run). I then also gathered data for a smaller set of runs for -O2 -funroll-loops (given the consistency between runs, 10 iterations seemed like major overkill). Again variability was less than a second. I then applied the patch and repeated the -O2 and -O2 -funroll-loops test. The patched compiler was within a second of the unpatched compiler for both the -O2 and -O2 -funroll-loops tests. Given the difference was within the noise level of those tests, my conclusion is the new code makes no measurable difference in compile time performance for x86_64. Similar tests are in progress on powerpc64-linux-gnu. Thanks for the testing. I think we want the patch in now. Jakub
Re: [PATCH][RFA] [PR rtl-optimization/64317] Enhance postreload-gcse.c to eliminate more redundant loads
On Tue, Mar 17, 2015 at 11:27 AM, Jeff Law l...@redhat.com wrote: On 03/17/2015 04:35 AM, Richard Biener wrote: I'll test both. In the common case, the cost is going to be the basic bookkeeping so that we can compute the transparent property. The actual computation of transparency and everything else is guarded on having something in the hash tables -- and the overwhelming majority of the time there's nothing in the hash tables. Regardless, I'll pin down boxes and do some testing. I'm slightly leaning towards trying it even in stage4, but if e.g. richi disagrees, we could defer it to stage1 too. I'd be OK either way. I just want us to make a decision one way or the If it fixes a regression then OK for trunk, otherwise please wait for stage 1 to open. It fixes 64317 which is a P2 regression. I had a pass which I inherited from my predecessor which was basically doing the same as your patch: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/postreload-load.c;h=2652e51f8eca310d51db3a30e5f6c8847be436ce;hb=refs/heads/apinski/thunderx-cost But I like your patch much better than this one. I saw many loads being removed for AARCH64 also at -Ofast -funroll-loops on SPEC 2006 with this pass. But it seemed to due to subregs which I had mentioned at https://gcc.gnu.org/ml/gcc/2015-01/msg00125.html . When I get a chance I can test your patch out on AARCH64 and disable my pass to see if my pass catches more than your patch. Thanks, Andrew jeff
Re: [PATCH][RFA] [PR rtl-optimization/64317] Enhance postreload-gcse.c to eliminate more redundant loads
On March 16, 2015 8:45:18 PM GMT+01:00, Jeff Law l...@redhat.com wrote: On 03/16/15 13:27, Jakub Jelinek wrote: On Wed, Mar 11, 2015 at 03:30:36PM -0600, Jeff Law wrote: +#ifndef GCC_GCSE__COMMONH +#define GCC_GCSE__COMMONH GCC_GCSE_COMMON_H instead? :-) Will fix. @@ -1308,8 +1396,19 @@ gcse_after_reload_main (rtx f ATTRIBUTE_UNUSED) if (expr_table-elements () 0) { + /* Knowing which MEMs are transparent through a block can signifiantly +increase the number of reundant loads found. So compute transparency +information for for each memory expression in the hash table. */ s/for for/for/ ? Similarly. + df_analyze (); + /* This can not be part of the normal allocation routine because +we have to know the number of elements in the hash table. */ + transp = sbitmap_vector_alloc (last_basic_block_for_fn (cfun), +expr_table-elements ()); + bitmap_vector_ones (transp, last_basic_block_for_fn (cfun)); + expr_table-traverse FILE *, compute_expr_transp (dump_file); eliminate_partially_redundant_loads (); delete_redundant_insns (); + sbitmap_vector_free (transp); if (dump_file) { What effect does the patch have on compile time on say x86_64 or ppc64? I'll test both. In the common case, the cost is going to be the basic bookkeeping so that we can compute the transparent property. The actual computation of transparency and everything else is guarded on having something in the hash tables -- and the overwhelming majority of the time there's nothing in the hash tables. Regardless, I'll pin down boxes and do some testing. I'm slightly leaning towards trying it even in stage4, but if e.g. richi disagrees, we could defer it to stage1 too. I'd be OK either way. I just want us to make a decision one way or the If it fixes a regression then OK for trunk, otherwise please wait for stage 1 to open. Richard. other :-) Jeff
Re: [PATCH][RFA] [PR rtl-optimization/64317] Enhance postreload-gcse.c to eliminate more redundant loads
On 03/17/2015 04:35 AM, Richard Biener wrote: I'll test both. In the common case, the cost is going to be the basic bookkeeping so that we can compute the transparent property. The actual computation of transparency and everything else is guarded on having something in the hash tables -- and the overwhelming majority of the time there's nothing in the hash tables. Regardless, I'll pin down boxes and do some testing. I'm slightly leaning towards trying it even in stage4, but if e.g. richi disagrees, we could defer it to stage1 too. I'd be OK either way. I just want us to make a decision one way or the If it fixes a regression then OK for trunk, otherwise please wait for stage 1 to open. It fixes 64317 which is a P2 regression. jeff
Re: [PATCH][RFA] [PR rtl-optimization/64317] Enhance postreload-gcse.c to eliminate more redundant loads
On Wed, Mar 11, 2015 at 03:30:36PM -0600, Jeff Law wrote: +#ifndef GCC_GCSE__COMMONH +#define GCC_GCSE__COMMONH GCC_GCSE_COMMON_H instead? @@ -1308,8 +1396,19 @@ gcse_after_reload_main (rtx f ATTRIBUTE_UNUSED) if (expr_table-elements () 0) { + /* Knowing which MEMs are transparent through a block can signifiantly + increase the number of reundant loads found. So compute transparency + information for for each memory expression in the hash table. */ s/for for/for/ ? + df_analyze (); + /* This can not be part of the normal allocation routine because + we have to know the number of elements in the hash table. */ + transp = sbitmap_vector_alloc (last_basic_block_for_fn (cfun), + expr_table-elements ()); + bitmap_vector_ones (transp, last_basic_block_for_fn (cfun)); + expr_table-traverse FILE *, compute_expr_transp (dump_file); eliminate_partially_redundant_loads (); delete_redundant_insns (); + sbitmap_vector_free (transp); if (dump_file) { What effect does the patch have on compile time on say x86_64 or ppc64? I'm slightly leaning towards trying it even in stage4, but if e.g. richi disagrees, we could defer it to stage1 too. Jakub
Re: [PATCH][RFA] [PR rtl-optimization/64317] Enhance postreload-gcse.c to eliminate more redundant loads
On 03/16/15 13:27, Jakub Jelinek wrote: On Wed, Mar 11, 2015 at 03:30:36PM -0600, Jeff Law wrote: +#ifndef GCC_GCSE__COMMONH +#define GCC_GCSE__COMMONH GCC_GCSE_COMMON_H instead? :-) Will fix. @@ -1308,8 +1396,19 @@ gcse_after_reload_main (rtx f ATTRIBUTE_UNUSED) if (expr_table-elements () 0) { + /* Knowing which MEMs are transparent through a block can signifiantly +increase the number of reundant loads found. So compute transparency +information for for each memory expression in the hash table. */ s/for for/for/ ? Similarly. + df_analyze (); + /* This can not be part of the normal allocation routine because +we have to know the number of elements in the hash table. */ + transp = sbitmap_vector_alloc (last_basic_block_for_fn (cfun), +expr_table-elements ()); + bitmap_vector_ones (transp, last_basic_block_for_fn (cfun)); + expr_table-traverse FILE *, compute_expr_transp (dump_file); eliminate_partially_redundant_loads (); delete_redundant_insns (); + sbitmap_vector_free (transp); if (dump_file) { What effect does the patch have on compile time on say x86_64 or ppc64? I'll test both. In the common case, the cost is going to be the basic bookkeeping so that we can compute the transparent property. The actual computation of transparency and everything else is guarded on having something in the hash tables -- and the overwhelming majority of the time there's nothing in the hash tables. Regardless, I'll pin down boxes and do some testing. I'm slightly leaning towards trying it even in stage4, but if e.g. richi disagrees, we could defer it to stage1 too. I'd be OK either way. I just want us to make a decision one way or the other :-) Jeff
[PATCH][RFA] [PR rtl-optimization/64317] Enhance postreload-gcse.c to eliminate more redundant loads
64317 is a P2 regression related to making the PIC register a pseudo register on the x86 port. Basically post-LRA we still have some redundant loads of the PIC register from memory (it's spilled to memory as other objects are more profitable to keep in registers). But of the set of ~10 actual loads, we really only need ~5. The basic problem is that LRA and the basic postreload optimizers effectively forget everything they know when they encounter a block that starts with a label -- even if that block is only reachable from one predecessor. ie, these routines work on a subset of EBBs, but not maximal EBBs. The first thought was to have LRA clean this up since this is largely an issue with cross-block inheritance of values. Unfortunately Vlad has indicated that handling maximal EBBs is nontrivial in LRA. I also investigated extending the simple EBB oriented optimizer in postreload.c to handle this case. But it turns out that cselib really isn't suited to a scoped hash table, which would be the natural way to handle this. So I killed that idea. That leaves postreload-gcse and if you look at the incoming RTL you'd think that postreload-gcse would handle this just fine. Unfortunately, postreload-gcse is badly mis-named. Specifically it does no dataflow analysis of available expressions!! There may be some benefit in full dataflow analysis in postreload-gcse.c. But we can get most of the benefits with a relatively simple change. Specifically if we're unable to find the expression we want in the given block, see if the block has a single predecessor. If the expression is available in that predecessor and transparent through the current block, then the expression is still available at the end of the current block. That was enough to fix the PR. But after instrumentation I was discouraged by the fact that postreload-gcse does virtually nothing on x86. It's unable to find a single available load to put in its hash tables on an x86, x86_64 or ppc64 bootstrap. Though many are found during testsuite runs. Digging further into the history of postreload-gcse indicated the postreload-gcse.c work was primarily tested on ppc using spec. My first attempts compiling spec with the instrumented compiler didn't produce anything useful either. On a whim, I tried again with -O3 -funroll loops, then suddenly postreload-gcse.c got reasonably active. Across a subset of spec2k, compiling with -O3 -funroll-loops, this patch enables about a 2X improvement in loads eliminated (ppc64). On a ppc64 -O3 -funroll-loops bootstrap of GCC I see 4142 loads eliminated with this patch and 2739 loads eliminated without. A notable improvement. Even with -O2 -funroll-loops, there's nothing interesting happening on x86_64. Probably because it's not a load-store architecture. Anyway, the question in my mind is whether or not we want to put this in during stage4. The patch itself is medium sized, but mostly mechanical in nature. It's been bootstrapped on x86. It's been bootstrapped and regression tested on x86_64. It's been bootstrapped and regression tested with -O3 -funroll-loops on ppc64 (configured with --disable-werror). OK for the trunk or wait for stage1? I'm happy to go with the release manager's opinions on this. Jeff diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 28979d5..91863be 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,34 @@ +2015-03-11 Jeff Law l...@redhat.com + + PR rtl-optimization/64317 + * Makefile.in (OBJS): Add gcse-common.c + * gcse.c: Include gcse-common.h + (struct modify_pair_s): Move structure definition to gcse-common.h + (compute_transp): Move function to gcse-common.c. + (canon_list_insert): Similarly. + (record_last_mem_set_info): Break out some code and put it into + gcse-common.c. Call into the new common code. + (compute_local_properties): Pass additional arguments to compute_transp. + * postreload-gcse.c: Include gcse-common.h and df.h + (modify_mem_list_set, blocks_with_calls): New variables. + (modify_mem_list, canon_modify_mem_list, transp): Likewise. + (get_bb_avail_insn): Pass in the expression index too. + (alloc_mem): Allocate memory for the new bitmaps and lists. + (free_mem): Free memory for the new bitmaps and lists. + (insert_expr_in_table): Record a bitmap index for each entry we + add to the table. + (record_last_mem_set_info): Call into common code in gcse-common.c. + (get_bb_avail_insn): If no available insn was found in the requested + BB. If BB has a single predecessor, see if the expression is + transparent in BB and available in that single predecessor. + (compute_expr_transp): New wrapper for compute_transp. + (eliminate_partially_redundant_load): Pass expression's bitmap_index + to get_bb_avail_insn. Compute next_pred_bb_end a bit later. +