Re: [PATCH PR65447]Improve IV handling by grouping address type uses with same base and step
Hi Bin, On 08/05/15 11:47, Bin Cheng wrote: Hi, GCC's IVO currently handles every IV use independently, which is not right by learning from cases reported in PR65447. The rationale is: 1) Lots of address type IVs refer to the same memory object, share similar base and have same step. We should handle these IVs as a group in order to maximize CSE opportunities, prefer reg+offset addressing mode. 2) GCC's IVO algorithm is expensive and only is run when candidate set is small enough. By grouping same family uses, we can decrease the number of both uses and candidates. Before this patch, number of candidates for PR65447 is too big to run expensive IVO algorithm, resulting in bad assembly code on targets like AArch64 and Mips. 3) Even for cases the assembly code isn't improved, we can still get compilation time benefit with this patch. 4) This is a prerequisite for enabling auto-increment support in IVO on AArch64. For now, this is only done to address type IVs, in the future I may extend it to general IVs too. For AArch64: Benchmarks 470.lbm/spec2k6 and 173.applu/spec2k are improved obviously by this patch. A couple of cases from spec2k/fp appear regressed. I looked into generated assembly code and can confirm the regression is false alarm except one case (189.lucas). For that case, I think it's another issue exposed by this patch (GCC failed to CSE candidate setup code, resulting in bloated loop header). Anyway, I also fined tuned the patch to minimize the impact. For AArch32, this patch seems to be able to improve spec2kfp too, but I didn't look deep into it. I guess the reason is it can make life for auto-increment support in IVO better. One of defects of this patch is computation of max offset in compute_max_addr_offset is basically borrowed from get_address_cost. The comment says we should find a better way to compute all information. People also complained we need to refactor that part of code. I don't have good solution to that yet, though I did try best to keep compute_max_addr_offset simple. I believe this is a generally wanted change, bootstrap and test on x86_64 and AArch64, so is it ok? 2015-05-08 Bin Cheng bin.ch...@arm.com PR tree-optimization/65447 * tree-ssa-loop-ivopts.c (struct iv_use): New fields. (dump_use, dump_uses): Support to dump sub use. (record_use): New parameters to support sub use. Remove call to dump_use. (record_sub_use, record_group_use): New functions. (compute_max_addr_offset, split_all_small_groups): New functions. (group_address_uses, rewrite_use_address): New functions. (strip_offset): New declaration. (find_interesting_uses_address): Call record_group_use. (add_candidate): New assertion. (infinite_cost_p): Move definition forward. (add_costs): Check INFTY cost and return immediately. (get_computation_cost_at): Clear setup cost and dependent bitmap for sub uses. (determine_use_iv_cost_address): Compute cost for sub uses. (rewrite_use_address_1): Rename from old rewrite_use_address. (free_loop_data): Free sub uses. (tree_ssa_iv_optimize_loop): Call group_address_uses. gcc/testsuite/ChangeLog 2015-05-08 Bin Cheng bin.ch...@arm.com PR tree-optimization/65447 * gcc.dg/tree-ssa/pr65447.c: New test. I see this test failing on arm-none-eabi with a compiler at r223737. My configure options are: --enable-checking=yes --with-newlib --with-fpu=neon-fp-armv8 --with-arch=armv8-a --without-isl Kyrill
Re: [PATCH PR65447]Improve IV handling by grouping address type uses with same base and step
On Wed, May 27, 2015 at 11:44 PM, Kyrill Tkachov kyrylo.tkac...@arm.com wrote: Hi Bin, On 08/05/15 11:47, Bin Cheng wrote: Hi, GCC's IVO currently handles every IV use independently, which is not right by learning from cases reported in PR65447. The rationale is: 1) Lots of address type IVs refer to the same memory object, share similar base and have same step. We should handle these IVs as a group in order to maximize CSE opportunities, prefer reg+offset addressing mode. 2) GCC's IVO algorithm is expensive and only is run when candidate set is small enough. By grouping same family uses, we can decrease the number of both uses and candidates. Before this patch, number of candidates for PR65447 is too big to run expensive IVO algorithm, resulting in bad assembly code on targets like AArch64 and Mips. 3) Even for cases the assembly code isn't improved, we can still get compilation time benefit with this patch. 4) This is a prerequisite for enabling auto-increment support in IVO on AArch64. For now, this is only done to address type IVs, in the future I may extend it to general IVs too. For AArch64: Benchmarks 470.lbm/spec2k6 and 173.applu/spec2k are improved obviously by this patch. A couple of cases from spec2k/fp appear regressed. I looked into generated assembly code and can confirm the regression is false alarm except one case (189.lucas). For that case, I think it's another issue exposed by this patch (GCC failed to CSE candidate setup code, resulting in bloated loop header). Anyway, I also fined tuned the patch to minimize the impact. For AArch32, this patch seems to be able to improve spec2kfp too, but I didn't look deep into it. I guess the reason is it can make life for auto-increment support in IVO better. One of defects of this patch is computation of max offset in compute_max_addr_offset is basically borrowed from get_address_cost. The comment says we should find a better way to compute all information. People also complained we need to refactor that part of code. I don't have good solution to that yet, though I did try best to keep compute_max_addr_offset simple. I believe this is a generally wanted change, bootstrap and test on x86_64 and AArch64, so is it ok? 2015-05-08 Bin Cheng bin.ch...@arm.com PR tree-optimization/65447 * tree-ssa-loop-ivopts.c (struct iv_use): New fields. (dump_use, dump_uses): Support to dump sub use. (record_use): New parameters to support sub use. Remove call to dump_use. (record_sub_use, record_group_use): New functions. (compute_max_addr_offset, split_all_small_groups): New functions. (group_address_uses, rewrite_use_address): New functions. (strip_offset): New declaration. (find_interesting_uses_address): Call record_group_use. (add_candidate): New assertion. (infinite_cost_p): Move definition forward. (add_costs): Check INFTY cost and return immediately. (get_computation_cost_at): Clear setup cost and dependent bitmap for sub uses. (determine_use_iv_cost_address): Compute cost for sub uses. (rewrite_use_address_1): Rename from old rewrite_use_address. (free_loop_data): Free sub uses. (tree_ssa_iv_optimize_loop): Call group_address_uses. gcc/testsuite/ChangeLog 2015-05-08 Bin Cheng bin.ch...@arm.com PR tree-optimization/65447 * gcc.dg/tree-ssa/pr65447.c: New test. I see this test failing on arm-none-eabi with a compiler at r223737. My configure options are: --enable-checking=yes --with-newlib --with-fpu=neon-fp-armv8 --with-arch=armv8-a --without-isl Hi Kyrill, Thank you for reporting this. I will have a look. Thanks, bin Kyrill
Re: [PATCH PR65447]Improve IV handling by grouping address type uses with same base and step
On Thu, May 14, 2015 at 1:11 PM, Bin.Cheng amker.ch...@gmail.com wrote: On Wed, May 13, 2015 at 7:38 PM, Richard Biener richard.guent...@gmail.com wrote: On Fri, May 8, 2015 at 12:47 PM, Bin Cheng bin.ch...@arm.com wrote: Hi, GCC's IVO currently handles every IV use independently, which is not right by learning from cases reported in PR65447. The rationale is: 1) Lots of address type IVs refer to the same memory object, share similar base and have same step. We should handle these IVs as a group in order to maximize CSE opportunities, prefer reg+offset addressing mode. 2) GCC's IVO algorithm is expensive and only is run when candidate set is small enough. By grouping same family uses, we can decrease the number of both uses and candidates. Before this patch, number of candidates for PR65447 is too big to run expensive IVO algorithm, resulting in bad assembly code on targets like AArch64 and Mips. 3) Even for cases the assembly code isn't improved, we can still get compilation time benefit with this patch. 4) This is a prerequisite for enabling auto-increment support in IVO on AArch64. For now, this is only done to address type IVs, in the future I may extend it to general IVs too. For AArch64: Benchmarks 470.lbm/spec2k6 and 173.applu/spec2k are improved obviously by this patch. A couple of cases from spec2k/fp appear regressed. I looked into generated assembly code and can confirm the regression is false alarm except one case (189.lucas). For that case, I think it's another issue exposed by this patch (GCC failed to CSE candidate setup code, resulting in bloated loop header). Anyway, I also fined tuned the patch to minimize the impact. For AArch32, this patch seems to be able to improve spec2kfp too, but I didn't look deep into it. I guess the reason is it can make life for auto-increment support in IVO better. One of defects of this patch is computation of max offset in compute_max_addr_offset is basically borrowed from get_address_cost. The comment says we should find a better way to compute all information. People also complained we need to refactor that part of code. I don't have good solution to that yet, though I did try best to keep compute_max_addr_offset simple. I believe this is a generally wanted change, bootstrap and test on x86_64 and AArch64, so is it ok? I'm a little bit worried about the linked list of sub-uses and the sorting (that's quadratic). A little. I don't have any good idea but to use a tree... We don't seem to limit the number of sub-uses (if we'd do that it would become O(1)). Similar is searching in the list of uses for a group with same base/step (but ISTR IVOPTs has multiple similar loops?) Hi Richard, Thanks for reviewing. Instead of tree, I can also keep the linked list, then quick sort it by using a vector as temporary storage. This can avoid the complexity of BST operations without non-trivial overload. Sounds good if constructing querying are not done at the same time. For the searching routine, a local hash table could help. Overall the patch looks like a good improvement to how we do IVO, so I think it is ok as-is. Do you want me to do this change now, or I can pick it up later when dealing with compilation time issue? Also it would be nice if any compilation time issue will be reported after applying this version patch. Because we will have a IVO compilation time benchmark then. You can pick it up later. Richard. Thanks, bin Thanks, Richard. 2015-05-08 Bin Cheng bin.ch...@arm.com PR tree-optimization/65447 * tree-ssa-loop-ivopts.c (struct iv_use): New fields. (dump_use, dump_uses): Support to dump sub use. (record_use): New parameters to support sub use. Remove call to dump_use. (record_sub_use, record_group_use): New functions. (compute_max_addr_offset, split_all_small_groups): New functions. (group_address_uses, rewrite_use_address): New functions. (strip_offset): New declaration. (find_interesting_uses_address): Call record_group_use. (add_candidate): New assertion. (infinite_cost_p): Move definition forward. (add_costs): Check INFTY cost and return immediately. (get_computation_cost_at): Clear setup cost and dependent bitmap for sub uses. (determine_use_iv_cost_address): Compute cost for sub uses. (rewrite_use_address_1): Rename from old rewrite_use_address. (free_loop_data): Free sub uses. (tree_ssa_iv_optimize_loop): Call group_address_uses. gcc/testsuite/ChangeLog 2015-05-08 Bin Cheng bin.ch...@arm.com PR tree-optimization/65447 * gcc.dg/tree-ssa/pr65447.c: New test.
Re: [PATCH PR65447]Improve IV handling by grouping address type uses with same base and step
On Wed, May 13, 2015 at 7:38 PM, Richard Biener richard.guent...@gmail.com wrote: On Fri, May 8, 2015 at 12:47 PM, Bin Cheng bin.ch...@arm.com wrote: Hi, GCC's IVO currently handles every IV use independently, which is not right by learning from cases reported in PR65447. The rationale is: 1) Lots of address type IVs refer to the same memory object, share similar base and have same step. We should handle these IVs as a group in order to maximize CSE opportunities, prefer reg+offset addressing mode. 2) GCC's IVO algorithm is expensive and only is run when candidate set is small enough. By grouping same family uses, we can decrease the number of both uses and candidates. Before this patch, number of candidates for PR65447 is too big to run expensive IVO algorithm, resulting in bad assembly code on targets like AArch64 and Mips. 3) Even for cases the assembly code isn't improved, we can still get compilation time benefit with this patch. 4) This is a prerequisite for enabling auto-increment support in IVO on AArch64. For now, this is only done to address type IVs, in the future I may extend it to general IVs too. For AArch64: Benchmarks 470.lbm/spec2k6 and 173.applu/spec2k are improved obviously by this patch. A couple of cases from spec2k/fp appear regressed. I looked into generated assembly code and can confirm the regression is false alarm except one case (189.lucas). For that case, I think it's another issue exposed by this patch (GCC failed to CSE candidate setup code, resulting in bloated loop header). Anyway, I also fined tuned the patch to minimize the impact. For AArch32, this patch seems to be able to improve spec2kfp too, but I didn't look deep into it. I guess the reason is it can make life for auto-increment support in IVO better. One of defects of this patch is computation of max offset in compute_max_addr_offset is basically borrowed from get_address_cost. The comment says we should find a better way to compute all information. People also complained we need to refactor that part of code. I don't have good solution to that yet, though I did try best to keep compute_max_addr_offset simple. I believe this is a generally wanted change, bootstrap and test on x86_64 and AArch64, so is it ok? I'm a little bit worried about the linked list of sub-uses and the sorting (that's quadratic). A little. I don't have any good idea but to use a tree... We don't seem to limit the number of sub-uses (if we'd do that it would become O(1)). Similar is searching in the list of uses for a group with same base/step (but ISTR IVOPTs has multiple similar loops?) Hi Richard, Thanks for reviewing. Instead of tree, I can also keep the linked list, then quick sort it by using a vector as temporary storage. This can avoid the complexity of BST operations without non-trivial overload. For the searching routine, a local hash table could help. Overall the patch looks like a good improvement to how we do IVO, so I think it is ok as-is. Do you want me to do this change now, or I can pick it up later when dealing with compilation time issue? Also it would be nice if any compilation time issue will be reported after applying this version patch. Because we will have a IVO compilation time benchmark then. Thanks, bin Thanks, Richard. 2015-05-08 Bin Cheng bin.ch...@arm.com PR tree-optimization/65447 * tree-ssa-loop-ivopts.c (struct iv_use): New fields. (dump_use, dump_uses): Support to dump sub use. (record_use): New parameters to support sub use. Remove call to dump_use. (record_sub_use, record_group_use): New functions. (compute_max_addr_offset, split_all_small_groups): New functions. (group_address_uses, rewrite_use_address): New functions. (strip_offset): New declaration. (find_interesting_uses_address): Call record_group_use. (add_candidate): New assertion. (infinite_cost_p): Move definition forward. (add_costs): Check INFTY cost and return immediately. (get_computation_cost_at): Clear setup cost and dependent bitmap for sub uses. (determine_use_iv_cost_address): Compute cost for sub uses. (rewrite_use_address_1): Rename from old rewrite_use_address. (free_loop_data): Free sub uses. (tree_ssa_iv_optimize_loop): Call group_address_uses. gcc/testsuite/ChangeLog 2015-05-08 Bin Cheng bin.ch...@arm.com PR tree-optimization/65447 * gcc.dg/tree-ssa/pr65447.c: New test.
Re: [PATCH PR65447]Improve IV handling by grouping address type uses with same base and step
On Fri, May 8, 2015 at 12:47 PM, Bin Cheng bin.ch...@arm.com wrote: Hi, GCC's IVO currently handles every IV use independently, which is not right by learning from cases reported in PR65447. The rationale is: 1) Lots of address type IVs refer to the same memory object, share similar base and have same step. We should handle these IVs as a group in order to maximize CSE opportunities, prefer reg+offset addressing mode. 2) GCC's IVO algorithm is expensive and only is run when candidate set is small enough. By grouping same family uses, we can decrease the number of both uses and candidates. Before this patch, number of candidates for PR65447 is too big to run expensive IVO algorithm, resulting in bad assembly code on targets like AArch64 and Mips. 3) Even for cases the assembly code isn't improved, we can still get compilation time benefit with this patch. 4) This is a prerequisite for enabling auto-increment support in IVO on AArch64. For now, this is only done to address type IVs, in the future I may extend it to general IVs too. For AArch64: Benchmarks 470.lbm/spec2k6 and 173.applu/spec2k are improved obviously by this patch. A couple of cases from spec2k/fp appear regressed. I looked into generated assembly code and can confirm the regression is false alarm except one case (189.lucas). For that case, I think it's another issue exposed by this patch (GCC failed to CSE candidate setup code, resulting in bloated loop header). Anyway, I also fined tuned the patch to minimize the impact. For AArch32, this patch seems to be able to improve spec2kfp too, but I didn't look deep into it. I guess the reason is it can make life for auto-increment support in IVO better. One of defects of this patch is computation of max offset in compute_max_addr_offset is basically borrowed from get_address_cost. The comment says we should find a better way to compute all information. People also complained we need to refactor that part of code. I don't have good solution to that yet, though I did try best to keep compute_max_addr_offset simple. I believe this is a generally wanted change, bootstrap and test on x86_64 and AArch64, so is it ok? I'm a little bit worried about the linked list of sub-uses and the sorting (that's quadratic). A little. I don't have any good idea but to use a tree... We don't seem to limit the number of sub-uses (if we'd do that it would become O(1)). Similar is searching in the list of uses for a group with same base/step (but ISTR IVOPTs has multiple similar loops?) Overall the patch looks like a good improvement to how we do IVO, so I think it is ok as-is. Thanks, Richard. 2015-05-08 Bin Cheng bin.ch...@arm.com PR tree-optimization/65447 * tree-ssa-loop-ivopts.c (struct iv_use): New fields. (dump_use, dump_uses): Support to dump sub use. (record_use): New parameters to support sub use. Remove call to dump_use. (record_sub_use, record_group_use): New functions. (compute_max_addr_offset, split_all_small_groups): New functions. (group_address_uses, rewrite_use_address): New functions. (strip_offset): New declaration. (find_interesting_uses_address): Call record_group_use. (add_candidate): New assertion. (infinite_cost_p): Move definition forward. (add_costs): Check INFTY cost and return immediately. (get_computation_cost_at): Clear setup cost and dependent bitmap for sub uses. (determine_use_iv_cost_address): Compute cost for sub uses. (rewrite_use_address_1): Rename from old rewrite_use_address. (free_loop_data): Free sub uses. (tree_ssa_iv_optimize_loop): Call group_address_uses. gcc/testsuite/ChangeLog 2015-05-08 Bin Cheng bin.ch...@arm.com PR tree-optimization/65447 * gcc.dg/tree-ssa/pr65447.c: New test.
[PATCH PR65447]Improve IV handling by grouping address type uses with same base and step
Hi, GCC's IVO currently handles every IV use independently, which is not right by learning from cases reported in PR65447. The rationale is: 1) Lots of address type IVs refer to the same memory object, share similar base and have same step. We should handle these IVs as a group in order to maximize CSE opportunities, prefer reg+offset addressing mode. 2) GCC's IVO algorithm is expensive and only is run when candidate set is small enough. By grouping same family uses, we can decrease the number of both uses and candidates. Before this patch, number of candidates for PR65447 is too big to run expensive IVO algorithm, resulting in bad assembly code on targets like AArch64 and Mips. 3) Even for cases the assembly code isn't improved, we can still get compilation time benefit with this patch. 4) This is a prerequisite for enabling auto-increment support in IVO on AArch64. For now, this is only done to address type IVs, in the future I may extend it to general IVs too. For AArch64: Benchmarks 470.lbm/spec2k6 and 173.applu/spec2k are improved obviously by this patch. A couple of cases from spec2k/fp appear regressed. I looked into generated assembly code and can confirm the regression is false alarm except one case (189.lucas). For that case, I think it's another issue exposed by this patch (GCC failed to CSE candidate setup code, resulting in bloated loop header). Anyway, I also fined tuned the patch to minimize the impact. For AArch32, this patch seems to be able to improve spec2kfp too, but I didn't look deep into it. I guess the reason is it can make life for auto-increment support in IVO better. One of defects of this patch is computation of max offset in compute_max_addr_offset is basically borrowed from get_address_cost. The comment says we should find a better way to compute all information. People also complained we need to refactor that part of code. I don't have good solution to that yet, though I did try best to keep compute_max_addr_offset simple. I believe this is a generally wanted change, bootstrap and test on x86_64 and AArch64, so is it ok? 2015-05-08 Bin Cheng bin.ch...@arm.com PR tree-optimization/65447 * tree-ssa-loop-ivopts.c (struct iv_use): New fields. (dump_use, dump_uses): Support to dump sub use. (record_use): New parameters to support sub use. Remove call to dump_use. (record_sub_use, record_group_use): New functions. (compute_max_addr_offset, split_all_small_groups): New functions. (group_address_uses, rewrite_use_address): New functions. (strip_offset): New declaration. (find_interesting_uses_address): Call record_group_use. (add_candidate): New assertion. (infinite_cost_p): Move definition forward. (add_costs): Check INFTY cost and return immediately. (get_computation_cost_at): Clear setup cost and dependent bitmap for sub uses. (determine_use_iv_cost_address): Compute cost for sub uses. (rewrite_use_address_1): Rename from old rewrite_use_address. (free_loop_data): Free sub uses. (tree_ssa_iv_optimize_loop): Call group_address_uses. gcc/testsuite/ChangeLog 2015-05-08 Bin Cheng bin.ch...@arm.com PR tree-optimization/65447 * gcc.dg/tree-ssa/pr65447.c: New test. Index: gcc/tree-ssa-loop-ivopts.c === --- gcc/tree-ssa-loop-ivopts.c (revision 222758) +++ gcc/tree-ssa-loop-ivopts.c (working copy) @@ -226,6 +226,7 @@ struct cost_pair struct iv_use { unsigned id; /* The id of the use. */ + unsigned sub_id; /* The id of the sub use. */ enum use_type type; /* Type of the use. */ struct iv *iv; /* The induction variable it is based on. */ gimple stmt; /* Statement in that it occurs. */ @@ -239,6 +240,11 @@ struct iv_use struct iv_cand *selected; /* The selected candidate. */ + + struct iv_use *next; /* The next sub use. */ + tree addr_base; /* Base address with const offset stripped. */ + unsigned HOST_WIDE_INT addr_offset; + /* Const offset stripped from base address. */ }; /* The position where the iv is computed. */ @@ -556,8 +562,12 @@ dump_iv (FILE *file, struct iv *iv) void dump_use (FILE *file, struct iv_use *use) { - fprintf (file, use %d\n, use-id); + fprintf (file, use %d, use-id); + if (use-sub_id) +fprintf (file, .%d, use-sub_id); + fprintf (file, \n); + switch (use-type) { case USE_NONLINEAR_EXPR: @@ -605,8 +615,12 @@ dump_uses (FILE *file, struct ivopts_data *data) for (i = 0; i n_iv_uses (data); i++) { use = iv_use (data, i); - - dump_use (file, use); + do + { + dump_use (file, use); + use = use-next; + } + while (use); fprintf (file, \n); } } @@ -1327,33 +1341,88 @@ find_induction_variables (struct ivopts_data