On November 8, 2015 3:58:57 AM GMT+01:00, "Bin.Cheng" <amker.ch...@gmail.com> wrote: >On Fri, Nov 6, 2015 at 9:24 PM, Richard Biener ><richard.guent...@gmail.com> wrote: >> On Wed, Nov 4, 2015 at 11:18 AM, Bin Cheng <bin.ch...@arm.com> wrote: >>> Hi, >>> PR52272 reported a performance regression in spec2006/410.bwaves >once GCC is >>> prevented from representing address of one memory object using >address of >>> another memory object. Also as I commented in that PR, we have two >possible >>> fixes for this: >>> 1) Improve how TMR.base is deduced, so that we can represent addr of >mem obj >>> using another one, while not breaking PR50955. >>> 2) Add iv candidates with base object stripped. In this way, we use >the >>> common base-stripped part to represent all address expressions, in >the form >>> of [base_1 + common], [base_2 + common], ..., [base_n + common]. >>> >>> In terms of code generation, method 2) is at least as good as 1), >actually >>> better in my opinion. The problem of 2) is we need to tell when iv >>> candidates should be added for the common part and when shouldn't. >This >>> issue can be generalized and described as: We know IVO tries to add >>> candidates by deriving from iv uses. One disadvantage is that >candidates >>> are derived from iv use independently. It doesn't take common sub >>> expression among different iv uses into consideration. As a result, >>> candidate for common sub expression is not added, while many useless >>> candidates are added. >>> >>> As a matter of fact, candidate derived from iv use is useful only if >it's >>> common enough and could be shared among different uses. A candidate >is most >>> likely useless if it's derived from a single use and could not be >shared by >>> others. This patch works in this way by firstly recording all kinds >>> candidates derived from iv uses, then adding candidates for common >ones. >>> >>> The patch improves 410.bwaves by 3-4% on x86_64. I also saw >regression for >>> 400.perlbench and small regression for 401.bzip on x86_64, but I can >confirm >>> they are false alarms caused by align issues. >>> For aarch64, fp cases are obviously improved for both spec2000 and >spec2006. >>> Also the patch causes 2-3% regression for 459.GemsFDTD, which I >think is >>> another irrelevant issue caused by heuristic candidate selecting >algorithm. >>> Unfortunately, I don't have fix to it currently. >>> >>> This patch may add more candidates in some cases, but generally >candidates >>> number is smaller because we don't need to add useless candidates >now. >>> Statistic data shows there are quite fewer loops with more than 30 >>> candidates when building spec2k6 on x86_64 using this patch. >>> >>> Bootstrap and test on x86_64. I will re-test it against latest >trunk on >>> AArch64. Is it OK? >> >> +inline bool >> +iv_common_cand_hasher::equal (const iv_common_cand *ccand1, >> + const iv_common_cand *ccand2) >> +{ >> + return ccand1->hash == ccand2->hash >> + && operand_equal_p (ccand1->base, ccand2->base, 0) >> + && operand_equal_p (ccand1->step, ccand2->step, 0) >> + && TYPE_PRECISION (TREE_TYPE (ccand1->base)) >> + == TYPE_PRECISION (TREE_TYPE (ccand2->base)); >> >Hi Richard, >Thanks for reviewing. > >> I'm wondering on the TYPE_PRECISION check. a) why is that needed? >Because operand_equal_p doesn't check type precision for constant int >nodes, and IVO needs to take precision into consideration.
Ok >> and b) what kind of tree is base so that it is safe to inspect >TYPE_PRECISION >> unconditionally? >Both SCEV and IVO work on expressions with type satisfying >POINTER_TYPE_P or INTEGRAL_TYPE_P, so it's safe to access precision >unconditionally? Yes. Patch is OK then. Richard. > >> >> + slot = data->iv_common_cand_tab->find_slot (&ent, INSERT); >> + if (*slot == NULL) >> + { >> + *slot = XNEW (struct iv_common_cand); >> >> allocate from the IV obstack instead? I see we do a lot of heap >allocations >> in IVOPTs, so we can improve that as followup as well. >> >Yes, small structures in IVO like iv, iv_use, iv_cand, iv_common_cand >are better to be allocated in obstack. Actually I have already make >that change to struct iv. others will be followup too. > >Thanks, >bin >> We probably should empty the obstack after each processed loop. >> >> Thanks, >> Richard. >> >> >>> Thanks, >>> bin >>> >>> 2015-11-03 Bin Cheng <bin.ch...@arm.com> >>> >>> PR tree-optimization/52272 >>> * tree-ssa-loop-ivopts.c (struct iv_common_cand): New >struct. >>> (struct iv_common_cand_hasher): New struct. >>> (iv_common_cand_hasher::hash): New function. >>> (iv_common_cand_hasher::equal): New function. >>> (struct ivopts_data): New fields, iv_common_cand_tab and >>> iv_common_cands. >>> (tree_ssa_iv_optimize_init): Initialize above fields. >>> (record_common_cand, common_cand_cmp): New functions. >>> (add_iv_candidate_derived_from_uses): New function. >>> (add_iv_candidate_for_use): Record iv_common_cands derived >from >>> iv use in hash table, instead of adding candidates directly. >>> (add_iv_candidate_for_uses): Call >>> add_iv_candidate_derived_from_uses. >>> (record_important_candidates): Add important candidates to >iv uses' >>> related_cands. Always keep related_cands for future use. >>> (try_add_cand_for): Use iv uses' related_cands. >>> (free_loop_data, tree_ssa_iv_optimize_finalize): Release new >fields >>> in struct ivopts_data, iv_common_cand_tab and >iv_common_cands.