On Fri, Jun 21, 2019 at 1:13 AM Jeff Law <l...@redhat.com> wrote: > On 6/20/19 3:53 AM, JiangNing OS wrote: > > Hi Jeff, > > > > Appreciate your effort to review my patch! I've updated my patch as > attached. See my answers below. > > > >> in current function, so the store speculation can be avoided. > >> So at a high level should we be doing this in gimple rather than RTL? > >> We're going to have a lot more information about types, better > >> infrastructure for looking at uses/defs, access to the alias oracle, we > should > >> be able to accurately distinguish between potentially shared objects vs > those > >> which are local to the thread, etc. We lose the low level costing > information > >> though. > >> > >> I'm still going to go through the patch and do some level of review, > but I do > >> think we need to answer the higher level question though. > >> > > I have the following reasons, > > > > 1) Following the clue Richard B gave me before about parameter --param > allow-store-data-races, > > I did check the middle-end pass tree-if-conv, but I think this pass at > the moment doesn't work > > for the issue I'm trying to solve. Tree-if-conv is to do if conversion > for loop, and its final goal is to > > help loop vectorization, while my case doesn't have a loop at all. > I think the fact that it's focused so much on loops is a historical > accident. We certainly have a variety of places in the gimple > optimizers that do if-conversion, and they're not all in tree-if-conv :( > For example, some are done in tree-ssa-phiopt. > > In the gimple optimizers the testcase from 89430 is going to look > something like this: > > > > ; basic block 2, loop depth 0, count 1073741824 (estimated locally), > maybe hot > > ;; prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED) > > ;; pred: ENTRY [always] count:1073741824 (estimated locally) > (FALLTHRU,EXECUTABLE) > > a.0_1 = a; > > _2 = (long unsigned int) k_8(D); > > _3 = _2 * 4; > > _4 = a.0_1 + _3; > > _5 = *_4; > > if (_5 > b_9(D)) > > goto <bb 3>; [50.00%] > > else > > goto <bb 4>; [50.00%] > > ;; succ: 3 [50.0% (guessed)] count:536870912 (estimated > locally) (TRUE_VALUE,EXECUTABLE) > > ;; 4 [50.0% (guessed)] count:536870912 (estimated > locally) (FALSE_VALUE,EXECUTABLE) > > > > ;; basic block 3, loop depth 0, count 536870913 (estimated locally), > maybe hot > > ;; prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED) > > ;; pred: 2 [50.0% (guessed)] count:536870912 (estimated > locally) (TRUE_VALUE,EXECUTABLE) > > *_4 = b_9(D); > > ;; succ: 4 [always] count:536870913 (estimated locally) > (FALLTHRU,EXECUTABLE) > > > > ;; basic block 4, loop depth 0, count 1073741824 (estimated locally), > maybe hot > > ;; prev block 3, next block 1, flags: (NEW, REACHABLE, VISITED) > > ;; pred: 3 [always] count:536870913 (estimated locally) > (FALLTHRU,EXECUTABLE) > > ;; 2 [50.0% (guessed)] count:536870912 (estimated > locally) (FALSE_VALUE,EXECUTABLE) > > return; > > That looks like a pretty easy form to analyze. I'd suggest looking > through tree-ssa-phiopt.c closely. There's several transformations in > there that share similarities with yours. >
More specifically there is the conditional store elimination (cselim) pass inside this file. > > > > > > > > > > 2) My current solution fits into current back-end if-conversion pass > very well. I don't want to invent > > a new framework to solve this relatively small issue. Besides, this > back-end patch doesn't only > > enhance store speculation detection, but also fix a bug in the original > code. > Understood, but I still wonder if we're better off addressing this in > gimple. > Traditionally if-conversions profitability heavily depends on the target and esp. if memory is involved costing on GIMPLE is hard. That's also one reason why we turned off loop if-conversion on GIMPLE for non-vectorized code. phiopt is really about followup optimization opportunities in passes that do not handle conditionals well. cselim is on the border... > >> Just from a design standpoint, what are the consequences if this > function > >> returns true for something that isn't actually in the stack or false for > >> something that is on the stack? > >> > > If noce_mem_is_on_stack returns true for something that isn't actually > in the stack, > > it could potentially introduce store speculation, then the if-conversion > optimization > > will be incorrect. If this function returns false for something that is > on stack, it doesn't > > matter, because the optimization will not be triggered. > OK. That's what I expected. > > > > > > > > > >> > >>> + > >>> +/* Always return true, if there is a dominating write. > >>> + > >>> + When there is a dominating read from memory on stack, > >>> + 1) if x = a is a memory read, return true. > >>> + 2) if x = a is a memory write, return true if the memory is on > stack. > >>> + This is the guarantee the memory is *not* readonly. */ > >>> + > >>> +static bool > >>> +noce_valid_for_dominating (basic_block bb, rtx_insn *a_insn, > >>> + const_rtx x, bool is_store) { > >>> + rtx_insn *insn; > >>> + rtx set; > >>> + > >>> + gcc_assert (MEM_P (x)); > >>> + > >>> + FOR_BB_INSNS (bb, insn) > >>> + { > >>> + set = single_set (insn); > >>> + if (!set) > >>> + continue; > >>> + > >>> + /* Dominating store */ > >>> + if (rtx_equal_p (x, SET_DEST (set))) > >>> + return true; > >>> + > >>> + /* Dominating load */ > >>> + if (rtx_equal_p (x, SET_SRC (set))) > >>> + if (is_store && noce_mem_is_on_stack (a_insn, x)) > >>> + return true; > >>> + } > >>> + > >>> + return false; > >>> +} > >> So what would be the consequences here of returning false when in fact > >> there was a dominating read or write? That could easily happen if the > >> dominating read or write was not a single_set. > > If return false when in fact there is a dominating read or write, the > optimization will not > > be triggered, because noce_mem_maybe_invalid_p will be true, which is > playing the same > > role as may_trap_or_fault_p. > > > >> > >> I'm guessing that from a design standpoint you're trying to find cases > where > >> you know an object was written to within the block and does not > escape. So > >> returning false when there was a dominating write is safe. > >> Returning true when there was not would be disastrous. Right? > >> > > YES! You've completely understood my code! 😊 > So in this context, only handling single_sets is safe. Perfect. > > > > > >> > >> > >> > >>> + > >>> + /* Iterate all insns until finding all stack pointers, and store > >>> + them into bba_sets_must_be_sfp. */ bba_sets_must_be_sfp = > >>> + BITMAP_ALLOC (®_obstack); do > >>> + { > >>> + sfp_found = false; > >>> + FOR_ALL_BB_FN (bb, cfun) > >>> + FOR_BB_INSNS (bb, a_insn) > >>> + { > >>> + rtx sset_a = single_set (a_insn); > >>> + if (!sset_a) > >>> + continue; > >>> + rtx src = SET_SRC (sset_a); > >>> + rtx dest = SET_DEST (sset_a); > >> So again, we can get things that aren't a single_set here and they'll be > >> ignored. But I believe that's safe as you're trying to identify insns > which store > >> stack addresses into a REG. Missing a more complicated case where a > stack > >> address is stored into a REG is safe. Right? > >> > > Yes. Missing some pointers to stack is safe here, because we will use it > to check if a memory > > in if-conversion candidate doesn't have store speculation. If we miss > it, the optimization will > > not be triggered, so there will not be risky. > Thanks for the confirmation. > > >> > >>> + > >>> + /* Handle case like "x2 = x1 + offset", in which x1 is > already > >>> + identified as a stack pointer. */ > >>> + if ((GET_CODE (src) == PLUS || GET_CODE (src) == MINUS) > >>> + && CONST_INT_P (XEXP (src, 1))) > >>> + { > >>> + rtx x1 = XEXP (src, 0); > >>> + if (!REG_P (x1)) > >>> + continue; > >>> + > >>> + FOR_EACH_INSN_USE (use, a_insn) > >>> + { > >>> + if (!rtx_equal_p (x1, DF_REF_REG (use))) > >>> + continue; > >>> + > >>> + if (bitmap_bit_p (bba_sets_must_be_sfp, > DF_REF_REGNO > >> (use))) > >>> + { > >>> + bitmap_set_bit (bba_sets_must_be_sfp, > dest_regno); > >>> + sfp_found = true; > >>> + break; > >>> + } > >>> + } > >> So how much is there to be gained from going from this kind of localized > >> computation to global? It shouldn't be hard to turn this into a truly > global > >> algorithm, but I'm not sure it's worth the effort. > >> > >> I _do_ think it's worth visiting blocks in dominator order. It's > better than > >> what you're doing, but nowhere near as expensive as a global propagation > >> engine. > >> > > Fixed in attached new patch using dom_walk class. Please review again. > > > >> Is it worth handling simple copies in the code above? > > I thought it less likely to have a copy for pointer to stack before > register allocation. Right? > We certainly try to eliminate as many copies as possible, but experience > would tell us that they often sneak through. I guess some quick > instrumentation would tell us if it's worth the effort. > > > > > > > >> > >>> +/* Return true if current function doesn't pass stack address out of > >>> +frame. */ > >>> + > >>> +static bool > >>> +no_stack_address_taken (void) > >>> +{ > >>> + basic_block bb; > >>> + rtx_insn *a_insn; > >>> + df_ref use; > >>> + > >>> + FOR_ALL_BB_FN (bb, cfun) > >>> + FOR_BB_INSNS (bb, a_insn) > >>> + { > >>> + if (!INSN_P (a_insn)) > >>> + continue; > >>> + > >>> + rtx sset_a = single_set (a_insn); > >>> + if (!sset_a) > >>> + continue; > >>> + rtx src = SET_SRC (sset_a); > >>> + rtx dest = SET_DEST (sset_a); > >>> + > >>> + /* Skip insn that is already identified as frame pointers. */ > >>> + if (bitmap_bit_p (bba_sets_may_be_sfp, REGNO (dest))) > >>> + continue; > >>> + > >>> + FOR_EACH_INSN_USE (use, a_insn) > >>> + { > >>> + /* Check if current insn is using any stack pointer. > Ignore > >>> + if it doesn't use frame pointers at all. */ > >>> + if (!bitmap_bit_p (bba_sets_may_be_sfp, DF_REF_REGNO > (use))) > >>> + continue; > >>> + > >>> + /* Load cannot introduce address taken. */ > >>> + if (MEM_P (src)) > >>> + continue; > >>> + > >>> + /* Store is safe if only the src doesn't contain stack > pointer. */ > >>> + if (MEM_P (dest) > >>> + && !reg_mentioned_p (DF_REF_REG (use), src)) > >>> + continue; > >>> + > >>> + /* All other cases potentially introduce address taken. */ > >>> + return false; > >>> + } > >> So what if you have a PARALLEL where one entry causes an escape of a > >> pointer into the stack and another entry does some other operation? If > so, > >> don't you miss the fact that you've got an escaping pointer into the > stack? I > >> don't think you can restrict yourself to single_sets here. > > Yes. You are right. I have added code to handle PARALLEL case. I handle > it in a rather conservative way though. Review again, please! > I'll take another look at the patch. I'll also do a little > experimentation to see if we're seeing enough copies to make handling > them worth the effort. > > While I'm doing that, if you could walk through tree-ssa-phiopt.c and > ponder if your transformation could fit into that framework it'd be > appreciated (note the comment at the start of the file only covers one > class of transformations, there are others later). > > jeff >