https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
--- Comment #38 from rguenther at suse dot de <rguenther at suse dot de> --- On Thu, 15 Mar 2018, aldyh at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359 > > --- Comment #37 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- > Hi Richi. > > (In reply to rguent...@suse.de from comment #31) > > > I'd have not restricted the out-of-loop IV use to IV +- CST but > > instead did the transform > > > > + LOOP: > > + # p_8 = PHI <p_16(2), p_INC(3)> > > + ... > > + p_INC = p_8 - 1; > > + goto LOOP; > > + ... p_8 uses ... > > > > to > > > > + LOOP: > > + # p_8 = PHI <p_16(2), p_INC(3)> > > + ... > > + p_INC = p_8 - 1; > > + goto LOOP; > > newtem_12 = p_INC + 1; // undo IV increment > > ... p_8 out-of-loop p_8 uses replaced with newtem_12 ... > > > > so it would always work if we can undo the IV increment. > > > > The disadvantage might be that we then rely on RTL optimizations > > to combine the original out-of-loop constant add with the > > newtem computation but I guess that's not too much to ask ;) > > k > > It looks like RTL optimizations have a harder time optimizing things when I > take the above approach. > > Doing what you suggest, we end up optimizing this (simplified for brevity): > > <bb 3> > # p_8 = PHI <p_16(2), p_19(3)> > p_19 = p_8 + 4294967295; > if (ui_7 > 9) > goto <bb 3>; [89.00%] > ... > > <bb 5> > p_22 = p_8 + 4294967294; > MEM[(char *)p_19 + 4294967295B] = 45; > > into this: > > <bb 3>: > # p_8 = PHI <p_16(2), p_19(3)> > p_19 = p_8 + 4294967295; > if (ui_7 > 9) > ... > > <bb 4>: > _25 = p_19 + 1; ;; undo the increment > ... > > <bb 5>: > p_22 = _25 + 4294967294; > MEM[(char *)_25 + 4294967294B] = 45; shouldn't the p_19 in the MEM remain? Why a new BB for the adjustment? (just asking...) > I haven't dug into the RTL optimizations, but the end result is that we only > get one auto-dec inside the loop, and some -2 indexing outside of it: > > strb r1, [r4, #-1]! > lsr r3, r3, #3 > bhi .L4 > cmp r6, #0 > movlt r2, #45 > add r3, r4, #1 > strblt r2, [r3, #-2] > sublt r4, r4, #1 > > as opposed to mine: > > <bb 3>: > # p_8 = PHI <p_16(2), p_19(3)> > p_19 = p_8 + 4294967295; > if (ui_7 > 9) > ... > > <bb 5>: > p_22 = p_19 + 4294967295; > *p_22 = 45; > > which gives us two auto-dec, and much tighter code: > > strb r1, [r4, #-1]! > lsr r3, r3, #3 > bhi .L4 > cmp r6, #0 > movlt r3, #45 > strblt r3, [r4, #-1]! > > Would it be OK to go with my approach, or is worth looking into the rtl > optimizers and seeing what can be done (boo! :)). Would be interesting to see what is causing things to break. If it's only combine doing the trick then it will obviously be multi-uses of the adjustment pseudo. But eventually I thought fwprop would come to the rescue... I think a good approach would be to instead of just replacing all out-of-loop uses with the new SSA name doing the adjustment check if the use is in a "simple" (thus SSA name +- cst) stmt and there forward the adjustment. So have the best of both worlds. I would have hoped this extra complication isn't needed but as you figured...