On Fri, Aug 28, 2015 at 5:48 PM, Jeff Law <l...@redhat.com> wrote: > On 08/28/2015 09:43 AM, Simon Dardis wrote: > >> Following Jeff's advice[1] to extract more information from GCC, I've >> narrowed the cause down to the predictive commoning pass inserting >> the load in a loop header style basic block. However, the next pass >> in GCC, tree-cunroll promptly removes the loop and joins the loop >> header to the body of the (non)loop. More oddly, disabling >> conditional store elimination pass or the dominator optimizations >> pass or disabling of jump-threading with --param >> max-jump-thread-duplication-stmts=0 nets the above assembly code. Any >> ideas on an approach for this issue? > > I'd probably start by looking at the .optimized tree dump in both cases to > understand the difference, then (most liklely) tracing that through the RTL > optimizers into the register allocator.
It's the known issue of LIM (here the one after pcom and complete unrolling of the inner loop) being too aggressive with store-motion. Here the comptete array is replaced with registers for the outer loop. Were 'poly' a local variable we'd have optimized it away completely. <bb 6>: _8 = 1.0e+0 / pretmp_42; _12 = _8 * _8; poly[1] = _12; <bb 7>: # prephitmp_30 = PHI <_12(6), _36(9)> # T_lsm.8_22 = PHI <_8(6), pretmp_42(9)> poly_I_lsm0.10_38 = MEM[(double *)&poly + 8B]; _2 = prephitmp_30 * poly_I_lsm0.10_38; _54 = _2 * poly_I_lsm0.10_38; _67 = poly_I_lsm0.10_38 * _54; _80 = poly_I_lsm0.10_38 * _67; _93 = poly_I_lsm0.10_38 * _80; _106 = poly_I_lsm0.10_38 * _93; _19 = poly_I_lsm0.10_38 * _106; count_23 = count_28 + 1; if (count_23 != iterations_6(D)) goto <bb 5>; else goto <bb 8>; <bb 8>: poly[2] = _2; poly[3] = _54; poly[4] = _67; poly[5] = _80; poly[6] = _93; poly[7] = _106; poly[8] = _19; i1 = 9; T = T_lsm.8_22; note that DOM misses to CSE poly[1] (a known defect), but heh, doing that would only increase register pressure even more. Note the above is on x86_64. Richard. > jeff