https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2019-01-16 CC| |rsandifo at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #10 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> --- FWIW, I agree that pure unrolling doesn't feel like a gimple-level optimisation. Whether it's a win or not depends on whether the unrolled loop will make better use of the microarchitecture. The problem isn't just that that's hard to decide at the gimple level, but that the result can't be represented directly in gimple. AIUI there's no real significance to the schedule of gimple statements (beyond ensuring valid SSA and functional correctness). This is different from vectorisation and ivopts, which can represent the benefit of the transformation directly in gimple (using vector ops and TARGET_MEM_REFs respectively). As Kyrill pointed out off-list, LLVM does the unrolling in the vectoriser rather than a separate unrolling pass. (Use -mllvm -print-after-all to see this.) I think for AArch64 we can view LDP and STP as 2-element vector loads and stores that have zero-cost insertion and extraction. So converting: ldr x0, [...] add x0, x0, 1 str x0, [...] into: ldp x0, x1, [...] add x0, x0, 1 add x1, x1, 1 stp x0, x1, [...] is IMO genuine vectorisation. The LDPs and STPs are effectively scalar IFN_LOAD_LANES and IFN_STORE_LANES, although we could also represent them as single-element (V1) vector ops instead if that seems more consistent. Vectorising operations other than loads and stores would simply involve duplicating the statements VF times.