https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64731
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2015-01-22 CC| |rguenth at gcc dot gnu.org Summary|poor code when using |vector lowering should |vector_size((32)) for sse2 |split loads and stores Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- Ok, the issue is "simple" - veclower doesn't split the loads/stores itself but the registers: <bb 3>: # ivtmp.11_24 = PHI <ivtmp.11_23(3), 0(2)> _8 = MEM[base: a_6(D), index: ivtmp.11_24, offset: 0B]; _11 = MEM[base: b_9(D), index: ivtmp.11_24, offset: 0B]; _17 = BIT_FIELD_REF <_8, 128, 0>; _4 = BIT_FIELD_REF <_11, 128, 0>; _5 = _4 + _17; _29 = BIT_FIELD_REF <_8, 128, 128>; _28 = BIT_FIELD_REF <_11, 128, 128>; _14 = _28 + _29; _12 = {_5, _14}; MEM[base: a_6(D), index: ivtmp.11_24, offset: 0B] = _12; ivtmp.11_23 = ivtmp.11_24 + 32; if (ivtmp.11_23 != 8192) goto <bb 3>; else goto <bb 4>; in this case it would also have a moderately hard time to split the loads/store as it is faced with TARGET_MEM_REFs already. Nothing combines this back into a sane form. I've recently added code that handles exactly the same situation but only for complex arithmetic (in tree-ssa-forwprop.c for PR64568). I wonder why with only -msse2 IVOPTs produces TARGET_MEM_REFs for the loads. For sure x86_64 cannot load V4DF in one instruction...