https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81504
Bug ID: 81504 Summary: gcc-7 regression: vec_st in loop misoptimized Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: zoltan at hidvegi dot com CC: wschmidt at gcc dot gnu.org Target Milestone: --- Target: powerpc64le-unknown-linux-gnu Created attachment 41802 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41802&action=edit gcc-7 -O2, vec_st in loop misoptimized The attached code miscompiles with gcc-7 -O2, but gcc-6 produces correct code. gcc-7 -O1 also generates good code. With gcc-7 -O2 idx is always incremented before p[idx] is written using vec_st. I've tried to create a minimal testcase to reproduce the problem, this is not real code. The background on this is that I needed a pointer wrapper class for ppcle vectors because gcc by default never uses the lvx / stvx instructions even if it knows an address is aligned, it always wants to use lxvd2x / xxswapd, and generates tons of unnecessary xxswapd instructions. I'm aware of attempts to optimize away swaps, but those don't apply to my application, so I just want to use lvx without the swaps. This is somehow related to vec_st, since if I use inline asm to generate stvx it works. It also works if there is no builtin_constant_p check in the rotate_left macro. It would be really nice if there was a way to disable lane swap optimizations and allow gcc to use the aligned load/stores when the address is known to be aligned. On x86 gcc already knows when to use aligned vs. unaligned loads, so it must be possible on pcc as well. My code can execute over 100 million vector load and store instructions per second, so removing the swaps have a real performance impact. Here is the gcc-7 assembly, note the addi 3,3,16 after the unconditional branch to L4 is executed before the first stvx 0,0,3, the vector pointer is incremented before it's ever written: sldi 9,5,4 srdi 10,4,5 add 3,3,9 addi 9,10,1 mtctr 9 li 8,1 b .L3 .p2align 4,,15 .L2: stvx 0,0,3 addi 5,5,1 .L3: #APP # 20 "msary_bug.C" 1 rotld 9,8,5 # 0 "" 2 #NO_APP mtvsrd 32,9 addi 3,3,16 xxpermdi 32,32,32,0 bdnz .L2 sldi 10,10,5 subf 4,10,4 mtvsrd 32,4 xxpermdi 32,32,32,0 stvx 0,0,3 blr