[Bug target/68793] Bad optimization by split-wide-type on NEON
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793 Richard Earnshaw changed: What|Removed |Added Target Milestone|--- |6.0
[Bug target/68793] Bad optimization by split-wide-type on NEON
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793 Allan Jensen changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #8 from Allan Jensen --- I can confirm the issue is solved in gcc 6.
[Bug target/68793] Bad optimization by split-wide-type on NEON
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org --- Comment #4 from ktkachov at gcc dot gnu.org --- The testcase doesn't compile for me. Did you mean the below? #include typedef unsigned int uint; void RGBA2BGRA_neon64(const uint* src, uint* dst, uint count) { uint i = 0; for (; i < count - 7; i += 8) { uint8x8x4_t tmp = vld4_u8((const uint8_t*)(src + i)); uint8x8x4_t tmp2 = { tmp.val[2], tmp.val[1], tmp.val[0], tmp.val[3] }; vst4_u8((uint8_t*)(dst + i), tmp2); } for (; i < count; ++i) { dst[i] = src[i] & 0x00ff00ff; uint tmp = src[i] & 0xff00ff00; dst[i] |= (tmp << 16) | (tmp >> 16); } } void RGBA2BGRA_neon128(const uint* src, uint* dst, uint count) { uint i = 0; for (; i < count - 15; i += 16) { uint8x16x4_t tmp = vld4q_u8((const uint8_t*)(src + i)); uint8x16x4_t tmp2 = {tmp.val[2], tmp.val[1], tmp.val[0], tmp.val[3]}; vst4q_u8((uint8_t*)(dst + i), tmp2); } for (; i < count; ++i) { dst[i] = src[i] & 0x00ff00ff; uint tmp = src[i] & 0xff00ff00; dst[i] |= (tmp << 16) | (tmp >> 16); } } Can you please try a trunk compiler? I indeed get the extra umovs with a GCC 5 compiler but latest trunk at -O2 -mcpu=generic for me generates the good code for that loop: ld4 {v4.16b - v7.16b}, [x6] orr v0.16b, v6.16b, v6.16b orr v1.16b, v5.16b, v5.16b orr v2.16b, v4.16b, v4.16b orr v3.16b, v7.16b, v7.16b st4 {v0.16b - v3.16b}, [x3]
[Bug target/68793] Bad optimization by split-wide-type on NEON
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793 --- Comment #3 from Allan Jensen --- Created attachment 36959 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36959=edit neon-test-no-split-wide-types.s
[Bug target/68793] Bad optimization by split-wide-type on NEON
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793 --- Comment #6 from Allan Jensen --- I mean the neon64 case, not 32-bit.
[Bug target/68793] Bad optimization by split-wide-type on NEON
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793 --- Comment #7 from ktkachov at gcc dot gnu.org --- (In reply to Allan Jensen from comment #6) > I mean the neon64 case, not 32-bit. Seems so. I get: _Z16RGBA2BGRA_neon64PKjPjj: .LFB3215: .cfi_startproc subsw7, w2, #7 mov w5, 0 beq .L4 .p2align 2 .L8: ubfiz x3, x5, 2, 32 add w5, w5, 8 add x4, x0, x3 add x3, x1, x3 cmp w5, w7 ld4 {v4.8b - v7.8b}, [x4] mov v0.8b, v6.8b mov v1.8b, v5.8b mov v2.8b, v4.8b mov v3.8b, v7.8b st4 {v0.8b - v3.8b}, [x3] bcc .L8 .L4: cmp w5, w2 bcs .L10 uxtwx3, w5 sub w2, w2, #1 sub w2, w2, w5 add x5, x3, 1 add x5, x2, x5 lsl x2, x3, 2 lsl x5, x5, 2 .p2align 2 .L7: ldr w3, [x0, x2] and w4, w3, 16711935 str w4, [x1, x2] ldr w3, [x0, x2] and w3, w3, -16711936 orr w3, w4, w3, ror (32 - 16) str w3, [x1, x2] add x2, x2, 4 cmp x2, x5 bne .L7 ret .L10: ret
[Bug target/68793] Bad optimization by split-wide-type on NEON
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793 --- Comment #1 from Allan Jensen --- Created attachment 36957 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36957=edit neon-test.cpp
[Bug target/68793] Bad optimization by split-wide-type on NEON
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793 --- Comment #2 from Allan Jensen --- Created attachment 36958 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36958=edit neon-test-split-wide-types.s
[Bug target/68793] Bad optimization by split-wide-type on NEON
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793 --- Comment #5 from Allan Jensen --- The test-case uses C++11 initialization. I haven't tested gcc 6, so if you say it is solved, I would trust you. Note the 32-bit case is also suboptimal in both cases (not affected by split-wide-types). Is that also fixes in gcc 6?