[Bug target/68793] Bad optimization by split-wide-type on NEON

2015-12-23 Thread rearnsha at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793

Richard Earnshaw  changed:

   What|Removed |Added

   Target Milestone|--- |6.0

[Bug target/68793] Bad optimization by split-wide-type on NEON

2015-12-09 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793

Allan Jensen  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from Allan Jensen  ---
I can confirm the issue is solved in gcc 6.

[Bug target/68793] Bad optimization by split-wide-type on NEON

2015-12-08 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org

--- Comment #4 from ktkachov at gcc dot gnu.org ---
The testcase doesn't compile for me.
Did you mean the below?
#include 

typedef unsigned int uint;

void RGBA2BGRA_neon64(const uint* src, uint* dst, uint count)
{
uint i = 0;
for (; i < count - 7; i += 8) {
uint8x8x4_t tmp = vld4_u8((const uint8_t*)(src + i));
uint8x8x4_t tmp2 = { tmp.val[2], tmp.val[1], tmp.val[0], tmp.val[3] };
vst4_u8((uint8_t*)(dst + i), tmp2);
}
for (; i < count; ++i) {
dst[i] = src[i] & 0x00ff00ff;
uint tmp = src[i] & 0xff00ff00;
dst[i] |= (tmp << 16) | (tmp >> 16);
}
}

void RGBA2BGRA_neon128(const uint* src, uint* dst, uint count)
{
uint i = 0;
for (; i < count - 15; i += 16) {
uint8x16x4_t tmp = vld4q_u8((const uint8_t*)(src + i));
uint8x16x4_t tmp2 = {tmp.val[2], tmp.val[1], tmp.val[0], tmp.val[3]};
vst4q_u8((uint8_t*)(dst + i), tmp2);
}
for (; i < count; ++i) {
dst[i] = src[i] & 0x00ff00ff;
uint tmp = src[i] & 0xff00ff00;
dst[i] |= (tmp << 16) | (tmp >> 16);
}
}

Can you please try a trunk compiler?
I indeed get the extra umovs with a GCC 5 compiler but latest trunk at -O2
-mcpu=generic for me generates the good code for that loop:
ld4 {v4.16b - v7.16b}, [x6]
orr v0.16b, v6.16b, v6.16b
orr v1.16b, v5.16b, v5.16b
orr v2.16b, v4.16b, v4.16b
orr v3.16b, v7.16b, v7.16b
st4 {v0.16b - v3.16b}, [x3]

[Bug target/68793] Bad optimization by split-wide-type on NEON

2015-12-08 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793

--- Comment #3 from Allan Jensen  ---
Created attachment 36959
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36959=edit
neon-test-no-split-wide-types.s

[Bug target/68793] Bad optimization by split-wide-type on NEON

2015-12-08 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793

--- Comment #6 from Allan Jensen  ---
I mean the neon64 case, not 32-bit.

[Bug target/68793] Bad optimization by split-wide-type on NEON

2015-12-08 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793

--- Comment #7 from ktkachov at gcc dot gnu.org ---
(In reply to Allan Jensen from comment #6)
> I mean the neon64 case, not 32-bit.

Seems so. I get:
_Z16RGBA2BGRA_neon64PKjPjj:
.LFB3215:
.cfi_startproc
subsw7, w2, #7
mov w5, 0
beq .L4
.p2align 2
.L8:
ubfiz   x3, x5, 2, 32
add w5, w5, 8
add x4, x0, x3
add x3, x1, x3
cmp w5, w7
ld4 {v4.8b - v7.8b}, [x4]
mov v0.8b, v6.8b
mov v1.8b, v5.8b
mov v2.8b, v4.8b
mov v3.8b, v7.8b
st4 {v0.8b - v3.8b}, [x3]
bcc .L8
.L4:
cmp w5, w2
bcs .L10
uxtwx3, w5
sub w2, w2, #1
sub w2, w2, w5
add x5, x3, 1
add x5, x2, x5
lsl x2, x3, 2
lsl x5, x5, 2
.p2align 2
.L7:
ldr w3, [x0, x2]
and w4, w3, 16711935
str w4, [x1, x2]
ldr w3, [x0, x2]
and w3, w3, -16711936
orr w3, w4, w3, ror (32 - 16)
str w3, [x1, x2]
add x2, x2, 4
cmp x2, x5
bne .L7
ret
.L10:
ret

[Bug target/68793] Bad optimization by split-wide-type on NEON

2015-12-08 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793

--- Comment #1 from Allan Jensen  ---
Created attachment 36957
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36957=edit
neon-test.cpp

[Bug target/68793] Bad optimization by split-wide-type on NEON

2015-12-08 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793

--- Comment #2 from Allan Jensen  ---
Created attachment 36958
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36958=edit
neon-test-split-wide-types.s

[Bug target/68793] Bad optimization by split-wide-type on NEON

2015-12-08 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68793

--- Comment #5 from Allan Jensen  ---
The test-case uses C++11 initialization. I haven't tested gcc 6, so if you say
it is solved, I would trust you.

Note the 32-bit case is also suboptimal in both cases (not affected by
split-wide-types). Is that also fixes in gcc 6?