[Bug target/88013] can't vectorize rgb to grayscale conversion code

2021-12-21 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88013

Andrew Pinski  changed:

   What|Removed |Added

   Severity|normal  |enhancement

[Bug target/88013] can't vectorize rgb to grayscale conversion code

2019-05-26 Thread hoganmeier at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88013

--- Comment #9 from krux  ---
(In reply to ktkachov from comment #7)
> I tried current trunk (future GCC 9)
> GCC 9 learned to avoid excessive widening during vectorisation, which is
> what accounts for the large number of instructions you see.

Confirmed, the loop is now as described in comment #5 with trunk gcc.
Still with vshr+vmovn as mentioned by Ramana.

But by the way, the tail is completely unrolled, 15x the following, seems quite
excessive to me:

ldrbip, [r1, #1]@ zero_extendqisi2
movsr6, #151
ldrblr, [r1]@ zero_extendqisi2
movsr5, #77
ldrbr7, [r1, #2]@ zero_extendqisi2
movsr4, #28
smulbb  ip, ip, r6
smlabb  lr, r5, lr, ip
add ip, r3, #1
smlabb  r7, r4, r7, lr
cmp ip, r2
asr r7, r7, #8
strbr7, [r0]
bge .L1

assert(n >= 16) helps a bit, but n % 16 == 0 doesn't.

[Bug target/88013] can't vectorize rgb to grayscale conversion code

2018-12-14 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88013

Ramana Radhakrishnan  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2018-12-14
 CC||ramana at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #8 from Ramana Radhakrishnan  ---

> vshr.u16q9, q9, #8
> vshr.u16q8, q8, #8
> vmovn.i16   d20, q9
> vmovn.i16   d21, q8

Isn't that "just" a missing combine pattern to get us vshrn in both backends ? 

Ramana

[Bug target/88013] can't vectorize rgb to grayscale conversion code

2018-11-14 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88013

--- Comment #7 from ktkachov at gcc dot gnu.org ---
I tried current trunk (future GCC 9)
GCC 9 learned to avoid excessive widening during vectorisation, which is what
accounts for the large number of instructions you see.

[Bug target/88013] can't vectorize rgb to grayscale conversion code

2018-11-14 Thread hoganmeier at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88013

--- Comment #6 from krux  ---
-mfloat-abi=hard was missing indeed. It's a pity there's no warning like when
trying to use the intrinsics.

Still I see a lot more instructions, maybe that got fixed after v7.2?
https://godbolt.org/z/OWzgXi

  vld3.8 {d16, d18, d20}, [r3]
  add ip, r3, #24
  add lr, lr, #1
  add r3, r3, #48
  cmp lr, r5
  vld3.8 {d17, d19, d21}, [ip]
  vmovl.u8 q5, d16
  vmovl.u8 q15, d18
  vmovl.u8 q11, d17
  vmovl.u8 q4, d19
  vmovl.u8 q0, d20
  vmovl.u8 q1, d21
  vmull.s16 q6, d10, d28
  vmull.s16 q3, d22, d28
  vmull.s16 q2, d30, d26
  vmull.s16 q11, d23, d29
  vmull.s16 q15, d31, d27
  vmull.s16 q5, d11, d29
  vmull.s16 q9, d8, d26
  vmull.s16 q8, d9, d27
  vadd.i32 q2, q6, q2
  vadd.i32 q10, q5, q15
  vadd.i32 q9, q3, q9
  vmull.s16 q15, d0, d24
  vadd.i32 q8, q11, q8
  vmull.s16 q3, d2, d24
  vmull.s16 q0, d1, d25
  vmull.s16 q1, d3, d25
  vadd.i32 q11, q2, q15
  vadd.i32 q9, q9, q3
  vadd.i32 q10, q10, q0
  vadd.i32 q8, q8, q1
  vshr.s32 q11, q11, #8
  vshr.s32 q9, q9, #8
  vshr.s32 q10, q10, #8
  vshr.s32 q8, q8, #8
  vmovn.i32 d30, q11
  vmovn.i32 d31, q10
  vmovn.i32 d20, q9
  vmovn.i32 d21, q8
  vmovn.i16 d16, q15
  vmovn.i16 d17, q10
  vst1.8 {q8}, [r4]

[Bug target/88013] can't vectorize rgb to grayscale conversion code

2018-11-14 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88013

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ktkachov at gcc dot gnu.org

--- Comment #5 from ktkachov at gcc dot gnu.org ---
I see vectorisation for arm (and aarch64 FWIW):
-O3 -march=armv8-a -mfpu=neon-fp-armv8 -mfloat-abi=hard

gives the loop:
.L4:
mov r3, lr
add lr, lr, #48
vld3.8  {d16, d18, d20}, [r3]!
vld3.8  {d17, d19, d21}, [r3]
vmull.u8 q12, d16, d30
vmull.u8 q1, d18, d28
vmull.u8 q2, d19, d29
vmull.u8 q11, d17, d31
vmull.u8 q3, d20, d26
vadd.i16q12, q12, q1
vmull.u8 q10, d21, d27
vadd.i16q8, q11, q2
vadd.i16q9, q12, q3
vadd.i16q8, q8, q10
vshr.u16q9, q9, #8
vshr.u16q8, q8, #8
vmovn.i16   d20, q9
vmovn.i16   d21, q8
vst1.8  {q10}, [ip]!
cmp ip, r4
bne .L4

Though of course it's not as tight as the assembly given in the link

[Bug target/88013] can't vectorize rgb to grayscale conversion code

2018-11-14 Thread hoganmeier at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88013

--- Comment #4 from krux  ---
On x64 indeed both compilers generate a huge amount of code.
https://godbolt.org/z/TH7mqn

[Bug target/88013] can't vectorize rgb to grayscale conversion code

2018-11-14 Thread hoganmeier at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88013

--- Comment #3 from krux  ---
A few NEON instructions are sufficient:
https://web.archive.org/web/20170227190422/http://hilbert-space.de/?p=22

clang seems to generate similar code, see the godbolt links.

[Bug target/88013] can't vectorize rgb to grayscale conversion code

2018-11-14 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88013

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target||arm
 Blocks||53947

--- Comment #2 from Richard Biener  ---
On x86_64 we manage to vectorize this with quite absymal code (for core-avx2)
with a vectorization factor of 32:

.L4:
vmovdqu (%rax), %ymm1
vmovdqu 64(%rax), %ymm4
addq$32, %rcx
addq$96, %rax
vmovdqu -64(%rax), %ymm5
vpshufb %ymm14, %ymm1, %ymm0
vpermq  $78, %ymm0, %ymm2
vpshufb %ymm13, %ymm1, %ymm0
vpshufb %ymm12, %ymm5, %ymm3
vpor%ymm2, %ymm0, %ymm0
vpshufb %ymm11, %ymm4, %ymm2
vpor%ymm3, %ymm0, %ymm0
vpermq  $78, %ymm2, %ymm3
vpshufb .LC5(%rip), %ymm4, %ymm2
vpshufb .LC4(%rip), %ymm0, %ymm0
vpor%ymm3, %ymm2, %ymm2
vpshufb .LC6(%rip), %ymm1, %ymm3
vpermq  $78, %ymm3, %ymm15
vpor%ymm2, %ymm0, %ymm0
vpshufb .LC7(%rip), %ymm1, %ymm3
vpshufb .LC8(%rip), %ymm5, %ymm2
vpor%ymm15, %ymm3, %ymm3
vpshufb .LC11(%rip), %ymm4, %ymm15
vpshufb .LC14(%rip), %ymm5, %ymm5
vpor%ymm2, %ymm3, %ymm3
vpshufb .LC9(%rip), %ymm4, %ymm2
vpermq  $78, %ymm2, %ymm2
vpshufb %ymm10, %ymm3, %ymm3
vpor%ymm2, %ymm15, %ymm2
vpor%ymm2, %ymm3, %ymm3
vpshufb .LC12(%rip), %ymm1, %ymm2
vpshufb .LC13(%rip), %ymm1, %ymm1
vpermq  $78, %ymm2, %ymm2
vpor%ymm2, %ymm1, %ymm2
vpshufb .LC15(%rip), %ymm4, %ymm1
vpshufb .LC16(%rip), %ymm4, %ymm4
vpermq  $78, %ymm1, %ymm1
vpor%ymm5, %ymm2, %ymm2
vpor%ymm1, %ymm4, %ymm4
vpshufb %ymm10, %ymm2, %ymm2
vpmovzxbw   %xmm0, %ymm1
vpor%ymm4, %ymm2, %ymm2
vpmovzxbw   %xmm3, %ymm4
vextracti128$0x1, %ymm0, %xmm0
vpmullw %ymm7, %ymm4, %ymm4
vpmullw %ymm8, %ymm1, %ymm1
vextracti128$0x1, %ymm3, %xmm3
vpmovzxbw   %xmm0, %ymm0
vpmovzxbw   %xmm3, %ymm3
vpmullw %ymm8, %ymm0, %ymm0
vpmullw %ymm7, %ymm3, %ymm3
vpaddw  %ymm4, %ymm1, %ymm1
vpmovzxbw   %xmm2, %ymm4
vextracti128$0x1, %ymm2, %xmm2
vpmovzxbw   %xmm2, %ymm2
vpmullw %ymm6, %ymm4, %ymm4
vpmullw %ymm6, %ymm2, %ymm2
vpaddw  %ymm3, %ymm0, %ymm0
vpaddw  %ymm4, %ymm1, %ymm1
vpaddw  %ymm2, %ymm0, %ymm0
vpsrlw  $8, %ymm1, %ymm1
vpsrlw  $8, %ymm0, %ymm0
vpand   %ymm1, %ymm9, %ymm1
vpand   %ymm0, %ymm9, %ymm0
vpackuswb   %ymm0, %ymm1, %ymm0
vpermq  $216, %ymm0, %ymm0
vmovdqu %ymm0, -32(%rcx)
cmpq%r8, %rcx
jne .L4


Maybe you can post what you think arm can do better here?


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug target/88013] can't vectorize rgb to grayscale conversion code

2018-11-13 Thread hoganmeier at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88013

--- Comment #1 from krux  ---
Something like -march=armv8-a -mfpu=neon-fp-armv8 does not work either.
https://godbolt.org/z/MpBQ0I