[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-11-06 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #13 from CVS Commits  ---
The master branch has been updated by Segher Boessenkool :

https://gcc.gnu.org/g:e5502ae72f784470019de5850017ad0c87ffacef

commit r11-4805-ge5502ae72f784470019de5850017ad0c87ffacef
Author: Segher Boessenkool 
Date:   Fri Nov 6 12:50:35 2020 +

rs6000: Fix TARGET_POWERPC64 vs. TARGET_64BIT confusion

I gave Ke Wen bad advice, luckily David corrected me: it is true that we
cannot use TARGET_POWERPC64 on many 32-bit OSes, since either the kernel
or userland does not save the top half of the 64-bit integer registers,
but we do not have to care about that in separate patterns or related
code.  The flag is automatically not enabled by default on targets that
do not handle this correctly.

This patch fixes it.

Segher

2020-11-06  Segher Boessenkool  

PR target/96933
* config/rs6000/rs6000.c (rs6000_expand_vector_init): Use
TARGET_POWERPC64 instead of TARGET_64BIT.

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-11-05 Thread linkw at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

Kewen Lin  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from Kewen Lin  ---
Should be done with latest trunk.

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-11-05 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #11 from CVS Commits  ---
The master branch has been updated by Kewen Lin :

https://gcc.gnu.org/g:025f434a87336e38bf5140fba2005081876aa911

commit r11-4731-g025f434a87336e38bf5140fba2005081876aa911
Author: Kewen Lin 
Date:   Thu Nov 5 00:04:10 2020 -0600

rs6000: Use direct move for char/short vector CTOR [PR96933]

This patch is to make vector CTOR with char/short leverage direct
move instructions when they are available.  With one constructed
test case, it can speed up 145% for char and 190% for short on P9.

Tested SPEC2017 x264_r at -Ofast on P9, it gets 1.61% speedup
(but based on unexpected SLP see PR96789).

Bootstrapped/regtested on powerpc64{,le}-linux-gnu P8 and
powerpc64le-linux-gnu P9.

gcc/ChangeLog:

PR target/96933
* config/rs6000/rs6000.c (rs6000_expand_vector_init): Use direct
move
instructions for vector construction with char/short types.
* config/rs6000/rs6000.md (p8_mtvsrwz_v16qisi2): New define_insn.
(p8_mtvsrd_v16qidi2): Likewise.

gcc/testsuite/ChangeLog:

PR target/96933
* gcc.target/powerpc/pr96933-1.c: New test.
* gcc.target/powerpc/pr96933-2.c: New test.
* gcc.target/powerpc/pr96933-3.c: New test.
* gcc.target/powerpc/pr96933-4.c: New test.
* gcc.target/powerpc/pr96933.h: New test.
* gcc.target/powerpc/pr96933-run.h: New test.

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-08 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #10 from Kewen Lin  ---
(In reply to Segher Boessenkool from comment #9)
> I'm not sure what you mean.
> 
> vmrglb merges the vectors
>   abcdefghijklmnop
> and
>   ABCDEFGHIJKLMNOP
> to
>   iIjJkKlLmMnNoOpP
> 
> ... ah, I see what you mean I guess.
> 
> So, use something else instead?  How about vpku*um?
> 
> First vpkudum, xforming
>   xxxAxxxB
> and
>   xxxCxxxD
> into
>   xxxAxxxBxxxCxxxD
> 
> and then vpkuwum:
>   xxxAxxxBxxxCxxxD
> and
>   xxxExxxFxxxGxxxH
> into
>   xAxBxCxDxExFxGxH
> 
> and finally vpkuhum:
>   xAxBxCxDxExFxGxH
> and
>   xIxJxKxLxMxNxOxP
> into
>   ABCDEFGHIJKLMNOP
> 
> ?

Great, it works! Thanks for the advice. By testing, for type char, it's on par
with the artificial control vector version, 7.30s vs. 7.28s, while for type
short, it's better, 28.66s vs. 31.52s. Will update the sent patch to V2.

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-08 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #9 from Segher Boessenkool  ---
I'm not sure what you mean.

vmrglb merges the vectors
  abcdefghijklmnop
and
  ABCDEFGHIJKLMNOP
to
  iIjJkKlLmMnNoOpP

... ah, I see what you mean I guess.

So, use something else instead?  How about vpku*um?

First vpkudum, xforming
  xxxAxxxB
and
  xxxCxxxD
into
  xxxAxxxBxxxCxxxD

and then vpkuwum:
  xxxAxxxBxxxCxxxD
and
  xxxExxxFxxxGxxxH
into
  xAxBxCxDxExFxGxH

and finally vpkuhum:
  xAxBxCxDxExFxGxH
and
  xIxJxKxLxMxNxOxP
into
  ABCDEFGHIJKLMNOP

?

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-07 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #8 from Kewen Lin  ---
(In reply to Segher Boessenkool from comment #7)
> There are vmrglb and vrghb etc.?

But these are only for low/high part separately, with mtvsrdd both low/high
parts (doubleword) have the values, we don't have Vector Merge Even/Odd for
char or short to merge them. Now I used one artificial control vector for the
merging, correct me if I miss something.

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-07 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #7 from Segher Boessenkool  ---
There are vmrglb and vrghb etc.?

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-07 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #6 from Kewen Lin  ---
(In reply to Kewen Lin from comment #5)
> (In reply to Segher Boessenkool from comment #4)
> > Yes, timing suggests there is some SHL/LHS flush.
> > 
> > On p9 and later we can use mtvsrdd instead of mtvsrd (moving two
> > bytes into place at one), which reduces the number of moves from
> > 16 to 8, and the number of merges from 15 to 7 (and reduces path
> > length by 1).  This sounds like a no-brainer win with that :-)
> 
> Good idea, it looks better on P9. One thing to double confirm, currently
> there are no instructions like vmrgob and vmrgoh, so of the mergings you
> mentioned from vector bytes to vector short and vector short to vector word
> needs artificial control vector?

Improve the patch to support mtvsrdd, the asm for char looks like:

 :
   0:   00 00 4c 3c addis   r2,r12,0
0: R_PPC64_REL16_HA .TOC.
   4:   00 00 42 38 addir2,r2,0
4: R_PPC64_REL16_LO .TOC.+0x4
   8:   e8 ff a1 fb std r29,-24(r1)
   c:   00 00 a2 3f addis   r29,r2,0
c: R_PPC64_TOC16_HA .rodata.cst16
  10:   f0 ff c1 fb std r30,-16(r1)
  14:   f8 ff e1 fb std r31,-8(r1)
  18:   67 1b 24 7c mtvsrdd vs33,r4,r3
  1c:   67 3b 28 7d mtvsrdd vs41,r8,r7
  20:   68 00 c1 8b lbz r30,104(r1)
  24:   78 00 e1 8b lbz r31,120(r1)
  28:   00 00 bd 3b addir29,r29,0
28: R_PPC64_TOC16_LO.rodata.cst16
  2c:   60 00 81 89 lbz r12,96(r1)
  30:   70 00 61 89 lbz r11,112(r1)
  34:   80 00 81 88 lbz r4,128(r1)
  38:   88 00 61 88 lbz r3,136(r1)
  3c:   90 00 01 89 lbz r8,144(r1)
  40:   98 00 e1 88 lbz r7,152(r1)
  44:   67 2b 46 7c mtvsrdd vs34,r6,r5
  48:   67 4b aa 7d mtvsrdd vs45,r10,r9
  4c:   09 00 9d f5 lxv vs44,0(r29)
  50:   67 63 5e 7d mtvsrdd vs42,r30,r12
  54:   67 5b 1f 7c mtvsrdd vs32,r31,r11
  58:   e8 ff a1 eb ld  r29,-24(r1)
  5c:   f0 ff c1 eb ld  r30,-16(r1)
  60:   67 23 63 7d mtvsrdd vs43,r3,r4
  64:   f8 ff e1 eb ld  r31,-8(r1)
  68:   3b 0b 42 10 vpermr  v2,v2,v1,v12
  6c:   67 43 27 7c mtvsrdd vs33,r7,r8
  70:   3b 4b ad 11 vpermr  v13,v13,v9,v12
  74:   3b 53 00 10 vpermr  v0,v0,v10,v12
  78:   3b 5b 21 10 vpermr  v1,v1,v11,v12
  7c:   97 11 4d f0 xxmrglw vs34,vs45,vs34
  80:   97 01 01 f0 xxmrglw vs32,vs33,vs32
  84:   57 13 40 f0 xxmrgld vs34,vs32,vs34
  88:   20 00 80 4e blr

For:
  1) mtvsrdd under TARGET_DIRECT_MOVE_128
  2) mtvsrd under  TARGET_DIRECT_MOVE
  3) original

The time evaluation on Power9 looks like
  1) 7.28s
  2) 7.41s
  3) 18.19s

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-06 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #5 from Kewen Lin  ---
(In reply to Segher Boessenkool from comment #4)
> Yes, timing suggests there is some SHL/LHS flush.
> 
> On p9 and later we can use mtvsrdd instead of mtvsrd (moving two
> bytes into place at one), which reduces the number of moves from
> 16 to 8, and the number of merges from 15 to 7 (and reduces path
> length by 1).  This sounds like a no-brainer win with that :-)

Good idea, it looks better on P9. One thing to double confirm, currently there
are no instructions like vmrgob and vmrgoh, so of the mergings you mentioned
from vector bytes to vector short and vector short to vector word needs
artificial control vector?

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-04 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #4 from Segher Boessenkool  ---
Yes, timing suggests there is some SHL/LHS flush.

On p9 and later we can use mtvsrdd instead of mtvsrd (moving two
bytes into place at one), which reduces the number of moves from
16 to 8, and the number of merges from 15 to 7 (and reduces path
length by 1).  This sounds like a no-brainer win with that :-)

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-04 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #3 from Richard Biener  ---
very likely the byte stores and then the following vector load will also
trigger
STLF issues.

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-04 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #2 from Kewen Lin  ---
(In reply to Segher Boessenkool from comment #1)
> Is that actually faster though?  The original has shorter dependency
> chains.  Or is this to avoid some LHS/SHL?

Yes, I tested it with one constructed case, the original version takes 18.20s
while the optimized version takes 8.40s. And yes, I guess it's due to LHS/SHL
similar to the vec_insert issue xionghu is working on.

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-04 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

--- Comment #1 from Segher Boessenkool  ---
Is that actually faster though?  The original has shorter dependency
chains.  Or is this to avoid some LHS/SHL?

[Bug target/96933] rs6000: inefficient code for char/short vec CTOR

2020-09-04 Thread linkw at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933

Kewen Lin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
 CC||bergner at gcc dot gnu.org,
   ||linkw at gcc dot gnu.org,
   ||segher at gcc dot gnu.org,
   ||wschmidt at gcc dot gnu.org
Summary|inefficient code for|rs6000: inefficient code
   |char/short vec CTOR |for char/short vec CTOR
   Last reconfirmed||2020-09-04
 Target||powerpc
 Ever confirmed|0   |1
   Assignee|unassigned at gcc dot gnu.org  |linkw at gcc dot gnu.org