[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #13 from CVS Commits --- The master branch has been updated by Segher Boessenkool : https://gcc.gnu.org/g:e5502ae72f784470019de5850017ad0c87ffacef commit r11-4805-ge5502ae72f784470019de5850017ad0c87ffacef Author: Segher Boessenkool Date: Fri Nov 6 12:50:35 2020 + rs6000: Fix TARGET_POWERPC64 vs. TARGET_64BIT confusion I gave Ke Wen bad advice, luckily David corrected me: it is true that we cannot use TARGET_POWERPC64 on many 32-bit OSes, since either the kernel or userland does not save the top half of the 64-bit integer registers, but we do not have to care about that in separate patterns or related code. The flag is automatically not enabled by default on targets that do not handle this correctly. This patch fixes it. Segher 2020-11-06 Segher Boessenkool PR target/96933 * config/rs6000/rs6000.c (rs6000_expand_vector_init): Use TARGET_POWERPC64 instead of TARGET_64BIT.
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 Kewen Lin changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #12 from Kewen Lin --- Should be done with latest trunk.
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #11 from CVS Commits --- The master branch has been updated by Kewen Lin : https://gcc.gnu.org/g:025f434a87336e38bf5140fba2005081876aa911 commit r11-4731-g025f434a87336e38bf5140fba2005081876aa911 Author: Kewen Lin Date: Thu Nov 5 00:04:10 2020 -0600 rs6000: Use direct move for char/short vector CTOR [PR96933] This patch is to make vector CTOR with char/short leverage direct move instructions when they are available. With one constructed test case, it can speed up 145% for char and 190% for short on P9. Tested SPEC2017 x264_r at -Ofast on P9, it gets 1.61% speedup (but based on unexpected SLP see PR96789). Bootstrapped/regtested on powerpc64{,le}-linux-gnu P8 and powerpc64le-linux-gnu P9. gcc/ChangeLog: PR target/96933 * config/rs6000/rs6000.c (rs6000_expand_vector_init): Use direct move instructions for vector construction with char/short types. * config/rs6000/rs6000.md (p8_mtvsrwz_v16qisi2): New define_insn. (p8_mtvsrd_v16qidi2): Likewise. gcc/testsuite/ChangeLog: PR target/96933 * gcc.target/powerpc/pr96933-1.c: New test. * gcc.target/powerpc/pr96933-2.c: New test. * gcc.target/powerpc/pr96933-3.c: New test. * gcc.target/powerpc/pr96933-4.c: New test. * gcc.target/powerpc/pr96933.h: New test. * gcc.target/powerpc/pr96933-run.h: New test.
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #10 from Kewen Lin --- (In reply to Segher Boessenkool from comment #9) > I'm not sure what you mean. > > vmrglb merges the vectors > abcdefghijklmnop > and > ABCDEFGHIJKLMNOP > to > iIjJkKlLmMnNoOpP > > ... ah, I see what you mean I guess. > > So, use something else instead? How about vpku*um? > > First vpkudum, xforming > xxxAxxxB > and > xxxCxxxD > into > xxxAxxxBxxxCxxxD > > and then vpkuwum: > xxxAxxxBxxxCxxxD > and > xxxExxxFxxxGxxxH > into > xAxBxCxDxExFxGxH > > and finally vpkuhum: > xAxBxCxDxExFxGxH > and > xIxJxKxLxMxNxOxP > into > ABCDEFGHIJKLMNOP > > ? Great, it works! Thanks for the advice. By testing, for type char, it's on par with the artificial control vector version, 7.30s vs. 7.28s, while for type short, it's better, 28.66s vs. 31.52s. Will update the sent patch to V2.
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #9 from Segher Boessenkool --- I'm not sure what you mean. vmrglb merges the vectors abcdefghijklmnop and ABCDEFGHIJKLMNOP to iIjJkKlLmMnNoOpP ... ah, I see what you mean I guess. So, use something else instead? How about vpku*um? First vpkudum, xforming xxxAxxxB and xxxCxxxD into xxxAxxxBxxxCxxxD and then vpkuwum: xxxAxxxBxxxCxxxD and xxxExxxFxxxGxxxH into xAxBxCxDxExFxGxH and finally vpkuhum: xAxBxCxDxExFxGxH and xIxJxKxLxMxNxOxP into ABCDEFGHIJKLMNOP ?
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #8 from Kewen Lin --- (In reply to Segher Boessenkool from comment #7) > There are vmrglb and vrghb etc.? But these are only for low/high part separately, with mtvsrdd both low/high parts (doubleword) have the values, we don't have Vector Merge Even/Odd for char or short to merge them. Now I used one artificial control vector for the merging, correct me if I miss something.
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #7 from Segher Boessenkool --- There are vmrglb and vrghb etc.?
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #6 from Kewen Lin --- (In reply to Kewen Lin from comment #5) > (In reply to Segher Boessenkool from comment #4) > > Yes, timing suggests there is some SHL/LHS flush. > > > > On p9 and later we can use mtvsrdd instead of mtvsrd (moving two > > bytes into place at one), which reduces the number of moves from > > 16 to 8, and the number of merges from 15 to 7 (and reduces path > > length by 1). This sounds like a no-brainer win with that :-) > > Good idea, it looks better on P9. One thing to double confirm, currently > there are no instructions like vmrgob and vmrgoh, so of the mergings you > mentioned from vector bytes to vector short and vector short to vector word > needs artificial control vector? Improve the patch to support mtvsrdd, the asm for char looks like: : 0: 00 00 4c 3c addis r2,r12,0 0: R_PPC64_REL16_HA .TOC. 4: 00 00 42 38 addir2,r2,0 4: R_PPC64_REL16_LO .TOC.+0x4 8: e8 ff a1 fb std r29,-24(r1) c: 00 00 a2 3f addis r29,r2,0 c: R_PPC64_TOC16_HA .rodata.cst16 10: f0 ff c1 fb std r30,-16(r1) 14: f8 ff e1 fb std r31,-8(r1) 18: 67 1b 24 7c mtvsrdd vs33,r4,r3 1c: 67 3b 28 7d mtvsrdd vs41,r8,r7 20: 68 00 c1 8b lbz r30,104(r1) 24: 78 00 e1 8b lbz r31,120(r1) 28: 00 00 bd 3b addir29,r29,0 28: R_PPC64_TOC16_LO.rodata.cst16 2c: 60 00 81 89 lbz r12,96(r1) 30: 70 00 61 89 lbz r11,112(r1) 34: 80 00 81 88 lbz r4,128(r1) 38: 88 00 61 88 lbz r3,136(r1) 3c: 90 00 01 89 lbz r8,144(r1) 40: 98 00 e1 88 lbz r7,152(r1) 44: 67 2b 46 7c mtvsrdd vs34,r6,r5 48: 67 4b aa 7d mtvsrdd vs45,r10,r9 4c: 09 00 9d f5 lxv vs44,0(r29) 50: 67 63 5e 7d mtvsrdd vs42,r30,r12 54: 67 5b 1f 7c mtvsrdd vs32,r31,r11 58: e8 ff a1 eb ld r29,-24(r1) 5c: f0 ff c1 eb ld r30,-16(r1) 60: 67 23 63 7d mtvsrdd vs43,r3,r4 64: f8 ff e1 eb ld r31,-8(r1) 68: 3b 0b 42 10 vpermr v2,v2,v1,v12 6c: 67 43 27 7c mtvsrdd vs33,r7,r8 70: 3b 4b ad 11 vpermr v13,v13,v9,v12 74: 3b 53 00 10 vpermr v0,v0,v10,v12 78: 3b 5b 21 10 vpermr v1,v1,v11,v12 7c: 97 11 4d f0 xxmrglw vs34,vs45,vs34 80: 97 01 01 f0 xxmrglw vs32,vs33,vs32 84: 57 13 40 f0 xxmrgld vs34,vs32,vs34 88: 20 00 80 4e blr For: 1) mtvsrdd under TARGET_DIRECT_MOVE_128 2) mtvsrd under TARGET_DIRECT_MOVE 3) original The time evaluation on Power9 looks like 1) 7.28s 2) 7.41s 3) 18.19s
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #5 from Kewen Lin --- (In reply to Segher Boessenkool from comment #4) > Yes, timing suggests there is some SHL/LHS flush. > > On p9 and later we can use mtvsrdd instead of mtvsrd (moving two > bytes into place at one), which reduces the number of moves from > 16 to 8, and the number of merges from 15 to 7 (and reduces path > length by 1). This sounds like a no-brainer win with that :-) Good idea, it looks better on P9. One thing to double confirm, currently there are no instructions like vmrgob and vmrgoh, so of the mergings you mentioned from vector bytes to vector short and vector short to vector word needs artificial control vector?
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #4 from Segher Boessenkool --- Yes, timing suggests there is some SHL/LHS flush. On p9 and later we can use mtvsrdd instead of mtvsrd (moving two bytes into place at one), which reduces the number of moves from 16 to 8, and the number of merges from 15 to 7 (and reduces path length by 1). This sounds like a no-brainer win with that :-)
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #3 from Richard Biener --- very likely the byte stores and then the following vector load will also trigger STLF issues.
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #2 from Kewen Lin --- (In reply to Segher Boessenkool from comment #1) > Is that actually faster though? The original has shorter dependency > chains. Or is this to avoid some LHS/SHL? Yes, I tested it with one constructed case, the original version takes 18.20s while the optimized version takes 8.40s. And yes, I guess it's due to LHS/SHL similar to the vec_insert issue xionghu is working on.
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 --- Comment #1 from Segher Boessenkool --- Is that actually faster though? The original has shorter dependency chains. Or is this to avoid some LHS/SHL?
[Bug target/96933] rs6000: inefficient code for char/short vec CTOR
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933 Kewen Lin changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED CC||bergner at gcc dot gnu.org, ||linkw at gcc dot gnu.org, ||segher at gcc dot gnu.org, ||wschmidt at gcc dot gnu.org Summary|inefficient code for|rs6000: inefficient code |char/short vec CTOR |for char/short vec CTOR Last reconfirmed||2020-09-04 Target||powerpc Ever confirmed|0 |1 Assignee|unassigned at gcc dot gnu.org |linkw at gcc dot gnu.org