Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
Committed. Regards, Dave On 2020-08-22 7:24 p.m., Roger Sayle wrote: > Hi Dave, > I actually think using plus_xor_ior operator is useful. It means that if > combine, > inlining or some other RTL simplification generates these variants, these > forms > will still be recognized by the backend. It's more typing, but the compiler > produces > better code. > > Here's what I have so far, but please feel free to modify anything. I'll > leave the > rest to you. > > With this patch: > > unsigned long long rotl4(unsigned long long x) > { > return (x<<4) | (x>>60); > } > > unsigned long long rotr4(unsigned long long x) > { > return (x<<60) | (x>>4); > } > > which previously generated: > > rotl4:depd,z %r26,59,60,%r28 > extrd,u %r26,3,4,%r26 > bve (%r2) > or %r26,%r28,%r28 > > rotr4:extrd,u %r26,59,60,%r28 > depd,z %r26,3,4,%r26 > bve (%r2) > or %r26,%r28,%r28 > > now produces: > > rotl4:bve (%r2) > shrpd %r26,%r26,60,%r28 > > rotr4:bve (%r2) > shrpd %r26,%r26,4,%r28 > > > I'm guessing this is very similar to what you were thinking (or what I > described previously). > > Many thanks again for trying out these patches/suggestions for me. > > Best regards, > Roger > -- > > -Original Message- > From: John David Anglin > Sent: 22 August 2020 23:09 > To: Roger Sayle ; 'GCC Patches' > > Cc: 'Jeff Law' > Subject: Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT > > On 2020-08-22 12:01 p.m., Roger Sayle wrote: >> I suspect that the issue with the 64-bit patterns is that the second >> variant of pa.md's define_insn "shrpdi4" is unlikely ever to match as >> (minus:DI (const_int 64) x) is never "canonical" when x is itself a >> CONST_INT. Splitting this define_insn into two (or three see below) >> separate forms; the first as it currently is and the second (as you >> suggest) with >> "TARGET_64BIT >>&& INTVAL (operands[3]) + INTVAL (operands[4]) == 64" >> should do the trick. > I will go ahead and add the basic patterns. It seems it would be best if I > avoid using the "plus_xor_ior_operator". It also seems the 32-bit patterns > should avoid it. >> My first impression was that the DImode shrpd instructions would be >> most useful for implementing TI mode shifts, but that TI mode isn't >> supported by hppa64. But then I noticed that the more immediate >> benefit would be in supporting rotrdi3 and rotldi3 on TARGET_64BIT >> that currently don't have expanders nor insns defined. Here GCC >> currently generates three instructions where a single shrpd would be >> optimal. > It turns out we now need to support TI mode and __int128 for libgomp. The > hppa64-hpux target won't boot without it. I had just added a change to > support TI mode but it's untested. > > Regards, > Dave > > -- > John David Anglin dave.ang...@bell.net > -- John David Anglin dave.ang...@bell.net
Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
On Thu, Aug 27, 2020 at 9:17 AM Roger Sayle wrote: > > > >On 2020-08-26 5:23 p.m., Roger Sayle wrote: > >> These more accurate target rtx_costs are used by the > >> gimple-ssa-strength-reduction.c (via a call to mult_by_coeff_cost) to > >> decide whether applying strength reduction would be profitable. This test > >> case, slsr-13.c, assumes that two multiplications by four are > >> cheaper than two multiplications by five. (I believe) This is not the > >> case on hppa which > >> has a sh2add instruction, that performs a multiplication by five in > >> one cycle, or exactly the same cost as performing a left shift by two > >> (i.e. a multiplication by four). Oddly, I also believe this isn't the > >> case on x86_64, where the similar lea instruction is (sometimes) as > >> efficient as left shift by two bits. > >This looks like a regression. > > > >gcc-10 (prepatch): > > > >addl %r25,%r26,%r28 > >sh2addl %r25,%r28,%r25 > >sh2addl %r26,%r28,%r26 > >addl %r26,%r28,%r28 > >bv %r0(%r2) > >addl %r28,%r25,%r28 > > > > [local count: 1073741824]: > > x1_4 = c_2(D) + s_3(D); > > slsr_11 = s_3(D) * 4; > > x2_6 = x1_4 + slsr_11; > > slsr_12 = c_2(D) * 4; > > x3_8 = x1_4 + slsr_12; > > _1 = x1_4 + x2_6; > > x_9 = _1 + x3_8; > > return x_9; > > > >gcc-11 (with patch): > > > >addl %r25,%r26,%r19 > >sh2addl %r26,%r26,%r28 > >addl %r28,%r25,%r28 > >sh2addl %r25,%r25,%r25 > >addl %r28,%r19,%r28 > >addl %r25,%r26,%r26 > >bv %r0(%r2) > >addl %r28,%r26,%r28 > > > > [local count: 1073741824]: > > x1_4 = c_2(D) + s_3(D); > > a2_5 = s_3(D) * 5; > > x2_6 = c_2(D) + a2_5; > > a3_7 = c_2(D) * 5; > > x3_8 = s_3(D) + a3_7; > > _1 = x1_4 + x2_6; > > x_9 = _1 + x3_8; > > return x_9; > > > > Regards, > > Dave > > There are two interesting (tree-optimization) observations here. The first > is that at the tree-ssa > level both of these gimple sequences look to have exactly the same cost, > seven assignments on > a target where *4 is the same cost as *5. The gimple doesn't attempt to > model the sh?add/lea > instructions that combine may find, so at RTL expansion both sequences look > equivalent. One > fix may be to have gimple-ssa-strength-reduction.c just prefer > multiplications by 2, 4 and 8, > even on targets that have a single cycle "mul" instruction. > > The second observation is why isn't tree-ssa-reassoc.c doing something here. > The test case > is evaluating (s+c)+(s+5*c)+(5*s+c), and this strength reduction test is > expecting this to turn > into "tmp=s+c; return tmp+(tmp+4*c)+(4*s+tmp" which is clever and an > improvement, but > overlooks the obvious reassociation 7*(s+c). Indeed LLVM does this in three > instructions: reassoc doesn't work on signed types > > tmp1 = s+c; > tmp2 = tmp1<<3; > return tmp2-tmp1; > > Although the PA backend is (mostly) innocent in this, the lowest impact > fix/work around is > to have multiplications by 2, 4 and 8 return COSTS_N_INSNS(1)-1, to indicate > a preference > when splitting ties. I'll prepare a patch. > > Roger > -- > >
RE: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
>On 2020-08-26 5:23 p.m., Roger Sayle wrote: >> These more accurate target rtx_costs are used by the >> gimple-ssa-strength-reduction.c (via a call to mult_by_coeff_cost) to >> decide whether applying strength reduction would be profitable. This test >> case, slsr-13.c, assumes that two multiplications by four are >> cheaper than two multiplications by five. (I believe) This is not the case >> on hppa which >> has a sh2add instruction, that performs a multiplication by five in >> one cycle, or exactly the same cost as performing a left shift by two >> (i.e. a multiplication by four). Oddly, I also believe this isn't the >> case on x86_64, where the similar lea instruction is (sometimes) as >> efficient as left shift by two bits. >This looks like a regression. > >gcc-10 (prepatch): > >addl %r25,%r26,%r28 >sh2addl %r25,%r28,%r25 >sh2addl %r26,%r28,%r26 >addl %r26,%r28,%r28 >bv %r0(%r2) >addl %r28,%r25,%r28 > > [local count: 1073741824]: > x1_4 = c_2(D) + s_3(D); > slsr_11 = s_3(D) * 4; > x2_6 = x1_4 + slsr_11; > slsr_12 = c_2(D) * 4; > x3_8 = x1_4 + slsr_12; > _1 = x1_4 + x2_6; > x_9 = _1 + x3_8; > return x_9; > >gcc-11 (with patch): > >addl %r25,%r26,%r19 >sh2addl %r26,%r26,%r28 >addl %r28,%r25,%r28 >sh2addl %r25,%r25,%r25 >addl %r28,%r19,%r28 >addl %r25,%r26,%r26 >bv %r0(%r2) >addl %r28,%r26,%r28 > > [local count: 1073741824]: > x1_4 = c_2(D) + s_3(D); > a2_5 = s_3(D) * 5; > x2_6 = c_2(D) + a2_5; > a3_7 = c_2(D) * 5; > x3_8 = s_3(D) + a3_7; > _1 = x1_4 + x2_6; > x_9 = _1 + x3_8; > return x_9; > > Regards, > Dave There are two interesting (tree-optimization) observations here. The first is that at the tree-ssa level both of these gimple sequences look to have exactly the same cost, seven assignments on a target where *4 is the same cost as *5. The gimple doesn't attempt to model the sh?add/lea instructions that combine may find, so at RTL expansion both sequences look equivalent. One fix may be to have gimple-ssa-strength-reduction.c just prefer multiplications by 2, 4 and 8, even on targets that have a single cycle "mul" instruction. The second observation is why isn't tree-ssa-reassoc.c doing something here. The test case is evaluating (s+c)+(s+5*c)+(5*s+c), and this strength reduction test is expecting this to turn into "tmp=s+c; return tmp+(tmp+4*c)+(4*s+tmp" which is clever and an improvement, but overlooks the obvious reassociation 7*(s+c). Indeed LLVM does this in three instructions: tmp1 = s+c; tmp2 = tmp1<<3; return tmp2-tmp1; Although the PA backend is (mostly) innocent in this, the lowest impact fix/work around is to have multiplications by 2, 4 and 8 return COSTS_N_INSNS(1)-1, to indicate a preference when splitting ties. I'll prepare a patch. Roger --
Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
On 2020-08-26 5:23 p.m., Roger Sayle wrote: > These more accurate target rtx_costs are used by the > gimple-ssa-strength-reduction.c > (via a call to mult_by_coeff_cost) to decide whether applying strength > reduction would > be profitable. This test case, slsr-13.c, assumes that two multiplications > by four are > cheaper than two multiplications by five. (I believe) This is not the case > on hppa which > has a sh2add instruction, that performs a multiplication by five in one > cycle, or exactly > the same cost as performing a left shift by two (i.e. a multiplication by > four). Oddly, I > also believe this isn't the case on x86_64, where the similar lea instruction > is (sometimes) > as efficient as left shift by two bits. This looks like a regression. gcc-10 (prepatch): addl %r25,%r26,%r28 sh2addl %r25,%r28,%r25 sh2addl %r26,%r28,%r26 addl %r26,%r28,%r28 bv %r0(%r2) addl %r28,%r25,%r28 [local count: 1073741824]: x1_4 = c_2(D) + s_3(D); slsr_11 = s_3(D) * 4; x2_6 = x1_4 + slsr_11; slsr_12 = c_2(D) * 4; x3_8 = x1_4 + slsr_12; _1 = x1_4 + x2_6; x_9 = _1 + x3_8; return x_9; gcc-11 (with patch): addl %r25,%r26,%r19 sh2addl %r26,%r26,%r28 addl %r28,%r25,%r28 sh2addl %r25,%r25,%r25 addl %r28,%r19,%r28 addl %r25,%r26,%r26 bv %r0(%r2) addl %r28,%r26,%r28 [local count: 1073741824]: x1_4 = c_2(D) + s_3(D); a2_5 = s_3(D) * 5; x2_6 = c_2(D) + a2_5; a3_7 = c_2(D) * 5; x3_8 = s_3(D) + a3_7; _1 = x1_4 + x2_6; x_9 = _1 + x3_8; return x_9; Regards, Dave -- John David Anglin dave.ang...@bell.net
Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
On Wed, 2020-08-26 at 22:23 +0100, Roger Sayle wrote: > The failure of slsr-13.c is not caused by the patchh3.txt, but the previous > patchh2.txt > that's now on mainline and the gcc-10 branch. That change provided more > accurate > rtx_costs for hppa, and solved the performance problems with synth_mult. > > These more accurate target rtx_costs are used by the > gimple-ssa-strength-reduction.c > (via a call to mult_by_coeff_cost) to decide whether applying strength > reduction would > be profitable. This test case, slsr-13.c, assumes that two multiplications > by four are > cheaper than two multiplications by five. (I believe) This is not the case > on hppa which > has a sh2add instruction, that performs a multiplication by five in one > cycle, or exactly > the same cost as performing a left shift by two (i.e. a multiplication by > four). Oddly, I > also believe this isn't the case on x86_64, where the similar lea instruction > is (sometimes) > as efficient as left shift by two bits. Yea, you can do a multiplication by 5 cheap on the PA. While the x86 can too, I don't think it's as clear cut a win as the PA, so they may not cost the same as a multiply by 4 or left shift by 2. > > I suspect that slsr-13.c should be expected to fail on some platforms > depending upon > a targets instruction set/timings. Sounds like you're right since it depends on mult_by_coeff_cost under the hood :( I presume you or John will xfail it for the PA. > > Unfortunately, to complicate things in our case, it appears that after RTL > optimizations, > performing this strength reduction actually does results in fewer > instructions on the PA, > so it's the right thing to do. I'll need to study the logic in > gimple-ssa-strength to see > how mult_by_coeff cost is being used; cost(x*4) == cost(x*5), but cost(x*4+y) > < cost(x*5+y). Yea, x*4+y is cheaper than x*5+y on the PA. The first is a single sh2add, the second requires an additional add instruction. Jeff
RE: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
The failure of slsr-13.c is not caused by the patchh3.txt, but the previous patchh2.txt that's now on mainline and the gcc-10 branch. That change provided more accurate rtx_costs for hppa, and solved the performance problems with synth_mult. These more accurate target rtx_costs are used by the gimple-ssa-strength-reduction.c (via a call to mult_by_coeff_cost) to decide whether applying strength reduction would be profitable. This test case, slsr-13.c, assumes that two multiplications by four are cheaper than two multiplications by five. (I believe) This is not the case on hppa which has a sh2add instruction, that performs a multiplication by five in one cycle, or exactly the same cost as performing a left shift by two (i.e. a multiplication by four). Oddly, I also believe this isn't the case on x86_64, where the similar lea instruction is (sometimes) as efficient as left shift by two bits. I suspect that slsr-13.c should be expected to fail on some platforms depending upon a targets instruction set/timings. Unfortunately, to complicate things in our case, it appears that after RTL optimizations, performing this strength reduction actually does results in fewer instructions on the PA, so it's the right thing to do. I'll need to study the logic in gimple-ssa-strength to see how mult_by_coeff cost is being used; cost(x*4) == cost(x*5), but cost(x*4+y) < cost(x*5+y). My apologies for the inconvenience. Roger -- -Original Message- From: John David Anglin Sent: 26 August 2020 21:34 To: l...@redhat.com; Roger Sayle ; 'GCC Patches' Subject: Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT On 2020-08-26 4:08 p.m., Jeff Law wrote: > It 3-stages, but trips: > Tests that now fail, but worked before (5 tests): > > gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 4" 2 > gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 4" 2 > gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 5" 0 > gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 5" 0 In my last Linux build, these failed: FAIL: gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " * 4" 2 FAIL: gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " * 5" 0 The above didn't include patchh3.txt. Seems to be a difference in quoting. I'll check. Linux build is close to running this test. These tests didn't fail on 32-bit hpux without patchh2.txt or in previous Linux build. Dave -- John David Anglin dave.ang...@bell.net
Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
On 2020-08-26 4:08 p.m., Jeff Law wrote: > It 3-stages, but trips: > Tests that now fail, but worked before (5 tests): > > gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 4" 2 > gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 4" 2 > gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 5" 0 > gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 5" 0 In my last Linux build, these failed: FAIL: gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " * 4" 2 FAIL: gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " * 5" 0 The above didn't include patchh3.txt. Seems to be a difference in quoting. I'll check. Linux build is close to running this test. These tests didn't fail on 32-bit hpux without patchh2.txt or in previous Linux build. Dave -- John David Anglin dave.ang...@bell.net
Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
On Sun, 2020-08-23 at 00:24 +0100, Roger Sayle wrote: > Hi Dave, > I actually think using plus_xor_ior operator is useful. It means that if > combine, > inlining or some other RTL simplification generates these variants, these > forms > will still be recognized by the backend. It's more typing, but the compiler > produces > better code. > > Here's what I have so far, but please feel free to modify anything. I'll > leave the > rest to you. > > With this patch: > > unsigned long long rotl4(unsigned long long x) > { > return (x<<4) | (x>>60); > } > > unsigned long long rotr4(unsigned long long x) > { > return (x<<60) | (x>>4); > } > > which previously generated: > > rotl4:depd,z %r26,59,60,%r28 > extrd,u %r26,3,4,%r26 > bve (%r2) > or %r26,%r28,%r28 > > rotr4:extrd,u %r26,59,60,%r28 > depd,z %r26,3,4,%r26 > bve (%r2) > or %r26,%r28,%r28 > > now produces: > > rotl4:bve (%r2) > shrpd %r26,%r26,60,%r28 > > rotr4:bve (%r2) > shrpd %r26,%r26,4,%r28 > > > I'm guessing this is very similar to what you were thinking (or what I > described previously). > > Many thanks again for trying out these patches/suggestions for me. So I put this one into the tester overnight. It 3-stages, but trips: Tests that now fail, but worked before (5 tests): gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 4" 2 gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 4" 2 gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 5" 0 gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 5" 0 gcc.target/hppa/shadd-2.c scan-assembler-times sh.add 2 I think we've already discussed shadd-2. It's not immediately clear if the slsr failure is due to this change or something different -- the latter seems like a definite possibility as we're changing things as we leave gimple, but the test is checking the result of the gimple optimizers. Your call on how to proceed. I can put additional patches into the tester easily, so if you've got something you want investigated, don't hesitate to reach out. jeff
Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
On 2020-08-25 10:05 a.m., Jeff Law wrote: > On Sat, 2020-08-22 at 09:52 +0100, Roger Sayle wrote: >> Hi Dave, >> >> It's great to hear from you. It's been a long while. >> >> Sorry, doh! yes, there's a mistake in my patch (that I introduced when I >> renumbered >> the operands in the shd's define_expand to be the more logical 0, 1, 2, 3, >> then 4). >> Sorry for the inconvenience [due to my lack of familiarity with PA-RISC >> assembly]. >> Hopefully you should get much better mileage out of the attached revision. > FWIW this version 3-staged in my tester and is currently running the > testsuite. > It should finish testing later today. > > If David doesn't have objections, then I suggest going forward with the patch > independent of the others you've got queued up. The patch is fine. Testing completed this morning on hpux and there are no regressions. Please install on trunk and gcc-10 branch. Thanks very much, Dave -- John David Anglin dave.ang...@bell.net
Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
On Sat, 2020-08-22 at 09:52 +0100, Roger Sayle wrote: > Hi Dave, > > It's great to hear from you. It's been a long while. > > Sorry, doh! yes, there's a mistake in my patch (that I introduced when I > renumbered > the operands in the shd's define_expand to be the more logical 0, 1, 2, 3, > then 4). > Sorry for the inconvenience [due to my lack of familiarity with PA-RISC > assembly]. > Hopefully you should get much better mileage out of the attached revision. FWIW this version 3-staged in my tester and is currently running the testsuite. It should finish testing later today. If David doesn't have objections, then I suggest going forward with the patch independent of the others you've got queued up. jeff >
Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
On Fri, 2020-08-21 at 13:53 +0100, Roger Sayle wrote: > This patch improves the code generated on PA-RISC for DImode > (double word) left shifts by small constants (1-31). This target > has a very cool shd instruction that can be recognized by combine > for simple shifts, but relying on combine is fragile for more > complicated functions. This patch tweaks pa.md's ashldi3 expander, > to form the optimal two instruction shd/zdep sequence at RTL > expansion time. > > As an example of the benefits of this approach, the simple function > > unsigned long long u9(unsigned long long x) { return x*9; } > > currently generates 9 instructions > > u9: copy %r25,%r28 > copy %r26,%r29 > extru %r26,2,3,%r21 > zdep %r25,28,29,%r19 > zdep %r26,28,29,%r20 > or %r21,%r19,%r19 > add %r29,%r20,%r29 > addc %r28,%r19,%r28 > bv,n %r0(%r2) > > and with this patch now requires only 7: > > u9: copy %r25,%r28 > copy %r26,%r29 > shd %r26,%r25,29,%r19 > zdep %r26,28,29,%r20 > add %r29,%r20,%r29 > addc %r28,%r19,%r28 > bv,n %r0(%r2) > > > This improvement is a first step towards getting synth_mult to > behave sanely on hppa (PR middle-end/87256). > > Unfortunately, it's been a long while since I've had access to a > hppa system, so apart from building a cross-compiler and looking at > the assembler it generates, this patch is completely untested. > I was wondering whether Dave or Jeff (or someone else with access > to real hardware) might "spin" this patch for me? So while I have an R390 down in the basement, it overheats and I don't think I've turned it on in a few years. What I do for testing is exploit QEMU user mode emulation. I have a little root filesystem with HPPA binaries -- there's enough bits in that filesystem to build binutils, gcc, glibc & the linux kernel as well as run the testsuite. The root filesystems are here: https://github.com/JeffreyALaw/chroots The Jenkins tester spins the PA once week. It failed its last run due to regressions in the analyzer. I didn't realize it was running the static analyzer for the qemu emulated targets, which I turned off yesterday and re-started the PA build. http://3.14.90.209:8080/job/hppa-linux-gnu/ Jeff >
RE: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
Hi Dave, I actually think using plus_xor_ior operator is useful. It means that if combine, inlining or some other RTL simplification generates these variants, these forms will still be recognized by the backend. It's more typing, but the compiler produces better code. Here's what I have so far, but please feel free to modify anything. I'll leave the rest to you. With this patch: unsigned long long rotl4(unsigned long long x) { return (x<<4) | (x>>60); } unsigned long long rotr4(unsigned long long x) { return (x<<60) | (x>>4); } which previously generated: rotl4: depd,z %r26,59,60,%r28 extrd,u %r26,3,4,%r26 bve (%r2) or %r26,%r28,%r28 rotr4: extrd,u %r26,59,60,%r28 depd,z %r26,3,4,%r26 bve (%r2) or %r26,%r28,%r28 now produces: rotl4: bve (%r2) shrpd %r26,%r26,60,%r28 rotr4: bve (%r2) shrpd %r26,%r26,4,%r28 I'm guessing this is very similar to what you were thinking (or what I described previously). Many thanks again for trying out these patches/suggestions for me. Best regards, Roger -- -Original Message- From: John David Anglin Sent: 22 August 2020 23:09 To: Roger Sayle ; 'GCC Patches' Cc: 'Jeff Law' Subject: Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT On 2020-08-22 12:01 p.m., Roger Sayle wrote: > I suspect that the issue with the 64-bit patterns is that the second > variant of pa.md's define_insn "shrpdi4" is unlikely ever to match as > (minus:DI (const_int 64) x) is never "canonical" when x is itself a > CONST_INT. Splitting this define_insn into two (or three see below) > separate forms; the first as it currently is and the second (as you > suggest) with > "TARGET_64BIT > && INTVAL (operands[3]) + INTVAL (operands[4]) == 64" > should do the trick. I will go ahead and add the basic patterns. It seems it would be best if I avoid using the "plus_xor_ior_operator". It also seems the 32-bit patterns should avoid it. > > My first impression was that the DImode shrpd instructions would be > most useful for implementing TI mode shifts, but that TI mode isn't > supported by hppa64. But then I noticed that the more immediate > benefit would be in supporting rotrdi3 and rotldi3 on TARGET_64BIT > that currently don't have expanders nor insns defined. Here GCC > currently generates three instructions where a single shrpd would be > optimal. It turns out we now need to support TI mode and __int128 for libgomp. The hppa64-hpux target won't boot without it. I had just added a change to support TI mode but it's untested. Regards, Dave -- John David Anglin dave.ang...@bell.net patchh3.log Description: Binary data diff --git a/gcc/config/pa/pa.md b/gcc/config/pa/pa.md index 6350c68..5f04c02 100644 --- a/gcc/config/pa/pa.md +++ b/gcc/config/pa/pa.md @@ -6604,32 +6604,82 @@ (set_attr "length" "4")]) ; Shift right pair word 0 to 31 bits. -(define_insn "shrpsi4" - [(set (match_operand:SI 0 "register_operand" "=r,r") - (ior:SI (ashift:SI (match_operand:SI 1 "register_operand" "r,r") - (minus:SI (const_int 32) -(match_operand:SI 3 "shift5_operand" "q,n"))) - (lshiftrt:SI (match_operand:SI 2 "register_operand" "r,r") -(match_dup 3] +(define_insn "*shrpsi4_1" + [(set (match_operand:SI 0 "register_operand" "=r") + (match_operator:SI 4 "plus_xor_ior_operator" + [(ashift:SI (match_operand:SI 1 "register_operand" "r") + (minus:SI (const_int 32) + (match_operand:SI 3 "register_operand" "q"))) + (lshiftrt:SI (match_operand:SI 2 "register_operand" "r") + (match_dup 3))]))] "" - "@ - {vshd %1,%2,%0|shrpw %1,%2,%%sar,%0} - {shd|shrpw} %1,%2,%3,%0" + "{vshd %1,%2,%0|shrpw %1,%2,%%sar,%0}" + [(set_attr "type" "shift") + (set_attr "length" "4")]) + +(define_insn "*shrpsi4_2" + [(set (match_operand:SI 0 "register_operand" "=r") + (match_operator:SI 4 "plus_xor_ior_operator" + [(lshiftrt:SI (match_operand:SI 2 "register_operand" "r") + (match_operand:SI 3 "register_operand" "q")) + (ashift:SI (match_operand:SI 1 "register_operand" "r") + (minus:SI (const_int 32) + (match_dup 3)))]))] + ""
Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
On 2020-08-22 12:01 p.m., Roger Sayle wrote: > I suspect that the issue with the 64-bit patterns is that the second variant > of > pa.md's define_insn "shrpdi4" is unlikely ever to match as (minus:DI > (const_int 64) x) > is never "canonical" when x is itself a CONST_INT. Splitting this > define_insn > into two (or three see below) separate forms; the first as it currently is > and the > second (as you suggest) with > "TARGET_64BIT > && INTVAL (operands[3]) + INTVAL (operands[4]) == 64" > should do the trick. I will go ahead and add the basic patterns. It seems it would be best if I avoid using the "plus_xor_ior_operator". It also seems the 32-bit patterns should avoid it. > > My first impression was that the DImode shrpd instructions would be most > useful for implementing TI mode shifts, but that TI mode isn't supported by > hppa64. But then I noticed that the more immediate benefit would be in > supporting rotrdi3 and rotldi3 on TARGET_64BIT that currently don't have > expanders nor insns defined. Here GCC currently generates three > instructions > where a single shrpd would be optimal. It turns out we now need to support TI mode and __int128 for libgomp. The hppa64-hpux target won't boot without it. I had just added a change to support TI mode but it's untested. Regards, Dave -- John David Anglin dave.ang...@bell.net
RE: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
Hi Dave, Many thanks for your help. I suspect that the issue with the 64-bit patterns is that the second variant of pa.md's define_insn "shrpdi4" is unlikely ever to match as (minus:DI (const_int 64) x) is never "canonical" when x is itself a CONST_INT. Splitting this define_insn into two (or three see below) separate forms; the first as it currently is and the second (as you suggest) with "TARGET_64BIT && INTVAL (operands[3]) + INTVAL (operands[4]) == 64" should do the trick. My first impression was that the DImode shrpd instructions would be most useful for implementing TI mode shifts, but that TI mode isn't supported by hppa64. But then I noticed that the more immediate benefit would be in supporting rotrdi3 and rotldi3 on TARGET_64BIT that currently don't have expanders nor insns defined. Here GCC currently generates three instructions where a single shrpd would be optimal. I'm sure that folks are aware of this but something I learned/found strange was the implications of using (match_operator:SI 5 "plus_xor_ior_operator" x y). Clearly it's nice to have these patterns match IOR, PLUS and XOR (as recently pointed out by Jakub in a recent review), but using match_operator has two side-effects. The first is that recog's machinery doesn't know these operators are commutative, so two variants of a define_insn using match_operator need to be specified, i.e. (... y x). The second side-effect is that for generation/expansion it's impossible (inefficient?) to use the corresponding gen_foo functions. It would be nice if match_operator:5 resulted in the gen_function taking an enum rtx_code as the corresponding argument, to specify IOR or PLUS or XOR. Alas no. So instead I added the extremely light weight define_expand for SImode shd_internal to select IOR arbitrarily as the preferred RTL. You're right, that a similar strategy using a DImode shrpd_internal would be required for the 64-bit form; but the two are different; it's not that the current name would be affected (or should be changed). Hence the define_expand for rotrdi3 and rotldi3 would/should (for constant shifts) call gen_shrpd_internal to create an insn that matches one of the define_insns. I'd be happy to write that patch, but I personally have no way of testing it. It's a shame there isn't a hppa64 machine on the GCC compile farm. [p.s. I've ordered Gerry Kane's "PA-RISC 2.0 architecture" book from amazon, so I'll hopefully understand more of what I'm talking about when it arrives]. Thanks again. Roger -- -Original Message- From: John David Anglin Sent: 22 August 2020 13:58 To: Roger Sayle ; 'GCC Patches' Cc: 'Jeff Law' Subject: Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT Hi Roger, Started a test of your latest version. It appears we miss 64-bit patterns similar to these: (define_insn "" [(set (match_operand:SI 0 "register_operand" "=r") (match_operator:SI 5 "plus_xor_ior_operator" [(ashift:SI (match_operand:SI 1 "register_operand" "r") (match_operand:SI 3 "const_int_operand" "n")) (lshiftrt:SI (match_operand:SI 2 "register_operand" "r") (match_operand:SI 4 "const_int_operand" "n"))]))] "INTVAL (operands[3]) + INTVAL (operands[4]) == 32" "{shd|shrpw} %1,%2,%4,%0" [(set_attr "type" "shift") (set_attr "length" "4")]) (define_insn "" [(set (match_operand:SI 0 "register_operand" "=r") (match_operator:SI 5 "plus_xor_ior_operator" [(lshiftrt:SI (match_operand:SI 2 "register_operand" "r") (match_operand:SI 4 "const_int_operand" "n")) (ashift:SI (match_operand:SI 1 "register_operand" "r") (match_operand:SI 3 "const_int_operand" "n"))]))] "INTVAL (operands[3]) + INTVAL (operands[4]) == 32" "{shd|shrpw} %1,%2,%4,%0" [(set_attr "type" "shift") (set_attr "length" "4")]) I'm wondering if it would be useful to define the 64-bit equivalents using the "shrpd" instruction. If that's the case, then I think it would be good to rename "shd_internal" to "shrpw_internal". Regards, Dave On 2020-08-22 4:52 a.m., Roger Sayle wrote: > Hi Dave, > > It's great to hear from you. It's been a long while. > > Sorry, doh! yes, there's a mistake in my patch (that I introduced when > I renumbered the operands in the shd's define_expand to be t
Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
Hi Roger, Started a test of your latest version. It appears we miss 64-bit patterns similar to these: (define_insn "" [(set (match_operand:SI 0 "register_operand" "=r") (match_operator:SI 5 "plus_xor_ior_operator" [(ashift:SI (match_operand:SI 1 "register_operand" "r") (match_operand:SI 3 "const_int_operand" "n")) (lshiftrt:SI (match_operand:SI 2 "register_operand" "r") (match_operand:SI 4 "const_int_operand" "n"))]))] "INTVAL (operands[3]) + INTVAL (operands[4]) == 32" "{shd|shrpw} %1,%2,%4,%0" [(set_attr "type" "shift") (set_attr "length" "4")]) (define_insn "" [(set (match_operand:SI 0 "register_operand" "=r") (match_operator:SI 5 "plus_xor_ior_operator" [(lshiftrt:SI (match_operand:SI 2 "register_operand" "r") (match_operand:SI 4 "const_int_operand" "n")) (ashift:SI (match_operand:SI 1 "register_operand" "r") (match_operand:SI 3 "const_int_operand" "n"))]))] "INTVAL (operands[3]) + INTVAL (operands[4]) == 32" "{shd|shrpw} %1,%2,%4,%0" [(set_attr "type" "shift") (set_attr "length" "4")]) I'm wondering if it would be useful to define the 64-bit equivalents using the "shrpd" instruction. If that's the case, then I think it would be good to rename "shd_internal" to "shrpw_internal". Regards, Dave On 2020-08-22 4:52 a.m., Roger Sayle wrote: > Hi Dave, > > It's great to hear from you. It's been a long while. > > Sorry, doh! yes, there's a mistake in my patch (that I introduced when I > renumbered > the operands in the shd's define_expand to be the more logical 0, 1, 2, 3, > then 4). > Sorry for the inconvenience [due to my lack of familiarity with PA-RISC > assembly]. > Hopefully you should get much better mileage out of the attached revision. > > Thanks again (and my sincere apologies), > Roger > -- > > -Original Message- > From: John David Anglin > Sent: 21 August 2020 20:00 > To: Roger Sayle ; 'GCC Patches' > > Cc: 'Jeff Law' > Subject: Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT > > Hi Roger, > > On 2020-08-21 8:53 a.m., Roger Sayle wrote: >> I was wondering whether Dave or Jeff (or someone else with access to >> real hardware) might "spin" this patch for me? > This may be totally unrelated to this patch but I hit this error in stage2 > testing your change: > build/genattrtab ../../gcc/gcc/common.md ../../gcc/gcc/config/pa/pa.md > insn-conditions.md \ > -Atmp-attrtab.c -Dtmp-dfatab.c -Ltmp-latencytab.c > genattrtab: Internal error: abort in attr_alt_union, at genattrtab.c:2383 > > It's great that you are back helpting with the middle-end. > > Regards, > Dave > > -- > John David Anglin dave.ang...@bell.net > -- John David Anglin dave.ang...@bell.net
RE: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
Hi Dave, It's great to hear from you. It's been a long while. Sorry, doh! yes, there's a mistake in my patch (that I introduced when I renumbered the operands in the shd's define_expand to be the more logical 0, 1, 2, 3, then 4). Sorry for the inconvenience [due to my lack of familiarity with PA-RISC assembly]. Hopefully you should get much better mileage out of the attached revision. Thanks again (and my sincere apologies), Roger -- -Original Message- From: John David Anglin Sent: 21 August 2020 20:00 To: Roger Sayle ; 'GCC Patches' Cc: 'Jeff Law' Subject: Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT Hi Roger, On 2020-08-21 8:53 a.m., Roger Sayle wrote: > I was wondering whether Dave or Jeff (or someone else with access to > real hardware) might "spin" this patch for me? This may be totally unrelated to this patch but I hit this error in stage2 testing your change: build/genattrtab ../../gcc/gcc/common.md ../../gcc/gcc/config/pa/pa.md insn-conditions.md \ -Atmp-attrtab.c -Dtmp-dfatab.c -Ltmp-latencytab.c genattrtab: Internal error: abort in attr_alt_union, at genattrtab.c:2383 It's great that you are back helpting with the middle-end. Regards, Dave -- John David Anglin dave.ang...@bell.net diff --git a/gcc/config/pa/pa.md b/gcc/config/pa/pa.md index 6350c68..713ff17 100644 --- a/gcc/config/pa/pa.md +++ b/gcc/config/pa/pa.md @@ -6416,9 +6416,32 @@ [(set (match_operand:DI 0 "register_operand" "") (ashift:DI (match_operand:DI 1 "lhs_lshift_operand" "") (match_operand:DI 2 "arith32_operand" "")))] - "TARGET_64BIT" + "" " { + if (!TARGET_64BIT) +{ + if (REG_P (operands[0]) && GET_CODE (operands[2]) == CONST_INT) + { + unsigned HOST_WIDE_INT shift = UINTVAL (operands[2]); + if (shift >= 1 && shift <= 31) + { + rtx dst = operands[0]; + rtx src = force_reg (DImode, operands[1]); + emit_insn (gen_shd_internal (gen_highpart (SImode, dst), + gen_lowpart (SImode, src), + GEN_INT (32-shift), + gen_highpart (SImode, src), + GEN_INT (shift))); + emit_insn (gen_ashlsi3 (gen_lowpart (SImode, dst), + gen_lowpart (SImode, src), + GEN_INT (shift))); + DONE; + } + } + /* Fallback to using optabs.c's expand_doubleword_shift. */ + FAIL; +} if (GET_CODE (operands[2]) != CONST_INT) { rtx temp = gen_reg_rtx (DImode); @@ -6705,6 +6728,15 @@ [(set_attr "type" "shift") (set_attr "length" "4")]) +(define_expand "shd_internal" + [(set (match_operand:SI 0 "register_operand") + (ior:SI + (lshiftrt:SI (match_operand:SI 1 "register_operand") + (match_operand:SI 2 "const_int_operand")) + (ashift:SI (match_operand:SI 3 "register_operand") +(match_operand:SI 4 "const_int_operand"] + "") + (define_insn "" [(set (match_operand:SI 0 "register_operand" "=r") (and:SI (ashift:SI (match_operand:SI 1 "register_operand" "r")
Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
Hi Roger, On 2020-08-21 8:53 a.m., Roger Sayle wrote: > I was wondering whether Dave or Jeff (or someone else with access > to real hardware) might "spin" this patch for me? This may be totally unrelated to this patch but I hit this error in stage2 testing your change: build/genattrtab ../../gcc/gcc/common.md ../../gcc/gcc/config/pa/pa.md insn-conditions.md \ -Atmp-attrtab.c -Dtmp-dfatab.c -Ltmp-latencytab.c genattrtab: Internal error: abort in attr_alt_union, at genattrtab.c:2383 It's great that you are back helpting with the middle-end. Regards, Dave -- John David Anglin dave.ang...@bell.net