[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 --- Comment #22 from Jiu Fu Guo --- On power10, loading constant only needs 1 instruction, like: pld 9,.LC0@pcrel And, as tests, it seems nearly as fast as using 1 instruction to build const.
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 --- Comment #21 from Jiu Fu Guo --- Also had a test on powerpc, -m32. As testing, it seems no significant benefit loading from 'rodata' vs. building constants by instructions. lis %r7,0x410 ori %r7,%r7,0x103c lis %r6,0x710 ori %r6,%r6,0xe005 lis %r12,.LC3@ha la %r12,.LC3@l(%r12) lwz %r3,0(%r12) lwz %r4,4(%r12)
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 --- Comment #20 from Jiu Fu Guo --- Created attachment 52114 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52114=edit testcases With these test cases, invoke 'foo' in these cases 1000,000,000 times, to see the runtime: building 'constant' through 1 insn is fastest. next faster is building const by 2 instructions, or loading from rodata, or loading from toc. building const by 3 instructions is slower than loading from rodata, building const by 5 ins is slowest.
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 --- Comment #19 from Jiu Fu Guo --- (In reply to Segher Boessenkool from comment #18) Thanks for your clarify! > Yes, it is slow. Five sequential dependent integer instructions instead of > one load instruction. Depending on how you benchmark this you possibly won't Yes, it depends on how the cases are benchmarked. There are some factors that affect the runtime. This is really the point! In the above cases, a few std(s) and there is one spill on r31 are all affect the runtime and would hide the instructions on const building. Focusing on the sequence to build a const, the 5 insns sequence is faster a lot than the sequence of 1 insns.
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 --- Comment #18 from Segher Boessenkool --- Yes, it is slow. Five sequential dependent integer instructions instead of one load instruction. Depending on how you benchmark this you possibly won't see the slowness, the values are stored to memory and that can happen very many cycles later even, this is totally out of the critical path, will not clog up any pipelines.
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 --- Comment #17 from Jiu Fu Guo --- One thing, I'm wondering, is if it is really 'slow' using instructions to build the const (even with 5 insns). For example, there seems no big difference in runtime between the below two pieces of code on a real machine. 1. foo: .LFB0: .cfi_startproc std %r31,-8(%r1) .cfi_offset 31, -8 li %r12,2 li %r31,1 li %r0,3 li %r11,4 std %r31,0(%r3) std %r12,0(%r4) std %r0,0(%r5) std %r11,0(%r6) std %r31,0(%r7) std %r12,0(%r8) ld %r31,-8(%r1) std %r0,0(%r9) std %r11,0(%r10) .cfi_restore 31 blr 2 foo: .LFB0: .cfi_startproc std 31,-8(1) .cfi_offset 31, -8 li 11,0 li 31,0 li 12,0 ori 11,11,0x8000 ori 31,31,0x8000 ori 12,12,0x8000 sldi 11,11,32 sldi 31,31,32 sldi 12,12,32 oris 11,11,0x410 oris 31,31,0x410 oris 12,12,0x410 ori 11,11,0x1 ori 31,31,0x3 ori 12,12,0x5 li 0,0 std 11,0(3) std 31,0(4) li 3,0 li 4,0 std 12,0(5) li 5,0 ori 0,0,0x8000 ld 31,-8(1) ori 3,3,0x8000 ori 4,4,0x8000 ori 5,5,0x8000 sldi 0,0,32 sldi 3,3,32 sldi 4,4,32 sldi 5,5,32 oris 0,0,0x410 oris 3,3,0x410 oris 4,4,0x410 oris 5,5,0x410 ori 0,0,0x7 addi 11,11,5 ori 3,3,0xa ori 4,4,0xe ori 5,5,0xc std 0,0(6) std 11,0(7) std 3,0(8) std 4,0(9) std 5,0(10) .cfi_restore 31 blr
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 --- Comment #16 from Jiu Fu Guo --- Thanks, Alan! I saw your patches in this PR. They would help us to get the sequence of what we are thinking. And as you said in the comments: it is a big problem for fixing insn and rtl cost.
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 --- Comment #15 from Alan Modra --- (In reply to Jiu Fu Guo from comment #14) > It would be a way to keep the data in memory(.rodata) through adjusting the > cost of constant. Yes, I posted a series of patches that fix this problem and other rtx costs. Look for patches with "rs6000_rtx_costs" in the subject. Some of the patches were even approved, but not all in the series. I am disillusioned enough with gcc that I won't be pushing those patches or attempting any future gcc work. You or anyone else are welcome to pick up the pieces.
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 --- Comment #14 from Jiu Fu Guo --- For constant like 0x0008411, which is using 5 insns, at 'expand' pass, it is treated as preferred to save in memory, while at cse1 pass, it was replaced back to constant. expand: 7: r119:DI=[unspec[`*.LC0',%r2:DI] 47] REG_EQUAL 0x8411 8: [r117:DI]=r119:DI cse1: 7: r119:DI=0x8411 REG_EQUAL 0x8411 8: [r117:DI]=r119:DI This is because: expand_assignment invoke force_const_mem/gen_const_mem under the condition: (num_insns_constant (operands[1], mode) > (TARGET_CMODEL != CMODEL_SMALL ? 3 : 2)) At cse1, when comparing the cost between 'fold_const' and 'src', 'fold_const' is selected 'preferable (src_folded_cost, src_folded_regcost, src_cost, src_regcost) <= 0' src: (mem/u/c:DI (unspec:DI [ (symbol_ref/u:DI ("*.LC0") [flags 0x82]) (reg:DI 2 2) ] UNSPEC_TOCREL) [2 S8 A8]) fold_const: (const_int 140737556512769 [0x8411]) It would be a way to keep the data in memory(.rodata) through adjusting the cost of constant.
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 --- Comment #13 from Segher Boessenkool --- If we need more than three insns to create a constant we are better off loading it from memory, in all cases. Maybe three is too much already, at least on some processors?
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 --- Comment #12 from Segher Boessenkool --- This is my g:72b2f3317b44, two years and a day old :-)
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 --- Comment #11 from Jiu Fu Guo --- While for the const which Bill said in comment9, 0x0008411 The code sequence still contains a few instructions: e.g. li %r11,0 ori %r11,%r11,0x8000 sldi %r11,%r11,32 oris %r11,%r11,0x410 ori %r11,%r11,0x1 std %r11,0(%r3)
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 Jiu Fu Guo changed: What|Removed |Added CC||guojiufu at gcc dot gnu.org --- Comment #10 from Jiu Fu Guo --- With the latest trunk (AT14 is similar), the generated code looks like this: -O lis %r9,0x8123 ori %r9,%r9,0x4567 rldimi %r9,%r9,32,0 std %r9,0(%r10) Or -O3 lis %r11,0x1234 lis %r31,0x2345 lis %r12,0x3456 ori %r11,%r11,0x5678 ori %r31,%r31,0x6781 ori %r12,%r12,0x7812 rldimi %r11,%r11,32,0 rldimi %r31,%r31,32,0 rldimi %r12,%r12,32,0 ... This code seems better than the previous one.
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 --- Comment #9 from Bill Schmidt --- Also reported by Donald Stence this week: The compiler produces excessive sequences to synthesize some literal constants. This contributes excess path length and potentially latency. Constants requiring only 2 or 3 instructions is acceptable. More than 3 should be optimized via a load from the GOT (i.e., data in GOT). Compile test case either -O or -O3, default processor (AT-11.0-0). Example constant from perlbench: 0x0008411. Resulting sequence: li3,0 ori 3,3,0x8000 sldi 3,3,32 oris 3,3,0x410 ori 3,3,0x1 It was ~20% faster on the block of some 30 instructions prior to the switch in the top function of perlbench (S_regmatch). That section of code contained two longer sequences (one 4, the other 5 instructions - with the 5 one capable of being done in 4 - as [Segher] pointed out), with the rest being 1 or 2 instruction constant synthesization or addi and a out-of-bounds check for the switch. I replaced the two longer ones with ld off r2 to get the ~20%. Of course this is in isolation, but I believe this to be sound.
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 --- Comment #8 from Alan Modra --- Created attachment 42187 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42187=edit [RS6000] Address cost Somewhat related, costing constants properly also needs a proper cost to loading from memory.
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 Alan Modra changed: What|Removed |Added Attachment #33503|0 |1 is obsolete|| --- Comment #7 from Alan Modra --- Created attachment 42186 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42186=edit [RS6000] Cost multi-insn constants I think the patches aren't worth pursuing *until* insn and rtl costing is fixed. That's the really big problem, and I think fixing it will require someone willing to regress things for a while on all targets. I looked at doing that some years ago and came to the conclusion that I didn't have the reputation in the gcc community for anything I did to be accepted. Without fixed costing, even if we emit what we think is better code in rs6000_emit_set_const, optimization passes may transform that code to non-optimal sequences. So, within the current broken rtx costing, the attached patch teaches gcc to cost multi-insn constants.
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 kelvin at gcc dot gnu.org changed: What|Removed |Added CC||kelvin at gcc dot gnu.org --- Comment #6 from kelvin at gcc dot gnu.org --- It appears that Alan's proposed patches will "improve" the generated code, but those patches were never proposed for incoporation into the trunk. Can Alan clarify his thoughts on this? I can incorporate those patches and do the regression testing if we think that's desirable. But it seems that we might not have consensus on what should be done here.
[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281 Andrew Pinski changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2016-10-27 Component|target |rtl-optimization Ever confirmed|0 |1 --- Comment #5 from Andrew Pinski --- Actually this should just create one constant in registers and then rotate them. We are able to handle the +1 case just fine.