[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2022-01-09 Thread guojiufu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

--- Comment #22 from Jiu Fu Guo  ---
On power10, loading constant only needs 1 instruction, like:
pld 9,.LC0@pcrel

And, as tests, it seems nearly as fast as using 1 instruction to build const.

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2022-01-06 Thread guojiufu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

--- Comment #21 from Jiu Fu Guo  ---
Also had a test on powerpc, -m32.  As testing, it seems no significant benefit
loading from 'rodata' vs. building constants by instructions.

lis %r7,0x410
ori %r7,%r7,0x103c
lis %r6,0x710
ori %r6,%r6,0xe005

lis %r12,.LC3@ha
la %r12,.LC3@l(%r12)
lwz %r3,0(%r12)
lwz %r4,4(%r12)

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2022-01-04 Thread guojiufu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

--- Comment #20 from Jiu Fu Guo  ---
Created attachment 52114
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52114=edit
testcases

With these test cases, invoke 'foo' in these cases 1000,000,000 times, to see
the runtime:
building 'constant' through 1 insn is fastest.
next faster is building const by 2 instructions, or loading from rodata, or
loading from toc.
building const by 3 instructions is slower than loading from rodata, building
const by 5 ins is slowest.

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2022-01-03 Thread guojiufu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

--- Comment #19 from Jiu Fu Guo  ---
(In reply to Segher Boessenkool from comment #18)
Thanks for your clarify! 
> Yes, it is slow.  Five sequential dependent integer instructions instead of
> one load instruction.  Depending on how you benchmark this you possibly won't
Yes, it depends on how the cases are benchmarked.  There are some factors that
affect the runtime.  This is really the point! 
In the above cases, a few std(s) and there is one spill on r31 are all affect
the runtime and would hide the instructions on const building.
Focusing on the sequence to build a const, the 5 insns sequence is faster a lot
than the sequence of 1 insns.

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2021-12-30 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

--- Comment #18 from Segher Boessenkool  ---
Yes, it is slow.  Five sequential dependent integer instructions instead of
one load instruction.  Depending on how you benchmark this you possibly won't
see the slowness, the values are stored to memory and that can happen very
many cycles later even, this is totally out of the critical path, will not
clog up any pipelines.

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2021-12-30 Thread guojiufu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

--- Comment #17 from Jiu Fu Guo  ---
One thing, I'm wondering, is if it is really 'slow' using instructions to build
the const (even with 5 insns). 

For example, there seems no big difference in runtime between the below two
pieces of code on a real machine.
1.

foo:
.LFB0:
.cfi_startproc
std %r31,-8(%r1)
.cfi_offset 31, -8
li %r12,2
li %r31,1
li %r0,3
li %r11,4
std %r31,0(%r3)
std %r12,0(%r4)
std %r0,0(%r5)
std %r11,0(%r6)
std %r31,0(%r7)
std %r12,0(%r8)
ld %r31,-8(%r1)
std %r0,0(%r9)
std %r11,0(%r10)
.cfi_restore 31
blr


2
foo:
.LFB0:
.cfi_startproc
std 31,-8(1)
.cfi_offset 31, -8
li 11,0
li 31,0
li 12,0
ori 11,11,0x8000
ori 31,31,0x8000
ori 12,12,0x8000
sldi 11,11,32
sldi 31,31,32
sldi 12,12,32
oris 11,11,0x410
oris 31,31,0x410
oris 12,12,0x410
ori 11,11,0x1
ori 31,31,0x3
ori 12,12,0x5
li 0,0
std 11,0(3)
std 31,0(4)
li 3,0
li 4,0
std 12,0(5)
li 5,0
ori 0,0,0x8000
ld 31,-8(1)
ori 3,3,0x8000
ori 4,4,0x8000
ori 5,5,0x8000
sldi 0,0,32
sldi 3,3,32
sldi 4,4,32
sldi 5,5,32
oris 0,0,0x410
oris 3,3,0x410
oris 4,4,0x410
oris 5,5,0x410
ori 0,0,0x7
addi 11,11,5
ori 3,3,0xa
ori 4,4,0xe
ori 5,5,0xc
std 0,0(6)
std 11,0(7)
std 3,0(8)
std 4,0(9)
std 5,0(10)
.cfi_restore 31
   blr

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2021-12-30 Thread guojiufu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

--- Comment #16 from Jiu Fu Guo  ---
Thanks, Alan!
I saw your patches in this PR. They would help us to get the sequence of what
we are thinking. And as you said in the comments: it is a big problem for
fixing insn and rtl cost.

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2021-12-29 Thread amodra at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

--- Comment #15 from Alan Modra  ---
(In reply to Jiu Fu Guo from comment #14)
> It would be a way to keep the data in memory(.rodata) through adjusting the
> cost of constant.

Yes, I posted a series of patches that fix this problem and other rtx costs. 
Look for patches with "rs6000_rtx_costs" in the subject.  Some of the patches
were even approved, but not all in the series.  I am disillusioned enough with
gcc that I won't be pushing those patches or attempting any future gcc work. 
You or anyone else are welcome to pick up the pieces.

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2021-12-29 Thread guojiufu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

--- Comment #14 from Jiu Fu Guo  ---
For constant like 0x0008411, which is using 5 insns, at 'expand' pass,
it is treated as preferred to save in memory, while at cse1 pass, it was
replaced back to constant.

expand:
7: r119:DI=[unspec[`*.LC0',%r2:DI] 47]
  REG_EQUAL 0x8411
8: [r117:DI]=r119:DI

cse1:
7: r119:DI=0x8411
  REG_EQUAL 0x8411
8: [r117:DI]=r119:DI

This is because:
expand_assignment invoke force_const_mem/gen_const_mem under the condition:
(num_insns_constant (operands[1], mode) > (TARGET_CMODEL != CMODEL_SMALL ? 3 :
2))

At cse1, when comparing the cost between 'fold_const' and 'src', 'fold_const'
is selected
'preferable (src_folded_cost, src_folded_regcost, src_cost, src_regcost) <= 0'

src:
(mem/u/c:DI (unspec:DI [
(symbol_ref/u:DI ("*.LC0") [flags 0x82])
(reg:DI 2 2)
] UNSPEC_TOCREL) [2  S8 A8])
fold_const:
(const_int 140737556512769 [0x8411])

It would be a way to keep the data in memory(.rodata) through adjusting the
cost of constant.

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2021-12-21 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

--- Comment #13 from Segher Boessenkool  ---
If we need more than three insns to create a constant we are better off loading
it from memory, in all cases.  Maybe three is too much already, at least on
some processors?

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2021-12-21 Thread segher at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

--- Comment #12 from Segher Boessenkool  ---
This is my g:72b2f3317b44, two years and a day old :-)

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2021-12-21 Thread guojiufu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

--- Comment #11 from Jiu Fu Guo  ---
While for the const which Bill said in comment9, 0x0008411
The code sequence still contains a few instructions:
e.g.
li %r11,0
ori %r11,%r11,0x8000
sldi %r11,%r11,32
oris %r11,%r11,0x410
ori %r11,%r11,0x1
std %r11,0(%r3)

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2021-12-21 Thread guojiufu at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

Jiu Fu Guo  changed:

   What|Removed |Added

 CC||guojiufu at gcc dot gnu.org

--- Comment #10 from Jiu Fu Guo  ---
With the latest trunk (AT14 is similar), the generated code looks like this:

-O
lis %r9,0x8123
ori %r9,%r9,0x4567
rldimi %r9,%r9,32,0
std %r9,0(%r10)

Or 
-O3
lis %r11,0x1234
lis %r31,0x2345
lis %r12,0x3456
ori %r11,%r11,0x5678
ori %r31,%r31,0x6781
ori %r12,%r12,0x7812
rldimi %r11,%r11,32,0
rldimi %r31,%r31,32,0
rldimi %r12,%r12,32,0
...

This code seems better than the previous one.

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2018-06-14 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

--- Comment #9 from Bill Schmidt  ---
Also reported by Donald Stence this week:

The compiler produces excessive sequences to synthesize some literal constants.
This contributes excess path length and potentially latency.
Constants requiring only 2 or 3 instructions is acceptable. More than 3 should
be optimized via a load from the GOT (i.e., data in GOT).

Compile test case either -O or -O3, default processor (AT-11.0-0).

Example constant from perlbench: 0x0008411.
Resulting sequence:
li3,0
ori  3,3,0x8000
sldi 3,3,32
oris 3,3,0x410
ori  3,3,0x1

It was ~20% faster on the block of some 30 instructions prior to the switch in
the top function of perlbench (S_regmatch). That section of code contained two
longer sequences (one 4, the other 5 instructions - with the 5 one capable of
being done in 4 - as [Segher] pointed out), with the rest being 1 or 2
instruction constant synthesization or addi and a out-of-bounds check for the
switch. I replaced the two longer ones with ld off r2 to get the ~20%. Of
course  this is in isolation, but I believe this to be sound.

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2017-09-16 Thread amodra at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

--- Comment #8 from Alan Modra  ---
Created attachment 42187
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42187=edit
[RS6000] Address cost

Somewhat related, costing constants properly also needs a proper cost to
loading from memory.

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2017-09-16 Thread amodra at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

Alan Modra  changed:

   What|Removed |Added

  Attachment #33503|0   |1
is obsolete||

--- Comment #7 from Alan Modra  ---
Created attachment 42186
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42186=edit
[RS6000] Cost multi-insn constants

I think the patches aren't worth pursuing *until* insn and rtl costing is
fixed.  That's the really big problem, and I think fixing it will require
someone willing to regress things for a while on all targets.  I looked at
doing that some years ago and came to the conclusion that I didn't have the
reputation in the gcc community for anything I did to be accepted.

Without fixed costing, even if we emit what we think is better code in
rs6000_emit_set_const, optimization passes may transform that code to
non-optimal sequences.

So, within the current broken rtx costing, the attached patch teaches gcc to
cost multi-insn constants.

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2017-09-15 Thread kelvin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

kelvin at gcc dot gnu.org changed:

   What|Removed |Added

 CC||kelvin at gcc dot gnu.org

--- Comment #6 from kelvin at gcc dot gnu.org ---
It appears that Alan's proposed patches will "improve" the generated code, but
those patches were never proposed for incoporation into the trunk.  Can Alan
clarify his thoughts on this?

I can incorporate those patches and do the regression testing if we think
that's desirable.  But it seems that we might not have consensus on what should
be done here.

[Bug rtl-optimization/63281] powerpc64le creates 64 bit constants from scratch instead of loading them

2016-10-27 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63281

Andrew Pinski  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2016-10-27
  Component|target  |rtl-optimization
 Ever confirmed|0   |1

--- Comment #5 from Andrew Pinski  ---
Actually this should just create one constant in registers and then rotate
them.  We are able to handle the +1 case just fine.