On Mon, Mar 3, 2014 at 9:40 AM, Jakub Jelinek <[email protected]> wrote:
> On Mon, Mar 03, 2014 at 11:02:14AM +0800, lin zuojian wrote:
>> I wrote a test code like this:
>> void foo(int * a)
>> {
>> a[0] = 0xfafafafb;
>> a[1] = 0xfafafafc;
>> a[2] = 0xfafafafe;
>> a[3] = 0xfafafaff;
>> a[4] = 0xfafafaf0;
>> a[5] = 0xfafafaf1;
>> a[6] = 0xfafafaf2;
>> a[7] = 0xfafafaf3;
>> a[8] = 0xfafafaf4;
>> a[9] = 0xfafafaf5;
>> a[10] = 0xfafafaf6;
>> a[11] = 0xfafafaf7;
>> a[12] = 0xfafafaf8;
>> a[13] = 0xfafafaf9;
>> a[14] = 0xfafafafa;
>> a[15] = 0xfafaf0fa;
>> }
>> that was what gcc generated:
>> movl $-84215045, (%rdi)
>> movl $-84215044, 4(%rdi)
>> movl $-84215042, 8(%rdi)
>> movl $-84215041, 12(%rdi)
>> movl $-84215056, 16(%rdi)
>> ...
>> that was what LLVM/clang generated:
>> movabsq $-361700855600448773, %rax # imm = 0xFAFAFAFCFAFAFAFB
>> movq %rax, (%rdi)
>> movabsq $-361700842715546882, %rax # imm = 0xFAFAFAFFFAFAFAFE
>> movq %rax, 8(%rdi)
>> movabsq $-361700902845089040, %rax # imm = 0xFAFAFAF1FAFAFAF0
>> movq %rax, 16(%rdi)
>> movabsq $-361700894255154446, %rax # imm = 0xFAFAFAF3FAFAFAF2
>> ...
>> I ran the code on my i7 machine for 10000000000 times.Here was the result:
>> gcc:
>> real 0m50.613s
>> user 0m50.559s
>> sys 0m0.000s
>>
>> LLVM/clang:
>> real 0m32.036s
>> user 0m32.001s
>> sys 0m0.000s
>>
>> That mean movabsq did do a better job!
>> Should gcc peephole pass add such a combine?
>
> This sounds like PR22141, but a microbenchmark isn't the right thing
> to decide this. From what I remember when playing with the patches,
> movabsq has been mostly bad for performance, at least on the CPUs I've tried
> it back then. In addition to whether movabsq + movq compared to two movl
> is more beneficial, also alignment plays role here, say if this is in an
> inner loop and not aligned to 64-bits whether it won't slow things down too
> much.
Also the micro-benchmark may be best optimized by memset (, 0xfa, ) and
a set of byte stores? Also interesting for optimizing this artificial testcase
for -Os ...
Looks like a candidate for collecting sth like
static const init[] = { 0xfa.... };
*ptr = init;
and deciding of an optimal expansion strategy with the help of target
specific code. With a good implementation for that we could avoid
most constructor lowering in gimplification (at least most of the
middle-end passes happily lookup the constructor values from
inits above and accesses to sub-parts of *ptr).
Richard.
> Jakub