On Mon, Mar 3, 2014 at 9:40 AM, Jakub Jelinek <ja...@redhat.com> wrote:
> On Mon, Mar 03, 2014 at 11:02:14AM +0800, lin zuojian wrote:
>>    I wrote a test code like this:
>> void foo(int * a)
>> {
>>     a[0] = 0xfafafafb;
>>     a[1] = 0xfafafafc;
>>     a[2] = 0xfafafafe;
>>     a[3] = 0xfafafaff;
>>     a[4] = 0xfafafaf0;
>>     a[5] = 0xfafafaf1;
>>     a[6] = 0xfafafaf2;
>>     a[7] = 0xfafafaf3;
>>     a[8] = 0xfafafaf4;
>>     a[9] = 0xfafafaf5;
>>     a[10] = 0xfafafaf6;
>>     a[11] = 0xfafafaf7;
>>     a[12] = 0xfafafaf8;
>>     a[13] = 0xfafafaf9;
>>     a[14] = 0xfafafafa;
>>     a[15] = 0xfafaf0fa;
>> }
>> that was what gcc generated:
>>       movl    $-84215045, (%rdi)
>>       movl    $-84215044, 4(%rdi)
>>       movl    $-84215042, 8(%rdi)
>>       movl    $-84215041, 12(%rdi)
>>       movl    $-84215056, 16(%rdi)
>>     ...
>> that was what LLVM/clang generated:
>>       movabsq $-361700855600448773, %rax # imm = 0xFAFAFAFCFAFAFAFB
>>       movq    %rax, (%rdi)
>>       movabsq $-361700842715546882, %rax # imm = 0xFAFAFAFFFAFAFAFE
>>       movq    %rax, 8(%rdi)
>>       movabsq $-361700902845089040, %rax # imm = 0xFAFAFAF1FAFAFAF0
>>       movq    %rax, 16(%rdi)
>>       movabsq $-361700894255154446, %rax # imm = 0xFAFAFAF3FAFAFAF2
>>     ...
>> I ran the code on my i7 machine for 10000000000 times.Here was the result:
>> gcc:
>> real  0m50.613s
>> user  0m50.559s
>> sys   0m0.000s
>>
>> LLVM/clang:
>> real  0m32.036s
>> user  0m32.001s
>> sys   0m0.000s
>>
>> That mean movabsq did do a better job!
>> Should gcc peephole pass add such a combine?
>
> This sounds like PR22141, but a microbenchmark isn't the right thing
> to decide this.  From what I remember when playing with the patches,
> movabsq has been mostly bad for performance, at least on the CPUs I've tried
> it back then.  In addition to whether movabsq + movq compared to two movl
> is more beneficial, also alignment plays role here, say if this is in an
> inner loop and not aligned to 64-bits whether it won't slow things down too
> much.

Also the micro-benchmark may be best optimized by memset (, 0xfa, ) and
a set of byte stores?  Also interesting for optimizing this artificial testcase
for -Os ...

Looks like a candidate for collecting sth like

 static const init[] = { 0xfa.... };
 *ptr = init;

and deciding of an optimal expansion strategy with the help of target
specific code.  With a good implementation for that we could avoid
most constructor lowering in gimplification (at least most of the
middle-end passes happily lookup the constructor values from
inits above and accesses to sub-parts of *ptr).

Richard.

>         Jakub

Reply via email to