On Mon, Mar 3, 2014 at 9:40 AM, Jakub Jelinek <ja...@redhat.com> wrote: > On Mon, Mar 03, 2014 at 11:02:14AM +0800, lin zuojian wrote: >> I wrote a test code like this: >> void foo(int * a) >> { >> a[0] = 0xfafafafb; >> a[1] = 0xfafafafc; >> a[2] = 0xfafafafe; >> a[3] = 0xfafafaff; >> a[4] = 0xfafafaf0; >> a[5] = 0xfafafaf1; >> a[6] = 0xfafafaf2; >> a[7] = 0xfafafaf3; >> a[8] = 0xfafafaf4; >> a[9] = 0xfafafaf5; >> a[10] = 0xfafafaf6; >> a[11] = 0xfafafaf7; >> a[12] = 0xfafafaf8; >> a[13] = 0xfafafaf9; >> a[14] = 0xfafafafa; >> a[15] = 0xfafaf0fa; >> } >> that was what gcc generated: >> movl $-84215045, (%rdi) >> movl $-84215044, 4(%rdi) >> movl $-84215042, 8(%rdi) >> movl $-84215041, 12(%rdi) >> movl $-84215056, 16(%rdi) >> ... >> that was what LLVM/clang generated: >> movabsq $-361700855600448773, %rax # imm = 0xFAFAFAFCFAFAFAFB >> movq %rax, (%rdi) >> movabsq $-361700842715546882, %rax # imm = 0xFAFAFAFFFAFAFAFE >> movq %rax, 8(%rdi) >> movabsq $-361700902845089040, %rax # imm = 0xFAFAFAF1FAFAFAF0 >> movq %rax, 16(%rdi) >> movabsq $-361700894255154446, %rax # imm = 0xFAFAFAF3FAFAFAF2 >> ... >> I ran the code on my i7 machine for 10000000000 times.Here was the result: >> gcc: >> real 0m50.613s >> user 0m50.559s >> sys 0m0.000s >> >> LLVM/clang: >> real 0m32.036s >> user 0m32.001s >> sys 0m0.000s >> >> That mean movabsq did do a better job! >> Should gcc peephole pass add such a combine? > > This sounds like PR22141, but a microbenchmark isn't the right thing > to decide this. From what I remember when playing with the patches, > movabsq has been mostly bad for performance, at least on the CPUs I've tried > it back then. In addition to whether movabsq + movq compared to two movl > is more beneficial, also alignment plays role here, say if this is in an > inner loop and not aligned to 64-bits whether it won't slow things down too > much.
Also the micro-benchmark may be best optimized by memset (, 0xfa, ) and a set of byte stores? Also interesting for optimizing this artificial testcase for -Os ... Looks like a candidate for collecting sth like static const init[] = { 0xfa.... }; *ptr = init; and deciding of an optimal expansion strategy with the help of target specific code. With a good implementation for that we could avoid most constructor lowering in gimplification (at least most of the middle-end passes happily lookup the constructor values from inits above and accesses to sub-parts of *ptr). Richard. > Jakub