Try the following one. 1) -minline-all-stringops -mstringop-strategy=rep_8byte -O2 vs 1) -mstringop_strategy=libcall -O2.
David #include <string.h> #include <stdio.h> #include <stdlib.h> #ifndef LEN #define LEN 16 #endif void copy(char* s1, char* s2,int len) __attribute__((noinline)); void copy(char* s1, char* s2,int len) { memcpy(s2,s1,len); } int main() { char* s1 = (char*) malloc(LEN +10); char* s2 = (char*) malloc(LEN +10); int i = 0; for (i = 0; i < 1000000000; i++) { copy(s1+1,s2+3,LEN); } } On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek <ja...@redhat.com> wrote: > On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote: >> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka <hubi...@ucw.cz> wrote: >> >> > libcall is not faster up to 8KB to rep sequence that is better for >> >> > regalloc/code >> >> > cache than fully blowin function call. >> >> >> >> Be careful with this. My recollection is that REP sequence is good for >> >> any size -- for smaller size, the REP initial set up cost is too high >> >> (10s of cycles), while for large size copy, it is less efficient >> >> compared with library version. >> > >> > Well this is based on the data from the memtest script. >> > Core has good REP implementation - it is a win from rather small blocks (16 >> > bytes if I recall) and it does not need alignment. >> > Library version starts to be interesting with caching hints, but I think >> > till 80KB >> > it is still not a win for my setup (glibc-2.15) >> >> A simple test shows that -mstringop-strategy=libcall always beats >> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size >> smaller than 8 where the rep_8byte strategy simply bypasses REP movs. >> Can you share your memtest ? > > I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a > libcall. The PLT call overhead is simply too high. > > Jakub