"Ondřej Bílka" <nel...@seznam.cz> wrote: >On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote: >> Also keep in mind that usually costs go up significantly if >> misalignment causes cache line splits (processor will fetch 2 lines). >> There are non-linear costs of filling up the store queue in modern >> out-of-order processors (x86). Bottom line is that it's much better >to >> peel e.g. for AVX2/AVX3 if the loop would cause loads that cross >cache >> line boundaries otherwise. The solution is to either actually always >> peel for alignment, or insert an additional check for cache line >> boundaries (for high trip count loops). > >That is quite bold claim do you have a benchmark to support that? > >Since nehalem there is no overhead of unaligned sse loads except of >fetching >cache lines. As haswell avx2 loads behave in similar way. > >You are forgetting that loop needs both cache lines when it issues >unaligned load. This will generaly take maximum of times needed to >access these lines. Now with peeling you accesss first cache line, and >after that in loop access the second, effectively doubling running time >when both lines were in main memory. > >You also need to compute all factors not just that one factor is >expensive. There are several factor in plays, cost of branch >misprediction is main argument againist doing peeling, so you need to >show that cost of unaligned loads is bigger than cost of branch >misprediction of a peeled implementation. > >As a quick example why peeling is generaly bad idea I did a simple >benchmark. Could somebody with haswell also test attached code >generated >by gcc -O3 -march=core-avx2 (files set[13]_avx2.s)? > >For the test we repeately call a function set with a pointer randomly >picked from 262144 bytes to stress a L2 cache, relevant tester >is following (file test.c) > >for (i=0;i<100000000;i++){ >set (ptr + 64 * (p % (SIZE /64) + 60), ptr2 + 64 * (q % (SIZE /64) + >60)); > >First vectorize by following function. A vectorizer here does >peeling (assembly is bit long, see file set1.s) > >void set(int *p, int *q){ > int i; > for (i=0; i<128; i++) > p[i] = 42 * p[i]; >} > >When ran it I got > >$ gcc -O3 -DSIZE= test.c >$ gcc test.o set1.s >$ time ./a.out > >real 0m3.724s >user 0m3.724s >sys 0m0.000s > >Now what happens if we use separate input and output arrays? A gcc >vectorizer fortunately does not peel in this case (file set2.s) which >gives better performance > >void set(int *p, int *q){ > int i; > for (i=0; i<128; i++) > p[i] = 42 * q[i]; >} > >$ gcc test.o set2.s >$ time ./a.out > >real 0m3.169s >user 0m3.170s >sys 0m0.000s > > >A speedup here is can be partialy explained by fact that inplace >modifications run slower. To eliminate this possibility we change >assembly to make input same as output (file set3.s) > > jb .L15 > .L7: > xorl %eax, %eax >+ movq %rdi, %rsi > .p2align 4,,10 > .p2align 3 > .L5: > >$ gcc test.o set3.s >$ time ./a.out > >real 0m3.169s >user 0m3.170s >sys 0m0.000s > >Which is still faster than what peeling vectorizer generated. > >And in this test I did not alignment is constant so branch >misprediction >is not a issue.
IIRC what can still be seen is store-buffer related slowdowns when you have a big unaligned store load in your loop. Thus aligning stores still pays back last time I measured this. Richard.