"Ondřej Bílka" <nel...@seznam.cz> wrote:
>On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
>> Also keep in mind that usually costs go up significantly if
>> misalignment causes cache line splits (processor will fetch 2 lines).
>> There are non-linear costs of filling up the store queue in modern
>> out-of-order processors (x86). Bottom line is that it's much better
>to
>> peel e.g. for AVX2/AVX3 if the loop would cause loads that cross
>cache
>> line boundaries otherwise. The solution is to either actually always
>> peel for alignment, or insert an additional check for cache line
>> boundaries (for high trip count loops).
>
>That is quite bold claim do you have a benchmark to support that?
>
>Since nehalem there is no overhead of unaligned sse loads except of
>fetching
>cache lines. As haswell avx2 loads behave in similar way.
>
>You are forgetting that loop needs both cache lines when it issues
>unaligned load. This will generaly take maximum of times needed to
>access these lines. Now with peeling you accesss first cache line, and
>after that in loop access the second, effectively doubling running time
>when both lines were in main memory.
>
>You also need to compute all factors not just that one factor is
>expensive. There are several factor in plays, cost of branch
>misprediction is main argument againist doing peeling, so you need to
>show that cost of unaligned loads is bigger than cost of branch
>misprediction of a peeled implementation.
>
>As a quick example why peeling is generaly bad idea I did a simple
>benchmark. Could somebody with haswell also test attached code
>generated
>by gcc -O3 -march=core-avx2 (files set[13]_avx2.s)?
>
>For the test we repeately call a function set with a pointer randomly
>picked from 262144 bytes to stress a L2 cache, relevant tester 
>is following (file test.c)
>
>for (i=0;i<100000000;i++){
>set (ptr + 64 * (p % (SIZE /64) + 60), ptr2 + 64 * (q % (SIZE /64) +
>60));
>
>First vectorize by following function. A vectorizer here does
>peeling (assembly is bit long, see file set1.s)
>
>void set(int *p, int *q){
>  int i;
>  for (i=0; i<128; i++)
>     p[i] = 42 * p[i];
>}
>
>When ran it I got
>
>$ gcc -O3 -DSIZE= test.c
>$ gcc test.o set1.s
>$ time ./a.out
>
>real   0m3.724s
>user   0m3.724s
>sys    0m0.000s
>
>Now what happens if we use separate input and output arrays? A gcc
>vectorizer fortunately does not peel in this case (file set2.s) which
>gives better performance
>
>void set(int *p, int *q){
>  int i;
>  for (i=0; i<128; i++)
>     p[i] = 42 * q[i];
>}
>
>$ gcc test.o set2.s
>$ time ./a.out
>
>real   0m3.169s
>user   0m3.170s
>sys    0m0.000s
>
>
>A speedup here is can be partialy explained by fact that inplace
>modifications run slower. To eliminate this possibility we change
>assembly to make input same as output (file set3.s)
>
>       jb      .L15
> .L7:
>       xorl    %eax, %eax
>+      movq    %rdi, %rsi
>       .p2align 4,,10
>       .p2align 3
> .L5:
>
>$ gcc test.o set3.s
>$ time ./a.out
>
>real   0m3.169s
>user   0m3.170s
>sys    0m0.000s
>
>Which is still faster than what peeling vectorizer generated.
>
>And in this test I did not alignment is constant so branch
>misprediction
>is not a issue.

IIRC what can still be seen is store-buffer related slowdowns when you have a 
big unaligned store load in your loop.  Thus aligning stores still pays back 
last time I measured this.

Richard.


Reply via email to