On Sun, Nov 17, 2013 at 04:42:18PM +0100, Richard Biener wrote:
> "Ondřej Bílka" <[email protected]> wrote:
> >On Sat, Nov 16, 2013 at 11:37:36AM +0100, Richard Biener wrote:
> >> "Ondřej Bílka" <[email protected]> wrote:
> >> >On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
> >>
> >> IIRC what can still be seen is store-buffer related slowdowns when
> >you have a big unaligned store load in your loop. Thus aligning stores
> >still pays back last time I measured this.
> >
> >Then send you benchmark. What I did is a loop that stores 512 bytes.
> >Unaligned stores there are faster than aligned ones, so tell me when
> >aligning stores pays itself. Note that in filling store buffer you must
> >take into account extra stores to make loop aligned.
>
> The issue is that the effective write bandwidth can be limited by the store
> buffer if you have multiple write streams. IIRC at least some amd CPUs have
> to use two entries for stores crossing a cache line boundary.
>
So can be performance limited by branch misprediction. You need to show
that likely bottleneck is too much writes and not other factor.
> Anyway, a look into the optimization manuals will tell you what to do and the
> cost model should follow these recommendations.
>
These tend to be quite out of data, you typically need to recheck
everything.
Take
Intel® 64 and IA-32 Architectures Optimization Reference Manual
from April 2012
A sugestion on store load forwarding there is to align loads and stores
to make it working (with P4 and core2 suggestions).
However this is false since nehalem, when I test a variant of memcpy
that is unaligned by one byte, code is following (full benchmark attached.):
set:
.LFB0:
.cfi_startproc
xor %rdx, %rdx
addq $1, %rsi
lea 144(%rsi), %rdi
.L:
movdqu 0(%rsi,%rdx), %xmm0
movdqu 16(%rsi,%rdx), %xmm1
...
movdqu 112(%rsi,%rdx), %xmm7
movdqu %xmm0, 0(%rdi,%rdx)
...
movdqu %xmm7, 112(%rdi,%rdx)
addq $128, %rdx
cmp $5120, %rdx
jle .L
ret
Then there is around 10% slowdown vs nonforwarding one.
real 0m2.098s
user 0m2.083s
sys 0m0.003s
However when I set 'in lea 144(%rsi), %rdi' a 143 or other nonmultiple of 16
then
performance degrades.
real 0m3.495s
user 0m3.480s
sys 0m0.000s
And other suggestions are similarly flimsy unless your target is pentium 4.
> >Also what do you do with loops that contain no store? If I modify test
> >to
> >
> >int set(int *p, int *q){
> > int i;
> > int sum = 0;
> > for (i=0; i < 128; i++)
> > sum += 42 * p[i];
> > return sum;
> >}
> >
> >then it still does aligning.
>
> Because the cost model simply does not exist for the decision whether to peel
> or not. Patches welcome.
>
> >There may be a threshold after which aligning buffer makes sense then
> >you
> >need to show that loop spend most of time on sizes after that treshold.
> >
> >Also do you have data how common store-buffer slowdowns are? Without
> >knowing that you risk that you make few loops faster at expense of
> >majority which could likely slow whole application down. It would not
> >supprise me as these loops can be ran mostly on L1 cache data (which is
> >around same level as assuming that increased code size fits into
> >instruction cache.)
> >
> >
> >Actually these questions could be answered by a test, first compile
> >SPEC2006 with vanilla gcc -O3 and then with gcc that contains patch to
> >use unaligned loads. Then results will tell if peeling is also good in
> >practice or not.
>
> It should not be a on or off decision but rather a decision based on a cost
> model.
>
You cannot decide that on cost model alone as performance is decided by
runtime usage pattern. If you do profiling then you could do that.
Alternatively you can add a branch to enable peeling only after preset
treshold.
#define _GNU_SOURCE
#include <stdlib.h>
#include <malloc.h>
int main(){
char *ptr = pvalloc(2 * SIZE + 10000);
char *ptr2 = pvalloc(2 * SIZE + 10000);
unsigned long p = 31;
unsigned long q = 17;
int i;
for (i=0; i < 8000000; i++) {
set (ptr + 64 * (p % (SIZE / 64)), ptr2 + 64 * (q % (SIZE /64)));
p = 11 * p + 3;
q = 13 * p + 5;
}
}
.file "set1.c"
.text
.p2align 4,,15
.globl set
.type set, @function
set:
.LFB0:
.cfi_startproc
xor %rdx, %rdx
addq $1, %rsi
lea 144(%rsi), %rdi
.L:
movdqu 0(%rsi,%rdx), %xmm0
movdqu 16(%rsi,%rdx), %xmm1
movdqu 32(%rsi,%rdx), %xmm2
movdqu 48(%rsi,%rdx), %xmm3
movdqu 64(%rsi,%rdx), %xmm4
movdqu 80(%rsi,%rdx), %xmm5
movdqu 96(%rsi,%rdx), %xmm6
movdqu 112(%rsi,%rdx), %xmm7
movdqu %xmm0, 0(%rdi,%rdx)
movdqu %xmm1, 16(%rdi,%rdx)
movdqu %xmm2, 32(%rdi,%rdx)
movdqu %xmm3, 48(%rdi,%rdx)
movdqu %xmm4, 64(%rdi,%rdx)
movdqu %xmm5, 80(%rdi,%rdx)
movdqu %xmm6, 96(%rdi,%rdx)
movdqu %xmm7, 112(%rdi,%rdx)
addq $128, %rdx
cmp $5120, %rdx
jle .L
ret
.cfi_endproc
.LFE0:
.size set, .-set
.ident "GCC: (Debian 4.8.1-10) 4.8.1"
.section .note.GNU-stack,"",@progbits