Re: Vectorization: Loop peeling with misaligned support.

2013-11-17 Thread Toon Moene

On 11/16/2013 04:25 AM, Tim Prince wrote:


Many decisions on compiler defaults still are based on an unscientific
choice of benchmarks, with gcc evidently more responsive to input from
the community.


I'm also quite convinced that we are hampered by the fact that there is 
no IPA on alignment in GCC.


I bet that in the average Fortran program, most arrays are suitably 
aligned (after all, they're either a - by definition - SAVEd array in a 
module, or an ALLOCATEd array), and code that does this:


  CALL AAP(..., A(2), ...)

is relatively sparse.

--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news


Re: Vectorization: Loop peeling with misaligned support.

2013-11-17 Thread Richard Biener
Ondřej Bílka nel...@seznam.cz wrote:
On Sat, Nov 16, 2013 at 11:37:36AM +0100, Richard Biener wrote:
 Ondřej Bílka nel...@seznam.cz wrote:
 On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
 
 IIRC what can still be seen is store-buffer related slowdowns when
you have a big unaligned store load in your loop.  Thus aligning stores
still pays back last time I measured this.

Then send you benchmark. What I did is a loop that stores 512 bytes.
Unaligned stores there are faster than aligned ones, so tell me when
aligning stores pays itself. Note that in filling store buffer you must
take into account extra stores to make loop aligned.

The issue is that the effective write bandwidth can be limited by the store 
buffer if you have multiple write streams.  IIRC at least some amd CPUs have to 
use two entries for stores crossing a cache line boundary.

Anyway, a look into the optimization manuals will tell you what to do and the 
cost model should follow these recommendations.

Also what do you do with loops that contain no store? If I modify test
to

int set(int *p, int *q){
  int i;
  int sum = 0;
  for (i=0; i  128; i++)
 sum += 42 * p[i];
  return sum;
}

then it still does aligning.

Because the cost model simply does not exist for the decision whether to peel 
or not. Patches welcome.

There may be a threshold after which aligning buffer makes sense then
you
need to show that loop spend most of time on sizes after that treshold.

Also do you have data how common store-buffer slowdowns are? Without
knowing that you risk that you make few loops faster at expense of
majority which could likely slow whole application down. It would not
supprise me as these loops can be ran mostly on L1 cache data (which is
around same level as assuming that increased code size fits into
instruction cache.)


Actually these questions could be answered by a test, first compile
SPEC2006 with vanilla gcc -O3 and then with gcc that contains patch to
use unaligned loads. Then results will tell if peeling is also good in
practice or not.

It should not be a on or off decision but rather a decision based on a cost 
model.

Richard.



Re: Vectorization: Loop peeling with misaligned support.

2013-11-17 Thread Ondřej Bílka
On Sun, Nov 17, 2013 at 04:42:18PM +0100, Richard Biener wrote:
 Ondřej Bílka nel...@seznam.cz wrote:
 On Sat, Nov 16, 2013 at 11:37:36AM +0100, Richard Biener wrote:
  Ondřej Bílka nel...@seznam.cz wrote:
  On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
  
  IIRC what can still be seen is store-buffer related slowdowns when
 you have a big unaligned store load in your loop.  Thus aligning stores
 still pays back last time I measured this.
 
 Then send you benchmark. What I did is a loop that stores 512 bytes.
 Unaligned stores there are faster than aligned ones, so tell me when
 aligning stores pays itself. Note that in filling store buffer you must
 take into account extra stores to make loop aligned.
 
 The issue is that the effective write bandwidth can be limited by the store 
 buffer if you have multiple write streams.  IIRC at least some amd CPUs have 
 to use two entries for stores crossing a cache line boundary.

So can be performance limited by branch misprediction. You need to show
that likely bottleneck is too much writes and not other factor.
 
 Anyway, a look into the optimization manuals will tell you what to do and the 
 cost model should follow these recommendations.
 
These tend to be quite out of data, you typically need to recheck
everything.

Take 
Intel® 64 and IA-32 Architectures Optimization Reference Manual 
from April 2012

A sugestion on store load forwarding there is to align loads and stores
to make it working (with P4 and core2 suggestions).

However this is false since nehalem, when I test a variant of memcpy
that is unaligned by one byte, code is following (full benchmark attached.):

set:
.LFB0:
.cfi_startproc
xor %rdx, %rdx
addq$1, %rsi
lea 144(%rsi), %rdi  
.L:
movdqu  0(%rsi,%rdx), %xmm0
movdqu  16(%rsi,%rdx), %xmm1
...
movdqu  112(%rsi,%rdx), %xmm7
movdqu  %xmm0, 0(%rdi,%rdx)
...
movdqu  %xmm7, 112(%rdi,%rdx)
addq$128, %rdx
cmp $5120, %rdx
jle .L
ret

Then there is around 10% slowdown vs nonforwarding one.

real0m2.098s
user0m2.083s
sys 0m0.003s

However when I set 'in lea 144(%rsi), %rdi' a 143 or other nonmultiple of 16 
then
performance degrades.

real0m3.495s
user0m3.480s
sys 0m0.000s

And other suggestions are similarly flimsy unless your target is pentium 4.

 Also what do you do with loops that contain no store? If I modify test
 to
 
 int set(int *p, int *q){
   int i;
   int sum = 0;
   for (i=0; i  128; i++)
  sum += 42 * p[i];
   return sum;
 }
 
 then it still does aligning.
 
 Because the cost model simply does not exist for the decision whether to peel 
 or not. Patches welcome.
 
 There may be a threshold after which aligning buffer makes sense then
 you
 need to show that loop spend most of time on sizes after that treshold.
 
 Also do you have data how common store-buffer slowdowns are? Without
 knowing that you risk that you make few loops faster at expense of
 majority which could likely slow whole application down. It would not
 supprise me as these loops can be ran mostly on L1 cache data (which is
 around same level as assuming that increased code size fits into
 instruction cache.)
 
 
 Actually these questions could be answered by a test, first compile
 SPEC2006 with vanilla gcc -O3 and then with gcc that contains patch to
 use unaligned loads. Then results will tell if peeling is also good in
 practice or not.
 
 It should not be a on or off decision but rather a decision based on a cost 
 model.
 
You cannot decide that on cost model alone as performance is decided by
runtime usage pattern. If you do profiling then you could do that.
Alternatively you can add a branch to enable peeling only after preset
treshold.

#define _GNU_SOURCE
#include stdlib.h
#include malloc.h
int main(){
   char *ptr = pvalloc(2 * SIZE + 1);
   char *ptr2 = pvalloc(2 * SIZE + 1);

   unsigned long p = 31;
   unsigned long q = 17;

   int i;
   for (i=0; i  800; i++) {
 set (ptr + 64 * (p % (SIZE / 64)), ptr2 + 64 * (q % (SIZE /64)));
 p = 11 * p + 3;
 q = 13 * p + 5;
   }
}
.file   set1.c
.text
.p2align 4,,15
.globl  set
.type   set, @function
set:
.LFB0:
.cfi_startproc
xor %rdx, %rdx
addq$1, %rsi
lea 144(%rsi), %rdi  
.L:
movdqu  0(%rsi,%rdx), %xmm0
movdqu  16(%rsi,%rdx), %xmm1
movdqu  32(%rsi,%rdx), %xmm2
movdqu  48(%rsi,%rdx), %xmm3
movdqu  64(%rsi,%rdx), %xmm4
movdqu  80(%rsi,%rdx), %xmm5
movdqu  96(%rsi,%rdx), %xmm6
movdqu  112(%rsi,%rdx), %xmm7
movdqu  %xmm0, 0(%rdi,%rdx)
movdqu  %xmm1, 16(%rdi,%rdx)
movdqu  %xmm2, 32(%rdi,%rdx)
movdqu  %xmm3, 48(%rdi,%rdx)
movdqu  %xmm4, 64(%rdi,%rdx)
movdqu  %xmm5, 80(%rdi,%rdx)
movdqu  %xmm6, 96(%rdi,%rdx)
 

Re: Vectorization: Loop peeling with misaligned support.

2013-11-16 Thread Richard Biener
Ondřej Bílka nel...@seznam.cz wrote:
On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
 Also keep in mind that usually costs go up significantly if
 misalignment causes cache line splits (processor will fetch 2 lines).
 There are non-linear costs of filling up the store queue in modern
 out-of-order processors (x86). Bottom line is that it's much better
to
 peel e.g. for AVX2/AVX3 if the loop would cause loads that cross
cache
 line boundaries otherwise. The solution is to either actually always
 peel for alignment, or insert an additional check for cache line
 boundaries (for high trip count loops).

That is quite bold claim do you have a benchmark to support that?

Since nehalem there is no overhead of unaligned sse loads except of
fetching
cache lines. As haswell avx2 loads behave in similar way.

You are forgetting that loop needs both cache lines when it issues
unaligned load. This will generaly take maximum of times needed to
access these lines. Now with peeling you accesss first cache line, and
after that in loop access the second, effectively doubling running time
when both lines were in main memory.

You also need to compute all factors not just that one factor is
expensive. There are several factor in plays, cost of branch
misprediction is main argument againist doing peeling, so you need to
show that cost of unaligned loads is bigger than cost of branch
misprediction of a peeled implementation.

As a quick example why peeling is generaly bad idea I did a simple
benchmark. Could somebody with haswell also test attached code
generated
by gcc -O3 -march=core-avx2 (files set[13]_avx2.s)?

For the test we repeately call a function set with a pointer randomly
picked from 262144 bytes to stress a L2 cache, relevant tester 
is following (file test.c)

for (i=0;i1;i++){
set (ptr + 64 * (p % (SIZE /64) + 60), ptr2 + 64 * (q % (SIZE /64) +
60));

First vectorize by following function. A vectorizer here does
peeling (assembly is bit long, see file set1.s)

void set(int *p, int *q){
  int i;
  for (i=0; i128; i++)
 p[i] = 42 * p[i];
}

When ran it I got

$ gcc -O3 -DSIZE= test.c
$ gcc test.o set1.s
$ time ./a.out

real   0m3.724s
user   0m3.724s
sys0m0.000s

Now what happens if we use separate input and output arrays? A gcc
vectorizer fortunately does not peel in this case (file set2.s) which
gives better performance

void set(int *p, int *q){
  int i;
  for (i=0; i128; i++)
 p[i] = 42 * q[i];
}

$ gcc test.o set2.s
$ time ./a.out

real   0m3.169s
user   0m3.170s
sys0m0.000s


A speedup here is can be partialy explained by fact that inplace
modifications run slower. To eliminate this possibility we change
assembly to make input same as output (file set3.s)

   jb  .L15
 .L7:
   xorl%eax, %eax
+  movq%rdi, %rsi
   .p2align 4,,10
   .p2align 3
 .L5:

$ gcc test.o set3.s
$ time ./a.out

real   0m3.169s
user   0m3.170s
sys0m0.000s

Which is still faster than what peeling vectorizer generated.

And in this test I did not alignment is constant so branch
misprediction
is not a issue.

IIRC what can still be seen is store-buffer related slowdowns when you have a 
big unaligned store load in your loop.  Thus aligning stores still pays back 
last time I measured this.

Richard.




Re: Vectorization: Loop peeling with misaligned support.

2013-11-16 Thread Ondřej Bílka
On Sat, Nov 16, 2013 at 11:37:36AM +0100, Richard Biener wrote:
 Ondřej Bílka nel...@seznam.cz wrote:
 On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
 
 IIRC what can still be seen is store-buffer related slowdowns when you have a 
 big unaligned store load in your loop.  Thus aligning stores still pays back 
 last time I measured this.

Then send you benchmark. What I did is a loop that stores 512 bytes. Unaligned 
stores there are faster than aligned ones, so tell me when aligning stores pays 
itself. Note that in filling store buffer you must take into account extra 
stores to make loop aligned.

Also what do you do with loops that contain no store? If I modify test to

int set(int *p, int *q){
  int i;
  int sum = 0;
  for (i=0; i  128; i++)
 sum += 42 * p[i];
  return sum;
}

then it still does aligning.

There may be a threshold after which aligning buffer makes sense then you
need to show that loop spend most of time on sizes after that treshold.

Also do you have data how common store-buffer slowdowns are? Without
knowing that you risk that you make few loops faster at expense of
majority which could likely slow whole application down. It would not
supprise me as these loops can be ran mostly on L1 cache data (which is
around same level as assuming that increased code size fits into instruction 
cache.)


Actually these questions could be answered by a test, first compile
SPEC2006 with vanilla gcc -O3 and then with gcc that contains patch to
use unaligned loads. Then results will tell if peeling is also good in
practice or not.


Re: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Richard Biener
On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei b...@broadcom.com wrote:
 Hi,
 In loop vectorization, I found that vectorizer insists on loop peeling even 
 our target supports misaligned memory access. This results in much bigger 
 code size for a very simple loop. I defined 
 TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also 
 TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses 
 almost as cheap as an aligned one. But the vectorizer still does peeling 
 anyway.

 In vect_enhance_data_refs_alignment function, it seems that result of 
 vect_supportable_dr_alignment is not used in decision of whether to do 
 peeling.

   supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
   do_peeling = vector_alignment_reachable_p (dr);

 Later on, there is code to compare load/store costs. But it only decides 
 whether to do peeling for load or store, not whether to do peeling.

 Currently I have a workaround. For the following simple loop, the size is 
 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114)

What's the speed difference?

 int A[100];
 int B[100];
 void foo2() {
   int i;
   for (i = 0; i  100; ++i)
 A[i] = B[i] + 100;
 }

 What is the best way to tell vectorizer not to do peeling in such situation?

Well, the vectorizer should compute the cost without peeling and then,
when the cost with peeling is not better then do not peel.  That's
very easy to check with the vectorization_cost hook by comparing
vector_load / unaligned_load and vector_store / unaligned_store cost.

Richard.


 Thanks,
 Bingfeng Mei
 Broadcom UK



RE: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Bingfeng Mei
Hi, Richard,
Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop 
peeling is also slower for our processors.

By vectorization_cost, do you mean TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST 
hook? 

In our case, it is easy to make decision. But generally, if peeling loop is 
faster but bigger, what should be right balance? How to do with cases that are 
a bit faster and a lot bigger?

Thanks,
Bingfeng
-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: 15 November 2013 14:02
To: Bingfeng Mei
Cc: gcc@gcc.gnu.org
Subject: Re: Vectorization: Loop peeling with misaligned support.

On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei b...@broadcom.com wrote:
 Hi,
 In loop vectorization, I found that vectorizer insists on loop peeling even 
 our target supports misaligned memory access. This results in much bigger 
 code size for a very simple loop. I defined 
 TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also 
 TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses 
 almost as cheap as an aligned one. But the vectorizer still does peeling 
 anyway.

 In vect_enhance_data_refs_alignment function, it seems that result of 
 vect_supportable_dr_alignment is not used in decision of whether to do 
 peeling.

   supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
   do_peeling = vector_alignment_reachable_p (dr);

 Later on, there is code to compare load/store costs. But it only decides 
 whether to do peeling for load or store, not whether to do peeling.

 Currently I have a workaround. For the following simple loop, the size is 
 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114)

What's the speed difference?

 int A[100];
 int B[100];
 void foo2() {
   int i;
   for (i = 0; i  100; ++i)
 A[i] = B[i] + 100;
 }

 What is the best way to tell vectorizer not to do peeling in such situation?

Well, the vectorizer should compute the cost without peeling and then,
when the cost with peeling is not better then do not peel.  That's
very easy to check with the vectorization_cost hook by comparing
vector_load / unaligned_load and vector_store / unaligned_store cost.

Richard.


 Thanks,
 Bingfeng Mei
 Broadcom UK





Re: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Hendrik Greving
Also keep in mind that usually costs go up significantly if
misalignment causes cache line splits (processor will fetch 2 lines).
There are non-linear costs of filling up the store queue in modern
out-of-order processors (x86). Bottom line is that it's much better to
peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache
line boundaries otherwise. The solution is to either actually always
peel for alignment, or insert an additional check for cache line
boundaries (for high trip count loops).
- Hendrik

On Fri, Nov 15, 2013 at 7:21 AM, Bingfeng Mei b...@broadcom.com wrote:
 Hi, Richard,
 Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop 
 peeling is also slower for our processors.

 By vectorization_cost, do you mean 
 TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook?

 In our case, it is easy to make decision. But generally, if peeling loop is 
 faster but bigger, what should be right balance? How to do with cases that 
 are a bit faster and a lot bigger?

 Thanks,
 Bingfeng
 -Original Message-
 From: Richard Biener [mailto:richard.guent...@gmail.com]
 Sent: 15 November 2013 14:02
 To: Bingfeng Mei
 Cc: gcc@gcc.gnu.org
 Subject: Re: Vectorization: Loop peeling with misaligned support.

 On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei b...@broadcom.com wrote:
 Hi,
 In loop vectorization, I found that vectorizer insists on loop peeling even 
 our target supports misaligned memory access. This results in much bigger 
 code size for a very simple loop. I defined 
 TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also 
 TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses 
 almost as cheap as an aligned one. But the vectorizer still does peeling 
 anyway.

 In vect_enhance_data_refs_alignment function, it seems that result of 
 vect_supportable_dr_alignment is not used in decision of whether to do 
 peeling.

   supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
   do_peeling = vector_alignment_reachable_p (dr);

 Later on, there is code to compare load/store costs. But it only decides 
 whether to do peeling for load or store, not whether to do peeling.

 Currently I have a workaround. For the following simple loop, the size is 
 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114)

 What's the speed difference?

 int A[100];
 int B[100];
 void foo2() {
   int i;
   for (i = 0; i  100; ++i)
 A[i] = B[i] + 100;
 }

 What is the best way to tell vectorizer not to do peeling in such situation?

 Well, the vectorizer should compute the cost without peeling and then,
 when the cost with peeling is not better then do not peel.  That's
 very easy to check with the vectorization_cost hook by comparing
 vector_load / unaligned_load and vector_store / unaligned_store cost.

 Richard.


 Thanks,
 Bingfeng Mei
 Broadcom UK





Re: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Xinliang David Li
The right longer term fix is suggested by Richard. For now you can
probably override the peel parameter for your target (in the target
option_override function).

 maybe_set_param_value (PARAM_VECT_MAX_PEELING_FOR_ALIGNMENT,
0, opts-x_param_values, opts_set-x_param_values);

David

On Fri, Nov 15, 2013 at 7:21 AM, Bingfeng Mei b...@broadcom.com wrote:
 Hi, Richard,
 Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop 
 peeling is also slower for our processors.

 By vectorization_cost, do you mean 
 TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook?

 In our case, it is easy to make decision. But generally, if peeling loop is 
 faster but bigger, what should be right balance? How to do with cases that 
 are a bit faster and a lot bigger?

 Thanks,
 Bingfeng
 -Original Message-
 From: Richard Biener [mailto:richard.guent...@gmail.com]
 Sent: 15 November 2013 14:02
 To: Bingfeng Mei
 Cc: gcc@gcc.gnu.org
 Subject: Re: Vectorization: Loop peeling with misaligned support.

 On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei b...@broadcom.com wrote:
 Hi,
 In loop vectorization, I found that vectorizer insists on loop peeling even 
 our target supports misaligned memory access. This results in much bigger 
 code size for a very simple loop. I defined 
 TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also 
 TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses 
 almost as cheap as an aligned one. But the vectorizer still does peeling 
 anyway.

 In vect_enhance_data_refs_alignment function, it seems that result of 
 vect_supportable_dr_alignment is not used in decision of whether to do 
 peeling.

   supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
   do_peeling = vector_alignment_reachable_p (dr);

 Later on, there is code to compare load/store costs. But it only decides 
 whether to do peeling for load or store, not whether to do peeling.

 Currently I have a workaround. For the following simple loop, the size is 
 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114)

 What's the speed difference?

 int A[100];
 int B[100];
 void foo2() {
   int i;
   for (i = 0; i  100; ++i)
 A[i] = B[i] + 100;
 }

 What is the best way to tell vectorizer not to do peeling in such situation?

 Well, the vectorizer should compute the cost without peeling and then,
 when the cost with peeling is not better then do not peel.  That's
 very easy to check with the vectorization_cost hook by comparing
 vector_load / unaligned_load and vector_store / unaligned_store cost.

 Richard.


 Thanks,
 Bingfeng Mei
 Broadcom UK





RE: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Bingfeng Mei
Thanks for the suggestion. It seems that parameter is only available in HEAD, 
not in 4.8. I will backport to 4.8.

However, implementing a good cost model seems quite tricky to me. There are 
conflicting requirements for different processors. For us or many embedded 
processors, 4-time size increase is unacceptable. But for many desktop 
processor/applications, I guess it is worth to trade significant size with some 
performance improvement. Not sure if existing 
TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST is up to task. Maybe an extra 
target hook or parameter should be provided to make such tradeoff.

Additionally, it seems hard to accurately estimate the costs. As Hendrik 
pointed out, misaligned access will affect cache performance for some 
processors. But for our processor, it is OK. Maybe just to pass a high cost for 
misaligned access for such processor is sufficient to guarantee to generate 
loop peeling. 

Bingfeng


-Original Message-
From: Xinliang David Li [mailto:davi...@google.com] 
Sent: 15 November 2013 17:30
To: Bingfeng Mei
Cc: Richard Biener; gcc@gcc.gnu.org
Subject: Re: Vectorization: Loop peeling with misaligned support.

The right longer term fix is suggested by Richard. For now you can
probably override the peel parameter for your target (in the target
option_override function).

 maybe_set_param_value (PARAM_VECT_MAX_PEELING_FOR_ALIGNMENT,
0, opts-x_param_values, opts_set-x_param_values);

David

On Fri, Nov 15, 2013 at 7:21 AM, Bingfeng Mei b...@broadcom.com wrote:
 Hi, Richard,
 Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop 
 peeling is also slower for our processors.

 By vectorization_cost, do you mean 
 TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook?

 In our case, it is easy to make decision. But generally, if peeling loop is 
 faster but bigger, what should be right balance? How to do with cases that 
 are a bit faster and a lot bigger?

 Thanks,
 Bingfeng
 -Original Message-
 From: Richard Biener [mailto:richard.guent...@gmail.com]
 Sent: 15 November 2013 14:02
 To: Bingfeng Mei
 Cc: gcc@gcc.gnu.org
 Subject: Re: Vectorization: Loop peeling with misaligned support.

 On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei b...@broadcom.com wrote:
 Hi,
 In loop vectorization, I found that vectorizer insists on loop peeling even 
 our target supports misaligned memory access. This results in much bigger 
 code size for a very simple loop. I defined 
 TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also 
 TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses 
 almost as cheap as an aligned one. But the vectorizer still does peeling 
 anyway.

 In vect_enhance_data_refs_alignment function, it seems that result of 
 vect_supportable_dr_alignment is not used in decision of whether to do 
 peeling.

   supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
   do_peeling = vector_alignment_reachable_p (dr);

 Later on, there is code to compare load/store costs. But it only decides 
 whether to do peeling for load or store, not whether to do peeling.

 Currently I have a workaround. For the following simple loop, the size is 
 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114)

 What's the speed difference?

 int A[100];
 int B[100];
 void foo2() {
   int i;
   for (i = 0; i  100; ++i)
 A[i] = B[i] + 100;
 }

 What is the best way to tell vectorizer not to do peeling in such situation?

 Well, the vectorizer should compute the cost without peeling and then,
 when the cost with peeling is not better then do not peel.  That's
 very easy to check with the vectorization_cost hook by comparing
 vector_load / unaligned_load and vector_store / unaligned_store cost.

 Richard.


 Thanks,
 Bingfeng Mei
 Broadcom UK







Re: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Xinliang David Li
I agree it is hard to tune cost model to make it precise.

Trunk compiler now supports better command line control for cost model
selection. It seems to me that you can backport that change (as well
as changes to control loop and slp vectorizer with different options)
to your branch. With those, you can do the following:
1) turn on vectorization with -O2 : -O2 -ftree-loop-vectorize -- it
will use the 'cheap' model which disables peeling
or
2) -O3 -fvect-cost-model=cheap  -- it will also disabling peeling
3) Playing with different parameters to control peeling, alias check
versioning etc.

Better yet -- improve the vectorizer to reduce the cost in general
(e.g, better alias analysis, better alignment propagation, more
efficient runtime alias check etc).

thanks,

David

On Fri, Nov 15, 2013 at 10:01 AM, Bingfeng Mei b...@broadcom.com wrote:
 Thanks for the suggestion. It seems that parameter is only available in HEAD, 
 not in 4.8. I will backport to 4.8.

 However, implementing a good cost model seems quite tricky to me. There are 
 conflicting requirements for different processors. For us or many embedded 
 processors, 4-time size increase is unacceptable. But for many desktop 
 processor/applications, I guess it is worth to trade significant size with 
 some performance improvement. Not sure if existing 
 TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST is up to task. Maybe an extra 
 target hook or parameter should be provided to make such tradeoff.

 Additionally, it seems hard to accurately estimate the costs. As Hendrik 
 pointed out, misaligned access will affect cache performance for some 
 processors. But for our processor, it is OK. Maybe just to pass a high cost 
 for misaligned access for such processor is sufficient to guarantee to 
 generate loop peeling.

 Bingfeng


 -Original Message-
 From: Xinliang David Li [mailto:davi...@google.com]
 Sent: 15 November 2013 17:30
 To: Bingfeng Mei
 Cc: Richard Biener; gcc@gcc.gnu.org
 Subject: Re: Vectorization: Loop peeling with misaligned support.

 The right longer term fix is suggested by Richard. For now you can
 probably override the peel parameter for your target (in the target
 option_override function).

  maybe_set_param_value (PARAM_VECT_MAX_PEELING_FOR_ALIGNMENT,
 0, opts-x_param_values, opts_set-x_param_values);

 David

 On Fri, Nov 15, 2013 at 7:21 AM, Bingfeng Mei b...@broadcom.com wrote:
 Hi, Richard,
 Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop 
 peeling is also slower for our processors.

 By vectorization_cost, do you mean 
 TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook?

 In our case, it is easy to make decision. But generally, if peeling loop is 
 faster but bigger, what should be right balance? How to do with cases that 
 are a bit faster and a lot bigger?

 Thanks,
 Bingfeng
 -Original Message-
 From: Richard Biener [mailto:richard.guent...@gmail.com]
 Sent: 15 November 2013 14:02
 To: Bingfeng Mei
 Cc: gcc@gcc.gnu.org
 Subject: Re: Vectorization: Loop peeling with misaligned support.

 On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei b...@broadcom.com wrote:
 Hi,
 In loop vectorization, I found that vectorizer insists on loop peeling even 
 our target supports misaligned memory access. This results in much bigger 
 code size for a very simple loop. I defined 
 TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also 
 TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses 
 almost as cheap as an aligned one. But the vectorizer still does peeling 
 anyway.

 In vect_enhance_data_refs_alignment function, it seems that result of 
 vect_supportable_dr_alignment is not used in decision of whether to do 
 peeling.

   supportable_dr_alignment = vect_supportable_dr_alignment (dr, true);
   do_peeling = vector_alignment_reachable_p (dr);

 Later on, there is code to compare load/store costs. But it only decides 
 whether to do peeling for load or store, not whether to do peeling.

 Currently I have a workaround. For the following simple loop, the size is 
 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 
 20131114)

 What's the speed difference?

 int A[100];
 int B[100];
 void foo2() {
   int i;
   for (i = 0; i  100; ++i)
 A[i] = B[i] + 100;
 }

 What is the best way to tell vectorizer not to do peeling in such situation?

 Well, the vectorizer should compute the cost without peeling and then,
 when the cost with peeling is not better then do not peel.  That's
 very easy to check with the vectorization_cost hook by comparing
 vector_load / unaligned_load and vector_store / unaligned_store cost.

 Richard.


 Thanks,
 Bingfeng Mei
 Broadcom UK







Re: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Ondřej Bílka
On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
 Also keep in mind that usually costs go up significantly if
 misalignment causes cache line splits (processor will fetch 2 lines).
 There are non-linear costs of filling up the store queue in modern
 out-of-order processors (x86). Bottom line is that it's much better to
 peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache
 line boundaries otherwise. The solution is to either actually always
 peel for alignment, or insert an additional check for cache line
 boundaries (for high trip count loops).

That is quite bold claim do you have a benchmark to support that?

Since nehalem there is no overhead of unaligned sse loads except of fetching
cache lines. As haswell avx2 loads behave in similar way.

You are forgetting that loop needs both cache lines when it issues
unaligned load. This will generaly take maximum of times needed to
access these lines. Now with peeling you accesss first cache line, and
after that in loop access the second, effectively doubling running time
when both lines were in main memory.

You also need to compute all factors not just that one factor is
expensive. There are several factor in plays, cost of branch
misprediction is main argument againist doing peeling, so you need to
show that cost of unaligned loads is bigger than cost of branch
misprediction of a peeled implementation.

As a quick example why peeling is generaly bad idea I did a simple
benchmark. Could somebody with haswell also test attached code generated
by gcc -O3 -march=core-avx2 (files set[13]_avx2.s)?

For the test we repeately call a function set with a pointer randomly
picked from 262144 bytes to stress a L2 cache, relevant tester 
is following (file test.c)

for (i=0;i1;i++){
 set (ptr + 64 * (p % (SIZE /64) + 60), ptr2 + 64 * (q % (SIZE /64) + 60));

First vectorize by following function. A vectorizer here does
peeling (assembly is bit long, see file set1.s)

void set(int *p, int *q){
  int i;
  for (i=0; i128; i++)
 p[i] = 42 * p[i];
}

When ran it I got

$ gcc -O3 -DSIZE= test.c
$ gcc test.o set1.s
$ time ./a.out

real0m3.724s
user0m3.724s
sys 0m0.000s

Now what happens if we use separate input and output arrays? A gcc
vectorizer fortunately does not peel in this case (file set2.s) which
gives better performance

void set(int *p, int *q){
  int i;
  for (i=0; i128; i++)
 p[i] = 42 * q[i];
}

$ gcc test.o set2.s
$ time ./a.out

real0m3.169s
user0m3.170s
sys 0m0.000s


A speedup here is can be partialy explained by fact that inplace
modifications run slower. To eliminate this possibility we change
assembly to make input same as output (file set3.s)

jb  .L15
 .L7:
xorl%eax, %eax
+   movq%rdi, %rsi
.p2align 4,,10
.p2align 3
 .L5:

$ gcc test.o set3.s
$ time ./a.out

real0m3.169s
user0m3.170s
sys 0m0.000s

Which is still faster than what peeling vectorizer generated.

And in this test I did not alignment is constant so branch misprediction
is not a issue.
#define _GNU_SOURCE
#include stdlib.h
int main(){
   char *ptr = pvalloc(2 * SIZE + 128);
   char *ptr2 = pvalloc(2 * SIZE + 128);

   unsigned long p = 31;
   unsigned long q = 17;

   int i;
   for (i=0; i  1; i++) {
 set (ptr + 64 * (p % (SIZE / 64) + 60), ptr2 + 64 * (q % (SIZE /64) + 60));
 p = 11 * p + 3;
 q = 13 * p + 5;
   }
}
.file   set1.c
.text
.p2align 4,,15
.globl  set
.type   set, @function
set:
.LFB0:
.cfi_startproc
leaq32(%rdi), %rax
cmpq%rax, %rsi
jb  .L12
movq  %rdi, %rsi
.L6:
vmovdqu (%rsi), %ymm1
vmovdqa .LC0(%rip), %ymm0
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, (%rdi)
vmovdqu 32(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 32(%rdi)
vmovdqu 64(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 64(%rdi)
vmovdqu 96(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 96(%rdi)
vmovdqu 128(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 128(%rdi)
vmovdqu 160(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 160(%rdi)
vmovdqu 192(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 192(%rdi)
vmovdqu 224(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 224(%rdi)
vmovdqu 256(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 256(%rdi)
vmovdqu 288(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 288(%rdi)
vmovdqu 320(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 320(%rdi)
vmovdqu 352(%rsi), %ymm1
vpmulld %ymm0, %ymm1, %ymm1
vmovdqu %ymm1, 352(%rdi)
vmovdqu 384(%rsi), %ymm1
vpmulld 

Re: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Ondřej Bílka
On Fri, Nov 15, 2013 at 11:26:06PM +0100, Ondřej Bílka wrote:
Minor correction, a mutt read replaced a set1.s file by one that I later
used for avx2 variant. A correct file is following
.file   set1.c
.text
.p2align 4,,15
.globl  set
.type   set, @function
set:
.LFB0:
.cfi_startproc
movq%rdi, %rax
andl$15, %eax
shrq$2, %rax
negq%rax
andl$3, %eax
je  .L9
movl(%rdi), %edx
movl$42, %esi
imull   %esi, %edx
cmpl$1, %eax
movl%edx, (%rdi)
jbe .L10
movl4(%rdi), %edx
movl$42, %ecx
imull   %ecx, %edx
cmpl$2, %eax
movl%edx, 4(%rdi)
jbe .L11
movl8(%rdi), %edx
movl$42, %r11d
movl$125, %r10d
imull   %r11d, %edx
movl$3, %r11d
movl%edx, 8(%rdi)
.L2:
movl$128, %r8d
xorl%edx, %edx
subl%eax, %r8d
movl%eax, %eax
movl%r8d, %esi
leaq(%rdi,%rax,4), %rcx
xorl%eax, %eax
shrl$2, %esi
leal0(,%rsi,4), %r9d
.p2align 4,,10
.p2align 3
.L8:
movdqa  (%rcx,%rax), %xmm1
addl$1, %edx
pslld   $1, %xmm1
movdqa  %xmm1, %xmm0
pslld   $2, %xmm0
psubd   %xmm1, %xmm0
movdqa  %xmm0, %xmm1
pslld   $3, %xmm1
psubd   %xmm0, %xmm1
movdqa  %xmm1, (%rcx,%rax)
addq$16, %rax
cmpl%edx, %esi
ja  .L8
movl%r10d, %ecx
leal(%r11,%r9), %eax
subl%r9d, %ecx
cmpl%r9d, %r8d
je  .L1
movslq  %eax, %rdx
movl$42, %r9d
leaq(%rdi,%rdx,4), %rdx
movl(%rdx), %esi
imull   %r9d, %esi
cmpl$1, %ecx
movl%esi, (%rdx)
leal1(%rax), %edx
je  .L1
movslq  %edx, %rdx
movl$42, %r8d
addl$2, %eax
leaq(%rdi,%rdx,4), %rdx
movl(%rdx), %esi
imull   %r8d, %esi
cmpl$2, %ecx
movl%esi, (%rdx)
je  .L1
cltq
movl$42, %r10d
leaq(%rdi,%rax,4), %rax
movl(%rax), %edx
imull   %r10d, %edx
movl%edx, (%rax)
ret
.p2align 4,,10
.p2align 3
.L1:
rep ret
.p2align 4,,10
.p2align 3
.L9:
movl$128, %r10d
xorl%r11d, %r11d
jmp .L2
.p2align 4,,10
.p2align 3
.L11:
movl$126, %r10d
movl$2, %r11d
jmp .L2
.p2align 4,,10
.p2align 3
.L10:
movl$127, %r10d
movl$1, %r11d
jmp .L2
.cfi_endproc
.LFE0:
.size   set, .-set
.ident  GCC: (Debian 4.8.1-10) 4.8.1
.section.note.GNU-stack,,@progbits


Re: Vectorization: Loop peeling with misaligned support.

2013-11-15 Thread Tim Prince

On 11/15/2013 2:26 PM, Ondřej Bílka wrote:

On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:

Also keep in mind that usually costs go up significantly if
misalignment causes cache line splits (processor will fetch 2 lines).
There are non-linear costs of filling up the store queue in modern
out-of-order processors (x86). Bottom line is that it's much better to
peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache
line boundaries otherwise. The solution is to either actually always
peel for alignment, or insert an additional check for cache line
boundaries (for high trip count loops).

That is quite bold claim do you have a benchmark to support that?

Since nehalem there is no overhead of unaligned sse loads except of fetching
cache lines. As haswell avx2 loads behave in similar way.
Where gcc or gfortran choose to split sse2 or sse4 loads, I found a 
marked advantage in that choice on my Westmere (which I seldom power on 
nowadays). You are correct that this finding is in disagreement with 
Intel documentation, and it has the effect that Intel option -xHost is 
not the optimum one. I suspect the Westmere was less well performing 
than Nehalem on unaligned loads. Another poorly documented feature of 
Nehalem and Westmere was a preference for 32-byte aligned data, more so 
than Sandy Bridge.
Intel documentation encourages use of unaligned AVX-256 loads on Ivy 
Bridge and Haswell, but Intel compilers don't implement them (except for 
intrinsics) until AVX2. Still, on my own Haswell tests, the splitting of 
unaligned loads by use of AVX compile option comes out ahead. 
Supposedly, the preference of Windows intrinsics programmers for the 
relative simplicity of unaligned moves was taken into account in the 
more recent hardware designs, as it was disastrous for Sandy Bridge.
I have only remote access to Haswell although I plan to buy a laptop 
soon. I'm skeptical about whether useful findings on these points may be 
obtained on a Windows laptop.
In case you didn't notice it, Intel compilers introduced #pragma vector 
unaligned as a means to specify handling of unaligned access without 
peeling. I guess it is expected to be useful on Ivy Bridge or Haswell 
for cases where the loop count is moderate but expected to match 
unrolled AVX-256, or if the case where peeling can improve alignment is 
rare.
In addition, Intel compilers learned from gcc the trick of using AVX-128 
for situations where frequent unaligned accesses are expected and 
peeling is clearly undesirable. The new facility for vectorizing OpenMP 
parallel loops (e.g. #pragma omp parallel for simd) uses AVX-128, 
consistent with the fact that OpenMP chunks are more frequently 
unaligned. In fact, parallel for simd seems to perform nearly the same 
with gcc-4.9 as with icc.
Many decisions on compiler defaults still are based on an unscientific 
choice of benchmarks, with gcc evidently more responsive to input from 
the community.


--
Tim Prince