Re: Vectorization: Loop peeling with misaligned support.
On 11/16/2013 04:25 AM, Tim Prince wrote: Many decisions on compiler defaults still are based on an unscientific choice of benchmarks, with gcc evidently more responsive to input from the community. I'm also quite convinced that we are hampered by the fact that there is no IPA on alignment in GCC. I bet that in the average Fortran program, most arrays are suitably aligned (after all, they're either a - by definition - SAVEd array in a module, or an ALLOCATEd array), and code that does this: CALL AAP(..., A(2), ...) is relatively sparse. -- Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/ Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news
Re: Vectorization: Loop peeling with misaligned support.
Ondřej Bílka nel...@seznam.cz wrote: On Sat, Nov 16, 2013 at 11:37:36AM +0100, Richard Biener wrote: Ondřej Bílka nel...@seznam.cz wrote: On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote: IIRC what can still be seen is store-buffer related slowdowns when you have a big unaligned store load in your loop. Thus aligning stores still pays back last time I measured this. Then send you benchmark. What I did is a loop that stores 512 bytes. Unaligned stores there are faster than aligned ones, so tell me when aligning stores pays itself. Note that in filling store buffer you must take into account extra stores to make loop aligned. The issue is that the effective write bandwidth can be limited by the store buffer if you have multiple write streams. IIRC at least some amd CPUs have to use two entries for stores crossing a cache line boundary. Anyway, a look into the optimization manuals will tell you what to do and the cost model should follow these recommendations. Also what do you do with loops that contain no store? If I modify test to int set(int *p, int *q){ int i; int sum = 0; for (i=0; i 128; i++) sum += 42 * p[i]; return sum; } then it still does aligning. Because the cost model simply does not exist for the decision whether to peel or not. Patches welcome. There may be a threshold after which aligning buffer makes sense then you need to show that loop spend most of time on sizes after that treshold. Also do you have data how common store-buffer slowdowns are? Without knowing that you risk that you make few loops faster at expense of majority which could likely slow whole application down. It would not supprise me as these loops can be ran mostly on L1 cache data (which is around same level as assuming that increased code size fits into instruction cache.) Actually these questions could be answered by a test, first compile SPEC2006 with vanilla gcc -O3 and then with gcc that contains patch to use unaligned loads. Then results will tell if peeling is also good in practice or not. It should not be a on or off decision but rather a decision based on a cost model. Richard.
Re: Vectorization: Loop peeling with misaligned support.
On Sun, Nov 17, 2013 at 04:42:18PM +0100, Richard Biener wrote: Ondřej Bílka nel...@seznam.cz wrote: On Sat, Nov 16, 2013 at 11:37:36AM +0100, Richard Biener wrote: Ondřej Bílka nel...@seznam.cz wrote: On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote: IIRC what can still be seen is store-buffer related slowdowns when you have a big unaligned store load in your loop. Thus aligning stores still pays back last time I measured this. Then send you benchmark. What I did is a loop that stores 512 bytes. Unaligned stores there are faster than aligned ones, so tell me when aligning stores pays itself. Note that in filling store buffer you must take into account extra stores to make loop aligned. The issue is that the effective write bandwidth can be limited by the store buffer if you have multiple write streams. IIRC at least some amd CPUs have to use two entries for stores crossing a cache line boundary. So can be performance limited by branch misprediction. You need to show that likely bottleneck is too much writes and not other factor. Anyway, a look into the optimization manuals will tell you what to do and the cost model should follow these recommendations. These tend to be quite out of data, you typically need to recheck everything. Take Intel® 64 and IA-32 Architectures Optimization Reference Manual from April 2012 A sugestion on store load forwarding there is to align loads and stores to make it working (with P4 and core2 suggestions). However this is false since nehalem, when I test a variant of memcpy that is unaligned by one byte, code is following (full benchmark attached.): set: .LFB0: .cfi_startproc xor %rdx, %rdx addq$1, %rsi lea 144(%rsi), %rdi .L: movdqu 0(%rsi,%rdx), %xmm0 movdqu 16(%rsi,%rdx), %xmm1 ... movdqu 112(%rsi,%rdx), %xmm7 movdqu %xmm0, 0(%rdi,%rdx) ... movdqu %xmm7, 112(%rdi,%rdx) addq$128, %rdx cmp $5120, %rdx jle .L ret Then there is around 10% slowdown vs nonforwarding one. real0m2.098s user0m2.083s sys 0m0.003s However when I set 'in lea 144(%rsi), %rdi' a 143 or other nonmultiple of 16 then performance degrades. real0m3.495s user0m3.480s sys 0m0.000s And other suggestions are similarly flimsy unless your target is pentium 4. Also what do you do with loops that contain no store? If I modify test to int set(int *p, int *q){ int i; int sum = 0; for (i=0; i 128; i++) sum += 42 * p[i]; return sum; } then it still does aligning. Because the cost model simply does not exist for the decision whether to peel or not. Patches welcome. There may be a threshold after which aligning buffer makes sense then you need to show that loop spend most of time on sizes after that treshold. Also do you have data how common store-buffer slowdowns are? Without knowing that you risk that you make few loops faster at expense of majority which could likely slow whole application down. It would not supprise me as these loops can be ran mostly on L1 cache data (which is around same level as assuming that increased code size fits into instruction cache.) Actually these questions could be answered by a test, first compile SPEC2006 with vanilla gcc -O3 and then with gcc that contains patch to use unaligned loads. Then results will tell if peeling is also good in practice or not. It should not be a on or off decision but rather a decision based on a cost model. You cannot decide that on cost model alone as performance is decided by runtime usage pattern. If you do profiling then you could do that. Alternatively you can add a branch to enable peeling only after preset treshold. #define _GNU_SOURCE #include stdlib.h #include malloc.h int main(){ char *ptr = pvalloc(2 * SIZE + 1); char *ptr2 = pvalloc(2 * SIZE + 1); unsigned long p = 31; unsigned long q = 17; int i; for (i=0; i 800; i++) { set (ptr + 64 * (p % (SIZE / 64)), ptr2 + 64 * (q % (SIZE /64))); p = 11 * p + 3; q = 13 * p + 5; } } .file set1.c .text .p2align 4,,15 .globl set .type set, @function set: .LFB0: .cfi_startproc xor %rdx, %rdx addq$1, %rsi lea 144(%rsi), %rdi .L: movdqu 0(%rsi,%rdx), %xmm0 movdqu 16(%rsi,%rdx), %xmm1 movdqu 32(%rsi,%rdx), %xmm2 movdqu 48(%rsi,%rdx), %xmm3 movdqu 64(%rsi,%rdx), %xmm4 movdqu 80(%rsi,%rdx), %xmm5 movdqu 96(%rsi,%rdx), %xmm6 movdqu 112(%rsi,%rdx), %xmm7 movdqu %xmm0, 0(%rdi,%rdx) movdqu %xmm1, 16(%rdi,%rdx) movdqu %xmm2, 32(%rdi,%rdx) movdqu %xmm3, 48(%rdi,%rdx) movdqu %xmm4, 64(%rdi,%rdx) movdqu %xmm5, 80(%rdi,%rdx) movdqu %xmm6, 96(%rdi,%rdx)
Re: Vectorization: Loop peeling with misaligned support.
Ondřej Bílka nel...@seznam.cz wrote: On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote: Also keep in mind that usually costs go up significantly if misalignment causes cache line splits (processor will fetch 2 lines). There are non-linear costs of filling up the store queue in modern out-of-order processors (x86). Bottom line is that it's much better to peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache line boundaries otherwise. The solution is to either actually always peel for alignment, or insert an additional check for cache line boundaries (for high trip count loops). That is quite bold claim do you have a benchmark to support that? Since nehalem there is no overhead of unaligned sse loads except of fetching cache lines. As haswell avx2 loads behave in similar way. You are forgetting that loop needs both cache lines when it issues unaligned load. This will generaly take maximum of times needed to access these lines. Now with peeling you accesss first cache line, and after that in loop access the second, effectively doubling running time when both lines were in main memory. You also need to compute all factors not just that one factor is expensive. There are several factor in plays, cost of branch misprediction is main argument againist doing peeling, so you need to show that cost of unaligned loads is bigger than cost of branch misprediction of a peeled implementation. As a quick example why peeling is generaly bad idea I did a simple benchmark. Could somebody with haswell also test attached code generated by gcc -O3 -march=core-avx2 (files set[13]_avx2.s)? For the test we repeately call a function set with a pointer randomly picked from 262144 bytes to stress a L2 cache, relevant tester is following (file test.c) for (i=0;i1;i++){ set (ptr + 64 * (p % (SIZE /64) + 60), ptr2 + 64 * (q % (SIZE /64) + 60)); First vectorize by following function. A vectorizer here does peeling (assembly is bit long, see file set1.s) void set(int *p, int *q){ int i; for (i=0; i128; i++) p[i] = 42 * p[i]; } When ran it I got $ gcc -O3 -DSIZE= test.c $ gcc test.o set1.s $ time ./a.out real 0m3.724s user 0m3.724s sys0m0.000s Now what happens if we use separate input and output arrays? A gcc vectorizer fortunately does not peel in this case (file set2.s) which gives better performance void set(int *p, int *q){ int i; for (i=0; i128; i++) p[i] = 42 * q[i]; } $ gcc test.o set2.s $ time ./a.out real 0m3.169s user 0m3.170s sys0m0.000s A speedup here is can be partialy explained by fact that inplace modifications run slower. To eliminate this possibility we change assembly to make input same as output (file set3.s) jb .L15 .L7: xorl%eax, %eax + movq%rdi, %rsi .p2align 4,,10 .p2align 3 .L5: $ gcc test.o set3.s $ time ./a.out real 0m3.169s user 0m3.170s sys0m0.000s Which is still faster than what peeling vectorizer generated. And in this test I did not alignment is constant so branch misprediction is not a issue. IIRC what can still be seen is store-buffer related slowdowns when you have a big unaligned store load in your loop. Thus aligning stores still pays back last time I measured this. Richard.
Re: Vectorization: Loop peeling with misaligned support.
On Sat, Nov 16, 2013 at 11:37:36AM +0100, Richard Biener wrote: Ondřej Bílka nel...@seznam.cz wrote: On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote: IIRC what can still be seen is store-buffer related slowdowns when you have a big unaligned store load in your loop. Thus aligning stores still pays back last time I measured this. Then send you benchmark. What I did is a loop that stores 512 bytes. Unaligned stores there are faster than aligned ones, so tell me when aligning stores pays itself. Note that in filling store buffer you must take into account extra stores to make loop aligned. Also what do you do with loops that contain no store? If I modify test to int set(int *p, int *q){ int i; int sum = 0; for (i=0; i 128; i++) sum += 42 * p[i]; return sum; } then it still does aligning. There may be a threshold after which aligning buffer makes sense then you need to show that loop spend most of time on sizes after that treshold. Also do you have data how common store-buffer slowdowns are? Without knowing that you risk that you make few loops faster at expense of majority which could likely slow whole application down. It would not supprise me as these loops can be ran mostly on L1 cache data (which is around same level as assuming that increased code size fits into instruction cache.) Actually these questions could be answered by a test, first compile SPEC2006 with vanilla gcc -O3 and then with gcc that contains patch to use unaligned loads. Then results will tell if peeling is also good in practice or not.
Re: Vectorization: Loop peeling with misaligned support.
On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei b...@broadcom.com wrote: Hi, In loop vectorization, I found that vectorizer insists on loop peeling even our target supports misaligned memory access. This results in much bigger code size for a very simple loop. I defined TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses almost as cheap as an aligned one. But the vectorizer still does peeling anyway. In vect_enhance_data_refs_alignment function, it seems that result of vect_supportable_dr_alignment is not used in decision of whether to do peeling. supportable_dr_alignment = vect_supportable_dr_alignment (dr, true); do_peeling = vector_alignment_reachable_p (dr); Later on, there is code to compare load/store costs. But it only decides whether to do peeling for load or store, not whether to do peeling. Currently I have a workaround. For the following simple loop, the size is 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114) What's the speed difference? int A[100]; int B[100]; void foo2() { int i; for (i = 0; i 100; ++i) A[i] = B[i] + 100; } What is the best way to tell vectorizer not to do peeling in such situation? Well, the vectorizer should compute the cost without peeling and then, when the cost with peeling is not better then do not peel. That's very easy to check with the vectorization_cost hook by comparing vector_load / unaligned_load and vector_store / unaligned_store cost. Richard. Thanks, Bingfeng Mei Broadcom UK
RE: Vectorization: Loop peeling with misaligned support.
Hi, Richard, Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop peeling is also slower for our processors. By vectorization_cost, do you mean TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook? In our case, it is easy to make decision. But generally, if peeling loop is faster but bigger, what should be right balance? How to do with cases that are a bit faster and a lot bigger? Thanks, Bingfeng -Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: 15 November 2013 14:02 To: Bingfeng Mei Cc: gcc@gcc.gnu.org Subject: Re: Vectorization: Loop peeling with misaligned support. On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei b...@broadcom.com wrote: Hi, In loop vectorization, I found that vectorizer insists on loop peeling even our target supports misaligned memory access. This results in much bigger code size for a very simple loop. I defined TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses almost as cheap as an aligned one. But the vectorizer still does peeling anyway. In vect_enhance_data_refs_alignment function, it seems that result of vect_supportable_dr_alignment is not used in decision of whether to do peeling. supportable_dr_alignment = vect_supportable_dr_alignment (dr, true); do_peeling = vector_alignment_reachable_p (dr); Later on, there is code to compare load/store costs. But it only decides whether to do peeling for load or store, not whether to do peeling. Currently I have a workaround. For the following simple loop, the size is 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114) What's the speed difference? int A[100]; int B[100]; void foo2() { int i; for (i = 0; i 100; ++i) A[i] = B[i] + 100; } What is the best way to tell vectorizer not to do peeling in such situation? Well, the vectorizer should compute the cost without peeling and then, when the cost with peeling is not better then do not peel. That's very easy to check with the vectorization_cost hook by comparing vector_load / unaligned_load and vector_store / unaligned_store cost. Richard. Thanks, Bingfeng Mei Broadcom UK
Re: Vectorization: Loop peeling with misaligned support.
Also keep in mind that usually costs go up significantly if misalignment causes cache line splits (processor will fetch 2 lines). There are non-linear costs of filling up the store queue in modern out-of-order processors (x86). Bottom line is that it's much better to peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache line boundaries otherwise. The solution is to either actually always peel for alignment, or insert an additional check for cache line boundaries (for high trip count loops). - Hendrik On Fri, Nov 15, 2013 at 7:21 AM, Bingfeng Mei b...@broadcom.com wrote: Hi, Richard, Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop peeling is also slower for our processors. By vectorization_cost, do you mean TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook? In our case, it is easy to make decision. But generally, if peeling loop is faster but bigger, what should be right balance? How to do with cases that are a bit faster and a lot bigger? Thanks, Bingfeng -Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: 15 November 2013 14:02 To: Bingfeng Mei Cc: gcc@gcc.gnu.org Subject: Re: Vectorization: Loop peeling with misaligned support. On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei b...@broadcom.com wrote: Hi, In loop vectorization, I found that vectorizer insists on loop peeling even our target supports misaligned memory access. This results in much bigger code size for a very simple loop. I defined TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses almost as cheap as an aligned one. But the vectorizer still does peeling anyway. In vect_enhance_data_refs_alignment function, it seems that result of vect_supportable_dr_alignment is not used in decision of whether to do peeling. supportable_dr_alignment = vect_supportable_dr_alignment (dr, true); do_peeling = vector_alignment_reachable_p (dr); Later on, there is code to compare load/store costs. But it only decides whether to do peeling for load or store, not whether to do peeling. Currently I have a workaround. For the following simple loop, the size is 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114) What's the speed difference? int A[100]; int B[100]; void foo2() { int i; for (i = 0; i 100; ++i) A[i] = B[i] + 100; } What is the best way to tell vectorizer not to do peeling in such situation? Well, the vectorizer should compute the cost without peeling and then, when the cost with peeling is not better then do not peel. That's very easy to check with the vectorization_cost hook by comparing vector_load / unaligned_load and vector_store / unaligned_store cost. Richard. Thanks, Bingfeng Mei Broadcom UK
Re: Vectorization: Loop peeling with misaligned support.
The right longer term fix is suggested by Richard. For now you can probably override the peel parameter for your target (in the target option_override function). maybe_set_param_value (PARAM_VECT_MAX_PEELING_FOR_ALIGNMENT, 0, opts-x_param_values, opts_set-x_param_values); David On Fri, Nov 15, 2013 at 7:21 AM, Bingfeng Mei b...@broadcom.com wrote: Hi, Richard, Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop peeling is also slower for our processors. By vectorization_cost, do you mean TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook? In our case, it is easy to make decision. But generally, if peeling loop is faster but bigger, what should be right balance? How to do with cases that are a bit faster and a lot bigger? Thanks, Bingfeng -Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: 15 November 2013 14:02 To: Bingfeng Mei Cc: gcc@gcc.gnu.org Subject: Re: Vectorization: Loop peeling with misaligned support. On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei b...@broadcom.com wrote: Hi, In loop vectorization, I found that vectorizer insists on loop peeling even our target supports misaligned memory access. This results in much bigger code size for a very simple loop. I defined TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses almost as cheap as an aligned one. But the vectorizer still does peeling anyway. In vect_enhance_data_refs_alignment function, it seems that result of vect_supportable_dr_alignment is not used in decision of whether to do peeling. supportable_dr_alignment = vect_supportable_dr_alignment (dr, true); do_peeling = vector_alignment_reachable_p (dr); Later on, there is code to compare load/store costs. But it only decides whether to do peeling for load or store, not whether to do peeling. Currently I have a workaround. For the following simple loop, the size is 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114) What's the speed difference? int A[100]; int B[100]; void foo2() { int i; for (i = 0; i 100; ++i) A[i] = B[i] + 100; } What is the best way to tell vectorizer not to do peeling in such situation? Well, the vectorizer should compute the cost without peeling and then, when the cost with peeling is not better then do not peel. That's very easy to check with the vectorization_cost hook by comparing vector_load / unaligned_load and vector_store / unaligned_store cost. Richard. Thanks, Bingfeng Mei Broadcom UK
RE: Vectorization: Loop peeling with misaligned support.
Thanks for the suggestion. It seems that parameter is only available in HEAD, not in 4.8. I will backport to 4.8. However, implementing a good cost model seems quite tricky to me. There are conflicting requirements for different processors. For us or many embedded processors, 4-time size increase is unacceptable. But for many desktop processor/applications, I guess it is worth to trade significant size with some performance improvement. Not sure if existing TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST is up to task. Maybe an extra target hook or parameter should be provided to make such tradeoff. Additionally, it seems hard to accurately estimate the costs. As Hendrik pointed out, misaligned access will affect cache performance for some processors. But for our processor, it is OK. Maybe just to pass a high cost for misaligned access for such processor is sufficient to guarantee to generate loop peeling. Bingfeng -Original Message- From: Xinliang David Li [mailto:davi...@google.com] Sent: 15 November 2013 17:30 To: Bingfeng Mei Cc: Richard Biener; gcc@gcc.gnu.org Subject: Re: Vectorization: Loop peeling with misaligned support. The right longer term fix is suggested by Richard. For now you can probably override the peel parameter for your target (in the target option_override function). maybe_set_param_value (PARAM_VECT_MAX_PEELING_FOR_ALIGNMENT, 0, opts-x_param_values, opts_set-x_param_values); David On Fri, Nov 15, 2013 at 7:21 AM, Bingfeng Mei b...@broadcom.com wrote: Hi, Richard, Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop peeling is also slower for our processors. By vectorization_cost, do you mean TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook? In our case, it is easy to make decision. But generally, if peeling loop is faster but bigger, what should be right balance? How to do with cases that are a bit faster and a lot bigger? Thanks, Bingfeng -Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: 15 November 2013 14:02 To: Bingfeng Mei Cc: gcc@gcc.gnu.org Subject: Re: Vectorization: Loop peeling with misaligned support. On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei b...@broadcom.com wrote: Hi, In loop vectorization, I found that vectorizer insists on loop peeling even our target supports misaligned memory access. This results in much bigger code size for a very simple loop. I defined TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses almost as cheap as an aligned one. But the vectorizer still does peeling anyway. In vect_enhance_data_refs_alignment function, it seems that result of vect_supportable_dr_alignment is not used in decision of whether to do peeling. supportable_dr_alignment = vect_supportable_dr_alignment (dr, true); do_peeling = vector_alignment_reachable_p (dr); Later on, there is code to compare load/store costs. But it only decides whether to do peeling for load or store, not whether to do peeling. Currently I have a workaround. For the following simple loop, the size is 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114) What's the speed difference? int A[100]; int B[100]; void foo2() { int i; for (i = 0; i 100; ++i) A[i] = B[i] + 100; } What is the best way to tell vectorizer not to do peeling in such situation? Well, the vectorizer should compute the cost without peeling and then, when the cost with peeling is not better then do not peel. That's very easy to check with the vectorization_cost hook by comparing vector_load / unaligned_load and vector_store / unaligned_store cost. Richard. Thanks, Bingfeng Mei Broadcom UK
Re: Vectorization: Loop peeling with misaligned support.
I agree it is hard to tune cost model to make it precise. Trunk compiler now supports better command line control for cost model selection. It seems to me that you can backport that change (as well as changes to control loop and slp vectorizer with different options) to your branch. With those, you can do the following: 1) turn on vectorization with -O2 : -O2 -ftree-loop-vectorize -- it will use the 'cheap' model which disables peeling or 2) -O3 -fvect-cost-model=cheap -- it will also disabling peeling 3) Playing with different parameters to control peeling, alias check versioning etc. Better yet -- improve the vectorizer to reduce the cost in general (e.g, better alias analysis, better alignment propagation, more efficient runtime alias check etc). thanks, David On Fri, Nov 15, 2013 at 10:01 AM, Bingfeng Mei b...@broadcom.com wrote: Thanks for the suggestion. It seems that parameter is only available in HEAD, not in 4.8. I will backport to 4.8. However, implementing a good cost model seems quite tricky to me. There are conflicting requirements for different processors. For us or many embedded processors, 4-time size increase is unacceptable. But for many desktop processor/applications, I guess it is worth to trade significant size with some performance improvement. Not sure if existing TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST is up to task. Maybe an extra target hook or parameter should be provided to make such tradeoff. Additionally, it seems hard to accurately estimate the costs. As Hendrik pointed out, misaligned access will affect cache performance for some processors. But for our processor, it is OK. Maybe just to pass a high cost for misaligned access for such processor is sufficient to guarantee to generate loop peeling. Bingfeng -Original Message- From: Xinliang David Li [mailto:davi...@google.com] Sent: 15 November 2013 17:30 To: Bingfeng Mei Cc: Richard Biener; gcc@gcc.gnu.org Subject: Re: Vectorization: Loop peeling with misaligned support. The right longer term fix is suggested by Richard. For now you can probably override the peel parameter for your target (in the target option_override function). maybe_set_param_value (PARAM_VECT_MAX_PEELING_FOR_ALIGNMENT, 0, opts-x_param_values, opts_set-x_param_values); David On Fri, Nov 15, 2013 at 7:21 AM, Bingfeng Mei b...@broadcom.com wrote: Hi, Richard, Speed difference is 154 cycles (with workaround) vs. 198 cycles. So loop peeling is also slower for our processors. By vectorization_cost, do you mean TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST hook? In our case, it is easy to make decision. But generally, if peeling loop is faster but bigger, what should be right balance? How to do with cases that are a bit faster and a lot bigger? Thanks, Bingfeng -Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: 15 November 2013 14:02 To: Bingfeng Mei Cc: gcc@gcc.gnu.org Subject: Re: Vectorization: Loop peeling with misaligned support. On Fri, Nov 15, 2013 at 2:16 PM, Bingfeng Mei b...@broadcom.com wrote: Hi, In loop vectorization, I found that vectorizer insists on loop peeling even our target supports misaligned memory access. This results in much bigger code size for a very simple loop. I defined TARGET_VECTORIZE_SUPPORT_VECTOR_MISALGINMENT and also TARGET_VECTORIZE_BUILTIN_VECTORIZATION_COST to make misaligned accesses almost as cheap as an aligned one. But the vectorizer still does peeling anyway. In vect_enhance_data_refs_alignment function, it seems that result of vect_supportable_dr_alignment is not used in decision of whether to do peeling. supportable_dr_alignment = vect_supportable_dr_alignment (dr, true); do_peeling = vector_alignment_reachable_p (dr); Later on, there is code to compare load/store costs. But it only decides whether to do peeling for load or store, not whether to do peeling. Currently I have a workaround. For the following simple loop, the size is 80bytes vs. 352 bytes without patch (-O2 -ftree-vectorize gcc 4.8.3 20131114) What's the speed difference? int A[100]; int B[100]; void foo2() { int i; for (i = 0; i 100; ++i) A[i] = B[i] + 100; } What is the best way to tell vectorizer not to do peeling in such situation? Well, the vectorizer should compute the cost without peeling and then, when the cost with peeling is not better then do not peel. That's very easy to check with the vectorization_cost hook by comparing vector_load / unaligned_load and vector_store / unaligned_store cost. Richard. Thanks, Bingfeng Mei Broadcom UK
Re: Vectorization: Loop peeling with misaligned support.
On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote: Also keep in mind that usually costs go up significantly if misalignment causes cache line splits (processor will fetch 2 lines). There are non-linear costs of filling up the store queue in modern out-of-order processors (x86). Bottom line is that it's much better to peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache line boundaries otherwise. The solution is to either actually always peel for alignment, or insert an additional check for cache line boundaries (for high trip count loops). That is quite bold claim do you have a benchmark to support that? Since nehalem there is no overhead of unaligned sse loads except of fetching cache lines. As haswell avx2 loads behave in similar way. You are forgetting that loop needs both cache lines when it issues unaligned load. This will generaly take maximum of times needed to access these lines. Now with peeling you accesss first cache line, and after that in loop access the second, effectively doubling running time when both lines were in main memory. You also need to compute all factors not just that one factor is expensive. There are several factor in plays, cost of branch misprediction is main argument againist doing peeling, so you need to show that cost of unaligned loads is bigger than cost of branch misprediction of a peeled implementation. As a quick example why peeling is generaly bad idea I did a simple benchmark. Could somebody with haswell also test attached code generated by gcc -O3 -march=core-avx2 (files set[13]_avx2.s)? For the test we repeately call a function set with a pointer randomly picked from 262144 bytes to stress a L2 cache, relevant tester is following (file test.c) for (i=0;i1;i++){ set (ptr + 64 * (p % (SIZE /64) + 60), ptr2 + 64 * (q % (SIZE /64) + 60)); First vectorize by following function. A vectorizer here does peeling (assembly is bit long, see file set1.s) void set(int *p, int *q){ int i; for (i=0; i128; i++) p[i] = 42 * p[i]; } When ran it I got $ gcc -O3 -DSIZE= test.c $ gcc test.o set1.s $ time ./a.out real0m3.724s user0m3.724s sys 0m0.000s Now what happens if we use separate input and output arrays? A gcc vectorizer fortunately does not peel in this case (file set2.s) which gives better performance void set(int *p, int *q){ int i; for (i=0; i128; i++) p[i] = 42 * q[i]; } $ gcc test.o set2.s $ time ./a.out real0m3.169s user0m3.170s sys 0m0.000s A speedup here is can be partialy explained by fact that inplace modifications run slower. To eliminate this possibility we change assembly to make input same as output (file set3.s) jb .L15 .L7: xorl%eax, %eax + movq%rdi, %rsi .p2align 4,,10 .p2align 3 .L5: $ gcc test.o set3.s $ time ./a.out real0m3.169s user0m3.170s sys 0m0.000s Which is still faster than what peeling vectorizer generated. And in this test I did not alignment is constant so branch misprediction is not a issue. #define _GNU_SOURCE #include stdlib.h int main(){ char *ptr = pvalloc(2 * SIZE + 128); char *ptr2 = pvalloc(2 * SIZE + 128); unsigned long p = 31; unsigned long q = 17; int i; for (i=0; i 1; i++) { set (ptr + 64 * (p % (SIZE / 64) + 60), ptr2 + 64 * (q % (SIZE /64) + 60)); p = 11 * p + 3; q = 13 * p + 5; } } .file set1.c .text .p2align 4,,15 .globl set .type set, @function set: .LFB0: .cfi_startproc leaq32(%rdi), %rax cmpq%rax, %rsi jb .L12 movq %rdi, %rsi .L6: vmovdqu (%rsi), %ymm1 vmovdqa .LC0(%rip), %ymm0 vpmulld %ymm0, %ymm1, %ymm1 vmovdqu %ymm1, (%rdi) vmovdqu 32(%rsi), %ymm1 vpmulld %ymm0, %ymm1, %ymm1 vmovdqu %ymm1, 32(%rdi) vmovdqu 64(%rsi), %ymm1 vpmulld %ymm0, %ymm1, %ymm1 vmovdqu %ymm1, 64(%rdi) vmovdqu 96(%rsi), %ymm1 vpmulld %ymm0, %ymm1, %ymm1 vmovdqu %ymm1, 96(%rdi) vmovdqu 128(%rsi), %ymm1 vpmulld %ymm0, %ymm1, %ymm1 vmovdqu %ymm1, 128(%rdi) vmovdqu 160(%rsi), %ymm1 vpmulld %ymm0, %ymm1, %ymm1 vmovdqu %ymm1, 160(%rdi) vmovdqu 192(%rsi), %ymm1 vpmulld %ymm0, %ymm1, %ymm1 vmovdqu %ymm1, 192(%rdi) vmovdqu 224(%rsi), %ymm1 vpmulld %ymm0, %ymm1, %ymm1 vmovdqu %ymm1, 224(%rdi) vmovdqu 256(%rsi), %ymm1 vpmulld %ymm0, %ymm1, %ymm1 vmovdqu %ymm1, 256(%rdi) vmovdqu 288(%rsi), %ymm1 vpmulld %ymm0, %ymm1, %ymm1 vmovdqu %ymm1, 288(%rdi) vmovdqu 320(%rsi), %ymm1 vpmulld %ymm0, %ymm1, %ymm1 vmovdqu %ymm1, 320(%rdi) vmovdqu 352(%rsi), %ymm1 vpmulld %ymm0, %ymm1, %ymm1 vmovdqu %ymm1, 352(%rdi) vmovdqu 384(%rsi), %ymm1 vpmulld
Re: Vectorization: Loop peeling with misaligned support.
On Fri, Nov 15, 2013 at 11:26:06PM +0100, Ondřej Bílka wrote: Minor correction, a mutt read replaced a set1.s file by one that I later used for avx2 variant. A correct file is following .file set1.c .text .p2align 4,,15 .globl set .type set, @function set: .LFB0: .cfi_startproc movq%rdi, %rax andl$15, %eax shrq$2, %rax negq%rax andl$3, %eax je .L9 movl(%rdi), %edx movl$42, %esi imull %esi, %edx cmpl$1, %eax movl%edx, (%rdi) jbe .L10 movl4(%rdi), %edx movl$42, %ecx imull %ecx, %edx cmpl$2, %eax movl%edx, 4(%rdi) jbe .L11 movl8(%rdi), %edx movl$42, %r11d movl$125, %r10d imull %r11d, %edx movl$3, %r11d movl%edx, 8(%rdi) .L2: movl$128, %r8d xorl%edx, %edx subl%eax, %r8d movl%eax, %eax movl%r8d, %esi leaq(%rdi,%rax,4), %rcx xorl%eax, %eax shrl$2, %esi leal0(,%rsi,4), %r9d .p2align 4,,10 .p2align 3 .L8: movdqa (%rcx,%rax), %xmm1 addl$1, %edx pslld $1, %xmm1 movdqa %xmm1, %xmm0 pslld $2, %xmm0 psubd %xmm1, %xmm0 movdqa %xmm0, %xmm1 pslld $3, %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, (%rcx,%rax) addq$16, %rax cmpl%edx, %esi ja .L8 movl%r10d, %ecx leal(%r11,%r9), %eax subl%r9d, %ecx cmpl%r9d, %r8d je .L1 movslq %eax, %rdx movl$42, %r9d leaq(%rdi,%rdx,4), %rdx movl(%rdx), %esi imull %r9d, %esi cmpl$1, %ecx movl%esi, (%rdx) leal1(%rax), %edx je .L1 movslq %edx, %rdx movl$42, %r8d addl$2, %eax leaq(%rdi,%rdx,4), %rdx movl(%rdx), %esi imull %r8d, %esi cmpl$2, %ecx movl%esi, (%rdx) je .L1 cltq movl$42, %r10d leaq(%rdi,%rax,4), %rax movl(%rax), %edx imull %r10d, %edx movl%edx, (%rax) ret .p2align 4,,10 .p2align 3 .L1: rep ret .p2align 4,,10 .p2align 3 .L9: movl$128, %r10d xorl%r11d, %r11d jmp .L2 .p2align 4,,10 .p2align 3 .L11: movl$126, %r10d movl$2, %r11d jmp .L2 .p2align 4,,10 .p2align 3 .L10: movl$127, %r10d movl$1, %r11d jmp .L2 .cfi_endproc .LFE0: .size set, .-set .ident GCC: (Debian 4.8.1-10) 4.8.1 .section.note.GNU-stack,,@progbits
Re: Vectorization: Loop peeling with misaligned support.
On 11/15/2013 2:26 PM, Ondřej Bílka wrote: On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote: Also keep in mind that usually costs go up significantly if misalignment causes cache line splits (processor will fetch 2 lines). There are non-linear costs of filling up the store queue in modern out-of-order processors (x86). Bottom line is that it's much better to peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache line boundaries otherwise. The solution is to either actually always peel for alignment, or insert an additional check for cache line boundaries (for high trip count loops). That is quite bold claim do you have a benchmark to support that? Since nehalem there is no overhead of unaligned sse loads except of fetching cache lines. As haswell avx2 loads behave in similar way. Where gcc or gfortran choose to split sse2 or sse4 loads, I found a marked advantage in that choice on my Westmere (which I seldom power on nowadays). You are correct that this finding is in disagreement with Intel documentation, and it has the effect that Intel option -xHost is not the optimum one. I suspect the Westmere was less well performing than Nehalem on unaligned loads. Another poorly documented feature of Nehalem and Westmere was a preference for 32-byte aligned data, more so than Sandy Bridge. Intel documentation encourages use of unaligned AVX-256 loads on Ivy Bridge and Haswell, but Intel compilers don't implement them (except for intrinsics) until AVX2. Still, on my own Haswell tests, the splitting of unaligned loads by use of AVX compile option comes out ahead. Supposedly, the preference of Windows intrinsics programmers for the relative simplicity of unaligned moves was taken into account in the more recent hardware designs, as it was disastrous for Sandy Bridge. I have only remote access to Haswell although I plan to buy a laptop soon. I'm skeptical about whether useful findings on these points may be obtained on a Windows laptop. In case you didn't notice it, Intel compilers introduced #pragma vector unaligned as a means to specify handling of unaligned access without peeling. I guess it is expected to be useful on Ivy Bridge or Haswell for cases where the loop count is moderate but expected to match unrolled AVX-256, or if the case where peeling can improve alignment is rare. In addition, Intel compilers learned from gcc the trick of using AVX-128 for situations where frequent unaligned accesses are expected and peeling is clearly undesirable. The new facility for vectorizing OpenMP parallel loops (e.g. #pragma omp parallel for simd) uses AVX-128, consistent with the fact that OpenMP chunks are more frequently unaligned. In fact, parallel for simd seems to perform nearly the same with gcc-4.9 as with icc. Many decisions on compiler defaults still are based on an unscientific choice of benchmarks, with gcc evidently more responsive to input from the community. -- Tim Prince