> On Sun, Apr 20, 2008 at 3:58 AM, Julian Seward <[EMAIL PROTECTED]> wrote: > > > for (j = N; j > 0 && j--;)
> On Monday 21 April 2008 03:22, Andrew W. Steiner wrote: > In case it helps, see > http://www.gnu.org/software/gsl/design/gsl-design.html#SEC31 > for why the loop is constructed that way. Ok. So it's clear that is what the original authors intended. And I fully sympathise with wanting to avoid unsigned integer wraparound problems in such loops. However, regarding http://www.gnu.org/software/gsl/design/gsl-design.html#SEC31 not only is the clever version for (i = N; i > 0 && i--;) { ... } confusing, it also generates worse code than the simple version: for (i = 0; i < N; i++) { j = N - i; ... } For the following test program: void clever ( double* a, unsigned long N ) { unsigned long i; for (i = N; i > 0 && i--;) { a[i] = 0.0; } } void simple ( double* a, unsigned long N ) { unsigned long i, j; for (i = 0; i < N; i++) { j = N - i; a[j] = 0.0; } } when compiled on x86_64 with gcc-4.1.2 -O2, the inner loop of "clever" is: .L10: subq $1, %rax subq $8, %rdx cmpq $-1, %rax je .L7 .L5: testq %rax, %rax movq $0, -8(%rdx) jne .L10 (7 instructions, 2 conditional branches) as opposed to this for "straightforward": .L14: addq $1, %rdx movq $0, (%rax) subq $8, %rax cmpq %rdx, %rsi jne .L14 (5 instructions, 1 conditional branch). Although to be fair it does require 3 integer registers whereas "simple" version only requires 2, which might have been relevant in the heyday of x86, but less so now. If you kick gcc enough to get it to unroll the loops (gcc -O3 -funroll-loops) the differences are larger. Then "straightforward" produces this: .L55: addq $8, %rcx movq $0, (%rax) movq $0, -8(%rax) movq $0, -16(%rax) movq $0, -24(%rax) movq $0, -32(%rax) movq $0, -40(%rax) movq $0, -48(%rax) movq $0, -56(%rax) subq $64, %rax cmpq %rcx, %rsi jne .L55 (nice! one conditional branch per 8 array elements processed) whereas "clever" still requires one conditional branch per element, even after unrolling: .L5: testq %rax, %rax movq $0, -8(%rdx) je .L7 testq %rax, %rax leaq -8(%rdx), %rcx je .L7 cmpq $1, %rax movq $0, -8(%rcx) leaq -16(%rdx), %rcx je .L7 cmpq $2, %rax movq $0, -8(%rcx) leaq -24(%rdx), %rcx je .L7 cmpq $3, %rax movq $0, -8(%rcx) leaq -32(%rdx), %rcx je .L7 cmpq $4, %rax movq $0, -8(%rcx) leaq -40(%rdx), %rcx je .L7 cmpq $5, %rax movq $0, -8(%rcx) leaq -48(%rdx), %rcx je .L7 cmpq $6, %rax movq $0, -8(%rcx) leaq -56(%rdx), %rcx je .L7 subq $8, %rax subq $64, %rdx movq $0, -8(%rcx) cmpq $-1, %rax jne .L5 which speaks for itself. At a guess I'd also say that the extra test in the clever version likely makes it unvectorizable. To vectorize a loop effectively, compilers need to be able to compute a trip count for the loop before the loop starts. Which is trivial for the simple version (trip count = N) but less than obvious for the clever version. J _______________________________________________ Bug-gsl mailing list [email protected] http://lists.gnu.org/mailman/listinfo/bug-gsl
