Re: [Bug-gsl] array, vector indices out-of-bounds in "linalg/bidiag.c"

Julian Seward Tue, 22 Apr 2008 03:01:41 -0700

> On Sun, Apr 20, 2008 at 3:58 AM, Julian Seward <[EMAIL PROTECTED]> wrote:
> >  >       for (j = N; j > 0 && j--;)


> On Monday 21 April 2008 03:22, Andrew W. Steiner wrote:
> In case it helps, see
> http://www.gnu.org/software/gsl/design/gsl-design.html#SEC31
> for why the loop is constructed that way.

Ok.  So it's clear that is what the original authors intended.
And I fully sympathise with wanting to avoid unsigned integer 
wraparound problems in such loops.

However, regarding 
http://www.gnu.org/software/gsl/design/gsl-design.html#SEC31

not only is the clever version

  for (i = N; i > 0 && i--;) { ... }

confusing, it also generates worse code than the simple
version:

  for (i = 0; i < N; i++) { j = N - i; ... }

For the following test program:

  void clever ( double* a, unsigned long N ) {
     unsigned long i;
     for (i = N; i > 0 && i--;) {
        a[i] = 0.0;
     }
  }

  void simple ( double* a, unsigned long N ) {
     unsigned long i, j;
     for (i = 0; i < N; i++) {
        j = N - i;
        a[j] = 0.0;
     }
  }

when compiled on x86_64 with gcc-4.1.2 -O2, the inner loop 
of "clever" is:

  .L10:
        subq    $1, %rax
        subq    $8, %rdx
        cmpq    $-1, %rax
        je      .L7
  .L5:
        testq   %rax, %rax
        movq    $0, -8(%rdx)
        jne     .L10

(7 instructions, 2 conditional branches) as opposed to this for
"straightforward":

  .L14:
        addq    $1, %rdx
        movq    $0, (%rax)
        subq    $8, %rax
        cmpq    %rdx, %rsi
        jne     .L14

(5 instructions, 1 conditional branch).  Although to be fair it
does require 3 integer registers whereas "simple" version only
requires 2, which might have been relevant in the heyday of x86,
but less so now.

If you kick gcc enough to get it to unroll the loops 
(gcc -O3 -funroll-loops) the differences are larger.
Then "straightforward" produces this:

   .L55:
        addq    $8, %rcx
        movq    $0, (%rax)
        movq    $0, -8(%rax)
        movq    $0, -16(%rax)
        movq    $0, -24(%rax)
        movq    $0, -32(%rax)
        movq    $0, -40(%rax)
        movq    $0, -48(%rax)
        movq    $0, -56(%rax)
        subq    $64, %rax
        cmpq    %rcx, %rsi
        jne     .L55

(nice! one conditional branch per 8 array elements processed)
whereas "clever" still requires one conditional branch per element,
even after unrolling:

  .L5:
        testq   %rax, %rax
        movq    $0, -8(%rdx)
        je      .L7
        testq   %rax, %rax
        leaq    -8(%rdx), %rcx
        je      .L7
        cmpq    $1, %rax
        movq    $0, -8(%rcx)
        leaq    -16(%rdx), %rcx
        je      .L7
        cmpq    $2, %rax
        movq    $0, -8(%rcx)
        leaq    -24(%rdx), %rcx
        je      .L7
        cmpq    $3, %rax
        movq    $0, -8(%rcx)
        leaq    -32(%rdx), %rcx
        je      .L7
        cmpq    $4, %rax
        movq    $0, -8(%rcx)
        leaq    -40(%rdx), %rcx
        je      .L7
        cmpq    $5, %rax
        movq    $0, -8(%rcx)
        leaq    -48(%rdx), %rcx
        je      .L7
        cmpq    $6, %rax
        movq    $0, -8(%rcx)
        leaq    -56(%rdx), %rcx
        je      .L7
        subq    $8, %rax
        subq    $64, %rdx
        movq    $0, -8(%rcx)
        cmpq    $-1, %rax
        jne     .L5

which speaks for itself.

At a guess I'd also say that the extra test in the clever version 
likely makes it unvectorizable.  To vectorize a loop effectively,
compilers need to be able to compute a trip count for the loop
before the loop starts.  Which is trivial for the simple version
(trip count = N) but less than obvious for the clever version.

J


_______________________________________________
Bug-gsl mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-gsl

Re: [Bug-gsl] array, vector indices out-of-bounds in "linalg/bidiag.c"

Reply via email to