Re: [LAD] vectorization

2008-05-05 Thread Jens M Andreasen
[...]

Another strategy for Complex Multiply Add would be to organize the data
in vertical arrays of the same type. This is to say that there should be
one array of Real followed by another array of Imaginary, perhaps
something like this:

typedef struct {
   float r[N] __attribute__ ((aligned(16))); 
   float i[N] __attribute__ ((aligned(16)));
} cvec_t;

cvec_t cA,cB,cD;

It can be argued that the data is now scattered all over the place, and
twice as many moves are needed to perform a single cmadd. This may be
true in some cases as we shall later see, but when you are operating on
vectors you'd usually want to load and store (at least) 4 variables at a
time.

The routine to calculate them all becomes:

   for (i = 0;i  N; ++i)
{
   cD.r[i] += cA.r[i] * cB.r[i] - cA.i[i] * cB.i[i];
   cD.i[i] += cA.r[i] * cB.i[i] + cA.i[i] * cB.r[i];
}
  

This auto-vectorizes really well with icc -O3 -msse, here looped a
billion times and compared to the original routine:

 clock:  1340 ms (cvec_t)  -- This is not a typo!
 clock: 12820 ms (original array of complex)

It is not too bad with  gcc -O3 -msse -ftree-vectorize either:

 clock:  6490 ms (cvec_t)
 clock: 14190 ms (original array of complex)

But all things are not so rosy in the vector-department if we should
also plan to distribute this code fragment as part of a binary generic
i386 package, say conservatively compiled with gcc -O2:

 clock: 18880 ms (cvec_t) -- ouch!
 clock: 14230 ms (original array of complex)

That was pretty bad! Your trusted 100MHz pentium suddenly got downgraded
to 66MHz :-/ 

So to conclude: A more than 10 times speedup of cmadd() of large arrays
is possible by a) rearangement of the data in a format that fits modern
machines and b) switch of compiler (which perhaps not everybody is
willing to do.)  
 

/j

___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] vectorization

2008-05-05 Thread Christian Schoenebeck
Am Montag, 5. Mai 2008 15:35:42 schrieb Jens M Andreasen:
 [...]

 Another strategy for Complex Multiply Add would be to organize the data
 in vertical arrays of the same type. This is to say that there should be
 one array of Real followed by another array of Imaginary, perhaps
 something like this:

Uhm, stupid question: already tried if GCC's special complex attribute type 
leads to a better result with auto vectorization? At least that could give 
the optimizer a better chance.

CU
Christian
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] vectorization

2008-05-05 Thread Jens M Andreasen
On Mon, 2008-05-05 at 16:07 +0200, Christian Schoenebeck wrote:

 Uhm, stupid question: already tried if GCC's special complex attribute type 
 leads to a better result with auto vectorization? At least that could give 
 the optimizer a better chance.
 

No I did not, but thats an idea. This certainly looks nice and clean:

  _Complex float cxA[N], cxB[N], cxD[N];

  for (i = 0;i  N; ++i)
 cxD[i] += cxA[i] * cxB[i];

Comparison to the other two versions with gcc -O3 -msse
-ftree-vectorize, suggests a slight advantage over the original
(non-vectorized) two dimensional array:

 clock: 13920 ms (_Complex)
 clock:  7040 ms (cvec_t)
 clock: 14470 ms (original array of complex)


With icc -O3 -msse the difference is even more pronounced:

 clock:  3850 ms (_Complex)
 clock:  1410 ms (cvec_t)
 clock: 13290 ms (original array of complex)

Moving from 'gcc 4.2.2' to '4.3 20070713 (experimental)' is very
disappointing:

 clock: 46180 (_Complex) -- we have found a looser!
 clock:  7030 (cvec_t)
 clock: 14340 (original array of complex)

/j

 CU
 Christian
 ___
 Linux-audio-dev mailing list
 Linux-audio-dev@lists.linuxaudio.org
 http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
-- 

___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] vectorization

2008-05-05 Thread Jussi Laako
Jussi Laako wrote:
 I would propose something like -march=prescott -O3 -ftree-vectorize or 
 -O3 -sse3 -ftree-vectorize.

Sorry, typo, -O3 -msse3 -ftree-vectorize of course...


- Jussi
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] vectorization

2008-05-05 Thread Jens M Andreasen
Jussi!

Could you try this out with your proposed compiler options on your own
hardware?
Admittedly, the recycled PIII here is very unrepresentative, outdated
and old-skool (although it seems to shine when paired up with icc :)

--8-

// include everything just in case we need it ...

#include unistd.h
#include stdio.h
#include sched.h
#include time.h
#include stdlib.h

#define N 1024

#include complex.h


float // complex
   ffta[N][2]  __attribute__ ((aligned(16))), 
   fftb[N][2]  __attribute__ ((aligned(16))), 
   data[N][2]  __attribute__ ((aligned(16)));

_Complex float 
   cxA[N] __attribute__ ((aligned(16))), 
   cxB[N] __attribute__ ((aligned(16))),
   cxD[N] __attribute__ ((aligned(16))) ;

typedef struct
{
   float r[N] __attribute__ ((aligned(16))); 
   float i[N] __attribute__ ((aligned(16)));
} cvec_t;

cvec_t cA,cB,cD;

int main()
{
   int n = 100;
   int i,j;
   char* s;

   clock_t clk = clock();
   s = (_Complex);

   for (j = 0; j  n; ++j)
  for (i = 0;i  N; ++i)
 cxD[i]+= cxA[i]*cxB[i];
   
   fprintf (stderr, clock: %d ms %s\n,(clock()-clk)/1000,s);

   s = (cvec_t);
   clk = clock(); 

   for (j = 0; j  n; ++j)
  for (i = 0;i  N; ++i)
  {
 cD.r[i] += cA.r[i] * cB.r[i] - cA.i[i] * cB.i[i];
 cD.i[i] += cA.r[i] * cB.i[i] + cA.i[i] * cB.r[i];
  }

   fprintf (stderr, clock: %d ms %s\n,(clock()-clk)/1000,s);

   s = (original float array[N][2]);
   clk = clock(); 
   for (j = 0; j  n ; ++j)
  for (i = 0; i N; ++i)
  {
 data [i][0] += ffta [i][0] * fftb [i][0] - ffta [i][1] * fftb [i][1];
 data [i][1] += ffta [i][0] * fftb [i][1] + ffta [i][1] * fftb [i][0];
  }
   fprintf (stderr, clock: %d ms %s\n,(clock()-clk)/1000,s);

   return 0;
}

On Mon, 2008-05-05 at 19:15 +0300, Jussi Laako wrote:
 Jussi Laako wrote:
  I would propose something like -march=prescott -O3 -ftree-vectorize or 
  -O3 -sse3 -ftree-vectorize.
 
 Sorry, typo, -O3 -msse3 -ftree-vectorize of course...
 
 
   - Jussi
-- 

___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] ppc audio distro?

2008-05-05 Thread oc2pus
Am Montag, 5. Mai 2008 schrieb Justin Smith:
 I just got an ibook g4. Is anyone out there using a ppc distro for
 audio? Is there an audio-specific ppc distro?
most of the audio packages for openSuSE are available also for ppc 
architecture on our packman site:
http://packman.links2linux.de/

give it a try :)

 ___
 Linux-audio-dev mailing list
 Linux-audio-dev@lists.linuxaudio.org
 http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] vectorization

2008-05-05 Thread Fons Adriaensen
On Mon, May 05, 2008 at 07:18:39PM +0200, Jens M Andreasen wrote:

 Could you try this out with your proposed compiler options on your own
 hardware?
 
 ...
 #define N 1024
 ...
 int n = 100;
 ...

Looping a million times over the same small data vector
is _not_ very realistic. 

In a real app, the data size would be much longer (there's
no need to optimise otherwise), that data would be rewritten
for each iteration (no need to redo the calculation otherwise),
and the work would not be done in a single long run but be
divided over a number of e.g. jack process callbacks.

I've again performed some tests on zita-convolver used by
jconv to do the York Minster config. That means around 240
different blocks of 8192 complex values each. The differences
between plain C++, hand vectorized, and optimised assembly
code are absolutely marginal in that case.

Ciao,

-- 
FA

Laboratorio di Acustica ed Elettroacustica
Parma, Italia

Lascia la spina, cogli la rosa.

___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] vectorization

2008-05-05 Thread Jens M Andreasen
On Mon, 2008-05-05 at 23:32 +0200, Jens M Andreasen wrote:

  clock:  4020 ms (_Complex)
  clock:  1550 ms (cvec_t)
  clock: 7 ms (original float array[N][2])

Wait a second ... It's a trick! :-D

The compiler splits the iterations up in smaller parts that fits in the
cache and runs them in succesion.



___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev


Re: [LAD] vectorization

2008-05-05 Thread Fons Adriaensen
On Tue, May 06, 2008 at 12:10:38AM +0200, Jens M Andreasen wrote:

 On Mon, 2008-05-05 at 23:32 +0200, Jens M Andreasen wrote:
 
   clock:  4020 ms (_Complex)
   clock:  1550 ms (cvec_t)
   clock: 7 ms (original float array[N][2])
 
 Wait a second ... It's a trick! :-D
 
 The compiler splits the iterations up in smaller parts that fits in the
 cache and runs them in succesion.

I've know one what that even more clever. It just noticed I didn't use
the result, and skipped the whole loop.

After each iteration, call an empty function, separately compiled,
that takes all three vectors as arguments (and _not_ as const *
of course). No more tricks. The overhead is peanuts compared
to the calculation.

Ciao,

-- 
FA

Laboratorio di Acustica ed Elettroacustica
Parma, Italia

Lascia la spina, cogli la rosa.

___
Linux-audio-dev mailing list
Linux-audio-dev@lists.linuxaudio.org
http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev