Re: [LAD] vectorization
[...] Another strategy for Complex Multiply Add would be to organize the data in vertical arrays of the same type. This is to say that there should be one array of Real followed by another array of Imaginary, perhaps something like this: typedef struct { float r[N] __attribute__ ((aligned(16))); float i[N] __attribute__ ((aligned(16))); } cvec_t; cvec_t cA,cB,cD; It can be argued that the data is now scattered all over the place, and twice as many moves are needed to perform a single cmadd. This may be true in some cases as we shall later see, but when you are operating on vectors you'd usually want to load and store (at least) 4 variables at a time. The routine to calculate them all becomes: for (i = 0;i N; ++i) { cD.r[i] += cA.r[i] * cB.r[i] - cA.i[i] * cB.i[i]; cD.i[i] += cA.r[i] * cB.i[i] + cA.i[i] * cB.r[i]; } This auto-vectorizes really well with icc -O3 -msse, here looped a billion times and compared to the original routine: clock: 1340 ms (cvec_t) -- This is not a typo! clock: 12820 ms (original array of complex) It is not too bad with gcc -O3 -msse -ftree-vectorize either: clock: 6490 ms (cvec_t) clock: 14190 ms (original array of complex) But all things are not so rosy in the vector-department if we should also plan to distribute this code fragment as part of a binary generic i386 package, say conservatively compiled with gcc -O2: clock: 18880 ms (cvec_t) -- ouch! clock: 14230 ms (original array of complex) That was pretty bad! Your trusted 100MHz pentium suddenly got downgraded to 66MHz :-/ So to conclude: A more than 10 times speedup of cmadd() of large arrays is possible by a) rearangement of the data in a format that fits modern machines and b) switch of compiler (which perhaps not everybody is willing to do.) /j ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] vectorization
Am Montag, 5. Mai 2008 15:35:42 schrieb Jens M Andreasen: [...] Another strategy for Complex Multiply Add would be to organize the data in vertical arrays of the same type. This is to say that there should be one array of Real followed by another array of Imaginary, perhaps something like this: Uhm, stupid question: already tried if GCC's special complex attribute type leads to a better result with auto vectorization? At least that could give the optimizer a better chance. CU Christian ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] vectorization
On Mon, 2008-05-05 at 16:07 +0200, Christian Schoenebeck wrote: Uhm, stupid question: already tried if GCC's special complex attribute type leads to a better result with auto vectorization? At least that could give the optimizer a better chance. No I did not, but thats an idea. This certainly looks nice and clean: _Complex float cxA[N], cxB[N], cxD[N]; for (i = 0;i N; ++i) cxD[i] += cxA[i] * cxB[i]; Comparison to the other two versions with gcc -O3 -msse -ftree-vectorize, suggests a slight advantage over the original (non-vectorized) two dimensional array: clock: 13920 ms (_Complex) clock: 7040 ms (cvec_t) clock: 14470 ms (original array of complex) With icc -O3 -msse the difference is even more pronounced: clock: 3850 ms (_Complex) clock: 1410 ms (cvec_t) clock: 13290 ms (original array of complex) Moving from 'gcc 4.2.2' to '4.3 20070713 (experimental)' is very disappointing: clock: 46180 (_Complex) -- we have found a looser! clock: 7030 (cvec_t) clock: 14340 (original array of complex) /j CU Christian ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev -- ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] vectorization
Jussi Laako wrote: I would propose something like -march=prescott -O3 -ftree-vectorize or -O3 -sse3 -ftree-vectorize. Sorry, typo, -O3 -msse3 -ftree-vectorize of course... - Jussi ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] vectorization
Jussi! Could you try this out with your proposed compiler options on your own hardware? Admittedly, the recycled PIII here is very unrepresentative, outdated and old-skool (although it seems to shine when paired up with icc :) --8- // include everything just in case we need it ... #include unistd.h #include stdio.h #include sched.h #include time.h #include stdlib.h #define N 1024 #include complex.h float // complex ffta[N][2] __attribute__ ((aligned(16))), fftb[N][2] __attribute__ ((aligned(16))), data[N][2] __attribute__ ((aligned(16))); _Complex float cxA[N] __attribute__ ((aligned(16))), cxB[N] __attribute__ ((aligned(16))), cxD[N] __attribute__ ((aligned(16))) ; typedef struct { float r[N] __attribute__ ((aligned(16))); float i[N] __attribute__ ((aligned(16))); } cvec_t; cvec_t cA,cB,cD; int main() { int n = 100; int i,j; char* s; clock_t clk = clock(); s = (_Complex); for (j = 0; j n; ++j) for (i = 0;i N; ++i) cxD[i]+= cxA[i]*cxB[i]; fprintf (stderr, clock: %d ms %s\n,(clock()-clk)/1000,s); s = (cvec_t); clk = clock(); for (j = 0; j n; ++j) for (i = 0;i N; ++i) { cD.r[i] += cA.r[i] * cB.r[i] - cA.i[i] * cB.i[i]; cD.i[i] += cA.r[i] * cB.i[i] + cA.i[i] * cB.r[i]; } fprintf (stderr, clock: %d ms %s\n,(clock()-clk)/1000,s); s = (original float array[N][2]); clk = clock(); for (j = 0; j n ; ++j) for (i = 0; i N; ++i) { data [i][0] += ffta [i][0] * fftb [i][0] - ffta [i][1] * fftb [i][1]; data [i][1] += ffta [i][0] * fftb [i][1] + ffta [i][1] * fftb [i][0]; } fprintf (stderr, clock: %d ms %s\n,(clock()-clk)/1000,s); return 0; } On Mon, 2008-05-05 at 19:15 +0300, Jussi Laako wrote: Jussi Laako wrote: I would propose something like -march=prescott -O3 -ftree-vectorize or -O3 -sse3 -ftree-vectorize. Sorry, typo, -O3 -msse3 -ftree-vectorize of course... - Jussi -- ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] ppc audio distro?
Am Montag, 5. Mai 2008 schrieb Justin Smith: I just got an ibook g4. Is anyone out there using a ppc distro for audio? Is there an audio-specific ppc distro? most of the audio packages for openSuSE are available also for ppc architecture on our packman site: http://packman.links2linux.de/ give it a try :) ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] vectorization
On Mon, May 05, 2008 at 07:18:39PM +0200, Jens M Andreasen wrote: Could you try this out with your proposed compiler options on your own hardware? ... #define N 1024 ... int n = 100; ... Looping a million times over the same small data vector is _not_ very realistic. In a real app, the data size would be much longer (there's no need to optimise otherwise), that data would be rewritten for each iteration (no need to redo the calculation otherwise), and the work would not be done in a single long run but be divided over a number of e.g. jack process callbacks. I've again performed some tests on zita-convolver used by jconv to do the York Minster config. That means around 240 different blocks of 8192 complex values each. The differences between plain C++, hand vectorized, and optimised assembly code are absolutely marginal in that case. Ciao, -- FA Laboratorio di Acustica ed Elettroacustica Parma, Italia Lascia la spina, cogli la rosa. ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] vectorization
On Mon, 2008-05-05 at 23:32 +0200, Jens M Andreasen wrote: clock: 4020 ms (_Complex) clock: 1550 ms (cvec_t) clock: 7 ms (original float array[N][2]) Wait a second ... It's a trick! :-D The compiler splits the iterations up in smaller parts that fits in the cache and runs them in succesion. ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev
Re: [LAD] vectorization
On Tue, May 06, 2008 at 12:10:38AM +0200, Jens M Andreasen wrote: On Mon, 2008-05-05 at 23:32 +0200, Jens M Andreasen wrote: clock: 4020 ms (_Complex) clock: 1550 ms (cvec_t) clock: 7 ms (original float array[N][2]) Wait a second ... It's a trick! :-D The compiler splits the iterations up in smaller parts that fits in the cache and runs them in succesion. I've know one what that even more clever. It just noticed I didn't use the result, and skipped the whole loop. After each iteration, call an empty function, separately compiled, that takes all three vectors as arguments (and _not_ as const * of course). No more tricks. The overhead is peanuts compared to the calculation. Ciao, -- FA Laboratorio di Acustica ed Elettroacustica Parma, Italia Lascia la spina, cogli la rosa. ___ Linux-audio-dev mailing list Linux-audio-dev@lists.linuxaudio.org http://lists.linuxaudio.org/mailman/listinfo/linux-audio-dev