Re: [Discuss-gnuradio] VOLK division between complexes

Marcus Müller Sun, 15 May 2016 02:08:05 -0700

Hi Federico

On 15.05.2016 02:40, Federico Larroca wrote:
> That was fast!
Only ten times as fast as the generic, pure C implementation, but thank
you :)
> Thank you very much!
You're welcome :)
> I don't have access to my computer for the weekend, but I'll check it
> as soon as I get back to the University on tuesday (monday's holiday
> here).
> In any case, I got to halfway implementing the AVX kernel, which I
> copy below just for the record... I didn't even got to compile it, let
> alone test it, but I surely learned a lot.
Yeah, it was my first kernel, too :) Learned a lot!
> static inline void
> volk_32fc_x2_divide_32fc_u_avx(lv_32fc_t* cVector, const lv_32fc_t*
> aVector,
>                                            const lv_32fc_t* bVector,
> unsigned int num_points)
> {
>   unsigned int number = 0;
>   const unsigned int quarterPoints = num_points / 4;
>
>   __m256 x, y, z, sq, mag_sq, mag_sq_un, div;
>   lv_32fc_t* c = cVector;
>   const lv_32fc_t* a = aVector;
>   const lv_32fc_t* b = bVector;
>
>   for(; number < quarterPoints; number++){
>     x = _mm256_loadu_ps((float*) a); // Load the ar + ai, br + bi ...
> as ar,ai,br,bi ...
>     y = _mm256_loadu_ps((float*) b); // Load the cr + ci, dr + di ...
> as cr,ci,dr,di ...
>     z = _mm256_complexconjugatemul_ps(x, y);
>     sq = _mm256_mul_ps(y, y); // Square the values
>     mag_sq_un = _mm256_hadd_ps(w,w); // obtain the actual squared
> magnitude, although out of order
you mean ... _hadd_ps(sq,sq), right?
>     mag_sq = _mm256_permute_ps(mag_sq_un, 0xd8) // I order it
ah, clever move! Very clever indeed!
What you do is get four complex values at once, then calculate a b*,
then calculate
|b0|² |b1|² |b2|² |b3|² |b0|² |b1|² |b2|² |b3|²
and then reorder it in memory to be
|b0|² |b0|² |b1|² |b1|² |b2|² |b2|² |b3|² |b3|²
right? (still haven't gotten around being able to read the
shuffle/permute masks, and a bit too lazy to do so, now).



>     div = _mm256_div_ps(z,mag_sq);
>
>     _mm256_storeu_ps((float*) c, div); // Store the results back into
> the C container
>
>     a += 4;
>     b += 4;
>     c += 4;
>   }
>
> (I got this far ).
Looks pretty solid to me!

So the difference between my and your AVX kernel is that my kernel loads
a total of eight a,b complexes at once, basically because the
_mm256_mul/_mm256_hadd step can produce eight |b|² at once – and then I
really struggled (but managed) to have each of these |b|² twice, so I
can do the two _mm256_div. Your approach is so much cleverer, because it
uses less registers, and less obscure shuffling.

My AVX kernel, on my machine, is about as fast as my SSE3 kernel. So I'd
really like to ask you to try mine, and then just replace my AVX code
with yours, and compare the results. I think yours might be
significantly faster!

Best regards,
Marcus

_______________________________________________
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Re: [Discuss-gnuradio] VOLK division between complexes

Reply via email to