Hello again,

Thank you Marcus por looking through my code (and the positive comments).
I have several things to report:
 - Pulling from your repo and using volk_32fc_x2_divide_32fc worked
perfectly, and gr-isdbt kept operating as usual.
 - Substituting yours with my AVX proto-kernel (plus the aligned version)
works too.
 - Regarding performance, they are almost indistinguishable, at least in my
PC: running volk_profile they are roughly 10x faster than the generic
implementation, and Performance Monitor does not show any real difference
between both implementations (here I used the complete receiving chain).
 - The performance gain (again, measured by using Performance Monitor with
the complete receiving chain) between using the four lines of code I sent
before, and the divide kernel is mostly null.

So, to conclude, maybe the performance gain of using a kernel instead of
four lines of code is not significant, but I believe it's both simpler to
use and easier to read (and performance should be further tested in other
setups to actually conclude the former). Regarding the implementations,
both implementations of the AVX kernel are, from my unexperienced
perspective, mostly identical (maybe mine is a little bit simpler to read,
but it was me who wrote it, so I'm no judge). I've seen you already made a
pull request to add it to gnuradio, so my opinion is to use yours (but feel
free to use/test mine if you prefer).

In any case, this was a very interesting and formative experience.
best
Federico

2016-05-15 6:06 GMT-03:00 Marcus Müller <marcus.muel...@ettus.com>:

> Hi Federico
>
> On 15.05.2016 02:40, Federico Larroca wrote:
> > That was fast!
> Only ten times as fast as the generic, pure C implementation, but thank
> you :)
> > Thank you very much!
> You're welcome :)
> > I don't have access to my computer for the weekend, but I'll check it
> > as soon as I get back to the University on tuesday (monday's holiday
> > here).
> > In any case, I got to halfway implementing the AVX kernel, which I
> > copy below just for the record... I didn't even got to compile it, let
> > alone test it, but I surely learned a lot.
> Yeah, it was my first kernel, too :) Learned a lot!
> > static inline void
> > volk_32fc_x2_divide_32fc_u_avx(lv_32fc_t* cVector, const lv_32fc_t*
> > aVector,
> >                                            const lv_32fc_t* bVector,
> > unsigned int num_points)
> > {
> >   unsigned int number = 0;
> >   const unsigned int quarterPoints = num_points / 4;
> >
> >   __m256 x, y, z, sq, mag_sq, mag_sq_un, div;
> >   lv_32fc_t* c = cVector;
> >   const lv_32fc_t* a = aVector;
> >   const lv_32fc_t* b = bVector;
> >
> >   for(; number < quarterPoints; number++){
> >     x = _mm256_loadu_ps((float*) a); // Load the ar + ai, br + bi ...
> > as ar,ai,br,bi ...
> >     y = _mm256_loadu_ps((float*) b); // Load the cr + ci, dr + di ...
> > as cr,ci,dr,di ...
> >     z = _mm256_complexconjugatemul_ps(x, y);
> >     sq = _mm256_mul_ps(y, y); // Square the values
> >     mag_sq_un = _mm256_hadd_ps(w,w); // obtain the actual squared
> > magnitude, although out of order
> you mean ... _hadd_ps(sq,sq), right?
> >     mag_sq = _mm256_permute_ps(mag_sq_un, 0xd8) // I order it
> ah, clever move! Very clever indeed!
> What you do is get four complex values at once, then calculate a b*,
> then calculate
> |b0|² |b1|² |b2|² |b3|² |b0|² |b1|² |b2|² |b3|²
> and then reorder it in memory to be
> |b0|² |b0|² |b1|² |b1|² |b2|² |b2|² |b3|² |b3|²
> right? (still haven't gotten around being able to read the
> shuffle/permute masks, and a bit too lazy to do so, now).
>
>
> >     div = _mm256_div_ps(z,mag_sq);
> >
> >     _mm256_storeu_ps((float*) c, div); // Store the results back into
> > the C container
> >
> >     a += 4;
> >     b += 4;
> >     c += 4;
> >   }
> >
> > (I got this far ).
> Looks pretty solid to me!
>
> So the difference between my and your AVX kernel is that my kernel loads
> a total of eight a,b complexes at once, basically because the
> _mm256_mul/_mm256_hadd step can produce eight |b|² at once – and then I
> really struggled (but managed) to have each of these |b|² twice, so I
> can do the two _mm256_div. Your approach is so much cleverer, because it
> uses less registers, and less obscure shuffling.
>
> My AVX kernel, on my machine, is about as fast as my SSE3 kernel. So I'd
> really like to ask you to try mine, and then just replace my AVX code
> with yours, and compare the results. I think yours might be
> significantly faster!
>
> Best regards,
> Marcus
>
_______________________________________________
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Reply via email to