Hi Federico On 15.05.2016 02:40, Federico Larroca wrote: > That was fast! Only ten times as fast as the generic, pure C implementation, but thank you :) > Thank you very much! You're welcome :) > I don't have access to my computer for the weekend, but I'll check it > as soon as I get back to the University on tuesday (monday's holiday > here). > In any case, I got to halfway implementing the AVX kernel, which I > copy below just for the record... I didn't even got to compile it, let > alone test it, but I surely learned a lot. Yeah, it was my first kernel, too :) Learned a lot! > static inline void > volk_32fc_x2_divide_32fc_u_avx(lv_32fc_t* cVector, const lv_32fc_t* > aVector, > const lv_32fc_t* bVector, > unsigned int num_points) > { > unsigned int number = 0; > const unsigned int quarterPoints = num_points / 4; > > __m256 x, y, z, sq, mag_sq, mag_sq_un, div; > lv_32fc_t* c = cVector; > const lv_32fc_t* a = aVector; > const lv_32fc_t* b = bVector; > > for(; number < quarterPoints; number++){ > x = _mm256_loadu_ps((float*) a); // Load the ar + ai, br + bi ... > as ar,ai,br,bi ... > y = _mm256_loadu_ps((float*) b); // Load the cr + ci, dr + di ... > as cr,ci,dr,di ... > z = _mm256_complexconjugatemul_ps(x, y); > sq = _mm256_mul_ps(y, y); // Square the values > mag_sq_un = _mm256_hadd_ps(w,w); // obtain the actual squared > magnitude, although out of order you mean ... _hadd_ps(sq,sq), right? > mag_sq = _mm256_permute_ps(mag_sq_un, 0xd8) // I order it ah, clever move! Very clever indeed! What you do is get four complex values at once, then calculate a b*, then calculate |b0|² |b1|² |b2|² |b3|² |b0|² |b1|² |b2|² |b3|² and then reorder it in memory to be |b0|² |b0|² |b1|² |b1|² |b2|² |b2|² |b3|² |b3|² right? (still haven't gotten around being able to read the shuffle/permute masks, and a bit too lazy to do so, now).
> div = _mm256_div_ps(z,mag_sq); > > _mm256_storeu_ps((float*) c, div); // Store the results back into > the C container > > a += 4; > b += 4; > c += 4; > } > > (I got this far ). Looks pretty solid to me! So the difference between my and your AVX kernel is that my kernel loads a total of eight a,b complexes at once, basically because the _mm256_mul/_mm256_hadd step can produce eight |b|² at once – and then I really struggled (but managed) to have each of these |b|² twice, so I can do the two _mm256_div. Your approach is so much cleverer, because it uses less registers, and less obscure shuffling. My AVX kernel, on my machine, is about as fast as my SSE3 kernel. So I'd really like to ask you to try mine, and then just replace my AVX code with yours, and compare the results. I think yours might be significantly faster! Best regards, Marcus _______________________________________________ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org https://lists.gnu.org/mailman/listinfo/discuss-gnuradio