Hi,

> […]
>            col[0] /= 9.0, col[1] /= 9.0, col[2] /= 9.0, col[3] /= 9.0;
> 0x5a39 pxor   %xmm0,%xmm0
> […]
> 
> Notice, how line, containing chain of divisions, is compiled to single
> sse operation.

I don’t see any SSE operation here. The pxor is just to zero the xmm0 register.

It’s a bit difficult to know what you are doing here, not having context and 
not knowing the datatypes, but it does indeed look like this code could benefit 
from vectorisation, since you are doing calculation in blocks of 4. E.g. you 
can multiply 4 floating points in a single SSE instruction, add 4 floating 
points in a single SSE instructions, etc.

e.g.

__m128 factor = _mm_set_ps1 (1.0f/9.0f);
__m128 result = _mm_mul_ps (packed, factor);

would divide the 4 floats in packed by 9. (We could use _mm_div_ps , but 
multiplication is faster than division)

(See https://software.intel.com/sites/landingpage/IntrinsicsGuide/ 
<https://software.intel.com/sites/landingpage/IntrinsicsGuide/> )

Depending on your data, it might be faster to stay in the floating point domain 
as long as possible to use SSE floating point operations, and convert to 
integer at the last moment.

If you do want/need to stay in the integer domain, note that their is no SIMD 
instruction for integer division, but you could use a multiplication here as 
well:

__m128i _mm_mulhi_epi16 (__m128i a, __m128i b)

multiplies the packed 16-bit integers in a and b (so 8 at the same time), 
producing intermediate 32-bit integers, and stores the high 16 bits of the 
intermediate integers in the result.

Taking the high 16 bits of the 32 bit intermediate result is effectively 
dividing by 65536. Since x/9 can be expressed (with some error) as x*7281/65536:

__m128i factor = _mm_set1_epi16 (7282);
__m128i result = _mm_mulhi_epi16(packed, factor)

Of course you would have to get your 8 bit integers (I assume) into/out of the 
packed 16 bit registers.

That said, whether you want to do this kind of vectorisation by hand is a 
different matter. The compiler is pretty good in doing these kind of 
optimisations. Make sure you pass the right flags to turn on SSE and AVX at the 
levels you want to support. But it certainly is possible to improve what the 
compilers does. I have obtained significant speed boosts though rewriting inner 
loops with SSE intrinsics. But even if you choose to stay in C, having some 
knowledge of the SSE instruction set certainly might help.

Maarten
_______________________________________________
Linux-audio-dev mailing list
[email protected]
https://lists.linuxaudio.org/listinfo/linux-audio-dev

Reply via email to