Hi,
> […]
> col[0] /= 9.0, col[1] /= 9.0, col[2] /= 9.0, col[3] /= 9.0;
> 0x5a39 pxor %xmm0,%xmm0
> […]
>
> Notice, how line, containing chain of divisions, is compiled to single
> sse operation.
I don’t see any SSE operation here. The pxor is just to zero the xmm0 register.
It’s a bit difficult to know what you are doing here, not having context and
not knowing the datatypes, but it does indeed look like this code could benefit
from vectorisation, since you are doing calculation in blocks of 4. E.g. you
can multiply 4 floating points in a single SSE instruction, add 4 floating
points in a single SSE instructions, etc.
e.g.
__m128 factor = _mm_set_ps1 (1.0f/9.0f);
__m128 result = _mm_mul_ps (packed, factor);
would divide the 4 floats in packed by 9. (We could use _mm_div_ps , but
multiplication is faster than division)
(See https://software.intel.com/sites/landingpage/IntrinsicsGuide/
<https://software.intel.com/sites/landingpage/IntrinsicsGuide/> )
Depending on your data, it might be faster to stay in the floating point domain
as long as possible to use SSE floating point operations, and convert to
integer at the last moment.
If you do want/need to stay in the integer domain, note that their is no SIMD
instruction for integer division, but you could use a multiplication here as
well:
__m128i _mm_mulhi_epi16 (__m128i a, __m128i b)
multiplies the packed 16-bit integers in a and b (so 8 at the same time),
producing intermediate 32-bit integers, and stores the high 16 bits of the
intermediate integers in the result.
Taking the high 16 bits of the 32 bit intermediate result is effectively
dividing by 65536. Since x/9 can be expressed (with some error) as x*7281/65536:
__m128i factor = _mm_set1_epi16 (7282);
__m128i result = _mm_mulhi_epi16(packed, factor)
Of course you would have to get your 8 bit integers (I assume) into/out of the
packed 16 bit registers.
That said, whether you want to do this kind of vectorisation by hand is a
different matter. The compiler is pretty good in doing these kind of
optimisations. Make sure you pass the right flags to turn on SSE and AVX at the
levels you want to support. But it certainly is possible to improve what the
compilers does. I have obtained significant speed boosts though rewriting inner
loops with SSE intrinsics. But even if you choose to stay in C, having some
knowledge of the SSE instruction set certainly might help.
Maarten
_______________________________________________
Linux-audio-dev mailing list
[email protected]
https://lists.linuxaudio.org/listinfo/linux-audio-dev