Re: Efficient way to convert 32 bit float to 32 bit int (SSE)

Stefano D'Angelo Wed, 26 Apr 2023 01:12:08 -0700

Hello,

I'm no SSE expert either but I would exploit IEEE 754r single precisionfloating point representation.

Essentially you have that 0x4f000000 represents 2147483648.f while0x4effffff represents 2147483520.f. OTOH, in 2's complement 32 bits,0x7fffffff is 2147483647 and 0x80000000 is -2147483648.

The idea is then to convert using _mm_cvtps_epi32 as you did, andsubtract 1 if the input is represented as a number bigger than 0x4effffff.


Here's the code:

#include <smmintrin.h>
#include <emmintrin.h>
#include <stdio.h>

int main()
{

const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f,3000000000.f, -3000000000.f);


    const __m128i ones = _mm_set_epi32(1, 1, 1, 1);

const __m128i h = _mm_set_epi32(0x4f000000, 0x4f000000, 0x4f000000,0x4f000000);


    __m128i x = _mm_cvtps_epi32(sseFloatInput);
    __m128i i = _mm_castps_si128(sseFloatInput);
    __m128i m = _mm_max_epi32(i, h);
    __m128i s = _mm_sub_epi32(m, h);
    __m128i y = _mm_sign_epi32(ones, s);
    __m128i r = _mm_sub_epi32(x,y);

    printf("%d %d %d %d\n",
        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)),
        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)),
        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)),
        _mm_cvtsi128_si32(r)
        );
}

I get the correct result: 1000 -1000 2147483647 -2147483648.

HTH.

Best,

Stefano D'Angelo

Il 26/04/23 09:09, Holger Strauss ha scritto:

Hi,

thank you all for the interesting discussion posts on denorms and
fixed-point/floating-point processing.

I have a problem that is very much related to the arguments posted by B.J.,
mentioning the lack of saturation arithmetics on x86/x64 processors.

I need to convert a batch of 32 bit float samples to 32 bit int samples with
appropriate clipping. I.e. samples which are outside the range of a 32 bit
int (-2147483648..2147483647) shall be clipped to  -2147483648 or
2147483647.

Because the conversion shall be fast and efficient, I would prefer a
solution using SSE (2/3).

This sounds like an easy problem, but unfortunately it turned out it's not
so simple after all.
So I would like to challenge any SSE experts on this list.

Here is what I have found out already:

Starting with the following sample input:

     const __m128 sseFloatInput = _mm_set_ps(1000.0, -1000, 3000000000.0,
-3000000000.0);

My first approach was to convert this directly:

    const __m128i sseClippedInt = _mm_cvtps_epi32(sseFloatInput);

This results in 1000, -1000, -2147483648, -2147483648, which is correct for
all input samples but 3000000000.0. It turns out that all values which
cannot be represented by an int32 are converted to -2147483648.

To fix this, my next idea was to clip the maximum value before converting:

     const __m128 sseMax =
_mm_set1_ps(float(std::numeric_limits<int32_t>::max()));
     const __m128i sseClippedInt = _mm_cvtps_epi32(_mm_min_ps(sseFloatInput,
sseMax));

Well, the output is the same: 1000, -1000, -2147483648, -2147483648. What is
happening here? The maximum possible int32 (2147483647) cannot be
represented exactly as a floating-point number. So sseMax is slightly larger
(2.14748365e+09) and therefore sseClipMax is still (slightly) out of range,
resulting in the same int32 values.

My final approach was to make sseMax minimally smaller:

     const __m128 sseMax =
_mm_set1_ps(std::nextafterf(float(std::numeric_limits<int32_t>::max()),
0.0f));

This results in 1000, -1000, 2147483520, -2147483648. This is the 'best'
solution so far, but still not what I want, because 3000000000.0 does not
clip to the maximum possible int32 (2147483647). It is obviously the same
problem as before: The clipping limit cannot be represented exactly as a
float. (sseMax is 2.14748352e+09 here)

Does anyone have an _efficient_ solution for this problem? Does it really
need a (probably very inefficient) detour using double or int64?

Re: Efficient way to convert 32 bit float to 32 bit int (SSE)

Reply via email to