Hello,

I'm no SSE expert either but I would exploit IEEE 754r single precision floating point representation.

Essentially you have that 0x4f000000 represents 2147483648.f while 0x4effffff represents 2147483520.f. OTOH, in 2's complement 32 bits, 0x7fffffff is 2147483647 and 0x80000000 is -2147483648.

The idea is then to convert using _mm_cvtps_epi32 as you did, and subtract 1 if the input is represented as a number bigger than 0x4effffff.

Here's the code:

#include <smmintrin.h>
#include <emmintrin.h>
#include <stdio.h>

int main()
{
    const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f, 3000000000.f, -3000000000.f);

    const __m128i ones = _mm_set_epi32(1, 1, 1, 1);
    const __m128i h = _mm_set_epi32(0x4f000000, 0x4f000000, 0x4f000000, 0x4f000000);

    __m128i x = _mm_cvtps_epi32(sseFloatInput);
    __m128i i = _mm_castps_si128(sseFloatInput);
    __m128i m = _mm_max_epi32(i, h);
    __m128i s = _mm_sub_epi32(m, h);
    __m128i y = _mm_sign_epi32(ones, s);
    __m128i r = _mm_sub_epi32(x,y);

    printf("%d %d %d %d\n",
        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)),
        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)),
        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)),
        _mm_cvtsi128_si32(r)
        );
}

I get the correct result: 1000 -1000 2147483647 -2147483648.

HTH.

Best,

Stefano D'Angelo

Il 26/04/23 09:09, Holger Strauss ha scritto:
Hi,

thank you all for the interesting discussion posts on denorms and
fixed-point/floating-point processing.

I have a problem that is very much related to the arguments posted by B.J.,
mentioning the lack of saturation arithmetics on x86/x64 processors.

I need to convert a batch of 32 bit float samples to 32 bit int samples with
appropriate clipping. I.e. samples which are outside the range of a 32 bit
int (-2147483648..2147483647) shall be clipped to  -2147483648 or
2147483647.

Because the conversion shall be fast and efficient, I would prefer a
solution using SSE (2/3).

This sounds like an easy problem, but unfortunately it turned out it's not
so simple after all.
So I would like to challenge any SSE experts on this list.

Here is what I have found out already:

Starting with the following sample input:

     const __m128 sseFloatInput = _mm_set_ps(1000.0, -1000, 3000000000.0,
-3000000000.0);

My first approach was to convert this directly:

    const __m128i sseClippedInt = _mm_cvtps_epi32(sseFloatInput);

This results in 1000, -1000, -2147483648, -2147483648, which is correct for
all input samples but 3000000000.0. It turns out that all values which
cannot be represented by an int32 are converted to -2147483648.

To fix this, my next idea was to clip the maximum value before converting:

     const __m128 sseMax =
_mm_set1_ps(float(std::numeric_limits<int32_t>::max()));
     const __m128i sseClippedInt = _mm_cvtps_epi32(_mm_min_ps(sseFloatInput,
sseMax));

Well, the output is the same: 1000, -1000, -2147483648, -2147483648. What is
happening here? The maximum possible int32 (2147483647) cannot be
represented exactly as a floating-point number. So sseMax is slightly larger
(2.14748365e+09) and therefore sseClipMax is still (slightly) out of range,
resulting in the same int32 values.

My final approach was to make sseMax minimally smaller:

     const __m128 sseMax =
_mm_set1_ps(std::nextafterf(float(std::numeric_limits<int32_t>::max()),
0.0f));

This results in 1000, -1000, 2147483520, -2147483648. This is the 'best'
solution so far, but still not what I want, because 3000000000.0 does not
clip to the maximum possible int32 (2147483647). It is obviously the same
problem as before: The clipping limit cannot be represented exactly as a
float. (sseMax is 2.14748352e+09 here)

Does anyone have an _efficient_ solution for this problem? Does it really
need a (probably very inefficient) detour using double or int64?

Reply via email to