On Wed, 20 May 2026 at 19:24, Richard Henderson
<[email protected]> wrote:
>
> Signed-off-by: Richard Henderson <[email protected]>

> +static uint8_t fcvt_fp8_e5m2_output(FloatParts64 *p, int scale,
> +                                    bool saturate, float_status *s)
> +{
> +    /*
> +     * Because e5m2 has an infinity encoding, we need to handle
> +     * saturation conversion of Inf -> Max manually.
> +     */
> +    if (unlikely(p->cls == float_class_inf)) {
> +        if (saturate) {
> +            p->cls = float_class_normal;
> +            p->exp = float8_e5m2_params.exp_max;
> +            p->frac = -1ull << float8_e5m2_params.frac_shift;

This value is larger than the maximum representable normal,
so although round_pack_canonical will correctly saturate it
down to the maximum normal, it will also set Inexact and
Overflow in the process. In the pseudocode FPConvertFP8(),
input Infinity is special-cased and returns either Infinity
or the maximum normal without setting any exception flags.

To get the exact maximum normal you want
             p->exp = float8_e5m2_params.exp_max -
float8_e5m2_params.exp_bias - 1;

Or we could shortcut the packing process and just return
the right value:

            /* maximum or minimum normal value for E5M2 */
            return 0x7b | (p->sign << 7);

> +        }
> +    } else {
> +        *p = parts64_scalbn(p, scale, s);
> +    }
> +    return float8_e5m2_round_pack_canonical(p, s, saturate);
> +}

-- PMM

Reply via email to