On Wed, 20 May 2026 at 19:24, Richard Henderson
<[email protected]> wrote:
>
> Signed-off-by: Richard Henderson <[email protected]>
> +static uint8_t fcvt_fp8_e5m2_output(FloatParts64 *p, int scale,
> + bool saturate, float_status *s)
> +{
> + /*
> + * Because e5m2 has an infinity encoding, we need to handle
> + * saturation conversion of Inf -> Max manually.
> + */
> + if (unlikely(p->cls == float_class_inf)) {
> + if (saturate) {
> + p->cls = float_class_normal;
> + p->exp = float8_e5m2_params.exp_max;
> + p->frac = -1ull << float8_e5m2_params.frac_shift;
This value is larger than the maximum representable normal,
so although round_pack_canonical will correctly saturate it
down to the maximum normal, it will also set Inexact and
Overflow in the process. In the pseudocode FPConvertFP8(),
input Infinity is special-cased and returns either Infinity
or the maximum normal without setting any exception flags.
To get the exact maximum normal you want
p->exp = float8_e5m2_params.exp_max -
float8_e5m2_params.exp_bias - 1;
Or we could shortcut the packing process and just return
the right value:
/* maximum or minimum normal value for E5M2 */
return 0x7b | (p->sign << 7);
> + }
> + } else {
> + *p = parts64_scalbn(p, scale, s);
> + }
> + return float8_e5m2_round_pack_canonical(p, s, saturate);
> +}
-- PMM