On Sun, 17 May 2026 at 01:32, Richard Henderson
<[email protected]> wrote:
>
> Signed-off-by: Richard Henderson <[email protected]>

> +static uint8_t fcvt_fp8_e5m2_output(FloatParts64 *p, int scale,
> +                                    bool saturate, float_status *s)
> +{
> +    /*
> +     * Because e5m2 has an infinity encoding, we need to handle
> +     * conversion of Inf -> Max manually.  This will be converted
> +     * to the actual maximum value during rounding.
> +     */
> +    if (unlikely(p->cls == float_class_inf)) {
> +        if (saturate) {
> +            p->cls = float_class_normal;
> +            p->exp = INT_MAX;
> +            p->frac = -1;
> +        }

This "saturate to the maximum value" codepath doesn't seem
to work -- we end up returning 0. But I don't know if this is
a bug in the softfloat code, or if this FloatParts64 is not
a valid thing to pass to it.

> +    } else {
> +        *p = parts64_scalbn(p, scale, s);
> +    }
> +    return float8_e5m2_round_pack_canonical(p, s, saturate);

What happens is that float8_e5m2_round_pack_canonical()
calls parts64_uncanon() on this. In the uncanon_normal code
we end up in this path:

    if (likely(exp > 0)) {
        if (p->frac_lo & round_mask) {
            flags |= float_flag_inexact;
            if (fracN(addi)(p, p, inc)) {
                fracN(shr)(p, 1);
                p->frac_hi |= DECOMPOSED_IMPLICIT_BIT;
                exp++;
            }
            p->frac_lo &= ~round_mask;
        }

Since round_mask is 0x1fffffffffffffff and p->frac_lo is
all-ones, we take the "do rounding" branch. The "round to
nearest even" logic decided it needed to round up, so
inc is 0x1000000000000000. So we add to the fractional
part and increment the exponent. This means that exp is
0x80000000 (i.e. has overflowed to negative, because it
is an int type), and we don't trigger the "exp >= exp_max"
check for overflow. So instead we return this bogus struct:
  {cls = float_class_normal, sign = 0x0, exp = 0x80000000, {frac =
0x4, frac_hi = 0x4, frac_lo = 0x4}}

where the exponent and frac values are too big for the
format, and then the packing code produces a zero byte
because the parts of the struct members it is extracting
are all zero.

So is the problem that we shouldn't be feeding in INT_MAX
as a p->exp, or that the "round up" code should be
checking that it isn't going to overflow with the exp++?
Does the same thing apply to the "exp = INT32_MAX" in the
exp_bias handling in parts_uncanon itself ?

thanks
-- PMM

Reply via email to