arm: Implement FMMLA (FP8 to FP16) for AdvSIMD

Peter Maydell Thu, 21 May 2026 02:53:48 -0700

On Wed, 20 May 2026 at 19:29, Richard Henderson
<[email protected]> wrote:
>
> Signed-off-by: Richard Henderson <[email protected]>


> +void HELPER(gvec_fmmla_hb)(void *vd, void *vn, void *vm,
> +                           CPUARMState *env, uint32_t desc)

This still has some lurking copy-and-paste issues from the _sb
version:

> +{
> +    FP8MulContext ctx = fp8_mul_start(env, 0xf);
> +    size_t oprsz = simd_oprsz(desc);
> +    size_t nseg = oprsz / 16;

Each loop here handles 4 16-bit halfprec outputs == 8 bytes,
so we want oprsz / 8.

> +    uint32_t *n = vn;
> +    uint32_t *m = vm;
> +    float16 *d = vd;
> +
> +    for (size_t seg = 0; seg < nseg; seg++, d += 4, n += 2, m += 2) {
> +        float16 d0 = f8dotadd_h(n[0], m[0], 4, d[H4(0)], &ctx);
> +        float16 d1 = f8dotadd_h(n[0], m[1], 4, d[H4(1)], &ctx);
> +        float16 d2 = f8dotadd_h(n[1], m[0], 4, d[H4(2)], &ctx);
> +        float16 d3 = f8dotadd_h(n[1], m[1], 4, d[H4(3)], &ctx);
> +
> +        d[H4(0)] = d0;
> +        d[H4(1)] = d1;
> +        d[H4(2)] = d2;
> +        d[H4(3)] = d3;

The H macros here I think are wrong -- d is a float16 so we
want H2(), and we need H4() macros for the n and m arrays.
(I think in fact if you work it through then all the H macros
cancel out and we could drop the lot, but since they're all
acting on constant indexes there's no runtime cost and having
them present is clearer for the reader.)

thanks
-- PMM

Re: [PATCH v6 60/64] target/arm: Implement FMMLA (FP8 to FP16) for AdvSIMD

Reply via email to