On Wed, 20 May 2026 at 19:29, Richard Henderson
<[email protected]> wrote:
>
> Signed-off-by: Richard Henderson <[email protected]>
> +void HELPER(gvec_fmmla_hb)(void *vd, void *vn, void *vm,
> + CPUARMState *env, uint32_t desc)
This still has some lurking copy-and-paste issues from the _sb
version:
> +{
> + FP8MulContext ctx = fp8_mul_start(env, 0xf);
> + size_t oprsz = simd_oprsz(desc);
> + size_t nseg = oprsz / 16;
Each loop here handles 4 16-bit halfprec outputs == 8 bytes,
so we want oprsz / 8.
> + uint32_t *n = vn;
> + uint32_t *m = vm;
> + float16 *d = vd;
> +
> + for (size_t seg = 0; seg < nseg; seg++, d += 4, n += 2, m += 2) {
> + float16 d0 = f8dotadd_h(n[0], m[0], 4, d[H4(0)], &ctx);
> + float16 d1 = f8dotadd_h(n[0], m[1], 4, d[H4(1)], &ctx);
> + float16 d2 = f8dotadd_h(n[1], m[0], 4, d[H4(2)], &ctx);
> + float16 d3 = f8dotadd_h(n[1], m[1], 4, d[H4(3)], &ctx);
> +
> + d[H4(0)] = d0;
> + d[H4(1)] = d1;
> + d[H4(2)] = d2;
> + d[H4(3)] = d3;
The H macros here I think are wrong -- d is a float16 so we
want H2(), and we need H4() macros for the n and m arrays.
(I think in fact if you work it through then all the H macros
cancel out and we could drop the lot, but since they're all
acting on constant indexes there's no runtime cost and having
them present is clearer for the reader.)
thanks
-- PMM