On Tue, Sep 13, 2022 at 10:17 AM Richard Henderson <richard.hender...@linaro.org> wrote: > > On 9/12/22 00:04, Paolo Bonzini wrote: > > + while (vec_len > 8) { > > + vec_len -= 8; > > + tcg_gen_shli_tl(s->T0, s->T0, 8); > > + tcg_gen_ld8u_tl(t, cpu_env, offsetof(CPUX86State, > > xmm_t0.ZMM_B(vec_len - 1))); > > + tcg_gen_or_tl(s->T0, s->T0, t); > > } > > The shl + or is deposit, for those hosts that have it, > and will be re-expanded to shl + or for those that don't: > > tcg_gen_ld8u_tl(t, ...); > tcg_gen_deposit_tl(s->T0, t, s->T0, 8, TARGET_LONG_BITS - 8);
What you get from that is an shl(t, 56) followed by extract2 (i.e. SHRD). Yeah there are targets with a native deposit (x86 itself could add PDEP/PEXT support I guess) but I find it hard to believe that it outperforms a simple shl + or. If we want to get clever, I should instead load ZMM_B(vec_len - 1) directly into the *high* byte of t, using ZMM_L or ZMM_Q, and then issue the extract2 myself. Paolo