On 9/14/22 23:59, Paolo Bonzini wrote:
On Tue, Sep 13, 2022 at 10:17 AM Richard Henderson
<richard.hender...@linaro.org> wrote:

On 9/12/22 00:04, Paolo Bonzini wrote:
+    while (vec_len > 8) {
+        vec_len -= 8;
+        tcg_gen_shli_tl(s->T0, s->T0, 8);
+        tcg_gen_ld8u_tl(t, cpu_env, offsetof(CPUX86State, xmm_t0.ZMM_B(vec_len 
- 1)));
+        tcg_gen_or_tl(s->T0, s->T0, t);
       }

The shl + or is deposit, for those hosts that have it,
and will be re-expanded to shl + or for those that don't:

      tcg_gen_ld8u_tl(t, ...);
      tcg_gen_deposit_tl(s->T0, t, s->T0, 8, TARGET_LONG_BITS - 8);

What you get from that is an shl(t, 56) followed by extract2 (i.e.
SHRD). Yeah there are targets with a native deposit (x86 itself could
add PDEP/PEXT support I guess) but I find it hard to believe that it
outperforms a simple shl + or.

Perhaps the shl+shrd (or shrd+rol if the deposit is slightly different) is over-cleverness on my part in the expansion, and pdep requires a constant mask.

But for other hosts, deposit is the same cost as shift.


r~

Reply via email to