On 9/14/22 23:59, Paolo Bonzini wrote:
On Tue, Sep 13, 2022 at 10:17 AM Richard Henderson
wrote:
On 9/12/22 00:04, Paolo Bonzini wrote:
+while (vec_len > 8) {
+vec_len -= 8;
+tcg_gen_shli_tl(s->T0, s->T0, 8);
+tcg_gen_ld8u_tl(t, cpu_env, offsetof(CPUX86State,
On Tue, Sep 13, 2022 at 10:17 AM Richard Henderson
wrote:
>
> On 9/12/22 00:04, Paolo Bonzini wrote:
> > +while (vec_len > 8) {
> > +vec_len -= 8;
> > +tcg_gen_shli_tl(s->T0, s->T0, 8);
> > +tcg_gen_ld8u_tl(t, cpu_env, offsetof(CPUX86State,
> > xmm_t0.ZMM_B(vec_len -
On 9/12/22 00:04, Paolo Bonzini wrote:
+while (vec_len > 8) {
+vec_len -= 8;
+tcg_gen_shli_tl(s->T0, s->T0, 8);
+tcg_gen_ld8u_tl(t, cpu_env, offsetof(CPUX86State, xmm_t0.ZMM_B(vec_len
- 1)));
+tcg_gen_or_tl(s->T0, s->T0, t);
}
The shl + or is deposit,
From: Richard Henderson
As pmovmskb is used by strlen et al, this is the third
highest overhead sse operation at %0.8.
Signed-off-by: Richard Henderson
[Reorganize to generate code for any vector size. - Paolo]
Signed-off-by: Paolo Bonzini
---
target/i386/tcg/emit.c.inc | 65