On Fri, Apr 28, 2023 at 6:58 AM Daniel Henrique Barboza <dbarb...@ventanamicro.com> wrote: > > The function is a no-op if 'vta' is zero but we're still doing a lot of > stuff in this function regardless. vext_set_elems_1s() will ignore every > single time (since vta is zero) and we just wasted time. > > Skip it altogether in this case. Aside from the code simplification > there's a noticeable emulation performance gain by doing it. For a > regular C binary that does a vectors operation like this: > > ======= > #define SZ 10000000 > > int main () > { > int *a = malloc (SZ * sizeof (int)); > int *b = malloc (SZ * sizeof (int)); > int *c = malloc (SZ * sizeof (int)); > > for (int i = 0; i < SZ; i++) > c[i] = a[i] + b[i]; > return c[SZ - 1]; > } > ======= > > Emulating it with qemu-riscv64 and RVV takes ~0.3 sec: > > $ time ~/work/qemu/build/qemu-riscv64 \ > -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo.out > > real 0m0.303s > user 0m0.281s > sys 0m0.023s > > With this skip we take ~0.275 sec: > > $ time ~/work/qemu/build/qemu-riscv64 \ > -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo.out > > real 0m0.274s > user 0m0.252s > sys 0m0.019s > > This performance gain adds up fast when executing heavy benchmarks like > SPEC. > > Signed-off-by: Daniel Henrique Barboza <dbarb...@ventanamicro.com>
Thanks! Applied to riscv-to-apply.next Alistair > --- > target/riscv/vector_helper.c | 11 ++++++++--- > 1 file changed, 8 insertions(+), 3 deletions(-) > > diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c > index f4d0438988..8e6c99e573 100644 > --- a/target/riscv/vector_helper.c > +++ b/target/riscv/vector_helper.c > @@ -268,12 +268,17 @@ static void vext_set_tail_elems_1s(CPURISCVState *env, > target_ulong vl, > void *vd, uint32_t desc, uint32_t nf, > uint32_t esz, uint32_t max_elems) > { > - uint32_t total_elems = vext_get_total_elems(env, desc, esz); > - uint32_t vlenb = riscv_cpu_cfg(env)->vlen >> 3; > + uint32_t total_elems, vlenb, registers_used; > uint32_t vta = vext_vta(desc); > - uint32_t registers_used; > int k; > > + if (vta == 0) { > + return; > + } > + > + total_elems = vext_get_total_elems(env, desc, esz); > + vlenb = riscv_cpu_cfg(env)->vlen >> 3; > + > for (k = 0; k < nf; ++k) { > vext_set_elems_1s(vd, vta, (k * max_elems + vl) * esz, > (k * max_elems + max_elems) * esz); > -- > 2.40.0 > >