On Wed, 13 Oct 2021 at 15:44, Vincent Palatin <vpala...@rivosinc.com> wrote: > > On Wed, Oct 13, 2021 at 3:13 PM Philipp Tomsich > <philipp.toms...@vrull.eu> wrote: > > > > I had a much simpler version initially (using 3 x mask/shift/or, for > > 12 instructions after setup of constants), but took up the suggestion > > to optimize based on haszero(v)... > > Indeed this appears to not do what we expect, when there's only 0x01 > > set in a byte. > > > > The less optimized form, with a single constant, that would still do > > what we want is: > > /* set high-bit for non-zero bytes */ > > constant = dup_const_tl(MO_8, 0x7f); > > tmp = v & constant; // AND > > tmp += constant; // ADD > > tmp |= v; // OR > > /* extract high-bit to low-bit, for each word */ > > tmp &= ~constant; // ANDC > > tmp >>= 7; // SHR > > /* multiply with 0xff to populate entire byte where the low-bit is set */ > > tmp *= 0xff; // MUL > > > > I'll submit a patch with this one later today, once I had a chance to > > pass this through a full test. > > > Thanks for the insight. > > I have tried it, implemented as: > ``` > static void gen_orc_b(TCGv ret, TCGv source1) > { > TCGv tmp = tcg_temp_new(); > TCGv constant = tcg_constant_tl(dup_const_tl(MO_8, 0x7f)); > > /* set high-bit for non-zero bytes */ > tcg_gen_and_tl(tmp, source1, constant); > tcg_gen_add_tl(tmp, tmp, constant); > tcg_gen_or_tl(tmp, tmp, source1); > /* extract high-bit to low-bit, for each word */ > tcg_gen_andc_tl(tmp, tmp, constant); > tcg_gen_shri_tl(tmp, tmp, 7); > > /* Replicate the lsb of each byte across the byte. */ > tcg_gen_muli_tl(ret, tmp, 0xff); > > tcg_temp_free(tmp); > } > ``` > > It does pass my own test sequences.
I am running it against SPEC at the moment, using optimized strlen/strcpy/strcmp functions using orc.b. The verdict on that should be available later today... Philipp.