On 8/27/19 2:37 AM, Stefan Brankovic wrote: > + for (i = 0; i < 2; i++) { > + if (i == 0) { > + /* Get high doubleword of vB in avr. */ > + get_avr64(avr, VB, true); > + } else { > + /* Get low doubleword of vB in avr. */ > + get_avr64(avr, VB, false); > + } > + /* > + * Perform count for every byte element using tcg_gen_clzi_i64. > + * Since it counts leading zeros on 64 bit lenght, we have to move > + * ith byte element to highest 8 bits of tmp, or it with mask(so we > get > + * all ones in lowest 56 bits), then perform tcg_gen_clzi_i64 and > move > + * it's result in appropriate byte element of result. > + */ > + tcg_gen_shli_i64(tmp, avr, 56); > + tcg_gen_or_i64(tmp, tmp, mask); > + tcg_gen_clzi_i64(result, tmp, 64); > + for (j = 1; j < 7; j++) { > + tcg_gen_shli_i64(tmp, avr, (7 - j) * 8); > + tcg_gen_or_i64(tmp, tmp, mask); > + tcg_gen_clzi_i64(tmp, tmp, 64); > + tcg_gen_deposit_i64(result, result, tmp, j * 8, 8); > + } > + tcg_gen_or_i64(tmp, avr, mask); > + tcg_gen_clzi_i64(tmp, tmp, 64); > + tcg_gen_deposit_i64(result, result, tmp, 56, 8); > + if (i == 0) { > + /* Place result in high doubleword element of vD. */ > + tcg_gen_mov_i64(result1, result); > + } else { > + /* Place result in low doubleword element of vD. */ > + tcg_gen_mov_i64(result2, result); > + } > + }
By my count, 60 non-move operations. This is too many to inline. Moreover, unlike vpkpx, which I can see being used for graphics format conversion in old operating systems (who else uses 16-bit graphics formats now?), I would be very surprised to see vclzb or vclzh being used frequently. How did you determine that these instructions needed optimization? I can see wanting to apply --- a/target/ppc/int_helper.c +++ b/target/ppc/int_helper.c @@ -1817,8 +1817,8 @@ VUPK(lsw, s64, s32, UPKLO) } \ } -#define clzb(v) ((v) ? clz32((uint32_t)(v) << 24) : 8) -#define clzh(v) ((v) ? clz32((uint32_t)(v) << 16) : 16) +#define clzb(v) clz32(((uint32_t)(v) << 24) | 0x00ffffffu) +#define clzh(v) clz32(((uint32_t)(v) << 16) | 0x0000ffffu) VGENERIC_DO(clzb, u8) VGENERIC_DO(clzh, u16) as the cmov instruction required by the current implementation is going to be quite a bit slower than the OR instruction. And similarly for ctzb() and ctzh(). r~