Re: [PATCH v4 24/47] target/ppc: move vrl[bhwd]nm/vrl[bhwd]mi to decodetree

2022-02-24 Thread Richard Henderson

On 2/24/22 10:23, Matheus K. Ferst wrote:

You don't want to use tcg_gen_rotlv_vec directly, but tcg_gen_rotlv_vec.



I guess there is a typo here. Did you mean tcg_gen_gvec_rotlv? Or 
tcg_gen_rotlv_mod_vec?


Dangit.  Paste-paste error.  The first: tcg_gen_gvec_rotlv.


r~



Re: [PATCH v4 24/47] target/ppc: move vrl[bhwd]nm/vrl[bhwd]mi to decodetree

2022-02-24 Thread Matheus K. Ferst

On 23/02/2022 19:19, Richard Henderson wrote:

On 2/23/22 11:43, Matheus K. Ferst wrote:

Note that rotlv does the masking itself:

/*
  * Expand D = A << (B % element bits)
  *
  * Unlike scalar shifts, where it is easy for the target front end
  * to include the modulo as part of the expansion.  If the target
  * naturally includes the modulo as part of the operation, great!
  * If the target has some other behaviour from out-of-range shifts,
  * then it could not use this function anyway, and would need to
  * do it's own expansion with custom functions.
  */



Using tcg_gen_rotlv_vec(vece, vrt, vra, vrb) works on PPC but fails on 
x86. It looks like
a problem on the i386 backend. It's using VPS[RL]LV[DQ], but instead 
of this modulo
behavior, these instructions write zero to the element[1]. I'm not 
sure how to fix that.


You don't want to use tcg_gen_rotlv_vec directly, but tcg_gen_rotlv_vec.



I guess there is a typo here. Did you mean tcg_gen_gvec_rotlv? Or 
tcg_gen_rotlv_mod_vec?



The generic modulo is being applied here:

static void tcg_gen_rotlv_mod_vec(unsigned vece, TCGv_vec d,
   TCGv_vec a, TCGv_vec b)
{
     TCGv_vec t = tcg_temp_new_vec_matching(d);
     TCGv_vec m = tcg_constant_vec_matching(d, vece, (8 << vece) - 1);

     tcg_gen_and_vec(vece, t, b, m);
     tcg_gen_rotlv_vec(vece, d, a, t);
     tcg_temp_free_vec(t);
}


I can see that this method is called when we use tcg_gen_gvec_rotlv to 
implement vrl[bhwd], and they are working as expected. For vrl[wd]nm and 
vrl[wd]mi, however, we can't call tcg_gen_rotlv_mod_vec directly in the 
.fniv implementation because it is not exposed in tcg-op.h. Is there any 
other way to use this method? Should we add it to the header file?


Thanks,
Matheus K. Ferst
Instituto de Pesquisas ELDORADO 
Analista de Software
Aviso Legal - Disclaimer 


Re: [PATCH v4 24/47] target/ppc: move vrl[bhwd]nm/vrl[bhwd]mi to decodetree

2022-02-23 Thread Richard Henderson

On 2/23/22 11:43, Matheus K. Ferst wrote:

Note that rotlv does the masking itself:

/*
  * Expand D = A << (B % element bits)
  *
  * Unlike scalar shifts, where it is easy for the target front end
  * to include the modulo as part of the expansion.  If the target
  * naturally includes the modulo as part of the operation, great!
  * If the target has some other behaviour from out-of-range shifts,
  * then it could not use this function anyway, and would need to
  * do it's own expansion with custom functions.
  */



Using tcg_gen_rotlv_vec(vece, vrt, vra, vrb) works on PPC but fails on x86. It looks like 
a problem on the i386 backend. It's using VPS[RL]LV[DQ], but instead of this modulo 
behavior, these instructions write zero to the element[1]. I'm not sure how to fix that. 


You don't want to use tcg_gen_rotlv_vec directly, but tcg_gen_rotlv_vec.

The generic modulo is being applied here:

static void tcg_gen_rotlv_mod_vec(unsigned vece, TCGv_vec d,
  TCGv_vec a, TCGv_vec b)
{
TCGv_vec t = tcg_temp_new_vec_matching(d);
TCGv_vec m = tcg_constant_vec_matching(d, vece, (8 << vece) - 1);

tcg_gen_and_vec(vece, t, b, m);
tcg_gen_rotlv_vec(vece, d, a, t);
tcg_temp_free_vec(t);
}


r~



Re: [PATCH v4 24/47] target/ppc: move vrl[bhwd]nm/vrl[bhwd]mi to decodetree

2022-02-23 Thread Matheus K. Ferst

On 22/02/2022 19:30, Richard Henderson wrote:

On 2/22/22 04:36, matheus.fe...@eldorado.org.br wrote:

+static void gen_vrlnm_vec(unsigned vece, TCGv_vec vrt, TCGv_vec vra,
+  TCGv_vec vrb)
+{
+    TCGv_vec mask, n = tcg_temp_new_vec_matching(vrt);
+
+    /* Create the mask */
+    mask = do_vrl_mask_vec(vece, vrb);
+
+    /* Extract n */
+    tcg_gen_dupi_vec(vece, n, (8 << vece) - 1);
+    tcg_gen_and_vec(vece, n, vrb, n);
+
+    /* Rotate and mask */
+    tcg_gen_rotlv_vec(vece, vrt, vra, n);


Note that rotlv does the masking itself:

/*
  * Expand D = A << (B % element bits)
  *
  * Unlike scalar shifts, where it is easy for the target front end
  * to include the modulo as part of the expansion.  If the target
  * naturally includes the modulo as part of the operation, great!
  * If the target has some other behaviour from out-of-range shifts,
  * then it could not use this function anyway, and would need to
  * do it's own expansion with custom functions.
  */



Using tcg_gen_rotlv_vec(vece, vrt, vra, vrb) works on PPC but fails on 
x86. It looks like a problem on the i386 backend. It's using 
VPS[RL]LV[DQ], but instead of this modulo behavior, these instructions 
write zero to the element[1]. I'm not sure how to fix that. Do we need 
an INDEX_op_shlv_vec case in i386 tcg_expand_vec_op?



+static bool do_vrlnm(DisasContext *ctx, arg_VX *a, int vece)
+{
+    static const TCGOpcode vecop_list[] = {
+    INDEX_op_cmp_vec, INDEX_op_rotlv_vec, INDEX_op_sari_vec,
+    INDEX_op_shli_vec, INDEX_op_shri_vec, INDEX_op_shrv_vec, 0
+    };


Where is sari used?



I'll remove in v5.

[1] Section 5.3 of 
https://www.intel.com/content/dam/develop/external/us/en/documents/36945


Thanks,
Matheus K. Ferst
Instituto de Pesquisas ELDORADO 
Analista de Software
Aviso Legal - Disclaimer 


Re: [PATCH v4 24/47] target/ppc: move vrl[bhwd]nm/vrl[bhwd]mi to decodetree

2022-02-22 Thread Richard Henderson

On 2/22/22 04:36, matheus.fe...@eldorado.org.br wrote:

+static void gen_vrlnm_vec(unsigned vece, TCGv_vec vrt, TCGv_vec vra,
+  TCGv_vec vrb)
+{
+TCGv_vec mask, n = tcg_temp_new_vec_matching(vrt);
+
+/* Create the mask */
+mask = do_vrl_mask_vec(vece, vrb);
+
+/* Extract n */
+tcg_gen_dupi_vec(vece, n, (8 << vece) - 1);
+tcg_gen_and_vec(vece, n, vrb, n);
+
+/* Rotate and mask */
+tcg_gen_rotlv_vec(vece, vrt, vra, n);


Note that rotlv does the masking itself:

/*
 * Expand D = A << (B % element bits)
 *
 * Unlike scalar shifts, where it is easy for the target front end
 * to include the modulo as part of the expansion.  If the target
 * naturally includes the modulo as part of the operation, great!
 * If the target has some other behaviour from out-of-range shifts,
 * then it could not use this function anyway, and would need to
 * do it's own expansion with custom functions.
 */


+static bool do_vrlnm(DisasContext *ctx, arg_VX *a, int vece)
+{
+static const TCGOpcode vecop_list[] = {
+INDEX_op_cmp_vec, INDEX_op_rotlv_vec, INDEX_op_sari_vec,
+INDEX_op_shli_vec, INDEX_op_shri_vec, INDEX_op_shrv_vec, 0
+};


Where is sari used?



r~



[PATCH v4 24/47] target/ppc: move vrl[bhwd]nm/vrl[bhwd]mi to decodetree

2022-02-22 Thread matheus . ferst
From: Matheus Ferst 

Signed-off-by: Matheus Ferst 
---
v4:
 -  New in v4.
---
 target/ppc/helper.h |   8 +-
 target/ppc/insn32.decode|   6 ++
 target/ppc/int_helper.c |  50 -
 target/ppc/translate/vmx-impl.c.inc | 152 ++--
 target/ppc/translate/vmx-ops.c.inc  |   5 +-
 5 files changed, 182 insertions(+), 39 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 269150b197..a2a0d461dd 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -275,10 +275,10 @@ DEF_HELPER_4(vmaxfp, void, env, avr, avr, avr)
 DEF_HELPER_4(vminfp, void, env, avr, avr, avr)
 DEF_HELPER_3(vrefp, void, env, avr, avr)
 DEF_HELPER_3(vrsqrtefp, void, env, avr, avr)
-DEF_HELPER_3(vrlwmi, void, avr, avr, avr)
-DEF_HELPER_3(vrldmi, void, avr, avr, avr)
-DEF_HELPER_3(vrldnm, void, avr, avr, avr)
-DEF_HELPER_3(vrlwnm, void, avr, avr, avr)
+DEF_HELPER_4(VRLWMI, void, avr, avr, avr, i32)
+DEF_HELPER_4(VRLDMI, void, avr, avr, avr, i32)
+DEF_HELPER_4(VRLDNM, void, avr, avr, avr, i32)
+DEF_HELPER_4(VRLWNM, void, avr, avr, avr, i32)
 DEF_HELPER_5(vmaddfp, void, env, avr, avr, avr, avr)
 DEF_HELPER_5(vnmsubfp, void, env, avr, avr, avr, avr)
 DEF_HELPER_3(vexptefp, void, env, avr, avr)
diff --git a/target/ppc/insn32.decode b/target/ppc/insn32.decode
index d918e2d0f2..e788dc5152 100644
--- a/target/ppc/insn32.decode
+++ b/target/ppc/insn32.decode
@@ -492,6 +492,12 @@ VRLH000100 . . . 1000100@VX
 VRLW000100 . . . 0001100@VX
 VRLD000100 . . . 00011000100@VX
 
+VRLWMI  000100 . . . 0001101@VX
+VRLDMI  000100 . . . 00011000101@VX
+
+VRLWNM  000100 . . . 0011101@VX
+VRLDNM  000100 . . . 00111000101@VX
+
 ## Vector Integer Arithmetic Instructions
 
 VEXTSB2W000100 . 1 . 1100010@VX_tb
diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index 0a094b535a..58e57b2563 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1291,33 +1291,33 @@ void helper_vrsqrtefp(CPUPPCState *env, ppc_avr_t *r, 
ppc_avr_t *b)
 }
 }
 
-#define VRLMI(name, size, element, insert)\
-void helper_##name(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)  \
-{ \
-int i;\
-for (i = 0; i < ARRAY_SIZE(r->element); i++) {\
-uint##size##_t src1 = a->element[i];  \
-uint##size##_t src2 = b->element[i];  \
-uint##size##_t src3 = r->element[i];  \
-uint##size##_t begin, end, shift, mask, rot_val;  \
-  \
-shift = extract##size(src2, 0, 6);\
-end   = extract##size(src2, 8, 6);\
-begin = extract##size(src2, 16, 6);   \
-rot_val = rol##size(src1, shift); \
-mask = mask_u##size(begin, end);  \
-if (insert) { \
-r->element[i] = (rot_val & mask) | (src3 & ~mask);\
-} else {  \
-r->element[i] = (rot_val & mask); \
-} \
-} \
+#define VRLMI(name, size, element, insert)  \
+void helper_##name(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, uint32_t desc) \
+{   \
+int i;  \
+for (i = 0; i < ARRAY_SIZE(r->element); i++) {  \
+uint##size##_t src1 = a->element[i];\
+uint##size##_t src2 = b->element[i];\
+uint##size##_t src3 = r->element[i];\
+uint##size##_t begin, end, shift, mask, rot_val;\
+\
+shift = extract##size(src2, 0, 6);  \
+end   = extract##size(src2, 8, 6);  \
+begin = extract##size(src2, 16, 6); \
+rot_val = rol##size(src1, shift);   \
+mask = mask_u##size(begin, end);\
+if (insert) {