Re: [PATCH v3 14/14] RFC tcg/ppc: Disable TCG_REG_TB for Power9/Power10

2023-09-14 Thread Jordan Niethe
On Wed, Aug 16, 2023 at 5:57 AM Richard Henderson
 wrote:
>
> This may or may not improve performance.
> It appears to result in slightly larger code,
> but perhaps not enough to matter.

I have collected some power9 macro performance data for an smp compile workload:

Setup
-

- Power9 powernv host
- mttcg smp 8 guest

Method
--

- Warm up compile skiboot (https://github.com/open-power/skiboot)
- Average time taken for 5 trials compiling skiboot with -j `nproc`

Results
---


|Patch| Mean time (s) | stdev | Decrease (%) |
|-|---|---|--|
| tcg: Add tcg_out_tb_start...|161.77 |  2.39 |- |
| tcg/ppc: Enable direct branching... |145.81 |  1.71 |  9.9 |
| tcg/ppc: Use ADDPCIS... |146.44 |  1.28 |  9.5 |
| RFC tcg/ppc: Disable TCG_REG_TB...  |145.95 |  1.07 |  9.7 |


- Enabling direct branching is a performance gain, beyond that less conclusive.
- Using pcaddis for direct branching seems slightly better than bl +4
sequence for ISA v3.0.
- PC relative addressing seems slightly better than TOC relative addressing.

Any other suggestions for performance comparison?
I still have to try on a Power10.

>
> Signed-off-by: Richard Henderson 
> ---
>  tcg/ppc/tcg-target.c.inc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
> index 20aaa90af2..c1e0efb498 100644
> --- a/tcg/ppc/tcg-target.c.inc
> +++ b/tcg/ppc/tcg-target.c.inc
> @@ -83,7 +83,7 @@
>  #define TCG_VEC_TMP2TCG_REG_V1
>
>  #define TCG_REG_TB TCG_REG_R31
> -#define USE_REG_TB (TCG_TARGET_REG_BITS == 64)
> +#define USE_REG_TB (TCG_TARGET_REG_BITS == 64 && !have_isa_3_00)
>
>  /* Shorthand for size of a pointer.  Avoid promotion to unsigned.  */
>  #define SZP  ((int)sizeof(void *))
> --
> 2.34.1
>



Re: [PATCH v3 05/14] tcg/ppc: Use ADDPCIS in tcg_out_tb_start

2023-08-23 Thread Jordan Niethe
On Wed, 16 Aug 2023, 5:57 am Richard Henderson, <
richard.hender...@linaro.org> wrote:

> With ISA v3.0, we can use ADDPCIS instead of BCL+MFLR to load NIP.
>
> Signed-off-by: Richard Henderson 
> ---
>  tcg/ppc/tcg-target.c.inc | 25 ++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
> index 19004fa568..36b4f61236 100644
> --- a/tcg/ppc/tcg-target.c.inc
> +++ b/tcg/ppc/tcg-target.c.inc
> @@ -362,6 +362,7 @@ static bool tcg_target_const_match(int64_t val,
> TCGType type, int ct)
>  #define CRNAND XO19(225)
>  #define CROR   XO19(449)
>  #define CRNOR  XO19( 33)
> +#define ADDPCIS XO19( 2)
>
>  #define EXTSB  XO31(954)
>  #define EXTSH  XO31(922)
> @@ -854,6 +855,19 @@ static inline void tcg_out_sari64(TCGContext *s,
> TCGReg dst, TCGReg src, int c)
>  tcg_out32(s, SRADI | RA(dst) | RS(src) | SH(c & 0x1f) | ((c >> 4) &
> 2));
>  }
>
> +static void tcg_out_addpcis(TCGContext *s, TCGReg dst, intptr_t imm)
> +{
> +int d0, d1, d2;
> +
> +tcg_debug_assert((imm & 0x) == 0);
> +tcg_debug_assert(imm == (int32_t)imm);
> +
>

I think you need imm >>= 16 here.
Without that the next patch in the series crashes.

+d2 = imm & 1;
> +d1 = (imm >> 1) & 0x1f;
> +d0 = (imm >> 6) & 0x3ff;
> +tcg_out32(s, ADDPCIS | RT(dst) | (d1 << 16) | (d0 << 6) | d2);
> +}
> +
>  static void tcg_out_bswap16(TCGContext *s, TCGReg dst, TCGReg src, int
> flags)
>  {
>  TCGReg tmp = dst == src ? TCG_REG_R0 : dst;
> @@ -2489,9 +2503,14 @@ static void tcg_out_tb_start(TCGContext *s)
>  {
>  /* Load TCG_REG_TB. */
>  if (USE_REG_TB) {
> -/* bcl 20,31,$+4 (preferred form for getting nia) */
> -tcg_out32(s, BC | BO_ALWAYS | BI(7, CR_SO) | 0x4 | LK);
> -tcg_out32(s, MFSPR | RT(TCG_REG_TB) | LR);
> +if (have_isa_3_00) {
> +/* lnia REG_TB */
> +tcg_out_addpcis(s, TCG_REG_TB, 0);
> +} else {
> +/* bcl 20,31,$+4 (preferred form for getting nia) */
> +tcg_out32(s, BC | BO_ALWAYS | BI(7, CR_SO) | 0x4 | LK);
> +tcg_out32(s, MFSPR | RT(TCG_REG_TB) | LR);
> +}
>  }
>  }
>
> --
> 2.34.1
>
>


Re: [RFC PATCH] tcg/ppc: Enable direct branching tcg_out_goto_tb with TCG_REG_TB

2023-08-15 Thread Jordan Niethe




On 16/8/23 2:07 am, Richard Henderson wrote:

On 8/14/23 22:01, Jordan Niethe wrote:

Direct branch patching was disabled when using TCG_REG_TB in commit
736a1588c1 ("tcg/ppc: Fix race in goto_tb implementation"). Commit
7502f89c74 ("tcg/ppc: Use prefixed instructions for tcg_out_goto_tb")
used the support for pc relative ISAv3.1 instructions to re-enable
direct branch patching on POWER10.

The issue with direct branch patching with TCG_REG_TB is the lack of
synchronization between the new TCG_REG_TB being established and the
direct branch being patched in.

If each translation block is responsible for establishing its own
TCG_REG_TB then there can be no synchronization issue.


That's a good idea, and can be used for other things...

It also begs the question of whether power10 should continue to use 
TCG_REG_TB, loading the address with PADDI.  Or whether power9 should, 
like power10, disable USE_REG_TB and use ADDPCIS throughout.


I imagine it depends on usage frequency, whether use of TCG_REG_TB 
allows 1 insn, where addpcis requires 2 insns and prefixed insns require 
2 or 3 insn slots (depending on alignment).


Yes, I agree. Your v3 series looks good, I'll try and get some 
performance numbers with it so we can make a decision about which way to 
go on power10 and power9.






+    tcg_out32(s, MFSPR | RT(TCG_REG_TMP1) | LR);
+    /* bcl 20,31,$+4 (Preferred form for getting nia.) */
+    tcg_out32(s, BC | BO_ALWAYS | BI(7, CR_SO) | 0x4 | LK);
+    tcg_out32(s, MFSPR | RT(TCG_REG_TB) | LR);
+    tcg_out32(s, ADDI | TAI(TCG_REG_TB, TCG_REG_TB, -8));
+    tcg_out32(s, MTSPR | RS(TCG_REG_TMP1) | LR);


Don't need to save/restore LR.  It is saved in the prologue and may be 
clobbered within the tb itself (as we do for calls >
@@ -2678,6 +2693,12 @@ static void tcg_out_goto_tb(TCGContext *s, int 
which)

  tcg_out32(s, MTSPR | RS(TCG_REG_TMP1) | CTR);
  tcg_out32(s, BCCTR | BO_ALWAYS);
  set_jmp_reset_offset(s, which);
+
+    /* For the unlinked case, need to reset TCG_REG_TB.  */
+    if (USE_REG_TB) {
+    tcg_out_movi_int(s, TCG_TYPE_I64, TCG_REG_TB,
+ (tcg_target_long)s->code_buf, true);
+    }
  }


Actually, we don't.  The only time we arrive here is when an unlinked TB 
branches to itself.  TCG_REG_TB is still valid.


Ok, I was not sure how that was meant to work.




diff --git a/tcg/tcg.c b/tcg/tcg.c
index ddfe9a96cb..20698131c2 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -6010,6 +6010,9 @@ int tcg_gen_code(TCGContext *s, TranslationBlock 
*tb, uint64_t pc_start)

  tcg_malloc(sizeof(uint64_t) * s->gen_tb->icount * start_words);
  num_insns = -1;
+#ifdef TCG_TARGET_NEED_ENTER_TB
+    tcg_out_enter_tb(s);
+#endif


Better would be to not have the ifdef, and add this symbol as an empty 
function in all other tcg backends.


I might play around with this a bit.


Thank you for that, adding pcrel support on POWER9 too is really cool.




r~





[RFC PATCH] tcg/ppc: Enable direct branching tcg_out_goto_tb with TCG_REG_TB

2023-08-14 Thread Jordan Niethe
Direct branch patching was disabled when using TCG_REG_TB in commit
736a1588c1 ("tcg/ppc: Fix race in goto_tb implementation"). Commit
7502f89c74 ("tcg/ppc: Use prefixed instructions for tcg_out_goto_tb")
used the support for pc relative ISAv3.1 instructions to re-enable
direct branch patching on POWER10.

The issue with direct branch patching with TCG_REG_TB is the lack of
synchronization between the new TCG_REG_TB being established and the
direct branch being patched in.

If each translation block is responsible for establishing its own
TCG_REG_TB then there can be no synchronization issue.

Make each translation block begin by setting up its own TCG_REG_TB.
ISA v3.0 has addpcis so use that for getting the pc at the beginning of
a translation block (plus offset). For systems without addpcis, use
the preferred 'bcl 20,31,$+4' sequence.

When branching indirectly to a translation block the setup sequence can
be skipped if the caller sets up TCG_REG_TB as there is no possible race
in this case.

Signed-off-by: Jordan Niethe 
---
This is just a proof of concept, not sure that this is the correct way
to do this or even if it is something we'd like to do.

Applies on top of Richard's series [1].

  [1] 
https://lore.kernel.org/qemu-devel/20230808030250.50602-1-richard.hender...@linaro.org/
---
 include/tcg/tcg.h|  1 +
 tcg/ppc/tcg-target.c.inc | 59 ++--
 tcg/tcg.c|  3 ++
 3 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/include/tcg/tcg.h b/include/tcg/tcg.h
index 0875971719..337506fea0 100644
--- a/include/tcg/tcg.h
+++ b/include/tcg/tcg.h
@@ -518,6 +518,7 @@ struct TCGContext {
extension that allows arithmetic on void*.  */
 void *code_gen_buffer;
 size_t code_gen_buffer_size;
+size_t code_gen_entry_size;
 void *code_gen_ptr;
 void *data_gen_ptr;
 
diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index b686a68247..4b55751051 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -382,6 +382,7 @@ static bool tcg_target_const_match(int64_t val, TCGType 
type, int ct)
 #define CRNAND XO19(225)
 #define CROR   XO19(449)
 #define CRNOR  XO19( 33)
+#define ADDPCIS  XO19( 2)
 
 #define EXTSB  XO31(954)
 #define EXTSH  XO31(922)
@@ -2635,6 +2636,30 @@ static void tcg_target_qemu_prologue(TCGContext *s)
 tcg_out32(s, BCLR | BO_ALWAYS);
 }
 
+
+#define TCG_TARGET_NEED_ENTER_TB
+static void tcg_out_enter_tb(TCGContext *s)
+{
+if (!USE_REG_TB) {
+return;
+}
+
+if (have_isa_3_00) {
+/* lnia REG_TB */
+tcg_out32(s, ADDPCIS | RT(TCG_REG_TB));
+tcg_out32(s, ADDI | TAI(TCG_REG_TB, TCG_REG_TB, -4));
+} else {
+tcg_out32(s, MFSPR | RT(TCG_REG_TMP1) | LR);
+/* bcl 20,31,$+4 (Preferred form for getting nia.) */
+tcg_out32(s, BC | BO_ALWAYS | BI(7, CR_SO) | 0x4 | LK);
+tcg_out32(s, MFSPR | RT(TCG_REG_TB) | LR);
+tcg_out32(s, ADDI | TAI(TCG_REG_TB, TCG_REG_TB, -8));
+tcg_out32(s, MTSPR | RS(TCG_REG_TMP1) | LR);
+}
+
+s->code_gen_entry_size = tcg_current_code_size(s);
+}
+
 static void tcg_out_exit_tb(TCGContext *s, uintptr_t arg)
 {
 tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R3, arg);
@@ -2645,23 +2670,6 @@ static void tcg_out_goto_tb(TCGContext *s, int which)
 {
 uintptr_t ptr = get_jmp_target_addr(s, which);
 
-if (USE_REG_TB) {
-/*
- * With REG_TB, we must always use indirect branching,
- * so that the branch destination and TCG_REG_TB match.
- */
-ptrdiff_t offset = tcg_tbrel_diff(s, (void *)ptr);
-tcg_out_mem_long(s, LD, LDX, TCG_REG_TB, TCG_REG_TB, offset);
-tcg_out32(s, MTSPR | RS(TCG_REG_TB) | CTR);
-tcg_out32(s, BCCTR | BO_ALWAYS);
-
-/* For the unlinked case, need to reset TCG_REG_TB.  */
-set_jmp_reset_offset(s, which);
-tcg_out_mem_long(s, ADDI, ADD, TCG_REG_TB, TCG_REG_TB,
- -tcg_current_code_size(s));
-return;
-}
-
 /* Direct branch will be patched by tb_target_set_jmp_target. */
 set_jmp_insn_offset(s, which);
 tcg_out32(s, NOP);
@@ -2670,6 +2678,13 @@ static void tcg_out_goto_tb(TCGContext *s, int which)
 if (have_isa_3_10) {
 ptrdiff_t offset = tcg_pcrel_diff_for_prefix(s, (void *)ptr);
 tcg_out_8ls_d(s, PLD, TCG_REG_TMP1, 0, offset, 1);
+} else if (USE_REG_TB) {
+ptrdiff_t offset = tcg_tbrel_diff(s, (void *)ptr);
+
+tcg_out_mem_long(s, LD, LDX, TCG_REG_TB, TCG_REG_TB, offset);
+/* Callee can skip establishing REG_TB for the indirect case. */
+tcg_out32(s, ADDI | TAI(TCG_REG_TMP1, TCG_REG_TB,
+s->code_gen_entry_size));
 } else {
 tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP1, ptr - (int16_t)ptr);
 tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1, (int16_t)ptr);
@@ -2678,6 +2693,12 @@ static void 

Re: [PATCH v2 7/7] tcg/ppc: Use prefixed instructions for tcg_out_goto_tb

2023-08-08 Thread Jordan Niethe
On Tue, Aug 8, 2023 at 1:02 PM Richard Henderson
 wrote:
>
> When a direct branch is out of range, we can load the destination for
> the indirect branch using PLA (for 16GB worth of buffer) and PLD from
> the TranslationBlock for everything larger.
>
> This means the patch affects exactly one instruction: B (plus filler),
> PLA or PLD.  Which means we can update and execute the patch atomically.

I think the commit message needs to be updated for Nick's changes.

>
> Signed-off-by: Richard Henderson 
> ---
>  tcg/ppc/tcg-target.c.inc | 31 +++
>  1 file changed, 19 insertions(+), 12 deletions(-)
>
> diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
> index 63fe4ef995..b686a68247 100644
> --- a/tcg/ppc/tcg-target.c.inc
> +++ b/tcg/ppc/tcg-target.c.inc
> @@ -2646,31 +2646,38 @@ static void tcg_out_goto_tb(TCGContext *s, int which)
>  uintptr_t ptr = get_jmp_target_addr(s, which);
>
>  if (USE_REG_TB) {
> +/*
> + * With REG_TB, we must always use indirect branching,
> + * so that the branch destination and TCG_REG_TB match.
> + */
>  ptrdiff_t offset = tcg_tbrel_diff(s, (void *)ptr);
>  tcg_out_mem_long(s, LD, LDX, TCG_REG_TB, TCG_REG_TB, offset);
> -
> -/* TODO: Use direct branches when possible. */
> -set_jmp_insn_offset(s, which);
>  tcg_out32(s, MTSPR | RS(TCG_REG_TB) | CTR);
> -
>  tcg_out32(s, BCCTR | BO_ALWAYS);
>
>  /* For the unlinked case, need to reset TCG_REG_TB.  */
>  set_jmp_reset_offset(s, which);
>  tcg_out_mem_long(s, ADDI, ADD, TCG_REG_TB, TCG_REG_TB,
>   -tcg_current_code_size(s));
> -} else {
> -/* Direct branch will be patched by tb_target_set_jmp_target. */
> -set_jmp_insn_offset(s, which);
> -tcg_out32(s, NOP);
> +return;
> +}
>
> -/* When branch is out of range, fall through to indirect. */
> +/* Direct branch will be patched by tb_target_set_jmp_target. */
> +set_jmp_insn_offset(s, which);
> +tcg_out32(s, NOP);
> +
> +/* When branch is out of range, fall through to indirect. */
> +if (have_isa_3_10) {
> +ptrdiff_t offset = tcg_pcrel_diff_for_prefix(s, (void *)ptr);
> +tcg_out_8ls_d(s, PLD, TCG_REG_TMP1, 0, offset, 1);
> +} else {
>  tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP1, ptr - (int16_t)ptr);
>  tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1, 
> (int16_t)ptr);
> -tcg_out32(s, MTSPR | RS(TCG_REG_TMP1) | CTR);
> -tcg_out32(s, BCCTR | BO_ALWAYS);
> -set_jmp_reset_offset(s, which);
>  }
> +
> +tcg_out32(s, MTSPR | RS(TCG_REG_TMP1) | CTR);
> +tcg_out32(s, BCCTR | BO_ALWAYS);
> +set_jmp_reset_offset(s, which);
>  }
>
>  void tb_target_set_jmp_target(const TranslationBlock *tb, int n,
> --
> 2.34.1
>

Thank you for implementing this Richard.

I was able to boot mttcg guests on P9 and P10 hosts.

Tested-by: Jordan Niethe 
Reviewed-by: Jordan Niethe 



Re: [PATCH 7/7] tcg/ppc: Use prefixed instructions for tcg_out_goto_tb

2023-08-06 Thread Jordan Niethe
On Sat, Aug 5, 2023 at 7:34 AM Richard Henderson
 wrote:
>
> When a direct branch is out of range, we can load the destination for
> the indirect branch using PLA (for 16GB worth of buffer) and PLD from
> the TranslationBlock for everything larger.
>
> This means the patch affects exactly one instruction: B (plus filler),
> PLA or PLD.  Which means we can update and execute the patch atomically.
>
> Signed-off-by: Richard Henderson 
> ---
>  tcg/ppc/tcg-target.c.inc | 76 ++--
>  1 file changed, 58 insertions(+), 18 deletions(-)
>
> diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
> index 5b243b2353..47c71bb5f2 100644
> --- a/tcg/ppc/tcg-target.c.inc
> +++ b/tcg/ppc/tcg-target.c.inc
> @@ -2642,31 +2642,41 @@ static void tcg_out_goto_tb(TCGContext *s, int which)
>  uintptr_t ptr = get_jmp_target_addr(s, which);
>
>  if (USE_REG_TB) {
> +/*
> + * With REG_TB, we must always use indirect branching,
> + * so that the branch destination and TCG_REG_TB match.
> + */
>  ptrdiff_t offset = tcg_tbrel_diff(s, (void *)ptr);
>  tcg_out_mem_long(s, LD, LDX, TCG_REG_TB, TCG_REG_TB, offset);
> -
> -/* TODO: Use direct branches when possible. */
> -set_jmp_insn_offset(s, which);
>  tcg_out32(s, MTSPR | RS(TCG_REG_TB) | CTR);
> -
>  tcg_out32(s, BCCTR | BO_ALWAYS);
>
>  /* For the unlinked case, need to reset TCG_REG_TB.  */
>  set_jmp_reset_offset(s, which);
>  tcg_out_mem_long(s, ADDI, ADD, TCG_REG_TB, TCG_REG_TB,
>   -tcg_current_code_size(s));
> +return;
> +}
> +
> +if (have_isa_3_10) {
> +/* Align, so that we can patch 8 bytes atomically. */
> +if ((uintptr_t)s->code_ptr & 7) {
> +tcg_out32(s, NOP);
> +}
> +set_jmp_insn_offset(s, which);
> +/* Direct branch will be patched by tb_target_set_jmp_target. */
> +tcg_out_mls_d(s, ADDI, TCG_REG_TMP1, 0, 0, 1);
>  } else {
>  /* Direct branch will be patched by tb_target_set_jmp_target. */
> -set_jmp_insn_offset(s, which);

It looks like 32bit loses its set_jmp_insn_offset(), is that intended?

> -tcg_out32(s, NOP);
> -
> +tcg_out32(s, B);
>  /* When branch is out of range, fall through to indirect. */
>  tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP1, ptr - (int16_t)ptr);
>  tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1, 
> (int16_t)ptr);
> -tcg_out32(s, MTSPR | RS(TCG_REG_TMP1) | CTR);
> -tcg_out32(s, BCCTR | BO_ALWAYS);
> -set_jmp_reset_offset(s, which);
>  }
> +
> +tcg_out32(s, MTSPR | RS(TCG_REG_TMP1) | CTR);
> +tcg_out32(s, BCCTR | BO_ALWAYS);
> +set_jmp_reset_offset(s, which);
>  }
>
>  void tb_target_set_jmp_target(const TranslationBlock *tb, int n,
> @@ -2674,20 +2684,50 @@ void tb_target_set_jmp_target(const TranslationBlock 
> *tb, int n,
>  {
>  uintptr_t addr = tb->jmp_target_addr[n];
>  intptr_t diff = addr - jmp_rx;
> -tcg_insn_unit insn;
>
>  if (USE_REG_TB) {
>  return;
>  }
>
> -if (in_range_b(diff)) {
> -insn = B | (diff & 0x3fc);
> -} else {
> -insn = NOP;
> -}
> +if (have_isa_3_10) {
> +tcg_insn_unit insn1, insn2;
> +uint64_t pair;
>
> -qatomic_set((uint32_t *)jmp_rw, insn);
> -flush_idcache_range(jmp_rx, jmp_rw, 4);
> +if (in_range_b(diff)) {
> +insn1 = B | (diff & 0x3fc);
> +insn2 = NOP;
> +} else if (diff == sextract64(diff, 0, 34)) {
> +/* PLA tmp1, diff */
> +insn1 = OPCD(1) | (2 << 24) | (1 << 20) | ((diff >> 16) & 
> 0x3);
> +insn2 = ADDI | TAI(TCG_REG_TMP1, 0, diff);
> +} else {
> +addr = (uintptr_t)>jmp_target_addr[n];
> +diff = addr - jmp_rx;
> +tcg_debug_assert(diff == sextract64(diff, 0, 34));
> +/* PLD tmp1, diff */
> +insn1 = OPCD(1) | (1 << 20) | ((diff >> 16) & 0x3);
> +insn2 = PLD | TAI(TCG_REG_TMP1, 0, diff);
> +}
> +
> +if (HOST_BIG_ENDIAN) {
> +pair = ((uint64_t)insn1) << 32 | insn2;
> +} else {
> +pair = ((uint64_t)insn2) << 32 | insn1;
> +}
> +
> +qatomic_set((uint64_t *)jmp_rw, pair);
> +flush_idcache_range(jmp_rx, jmp_rw, 8);
> +} else {
> +tcg_insn_unit insn;
> +
> +if (in_range_b(diff)) {
> +insn = B | (diff & 0x3fc);
> +} else {
> +insn = NOP;
> +}
> +qatomic_set((uint32_t *)jmp_rw, insn);
> +flush_idcache_range(jmp_rx, jmp_rw, 4);
> +}
>  }
>
>  static void tcg_out_op(TCGContext *s, TCGOpcode opc,
> --
> 2.34.1
>



Re: [PATCH 2/7] tcg/ppc: Use PADDI in tcg_out_movi

2023-08-06 Thread Jordan Niethe
On Sat, Aug 5, 2023 at 7:33 AM Richard Henderson
 wrote:
>
> PADDI can load 34-bit immediates and 34-bit pc-relative addresses.
>
> Signed-off-by: Richard Henderson 
> ---
>  tcg/ppc/tcg-target.c.inc | 47 
>  1 file changed, 47 insertions(+)
>
> diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
> index 642d0fd128..7fa2a2500b 100644
> --- a/tcg/ppc/tcg-target.c.inc
> +++ b/tcg/ppc/tcg-target.c.inc
> @@ -707,6 +707,33 @@ static bool patch_reloc(tcg_insn_unit *code_ptr, int 
> type,
>  return true;
>  }
>
> +/* Ensure that the prefixed instruction does not cross a 64-byte boundary. */
> +static bool tcg_out_need_prefix_align(TCGContext *s)
> +{
> +return ((uintptr_t)s->code_ptr & 0x3f) == 0x3c;
> +}
> +
> +static void tcg_out_prefix_align(TCGContext *s)
> +{
> +if (tcg_out_need_prefix_align(s)) {
> +tcg_out32(s, NOP);
> +}
> +}
> +
> +/* Output Type 10 Prefix - Modified Load/Store Form (MLS:D) */
> +static void tcg_out_mls_d(TCGContext *s, tcg_insn_unit opc, unsigned rt,
> +  unsigned ra, tcg_target_long imm, bool r)
> +{
> +tcg_insn_unit p, i;
> +
> +p = OPCD(1) | (2 << 24) | (r << 20) | ((imm >> 16) & 0x3);
> +i = opc | TAI(rt, ra, imm);
> +
> +tcg_out_prefix_align(s);
> +tcg_out32(s, p);
> +tcg_out32(s, i);
> +}
> +
>  static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt,
>   TCGReg base, tcg_target_long offset);
>
> @@ -992,6 +1019,26 @@ static void tcg_out_movi_int(TCGContext *s, TCGType 
> type, TCGReg ret,
>  return;
>  }
>
> +/*
> + * Load values up to 34 bits, and pc-relative addresses,
> + * with one prefixed insn.
> + */
> +if (have_isa_3_10) {
> +if (arg == sextract64(arg, 0, 34)) {
> +/* pli ret,value = paddi ret,0,value,0 */
> +tcg_out_mls_d(s, ADDI, ret, 0, arg, 0);
> +return;
> +}
> +
> +tmp = tcg_out_need_prefix_align(s) * 4;

tcg_out_need_prefix_align() returns a bool, optionally might prefer

tmp = tcg_out_need_prefix_align(s) ? 4 : 0;


> +tmp = tcg_pcrel_diff(s, (void *)arg) - tmp;
> +if (tmp == sextract64(tmp, 0, 34)) {
> +/* pla ret,value = paddi ret,0,value,1 */
> +tcg_out_mls_d(s, ADDI, ret, 0, tmp, 1);
> +return;
> +}
> +}
> +
>  /* Load 32-bit immediates with two insns.  Note that we've already
> eliminated bare ADDIS, so we know both insns are required.  */
>  if (TCG_TARGET_REG_BITS == 32 || arg == (int32_t)arg) {
> --
> 2.34.1
>

Reviewed-by: Jordan Niethe 



Re: [PATCH 3/7] tcg/ppc: Use prefixed instructions in tcg_out_mem_long

2023-08-06 Thread Jordan Niethe
On Sat, Aug 5, 2023 at 7:33 AM Richard Henderson
 wrote:
>
> When the offset is out of range of the non-prefixed insn, but
> fits the 34-bit immediate of the prefixed insn, use that.
>
> Signed-off-by: Richard Henderson 
> ---
>  tcg/ppc/tcg-target.c.inc | 66 
>  1 file changed, 66 insertions(+)
>
> diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
> index 7fa2a2500b..d41c499b7d 100644
> --- a/tcg/ppc/tcg-target.c.inc
> +++ b/tcg/ppc/tcg-target.c.inc
> @@ -323,6 +323,15 @@ static bool tcg_target_const_match(int64_t val, TCGType 
> type, int ct)
>  #define STDX   XO31(149)
>  #define STQXO62(  2)
>
> +#define PLWA   OPCD( 41)
> +#define PLDOPCD( 57)
> +#define PLXSD  OPCD( 42)
> +#define PLXV   OPCD(25 * 2 + 1)  /* force tx=1 */
> +
> +#define PSTD   OPCD( 61)
> +#define PSTXSD OPCD( 46)
> +#define PSTXV  OPCD(27 * 2 + 1)  /* force tx=1 */

PSTXV calls it sx not tx


> +
>  #define ADDIC  OPCD( 12)
>  #define ADDI   OPCD( 14)
>  #define ADDIS  OPCD( 15)
> @@ -720,6 +729,20 @@ static void tcg_out_prefix_align(TCGContext *s)
>  }
>  }
>
> +/* Output Type 00 Prefix - 8-Byte Load/Store Form (8LS:D) */
> +static void tcg_out_8ls_d(TCGContext *s, tcg_insn_unit opc, unsigned rt,
> +  unsigned ra, tcg_target_long imm, bool r)
> +{
> +tcg_insn_unit p, i;
> +
> +p = OPCD(1) | (r << 20) | ((imm >> 16) & 0x3);
> +i = opc | TAI(rt, ra, imm);
> +
> +tcg_out_prefix_align(s);
> +tcg_out32(s, p);
> +tcg_out32(s, i);
> +}
> +
>  /* Output Type 10 Prefix - Modified Load/Store Form (MLS:D) */
>  static void tcg_out_mls_d(TCGContext *s, tcg_insn_unit opc, unsigned rt,
>unsigned ra, tcg_target_long imm, bool r)
> @@ -1364,6 +1387,49 @@ static void tcg_out_mem_long(TCGContext *s, int opi, 
> int opx, TCGReg rt,
>  break;
>  }
>
> +/* For unaligned or large offsets, use the prefixed form. */
> +if (have_isa_3_10
> +&& (offset != (int16_t)offset || (offset & align))
> +&& offset == sextract64(offset, 0, 34)) {
> +/*
> + * Note that the MLS:D insns retain their un-prefixed opcode,
> + * while the 8LS:D insns use a different opcode space.
> + */
> +switch (opi) {
> +case LBZ:
> +case LHZ:
> +case LHA:
> +case LWZ:
> +case STB:
> +case STH:
> +case STW:
> +case ADDI:
> +tcg_out_mls_d(s, opi, rt, base, offset, 0);
> +return;
> +case LWA:
> +tcg_out_8ls_d(s, PLWA, rt, base, offset, 0);
> +return;
> +case LD:
> +tcg_out_8ls_d(s, PLD, rt, base, offset, 0);
> +return;
> +case STD:
> +tcg_out_8ls_d(s, PSTD, rt, base, offset, 0);
> +return;
> +case LXSD:
> +tcg_out_8ls_d(s, PLXSD, rt & 31, base, offset, 0);
> +return;
> +case STXSD:
> +tcg_out_8ls_d(s, PSTXSD, rt & 31, base, offset, 0);
> +return;
> +case LXV:
> +tcg_out_8ls_d(s, PLXV, rt & 31, base, offset, 0);
> +return;
> +    case STXV:
> +tcg_out_8ls_d(s, PSTXV, rt & 31, base, offset, 0);
> +return;
> +}
> +}
> +
>  /* For unaligned, or very large offsets, use the indexed form.  */
>  if (offset & align || offset != (int32_t)offset || opi == 0) {
>  if (rs == base) {
> --
> 2.34.1
>

Reviewed-by: Jordan Niethe 



Re: [PATCH for-8.0] tcg/sparc64: Disable direct linking for goto_tb

2023-07-18 Thread Jordan Niethe




On 5/4/23 1:04 am, Richard Henderson wrote:

Something is wrong with this code, and also wrong with gdb on the
sparc systems to which I have access, so I cannot debug it either.
Disable for now, so the release is not broken.


I'm not sure if it is the entire problem but it looks like the broken 
code had the same race as on ppc [1] between loading TCG_REG_TB and 
patching and executing the direct branch.


[1] 
https://lore.kernel.org/qemu-devel/20230717093001.13167-1-jniet...@gmail.com/#t




Signed-off-by: Richard Henderson 
---
  tcg/sparc64/tcg-target.c.inc | 30 --
  1 file changed, 4 insertions(+), 26 deletions(-)

diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
index ccc4144f7c..694f2b9dd4 100644
--- a/tcg/sparc64/tcg-target.c.inc
+++ b/tcg/sparc64/tcg-target.c.inc
@@ -1445,12 +1445,12 @@ static void tcg_out_goto_tb(TCGContext *s, int which)
  {
  ptrdiff_t off = tcg_tbrel_diff(s, (void *)get_jmp_target_addr(s, which));
  
-/* Direct branch will be patched by tb_target_set_jmp_target. */

+/* Load link and indirect branch. */
  set_jmp_insn_offset(s, which);
-tcg_out32(s, CALL);
-/* delay slot */
-tcg_debug_assert(check_fit_ptr(off, 13));
  tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TB, TCG_REG_TB, off);
+tcg_out_arithi(s, TCG_REG_G0, TCG_REG_TB, 0, JMPL);
+/* delay slot */
+tcg_out_nop(s);
  set_jmp_reset_offset(s, which);
  
  /*

@@ -1469,28 +1469,6 @@ static void tcg_out_goto_tb(TCGContext *s, int which)
  void tb_target_set_jmp_target(const TranslationBlock *tb, int n,
uintptr_t jmp_rx, uintptr_t jmp_rw)
  {
-uintptr_t addr = tb->jmp_target_addr[n];
-intptr_t br_disp = (intptr_t)(addr - jmp_rx) >> 2;
-tcg_insn_unit insn;
-
-br_disp >>= 2;
-if (check_fit_ptr(br_disp, 19)) {
-/* ba,pt %icc, addr */
-insn = deposit32(INSN_OP(0) | INSN_OP2(1) | INSN_COND(COND_A)
- | BPCC_ICC | BPCC_PT, 0, 19, br_disp);
-} else if (check_fit_ptr(br_disp, 22)) {
-/* ba addr */
-insn = deposit32(INSN_OP(0) | INSN_OP2(2) | INSN_COND(COND_A),
- 0, 22, br_disp);
-} else {
-/* The code_gen_buffer can't be larger than 2GB.  */
-tcg_debug_assert(check_fit_ptr(br_disp, 30));
-/* call addr */
-insn = deposit32(CALL, 0, 30, br_disp);
-}
-
-qatomic_set((uint32_t *)jmp_rw, insn);
-flush_idcache_range(jmp_rx, jmp_rw, 4);
  }
  
  static void tcg_out_op(TCGContext *s, TCGOpcode opc,






[PATCH] tcg/ppc: Fix race in goto_tb implementation

2023-07-17 Thread Jordan Niethe
Commit 20b6643324 ("tcg/ppc: Reorg goto_tb implementation") modified
goto_tb to ensure only a single instruction was patched to prevent
incorrect behaviour if a thread was in the middle of multiple
instructions when they were replaced. However this introduced a race
between loading the jmp target into TCG_REG_TB and patching and
executing the direct branch.

The relevent part of the goto_tb implementation:

ld TCG_REG_TB, TARGET_ADDR_LOCATION(TCG_REG_TB)
  patch_location:
mtctr TCG_REG_TB
bctr

tb_target_set_jmp_target() will replace 'patch_location' with a direct
branch if the target is in range. The direct branch now relies on
TCG_REG_TB being set up correctly by the ld. Prior to this commit
multiple instructions were patched in for the direct branch case; these
instructions would initalise TCG_REG_TB to the same value as the branch
target.

Imagine the following sequence:

1) Thread A is executing the goto_tb sequence and loads the jmp
   target into TCG_REG_TB.

2) Thread B updates the jmp target address and calls
   tb_target_set_jmp_target(). This patches a new direct branch into the
   goto_tb sequence.

3) Thread A executes the newly patched direct branch. The value in
   TCG_REG_TB still contains the old jmp target.

TCG_REG_TB MUST contain the translation block's tc.ptr. Execution will
eventually crash after performing memory accesses generated from a
faulty value in TCG_REG_TB.

This presents as segfaults or illegal instruction exceptions.

Do not revert commit 20b6643324 as it did fix a different race
condition. Instead remove the direct branch optimization and always use
indirect branches.

The direct branch optimization can be re-added later with a race free
sequence.

Gitlab issue: https://gitlab.com/qemu-project/qemu/-/issues/1726

Fixes: 20b6643324 ("tcg/ppc: Reorg goto_tb implementation")

Reported-by: Anushree Mathur 
Co-developed-by: Benjamin Gray 
Signed-off-by: Jordan Niethe 
---
 tcg/ppc/tcg-target.c.inc | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index 8d6899cf40..a7323f479b 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -2533,11 +2533,10 @@ static void tcg_out_goto_tb(TCGContext *s, int which)
 ptrdiff_t offset = tcg_tbrel_diff(s, (void *)ptr);
 tcg_out_mem_long(s, LD, LDX, TCG_REG_TB, TCG_REG_TB, offset);
 
-/* Direct branch will be patched by tb_target_set_jmp_target. */
+/* TODO: Use direct branches when possible. */
 set_jmp_insn_offset(s, which);
 tcg_out32(s, MTSPR | RS(TCG_REG_TB) | CTR);
 
-/* When branch is out of range, fall through to indirect. */
 tcg_out32(s, BCCTR | BO_ALWAYS);
 
 /* For the unlinked case, need to reset TCG_REG_TB.  */
@@ -2565,10 +2564,11 @@ void tb_target_set_jmp_target(const TranslationBlock 
*tb, int n,
 intptr_t diff = addr - jmp_rx;
 tcg_insn_unit insn;
 
+if (USE_REG_TB)
+return;
+
 if (in_range_b(diff)) {
 insn = B | (diff & 0x3fc);
-} else if (USE_REG_TB) {
-insn = MTSPR | RS(TCG_REG_TB) | CTR;
 } else {
 insn = NOP;
 }
-- 
2.39.3




Re: [PATCH] tcg/ppc: Fix race in goto_tb implementation

2023-07-17 Thread Jordan Niethe
On Mon, Jul 17, 2023 at 11:24 AM Jordan Niethe  wrote:
[snip]
>
> Reported-by: Anushree Mathur 
> Co-developed-by: Benjamin Gray 
> Signed-off-by: Jordan Niethe 

Sorry, should also have:
Signed-off-by: Benjamin Gray 

[snip]



[PATCH v2] tcg/ppc: Fix race in goto_tb implementation

2023-07-17 Thread Jordan Niethe
Commit 20b6643324 ("tcg/ppc: Reorg goto_tb implementation") modified
goto_tb to ensure only a single instruction was patched to prevent
incorrect behavior if a thread was in the middle of multiple
instructions when they were replaced. However this introduced a race
between loading the jmp target into TCG_REG_TB and patching and
executing the direct branch.

The relevant part of the goto_tb implementation:

ld TCG_REG_TB, TARGET_ADDR_LOCATION(TCG_REG_TB)
  patch_location:
mtctr TCG_REG_TB
bctr

tb_target_set_jmp_target() will replace 'patch_location' with a direct
branch if the target is in range. The direct branch now relies on
TCG_REG_TB being set up correctly by the ld. Prior to this commit
multiple instructions were patched in for the direct branch case; these
instructions would initialize TCG_REG_TB to the same value as the branch
target.

Imagine the following sequence:

1) Thread A is executing the goto_tb sequence and loads the jmp
   target into TCG_REG_TB.

2) Thread B updates the jmp target address and calls
   tb_target_set_jmp_target(). This patches a new direct branch into the
   goto_tb sequence.

3) Thread A executes the newly patched direct branch. The value in
   TCG_REG_TB still contains the old jmp target.

TCG_REG_TB MUST contain the translation block's tc.ptr. Execution will
eventually crash after performing memory accesses generated from a
faulty value in TCG_REG_TB.

This presents as segfaults or illegal instruction exceptions.

Do not revert commit 20b6643324 as it did fix a different race
condition. Instead remove the direct branch optimization and always use
indirect branches.

The direct branch optimization can be re-added later with a race free
sequence.

Fixes: 20b6643324 ("tcg/ppc: Reorg goto_tb implementation")
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1726
Reported-by: Anushree Mathur 
Tested-by: Anushree Mathur 
Tested-by: Michael Tokarev 
Reviewed-by: Richard Henderson 
Co-developed-by: Benjamin Gray 
Signed-off-by: Jordan Niethe 
Signed-off-by: Benjamin Gray 

---
v2: - Use braces
- Collect tags
---
 tcg/ppc/tcg-target.c.inc | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index 8d6899cf40..329319ab8a 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -2533,11 +2533,10 @@ static void tcg_out_goto_tb(TCGContext *s, int which)
 ptrdiff_t offset = tcg_tbrel_diff(s, (void *)ptr);
 tcg_out_mem_long(s, LD, LDX, TCG_REG_TB, TCG_REG_TB, offset);
 
-/* Direct branch will be patched by tb_target_set_jmp_target. */
+/* TODO: Use direct branches when possible. */
 set_jmp_insn_offset(s, which);
 tcg_out32(s, MTSPR | RS(TCG_REG_TB) | CTR);
 
-/* When branch is out of range, fall through to indirect. */
 tcg_out32(s, BCCTR | BO_ALWAYS);
 
 /* For the unlinked case, need to reset TCG_REG_TB.  */
@@ -2565,10 +2564,12 @@ void tb_target_set_jmp_target(const TranslationBlock 
*tb, int n,
 intptr_t diff = addr - jmp_rx;
 tcg_insn_unit insn;
 
+if (USE_REG_TB) {
+return;
+}
+
 if (in_range_b(diff)) {
 insn = B | (diff & 0x3fc);
-} else if (USE_REG_TB) {
-insn = MTSPR | RS(TCG_REG_TB) | CTR;
 } else {
 insn = NOP;
 }
-- 
2.39.3