Re: [PATCH v3 14/14] RFC tcg/ppc: Disable TCG_REG_TB for Power9/Power10
On Wed, Aug 16, 2023 at 5:57 AM Richard Henderson wrote: > > This may or may not improve performance. > It appears to result in slightly larger code, > but perhaps not enough to matter. I have collected some power9 macro performance data for an smp compile workload: Setup - - Power9 powernv host - mttcg smp 8 guest Method -- - Warm up compile skiboot (https://github.com/open-power/skiboot) - Average time taken for 5 trials compiling skiboot with -j `nproc` Results --- |Patch| Mean time (s) | stdev | Decrease (%) | |-|---|---|--| | tcg: Add tcg_out_tb_start...|161.77 | 2.39 |- | | tcg/ppc: Enable direct branching... |145.81 | 1.71 | 9.9 | | tcg/ppc: Use ADDPCIS... |146.44 | 1.28 | 9.5 | | RFC tcg/ppc: Disable TCG_REG_TB... |145.95 | 1.07 | 9.7 | - Enabling direct branching is a performance gain, beyond that less conclusive. - Using pcaddis for direct branching seems slightly better than bl +4 sequence for ISA v3.0. - PC relative addressing seems slightly better than TOC relative addressing. Any other suggestions for performance comparison? I still have to try on a Power10. > > Signed-off-by: Richard Henderson > --- > tcg/ppc/tcg-target.c.inc | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc > index 20aaa90af2..c1e0efb498 100644 > --- a/tcg/ppc/tcg-target.c.inc > +++ b/tcg/ppc/tcg-target.c.inc > @@ -83,7 +83,7 @@ > #define TCG_VEC_TMP2TCG_REG_V1 > > #define TCG_REG_TB TCG_REG_R31 > -#define USE_REG_TB (TCG_TARGET_REG_BITS == 64) > +#define USE_REG_TB (TCG_TARGET_REG_BITS == 64 && !have_isa_3_00) > > /* Shorthand for size of a pointer. Avoid promotion to unsigned. */ > #define SZP ((int)sizeof(void *)) > -- > 2.34.1 >
Re: [PATCH v3 05/14] tcg/ppc: Use ADDPCIS in tcg_out_tb_start
On Wed, 16 Aug 2023, 5:57 am Richard Henderson, < richard.hender...@linaro.org> wrote: > With ISA v3.0, we can use ADDPCIS instead of BCL+MFLR to load NIP. > > Signed-off-by: Richard Henderson > --- > tcg/ppc/tcg-target.c.inc | 25 ++--- > 1 file changed, 22 insertions(+), 3 deletions(-) > > diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc > index 19004fa568..36b4f61236 100644 > --- a/tcg/ppc/tcg-target.c.inc > +++ b/tcg/ppc/tcg-target.c.inc > @@ -362,6 +362,7 @@ static bool tcg_target_const_match(int64_t val, > TCGType type, int ct) > #define CRNAND XO19(225) > #define CROR XO19(449) > #define CRNOR XO19( 33) > +#define ADDPCIS XO19( 2) > > #define EXTSB XO31(954) > #define EXTSH XO31(922) > @@ -854,6 +855,19 @@ static inline void tcg_out_sari64(TCGContext *s, > TCGReg dst, TCGReg src, int c) > tcg_out32(s, SRADI | RA(dst) | RS(src) | SH(c & 0x1f) | ((c >> 4) & > 2)); > } > > +static void tcg_out_addpcis(TCGContext *s, TCGReg dst, intptr_t imm) > +{ > +int d0, d1, d2; > + > +tcg_debug_assert((imm & 0x) == 0); > +tcg_debug_assert(imm == (int32_t)imm); > + > I think you need imm >>= 16 here. Without that the next patch in the series crashes. +d2 = imm & 1; > +d1 = (imm >> 1) & 0x1f; > +d0 = (imm >> 6) & 0x3ff; > +tcg_out32(s, ADDPCIS | RT(dst) | (d1 << 16) | (d0 << 6) | d2); > +} > + > static void tcg_out_bswap16(TCGContext *s, TCGReg dst, TCGReg src, int > flags) > { > TCGReg tmp = dst == src ? TCG_REG_R0 : dst; > @@ -2489,9 +2503,14 @@ static void tcg_out_tb_start(TCGContext *s) > { > /* Load TCG_REG_TB. */ > if (USE_REG_TB) { > -/* bcl 20,31,$+4 (preferred form for getting nia) */ > -tcg_out32(s, BC | BO_ALWAYS | BI(7, CR_SO) | 0x4 | LK); > -tcg_out32(s, MFSPR | RT(TCG_REG_TB) | LR); > +if (have_isa_3_00) { > +/* lnia REG_TB */ > +tcg_out_addpcis(s, TCG_REG_TB, 0); > +} else { > +/* bcl 20,31,$+4 (preferred form for getting nia) */ > +tcg_out32(s, BC | BO_ALWAYS | BI(7, CR_SO) | 0x4 | LK); > +tcg_out32(s, MFSPR | RT(TCG_REG_TB) | LR); > +} > } > } > > -- > 2.34.1 > >
Re: [RFC PATCH] tcg/ppc: Enable direct branching tcg_out_goto_tb with TCG_REG_TB
On 16/8/23 2:07 am, Richard Henderson wrote: On 8/14/23 22:01, Jordan Niethe wrote: Direct branch patching was disabled when using TCG_REG_TB in commit 736a1588c1 ("tcg/ppc: Fix race in goto_tb implementation"). Commit 7502f89c74 ("tcg/ppc: Use prefixed instructions for tcg_out_goto_tb") used the support for pc relative ISAv3.1 instructions to re-enable direct branch patching on POWER10. The issue with direct branch patching with TCG_REG_TB is the lack of synchronization between the new TCG_REG_TB being established and the direct branch being patched in. If each translation block is responsible for establishing its own TCG_REG_TB then there can be no synchronization issue. That's a good idea, and can be used for other things... It also begs the question of whether power10 should continue to use TCG_REG_TB, loading the address with PADDI. Or whether power9 should, like power10, disable USE_REG_TB and use ADDPCIS throughout. I imagine it depends on usage frequency, whether use of TCG_REG_TB allows 1 insn, where addpcis requires 2 insns and prefixed insns require 2 or 3 insn slots (depending on alignment). Yes, I agree. Your v3 series looks good, I'll try and get some performance numbers with it so we can make a decision about which way to go on power10 and power9. + tcg_out32(s, MFSPR | RT(TCG_REG_TMP1) | LR); + /* bcl 20,31,$+4 (Preferred form for getting nia.) */ + tcg_out32(s, BC | BO_ALWAYS | BI(7, CR_SO) | 0x4 | LK); + tcg_out32(s, MFSPR | RT(TCG_REG_TB) | LR); + tcg_out32(s, ADDI | TAI(TCG_REG_TB, TCG_REG_TB, -8)); + tcg_out32(s, MTSPR | RS(TCG_REG_TMP1) | LR); Don't need to save/restore LR. It is saved in the prologue and may be clobbered within the tb itself (as we do for calls > @@ -2678,6 +2693,12 @@ static void tcg_out_goto_tb(TCGContext *s, int which) tcg_out32(s, MTSPR | RS(TCG_REG_TMP1) | CTR); tcg_out32(s, BCCTR | BO_ALWAYS); set_jmp_reset_offset(s, which); + + /* For the unlinked case, need to reset TCG_REG_TB. */ + if (USE_REG_TB) { + tcg_out_movi_int(s, TCG_TYPE_I64, TCG_REG_TB, + (tcg_target_long)s->code_buf, true); + } } Actually, we don't. The only time we arrive here is when an unlinked TB branches to itself. TCG_REG_TB is still valid. Ok, I was not sure how that was meant to work. diff --git a/tcg/tcg.c b/tcg/tcg.c index ddfe9a96cb..20698131c2 100644 --- a/tcg/tcg.c +++ b/tcg/tcg.c @@ -6010,6 +6010,9 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb, uint64_t pc_start) tcg_malloc(sizeof(uint64_t) * s->gen_tb->icount * start_words); num_insns = -1; +#ifdef TCG_TARGET_NEED_ENTER_TB + tcg_out_enter_tb(s); +#endif Better would be to not have the ifdef, and add this symbol as an empty function in all other tcg backends. I might play around with this a bit. Thank you for that, adding pcrel support on POWER9 too is really cool. r~
[RFC PATCH] tcg/ppc: Enable direct branching tcg_out_goto_tb with TCG_REG_TB
Direct branch patching was disabled when using TCG_REG_TB in commit 736a1588c1 ("tcg/ppc: Fix race in goto_tb implementation"). Commit 7502f89c74 ("tcg/ppc: Use prefixed instructions for tcg_out_goto_tb") used the support for pc relative ISAv3.1 instructions to re-enable direct branch patching on POWER10. The issue with direct branch patching with TCG_REG_TB is the lack of synchronization between the new TCG_REG_TB being established and the direct branch being patched in. If each translation block is responsible for establishing its own TCG_REG_TB then there can be no synchronization issue. Make each translation block begin by setting up its own TCG_REG_TB. ISA v3.0 has addpcis so use that for getting the pc at the beginning of a translation block (plus offset). For systems without addpcis, use the preferred 'bcl 20,31,$+4' sequence. When branching indirectly to a translation block the setup sequence can be skipped if the caller sets up TCG_REG_TB as there is no possible race in this case. Signed-off-by: Jordan Niethe --- This is just a proof of concept, not sure that this is the correct way to do this or even if it is something we'd like to do. Applies on top of Richard's series [1]. [1] https://lore.kernel.org/qemu-devel/20230808030250.50602-1-richard.hender...@linaro.org/ --- include/tcg/tcg.h| 1 + tcg/ppc/tcg-target.c.inc | 59 ++-- tcg/tcg.c| 3 ++ 3 files changed, 42 insertions(+), 21 deletions(-) diff --git a/include/tcg/tcg.h b/include/tcg/tcg.h index 0875971719..337506fea0 100644 --- a/include/tcg/tcg.h +++ b/include/tcg/tcg.h @@ -518,6 +518,7 @@ struct TCGContext { extension that allows arithmetic on void*. */ void *code_gen_buffer; size_t code_gen_buffer_size; +size_t code_gen_entry_size; void *code_gen_ptr; void *data_gen_ptr; diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc index b686a68247..4b55751051 100644 --- a/tcg/ppc/tcg-target.c.inc +++ b/tcg/ppc/tcg-target.c.inc @@ -382,6 +382,7 @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct) #define CRNAND XO19(225) #define CROR XO19(449) #define CRNOR XO19( 33) +#define ADDPCIS XO19( 2) #define EXTSB XO31(954) #define EXTSH XO31(922) @@ -2635,6 +2636,30 @@ static void tcg_target_qemu_prologue(TCGContext *s) tcg_out32(s, BCLR | BO_ALWAYS); } + +#define TCG_TARGET_NEED_ENTER_TB +static void tcg_out_enter_tb(TCGContext *s) +{ +if (!USE_REG_TB) { +return; +} + +if (have_isa_3_00) { +/* lnia REG_TB */ +tcg_out32(s, ADDPCIS | RT(TCG_REG_TB)); +tcg_out32(s, ADDI | TAI(TCG_REG_TB, TCG_REG_TB, -4)); +} else { +tcg_out32(s, MFSPR | RT(TCG_REG_TMP1) | LR); +/* bcl 20,31,$+4 (Preferred form for getting nia.) */ +tcg_out32(s, BC | BO_ALWAYS | BI(7, CR_SO) | 0x4 | LK); +tcg_out32(s, MFSPR | RT(TCG_REG_TB) | LR); +tcg_out32(s, ADDI | TAI(TCG_REG_TB, TCG_REG_TB, -8)); +tcg_out32(s, MTSPR | RS(TCG_REG_TMP1) | LR); +} + +s->code_gen_entry_size = tcg_current_code_size(s); +} + static void tcg_out_exit_tb(TCGContext *s, uintptr_t arg) { tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R3, arg); @@ -2645,23 +2670,6 @@ static void tcg_out_goto_tb(TCGContext *s, int which) { uintptr_t ptr = get_jmp_target_addr(s, which); -if (USE_REG_TB) { -/* - * With REG_TB, we must always use indirect branching, - * so that the branch destination and TCG_REG_TB match. - */ -ptrdiff_t offset = tcg_tbrel_diff(s, (void *)ptr); -tcg_out_mem_long(s, LD, LDX, TCG_REG_TB, TCG_REG_TB, offset); -tcg_out32(s, MTSPR | RS(TCG_REG_TB) | CTR); -tcg_out32(s, BCCTR | BO_ALWAYS); - -/* For the unlinked case, need to reset TCG_REG_TB. */ -set_jmp_reset_offset(s, which); -tcg_out_mem_long(s, ADDI, ADD, TCG_REG_TB, TCG_REG_TB, - -tcg_current_code_size(s)); -return; -} - /* Direct branch will be patched by tb_target_set_jmp_target. */ set_jmp_insn_offset(s, which); tcg_out32(s, NOP); @@ -2670,6 +2678,13 @@ static void tcg_out_goto_tb(TCGContext *s, int which) if (have_isa_3_10) { ptrdiff_t offset = tcg_pcrel_diff_for_prefix(s, (void *)ptr); tcg_out_8ls_d(s, PLD, TCG_REG_TMP1, 0, offset, 1); +} else if (USE_REG_TB) { +ptrdiff_t offset = tcg_tbrel_diff(s, (void *)ptr); + +tcg_out_mem_long(s, LD, LDX, TCG_REG_TB, TCG_REG_TB, offset); +/* Callee can skip establishing REG_TB for the indirect case. */ +tcg_out32(s, ADDI | TAI(TCG_REG_TMP1, TCG_REG_TB, +s->code_gen_entry_size)); } else { tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP1, ptr - (int16_t)ptr); tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1, (int16_t)ptr); @@ -2678,6 +2693,12 @@ static void
Re: [PATCH v2 7/7] tcg/ppc: Use prefixed instructions for tcg_out_goto_tb
On Tue, Aug 8, 2023 at 1:02 PM Richard Henderson wrote: > > When a direct branch is out of range, we can load the destination for > the indirect branch using PLA (for 16GB worth of buffer) and PLD from > the TranslationBlock for everything larger. > > This means the patch affects exactly one instruction: B (plus filler), > PLA or PLD. Which means we can update and execute the patch atomically. I think the commit message needs to be updated for Nick's changes. > > Signed-off-by: Richard Henderson > --- > tcg/ppc/tcg-target.c.inc | 31 +++ > 1 file changed, 19 insertions(+), 12 deletions(-) > > diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc > index 63fe4ef995..b686a68247 100644 > --- a/tcg/ppc/tcg-target.c.inc > +++ b/tcg/ppc/tcg-target.c.inc > @@ -2646,31 +2646,38 @@ static void tcg_out_goto_tb(TCGContext *s, int which) > uintptr_t ptr = get_jmp_target_addr(s, which); > > if (USE_REG_TB) { > +/* > + * With REG_TB, we must always use indirect branching, > + * so that the branch destination and TCG_REG_TB match. > + */ > ptrdiff_t offset = tcg_tbrel_diff(s, (void *)ptr); > tcg_out_mem_long(s, LD, LDX, TCG_REG_TB, TCG_REG_TB, offset); > - > -/* TODO: Use direct branches when possible. */ > -set_jmp_insn_offset(s, which); > tcg_out32(s, MTSPR | RS(TCG_REG_TB) | CTR); > - > tcg_out32(s, BCCTR | BO_ALWAYS); > > /* For the unlinked case, need to reset TCG_REG_TB. */ > set_jmp_reset_offset(s, which); > tcg_out_mem_long(s, ADDI, ADD, TCG_REG_TB, TCG_REG_TB, > -tcg_current_code_size(s)); > -} else { > -/* Direct branch will be patched by tb_target_set_jmp_target. */ > -set_jmp_insn_offset(s, which); > -tcg_out32(s, NOP); > +return; > +} > > -/* When branch is out of range, fall through to indirect. */ > +/* Direct branch will be patched by tb_target_set_jmp_target. */ > +set_jmp_insn_offset(s, which); > +tcg_out32(s, NOP); > + > +/* When branch is out of range, fall through to indirect. */ > +if (have_isa_3_10) { > +ptrdiff_t offset = tcg_pcrel_diff_for_prefix(s, (void *)ptr); > +tcg_out_8ls_d(s, PLD, TCG_REG_TMP1, 0, offset, 1); > +} else { > tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP1, ptr - (int16_t)ptr); > tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1, > (int16_t)ptr); > -tcg_out32(s, MTSPR | RS(TCG_REG_TMP1) | CTR); > -tcg_out32(s, BCCTR | BO_ALWAYS); > -set_jmp_reset_offset(s, which); > } > + > +tcg_out32(s, MTSPR | RS(TCG_REG_TMP1) | CTR); > +tcg_out32(s, BCCTR | BO_ALWAYS); > +set_jmp_reset_offset(s, which); > } > > void tb_target_set_jmp_target(const TranslationBlock *tb, int n, > -- > 2.34.1 > Thank you for implementing this Richard. I was able to boot mttcg guests on P9 and P10 hosts. Tested-by: Jordan Niethe Reviewed-by: Jordan Niethe
Re: [PATCH 7/7] tcg/ppc: Use prefixed instructions for tcg_out_goto_tb
On Sat, Aug 5, 2023 at 7:34 AM Richard Henderson wrote: > > When a direct branch is out of range, we can load the destination for > the indirect branch using PLA (for 16GB worth of buffer) and PLD from > the TranslationBlock for everything larger. > > This means the patch affects exactly one instruction: B (plus filler), > PLA or PLD. Which means we can update and execute the patch atomically. > > Signed-off-by: Richard Henderson > --- > tcg/ppc/tcg-target.c.inc | 76 ++-- > 1 file changed, 58 insertions(+), 18 deletions(-) > > diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc > index 5b243b2353..47c71bb5f2 100644 > --- a/tcg/ppc/tcg-target.c.inc > +++ b/tcg/ppc/tcg-target.c.inc > @@ -2642,31 +2642,41 @@ static void tcg_out_goto_tb(TCGContext *s, int which) > uintptr_t ptr = get_jmp_target_addr(s, which); > > if (USE_REG_TB) { > +/* > + * With REG_TB, we must always use indirect branching, > + * so that the branch destination and TCG_REG_TB match. > + */ > ptrdiff_t offset = tcg_tbrel_diff(s, (void *)ptr); > tcg_out_mem_long(s, LD, LDX, TCG_REG_TB, TCG_REG_TB, offset); > - > -/* TODO: Use direct branches when possible. */ > -set_jmp_insn_offset(s, which); > tcg_out32(s, MTSPR | RS(TCG_REG_TB) | CTR); > - > tcg_out32(s, BCCTR | BO_ALWAYS); > > /* For the unlinked case, need to reset TCG_REG_TB. */ > set_jmp_reset_offset(s, which); > tcg_out_mem_long(s, ADDI, ADD, TCG_REG_TB, TCG_REG_TB, > -tcg_current_code_size(s)); > +return; > +} > + > +if (have_isa_3_10) { > +/* Align, so that we can patch 8 bytes atomically. */ > +if ((uintptr_t)s->code_ptr & 7) { > +tcg_out32(s, NOP); > +} > +set_jmp_insn_offset(s, which); > +/* Direct branch will be patched by tb_target_set_jmp_target. */ > +tcg_out_mls_d(s, ADDI, TCG_REG_TMP1, 0, 0, 1); > } else { > /* Direct branch will be patched by tb_target_set_jmp_target. */ > -set_jmp_insn_offset(s, which); It looks like 32bit loses its set_jmp_insn_offset(), is that intended? > -tcg_out32(s, NOP); > - > +tcg_out32(s, B); > /* When branch is out of range, fall through to indirect. */ > tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP1, ptr - (int16_t)ptr); > tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_REG_TMP1, > (int16_t)ptr); > -tcg_out32(s, MTSPR | RS(TCG_REG_TMP1) | CTR); > -tcg_out32(s, BCCTR | BO_ALWAYS); > -set_jmp_reset_offset(s, which); > } > + > +tcg_out32(s, MTSPR | RS(TCG_REG_TMP1) | CTR); > +tcg_out32(s, BCCTR | BO_ALWAYS); > +set_jmp_reset_offset(s, which); > } > > void tb_target_set_jmp_target(const TranslationBlock *tb, int n, > @@ -2674,20 +2684,50 @@ void tb_target_set_jmp_target(const TranslationBlock > *tb, int n, > { > uintptr_t addr = tb->jmp_target_addr[n]; > intptr_t diff = addr - jmp_rx; > -tcg_insn_unit insn; > > if (USE_REG_TB) { > return; > } > > -if (in_range_b(diff)) { > -insn = B | (diff & 0x3fc); > -} else { > -insn = NOP; > -} > +if (have_isa_3_10) { > +tcg_insn_unit insn1, insn2; > +uint64_t pair; > > -qatomic_set((uint32_t *)jmp_rw, insn); > -flush_idcache_range(jmp_rx, jmp_rw, 4); > +if (in_range_b(diff)) { > +insn1 = B | (diff & 0x3fc); > +insn2 = NOP; > +} else if (diff == sextract64(diff, 0, 34)) { > +/* PLA tmp1, diff */ > +insn1 = OPCD(1) | (2 << 24) | (1 << 20) | ((diff >> 16) & > 0x3); > +insn2 = ADDI | TAI(TCG_REG_TMP1, 0, diff); > +} else { > +addr = (uintptr_t)>jmp_target_addr[n]; > +diff = addr - jmp_rx; > +tcg_debug_assert(diff == sextract64(diff, 0, 34)); > +/* PLD tmp1, diff */ > +insn1 = OPCD(1) | (1 << 20) | ((diff >> 16) & 0x3); > +insn2 = PLD | TAI(TCG_REG_TMP1, 0, diff); > +} > + > +if (HOST_BIG_ENDIAN) { > +pair = ((uint64_t)insn1) << 32 | insn2; > +} else { > +pair = ((uint64_t)insn2) << 32 | insn1; > +} > + > +qatomic_set((uint64_t *)jmp_rw, pair); > +flush_idcache_range(jmp_rx, jmp_rw, 8); > +} else { > +tcg_insn_unit insn; > + > +if (in_range_b(diff)) { > +insn = B | (diff & 0x3fc); > +} else { > +insn = NOP; > +} > +qatomic_set((uint32_t *)jmp_rw, insn); > +flush_idcache_range(jmp_rx, jmp_rw, 4); > +} > } > > static void tcg_out_op(TCGContext *s, TCGOpcode opc, > -- > 2.34.1 >
Re: [PATCH 2/7] tcg/ppc: Use PADDI in tcg_out_movi
On Sat, Aug 5, 2023 at 7:33 AM Richard Henderson wrote: > > PADDI can load 34-bit immediates and 34-bit pc-relative addresses. > > Signed-off-by: Richard Henderson > --- > tcg/ppc/tcg-target.c.inc | 47 > 1 file changed, 47 insertions(+) > > diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc > index 642d0fd128..7fa2a2500b 100644 > --- a/tcg/ppc/tcg-target.c.inc > +++ b/tcg/ppc/tcg-target.c.inc > @@ -707,6 +707,33 @@ static bool patch_reloc(tcg_insn_unit *code_ptr, int > type, > return true; > } > > +/* Ensure that the prefixed instruction does not cross a 64-byte boundary. */ > +static bool tcg_out_need_prefix_align(TCGContext *s) > +{ > +return ((uintptr_t)s->code_ptr & 0x3f) == 0x3c; > +} > + > +static void tcg_out_prefix_align(TCGContext *s) > +{ > +if (tcg_out_need_prefix_align(s)) { > +tcg_out32(s, NOP); > +} > +} > + > +/* Output Type 10 Prefix - Modified Load/Store Form (MLS:D) */ > +static void tcg_out_mls_d(TCGContext *s, tcg_insn_unit opc, unsigned rt, > + unsigned ra, tcg_target_long imm, bool r) > +{ > +tcg_insn_unit p, i; > + > +p = OPCD(1) | (2 << 24) | (r << 20) | ((imm >> 16) & 0x3); > +i = opc | TAI(rt, ra, imm); > + > +tcg_out_prefix_align(s); > +tcg_out32(s, p); > +tcg_out32(s, i); > +} > + > static void tcg_out_mem_long(TCGContext *s, int opi, int opx, TCGReg rt, > TCGReg base, tcg_target_long offset); > > @@ -992,6 +1019,26 @@ static void tcg_out_movi_int(TCGContext *s, TCGType > type, TCGReg ret, > return; > } > > +/* > + * Load values up to 34 bits, and pc-relative addresses, > + * with one prefixed insn. > + */ > +if (have_isa_3_10) { > +if (arg == sextract64(arg, 0, 34)) { > +/* pli ret,value = paddi ret,0,value,0 */ > +tcg_out_mls_d(s, ADDI, ret, 0, arg, 0); > +return; > +} > + > +tmp = tcg_out_need_prefix_align(s) * 4; tcg_out_need_prefix_align() returns a bool, optionally might prefer tmp = tcg_out_need_prefix_align(s) ? 4 : 0; > +tmp = tcg_pcrel_diff(s, (void *)arg) - tmp; > +if (tmp == sextract64(tmp, 0, 34)) { > +/* pla ret,value = paddi ret,0,value,1 */ > +tcg_out_mls_d(s, ADDI, ret, 0, tmp, 1); > +return; > +} > +} > + > /* Load 32-bit immediates with two insns. Note that we've already > eliminated bare ADDIS, so we know both insns are required. */ > if (TCG_TARGET_REG_BITS == 32 || arg == (int32_t)arg) { > -- > 2.34.1 > Reviewed-by: Jordan Niethe
Re: [PATCH 3/7] tcg/ppc: Use prefixed instructions in tcg_out_mem_long
On Sat, Aug 5, 2023 at 7:33 AM Richard Henderson wrote: > > When the offset is out of range of the non-prefixed insn, but > fits the 34-bit immediate of the prefixed insn, use that. > > Signed-off-by: Richard Henderson > --- > tcg/ppc/tcg-target.c.inc | 66 > 1 file changed, 66 insertions(+) > > diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc > index 7fa2a2500b..d41c499b7d 100644 > --- a/tcg/ppc/tcg-target.c.inc > +++ b/tcg/ppc/tcg-target.c.inc > @@ -323,6 +323,15 @@ static bool tcg_target_const_match(int64_t val, TCGType > type, int ct) > #define STDX XO31(149) > #define STQXO62( 2) > > +#define PLWA OPCD( 41) > +#define PLDOPCD( 57) > +#define PLXSD OPCD( 42) > +#define PLXV OPCD(25 * 2 + 1) /* force tx=1 */ > + > +#define PSTD OPCD( 61) > +#define PSTXSD OPCD( 46) > +#define PSTXV OPCD(27 * 2 + 1) /* force tx=1 */ PSTXV calls it sx not tx > + > #define ADDIC OPCD( 12) > #define ADDI OPCD( 14) > #define ADDIS OPCD( 15) > @@ -720,6 +729,20 @@ static void tcg_out_prefix_align(TCGContext *s) > } > } > > +/* Output Type 00 Prefix - 8-Byte Load/Store Form (8LS:D) */ > +static void tcg_out_8ls_d(TCGContext *s, tcg_insn_unit opc, unsigned rt, > + unsigned ra, tcg_target_long imm, bool r) > +{ > +tcg_insn_unit p, i; > + > +p = OPCD(1) | (r << 20) | ((imm >> 16) & 0x3); > +i = opc | TAI(rt, ra, imm); > + > +tcg_out_prefix_align(s); > +tcg_out32(s, p); > +tcg_out32(s, i); > +} > + > /* Output Type 10 Prefix - Modified Load/Store Form (MLS:D) */ > static void tcg_out_mls_d(TCGContext *s, tcg_insn_unit opc, unsigned rt, >unsigned ra, tcg_target_long imm, bool r) > @@ -1364,6 +1387,49 @@ static void tcg_out_mem_long(TCGContext *s, int opi, > int opx, TCGReg rt, > break; > } > > +/* For unaligned or large offsets, use the prefixed form. */ > +if (have_isa_3_10 > +&& (offset != (int16_t)offset || (offset & align)) > +&& offset == sextract64(offset, 0, 34)) { > +/* > + * Note that the MLS:D insns retain their un-prefixed opcode, > + * while the 8LS:D insns use a different opcode space. > + */ > +switch (opi) { > +case LBZ: > +case LHZ: > +case LHA: > +case LWZ: > +case STB: > +case STH: > +case STW: > +case ADDI: > +tcg_out_mls_d(s, opi, rt, base, offset, 0); > +return; > +case LWA: > +tcg_out_8ls_d(s, PLWA, rt, base, offset, 0); > +return; > +case LD: > +tcg_out_8ls_d(s, PLD, rt, base, offset, 0); > +return; > +case STD: > +tcg_out_8ls_d(s, PSTD, rt, base, offset, 0); > +return; > +case LXSD: > +tcg_out_8ls_d(s, PLXSD, rt & 31, base, offset, 0); > +return; > +case STXSD: > +tcg_out_8ls_d(s, PSTXSD, rt & 31, base, offset, 0); > +return; > +case LXV: > +tcg_out_8ls_d(s, PLXV, rt & 31, base, offset, 0); > +return; > + case STXV: > +tcg_out_8ls_d(s, PSTXV, rt & 31, base, offset, 0); > +return; > +} > +} > + > /* For unaligned, or very large offsets, use the indexed form. */ > if (offset & align || offset != (int32_t)offset || opi == 0) { > if (rs == base) { > -- > 2.34.1 > Reviewed-by: Jordan Niethe
Re: [PATCH for-8.0] tcg/sparc64: Disable direct linking for goto_tb
On 5/4/23 1:04 am, Richard Henderson wrote: Something is wrong with this code, and also wrong with gdb on the sparc systems to which I have access, so I cannot debug it either. Disable for now, so the release is not broken. I'm not sure if it is the entire problem but it looks like the broken code had the same race as on ppc [1] between loading TCG_REG_TB and patching and executing the direct branch. [1] https://lore.kernel.org/qemu-devel/20230717093001.13167-1-jniet...@gmail.com/#t Signed-off-by: Richard Henderson --- tcg/sparc64/tcg-target.c.inc | 30 -- 1 file changed, 4 insertions(+), 26 deletions(-) diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc index ccc4144f7c..694f2b9dd4 100644 --- a/tcg/sparc64/tcg-target.c.inc +++ b/tcg/sparc64/tcg-target.c.inc @@ -1445,12 +1445,12 @@ static void tcg_out_goto_tb(TCGContext *s, int which) { ptrdiff_t off = tcg_tbrel_diff(s, (void *)get_jmp_target_addr(s, which)); -/* Direct branch will be patched by tb_target_set_jmp_target. */ +/* Load link and indirect branch. */ set_jmp_insn_offset(s, which); -tcg_out32(s, CALL); -/* delay slot */ -tcg_debug_assert(check_fit_ptr(off, 13)); tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TB, TCG_REG_TB, off); +tcg_out_arithi(s, TCG_REG_G0, TCG_REG_TB, 0, JMPL); +/* delay slot */ +tcg_out_nop(s); set_jmp_reset_offset(s, which); /* @@ -1469,28 +1469,6 @@ static void tcg_out_goto_tb(TCGContext *s, int which) void tb_target_set_jmp_target(const TranslationBlock *tb, int n, uintptr_t jmp_rx, uintptr_t jmp_rw) { -uintptr_t addr = tb->jmp_target_addr[n]; -intptr_t br_disp = (intptr_t)(addr - jmp_rx) >> 2; -tcg_insn_unit insn; - -br_disp >>= 2; -if (check_fit_ptr(br_disp, 19)) { -/* ba,pt %icc, addr */ -insn = deposit32(INSN_OP(0) | INSN_OP2(1) | INSN_COND(COND_A) - | BPCC_ICC | BPCC_PT, 0, 19, br_disp); -} else if (check_fit_ptr(br_disp, 22)) { -/* ba addr */ -insn = deposit32(INSN_OP(0) | INSN_OP2(2) | INSN_COND(COND_A), - 0, 22, br_disp); -} else { -/* The code_gen_buffer can't be larger than 2GB. */ -tcg_debug_assert(check_fit_ptr(br_disp, 30)); -/* call addr */ -insn = deposit32(CALL, 0, 30, br_disp); -} - -qatomic_set((uint32_t *)jmp_rw, insn); -flush_idcache_range(jmp_rx, jmp_rw, 4); } static void tcg_out_op(TCGContext *s, TCGOpcode opc,
[PATCH] tcg/ppc: Fix race in goto_tb implementation
Commit 20b6643324 ("tcg/ppc: Reorg goto_tb implementation") modified goto_tb to ensure only a single instruction was patched to prevent incorrect behaviour if a thread was in the middle of multiple instructions when they were replaced. However this introduced a race between loading the jmp target into TCG_REG_TB and patching and executing the direct branch. The relevent part of the goto_tb implementation: ld TCG_REG_TB, TARGET_ADDR_LOCATION(TCG_REG_TB) patch_location: mtctr TCG_REG_TB bctr tb_target_set_jmp_target() will replace 'patch_location' with a direct branch if the target is in range. The direct branch now relies on TCG_REG_TB being set up correctly by the ld. Prior to this commit multiple instructions were patched in for the direct branch case; these instructions would initalise TCG_REG_TB to the same value as the branch target. Imagine the following sequence: 1) Thread A is executing the goto_tb sequence and loads the jmp target into TCG_REG_TB. 2) Thread B updates the jmp target address and calls tb_target_set_jmp_target(). This patches a new direct branch into the goto_tb sequence. 3) Thread A executes the newly patched direct branch. The value in TCG_REG_TB still contains the old jmp target. TCG_REG_TB MUST contain the translation block's tc.ptr. Execution will eventually crash after performing memory accesses generated from a faulty value in TCG_REG_TB. This presents as segfaults or illegal instruction exceptions. Do not revert commit 20b6643324 as it did fix a different race condition. Instead remove the direct branch optimization and always use indirect branches. The direct branch optimization can be re-added later with a race free sequence. Gitlab issue: https://gitlab.com/qemu-project/qemu/-/issues/1726 Fixes: 20b6643324 ("tcg/ppc: Reorg goto_tb implementation") Reported-by: Anushree Mathur Co-developed-by: Benjamin Gray Signed-off-by: Jordan Niethe --- tcg/ppc/tcg-target.c.inc | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc index 8d6899cf40..a7323f479b 100644 --- a/tcg/ppc/tcg-target.c.inc +++ b/tcg/ppc/tcg-target.c.inc @@ -2533,11 +2533,10 @@ static void tcg_out_goto_tb(TCGContext *s, int which) ptrdiff_t offset = tcg_tbrel_diff(s, (void *)ptr); tcg_out_mem_long(s, LD, LDX, TCG_REG_TB, TCG_REG_TB, offset); -/* Direct branch will be patched by tb_target_set_jmp_target. */ +/* TODO: Use direct branches when possible. */ set_jmp_insn_offset(s, which); tcg_out32(s, MTSPR | RS(TCG_REG_TB) | CTR); -/* When branch is out of range, fall through to indirect. */ tcg_out32(s, BCCTR | BO_ALWAYS); /* For the unlinked case, need to reset TCG_REG_TB. */ @@ -2565,10 +2564,11 @@ void tb_target_set_jmp_target(const TranslationBlock *tb, int n, intptr_t diff = addr - jmp_rx; tcg_insn_unit insn; +if (USE_REG_TB) +return; + if (in_range_b(diff)) { insn = B | (diff & 0x3fc); -} else if (USE_REG_TB) { -insn = MTSPR | RS(TCG_REG_TB) | CTR; } else { insn = NOP; } -- 2.39.3
Re: [PATCH] tcg/ppc: Fix race in goto_tb implementation
On Mon, Jul 17, 2023 at 11:24 AM Jordan Niethe wrote: [snip] > > Reported-by: Anushree Mathur > Co-developed-by: Benjamin Gray > Signed-off-by: Jordan Niethe Sorry, should also have: Signed-off-by: Benjamin Gray [snip]
[PATCH v2] tcg/ppc: Fix race in goto_tb implementation
Commit 20b6643324 ("tcg/ppc: Reorg goto_tb implementation") modified goto_tb to ensure only a single instruction was patched to prevent incorrect behavior if a thread was in the middle of multiple instructions when they were replaced. However this introduced a race between loading the jmp target into TCG_REG_TB and patching and executing the direct branch. The relevant part of the goto_tb implementation: ld TCG_REG_TB, TARGET_ADDR_LOCATION(TCG_REG_TB) patch_location: mtctr TCG_REG_TB bctr tb_target_set_jmp_target() will replace 'patch_location' with a direct branch if the target is in range. The direct branch now relies on TCG_REG_TB being set up correctly by the ld. Prior to this commit multiple instructions were patched in for the direct branch case; these instructions would initialize TCG_REG_TB to the same value as the branch target. Imagine the following sequence: 1) Thread A is executing the goto_tb sequence and loads the jmp target into TCG_REG_TB. 2) Thread B updates the jmp target address and calls tb_target_set_jmp_target(). This patches a new direct branch into the goto_tb sequence. 3) Thread A executes the newly patched direct branch. The value in TCG_REG_TB still contains the old jmp target. TCG_REG_TB MUST contain the translation block's tc.ptr. Execution will eventually crash after performing memory accesses generated from a faulty value in TCG_REG_TB. This presents as segfaults or illegal instruction exceptions. Do not revert commit 20b6643324 as it did fix a different race condition. Instead remove the direct branch optimization and always use indirect branches. The direct branch optimization can be re-added later with a race free sequence. Fixes: 20b6643324 ("tcg/ppc: Reorg goto_tb implementation") Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1726 Reported-by: Anushree Mathur Tested-by: Anushree Mathur Tested-by: Michael Tokarev Reviewed-by: Richard Henderson Co-developed-by: Benjamin Gray Signed-off-by: Jordan Niethe Signed-off-by: Benjamin Gray --- v2: - Use braces - Collect tags --- tcg/ppc/tcg-target.c.inc | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc index 8d6899cf40..329319ab8a 100644 --- a/tcg/ppc/tcg-target.c.inc +++ b/tcg/ppc/tcg-target.c.inc @@ -2533,11 +2533,10 @@ static void tcg_out_goto_tb(TCGContext *s, int which) ptrdiff_t offset = tcg_tbrel_diff(s, (void *)ptr); tcg_out_mem_long(s, LD, LDX, TCG_REG_TB, TCG_REG_TB, offset); -/* Direct branch will be patched by tb_target_set_jmp_target. */ +/* TODO: Use direct branches when possible. */ set_jmp_insn_offset(s, which); tcg_out32(s, MTSPR | RS(TCG_REG_TB) | CTR); -/* When branch is out of range, fall through to indirect. */ tcg_out32(s, BCCTR | BO_ALWAYS); /* For the unlinked case, need to reset TCG_REG_TB. */ @@ -2565,10 +2564,12 @@ void tb_target_set_jmp_target(const TranslationBlock *tb, int n, intptr_t diff = addr - jmp_rx; tcg_insn_unit insn; +if (USE_REG_TB) { +return; +} + if (in_range_b(diff)) { insn = B | (diff & 0x3fc); -} else if (USE_REG_TB) { -insn = MTSPR | RS(TCG_REG_TB) | CTR; } else { insn = NOP; } -- 2.39.3