On 13.05.2013 21:49, Richard Henderson wrote: > On 05/13/2013 06:33 AM, Claudio Fontana wrote: >> +enum aarch64_cond_code { >> + COND_EQ = 0x0, >> + COND_NE = 0x1, >> + COND_CS = 0x2, /* Unsigned greater or equal */ >> + COND_HS = 0x2, /* ALIAS greater or equal */ > > Clearer to define aliases as COND_HS = COND_CS.
I agree, will change. >> +static inline void tcg_out_movi64(TCGContext *s, int rd, uint64_t value) >> +{ >> + uint32_t half, base, movk = 0, shift = 0; >> + if (!value) { >> + tcg_out_movr(s, 1, rd, TCG_REG_XZR); >> + return; >> + } >> + /* construct halfwords of the immediate with MOVZ with LSL */ >> + /* using MOVZ 0x52800000 | extended reg.. */ >> + base = 0xd2800000; >> + >> + while (value) { >> + half = value & 0xffff; >> + if (half) { >> + /* Op can be MOVZ or MOVK */ >> + tcg_out32(s, base | movk | shift | half << 5 | rd); >> + if (!movk) >> + movk = 0x20000000; /* morph next MOVZs into MOVKs */ >> + } >> + value >>= 16; >> + shift += 0x00200000; > > You'll almost certainly want to try ADP+ADD before decomposing into 3-4 > mov[zk] > instructions. I don't know the ADP instruction; but this movi can be improved. Right now it needs 1 to 4 mov[zk] instructions, depending on the value: it depends on how many 0000h 16 bit holes there are in the 64bit value. I was thinking a successive patch to experiment with this. > >> +void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr) >> +{ >> + tcg_target_long target, offset; >> + target = (tcg_target_long)addr; >> + offset = (target - (tcg_target_long)jmp_addr) / 4; >> + >> + if (offset <= -0x02000000 || offset >= 0x02000000) { >> + /* out of 26bit range */ >> + tcg_abort(); >> + } > > See MAX_CODE_GEN_BUFFER_SIZE in translate-all.c. Set this value to 128MB and > then all cross-TB branches will be in range, and the abort won't trigger. That's great, I missed that. I will add a change to that end in translate-all.c . > >> +static inline void tcg_out_goto_label_cond(TCGContext *s, TCGCond c, int >> label_index) >> +{ >> + tcg_target_long offset; >> + /* backward conditional jump never seems to happen in practice, >> + so just always use the branch trampoline */ >> + c = tcg_invert_cond(c); >> + offset = 2; /* skip current instr and the next */ >> + tcg_out_goto_cond(s, c, offset); >> + tcg_out_goto_label(s, label_index); /* emit 26bit jump */ >> +} > > Conditional branch range is +-1MB. You'll never see a TB that large. You > don't need to emit a branch-across-branch. Is there maybe a way to do it right even in the corner case where we have a huge list of hundreds of thousands of instructions without jumps and then a conditional jump? Are we _guaranteed_ to never see that large a TB with some kind of define, similarly to MAX_CODE_GEN_BUFFER_SIZE? I know, it's a corner case that would only trigger in a very strange program, but I would propose to add this topic to the TODO list for successive patches. > >> + /* Should generate something like the following: >> + * shr x8, addr_reg, #TARGET_PAGE_BITS >> + * and x0, x8, #(CPU_TLB_SIZE - 1) @ Assumption: CPU_TLB_BITS <= 8 >> + * add x0, env, x0 lsl #CPU_TLB_ENTRY_BITS >> + */ >> +# if CPU_TLB_BITS > 8 >> +# error "CPU_TLB_BITS too large" >> +# endif > > I wonder if using UBFM to extract the TLB bits and BFM with XZR to clear the > middle bits wouldn't be better, as you wouldn't be restricted on the size of > CPU_TLB_BITS. AFAICS it would be the same number of instructions. Hmm.. > >> + case INDEX_op_mov_i64: ext = 1; >> + case INDEX_op_mov_i32: >> + tcg_out_movr(s, ext, args[0], args[1]); >> + break; > > See how the i386 backend uses macros to reduce the typing with these sorts of > paired opcodes. I saw the glue thing, and I try to stay away from that kind of preprocessor use, as it makes it more difficult for newcomers to dig in, since it breaks gid / grepping for symbols among other things. I used instead the editor and regexps to generate the list. Maybe this could be patched up later if that seems to be the consensus? > >> + case INDEX_op_rotl_i64: ext = 1; >> + case INDEX_op_rotl_i32: /* same as rotate right by (32 - m) */ >> + if (const_args[2]) /* ROR / EXTR Wd, Wm, Wm, 32 - m */ >> + tcg_out_rotl(s, ext, args[0], args[1], args[2]); >> + else { /* no RSB in aarch64 unfortunately. */ >> + /* XXX UNTESTED */ >> + tcg_out_movi32(s, ext, TCG_REG_X8, ext ? 64 : 32); > > But A64 does have shift counts that truncate to the width of the operation. > Which means that the high bits may contain garbage, which means that you can > compute this merely as ROR = -ROL, ignoring the 32/64. I see. > >> + case INDEX_op_setcond_i64: ext = 1; >> + case INDEX_op_setcond_i32: >> + tcg_out_movi32(s, ext, TCG_REG_X8, 0x01); >> + tcg_out_cmp(s, ext, args[1], args[2]); >> + tcg_out_csel(s, ext, args[0], TCG_REG_X8, TCG_REG_XZR, >> + tcg_cond_to_aarch64_cond[args[3]]); > > See CSINC Wd,Wzr,Wzr,cond. No need for the initial movi. Yes, that was mentioned also by Peter, will change. > >> + tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I32], 0, 0xffff); >> + tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I64], 0, 0xffff); > > Only half of your registers are marked available. Oops, will fix. Thanks, -- Claudio Fontana