We are working on support for x86_64 emulation on aarch64, mainly related to memory ordering issues. We first wanted to know what the community thinks about our proposal, and its chance of getting merged one day.
Note that we worked with qemu-user, so there may be issues in system mode that we missed. # Problem When generating the TCG instructions for memory accesses, fences are always inserted *before* the access, following this translation rule: x86 --> TCG --> aarch64 ------------------------------------- RMOV --> Fm_ld; ld --> DMBLD; LDR WMOV --> Fm_st; st --> DMBFF; STR Here, Fm_ld is a fence that orders any preceding memory access with the subsequent load. F_m_st is a fence that orders any preceding memory access with the subsequent store. This means that, in TCG, all memory accesses are ordered by fences. Thus, no memory accesses can be re-ordered in TCG. This is a problem, because it is *stricter than x86*. Consider when a program contains: WMOV; RMOV x86 allows re-ordering independent store-load pairs, so the above pair can safely re-order on an x86 host. However, with QEMU's current translation, it becomes: DMBFF; STR; DMBLD; LDR In this target aarch64 code, no re-ordering is possible. Hence, QEMU enforces a stronger model than x86. While that is correct, it harms performance. # Solution We propose an alternative scheme, which we formally proved correct (paper under review): x86 --> TCG --> aarch64 ------------------------------------- RMOV --> ld; Fld_m --> LDR; DMBLD WMOV --> Fst_st; st --> DMBST; STR This new scheme precisely captures the observable behaviors of the input program (in x86's memory model). This behavior is preserved in the resulting TCG and aarch64 programs. Which the inserted fences enforce (formally verified). Note that this scheme enforces fewer ordering than the previous (unnecessarily strong) mapping scheme. This new scheme benefits performance. We evaluated this on benchmarks (PARSEC) and got up to 19.7% improvement, 6.7% on average. # Implementation Considerations Different (source and host) architectures may demand different such mapping schemes. Some schemes may place fences before an instruction, while others place them after. The implementation of fence placement should thus be sufficiently flexible that either is possible. Though, note that write-read pairs are unordered in almost all architectures. We see two ways of doing this: - extracting the placement of the fence from the tcg_gen_qemu_ld/st_i32/i64 functions, and have each architecture explicitly generate the fence at the correct place - adding two parameters to these functions specifying the strength of the "before" and "after" fences. The function would then generate both fences in the IR (one of them may be a NOP fence), which in turn will be translated back to the host We are eager to see what you think about this change in TCG. Cheers! -- Redha Gouicem Post doctoral researcher Chair of Decentralized Systems Engineering Department of Informatics, Technical University of Munich (TUM)