Hello, With MTTCG code now merged in mainline, I tried to see if we are able to run x86 SMP guests on ARM64 hosts. For this I tried running a windows XP guest on a dragonboard 410c which has 1GB RAM. Since x86 has a strong memory model whereas ARM64 is a weak memory model, I added a patch to generate fence instructions for every guest memory access. After some minor fixes, I was successfully able to boot a 4 core guest all the way to the desktop (albeit with a 1GB backing swap). However the performance is severely limited and the guest is barely usable. Based on my observations, I think there are some easily implementable additions we can make to improve the performance of TCG in general and on ARM64 in particular. I propose to do the following as part of Google Summer of Code 2017.
* Implement jump-to-register instruction on ARM64 to overcome the 128MB translation cache size limit. The translation cache size for an ARM64 host is currently limited to 128 MB. This limitation is imposed by utilizing a branch instruction which encodes the jump offset and is limited by the number of bits it can use for the range of the offset. The performance impact by this limitation is severe and can be observed when you try to run large programs like a browser in the guest. The cache is flushed several times before the browser starts and the performance is not satisfactory. This limitation can be overcome by generating a branch-to-register instruction and utilizing that when the destination address is outside the range of what can be encoded in current branch instruction. * Implement an LRU translation block code cache. In the current TCG design, when the translation cache fills up, we flush all the translated blocks (TBs) to free up space. We can improve this situation by not flushing the TBs that were recently used i.e., by implementing an LRU policy for freeing the blocks. This should avoid the re-translation overhead for frequently used blocks and improve performance. * Avoid consistency overhead for strong memory model guests by generating load-acquire and store-release instructions. To run a strongly ordered guest on a weakly ordered host using MTTCG, for example, x86 on ARM64, we have to generate fence instructions for all the guest memory accesses to ensure consistency. The overhead imposed by these fence instructions is significant (almost 3x when compared to a run without fence instructions). ARM64 provides load-acquire and store-release instructions which are sequentially consistent and can be used instead of generating fence instructions. I plan to add support to generate these instructions in the TCG run-time to reduce the consistency overhead in MTTCG. Alex Bennée, who mentored me last year, has agreed to mentor me again this time if the proposal is accepted. Please let me know if you have any comments or suggestions. Also please let me know if there are other enhancements that are easily implementable to increase TCG performance as part of this project or otherwise. Thanks, -- Pranith