Re: [Qemu-devel] save compiled qemu traces.
On Thu, Dec 12, 2013 at 5:07 AM, Xin Tong trent.t...@gmail.com wrote: see questions below. On Tue, Dec 10, 2013 at 12:25 AM, Alex Bennée alex.ben...@linaro.org wrote: trent.t...@gmail.com writes: Does anyone have profiles on how much time QEMU spends in translating instructions. QEMU does not have a baseline interpreter nor does it translate on trace-granularity. so i imagine QEMU must spend quite a bit of time translating instructions. Not as much as you'd think. The translation stage isn't very complex and blocks only get translated once (modulo exceptions and self modifying code). If you run perf on your task you should see most of the time is spent in the generated code - if not please send the test case to the list. I took a profile running speccpu2006 403.gcc with test input on a intel xeon machine. we only spent 44.76% of the time in the code cache (i.e. 13M ticks in the code cache), while 40.97% of the time is spent in the qemu-system-x86_64. some of the hot functions in qemu-system-x86_64 are listed below. *you are right* we do not spend much time in translation routines. instead we spend significant amount of time in address translation code. CPU_CLK_UNHALTED % Symbol/Functions 1340512 100.00 anon (tgid:7106 range:0x7f97815ca000-0x7f979a692000) CPU_CLK_UNHALTED % Symbol/Functions 314655 25.64 address_space_translate_internal 308942 25.18 cpu_x86_exec 128922 10.51 ldq_phys 92345 7.53 cpu_x86_handle_mmu_fault 62456 5.09 tlb_set_page 49332 4.02 memory_region_is_ram 31055 2.53 helper_le_ldq_mmu 22048 1.80 memory_region_get_ram_addr 19223 1.57 memory_region_section_get_iotlb 15873 1.29 tcg_optimize 14526 1.18 get_page_addr_code 12601 1.03 memory_region_get_ram_ptr You could perhaps redo the same experiment using user mode QEMU. That'll give you another interesting point of measure. Another experiment is kernel booting, because it's likely to run code once which will make code translation functions climb up the use scale. Laurent
Re: [Qemu-devel] save compiled qemu traces.
see questions below. On Tue, Dec 10, 2013 at 12:25 AM, Alex Bennée alex.ben...@linaro.org wrote: trent.t...@gmail.com writes: Does anyone have profiles on how much time QEMU spends in translating instructions. QEMU does not have a baseline interpreter nor does it translate on trace-granularity. so i imagine QEMU must spend quite a bit of time translating instructions. Not as much as you'd think. The translation stage isn't very complex and blocks only get translated once (modulo exceptions and self modifying code). If you run perf on your task you should see most of the time is spent in the generated code - if not please send the test case to the list. I took a profile running speccpu2006 403.gcc with test input on a intel xeon machine. we only spent 44.76% of the time in the code cache (i.e. 13M ticks in the code cache), while 40.97% of the time is spent in the qemu-system-x86_64. some of the hot functions in qemu-system-x86_64 are listed below. *you are right* we do not spend much time in translation routines. instead we spend significant amount of time in address translation code. CPU_CLK_UNHALTED % Symbol/Functions 1340512 100.00 anon (tgid:7106 range:0x7f97815ca000-0x7f979a692000) CPU_CLK_UNHALTED % Symbol/Functions 314655 25.64 address_space_translate_internal 308942 25.18 cpu_x86_exec 128922 10.51 ldq_phys 92345 7.53 cpu_x86_handle_mmu_fault 62456 5.09 tlb_set_page 49332 4.02 memory_region_is_ram 31055 2.53 helper_le_ldq_mmu 22048 1.80 memory_region_get_ram_addr 19223 1.57 memory_region_section_get_iotlb 15873 1.29 tcg_optimize 14526 1.18 get_page_addr_code 12601 1.03 memory_region_get_ram_ptr Xin I suspect the more useful statistic would be getting a break down of the translation blocks and seeing which ones are the most heavily used and examining if QEMU has done as good a job as it can of translating them. Is it possible for QEMU to obviate some of the translations by attaching a signature (e.g. a hash) with every translated basic block and try to reuse translated basic block based on the signature as much as possible ? Reuses can be a result of rerunning programs or same libraries statically linked to programs. Your right a translation cache *could* save some translation time, especially if you end up translating the same program over and over again. Having said that you might find the cost of computing the checksum obviates any speed-up from skipping the translation. After all QEMU only needs to look at each subject instruction once normally. Using QEMU linux-user for cross building would be the obvious pain point. However as the usual use case is building for embedded platforms most users are just happy to fully utilise their 80-core build machines in preference to having a farm of slow embedded processors. This could end up saving some translation time. I think you would need to do some performance analysis and come up with some numbers before you made that assumption. Cheers, -- Alex Bennée QEMU/KVM Hacker for Linaro
Re: [Qemu-devel] save compiled qemu traces.
On Thu, Dec 12, 2013 at 1:07 PM, Xin Tong trent.t...@gmail.com wrote: see questions below. On Tue, Dec 10, 2013 at 12:25 AM, Alex Bennée alex.ben...@linaro.org wrote: trent.t...@gmail.com writes: Does anyone have profiles on how much time QEMU spends in translating instructions. QEMU does not have a baseline interpreter nor does it translate on trace-granularity. so i imagine QEMU must spend quite a bit of time translating instructions. Not as much as you'd think. The translation stage isn't very complex and blocks only get translated once (modulo exceptions and self modifying code). If you run perf on your task you should see most of the time is spent in the generated code - if not please send the test case to the list. I took a profile running speccpu2006 403.gcc with test input on a intel xeon machine. we only spent 44.76% of the time in the code cache (i.e. 13M ticks in the code cache), while 40.97% of the time is spent in the qemu-system-x86_64. some of the hot functions in qemu-system-x86_64 are listed below. *you are right* we do not spend much time in translation routines. instead we spend significant amount of time in address translation code. CPU_CLK_UNHALTED % Symbol/Functions 1340512 100.00 anon (tgid:7106 range:0x7f97815ca000-0x7f979a692000) CPU_CLK_UNHALTED % Symbol/Functions 314655 25.64 address_space_translate_internal 308942 25.18 cpu_x86_exec 128922 10.51 ldq_phys 92345 7.53 cpu_x86_handle_mmu_fault 62456 5.09 tlb_set_page 49332 4.02 memory_region_is_ram 31055 2.53 helper_le_ldq_mmu 22048 1.80 memory_region_get_ram_addr 19223 1.57 memory_region_section_get_iotlb 15873 1.29 tcg_optimize 14526 1.18 get_page_addr_code 12601 1.03 memory_region_get_ram_ptr However, being able to reuse cached blocks based on content in QEMU maybe a step towards reusing translated blocks across multiple invocations of QEMU. Xin I suspect the more useful statistic would be getting a break down of the translation blocks and seeing which ones are the most heavily used and examining if QEMU has done as good a job as it can of translating them. Is it possible for QEMU to obviate some of the translations by attaching a signature (e.g. a hash) with every translated basic block and try to reuse translated basic block based on the signature as much as possible ? Reuses can be a result of rerunning programs or same libraries statically linked to programs. Your right a translation cache *could* save some translation time, especially if you end up translating the same program over and over again. Having said that you might find the cost of computing the checksum obviates any speed-up from skipping the translation. After all QEMU only needs to look at each subject instruction once normally. Using QEMU linux-user for cross building would be the obvious pain point. However as the usual use case is building for embedded platforms most users are just happy to fully utilise their 80-core build machines in preference to having a farm of slow embedded processors. This could end up saving some translation time. I think you would need to do some performance analysis and come up with some numbers before you made that assumption. Cheers, -- Alex Bennée QEMU/KVM Hacker for Linaro
Re: [Qemu-devel] save compiled qemu traces.
peter.mayd...@linaro.org writes: On 9 December 2013 06:36, Xin Tong trent.t...@gmail.com wrote: Is it possible for QEMU to obviate some of the translations by attaching a signature (e.g. a hash) with every translated basic block and try to reuse translated basic block based on the signature as much as possible ? Reuses can be a result of rerunning programs or same libraries statically linked to programs. We already cache translated results. See tb_find_fast() and tb_find_slow() which do the lookup into the cache. These are for the current execution context though aren't they? I thought Xin was talking about caching translations between invocations of QEMU. I suspect address space randomisation would be another wrinkle in the side of any such scheme though. thanks -- PMM -- Alex Bennée QEMU/KVM Hacker for Linaro
Re: [Qemu-devel] save compiled qemu traces.
trent.t...@gmail.com writes: Does anyone have profiles on how much time QEMU spends in translating instructions. QEMU does not have a baseline interpreter nor does it translate on trace-granularity. so i imagine QEMU must spend quite a bit of time translating instructions. Not as much as you'd think. The translation stage isn't very complex and blocks only get translated once (modulo exceptions and self modifying code). If you run perf on your task you should see most of the time is spent in the generated code - if not please send the test case to the list. I suspect the more useful statistic would be getting a break down of the translation blocks and seeing which ones are the most heavily used and examining if QEMU has done as good a job as it can of translating them. Is it possible for QEMU to obviate some of the translations by attaching a signature (e.g. a hash) with every translated basic block and try to reuse translated basic block based on the signature as much as possible ? Reuses can be a result of rerunning programs or same libraries statically linked to programs. Your right a translation cache *could* save some translation time, especially if you end up translating the same program over and over again. Having said that you might find the cost of computing the checksum obviates any speed-up from skipping the translation. After all QEMU only needs to look at each subject instruction once normally. Using QEMU linux-user for cross building would be the obvious pain point. However as the usual use case is building for embedded platforms most users are just happy to fully utilise their 80-core build machines in preference to having a farm of slow embedded processors. This could end up saving some translation time. I think you would need to do some performance analysis and come up with some numbers before you made that assumption. Cheers, -- Alex Bennée QEMU/KVM Hacker for Linaro
Re: [Qemu-devel] save compiled qemu traces.
On 9 December 2013 06:36, Xin Tong trent.t...@gmail.com wrote: Is it possible for QEMU to obviate some of the translations by attaching a signature (e.g. a hash) with every translated basic block and try to reuse translated basic block based on the signature as much as possible ? Reuses can be a result of rerunning programs or same libraries statically linked to programs. We already cache translated results. See tb_find_fast() and tb_find_slow() which do the lookup into the cache. thanks -- PMM
Re: [Qemu-devel] save compiled qemu traces.
tb_find_fast and tb_find_slow are finding the translated blocks based on guest physical address. I am thinking about finding tbs by content, e.g. using a hash signature. this can be used to potentially save translations. Xin On Mon, Dec 9, 2013 at 7:32 AM, Peter Maydell peter.mayd...@linaro.orgwrote: On 9 December 2013 06:36, Xin Tong trent.t...@gmail.com wrote: Is it possible for QEMU to obviate some of the translations by attaching a signature (e.g. a hash) with every translated basic block and try to reuse translated basic block based on the signature as much as possible ? Reuses can be a result of rerunning programs or same libraries statically linked to programs. We already cache translated results. See tb_find_fast() and tb_find_slow() which do the lookup into the cache. thanks -- PMM
[Qemu-devel] save compiled qemu traces.
Does anyone have profiles on how much time QEMU spends in translating instructions. QEMU does not have a baseline interpreter nor does it translate on trace-granularity. so i imagine QEMU must spend quite a bit of time translating instructions. Is it possible for QEMU to obviate some of the translations by attaching a signature (e.g. a hash) with every translated basic block and try to reuse translated basic block based on the signature as much as possible ? Reuses can be a result of rerunning programs or same libraries statically linked to programs. This could end up saving some translation time. Thank you, Xin