Re: [Qemu-devel] save compiled qemu traces.

2013-12-12 Thread Laurent Desnogues
On Thu, Dec 12, 2013 at 5:07 AM, Xin Tong trent.t...@gmail.com wrote:
 see questions below.

 On Tue, Dec 10, 2013 at 12:25 AM, Alex Bennée alex.ben...@linaro.org wrote:

 trent.t...@gmail.com writes:

 Does anyone have profiles on how much time QEMU spends in translating
 instructions. QEMU does not have a baseline interpreter nor does it
 translate on trace-granularity.  so i imagine QEMU must spend quite a bit
 of time translating instructions.

 Not as much as you'd think. The translation stage isn't very complex and
 blocks only get translated once (modulo exceptions and self modifying
 code). If you run perf on your task you should see most of the time is
 spent in the generated code - if not please send the test case to the
 list.

 I took a profile running speccpu2006 403.gcc with test input on a
 intel xeon machine. we only spent 44.76% of the time in the code cache
 (i.e. 13M ticks in the code cache), while 40.97% of the time is spent
 in the qemu-system-x86_64. some of the hot functions in
 qemu-system-x86_64 are listed below.

 *you are right* we do not spend much time in translation routines.
 instead we spend significant amount of time in address translation
 code.

 CPU_CLK_UNHALTED % Symbol/Functions
 1340512 100.00 anon (tgid:7106 range:0x7f97815ca000-0x7f979a692000)


 CPU_CLK_UNHALTED % Symbol/Functions
 314655   25.64 address_space_translate_internal
 308942   25.18 cpu_x86_exec
 128922   10.51 ldq_phys
 92345   7.53 cpu_x86_handle_mmu_fault
 62456   5.09 tlb_set_page
 49332   4.02 memory_region_is_ram
 31055   2.53 helper_le_ldq_mmu
 22048   1.80 memory_region_get_ram_addr
 19223   1.57 memory_region_section_get_iotlb
 15873   1.29 tcg_optimize
 14526   1.18 get_page_addr_code
 12601   1.03 memory_region_get_ram_ptr

You could perhaps redo the same experiment using user mode QEMU.
That'll give you another interesting point of measure.

Another experiment is kernel booting, because it's likely to run code
once which will make code translation functions climb up the use
scale.


Laurent



Re: [Qemu-devel] save compiled qemu traces.

2013-12-11 Thread Xin Tong
see questions below.

On Tue, Dec 10, 2013 at 12:25 AM, Alex Bennée alex.ben...@linaro.org wrote:

 trent.t...@gmail.com writes:

 Does anyone have profiles on how much time QEMU spends in translating
 instructions. QEMU does not have a baseline interpreter nor does it
 translate on trace-granularity.  so i imagine QEMU must spend quite a bit
 of time translating instructions.

 Not as much as you'd think. The translation stage isn't very complex and
 blocks only get translated once (modulo exceptions and self modifying
 code). If you run perf on your task you should see most of the time is
 spent in the generated code - if not please send the test case to the
 list.

I took a profile running speccpu2006 403.gcc with test input on a
intel xeon machine. we only spent 44.76% of the time in the code cache
(i.e. 13M ticks in the code cache), while 40.97% of the time is spent
in the qemu-system-x86_64. some of the hot functions in
qemu-system-x86_64 are listed below.

*you are right* we do not spend much time in translation routines.
instead we spend significant amount of time in address translation
code.

CPU_CLK_UNHALTED % Symbol/Functions
1340512 100.00 anon (tgid:7106 range:0x7f97815ca000-0x7f979a692000)


CPU_CLK_UNHALTED % Symbol/Functions
314655   25.64 address_space_translate_internal
308942   25.18 cpu_x86_exec
128922   10.51 ldq_phys
92345   7.53 cpu_x86_handle_mmu_fault
62456   5.09 tlb_set_page
49332   4.02 memory_region_is_ram
31055   2.53 helper_le_ldq_mmu
22048   1.80 memory_region_get_ram_addr
19223   1.57 memory_region_section_get_iotlb
15873   1.29 tcg_optimize
14526   1.18 get_page_addr_code
12601   1.03 memory_region_get_ram_ptr

Xin



 I suspect the more useful statistic would be getting a break down of the
 translation blocks and seeing which ones are the most heavily used and
 examining if QEMU has done as good a job as it can of translating them.

 Is it possible for QEMU to obviate some of the translations by attaching a
 signature (e.g. a hash) with every translated basic block and try to reuse
 translated basic block based on the signature as much as possible ? Reuses
 can be a result of rerunning programs or same libraries statically linked
 to programs.

 Your right a translation cache *could* save some translation time,
 especially if you end up translating the same program over and over
 again. Having said that you might find the cost of computing the
 checksum obviates any speed-up from skipping the translation. After all
 QEMU only needs to look at each subject instruction once normally.

 Using QEMU  linux-user for cross building would be the obvious pain
 point. However as the usual use case is building for embedded platforms
 most users are just happy to fully utilise their 80-core build machines
 in preference to having a farm of slow embedded processors.

 This could end up saving some translation time.

 I think you would need to do some performance analysis and come up with
 some numbers before you made that assumption.

 Cheers,

 --
 Alex Bennée
 QEMU/KVM Hacker for Linaro




Re: [Qemu-devel] save compiled qemu traces.

2013-12-11 Thread Xin Tong
On Thu, Dec 12, 2013 at 1:07 PM, Xin Tong trent.t...@gmail.com wrote:
 see questions below.

 On Tue, Dec 10, 2013 at 12:25 AM, Alex Bennée alex.ben...@linaro.org wrote:

 trent.t...@gmail.com writes:

 Does anyone have profiles on how much time QEMU spends in translating
 instructions. QEMU does not have a baseline interpreter nor does it
 translate on trace-granularity.  so i imagine QEMU must spend quite a bit
 of time translating instructions.

 Not as much as you'd think. The translation stage isn't very complex and
 blocks only get translated once (modulo exceptions and self modifying
 code). If you run perf on your task you should see most of the time is
 spent in the generated code - if not please send the test case to the
 list.

 I took a profile running speccpu2006 403.gcc with test input on a
 intel xeon machine. we only spent 44.76% of the time in the code cache
 (i.e. 13M ticks in the code cache), while 40.97% of the time is spent
 in the qemu-system-x86_64. some of the hot functions in
 qemu-system-x86_64 are listed below.

 *you are right* we do not spend much time in translation routines.
 instead we spend significant amount of time in address translation
 code.

 CPU_CLK_UNHALTED % Symbol/Functions
 1340512 100.00 anon (tgid:7106 range:0x7f97815ca000-0x7f979a692000)


 CPU_CLK_UNHALTED % Symbol/Functions
 314655   25.64 address_space_translate_internal
 308942   25.18 cpu_x86_exec
 128922   10.51 ldq_phys
 92345   7.53 cpu_x86_handle_mmu_fault
 62456   5.09 tlb_set_page
 49332   4.02 memory_region_is_ram
 31055   2.53 helper_le_ldq_mmu
 22048   1.80 memory_region_get_ram_addr
 19223   1.57 memory_region_section_get_iotlb
 15873   1.29 tcg_optimize
 14526   1.18 get_page_addr_code
 12601   1.03 memory_region_get_ram_ptr

However, being able to reuse cached blocks based on content in QEMU
maybe a step towards reusing translated blocks across multiple
invocations of QEMU.

 Xin



 I suspect the more useful statistic would be getting a break down of the
 translation blocks and seeing which ones are the most heavily used and
 examining if QEMU has done as good a job as it can of translating them.

 Is it possible for QEMU to obviate some of the translations by attaching a
 signature (e.g. a hash) with every translated basic block and try to reuse
 translated basic block based on the signature as much as possible ? Reuses
 can be a result of rerunning programs or same libraries statically linked
 to programs.

 Your right a translation cache *could* save some translation time,
 especially if you end up translating the same program over and over
 again. Having said that you might find the cost of computing the
 checksum obviates any speed-up from skipping the translation. After all
 QEMU only needs to look at each subject instruction once normally.

 Using QEMU  linux-user for cross building would be the obvious pain
 point. However as the usual use case is building for embedded platforms
 most users are just happy to fully utilise their 80-core build machines
 in preference to having a farm of slow embedded processors.

 This could end up saving some translation time.

 I think you would need to do some performance analysis and come up with
 some numbers before you made that assumption.

 Cheers,

 --
 Alex Bennée
 QEMU/KVM Hacker for Linaro




Re: [Qemu-devel] save compiled qemu traces.

2013-12-10 Thread Alex Bennée

peter.mayd...@linaro.org writes:

 On 9 December 2013 06:36, Xin Tong trent.t...@gmail.com wrote:
 Is it possible for QEMU to obviate some of the translations by attaching a
 signature (e.g. a hash) with every translated basic block and try to reuse
 translated basic block based on the signature as much as possible ? Reuses
 can be a result of rerunning programs or same libraries statically linked to
 programs.

 We already cache translated results. See tb_find_fast()
 and tb_find_slow() which do the lookup into the cache.

These are for the current execution context though aren't they? I
thought Xin was talking about caching translations between invocations
of QEMU.

I suspect address space randomisation would be another wrinkle in the
side of any such scheme though.


 thanks
 -- PMM

-- 
Alex Bennée
QEMU/KVM Hacker for Linaro




Re: [Qemu-devel] save compiled qemu traces.

2013-12-09 Thread Alex Bennée

trent.t...@gmail.com writes:

 Does anyone have profiles on how much time QEMU spends in translating
 instructions. QEMU does not have a baseline interpreter nor does it
 translate on trace-granularity.  so i imagine QEMU must spend quite a bit
 of time translating instructions.

Not as much as you'd think. The translation stage isn't very complex and
blocks only get translated once (modulo exceptions and self modifying
code). If you run perf on your task you should see most of the time is
spent in the generated code - if not please send the test case to the
list.

I suspect the more useful statistic would be getting a break down of the
translation blocks and seeing which ones are the most heavily used and
examining if QEMU has done as good a job as it can of translating them.  

 Is it possible for QEMU to obviate some of the translations by attaching a
 signature (e.g. a hash) with every translated basic block and try to reuse
 translated basic block based on the signature as much as possible ? Reuses
 can be a result of rerunning programs or same libraries statically linked
 to programs.

Your right a translation cache *could* save some translation time,
especially if you end up translating the same program over and over
again. Having said that you might find the cost of computing the
checksum obviates any speed-up from skipping the translation. After all
QEMU only needs to look at each subject instruction once normally.

Using QEMU  linux-user for cross building would be the obvious pain
point. However as the usual use case is building for embedded platforms
most users are just happy to fully utilise their 80-core build machines
in preference to having a farm of slow embedded processors.

 This could end up saving some translation time.

I think you would need to do some performance analysis and come up with
some numbers before you made that assumption.

Cheers,

-- 
Alex Bennée
QEMU/KVM Hacker for Linaro




Re: [Qemu-devel] save compiled qemu traces.

2013-12-09 Thread Peter Maydell
On 9 December 2013 06:36, Xin Tong trent.t...@gmail.com wrote:
 Is it possible for QEMU to obviate some of the translations by attaching a
 signature (e.g. a hash) with every translated basic block and try to reuse
 translated basic block based on the signature as much as possible ? Reuses
 can be a result of rerunning programs or same libraries statically linked to
 programs.

We already cache translated results. See tb_find_fast()
and tb_find_slow() which do the lookup into the cache.

thanks
-- PMM



Re: [Qemu-devel] save compiled qemu traces.

2013-12-09 Thread Xin Tong
tb_find_fast and tb_find_slow are finding the translated blocks based on
guest physical address. I am thinking about finding tbs by content, e.g.
using a hash signature. this can be used to potentially save translations.

Xin


On Mon, Dec 9, 2013 at 7:32 AM, Peter Maydell peter.mayd...@linaro.orgwrote:

 On 9 December 2013 06:36, Xin Tong trent.t...@gmail.com wrote:
  Is it possible for QEMU to obviate some of the translations by attaching
 a
  signature (e.g. a hash) with every translated basic block and try to
 reuse
  translated basic block based on the signature as much as possible ?
 Reuses
  can be a result of rerunning programs or same libraries statically
 linked to
  programs.

 We already cache translated results. See tb_find_fast()
 and tb_find_slow() which do the lookup into the cache.

 thanks
 -- PMM



[Qemu-devel] save compiled qemu traces.

2013-12-08 Thread Xin Tong
Does anyone have profiles on how much time QEMU spends in translating
instructions. QEMU does not have a baseline interpreter nor does it
translate on trace-granularity.  so i imagine QEMU must spend quite a bit
of time translating instructions.

Is it possible for QEMU to obviate some of the translations by attaching a
signature (e.g. a hash) with every translated basic block and try to reuse
translated basic block based on the signature as much as possible ? Reuses
can be a result of rerunning programs or same libraries statically linked
to programs.

This could end up saving some translation time.

Thank you,
Xin