Re: [Qemu-devel] [Qemu-ppc] Profiling results

2018-07-17 Thread Peter Maydell
On 17 July 2018 at 21:46, BALATON Zoltan  wrote:
> On Tue, 17 Jul 2018, Mark Cave-Ayland wrote:
>> Good question. A quick grep for 'asidx_from_attrs' shows that
>> cc->asidx_from_attrs() isn't set for PPC targets, so as a quick test does
>> replacing the inline function cpu_asidx_from_attrs() in include/qom/cpu.h
>> with a simple "return 0" change the profile at all?
>
>
> It does seem to lessen its impact but it's still higher than I expected:

It may be worth special-casing the CPU method lookups (or at
least that one) if we can, then...

> %cum. % linenr info symbol name
> 10.7949  10.7949exec-all.h:410  helper_lookup_tb_ptr
>  7.8663  18.6612cputlb.c:793io_readx
>  6.0265  24.6878cputlb.c:114tlb_flush_nocheck
>  4.0671  28.7548sm501_template.h:62 draw_line16_32
>  4.0559  32.8107object.c:765
> object_class_dynamic_cast_assert
>  3.3780  36.1887memory.c:1350   memory_region_access_valid
>  2.8920  39.0808qemu-thread-posix.c:61  qemu_mutex_lock_impl
>  2.7187  41.7995memory.c:1415   memory_region_dispatch_read
>  2.6011  44.4006qht.c:487   qht_lookup_custom
>  2.5356  46.9362softmmu_template.h:112  helper_ret_ldub_mmu
>
> Maybe it's called from somewhere else too? I know draw_line16_32 but I
> wonder where could helper_lookup_tb_ptr and tlb flushes come from? Those
> seem to be significant. And io_readx in itself seems to be too high on the
> list too.

helper_lookup_tb_ptr is part of TCG -- it's where we look for
the next TB to go to. Any non-computed branch to a different page
will result in our calling this. So it's high on the profile
because we do it a lot, I think, but that's not necessarily a
problem as such.

io_readx is the slow path for guest memory accesses -- any
guest access to something that's not RAM will have to go through
here. My first guess (given the other things in the profile,
especially helper_ret_ldub_mmu, memory_region_dispatch_read
and memory_region_access_valid) is that the guest is in a tight
loop doing a read on a device register a lot of the time.

> I wonder if it may have something to do with the background task
> trying to read non-implemented i2c stuff frequently (as discussed in point
> 2. in http://zero.eik.bme.hu/~balaton/qemu/amiga/#morphos).

Could be, or some similar thing. If you suspect the i2c you
could try putting in an unimplemented-device stub in the
right place and see how often -d unimp yells about reads to it.

So overall I'd be a little wary of optimizing based on this
profile, because I suspect it's atypical -- the guest is sat
in a tight polling loop and the profile says "all the functions
in the code path for doing device access are really hot".
The fix is to improve our model so the guest doesn't get
stuck like that, not to try to slightly improve the speed
of device accesses (we call it the "slow path" for a reason :-))

(But places like asidx_from_attrs are likely to be on hot
paths in general, so having the QOM class lookup there be
overly heavyweight is maybe worth fixing anyhow.)

thanks
-- PMM



Re: [Qemu-devel] [Qemu-ppc] Profiling results

2018-07-17 Thread BALATON Zoltan

On Tue, 17 Jul 2018, Mark Cave-Ayland wrote:

On 17/07/18 20:35, BALATON Zoltan wrote:

On Tue, 17 Jul 2018, Mark Cave-Ayland wrote:
MorphOS on mac99 this seems to be significant. This is with default 
configure (--enable-qom-cast-debug):


%?? cum. % linenr info symbol name
9.7057?? 9.7057??? exec-all.h:410? helper_lookup_tb_ptr
8.0330? 17.7387??? object.c:711 object_class_dynamic_cast_assert
6.9411? 24.6798??? cputlb.c:793??? io_readx
6.3219? 31.0017??? sm501_template.h:62 draw_line16_32
5.3601? 36.3617??? cputlb.c:114??? tlb_flush_nocheck
3.6170? 39.9787??? translate-all.c:749 page_trylock_add
3.1188? 43.0975??? translate-all.c:803 page_collection_lock
3.0405? 46.1380??? exec.c:3025 iotlb_to_section
2.7044? 48.8424??? softmmu_template.h:112? helper_ret_ldub_mmu
2.4154? 51.2578??? memory.c:1350?? memory_region_access_valid

[...]
My first thought is that there is a QOM cast somewhere in a hot path on -M 
mac99 - can you show us the call stack information from the profile?


Not really. Oprofile that gave me this profile could not display call graph 
for this call. I've tried again with the perf tool but I'm not quite sure 
how to interpret its results. If I'm not mistaken it points me in the 
direction of cpu_asidx_from_attrs, called from iotlb_to_section, called 
from io_readx. The object_class_dynamic_cast_assert call likely comes from 
OBJECT_CLASS_CHECK, expanded from OBJECT_GET_CLASS, expanded from 
CPU_GET_CLASS. Would that make any sense? Any idea what to do about it?


Good question. A quick grep for 'asidx_from_attrs' shows that 
cc->asidx_from_attrs() isn't set for PPC targets, so as a quick test does 
replacing the inline function cpu_asidx_from_attrs() in include/qom/cpu.h 
with a simple "return 0" change the profile at all?


It does seem to lessen its impact but it's still higher than I expected:

%cum. % linenr info symbol name
10.7949  10.7949exec-all.h:410  helper_lookup_tb_ptr
 7.8663  18.6612cputlb.c:793io_readx
 6.0265  24.6878cputlb.c:114tlb_flush_nocheck
 4.0671  28.7548sm501_template.h:62 draw_line16_32
 4.0559  32.8107object.c:765object_class_dynamic_cast_assert
 3.3780  36.1887memory.c:1350   memory_region_access_valid
 2.8920  39.0808qemu-thread-posix.c:61  qemu_mutex_lock_impl
 2.7187  41.7995memory.c:1415   memory_region_dispatch_read
 2.6011  44.4006qht.c:487   qht_lookup_custom
 2.5356  46.9362softmmu_template.h:112  helper_ret_ldub_mmu

Maybe it's called from somewhere else too? I know draw_line16_32 but I 
wonder where could helper_lookup_tb_ptr and tlb flushes come from? Those 
seem to be significant. And io_readx in itself seems to be too high on the 
list too. I wonder if it may have something to do with the background task 
trying to read non-implemented i2c stuff frequently (as discussed in point 
2. in http://zero.eik.bme.hu/~balaton/qemu/amiga/#morphos).


Regards,
BALATON Zoltan



Re: [Qemu-devel] [Qemu-ppc] Profiling results

2018-07-17 Thread Mark Cave-Ayland

On 17/07/18 20:35, BALATON Zoltan wrote:


On Tue, 17 Jul 2018, Mark Cave-Ayland wrote:
MorphOS on mac99 this seems to be significant. This is with default 
configure (--enable-qom-cast-debug):


%?? cum. % linenr info symbol name
9.7057?? 9.7057??? exec-all.h:410? helper_lookup_tb_ptr
8.0330? 17.7387??? object.c:711 object_class_dynamic_cast_assert
6.9411? 24.6798??? cputlb.c:793??? io_readx
6.3219? 31.0017??? sm501_template.h:62 draw_line16_32
5.3601? 36.3617??? cputlb.c:114??? tlb_flush_nocheck
3.6170? 39.9787??? translate-all.c:749 page_trylock_add
3.1188? 43.0975??? translate-all.c:803 page_collection_lock
3.0405? 46.1380??? exec.c:3025 iotlb_to_section
2.7044? 48.8424??? softmmu_template.h:112? helper_ret_ldub_mmu
2.4154? 51.2578??? memory.c:1350?? 
memory_region_access_valid

[...]
My first thought is that there is a QOM cast somewhere in a hot path 
on -M mac99 - can you show us the call stack information from the 
profile?


Not really. Oprofile that gave me this profile could not display call 
graph for this call. I've tried again with the perf tool but I'm not 
quite sure how to interpret its results. If I'm not mistaken it points 
me in the direction of cpu_asidx_from_attrs, called from 
iotlb_to_section, called from io_readx. The 
object_class_dynamic_cast_assert call likely comes from 
OBJECT_CLASS_CHECK, expanded from OBJECT_GET_CLASS, expanded from 
CPU_GET_CLASS. Would that make any sense? Any idea what to do about it?


Good question. A quick grep for 'asidx_from_attrs' shows that 
cc->asidx_from_attrs() isn't set for PPC targets, so as a quick test 
does replacing the inline function cpu_asidx_from_attrs() in 
include/qom/cpu.h with a simple "return 0" change the profile at all?



ATB,

Mark.



Re: [Qemu-devel] [Qemu-ppc] Profiling results

2018-07-17 Thread BALATON Zoltan

On Tue, 17 Jul 2018, Mark Cave-Ayland wrote:
MorphOS on mac99 this seems to be significant. This is with default 
configure (--enable-qom-cast-debug):


%?? cum. % linenr info symbol name
9.7057?? 9.7057??? exec-all.h:410? helper_lookup_tb_ptr
8.0330? 17.7387??? object.c:711 
object_class_dynamic_cast_assert

6.9411? 24.6798??? cputlb.c:793??? io_readx
6.3219? 31.0017??? sm501_template.h:62 draw_line16_32
5.3601? 36.3617??? cputlb.c:114??? tlb_flush_nocheck
3.6170? 39.9787??? translate-all.c:749 page_trylock_add
3.1188? 43.0975??? translate-all.c:803 page_collection_lock
3.0405? 46.1380??? exec.c:3025 iotlb_to_section
2.7044? 48.8424??? softmmu_template.h:112? helper_ret_ldub_mmu
2.4154? 51.2578??? memory.c:1350?? memory_region_access_valid

[...]
My first thought is that there is a QOM cast somewhere in a hot path on -M 
mac99 - can you show us the call stack information from the profile?


Not really. Oprofile that gave me this profile could not display call 
graph for this call. I've tried again with the perf tool but I'm not quite 
sure how to interpret its results. If I'm not mistaken it points me in the 
direction of cpu_asidx_from_attrs, called from iotlb_to_section, called 
from io_readx. The object_class_dynamic_cast_assert call likely comes from 
OBJECT_CLASS_CHECK, expanded from OBJECT_GET_CLASS, expanded from 
CPU_GET_CLASS. Would that make any sense? Any idea what to do about it?


Thank you,
BALATON Zoltan



Re: [Qemu-devel] [Qemu-ppc] Profiling results

2018-07-17 Thread Mark Cave-Ayland

On 17/07/18 16:09, BALATON Zoltan wrote:


On Mon, 16 Jul 2018, Peter Maydell wrote:

Is this coming up as significant in profiling? In the past we've


This seems to depend on the workload. From the cases I'm interested in 
AROS and AmigaOS on qemu-system-ppc -M sam460ex does not seem to be 
effected much (object_class_dynamic_cast_assert is not in top 10 with 
<2%) but for MorphOS on mac99 this seems to be significant. This is with 
default configure (--enable-qom-cast-debug):


%   cum. % linenr info symbol name
9.7057   9.7057    exec-all.h:410  helper_lookup_tb_ptr
8.0330  17.7387    object.c:711
object_class_dynamic_cast_assert

6.9411  24.6798    cputlb.c:793    io_readx
6.3219  31.0017    sm501_template.h:62 draw_line16_32
5.3601  36.3617    cputlb.c:114    tlb_flush_nocheck
3.6170  39.9787    translate-all.c:749 page_trylock_add
3.1188  43.0975    translate-all.c:803 page_collection_lock
3.0405  46.1380    exec.c:3025 iotlb_to_section
2.7044  48.8424    softmmu_template.h:112  helper_ret_ldub_mmu
2.4154  51.2578    memory.c:1350   memory_region_access_valid

and improves a bit (but not much) with --disable-qom-cast-debug

%    cum. % linenr info symbol name
10.2063  10.2063    exec-all.h:410  helper_lookup_tb_ptr
  7.1581  17.3644    object.c:711
object_class_dynamic_cast_assert

  5.9297  23.2941    sm501_template.h:62 draw_line16_32
  5.9227  29.2168    cputlb.c:793    io_readx
  5.3030  34.5198    cputlb.c:114    tlb_flush_nocheck
  3.6445  38.1643    memory.c:1350   memory_region_access_valid
  3.5499  41.7142    softmmu_template.h:112  helper_ret_ldub_mmu
  3.0383  44.7525    translate-all.c:803 page_collection_lock
  2.9735  47.7260    memory.c:1415   
memory_region_dispatch_read

  2.9503  50.6763    translate-all.c:749 page_trylock_add

But the workloads may not have been 100% identical so this is not 
conclusive, maybe this debug code is not that expensive at the moment.


AROS on sam460ex has a different profile:

%    cum. % linenr info symbol name
8.9905   8.9905 translate-all.c:749 page_trylock_add
8.7658  17.7563 exec-all.h:410  helper_lookup_tb_ptr
7.7349  25.4911 translate-all.c:803 page_collection_lock
5.8246  31.3158 cputlb.c:924    victim_tlb_hit
3.1640  34.4797 cpus.c:347  cpu_get_clock
3.1538  37.6335 translate-all.c:788 tb_page_addr_cmp
2.7969  40.4304 exec.c:435  
address_space_translate_internal

2.6647  43.0951 memory.c:571    access_with_adjusted_size
2.0615  45.1567 exec.c:569  flatview_do_translate
1.9586  47.1153 memory.c:1350   memory_region_access_valid

Would anyone be able to guess what are the places that should be looked 
at or what to check to get more info on this?


My first thought is that there is a QOM cast somewhere in a hot path on 
-M mac99 - can you show us the call stack information from the profile?


I had a similar issue with SPARC32 whereby each DMA request needs to be 
manually word-swapped, so instead of adding the QOM cast into these 
routines I did a direct C cast from the opaque to ensure that the 
overhead was as little as possible.



ATB,

Mark.