Hi Honza,

> On 24 Jun 2025, at 4:37 pm, Jan Hubicka <hubi...@ucw.cz> wrote:
>
> External email: Use caution opening links or attachments
>
>
>>>
>>> With part suffixes we also may want to merge specially, since the
>>> entry_count of the split part does not correspond to entry_count of the
>>> original function.
>>>
>>> I wonder, does partitioned function work with the google tool?  I
>>> remember it had limitations in this respect.
>>>
>>
>> Yes, Here are some examples.
>>
>> _Z17expand_assignmentP9tree_nodeS0_b.part.0 total:7045 head:297
>>  0: 297
>>  20: 297
>>
>> _Z17expand_assignmentP9tree_nodeS0_b total:1488 head:277
>>  1: 277
>>  9: 277
>> Here, we should keep the head as it is as head is for offset 1.
>
> I actually had in ming the .cold partition
> (-freprder-blocks-and-partitoin)
> but this is interesting too.  We should track if we stripped .part
> suffix and in that case do not merge in head counts.
>
> However # of invocations of
> _Z17expand_assignmentP9tree_nodeS0_b.part.0
> should always be strictly lower than #of invocations of
> _Z17expand_assignmentP9tree_nodeS0_b
>
> this is not reflected by head count, since I suppose the second is
> inlined into some contexts which makes the execution to be accounted
> spearately into their inline instances.
>
> So merging the profiles will also lead to inconsistencies making the
> .part variant to seem more hot than it is...

I am looking into this and will post the patch as a follow up patch.
>
>>
>>
>> _Z19recompute_dominator13cdi_directionP15basic_block_def.part.0 total:1182 
>> head:13
>>  0: 13
>>  3: 13
>>  11: 13
>>
>> _Z19recompute_dominator13cdi_directionP15basic_block_def total:11 head:9
>>  1: 0
>>  3: 0
>>  9: 9
>>
>> Here also, we should keep the head as it is as head is for offset 9.
>>
>> _Z22init_attr_rdwr_indicesP8hash_mapI16rdwr_access_hash11attr_access21simple_hashmap_traitsI19default_hash_traitsIS0_ES1_EEP9tree_node.part.0
>>  total:85 head:5
>>  0: 8
>>  11: 0
>>  12: 0
>>  16: 0
>>  17: 0
>>  18: 0
>>  20: 0
>>  21: 0
>>  25: 0
>>  25.1: 2
>>  27: 2
>>  30: 0
>>  31: 0
>>  34: 0
>>  35: 2
>>  38: 2
>>  38.1: 2
>>  39: 2
>>  41: 2
>>  46: 2
>>  52.1: 0
>>  54: 0
>>  54.1: 0
>>  56: 8
>>  57: 0
>>  59: 0
>>  62: 0
>>  63: 3
>>  65: 0
>>  71.1: 0
>>  77: 0
>>  78: 0
>>  81: 3
>>  84: 2
>>  86: 0
>>  89: 0
>>  91: 0
>>  92: 0
>>  92.1: 0
>>  98: 0
>>  99: 0
>>  103: 0
>>  108: 0
>>  108.1: 0
>>  111: 0
>>  114: 0
>>  120: 1
>>  124: 0
>>  125: 0
>>  127: 0
>>  128: 0
>>  130: 0
>>  131: 0
>>  134: 0
>>  139: 0
>>  140: 0
>>  143: 1
>>  6: lookup_attribute total:40
>>    4: 5
>>
>>
>> _Z22init_attr_rdwr_indicesP8hash_mapI16rdwr_access_hash11attr_access21simple_hashmap_traitsI19default_hash_traitsIS0_ES1_EEP9tree_node
>>  total:212 head:71
>>  2: 71  
>> _Z22init_attr_rdwr_indicesP8hash_mapI16rdwr_access_hash11attr_access21simple_hashmap_traitsI19default_hash_traitsIS0_ES1_EEP9tree_node.part.0:5
>>  143: 141
>>
>> This looks odd. Looks like create_gcovt getting  mixed up with the offset of 
>> inlined functions
>
> I am not sure I follow what you mean here?

Head count here is 5 ( head:5). But the sample counts for the offset does not 
match this.  Except that:
>>  6: lookup_attribute total:40
>>    4: 5


This looks wrong?


Here is the revised patch for get_original_name. Also added a test case.
Is this OK?

Thanks,
Kugan



>
> This is my current benchmark run with -Ofast -mtune=native (without LTO)
> comparing no feedback (base) to autofdo (peak)
>
> 500.perlbench_r       1      167         9.51  *       1      155        10.3 
>   *
> 502.gcc_r             1      132        10.7   *       1      126        11.2 
>   *
> 505.mcf_r             1      226         7.16  *       1      225         
> 7.20  *
> 520.omnetpp_r         1      203         6.47  *       1      203         
> 6.47  *
> 523.xalancbmk_r                               NR                              
>  NR
> 525.x264_r            1       84.7      20.7   *       1       90.7      19.3 
>   *
> 531.deepsjeng_r       1      208         5.50  *       1      209         
> 5.47  *
> 541.leela_r           1      295         5.61  *       1      318         
> 5.21  *
> 548.exchange2_r       1       85.9      30.5   *       1       93.3      28.1 
>   *
> 557.xz_r              1      225         4.79  *       1      220         
> 4.90  *
> Est. SPECrate2017_int_base              9.13
> Est. SPECrate2017_int_peak                                               9.05
>
> So there are regressions in x264, deepsjeng, leela and exchange neighter
> of them very bad.  I think it would be interesting to understand
> 541.leela_r first.
>
> Honza
>>
>> Thanksm
>> Kugabn


Attachment: 0001-AutoFDO-v3-Fix-get_original_name-to-strip-only-names.patch
Description: 0001-AutoFDO-v3-Fix-get_original_name-to-strip-only-names.patch

Reply via email to