> On 01/19/2018 12:57 PM, Martin Liška wrote:
> > Yes, there's a huge difference in between CPU 2006 and 2017. Former has 63% 
> > w/ dominant edges,
> > and later one only 11%. It's caused by these 2 benchmarks with a high 
> > coverage:
> > 
> 
> Hi.
> 
> I'm sending details about the 2 edges that influence the statistics 
> significantly:
> 
> > 500.perlbench_r: regexec.c.065i.profile:
> >   negative return heuristics of edge 1368->1370: 2.0%  exec 2477714850 hit 
> > 2429863555 (98.1%)
> 
> 2477714850: 3484:S_regtry(pTHX_ regmatch_info *reginfo, char **startposp)
>         -: 3485:{
> 2477714850: 3486:    CHECKPOINT lastcp;
> 2477714850: 3487:    REGEXP *const rx = reginfo->prog;
> 2477714850: 3488:    regexp *const prog = ReANY(rx);
> 2477714850: 3489:    SSize_t result;
> [snip]
> 2477714850: 8046:    assert(!result ||  locinput - reginfo->strbeg >= 0);
> 2477714850: 8047:    return result ?  locinput - reginfo->strbeg : -1;
>         -:  8048:}
> 
> As seen it return -1 if a regex is not found, which is in case of perlbench 
> very likely branch.
> 
> > 
> > and 523.xalancbmk_r:
> > build/build_peak_gcc7-m64.0000/NameDatatypeValidator.cpp.065i.profile:  
> > negative return heuristics of edge 3->4: 2.0%  exec 1221735072 hit 
> > 1221522453 (100.0%)
> 
> 1221735072:   74:int NameDatatypeValidator::compare(const XMLCh* const lValue
>         -:   75:                                   , const XMLCh* const rValue
>         -:   76:                                   ,       MemoryManager*     
> const)
>         -:   77:{
> 1221735072:   78:    return ( XMLString::equals(lValue, rValue)? 0 : -1);
>         -:   79:}
>         -:   80:
> 
> IP profile dump file:
> 
> xercesc_2_7::NameDatatypeValidator::compare (struct NameDatatypeValidator * 
> const this, const XMLCh * const lValue, const XMLCh * const rValue, struct 
> MemoryManager * const D.17157)
> {
>   bool _1;
>   int iftmp.0_2;
> 
>   <bb 2> [count: 1221735072]:
>   # DEBUG BEGIN_STMT
>   _1 = xercesc_2_7::XMLString::equals (lValue_4(D), rValue_5(D));
>   if (_1 != 0)
>     goto <bb 4>; [0.02%]
>   else
>     goto <bb 3>; [99.98%]
> 
>   <bb 3> [count: 1221522453]:
> 
>   <bb 4> [count: 1221735072]:
>   # iftmp.0_2 = PHI <0(2), -1(3)>
>   return iftmp.0_2;
> 
> }
> 
> Likewise, XML values are more commonly non-equal.
> Ok, so may I mark also negative return with PROB_EVEN to track it?

Well, if we start disabling predictors just because we can find benchmark where 
they do not
perform as expected, we will need to disable them all. It is nature of 
heuristics to make
mistakes, just they should do something useful for common code.
It is not goal to make heuristics that makes no mistake.
Heuristics is useful either if it have high precision even if it cover few
branches (say noreturn), or if it have large coverage and still some hitrate 
(say opcode,
call or return based heuristics).

To make things bit more esoteric, in practice this is also bit weighted by fact
how much damage bad prediction does.  For example, call heuristics which predict
non-call path to be taken is likely going to do little damage. Even if call
path is common it is heavy and hard to optimize by presence of call and thus we
don't loose that much in common scenarios.

In the second case:
00073 // -----------------------------------------------------------------------
00074 // Compare methods
00075 // -----------------------------------------------------------------------
00076 int NameDatatypeValidator::compare(const XMLCh* const lValue
00077                                    , const XMLCh* const rValue
00078                                    ,       MemoryManager*     const)
00079 {
00080     return ( XMLString::equals(lValue, rValue)? 0 : -1);
00081 }

I would say it is odd coding style.  I do not see why it is not declared
bool and not returning true/false. 

First case seems bit non-standard too as it looks like usual cache lookup just
the cache frequently fails.

I would be in favor of keeping the prediction with non-50% hitrate thus unless
we run into some performance problems.
If there are only two benchmarks out of say 20 we tried (in spec2000 and 2k17)
it seems fine to me.

There is issue with the way we collect our data.  Using arithmetic mean across
spec is optimizing for overall runtime of all spec benchmarks combined
together.  We more want to look for safer values that do well for average
benchmarks independently how many branches they do.  We could consider making
the statistics script also collect geometric averages across different
benchmarks. If we will see big difference between arithmetic mean and geomean
we will know there is small subset of benchmarks which dominates and we could
decide what to do with the probability.

Honza

> 
> Thanks,
> Martin
> 
> > 
> > Ideas what to do with the predictor for GCC 8 release?
> > Martin

Reply via email to