Re: [RFC PATCH 0/7] Introduce thermal pressure

Quentin Perret Wed, 10 Oct 2018 01:30:02 -0700

Hi Thara,

On Wednesday 10 Oct 2018 at 08:17:51 (+0200), Ingo Molnar wrote:
> 
> * Thara Gopinath <[email protected]> wrote:
> 
> > Thermal governors can respond to an overheat event for a cpu by
> > capping the cpu's maximum possible frequency. This in turn
> > means that the maximum available compute capacity of the
> > cpu is restricted. But today in linux kernel, in event of maximum
> > frequency capping of a cpu, the maximum available compute
> > capacity of the cpu is not adjusted at all. In other words, scheduler
> > is unware maximum cpu capacity restrictions placed due to thermal
> > activity. This patch series attempts to address this issue.
> > The benefits identified are better task placement among available
> > cpus in event of overheating which in turn leads to better
> > performance numbers.
> > 
> > The delta between the maximum possible capacity of a cpu and
> > maximum available capacity of a cpu due to thermal event can
> > be considered as thermal pressure. Instantaneous thermal pressure
> > is hard to record and can sometime be erroneous as there can be mismatch
> > between the actual capping of capacity and scheduler recording it.
> > Thus solution is to have a weighted average per cpu value for thermal
> > pressure over time. The weight reflects the amount of time the cpu has
> > spent at a capped maximum frequency. To accumulate, average and
> > appropriately decay thermal pressure, this patch series uses pelt
> > signals and reuses the available framework that does a similar
> > bookkeeping of rt/dl task utilization.
> > 
> > Regarding testing, basic build, boot and sanity testing have been
> > performed on hikey960 mainline kernel with debian file system.
> > Further aobench (An occlusion renderer for benchmarking realworld
> > floating point performance) showed the following results on hikey960
> > with debain.
> > 
> >                                         Result          Standard        
> > Standard
> >                                         (Time secs)     Error           
> > Deviation
> > Hikey 960 - no thermal pressure applied 138.67          6.52            
> > 11.52%
> > Hikey 960 -  thermal pressure applied   122.37          5.78            
> > 11.57%
> 
> Wow, +13% speedup, impressive! We definitely want this outcome.
> 
> I'm wondering what happens if we do not track and decay the thermal load at 
> all at the PELT 
> level, but instantaneously decrease/increase effective CPU capacity in 
> reaction to thermal 
> events we receive from the CPU.


+1, it's not that obvious (to me at least) that averaging the thermal
pressure over time is necessarily what we want. Say the thermal governor
caps a CPU and 'removes' 70% of its capacity, it will take forever for
the PELT signal to ramp-up to that level before the scheduler can react.
And the other way around, if you release the cap, it'll take a while
before we actually start using the newly available capacity. I can also
imagine how reacting too fast can be counter-productive, but I guess
having numbers and/or use-cases to show that would be great :-)

Thara, have you tried to experiment with a simpler implementation as
suggested by Ingo ?

Also, assuming that we do want to average things, do we actually want to
tie the thermal ramp-up time to the PELT half life ? That provides
nice maths properties wrt the other signals, but it's not obvious to me
that this thermal 'constant' should be the same on all platforms. Or
maybe it should ?

Thanks,
Quentin

> 
> You describe the averaging as:
> 
> > Instantaneous thermal pressure is hard to record and can sometime be 
> > erroneous as there can 
> > be mismatch between the actual capping of capacity and scheduler recording 
> > it.
> 
> Not sure I follow the argument here: are there bogus thermal throttling 
> events? If so then
> they are hopefully not frequent enough and should average out over time even 
> if we follow
> it instantly.
> 
> I.e. what is 'can sometimes be erroneous', exactly?
> 
> Thanks,
> 
>       Ingo

Re: [RFC PATCH 0/7] Introduce thermal pressure

Reply via email to