Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-27 Thread Viresh Kumar
On 27-10-20, 11:42, Qais Yousef wrote:
> On 10/27/20 11:26, Valentin Schneider wrote:
> > 
> > On 27/10/20 11:11, Qais Yousef wrote:
> > > On 10/22/20 14:02, Peter Zijlstra wrote:
> > >> However I do want to retire ondemand, conservative and also very much
> > >> intel_pstate/active mode. I also have very little sympathy for
> > >> userspace.
> > >
> > > Userspace is useful for testing and sanity checking. Not sure if people 
> > > use it
> > > to measure voltage/current at each frequency to generate
> > > dynamic-power-coefficient for their platform. Lukasz, Dietmar?
> > >
> > 
> > It's valuable even just for cpufreq sanity checking - we have that test
> > that goes through increasing frequencies and asserts the work done is
> > monotonically increasing. This has been quite useful in the past to detect
> > broken bits.
> > 
> > That *should* still be totally doable with any other governor by using the
> > scaling_{min, max}_freq sysfs interface.
> 
> True. This effectively makes every governor a potential user space governor.
> 
> /me not sure to be happy or grumpy about it

Userspace governor should be kept as is, it is very effective to get
unnecessary governor code out of the path when testing basic
functioning of the hardware/driver. It is quite useful when things
don't work as expected.

-- 
viresh


Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-27 Thread Qais Yousef
On 10/27/20 11:26, Valentin Schneider wrote:
> 
> On 27/10/20 11:11, Qais Yousef wrote:
> > On 10/22/20 14:02, Peter Zijlstra wrote:
> >> However I do want to retire ondemand, conservative and also very much
> >> intel_pstate/active mode. I also have very little sympathy for
> >> userspace.
> >
> > Userspace is useful for testing and sanity checking. Not sure if people use 
> > it
> > to measure voltage/current at each frequency to generate
> > dynamic-power-coefficient for their platform. Lukasz, Dietmar?
> >
> 
> It's valuable even just for cpufreq sanity checking - we have that test
> that goes through increasing frequencies and asserts the work done is
> monotonically increasing. This has been quite useful in the past to detect
> broken bits.
> 
> That *should* still be totally doable with any other governor by using the
> scaling_{min, max}_freq sysfs interface.

True. This effectively makes every governor a potential user space governor.

/me not sure to be happy or grumpy about it

Thanks

--
Qais Yousef


Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-27 Thread Valentin Schneider


On 27/10/20 11:11, Qais Yousef wrote:
> On 10/22/20 14:02, Peter Zijlstra wrote:
>> However I do want to retire ondemand, conservative and also very much
>> intel_pstate/active mode. I also have very little sympathy for
>> userspace.
>
> Userspace is useful for testing and sanity checking. Not sure if people use it
> to measure voltage/current at each frequency to generate
> dynamic-power-coefficient for their platform. Lukasz, Dietmar?
>

It's valuable even just for cpufreq sanity checking - we have that test
that goes through increasing frequencies and asserts the work done is
monotonically increasing. This has been quite useful in the past to detect
broken bits.

That *should* still be totally doable with any other governor by using the
scaling_{min, max}_freq sysfs interface.

> Thanks


Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-27 Thread Qais Yousef
On 10/22/20 14:02, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 01:45:25PM +0200, Rafael J. Wysocki wrote:
> > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> > > On 22-10-20, 09:11, Peter Zijlstra wrote:
> > > > Well, but we need to do something to force people onto schedutil,
> > > > otherwise we'll get more crap like this thread.
> > > > 
> > > > Can we take the choice away? Only let Kconfig select which governors are
> > > > available and then set the default ourselves? I mean, the end goal being
> > > > to not have selectable governors at all, this seems like a good step
> > > > anyway.
> > > 
> > > Just to clarify and complete the point a bit here, the users can still
> > > pass the default governor from cmdline using
> > > cpufreq.default_governor=, which will take precedence over the one the
> > > below code is playing with. And later once the kernel is up, they can
> > > still choose a different governor from userspace.
> > 
> > Right.
> > 
> > Also some people simply set "performance" as the default governor and then
> > don't touch cpufreq otherwise (the idea is to get everything to the max
> > freq right away and stay in that mode forever).  This still needs to be
> > possible IMO.
> 
> Performance/powersave make sense to keep.
> 
> However I do want to retire ondemand, conservative and also very much
> intel_pstate/active mode. I also have very little sympathy for
> userspace.

Userspace is useful for testing and sanity checking. Not sure if people use it
to measure voltage/current at each frequency to generate
dynamic-power-coefficient for their platform. Lukasz, Dietmar?

Thanks

--
Qais Yousef

> 
> We should start by making it hard to use them and eventually just delete
> them outright.
> 


Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-26 Thread Fontenot, Nathan
On 10/23/2020 12:46 PM, Tom Lendacky wrote:
> On 10/23/20 2:03 AM, Peter Zijlstra wrote:
>> On Thu, Oct 22, 2020 at 10:10:35PM +0200, Giovanni Gherdovich wrote:
>>> * for the AMD EPYC machines we haven't yet implemented frequency invariant
>>>    accounting, which might explain why schedutil looses to ondemand on all
>>>    the benchmarks.
>>
>> Right, I poked the AMD people on that a few times, but nothing seems to
>> be forthcoming :/ Tom, any way you could perhaps expedite the matter?
> 
> Adding Nathan to the thread to help out here.
> 
> Thanks,
> Tom

Thanks Tom, diving in...

> 
>>
>> In particular we're looking for some X86_VENDOR_AMD/HYGON code to run in
>>
>>    arch/x86/kernel/smpboot.c:init_freq_invariance()
>>
>> The main issue is finding a 'max' frequency that is not the absolute max
>> turbo boost (this could result in not reaching it very often) but also
>> not too low such that we're always clipping.

I've started looking into this and have a lead but need to confirm that the
frequency value I'm getting is not an absolute max.

>>
>> And while we're here, IIUC AMD is still using acpi_cpufreq, but AFAIK
>> the chips have a CPPC interface which could be used instead. Is there
>> any progress on that?
>>

Correct, AMD uses acpi_cpufreq. The newer AMD chips do have a CPPC interface
(not sure how far back 'newer' covers). I'll take a look at schedutil and
cppc_cpufreq and the possibility of transitioning to them for AMD.

-Nathan


Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-23 Thread Tom Lendacky

On 10/23/20 2:03 AM, Peter Zijlstra wrote:

On Thu, Oct 22, 2020 at 10:10:35PM +0200, Giovanni Gherdovich wrote:

* for the AMD EPYC machines we haven't yet implemented frequency invariant
   accounting, which might explain why schedutil looses to ondemand on all
   the benchmarks.


Right, I poked the AMD people on that a few times, but nothing seems to
be forthcoming :/ Tom, any way you could perhaps expedite the matter?


Adding Nathan to the thread to help out here.

Thanks,
Tom



In particular we're looking for some X86_VENDOR_AMD/HYGON code to run in

   arch/x86/kernel/smpboot.c:init_freq_invariance()

The main issue is finding a 'max' frequency that is not the absolute max
turbo boost (this could result in not reaching it very often) but also
not too low such that we're always clipping.

And while we're here, IIUC AMD is still using acpi_cpufreq, but AFAIK
the chips have a CPPC interface which could be used instead. Is there
any progress on that?



Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-23 Thread Peter Zijlstra
On Thu, Oct 22, 2020 at 10:10:35PM +0200, Giovanni Gherdovich wrote:
> * for the AMD EPYC machines we haven't yet implemented frequency invariant
>   accounting, which might explain why schedutil looses to ondemand on all
>   the benchmarks.

Right, I poked the AMD people on that a few times, but nothing seems to
be forthcoming :/ Tom, any way you could perhaps expedite the matter?

In particular we're looking for some X86_VENDOR_AMD/HYGON code to run in

  arch/x86/kernel/smpboot.c:init_freq_invariance()

The main issue is finding a 'max' frequency that is not the absolute max
turbo boost (this could result in not reaching it very often) but also
not too low such that we're always clipping.

And while we're here, IIUC AMD is still using acpi_cpufreq, but AFAIK
the chips have a CPPC interface which could be used instead. Is there
any progress on that?


Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Viresh Kumar
On 22-10-20, 17:55, Vincent Guittot wrote:
> On Thu, 22 Oct 2020 at 17:45, A L  wrote:
> >
> >
> >
> >  From: Peter Zijlstra  -- Sent: 2020-10-22 - 
> > 14:29 
> >
> > > On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
> > >> > However I do want to retire ondemand, conservative and also very much
> > >> > intel_pstate/active mode.
> > >>
> > >> I agree in general, but IMO it would not be prudent to do that without 
> > >> making
> > >> schedutil provide the same level of performance in all of the relevant 
> > >> use
> > >> cases.
> > >
> > > Agreed; I though to have understood we were there already.
> >
> > Hi,
> >
> >
> > Currently schedutil does not populate all stats like ondemand does, which 
> > can be a problem for some monitoring software.
> >
> > On my AMD 3000G CPU with kernel-5.9.1:
> >
> >
> > grep. /sys/devices/system/cpu/cpufreq/policy0/stats/*
> >
> > With ondemand:
> > time_in_state:390 145179
> > time_in_state:160 9588482
> > total_trans:177565
> > trans_table:   From  :To
> > trans_table: :   390   160
> > trans_table:  390: 0 88783
> > trans_table:  160: 88782 0
> >
> > With schedutil only two file exists:
> > reset:
> > total_trans:216609
> >
> >
> > I'd really like to have these stats populated with schedutil, if that's 
> > possible.
> 
> Your problem might have been fixed with
> commit 96f60cddf7a1 ("cpufreq: stats: Enable stats for fast-switch as well")

Thanks Vincent. Right, I have already fixed that for everyone.

-- 
viresh


Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Phil Auld
On Thu, Oct 22, 2020 at 09:32:55PM +0100 Mel Gorman wrote:
> On Thu, Oct 22, 2020 at 07:59:43PM +0200, Rafael J. Wysocki wrote:
> > > > Agreed. I'd like the option to switch back if we make the default 
> > > > change.
> > > > It's on the table and I'd like to be able to go that way.
> > > >
> > >
> > > Yep. It sounds chicken, but it's a useful safety net and a reasonable
> > > way to deprecate a feature. It's also useful for bug creation -- User X
> > > running whatever found that schedutil is worse than the old governor and
> > > had to temporarily switch back. Repeat until complaining stops and then
> > > tear out the old stuff.
> > >
> > > When/if there is a patch setting schedutil as the default, cc suitable
> > > distro people (Giovanni and myself for openSUSE).
> > 
> > So for the record, Giovanni was on the CC list of the "cpufreq:
> > intel_pstate: Use passive mode by default without HWP" patch that this
> > discussion resulted from (and which kind of belongs to the above
> > category).
> > 
> 
> Oh I know, I did not mean to suggest that you did not. He made people
> aware that this was going to be coming down the line and has been looking
> into the "what if schedutil was the default" question.  AFAIK, it's still
> a work-in-progress and I don't know all the specifics but he knows more
> than I do on the topic. I only know enough that if we flipped the switch
> tomorrow that we could be plagued with google searches suggesting it be
> turned off again just like there is still broken advice out there about
> disabling intel_pstate for usually the wrong reasons.
> 
> The passive patch was a clear flag that the intent is that schedutil will
> be the default at some unknown point in the future. That point is now a
> bit closer and this thread could have encouraged a premature change of
> the default resulting in unfair finger pointing at one company's test
> team. If at least two distos check it out and it still goes wrong, at
> least there will be shared blame :/
> 
> > > Other distros assuming they're watching can nominate their own victim.
> > 
> > But no other victims had been nominated at that time.
> 
> We have one, possibly two if Phil agrees. That's better than zero or
> unfairly placing the full responsibility on the Intel guys that have been
> testing it out.
>

Yes. I agree and we (RHEL) are planning to test this soon. I'll try to get
to it.  You can certainly CC me, please, athough I also try to watch for this
sort of thing on list. 


Cheers,
Phil

> -- 
> Mel Gorman
> SUSE Labs
> 

-- 



Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Mel Gorman
On Thu, Oct 22, 2020 at 07:59:43PM +0200, Rafael J. Wysocki wrote:
> > > Agreed. I'd like the option to switch back if we make the default change.
> > > It's on the table and I'd like to be able to go that way.
> > >
> >
> > Yep. It sounds chicken, but it's a useful safety net and a reasonable
> > way to deprecate a feature. It's also useful for bug creation -- User X
> > running whatever found that schedutil is worse than the old governor and
> > had to temporarily switch back. Repeat until complaining stops and then
> > tear out the old stuff.
> >
> > When/if there is a patch setting schedutil as the default, cc suitable
> > distro people (Giovanni and myself for openSUSE).
> 
> So for the record, Giovanni was on the CC list of the "cpufreq:
> intel_pstate: Use passive mode by default without HWP" patch that this
> discussion resulted from (and which kind of belongs to the above
> category).
> 

Oh I know, I did not mean to suggest that you did not. He made people
aware that this was going to be coming down the line and has been looking
into the "what if schedutil was the default" question.  AFAIK, it's still
a work-in-progress and I don't know all the specifics but he knows more
than I do on the topic. I only know enough that if we flipped the switch
tomorrow that we could be plagued with google searches suggesting it be
turned off again just like there is still broken advice out there about
disabling intel_pstate for usually the wrong reasons.

The passive patch was a clear flag that the intent is that schedutil will
be the default at some unknown point in the future. That point is now a
bit closer and this thread could have encouraged a premature change of
the default resulting in unfair finger pointing at one company's test
team. If at least two distos check it out and it still goes wrong, at
least there will be shared blame :/

> > Other distros assuming they're watching can nominate their own victim.
> 
> But no other victims had been nominated at that time.

We have one, possibly two if Phil agrees. That's better than zero or
unfairly placing the full responsibility on the Intel guys that have been
testing it out.

-- 
Mel Gorman
SUSE Labs


Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Giovanni Gherdovich
On Thu, 2020-10-22 at 22:10 +0200, Giovanni Gherdovich wrote:
> [...]
> To read the tables:
> 
> Tilde (~) means the result is the same as baseline (or, the ratio is close
> to 1). The double asterisk (**) is a visual aid and means the result is
> worse than baseline (higher or lower depending on the case).

Ouch, the opposite. Double asterisk (**) is where the result is better
than baseline, and schedutil needs improvement.


Giovanni



Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Giovanni Gherdovich
Hello Peter, Rafael,

back in August I tested a v5.8 kernel adding Rafael's patches from v5.9 that
make schedutil and HWP works together, i.e. f6ebbcf08f37 ("cpufreq: 
intel_pstate:
Implement passive mode with HWP enabled").

The main point I took from the exercise is that tbench (network benchmark
in localhost) is problematic for schedutil and only with HWP (thanks to
Rafael's patch above) it reaches the throughput of the other governors.
When HWP isn't available, the penalty is 5-10% and I need to understand if
the cause is something that can affect other applications too (or just a
quirk of this test).

I ran this campaign this summer when Rafal CC'ed me to f6ebbcf08f37
("cpufreq: intel_pstate: Implement passive mode with HWP enabled"),
I didn't reply as the patch was a win anyways (my bad, I should have posted
the positive results). The regression of tbench with schedutil w/o HWP,
that went unnoticed for long, got the best of my attention.

Other remarks

* on gitsource (running the git unit test suite, measures elapsed time)
  schedutil is a lot better than Intel's powersave but not as good as the
  performance governor.

* for the AMD EPYC machines we haven't yet implemented frequency invariant
  accounting, which might explain why schedutil looses to ondemand on all
  the benchmarks.

* on dbench (filesystem, measures latency) and kernbench (kernel compilation),
  sugov is as good as the Intel performance governor. You can add or remove
  HWP (to either sugov or perfgov), it doesn't make a difference. Intel's
  powersave in general trails behind.

* generally my main concern is performance, not power efficiency, but I was
  a little disappointed to see schedutil being just as efficient as
  perfgov (the performance-per-watt ratios): there are even a few cases
  where (on tbench) the performance governor is both faster and more
  efficient. From previous conversations with Rafael I recall that
  switching frequency has an energy cost, so it could be that schedutil
  switches too often to amortize it. I haven't checked.

To read the tables:

Tilde (~) means the result is the same as baseline (or, the ratio is close
to 1). The double asterisk (**) is a visual aid and means the result is
worse than baseline (higher or lower depending on the case).

For an overview of the possible configurations (intel_psate passive,
active, HWP on/off etc) I made the diagram at
https://beta.suse.com/private/ggherdovich/cpufreq/x86-cpufreq.png

1) INTEL, HWP-CAPABLE MACHINES
2) INTEL, NON-HWP-CAPABLE MACHINES
3) AMD EPYC

1) INTEL, HWP-CAPABLE MACHINES:

64x_SKYLAKE_NUMA: Intel Skylake SP, 32 cores / 64 threads, NUMA, SATA SSD 
storage
--
sugov-HWP   sugov-no-HWP   powersave-HWP   perfgov-HWP   better if
--
  PERFORMANCE RATIOS
tbench1.000.68   ~   1.03**higher
dbench1.00~  1.03~ lower
kernbench 1.00~  1.11~ lower
gitsource 1.001.03   2.260.82**lower
--
 PERFORMANCE-PER-WATT RATIOS
tbench1.000.74   ~   ~ higher
dbench1.00~  ~   ~ higher
kernbench 1.00~  0.96~ higher
gitsource 1.000.96   0.451.15**higher


8x_SKYLAKE_UMA: Intel Skylake (client), 4 cores / 8 threads, UMA, SATA SSD 
storage
--
sugov-HWP   sugov-no-HWP   powersave-HWP   perfgov-HWP   better if
--
  PERFORMANCE RATIOS
tbench1.000.91   ~   ~ higher
dbench1.00~  ~   ~ lower
kernbench 1.00~  ~   ~ lower
gitsource 1.001.04   1.77~ lower
--
 PERFORMANCE-PER-WATT RATIOS
tbench1.000.95   ~   ~ higher
dbench1.00~  ~   ~ higher
kernbench 1.00~  ~   ~ higher
gitsource 1.00~  0.74~ higher


8x_COFFEELAKE_UMA: Intel Coffee Lake, 4 cores / 8 threads, UMA, NVMe SSD storage
---
   

Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Rafael J. Wysocki
On Thu, Oct 22, 2020 at 6:35 PM Mel Gorman  wrote:
>
> On Thu, Oct 22, 2020 at 11:12:00AM -0400, Phil Auld wrote:
> > > > AFAIK, not quite (added Giovanni as he has been paying more attention).
> > > > Schedutil has improved since it was merged but not to the extent where
> > > > it is a drop-in replacement. The standard it needs to meet is that
> > > > it is at least equivalent to powersave (in intel_pstate language)
> > > > or ondemand (acpi_cpufreq) and within a reasonable percentage of the
> > > > performance governor. Defaulting to performance is a) giving up and b)
> > > > the performance governor is not a universal win. There are some 
> > > > questions
> > > > currently on whether schedutil is good enough when HWP is not available.
> > > > There was some evidence (I don't have the data, Giovanni was looking 
> > > > into
> > > > it) that HWP was a requirement to make schedutil work well. That is a
> > > > hazard in itself because someone could test on the latest gen Intel CPU
> > > > and conclude everything is fine and miss that Intel-specific technology
> > > > is needed to make it work well while throwing everyone else under a bus.
> > > > Giovanni knows a lot more than I do about this, I could be wrong or
> > > > forgetting things.
> > > >
> > > > For distros, switching to schedutil by default would be nice because
> > > > frequency selection state would follow the task instead of being per-cpu
> > > > and we could stop worrying about different HWP implementations but it's
> > > > not at the point where the switch is advisable. I would expect hard data
> > > > before switching the default and still would strongly advise having a
> > > > period of time where we can fall back when someone inevitably finds a
> > > > new corner case or exception.
> > >
> > > ..and it would be really useful for distros to know when the hard data
> > > is available so that they can make an informed decision when to move to
> > > schedutil.
> > >
> >
> > I think distros are on the hook to generate that hard data themselves
> > with which to make such a decision.  I don't expect it to be done by
> > someone else.
> >
>
> Yep, distros are on the hook. When I said "I would expect hard data",
> it was in the knowledge that for openSUSE/SLE, we (as in SUSE) would be
> generating said data and making a call based on it. I'd be surprised if
> Phil was not thinking along the same lines.
>
> > > > For reference, SLUB had the same problem for years. It was switched
> > > > on by default in the kernel config but it was a long time before
> > > > SLUB was generally equivalent to SLAB in terms of performance. Block
> > > > multiqueue also had vaguely similar issues before the default changes
> > > > and a period of time before it was removed removed (example whinging 
> > > > mail
> > > > https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5sp...@techsingularity.net/)
> > > > It's schedutil's turn :P
> > > >
> > >
> >
> > Agreed. I'd like the option to switch back if we make the default change.
> > It's on the table and I'd like to be able to go that way.
> >
>
> Yep. It sounds chicken, but it's a useful safety net and a reasonable
> way to deprecate a feature. It's also useful for bug creation -- User X
> running whatever found that schedutil is worse than the old governor and
> had to temporarily switch back. Repeat until complaining stops and then
> tear out the old stuff.
>
> When/if there is a patch setting schedutil as the default, cc suitable
> distro people (Giovanni and myself for openSUSE).

So for the record, Giovanni was on the CC list of the "cpufreq:
intel_pstate: Use passive mode by default without HWP" patch that this
discussion resulted from (and which kind of belongs to the above
category).

> Other distros assuming they're watching can nominate their own victim.

But no other victims had been nominated at that time.


Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Mel Gorman
On Thu, Oct 22, 2020 at 11:12:00AM -0400, Phil Auld wrote:
> > > AFAIK, not quite (added Giovanni as he has been paying more attention).
> > > Schedutil has improved since it was merged but not to the extent where
> > > it is a drop-in replacement. The standard it needs to meet is that
> > > it is at least equivalent to powersave (in intel_pstate language)
> > > or ondemand (acpi_cpufreq) and within a reasonable percentage of the
> > > performance governor. Defaulting to performance is a) giving up and b)
> > > the performance governor is not a universal win. There are some questions
> > > currently on whether schedutil is good enough when HWP is not available.
> > > There was some evidence (I don't have the data, Giovanni was looking into
> > > it) that HWP was a requirement to make schedutil work well. That is a
> > > hazard in itself because someone could test on the latest gen Intel CPU
> > > and conclude everything is fine and miss that Intel-specific technology
> > > is needed to make it work well while throwing everyone else under a bus.
> > > Giovanni knows a lot more than I do about this, I could be wrong or
> > > forgetting things.
> > > 
> > > For distros, switching to schedutil by default would be nice because
> > > frequency selection state would follow the task instead of being per-cpu
> > > and we could stop worrying about different HWP implementations but it's
> > > not at the point where the switch is advisable. I would expect hard data
> > > before switching the default and still would strongly advise having a
> > > period of time where we can fall back when someone inevitably finds a
> > > new corner case or exception.
> > 
> > ..and it would be really useful for distros to know when the hard data
> > is available so that they can make an informed decision when to move to
> > schedutil.
> >
> 
> I think distros are on the hook to generate that hard data themselves
> with which to make such a decision.  I don't expect it to be done by
> someone else. 
> 

Yep, distros are on the hook. When I said "I would expect hard data",
it was in the knowledge that for openSUSE/SLE, we (as in SUSE) would be
generating said data and making a call based on it. I'd be surprised if
Phil was not thinking along the same lines.

> > > For reference, SLUB had the same problem for years. It was switched
> > > on by default in the kernel config but it was a long time before
> > > SLUB was generally equivalent to SLAB in terms of performance. Block
> > > multiqueue also had vaguely similar issues before the default changes
> > > and a period of time before it was removed removed (example whinging mail
> > > https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5sp...@techsingularity.net/)
> > > It's schedutil's turn :P
> > > 
> > 
> 
> Agreed. I'd like the option to switch back if we make the default change.
> It's on the table and I'd like to be able to go that way. 
> 

Yep. It sounds chicken, but it's a useful safety net and a reasonable
way to deprecate a feature. It's also useful for bug creation -- User X
running whatever found that schedutil is worse than the old governor and
had to temporarily switch back. Repeat until complaining stops and then
tear out the old stuff.

When/if there is a patch setting schedutil as the default, cc suitable
distro people (Giovanni and myself for openSUSE). Other distros assuming
they're watching can nominate their own victim.

-- 
Mel Gorman
SUSE Labs


Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Mel Gorman
On Thu, Oct 22, 2020 at 05:25:14PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 03:52:50PM +0100, Mel Gorman wrote:
> 
> > There are some questions
> > currently on whether schedutil is good enough when HWP is not available.
> 
> Srinivas and Rafael will know better, but Intel does run a lot of tests
> and IIRC it was found that schedutil was on-par for !HWP. That was the
> basis for commit:
> 
>   33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without 
> HWP")
> 
> But now it turns out that commit results in running intel_pstate-passive
> on ondemand, which is quite horrible.
> 

I know Intel ran a lot of tests, no question about it and no fingers are
being pointed. I know I've had enough bugs patches tested with a battery
of tests on various machines and still ended up with bug reports :)

> > There was some evidence (I don't have the data, Giovanni was looking into
> > it) that HWP was a requirement to make schedutil work well.
> 
> That seems to be the question; Rafael just said the opposite.
> 
> > For distros, switching to schedutil by default would be nice because
> > frequency selection state would follow the task instead of being per-cpu
> > and we could stop worrying about different HWP implementations but it's
> 
> s/HWP/cpufreq-governors/ ? But yes.
> 

I've seen cases where HWP had variable behaviour between CPU
generations. It was hard to quantify and/or figure out because HWP is a
black box.

> > not at the point where the switch is advisable. I would expect hard data
> > before switching the default and still would strongly advise having a
> > period of time where we can fall back when someone inevitably finds a
> > new corner case or exception.
> 
> Which is why I advocated to make it 'difficult' to use the old ones and
> only later remove them.
> 

That's fair.

-- 
Mel Gorman
SUSE Labs


Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread A L



 From: Peter Zijlstra  -- Sent: 2020-10-22 - 14:29 


> On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
>> > However I do want to retire ondemand, conservative and also very much
>> > intel_pstate/active mode.
>> 
>> I agree in general, but IMO it would not be prudent to do that without making
>> schedutil provide the same level of performance in all of the relevant use
>> cases.
> 
> Agreed; I though to have understood we were there already.

Hi, 


Currently schedutil does not populate all stats like ondemand does, which can 
be a problem for some monitoring software. 

On my AMD 3000G CPU with kernel-5.9.1:


grep. /sys/devices/system/cpu/cpufreq/policy0/stats/*

With ondemand:
time_in_state:390 145179
time_in_state:160 9588482
total_trans:177565
trans_table:   From  :To
trans_table: :   390   160
trans_table:  390: 0 88783
trans_table:  160: 88782 0

With schedutil only two file exists:
reset:
total_trans:216609 


I'd really like to have these stats populated with schedutil, if that's 
possible.

Thanks. 



Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Rafael J. Wysocki
On Thu, Oct 22, 2020 at 5:25 PM Peter Zijlstra  wrote:
>
> On Thu, Oct 22, 2020 at 03:52:50PM +0100, Mel Gorman wrote:
>
> > There are some questions
> > currently on whether schedutil is good enough when HWP is not available.
>
> Srinivas and Rafael will know better, but Intel does run a lot of tests
> and IIRC it was found that schedutil was on-par for !HWP. That was the
> basis for commit:
>
>   33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without 
> HWP")
>
> But now it turns out that commit results in running intel_pstate-passive
> on ondemand, which is quite horrible.

It doesn't in general.  AFAICS this happens only if "ondemand" was
selected as the default governor in the old kernel config, which
should not be the common case.

But I do agree that this needs to be avoided.

> > There was some evidence (I don't have the data, Giovanni was looking into
> > it) that HWP was a requirement to make schedutil work well.
>
> That seems to be the question; Rafael just said the opposite.

I'm not aware of any data like that.

HWP should not be required and it should always be possible to make an
HWP system run without HWP (except for those with exotic BIOS
configs).  However, schedutil should work without HWP as well as (or
better than) the "ondemand" and "conservative" governors on top of the
same driver (whatever it is) and it should work as well as (or better
than) "raw" HWP (so to speak) on top of intel_pstate in the passive
mode with HWP enabled (before 5.9 it couldn't work in that
configuration at all and now it can do that, which I guess may be
regarded as an improvement).

> > For distros, switching to schedutil by default would be nice because
> > frequency selection state would follow the task instead of being per-cpu
> > and we could stop worrying about different HWP implementations but it's
>
> s/HWP/cpufreq-governors/ ? But yes.

Well, different HWP implementations in different processor generations
may be a concern as well in general.

> > not at the point where the switch is advisable. I would expect hard data
> > before switching the default and still would strongly advise having a
> > period of time where we can fall back when someone inevitably finds a
> > new corner case or exception.
>
> Which is why I advocated to make it 'difficult' to use the old ones and
> only later remove them.

Slightly less convenient may be sufficient IMV.

> > For reference, SLUB had the same problem for years. It was switched
> > on by default in the kernel config but it was a long time before
> > SLUB was generally equivalent to SLAB in terms of performance.
>
> I remember :-)


Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Vincent Guittot
On Thu, 22 Oct 2020 at 17:45, A L  wrote:
>
>
>
>  From: Peter Zijlstra  -- Sent: 2020-10-22 - 14:29 
> 
>
> > On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
> >> > However I do want to retire ondemand, conservative and also very much
> >> > intel_pstate/active mode.
> >>
> >> I agree in general, but IMO it would not be prudent to do that without 
> >> making
> >> schedutil provide the same level of performance in all of the relevant use
> >> cases.
> >
> > Agreed; I though to have understood we were there already.
>
> Hi,
>
>
> Currently schedutil does not populate all stats like ondemand does, which can 
> be a problem for some monitoring software.
>
> On my AMD 3000G CPU with kernel-5.9.1:
>
>
> grep. /sys/devices/system/cpu/cpufreq/policy0/stats/*
>
> With ondemand:
> time_in_state:390 145179
> time_in_state:160 9588482
> total_trans:177565
> trans_table:   From  :To
> trans_table: :   390   160
> trans_table:  390: 0 88783
> trans_table:  160: 88782 0
>
> With schedutil only two file exists:
> reset:
> total_trans:216609
>
>
> I'd really like to have these stats populated with schedutil, if that's 
> possible.

Your problem might have been fixed with
commit 96f60cddf7a1 ("cpufreq: stats: Enable stats for fast-switch as well")


>
> Thanks.
>


Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Peter Zijlstra
On Thu, Oct 22, 2020 at 03:52:50PM +0100, Mel Gorman wrote:

> There are some questions
> currently on whether schedutil is good enough when HWP is not available.

Srinivas and Rafael will know better, but Intel does run a lot of tests
and IIRC it was found that schedutil was on-par for !HWP. That was the
basis for commit:

  33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default without 
HWP")

But now it turns out that commit results in running intel_pstate-passive
on ondemand, which is quite horrible.

> There was some evidence (I don't have the data, Giovanni was looking into
> it) that HWP was a requirement to make schedutil work well.

That seems to be the question; Rafael just said the opposite.

> For distros, switching to schedutil by default would be nice because
> frequency selection state would follow the task instead of being per-cpu
> and we could stop worrying about different HWP implementations but it's

s/HWP/cpufreq-governors/ ? But yes.

> not at the point where the switch is advisable. I would expect hard data
> before switching the default and still would strongly advise having a
> period of time where we can fall back when someone inevitably finds a
> new corner case or exception.

Which is why I advocated to make it 'difficult' to use the old ones and
only later remove them.

> For reference, SLUB had the same problem for years. It was switched
> on by default in the kernel config but it was a long time before
> SLUB was generally equivalent to SLAB in terms of performance.

I remember :-)



Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Phil Auld
On Thu, Oct 22, 2020 at 03:58:13PM +0100 Colin Ian King wrote:
> On 22/10/2020 15:52, Mel Gorman wrote:
> > On Thu, Oct 22, 2020 at 02:29:49PM +0200, Peter Zijlstra wrote:
> >> On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
>  However I do want to retire ondemand, conservative and also very much
>  intel_pstate/active mode.
> >>>
> >>> I agree in general, but IMO it would not be prudent to do that without 
> >>> making
> >>> schedutil provide the same level of performance in all of the relevant use
> >>> cases.
> >>
> >> Agreed; I though to have understood we were there already.
> > 
> > AFAIK, not quite (added Giovanni as he has been paying more attention).
> > Schedutil has improved since it was merged but not to the extent where
> > it is a drop-in replacement. The standard it needs to meet is that
> > it is at least equivalent to powersave (in intel_pstate language)
> > or ondemand (acpi_cpufreq) and within a reasonable percentage of the
> > performance governor. Defaulting to performance is a) giving up and b)
> > the performance governor is not a universal win. There are some questions
> > currently on whether schedutil is good enough when HWP is not available.
> > There was some evidence (I don't have the data, Giovanni was looking into
> > it) that HWP was a requirement to make schedutil work well. That is a
> > hazard in itself because someone could test on the latest gen Intel CPU
> > and conclude everything is fine and miss that Intel-specific technology
> > is needed to make it work well while throwing everyone else under a bus.
> > Giovanni knows a lot more than I do about this, I could be wrong or
> > forgetting things.
> > 
> > For distros, switching to schedutil by default would be nice because
> > frequency selection state would follow the task instead of being per-cpu
> > and we could stop worrying about different HWP implementations but it's
> > not at the point where the switch is advisable. I would expect hard data
> > before switching the default and still would strongly advise having a
> > period of time where we can fall back when someone inevitably finds a
> > new corner case or exception.
> 
> ..and it would be really useful for distros to know when the hard data
> is available so that they can make an informed decision when to move to
> schedutil.
>

I think distros are on the hook to generate that hard data themselves
with which to make such a decision.  I don't expect it to be done by
someone else. 

> > 
> > For reference, SLUB had the same problem for years. It was switched
> > on by default in the kernel config but it was a long time before
> > SLUB was generally equivalent to SLAB in terms of performance. Block
> > multiqueue also had vaguely similar issues before the default changes
> > and a period of time before it was removed removed (example whinging mail
> > https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5sp...@techsingularity.net/)
> > It's schedutil's turn :P
> > 
> 

Agreed. I'd like the option to switch back if we make the default change.
It's on the table and I'd like to be able to go that way. 

Cheers,
Phil

-- 



Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Colin Ian King
On 22/10/2020 15:52, Mel Gorman wrote:
> On Thu, Oct 22, 2020 at 02:29:49PM +0200, Peter Zijlstra wrote:
>> On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
 However I do want to retire ondemand, conservative and also very much
 intel_pstate/active mode.
>>>
>>> I agree in general, but IMO it would not be prudent to do that without 
>>> making
>>> schedutil provide the same level of performance in all of the relevant use
>>> cases.
>>
>> Agreed; I though to have understood we were there already.
> 
> AFAIK, not quite (added Giovanni as he has been paying more attention).
> Schedutil has improved since it was merged but not to the extent where
> it is a drop-in replacement. The standard it needs to meet is that
> it is at least equivalent to powersave (in intel_pstate language)
> or ondemand (acpi_cpufreq) and within a reasonable percentage of the
> performance governor. Defaulting to performance is a) giving up and b)
> the performance governor is not a universal win. There are some questions
> currently on whether schedutil is good enough when HWP is not available.
> There was some evidence (I don't have the data, Giovanni was looking into
> it) that HWP was a requirement to make schedutil work well. That is a
> hazard in itself because someone could test on the latest gen Intel CPU
> and conclude everything is fine and miss that Intel-specific technology
> is needed to make it work well while throwing everyone else under a bus.
> Giovanni knows a lot more than I do about this, I could be wrong or
> forgetting things.
> 
> For distros, switching to schedutil by default would be nice because
> frequency selection state would follow the task instead of being per-cpu
> and we could stop worrying about different HWP implementations but it's
> not at the point where the switch is advisable. I would expect hard data
> before switching the default and still would strongly advise having a
> period of time where we can fall back when someone inevitably finds a
> new corner case or exception.

..and it would be really useful for distros to know when the hard data
is available so that they can make an informed decision when to move to
schedutil.

> 
> For reference, SLUB had the same problem for years. It was switched
> on by default in the kernel config but it was a long time before
> SLUB was generally equivalent to SLAB in terms of performance. Block
> multiqueue also had vaguely similar issues before the default changes
> and a period of time before it was removed removed (example whinging mail
> https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5sp...@techsingularity.net/)
> It's schedutil's turn :P
> 



Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Mel Gorman
On Thu, Oct 22, 2020 at 02:29:49PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
> > > However I do want to retire ondemand, conservative and also very much
> > > intel_pstate/active mode.
> > 
> > I agree in general, but IMO it would not be prudent to do that without 
> > making
> > schedutil provide the same level of performance in all of the relevant use
> > cases.
> 
> Agreed; I though to have understood we were there already.

AFAIK, not quite (added Giovanni as he has been paying more attention).
Schedutil has improved since it was merged but not to the extent where
it is a drop-in replacement. The standard it needs to meet is that
it is at least equivalent to powersave (in intel_pstate language)
or ondemand (acpi_cpufreq) and within a reasonable percentage of the
performance governor. Defaulting to performance is a) giving up and b)
the performance governor is not a universal win. There are some questions
currently on whether schedutil is good enough when HWP is not available.
There was some evidence (I don't have the data, Giovanni was looking into
it) that HWP was a requirement to make schedutil work well. That is a
hazard in itself because someone could test on the latest gen Intel CPU
and conclude everything is fine and miss that Intel-specific technology
is needed to make it work well while throwing everyone else under a bus.
Giovanni knows a lot more than I do about this, I could be wrong or
forgetting things.

For distros, switching to schedutil by default would be nice because
frequency selection state would follow the task instead of being per-cpu
and we could stop worrying about different HWP implementations but it's
not at the point where the switch is advisable. I would expect hard data
before switching the default and still would strongly advise having a
period of time where we can fall back when someone inevitably finds a
new corner case or exception.

For reference, SLUB had the same problem for years. It was switched
on by default in the kernel config but it was a long time before
SLUB was generally equivalent to SLAB in terms of performance. Block
multiqueue also had vaguely similar issues before the default changes
and a period of time before it was removed removed (example whinging mail
https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5sp...@techsingularity.net/)
It's schedutil's turn :P

-- 
Mel Gorman
SUSE Labs


Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Peter Zijlstra
On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
> > However I do want to retire ondemand, conservative and also very much
> > intel_pstate/active mode.
> 
> I agree in general, but IMO it would not be prudent to do that without making
> schedutil provide the same level of performance in all of the relevant use
> cases.

Agreed; I though to have understood we were there already.


Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Rafael J. Wysocki
[CC linux-pm and Len]

On Thursday, October 22, 2020 2:02:13 PM CEST Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 01:45:25PM +0200, Rafael J. Wysocki wrote:
> > On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> > > On 22-10-20, 09:11, Peter Zijlstra wrote:
> > > > Well, but we need to do something to force people onto schedutil,
> > > > otherwise we'll get more crap like this thread.
> > > > 
> > > > Can we take the choice away? Only let Kconfig select which governors are
> > > > available and then set the default ourselves? I mean, the end goal being
> > > > to not have selectable governors at all, this seems like a good step
> > > > anyway.
> > > 
> > > Just to clarify and complete the point a bit here, the users can still
> > > pass the default governor from cmdline using
> > > cpufreq.default_governor=, which will take precedence over the one the
> > > below code is playing with. And later once the kernel is up, they can
> > > still choose a different governor from userspace.
> > 
> > Right.
> > 
> > Also some people simply set "performance" as the default governor and then
> > don't touch cpufreq otherwise (the idea is to get everything to the max
> > freq right away and stay in that mode forever).  This still needs to be
> > possible IMO.
> 
> Performance/powersave make sense to keep.
> 
> However I do want to retire ondemand, conservative and also very much
> intel_pstate/active mode.

I agree in general, but IMO it would not be prudent to do that without making
schedutil provide the same level of performance in all of the relevant use
cases.

> I also have very little sympathy for userspace.

That I completely agree with.

> We should start by making it hard to use them and eventually just delete
> them outright.

Right, but see above: IMO step 0 should be to ensure that schedutil is a viable
replacement for all users.





default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Peter Zijlstra
On Thu, Oct 22, 2020 at 01:45:25PM +0200, Rafael J. Wysocki wrote:
> On Thursday, October 22, 2020 12:47:03 PM CEST Viresh Kumar wrote:
> > On 22-10-20, 09:11, Peter Zijlstra wrote:
> > > Well, but we need to do something to force people onto schedutil,
> > > otherwise we'll get more crap like this thread.
> > > 
> > > Can we take the choice away? Only let Kconfig select which governors are
> > > available and then set the default ourselves? I mean, the end goal being
> > > to not have selectable governors at all, this seems like a good step
> > > anyway.
> > 
> > Just to clarify and complete the point a bit here, the users can still
> > pass the default governor from cmdline using
> > cpufreq.default_governor=, which will take precedence over the one the
> > below code is playing with. And later once the kernel is up, they can
> > still choose a different governor from userspace.
> 
> Right.
> 
> Also some people simply set "performance" as the default governor and then
> don't touch cpufreq otherwise (the idea is to get everything to the max
> freq right away and stay in that mode forever).  This still needs to be
> possible IMO.

Performance/powersave make sense to keep.

However I do want to retire ondemand, conservative and also very much
intel_pstate/active mode. I also have very little sympathy for
userspace.

We should start by making it hard to use them and eventually just delete
them outright.