Re: [RFC] sched: CPU topology try

2014-01-08 Thread Morten Rasmussen
On Wed, Jan 08, 2014 at 01:32:57PM +, Peter Zijlstra wrote:
> On Wed, Jan 08, 2014 at 01:27:39PM +, Morten Rasmussen wrote:
> > On Wed, Jan 08, 2014 at 12:45:34PM +, Peter Zijlstra wrote:
> > > On Wed, Jan 08, 2014 at 12:35:34PM +, Morten Rasmussen wrote:
> > > > > Currently we detect overload by sg.nr_running >= sg.capacity, which 
> > > > > can
> > > > > be very misleading because while a cpu might have a task running 'now'
> > > > > it might be 99% idle.
> > > > > 
> > > > > At which point I argued we should change the capacity thing anyhow. 
> > > > > Ever
> > > > > since the runnable_avg patch set I've been arguing to change that into
> > > > > an actual utilization test.
> > > > > 
> > > > > So I think that if we measure overload by something like >95% 
> > > > > utilization
> > > > > on the entire group the load scaling again makes perfect sense.
> > > > 
> > > > I agree that it make more sense to change the overload test to be based
> > > > on some tracked load. How about the non-overloaded case? Load balancing
> > > > would have to be based on unweighted task loads in that case?
> > > 
> > > Yeah, until we're overloaded our goal is to minimize idle time.
> > 
> > I would say, make the most of the available cpu cycles. Minimizing idle
> > time is not always the right thing to do when considering power
> > awareness.
> > 
> > If we know the actual load of the tasks, we may be able to consolidate
> 
> I think we must start to be careful with the word load, I think you
> meant to say utilization.

Indeed, I meant utilization.

> 
> > them on fewer cpus and save power by idling cpus. In that case the idle
> > time (total) is unchanged (unless the P-state is changed). Somewhat
> > similar to the video use-case running on 1, 2, and 4 cpu that I reposted
> > yesterday.
> 
> But fair enough.. Its idle time when you consider CPUs to always run at
> max frequency, but clearly I must stop thinking about CPUs like that :-)

Yes, it opens a whole new world of problems to be solved :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-08 Thread Morten Rasmussen
On Wed, Jan 08, 2014 at 01:04:07PM +, Peter Zijlstra wrote:
> On Wed, Jan 08, 2014 at 12:52:28PM +, Morten Rasmussen wrote:
> 
> > If I remember correctly, Alex used the rq runnable_avg_sum (in rq->avg)
> > for this. It is the most obvious choice, but it takes ages to reach
> > 100%.
> > 
> > #define LOAD_AVG_MAX_N 345
> > 
> > Worst case it takes 345 ms from the system is becomes fully utilized
> > after a long period of idle until the rq runnable_avg_sum reaches 100%.
> > 
> > An unweigthed version of cfs_rq->runnable_load_avg and blocked_load_avg
> > wouldn't have that delay.
> 
> Right.. not sure we want to involve blocked load on the utilization
> metric, but who knows maybe that does make sense.
> 
> But yes, we need unweighted runnable_avg.

I'm not sure about the blocked load either.

> 
> > Also, if we are changing the load balance behavior when all cpus are
> > fully utilized
> 
> We already have this tipping point. See all the has_capacity bits. But
> yes, it'd get more involved I suppose.

I'll have a look.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-08 Thread Peter Zijlstra
On Wed, Jan 08, 2014 at 01:27:39PM +, Morten Rasmussen wrote:
> On Wed, Jan 08, 2014 at 12:45:34PM +, Peter Zijlstra wrote:
> > On Wed, Jan 08, 2014 at 12:35:34PM +, Morten Rasmussen wrote:
> > > > Currently we detect overload by sg.nr_running >= sg.capacity, which can
> > > > be very misleading because while a cpu might have a task running 'now'
> > > > it might be 99% idle.
> > > > 
> > > > At which point I argued we should change the capacity thing anyhow. Ever
> > > > since the runnable_avg patch set I've been arguing to change that into
> > > > an actual utilization test.
> > > > 
> > > > So I think that if we measure overload by something like >95% 
> > > > utilization
> > > > on the entire group the load scaling again makes perfect sense.
> > > 
> > > I agree that it make more sense to change the overload test to be based
> > > on some tracked load. How about the non-overloaded case? Load balancing
> > > would have to be based on unweighted task loads in that case?
> > 
> > Yeah, until we're overloaded our goal is to minimize idle time.
> 
> I would say, make the most of the available cpu cycles. Minimizing idle
> time is not always the right thing to do when considering power
> awareness.
> 
> If we know the actual load of the tasks, we may be able to consolidate

I think we must start to be careful with the word load, I think you
meant to say utilization. 

> them on fewer cpus and save power by idling cpus. In that case the idle
> time (total) is unchanged (unless the P-state is changed). Somewhat
> similar to the video use-case running on 1, 2, and 4 cpu that I reposted
> yesterday.

But fair enough.. Its idle time when you consider CPUs to always run at
max frequency, but clearly I must stop thinking about CPUs like that :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-08 Thread Morten Rasmussen
On Wed, Jan 08, 2014 at 12:45:34PM +, Peter Zijlstra wrote:
> On Wed, Jan 08, 2014 at 12:35:34PM +, Morten Rasmussen wrote:
> > > Currently we detect overload by sg.nr_running >= sg.capacity, which can
> > > be very misleading because while a cpu might have a task running 'now'
> > > it might be 99% idle.
> > > 
> > > At which point I argued we should change the capacity thing anyhow. Ever
> > > since the runnable_avg patch set I've been arguing to change that into
> > > an actual utilization test.
> > > 
> > > So I think that if we measure overload by something like >95% utilization
> > > on the entire group the load scaling again makes perfect sense.
> > 
> > I agree that it make more sense to change the overload test to be based
> > on some tracked load. How about the non-overloaded case? Load balancing
> > would have to be based on unweighted task loads in that case?
> 
> Yeah, until we're overloaded our goal is to minimize idle time.

I would say, make the most of the available cpu cycles. Minimizing idle
time is not always the right thing to do when considering power
awareness.

If we know the actual load of the tasks, we may be able to consolidate
them on fewer cpus and save power by idling cpus. In that case the idle
time (total) is unchanged (unless the P-state is changed). Somewhat
similar to the video use-case running on 1, 2, and 4 cpu that I reposted
yesterday.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-08 Thread Peter Zijlstra
On Wed, Jan 08, 2014 at 12:52:28PM +, Morten Rasmussen wrote:

> If I remember correctly, Alex used the rq runnable_avg_sum (in rq->avg)
> for this. It is the most obvious choice, but it takes ages to reach
> 100%.
> 
> #define LOAD_AVG_MAX_N 345
> 
> Worst case it takes 345 ms from the system is becomes fully utilized
> after a long period of idle until the rq runnable_avg_sum reaches 100%.
> 
> An unweigthed version of cfs_rq->runnable_load_avg and blocked_load_avg
> wouldn't have that delay.

Right.. not sure we want to involve blocked load on the utilization
metric, but who knows maybe that does make sense.

But yes, we need unweighted runnable_avg.

> Also, if we are changing the load balance behavior when all cpus are
> fully utilized

We already have this tipping point. See all the has_capacity bits. But
yes, it'd get more involved I suppose.

> we may need to think about cases where the load is
> hovering around the saturation threshold. But I don't think that is
> important yet.

Yah.. I'm going to wait until we have a fail case that can give us
some guidance before really pondering this though :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-08 Thread Morten Rasmussen
On Wed, Jan 08, 2014 at 08:37:16AM +, Peter Zijlstra wrote:
> On Wed, Jan 08, 2014 at 04:32:18PM +0800, Alex Shi wrote:
> > In my old power aware scheduling patchset, I had tried the 95 to 99. But
> > all those values will lead imbalance when we test while(1) like cases.
> > like in a 24LCPUs groups, 24*5% > 1. So, finally use 100% as overload
> > indicator. And in testing 100% works well to find overload since few
> > system service involved. :)
> 
> Ah indeed, so 100% it is ;-)

If I remember correctly, Alex used the rq runnable_avg_sum (in rq->avg)
for this. It is the most obvious choice, but it takes ages to reach
100%.

#define LOAD_AVG_MAX_N 345

Worst case it takes 345 ms from the system is becomes fully utilized
after a long period of idle until the rq runnable_avg_sum reaches 100%.

An unweigthed version of cfs_rq->runnable_load_avg and blocked_load_avg
wouldn't have that delay.

Also, if we are changing the load balance behavior when all cpus are
fully utilized we may need to think about cases where the load is
hovering around the saturation threshold. But I don't think that is
important yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-08 Thread Peter Zijlstra
On Wed, Jan 08, 2014 at 12:35:34PM +, Morten Rasmussen wrote:
> > Currently we detect overload by sg.nr_running >= sg.capacity, which can
> > be very misleading because while a cpu might have a task running 'now'
> > it might be 99% idle.
> > 
> > At which point I argued we should change the capacity thing anyhow. Ever
> > since the runnable_avg patch set I've been arguing to change that into
> > an actual utilization test.
> > 
> > So I think that if we measure overload by something like >95% utilization
> > on the entire group the load scaling again makes perfect sense.
> 
> I agree that it make more sense to change the overload test to be based
> on some tracked load. How about the non-overloaded case? Load balancing
> would have to be based on unweighted task loads in that case?

Yeah, until we're overloaded our goal is to minimize idle time.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-08 Thread Peter Zijlstra
On Wed, Jan 08, 2014 at 12:35:34PM +, Morten Rasmussen wrote:
> > The harder case is where all 3 tasks are of equal weight; in which case
> > fairness would mandate we (slowly) rotate the tasks such that they all
> > get 2/3 time -- we also horribly fail at this :-)
> 
> I have encountered that one a number of times. All the middleware noise
> in Android sometimes give that effect.

You've got a typo there: s/middleware/muddleware/ :-)

> I'm not sure if the NUMA guy would like rotating scheduler though :-)

Hurmph ;-) But yes, the N+1 tasks on a N cpu system is rotten; any
static solution gets 2 tasks that run at 50%, any dynamic solution gets
the migration overhead issue.

So while the dynamic solution would indeed allow each task to (on
average) receive N/N+1 time -- a vast improvement over the 50% thing, it
doesn't come without down sides.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-08 Thread Morten Rasmussen
On Tue, Jan 07, 2014 at 08:49:51PM +, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 03:41:54PM +, Morten Rasmussen wrote:
> > I think that could work if we sort of the priority scaling issue that I
> > mentioned before.
> 
> We talked a bit about this on IRC a month or so ago, right? My memories
> from that are that your main complaint is that we don't detect the
> overload scenario right.
> 
> That is; the point at which we should start caring about SMP-nice is
> when all our CPUs are fully occupied, because up to that point we're
> under utilized and work preservation mandates we utilize idle time.

Yes. I think I stated the problem differently, but I think we talk about
the same thing. Basically, priority-scaling in task load_contrib means
that runnable_load_avg and blocked_load_avg are poor indicators of cpu
load (available idle time). Priority scaling only makes sense when the
system is fully utilized. When that is not the case, it just gives us a
potentially very inaccurate picture of the load (available idle time).

Pretty much what you just said :-)

> Currently we detect overload by sg.nr_running >= sg.capacity, which can
> be very misleading because while a cpu might have a task running 'now'
> it might be 99% idle.
> 
> At which point I argued we should change the capacity thing anyhow. Ever
> since the runnable_avg patch set I've been arguing to change that into
> an actual utilization test.
> 
> So I think that if we measure overload by something like >95% utilization
> on the entire group the load scaling again makes perfect sense.

I agree that it make more sense to change the overload test to be based
on some tracked load. How about the non-overloaded case? Load balancing
would have to be based on unweighted task loads in that case?

> 
> Given the 3 task {A,B,C} workload where A and B are niced, to land on a
> symmetric dual CPU system like: {A,B}+{C}, assuming they're all while(1)
> loops :-).
> 
> The harder case is where all 3 tasks are of equal weight; in which case
> fairness would mandate we (slowly) rotate the tasks such that they all
> get 2/3 time -- we also horribly fail at this :-)

I have encountered that one a number of times. All the middleware noise
in Android sometimes give that effect.

I'm not sure if the NUMA guy would like rotating scheduler though :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-08 Thread Alex Shi
On 01/07/2014 11:37 PM, Morten Rasmussen wrote:
>>> > >
>>> > > So something like:
>>> > >
>>> > >   avg = runnable + p(i) * blocked; where p(i) \e [0,1]
>>> > >
>>> > > could maybe be used to replace the cpu_load array and still represent
>>> > > the concept of looking at a bigger picture for larger sets. Leaving open
>>> > > the details of the map p.
> Figuring out p is the difficult bit. AFAIK, with blocked load in its
> current form we don't have any clue when a task will reappear.

Yes, that's why we can not find a suitable way to consider the blocked
load in load balance.


-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-08 Thread Peter Zijlstra
On Wed, Jan 08, 2014 at 04:32:18PM +0800, Alex Shi wrote:
> In my old power aware scheduling patchset, I had tried the 95 to 99. But
> all those values will lead imbalance when we test while(1) like cases.
> like in a 24LCPUs groups, 24*5% > 1. So, finally use 100% as overload
> indicator. And in testing 100% works well to find overload since few
> system service involved. :)

Ah indeed, so 100% it is ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-08 Thread Alex Shi
On 01/08/2014 04:49 AM, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 03:41:54PM +, Morten Rasmussen wrote:
>> I think that could work if we sort of the priority scaling issue that I
>> mentioned before.
> 
> We talked a bit about this on IRC a month or so ago, right? My memories
> from that are that your main complaint is that we don't detect the
> overload scenario right.
> 
> That is; the point at which we should start caring about SMP-nice is
> when all our CPUs are fully occupied, because up to that point we're
> under utilized and work preservation mandates we utilize idle time.
> 
> Currently we detect overload by sg.nr_running >= sg.capacity, which can
> be very misleading because while a cpu might have a task running 'now'
> it might be 99% idle.
> 
> At which point I argued we should change the capacity thing anyhow. Ever
> since the runnable_avg patch set I've been arguing to change that into
> an actual utilization test.
> 
> So I think that if we measure overload by something like >95% utilization
> on the entire group the load scaling again makes perfect sense.

In my old power aware scheduling patchset, I had tried the 95 to 99. But
all those values will lead imbalance when we test while(1) like cases.
like in a 24LCPUs groups, 24*5% > 1. So, finally use 100% as overload
indicator. And in testing 100% works well to find overload since few
system service involved. :)
> 
> Given the 3 task {A,B,C} workload where A and B are niced, to land on a
> symmetric dual CPU system like: {A,B}+{C}, assuming they're all while(1)
> loops :-).
> 
> The harder case is where all 3 tasks are of equal weight; in which case
> fairness would mandate we (slowly) rotate the tasks such that they all
> get 2/3 time -- we also horribly fail at this :-)
> 


-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Peter Zijlstra
On Tue, Jan 07, 2014 at 03:41:54PM +, Morten Rasmussen wrote:
> I think that could work if we sort of the priority scaling issue that I
> mentioned before.

We talked a bit about this on IRC a month or so ago, right? My memories
from that are that your main complaint is that we don't detect the
overload scenario right.

That is; the point at which we should start caring about SMP-nice is
when all our CPUs are fully occupied, because up to that point we're
under utilized and work preservation mandates we utilize idle time.

Currently we detect overload by sg.nr_running >= sg.capacity, which can
be very misleading because while a cpu might have a task running 'now'
it might be 99% idle.

At which point I argued we should change the capacity thing anyhow. Ever
since the runnable_avg patch set I've been arguing to change that into
an actual utilization test.

So I think that if we measure overload by something like >95% utilization
on the entire group the load scaling again makes perfect sense.

Given the 3 task {A,B,C} workload where A and B are niced, to land on a
symmetric dual CPU system like: {A,B}+{C}, assuming they're all while(1)
loops :-).

The harder case is where all 3 tasks are of equal weight; in which case
fairness would mandate we (slowly) rotate the tasks such that they all
get 2/3 time -- we also horribly fail at this :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Preeti U Murthy
On 01/07/2014 06:01 PM, Vincent Guittot wrote:
> On 7 January 2014 11:39, Preeti U Murthy  wrote:
>> On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
>>> On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
 What if we want to add arch specific flags to the NUMA domain? Currently
 with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
 the arch can modify the sd flags of the topology levels till just before
 the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
 initialized. We need to perhaps call into arch here to probe for
 additional flags?
>>>
>>> What are you thinking of? I was hoping all NUMA details were captured in
>>> the distance table.
>>>
>>> Its far easier to talk of specifics in this case.
>>>
>> If the processor can be core gated, then there is very little power
>> savings that we could yield from consolidating all the load onto a
>> single node in a NUMA domain. 6 cores on one node or 3 cores each on two
>> nodes, the power is drawn by 6 cores in all. So I was thinking under
>> this circumstance we might want to set the SD_SHARE_POWERDOMAIN flag at
>> the NUMA domain and spread the load if it favours the workload.
> 
> The policy of keeping the tasks running on cores that are close (same
> node) to the memory, is the more power efficient, isn't it ? so it's
> probably more about where to place the memory than about where to
> place the tasks ?

Yes this is another point. One of the reasons that we try to consolidate
load to cores is that on Power8 systems most of the power management is
at the core level and node level cpuidle states are usually entered into
on fully idle systems due to the overhead involved in exit from these
idle states as I mentioned in reply to this thread.

Another point questioning node level idle states which could for
instance include flushing of large shared cache is that if we try and
consolidate the load to nodes, we must also consolidate memory pages
simultaneously. Else the performance will be severely hurt in
re-fetching the pages which were flushed as compared to core level idle
management.
  Core level idle power management could include flushing of l2 cache,
which is still ok for performance because re-fetching of the pages on
this cache has relatively low overhead and depending on the arch, the
power savings obtained could be worth the overhead.

Thanks

Regards
Preeti U Murthy
> 
> Vincent
> 
>>
>> Regards
>> Preeti U Murthy
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Preeti U Murthy
On 01/07/2014 04:43 PM, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 04:09:39PM +0530, Preeti U Murthy wrote:
>> On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
>>> On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
 What if we want to add arch specific flags to the NUMA domain? Currently
 with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
 the arch can modify the sd flags of the topology levels till just before
 the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
 initialized. We need to perhaps call into arch here to probe for
 additional flags?
>>>
>>> What are you thinking of? I was hoping all NUMA details were captured in
>>> the distance table.
>>>
>>> Its far easier to talk of specifics in this case.
>>>
>> If the processor can be core gated, then there is very little power
>> savings that we could yield from consolidating all the load onto a
>> single node in a NUMA domain. 6 cores on one node or 3 cores each on two
>> nodes, the power is drawn by 6 cores in all. So I was thinking under
>> this circumstance we might want to set the SD_SHARE_POWERDOMAIN flag at
>> the NUMA domain and spread the load if it favours the workload.
> 
> So Intel has so far not said a lot of sensible things about power
> management on their multi-socket platform.
> 
> And I've not heard anything at all from IBM on the POWER chips.
> 
> What I know from the Intel side is that packet idle hardly saves
> anything when compared to the DRAM power and the cost of having to do
> remote memory accesses.
> 
> In other words, I'm not at all considering power aware scheduling for
> NUMA systems until someone starts talking sense :-)
> 

On Power8 systems, most of the cpuidle power management is done at the
core level. Doing so is expected to yield us good power savings without
much loss of performance, with little exit latency from these idle
states and little overhead obtained from re-initialization of the cores.

However doing idle power management at a node level could hit
performance although good power savings is obtained due to the overhead
of re-initialization of the node which could be significant and of
course the large exit latency from such idle states.

Therefore we would try and consolidate load to cores as much as possible
rather than to nodes so as to leave as many cores idle. Again
consolidation to cores needs to be to 3-4 threads in a core. With 8
threads in a core, running just one thread would hardly do justice to
the core's resources. At the same time running the core full throttle
would hit performance. Hence a fine balance could be obtained by
consolidating load to minimum number of threads.
  *Consolidating load to core and spreading the load across nodes* would
probably help memory intensive workloads finish faster due to less
contention on local node memory and can get the cores to idle faster.

Thanks

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Morten Rasmussen
On Tue, Jan 07, 2014 at 02:10:59PM +, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 02:22:20PM +0100, Peter Zijlstra wrote:
> 
> I just realized there's two different p's in there.
> 
> > Ah, another way of looking at it is that the avg without blocked
> > component is a 'now' picture. It is the load we are concerned with right
> > now.
> > 
> > The more blocked we add the further out we look; with the obvious limit
> > of the entire averaging period.
> > 
> > So the avg that is runnable is right now, t_0; the avg that is runnable +
> > blocked is t_0 + p, where p is the avg period over which we expect the
> > blocked contribution to appear.
> 
> So the above p for period, is unrelated to the below p which is a
> probability function.
> 
> > So something like:
> > 
> >   avg = runnable + p(i) * blocked; where p(i) \e [0,1]
> > 
> > could maybe be used to replace the cpu_load array and still represent
> > the concept of looking at a bigger picture for larger sets. Leaving open
> > the details of the map p.
> 
> We probably want to assume task wakeup is constant over time, so p (our
> probability function) should probably be an exponential distribution.

Ah, makes more sense now.

You propose that we don't actually try keep track of which tasks that
might wake up when, but just factor in more and more of the blocked load
depending on how conservative the load estimate we want?

I think that could work if we sort of the priority scaling issue that I
mentioned before.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Morten Rasmussen
On Tue, Jan 07, 2014 at 02:11:22PM +, Vincent Guittot wrote:
> On 7 January 2014 14:22, Peter Zijlstra  wrote:
> > On Tue, Jan 07, 2014 at 09:32:04AM +0100, Vincent Guittot wrote:
> >> On 6 January 2014 17:31, Peter Zijlstra  wrote:
> >> > On Mon, Jan 06, 2014 at 02:41:31PM +0100, Vincent Guittot wrote:
> >> >> IMHO, these settings will disappear sooner or later, as an example the
> >> >> idle/busy _idx are going to be removed by Alex's patch.
> >> >
> >> > Well I'm still entirely unconvinced by them..
> >> >
> >> > removing the cpu_load array makes sense, but I'm starting to doubt the
> >> > removal of the _idx things.. I think we want to retain them in some
> >> > form, it simply makes sense to look at longer term averages when looking
> >> > at larger CPU groups.
> >> >
> >> > So maybe we can express the things in log_2(group-span) or so, but we
> >> > need a working replacement for the cpu_load array. Ideally some
> >> > expression involving the blocked load.
> >>
> >> Using the blocked load can surely give benefit in the load balance
> >> because it gives a view of potential load on a core but it still decay
> >> with the same speed than runnable load average so it doesn't solve the
> >> issue for longer term average. One way is to have a runnable average
> >> load with longer time window

The blocked load discussion comes up again :)

I totally agree that blocked load would be useful, but only if we get
the priority problem sorted out. Blocked load is the sum of load_contrib
of blocked tasks, which means that a tiny high priority task can have a
massive contribution to the blocked load.

> >
> > Ah, another way of looking at it is that the avg without blocked
> > component is a 'now' picture. It is the load we are concerned with right
> > now.
> >
> > The more blocked we add the further out we look; with the obvious limit
> > of the entire averaging period.
> >
> > So the avg that is runnable is right now, t_0; the avg that is runnable +
> > blocked is t_0 + p, where p is the avg period over which we expect the
> > blocked contribution to appear.
> >
> > So something like:
> >
> >   avg = runnable + p(i) * blocked; where p(i) \e [0,1]
> >
> > could maybe be used to replace the cpu_load array and still represent
> > the concept of looking at a bigger picture for larger sets. Leaving open
> > the details of the map p.

Figuring out p is the difficult bit. AFAIK, with blocked load in its
current form we don't have any clue when a task will reappear.

> 
> That needs to be studied more deeply but that could be a way to have a
> larger picture

Agree.

> 
> Another point is that we are using runnable and blocked load average
> which are the sum of load_avg_contrib of tasks but we are not using
> the runnable_avg_sum of the cpus which is not the now picture but a
> average of the past running time (without taking into account task
> weight)

Yes. The rq runnable_avg_sum is an excellent longer term load indicator.
It can't be compared with the runnable and blocked load though. The
other alternative that I can think of is to introduce an unweighted
alternative to blocked load. That is, sum of load_contrib/priority.

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Vincent Guittot
On 7 January 2014 14:22, Peter Zijlstra  wrote:
> On Tue, Jan 07, 2014 at 09:32:04AM +0100, Vincent Guittot wrote:
>> On 6 January 2014 17:31, Peter Zijlstra  wrote:
>> > On Mon, Jan 06, 2014 at 02:41:31PM +0100, Vincent Guittot wrote:
>> >> IMHO, these settings will disappear sooner or later, as an example the
>> >> idle/busy _idx are going to be removed by Alex's patch.
>> >
>> > Well I'm still entirely unconvinced by them..
>> >
>> > removing the cpu_load array makes sense, but I'm starting to doubt the
>> > removal of the _idx things.. I think we want to retain them in some
>> > form, it simply makes sense to look at longer term averages when looking
>> > at larger CPU groups.
>> >
>> > So maybe we can express the things in log_2(group-span) or so, but we
>> > need a working replacement for the cpu_load array. Ideally some
>> > expression involving the blocked load.
>>
>> Using the blocked load can surely give benefit in the load balance
>> because it gives a view of potential load on a core but it still decay
>> with the same speed than runnable load average so it doesn't solve the
>> issue for longer term average. One way is to have a runnable average
>> load with longer time window
>
> Ah, another way of looking at it is that the avg without blocked
> component is a 'now' picture. It is the load we are concerned with right
> now.
>
> The more blocked we add the further out we look; with the obvious limit
> of the entire averaging period.
>
> So the avg that is runnable is right now, t_0; the avg that is runnable +
> blocked is t_0 + p, where p is the avg period over which we expect the
> blocked contribution to appear.
>
> So something like:
>
>   avg = runnable + p(i) * blocked; where p(i) \e [0,1]
>
> could maybe be used to replace the cpu_load array and still represent
> the concept of looking at a bigger picture for larger sets. Leaving open
> the details of the map p.

That needs to be studied more deeply but that could be a way to have a
larger picture

Another point is that we are using runnable and blocked load average
which are the sum of load_avg_contrib of tasks but we are not using
the runnable_avg_sum of the cpus which is not the now picture but a
average of the past running time (without taking into account task
weight)

Vincent
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Peter Zijlstra
On Tue, Jan 07, 2014 at 02:22:20PM +0100, Peter Zijlstra wrote:

I just realized there's two different p's in there.

> Ah, another way of looking at it is that the avg without blocked
> component is a 'now' picture. It is the load we are concerned with right
> now.
> 
> The more blocked we add the further out we look; with the obvious limit
> of the entire averaging period.
> 
> So the avg that is runnable is right now, t_0; the avg that is runnable +
> blocked is t_0 + p, where p is the avg period over which we expect the
> blocked contribution to appear.

So the above p for period, is unrelated to the below p which is a
probability function.

> So something like:
> 
>   avg = runnable + p(i) * blocked; where p(i) \e [0,1]
> 
> could maybe be used to replace the cpu_load array and still represent
> the concept of looking at a bigger picture for larger sets. Leaving open
> the details of the map p.

We probably want to assume task wakeup is constant over time, so p (our
probability function) should probably be an exponential distribution.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Peter Zijlstra
On Tue, Jan 07, 2014 at 09:32:04AM +0100, Vincent Guittot wrote:
> On 6 January 2014 17:31, Peter Zijlstra  wrote:
> > On Mon, Jan 06, 2014 at 02:41:31PM +0100, Vincent Guittot wrote:
> >> IMHO, these settings will disappear sooner or later, as an example the
> >> idle/busy _idx are going to be removed by Alex's patch.
> >
> > Well I'm still entirely unconvinced by them..
> >
> > removing the cpu_load array makes sense, but I'm starting to doubt the
> > removal of the _idx things.. I think we want to retain them in some
> > form, it simply makes sense to look at longer term averages when looking
> > at larger CPU groups.
> >
> > So maybe we can express the things in log_2(group-span) or so, but we
> > need a working replacement for the cpu_load array. Ideally some
> > expression involving the blocked load.
> 
> Using the blocked load can surely give benefit in the load balance
> because it gives a view of potential load on a core but it still decay
> with the same speed than runnable load average so it doesn't solve the
> issue for longer term average. One way is to have a runnable average
> load with longer time window

Ah, another way of looking at it is that the avg without blocked
component is a 'now' picture. It is the load we are concerned with right
now.

The more blocked we add the further out we look; with the obvious limit
of the entire averaging period.

So the avg that is runnable is right now, t_0; the avg that is runnable +
blocked is t_0 + p, where p is the avg period over which we expect the
blocked contribution to appear.

So something like:

  avg = runnable + p(i) * blocked; where p(i) \e [0,1]

could maybe be used to replace the cpu_load array and still represent
the concept of looking at a bigger picture for larger sets. Leaving open
the details of the map p.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Vincent Guittot
On 1 January 2014 06:00, Preeti U Murthy  wrote:
> Hi Vincent,
>
> On 12/18/2013 06:43 PM, Vincent Guittot wrote:
>> This patch applies on top of the two patches [1][2] that have been proposed 
>> by
>> Peter for creating a new way to initialize sched_domain. It includes some 
>> minor
>> compilation fixes and a trial of using this new method on ARM platform.
>> [1] https://lkml.org/lkml/2013/11/5/239
>> [2] https://lkml.org/lkml/2013/11/5/449
>>
>> Based on the results of this tests, my feeling about this new way to init the
>> sched_domain is a bit mitigated.
>>
>> The good point is that I have been able to create the same sched_domain
>> topologies than before and even more complex ones (where a subset of the 
>> cores
>> in a cluster share their powergating capabilities). I have described various
>> topology results below.
>>
>> I use a system that is made of a dual cluster of quad cores with 
>> hyperthreading
>> for my examples.
>>
>> If one cluster (0-7) can powergate its cores independantly but not the other
>> cluster (8-15) we have the following topology, which is equal to what I had
>> previously:
>>
>> CPU0:
>> domain 0: span 0-1 level: SMT
>> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>> groups: 0 1
>>   domain 1: span 0-7 level: MC
>>   flags: SD_SHARE_PKG_RESOURCES
>>   groups: 0-1 2-3 4-5 6-7
>> domain 2: span 0-15 level: CPU
>> flags:
>> groups: 0-7 8-15
>>
>> CPU8
>> domain 0: span 8-9 level: SMT
>> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>> groups: 8 9
>>   domain 1: span 8-15 level: MC
>>   flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>>   groups: 8-9 10-11 12-13 14-15
>> domain 2: span 0-15 level CPU
>> flags:
>> groups: 8-15 0-7
>>
>> We can even describe some more complex topologies if a susbset (2-7) of the
>> cluster can't powergate independatly:
>>
>> CPU0:
>> domain 0: span 0-1 level: SMT
>> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>> groups: 0 1
>>   domain 1: span 0-7 level: MC
>>   flags: SD_SHARE_PKG_RESOURCES
>>   groups: 0-1 2-7
>> domain 2: span 0-15 level: CPU
>> flags:
>> groups: 0-7 8-15
>>
>> CPU2:
>> domain 0: span 2-3 level: SMT
>> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>> groups: 0 1
>>   domain 1: span 2-7 level: MC
>>   flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>>   groups: 2-7 4-5 6-7
>> domain 2: span 0-7 level: MC
>> flags: SD_SHARE_PKG_RESOURCES
>> groups: 2-7 0-1
>>   domain 3: span 0-15 level: CPU
>>   flags:
>>   groups: 0-7 8-15
>>
>> In this case, we have an aditionnal sched_domain MC level for this subset 
>> (2-7)
>> of cores so we can trigger some load balance in this subset before doing that
>> on the complete cluster (which is the last level of cache in my example)
>>
>> We can add more levels that will describe other dependency/independency like
>> the frequency scaling dependency and as a result the final sched_domain
>> topology will have additional levels (if they have not been removed during
>> the degenerate sequence)
>
> The design looks good to me. In my opinion information like P-states and
> C-states dependency can be kept separate from the topology levels, it
> might get too complicated unless the information is tightly coupled to
> the topology.
>
>>
>> My concern is about the configuration of the table that is used to create the
>> sched_domain. Some levels are "duplicated" with different flags configuration
>
> I do not feel this is a problem since the levels are not duplicated,
> rather they have different properties within them which is best
> represented by flags like you have introduced in this patch.
>
>> which make the table not easily readable and we must also take care of the
>> order  because parents have to gather all cpus of its childs. So we must
>> choose which capabilities will be a subset of the other one. The order is
>
> The sched domain levels which have SD_SHARE_POWERDOMAIN set is expected
> to have cpus which are a subset of the cpus that this domain would have
> included had this flag not been set. In addition to this every higher
> domain, irrespective of SD_SHARE_POWERDOMAIN being set, will include all
> cpus of the lower domains. As far as I see, this patch does not change
> these assumptions. Hence I am unable to imagine a scenario when the
> parent might not include all cpus of its children domain. Do you have
> such a scenario in mind which can arise due to this patch ?

My patch doesn't have issue because i have added only 1 layer which is
always a subset of the current cache level topology but if we add
another feature with another layer, we have to decide which feature
will be a subset of the other one.

Vincent

>
> Thanks
>
> Regards
> Preeti U Murthy
>
--
To unsubscribe from this list: send the lin

Re: [RFC] sched: CPU topology try

2014-01-07 Thread Vincent Guittot
On 7 January 2014 11:39, Preeti U Murthy  wrote:
> On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
>> On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
>>> What if we want to add arch specific flags to the NUMA domain? Currently
>>> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
>>> the arch can modify the sd flags of the topology levels till just before
>>> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
>>> initialized. We need to perhaps call into arch here to probe for
>>> additional flags?
>>
>> What are you thinking of? I was hoping all NUMA details were captured in
>> the distance table.
>>
>> Its far easier to talk of specifics in this case.
>>
> If the processor can be core gated, then there is very little power
> savings that we could yield from consolidating all the load onto a
> single node in a NUMA domain. 6 cores on one node or 3 cores each on two
> nodes, the power is drawn by 6 cores in all. So I was thinking under
> this circumstance we might want to set the SD_SHARE_POWERDOMAIN flag at
> the NUMA domain and spread the load if it favours the workload.

The policy of keeping the tasks running on cores that are close (same
node) to the memory, is the more power efficient, isn't it ? so it's
probably more about where to place the memory than about where to
place the tasks ?

Vincent

>
> Regards
> Preeti U Murthy
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Morten Rasmussen
On Tue, Jan 07, 2014 at 10:39:39AM +, Preeti U Murthy wrote:
> On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
> > On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
> >> What if we want to add arch specific flags to the NUMA domain? Currently
> >> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
> >> the arch can modify the sd flags of the topology levels till just before
> >> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
> >> initialized. We need to perhaps call into arch here to probe for
> >> additional flags?
> > 
> > What are you thinking of? I was hoping all NUMA details were captured in
> > the distance table.
> > 
> > Its far easier to talk of specifics in this case.
> > 
> If the processor can be core gated, then there is very little power
> savings that we could yield from consolidating all the load onto a
> single node in a NUMA domain. 6 cores on one node or 3 cores each on two
> nodes, the power is drawn by 6 cores in all.

Not being a NUMA expert, I would have thought that load consolidation at
node level would nearly always save power even when cpus can be power
gated individually. The number of cpus awake is the same, but you only
need to power the caches, memory, and other node peripherals for one
node instead of two in your example. Wouldn't that save power?

Memory/cache intensive workloads might benefit from spreading at node
level though. 

Am I missing something?

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Peter Zijlstra
On Tue, Jan 07, 2014 at 04:09:39PM +0530, Preeti U Murthy wrote:
> On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
> > On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
> >> What if we want to add arch specific flags to the NUMA domain? Currently
> >> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
> >> the arch can modify the sd flags of the topology levels till just before
> >> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
> >> initialized. We need to perhaps call into arch here to probe for
> >> additional flags?
> > 
> > What are you thinking of? I was hoping all NUMA details were captured in
> > the distance table.
> > 
> > Its far easier to talk of specifics in this case.
> > 
> If the processor can be core gated, then there is very little power
> savings that we could yield from consolidating all the load onto a
> single node in a NUMA domain. 6 cores on one node or 3 cores each on two
> nodes, the power is drawn by 6 cores in all. So I was thinking under
> this circumstance we might want to set the SD_SHARE_POWERDOMAIN flag at
> the NUMA domain and spread the load if it favours the workload.

So Intel has so far not said a lot of sensible things about power
management on their multi-socket platform.

And I've not heard anything at all from IBM on the POWER chips.

What I know from the Intel side is that packet idle hardly saves
anything when compared to the DRAM power and the cost of having to do
remote memory accesses.

In other words, I'm not at all considering power aware scheduling for
NUMA systems until someone starts talking sense :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Preeti U Murthy
On 01/07/2014 03:20 PM, Peter Zijlstra wrote:
> On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
>> What if we want to add arch specific flags to the NUMA domain? Currently
>> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
>> the arch can modify the sd flags of the topology levels till just before
>> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
>> initialized. We need to perhaps call into arch here to probe for
>> additional flags?
> 
> What are you thinking of? I was hoping all NUMA details were captured in
> the distance table.
> 
> Its far easier to talk of specifics in this case.
> 
If the processor can be core gated, then there is very little power
savings that we could yield from consolidating all the load onto a
single node in a NUMA domain. 6 cores on one node or 3 cores each on two
nodes, the power is drawn by 6 cores in all. So I was thinking under
this circumstance we might want to set the SD_SHARE_POWERDOMAIN flag at
the NUMA domain and spread the load if it favours the workload.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Peter Zijlstra
On Mon, Jan 06, 2014 at 05:15:30PM +, Morten Rasmussen wrote:
> Is there any examples of frequency domains not matching the span of a
> sched_domain?

nafaik, but I don't really know much about this anyway.

> I would have thought that we would have a matching sched_domain to hang
> the P and C state information from for most systems. If not, we could
> just add it.

This was my thought too.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Peter Zijlstra
On Tue, Jan 07, 2014 at 03:10:21PM +0530, Preeti U Murthy wrote:
> What if we want to add arch specific flags to the NUMA domain? Currently
> with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
> the arch can modify the sd flags of the topology levels till just before
> the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
> initialized. We need to perhaps call into arch here to probe for
> additional flags?

What are you thinking of? I was hoping all NUMA details were captured in
the distance table.

Its far easier to talk of specifics in this case.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Preeti U Murthy
Hi Vincent, Peter,

On 12/18/2013 06:43 PM, Vincent Guittot wrote:
> This patch applies on top of the two patches [1][2] that have been proposed by
> Peter for creating a new way to initialize sched_domain. It includes some 
> minor
> compilation fixes and a trial of using this new method on ARM platform.
> [1] https://lkml.org/lkml/2013/11/5/239
> [2] https://lkml.org/lkml/2013/11/5/449
> 
> Based on the results of this tests, my feeling about this new way to init the
> sched_domain is a bit mitigated.
> 
> The good point is that I have been able to create the same sched_domain
> topologies than before and even more complex ones (where a subset of the cores
> in a cluster share their powergating capabilities). I have described various
> topology results below.
> 
> I use a system that is made of a dual cluster of quad cores with 
> hyperthreading
> for my examples.
> 
> If one cluster (0-7) can powergate its cores independantly but not the other
> cluster (8-15) we have the following topology, which is equal to what I had
> previously:
> 
> CPU0:
> domain 0: span 0-1 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
>   domain 1: span 0-7 level: MC
>   flags: SD_SHARE_PKG_RESOURCES
>   groups: 0-1 2-3 4-5 6-7
> domain 2: span 0-15 level: CPU
> flags:
> groups: 0-7 8-15
> 
> CPU8
> domain 0: span 8-9 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 8 9
>   domain 1: span 8-15 level: MC
>   flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>   groups: 8-9 10-11 12-13 14-15
> domain 2: span 0-15 level CPU
> flags:
> groups: 8-15 0-7
> 
> We can even describe some more complex topologies if a susbset (2-7) of the
> cluster can't powergate independatly:
> 
> CPU0:
> domain 0: span 0-1 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
>   domain 1: span 0-7 level: MC
>   flags: SD_SHARE_PKG_RESOURCES
>   groups: 0-1 2-7
> domain 2: span 0-15 level: CPU
> flags:
> groups: 0-7 8-15
> 
> CPU2:
> domain 0: span 2-3 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
>   domain 1: span 2-7 level: MC
>   flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>   groups: 2-7 4-5 6-7
> domain 2: span 0-7 level: MC
> flags: SD_SHARE_PKG_RESOURCES
> groups: 2-7 0-1
>   domain 3: span 0-15 level: CPU
>   flags:
>   groups: 0-7 8-15
> 
> In this case, we have an aditionnal sched_domain MC level for this subset 
> (2-7)
> of cores so we can trigger some load balance in this subset before doing that
> on the complete cluster (which is the last level of cache in my example)
> 
> We can add more levels that will describe other dependency/independency like
> the frequency scaling dependency and as a result the final sched_domain
> topology will have additional levels (if they have not been removed during
> the degenerate sequence)
> 
> My concern is about the configuration of the table that is used to create the
> sched_domain. Some levels are "duplicated" with different flags configuration
> which make the table not easily readable and we must also take care of the
> order  because parents have to gather all cpus of its childs. So we must
> choose which capabilities will be a subset of the other one. The order is
> almost straight forward when we describe 1 or 2 kind of capabilities
> (package ressource sharing and power sharing) but it can become complex if we
> want to add more.

What if we want to add arch specific flags to the NUMA domain? Currently
with Peter's patch:https://lkml.org/lkml/2013/11/5/239 and this patch,
the arch can modify the sd flags of the topology levels till just before
the NUMA domain. In sd_init_numa(), the flags for the NUMA domain get
initialized. We need to perhaps call into arch here to probe for
additional flags?

Thanks

Regards
Preeti U Murthy
> 
> Regards
> Vincent
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Vincent Guittot
On 6 January 2014 17:31, Peter Zijlstra  wrote:
> On Mon, Jan 06, 2014 at 02:41:31PM +0100, Vincent Guittot wrote:
>> IMHO, these settings will disappear sooner or later, as an example the
>> idle/busy _idx are going to be removed by Alex's patch.
>
> Well I'm still entirely unconvinced by them..
>
> removing the cpu_load array makes sense, but I'm starting to doubt the
> removal of the _idx things.. I think we want to retain them in some
> form, it simply makes sense to look at longer term averages when looking
> at larger CPU groups.
>
> So maybe we can express the things in log_2(group-span) or so, but we
> need a working replacement for the cpu_load array. Ideally some
> expression involving the blocked load.

Using the blocked load can surely give benefit in the load balance
because it gives a view of potential load on a core but it still decay
with the same speed than runnable load average so it doesn't solve the
issue for longer term average. One way is to have a runnable average
load with longer time window

>
> Its another one of those things I need to ponder more :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-07 Thread Vincent Guittot
On 6 January 2014 17:21, Peter Zijlstra  wrote:
> On Wed, Dec 18, 2013 at 02:13:51PM +0100, Vincent Guittot wrote:
>> This patch applies on top of the two patches [1][2] that have been proposed 
>> by
>> Peter for creating a new way to initialize sched_domain. It includes some 
>> minor
>> compilation fixes and a trial of using this new method on ARM platform.
>> [1] https://lkml.org/lkml/2013/11/5/239
>> [2] https://lkml.org/lkml/2013/11/5/449
>>
>> Based on the results of this tests, my feeling about this new way to init the
>> sched_domain is a bit mitigated.
>
> Yay :-)
>
>> We can add more levels that will describe other dependency/independency like
>> the frequency scaling dependency and as a result the final sched_domain
>> topology will have additional levels (if they have not been removed during
>> the degenerate sequence)
>
> Yeah, this 'creative' use of degenerate domains is pretty neat ;-)

thanks :-)

>
>> My concern is about the configuration of the table that is used to create the
>> sched_domain. Some levels are "duplicated" with different flags configuration
>> which make the table not easily readable and we must also take care of the
>> order  because parents have to gather all cpus of its childs. So we must
>> choose which capabilities will be a subset of the other one. The order is
>> almost straight forward when we describe 1 or 2 kind of capabilities
>> (package ressource sharing and power sharing) but it can become complex if we
>> want to add more.
>
> I think I see what you're saying, although I hope that won't actually
> happen in real hardware -- that said, people do tend to do crazy with
> these ARM chips :/

it should be ok for ARM chip because the cores in a cluster share the
same clock but it doesn't mean that it will not be possible in a near
future or on other arch.

>
> We should also try and be conservative in the topology flags we want to
> add, which should further reduce the amount of pain here.

yes, i see a interest for powerdomain sharing and clock sharing flags
so it should minimize the complexity

>
> For now I do think this is a viable approach.. Yes its a bit cumbersome
> for these asymmetric systems but it does give us enough to start
> playing.

ok

Vincent
>
> I yet have to read Morton's emails on the P and C states, will try to
> have a look at those tomorrow with a hopefully fresher brain -- somehow
> its the end of the day already..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-06 Thread Morten Rasmussen
On Mon, Jan 06, 2014 at 04:28:13PM +, Peter Zijlstra wrote:
> On Mon, Dec 23, 2013 at 06:22:17PM +0100, Dietmar Eggemann wrote:
> > I'm not sure if the idea to create a dedicated sched_domain level for every
> > topology flag representing a specific functionality will scale. From the
> > perspective of energy-aware scheduling we need e.g. energy costs (P and C
> > state) which can only be populated towards the scheduler via an additional
> > sub-struct and additional function arch_sd_energy() like depicted in
> > Morten's email:
> > 
> > [2] lkml.org/lkml/2013/11/14/102
> 
> That lkml.org link is actually not working for me (blank page -- maybe
> lkml.org is on the blink again).
> 
> That said, I yet have to sit down and think about the P state stuff, but
> I was thinking we need some rudimentary domain support for that.
> 
> For instance, the big-little thingies seem share their P state per
> cluster, so we need a domain at that level to hang some state off of --
> which we actually have in this case. But we need to ensure we do have
> it -- somehow.

Is there any examples of frequency domains not matching the span of a
sched_domain?

I would have thought that we would have a matching sched_domain to hang
the P and C state information from for most systems. If not, we could
just add it.

I don't think it is safe to assume that big-little always has cluster
P-states. It is implementation dependent. But the most obvious
alternative would be to have per-cpu P-states in which case we would
also have a matching sched_domain.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-06 Thread Arjan van de Ven



AFAICT this is a chicken-egg problem, the OS never did anything useful
with it so the hardware guys are now trying to do something with it, but
this also means that if we cannot predict what the hardware will do
under certain circumstances the OS really cannot do anything smart
anymore.

So yes, for certain hardware we'll just have to give up and not do
anything.

That said, some hardware still does allow us to do something and for
those we do need some of this.

Maybe if the OS becomes smart enough the hardware guys will give us some
control again, who knows.

So yes, I'm entirely fine saying that some chips are fucked and we can't
do anything sane with them.. Fine they get to sort things themselves.


That is; you're entirely unhelpful and I'm tempting to stop listening
to whatever you have to say on the subject.

Most of your emails are about how stuff cannot possibly work; without
saying how things can work.

The entire point of adding P and C state information to the scheduler is
so that we CAN do cross cpu decisions, but if you're saying we shouldn't
attempt because you can't say how the hardware will react anyway; fine
we'll ignore Intel hardware from now on.


that's not what I'm trying to say.

if we as OS want to help make such decisions, we also need to face reality of 
what it means,
and see how we can get there.

let me give a simple but common example case, of a 2 core system where the 
cores share P state.
one task (A) is high priority/high utilization/whatever
(e.g. causes the OS to ask for high performance from the CPU if by 
itself)
the other task (B), on the 2nd core, is not that high priority/utilization/etc
(e.g. would cause the OS to ask for max power savings from the CPU if 
by itself)


timecore 0  core 1  what the 
combined probably should be
0   task A  idlemax performance
1   task A  task B  max performance
2   idle (disk IO)  task B  least power
3   task A  task B  max performance

e.g. a simple case of task A running, and task B coming in... but then task A 
blocks briefly,
on say disk IO or some mutex or whatever.

we as OS will need to figure out how to get to the combined result, in a way 
that's relatively race free,
with two common races to take care of:
 * knowing if another core is idle at any time is inherently racey.. it may 
wake up or go idle the next cycle
 * in hardware modes where the OS controls all, the P state registers tend to be 
"the last one to write on any
   core controls them all" way; we need to make sure we don't fight ourselves 
here and assign a core to do
   this decision/communication to hardware on behalf of the whole domain (even 
if the core that's
   assigned may move around when the assigned core goes idle) rather than the 
various cores doing it themselves async.
   This tends to be harder than it seems if you also don't want to lose 
efficiency (e.g. no significant extra
   wakeups from idle and also not missing opportunities to go to "least power" in the 
"time 2" scenario above)


x86 and modern ARM (snapdragon at least) do this kind of coordination in 
hardware/microcontroller (with an opt in for the OS to
do it itself on x86 and likely snapdragon) which means the race conditions are 
not really there.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-06 Thread Peter Zijlstra
On Mon, Jan 06, 2014 at 05:48:38PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 06, 2014 at 08:37:13AM -0800, Arjan van de Ven wrote:
> > On 1/6/2014 8:33 AM, Peter Zijlstra wrote:
> > >On Wed, Jan 01, 2014 at 10:30:33AM +0530, Preeti U Murthy wrote:
> > >>The design looks good to me. In my opinion information like P-states and
> > >>C-states dependency can be kept separate from the topology levels, it
> > >>might get too complicated unless the information is tightly coupled to
> > >>the topology.
> > >
> > >I'm not entirely convinced we can keep them separated, the moment we
> > >have multiple CPUs sharing a P or C state we need somewhere to manage
> > >the shared state and the domain tree seems like the most natural place
> > >for this.
> > >
> > >Now it might well be both P and C states operate at 'natural' domains
> > >which we already have so it might be 'easy'.
> > 
> > more than that though.. P and C state sharing is mostly hidden from the OS
> > (because the OS does not have the ability to do this; e.g. there are things
> > that do "if THIS cpu goes idle, the OTHER cpu P state changes automatic".
> > 
> > that's not just on x86, the ARM guys (iirc at least the latest snapdragon)  
> > are going in that
> > direction as well.
> > 
> > for those systems, the OS really should just make local decisions and let 
> > the hardware
> > cope with hardware grouping.
> 
> AFAICT this is a chicken-egg problem, the OS never did anything useful
> with it so the hardware guys are now trying to do something with it, but
> this also means that if we cannot predict what the hardware will do
> under certain circumstances the OS really cannot do anything smart
> anymore.
> 
> So yes, for certain hardware we'll just have to give up and not do
> anything.
> 
> That said, some hardware still does allow us to do something and for
> those we do need some of this.
> 
> Maybe if the OS becomes smart enough the hardware guys will give us some
> control again, who knows.
> 
> So yes, I'm entirely fine saying that some chips are fucked and we can't
> do anything sane with them.. Fine they get to sort things themselves.

That is; you're entirely unhelpful and I'm tempting to stop listening
to whatever you have to say on the subject.

Most of your emails are about how stuff cannot possibly work; without
saying how things can work.

The entire point of adding P and C state information to the scheduler is
so that we CAN do cross cpu decisions, but if you're saying we shouldn't
attempt because you can't say how the hardware will react anyway; fine
we'll ignore Intel hardware from now on.

So bloody stop saying what cannot work and start telling how we can make
useful cross cpu decisions.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-06 Thread Peter Zijlstra
On Mon, Jan 06, 2014 at 08:37:13AM -0800, Arjan van de Ven wrote:
> On 1/6/2014 8:33 AM, Peter Zijlstra wrote:
> >On Wed, Jan 01, 2014 at 10:30:33AM +0530, Preeti U Murthy wrote:
> >>The design looks good to me. In my opinion information like P-states and
> >>C-states dependency can be kept separate from the topology levels, it
> >>might get too complicated unless the information is tightly coupled to
> >>the topology.
> >
> >I'm not entirely convinced we can keep them separated, the moment we
> >have multiple CPUs sharing a P or C state we need somewhere to manage
> >the shared state and the domain tree seems like the most natural place
> >for this.
> >
> >Now it might well be both P and C states operate at 'natural' domains
> >which we already have so it might be 'easy'.
> 
> more than that though.. P and C state sharing is mostly hidden from the OS
> (because the OS does not have the ability to do this; e.g. there are things
> that do "if THIS cpu goes idle, the OTHER cpu P state changes automatic".
> 
> that's not just on x86, the ARM guys (iirc at least the latest snapdragon)  
> are going in that
> direction as well.
> 
> for those systems, the OS really should just make local decisions and let the 
> hardware
> cope with hardware grouping.

AFAICT this is a chicken-egg problem, the OS never did anything useful
with it so the hardware guys are now trying to do something with it, but
this also means that if we cannot predict what the hardware will do
under certain circumstances the OS really cannot do anything smart
anymore.

So yes, for certain hardware we'll just have to give up and not do
anything.

That said, some hardware still does allow us to do something and for
those we do need some of this.

Maybe if the OS becomes smart enough the hardware guys will give us some
control again, who knows.

So yes, I'm entirely fine saying that some chips are fucked and we can't
do anything sane with them.. Fine they get to sort things themselves.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-06 Thread Arjan van de Ven

On 1/6/2014 8:33 AM, Peter Zijlstra wrote:

On Wed, Jan 01, 2014 at 10:30:33AM +0530, Preeti U Murthy wrote:

The design looks good to me. In my opinion information like P-states and
C-states dependency can be kept separate from the topology levels, it
might get too complicated unless the information is tightly coupled to
the topology.


I'm not entirely convinced we can keep them separated, the moment we
have multiple CPUs sharing a P or C state we need somewhere to manage
the shared state and the domain tree seems like the most natural place
for this.

Now it might well be both P and C states operate at 'natural' domains
which we already have so it might be 'easy'.


more than that though.. P and C state sharing is mostly hidden from the OS
(because the OS does not have the ability to do this; e.g. there are things
that do "if THIS cpu goes idle, the OTHER cpu P state changes automatic".

that's not just on x86, the ARM guys (iirc at least the latest snapdragon)  are 
going in that
direction as well.

for those systems, the OS really should just make local decisions and let the 
hardware
cope with hardware grouping.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-06 Thread Peter Zijlstra
On Wed, Jan 01, 2014 at 10:30:33AM +0530, Preeti U Murthy wrote:
> The design looks good to me. In my opinion information like P-states and
> C-states dependency can be kept separate from the topology levels, it
> might get too complicated unless the information is tightly coupled to
> the topology.

I'm not entirely convinced we can keep them separated, the moment we
have multiple CPUs sharing a P or C state we need somewhere to manage
the shared state and the domain tree seems like the most natural place
for this.

Now it might well be both P and C states operate at 'natural' domains
which we already have so it might be 'easy'.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-06 Thread Peter Zijlstra
On Wed, Dec 18, 2013 at 02:13:51PM +0100, Vincent Guittot wrote:
> This patch applies on top of the two patches [1][2] that have been proposed by
> Peter for creating a new way to initialize sched_domain. It includes some 
> minor
> compilation fixes and a trial of using this new method on ARM platform.
> [1] https://lkml.org/lkml/2013/11/5/239
> [2] https://lkml.org/lkml/2013/11/5/449
> 
> Based on the results of this tests, my feeling about this new way to init the
> sched_domain is a bit mitigated.

Yay :-)

> We can add more levels that will describe other dependency/independency like
> the frequency scaling dependency and as a result the final sched_domain
> topology will have additional levels (if they have not been removed during
> the degenerate sequence)

Yeah, this 'creative' use of degenerate domains is pretty neat ;-)

> My concern is about the configuration of the table that is used to create the
> sched_domain. Some levels are "duplicated" with different flags configuration
> which make the table not easily readable and we must also take care of the
> order  because parents have to gather all cpus of its childs. So we must
> choose which capabilities will be a subset of the other one. The order is
> almost straight forward when we describe 1 or 2 kind of capabilities
> (package ressource sharing and power sharing) but it can become complex if we
> want to add more.

I think I see what you're saying, although I hope that won't actually
happen in real hardware -- that said, people do tend to do crazy with
these ARM chips :/

We should also try and be conservative in the topology flags we want to
add, which should further reduce the amount of pain here.

For now I do think this is a viable approach.. Yes its a bit cumbersome
for these asymmetric systems but it does give us enough to start
playing.

I yet have to read Morton's emails on the P and C states, will try to
have a look at those tomorrow with a hopefully fresher brain -- somehow
its the end of the day already..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-06 Thread Peter Zijlstra
On Mon, Dec 23, 2013 at 06:22:17PM +0100, Dietmar Eggemann wrote:
> I'm not sure if the idea to create a dedicated sched_domain level for every
> topology flag representing a specific functionality will scale. From the
> perspective of energy-aware scheduling we need e.g. energy costs (P and C
> state) which can only be populated towards the scheduler via an additional
> sub-struct and additional function arch_sd_energy() like depicted in
> Morten's email:
> 
> [2] lkml.org/lkml/2013/11/14/102

That lkml.org link is actually not working for me (blank page -- maybe
lkml.org is on the blink again).

That said, I yet have to sit down and think about the P state stuff, but
I was thinking we need some rudimentary domain support for that.

For instance, the big-little thingies seem share their P state per
cluster, so we need a domain at that level to hang some state off of --
which we actually have in this case. But we need to ensure we do have
it -- somehow.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-06 Thread Peter Zijlstra
On Mon, Jan 06, 2014 at 02:41:31PM +0100, Vincent Guittot wrote:
> IMHO, these settings will disappear sooner or later, as an example the
> idle/busy _idx are going to be removed by Alex's patch.

Well I'm still entirely unconvinced by them..

removing the cpu_load array makes sense, but I'm starting to doubt the
removal of the _idx things.. I think we want to retain them in some
form, it simply makes sense to look at longer term averages when looking
at larger CPU groups.

So maybe we can express the things in log_2(group-span) or so, but we
need a working replacement for the cpu_load array. Ideally some
expression involving the blocked load.

Its another one of those things I need to ponder more :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2014-01-06 Thread Vincent Guittot
On 23 December 2013 18:22, Dietmar Eggemann  wrote:
> Hi Vincent,
>
>
> On 18/12/13 14:13, Vincent Guittot wrote:
>>
>> This patch applies on top of the two patches [1][2] that have been
>> proposed by
>> Peter for creating a new way to initialize sched_domain. It includes some
>> minor
>> compilation fixes and a trial of using this new method on ARM platform.
>> [1] https://lkml.org/lkml/2013/11/5/239
>> [2] https://lkml.org/lkml/2013/11/5/449
>
>
> I came up w/ a similar implementation proposal for an arch specific
> interface for scheduler domain set-up a couple of days ago:
>
> [1] https://lkml.org/lkml/2013/12/13/182
>
> I had the following requirements in mind:
>
> 1) The arch should not be able to fine tune individual scheduler behaviour,
> i.e. get rid of the arch specific SD_FOO_INIT macros.
>
> 2) Unify the set-up code for conventional and NUMA scheduler domains.
>
> 3) The arch is able to specify additional scheduler domain level, other than
> SMT, MC, BOOK, and CPU.
>
> 4) Allow to integrate the provision of additional topology related data
> (e.g. energy information) to the scheduler.
>
> Moreover, I think now that:
>
> 5) Something like the existing default set-up via default_topology[] is
> needed to avoid code duplication for archs not interested in (3) or (4).

Hi Dietmar,

I agree. This default array is available in Peter's patch and my
patches overwrites the default array only if it wants to add more/new
levels

[snip]

>>
>> CPU2:
>> domain 0: span 2-3 level: SMT
>>  flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES |
>> SD_SHARE_POWERDOMAIN
>>  groups: 0 1
>>domain 1: span 2-7 level: MC
>>flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>>groups: 2-7 4-5 6-7
>>  domain 2: span 0-7 level: MC
>>  flags: SD_SHARE_PKG_RESOURCES
>>  groups: 2-7 0-1
>>domain 3: span 0-15 level: CPU
>>flags:
>>groups: 0-7 8-15
>>
>> In this case, we have an aditionnal sched_domain MC level for this subset
>> (2-7)
>> of cores so we can trigger some load balance in this subset before doing
>> that
>> on the complete cluster (which is the last level of cache in my example)
>
>
> I think the weakest point right now is the condition in sd_init() where we
> convert the topology flags into scheduler behaviour. We not only introduce a
> very tight coupling between topology flags and scheduler domain level but
> also we need to follow a certain order in the initialization. This bit needs
> more thinking.

IMHO, these settings will disappear sooner or later, as an example the
idle/busy _idx are going to be removed by Alex's patch.

>
>
>>
>> We can add more levels that will describe other dependency/independency
>> like
>> the frequency scaling dependency and as a result the final sched_domain
>> topology will have additional levels (if they have not been removed during
>> the degenerate sequence)
>>
>> My concern is about the configuration of the table that is used to create
>> the
>> sched_domain. Some levels are "duplicated" with different flags
>> configuration
>> which make the table not easily readable and we must also take care of the
>> order  because parents have to gather all cpus of its childs. So we must
>> choose which capabilities will be a subset of the other one. The order is
>> almost straight forward when we describe 1 or 2 kind of capabilities
>> (package ressource sharing and power sharing) but it can become complex if
>> we
>> want to add more.
>
>
> I'm not sure if the idea to create a dedicated sched_domain level for every
> topology flag representing a specific functionality will scale. From the

It's up to the arch to decide how many levels they want to add; if a
dedicated level is needed or if it can gather some features/flags.
IMHO, having sub structs for energy information like what we have for
the cpu/group capacity will not prevent from having a 1st and quick
topology tree description

> perspective of energy-aware scheduling we need e.g. energy costs (P and C
> state) which can only be populated towards the scheduler via an additional
> sub-struct and additional function arch_sd_energy() like depicted in
> Morten's email:
>
> [2] lkml.org/lkml/2013/11/14/102
>

[snip]

>> +
>> +static int __init arm_sched_topology(void)
>> +{
>> +   sched_domain_topology = arm_topology;
>
>
> return missing

good catch

Thanks

Vincent
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2013-12-31 Thread Preeti U Murthy
Hi Vincent,

On 12/18/2013 06:43 PM, Vincent Guittot wrote:
> This patch applies on top of the two patches [1][2] that have been proposed by
> Peter for creating a new way to initialize sched_domain. It includes some 
> minor
> compilation fixes and a trial of using this new method on ARM platform.
> [1] https://lkml.org/lkml/2013/11/5/239
> [2] https://lkml.org/lkml/2013/11/5/449
> 
> Based on the results of this tests, my feeling about this new way to init the
> sched_domain is a bit mitigated.
> 
> The good point is that I have been able to create the same sched_domain
> topologies than before and even more complex ones (where a subset of the cores
> in a cluster share their powergating capabilities). I have described various
> topology results below.
> 
> I use a system that is made of a dual cluster of quad cores with 
> hyperthreading
> for my examples.
> 
> If one cluster (0-7) can powergate its cores independantly but not the other
> cluster (8-15) we have the following topology, which is equal to what I had
> previously:
> 
> CPU0:
> domain 0: span 0-1 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
>   domain 1: span 0-7 level: MC
>   flags: SD_SHARE_PKG_RESOURCES
>   groups: 0-1 2-3 4-5 6-7
> domain 2: span 0-15 level: CPU
> flags:
> groups: 0-7 8-15
> 
> CPU8
> domain 0: span 8-9 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 8 9
>   domain 1: span 8-15 level: MC
>   flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>   groups: 8-9 10-11 12-13 14-15
> domain 2: span 0-15 level CPU
> flags:
> groups: 8-15 0-7
> 
> We can even describe some more complex topologies if a susbset (2-7) of the
> cluster can't powergate independatly:
> 
> CPU0:
> domain 0: span 0-1 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
>   domain 1: span 0-7 level: MC
>   flags: SD_SHARE_PKG_RESOURCES
>   groups: 0-1 2-7
> domain 2: span 0-15 level: CPU
> flags:
> groups: 0-7 8-15
> 
> CPU2:
> domain 0: span 2-3 level: SMT
> flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
> groups: 0 1
>   domain 1: span 2-7 level: MC
>   flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>   groups: 2-7 4-5 6-7
> domain 2: span 0-7 level: MC
> flags: SD_SHARE_PKG_RESOURCES
> groups: 2-7 0-1
>   domain 3: span 0-15 level: CPU
>   flags:
>   groups: 0-7 8-15
> 
> In this case, we have an aditionnal sched_domain MC level for this subset 
> (2-7)
> of cores so we can trigger some load balance in this subset before doing that
> on the complete cluster (which is the last level of cache in my example)
> 
> We can add more levels that will describe other dependency/independency like
> the frequency scaling dependency and as a result the final sched_domain
> topology will have additional levels (if they have not been removed during
> the degenerate sequence)

The design looks good to me. In my opinion information like P-states and
C-states dependency can be kept separate from the topology levels, it
might get too complicated unless the information is tightly coupled to
the topology.

> 
> My concern is about the configuration of the table that is used to create the
> sched_domain. Some levels are "duplicated" with different flags configuration

I do not feel this is a problem since the levels are not duplicated,
rather they have different properties within them which is best
represented by flags like you have introduced in this patch.

> which make the table not easily readable and we must also take care of the
> order  because parents have to gather all cpus of its childs. So we must
> choose which capabilities will be a subset of the other one. The order is

The sched domain levels which have SD_SHARE_POWERDOMAIN set is expected
to have cpus which are a subset of the cpus that this domain would have
included had this flag not been set. In addition to this every higher
domain, irrespective of SD_SHARE_POWERDOMAIN being set, will include all
cpus of the lower domains. As far as I see, this patch does not change
these assumptions. Hence I am unable to imagine a scenario when the
parent might not include all cpus of its children domain. Do you have
such a scenario in mind which can arise due to this patch ?

Thanks

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] sched: CPU topology try

2013-12-23 Thread Dietmar Eggemann

Hi Vincent,

On 18/12/13 14:13, Vincent Guittot wrote:

This patch applies on top of the two patches [1][2] that have been proposed by
Peter for creating a new way to initialize sched_domain. It includes some minor
compilation fixes and a trial of using this new method on ARM platform.
[1] https://lkml.org/lkml/2013/11/5/239
[2] https://lkml.org/lkml/2013/11/5/449


I came up w/ a similar implementation proposal for an arch specific 
interface for scheduler domain set-up a couple of days ago:


[1] https://lkml.org/lkml/2013/12/13/182

I had the following requirements in mind:

1) The arch should not be able to fine tune individual scheduler 
behaviour, i.e. get rid of the arch specific SD_FOO_INIT macros.


2) Unify the set-up code for conventional and NUMA scheduler domains.

3) The arch is able to specify additional scheduler domain level, other 
than SMT, MC, BOOK, and CPU.


4) Allow to integrate the provision of additional topology related data 
(e.g. energy information) to the scheduler.


Moreover, I think now that:

5) Something like the existing default set-up via default_topology[] is 
needed to avoid code duplication for archs not interested in (3) or (4).


I can see the following similarities w/ your implementation:

1) Move the cpu_foo_mask functions from scheduler to topology. I even 
put cpu_smt_mask() and cpu_cpu_mask() into include/linux/topology.h.


2) Use the existing func ptr sched_domain_mask_f to pass per-cpu cpu 
mask from the topology shim-layer to the scheduler.




Based on the results of this tests, my feeling about this new way to init the
sched_domain is a bit mitigated.

The good point is that I have been able to create the same sched_domain
topologies than before and even more complex ones (where a subset of the cores
in a cluster share their powergating capabilities). I have described various
topology results below.

I use a system that is made of a dual cluster of quad cores with hyperthreading
for my examples.

If one cluster (0-7) can powergate its cores independantly but not the other
cluster (8-15) we have the following topology, which is equal to what I had
previously:

CPU0:
domain 0: span 0-1 level: SMT
 flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
 groups: 0 1
   domain 1: span 0-7 level: MC
   flags: SD_SHARE_PKG_RESOURCES
   groups: 0-1 2-3 4-5 6-7
 domain 2: span 0-15 level: CPU
 flags:
 groups: 0-7 8-15

CPU8
domain 0: span 8-9 level: SMT
 flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
 groups: 8 9
   domain 1: span 8-15 level: MC
   flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
   groups: 8-9 10-11 12-13 14-15
 domain 2: span 0-15 level CPU
 flags:
 groups: 8-15 0-7

We can even describe some more complex topologies if a susbset (2-7) of the
cluster can't powergate independatly:

CPU0:
domain 0: span 0-1 level: SMT
 flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
 groups: 0 1
   domain 1: span 0-7 level: MC
   flags: SD_SHARE_PKG_RESOURCES
   groups: 0-1 2-7
 domain 2: span 0-15 level: CPU
 flags:
 groups: 0-7 8-15

CPU2:
domain 0: span 2-3 level: SMT
 flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
 groups: 0 1
   domain 1: span 2-7 level: MC
   flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
   groups: 2-7 4-5 6-7
 domain 2: span 0-7 level: MC
 flags: SD_SHARE_PKG_RESOURCES
 groups: 2-7 0-1
   domain 3: span 0-15 level: CPU
   flags:
   groups: 0-7 8-15

In this case, we have an aditionnal sched_domain MC level for this subset (2-7)
of cores so we can trigger some load balance in this subset before doing that
on the complete cluster (which is the last level of cache in my example)


I think the weakest point right now is the condition in sd_init() where 
we convert the topology flags into scheduler behaviour. We not only 
introduce a very tight coupling between topology flags and scheduler 
domain level but also we need to follow a certain order in the 
initialization. This bit needs more thinking.




We can add more levels that will describe other dependency/independency like
the frequency scaling dependency and as a result the final sched_domain
topology will have additional levels (if they have not been removed during
the degenerate sequence)

My concern is about the configuration of the table that is used to create the
sched_domain. Some levels are "duplicated" with different flags configuration
which make the table not easily readable and we must also take care of the
order  because parents have to gather all cpus of its childs. So we must
choose which capabilities will be a subset of the other one. The order is
almost straight forward when we describe 1 or 2 kind of capabilities
(package ressource sharing and power sharing) but it can become complex if we
want to add more.