Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-08-16 Thread Valentin Schneider
On 16/08/21 16:03, Srikar Dronamraju wrote:
>>
>> Your version is much much better than mine.
>> And I have verified that it works as expected.
>>
>>
>
> Hey Peter/Valentin
>
> Are we waiting for any more feedback/testing for this?
>

I'm not overly fond of that last one, but AFAICT the only alternative is
doing a full-fledged NUMA topology rebuild on new-node onlining (i.e. make
calling sched_init_numa() more than once work). It's a lot more work for a
very particular usecase.

>
> --
> Thanks and Regards
> Srikar Dronamraju


Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-08-16 Thread Srikar Dronamraju
> 
> Your version is much much better than mine.
> And I have verified that it works as expected.
> 
> 

Hey Peter/Valentin

Are we waiting for any more feedback/testing for this?


-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-08-10 Thread Srikar Dronamraju
* Valentin Schneider  [2021-08-09 13:52:38]:

> On 09/08/21 12:22, Srikar Dronamraju wrote:
> > * Valentin Schneider  [2021-08-08 16:56:47]:
> >> Wait, doesn't the distance matrix (without any offline node) say
> >>
> >>   distance(0, 3) == 40
> >>
> >> ? We should have at the very least:
> >>
> >>   node   0   1   2   3
> >> 0:  10  20  ??  40
> >> 1:  20  20  ??  40
> >> 2:  ??  ??  ??  ??
> >> 3:  40  40  ??  10
> >>
> >
> > Before onlining node 3 and CPU 3 (node/CPU 0 and 1 are already online)
> > Note: Node 2-7 and CPU 2-7 are still offline.
> >
> > node   0   1   2   3
> >   0:  10  20  40  10
> >   1:  20  20  40  10
> >   2:  40  40  10  10
> >   3:  10  10  10  10
> >
> > NODE->mask(0) == 0
> > NODE->mask(1) == 1
> > NODE->mask(2) == 0
> > NODE->mask(3) == 0
> >
> > Note: This is with updating Node 2's distance as 40 for figuring out
> > the number of numa levels. Since we have all possible distances, we
> > dont update Node 3 distance, so it will be as if its local to node 0.
> >
> > Now when Node 3 and CPU 3 are onlined
> > Note: Node 2, 3-7 and CPU 2, 3-7 are still offline.
> >
> > node   0   1   2   3
> >   0:  10  20  40  40
> >   1:  20  20  40  40
> >   2:  40  40  10  40
> >   3:  40  40  40  10
> >
> > NODE->mask(0) == 0
> > NODE->mask(1) == 1
> > NODE->mask(2) == 0
> > NODE->mask(3) == 0,3
> >
> > CPU 0 continues to be part of Node->mask(3) because when we online and
> > we find the right distance, there is no API to reset the numa mask of
> > 3 to remove CPU 0 from the numa masks.
> >
> > If we had an API to clear/set sched_domains_numa_masks[node][] when
> > the node state changes, we could probably plug-in to clear/set the
> > node masks whenever node state changes.
> >
> 
> Gotcha, this is now coming back to me...
> 
> [...]
> 
> >> Ok, so it looks like we really can't do without that part - even if we get
> >> "sensible" distance values for the online nodes, we can't divine values for
> >> the offline ones.
> >>
> >
> > Yes
> >
> 
> Argh, while your approach does take care of the masks, it leaves
> sched_numa_topology_type unchanged. You *can* force an update of it, but
> yuck :(
> 
> I got to the below...
> 

Yes, I completely missed that we should update sched_numa_topology_type.


> ---
> From: Srikar Dronamraju 
> Date: Thu, 1 Jul 2021 09:45:51 +0530
> Subject: [PATCH 1/1] sched/topology: Skip updating masks for non-online nodes
> 
> The scheduler currently expects NUMA node distances to be stable from init
> onwards, and as a consequence builds the related data structures
> once-and-for-all at init (see sched_init_numa()).
> 
> Unfortunately, on some architectures node distance is unreliable for
> offline nodes and may very well change upon onlining.
> 
> Skip over offline nodes during sched_init_numa(). Track nodes that have
> been onlined at least once, and trigger a build of a node's NUMA masks when
> it is first onlined post-init.
> 

Your version is much much better than mine.
And I have verified that it works as expected.


> Reported-by: Geetika Moolchandani 
> Signed-off-by: Srikar Dronamraju 
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/sched/topology.c | 65 +
>  1 file changed, 65 insertions(+)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b77ad49dc14f..cba95793a9b7 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1482,6 +1482,8 @@ int sched_max_numa_distance;
>  static int   *sched_domains_numa_distance;
>  static struct cpumask***sched_domains_numa_masks;
>  int __read_mostlynode_reclaim_distance = RECLAIM_DISTANCE;
> +
> +static unsigned long __read_mostly *sched_numa_onlined_nodes;
>  #endif
> 
>  /*
> @@ -1833,6 +1835,16 @@ void sched_init_numa(void)
>   sched_domains_numa_masks[i][j] = mask;
> 
>   for_each_node(k) {
> + /*
> +  * Distance information can be unreliable for
> +  * offline nodes, defer building the node
> +  * masks to its bringup.
> +  * This relies on all unique distance values
> +  * still being visible at init time.
> +  */
> + if (!node_online(j))
> + continue;
> +
>   if (sched_debug() && (node_distance(j, k) != 
> node_distance(k, j)))
>   sched_numa_warn("Node-distance not 
> symmetric");
> 
> @@ -1886,6 +1898,53 @@ void sched_init_numa(void)
>   sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
> 
>   init_numa_topology_type();
> +
> + sched_numa_onlined_nodes = bitmap_alloc(nr_node_ids, GFP_KERNEL);
> + if (!sched_numa_onlined_nodes)
> + 

Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-08-09 Thread Valentin Schneider
On 09/08/21 12:22, Srikar Dronamraju wrote:
> * Valentin Schneider  [2021-08-08 16:56:47]:
>> Wait, doesn't the distance matrix (without any offline node) say
>>
>>   distance(0, 3) == 40
>>
>> ? We should have at the very least:
>>
>>   node   0   1   2   3
>> 0:  10  20  ??  40
>> 1:  20  20  ??  40
>> 2:  ??  ??  ??  ??
>> 3:  40  40  ??  10
>>
>
> Before onlining node 3 and CPU 3 (node/CPU 0 and 1 are already online)
> Note: Node 2-7 and CPU 2-7 are still offline.
>
> node   0   1   2   3
>   0:  10  20  40  10
>   1:  20  20  40  10
>   2:  40  40  10  10
>   3:  10  10  10  10
>
> NODE->mask(0) == 0
> NODE->mask(1) == 1
> NODE->mask(2) == 0
> NODE->mask(3) == 0
>
> Note: This is with updating Node 2's distance as 40 for figuring out
> the number of numa levels. Since we have all possible distances, we
> dont update Node 3 distance, so it will be as if its local to node 0.
>
> Now when Node 3 and CPU 3 are onlined
> Note: Node 2, 3-7 and CPU 2, 3-7 are still offline.
>
> node   0   1   2   3
>   0:  10  20  40  40
>   1:  20  20  40  40
>   2:  40  40  10  40
>   3:  40  40  40  10
>
> NODE->mask(0) == 0
> NODE->mask(1) == 1
> NODE->mask(2) == 0
> NODE->mask(3) == 0,3
>
> CPU 0 continues to be part of Node->mask(3) because when we online and
> we find the right distance, there is no API to reset the numa mask of
> 3 to remove CPU 0 from the numa masks.
>
> If we had an API to clear/set sched_domains_numa_masks[node][] when
> the node state changes, we could probably plug-in to clear/set the
> node masks whenever node state changes.
>

Gotcha, this is now coming back to me...

[...]

>> Ok, so it looks like we really can't do without that part - even if we get
>> "sensible" distance values for the online nodes, we can't divine values for
>> the offline ones.
>>
>
> Yes
>

Argh, while your approach does take care of the masks, it leaves
sched_numa_topology_type unchanged. You *can* force an update of it, but
yuck :(

I got to the below...

---
From: Srikar Dronamraju 
Date: Thu, 1 Jul 2021 09:45:51 +0530
Subject: [PATCH 1/1] sched/topology: Skip updating masks for non-online nodes

The scheduler currently expects NUMA node distances to be stable from init
onwards, and as a consequence builds the related data structures
once-and-for-all at init (see sched_init_numa()).

Unfortunately, on some architectures node distance is unreliable for
offline nodes and may very well change upon onlining.

Skip over offline nodes during sched_init_numa(). Track nodes that have
been onlined at least once, and trigger a build of a node's NUMA masks when
it is first onlined post-init.

Reported-by: Geetika Moolchandani 
Signed-off-by: Srikar Dronamraju 
Signed-off-by: Valentin Schneider 
---
 kernel/sched/topology.c | 65 +
 1 file changed, 65 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b77ad49dc14f..cba95793a9b7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1482,6 +1482,8 @@ int   sched_max_numa_distance;
 static int *sched_domains_numa_distance;
 static struct cpumask  ***sched_domains_numa_masks;
 int __read_mostly  node_reclaim_distance = RECLAIM_DISTANCE;
+
+static unsigned long __read_mostly *sched_numa_onlined_nodes;
 #endif
 
 /*
@@ -1833,6 +1835,16 @@ void sched_init_numa(void)
sched_domains_numa_masks[i][j] = mask;
 
for_each_node(k) {
+   /*
+* Distance information can be unreliable for
+* offline nodes, defer building the node
+* masks to its bringup.
+* This relies on all unique distance values
+* still being visible at init time.
+*/
+   if (!node_online(j))
+   continue;
+
if (sched_debug() && (node_distance(j, k) != 
node_distance(k, j)))
sched_numa_warn("Node-distance not 
symmetric");
 
@@ -1886,6 +1898,53 @@ void sched_init_numa(void)
sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
 
init_numa_topology_type();
+
+   sched_numa_onlined_nodes = bitmap_alloc(nr_node_ids, GFP_KERNEL);
+   if (!sched_numa_onlined_nodes)
+   return;
+
+   bitmap_zero(sched_numa_onlined_nodes, nr_node_ids);
+   for_each_online_node(i)
+   bitmap_set(sched_numa_onlined_nodes, i, 1);
+}
+
+void __sched_domains_numa_masks_set(unsigned int node)
+{
+   int i, j;
+
+   /*
+* NUMA masks are not built for offline nodes in sched_init_numa().
+* Thus, when a CPU of a never-onlined-before node gets plugged in,
+* adding that new CPU to the right 

Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-08-09 Thread Srikar Dronamraju
* Valentin Schneider  [2021-08-08 16:56:47]:

>
> A bit late, but technically the week isn't over yet! :D
>
> On 23/07/21 20:09, Srikar Dronamraju wrote:
> > * Valentin Schneider  [2021-07-13 17:32:14]:
> >> Now, let's take examples from your cover letter:
> >>
> >>   node distances:
> >>   node   0   1   2   3   4   5   6   7
> >> 0:  10  20  40  40  40  40  40  40
> >> 1:  20  10  40  40  40  40  40  40
> >> 2:  40  40  10  20  40  40  40  40
> >> 3:  40  40  20  10  40  40  40  40
> >> 4:  40  40  40  40  10  20  40  40
> >> 5:  40  40  40  40  20  10  40  40
> >> 6:  40  40  40  40  40  40  10  20
> >> 7:  40  40  40  40  40  40  20  10
> >>
> >> But the system boots with just nodes 0 and 1, thus only this distance
> >> matrix is valid:
> >>
> >>   node   0   1
> >> 0:  10  20
> >> 1:  20  10
> >>
> >> topology_span_sane() is going to use tl->mask(cpu), and as you reported the
> >> NODE topology level should cause issues. Let's assume all offline nodes say
> >> they're 10 distance away from everyone else, and that we have one CPU per
> >> node. This would give us:
> >>
> >
> > No,
> > All offline nodes would be at a distance of 10 from node 0 only.
> > So here node distance of all offline nodes from node 1 would be 20.
> >
> >>   NODE->mask(0) == 0,2-7
> >>   NODE->mask(1) == 1-7
> >
> > so
> >
> > NODE->mask(0) == 0,2-7
> > NODE->mask(1) should be 1
> > and NODE->mask(2-7) == 0,2-7
> >
>
> Ok, so that shouldn't trigger the warning.

Yes not at this point, but later on when we online a node.

>
> >>
> >> The intersection is 2-7, we'll trigger the WARN_ON().
> >> Now, with the above snippet, we'll check if that intersection covers any
> >> online CPU. For sched_init_domains(), cpu_map is cpu_active_mask, so we'd
> >> end up with an empty intersection and we shouldn't warn - that's the theory
> >> at least.
> >
> > Now lets say we onlined CPU 3 and node 3 which was at a actual distance
> > of 20 from node 0.
> >
> > (If we only consider online CPUs, and since scheduler masks like
> > sched_domains_numa_masks arent updated with offline CPUs,)
> > then
> >
> > NODE->mask(0) == 0
> > NODE->mask(1) == 1
> > NODE->mask(3) == 0,3
> >
>
> Wait, doesn't the distance matrix (without any offline node) say
>
>   distance(0, 3) == 40
>
> ? We should have at the very least:
>
>   node   0   1   2   3
> 0:  10  20  ??  40
> 1:  20  20  ??  40
> 2:  ??  ??  ??  ??
> 3:  40  40  ??  10
>

Before onlining node 3 and CPU 3 (node/CPU 0 and 1 are already online)
Note: Node 2-7 and CPU 2-7 are still offline.

node   0   1   2   3
  0:  10  20  40  10
  1:  20  20  40  10
  2:  40  40  10  10
  3:  10  10  10  10

NODE->mask(0) == 0
NODE->mask(1) == 1
NODE->mask(2) == 0
NODE->mask(3) == 0

Note: This is with updating Node 2's distance as 40 for figuring out
the number of numa levels. Since we have all possible distances, we
dont update Node 3 distance, so it will be as if its local to node 0.

Now when Node 3 and CPU 3 are onlined
Note: Node 2, 3-7 and CPU 2, 3-7 are still offline.

node   0   1   2   3
  0:  10  20  40  40
  1:  20  20  40  40
  2:  40  40  10  40
  3:  40  40  40  10

NODE->mask(0) == 0
NODE->mask(1) == 1
NODE->mask(2) == 0
NODE->mask(3) == 0,3

CPU 0 continues to be part of Node->mask(3) because when we online and
we find the right distance, there is no API to reset the numa mask of
3 to remove CPU 0 from the numa masks.

If we had an API to clear/set sched_domains_numa_masks[node][] when
the node state changes, we could probably plug-in to clear/set the
node masks whenever node state changes.


> Regardless, NODE->mask(x) is sched_domains_numa_masks[0][x], if
>
>   distance(0,3) > LOCAL_DISTANCE
>
> then
>
>   node0 ??? NODE->mask(3)
>
> > cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
> > if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && 
> > cpumask_intersects(intersect, cpu_map))
> >
> > cpu_map is 0,1,3
> > intersect is 0
> >
> > From above NODE->mask(0) is !equal to NODE->mask(1) and
> > cpumask_intersects(intersect, cpu_map) is also true.
> >
> > I picked Node 3 since if Node 1 is online, we would have faked distance
> > for Node 2 to be at distance of 40.
> >
> > Any node from 3 to 7, we would have faced the same problem.
> >
> >>
> >> Looking at sd_numa_mask(), I think there's a bug with topology_span_sane():
> >> it doesn't run in the right place wrt where sched_domains_curr_level is
> >> updated. Could you try the below on top of the previous snippet?
> >>
> >> If that doesn't help, could you share the node distances / topology masks
> >> that lead to the WARN_ON()? Thanks.
> >>
> >> ---
> >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> >> index b77ad49dc14f..cda69dfa4065 100644
> >> --- a/kernel/sched/topology.c
> >> +++ b/kernel/sched/topology.c
> >> @@ -1516,13 +1516,6 @@ sd_init(struct sched_domain_topology_level *tl,
> >>  int sd_id, sd_weight, sd_flags = 0;
> >>  struct cpumask *sd_span;
> >>
> >> 

Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-08-08 Thread Valentin Schneider


A bit late, but technically the week isn't over yet! :D

On 23/07/21 20:09, Srikar Dronamraju wrote:
> * Valentin Schneider  [2021-07-13 17:32:14]:
>> Now, let's take examples from your cover letter:
>>
>>   node distances:
>>   node   0   1   2   3   4   5   6   7
>> 0:  10  20  40  40  40  40  40  40
>> 1:  20  10  40  40  40  40  40  40
>> 2:  40  40  10  20  40  40  40  40
>> 3:  40  40  20  10  40  40  40  40
>> 4:  40  40  40  40  10  20  40  40
>> 5:  40  40  40  40  20  10  40  40
>> 6:  40  40  40  40  40  40  10  20
>> 7:  40  40  40  40  40  40  20  10
>>
>> But the system boots with just nodes 0 and 1, thus only this distance
>> matrix is valid:
>>
>>   node   0   1
>> 0:  10  20
>> 1:  20  10
>>
>> topology_span_sane() is going to use tl->mask(cpu), and as you reported the
>> NODE topology level should cause issues. Let's assume all offline nodes say
>> they're 10 distance away from everyone else, and that we have one CPU per
>> node. This would give us:
>>
>
> No,
> All offline nodes would be at a distance of 10 from node 0 only.
> So here node distance of all offline nodes from node 1 would be 20.
>
>>   NODE->mask(0) == 0,2-7
>>   NODE->mask(1) == 1-7
>
> so
>
> NODE->mask(0) == 0,2-7
> NODE->mask(1) should be 1
> and NODE->mask(2-7) == 0,2-7
>

Ok, so that shouldn't trigger the warning.

>>
>> The intersection is 2-7, we'll trigger the WARN_ON().
>> Now, with the above snippet, we'll check if that intersection covers any
>> online CPU. For sched_init_domains(), cpu_map is cpu_active_mask, so we'd
>> end up with an empty intersection and we shouldn't warn - that's the theory
>> at least.
>
> Now lets say we onlined CPU 3 and node 3 which was at a actual distance
> of 20 from node 0.
>
> (If we only consider online CPUs, and since scheduler masks like
> sched_domains_numa_masks arent updated with offline CPUs,)
> then
>
> NODE->mask(0) == 0
> NODE->mask(1) == 1
> NODE->mask(3) == 0,3
>

Wait, doesn't the distance matrix (without any offline node) say

  distance(0, 3) == 40

? We should have at the very least:

  node   0   1   2   3
0:  10  20  ??  40
1:  20  20  ??  40
2:  ??  ??  ??  ??
3:  40  40  ??  10

Regardless, NODE->mask(x) is sched_domains_numa_masks[0][x], if

  distance(0,3) > LOCAL_DISTANCE

then

  node0 ∉ NODE->mask(3)

> cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
> if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && 
> cpumask_intersects(intersect, cpu_map))
>
> cpu_map is 0,1,3
> intersect is 0
>
> From above NODE->mask(0) is !equal to NODE->mask(1) and
> cpumask_intersects(intersect, cpu_map) is also true.
>
> I picked Node 3 since if Node 1 is online, we would have faked distance
> for Node 2 to be at distance of 40.
>
> Any node from 3 to 7, we would have faced the same problem.
>
>>
>> Looking at sd_numa_mask(), I think there's a bug with topology_span_sane():
>> it doesn't run in the right place wrt where sched_domains_curr_level is
>> updated. Could you try the below on top of the previous snippet?
>>
>> If that doesn't help, could you share the node distances / topology masks
>> that lead to the WARN_ON()? Thanks.
>>
>> ---
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index b77ad49dc14f..cda69dfa4065 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1516,13 +1516,6 @@ sd_init(struct sched_domain_topology_level *tl,
>>  int sd_id, sd_weight, sd_flags = 0;
>>  struct cpumask *sd_span;
>>
>> -#ifdef CONFIG_NUMA
>> -/*
>> - * Ugly hack to pass state to sd_numa_mask()...
>> - */
>> -sched_domains_curr_level = tl->numa_level;
>> -#endif
>> -
>>  sd_weight = cpumask_weight(tl->mask(cpu));
>>
>>  if (tl->sd_flags)
>> @@ -2131,7 +2124,12 @@ build_sched_domains(const struct cpumask *cpu_map, 
>> struct sched_domain_attr *att
>>
>>  sd = NULL;
>>  for_each_sd_topology(tl) {
>> -
>> +#ifdef CONFIG_NUMA
>> +/*
>> + * Ugly hack to pass state to sd_numa_mask()...
>> + */
>> +sched_domains_curr_level = tl->numa_level;
>> +#endif
>>  if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
>>  goto error;
>>
>>
>
> I tested with the above patch too. However it still not helping.
>
> Here is the log from my testing.
>
> At Boot.
>
> (Do remember to arrive at sched_max_numa_levels we faked the
> numa_distance of node 1 to be at 20 from node 0. All other offline
> nodes are at a distance of 10 from node 0.)
>

[...]

> ( First addition of a CPU to a non-online node esp node whose node
> distance was not faked.)
>
> numactl -H
> available: 3 nodes (0,5,7)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 5 cpus: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 
> 32 33 34 35 40 41 42 43 48 49 50 51 56 57 58 59 64 65 66 67 72 73 74 75 76 77 
> 78 79 80 

Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-08-04 Thread Valentin Schneider
On 04/08/21 15:31, Srikar Dronamraju wrote:
> * Srikar Dronamraju  [2021-07-23 20:09:14]:
>
>> * Valentin Schneider  [2021-07-13 17:32:14]:
>>
>> > On 12/07/21 18:18, Srikar Dronamraju wrote:
>> > > Hi Valentin,
>> > >
>> > >> On 01/07/21 09:45, Srikar Dronamraju wrote:
>> > >> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
>> > >> >  void sched_domains_numa_masks_set(unsigned int cpu)
>> > >> >  {
>> > >
>
> Hey Valentin / Peter
>
> Did you get a chance to look at this?
>

Barely, I wanted to set some time aside to stare at this and have been
failing miserably. Let me bump it up my todolist, I'll get to it before the
end of the week.

> --
> Thanks and Regards
> Srikar Dronamraju


Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-08-04 Thread Srikar Dronamraju
* Srikar Dronamraju  [2021-07-23 20:09:14]:

> * Valentin Schneider  [2021-07-13 17:32:14]:
> 
> > On 12/07/21 18:18, Srikar Dronamraju wrote:
> > > Hi Valentin,
> > >
> > >> On 01/07/21 09:45, Srikar Dronamraju wrote:
> > >> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
> > >> >  void sched_domains_numa_masks_set(unsigned int cpu)
> > >> >  {
> > >

Hey Valentin / Peter

Did you get a chance to look at this?

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-07-23 Thread Srikar Dronamraju
* Valentin Schneider  [2021-07-13 17:32:14]:

> On 12/07/21 18:18, Srikar Dronamraju wrote:
> > Hi Valentin,
> >
> >> On 01/07/21 09:45, Srikar Dronamraju wrote:
> >> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
> >> >  void sched_domains_numa_masks_set(unsigned int cpu)
> >> >  {
> >
> > Unfortunately this is not helping.
> > I tried this patch alone and also with 2/2 patch of this series where
> > we update/fill fake topology numbers. However both cases are still failing.
> >
> 
> Thanks for testing it.
> 
> 
> Now, let's take examples from your cover letter:
> 
>   node distances:
>   node   0   1   2   3   4   5   6   7
> 0:  10  20  40  40  40  40  40  40
> 1:  20  10  40  40  40  40  40  40
> 2:  40  40  10  20  40  40  40  40
> 3:  40  40  20  10  40  40  40  40
> 4:  40  40  40  40  10  20  40  40
> 5:  40  40  40  40  20  10  40  40
> 6:  40  40  40  40  40  40  10  20
> 7:  40  40  40  40  40  40  20  10
> 
> But the system boots with just nodes 0 and 1, thus only this distance
> matrix is valid:
> 
>   node   0   1
> 0:  10  20
> 1:  20  10
> 
> topology_span_sane() is going to use tl->mask(cpu), and as you reported the
> NODE topology level should cause issues. Let's assume all offline nodes say
> they're 10 distance away from everyone else, and that we have one CPU per
> node. This would give us:
> 

No,
All offline nodes would be at a distance of 10 from node 0 only.
So here node distance of all offline nodes from node 1 would be 20.

>   NODE->mask(0) == 0,2-7
>   NODE->mask(1) == 1-7

so 

NODE->mask(0) == 0,2-7
NODE->mask(1) should be 1
and NODE->mask(2-7) == 0,2-7

> 
> The intersection is 2-7, we'll trigger the WARN_ON().
> Now, with the above snippet, we'll check if that intersection covers any
> online CPU. For sched_init_domains(), cpu_map is cpu_active_mask, so we'd
> end up with an empty intersection and we shouldn't warn - that's the theory
> at least.

Now lets say we onlined CPU 3 and node 3 which was at a actual distance
of 20 from node 0.

(If we only consider online CPUs, and since scheduler masks like
sched_domains_numa_masks arent updated with offline CPUs,)
then

NODE->mask(0) == 0
NODE->mask(1) == 1
NODE->mask(3) == 0,3

cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && cpumask_intersects(intersect, 
cpu_map))

cpu_map is 0,1,3
intersect is 0

>From above NODE->mask(0) is !equal to NODE->mask(1) and
cpumask_intersects(intersect, cpu_map) is also true.

I picked Node 3 since if Node 1 is online, we would have faked distance
for Node 2 to be at distance of 40.

Any node from 3 to 7, we would have faced the same problem.

> 
> Looking at sd_numa_mask(), I think there's a bug with topology_span_sane():
> it doesn't run in the right place wrt where sched_domains_curr_level is
> updated. Could you try the below on top of the previous snippet?
> 
> If that doesn't help, could you share the node distances / topology masks
> that lead to the WARN_ON()? Thanks.
> 
> ---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b77ad49dc14f..cda69dfa4065 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1516,13 +1516,6 @@ sd_init(struct sched_domain_topology_level *tl,
>   int sd_id, sd_weight, sd_flags = 0;
>   struct cpumask *sd_span;
> 
> -#ifdef CONFIG_NUMA
> - /*
> -  * Ugly hack to pass state to sd_numa_mask()...
> -  */
> - sched_domains_curr_level = tl->numa_level;
> -#endif
> -
>   sd_weight = cpumask_weight(tl->mask(cpu));
> 
>   if (tl->sd_flags)
> @@ -2131,7 +2124,12 @@ build_sched_domains(const struct cpumask *cpu_map, 
> struct sched_domain_attr *att
> 
>   sd = NULL;
>   for_each_sd_topology(tl) {
> -
> +#ifdef CONFIG_NUMA
> + /*
> +  * Ugly hack to pass state to sd_numa_mask()...
> +  */
> + sched_domains_curr_level = tl->numa_level;
> +#endif
>   if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
>   goto error;
> 
> 

I tested with the above patch too. However it still not helping.

Here is the log from my testing.

At Boot.

(Do remember to arrive at sched_max_numa_levels we faked the
numa_distance of node 1 to be at 20 from node 0. All other offline
nodes are at a distance of 10 from node 0.)

numactl -H
available: 2 nodes (0,5)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 0 MB
node 0 free: 0 MB
node 5 cpus:
node 5 size: 32038 MB
node 5 free: 29367 MB
node distances:
node   0   5
  0:  10  40
  5:  40  10
--
grep -r . /sys/kernel/debug/sched/domains/cpu0/domain{0,1,2,3,4}/{name,flags}
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain0/flags:SD_BALANCE_NEWIDLE 
SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY 

Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-07-13 Thread Valentin Schneider
On 12/07/21 18:18, Srikar Dronamraju wrote:
> Hi Valentin,
>
>> On 01/07/21 09:45, Srikar Dronamraju wrote:
>> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
>> >  void sched_domains_numa_masks_set(unsigned int cpu)
>> >  {
>>
>> Hmph, so we're playing games with masks of offline nodes - is that really
>> necessary? Your modification of sched_init_numa() still scans all of the
>> nodes (regardless of their online status) to build the distance map, and
>> that is never updated (sched_init_numa() is pretty much an __init
>> function).
>>
>> So AFAICT this is all to cope with topology_span_sane() not applying
>> 'cpu_map' to its masks. That seemed fine to me back when I wrote it, but in
>> light of having bogus distance values for offline nodes, not so much...
>>
>> What about the below instead?
>>
>> ---
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index b77ad49dc14f..c2d9caad4aa6 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -2075,6 +2075,7 @@ static struct sched_domain *build_sched_domain(struct 
>> sched_domain_topology_leve
>>  static bool topology_span_sane(struct sched_domain_topology_level *tl,
>>const struct cpumask *cpu_map, int cpu)
>>  {
>> +struct cpumask *intersect = sched_domains_tmpmask;
>>  int i;
>>
>>  /* NUMA levels are allowed to overlap */
>> @@ -2090,14 +2091,17 @@ static bool topology_span_sane(struct 
>> sched_domain_topology_level *tl,
>>  for_each_cpu(i, cpu_map) {
>>  if (i == cpu)
>>  continue;
>> +
>>  /*
>> - * We should 'and' all those masks with 'cpu_map' to exactly
>> - * match the topology we're about to build, but that can only
>> - * remove CPUs, which only lessens our ability to detect
>> - * overlaps
>> + * We shouldn't have to bother with cpu_map here, unfortunately
>> + * some architectures (powerpc says hello) have to deal with
>> + * offline NUMA nodes reporting bogus distance values. This can
>> + * lead to funky NODE domain spans, but since those are offline
>> + * we can mask them out.
>>   */
>> +cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
>>  if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
>> -cpumask_intersects(tl->mask(cpu), tl->mask(i)))
>> +cpumask_intersects(intersect, cpu_map))
>>  return false;
>>  }
>>
>
> Unfortunately this is not helping.
> I tried this patch alone and also with 2/2 patch of this series where
> we update/fill fake topology numbers. However both cases are still failing.
>

Thanks for testing it.


Now, let's take examples from your cover letter:

  node distances:
  node   0   1   2   3   4   5   6   7
0:  10  20  40  40  40  40  40  40
1:  20  10  40  40  40  40  40  40
2:  40  40  10  20  40  40  40  40
3:  40  40  20  10  40  40  40  40
4:  40  40  40  40  10  20  40  40
5:  40  40  40  40  20  10  40  40
6:  40  40  40  40  40  40  10  20
7:  40  40  40  40  40  40  20  10

But the system boots with just nodes 0 and 1, thus only this distance
matrix is valid:

  node   0   1
0:  10  20
1:  20  10

topology_span_sane() is going to use tl->mask(cpu), and as you reported the
NODE topology level should cause issues. Let's assume all offline nodes say
they're 10 distance away from everyone else, and that we have one CPU per
node. This would give us:

  NODE->mask(0) == 0,2-7
  NODE->mask(1) == 1-7

The intersection is 2-7, we'll trigger the WARN_ON().
Now, with the above snippet, we'll check if that intersection covers any
online CPU. For sched_init_domains(), cpu_map is cpu_active_mask, so we'd
end up with an empty intersection and we shouldn't warn - that's the theory
at least.

Looking at sd_numa_mask(), I think there's a bug with topology_span_sane():
it doesn't run in the right place wrt where sched_domains_curr_level is
updated. Could you try the below on top of the previous snippet?

If that doesn't help, could you share the node distances / topology masks
that lead to the WARN_ON()? Thanks.

---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b77ad49dc14f..cda69dfa4065 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1516,13 +1516,6 @@ sd_init(struct sched_domain_topology_level *tl,
int sd_id, sd_weight, sd_flags = 0;
struct cpumask *sd_span;
 
-#ifdef CONFIG_NUMA
-   /*
-* Ugly hack to pass state to sd_numa_mask()...
-*/
-   sched_domains_curr_level = tl->numa_level;
-#endif
-
sd_weight = cpumask_weight(tl->mask(cpu));
 
if (tl->sd_flags)
@@ -2131,7 +2124,12 @@ build_sched_domains(const struct cpumask *cpu_map, 
struct sched_domain_attr *att
 
sd = NULL;
for_each_sd_topology(tl) {
-
+#ifdef CONFIG_NUMA
+   

Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-07-12 Thread Srikar Dronamraju
Hi Valentin,

> On 01/07/21 09:45, Srikar Dronamraju wrote:
> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
> >  void sched_domains_numa_masks_set(unsigned int cpu)
> >  {
> 
> Hmph, so we're playing games with masks of offline nodes - is that really
> necessary? Your modification of sched_init_numa() still scans all of the
> nodes (regardless of their online status) to build the distance map, and
> that is never updated (sched_init_numa() is pretty much an __init
> function).
> 
> So AFAICT this is all to cope with topology_span_sane() not applying
> 'cpu_map' to its masks. That seemed fine to me back when I wrote it, but in
> light of having bogus distance values for offline nodes, not so much...
> 
> What about the below instead?
> 
> ---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b77ad49dc14f..c2d9caad4aa6 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2075,6 +2075,7 @@ static struct sched_domain *build_sched_domain(struct 
> sched_domain_topology_leve
>  static bool topology_span_sane(struct sched_domain_topology_level *tl,
> const struct cpumask *cpu_map, int cpu)
>  {
> + struct cpumask *intersect = sched_domains_tmpmask;
>   int i;
> 
>   /* NUMA levels are allowed to overlap */
> @@ -2090,14 +2091,17 @@ static bool topology_span_sane(struct 
> sched_domain_topology_level *tl,
>   for_each_cpu(i, cpu_map) {
>   if (i == cpu)
>   continue;
> +
>   /*
> -  * We should 'and' all those masks with 'cpu_map' to exactly
> -  * match the topology we're about to build, but that can only
> -  * remove CPUs, which only lessens our ability to detect
> -  * overlaps
> +  * We shouldn't have to bother with cpu_map here, unfortunately
> +  * some architectures (powerpc says hello) have to deal with
> +  * offline NUMA nodes reporting bogus distance values. This can
> +  * lead to funky NODE domain spans, but since those are offline
> +  * we can mask them out.
>*/
> + cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
>   if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> - cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> + cpumask_intersects(intersect, cpu_map))
>   return false;
>   }
> 

Unfortunately this is not helping.
I tried this patch alone and also with 2/2 patch of this series where
we update/fill fake topology numbers. However both cases are still failing.

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-07-01 Thread Valentin Schneider
On 01/07/21 09:45, Srikar Dronamraju wrote:
> @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
>  void sched_domains_numa_masks_set(unsigned int cpu)
>  {
>   int node = cpu_to_node(cpu);
> - int i, j;
> + int i, j, empty;
>
> + empty = cpumask_empty(sched_domains_numa_masks[0][node]);
>   for (i = 0; i < sched_domains_numa_levels; i++) {
>   for (j = 0; j < nr_node_ids; j++) {
> - if (node_distance(j, node) <= 
> sched_domains_numa_distance[i])
> + if (!node_online(j))
> + continue;
> +
> + if (node_distance(j, node) <= 
> sched_domains_numa_distance[i]) {
>   cpumask_set_cpu(cpu, 
> sched_domains_numa_masks[i][j]);
> +
> + /*
> +  * We skip updating numa_masks for offline
> +  * nodes. However now that the node is
> +  * finally online, CPUs that were added
> +  * earlier, should now be accommodated into
> +  * newly oneline node's numa mask.
> +  */
> + if (node != j && empty) {
> + 
> cpumask_or(sched_domains_numa_masks[i][node],
> + 
> sched_domains_numa_masks[i][node],
> + 
> sched_domains_numa_masks[0][j]);
> + }
> + }

Hmph, so we're playing games with masks of offline nodes - is that really
necessary? Your modification of sched_init_numa() still scans all of the
nodes (regardless of their online status) to build the distance map, and
that is never updated (sched_init_numa() is pretty much an __init
function).

So AFAICT this is all to cope with topology_span_sane() not applying
'cpu_map' to its masks. That seemed fine to me back when I wrote it, but in
light of having bogus distance values for offline nodes, not so much...

What about the below instead?

---
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b77ad49dc14f..c2d9caad4aa6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2075,6 +2075,7 @@ static struct sched_domain *build_sched_domain(struct 
sched_domain_topology_leve
 static bool topology_span_sane(struct sched_domain_topology_level *tl,
  const struct cpumask *cpu_map, int cpu)
 {
+   struct cpumask *intersect = sched_domains_tmpmask;
int i;
 
/* NUMA levels are allowed to overlap */
@@ -2090,14 +2091,17 @@ static bool topology_span_sane(struct 
sched_domain_topology_level *tl,
for_each_cpu(i, cpu_map) {
if (i == cpu)
continue;
+
/*
-* We should 'and' all those masks with 'cpu_map' to exactly
-* match the topology we're about to build, but that can only
-* remove CPUs, which only lessens our ability to detect
-* overlaps
+* We shouldn't have to bother with cpu_map here, unfortunately
+* some architectures (powerpc says hello) have to deal with
+* offline NUMA nodes reporting bogus distance values. This can
+* lead to funky NODE domain spans, but since those are offline
+* we can mask them out.
 */
+   cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
-   cpumask_intersects(tl->mask(cpu), tl->mask(i)))
+   cpumask_intersects(intersect, cpu_map))
return false;
}