Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-24 Thread Vlastimil Babka
On 07/21/2017 04:39 PM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> build_zonelists gradually builds zonelists from the nearest to the most
> distant node. As we do not know how many populated zones we will have in
> each node we rely on the _zoneref to terminate initialized part of the
> zonelist by a NULL zone. While this is functionally correct it is quite
> suboptimal because we cannot allow updaters to race with zonelists
> users because they could see an empty zonelist and fail the allocation
> or hit the OOM killer in the worst case.
> 
> We can do much better, though. We can store the node ordering into an
> already existing node_order array and then give this array to
> build_zonelists_in_node_order and do the whole initialization at once.
> zonelists consumers still might see halfway initialized state but that
> should be much more tolerateable because the list will not be empty and
> they would either see some zone twice or skip over some zone(s) in the
> worst case which shouldn't lead to immediate failures.
> 
> While at it let's simplify build_zonelists_node which is rather
> confusing now. It gets an index into the zoneref array and returns
> the updated index for the next iteration. Let's rename the function
> to build_zonerefs_node to better reflect its purpose and give it
> zoneref array to update. The function doesn't the index anymore. It
> just returns the number of added zones so that the caller can advance
> the zonered array start for the next update.
> 
> This patch alone doesn't introduce any functional change yet, though, it
> is merely a preparatory work for later changes.
> 
> Changes since v1
> - build_zonelists_node -> build_zonerefs_node and operate directly on
>   zonerefs array rather than play tricks with index into the array.
> - give build_zonelists_in_node_order nr_nodes to not iterate over all
>   MAX_NUMNODES as per Mel
> 
> Signed-off-by: Michal Hocko 

Acked-by: Vlastimil Babka 


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-24 Thread Vlastimil Babka
On 07/21/2017 04:39 PM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> build_zonelists gradually builds zonelists from the nearest to the most
> distant node. As we do not know how many populated zones we will have in
> each node we rely on the _zoneref to terminate initialized part of the
> zonelist by a NULL zone. While this is functionally correct it is quite
> suboptimal because we cannot allow updaters to race with zonelists
> users because they could see an empty zonelist and fail the allocation
> or hit the OOM killer in the worst case.
> 
> We can do much better, though. We can store the node ordering into an
> already existing node_order array and then give this array to
> build_zonelists_in_node_order and do the whole initialization at once.
> zonelists consumers still might see halfway initialized state but that
> should be much more tolerateable because the list will not be empty and
> they would either see some zone twice or skip over some zone(s) in the
> worst case which shouldn't lead to immediate failures.
> 
> While at it let's simplify build_zonelists_node which is rather
> confusing now. It gets an index into the zoneref array and returns
> the updated index for the next iteration. Let's rename the function
> to build_zonerefs_node to better reflect its purpose and give it
> zoneref array to update. The function doesn't the index anymore. It
> just returns the number of added zones so that the caller can advance
> the zonered array start for the next update.
> 
> This patch alone doesn't introduce any functional change yet, though, it
> is merely a preparatory work for later changes.
> 
> Changes since v1
> - build_zonelists_node -> build_zonerefs_node and operate directly on
>   zonerefs array rather than play tricks with index into the array.
> - give build_zonelists_in_node_order nr_nodes to not iterate over all
>   MAX_NUMNODES as per Mel
> 
> Signed-off-by: Michal Hocko 

Acked-by: Vlastimil Babka 


[PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-21 Thread Michal Hocko
From: Michal Hocko 

build_zonelists gradually builds zonelists from the nearest to the most
distant node. As we do not know how many populated zones we will have in
each node we rely on the _zoneref to terminate initialized part of the
zonelist by a NULL zone. While this is functionally correct it is quite
suboptimal because we cannot allow updaters to race with zonelists
users because they could see an empty zonelist and fail the allocation
or hit the OOM killer in the worst case.

We can do much better, though. We can store the node ordering into an
already existing node_order array and then give this array to
build_zonelists_in_node_order and do the whole initialization at once.
zonelists consumers still might see halfway initialized state but that
should be much more tolerateable because the list will not be empty and
they would either see some zone twice or skip over some zone(s) in the
worst case which shouldn't lead to immediate failures.

While at it let's simplify build_zonelists_node which is rather
confusing now. It gets an index into the zoneref array and returns
the updated index for the next iteration. Let's rename the function
to build_zonerefs_node to better reflect its purpose and give it
zoneref array to update. The function doesn't the index anymore. It
just returns the number of added zones so that the caller can advance
the zonered array start for the next update.

This patch alone doesn't introduce any functional change yet, though, it
is merely a preparatory work for later changes.

Changes since v1
- build_zonelists_node -> build_zonerefs_node and operate directly on
  zonerefs array rather than play tricks with index into the array.
- give build_zonelists_in_node_order nr_nodes to not iterate over all
  MAX_NUMNODES as per Mel

Signed-off-by: Michal Hocko 
---
 mm/page_alloc.c | 81 +
 1 file changed, 41 insertions(+), 40 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e9aca464f66..0d78dc5a708f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4772,18 +4772,17 @@ static void zoneref_set_zone(struct zone *zone, struct 
zoneref *zoneref)
  *
  * Add all populated zones of a node to the zonelist.
  */
-static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
-   int nr_zones)
+static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
 {
struct zone *zone;
enum zone_type zone_type = MAX_NR_ZONES;
+   int nr_zones = 0;
 
do {
zone_type--;
zone = pgdat->node_zones + zone_type;
if (managed_zone(zone)) {
-   zoneref_set_zone(zone,
-   >_zonerefs[nr_zones++]);
+   zoneref_set_zone(zone, [nr_zones++]);
check_highest_zone(zone_type);
}
} while (zone_type);
@@ -4913,17 +4912,24 @@ static int find_next_best_node(int node, nodemask_t 
*used_node_mask)
  * This results in maximum locality--normal zone overflows into local
  * DMA zone, if any--but risks exhausting DMA zone.
  */
-static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
+static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
+   unsigned nr_nodes)
 {
-   int j;
-   struct zonelist *zonelist;
+   struct zoneref *zonerefs;
+   int i;
+
+   zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
+
+   for (i = 0; i < nr_nodes; i++) {
+   int nr_zones;
 
-   zonelist = >node_zonelists[ZONELIST_FALLBACK];
-   for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
-   ;
-   j = build_zonelists_node(NODE_DATA(node), zonelist, j);
-   zonelist->_zonerefs[j].zone = NULL;
-   zonelist->_zonerefs[j].zone_idx = 0;
+   pg_data_t *node = NODE_DATA(node_order[i]);
+
+   nr_zones = build_zonerefs_node(node, zonerefs);
+   zonerefs += nr_zones;
+   }
+   zonerefs->zone = NULL;
+   zonerefs->zone_idx = 0;
 }
 
 /*
@@ -4931,13 +4937,14 @@ static void build_zonelists_in_node_order(pg_data_t 
*pgdat, int node)
  */
 static void build_thisnode_zonelists(pg_data_t *pgdat)
 {
-   int j;
-   struct zonelist *zonelist;
+   struct zoneref *zonerefs;
+   int nr_zones;
 
-   zonelist = >node_zonelists[ZONELIST_NOFALLBACK];
-   j = build_zonelists_node(pgdat, zonelist, 0);
-   zonelist->_zonerefs[j].zone = NULL;
-   zonelist->_zonerefs[j].zone_idx = 0;
+   zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs;
+   nr_zones = build_zonerefs_node(pgdat, zonerefs);
+   zonerefs += nr_zones;
+   zonerefs->zone = NULL;
+   zonerefs->zone_idx = 0;
 }
 
 /*
@@ -4946,21 +4953,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
  * exhausted, but results in 

[PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-21 Thread Michal Hocko
From: Michal Hocko 

build_zonelists gradually builds zonelists from the nearest to the most
distant node. As we do not know how many populated zones we will have in
each node we rely on the _zoneref to terminate initialized part of the
zonelist by a NULL zone. While this is functionally correct it is quite
suboptimal because we cannot allow updaters to race with zonelists
users because they could see an empty zonelist and fail the allocation
or hit the OOM killer in the worst case.

We can do much better, though. We can store the node ordering into an
already existing node_order array and then give this array to
build_zonelists_in_node_order and do the whole initialization at once.
zonelists consumers still might see halfway initialized state but that
should be much more tolerateable because the list will not be empty and
they would either see some zone twice or skip over some zone(s) in the
worst case which shouldn't lead to immediate failures.

While at it let's simplify build_zonelists_node which is rather
confusing now. It gets an index into the zoneref array and returns
the updated index for the next iteration. Let's rename the function
to build_zonerefs_node to better reflect its purpose and give it
zoneref array to update. The function doesn't the index anymore. It
just returns the number of added zones so that the caller can advance
the zonered array start for the next update.

This patch alone doesn't introduce any functional change yet, though, it
is merely a preparatory work for later changes.

Changes since v1
- build_zonelists_node -> build_zonerefs_node and operate directly on
  zonerefs array rather than play tricks with index into the array.
- give build_zonelists_in_node_order nr_nodes to not iterate over all
  MAX_NUMNODES as per Mel

Signed-off-by: Michal Hocko 
---
 mm/page_alloc.c | 81 +
 1 file changed, 41 insertions(+), 40 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e9aca464f66..0d78dc5a708f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4772,18 +4772,17 @@ static void zoneref_set_zone(struct zone *zone, struct 
zoneref *zoneref)
  *
  * Add all populated zones of a node to the zonelist.
  */
-static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
-   int nr_zones)
+static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
 {
struct zone *zone;
enum zone_type zone_type = MAX_NR_ZONES;
+   int nr_zones = 0;
 
do {
zone_type--;
zone = pgdat->node_zones + zone_type;
if (managed_zone(zone)) {
-   zoneref_set_zone(zone,
-   >_zonerefs[nr_zones++]);
+   zoneref_set_zone(zone, [nr_zones++]);
check_highest_zone(zone_type);
}
} while (zone_type);
@@ -4913,17 +4912,24 @@ static int find_next_best_node(int node, nodemask_t 
*used_node_mask)
  * This results in maximum locality--normal zone overflows into local
  * DMA zone, if any--but risks exhausting DMA zone.
  */
-static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
+static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
+   unsigned nr_nodes)
 {
-   int j;
-   struct zonelist *zonelist;
+   struct zoneref *zonerefs;
+   int i;
+
+   zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
+
+   for (i = 0; i < nr_nodes; i++) {
+   int nr_zones;
 
-   zonelist = >node_zonelists[ZONELIST_FALLBACK];
-   for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
-   ;
-   j = build_zonelists_node(NODE_DATA(node), zonelist, j);
-   zonelist->_zonerefs[j].zone = NULL;
-   zonelist->_zonerefs[j].zone_idx = 0;
+   pg_data_t *node = NODE_DATA(node_order[i]);
+
+   nr_zones = build_zonerefs_node(node, zonerefs);
+   zonerefs += nr_zones;
+   }
+   zonerefs->zone = NULL;
+   zonerefs->zone_idx = 0;
 }
 
 /*
@@ -4931,13 +4937,14 @@ static void build_zonelists_in_node_order(pg_data_t 
*pgdat, int node)
  */
 static void build_thisnode_zonelists(pg_data_t *pgdat)
 {
-   int j;
-   struct zonelist *zonelist;
+   struct zoneref *zonerefs;
+   int nr_zones;
 
-   zonelist = >node_zonelists[ZONELIST_NOFALLBACK];
-   j = build_zonelists_node(pgdat, zonelist, 0);
-   zonelist->_zonerefs[j].zone = NULL;
-   zonelist->_zonerefs[j].zone_idx = 0;
+   zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs;
+   nr_zones = build_zonerefs_node(pgdat, zonerefs);
+   zonerefs += nr_zones;
+   zonerefs->zone = NULL;
+   zonerefs->zone_idx = 0;
 }
 
 /*
@@ -4946,21 +4953,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
  * exhausted, but results in overflowing to remote node while memory
  * 

Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-20 Thread Michal Hocko
On Thu 20-07-17 08:55:42, Vlastimil Babka wrote:
> On 07/14/2017 10:00 AM, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > build_zonelists gradually builds zonelists from the nearest to the most
> > distant node. As we do not know how many populated zones we will have in
> > each node we rely on the _zoneref to terminate initialized part of the
> > zonelist by a NULL zone. While this is functionally correct it is quite
> > suboptimal because we cannot allow updaters to race with zonelists
> > users because they could see an empty zonelist and fail the allocation
> > or hit the OOM killer in the worst case.
> > 
> > We can do much better, though. We can store the node ordering into an
> > already existing node_order array and then give this array to
> > build_zonelists_in_node_order and do the whole initialization at once.
> > zonelists consumers still might see halfway initialized state but that
> > should be much more tolerateable because the list will not be empty and
> > they would either see some zone twice or skip over some zone(s) in the
> > worst case which shouldn't lead to immediate failures.
> > 
> > This patch alone doesn't introduce any functional change yet, though, it
> > is merely a preparatory work for later changes.
> > 
> > Signed-off-by: Michal Hocko 
> 
> I've collected the fold-ups from this thread and looked at the result as
> single patch. Sems OK, just two things:
> - please rename variable "i" in build_zonelists() to e.g. "nr_nodes"
> - the !CONFIG_NUMA variant of build_zonelists() won't build, because it
> doesn't declare nr_zones variable

Thanks! I will fold this in.
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0d3e8eeb150..6f192405e469 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4957,7 +4957,7 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
 static void build_zonelists(pg_data_t *pgdat)
 {
static int node_order[MAX_NUMNODES];
-   int node, load, i = 0;
+   int node, load, nr_nodes = 0;
nodemask_t used_mask;
int local_node, prev_node;
 
@@ -4978,12 +4978,12 @@ static void build_zonelists(pg_data_t *pgdat)
node_distance(local_node, prev_node))
node_load[node] = load;
 
-   node_order[i++] = node;
+   node_order[nr_nodes++] = node;
prev_node = node;
load--;
}
 
-   build_zonelists_in_node_order(pgdat, node_order, i);
+   build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
build_thisnode_zonelists(pgdat);
 }
 
@@ -5013,10 +5013,11 @@ static void build_zonelists(pg_data_t *pgdat)
 {
int node, local_node;
struct zoneref *zonerefs;
+   int nr_zones;
 
local_node = pgdat->node_id;
 
-   zonrefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
+   zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
nr_zones = build_zonerefs_node(pgdat, zonerefs);
zonerefs += nr_zones;
 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-20 Thread Michal Hocko
On Thu 20-07-17 08:55:42, Vlastimil Babka wrote:
> On 07/14/2017 10:00 AM, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > build_zonelists gradually builds zonelists from the nearest to the most
> > distant node. As we do not know how many populated zones we will have in
> > each node we rely on the _zoneref to terminate initialized part of the
> > zonelist by a NULL zone. While this is functionally correct it is quite
> > suboptimal because we cannot allow updaters to race with zonelists
> > users because they could see an empty zonelist and fail the allocation
> > or hit the OOM killer in the worst case.
> > 
> > We can do much better, though. We can store the node ordering into an
> > already existing node_order array and then give this array to
> > build_zonelists_in_node_order and do the whole initialization at once.
> > zonelists consumers still might see halfway initialized state but that
> > should be much more tolerateable because the list will not be empty and
> > they would either see some zone twice or skip over some zone(s) in the
> > worst case which shouldn't lead to immediate failures.
> > 
> > This patch alone doesn't introduce any functional change yet, though, it
> > is merely a preparatory work for later changes.
> > 
> > Signed-off-by: Michal Hocko 
> 
> I've collected the fold-ups from this thread and looked at the result as
> single patch. Sems OK, just two things:
> - please rename variable "i" in build_zonelists() to e.g. "nr_nodes"
> - the !CONFIG_NUMA variant of build_zonelists() won't build, because it
> doesn't declare nr_zones variable

Thanks! I will fold this in.
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0d3e8eeb150..6f192405e469 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4957,7 +4957,7 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
 static void build_zonelists(pg_data_t *pgdat)
 {
static int node_order[MAX_NUMNODES];
-   int node, load, i = 0;
+   int node, load, nr_nodes = 0;
nodemask_t used_mask;
int local_node, prev_node;
 
@@ -4978,12 +4978,12 @@ static void build_zonelists(pg_data_t *pgdat)
node_distance(local_node, prev_node))
node_load[node] = load;
 
-   node_order[i++] = node;
+   node_order[nr_nodes++] = node;
prev_node = node;
load--;
}
 
-   build_zonelists_in_node_order(pgdat, node_order, i);
+   build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
build_thisnode_zonelists(pgdat);
 }
 
@@ -5013,10 +5013,11 @@ static void build_zonelists(pg_data_t *pgdat)
 {
int node, local_node;
struct zoneref *zonerefs;
+   int nr_zones;
 
local_node = pgdat->node_id;
 
-   zonrefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
+   zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
nr_zones = build_zonerefs_node(pgdat, zonerefs);
zonerefs += nr_zones;
 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-20 Thread Vlastimil Babka
On 07/14/2017 10:00 AM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> build_zonelists gradually builds zonelists from the nearest to the most
> distant node. As we do not know how many populated zones we will have in
> each node we rely on the _zoneref to terminate initialized part of the
> zonelist by a NULL zone. While this is functionally correct it is quite
> suboptimal because we cannot allow updaters to race with zonelists
> users because they could see an empty zonelist and fail the allocation
> or hit the OOM killer in the worst case.
> 
> We can do much better, though. We can store the node ordering into an
> already existing node_order array and then give this array to
> build_zonelists_in_node_order and do the whole initialization at once.
> zonelists consumers still might see halfway initialized state but that
> should be much more tolerateable because the list will not be empty and
> they would either see some zone twice or skip over some zone(s) in the
> worst case which shouldn't lead to immediate failures.
> 
> This patch alone doesn't introduce any functional change yet, though, it
> is merely a preparatory work for later changes.
> 
> Signed-off-by: Michal Hocko 

I've collected the fold-ups from this thread and looked at the result as
single patch. Sems OK, just two things:
- please rename variable "i" in build_zonelists() to e.g. "nr_nodes"
- the !CONFIG_NUMA variant of build_zonelists() won't build, because it
doesn't declare nr_zones variable


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-20 Thread Vlastimil Babka
On 07/14/2017 10:00 AM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> build_zonelists gradually builds zonelists from the nearest to the most
> distant node. As we do not know how many populated zones we will have in
> each node we rely on the _zoneref to terminate initialized part of the
> zonelist by a NULL zone. While this is functionally correct it is quite
> suboptimal because we cannot allow updaters to race with zonelists
> users because they could see an empty zonelist and fail the allocation
> or hit the OOM killer in the worst case.
> 
> We can do much better, though. We can store the node ordering into an
> already existing node_order array and then give this array to
> build_zonelists_in_node_order and do the whole initialization at once.
> zonelists consumers still might see halfway initialized state but that
> should be much more tolerateable because the list will not be empty and
> they would either see some zone twice or skip over some zone(s) in the
> worst case which shouldn't lead to immediate failures.
> 
> This patch alone doesn't introduce any functional change yet, though, it
> is merely a preparatory work for later changes.
> 
> Signed-off-by: Michal Hocko 

I've collected the fold-ups from this thread and looked at the result as
single patch. Sems OK, just two things:
- please rename variable "i" in build_zonelists() to e.g. "nr_nodes"
- the !CONFIG_NUMA variant of build_zonelists() won't build, because it
doesn't declare nr_zones variable


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-17 Thread Michal Hocko
On Mon 17-07-17 09:58:04, Mel Gorman wrote:
> On Mon, Jul 17, 2017 at 10:19:42AM +0200, Michal Hocko wrote:
> > On Mon 17-07-17 09:07:23, Mel Gorman wrote:
> > > On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote:
> > > > On Fri 14-07-17 15:18:23, Mel Gorman wrote:
> > > > > Fairly sure that's not what you meant.
> > > > > 
> > > > > 
> > > > > > pg_data_t *node = NODE_DATA(node_order[i]);
> > > > > >  
> > > > > > -   zoneref_idx = build_zonelists_node(node, zonelist, 
> > > > > > zoneref_idx);
> > > > > > +   nr_zones = build_zonelists_node(node, zonelist, 
> > > > > > nr_zones);
> > > > > 
> > > > > I meant converting build_zonelists_node and passing in _zones and
> > > > > returning false when an empty node is encountered. In this context,
> > > > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones 
> > > > > in
> > > > > build_zonelists_node as well.
> > > > 
> > > > hmm, why don't we rather make it zonerefs based then. Something
> > > > like the following?
> > > 
> > > Works for me.
> > 
> > Should I fold it to the patch or make it a patch on its own?
> 
> I have no strong feelings either way but if it was folded then the
> overall naming should be easier to follow (at least for me).

OK, I will fold it in then. Unless there are more issues/feedback to
address I will repost the full series in few days.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-17 Thread Michal Hocko
On Mon 17-07-17 09:58:04, Mel Gorman wrote:
> On Mon, Jul 17, 2017 at 10:19:42AM +0200, Michal Hocko wrote:
> > On Mon 17-07-17 09:07:23, Mel Gorman wrote:
> > > On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote:
> > > > On Fri 14-07-17 15:18:23, Mel Gorman wrote:
> > > > > Fairly sure that's not what you meant.
> > > > > 
> > > > > 
> > > > > > pg_data_t *node = NODE_DATA(node_order[i]);
> > > > > >  
> > > > > > -   zoneref_idx = build_zonelists_node(node, zonelist, 
> > > > > > zoneref_idx);
> > > > > > +   nr_zones = build_zonelists_node(node, zonelist, 
> > > > > > nr_zones);
> > > > > 
> > > > > I meant converting build_zonelists_node and passing in _zones and
> > > > > returning false when an empty node is encountered. In this context,
> > > > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones 
> > > > > in
> > > > > build_zonelists_node as well.
> > > > 
> > > > hmm, why don't we rather make it zonerefs based then. Something
> > > > like the following?
> > > 
> > > Works for me.
> > 
> > Should I fold it to the patch or make it a patch on its own?
> 
> I have no strong feelings either way but if it was folded then the
> overall naming should be easier to follow (at least for me).

OK, I will fold it in then. Unless there are more issues/feedback to
address I will repost the full series in few days.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-17 Thread Mel Gorman
On Mon, Jul 17, 2017 at 10:19:42AM +0200, Michal Hocko wrote:
> On Mon 17-07-17 09:07:23, Mel Gorman wrote:
> > On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote:
> > > On Fri 14-07-17 15:18:23, Mel Gorman wrote:
> > > > Fairly sure that's not what you meant.
> > > > 
> > > > 
> > > > >   pg_data_t *node = NODE_DATA(node_order[i]);
> > > > >  
> > > > > - zoneref_idx = build_zonelists_node(node, zonelist, 
> > > > > zoneref_idx);
> > > > > + nr_zones = build_zonelists_node(node, zonelist, 
> > > > > nr_zones);
> > > > 
> > > > I meant converting build_zonelists_node and passing in _zones and
> > > > returning false when an empty node is encountered. In this context,
> > > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
> > > > build_zonelists_node as well.
> > > 
> > > hmm, why don't we rather make it zonerefs based then. Something
> > > like the following?
> > 
> > Works for me.
> 
> Should I fold it to the patch or make it a patch on its own?

I have no strong feelings either way but if it was folded then the
overall naming should be easier to follow (at least for me).

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-17 Thread Mel Gorman
On Mon, Jul 17, 2017 at 10:19:42AM +0200, Michal Hocko wrote:
> On Mon 17-07-17 09:07:23, Mel Gorman wrote:
> > On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote:
> > > On Fri 14-07-17 15:18:23, Mel Gorman wrote:
> > > > Fairly sure that's not what you meant.
> > > > 
> > > > 
> > > > >   pg_data_t *node = NODE_DATA(node_order[i]);
> > > > >  
> > > > > - zoneref_idx = build_zonelists_node(node, zonelist, 
> > > > > zoneref_idx);
> > > > > + nr_zones = build_zonelists_node(node, zonelist, 
> > > > > nr_zones);
> > > > 
> > > > I meant converting build_zonelists_node and passing in _zones and
> > > > returning false when an empty node is encountered. In this context,
> > > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
> > > > build_zonelists_node as well.
> > > 
> > > hmm, why don't we rather make it zonerefs based then. Something
> > > like the following?
> > 
> > Works for me.
> 
> Should I fold it to the patch or make it a patch on its own?

I have no strong feelings either way but if it was folded then the
overall naming should be easier to follow (at least for me).

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-17 Thread Michal Hocko
On Mon 17-07-17 09:07:23, Mel Gorman wrote:
> On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote:
> > On Fri 14-07-17 15:18:23, Mel Gorman wrote:
> > > Fairly sure that's not what you meant.
> > > 
> > > 
> > > > pg_data_t *node = NODE_DATA(node_order[i]);
> > > >  
> > > > -   zoneref_idx = build_zonelists_node(node, zonelist, 
> > > > zoneref_idx);
> > > > +   nr_zones = build_zonelists_node(node, zonelist, 
> > > > nr_zones);
> > > 
> > > I meant converting build_zonelists_node and passing in _zones and
> > > returning false when an empty node is encountered. In this context,
> > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
> > > build_zonelists_node as well.
> > 
> > hmm, why don't we rather make it zonerefs based then. Something
> > like the following?
> 
> Works for me.

Should I fold it to the patch or make it a patch on its own?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-17 Thread Michal Hocko
On Mon 17-07-17 09:07:23, Mel Gorman wrote:
> On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote:
> > On Fri 14-07-17 15:18:23, Mel Gorman wrote:
> > > Fairly sure that's not what you meant.
> > > 
> > > 
> > > > pg_data_t *node = NODE_DATA(node_order[i]);
> > > >  
> > > > -   zoneref_idx = build_zonelists_node(node, zonelist, 
> > > > zoneref_idx);
> > > > +   nr_zones = build_zonelists_node(node, zonelist, 
> > > > nr_zones);
> > > 
> > > I meant converting build_zonelists_node and passing in _zones and
> > > returning false when an empty node is encountered. In this context,
> > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
> > > build_zonelists_node as well.
> > 
> > hmm, why don't we rather make it zonerefs based then. Something
> > like the following?
> 
> Works for me.

Should I fold it to the patch or make it a patch on its own?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-17 Thread Mel Gorman
On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote:
> On Fri 14-07-17 15:18:23, Mel Gorman wrote:
> > Fairly sure that's not what you meant.
> > 
> > 
> > >   pg_data_t *node = NODE_DATA(node_order[i]);
> > >  
> > > - zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> > > + nr_zones = build_zonelists_node(node, zonelist, nr_zones);
> > 
> > I meant converting build_zonelists_node and passing in _zones and
> > returning false when an empty node is encountered. In this context,
> > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
> > build_zonelists_node as well.
> 
> hmm, why don't we rather make it zonerefs based then. Something
> like the following?

Works for me.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-17 Thread Mel Gorman
On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote:
> On Fri 14-07-17 15:18:23, Mel Gorman wrote:
> > Fairly sure that's not what you meant.
> > 
> > 
> > >   pg_data_t *node = NODE_DATA(node_order[i]);
> > >  
> > > - zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> > > + nr_zones = build_zonelists_node(node, zonelist, nr_zones);
> > 
> > I meant converting build_zonelists_node and passing in _zones and
> > returning false when an empty node is encountered. In this context,
> > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
> > build_zonelists_node as well.
> 
> hmm, why don't we rather make it zonerefs based then. Something
> like the following?

Works for me.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-17 Thread Michal Hocko
On Fri 14-07-17 15:18:23, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 03:02:42PM +0200, Michal Hocko wrote:
[...]
> > What do you think about this on top?
> > ---
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 49bade7ff049..3b98524c04ec 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4913,20 +4913,21 @@ static int find_next_best_node(int node, nodemask_t 
> > *used_node_mask)
> >   * This results in maximum locality--normal zone overflows into local
> >   * DMA zone, if any--but risks exhausting DMA zone.
> >   */
> > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int 
> > *node_order)
> > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int 
> > *node_order,
> > +   unsigned nr_nodes)
> >  {
> > struct zonelist *zonelist;
> > -   int i, zoneref_idx = 0;
> > +   int i, nr_zones = 0;
> >  
> > zonelist = >node_zonelists[ZONELIST_FALLBACK];
> >  
> > -   for (i = 0; i < MAX_NUMNODES; i++) {
> > +   for (i = 0; i < nr_nodes; i++) {
> 
> The first iteration is then -- for (i = 0; i < 0; i++)

find_next_best_node always returns at least one node (the local one) so
I believe that nr_nodes should never be 0.

> Fairly sure that's not what you meant.
> 
> 
> > pg_data_t *node = NODE_DATA(node_order[i]);
> >  
> > -   zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> > +   nr_zones = build_zonelists_node(node, zonelist, nr_zones);
> 
> I meant converting build_zonelists_node and passing in _zones and
> returning false when an empty node is encountered. In this context,
> it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
> build_zonelists_node as well.

hmm, why don't we rather make it zonerefs based then. Something
like the following?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b98524c04ec..01e67e629351 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4772,18 +4772,17 @@ static void zoneref_set_zone(struct zone *zone, struct 
zoneref *zoneref)
  *
  * Add all populated zones of a node to the zonelist.
  */
-static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
-   int nr_zones)
+static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
 {
struct zone *zone;
enum zone_type zone_type = MAX_NR_ZONES;
+   int nr_zones = 0;
 
do {
zone_type--;
zone = pgdat->node_zones + zone_type;
if (managed_zone(zone)) {
-   zoneref_set_zone(zone,
-   >_zonerefs[nr_zones++]);
+   zoneref_set_zone(zone, [nr_zones++]);
check_highest_zone(zone_type);
}
} while (zone_type);
@@ -4916,18 +4915,21 @@ static int find_next_best_node(int node, nodemask_t 
*used_node_mask)
 static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
unsigned nr_nodes)
 {
-   struct zonelist *zonelist;
-   int i, nr_zones = 0;
+   struct zoneref *zonerefs;
+   int i;
 
-   zonelist = >node_zonelists[ZONELIST_FALLBACK];
+   zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
 
for (i = 0; i < nr_nodes; i++) {
+   int nr_zones;
+
pg_data_t *node = NODE_DATA(node_order[i]);
 
-   nr_zones = build_zonelists_node(node, zonelist, nr_zones);
+   nr_zones = build_zonerefs_node(node, zonerefs);
+   zonerefs += nr_zones;
}
-   zonelist->_zonerefs[nr_zones].zone = NULL;
-   zonelist->_zonerefs[nr_zones].zone_idx = 0;
+   zonerefs->zone = NULL;
+   zonerefs->zone_idx = 0;
 }
 
 /*
@@ -4935,13 +4937,14 @@ static void build_zonelists_in_node_order(pg_data_t 
*pgdat, int *node_order,
  */
 static void build_thisnode_zonelists(pg_data_t *pgdat)
 {
-   struct zonelist *zonelist;
-   int nr_zones = 0;
+   struct zoneref *zonerefs;
+   int nr_zones;
 
-   zonelist = >node_zonelists[ZONELIST_NOFALLBACK];
-   nr_zones = build_zonelists_node(pgdat, zonelist, nr_zones);
-   zonelist->_zonerefs[nr_zones].zone = NULL;
-   zonelist->_zonerefs[nr_zones].zone_idx = 0;
+   zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs;
+   nr_zones = build_zonerefs_node(pgdat, zonerefs);
+   zonerefs += nr_zones;
+   zonerefs->zone = NULL;
+   zonerefs->zone_idx = 0;
 }
 
 /*
@@ -5009,13 +5012,13 @@ static void setup_min_slab_ratio(void);
 static void build_zonelists(pg_data_t *pgdat)
 {
int node, local_node;
-   enum zone_type j;
-   struct zonelist *zonelist;
+   struct zoneref *zonerefs;
 
local_node = pgdat->node_id;
 
-   zonelist = >node_zonelists[ZONELIST_FALLBACK];
-   j = build_zonelists_node(pgdat, zonelist, 0);
+   zonrefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
+   nr_zones = 

Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-17 Thread Michal Hocko
On Fri 14-07-17 15:18:23, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 03:02:42PM +0200, Michal Hocko wrote:
[...]
> > What do you think about this on top?
> > ---
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 49bade7ff049..3b98524c04ec 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4913,20 +4913,21 @@ static int find_next_best_node(int node, nodemask_t 
> > *used_node_mask)
> >   * This results in maximum locality--normal zone overflows into local
> >   * DMA zone, if any--but risks exhausting DMA zone.
> >   */
> > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int 
> > *node_order)
> > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int 
> > *node_order,
> > +   unsigned nr_nodes)
> >  {
> > struct zonelist *zonelist;
> > -   int i, zoneref_idx = 0;
> > +   int i, nr_zones = 0;
> >  
> > zonelist = >node_zonelists[ZONELIST_FALLBACK];
> >  
> > -   for (i = 0; i < MAX_NUMNODES; i++) {
> > +   for (i = 0; i < nr_nodes; i++) {
> 
> The first iteration is then -- for (i = 0; i < 0; i++)

find_next_best_node always returns at least one node (the local one) so
I believe that nr_nodes should never be 0.

> Fairly sure that's not what you meant.
> 
> 
> > pg_data_t *node = NODE_DATA(node_order[i]);
> >  
> > -   zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> > +   nr_zones = build_zonelists_node(node, zonelist, nr_zones);
> 
> I meant converting build_zonelists_node and passing in _zones and
> returning false when an empty node is encountered. In this context,
> it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
> build_zonelists_node as well.

hmm, why don't we rather make it zonerefs based then. Something
like the following?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b98524c04ec..01e67e629351 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4772,18 +4772,17 @@ static void zoneref_set_zone(struct zone *zone, struct 
zoneref *zoneref)
  *
  * Add all populated zones of a node to the zonelist.
  */
-static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
-   int nr_zones)
+static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
 {
struct zone *zone;
enum zone_type zone_type = MAX_NR_ZONES;
+   int nr_zones = 0;
 
do {
zone_type--;
zone = pgdat->node_zones + zone_type;
if (managed_zone(zone)) {
-   zoneref_set_zone(zone,
-   >_zonerefs[nr_zones++]);
+   zoneref_set_zone(zone, [nr_zones++]);
check_highest_zone(zone_type);
}
} while (zone_type);
@@ -4916,18 +4915,21 @@ static int find_next_best_node(int node, nodemask_t 
*used_node_mask)
 static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
unsigned nr_nodes)
 {
-   struct zonelist *zonelist;
-   int i, nr_zones = 0;
+   struct zoneref *zonerefs;
+   int i;
 
-   zonelist = >node_zonelists[ZONELIST_FALLBACK];
+   zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
 
for (i = 0; i < nr_nodes; i++) {
+   int nr_zones;
+
pg_data_t *node = NODE_DATA(node_order[i]);
 
-   nr_zones = build_zonelists_node(node, zonelist, nr_zones);
+   nr_zones = build_zonerefs_node(node, zonerefs);
+   zonerefs += nr_zones;
}
-   zonelist->_zonerefs[nr_zones].zone = NULL;
-   zonelist->_zonerefs[nr_zones].zone_idx = 0;
+   zonerefs->zone = NULL;
+   zonerefs->zone_idx = 0;
 }
 
 /*
@@ -4935,13 +4937,14 @@ static void build_zonelists_in_node_order(pg_data_t 
*pgdat, int *node_order,
  */
 static void build_thisnode_zonelists(pg_data_t *pgdat)
 {
-   struct zonelist *zonelist;
-   int nr_zones = 0;
+   struct zoneref *zonerefs;
+   int nr_zones;
 
-   zonelist = >node_zonelists[ZONELIST_NOFALLBACK];
-   nr_zones = build_zonelists_node(pgdat, zonelist, nr_zones);
-   zonelist->_zonerefs[nr_zones].zone = NULL;
-   zonelist->_zonerefs[nr_zones].zone_idx = 0;
+   zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs;
+   nr_zones = build_zonerefs_node(pgdat, zonerefs);
+   zonerefs += nr_zones;
+   zonerefs->zone = NULL;
+   zonerefs->zone_idx = 0;
 }
 
 /*
@@ -5009,13 +5012,13 @@ static void setup_min_slab_ratio(void);
 static void build_zonelists(pg_data_t *pgdat)
 {
int node, local_node;
-   enum zone_type j;
-   struct zonelist *zonelist;
+   struct zoneref *zonerefs;
 
local_node = pgdat->node_id;
 
-   zonelist = >node_zonelists[ZONELIST_FALLBACK];
-   j = build_zonelists_node(pgdat, zonelist, 0);
+   zonrefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
+   nr_zones = 

Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-14 Thread Mel Gorman
On Fri, Jul 14, 2017 at 03:02:42PM +0200, Michal Hocko wrote:
> > It *might* be safer given the next patch to zero out the remainder of
> > the _zonerefs to that there is no combination of node add/remove that has
> > an iterator working with a semi-valid _zoneref which is beyond the last
> > correct value. It *should* be safe as the very last entry will always
> > be null but if you don't zero it out, it is possible for iterators to be
> > working beyond the "end" of the zonelist for a short window.
> 
> yes that is true but there will always be terminating NULL zone and I
> found that acceptable. It is basically the same thing as accessing an
> empty zone or a zone twice. Or do you think this is absolutely necessary
> to handle?
> 

I don't think it's absolutely necessary. While you could construct some
odd behaviour for iterators currently past the end of the list, they would
eventually encounter a NULL.

> > Otherwise think it's ok including my stupid comment about node_order
> > stack usage.
> 
> What do you think about this on top?
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 49bade7ff049..3b98524c04ec 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4913,20 +4913,21 @@ static int find_next_best_node(int node, nodemask_t 
> *used_node_mask)
>   * This results in maximum locality--normal zone overflows into local
>   * DMA zone, if any--but risks exhausting DMA zone.
>   */
> -static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order)
> +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
> + unsigned nr_nodes)
>  {
>   struct zonelist *zonelist;
> - int i, zoneref_idx = 0;
> + int i, nr_zones = 0;
>  
>   zonelist = >node_zonelists[ZONELIST_FALLBACK];
>  
> - for (i = 0; i < MAX_NUMNODES; i++) {
> + for (i = 0; i < nr_nodes; i++) {

The first iteration is then -- for (i = 0; i < 0; i++)

Fairly sure that's not what you meant.


>   pg_data_t *node = NODE_DATA(node_order[i]);
>  
> - zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> + nr_zones = build_zonelists_node(node, zonelist, nr_zones);

I meant converting build_zonelists_node and passing in _zones and
returning false when an empty node is encountered. In this context,
it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
build_zonelists_node as well.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-14 Thread Mel Gorman
On Fri, Jul 14, 2017 at 03:02:42PM +0200, Michal Hocko wrote:
> > It *might* be safer given the next patch to zero out the remainder of
> > the _zonerefs to that there is no combination of node add/remove that has
> > an iterator working with a semi-valid _zoneref which is beyond the last
> > correct value. It *should* be safe as the very last entry will always
> > be null but if you don't zero it out, it is possible for iterators to be
> > working beyond the "end" of the zonelist for a short window.
> 
> yes that is true but there will always be terminating NULL zone and I
> found that acceptable. It is basically the same thing as accessing an
> empty zone or a zone twice. Or do you think this is absolutely necessary
> to handle?
> 

I don't think it's absolutely necessary. While you could construct some
odd behaviour for iterators currently past the end of the list, they would
eventually encounter a NULL.

> > Otherwise think it's ok including my stupid comment about node_order
> > stack usage.
> 
> What do you think about this on top?
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 49bade7ff049..3b98524c04ec 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4913,20 +4913,21 @@ static int find_next_best_node(int node, nodemask_t 
> *used_node_mask)
>   * This results in maximum locality--normal zone overflows into local
>   * DMA zone, if any--but risks exhausting DMA zone.
>   */
> -static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order)
> +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
> + unsigned nr_nodes)
>  {
>   struct zonelist *zonelist;
> - int i, zoneref_idx = 0;
> + int i, nr_zones = 0;
>  
>   zonelist = >node_zonelists[ZONELIST_FALLBACK];
>  
> - for (i = 0; i < MAX_NUMNODES; i++) {
> + for (i = 0; i < nr_nodes; i++) {

The first iteration is then -- for (i = 0; i < 0; i++)

Fairly sure that's not what you meant.


>   pg_data_t *node = NODE_DATA(node_order[i]);
>  
> - zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> + nr_zones = build_zonelists_node(node, zonelist, nr_zones);

I meant converting build_zonelists_node and passing in _zones and
returning false when an empty node is encountered. In this context,
it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
build_zonelists_node as well.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-14 Thread Michal Hocko
On Fri 14-07-17 13:46:46, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > build_zonelists gradually builds zonelists from the nearest to the most
> > distant node. As we do not know how many populated zones we will have in
> > each node we rely on the _zoneref to terminate initialized part of the
> > zonelist by a NULL zone. While this is functionally correct it is quite
> > suboptimal because we cannot allow updaters to race with zonelists
> > users because they could see an empty zonelist and fail the allocation
> > or hit the OOM killer in the worst case.
> > 
> > We can do much better, though. We can store the node ordering into an
> > already existing node_order array and then give this array to
> > build_zonelists_in_node_order and do the whole initialization at once.
> > zonelists consumers still might see halfway initialized state but that
> > should be much more tolerateable because the list will not be empty and
> > they would either see some zone twice or skip over some zone(s) in the
> > worst case which shouldn't lead to immediate failures.
> > 
> > This patch alone doesn't introduce any functional change yet, though, it
> > is merely a preparatory work for later changes.
> > 
> > Signed-off-by: Michal Hocko 
> > ---
> >  mm/page_alloc.c | 42 ++
> >  1 file changed, 18 insertions(+), 24 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 00e117922b3f..78bd62418380 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t 
> > *used_node_mask)
> >   * This results in maximum locality--normal zone overflows into local
> >   * DMA zone, if any--but risks exhausting DMA zone.
> >   */
> > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
> > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int 
> > *node_order)
> >  {
> > -   int j;
> > struct zonelist *zonelist;
> > +   int i, zoneref_idx = 0;
> >  
> > zonelist = >node_zonelists[ZONELIST_FALLBACK];
> > -   for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
> > -   ;
> > -   j = build_zonelists_node(NODE_DATA(node), zonelist, j);
> > -   zonelist->_zonerefs[j].zone = NULL;
> > -   zonelist->_zonerefs[j].zone_idx = 0;
> > +
> > +   for (i = 0; i < MAX_NUMNODES; i++) {
> > +   pg_data_t *node = NODE_DATA(node_order[i]);
> > +
> > +   zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> > +   }
> 
> The naming here is weird to say the least and makes this a lot more
> confusing than it needs to be. Primarily, it's because the zoneref_idx
> parameter gets renamed to nr_zones in build_zonelists_node where it's
> nothing to do with the number of zones at all.

you are right. I just wanted to get rid of `j' and didn't realize
nr_zones would fit much better.

> It also iterates for longer than it needs to. MAX_NUMNODES can be a
> large value of mostly empty nodes but it happily goes through them
> anyway. Pass zoneref_idx in as a pointer that is updated by the function
> and use the return value to break the loop when an empty node is
> encountered?
> 
> > +   zonelist->_zonerefs[zoneref_idx].zone = NULL;
> > +   zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
> >  }
> >  
> 
> It *might* be safer given the next patch to zero out the remainder of
> the _zonerefs to that there is no combination of node add/remove that has
> an iterator working with a semi-valid _zoneref which is beyond the last
> correct value. It *should* be safe as the very last entry will always
> be null but if you don't zero it out, it is possible for iterators to be
> working beyond the "end" of the zonelist for a short window.

yes that is true but there will always be terminating NULL zone and I
found that acceptable. It is basically the same thing as accessing an
empty zone or a zone twice. Or do you think this is absolutely necessary
to handle?

> Otherwise think it's ok including my stupid comment about node_order
> stack usage.

What do you think about this on top?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 49bade7ff049..3b98524c04ec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4913,20 +4913,21 @@ static int find_next_best_node(int node, nodemask_t 
*used_node_mask)
  * This results in maximum locality--normal zone overflows into local
  * DMA zone, if any--but risks exhausting DMA zone.
  */
-static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order)
+static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
+   unsigned nr_nodes)
 {
struct zonelist *zonelist;
-   int i, zoneref_idx = 0;
+   int i, nr_zones = 0;
 
zonelist = >node_zonelists[ZONELIST_FALLBACK];
 
-   for (i = 0; i < MAX_NUMNODES; i++) {
+   for (i = 0; i < nr_nodes; i++) {
  

Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-14 Thread Michal Hocko
On Fri 14-07-17 13:46:46, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > build_zonelists gradually builds zonelists from the nearest to the most
> > distant node. As we do not know how many populated zones we will have in
> > each node we rely on the _zoneref to terminate initialized part of the
> > zonelist by a NULL zone. While this is functionally correct it is quite
> > suboptimal because we cannot allow updaters to race with zonelists
> > users because they could see an empty zonelist and fail the allocation
> > or hit the OOM killer in the worst case.
> > 
> > We can do much better, though. We can store the node ordering into an
> > already existing node_order array and then give this array to
> > build_zonelists_in_node_order and do the whole initialization at once.
> > zonelists consumers still might see halfway initialized state but that
> > should be much more tolerateable because the list will not be empty and
> > they would either see some zone twice or skip over some zone(s) in the
> > worst case which shouldn't lead to immediate failures.
> > 
> > This patch alone doesn't introduce any functional change yet, though, it
> > is merely a preparatory work for later changes.
> > 
> > Signed-off-by: Michal Hocko 
> > ---
> >  mm/page_alloc.c | 42 ++
> >  1 file changed, 18 insertions(+), 24 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 00e117922b3f..78bd62418380 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t 
> > *used_node_mask)
> >   * This results in maximum locality--normal zone overflows into local
> >   * DMA zone, if any--but risks exhausting DMA zone.
> >   */
> > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
> > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int 
> > *node_order)
> >  {
> > -   int j;
> > struct zonelist *zonelist;
> > +   int i, zoneref_idx = 0;
> >  
> > zonelist = >node_zonelists[ZONELIST_FALLBACK];
> > -   for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
> > -   ;
> > -   j = build_zonelists_node(NODE_DATA(node), zonelist, j);
> > -   zonelist->_zonerefs[j].zone = NULL;
> > -   zonelist->_zonerefs[j].zone_idx = 0;
> > +
> > +   for (i = 0; i < MAX_NUMNODES; i++) {
> > +   pg_data_t *node = NODE_DATA(node_order[i]);
> > +
> > +   zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> > +   }
> 
> The naming here is weird to say the least and makes this a lot more
> confusing than it needs to be. Primarily, it's because the zoneref_idx
> parameter gets renamed to nr_zones in build_zonelists_node where it's
> nothing to do with the number of zones at all.

you are right. I just wanted to get rid of `j' and didn't realize
nr_zones would fit much better.

> It also iterates for longer than it needs to. MAX_NUMNODES can be a
> large value of mostly empty nodes but it happily goes through them
> anyway. Pass zoneref_idx in as a pointer that is updated by the function
> and use the return value to break the loop when an empty node is
> encountered?
> 
> > +   zonelist->_zonerefs[zoneref_idx].zone = NULL;
> > +   zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
> >  }
> >  
> 
> It *might* be safer given the next patch to zero out the remainder of
> the _zonerefs to that there is no combination of node add/remove that has
> an iterator working with a semi-valid _zoneref which is beyond the last
> correct value. It *should* be safe as the very last entry will always
> be null but if you don't zero it out, it is possible for iterators to be
> working beyond the "end" of the zonelist for a short window.

yes that is true but there will always be terminating NULL zone and I
found that acceptable. It is basically the same thing as accessing an
empty zone or a zone twice. Or do you think this is absolutely necessary
to handle?

> Otherwise think it's ok including my stupid comment about node_order
> stack usage.

What do you think about this on top?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 49bade7ff049..3b98524c04ec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4913,20 +4913,21 @@ static int find_next_best_node(int node, nodemask_t 
*used_node_mask)
  * This results in maximum locality--normal zone overflows into local
  * DMA zone, if any--but risks exhausting DMA zone.
  */
-static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order)
+static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
+   unsigned nr_nodes)
 {
struct zonelist *zonelist;
-   int i, zoneref_idx = 0;
+   int i, nr_zones = 0;
 
zonelist = >node_zonelists[ZONELIST_FALLBACK];
 
-   for (i = 0; i < MAX_NUMNODES; i++) {
+   for (i = 0; i < nr_nodes; i++) {
pg_data_t *node = 

Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-14 Thread Mel Gorman
On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote:
> From: Michal Hocko 
> 
> build_zonelists gradually builds zonelists from the nearest to the most
> distant node. As we do not know how many populated zones we will have in
> each node we rely on the _zoneref to terminate initialized part of the
> zonelist by a NULL zone. While this is functionally correct it is quite
> suboptimal because we cannot allow updaters to race with zonelists
> users because they could see an empty zonelist and fail the allocation
> or hit the OOM killer in the worst case.
> 
> We can do much better, though. We can store the node ordering into an
> already existing node_order array and then give this array to
> build_zonelists_in_node_order and do the whole initialization at once.
> zonelists consumers still might see halfway initialized state but that
> should be much more tolerateable because the list will not be empty and
> they would either see some zone twice or skip over some zone(s) in the
> worst case which shouldn't lead to immediate failures.
> 
> This patch alone doesn't introduce any functional change yet, though, it
> is merely a preparatory work for later changes.
> 
> Signed-off-by: Michal Hocko 
> ---
>  mm/page_alloc.c | 42 ++
>  1 file changed, 18 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 00e117922b3f..78bd62418380 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t 
> *used_node_mask)
>   * This results in maximum locality--normal zone overflows into local
>   * DMA zone, if any--but risks exhausting DMA zone.
>   */
> -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
> +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order)
>  {
> - int j;
>   struct zonelist *zonelist;
> + int i, zoneref_idx = 0;
>  
>   zonelist = >node_zonelists[ZONELIST_FALLBACK];
> - for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
> - ;
> - j = build_zonelists_node(NODE_DATA(node), zonelist, j);
> - zonelist->_zonerefs[j].zone = NULL;
> - zonelist->_zonerefs[j].zone_idx = 0;
> +
> + for (i = 0; i < MAX_NUMNODES; i++) {
> + pg_data_t *node = NODE_DATA(node_order[i]);
> +
> + zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> + }

The naming here is weird to say the least and makes this a lot more
confusing than it needs to be. Primarily, it's because the zoneref_idx
parameter gets renamed to nr_zones in build_zonelists_node where it's
nothing to do with the number of zones at all.

It also iterates for longer than it needs to. MAX_NUMNODES can be a
large value of mostly empty nodes but it happily goes through them
anyway. Pass zoneref_idx in as a pointer that is updated by the function
and use the return value to break the loop when an empty node is
encountered?

> + zonelist->_zonerefs[zoneref_idx].zone = NULL;
> + zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
>  }
>  

It *might* be safer given the next patch to zero out the remainder of
the _zonerefs to that there is no combination of node add/remove that has
an iterator working with a semi-valid _zoneref which is beyond the last
correct value. It *should* be safe as the very last entry will always
be null but if you don't zero it out, it is possible for iterators to be
working beyond the "end" of the zonelist for a short window.

Otherwise think it's ok including my stupid comment about node_order
stack usage.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-14 Thread Mel Gorman
On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote:
> From: Michal Hocko 
> 
> build_zonelists gradually builds zonelists from the nearest to the most
> distant node. As we do not know how many populated zones we will have in
> each node we rely on the _zoneref to terminate initialized part of the
> zonelist by a NULL zone. While this is functionally correct it is quite
> suboptimal because we cannot allow updaters to race with zonelists
> users because they could see an empty zonelist and fail the allocation
> or hit the OOM killer in the worst case.
> 
> We can do much better, though. We can store the node ordering into an
> already existing node_order array and then give this array to
> build_zonelists_in_node_order and do the whole initialization at once.
> zonelists consumers still might see halfway initialized state but that
> should be much more tolerateable because the list will not be empty and
> they would either see some zone twice or skip over some zone(s) in the
> worst case which shouldn't lead to immediate failures.
> 
> This patch alone doesn't introduce any functional change yet, though, it
> is merely a preparatory work for later changes.
> 
> Signed-off-by: Michal Hocko 
> ---
>  mm/page_alloc.c | 42 ++
>  1 file changed, 18 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 00e117922b3f..78bd62418380 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t 
> *used_node_mask)
>   * This results in maximum locality--normal zone overflows into local
>   * DMA zone, if any--but risks exhausting DMA zone.
>   */
> -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
> +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order)
>  {
> - int j;
>   struct zonelist *zonelist;
> + int i, zoneref_idx = 0;
>  
>   zonelist = >node_zonelists[ZONELIST_FALLBACK];
> - for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
> - ;
> - j = build_zonelists_node(NODE_DATA(node), zonelist, j);
> - zonelist->_zonerefs[j].zone = NULL;
> - zonelist->_zonerefs[j].zone_idx = 0;
> +
> + for (i = 0; i < MAX_NUMNODES; i++) {
> + pg_data_t *node = NODE_DATA(node_order[i]);
> +
> + zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> + }

The naming here is weird to say the least and makes this a lot more
confusing than it needs to be. Primarily, it's because the zoneref_idx
parameter gets renamed to nr_zones in build_zonelists_node where it's
nothing to do with the number of zones at all.

It also iterates for longer than it needs to. MAX_NUMNODES can be a
large value of mostly empty nodes but it happily goes through them
anyway. Pass zoneref_idx in as a pointer that is updated by the function
and use the return value to break the loop when an empty node is
encountered?

> + zonelist->_zonerefs[zoneref_idx].zone = NULL;
> + zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
>  }
>  

It *might* be safer given the next patch to zero out the remainder of
the _zonerefs to that there is no combination of node add/remove that has
an iterator working with a semi-valid _zoneref which is beyond the last
correct value. It *should* be safe as the very last entry will always
be null but if you don't zero it out, it is possible for iterators to be
working beyond the "end" of the zonelist for a short window.

Otherwise think it's ok including my stupid comment about node_order
stack usage.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-14 Thread Michal Hocko
On Fri 14-07-17 10:55:34, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote:
> >  
> > zonelist = >node_zonelists[ZONELIST_NOFALLBACK];
> > -   j = build_zonelists_node(pgdat, zonelist, 0);
> > -   zonelist->_zonerefs[j].zone = NULL;
> > -   zonelist->_zonerefs[j].zone_idx = 0;
> > +   zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx);
> > +   zonelist->_zonerefs[zoneref_idx].zone = NULL;
> > +   zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
> >  }
> >  
> >  /*
> > @@ -4946,21 +4949,13 @@ static void build_thisnode_zonelists(pg_data_t 
> > *pgdat)
> >   * exhausted, but results in overflowing to remote node while memory
> >   * may still exist in local DMA zone.
> >   */
> > -static int node_order[MAX_NUMNODES];
> >  
> >  static void build_zonelists(pg_data_t *pgdat)
> >  {
> > -   int i, node, load;
> > +   static int node_order[MAX_NUMNODES];
> > +   int node, load, i = 0;
> 
> Emm, node_order can be large. The first distro config I checked
> indicated that this is 8K. I got hung up on that part and didn't look
> closely at the rest of the patch.

yes, that's why I kept it static. I just placed it into the function to
make it clear what the scope is.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-14 Thread Michal Hocko
On Fri 14-07-17 10:55:34, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote:
> >  
> > zonelist = >node_zonelists[ZONELIST_NOFALLBACK];
> > -   j = build_zonelists_node(pgdat, zonelist, 0);
> > -   zonelist->_zonerefs[j].zone = NULL;
> > -   zonelist->_zonerefs[j].zone_idx = 0;
> > +   zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx);
> > +   zonelist->_zonerefs[zoneref_idx].zone = NULL;
> > +   zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
> >  }
> >  
> >  /*
> > @@ -4946,21 +4949,13 @@ static void build_thisnode_zonelists(pg_data_t 
> > *pgdat)
> >   * exhausted, but results in overflowing to remote node while memory
> >   * may still exist in local DMA zone.
> >   */
> > -static int node_order[MAX_NUMNODES];
> >  
> >  static void build_zonelists(pg_data_t *pgdat)
> >  {
> > -   int i, node, load;
> > +   static int node_order[MAX_NUMNODES];
> > +   int node, load, i = 0;
> 
> Emm, node_order can be large. The first distro config I checked
> indicated that this is 8K. I got hung up on that part and didn't look
> closely at the rest of the patch.

yes, that's why I kept it static. I just placed it into the function to
make it clear what the scope is.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-14 Thread Mel Gorman
On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote:
>  
>   zonelist = >node_zonelists[ZONELIST_NOFALLBACK];
> - j = build_zonelists_node(pgdat, zonelist, 0);
> - zonelist->_zonerefs[j].zone = NULL;
> - zonelist->_zonerefs[j].zone_idx = 0;
> + zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx);
> + zonelist->_zonerefs[zoneref_idx].zone = NULL;
> + zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
>  }
>  
>  /*
> @@ -4946,21 +4949,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
>   * exhausted, but results in overflowing to remote node while memory
>   * may still exist in local DMA zone.
>   */
> -static int node_order[MAX_NUMNODES];
>  
>  static void build_zonelists(pg_data_t *pgdat)
>  {
> - int i, node, load;
> + static int node_order[MAX_NUMNODES];
> + int node, load, i = 0;

Emm, node_order can be large. The first distro config I checked
indicated that this is 8K. I got hung up on that part and didn't look
closely at the rest of the patch.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-14 Thread Mel Gorman
On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote:
>  
>   zonelist = >node_zonelists[ZONELIST_NOFALLBACK];
> - j = build_zonelists_node(pgdat, zonelist, 0);
> - zonelist->_zonerefs[j].zone = NULL;
> - zonelist->_zonerefs[j].zone_idx = 0;
> + zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx);
> + zonelist->_zonerefs[zoneref_idx].zone = NULL;
> + zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
>  }
>  
>  /*
> @@ -4946,21 +4949,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
>   * exhausted, but results in overflowing to remote node while memory
>   * may still exist in local DMA zone.
>   */
> -static int node_order[MAX_NUMNODES];
>  
>  static void build_zonelists(pg_data_t *pgdat)
>  {
> - int i, node, load;
> + static int node_order[MAX_NUMNODES];
> + int node, load, i = 0;

Emm, node_order can be large. The first distro config I checked
indicated that this is 8K. I got hung up on that part and didn't look
closely at the rest of the patch.

-- 
Mel Gorman
SUSE Labs


[PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-14 Thread Michal Hocko
From: Michal Hocko 

build_zonelists gradually builds zonelists from the nearest to the most
distant node. As we do not know how many populated zones we will have in
each node we rely on the _zoneref to terminate initialized part of the
zonelist by a NULL zone. While this is functionally correct it is quite
suboptimal because we cannot allow updaters to race with zonelists
users because they could see an empty zonelist and fail the allocation
or hit the OOM killer in the worst case.

We can do much better, though. We can store the node ordering into an
already existing node_order array and then give this array to
build_zonelists_in_node_order and do the whole initialization at once.
zonelists consumers still might see halfway initialized state but that
should be much more tolerateable because the list will not be empty and
they would either see some zone twice or skip over some zone(s) in the
worst case which shouldn't lead to immediate failures.

This patch alone doesn't introduce any functional change yet, though, it
is merely a preparatory work for later changes.

Signed-off-by: Michal Hocko 
---
 mm/page_alloc.c | 42 ++
 1 file changed, 18 insertions(+), 24 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 00e117922b3f..78bd62418380 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t 
*used_node_mask)
  * This results in maximum locality--normal zone overflows into local
  * DMA zone, if any--but risks exhausting DMA zone.
  */
-static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
+static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order)
 {
-   int j;
struct zonelist *zonelist;
+   int i, zoneref_idx = 0;
 
zonelist = >node_zonelists[ZONELIST_FALLBACK];
-   for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
-   ;
-   j = build_zonelists_node(NODE_DATA(node), zonelist, j);
-   zonelist->_zonerefs[j].zone = NULL;
-   zonelist->_zonerefs[j].zone_idx = 0;
+
+   for (i = 0; i < MAX_NUMNODES; i++) {
+   pg_data_t *node = NODE_DATA(node_order[i]);
+
+   zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
+   }
+   zonelist->_zonerefs[zoneref_idx].zone = NULL;
+   zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
 }
 
 /*
@@ -4931,13 +4934,13 @@ static void build_zonelists_in_node_order(pg_data_t 
*pgdat, int node)
  */
 static void build_thisnode_zonelists(pg_data_t *pgdat)
 {
-   int j;
struct zonelist *zonelist;
+   int zoneref_idx = 0;
 
zonelist = >node_zonelists[ZONELIST_NOFALLBACK];
-   j = build_zonelists_node(pgdat, zonelist, 0);
-   zonelist->_zonerefs[j].zone = NULL;
-   zonelist->_zonerefs[j].zone_idx = 0;
+   zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx);
+   zonelist->_zonerefs[zoneref_idx].zone = NULL;
+   zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
 }
 
 /*
@@ -4946,21 +4949,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
  * exhausted, but results in overflowing to remote node while memory
  * may still exist in local DMA zone.
  */
-static int node_order[MAX_NUMNODES];
 
 static void build_zonelists(pg_data_t *pgdat)
 {
-   int i, node, load;
+   static int node_order[MAX_NUMNODES];
+   int node, load, i = 0;
nodemask_t used_mask;
int local_node, prev_node;
-   struct zonelist *zonelist;
-
-   /* initialize zonelists */
-   for (i = 0; i < MAX_ZONELISTS; i++) {
-   zonelist = pgdat->node_zonelists + i;
-   zonelist->_zonerefs[0].zone = NULL;
-   zonelist->_zonerefs[0].zone_idx = 0;
-   }
 
/* NUMA-aware ordering of nodes */
local_node = pgdat->node_id;
@@ -4969,8 +4964,6 @@ static void build_zonelists(pg_data_t *pgdat)
nodes_clear(used_mask);
 
memset(node_order, 0, sizeof(node_order));
-   i = 0;
-
while ((node = find_next_best_node(local_node, _mask)) >= 0) {
/*
 * We don't want to pressure a particular node.
@@ -4981,11 +4974,12 @@ static void build_zonelists(pg_data_t *pgdat)
node_distance(local_node, prev_node))
node_load[node] = load;
 
+   node_order[i++] = node;
prev_node = node;
load--;
-   build_zonelists_in_node_order(pgdat, node);
}
 
+   build_zonelists_in_node_order(pgdat, node_order);
build_thisnode_zonelists(pgdat);
 }
 
-- 
2.11.0



[PATCH 6/9] mm, page_alloc: simplify zonelist initialization

2017-07-14 Thread Michal Hocko
From: Michal Hocko 

build_zonelists gradually builds zonelists from the nearest to the most
distant node. As we do not know how many populated zones we will have in
each node we rely on the _zoneref to terminate initialized part of the
zonelist by a NULL zone. While this is functionally correct it is quite
suboptimal because we cannot allow updaters to race with zonelists
users because they could see an empty zonelist and fail the allocation
or hit the OOM killer in the worst case.

We can do much better, though. We can store the node ordering into an
already existing node_order array and then give this array to
build_zonelists_in_node_order and do the whole initialization at once.
zonelists consumers still might see halfway initialized state but that
should be much more tolerateable because the list will not be empty and
they would either see some zone twice or skip over some zone(s) in the
worst case which shouldn't lead to immediate failures.

This patch alone doesn't introduce any functional change yet, though, it
is merely a preparatory work for later changes.

Signed-off-by: Michal Hocko 
---
 mm/page_alloc.c | 42 ++
 1 file changed, 18 insertions(+), 24 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 00e117922b3f..78bd62418380 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t 
*used_node_mask)
  * This results in maximum locality--normal zone overflows into local
  * DMA zone, if any--but risks exhausting DMA zone.
  */
-static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
+static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order)
 {
-   int j;
struct zonelist *zonelist;
+   int i, zoneref_idx = 0;
 
zonelist = >node_zonelists[ZONELIST_FALLBACK];
-   for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
-   ;
-   j = build_zonelists_node(NODE_DATA(node), zonelist, j);
-   zonelist->_zonerefs[j].zone = NULL;
-   zonelist->_zonerefs[j].zone_idx = 0;
+
+   for (i = 0; i < MAX_NUMNODES; i++) {
+   pg_data_t *node = NODE_DATA(node_order[i]);
+
+   zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
+   }
+   zonelist->_zonerefs[zoneref_idx].zone = NULL;
+   zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
 }
 
 /*
@@ -4931,13 +4934,13 @@ static void build_zonelists_in_node_order(pg_data_t 
*pgdat, int node)
  */
 static void build_thisnode_zonelists(pg_data_t *pgdat)
 {
-   int j;
struct zonelist *zonelist;
+   int zoneref_idx = 0;
 
zonelist = >node_zonelists[ZONELIST_NOFALLBACK];
-   j = build_zonelists_node(pgdat, zonelist, 0);
-   zonelist->_zonerefs[j].zone = NULL;
-   zonelist->_zonerefs[j].zone_idx = 0;
+   zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx);
+   zonelist->_zonerefs[zoneref_idx].zone = NULL;
+   zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
 }
 
 /*
@@ -4946,21 +4949,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
  * exhausted, but results in overflowing to remote node while memory
  * may still exist in local DMA zone.
  */
-static int node_order[MAX_NUMNODES];
 
 static void build_zonelists(pg_data_t *pgdat)
 {
-   int i, node, load;
+   static int node_order[MAX_NUMNODES];
+   int node, load, i = 0;
nodemask_t used_mask;
int local_node, prev_node;
-   struct zonelist *zonelist;
-
-   /* initialize zonelists */
-   for (i = 0; i < MAX_ZONELISTS; i++) {
-   zonelist = pgdat->node_zonelists + i;
-   zonelist->_zonerefs[0].zone = NULL;
-   zonelist->_zonerefs[0].zone_idx = 0;
-   }
 
/* NUMA-aware ordering of nodes */
local_node = pgdat->node_id;
@@ -4969,8 +4964,6 @@ static void build_zonelists(pg_data_t *pgdat)
nodes_clear(used_mask);
 
memset(node_order, 0, sizeof(node_order));
-   i = 0;
-
while ((node = find_next_best_node(local_node, _mask)) >= 0) {
/*
 * We don't want to pressure a particular node.
@@ -4981,11 +4974,12 @@ static void build_zonelists(pg_data_t *pgdat)
node_distance(local_node, prev_node))
node_load[node] = load;
 
+   node_order[i++] = node;
prev_node = node;
load--;
-   build_zonelists_in_node_order(pgdat, node);
}
 
+   build_zonelists_in_node_order(pgdat, node_order);
build_thisnode_zonelists(pgdat);
 }
 
-- 
2.11.0