Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On 07/21/2017 04:39 PM, Michal Hocko wrote: > From: Michal Hocko> > build_zonelists gradually builds zonelists from the nearest to the most > distant node. As we do not know how many populated zones we will have in > each node we rely on the _zoneref to terminate initialized part of the > zonelist by a NULL zone. While this is functionally correct it is quite > suboptimal because we cannot allow updaters to race with zonelists > users because they could see an empty zonelist and fail the allocation > or hit the OOM killer in the worst case. > > We can do much better, though. We can store the node ordering into an > already existing node_order array and then give this array to > build_zonelists_in_node_order and do the whole initialization at once. > zonelists consumers still might see halfway initialized state but that > should be much more tolerateable because the list will not be empty and > they would either see some zone twice or skip over some zone(s) in the > worst case which shouldn't lead to immediate failures. > > While at it let's simplify build_zonelists_node which is rather > confusing now. It gets an index into the zoneref array and returns > the updated index for the next iteration. Let's rename the function > to build_zonerefs_node to better reflect its purpose and give it > zoneref array to update. The function doesn't the index anymore. It > just returns the number of added zones so that the caller can advance > the zonered array start for the next update. > > This patch alone doesn't introduce any functional change yet, though, it > is merely a preparatory work for later changes. > > Changes since v1 > - build_zonelists_node -> build_zonerefs_node and operate directly on > zonerefs array rather than play tricks with index into the array. > - give build_zonelists_in_node_order nr_nodes to not iterate over all > MAX_NUMNODES as per Mel > > Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On 07/21/2017 04:39 PM, Michal Hocko wrote: > From: Michal Hocko > > build_zonelists gradually builds zonelists from the nearest to the most > distant node. As we do not know how many populated zones we will have in > each node we rely on the _zoneref to terminate initialized part of the > zonelist by a NULL zone. While this is functionally correct it is quite > suboptimal because we cannot allow updaters to race with zonelists > users because they could see an empty zonelist and fail the allocation > or hit the OOM killer in the worst case. > > We can do much better, though. We can store the node ordering into an > already existing node_order array and then give this array to > build_zonelists_in_node_order and do the whole initialization at once. > zonelists consumers still might see halfway initialized state but that > should be much more tolerateable because the list will not be empty and > they would either see some zone twice or skip over some zone(s) in the > worst case which shouldn't lead to immediate failures. > > While at it let's simplify build_zonelists_node which is rather > confusing now. It gets an index into the zoneref array and returns > the updated index for the next iteration. Let's rename the function > to build_zonerefs_node to better reflect its purpose and give it > zoneref array to update. The function doesn't the index anymore. It > just returns the number of added zones so that the caller can advance > the zonered array start for the next update. > > This patch alone doesn't introduce any functional change yet, though, it > is merely a preparatory work for later changes. > > Changes since v1 > - build_zonelists_node -> build_zonerefs_node and operate directly on > zonerefs array rather than play tricks with index into the array. > - give build_zonelists_in_node_order nr_nodes to not iterate over all > MAX_NUMNODES as per Mel > > Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka
[PATCH 6/9] mm, page_alloc: simplify zonelist initialization
From: Michal Hockobuild_zonelists gradually builds zonelists from the nearest to the most distant node. As we do not know how many populated zones we will have in each node we rely on the _zoneref to terminate initialized part of the zonelist by a NULL zone. While this is functionally correct it is quite suboptimal because we cannot allow updaters to race with zonelists users because they could see an empty zonelist and fail the allocation or hit the OOM killer in the worst case. We can do much better, though. We can store the node ordering into an already existing node_order array and then give this array to build_zonelists_in_node_order and do the whole initialization at once. zonelists consumers still might see halfway initialized state but that should be much more tolerateable because the list will not be empty and they would either see some zone twice or skip over some zone(s) in the worst case which shouldn't lead to immediate failures. While at it let's simplify build_zonelists_node which is rather confusing now. It gets an index into the zoneref array and returns the updated index for the next iteration. Let's rename the function to build_zonerefs_node to better reflect its purpose and give it zoneref array to update. The function doesn't the index anymore. It just returns the number of added zones so that the caller can advance the zonered array start for the next update. This patch alone doesn't introduce any functional change yet, though, it is merely a preparatory work for later changes. Changes since v1 - build_zonelists_node -> build_zonerefs_node and operate directly on zonerefs array rather than play tricks with index into the array. - give build_zonelists_in_node_order nr_nodes to not iterate over all MAX_NUMNODES as per Mel Signed-off-by: Michal Hocko --- mm/page_alloc.c | 81 + 1 file changed, 41 insertions(+), 40 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6e9aca464f66..0d78dc5a708f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4772,18 +4772,17 @@ static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) * * Add all populated zones of a node to the zonelist. */ -static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, - int nr_zones) +static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) { struct zone *zone; enum zone_type zone_type = MAX_NR_ZONES; + int nr_zones = 0; do { zone_type--; zone = pgdat->node_zones + zone_type; if (managed_zone(zone)) { - zoneref_set_zone(zone, - >_zonerefs[nr_zones++]); + zoneref_set_zone(zone, [nr_zones++]); check_highest_zone(zone_type); } } while (zone_type); @@ -4913,17 +4912,24 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) * This results in maximum locality--normal zone overflows into local * DMA zone, if any--but risks exhausting DMA zone. */ -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, + unsigned nr_nodes) { - int j; - struct zonelist *zonelist; + struct zoneref *zonerefs; + int i; + + zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs; + + for (i = 0; i < nr_nodes; i++) { + int nr_zones; - zonelist = >node_zonelists[ZONELIST_FALLBACK]; - for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++) - ; - j = build_zonelists_node(NODE_DATA(node), zonelist, j); - zonelist->_zonerefs[j].zone = NULL; - zonelist->_zonerefs[j].zone_idx = 0; + pg_data_t *node = NODE_DATA(node_order[i]); + + nr_zones = build_zonerefs_node(node, zonerefs); + zonerefs += nr_zones; + } + zonerefs->zone = NULL; + zonerefs->zone_idx = 0; } /* @@ -4931,13 +4937,14 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) */ static void build_thisnode_zonelists(pg_data_t *pgdat) { - int j; - struct zonelist *zonelist; + struct zoneref *zonerefs; + int nr_zones; - zonelist = >node_zonelists[ZONELIST_NOFALLBACK]; - j = build_zonelists_node(pgdat, zonelist, 0); - zonelist->_zonerefs[j].zone = NULL; - zonelist->_zonerefs[j].zone_idx = 0; + zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs; + nr_zones = build_zonerefs_node(pgdat, zonerefs); + zonerefs += nr_zones; + zonerefs->zone = NULL; + zonerefs->zone_idx = 0; } /* @@ -4946,21 +4953,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat) * exhausted, but results in
[PATCH 6/9] mm, page_alloc: simplify zonelist initialization
From: Michal Hocko build_zonelists gradually builds zonelists from the nearest to the most distant node. As we do not know how many populated zones we will have in each node we rely on the _zoneref to terminate initialized part of the zonelist by a NULL zone. While this is functionally correct it is quite suboptimal because we cannot allow updaters to race with zonelists users because they could see an empty zonelist and fail the allocation or hit the OOM killer in the worst case. We can do much better, though. We can store the node ordering into an already existing node_order array and then give this array to build_zonelists_in_node_order and do the whole initialization at once. zonelists consumers still might see halfway initialized state but that should be much more tolerateable because the list will not be empty and they would either see some zone twice or skip over some zone(s) in the worst case which shouldn't lead to immediate failures. While at it let's simplify build_zonelists_node which is rather confusing now. It gets an index into the zoneref array and returns the updated index for the next iteration. Let's rename the function to build_zonerefs_node to better reflect its purpose and give it zoneref array to update. The function doesn't the index anymore. It just returns the number of added zones so that the caller can advance the zonered array start for the next update. This patch alone doesn't introduce any functional change yet, though, it is merely a preparatory work for later changes. Changes since v1 - build_zonelists_node -> build_zonerefs_node and operate directly on zonerefs array rather than play tricks with index into the array. - give build_zonelists_in_node_order nr_nodes to not iterate over all MAX_NUMNODES as per Mel Signed-off-by: Michal Hocko --- mm/page_alloc.c | 81 + 1 file changed, 41 insertions(+), 40 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6e9aca464f66..0d78dc5a708f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4772,18 +4772,17 @@ static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) * * Add all populated zones of a node to the zonelist. */ -static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, - int nr_zones) +static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) { struct zone *zone; enum zone_type zone_type = MAX_NR_ZONES; + int nr_zones = 0; do { zone_type--; zone = pgdat->node_zones + zone_type; if (managed_zone(zone)) { - zoneref_set_zone(zone, - >_zonerefs[nr_zones++]); + zoneref_set_zone(zone, [nr_zones++]); check_highest_zone(zone_type); } } while (zone_type); @@ -4913,17 +4912,24 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) * This results in maximum locality--normal zone overflows into local * DMA zone, if any--but risks exhausting DMA zone. */ -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, + unsigned nr_nodes) { - int j; - struct zonelist *zonelist; + struct zoneref *zonerefs; + int i; + + zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs; + + for (i = 0; i < nr_nodes; i++) { + int nr_zones; - zonelist = >node_zonelists[ZONELIST_FALLBACK]; - for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++) - ; - j = build_zonelists_node(NODE_DATA(node), zonelist, j); - zonelist->_zonerefs[j].zone = NULL; - zonelist->_zonerefs[j].zone_idx = 0; + pg_data_t *node = NODE_DATA(node_order[i]); + + nr_zones = build_zonerefs_node(node, zonerefs); + zonerefs += nr_zones; + } + zonerefs->zone = NULL; + zonerefs->zone_idx = 0; } /* @@ -4931,13 +4937,14 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) */ static void build_thisnode_zonelists(pg_data_t *pgdat) { - int j; - struct zonelist *zonelist; + struct zoneref *zonerefs; + int nr_zones; - zonelist = >node_zonelists[ZONELIST_NOFALLBACK]; - j = build_zonelists_node(pgdat, zonelist, 0); - zonelist->_zonerefs[j].zone = NULL; - zonelist->_zonerefs[j].zone_idx = 0; + zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs; + nr_zones = build_zonerefs_node(pgdat, zonerefs); + zonerefs += nr_zones; + zonerefs->zone = NULL; + zonerefs->zone_idx = 0; } /* @@ -4946,21 +4953,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat) * exhausted, but results in overflowing to remote node while memory *
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Thu 20-07-17 08:55:42, Vlastimil Babka wrote: > On 07/14/2017 10:00 AM, Michal Hocko wrote: > > From: Michal Hocko> > > > build_zonelists gradually builds zonelists from the nearest to the most > > distant node. As we do not know how many populated zones we will have in > > each node we rely on the _zoneref to terminate initialized part of the > > zonelist by a NULL zone. While this is functionally correct it is quite > > suboptimal because we cannot allow updaters to race with zonelists > > users because they could see an empty zonelist and fail the allocation > > or hit the OOM killer in the worst case. > > > > We can do much better, though. We can store the node ordering into an > > already existing node_order array and then give this array to > > build_zonelists_in_node_order and do the whole initialization at once. > > zonelists consumers still might see halfway initialized state but that > > should be much more tolerateable because the list will not be empty and > > they would either see some zone twice or skip over some zone(s) in the > > worst case which shouldn't lead to immediate failures. > > > > This patch alone doesn't introduce any functional change yet, though, it > > is merely a preparatory work for later changes. > > > > Signed-off-by: Michal Hocko > > I've collected the fold-ups from this thread and looked at the result as > single patch. Sems OK, just two things: > - please rename variable "i" in build_zonelists() to e.g. "nr_nodes" > - the !CONFIG_NUMA variant of build_zonelists() won't build, because it > doesn't declare nr_zones variable Thanks! I will fold this in. --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c0d3e8eeb150..6f192405e469 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4957,7 +4957,7 @@ static void build_thisnode_zonelists(pg_data_t *pgdat) static void build_zonelists(pg_data_t *pgdat) { static int node_order[MAX_NUMNODES]; - int node, load, i = 0; + int node, load, nr_nodes = 0; nodemask_t used_mask; int local_node, prev_node; @@ -4978,12 +4978,12 @@ static void build_zonelists(pg_data_t *pgdat) node_distance(local_node, prev_node)) node_load[node] = load; - node_order[i++] = node; + node_order[nr_nodes++] = node; prev_node = node; load--; } - build_zonelists_in_node_order(pgdat, node_order, i); + build_zonelists_in_node_order(pgdat, node_order, nr_nodes); build_thisnode_zonelists(pgdat); } @@ -5013,10 +5013,11 @@ static void build_zonelists(pg_data_t *pgdat) { int node, local_node; struct zoneref *zonerefs; + int nr_zones; local_node = pgdat->node_id; - zonrefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs; + zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs; nr_zones = build_zonerefs_node(pgdat, zonerefs); zonerefs += nr_zones; -- Michal Hocko SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Thu 20-07-17 08:55:42, Vlastimil Babka wrote: > On 07/14/2017 10:00 AM, Michal Hocko wrote: > > From: Michal Hocko > > > > build_zonelists gradually builds zonelists from the nearest to the most > > distant node. As we do not know how many populated zones we will have in > > each node we rely on the _zoneref to terminate initialized part of the > > zonelist by a NULL zone. While this is functionally correct it is quite > > suboptimal because we cannot allow updaters to race with zonelists > > users because they could see an empty zonelist and fail the allocation > > or hit the OOM killer in the worst case. > > > > We can do much better, though. We can store the node ordering into an > > already existing node_order array and then give this array to > > build_zonelists_in_node_order and do the whole initialization at once. > > zonelists consumers still might see halfway initialized state but that > > should be much more tolerateable because the list will not be empty and > > they would either see some zone twice or skip over some zone(s) in the > > worst case which shouldn't lead to immediate failures. > > > > This patch alone doesn't introduce any functional change yet, though, it > > is merely a preparatory work for later changes. > > > > Signed-off-by: Michal Hocko > > I've collected the fold-ups from this thread and looked at the result as > single patch. Sems OK, just two things: > - please rename variable "i" in build_zonelists() to e.g. "nr_nodes" > - the !CONFIG_NUMA variant of build_zonelists() won't build, because it > doesn't declare nr_zones variable Thanks! I will fold this in. --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c0d3e8eeb150..6f192405e469 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4957,7 +4957,7 @@ static void build_thisnode_zonelists(pg_data_t *pgdat) static void build_zonelists(pg_data_t *pgdat) { static int node_order[MAX_NUMNODES]; - int node, load, i = 0; + int node, load, nr_nodes = 0; nodemask_t used_mask; int local_node, prev_node; @@ -4978,12 +4978,12 @@ static void build_zonelists(pg_data_t *pgdat) node_distance(local_node, prev_node)) node_load[node] = load; - node_order[i++] = node; + node_order[nr_nodes++] = node; prev_node = node; load--; } - build_zonelists_in_node_order(pgdat, node_order, i); + build_zonelists_in_node_order(pgdat, node_order, nr_nodes); build_thisnode_zonelists(pgdat); } @@ -5013,10 +5013,11 @@ static void build_zonelists(pg_data_t *pgdat) { int node, local_node; struct zoneref *zonerefs; + int nr_zones; local_node = pgdat->node_id; - zonrefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs; + zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs; nr_zones = build_zonerefs_node(pgdat, zonerefs); zonerefs += nr_zones; -- Michal Hocko SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On 07/14/2017 10:00 AM, Michal Hocko wrote: > From: Michal Hocko> > build_zonelists gradually builds zonelists from the nearest to the most > distant node. As we do not know how many populated zones we will have in > each node we rely on the _zoneref to terminate initialized part of the > zonelist by a NULL zone. While this is functionally correct it is quite > suboptimal because we cannot allow updaters to race with zonelists > users because they could see an empty zonelist and fail the allocation > or hit the OOM killer in the worst case. > > We can do much better, though. We can store the node ordering into an > already existing node_order array and then give this array to > build_zonelists_in_node_order and do the whole initialization at once. > zonelists consumers still might see halfway initialized state but that > should be much more tolerateable because the list will not be empty and > they would either see some zone twice or skip over some zone(s) in the > worst case which shouldn't lead to immediate failures. > > This patch alone doesn't introduce any functional change yet, though, it > is merely a preparatory work for later changes. > > Signed-off-by: Michal Hocko I've collected the fold-ups from this thread and looked at the result as single patch. Sems OK, just two things: - please rename variable "i" in build_zonelists() to e.g. "nr_nodes" - the !CONFIG_NUMA variant of build_zonelists() won't build, because it doesn't declare nr_zones variable
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On 07/14/2017 10:00 AM, Michal Hocko wrote: > From: Michal Hocko > > build_zonelists gradually builds zonelists from the nearest to the most > distant node. As we do not know how many populated zones we will have in > each node we rely on the _zoneref to terminate initialized part of the > zonelist by a NULL zone. While this is functionally correct it is quite > suboptimal because we cannot allow updaters to race with zonelists > users because they could see an empty zonelist and fail the allocation > or hit the OOM killer in the worst case. > > We can do much better, though. We can store the node ordering into an > already existing node_order array and then give this array to > build_zonelists_in_node_order and do the whole initialization at once. > zonelists consumers still might see halfway initialized state but that > should be much more tolerateable because the list will not be empty and > they would either see some zone twice or skip over some zone(s) in the > worst case which shouldn't lead to immediate failures. > > This patch alone doesn't introduce any functional change yet, though, it > is merely a preparatory work for later changes. > > Signed-off-by: Michal Hocko I've collected the fold-ups from this thread and looked at the result as single patch. Sems OK, just two things: - please rename variable "i" in build_zonelists() to e.g. "nr_nodes" - the !CONFIG_NUMA variant of build_zonelists() won't build, because it doesn't declare nr_zones variable
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Mon 17-07-17 09:58:04, Mel Gorman wrote: > On Mon, Jul 17, 2017 at 10:19:42AM +0200, Michal Hocko wrote: > > On Mon 17-07-17 09:07:23, Mel Gorman wrote: > > > On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote: > > > > On Fri 14-07-17 15:18:23, Mel Gorman wrote: > > > > > Fairly sure that's not what you meant. > > > > > > > > > > > > > > > > pg_data_t *node = NODE_DATA(node_order[i]); > > > > > > > > > > > > - zoneref_idx = build_zonelists_node(node, zonelist, > > > > > > zoneref_idx); > > > > > > + nr_zones = build_zonelists_node(node, zonelist, > > > > > > nr_zones); > > > > > > > > > > I meant converting build_zonelists_node and passing in _zones and > > > > > returning false when an empty node is encountered. In this context, > > > > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones > > > > > in > > > > > build_zonelists_node as well. > > > > > > > > hmm, why don't we rather make it zonerefs based then. Something > > > > like the following? > > > > > > Works for me. > > > > Should I fold it to the patch or make it a patch on its own? > > I have no strong feelings either way but if it was folded then the > overall naming should be easier to follow (at least for me). OK, I will fold it in then. Unless there are more issues/feedback to address I will repost the full series in few days. Thanks! -- Michal Hocko SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Mon 17-07-17 09:58:04, Mel Gorman wrote: > On Mon, Jul 17, 2017 at 10:19:42AM +0200, Michal Hocko wrote: > > On Mon 17-07-17 09:07:23, Mel Gorman wrote: > > > On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote: > > > > On Fri 14-07-17 15:18:23, Mel Gorman wrote: > > > > > Fairly sure that's not what you meant. > > > > > > > > > > > > > > > > pg_data_t *node = NODE_DATA(node_order[i]); > > > > > > > > > > > > - zoneref_idx = build_zonelists_node(node, zonelist, > > > > > > zoneref_idx); > > > > > > + nr_zones = build_zonelists_node(node, zonelist, > > > > > > nr_zones); > > > > > > > > > > I meant converting build_zonelists_node and passing in _zones and > > > > > returning false when an empty node is encountered. In this context, > > > > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones > > > > > in > > > > > build_zonelists_node as well. > > > > > > > > hmm, why don't we rather make it zonerefs based then. Something > > > > like the following? > > > > > > Works for me. > > > > Should I fold it to the patch or make it a patch on its own? > > I have no strong feelings either way but if it was folded then the > overall naming should be easier to follow (at least for me). OK, I will fold it in then. Unless there are more issues/feedback to address I will repost the full series in few days. Thanks! -- Michal Hocko SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Mon, Jul 17, 2017 at 10:19:42AM +0200, Michal Hocko wrote: > On Mon 17-07-17 09:07:23, Mel Gorman wrote: > > On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote: > > > On Fri 14-07-17 15:18:23, Mel Gorman wrote: > > > > Fairly sure that's not what you meant. > > > > > > > > > > > > > pg_data_t *node = NODE_DATA(node_order[i]); > > > > > > > > > > - zoneref_idx = build_zonelists_node(node, zonelist, > > > > > zoneref_idx); > > > > > + nr_zones = build_zonelists_node(node, zonelist, > > > > > nr_zones); > > > > > > > > I meant converting build_zonelists_node and passing in _zones and > > > > returning false when an empty node is encountered. In this context, > > > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in > > > > build_zonelists_node as well. > > > > > > hmm, why don't we rather make it zonerefs based then. Something > > > like the following? > > > > Works for me. > > Should I fold it to the patch or make it a patch on its own? I have no strong feelings either way but if it was folded then the overall naming should be easier to follow (at least for me). -- Mel Gorman SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Mon, Jul 17, 2017 at 10:19:42AM +0200, Michal Hocko wrote: > On Mon 17-07-17 09:07:23, Mel Gorman wrote: > > On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote: > > > On Fri 14-07-17 15:18:23, Mel Gorman wrote: > > > > Fairly sure that's not what you meant. > > > > > > > > > > > > > pg_data_t *node = NODE_DATA(node_order[i]); > > > > > > > > > > - zoneref_idx = build_zonelists_node(node, zonelist, > > > > > zoneref_idx); > > > > > + nr_zones = build_zonelists_node(node, zonelist, > > > > > nr_zones); > > > > > > > > I meant converting build_zonelists_node and passing in _zones and > > > > returning false when an empty node is encountered. In this context, > > > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in > > > > build_zonelists_node as well. > > > > > > hmm, why don't we rather make it zonerefs based then. Something > > > like the following? > > > > Works for me. > > Should I fold it to the patch or make it a patch on its own? I have no strong feelings either way but if it was folded then the overall naming should be easier to follow (at least for me). -- Mel Gorman SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Mon 17-07-17 09:07:23, Mel Gorman wrote: > On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote: > > On Fri 14-07-17 15:18:23, Mel Gorman wrote: > > > Fairly sure that's not what you meant. > > > > > > > > > > pg_data_t *node = NODE_DATA(node_order[i]); > > > > > > > > - zoneref_idx = build_zonelists_node(node, zonelist, > > > > zoneref_idx); > > > > + nr_zones = build_zonelists_node(node, zonelist, > > > > nr_zones); > > > > > > I meant converting build_zonelists_node and passing in _zones and > > > returning false when an empty node is encountered. In this context, > > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in > > > build_zonelists_node as well. > > > > hmm, why don't we rather make it zonerefs based then. Something > > like the following? > > Works for me. Should I fold it to the patch or make it a patch on its own? -- Michal Hocko SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Mon 17-07-17 09:07:23, Mel Gorman wrote: > On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote: > > On Fri 14-07-17 15:18:23, Mel Gorman wrote: > > > Fairly sure that's not what you meant. > > > > > > > > > > pg_data_t *node = NODE_DATA(node_order[i]); > > > > > > > > - zoneref_idx = build_zonelists_node(node, zonelist, > > > > zoneref_idx); > > > > + nr_zones = build_zonelists_node(node, zonelist, > > > > nr_zones); > > > > > > I meant converting build_zonelists_node and passing in _zones and > > > returning false when an empty node is encountered. In this context, > > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in > > > build_zonelists_node as well. > > > > hmm, why don't we rather make it zonerefs based then. Something > > like the following? > > Works for me. Should I fold it to the patch or make it a patch on its own? -- Michal Hocko SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote: > On Fri 14-07-17 15:18:23, Mel Gorman wrote: > > Fairly sure that's not what you meant. > > > > > > > pg_data_t *node = NODE_DATA(node_order[i]); > > > > > > - zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx); > > > + nr_zones = build_zonelists_node(node, zonelist, nr_zones); > > > > I meant converting build_zonelists_node and passing in _zones and > > returning false when an empty node is encountered. In this context, > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in > > build_zonelists_node as well. > > hmm, why don't we rather make it zonerefs based then. Something > like the following? Works for me. -- Mel Gorman SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote: > On Fri 14-07-17 15:18:23, Mel Gorman wrote: > > Fairly sure that's not what you meant. > > > > > > > pg_data_t *node = NODE_DATA(node_order[i]); > > > > > > - zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx); > > > + nr_zones = build_zonelists_node(node, zonelist, nr_zones); > > > > I meant converting build_zonelists_node and passing in _zones and > > returning false when an empty node is encountered. In this context, > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in > > build_zonelists_node as well. > > hmm, why don't we rather make it zonerefs based then. Something > like the following? Works for me. -- Mel Gorman SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Fri 14-07-17 15:18:23, Mel Gorman wrote: > On Fri, Jul 14, 2017 at 03:02:42PM +0200, Michal Hocko wrote: [...] > > What do you think about this on top? > > --- > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 49bade7ff049..3b98524c04ec 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4913,20 +4913,21 @@ static int find_next_best_node(int node, nodemask_t > > *used_node_mask) > > * This results in maximum locality--normal zone overflows into local > > * DMA zone, if any--but risks exhausting DMA zone. > > */ > > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int > > *node_order) > > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int > > *node_order, > > + unsigned nr_nodes) > > { > > struct zonelist *zonelist; > > - int i, zoneref_idx = 0; > > + int i, nr_zones = 0; > > > > zonelist = >node_zonelists[ZONELIST_FALLBACK]; > > > > - for (i = 0; i < MAX_NUMNODES; i++) { > > + for (i = 0; i < nr_nodes; i++) { > > The first iteration is then -- for (i = 0; i < 0; i++) find_next_best_node always returns at least one node (the local one) so I believe that nr_nodes should never be 0. > Fairly sure that's not what you meant. > > > > pg_data_t *node = NODE_DATA(node_order[i]); > > > > - zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx); > > + nr_zones = build_zonelists_node(node, zonelist, nr_zones); > > I meant converting build_zonelists_node and passing in _zones and > returning false when an empty node is encountered. In this context, > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in > build_zonelists_node as well. hmm, why don't we rather make it zonerefs based then. Something like the following? --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3b98524c04ec..01e67e629351 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4772,18 +4772,17 @@ static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) * * Add all populated zones of a node to the zonelist. */ -static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, - int nr_zones) +static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) { struct zone *zone; enum zone_type zone_type = MAX_NR_ZONES; + int nr_zones = 0; do { zone_type--; zone = pgdat->node_zones + zone_type; if (managed_zone(zone)) { - zoneref_set_zone(zone, - >_zonerefs[nr_zones++]); + zoneref_set_zone(zone, [nr_zones++]); check_highest_zone(zone_type); } } while (zone_type); @@ -4916,18 +4915,21 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, unsigned nr_nodes) { - struct zonelist *zonelist; - int i, nr_zones = 0; + struct zoneref *zonerefs; + int i; - zonelist = >node_zonelists[ZONELIST_FALLBACK]; + zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs; for (i = 0; i < nr_nodes; i++) { + int nr_zones; + pg_data_t *node = NODE_DATA(node_order[i]); - nr_zones = build_zonelists_node(node, zonelist, nr_zones); + nr_zones = build_zonerefs_node(node, zonerefs); + zonerefs += nr_zones; } - zonelist->_zonerefs[nr_zones].zone = NULL; - zonelist->_zonerefs[nr_zones].zone_idx = 0; + zonerefs->zone = NULL; + zonerefs->zone_idx = 0; } /* @@ -4935,13 +4937,14 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, */ static void build_thisnode_zonelists(pg_data_t *pgdat) { - struct zonelist *zonelist; - int nr_zones = 0; + struct zoneref *zonerefs; + int nr_zones; - zonelist = >node_zonelists[ZONELIST_NOFALLBACK]; - nr_zones = build_zonelists_node(pgdat, zonelist, nr_zones); - zonelist->_zonerefs[nr_zones].zone = NULL; - zonelist->_zonerefs[nr_zones].zone_idx = 0; + zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs; + nr_zones = build_zonerefs_node(pgdat, zonerefs); + zonerefs += nr_zones; + zonerefs->zone = NULL; + zonerefs->zone_idx = 0; } /* @@ -5009,13 +5012,13 @@ static void setup_min_slab_ratio(void); static void build_zonelists(pg_data_t *pgdat) { int node, local_node; - enum zone_type j; - struct zonelist *zonelist; + struct zoneref *zonerefs; local_node = pgdat->node_id; - zonelist = >node_zonelists[ZONELIST_FALLBACK]; - j = build_zonelists_node(pgdat, zonelist, 0); + zonrefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs; + nr_zones =
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Fri 14-07-17 15:18:23, Mel Gorman wrote: > On Fri, Jul 14, 2017 at 03:02:42PM +0200, Michal Hocko wrote: [...] > > What do you think about this on top? > > --- > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 49bade7ff049..3b98524c04ec 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4913,20 +4913,21 @@ static int find_next_best_node(int node, nodemask_t > > *used_node_mask) > > * This results in maximum locality--normal zone overflows into local > > * DMA zone, if any--but risks exhausting DMA zone. > > */ > > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int > > *node_order) > > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int > > *node_order, > > + unsigned nr_nodes) > > { > > struct zonelist *zonelist; > > - int i, zoneref_idx = 0; > > + int i, nr_zones = 0; > > > > zonelist = >node_zonelists[ZONELIST_FALLBACK]; > > > > - for (i = 0; i < MAX_NUMNODES; i++) { > > + for (i = 0; i < nr_nodes; i++) { > > The first iteration is then -- for (i = 0; i < 0; i++) find_next_best_node always returns at least one node (the local one) so I believe that nr_nodes should never be 0. > Fairly sure that's not what you meant. > > > > pg_data_t *node = NODE_DATA(node_order[i]); > > > > - zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx); > > + nr_zones = build_zonelists_node(node, zonelist, nr_zones); > > I meant converting build_zonelists_node and passing in _zones and > returning false when an empty node is encountered. In this context, > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in > build_zonelists_node as well. hmm, why don't we rather make it zonerefs based then. Something like the following? --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3b98524c04ec..01e67e629351 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4772,18 +4772,17 @@ static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) * * Add all populated zones of a node to the zonelist. */ -static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, - int nr_zones) +static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs) { struct zone *zone; enum zone_type zone_type = MAX_NR_ZONES; + int nr_zones = 0; do { zone_type--; zone = pgdat->node_zones + zone_type; if (managed_zone(zone)) { - zoneref_set_zone(zone, - >_zonerefs[nr_zones++]); + zoneref_set_zone(zone, [nr_zones++]); check_highest_zone(zone_type); } } while (zone_type); @@ -4916,18 +4915,21 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, unsigned nr_nodes) { - struct zonelist *zonelist; - int i, nr_zones = 0; + struct zoneref *zonerefs; + int i; - zonelist = >node_zonelists[ZONELIST_FALLBACK]; + zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs; for (i = 0; i < nr_nodes; i++) { + int nr_zones; + pg_data_t *node = NODE_DATA(node_order[i]); - nr_zones = build_zonelists_node(node, zonelist, nr_zones); + nr_zones = build_zonerefs_node(node, zonerefs); + zonerefs += nr_zones; } - zonelist->_zonerefs[nr_zones].zone = NULL; - zonelist->_zonerefs[nr_zones].zone_idx = 0; + zonerefs->zone = NULL; + zonerefs->zone_idx = 0; } /* @@ -4935,13 +4937,14 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, */ static void build_thisnode_zonelists(pg_data_t *pgdat) { - struct zonelist *zonelist; - int nr_zones = 0; + struct zoneref *zonerefs; + int nr_zones; - zonelist = >node_zonelists[ZONELIST_NOFALLBACK]; - nr_zones = build_zonelists_node(pgdat, zonelist, nr_zones); - zonelist->_zonerefs[nr_zones].zone = NULL; - zonelist->_zonerefs[nr_zones].zone_idx = 0; + zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs; + nr_zones = build_zonerefs_node(pgdat, zonerefs); + zonerefs += nr_zones; + zonerefs->zone = NULL; + zonerefs->zone_idx = 0; } /* @@ -5009,13 +5012,13 @@ static void setup_min_slab_ratio(void); static void build_zonelists(pg_data_t *pgdat) { int node, local_node; - enum zone_type j; - struct zonelist *zonelist; + struct zoneref *zonerefs; local_node = pgdat->node_id; - zonelist = >node_zonelists[ZONELIST_FALLBACK]; - j = build_zonelists_node(pgdat, zonelist, 0); + zonrefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs; + nr_zones =
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Fri, Jul 14, 2017 at 03:02:42PM +0200, Michal Hocko wrote: > > It *might* be safer given the next patch to zero out the remainder of > > the _zonerefs to that there is no combination of node add/remove that has > > an iterator working with a semi-valid _zoneref which is beyond the last > > correct value. It *should* be safe as the very last entry will always > > be null but if you don't zero it out, it is possible for iterators to be > > working beyond the "end" of the zonelist for a short window. > > yes that is true but there will always be terminating NULL zone and I > found that acceptable. It is basically the same thing as accessing an > empty zone or a zone twice. Or do you think this is absolutely necessary > to handle? > I don't think it's absolutely necessary. While you could construct some odd behaviour for iterators currently past the end of the list, they would eventually encounter a NULL. > > Otherwise think it's ok including my stupid comment about node_order > > stack usage. > > What do you think about this on top? > --- > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 49bade7ff049..3b98524c04ec 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4913,20 +4913,21 @@ static int find_next_best_node(int node, nodemask_t > *used_node_mask) > * This results in maximum locality--normal zone overflows into local > * DMA zone, if any--but risks exhausting DMA zone. > */ > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order) > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, > + unsigned nr_nodes) > { > struct zonelist *zonelist; > - int i, zoneref_idx = 0; > + int i, nr_zones = 0; > > zonelist = >node_zonelists[ZONELIST_FALLBACK]; > > - for (i = 0; i < MAX_NUMNODES; i++) { > + for (i = 0; i < nr_nodes; i++) { The first iteration is then -- for (i = 0; i < 0; i++) Fairly sure that's not what you meant. > pg_data_t *node = NODE_DATA(node_order[i]); > > - zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx); > + nr_zones = build_zonelists_node(node, zonelist, nr_zones); I meant converting build_zonelists_node and passing in _zones and returning false when an empty node is encountered. In this context, it's also not about zones, it really is nr_zonerefs. Rename nr_zones in build_zonelists_node as well. -- Mel Gorman SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Fri, Jul 14, 2017 at 03:02:42PM +0200, Michal Hocko wrote: > > It *might* be safer given the next patch to zero out the remainder of > > the _zonerefs to that there is no combination of node add/remove that has > > an iterator working with a semi-valid _zoneref which is beyond the last > > correct value. It *should* be safe as the very last entry will always > > be null but if you don't zero it out, it is possible for iterators to be > > working beyond the "end" of the zonelist for a short window. > > yes that is true but there will always be terminating NULL zone and I > found that acceptable. It is basically the same thing as accessing an > empty zone or a zone twice. Or do you think this is absolutely necessary > to handle? > I don't think it's absolutely necessary. While you could construct some odd behaviour for iterators currently past the end of the list, they would eventually encounter a NULL. > > Otherwise think it's ok including my stupid comment about node_order > > stack usage. > > What do you think about this on top? > --- > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 49bade7ff049..3b98524c04ec 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4913,20 +4913,21 @@ static int find_next_best_node(int node, nodemask_t > *used_node_mask) > * This results in maximum locality--normal zone overflows into local > * DMA zone, if any--but risks exhausting DMA zone. > */ > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order) > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, > + unsigned nr_nodes) > { > struct zonelist *zonelist; > - int i, zoneref_idx = 0; > + int i, nr_zones = 0; > > zonelist = >node_zonelists[ZONELIST_FALLBACK]; > > - for (i = 0; i < MAX_NUMNODES; i++) { > + for (i = 0; i < nr_nodes; i++) { The first iteration is then -- for (i = 0; i < 0; i++) Fairly sure that's not what you meant. > pg_data_t *node = NODE_DATA(node_order[i]); > > - zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx); > + nr_zones = build_zonelists_node(node, zonelist, nr_zones); I meant converting build_zonelists_node and passing in _zones and returning false when an empty node is encountered. In this context, it's also not about zones, it really is nr_zonerefs. Rename nr_zones in build_zonelists_node as well. -- Mel Gorman SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Fri 14-07-17 13:46:46, Mel Gorman wrote: > On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote: > > From: Michal Hocko> > > > build_zonelists gradually builds zonelists from the nearest to the most > > distant node. As we do not know how many populated zones we will have in > > each node we rely on the _zoneref to terminate initialized part of the > > zonelist by a NULL zone. While this is functionally correct it is quite > > suboptimal because we cannot allow updaters to race with zonelists > > users because they could see an empty zonelist and fail the allocation > > or hit the OOM killer in the worst case. > > > > We can do much better, though. We can store the node ordering into an > > already existing node_order array and then give this array to > > build_zonelists_in_node_order and do the whole initialization at once. > > zonelists consumers still might see halfway initialized state but that > > should be much more tolerateable because the list will not be empty and > > they would either see some zone twice or skip over some zone(s) in the > > worst case which shouldn't lead to immediate failures. > > > > This patch alone doesn't introduce any functional change yet, though, it > > is merely a preparatory work for later changes. > > > > Signed-off-by: Michal Hocko > > --- > > mm/page_alloc.c | 42 ++ > > 1 file changed, 18 insertions(+), 24 deletions(-) > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 00e117922b3f..78bd62418380 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t > > *used_node_mask) > > * This results in maximum locality--normal zone overflows into local > > * DMA zone, if any--but risks exhausting DMA zone. > > */ > > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) > > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int > > *node_order) > > { > > - int j; > > struct zonelist *zonelist; > > + int i, zoneref_idx = 0; > > > > zonelist = >node_zonelists[ZONELIST_FALLBACK]; > > - for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++) > > - ; > > - j = build_zonelists_node(NODE_DATA(node), zonelist, j); > > - zonelist->_zonerefs[j].zone = NULL; > > - zonelist->_zonerefs[j].zone_idx = 0; > > + > > + for (i = 0; i < MAX_NUMNODES; i++) { > > + pg_data_t *node = NODE_DATA(node_order[i]); > > + > > + zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx); > > + } > > The naming here is weird to say the least and makes this a lot more > confusing than it needs to be. Primarily, it's because the zoneref_idx > parameter gets renamed to nr_zones in build_zonelists_node where it's > nothing to do with the number of zones at all. you are right. I just wanted to get rid of `j' and didn't realize nr_zones would fit much better. > It also iterates for longer than it needs to. MAX_NUMNODES can be a > large value of mostly empty nodes but it happily goes through them > anyway. Pass zoneref_idx in as a pointer that is updated by the function > and use the return value to break the loop when an empty node is > encountered? > > > + zonelist->_zonerefs[zoneref_idx].zone = NULL; > > + zonelist->_zonerefs[zoneref_idx].zone_idx = 0; > > } > > > > It *might* be safer given the next patch to zero out the remainder of > the _zonerefs to that there is no combination of node add/remove that has > an iterator working with a semi-valid _zoneref which is beyond the last > correct value. It *should* be safe as the very last entry will always > be null but if you don't zero it out, it is possible for iterators to be > working beyond the "end" of the zonelist for a short window. yes that is true but there will always be terminating NULL zone and I found that acceptable. It is basically the same thing as accessing an empty zone or a zone twice. Or do you think this is absolutely necessary to handle? > Otherwise think it's ok including my stupid comment about node_order > stack usage. What do you think about this on top? --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 49bade7ff049..3b98524c04ec 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4913,20 +4913,21 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) * This results in maximum locality--normal zone overflows into local * DMA zone, if any--but risks exhausting DMA zone. */ -static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order) +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, + unsigned nr_nodes) { struct zonelist *zonelist; - int i, zoneref_idx = 0; + int i, nr_zones = 0; zonelist = >node_zonelists[ZONELIST_FALLBACK]; - for (i = 0; i < MAX_NUMNODES; i++) { + for (i = 0; i < nr_nodes; i++) {
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Fri 14-07-17 13:46:46, Mel Gorman wrote: > On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote: > > From: Michal Hocko > > > > build_zonelists gradually builds zonelists from the nearest to the most > > distant node. As we do not know how many populated zones we will have in > > each node we rely on the _zoneref to terminate initialized part of the > > zonelist by a NULL zone. While this is functionally correct it is quite > > suboptimal because we cannot allow updaters to race with zonelists > > users because they could see an empty zonelist and fail the allocation > > or hit the OOM killer in the worst case. > > > > We can do much better, though. We can store the node ordering into an > > already existing node_order array and then give this array to > > build_zonelists_in_node_order and do the whole initialization at once. > > zonelists consumers still might see halfway initialized state but that > > should be much more tolerateable because the list will not be empty and > > they would either see some zone twice or skip over some zone(s) in the > > worst case which shouldn't lead to immediate failures. > > > > This patch alone doesn't introduce any functional change yet, though, it > > is merely a preparatory work for later changes. > > > > Signed-off-by: Michal Hocko > > --- > > mm/page_alloc.c | 42 ++ > > 1 file changed, 18 insertions(+), 24 deletions(-) > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 00e117922b3f..78bd62418380 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t > > *used_node_mask) > > * This results in maximum locality--normal zone overflows into local > > * DMA zone, if any--but risks exhausting DMA zone. > > */ > > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) > > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int > > *node_order) > > { > > - int j; > > struct zonelist *zonelist; > > + int i, zoneref_idx = 0; > > > > zonelist = >node_zonelists[ZONELIST_FALLBACK]; > > - for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++) > > - ; > > - j = build_zonelists_node(NODE_DATA(node), zonelist, j); > > - zonelist->_zonerefs[j].zone = NULL; > > - zonelist->_zonerefs[j].zone_idx = 0; > > + > > + for (i = 0; i < MAX_NUMNODES; i++) { > > + pg_data_t *node = NODE_DATA(node_order[i]); > > + > > + zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx); > > + } > > The naming here is weird to say the least and makes this a lot more > confusing than it needs to be. Primarily, it's because the zoneref_idx > parameter gets renamed to nr_zones in build_zonelists_node where it's > nothing to do with the number of zones at all. you are right. I just wanted to get rid of `j' and didn't realize nr_zones would fit much better. > It also iterates for longer than it needs to. MAX_NUMNODES can be a > large value of mostly empty nodes but it happily goes through them > anyway. Pass zoneref_idx in as a pointer that is updated by the function > and use the return value to break the loop when an empty node is > encountered? > > > + zonelist->_zonerefs[zoneref_idx].zone = NULL; > > + zonelist->_zonerefs[zoneref_idx].zone_idx = 0; > > } > > > > It *might* be safer given the next patch to zero out the remainder of > the _zonerefs to that there is no combination of node add/remove that has > an iterator working with a semi-valid _zoneref which is beyond the last > correct value. It *should* be safe as the very last entry will always > be null but if you don't zero it out, it is possible for iterators to be > working beyond the "end" of the zonelist for a short window. yes that is true but there will always be terminating NULL zone and I found that acceptable. It is basically the same thing as accessing an empty zone or a zone twice. Or do you think this is absolutely necessary to handle? > Otherwise think it's ok including my stupid comment about node_order > stack usage. What do you think about this on top? --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 49bade7ff049..3b98524c04ec 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4913,20 +4913,21 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) * This results in maximum locality--normal zone overflows into local * DMA zone, if any--but risks exhausting DMA zone. */ -static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order) +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, + unsigned nr_nodes) { struct zonelist *zonelist; - int i, zoneref_idx = 0; + int i, nr_zones = 0; zonelist = >node_zonelists[ZONELIST_FALLBACK]; - for (i = 0; i < MAX_NUMNODES; i++) { + for (i = 0; i < nr_nodes; i++) { pg_data_t *node =
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote: > From: Michal Hocko> > build_zonelists gradually builds zonelists from the nearest to the most > distant node. As we do not know how many populated zones we will have in > each node we rely on the _zoneref to terminate initialized part of the > zonelist by a NULL zone. While this is functionally correct it is quite > suboptimal because we cannot allow updaters to race with zonelists > users because they could see an empty zonelist and fail the allocation > or hit the OOM killer in the worst case. > > We can do much better, though. We can store the node ordering into an > already existing node_order array and then give this array to > build_zonelists_in_node_order and do the whole initialization at once. > zonelists consumers still might see halfway initialized state but that > should be much more tolerateable because the list will not be empty and > they would either see some zone twice or skip over some zone(s) in the > worst case which shouldn't lead to immediate failures. > > This patch alone doesn't introduce any functional change yet, though, it > is merely a preparatory work for later changes. > > Signed-off-by: Michal Hocko > --- > mm/page_alloc.c | 42 ++ > 1 file changed, 18 insertions(+), 24 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 00e117922b3f..78bd62418380 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t > *used_node_mask) > * This results in maximum locality--normal zone overflows into local > * DMA zone, if any--but risks exhausting DMA zone. > */ > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order) > { > - int j; > struct zonelist *zonelist; > + int i, zoneref_idx = 0; > > zonelist = >node_zonelists[ZONELIST_FALLBACK]; > - for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++) > - ; > - j = build_zonelists_node(NODE_DATA(node), zonelist, j); > - zonelist->_zonerefs[j].zone = NULL; > - zonelist->_zonerefs[j].zone_idx = 0; > + > + for (i = 0; i < MAX_NUMNODES; i++) { > + pg_data_t *node = NODE_DATA(node_order[i]); > + > + zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx); > + } The naming here is weird to say the least and makes this a lot more confusing than it needs to be. Primarily, it's because the zoneref_idx parameter gets renamed to nr_zones in build_zonelists_node where it's nothing to do with the number of zones at all. It also iterates for longer than it needs to. MAX_NUMNODES can be a large value of mostly empty nodes but it happily goes through them anyway. Pass zoneref_idx in as a pointer that is updated by the function and use the return value to break the loop when an empty node is encountered? > + zonelist->_zonerefs[zoneref_idx].zone = NULL; > + zonelist->_zonerefs[zoneref_idx].zone_idx = 0; > } > It *might* be safer given the next patch to zero out the remainder of the _zonerefs to that there is no combination of node add/remove that has an iterator working with a semi-valid _zoneref which is beyond the last correct value. It *should* be safe as the very last entry will always be null but if you don't zero it out, it is possible for iterators to be working beyond the "end" of the zonelist for a short window. Otherwise think it's ok including my stupid comment about node_order stack usage. -- Mel Gorman SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote: > From: Michal Hocko > > build_zonelists gradually builds zonelists from the nearest to the most > distant node. As we do not know how many populated zones we will have in > each node we rely on the _zoneref to terminate initialized part of the > zonelist by a NULL zone. While this is functionally correct it is quite > suboptimal because we cannot allow updaters to race with zonelists > users because they could see an empty zonelist and fail the allocation > or hit the OOM killer in the worst case. > > We can do much better, though. We can store the node ordering into an > already existing node_order array and then give this array to > build_zonelists_in_node_order and do the whole initialization at once. > zonelists consumers still might see halfway initialized state but that > should be much more tolerateable because the list will not be empty and > they would either see some zone twice or skip over some zone(s) in the > worst case which shouldn't lead to immediate failures. > > This patch alone doesn't introduce any functional change yet, though, it > is merely a preparatory work for later changes. > > Signed-off-by: Michal Hocko > --- > mm/page_alloc.c | 42 ++ > 1 file changed, 18 insertions(+), 24 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 00e117922b3f..78bd62418380 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t > *used_node_mask) > * This results in maximum locality--normal zone overflows into local > * DMA zone, if any--but risks exhausting DMA zone. > */ > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order) > { > - int j; > struct zonelist *zonelist; > + int i, zoneref_idx = 0; > > zonelist = >node_zonelists[ZONELIST_FALLBACK]; > - for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++) > - ; > - j = build_zonelists_node(NODE_DATA(node), zonelist, j); > - zonelist->_zonerefs[j].zone = NULL; > - zonelist->_zonerefs[j].zone_idx = 0; > + > + for (i = 0; i < MAX_NUMNODES; i++) { > + pg_data_t *node = NODE_DATA(node_order[i]); > + > + zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx); > + } The naming here is weird to say the least and makes this a lot more confusing than it needs to be. Primarily, it's because the zoneref_idx parameter gets renamed to nr_zones in build_zonelists_node where it's nothing to do with the number of zones at all. It also iterates for longer than it needs to. MAX_NUMNODES can be a large value of mostly empty nodes but it happily goes through them anyway. Pass zoneref_idx in as a pointer that is updated by the function and use the return value to break the loop when an empty node is encountered? > + zonelist->_zonerefs[zoneref_idx].zone = NULL; > + zonelist->_zonerefs[zoneref_idx].zone_idx = 0; > } > It *might* be safer given the next patch to zero out the remainder of the _zonerefs to that there is no combination of node add/remove that has an iterator working with a semi-valid _zoneref which is beyond the last correct value. It *should* be safe as the very last entry will always be null but if you don't zero it out, it is possible for iterators to be working beyond the "end" of the zonelist for a short window. Otherwise think it's ok including my stupid comment about node_order stack usage. -- Mel Gorman SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Fri 14-07-17 10:55:34, Mel Gorman wrote: > On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote: > > > > zonelist = >node_zonelists[ZONELIST_NOFALLBACK]; > > - j = build_zonelists_node(pgdat, zonelist, 0); > > - zonelist->_zonerefs[j].zone = NULL; > > - zonelist->_zonerefs[j].zone_idx = 0; > > + zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx); > > + zonelist->_zonerefs[zoneref_idx].zone = NULL; > > + zonelist->_zonerefs[zoneref_idx].zone_idx = 0; > > } > > > > /* > > @@ -4946,21 +4949,13 @@ static void build_thisnode_zonelists(pg_data_t > > *pgdat) > > * exhausted, but results in overflowing to remote node while memory > > * may still exist in local DMA zone. > > */ > > -static int node_order[MAX_NUMNODES]; > > > > static void build_zonelists(pg_data_t *pgdat) > > { > > - int i, node, load; > > + static int node_order[MAX_NUMNODES]; > > + int node, load, i = 0; > > Emm, node_order can be large. The first distro config I checked > indicated that this is 8K. I got hung up on that part and didn't look > closely at the rest of the patch. yes, that's why I kept it static. I just placed it into the function to make it clear what the scope is. -- Michal Hocko SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Fri 14-07-17 10:55:34, Mel Gorman wrote: > On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote: > > > > zonelist = >node_zonelists[ZONELIST_NOFALLBACK]; > > - j = build_zonelists_node(pgdat, zonelist, 0); > > - zonelist->_zonerefs[j].zone = NULL; > > - zonelist->_zonerefs[j].zone_idx = 0; > > + zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx); > > + zonelist->_zonerefs[zoneref_idx].zone = NULL; > > + zonelist->_zonerefs[zoneref_idx].zone_idx = 0; > > } > > > > /* > > @@ -4946,21 +4949,13 @@ static void build_thisnode_zonelists(pg_data_t > > *pgdat) > > * exhausted, but results in overflowing to remote node while memory > > * may still exist in local DMA zone. > > */ > > -static int node_order[MAX_NUMNODES]; > > > > static void build_zonelists(pg_data_t *pgdat) > > { > > - int i, node, load; > > + static int node_order[MAX_NUMNODES]; > > + int node, load, i = 0; > > Emm, node_order can be large. The first distro config I checked > indicated that this is 8K. I got hung up on that part and didn't look > closely at the rest of the patch. yes, that's why I kept it static. I just placed it into the function to make it clear what the scope is. -- Michal Hocko SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote: > > zonelist = >node_zonelists[ZONELIST_NOFALLBACK]; > - j = build_zonelists_node(pgdat, zonelist, 0); > - zonelist->_zonerefs[j].zone = NULL; > - zonelist->_zonerefs[j].zone_idx = 0; > + zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx); > + zonelist->_zonerefs[zoneref_idx].zone = NULL; > + zonelist->_zonerefs[zoneref_idx].zone_idx = 0; > } > > /* > @@ -4946,21 +4949,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat) > * exhausted, but results in overflowing to remote node while memory > * may still exist in local DMA zone. > */ > -static int node_order[MAX_NUMNODES]; > > static void build_zonelists(pg_data_t *pgdat) > { > - int i, node, load; > + static int node_order[MAX_NUMNODES]; > + int node, load, i = 0; Emm, node_order can be large. The first distro config I checked indicated that this is 8K. I got hung up on that part and didn't look closely at the rest of the patch. -- Mel Gorman SUSE Labs
Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote: > > zonelist = >node_zonelists[ZONELIST_NOFALLBACK]; > - j = build_zonelists_node(pgdat, zonelist, 0); > - zonelist->_zonerefs[j].zone = NULL; > - zonelist->_zonerefs[j].zone_idx = 0; > + zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx); > + zonelist->_zonerefs[zoneref_idx].zone = NULL; > + zonelist->_zonerefs[zoneref_idx].zone_idx = 0; > } > > /* > @@ -4946,21 +4949,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat) > * exhausted, but results in overflowing to remote node while memory > * may still exist in local DMA zone. > */ > -static int node_order[MAX_NUMNODES]; > > static void build_zonelists(pg_data_t *pgdat) > { > - int i, node, load; > + static int node_order[MAX_NUMNODES]; > + int node, load, i = 0; Emm, node_order can be large. The first distro config I checked indicated that this is 8K. I got hung up on that part and didn't look closely at the rest of the patch. -- Mel Gorman SUSE Labs
[PATCH 6/9] mm, page_alloc: simplify zonelist initialization
From: Michal Hockobuild_zonelists gradually builds zonelists from the nearest to the most distant node. As we do not know how many populated zones we will have in each node we rely on the _zoneref to terminate initialized part of the zonelist by a NULL zone. While this is functionally correct it is quite suboptimal because we cannot allow updaters to race with zonelists users because they could see an empty zonelist and fail the allocation or hit the OOM killer in the worst case. We can do much better, though. We can store the node ordering into an already existing node_order array and then give this array to build_zonelists_in_node_order and do the whole initialization at once. zonelists consumers still might see halfway initialized state but that should be much more tolerateable because the list will not be empty and they would either see some zone twice or skip over some zone(s) in the worst case which shouldn't lead to immediate failures. This patch alone doesn't introduce any functional change yet, though, it is merely a preparatory work for later changes. Signed-off-by: Michal Hocko --- mm/page_alloc.c | 42 ++ 1 file changed, 18 insertions(+), 24 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 00e117922b3f..78bd62418380 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) * This results in maximum locality--normal zone overflows into local * DMA zone, if any--but risks exhausting DMA zone. */ -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order) { - int j; struct zonelist *zonelist; + int i, zoneref_idx = 0; zonelist = >node_zonelists[ZONELIST_FALLBACK]; - for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++) - ; - j = build_zonelists_node(NODE_DATA(node), zonelist, j); - zonelist->_zonerefs[j].zone = NULL; - zonelist->_zonerefs[j].zone_idx = 0; + + for (i = 0; i < MAX_NUMNODES; i++) { + pg_data_t *node = NODE_DATA(node_order[i]); + + zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx); + } + zonelist->_zonerefs[zoneref_idx].zone = NULL; + zonelist->_zonerefs[zoneref_idx].zone_idx = 0; } /* @@ -4931,13 +4934,13 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) */ static void build_thisnode_zonelists(pg_data_t *pgdat) { - int j; struct zonelist *zonelist; + int zoneref_idx = 0; zonelist = >node_zonelists[ZONELIST_NOFALLBACK]; - j = build_zonelists_node(pgdat, zonelist, 0); - zonelist->_zonerefs[j].zone = NULL; - zonelist->_zonerefs[j].zone_idx = 0; + zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx); + zonelist->_zonerefs[zoneref_idx].zone = NULL; + zonelist->_zonerefs[zoneref_idx].zone_idx = 0; } /* @@ -4946,21 +4949,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat) * exhausted, but results in overflowing to remote node while memory * may still exist in local DMA zone. */ -static int node_order[MAX_NUMNODES]; static void build_zonelists(pg_data_t *pgdat) { - int i, node, load; + static int node_order[MAX_NUMNODES]; + int node, load, i = 0; nodemask_t used_mask; int local_node, prev_node; - struct zonelist *zonelist; - - /* initialize zonelists */ - for (i = 0; i < MAX_ZONELISTS; i++) { - zonelist = pgdat->node_zonelists + i; - zonelist->_zonerefs[0].zone = NULL; - zonelist->_zonerefs[0].zone_idx = 0; - } /* NUMA-aware ordering of nodes */ local_node = pgdat->node_id; @@ -4969,8 +4964,6 @@ static void build_zonelists(pg_data_t *pgdat) nodes_clear(used_mask); memset(node_order, 0, sizeof(node_order)); - i = 0; - while ((node = find_next_best_node(local_node, _mask)) >= 0) { /* * We don't want to pressure a particular node. @@ -4981,11 +4974,12 @@ static void build_zonelists(pg_data_t *pgdat) node_distance(local_node, prev_node)) node_load[node] = load; + node_order[i++] = node; prev_node = node; load--; - build_zonelists_in_node_order(pgdat, node); } + build_zonelists_in_node_order(pgdat, node_order); build_thisnode_zonelists(pgdat); } -- 2.11.0
[PATCH 6/9] mm, page_alloc: simplify zonelist initialization
From: Michal Hocko build_zonelists gradually builds zonelists from the nearest to the most distant node. As we do not know how many populated zones we will have in each node we rely on the _zoneref to terminate initialized part of the zonelist by a NULL zone. While this is functionally correct it is quite suboptimal because we cannot allow updaters to race with zonelists users because they could see an empty zonelist and fail the allocation or hit the OOM killer in the worst case. We can do much better, though. We can store the node ordering into an already existing node_order array and then give this array to build_zonelists_in_node_order and do the whole initialization at once. zonelists consumers still might see halfway initialized state but that should be much more tolerateable because the list will not be empty and they would either see some zone twice or skip over some zone(s) in the worst case which shouldn't lead to immediate failures. This patch alone doesn't introduce any functional change yet, though, it is merely a preparatory work for later changes. Signed-off-by: Michal Hocko --- mm/page_alloc.c | 42 ++ 1 file changed, 18 insertions(+), 24 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 00e117922b3f..78bd62418380 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) * This results in maximum locality--normal zone overflows into local * DMA zone, if any--but risks exhausting DMA zone. */ -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order) { - int j; struct zonelist *zonelist; + int i, zoneref_idx = 0; zonelist = >node_zonelists[ZONELIST_FALLBACK]; - for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++) - ; - j = build_zonelists_node(NODE_DATA(node), zonelist, j); - zonelist->_zonerefs[j].zone = NULL; - zonelist->_zonerefs[j].zone_idx = 0; + + for (i = 0; i < MAX_NUMNODES; i++) { + pg_data_t *node = NODE_DATA(node_order[i]); + + zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx); + } + zonelist->_zonerefs[zoneref_idx].zone = NULL; + zonelist->_zonerefs[zoneref_idx].zone_idx = 0; } /* @@ -4931,13 +4934,13 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) */ static void build_thisnode_zonelists(pg_data_t *pgdat) { - int j; struct zonelist *zonelist; + int zoneref_idx = 0; zonelist = >node_zonelists[ZONELIST_NOFALLBACK]; - j = build_zonelists_node(pgdat, zonelist, 0); - zonelist->_zonerefs[j].zone = NULL; - zonelist->_zonerefs[j].zone_idx = 0; + zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx); + zonelist->_zonerefs[zoneref_idx].zone = NULL; + zonelist->_zonerefs[zoneref_idx].zone_idx = 0; } /* @@ -4946,21 +4949,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat) * exhausted, but results in overflowing to remote node while memory * may still exist in local DMA zone. */ -static int node_order[MAX_NUMNODES]; static void build_zonelists(pg_data_t *pgdat) { - int i, node, load; + static int node_order[MAX_NUMNODES]; + int node, load, i = 0; nodemask_t used_mask; int local_node, prev_node; - struct zonelist *zonelist; - - /* initialize zonelists */ - for (i = 0; i < MAX_ZONELISTS; i++) { - zonelist = pgdat->node_zonelists + i; - zonelist->_zonerefs[0].zone = NULL; - zonelist->_zonerefs[0].zone_idx = 0; - } /* NUMA-aware ordering of nodes */ local_node = pgdat->node_id; @@ -4969,8 +4964,6 @@ static void build_zonelists(pg_data_t *pgdat) nodes_clear(used_mask); memset(node_order, 0, sizeof(node_order)); - i = 0; - while ((node = find_next_best_node(local_node, _mask)) >= 0) { /* * We don't want to pressure a particular node. @@ -4981,11 +4974,12 @@ static void build_zonelists(pg_data_t *pgdat) node_distance(local_node, prev_node)) node_load[node] = load; + node_order[i++] = node; prev_node = node; load--; - build_zonelists_in_node_order(pgdat, node); } + build_zonelists_in_node_order(pgdat, node_order); build_thisnode_zonelists(pgdat); } -- 2.11.0