On Fri, Jun 27, 2014 at 02:57:00PM -0400, Johannes Weiner wrote:
> On Fri, Jun 27, 2014 at 09:14:39AM +0100, Mel Gorman wrote:
> > The fair zone allocation policy round-robins allocations between zones
> > within a node to avoid age inversion problems during reclaim. If the
> > first allocation fails, the batch counts is reset and a second attempt
> > made before entering the slow path.
> > 
> > One assumption made with this scheme is that batches expire at roughly the
> > same time and the resets each time are justified. This assumption does not
> > hold when zones reach their low watermark as the batches will be consumed
> > at uneven rates.  Allocation failure due to watermark depletion result in
> > additional zonelist scans for the reset and another watermark check before
> > hitting the slowpath.
> > 
> > This patch makes a number of changes that should reduce the overall cost
> > 
> > o Do not apply the fair zone policy to small zones such as DMA
> > o Abort the fair zone allocation policy once remote or small zones are
> >   encountered
> > o Use a simplier scan when resetting NR_ALLOC_BATCH
> > o Use a simple flag to identify depleted zones instead of accessing a
> >   potentially write-intensive cache line for counters
> > o Track zones who met the watermark but failed the NR_ALLOC_BATCH check
> >   to avoid doing a rescan of the zonelist when the counters are reset
> > 
> > On UMA machines, the effect is marginal. Even judging from system CPU
> > usage it's small for the tiobench test
> > 
> >           3.16.0-rc2  3.16.0-rc2
> >             checklow    fairzone
> > User          396.24      396.23
> > System        395.23      391.50
> > Elapsed      5182.65     5165.49
> 
> The next patch reports fairzone at 5182.86 again, so I'm guessing this
> patch is not actually reliably reducing the runtime to 5165.49, that's
> just runtime variation.
> 

The change is going to be marginal that it'll depend on a host of issues
including how much merging/splitting the page allocator does.

> > And the number of pages allocated from each zone is comparable
> > 
> >                             3.16.0-rc2  3.16.0-rc2
> >                               checklow    fairzone
> > DMA allocs                           0           0
> > DMA32 allocs                   7374217     7920241
> > Normal allocs                999277551   996568115
> 
> Wow, the DMA32 zone gets less than 1% of the allocations.  What are
> the zone sizes in this machine?
> 

        managed  3976
        managed  755409
        managed  1281601

> > @@ -1908,6 +1912,20 @@ static bool zone_allows_reclaim(struct zone 
> > *local_zone, struct zone *zone)
> >  
> >  #endif     /* CONFIG_NUMA */
> >  
> > +static void reset_alloc_batches(struct zone *preferred_zone)
> > +{
> > +   struct zone *zone = preferred_zone->zone_pgdat->node_zones;
> > +
> > +   do {
> > +           if (!zone_is_fair_depleted(zone))
> > +                   continue;
> > +           mod_zone_page_state(zone, NR_ALLOC_BATCH,
> > +                   high_wmark_pages(zone) - low_wmark_pages(zone) -
> > +                   atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
> > +           zone_clear_flag(zone, ZONE_FAIR_DEPLETED);
> > +   } while (zone++ != preferred_zone);
> 
> get_page_from_freelist() looks at the batches in zonelist order, why
> reset them in node_zones order?  Sure they are the same for all the
> cases we care about now, but it's a non-obvious cross-depedency...
> 

There is no functional difference at the end of the day.

> Does this even make a measurable difference?  It's a slow path after
> you fixed the excessive resets below.
> 

Again, very borderline but doing it this way avoids functional calls to
restart the zonelist walk.

> > @@ -2073,8 +2093,25 @@ this_zone_full:
> >              * for !PFMEMALLOC purposes.
> >              */
> >             page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
> > +           return page;
> > +   }
> >  
> > -   return page;
> > +   if ((alloc_flags & ALLOC_FAIR) && nr_fair_skipped) {
> > +           alloc_flags &= ~ALLOC_FAIR;
> > +           zonelist_rescan = true;
> > +           reset_alloc_batches(preferred_zone);
> > +   }
> 
> Yes, it happens quite often that get_page_from_freelist() fails due to
> watermarks while all the batches are fine, so resetting the batches
> and rescanning the zonelist is kind of a waste of time.

Yep

> However, in
> this situation, we are waiting for kswapd to make progress on the
> watermarks, and it doesn't really matter where we are wasting time...
> 

Or we can enter the slowpath sooner and get something useful done.

> In this micro benchmark that doesn't really do much besides allocating
> and reclaiming IO-less cache pages, the performance difference is less
> than 1% with this patch applied:
> 
> old: 19.835353264 seconds time elapsed                                        
>   ( +-  0.39% )
> new: 19.587258161 seconds time elapsed                                        
>   ( +-  0.34% )
> 
> But overall I agree with this particular change.
> 
> > @@ -2748,33 +2763,18 @@ retry_cpuset:
> >             goto out;
> >     classzone_idx = zonelist_zone_idx(preferred_zoneref);
> >  
> > +   if (zonelist->fair_enabled)
> > +           alloc_flags |= ALLOC_FAIR;
> >  #ifdef CONFIG_CMA
> >     if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> >             alloc_flags |= ALLOC_CMA;
> >  #endif
> > -retry:
> >     /* First allocation attempt */
> >     page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
> >                     zonelist, high_zoneidx, alloc_flags,
> >                     preferred_zone, classzone_idx, migratetype);
> >     if (unlikely(!page)) {
> >             /*
> > -            * The first pass makes sure allocations are spread
> > -            * fairly within the local node.  However, the local
> > -            * node might have free pages left after the fairness
> > -            * batches are exhausted, and remote zones haven't
> > -            * even been considered yet.  Try once more without
> > -            * fairness, and include remote zones now, before
> > -            * entering the slowpath and waking kswapd: prefer
> > -            * spilling to a remote zone over swapping locally.
> > -            */
> 
> I wrote this comment, so I don't know how helpful it is to others, but
> the retry logic in get_page_from_freelist() seems a little naked
> without any explanation.
> 

I'll preserve the comment

> > -           if (alloc_flags & ALLOC_FAIR) {
> > -                   reset_alloc_batches(zonelist, high_zoneidx,
> > -                                       preferred_zone);
> > -                   alloc_flags &= ~ALLOC_FAIR;
> > -                   goto retry;
> > -           }
> > -           /*
> >              * Runtime PM, block IO and its error handling path
> >              * can deadlock because I/O on the device might not
> >              * complete.
> > @@ -3287,10 +3287,18 @@ void show_free_areas(unsigned int filter)
> >     show_swap_cache_info();
> >  }
> >  
> > -static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
> > +static int zoneref_set_zone(pg_data_t *pgdat, struct zone *zone,
> > +                   struct zoneref *zoneref, struct zone *preferred_zone)
> >  {
> > +   int zone_type = zone_idx(zone);
> > +   bool fair_enabled = zone_local(zone, preferred_zone);
> > +   if (zone_type == 0 &&
> > +                   zone->managed_pages < (pgdat->node_present_pages >> 4))
> > +           fair_enabled = false;
> 
> This needs a comment.
> 

        /*
         * Do not count the lowest zone as of relevance to the fair zone
         * allocation policy if it's a small percentage of the node
         */

However, as I write this I'll look at getting rid of this entirely. It
made some sense when fair_eligible was tracked on a per-zone basis but
it's more complex than necessary.

> >     zoneref->zone = zone;
> > -   zoneref->zone_idx = zone_idx(zone);
> > +   zoneref->zone_idx = zone_type;
> > +   return fair_enabled;
> >  }
> >  
> >  /*
> > @@ -3303,17 +3311,26 @@ static int build_zonelists_node(pg_data_t *pgdat, 
> > struct zonelist *zonelist,
> >  {
> >     struct zone *zone;
> >     enum zone_type zone_type = MAX_NR_ZONES;
> > +   struct zone *preferred_zone = NULL;
> > +   int nr_fair = 0;
> >  
> >     do {
> >             zone_type--;
> >             zone = pgdat->node_zones + zone_type;
> >             if (populated_zone(zone)) {
> > -                   zoneref_set_zone(zone,
> > -                           &zonelist->_zonerefs[nr_zones++]);
> > +                   if (!preferred_zone)
> > +                           preferred_zone = zone;
> > +
> > +                   nr_fair += zoneref_set_zone(pgdat, zone,
> > +                           &zonelist->_zonerefs[nr_zones++],
> > +                           preferred_zone);
> 
> Passing preferred_zone to determine locality seems pointless when you
> walk the zones of a single node.
> 

True.

> And the return value of zoneref_set_zone() is fairly unexpected.
> 

How so?

> It's probably better to determine fair_enabled in the callsite, that
> would fix both problems, and write a separate helper that tests if a
> zone is eligible for fair treatment (type && managed_pages test).
> 

Are you thinking of putting that into the page allocator fast path? I'm
trying to take stuff out of there :/.

An alternative would be to look one zone ahead in the zonelist and check
zone_local there. That would get most of the intent and be cheaper than
a check on the zone or pgdat size.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to