On 11/21/18 11:14 AM, Mel Gorman wrote:
> The page allocator zone lists are iterated based on the watermarks
> of each zone which does not take anti-fragmentation into account. On
> x86, node 0 may have multiple zones while other nodes have one zone. A
> consequence is that tasks running on node 0 may fragment ZONE_NORMAL even
> though ZONE_DMA32 has plenty of free memory. This patch special cases
> the allocator fast path such that it'll try an allocation from a lower
> local zone before fragmenting a higher zone. In this case, stealing of
> pageblocks or orders larger than a pageblock are still allowed in the
> fast path as they are uninteresting from a fragmentation point of view.
> 
> This was evaluated using a benchmark designed to fragment memory
> before attempting THPs.  It's implemented in mmtests as the following
> configurations
> 
> configs/config-global-dhp__workload_thpfioscale
> configs/config-global-dhp__workload_thpfioscale-defrag
> configs/config-global-dhp__workload_thpfioscale-madvhugepage
> 
> e.g. from mmtests
> ./run-mmtests.sh --run-monitor --config 
> configs/config-global-dhp__workload_thpfioscale test-run-1
> 
> The broad details of the workload are as follows;
> 
> 1. Create an XFS filesystem (not specified in the configuration but done
>    as part of the testing for this patch)
> 2. Start 4 fio threads that write a number of 64K files inefficiently.
>    Inefficiently means that files are created on first access and not
>    created in advance (fio parameterr create_on_open=1) and fallocate
>    is not used (fallocate=none). With multiple IO issuers this creates
>    a mix of slab and page cache allocations over time. The total size
>    of the files is 150% physical memory so that the slabs and page cache
>    pages get mixed
> 3. Warm up a number of fio read-only threads accessing the same files
>    created in step 2. This part runs for the same length of time it
>    took to create the files. It'll fault back in old data and further
>    interleave slab and page cache allocations. As it's now low on
>    memory due to step 2, fragmentation occurs as pageblocks get
>    stolen.
> 4. While step 3 is still running, start a process that tries to allocate
>    75% of memory as huge pages with a number of threads. The number of
>    threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
>    threads contending with fio, any other threads or forcing cross-NUMA
>    scheduling. Note that the test has not been used on a machine with less
>    than 8 cores. The benchmark records whether huge pages were allocated
>    and what the fault latency was in microseconds
> 5. Measure the number of events potentially causing external fragmentation,
>    the fault latency and the huge page allocation success rate.
> 6. Cleanup
> 
> Note that due to the use of IO and page cache that this benchmark is not
> suitable for running on large machines where the time to fragment memory
> may be excessive. Also note that while this is one mix that generates
> fragmentation that it's not the only mix that generates fragmentation.
> Differences in workload that are more slab-intensive or whether SLUB is
> used with high-order pages may yield different results.
> 
> When the page allocator fragments memory, it records the event using the
> mm_page_alloc_extfrag event. If the fallback_order is smaller than a
> pageblock order (order-9 on 64-bit x86) then it's considered an event
> that may cause external fragmentation issues in the future. Hence, the
> primary metric here is the number of external fragmentation events that
> occur with order < 9. The secondary metric is allocation latency and huge
> page allocation success rates but note that differences in latencies and
> what the success rate also can affect the number of external fragmentation
> event which is why it's a secondary metric.
> 
> 1-socket Skylake machine
> config-global-dhp__workload_thpfioscale XFS (no special madvise)
> 4 fio threads, 1 THP allocating thread
> --------------------------------------
> 
> 4.20-rc1 extfrag events < order 9:  1023463
> 4.20-rc1+patch:                      358574 (65% reduction)

It would be nice to have also breakdown of what kind of extfrag events,
mainly distinguish number of unmovable/reclaimable allocations
fragmenting movable pageblocks, as those are the most critical ones.

...

> @@ -3253,6 +3268,36 @@ static bool zone_allows_reclaim(struct zone 
> *local_zone, struct zone *zone)
>  }
>  #endif       /* CONFIG_NUMA */
>  
> +#ifdef CONFIG_ZONE_DMA32
> +/*
> + * The restriction on ZONE_DMA32 as being a suitable zone to use to avoid
> + * fragmentation is subtle. If the preferred zone was HIGHMEM then
> + * premature use of a lower zone may cause lowmem pressure problems that
> + * are wose than fragmentation. If the next zone is ZONE_DMA then it is
> + * probably too small. It only makes sense to spread allocations to avoid
> + * fragmentation between the Normal and DMA32 zones.
> + */
> +static inline unsigned int alloc_flags_nofragment(struct zone *zone)
> +{
> +     if (zone_idx(zone) != ZONE_NORMAL)
> +             return 0;
> +
> +     /*
> +      * If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and
> +      * the pointer is within zone->zone_pgdat->node_zones[].
> +      */
> +     if (!populated_zone(--zone))
> +             return 0;

How about something along:
BUILD_BUG_ON(ZONE_NORMAL - ZONE_DMA32 != 1);

Also is this perhaps going against your earlier efforts of speeding up
the fast path, and maybe it would be faster to just stick a bool into
struct zone, which would be set true once during zonelist build, only
for a ZONE_NORMAL with ZONE_DMA32 in the same node?

> +
> +     return ALLOC_NOFRAGMENT;
> +}
> +#else
> +static inline unsigned int alloc_flags_nofragment(struct zone *zone)
> +{
> +     return 0;
> +}
> +#endif
> +
>  /*
>   * get_page_from_freelist goes through the zonelist trying to allocate
>   * a page.
> @@ -3264,11 +3309,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int 
> order, int alloc_flags,
>       struct zoneref *z = ac->preferred_zoneref;
>       struct zone *zone;
>       struct pglist_data *last_pgdat_dirty_limit = NULL;
> +     bool no_fallback;
>  
> +retry:

Ugh, I think 'z = ac->preferred_zoneref' should be moved here under
retry. AFAICS without that, the preference of local node to
fragmentation avoidance doesn't work?

>       /*
>        * Scan zonelist, looking for a zone with enough free.
>        * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
>        */
> +     no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
>       for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
>                                                               ac->nodemask) {
>               struct page *page;
> @@ -3307,6 +3355,21 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int 
> order, int alloc_flags,
>                       }
>               }
>  
> +             if (no_fallback) {
> +                     int local_nid;
> +
> +                     /*
> +                      * If moving to a remote node, retry but allow
> +                      * fragmenting fallbacks. Locality is more important
> +                      * than fragmentation avoidance.
> +                      */
> +                     local_nid = zone_to_nid(ac->preferred_zoneref->zone);
> +                     if (zone_to_nid(zone) != local_nid) {
> +                             alloc_flags &= ~ALLOC_NOFRAGMENT;
> +                             goto retry;
> +                     }
> +             }
> +
>               mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
>               if (!zone_watermark_fast(zone, order, mark,
>                                      ac_classzone_idx(ac), alloc_flags)) {
> @@ -3374,6 +3437,15 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int 
> order, int alloc_flags,
>               }
>       }
>  
> +     /*
> +      * It's possible on a UMA machine to get through all zones that are
> +      * fragmented. If avoiding fragmentation, reset and try again
> +      */
> +     if (no_fallback) {
> +             alloc_flags &= ~ALLOC_NOFRAGMENT;
> +             goto retry;
> +     }
> +
>       return NULL;
>  }
>  
> @@ -4369,6 +4441,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order, int preferred_nid,
>  
>       finalise_ac(gfp_mask, &ac);
>  
> +     /*
> +      * Forbid the first pass from falling back to types that fragment
> +      * memory until all local zones are considered.
> +      */
> +     alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone);
> +
>       /* First allocation attempt */
>       page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
>       if (likely(page))
> 

Reply via email to