On Thu, Jul 07, 2016 at 10:20:39AM +0900, Joonsoo Kim wrote:
> > @@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, 
> > int order,
> >  
> >     prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> >  
> > +   /*
> > +    * If kswapd has not been woken recently, then kswapd goes fully
> > +    * to sleep. kcompactd may still need to wake if the original
> > +    * request was high-order.
> > +    */
> > +   if (classzone_idx == -1) {
> > +           wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> > +           classzone_idx = MAX_NR_ZONES - 1;
> > +           goto full_sleep;
> > +   }
> 
> Passing -1 to kcompactd would cause the problem?
> 

No, it ends up doing a wakeup and then going back to sleep which is not
what is required. I'll fix it.

> > @@ -3390,12 +3386,24 @@ static int kswapd(void *p)
> >              * We can speed up thawing tasks if we don't call balance_pgdat
> >              * after returning from the refrigerator
> >              */
> > -           if (!ret) {
> > -                   trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> > +           if (ret)
> > +                   continue;
> >  
> > -                   /* return value ignored until next patch */
> > -                   balance_pgdat(pgdat, order, classzone_idx);
> > -           }
> > +           /*
> > +            * Reclaim begins at the requested order but if a high-order
> > +            * reclaim fails then kswapd falls back to reclaiming for
> > +            * order-0. If that happens, kswapd will consider sleeping
> > +            * for the order it finished reclaiming at (reclaim_order)
> > +            * but kcompactd is woken to compact for the original
> > +            * request (alloc_order).
> > +            */
> > +           trace_mm_vmscan_kswapd_wake(pgdat->node_id, alloc_order);
> > +           reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > classzone_idx);
> > +           if (reclaim_order < alloc_order)
> > +                   goto kswapd_try_sleep;
> 
> This 'goto' would cause kswapd to sleep prematurely. We need to check
> *new* pgdat->kswapd_order and classzone_idx even in this case.
> 

It only matters if the next request coming is also high-order requests but
one thing that needs to be avoided is kswapd staying awake periods of time
constantly reclaiming for high-order pages. This is why the check means
"If we reclaimed for high-order and failed, then consider sleeping now".
If allocations still require it, they direct reclaim instead.

"Fixing" this potentially causes reclaim storms from kswapd.

> > @@ -3418,10 +3426,10 @@ void wakeup_kswapd(struct zone *zone, int order, 
> > enum zone_type classzone_idx)
> >     if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> >             return;
> >     pgdat = zone->zone_pgdat;
> > -   if (pgdat->kswapd_max_order < order) {
> > -           pgdat->kswapd_max_order = order;
> > -           pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
> > -   }
> > +   if (pgdat->kswapd_classzone_idx == -1)
> > +           pgdat->kswapd_classzone_idx = classzone_idx;
> > +   pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, 
> > classzone_idx);
> > +   pgdat->kswapd_order = max(pgdat->kswapd_order, order);
> 
> Now, updating pgdat->skwapd_max_order and classzone_idx happens
> unconditionally. Before your patch, it is only updated toward hard
> constraint (e.g. higher order).
> 

So? It's updating the request to suit the requirements of all pending
allocation requests that woke kswapd.

> And, I'd like to know why max() is used for classzone_idx rather than
> min()? I think that kswapd should balance the lowest zone requested.
> 

If there are two allocation requests -- one zone-constraned and the other
zone-unconstrained, it does not make sense to have kswapd skip the pages
usable for the zone-unconstrained and waste a load of CPU. You could
argue that using min would satisfy the zone-constrained allocation faster
but that's at the cost of delaying the zone-unconstrained allocation and
wasting CPU. Bear in mind that using max may mean some lowmem pages get
freed anyway due to LRU order.

-- 
Mel Gorman
SUSE Labs

Reply via email to