from:"Zhang Yanfei"

Re: [RFC PATCH v3 3/3] mm/compaction: enhance compaction finish condition

2015-02-02 Thread Zhang Yanfei

Hello,

At 2015/2/2 18:20, Vlastimil Babka wrote:
> On 02/02/2015 08:15 AM, Joonsoo Kim wrote:
>> Compaction has anti fragmentation algorithm. It is that freepage
>> should be more than pageblock order to finish the compaction if we don't
>> find any freepage in requested migratetype buddy list. This is for
>> mitigating fragmentation, but, there is a lack of migratetype
>> consideration and it is too excessive compared to page allocator's anti
>> fragmentation algorithm.
>>
>> Not considering migratetype would cause premature finish of compaction.
>> For example, if allocation request is for unmovable migratetype,
>> freepage with CMA migratetype doesn't help that allocation and
>> compaction should not be stopped. But, current logic regards this
>> situation as compaction is no longer needed, so finish the compaction.
> 
> This is only for order >= pageblock_order, right? Perhaps should be told 
> explicitly.

I might be wrong. If we applied patch1, so after the system runs for some time,
there must be no MIGRATE_CMA free pages in the system, right? If so, the
example above doesn't exist anymore.

> 
>> Secondly, condition is too excessive compared to page allocator's logic.
>> We can steal freepage from other migratetype and change pageblock
>> migratetype on more relaxed conditions in page allocator. This is designed
>> to prevent fragmentation and we can use it here. Imposing hard constraint
>> only to the compaction doesn't help much in this case since page allocator
>> would cause fragmentation again.
>>
>> To solve these problems, this patch borrows anti fragmentation logic from
>> page allocator. It will reduce premature compaction finish in some cases
>> and reduce excessive compaction work.
>>
>> stress-highalloc test in mmtests with non movable order 7 allocation shows
>> considerable increase of compaction success rate.
>>
>> Compaction success rate (Compaction success * 100 / Compaction stalls, %)
>> 31.82 : 42.20
>>
>> Signed-off-by: Joonsoo Kim 
>> ---
>>  mm/compaction.c | 14 --
>>  mm/internal.h   |  2 ++
>>  mm/page_alloc.c | 12 
>>  3 files changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 782772d..d40c426 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -1170,13 +1170,23 @@ static int __compact_finished(struct zone *zone, 
>> struct compact_control *cc,
>>  /* Direct compactor: Is a suitable page free? */
>>  for (order = cc->order; order < MAX_ORDER; order++) {
>>  struct free_area *area = &zone->free_area[order];
>> +bool can_steal;
>>  
>>  /* Job done if page is free of the right migratetype */
>>  if (!list_empty(&area->free_list[migratetype]))
>>  return COMPACT_PARTIAL;
>>  
>> -/* Job done if allocation would set block type */
>> -if (order >= pageblock_order && area->nr_free)
>> +/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
>> +if (migratetype == MIGRATE_MOVABLE &&
>> +!list_empty(&area->free_list[MIGRATE_CMA]))
>> +return COMPACT_PARTIAL;
> 
> The above AFAICS needs #ifdef CMA otherwise won't compile without CMA.
> 
>> +
>> +/*
>> + * Job done if allocation would steal freepages from
>> + * other migratetype buddy lists.
>> + */
>> +if (find_suitable_fallback(area, order, migratetype,
>> +true, &can_steal) != -1)
>>  return COMPACT_PARTIAL;
>>  }
>>  
>> diff --git a/mm/internal.h b/mm/internal.h
>> index c4d6c9b..9640650 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -200,6 +200,8 @@ isolate_freepages_range(struct compact_control *cc,
>>  unsigned long
>>  isolate_migratepages_range(struct compact_control *cc,
>> unsigned long low_pfn, unsigned long end_pfn);
>> +int find_suitable_fallback(struct free_area *area, unsigned int order,
>> +int migratetype, bool only_stealable, bool *can_steal);
>>  
>>  #endif
>>  
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6cb18f8..0a150f1 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1177,8 +1177,8 @@ static void steal_suitable_fallback(struct zone *zone, 
>> struct page *page,
>>  set_pageblock_migratetype(page, start_type);
>>  }
>>  
>> -static int find_suitable_fallback(struct free_area *area, unsigned int 
>> order,
>> -int migratetype, bool *can_steal)
>> +int find_suitable_fallback(struct free_area *area, unsigned int order,
>> +int migratetype, bool only_stealable, bool *can_steal)
>>  {
>>  int i;
>>  int fallback_mt;
>> @@ -1198,7 +1198,11 @@ static int find_suitable_fallback(struct free_area 
>> *area, unsigned int order,
>>  if (can_steal_fallback(order, migratetype))
>>

Re: [RFC PATCH v3 2/3] mm/page_alloc: factor out fallback freepage checking

2015-02-02 Thread Zhang Yanfei

Hello Joonsoo,

At 2015/2/2 15:15, Joonsoo Kim wrote:
> This is preparation step to use page allocator's anti fragmentation logic
> in compaction. This patch just separates fallback freepage checking part
> from fallback freepage management part. Therefore, there is no functional
> change.
> 
> Signed-off-by: Joonsoo Kim 
> ---
>  mm/page_alloc.c | 128 
> +---
>  1 file changed, 76 insertions(+), 52 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e64b260..6cb18f8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1142,14 +1142,26 @@ static void change_pageblock_range(struct page 
> *pageblock_page,
>   * as fragmentation caused by those allocations polluting movable pageblocks
>   * is worse than movable allocations stealing from unmovable and reclaimable
>   * pageblocks.
> - *
> - * If we claim more than half of the pageblock, change pageblock's 
> migratetype
> - * as well.
>   */
> -static void try_to_steal_freepages(struct zone *zone, struct page *page,
> -   int start_type, int fallback_type)
> +static bool can_steal_fallback(unsigned int order, int start_mt)
> +{
> + if (order >= pageblock_order)
> + return true;

Is this test necessary? Since an order which is >= pageblock_order
will always pass the order >= pageblock_order / 2 test below.

Thanks.

> +
> + if (order >= pageblock_order / 2 ||
> + start_mt == MIGRATE_RECLAIMABLE ||
> + start_mt == MIGRATE_UNMOVABLE ||
> + page_group_by_mobility_disabled)
> + return true;
> +
> + return false;
> +}
> +
> +static void steal_suitable_fallback(struct zone *zone, struct page *page,
> +   int start_type)
>  {
>   int current_order = page_order(page);
> + int pages;
>  
>   /* Take ownership for orders >= pageblock_order */
>   if (current_order >= pageblock_order) {
> @@ -1157,19 +1169,39 @@ static void try_to_steal_freepages(struct zone *zone, 
> struct page *page,
>   return;
>   }
>  
> - if (current_order >= pageblock_order / 2 ||
> - start_type == MIGRATE_RECLAIMABLE ||
> - start_type == MIGRATE_UNMOVABLE ||
> - page_group_by_mobility_disabled) {
> - int pages;
> + pages = move_freepages_block(zone, page, start_type);
>  
> - pages = move_freepages_block(zone, page, start_type);
> + /* Claim the whole block if over half of it is free */
> + if (pages >= (1 << (pageblock_order-1)) ||
> + page_group_by_mobility_disabled)
> + set_pageblock_migratetype(page, start_type);
> +}
>  
> - /* Claim the whole block if over half of it is free */
> - if (pages >= (1 << (pageblock_order-1)) ||
> - page_group_by_mobility_disabled)
> - set_pageblock_migratetype(page, start_type);
> +static int find_suitable_fallback(struct free_area *area, unsigned int order,
> + int migratetype, bool *can_steal)
> +{
> + int i;
> + int fallback_mt;
> +
> + if (area->nr_free == 0)
> + return -1;
> +
> + *can_steal = false;
> + for (i = 0;; i++) {
> + fallback_mt = fallbacks[migratetype][i];
> + if (fallback_mt == MIGRATE_RESERVE)
> + break;
> +
> + if (list_empty(&area->free_list[fallback_mt]))
> + continue;
> +
> + if (can_steal_fallback(order, migratetype))
> + *can_steal = true;
> +
> + return i;
>   }
> +
> + return -1;
>  }
>  
>  /* Remove an element from the buddy allocator from the fallback list */
> @@ -1179,53 +1211,45 @@ __rmqueue_fallback(struct zone *zone, unsigned int 
> order, int start_migratetype)
>   struct free_area *area;
>   unsigned int current_order;
>   struct page *page;
> + int fallback_mt;
> + bool can_steal;
>  
>   /* Find the largest possible block of pages in the other list */
>   for (current_order = MAX_ORDER-1;
>   current_order >= order && current_order <= 
> MAX_ORDER-1;
>   --current_order) {
> - int i;
> - for (i = 0;; i++) {
> - int migratetype = fallbacks[start_migratetype][i];
> - int buddy_type = start_migratetype;
> -
> - /* MIGRATE_RESERVE handled later if necessary */
> - if (migratetype == MIGRATE_RESERVE)
> - break;
> -
> - area = &(zone->free_area[current_order]);
> - if (list_empty(&area->free_list[migratetype]))
> - continue;
> -
> - page = list_entry(area->free_list[migratetype].next,
> -

Re: [PATCH v2 4/4] mm/compaction: enhance compaction finish condition

2015-01-31 Thread Zhang Yanfei

At 2015/1/30 20:34, Joonsoo Kim wrote:
> From: Joonsoo 
> 
> Compaction has anti fragmentation algorithm. It is that freepage
> should be more than pageblock order to finish the compaction if we don't
> find any freepage in requested migratetype buddy list. This is for
> mitigating fragmentation, but, there is a lack of migratetype
> consideration and it is too excessive compared to page allocator's anti
> fragmentation algorithm.
> 
> Not considering migratetype would cause premature finish of compaction.
> For example, if allocation request is for unmovable migratetype,
> freepage with CMA migratetype doesn't help that allocation and
> compaction should not be stopped. But, current logic regards this
> situation as compaction is no longer needed, so finish the compaction.
> 
> Secondly, condition is too excessive compared to page allocator's logic.
> We can steal freepage from other migratetype and change pageblock
> migratetype on more relaxed conditions in page allocator. This is designed
> to prevent fragmentation and we can use it here. Imposing hard constraint
> only to the compaction doesn't help much in this case since page allocator
> would cause fragmentation again.

Changing both two behaviours in compaction may change the high order allocation
behaviours in the buddy allocator slowpath, so just as Vlastimil suggested,
some data from allocator should be necessary and helpful, IMHO.

Thanks. 

> 
> To solve these problems, this patch borrows anti fragmentation logic from
> page allocator. It will reduce premature compaction finish in some cases
> and reduce excessive compaction work.
> 
> stress-highalloc test in mmtests with non movable order 7 allocation shows
> considerable increase of compaction success rate.
> 
> Compaction success rate (Compaction success * 100 / Compaction stalls, %)
> 31.82 : 42.20
> 
> Signed-off-by: Joonsoo Kim 
> ---
>  include/linux/mmzone.h |  3 +++
>  mm/compaction.c| 30 --
>  mm/internal.h  |  1 +
>  mm/page_alloc.c|  5 ++---
>  4 files changed, 34 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f279d9c..a2906bc 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -63,6 +63,9 @@ enum {
>   MIGRATE_TYPES
>  };
>  
> +#define FALLBACK_MIGRATETYPES (4)
> +extern int fallbacks[MIGRATE_TYPES][FALLBACK_MIGRATETYPES];
> +
>  #ifdef CONFIG_CMA
>  #  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
>  #else
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 782772d..0460e4b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1125,6 +1125,29 @@ static isolate_migrate_t isolate_migratepages(struct 
> zone *zone,
>   return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
>  }
>  
> +static bool can_steal_fallbacks(struct free_area *area,
> + unsigned int order, int migratetype)
> +{
> + int i;
> + int fallback_mt;
> +
> + if (area->nr_free == 0)
> + return false;
> +
> + for (i = 0; i < FALLBACK_MIGRATETYPES; i++) {
> + fallback_mt = fallbacks[migratetype][i];
> + if (fallback_mt == MIGRATE_RESERVE)
> + break;
> +
> + if (list_empty(&area->free_list[fallback_mt]))
> + continue;
> +
> + if (can_steal_freepages(order, migratetype, fallback_mt))
> + return true;
> + }
> + return false;
> +}
> +
>  static int __compact_finished(struct zone *zone, struct compact_control *cc,
>   const int migratetype)
>  {
> @@ -1175,8 +1198,11 @@ static int __compact_finished(struct zone *zone, 
> struct compact_control *cc,
>   if (!list_empty(&area->free_list[migratetype]))
>   return COMPACT_PARTIAL;
>  
> - /* Job done if allocation would set block type */
> - if (order >= pageblock_order && area->nr_free)
> + /*
> +  * Job done if allocation would steal freepages from
> +  * other migratetype buddy lists.
> +  */
> + if (can_steal_fallbacks(area, order, migratetype))
>   return COMPACT_PARTIAL;
>   }
>  
> diff --git a/mm/internal.h b/mm/internal.h
> index c4d6c9b..0a89a14 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -201,6 +201,7 @@ unsigned long
>  isolate_migratepages_range(struct compact_control *cc,
>  unsigned long low_pfn, unsigned long end_pfn);
>  
> +bool can_steal_freepages(unsigned int order, int start_mt, int fallback_mt);
>  #endif
>  
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ef74750..4c3538b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1026,7 +1026,7 @@ struct page *__rmqueue_smallest(struct zone *zone, 
> unsigned int order,
>   * This array describes the order lists are fallen back to when
>   * the fr

Re: [PATCH v2 3/4] mm/page_alloc: separate steal decision from steal behaviour part

2015-01-31 Thread Zhang Yanfei

At 2015/1/30 20:34, Joonsoo Kim wrote:
> From: Joonsoo 
> 
> This is preparation step to use page allocator's anti fragmentation logic
> in compaction. This patch just separates steal decision part from actual
> steal behaviour part so there is no functional change.
> 
> Signed-off-by: Joonsoo Kim 
> ---
>  mm/page_alloc.c | 49 -
>  1 file changed, 32 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8d52ab1..ef74750 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1122,6 +1122,24 @@ static void change_pageblock_range(struct page 
> *pageblock_page,
>   }
>  }
>  
> +static bool can_steal_freepages(unsigned int order,
> + int start_mt, int fallback_mt)
> +{
> + if (is_migrate_cma(fallback_mt))
> + return false;
> +
> + if (order >= pageblock_order)
> + return true;
> +
> + if (order >= pageblock_order / 2 ||
> + start_mt == MIGRATE_RECLAIMABLE ||
> + start_mt == MIGRATE_UNMOVABLE ||
> + page_group_by_mobility_disabled)
> + return true;
> +
> + return false;
> +}

So some comments which can tell the cases can or cannot steal freepages
from other migratetype is necessary IMHO. Actually we can just
move some comments in try_to_steal_pages to here.

Thanks.

> +
>  /*
>   * When we are falling back to another migratetype during allocation, try to
>   * steal extra free pages from the same pageblocks to satisfy further
> @@ -1138,9 +1156,10 @@ static void change_pageblock_range(struct page 
> *pageblock_page,
>   * as well.
>   */
>  static void try_to_steal_freepages(struct zone *zone, struct page *page,
> -   int start_type, int fallback_type)
> +   int start_type)
>  {
>   int current_order = page_order(page);
> + int pages;
>  
>   /* Take ownership for orders >= pageblock_order */
>   if (current_order >= pageblock_order) {
> @@ -1148,19 +1167,12 @@ static void try_to_steal_freepages(struct zone *zone, 
> struct page *page,
>   return;
>   }
>  
> - if (current_order >= pageblock_order / 2 ||
> - start_type == MIGRATE_RECLAIMABLE ||
> - start_type == MIGRATE_UNMOVABLE ||
> - page_group_by_mobility_disabled) {
> - int pages;
> + pages = move_freepages_block(zone, page, start_type);
>  
> - pages = move_freepages_block(zone, page, start_type);
> -
> - /* Claim the whole block if over half of it is free */
> - if (pages >= (1 << (pageblock_order-1)) ||
> - page_group_by_mobility_disabled)
> - set_pageblock_migratetype(page, start_type);
> - }
> + /* Claim the whole block if over half of it is free */
> + if (pages >= (1 << (pageblock_order-1)) ||
> + page_group_by_mobility_disabled)
> + set_pageblock_migratetype(page, start_type);
>  }
>  
>  /* Remove an element from the buddy allocator from the fallback list */
> @@ -1170,6 +1182,7 @@ __rmqueue_fallback(struct zone *zone, unsigned int 
> order, int start_migratetype)
>   struct free_area *area;
>   unsigned int current_order;
>   struct page *page;
> + bool can_steal;
>  
>   /* Find the largest possible block of pages in the other list */
>   for (current_order = MAX_ORDER-1;
> @@ -1192,10 +1205,11 @@ __rmqueue_fallback(struct zone *zone, unsigned int 
> order, int start_migratetype)
>   struct page, lru);
>   area->nr_free--;
>  
> - if (!is_migrate_cma(migratetype)) {
> + can_steal = can_steal_freepages(current_order,
> + start_migratetype, migratetype);
> + if (can_steal) {
>   try_to_steal_freepages(zone, page,
> - start_migratetype,
> - migratetype);
> + start_migratetype);
>   } else {
>   /*
>* When borrowing from MIGRATE_CMA, we need to
> @@ -1203,7 +1217,8 @@ __rmqueue_fallback(struct zone *zone, unsigned int 
> order, int start_migratetype)
>* itself, and we do not try to steal extra
>* free pages.
>*/
> - buddy_type = migratetype;
> + if (is_migrate_cma(migratetype))
> + buddy_type = migratetype;
>   }
>  
>   /* Remove the page from the freelists */
> 
--
To unsubscribe from this list: send the line "unsubscribe linux

Re: [PATCH v2 2/4] mm/compaction: stop the isolation when we isolate enough freepage

2015-01-31 Thread Zhang Yanfei

At 2015/1/31 16:31, Vlastimil Babka wrote:
> On 01/31/2015 08:49 AM, Zhang Yanfei wrote:
>> Hello,
>>
>> At 2015/1/30 20:34, Joonsoo Kim wrote:
>>
>> Reviewed-by: Zhang Yanfei 
>>
>> IMHO, the patch making the free scanner move slower makes both scanners
>> meet further. Before this patch, if we isolate too many free pages and even 
>> after we release the unneeded free pages later the free scanner still already
>> be there and will be moved forward again next time -- the free scanner just
>> cannot be moved back to grab the free pages we released before no matter 
>> where
>> the free pages in, pcp or buddy. 
> 
> It can be actually moved back. If we are releasing free pages, it means the
> current compaction is terminating, and it will set 
> zone->compact_cached_free_pfn
> back to the position of the released free page that was furthest back. The 
> next
> compaction will start from the cached free pfn.

Yeah, you are right. I missed the release_freepages(). Thanks!

> 
> It is however possible that another compaction runs in parallel and has
> progressed further and overwrites the cached free pfn.
> 

Hmm, maybe.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/4] mm/compaction: stop the isolation when we isolate enough freepage

2015-01-30 Thread Zhang Yanfei

Hello,

At 2015/1/30 20:34, Joonsoo Kim wrote:
> From: Joonsoo 
> 
> Currently, freepage isolation in one pageblock doesn't consider how many
> freepages we isolate. When I traced flow of compaction, compaction
> sometimes isolates more than 256 freepages to migrate just 32 pages.
> 
> In this patch, freepage isolation is stopped at the point that we
> have more isolated freepage than isolated page for migration. This
> results in slowing down free page scanner and make compaction success
> rate higher.
> 
> stress-highalloc test in mmtests with non movable order 7 allocation shows
> increase of compaction success rate.
> 
> Compaction success rate (Compaction success * 100 / Compaction stalls, %)
> 27.13 : 31.82
> 
> pfn where both scanners meets on compaction complete
> (separate test due to enormous tracepoint buffer)
> (zone_start=4096, zone_end=1048576)
> 586034 : 654378
> 
> In fact, I didn't fully understand why this patch results in such good
> result. There was a guess that not used freepages are released to pcp list
> and on next compaction trial we won't isolate them again so compaction
> success rate would decrease. To prevent this effect, I tested with adding
> pcp drain code on release_freepages(), but, it has no good effect.
> 
> Anyway, this patch reduces waste time to isolate unneeded freepages so
> seems reasonable.

Reviewed-by: Zhang Yanfei 

IMHO, the patch making the free scanner move slower makes both scanners
meet further. Before this patch, if we isolate too many free pages and even 
after we release the unneeded free pages later the free scanner still already
be there and will be moved forward again next time -- the free scanner just
cannot be moved back to grab the free pages we released before no matter where
the free pages in, pcp or buddy. 

> 
> Signed-off-by: Joonsoo Kim 
> ---
>  mm/compaction.c | 17 ++---
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4954e19..782772d 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -490,6 +490,13 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
>  
>   /* If a page was split, advance to the end of it */
>   if (isolated) {
> + cc->nr_freepages += isolated;
> + if (!strict &&
> + cc->nr_migratepages <= cc->nr_freepages) {
> + blockpfn += isolated;
> + break;
> + }
> +
>   blockpfn += isolated - 1;
>   cursor += isolated - 1;
>   continue;
> @@ -899,7 +906,6 @@ static void isolate_freepages(struct compact_control *cc)
>   unsigned long isolate_start_pfn; /* exact pfn we start at */
>   unsigned long block_end_pfn;/* end of current pageblock */
>   unsigned long low_pfn;   /* lowest pfn scanner is able to scan */
> - int nr_freepages = cc->nr_freepages;
>   struct list_head *freelist = &cc->freepages;
>  
>   /*
> @@ -924,11 +930,11 @@ static void isolate_freepages(struct compact_control 
> *cc)
>* pages on cc->migratepages. We stop searching if the migrate
>* and free page scanners meet or enough free pages are isolated.
>*/
> - for (; block_start_pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
> + for (; block_start_pfn >= low_pfn &&
> + cc->nr_migratepages > cc->nr_freepages;
>   block_end_pfn = block_start_pfn,
>   block_start_pfn -= pageblock_nr_pages,
>   isolate_start_pfn = block_start_pfn) {
> - unsigned long isolated;
>  
>   /*
>* This can iterate a massively long zone without finding any
> @@ -953,9 +959,8 @@ static void isolate_freepages(struct compact_control *cc)
>   continue;
>  
>   /* Found a block suitable for isolating free pages from. */
> - isolated = isolate_freepages_block(cc, &isolate_start_pfn,
> + isolate_freepages_block(cc, &isolate_start_pfn,
>   block_end_pfn, freelist, false);
> - nr_freepages += isolated;
>  
>   /*
>* Remember where the free scanner should restart next time,
> @@ -987,8 +992,6 @@ static void isolate_freepages(struct compact_control *cc)
>*/
>   if (block_start_pfn < low_pfn)
>   cc->free_pfn = cc->migrate_pfn;
> -
> - cc->nr_freepages = nr_freepages;
>  }
>  
>  /*
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/4] mm/compaction: fix wrong order check in compact_finished()

2015-01-30 Thread Zhang Yanfei

Hello,

At 2015/1/30 20:34, Joonsoo Kim wrote:
> What we want to check here is whether there is highorder freepage
> in buddy list of other migratetype in order to steal it without
> fragmentation. But, current code just checks cc->order which means
> allocation request order. So, this is wrong.
> 
> Without this fix, non-movable synchronous compaction below pageblock order
> would not stopped until compaction is complete, because migratetype of most
> pageblocks are movable and high order freepage made by compaction is usually
> on movable type buddy list.
> 
> There is some report related to this bug. See below link.
> 
> http://www.spinics.net/lists/linux-mm/msg81666.html
> 
> Although the issued system still has load spike comes from compaction,
> this makes that system completely stable and responsive according to
> his report.
> 
> stress-highalloc test in mmtests with non movable order 7 allocation doesn't
> show any notable difference in allocation success rate, but, it shows more
> compaction success rate.
> 
> Compaction success rate (Compaction success * 100 / Compaction stalls, %)
> 18.47 : 28.94
> 
> Cc: 
> Acked-by: Vlastimil Babka 
> Signed-off-by: Joonsoo Kim 

Reviewed-by: Zhang Yanfei 

> ---
>  mm/compaction.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b68736c..4954e19 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1173,7 +1173,7 @@ static int __compact_finished(struct zone *zone, struct 
> compact_control *cc,
>   return COMPACT_PARTIAL;
>  
>   /* Job done if allocation would set block type */
> - if (cc->order >= pageblock_order && area->nr_free)
> + if (order >= pageblock_order && area->nr_free)
>   return COMPACT_PARTIAL;
>   }
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3] mm: incorporate read-only pages into transparent huge pages

2015-01-28 Thread Zhang Yanfei

Hello

在 2015/1/28 1:39, Ebru Akagunduz 写道:
> This patch aims to improve THP collapse rates, by allowing
> THP collapse in the presence of read-only ptes, like those
> left in place by do_swap_page after a read fault.
>
> Currently THP can collapse 4kB pages into a THP when
> there are up to khugepaged_max_ptes_none pte_none ptes
> in a 2MB range. This patch applies the same limit for
> read-only ptes.
>
> The patch was tested with a test program that allocates
> 800MB of memory, writes to it, and then sleeps. I force
> the system to swap out all but 190MB of the program by
> touching other memory. Afterwards, the test program does
> a mix of reads and writes to its memory, and the memory
> gets swapped back in.
>
> Without the patch, only the memory that did not get
> swapped out remained in THPs, which corresponds to 24% of
> the memory of the program. The percentage did not increase
> over time.
>
> With this patch, after 5 minutes of waiting khugepaged had
> collapsed 50% of the program's memory back into THPs.
>
> Signed-off-by: Ebru Akagunduz 
> Reviewed-by: Rik van Riel 
> Acked-by: Vlastimil Babka 

Please feel free to add:

Acked-by: Zhang Yanfei 

> ---
> Changes in v2:
>  - Remove extra code indent (Vlastimil Babka)
>  - Add comment line for check condition of page_count() (Vlastimil Babka)
>  - Add fast path optimistic check to
>__collapse_huge_page_isolate() (Andrea Arcangeli)
>  - Move check condition of page_count() below to trylock_page() (Andrea 
> Arcangeli)
>
> Changes in v3:
>  - Add a at-least-one-writable-pte check (Zhang Yanfei)
>  - Debug page count (Vlastimil Babka, Andrea Arcangeli)
>  - Increase read-only pte counter if pte is none (Andrea Arcangeli)
>
> I've written down test results:
> With the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:  100464 kB
> AnonHugePages:  100352 kB
> Swap:   699540 kB
> Fraction:   99,88
>
> cat /proc/meminfo:
> AnonPages:  1754448 kB
> AnonHugePages:  1716224 kB
> Fraction:   97,82
>
> After swapped in:
> In a few seconds:
> cat /proc/pid/smaps:
> Anonymous:  84 kB
> AnonHugePages:  145408 kB
> Swap:   0 kB
> Fraction:   18,17
>
> cat /proc/meminfo:
> AnonPages:  2455016 kB
> AnonHugePages:  1761280 kB
> Fraction:   71,74
>
> In 5 minutes:
> cat /proc/pid/smaps
> Anonymous:  84 kB
> AnonHugePages:  407552 kB
> Swap:   0 kB
> Fraction:   50,94
>
> cat /proc/meminfo:
> AnonPages:  2456872 kB
> AnonHugePages:  2023424 kB
> Fraction:   82,35
>
> Without the patch:
> After swapped out:
> cat /proc/pid/smaps:
> Anonymous:  190660 kB
> AnonHugePages:  190464 kB
> Swap:   609344 kB
> Fraction:   99,89
>
> cat /proc/meminfo:
> AnonPages:  1740456 kB
> AnonHugePages:  1667072 kB
> Fraction:   95,78
>
> After swapped in:
> cat /proc/pid/smaps:
> Anonymous:  84 kB
> AnonHugePages:  190464 kB
> Swap:   0 kB
> Fraction:   23,80
>
> cat /proc/meminfo:
> AnonPages:  2350032 kB
> AnonHugePages:  1667072 kB
> Fraction:   70,93
>
> I waited 10 minutes the fractions
> did not change without the patch.
>
>  mm/huge_memory.c | 60 
> +---
>  1 file changed, 49 insertions(+), 11 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 817a875..17d6e59 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2148,17 +2148,18 @@ static int __collapse_huge_page_isolate(struct 
> vm_area_struct *vma,
>  {
>   struct page *page;
>   pte_t *_pte;
> - int referenced = 0, none = 0;
> + int referenced = 0, none = 0, ro = 0, writable = 0;
>   for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
>_pte++, address += PAGE_SIZE) {
>   pte_t pteval = *_pte;
>   if (pte_none(pteval)) {
> + ro++;
>   if (++none <= khugepaged_max_ptes_none)
>   continue;
>   else
>   goto out;
>   }
> - if (!pte_present(pteval) || !pte_write(pteval))
> + if (!pte_present(pteval))
>   goto out;
>   page = vm_normal_page(vma, address, pteval);
>   if (unlikely(!page))
> @@ -2168,9 +2169,6 @@ static int __collapse_huge_page_isolate(struct 
> vm_area_struct *vma,
>   VM_BUG_ON_PAGE(!PageAnon(page), page);
>   VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>  
> - /

Re: [PATCH v3] mm: incorporate read-only pages into transparent huge pages

2015-01-28 Thread Zhang Yanfei

Hello

在 2015/1/28 8:27, Andrea Arcangeli 写道:
> On Tue, Jan 27, 2015 at 07:39:13PM +0200, Ebru Akagunduz wrote:
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 817a875..17d6e59 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2148,17 +2148,18 @@ static int __collapse_huge_page_isolate(struct 
>> vm_area_struct *vma,
>>  {
>>  struct page *page;
>>  pte_t *_pte;
>> -int referenced = 0, none = 0;
>> +int referenced = 0, none = 0, ro = 0, writable = 0;
> So your "writable" addition is enough and simpler/better than "ro"
> counting. Once "ro" is removed "writable" can actually start to make a
> difference (at the moment it does not).
>
> I'd suggest to remove "ro".
>
> The sysctl was there only to reduce the memory footprint but
> collapsing readonly swapcache won't reduce the memory footprint. So it
> may have been handy before but this new "writable" looks better now
> and keeping both doesn't help (keeping "ro" around prevents "writable"
> to make a difference).

Agreed.

>
>> @@ -2179,6 +2177,34 @@ static int __collapse_huge_page_isolate(struct 
>> vm_area_struct *vma,
>>   */
>>  if (!trylock_page(page))
>>  goto out;
>> +
>> +/*
>> + * cannot use mapcount: can't collapse if there's a gup pin.
>> + * The page must only be referenced by the scanned process
>> + * and page swap cache.
>> + */
>> +if (page_count(page) != 1 + !!PageSwapCache(page)) {
>> +unlock_page(page);
>> +goto out;
>> +}
>> +if (!pte_write(pteval)) {
>> +if (++ro > khugepaged_max_ptes_none) {
>> +unlock_page(page);
>> +goto out;
>> +}
>> +if (PageSwapCache(page) && !reuse_swap_page(page)) {
>> +unlock_page(page);
>> +goto out;
>> +}
>> +/*
>> + * Page is not in the swap cache, and page count is
>> + * one (see above). It can be collapsed into a THP.
>> + */
>> +VM_BUG_ON(page_count(page) != 1);
> In an earlier email I commented on this suggestion you received during
> previous code review: the VM_BUG_ON is not ok because it can generate
> false positives.
>
> It's perfectly ok if page_count is not 1 if the page is isolated by
> another CPU (another cpu calling isolate_lru_page).
>
> The page_count check there is to ensure there are no gup-pins, and
> that is achieved during the check. The VM may still mangle the
> page_count and it's ok (the page count taken by the VM running in
> another CPU doesn't need to be transferred to the collapsed THP).
>
> In short, the check "page_count(page) != 1 + !!PageSwapCache(page)"
> doesn't imply that the page_count cannot change. It only means at any
> given time there was no gup-pin at the very time of the check. It also
> means there were no other VM pin, but what we care about is only the
> gup-pin. The VM LRU pin can still be taken after the check and it's
> ok. The GUP pin cannot be taken because we stopped all gup so we're
> safe if the check passes.
>
> So you can simply delete the VM_BUG_ON, the earlier code there, was fine.

So IMO, the comment should also be removed or changed as it may
mislead someone again later.

Thanks
Zhang

>
>> +} else {
>> +writable = 1;
>> +}
>> +
> I suggest to make writable a bool and use writable = false to init,
> and writable = true above.
>
> When a value can only be 0|1 bool is better (it can be casted and
> takes the same memory as an int, it just allows the compiler to be
> more strict and the fact it makes the code more self explanatory).
>
>> +if (++ro > khugepaged_max_ptes_none)
>> +goto out_unmap;
> As mentioned above the ro counting can go, and we can keep only
> your new writable addition, as mentioned above.
>
> Thanks,
> Andrea
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: incorporate read-only pages into transparent huge pages

2015-01-25 Thread Zhang Yanfei

Hello

在 2015/1/25 17:25, Vlastimil Babka 写道:
> On 23.1.2015 20:18, Andrea Arcangeli wrote:
>>> >+if (!pte_write(pteval)) {
>>> >+if (++ro > khugepaged_max_ptes_none)
>>> >+goto out_unmap;
>>> >+}
>> It's true this is maxed out at 511, so there must be at least one
>> writable and not none pte (as results of the two "ro" and "none"
>> counters checks).
> 
> Hm, but if we consider ro and pte_none separately, both can be lower
> than 512, but the sum of the two can be 512, so we can actually be in
> read-only VMA?

Yes, I also think so.

So is it necessary to add a at-least-one-writable-pte check just like the 
existing
at-least-one-page-referenced check?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] CMA: treat free cma pages as non-free if not ALLOC_CMA on watermark checking

2015-01-20 Thread Zhang Yanfei

Hello Minchan,

How are you?

在 2015/1/19 14:55, Minchan Kim 写道:
> Hello,
> 
> On Sun, Jan 18, 2015 at 04:32:59PM +0800, Hui Zhu wrote:
>> From: Hui Zhu 
>>
>> The original of this patch [1] is part of Joonsoo's CMA patch series.
>> I made a patch [2] to fix the issue of this patch.  Joonsoo reminded me
>> that this issue affect current kernel too.  So made a new one for upstream.
> 
> Recently, we found many problems of CMA and Joonsoo tried to add more
> hooks into MM like agressive allocation but I suggested adding new zone

Just out of curiosity, "new zone"? Something like movable zone?

Thanks.

> would be more desirable than more hooks in mm fast path in various aspect.
> (ie, remove lots of hooks in hot path of MM, don't need reclaim hooks
>  for special CMA pages, don't need custom fair allocation for CMA).
> 
> Joonsoo is investigating the direction so please wait.
> If it turns out we have lots of hurdle to go that way,
> this direction(ie, putting more hooks) should be second plan.
> 
> Thanks.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 4/5] mm, compaction: allow scanners to start at any pfn within the zone

2015-01-20 Thread Zhang Yanfei

Hello Vlastimil

在 2015/1/19 18:05, Vlastimil Babka 写道:
> Compaction employs two page scanners - migration scanner isolates pages to be
> the source of migration, free page scanner isolates pages to be the target of
> migration. Currently, migration scanner starts at the zone's first pageblock
> and progresses towards the last one. Free scanner starts at the last pageblock
> and progresses towards the first one. Within a pageblock, each scanner scans
> pages from the first to the last one. When the scanners meet within the same
> pageblock, compaction terminates.
> 
> One consequence of the current scheme, that turns out to be unfortunate, is
> that the migration scanner does not encounter the pageblocks which were
> scanned by the free scanner. In a test with stress-highalloc from mmtests,
> the scanners were observed to meet around the middle of the zone in first two
> phases (with background memory pressure) of the test when executed after fresh
> reboot. On further executions without reboot, the meeting point shifts to
> roughly third of the zone, and compaction activity as well as allocation
> success rates deteriorates compared to the run after fresh reboot.
> 
> It turns out that the deterioration is indeed due to the migration scanner
> processing only a small part of the zone. Compaction also keeps making this
> bias worse by its activity - by moving all migratable pages towards end of the
> zone, the free scanner has to scan a lot of full pageblocks to find more free
> pages. The beginning of the zone contains pageblocks that have been compacted
> as much as possible, but the free pages there cannot be further merged into
> larger orders due to unmovable pages. The rest of the zone might contain more
> suitable pageblocks, but the migration scanner will not reach them. It also
> isn't be able to move movable pages out of unmovable pageblocks there, which
> affects fragmentation.
> 
> This patch is the first step to remove this bias. It allows the compaction
> scanners to start at arbitrary pfn (aligned to pageblock for practical
> purposes), called pivot, within the zone. The migration scanner starts at the
> exact pfn, the free scanner starts at the pageblock preceding the pivot. The
> direction of scanning is unaffected, but when the migration scanner reaches
> the last pageblock of the zone, or the free scanner reaches the first
> pageblock, they wrap and continue with the first or last pageblock,
> respectively. Compaction terminates when any of the scanners wrap and both
> meet within the same pageblock.
> 
> For easier bisection of potential regressions, this patch always uses the
> first zone's pfn as the pivot. That means the free scanner immediately wraps
> to the last pageblock and the operation of scanners is thus unchanged. The
> actual pivot changing is done by the next patch.
> 
> Signed-off-by: Vlastimil Babka 

I read through the whole patch, and you can feel free to add:

Acked-by: Zhang Yanfei 

I agree with you and the approach to improve the current scheme. One thing
I think should be carefully treated is how to avoid migrating back and forth
since the pivot pfn can be changed. I see patch 5 has introduced a policy to
change the pivot so we can have a careful observation on it.

(The changes in the patch make the code more difficult to understand now...
and I just find a tiny mistake, please see below)

> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 
> ---
>  include/linux/mmzone.h |   2 +
>  mm/compaction.c| 204 
> +++--
>  mm/internal.h  |   1 +
>  3 files changed, 182 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2f0856d..47aa181 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -503,6 +503,8 @@ struct zone {
>   unsigned long percpu_drift_mark;
>  
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> + /* pfn where compaction scanners have initially started last time */
> + unsigned long   compact_cached_pivot_pfn;
>   /* pfn where compaction free scanner should start */
>   unsigned long   compact_cached_free_pfn;
>   /* pfn where async and sync compaction migration scanner should start */
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5626220..abae89a 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -123,11 +123,16 @@ static inline bool isolation_suitable(struct 
> compact_control *cc,
>   return !get_pageblock_skip(page);
>  }
>  
> +/*
> + * Invalidate cached compaction scanner po

Re: [PATCH 3/5] mm, compaction: encapsulate resetting cached scanner positions

2015-01-20 Thread Zhang Yanfei

在 2015/1/19 18:05, Vlastimil Babka 写道:
> Reseting the cached compaction scanner positions is now done implicitly in
> __reset_isolation_suitable() and compact_finished(). Encapsulate the
> functionality in a new function reset_cached_positions() and call it
> explicitly where needed.
> 
> Signed-off-by: Vlastimil Babka 

Reviewed-by: Zhang Yanfei 

Should the new function be inline?

Thanks.

> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 
> ---
>  mm/compaction.c | 22 ++
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 45799a4..5626220 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -123,6 +123,13 @@ static inline bool isolation_suitable(struct 
> compact_control *cc,
>   return !get_pageblock_skip(page);
>  }
>  
> +static void reset_cached_positions(struct zone *zone)
> +{
> + zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
> + zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
> + zone->compact_cached_free_pfn = zone_end_pfn(zone);
> +}
> +
>  /*
>   * This function is called to clear all cached information on pageblocks that
>   * should be skipped for page isolation when the migrate and free page 
> scanner
> @@ -134,9 +141,6 @@ static void __reset_isolation_suitable(struct zone *zone)
>   unsigned long end_pfn = zone_end_pfn(zone);
>   unsigned long pfn;
>  
> - zone->compact_cached_migrate_pfn[0] = start_pfn;
> - zone->compact_cached_migrate_pfn[1] = start_pfn;
> - zone->compact_cached_free_pfn = end_pfn;
>   zone->compact_blockskip_flush = false;
>  
>   /* Walk the zone and mark every pageblock as suitable for isolation */
> @@ -166,8 +170,10 @@ void reset_isolation_suitable(pg_data_t *pgdat)
>   continue;
>  
>   /* Only flush if a full compaction finished recently */
> - if (zone->compact_blockskip_flush)
> + if (zone->compact_blockskip_flush) {
>   __reset_isolation_suitable(zone);
> + reset_cached_positions(zone);
> + }
>   }
>  }
>  
> @@ -1059,9 +1065,7 @@ static int compact_finished(struct zone *zone, struct 
> compact_control *cc,
>   /* Compaction run completes if the migrate and free scanner meet */
>   if (compact_scanners_met(cc)) {
>   /* Let the next compaction start anew. */
> - zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
> - zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
> - zone->compact_cached_free_pfn = zone_end_pfn(zone);
> + reset_cached_positions(zone);
>  
>   /*
>* Mark that the PG_migrate_skip information should be cleared
> @@ -1187,8 +1191,10 @@ static int compact_zone(struct zone *zone, struct 
> compact_control *cc)
>* is about to be retried after being deferred. kswapd does not do
>* this reset as it'll reset the cached information when going to sleep.
>*/
> - if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
> + if (compaction_restarting(zone, cc->order) && !current_is_kswapd()) {
>   __reset_isolation_suitable(zone);
> + reset_cached_positions(zone);
> + }
>  
>   /*
>* Setup to move all movable pages to the end of the zone. Used cached
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/5] mm, compaction: simplify handling restart position in free pages scanner

2015-01-20 Thread Zhang Yanfei

Hello,

在 2015/1/19 18:05, Vlastimil Babka 写道:
> Handling the position where compaction free scanner should restart (stored in
> cc->free_pfn) got more complex with commit e14c720efdd7 ("mm, compaction:
> remember position within pageblock in free pages scanner"). Currently the
> position is updated in each loop iteration isolate_freepages(), although it's
> enough to update it only when exiting the loop when we have found enough free
> pages, or detected contention in async compaction. Then an extra check outside
> the loop updates the position in case we have met the migration scanner.
> 
> This can be simplified if we move the test for having isolated enough from
> for loop header next to the test for contention, and determining the restart
> position only in these cases. We can reuse the isolate_start_pfn variable for
> this instead of setting cc->free_pfn directly. Outside the loop, we can simply
> set cc->free_pfn to value of isolate_start_pfn without extra check.
> 
> We also add VM_BUG_ON to future-proof the code, in case somebody adds a new
> condition that terminates isolate_freepages_block() prematurely, which
> wouldn't be also considered in isolate_freepages().
> 
> Signed-off-by: Vlastimil Babka 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 
> ---
>  mm/compaction.c | 34 +++---
>  1 file changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5fdbdb8..45799a4 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -849,7 +849,7 @@ static void isolate_freepages(struct compact_control *cc)
>* pages on cc->migratepages. We stop searching if the migrate
>* and free page scanners meet or enough free pages are isolated.
>*/
> - for (; block_start_pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
> + for (; block_start_pfn >= low_pfn;
>   block_end_pfn = block_start_pfn,
>   block_start_pfn -= pageblock_nr_pages,
>   isolate_start_pfn = block_start_pfn) {
> @@ -883,6 +883,8 @@ static void isolate_freepages(struct compact_control *cc)
>   nr_freepages += isolated;
>  
>   /*
> +  * If we isolated enough freepages, or aborted due to async
> +  * compaction being contended, terminate the loop.
>* Remember where the free scanner should restart next time,
>* which is where isolate_freepages_block() left off.
>* But if it scanned the whole pageblock, isolate_start_pfn
> @@ -891,28 +893,30 @@ static void isolate_freepages(struct compact_control 
> *cc)
>* In that case we will however want to restart at the start
>* of the previous pageblock.
>*/
> - cc->free_pfn = (isolate_start_pfn < block_end_pfn) ?
> - isolate_start_pfn :
> - block_start_pfn - pageblock_nr_pages;
> -
> - /*
> -  * isolate_freepages_block() might have aborted due to async
> -  * compaction being contended
> -  */
> - if (cc->contended)
> + if ((nr_freepages > cc->nr_migratepages) || cc->contended) {

Shouldn't this be nr_freepages >= cc->nr_migratepages?

Thanks

> + if (isolate_start_pfn >= block_end_pfn)
> + isolate_start_pfn =
> + block_start_pfn - pageblock_nr_pages;
>   break;
> + } else {
> + /*
> +  * isolate_freepages_block() should not terminate
> +  * prematurely unless contended, or isolated enough
> +  */
> + VM_BUG_ON(isolate_start_pfn < block_end_pfn);
> + }
>   }
>  
>   /* split_free_page does not map the pages */
>   map_pages(freelist);
>  
>   /*
> -  * If we crossed the migrate scanner, we want to keep it that way
> -  * so that compact_finished() may detect this
> +  * Record where the free scanner will restart next time. Either we
> +  * broke from the loop and set isolate_start_pfn based on the last
> +  * call to isolate_freepages_block(), or we met the migration scanner
> +  * and the loop terminated due to isolate_start_pfn < low_pfn
>*/
> - if (block_start_pfn < low_pfn)
> - cc->free_pfn = cc->migrate_pfn;
> -
> + cc->free_pfn = isolate_start_pfn;
>   cc->nr_freepages = nr_freepages;
>  }
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://ww

Re: [PATCH 1/5] mm, compaction: more robust check for scanners meeting

2015-01-20 Thread Zhang Yanfei

在 2015/1/19 18:05, Vlastimil Babka 写道:
> Compaction should finish when the migration and free scanner meet, i.e. they
> reach the same pageblock. Currently however, the test in compact_finished()
> simply just compares the exact pfns, which may yield a false negative when the
> free scanner position is in the middle of a pageblock and the migration
> scanner reaches the begining of the same pageblock.
> 
> This hasn't been a problem until commit e14c720efdd7 ("mm, compaction:
> remember position within pageblock in free pages scanner") allowed the free
> scanner position to be in the middle of a pageblock between invocations.
> The hot-fix 1d5bfe1ffb5b ("mm, compaction: prevent infinite loop in
> compact_zone") prevented the issue by adding a special check in the migration
> scanner to satisfy the current detection of scanners meeting.
> 
> However, the proper fix is to make the detection more robust. This patch
> introduces the compact_scanners_met() function that returns true when the free
> scanner position is in the same or lower pageblock than the migration scanner.
> The special case in isolate_migratepages() introduced by 1d5bfe1ffb5b is
> removed.
> 
> Suggested-by: Joonsoo Kim 
> Signed-off-by: Vlastimil Babka 

Reviewed-by: Zhang Yanfei 

> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 
> ---
>  mm/compaction.c | 22 ++
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 546e571..5fdbdb8 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -803,6 +803,16 @@ isolate_migratepages_range(struct compact_control *cc, 
> unsigned long start_pfn,
>  #endif /* CONFIG_COMPACTION || CONFIG_CMA */
>  #ifdef CONFIG_COMPACTION
>  /*
> + * Test whether the free scanner has reached the same or lower pageblock than
> + * the migration scanner, and compaction should thus terminate.
> + */
> +static inline bool compact_scanners_met(struct compact_control *cc)
> +{
> + return (cc->free_pfn >> pageblock_order)
> + <= (cc->migrate_pfn >> pageblock_order);
> +}
> +
> +/*
>   * Based on information in the current compact_control, find blocks
>   * suitable for isolating free pages from and then isolate them.
>   */
> @@ -1027,12 +1037,8 @@ static isolate_migrate_t isolate_migratepages(struct 
> zone *zone,
>   }
>  
>   acct_isolated(zone, cc);
> - /*
> -  * Record where migration scanner will be restarted. If we end up in
> -  * the same pageblock as the free scanner, make the scanners fully
> -  * meet so that compact_finished() terminates compaction.
> -  */
> - cc->migrate_pfn = (end_pfn <= cc->free_pfn) ? low_pfn : cc->free_pfn;
> + /* Record where migration scanner will be restarted. */
> + cc->migrate_pfn = low_pfn;
>  
>   return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
>  }
> @@ -1047,7 +1053,7 @@ static int compact_finished(struct zone *zone, struct 
> compact_control *cc,
>   return COMPACT_PARTIAL;
>  
>   /* Compaction run completes if the migrate and free scanner meet */
> - if (cc->free_pfn <= cc->migrate_pfn) {
> + if (compact_scanners_met(cc)) {
>   /* Let the next compaction start anew. */
>   zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
>   zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
> @@ -1238,7 +1244,7 @@ static int compact_zone(struct zone *zone, struct 
> compact_control *cc)
>* migrate_pages() may return -ENOMEM when scanners meet
>* and we want compact_finished() to detect it
>*/
> - if (err == -ENOMEM && cc->free_pfn > cc->migrate_pfn) {
> + if (err == -ENOMEM && !compact_scanners_met(cc)) {
>   ret = COMPACT_PARTIAL;
>   goto out;
>   }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/6] mm/hugetlb: gigantic hugetlb page pools shrink supporting

2014-08-21 Thread Zhang Yanfei

Hello Wanpeng

On 08/22/2014 07:37 AM, Wanpeng Li wrote:
> Hi Andi,
> On Fri, Apr 12, 2013 at 05:22:37PM +0200, Andi Kleen wrote:
>> On Fri, Apr 12, 2013 at 07:29:07AM +0800, Wanpeng Li wrote:
>>> Ping Andi,
>>> On Thu, Apr 04, 2013 at 05:09:08PM +0800, Wanpeng Li wrote:
>>>> order >= MAX_ORDER pages are only allocated at boot stage using the 
>>>> bootmem allocator with the "hugepages=xxx" option. These pages are never 
>>>> free after boot by default since it would be a one-way street(>= MAX_ORDER
>>>> pages cannot be allocated later), but if administrator confirm not to 
>>>> use these gigantic pages any more, these pinned pages will waste memory
>>>> since other users can't grab free pages from gigantic hugetlb pool even
>>>> if OOM, it's not flexible.  The patchset add hugetlb gigantic page pools
>>>> shrink supporting. Administrator can enable knob exported in sysctl to
>>>> permit to shrink gigantic hugetlb pool.
>>
>>
>> I originally didn't allow this because it's only one way and it seemed
>> dubious.  I've been recently working on a new patchkit to allocate
>> GB pages from CMA. With that freeing actually makes sense, as 
>> the pages can be reallocated.
>>
> 
> More than one year past, If your allocate GB pages from CMA merged? 

commit 944d9fec8d7aee3f2e16573e9b6a16634b33f403
Author: Luiz Capitulino 
Date:   Wed Jun 4 16:07:13 2014 -0700

hugetlb: add support for gigantic page allocation at runtime


> 
> Regards,
> Wanpeng Li 
> 
>> -Andi
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majord...@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> .
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] mm/slab_common: move kmem_cache definition to internal header

2014-08-21 Thread Zhang Yanfei

aches on the system */
> +};
> +
> +#endif /* CONFIG_SLOB */
> +
> +#ifdef CONFIG_SLAB
> +#include 
> +#endif
> +
> +#ifdef CONFIG_SLUB
> +#include 
> +#endif
> +
>  /*
>   * State of the slab allocator.
>   *
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index d319502..2088904 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -30,6 +30,14 @@ LIST_HEAD(slab_caches);
>  DEFINE_MUTEX(slab_mutex);
>  struct kmem_cache *kmem_cache;
>  
> +/*
> + * Determine the size of a slab object
> + */
> +unsigned int kmem_cache_size(struct kmem_cache *s)
> +{
> + return s->object_size;
> +}
> +
>  #ifdef CONFIG_DEBUG_VM
>  static int kmem_cache_sanity_check(const char *name, size_t size)
>  {
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 3/8] mm/page_alloc: fix pcp high, batch management

2014-08-06 Thread Zhang Yanfei

t *output_batch)
> +{
> + *output_high = 6 * input_batch;
> + *output_batch = max(1, 1 * input_batch);
>  }
>  
> -/* a companion to pageset_set_high() */
> -static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
> +static void pageset_get_values(struct zone *zone, int *high, int *batch)
>  {
> - pageset_update(&p->pcp, 6 * batch, max(1UL, 1 * batch));
> + if (percpu_pagelist_fraction) {
> + pageset_get_values_by_high(
> + (zone->managed_pages / percpu_pagelist_fraction),
> + high, batch);
> + } else
> + pageset_get_values_by_batch(zone_batchsize(zone), high, batch);
>  }
>  
>  static void pageset_init(struct per_cpu_pageset *p)
> @@ -4263,51 +4298,38 @@ static void pageset_init(struct per_cpu_pageset *p)
>   INIT_LIST_HEAD(&pcp->lists[migratetype]);
>  }
>  
> -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
> +/* Use this only in boot time, because it doesn't do any synchronization */
> +static void setup_pageset(struct per_cpu_pageset __percpu *pcp)
>  {
> - pageset_init(p);
> - pageset_set_batch(p, batch);
> -}
> -
> -/*
> - * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
> - * to the value high for the pageset p.
> - */
> -static void pageset_set_high(struct per_cpu_pageset *p,
> - unsigned long high)
> -{
> - unsigned long batch = max(1UL, high / 4);
> - if ((high / 4) > (PAGE_SHIFT * 8))
> - batch = PAGE_SHIFT * 8;
> -
> - pageset_update(&p->pcp, high, batch);
> -}
> -
> -static void pageset_set_high_and_batch(struct zone *zone,
> -struct per_cpu_pageset *pcp)
> -{
> - if (percpu_pagelist_fraction)
> - pageset_set_high(pcp,
> - (zone->managed_pages /
> - percpu_pagelist_fraction));
> - else
> - pageset_set_batch(pcp, zone_batchsize(zone));
> -}
> + int cpu;
> + int high, batch;
> + struct per_cpu_pageset *p;
>  
> -static void __meminit zone_pageset_init(struct zone *zone, int cpu)
> -{
> - struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
> + pageset_get_values_by_batch(0, &high, &batch);
>  
> - pageset_init(pcp);
> - pageset_set_high_and_batch(zone, pcp);
> + for_each_possible_cpu(cpu) {
> + p = per_cpu_ptr(pcp, cpu);
> + pageset_init(p);
> + p->pcp.high = high;
> + p->pcp.batch = batch;
> + }
>  }
>  
>  static void __meminit setup_zone_pageset(struct zone *zone)
>  {
>   int cpu;
> + int high, batch;
> + struct per_cpu_pageset *p;
> +
> + pageset_get_values(zone, &high, &batch);
> +
>   zone->pageset = alloc_percpu(struct per_cpu_pageset);
> - for_each_possible_cpu(cpu)
> - zone_pageset_init(zone, cpu);
> + for_each_possible_cpu(cpu) {
> + p = per_cpu_ptr(zone->pageset, cpu);
> + pageset_init(p);
> + p->pcp.high = high;
> + p->pcp.batch = batch;
> + }
>  }
>  
>  /*
> @@ -5928,11 +5950,10 @@ int percpu_pagelist_fraction_sysctl_handler(struct 
> ctl_table *table, int write,
>   goto out;
>  
>   for_each_populated_zone(zone) {
> - unsigned int cpu;
> + int high, batch;
>  
> - for_each_possible_cpu(cpu)
> - pageset_set_high_and_batch(zone,
> -     per_cpu_ptr(zone->pageset, cpu));
> + pageset_get_values(zone, &high, &batch);
> + pageset_update(zone, high, batch);
>   }
>  out:
>   mutex_unlock(&pcp_batch_high_lock);
> @@ -6455,11 +6476,11 @@ void free_contig_range(unsigned long pfn, unsigned 
> nr_pages)
>   */
>  void __meminit zone_pcp_update(struct zone *zone)
>  {
> - unsigned cpu;
> + int high, batch;
> +
>   mutex_lock(&pcp_batch_high_lock);
> - for_each_possible_cpu(cpu)
> - pageset_set_high_and_batch(zone,
> - per_cpu_ptr(zone->pageset, cpu));
> + pageset_get_values(zone, &high, &batch);
> + pageset_update(zone, high, batch);
>   mutex_unlock(&pcp_batch_high_lock);
>  }
>  #endif
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/8] mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC

2014-08-06 Thread Zhang Yanfei

On 08/06/2014 03:18 PM, Joonsoo Kim wrote:
> In __free_one_page(), we check the buddy page if it is guard page.
> And, if so, we should clear guard attribute on the buddy page. But,
> currently, we clear original page's order rather than buddy one's.
> This doesn't have any problem, because resetting buddy's order
> is useless and the original page's order is re-assigned soon.
> But, it is better to correct code.
> 
> Additionally, I change (set/clear)_page_guard_flag() to
> (set/clear)_page_guard() and makes these functions do all works
> needed for guard page. This may make code more understandable.
> 
> One more thing, I did in this patch, is that fixing freepage accounting.
> If we clear guard page and link it onto isolate buddy list, we should
> not increase freepage count.
> 
> Acked-by: Vlastimil Babka 
> Signed-off-by: Joonsoo Kim 

Reviewed-by: Zhang Yanfei 

> ---
>  mm/page_alloc.c |   29 -
>  1 file changed, 16 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b99643d4..e6fee4b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -441,18 +441,28 @@ static int __init debug_guardpage_minorder_setup(char 
> *buf)
>  }
>  __setup("debug_guardpage_minorder=", debug_guardpage_minorder_setup);
>  
> -static inline void set_page_guard_flag(struct page *page)
> +static inline void set_page_guard(struct zone *zone, struct page *page,
> + unsigned int order, int migratetype)
>  {
>   __set_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
> + set_page_private(page, order);
> + /* Guard pages are not available for any usage */
> + __mod_zone_freepage_state(zone, -(1 << order), migratetype);
>  }
>  
> -static inline void clear_page_guard_flag(struct page *page)
> +static inline void clear_page_guard(struct zone *zone, struct page *page,
> + unsigned int order, int migratetype)
>  {
>   __clear_bit(PAGE_DEBUG_FLAG_GUARD, &page->debug_flags);
> + set_page_private(page, 0);
> + if (!is_migrate_isolate(migratetype))
> + __mod_zone_freepage_state(zone, (1 << order), migratetype);
>  }
>  #else
> -static inline void set_page_guard_flag(struct page *page) { }
> -static inline void clear_page_guard_flag(struct page *page) { }
> +static inline void set_page_guard(struct zone *zone, struct page *page,
> + unsigned int order, int migratetype) {}
> +static inline void clear_page_guard(struct zone *zone, struct page *page,
> + unsigned int order, int migratetype) {}
>  #endif
>  
>  static inline void set_page_order(struct page *page, unsigned int order)
> @@ -594,10 +604,7 @@ static inline void __free_one_page(struct page *page,
>* merge with it and move up one order.
>*/
>   if (page_is_guard(buddy)) {
> - clear_page_guard_flag(buddy);
> - set_page_private(page, 0);
> - __mod_zone_freepage_state(zone, 1 << order,
> -   migratetype);
> + clear_page_guard(zone, buddy, order, migratetype);
>   } else {
>   list_del(&buddy->lru);
>   zone->free_area[order].nr_free--;
> @@ -876,11 +883,7 @@ static inline void expand(struct zone *zone, struct page 
> *page,
>* pages will stay not present in virtual address space
>*/
>   INIT_LIST_HEAD(&page[size].lru);
> - set_page_guard_flag(&page[size]);
> - set_page_private(&page[size], high);
> - /* Guard pages are not available for any usage */
> - __mod_zone_freepage_state(zone, -(1 << high),
> -   migratetype);
> + set_page_guard(zone, &page[size], high, migratetype);
>   continue;
>   }
>  #endif
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 0/8] fix freepage count problems in memory isolation

2014-08-06 Thread Zhang Yanfei

Hi Joonsoo,

The first 3 patches in this patchset are in a bit of mess.

On 08/06/2014 03:18 PM, Joonsoo Kim wrote:
> Hello,
> 
> This patchset aims at fixing problems during memory isolation found by
> testing my patchset [1].
> 
> These are really subtle problems so I can be wrong. If you find what I am
> missing, please let me know.
> 
> Before describing bugs itself, I first explain definition of freepage.
> 
> 1. pages on buddy list are counted as freepage.
> 2. pages on isolate migratetype buddy list are *not* counted as freepage.
> 3. pages on cma buddy list are counted as CMA freepage, too.
> 4. pages for guard are *not* counted as freepage.
> 
> Now, I describe problems and related patch.
> 
> Patch 1: If guard page are cleared and merged into isolate buddy list,
> we should not add freepage count.
> 
> Patch 4: There is race conditions that results in misplacement of free
> pages on buddy list. Then, it results in incorrect freepage count and
> un-availability of freepage.
> 
> Patch 5: To count freepage correctly, we should prevent freepage from
> being added to buddy list in some period of isolation. Without it, we
> cannot be sure if the freepage is counted or not and miscount number
> of freepage.
> 
> Patch 7: In spite of above fixes, there is one more condition for
> incorrect freepage count. pageblock isolation could be done in pageblock
> unit  so we can't prevent freepage from merging with page on next
> pageblock. To fix it, start_isolate_page_range() and
> undo_isolate_page_range() is modified to process whole range at one go.
> With this change, if input parameter of start_isolate_page_range() and
> undo_isolate_page_range() is properly aligned, there is no condition for
> incorrect merging.
> 
> Without patchset [1], above problem doesn't happens on my CMA allocation
> test, because CMA reserved pages aren't used at all. So there is no
> chance for above race.
> 
> With patchset [1], I did simple CMA allocation test and get below result.
> 
> - Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
> - run kernel build (make -j16) on background
> - 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
> - Result: more than 5000 freepage count are missed
> 
> With patchset [1] and this patchset, I found that no freepage count are
> missed so that I conclude that problems are solved.
> 
> These problems can be possible on memory hot remove users, although
> I didn't check it further.
> 
> This patchset is based on linux-next-20140728.
> Please see individual patches for more information.
> 
> Thanks.
> 
> [1]: Aggressively allocate the pages on cma reserved memory
>  https://lkml.org/lkml/2014/5/30/291
> 
> Joonsoo Kim (8):
>   mm/page_alloc: correct to clear guard attribute in DEBUG_PAGEALLOC
>   mm/isolation: remove unstable check for isolated page
>   mm/page_alloc: fix pcp high, batch management
>   mm/isolation: close the two race problems related to pageblock
> isolation
>   mm/isolation: change pageblock isolation logic to fix freepage
> counting bugs
>   mm/isolation: factor out pre/post logic on
> set/unset_migratetype_isolate()
>   mm/isolation: fix freepage counting bug on
> start/undo_isolat_page_range()
>   mm/isolation: remove useless race handling related to pageblock
> isolation
> 
>  include/linux/page-isolation.h |2 +
>  mm/internal.h      |5 +
>  mm/page_alloc.c|  223 +-
>  mm/page_isolation.c|  292 
> +++-
>  4 files changed, 368 insertions(+), 154 deletions(-)
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] CMA/HOTPLUG: clear buffer-head lru before page migration

2014-07-18 Thread Zhang Yanfei

Hello,

On 07/18/2014 04:23 PM, Gioh Kim wrote:
> 
> 
> 2014-07-18 오후 4:50, Marek Szyprowski 쓴 글:
>> Hello,
>>
>> On 2014-07-18 08:45, Gioh Kim wrote:
>>> For page migration of CMA, buffer-heads of lru should be dropped.
>>> Please refer to https://lkml.org/lkml/2014/7/4/101 for the history.
>>>
>>> I have two solution to drop bhs.
>>> One is invalidating entire lru.
>>> Another is searching the lru and dropping only one bh that Laura proposed
>>> at https://lkml.org/lkml/2012/8/31/313.
>>>
>>> I'm not sure which has better performance.
>>> So I did performance test on my cortex-a7 platform with Lmbench
>>> that has "File & VM system latencies" test.
>>> I am attaching the results.
>>> The first line is of invalidating entire lru and the second is dropping 
>>> selected bh.
>>>
>>> File & VM system latencies in microseconds - smaller is better
>>> ---
>>> Host OS   0K File  10K File MmapProt   Page   
>>> 100fd
>>>  Create Delete Create Delete Latency Fault  Fault  
>>> selct
>>> - - -- -- -- -- --- - --- 
>>> -
>>> 10.178.33 Linux 3.10.19   25.1   19.6   32.6   19.7  5098.0 0.666 3.45880 
>>> 6.506
>>> 10.178.33 Linux 3.10.19   24.9   19.5   32.3   19.4  5059.0 0.563 3.46380 
>>> 6.521
>>>
>>>
>>> I tried several times but the result tells that they are the same under 1% 
>>> gap
>>> except Protection Fault.
>>> But the latency of Protection Fault is very small and I think it has little 
>>> effect.
>>>
>>> Therefore we can choose anything but I choose invalidating entire lru.
>>> The try_to_free_buffers() which is calling drop_buffers() is called by many 
>>> filesystem code.
>>> So I think inserting codes in drop_buffers() can affect the system.
>>> And also we cannot distinguish migration type in drop_buffers().
>>>
>>> In alloc_contig_range() we can distinguish migration type and invalidate 
>>> lru if it needs.
>>> I think alloc_contig_range() is proper to deal with bh like following patch.
>>>
>>> Laura, can I have you name on Acked-by line?
>>> Please let me represent my thanks.
>>>
>>> Thanks for any feedback.
>>>
>>> --- 8< --
>>>
>>> >From 33c894b1bab9bc26486716f0c62c452d3a04d35d Mon Sep 17 00:00:00 2001
>>> From: Gioh Kim 
>>> Date: Fri, 18 Jul 2014 13:40:01 +0900
>>> Subject: [PATCH] CMA/HOTPLUG: clear buffer-head lru before page migration
>>>
>>> The bh must be free to migrate a page at which bh is mapped.
>>> The reference count of bh is increased when it is installed
>>> into lru so that the bh of lru must be freed before migrating the page.
>>>
>>> This frees every bh of lru. We could free only bh of migrating page.
>>> But searching lru costs more than invalidating entire lru.
>>>
>>> Signed-off-by: Gioh Kim 
>>> Acked-by: Laura Abbott 
>>> ---
>>>   mm/page_alloc.c |3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index b99643d4..3b474e0 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -6369,6 +6369,9 @@ int alloc_contig_range(unsigned long start, unsigned 
>>> long end,
>>>  if (ret)
>>>  return ret;
>>>
>>> +   if (migratetype == MIGRATE_CMA || migratetype == MIGRATE_MOVABLE)
>>
>> I'm not sure if it really makes sense to check the migratetype here. This 
>> check
>> doesn't add any new information to the code and make false impression that 
>> this
>> function can be called for other migratetypes than CMA or MOVABLE. Even if 
>> so,
>> then invalidating bh_lrus unconditionally will make more sense, IMHO.
> 
> I agree. I cannot understand why alloc_contig_range has an argument of 
> migratetype.
> Can the alloc_contig_range is called for other migrate type than CMA/MOVABLE?
> 
> What do you think about removing the argument of migratetype and
> checking migratetype (if (migratetype == MIGRATE_CMA || migratetype == 
> MIGRATE_MOVABLE))?
> 

Remove the checking only. Because gigantic page allocation used for hugetlb is
using alloc_contig_range(.. MIGRATE_MOVABLE).

Thanks.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] memory-hotplug: suitable memory should go to ZONE_MOVABLE

2014-07-18 Thread Zhang Yanfei

Hello,

On 07/18/2014 03:55 PM, Wang Nan wrote:
> This series of patches fix a problem when adding memory in bad manner.
> For example: for a x86_64 machine booted with "mem=400M" and with 2GiB
> memory installed, following commands cause problem:
> 
>  # echo 0x4000 > /sys/devices/system/memory/probe
> [   28.613895] init_memory_mapping: [mem 0x4000-0x47ff]
>  # echo 0x4800 > /sys/devices/system/memory/probe
> [   28.693675] init_memory_mapping: [mem 0x4800-0x4fff]
>  # echo online_movable > /sys/devices/system/memory/memory9/state
>  # echo 0x5000 > /sys/devices/system/memory/probe 
> [   29.084090] init_memory_mapping: [mem 0x5000-0x57ff]
>  # echo 0x5800 > /sys/devices/system/memory/probe 
> [   29.151880] init_memory_mapping: [mem 0x5800-0x5fff]
>  # echo online_movable > /sys/devices/system/memory/memory11/state
>  # echo online> /sys/devices/system/memory/memory8/state
>  # echo online> /sys/devices/system/memory/memory10/state
>  # echo offline> /sys/devices/system/memory/memory9/state
> [   30.558819] Offlined Pages 32768
>  # free
>  total   used   free sharedbuffers cached
> Mem:780588 18014398509432020 830552  0  0  
> 51180
> -/+ buffers/cache: 18014398509380840 881732
> Swap:0  0  0
> 
> This is because the above commands probe higher memory after online a
> section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL
> for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE.

Yeah, this is rare in reality but can happen. Could you please also
include the free result and zoneinfo after applying your patch?

Thanks.

> 
> After the second online_movable, the problem can be observed from
> zoneinfo:
> 
>  # cat /proc/zoneinfo
> ...
> Node 0, zone  Movable
>   pages free 65491
> min  250
> low  312
> high 375
> scanned  0
> spanned  18446744073709518848
> present  65536
> managed  65536
> ...
> 
> This series of patches solve the problem by checking ZONE_MOVABLE when
> choosing zone for new memory. If new memory is inside or higher than
> ZONE_MOVABLE, makes it go there instead.
> 
> 
> Wang Nan (5):
>   memory-hotplug: x86_64: suitable memory should go to ZONE_MOVABLE
>   memory-hotplug: x86_32: suitable memory should go to ZONE_MOVABLE
>   memory-hotplug: ia64: suitable memory should go to ZONE_MOVABLE
>   memory-hotplug: sh: suitable memory should go to ZONE_MOVABLE
>   memory-hotplug: powerpc: suitable memory should go to ZONE_MOVABLE
> 
>  arch/ia64/mm/init.c   |  7 +++
>  arch/powerpc/mm/mem.c |  6 ++
>  arch/sh/mm/init.c | 13 -
>  arch/x86/mm/init_32.c |  6 ++
>  arch/x86/mm/init_64.c | 10 --
>  5 files changed, 35 insertions(+), 7 deletions(-)
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v11 2/7] x86: add pmd_[dirty|mkclean] for THP

2014-07-07 Thread Zhang Yanfei

On 07/08/2014 02:03 PM, Minchan Kim wrote:
> MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
> overwrite of the contents since MADV_FREE syscall is called for
> THP page.
> 
> This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
> support.
> 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: x...@kernel.org
> Acked-by: Kirill A. Shutemov 
> Signed-off-by: Minchan Kim 

Acked-by: Zhang Yanfei 

> ---
>  arch/x86/include/asm/pgtable.h | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 0ec056012618..329865799653 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -104,6 +104,11 @@ static inline int pmd_young(pmd_t pmd)
>   return pmd_flags(pmd) & _PAGE_ACCESSED;
>  }
>  
> +static inline int pmd_dirty(pmd_t pmd)
> +{
> + return pmd_flags(pmd) & _PAGE_DIRTY;
> +}
> +
>  static inline int pte_write(pte_t pte)
>  {
>   return pte_flags(pte) & _PAGE_RW;
> @@ -267,6 +272,11 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
>   return pmd_clear_flags(pmd, _PAGE_ACCESSED);
>  }
>  
> +static inline pmd_t pmd_mkclean(pmd_t pmd)
> +{
> + return pmd_clear_flags(pmd, _PAGE_DIRTY);
> +}
> +
>  static inline pmd_t pmd_wrprotect(pmd_t pmd)
>  {
>   return pmd_clear_flags(pmd, _PAGE_RW);
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v11 1/7] mm: support madvise(MADV_FREE)

2014-07-07 Thread Zhang Yanfei

On 07/08/2014 02:03 PM, Minchan Kim wrote:
> Linux doesn't have an ability to free pages lazy while other OS
> already have been supported that named by madvise(MADV_FREE).
> 
> The gain is clear that kernel can discard freed pages rather than
> swapping out or OOM if memory pressure happens.
> 
> Without memory pressure, freed pages would be reused by userspace
> without another additional overhead(ex, page fault + allocation
> + zeroing).
> 
> How to work is following as.
> 
> When madvise syscall is called, VM clears dirty bit of ptes of
> the range. If memory pressure happens, VM checks dirty bit of
> page table and if it found still "clean", it means it's a
> "lazyfree pages" so VM could discard the page instead of swapping out.
> Once there was store operation for the page before VM peek a page
> to reclaim, dirty bit is set so VM can swap out the page instead of
> discarding.
> 
> Firstly, heavy users would be general allocators(ex, jemalloc,
> tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
> have supported the feature for other OS(ex, FreeBSD)
> 
> barrios@blaptop:~/benchmark/ebizzy$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):4
> On-line CPU(s) list:   0-3
> Thread(s) per core:2
> Core(s) per socket:2
> Socket(s): 1
> NUMA node(s):  1
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 42
> Stepping:  7
> CPU MHz:   2801.000
> BogoMIPS:  5581.64
> Virtualization:VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  256K
> L3 cache:  4096K
> NUMA node0 CPU(s): 0-3
> 
> ebizzy benchmark(./ebizzy -S 10 -n 512)
> 
>  vanilla-jemalloc MADV_free-jemalloc
> 
> 1 thread
> records:  10  records:  10
> avg:  7682.10 avg:  15306.10
> std:  62.35(0.81%)std:  347.99(2.27%)
> max:  7770.00 max:  15622.00
> min:  7598.00 min:  14772.00
> 
> 2 thread
> records:  10  records:  10
> avg:  12747.50avg:  24171.00
> std:  792.06(6.21%)   std:  895.18(3.70%)
> max:  13337.00max:  26023.00
> min:  10535.00min:  23152.00
> 
> 4 thread
> records:  10  records:  10
> avg:  16474.60avg:  33717.90
> std:  1496.45(9.08%)  std:  2008.97(5.96%)
> max:  17877.00max:  35958.00
> min:  12224.00min:  29565.00
> 
> 8 thread
> records:  10  records:  10
> avg:  16778.50avg:  33308.10
> std:  825.53(4.92%)   std:  1668.30(5.01%)
> max:  17543.00max:  36010.00
> min:  14576.00min:  29577.00
> 
> 16 thread
> records:  10  records:  10
> avg:  20614.40avg:  35516.30
> std:  602.95(2.92%)   std:  1283.65(3.61%)
> max:  21753.00max:  37178.00
> min:  19605.00min:  33217.00
> 
> 32 thread
> records:  10  records:  10
> avg:  22771.70avg:  36018.50
> std:  598.94(2.63%)   std:  1046.76(2.91%)
> max:  24035.00    max:  37266.00
> min:  22108.00min:  34149.00
> 
> In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.
> 
> Cc: Michael Kerrisk 
> Cc: Linux API 
> Cc: Hugh Dickins 
> Cc: Johannes Weiner 
> Cc: KOSAKI Motohiro 
> Cc: Mel Gorman 
> Cc: Jason Evans 
> Cc: Zhang Yanfei 
> Acked-by: Rik van Riel 
> Signed-off-by: Minchan Kim 

A quick respin, looks good to me now for this !THP part. And
looks neat with the Pagewalker.

Acked-by: Zhang Yanfei 

> ---
>  include/linux/rmap.h   |   9 ++-
>  include/linux/vm_event_item.h  |   1 +
>  include/uapi/asm-generic/mman-common.h |   1 +
>  mm/madvise.c   | 135 
> +
>  mm/rmap.c  |  42 +-
>  mm/vmscan.c|  40 --
>  mm/vmstat.c|   1 +
>  7 files changed, 217 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index be574506e6a9..0ba377b97a38 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -75,6 +75,7 @@ enum ttu_flags {
>   TTU_UNMAP = 1,  /* unmap mode */
>   TTU_MIGRATION = 2,  /* migration mode */
>   TTU_MUN

Re: [PATCH v10 1/7] mm: support madvise(MADV_FREE)

2014-07-07 Thread Zhang Yanfei

Hi Minchan,

On 07/07/2014 08:53 AM, Minchan Kim wrote:
> Linux doesn't have an ability to free pages lazy while other OS
> already have been supported that named by madvise(MADV_FREE).
> 
> The gain is clear that kernel can discard freed pages rather than
> swapping out or OOM if memory pressure happens.
> 
> Without memory pressure, freed pages would be reused by userspace
> without another additional overhead(ex, page fault + allocation
> + zeroing).
> 
> How to work is following as.
> 
> When madvise syscall is called, VM clears dirty bit of ptes of
> the range. 

This should be updated because the implementation has been changed.
It also remove the page from the swapcache if it is.

Thank you for your effort!

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 05/13] mm, compaction: report compaction as contended only due to lock contention

2014-06-23 Thread Zhang Yanfei

Hello Minchan

Thank you for your explain. Actually, I read the kernel with an old
version. The latest upstream kernel has the behaviour like you described
below. Oops, how long didn't I follow the buddy allocator change.

Thanks.

On 06/24/2014 07:35 AM, Minchan Kim wrote:
>>> Anyway, most big concern is that you are changing current behavior as
>>> > > I said earlier.
>>> > > 
>>> > > Old behavior in THP page fault when it consumes own timeslot was just
>>> > > abort and fallback 4K page but with your patch, new behavior is
>>> > > take a rest when it founds need_resched and goes to another round with
>>> > > async, not sync compaction. I'm not sure we need another round with
>>> > > async compaction at the cost of increasing latency rather than fallback
>>> > > 4 page.
>> > 
>> > I don't see the new behavior works like what you said. If need_resched
>> > is true, it calls cond_resched() and after a rest it just breaks the loop.
>> > Why there is another round with async compact?
> One example goes
> 
> Old:
> page fault
> huge page allocation
> __alloc_pages_slowpath
> __alloc_pages_direct_compact
> compact_zone_order
> isolate_migratepages
> compact_checklock_irqsave
> need_resched is true
> cc->contended = true;
> return ISOLATE_ABORT
> return COMPACT_PARTIAL with *contented = cc.contended;
> COMPACTFAIL
> if (contended_compaction && gfp_mask & __GFP_NO_KSWAPD)
> goto nopage;
> 
> New:
> 
> page fault
> huge page allocation
> __alloc_pages_slowpath
> __alloc_pages_direct_compact
> compact_zone_order
> isolate_migratepages
> compact_unlock_should_abort
> need_resched is true
> cc->contended = COMPACT_CONTENDED_SCHED;
> return true;
> return ISOLATE_ABORT
> return COMPACT_PARTIAL with *contended = cc.contended == 
> COMPACT_CONTENDED_LOCK (1)
> COMPACTFAIL
> if (contended_compaction && gfp_mask & __GFP_NO_KSWAPD)
> no goto nopage because contended_compaction was false by (1)
> 
> __alloc_pages_direct_reclaim
> if (should_alloc_retry)
> else
> __alloc_pages_direct_compact again with ASYNC_MODE
> 
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 01/13] mm, THP: don't hold mmap_sem in khugepaged when allocating THP

2014-06-23 Thread Zhang Yanfei

On 06/23/2014 05:52 PM, Vlastimil Babka wrote:
> On 06/23/2014 07:39 AM, Zhang Yanfei wrote:
>> Hello
>>
>> On 06/21/2014 01:45 AM, Kirill A. Shutemov wrote:
>>> On Fri, Jun 20, 2014 at 05:49:31PM +0200, Vlastimil Babka wrote:
>>>> When allocating huge page for collapsing, khugepaged currently holds 
>>>> mmap_sem
>>>> for reading on the mm where collapsing occurs. Afterwards the read lock is
>>>> dropped before write lock is taken on the same mmap_sem.
>>>>
>>>> Holding mmap_sem during whole huge page allocation is therefore useless, 
>>>> the
>>>> vma needs to be rechecked after taking the write lock anyway. Furthemore, 
>>>> huge
>>>> page allocation might involve a rather long sync compaction, and thus block
>>>> any mmap_sem writers and i.e. affect workloads that perform frequent 
>>>> m(un)map
>>>> or mprotect oterations.
>>>>
>>>> This patch simply releases the read lock before allocating a huge page. It
>>>> also deletes an outdated comment that assumed vma must be stable, as it was
>>>> using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
>>>> ("mm: thp: khugepaged: add policy for finding target node").
>>>
>>> There is no point in touching ->mmap_sem in khugepaged_alloc_page() at
>>> all. Please, move up_read() outside khugepaged_alloc_page().
>>>
> 
> Well there's also currently no point in passing several parameters to 
> khugepaged_alloc_page(). So I could clean it up as well, but I imagine later 
> we would perhaps reintroduce them back, as I don't think the current 
> situation is ideal for at least two reasons.
> 
> 1. If you read commit 9f1b868a13 ("mm: thp: khugepaged: add policy for 
> finding target node"), it's based on a report where somebody found that 
> mempolicy is not observed properly when collapsing THP's. But the 'policy' 
> introduced by the commit isn't based on real mempolicy, it might just under 
> certain conditions results in an interleave, which happens to be what the 
> reporter was trying.
> 
> So ideally, it should be making node allocation decisions based on where the 
> original 4KB pages are located. For example, allocate a THP only if all the 
> 4KB pages are on the same node. That would also automatically obey any policy 
> that has lead to the allocation of those 4KB pages.
> 
> And for this, it will need again the parameters and mmap_sem in read mode. It 
> would be however still a good idea to drop mmap_sem before the allocation 
> itself, since compaction/reclaim might take some time...
> 
> 2. (less related) I'd expect khugepaged to first allocate a hugepage and then 
> scan for collapsing. Yes there's khugepaged_prealloc_page, but that only does 
> something on !NUMA systems and these are not the future.
> Although I don't have the data, I expect allocating a hugepage is a bigger 
> issue than finding something that could be collapsed. So why scan for 
> collapsing if in the end I cannot allocate a hugepage? And if I really cannot 
> find something to collapse, would e.g. caching a single hugepage per node be 
> a big hit? Also, if there's really nothing to collapse, then it means 
> khugepaged won't compact. And since khugepaged is becoming the only source of 
> sync compaction that doesn't give up easily and tries to e.g. migrate movable 
> pages out of unmovable pageblocks, this might have bad effects on 
> fragmentation.
> I believe this could be done smarter.
> 
>> I might be wrong. If we up_read in khugepaged_scan_pmd(), then if we round 
>> again
>> do the for loop to get the next vma and handle it. Does we do this without 
>> holding
>> the mmap_sem in any mode?
>>
>> And if the loop end, we have another up_read in breakouterloop. What if we 
>> have
>> released the mmap_sem in collapse_huge_page()?
> 
> collapse_huge_page() is only called from khugepaged_scan_pmd() in the if 
> (ret) condition. And khugepaged_scan_mm_slot() has similar if (ret) for the 
> return value of khugepaged_scan_pmd() to break out of the loop (and not doing 
> up_read() again). So I think this is correct and moving up_read from 
> khugepaged_alloc_page() to collapse_huge_page() wouldn't
> change this?

Ah, right.

> 
> 
> .
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 11/13] mm, compaction: pass gfp mask to compact_control

2014-06-23 Thread Zhang Yanfei

On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> From: David Rientjes 
> 
> struct compact_control currently converts the gfp mask to a migratetype, but 
> we
> need the entire gfp mask in a follow-up patch.
> 
> Pass the entire gfp mask as part of struct compact_control.
> 
> Signed-off-by: David Rientjes 
> Signed-off-by: Vlastimil Babka 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 

Reviewed-by: Zhang Yanfei 

> ---
>  mm/compaction.c | 12 +++-
>  mm/internal.h   |  2 +-
>  2 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 32c768b..d4e0c13 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -975,8 +975,8 @@ static isolate_migrate_t isolate_migratepages(struct zone 
> *zone,
>   return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
>  }
>  
> -static int compact_finished(struct zone *zone,
> - struct compact_control *cc)
> +static int compact_finished(struct zone *zone, struct compact_control *cc,
> + const int migratetype)
>  {
>   unsigned int order;
>   unsigned long watermark;
> @@ -1022,7 +1022,7 @@ static int compact_finished(struct zone *zone,
>   struct free_area *area = &zone->free_area[order];
>  
>   /* Job done if page is free of the right migratetype */
> - if (!list_empty(&area->free_list[cc->migratetype]))
> + if (!list_empty(&area->free_list[migratetype]))
>   return COMPACT_PARTIAL;
>  
>   /* Job done if allocation would set block type */
> @@ -1088,6 +1088,7 @@ static int compact_zone(struct zone *zone, struct 
> compact_control *cc)
>   int ret;
>   unsigned long start_pfn = zone->zone_start_pfn;
>   unsigned long end_pfn = zone_end_pfn(zone);
> + const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
>   const bool sync = cc->mode != MIGRATE_ASYNC;
>  
>   ret = compaction_suitable(zone, cc->order);
> @@ -1130,7 +1131,8 @@ static int compact_zone(struct zone *zone, struct 
> compact_control *cc)
>  
>   migrate_prep_local();
>  
> - while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
> + while ((ret = compact_finished(zone, cc, migratetype)) ==
> + COMPACT_CONTINUE) {
>   int err;
>  
>   switch (isolate_migratepages(zone, cc)) {
> @@ -1185,7 +1187,7 @@ static unsigned long compact_zone_order(struct zone 
> *zone, int order,
>   .nr_freepages = 0,
>   .nr_migratepages = 0,
>   .order = order,
> - .migratetype = gfpflags_to_migratetype(gfp_mask),
> + .gfp_mask = gfp_mask,
>   .zone = zone,
>   .mode = mode,
>   };
> diff --git a/mm/internal.h b/mm/internal.h
> index 584cd69..dd17a40 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -149,7 +149,7 @@ struct compact_control {
>   bool finished_update_migrate;
>  
>   int order;  /* order a direct compactor needs */
> - int migratetype;/* MOVABLE, RECLAIMABLE etc */
> + const gfp_t gfp_mask;   /* gfp mask of a direct compactor */
>   struct zone *zone;
>   enum compact_contended contended; /* Signal need_sched() or lock
>  * contention detected during
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 09/13] mm, compaction: skip buddy pages by their order in the migrate scanner

2014-06-23 Thread Zhang Yanfei

On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> The migration scanner skips PageBuddy pages, but does not consider their order
> as checking page_order() is generally unsafe without holding the zone->lock,
> and acquiring the lock just for the check wouldn't be a good tradeoff.
> 
> Still, this could avoid some iterations over the rest of the buddy page, and
> if we are careful, the race window between PageBuddy() check and page_order()
> is small, and the worst thing that can happen is that we skip too much and 
> miss
> some isolation candidates. This is not that bad, as compaction can already 
> fail
> for many other reasons like parallel allocations, and those have much larger
> race window.
> 
> This patch therefore makes the migration scanner obtain the buddy page order
> and use it to skip the whole buddy page, if the order appears to be in the
> valid range.
> 
> It's important that the page_order() is read only once, so that the value used
> in the checks and in the pfn calculation is the same. But in theory the
> compiler can replace the local variable by multiple inlines of page_order().
> Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to
> prevent this.
> 
> Testing with stress-highalloc from mmtests shows a 15% reduction in number of
> pages scanned by migration scanner. This change is also a prerequisite for a
> later patch which is detecting when a cc->order block of pages contains
> non-buddy pages that cannot be isolated, and the scanner should thus skip to
> the next block immediately.
> 
> Signed-off-by: Vlastimil Babka 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 

Fair enough.

Reviewed-by: Zhang Yanfei 

> ---
>  mm/compaction.c | 36 +++-
>  mm/internal.h   | 16 +++-
>  2 files changed, 46 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 41c7005..df0961b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -270,8 +270,15 @@ static inline bool compact_should_abort(struct 
> compact_control *cc)
>  static bool suitable_migration_target(struct page *page)
>  {
>   /* If the page is a large free page, then disallow migration */
> - if (PageBuddy(page) && page_order(page) >= pageblock_order)
> - return false;
> + if (PageBuddy(page)) {
> + /*
> +  * We are checking page_order without zone->lock taken. But
> +  * the only small danger is that we skip a potentially suitable
> +  * pageblock, so it's not worth to check order for valid range.
> +  */
> + if (page_order_unsafe(page) >= pageblock_order)
> + return false;
> + }
>  
>   /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
>   if (migrate_async_suitable(get_pageblock_migratetype(page)))
> @@ -591,11 +598,23 @@ isolate_migratepages_range(struct zone *zone, struct 
> compact_control *cc,
>   valid_page = page;
>  
>   /*
> -  * Skip if free. page_order cannot be used without zone->lock
> -  * as nothing prevents parallel allocations or buddy merging.
> +  * Skip if free. We read page order here without zone lock
> +  * which is generally unsafe, but the race window is small and
> +  * the worst thing that can happen is that we skip some
> +  * potential isolation targets.
>*/
> - if (PageBuddy(page))
> + if (PageBuddy(page)) {
> + unsigned long freepage_order = page_order_unsafe(page);
> +
> + /*
> +  * Without lock, we cannot be sure that what we got is
> +  * a valid page order. Consider only values in the
> +  * valid order range to prevent low_pfn overflow.
> +  */
> + if (freepage_order > 0 && freepage_order < MAX_ORDER)
> + low_pfn += (1UL << freepage_order) - 1;
>   continue;
> + }
>  
>   /*
>* Check may be lockless but that's ok as we recheck later.
> @@ -683,6 +702,13 @@ next_pageblock:
>   low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
>   }
>  
> + /*
> +  * The PageBuddy() check could have potentially brought us outside
> +  * the range to be scanned.
> +  */
> +

Re: [PATCH v3 08/13] mm, compaction: remember position within pageblock in free pages scanner

2014-06-23 Thread Zhang Yanfei

On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> Unlike the migration scanner, the free scanner remembers the beginning of the
> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> uselessly when called several times during single compaction. This might have
> been useful when pages were returned to the buddy allocator after a failed
> migration, but this is no longer the case.
> 
> This patch changes the meaning of cc->free_pfn so that if it points to a
> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> end. isolate_freepages_block() will record the pfn of the last page it looked
> at, which is then used to update cc->free_pfn.
> 
> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> page, to 2.25 free pages per migrate page, without affecting success rates.
> 
> Signed-off-by: Vlastimil Babka 
> Acked-by: David Rientjes 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: Zhang Yanfei 

Reviewed-by: Zhang Yanfei 

> ---
>  mm/compaction.c | 40 +++-
>  1 file changed, 31 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 9f6e857..41c7005 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -287,7 +287,7 @@ static bool suitable_migration_target(struct page *page)
>   * (even though it may still end up isolating some pages).
>   */
>  static unsigned long isolate_freepages_block(struct compact_control *cc,
> - unsigned long blockpfn,
> + unsigned long *start_pfn,
>   unsigned long end_pfn,
>   struct list_head *freelist,
>   bool strict)
> @@ -296,6 +296,7 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
>   struct page *cursor, *valid_page = NULL;
>   unsigned long flags;
>   bool locked = false;
> + unsigned long blockpfn = *start_pfn;
>  
>   cursor = pfn_to_page(blockpfn);
>  
> @@ -369,6 +370,9 @@ isolate_fail:
>   break;
>   }
>  
> + /* Record how far we have got within the block */
> + *start_pfn = blockpfn;
> +
>   trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
>  
>   /*
> @@ -413,6 +417,9 @@ isolate_freepages_range(struct compact_control *cc,
>   LIST_HEAD(freelist);
>  
>   for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
> + /* Protect pfn from changing by isolate_freepages_block */
> + unsigned long isolate_start_pfn = pfn;
> +
>   if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
>   break;
>  
> @@ -423,8 +430,8 @@ isolate_freepages_range(struct compact_control *cc,
>   block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
>   block_end_pfn = min(block_end_pfn, end_pfn);
>  
> - isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
> -&freelist, true);
> + isolated = isolate_freepages_block(cc, &isolate_start_pfn,
> + block_end_pfn, &freelist, true);
>  
>   /*
>* In strict mode, isolate_freepages_block() returns 0 if
> @@ -708,6 +715,7 @@ static void isolate_freepages(struct zone *zone,
>  {
>   struct page *page;
>   unsigned long block_start_pfn;  /* start of current pageblock */
> + unsigned long isolate_start_pfn; /* exact pfn we start at */
>   unsigned long block_end_pfn;/* end of current pageblock */
>   unsigned long low_pfn;   /* lowest pfn scanner is able to scan */
>   int nr_freepages = cc->nr_freepages;
> @@ -716,14 +724,15 @@ static void isolate_freepages(struct zone *zone,
>   /*
>* Initialise the free scanner. The starting point is where we last
>* successfully isolated from, zone-cached value, or the end of the
> -  * zone when isolating for the first time. We need this aligned to
> -  * the pageblock boundary, because we do
> +  * zone when isolating for the first time. For looping we also need
> +  * this pfn aligned down to the pageblock boundary, because we do
>* block_start_pfn -= pageblock_nr_pages in the for loop.
>* For ending point, take care when isolating in last pageblock of a
>

Re: [PATCH v3 07/13] mm, compaction: skip rechecks when lock was already held

2014-06-23 Thread Zhang Yanfei

On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> Compaction scanners try to lock zone locks as late as possible by checking
> many page or pageblock properties opportunistically without lock and skipping
> them if not unsuitable. For pages that pass the initial checks, some 
> properties
> have to be checked again safely under lock. However, if the lock was already
> held from a previous iteration in the initial checks, the rechecks are
> unnecessary.
> 
> This patch therefore skips the rechecks when the lock was already held. This 
> is
> now possible to do, since we don't (potentially) drop and reacquire the lock
> between the initial checks and the safe rechecks anymore.
> 
> Signed-off-by: Vlastimil Babka 
> Acked-by: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Acked-by: David Rientjes 

Reviewed-by: Zhang Yanfei 

> ---
>  mm/compaction.c | 53 +++--
>  1 file changed, 31 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 40da812..9f6e857 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -324,22 +324,30 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
>   goto isolate_fail;
>  
>   /*
> -  * The zone lock must be held to isolate freepages.
> -  * Unfortunately this is a very coarse lock and can be
> -  * heavily contended if there are parallel allocations
> -  * or parallel compactions. For async compaction do not
> -  * spin on the lock and we acquire the lock as late as
> -  * possible.
> +  * If we already hold the lock, we can skip some rechecking.
> +  * Note that if we hold the lock now, checked_pageblock was
> +  * already set in some previous iteration (or strict is true),
> +  * so it is correct to skip the suitable migration target
> +  * recheck as well.
>*/
> - if (!locked)
> + if (!locked) {
> + /*
> +  * The zone lock must be held to isolate freepages.
> +  * Unfortunately this is a very coarse lock and can be
> +  * heavily contended if there are parallel allocations
> +  * or parallel compactions. For async compaction do not
> +  * spin on the lock and we acquire the lock as late as
> +  * possible.
> +  */
>   locked = compact_trylock_irqsave(&cc->zone->lock,
>   &flags, cc);
> - if (!locked)
> - break;
> + if (!locked)
> + break;
>  
> - /* Recheck this is a buddy page under lock */
> - if (!PageBuddy(page))
> - goto isolate_fail;
> + /* Recheck this is a buddy page under lock */
> + if (!PageBuddy(page))
> + goto isolate_fail;
> + }
>  
>   /* Found a free page, break it into order-0 pages */
>   isolated = split_free_page(page);
> @@ -623,19 +631,20 @@ isolate_migratepages_range(struct zone *zone, struct 
> compact_control *cc,
>   page_count(page) > page_mapcount(page))
>   continue;
>  
> - /* If the lock is not held, try to take it */
> - if (!locked)
> + /* If we already hold the lock, we can skip some rechecking */
> + if (!locked) {
>   locked = compact_trylock_irqsave(&zone->lru_lock,
>   &flags, cc);
> - if (!locked)
> - break;
> + if (!locked)
> + break;
>  
> - /* Recheck PageLRU and PageTransHuge under lock */
> - if (!PageLRU(page))
> - continue;
> - if (PageTransHuge(page)) {
> - low_pfn += (1 << compound_order(page)) - 1;
> - continue;
> + /* Recheck PageLRU and PageTransHuge under lock */
> + if (!PageLRU(page))
> +     continue;
> + if (PageTransHuge(page)) {
> + low_pfn += (1 << compound_order(page)) - 1;
> +

Re: [PATCH v3 06/13] mm, compaction: periodically drop lock and restore IRQs in scanners

2014-06-23 Thread Zhang Yanfei

On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> Compaction scanners regularly check for lock contention and need_resched()
> through the compact_checklock_irqsave() function. However, if there is no
> contention, the lock can be held and IRQ disabled for potentially long time.
> 
> This has been addressed by commit b2eef8c0d0 ("mm: compaction: minimise the
> time IRQs are disabled while isolating pages for migration") for the migration
> scanner. However, the refactoring done by commit 748446bb6b ("mm: compaction:
> acquire the zone->lru_lock as late as possible") has changed the conditions so
> that the lock is dropped only when there's contention on the lock or
> need_resched() is true. Also, need_resched() is checked only when the lock is
> already held. The comment "give a chance to irqs before checking need_resched"
> is therefore misleading, as IRQs remain disabled when the check is done.
> 
> This patch restores the behavior intended by commit b2eef8c0d0 and also tries
> to better balance and make more deterministic the time spent by checking for
> contention vs the time the scanners might run between the checks. It also
> avoids situations where checking has not been done often enough before. The
> result should be avoiding both too frequent and too infrequent contention
> checking, and especially the potentially long-running scans with IRQs disabled
> and no checking of need_resched() or for fatal signal pending, which can 
> happen
> when many consecutive pages or pageblocks fail the preliminary tests and do 
> not
> reach the later call site to compact_checklock_irqsave(), as explained below.
> 
> Before the patch:
> 
> In the migration scanner, compact_checklock_irqsave() was called each loop, if
> reached. If not reached, some lower-frequency checking could still be done if
> the lock was already held, but this would not result in aborting contended
> async compaction until reaching compact_checklock_irqsave() or end of
> pageblock. In the free scanner, it was similar but completely without the
> periodical checking, so lock can be potentially held until reaching the end of
> pageblock.
> 
> After the patch, in both scanners:
> 
> The periodical check is done as the first thing in the loop on each
> SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
> function, which always unlocks the lock (if locked) and aborts async 
> compaction
> if scheduling is needed. It also aborts any type of compaction when a fatal
> signal is pending.
> 
> The compact_checklock_irqsave() function is replaced with a slightly different
> compact_trylock_irqsave(). The biggest difference is that the function is not
> called at all if the lock is already held. The periodical need_resched()
> checking is left solely to compact_unlock_should_abort(). The lock contention
> avoidance for async compaction is achieved by the periodical unlock by
> compact_unlock_should_abort() and by using trylock in 
> compact_trylock_irqsave()
> and aborting when trylock fails. Sync compaction does not use trylock.
> 
> Signed-off-by: Vlastimil Babka 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 

Reviewed-by: Zhang Yanfei 

> ---
>  mm/compaction.c | 114 
> 
>  1 file changed, 73 insertions(+), 41 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e8cfac9..40da812 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -180,54 +180,72 @@ static void update_pageblock_skip(struct 
> compact_control *cc,
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -enum compact_contended should_release_lock(spinlock_t *lock)
> +/*
> + * Compaction requires the taking of some coarse locks that are potentially
> + * very heavily contended. For async compaction, back out if the lock cannot
> + * be taken immediately. For sync compaction, spin on the lock if needed.
> + *
> + * Returns true if the lock is held
> + * Returns false if the lock is not held and compaction should abort
> + */
> +static bool compact_trylock_irqsave(spinlock_t *lock,
> + unsigned long *flags, struct compact_control *cc)
>  {
> - if (spin_is_contended(lock))
> - return COMPACT_CONTENDED_LOCK;
> - else if (need_resched())
> - return COMPACT_CONTENDED_SCHED;
> - else
> - return COMPACT_CONTENDED_NONE;
> + if (cc->mode == MIGRATE_ASYNC) {
> + if (!spin_trylock_irqsave(lock, *flags)) {
> + cc->contended = COMPACT_CONTENDED_LOCK;
> +

Re: [PATCH v3 05/13] mm, compaction: report compaction as contended only due to lock contention

2014-06-23 Thread Zhang Yanfei

 we count a compaction stall
>> + * and we report if all zones that were tried were contended.
>> + */
>> +if (!*deferred) {
>>  count_compact_event(COMPACTSTALL);
>> +*contended = all_zones_contended;
> 
> Why don't you initialize contended as *false* in function's intro?
> 
>> +}
>>  
>>  return rc;
>>  }
>> diff --git a/mm/internal.h b/mm/internal.h
>> index a1b651b..2c187d2 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -117,6 +117,13 @@ extern int user_min_free_kbytes;
>>  
>>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>>  
>> +/* Used to signal whether compaction detected need_sched() or lock 
>> contention */
>> +enum compact_contended {
>> +COMPACT_CONTENDED_NONE = 0, /* no contention detected */
>> +COMPACT_CONTENDED_SCHED,/* need_sched() was true */
>> +COMPACT_CONTENDED_LOCK, /* zone lock or lru_lock was contended */
>> +};
>> +
>>  /*
>>   * in mm/compaction.c
>>   */
>> @@ -144,10 +151,10 @@ struct compact_control {
>>  int order;  /* order a direct compactor needs */
>>  int migratetype;/* MOVABLE, RECLAIMABLE etc */
>>  struct zone *zone;
>> -bool contended; /* True if a lock was contended, or
>> - * need_resched() true during async
>> -     * compaction
>> - */
>> +enum compact_contended contended; /* Signal need_sched() or lock
>> +   * contention detected during
>> +   * compaction
>> +   */
>>  };
>>  
>>  unsigned long
>> -- 
> 
> Anyway, most big concern is that you are changing current behavior as
> I said earlier.
> 
> Old behavior in THP page fault when it consumes own timeslot was just
> abort and fallback 4K page but with your patch, new behavior is
> take a rest when it founds need_resched and goes to another round with
> async, not sync compaction. I'm not sure we need another round with
> async compaction at the cost of increasing latency rather than fallback
> 4 page.

I don't see the new behavior works like what you said. If need_resched
is true, it calls cond_resched() and after a rest it just breaks the loop.
Why there is another round with async compact?

Thanks.

> 
> It might be okay if the VMA has MADV_HUGEPAGE which is good hint to
> indicate non-temporal VMA so latency would be trade-off but it's not
> for temporal big memory allocation in HUGEPAGE_ALWAYS system.
> 
> If you really want to go this, could you show us numbers?
> 
> 1. How many could we can be successful in direct compaction by this patch?
> 2. How long could we increase latency for temporal allocation
>for HUGEPAGE_ALWAYS system?
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 04/13] mm, compaction: move pageblock checks up from isolate_migratepages_range()

2014-06-22 Thread Zhang Yanfei

On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> isolate_migratepages_range() is the main function of the compaction scanner,
> called either on a single pageblock by isolate_migratepages() during regular
> compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
> It currently perfoms two pageblock-wide compaction suitability checks, and
> because of the CMA callpath, it tracks if it crossed a pageblock boundary in
> order to repeat those checks.
> 
> However, closer inspection shows that those checks are always true for CMA:
> - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
> - migrate_async_suitable() check is skipped because CMA uses sync compaction
> 
> We can therefore move the checks to isolate_migratepages(), reducing variables
> and simplifying isolate_migratepages_range(). The update_pageblock_skip()
> function also no longer needs set_unsuitable parameter.
> 
> Furthermore, going back to compact_zone() and compact_finished() when 
> pageblock
> is unsuitable is wasteful - the checks are meant to skip pageblocks quickly.
> The patch therefore also introduces a simple loop into isolate_migratepages()
> so that it does not return immediately on pageblock checks, but keeps going
> until isolate_migratepages_range() gets called once. Similarily to
> isolate_freepages(), the function periodically checks if it needs to 
> reschedule
> or abort async compaction.
> 
> Signed-off-by: Vlastimil Babka 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 

I think this is a good clean-up to make code more clear.

Reviewed-by: Zhang Yanfei 

Only a tiny nit-pick below.

> ---
>  mm/compaction.c | 112 
> +---
>  1 file changed, 59 insertions(+), 53 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 3064a7f..ebe30c9 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -132,7 +132,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
>   */
>  static void update_pageblock_skip(struct compact_control *cc,
>   struct page *page, unsigned long nr_isolated,
> - bool set_unsuitable, bool migrate_scanner)
> + bool migrate_scanner)
>  {
>   struct zone *zone = cc->zone;
>   unsigned long pfn;
> @@ -146,12 +146,7 @@ static void update_pageblock_skip(struct compact_control 
> *cc,
>   if (nr_isolated)
>   return;
>  
> - /*
> -  * Only skip pageblocks when all forms of compaction will be known to
> -  * fail in the near future.
> -  */
> - if (set_unsuitable)
> - set_pageblock_skip(page);
> + set_pageblock_skip(page);
>  
>   pfn = page_to_pfn(page);
>  
> @@ -180,7 +175,7 @@ static inline bool isolation_suitable(struct 
> compact_control *cc,
>  
>  static void update_pageblock_skip(struct compact_control *cc,
>   struct page *page, unsigned long nr_isolated,
> - bool set_unsuitable, bool migrate_scanner)
> + bool migrate_scanner)
>  {
>  }
>  #endif /* CONFIG_COMPACTION */
> @@ -345,8 +340,7 @@ isolate_fail:
>  
>   /* Update the pageblock-skip if the whole pageblock was scanned */
>   if (blockpfn == end_pfn)
> - update_pageblock_skip(cc, valid_page, total_isolated, true,
> -   false);
> + update_pageblock_skip(cc, valid_page, total_isolated, false);
>  
>   count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
>   if (total_isolated)
> @@ -474,14 +468,12 @@ unsigned long
>  isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>   unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
>  {
> - unsigned long last_pageblock_nr = 0, pageblock_nr;
>   unsigned long nr_scanned = 0, nr_isolated = 0;
>   struct list_head *migratelist = &cc->migratepages;
>   struct lruvec *lruvec;
>   unsigned long flags;
>   bool locked = false;
>   struct page *page = NULL, *valid_page = NULL;
> - bool set_unsuitable = true;
>   const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
>   ISOLATE_ASYNC_MIGRATE : 0) |
>   (unevictable ? ISOLATE_UNEVICTABLE : 0);
> @@ -545,28 +537,6 @@ isolate_migratepages_range(struct zone *zone, struct 
> compact_control *cc,
>   if (!valid_page)
>   valid_page = pa

Re: [PATCH v3 02/13] mm, compaction: defer each zone individually instead of preferred zone

2014-06-22 Thread Zhang Yanfei

On 06/20/2014 11:49 PM, Vlastimil Babka wrote:
> When direct sync compaction is often unsuccessful, it may become deferred for
> some time to avoid further useless attempts, both sync and async. Successful
> high-order allocations un-defer compaction, while further unsuccessful
> compaction attempts prolong the copmaction deferred period.
> 
> Currently the checking and setting deferred status is performed only on the
> preferred zone of the allocation that invoked direct compaction. But 
> compaction
> itself is attempted on all eligible zones in the zonelist, so the behavior is
> suboptimal and may lead both to scenarios where 1) compaction is attempted
> uselessly, or 2) where it's not attempted despite good chances of succeeding,
> as shown on the examples below:
> 
> 1) A direct compaction with Normal preferred zone failed and set deferred
>compaction for the Normal zone. Another unrelated direct compaction with
>DMA32 as preferred zone will attempt to compact DMA32 zone even though
>the first compaction attempt also included DMA32 zone.
> 
>In another scenario, compaction with Normal preferred zone failed to 
> compact
>Normal zone, but succeeded in the DMA32 zone, so it will not defer
>compaction. In the next attempt, it will try Normal zone which will fail
>again, instead of skipping Normal zone and trying DMA32 directly.
> 
> 2) Kswapd will balance DMA32 zone and reset defer status based on watermarks
>looking good. A direct compaction with preferred Normal zone will skip
>compaction of all zones including DMA32 because Normal was still deferred.
>The allocation might have succeeded in DMA32, but won't.
> 
> This patch makes compaction deferring work on individual zone basis instead of
> preferred zone. For each zone, it checks compaction_deferred() to decide if 
> the
> zone should be skipped. If watermarks fail after compacting the zone,
> defer_compaction() is called. The zone where watermarks passed can still be
> deferred when the allocation attempt is unsuccessful. When allocation is
> successful, compaction_defer_reset() is called for the zone containing the
> allocated page. This approach should approximate calling defer_compaction()
> only on zones where compaction was attempted and did not yield allocated page.
> There might be corner cases but that is inevitable as long as the decision
> to stop compacting dues not guarantee that a page will be allocated.
> 
> During testing on a two-node machine with a single very small Normal zone on
> node 1, this patch has improved success rates in stress-highalloc mmtests
> benchmark. The success here were previously made worse by commit 3a025760fc
> ("mm: page_alloc: spill to remote nodes before waking kswapd") as kswapd was
> no longer resetting often enough the deferred compaction for the Normal zone,
> and DMA32 zones on both nodes were thus not considered for compaction.
> 
> Signed-off-by: Vlastimil Babka 
> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 

Really good.

Reviewed-by: Zhang Yanfei 

> ---
>  include/linux/compaction.h |  6 --
>  mm/compaction.c| 29 -
>  mm/page_alloc.c| 33 ++---
>  3 files changed, 46 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 01e3132..76f9beb 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -22,7 +22,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, 
> int write,
>  extern int fragmentation_index(struct zone *zone, unsigned int order);
>  extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>   int order, gfp_t gfp_mask, nodemask_t *mask,
> - enum migrate_mode mode, bool *contended);
> + enum migrate_mode mode, bool *contended, bool *deferred,
> + struct zone **candidate_zone);
>  extern void compact_pgdat(pg_data_t *pgdat, int order);
>  extern void reset_isolation_suitable(pg_data_t *pgdat);
>  extern unsigned long compaction_suitable(struct zone *zone, int order);
> @@ -91,7 +92,8 @@ static inline bool compaction_restarting(struct zone *zone, 
> int order)
>  #else
>  static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
>   int order, gfp_t gfp_mask, nodemask_t *nodemask,
> - enum migrate_mode mode, bool *contended)
> + enum migrate_mode mode, bool *contended, bool *deferred,
> + s

Re: [PATCH v3 01/13] mm, THP: don't hold mmap_sem in khugepaged when allocating THP

2014-06-22 Thread Zhang Yanfei

Hello

On 06/21/2014 01:45 AM, Kirill A. Shutemov wrote:
> On Fri, Jun 20, 2014 at 05:49:31PM +0200, Vlastimil Babka wrote:
>> When allocating huge page for collapsing, khugepaged currently holds mmap_sem
>> for reading on the mm where collapsing occurs. Afterwards the read lock is
>> dropped before write lock is taken on the same mmap_sem.
>>
>> Holding mmap_sem during whole huge page allocation is therefore useless, the
>> vma needs to be rechecked after taking the write lock anyway. Furthemore, 
>> huge
>> page allocation might involve a rather long sync compaction, and thus block
>> any mmap_sem writers and i.e. affect workloads that perform frequent m(un)map
>> or mprotect oterations.
>>
>> This patch simply releases the read lock before allocating a huge page. It
>> also deletes an outdated comment that assumed vma must be stable, as it was
>> using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
>> ("mm: thp: khugepaged: add policy for finding target node").
> 
> There is no point in touching ->mmap_sem in khugepaged_alloc_page() at
> all. Please, move up_read() outside khugepaged_alloc_page().
> 

I might be wrong. If we up_read in khugepaged_scan_pmd(), then if we round again
do the for loop to get the next vma and handle it. Does we do this without 
holding
the mmap_sem in any mode?

And if the loop end, we have another up_read in breakouterloop. What if we have
released the mmap_sem in collapse_huge_page()?

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] mm: add page cache limit and reclaim feature

2014-06-16 Thread Zhang Yanfei

Hi,

On 06/16/2014 05:24 PM, Xishi Qiu wrote:
> When system(e.g. smart phone) running for a long time, the cache often takes
> a large memory, maybe the free memory is less than 50M, then OOM will happen
> if APP allocate a large order pages suddenly and memory reclaim too slowly. 

If there is really too many page caches, and the free memory is low. I think
the page allocator will enter the slowpath to free more memory for allocation.
And it the slowpath, there is indeed the direct reclaim operation, so is that
really not enough to reclaim pagecaches?

> 
> Use "echo 3 > /proc/sys/vm/drop_caches" will drop the whole cache, this will
> affect the performance, so it is used for debugging only. 
> 
> suse has this feature, I tested it before, but it can not limit the page cache
> actually. So I rewrite the feature and add some parameters.
> 
> Christoph Lameter has written a patch "Limit the size of the pagecache"
> http://marc.info/?l=linux-mm&m=116959990228182&w=2
> It changes in zone fallback, this is not a good way.
> 
> The patchset is based on v3.15, it introduces two features, page cache limit
> and page cache reclaim in circles.
> 
> Add four parameters in /proc/sys/vm
> 
> 1) cache_limit_mbytes
> This is used to limit page cache amount.
> The input unit is MB, value range is from 0 to totalram_pages.
> If this is set to 0, it will not limit page cache.
> When written to the file, cache_limit_ratio will be updated too.
> The default value is 0.
> 
> 2) cache_limit_ratio
> This is used to limit page cache amount.
> The input unit is percent, value range is from 0 to 100.
> If this is set to 0, it will not limit page cache.
> When written to the file, cache_limit_mbytes will be updated too.
> The default value is 0.
> 
> 3) cache_reclaim_s
> This is used to reclaim page cache in circles.
> The input unit is second, the minimum value is 0.
> If this is set to 0, it will disable the feature.
> The default value is 0.
> 
> 4) cache_reclaim_weight
> This is used to speed up page cache reclaim.
> It depend on enabling cache_limit_mbytes/cache_limit_ratio or cache_reclaim_s.
> Value range is from 1(slow) to 100(fast).
> The default value is 1.
> 
> I tested the two features on my system(x86_64), it seems to work right.
> However, as it changes the hot path "add_to_page_cache_lru()", I don't know
> how much it will the affect the performance,

Yeah, at a quick glance, for every invoke of add_to_page_cache_lru(), there is 
the 
newly added test:

if (vm_cache_limit_mbytes && page_cache_over_limit())

and if the test is passed, shrink_page_cache()->do_try_to_free_pages() is 
called.
And this is a sync operation. IMO, it is better to make such an operation async.
(You've implemented async operation but I doubt if it is suitable to put the 
sync operation
here.)

Thanks.

 maybe there are some errors
> in the patches too, RFC.
> 
> 
> *** BLURB HERE ***
> 
> Xishi Qiu (8):
>   mm: introduce cache_limit_ratio and cache_limit_mbytes
>   mm: add shrink page cache core
>   mm: implement page cache limit feature
>   mm: introduce cache_reclaim_s
>   mm: implement page cache reclaim in circles
>   mm: introduce cache_reclaim_weight
>   mm: implement page cache reclaim speed
>   doc: update Documentation/sysctl/vm.txt
> 
>  Documentation/sysctl/vm.txt |   43 +++
>  include/linux/swap.h|   17 
>  kernel/sysctl.c |   35 +++
>  mm/filemap.c|3 +
>  mm/hugetlb.c|3 +
>  mm/page_alloc.c |   51 ++
>  mm/vmscan.c |   97 
> ++-
>  7 files changed, 248 insertions(+), 1 deletions(-)
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> .
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 08/10] mm, cma: clean-up cma allocation error path

2014-06-12 Thread Zhang Yanfei

On 06/12/2014 11:21 AM, Joonsoo Kim wrote:
> We can remove one call sites for clear_cma_bitmap() if we first
> call it before checking error number.
> 
> Signed-off-by: Joonsoo Kim 

Reviewed-by: Zhang Yanfei 

> 
> diff --git a/mm/cma.c b/mm/cma.c
> index 1e1b017..01a0713 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -282,11 +282,12 @@ struct page *cma_alloc(struct cma *cma, int count, 
> unsigned int align)
>   if (ret == 0) {
>   page = pfn_to_page(pfn);
>   break;
> - } else if (ret != -EBUSY) {
> - clear_cma_bitmap(cma, pfn, count);
> - break;
>   }
> +
>   clear_cma_bitmap(cma, pfn, count);
> + if (ret != -EBUSY)
> + break;
> +
>   pr_debug("%s(): memory range at %p is busy, retrying\n",
>__func__, pfn_to_page(pfn));
>   /* try again with a bit different memory target */
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 06/10] CMA: generalize CMA reserved area management functionality

2014-06-12 Thread Zhang Yanfei

On 06/12/2014 11:21 AM, Joonsoo Kim wrote:
> Currently, there are two users on CMA functionality, one is the DMA
> subsystem and the other is the kvm on powerpc. They have their own code
> to manage CMA reserved area even if they looks really similar.
>>From my guess, it is caused by some needs on bitmap management. Kvm side
> wants to maintain bitmap not for 1 page, but for more size. Eventually it
> use bitmap where one bit represents 64 pages.
> 
> When I implement CMA related patches, I should change those two places
> to apply my change and it seem to be painful to me. I want to change
> this situation and reduce future code management overhead through
> this patch.
> 
> This change could also help developer who want to use CMA in their
> new feature development, since they can use CMA easily without
> copying & pasting this reserved area management code.
> 
> In previous patches, we have prepared some features to generalize
> CMA reserved area management and now it's time to do it. This patch
> moves core functions to mm/cma.c and change DMA APIs to use
> these functions.
> 
> There is no functional change in DMA APIs.
> 
> v2: There is no big change from v1 in mm/cma.c. Mostly renaming.
> 
> Acked-by: Michal Nazarewicz 
> Signed-off-by: Joonsoo Kim 

Acked-by: Zhang Yanfei 

> 
> diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> index 00e13ce..4eac559 100644
> --- a/drivers/base/Kconfig
> +++ b/drivers/base/Kconfig
> @@ -283,16 +283,6 @@ config CMA_ALIGNMENT
>  
> If unsure, leave the default value "8".
>  
> -config CMA_AREAS
> - int "Maximum count of the CMA device-private areas"
> - default 7
> - help
> -   CMA allows to create CMA areas for particular devices. This parameter
> -   sets the maximum number of such device private CMA areas in the
> -   system.
> -
> -   If unsure, leave the default value "7".
> -
>  endif
>  
>  endmenu
> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
> index 9bc9340..f177f73 100644
> --- a/drivers/base/dma-contiguous.c
> +++ b/drivers/base/dma-contiguous.c
> @@ -24,25 +24,10 @@
>  
>  #include 
>  #include 
> -#include 
> -#include 
> -#include 
>  #include 
> -#include 
> -#include 
> -#include 
>  #include 
>  #include 
> -
> -struct cma {
> - unsigned long   base_pfn;
> - unsigned long   count;
> - unsigned long   *bitmap;
> - int order_per_bit; /* Order of pages represented by one bit */
> - struct mutexlock;
> -};
> -
> -struct cma *dma_contiguous_default_area;
> +#include 
>  
>  #ifdef CONFIG_CMA_SIZE_MBYTES
>  #define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES
> @@ -50,6 +35,8 @@ struct cma *dma_contiguous_default_area;
>  #define CMA_SIZE_MBYTES 0
>  #endif
>  
> +struct cma *dma_contiguous_default_area;
> +
>  /*
>   * Default global CMA area size can be defined in kernel's .config.
>   * This is useful mainly for distro maintainers to create a kernel
> @@ -156,199 +143,13 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>   }
>  }
>  
> -static DEFINE_MUTEX(cma_mutex);
> -
> -static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int 
> align_order)
> -{
> - return (1 << (align_order >> cma->order_per_bit)) - 1;
> -}
> -
> -static unsigned long cma_bitmap_maxno(struct cma *cma)
> -{
> - return cma->count >> cma->order_per_bit;
> -}
> -
> -static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
> - unsigned long pages)
> -{
> - return ALIGN(pages, 1 << cma->order_per_bit) >> cma->order_per_bit;
> -}
> -
> -static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
> -{
> - unsigned long bitmapno, nr_bits;
> -
> - bitmapno = (pfn - cma->base_pfn) >> cma->order_per_bit;
> - nr_bits = cma_bitmap_pages_to_bits(cma, count);
> -
> - mutex_lock(&cma->lock);
> - bitmap_clear(cma->bitmap, bitmapno, nr_bits);
> - mutex_unlock(&cma->lock);
> -}
> -
> -static int __init cma_activate_area(struct cma *cma)
> -{
> - int bitmap_maxno = cma_bitmap_maxno(cma);
> - int bitmap_size = BITS_TO_LONGS(bitmap_maxno) * sizeof(long);
> - unsigned long base_pfn = cma->base_pfn, pfn = base_pfn;
> - unsigned i = cma->count >> pageblock_order;
> - struct zone *zone;
> -
> - pr_debug("%s()\n", __func__);
> -
> - cma->bitmap = kzalloc(bitmap_size, GFP_KERNEL);
> - if (!cma->b

Re: [PATCH v2 05/10] DMA, CMA: support arbitrary bitmap granularity

2014-06-12 Thread Zhang Yanfei

On 06/12/2014 11:21 AM, Joonsoo Kim wrote:
> ppc kvm's cma region management requires arbitrary bitmap granularity,
> since they want to reserve very large memory and manage this region
> with bitmap that one bit for several pages to reduce management overheads.
> So support arbitrary bitmap granularity for following generalization.
> 
> Signed-off-by: Joonsoo Kim 

Acked-by: Zhang Yanfei 

> 
> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
> index bc4c171..9bc9340 100644
> --- a/drivers/base/dma-contiguous.c
> +++ b/drivers/base/dma-contiguous.c
> @@ -38,6 +38,7 @@ struct cma {
>   unsigned long   base_pfn;
>   unsigned long   count;
>   unsigned long   *bitmap;
> + int order_per_bit; /* Order of pages represented by one bit */
>   struct mutexlock;
>  };
>  
> @@ -157,9 +158,38 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>  
>  static DEFINE_MUTEX(cma_mutex);
>  
> +static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int 
> align_order)
> +{
> + return (1 << (align_order >> cma->order_per_bit)) - 1;
> +}
> +
> +static unsigned long cma_bitmap_maxno(struct cma *cma)
> +{
> + return cma->count >> cma->order_per_bit;
> +}
> +
> +static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
> + unsigned long pages)
> +{
> + return ALIGN(pages, 1 << cma->order_per_bit) >> cma->order_per_bit;
> +}
> +
> +static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
> +{
> + unsigned long bitmapno, nr_bits;
> +
> + bitmapno = (pfn - cma->base_pfn) >> cma->order_per_bit;
> + nr_bits = cma_bitmap_pages_to_bits(cma, count);
> +
> + mutex_lock(&cma->lock);
> + bitmap_clear(cma->bitmap, bitmapno, nr_bits);
> + mutex_unlock(&cma->lock);
> +}
> +
>  static int __init cma_activate_area(struct cma *cma)
>  {
> - int bitmap_size = BITS_TO_LONGS(cma->count) * sizeof(long);
> + int bitmap_maxno = cma_bitmap_maxno(cma);
> + int bitmap_size = BITS_TO_LONGS(bitmap_maxno) * sizeof(long);
>   unsigned long base_pfn = cma->base_pfn, pfn = base_pfn;
>   unsigned i = cma->count >> pageblock_order;
>   struct zone *zone;
> @@ -221,6 +251,7 @@ core_initcall(cma_init_reserved_areas);
>   * @base: Base address of the reserved area optional, use 0 for any
>   * @limit: End address of the reserved memory (optional, 0 for any).
>   * @alignment: Alignment for the contiguous memory area, should be power of 2
> + * @order_per_bit: Order of pages represented by one bit on bitmap.
>   * @res_cma: Pointer to store the created cma region.
>   * @fixed: hint about where to place the reserved area
>   *
> @@ -235,7 +266,7 @@ core_initcall(cma_init_reserved_areas);
>   */
>  static int __init __dma_contiguous_reserve_area(phys_addr_t size,
>   phys_addr_t base, phys_addr_t limit,
> - phys_addr_t alignment,
> + phys_addr_t alignment, int order_per_bit,
>   struct cma **res_cma, bool fixed)
>  {
>   struct cma *cma = &cma_areas[cma_area_count];
> @@ -269,6 +300,8 @@ static int __init 
> __dma_contiguous_reserve_area(phys_addr_t size,
>   base = ALIGN(base, alignment);
>   size = ALIGN(size, alignment);
>   limit &= ~(alignment - 1);
> + /* size should be aligned with order_per_bit */
> + BUG_ON(!IS_ALIGNED(size >> PAGE_SHIFT, 1 << order_per_bit));
>  
>   /* Reserve memory */
>   if (base && fixed) {
> @@ -294,6 +327,7 @@ static int __init 
> __dma_contiguous_reserve_area(phys_addr_t size,
>*/
>   cma->base_pfn = PFN_DOWN(base);
>   cma->count = size >> PAGE_SHIFT;
> + cma->order_per_bit = order_per_bit;
>   *res_cma = cma;
>   cma_area_count++;
>  
> @@ -313,7 +347,7 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
> phys_addr_t base,
>  {
>   int ret;
>  
> - ret = __dma_contiguous_reserve_area(size, base, limit, 0,
> + ret = __dma_contiguous_reserve_area(size, base, limit, 0, 0,
>   res_cma, fixed);
>   if (ret)
>   return ret;
> @@ -324,13 +358,6 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
> phys_addr_t base,
>   return 0;
>  }
>  
> -static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
> -{
> - mutex_lock(&cma->lock);
> - bitmap_clear(cma->bitmap, pfn - cma->

Re: [PATCH v2 02/10] DMA, CMA: fix possible memory leak

2014-06-12 Thread Zhang Yanfei

On 06/12/2014 02:02 PM, Joonsoo Kim wrote:
> On Thu, Jun 12, 2014 at 02:25:43PM +0900, Minchan Kim wrote:
>> On Thu, Jun 12, 2014 at 12:21:39PM +0900, Joonsoo Kim wrote:
>>> We should free memory for bitmap when we find zone mis-match,
>>> otherwise this memory will leak.
>>
>> Then, -stable stuff?
> 
> I don't think so. This is just possible leak candidate, so we don't
> need to push this to stable tree.
> 
>>
>>>
>>> Additionally, I copy code comment from ppc kvm's cma code to notify
>>> why we need to check zone mis-match.
>>>
>>> Signed-off-by: Joonsoo Kim 
>>>
>>> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
>>> index bd0bb81..fb0cdce 100644
>>> --- a/drivers/base/dma-contiguous.c
>>> +++ b/drivers/base/dma-contiguous.c
>>> @@ -177,14 +177,24 @@ static int __init cma_activate_area(struct cma *cma)
>>> base_pfn = pfn;
>>> for (j = pageblock_nr_pages; j; --j, pfn++) {
>>> WARN_ON_ONCE(!pfn_valid(pfn));
>>> +   /*
>>> +* alloc_contig_range requires the pfn range
>>> +* specified to be in the same zone. Make this
>>> +* simple by forcing the entire CMA resv range
>>> +* to be in the same zone.
>>> +*/
>>> if (page_zone(pfn_to_page(pfn)) != zone)
>>> -   return -EINVAL;
>>> +   goto err;
>>
>> At a first glance, I thought it would be better to handle such error
>> before activating.
>> So when I see the registration code(ie, dma_contiguous_revere_area),
>> I realized it is impossible because we didn't set up zone yet. :(
>>
>> If so, when we detect to fail here, it would be better to report more
>> meaningful error message like what was successful zone and what is
>> new zone and failed pfn number?
> 
> What I want to do in early phase of this patchset is to make cma code
> on DMA APIs similar to ppc kvm's cma code. ppc kvm's cma code already
> has this error handling logic, so I make this patch.
> 
> If we think that we need more things, we can do that on general cma code
> after merging this patchset.
> 

Yeah, I also like the idea. After all, this patchset aims to a general CMA
management, we could improve more after this patchset. So

Acked-by: Zhang Yanfei 

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 01/10] DMA, CMA: clean-up log message

2014-06-12 Thread Zhang Yanfei

On 06/12/2014 11:21 AM, Joonsoo Kim wrote:
> We don't need explicit 'CMA:' prefix, since we already define prefix
> 'cma:' in pr_fmt. So remove it.
> 
> And, some logs print function name and others doesn't. This looks
> bad to me, so I unify log format to print function name consistently.
> 
> Lastly, I add one more debug log on cma_activate_area().
> 
> Signed-off-by: Joonsoo Kim 

Reviewed-by: Zhang Yanfei 

> 
> diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
> index 83969f8..bd0bb81 100644
> --- a/drivers/base/dma-contiguous.c
> +++ b/drivers/base/dma-contiguous.c
> @@ -144,7 +144,7 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
>   }
>  
>   if (selected_size && !dma_contiguous_default_area) {
> - pr_debug("%s: reserving %ld MiB for global area\n", __func__,
> + pr_debug("%s(): reserving %ld MiB for global area\n", __func__,
>(unsigned long)selected_size / SZ_1M);
>  
>   dma_contiguous_reserve_area(selected_size, selected_base,
> @@ -163,8 +163,9 @@ static int __init cma_activate_area(struct cma *cma)
>   unsigned i = cma->count >> pageblock_order;
>   struct zone *zone;
>  
> - cma->bitmap = kzalloc(bitmap_size, GFP_KERNEL);
> + pr_debug("%s()\n", __func__);
>  
> + cma->bitmap = kzalloc(bitmap_size, GFP_KERNEL);
>   if (!cma->bitmap)
>   return -ENOMEM;
>  
> @@ -234,7 +235,8 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
> phys_addr_t base,
>  
>   /* Sanity checks */
>   if (cma_area_count == ARRAY_SIZE(cma_areas)) {
> - pr_err("Not enough slots for CMA reserved regions!\n");
> + pr_err("%s(): Not enough slots for CMA reserved regions!\n",
> + __func__);
>   return -ENOSPC;
>   }
>  
> @@ -274,14 +276,15 @@ int __init dma_contiguous_reserve_area(phys_addr_t 
> size, phys_addr_t base,
>   *res_cma = cma;
>   cma_area_count++;
>  
> - pr_info("CMA: reserved %ld MiB at %08lx\n", (unsigned long)size / SZ_1M,
> - (unsigned long)base);
> + pr_info("%s(): reserved %ld MiB at %08lx\n",
> + __func__, (unsigned long)size / SZ_1M, (unsigned long)base);
>  
>   /* Architecture specific contiguous memory fixup. */
>   dma_contiguous_early_fixup(base, size);
>   return 0;
>  err:
> - pr_err("CMA: failed to reserve %ld MiB\n", (unsigned long)size / SZ_1M);
> + pr_err("%s(): failed to reserve %ld MiB\n",
> + __func__, (unsigned long)size / SZ_1M);
>   return ret;
>  }
>  
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 05/10] DMA, CMA: support arbitrary bitmap granularity

2014-06-12 Thread Zhang Yanfei

area(size, base, limit, 0, 0,
>>  res_cma, fixed);
>>  if (ret)
>>  return ret;
>> @@ -324,13 +358,6 @@ int __init dma_contiguous_reserve_area(phys_addr_t 
>> size, phys_addr_t base,
>>  return 0;
>>  }
>>  
>> -static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
>> -{
>> -mutex_lock(&cma->lock);
>> -bitmap_clear(cma->bitmap, pfn - cma->base_pfn, count);
>> -mutex_unlock(&cma->lock);
>> -}
>> -
>>  /**
>>   * dma_alloc_from_contiguous() - allocate pages from contiguous area
>>   * @dev:   Pointer to device for which the allocation is performed.
>> @@ -345,7 +372,8 @@ static void clear_cma_bitmap(struct cma *cma, unsigned 
>> long pfn, int count)
>>  static struct page *__dma_alloc_from_contiguous(struct cma *cma, int count,
>> unsigned int align)
>>  {
>> -unsigned long mask, pfn, pageno, start = 0;
>> +unsigned long mask, pfn, start = 0;
>> +unsigned long bitmap_maxno, bitmapno, nr_bits;
> 
> Just Nit: bitmap_maxno, bitmap_no or something consistent.
> I know you love consistent when I read description in first patch
> in this patchset. ;-)
> 

Yeah, not only in this patchset, I saw Joonsoo trying to unify all
kinds of things in the MM. This is great for newbies, IMO.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 07/10] mm: rename allocflags_to_migratetype for clarity

2014-06-10 Thread Zhang Yanfei

On 06/11/2014 10:41 AM, Minchan Kim wrote:
> On Mon, Jun 09, 2014 at 11:26:19AM +0200, Vlastimil Babka wrote:
>> From: David Rientjes 
>>
>> The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
>> ALLOC_CPUSET) that have separate semantics.
>>
>> The function allocflags_to_migratetype() actually takes gfp flags, not alloc
>> flags, and returns a migratetype.  Rename it to gfpflags_to_migratetype().
>>
>> Signed-off-by: David Rientjes 
>> Signed-off-by: Vlastimil Babka 
> 
> I was one of person who got confused sometime.

Some names in MM really make people confused. But sometimes thinking
an appropriate name is also a hard thing. Like I once wanted to change
the name of function nr_free_zone_pages() and also nr_free_buffer_pages().
But it is hard to name them, so at last Andrew suggested to add the
detailed function description to make it clear only.

Reviewed-by: Zhang Yanfei 

> 
> Acked-by: Minchan Kim 
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 05/10] mm, compaction: remember position within pageblock in free pages scanner

2014-06-10 Thread Zhang Yanfei

On 06/09/2014 05:26 PM, Vlastimil Babka wrote:
> Unlike the migration scanner, the free scanner remembers the beginning of the
> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> uselessly when called several times during single compaction. This might have
> been useful when pages were returned to the buddy allocator after a failed
> migration, but this is no longer the case.
> 
> This patch changes the meaning of cc->free_pfn so that if it points to a
> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> end. isolate_freepages_block() will record the pfn of the last page it looked
> at, which is then used to update cc->free_pfn.
> 
> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> page, to 2.25 free pages per migrate page, without affecting success rates.
> 
> Signed-off-by: Vlastimil Babka 

Reviewed-by: Zhang Yanfei 

> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 
> ---
>  mm/compaction.c | 33 -
>  1 file changed, 28 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 83f72bd..58dfaaa 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -297,7 +297,7 @@ static bool suitable_migration_target(struct page *page)
>   * (even though it may still end up isolating some pages).
>   */
>  static unsigned long isolate_freepages_block(struct compact_control *cc,
> - unsigned long blockpfn,
> + unsigned long *start_pfn,
>   unsigned long end_pfn,
>   struct list_head *freelist,
>   bool strict)
> @@ -306,6 +306,7 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
>   struct page *cursor, *valid_page = NULL;
>   unsigned long flags;
>   bool locked = false;
> + unsigned long blockpfn = *start_pfn;
>  
>   cursor = pfn_to_page(blockpfn);
>  
> @@ -314,6 +315,9 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
>   int isolated, i;
>   struct page *page = cursor;
>  
> + /* Record how far we have got within the block */
> + *start_pfn = blockpfn;
> +
>   /*
>* Periodically drop the lock (if held) regardless of its
>* contention, to give chance to IRQs. Abort async compaction
> @@ -424,6 +428,9 @@ isolate_freepages_range(struct compact_control *cc,
>   LIST_HEAD(freelist);
>  
>   for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
> + /* Protect pfn from changing by isolate_freepages_block */
> + unsigned long isolate_start_pfn = pfn;
> +
>   if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
>   break;
>  
> @@ -434,8 +441,8 @@ isolate_freepages_range(struct compact_control *cc,
>   block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
>   block_end_pfn = min(block_end_pfn, end_pfn);
>  
> - isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
> -&freelist, true);
> + isolated = isolate_freepages_block(cc, &isolate_start_pfn,
> + block_end_pfn, &freelist, true);
>  
>   /*
>* In strict mode, isolate_freepages_block() returns 0 if
> @@ -774,6 +781,7 @@ static void isolate_freepages(struct zone *zone,
>   block_end_pfn = block_start_pfn,
>   block_start_pfn -= pageblock_nr_pages) {
>   unsigned long isolated;
> + unsigned long isolate_start_pfn;
>  
>   /*
>* This can iterate a massively long zone without finding any
> @@ -807,12 +815,27 @@ static void isolate_freepages(struct zone *zone,
>   continue;
>  
>   /* Found a block suitable for isolating free pages from */
> - cc->free_pfn = block_start_pfn;
> - isolated = isolate_freepages_block(cc, block_start_pfn,
> + isolate_start_pfn = block_start_pfn;
> +
> + /*
> +  * If we are restarting the free scanner in this block, do not
> +  * rescan the beginning of the block
> +  */
> + if (

Re: [PATCH 01/10] mm, compaction: do not recheck suitable_migration_target under lock

2014-06-10 Thread Zhang Yanfei

On 06/09/2014 05:26 PM, Vlastimil Babka wrote:
> isolate_freepages_block() rechecks if the pageblock is suitable to be a target
> for migration after it has taken the zone->lock. However, the check has been
> optimized to occur only once per pageblock, and compact_checklock_irqsave()
> might be dropping and reacquiring lock, which means somebody else might have
> changed the pageblock's migratetype meanwhile.
> 
> Furthermore, nothing prevents the migratetype to change right after
> isolate_freepages_block() has finished isolating. Given how imperfect this is,
> it's simpler to just rely on the check done in isolate_freepages() without
> lock, and not pretend that the recheck under lock guarantees anything. It is
> just a heuristic after all.
> 
> Signed-off-by: Vlastimil Babka 

Reviewed-by: Zhang Yanfei 

> Cc: Minchan Kim 
> Cc: Mel Gorman 
> Cc: Joonsoo Kim 
> Cc: Michal Nazarewicz 
> Cc: Naoya Horiguchi 
> Cc: Christoph Lameter 
> Cc: Rik van Riel 
> Cc: David Rientjes 
> ---
> I suggest folding mm-compactionc-isolate_freepages_block-small-tuneup.patch 
> into this
> 
>  mm/compaction.c | 13 -
>  1 file changed, 13 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5175019..b73b182 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -276,7 +276,6 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
>   struct page *cursor, *valid_page = NULL;
>   unsigned long flags;
>   bool locked = false;
> - bool checked_pageblock = false;
>  
>   cursor = pfn_to_page(blockpfn);
>  
> @@ -307,18 +306,6 @@ static unsigned long isolate_freepages_block(struct 
> compact_control *cc,
>   if (!locked)
>   break;
>  
> - /* Recheck this is a suitable migration target under lock */
> - if (!strict && !checked_pageblock) {
> - /*
> -  * We need to check suitability of pageblock only once
> -  * and this isolate_freepages_block() is called with
> -  * pageblock range, so just check once is sufficient.
> -  */
> - checked_pageblock = true;
> - if (!suitable_migration_target(page))
> - break;
> - }
> -
>   /* Recheck this is a buddy page under lock */
>   if (!PageBuddy(page))
>   goto isolate_fail;
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm/swap: cleanup lru_cache_add functions

2014-04-20 Thread Zhang Yanfei

On 04/21/2014 12:02 PM, Jianyu Zhan wrote:
> Hi,  Yanfei,
> 
> On Mon, Apr 21, 2014 at 9:00 AM, Zhang Yanfei
>  wrote:
>> What should be exported?
>>
>> lru_cache_add()
>> lru_cache_add_anon()
>> lru_cache_add_file()
>>
>> It seems you only export lru_cache_add_file() in the patch.
> 
> Right, lru_cache_add_anon() is only used by VM code, so it should not
> be exported.
> 
> lru_cache_add_file() and lru_cache_add() are supposed to be used by
> vfs ans fs code.
> 
> But  now only lru_cache_add_file() is  used by CIFS and FUSE, which
> both could be
> built as module, so it must be exported;  and lru_cache_add() has now
> no module users,
> so as Rik suggests, it is unexported too.
> 

OK. So The sentence in the patch log confused me:

[ However, lru_cache_add() is supposed to
be used by vfs, or whatever others, but it is not exported.]

otherwise, 
Reviewed-by: Zhang Yanfei 

Thanks.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm/swap: cleanup lru_cache_add functions

2014-04-20 Thread Zhang Yanfei

Hi Jianyu

On 04/18/2014 11:39 PM, Jianyu Zhan wrote:
> Hi, Christoph Hellwig,
> 
>> There are no modular users of lru_cache_add, so please don't needlessly
>> export it.
> 
> yep, I re-checked and found there is no module user of neither 
> lru_cache_add() nor lru_cache_add_anon(), so don't export it.
> 
> Here is the renewed patch:
> ---
> 
> In mm/swap.c, __lru_cache_add() is exported, but actually there are
> no users outside this file. However, lru_cache_add() is supposed to
> be used by vfs, or whatever others, but it is not exported.
> 
> This patch unexports __lru_cache_add(), and makes it static.
> It also exports lru_cache_add_file(), as it is use by cifs, which
> be loaded as module.

What should be exported?

lru_cache_add()
lru_cache_add_anon()
lru_cache_add_file()

It seems you only export lru_cache_add_file() in the patch.

Thanks

> 
> Signed-off-by: Jianyu Zhan 
> ---
>  include/linux/swap.h | 19 ++-
>  mm/swap.c| 31 +++
>  2 files changed, 25 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 3507115..5a14b92 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -308,8 +308,9 @@ extern unsigned long nr_free_pagecache_pages(void);
>  
>  
>  /* linux/mm/swap.c */
> -extern void __lru_cache_add(struct page *);
>  extern void lru_cache_add(struct page *);
> +extern void lru_cache_add_anon(struct page *page);
> +extern void lru_cache_add_file(struct page *page);
>  extern void lru_add_page_tail(struct page *page, struct page *page_tail,
>struct lruvec *lruvec, struct list_head *head);
>  extern void activate_page(struct page *);
> @@ -323,22 +324,6 @@ extern void swap_setup(void);
>  
>  extern void add_page_to_unevictable_list(struct page *page);
>  
> -/**
> - * lru_cache_add: add a page to the page lists
> - * @page: the page to add
> - */
> -static inline void lru_cache_add_anon(struct page *page)
> -{
> - ClearPageActive(page);
> - __lru_cache_add(page);
> -}
> -
> -static inline void lru_cache_add_file(struct page *page)
> -{
> - ClearPageActive(page);
> - __lru_cache_add(page);
> -}
> -
>  /* linux/mm/vmscan.c */
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>   gfp_t gfp_mask, nodemask_t *mask);
> diff --git a/mm/swap.c b/mm/swap.c
> index ab3f508..c0cd7d0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -582,13 +582,7 @@ void mark_page_accessed(struct page *page)
>  }
>  EXPORT_SYMBOL(mark_page_accessed);
>  
> -/*
> - * Queue the page for addition to the LRU via pagevec. The decision on 
> whether
> - * to add the page to the [in]active [file|anon] list is deferred until the
> - * pagevec is drained. This gives a chance for the caller of 
> __lru_cache_add()
> - * have the page added to the active list using mark_page_accessed().
> - */
> -void __lru_cache_add(struct page *page)
> +static void __lru_cache_add(struct page *page)
>  {
>   struct pagevec *pvec = &get_cpu_var(lru_add_pvec);
>  
> @@ -598,11 +592,32 @@ void __lru_cache_add(struct page *page)
>   pagevec_add(pvec, page);
>   put_cpu_var(lru_add_pvec);
>  }
> -EXPORT_SYMBOL(__lru_cache_add);
> +
> +/**
> + * lru_cache_add: add a page to the page lists
> + * @page: the page to add
> + */
> +void lru_cache_add_anon(struct page *page)
> +{
> + ClearPageActive(page);
> + __lru_cache_add(page);
> +}
> +
> +void lru_cache_add_file(struct page *page)
> +{
> + ClearPageActive(page);
> + __lru_cache_add(page);
> +}
> +EXPORT_SYMBOL(lru_cache_add_file);
>  
>  /**
>   * lru_cache_add - add a page to a page list
>   * @page: the page to be added to the LRU.
> + *
> + * Queue the page for addition to the LRU via pagevec. The decision on 
> whether
> + * to add the page to the [in]active [file|anon] list is deferred until the
> + * pagevec is drained. This gives a chance for the caller of lru_cache_add()
> + * have the page added to the active list using mark_page_accessed().
>   */
>  void lru_cache_add(struct page *page)
>  {
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4] mm: support madvise(MADV_FREE)

2014-04-14 Thread Zhang Yanfei

On 04/15/2014 12:46 PM, Minchan Kim wrote:
> Linux doesn't have an ability to free pages lazy while other OS
> already have been supported that named by madvise(MADV_FREE).
> 
> The gain is clear that kernel can discard freed pages rather than
> swapping out or OOM if memory pressure happens.
> 
> Without memory pressure, freed pages would be reused by userspace
> without another additional overhead(ex, page fault + allocation
> + zeroing).
> 
> How to work is following as.
> 
> When madvise syscall is called, VM clears dirty bit of ptes of
> the range. If memory pressure happens, VM checks dirty bit of
> page table and if it found still "clean", it means it's a
> "lazyfree pages" so VM could discard the page instead of swapping out.
> Once there was store operation for the page before VM peek a page
> to reclaim, dirty bit is set so VM can swap out the page instead of
> discarding.
> 
> Firstly, heavy users would be general allocators(ex, jemalloc,
> tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
> have supported the feature for other OS(ex, FreeBSD)

Reviewed-by: Zhang Yanfei 

> 
> barrios@blaptop:~/benchmark/ebizzy$ lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):4
> On-line CPU(s) list:   0-3
> Thread(s) per core:2
> Core(s) per socket:2
> Socket(s): 1
> NUMA node(s):  1
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 42
> Stepping:  7
> CPU MHz:   2801.000
> BogoMIPS:  5581.64
> Virtualization:VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  256K
> L3 cache:  4096K
> NUMA node0 CPU(s): 0-3
> 
> ebizzy benchmark(./ebizzy -S 10 -n 512)
> 
>  vanilla-jemalloc MADV_free-jemalloc
> 
> 1 thread
> records:  10  records:  10
> avg:  7436.70 avg:  15292.70
> std:  48.01(0.65%)std:  496.40(3.25%)
> max:  7542.00 max:  15944.00
> min:  7366.00 min:  14478.00
> 
> 2 thread
> records:  10  records:  10
> avg:  12190.50avg:  24975.50
> std:  1011.51(8.30%)  std:  1127.22(4.51%)
> max:  13012.00max:  26382.00
> min:  10192.00min:  23265.00
> 
> 4 thread
> records:  10  records:  10
> avg:  16875.30avg:  36320.90
> std:  562.59(3.33%)   std:  1503.75(4.14%)
> max:  17465.00max:  38314.00
> min:  15552.00min:  33863.00
> 
> 8 thread
> records:  10  records:  10
> avg:  16966.80avg:  35915.20
> std:  229.35(1.35%)   std:  2153.89(6.00%)
> max:  17456.00max:  37943.00
> min:  16742.00min:  29891.00
> 
> 16 thread
> records:  10  records:  10
> avg:  20590.90avg:  37388.40
> std:  362.33(1.76%)   std:  1282.59(3.43%)
> max:  20954.00max:  38911.00
> min:  19985.00min:  34928.00
> 
> 32 thread
> records:  10  records:  10
> avg:  22633.40avg:  37118.00
> std:  413.73(1.83%)   std:  766.36(2.06%)
> max:  23120.00max:  38328.00
> min:  22071.00min:  35557.00
> 
> In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.
> Patchset is based on 3.14
> 
> * From v3
>  * Add "how to work part" in description - Zhang
>  * Add page_discardable utility function - Zhang
>  * Clean up
> 
> * From v2
>  * Remove forceful dirty marking of swap-readed page - Johannes
>  * Remove deactivation logic of lazyfreed page
>  * Rebased on 3.14
>  * Remove RFC tag
> 
> * From v1
>  * Use custom page table walker for madvise_free - Johannes
>  * Remove PG_lazypage flag - Johannes
>  * Do madvise_dontneed instead of madvise_freein swapless system
> 
> Cc: Hugh Dickins 
> Cc: Johannes Weiner 
> Cc: Rik van Riel 
> Cc: KOSAKI Motohiro 
> Cc: Mel Gorman 
> Cc: Jason Evans 
> Signed-off-by: Minchan Kim 
> ---
>  include/linux/mm.h |   2 +
>  include/linux/rmap.h   |  21 -
>  include/linux/vm_event_item.h  |   1 +
>  include/uapi/asm-generic/mman-common.h |   1 +
>  mm/madvise.c   |  25 ++
>  mm/memory.c| 140 
> +
>  mm/rmap.c  |  82 +--
>  mm/vmscan

Re: [RFC PATCH v2] memory-hotplug: Update documentation to hide information about SECTIONS and remove end_phys_index

2014-04-14 Thread Zhang Yanfei

On 04/14/2014 04:43 PM, Li Zhong wrote:
> Seems we all agree that information about SECTION, e.g. section size,
> sections per memory block should be kept as kernel internals, and not
> exposed to userspace.
> 
> This patch updates Documentation/memory-hotplug.txt to refer to memory
> blocks instead of memory sections where appropriate and added a
> paragraph to explain that memory blocks are made of memory sections.
> The documentation update is mostly provided by Nathan.
> 
> Also, as end_phys_index in code is actually not the end section id, but
> the end memory block id, which should always be the same as phys_index.
> So it is removed here.
> 
> Signed-off-by: Li Zhong 

Reviewed-by: Zhang Yanfei 

Still the nitpick there.

> ---
>  Documentation/memory-hotplug.txt |  125 
> +++---
>  drivers/base/memory.c|   12 
>  2 files changed, 61 insertions(+), 76 deletions(-)
> 
> diff --git a/Documentation/memory-hotplug.txt 
> b/Documentation/memory-hotplug.txt
> index 58340d5..1aa239f 100644
> --- a/Documentation/memory-hotplug.txt
> +++ b/Documentation/memory-hotplug.txt
> @@ -88,16 +88,21 @@ phase by hand.
>  
>  1.3. Unit of Memory online/offline operation
>  
> -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole 
> memory
> -into chunks of the same size. The chunk is called a "section". The size of
> -a section is architecture dependent. For example, power uses 16MiB, ia64 uses
> -1GiB. The unit of online/offline operation is "one section". (see Section 3.)
> +Memory hotplug uses SPARSEMEM memory model which allows memory to be divided
> +into chunks of the same size. These chunks are called "sections". The size of
> +a memory section is architecture dependent. For example, power uses 16MiB, 
> ia64
> +uses 1GiB.
>  
> -To determine the size of sections, please read this file:
> +Memory sections are combined into chunks referred to as "memory blocks". The
> +size of a memory block is architecture dependent and represents the logical
> +unit upon which memory online/offline operations are to be performed. The
> +default size of a memory block is the same as memory section size unless an
> +architecture specifies otherwise. (see Section 3.)
> +
> +To determine the size (in bytes) of a memory block please read this file:
>  
>  /sys/devices/system/memory/block_size_bytes
>  
> -This file shows the size of sections in byte.
>  
>  ---
>  2. Kernel Configuration
> @@ -123,42 +128,35 @@ config options.
>  (CONFIG_ACPI_CONTAINER).
>  This option can be kernel module too.
>  
> +
>  
> -4 sysfs files for memory hotplug
> +3 sysfs files for memory hotplug
>  
> -All sections have their device information in sysfs.  Each section is part of
> -a memory block under /sys/devices/system/memory as
> +All memory blocks have their device information in sysfs.  Each memory block
> +is described under /sys/devices/system/memory as
>  
>  /sys/devices/system/memory/memoryXXX
> -(XXX is the section id.)
> +(XXX is the memory block id.)
>  
> -Now, XXX is defined as (start_address_of_section / section_size) of the first
> -section contained in the memory block.  The files 'phys_index' and
> -'end_phys_index' under each directory report the beginning and end section 
> id's
> -for the memory block covered by the sysfs directory.  It is expected that all
> +For the memory block covered by the sysfs directory.  It is expected that all
>  memory sections in this range are present and no memory holes exist in the
>  range. Currently there is no way to determine if there is a memory hole, but
>  the existence of one should not affect the hotplug capabilities of the memory
>  block.
>  
> -For example, assume 1GiB section size. A device for a memory starting at
> +For example, assume 1GiB memory block size. A device for a memory starting at
>  0x1 is /sys/device/system/memory/memory4
>  (0x1 / 1Gib = 4)
>  This device covers address range [0x1 ... 0x14000)
>  
> -Under each section, you can see 4 or 5 files, the end_phys_index file being
> -a recent addition and not present on older kernels.
> +Under each memory block, you can see 4 files:
>  
> -/sys/devices/system/memory/memoryXXX/start_phys_index
> -/sys/devices/system/memory/memoryXXX/end_phys_index
> +/sys/devices/system/memory/memoryXXX/phys_index
>  /sys/devices/system/memory/memoryXXX/phys_device
>  /sys/devices/system/memory/memoryXXX/state
>  /sys/devices/system/memory/memoryXXX/removable
>  
> -'phys_

Re: [PATCH v3 0/5] hugetlb: add support gigantic page allocation at runtime

2014-04-14 Thread Zhang Yanfei

Clear explanation and implementation!

Reviewed-by: Zhang Yanfei 

On 04/11/2014 01:58 AM, Luiz Capitulino wrote:
> [Full introduction right after the changelog]
> 
> Changelog
> -
> 
> v3
> 
> - Dropped unnecessary WARN_ON() call [Kirill]
> - Always check if the pfn range lies within a zone [Yasuaki]
> - Renamed some function arguments for consistency
> 
> v2
> 
> - Rewrote allocation loop to avoid scanning unless PFNs [Yasuaki]
> - Dropped incomplete multi-arch support [Naoya]
> - Added patch to drop __init from prep_compound_gigantic_page()
> - Restricted the feature to x86_64 (more details in patch 5/5)
> - Added review-bys plus minor changelog changes
> 
> Introduction
> 
> 
> The HugeTLB subsystem uses the buddy allocator to allocate hugepages during
> runtime. This means that hugepages allocation during runtime is limited to
> MAX_ORDER order. For archs supporting gigantic pages (that is, page sizes
> greater than MAX_ORDER), this in turn means that those pages can't be
> allocated at runtime.
> 
> HugeTLB supports gigantic page allocation during boottime, via the boot
> allocator. To this end the kernel provides the command-line options
> hugepagesz= and hugepages=, which can be used to instruct the kernel to
> allocate N gigantic pages during boot.
> 
> For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages can
> be allocated and freed at runtime. If one wants to allocate 1G gigantic pages,
> this has to be done at boot via the hugepagesz= and hugepages= command-line
> options.
> 
> Now, gigantic page allocation at boottime has two serious problems:
> 
>  1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
> evenly distributes boottime allocated hugepages among nodes.
> 
> For example, suppose you have a four-node NUMA machine and want
> to allocate four 1G gigantic pages at boottime. The kernel will
> allocate one gigantic page per node.
> 
> On the other hand, we do have users who want to be able to specify
> which NUMA node gigantic pages should allocated from. So that they
> can place virtual machines on a specific NUMA node.
> 
>  2. Gigantic pages allocated at boottime can't be freed
> 
> At this point it's important to observe that regular hugepages allocated
> at runtime don't have those problems. This is so because HugeTLB interface
> for runtime allocation in sysfs supports NUMA and runtime allocated pages
> can be freed just fine via the buddy allocator.
> 
> This series adds support for allocating gigantic pages at runtime. It does
> so by allocating gigantic pages via CMA instead of the buddy allocator.
> Releasing gigantic pages is also supported via CMA. As this series builds
> on top of the existing HugeTLB interface, it makes gigantic page allocation
> and releasing just like regular sized hugepages. This also means that NUMA
> support just works.
> 
> For example, to allocate two 1G gigantic pages on node 1, one can do:
> 
>  # echo 2 > \
>/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> And, to release all gigantic pages on the same node:
> 
>  # echo 0 > \
>/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
> 
> Please, refer to patch 5/5 for full technical details.
> 
> Finally, please note that this series is a follow up for a previous series
> that tried to extend the command-line options set to be NUMA aware:
> 
>  http://marc.info/?l=linux-mm&m=139593335312191&w=2
> 
> During the discussion of that series it was agreed that having runtime
> allocation support for gigantic pages was a better solution.
> 
> Luiz Capitulino (5):
>   hugetlb: prep_compound_gigantic_page(): drop __init marker
>   hugetlb: add hstate_is_gigantic()
>   hugetlb: update_and_free_page(): don't clear PG_reserved bit
>   hugetlb: move helpers up in the file
>   hugetlb: add support for gigantic page allocation at runtime
> 
>  include/linux/hugetlb.h |   5 +
>  mm/hugetlb.c| 336 
> ++--
>  2 files changed, 245 insertions(+), 96 deletions(-)
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number

2014-04-09 Thread Zhang Yanfei

On 04/10/2014 11:14 AM, Li Zhong wrote:
> On Wed, 2014-04-09 at 08:49 -0700, Dave Hansen wrote:
>> On 04/09/2014 02:20 AM, Li Zhong wrote:
>>> Or do you mean we don't need to expose any information related to
>>> SECTION to userspace? 
>>
>> Right, we don't need to expose sections themselves to userspace.  Do we?
>>
> OK, I agree with that. 
> 
> Yanfei, I recall you once expressed your preference for section
> numbers? 

Hmmm Looking at the git log:

commit d33601644cd3b09afb2edd9474517edc441c8fad
Author: Nathan Fontenot 
Date:   Thu Jan 20 10:44:29 2011 -0600

memory hotplug: Update phys_index to [start|end]_section_nr

Update the 'phys_index' property of a the memory_block struct to be
called start_section_nr, and add a end_section_nr property.  The
data tracked here is the same but the updated naming is more in line
with what is stored here, namely the first and last section number
that the memory block spans.

The names presented to userspace remain the same, phys_index for
start_section_nr and end_phys_index for end_section_nr, to avoid breaking
anything in userspace.

This also updates the node sysfs code to be aware of the new capability for
a memory block to contain multiple memory sections and be aware of the 
memory
block structure name changes (start_section_nr).  This requires an 
additional
parameter to unregister_mem_sect_under_nodes so that we know which memory
section of the memory block to unregister.

Signed-off-by: Nathan Fontenot 
Reviewed-by: Robin Holt 
Reviewed-by: KAMEZAWA Hiroyuki 
Signed-off-by: Greg Kroah-Hartman 

So obviously, Nathan added the end_phys_index sysfile to present the last 
section
number of a memory block (for end_section_nr), but what he did in the patch
seems not matching the log.

So what is the motivation of adding this 'end_phys_index' file here?

Confused.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number

2014-04-09 Thread Zhang Yanfei

On 04/10/2014 01:39 AM, Nathan Fontenot wrote:
> On 04/08/2014 02:47 PM, Dave Hansen wrote:
>>
>> That document really needs to be updated to stop referring to sections
>> (at least in the descriptions of the user interface).  We can not change
>> the units of phys_index/end_phys_index without also changing
>> block_size_bytes.
>>
> 
> Here is a first pass at updating the documentation.
> 
> I have tried to update the documentation to refer to memory blocks instead
> of memory sections where appropriate and added a paragraph to explain
> that memory blocks are mode of memory sections.
> 
> Thoughts?

I think the change is basically ok. So

Reviewed-by: Zhang Yanfei 

Only one nitpick below.

> 
> -Nathan
> ---
>  Documentation/memory-hotplug.txt |  113 
> ---
>  1 file changed, 59 insertions(+), 54 deletions(-)
> 
> Index: linux/Documentation/memory-hotplug.txt
> ===
> --- linux.orig/Documentation/memory-hotplug.txt
> +++ linux/Documentation/memory-hotplug.txt
> @@ -88,16 +88,21 @@ phase by hand.
>  
>  1.3. Unit of Memory online/offline operation
>  
> -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole 
> memory
> -into chunks of the same size. The chunk is called a "section". The size of
> -a section is architecture dependent. For example, power uses 16MiB, ia64 uses
> -1GiB. The unit of online/offline operation is "one section". (see Section 3.)
> +Memory hotplug uses SPARSEMEM memory model which allows memory to be divided
> +into chunks of the same size. These chunks are called "sections". The size of
> +a memory section is architecture dependent. For example, power uses 16MiB, 
> ia64
> +uses 1GiB.
> +
> +Memory sections are combined into chunks referred to as "memory blocks". The
> +size of a memory block is architecture dependent and represents the logical
> +unit upon which memory online/offline operations are to be performed. The
> +default size of a memory block is the same as memory section size unless an
> +architecture specifies otherwise. (see Section 3.)
>  
> -To determine the size of sections, please read this file:
> +To determine the size (in bytes) of a memory block please read this file:
>  
>  /sys/devices/system/memory/block_size_bytes
>  
> -This file shows the size of sections in byte.
>  
>  ---
>  2. Kernel Configuration
> @@ -123,14 +128,15 @@ config options.
>  (CONFIG_ACPI_CONTAINER).
>  This option can be kernel module too.
>  
> +
>  
> -4 sysfs files for memory hotplug
> +3 sysfs files for memory hotplug
>  
> -All sections have their device information in sysfs.  Each section is part of
> -a memory block under /sys/devices/system/memory as
> +All memory blocks have their device information in sysfs.  Each memory block
> +is described under /sys/devices/system/memory as
>  
>  /sys/devices/system/memory/memoryXXX
> -(XXX is the section id.)
> +(XXX is the memory block id.)
>  
>  Now, XXX is defined as (start_address_of_section / section_size) of the first
>  section contained in the memory block.  The files 'phys_index' and
> @@ -141,13 +147,13 @@ range. Currently there is no way to dete
>  the existence of one should not affect the hotplug capabilities of the memory
>  block.
>  
> -For example, assume 1GiB section size. A device for a memory starting at
> +For example, assume 1GiB memory block size. A device for a memory starting at
>  0x1 is /sys/device/system/memory/memory4
>  (0x1 / 1Gib = 4)
>  This device covers address range [0x1 ... 0x14000)
>  
> -Under each section, you can see 4 or 5 files, the end_phys_index file being
> -a recent addition and not present on older kernels.
> +Under each memory block, you can see 4 or 5 files, the end_phys_index file
> +being a recent addition and not present on older kernels.
>  
>  /sys/devices/system/memory/memoryXXX/start_phys_index
>  /sys/devices/system/memory/memoryXXX/end_phys_index
> @@ -185,6 +191,7 @@ For example:
>  A backlink will also be created:
>  /sys/devices/system/memory/memory9/node0 -> ../../node/node0
>  
> +
>  
>  4. Physical memory hot-add phase
>  
> @@ -227,11 +234,10 @@ You can tell the physical address of new
>  
>  % echo start_address_of_new_memory > /sys/devices/system/memory/probe
>  
> -Then, [start_address_of_new_memory, start_address_of_new_memory + 
> section_size)
> -memory range is

Re: [PATCH v3] support madvise(MADV_FREE)

2014-04-07 Thread Zhang Yanfei

+1566,13 @@ int try_to_unmap(struct page *page, enum ttu_flags 
> flags)
>  int try_to_munlock(struct page *page)
>  {
>   int ret;
> + struct rmap_private rp = {
> + .flags = TTU_MUNLOCK,
> + };
> +
>   struct rmap_walk_control rwc = {
>   .rmap_one = try_to_unmap_one,
> - .arg = (void *)TTU_MUNLOCK,
> + .arg = &rp,
>   .done = page_not_mapped,
>   /*
>* We don't bother to try to find the munlocked page in
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a9c74b409681..7f1c5a26bc41 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -684,6 +684,7 @@ enum page_references {
>   PAGEREF_RECLAIM_CLEAN,
>   PAGEREF_KEEP,
>   PAGEREF_ACTIVATE,
> + PAGEREF_DISCARD,
>  };
>  
>  static enum page_references page_check_references(struct page *page,
> @@ -691,9 +692,12 @@ static enum page_references page_check_references(struct 
> page *page,
>  {
>   int referenced_ptes, referenced_page;
>   unsigned long vm_flags;
> + int is_pte_dirty;
> +
> + VM_BUG_ON_PAGE(!PageLocked(page), page);
>  
>   referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
> -   &vm_flags);
> +   &vm_flags, &is_pte_dirty);
>   referenced_page = TestClearPageReferenced(page);
>  
>   /*
> @@ -734,6 +738,18 @@ static enum page_references page_check_references(struct 
> page *page,
>   return PAGEREF_KEEP;
>   }
>  
> + /*
> +  * We should check PageDirty because swap-in page by read fault
> +  * will be swapcache and pte point out the page doesn't have
> +  * dirty bit so only pte dirtiness check isn't enough. In this case,
> +  * it would be good to check PG_swapcache to filter it out.
> +  * If the page is removed from swapcache, it must have PG_dirty
> +  * so we should check it to prevent purging non-lazyfree page.
> +  */

Nice explanation. So this is the key point to know how to detect a lazyfree 
page,
I think something like this can be put in the patch log, too.

Thanks
Zhang

> + if (PageAnon(page) && !is_pte_dirty &&
> + !PageSwapCache(page) && !PageDirty(page))
> + return PAGEREF_DISCARD;
> +
>   /* Reclaim if clean, defer dirty pages to writeback */
>   if (referenced_page && !PageSwapBacked(page))
>   return PAGEREF_RECLAIM_CLEAN;
> @@ -932,6 +948,8 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>   goto activate_locked;
>   case PAGEREF_KEEP:
>   goto keep_locked;
> + case PAGEREF_DISCARD:
> + goto discard;
>   case PAGEREF_RECLAIM:
>   case PAGEREF_RECLAIM_CLEAN:
>   ; /* try to reclaim the page below */
> @@ -957,6 +975,7 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>* processes. Try to unmap it here.
>*/
>   if (page_mapped(page) && mapping) {
> +discard:
>   switch (try_to_unmap(page, ttu_flags)) {
>   case SWAP_FAIL:
>   goto activate_locked;
> @@ -964,6 +983,13 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>   goto keep_locked;
>   case SWAP_MLOCK:
>   goto cull_mlocked;
> + case SWAP_DISCARD:
> + VM_BUG_ON_PAGE(PageSwapCache(page), page);
> + if (!page_freeze_refs(page, 1))
> + goto keep_locked;
> + __clear_page_locked(page);
> + count_vm_event(PGLAZYFREED);
> + goto free_it;
>   case SWAP_SUCCESS:
>   ; /* try to free the page below */
>   }
> @@ -1688,7 +1714,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
>   }
>  
>   if (page_referenced(page, 0, sc->target_mem_cgroup,
> - &vm_flags)) {
> + &vm_flags, NULL)) {
>   nr_rotated += hpage_nr_pages(page);
>   /*
>* Identify referenced, file-backed active pages and
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index def5dd2fbe61..2d80f7ed495d 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -789,6 +789,7 @@ const char * const vmstat_text[] = {
>  
>   "pgfault",
>   "pgmajfault",
> + "pglazyfreed",
>  
>   TEXTS_FOR_ZONES("pgrefill")
>   TEXTS_FOR_ZONES("pgsteal_kswapd")
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] mm: page_alloc: Do not cache reclaim distances

2014-04-07 Thread Zhang Yanfei

On 04/08/2014 06:34 AM, Mel Gorman wrote:
> pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by
> zone_reclaim due to its distance. As it is expected that zone_reclaim_mode
> will be rarely enabled it is unreasonable for all machines to take a penalty.
> Fortunately, the zone_reclaim_mode() path is already slow and it is the path
> that takes the hit.
> 
> Signed-off-by: Mel Gorman 

Reviewed-by: Zhang Yanfei 

> ---
>  include/linux/mmzone.h |  1 -
>  mm/page_alloc.c| 15 +--
>  2 files changed, 1 insertion(+), 15 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 9b61b9b..564b169 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -757,7 +757,6 @@ typedef struct pglist_data {
>   unsigned long node_spanned_pages; /* total size of physical page
>range, including holes */
>   int node_id;
> - nodemask_t reclaim_nodes;   /* Nodes allowed to reclaim from */
>   wait_queue_head_t kswapd_wait;
>   wait_queue_head_t pfmemalloc_wait;
>   struct task_struct *kswapd; /* Protected by lock_memory_hotplug() */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a256f85..574928e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1863,16 +1863,7 @@ static bool zone_local(struct zone *local_zone, struct 
> zone *zone)
>  
>  static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>  {
> - return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes);
> -}
> -
> -static void __paginginit init_zone_allows_reclaim(int nid)
> -{
> - int i;
> -
> - for_each_online_node(i)
> - if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> - node_set(i, NODE_DATA(nid)->reclaim_nodes);
> + return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < 
> RECLAIM_DISTANCE;
>  }
>  
>  #else/* CONFIG_NUMA */
> @@ -1906,9 +1897,6 @@ static bool zone_allows_reclaim(struct zone 
> *local_zone, struct zone *zone)
>   return true;
>  }
>  
> -static inline void init_zone_allows_reclaim(int nid)
> -{
> -}
>  #endif   /* CONFIG_NUMA */
>  
>  /*
> @@ -4917,7 +4905,6 @@ void __paginginit free_area_init_node(int nid, unsigned 
> long *zones_size,
>  
>   pgdat->node_id = nid;
>   pgdat->node_start_pfn = node_start_pfn;
> - init_zone_allows_reclaim(nid);
>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>   get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
>  #endif
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default

2014-04-07 Thread Zhang Yanfei

On 04/08/2014 06:34 AM, Mel Gorman wrote:
> zone_reclaim_mode causes processes to prefer reclaiming memory from local
> node instead of spilling over to other nodes. This made sense initially when
> NUMA machines were almost exclusively HPC and the workload was partitioned
> into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> the memory. On current machines and workloads it is often the case that
> zone_reclaim_mode destroys performance but not all users know how to detect
> this. Favour the common case and disable it by default. Users that are
> sophisticated enough to know they need zone_reclaim_mode will detect it.
> 
> Signed-off-by: Mel Gorman 

Reviewed-by: Zhang Yanfei 

> ---
>  Documentation/sysctl/vm.txt | 17 +
>  mm/page_alloc.c |  2 --
>  2 files changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index d614a9b..ff5da70 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -751,16 +751,17 @@ This is value ORed together of
>  2= Zone reclaim writes dirty pages out
>  4= Zone reclaim swaps pages
>  
> -zone_reclaim_mode is set during bootup to 1 if it is determined that pages
> -from remote zones will cause a measurable performance reduction. The
> -page allocator will then reclaim easily reusable pages (those page
> -cache pages that are currently not used) before allocating off node pages.
> -
> -It may be beneficial to switch off zone reclaim if the system is
> -used for a file server and all of memory should be used for caching files
> -from disk. In that case the caching effect is more important than
> +zone_reclaim_mode is disabled by default.  For file servers or workloads
> +that benefit from having their data cached, zone_reclaim_mode should be
> +left disabled as the caching effect is likely to be more important than
>  data locality.
>  
> +zone_reclaim may be enabled if it's known that the workload is partitioned
> +such that each partition fits within a NUMA node and that accessing remote
> +memory would cause a measurable performance reduction.  The page allocator
> +will then reclaim easily reusable pages (those page cache pages that are
> +currently not used) before allocating off node pages.
> +
>  Allowing zone reclaim to write out pages stops processes that are
>  writing large amounts of data from dirtying pages on other nodes. Zone
>  reclaim will write out dirty pages if a zone fills up and so effectively
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3bac76a..a256f85 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1873,8 +1873,6 @@ static void __paginginit init_zone_allows_reclaim(int 
> nid)
>   for_each_online_node(i)
>   if (node_distance(nid, i) <= RECLAIM_DISTANCE)
>   node_set(i, NODE_DATA(nid)->reclaim_nodes);
> - else
> - zone_reclaim_mode = 1;
>  }
>  
>  #else/* CONFIG_NUMA */
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number

2014-04-02 Thread Zhang Yanfei

On 04/03/2014 10:37 AM, Li Zhong wrote:
> On Thu, 2014-04-03 at 09:37 +0800, Zhang Yanfei wrote:
>> Add ccing
>>
>> On 04/02/2014 04:56 PM, Li Zhong wrote:
>>> I noticed the phys_index and end_phys_index under 
>>> /sys/devices/system/memory/memoryXXX/ have the same value, e.g.
>>> (for the test machine, one memory block has 8 sections, that is 
>>>  sections_per_block equals 8)
>>>
>>> # cd /sys/devices/system/memory/memory100/
>>> # cat phys_index end_phys_index 
>>> 0064
>>> 0064
>>>
>>> Seems they should reflect the start/end section number respectively, which 
>>> also matches what is said in Documentation/memory-hotplug.txt
>>
> Hi Yanfei, 
> 
> Thanks for the review. 
> 
>> Indeed. I've noticed this before. The value in 'end_phys_index' doesn't
>> match what it really means. But, the name itself is vague, it looks like
>> it is the index of some page frame. (we keep this name for compatibility?)
> 
> I guess so, Dave just reminded me that the RFC would also break
> userspace..
> 
> And now my plan is: 
>  leave the code unchanged
>  update the document, state the end_phys_index/phys_index are the same,
> and means the memory block index

Ah. I doubt whether there is userspace tool which is using the two sysfiles?
for example, the memory100 directory itself can tell us which block it is.
So why there is the two files under it give the same meaning.

If there is userspace tool using the two files, does it use 'end_phys_index'
in the correct way? That said, if a userspace tool knows what the 
'end_phys_index'
really mean, does it still need it since we have 'phys_index' with the same 
value?

>  [optional] create two new files start_sec_nr, end_sec_nr if needed

These two are the really meaningful sysfiles for userspace, IMO.

> 
> Do you have any other suggestions? 

No. I think we should wait for other guys to comment more.

Thanks.

> 
> Thanks, Zhong
> 
>>
>> The corresponding member in struct memory_block is:
>>
>> struct memory_block {
>> unsigned long start_section_nr;
>> unsigned long end_section_nr;
>> ...
>>
>> The two members seem to have the right name, and have the right value in 
>> kernel.
>>
>>
>>>
>>> This patch tries to modify that so the two files could show the start/end
>>> section number of the memory block.
>>>
>>> After this change, output of the above example looks like:
>>>
>>> # cat phys_index end_phys_index 
>>> 0320
>>> 0327
>>>
>>> Signed-off-by: Li Zhong 
>>> ---
>>>  drivers/base/memory.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>>> index bece691..b10f2fa 100644
>>> --- a/drivers/base/memory.c
>>> +++ b/drivers/base/memory.c
>>> @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct device 
>>> *dev,
>>> struct memory_block *mem = to_memory_block(dev);
>>> unsigned long phys_index;
>>>  
>>> -   phys_index = mem->start_section_nr / sections_per_block;
>>> +   phys_index = mem->start_section_nr;
>>> return sprintf(buf, "%08lx\n", phys_index);
>>>  }
>>>  
>>> @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device 
>>> *dev,
>>> struct memory_block *mem = to_memory_block(dev);
>>> unsigned long phys_index;
>>>  
>>> -   phys_index = mem->end_section_nr / sections_per_block;
>>> +   phys_index = mem->end_section_nr;
>>> return sprintf(buf, "%08lx\n", phys_index);
>>>  }
>>>  
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>
>>
>>
> 
> 
> .
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number

2014-04-02 Thread Zhang Yanfei

Add ccing

On 04/02/2014 04:56 PM, Li Zhong wrote:
> I noticed the phys_index and end_phys_index under 
> /sys/devices/system/memory/memoryXXX/ have the same value, e.g.
> (for the test machine, one memory block has 8 sections, that is 
>  sections_per_block equals 8)
> 
> # cd /sys/devices/system/memory/memory100/
> # cat phys_index end_phys_index 
> 0064
> 0064
> 
> Seems they should reflect the start/end section number respectively, which 
> also matches what is said in Documentation/memory-hotplug.txt

Indeed. I've noticed this before. The value in 'end_phys_index' doesn't
match what it really means. But, the name itself is vague, it looks like
it is the index of some page frame. (we keep this name for compatibility?)

The corresponding member in struct memory_block is:

struct memory_block {
unsigned long start_section_nr;
unsigned long end_section_nr;
...

The two members seem to have the right name, and have the right value in kernel.


> 
> This patch tries to modify that so the two files could show the start/end
> section number of the memory block.
> 
> After this change, output of the above example looks like:
> 
> # cat phys_index end_phys_index 
> 0320
> 0327
> 
> Signed-off-by: Li Zhong 
> ---
>  drivers/base/memory.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index bece691..b10f2fa 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -114,7 +114,7 @@ static ssize_t show_mem_start_phys_index(struct device 
> *dev,
>   struct memory_block *mem = to_memory_block(dev);
>   unsigned long phys_index;
>  
> - phys_index = mem->start_section_nr / sections_per_block;
> + phys_index = mem->start_section_nr;
>   return sprintf(buf, "%08lx\n", phys_index);
>  }
>  
> @@ -124,7 +124,7 @@ static ssize_t show_mem_end_phys_index(struct device *dev,
>   struct memory_block *mem = to_memory_block(dev);
>   unsigned long phys_index;
>  
> - phys_index = mem->end_section_nr / sections_per_block;
> + phys_index = mem->end_section_nr;
>   return sprintf(buf, "%08lx\n", phys_index);
>  }
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] madvise: Correct the comment of MADV_DODUMP flag

2014-03-31 Thread Zhang Yanfei

s/MADV_NODUMP/MADV_DONTDUMP/

Signed-off-by: Zhang Yanfei 
---
 include/uapi/asm-generic/mman-common.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/uapi/asm-generic/mman-common.h 
b/include/uapi/asm-generic/mman-common.h
index 4164529..ddc3b36 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -50,7 +50,7 @@

 #define MADV_DONTDUMP   16 /* Explicity exclude from the core dump,
   overrides the coredump filter bits */
-#define MADV_DODUMP17  /* Clear the MADV_NODUMP flag */
+#define MADV_DODUMP17  /* Clear the MADV_DONTDUMP flag */

 /* compatibility flags */
 #define MAP_FILE   0
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 0/6] mm: support madvise(MADV_FREE)

2014-03-14 Thread Zhang Yanfei

Hello Minchan

On 03/14/2014 02:37 PM, Minchan Kim wrote:
> This patch is an attempt to support MADV_FREE for Linux.
> 
> Rationale is following as.
> 
> Allocators call munmap(2) when user call free(3) if ptr is
> in mmaped area. But munmap isn't cheap because it have to clean up
> all pte entries, unlinking a vma and returns free pages to buddy
> so overhead would be increased linearly by mmaped area's size.
> So they like madvise_dontneed rather than munmap.
> 
> "dontneed" holds read-side lock of mmap_sem so other threads
> of the process could go with concurrent page faults so it is
> better than munmap if it's not lack of address space.
> But the problem is that most of allocator reuses that address
> space soonish so applications see page fault, page allocation,
> page zeroing if allocator already called madvise_dontneed
> on the address space.
> 
> For avoidng that overheads, other OS have supported MADV_FREE.
> The idea is just mark pages as lazyfree when madvise called
> and purge them if memory pressure happens. Otherwise, VM doesn't
> detach pages on the address space so application could use
> that memory space without above overheads.

I didn't look into the code. Does this mean we just keep the vma,
the pte entries, and page itself for later possible reuse? If so,
how can we reuse the vma? The kernel would mark the vma kinds of
special so that it can be reused other than unmapped? Do you have
an example about this reuse?

Another thing is when I search MADV_FREE in the internet, I see that
Rik posted the similar patch in 2007 but that patch didn't
go into the upstream kernel.  And some explanation from Andrew:

--
 lazy-freeing-of-memory-through-madv_free.patch

 
lazy-freeing-of-memory-through-madv_free-vs-mm-madvise-avoid-exclusive-mmap_sem.patch

 restore-madv_dontneed-to-its-original-linux-behaviour.patch



I think the MADV_FREE changes need more work:



We need crystal-clear statements regarding the present functionality, the new

functionality and how these relate to the spec and to implmentations in other

OS'es.  Once we have that info we are in a position to work out whether the

code can be merged as-is, or if additional changes are needed.



Because right now, I don't know where we are with respect to these things and

I doubt if many of our users know either.  How can Michael write a manpage for

this is we don't tell him what it all does?
--

Thanks
Zhang Yanfei

> 
> I tweaked jamalloc to use MADV_FREE for the testing.
> 
> diff --git a/src/chunk_mmap.c b/src/chunk_mmap.c
> index 8a42e75..20e31af 100644
> --- a/src/chunk_mmap.c
> +++ b/src/chunk_mmap.c
> @@ -131,7 +131,7 @@ pages_purge(void *addr, size_t length)
>  #  else
>  #error "No method defined for purging unused dirty pages."
>  #  endif
> -   int err = madvise(addr, length, JEMALLOC_MADV_PURGE);
> +   int err = madvise(addr, length, 5);
> unzeroed = (JEMALLOC_MADV_ZEROS == false || err != 0);
>  #  undef JEMALLOC_MADV_PURGE
>  #  undef JEMALLOC_MADV_ZEROS
> 
> 
> RAM 2G, CPU 4, ebizzy benchmark(./ebizzy -S 30 -n 512)
> 
> (1.1) stands for 1 process and 1 thread so for exmaple,
> (1.4) is 1 process and 4 thread.
> 
> vanilla jemalloc   patched jemalloc
> 
> 1.1   1.1
> records:  5  records:  5
> avg:  7404.60avg:  14059.80
> std:  116.67(1.58%)  std:  93.92(0.67%)
> max:  7564.00max:  14152.00
> min:  7288.00min:  13893.00
> 1.4   1.4
> records:  5  records:  5
> avg:  16160.80   avg:  30173.00
> std:  509.80(3.15%)  std:  3050.72(10.11%)
> max:  16728.00   max:  33989.00
> min:  15216.00   min:  25173.00
> 1.8   1.8
> records:  5  records:  5
> avg:  16003.00   avg:  30080.20
> std:  290.40(1.81%)  std:  2063.57(6.86%)
> max:  16537.00   max:  32735.00
> min:  15727.00   min:  27381.00
> 4.1   4.1
> records:  5  records:  5
> avg:  4003.60avg:  8064.80
> std:  65.33(1.63%)   std:  143.89(1.78%)
> max:  4118.00max:  8319.00
> min:  3921.00min:  7888.00
> 4.4   4.4
> records:  5  records:  5
> avg:  3907.40avg:  7199.80
> std:  48.68(1.25%)   std:  80.21(1.11%)
> max:  3997.00max:  7320.00
> min:  3863.00min:  7113.00
> 4.8   4.8
> records:  5  records:  5
> avg:  3893.00avg:  7195.20
> std:  19.11(0.49%)   std

Re: [PATCH][RFC] mm: warning message for vm_map_ram about vm size

2014-03-09 Thread Zhang Yanfei

On 03/10/2014 01:47 PM, Minchan Kim wrote:
> Hi Giho,
> 
> On Mon, Mar 10, 2014 at 01:57:07PM +0900, Gioh Kim wrote:
>> Hi,
>>
>> I have a failure of allocation of virtual memory on ARMv7 based platform.
>>
>> I called alloc_page()/vm_map_ram() for allocation/mapping pages.
>> Virtual memory space exhausting problem occurred.
>> I checked virtual memory space and found that there are too many 4MB chunks.
>>
>> I thought that if just one page in the 4MB chunk lives long, 
>> the entire chunk cannot be freed. Therefore new chunk is created again and 
>> again.
>>
>> In my opinion, the vm_map_ram() function should be used for temporary mapping
>> and/or short term memory mapping. Otherwise virtual memory is wasted.
>>
>> I am not sure if my opinion is correct. If it is, please add some warning 
>> message
>> about the vm_map_ram().
>>
>>
>>
>> ---8<---
>>
>> Subject: [PATCH] mm: warning comment for vm_map_ram
>>
>> vm_map_ram can occur locking of virtual memory space
>> because if only one page lives long in one vmap_block,
>> it takes 4MB (1024-times more than one page) space.
> 
> For clarification, vm_map_ram has fragment problem because it
> couldn't purge a chunk(ie, 4M address space) if there is a pinning
> object in that addresss space so it could consume all VMALLOC
> address space easily.
> 
> We can fix the fragementaion problem with using vmap instead of
> vm_map_ram but it wouldn't a good solution because vmap is much
> slower than vm_map_ram for VMAP_MAX_ALLOC below. In my x86 machine,
> vm_map_ram is 5 times faster than vmap.
> 
> AFAICR, some proprietary GPU driver uses that function heavily so
> performance would be really important so I want to stick to use
> vm_map_ram.
> 
> Another option is that caller should separate long-life and short-life
> object and use vmap for long-life but vm_map_ram for short-life.
> But it's not a good solution because it's hard for allocator layer
> to detect it that how customer lives with the object.

Indeed. So at least the note comment should be added.

> 
> So I thought to fix that problem with revert [1] and adding more
> logic to solve fragmentation problem and make bitmap search
> operation more efficient by caching the hole. It might handle
> fragmentation at the moment but it would make more IPI storm for
> TLB flushing as time goes by so that it would mitigate API itself
> so using for only temporal object is too limited but it's best at the
> moment. I am supporting your opinion.
> 
> Let's add some notice message to user.
> 
> [1] [3fcd76e8028, mm/vmalloc.c: remove dead code in vb_alloc]
> 
>>
>> Change-Id: I6f5919848cf03788b5846b7d850d66e4d93ac39a
>> Signed-off-by: Gioh Kim 
>> ---
>>  mm/vmalloc.c |4 
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 0fdf968..2de1d1b 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -1083,6 +1083,10 @@ EXPORT_SYMBOL(vm_unmap_ram);
>>   * @node: prefer to allocate data structures on this node
>>   * @prot: memory protection to use. PAGE_KERNEL for regular RAM
>>   *
>> + * This function should be used for TEMPORARY mapping. If just one page 
>> lives i
>> + * long, it would occupy 4MB vm size permamently. 100 pages (just 400KB) 
>> could
>> + * takes 400MB with bad luck.
>> + *
> 
> If you use this function for below VMAP_MAX_ALLOC pages, it could be 
> faster
> than vmap so it's good but if you mix long-life and short-life object
> with vm_map_ram, it could consume lots of address space by fragmentation(
> expecially, 32bit machine) so you could see failure in the end.
> So, please use this function for short-life object.

Minchan's is better. So I suggest Giho post another patch with this comment
and take what Minchan said above to the commit log. And you can feel free to
add:

Reviewed-by: Zhang Yanfei 

Thanks.

> 
>>   * Returns: a pointer to the address that has been mapped, or %NULL on 
>> failure
>>   */
>>  void *vm_map_ram(struct page **pages, unsigned int count, int node, 
>> pgprot_t prot)
>> --
>> 1.7.9.5
>>
>> Gioh Kim / 김 기 오
>> Research Engineer
>> Advanced OS Technology Team
>> Software Platform R&D Lab.
>> Mobile: 82-10-7322-5548  
>> E-mail: gioh@lge.com 
>> 19, Yangjae-daero 11gil
>> Seocho-gu, Seoul 137-130, Korea
>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majord...@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 7/9] mm: thrash detection-based file cache sizing

2014-01-14 Thread Zhang Yanfei

>>>> + *out of memory.
>>>> + *
>>>> + * 2. When a page is accessed for the second time, it is promoted to
>>>> + *the active list, shrinking the inactive list by one slot.  This
>>>> + *also slides all inactive pages that were faulted into the cache
>>>> + *more recently than the activated page towards the tail of the
>>>> + *inactive list.
>>>> + *
>>>
>>> Nitpick, how about the reference bit?
>>
>> What do you mean?
>>
> 
> Sorry, I mean the PG_referenced flag. I thought when a page is accessed
> for the second time only PG_referenced flag  will be set instead of be
> promoted to active list.
> 

No. I try to explain a bit. For mapped file pages, if the second access
occurs on a different page table entry, the page is surely promoted to active
list. But if the paged is always accessed from the same page table entry, it
was mistakenly evicted. This was fixed by Johannes already by reusing the
PG_referenced flag, for details, please refer to commit 64574746
("vmscan: detect mapped file pages used only once").

Correct me if I am wrong.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages

2013-12-06 Thread Zhang Yanfei

Hello

On 12/06/2013 04:41 PM, Joonsoo Kim wrote:
> Some part of putback_lru_pages() and putback_movable_pages() is
> duplicated, so it could confuse us what we should use.
> We can remove putback_lru_pages() since it is not really needed now.
> This makes us undestand and maintain the code more easily.
> 
> And comment on putback_movable_pages() is stale now, so fix it.
> 
> Signed-off-by: Joonsoo Kim 
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f5096b5..7782b74 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -35,7 +35,6 @@ enum migrate_reason {
>  
>  #ifdef CONFIG_MIGRATION
>  
> -extern void putback_lru_pages(struct list_head *l);
>  extern void putback_movable_pages(struct list_head *l);
>  extern int migrate_page(struct address_space *,
>   struct page *, struct page *, enum migrate_mode);
> @@ -59,7 +58,6 @@ extern int migrate_page_move_mapping(struct address_space 
> *mapping,
>  #else
>  
>  static inline void putback_lru_pages(struct list_head *l) {}

If you want to remove the function, this should be removed, right?

> -static inline void putback_movable_pages(struct list_head *l) {}
>  static inline int migrate_pages(struct list_head *l, new_page_t x,
>   unsigned long private, enum migrate_mode mode, int reason)
>   { return -ENOSYS; }
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index b7c1716..1debdea 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1569,7 +1569,13 @@ static int __soft_offline_page(struct page *page, int 
> flags)
>   ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
>   MIGRATE_SYNC, MR_MEMORY_FAILURE);
>   if (ret) {
> - putback_lru_pages(&pagelist);
> + if (!list_empty(&pagelist)) {
> + list_del(&page->lru);
> + dec_zone_page_state(page, NR_ISOLATED_ANON +
> + page_is_file_cache(page));
> + putback_lru_page(page);
> + }
> +
>   pr_info("soft offline: %#lx: migration failed %d, type 
> %lx\n",
>   pfn, ret, page->flags);
>   if (ret > 0)
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 1f59ccc..8392de4 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -71,28 +71,12 @@ int migrate_prep_local(void)
>  }
>  
>  /*
> - * Add isolated pages on the list back to the LRU under page lock
> - * to avoid leaking evictable pages back onto unevictable list.
> - */
> -void putback_lru_pages(struct list_head *l)
> -{
> - struct page *page;
> - struct page *page2;
> -
> - list_for_each_entry_safe(page, page2, l, lru) {
> - list_del(&page->lru);
> - dec_zone_page_state(page, NR_ISOLATED_ANON +
> - page_is_file_cache(page));
> - putback_lru_page(page);
> - }
> -}
> -
> -/*
>   * Put previously isolated pages back onto the appropriate lists
>   * from where they were once taken off for compaction/migration.
>   *
> - * This function shall be used instead of putback_lru_pages(),
> - * whenever the isolated pageset has been built by 
> isolate_migratepages_range()
> + * This function shall be used whenever the isolated pageset has been
> + * built from lru, balloon, hugetlbfs page. See isolate_migratepages_range()
> + * and isolate_huge_page().
>   */
>  void putback_movable_pages(struct list_head *l)
>  {
> @@ -1697,6 +1681,12 @@ int migrate_misplaced_page(struct page *page, struct 
> vm_area_struct *vma,
>   nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page,
>node, MIGRATE_ASYNC, MR_NUMA_MISPLACED);
>   if (nr_remaining) {
> + if (!list_empty(&migratepages)) {
> + list_del(&page->lru);
> + dec_zone_page_state(page, NR_ISOLATED_ANON +
> + page_is_file_cache(page));
> + putback_lru_page(page);
> + }
>   putback_lru_pages(&migratepages);
>   isolated = 0;
>   } else
> 


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE

2013-10-14 Thread Zhang Yanfei

Hello tejun, peter and yinghai

On 10/15/2013 04:55 AM, Tejun Heo wrote:
> Hello,
> 
> On Mon, Oct 14, 2013 at 01:37:20PM -0700, Yinghai Lu wrote:
>> The problem is how to define "amount necessary". If we can parse srat early,
>> then we could just map RAM for all boot nodes one time, instead of try some
>> small and then after SRAT table, expand it cover non-boot nodes.
> 
> Wouldn't that amount be fairly static and restricted?  If you wanna
> chunk memory init anyway, there's no reason to init more than
> necessary until smp stage is reached.  The more you do early, the more
> serialized you're, so wouldn't the goal naturally be initing the
> minimum possible?
> 
>> To keep non-boot numa node hot-removable. we need to page table (and other
>> that we allocate during boot stage) on ram of non boot nodes, or their
>> local node ram.  (share page table always should be on boot nodes).
> 
> The above assumes the followings,
> 
> * 4k page mappings.  It'd be nice to keep everything working for 4k
>   but just following SRAT isn't enough.  What if the non-hotpluggable
>   boot node doesn't stretch high enough and page table reaches down
>   too far?  This won't be an optional behavior, so it is actually
>   *likely* to happen on certain setups.
> 
> * Memory hotplug is at NUMA node granularity instead of device.
> 
>>> Optimizing NUMA boot just requires moving the heavy lifting to
>>> appropriate NUMA nodes.  It doesn't require that early boot phase
>>> should strictly follow NUMA node boundaries.
>>
>> At end of day, I like to see all numa system (ram/cpu/pci) could have
>> non boot nodes to be hot-removed logically. with any boot command
>> line.
> 
> I suppose you mean "without any boot command line"?  Sure, but, first
> of all, there is a clear performance trade-off, and, secondly, don't
> we want something finer grained?  Why would we want to that per-NUMA
> node, which is extremely coarse?
> 

Both ways seem ok enough *currently*. But what tejun always emphasizes
is the trade-off, or benefit / cost ratio. 

Yinghai and peter insist on the long-term plan. But it seems currently
no actual requirements and plans that *must* parse SRAT earlier comparing
to the current approach in this patchset, right?

Should we follow "Make it work first and optimize/beautify it later"?
I think if we have the scene that must parse SRAT earlier, I think tejun
will have no objection to it.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE

2013-10-14 Thread Zhang Yanfei

Hello tejun,

On 10/14/2013 11:19 PM, Tejun Heo wrote:
> Hey,
> 
> On Mon, Oct 14, 2013 at 11:06:14PM +0800, Zhang Yanfei wrote:
>> a little difference here, consider a 16-GB node. If we parse SRAT earlier,
>> and still use the top-down allocation, and kernel image is loaded at 16MB,
>> we reserve other nodes but this 16GB node that kernel resides in is used
>> for boot-up allocation. So the page table is allocated from 16GB to 0.
>> The page table is allocated on top of the the memory as possible.
>>
>> But if we use this approach, no matter how large the page table is, we 
>> allocate the page table in low memory which is the case that hpa concerns
>> about the DMA.
> 
> Yeah, sure there will be cases where parsing SRAT would be better.
> 
>  4k mapping is in use, which is mostly for debugging && memory map is
>  composed such that the highest non-hotpluggable address is high
>  enough.
> 
> It's going in circles again but my point has always been that the
> above in itself don't seem to be substantial enough to justify
> putting, say, initrd loading before page table init.
> 
> Later some argued that bringing SRAT parsing earlier could help
> implementing finer grained hotplug, which would be an acceptable path
> to follow; however, that doesn't turn out to be true either.
> 
> * Again, it matter if and only if 4k mapping is in use.  Do we even
>   care?
> 
> * SRAT isn't enough.  The whole device tree needs to be parsed to put
>   page tables into local device.  It's a lot of churn, including major
>   updates to page table allocation, just to support debug 4k mapping
>   cases.  Doesn't make much sense to me.
> 
> So, SRAT's usefulness seems extremely limited - it helps if the user
> wants to use debug features along with memory hotplug on an extreme
> large machine with devices which have low DMA limit, and that's it.
> To me, it seems to be a poor argument.  Just declaring memory hotplug
> works iff large kernel mapping is in use feels like a pretty good
> trade-off to me, and I have no idea why I have to repeat all this,
> which I've written multiple times already, in a private thread again.
> 
> If the thread is to make progress, one has to provide counter
> arguments to the points raised.  It feels like I'm going in circle
> again.  The exact same content I wrote above has been repeated
> multiple times in the past discussions and I'm getting tired of doing
> it without getting any actual response.
> 
> When replying, please restore cc's and keep the whole body.
> 

Thanks for the whole explanation again. I was just raising some argument
that other guys raised before. I agree with what you said above and already
put some of them into the patch 4 description in v7 version.

Now could you please help reviewing the part2? As I said before, no matter
how we implement the part1, part2 is kind of independent.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part2 v2 0/8] Arrange hotpluggable memory as ZONE_MOVABLE

2013-10-11 Thread Zhang Yanfei

Hello guys, this is the part2 of our memory hotplug work. This part
is based on the part1:
"x86, memblock: Allocate memory near kernel image before SRAT parsed"
which is base on 3.12-rc4.

You could refer part1 from: https://lkml.org/lkml/2013/10/10/644

Any comments are welcome! Thanks!

[Problem]

The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.


[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.

To do this, we need ACPI's help.


[How we do this]

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
   ZONE_MOVABLE.
   (This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
   hotpluggable memory for the kernel before the memory initialization
   finishes.
   (This is what we are going to do. See below.)


[About this patch-set]

In previous part's patches, we have made the kernel allocate memory near
kernel image before SRAT parsed to avoid allocating hotpluggable memory
for kernel. So this patch-set does the following things:

1. Improve memblock to support flags, which are used to indicate different 
   memory type.

2. Mark all hotpluggable memory in memblock.memory[].

3. Make the default memblock allocator skip hotpluggable memory.

4. Improve "movable_node" boot option to have higher priority of movablecore
   and kernelcore boot option.

Change log v1 -> v2:
1. Rebase this part on the v7 version of part1
2. Fix bug: If movable_node boot option not specified, memblock still
   checks hotpluggable memory when allocating memory. 

Tang Chen (7):
  memblock, numa: Introduce flag into memblock
  memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark
hotpluggable regions
  memblock: Make memblock_set_node() support different memblock_type
  acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock
  acpi, numa, mem_hotplug: Mark all nodes the kernel resides
un-hotpluggable
  memblock, mem_hotplug: Make memblock skip hotpluggable regions if
needed
  x86, numa, acpi, memory-hotplug: Make movable_node have higher
priority

Yasuaki Ishimatsu (1):
  x86: get pg_data_t's memory from other node

 arch/metag/mm/init.c  |3 +-
 arch/metag/mm/numa.c  |3 +-
 arch/microblaze/mm/init.c |3 +-
 arch/powerpc/mm/mem.c |2 +-
 arch/powerpc/mm/numa.c|8 ++-
 arch/sh/kernel/setup.c|4 +-
 arch/sparc/mm/init_64.c   |5 +-
 arch/x86/mm/init_32.c |2 +-
 arch/x86/mm/init_64.c |2 +-
 arch/x86/mm/numa.c|   63 +--
 arch/x86/mm/srat.c|5 ++
 include/linux/memblock.h  |   39 ++-
 mm/memblock.c |  123 ++---
 mm/memory_hotplug.c   |1 +
 mm/page_alloc.c   |   28 ++-
 15 files changed, 252 insertions(+), 39 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part2 v2 1/8] x86: get pg_data_t's memory from other node

2013-10-11 Thread Zhang Yanfei

From: Yasuaki Ishimatsu 

If system can create movable node which all memory of the node is allocated
as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's
pg_data_t. So, invoke memblock_alloc_nid(...MAX_NUMNODES) again to retry when
the first allocation fails. Otherwise, the system could failed to boot.
(We don't use memblock_alloc_try_nid() to retry because in this function,
if the allocation fails, it will panic the system.)

The node_data could be on hotpluggable node. And so could pagetable and
vmemmap. But for now, doing so will break memory hot-remove path.

A node could have several memory devices. And the device who holds node
data should be hot-removed in the last place. But in NUMA level, we don't
know which memory_block (/sys/devices/system/node/nodeX/memoryXXX) belongs
to which memory device. We only have node. So we can only do node hotplug.

But in virtualization, developers are now developing memory hotplug in qemu,
which support a single memory device hotplug. So a whole node hotplug will
not satisfy virtualization users.

So at last, we concluded that we'd better do memory hotplug and local node
things (local node node data, pagetable, vmemmap, ...) in two steps.
Please refer to https://lkml.org/lkml/2013/6/19/73

For now, we put node_data of movable node to another node, and then improve
it in the future.

Signed-off-by: Yasuaki Ishimatsu 
Signed-off-by: Lai Jiangshan 
Signed-off-by: Tang Chen 
Signed-off-by: Jiang Liu 
Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
Reviewed-by: Wanpeng Li 
Acked-by: Toshi Kani 
---
 arch/x86/mm/numa.c |   11 ---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 24aec58..e17db5d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -211,9 +211,14 @@ static void __init setup_node_data(int nid, u64 start, u64 
end)
 */
nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
if (!nd_pa) {
-   pr_err("Cannot find %zu bytes in node %d\n",
-  nd_size, nid);
-   return;
+   pr_warn("Cannot find %zu bytes in node %d, so try other nodes",
+   nd_size, nid);
+   nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES,
+  MAX_NUMNODES);
+   if (!nd_pa) {
+   pr_err("Cannot find %zu bytes in any node\n", nd_size);
+   return;
+   }
}
nd = __va(nd_pa);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part2 v2 8/8] x86, numa, acpi, memory-hotplug: Make movable_node have higher priority

2013-10-11 Thread Zhang Yanfei

From: Tang Chen 

If users specify the original movablecore=nn@ss boot option, the kernel will
arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot option is similar
except it specifies ZONE_NORMAL ranges.

Now, if users specify "movable_node" in kernel commandline, the kernel will
arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do this, all
the other movablecore=nn@ss and kernelcore=nn@ss options should be ignored.

For those who don't want this, just specify nothing. The kernel will act as
before.

Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
Reviewed-by: Wanpeng Li 
---
 mm/page_alloc.c |   28 ++--
 1 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd886fa..768ea0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5021,9 +5021,33 @@ static void __init find_zone_movable_pfns_for_nodes(void)
nodemask_t saved_node_state = node_states[N_MEMORY];
unsigned long totalpages = early_calculate_totalpages();
int usable_nodes = nodes_weight(node_states[N_MEMORY]);
+   struct memblock_type *type = &memblock.memory;
+
+   /* Need to find movable_zone earlier when movable_node is specified. */
+   find_usable_zone_for_movable();
+
+   /*
+* If movable_node is specified, ignore kernelcore and movablecore
+* options.
+*/
+   if (movable_node_is_enabled()) {
+   for (i = 0; i < type->cnt; i++) {
+   if (!memblock_is_hotpluggable(&type->regions[i]))
+   continue;
+
+   nid = type->regions[i].nid;
+
+   usable_startpfn = PFN_DOWN(type->regions[i].base);
+   zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+   min(usable_startpfn, zone_movable_pfn[nid]) :
+   usable_startpfn;
+   }
+
+   goto out2;
+   }
 
/*
-* If movablecore was specified, calculate what size of
+* If movablecore=nn[KMG] was specified, calculate what size of
 * kernelcore that corresponds so that memory usable for
 * any allocation type is evenly spread. If both kernelcore
 * and movablecore are specified, then the value of kernelcore
@@ -5049,7 +5073,6 @@ static void __init find_zone_movable_pfns_for_nodes(void)
goto out;
 
/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
-   find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
 
 restart:
@@ -5140,6 +5163,7 @@ restart:
if (usable_nodes && required_kernelcore > usable_nodes)
goto restart;
 
+out2:
/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
for (nid = 0; nid < MAX_NUMNODES; nid++)
zone_movable_pfn[nid] =
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part2 v2 7/8] memblock, mem_hotplug: Make memblock skip hotpluggable regions if needed

2013-10-11 Thread Zhang Yanfei

From: Tang Chen 

Linux kernel cannot migrate pages used by the kernel. As a result, hotpluggable
memory used by the kernel won't be able to be hot-removed. To solve this
problem, the basic idea is to prevent memblock from allocating hotpluggable
memory for the kernel at early time, and arrange all hotpluggable memory in
ACPI SRAT(System Resource Affinity Table) as ZONE_MOVABLE when initializing
zones.

In the previous patches, we have marked hotpluggable memory regions with
MEMBLOCK_HOTPLUG flag in memblock.memory.

In this patch, we make memblock skip these hotpluggable memory regions in
the default top-down allocation function if movable_node boot option is
specified.

Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
---
 include/linux/memblock.h |   18 ++
 mm/memblock.c|   12 
 mm/memory_hotplug.c  |1 +
 3 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 97480d3..bfc1dba 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -47,6 +47,10 @@ struct memblock {
 
 extern struct memblock memblock;
 extern int memblock_debug;
+#ifdef CONFIG_MOVABLE_NODE
+/* If movable_node boot option specified */
+extern bool movable_node_enabled;
+#endif /* CONFIG_MOVABLE_NODE */
 
 #define memblock_dbg(fmt, ...) \
if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
@@ -65,6 +69,20 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
 int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
 int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
+#ifdef CONFIG_MOVABLE_NODE
+static inline bool memblock_is_hotpluggable(struct memblock_region *m)
+{
+   return m->flags & MEMBLOCK_HOTPLUG;
+}
+
+static inline bool movable_node_is_enabled(void)
+{
+   return movable_node_enabled;
+}
+#else
+static inline bool memblock_is_hotpluggable(struct memblock_region *m){ return 
false; }
+static inline bool movable_node_is_enabled(void) { return false; }
+#endif
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index 7de9c76..7f69012 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -39,6 +39,9 @@ struct memblock memblock __initdata_memblock = {
 };
 
 int memblock_debug __initdata_memblock;
+#ifdef CONFIG_MOVABLE_NODE
+bool movable_node_enabled __initdata_memblock = false;
+#endif
 static int memblock_can_resize __initdata_memblock;
 static int memblock_memory_in_slab __initdata_memblock = 0;
 static int memblock_reserved_in_slab __initdata_memblock = 0;
@@ -819,6 +822,11 @@ void __init_memblock __next_free_mem_range(u64 *idx, int 
nid,
  * @out_nid: ptr to int for nid of the range, can be %NULL
  *
  * Reverse of __next_free_mem_range().
+ *
+ * Linux kernel cannot migrate pages used by itself. Memory hotplug users won't
+ * be able to hot-remove hotpluggable memory used by the kernel. So this
+ * function skip hotpluggable regions if needed when allocating memory for the
+ * kernel.
  */
 void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
   phys_addr_t *out_start,
@@ -843,6 +851,10 @@ void __init_memblock __next_free_mem_range_rev(u64 *idx, 
int nid,
if (nid != MAX_NUMNODES && nid != memblock_get_region_node(m))
continue;
 
+   /* skip hotpluggable memory regions if needed */
+   if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
+   continue;
+
/* scan areas before each reservation for intersection */
for ( ; ri >= 0; ri--) {
struct memblock_region *r = &rsv->regions[ri];
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8c91d0a..729a2d8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1436,6 +1436,7 @@ static int __init cmdline_parse_movable_node(char *p)
 * the kernel away from hotpluggable memory.
 */
memblock_set_bottom_up(true);
+   movable_node_enabled = true;
 #else
pr_warn("movable_node option not supported\n");
 #endif
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part2 v2 6/8] acpi, numa, mem_hotplug: Mark all nodes the kernel resides un-hotpluggable

2013-10-11 Thread Zhang Yanfei

From: Tang Chen 

At very early time, the kernel have to use some memory such as
loading the kernel image. We cannot prevent this anyway. So any
node the kernel resides in should be un-hotpluggable.

Signed-off-by: Zhang Yanfei 
Reviewed-by: Zhang Yanfei 
---
 arch/x86/mm/numa.c |   44 
 1 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 408c02d..f26b16f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -494,6 +494,14 @@ static int __init numa_register_memblks(struct 
numa_meminfo *mi)
struct numa_memblk *mb = &mi->blk[i];
memblock_set_node(mb->start, mb->end - mb->start,
  &memblock.memory, mb->nid);
+
+   /*
+* At this time, all memory regions reserved by memblock are
+* used by the kernel. Set the nid in memblock.reserved will
+* mark out all the nodes the kernel resides in.
+*/
+   memblock_set_node(mb->start, mb->end - mb->start,
+ &memblock.reserved, mb->nid);
}
 
/*
@@ -555,6 +563,30 @@ static void __init numa_init_array(void)
}
 }
 
+static void __init numa_clear_kernel_node_hotplug(void)
+{
+   int i, nid;
+   nodemask_t numa_kernel_nodes;
+   unsigned long start, end;
+   struct memblock_type *type = &memblock.reserved;
+
+   /* Mark all kernel nodes. */
+   for (i = 0; i < type->cnt; i++)
+   node_set(type->regions[i].nid, numa_kernel_nodes);
+
+   /* Clear MEMBLOCK_HOTPLUG flag for memory in kernel nodes. */
+   for (i = 0; i < numa_meminfo.nr_blks; i++) {
+   nid = numa_meminfo.blk[i].nid;
+   if (!node_isset(nid, numa_kernel_nodes))
+   continue;
+
+   start = numa_meminfo.blk[i].start;
+   end = numa_meminfo.blk[i].end;
+
+   memblock_clear_hotplug(start, end - start);
+   }
+}
+
 static int __init numa_init(int (*init_func)(void))
 {
int i;
@@ -569,6 +601,8 @@ static int __init numa_init(int (*init_func)(void))
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
  MAX_NUMNODES));
+   WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
+ MAX_NUMNODES));
/* In case that parsing SRAT failed. */
WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
numa_reset_distance();
@@ -606,6 +640,16 @@ static int __init numa_init(int (*init_func)(void))
numa_clear_node(i);
}
numa_init_array();
+
+   /*
+* At very early time, the kernel have to use some memory such as
+* loading the kernel image. We cannot prevent this anyway. So any
+* node the kernel resides in should be un-hotpluggable.
+*
+* And when we come here, numa_init() won't fail.
+*/
+   numa_clear_kernel_node_hotplug();
+
return 0;
 }
 
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part2 v2 5/8] acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock

2013-10-11 Thread Zhang Yanfei

From: Tang Chen 

When parsing SRAT, we know that which memory area is hotpluggable.
So we invoke function memblock_mark_hotplug() introduced by previous
patch to mark hotpluggable memory in memblock.

Signed-off-by: Tang Chen 
Reviewed-by: Zhang Yanfei 
---
 arch/x86/mm/numa.c |2 ++
 arch/x86/mm/srat.c |5 +
 2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index ab69e1d..408c02d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -569,6 +569,8 @@ static int __init numa_init(int (*init_func)(void))
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
  MAX_NUMNODES));
+   /* In case that parsing SRAT failed. */
+   WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
numa_reset_distance();
 
ret = init_func();
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 266ca91..ca7c484 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -181,6 +181,11 @@ acpi_numa_memory_affinity_init(struct 
acpi_srat_mem_affinity *ma)
(unsigned long long) start, (unsigned long long) end - 1,
hotpluggable ? " hotplug" : "");
 
+   /* Mark hotplug range in memblock. */
+   if (hotpluggable && memblock_mark_hotplug(start, ma->length))
+   pr_warn("SRAT: Failed to mark hotplug range [mem 
%#010Lx-%#010Lx] in memblock\n",
+   (unsigned long long) start, (unsigned long long) end - 
1);
+
return 0;
 out_err_bad_srat:
bad_srat();
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part2 v2 4/8] memblock: Make memblock_set_node() support different memblock_type

2013-10-11 Thread Zhang Yanfei

From: Tang Chen 

Signed-off-by: Tang Chen 
Reviewed-by: Zhang Yanfei 
---
 arch/metag/mm/init.c  |3 ++-
 arch/metag/mm/numa.c  |3 ++-
 arch/microblaze/mm/init.c |3 ++-
 arch/powerpc/mm/mem.c |2 +-
 arch/powerpc/mm/numa.c|8 +---
 arch/sh/kernel/setup.c|4 ++--
 arch/sparc/mm/init_64.c   |5 +++--
 arch/x86/mm/init_32.c |2 +-
 arch/x86/mm/init_64.c |2 +-
 arch/x86/mm/numa.c|6 --
 include/linux/memblock.h  |3 ++-
 mm/memblock.c |6 +++---
 12 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/arch/metag/mm/init.c b/arch/metag/mm/init.c
index 1239195..d94a58f 100644
--- a/arch/metag/mm/init.c
+++ b/arch/metag/mm/init.c
@@ -205,7 +205,8 @@ static void __init do_init_bootmem(void)
start_pfn = memblock_region_memory_base_pfn(reg);
end_pfn = memblock_region_memory_end_pfn(reg);
memblock_set_node(PFN_PHYS(start_pfn),
- PFN_PHYS(end_pfn - start_pfn), 0);
+ PFN_PHYS(end_pfn - start_pfn),
+ &memblock.memory, 0);
}
 
/* All of system RAM sits in node 0 for the non-NUMA case */
diff --git a/arch/metag/mm/numa.c b/arch/metag/mm/numa.c
index 9ae578c..229407f 100644
--- a/arch/metag/mm/numa.c
+++ b/arch/metag/mm/numa.c
@@ -42,7 +42,8 @@ void __init setup_bootmem_node(int nid, unsigned long start, 
unsigned long end)
memblock_add(start, end - start);
 
memblock_set_node(PFN_PHYS(start_pfn),
- PFN_PHYS(end_pfn - start_pfn), nid);
+ PFN_PHYS(end_pfn - start_pfn),
+ &memblock.memory, nid);
 
/* Node-local pgdat */
pgdat_paddr = memblock_alloc_base(sizeof(struct pglist_data),
diff --git a/arch/microblaze/mm/init.c b/arch/microblaze/mm/init.c
index 74c7bcc..89077d3 100644
--- a/arch/microblaze/mm/init.c
+++ b/arch/microblaze/mm/init.c
@@ -192,7 +192,8 @@ void __init setup_memory(void)
start_pfn = memblock_region_memory_base_pfn(reg);
end_pfn = memblock_region_memory_end_pfn(reg);
memblock_set_node(start_pfn << PAGE_SHIFT,
-   (end_pfn - start_pfn) << PAGE_SHIFT, 0);
+ (end_pfn - start_pfn) << PAGE_SHIFT,
+ &memblock.memory, 0);
}
 
/* free bootmem is whole main memory */
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 3fa93dc..231b785 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -209,7 +209,7 @@ void __init do_init_bootmem(void)
/* Place all memblock_regions in the same node and merge contiguous
 * memblock_regions
 */
-   memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
+   memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock_memory, 0);
 
/* Add all physical memory to the bootmem map, mark each area
 * present.
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c916127..f82f2ea 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -670,7 +670,8 @@ static void __init parse_drconf_memory(struct device_node 
*memory)
node_set_online(nid);
sz = numa_enforce_memory_limit(base, size);
if (sz)
-   memblock_set_node(base, sz, nid);
+   memblock_set_node(base, sz,
+ &memblock.memory, nid);
} while (--ranges);
}
 }
@@ -760,7 +761,7 @@ new_range:
continue;
}
 
-   memblock_set_node(start, size, nid);
+   memblock_set_node(start, size, &memblock.memory, nid);
 
if (--ranges)
goto new_range;
@@ -797,7 +798,8 @@ static void __init setup_nonnuma(void)
 
fake_numa_create_new_node(end_pfn, &nid);
memblock_set_node(PFN_PHYS(start_pfn),
- PFN_PHYS(end_pfn - start_pfn), nid);
+ PFN_PHYS(end_pfn - start_pfn),
+ &memblock.memory, nid);
node_set_online(nid);
}
 }
diff --git a/arch/sh/kernel/setup.c b/arch/sh/kernel/setup.c
index 1cf90e9..de19cfa 100644
--- a/arch/sh/kernel/setup.c
+++ b/arch/sh/kernel/setup.c
@@ -230,8 +230,8 @@ void __init __add_active_range(unsigned int nid, unsigned 
long start_pfn,
pmb_bolt_mapping((unsigned long)__va(start), start, end - start,
 PAGE_KERNEL);
 
-   memblock_set_node(PFN_PHYS(start_pfn),
- PFN_PHYS(end_pfn - start_pfn), nid);
+   memblock_set_node(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - star

[PATCH part2 v2 3/8] memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions

2013-10-11 Thread Zhang Yanfei

From: Tang Chen 

In find_hotpluggable_memory, once we find out a memory region which is
hotpluggable, we want to mark them in memblock.memory. So that we could
control memblock allocator not to allocte hotpluggable memory for the kernel
later.

To achieve this goal, we introduce MEMBLOCK_HOTPLUG flag to indicate the
hotpluggable memory regions in memblock and a function memblock_mark_hotplug()
to mark hotpluggable memory if we find one.

Signed-off-by: Tang Chen 
Reviewed-by: Zhang Yanfei 
---
 include/linux/memblock.h |   17 +++
 mm/memblock.c|   52 ++
 2 files changed, 69 insertions(+), 0 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 9a805ec..b788faa 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -19,6 +19,9 @@
 
 #define INIT_MEMBLOCK_REGIONS  128
 
+/* Definition of memblock flags. */
+#define MEMBLOCK_HOTPLUG   0x1 /* hotpluggable region */
+
 struct memblock_region {
phys_addr_t base;
phys_addr_t size;
@@ -60,6 +63,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
 int memblock_free(phys_addr_t base, phys_addr_t size);
 int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
+int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
+int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
@@ -122,6 +127,18 @@ void __next_free_mem_range_rev(u64 *idx, int nid, 
phys_addr_t *out_start,
 i != (u64)ULLONG_MAX;  \
 __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid))
 
+static inline void memblock_set_region_flags(struct memblock_region *r,
+unsigned long flags)
+{
+   r->flags |= flags;
+}
+
+static inline void memblock_clear_region_flags(struct memblock_region *r,
+  unsigned long flags)
+{
+   r->flags &= ~flags;
+}
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);
 
diff --git a/mm/memblock.c b/mm/memblock.c
index 877973e..5bea331 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -683,6 +683,58 @@ int __init_memblock memblock_reserve(phys_addr_t base, 
phys_addr_t size)
 }
 
 /**
+ * memblock_mark_hotplug - Mark hotpluggable memory with flag MEMBLOCK_HOTPLUG.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * This function isolates region [@base, @base + @size), and mark it with flag
+ * MEMBLOCK_HOTPLUG.
+ *
+ * Return 0 on succees, -errno on failure.
+ */
+int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
+{
+   struct memblock_type *type = &memblock.memory;
+   int i, ret, start_rgn, end_rgn;
+
+   ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
+   if (ret)
+   return ret;
+
+   for (i = start_rgn; i < end_rgn; i++)
+   memblock_set_region_flags(&type->regions[i], MEMBLOCK_HOTPLUG);
+
+   memblock_merge_regions(type);
+   return 0;
+}
+
+/**
+ * memblock_clear_hotplug - Clear flag MEMBLOCK_HOTPLUG for a specified region.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * This function isolates region [@base, @base + @size), and clear flag
+ * MEMBLOCK_HOTPLUG for the isolated regions.
+ *
+ * Return 0 on succees, -errno on failure.
+ */
+int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
+{
+   struct memblock_type *type = &memblock.memory;
+   int i, ret, start_rgn, end_rgn;
+
+   ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
+   if (ret)
+   return ret;
+
+   for (i = start_rgn; i < end_rgn; i++)
+   memblock_clear_region_flags(&type->regions[i], 
MEMBLOCK_HOTPLUG);
+
+   memblock_merge_regions(type);
+   return 0;
+}
+
+/**
  * __next_free_mem_range - next function for for_each_free_mem_range()
  * @idx: pointer to u64 loop variable
  * @nid: node selector, %MAX_NUMNODES for all nodes
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part2 v2 2/8] memblock, numa: Introduce flag into memblock

2013-10-11 Thread Zhang Yanfei

From: Tang Chen 

There is no flag in memblock to describe what type the memory is.
Sometimes, we may use memblock to reserve some memory for special usage.
And we want to know what kind of memory it is. So we need a way to
differentiate memory for different usage.

In hotplug environment, we want to reserve hotpluggable memory so the
kernel won't be able to use it. And when the system is up, we have to
free these hotpluggable memory to buddy. So we need to mark these memory
first.

In order to do so, we need to mark out these special memory in memblock.
In this patch, we introduce a new "flags" member into memblock_region:
   struct memblock_region {
   phys_addr_t base;
   phys_addr_t size;
   unsigned long flags; /* This is new. */
   #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
   int nid;
   #endif
   };

This patch does the following things:
1) Add "flags" member to memblock_region.
2) Modify the following APIs' prototype:
memblock_add_region()
memblock_insert_region()
3) Add memblock_reserve_region() to support reserve memory with flags, and keep
   memblock_reserve()'s prototype unmodified.
4) Modify other APIs to support flags, but keep their prototype unmodified.

The idea is from Wen Congyang  and Liu Jiang 
.

v1 -> v2:
As tj suggested, a zero flag MEMBLK_DEFAULT will make users confused. If
we want to specify any other flag, such MEMBLK_HOTPLUG, users don't know
to use MEMBLK_DEFAULT | MEMBLK_HOTPLUG or just MEMBLK_HOTPLUG. So remove
MEMBLK_DEFAULT (which is 0), and just use 0 by default to avoid confusions
to users.

Suggested-by: Wen Congyang 
Suggested-by: Liu Jiang 
Signed-off-by: Tang Chen 
Reviewed-by: Zhang Yanfei 
---
 include/linux/memblock.h |1 +
 mm/memblock.c|   53 +-
 2 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 77c60e5..9a805ec 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -22,6 +22,7 @@
 struct memblock_region {
phys_addr_t base;
phys_addr_t size;
+   unsigned long flags;
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid;
 #endif
diff --git a/mm/memblock.c b/mm/memblock.c
index 53e477b..877973e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -255,6 +255,7 @@ static void __init_memblock memblock_remove_region(struct 
memblock_type *type, u
type->cnt = 1;
type->regions[0].base = 0;
type->regions[0].size = 0;
+   type->regions[0].flags = 0;
memblock_set_region_node(&type->regions[0], MAX_NUMNODES);
}
 }
@@ -405,7 +406,8 @@ static void __init_memblock memblock_merge_regions(struct 
memblock_type *type)
 
if (this->base + this->size != next->base ||
memblock_get_region_node(this) !=
-   memblock_get_region_node(next)) {
+   memblock_get_region_node(next) ||
+   this->flags != next->flags) {
BUG_ON(this->base + this->size > next->base);
i++;
continue;
@@ -425,13 +427,15 @@ static void __init_memblock memblock_merge_regions(struct 
memblock_type *type)
  * @base:  base address of the new region
  * @size:  size of the new region
  * @nid:   node id of the new region
+ * @flags: flags of the new region
  *
  * Insert new memblock region [@base,@base+@size) into @type at @idx.
  * @type must already have extra room to accomodate the new region.
  */
 static void __init_memblock memblock_insert_region(struct memblock_type *type,
   int idx, phys_addr_t base,
-  phys_addr_t size, int nid)
+  phys_addr_t size,
+  int nid, unsigned long flags)
 {
struct memblock_region *rgn = &type->regions[idx];
 
@@ -439,6 +443,7 @@ static void __init_memblock memblock_insert_region(struct 
memblock_type *type,
memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
rgn->base = base;
rgn->size = size;
+   rgn->flags = flags;
memblock_set_region_node(rgn, nid);
type->cnt++;
type->total_size += size;
@@ -450,6 +455,7 @@ static void __init_memblock memblock_insert_region(struct 
memblock_type *type,
  * @base: base address of the new region
  * @size: size of the new region
  * @nid: nid of the new region
+ * @flags: flags of the new region
  *
  * Add new memblock region [@base,@base+@size) into @type.  The new region
  * is allowed to overlap with existing ones - overlaps don't affect already
@@ -460,7 +466,8 @@ static void __init_memblock me

Re: [PATCH part1 v7 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed

2013-10-10 Thread Zhang Yanfei

Hello Andrew,

Could you take this version now? Since the approach of
this patchset is suggested by tejun, and thanks him for
helping us explaining a lot to guys that have the concern
about the page table location. I've added some note in
the patch4 description to explain why we could be not worrisome
about the approach.

Thanks.

On 10/11/2013 04:13 AM, Zhang Yanfei wrote:
> Hello, here is the v7 version. Any comments are welcome!
> 
> The v7 version is based on linus's tree (3.12-rc4)
> HEAD is:
> commit d0e639c9e06d44e713170031fe05fb60ebe680af
> Author: Linus Torvalds 
> Date:   Sun Oct 6 14:00:20 2013 -0700
> 
> Linux 3.12-rc4
> 
> 
> [Problem]
> 
> The current Linux cannot migrate pages used by the kerenl because
> of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
> When the pa is changed, we cannot simply update the pagetable and
> keep the va unmodified. So the kernel pages are not migratable.
> 
> There are also some other issues will cause the kernel pages not migratable.
> For example, the physical address may be cached somewhere and will be used.
> It is not to update all the caches.
> 
> When doing memory hotplug in Linux, we first migrate all the pages in one
> memory device somewhere else, and then remove the device. But if pages are
> used by the kernel, they are not migratable. As a result, memory used by
> the kernel cannot be hot-removed.
> 
> Modifying the kernel direct mapping mechanism is too difficult to do. And
> it may cause the kernel performance down and unstable. So we use the following
> way to do memory hotplug.
> 
> 
> [What we are doing]
> 
> In Linux, memory in one numa node is divided into several zones. One of the
> zones is ZONE_MOVABLE, which the kernel won't use.
> 
> In order to implement memory hotplug in Linux, we are going to arrange all
> hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
> To do this, we need ACPI's help.
> 
> In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
> affinities in SRAT record every memory range in the system, and also, flags
> specifying if the memory range is hotpluggable.
> (Please refer to ACPI spec 5.0 5.2.16)
> 
> With the help of SRAT, we have to do the following two things to achieve our
> goal:
> 
> 1. When doing memory hot-add, allow the users arranging hotpluggable as
>ZONE_MOVABLE.
>(This has been done by the MOVABLE_NODE functionality in Linux.)
> 
> 2. when the system is booting, prevent bootmem allocator from allocating
>hotpluggable memory for the kernel before the memory initialization
>finishes.
> 
> The problem 2 is the key problem we are going to solve. But before solving it,
> we need some preparation. Please see below.
> 
> 
> [Preparation]
> 
> Bootloader has to load the kernel image into memory. And this memory must be 
> unhotpluggable. We cannot prevent this anyway. So in a memory hotplug system, 
> we can assume any node the kernel resides in is not hotpluggable.
> 
> Before SRAT is parsed, we don't know which memory ranges are hotpluggable. But
> memblock has already started to work. In the current kernel, memblock 
> allocates 
> the following memory before SRAT is parsed:
> 
> setup_arch()
>  |->memblock_x86_fill()/* memblock is ready */
>  |..
>  |->early_reserve_e820_mpc_new()   /* allocate memory under 1MB */
>  |->reserve_real_mode()/* allocate memory under 1MB */
>  |->init_mem_mapping() /* allocate page tables, about 2MB to map 
> 1GB memory */
>  |->dma_contiguous_reserve()   /* specified by user, should be low */
>  |->setup_log_buf()/* specified by user, several mega bytes */
>  |->relocate_initrd()  /* could be large, but will be freed after 
> boot, should reorder */
>  |->acpi_initrd_override() /* several mega bytes */
>  |->reserve_crashkernel()  /* could be large, should reorder */
>  |..
>  |->initmem_init() /* Parse SRAT */
> 
> According to Tejun's advice, before SRAT is parsed, we should try our best to
> allocate memory near the kernel image. Since the whole node the kernel 
> resides 
> in won't be hotpluggable, and for a modern server, a node may have at least 
> 16GB
> memory, allocating several mega bytes memory around the kernel image won't 
> cross
> to hotpluggable memory.
> 
> 
> [About this patch-set]
> 
> So this patch-set is the preparation for the problem 2 that we want to solve. 
> It
> does the following:
> 
> 1. Make memblock be able to allocate memory bottom up.
>1) Keep all the

[PATCH part1 v7 6/6] mem-hotplug: Introduce movable_node boot option

2013-10-10 Thread Zhang Yanfei

From: Tang Chen 

The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and
later we can set it as ZONE_MOVABLE.

To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained
in the previous patches. So if movable_node boot option is set, the kernel
does the following:

1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
   top down.

Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

Suggested-by: Kamezawa Hiroyuki 
Suggested-by: Ingo Molnar 
Acked-by: Tejun Heo 
Acked-by: Toshi Kani 
Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
---
 Documentation/kernel-parameters.txt |3 +++
 arch/x86/mm/numa.c  |   11 +++
 mm/Kconfig  |   17 -
 mm/memory_hotplug.c |   31 +++
 4 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index fcbb736..a75a70a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1773,6 +1773,9 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
that the amount of memory usable for all allocations
is not too small.
 
+   movable_node[KNL,X86] Boot-time switch to enable the effects
+   of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.
+
MTD_Partition=  [MTD]
Format: ,,,
 
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..24aec58 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -567,6 +567,17 @@ static int __init numa_init(int (*init_func)(void))
ret = init_func();
if (ret < 0)
return ret;
+
+   /*
+* We reset memblock back to the top-down direction
+* here because if we configured ACPI_NUMA, we have
+* parsed SRAT in init_func(). It is ok to have the
+* reset here even if we did't configure ACPI_NUMA
+* or acpi numa init fails and fallbacks to dummy
+* numa init.
+*/
+   memblock_set_bottom_up(false);
+
ret = numa_cleanup_meminfo(&numa_meminfo);
if (ret < 0)
return ret;
diff --git a/mm/Kconfig b/mm/Kconfig
index 394838f..3f4ffda 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -153,11 +153,18 @@ config MOVABLE_NODE
help
  Allow a node to have only movable memory.  Pages used by the kernel,
  such as direct mapping pages cannot be migrated.  So the corresponding
- memory device cannot be hotplugged.  This option allows users to
- online all the memory of a node as movable memory so that the whole
- node can be hotplugged.  Users who don't use the memory hotplug
- feature are fine with this option on since they don't online memory
- as movable.
+ memory device cannot be hotplugged.  This option allows the following
+ two things:
+ - When the system is booting, node full of hotpluggable memory can
+ be arranged to have only movable memory so that the whole node can
+ be hot-removed. (need movable_node boot option specified).
+ - After the system is up, the option allows users to online all the
+ memory of a node as movable memory so that the whole node can be
+ hot-removed.
+
+ Users who don't use the memory hotplug feature are fine with this
+ option on since they don't specify movable_node boot option or they
+ don't online memory as movable.
 
  Say Y here if you want to hotplug a whole node.
  Say N here if you want kernel to use memory on all nodes evenly.
diff --git a/m

[PATCH part1 v7 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed.

2013-10-10 Thread Zhang Yanfei

From: Tang Chen 

Memory reserved for crashkernel could be large. So we should not allocate
this memory bottom up from the end of kernel image.

When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
and we can avoid allocating this memory for the kernel. So reorder
reserve_crashkernel() after SRAT is parsed.

Acked-by: Tejun Heo 
Acked-by: Toshi Kani 
Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
---
 arch/x86/kernel/setup.c |9 +++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f0de629..b5e350d 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1120,8 +1120,6 @@ void __init setup_arch(char **cmdline_p)
acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
 #endif
 
-   reserve_crashkernel();
-
vsmp_init();
 
io_delay_init();
@@ -1134,6 +1132,13 @@ void __init setup_arch(char **cmdline_p)
early_acpi_boot_init();
 
initmem_init();
+
+   /*
+* Reserve memory for crash kernel after SRAT is parsed so that it
+* won't consume hotpluggable memory.
+*/
+   reserve_crashkernel();
+
memblock_find_dma_reserve();
 
 #ifdef CONFIG_KVM_GUEST
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part1 v7 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up

2013-10-10 Thread Zhang Yanfei

From: Tang Chen 

The Linux kernel cannot migrate pages used by the kernel. As a
result, kernel pages cannot be hot-removed. So we cannot allocate
hotpluggable memory for the kernel.

In a memory hotplug system, any numa node the kernel resides in
should be unhotpluggable. And for a modern server, each node could
have at least 16GB memory. So memory around the kernel image is
highly likely unhotpluggable.

ACPI SRAT (System Resource Affinity Table) contains the memory
hotplug info. But before SRAT is parsed, memblock has already
started to allocate memory for the kernel. So we need to prevent
memblock from doing this.

So direct memory mapping page tables setup is the case. init_mem_mapping()
is called before SRAT is parsed. To prevent page tables being allocated
within hotpluggable memory, we will use bottom-up direction to allocate
page tables from the end of kernel image to the higher memory.

Note:
As for allocating page tables in lower memory, TJ said:
[This is an optional behavior which is triggered by a very specific
kernel boot param, which I suspect is gonna need to stick around to
support memory hotplug in the current setup unless we add another
layer of address translation to support memory hotplug.]

As for page tables may occupy too much lower memory if using 4K mapping
(CONFIG_DEBUG_PAGEALLOC and CONFIG_KMEMCHECK both disable using >4k
pages), TJ said:
[But as I said in the same paragraph, parsing SRAT earlier doesn't
solve the problem in itself either. Ignoring the option if 4k mapping
is required and memory consumption would be prohibitive should work, no?
Something like that would be necessary if we're gonna worry about cases
like this no matter how we implement it, but, frankly, I'm not sure this
is something worth worrying about.]

Acked-by: Tejun Heo 
Acked-by: Toshi Kani 
Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
---
 arch/x86/mm/init.c |   66 ++-
 1 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ea2be79..b6892a7 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -458,6 +458,51 @@ static void __init memory_map_top_down(unsigned long 
map_start,
init_range_memory_mapping(real_end, map_end);
 }
 
+/**
+ * memory_map_bottom_up - Map [map_start, map_end) bottom up
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in bottom-up. Since we have limited the
+ * bottom-up allocation above the kernel, the page tables will
+ * be allocated just above the kernel and we map the memory
+ * in [map_start, map_end) in bottom-up.
+ */
+static void __init memory_map_bottom_up(unsigned long map_start,
+   unsigned long map_end)
+{
+   unsigned long next, new_mapped_ram_size, start;
+   unsigned long mapped_ram_size = 0;
+   /* step_size need to be small so pgt_buf from BRK could cover it */
+   unsigned long step_size = PMD_SIZE;
+
+   start = map_start;
+   min_pfn_mapped = start >> PAGE_SHIFT;
+
+   /*
+* We start from the bottom (@map_start) and go to the top (@map_end).
+* The memblock_find_in_range() gets us a block of RAM from the
+* end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
+* for page table.
+*/
+   while (start < map_end) {
+   if (map_end - start > step_size) {
+   next = round_up(start + 1, step_size);
+   if (next > map_end)
+   next = map_end;
+   } else
+   next = map_end;
+
+   new_mapped_ram_size = init_range_memory_mapping(start, next);
+   start = next;
+
+   if (new_mapped_ram_size > mapped_ram_size)
+   step_size <<= STEP_SIZE_SHIFT;
+   mapped_ram_size += new_mapped_ram_size;
+   }
+}
+
 void __init init_mem_mapping(void)
 {
unsigned long end;
@@ -473,8 +518,25 @@ void __init init_mem_mapping(void)
/* the ISA range is always mapped regardless of memory holes */
init_memory_mapping(0, ISA_END_ADDRESS);
 
-   /* setup direct mapping for range [ISA_END_ADDRESS, end) in top-down*/
-   memory_map_top_down(ISA_END_ADDRESS, end);
+   /*
+* If the allocation is in bottom-up direction, we setup direct mapping
+* in bottom-up, otherwise we setup direct mapping in top-down.
+*/
+   if (memblock_bottom_up()) {
+   unsigned long kernel_end = __pa_symbol(_end);
+
+   /*
+* we need two separate calls here. This is because we want to
+* allocate page tables above the kernel. So we first map
+* [kernel_end, end)

[PATCH part1 v7 3/6] x86/mm: Factor out of top-down direct mapping setup

2013-10-10 Thread Zhang Yanfei

From: Tang Chen 

This patch creates a new function memory_map_top_down to
factor out of the top-down direct memory mapping pagetable
setup. This is also a preparation for the following patch,
which will introduce the bottom-up memory mapping. That said,
we will put the two ways of pagetable setup into separate
functions, and choose to use which way in init_mem_mapping,
which makes the code more clear.

Acked-by: Tejun Heo 
Acked-by: Toshi Kani 
Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
---
 arch/x86/mm/init.c |   60 ++-
 1 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 04664cd..ea2be79 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -401,27 +401,28 @@ static unsigned long __init init_range_memory_mapping(
 
 /* (PUD_SHIFT-PMD_SHIFT)/2 */
 #define STEP_SIZE_SHIFT 5
-void __init init_mem_mapping(void)
+
+/**
+ * memory_map_top_down - Map [map_start, map_end) top down
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in top-down. That said, the page tables
+ * will be allocated at the end of the memory, and we map the
+ * memory in top-down.
+ */
+static void __init memory_map_top_down(unsigned long map_start,
+  unsigned long map_end)
 {
-   unsigned long end, real_end, start, last_start;
+   unsigned long real_end, start, last_start;
unsigned long step_size;
unsigned long addr;
unsigned long mapped_ram_size = 0;
unsigned long new_mapped_ram_size;
 
-   probe_page_size_mask();
-
-#ifdef CONFIG_X86_64
-   end = max_pfn << PAGE_SHIFT;
-#else
-   end = max_low_pfn << PAGE_SHIFT;
-#endif
-
-   /* the ISA range is always mapped regardless of memory holes */
-   init_memory_mapping(0, ISA_END_ADDRESS);
-
/* xen has big range in reserved near end of ram, skip it at first.*/
-   addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
+   addr = memblock_find_in_range(map_start, map_end, PMD_SIZE, PMD_SIZE);
real_end = addr + PMD_SIZE;
 
/* step_size need to be small so pgt_buf from BRK could cover it */
@@ -436,13 +437,13 @@ void __init init_mem_mapping(void)
 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
 * for page table.
 */
-   while (last_start > ISA_END_ADDRESS) {
+   while (last_start > map_start) {
if (last_start > step_size) {
start = round_down(last_start - 1, step_size);
-   if (start < ISA_END_ADDRESS)
-   start = ISA_END_ADDRESS;
+   if (start < map_start)
+   start = map_start;
} else
-   start = ISA_END_ADDRESS;
+   start = map_start;
new_mapped_ram_size = init_range_memory_mapping(start,
last_start);
last_start = start;
@@ -453,8 +454,27 @@ void __init init_mem_mapping(void)
mapped_ram_size += new_mapped_ram_size;
}
 
-   if (real_end < end)
-   init_range_memory_mapping(real_end, end);
+   if (real_end < map_end)
+   init_range_memory_mapping(real_end, map_end);
+}
+
+void __init init_mem_mapping(void)
+{
+   unsigned long end;
+
+   probe_page_size_mask();
+
+#ifdef CONFIG_X86_64
+   end = max_pfn << PAGE_SHIFT;
+#else
+   end = max_low_pfn << PAGE_SHIFT;
+#endif
+
+   /* the ISA range is always mapped regardless of memory holes */
+   init_memory_mapping(0, ISA_END_ADDRESS);
+
+   /* setup direct mapping for range [ISA_END_ADDRESS, end) in top-down*/
+   memory_map_top_down(ISA_END_ADDRESS, end);
 
 #ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part1 v7 2/6] memblock: Introduce bottom-up allocation mode

2013-10-10 Thread Zhang Yanfei

From: Tang Chen 

The Linux kernel cannot migrate pages used by the kernel. As a result, kernel
pages cannot be hot-removed. So we cannot allocate hotpluggable memory for
the kernel.

ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info.
But before SRAT is parsed, memblock has already started to allocate memory
for the kernel. So we need to prevent memblock from doing this.

In a memory hotplug system, any numa node the kernel resides in should
be unhotpluggable. And for a modern server, each node could have at least
16GB memory. So memory around the kernel image is highly likely unhotpluggable.

So the basic idea is: Allocate memory from the end of the kernel image and
to the higher memory. Since memory allocation before SRAT is parsed won't
be too much, it could highly likely be in the same node with kernel image.

The current memblock can only allocate memory top-down. So this patch introduces
a new bottom-up allocation mode to allocate memory bottom-up. And later
when we use this allocation direction to allocate memory, we will limit
the start address above the kernel.

Acked-by: Toshi Kani 
Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
---
 include/linux/memblock.h |   24 +
 include/linux/mm.h   |4 ++
 mm/memblock.c|   83 --
 3 files changed, 108 insertions(+), 3 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 31e95ac..77c60e5 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -35,6 +35,7 @@ struct memblock_type {
 };
 
 struct memblock {
+   bool bottom_up;  /* is bottom up direction? */
phys_addr_t current_limit;
struct memblock_type memory;
struct memblock_type reserved;
@@ -148,6 +149,29 @@ phys_addr_t memblock_alloc_try_nid(phys_addr_t size, 
phys_addr_t align, int nid)
 
 phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align);
 
+#ifdef CONFIG_MOVABLE_NODE
+/*
+ * Set the allocation direction to bottom-up or top-down.
+ */
+static inline void memblock_set_bottom_up(bool enable)
+{
+   memblock.bottom_up = enable;
+}
+
+/*
+ * Check if the allocation direction is bottom-up or not.
+ * if this is true, that said, memblock will allocate memory
+ * in bottom-up direction.
+ */
+static inline bool memblock_bottom_up(void)
+{
+   return memblock.bottom_up;
+}
+#else
+static inline void memblock_set_bottom_up(bool enable) {}
+static inline bool memblock_bottom_up(void) { return false; }
+#endif
+
 /* Flags for memblock_alloc_base() amd __memblock_alloc_base() */
 #define MEMBLOCK_ALLOC_ANYWHERE(~(phys_addr_t)0)
 #define MEMBLOCK_ALLOC_ACCESSIBLE  0
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55e..3d05c07 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -50,6 +50,10 @@ extern int sysctl_legacy_va_layout;
 #include 
 #include 
 
+#ifndef __pa_symbol
+#define __pa_symbol(x)  __pa(RELOC_HIDE((unsigned long)(x), 0))
+#endif
+
 extern unsigned long sysctl_user_reserve_kbytes;
 extern unsigned long sysctl_admin_reserve_kbytes;
 
diff --git a/mm/memblock.c b/mm/memblock.c
index accff10..53e477b 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -20,6 +20,8 @@
 #include 
 #include 
 
+#include 
+
 static struct memblock_region 
memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
 static struct memblock_region 
memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
 
@@ -32,6 +34,7 @@ struct memblock memblock __initdata_memblock = {
.reserved.cnt   = 1,/* empty dummy entry */
.reserved.max   = INIT_MEMBLOCK_REGIONS,
 
+   .bottom_up  = false,
.current_limit  = MEMBLOCK_ALLOC_ANYWHERE,
 };
 
@@ -82,6 +85,38 @@ static long __init_memblock memblock_overlaps_region(struct 
memblock_type *type,
return (i < type->cnt) ? i : -1;
 }
 
+/*
+ * __memblock_find_range_bottom_up - find free area utility in bottom-up
+ * @start: start of candidate range
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Utility called from memblock_find_in_range_node(), find free area bottom-up.
+ *
+ * RETURNS:
+ * Found address on success, 0 on failure.
+ */
+static phys_addr_t __init_memblock
+__memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end,
+   phys_addr_t size, phys_addr_t align, int nid)
+{
+   phys_addr_t this_start, this_end, cand;
+   u64 i;
+
+   for_each_free_mem_range(i, nid, &this_start, &this_end, NULL) {
+   this_start = clamp(this_start, start, end);
+   this_end = clamp(this_end, start, end);
+
+   cand = round_up(this_start, align);
+   if (cand < this_

[PATCH part1 v7 1/6] memblock: Factor out of top-down allocation

2013-10-10 Thread Zhang Yanfei

From: Tang Chen 

This patch creates a new function __memblock_find_range_top_down
to factor out of top-down allocation from memblock_find_in_range_node.
This is a preparation because we will introduce a new bottom-up
allocation mode in the following patch.

Acked-by: Tejun Heo 
Acked-by: Toshi Kani 
Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
---
 mm/memblock.c |   47 ++-
 1 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 0ac412a..accff10 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -83,33 +83,25 @@ static long __init_memblock memblock_overlaps_region(struct 
memblock_type *type,
 }
 
 /**
- * memblock_find_in_range_node - find free area in given range and node
+ * __memblock_find_range_top_down - find free area utility, in top-down
  * @start: start of candidate range
  * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
  * @size: size of free area to find
  * @align: alignment of free area to find
  * @nid: nid of the free area to find, %MAX_NUMNODES for any node
  *
- * Find @size free area aligned to @align in the specified range and node.
+ * Utility called from memblock_find_in_range_node(), find free area top-down.
  *
  * RETURNS:
  * Found address on success, %0 on failure.
  */
-phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
-   phys_addr_t end, phys_addr_t size,
-   phys_addr_t align, int nid)
+static phys_addr_t __init_memblock
+__memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
+  phys_addr_t size, phys_addr_t align, int nid)
 {
phys_addr_t this_start, this_end, cand;
u64 i;
 
-   /* pump up @end */
-   if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
-   end = memblock.current_limit;
-
-   /* avoid allocating the first page */
-   start = max_t(phys_addr_t, start, PAGE_SIZE);
-   end = max(start, end);
-
for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
this_start = clamp(this_start, start, end);
this_end = clamp(this_end, start, end);
@@ -121,10 +113,39 @@ phys_addr_t __init_memblock 
memblock_find_in_range_node(phys_addr_t start,
if (cand >= this_start)
return cand;
}
+
return 0;
 }
 
 /**
+ * memblock_find_in_range_node - find free area in given range and node
+ * @start: start of candidate range
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Find @size free area aligned to @align in the specified range and node.
+ *
+ * RETURNS:
+ * Found address on success, %0 on failure.
+ */
+phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
+   phys_addr_t end, phys_addr_t size,
+   phys_addr_t align, int nid)
+{
+   /* pump up @end */
+   if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+   end = memblock.current_limit;
+
+   /* avoid allocating the first page */
+   start = max_t(phys_addr_t, start, PAGE_SIZE);
+   end = max(start, end);
+
+   return __memblock_find_range_top_down(start, end, size, align, nid);
+}
+
+/**
  * memblock_find_in_range - find free area in given range
  * @start: start of candidate range
  * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part1 v7 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed

2013-10-10 Thread Zhang Yanfei

Hello, here is the v7 version. Any comments are welcome!

The v7 version is based on linus's tree (3.12-rc4)
HEAD is:
commit d0e639c9e06d44e713170031fe05fb60ebe680af
Author: Linus Torvalds 
Date:   Sun Oct 6 14:00:20 2013 -0700

Linux 3.12-rc4


[Problem]

The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.


[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
To do this, we need ACPI's help.

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
   ZONE_MOVABLE.
   (This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
   hotpluggable memory for the kernel before the memory initialization
   finishes.

The problem 2 is the key problem we are going to solve. But before solving it,
we need some preparation. Please see below.


[Preparation]

Bootloader has to load the kernel image into memory. And this memory must be 
unhotpluggable. We cannot prevent this anyway. So in a memory hotplug system, 
we can assume any node the kernel resides in is not hotpluggable.

Before SRAT is parsed, we don't know which memory ranges are hotpluggable. But
memblock has already started to work. In the current kernel, memblock allocates 
the following memory before SRAT is parsed:

setup_arch()
 |->memblock_x86_fill()/* memblock is ready */
 |..
 |->early_reserve_e820_mpc_new()   /* allocate memory under 1MB */
 |->reserve_real_mode()/* allocate memory under 1MB */
 |->init_mem_mapping() /* allocate page tables, about 2MB to map 
1GB memory */
 |->dma_contiguous_reserve()   /* specified by user, should be low */
 |->setup_log_buf()/* specified by user, several mega bytes */
 |->relocate_initrd()  /* could be large, but will be freed after 
boot, should reorder */
 |->acpi_initrd_override() /* several mega bytes */
 |->reserve_crashkernel()  /* could be large, should reorder */
 |..
 |->initmem_init() /* Parse SRAT */

According to Tejun's advice, before SRAT is parsed, we should try our best to
allocate memory near the kernel image. Since the whole node the kernel resides 
in won't be hotpluggable, and for a modern server, a node may have at least 16GB
memory, allocating several mega bytes memory around the kernel image won't cross
to hotpluggable memory.


[About this patch-set]

So this patch-set is the preparation for the problem 2 that we want to solve. It
does the following:

1. Make memblock be able to allocate memory bottom up.
   1) Keep all the memblock APIs' prototype unmodified.
   2) When the direction is bottom up, keep the start address greater than the 
  end of kernel image.

2. Improve init_mem_mapping() to support allocate page tables in bottom up 
direction.

3. Introduce "movable_node" boot option to enable and disable this 
functionality.

Change log v6 -> v7:
1. Add toshi's ack in several patches.
2. Make __pa_symbol() available everywhere by putting a pesudo __pa_symbol 
define
   in include/linux/mm.h. Thanks HPA.
3. Add notes about the page table allocation in bottom-up.

Change log v5 -> v6:
1. Add tejun and toshi's ack in several patches.
2. Change movablenode to movable_node boot option and update the description
   for movable_node and CONFIG_MOVABLE_NODE. Thanks Ingo!
3. Fix the __pa_symbol() issue pointed by Andrew Morton.
4. Update some functions' comments and names.

Change log v4 -> v5:
1. Change memblock.current_direction to a boolean memblock.bottom_up. And 
remove 
   the direction

Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up

2013-10-09 Thread Zhang Yanfei

Hello guys,

On 10/10/2013 07:26 AM, Zhang Yanfei wrote:
> Hello Peter,
> 
> On 10/10/2013 07:10 AM, H. Peter Anvin wrote:
>> On 10/09/2013 02:45 PM, Zhang Yanfei wrote:
>>>>
>>>> I would also argue that in the VM scenario -- and arguable even in the
>>>> hardware scenario -- the right thing is to not expose the flexible
>>>> memory in the e820/EFI tables, and instead have it hotadded (possibly
>>>> *immediately* so) on boot.  This avoids both the boot time funnies as
>>>> well as the scaling issues with metadata.
>>>>
>>>
>>> So in this kind of scenario, hotpluggable memory will not be detected
>>> at boot time, and admin should not use this movable_node boot option
>>> and the kernel will act as before, using top-down allocation always.
>>>
>>
>> Yes.  The idea is that the kernel will boot up without the hotplug
>> memory, but if desired, will immediately see a hotplug-add event for the
>> movable memory.
> 
> Yeah, this is good.
> 
> But in the scenario that boot with hotplug memory, we need the movable_node
> option. So as tejun has explained a lot about this patchset, do you still
> have objection to it or could I ask andrew to merge it into -mm tree for
> more tests?
> 

Since tejun has explained a lot about this approach, could we come to
an agreement on this one?

Peter? If you have no objection, I'll post a new v7 version which will fix
the __pa_symbol problem pointed by you.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up

2013-10-08 Thread Zhang Yanfei

Hello tejun
CC: Peter

On 10/07/2013 08:00 AM, H. Peter Anvin wrote:
> On 10/03/2013 07:00 PM, Zhang Yanfei wrote:
>> From: Tang Chen 
>>
>> The Linux kernel cannot migrate pages used by the kernel. As a
>> result, kernel pages cannot be hot-removed. So we cannot allocate
>> hotpluggable memory for the kernel.
>>
>> In a memory hotplug system, any numa node the kernel resides in
>> should be unhotpluggable. And for a modern server, each node could
>> have at least 16GB memory. So memory around the kernel image is
>> highly likely unhotpluggable.
>>
>> ACPI SRAT (System Resource Affinity Table) contains the memory
>> hotplug info. But before SRAT is parsed, memblock has already
>> started to allocate memory for the kernel. So we need to prevent
>> memblock from doing this.
>>
>> So direct memory mapping page tables setup is the case. init_mem_mapping()
>> is called before SRAT is parsed. To prevent page tables being allocated
>> within hotpluggable memory, we will use bottom-up direction to allocate
>> page tables from the end of kernel image to the higher memory.
>>
>> Acked-by: Tejun Heo 
>> Signed-off-by: Tang Chen 
>> Signed-off-by: Zhang Yanfei 
> 
> I'm still seriously concerned about this.  This unconditionally
> introduces new behavior which may very well break some classes of
> systems -- the whole point of creating the page tables top down is
> because the kernel tends to be allocated in lower memory, which is also
> the memory that some devices need for DMA.
> 

After thinking for a while, this issue pointed by Peter seems to be really
existing. And looking back to what you suggested the allocation close to the
kernel, 

> so if we allocate memory close to the kernel image,
>   it's likely that we don't contaminate hotpluggable node.  We're
>   talking about few megs at most right after the kernel image.  I
>   can't see how that would make any noticeable difference.

You meant that the memory size is about few megs. But here, page tables
seems to be large enough in big memory machines, so that page tables will
consume the precious lower memory. So I think we may really reorder
the page table setup after we get the hotplug info in some way. Just like
we have done in patch 5, we reorder reserve_crashkernel() to be called
after initmem_init().

So do you still have any objection to the pagetable setup reorder?

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part1 v6 update 6/6] mem-hotplug: Introduce movable_node boot option

2013-10-06 Thread Zhang Yanfei

From: Tang Chen 

The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and
later we can set it as ZONE_MOVABLE.

To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained
in the previous patches. So if movable_node boot option is set, the kernel
does the following:

1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
   top down.

Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

Suggested-by: Kamezawa Hiroyuki 
Acked-by: Tejun Heo 
Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
---
 Documentation/kernel-parameters.txt |3 +++
 arch/x86/mm/numa.c  |   11 +++
 mm/Kconfig  |   17 -
 mm/memory_hotplug.c |   31 +++
 4 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 539a236..13201d4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1769,6 +1769,9 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
that the amount of memory usable for all allocations
is not too small.
 
+   movable_node[KNL,X86] Boot-time switch to enable the effects
+   of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.
+
MTD_Partition=  [MTD]
Format: ,,,
 
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..24aec58 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -567,6 +567,17 @@ static int __init numa_init(int (*init_func)(void))
ret = init_func();
if (ret < 0)
return ret;
+
+   /*
+* We reset memblock back to the top-down direction
+* here because if we configured ACPI_NUMA, we have
+* parsed SRAT in init_func(). It is ok to have the
+* reset here even if we did't configure ACPI_NUMA
+* or acpi numa init fails and fallbacks to dummy
+* numa init.
+*/
+   memblock_set_bottom_up(false);
+
ret = numa_cleanup_meminfo(&numa_meminfo);
if (ret < 0)
return ret;
diff --git a/mm/Kconfig b/mm/Kconfig
index 026771a..0db1cc6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -153,11 +153,18 @@ config MOVABLE_NODE
help
  Allow a node to have only movable memory.  Pages used by the kernel,
  such as direct mapping pages cannot be migrated.  So the corresponding
- memory device cannot be hotplugged.  This option allows users to
- online all the memory of a node as movable memory so that the whole
- node can be hotplugged.  Users who don't use the memory hotplug
- feature are fine with this option on since they don't online memory
- as movable.
+ memory device cannot be hotplugged.  This option allows the following
+ two things:
+ - When the system is booting, node full of hotpluggable memory can
+ be arranged to have only movable memory so that the whole node can
+ be hot-removed. (need movable_node boot option specified).
+ - After the system is up, the option allows users to online all the
+ memory of a node as movable memory so that the whole node can be
+ hot-removed.
+
+ Users who don't use the memory hotplug feature are fine with this
+ option on since they don't specify movable_node boot option or they
+ don't online memory as movable.
 
  Say Y here if you want to hotplug a whole node.
  Say N here if you want kernel to use memory on all nodes evenly.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ed85f

[PATCH 2/2] mm/page_alloc.c: Get rid of unused marco LONG_ALIGN

2013-10-05 Thread Zhang Yanfei

From: Zhang Yanfei 

The macro is nowhere used, so remove it.

Signed-off-by: Zhang Yanfei 
---
 mm/page_alloc.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1fb13b6..9d8508d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3881,8 +3881,6 @@ static inline unsigned long wait_table_bits(unsigned long 
size)
return ffz(~size);
 }
 
-#define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
-
 /*
  * Check if a pageblock contains reserved pages
  */
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] mm/page_alloc.c: Implement an empty get_pfn_range_for_nid

2013-10-05 Thread Zhang Yanfei

From: Zhang Yanfei 

Implement an empty get_pfn_range_for_nid for !CONFIG_HAVE_MEMBLOCK_NODE_MAP,
so that we could remove the #ifdef in free_area_init_node.

Signed-off-by: Zhang Yanfei 
---
 mm/page_alloc.c |7 +--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd886fa..1fb13b6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4566,6 +4566,11 @@ static unsigned long __meminit 
zone_absent_pages_in_node(int nid,
 }
 
 #else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+void __meminit get_pfn_range_for_nid(unsigned int nid,
+   unsigned long *ignored, unsigned long *ignored)
+{
+}
+
 static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
unsigned long node_start_pfn,
@@ -4871,9 +4876,7 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
init_zone_allows_reclaim(nid);
-#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
-#endif
calculate_node_totalpages(pgdat, start_pfn, end_pfn,
  zones_size, zholes_size);
 
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] mm/sparsemem: Fix a bug in free_map_bootmem when CONFIG_SPARSEMEM_VMEMMAP

2013-10-05 Thread Zhang Yanfei

From: Zhang Yanfei 

We pass the number of pages which hold page structs of a memory
section to function free_map_bootmem. This is right when
!CONFIG_SPARSEMEM_VMEMMAP but wrong when CONFIG_SPARSEMEM_VMEMMAP.
When CONFIG_SPARSEMEM_VMEMMAP, we should pass the number of pages
of a memory section to free_map_bootmem.

So the fix is removing the nr_pages parameter. When
CONFIG_SPARSEMEM_VMEMMAP, we directly use the prefined marco
PAGES_PER_SECTION in free_map_bootmem. When !CONFIG_SPARSEMEM_VMEMMAP,
we calculate page numbers needed to hold the page structs for a
memory section and use the value in free_map_bootmem.

Signed-off-by: Zhang Yanfei 
---
v2: Fix a bug introduced in v1 patch. Thanks wanpeng!
---
 mm/sparse.c |   20 +---
 1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/mm/sparse.c b/mm/sparse.c
index 4ac1d7e..fe32b48 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -604,10 +604,10 @@ static void __kfree_section_memmap(struct page *memmap, 
unsigned long nr_pages)
vmemmap_free(start, end);
 }
 #ifdef CONFIG_MEMORY_HOTREMOVE
-static void free_map_bootmem(struct page *memmap, unsigned long nr_pages)
+static void free_map_bootmem(struct page *memmap)
 {
unsigned long start = (unsigned long)memmap;
-   unsigned long end = (unsigned long)(memmap + nr_pages);
+   unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);
 
vmemmap_free(start, end);
 }
@@ -650,12 +650,15 @@ static void __kfree_section_memmap(struct page *memmap, 
unsigned long nr_pages)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-static void free_map_bootmem(struct page *memmap, unsigned long nr_pages)
+static void free_map_bootmem(struct page *memmap)
 {
unsigned long maps_section_nr, removing_section_nr, i;
-   unsigned long magic;
+   unsigned long magic, nr_pages;
struct page *page = virt_to_page(memmap);
 
+   nr_pages = PAGE_ALIGN(PAGES_PER_SECTION * sizeof(struct page))
+   >> PAGE_SHIFT;
+
for (i = 0; i < nr_pages; i++, page++) {
magic = (unsigned long) page->lru.next;
 
@@ -759,7 +762,6 @@ static inline void clear_hwpoisoned_pages(struct page 
*memmap, int nr_pages)
 static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 {
struct page *usemap_page;
-   unsigned long nr_pages;
 
if (!usemap)
return;
@@ -780,12 +782,8 @@ static void free_section_usemap(struct page *memmap, 
unsigned long *usemap)
 * on the section which has pgdat at boot time. Just keep it as is now.
 */
 
-   if (memmap) {
-   nr_pages = PAGE_ALIGN(PAGES_PER_SECTION * sizeof(struct page))
-   >> PAGE_SHIFT;
-
-   free_map_bootmem(memmap, nr_pages);
-   }
+   if (memmap)
+   free_map_bootmem(memmap);
 }
 
 void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] mm/sparsemem: Fix a bug in free_map_bootmem when CONFIG_SPARSEMEM_VMEMMAP

2013-10-05 Thread Zhang Yanfei

Hello andrew,

On 10/04/2013 04:42 AM, Andrew Morton wrote:
> On Thu, 03 Oct 2013 11:32:02 +0800 Zhang Yanfei  
> wrote:
> 
>> We pass the number of pages which hold page structs of a memory
>> section to function free_map_bootmem. This is right when
>> !CONFIG_SPARSEMEM_VMEMMAP but wrong when CONFIG_SPARSEMEM_VMEMMAP.
>> When CONFIG_SPARSEMEM_VMEMMAP, we should pass the number of pages
>> of a memory section to free_map_bootmem.
>>
>> So the fix is removing the nr_pages parameter. When
>> CONFIG_SPARSEMEM_VMEMMAP, we directly use the prefined marco
>> PAGES_PER_SECTION in free_map_bootmem. When !CONFIG_SPARSEMEM_VMEMMAP,
>> we calculate page numbers needed to hold the page structs for a
>> memory section and use the value in free_map_bootmem.
> 
> What were the runtime user-visible effects of that bug?
> 
> Please always include this information when fixing a bug.


SorryThis was found by reading the code. And I have no machine that
support memory hot-remove to test the bug now. But I believe it is a bug.

BTW, I've made a mistake in this patch which was found by wanpeng. I'll
send v2.

-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] mm/sparsemem: Fix a bug in free_map_bootmem when CONFIG_SPARSEMEM_VMEMMAP

2013-10-04 Thread Zhang Yanfei

Hello wanpeng,

On 10/05/2013 01:54 PM, Wanpeng Li wrote:
> Hi Yanfei,
> On Thu, Oct 03, 2013 at 11:32:02AM +0800, Zhang Yanfei wrote:
>> From: Zhang Yanfei 
>>
>> We pass the number of pages which hold page structs of a memory
>> section to function free_map_bootmem. This is right when
>> !CONFIG_SPARSEMEM_VMEMMAP but wrong when CONFIG_SPARSEMEM_VMEMMAP.
>> When CONFIG_SPARSEMEM_VMEMMAP, we should pass the number of pages
>> of a memory section to free_map_bootmem.
>>
>> So the fix is removing the nr_pages parameter. When
>> CONFIG_SPARSEMEM_VMEMMAP, we directly use the prefined marco
>> PAGES_PER_SECTION in free_map_bootmem. When !CONFIG_SPARSEMEM_VMEMMAP,
>> we calculate page numbers needed to hold the page structs for a
>> memory section and use the value in free_map_bootmem.
>>
>> Signed-off-by: Zhang Yanfei 
>> ---
>> mm/sparse.c |   17 +++--
>> 1 files changed, 7 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/sparse.c b/mm/sparse.c
>> index fbb9dbc..908c134 100644
>> --- a/mm/sparse.c
>> +++ b/mm/sparse.c
>> @@ -603,10 +603,10 @@ static void __kfree_section_memmap(struct page *memmap)
>>  vmemmap_free(start, end);
>> }
>> #ifdef CONFIG_MEMORY_HOTREMOVE
>> -static void free_map_bootmem(struct page *memmap, unsigned long nr_pages)
>> +static void free_map_bootmem(struct page *memmap)
>> {
>>  unsigned long start = (unsigned long)memmap;
>> -unsigned long end = (unsigned long)(memmap + nr_pages);
>> +unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);
>>
>>  vmemmap_free(start, end);
>> }
>> @@ -648,11 +648,13 @@ static void __kfree_section_memmap(struct page *memmap)
>> }
>>
>> #ifdef CONFIG_MEMORY_HOTREMOVE
>> -static void free_map_bootmem(struct page *memmap, unsigned long nr_pages)
>> +static void free_map_bootmem(struct page *memmap)
>> {
>>  unsigned long maps_section_nr, removing_section_nr, i;
>>  unsigned long magic;
>>  struct page *page = virt_to_page(memmap);
>> +unsigned long nr_pages = get_order(sizeof(struct page) *
>> +   PAGES_PER_SECTION);
> 
> Why replace PAGE_ALIGN(XXX) >> PAGE_SHIFT by get_order(XXX)? This will result 
> in memory leak.

oops... I will correct this by sending a new version.

Thanks.


-- 
Thanks.
Zhang Yanfei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part1 v6 6/6] mem-hotplug: Introduce movable_node boot option

2013-10-03 Thread Zhang Yanfei

From: Tang Chen 

The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and
later we can set it as ZONE_MOVABLE.

To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained
in the previous patches. So if movable_node boot option is set, the kernel
does the following:

1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
   top down.

Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

Suggested-by: Kamezawa Hiroyuki 
Suggested-by: Ingo Molnar 
Acked-by: Tejun Heo 
Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
---
 Documentation/kernel-parameters.txt |3 +++
 arch/x86/mm/numa.c  |   11 +++
 mm/Kconfig  |   17 -
 mm/memory_hotplug.c |   31 +++
 4 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 539a236..13201d4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1769,6 +1769,9 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
that the amount of memory usable for all allocations
is not too small.
 
+   movable_node[KNL,X86] Boot-time switch to disable the effects
+   of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.
+
MTD_Partition=  [MTD]
Format: ,,,
 
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..24aec58 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -567,6 +567,17 @@ static int __init numa_init(int (*init_func)(void))
ret = init_func();
if (ret < 0)
return ret;
+
+   /*
+* We reset memblock back to the top-down direction
+* here because if we configured ACPI_NUMA, we have
+* parsed SRAT in init_func(). It is ok to have the
+* reset here even if we did't configure ACPI_NUMA
+* or acpi numa init fails and fallbacks to dummy
+* numa init.
+*/
+   memblock_set_bottom_up(false);
+
ret = numa_cleanup_meminfo(&numa_meminfo);
if (ret < 0)
return ret;
diff --git a/mm/Kconfig b/mm/Kconfig
index 026771a..0db1cc6 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -153,11 +153,18 @@ config MOVABLE_NODE
help
  Allow a node to have only movable memory.  Pages used by the kernel,
  such as direct mapping pages cannot be migrated.  So the corresponding
- memory device cannot be hotplugged.  This option allows users to
- online all the memory of a node as movable memory so that the whole
- node can be hotplugged.  Users who don't use the memory hotplug
- feature are fine with this option on since they don't online memory
- as movable.
+ memory device cannot be hotplugged.  This option allows the following
+ two things:
+ - When the system is booting, node full of hotpluggable memory can
+ be arranged to have only movable memory so that the whole node can
+ be hotplugged. (need movable_node boot option specified).
+ - After the system is up, the option allows users to online all the
+ memory of a node as movable memory so that the whole node can be
+ hotplugged.
+
+ Users who don't use the memory hotplug feature are fine with this
+ option on since they don't specify movable_node boot option or they
+ don't online memory as movable.
 
  Say Y here if you want to hotplug a whole node.
  Say N here if you want kernel to use memory on all nodes evenly.
diff --git a/mm/memory_hotplug.c b/mm/me

[PATCH part1 v6 5/6] x86, acpi, crash, kdump: Do reserve_crashkernel() after SRAT is parsed.

2013-10-03 Thread Zhang Yanfei

From: Tang Chen 

Memory reserved for crashkernel could be large. So we should not allocate
this memory bottom up from the end of kernel image.

When SRAT is parsed, we will be able to know whihc memory is hotpluggable,
and we can avoid allocating this memory for the kernel. So reorder
reserve_crashkernel() after SRAT is parsed.

Acked-by: Tejun Heo 
Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
---
 arch/x86/kernel/setup.c |9 +++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f0de629..b5e350d 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1120,8 +1120,6 @@ void __init setup_arch(char **cmdline_p)
acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
 #endif
 
-   reserve_crashkernel();
-
vsmp_init();
 
io_delay_init();
@@ -1134,6 +1132,13 @@ void __init setup_arch(char **cmdline_p)
early_acpi_boot_init();
 
initmem_init();
+
+   /*
+* Reserve memory for crash kernel after SRAT is parsed so that it
+* won't consume hotpluggable memory.
+*/
+   reserve_crashkernel();
+
memblock_find_dma_reserve();
 
 #ifdef CONFIG_KVM_GUEST
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part1 v6 4/6] x86/mem-hotplug: Support initialize page tables in bottom-up

2013-10-03 Thread Zhang Yanfei

From: Tang Chen 

The Linux kernel cannot migrate pages used by the kernel. As a
result, kernel pages cannot be hot-removed. So we cannot allocate
hotpluggable memory for the kernel.

In a memory hotplug system, any numa node the kernel resides in
should be unhotpluggable. And for a modern server, each node could
have at least 16GB memory. So memory around the kernel image is
highly likely unhotpluggable.

ACPI SRAT (System Resource Affinity Table) contains the memory
hotplug info. But before SRAT is parsed, memblock has already
started to allocate memory for the kernel. So we need to prevent
memblock from doing this.

So direct memory mapping page tables setup is the case. init_mem_mapping()
is called before SRAT is parsed. To prevent page tables being allocated
within hotpluggable memory, we will use bottom-up direction to allocate
page tables from the end of kernel image to the higher memory.

Acked-by: Tejun Heo 
Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
---
 arch/x86/mm/init.c |   71 ++-
 1 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ea2be79..5cea9ed 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -458,6 +458,51 @@ static void __init memory_map_top_down(unsigned long 
map_start,
init_range_memory_mapping(real_end, map_end);
 }
 
+/**
+ * memory_map_bottom_up - Map [map_start, map_end) bottom up
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in bottom-up. Since we have limited the
+ * bottom-up allocation above the kernel, the page tables will
+ * be allocated just above the kernel and we map the memory
+ * in [map_start, map_end) in bottom-up.
+ */
+static void __init memory_map_bottom_up(unsigned long map_start,
+   unsigned long map_end)
+{
+   unsigned long next, new_mapped_ram_size, start;
+   unsigned long mapped_ram_size = 0;
+   /* step_size need to be small so pgt_buf from BRK could cover it */
+   unsigned long step_size = PMD_SIZE;
+
+   start = map_start;
+   min_pfn_mapped = start >> PAGE_SHIFT;
+
+   /*
+* We start from the bottom (@map_start) and go to the top (@map_end).
+* The memblock_find_in_range() gets us a block of RAM from the
+* end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
+* for page table.
+*/
+   while (start < map_end) {
+   if (map_end - start > step_size) {
+   next = round_up(start + 1, step_size);
+   if (next > map_end)
+   next = map_end;
+   } else
+   next = map_end;
+
+   new_mapped_ram_size = init_range_memory_mapping(start, next);
+   start = next;
+
+   if (new_mapped_ram_size > mapped_ram_size)
+   step_size <<= STEP_SIZE_SHIFT;
+   mapped_ram_size += new_mapped_ram_size;
+   }
+}
+
 void __init init_mem_mapping(void)
 {
unsigned long end;
@@ -473,8 +518,30 @@ void __init init_mem_mapping(void)
/* the ISA range is always mapped regardless of memory holes */
init_memory_mapping(0, ISA_END_ADDRESS);
 
-   /* setup direct mapping for range [ISA_END_ADDRESS, end) in top-down*/
-   memory_map_top_down(ISA_END_ADDRESS, end);
+   /*
+* If the allocation is in bottom-up direction, we setup direct mapping
+* in bottom-up, otherwise we setup direct mapping in top-down.
+*/
+   if (memblock_bottom_up()) {
+   unsigned long kernel_end;
+
+#ifdef CONFIG_X86
+   kernel_end = __pa_symbol(_end);
+#else
+   kernel_end = __pa(RELOC_HIDE((unsigned long)(_end), 0));
+#endif
+   /*
+* we need two separate calls here. This is because we want to
+* allocate page tables above the kernel. So we first map
+* [kernel_end, end) to make memory above the kernel be mapped
+* as soon as possible. And then use page tables allocated above
+* the kernel to map [ISA_END_ADDRESS, kernel_end).
+*/
+   memory_map_bottom_up(kernel_end, end);
+   memory_map_bottom_up(ISA_END_ADDRESS, kernel_end);
+   } else {
+   memory_map_top_down(ISA_END_ADDRESS, end);
+   }
 
 #ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part1 v6 3/6] x86/mm: Factor out of top-down direct mapping setup

2013-10-03 Thread Zhang Yanfei

From: Tang Chen 

This patch creates a new function memory_map_top_down to
factor out of the top-down direct memory mapping pagetable
setup. This is also a preparation for the following patch,
which will introduce the bottom-up memory mapping. That said,
we will put the two ways of pagetable setup into separate
functions, and choose to use which way in init_mem_mapping,
which makes the code more clear.

Acked-by: Tejun Heo 
Acked-by: Toshi Kani 
Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
---
 arch/x86/mm/init.c |   60 ++-
 1 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 04664cd..ea2be79 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -401,27 +401,28 @@ static unsigned long __init init_range_memory_mapping(
 
 /* (PUD_SHIFT-PMD_SHIFT)/2 */
 #define STEP_SIZE_SHIFT 5
-void __init init_mem_mapping(void)
+
+/**
+ * memory_map_top_down - Map [map_start, map_end) top down
+ * @map_start: start address of the target memory range
+ * @map_end: end address of the target memory range
+ *
+ * This function will setup direct mapping for memory range
+ * [map_start, map_end) in top-down. That said, the page tables
+ * will be allocated at the end of the memory, and we map the
+ * memory in top-down.
+ */
+static void __init memory_map_top_down(unsigned long map_start,
+  unsigned long map_end)
 {
-   unsigned long end, real_end, start, last_start;
+   unsigned long real_end, start, last_start;
unsigned long step_size;
unsigned long addr;
unsigned long mapped_ram_size = 0;
unsigned long new_mapped_ram_size;
 
-   probe_page_size_mask();
-
-#ifdef CONFIG_X86_64
-   end = max_pfn << PAGE_SHIFT;
-#else
-   end = max_low_pfn << PAGE_SHIFT;
-#endif
-
-   /* the ISA range is always mapped regardless of memory holes */
-   init_memory_mapping(0, ISA_END_ADDRESS);
-
/* xen has big range in reserved near end of ram, skip it at first.*/
-   addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
+   addr = memblock_find_in_range(map_start, map_end, PMD_SIZE, PMD_SIZE);
real_end = addr + PMD_SIZE;
 
/* step_size need to be small so pgt_buf from BRK could cover it */
@@ -436,13 +437,13 @@ void __init init_mem_mapping(void)
 * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
 * for page table.
 */
-   while (last_start > ISA_END_ADDRESS) {
+   while (last_start > map_start) {
if (last_start > step_size) {
start = round_down(last_start - 1, step_size);
-   if (start < ISA_END_ADDRESS)
-   start = ISA_END_ADDRESS;
+   if (start < map_start)
+   start = map_start;
} else
-   start = ISA_END_ADDRESS;
+   start = map_start;
new_mapped_ram_size = init_range_memory_mapping(start,
last_start);
last_start = start;
@@ -453,8 +454,27 @@ void __init init_mem_mapping(void)
mapped_ram_size += new_mapped_ram_size;
}
 
-   if (real_end < end)
-   init_range_memory_mapping(real_end, end);
+   if (real_end < map_end)
+   init_range_memory_mapping(real_end, map_end);
+}
+
+void __init init_mem_mapping(void)
+{
+   unsigned long end;
+
+   probe_page_size_mask();
+
+#ifdef CONFIG_X86_64
+   end = max_pfn << PAGE_SHIFT;
+#else
+   end = max_low_pfn << PAGE_SHIFT;
+#endif
+
+   /* the ISA range is always mapped regardless of memory holes */
+   init_memory_mapping(0, ISA_END_ADDRESS);
+
+   /* setup direct mapping for range [ISA_END_ADDRESS, end) in top-down*/
+   memory_map_top_down(ISA_END_ADDRESS, end);
 
 #ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part1 v6 2/6] memblock: Introduce bottom-up allocation mode

2013-10-03 Thread Zhang Yanfei

From: Tang Chen 

The Linux kernel cannot migrate pages used by the kernel. As a result, kernel
pages cannot be hot-removed. So we cannot allocate hotpluggable memory for
the kernel.

ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info.
But before SRAT is parsed, memblock has already started to allocate memory
for the kernel. So we need to prevent memblock from doing this.

In a memory hotplug system, any numa node the kernel resides in should
be unhotpluggable. And for a modern server, each node could have at least
16GB memory. So memory around the kernel image is highly likely unhotpluggable.

So the basic idea is: Allocate memory from the end of the kernel image and
to the higher memory. Since memory allocation before SRAT is parsed won't
be too much, it could highly likely be in the same node with kernel image.

The current memblock can only allocate memory top-down. So this patch introduces
a new bottom-up allocation mode to allocate memory bottom-up. And later
when we use this allocation direction to allocate memory, we will limit
the start address above the kernel.

Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
---
 include/linux/memblock.h |   24 +
 mm/memblock.c|   87 --
 2 files changed, 108 insertions(+), 3 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 31e95ac..77c60e5 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -35,6 +35,7 @@ struct memblock_type {
 };
 
 struct memblock {
+   bool bottom_up;  /* is bottom up direction? */
phys_addr_t current_limit;
struct memblock_type memory;
struct memblock_type reserved;
@@ -148,6 +149,29 @@ phys_addr_t memblock_alloc_try_nid(phys_addr_t size, 
phys_addr_t align, int nid)
 
 phys_addr_t memblock_alloc(phys_addr_t size, phys_addr_t align);
 
+#ifdef CONFIG_MOVABLE_NODE
+/*
+ * Set the allocation direction to bottom-up or top-down.
+ */
+static inline void memblock_set_bottom_up(bool enable)
+{
+   memblock.bottom_up = enable;
+}
+
+/*
+ * Check if the allocation direction is bottom-up or not.
+ * if this is true, that said, memblock will allocate memory
+ * in bottom-up direction.
+ */
+static inline bool memblock_bottom_up(void)
+{
+   return memblock.bottom_up;
+}
+#else
+static inline void memblock_set_bottom_up(bool enable) {}
+static inline bool memblock_bottom_up(void) { return false; }
+#endif
+
 /* Flags for memblock_alloc_base() amd __memblock_alloc_base() */
 #define MEMBLOCK_ALLOC_ANYWHERE(~(phys_addr_t)0)
 #define MEMBLOCK_ALLOC_ACCESSIBLE  0
diff --git a/mm/memblock.c b/mm/memblock.c
index accff10..04f20f4 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -20,6 +20,8 @@
 #include 
 #include 
 
+#include 
+
 static struct memblock_region 
memblock_memory_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
 static struct memblock_region 
memblock_reserved_init_regions[INIT_MEMBLOCK_REGIONS] __initdata_memblock;
 
@@ -32,6 +34,7 @@ struct memblock memblock __initdata_memblock = {
.reserved.cnt   = 1,/* empty dummy entry */
.reserved.max   = INIT_MEMBLOCK_REGIONS,
 
+   .bottom_up  = false,
.current_limit  = MEMBLOCK_ALLOC_ANYWHERE,
 };
 
@@ -82,6 +85,38 @@ static long __init_memblock memblock_overlaps_region(struct 
memblock_type *type,
return (i < type->cnt) ? i : -1;
 }
 
+/*
+ * __memblock_find_range_bottom_up - find free area utility in bottom-up
+ * @start: start of candidate range
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Utility called from memblock_find_in_range_node(), find free area bottom-up.
+ *
+ * RETURNS:
+ * Found address on success, 0 on failure.
+ */
+static phys_addr_t __init_memblock
+__memblock_find_range_bottom_up(phys_addr_t start, phys_addr_t end,
+   phys_addr_t size, phys_addr_t align, int nid)
+{
+   phys_addr_t this_start, this_end, cand;
+   u64 i;
+
+   for_each_free_mem_range(i, nid, &this_start, &this_end, NULL) {
+   this_start = clamp(this_start, start, end);
+   this_end = clamp(this_end, start, end);
+
+   cand = round_up(this_start, align);
+   if (cand < this_end && this_end - cand >= size)
+   return cand;
+   }
+
+   return 0;
+}
+
 /**
  * __memblock_find_range_top_down - find free area utility, in top-down
  * @start: start of candidate range
@@ -93,7 +128,7 @@ static long __init_memblock memblock_overlaps_region(struct 
memblock_type *type,
  * Utility called from memblock_find_in_range_node(), find free area top-down.
  *
  * RETURNS:
- * Found address on success, %0 on failure

[PATCH part1 v6 1/6] memblock: Factor out of top-down allocation

2013-10-03 Thread Zhang Yanfei

From: Tang Chen 

This patch creates a new function __memblock_find_range_top_down
to factor out of top-down allocation from memblock_find_in_range_node.
This is a preparation because we will introduce a new bottom-up
allocation mode in the following patch.

Acked-by: Tejun Heo 
Acked-by: Toshi Kani 
Signed-off-by: Tang Chen 
Signed-off-by: Zhang Yanfei 
---
 mm/memblock.c |   47 ++-
 1 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 0ac412a..accff10 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -83,33 +83,25 @@ static long __init_memblock memblock_overlaps_region(struct 
memblock_type *type,
 }
 
 /**
- * memblock_find_in_range_node - find free area in given range and node
+ * __memblock_find_range_top_down - find free area utility, in top-down
  * @start: start of candidate range
  * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
  * @size: size of free area to find
  * @align: alignment of free area to find
  * @nid: nid of the free area to find, %MAX_NUMNODES for any node
  *
- * Find @size free area aligned to @align in the specified range and node.
+ * Utility called from memblock_find_in_range_node(), find free area top-down.
  *
  * RETURNS:
  * Found address on success, %0 on failure.
  */
-phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
-   phys_addr_t end, phys_addr_t size,
-   phys_addr_t align, int nid)
+static phys_addr_t __init_memblock
+__memblock_find_range_top_down(phys_addr_t start, phys_addr_t end,
+  phys_addr_t size, phys_addr_t align, int nid)
 {
phys_addr_t this_start, this_end, cand;
u64 i;
 
-   /* pump up @end */
-   if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
-   end = memblock.current_limit;
-
-   /* avoid allocating the first page */
-   start = max_t(phys_addr_t, start, PAGE_SIZE);
-   end = max(start, end);
-
for_each_free_mem_range_reverse(i, nid, &this_start, &this_end, NULL) {
this_start = clamp(this_start, start, end);
this_end = clamp(this_end, start, end);
@@ -121,10 +113,39 @@ phys_addr_t __init_memblock 
memblock_find_in_range_node(phys_addr_t start,
if (cand >= this_start)
return cand;
}
+
return 0;
 }
 
 /**
+ * memblock_find_in_range_node - find free area in given range and node
+ * @start: start of candidate range
+ * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
+ * @size: size of free area to find
+ * @align: alignment of free area to find
+ * @nid: nid of the free area to find, %MAX_NUMNODES for any node
+ *
+ * Find @size free area aligned to @align in the specified range and node.
+ *
+ * RETURNS:
+ * Found address on success, %0 on failure.
+ */
+phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t start,
+   phys_addr_t end, phys_addr_t size,
+   phys_addr_t align, int nid)
+{
+   /* pump up @end */
+   if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+   end = memblock.current_limit;
+
+   /* avoid allocating the first page */
+   start = max_t(phys_addr_t, start, PAGE_SIZE);
+   end = max(start, end);
+
+   return __memblock_find_range_top_down(start, end, size, align, nid);
+}
+
+/**
  * memblock_find_in_range - find free area in given range
  * @start: start of candidate range
  * @end: end of candidate range, can be %MEMBLOCK_ALLOC_{ANYWHERE|ACCESSIBLE}
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH part1 v6 0/6] x86, memblock: Allocate memory near kernel image before SRAT parsed

2013-10-03 Thread Zhang Yanfei

Hello, here is the v6 version. Any comments are welcome!

The v6 version is based on linus's tree (3.12-rc3)
HEAD is:
commit 15c03dd4859ab16f9212238f29dd315654aa94f6
Author: Linus Torvalds 
Date:   Sun Sep 29 15:02:38 2013 -0700

Linux 3.12-rc3


[Problem]

The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.


[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
To do this, we need ACPI's help.

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
   ZONE_MOVABLE.
   (This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
   hotpluggable memory for the kernel before the memory initialization
   finishes.

The problem 2 is the key problem we are going to solve. But before solving it,
we need some preparation. Please see below.


[Preparation]

Bootloader has to load the kernel image into memory. And this memory must be 
unhotpluggable. We cannot prevent this anyway. So in a memory hotplug system, 
we can assume any node the kernel resides in is not hotpluggable.

Before SRAT is parsed, we don't know which memory ranges are hotpluggable. But
memblock has already started to work. In the current kernel, memblock allocates 
the following memory before SRAT is parsed:

setup_arch()
 |->memblock_x86_fill()/* memblock is ready */
 |..
 |->early_reserve_e820_mpc_new()   /* allocate memory under 1MB */
 |->reserve_real_mode()/* allocate memory under 1MB */
 |->init_mem_mapping() /* allocate page tables, about 2MB to map 
1GB memory */
 |->dma_contiguous_reserve()   /* specified by user, should be low */
 |->setup_log_buf()/* specified by user, several mega bytes */
 |->relocate_initrd()  /* could be large, but will be freed after 
boot, should reorder */
 |->acpi_initrd_override() /* several mega bytes */
 |->reserve_crashkernel()  /* could be large, should reorder */
 |..
 |->initmem_init() /* Parse SRAT */

According to Tejun's advice, before SRAT is parsed, we should try our best to
allocate memory near the kernel image. Since the whole node the kernel resides 
in won't be hotpluggable, and for a modern server, a node may have at least 16GB
memory, allocating several mega bytes memory around the kernel image won't cross
to hotpluggable memory.


[About this patch-set]

So this patch-set is the preparation for the problem 2 that we want to solve. It
does the following:

1. Make memblock be able to allocate memory bottom up.
   1) Keep all the memblock APIs' prototype unmodified.
   2) When the direction is bottom up, keep the start address greater than the 
  end of kernel image.

2. Improve init_mem_mapping() to support allocate page tables in bottom up 
direction.

3. Introduce "movable_node" boot option to enable and disable this 
functionality.

Change log v5 -> v6:
1. Add tejun and toshi's ack in several patches.
2. Change movablenode to movable_node boot option and update the description
   for movable_node and CONFIG_MOVABLE_NODE. Thanks Ingo!
3. Fix the __pa_symbol() issue pointed by Andrew Morton.
4. Update some functions' comments and names.

Change log v4 -> v5:
1. Change memblock.current_direction to a boolean memblock.bottom_up. And 
remove 
   the direction enum.
2. Update and add some comments to explain things clearer.
3. Misc fixes, such as removing unnecessary #ifdef

Change log v3 -> v4:
1. Use bottom-up/top-down to unify things. Thanks tj.
2. Factor out of current top-down implementatio

[PATCH TRIVIAL] __page_to_pfn: Fix typo in comment

2013-10-03 Thread Zhang Yanfei

From: Zhang Yanfei 

Fix typo in __page_to_pfn comment: s/encorded/encoded.

Signed-off-by: Zhang Yanfei 
---
 include/asm-generic/memory_model.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/asm-generic/memory_model.h 
b/include/asm-generic/memory_model.h
index aea9e45..14909b0 100644
--- a/include/asm-generic/memory_model.h
+++ b/include/asm-generic/memory_model.h
@@ -53,7 +53,7 @@
 
 #elif defined(CONFIG_SPARSEMEM)
 /*
- * Note: section's mem_map is encorded to reflect its start_pfn.
+ * Note: section's mem_map is encoded to reflect its start_pfn.
  * section[i].section_mem_map == mem_map's address - start_pfn;
  */
 #define __page_to_pfn(pg)  \
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86/early_iounmap: Let the compiler enter the function name

2013-10-03 Thread Zhang Yanfei

From: Zhang Yanfei 

To be consistent with early_ioremap which had a change in
commit 4f4319a ("x86/ioremap: Correct function name output"),
let the complier enter the function name too.

Signed-off-by: Zhang Yanfei 
---
 arch/x86/mm/ioremap.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 799580c..577bd8e 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -585,21 +585,21 @@ void __init early_iounmap(void __iomem *addr, unsigned 
long size)
}
 
if (slot < 0) {
-   printk(KERN_INFO "early_iounmap(%p, %08lx) not found slot\n",
+   printk(KERN_INFO "%s(%p, %08lx) not found slot\n", __func__,
 addr, size);
WARN_ON(1);
return;
}
 
if (prev_size[slot] != size) {
-   printk(KERN_INFO "early_iounmap(%p, %08lx) [%d] size not 
consistent %08lx\n",
-addr, size, slot, prev_size[slot]);
+   printk(KERN_INFO "%s(%p, %08lx) [%d] size not consistent 
%08lx\n",
+__func__, addr, size, slot, prev_size[slot]);
WARN_ON(1);
return;
}
 
if (early_ioremap_debug) {
-   printk(KERN_INFO "early_iounmap(%p, %08lx) [%d]\n", addr,
+   printk(KERN_INFO "%s(%p, %08lx) [%d]\n", __func__, addr,
   size, slot);
dump_stack();
}
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] mm/sparsemem: Fix a bug in free_map_bootmem when CONFIG_SPARSEMEM_VMEMMAP

2013-10-02 Thread Zhang Yanfei

From: Zhang Yanfei 

We pass the number of pages which hold page structs of a memory
section to function free_map_bootmem. This is right when
!CONFIG_SPARSEMEM_VMEMMAP but wrong when CONFIG_SPARSEMEM_VMEMMAP.
When CONFIG_SPARSEMEM_VMEMMAP, we should pass the number of pages
of a memory section to free_map_bootmem.

So the fix is removing the nr_pages parameter. When
CONFIG_SPARSEMEM_VMEMMAP, we directly use the prefined marco
PAGES_PER_SECTION in free_map_bootmem. When !CONFIG_SPARSEMEM_VMEMMAP,
we calculate page numbers needed to hold the page structs for a
memory section and use the value in free_map_bootmem.

Signed-off-by: Zhang Yanfei 
---
 mm/sparse.c |   17 +++--
 1 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/sparse.c b/mm/sparse.c
index fbb9dbc..908c134 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -603,10 +603,10 @@ static void __kfree_section_memmap(struct page *memmap)
vmemmap_free(start, end);
 }
 #ifdef CONFIG_MEMORY_HOTREMOVE
-static void free_map_bootmem(struct page *memmap, unsigned long nr_pages)
+static void free_map_bootmem(struct page *memmap)
 {
unsigned long start = (unsigned long)memmap;
-   unsigned long end = (unsigned long)(memmap + nr_pages);
+   unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);
 
vmemmap_free(start, end);
 }
@@ -648,11 +648,13 @@ static void __kfree_section_memmap(struct page *memmap)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-static void free_map_bootmem(struct page *memmap, unsigned long nr_pages)
+static void free_map_bootmem(struct page *memmap)
 {
unsigned long maps_section_nr, removing_section_nr, i;
unsigned long magic;
struct page *page = virt_to_page(memmap);
+   unsigned long nr_pages = get_order(sizeof(struct page) *
+  PAGES_PER_SECTION);
 
for (i = 0; i < nr_pages; i++, page++) {
magic = (unsigned long) page->lru.next;
@@ -756,7 +758,6 @@ static inline void clear_hwpoisoned_pages(struct page 
*memmap, int nr_pages)
 static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 {
struct page *usemap_page;
-   unsigned long nr_pages;
 
if (!usemap)
return;
@@ -777,12 +778,8 @@ static void free_section_usemap(struct page *memmap, 
unsigned long *usemap)
 * on the section which has pgdat at boot time. Just keep it as is now.
 */
 
-   if (memmap) {
-   nr_pages = PAGE_ALIGN(PAGES_PER_SECTION * sizeof(struct page))
-   >> PAGE_SHIFT;
-
-   free_map_bootmem(memmap, nr_pages);
-   }
+   if (memmap)
+   free_map_bootmem(memmap);
 }
 
 void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 >

1 - 100 of 389 matches

Mail list logo