Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-20 Thread Lin Feng


On 02/20/2013 07:31 PM, Simon Jeons wrote:
> On 02/20/2013 06:23 PM, Lin Feng wrote:
>> Hi Simon,
>>
>> On 02/20/2013 05:58 PM, Simon Jeons wrote:
 The other is that this almost certainly broken for transhuge page
 handling. gup returns the head and tail pages and ordinarily this is ok
>>> When need gup thp? in kvm case?
>> gup just pins the wanted pages(for x86 is 4k size) of user address space in 
>> memory.
>> We can't expect the pages have been allocated for user address space are thp 
>> or
>> normal page. So we need to deal with them and I think it have nothing to do 
>> with kvm.
> 
> Ok, I'm curious about userspace process call which funtion(will call gup) to 
> pin pages except make_pages_present?
No, userspace process doesn't pin any pages directly but through some syscalls 
like io_setup() indirectly
for other purpose because kernel can't pagefault and it have to keep the page 
alive.
Kernel wants to communicate with the userspace such as to notify some events so 
it need some sort of buffer
that both Kernel and User space can both access, which leads to so called pin 
pages by gup.  

thanks,
linfeng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-20 Thread Simon Jeons

On 02/20/2013 06:23 PM, Lin Feng wrote:

Hi Simon,

On 02/20/2013 05:58 PM, Simon Jeons wrote:

The other is that this almost certainly broken for transhuge page
handling. gup returns the head and tail pages and ordinarily this is ok

When need gup thp? in kvm case?

gup just pins the wanted pages(for x86 is 4k size) of user address space in 
memory.
We can't expect the pages have been allocated for user address space are thp or
normal page. So we need to deal with them and I think it have nothing to do 
with kvm.


Ok, I'm curious about userspace process call which funtion(will call 
gup) to pin pages except make_pages_present?




thanks,
linfeng


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-20 Thread Lin Feng
Hi Simon,

On 02/20/2013 05:58 PM, Simon Jeons wrote:
> 
>>
>> The other is that this almost certainly broken for transhuge page
>> handling. gup returns the head and tail pages and ordinarily this is ok
> 
> When need gup thp? in kvm case?

gup just pins the wanted pages(for x86 is 4k size) of user address space in 
memory.
We can't expect the pages have been allocated for user address space are thp or 
normal page. So we need to deal with them and I think it have nothing to do 
with kvm.

thanks,
linfeng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-20 Thread Simon Jeons

On 02/05/2013 09:32 PM, Mel Gorman wrote:

On Tue, Feb 05, 2013 at 11:57:22AM +, Mel Gorman wrote:

+   migrate_pre_flag = 1;
+   }
+
+   if (!isolate_lru_page(pages[i])) {
+   inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
+page_is_file_cache(pages[i]));
+   list_add_tail([i]->lru, );
+   } else {
+   isolate_err = 1;
+   goto put_page;
+   }

isolate_lru_page() takes the LRU lock every time.

Credit to Michal Hocko for bringing this up but with the number of
other issues I missed that this is also broken with respect to huge page
handling. hugetlbfs pages will not be on the LRU so the isolation will mess
up and the migration has to be handled differently.  Ordinarily hugetlbfs
pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
encounters a hugetlbfs page, it'll just blow up.


As you said, hugetlbfs pages are not in LRU list, then how can encounter 
a hugetlbfs page and blow up?




The other is that this almost certainly broken for transhuge page
handling. gup returns the head and tail pages and ordinarily this is ok


When need gup thp? in kvm case?


because the caller only cares about the physical address. Migration will
also split a hugepage if it receives it but you are potentially adding
tail pages to a list here and then migrating them. The split of the first
page will get very confused. I'm not exactly sure what the result will be
but it won't be pretty.

Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
during testing?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-20 Thread Simon Jeons

On 02/05/2013 09:32 PM, Mel Gorman wrote:

On Tue, Feb 05, 2013 at 11:57:22AM +, Mel Gorman wrote:

+   migrate_pre_flag = 1;
+   }
+
+   if (!isolate_lru_page(pages[i])) {
+   inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
+page_is_file_cache(pages[i]));
+   list_add_tail(pages[i]-lru, pagelist);
+   } else {
+   isolate_err = 1;
+   goto put_page;
+   }

isolate_lru_page() takes the LRU lock every time.

Credit to Michal Hocko for bringing this up but with the number of
other issues I missed that this is also broken with respect to huge page
handling. hugetlbfs pages will not be on the LRU so the isolation will mess
up and the migration has to be handled differently.  Ordinarily hugetlbfs
pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
encounters a hugetlbfs page, it'll just blow up.


As you said, hugetlbfs pages are not in LRU list, then how can encounter 
a hugetlbfs page and blow up?




The other is that this almost certainly broken for transhuge page
handling. gup returns the head and tail pages and ordinarily this is ok


When need gup thp? in kvm case?


because the caller only cares about the physical address. Migration will
also split a hugepage if it receives it but you are potentially adding
tail pages to a list here and then migrating them. The split of the first
page will get very confused. I'm not exactly sure what the result will be
but it won't be pretty.

Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
during testing?



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-20 Thread Lin Feng
Hi Simon,

On 02/20/2013 05:58 PM, Simon Jeons wrote:
 

 The other is that this almost certainly broken for transhuge page
 handling. gup returns the head and tail pages and ordinarily this is ok
 
 When need gup thp? in kvm case?

gup just pins the wanted pages(for x86 is 4k size) of user address space in 
memory.
We can't expect the pages have been allocated for user address space are thp or 
normal page. So we need to deal with them and I think it have nothing to do 
with kvm.

thanks,
linfeng
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-20 Thread Simon Jeons

On 02/20/2013 06:23 PM, Lin Feng wrote:

Hi Simon,

On 02/20/2013 05:58 PM, Simon Jeons wrote:

The other is that this almost certainly broken for transhuge page
handling. gup returns the head and tail pages and ordinarily this is ok

When need gup thp? in kvm case?

gup just pins the wanted pages(for x86 is 4k size) of user address space in 
memory.
We can't expect the pages have been allocated for user address space are thp or
normal page. So we need to deal with them and I think it have nothing to do 
with kvm.


Ok, I'm curious about userspace process call which funtion(will call 
gup) to pin pages except make_pages_present?




thanks,
linfeng


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-20 Thread Lin Feng


On 02/20/2013 07:31 PM, Simon Jeons wrote:
 On 02/20/2013 06:23 PM, Lin Feng wrote:
 Hi Simon,

 On 02/20/2013 05:58 PM, Simon Jeons wrote:
 The other is that this almost certainly broken for transhuge page
 handling. gup returns the head and tail pages and ordinarily this is ok
 When need gup thp? in kvm case?
 gup just pins the wanted pages(for x86 is 4k size) of user address space in 
 memory.
 We can't expect the pages have been allocated for user address space are thp 
 or
 normal page. So we need to deal with them and I think it have nothing to do 
 with kvm.
 
 Ok, I'm curious about userspace process call which funtion(will call gup) to 
 pin pages except make_pages_present?
No, userspace process doesn't pin any pages directly but through some syscalls 
like io_setup() indirectly
for other purpose because kernel can't pagefault and it have to keep the page 
alive.
Kernel wants to communicate with the userspace such as to notify some events so 
it need some sort of buffer
that both Kernel and User space can both access, which leads to so called pin 
pages by gup.  

thanks,
linfeng
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-19 Thread Lin Feng
Hi Wanpeng,

On 02/20/2013 10:44 AM, Wanpeng Li wrote:
>> Sorry, I misunderstood what "tail pages" means, stupid question, just ignore 
>> it.
>> >flee...
> According to the compound page, the first page of compound page is
> called head page, other sub pages are called tail pages.
> 
> Regards,
> Wanpeng Li 
> 
Yes, you are right, thanks for explaining.
I thought it just means only the last one of the compound pages..

thanks,
linfeng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-19 Thread Lin Feng


On 02/19/2013 09:37 PM, Lin Feng wrote:
>> > 
>> > The other is that this almost certainly broken for transhuge page
>> > handling. gup returns the head and tail pages and ordinarily this is ok
> I can't find codes doing such things :(, could you please point me out?
> 
Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it.
flee...

thanks,
linfeng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-19 Thread Lin Feng
Hi Mel,

On 02/05/2013 09:32 PM, Mel Gorman wrote:
> On Tue, Feb 05, 2013 at 11:57:22AM +, Mel Gorman wrote:
>>
 +  migrate_pre_flag = 1;
 +  }
 +
 +  if (!isolate_lru_page(pages[i])) {
 +  inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
 +   page_is_file_cache(pages[i]));
 +  list_add_tail([i]->lru, );
 +  } else {
 +  isolate_err = 1;
 +  goto put_page;
 +  }
>>
>> isolate_lru_page() takes the LRU lock every time.
> 
> Credit to Michal Hocko for bringing this up but with the number of
> other issues I missed that this is also broken with respect to huge page
> handling. hugetlbfs pages will not be on the LRU so the isolation will mess
> up and the migration has to be handled differently.  Ordinarily hugetlbfs
> pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
> it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
> encounters a hugetlbfs page, it'll just blow up.

I look into the migrate_huge_page() codes find that if we support the hugetlbfs
non movable migration, we have to invent another 
alloc_huge_page_node_nonmovable() 
or such allocate interface, which cost is large(exploding the codes and great 
impact
on current alloc_huge_page_node()) but gains little, I think that pinning 
hugepage
is a corner case. 

So can we skip over hugepage without migration but give some WARN_ON() info, is
it acceptable?

> 
> The other is that this almost certainly broken for transhuge page
> handling. gup returns the head and tail pages and ordinarily this is ok

I can't find codes doing such things :(, could you please point me out?

> because the caller only cares about the physical address. Migration will
> also split a hugepage if it receives it but you are potentially adding
> tail pages to a list here and then migrating them. The split of the first
> page will get very confused. I'm not exactly sure what the result will be
> but it won't be pretty.
> 
> Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
> during testing?

I checked my config file that both CONFIG options aboved are enabled. However 
it was 
only be tested by two services invoking io_setup(), it works fine..

thanks,
linfeng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-19 Thread Mel Gorman
On Tue, Feb 19, 2013 at 05:55:30PM +0800, Lin Feng wrote:
> Hi Mel,
> 
> On 02/18/2013 11:17 PM, Mel Gorman wrote:
> >>> > > 
> >>> > >
> >>> > > result. It's a little clumsy but the memory hot-remove failure message
> >>> > > could list what applications have pinned the pages that cannot be 
> >>> > > removed
> >>> > > so the administrator has the option of force-killing the application. 
> >>> > > It
> >>> > > is possible to discover what application is pinning a page from 
> >>> > > userspace
> >>> > > but it would involve an expensive search with /proc/kpagemap
> >>> > > 
> > > >>> + if (migrate_pre_flag && !isolate_err) {
> > > >>> + ret = migrate_pages(, alloc_migrate_target, 1,
> > > >>> + false, MIGRATE_SYNC, 
> > > >>> MR_SYSCALL);
> >>> > > 
> >>> > > The conversion of alloc_migrate_target is a bit problematic. It strips
> >>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts
> >>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a 
> >>> > > large
> >>> > > number of pages, particularly if the number of 
> >>> > > get_user_pages_non_movable()
> >>> > > increases for short-lived pins like direct IO.
> >> >
> >> > Sorry, I don't quite understand here neither. If we use the following 
> >> > new 
> >> > migration allocation function as you said, the increasing number of 
> >> > get_user_pages_non_movable() will also lead to large numbers of 
> >> > MIGRATE_UNMOVABLE
> >> > pages. What's the difference, do I miss something?
> >> > 
> > The replacement function preserves the __GFP_MOVABLE flag. It cannot use
> > ZONE_MOVABLE but otherwise the newly allocated page will be grouped with
> > other movable pages.
> 
> Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an 
> allocation policy.".
> 

Update the comment and describe the exception then.

> Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private 
> parameter so
> that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable 
> zones with little
> change?
> 

It should work (double check gfp_zone) but then the allocation cannot
use the highmem zone. If you can be 100% certain that zone will not exist
be populated then it will work as expected but it's a hack and should be
commented clearly. You could do a BUILD_BUG_ON if CONFIG_HIGHMEM is set
to enforce it.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-19 Thread Lin Feng
Hi Mel,

On 02/18/2013 11:17 PM, Mel Gorman wrote:
>>> > > 
>>> > >
>>> > > result. It's a little clumsy but the memory hot-remove failure message
>>> > > could list what applications have pinned the pages that cannot be 
>>> > > removed
>>> > > so the administrator has the option of force-killing the application. It
>>> > > is possible to discover what application is pinning a page from 
>>> > > userspace
>>> > > but it would involve an expensive search with /proc/kpagemap
>>> > > 
> > >>> +   if (migrate_pre_flag && !isolate_err) {
> > >>> +   ret = migrate_pages(, alloc_migrate_target, 1,
> > >>> +   false, MIGRATE_SYNC, 
> > >>> MR_SYSCALL);
>>> > > 
>>> > > The conversion of alloc_migrate_target is a bit problematic. It strips
>>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts
>>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a 
>>> > > large
>>> > > number of pages, particularly if the number of 
>>> > > get_user_pages_non_movable()
>>> > > increases for short-lived pins like direct IO.
>> >
>> > Sorry, I don't quite understand here neither. If we use the following new 
>> > migration allocation function as you said, the increasing number of 
>> > get_user_pages_non_movable() will also lead to large numbers of 
>> > MIGRATE_UNMOVABLE
>> > pages. What's the difference, do I miss something?
>> > 
> The replacement function preserves the __GFP_MOVABLE flag. It cannot use
> ZONE_MOVABLE but otherwise the newly allocated page will be grouped with
> other movable pages.

Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an 
allocation policy.".

Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private 
parameter so
that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones 
with little
change?

Does that approach work? Otherwise I have to follow your suggestion.

thanks,
linfeng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-19 Thread Lin Feng
Hi Mel,

On 02/18/2013 11:17 PM, Mel Gorman wrote:
   SNIP
  
   result. It's a little clumsy but the memory hot-remove failure message
   could list what applications have pinned the pages that cannot be 
   removed
   so the administrator has the option of force-killing the application. It
   is possible to discover what application is pinning a page from 
   userspace
   but it would involve an expensive search with /proc/kpagemap
   
   +   if (migrate_pre_flag  !isolate_err) {
   +   ret = migrate_pages(pagelist, alloc_migrate_target, 1,
   +   false, MIGRATE_SYNC, 
   MR_SYSCALL);
   
   The conversion of alloc_migrate_target is a bit problematic. It strips
   the __GFP_MOVABLE flag and the consequence of this is that it converts
   those allocation requests to MIGRATE_UNMOVABLE. This potentially is a 
   large
   number of pages, particularly if the number of 
   get_user_pages_non_movable()
   increases for short-lived pins like direct IO.
 
  Sorry, I don't quite understand here neither. If we use the following new 
  migration allocation function as you said, the increasing number of 
  get_user_pages_non_movable() will also lead to large numbers of 
  MIGRATE_UNMOVABLE
  pages. What's the difference, do I miss something?
  
 The replacement function preserves the __GFP_MOVABLE flag. It cannot use
 ZONE_MOVABLE but otherwise the newly allocated page will be grouped with
 other movable pages.

Ah, got it  But GFP_MOVABLE is not only a zone specifier but also an 
allocation policy..

Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private 
parameter so
that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones 
with little
change?

Does that approach work? Otherwise I have to follow your suggestion.

thanks,
linfeng
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-19 Thread Mel Gorman
On Tue, Feb 19, 2013 at 05:55:30PM +0800, Lin Feng wrote:
 Hi Mel,
 
 On 02/18/2013 11:17 PM, Mel Gorman wrote:
SNIP
   
result. It's a little clumsy but the memory hot-remove failure message
could list what applications have pinned the pages that cannot be 
removed
so the administrator has the option of force-killing the application. 
It
is possible to discover what application is pinning a page from 
userspace
but it would involve an expensive search with /proc/kpagemap

+ if (migrate_pre_flag  !isolate_err) {
+ ret = migrate_pages(pagelist, alloc_migrate_target, 1,
+ false, MIGRATE_SYNC, 
MR_SYSCALL);

The conversion of alloc_migrate_target is a bit problematic. It strips
the __GFP_MOVABLE flag and the consequence of this is that it converts
those allocation requests to MIGRATE_UNMOVABLE. This potentially is a 
large
number of pages, particularly if the number of 
get_user_pages_non_movable()
increases for short-lived pins like direct IO.
  
   Sorry, I don't quite understand here neither. If we use the following 
   new 
   migration allocation function as you said, the increasing number of 
   get_user_pages_non_movable() will also lead to large numbers of 
   MIGRATE_UNMOVABLE
   pages. What's the difference, do I miss something?
   
  The replacement function preserves the __GFP_MOVABLE flag. It cannot use
  ZONE_MOVABLE but otherwise the newly allocated page will be grouped with
  other movable pages.
 
 Ah, got it  But GFP_MOVABLE is not only a zone specifier but also an 
 allocation policy..
 

Update the comment and describe the exception then.

 Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private 
 parameter so
 that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable 
 zones with little
 change?
 

It should work (double check gfp_zone) but then the allocation cannot
use the highmem zone. If you can be 100% certain that zone will not exist
be populated then it will work as expected but it's a hack and should be
commented clearly. You could do a BUILD_BUG_ON if CONFIG_HIGHMEM is set
to enforce it.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-19 Thread Lin Feng
Hi Mel,

On 02/05/2013 09:32 PM, Mel Gorman wrote:
 On Tue, Feb 05, 2013 at 11:57:22AM +, Mel Gorman wrote:

 +  migrate_pre_flag = 1;
 +  }
 +
 +  if (!isolate_lru_page(pages[i])) {
 +  inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
 +   page_is_file_cache(pages[i]));
 +  list_add_tail(pages[i]-lru, pagelist);
 +  } else {
 +  isolate_err = 1;
 +  goto put_page;
 +  }

 isolate_lru_page() takes the LRU lock every time.
 
 Credit to Michal Hocko for bringing this up but with the number of
 other issues I missed that this is also broken with respect to huge page
 handling. hugetlbfs pages will not be on the LRU so the isolation will mess
 up and the migration has to be handled differently.  Ordinarily hugetlbfs
 pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
 it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
 encounters a hugetlbfs page, it'll just blow up.

I look into the migrate_huge_page() codes find that if we support the hugetlbfs
non movable migration, we have to invent another 
alloc_huge_page_node_nonmovable() 
or such allocate interface, which cost is large(exploding the codes and great 
impact
on current alloc_huge_page_node()) but gains little, I think that pinning 
hugepage
is a corner case. 

So can we skip over hugepage without migration but give some WARN_ON() info, is
it acceptable?

 
 The other is that this almost certainly broken for transhuge page
 handling. gup returns the head and tail pages and ordinarily this is ok

I can't find codes doing such things :(, could you please point me out?

 because the caller only cares about the physical address. Migration will
 also split a hugepage if it receives it but you are potentially adding
 tail pages to a list here and then migrating them. The split of the first
 page will get very confused. I'm not exactly sure what the result will be
 but it won't be pretty.
 
 Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
 during testing?

I checked my config file that both CONFIG options aboved are enabled. However 
it was 
only be tested by two services invoking io_setup(), it works fine..

thanks,
linfeng
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-19 Thread Lin Feng


On 02/19/2013 09:37 PM, Lin Feng wrote:
  
  The other is that this almost certainly broken for transhuge page
  handling. gup returns the head and tail pages and ordinarily this is ok
 I can't find codes doing such things :(, could you please point me out?
 
Sorry, I misunderstood what tail pages means, stupid question, just ignore it.
flee...

thanks,
linfeng
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-19 Thread Lin Feng
Hi Wanpeng,

On 02/20/2013 10:44 AM, Wanpeng Li wrote:
 Sorry, I misunderstood what tail pages means, stupid question, just ignore 
 it.
 flee...
 According to the compound page, the first page of compound page is
 called head page, other sub pages are called tail pages.
 
 Regards,
 Wanpeng Li 
 
Yes, you are right, thanks for explaining.
I thought it just means only the last one of the compound pages..

thanks,
linfeng
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-18 Thread Mel Gorman
On Mon, Feb 18, 2013 at 06:34:44PM +0800, Lin Feng wrote:
> >>> + if (!migrate_pre_flag) {
> >>> + if (migrate_prep())
> >>> + goto put_page;
> > 
> > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never
> > return failure. You could make this BUG_ON(migrate_prep()), remove this
> > goto and the migrate_pre_flag check below becomes redundant.
> Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() 
> check
> will call/check migrate_prep() every time we isolate a single page, do we have
> to do so?

I was referring to the migrate_pre_flag check further down and I'm not
suggesting it be removed from here as you do want to call migrate_prep()
only once. However, as it'll never return failure for any kernel
configuration that allows memory hot-remove, this goto can be
removed and the flow simplified.

> > 
> >>> + migrate_pre_flag = 1;
> >>> + }
> >>> +
> >>> + if (!isolate_lru_page(pages[i])) {
> >>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
> >>> +  page_is_file_cache(pages[i]));
> >>> + list_add_tail([i]->lru, );
> >>> + } else {
> >>> + isolate_err = 1;
> >>> + goto put_page;
> >>> + }
> > 
> > isolate_lru_page() takes the LRU lock every time. If
> > get_user_pages_non_movable is used heavily then you may encounter lock
> > contention problems. Batching this lock would be a separate patch and should
> > not be necessary yet but you should at least comment on it as a reminder.
> Farsighted, should improve it in the future.
> 
> > 
> > I think that a goto could also have been avoided here if you used break. The
> > i == ret check below would be false and it would just fall through.
> > Overall the flow would be a bit easier to follow.
> Yes, I noticed this before. I thought using goto could save some micro ops
> (here is only the i == ret check) though lower the readability but still be 
> clearly. 
> 
> Also there are some other place in current kernel facing such 
> performance/readability 
> race condition, but it seems that the developers prefer readability, why? 
> While I
> think performance is fatal to kernel..
> 

Memory hot-remove and page migration are not performance critical paths.
For page migration, the cost will be likely dominated by the page copy but
it's also possible the cost will be dominated by locking. The difference
between a break and goto will not even be measurable.

> > 
> >
> > result. It's a little clumsy but the memory hot-remove failure message
> > could list what applications have pinned the pages that cannot be removed
> > so the administrator has the option of force-killing the application. It
> > is possible to discover what application is pinning a page from userspace
> > but it would involve an expensive search with /proc/kpagemap
> > 
> >>> + if (migrate_pre_flag && !isolate_err) {
> >>> + ret = migrate_pages(, alloc_migrate_target, 1,
> >>> + false, MIGRATE_SYNC, MR_SYSCALL);
> > 
> > The conversion of alloc_migrate_target is a bit problematic. It strips
> > the __GFP_MOVABLE flag and the consequence of this is that it converts
> > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
> > number of pages, particularly if the number of get_user_pages_non_movable()
> > increases for short-lived pins like direct IO.
>
> Sorry, I don't quite understand here neither. If we use the following new 
> migration allocation function as you said, the increasing number of 
> get_user_pages_non_movable() will also lead to large numbers of 
> MIGRATE_UNMOVABLE
> pages. What's the difference, do I miss something?
> 

The replacement function preserves the __GFP_MOVABLE flag. It cannot use
ZONE_MOVABLE but otherwise the newly allocated page will be grouped with
other movable pages.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-18 Thread Lin Feng
Hi Mel,

See below.

On 02/05/2013 07:57 PM, Mel Gorman wrote:
> On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote:
>> The ifdefs aren't really needed here and I encourage people to omit
>> them.  This keeps the header files looking neater and reduces the
>> chances of things later breaking because we forgot to update some
>> CONFIG_foo logic in a header file.  The downside is that errors will be
>> revealed at link time rather than at compile time, but that's a pretty
>> small cost.
>>
> 
> As an aside, if ifdefs *have* to be used then it often better to have a
> pattern like
> 
> #ifdef CONFIG_MEMORY_HOTREMOVE
> int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
>   unsigned long start, int nr_pages, int write, int force,
>   struct page **pages, struct vm_area_struct **vmas);
> #else
> static inline get_user_pages_non_movable(...)
> {
>   get_user_pages(...)
> }
> #endif
> 
> It eliminates the need for #ifdefs in the C file that calls
> get_user_pages_non_movable().
Thanks for pointing out :)

>>> +
>>> +retry:
>>> +   ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
>>> +   vmas);
>>
>> We should handle (ret < 0) here.  At present the function will silently
>> convert an error return into "return 0", which is bad.  The function
>> does appear to otherwise do the right thing if get_user_pages() failed,
>> but only due to good luck.
>>
> 
> A BUG_ON check if we retry more than once wouldn't hurt either. It requires
> a broken implementation of alloc_migrate_target() but someone might "fix"
> it and miss this.
Agree.

> 
> A minor aside, it would be nice if we exited at this point if there was
> no populated ZONE_MOVABLE in the system. There is a movable_zone global
> variable already. If it was forced to be MAX_NR_ZONES before the call to
> find_usable_zone_for_movable() in memory initialisation we should be able
> to make a cheap check for it here.
OK.

>>> +   if (!migrate_pre_flag) {
>>> +   if (migrate_prep())
>>> +   goto put_page;
> 
> CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never
> return failure. You could make this BUG_ON(migrate_prep()), remove this
> goto and the migrate_pre_flag check below becomes redundant.
Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check
will call/check migrate_prep() every time we isolate a single page, do we have
to do so?
I mean that for a group of pages it's sufficient to call migrate_prep() only 
once.
There are some other migrate-relative codes using this scheme. 
So I get confused, do I miss something or the the performance cost is little? :(

> 
>>> +   migrate_pre_flag = 1;
>>> +   }
>>> +
>>> +   if (!isolate_lru_page(pages[i])) {
>>> +   inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
>>> +page_is_file_cache(pages[i]));
>>> +   list_add_tail([i]->lru, );
>>> +   } else {
>>> +   isolate_err = 1;
>>> +   goto put_page;
>>> +   }
> 
> isolate_lru_page() takes the LRU lock every time. If
> get_user_pages_non_movable is used heavily then you may encounter lock
> contention problems. Batching this lock would be a separate patch and should
> not be necessary yet but you should at least comment on it as a reminder.
Farsighted, should improve it in the future.

> 
> I think that a goto could also have been avoided here if you used break. The
> i == ret check below would be false and it would just fall through.
> Overall the flow would be a bit easier to follow.
Yes, I noticed this before. I thought using goto could save some micro ops
(here is only the i == ret check) though lower the readability but still be 
clearly. 

Also there are some other place in current kernel facing such 
performance/readability 
race condition, but it seems that the developers prefer readability, why? While 
I
think performance is fatal to kernel..

> 
> Why list_add_tail()? I don't think it's wrong but it's unusual to see
> list_add_tail() when list_add() is enough.
Sorry, not intentional, just steal from other codes without thinking more ;-)

> 
>>> +   }
>>> +   }
>>> +
>>> +   /* All pages are non movable, we are done :) */
>>> +   if (i == ret && list_empty())
>>> +   return ret;
>>> +
>>> +put_page:
>>> +   /* Undo the effects of former get_user_pages(), we won't pin anything */
>>> +   for (i = 0; i < ret; i++)
>>> +   put_page(pages[i]);
>>> +
> 
> release_pages.
> 
> That comment is insufficient. There are non-obvious consequences to this
> logic. We are dropping pins on all pages regardless of what zone they
> are in. If the subsequent migration fails then we end up returning 0
> with no pages 

Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-18 Thread Lin Feng
Hi Mel,

See below.

On 02/05/2013 07:57 PM, Mel Gorman wrote:
 On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote:
 The ifdefs aren't really needed here and I encourage people to omit
 them.  This keeps the header files looking neater and reduces the
 chances of things later breaking because we forgot to update some
 CONFIG_foo logic in a header file.  The downside is that errors will be
 revealed at link time rather than at compile time, but that's a pretty
 small cost.

 
 As an aside, if ifdefs *have* to be used then it often better to have a
 pattern like
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
   unsigned long start, int nr_pages, int write, int force,
   struct page **pages, struct vm_area_struct **vmas);
 #else
 static inline get_user_pages_non_movable(...)
 {
   get_user_pages(...)
 }
 #endif
 
 It eliminates the need for #ifdefs in the C file that calls
 get_user_pages_non_movable().
Thanks for pointing out :)

 +
 +retry:
 +   ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
 +   vmas);

 We should handle (ret  0) here.  At present the function will silently
 convert an error return into return 0, which is bad.  The function
 does appear to otherwise do the right thing if get_user_pages() failed,
 but only due to good luck.

 
 A BUG_ON check if we retry more than once wouldn't hurt either. It requires
 a broken implementation of alloc_migrate_target() but someone might fix
 it and miss this.
Agree.

 
 A minor aside, it would be nice if we exited at this point if there was
 no populated ZONE_MOVABLE in the system. There is a movable_zone global
 variable already. If it was forced to be MAX_NR_ZONES before the call to
 find_usable_zone_for_movable() in memory initialisation we should be able
 to make a cheap check for it here.
OK.

 +   if (!migrate_pre_flag) {
 +   if (migrate_prep())
 +   goto put_page;
 
 CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never
 return failure. You could make this BUG_ON(migrate_prep()), remove this
 goto and the migrate_pre_flag check below becomes redundant.
Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check
will call/check migrate_prep() every time we isolate a single page, do we have
to do so?
I mean that for a group of pages it's sufficient to call migrate_prep() only 
once.
There are some other migrate-relative codes using this scheme. 
So I get confused, do I miss something or the the performance cost is little? :(

 
 +   migrate_pre_flag = 1;
 +   }
 +
 +   if (!isolate_lru_page(pages[i])) {
 +   inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
 +page_is_file_cache(pages[i]));
 +   list_add_tail(pages[i]-lru, pagelist);
 +   } else {
 +   isolate_err = 1;
 +   goto put_page;
 +   }
 
 isolate_lru_page() takes the LRU lock every time. If
 get_user_pages_non_movable is used heavily then you may encounter lock
 contention problems. Batching this lock would be a separate patch and should
 not be necessary yet but you should at least comment on it as a reminder.
Farsighted, should improve it in the future.

 
 I think that a goto could also have been avoided here if you used break. The
 i == ret check below would be false and it would just fall through.
 Overall the flow would be a bit easier to follow.
Yes, I noticed this before. I thought using goto could save some micro ops
(here is only the i == ret check) though lower the readability but still be 
clearly. 

Also there are some other place in current kernel facing such 
performance/readability 
race condition, but it seems that the developers prefer readability, why? While 
I
think performance is fatal to kernel..

 
 Why list_add_tail()? I don't think it's wrong but it's unusual to see
 list_add_tail() when list_add() is enough.
Sorry, not intentional, just steal from other codes without thinking more ;-)

 
 +   }
 +   }
 +
 +   /* All pages are non movable, we are done :) */
 +   if (i == ret  list_empty(pagelist))
 +   return ret;
 +
 +put_page:
 +   /* Undo the effects of former get_user_pages(), we won't pin anything */
 +   for (i = 0; i  ret; i++)
 +   put_page(pages[i]);
 +
 
 release_pages.
 
 That comment is insufficient. There are non-obvious consequences to this
 logic. We are dropping pins on all pages regardless of what zone they
 are in. If the subsequent migration fails then we end up returning 0
 with no pages pinned. The user-visible effect is that io_setup() fails
 for non-obvious reasons. It will return EAGAIN to userspace which will be
 interpreted as The specified 

Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-18 Thread Mel Gorman
On Mon, Feb 18, 2013 at 06:34:44PM +0800, Lin Feng wrote:
  + if (!migrate_pre_flag) {
  + if (migrate_prep())
  + goto put_page;
  
  CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never
  return failure. You could make this BUG_ON(migrate_prep()), remove this
  goto and the migrate_pre_flag check below becomes redundant.
 Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() 
 check
 will call/check migrate_prep() every time we isolate a single page, do we have
 to do so?

I was referring to the migrate_pre_flag check further down and I'm not
suggesting it be removed from here as you do want to call migrate_prep()
only once. However, as it'll never return failure for any kernel
configuration that allows memory hot-remove, this goto can be
removed and the flow simplified.

  
  + migrate_pre_flag = 1;
  + }
  +
  + if (!isolate_lru_page(pages[i])) {
  + inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
  +  page_is_file_cache(pages[i]));
  + list_add_tail(pages[i]-lru, pagelist);
  + } else {
  + isolate_err = 1;
  + goto put_page;
  + }
  
  isolate_lru_page() takes the LRU lock every time. If
  get_user_pages_non_movable is used heavily then you may encounter lock
  contention problems. Batching this lock would be a separate patch and should
  not be necessary yet but you should at least comment on it as a reminder.
 Farsighted, should improve it in the future.
 
  
  I think that a goto could also have been avoided here if you used break. The
  i == ret check below would be false and it would just fall through.
  Overall the flow would be a bit easier to follow.
 Yes, I noticed this before. I thought using goto could save some micro ops
 (here is only the i == ret check) though lower the readability but still be 
 clearly. 
 
 Also there are some other place in current kernel facing such 
 performance/readability 
 race condition, but it seems that the developers prefer readability, why? 
 While I
 think performance is fatal to kernel..
 

Memory hot-remove and page migration are not performance critical paths.
For page migration, the cost will be likely dominated by the page copy but
it's also possible the cost will be dominated by locking. The difference
between a break and goto will not even be measurable.

  SNIP
 
  result. It's a little clumsy but the memory hot-remove failure message
  could list what applications have pinned the pages that cannot be removed
  so the administrator has the option of force-killing the application. It
  is possible to discover what application is pinning a page from userspace
  but it would involve an expensive search with /proc/kpagemap
  
  + if (migrate_pre_flag  !isolate_err) {
  + ret = migrate_pages(pagelist, alloc_migrate_target, 1,
  + false, MIGRATE_SYNC, MR_SYSCALL);
  
  The conversion of alloc_migrate_target is a bit problematic. It strips
  the __GFP_MOVABLE flag and the consequence of this is that it converts
  those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large
  number of pages, particularly if the number of get_user_pages_non_movable()
  increases for short-lived pins like direct IO.

 Sorry, I don't quite understand here neither. If we use the following new 
 migration allocation function as you said, the increasing number of 
 get_user_pages_non_movable() will also lead to large numbers of 
 MIGRATE_UNMOVABLE
 pages. What's the difference, do I miss something?
 

The replacement function preserves the __GFP_MOVABLE flag. It cannot use
ZONE_MOVABLE but otherwise the newly allocated page will be grouped with
other movable pages.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-06 Thread Mel Gorman
On Tue, Feb 05, 2013 at 06:26:51PM -0800, Michel Lespinasse wrote:
> Just nitpicking, but:
> 
> On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman  wrote:
> > +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx)
> > +{
> > +   /* This mess avoids a potentially expensive pointer subtraction. */
> > +   int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
> > +   return zone_off == idx * sizeof(*zone);
> > +}
> 
> Maybe:
> return zone == zone->zone_pgdat->node_zones + idx;
> ?
> 

Not a nit at all. Yours is more readable but it generates more code. A
single line function that uses the helper generates 0x3f bytes of code
(mostly function entry/exit) with your version and 0x39 bytes with mine. The
difference in efficiency is marginal as your version uses lea to multiply
by a constant but it's still slightly heavier.

The old code is fractionally better, your version is more readable so it's
up to Andrew really. Right now I think he's gone with his own version with
zone_idx() in the name of readability whatever sparse has to say about
the matter.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-06 Thread Mel Gorman
On Tue, Feb 05, 2013 at 06:26:51PM -0800, Michel Lespinasse wrote:
 Just nitpicking, but:
 
 On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman mgor...@suse.de wrote:
  +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx)
  +{
  +   /* This mess avoids a potentially expensive pointer subtraction. */
  +   int zone_off = (char *)zone - (char *)zone-zone_pgdat-node_zones;
  +   return zone_off == idx * sizeof(*zone);
  +}
 
 Maybe:
 return zone == zone-zone_pgdat-node_zones + idx;
 ?
 

Not a nit at all. Yours is more readable but it generates more code. A
single line function that uses the helper generates 0x3f bytes of code
(mostly function entry/exit) with your version and 0x39 bytes with mine. The
difference in efficiency is marginal as your version uses lea to multiply
by a constant but it's still slightly heavier.

The old code is fractionally better, your version is more readable so it's
up to Andrew really. Right now I think he's gone with his own version with
zone_idx() in the name of readability whatever sparse has to say about
the matter.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-05 Thread Michel Lespinasse
Just nitpicking, but:

On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman  wrote:
> +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx)
> +{
> +   /* This mess avoids a potentially expensive pointer subtraction. */
> +   int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
> +   return zone_off == idx * sizeof(*zone);
> +}

Maybe:
return zone == zone->zone_pgdat->node_zones + idx;
?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-05 Thread Andrew Morton
On Tue, 05 Feb 2013 11:09:27 +0800
Lin Feng  wrote:

> > 
> >>  struct kvec;
> >>  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
> >>struct page **pages);
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 73b64a3..5db811e 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
> >>return (idx == ZONE_NORMAL);
> >>  }
> >>  
> >> +static inline int is_movable(struct zone *zone)
> >> +{
> >> +  return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
> >> +}
> > 
> > A better name would be zone_is_movable().  We haven't been very
> > consistent about this in mmzone.h, but zone_is_foo() is pretty common.
> > 
> Yes, zone_is_xxx() should be a better name, but there are some analogous
> definition like is_dma32() and is_normal() etc, if we only use 
> zone_is_movable()
> for movable zone it will break such naming rules.
> Should we update other ones in a separate patch later or just keep the old 
> style?

I do think the old names were poorly chosen.  Yes, we could fix them up
sometime but it's hardly a pressing issue.

> > And a neater implementation would be
> > 
> > return zone_idx(zone) == ZONE_MOVABLE;
> > 
> OK. After your change, should we also update for other ones such as 
> is_normal()..?

Sure.

From: Andrew Morton 
Subject: include-linux-mmzoneh-cleanups-fix

use zone_idx() some more, further simplify is_highmem()

Cc: Lin Feng 
Cc: Mel Gorman 
Signed-off-by: Andrew Morton 
---

 include/linux/mmzone.h |   15 ---
 1 file changed, 4 insertions(+), 11 deletions(-)

diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix 
include/linux/mmzone.h
--- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix
+++ a/include/linux/mmzone.h
@@ -859,25 +859,18 @@ static inline int is_normal_idx(enum zon
  */
 static inline int is_highmem(struct zone *zone)
 {
-#ifdef CONFIG_HIGHMEM
-   enum zone_type idx = zone_idx(zone);
-
-   return idx == ZONE_HIGHMEM ||
-  (idx == ZONE_MOVABLE && zone_movable_is_highmem());
-#else
-   return 0;
-#endif
+   return is_highmem_idx(zone_idx(zone));
 }
 
 static inline int is_normal(struct zone *zone)
 {
-   return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL;
+   return zone_idx(zone) == ZONE_NORMAL;
 }
 
 static inline int is_dma32(struct zone *zone)
 {
 #ifdef CONFIG_ZONE_DMA32
-   return zone == zone->zone_pgdat->node_zones + ZONE_DMA32;
+   return zone_idx(zone) == ZONE_DMA32;
 #else
return 0;
 #endif
@@ -886,7 +879,7 @@ static inline int is_dma32(struct zone *
 static inline int is_dma(struct zone *zone)
 {
 #ifdef CONFIG_ZONE_DMA
-   return zone == zone->zone_pgdat->node_zones + ZONE_DMA;
+   return zone_idx(zone) == ZONE_DMA;
 #else
return 0;
 #endif
_

> >> +  /* All pages are non movable, we are done :) */
> >> +  if (i == ret && list_empty())
> >> +  return ret;
> >> +
> >> +put_page:
> >> +  /* Undo the effects of former get_user_pages(), we won't pin anything */
> >> +  for (i = 0; i < ret; i++)
> >> +  put_page(pages[i]);
> >> +
> >> +  if (migrate_pre_flag && !isolate_err) {
> >> +  ret = migrate_pages(, alloc_migrate_target, 1,
> >> +  false, MIGRATE_SYNC, MR_SYSCALL);
> >> +  /* Steal pages from non-movable zone successfully? */
> >> +  if (!ret)
> >> +  goto retry;
> > 
> > This is buggy.  migrate_pages() doesn't empty its `from' argument, so
> > page_list must be reinitialised here (or, better, at the start of the loop).
> migrate_pages()
>   list_for_each_entry_safe()
>  unmap_and_move()
>if (rc != -EAGAIN) {
>  list_del(>lru);
>}

ah, OK, there it is.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-05 Thread Mel Gorman
On Tue, Feb 05, 2013 at 11:57:22AM +, Mel Gorman wrote:
> 
> > > + migrate_pre_flag = 1;
> > > + }
> > > +
> > > + if (!isolate_lru_page(pages[i])) {
> > > + inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
> > > +  page_is_file_cache(pages[i]));
> > > + list_add_tail([i]->lru, );
> > > + } else {
> > > + isolate_err = 1;
> > > + goto put_page;
> > > + }
> 
> isolate_lru_page() takes the LRU lock every time.

Credit to Michal Hocko for bringing this up but with the number of
other issues I missed that this is also broken with respect to huge page
handling. hugetlbfs pages will not be on the LRU so the isolation will mess
up and the migration has to be handled differently.  Ordinarily hugetlbfs
pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
encounters a hugetlbfs page, it'll just blow up.

The other is that this almost certainly broken for transhuge page
handling. gup returns the head and tail pages and ordinarily this is ok
because the caller only cares about the physical address. Migration will
also split a hugepage if it receives it but you are potentially adding
tail pages to a list here and then migrating them. The split of the first
page will get very confused. I'm not exactly sure what the result will be
but it won't be pretty.

Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
during testing?

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-05 Thread Mel Gorman
On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote:
> On Mon, 4 Feb 2013 18:04:07 +0800
> Lin Feng  wrote:
> 
> > get_user_pages() always tries to allocate pages from movable zone, which is 
> > not
> >  reliable to memory hotremove framework in some case.
> > 
> > This patch introduces a new library function called 
> > get_user_pages_non_movable()
> >  to pin pages only from zone non-movable in memory.
> > It's a wrapper of get_user_pages() but it makes sure that all pages come 
> > from
> > non-movable zone via additional page migration.
> > 
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct 
> > mm_struct *mm,
> > struct page **pages, struct vm_area_struct **vmas);
> >  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> > struct page **pages);
> > +#ifdef CONFIG_MEMORY_HOTREMOVE
> > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct 
> > *mm,
> > +   unsigned long start, int nr_pages, int write, int force,
> > +   struct page **pages, struct vm_area_struct **vmas);
> > +#endif
> 
> The ifdefs aren't really needed here and I encourage people to omit
> them.  This keeps the header files looking neater and reduces the
> chances of things later breaking because we forgot to update some
> CONFIG_foo logic in a header file.  The downside is that errors will be
> revealed at link time rather than at compile time, but that's a pretty
> small cost.
> 

As an aside, if ifdefs *have* to be used then it often better to have a
pattern like

#ifdef CONFIG_MEMORY_HOTREMOVE
int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int nr_pages, int write, int force,
struct page **pages, struct vm_area_struct **vmas);
#else
static inline get_user_pages_non_movable(...)
{
get_user_pages(...)
}
#endif

It eliminates the need for #ifdefs in the C file that calls
get_user_pages_non_movable().

> >  struct kvec;
> >  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
> > struct page **pages);
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 73b64a3..5db811e 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
> > return (idx == ZONE_NORMAL);
> >  }
> >  
> > +static inline int is_movable(struct zone *zone)
> > +{
> > +   return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
> > +}
> 
> A better name would be zone_is_movable(). 

He is matching the names of the is_normal(), is_dma32() and is_dma() helpers.
Unfortunately I would expect a name like is_movable() to return true if the
page had either been allocated with __GFP_MOVABLE or if it was checking
migrate can currently handle the page. is_movable() is indeed a terrible
name. They should be renamed or deleted in preparation -- see below.

> We haven't been very
> consistent about this in mmzone.h, but zone_is_foo() is pretty common.
> 
> And a neater implementation would be
> 
>   return zone_idx(zone) == ZONE_MOVABLE;
> 

Yes.

> All of which made me look at mmzone.h, and what I saw wasn't very nice :(
> 
> Please review...
> 
> 
> From: Andrew Morton 
> Subject: include/linux/mmzone.h: cleanups
> 
> - implement zone_idx() in C to fix its references-args-twice macro bug
> 
> - use zone_idx() in is_highmem() to remove large amounts of silly fluff.
> 
> Cc: Lin Feng 
> Cc: Mel Gorman 
> Signed-off-by: Andrew Morton 
> ---
> 
>  include/linux/mmzone.h |   13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups 
> include/linux/mmzone.h
> --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups
> +++ a/include/linux/mmzone.h
> @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by
>  /*
>   * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, 
> etc.
>   */
> -#define zone_idx(zone)   ((zone) - 
> (zone)->zone_pgdat->node_zones)
> +static inline enum zone_type zone_idx(struct zone *zone)
> +{
> + return zone - zone->zone_pgdat->node_zones;
> +}
>  
>  static inline int populated_zone(struct zone *zone)
>  {
> @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon
>  static inline int is_highmem(struct zone *zone)
>  {
>  #ifdef CONFIG_HIGHMEM
> - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
> - return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
> -(zone_off == ZONE_MOVABLE * sizeof(*zone) &&
> - zone_movable_is_highmem());
> + enum zone_type idx = zone_idx(zone);
> +
> + return idx == ZONE_HIGHMEM ||
> +(idx == ZONE_MOVABLE && zone_movable_is_highmem());
>  #else
>   return 0;
>  #endif

*blinks*. Ok, 

Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-05 Thread Mel Gorman
On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote:
 On Mon, 4 Feb 2013 18:04:07 +0800
 Lin Feng linf...@cn.fujitsu.com wrote:
 
  get_user_pages() always tries to allocate pages from movable zone, which is 
  not
   reliable to memory hotremove framework in some case.
  
  This patch introduces a new library function called 
  get_user_pages_non_movable()
   to pin pages only from zone non-movable in memory.
  It's a wrapper of get_user_pages() but it makes sure that all pages come 
  from
  non-movable zone via additional page migration.
  
  ...
 
  --- a/include/linux/mm.h
  +++ b/include/linux/mm.h
  @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct 
  mm_struct *mm,
  struct page **pages, struct vm_area_struct **vmas);
   int get_user_pages_fast(unsigned long start, int nr_pages, int write,
  struct page **pages);
  +#ifdef CONFIG_MEMORY_HOTREMOVE
  +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct 
  *mm,
  +   unsigned long start, int nr_pages, int write, int force,
  +   struct page **pages, struct vm_area_struct **vmas);
  +#endif
 
 The ifdefs aren't really needed here and I encourage people to omit
 them.  This keeps the header files looking neater and reduces the
 chances of things later breaking because we forgot to update some
 CONFIG_foo logic in a header file.  The downside is that errors will be
 revealed at link time rather than at compile time, but that's a pretty
 small cost.
 

As an aside, if ifdefs *have* to be used then it often better to have a
pattern like

#ifdef CONFIG_MEMORY_HOTREMOVE
int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int nr_pages, int write, int force,
struct page **pages, struct vm_area_struct **vmas);
#else
static inline get_user_pages_non_movable(...)
{
get_user_pages(...)
}
#endif

It eliminates the need for #ifdefs in the C file that calls
get_user_pages_non_movable().

   struct kvec;
   int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
  struct page **pages);
  diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
  index 73b64a3..5db811e 100644
  --- a/include/linux/mmzone.h
  +++ b/include/linux/mmzone.h
  @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
  return (idx == ZONE_NORMAL);
   }
   
  +static inline int is_movable(struct zone *zone)
  +{
  +   return zone == zone-zone_pgdat-node_zones + ZONE_MOVABLE;
  +}
 
 A better name would be zone_is_movable(). 

He is matching the names of the is_normal(), is_dma32() and is_dma() helpers.
Unfortunately I would expect a name like is_movable() to return true if the
page had either been allocated with __GFP_MOVABLE or if it was checking
migrate can currently handle the page. is_movable() is indeed a terrible
name. They should be renamed or deleted in preparation -- see below.

 We haven't been very
 consistent about this in mmzone.h, but zone_is_foo() is pretty common.
 
 And a neater implementation would be
 
   return zone_idx(zone) == ZONE_MOVABLE;
 

Yes.

 All of which made me look at mmzone.h, and what I saw wasn't very nice :(
 
 Please review...
 
 
 From: Andrew Morton a...@linux-foundation.org
 Subject: include/linux/mmzone.h: cleanups
 
 - implement zone_idx() in C to fix its references-args-twice macro bug
 
 - use zone_idx() in is_highmem() to remove large amounts of silly fluff.
 
 Cc: Lin Feng linf...@cn.fujitsu.com
 Cc: Mel Gorman m...@csn.ul.ie
 Signed-off-by: Andrew Morton a...@linux-foundation.org
 ---
 
  include/linux/mmzone.h |   13 -
  1 file changed, 8 insertions(+), 5 deletions(-)
 
 diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups 
 include/linux/mmzone.h
 --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups
 +++ a/include/linux/mmzone.h
 @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by
  /*
   * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, 
 etc.
   */
 -#define zone_idx(zone)   ((zone) - 
 (zone)-zone_pgdat-node_zones)
 +static inline enum zone_type zone_idx(struct zone *zone)
 +{
 + return zone - zone-zone_pgdat-node_zones;
 +}
  
  static inline int populated_zone(struct zone *zone)
  {
 @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon
  static inline int is_highmem(struct zone *zone)
  {
  #ifdef CONFIG_HIGHMEM
 - int zone_off = (char *)zone - (char *)zone-zone_pgdat-node_zones;
 - return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
 -(zone_off == ZONE_MOVABLE * sizeof(*zone) 
 - zone_movable_is_highmem());
 + enum zone_type idx = zone_idx(zone);
 +
 + return idx == ZONE_HIGHMEM ||
 +(idx == ZONE_MOVABLE  zone_movable_is_highmem());
  #else
   return 0;
  #endif

*blinks*. Ok, while I think your version looks nicer, it is reverting

Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-05 Thread Mel Gorman
On Tue, Feb 05, 2013 at 11:57:22AM +, Mel Gorman wrote:
 
   + migrate_pre_flag = 1;
   + }
   +
   + if (!isolate_lru_page(pages[i])) {
   + inc_zone_page_state(pages[i], NR_ISOLATED_ANON +
   +  page_is_file_cache(pages[i]));
   + list_add_tail(pages[i]-lru, pagelist);
   + } else {
   + isolate_err = 1;
   + goto put_page;
   + }
 
 isolate_lru_page() takes the LRU lock every time.

Credit to Michal Hocko for bringing this up but with the number of
other issues I missed that this is also broken with respect to huge page
handling. hugetlbfs pages will not be on the LRU so the isolation will mess
up and the migration has to be handled differently.  Ordinarily hugetlbfs
pages cannot be allocated from ZONE_MOVABLE but it is possible to configure
it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this
encounters a hugetlbfs page, it'll just blow up.

The other is that this almost certainly broken for transhuge page
handling. gup returns the head and tail pages and ordinarily this is ok
because the caller only cares about the physical address. Migration will
also split a hugepage if it receives it but you are potentially adding
tail pages to a list here and then migrating them. The split of the first
page will get very confused. I'm not exactly sure what the result will be
but it won't be pretty.

Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled
during testing?

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-05 Thread Andrew Morton
On Tue, 05 Feb 2013 11:09:27 +0800
Lin Feng linf...@cn.fujitsu.com wrote:

  
   struct kvec;
   int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
 struct page **pages);
  diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
  index 73b64a3..5db811e 100644
  --- a/include/linux/mmzone.h
  +++ b/include/linux/mmzone.h
  @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
 return (idx == ZONE_NORMAL);
   }
   
  +static inline int is_movable(struct zone *zone)
  +{
  +  return zone == zone-zone_pgdat-node_zones + ZONE_MOVABLE;
  +}
  
  A better name would be zone_is_movable().  We haven't been very
  consistent about this in mmzone.h, but zone_is_foo() is pretty common.
  
 Yes, zone_is_xxx() should be a better name, but there are some analogous
 definition like is_dma32() and is_normal() etc, if we only use 
 zone_is_movable()
 for movable zone it will break such naming rules.
 Should we update other ones in a separate patch later or just keep the old 
 style?

I do think the old names were poorly chosen.  Yes, we could fix them up
sometime but it's hardly a pressing issue.

  And a neater implementation would be
  
  return zone_idx(zone) == ZONE_MOVABLE;
  
 OK. After your change, should we also update for other ones such as 
 is_normal()..?

Sure.

From: Andrew Morton a...@linux-foundation.org
Subject: include-linux-mmzoneh-cleanups-fix

use zone_idx() some more, further simplify is_highmem()

Cc: Lin Feng linf...@cn.fujitsu.com
Cc: Mel Gorman m...@csn.ul.ie
Signed-off-by: Andrew Morton a...@linux-foundation.org
---

 include/linux/mmzone.h |   15 ---
 1 file changed, 4 insertions(+), 11 deletions(-)

diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix 
include/linux/mmzone.h
--- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix
+++ a/include/linux/mmzone.h
@@ -859,25 +859,18 @@ static inline int is_normal_idx(enum zon
  */
 static inline int is_highmem(struct zone *zone)
 {
-#ifdef CONFIG_HIGHMEM
-   enum zone_type idx = zone_idx(zone);
-
-   return idx == ZONE_HIGHMEM ||
-  (idx == ZONE_MOVABLE  zone_movable_is_highmem());
-#else
-   return 0;
-#endif
+   return is_highmem_idx(zone_idx(zone));
 }
 
 static inline int is_normal(struct zone *zone)
 {
-   return zone == zone-zone_pgdat-node_zones + ZONE_NORMAL;
+   return zone_idx(zone) == ZONE_NORMAL;
 }
 
 static inline int is_dma32(struct zone *zone)
 {
 #ifdef CONFIG_ZONE_DMA32
-   return zone == zone-zone_pgdat-node_zones + ZONE_DMA32;
+   return zone_idx(zone) == ZONE_DMA32;
 #else
return 0;
 #endif
@@ -886,7 +879,7 @@ static inline int is_dma32(struct zone *
 static inline int is_dma(struct zone *zone)
 {
 #ifdef CONFIG_ZONE_DMA
-   return zone == zone-zone_pgdat-node_zones + ZONE_DMA;
+   return zone_idx(zone) == ZONE_DMA;
 #else
return 0;
 #endif
_

  +  /* All pages are non movable, we are done :) */
  +  if (i == ret  list_empty(pagelist))
  +  return ret;
  +
  +put_page:
  +  /* Undo the effects of former get_user_pages(), we won't pin anything */
  +  for (i = 0; i  ret; i++)
  +  put_page(pages[i]);
  +
  +  if (migrate_pre_flag  !isolate_err) {
  +  ret = migrate_pages(pagelist, alloc_migrate_target, 1,
  +  false, MIGRATE_SYNC, MR_SYSCALL);
  +  /* Steal pages from non-movable zone successfully? */
  +  if (!ret)
  +  goto retry;
  
  This is buggy.  migrate_pages() doesn't empty its `from' argument, so
  page_list must be reinitialised here (or, better, at the start of the loop).
 migrate_pages()
   list_for_each_entry_safe()
  unmap_and_move()
if (rc != -EAGAIN) {
  list_del(page-lru);
}

ah, OK, there it is.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-05 Thread Michel Lespinasse
Just nitpicking, but:

On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman mgor...@suse.de wrote:
 +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx)
 +{
 +   /* This mess avoids a potentially expensive pointer subtraction. */
 +   int zone_off = (char *)zone - (char *)zone-zone_pgdat-node_zones;
 +   return zone_off == idx * sizeof(*zone);
 +}

Maybe:
return zone == zone-zone_pgdat-node_zones + idx;
?

-- 
Michel Walken Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-04 Thread Lin Feng
Hi Andrew,

On 02/05/2013 08:06 AM, Andrew Morton wrote:
> 
> melreadthis
> 
> On Mon, 4 Feb 2013 18:04:07 +0800
> Lin Feng  wrote:
> 
>> get_user_pages() always tries to allocate pages from movable zone, which is 
>> not
>>  reliable to memory hotremove framework in some case.
>>
>> This patch introduces a new library function called 
>> get_user_pages_non_movable()
>>  to pin pages only from zone non-movable in memory.
>> It's a wrapper of get_user_pages() but it makes sure that all pages come from
>> non-movable zone via additional page migration.
>>
>> ...
>>
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct 
>> mm_struct *mm,
>>  struct page **pages, struct vm_area_struct **vmas);
>>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>>  struct page **pages);
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct 
>> *mm,
>> +unsigned long start, int nr_pages, int write, int force,
>> +struct page **pages, struct vm_area_struct **vmas);
>> +#endif
> 
> The ifdefs aren't really needed here and I encourage people to omit
> them.  This keeps the header files looking neater and reduces the
> chances of things later breaking because we forgot to update some
> CONFIG_foo logic in a header file.  The downside is that errors will be
> revealed at link time rather than at compile time, but that's a pretty
> small cost.
OK, got it, thanks :)

> 
>>  struct kvec;
>>  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
>>  struct page **pages);
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 73b64a3..5db811e 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
>>  return (idx == ZONE_NORMAL);
>>  }
>>  
>> +static inline int is_movable(struct zone *zone)
>> +{
>> +return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
>> +}
> 
> A better name would be zone_is_movable().  We haven't been very
> consistent about this in mmzone.h, but zone_is_foo() is pretty common.
> 
Yes, zone_is_xxx() should be a better name, but there are some analogous
definition like is_dma32() and is_normal() etc, if we only use zone_is_movable()
for movable zone it will break such naming rules.
Should we update other ones in a separate patch later or just keep the old 
style?
 
> And a neater implementation would be
> 
>   return zone_idx(zone) == ZONE_MOVABLE;
> 
OK. After your change, should we also update for other ones such as 
is_normal()..?

> All of which made me look at mmzone.h, and what I saw wasn't very nice :(
> 
> Please review...
> 
> 
> From: Andrew Morton 
> Subject: include/linux/mmzone.h: cleanups
> 
> - implement zone_idx() in C to fix its references-args-twice macro bug
> 
> - use zone_idx() in is_highmem() to remove large amounts of silly fluff.
> 
> Cc: Lin Feng 
> Cc: Mel Gorman 
> Signed-off-by: Andrew Morton 
> ---
> 
>  include/linux/mmzone.h |   13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups 
> include/linux/mmzone.h
> --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups
> +++ a/include/linux/mmzone.h
> @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by
>  /*
>   * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, 
> etc.
>   */
> -#define zone_idx(zone)   ((zone) - 
> (zone)->zone_pgdat->node_zones)
> +static inline enum zone_type zone_idx(struct zone *zone)
> +{
> + return zone - zone->zone_pgdat->node_zones;
> +}
>  
>  static inline int populated_zone(struct zone *zone)
>  {
> @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon
>  static inline int is_highmem(struct zone *zone)
>  {
>  #ifdef CONFIG_HIGHMEM
> - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
> - return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
> -(zone_off == ZONE_MOVABLE * sizeof(*zone) &&
> - zone_movable_is_highmem());
> + enum zone_type idx = zone_idx(zone);
> +
> + return idx == ZONE_HIGHMEM ||
> +(idx == ZONE_MOVABLE && zone_movable_is_highmem());
>  #else
>   return 0;
>  #endif
> _
> 
> 
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -58,6 +58,8 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>> +#include 
>>  #include 
>>  
>>  #include 
>> @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct 
>> mm_struct *mm,
>>  }
>>  EXPORT_SYMBOL(get_user_pages);
>>  
>> +#ifdef CONFIG_MEMORY_HOTREMOVE
>> +/**
>> + * It's a wrapper of get_user_pages() but it makes sure that all pages come 
>> from
>> + * non-movable zone via additional page migration.
>> + */
> 
> This needs a description of 

Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-04 Thread Andrew Morton

Also...

On Mon, 4 Feb 2013 16:06:24 -0800
Andrew Morton  wrote:

> > +put_page:
> > +   /* Undo the effects of former get_user_pages(), we won't pin anything */
> > +   for (i = 0; i < ret; i++)
> > +   put_page(pages[i]);

We can use release_pages() here.

release_pages() is designed to be more efficient when we're putting the
final reference to (most of) the pages.  It probably has little if any
benefit when putting still-in-use pages, as we're doing here.

But please consider...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-04 Thread Andrew Morton

melreadthis

On Mon, 4 Feb 2013 18:04:07 +0800
Lin Feng  wrote:

> get_user_pages() always tries to allocate pages from movable zone, which is 
> not
>  reliable to memory hotremove framework in some case.
> 
> This patch introduces a new library function called 
> get_user_pages_non_movable()
>  to pin pages only from zone non-movable in memory.
> It's a wrapper of get_user_pages() but it makes sure that all pages come from
> non-movable zone via additional page migration.
> 
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct 
> mm_struct *mm,
>   struct page **pages, struct vm_area_struct **vmas);
>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>   struct page **pages);
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> + unsigned long start, int nr_pages, int write, int force,
> + struct page **pages, struct vm_area_struct **vmas);
> +#endif

The ifdefs aren't really needed here and I encourage people to omit
them.  This keeps the header files looking neater and reduces the
chances of things later breaking because we forgot to update some
CONFIG_foo logic in a header file.  The downside is that errors will be
revealed at link time rather than at compile time, but that's a pretty
small cost.

>  struct kvec;
>  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
>   struct page **pages);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 73b64a3..5db811e 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
>   return (idx == ZONE_NORMAL);
>  }
>  
> +static inline int is_movable(struct zone *zone)
> +{
> + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE;
> +}

A better name would be zone_is_movable().  We haven't been very
consistent about this in mmzone.h, but zone_is_foo() is pretty common.

And a neater implementation would be

return zone_idx(zone) == ZONE_MOVABLE;

All of which made me look at mmzone.h, and what I saw wasn't very nice :(

Please review...


From: Andrew Morton 
Subject: include/linux/mmzone.h: cleanups

- implement zone_idx() in C to fix its references-args-twice macro bug

- use zone_idx() in is_highmem() to remove large amounts of silly fluff.

Cc: Lin Feng 
Cc: Mel Gorman 
Signed-off-by: Andrew Morton 
---

 include/linux/mmzone.h |   13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups 
include/linux/mmzone.h
--- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups
+++ a/include/linux/mmzone.h
@@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by
 /*
  * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
  */
-#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones)
+static inline enum zone_type zone_idx(struct zone *zone)
+{
+   return zone - zone->zone_pgdat->node_zones;
+}
 
 static inline int populated_zone(struct zone *zone)
 {
@@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon
 static inline int is_highmem(struct zone *zone)
 {
 #ifdef CONFIG_HIGHMEM
-   int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
-   return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
-  (zone_off == ZONE_MOVABLE * sizeof(*zone) &&
-   zone_movable_is_highmem());
+   enum zone_type idx = zone_idx(zone);
+
+   return idx == ZONE_HIGHMEM ||
+  (idx == ZONE_MOVABLE && zone_movable_is_highmem());
 #else
return 0;
 #endif
_


> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -58,6 +58,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  
>  #include 
> @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct 
> mm_struct *mm,
>  }
>  EXPORT_SYMBOL(get_user_pages);
>  
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +/**
> + * It's a wrapper of get_user_pages() but it makes sure that all pages come 
> from
> + * non-movable zone via additional page migration.
> + */

This needs a description of why the function exists - say something
about the requirements of memory hotplug.

Also a few words describing how the function works would be good.

> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
> + unsigned long start, int nr_pages, int write, int force,
> + struct page **pages, struct vm_area_struct **vmas)
> +{
> + int ret, i, isolate_err, migrate_pre_flag;
> + LIST_HEAD(pagelist);
> +
> +retry:
> + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
> + vmas);

We should handle (ret < 0) here.  At present the function will silently

Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-04 Thread Andrew Morton

melreadthis

On Mon, 4 Feb 2013 18:04:07 +0800
Lin Feng linf...@cn.fujitsu.com wrote:

 get_user_pages() always tries to allocate pages from movable zone, which is 
 not
  reliable to memory hotremove framework in some case.
 
 This patch introduces a new library function called 
 get_user_pages_non_movable()
  to pin pages only from zone non-movable in memory.
 It's a wrapper of get_user_pages() but it makes sure that all pages come from
 non-movable zone via additional page migration.
 
 ...

 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct 
 mm_struct *mm,
   struct page **pages, struct vm_area_struct **vmas);
  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
   struct page **pages);
 +#ifdef CONFIG_MEMORY_HOTREMOVE
 +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
 + unsigned long start, int nr_pages, int write, int force,
 + struct page **pages, struct vm_area_struct **vmas);
 +#endif

The ifdefs aren't really needed here and I encourage people to omit
them.  This keeps the header files looking neater and reduces the
chances of things later breaking because we forgot to update some
CONFIG_foo logic in a header file.  The downside is that errors will be
revealed at link time rather than at compile time, but that's a pretty
small cost.

  struct kvec;
  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
   struct page **pages);
 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
 index 73b64a3..5db811e 100644
 --- a/include/linux/mmzone.h
 +++ b/include/linux/mmzone.h
 @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
   return (idx == ZONE_NORMAL);
  }
  
 +static inline int is_movable(struct zone *zone)
 +{
 + return zone == zone-zone_pgdat-node_zones + ZONE_MOVABLE;
 +}

A better name would be zone_is_movable().  We haven't been very
consistent about this in mmzone.h, but zone_is_foo() is pretty common.

And a neater implementation would be

return zone_idx(zone) == ZONE_MOVABLE;

All of which made me look at mmzone.h, and what I saw wasn't very nice :(

Please review...


From: Andrew Morton a...@linux-foundation.org
Subject: include/linux/mmzone.h: cleanups

- implement zone_idx() in C to fix its references-args-twice macro bug

- use zone_idx() in is_highmem() to remove large amounts of silly fluff.

Cc: Lin Feng linf...@cn.fujitsu.com
Cc: Mel Gorman m...@csn.ul.ie
Signed-off-by: Andrew Morton a...@linux-foundation.org
---

 include/linux/mmzone.h |   13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups 
include/linux/mmzone.h
--- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups
+++ a/include/linux/mmzone.h
@@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by
 /*
  * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
  */
-#define zone_idx(zone) ((zone) - (zone)-zone_pgdat-node_zones)
+static inline enum zone_type zone_idx(struct zone *zone)
+{
+   return zone - zone-zone_pgdat-node_zones;
+}
 
 static inline int populated_zone(struct zone *zone)
 {
@@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon
 static inline int is_highmem(struct zone *zone)
 {
 #ifdef CONFIG_HIGHMEM
-   int zone_off = (char *)zone - (char *)zone-zone_pgdat-node_zones;
-   return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
-  (zone_off == ZONE_MOVABLE * sizeof(*zone) 
-   zone_movable_is_highmem());
+   enum zone_type idx = zone_idx(zone);
+
+   return idx == ZONE_HIGHMEM ||
+  (idx == ZONE_MOVABLE  zone_movable_is_highmem());
 #else
return 0;
 #endif
_


 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -58,6 +58,8 @@
  #include linux/elf.h
  #include linux/gfp.h
  #include linux/migrate.h
 +#include linux/page-isolation.h
 +#include linux/mm_inline.h
  #include linux/string.h
  
  #include asm/io.h
 @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct 
 mm_struct *mm,
  }
  EXPORT_SYMBOL(get_user_pages);
  
 +#ifdef CONFIG_MEMORY_HOTREMOVE
 +/**
 + * It's a wrapper of get_user_pages() but it makes sure that all pages come 
 from
 + * non-movable zone via additional page migration.
 + */

This needs a description of why the function exists - say something
about the requirements of memory hotplug.

Also a few words describing how the function works would be good.

 +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm,
 + unsigned long start, int nr_pages, int write, int force,
 + struct page **pages, struct vm_area_struct **vmas)
 +{
 + int ret, i, isolate_err, migrate_pre_flag;
 + LIST_HEAD(pagelist);
 +
 +retry:
 + ret = get_user_pages(tsk, mm, start, nr_pages, write, 

Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-04 Thread Andrew Morton

Also...

On Mon, 4 Feb 2013 16:06:24 -0800
Andrew Morton a...@linux-foundation.org wrote:

  +put_page:
  +   /* Undo the effects of former get_user_pages(), we won't pin anything */
  +   for (i = 0; i  ret; i++)
  +   put_page(pages[i]);

We can use release_pages() here.

release_pages() is designed to be more efficient when we're putting the
final reference to (most of) the pages.  It probably has little if any
benefit when putting still-in-use pages, as we're doing here.

But please consider...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()

2013-02-04 Thread Lin Feng
Hi Andrew,

On 02/05/2013 08:06 AM, Andrew Morton wrote:
 
 melreadthis
 
 On Mon, 4 Feb 2013 18:04:07 +0800
 Lin Feng linf...@cn.fujitsu.com wrote:
 
 get_user_pages() always tries to allocate pages from movable zone, which is 
 not
  reliable to memory hotremove framework in some case.

 This patch introduces a new library function called 
 get_user_pages_non_movable()
  to pin pages only from zone non-movable in memory.
 It's a wrapper of get_user_pages() but it makes sure that all pages come from
 non-movable zone via additional page migration.

 ...

 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct 
 mm_struct *mm,
  struct page **pages, struct vm_area_struct **vmas);
  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
  struct page **pages);
 +#ifdef CONFIG_MEMORY_HOTREMOVE
 +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct 
 *mm,
 +unsigned long start, int nr_pages, int write, int force,
 +struct page **pages, struct vm_area_struct **vmas);
 +#endif
 
 The ifdefs aren't really needed here and I encourage people to omit
 them.  This keeps the header files looking neater and reduces the
 chances of things later breaking because we forgot to update some
 CONFIG_foo logic in a header file.  The downside is that errors will be
 revealed at link time rather than at compile time, but that's a pretty
 small cost.
OK, got it, thanks :)

 
  struct kvec;
  int get_kernel_pages(const struct kvec *iov, int nr_pages, int write,
  struct page **pages);
 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
 index 73b64a3..5db811e 100644
 --- a/include/linux/mmzone.h
 +++ b/include/linux/mmzone.h
 @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx)
  return (idx == ZONE_NORMAL);
  }
  
 +static inline int is_movable(struct zone *zone)
 +{
 +return zone == zone-zone_pgdat-node_zones + ZONE_MOVABLE;
 +}
 
 A better name would be zone_is_movable().  We haven't been very
 consistent about this in mmzone.h, but zone_is_foo() is pretty common.
 
Yes, zone_is_xxx() should be a better name, but there are some analogous
definition like is_dma32() and is_normal() etc, if we only use zone_is_movable()
for movable zone it will break such naming rules.
Should we update other ones in a separate patch later or just keep the old 
style?
 
 And a neater implementation would be
 
   return zone_idx(zone) == ZONE_MOVABLE;
 
OK. After your change, should we also update for other ones such as 
is_normal()..?

 All of which made me look at mmzone.h, and what I saw wasn't very nice :(
 
 Please review...
 
 
 From: Andrew Morton a...@linux-foundation.org
 Subject: include/linux/mmzone.h: cleanups
 
 - implement zone_idx() in C to fix its references-args-twice macro bug
 
 - use zone_idx() in is_highmem() to remove large amounts of silly fluff.
 
 Cc: Lin Feng linf...@cn.fujitsu.com
 Cc: Mel Gorman m...@csn.ul.ie
 Signed-off-by: Andrew Morton a...@linux-foundation.org
 ---
 
  include/linux/mmzone.h |   13 -
  1 file changed, 8 insertions(+), 5 deletions(-)
 
 diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups 
 include/linux/mmzone.h
 --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups
 +++ a/include/linux/mmzone.h
 @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by
  /*
   * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, 
 etc.
   */
 -#define zone_idx(zone)   ((zone) - 
 (zone)-zone_pgdat-node_zones)
 +static inline enum zone_type zone_idx(struct zone *zone)
 +{
 + return zone - zone-zone_pgdat-node_zones;
 +}
  
  static inline int populated_zone(struct zone *zone)
  {
 @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon
  static inline int is_highmem(struct zone *zone)
  {
  #ifdef CONFIG_HIGHMEM
 - int zone_off = (char *)zone - (char *)zone-zone_pgdat-node_zones;
 - return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
 -(zone_off == ZONE_MOVABLE * sizeof(*zone) 
 - zone_movable_is_highmem());
 + enum zone_type idx = zone_idx(zone);
 +
 + return idx == ZONE_HIGHMEM ||
 +(idx == ZONE_MOVABLE  zone_movable_is_highmem());
  #else
   return 0;
  #endif
 _
 
 
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -58,6 +58,8 @@
  #include linux/elf.h
  #include linux/gfp.h
  #include linux/migrate.h
 +#include linux/page-isolation.h
 +#include linux/mm_inline.h
  #include linux/string.h
  
  #include asm/io.h
 @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct 
 mm_struct *mm,
  }
  EXPORT_SYMBOL(get_user_pages);
  
 +#ifdef CONFIG_MEMORY_HOTREMOVE
 +/**
 + * It's a wrapper of get_user_pages() but it makes sure that all pages come 
 from
 + * non-movable zone via additional page migration.
 + */
 
 This needs a description of why the