Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On 02/20/2013 07:31 PM, Simon Jeons wrote: > On 02/20/2013 06:23 PM, Lin Feng wrote: >> Hi Simon, >> >> On 02/20/2013 05:58 PM, Simon Jeons wrote: The other is that this almost certainly broken for transhuge page handling. gup returns the head and tail pages and ordinarily this is ok >>> When need gup thp? in kvm case? >> gup just pins the wanted pages(for x86 is 4k size) of user address space in >> memory. >> We can't expect the pages have been allocated for user address space are thp >> or >> normal page. So we need to deal with them and I think it have nothing to do >> with kvm. > > Ok, I'm curious about userspace process call which funtion(will call gup) to > pin pages except make_pages_present? No, userspace process doesn't pin any pages directly but through some syscalls like io_setup() indirectly for other purpose because kernel can't pagefault and it have to keep the page alive. Kernel wants to communicate with the userspace such as to notify some events so it need some sort of buffer that both Kernel and User space can both access, which leads to so called pin pages by gup. thanks, linfeng -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On 02/20/2013 06:23 PM, Lin Feng wrote: Hi Simon, On 02/20/2013 05:58 PM, Simon Jeons wrote: The other is that this almost certainly broken for transhuge page handling. gup returns the head and tail pages and ordinarily this is ok When need gup thp? in kvm case? gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. We can't expect the pages have been allocated for user address space are thp or normal page. So we need to deal with them and I think it have nothing to do with kvm. Ok, I'm curious about userspace process call which funtion(will call gup) to pin pages except make_pages_present? thanks, linfeng -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Hi Simon, On 02/20/2013 05:58 PM, Simon Jeons wrote: > >> >> The other is that this almost certainly broken for transhuge page >> handling. gup returns the head and tail pages and ordinarily this is ok > > When need gup thp? in kvm case? gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. We can't expect the pages have been allocated for user address space are thp or normal page. So we need to deal with them and I think it have nothing to do with kvm. thanks, linfeng -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On 02/05/2013 09:32 PM, Mel Gorman wrote: On Tue, Feb 05, 2013 at 11:57:22AM +, Mel Gorman wrote: + migrate_pre_flag = 1; + } + + if (!isolate_lru_page(pages[i])) { + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + +page_is_file_cache(pages[i])); + list_add_tail([i]->lru, ); + } else { + isolate_err = 1; + goto put_page; + } isolate_lru_page() takes the LRU lock every time. Credit to Michal Hocko for bringing this up but with the number of other issues I missed that this is also broken with respect to huge page handling. hugetlbfs pages will not be on the LRU so the isolation will mess up and the migration has to be handled differently. Ordinarily hugetlbfs pages cannot be allocated from ZONE_MOVABLE but it is possible to configure it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this encounters a hugetlbfs page, it'll just blow up. As you said, hugetlbfs pages are not in LRU list, then how can encounter a hugetlbfs page and blow up? The other is that this almost certainly broken for transhuge page handling. gup returns the head and tail pages and ordinarily this is ok When need gup thp? in kvm case? because the caller only cares about the physical address. Migration will also split a hugepage if it receives it but you are potentially adding tail pages to a list here and then migrating them. The split of the first page will get very confused. I'm not exactly sure what the result will be but it won't be pretty. Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled during testing? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On 02/05/2013 09:32 PM, Mel Gorman wrote: On Tue, Feb 05, 2013 at 11:57:22AM +, Mel Gorman wrote: + migrate_pre_flag = 1; + } + + if (!isolate_lru_page(pages[i])) { + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + +page_is_file_cache(pages[i])); + list_add_tail(pages[i]-lru, pagelist); + } else { + isolate_err = 1; + goto put_page; + } isolate_lru_page() takes the LRU lock every time. Credit to Michal Hocko for bringing this up but with the number of other issues I missed that this is also broken with respect to huge page handling. hugetlbfs pages will not be on the LRU so the isolation will mess up and the migration has to be handled differently. Ordinarily hugetlbfs pages cannot be allocated from ZONE_MOVABLE but it is possible to configure it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this encounters a hugetlbfs page, it'll just blow up. As you said, hugetlbfs pages are not in LRU list, then how can encounter a hugetlbfs page and blow up? The other is that this almost certainly broken for transhuge page handling. gup returns the head and tail pages and ordinarily this is ok When need gup thp? in kvm case? because the caller only cares about the physical address. Migration will also split a hugepage if it receives it but you are potentially adding tail pages to a list here and then migrating them. The split of the first page will get very confused. I'm not exactly sure what the result will be but it won't be pretty. Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled during testing? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Hi Simon, On 02/20/2013 05:58 PM, Simon Jeons wrote: The other is that this almost certainly broken for transhuge page handling. gup returns the head and tail pages and ordinarily this is ok When need gup thp? in kvm case? gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. We can't expect the pages have been allocated for user address space are thp or normal page. So we need to deal with them and I think it have nothing to do with kvm. thanks, linfeng -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On 02/20/2013 06:23 PM, Lin Feng wrote: Hi Simon, On 02/20/2013 05:58 PM, Simon Jeons wrote: The other is that this almost certainly broken for transhuge page handling. gup returns the head and tail pages and ordinarily this is ok When need gup thp? in kvm case? gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. We can't expect the pages have been allocated for user address space are thp or normal page. So we need to deal with them and I think it have nothing to do with kvm. Ok, I'm curious about userspace process call which funtion(will call gup) to pin pages except make_pages_present? thanks, linfeng -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On 02/20/2013 07:31 PM, Simon Jeons wrote: On 02/20/2013 06:23 PM, Lin Feng wrote: Hi Simon, On 02/20/2013 05:58 PM, Simon Jeons wrote: The other is that this almost certainly broken for transhuge page handling. gup returns the head and tail pages and ordinarily this is ok When need gup thp? in kvm case? gup just pins the wanted pages(for x86 is 4k size) of user address space in memory. We can't expect the pages have been allocated for user address space are thp or normal page. So we need to deal with them and I think it have nothing to do with kvm. Ok, I'm curious about userspace process call which funtion(will call gup) to pin pages except make_pages_present? No, userspace process doesn't pin any pages directly but through some syscalls like io_setup() indirectly for other purpose because kernel can't pagefault and it have to keep the page alive. Kernel wants to communicate with the userspace such as to notify some events so it need some sort of buffer that both Kernel and User space can both access, which leads to so called pin pages by gup. thanks, linfeng -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Hi Wanpeng, On 02/20/2013 10:44 AM, Wanpeng Li wrote: >> Sorry, I misunderstood what "tail pages" means, stupid question, just ignore >> it. >> >flee... > According to the compound page, the first page of compound page is > called head page, other sub pages are called tail pages. > > Regards, > Wanpeng Li > Yes, you are right, thanks for explaining. I thought it just means only the last one of the compound pages.. thanks, linfeng -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On 02/19/2013 09:37 PM, Lin Feng wrote: >> > >> > The other is that this almost certainly broken for transhuge page >> > handling. gup returns the head and tail pages and ordinarily this is ok > I can't find codes doing such things :(, could you please point me out? > Sorry, I misunderstood what "tail pages" means, stupid question, just ignore it. flee... thanks, linfeng -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Hi Mel, On 02/05/2013 09:32 PM, Mel Gorman wrote: > On Tue, Feb 05, 2013 at 11:57:22AM +, Mel Gorman wrote: >> + migrate_pre_flag = 1; + } + + if (!isolate_lru_page(pages[i])) { + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + + page_is_file_cache(pages[i])); + list_add_tail([i]->lru, ); + } else { + isolate_err = 1; + goto put_page; + } >> >> isolate_lru_page() takes the LRU lock every time. > > Credit to Michal Hocko for bringing this up but with the number of > other issues I missed that this is also broken with respect to huge page > handling. hugetlbfs pages will not be on the LRU so the isolation will mess > up and the migration has to be handled differently. Ordinarily hugetlbfs > pages cannot be allocated from ZONE_MOVABLE but it is possible to configure > it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this > encounters a hugetlbfs page, it'll just blow up. I look into the migrate_huge_page() codes find that if we support the hugetlbfs non movable migration, we have to invent another alloc_huge_page_node_nonmovable() or such allocate interface, which cost is large(exploding the codes and great impact on current alloc_huge_page_node()) but gains little, I think that pinning hugepage is a corner case. So can we skip over hugepage without migration but give some WARN_ON() info, is it acceptable? > > The other is that this almost certainly broken for transhuge page > handling. gup returns the head and tail pages and ordinarily this is ok I can't find codes doing such things :(, could you please point me out? > because the caller only cares about the physical address. Migration will > also split a hugepage if it receives it but you are potentially adding > tail pages to a list here and then migrating them. The split of the first > page will get very confused. I'm not exactly sure what the result will be > but it won't be pretty. > > Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled > during testing? I checked my config file that both CONFIG options aboved are enabled. However it was only be tested by two services invoking io_setup(), it works fine.. thanks, linfeng -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On Tue, Feb 19, 2013 at 05:55:30PM +0800, Lin Feng wrote: > Hi Mel, > > On 02/18/2013 11:17 PM, Mel Gorman wrote: > >>> > > > >>> > > > >>> > > result. It's a little clumsy but the memory hot-remove failure message > >>> > > could list what applications have pinned the pages that cannot be > >>> > > removed > >>> > > so the administrator has the option of force-killing the application. > >>> > > It > >>> > > is possible to discover what application is pinning a page from > >>> > > userspace > >>> > > but it would involve an expensive search with /proc/kpagemap > >>> > > > > > >>> + if (migrate_pre_flag && !isolate_err) { > > > >>> + ret = migrate_pages(, alloc_migrate_target, 1, > > > >>> + false, MIGRATE_SYNC, > > > >>> MR_SYSCALL); > >>> > > > >>> > > The conversion of alloc_migrate_target is a bit problematic. It strips > >>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts > >>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a > >>> > > large > >>> > > number of pages, particularly if the number of > >>> > > get_user_pages_non_movable() > >>> > > increases for short-lived pins like direct IO. > >> > > >> > Sorry, I don't quite understand here neither. If we use the following > >> > new > >> > migration allocation function as you said, the increasing number of > >> > get_user_pages_non_movable() will also lead to large numbers of > >> > MIGRATE_UNMOVABLE > >> > pages. What's the difference, do I miss something? > >> > > > The replacement function preserves the __GFP_MOVABLE flag. It cannot use > > ZONE_MOVABLE but otherwise the newly allocated page will be grouped with > > other movable pages. > > Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an > allocation policy.". > Update the comment and describe the exception then. > Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private > parameter so > that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable > zones with little > change? > It should work (double check gfp_zone) but then the allocation cannot use the highmem zone. If you can be 100% certain that zone will not exist be populated then it will work as expected but it's a hack and should be commented clearly. You could do a BUILD_BUG_ON if CONFIG_HIGHMEM is set to enforce it. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Hi Mel, On 02/18/2013 11:17 PM, Mel Gorman wrote: >>> > > >>> > > >>> > > result. It's a little clumsy but the memory hot-remove failure message >>> > > could list what applications have pinned the pages that cannot be >>> > > removed >>> > > so the administrator has the option of force-killing the application. It >>> > > is possible to discover what application is pinning a page from >>> > > userspace >>> > > but it would involve an expensive search with /proc/kpagemap >>> > > > > >>> + if (migrate_pre_flag && !isolate_err) { > > >>> + ret = migrate_pages(, alloc_migrate_target, 1, > > >>> + false, MIGRATE_SYNC, > > >>> MR_SYSCALL); >>> > > >>> > > The conversion of alloc_migrate_target is a bit problematic. It strips >>> > > the __GFP_MOVABLE flag and the consequence of this is that it converts >>> > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a >>> > > large >>> > > number of pages, particularly if the number of >>> > > get_user_pages_non_movable() >>> > > increases for short-lived pins like direct IO. >> > >> > Sorry, I don't quite understand here neither. If we use the following new >> > migration allocation function as you said, the increasing number of >> > get_user_pages_non_movable() will also lead to large numbers of >> > MIGRATE_UNMOVABLE >> > pages. What's the difference, do I miss something? >> > > The replacement function preserves the __GFP_MOVABLE flag. It cannot use > ZONE_MOVABLE but otherwise the newly allocated page will be grouped with > other movable pages. Ah, got it " But GFP_MOVABLE is not only a zone specifier but also an allocation policy.". Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private parameter so that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones with little change? Does that approach work? Otherwise I have to follow your suggestion. thanks, linfeng -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Hi Mel, On 02/18/2013 11:17 PM, Mel Gorman wrote: SNIP result. It's a little clumsy but the memory hot-remove failure message could list what applications have pinned the pages that cannot be removed so the administrator has the option of force-killing the application. It is possible to discover what application is pinning a page from userspace but it would involve an expensive search with /proc/kpagemap + if (migrate_pre_flag !isolate_err) { + ret = migrate_pages(pagelist, alloc_migrate_target, 1, + false, MIGRATE_SYNC, MR_SYSCALL); The conversion of alloc_migrate_target is a bit problematic. It strips the __GFP_MOVABLE flag and the consequence of this is that it converts those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large number of pages, particularly if the number of get_user_pages_non_movable() increases for short-lived pins like direct IO. Sorry, I don't quite understand here neither. If we use the following new migration allocation function as you said, the increasing number of get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE pages. What's the difference, do I miss something? The replacement function preserves the __GFP_MOVABLE flag. It cannot use ZONE_MOVABLE but otherwise the newly allocated page will be grouped with other movable pages. Ah, got it But GFP_MOVABLE is not only a zone specifier but also an allocation policy.. Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private parameter so that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones with little change? Does that approach work? Otherwise I have to follow your suggestion. thanks, linfeng -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On Tue, Feb 19, 2013 at 05:55:30PM +0800, Lin Feng wrote: Hi Mel, On 02/18/2013 11:17 PM, Mel Gorman wrote: SNIP result. It's a little clumsy but the memory hot-remove failure message could list what applications have pinned the pages that cannot be removed so the administrator has the option of force-killing the application. It is possible to discover what application is pinning a page from userspace but it would involve an expensive search with /proc/kpagemap + if (migrate_pre_flag !isolate_err) { + ret = migrate_pages(pagelist, alloc_migrate_target, 1, + false, MIGRATE_SYNC, MR_SYSCALL); The conversion of alloc_migrate_target is a bit problematic. It strips the __GFP_MOVABLE flag and the consequence of this is that it converts those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large number of pages, particularly if the number of get_user_pages_non_movable() increases for short-lived pins like direct IO. Sorry, I don't quite understand here neither. If we use the following new migration allocation function as you said, the increasing number of get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE pages. What's the difference, do I miss something? The replacement function preserves the __GFP_MOVABLE flag. It cannot use ZONE_MOVABLE but otherwise the newly allocated page will be grouped with other movable pages. Ah, got it But GFP_MOVABLE is not only a zone specifier but also an allocation policy.. Update the comment and describe the exception then. Could I clear __GFP_HIGHMEM flag in alloc_migrate_target depending on private parameter so that we can keep MIGRATE_UNMOVABLE policy also allocate page none movable zones with little change? It should work (double check gfp_zone) but then the allocation cannot use the highmem zone. If you can be 100% certain that zone will not exist be populated then it will work as expected but it's a hack and should be commented clearly. You could do a BUILD_BUG_ON if CONFIG_HIGHMEM is set to enforce it. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Hi Mel, On 02/05/2013 09:32 PM, Mel Gorman wrote: On Tue, Feb 05, 2013 at 11:57:22AM +, Mel Gorman wrote: + migrate_pre_flag = 1; + } + + if (!isolate_lru_page(pages[i])) { + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + + page_is_file_cache(pages[i])); + list_add_tail(pages[i]-lru, pagelist); + } else { + isolate_err = 1; + goto put_page; + } isolate_lru_page() takes the LRU lock every time. Credit to Michal Hocko for bringing this up but with the number of other issues I missed that this is also broken with respect to huge page handling. hugetlbfs pages will not be on the LRU so the isolation will mess up and the migration has to be handled differently. Ordinarily hugetlbfs pages cannot be allocated from ZONE_MOVABLE but it is possible to configure it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this encounters a hugetlbfs page, it'll just blow up. I look into the migrate_huge_page() codes find that if we support the hugetlbfs non movable migration, we have to invent another alloc_huge_page_node_nonmovable() or such allocate interface, which cost is large(exploding the codes and great impact on current alloc_huge_page_node()) but gains little, I think that pinning hugepage is a corner case. So can we skip over hugepage without migration but give some WARN_ON() info, is it acceptable? The other is that this almost certainly broken for transhuge page handling. gup returns the head and tail pages and ordinarily this is ok I can't find codes doing such things :(, could you please point me out? because the caller only cares about the physical address. Migration will also split a hugepage if it receives it but you are potentially adding tail pages to a list here and then migrating them. The split of the first page will get very confused. I'm not exactly sure what the result will be but it won't be pretty. Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled during testing? I checked my config file that both CONFIG options aboved are enabled. However it was only be tested by two services invoking io_setup(), it works fine.. thanks, linfeng -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On 02/19/2013 09:37 PM, Lin Feng wrote: The other is that this almost certainly broken for transhuge page handling. gup returns the head and tail pages and ordinarily this is ok I can't find codes doing such things :(, could you please point me out? Sorry, I misunderstood what tail pages means, stupid question, just ignore it. flee... thanks, linfeng -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Hi Wanpeng, On 02/20/2013 10:44 AM, Wanpeng Li wrote: Sorry, I misunderstood what tail pages means, stupid question, just ignore it. flee... According to the compound page, the first page of compound page is called head page, other sub pages are called tail pages. Regards, Wanpeng Li Yes, you are right, thanks for explaining. I thought it just means only the last one of the compound pages.. thanks, linfeng -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On Mon, Feb 18, 2013 at 06:34:44PM +0800, Lin Feng wrote: > >>> + if (!migrate_pre_flag) { > >>> + if (migrate_prep()) > >>> + goto put_page; > > > > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never > > return failure. You could make this BUG_ON(migrate_prep()), remove this > > goto and the migrate_pre_flag check below becomes redundant. > Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() > check > will call/check migrate_prep() every time we isolate a single page, do we have > to do so? I was referring to the migrate_pre_flag check further down and I'm not suggesting it be removed from here as you do want to call migrate_prep() only once. However, as it'll never return failure for any kernel configuration that allows memory hot-remove, this goto can be removed and the flow simplified. > > > >>> + migrate_pre_flag = 1; > >>> + } > >>> + > >>> + if (!isolate_lru_page(pages[i])) { > >>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > >>> + page_is_file_cache(pages[i])); > >>> + list_add_tail([i]->lru, ); > >>> + } else { > >>> + isolate_err = 1; > >>> + goto put_page; > >>> + } > > > > isolate_lru_page() takes the LRU lock every time. If > > get_user_pages_non_movable is used heavily then you may encounter lock > > contention problems. Batching this lock would be a separate patch and should > > not be necessary yet but you should at least comment on it as a reminder. > Farsighted, should improve it in the future. > > > > > I think that a goto could also have been avoided here if you used break. The > > i == ret check below would be false and it would just fall through. > > Overall the flow would be a bit easier to follow. > Yes, I noticed this before. I thought using goto could save some micro ops > (here is only the i == ret check) though lower the readability but still be > clearly. > > Also there are some other place in current kernel facing such > performance/readability > race condition, but it seems that the developers prefer readability, why? > While I > think performance is fatal to kernel.. > Memory hot-remove and page migration are not performance critical paths. For page migration, the cost will be likely dominated by the page copy but it's also possible the cost will be dominated by locking. The difference between a break and goto will not even be measurable. > > > > > > result. It's a little clumsy but the memory hot-remove failure message > > could list what applications have pinned the pages that cannot be removed > > so the administrator has the option of force-killing the application. It > > is possible to discover what application is pinning a page from userspace > > but it would involve an expensive search with /proc/kpagemap > > > >>> + if (migrate_pre_flag && !isolate_err) { > >>> + ret = migrate_pages(, alloc_migrate_target, 1, > >>> + false, MIGRATE_SYNC, MR_SYSCALL); > > > > The conversion of alloc_migrate_target is a bit problematic. It strips > > the __GFP_MOVABLE flag and the consequence of this is that it converts > > those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large > > number of pages, particularly if the number of get_user_pages_non_movable() > > increases for short-lived pins like direct IO. > > Sorry, I don't quite understand here neither. If we use the following new > migration allocation function as you said, the increasing number of > get_user_pages_non_movable() will also lead to large numbers of > MIGRATE_UNMOVABLE > pages. What's the difference, do I miss something? > The replacement function preserves the __GFP_MOVABLE flag. It cannot use ZONE_MOVABLE but otherwise the newly allocated page will be grouped with other movable pages. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Hi Mel, See below. On 02/05/2013 07:57 PM, Mel Gorman wrote: > On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote: >> The ifdefs aren't really needed here and I encourage people to omit >> them. This keeps the header files looking neater and reduces the >> chances of things later breaking because we forgot to update some >> CONFIG_foo logic in a header file. The downside is that errors will be >> revealed at link time rather than at compile time, but that's a pretty >> small cost. >> > > As an aside, if ifdefs *have* to be used then it often better to have a > pattern like > > #ifdef CONFIG_MEMORY_HOTREMOVE > int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > unsigned long start, int nr_pages, int write, int force, > struct page **pages, struct vm_area_struct **vmas); > #else > static inline get_user_pages_non_movable(...) > { > get_user_pages(...) > } > #endif > > It eliminates the need for #ifdefs in the C file that calls > get_user_pages_non_movable(). Thanks for pointing out :) >>> + >>> +retry: >>> + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, >>> + vmas); >> >> We should handle (ret < 0) here. At present the function will silently >> convert an error return into "return 0", which is bad. The function >> does appear to otherwise do the right thing if get_user_pages() failed, >> but only due to good luck. >> > > A BUG_ON check if we retry more than once wouldn't hurt either. It requires > a broken implementation of alloc_migrate_target() but someone might "fix" > it and miss this. Agree. > > A minor aside, it would be nice if we exited at this point if there was > no populated ZONE_MOVABLE in the system. There is a movable_zone global > variable already. If it was forced to be MAX_NR_ZONES before the call to > find_usable_zone_for_movable() in memory initialisation we should be able > to make a cheap check for it here. OK. >>> + if (!migrate_pre_flag) { >>> + if (migrate_prep()) >>> + goto put_page; > > CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never > return failure. You could make this BUG_ON(migrate_prep()), remove this > goto and the migrate_pre_flag check below becomes redundant. Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check will call/check migrate_prep() every time we isolate a single page, do we have to do so? I mean that for a group of pages it's sufficient to call migrate_prep() only once. There are some other migrate-relative codes using this scheme. So I get confused, do I miss something or the the performance cost is little? :( > >>> + migrate_pre_flag = 1; >>> + } >>> + >>> + if (!isolate_lru_page(pages[i])) { >>> + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + >>> +page_is_file_cache(pages[i])); >>> + list_add_tail([i]->lru, ); >>> + } else { >>> + isolate_err = 1; >>> + goto put_page; >>> + } > > isolate_lru_page() takes the LRU lock every time. If > get_user_pages_non_movable is used heavily then you may encounter lock > contention problems. Batching this lock would be a separate patch and should > not be necessary yet but you should at least comment on it as a reminder. Farsighted, should improve it in the future. > > I think that a goto could also have been avoided here if you used break. The > i == ret check below would be false and it would just fall through. > Overall the flow would be a bit easier to follow. Yes, I noticed this before. I thought using goto could save some micro ops (here is only the i == ret check) though lower the readability but still be clearly. Also there are some other place in current kernel facing such performance/readability race condition, but it seems that the developers prefer readability, why? While I think performance is fatal to kernel.. > > Why list_add_tail()? I don't think it's wrong but it's unusual to see > list_add_tail() when list_add() is enough. Sorry, not intentional, just steal from other codes without thinking more ;-) > >>> + } >>> + } >>> + >>> + /* All pages are non movable, we are done :) */ >>> + if (i == ret && list_empty()) >>> + return ret; >>> + >>> +put_page: >>> + /* Undo the effects of former get_user_pages(), we won't pin anything */ >>> + for (i = 0; i < ret; i++) >>> + put_page(pages[i]); >>> + > > release_pages. > > That comment is insufficient. There are non-obvious consequences to this > logic. We are dropping pins on all pages regardless of what zone they > are in. If the subsequent migration fails then we end up returning 0 > with no pages
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Hi Mel, See below. On 02/05/2013 07:57 PM, Mel Gorman wrote: On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote: The ifdefs aren't really needed here and I encourage people to omit them. This keeps the header files looking neater and reduces the chances of things later breaking because we forgot to update some CONFIG_foo logic in a header file. The downside is that errors will be revealed at link time rather than at compile time, but that's a pretty small cost. As an aside, if ifdefs *have* to be used then it often better to have a pattern like #ifdef CONFIG_MEMORY_HOTREMOVE int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int nr_pages, int write, int force, struct page **pages, struct vm_area_struct **vmas); #else static inline get_user_pages_non_movable(...) { get_user_pages(...) } #endif It eliminates the need for #ifdefs in the C file that calls get_user_pages_non_movable(). Thanks for pointing out :) + +retry: + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, + vmas); We should handle (ret 0) here. At present the function will silently convert an error return into return 0, which is bad. The function does appear to otherwise do the right thing if get_user_pages() failed, but only due to good luck. A BUG_ON check if we retry more than once wouldn't hurt either. It requires a broken implementation of alloc_migrate_target() but someone might fix it and miss this. Agree. A minor aside, it would be nice if we exited at this point if there was no populated ZONE_MOVABLE in the system. There is a movable_zone global variable already. If it was forced to be MAX_NR_ZONES before the call to find_usable_zone_for_movable() in memory initialisation we should be able to make a cheap check for it here. OK. + if (!migrate_pre_flag) { + if (migrate_prep()) + goto put_page; CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never return failure. You could make this BUG_ON(migrate_prep()), remove this goto and the migrate_pre_flag check below becomes redundant. Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check will call/check migrate_prep() every time we isolate a single page, do we have to do so? I mean that for a group of pages it's sufficient to call migrate_prep() only once. There are some other migrate-relative codes using this scheme. So I get confused, do I miss something or the the performance cost is little? :( + migrate_pre_flag = 1; + } + + if (!isolate_lru_page(pages[i])) { + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + +page_is_file_cache(pages[i])); + list_add_tail(pages[i]-lru, pagelist); + } else { + isolate_err = 1; + goto put_page; + } isolate_lru_page() takes the LRU lock every time. If get_user_pages_non_movable is used heavily then you may encounter lock contention problems. Batching this lock would be a separate patch and should not be necessary yet but you should at least comment on it as a reminder. Farsighted, should improve it in the future. I think that a goto could also have been avoided here if you used break. The i == ret check below would be false and it would just fall through. Overall the flow would be a bit easier to follow. Yes, I noticed this before. I thought using goto could save some micro ops (here is only the i == ret check) though lower the readability but still be clearly. Also there are some other place in current kernel facing such performance/readability race condition, but it seems that the developers prefer readability, why? While I think performance is fatal to kernel.. Why list_add_tail()? I don't think it's wrong but it's unusual to see list_add_tail() when list_add() is enough. Sorry, not intentional, just steal from other codes without thinking more ;-) + } + } + + /* All pages are non movable, we are done :) */ + if (i == ret list_empty(pagelist)) + return ret; + +put_page: + /* Undo the effects of former get_user_pages(), we won't pin anything */ + for (i = 0; i ret; i++) + put_page(pages[i]); + release_pages. That comment is insufficient. There are non-obvious consequences to this logic. We are dropping pins on all pages regardless of what zone they are in. If the subsequent migration fails then we end up returning 0 with no pages pinned. The user-visible effect is that io_setup() fails for non-obvious reasons. It will return EAGAIN to userspace which will be interpreted as The specified
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On Mon, Feb 18, 2013 at 06:34:44PM +0800, Lin Feng wrote: + if (!migrate_pre_flag) { + if (migrate_prep()) + goto put_page; CONFIG_MEMORY_HOTREMOVE depends on CONFIG_MIGRATION so this will never return failure. You could make this BUG_ON(migrate_prep()), remove this goto and the migrate_pre_flag check below becomes redundant. Sorry, I don't quite catch on this. Without migrate_pre_flag, the BUG_ON() check will call/check migrate_prep() every time we isolate a single page, do we have to do so? I was referring to the migrate_pre_flag check further down and I'm not suggesting it be removed from here as you do want to call migrate_prep() only once. However, as it'll never return failure for any kernel configuration that allows memory hot-remove, this goto can be removed and the flow simplified. + migrate_pre_flag = 1; + } + + if (!isolate_lru_page(pages[i])) { + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + + page_is_file_cache(pages[i])); + list_add_tail(pages[i]-lru, pagelist); + } else { + isolate_err = 1; + goto put_page; + } isolate_lru_page() takes the LRU lock every time. If get_user_pages_non_movable is used heavily then you may encounter lock contention problems. Batching this lock would be a separate patch and should not be necessary yet but you should at least comment on it as a reminder. Farsighted, should improve it in the future. I think that a goto could also have been avoided here if you used break. The i == ret check below would be false and it would just fall through. Overall the flow would be a bit easier to follow. Yes, I noticed this before. I thought using goto could save some micro ops (here is only the i == ret check) though lower the readability but still be clearly. Also there are some other place in current kernel facing such performance/readability race condition, but it seems that the developers prefer readability, why? While I think performance is fatal to kernel.. Memory hot-remove and page migration are not performance critical paths. For page migration, the cost will be likely dominated by the page copy but it's also possible the cost will be dominated by locking. The difference between a break and goto will not even be measurable. SNIP result. It's a little clumsy but the memory hot-remove failure message could list what applications have pinned the pages that cannot be removed so the administrator has the option of force-killing the application. It is possible to discover what application is pinning a page from userspace but it would involve an expensive search with /proc/kpagemap + if (migrate_pre_flag !isolate_err) { + ret = migrate_pages(pagelist, alloc_migrate_target, 1, + false, MIGRATE_SYNC, MR_SYSCALL); The conversion of alloc_migrate_target is a bit problematic. It strips the __GFP_MOVABLE flag and the consequence of this is that it converts those allocation requests to MIGRATE_UNMOVABLE. This potentially is a large number of pages, particularly if the number of get_user_pages_non_movable() increases for short-lived pins like direct IO. Sorry, I don't quite understand here neither. If we use the following new migration allocation function as you said, the increasing number of get_user_pages_non_movable() will also lead to large numbers of MIGRATE_UNMOVABLE pages. What's the difference, do I miss something? The replacement function preserves the __GFP_MOVABLE flag. It cannot use ZONE_MOVABLE but otherwise the newly allocated page will be grouped with other movable pages. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On Tue, Feb 05, 2013 at 06:26:51PM -0800, Michel Lespinasse wrote: > Just nitpicking, but: > > On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman wrote: > > +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx) > > +{ > > + /* This mess avoids a potentially expensive pointer subtraction. */ > > + int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > > + return zone_off == idx * sizeof(*zone); > > +} > > Maybe: > return zone == zone->zone_pgdat->node_zones + idx; > ? > Not a nit at all. Yours is more readable but it generates more code. A single line function that uses the helper generates 0x3f bytes of code (mostly function entry/exit) with your version and 0x39 bytes with mine. The difference in efficiency is marginal as your version uses lea to multiply by a constant but it's still slightly heavier. The old code is fractionally better, your version is more readable so it's up to Andrew really. Right now I think he's gone with his own version with zone_idx() in the name of readability whatever sparse has to say about the matter. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On Tue, Feb 05, 2013 at 06:26:51PM -0800, Michel Lespinasse wrote: Just nitpicking, but: On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman mgor...@suse.de wrote: +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx) +{ + /* This mess avoids a potentially expensive pointer subtraction. */ + int zone_off = (char *)zone - (char *)zone-zone_pgdat-node_zones; + return zone_off == idx * sizeof(*zone); +} Maybe: return zone == zone-zone_pgdat-node_zones + idx; ? Not a nit at all. Yours is more readable but it generates more code. A single line function that uses the helper generates 0x3f bytes of code (mostly function entry/exit) with your version and 0x39 bytes with mine. The difference in efficiency is marginal as your version uses lea to multiply by a constant but it's still slightly heavier. The old code is fractionally better, your version is more readable so it's up to Andrew really. Right now I think he's gone with his own version with zone_idx() in the name of readability whatever sparse has to say about the matter. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Just nitpicking, but: On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman wrote: > +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx) > +{ > + /* This mess avoids a potentially expensive pointer subtraction. */ > + int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > + return zone_off == idx * sizeof(*zone); > +} Maybe: return zone == zone->zone_pgdat->node_zones + idx; ? -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On Tue, 05 Feb 2013 11:09:27 +0800 Lin Feng wrote: > > > >> struct kvec; > >> int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > >>struct page **pages); > >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > >> index 73b64a3..5db811e 100644 > >> --- a/include/linux/mmzone.h > >> +++ b/include/linux/mmzone.h > >> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) > >>return (idx == ZONE_NORMAL); > >> } > >> > >> +static inline int is_movable(struct zone *zone) > >> +{ > >> + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; > >> +} > > > > A better name would be zone_is_movable(). We haven't been very > > consistent about this in mmzone.h, but zone_is_foo() is pretty common. > > > Yes, zone_is_xxx() should be a better name, but there are some analogous > definition like is_dma32() and is_normal() etc, if we only use > zone_is_movable() > for movable zone it will break such naming rules. > Should we update other ones in a separate patch later or just keep the old > style? I do think the old names were poorly chosen. Yes, we could fix them up sometime but it's hardly a pressing issue. > > And a neater implementation would be > > > > return zone_idx(zone) == ZONE_MOVABLE; > > > OK. After your change, should we also update for other ones such as > is_normal()..? Sure. From: Andrew Morton Subject: include-linux-mmzoneh-cleanups-fix use zone_idx() some more, further simplify is_highmem() Cc: Lin Feng Cc: Mel Gorman Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 15 --- 1 file changed, 4 insertions(+), 11 deletions(-) diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix include/linux/mmzone.h --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix +++ a/include/linux/mmzone.h @@ -859,25 +859,18 @@ static inline int is_normal_idx(enum zon */ static inline int is_highmem(struct zone *zone) { -#ifdef CONFIG_HIGHMEM - enum zone_type idx = zone_idx(zone); - - return idx == ZONE_HIGHMEM || - (idx == ZONE_MOVABLE && zone_movable_is_highmem()); -#else - return 0; -#endif + return is_highmem_idx(zone_idx(zone)); } static inline int is_normal(struct zone *zone) { - return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL; + return zone_idx(zone) == ZONE_NORMAL; } static inline int is_dma32(struct zone *zone) { #ifdef CONFIG_ZONE_DMA32 - return zone == zone->zone_pgdat->node_zones + ZONE_DMA32; + return zone_idx(zone) == ZONE_DMA32; #else return 0; #endif @@ -886,7 +879,7 @@ static inline int is_dma32(struct zone * static inline int is_dma(struct zone *zone) { #ifdef CONFIG_ZONE_DMA - return zone == zone->zone_pgdat->node_zones + ZONE_DMA; + return zone_idx(zone) == ZONE_DMA; #else return 0; #endif _ > >> + /* All pages are non movable, we are done :) */ > >> + if (i == ret && list_empty()) > >> + return ret; > >> + > >> +put_page: > >> + /* Undo the effects of former get_user_pages(), we won't pin anything */ > >> + for (i = 0; i < ret; i++) > >> + put_page(pages[i]); > >> + > >> + if (migrate_pre_flag && !isolate_err) { > >> + ret = migrate_pages(, alloc_migrate_target, 1, > >> + false, MIGRATE_SYNC, MR_SYSCALL); > >> + /* Steal pages from non-movable zone successfully? */ > >> + if (!ret) > >> + goto retry; > > > > This is buggy. migrate_pages() doesn't empty its `from' argument, so > > page_list must be reinitialised here (or, better, at the start of the loop). > migrate_pages() > list_for_each_entry_safe() > unmap_and_move() >if (rc != -EAGAIN) { > list_del(>lru); >} ah, OK, there it is. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On Tue, Feb 05, 2013 at 11:57:22AM +, Mel Gorman wrote: > > > > + migrate_pre_flag = 1; > > > + } > > > + > > > + if (!isolate_lru_page(pages[i])) { > > > + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + > > > + page_is_file_cache(pages[i])); > > > + list_add_tail([i]->lru, ); > > > + } else { > > > + isolate_err = 1; > > > + goto put_page; > > > + } > > isolate_lru_page() takes the LRU lock every time. Credit to Michal Hocko for bringing this up but with the number of other issues I missed that this is also broken with respect to huge page handling. hugetlbfs pages will not be on the LRU so the isolation will mess up and the migration has to be handled differently. Ordinarily hugetlbfs pages cannot be allocated from ZONE_MOVABLE but it is possible to configure it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this encounters a hugetlbfs page, it'll just blow up. The other is that this almost certainly broken for transhuge page handling. gup returns the head and tail pages and ordinarily this is ok because the caller only cares about the physical address. Migration will also split a hugepage if it receives it but you are potentially adding tail pages to a list here and then migrating them. The split of the first page will get very confused. I'm not exactly sure what the result will be but it won't be pretty. Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled during testing? -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote: > On Mon, 4 Feb 2013 18:04:07 +0800 > Lin Feng wrote: > > > get_user_pages() always tries to allocate pages from movable zone, which is > > not > > reliable to memory hotremove framework in some case. > > > > This patch introduces a new library function called > > get_user_pages_non_movable() > > to pin pages only from zone non-movable in memory. > > It's a wrapper of get_user_pages() but it makes sure that all pages come > > from > > non-movable zone via additional page migration. > > > > ... > > > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct > > mm_struct *mm, > > struct page **pages, struct vm_area_struct **vmas); > > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > > struct page **pages); > > +#ifdef CONFIG_MEMORY_HOTREMOVE > > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct > > *mm, > > + unsigned long start, int nr_pages, int write, int force, > > + struct page **pages, struct vm_area_struct **vmas); > > +#endif > > The ifdefs aren't really needed here and I encourage people to omit > them. This keeps the header files looking neater and reduces the > chances of things later breaking because we forgot to update some > CONFIG_foo logic in a header file. The downside is that errors will be > revealed at link time rather than at compile time, but that's a pretty > small cost. > As an aside, if ifdefs *have* to be used then it often better to have a pattern like #ifdef CONFIG_MEMORY_HOTREMOVE int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int nr_pages, int write, int force, struct page **pages, struct vm_area_struct **vmas); #else static inline get_user_pages_non_movable(...) { get_user_pages(...) } #endif It eliminates the need for #ifdefs in the C file that calls get_user_pages_non_movable(). > > struct kvec; > > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > > struct page **pages); > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 73b64a3..5db811e 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) > > return (idx == ZONE_NORMAL); > > } > > > > +static inline int is_movable(struct zone *zone) > > +{ > > + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; > > +} > > A better name would be zone_is_movable(). He is matching the names of the is_normal(), is_dma32() and is_dma() helpers. Unfortunately I would expect a name like is_movable() to return true if the page had either been allocated with __GFP_MOVABLE or if it was checking migrate can currently handle the page. is_movable() is indeed a terrible name. They should be renamed or deleted in preparation -- see below. > We haven't been very > consistent about this in mmzone.h, but zone_is_foo() is pretty common. > > And a neater implementation would be > > return zone_idx(zone) == ZONE_MOVABLE; > Yes. > All of which made me look at mmzone.h, and what I saw wasn't very nice :( > > Please review... > > > From: Andrew Morton > Subject: include/linux/mmzone.h: cleanups > > - implement zone_idx() in C to fix its references-args-twice macro bug > > - use zone_idx() in is_highmem() to remove large amounts of silly fluff. > > Cc: Lin Feng > Cc: Mel Gorman > Signed-off-by: Andrew Morton > --- > > include/linux/mmzone.h | 13 - > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups > include/linux/mmzone.h > --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups > +++ a/include/linux/mmzone.h > @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by > /* > * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, > etc. > */ > -#define zone_idx(zone) ((zone) - > (zone)->zone_pgdat->node_zones) > +static inline enum zone_type zone_idx(struct zone *zone) > +{ > + return zone - zone->zone_pgdat->node_zones; > +} > > static inline int populated_zone(struct zone *zone) > { > @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon > static inline int is_highmem(struct zone *zone) > { > #ifdef CONFIG_HIGHMEM > - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || > -(zone_off == ZONE_MOVABLE * sizeof(*zone) && > - zone_movable_is_highmem()); > + enum zone_type idx = zone_idx(zone); > + > + return idx == ZONE_HIGHMEM || > +(idx == ZONE_MOVABLE && zone_movable_is_highmem()); > #else > return 0; > #endif *blinks*. Ok,
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On Mon, Feb 04, 2013 at 04:06:24PM -0800, Andrew Morton wrote: On Mon, 4 Feb 2013 18:04:07 +0800 Lin Feng linf...@cn.fujitsu.com wrote: get_user_pages() always tries to allocate pages from movable zone, which is not reliable to memory hotremove framework in some case. This patch introduces a new library function called get_user_pages_non_movable() to pin pages only from zone non-movable in memory. It's a wrapper of get_user_pages() but it makes sure that all pages come from non-movable zone via additional page migration. ... --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, struct page **pages, struct vm_area_struct **vmas); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); +#ifdef CONFIG_MEMORY_HOTREMOVE +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas); +#endif The ifdefs aren't really needed here and I encourage people to omit them. This keeps the header files looking neater and reduces the chances of things later breaking because we forgot to update some CONFIG_foo logic in a header file. The downside is that errors will be revealed at link time rather than at compile time, but that's a pretty small cost. As an aside, if ifdefs *have* to be used then it often better to have a pattern like #ifdef CONFIG_MEMORY_HOTREMOVE int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int nr_pages, int write, int force, struct page **pages, struct vm_area_struct **vmas); #else static inline get_user_pages_non_movable(...) { get_user_pages(...) } #endif It eliminates the need for #ifdefs in the C file that calls get_user_pages_non_movable(). struct kvec; int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, struct page **pages); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 73b64a3..5db811e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) return (idx == ZONE_NORMAL); } +static inline int is_movable(struct zone *zone) +{ + return zone == zone-zone_pgdat-node_zones + ZONE_MOVABLE; +} A better name would be zone_is_movable(). He is matching the names of the is_normal(), is_dma32() and is_dma() helpers. Unfortunately I would expect a name like is_movable() to return true if the page had either been allocated with __GFP_MOVABLE or if it was checking migrate can currently handle the page. is_movable() is indeed a terrible name. They should be renamed or deleted in preparation -- see below. We haven't been very consistent about this in mmzone.h, but zone_is_foo() is pretty common. And a neater implementation would be return zone_idx(zone) == ZONE_MOVABLE; Yes. All of which made me look at mmzone.h, and what I saw wasn't very nice :( Please review... From: Andrew Morton a...@linux-foundation.org Subject: include/linux/mmzone.h: cleanups - implement zone_idx() in C to fix its references-args-twice macro bug - use zone_idx() in is_highmem() to remove large amounts of silly fluff. Cc: Lin Feng linf...@cn.fujitsu.com Cc: Mel Gorman m...@csn.ul.ie Signed-off-by: Andrew Morton a...@linux-foundation.org --- include/linux/mmzone.h | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups +++ a/include/linux/mmzone.h @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by /* * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. */ -#define zone_idx(zone) ((zone) - (zone)-zone_pgdat-node_zones) +static inline enum zone_type zone_idx(struct zone *zone) +{ + return zone - zone-zone_pgdat-node_zones; +} static inline int populated_zone(struct zone *zone) { @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon static inline int is_highmem(struct zone *zone) { #ifdef CONFIG_HIGHMEM - int zone_off = (char *)zone - (char *)zone-zone_pgdat-node_zones; - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || -(zone_off == ZONE_MOVABLE * sizeof(*zone) - zone_movable_is_highmem()); + enum zone_type idx = zone_idx(zone); + + return idx == ZONE_HIGHMEM || +(idx == ZONE_MOVABLE zone_movable_is_highmem()); #else return 0; #endif *blinks*. Ok, while I think your version looks nicer, it is reverting
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On Tue, Feb 05, 2013 at 11:57:22AM +, Mel Gorman wrote: + migrate_pre_flag = 1; + } + + if (!isolate_lru_page(pages[i])) { + inc_zone_page_state(pages[i], NR_ISOLATED_ANON + + page_is_file_cache(pages[i])); + list_add_tail(pages[i]-lru, pagelist); + } else { + isolate_err = 1; + goto put_page; + } isolate_lru_page() takes the LRU lock every time. Credit to Michal Hocko for bringing this up but with the number of other issues I missed that this is also broken with respect to huge page handling. hugetlbfs pages will not be on the LRU so the isolation will mess up and the migration has to be handled differently. Ordinarily hugetlbfs pages cannot be allocated from ZONE_MOVABLE but it is possible to configure it to be allowed via /proc/sys/vm/hugepages_treat_as_movable. If this encounters a hugetlbfs page, it'll just blow up. The other is that this almost certainly broken for transhuge page handling. gup returns the head and tail pages and ordinarily this is ok because the caller only cares about the physical address. Migration will also split a hugepage if it receives it but you are potentially adding tail pages to a list here and then migrating them. The split of the first page will get very confused. I'm not exactly sure what the result will be but it won't be pretty. Was THP enabled when this was tested? Was CONFIG_DEBUG_LIST enabled during testing? -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
On Tue, 05 Feb 2013 11:09:27 +0800 Lin Feng linf...@cn.fujitsu.com wrote: struct kvec; int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, struct page **pages); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 73b64a3..5db811e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) return (idx == ZONE_NORMAL); } +static inline int is_movable(struct zone *zone) +{ + return zone == zone-zone_pgdat-node_zones + ZONE_MOVABLE; +} A better name would be zone_is_movable(). We haven't been very consistent about this in mmzone.h, but zone_is_foo() is pretty common. Yes, zone_is_xxx() should be a better name, but there are some analogous definition like is_dma32() and is_normal() etc, if we only use zone_is_movable() for movable zone it will break such naming rules. Should we update other ones in a separate patch later or just keep the old style? I do think the old names were poorly chosen. Yes, we could fix them up sometime but it's hardly a pressing issue. And a neater implementation would be return zone_idx(zone) == ZONE_MOVABLE; OK. After your change, should we also update for other ones such as is_normal()..? Sure. From: Andrew Morton a...@linux-foundation.org Subject: include-linux-mmzoneh-cleanups-fix use zone_idx() some more, further simplify is_highmem() Cc: Lin Feng linf...@cn.fujitsu.com Cc: Mel Gorman m...@csn.ul.ie Signed-off-by: Andrew Morton a...@linux-foundation.org --- include/linux/mmzone.h | 15 --- 1 file changed, 4 insertions(+), 11 deletions(-) diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix include/linux/mmzone.h --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups-fix +++ a/include/linux/mmzone.h @@ -859,25 +859,18 @@ static inline int is_normal_idx(enum zon */ static inline int is_highmem(struct zone *zone) { -#ifdef CONFIG_HIGHMEM - enum zone_type idx = zone_idx(zone); - - return idx == ZONE_HIGHMEM || - (idx == ZONE_MOVABLE zone_movable_is_highmem()); -#else - return 0; -#endif + return is_highmem_idx(zone_idx(zone)); } static inline int is_normal(struct zone *zone) { - return zone == zone-zone_pgdat-node_zones + ZONE_NORMAL; + return zone_idx(zone) == ZONE_NORMAL; } static inline int is_dma32(struct zone *zone) { #ifdef CONFIG_ZONE_DMA32 - return zone == zone-zone_pgdat-node_zones + ZONE_DMA32; + return zone_idx(zone) == ZONE_DMA32; #else return 0; #endif @@ -886,7 +879,7 @@ static inline int is_dma32(struct zone * static inline int is_dma(struct zone *zone) { #ifdef CONFIG_ZONE_DMA - return zone == zone-zone_pgdat-node_zones + ZONE_DMA; + return zone_idx(zone) == ZONE_DMA; #else return 0; #endif _ + /* All pages are non movable, we are done :) */ + if (i == ret list_empty(pagelist)) + return ret; + +put_page: + /* Undo the effects of former get_user_pages(), we won't pin anything */ + for (i = 0; i ret; i++) + put_page(pages[i]); + + if (migrate_pre_flag !isolate_err) { + ret = migrate_pages(pagelist, alloc_migrate_target, 1, + false, MIGRATE_SYNC, MR_SYSCALL); + /* Steal pages from non-movable zone successfully? */ + if (!ret) + goto retry; This is buggy. migrate_pages() doesn't empty its `from' argument, so page_list must be reinitialised here (or, better, at the start of the loop). migrate_pages() list_for_each_entry_safe() unmap_and_move() if (rc != -EAGAIN) { list_del(page-lru); } ah, OK, there it is. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Just nitpicking, but: On Tue, Feb 5, 2013 at 3:57 AM, Mel Gorman mgor...@suse.de wrote: +static inline bool zone_is_idx(struct zone *zone, enum zone_type idx) +{ + /* This mess avoids a potentially expensive pointer subtraction. */ + int zone_off = (char *)zone - (char *)zone-zone_pgdat-node_zones; + return zone_off == idx * sizeof(*zone); +} Maybe: return zone == zone-zone_pgdat-node_zones + idx; ? -- Michel Walken Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Hi Andrew, On 02/05/2013 08:06 AM, Andrew Morton wrote: > > melreadthis > > On Mon, 4 Feb 2013 18:04:07 +0800 > Lin Feng wrote: > >> get_user_pages() always tries to allocate pages from movable zone, which is >> not >> reliable to memory hotremove framework in some case. >> >> This patch introduces a new library function called >> get_user_pages_non_movable() >> to pin pages only from zone non-movable in memory. >> It's a wrapper of get_user_pages() but it makes sure that all pages come from >> non-movable zone via additional page migration. >> >> ... >> >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct >> mm_struct *mm, >> struct page **pages, struct vm_area_struct **vmas); >> int get_user_pages_fast(unsigned long start, int nr_pages, int write, >> struct page **pages); >> +#ifdef CONFIG_MEMORY_HOTREMOVE >> +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct >> *mm, >> +unsigned long start, int nr_pages, int write, int force, >> +struct page **pages, struct vm_area_struct **vmas); >> +#endif > > The ifdefs aren't really needed here and I encourage people to omit > them. This keeps the header files looking neater and reduces the > chances of things later breaking because we forgot to update some > CONFIG_foo logic in a header file. The downside is that errors will be > revealed at link time rather than at compile time, but that's a pretty > small cost. OK, got it, thanks :) > >> struct kvec; >> int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, >> struct page **pages); >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index 73b64a3..5db811e 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) >> return (idx == ZONE_NORMAL); >> } >> >> +static inline int is_movable(struct zone *zone) >> +{ >> +return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; >> +} > > A better name would be zone_is_movable(). We haven't been very > consistent about this in mmzone.h, but zone_is_foo() is pretty common. > Yes, zone_is_xxx() should be a better name, but there are some analogous definition like is_dma32() and is_normal() etc, if we only use zone_is_movable() for movable zone it will break such naming rules. Should we update other ones in a separate patch later or just keep the old style? > And a neater implementation would be > > return zone_idx(zone) == ZONE_MOVABLE; > OK. After your change, should we also update for other ones such as is_normal()..? > All of which made me look at mmzone.h, and what I saw wasn't very nice :( > > Please review... > > > From: Andrew Morton > Subject: include/linux/mmzone.h: cleanups > > - implement zone_idx() in C to fix its references-args-twice macro bug > > - use zone_idx() in is_highmem() to remove large amounts of silly fluff. > > Cc: Lin Feng > Cc: Mel Gorman > Signed-off-by: Andrew Morton > --- > > include/linux/mmzone.h | 13 - > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups > include/linux/mmzone.h > --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups > +++ a/include/linux/mmzone.h > @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by > /* > * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, > etc. > */ > -#define zone_idx(zone) ((zone) - > (zone)->zone_pgdat->node_zones) > +static inline enum zone_type zone_idx(struct zone *zone) > +{ > + return zone - zone->zone_pgdat->node_zones; > +} > > static inline int populated_zone(struct zone *zone) > { > @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon > static inline int is_highmem(struct zone *zone) > { > #ifdef CONFIG_HIGHMEM > - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; > - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || > -(zone_off == ZONE_MOVABLE * sizeof(*zone) && > - zone_movable_is_highmem()); > + enum zone_type idx = zone_idx(zone); > + > + return idx == ZONE_HIGHMEM || > +(idx == ZONE_MOVABLE && zone_movable_is_highmem()); > #else > return 0; > #endif > _ > > >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -58,6 +58,8 @@ >> #include >> #include >> #include >> +#include >> +#include >> #include >> >> #include >> @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct >> mm_struct *mm, >> } >> EXPORT_SYMBOL(get_user_pages); >> >> +#ifdef CONFIG_MEMORY_HOTREMOVE >> +/** >> + * It's a wrapper of get_user_pages() but it makes sure that all pages come >> from >> + * non-movable zone via additional page migration. >> + */ > > This needs a description of
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Also... On Mon, 4 Feb 2013 16:06:24 -0800 Andrew Morton wrote: > > +put_page: > > + /* Undo the effects of former get_user_pages(), we won't pin anything */ > > + for (i = 0; i < ret; i++) > > + put_page(pages[i]); We can use release_pages() here. release_pages() is designed to be more efficient when we're putting the final reference to (most of) the pages. It probably has little if any benefit when putting still-in-use pages, as we're doing here. But please consider... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
melreadthis On Mon, 4 Feb 2013 18:04:07 +0800 Lin Feng wrote: > get_user_pages() always tries to allocate pages from movable zone, which is > not > reliable to memory hotremove framework in some case. > > This patch introduces a new library function called > get_user_pages_non_movable() > to pin pages only from zone non-movable in memory. > It's a wrapper of get_user_pages() but it makes sure that all pages come from > non-movable zone via additional page migration. > > ... > > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct > mm_struct *mm, > struct page **pages, struct vm_area_struct **vmas); > int get_user_pages_fast(unsigned long start, int nr_pages, int write, > struct page **pages); > +#ifdef CONFIG_MEMORY_HOTREMOVE > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long start, int nr_pages, int write, int force, > + struct page **pages, struct vm_area_struct **vmas); > +#endif The ifdefs aren't really needed here and I encourage people to omit them. This keeps the header files looking neater and reduces the chances of things later breaking because we forgot to update some CONFIG_foo logic in a header file. The downside is that errors will be revealed at link time rather than at compile time, but that's a pretty small cost. > struct kvec; > int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, > struct page **pages); > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 73b64a3..5db811e 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) > return (idx == ZONE_NORMAL); > } > > +static inline int is_movable(struct zone *zone) > +{ > + return zone == zone->zone_pgdat->node_zones + ZONE_MOVABLE; > +} A better name would be zone_is_movable(). We haven't been very consistent about this in mmzone.h, but zone_is_foo() is pretty common. And a neater implementation would be return zone_idx(zone) == ZONE_MOVABLE; All of which made me look at mmzone.h, and what I saw wasn't very nice :( Please review... From: Andrew Morton Subject: include/linux/mmzone.h: cleanups - implement zone_idx() in C to fix its references-args-twice macro bug - use zone_idx() in is_highmem() to remove large amounts of silly fluff. Cc: Lin Feng Cc: Mel Gorman Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups +++ a/include/linux/mmzone.h @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by /* * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. */ -#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) +static inline enum zone_type zone_idx(struct zone *zone) +{ + return zone - zone->zone_pgdat->node_zones; +} static inline int populated_zone(struct zone *zone) { @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon static inline int is_highmem(struct zone *zone) { #ifdef CONFIG_HIGHMEM - int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones; - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || - (zone_off == ZONE_MOVABLE * sizeof(*zone) && - zone_movable_is_highmem()); + enum zone_type idx = zone_idx(zone); + + return idx == ZONE_HIGHMEM || + (idx == ZONE_MOVABLE && zone_movable_is_highmem()); #else return 0; #endif _ > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -58,6 +58,8 @@ > #include > #include > #include > +#include > +#include > #include > > #include > @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct > mm_struct *mm, > } > EXPORT_SYMBOL(get_user_pages); > > +#ifdef CONFIG_MEMORY_HOTREMOVE > +/** > + * It's a wrapper of get_user_pages() but it makes sure that all pages come > from > + * non-movable zone via additional page migration. > + */ This needs a description of why the function exists - say something about the requirements of memory hotplug. Also a few words describing how the function works would be good. > +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, > + unsigned long start, int nr_pages, int write, int force, > + struct page **pages, struct vm_area_struct **vmas) > +{ > + int ret, i, isolate_err, migrate_pre_flag; > + LIST_HEAD(pagelist); > + > +retry: > + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages, > + vmas); We should handle (ret < 0) here. At present the function will silently
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
melreadthis On Mon, 4 Feb 2013 18:04:07 +0800 Lin Feng linf...@cn.fujitsu.com wrote: get_user_pages() always tries to allocate pages from movable zone, which is not reliable to memory hotremove framework in some case. This patch introduces a new library function called get_user_pages_non_movable() to pin pages only from zone non-movable in memory. It's a wrapper of get_user_pages() but it makes sure that all pages come from non-movable zone via additional page migration. ... --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, struct page **pages, struct vm_area_struct **vmas); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); +#ifdef CONFIG_MEMORY_HOTREMOVE +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas); +#endif The ifdefs aren't really needed here and I encourage people to omit them. This keeps the header files looking neater and reduces the chances of things later breaking because we forgot to update some CONFIG_foo logic in a header file. The downside is that errors will be revealed at link time rather than at compile time, but that's a pretty small cost. struct kvec; int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, struct page **pages); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 73b64a3..5db811e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) return (idx == ZONE_NORMAL); } +static inline int is_movable(struct zone *zone) +{ + return zone == zone-zone_pgdat-node_zones + ZONE_MOVABLE; +} A better name would be zone_is_movable(). We haven't been very consistent about this in mmzone.h, but zone_is_foo() is pretty common. And a neater implementation would be return zone_idx(zone) == ZONE_MOVABLE; All of which made me look at mmzone.h, and what I saw wasn't very nice :( Please review... From: Andrew Morton a...@linux-foundation.org Subject: include/linux/mmzone.h: cleanups - implement zone_idx() in C to fix its references-args-twice macro bug - use zone_idx() in is_highmem() to remove large amounts of silly fluff. Cc: Lin Feng linf...@cn.fujitsu.com Cc: Mel Gorman m...@csn.ul.ie Signed-off-by: Andrew Morton a...@linux-foundation.org --- include/linux/mmzone.h | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups +++ a/include/linux/mmzone.h @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by /* * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. */ -#define zone_idx(zone) ((zone) - (zone)-zone_pgdat-node_zones) +static inline enum zone_type zone_idx(struct zone *zone) +{ + return zone - zone-zone_pgdat-node_zones; +} static inline int populated_zone(struct zone *zone) { @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon static inline int is_highmem(struct zone *zone) { #ifdef CONFIG_HIGHMEM - int zone_off = (char *)zone - (char *)zone-zone_pgdat-node_zones; - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || - (zone_off == ZONE_MOVABLE * sizeof(*zone) - zone_movable_is_highmem()); + enum zone_type idx = zone_idx(zone); + + return idx == ZONE_HIGHMEM || + (idx == ZONE_MOVABLE zone_movable_is_highmem()); #else return 0; #endif _ --- a/mm/memory.c +++ b/mm/memory.c @@ -58,6 +58,8 @@ #include linux/elf.h #include linux/gfp.h #include linux/migrate.h +#include linux/page-isolation.h +#include linux/mm_inline.h #include linux/string.h #include asm/io.h @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } EXPORT_SYMBOL(get_user_pages); +#ifdef CONFIG_MEMORY_HOTREMOVE +/** + * It's a wrapper of get_user_pages() but it makes sure that all pages come from + * non-movable zone via additional page migration. + */ This needs a description of why the function exists - say something about the requirements of memory hotplug. Also a few words describing how the function works would be good. +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int nr_pages, int write, int force, + struct page **pages, struct vm_area_struct **vmas) +{ + int ret, i, isolate_err, migrate_pre_flag; + LIST_HEAD(pagelist); + +retry: + ret = get_user_pages(tsk, mm, start, nr_pages, write,
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Also... On Mon, 4 Feb 2013 16:06:24 -0800 Andrew Morton a...@linux-foundation.org wrote: +put_page: + /* Undo the effects of former get_user_pages(), we won't pin anything */ + for (i = 0; i ret; i++) + put_page(pages[i]); We can use release_pages() here. release_pages() is designed to be more efficient when we're putting the final reference to (most of) the pages. It probably has little if any benefit when putting still-in-use pages, as we're doing here. But please consider... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: hotplug: implement non-movable version of get_user_pages() called get_user_pages_non_movable()
Hi Andrew, On 02/05/2013 08:06 AM, Andrew Morton wrote: melreadthis On Mon, 4 Feb 2013 18:04:07 +0800 Lin Feng linf...@cn.fujitsu.com wrote: get_user_pages() always tries to allocate pages from movable zone, which is not reliable to memory hotremove framework in some case. This patch introduces a new library function called get_user_pages_non_movable() to pin pages only from zone non-movable in memory. It's a wrapper of get_user_pages() but it makes sure that all pages come from non-movable zone via additional page migration. ... --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1049,6 +1049,11 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, struct page **pages, struct vm_area_struct **vmas); int get_user_pages_fast(unsigned long start, int nr_pages, int write, struct page **pages); +#ifdef CONFIG_MEMORY_HOTREMOVE +int get_user_pages_non_movable(struct task_struct *tsk, struct mm_struct *mm, +unsigned long start, int nr_pages, int write, int force, +struct page **pages, struct vm_area_struct **vmas); +#endif The ifdefs aren't really needed here and I encourage people to omit them. This keeps the header files looking neater and reduces the chances of things later breaking because we forgot to update some CONFIG_foo logic in a header file. The downside is that errors will be revealed at link time rather than at compile time, but that's a pretty small cost. OK, got it, thanks :) struct kvec; int get_kernel_pages(const struct kvec *iov, int nr_pages, int write, struct page **pages); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 73b64a3..5db811e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -838,6 +838,10 @@ static inline int is_normal_idx(enum zone_type idx) return (idx == ZONE_NORMAL); } +static inline int is_movable(struct zone *zone) +{ +return zone == zone-zone_pgdat-node_zones + ZONE_MOVABLE; +} A better name would be zone_is_movable(). We haven't been very consistent about this in mmzone.h, but zone_is_foo() is pretty common. Yes, zone_is_xxx() should be a better name, but there are some analogous definition like is_dma32() and is_normal() etc, if we only use zone_is_movable() for movable zone it will break such naming rules. Should we update other ones in a separate patch later or just keep the old style? And a neater implementation would be return zone_idx(zone) == ZONE_MOVABLE; OK. After your change, should we also update for other ones such as is_normal()..? All of which made me look at mmzone.h, and what I saw wasn't very nice :( Please review... From: Andrew Morton a...@linux-foundation.org Subject: include/linux/mmzone.h: cleanups - implement zone_idx() in C to fix its references-args-twice macro bug - use zone_idx() in is_highmem() to remove large amounts of silly fluff. Cc: Lin Feng linf...@cn.fujitsu.com Cc: Mel Gorman m...@csn.ul.ie Signed-off-by: Andrew Morton a...@linux-foundation.org --- include/linux/mmzone.h | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff -puN include/linux/mmzone.h~include-linux-mmzoneh-cleanups include/linux/mmzone.h --- a/include/linux/mmzone.h~include-linux-mmzoneh-cleanups +++ a/include/linux/mmzone.h @@ -815,7 +815,10 @@ unsigned long __init node_memmap_size_by /* * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc. */ -#define zone_idx(zone) ((zone) - (zone)-zone_pgdat-node_zones) +static inline enum zone_type zone_idx(struct zone *zone) +{ + return zone - zone-zone_pgdat-node_zones; +} static inline int populated_zone(struct zone *zone) { @@ -857,10 +860,10 @@ static inline int is_normal_idx(enum zon static inline int is_highmem(struct zone *zone) { #ifdef CONFIG_HIGHMEM - int zone_off = (char *)zone - (char *)zone-zone_pgdat-node_zones; - return zone_off == ZONE_HIGHMEM * sizeof(*zone) || -(zone_off == ZONE_MOVABLE * sizeof(*zone) - zone_movable_is_highmem()); + enum zone_type idx = zone_idx(zone); + + return idx == ZONE_HIGHMEM || +(idx == ZONE_MOVABLE zone_movable_is_highmem()); #else return 0; #endif _ --- a/mm/memory.c +++ b/mm/memory.c @@ -58,6 +58,8 @@ #include linux/elf.h #include linux/gfp.h #include linux/migrate.h +#include linux/page-isolation.h +#include linux/mm_inline.h #include linux/string.h #include asm/io.h @@ -1995,6 +1997,67 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, } EXPORT_SYMBOL(get_user_pages); +#ifdef CONFIG_MEMORY_HOTREMOVE +/** + * It's a wrapper of get_user_pages() but it makes sure that all pages come from + * non-movable zone via additional page migration. + */ This needs a description of why the