Re: [PATCH 04/10] posix-cpu-timers: don't account cpu timer after stopped thread runtime accounting
Hm. I'm not sure why this path series start [patch 4/10]. maybe I need to review my script again. anyway, patch 1-3 don't exist. sorry for confusing. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/3] process cputimer is moving faster than its corresponding clock
(4/27/13 12:40 AM), Olivier Langlois wrote: > > > Forbids the cputimer to drift ahead of its process clock by > blocking its update when a tick occurs while a autoreaping task > is currently in do_exit() between the call to release_task() and > its final call to schedule(). > > Any task stats update after having called release_task() will > be lost because they are added to the global process stats located > in the signal struct from release_task(). > > Ideally, you should postpone the release_task() call after the > final context switch to get all the stats added but this is > more complex to achieve. > > In other words, this is slowing down the cputimer so it keep the same > pace than the process clock but in fact, what should be done is to > speed up the process clock by adding the missing stats to it. > > Signed-off-by: Olivier Langlois > --- > kernel/sched/fair.c | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7a33e59..52d7b10 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -708,7 +708,15 @@ static void update_curr(struct cfs_rq *cfs_rq) > > trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime); > cpuacct_charge(curtask, delta_exec); > - account_group_exec_runtime(curtask, delta_exec); > + /* > +* Do not update the cputimer if the task is already released > by > +* release_task(). > +* > +* it would preferable to defer the autoreap release_task > +* after the last context switch but harder to do. > +*/ > + if (likely(curtask->sighand)) > + account_group_exec_runtime(curtask, delta_exec); > } I'm confused. glibc's rt/tst-cputimer1 doesn't have thread exiting code. I have no seen any issue in this accounting. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/3] process cputimer is moving faster than its corresponding clock
(4/27/13 12:40 AM), Olivier Langlois wrote: Forbids the cputimer to drift ahead of its process clock by blocking its update when a tick occurs while a autoreaping task is currently in do_exit() between the call to release_task() and its final call to schedule(). Any task stats update after having called release_task() will be lost because they are added to the global process stats located in the signal struct from release_task(). Ideally, you should postpone the release_task() call after the final context switch to get all the stats added but this is more complex to achieve. In other words, this is slowing down the cputimer so it keep the same pace than the process clock but in fact, what should be done is to speed up the process clock by adding the missing stats to it. Signed-off-by: Olivier Langlois oliv...@trillion01.com --- kernel/sched/fair.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7a33e59..52d7b10 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -708,7 +708,15 @@ static void update_curr(struct cfs_rq *cfs_rq) trace_sched_stat_runtime(curtask, delta_exec, curr-vruntime); cpuacct_charge(curtask, delta_exec); - account_group_exec_runtime(curtask, delta_exec); + /* +* Do not update the cputimer if the task is already released by +* release_task(). +* +* it would preferable to defer the autoreap release_task +* after the last context switch but harder to do. +*/ + if (likely(curtask-sighand)) + account_group_exec_runtime(curtask, delta_exec); } I'm confused. glibc's rt/tst-cputimer1 doesn't have thread exiting code. I have no seen any issue in this accounting. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] process cputimer is moving faster than its corresponding clock
On Fri, Apr 26, 2013 at 9:51 PM, Olivier Langlois wrote: > On Fri, 2013-04-26 at 15:08 -0400, KOSAKI Motohiro wrote: >> > I need to add that I can only confirm that to be true with >> > sum_exec_runtime. >> > >> > To affirm it to be true for stime and utime would require more >> > investigation. I didn't look them at all. I was only concerned with >> > sum_exec_runtime. >> > >> > I will prepare a v2 of the patch accounting all the feedbacks that I >> > received from KOSAKI Motohiro, Frederic Weisbecker and Peter Zijlstra >> > and send it back here for further discussion. >> > >> > Thank you very much all! >> >> Do you mean your utime test case still failure? If you share your test-case, >> I'm going to look at your issue too. >> > Sure with pleasure. My testcase is glibc-2.17/rt/tst-cputimer1.c > > That being said, it strictly test CPUCLOCK_SCHED timers. Hence my focus > when modifying the code was strictly on sum_exec_runtime. > > If utime and stime components of cputimer are moving faster than their > associated clock, this is something that I did not address. Hmm... Sorry. I'm confused. 1) I haven't seen any glibc test failure after applying my patch. 2) tst-cputimer1.c only have CLOCK_PROCESS_CPUTIME_ID test and don't have any utime, stime tests. Please let me know if you've seen any failure after applying my patch. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] process cputimer is moving faster than its corresponding clock
> I need to add that I can only confirm that to be true with > sum_exec_runtime. > > To affirm it to be true for stime and utime would require more > investigation. I didn't look them at all. I was only concerned with > sum_exec_runtime. > > I will prepare a v2 of the patch accounting all the feedbacks that I > received from KOSAKI Motohiro, Frederic Weisbecker and Peter Zijlstra > and send it back here for further discussion. > > Thank you very much all! Do you mean your utime test case still failure? If you share your test-case, I'm going to look at your issue too. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] process cputimer is moving faster than its corresponding clock
I need to add that I can only confirm that to be true with sum_exec_runtime. To affirm it to be true for stime and utime would require more investigation. I didn't look them at all. I was only concerned with sum_exec_runtime. I will prepare a v2 of the patch accounting all the feedbacks that I received from KOSAKI Motohiro, Frederic Weisbecker and Peter Zijlstra and send it back here for further discussion. Thank you very much all! Do you mean your utime test case still failure? If you share your test-case, I'm going to look at your issue too. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] process cputimer is moving faster than its corresponding clock
On Fri, Apr 26, 2013 at 9:51 PM, Olivier Langlois oliv...@trillion01.com wrote: On Fri, 2013-04-26 at 15:08 -0400, KOSAKI Motohiro wrote: I need to add that I can only confirm that to be true with sum_exec_runtime. To affirm it to be true for stime and utime would require more investigation. I didn't look them at all. I was only concerned with sum_exec_runtime. I will prepare a v2 of the patch accounting all the feedbacks that I received from KOSAKI Motohiro, Frederic Weisbecker and Peter Zijlstra and send it back here for further discussion. Thank you very much all! Do you mean your utime test case still failure? If you share your test-case, I'm going to look at your issue too. Sure with pleasure. My testcase is glibc-2.17/rt/tst-cputimer1.c That being said, it strictly test CPUCLOCK_SCHED timers. Hence my focus when modifying the code was strictly on sum_exec_runtime. If utime and stime components of cputimer are moving faster than their associated clock, this is something that I did not address. Hmm... Sorry. I'm confused. 1) I haven't seen any glibc test failure after applying my patch. 2) tst-cputimer1.c only have CLOCK_PROCESS_CPUTIME_ID test and don't have any utime, stime tests. Please let me know if you've seen any failure after applying my patch. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] process cputimer is moving faster than its corresponding clock
On Fri, Apr 19, 2013 at 10:38 AM, KOSAKI Motohiro wrote: >> I feel we are hitting the same issue than this patch: >> https://lkml.org/lkml/2013/4/5/116 >> >> I'm adding Kosaki in Cc, who proposed roughly the same fix. > > Thanks to CCing. I'm now sitting LSF and I can't read whole tons emails. > However the fix is definitely same and I definitely agree this approach. > > thank you. And if I understand correctly, update_gt_cputime() is no longer necessary after this patch because time never makes backward. What do you think? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] process cputimer is moving faster than its corresponding clock
> I feel we are hitting the same issue than this patch: > https://lkml.org/lkml/2013/4/5/116 > > I'm adding Kosaki in Cc, who proposed roughly the same fix. Thanks to CCing. I'm now sitting LSF and I can't read whole tons emails. However the fix is definitely same and I definitely agree this approach. thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] process cputimer is moving faster than its corresponding clock
I feel we are hitting the same issue than this patch: https://lkml.org/lkml/2013/4/5/116 I'm adding Kosaki in Cc, who proposed roughly the same fix. Thanks to CCing. I'm now sitting LSF and I can't read whole tons emails. However the fix is definitely same and I definitely agree this approach. thank you. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] process cputimer is moving faster than its corresponding clock
On Fri, Apr 19, 2013 at 10:38 AM, KOSAKI Motohiro kosaki.motoh...@gmail.com wrote: I feel we are hitting the same issue than this patch: https://lkml.org/lkml/2013/4/5/116 I'm adding Kosaki in Cc, who proposed roughly the same fix. Thanks to CCing. I'm now sitting LSF and I can't read whole tons emails. However the fix is definitely same and I definitely agree this approach. thank you. And if I understand correctly, update_gt_cputime() is no longer necessary after this patch because time never makes backward. What do you think? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug fix PATCH] numa, cpu hotplug: Change links of CPU and node when changing node number by onlining CPU
> #ifdef CONFIG_HOTPLUG_CPU > +static void change_cpu_under_node(struct cpu *cpu, > + unsigned int from_nid, unsigned int to_nid) > +{ > + int cpuid = cpu->dev.id; > + unregister_cpu_under_node(cpuid, from_nid); > + register_cpu_under_node(cpuid, to_nid); > + cpu->node_id = to_nid; > +} > + Where is stub for !CONFIG_HOTPLUG_CPU? > static ssize_t show_online(struct device *dev, >struct device_attribute *attr, >char *buf) > @@ -39,17 +48,23 @@ static ssize_t __ref store_online(struct device *dev, > const char *buf, size_t count) > { > struct cpu *cpu = container_of(dev, struct cpu, dev); > + int num = cpu->dev.id; "num" is wrong name. cpuid may be better. > + int from_nid, to_nid; > ssize_t ret; > > cpu_hotplug_driver_lock(); > switch (buf[0]) { > case '0': > - ret = cpu_down(cpu->dev.id); > + ret = cpu_down(num); > if (!ret) > kobject_uevent(>kobj, KOBJ_OFFLINE); > break; > case '1': > - ret = cpu_up(cpu->dev.id); > + from_nid = cpu_to_node(num); > + ret = cpu_up(num); > + to_nid = cpu_to_node(num); > + if (from_nid != to_nid) > + change_cpu_under_node(cpu, from_nid, to_nid); You need to add several comments. this code is not straightforward. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Bug fix PATCH] numa, cpu hotplug: Change links of CPU and node when changing node number by onlining CPU
#ifdef CONFIG_HOTPLUG_CPU +static void change_cpu_under_node(struct cpu *cpu, + unsigned int from_nid, unsigned int to_nid) +{ + int cpuid = cpu-dev.id; + unregister_cpu_under_node(cpuid, from_nid); + register_cpu_under_node(cpuid, to_nid); + cpu-node_id = to_nid; +} + Where is stub for !CONFIG_HOTPLUG_CPU? static ssize_t show_online(struct device *dev, struct device_attribute *attr, char *buf) @@ -39,17 +48,23 @@ static ssize_t __ref store_online(struct device *dev, const char *buf, size_t count) { struct cpu *cpu = container_of(dev, struct cpu, dev); + int num = cpu-dev.id; num is wrong name. cpuid may be better. + int from_nid, to_nid; ssize_t ret; cpu_hotplug_driver_lock(); switch (buf[0]) { case '0': - ret = cpu_down(cpu-dev.id); + ret = cpu_down(num); if (!ret) kobject_uevent(dev-kobj, KOBJ_OFFLINE); break; case '1': - ret = cpu_up(cpu-dev.id); + from_nid = cpu_to_node(num); + ret = cpu_up(num); + to_nid = cpu_to_node(num); + if (from_nid != to_nid) + change_cpu_under_node(cpu, from_nid, to_nid); You need to add several comments. this code is not straightforward. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/28] proc: Split kcore bits from linux/procfs.h into linux/kcore.h [RFC]
On Tue, Apr 16, 2013 at 3:07 PM, David Howells wrote: > > KOSAKI Motohiro wrote: > >> I have no seen any issue in this change. but why? Is there any >> motivation rather than cleanup? > > Stopping stuff mucking about with the internals of procfs incorrectly > (sometimes because the internals of procfs have changed, but the drivers > haven't). OK, thank you for explanation. Acked-by: KOSAKI Motohiro -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/28] proc: Split kcore bits from linux/procfs.h into linux/kcore.h [RFC]
On Tue, Apr 16, 2013 at 11:26 AM, David Howells wrote: > Split kcore bits from linux/procfs.h into linux/kcore.h. > > Signed-off-by: David Howells > cc: linux-m...@linux-mips.org > cc: sparcli...@vger.kernel.org > cc: x...@kernel.org > cc: linux...@kvack.org I have no seen any issue in this change. but why? Is there any motivation rather than cleanup? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/28] proc: Split kcore bits from linux/procfs.h into linux/kcore.h [RFC]
On Tue, Apr 16, 2013 at 11:26 AM, David Howells dhowe...@redhat.com wrote: Split kcore bits from linux/procfs.h into linux/kcore.h. Signed-off-by: David Howells dhowe...@redhat.com cc: linux-m...@linux-mips.org cc: sparcli...@vger.kernel.org cc: x...@kernel.org cc: linux...@kvack.org I have no seen any issue in this change. but why? Is there any motivation rather than cleanup? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/28] proc: Split kcore bits from linux/procfs.h into linux/kcore.h [RFC]
On Tue, Apr 16, 2013 at 3:07 PM, David Howells dhowe...@redhat.com wrote: KOSAKI Motohiro kosaki.motoh...@gmail.com wrote: I have no seen any issue in this change. but why? Is there any motivation rather than cleanup? Stopping stuff mucking about with the internals of procfs incorrectly (sometimes because the internals of procfs have changed, but the drivers haven't). OK, thank you for explanation. Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Return value of __mm_populate
(4/13/13 5:14 AM), Marco Stornelli wrote: > Hi, > > I was seeing the code of __mm_populate (in -next) and I've got a doubt > about the return value. The function __mlock_posix_error_return should > return a proper error for mlock, converting the return value from > __get_user_pages. It checks for EFAULT and ENOMEM. Actually > __get_user_pages could return, in addition, ERESTARTSYS and EHWPOISON. __get_user_pages doesn't return EHWPOISON if FOLL_HWPOISON is not specified. I'm not expert ERESTARTSYS. I understand correctly, ERESTARTSYS is only returned when signal received, and signal handling routine (e.g. do_signal) modify EIP and hidden ERESTARTSYS from userland generically. > So it seems to me that we could return to user space not expected value. > I can't see them on the man page. In addition we shouldn't ever return > ERESTARTSYS to the user space but EINTR. According to the man pages > maybe we should return EAGAIN in these cases. Am I missing something? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Return value of __mm_populate
(4/13/13 5:14 AM), Marco Stornelli wrote: Hi, I was seeing the code of __mm_populate (in -next) and I've got a doubt about the return value. The function __mlock_posix_error_return should return a proper error for mlock, converting the return value from __get_user_pages. It checks for EFAULT and ENOMEM. Actually __get_user_pages could return, in addition, ERESTARTSYS and EHWPOISON. __get_user_pages doesn't return EHWPOISON if FOLL_HWPOISON is not specified. I'm not expert ERESTARTSYS. I understand correctly, ERESTARTSYS is only returned when signal received, and signal handling routine (e.g. do_signal) modify EIP and hidden ERESTARTSYS from userland generically. So it seems to me that we could return to user space not expected value. I can't see them on the man page. In addition we shouldn't ever return ERESTARTSYS to the user space but EINTR. According to the man pages maybe we should return EAGAIN in these cases. Am I missing something? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable
(4/10/13 11:26 PM), Mitsuhiro Tanino wrote: > Hi All, > Please find a patch set that introduces these new sysctl interfaces, > to handle a case when an memory error is detected on dirty page cache. > > - vm.memory_failure_dirty_panic Panic knob is ok to me. However I agree with Andi. If we need panic know, it should handle generic IO error and data lost. > - vm.memory_failure_print_ratelimit > - vm.memory_failure_print_ratelimit_burst But this is totally silly. print_ratelimit might ommit important messages. Please do a right way. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v7 00/11] Support vrange for anonymous page
and adding new syscall invokation is unwelcome. >>> >>> Sure. But one more system call could be cheaper than page-granuarity >>> operation on purged range. >> >> I don't think vrange(VOLATILE) cost is the related of this discusstion. >> Whether sending SIGBUS or just nuke pte, purge should be done on vmscan, >> not vrange() syscall. > > Again, please see the MADV_FREE. http://lwn.net/Articles/230799/ > It does changes pte and page flags on all pages of the range through > zap_pte_range. So it would make vrange(VOLASTILE) expensive and > the bigger cost is, the bigger range is. This haven't been crossed my mind. now try_to_discard_one() insert vrange for making SIGBUS. then, we can insert pte_none() as the same cost too. Am I missing something? I couldn't imazine why pte should be zapping on vrange(VOLATILE). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v7 00/11] Support vrange for anonymous page
(4/11/13 4:02 AM), Minchan Kim wrote: > On Thu, Apr 11, 2013 at 03:20:30AM -0400, KOSAKI Motohiro wrote: >>>>> DONTNEED makes sure user always can see zero-fill pages after >>>>> he calls madvise while vrange can see data or encounter SIGBUS. >>>> >>>> For replacing DONTNEED, user want to zero-fill pages like DONTNEED >>>> instead of SIGBUS. So, new flag option would be nice. >>> >>> If userspace people want it, I can do it. >>> But not sure they want it at the moment becaue vrange is rather >>> different concept of madvise(DONTNEED) POV usage. >>> >>> As you know well, in case of DONTNEED, user calls madvise _once_ and >>> VM releases memory as soon as he called system call. >>> But vrange is same with delayed free when the system memory pressure >>> happens so user can't know OS frees the pages anytime. >>> It means user should call pair of system call both VRANGE_VOLATILE >>> and VRANGE_NOVOLATILE for right usage of volatile range >>> (for simple, I don't want to tell SIGBUS fault recovery method). >>> If he took a mistake(ie, NOT to call VRANGE_NOVOLATILE) on the range >>> which is used by current process, pages used by some process could be >>> disappeared suddenly. >>> >>> In summary, I don't think vrange is a replacement of madvise(DONTNEED) >>> but could be useful with madvise(DONTNEED) friend. For example, we can >>> make return 1 in vrange(VRANGE_VOLATILE) if memory pressure was already >> >> Do you mean vrange(VRANGE_UNVOLATILE)? > > I meant VRANGE_VOLATILE. It seems my explanation was poor. Here it goes, > again. > Now vrange's semantic return just 0 if the system call is successful, > otherwise, > return error. But we can change it as folows > > 1. return 0 if the system call is successful and memory pressure isn't severe > 2. return 1 if the system call is successful and memory pressure is severe > 3. return -ERRXXX if the system call is failed by some reason > > So the process can know system-wide memory pressure without peeking the vmstat > and then call madvise(DONTNEED) right after vrange call. The benefit is system > can zap all pages instantly. Do you mean your patchset is not latest? and when do you use this feature? what's happen VRANGE_VOLATILE return 0 and purge the range just after returning syscall. >> btw, assign new error number to asm-generic/errno.h is better than strange >> '1'. > > I can and admit "1" is rather weired. > But it's not error, either. If this is really necessary, I don't oppose it. However I am still not convinced. >>> severe so user can catch up memory pressure by return value and calls >>> madvise(DONTNEED) if memory pressure was already severe. Of course, we >>> can handle it vrange system call itself(ex, change vrange system call to >>> madvise(DONTNEED) but don't want it because I want to keep vrange hinting >>> sytem call very light at all times so user can expect latency. >> >> For allocator usage, vrange(UNVOLATILE) is annoying and don't need at all. >> When data has already been purged, just return new zero filled page. so, >> maybe adding new flag is worthwhile. Because malloc is definitely fast path > > I really want it and it's exactly same with madvise(MADV_FREE). > But for implementation, we need page granularity someting in address range > in system call context like zap_pte_range(ex, clear page table bits and > mark something to page flags for reclaimer to detect it). > It means vrange system call is still bigger although we are able to remove > lazy page fault. > > Do you have any idea to remove it? If so, I'm very open to implement it. Hm. Maybe I am missing something. I'll look the code closely after LFS. >> and adding new syscall invokation is unwelcome. > > Sure. But one more system call could be cheaper than page-granuarity > operation on purged range. I don't think vrange(VOLATILE) cost is the related of this discusstion. Whether sending SIGBUS or just nuke pte, purge should be done on vmscan, not vrange() syscall. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v7 00/11] Support vrange for anonymous page
>>> DONTNEED makes sure user always can see zero-fill pages after >>> he calls madvise while vrange can see data or encounter SIGBUS. >> >> For replacing DONTNEED, user want to zero-fill pages like DONTNEED >> instead of SIGBUS. So, new flag option would be nice. > > If userspace people want it, I can do it. > But not sure they want it at the moment becaue vrange is rather > different concept of madvise(DONTNEED) POV usage. > > As you know well, in case of DONTNEED, user calls madvise _once_ and > VM releases memory as soon as he called system call. > But vrange is same with delayed free when the system memory pressure > happens so user can't know OS frees the pages anytime. > It means user should call pair of system call both VRANGE_VOLATILE > and VRANGE_NOVOLATILE for right usage of volatile range > (for simple, I don't want to tell SIGBUS fault recovery method). > If he took a mistake(ie, NOT to call VRANGE_NOVOLATILE) on the range > which is used by current process, pages used by some process could be > disappeared suddenly. > > In summary, I don't think vrange is a replacement of madvise(DONTNEED) > but could be useful with madvise(DONTNEED) friend. For example, we can > make return 1 in vrange(VRANGE_VOLATILE) if memory pressure was already Do you mean vrange(VRANGE_UNVOLATILE)? btw, assign new error number to asm-generic/errno.h is better than strange '1'. > severe so user can catch up memory pressure by return value and calls > madvise(DONTNEED) if memory pressure was already severe. Of course, we > can handle it vrange system call itself(ex, change vrange system call to > madvise(DONTNEED) but don't want it because I want to keep vrange hinting > sytem call very light at all times so user can expect latency. For allocator usage, vrange(UNVOLATILE) is annoying and don't need at all. When data has already been purged, just return new zero filled page. so, maybe adding new flag is worthwhile. Because malloc is definitely fast path and adding new syscall invokation is unwelcome. >> # of # of # of >> thread iter iter (patched glibc) > > What's the workload? Ahh, sorry. I forgot to write. I use ebizzy, your favolite workload. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v7 00/11] Support vrange for anonymous page
DONTNEED makes sure user always can see zero-fill pages after he calls madvise while vrange can see data or encounter SIGBUS. For replacing DONTNEED, user want to zero-fill pages like DONTNEED instead of SIGBUS. So, new flag option would be nice. If userspace people want it, I can do it. But not sure they want it at the moment becaue vrange is rather different concept of madvise(DONTNEED) POV usage. As you know well, in case of DONTNEED, user calls madvise _once_ and VM releases memory as soon as he called system call. But vrange is same with delayed free when the system memory pressure happens so user can't know OS frees the pages anytime. It means user should call pair of system call both VRANGE_VOLATILE and VRANGE_NOVOLATILE for right usage of volatile range (for simple, I don't want to tell SIGBUS fault recovery method). If he took a mistake(ie, NOT to call VRANGE_NOVOLATILE) on the range which is used by current process, pages used by some process could be disappeared suddenly. In summary, I don't think vrange is a replacement of madvise(DONTNEED) but could be useful with madvise(DONTNEED) friend. For example, we can make return 1 in vrange(VRANGE_VOLATILE) if memory pressure was already Do you mean vrange(VRANGE_UNVOLATILE)? btw, assign new error number to asm-generic/errno.h is better than strange '1'. severe so user can catch up memory pressure by return value and calls madvise(DONTNEED) if memory pressure was already severe. Of course, we can handle it vrange system call itself(ex, change vrange system call to madvise(DONTNEED) but don't want it because I want to keep vrange hinting sytem call very light at all times so user can expect latency. For allocator usage, vrange(UNVOLATILE) is annoying and don't need at all. When data has already been purged, just return new zero filled page. so, maybe adding new flag is worthwhile. Because malloc is definitely fast path and adding new syscall invokation is unwelcome. # of # of # of thread iter iter (patched glibc) What's the workload? Ahh, sorry. I forgot to write. I use ebizzy, your favolite workload. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v7 00/11] Support vrange for anonymous page
(4/11/13 4:02 AM), Minchan Kim wrote: On Thu, Apr 11, 2013 at 03:20:30AM -0400, KOSAKI Motohiro wrote: DONTNEED makes sure user always can see zero-fill pages after he calls madvise while vrange can see data or encounter SIGBUS. For replacing DONTNEED, user want to zero-fill pages like DONTNEED instead of SIGBUS. So, new flag option would be nice. If userspace people want it, I can do it. But not sure they want it at the moment becaue vrange is rather different concept of madvise(DONTNEED) POV usage. As you know well, in case of DONTNEED, user calls madvise _once_ and VM releases memory as soon as he called system call. But vrange is same with delayed free when the system memory pressure happens so user can't know OS frees the pages anytime. It means user should call pair of system call both VRANGE_VOLATILE and VRANGE_NOVOLATILE for right usage of volatile range (for simple, I don't want to tell SIGBUS fault recovery method). If he took a mistake(ie, NOT to call VRANGE_NOVOLATILE) on the range which is used by current process, pages used by some process could be disappeared suddenly. In summary, I don't think vrange is a replacement of madvise(DONTNEED) but could be useful with madvise(DONTNEED) friend. For example, we can make return 1 in vrange(VRANGE_VOLATILE) if memory pressure was already Do you mean vrange(VRANGE_UNVOLATILE)? I meant VRANGE_VOLATILE. It seems my explanation was poor. Here it goes, again. Now vrange's semantic return just 0 if the system call is successful, otherwise, return error. But we can change it as folows 1. return 0 if the system call is successful and memory pressure isn't severe 2. return 1 if the system call is successful and memory pressure is severe 3. return -ERRXXX if the system call is failed by some reason So the process can know system-wide memory pressure without peeking the vmstat and then call madvise(DONTNEED) right after vrange call. The benefit is system can zap all pages instantly. Do you mean your patchset is not latest? and when do you use this feature? what's happen VRANGE_VOLATILE return 0 and purge the range just after returning syscall. btw, assign new error number to asm-generic/errno.h is better than strange '1'. I can and admit 1 is rather weired. But it's not error, either. If this is really necessary, I don't oppose it. However I am still not convinced. severe so user can catch up memory pressure by return value and calls madvise(DONTNEED) if memory pressure was already severe. Of course, we can handle it vrange system call itself(ex, change vrange system call to madvise(DONTNEED) but don't want it because I want to keep vrange hinting sytem call very light at all times so user can expect latency. For allocator usage, vrange(UNVOLATILE) is annoying and don't need at all. When data has already been purged, just return new zero filled page. so, maybe adding new flag is worthwhile. Because malloc is definitely fast path I really want it and it's exactly same with madvise(MADV_FREE). But for implementation, we need page granularity someting in address range in system call context like zap_pte_range(ex, clear page table bits and mark something to page flags for reclaimer to detect it). It means vrange system call is still bigger although we are able to remove lazy page fault. Do you have any idea to remove it? If so, I'm very open to implement it. Hm. Maybe I am missing something. I'll look the code closely after LFS. and adding new syscall invokation is unwelcome. Sure. But one more system call could be cheaper than page-granuarity operation on purged range. I don't think vrange(VOLATILE) cost is the related of this discusstion. Whether sending SIGBUS or just nuke pte, purge should be done on vmscan, not vrange() syscall. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v7 00/11] Support vrange for anonymous page
and adding new syscall invokation is unwelcome. Sure. But one more system call could be cheaper than page-granuarity operation on purged range. I don't think vrange(VOLATILE) cost is the related of this discusstion. Whether sending SIGBUS or just nuke pte, purge should be done on vmscan, not vrange() syscall. Again, please see the MADV_FREE. http://lwn.net/Articles/230799/ It does changes pte and page flags on all pages of the range through zap_pte_range. So it would make vrange(VOLASTILE) expensive and the bigger cost is, the bigger range is. This haven't been crossed my mind. now try_to_discard_one() insert vrange for making SIGBUS. then, we can insert pte_none() as the same cost too. Am I missing something? I couldn't imazine why pte should be zapping on vrange(VOLATILE). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable
(4/10/13 11:26 PM), Mitsuhiro Tanino wrote: Hi All, Please find a patch set that introduces these new sysctl interfaces, to handle a case when an memory error is detected on dirty page cache. - vm.memory_failure_dirty_panic Panic knob is ok to me. However I agree with Andi. If we need panic know, it should handle generic IO error and data lost. - vm.memory_failure_print_ratelimit - vm.memory_failure_print_ratelimit_burst But this is totally silly. print_ratelimit might ommit important messages. Please do a right way. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: madvise: complete input validation before taking lock
(4/10/13 7:45 PM), Rasmus Villemoes wrote: > In madvise(), there doesn't seem to be any reason for taking the > >mm->mmap_sem before start and len_in have been > validated. Incidentally, this removes the need for the out: label. > > > Signed-off-by: Rasmus Villemoes Looks good. Acked-by: KOSAKI Motohiro -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
>> I've never checked it but I would have expected kswapd to stay on the >> same processor for significant periods of time. Have you experienced >> problems where kswapd bounces around on CPUs within a node causing >> workload disruption? > > When kswapd shares the same CPU as our main process it causes a measurable > drop in response time (graphs show tiny spikes at the same time memory is > freed). Would be nice to be able to ensure it runs on a different core > than our latency sensitive processes at least. We can pin processes to > subsets of cores but I don't think there's a way to keep kswapd from > waking up on any of them? You are only talking about extream corner case and don't talk about the other hand. When number-of-nodes > nubmer-of-cpus, we have no way to avoid cpu sharing. Moreover, this is not kswapd specific isssue, every kernel thread makes the same latency ick. so, this issue should be solved more generic layer. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: + fix-hugetlb-memory-check-in-vma_dump_size.patch added to -mm tree
> Signed-off-by: Naoya Horiguchi > Reviewed-by: Rik van Riel > Acked-by: Michal Hocko > Reviewed-by: HATAYAMA Daisuke > Acked-by: KOSAKI Motohiro > Acked-by: David Rientjes > Cc: [2.6.34+] Not correct. It's 3.7 materials. See Michal or my comments. cut-n-paste here. >Just for the record. It should be stable for 3.7+ since (314e51b98) >becuase then have lost VM_RESERVED check which used to stop hugetlb >mappings. > Acked-by: Michal Hocko -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v7 00/11] Support vrange for anonymous page
(3/12/13 3:38 AM), Minchan Kim wrote: > First of all, let's define the term. > From now on, I'd like to call it as vrange(a.k.a volatile range) > for anonymous page. If you have a better name in mind, please suggest. > > This version is still *RFC* because it's just quick prototype so > it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86. > Before further sorting out issues, I'd like to post current direction > and discuss it. Of course, I'd like to extend this discussion in > comming LSF/MM. > > In this version, I changed lots of thing, expecially removed vma-based > approach because it needs write-side lock for mmap_sem, which will drop > performance in mutli-threaded big SMP system, KOSAKI pointed out. > And vma-based approach is hard to meet requirement of new system call by > John Stultz's suggested semantic for consistent purged handling. > (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none) > > I tested this patchset with modified jemalloc allocator which was > leaded by Jason Evans(jemalloc author) who was interest in this feature > and was happy to port his allocator to use new system call. > Super Thanks Jason! > > The benchmark for test is ebizzy. It have been used for testing the > allocator performance so it's good for me. Again, thanks for recommending > the benchmark, Jason. > (http://people.freebsd.org/~kris/scaling/ebizzy.html) > > The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G) > > ebizzy -S 20 > > jemalloc-vanilla: 52389 records/sec > jemalloc-vrange: 203414 records/sec > > ebizzy -S 20 with background memory pressure > > jemalloc-vanilla: 40746 records/sec > jemalloc-vrange: 174910 records/sec > > And it's much improved on KVM virtual machine. > > This patchset is based on v3.9-rc2 > > - What's the sys_vrange(addr, length, mode, behavior)? > > It's a hint that user deliver to kernel so kernel can *discard* > pages in a range anytime. mode is one of VRANGE_VOLATILE and > VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so > kernel coudn't discard any pages any more while VRANGE_VOLATILE > is memory unpin opeartion so kernel can discard pages in vrange > anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE > PARTIAL. VRANGE_FULL tell kernel that once kernel decide to > discard page in a vrange, please, discard all of pages in a > vrange selected by victim vrange. VRANGE_PARTIAL tell kernel > that please discard of some pages in a vrange. But now I didn't > implemented VRANGE_PARTIAL handling yet. > > - What happens if user access page(ie, virtual address) discarded > by kernel? > > The user can encounter SIGBUS. > > - What should user do for avoding SIGBUS? > He should call vrange(addr, length, VRANGE_NOVOLATILE, mode) before > accessing the range which was called > vrange(addr, length, VRANGE_VOLATILE, mode) > > - What happens if user access page(ie, virtual address) doesn't > discarded by kernel? > > The user can see vaild data which was there before calling > vrange(., VRANGE_VOLATILE) without page fault. > > - What's different with madvise(DONTNEED)? > > System call semantic > > DONTNEED makes sure user always can see zero-fill pages after > he calls madvise while vrange can see data or encounter SIGBUS. For replacing DONTNEED, user want to zero-fill pages like DONTNEED instead of SIGBUS. So, new flag option would be nice. I played a bit this patch. The result looks really promissing. (i.e. 20x faster) My machine have 24cpus, 8GB ram, kvm guest. I guess current DONTNEED implementation doesn't fit kvm at all. # of # of # of thread iter iter (patched glibc) -- 1 43810740 2 84220916 4 98732534 8 71715155 12 71414109 16 70813457 20 72013742 24 72713642 28 71513328 32 70913096 36 70513661 40 70813634 44 70713367 48 71413377 -libc patch (just dirty hack) -- diff --git a/malloc/arena.c b/malloc/arena.c index 12a48ad..da04f67 100644 --- a/malloc/arena.c +++ b/malloc/arena.c @@ -365,6 +365,8 @@ extern struct dl_open_hook *_dl_open_hook; libc_hidden_proto (_dl_open_hook); #endif +int vrange_enabled = 0; + static void ptmalloc_init (void) { @@ -457,6 +459,18 @@ ptmalloc_init (void) if (check_action != 0) __malloc_check_init(); } + + { +char *vrange = getenv("MALLOC_VRANGE"); +if (vrange) { + int val = atoi(vrange); + if (val) { + printf("glibc: vrange enabled\n"); + vrange_enabled = !!val; + } +} + } + void (*hook) (void) = force_reg (__malloc_initialize_hook); if (hook != NULL) (*hook)(); @@ -628,9 +642,14 @@ shrink_heap(heap_info *h, long diff) return -2; h->mprotect_size = new_size; } - else -
Re: [RESEND][PATCH v5 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
On Wed, Apr 10, 2013 at 12:17 PM, Naoya Horiguchi wrote: > # I suspended Reviewed and Acked given for the previous version, because > # it has a non-minor change. If you want to restore it, please let me know. > - > With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in > initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory > error happens on a hugepage and the affected processes try to access > the error hugepage, we hit VM_BUG_ON(atomic_read(>_count) <= 0) > in get_page(). > > The reason for this bug is that coredump-related code doesn't recognise > "hugepage hwpoison entry" with which a pmd entry is replaced when a memory > error occurs on a hugepage. > In other words, physical address information is stored in different bit layout > between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page() > which is called in get_dump_page() returns a wrong page from a given address. > > The expected behavior is like this: > > absent is_swap_pte FOLL_DUMP Expected behavior > --- >true false false hugetlb_fault >falsetrue false hugetlb_fault >falsefalse false return page >true false trueskip page (to avoid allocation) >falsetrue truehugetlb_fault >falsefalse truereturn page > > With this patch, we can call hugetlb_fault() and take proper actions > (we wait for migration entries, fail with VM_FAULT_HWPOISON_LARGE for > hwpoisoned entries,) and as the result we can dump all hugepages except > for hwpoisoned ones. > > ChangeLog v5: > - improve comment and description. > > ChangeLog v4: > - move is_swap_page() to right place. > > ChangeLog v3: > - add comment about using is_swap_pte() > > Signed-off-by: Naoya Horiguchi > Cc: sta...@vger.kernel.org Acked-by: KOSAKI Motohiro -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND][PATCH v5 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
On Wed, Apr 10, 2013 at 12:17 PM, Naoya Horiguchi n-horigu...@ah.jp.nec.com wrote: # I suspended Reviewed and Acked given for the previous version, because # it has a non-minor change. If you want to restore it, please let me know. - With applying the previous patch hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) to reenable hugepage coredump, if a memory error happens on a hugepage and the affected processes try to access the error hugepage, we hit VM_BUG_ON(atomic_read(page-_count) = 0) in get_page(). The reason for this bug is that coredump-related code doesn't recognise hugepage hwpoison entry with which a pmd entry is replaced when a memory error occurs on a hugepage. In other words, physical address information is stored in different bit layout between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page() which is called in get_dump_page() returns a wrong page from a given address. The expected behavior is like this: absent is_swap_pte FOLL_DUMP Expected behavior --- true false false hugetlb_fault falsetrue false hugetlb_fault falsefalse false return page true false trueskip page (to avoid allocation) falsetrue truehugetlb_fault falsefalse truereturn page With this patch, we can call hugetlb_fault() and take proper actions (we wait for migration entries, fail with VM_FAULT_HWPOISON_LARGE for hwpoisoned entries,) and as the result we can dump all hugepages except for hwpoisoned ones. ChangeLog v5: - improve comment and description. ChangeLog v4: - move is_swap_page() to right place. ChangeLog v3: - add comment about using is_swap_pte() Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com Cc: sta...@vger.kernel.org Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v7 00/11] Support vrange for anonymous page
(3/12/13 3:38 AM), Minchan Kim wrote: First of all, let's define the term. From now on, I'd like to call it as vrange(a.k.a volatile range) for anonymous page. If you have a better name in mind, please suggest. This version is still *RFC* because it's just quick prototype so it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86. Before further sorting out issues, I'd like to post current direction and discuss it. Of course, I'd like to extend this discussion in comming LSF/MM. In this version, I changed lots of thing, expecially removed vma-based approach because it needs write-side lock for mmap_sem, which will drop performance in mutli-threaded big SMP system, KOSAKI pointed out. And vma-based approach is hard to meet requirement of new system call by John Stultz's suggested semantic for consistent purged handling. (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none) I tested this patchset with modified jemalloc allocator which was leaded by Jason Evans(jemalloc author) who was interest in this feature and was happy to port his allocator to use new system call. Super Thanks Jason! The benchmark for test is ebizzy. It have been used for testing the allocator performance so it's good for me. Again, thanks for recommending the benchmark, Jason. (http://people.freebsd.org/~kris/scaling/ebizzy.html) The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G) ebizzy -S 20 jemalloc-vanilla: 52389 records/sec jemalloc-vrange: 203414 records/sec ebizzy -S 20 with background memory pressure jemalloc-vanilla: 40746 records/sec jemalloc-vrange: 174910 records/sec And it's much improved on KVM virtual machine. This patchset is based on v3.9-rc2 - What's the sys_vrange(addr, length, mode, behavior)? It's a hint that user deliver to kernel so kernel can *discard* pages in a range anytime. mode is one of VRANGE_VOLATILE and VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so kernel coudn't discard any pages any more while VRANGE_VOLATILE is memory unpin opeartion so kernel can discard pages in vrange anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE PARTIAL. VRANGE_FULL tell kernel that once kernel decide to discard page in a vrange, please, discard all of pages in a vrange selected by victim vrange. VRANGE_PARTIAL tell kernel that please discard of some pages in a vrange. But now I didn't implemented VRANGE_PARTIAL handling yet. - What happens if user access page(ie, virtual address) discarded by kernel? The user can encounter SIGBUS. - What should user do for avoding SIGBUS? He should call vrange(addr, length, VRANGE_NOVOLATILE, mode) before accessing the range which was called vrange(addr, length, VRANGE_VOLATILE, mode) - What happens if user access page(ie, virtual address) doesn't discarded by kernel? The user can see vaild data which was there before calling vrange(., VRANGE_VOLATILE) without page fault. - What's different with madvise(DONTNEED)? System call semantic DONTNEED makes sure user always can see zero-fill pages after he calls madvise while vrange can see data or encounter SIGBUS. For replacing DONTNEED, user want to zero-fill pages like DONTNEED instead of SIGBUS. So, new flag option would be nice. I played a bit this patch. The result looks really promissing. (i.e. 20x faster) My machine have 24cpus, 8GB ram, kvm guest. I guess current DONTNEED implementation doesn't fit kvm at all. # of # of # of thread iter iter (patched glibc) -- 1 43810740 2 84220916 4 98732534 8 71715155 12 71414109 16 70813457 20 72013742 24 72713642 28 71513328 32 70913096 36 70513661 40 70813634 44 70713367 48 71413377 -libc patch (just dirty hack) -- diff --git a/malloc/arena.c b/malloc/arena.c index 12a48ad..da04f67 100644 --- a/malloc/arena.c +++ b/malloc/arena.c @@ -365,6 +365,8 @@ extern struct dl_open_hook *_dl_open_hook; libc_hidden_proto (_dl_open_hook); #endif +int vrange_enabled = 0; + static void ptmalloc_init (void) { @@ -457,6 +459,18 @@ ptmalloc_init (void) if (check_action != 0) __malloc_check_init(); } + + { +char *vrange = getenv(MALLOC_VRANGE); +if (vrange) { + int val = atoi(vrange); + if (val) { + printf(glibc: vrange enabled\n); + vrange_enabled = !!val; + } +} + } + void (*hook) (void) = force_reg (__malloc_initialize_hook); if (hook != NULL) (*hook)(); @@ -628,9 +642,14 @@ shrink_heap(heap_info *h, long diff) return -2; h-mprotect_size = new_size; } - else -__madvise ((char *)h + new_size, diff, MADV_DONTNEED); + else { +if (vrange_enabled) {
Re: + fix-hugetlb-memory-check-in-vma_dump_size.patch added to -mm tree
Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com Reviewed-by: Rik van Riel r...@redhat.com Acked-by: Michal Hocko mho...@suse.cz Reviewed-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com Acked-by: David Rientjes rient...@google.com Cc: sta...@vger.kernel.org[2.6.34+] Not correct. It's 3.7 materials. See Michal or my comments. cut-n-paste here. Just for the record. It should be stable for 3.7+ since (314e51b98) becuase then have lost VM_RESERVED check which used to stop hugetlb mappings. Acked-by: Michal Hocko mho...@suse.cz -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/10] Reduce system disruption due to kswapd V2
I've never checked it but I would have expected kswapd to stay on the same processor for significant periods of time. Have you experienced problems where kswapd bounces around on CPUs within a node causing workload disruption? When kswapd shares the same CPU as our main process it causes a measurable drop in response time (graphs show tiny spikes at the same time memory is freed). Would be nice to be able to ensure it runs on a different core than our latency sensitive processes at least. We can pin processes to subsets of cores but I don't think there's a way to keep kswapd from waking up on any of them? You are only talking about extream corner case and don't talk about the other hand. When number-of-nodes nubmer-of-cpus, we have no way to avoid cpu sharing. Moreover, this is not kswapd specific isssue, every kernel thread makes the same latency ick. so, this issue should be solved more generic layer. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: madvise: complete input validation before taking lock
(4/10/13 7:45 PM), Rasmus Villemoes wrote: In madvise(), there doesn't seem to be any reason for taking the current-mm-mmap_sem before start and len_in have been validated. Incidentally, this removes the need for the out: label. Signed-off-by: Rasmus Villemoes li...@rasmusvillemoes.dk Looks good. Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
On Tue, Apr 9, 2013 at 6:43 PM, Naoya Horiguchi wrote: > On Tue, Apr 09, 2013 at 05:27:44PM -0400, KOSAKI Motohiro wrote: >> >> numa_node_id() is really silly. This might lead to allocate from >> >> offlining node. >> > >> > Right, it should've been alloc_huge_page(). >> > >> >> and, offline_pages() should mark hstate as isolated likes normal pages >> >> for prohibiting >> >> new allocation at first. >> > >> > It seems that alloc_migrate_target() calls alloc_page() for normal pages >> > and the destination pages can be in the same node with the source pages >> > (new page allocation from the same memblock are prohibited.) >> >> No. It can't. memory hotplug change buddy attribute to MIGRATE_ISOLTE at >> first. >> then alloc_page() never allocate from source node. however huge page don't >> use >> buddy. then we need another trick. > > MIGRATE_ISOLTE is changed only within the range [start_pfn, end_pfn) > given as the argument of __offline_pages (see also start_isolate_page_range), > so it's set only for pages within the single memblock to be offlined. When partial memory hot remove, that's correct behavior. different node is not required. > BTW, in previous discussion I already agreed with checking migrate type > in hugepage allocation code (maybe it will be in dequeue_huge_page_vma(),) > so what you concern should be solved in the next post. Umm.. Maybe I missed such discussion. Do you have a pointer? >> > So if we want to avoid new page allocation from the same node, >> > this is the problem both for normal and huge pages. >> > >> > BTW, is it correct to think that all users of memory hotplug assume >> > that they want to hotplug a whole node (not the part of it?) >> >> Both are valid use case. admin can isolate a part of memory for isolating >> broken memory range. >> >> but I'm sure almost user want to remove whole node. > > OK. So I think about "allocation in the nearest neighbor node", > although it can be in separate patch if it's hard to implement. That's fine. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
> I rewrite the comment here, how about this? > > - if (absent || > + /* > +* We need call hugetlb_fault for both hugepages under > migration > +* (in which case hugetlb_fault waits for the migration,) and > +* hwpoisoned hugepages (in which case we need to prevent the > +* caller from accessing to them.) In order to do this, we use > +* here is_swap_pte instead of is_hugetlb_entry_migration and > +* is_hugetlb_entry_hwpoisoned. This is because it simply > covers > +* both cases, and because we can't follow correct pages > directly > +* from any kind of swap entries. > +*/ > + if (absent || is_swap_pte(huge_ptep_get(pte)) || > ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte > { > int ret; Looks ok to me. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
>> numa_node_id() is really silly. This might lead to allocate from offlining >> node. > > Right, it should've been alloc_huge_page(). > >> and, offline_pages() should mark hstate as isolated likes normal pages for >> prohibiting >> new allocation at first. > > It seems that alloc_migrate_target() calls alloc_page() for normal pages > and the destination pages can be in the same node with the source pages > (new page allocation from the same memblock are prohibited.) No. It can't. memory hotplug change buddy attribute to MIGRATE_ISOLTE at first. then alloc_page() never allocate from source node. however huge page don't use buddy. then we need another trick. > So if we want to avoid new page allocation from the same node, > this is the problem both for normal and huge pages. > > BTW, is it correct to think that all users of memory hotplug assume > that they want to hotplug a whole node (not the part of it?) Both are valid use case. admin can isolate a part of memory for isolating broken memory range. but I'm sure almost user want to remove whole node. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
numa_node_id() is really silly. This might lead to allocate from offlining node. Right, it should've been alloc_huge_page(). and, offline_pages() should mark hstate as isolated likes normal pages for prohibiting new allocation at first. It seems that alloc_migrate_target() calls alloc_page() for normal pages and the destination pages can be in the same node with the source pages (new page allocation from the same memblock are prohibited.) No. It can't. memory hotplug change buddy attribute to MIGRATE_ISOLTE at first. then alloc_page() never allocate from source node. however huge page don't use buddy. then we need another trick. So if we want to avoid new page allocation from the same node, this is the problem both for normal and huge pages. BTW, is it correct to think that all users of memory hotplug assume that they want to hotplug a whole node (not the part of it?) Both are valid use case. admin can isolate a part of memory for isolating broken memory range. but I'm sure almost user want to remove whole node. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
I rewrite the comment here, how about this? - if (absent || + /* +* We need call hugetlb_fault for both hugepages under migration +* (in which case hugetlb_fault waits for the migration,) and +* hwpoisoned hugepages (in which case we need to prevent the +* caller from accessing to them.) In order to do this, we use +* here is_swap_pte instead of is_hugetlb_entry_migration and +* is_hugetlb_entry_hwpoisoned. This is because it simply covers +* both cases, and because we can't follow correct pages directly +* from any kind of swap entries. +*/ + if (absent || is_swap_pte(huge_ptep_get(pte)) || ((flags FOLL_WRITE) !pte_write(huge_ptep_get(pte { int ret; Looks ok to me. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
On Tue, Apr 9, 2013 at 6:43 PM, Naoya Horiguchi n-horigu...@ah.jp.nec.com wrote: On Tue, Apr 09, 2013 at 05:27:44PM -0400, KOSAKI Motohiro wrote: numa_node_id() is really silly. This might lead to allocate from offlining node. Right, it should've been alloc_huge_page(). and, offline_pages() should mark hstate as isolated likes normal pages for prohibiting new allocation at first. It seems that alloc_migrate_target() calls alloc_page() for normal pages and the destination pages can be in the same node with the source pages (new page allocation from the same memblock are prohibited.) No. It can't. memory hotplug change buddy attribute to MIGRATE_ISOLTE at first. then alloc_page() never allocate from source node. however huge page don't use buddy. then we need another trick. MIGRATE_ISOLTE is changed only within the range [start_pfn, end_pfn) given as the argument of __offline_pages (see also start_isolate_page_range), so it's set only for pages within the single memblock to be offlined. When partial memory hot remove, that's correct behavior. different node is not required. BTW, in previous discussion I already agreed with checking migrate type in hugepage allocation code (maybe it will be in dequeue_huge_page_vma(),) so what you concern should be solved in the next post. Umm.. Maybe I missed such discussion. Do you have a pointer? So if we want to avoid new page allocation from the same node, this is the problem both for normal and huge pages. BTW, is it correct to think that all users of memory hotplug assume that they want to hotplug a whole node (not the part of it?) Both are valid use case. admin can isolate a part of memory for isolating broken memory range. but I'm sure almost user want to remove whole node. OK. So I think about allocation in the nearest neighbor node, although it can be in separate patch if it's hard to implement. That's fine. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 1/1] mm: Another attempt to monitor task's memory changes
> This approach works on any task via it's proc, and can be used on different > tasks in parallel. > > Also, Andrew was asking for some performance numbers related to the change. > Now I can say, that as long as soft dirty bits are not cleared, no performance > penalty occur, since the soft dirty bit and the regular dirty bit are set at > the same time within the same instruction. When soft dirty is cleared via > clear_refs, the task in question might slow down, but it will depend on how > actively it uses the memory. > > > What do you think, does it make sense to develop this approach further? When touching mmaped page, cpu turns on dirty bit but doesn't turn on soft dirty. So, I'm not convinced how to use this flag. Please show us your userland algorithm how to detect diff. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.
(4/8/13 3:50 PM), Cody P Schafer wrote: > On 04/08/2013 10:28 AM, Cody P Schafer wrote: >> On 04/08/2013 05:20 AM, Gilad Ben-Yossef wrote: >>> On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer >>> wrote: In free_hot_cold_page(), we rely on pcp->batch remaining stable. Updating it without being on the cpu owning the percpu pageset potentially destroys this stability. Change for_each_cpu() to on_each_cpu() to fix. >>> >>> Are you referring to this? - >> >> This was the case I noticed. >> >>> >>> 1329 if (pcp->count >= pcp->high) { >>> 1330 free_pcppages_bulk(zone, pcp->batch, pcp); >>> 1331 pcp->count -= pcp->batch; >>> 1332 } >>> >>> I'm probably missing the obvious but won't it be simpler to do this in >>> free_hot_cold_page() - >>> >>> 1329 if (pcp->count >= pcp->high) { >>> 1330 unsigned int batch = ACCESS_ONCE(pcp->batch); >>> 1331 free_pcppages_bulk(zone, batch, pcp); >>> 1332 pcp->count -= batch; >>> 1333 } >>> >> >> Potentially, yes. Note that this was simply the one case I noticed, >> rather than certainly the only case. >> >> I also wonder whether there could be unexpected interactions between >> ->high and ->batch not changing together atomically. For example, could >> adjusting this knob cause ->batch to rise enough that it is greater than >> the previous ->high? If the code above then runs with the previous >> ->high, ->count wouldn't be correct (checking this inside >> free_pcppages_bulk() might help on this one issue). >> >>> Now the batch value used is stable and you don't have to IPI every CPU >>> in the system just to change a config knob... >> >> Is this really considered an issue? I wouldn't have expected someone to >> adjust the config knob often enough (or even more than once) to cause >> problems. Of course as a "It'd be nice" thing, I completely agree. > > Would using schedule_on_each_cpu() instead of on_each_cpu() be an > improvement, in your opinion? No. As far as lightweight solusion work, we shouldn't introduce heavyweight code never. on_each_cpu() is really heavy weight especially when number of cpus are much than a thousand. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()
(4/8/13 3:49 PM), Cody P Schafer wrote: > On 04/08/2013 12:26 PM, KOSAKI Motohiro wrote: >> (4/8/13 1:32 PM), Cody P Schafer wrote: >>> On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote: >>>> (4/5/13 4:33 PM), Cody P Schafer wrote: >>>>> No off-cpu users of the percpu pagesets exist. >>>>> >>>>> zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a >>>>> percpu pageset based on a zone's ->managed_pages. We don't need to drain >>>>> the entire percpu pageset just to modify these fields. Avoid calling >>>>> setup_pageset() (and the draining required to call it) and instead just >>>>> set the fields' values. >>>>> >>>>> This does change the behavior of zone_pcp_update() as the percpu >>>>> pagesets will not be drained when zone_pcp_update() is called (they will >>>>> end up being shrunk, not completely drained, later when a 0-order page >>>>> is freed in free_hot_cold_page()). >>>>> >>>>> Signed-off-by: Cody P Schafer >>>> >>>> NAK. >>>> >>>> 1) zone_pcp_update() is only used from memory hotplug and it require page >>>> drain. >>> >>> I'm looking at this code because I'm currently working on a patchset >>> which adds another interface which modifies zone sizes, so "only used >>> from memory hotplug" is a temporary thing (unless I discover that >>> zone_pcp_update() is not intended to do what I want it to do). >> >> maybe yes, maybe no. I don't know temporary or not. However the fact is, >> you must not break anywhere. You need to look all caller always. > > Right, which is why I want to understand memory hotplug's actual > requirements. > >>>> 2) stop_machin is used for avoiding race. just removing it is insane. >>> >>> What race? Is there a cross cpu access to ->high & ->batch that makes >>> using on_each_cpu() instead of stop_machine() inappropriate? It is >>> absolutely not just being removed. >> >> OK, I missed that. however your code is still wrong. >> However you can't call free_pcppages_bulk() from interrupt context and >> then you can't use on_each_cpu() anyway. > > Given drain_pages() implementation, I find that hard to believe (It uses > on_each_cpu_mask() and eventually calls free_pcppages_bulk()). > > Can you provide a reference backing up your statement? Grr. I missed again. OK you are right. go ahead. > If this turns out to be an issue, schedule_on_each_cpu() could be an > alternative. no way. schedule_on_each_cpu() is more problematic and it should be removed in the future. schedule_on_each_cpu() can only be used when caller task don't have any lock. otherwise it may make deadlock. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
> - if (absent || > + /* > +* is_swap_pte test covers both is_hugetlb_entry_hwpoisoned > +* and hugepages under migration in which case > +* hugetlb_fault waits for the migration and bails out > +* properly for HWPosined pages. > +*/ > + if (absent || is_swap_pte(huge_ptep_get(pte)) || > ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte > { > int ret; Your comment describe what the code is. However we want the comment describe why. In migration case, calling hugetlb_fault() is natural. but in hwpoison case, it is needed more explanation. Why can't we call is_hugetlb_hwpoisoned() directly? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()
(4/8/13 1:32 PM), Cody P Schafer wrote: > On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote: >> (4/5/13 4:33 PM), Cody P Schafer wrote: >>> No off-cpu users of the percpu pagesets exist. >>> >>> zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a >>> percpu pageset based on a zone's ->managed_pages. We don't need to drain >>> the entire percpu pageset just to modify these fields. Avoid calling >>> setup_pageset() (and the draining required to call it) and instead just >>> set the fields' values. >>> >>> This does change the behavior of zone_pcp_update() as the percpu >>> pagesets will not be drained when zone_pcp_update() is called (they will >>> end up being shrunk, not completely drained, later when a 0-order page >>> is freed in free_hot_cold_page()). >>> >>> Signed-off-by: Cody P Schafer >> >> NAK. >> >> 1) zone_pcp_update() is only used from memory hotplug and it require page >> drain. > > I'm looking at this code because I'm currently working on a patchset > which adds another interface which modifies zone sizes, so "only used > from memory hotplug" is a temporary thing (unless I discover that > zone_pcp_update() is not intended to do what I want it to do). maybe yes, maybe no. I don't know temporary or not. However the fact is, you must not break anywhere. You need to look all caller always. >> 2) stop_machin is used for avoiding race. just removing it is insane. > > What race? Is there a cross cpu access to ->high & ->batch that makes > using on_each_cpu() instead of stop_machine() inappropriate? It is > absolutely not just being removed. OK, I missed that. however your code is still wrong. However you can't call free_pcppages_bulk() from interrupt context and then you can't use on_each_cpu() anyway. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.
(4/8/13 8:20 AM), Gilad Ben-Yossef wrote: > On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer > wrote: >> In free_hot_cold_page(), we rely on pcp->batch remaining stable. >> Updating it without being on the cpu owning the percpu pageset >> potentially destroys this stability. >> >> Change for_each_cpu() to on_each_cpu() to fix. > > Are you referring to this? - > > 1329 if (pcp->count >= pcp->high) { > 1330 free_pcppages_bulk(zone, pcp->batch, pcp); > 1331 pcp->count -= pcp->batch; > 1332 } > > I'm probably missing the obvious but won't it be simpler to do this in > free_hot_cold_page() - > > 1329 if (pcp->count >= pcp->high) { > 1330 unsigned int batch = ACCESS_ONCE(pcp->batch); > 1331 free_pcppages_bulk(zone, batch, pcp); > 1332 pcp->count -= batch; > 1333 } > > Now the batch value used is stable and you don't have to IPI every CPU > in the system just to change a config knob... OK, right. Your approach is much better. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch
(4/8/13 1:16 PM), Cody P Schafer wrote: > On 04/07/2013 08:23 AM, KOSAKI Motohiro wrote: >> (4/5/13 4:33 PM), Cody P Schafer wrote: >>> In one case while modifying the ->high and ->batch fields of per cpu >>> pagesets >>> we're unneededly using stop_machine() (patches 1 & 2), and in another we >>> don't have any >>> syncronization at all (patch 3). >>> >>> This patchset fixes both of them. >>> >>> Note that it results in a change to the behavior of zone_pcp_update(), >>> which is >>> used by memory_hotplug. I _think_ that I've diserned (and preserved) the >>> essential behavior (changing ->high and ->batch), and only eliminated >>> unneeded >>> actions (draining the per cpu pages), but this may not be the case. >> >> at least, memory hotplug need to drain. > > Could you explain why the drain is required here? From what I can tell, > after the stop_machine() completes, the per cpu page sets could be > repopulated at any point, making the combination of draining and > modifying ->batch & ->high uneeded. Then, memory hotplug again and again try to drain. Moreover hotplug prevent repopulation by using MIGRATE_ISOLATE. pcp never be page count == 0 and it prevent memory hot remove. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's -high and -batch
(4/8/13 1:16 PM), Cody P Schafer wrote: On 04/07/2013 08:23 AM, KOSAKI Motohiro wrote: (4/5/13 4:33 PM), Cody P Schafer wrote: In one case while modifying the -high and -batch fields of per cpu pagesets we're unneededly using stop_machine() (patches 1 2), and in another we don't have any syncronization at all (patch 3). This patchset fixes both of them. Note that it results in a change to the behavior of zone_pcp_update(), which is used by memory_hotplug. I _think_ that I've diserned (and preserved) the essential behavior (changing -high and -batch), and only eliminated unneeded actions (draining the per cpu pages), but this may not be the case. at least, memory hotplug need to drain. Could you explain why the drain is required here? From what I can tell, after the stop_machine() completes, the per cpu page sets could be repopulated at any point, making the combination of draining and modifying -batch -high uneeded. Then, memory hotplug again and again try to drain. Moreover hotplug prevent repopulation by using MIGRATE_ISOLATE. pcp never be page count == 0 and it prevent memory hot remove. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.
(4/8/13 8:20 AM), Gilad Ben-Yossef wrote: On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer c...@linux.vnet.ibm.com wrote: In free_hot_cold_page(), we rely on pcp-batch remaining stable. Updating it without being on the cpu owning the percpu pageset potentially destroys this stability. Change for_each_cpu() to on_each_cpu() to fix. Are you referring to this? - 1329 if (pcp-count = pcp-high) { 1330 free_pcppages_bulk(zone, pcp-batch, pcp); 1331 pcp-count -= pcp-batch; 1332 } I'm probably missing the obvious but won't it be simpler to do this in free_hot_cold_page() - 1329 if (pcp-count = pcp-high) { 1330 unsigned int batch = ACCESS_ONCE(pcp-batch); 1331 free_pcppages_bulk(zone, batch, pcp); 1332 pcp-count -= batch; 1333 } Now the batch value used is stable and you don't have to IPI every CPU in the system just to change a config knob... OK, right. Your approach is much better. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()
(4/8/13 1:32 PM), Cody P Schafer wrote: On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote: (4/5/13 4:33 PM), Cody P Schafer wrote: No off-cpu users of the percpu pagesets exist. zone_pcp_update()'s goal is to adjust the -high and -mark members of a percpu pageset based on a zone's -managed_pages. We don't need to drain the entire percpu pageset just to modify these fields. Avoid calling setup_pageset() (and the draining required to call it) and instead just set the fields' values. This does change the behavior of zone_pcp_update() as the percpu pagesets will not be drained when zone_pcp_update() is called (they will end up being shrunk, not completely drained, later when a 0-order page is freed in free_hot_cold_page()). Signed-off-by: Cody P Schafer c...@linux.vnet.ibm.com NAK. 1) zone_pcp_update() is only used from memory hotplug and it require page drain. I'm looking at this code because I'm currently working on a patchset which adds another interface which modifies zone sizes, so only used from memory hotplug is a temporary thing (unless I discover that zone_pcp_update() is not intended to do what I want it to do). maybe yes, maybe no. I don't know temporary or not. However the fact is, you must not break anywhere. You need to look all caller always. 2) stop_machin is used for avoiding race. just removing it is insane. What race? Is there a cross cpu access to -high -batch that makes using on_each_cpu() instead of stop_machine() inappropriate? It is absolutely not just being removed. OK, I missed that. however your code is still wrong. However you can't call free_pcppages_bulk() from interrupt context and then you can't use on_each_cpu() anyway. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
- if (absent || + /* +* is_swap_pte test covers both is_hugetlb_entry_hwpoisoned +* and hugepages under migration in which case +* hugetlb_fault waits for the migration and bails out +* properly for HWPosined pages. +*/ + if (absent || is_swap_pte(huge_ptep_get(pte)) || ((flags FOLL_WRITE) !pte_write(huge_ptep_get(pte { int ret; Your comment describe what the code is. However we want the comment describe why. In migration case, calling hugetlb_fault() is natural. but in hwpoison case, it is needed more explanation. Why can't we call is_hugetlb_hwpoisoned() directly? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()
(4/8/13 3:49 PM), Cody P Schafer wrote: On 04/08/2013 12:26 PM, KOSAKI Motohiro wrote: (4/8/13 1:32 PM), Cody P Schafer wrote: On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote: (4/5/13 4:33 PM), Cody P Schafer wrote: No off-cpu users of the percpu pagesets exist. zone_pcp_update()'s goal is to adjust the -high and -mark members of a percpu pageset based on a zone's -managed_pages. We don't need to drain the entire percpu pageset just to modify these fields. Avoid calling setup_pageset() (and the draining required to call it) and instead just set the fields' values. This does change the behavior of zone_pcp_update() as the percpu pagesets will not be drained when zone_pcp_update() is called (they will end up being shrunk, not completely drained, later when a 0-order page is freed in free_hot_cold_page()). Signed-off-by: Cody P Schafer c...@linux.vnet.ibm.com NAK. 1) zone_pcp_update() is only used from memory hotplug and it require page drain. I'm looking at this code because I'm currently working on a patchset which adds another interface which modifies zone sizes, so only used from memory hotplug is a temporary thing (unless I discover that zone_pcp_update() is not intended to do what I want it to do). maybe yes, maybe no. I don't know temporary or not. However the fact is, you must not break anywhere. You need to look all caller always. Right, which is why I want to understand memory hotplug's actual requirements. 2) stop_machin is used for avoiding race. just removing it is insane. What race? Is there a cross cpu access to -high -batch that makes using on_each_cpu() instead of stop_machine() inappropriate? It is absolutely not just being removed. OK, I missed that. however your code is still wrong. However you can't call free_pcppages_bulk() from interrupt context and then you can't use on_each_cpu() anyway. Given drain_pages() implementation, I find that hard to believe (It uses on_each_cpu_mask() and eventually calls free_pcppages_bulk()). Can you provide a reference backing up your statement? Grr. I missed again. OK you are right. go ahead. If this turns out to be an issue, schedule_on_each_cpu() could be an alternative. no way. schedule_on_each_cpu() is more problematic and it should be removed in the future. schedule_on_each_cpu() can only be used when caller task don't have any lock. otherwise it may make deadlock. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.
(4/8/13 3:50 PM), Cody P Schafer wrote: On 04/08/2013 10:28 AM, Cody P Schafer wrote: On 04/08/2013 05:20 AM, Gilad Ben-Yossef wrote: On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer c...@linux.vnet.ibm.com wrote: In free_hot_cold_page(), we rely on pcp-batch remaining stable. Updating it without being on the cpu owning the percpu pageset potentially destroys this stability. Change for_each_cpu() to on_each_cpu() to fix. Are you referring to this? - This was the case I noticed. 1329 if (pcp-count = pcp-high) { 1330 free_pcppages_bulk(zone, pcp-batch, pcp); 1331 pcp-count -= pcp-batch; 1332 } I'm probably missing the obvious but won't it be simpler to do this in free_hot_cold_page() - 1329 if (pcp-count = pcp-high) { 1330 unsigned int batch = ACCESS_ONCE(pcp-batch); 1331 free_pcppages_bulk(zone, batch, pcp); 1332 pcp-count -= batch; 1333 } Potentially, yes. Note that this was simply the one case I noticed, rather than certainly the only case. I also wonder whether there could be unexpected interactions between -high and -batch not changing together atomically. For example, could adjusting this knob cause -batch to rise enough that it is greater than the previous -high? If the code above then runs with the previous -high, -count wouldn't be correct (checking this inside free_pcppages_bulk() might help on this one issue). Now the batch value used is stable and you don't have to IPI every CPU in the system just to change a config knob... Is this really considered an issue? I wouldn't have expected someone to adjust the config knob often enough (or even more than once) to cause problems. Of course as a It'd be nice thing, I completely agree. Would using schedule_on_each_cpu() instead of on_each_cpu() be an improvement, in your opinion? No. As far as lightweight solusion work, we shouldn't introduce heavyweight code never. on_each_cpu() is really heavy weight especially when number of cpus are much than a thousand. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 1/1] mm: Another attempt to monitor task's memory changes
This approach works on any task via it's proc, and can be used on different tasks in parallel. Also, Andrew was asking for some performance numbers related to the change. Now I can say, that as long as soft dirty bits are not cleared, no performance penalty occur, since the soft dirty bit and the regular dirty bit are set at the same time within the same instruction. When soft dirty is cleared via clear_refs, the task in question might slow down, but it will depend on how actively it uses the memory. What do you think, does it make sense to develop this approach further? When touching mmaped page, cpu turns on dirty bit but doesn't turn on soft dirty. So, I'm not convinced how to use this flag. Please show us your userland algorithm how to detect diff. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kernel: tsacct: strncpy, always be sure of NUL terminated.
On Sun, Apr 7, 2013 at 11:27 PM, Chen Gang wrote: > > for NUL terminated string, always set '\0' at the end. > > Signed-off-by: Chen Gang > --- > kernel/tsacct.c |3 ++- > 1 files changed, 2 insertions(+), 1 deletions(-) > > diff --git a/kernel/tsacct.c b/kernel/tsacct.c > index a1dd9a1..01bcc4e 100644 > --- a/kernel/tsacct.c > +++ b/kernel/tsacct.c > @@ -78,7 +78,8 @@ void bacct_add_tsk(struct user_namespace *user_ns, > stats->ac_minflt = tsk->min_flt; > stats->ac_majflt = tsk->maj_flt; > > - strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm)); > + strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm) - 1); > + stats->ac_comm[sizeof(stats->ac_comm) - 1] = '\0'; sizeof(tsk->comm) is 16 and sizeof(stats->ac_comm) is 32. then this statement is strange. and set_task_comm ensure tsk->comm is nul-terminated. so your code never change the behavior, right? And, If buggy driver change tsk->comm not to use set_task_comm and tsk->comm is not nul-terminated, strncpy() still touch unrelated memory and ac_comm may expose kernel internal info. that's bad. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.
(4/5/13 4:33 PM), Cody P Schafer wrote: > In free_hot_cold_page(), we rely on pcp->batch remaining stable. > Updating it without being on the cpu owning the percpu pageset > potentially destroys this stability. > > Change for_each_cpu() to on_each_cpu() to fix. > > Signed-off-by: Cody P Schafer Acked-by: KOSAKI Motohiro -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()
(4/5/13 4:33 PM), Cody P Schafer wrote: > No off-cpu users of the percpu pagesets exist. > > zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a > percpu pageset based on a zone's ->managed_pages. We don't need to drain > the entire percpu pageset just to modify these fields. Avoid calling > setup_pageset() (and the draining required to call it) and instead just > set the fields' values. > > This does change the behavior of zone_pcp_update() as the percpu > pagesets will not be drained when zone_pcp_update() is called (they will > end up being shrunk, not completely drained, later when a 0-order page > is freed in free_hot_cold_page()). > > Signed-off-by: Cody P Schafer NAK. 1) zone_pcp_update() is only used from memory hotplug and it require page drain. 2) stop_machin is used for avoiding race. just removing it is insane. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch
(4/5/13 4:33 PM), Cody P Schafer wrote: > In one case while modifying the ->high and ->batch fields of per cpu pagesets > we're unneededly using stop_machine() (patches 1 & 2), and in another we > don't have any > syncronization at all (patch 3). > > This patchset fixes both of them. > > Note that it results in a change to the behavior of zone_pcp_update(), which > is > used by memory_hotplug. I _think_ that I've diserned (and preserved) the > essential behavior (changing ->high and ->batch), and only eliminated unneeded > actions (draining the per cpu pages), but this may not be the case. at least, memory hotplug need to drain. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/9] extend hugepage migration
>> Please refer to hugetlb_fault for more information. > > Thanks for your pointing out. So my assume is correct, is it? Can pmd > which support 2MB map 32MB pages work well? Simon, Please stop hijaking unrelated threads. This is not question and answer thread. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/9] extend hugepage migration
Please refer to hugetlb_fault for more information. Thanks for your pointing out. So my assume is correct, is it? Can pmd which support 2MB map 32MB pages work well? Simon, Please stop hijaking unrelated threads. This is not question and answer thread. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's -high and -batch
(4/5/13 4:33 PM), Cody P Schafer wrote: In one case while modifying the -high and -batch fields of per cpu pagesets we're unneededly using stop_machine() (patches 1 2), and in another we don't have any syncronization at all (patch 3). This patchset fixes both of them. Note that it results in a change to the behavior of zone_pcp_update(), which is used by memory_hotplug. I _think_ that I've diserned (and preserved) the essential behavior (changing -high and -batch), and only eliminated unneeded actions (draining the per cpu pages), but this may not be the case. at least, memory hotplug need to drain. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()
(4/5/13 4:33 PM), Cody P Schafer wrote: No off-cpu users of the percpu pagesets exist. zone_pcp_update()'s goal is to adjust the -high and -mark members of a percpu pageset based on a zone's -managed_pages. We don't need to drain the entire percpu pageset just to modify these fields. Avoid calling setup_pageset() (and the draining required to call it) and instead just set the fields' values. This does change the behavior of zone_pcp_update() as the percpu pagesets will not be drained when zone_pcp_update() is called (they will end up being shrunk, not completely drained, later when a 0-order page is freed in free_hot_cold_page()). Signed-off-by: Cody P Schafer c...@linux.vnet.ibm.com NAK. 1) zone_pcp_update() is only used from memory hotplug and it require page drain. 2) stop_machin is used for avoiding race. just removing it is insane. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.
(4/5/13 4:33 PM), Cody P Schafer wrote: In free_hot_cold_page(), we rely on pcp-batch remaining stable. Updating it without being on the cpu owning the percpu pageset potentially destroys this stability. Change for_each_cpu() to on_each_cpu() to fix. Signed-off-by: Cody P Schafer c...@linux.vnet.ibm.com Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kernel: tsacct: strncpy, always be sure of NUL terminated.
On Sun, Apr 7, 2013 at 11:27 PM, Chen Gang gang.c...@asianux.com wrote: for NUL terminated string, always set '\0' at the end. Signed-off-by: Chen Gang gang.c...@asianux.com --- kernel/tsacct.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/kernel/tsacct.c b/kernel/tsacct.c index a1dd9a1..01bcc4e 100644 --- a/kernel/tsacct.c +++ b/kernel/tsacct.c @@ -78,7 +78,8 @@ void bacct_add_tsk(struct user_namespace *user_ns, stats-ac_minflt = tsk-min_flt; stats-ac_majflt = tsk-maj_flt; - strncpy(stats-ac_comm, tsk-comm, sizeof(stats-ac_comm)); + strncpy(stats-ac_comm, tsk-comm, sizeof(stats-ac_comm) - 1); + stats-ac_comm[sizeof(stats-ac_comm) - 1] = '\0'; sizeof(tsk-comm) is 16 and sizeof(stats-ac_comm) is 32. then this statement is strange. and set_task_comm ensure tsk-comm is nul-terminated. so your code never change the behavior, right? And, If buggy driver change tsk-comm not to use set_task_comm and tsk-comm is not nul-terminated, strncpy() still touch unrelated memory and ac_comm may expose kernel internal info. that's bad. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] fsfreeze: avoid to return zero in __get_user_pages
On Sat, Apr 6, 2013 at 6:07 AM, Marco Stornelli wrote: > In case of VM_FAULT_RETRY, __get_user_pages returns the number > of pages alredy gotten, but there isn't a check if this number is > zero. Instead, we have to return a proper error code so we can avoid > a possible extra call of __get_user_pages. There are several > places where get_user_pages is called inside a loop until all the > pages requested are gotten or an error code is returned. > > Signed-off-by: Marco Stornelli > --- > mm/memory.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index 494526a..cca14ed 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1858,7 +1858,7 @@ long __get_user_pages(struct task_struct *tsk, struct > mm_struct *mm, > if (ret & VM_FAULT_RETRY) { > if (nonblocking) > *nonblocking = 0; > - return i; > + return i ? i : -ERESTARTSYS; nonblock argument is only used from __mm_populate() and it expect __get_user_pages() return 0. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] fsfreeze: avoid to return zero in __get_user_pages
On Sat, Apr 6, 2013 at 6:07 AM, Marco Stornelli marco.storne...@gmail.com wrote: In case of VM_FAULT_RETRY, __get_user_pages returns the number of pages alredy gotten, but there isn't a check if this number is zero. Instead, we have to return a proper error code so we can avoid a possible extra call of __get_user_pages. There are several places where get_user_pages is called inside a loop until all the pages requested are gotten or an error code is returned. Signed-off-by: Marco Stornelli marco.storne...@gmail.com --- mm/memory.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 494526a..cca14ed 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1858,7 +1858,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, if (ret VM_FAULT_RETRY) { if (nonblocking) *nonblocking = 0; - return i; + return i ? i : -ERESTARTSYS; nonblock argument is only used from __mm_populate() and it expect __get_user_pages() return 0. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 10/10] prepare to remove /proc/sys/vm/hugepages_treat_as_movable
(3/25/13 11:12 AM), Michal Hocko wrote: > On Fri 22-03-13 16:23:55, Naoya Horiguchi wrote: > [...] >> @@ -2086,11 +2085,7 @@ int hugetlb_treat_movable_handler(struct ctl_table >> *table, int write, >> void __user *buffer, >> size_t *length, loff_t *ppos) >> { >> -proc_dointvec(table, write, buffer, length, ppos); >> -if (hugepages_treat_as_movable) >> -htlb_alloc_mask = GFP_HIGHUSER_MOVABLE; >> -else >> -htlb_alloc_mask = GFP_HIGHUSER; >> +/* hugepages_treat_as_movable is obsolete and to be removed. */ > > WARN_ON_ONCE("This knob is obsolete and has no effect. It is scheduled for > removal") Indeed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
(3/22/13 4:23 PM), Naoya Horiguchi wrote: > Currently we can't offline memory blocks which contain hugepages because > a hugepage is considered as an unmovable page. But now with this patch > series, a hugepage has become movable, so by using hugepage migration we > can offline such memory blocks. > > What's different from other users of hugepage migration is that we need > to decompose all the hugepages inside the target memory block into free > buddy pages after hugepage migration, because otherwise free hugepages > remaining in the memory block intervene the memory offlining. > For this reason we introduce new functions dissolve_free_huge_page() and > dissolve_free_huge_pages(). > > Other than that, what this patch does is straightforwardly to add hugepage > migration code, that is, adding hugepage code to the functions which scan > over pfn and collect hugepages to be migrated, and adding a hugepage > allocation function to alloc_migrate_target(). > > As for larger hugepages (1GB for x86_64), it's not easy to do hotremove > over them because it's larger than memory block. So we now simply leave > it to fail as it is. > > ChangeLog v2: > - changed return value type of is_hugepage_movable() to bool > - is_hugepage_movable() uses list_for_each_entry() instead of *_safe() > - moved if(PageHuge) block before get_page_unless_zero() in > do_migrate_range() > - do_migrate_range() returns -EBUSY for hugepages larger than memory block > - dissolve_free_huge_pages() calculates scan step and sets it to minimum >hugepage size > > Signed-off-by: Naoya Horiguchi > --- > include/linux/hugetlb.h | 6 + > mm/hugetlb.c| 58 > + > mm/memory_hotplug.c | 42 +++ > mm/page_alloc.c | 12 ++ > mm/page_isolation.c | 5 + > 5 files changed, 114 insertions(+), 9 deletions(-) > > diff --git v3.9-rc3.orig/include/linux/hugetlb.h > v3.9-rc3/include/linux/hugetlb.h > index 981eff8..8220a8a 100644 > --- v3.9-rc3.orig/include/linux/hugetlb.h > +++ v3.9-rc3/include/linux/hugetlb.h > @@ -69,6 +69,7 @@ int dequeue_hwpoisoned_huge_page(struct page *page); > void putback_active_hugepage(struct page *page); > void putback_active_hugepages(struct list_head *l); > void migrate_hugepage_add(struct page *page, struct list_head *list); > +bool is_hugepage_movable(struct page *page); > void copy_huge_page(struct page *dst, struct page *src); > > extern unsigned long hugepages_treat_as_movable; > @@ -134,6 +135,7 @@ static inline int dequeue_hwpoisoned_huge_page(struct > page *page) > #define putback_active_hugepage(p) 0 > #define putback_active_hugepages(l) 0 > #define migrate_hugepage_add(p, l) 0 > +#define is_hugepage_movable(x) 0 should be false instaed of 0. > static inline void copy_huge_page(struct page *dst, struct page *src) > { > } > @@ -356,6 +358,9 @@ static inline int hstate_index(struct hstate *h) > return h - hstates; > } > > +extern void dissolve_free_huge_pages(unsigned long start_pfn, > + unsigned long end_pfn); > + > #else > struct hstate {}; > #define alloc_huge_page(v, a, r) NULL > @@ -376,6 +381,7 @@ static inline unsigned int pages_per_huge_page(struct > hstate *h) > } > #define hstate_index_to_shift(index) 0 > #define hstate_index(h) 0 > +#define dissolve_free_huge_pages(s, e) 0 no need 0. > #endif > > #endif /* _LINUX_HUGETLB_H */ > diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c > index d9d3dd7..ef79871 100644 > --- v3.9-rc3.orig/mm/hugetlb.c > +++ v3.9-rc3/mm/hugetlb.c > @@ -844,6 +844,36 @@ static int free_pool_huge_page(struct hstate *h, > nodemask_t *nodes_allowed, > return ret; > } > > +/* Dissolve a given free hugepage into free pages. */ > +static void dissolve_free_huge_page(struct page *page) > +{ > + spin_lock(_lock); > + if (PageHuge(page) && !page_count(page)) { > + struct hstate *h = page_hstate(page); > + int nid = page_to_nid(page); > + list_del(>lru); > + h->free_huge_pages--; > + h->free_huge_pages_node[nid]--; > + update_and_free_page(h, page); > + } > + spin_unlock(_lock); > +} > + > +/* Dissolve free hugepages in a given pfn range. Used by memory hotplug. */ > +void dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn) > +{ > + unsigned int order = 8 * sizeof(void *); > + unsigned long pfn; > + struct hstate *h; > + > + /* Set scan step to minimum hugepage size */ > + for_each_hstate(h) > + if (order > huge_page_order(h)) > + order = huge_page_order(h); > + for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << order) > + dissolve_free_huge_page(pfn_to_page(pfn)); > +} hotplug.c must not have such pure huge page function. > + > static struct page *alloc_buddy_huge_page(struct hstate *h, int nid) >
Re: [PATCH 07/10] mbind: add hugepage migration code to mbind()
>> -if (!new_hpage) >> +/* >> + * Getting a new hugepage with alloc_huge_page() (which can happen >> + * when migration is caused by mbind()) can return ERR_PTR value, >> + * so we need take care of the case here. >> + */ >> +if (!new_hpage || IS_ERR_VALUE(new_hpage)) >> return -ENOMEM; > > Please no. get_new_page returns NULL or a page. You are hooking a wrong > callback here. The error value doesn't make any sense here. IMO you > should just wrap alloc_huge_page by something that returns NULL or page. I suggest just opposite way. new_vma_page() always return ENOMEM, ENOSPC etc instad of NULL. and caller propegate it to userland. I guess userland want to distingush why mbind was failed. Anyway, If new_vma_page() have a change to return both NULL and -ENOMEM. That's a bug. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 07/10] mbind: add hugepage migration code to mbind()
> @@ -1277,14 +1279,10 @@ static long do_mbind(unsigned long start, unsigned > long len, > if (!err) { > int nr_failed = 0; > > - if (!list_empty()) { > - WARN_ON_ONCE(flags & MPOL_MF_LAZY); > - nr_failed = migrate_pages(, new_vma_page, > + WARN_ON_ONCE(flags & MPOL_MF_LAZY); ??? MPOL_MF_LAZY always output warn? It seems really insane. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 05/10] migrate: add hugepage migration code to migrate_pages()
(3/22/13 4:23 PM), Naoya Horiguchi wrote: > This patch extends check_range() to handle vma with VM_HUGETLB set. > We will be able to migrate hugepage with migrate_pages(2) after > applying the enablement patch which comes later in this series. > > Note that for larger hugepages (covered by pud entries, 1GB for > x86_64 for example), we simply skip it now. check_range() has largely duplication with mm_walk and it is quirk subset. Instead of, could you replace them to mm_walk and enhance/cleanup mm_walk? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 04/10] migrate: clean up migrate_huge_page()
(3/22/13 4:23 PM), Naoya Horiguchi wrote: > Due to the previous patch, soft_offline_huge_page() switches to use > migrate_pages(), and migrate_huge_page() is not used any more. > So let's remove it. > > Signed-off-by: Naoya Horiguchi Acked-by: KOSAKI Motohiro -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
(3/27/13 9:00 AM), Michal Hocko wrote: > On Tue 26-03-13 16:35:35, Naoya Horiguchi wrote: > [...] >> The differences is that migrate_huge_page() has one hugepage as an argument, >> and migrate_pages() has a pagelist with multiple hugepages. >> I already told this before and I'm not sure it's enough to answer the >> question, >> so I explain another point about why this patch do like it. > > OK, I am blind. It is > + list_move(>lru, ); > + ret = migrate_pages(, new_page, MPOL_MF_MOVE_ALL, > + MIGRATE_SYNC, MR_MEMORY_FAILURE); > > which moves it from active_list and so you have to put it back. > >> I think that we must do putback_*pages() for source pages whether migration >> succeeds or not. >> But when we call migrate_pages() with a pagelist, >> the caller can't access to the successfully migrated source pages >> after migrate_pages() returns, because they are no longer on the pagelist. >> So putback of the successfully migrated source pages should be done *in* >> unmap_and_move() and/or unmap_and_move_huge_page(). > > If the migration succeeds then the page becomes unused and free after > its last reference drops. So I do not see any reason to put it back to > active list and free it right afterwards. > On the other hand unmap_and_move does the same thing (although page > reference counting is a bit more complicated in that case) so it would > be good to keep in sync with regular pages case. Even if pages are isolated from lists, there are several page count increasing path. So, putback_pages() close a race when page count != 1. I'm not sure, but I guess follow_hugepage() can make the same race. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 02/10] migrate: make core migration code aware of hugepage
>>> There doesn't seem to be any caller for this function. Please move it to >>> the patch which uses it. >> >> I would do like that if there's only one user of this function, but I thought >> that it's better to separate this part as changes of common code >> because this function is commonly used by multiple users which are added by >> multiple patches later in this series. > > Sure there is no hard rule for this. I just find it much easier to > review if there is a caller of introduced functionality. In this > particular case I found out only later that many migrate_pages callers > were changed to use mograte_movable_pages and made the > putback_movable_pages cleanup inconsistent between the two. > > It would help to mention what is the planned future usage of the > introduced function if you prefer to introduce it without users. I strong agree with Michal. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] migrate: add migrate_entry_wait_huge()
> diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c > index 0a0be33..98a478e 100644 > --- v3.9-rc3.orig/mm/hugetlb.c > +++ v3.9-rc3/mm/hugetlb.c > @@ -2819,7 +2819,7 @@ int hugetlb_fault(struct mm_struct *mm, struct > vm_area_struct *vma, > if (ptep) { > entry = huge_ptep_get(ptep); > if (unlikely(is_hugetlb_entry_migration(entry))) { > - migration_entry_wait(mm, (pmd_t *)ptep, address); > + migration_entry_wait_huge(mm, (pmd_t *)ptep, address); Hm. How do you test this? From x86 point of view, this patch seems unnecessary because hugetlb_fault call "address &= hugetlb_mask()" at first and then migration_entry_wait() could grab right pte lock. And from !x86 point of view, this funciton still doesn't work because huge page != pmd on some arch. I might be missing though. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] migrate: add migrate_entry_wait_huge()
> diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c > index 0a0be33..98a478e 100644 > --- v3.9-rc3.orig/mm/hugetlb.c > +++ v3.9-rc3/mm/hugetlb.c > @@ -2819,7 +2819,7 @@ int hugetlb_fault(struct mm_struct *mm, struct > vm_area_struct *vma, > if (ptep) { > entry = huge_ptep_get(ptep); > if (unlikely(is_hugetlb_entry_migration(entry))) { > - migration_entry_wait(mm, (pmd_t *)ptep, address); > + migration_entry_wait_huge(mm, (pmd_t *)ptep, address); Hm. How do you test this? From x86 point of view, this patch seems unnecessary because hugetlb_fault call "address &= hugetlb_mask()" at first and then migration_entry_wait() could grab right pte lock. And from !x86 point of view, this funciton still doesn't work because huge page != pmd on some arch. I might be missing something though. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size()
(4/3/13 2:35 PM), Naoya Horiguchi wrote: > Documentation/filesystems/proc.txt says about coredump_filter bitmask, > > Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only > effected by bit 5-6. > > However current code can go into the subsequent flag checks of bit 0-4 > for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work > as written in the document. > > Signed-off-by: Naoya Horiguchi > Cc: sta...@vger.kernel.org If I were you, I merge this patch into [1/3] because both patches treat the same regression. but it is no big matter. Acked-by: KOSAKI Motohiro -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
(4/3/13 2:35 PM), Naoya Horiguchi wrote: > With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in > initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory > error happens on a hugepage and the affected processes try to access > the error hugepage, we hit VM_BUG_ON(atomic_read(>_count) <= 0) > in get_page(). > > The reason for this bug is that coredump-related code doesn't recognise > "hugepage hwpoison entry" with which a pmd entry is replaced when a memory > error occurs on a hugepage. > In other words, physical address information is stored in different bit layout > between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page() > which is called in get_dump_page() returns a wrong page from a given address. > > We need to filter out only hwpoison hugepages to have data on healthy > hugepages in coredump. So this patch makes follow_hugetlb_page() avoid > trying to get page when a pmd is in swap entry like format. > > ChangeLog v3: > - add comment about using is_swap_pte() > > Signed-off-by: Naoya Horiguchi > Reviewed-by: Michal Hocko > Acked-by: Konstantin Khlebnikov > Cc: sta...@vger.kernel.org > --- > mm/hugetlb.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c > index 0d1705b..3bc20bd 100644 > --- v3.9-rc3.orig/mm/hugetlb.c > +++ v3.9-rc3/mm/hugetlb.c > @@ -2966,9 +2966,15 @@ long follow_hugetlb_page(struct mm_struct *mm, struct > vm_area_struct *vma, >* Some archs (sparc64, sh*) have multiple pte_ts to >* each hugepage. We have to make sure we get the >* first, for the page indexing below to work. > + * > + * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned > + * and hugepages under migration in which case > + * hugetlb_fault waits for the migration and bails out > + * properly for HWPosined pages. >*/ > pte = huge_pte_offset(mm, vaddr & huge_page_mask(h)); > - absent = !pte || huge_pte_none(huge_ptep_get(pte)); > + absent = !pte || huge_pte_none(huge_ptep_get(pte)) || > + is_swap_pte(huge_ptep_get(pte)); Hmmm... Now absent has two meanings. 1) skip hugetlb_fault() and return immediately if FOLL_DUMP is used. 2) call hugetlb_fault() if to be need page population or cow. The description of this patch only explain about (2). and I'm not convinced why we don't need to dump pages under migraion. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB)
(4/3/13 2:35 PM), Naoya Horiguchi wrote: > Currently we fail to include any data on hugepages into coredump, > because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was recently > introduced by commit 314e51b98 "mm: kill vma flag VM_RESERVED and > mm->reserved_vm counter". This looks to me a serious regression, > so let's fix it. I don't think this is enough explanations. Let's explain the code meaning time to time order. First, we had no madvice(DONTDUMP) nor coredump_filter(HUGETLB). then hugetlb pages were never dumped. Second, I added coredump_filter(HUGETLB). and then vm_dump_size became.. vm_dump_size() { /* Hugetlb memory check */ if (vma->vm_flags & VM_HUGETLB) { .. goto whole; } if (vma->vm_flags & VM_RESERVED) return 0; The point is, hugetlb was checked before VM_RESERVED. i.e. hugetlb core dump ignored VM_RESERVED. At this time, "if (vma->vm_flags & VM_HUGETLB)" statement don't need return 0 because VM_RESERVED prevented to go into the subsequent flag checks. Third, Jason added madvise(DONTDUMP). then vm_dump_size became... vm_dump_size() { if (vma->vm_flags & VM_NODUMP) return 0; /* Hugetlb memory check */ if (vma->vm_flags & VM_HUGETLB) { .. goto whole; } if (vma->vm_flags & VM_RESERVED) return 0; Look, VM_NODUMP and VM_RESERVED had similar and different meanings at this time. Finally, Konstantin removed VM_RESERVED and hugetlb coredump behavior has been changed. Thus, patch [1/3] and [2/3] should be marked [stable for v3.6 or later]. Anyway, this patch is correct. Thank you! Acked-by: KOSAKI Motohiro -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] posix-cpu-timers: fix counting delta_exec twice
From: KOSAKI Motohiro Currently glibc rt/cpuclock2 test(*) sporadically fail. The pseudo test code is here. t0 = clock_gettime() abs = t0 + sleeptime; clock_nanosleep(TIMER_ABSTIME, abs) t1 = clock_gettime() assert(t1-t0 > sleeptime) assert(t1 > abs) Because current signal->cputimer->cputime logic can add delta_exec twice wrongly. The first is in initialization, thread_group_cputimer() call thread_group_cputime() and it adds delta_exec. and later, update_curr() adds delta_exec again. Finally, clock_nanosleep() wakes up slightly before than passed argument and it is posix violation. This issue was introduced by commit d670ec1317 (posix-cpu-timers: Cure SMP wobbles). (*) http://sourceware.org/git/?p=glibc.git;a=blob;f=rt/tst-cpuclock2.c;h=6752721717f959e89c0d692b3f1ee082d507eec2;hb=HEAD Cc: Peter Zijlstra Cc: David Miller Cc: sta...@kernel.org Cc: Thomas Gleixner Cc: Frederic Weisbecker Signed-off-by: KOSAKI Motohiro --- fs/binfmt_elf.c |2 +- fs/binfmt_elf_fdpic.c |2 +- include/linux/sched.h |4 ++-- kernel/posix-cpu-timers.c | 10 +- kernel/sched/core.c |6 -- kernel/sched/cputime.c|8 6 files changed, 17 insertions(+), 15 deletions(-) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 3939829..e139e18 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -1321,7 +1321,7 @@ static void fill_prstatus(struct elf_prstatus *prstatus, * This is the record for the group leader. It shows the * group-wide total, not its individual thread total. */ - thread_group_cputime(p, ); + thread_group_cputime(p, true, ); cputime_to_timeval(cputime.utime, >pr_utime); cputime_to_timeval(cputime.stime, >pr_stime); } else { diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c index 9c13e02..ab5b508 100644 --- a/fs/binfmt_elf_fdpic.c +++ b/fs/binfmt_elf_fdpic.c @@ -1371,7 +1371,7 @@ static void fill_prstatus(struct elf_prstatus *prstatus, * This is the record for the group leader. It shows the * group-wide total, not its individual thread total. */ - thread_group_cputime(p, ); + thread_group_cputime(p, true, ); cputime_to_timeval(cputime.utime, >pr_utime); cputime_to_timeval(cputime.stime, >pr_stime); } else { diff --git a/include/linux/sched.h b/include/linux/sched.h index d35d2b6..deb49f3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2001,7 +2001,7 @@ static inline void disable_sched_clock_irqtime(void) {} #endif extern unsigned long long -task_sched_runtime(struct task_struct *task); +task_sched_runtime(struct task_struct *task, bool use_delta); /* sched_exec is called by processes performing an exec */ #ifdef CONFIG_SMP @@ -2624,7 +2624,7 @@ static inline int spin_needbreak(spinlock_t *lock) /* * Thread group CPU time accounting. */ -void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times); +void thread_group_cputime(struct task_struct *tsk, bool use_delta, struct task_cputime *times); void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times); static inline void thread_group_cputime_init(struct signal_struct *sig) diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c index 8fd709c..948004d 100644 --- a/kernel/posix-cpu-timers.c +++ b/kernel/posix-cpu-timers.c @@ -220,7 +220,7 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p, cpu->cpu = virt_ticks(p); break; case CPUCLOCK_SCHED: - cpu->sched = task_sched_runtime(p); + cpu->sched = task_sched_runtime(p, true); break; } return 0; @@ -251,7 +251,7 @@ void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times) * to synchronize the timer to the clock every time we start * it. */ - thread_group_cputime(tsk, ); + thread_group_cputime(tsk, false, ); raw_spin_lock_irqsave(>lock, flags); cputimer->running = 1; update_gt_cputime(>cputime, ); @@ -275,15 +275,15 @@ static int cpu_clock_sample_group(const clockid_t which_clock, default: return -EINVAL; case CPUCLOCK_PROF: - thread_group_cputime(p, ); + thread_group_cputime(p, true, ); cpu->cpu = cputime.utime + cputime.stime; break; case CPUCLOCK_VIRT: - thread_group_cputime(p, ); + thread_group_cputime(p, true, ); cpu->cpu = cputime.utime; break; case CPUCLOCK_SCHED: - thread_group_cputime(p,
[PATCH] posix-cpu-timers: fix counting delta_exec twice
From: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com Currently glibc rt/cpuclock2 test(*) sporadically fail. The pseudo test code is here. t0 = clock_gettime() abs = t0 + sleeptime; clock_nanosleep(TIMER_ABSTIME, abs) t1 = clock_gettime() assert(t1-t0 sleeptime) assert(t1 abs) Because current signal-cputimer-cputime logic can add delta_exec twice wrongly. The first is in initialization, thread_group_cputimer() call thread_group_cputime() and it adds delta_exec. and later, update_curr() adds delta_exec again. Finally, clock_nanosleep() wakes up slightly before than passed argument and it is posix violation. This issue was introduced by commit d670ec1317 (posix-cpu-timers: Cure SMP wobbles). (*) http://sourceware.org/git/?p=glibc.git;a=blob;f=rt/tst-cpuclock2.c;h=6752721717f959e89c0d692b3f1ee082d507eec2;hb=HEAD Cc: Peter Zijlstra a.p.zijls...@chello.nl Cc: David Miller da...@davemloft.net Cc: sta...@kernel.org Cc: Thomas Gleixner t...@linutronix.de Cc: Frederic Weisbecker fweis...@gmail.com Signed-off-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com --- fs/binfmt_elf.c |2 +- fs/binfmt_elf_fdpic.c |2 +- include/linux/sched.h |4 ++-- kernel/posix-cpu-timers.c | 10 +- kernel/sched/core.c |6 -- kernel/sched/cputime.c|8 6 files changed, 17 insertions(+), 15 deletions(-) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 3939829..e139e18 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -1321,7 +1321,7 @@ static void fill_prstatus(struct elf_prstatus *prstatus, * This is the record for the group leader. It shows the * group-wide total, not its individual thread total. */ - thread_group_cputime(p, cputime); + thread_group_cputime(p, true, cputime); cputime_to_timeval(cputime.utime, prstatus-pr_utime); cputime_to_timeval(cputime.stime, prstatus-pr_stime); } else { diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c index 9c13e02..ab5b508 100644 --- a/fs/binfmt_elf_fdpic.c +++ b/fs/binfmt_elf_fdpic.c @@ -1371,7 +1371,7 @@ static void fill_prstatus(struct elf_prstatus *prstatus, * This is the record for the group leader. It shows the * group-wide total, not its individual thread total. */ - thread_group_cputime(p, cputime); + thread_group_cputime(p, true, cputime); cputime_to_timeval(cputime.utime, prstatus-pr_utime); cputime_to_timeval(cputime.stime, prstatus-pr_stime); } else { diff --git a/include/linux/sched.h b/include/linux/sched.h index d35d2b6..deb49f3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2001,7 +2001,7 @@ static inline void disable_sched_clock_irqtime(void) {} #endif extern unsigned long long -task_sched_runtime(struct task_struct *task); +task_sched_runtime(struct task_struct *task, bool use_delta); /* sched_exec is called by processes performing an exec */ #ifdef CONFIG_SMP @@ -2624,7 +2624,7 @@ static inline int spin_needbreak(spinlock_t *lock) /* * Thread group CPU time accounting. */ -void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times); +void thread_group_cputime(struct task_struct *tsk, bool use_delta, struct task_cputime *times); void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times); static inline void thread_group_cputime_init(struct signal_struct *sig) diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c index 8fd709c..948004d 100644 --- a/kernel/posix-cpu-timers.c +++ b/kernel/posix-cpu-timers.c @@ -220,7 +220,7 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p, cpu-cpu = virt_ticks(p); break; case CPUCLOCK_SCHED: - cpu-sched = task_sched_runtime(p); + cpu-sched = task_sched_runtime(p, true); break; } return 0; @@ -251,7 +251,7 @@ void thread_group_cputimer(struct task_struct *tsk, struct task_cputime *times) * to synchronize the timer to the clock every time we start * it. */ - thread_group_cputime(tsk, sum); + thread_group_cputime(tsk, false, sum); raw_spin_lock_irqsave(cputimer-lock, flags); cputimer-running = 1; update_gt_cputime(cputimer-cputime, sum); @@ -275,15 +275,15 @@ static int cpu_clock_sample_group(const clockid_t which_clock, default: return -EINVAL; case CPUCLOCK_PROF: - thread_group_cputime(p, cputime); + thread_group_cputime(p, true, cputime); cpu-cpu = cputime.utime + cputime.stime; break; case CPUCLOCK_VIRT: - thread_group_cputime(p, cputime
Re: [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB)
(4/3/13 2:35 PM), Naoya Horiguchi wrote: Currently we fail to include any data on hugepages into coredump, because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was recently introduced by commit 314e51b98 mm: kill vma flag VM_RESERVED and mm-reserved_vm counter. This looks to me a serious regression, so let's fix it. I don't think this is enough explanations. Let's explain the code meaning time to time order. First, we had no madvice(DONTDUMP) nor coredump_filter(HUGETLB). then hugetlb pages were never dumped. Second, I added coredump_filter(HUGETLB). and then vm_dump_size became.. vm_dump_size() { /* Hugetlb memory check */ if (vma-vm_flags VM_HUGETLB) { .. goto whole; } if (vma-vm_flags VM_RESERVED) return 0; The point is, hugetlb was checked before VM_RESERVED. i.e. hugetlb core dump ignored VM_RESERVED. At this time, if (vma-vm_flags VM_HUGETLB) statement don't need return 0 because VM_RESERVED prevented to go into the subsequent flag checks. Third, Jason added madvise(DONTDUMP). then vm_dump_size became... vm_dump_size() { if (vma-vm_flags VM_NODUMP) return 0; /* Hugetlb memory check */ if (vma-vm_flags VM_HUGETLB) { .. goto whole; } if (vma-vm_flags VM_RESERVED) return 0; Look, VM_NODUMP and VM_RESERVED had similar and different meanings at this time. Finally, Konstantin removed VM_RESERVED and hugetlb coredump behavior has been changed. Thus, patch [1/3] and [2/3] should be marked [stable for v3.6 or later]. Anyway, this patch is correct. Thank you! Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()
(4/3/13 2:35 PM), Naoya Horiguchi wrote: With applying the previous patch hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB) to reenable hugepage coredump, if a memory error happens on a hugepage and the affected processes try to access the error hugepage, we hit VM_BUG_ON(atomic_read(page-_count) = 0) in get_page(). The reason for this bug is that coredump-related code doesn't recognise hugepage hwpoison entry with which a pmd entry is replaced when a memory error occurs on a hugepage. In other words, physical address information is stored in different bit layout between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page() which is called in get_dump_page() returns a wrong page from a given address. We need to filter out only hwpoison hugepages to have data on healthy hugepages in coredump. So this patch makes follow_hugetlb_page() avoid trying to get page when a pmd is in swap entry like format. ChangeLog v3: - add comment about using is_swap_pte() Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com Reviewed-by: Michal Hocko mho...@suse.cz Acked-by: Konstantin Khlebnikov khlebni...@openvz.org Cc: sta...@vger.kernel.org --- mm/hugetlb.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c index 0d1705b..3bc20bd 100644 --- v3.9-rc3.orig/mm/hugetlb.c +++ v3.9-rc3/mm/hugetlb.c @@ -2966,9 +2966,15 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, * Some archs (sparc64, sh*) have multiple pte_ts to * each hugepage. We have to make sure we get the * first, for the page indexing below to work. + * + * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned + * and hugepages under migration in which case + * hugetlb_fault waits for the migration and bails out + * properly for HWPosined pages. */ pte = huge_pte_offset(mm, vaddr huge_page_mask(h)); - absent = !pte || huge_pte_none(huge_ptep_get(pte)); + absent = !pte || huge_pte_none(huge_ptep_get(pte)) || + is_swap_pte(huge_ptep_get(pte)); Hmmm... Now absent has two meanings. 1) skip hugetlb_fault() and return immediately if FOLL_DUMP is used. 2) call hugetlb_fault() if to be need page population or cow. The description of this patch only explain about (2). and I'm not convinced why we don't need to dump pages under migraion. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size()
(4/3/13 2:35 PM), Naoya Horiguchi wrote: Documentation/filesystems/proc.txt says about coredump_filter bitmask, Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only effected by bit 5-6. However current code can go into the subsequent flag checks of bit 0-4 for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work as written in the document. Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com Cc: sta...@vger.kernel.org If I were you, I merge this patch into [1/3] because both patches treat the same regression. but it is no big matter. Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] migrate: add migrate_entry_wait_huge()
diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c index 0a0be33..98a478e 100644 --- v3.9-rc3.orig/mm/hugetlb.c +++ v3.9-rc3/mm/hugetlb.c @@ -2819,7 +2819,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (ptep) { entry = huge_ptep_get(ptep); if (unlikely(is_hugetlb_entry_migration(entry))) { - migration_entry_wait(mm, (pmd_t *)ptep, address); + migration_entry_wait_huge(mm, (pmd_t *)ptep, address); Hm. How do you test this? From x86 point of view, this patch seems unnecessary because hugetlb_fault call address = hugetlb_mask() at first and then migration_entry_wait() could grab right pte lock. And from !x86 point of view, this funciton still doesn't work because huge page != pmd on some arch. I might be missing though. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/10] migrate: add migrate_entry_wait_huge()
diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c index 0a0be33..98a478e 100644 --- v3.9-rc3.orig/mm/hugetlb.c +++ v3.9-rc3/mm/hugetlb.c @@ -2819,7 +2819,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (ptep) { entry = huge_ptep_get(ptep); if (unlikely(is_hugetlb_entry_migration(entry))) { - migration_entry_wait(mm, (pmd_t *)ptep, address); + migration_entry_wait_huge(mm, (pmd_t *)ptep, address); Hm. How do you test this? From x86 point of view, this patch seems unnecessary because hugetlb_fault call address = hugetlb_mask() at first and then migration_entry_wait() could grab right pte lock. And from !x86 point of view, this funciton still doesn't work because huge page != pmd on some arch. I might be missing something though. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 02/10] migrate: make core migration code aware of hugepage
There doesn't seem to be any caller for this function. Please move it to the patch which uses it. I would do like that if there's only one user of this function, but I thought that it's better to separate this part as changes of common code because this function is commonly used by multiple users which are added by multiple patches later in this series. Sure there is no hard rule for this. I just find it much easier to review if there is a caller of introduced functionality. In this particular case I found out only later that many migrate_pages callers were changed to use mograte_movable_pages and made the putback_movable_pages cleanup inconsistent between the two. It would help to mention what is the planned future usage of the introduced function if you prefer to introduce it without users. I strong agree with Michal. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()
(3/27/13 9:00 AM), Michal Hocko wrote: On Tue 26-03-13 16:35:35, Naoya Horiguchi wrote: [...] The differences is that migrate_huge_page() has one hugepage as an argument, and migrate_pages() has a pagelist with multiple hugepages. I already told this before and I'm not sure it's enough to answer the question, so I explain another point about why this patch do like it. OK, I am blind. It is + list_move(hpage-lru, pagelist); + ret = migrate_pages(pagelist, new_page, MPOL_MF_MOVE_ALL, + MIGRATE_SYNC, MR_MEMORY_FAILURE); which moves it from active_list and so you have to put it back. I think that we must do putback_*pages() for source pages whether migration succeeds or not. But when we call migrate_pages() with a pagelist, the caller can't access to the successfully migrated source pages after migrate_pages() returns, because they are no longer on the pagelist. So putback of the successfully migrated source pages should be done *in* unmap_and_move() and/or unmap_and_move_huge_page(). If the migration succeeds then the page becomes unused and free after its last reference drops. So I do not see any reason to put it back to active list and free it right afterwards. On the other hand unmap_and_move does the same thing (although page reference counting is a bit more complicated in that case) so it would be good to keep in sync with regular pages case. Even if pages are isolated from lists, there are several page count increasing path. So, putback_pages() close a race when page count != 1. I'm not sure, but I guess follow_hugepage() can make the same race. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 04/10] migrate: clean up migrate_huge_page()
(3/22/13 4:23 PM), Naoya Horiguchi wrote: Due to the previous patch, soft_offline_huge_page() switches to use migrate_pages(), and migrate_huge_page() is not used any more. So let's remove it. Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 05/10] migrate: add hugepage migration code to migrate_pages()
(3/22/13 4:23 PM), Naoya Horiguchi wrote: This patch extends check_range() to handle vma with VM_HUGETLB set. We will be able to migrate hugepage with migrate_pages(2) after applying the enablement patch which comes later in this series. Note that for larger hugepages (covered by pud entries, 1GB for x86_64 for example), we simply skip it now. check_range() has largely duplication with mm_walk and it is quirk subset. Instead of, could you replace them to mm_walk and enhance/cleanup mm_walk? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 07/10] mbind: add hugepage migration code to mbind()
@@ -1277,14 +1279,10 @@ static long do_mbind(unsigned long start, unsigned long len, if (!err) { int nr_failed = 0; - if (!list_empty(pagelist)) { - WARN_ON_ONCE(flags MPOL_MF_LAZY); - nr_failed = migrate_pages(pagelist, new_vma_page, + WARN_ON_ONCE(flags MPOL_MF_LAZY); ??? MPOL_MF_LAZY always output warn? It seems really insane. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 07/10] mbind: add hugepage migration code to mbind()
-if (!new_hpage) +/* + * Getting a new hugepage with alloc_huge_page() (which can happen + * when migration is caused by mbind()) can return ERR_PTR value, + * so we need take care of the case here. + */ +if (!new_hpage || IS_ERR_VALUE(new_hpage)) return -ENOMEM; Please no. get_new_page returns NULL or a page. You are hooking a wrong callback here. The error value doesn't make any sense here. IMO you should just wrap alloc_huge_page by something that returns NULL or page. I suggest just opposite way. new_vma_page() always return ENOMEM, ENOSPC etc instad of NULL. and caller propegate it to userland. I guess userland want to distingush why mbind was failed. Anyway, If new_vma_page() have a change to return both NULL and -ENOMEM. That's a bug. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage
(3/22/13 4:23 PM), Naoya Horiguchi wrote: Currently we can't offline memory blocks which contain hugepages because a hugepage is considered as an unmovable page. But now with this patch series, a hugepage has become movable, so by using hugepage migration we can offline such memory blocks. What's different from other users of hugepage migration is that we need to decompose all the hugepages inside the target memory block into free buddy pages after hugepage migration, because otherwise free hugepages remaining in the memory block intervene the memory offlining. For this reason we introduce new functions dissolve_free_huge_page() and dissolve_free_huge_pages(). Other than that, what this patch does is straightforwardly to add hugepage migration code, that is, adding hugepage code to the functions which scan over pfn and collect hugepages to be migrated, and adding a hugepage allocation function to alloc_migrate_target(). As for larger hugepages (1GB for x86_64), it's not easy to do hotremove over them because it's larger than memory block. So we now simply leave it to fail as it is. ChangeLog v2: - changed return value type of is_hugepage_movable() to bool - is_hugepage_movable() uses list_for_each_entry() instead of *_safe() - moved if(PageHuge) block before get_page_unless_zero() in do_migrate_range() - do_migrate_range() returns -EBUSY for hugepages larger than memory block - dissolve_free_huge_pages() calculates scan step and sets it to minimum hugepage size Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com --- include/linux/hugetlb.h | 6 + mm/hugetlb.c| 58 + mm/memory_hotplug.c | 42 +++ mm/page_alloc.c | 12 ++ mm/page_isolation.c | 5 + 5 files changed, 114 insertions(+), 9 deletions(-) diff --git v3.9-rc3.orig/include/linux/hugetlb.h v3.9-rc3/include/linux/hugetlb.h index 981eff8..8220a8a 100644 --- v3.9-rc3.orig/include/linux/hugetlb.h +++ v3.9-rc3/include/linux/hugetlb.h @@ -69,6 +69,7 @@ int dequeue_hwpoisoned_huge_page(struct page *page); void putback_active_hugepage(struct page *page); void putback_active_hugepages(struct list_head *l); void migrate_hugepage_add(struct page *page, struct list_head *list); +bool is_hugepage_movable(struct page *page); void copy_huge_page(struct page *dst, struct page *src); extern unsigned long hugepages_treat_as_movable; @@ -134,6 +135,7 @@ static inline int dequeue_hwpoisoned_huge_page(struct page *page) #define putback_active_hugepage(p) 0 #define putback_active_hugepages(l) 0 #define migrate_hugepage_add(p, l) 0 +#define is_hugepage_movable(x) 0 should be false instaed of 0. static inline void copy_huge_page(struct page *dst, struct page *src) { } @@ -356,6 +358,9 @@ static inline int hstate_index(struct hstate *h) return h - hstates; } +extern void dissolve_free_huge_pages(unsigned long start_pfn, + unsigned long end_pfn); + #else struct hstate {}; #define alloc_huge_page(v, a, r) NULL @@ -376,6 +381,7 @@ static inline unsigned int pages_per_huge_page(struct hstate *h) } #define hstate_index_to_shift(index) 0 #define hstate_index(h) 0 +#define dissolve_free_huge_pages(s, e) 0 no need 0. #endif #endif /* _LINUX_HUGETLB_H */ diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c index d9d3dd7..ef79871 100644 --- v3.9-rc3.orig/mm/hugetlb.c +++ v3.9-rc3/mm/hugetlb.c @@ -844,6 +844,36 @@ static int free_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, return ret; } +/* Dissolve a given free hugepage into free pages. */ +static void dissolve_free_huge_page(struct page *page) +{ + spin_lock(hugetlb_lock); + if (PageHuge(page) !page_count(page)) { + struct hstate *h = page_hstate(page); + int nid = page_to_nid(page); + list_del(page-lru); + h-free_huge_pages--; + h-free_huge_pages_node[nid]--; + update_and_free_page(h, page); + } + spin_unlock(hugetlb_lock); +} + +/* Dissolve free hugepages in a given pfn range. Used by memory hotplug. */ +void dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn) +{ + unsigned int order = 8 * sizeof(void *); + unsigned long pfn; + struct hstate *h; + + /* Set scan step to minimum hugepage size */ + for_each_hstate(h) + if (order huge_page_order(h)) + order = huge_page_order(h); + for (pfn = start_pfn; pfn end_pfn; pfn += 1 order) + dissolve_free_huge_page(pfn_to_page(pfn)); +} hotplug.c must not have such pure huge page function. + static struct page *alloc_buddy_huge_page(struct hstate *h, int nid) { struct page *page; @@ -3155,6 +3185,34 @@ static int
Re: [PATCH 10/10] prepare to remove /proc/sys/vm/hugepages_treat_as_movable
(3/25/13 11:12 AM), Michal Hocko wrote: On Fri 22-03-13 16:23:55, Naoya Horiguchi wrote: [...] @@ -2086,11 +2085,7 @@ int hugetlb_treat_movable_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { -proc_dointvec(table, write, buffer, length, ppos); -if (hugepages_treat_as_movable) -htlb_alloc_mask = GFP_HIGHUSER_MOVABLE; -else -htlb_alloc_mask = GFP_HIGHUSER; +/* hugepages_treat_as_movable is obsolete and to be removed. */ WARN_ON_ONCE(This knob is obsolete and has no effect. It is scheduled for removal) Indeed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm: speedup in __early_pfn_to_nid
> --- linux.orig/mm/page_alloc.c 2013-03-19 16:09:03.736450861 -0500 > +++ linux/mm/page_alloc.c 2013-03-22 17:07:43.895405617 -0500 > @@ -4161,10 +4161,23 @@ int __meminit __early_pfn_to_nid(unsigne > { > unsigned long start_pfn, end_pfn; > int i, nid; > + /* > + NOTE: The following SMP-unsafe globals are only used early > + in boot when the kernel is running single-threaded. > +*/ > + static unsigned long last_start_pfn, last_end_pfn; > + static int last_nid; Why don't you mark them __meminitdata? They seems freeable. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] mm: speedup in __early_pfn_to_nid
--- linux.orig/mm/page_alloc.c 2013-03-19 16:09:03.736450861 -0500 +++ linux/mm/page_alloc.c 2013-03-22 17:07:43.895405617 -0500 @@ -4161,10 +4161,23 @@ int __meminit __early_pfn_to_nid(unsigne { unsigned long start_pfn, end_pfn; int i, nid; + /* + NOTE: The following SMP-unsafe globals are only used early + in boot when the kernel is running single-threaded. +*/ + static unsigned long last_start_pfn, last_end_pfn; + static int last_nid; Why don't you mark them __meminitdata? They seems freeable. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in mempolicy's sp_insert
On Thu, Feb 28, 2013 at 1:53 AM, Hillf Danton wrote: > On Thu, Feb 28, 2013 at 1:26 PM, KOSAKI Motohiro > wrote: >>> Insert new node after updating node in tree. >> >> Thanks. you are right. I could reproduce and verified. > > Thank you too;) pleasure to do minor work for you. > > btw, how about your belly now? fully recovered? Yup. I could learned US health care a bit. =) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/