Re: [PATCH 04/10] posix-cpu-timers: don't account cpu timer after stopped thread runtime accounting

2013-04-29 Thread KOSAKI Motohiro
Hm. I'm not sure why this path series start [patch 4/10]. maybe I need to 
review my script again.
anyway, patch 1-3 don't exist. sorry for confusing.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/3] process cputimer is moving faster than its corresponding clock

2013-04-28 Thread KOSAKI Motohiro
(4/27/13 12:40 AM), Olivier Langlois wrote:
> 
> 
> Forbids the cputimer to drift ahead of its process clock by
> blocking its update when a tick occurs while a autoreaping task
> is currently in do_exit() between the call to release_task() and
> its final call to schedule().
> 
> Any task stats update after having called release_task() will
> be lost because they are added to the global process stats located
> in the signal struct from release_task().
> 
> Ideally, you should postpone the release_task() call after the
> final context switch to get all the stats added but this is
> more complex to achieve.
> 
> In other words, this is slowing down the cputimer so it keep the same
> pace than the process clock but in fact, what should be done is to
> speed up the process clock by adding the missing stats to it.
> 
> Signed-off-by: Olivier Langlois 
> ---
>  kernel/sched/fair.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a33e59..52d7b10 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -708,7 +708,15 @@ static void update_curr(struct cfs_rq *cfs_rq)
>  
> trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
> cpuacct_charge(curtask, delta_exec);
> -   account_group_exec_runtime(curtask, delta_exec);
> +   /*
> +* Do not update the cputimer if the task is already released 
> by
> +* release_task().
> +*
> +* it would preferable to defer the autoreap release_task
> +* after the last context switch but harder to do.
> +*/
> +   if (likely(curtask->sighand))
> +   account_group_exec_runtime(curtask, delta_exec);
> }

I'm confused. glibc's rt/tst-cputimer1 doesn't have thread exiting code. I have
no seen any issue in this accounting.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/3] process cputimer is moving faster than its corresponding clock

2013-04-28 Thread KOSAKI Motohiro
(4/27/13 12:40 AM), Olivier Langlois wrote:
 
 
 Forbids the cputimer to drift ahead of its process clock by
 blocking its update when a tick occurs while a autoreaping task
 is currently in do_exit() between the call to release_task() and
 its final call to schedule().
 
 Any task stats update after having called release_task() will
 be lost because they are added to the global process stats located
 in the signal struct from release_task().
 
 Ideally, you should postpone the release_task() call after the
 final context switch to get all the stats added but this is
 more complex to achieve.
 
 In other words, this is slowing down the cputimer so it keep the same
 pace than the process clock but in fact, what should be done is to
 speed up the process clock by adding the missing stats to it.
 
 Signed-off-by: Olivier Langlois oliv...@trillion01.com
 ---
  kernel/sched/fair.c | 10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 7a33e59..52d7b10 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -708,7 +708,15 @@ static void update_curr(struct cfs_rq *cfs_rq)
  
 trace_sched_stat_runtime(curtask, delta_exec, curr-vruntime);
 cpuacct_charge(curtask, delta_exec);
 -   account_group_exec_runtime(curtask, delta_exec);
 +   /*
 +* Do not update the cputimer if the task is already released 
 by
 +* release_task().
 +*
 +* it would preferable to defer the autoreap release_task
 +* after the last context switch but harder to do.
 +*/
 +   if (likely(curtask-sighand))
 +   account_group_exec_runtime(curtask, delta_exec);
 }

I'm confused. glibc's rt/tst-cputimer1 doesn't have thread exiting code. I have
no seen any issue in this accounting.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] process cputimer is moving faster than its corresponding clock

2013-04-26 Thread KOSAKI Motohiro
On Fri, Apr 26, 2013 at 9:51 PM, Olivier Langlois
 wrote:
> On Fri, 2013-04-26 at 15:08 -0400, KOSAKI Motohiro wrote:
>> > I need to add that I can only confirm that to be true with
>> > sum_exec_runtime.
>> >
>> > To affirm it to be true for stime and utime would require more
>> > investigation. I didn't look them at all. I was only concerned with
>> > sum_exec_runtime.
>> >
>> > I will prepare a v2 of the patch accounting all the feedbacks that I
>> > received from KOSAKI Motohiro, Frederic Weisbecker and Peter Zijlstra
>> > and send it back here for further discussion.
>> >
>> > Thank you very much all!
>>
>> Do you mean your utime test case still failure? If you share your test-case,
>> I'm going to look at your issue too.
>>
> Sure with pleasure. My testcase is glibc-2.17/rt/tst-cputimer1.c
>
> That being said, it strictly test CPUCLOCK_SCHED timers. Hence my focus
> when modifying the code was strictly on sum_exec_runtime.
>
> If utime and stime components of cputimer are moving faster than their
> associated clock, this is something that I did not address.

Hmm... Sorry. I'm confused. 1) I haven't seen any glibc test failure
after applying
my patch. 2) tst-cputimer1.c only have CLOCK_PROCESS_CPUTIME_ID test and
don't have any utime, stime tests.

Please let me know if you've seen any failure after applying my patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] process cputimer is moving faster than its corresponding clock

2013-04-26 Thread KOSAKI Motohiro
> I need to add that I can only confirm that to be true with
> sum_exec_runtime.
> 
> To affirm it to be true for stime and utime would require more
> investigation. I didn't look them at all. I was only concerned with
> sum_exec_runtime.
> 
> I will prepare a v2 of the patch accounting all the feedbacks that I
> received from KOSAKI Motohiro, Frederic Weisbecker and Peter Zijlstra
> and send it back here for further discussion.
> 
> Thank you very much all!

Do you mean your utime test case still failure? If you share your test-case, 
I'm going to look at your issue too.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] process cputimer is moving faster than its corresponding clock

2013-04-26 Thread KOSAKI Motohiro
 I need to add that I can only confirm that to be true with
 sum_exec_runtime.
 
 To affirm it to be true for stime and utime would require more
 investigation. I didn't look them at all. I was only concerned with
 sum_exec_runtime.
 
 I will prepare a v2 of the patch accounting all the feedbacks that I
 received from KOSAKI Motohiro, Frederic Weisbecker and Peter Zijlstra
 and send it back here for further discussion.
 
 Thank you very much all!

Do you mean your utime test case still failure? If you share your test-case, 
I'm going to look at your issue too.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] process cputimer is moving faster than its corresponding clock

2013-04-26 Thread KOSAKI Motohiro
On Fri, Apr 26, 2013 at 9:51 PM, Olivier Langlois
oliv...@trillion01.com wrote:
 On Fri, 2013-04-26 at 15:08 -0400, KOSAKI Motohiro wrote:
  I need to add that I can only confirm that to be true with
  sum_exec_runtime.
 
  To affirm it to be true for stime and utime would require more
  investigation. I didn't look them at all. I was only concerned with
  sum_exec_runtime.
 
  I will prepare a v2 of the patch accounting all the feedbacks that I
  received from KOSAKI Motohiro, Frederic Weisbecker and Peter Zijlstra
  and send it back here for further discussion.
 
  Thank you very much all!

 Do you mean your utime test case still failure? If you share your test-case,
 I'm going to look at your issue too.

 Sure with pleasure. My testcase is glibc-2.17/rt/tst-cputimer1.c

 That being said, it strictly test CPUCLOCK_SCHED timers. Hence my focus
 when modifying the code was strictly on sum_exec_runtime.

 If utime and stime components of cputimer are moving faster than their
 associated clock, this is something that I did not address.

Hmm... Sorry. I'm confused. 1) I haven't seen any glibc test failure
after applying
my patch. 2) tst-cputimer1.c only have CLOCK_PROCESS_CPUTIME_ID test and
don't have any utime, stime tests.

Please let me know if you've seen any failure after applying my patch.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] process cputimer is moving faster than its corresponding clock

2013-04-19 Thread KOSAKI Motohiro
On Fri, Apr 19, 2013 at 10:38 AM, KOSAKI Motohiro
 wrote:
>> I feel we are hitting the same issue than this patch:
>> https://lkml.org/lkml/2013/4/5/116
>>
>> I'm adding Kosaki in Cc, who proposed roughly the same fix.
>
> Thanks to CCing. I'm now sitting LSF and I can't read whole tons emails.
> However the fix is definitely same and I definitely agree this approach.
>
> thank you.

And if I understand correctly, update_gt_cputime() is no longer
necessary after this patch because time never makes backward.

What do you think?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] process cputimer is moving faster than its corresponding clock

2013-04-19 Thread KOSAKI Motohiro
> I feel we are hitting the same issue than this patch:
> https://lkml.org/lkml/2013/4/5/116
>
> I'm adding Kosaki in Cc, who proposed roughly the same fix.

Thanks to CCing. I'm now sitting LSF and I can't read whole tons emails.
However the fix is definitely same and I definitely agree this approach.

thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] process cputimer is moving faster than its corresponding clock

2013-04-19 Thread KOSAKI Motohiro
 I feel we are hitting the same issue than this patch:
 https://lkml.org/lkml/2013/4/5/116

 I'm adding Kosaki in Cc, who proposed roughly the same fix.

Thanks to CCing. I'm now sitting LSF and I can't read whole tons emails.
However the fix is definitely same and I definitely agree this approach.

thank you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] process cputimer is moving faster than its corresponding clock

2013-04-19 Thread KOSAKI Motohiro
On Fri, Apr 19, 2013 at 10:38 AM, KOSAKI Motohiro
kosaki.motoh...@gmail.com wrote:
 I feel we are hitting the same issue than this patch:
 https://lkml.org/lkml/2013/4/5/116

 I'm adding Kosaki in Cc, who proposed roughly the same fix.

 Thanks to CCing. I'm now sitting LSF and I can't read whole tons emails.
 However the fix is definitely same and I definitely agree this approach.

 thank you.

And if I understand correctly, update_gt_cputime() is no longer
necessary after this patch because time never makes backward.

What do you think?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug fix PATCH] numa, cpu hotplug: Change links of CPU and node when changing node number by onlining CPU

2013-04-18 Thread KOSAKI Motohiro
>  #ifdef CONFIG_HOTPLUG_CPU
> +static void change_cpu_under_node(struct cpu *cpu,
> +   unsigned int from_nid, unsigned int to_nid)
> +{
> +   int cpuid = cpu->dev.id;
> +   unregister_cpu_under_node(cpuid, from_nid);
> +   register_cpu_under_node(cpuid, to_nid);
> +   cpu->node_id = to_nid;
> +}
> +

Where is stub for !CONFIG_HOTPLUG_CPU?


>  static ssize_t show_online(struct device *dev,
>struct device_attribute *attr,
>char *buf)
> @@ -39,17 +48,23 @@ static ssize_t __ref store_online(struct device *dev,
>   const char *buf, size_t count)
>  {
> struct cpu *cpu = container_of(dev, struct cpu, dev);
> +   int num = cpu->dev.id;

"num" is wrong name. cpuid may be better.


> +   int from_nid, to_nid;
> ssize_t ret;
>
> cpu_hotplug_driver_lock();
> switch (buf[0]) {
> case '0':
> -   ret = cpu_down(cpu->dev.id);
> +   ret = cpu_down(num);
> if (!ret)
> kobject_uevent(>kobj, KOBJ_OFFLINE);
> break;
> case '1':
> -   ret = cpu_up(cpu->dev.id);
> +   from_nid = cpu_to_node(num);
> +   ret = cpu_up(num);
> +   to_nid = cpu_to_node(num);
> +   if (from_nid != to_nid)
> +   change_cpu_under_node(cpu, from_nid, to_nid);

You need to add several comments. this code is not straightforward.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Bug fix PATCH] numa, cpu hotplug: Change links of CPU and node when changing node number by onlining CPU

2013-04-18 Thread KOSAKI Motohiro
  #ifdef CONFIG_HOTPLUG_CPU
 +static void change_cpu_under_node(struct cpu *cpu,
 +   unsigned int from_nid, unsigned int to_nid)
 +{
 +   int cpuid = cpu-dev.id;
 +   unregister_cpu_under_node(cpuid, from_nid);
 +   register_cpu_under_node(cpuid, to_nid);
 +   cpu-node_id = to_nid;
 +}
 +

Where is stub for !CONFIG_HOTPLUG_CPU?


  static ssize_t show_online(struct device *dev,
struct device_attribute *attr,
char *buf)
 @@ -39,17 +48,23 @@ static ssize_t __ref store_online(struct device *dev,
   const char *buf, size_t count)
  {
 struct cpu *cpu = container_of(dev, struct cpu, dev);
 +   int num = cpu-dev.id;

num is wrong name. cpuid may be better.


 +   int from_nid, to_nid;
 ssize_t ret;

 cpu_hotplug_driver_lock();
 switch (buf[0]) {
 case '0':
 -   ret = cpu_down(cpu-dev.id);
 +   ret = cpu_down(num);
 if (!ret)
 kobject_uevent(dev-kobj, KOBJ_OFFLINE);
 break;
 case '1':
 -   ret = cpu_up(cpu-dev.id);
 +   from_nid = cpu_to_node(num);
 +   ret = cpu_up(num);
 +   to_nid = cpu_to_node(num);
 +   if (from_nid != to_nid)
 +   change_cpu_under_node(cpu, from_nid, to_nid);

You need to add several comments. this code is not straightforward.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/28] proc: Split kcore bits from linux/procfs.h into linux/kcore.h [RFC]

2013-04-16 Thread KOSAKI Motohiro
On Tue, Apr 16, 2013 at 3:07 PM, David Howells  wrote:
>
> KOSAKI Motohiro  wrote:
>
>> I have no seen any issue in this change. but why? Is there any
>> motivation rather than cleanup?
>
> Stopping stuff mucking about with the internals of procfs incorrectly
> (sometimes because the internals of procfs have changed, but the drivers
> haven't).

OK, thank you for explanation.

Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/28] proc: Split kcore bits from linux/procfs.h into linux/kcore.h [RFC]

2013-04-16 Thread KOSAKI Motohiro
On Tue, Apr 16, 2013 at 11:26 AM, David Howells  wrote:
> Split kcore bits from linux/procfs.h into linux/kcore.h.
>
> Signed-off-by: David Howells 
> cc: linux-m...@linux-mips.org
> cc: sparcli...@vger.kernel.org
> cc: x...@kernel.org
> cc: linux...@kvack.org

I have no seen any issue in this change. but why? Is there any
motivation rather than cleanup?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/28] proc: Split kcore bits from linux/procfs.h into linux/kcore.h [RFC]

2013-04-16 Thread KOSAKI Motohiro
On Tue, Apr 16, 2013 at 11:26 AM, David Howells dhowe...@redhat.com wrote:
 Split kcore bits from linux/procfs.h into linux/kcore.h.

 Signed-off-by: David Howells dhowe...@redhat.com
 cc: linux-m...@linux-mips.org
 cc: sparcli...@vger.kernel.org
 cc: x...@kernel.org
 cc: linux...@kvack.org

I have no seen any issue in this change. but why? Is there any
motivation rather than cleanup?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/28] proc: Split kcore bits from linux/procfs.h into linux/kcore.h [RFC]

2013-04-16 Thread KOSAKI Motohiro
On Tue, Apr 16, 2013 at 3:07 PM, David Howells dhowe...@redhat.com wrote:

 KOSAKI Motohiro kosaki.motoh...@gmail.com wrote:

 I have no seen any issue in this change. but why? Is there any
 motivation rather than cleanup?

 Stopping stuff mucking about with the internals of procfs incorrectly
 (sometimes because the internals of procfs have changed, but the drivers
 haven't).

OK, thank you for explanation.

Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Return value of __mm_populate

2013-04-13 Thread KOSAKI Motohiro
(4/13/13 5:14 AM), Marco Stornelli wrote:
> Hi,
> 
> I was seeing the code of __mm_populate (in -next) and I've got a doubt 
> about the return value. The function __mlock_posix_error_return should 
> return a proper error for mlock, converting the return value from 
> __get_user_pages. It checks for EFAULT and ENOMEM. Actually 
> __get_user_pages could return, in addition, ERESTARTSYS and EHWPOISON. 

__get_user_pages doesn't return EHWPOISON if FOLL_HWPOISON is not specified.
I'm not expert ERESTARTSYS. I understand correctly, ERESTARTSYS is only returned
when signal received, and signal handling routine (e.g. do_signal) modify EIP 
and
hidden ERESTARTSYS from userland generically.


> So it seems to me that we could return to user space not expected value. 
> I can't see them on the man page. In addition we shouldn't ever return 
> ERESTARTSYS to the user space but EINTR. According to the man pages 
> maybe we should return EAGAIN in these cases. Am I missing something?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Return value of __mm_populate

2013-04-13 Thread KOSAKI Motohiro
(4/13/13 5:14 AM), Marco Stornelli wrote:
 Hi,
 
 I was seeing the code of __mm_populate (in -next) and I've got a doubt 
 about the return value. The function __mlock_posix_error_return should 
 return a proper error for mlock, converting the return value from 
 __get_user_pages. It checks for EFAULT and ENOMEM. Actually 
 __get_user_pages could return, in addition, ERESTARTSYS and EHWPOISON. 

__get_user_pages doesn't return EHWPOISON if FOLL_HWPOISON is not specified.
I'm not expert ERESTARTSYS. I understand correctly, ERESTARTSYS is only returned
when signal received, and signal handling routine (e.g. do_signal) modify EIP 
and
hidden ERESTARTSYS from userland generically.


 So it seems to me that we could return to user space not expected value. 
 I can't see them on the man page. In addition we shouldn't ever return 
 ERESTARTSYS to the user space but EINTR. According to the man pages 
 maybe we should return EAGAIN in these cases. Am I missing something?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

2013-04-11 Thread KOSAKI Motohiro
(4/10/13 11:26 PM), Mitsuhiro Tanino wrote:
> Hi All,
> Please find a patch set that introduces these new sysctl interfaces,
> to handle a case when an memory error is detected on dirty page cache.
> 
> - vm.memory_failure_dirty_panic

Panic knob is ok to me. However I agree with Andi. If we need panic know,
it should handle generic IO error and data lost.


> - vm.memory_failure_print_ratelimit
> - vm.memory_failure_print_ratelimit_burst

But this is totally silly. 
print_ratelimit might ommit important messages. Please do a right way.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v7 00/11] Support vrange for anonymous page

2013-04-11 Thread KOSAKI Motohiro
 and adding new syscall invokation is unwelcome.
>>>
>>> Sure. But one more system call could be cheaper than page-granuarity
>>> operation on purged range.
>>
>> I don't think vrange(VOLATILE) cost is the related of this discusstion.
>> Whether sending SIGBUS or just nuke pte, purge should be done on vmscan,
>> not vrange() syscall.
> 
> Again, please see the MADV_FREE. http://lwn.net/Articles/230799/
> It does changes pte and page flags on all pages of the range through
> zap_pte_range. So it would make vrange(VOLASTILE) expensive and
> the bigger cost is, the bigger range is.

This haven't been crossed my mind. now try_to_discard_one() insert vrange
for making SIGBUS. then, we can insert pte_none() as the same cost too. Am
I missing something?

I couldn't imazine why pte should be zapping on vrange(VOLATILE).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v7 00/11] Support vrange for anonymous page

2013-04-11 Thread KOSAKI Motohiro
(4/11/13 4:02 AM), Minchan Kim wrote:
> On Thu, Apr 11, 2013 at 03:20:30AM -0400, KOSAKI Motohiro wrote:
>>>>>   DONTNEED makes sure user always can see zero-fill pages after
>>>>>   he calls madvise while vrange can see data or encounter SIGBUS.
>>>>
>>>> For replacing DONTNEED, user want to zero-fill pages like DONTNEED
>>>> instead of SIGBUS. So, new flag option would be nice.
>>>
>>> If userspace people want it, I can do it. 
>>> But not sure they want it at the moment becaue vrange is rather
>>> different concept of madvise(DONTNEED) POV usage.
>>>
>>> As you know well, in case of DONTNEED, user calls madvise _once_ and
>>> VM releases memory as soon as he called system call.
>>> But vrange is same with delayed free when the system memory pressure
>>> happens so user can't know OS frees the pages anytime.
>>> It means user should call pair of system call both VRANGE_VOLATILE
>>> and VRANGE_NOVOLATILE for right usage of volatile range
>>> (for simple, I don't want to tell SIGBUS fault recovery method).
>>> If he took a mistake(ie, NOT to call VRANGE_NOVOLATILE) on the range
>>> which is used by current process, pages used by some process could be
>>> disappeared suddenly.
>>>
>>> In summary, I don't think vrange is a replacement of madvise(DONTNEED)
>>> but could be useful with madvise(DONTNEED) friend. For example, we can
>>> make return 1 in vrange(VRANGE_VOLATILE) if memory pressure was already
>>
>> Do you mean vrange(VRANGE_UNVOLATILE)?
> 
> I meant VRANGE_VOLATILE. It seems my explanation was poor. Here it goes, 
> again.
> Now vrange's semantic return just 0 if the system call is successful, 
> otherwise,
> return error. But we can change it as folows
> 
> 1. return 0 if the system call is successful and memory pressure isn't severe
> 2. return 1 if the system call is successful and memory pressure is severe
> 3. return -ERRXXX if the system call is failed by some reason
> 
> So the process can know system-wide memory pressure without peeking the vmstat
> and then call madvise(DONTNEED) right after vrange call. The benefit is system
> can zap all pages instantly.

Do you mean your patchset is not latest? and when do you use this feature? 
what's
happen VRANGE_VOLATILE return 0 and purge the range just after returning 
syscall.


>> btw, assign new error number to asm-generic/errno.h is better than strange 
>> '1'.
> 
> I can and admit "1" is rather weired.
> But it's not error, either.

If this is really necessary, I don't oppose it. However I am still not 
convinced.



>>> severe so user can catch up memory pressure by return value and calls
>>> madvise(DONTNEED) if memory pressure was already severe. Of course, we
>>> can handle it vrange system call itself(ex, change vrange system call to
>>> madvise(DONTNEED) but don't want it because I want to keep vrange hinting
>>> sytem call very light at all times so user can expect latency.
>>
>> For allocator usage, vrange(UNVOLATILE) is annoying and don't need at all.
>> When data has already been purged, just return new zero filled page. so,
>> maybe adding new flag is worthwhile. Because malloc is definitely fast path
> 
> I really want it and it's exactly same with madvise(MADV_FREE).
> But for implementation, we need page granularity someting in address range
> in system call context like zap_pte_range(ex, clear page table bits and
> mark something to page flags for reclaimer to detect it).
> It means vrange system call is still bigger although we are able to remove
> lazy page fault.
> 
> Do you have any idea to remove it? If so, I'm very open to implement it.

Hm. Maybe I am missing something. I'll look the code closely after LFS.


>> and adding new syscall invokation is unwelcome.
> 
> Sure. But one more system call could be cheaper than page-granuarity
> operation on purged range.

I don't think vrange(VOLATILE) cost is the related of this discusstion.
Whether sending SIGBUS or just nuke pte, purge should be done on vmscan,
not vrange() syscall.









--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v7 00/11] Support vrange for anonymous page

2013-04-11 Thread KOSAKI Motohiro
>>>   DONTNEED makes sure user always can see zero-fill pages after
>>>   he calls madvise while vrange can see data or encounter SIGBUS.
>>
>> For replacing DONTNEED, user want to zero-fill pages like DONTNEED
>> instead of SIGBUS. So, new flag option would be nice.
> 
> If userspace people want it, I can do it. 
> But not sure they want it at the moment becaue vrange is rather
> different concept of madvise(DONTNEED) POV usage.
> 
> As you know well, in case of DONTNEED, user calls madvise _once_ and
> VM releases memory as soon as he called system call.
> But vrange is same with delayed free when the system memory pressure
> happens so user can't know OS frees the pages anytime.
> It means user should call pair of system call both VRANGE_VOLATILE
> and VRANGE_NOVOLATILE for right usage of volatile range
> (for simple, I don't want to tell SIGBUS fault recovery method).
> If he took a mistake(ie, NOT to call VRANGE_NOVOLATILE) on the range
> which is used by current process, pages used by some process could be
> disappeared suddenly.
> 
> In summary, I don't think vrange is a replacement of madvise(DONTNEED)
> but could be useful with madvise(DONTNEED) friend. For example, we can
> make return 1 in vrange(VRANGE_VOLATILE) if memory pressure was already

Do you mean vrange(VRANGE_UNVOLATILE)?
btw, assign new error number to asm-generic/errno.h is better than strange '1'.


> severe so user can catch up memory pressure by return value and calls
> madvise(DONTNEED) if memory pressure was already severe. Of course, we
> can handle it vrange system call itself(ex, change vrange system call to
> madvise(DONTNEED) but don't want it because I want to keep vrange hinting
> sytem call very light at all times so user can expect latency.

For allocator usage, vrange(UNVOLATILE) is annoying and don't need at all.
When data has already been purged, just return new zero filled page. so,
maybe adding new flag is worthwhile. Because malloc is definitely fast path
and adding new syscall invokation is unwelcome.


>> # of # of   # of
>> thread   iter   iter (patched glibc)
> 
> What's the workload?

Ahh, sorry. I forgot to write. I use ebizzy, your favolite workload.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v7 00/11] Support vrange for anonymous page

2013-04-11 Thread KOSAKI Motohiro
   DONTNEED makes sure user always can see zero-fill pages after
   he calls madvise while vrange can see data or encounter SIGBUS.

 For replacing DONTNEED, user want to zero-fill pages like DONTNEED
 instead of SIGBUS. So, new flag option would be nice.
 
 If userspace people want it, I can do it. 
 But not sure they want it at the moment becaue vrange is rather
 different concept of madvise(DONTNEED) POV usage.
 
 As you know well, in case of DONTNEED, user calls madvise _once_ and
 VM releases memory as soon as he called system call.
 But vrange is same with delayed free when the system memory pressure
 happens so user can't know OS frees the pages anytime.
 It means user should call pair of system call both VRANGE_VOLATILE
 and VRANGE_NOVOLATILE for right usage of volatile range
 (for simple, I don't want to tell SIGBUS fault recovery method).
 If he took a mistake(ie, NOT to call VRANGE_NOVOLATILE) on the range
 which is used by current process, pages used by some process could be
 disappeared suddenly.
 
 In summary, I don't think vrange is a replacement of madvise(DONTNEED)
 but could be useful with madvise(DONTNEED) friend. For example, we can
 make return 1 in vrange(VRANGE_VOLATILE) if memory pressure was already

Do you mean vrange(VRANGE_UNVOLATILE)?
btw, assign new error number to asm-generic/errno.h is better than strange '1'.


 severe so user can catch up memory pressure by return value and calls
 madvise(DONTNEED) if memory pressure was already severe. Of course, we
 can handle it vrange system call itself(ex, change vrange system call to
 madvise(DONTNEED) but don't want it because I want to keep vrange hinting
 sytem call very light at all times so user can expect latency.

For allocator usage, vrange(UNVOLATILE) is annoying and don't need at all.
When data has already been purged, just return new zero filled page. so,
maybe adding new flag is worthwhile. Because malloc is definitely fast path
and adding new syscall invokation is unwelcome.


 # of # of   # of
 thread   iter   iter (patched glibc)
 
 What's the workload?

Ahh, sorry. I forgot to write. I use ebizzy, your favolite workload.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v7 00/11] Support vrange for anonymous page

2013-04-11 Thread KOSAKI Motohiro
(4/11/13 4:02 AM), Minchan Kim wrote:
 On Thu, Apr 11, 2013 at 03:20:30AM -0400, KOSAKI Motohiro wrote:
   DONTNEED makes sure user always can see zero-fill pages after
   he calls madvise while vrange can see data or encounter SIGBUS.

 For replacing DONTNEED, user want to zero-fill pages like DONTNEED
 instead of SIGBUS. So, new flag option would be nice.

 If userspace people want it, I can do it. 
 But not sure they want it at the moment becaue vrange is rather
 different concept of madvise(DONTNEED) POV usage.

 As you know well, in case of DONTNEED, user calls madvise _once_ and
 VM releases memory as soon as he called system call.
 But vrange is same with delayed free when the system memory pressure
 happens so user can't know OS frees the pages anytime.
 It means user should call pair of system call both VRANGE_VOLATILE
 and VRANGE_NOVOLATILE for right usage of volatile range
 (for simple, I don't want to tell SIGBUS fault recovery method).
 If he took a mistake(ie, NOT to call VRANGE_NOVOLATILE) on the range
 which is used by current process, pages used by some process could be
 disappeared suddenly.

 In summary, I don't think vrange is a replacement of madvise(DONTNEED)
 but could be useful with madvise(DONTNEED) friend. For example, we can
 make return 1 in vrange(VRANGE_VOLATILE) if memory pressure was already

 Do you mean vrange(VRANGE_UNVOLATILE)?
 
 I meant VRANGE_VOLATILE. It seems my explanation was poor. Here it goes, 
 again.
 Now vrange's semantic return just 0 if the system call is successful, 
 otherwise,
 return error. But we can change it as folows
 
 1. return 0 if the system call is successful and memory pressure isn't severe
 2. return 1 if the system call is successful and memory pressure is severe
 3. return -ERRXXX if the system call is failed by some reason
 
 So the process can know system-wide memory pressure without peeking the vmstat
 and then call madvise(DONTNEED) right after vrange call. The benefit is system
 can zap all pages instantly.

Do you mean your patchset is not latest? and when do you use this feature? 
what's
happen VRANGE_VOLATILE return 0 and purge the range just after returning 
syscall.


 btw, assign new error number to asm-generic/errno.h is better than strange 
 '1'.
 
 I can and admit 1 is rather weired.
 But it's not error, either.

If this is really necessary, I don't oppose it. However I am still not 
convinced.



 severe so user can catch up memory pressure by return value and calls
 madvise(DONTNEED) if memory pressure was already severe. Of course, we
 can handle it vrange system call itself(ex, change vrange system call to
 madvise(DONTNEED) but don't want it because I want to keep vrange hinting
 sytem call very light at all times so user can expect latency.

 For allocator usage, vrange(UNVOLATILE) is annoying and don't need at all.
 When data has already been purged, just return new zero filled page. so,
 maybe adding new flag is worthwhile. Because malloc is definitely fast path
 
 I really want it and it's exactly same with madvise(MADV_FREE).
 But for implementation, we need page granularity someting in address range
 in system call context like zap_pte_range(ex, clear page table bits and
 mark something to page flags for reclaimer to detect it).
 It means vrange system call is still bigger although we are able to remove
 lazy page fault.
 
 Do you have any idea to remove it? If so, I'm very open to implement it.

Hm. Maybe I am missing something. I'll look the code closely after LFS.


 and adding new syscall invokation is unwelcome.
 
 Sure. But one more system call could be cheaper than page-granuarity
 operation on purged range.

I don't think vrange(VOLATILE) cost is the related of this discusstion.
Whether sending SIGBUS or just nuke pte, purge should be done on vmscan,
not vrange() syscall.









--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v7 00/11] Support vrange for anonymous page

2013-04-11 Thread KOSAKI Motohiro
 and adding new syscall invokation is unwelcome.

 Sure. But one more system call could be cheaper than page-granuarity
 operation on purged range.

 I don't think vrange(VOLATILE) cost is the related of this discusstion.
 Whether sending SIGBUS or just nuke pte, purge should be done on vmscan,
 not vrange() syscall.
 
 Again, please see the MADV_FREE. http://lwn.net/Articles/230799/
 It does changes pte and page flags on all pages of the range through
 zap_pte_range. So it would make vrange(VOLASTILE) expensive and
 the bigger cost is, the bigger range is.

This haven't been crossed my mind. now try_to_discard_one() insert vrange
for making SIGBUS. then, we can insert pte_none() as the same cost too. Am
I missing something?

I couldn't imazine why pte should be zapping on vrange(VOLATILE).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

2013-04-11 Thread KOSAKI Motohiro
(4/10/13 11:26 PM), Mitsuhiro Tanino wrote:
 Hi All,
 Please find a patch set that introduces these new sysctl interfaces,
 to handle a case when an memory error is detected on dirty page cache.
 
 - vm.memory_failure_dirty_panic

Panic knob is ok to me. However I agree with Andi. If we need panic know,
it should handle generic IO error and data lost.


 - vm.memory_failure_print_ratelimit
 - vm.memory_failure_print_ratelimit_burst

But this is totally silly. 
print_ratelimit might ommit important messages. Please do a right way.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: madvise: complete input validation before taking lock

2013-04-10 Thread KOSAKI Motohiro
(4/10/13 7:45 PM), Rasmus Villemoes wrote:
> In madvise(), there doesn't seem to be any reason for taking the
> >mm->mmap_sem before start and len_in have been
> validated. Incidentally, this removes the need for the out: label.
> 
> 
> Signed-off-by: Rasmus Villemoes 

Looks good.

Acked-by: KOSAKI Motohiro 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/10] Reduce system disruption due to kswapd V2

2013-04-10 Thread KOSAKI Motohiro
>> I've never checked it but I would have expected kswapd to stay on the
>> same processor for significant periods of time. Have you experienced
>> problems where kswapd bounces around on CPUs within a node causing
>> workload disruption?
> 
> When kswapd shares the same CPU as our main process it causes a measurable
> drop in response time (graphs show tiny spikes at the same time memory is
> freed). Would be nice to be able to ensure it runs on a different core
> than our latency sensitive processes at least. We can pin processes to
> subsets of cores but I don't think there's a way to keep kswapd from
> waking up on any of them?

You are only talking about extream corner case and don't talk about the other 
hand.
When number-of-nodes > nubmer-of-cpus, we have no way to avoid cpu sharing. 

Moreover, this is not kswapd specific isssue, every kernel thread makes the same
latency ick. so, this issue should be solved more generic layer.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + fix-hugetlb-memory-check-in-vma_dump_size.patch added to -mm tree

2013-04-10 Thread KOSAKI Motohiro
> Signed-off-by: Naoya Horiguchi 
> Reviewed-by: Rik van Riel 
> Acked-by: Michal Hocko 
> Reviewed-by: HATAYAMA Daisuke 
> Acked-by: KOSAKI Motohiro 
> Acked-by: David Rientjes 
> Cc: [2.6.34+]

Not correct. It's 3.7 materials. See Michal or my comments.
cut-n-paste here.

>Just for the record. It should be stable for 3.7+ since (314e51b98)
>becuase then have lost VM_RESERVED check which used to stop hugetlb
>mappings.
> Acked-by: Michal Hocko 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v7 00/11] Support vrange for anonymous page

2013-04-10 Thread KOSAKI Motohiro
(3/12/13 3:38 AM), Minchan Kim wrote:
> First of all, let's define the term.
> From now on, I'd like to call it as vrange(a.k.a volatile range)
> for anonymous page. If you have a better name in mind, please suggest.
> 
> This version is still *RFC* because it's just quick prototype so
> it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
> Before further sorting out issues, I'd like to post current direction
> and discuss it. Of course, I'd like to extend this discussion in
> comming LSF/MM.
> 
> In this version, I changed lots of thing, expecially removed vma-based
> approach because it needs write-side lock for mmap_sem, which will drop
> performance in mutli-threaded big SMP system, KOSAKI pointed out.
> And vma-based approach is hard to meet requirement of new system call by
> John Stultz's suggested semantic for consistent purged handling.
> (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
> 
> I tested this patchset with modified jemalloc allocator which was
> leaded by Jason Evans(jemalloc author) who was interest in this feature
> and was happy to port his allocator to use new system call.
> Super Thanks Jason!
> 
> The benchmark for test is ebizzy. It have been used for testing the
> allocator performance so it's good for me. Again, thanks for recommending
> the benchmark, Jason.
> (http://people.freebsd.org/~kris/scaling/ebizzy.html)
> 
> The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
> 
>   ebizzy -S 20
> 
> jemalloc-vanilla: 52389 records/sec
> jemalloc-vrange: 203414 records/sec
> 
>   ebizzy -S 20 with background memory pressure
> 
> jemalloc-vanilla: 40746 records/sec
> jemalloc-vrange: 174910 records/sec
> 
> And it's much improved on KVM virtual machine.
> 
> This patchset is based on v3.9-rc2
> 
> - What's the sys_vrange(addr, length, mode, behavior)?
> 
>   It's a hint that user deliver to kernel so kernel can *discard*
>   pages in a range anytime. mode is one of VRANGE_VOLATILE and
>   VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so
>   kernel coudn't discard any pages any more while VRANGE_VOLATILE
>   is memory unpin opeartion so kernel can discard pages in vrange
>   anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE
>   PARTIAL. VRANGE_FULL tell kernel that once kernel decide to
>   discard page in a vrange, please, discard all of pages in a
>   vrange selected by victim vrange. VRANGE_PARTIAL tell kernel
>   that please discard of some pages in a vrange. But now I didn't
>   implemented VRANGE_PARTIAL handling yet.
> 
> - What happens if user access page(ie, virtual address) discarded
>   by kernel?
> 
>   The user can encounter SIGBUS.
> 
> - What should user do for avoding SIGBUS?
>   He should call vrange(addr, length, VRANGE_NOVOLATILE, mode) before
>   accessing the range which was called
>   vrange(addr, length, VRANGE_VOLATILE, mode)
> 
> - What happens if user access page(ie, virtual address) doesn't
>   discarded by kernel?
> 
>   The user can see vaild data which was there before calling
> vrange(., VRANGE_VOLATILE) without page fault.
> 
> - What's different with madvise(DONTNEED)?
> 
>   System call semantic
> 
>   DONTNEED makes sure user always can see zero-fill pages after
>   he calls madvise while vrange can see data or encounter SIGBUS.

For replacing DONTNEED, user want to zero-fill pages like DONTNEED
instead of SIGBUS. So, new flag option would be nice.

I played a bit this patch. The result looks really promissing.
(i.e. 20x faster)

My machine have 24cpus, 8GB ram, kvm guest. I guess current DONTNEED
implementation doesn't fit kvm at all.


# of # of   # of
thread   iter   iter (patched glibc)
--
  1  43810740
  2  84220916
  4  98732534
  8  71715155
 12  71414109
 16  70813457
 20  72013742
 24  72713642
 28  71513328
 32  70913096
 36  70513661
 40  70813634
 44  70713367
 48  71413377


-libc patch (just dirty hack) --

diff --git a/malloc/arena.c b/malloc/arena.c
index 12a48ad..da04f67 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -365,6 +365,8 @@ extern struct dl_open_hook *_dl_open_hook;
 libc_hidden_proto (_dl_open_hook);
 #endif

+int vrange_enabled = 0;
+
 static void
 ptmalloc_init (void)
 {
@@ -457,6 +459,18 @@ ptmalloc_init (void)
 if (check_action != 0)
   __malloc_check_init();
   }
+
+  {
+char *vrange = getenv("MALLOC_VRANGE");
+if (vrange) {
+  int val = atoi(vrange);
+  if (val) {
+   printf("glibc: vrange enabled\n");
+   vrange_enabled = !!val;
+  }
+}
+  }
+
   void (*hook) (void) = force_reg (__malloc_initialize_hook);
   if (hook != NULL)
 (*hook)();
@@ -628,9 +642,14 @@ shrink_heap(heap_info *h, long diff)
return -2;
   h->mprotect_size = new_size;
 }
-  else
-

Re: [RESEND][PATCH v5 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()

2013-04-10 Thread KOSAKI Motohiro
On Wed, Apr 10, 2013 at 12:17 PM, Naoya Horiguchi
 wrote:
> # I suspended Reviewed and Acked given for the previous version, because
> # it has a non-minor change. If you want to restore it, please let me know.
> -
> With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in
> initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory
> error happens on a hugepage and the affected processes try to access
> the error hugepage, we hit VM_BUG_ON(atomic_read(>_count) <= 0)
> in get_page().
>
> The reason for this bug is that coredump-related code doesn't recognise
> "hugepage hwpoison entry" with which a pmd entry is replaced when a memory
> error occurs on a hugepage.
> In other words, physical address information is stored in different bit layout
> between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page()
> which is called in get_dump_page() returns a wrong page from a given address.
>
> The expected behavior is like this:
>
>   absent   is_swap_pte   FOLL_DUMP   Expected behavior
>   ---
>true false false   hugetlb_fault
>falsetrue  false   hugetlb_fault
>falsefalse false   return page
>true false trueskip page (to avoid allocation)
>falsetrue  truehugetlb_fault
>falsefalse truereturn page
>
> With this patch, we can call hugetlb_fault() and take proper actions
> (we wait for migration entries, fail with VM_FAULT_HWPOISON_LARGE for
> hwpoisoned entries,) and as the result we can dump all hugepages except
> for hwpoisoned ones.
>
> ChangeLog v5:
>  - improve comment and description.
>
> ChangeLog v4:
>  - move is_swap_page() to right place.
>
> ChangeLog v3:
>  - add comment about using is_swap_pte()
>
> Signed-off-by: Naoya Horiguchi 
> Cc: sta...@vger.kernel.org

Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND][PATCH v5 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()

2013-04-10 Thread KOSAKI Motohiro
On Wed, Apr 10, 2013 at 12:17 PM, Naoya Horiguchi
n-horigu...@ah.jp.nec.com wrote:
 # I suspended Reviewed and Acked given for the previous version, because
 # it has a non-minor change. If you want to restore it, please let me know.
 -
 With applying the previous patch hugetlbfs: stop setting VM_DONTDUMP in
 initializing vma(VM_HUGETLB) to reenable hugepage coredump, if a memory
 error happens on a hugepage and the affected processes try to access
 the error hugepage, we hit VM_BUG_ON(atomic_read(page-_count) = 0)
 in get_page().

 The reason for this bug is that coredump-related code doesn't recognise
 hugepage hwpoison entry with which a pmd entry is replaced when a memory
 error occurs on a hugepage.
 In other words, physical address information is stored in different bit layout
 between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page()
 which is called in get_dump_page() returns a wrong page from a given address.

 The expected behavior is like this:

   absent   is_swap_pte   FOLL_DUMP   Expected behavior
   ---
true false false   hugetlb_fault
falsetrue  false   hugetlb_fault
falsefalse false   return page
true false trueskip page (to avoid allocation)
falsetrue  truehugetlb_fault
falsefalse truereturn page

 With this patch, we can call hugetlb_fault() and take proper actions
 (we wait for migration entries, fail with VM_FAULT_HWPOISON_LARGE for
 hwpoisoned entries,) and as the result we can dump all hugepages except
 for hwpoisoned ones.

 ChangeLog v5:
  - improve comment and description.

 ChangeLog v4:
  - move is_swap_page() to right place.

 ChangeLog v3:
  - add comment about using is_swap_pte()

 Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Cc: sta...@vger.kernel.org

Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v7 00/11] Support vrange for anonymous page

2013-04-10 Thread KOSAKI Motohiro
(3/12/13 3:38 AM), Minchan Kim wrote:
 First of all, let's define the term.
 From now on, I'd like to call it as vrange(a.k.a volatile range)
 for anonymous page. If you have a better name in mind, please suggest.
 
 This version is still *RFC* because it's just quick prototype so
 it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
 Before further sorting out issues, I'd like to post current direction
 and discuss it. Of course, I'd like to extend this discussion in
 comming LSF/MM.
 
 In this version, I changed lots of thing, expecially removed vma-based
 approach because it needs write-side lock for mmap_sem, which will drop
 performance in mutli-threaded big SMP system, KOSAKI pointed out.
 And vma-based approach is hard to meet requirement of new system call by
 John Stultz's suggested semantic for consistent purged handling.
 (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
 
 I tested this patchset with modified jemalloc allocator which was
 leaded by Jason Evans(jemalloc author) who was interest in this feature
 and was happy to port his allocator to use new system call.
 Super Thanks Jason!
 
 The benchmark for test is ebizzy. It have been used for testing the
 allocator performance so it's good for me. Again, thanks for recommending
 the benchmark, Jason.
 (http://people.freebsd.org/~kris/scaling/ebizzy.html)
 
 The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
 
   ebizzy -S 20
 
 jemalloc-vanilla: 52389 records/sec
 jemalloc-vrange: 203414 records/sec
 
   ebizzy -S 20 with background memory pressure
 
 jemalloc-vanilla: 40746 records/sec
 jemalloc-vrange: 174910 records/sec
 
 And it's much improved on KVM virtual machine.
 
 This patchset is based on v3.9-rc2
 
 - What's the sys_vrange(addr, length, mode, behavior)?
 
   It's a hint that user deliver to kernel so kernel can *discard*
   pages in a range anytime. mode is one of VRANGE_VOLATILE and
   VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so
   kernel coudn't discard any pages any more while VRANGE_VOLATILE
   is memory unpin opeartion so kernel can discard pages in vrange
   anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE
   PARTIAL. VRANGE_FULL tell kernel that once kernel decide to
   discard page in a vrange, please, discard all of pages in a
   vrange selected by victim vrange. VRANGE_PARTIAL tell kernel
   that please discard of some pages in a vrange. But now I didn't
   implemented VRANGE_PARTIAL handling yet.
 
 - What happens if user access page(ie, virtual address) discarded
   by kernel?
 
   The user can encounter SIGBUS.
 
 - What should user do for avoding SIGBUS?
   He should call vrange(addr, length, VRANGE_NOVOLATILE, mode) before
   accessing the range which was called
   vrange(addr, length, VRANGE_VOLATILE, mode)
 
 - What happens if user access page(ie, virtual address) doesn't
   discarded by kernel?
 
   The user can see vaild data which was there before calling
 vrange(., VRANGE_VOLATILE) without page fault.
 
 - What's different with madvise(DONTNEED)?
 
   System call semantic
 
   DONTNEED makes sure user always can see zero-fill pages after
   he calls madvise while vrange can see data or encounter SIGBUS.

For replacing DONTNEED, user want to zero-fill pages like DONTNEED
instead of SIGBUS. So, new flag option would be nice.

I played a bit this patch. The result looks really promissing.
(i.e. 20x faster)

My machine have 24cpus, 8GB ram, kvm guest. I guess current DONTNEED
implementation doesn't fit kvm at all.


# of # of   # of
thread   iter   iter (patched glibc)
--
  1  43810740
  2  84220916
  4  98732534
  8  71715155
 12  71414109
 16  70813457
 20  72013742
 24  72713642
 28  71513328
 32  70913096
 36  70513661
 40  70813634
 44  70713367
 48  71413377


-libc patch (just dirty hack) --

diff --git a/malloc/arena.c b/malloc/arena.c
index 12a48ad..da04f67 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -365,6 +365,8 @@ extern struct dl_open_hook *_dl_open_hook;
 libc_hidden_proto (_dl_open_hook);
 #endif

+int vrange_enabled = 0;
+
 static void
 ptmalloc_init (void)
 {
@@ -457,6 +459,18 @@ ptmalloc_init (void)
 if (check_action != 0)
   __malloc_check_init();
   }
+
+  {
+char *vrange = getenv(MALLOC_VRANGE);
+if (vrange) {
+  int val = atoi(vrange);
+  if (val) {
+   printf(glibc: vrange enabled\n);
+   vrange_enabled = !!val;
+  }
+}
+  }
+
   void (*hook) (void) = force_reg (__malloc_initialize_hook);
   if (hook != NULL)
 (*hook)();
@@ -628,9 +642,14 @@ shrink_heap(heap_info *h, long diff)
return -2;
   h-mprotect_size = new_size;
 }
-  else
-__madvise ((char *)h + new_size, diff, MADV_DONTNEED);
+  else {
+if (vrange_enabled) {

Re: + fix-hugetlb-memory-check-in-vma_dump_size.patch added to -mm tree

2013-04-10 Thread KOSAKI Motohiro
 Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Reviewed-by: Rik van Riel r...@redhat.com
 Acked-by: Michal Hocko mho...@suse.cz
 Reviewed-by: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
 Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
 Acked-by: David Rientjes rient...@google.com
 Cc: sta...@vger.kernel.org[2.6.34+]

Not correct. It's 3.7 materials. See Michal or my comments.
cut-n-paste here.

Just for the record. It should be stable for 3.7+ since (314e51b98)
becuase then have lost VM_RESERVED check which used to stop hugetlb
mappings.
 Acked-by: Michal Hocko mho...@suse.cz
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/10] Reduce system disruption due to kswapd V2

2013-04-10 Thread KOSAKI Motohiro
 I've never checked it but I would have expected kswapd to stay on the
 same processor for significant periods of time. Have you experienced
 problems where kswapd bounces around on CPUs within a node causing
 workload disruption?
 
 When kswapd shares the same CPU as our main process it causes a measurable
 drop in response time (graphs show tiny spikes at the same time memory is
 freed). Would be nice to be able to ensure it runs on a different core
 than our latency sensitive processes at least. We can pin processes to
 subsets of cores but I don't think there's a way to keep kswapd from
 waking up on any of them?

You are only talking about extream corner case and don't talk about the other 
hand.
When number-of-nodes  nubmer-of-cpus, we have no way to avoid cpu sharing. 

Moreover, this is not kswapd specific isssue, every kernel thread makes the same
latency ick. so, this issue should be solved more generic layer.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: madvise: complete input validation before taking lock

2013-04-10 Thread KOSAKI Motohiro
(4/10/13 7:45 PM), Rasmus Villemoes wrote:
 In madvise(), there doesn't seem to be any reason for taking the
 current-mm-mmap_sem before start and len_in have been
 validated. Incidentally, this removes the need for the out: label.
 
 
 Signed-off-by: Rasmus Villemoes li...@rasmusvillemoes.dk

Looks good.

Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage

2013-04-09 Thread KOSAKI Motohiro
On Tue, Apr 9, 2013 at 6:43 PM, Naoya Horiguchi
 wrote:
> On Tue, Apr 09, 2013 at 05:27:44PM -0400, KOSAKI Motohiro wrote:
>> >> numa_node_id() is really silly. This might lead to allocate from 
>> >> offlining node.
>> >
>> > Right, it should've been alloc_huge_page().
>> >
>> >> and, offline_pages() should mark hstate as isolated likes normal pages 
>> >> for prohibiting
>> >> new allocation at first.
>> >
>> > It seems that alloc_migrate_target() calls alloc_page() for normal pages
>> > and the destination pages can be in the same node with the source pages
>> > (new page allocation from the same memblock are prohibited.)
>>
>> No. It can't. memory hotplug change buddy attribute to MIGRATE_ISOLTE at 
>> first.
>> then alloc_page() never allocate from source node. however huge page don't 
>> use
>> buddy. then we need another trick.
>
> MIGRATE_ISOLTE is changed only within the range [start_pfn, end_pfn)
> given as the argument of __offline_pages (see also start_isolate_page_range),
> so it's set only for pages within the single memblock to be offlined.

When partial memory hot remove, that's correct behavior. different
node is not required.


> BTW, in previous discussion I already agreed with checking migrate type
> in hugepage allocation code (maybe it will be in dequeue_huge_page_vma(),)
> so what you concern should be solved in the next post.

Umm.. Maybe I missed such discussion. Do you have a pointer?


>> > So if we want to avoid new page allocation from the same node,
>> > this is the problem both for normal and huge pages.
>> >
>> > BTW, is it correct to think that all users of memory hotplug assume
>> > that they want to hotplug a whole node (not the part of it?)
>>
>> Both are valid use case. admin can isolate a part of memory for isolating
>> broken memory range.
>>
>> but I'm sure almost user want to remove whole node.
>
> OK. So I think about "allocation in the nearest neighbor node",
> although it can be in separate patch if it's hard to implement.

That's fine.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()

2013-04-09 Thread KOSAKI Motohiro
> I rewrite the comment here, how about this?
>
> -   if (absent ||
> +   /*
> +* We need call hugetlb_fault for both hugepages under 
> migration
> +* (in which case hugetlb_fault waits for the migration,) and
> +* hwpoisoned hugepages (in which case we need to prevent the
> +* caller from accessing to them.) In order to do this, we use
> +* here is_swap_pte instead of is_hugetlb_entry_migration and
> +* is_hugetlb_entry_hwpoisoned. This is because it simply 
> covers
> +* both cases, and because we can't follow correct pages 
> directly
> +* from any kind of swap entries.
> +*/
> +   if (absent || is_swap_pte(huge_ptep_get(pte)) ||
> ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte 
> {
> int ret;

Looks ok to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage

2013-04-09 Thread KOSAKI Motohiro
>> numa_node_id() is really silly. This might lead to allocate from offlining 
>> node.
>
> Right, it should've been alloc_huge_page().
>
>> and, offline_pages() should mark hstate as isolated likes normal pages for 
>> prohibiting
>> new allocation at first.
>
> It seems that alloc_migrate_target() calls alloc_page() for normal pages
> and the destination pages can be in the same node with the source pages
> (new page allocation from the same memblock are prohibited.)

No. It can't. memory hotplug change buddy attribute to MIGRATE_ISOLTE at first.
then alloc_page() never allocate from source node. however huge page don't use
buddy. then we need another trick.


> So if we want to avoid new page allocation from the same node,
> this is the problem both for normal and huge pages.
>
> BTW, is it correct to think that all users of memory hotplug assume
> that they want to hotplug a whole node (not the part of it?)

Both are valid use case. admin can isolate a part of memory for isolating
broken memory range.

but I'm sure almost user want to remove whole node.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage

2013-04-09 Thread KOSAKI Motohiro
 numa_node_id() is really silly. This might lead to allocate from offlining 
 node.

 Right, it should've been alloc_huge_page().

 and, offline_pages() should mark hstate as isolated likes normal pages for 
 prohibiting
 new allocation at first.

 It seems that alloc_migrate_target() calls alloc_page() for normal pages
 and the destination pages can be in the same node with the source pages
 (new page allocation from the same memblock are prohibited.)

No. It can't. memory hotplug change buddy attribute to MIGRATE_ISOLTE at first.
then alloc_page() never allocate from source node. however huge page don't use
buddy. then we need another trick.


 So if we want to avoid new page allocation from the same node,
 this is the problem both for normal and huge pages.

 BTW, is it correct to think that all users of memory hotplug assume
 that they want to hotplug a whole node (not the part of it?)

Both are valid use case. admin can isolate a part of memory for isolating
broken memory range.

but I'm sure almost user want to remove whole node.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()

2013-04-09 Thread KOSAKI Motohiro
 I rewrite the comment here, how about this?

 -   if (absent ||
 +   /*
 +* We need call hugetlb_fault for both hugepages under 
 migration
 +* (in which case hugetlb_fault waits for the migration,) and
 +* hwpoisoned hugepages (in which case we need to prevent the
 +* caller from accessing to them.) In order to do this, we use
 +* here is_swap_pte instead of is_hugetlb_entry_migration and
 +* is_hugetlb_entry_hwpoisoned. This is because it simply 
 covers
 +* both cases, and because we can't follow correct pages 
 directly
 +* from any kind of swap entries.
 +*/
 +   if (absent || is_swap_pte(huge_ptep_get(pte)) ||
 ((flags  FOLL_WRITE)  !pte_write(huge_ptep_get(pte 
 {
 int ret;

Looks ok to me.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage

2013-04-09 Thread KOSAKI Motohiro
On Tue, Apr 9, 2013 at 6:43 PM, Naoya Horiguchi
n-horigu...@ah.jp.nec.com wrote:
 On Tue, Apr 09, 2013 at 05:27:44PM -0400, KOSAKI Motohiro wrote:
  numa_node_id() is really silly. This might lead to allocate from 
  offlining node.
 
  Right, it should've been alloc_huge_page().
 
  and, offline_pages() should mark hstate as isolated likes normal pages 
  for prohibiting
  new allocation at first.
 
  It seems that alloc_migrate_target() calls alloc_page() for normal pages
  and the destination pages can be in the same node with the source pages
  (new page allocation from the same memblock are prohibited.)

 No. It can't. memory hotplug change buddy attribute to MIGRATE_ISOLTE at 
 first.
 then alloc_page() never allocate from source node. however huge page don't 
 use
 buddy. then we need another trick.

 MIGRATE_ISOLTE is changed only within the range [start_pfn, end_pfn)
 given as the argument of __offline_pages (see also start_isolate_page_range),
 so it's set only for pages within the single memblock to be offlined.

When partial memory hot remove, that's correct behavior. different
node is not required.


 BTW, in previous discussion I already agreed with checking migrate type
 in hugepage allocation code (maybe it will be in dequeue_huge_page_vma(),)
 so what you concern should be solved in the next post.

Umm.. Maybe I missed such discussion. Do you have a pointer?


  So if we want to avoid new page allocation from the same node,
  this is the problem both for normal and huge pages.
 
  BTW, is it correct to think that all users of memory hotplug assume
  that they want to hotplug a whole node (not the part of it?)

 Both are valid use case. admin can isolate a part of memory for isolating
 broken memory range.

 but I'm sure almost user want to remove whole node.

 OK. So I think about allocation in the nearest neighbor node,
 although it can be in separate patch if it's hard to implement.

That's fine.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] mm: Another attempt to monitor task's memory changes

2013-04-08 Thread KOSAKI Motohiro
> This approach works on any task via it's proc, and can be used on different
> tasks in parallel.
> 
> Also, Andrew was asking for some performance numbers related to the change.
> Now I can say, that as long as soft dirty bits are not cleared, no performance
> penalty occur, since the soft dirty bit and the regular dirty bit are set at 
> the same time within the same instruction. When soft dirty is cleared via 
> clear_refs, the task in question might slow down, but it will depend on how
> actively it uses the memory.
> 
> 
> What do you think, does it make sense to develop this approach further?

When touching mmaped page, cpu turns on dirty bit but doesn't turn on soft 
dirty.
So, I'm not convinced how to use this flag. Please show us your userland 
algorithm
how to detect diff.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

2013-04-08 Thread KOSAKI Motohiro
(4/8/13 3:50 PM), Cody P Schafer wrote:
> On 04/08/2013 10:28 AM, Cody P Schafer wrote:
>> On 04/08/2013 05:20 AM, Gilad Ben-Yossef wrote:
>>> On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer
>>>  wrote:
 In free_hot_cold_page(), we rely on pcp->batch remaining stable.
 Updating it without being on the cpu owning the percpu pageset
 potentially destroys this stability.

 Change for_each_cpu() to on_each_cpu() to fix.
>>>
>>> Are you referring to this? -
>>
>> This was the case I noticed.
>>
>>>
>>> 1329 if (pcp->count >= pcp->high) {
>>> 1330 free_pcppages_bulk(zone, pcp->batch, pcp);
>>> 1331 pcp->count -= pcp->batch;
>>> 1332 }
>>>
>>> I'm probably missing the obvious but won't it be simpler to do this in
>>>   free_hot_cold_page() -
>>>
>>> 1329 if (pcp->count >= pcp->high) {
>>> 1330  unsigned int batch = ACCESS_ONCE(pcp->batch);
>>> 1331 free_pcppages_bulk(zone, batch, pcp);
>>> 1332 pcp->count -= batch;
>>> 1333 }
>>>
>>
>> Potentially, yes. Note that this was simply the one case I noticed,
>> rather than certainly the only case.
>>
>> I also wonder whether there could be unexpected interactions between
>> ->high and ->batch not changing together atomically. For example, could
>> adjusting this knob cause ->batch to rise enough that it is greater than
>> the previous ->high? If the code above then runs with the previous
>> ->high, ->count wouldn't be correct (checking this inside
>> free_pcppages_bulk() might help on this one issue).
>>
>>> Now the batch value used is stable and you don't have to IPI every CPU
>>> in the system just to change a config knob...
>>
>> Is this really considered an issue? I wouldn't have expected someone to
>> adjust the config knob often enough (or even more than once) to cause
>> problems. Of course as a "It'd be nice" thing, I completely agree.
> 
> Would using schedule_on_each_cpu() instead of on_each_cpu() be an 
> improvement, in your opinion?

No. As far as lightweight solusion work, we shouldn't introduce heavyweight
code never. on_each_cpu() is really heavy weight especially when number of 
cpus are much than a thousand.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

2013-04-08 Thread KOSAKI Motohiro
(4/8/13 3:49 PM), Cody P Schafer wrote:
> On 04/08/2013 12:26 PM, KOSAKI Motohiro wrote:
>> (4/8/13 1:32 PM), Cody P Schafer wrote:
>>> On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote:
>>>> (4/5/13 4:33 PM), Cody P Schafer wrote:
>>>>> No off-cpu users of the percpu pagesets exist.
>>>>>
>>>>> zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
>>>>> percpu pageset based on a zone's ->managed_pages. We don't need to drain
>>>>> the entire percpu pageset just to modify these fields. Avoid calling
>>>>> setup_pageset() (and the draining required to call it) and instead just
>>>>> set the fields' values.
>>>>>
>>>>> This does change the behavior of zone_pcp_update() as the percpu
>>>>> pagesets will not be drained when zone_pcp_update() is called (they will
>>>>> end up being shrunk, not completely drained, later when a 0-order page
>>>>> is freed in free_hot_cold_page()).
>>>>>
>>>>> Signed-off-by: Cody P Schafer 
>>>>
>>>> NAK.
>>>>
>>>> 1) zone_pcp_update() is only used from memory hotplug and it require page 
>>>> drain.
>>>
>>> I'm looking at this code because I'm currently working on a patchset
>>> which adds another interface which modifies zone sizes, so "only used
>>> from memory hotplug" is a temporary thing (unless I discover that
>>> zone_pcp_update() is not intended to do what I want it to do).
>>
>> maybe yes, maybe no. I don't know temporary or not. However the fact is,
>> you must not break anywhere. You need to look all caller always.
> 
> Right, which is why I want to understand memory hotplug's actual 
> requirements.
> 
>>>> 2) stop_machin is used for avoiding race. just removing it is insane.
>>>
>>> What race? Is there a cross cpu access to ->high & ->batch that makes
>>> using on_each_cpu() instead of stop_machine() inappropriate? It is
>>> absolutely not just being removed.
>>
>> OK, I missed that. however your code is still wrong.
>> However you can't call free_pcppages_bulk() from interrupt context and
>> then you can't use on_each_cpu() anyway.
> 
> Given drain_pages() implementation, I find that hard to believe (It uses 
> on_each_cpu_mask() and eventually calls free_pcppages_bulk()).
> 
> Can you provide a reference backing up your statement?

Grr. I missed again. OK you are right. go ahead.



> If this turns out to be an issue, schedule_on_each_cpu() could be an 
> alternative.

no way. schedule_on_each_cpu() is more problematic and it should be removed
in the future.
schedule_on_each_cpu() can only be used when caller task don't have any lock.
otherwise it may make deadlock.







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()

2013-04-08 Thread KOSAKI Motohiro
> -   if (absent ||
> +   /*
> +* is_swap_pte test covers both is_hugetlb_entry_hwpoisoned
> +* and hugepages under migration in which case
> +* hugetlb_fault waits for the migration and bails out
> +* properly for HWPosined pages.
> +*/
> +   if (absent || is_swap_pte(huge_ptep_get(pte)) ||
> ((flags & FOLL_WRITE) && !pte_write(huge_ptep_get(pte 
> {
> int ret;

Your comment describe what the code is. However we want the comment describe
why. In migration case, calling hugetlb_fault() is natural. but in
hwpoison case, it is
needed more explanation. Why can't we call is_hugetlb_hwpoisoned() directly?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

2013-04-08 Thread KOSAKI Motohiro
(4/8/13 1:32 PM), Cody P Schafer wrote:
> On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote:
>> (4/5/13 4:33 PM), Cody P Schafer wrote:
>>> No off-cpu users of the percpu pagesets exist.
>>>
>>> zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
>>> percpu pageset based on a zone's ->managed_pages. We don't need to drain
>>> the entire percpu pageset just to modify these fields. Avoid calling
>>> setup_pageset() (and the draining required to call it) and instead just
>>> set the fields' values.
>>>
>>> This does change the behavior of zone_pcp_update() as the percpu
>>> pagesets will not be drained when zone_pcp_update() is called (they will
>>> end up being shrunk, not completely drained, later when a 0-order page
>>> is freed in free_hot_cold_page()).
>>>
>>> Signed-off-by: Cody P Schafer 
>>
>> NAK.
>>
>> 1) zone_pcp_update() is only used from memory hotplug and it require page 
>> drain.
> 
> I'm looking at this code because I'm currently working on a patchset
> which adds another interface which modifies zone sizes, so "only used
> from memory hotplug" is a temporary thing (unless I discover that
> zone_pcp_update() is not intended to do what I want it to do).

maybe yes, maybe no. I don't know temporary or not. However the fact is, 
you must not break anywhere. You need to look all caller always.


>> 2) stop_machin is used for avoiding race. just removing it is insane.
> 
> What race? Is there a cross cpu access to ->high & ->batch that makes
> using on_each_cpu() instead of stop_machine() inappropriate? It is
> absolutely not just being removed.

OK, I missed that. however your code is still wrong.
However you can't call free_pcppages_bulk() from interrupt context and
then you can't use on_each_cpu() anyway.






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

2013-04-08 Thread KOSAKI Motohiro
(4/8/13 8:20 AM), Gilad Ben-Yossef wrote:
> On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer  
> wrote:
>> In free_hot_cold_page(), we rely on pcp->batch remaining stable.
>> Updating it without being on the cpu owning the percpu pageset
>> potentially destroys this stability.
>>
>> Change for_each_cpu() to on_each_cpu() to fix.
> 
> Are you referring to this? -
> 
> 1329 if (pcp->count >= pcp->high) {
> 1330 free_pcppages_bulk(zone, pcp->batch, pcp);
> 1331 pcp->count -= pcp->batch;
> 1332 }
> 
> I'm probably missing the obvious but won't it be simpler to do this in
>  free_hot_cold_page() -
> 
> 1329 if (pcp->count >= pcp->high) {
> 1330  unsigned int batch = ACCESS_ONCE(pcp->batch);
> 1331 free_pcppages_bulk(zone, batch, pcp);
> 1332 pcp->count -= batch;
> 1333 }
> 
> Now the batch value used is stable and you don't have to IPI every CPU
> in the system just to change a config knob...

OK, right. Your approach is much better.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch

2013-04-08 Thread KOSAKI Motohiro
(4/8/13 1:16 PM), Cody P Schafer wrote:
> On 04/07/2013 08:23 AM, KOSAKI Motohiro wrote:
>> (4/5/13 4:33 PM), Cody P Schafer wrote:
>>> In one case while modifying the ->high and ->batch fields of per cpu 
>>> pagesets
>>> we're unneededly using stop_machine() (patches 1 & 2), and in another we 
>>> don't have any
>>> syncronization at all (patch 3).
>>>
>>> This patchset fixes both of them.
>>>
>>> Note that it results in a change to the behavior of zone_pcp_update(), 
>>> which is
>>> used by memory_hotplug. I _think_ that I've diserned (and preserved) the
>>> essential behavior (changing ->high and ->batch), and only eliminated 
>>> unneeded
>>> actions (draining the per cpu pages), but this may not be the case.
>>
>> at least, memory hotplug need to drain.
> 
> Could you explain why the drain is required here? From what I can tell,
> after the stop_machine() completes, the per cpu page sets could be
> repopulated at any point, making the combination of draining and
> modifying ->batch & ->high uneeded.

Then, memory hotplug again and again try to drain. Moreover hotplug prevent 
repopulation
by using MIGRATE_ISOLATE.
pcp never be page count == 0 and it prevent memory hot remove.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's -high and -batch

2013-04-08 Thread KOSAKI Motohiro
(4/8/13 1:16 PM), Cody P Schafer wrote:
 On 04/07/2013 08:23 AM, KOSAKI Motohiro wrote:
 (4/5/13 4:33 PM), Cody P Schafer wrote:
 In one case while modifying the -high and -batch fields of per cpu 
 pagesets
 we're unneededly using stop_machine() (patches 1  2), and in another we 
 don't have any
 syncronization at all (patch 3).

 This patchset fixes both of them.

 Note that it results in a change to the behavior of zone_pcp_update(), 
 which is
 used by memory_hotplug. I _think_ that I've diserned (and preserved) the
 essential behavior (changing -high and -batch), and only eliminated 
 unneeded
 actions (draining the per cpu pages), but this may not be the case.

 at least, memory hotplug need to drain.
 
 Could you explain why the drain is required here? From what I can tell,
 after the stop_machine() completes, the per cpu page sets could be
 repopulated at any point, making the combination of draining and
 modifying -batch  -high uneeded.

Then, memory hotplug again and again try to drain. Moreover hotplug prevent 
repopulation
by using MIGRATE_ISOLATE.
pcp never be page count == 0 and it prevent memory hot remove.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

2013-04-08 Thread KOSAKI Motohiro
(4/8/13 8:20 AM), Gilad Ben-Yossef wrote:
 On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer c...@linux.vnet.ibm.com 
 wrote:
 In free_hot_cold_page(), we rely on pcp-batch remaining stable.
 Updating it without being on the cpu owning the percpu pageset
 potentially destroys this stability.

 Change for_each_cpu() to on_each_cpu() to fix.
 
 Are you referring to this? -
 
 1329 if (pcp-count = pcp-high) {
 1330 free_pcppages_bulk(zone, pcp-batch, pcp);
 1331 pcp-count -= pcp-batch;
 1332 }
 
 I'm probably missing the obvious but won't it be simpler to do this in
  free_hot_cold_page() -
 
 1329 if (pcp-count = pcp-high) {
 1330  unsigned int batch = ACCESS_ONCE(pcp-batch);
 1331 free_pcppages_bulk(zone, batch, pcp);
 1332 pcp-count -= batch;
 1333 }
 
 Now the batch value used is stable and you don't have to IPI every CPU
 in the system just to change a config knob...

OK, right. Your approach is much better.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

2013-04-08 Thread KOSAKI Motohiro
(4/8/13 1:32 PM), Cody P Schafer wrote:
 On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote:
 (4/5/13 4:33 PM), Cody P Schafer wrote:
 No off-cpu users of the percpu pagesets exist.

 zone_pcp_update()'s goal is to adjust the -high and -mark members of a
 percpu pageset based on a zone's -managed_pages. We don't need to drain
 the entire percpu pageset just to modify these fields. Avoid calling
 setup_pageset() (and the draining required to call it) and instead just
 set the fields' values.

 This does change the behavior of zone_pcp_update() as the percpu
 pagesets will not be drained when zone_pcp_update() is called (they will
 end up being shrunk, not completely drained, later when a 0-order page
 is freed in free_hot_cold_page()).

 Signed-off-by: Cody P Schafer c...@linux.vnet.ibm.com

 NAK.

 1) zone_pcp_update() is only used from memory hotplug and it require page 
 drain.
 
 I'm looking at this code because I'm currently working on a patchset
 which adds another interface which modifies zone sizes, so only used
 from memory hotplug is a temporary thing (unless I discover that
 zone_pcp_update() is not intended to do what I want it to do).

maybe yes, maybe no. I don't know temporary or not. However the fact is, 
you must not break anywhere. You need to look all caller always.


 2) stop_machin is used for avoiding race. just removing it is insane.
 
 What race? Is there a cross cpu access to -high  -batch that makes
 using on_each_cpu() instead of stop_machine() inappropriate? It is
 absolutely not just being removed.

OK, I missed that. however your code is still wrong.
However you can't call free_pcppages_bulk() from interrupt context and
then you can't use on_each_cpu() anyway.






--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()

2013-04-08 Thread KOSAKI Motohiro
 -   if (absent ||
 +   /*
 +* is_swap_pte test covers both is_hugetlb_entry_hwpoisoned
 +* and hugepages under migration in which case
 +* hugetlb_fault waits for the migration and bails out
 +* properly for HWPosined pages.
 +*/
 +   if (absent || is_swap_pte(huge_ptep_get(pte)) ||
 ((flags  FOLL_WRITE)  !pte_write(huge_ptep_get(pte 
 {
 int ret;

Your comment describe what the code is. However we want the comment describe
why. In migration case, calling hugetlb_fault() is natural. but in
hwpoison case, it is
needed more explanation. Why can't we call is_hugetlb_hwpoisoned() directly?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

2013-04-08 Thread KOSAKI Motohiro
(4/8/13 3:49 PM), Cody P Schafer wrote:
 On 04/08/2013 12:26 PM, KOSAKI Motohiro wrote:
 (4/8/13 1:32 PM), Cody P Schafer wrote:
 On 04/07/2013 08:39 AM, KOSAKI Motohiro wrote:
 (4/5/13 4:33 PM), Cody P Schafer wrote:
 No off-cpu users of the percpu pagesets exist.

 zone_pcp_update()'s goal is to adjust the -high and -mark members of a
 percpu pageset based on a zone's -managed_pages. We don't need to drain
 the entire percpu pageset just to modify these fields. Avoid calling
 setup_pageset() (and the draining required to call it) and instead just
 set the fields' values.

 This does change the behavior of zone_pcp_update() as the percpu
 pagesets will not be drained when zone_pcp_update() is called (they will
 end up being shrunk, not completely drained, later when a 0-order page
 is freed in free_hot_cold_page()).

 Signed-off-by: Cody P Schafer c...@linux.vnet.ibm.com

 NAK.

 1) zone_pcp_update() is only used from memory hotplug and it require page 
 drain.

 I'm looking at this code because I'm currently working on a patchset
 which adds another interface which modifies zone sizes, so only used
 from memory hotplug is a temporary thing (unless I discover that
 zone_pcp_update() is not intended to do what I want it to do).

 maybe yes, maybe no. I don't know temporary or not. However the fact is,
 you must not break anywhere. You need to look all caller always.
 
 Right, which is why I want to understand memory hotplug's actual 
 requirements.
 
 2) stop_machin is used for avoiding race. just removing it is insane.

 What race? Is there a cross cpu access to -high  -batch that makes
 using on_each_cpu() instead of stop_machine() inappropriate? It is
 absolutely not just being removed.

 OK, I missed that. however your code is still wrong.
 However you can't call free_pcppages_bulk() from interrupt context and
 then you can't use on_each_cpu() anyway.
 
 Given drain_pages() implementation, I find that hard to believe (It uses 
 on_each_cpu_mask() and eventually calls free_pcppages_bulk()).
 
 Can you provide a reference backing up your statement?

Grr. I missed again. OK you are right. go ahead.



 If this turns out to be an issue, schedule_on_each_cpu() could be an 
 alternative.

no way. schedule_on_each_cpu() is more problematic and it should be removed
in the future.
schedule_on_each_cpu() can only be used when caller task don't have any lock.
otherwise it may make deadlock.







--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

2013-04-08 Thread KOSAKI Motohiro
(4/8/13 3:50 PM), Cody P Schafer wrote:
 On 04/08/2013 10:28 AM, Cody P Schafer wrote:
 On 04/08/2013 05:20 AM, Gilad Ben-Yossef wrote:
 On Fri, Apr 5, 2013 at 11:33 PM, Cody P Schafer
 c...@linux.vnet.ibm.com wrote:
 In free_hot_cold_page(), we rely on pcp-batch remaining stable.
 Updating it without being on the cpu owning the percpu pageset
 potentially destroys this stability.

 Change for_each_cpu() to on_each_cpu() to fix.

 Are you referring to this? -

 This was the case I noticed.


 1329 if (pcp-count = pcp-high) {
 1330 free_pcppages_bulk(zone, pcp-batch, pcp);
 1331 pcp-count -= pcp-batch;
 1332 }

 I'm probably missing the obvious but won't it be simpler to do this in
   free_hot_cold_page() -

 1329 if (pcp-count = pcp-high) {
 1330  unsigned int batch = ACCESS_ONCE(pcp-batch);
 1331 free_pcppages_bulk(zone, batch, pcp);
 1332 pcp-count -= batch;
 1333 }


 Potentially, yes. Note that this was simply the one case I noticed,
 rather than certainly the only case.

 I also wonder whether there could be unexpected interactions between
 -high and -batch not changing together atomically. For example, could
 adjusting this knob cause -batch to rise enough that it is greater than
 the previous -high? If the code above then runs with the previous
 -high, -count wouldn't be correct (checking this inside
 free_pcppages_bulk() might help on this one issue).

 Now the batch value used is stable and you don't have to IPI every CPU
 in the system just to change a config knob...

 Is this really considered an issue? I wouldn't have expected someone to
 adjust the config knob often enough (or even more than once) to cause
 problems. Of course as a It'd be nice thing, I completely agree.
 
 Would using schedule_on_each_cpu() instead of on_each_cpu() be an 
 improvement, in your opinion?

No. As far as lightweight solusion work, we shouldn't introduce heavyweight
code never. on_each_cpu() is really heavy weight especially when number of 
cpus are much than a thousand.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/1] mm: Another attempt to monitor task's memory changes

2013-04-08 Thread KOSAKI Motohiro
 This approach works on any task via it's proc, and can be used on different
 tasks in parallel.
 
 Also, Andrew was asking for some performance numbers related to the change.
 Now I can say, that as long as soft dirty bits are not cleared, no performance
 penalty occur, since the soft dirty bit and the regular dirty bit are set at 
 the same time within the same instruction. When soft dirty is cleared via 
 clear_refs, the task in question might slow down, but it will depend on how
 actively it uses the memory.
 
 
 What do you think, does it make sense to develop this approach further?

When touching mmaped page, cpu turns on dirty bit but doesn't turn on soft 
dirty.
So, I'm not convinced how to use this flag. Please show us your userland 
algorithm
how to detect diff.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernel: tsacct: strncpy, always be sure of NUL terminated.

2013-04-07 Thread KOSAKI Motohiro
On Sun, Apr 7, 2013 at 11:27 PM, Chen Gang  wrote:
>
>   for NUL terminated string, always set '\0' at the end.
>
> Signed-off-by: Chen Gang 
> ---
>  kernel/tsacct.c |3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/tsacct.c b/kernel/tsacct.c
> index a1dd9a1..01bcc4e 100644
> --- a/kernel/tsacct.c
> +++ b/kernel/tsacct.c
> @@ -78,7 +78,8 @@ void bacct_add_tsk(struct user_namespace *user_ns,
> stats->ac_minflt = tsk->min_flt;
> stats->ac_majflt = tsk->maj_flt;
>
> -   strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm));
> +   strncpy(stats->ac_comm, tsk->comm, sizeof(stats->ac_comm) - 1);
> +   stats->ac_comm[sizeof(stats->ac_comm) - 1] = '\0';

sizeof(tsk->comm) is 16 and sizeof(stats->ac_comm) is 32. then this statement
is strange. and set_task_comm ensure tsk->comm is nul-terminated.

so your code never change the behavior, right?

And, If buggy driver change tsk->comm not to use set_task_comm and tsk->comm
is not nul-terminated, strncpy() still touch unrelated memory and ac_comm may
expose kernel internal info. that's bad.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

2013-04-07 Thread KOSAKI Motohiro
(4/5/13 4:33 PM), Cody P Schafer wrote:
> In free_hot_cold_page(), we rely on pcp->batch remaining stable.
> Updating it without being on the cpu owning the percpu pageset
> potentially destroys this stability.
> 
> Change for_each_cpu() to on_each_cpu() to fix.
> 
> Signed-off-by: Cody P Schafer 

Acked-by: KOSAKI Motohiro 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

2013-04-07 Thread KOSAKI Motohiro
(4/5/13 4:33 PM), Cody P Schafer wrote:
> No off-cpu users of the percpu pagesets exist.
> 
> zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
> percpu pageset based on a zone's ->managed_pages. We don't need to drain
> the entire percpu pageset just to modify these fields. Avoid calling
> setup_pageset() (and the draining required to call it) and instead just
> set the fields' values.
> 
> This does change the behavior of zone_pcp_update() as the percpu
> pagesets will not be drained when zone_pcp_update() is called (they will
> end up being shrunk, not completely drained, later when a 0-order page
> is freed in free_hot_cold_page()).
> 
> Signed-off-by: Cody P Schafer 

NAK.

1) zone_pcp_update() is only used from memory hotplug and it require page drain.
2) stop_machin is used for avoiding race. just removing it is insane.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's ->high and ->batch

2013-04-07 Thread KOSAKI Motohiro
(4/5/13 4:33 PM), Cody P Schafer wrote:
> In one case while modifying the ->high and ->batch fields of per cpu pagesets
> we're unneededly using stop_machine() (patches 1 & 2), and in another we 
> don't have any
> syncronization at all (patch 3).
> 
> This patchset fixes both of them.
> 
> Note that it results in a change to the behavior of zone_pcp_update(), which 
> is
> used by memory_hotplug. I _think_ that I've diserned (and preserved) the
> essential behavior (changing ->high and ->batch), and only eliminated unneeded
> actions (draining the per cpu pages), but this may not be the case.

at least, memory hotplug need to drain.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/9] extend hugepage migration

2013-04-07 Thread KOSAKI Motohiro
>> Please refer to hugetlb_fault for more information.
> 
> Thanks for your pointing out. So my assume is correct, is it? Can pmd 
> which support 2MB map 32MB pages work well?

Simon, Please stop hijaking unrelated threads. This is not question and answer 
thread.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/9] extend hugepage migration

2013-04-07 Thread KOSAKI Motohiro
 Please refer to hugetlb_fault for more information.
 
 Thanks for your pointing out. So my assume is correct, is it? Can pmd 
 which support 2MB map 32MB pages work well?

Simon, Please stop hijaking unrelated threads. This is not question and answer 
thread.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] mm: fixup changers of per cpu pageset's -high and -batch

2013-04-07 Thread KOSAKI Motohiro
(4/5/13 4:33 PM), Cody P Schafer wrote:
 In one case while modifying the -high and -batch fields of per cpu pagesets
 we're unneededly using stop_machine() (patches 1  2), and in another we 
 don't have any
 syncronization at all (patch 3).
 
 This patchset fixes both of them.
 
 Note that it results in a change to the behavior of zone_pcp_update(), which 
 is
 used by memory_hotplug. I _think_ that I've diserned (and preserved) the
 essential behavior (changing -high and -batch), and only eliminated unneeded
 actions (draining the per cpu pages), but this may not be the case.

at least, memory hotplug need to drain.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] mm/page_alloc: convert zone_pcp_update() to use on_each_cpu() instead of stop_machine()

2013-04-07 Thread KOSAKI Motohiro
(4/5/13 4:33 PM), Cody P Schafer wrote:
 No off-cpu users of the percpu pagesets exist.
 
 zone_pcp_update()'s goal is to adjust the -high and -mark members of a
 percpu pageset based on a zone's -managed_pages. We don't need to drain
 the entire percpu pageset just to modify these fields. Avoid calling
 setup_pageset() (and the draining required to call it) and instead just
 set the fields' values.
 
 This does change the behavior of zone_pcp_update() as the percpu
 pagesets will not be drained when zone_pcp_update() is called (they will
 end up being shrunk, not completely drained, later when a 0-order page
 is freed in free_hot_cold_page()).
 
 Signed-off-by: Cody P Schafer c...@linux.vnet.ibm.com

NAK.

1) zone_pcp_update() is only used from memory hotplug and it require page drain.
2) stop_machin is used for avoiding race. just removing it is insane.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] mm: when handling percpu_pagelist_fraction, use on_each_cpu() to set percpu pageset fields.

2013-04-07 Thread KOSAKI Motohiro
(4/5/13 4:33 PM), Cody P Schafer wrote:
 In free_hot_cold_page(), we rely on pcp-batch remaining stable.
 Updating it without being on the cpu owning the percpu pageset
 potentially destroys this stability.
 
 Change for_each_cpu() to on_each_cpu() to fix.
 
 Signed-off-by: Cody P Schafer c...@linux.vnet.ibm.com

Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernel: tsacct: strncpy, always be sure of NUL terminated.

2013-04-07 Thread KOSAKI Motohiro
On Sun, Apr 7, 2013 at 11:27 PM, Chen Gang gang.c...@asianux.com wrote:

   for NUL terminated string, always set '\0' at the end.

 Signed-off-by: Chen Gang gang.c...@asianux.com
 ---
  kernel/tsacct.c |3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)

 diff --git a/kernel/tsacct.c b/kernel/tsacct.c
 index a1dd9a1..01bcc4e 100644
 --- a/kernel/tsacct.c
 +++ b/kernel/tsacct.c
 @@ -78,7 +78,8 @@ void bacct_add_tsk(struct user_namespace *user_ns,
 stats-ac_minflt = tsk-min_flt;
 stats-ac_majflt = tsk-maj_flt;

 -   strncpy(stats-ac_comm, tsk-comm, sizeof(stats-ac_comm));
 +   strncpy(stats-ac_comm, tsk-comm, sizeof(stats-ac_comm) - 1);
 +   stats-ac_comm[sizeof(stats-ac_comm) - 1] = '\0';

sizeof(tsk-comm) is 16 and sizeof(stats-ac_comm) is 32. then this statement
is strange. and set_task_comm ensure tsk-comm is nul-terminated.

so your code never change the behavior, right?

And, If buggy driver change tsk-comm not to use set_task_comm and tsk-comm
is not nul-terminated, strncpy() still touch unrelated memory and ac_comm may
expose kernel internal info. that's bad.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] fsfreeze: avoid to return zero in __get_user_pages

2013-04-06 Thread KOSAKI Motohiro
On Sat, Apr 6, 2013 at 6:07 AM, Marco Stornelli
 wrote:
> In case of VM_FAULT_RETRY, __get_user_pages returns the number
> of pages alredy gotten, but there isn't a check if this number is
> zero. Instead, we have to return a proper error code so we can avoid
> a possible extra call of __get_user_pages. There are several
> places where get_user_pages is called inside a loop until all the
> pages requested are gotten or an error code is returned.
>
> Signed-off-by: Marco Stornelli 
> ---
>  mm/memory.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 494526a..cca14ed 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1858,7 +1858,7 @@ long __get_user_pages(struct task_struct *tsk, struct 
> mm_struct *mm,
> if (ret & VM_FAULT_RETRY) {
> if (nonblocking)
> *nonblocking = 0;
> -   return i;
> +   return i ? i : -ERESTARTSYS;

nonblock argument is only used from __mm_populate() and it expect
__get_user_pages() return 0.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] fsfreeze: avoid to return zero in __get_user_pages

2013-04-06 Thread KOSAKI Motohiro
On Sat, Apr 6, 2013 at 6:07 AM, Marco Stornelli
marco.storne...@gmail.com wrote:
 In case of VM_FAULT_RETRY, __get_user_pages returns the number
 of pages alredy gotten, but there isn't a check if this number is
 zero. Instead, we have to return a proper error code so we can avoid
 a possible extra call of __get_user_pages. There are several
 places where get_user_pages is called inside a loop until all the
 pages requested are gotten or an error code is returned.

 Signed-off-by: Marco Stornelli marco.storne...@gmail.com
 ---
  mm/memory.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

 diff --git a/mm/memory.c b/mm/memory.c
 index 494526a..cca14ed 100644
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -1858,7 +1858,7 @@ long __get_user_pages(struct task_struct *tsk, struct 
 mm_struct *mm,
 if (ret  VM_FAULT_RETRY) {
 if (nonblocking)
 *nonblocking = 0;
 -   return i;
 +   return i ? i : -ERESTARTSYS;

nonblock argument is only used from __mm_populate() and it expect
__get_user_pages() return 0.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/10] prepare to remove /proc/sys/vm/hugepages_treat_as_movable

2013-04-05 Thread KOSAKI Motohiro
(3/25/13 11:12 AM), Michal Hocko wrote:
> On Fri 22-03-13 16:23:55, Naoya Horiguchi wrote:
> [...]
>> @@ -2086,11 +2085,7 @@ int hugetlb_treat_movable_handler(struct ctl_table 
>> *table, int write,
>>  void __user *buffer,
>>  size_t *length, loff_t *ppos)
>>  {
>> -proc_dointvec(table, write, buffer, length, ppos);
>> -if (hugepages_treat_as_movable)
>> -htlb_alloc_mask = GFP_HIGHUSER_MOVABLE;
>> -else
>> -htlb_alloc_mask = GFP_HIGHUSER;
>> +/* hugepages_treat_as_movable is obsolete and to be removed. */
> 
> WARN_ON_ONCE("This knob is obsolete and has no effect. It is scheduled for 
> removal")

Indeed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage

2013-04-05 Thread KOSAKI Motohiro
(3/22/13 4:23 PM), Naoya Horiguchi wrote:
> Currently we can't offline memory blocks which contain hugepages because
> a hugepage is considered as an unmovable page. But now with this patch
> series, a hugepage has become movable, so by using hugepage migration we
> can offline such memory blocks.
> 
> What's different from other users of hugepage migration is that we need
> to decompose all the hugepages inside the target memory block into free
> buddy pages after hugepage migration, because otherwise free hugepages
> remaining in the memory block intervene the memory offlining.
> For this reason we introduce new functions dissolve_free_huge_page() and
> dissolve_free_huge_pages().
> 
> Other than that, what this patch does is straightforwardly to add hugepage
> migration code, that is, adding hugepage code to the functions which scan
> over pfn and collect hugepages to be migrated, and adding a hugepage
> allocation function to alloc_migrate_target().
> 
> As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
> over them because it's larger than memory block. So we now simply leave
> it to fail as it is.
> 
> ChangeLog v2:
>  - changed return value type of is_hugepage_movable() to bool
>  - is_hugepage_movable() uses list_for_each_entry() instead of *_safe()
>  - moved if(PageHuge) block before get_page_unless_zero() in 
> do_migrate_range()
>  - do_migrate_range() returns -EBUSY for hugepages larger than memory block
>  - dissolve_free_huge_pages() calculates scan step and sets it to minimum
>hugepage size
> 
> Signed-off-by: Naoya Horiguchi 
> ---
>  include/linux/hugetlb.h |  6 +
>  mm/hugetlb.c| 58 
> +
>  mm/memory_hotplug.c | 42 +++
>  mm/page_alloc.c | 12 ++
>  mm/page_isolation.c |  5 +
>  5 files changed, 114 insertions(+), 9 deletions(-)
> 
> diff --git v3.9-rc3.orig/include/linux/hugetlb.h 
> v3.9-rc3/include/linux/hugetlb.h
> index 981eff8..8220a8a 100644
> --- v3.9-rc3.orig/include/linux/hugetlb.h
> +++ v3.9-rc3/include/linux/hugetlb.h
> @@ -69,6 +69,7 @@ int dequeue_hwpoisoned_huge_page(struct page *page);
>  void putback_active_hugepage(struct page *page);
>  void putback_active_hugepages(struct list_head *l);
>  void migrate_hugepage_add(struct page *page, struct list_head *list);
> +bool is_hugepage_movable(struct page *page);
>  void copy_huge_page(struct page *dst, struct page *src);
>  
>  extern unsigned long hugepages_treat_as_movable;
> @@ -134,6 +135,7 @@ static inline int dequeue_hwpoisoned_huge_page(struct 
> page *page)
>  #define putback_active_hugepage(p) 0
>  #define putback_active_hugepages(l) 0
>  #define migrate_hugepage_add(p, l) 0
> +#define is_hugepage_movable(x) 0

should be false instaed of 0.


>  static inline void copy_huge_page(struct page *dst, struct page *src)
>  {
>  }
> @@ -356,6 +358,9 @@ static inline int hstate_index(struct hstate *h)
>   return h - hstates;
>  }
>  
> +extern void dissolve_free_huge_pages(unsigned long start_pfn,
> +  unsigned long end_pfn);
> +
>  #else
>  struct hstate {};
>  #define alloc_huge_page(v, a, r) NULL
> @@ -376,6 +381,7 @@ static inline unsigned int pages_per_huge_page(struct 
> hstate *h)
>  }
>  #define hstate_index_to_shift(index) 0
>  #define hstate_index(h) 0
> +#define dissolve_free_huge_pages(s, e) 0

no need 0.

>  #endif
>  
>  #endif /* _LINUX_HUGETLB_H */
> diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> index d9d3dd7..ef79871 100644
> --- v3.9-rc3.orig/mm/hugetlb.c
> +++ v3.9-rc3/mm/hugetlb.c
> @@ -844,6 +844,36 @@ static int free_pool_huge_page(struct hstate *h, 
> nodemask_t *nodes_allowed,
>   return ret;
>  }
>  
> +/* Dissolve a given free hugepage into free pages. */
> +static void dissolve_free_huge_page(struct page *page)
> +{
> + spin_lock(_lock);
> + if (PageHuge(page) && !page_count(page)) {
> + struct hstate *h = page_hstate(page);
> + int nid = page_to_nid(page);
> + list_del(>lru);
> + h->free_huge_pages--;
> + h->free_huge_pages_node[nid]--;
> + update_and_free_page(h, page);
> + }
> + spin_unlock(_lock);
> +}
> +
> +/* Dissolve free hugepages in a given pfn range. Used by memory hotplug. */
> +void dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + unsigned int order = 8 * sizeof(void *);
> + unsigned long pfn;
> + struct hstate *h;
> +
> + /* Set scan step to minimum hugepage size */
> + for_each_hstate(h)
> + if (order > huge_page_order(h))
> + order = huge_page_order(h);
> + for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << order)
> + dissolve_free_huge_page(pfn_to_page(pfn));
> +}

hotplug.c must not have such pure huge page function.


> +
>  static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
> 

Re: [PATCH 07/10] mbind: add hugepage migration code to mbind()

2013-04-05 Thread KOSAKI Motohiro
>> -if (!new_hpage)
>> +/*
>> + * Getting a new hugepage with alloc_huge_page() (which can happen
>> + * when migration is caused by mbind()) can return ERR_PTR value,
>> + * so we need take care of the case here.
>> + */
>> +if (!new_hpage || IS_ERR_VALUE(new_hpage))
>>  return -ENOMEM;
> 
> Please no. get_new_page returns NULL or a page. You are hooking a wrong
> callback here. The error value doesn't make any sense here. IMO you
> should just wrap alloc_huge_page by something that returns NULL or page.

I suggest just opposite way. new_vma_page() always return ENOMEM, ENOSPC etc 
instad 
of NULL. and caller propegate it to userland.
I guess userland want to distingush why mbind was failed.

Anyway, If new_vma_page() have a change to return both NULL and -ENOMEM. That's 
a bug.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/10] mbind: add hugepage migration code to mbind()

2013-04-05 Thread KOSAKI Motohiro
> @@ -1277,14 +1279,10 @@ static long do_mbind(unsigned long start, unsigned 
> long len,
>   if (!err) {
>   int nr_failed = 0;
>  
> - if (!list_empty()) {
> - WARN_ON_ONCE(flags & MPOL_MF_LAZY);
> - nr_failed = migrate_pages(, new_vma_page,
> + WARN_ON_ONCE(flags & MPOL_MF_LAZY);

???
MPOL_MF_LAZY always output warn? It seems really insane.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/10] migrate: add hugepage migration code to migrate_pages()

2013-04-05 Thread KOSAKI Motohiro
(3/22/13 4:23 PM), Naoya Horiguchi wrote:
> This patch extends check_range() to handle vma with VM_HUGETLB set.
> We will be able to migrate hugepage with migrate_pages(2) after
> applying the enablement patch which comes later in this series.
> 
> Note that for larger hugepages (covered by pud entries, 1GB for
> x86_64 for example), we simply skip it now.

check_range() has largely duplication with mm_walk and it is quirk subset.
Instead of, could you replace them to mm_walk and enhance/cleanup mm_walk?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 04/10] migrate: clean up migrate_huge_page()

2013-04-05 Thread KOSAKI Motohiro
(3/22/13 4:23 PM), Naoya Horiguchi wrote:
> Due to the previous patch, soft_offline_huge_page() switches to use
> migrate_pages(), and migrate_huge_page() is not used any more.
> So let's remove it.
> 
> Signed-off-by: Naoya Horiguchi 

Acked-by: KOSAKI Motohiro 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()

2013-04-05 Thread KOSAKI Motohiro
(3/27/13 9:00 AM), Michal Hocko wrote:
> On Tue 26-03-13 16:35:35, Naoya Horiguchi wrote:
> [...]
>> The differences is that migrate_huge_page() has one hugepage as an argument,
>> and migrate_pages() has a pagelist with multiple hugepages.
>> I already told this before and I'm not sure it's enough to answer the 
>> question,
>> so I explain another point about why this patch do like it.
> 
> OK, I am blind. It is
> +   list_move(>lru, );
> +   ret = migrate_pages(, new_page, MPOL_MF_MOVE_ALL,
> +   MIGRATE_SYNC, MR_MEMORY_FAILURE);
> 
> which moves it from active_list and so you have to put it back.
> 
>> I think that we must do putback_*pages() for source pages whether migration
>> succeeds or not.
>> But when we call migrate_pages() with a pagelist,
>> the caller can't access to the successfully migrated source pages
>> after migrate_pages() returns, because they are no longer on the pagelist.
>> So putback of the successfully migrated source pages should be done *in*
>> unmap_and_move() and/or unmap_and_move_huge_page().
> 
> If the migration succeeds then the page becomes unused and free after
> its last reference drops. So I do not see any reason to put it back to
> active list and free it right afterwards.
> On the other hand unmap_and_move does the same thing (although page
> reference counting is a bit more complicated in that case) so it would
> be good to keep in sync with regular pages case.

Even if pages are isolated from lists, there are several page count increasing
path. So, putback_pages() close a race when page count != 1.

I'm not sure, but I guess follow_hugepage() can make the same race.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/10] migrate: make core migration code aware of hugepage

2013-04-05 Thread KOSAKI Motohiro
>>> There doesn't seem to be any caller for this function. Please move it to
>>> the patch which uses it.
>>
>> I would do like that if there's only one user of this function, but I thought
>> that it's better to separate this part as changes of common code
>> because this function is commonly used by multiple users which are added by
>> multiple patches later in this series.
> 
> Sure there is no hard rule for this. I just find it much easier to
> review if there is a caller of introduced functionality. In this
> particular case I found out only later that many migrate_pages callers
> were changed to use mograte_movable_pages and made the
> putback_movable_pages cleanup inconsistent between the two.
> 
> It would help to mention what is the planned future usage of the
> introduced function if you prefer to introduce it without users.

I strong agree with Michal.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] migrate: add migrate_entry_wait_huge()

2013-04-05 Thread KOSAKI Motohiro
> diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> index 0a0be33..98a478e 100644
> --- v3.9-rc3.orig/mm/hugetlb.c
> +++ v3.9-rc3/mm/hugetlb.c
> @@ -2819,7 +2819,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   if (ptep) {
>   entry = huge_ptep_get(ptep);
>   if (unlikely(is_hugetlb_entry_migration(entry))) {
> - migration_entry_wait(mm, (pmd_t *)ptep, address);
> + migration_entry_wait_huge(mm, (pmd_t *)ptep, address);

Hm.

How do you test this? From x86 point of view, this patch seems unnecessary 
because
hugetlb_fault call "address &= hugetlb_mask()" at first and then 
migration_entry_wait()
could grab right pte lock. And from !x86 point of view, this funciton still 
doesn't work
because huge page != pmd on some arch.

I might be missing though.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] migrate: add migrate_entry_wait_huge()

2013-04-05 Thread KOSAKI Motohiro
> diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> index 0a0be33..98a478e 100644
> --- v3.9-rc3.orig/mm/hugetlb.c
> +++ v3.9-rc3/mm/hugetlb.c
> @@ -2819,7 +2819,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   if (ptep) {
>   entry = huge_ptep_get(ptep);
>   if (unlikely(is_hugetlb_entry_migration(entry))) {
> - migration_entry_wait(mm, (pmd_t *)ptep, address);
> + migration_entry_wait_huge(mm, (pmd_t *)ptep, address);

Hm.

How do you test this? From x86 point of view, this patch seems unnecessary 
because
hugetlb_fault call "address &= hugetlb_mask()" at first and then 
migration_entry_wait()
could grab right pte lock. And from !x86 point of view, this funciton still 
doesn't work
because huge page != pmd on some arch.

I might be missing something though.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size()

2013-04-05 Thread KOSAKI Motohiro
(4/3/13 2:35 PM), Naoya Horiguchi wrote:
> Documentation/filesystems/proc.txt says about coredump_filter bitmask,
> 
>   Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only
>   effected by bit 5-6.
> 
> However current code can go into the subsequent flag checks of bit 0-4
> for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work
> as written in the document.
> 
> Signed-off-by: Naoya Horiguchi 
> Cc: sta...@vger.kernel.org

If I were you, I merge this patch into [1/3] because both patches treat the same
regression. but it is no big matter.

Acked-by: KOSAKI Motohiro 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()

2013-04-05 Thread KOSAKI Motohiro
(4/3/13 2:35 PM), Naoya Horiguchi wrote:
> With applying the previous patch "hugetlbfs: stop setting VM_DONTDUMP in
> initializing vma(VM_HUGETLB)" to reenable hugepage coredump, if a memory
> error happens on a hugepage and the affected processes try to access
> the error hugepage, we hit VM_BUG_ON(atomic_read(>_count) <= 0)
> in get_page().
> 
> The reason for this bug is that coredump-related code doesn't recognise
> "hugepage hwpoison entry" with which a pmd entry is replaced when a memory
> error occurs on a hugepage.
> In other words, physical address information is stored in different bit layout
> between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page()
> which is called in get_dump_page() returns a wrong page from a given address.
> 
> We need to filter out only hwpoison hugepages to have data on healthy
> hugepages in coredump. So this patch makes follow_hugetlb_page() avoid
> trying to get page when a pmd is in swap entry like format.
> 
> ChangeLog v3:
>  - add comment about using is_swap_pte()
> 
> Signed-off-by: Naoya Horiguchi 
> Reviewed-by: Michal Hocko 
> Acked-by: Konstantin Khlebnikov 
> Cc: sta...@vger.kernel.org
> ---
>  mm/hugetlb.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
> index 0d1705b..3bc20bd 100644
> --- v3.9-rc3.orig/mm/hugetlb.c
> +++ v3.9-rc3/mm/hugetlb.c
> @@ -2966,9 +2966,15 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>* Some archs (sparc64, sh*) have multiple pte_ts to
>* each hugepage.  We have to make sure we get the
>* first, for the page indexing below to work.
> +  *
> +  * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned
> +  * and hugepages under migration in which case
> +  * hugetlb_fault waits for the migration and bails out
> +  * properly for HWPosined pages.
>*/
>   pte = huge_pte_offset(mm, vaddr & huge_page_mask(h));
> - absent = !pte || huge_pte_none(huge_ptep_get(pte));
> + absent = !pte || huge_pte_none(huge_ptep_get(pte)) ||
> + is_swap_pte(huge_ptep_get(pte));

Hmmm...

Now absent has two meanings. 1) skip hugetlb_fault() and return immediately if 
FOLL_DUMP is used.
2) call hugetlb_fault() if to be need page population or cow.

The description of this patch only explain about (2). and I'm not convinced why 
we don't need to
dump pages under migraion.












--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB)

2013-04-05 Thread KOSAKI Motohiro
(4/3/13 2:35 PM), Naoya Horiguchi wrote:
> Currently we fail to include any data on hugepages into coredump,
> because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was recently
> introduced by commit 314e51b98 "mm: kill vma flag VM_RESERVED and
> mm->reserved_vm counter". This looks to me a serious regression,
> so let's fix it.

I don't think this is enough explanations. Let's explain the code meaning
time to time order.

First, we had no madvice(DONTDUMP) nor coredump_filter(HUGETLB). then hugetlb
pages were never dumped.

Second, I added coredump_filter(HUGETLB). and then vm_dump_size became..

vm_dump_size()
{
/* Hugetlb memory check */
if (vma->vm_flags & VM_HUGETLB) {
..
goto whole;
}
if (vma->vm_flags & VM_RESERVED)
return 0;

The point is, hugetlb was checked before VM_RESERVED. i.e. hugetlb core dump 
ignored
VM_RESERVED. At this time, "if (vma->vm_flags & VM_HUGETLB)" statement don't 
need
return 0 because VM_RESERVED prevented to go into the subsequent flag checks.

Third, Jason added madvise(DONTDUMP). then vm_dump_size became...

vm_dump_size()
{
   if (vma->vm_flags & VM_NODUMP)
   return 0;

/* Hugetlb memory check */
if (vma->vm_flags & VM_HUGETLB) {
..
goto whole;
}
if (vma->vm_flags & VM_RESERVED)
return 0;

Look, VM_NODUMP and VM_RESERVED had similar and different meanings at this time.

Finally, Konstantin removed VM_RESERVED and hugetlb coredump behavior
has been changed.

Thus, patch [1/3] and [2/3] should be marked [stable for v3.6 or later].

Anyway, this patch is correct. Thank you!

Acked-by: KOSAKI Motohiro 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] posix-cpu-timers: fix counting delta_exec twice

2013-04-05 Thread kosaki . motohiro
From: KOSAKI Motohiro 

Currently glibc rt/cpuclock2 test(*) sporadically fail. The
pseudo test code is here.

 t0 = clock_gettime()
 abs = t0 + sleeptime;
 clock_nanosleep(TIMER_ABSTIME, abs)
 t1 = clock_gettime()
 assert(t1-t0 > sleeptime)
 assert(t1 > abs)

Because current signal->cputimer->cputime logic can add delta_exec
twice wrongly. The first is in initialization, thread_group_cputimer()
call thread_group_cputime() and it adds delta_exec. and later,
update_curr() adds delta_exec again.

Finally, clock_nanosleep() wakes up slightly before than passed argument and it 
is posix violation.

This issue was introduced by commit d670ec1317 (posix-cpu-timers:
Cure SMP wobbles).

(*) 
http://sourceware.org/git/?p=glibc.git;a=blob;f=rt/tst-cpuclock2.c;h=6752721717f959e89c0d692b3f1ee082d507eec2;hb=HEAD

Cc: Peter Zijlstra 
Cc: David Miller 
Cc: sta...@kernel.org
Cc: Thomas Gleixner 
Cc: Frederic Weisbecker 
Signed-off-by: KOSAKI Motohiro 
---
 fs/binfmt_elf.c   |2 +-
 fs/binfmt_elf_fdpic.c |2 +-
 include/linux/sched.h |4 ++--
 kernel/posix-cpu-timers.c |   10 +-
 kernel/sched/core.c   |6 --
 kernel/sched/cputime.c|8 
 6 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 3939829..e139e18 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1321,7 +1321,7 @@ static void fill_prstatus(struct elf_prstatus *prstatus,
 * This is the record for the group leader.  It shows the
 * group-wide total, not its individual thread total.
 */
-   thread_group_cputime(p, );
+   thread_group_cputime(p, true, );
cputime_to_timeval(cputime.utime, >pr_utime);
cputime_to_timeval(cputime.stime, >pr_stime);
} else {
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 9c13e02..ab5b508 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1371,7 +1371,7 @@ static void fill_prstatus(struct elf_prstatus *prstatus,
 * This is the record for the group leader.  It shows the
 * group-wide total, not its individual thread total.
 */
-   thread_group_cputime(p, );
+   thread_group_cputime(p, true, );
cputime_to_timeval(cputime.utime, >pr_utime);
cputime_to_timeval(cputime.stime, >pr_stime);
} else {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..deb49f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2001,7 +2001,7 @@ static inline void disable_sched_clock_irqtime(void) {}
 #endif
 
 extern unsigned long long
-task_sched_runtime(struct task_struct *task);
+task_sched_runtime(struct task_struct *task, bool use_delta);
 
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
@@ -2624,7 +2624,7 @@ static inline int spin_needbreak(spinlock_t *lock)
 /*
  * Thread group CPU time accounting.
  */
-void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times);
+void thread_group_cputime(struct task_struct *tsk, bool use_delta, struct 
task_cputime *times);
 void thread_group_cputimer(struct task_struct *tsk, struct task_cputime 
*times);
 
 static inline void thread_group_cputime_init(struct signal_struct *sig)
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 8fd709c..948004d 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -220,7 +220,7 @@ static int cpu_clock_sample(const clockid_t which_clock, 
struct task_struct *p,
cpu->cpu = virt_ticks(p);
break;
case CPUCLOCK_SCHED:
-   cpu->sched = task_sched_runtime(p);
+   cpu->sched = task_sched_runtime(p, true);
break;
}
return 0;
@@ -251,7 +251,7 @@ void thread_group_cputimer(struct task_struct *tsk, struct 
task_cputime *times)
 * to synchronize the timer to the clock every time we start
 * it.
 */
-   thread_group_cputime(tsk, );
+   thread_group_cputime(tsk, false, );
raw_spin_lock_irqsave(>lock, flags);
cputimer->running = 1;
update_gt_cputime(>cputime, );
@@ -275,15 +275,15 @@ static int cpu_clock_sample_group(const clockid_t 
which_clock,
default:
return -EINVAL;
case CPUCLOCK_PROF:
-   thread_group_cputime(p, );
+   thread_group_cputime(p, true, );
cpu->cpu = cputime.utime + cputime.stime;
break;
case CPUCLOCK_VIRT:
-   thread_group_cputime(p, );
+   thread_group_cputime(p, true, );
cpu->cpu = cputime.utime;
break;
case CPUCLOCK_SCHED:
-   thread_group_cputime(p,

[PATCH] posix-cpu-timers: fix counting delta_exec twice

2013-04-05 Thread kosaki . motohiro
From: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com

Currently glibc rt/cpuclock2 test(*) sporadically fail. The
pseudo test code is here.

 t0 = clock_gettime()
 abs = t0 + sleeptime;
 clock_nanosleep(TIMER_ABSTIME, abs)
 t1 = clock_gettime()
 assert(t1-t0  sleeptime)
 assert(t1  abs)

Because current signal-cputimer-cputime logic can add delta_exec
twice wrongly. The first is in initialization, thread_group_cputimer()
call thread_group_cputime() and it adds delta_exec. and later,
update_curr() adds delta_exec again.

Finally, clock_nanosleep() wakes up slightly before than passed argument and it 
is posix violation.

This issue was introduced by commit d670ec1317 (posix-cpu-timers:
Cure SMP wobbles).

(*) 
http://sourceware.org/git/?p=glibc.git;a=blob;f=rt/tst-cpuclock2.c;h=6752721717f959e89c0d692b3f1ee082d507eec2;hb=HEAD

Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: David Miller da...@davemloft.net
Cc: sta...@kernel.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: Frederic Weisbecker fweis...@gmail.com
Signed-off-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
---
 fs/binfmt_elf.c   |2 +-
 fs/binfmt_elf_fdpic.c |2 +-
 include/linux/sched.h |4 ++--
 kernel/posix-cpu-timers.c |   10 +-
 kernel/sched/core.c   |6 --
 kernel/sched/cputime.c|8 
 6 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 3939829..e139e18 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1321,7 +1321,7 @@ static void fill_prstatus(struct elf_prstatus *prstatus,
 * This is the record for the group leader.  It shows the
 * group-wide total, not its individual thread total.
 */
-   thread_group_cputime(p, cputime);
+   thread_group_cputime(p, true, cputime);
cputime_to_timeval(cputime.utime, prstatus-pr_utime);
cputime_to_timeval(cputime.stime, prstatus-pr_stime);
} else {
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 9c13e02..ab5b508 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1371,7 +1371,7 @@ static void fill_prstatus(struct elf_prstatus *prstatus,
 * This is the record for the group leader.  It shows the
 * group-wide total, not its individual thread total.
 */
-   thread_group_cputime(p, cputime);
+   thread_group_cputime(p, true, cputime);
cputime_to_timeval(cputime.utime, prstatus-pr_utime);
cputime_to_timeval(cputime.stime, prstatus-pr_stime);
} else {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d35d2b6..deb49f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2001,7 +2001,7 @@ static inline void disable_sched_clock_irqtime(void) {}
 #endif
 
 extern unsigned long long
-task_sched_runtime(struct task_struct *task);
+task_sched_runtime(struct task_struct *task, bool use_delta);
 
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
@@ -2624,7 +2624,7 @@ static inline int spin_needbreak(spinlock_t *lock)
 /*
  * Thread group CPU time accounting.
  */
-void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times);
+void thread_group_cputime(struct task_struct *tsk, bool use_delta, struct 
task_cputime *times);
 void thread_group_cputimer(struct task_struct *tsk, struct task_cputime 
*times);
 
 static inline void thread_group_cputime_init(struct signal_struct *sig)
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 8fd709c..948004d 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -220,7 +220,7 @@ static int cpu_clock_sample(const clockid_t which_clock, 
struct task_struct *p,
cpu-cpu = virt_ticks(p);
break;
case CPUCLOCK_SCHED:
-   cpu-sched = task_sched_runtime(p);
+   cpu-sched = task_sched_runtime(p, true);
break;
}
return 0;
@@ -251,7 +251,7 @@ void thread_group_cputimer(struct task_struct *tsk, struct 
task_cputime *times)
 * to synchronize the timer to the clock every time we start
 * it.
 */
-   thread_group_cputime(tsk, sum);
+   thread_group_cputime(tsk, false, sum);
raw_spin_lock_irqsave(cputimer-lock, flags);
cputimer-running = 1;
update_gt_cputime(cputimer-cputime, sum);
@@ -275,15 +275,15 @@ static int cpu_clock_sample_group(const clockid_t 
which_clock,
default:
return -EINVAL;
case CPUCLOCK_PROF:
-   thread_group_cputime(p, cputime);
+   thread_group_cputime(p, true, cputime);
cpu-cpu = cputime.utime + cputime.stime;
break;
case CPUCLOCK_VIRT:
-   thread_group_cputime(p, cputime

Re: [PATCH v3 1/3] hugetlbfs: stop setting VM_DONTDUMP in initializing vma(VM_HUGETLB)

2013-04-05 Thread KOSAKI Motohiro
(4/3/13 2:35 PM), Naoya Horiguchi wrote:
 Currently we fail to include any data on hugepages into coredump,
 because VM_DONTDUMP is set on hugetlbfs's vma. This behavior was recently
 introduced by commit 314e51b98 mm: kill vma flag VM_RESERVED and
 mm-reserved_vm counter. This looks to me a serious regression,
 so let's fix it.

I don't think this is enough explanations. Let's explain the code meaning
time to time order.

First, we had no madvice(DONTDUMP) nor coredump_filter(HUGETLB). then hugetlb
pages were never dumped.

Second, I added coredump_filter(HUGETLB). and then vm_dump_size became..

vm_dump_size()
{
/* Hugetlb memory check */
if (vma-vm_flags  VM_HUGETLB) {
..
goto whole;
}
if (vma-vm_flags  VM_RESERVED)
return 0;

The point is, hugetlb was checked before VM_RESERVED. i.e. hugetlb core dump 
ignored
VM_RESERVED. At this time, if (vma-vm_flags  VM_HUGETLB) statement don't 
need
return 0 because VM_RESERVED prevented to go into the subsequent flag checks.

Third, Jason added madvise(DONTDUMP). then vm_dump_size became...

vm_dump_size()
{
   if (vma-vm_flags  VM_NODUMP)
   return 0;

/* Hugetlb memory check */
if (vma-vm_flags  VM_HUGETLB) {
..
goto whole;
}
if (vma-vm_flags  VM_RESERVED)
return 0;

Look, VM_NODUMP and VM_RESERVED had similar and different meanings at this time.

Finally, Konstantin removed VM_RESERVED and hugetlb coredump behavior
has been changed.

Thus, patch [1/3] and [2/3] should be marked [stable for v3.6 or later].

Anyway, this patch is correct. Thank you!

Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/3] hugetlbfs: add swap entry check in follow_hugetlb_page()

2013-04-05 Thread KOSAKI Motohiro
(4/3/13 2:35 PM), Naoya Horiguchi wrote:
 With applying the previous patch hugetlbfs: stop setting VM_DONTDUMP in
 initializing vma(VM_HUGETLB) to reenable hugepage coredump, if a memory
 error happens on a hugepage and the affected processes try to access
 the error hugepage, we hit VM_BUG_ON(atomic_read(page-_count) = 0)
 in get_page().
 
 The reason for this bug is that coredump-related code doesn't recognise
 hugepage hwpoison entry with which a pmd entry is replaced when a memory
 error occurs on a hugepage.
 In other words, physical address information is stored in different bit layout
 between hugepage hwpoison entry and pmd entry, so follow_hugetlb_page()
 which is called in get_dump_page() returns a wrong page from a given address.
 
 We need to filter out only hwpoison hugepages to have data on healthy
 hugepages in coredump. So this patch makes follow_hugetlb_page() avoid
 trying to get page when a pmd is in swap entry like format.
 
 ChangeLog v3:
  - add comment about using is_swap_pte()
 
 Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Reviewed-by: Michal Hocko mho...@suse.cz
 Acked-by: Konstantin Khlebnikov khlebni...@openvz.org
 Cc: sta...@vger.kernel.org
 ---
  mm/hugetlb.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)
 
 diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
 index 0d1705b..3bc20bd 100644
 --- v3.9-rc3.orig/mm/hugetlb.c
 +++ v3.9-rc3/mm/hugetlb.c
 @@ -2966,9 +2966,15 @@ long follow_hugetlb_page(struct mm_struct *mm, struct 
 vm_area_struct *vma,
* Some archs (sparc64, sh*) have multiple pte_ts to
* each hugepage.  We have to make sure we get the
* first, for the page indexing below to work.
 +  *
 +  * is_swap_pte test covers both is_hugetlb_entry_hwpoisoned
 +  * and hugepages under migration in which case
 +  * hugetlb_fault waits for the migration and bails out
 +  * properly for HWPosined pages.
*/
   pte = huge_pte_offset(mm, vaddr  huge_page_mask(h));
 - absent = !pte || huge_pte_none(huge_ptep_get(pte));
 + absent = !pte || huge_pte_none(huge_ptep_get(pte)) ||
 + is_swap_pte(huge_ptep_get(pte));

Hmmm...

Now absent has two meanings. 1) skip hugetlb_fault() and return immediately if 
FOLL_DUMP is used.
2) call hugetlb_fault() if to be need page population or cow.

The description of this patch only explain about (2). and I'm not convinced why 
we don't need to
dump pages under migraion.












--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/3] fix hugetlb memory check in vma_dump_size()

2013-04-05 Thread KOSAKI Motohiro
(4/3/13 2:35 PM), Naoya Horiguchi wrote:
 Documentation/filesystems/proc.txt says about coredump_filter bitmask,
 
   Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only
   effected by bit 5-6.
 
 However current code can go into the subsequent flag checks of bit 0-4
 for vma(VM_HUGETLB). So this patch inserts 'return' and makes it work
 as written in the document.
 
 Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 Cc: sta...@vger.kernel.org

If I were you, I merge this patch into [1/3] because both patches treat the same
regression. but it is no big matter.

Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] migrate: add migrate_entry_wait_huge()

2013-04-05 Thread KOSAKI Motohiro
 diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
 index 0a0be33..98a478e 100644
 --- v3.9-rc3.orig/mm/hugetlb.c
 +++ v3.9-rc3/mm/hugetlb.c
 @@ -2819,7 +2819,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
 vm_area_struct *vma,
   if (ptep) {
   entry = huge_ptep_get(ptep);
   if (unlikely(is_hugetlb_entry_migration(entry))) {
 - migration_entry_wait(mm, (pmd_t *)ptep, address);
 + migration_entry_wait_huge(mm, (pmd_t *)ptep, address);

Hm.

How do you test this? From x86 point of view, this patch seems unnecessary 
because
hugetlb_fault call address = hugetlb_mask() at first and then 
migration_entry_wait()
could grab right pte lock. And from !x86 point of view, this funciton still 
doesn't work
because huge page != pmd on some arch.

I might be missing though.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/10] migrate: add migrate_entry_wait_huge()

2013-04-05 Thread KOSAKI Motohiro
 diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
 index 0a0be33..98a478e 100644
 --- v3.9-rc3.orig/mm/hugetlb.c
 +++ v3.9-rc3/mm/hugetlb.c
 @@ -2819,7 +2819,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
 vm_area_struct *vma,
   if (ptep) {
   entry = huge_ptep_get(ptep);
   if (unlikely(is_hugetlb_entry_migration(entry))) {
 - migration_entry_wait(mm, (pmd_t *)ptep, address);
 + migration_entry_wait_huge(mm, (pmd_t *)ptep, address);

Hm.

How do you test this? From x86 point of view, this patch seems unnecessary 
because
hugetlb_fault call address = hugetlb_mask() at first and then 
migration_entry_wait()
could grab right pte lock. And from !x86 point of view, this funciton still 
doesn't work
because huge page != pmd on some arch.

I might be missing something though.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/10] migrate: make core migration code aware of hugepage

2013-04-05 Thread KOSAKI Motohiro
 There doesn't seem to be any caller for this function. Please move it to
 the patch which uses it.

 I would do like that if there's only one user of this function, but I thought
 that it's better to separate this part as changes of common code
 because this function is commonly used by multiple users which are added by
 multiple patches later in this series.
 
 Sure there is no hard rule for this. I just find it much easier to
 review if there is a caller of introduced functionality. In this
 particular case I found out only later that many migrate_pages callers
 were changed to use mograte_movable_pages and made the
 putback_movable_pages cleanup inconsistent between the two.
 
 It would help to mention what is the planned future usage of the
 introduced function if you prefer to introduce it without users.

I strong agree with Michal.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 03/10] soft-offline: use migrate_pages() instead of migrate_huge_page()

2013-04-05 Thread KOSAKI Motohiro
(3/27/13 9:00 AM), Michal Hocko wrote:
 On Tue 26-03-13 16:35:35, Naoya Horiguchi wrote:
 [...]
 The differences is that migrate_huge_page() has one hugepage as an argument,
 and migrate_pages() has a pagelist with multiple hugepages.
 I already told this before and I'm not sure it's enough to answer the 
 question,
 so I explain another point about why this patch do like it.
 
 OK, I am blind. It is
 +   list_move(hpage-lru, pagelist);
 +   ret = migrate_pages(pagelist, new_page, MPOL_MF_MOVE_ALL,
 +   MIGRATE_SYNC, MR_MEMORY_FAILURE);
 
 which moves it from active_list and so you have to put it back.
 
 I think that we must do putback_*pages() for source pages whether migration
 succeeds or not.
 But when we call migrate_pages() with a pagelist,
 the caller can't access to the successfully migrated source pages
 after migrate_pages() returns, because they are no longer on the pagelist.
 So putback of the successfully migrated source pages should be done *in*
 unmap_and_move() and/or unmap_and_move_huge_page().
 
 If the migration succeeds then the page becomes unused and free after
 its last reference drops. So I do not see any reason to put it back to
 active list and free it right afterwards.
 On the other hand unmap_and_move does the same thing (although page
 reference counting is a bit more complicated in that case) so it would
 be good to keep in sync with regular pages case.

Even if pages are isolated from lists, there are several page count increasing
path. So, putback_pages() close a race when page count != 1.

I'm not sure, but I guess follow_hugepage() can make the same race.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 04/10] migrate: clean up migrate_huge_page()

2013-04-05 Thread KOSAKI Motohiro
(3/22/13 4:23 PM), Naoya Horiguchi wrote:
 Due to the previous patch, soft_offline_huge_page() switches to use
 migrate_pages(), and migrate_huge_page() is not used any more.
 So let's remove it.
 
 Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com

Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/10] migrate: add hugepage migration code to migrate_pages()

2013-04-05 Thread KOSAKI Motohiro
(3/22/13 4:23 PM), Naoya Horiguchi wrote:
 This patch extends check_range() to handle vma with VM_HUGETLB set.
 We will be able to migrate hugepage with migrate_pages(2) after
 applying the enablement patch which comes later in this series.
 
 Note that for larger hugepages (covered by pud entries, 1GB for
 x86_64 for example), we simply skip it now.

check_range() has largely duplication with mm_walk and it is quirk subset.
Instead of, could you replace them to mm_walk and enhance/cleanup mm_walk?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/10] mbind: add hugepage migration code to mbind()

2013-04-05 Thread KOSAKI Motohiro
 @@ -1277,14 +1279,10 @@ static long do_mbind(unsigned long start, unsigned 
 long len,
   if (!err) {
   int nr_failed = 0;
  
 - if (!list_empty(pagelist)) {
 - WARN_ON_ONCE(flags  MPOL_MF_LAZY);
 - nr_failed = migrate_pages(pagelist, new_vma_page,
 + WARN_ON_ONCE(flags  MPOL_MF_LAZY);

???
MPOL_MF_LAZY always output warn? It seems really insane.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/10] mbind: add hugepage migration code to mbind()

2013-04-05 Thread KOSAKI Motohiro
 -if (!new_hpage)
 +/*
 + * Getting a new hugepage with alloc_huge_page() (which can happen
 + * when migration is caused by mbind()) can return ERR_PTR value,
 + * so we need take care of the case here.
 + */
 +if (!new_hpage || IS_ERR_VALUE(new_hpage))
  return -ENOMEM;
 
 Please no. get_new_page returns NULL or a page. You are hooking a wrong
 callback here. The error value doesn't make any sense here. IMO you
 should just wrap alloc_huge_page by something that returns NULL or page.

I suggest just opposite way. new_vma_page() always return ENOMEM, ENOSPC etc 
instad 
of NULL. and caller propegate it to userland.
I guess userland want to distingush why mbind was failed.

Anyway, If new_vma_page() have a change to return both NULL and -ENOMEM. That's 
a bug.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 09/10] memory-hotplug: enable memory hotplug to handle hugepage

2013-04-05 Thread KOSAKI Motohiro
(3/22/13 4:23 PM), Naoya Horiguchi wrote:
 Currently we can't offline memory blocks which contain hugepages because
 a hugepage is considered as an unmovable page. But now with this patch
 series, a hugepage has become movable, so by using hugepage migration we
 can offline such memory blocks.
 
 What's different from other users of hugepage migration is that we need
 to decompose all the hugepages inside the target memory block into free
 buddy pages after hugepage migration, because otherwise free hugepages
 remaining in the memory block intervene the memory offlining.
 For this reason we introduce new functions dissolve_free_huge_page() and
 dissolve_free_huge_pages().
 
 Other than that, what this patch does is straightforwardly to add hugepage
 migration code, that is, adding hugepage code to the functions which scan
 over pfn and collect hugepages to be migrated, and adding a hugepage
 allocation function to alloc_migrate_target().
 
 As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
 over them because it's larger than memory block. So we now simply leave
 it to fail as it is.
 
 ChangeLog v2:
  - changed return value type of is_hugepage_movable() to bool
  - is_hugepage_movable() uses list_for_each_entry() instead of *_safe()
  - moved if(PageHuge) block before get_page_unless_zero() in 
 do_migrate_range()
  - do_migrate_range() returns -EBUSY for hugepages larger than memory block
  - dissolve_free_huge_pages() calculates scan step and sets it to minimum
hugepage size
 
 Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 ---
  include/linux/hugetlb.h |  6 +
  mm/hugetlb.c| 58 
 +
  mm/memory_hotplug.c | 42 +++
  mm/page_alloc.c | 12 ++
  mm/page_isolation.c |  5 +
  5 files changed, 114 insertions(+), 9 deletions(-)
 
 diff --git v3.9-rc3.orig/include/linux/hugetlb.h 
 v3.9-rc3/include/linux/hugetlb.h
 index 981eff8..8220a8a 100644
 --- v3.9-rc3.orig/include/linux/hugetlb.h
 +++ v3.9-rc3/include/linux/hugetlb.h
 @@ -69,6 +69,7 @@ int dequeue_hwpoisoned_huge_page(struct page *page);
  void putback_active_hugepage(struct page *page);
  void putback_active_hugepages(struct list_head *l);
  void migrate_hugepage_add(struct page *page, struct list_head *list);
 +bool is_hugepage_movable(struct page *page);
  void copy_huge_page(struct page *dst, struct page *src);
  
  extern unsigned long hugepages_treat_as_movable;
 @@ -134,6 +135,7 @@ static inline int dequeue_hwpoisoned_huge_page(struct 
 page *page)
  #define putback_active_hugepage(p) 0
  #define putback_active_hugepages(l) 0
  #define migrate_hugepage_add(p, l) 0
 +#define is_hugepage_movable(x) 0

should be false instaed of 0.


  static inline void copy_huge_page(struct page *dst, struct page *src)
  {
  }
 @@ -356,6 +358,9 @@ static inline int hstate_index(struct hstate *h)
   return h - hstates;
  }
  
 +extern void dissolve_free_huge_pages(unsigned long start_pfn,
 +  unsigned long end_pfn);
 +
  #else
  struct hstate {};
  #define alloc_huge_page(v, a, r) NULL
 @@ -376,6 +381,7 @@ static inline unsigned int pages_per_huge_page(struct 
 hstate *h)
  }
  #define hstate_index_to_shift(index) 0
  #define hstate_index(h) 0
 +#define dissolve_free_huge_pages(s, e) 0

no need 0.

  #endif
  
  #endif /* _LINUX_HUGETLB_H */
 diff --git v3.9-rc3.orig/mm/hugetlb.c v3.9-rc3/mm/hugetlb.c
 index d9d3dd7..ef79871 100644
 --- v3.9-rc3.orig/mm/hugetlb.c
 +++ v3.9-rc3/mm/hugetlb.c
 @@ -844,6 +844,36 @@ static int free_pool_huge_page(struct hstate *h, 
 nodemask_t *nodes_allowed,
   return ret;
  }
  
 +/* Dissolve a given free hugepage into free pages. */
 +static void dissolve_free_huge_page(struct page *page)
 +{
 + spin_lock(hugetlb_lock);
 + if (PageHuge(page)  !page_count(page)) {
 + struct hstate *h = page_hstate(page);
 + int nid = page_to_nid(page);
 + list_del(page-lru);
 + h-free_huge_pages--;
 + h-free_huge_pages_node[nid]--;
 + update_and_free_page(h, page);
 + }
 + spin_unlock(hugetlb_lock);
 +}
 +
 +/* Dissolve free hugepages in a given pfn range. Used by memory hotplug. */
 +void dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
 +{
 + unsigned int order = 8 * sizeof(void *);
 + unsigned long pfn;
 + struct hstate *h;
 +
 + /* Set scan step to minimum hugepage size */
 + for_each_hstate(h)
 + if (order  huge_page_order(h))
 + order = huge_page_order(h);
 + for (pfn = start_pfn; pfn  end_pfn; pfn += 1  order)
 + dissolve_free_huge_page(pfn_to_page(pfn));
 +}

hotplug.c must not have such pure huge page function.


 +
  static struct page *alloc_buddy_huge_page(struct hstate *h, int nid)
  {
   struct page *page;
 @@ -3155,6 +3185,34 @@ static int 

Re: [PATCH 10/10] prepare to remove /proc/sys/vm/hugepages_treat_as_movable

2013-04-05 Thread KOSAKI Motohiro
(3/25/13 11:12 AM), Michal Hocko wrote:
 On Fri 22-03-13 16:23:55, Naoya Horiguchi wrote:
 [...]
 @@ -2086,11 +2085,7 @@ int hugetlb_treat_movable_handler(struct ctl_table 
 *table, int write,
  void __user *buffer,
  size_t *length, loff_t *ppos)
  {
 -proc_dointvec(table, write, buffer, length, ppos);
 -if (hugepages_treat_as_movable)
 -htlb_alloc_mask = GFP_HIGHUSER_MOVABLE;
 -else
 -htlb_alloc_mask = GFP_HIGHUSER;
 +/* hugepages_treat_as_movable is obsolete and to be removed. */
 
 WARN_ON_ONCE(This knob is obsolete and has no effect. It is scheduled for 
 removal)

Indeed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] mm: speedup in __early_pfn_to_nid

2013-03-23 Thread KOSAKI Motohiro
> --- linux.orig/mm/page_alloc.c  2013-03-19 16:09:03.736450861 -0500
> +++ linux/mm/page_alloc.c   2013-03-22 17:07:43.895405617 -0500
> @@ -4161,10 +4161,23 @@ int __meminit __early_pfn_to_nid(unsigne
>  {
> unsigned long start_pfn, end_pfn;
> int i, nid;
> +   /*
> +  NOTE: The following SMP-unsafe globals are only used early
> +  in boot when the kernel is running single-threaded.
> +*/
> +   static unsigned long last_start_pfn, last_end_pfn;
> +   static int last_nid;

Why don't you mark them __meminitdata? They seems freeable.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] mm: speedup in __early_pfn_to_nid

2013-03-23 Thread KOSAKI Motohiro
 --- linux.orig/mm/page_alloc.c  2013-03-19 16:09:03.736450861 -0500
 +++ linux/mm/page_alloc.c   2013-03-22 17:07:43.895405617 -0500
 @@ -4161,10 +4161,23 @@ int __meminit __early_pfn_to_nid(unsigne
  {
 unsigned long start_pfn, end_pfn;
 int i, nid;
 +   /*
 +  NOTE: The following SMP-unsafe globals are only used early
 +  in boot when the kernel is running single-threaded.
 +*/
 +   static unsigned long last_start_pfn, last_end_pfn;
 +   static int last_nid;

Why don't you mark them __meminitdata? They seems freeable.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: BUG in mempolicy's sp_insert

2013-02-28 Thread KOSAKI Motohiro
On Thu, Feb 28, 2013 at 1:53 AM, Hillf Danton  wrote:
> On Thu, Feb 28, 2013 at 1:26 PM, KOSAKI Motohiro
>  wrote:
>>> Insert new node after updating node in tree.
>>
>> Thanks. you are right. I could reproduce and verified.
>
> Thank you too;) pleasure to do minor work for you.
>
> btw, how about your belly now? fully recovered?

Yup. I could learned US health care a bit. =)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   5   6   7   8   9   10   >