Re: [RFC PATCH 1/2] kernel/fork.c: add a function to calculate page address from thread_info

2015-05-26 Thread KOSAKI Motohiro
On Sun, May 24, 2015 at 12:01 PM, Jungseok Lee  wrote:
> A current implementation assumes thread_info address is always correctly
> calculated via virt_to_page. It restricts a different approach, such as
> thread_info allocation from vmalloc space.
>
> This patch, thus, introduces an independent function to calculate page
> address from thread_info one.
>
> Suggested-by: Sungjinn Chung 
> Signed-off-by: Jungseok Lee 
> Cc: KOSAKI Motohiro 
> Cc: linux-arm-ker...@lists.infradead.org
> ---
>  kernel/fork.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)

I haven't receive a path [2/2] and haven't review whole patches. But
this patch itself is OK to me.
Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch][resend] MAP_HUGETLB munmap fails with size not 2MB aligned

2015-03-30 Thread KOSAKI Motohiro
On Thu, Mar 26, 2015 at 10:08 AM, Eric B Munson  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 03/26/2015 07:56 AM, Davide Libenzi wrote:
>> On Wed, 25 Mar 2015, David Rientjes wrote:
>>
>>> I looked at this thread at http://marc.info/?t=14139250881
>>> since I didn't have it in my mailbox, and I didn't get a chance
>>> to actually run your test code.
>>>
>>> In short, I think what you're saying is that
>>>
>>> ptr = mmap(..., 4KB, ..., MAP_HUGETLB | ..., ...) munmap(ptr,
>>> 4KB) == EINVAL
>>
>> I am not sure you have read the email correctly:
>>
>> munmap(mmap(size, HUGETLB), size) = EFAIL
>>
>> For every size not multiple of the huge page size. Whereas:
>>
>> munmap(mmap(size, HUGETLB), ALIGN(size, HUGEPAGE_SIZE)) = OK
>
> I think Davide is right here, this is a long existing bug in the
> MAP_HUGETLB implementation.  Specifically, the mmap man page says:
>
> All pages containing a part of the indicated range are unmapped, and
> subsequent references to these pages will generate SIGSEGV.
>
> I realize that huge pages may not have been considered by those that
> wrote the spec.  But if I read this I would assume that all pages,
> regardless of size, touched by the munmap() request should be unmapped.
>
> Please include
> Acked-by: Eric B Munson 
> to the original patch.  I would like to see the mmap man page adjusted
> to make note of this behavior as well.

This is just a bug fix and I never think this has large risk. But
caution, we might revert immediately
if this patch arise some regression even if it's come from broken
application code.

Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] workqueue: retry on NUMA_NO_NODE when create_worker() fails

2014-12-12 Thread KOSAKI Motohiro
On Fri, Dec 12, 2014 at 11:05 AM, KOSAKI Motohiro
 wrote:
> On Fri, Dec 12, 2014 at 5:19 AM, Lai Jiangshan  wrote:
>> A pwq bound to a specified node might be last long or even forever after
>> the node was offline.  Especially when this pwq has some back-to-back work
>> items which requeue themselves and cause the pwq can't quit.
>>
>> This kinds of pwqs will cause their own pools busy and maybe create workers.
>> This pools will fail on create_worker() since the node is offline.
>>
>> This case is extremely rare, but it is possible. And we hope create_worker()
>> to be fault-tolerant in this case and other different cases when the node
>> is lack of memory, for example, create_worker() can try to allocate memory
>> from the whole system rather than only the target node, the most important
>> thing is making some progress.
>>
>> So the solution is that, when the create_worker() fails on a specified node,
>> it will retry with NUMA_NO_NODE for further allocation.
>
> The code looks correct. But I don't think this issue depend on node offlining.
> The allocation may also fail if node has no memory.

Oh, sorry. my comment is no correct. Please forget.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] workqueue: retry on NUMA_NO_NODE when create_worker() fails

2014-12-12 Thread KOSAKI Motohiro
On Fri, Dec 12, 2014 at 5:19 AM, Lai Jiangshan  wrote:
> A pwq bound to a specified node might be last long or even forever after
> the node was offline.  Especially when this pwq has some back-to-back work
> items which requeue themselves and cause the pwq can't quit.
>
> This kinds of pwqs will cause their own pools busy and maybe create workers.
> This pools will fail on create_worker() since the node is offline.
>
> This case is extremely rare, but it is possible. And we hope create_worker()
> to be fault-tolerant in this case and other different cases when the node
> is lack of memory, for example, create_worker() can try to allocate memory
> from the whole system rather than only the target node, the most important
> thing is making some progress.
>
> So the solution is that, when the create_worker() fails on a specified node,
> it will retry with NUMA_NO_NODE for further allocation.

The code looks correct. But I don't think this issue depend on node offlining.
The allocation may also fail if node has no memory.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: export NR_SHMEM via sysinfo(2) / si_meminfo() interfaces

2014-06-25 Thread KOSAKI Motohiro
> I agree that reporting the amount of shared pages in that historically fashion
> might not be interesting for userspace tools resorting to sysinfo(2),
> nowadays.
>
> OTOH, our documentation implies we do return shared memory there, and FWIW,
> considering the other places we do export the "shared memory" concept to
> userspace nowadays, we are suggesting it's the amount of tmpfs/shmem, and not 
> the
> amount of shared mapped pages it historiacally represented once. What is 
> really
> confusing is having a field that supposedely/expectedely would return the 
> amount
> of shmem to userspace queries, but instead returns a hard-coded zero (0).
>
> I could easily find out that there were some user complaint/confusion on this
> semantic inconsistency in the past, as in:
> https://groups.google.com/forum/#!topic/comp.os.linux.development.system/ogWVn6XdvGA
>
> or in:
> http://marc.info/?l=net-snmp-cvs&m=132148788500667
>
> which suggests users seem to always have understood it as being shmem/tmpfs
> usage, as the /proc/meminfo field "MemShared" was tied direclty to
> sysinfo.sharedram. Historically we reported shared memory that way, and
> when it wasn't accurately meaning that anymore a 0 was hardcoded there to
> potentially not break compatibility with older tools (older than 2.4).
> In 2.6 we got rid of meminfo's "MemShared" until 2009, when you sort of
> re-introduced it re-branded as Shmem. IMO, we should leverage what we
> have in kernel now and take this change to make the exposed data consistent
> across the interfaces that export it today -- sysinfo(2) & /proc/meminfo.
>
> This is not a hard requirement, though, but rather a simple maintenance
> nitpick from code review.

Ok, ack then. But please update a patch description and repost w/
ccing linux-...@vger.kernel.org. Someone might have a specific concern
about a compatibility.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: export NR_SHMEM via sysinfo(2) / si_meminfo() interfaces

2014-06-25 Thread Motohiro Kosaki


> -Original Message-
> From: Rafael Aquini [mailto:aqu...@redhat.com]
> Sent: Wednesday, June 25, 2014 4:16 PM
> To: Motohiro Kosaki
> Cc: linux...@kvack.org; Andrew Morton; Rik van Riel; Mel Gorman; Johannes 
> Weiner; Motohiro Kosaki JP; linux-
> ker...@vger.kernel.org
> Subject: Re: [PATCH] mm: export NR_SHMEM via sysinfo(2) / si_meminfo() 
> interfaces
> 
> On Wed, Jun 25, 2014 at 12:41:17PM -0700, Motohiro Kosaki wrote:
> >
> >
> > > -Original Message-
> > > From: Rafael Aquini [mailto:aqu...@redhat.com]
> > > Sent: Wednesday, June 25, 2014 2:40 PM
> > > To: linux...@kvack.org
> > > Cc: Andrew Morton; Rik van Riel; Mel Gorman; Johannes Weiner;
> > > Motohiro Kosaki JP; linux-kernel@vger.kernel.org
> > > Subject: [PATCH] mm: export NR_SHMEM via sysinfo(2) / si_meminfo()
> > > interfaces
> > >
> > > This patch leverages the addition of explicit accounting for pages
> > > used by shmem/tmpfs -- "4b02108 mm: oom analysis: add shmem vmstat"
> > > -- in order to make the users of sysinfo(2) and si_meminfo*() friends 
> > > aware of that vmstat entry consistently across the interfaces.
> >
> > Why?
> 
> Because we do not report consistently across the interfaces we declare 
> exporting that data. Check sysinfo(2) manpage, for instance:
> [...]
>struct sysinfo {
>long uptime; /* Seconds since boot */
>unsigned long loads[3];  /* 1, 5, and 15 minute load averages 
> */
>unsigned long totalram;  /* Total usable main memory size */
>unsigned long freeram;   /* Available memory size */
>unsigned long sharedram; /* Amount of shared memory */ <<<<< 
> [...]
> 
> userspace tools resorting to sysinfo() syscall will get a hardcoded 0 for 
> shared memory which is reported differently from
> /proc/meminfo.
> 
> Also, si_meminfo() & si_meminfo_node() are utilized within the kernel to 
> gather statistics for /proc/meminfo & friends, and so we
> can leverage collecting sharedmem from those calls as well, just as we do for 
> totalram, freeram & bufferram.

But "Amount of shared memory"  didn't mean amout of shmem. It actually meant 
amout of page of page-count>=2.
Again, there is a possibility to change the semantics. But I don't have enough 
userland knowledge to do. Please investigate
and explain why your change don't break any userland. 






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: export NR_SHMEM via sysinfo(2) / si_meminfo() interfaces

2014-06-25 Thread Motohiro Kosaki


> -Original Message-
> From: motohiro.kos...@us.fujitsu.com [mailto:motohiro.kos...@us.fujitsu.com]
> Sent: Wednesday, June 25, 2014 3:41 PM
> To: Rafael Aquini; linux...@kvack.org
> Cc: Andrew Morton; Rik van Riel; Mel Gorman; Johannes Weiner; Motohiro Kosaki 
> JP; linux-kernel@vger.kernel.org
> Subject: RE: [PATCH] mm: export NR_SHMEM via sysinfo(2) / si_meminfo() 
> interfaces
> 
> 
> 
> > -Original Message-
> > From: Rafael Aquini [mailto:aqu...@redhat.com]
> > Sent: Wednesday, June 25, 2014 2:40 PM
> > To: linux...@kvack.org
> > Cc: Andrew Morton; Rik van Riel; Mel Gorman; Johannes Weiner; Motohiro
> > Kosaki JP; linux-kernel@vger.kernel.org
> > Subject: [PATCH] mm: export NR_SHMEM via sysinfo(2) / si_meminfo()
> > interfaces
> >
> > This patch leverages the addition of explicit accounting for pages
> > used by shmem/tmpfs -- "4b02108 mm: oom analysis: add shmem vmstat" --
> > in order to make the users of sysinfo(2) and si_meminfo*() friends aware of 
> > that vmstat entry consistently across the interfaces.
> 
> Why?
> Traditionally sysinfo.sharedram was not used for shmem. It was totally 
> strange semantics and completely outdated feature.
> So, we may reuse it for another purpose. But I'm not sure its benefit.
> 
> Why don't you use /proc/meminfo?
> I'm afraid userland programs get a confusion.

For the record. This is historical implementation at linux-2.3.12. I.e. account 
sum of page count.


void si_meminfo(struct sysinfo *val)
{
int i;

i = max_mapnr;
val->totalram = 0;
val->sharedram = 0;
val->freeram = nr_free_pages << PAGE_SHIFT;
val->bufferram = atomic_read(&buffermem);
while (i-- > 0)  {
if (PageReserved(mem_map+i))
continue;
val->totalram++;
if (!page_count(mem_map+i))
continue;
val->sharedram += page_count(mem_map+i) - 1;
}
val->totalram <<= PAGE_SHIFT;
val->sharedram <<= PAGE_SHIFT;
return;
}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: export NR_SHMEM via sysinfo(2) / si_meminfo() interfaces

2014-06-25 Thread Motohiro Kosaki


> -Original Message-
> From: Rafael Aquini [mailto:aqu...@redhat.com]
> Sent: Wednesday, June 25, 2014 2:40 PM
> To: linux...@kvack.org
> Cc: Andrew Morton; Rik van Riel; Mel Gorman; Johannes Weiner; Motohiro Kosaki 
> JP; linux-kernel@vger.kernel.org
> Subject: [PATCH] mm: export NR_SHMEM via sysinfo(2) / si_meminfo() interfaces
> 
> This patch leverages the addition of explicit accounting for pages used by 
> shmem/tmpfs -- "4b02108 mm: oom analysis: add shmem
> vmstat" -- in order to make the users of sysinfo(2) and si_meminfo*() friends 
> aware of that vmstat entry consistently across the
> interfaces.

Why?
Traditionally sysinfo.sharedram was not used for shmem. It was totally strange 
semantics and completely outdated feature. 
So, we may reuse it for another purpose. But I'm not sure its benefit. 

Why don't you use /proc/meminfo?
I'm afraid userland programs get a confusion. 


> 
> Signed-off-by: Rafael Aquini 
> ---
>  drivers/base/node.c | 2 +-
>  fs/proc/meminfo.c   | 2 +-
>  mm/page_alloc.c | 3 ++-
>  3 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c index 8f7ed99..c6d3ae0 
> 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -126,7 +126,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  nid, K(node_page_state(nid, NR_FILE_PAGES)),
>  nid, K(node_page_state(nid, NR_FILE_MAPPED)),
>  nid, K(node_page_state(nid, NR_ANON_PAGES)),
> -nid, K(node_page_state(nid, NR_SHMEM)),
> +nid, K(i.sharedram),
>  nid, node_page_state(nid, NR_KERNEL_STACK) *
>   THREAD_SIZE / 1024,
>  nid, K(node_page_state(nid, NR_PAGETABLE)), diff --git 
> a/fs/proc/meminfo.c b/fs/proc/meminfo.c index
> 7445af0..aa1eee0 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -168,7 +168,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>   K(global_page_state(NR_WRITEBACK)),
>   K(global_page_state(NR_ANON_PAGES)),
>   K(global_page_state(NR_FILE_MAPPED)),
> - K(global_page_state(NR_SHMEM)),
> + K(i.sharedram),
>   K(global_page_state(NR_SLAB_RECLAIMABLE) +
>   global_page_state(NR_SLAB_UNRECLAIMABLE)),
>   K(global_page_state(NR_SLAB_RECLAIMABLE)),
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 20d17f8..f72ea38 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3040,7 +3040,7 @@ static inline void show_node(struct zone *zone)  void 
> si_meminfo(struct sysinfo *val)  {
>   val->totalram = totalram_pages;
> - val->sharedram = 0;
> + val->sharedram = global_page_state(NR_SHMEM);
>   val->freeram = global_page_state(NR_FREE_PAGES);
>   val->bufferram = nr_blockdev_pages();
>   val->totalhigh = totalhigh_pages;
> @@ -3060,6 +3060,7 @@ void si_meminfo_node(struct sysinfo *val, int nid)
>   for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++)
>   managed_pages += pgdat->node_zones[zone_type].managed_pages;
>   val->totalram = managed_pages;
> + val->sharedram = node_page_state(nid, NR_SHMEM);
>   val->freeram = node_page_state(nid, NR_FREE_PAGES);  #ifdef 
> CONFIG_HIGHMEM
>   val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].managed_pages;
> --
> 1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch 1/4] mm: vmscan: remove remains of kswapd-managed zone->all_unreclaimable

2014-06-23 Thread Motohiro Kosaki


> -Original Message-
> From: Minchan Kim [mailto:minc...@kernel.org]
> Sent: Monday, June 23, 2014 2:16 AM
> To: Johannes Weiner
> Cc: Andrew Morton; Mel Gorman; Rik van Riel; Michal Hocko; 
> linux...@kvack.org; linux-kernel@vger.kernel.org; Motohiro Kosaki JP
> Subject: Re: [patch 1/4] mm: vmscan: remove remains of kswapd-managed 
> zone->all_unreclaimable
> 
> On Fri, Jun 20, 2014 at 12:33:47PM -0400, Johannes Weiner wrote:
> > shrink_zones() has a special branch to skip the all_unreclaimable()
> > check during hibernation, because a frozen kswapd can't mark a zone
> > unreclaimable.
> >
> > But ever since 6e543d5780e3 ("mm: vmscan: fix do_try_to_free_pages()
> > livelock"), determining a zone to be unreclaimable is done by directly
> > looking at its scan history and no longer relies on kswapd setting the
> > per-zone flag.
> >
> > Remove this branch and let shrink_zones() check the reclaimability of
> > the target zones regardless of hibernation state.
> >
> > Signed-off-by: Johannes Weiner 
> Acked-by: Minchan Kim 
> 
> It would be not bad to Cced KOSAKI who was involved all_unreclaimable series 
> several time with me.

Looks good to me.

KOSAKI Motohiro 






RE: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

2014-05-20 Thread Motohiro Kosaki


> -Original Message-
> From: Tetsuo Handa [mailto:penguin-ker...@i-love.sakura.ne.jp]
> Sent: Tuesday, May 20, 2014 11:58 PM
> To: da...@fromorbit.com; r...@redhat.com
> Cc: Motohiro Kosaki JP; fengguang...@intel.com; 
> kamezawa.hir...@jp.fujitsu.com; a...@linux-foundation.org;
> h...@infradead.org; linux-kernel@vger.kernel.org; x...@oss.sgi.com
> Subject: Re: [PATCH] mm/vmscan: Do not block forever at 
> shrink_inactive_list().
> 
> Today I discussed with Kosaki-san at LinuxCon Japan 2014 about this issue.
> He does not like the idea of adding timeout to throttle loop. As Dave posted 
> a patch that fixes a bug in XFS delayed allocation, I
> updated my patch accordingly.
> 
> Although the bug in XFS was fixed by Dave's patch, other kernel code would 
> have bugs which would fall into this infinite throttle loop.
> But to keep the possibility of triggering OOM killer minimum, can we agree 
> with this updated patch (and in the future adding some
> warning mechanism like /proc/sys/kernel/hung_task_timeout_secs for detecting 
> memory allocation stall)?
> 
> Dave, if you are OK with this updated patch, please let me know commit ID of 
> your patch.
> 
> Regards.
> --
> >From 408e65d9025e8e24838e7bf6ac9066ba8a9391a6 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa 
> Date: Tue, 20 May 2014 23:34:34 +0900
> Subject: [PATCH] mm/vmscan: Do not throttle kswapd at shrink_inactive_list().
> 
> I can observe that commit 35cd7815 "vmscan: throttle direct reclaim when too 
> many pages are isolated already" causes RHEL7
> environment to stall with 0% CPU usage when a certain type of memory pressure 
> is given.
> This is because nobody can reclaim memory due to rules listed below.
> 
>   (a) XFS uses a kernel worker thread for delayed allocation
>   (b) kswapd wakes up the kernel worker thread for delayed allocation
>   (c) the kernel worker thread is throttled due to commit 35cd7815
> 
> This patch and commit  "xfs: block allocation work needs to be kswapd 
> aware" will solve rule (c).
> 
> Signed-off-by: Tetsuo Handa 
> ---
>  mm/vmscan.c |   20 +++-
>  1 files changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 32c661d..5c6960e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1460,12 +1460,22 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
> lruvec *lruvec,
>   struct zone *zone = lruvec_zone(lruvec);
>   struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> 
> - while (unlikely(too_many_isolated(zone, file, sc))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + /*
> +  * Throttle only direct reclaimers. Allocations by kswapd (and
> +  * allocation workqueue on behalf of kswapd) should not be
> +  * throttled here; otherwise memory allocation will deadlock.
> +  */
> + if (!sc->hibernation_mode && !current_is_kswapd()) {
> + while (unlikely(too_many_isolated(zone, file, sc))) {
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
> - /* We are about to die and free our memory. Return now. */
> - if (fatal_signal_pending(current))
> - return SWAP_CLUSTER_MAX;
> + /*
> +  * We are about to die and free our memory.
> +  * Return now.
> +  */
> + if (fatal_signal_pending(current))
> + return SWAP_CLUSTER_MAX;
> + }
>   }


Acked-by: KOSAKI Motohiro 


Dave, I don't like Tetsuo's first patch because this too_many_isolated exist to 
prevent false oom-kill. So, simple timeout
resurrect it. Please let me know if you need further MM enhancement to solve 
XFS issue. I'd like join and assist this.

Thanks.






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm,writeback: fix divide by zero in pos_ratio_polynom

2014-04-29 Thread Motohiro Kosaki


> -Original Message-
> From: Rik van Riel [mailto:r...@redhat.com]
> Sent: Tuesday, April 29, 2014 3:19 PM
> To: linux-kernel@vger.kernel.org
> Cc: linux...@kvack.org; sand...@redhat.com; a...@linux-foundation.org; 
> jwei...@redhat.com; Motohiro Kosaki JP;
> mho...@suse.cz; fengguang...@intel.com; mpatla...@parallels.com
> Subject: [PATCH] mm,writeback: fix divide by zero in pos_ratio_polynom
> 
> It is possible for "limit - setpoint + 1" to equal zero, leading to a divide 
> by zero error. Blindly adding 1 to "limit - setpoint" is not working,
> so we need to actually test the divisor before calling div64.
> 
> Signed-off-by: Rik van Riel 
> Cc: sta...@vger.kernel.org

Fairly obvious fix.

Acked-by: KOSAKI Motohiro 


> ---
>  mm/page-writeback.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c index ef41349..2682516 
> 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -597,11 +597,16 @@ static inline long long pos_ratio_polynom(unsigned long 
> setpoint,
> unsigned long dirty,
> unsigned long limit)
>  {
> + unsigned int divisor;
>   long long pos_ratio;
>   long x;
> 
> + divisor = limit - setpoint;
> + if (!divisor)
> + divisor = 1;
> +
>   x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
> - limit - setpoint + 1);
> + divisor);
>   pos_ratio = x;
>   pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
>   pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 3/4] ipc/shm.c: check for integer overflow during shmget.

2014-04-22 Thread Motohiro Kosaki
> > SHMMAX is the upper limit for the size of a shared memory segment,
> > counted in bytes. The actual allocation is that size, rounded up to
> > the next full page.
> > Add a check that prevents the creation of segments where the rounded
> > up size causes an integer overflow.
> >
> > Signed-off-by: Manfred Spraul 
> 
> Acked-by: Davidlohr Bueso 

Acked-by: KOSAKI Motohiro 


RE: [PATCH 1/4] ipc/shm.c: check for ulong overflows in shmat

2014-04-22 Thread Motohiro Kosaki
> > find_vma_intersection does not work as intended if addr+size overflows.
> > The patch adds a manual check before the call to find_vma_intersection.
> >
> > Signed-off-by: Manfred Spraul 
> 
> Acked-by: Davidlohr Bueso 

Acked-by: KOSAKI Motohiro 



RE: [PATCH 4/4] ipc/shm.c: Increase the defaults for SHMALL, SHMMAX.

2014-04-22 Thread Motohiro Kosaki


> -Original Message-
> From: Manfred Spraul [mailto:manf...@colorfullife.com]
> Sent: Monday, April 21, 2014 10:27 AM
> To: Davidlohr Bueso; Michael Kerrisk; Martin Schwidefsky
> Cc: LKML; Andrew Morton; KAMEZAWA Hiroyuki; Motohiro Kosaki JP; 
> gthe...@google.com; as...@hp.com; linux...@kvack.org;
> Manfred Spraul
> Subject: [PATCH 4/4] ipc/shm.c: Increase the defaults for SHMALL, SHMMAX.
> 
> System V shared memory
> 
> a) can be abused to trigger out-of-memory conditions and the standard
>measures against out-of-memory do not work:
> 
> - it is not possible to use setrlimit to limit the size of shm segments.
> 
> - segments can exist without association with any processes, thus
>   the oom-killer is unable to free that memory.
> 
> b) is typically used for shared information - today often multiple GB.
>(e.g. database shared buffers)
> 
> The current default is a maximum segment size of 32 MB and a maximum total 
> size of 8 GB. This is often too much for a) and not
> enough for b), which means that lots of users must change the defaults.
> 
> This patch increases the default limits (nearly) to the maximum, which is 
> perfect for case b). The defaults are used after boot and as
> the initial value for each new namespace.
> 
> Admins/distros that need a protection against a) should reduce the limits 
> and/or enable shm_rmid_forced.
> 
> Further notes:
> - The patch only changes default, overrides behave as before:
> # sysctl kernel.shmall=33554432
>   would recreate the previous limit for SHMMAX (for the current namespace).
> 
> - Disabling sysv shm allocation is possible with:
> # sysctl kernel.shmall=0
>   (not a new feature, also per-namespace)
> 
> - The limits are intentionally set to a value slightly less than ULONG_MAX,
>   to avoid triggering overflows in user space apps.
>   [not unreasonable, see http://marc.info/?l=linux-mm&m=139638334330127]
> 
> Signed-off-by: Manfred Spraul 
> Reported-by: Davidlohr Bueso 
> Cc: mtk.manpa...@gmail.com

Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 2/4] ipc/shm.c: check for overflows of shm_tot

2014-04-22 Thread Motohiro Kosaki
> > shm_tot counts the total number of pages used by shm segments.
> >
> > If SHMALL is ULONG_MAX (or nearly ULONG_MAX), then the number can
> > overflow.  Subsequent calls to shmctl(,SHM_INFO,) would return wrong
> > values for shm_tot.
> >
> > The patch adds a detection for overflows.
> >
> > Signed-off-by: Manfred Spraul 
> 
> Acked-by: Davidlohr Bueso 

Acked-by: KOSAKI Motohiro 


RE: [PATCH] ipc/shm: Increase the defaults for SHMALL, SHMMAX to infinity

2014-04-18 Thread Motohiro Kosaki


> -Original Message-
> From: Manfred Spraul [mailto:manf...@colorfullife.com]
> Sent: Friday, April 18, 2014 2:19 AM
> To: Andrew Morton; Davidlohr Bueso
> Cc: LKML; KAMEZAWA Hiroyuki; Motohiro Kosaki JP; gthe...@google.com; 
> as...@hp.com; linux...@kvack.org; Manfred Spraul;
> mtk.manpa...@gmail.com
> Subject: [PATCH] ipc/shm: Increase the defaults for SHMALL, SHMMAX to infinity
> 
> System V shared memory
> 
> a) can be abused to trigger out-of-memory conditions and the standard
>measures against out-of-memory do not work:
> 
> - it is not possible to use setrlimit to limit the size of shm segments.
> 
> - segments can exist without association with any processes, thus
>   the oom-killer is unable to free that memory.
> 
> b) is typically used for shared information - today often multiple GB.
>(e.g. database shared buffers)
> 
> The current default is a maximum segment size of 32 MB and a maximum total 
> size of 8 GB. This is often too much for a) and not
> enough for b), which means that lots of users must change the defaults.
> 
> This patch increases the default limits to ULONG_MAX, which is perfect for 
> case b). The defaults are used after boot and as the initial
> value for each new namespace.
> 
> Admins/distros that need a protection against a) should reduce the limits 
> and/or enable shm_rmid_forced.
> 
> Further notes:
> - The patch only changes the boot time default, overrides behave as before:
>   # sysctl kernel/shmall=33554432
>   would recreate the previous limit for SHMMAX (for the current namespace).
> 
> - Disabling sysv shm allocation is possible with:
>   # sysctl kernel.shmall=0
>   (not a new feature, also per-namespace)
> 
> - ULONG_MAX is not really infinity, but 18 Exabyte segment size and
>   75 Zettabyte total size. This should be enough for the next few weeks.
>   (assuming a 64-bit system with 4k pages)
> 
> Risks:
> - The patch breaks installations that use "take current value and increase
>   it a bit". [seems to exist, http://marc.info/?l=linux-mm&m=139638334330127]
>   After a:
>   # CUR=`sysctl -n kernel.shmmax`
>   # NEW=`echo $CUR+1 | bc -l`
>   # sysctl -n kernel.shmmax=$NEW
>   shmmax ends up as 0, which disables shm allocations.
> 
> - There is no wrap-around protection for ns->shm_ctlall, i.e. the 75 ZB
>   limit is not enforced.
> 
> Signed-off-by: Manfred Spraul 
> Reported-by: Davidlohr Bueso 
> Cc: mtk.manpa...@gmail.com

I'm ok either ULONG_MAX or 0 (special value of infinity).

Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v2] kernel/panic: Add "crash_kexec_post_notifiers" option for kdump after panic_notifers

2014-04-17 Thread Motohiro Kosaki


> -Original Message-
> From: Masami Hiramatsu [mailto:masami.hiramatsu...@hitachi.com]
> Sent: Wednesday, April 16, 2014 8:13 PM
> To: linux-kernel@vger.kernel.org; Vivek Goyal; Eric Biederman
> Cc: Andrew Morton; Yoshihiro YUNOMAE; Satoru MORIYA; Motohiro Kosaki; Tomoki 
> Sekiyama
> Subject: [PATCH v2] kernel/panic: Add "crash_kexec_post_notifiers" option for 
> kdump after panic_notifers
> 
> Add a "crash_kexec_post_notifiers" option to run kdump after running 
> panic_notifiers and dump kmsg. This can help rare situations
> which kdump drops in failure because of unstable crashed kernel or hardware 
> failure (memory corruption on critical data/code), or
> the 2nd kernel is already broken by the 1st kernel (it's a broken behavior, 
> but who can guarantee that the "crashed" kernel works
> correctly?).
> 
> Usage: add "crash_kexec_post_notifiers" to kernel boot option.
> 
> Note that this actually increases risks of the failure of kdump.
> This option should be set only if you worry about the rare case of kdump 
> failure rather than increasing the chance of success.
> 
> Changes from v1:
>  - Rename late_kdump option to crash_kexec_post_notifiers.
>  - Remove unneeded warning message.
> 
> Signed-off-by: Masami Hiramatsu 
> Cc: Eric Biederman 
> Cc: Vivek Goyal 
> Cc: Andrew Morton 
> Cc: Yoshihiro YUNOMAE 
> Cc: Satoru MORIYA 
> Cc: Motohiro Kosaki 
> Cc: Tomoki Sekiyama 

I have no objection.
 Acked-by: KOSAKI Motohiro 



Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile

2014-04-07 Thread KOSAKI Motohiro
>> This change hwpoison and migration tag number. maybe ok, maybe not.
>
> Though depending on config can't these tag numbers change anyway?

I don't think distro disable any of these.


>> I'd suggest to use younger number than hwpoison.
>> (That's why hwpoison uses younger number than migration)
>
> So I can, but the way these are defined makes the results seem pretty
> terrible:
>
> #define SWP_MIGRATION_WRITE(MAX_SWAPFILES + SWP_HWPOISON_NUM \
> + SWP_MVOLATILE_PURGED_NUM + 1)
>
> Particularly when:
> #define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)\
> - SWP_MIGRATION_NUM\
> - SWP_HWPOISON_NUM\
> - SWP_MVOLATILE_PURGED_NUM\
> )
>
> Its a lot of unnecessary mental gymnastics. Yuck.
>
> Would a general cleanup like the following be ok to try to make this
> more extensible?
>
> thanks
> -john
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 3507115..21387df 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -49,29 +49,38 @@ static inline int current_is_kswapd(void)
>   * actions on faults.
>   */
>
> +enum {
> +   /*
> +* NOTE: We use the high bits here (subtracting from
> +* 1< +* new entries here at the top of the enum, not at the bottom
> +*/
> +#ifdef CONFIG_MEMORY_FAILURE
> +   SWP_HWPOISON_NR,
> +#endif
> +#ifdef CONFIG_MIGRATION
> +   SWP_MIGRATION_READ_NR,
> +   SWP_MIGRATION_WRITE_NR,
> +#endif
> +   SWP_MAX_NR,
> +};
> +#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT) - SWP_MAX_NR)
> +

I don't see any benefit of this code. At least, SWP_MAX_NR is suck.
The name doesn't match the actual meanings.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,shm: disable shmmax and shmall by default

2014-04-05 Thread KOSAKI Motohiro
On Fri, Apr 4, 2014 at 1:00 AM, Davidlohr Bueso  wrote:
> On Thu, 2014-04-03 at 19:39 -0400, KOSAKI Motohiro wrote:
>> On Thu, Apr 3, 2014 at 3:50 PM, Davidlohr Bueso  wrote:
>> > On Thu, 2014-04-03 at 21:02 +0200, Manfred Spraul wrote:
>> >> Hi Davidlohr,
>> >>
>> >> On 04/03/2014 02:20 AM, Davidlohr Bueso wrote:
>> >> > The default size for shmmax is, and always has been, 32Mb.
>> >> > Today, in the XXI century, it seems that this value is rather small,
>> >> > making users have to increase it via sysctl, which can cause
>> >> > unnecessary work and userspace application workarounds[1].
>> >> >
>> >> > Instead of choosing yet another arbitrary value, larger than 32Mb,
>> >> > this patch disables the use of both shmmax and shmall by default,
>> >> > allowing users to create segments of unlimited sizes. Users and
>> >> > applications that already explicitly set these values through sysctl
>> >> > are left untouched, and thus does not change any of the behavior.
>> >> >
>> >> > So a value of 0 bytes or pages, for shmmax and shmall, respectively,
>> >> > implies unlimited memory, as opposed to disabling sysv shared memory.
>> >> > This is safe as 0 cannot possibly be used previously as SHMMIN is
>> >> > hardcoded to 1 and cannot be modified.
>> >
>> >> Are we sure that no user space apps uses shmctl(IPC_INFO) and prints a
>> >> pretty error message if shmall is too small?
>> >> We would break these apps.
>> >
>> > Good point. 0 bytes/pages would definitely trigger an unexpected error
>> > message if users did this. But on the other hand I'm not sure this
>> > actually is a _real_ scenario, since upon overflow the value can still
>> > end up being 0, which is totally bogus and would cause the same
>> > breakage.
>> >
>> > So I see two possible workarounds:
>> > (i) Use ULONG_MAX for the shmmax default instead. This would make shmall
>> > default to 1152921504606846720 and 268435456, for 64 and 32bit systems,
>> > respectively.
>> >
>> > (ii) Keep the 0 bytes, but add a new a "transition" tunable that, if set
>> > (default off), would allow 0 bytes to be unlimited. With time, users
>> > could hopefully update their applications and we could eventually get
>> > rid of it. This _seems_ to be the less aggressive way to go.
>>
>> Do you mean
>>
>> set 0: IPC_INFO return shmmax = 0.
>> set 1: IPC_INFO return shmmax = ULONG_MAX.
>>
>> ?
>>
>> That makes sense.
>
> Well I was mostly referring to:
>
> set 0: leave things as there are now.
> set 1: this patch.

I don't recommend this approach because many user never switch 1 and
finally getting API fragmentation.


> I don't think it makes much sense to set unlimited for both 0 and
> ULONG_MAX, that would probably just create even more confusion.
>
> But then again, we shouldn't even care about breaking things with shmmax
> or shmall with 0 value, it just makes no sense from a user PoV. shmmax
> cannot be 0 unless there's an overflow, which voids any valid cases, and
> thus shmall cannot be 0 either as it would go against any values set for
> shmmax. I think it's safe to ignore this.

Agreed.
IMHO, until you find out any incompatibility issue of this, we don't
need the switch
because we can't make good workaround for that. I'd suggest to merge your patch
and see what happen.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,shm: disable shmmax and shmall by default

2014-04-03 Thread KOSAKI Motohiro
> This change allows Linux to treat shm just as regular anonymous memory.
> One important difference between them, though, is handling out-of-memory
> conditions: as opposed to regular anon memory, the OOM killer will not
> kill processes that are hogging memory through shm, allowing users to
> potentially abuse this. To overcome this situation, the shm_rmid_forced
> option must be enabled.

Off topic: systemd implemented similar feature RemoveIPC and it is
enabled by default.
http://lists.freedesktop.org/archives/systemd-devel/2014-March/018232.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,shm: disable shmmax and shmall by default

2014-04-03 Thread KOSAKI Motohiro
On Thu, Apr 3, 2014 at 3:50 PM, Davidlohr Bueso  wrote:
> On Thu, 2014-04-03 at 21:02 +0200, Manfred Spraul wrote:
>> Hi Davidlohr,
>>
>> On 04/03/2014 02:20 AM, Davidlohr Bueso wrote:
>> > The default size for shmmax is, and always has been, 32Mb.
>> > Today, in the XXI century, it seems that this value is rather small,
>> > making users have to increase it via sysctl, which can cause
>> > unnecessary work and userspace application workarounds[1].
>> >
>> > Instead of choosing yet another arbitrary value, larger than 32Mb,
>> > this patch disables the use of both shmmax and shmall by default,
>> > allowing users to create segments of unlimited sizes. Users and
>> > applications that already explicitly set these values through sysctl
>> > are left untouched, and thus does not change any of the behavior.
>> >
>> > So a value of 0 bytes or pages, for shmmax and shmall, respectively,
>> > implies unlimited memory, as opposed to disabling sysv shared memory.
>> > This is safe as 0 cannot possibly be used previously as SHMMIN is
>> > hardcoded to 1 and cannot be modified.
>
>> Are we sure that no user space apps uses shmctl(IPC_INFO) and prints a
>> pretty error message if shmall is too small?
>> We would break these apps.
>
> Good point. 0 bytes/pages would definitely trigger an unexpected error
> message if users did this. But on the other hand I'm not sure this
> actually is a _real_ scenario, since upon overflow the value can still
> end up being 0, which is totally bogus and would cause the same
> breakage.
>
> So I see two possible workarounds:
> (i) Use ULONG_MAX for the shmmax default instead. This would make shmall
> default to 1152921504606846720 and 268435456, for 64 and 32bit systems,
> respectively.
>
> (ii) Keep the 0 bytes, but add a new a "transition" tunable that, if set
> (default off), would allow 0 bytes to be unlimited. With time, users
> could hopefully update their applications and we could eventually get
> rid of it. This _seems_ to be the less aggressive way to go.

Do you mean

set 0: IPC_INFO return shmmax = 0.
set 1: IPC_INFO return shmmax = ULONG_MAX.

?

That makes sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number

2014-04-03 Thread KOSAKI Motohiro
On Wed, Apr 2, 2014 at 12:09 PM, Dave Hansen  wrote:
> On 04/02/2014 01:56 AM, Li Zhong wrote:
>> I noticed the phys_index and end_phys_index under
>> /sys/devices/system/memory/memoryXXX/ have the same value, e.g.
>> (for the test machine, one memory block has 8 sections, that is
>>  sections_per_block equals 8)
>>
>> # cd /sys/devices/system/memory/memory100/
>> # cat phys_index end_phys_index
>> 0064
>> 0064
>>
>> Seems they should reflect the start/end section number respectively, which
>> also matches what is said in Documentation/memory-hotplug.txt
>
> This changes a user-visible interface.  Won't this break userspace?

But who uses this? This is totally broken and I think nobody can use
meaningfully.
I bet we can fix this right now.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,shm: disable shmmax and shmall by default

2014-04-03 Thread KOSAKI Motohiro
On Wed, Apr 2, 2014 at 8:20 PM, Davidlohr Bueso  wrote:
> The default size for shmmax is, and always has been, 32Mb.
> Today, in the XXI century, it seems that this value is rather small,
> making users have to increase it via sysctl, which can cause
> unnecessary work and userspace application workarounds[1].
>
> Instead of choosing yet another arbitrary value, larger than 32Mb,
> this patch disables the use of both shmmax and shmall by default,
> allowing users to create segments of unlimited sizes. Users and
> applications that already explicitly set these values through sysctl
> are left untouched, and thus does not change any of the behavior.
>
> So a value of 0 bytes or pages, for shmmax and shmall, respectively,
> implies unlimited memory, as opposed to disabling sysv shared memory.
> This is safe as 0 cannot possibly be used previously as SHMMIN is
> hardcoded to 1 and cannot be modified.
>
> This change allows Linux to treat shm just as regular anonymous memory.
> One important difference between them, though, is handling out-of-memory
> conditions: as opposed to regular anon memory, the OOM killer will not
> kill processes that are hogging memory through shm, allowing users to
> potentially abuse this. To overcome this situation, the shm_rmid_forced
> option must be enabled.

I'm very slightly against this sentence.

OOM killer WILL kill the process because shm touching increase RSS anyway.
But the killing doesn't make memory freeing because it's shmem.

>
> Running this patch through LTP, everything passes, except the following,
> which, due to the nature of this change, is quite expected:
>
> shmget021  TFAIL  :  call succeeded unexpectedly
>
> [1]: http://rhaas.blogspot.com/2012/06/absurd-shared-memory-limits.html
>
> Signed-off-by: Davidlohr Bueso 
> ---
>  include/linux/shm.h  | 2 +-
>  include/uapi/linux/shm.h | 8 
>  ipc/shm.c| 6 --
>  3 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/shm.h b/include/linux/shm.h
> index 1e2cd2e..0ca06a3 100644
> --- a/include/linux/shm.h
> +++ b/include/linux/shm.h
> @@ -4,7 +4,7 @@
>  #include 
>  #include 
>
> -#define SHMALL (SHMMAX/PAGE_SIZE*(SHMMNI/16)) /* max shm system wide (pages) 
> */
> +#define SHMALL 0 /* max shm system wide (pages) */
>  #include 
>  struct shmid_kernel /* private to the kernel */
>  {
> diff --git a/include/uapi/linux/shm.h b/include/uapi/linux/shm.h
> index 78b6941..5f0ef28 100644
> --- a/include/uapi/linux/shm.h
> +++ b/include/uapi/linux/shm.h
> @@ -9,14 +9,14 @@
>
>  /*
>   * SHMMAX, SHMMNI and SHMALL are upper limits are defaults which can
> - * be increased by sysctl
> + * be increased by sysctl. By default, disable SHMMAX and SHMALL with
> + * 0 bytes, thus allowing processes to have unlimited shared memory.
>   */
> -
> -#define SHMMAX 0x200/* max shared seg size (bytes) */
> +#define SHMMAX 0/* max shared seg size (bytes) */
>  #define SHMMIN 1/* min shared seg size (bytes) */
>  #define SHMMNI 4096 /* max num of segs system wide */
>  #ifndef __KERNEL__
> -#define SHMALL (SHMMAX/getpagesize()*(SHMMNI/16))
> +#define SHMALL 0
>  #endif
>  #define SHMSEG SHMMNI   /* max shared segs per process */
>
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 7645961..ae01ffa 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -490,10 +490,12 @@ static int newseg(struct ipc_namespace *ns, struct 
> ipc_params *params)
> int id;
> vm_flags_t acctflag = 0;
>
> -   if (size < SHMMIN || size > ns->shm_ctlmax)
> +   if (ns->shm_ctlmax &&
> +   (size < SHMMIN || size > ns->shm_ctlmax))
> return -EINVAL;
>
> -   if (ns->shm_tot + numpages > ns->shm_ctlall)
> +   if (ns->shm_ctlall &&
> +   ns->shm_tot + numpages > ns->shm_ctlall)
> return -ENOSPC;
>
> shp = ipc_rcu_alloc(sizeof(*shp));

Looks good.

Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,shm: increase default size for shmmax

2014-04-01 Thread KOSAKI Motohiro
>> > Ah-hah, that's interesting info.
>> >
>> > Let's make the default 64GB?
>>
>> 64GB is infinity at that time, but it no longer near infinity today. I like
>> very large or total memory proportional number.
>
> So I still like 0 for unlimited. Nice, clean and much easier to look at
> than ULONG_MAX. And since we cannot disable shm through SHMMIN, I really
> don't see any disadvantages, as opposed to some other arbitrary value.
> Furthermore it wouldn't break userspace: any existing sysctl would
> continue to work, and if not set, the user never has to worry about this
> tunable again.
>
> Please let me know if you all agree with this...

Surething. Why not. :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,shm: increase default size for shmmax

2014-04-01 Thread KOSAKI Motohiro
On Tue, Apr 1, 2014 at 5:48 PM, Andrew Morton  wrote:
> On Tue, 1 Apr 2014 17:41:54 -0400 KOSAKI Motohiro  
> wrote:
>
>> >> > Hmmm so 0 won't really work because it could be weirdly used to disable
>> >> > shm altogether... we cannot go to some negative value either since we're
>> >> > dealing with unsigned, and cutting the range in half could also hurt
>> >> > users that set the limit above that. So I was thinking of simply setting
>> >> > SHMMAX to ULONG_MAX and be done with it. Users can then set it manually
>> >> > if they want a smaller value.
>> >> >
>> >> > Makes sense?
>> >>
>> >> I don't think people use 0 for disabling. but ULONG_MAX make sense to me 
>> >> too.
>> >
>> > Distros could have set it to [U]LONG_MAX in initscripts ten years ago
>> > - less phone calls, happier customers.  And they could do so today.
>> >
>> > But they haven't.   What are the risks of doing this?
>>
>> I have no idea really. But at least I'm sure current default is much worse.
>>
>> 1. Solaris changed the default to total-memory/4 since Solaris 10 for DB.
>>  http://www.postgresql.org/docs/9.1/static/kernel-resources.html
>>
>> 2. RHEL changed the default to very big size since RHEL5 (now it is
>> 64GB). Even tough many box don't have 64GB memory at that time.
>
> Ah-hah, that's interesting info.
>
> Let's make the default 64GB?

64GB is infinity at that time, but it no longer near infinity today. I like
very large or total memory proportional number.

But I'm open. Please let me see if anyone know the disadvantage of
very large value.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,shm: increase default size for shmmax

2014-04-01 Thread KOSAKI Motohiro
>> > Hmmm so 0 won't really work because it could be weirdly used to disable
>> > shm altogether... we cannot go to some negative value either since we're
>> > dealing with unsigned, and cutting the range in half could also hurt
>> > users that set the limit above that. So I was thinking of simply setting
>> > SHMMAX to ULONG_MAX and be done with it. Users can then set it manually
>> > if they want a smaller value.
>> >
>> > Makes sense?
>>
>> I don't think people use 0 for disabling. but ULONG_MAX make sense to me too.
>
> Distros could have set it to [U]LONG_MAX in initscripts ten years ago
> - less phone calls, happier customers.  And they could do so today.
>
> But they haven't.   What are the risks of doing this?

I have no idea really. But at least I'm sure current default is much worse.

1. Solaris changed the default to total-memory/4 since Solaris 10 for DB.
 http://www.postgresql.org/docs/9.1/static/kernel-resources.html

2. RHEL changed the default to very big size since RHEL5 (now it is
64GB). Even tough many box don't have 64GB memory at that time.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,shm: increase default size for shmmax

2014-04-01 Thread KOSAKI Motohiro
On Tue, Apr 1, 2014 at 5:01 PM, Davidlohr Bueso  wrote:
> On Tue, 2014-04-01 at 15:51 -0400, KOSAKI Motohiro wrote:
>> >> So, I personally like 0 byte per default.
>> >
>> > If by this you mean 0 bytes == unlimited, then I agree. It's less harsh
>> > then removing it entirely. So instead of removing the limit we can just
>> > set it by default to 0, and in newseg() if shm_ctlmax == 0 then we don't
>> > return EINVAL if the passed size is great (obviously), otherwise, if the
>> > user _explicitly_ set it via sysctl then we respect that. Andrew, do you
>> > agree with this? If so I'll send a patch.
>>
>> Yes, my 0 bytes mean unlimited. I totally agree we shouldn't remove the knob
>> entirely.
>
> Hmmm so 0 won't really work because it could be weirdly used to disable
> shm altogether... we cannot go to some negative value either since we're
> dealing with unsigned, and cutting the range in half could also hurt
> users that set the limit above that. So I was thinking of simply setting
> SHMMAX to ULONG_MAX and be done with it. Users can then set it manually
> if they want a smaller value.
>
> Makes sense?

I don't think people use 0 for disabling. but ULONG_MAX make sense to me too.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,shm: increase default size for shmmax

2014-04-01 Thread KOSAKI Motohiro
>> Our middleware engineers has been complaining about this sysctl limit.
>> System administrator need to calculate required sysctl value by making sum
>> of all planned middlewares, and middleware provider needs to write "please
>> calculate systcl param by." in their installation manuals.
>
> Why aren't people just setting the sysctl to a petabyte?  What problems
> would that lead to?

I don't have much Fujitsu middleware knowledges. But I'd like to explain
very funny bug I saw.

1. middleware-A suggest to set SHMMAX to very large value (maybe
LONG_MAX, but my memory was flushed)
2. middleware-B suggest to set SHMMAX to increase some dozen mega byte.

Finally, it was overflow and didn't work at all.

Let's demonstrate.

# echo 18446744073709551615 > /proc/sys/kernel/shmmax
# cat /proc/sys/kernel/shmmax
18446744073709551615
# echo 18446744073709551616 > /proc/sys/kernel/shmmax
# cat /proc/sys/kernel/shmmax
0

That's why many open source software continue the silly game. But
again, I don't have knowledge about Fujitsu middleware. I'm waiting
kamezawa-san's answer.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,shm: increase default size for shmmax

2014-04-01 Thread KOSAKI Motohiro
On Tue, Apr 1, 2014 at 2:31 PM, Davidlohr Bueso  wrote:
> On Tue, 2014-04-01 at 14:10 -0400, KOSAKI Motohiro wrote:
>> On Tue, Apr 1, 2014 at 1:01 PM, Davidlohr Bueso  wrote:
>> > On Mon, 2014-03-31 at 17:05 -0700, Andrew Morton wrote:
>> >> On Mon, 31 Mar 2014 16:25:32 -0700 Davidlohr Bueso  
>> >> wrote:
>> >>
>> >> > On Mon, 2014-03-31 at 16:13 -0700, Andrew Morton wrote:
>> >> > > On Mon, 31 Mar 2014 15:59:33 -0700 Davidlohr Bueso  
>> >> > > wrote:
>> >> > >
>> >> > > > >
>> >> > > > > - Shouldn't there be a way to alter this namespace's shm_ctlmax?
>> >> > > >
>> >> > > > Unfortunately this would also add the complexity I previously 
>> >> > > > mentioned.
>> >> > >
>> >> > > But if the current namespace's shm_ctlmax is too small, you're 
>> >> > > screwed.
>> >> > > Have to shut down the namespace all the way back to init_ns and start
>> >> > > again.
>> >> > >
>> >> > > > > - What happens if we just nuke the limit altogether and fall back 
>> >> > > > > to
>> >> > > > >   the next check, which presumably is the rlimit bounds?
>> >> > > >
>> >> > > > afaik we only have rlimit for msgqueues. But in any case, while I 
>> >> > > > like
>> >> > > > that simplicity, it's too late. Too many workloads (specially DBs) 
>> >> > > > rely
>> >> > > > heavily on shmmax. Removing it and relying on something else would 
>> >> > > > thus
>> >> > > > cause a lot of things to break.
>> >> > >
>> >> > > It would permit larger shm segments - how could that break things?  It
>> >> > > would make most or all of these issues go away?
>> >> > >
>> >> >
>> >> > So sysadmins wouldn't be very happy, per man shmget(2):
>> >> >
>> >> > EINVAL A new segment was to be created and size < SHMMIN or size >
>> >> > SHMMAX, or no new segment was to be created, a segment with given key
>> >> > existed, but size is greater than the size of that segment.
>> >>
>> >> So their system will act as if they had set SHMMAX=enormous.  What
>> >> problems could that cause?
>> >
>> > So, just like any sysctl configurable, only privileged users can change
>> > this value. If we remove this option, users can theoretically create
>> > huge segments, thus ignoring any custom limit previously set. This is
>> > what I fear. Think of it kind of like mlock's rlimit. And for that
>> > matter, why does sysctl exist at all, the same would go for the rest of
>> > the limits.
>>
>> Hmm. It's hard to agree. AFAIK 32MB is just borrowed from other Unix
>> and it doesn't respect any Linux internals.
>
> Agreed, it's stupid, but it's what Linux chose to use since forever.
>
>> Look, non privileged user
>> can user unlimited memory, at least on linux. So I don't find out any
>> difference between regular anon and shmem.
>
> Fine, let's try it, if users complain we can revert.
>
>>
>> So, I personally like 0 byte per default.
>
> If by this you mean 0 bytes == unlimited, then I agree. It's less harsh
> then removing it entirely. So instead of removing the limit we can just
> set it by default to 0, and in newseg() if shm_ctlmax == 0 then we don't
> return EINVAL if the passed size is great (obviously), otherwise, if the
> user _explicitly_ set it via sysctl then we respect that. Andrew, do you
> agree with this? If so I'll send a patch.

Yes, my 0 bytes mean unlimited. I totally agree we shouldn't remove the knob
entirely.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ipc,shm: increase default size for shmmax

2014-04-01 Thread KOSAKI Motohiro
On Tue, Apr 1, 2014 at 1:01 PM, Davidlohr Bueso  wrote:
> On Mon, 2014-03-31 at 17:05 -0700, Andrew Morton wrote:
>> On Mon, 31 Mar 2014 16:25:32 -0700 Davidlohr Bueso  wrote:
>>
>> > On Mon, 2014-03-31 at 16:13 -0700, Andrew Morton wrote:
>> > > On Mon, 31 Mar 2014 15:59:33 -0700 Davidlohr Bueso  
>> > > wrote:
>> > >
>> > > > >
>> > > > > - Shouldn't there be a way to alter this namespace's shm_ctlmax?
>> > > >
>> > > > Unfortunately this would also add the complexity I previously 
>> > > > mentioned.
>> > >
>> > > But if the current namespace's shm_ctlmax is too small, you're screwed.
>> > > Have to shut down the namespace all the way back to init_ns and start
>> > > again.
>> > >
>> > > > > - What happens if we just nuke the limit altogether and fall back to
>> > > > >   the next check, which presumably is the rlimit bounds?
>> > > >
>> > > > afaik we only have rlimit for msgqueues. But in any case, while I like
>> > > > that simplicity, it's too late. Too many workloads (specially DBs) rely
>> > > > heavily on shmmax. Removing it and relying on something else would thus
>> > > > cause a lot of things to break.
>> > >
>> > > It would permit larger shm segments - how could that break things?  It
>> > > would make most or all of these issues go away?
>> > >
>> >
>> > So sysadmins wouldn't be very happy, per man shmget(2):
>> >
>> > EINVAL A new segment was to be created and size < SHMMIN or size >
>> > SHMMAX, or no new segment was to be created, a segment with given key
>> > existed, but size is greater than the size of that segment.
>>
>> So their system will act as if they had set SHMMAX=enormous.  What
>> problems could that cause?
>
> So, just like any sysctl configurable, only privileged users can change
> this value. If we remove this option, users can theoretically create
> huge segments, thus ignoring any custom limit previously set. This is
> what I fear. Think of it kind of like mlock's rlimit. And for that
> matter, why does sysctl exist at all, the same would go for the rest of
> the limits.

Hmm. It's hard to agree. AFAIK 32MB is just borrowed from other Unix
and it doesn't respect any Linux internals. Look, non privileged user
can user unlimited memory, at least on linux. So I don't find out any
difference between regular anon and shmem.

So, I personally like 0 byte per default.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/5] vrange: Set affected pages referenced when marking volatile

2014-03-23 Thread KOSAKI Motohiro
On Fri, Mar 21, 2014 at 2:17 PM, John Stultz  wrote:
> One issue that some potential users were concerned about, was that
> they wanted to ensure that all the pages from one volatile range
> were purged before we purge pages from a different volatile range.
> This would prevent the case where they have 4 large objects, and
> the system purges one page from each object, casuing all of the
> objects to have to be re-created.
>
> The counter-point to this case, is when an application is using the
> SIGBUS semantics to continue to access pages after they have been
> marked volatile. In that case, the desire was that the most recently
> touched pages be purged last, and only the "cold" pages be purged
> from the specified range.
>
> Instead of adding option flags for the various usage model (at least
> initially), one way of getting a solutoin for both uses would be to
> have the act of marking pages as volatile in effect mark the pages
> as accessed. Since all of the pages in the range would be marked
> together, they would be of the same "age" and would (approximately)
> be purged together. Further, if any pages in the range were accessed
> after being marked volatile, they would be moved to the end of the
> lru and be purged later.

If you run after two hares, you will catch neither. I suspect this patch
doesn't make happy any user.
I suggest to aim former case (object level caching) and aim latter by
another patch-kit.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/5] vrange: Add page purging logic & SIGBUS trap

2014-03-23 Thread KOSAKI Motohiro
On Fri, Mar 21, 2014 at 2:17 PM, John Stultz  wrote:
> This patch adds the hooks in the vmscan logic to discard volatile pages
> and mark their pte as purged. With this, volatile pages will be purged
> under pressure, and their ptes swap entry's marked. If the purged pages
> are accessed before being marked non-volatile, we catch this and send a
> SIGBUS.
>
> This is a simplified implementation that uses logic from Minchan's earlier
> efforts, so credit to Minchan for his work.
>
> Cc: Andrew Morton 
> Cc: Android Kernel Team 
> Cc: Johannes Weiner 
> Cc: Robert Love 
> Cc: Mel Gorman 
> Cc: Hugh Dickins 
> Cc: Dave Hansen 
> Cc: Rik van Riel 
> Cc: Dmitry Adamushko 
> Cc: Neil Brown 
> Cc: Andrea Arcangeli 
> Cc: Mike Hommey 
> Cc: Taras Glek 
> Cc: Jan Kara 
> Cc: KOSAKI Motohiro 
> Cc: Michel Lespinasse 
> Cc: Minchan Kim 
> Cc: linux...@kvack.org 
> Signed-off-by: John Stultz 
> ---
>  include/linux/vrange.h |   2 +
>  mm/internal.h  |   2 -
>  mm/memory.c|  21 +
>  mm/rmap.c  |   5 +++
>  mm/vmscan.c|  12 ++
>  mm/vrange.c| 114 
> +
>  6 files changed, 154 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/vrange.h b/include/linux/vrange.h
> index 986fa85..d93ad21 100644
> --- a/include/linux/vrange.h
> +++ b/include/linux/vrange.h
> @@ -8,4 +8,6 @@
>  #define VRANGE_VOLATILE 1
>  #define VRANGE_VALID_FLAGS (0) /* Don't yet support any flags */
>
> +extern int discard_vpage(struct page *page);
> +
>  #endif /* _LINUX_VRANGE_H */
> diff --git a/mm/internal.h b/mm/internal.h
> index 29e1e76..ea66bf9 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -225,10 +225,8 @@ static inline void mlock_migrate_page(struct page 
> *newpage, struct page *page)
>
>  extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  extern unsigned long vma_address(struct page *page,
>  struct vm_area_struct *vma);
> -#endif
>  #else /* !CONFIG_MMU */
>  static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page 
> *p)
>  {
> diff --git a/mm/memory.c b/mm/memory.c
> index 22dfa61..db5f4da 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -60,6 +60,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -3643,6 +3644,8 @@ static int handle_pte_fault(struct mm_struct *mm,
>
> entry = *pte;
> if (!pte_present(entry)) {
> +   swp_entry_t vrange_entry;
> +retry:
> if (pte_none(entry)) {
> if (vma->vm_ops) {
> if (likely(vma->vm_ops->fault))
> @@ -3652,6 +3655,24 @@ static int handle_pte_fault(struct mm_struct *mm,
> return do_anonymous_page(mm, vma, address,
>  pte, pmd, flags);
> }
> +
> +   vrange_entry = pte_to_swp_entry(entry);
> +   if (unlikely(is_vpurged_entry(vrange_entry))) {
> +   if (vma->vm_flags & VM_VOLATILE)
> +   return VM_FAULT_SIGBUS;
> +
> +   /* zap pte */
> +   ptl = pte_lockptr(mm, pmd);
> +   spin_lock(ptl);
> +   if (unlikely(!pte_same(*pte, entry)))
> +   goto unlock;
> +   flush_cache_page(vma, address, pte_pfn(*pte));
> +   ptep_clear_flush(vma, address, pte);
> +   pte_unmap_unlock(pte, ptl);
> +   goto retry;

This looks strange why we need zap pte here?

> +   }
> +
> +
> if (pte_file(entry))
> return do_nonlinear_fault(mm, vma, address,
> pte, pmd, flags, entry);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d9d4231..2b6f079 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -728,6 +728,11 @@ int page_referenced_one(struct page *page, struct 
> vm_area_struct *vma,
> referenced++;
> }
> pte_unmap_unlock(pte, ptl);
> +   if (vma->vm_flags & VM_VOLATILE) {
> +   pra->mapcount = 0;
> +   pra->vm_flags |= VM_VOLATILE;
> +   return SWAP_FAIL;
> +   }
> }
>
> if (referenced) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index

Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile

2014-03-23 Thread KOSAKI Motohiro
On Sun, Mar 23, 2014 at 1:26 PM, John Stultz  wrote:
> On Sun, Mar 23, 2014 at 10:50 AM, KOSAKI Motohiro
>  wrote:
>>> +/**
>>> + * vrange_check_purged_pte - Checks ptes for purged pages
>>> + *
>>> + * Iterates over the ptes in the pmd checking if they have
>>> + * purged swap entries.
>>> + *
>>> + * Sets the vrange_walker.pages_purged to 1 if any were purged.
>>> + */
>>> +static int vrange_check_purged_pte(pmd_t *pmd, unsigned long addr,
>>> +   unsigned long end, struct mm_walk 
>>> *walk)
>>> +{
>>> +   struct vrange_walker *vw = walk->private;
>>> +   pte_t *pte;
>>> +   spinlock_t *ptl;
>>> +
>>> +   if (pmd_trans_huge(*pmd))
>>> +   return 0;
>>> +   if (pmd_trans_unstable(pmd))
>>> +   return 0;
>>> +
>>> +   pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>>> +   for (; addr != end; pte++, addr += PAGE_SIZE) {
>>> +   if (!pte_present(*pte)) {
>>> +   swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
>>> +
>>> +   if (unlikely(is_vpurged_entry(vrange_entry))) {
>>> +   vw->page_was_purged = 1;
>>> +   break;
>>
>> This function only detect there is vpurge entry or not. But
>> VRANGE_NONVOLATILE should remove all vpurge entries.
>> Otherwise, non-volatiled range still makes SIGBUS.
>
> So in the following patch (3/5), we only SIGBUS if the swap entry
> is_vpurged_entry()  && the vma is still marked volatile, so this
> shouldn't be an issue.

When VOLATILE -> NON-VOLATILE -> VOLATILE transition happen,
the page immediately marked "was purged"?

I don't understand why vma check help.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile

2014-03-23 Thread KOSAKI Motohiro
> +/**
> + * vrange_check_purged_pte - Checks ptes for purged pages
> + *
> + * Iterates over the ptes in the pmd checking if they have
> + * purged swap entries.
> + *
> + * Sets the vrange_walker.pages_purged to 1 if any were purged.
> + */
> +static int vrange_check_purged_pte(pmd_t *pmd, unsigned long addr,
> +   unsigned long end, struct mm_walk 
> *walk)
> +{
> +   struct vrange_walker *vw = walk->private;
> +   pte_t *pte;
> +   spinlock_t *ptl;
> +
> +   if (pmd_trans_huge(*pmd))
> +   return 0;
> +   if (pmd_trans_unstable(pmd))
> +   return 0;
> +
> +   pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> +   for (; addr != end; pte++, addr += PAGE_SIZE) {
> +   if (!pte_present(*pte)) {
> +   swp_entry_t vrange_entry = pte_to_swp_entry(*pte);
> +
> +   if (unlikely(is_vpurged_entry(vrange_entry))) {
> +   vw->page_was_purged = 1;
> +   break;

This function only detect there is vpurge entry or not. But
VRANGE_NONVOLATILE should remove all vpurge entries.
Otherwise, non-volatiled range still makes SIGBUS.

> +   }
> +   }
> +   }
> +   pte_unmap_unlock(pte - 1, ptl);
> +   cond_resched();
> +
> +   return 0;
> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] vrange: Add purged page detection on setting memory non-volatile

2014-03-23 Thread KOSAKI Motohiro
On Fri, Mar 21, 2014 at 2:17 PM, John Stultz  wrote:
> Users of volatile ranges will need to know if memory was discarded.
> This patch adds the purged state tracking required to inform userland
> when it marks memory as non-volatile that some memory in that range
> was purged and needs to be regenerated.
>
> This simplified implementation which uses some of the logic from
> Minchan's earlier efforts, so credit to Minchan for his work.
>
> Cc: Andrew Morton 
> Cc: Android Kernel Team 
> Cc: Johannes Weiner 
> Cc: Robert Love 
> Cc: Mel Gorman 
> Cc: Hugh Dickins 
> Cc: Dave Hansen 
> Cc: Rik van Riel 
> Cc: Dmitry Adamushko 
> Cc: Neil Brown 
> Cc: Andrea Arcangeli 
> Cc: Mike Hommey 
> Cc: Taras Glek 
> Cc: Jan Kara 
> Cc: KOSAKI Motohiro 
> Cc: Michel Lespinasse 
> Cc: Minchan Kim 
> Cc: linux...@kvack.org 
> Signed-off-by: John Stultz 
> ---
>  include/linux/swap.h| 15 --
>  include/linux/swapops.h | 10 +++
>  include/linux/vrange.h  |  3 ++
>  mm/vrange.c | 75 
> +
>  4 files changed, 101 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 46ba0c6..18c12f9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -70,8 +70,19 @@ static inline int current_is_kswapd(void)
>  #define SWP_HWPOISON_NUM 0
>  #endif
>
> -#define MAX_SWAPFILES \
> -   ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> +
> +/*
> + * Purged volatile range pages
> + */
> +#define SWP_VRANGE_PURGED_NUM 1
> +#define SWP_VRANGE_PURGED (MAX_SWAPFILES + SWP_HWPOISON_NUM + 
> SWP_MIGRATION_NUM)
> +
> +
> +#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)  \
> +   - SWP_MIGRATION_NUM \
> +   - SWP_HWPOISON_NUM  \
> +   - SWP_VRANGE_PURGED_NUM \
> +   )

This change hwpoison and migration tag number. maybe ok, maybe not.
I'd suggest to use younger number than hwpoison.
(That's why hwpoison uses younger number than migration)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5] vrange: Add vrange syscall and handle splitting/merging and marking vmas

2014-03-23 Thread KOSAKI Motohiro
Hi

On Fri, Mar 21, 2014 at 2:17 PM, John Stultz  wrote:
> This patch introduces the vrange() syscall, which allows for specifying
> ranges of memory as volatile, and able to be discarded by the system.
>
> This initial patch simply adds the syscall, and the vma handling,
> splitting and merging the vmas as needed, and marking them with
> VM_VOLATILE.
>
> No purging or discarding of volatile ranges is done at this point.
>
> Example man page:
>
> NAME
> vrange - Mark or unmark range of memory as volatile
>
> SYNOPSIS
> ssize_t vrange(unsigned_long start, size_t length,
>  unsigned_long mode, unsigned_long flags,
>  int *purged);
>
> DESCRIPTION
> Applications can use vrange(2) to advise kernel that pages of
> anonymous mapping in the given VM area can be reclaimed without
> swapping (or can no longer be reclaimed without swapping).
> The idea is that application can help kernel with page reclaim
> under memory pressure by specifying data it can easily regenerate
> and thus kernel can discard the data if needed.
>
> mode:
> VRANGE_VOLATILE
> Informs the kernel that the VM can discard in pages in
> the specified range when under memory pressure.
> VRANGE_NONVOLATILE
> Informs the kernel that the VM can no longer discard pages
> in this range.
>
> flags: Currently no flags are supported.
>
> purged: Pointer to an integer which will return 1 if
> mode == VRANGE_NONVOLATILE and any page in the affected range
> was purged. If purged returns zero during a mode ==
> VRANGE_NONVOLATILE call, it means all of the pages in the range
> are intact.
>
> If a process accesses volatile memory which has been purged, and
> was not set as non volatile via a VRANGE_NONVOLATILE call, it
> will recieve a SIGBUS.
>
> RETURN VALUE
> On success vrange returns the number of bytes marked or unmarked.
> Similar to write(), it may return fewer bytes then specified
> if it ran into a problem.

This explanation doesn't match your implementation. You return the
last VMA - orig_start.
That said, when some hole is there at middle of the range marked (or
unmarked) bytes
aren't match the return value.


>
> When using VRANGE_NON_VOLATILE, if the return value is smaller
> then the specified length, then the value specified by the purged
> pointer will be set to 1 if any of the pages specified in the
> return value as successfully marked non-volatile had been purged.
>
> If an error is returned, no changes were made.

At least, this explanation doesn't match the implementation. When you find file
mappings, you don't rollback prior changes.

>
> ERRORS
> EINVAL This error can occur for the following reasons:
> * The value length is negative or not page size units.
> * addr is not page-aligned
> * mode not a valid value.
> * flags is not a valid value.
>
> ENOMEM Not enough memory
>
> ENOMEM Addresses in the specified range are not currently mapped,
>or are outside the address space of the process.
>
> EFAULT Purged pointer is invalid
>
> This a simplified implementation which reuses some of the logic
> from Minchan's earlier efforts. So credit to Minchan for his work.
>
> Cc: Andrew Morton 
> Cc: Android Kernel Team 
> Cc: Johannes Weiner 
> Cc: Robert Love 
> Cc: Mel Gorman 
> Cc: Hugh Dickins 
> Cc: Dave Hansen 
> Cc: Rik van Riel 
> Cc: Dmitry Adamushko 
> Cc: Neil Brown 
> Cc: Andrea Arcangeli 
> Cc: Mike Hommey 
> Cc: Taras Glek 
> Cc: Jan Kara 
> Cc: KOSAKI Motohiro 
> Cc: Michel Lespinasse 
> Cc: Minchan Kim 
> Cc: linux...@kvack.org 
> Signed-off-by: John Stultz 
> ---
>  arch/x86/syscalls/syscall_64.tbl |   1 +
>  include/linux/mm.h   |   1 +
>  include/linux/vrange.h   |   8 ++
>  mm/Makefile  |   2 +-
>  mm/vrange.c  | 173 
> +++
>  5 files changed, 184 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/vrange.h
>  create mode 100644 mm/vrange.c
>
> diff --git a/arch/x86/syscalls/syscall_64.tbl 
> b/arch/x86/syscalls/syscall_64.tbl
> index a12bddc..7ae3940 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -322,6 +322,7 @@
>  313common  finit_modulesys_finit_module
&g

RE: Bug 71331 - mlock yields processor to lower priority process

2014-03-21 Thread Motohiro Kosaki
> Mike,
> 
> There are several problem domains where you protect critical sections by 
> assigning multiple threads to a single CPU and use priorities
> and SCHED_FIFO to ensure data integrity.
> 
> In this kind of design you don't make many syscalls.  The ones you do make, 
> have to be clearly understood
> if they block.
> 
> So, yes, I expect that a SCHED_FIFO task, that uses a subset of syscalls 
> known to be non-blocking, will not block.
> 
> If it is not 'unstoppable', then there is a defect in the OS.
> 
> In the past, a call to mlock() was known to be OK.  It would not block.  It 
> might take a while, but it would run to completion.  It does not
> do that any more.

False. Mlock is blockable since it was born.
Mlock and mlockall need memory allocate by definition. And it could lead to run 
VM activity and it may block. At least, on Linux.

lru_add_drain_all() is not only place to wait. Even if we remove it, mlock can 
still block. I don't think this discussion make sense.

> If mlock() is now a blocking call, then fine.  It only needs to be called on 
> occasion, and this can be accounted for in the application

Now? I have not seen any recent change.

Note: I'm not sure Artem's use-case is good or bad.  I only say the false 
assumption don't make a good discussion.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Update PATCH 2/2] aio, mem-hotplug: Add memory barrier to aio ring page migration.

2014-03-04 Thread KOSAKI Motohiro
>> + /*
>> +  * Ensure that the page's data was copied from old one by
>> +  * aio_migratepage().
>> +  */
>> + smp_rmb();
>> +
>
> smp_read_barrier_depends() is better.
>
> "One could place an A smp_rmb() primitive between the pointer fetch and
>  dereference. However, this imposes unneeded overhead on systems (such as
>  i386, IA64, PPC, and SPARC) that respect data dependencies on the read side.
>  A smp_read_barrier_depends() primitive has been added to the Linux 2.6 kernel
>  to eliminate overhead on these systems."
> -- From Chapter 7.1 of  Software Hackers>
>Written by Paul E. McKenney
>

Right.
The document of memory barrier described this situation.


see https://www.kernel.org/doc/Documentation/memory-barriers.txt

CPU 1CPU 2
===  ===
{ M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
M[1] = 4;

ACCESS_ONCE(P) = 1
 Q = ACCESS_ONCE(P);
 
 D = M[Q];
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Is: 'mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq()' fixed it.Was:Re: [BISECTED] Xen HVM guest hangs since 3.12-rc5

2014-02-24 Thread KOSAKI Motohiro
On Mon, Feb 24, 2014 at 11:13 AM, Konrad Rzeszutek Wilk
 wrote:
> On Sat, Feb 22, 2014 at 11:53:31PM -0800, Steven Noonan wrote:
>> On Fri, Feb 21, 2014 at 12:07 PM, Konrad Rzeszutek Wilk
>>  wrote:
>> > On Thu, Feb 20, 2014 at 12:44:15PM -0800, Steven Noonan wrote:
>> >> On Wed, Feb 19, 2014 at 1:01 PM, Steven Noonan  
>> >> wrote:
>> >> > On Wed, Feb 19, 2014 at 9:41 AM, Konrad Rzeszutek Wilk
>> >> >  wrote:
>> >> >> On Tue, Feb 18, 2014 at 11:16:05PM -0800, Steven Noonan wrote:
>> >> >>> I've been running into problems on an Xen HVM domU. I've got a guest 
>> >> >>> with NUMA
>> >> >>> enabled, 60GB of RAM, and 3 disks attached (including root volume). 2 
>> >> >>> of the
>> >> >>> disks are in an MD RAID0 in the guest, with an ext4 filesystem on top 
>> >> >>> of that.
>> >> >>> I was running the fio 'iometer-file-access-server.fio' example config 
>> >> >>> against
>> >> >>> that fs. During this workload, it would eventually cause a soft 
>> >> >>> lockup, like
>> >> >>> the below:
>> >> >>
>> >> >> I presume since you mention NUMA and Mel is CC-ed that if you boot 
>> >> >> without
>> >> >> NUMA enabled (either via the toolstack or via Linux command line) - 
>> >> >> the issue
>> >> >> is not present?
>> >> >
>> >> > I mentioned NUMA because the bisected commit is sched/numa, and the
>> >> > guest is NUMA-enabled. I hadn't attempted booting with NUMA off. I
>> >> > just tried with numa=off, and the workload has run in a loop for 20
>> >> > minutes so far with no issues (normally the issue would repro in less
>> >> > than 5).
>> >>
>> >> The subject line is actually incorrect -- I did a 'git describe' on
>> >> the result of the bisection when writing the subject line, but the
>> >> '3.12-rc5' tag was just the base on which the code was originally
>> >> developed. As far as what tags actually contain the commit:
>> >>
>> >> $ git tag --contains b795854b1fa70f6aee923ae5df74ff7afeaddcaa
>> >> v3.13
>> >> v3.13-rc1
>> >> v3.13-rc2
>> >> v3.13-rc3
>> >> v3.13-rc4
>> >> v3.13-rc5
>> >> v3.13-rc6
>> >> v3.13-rc7
>> >> v3.13-rc8
>> >> v3.13.1
>> >> v3.13.2
>> >> v3.13.3
>> >> v3.14-rc1
>> >> v3.14-rc2
>> >>
>> >> So it's more accurate to say it was introduced in the v3.13 merge window.
>> >>
>> >> In any case, does anyone have any ideas?
>> >
>> > There is nothing in that git commit that gives that 'AHA' feeling.
>> >
>> > If you revert that patch on top of the latest Linux kernel does the problem
>> > go away? This is more of a double-check to see if the commit
>> > is really the fault or if it exposed some latent issue.
>>
>> I just tried out 3.13.5 and the problem went away. Looking through the
>> commit logs, it appears this commit (added as part of 3.13.4) resolved
>> the issue:
>
> Excellent! Problem solved :-)

Thanks. I don't wonder this result because it fixed very fundamental
mistake. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] drop_caches: add some documentation and info message

2014-02-10 Thread Motohiro Kosaki


> -Original Message-
> From: Andrew Morton [mailto:a...@linux-foundation.org]
> Sent: Tuesday, February 11, 2014 5:51 AM
> To: Johannes Weiner
> Cc: Rik van Riel; Dave Hansen; Michal Hocko; Motohiro Kosaki JP; KAMEZAWA
> Hiroyuki; linux...@kvack.org; linux-kernel@vger.kernel.org
> Subject: Re: [patch] drop_caches: add some documentation and info
> message
> 
> On Fri, 7 Feb 2014 16:26:01 -0500 Johannes Weiner 
> wrote:
> 
> > On Fri, Feb 07, 2014 at 12:31:29PM -0800, Andrew Morton wrote:
> > > On Fri, 7 Feb 2014 13:13:32 -0500 Johannes Weiner
>  wrote:
> > >
> > > > @@ -63,6 +64,9 @@ int drop_caches_sysctl_handler(ctl_table *table,
> int write,
> > > > iterate_supers(drop_pagecache_sb, NULL);
> > > > if (sysctl_drop_caches & 2)
> > > > drop_slab();
> > > > +   printk_ratelimited(KERN_INFO "%s (%d): dropped kernel
> caches: %d\n",
> > > > +  current->comm, task_pid_nr(current),
> > > > +  sysctl_drop_caches);
> > > > }
> > > > return 0;
> > > >  }
> > >
> > > My concern with this is that there may be people whose
> > > other-party-provided software uses drop_caches.  Their machines will
> > > now sit there emitting log messages and there's nothing they can do
> > > about it, apart from whining at their vendors.
> >
> > Ironically, we have a customer that is complaining that we currently
> > do not log these events, and they want to know who in their stack is
> > being idiotic.
> 
> Right.  But if we release a kernel which goes blah on every write to
> drop_caches, that customer has logs full of blahs which they are now totally
> uninterested in.

Please let me know if I misunderstand something. This patch uses KERN_INFO.
Then, any log shouldn't be emitted by default.

Moreover, if someone change syslog log level to INFO, they are going to see
much prenty annoying and too much logs even if we reject this patch.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irq

2014-02-04 Thread kosaki . motohiro
From: KOSAKI Motohiro 

To use spin_{un}lock_irq is dangerous if caller disabled interrupt.
During aio buffer migration, we have a possibility to see the
following call stack.

aio_migratepage  [disable interrupt]
  migrate_page_copy
clear_page_dirty_for_io
  set_page_dirty
__set_page_dirty_buffers
  __set_page_dirty
spin_lock_irq

This mean, current aio migration is a deadlockable. spin_lock_irqsave
is a safer alternative and we should use it.

Reported-by: David Rientjes rient...@google.com>
Signed-off-by: KOSAKI Motohiro 
Cc: sta...@vger.kernel.org
---
 fs/buffer.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 651dba1..27265a8 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -654,14 +654,16 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 static void __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
 {
-   spin_lock_irq(&mapping->tree_lock);
+   unsigned long flags;
+
+   spin_lock_irqsave(&mapping->tree_lock, flags);
if (page->mapping) {/* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
-   spin_unlock_irq(&mapping->tree_lock);
+   spin_unlock_irqrestore(&mapping->tree_lock, flags);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: __set_page_dirty_nobuffers uses spin_lock_irqseve instead of spin_lock_irq

2014-02-04 Thread KOSAKI Motohiro
> Indeed, good catch.  Do we need the same treatment for
> __set_page_dirty_buffers() that can be called by way of
> clear_page_dirty_for_io()?

Indeed. I posted a patch fixed __set_page_dirty() too. plz see

Subject: [PATCH] __set_page_dirty uses spin_lock_irqsave instead of
spin_lock_irq
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irq

2014-02-04 Thread kosaki . motohiro
From: KOSAKI Motohiro 

To use spin_{un}lock_irq is dangerous if caller disabled interrupt.
spin_lock_irqsave is a safer alternative. Luckily, now there is no
caller that has such usage but it would be nice to fix.

Reported-by: David Rientjes rient...@google.com>
Signed-off-by: KOSAKI Motohiro 
---
 fs/buffer.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 651dba1..27265a8 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -654,14 +654,16 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 static void __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
 {
-   spin_lock_irq(&mapping->tree_lock);
+   unsigned long flags;
+
+   spin_lock_irqsave(&mapping->tree_lock, flags);
if (page->mapping) {/* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
-   spin_unlock_irq(&mapping->tree_lock);
+   spin_unlock_irqrestore(&mapping->tree_lock, flags);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] __set_page_dirty uses spin_lock_irqsave instead of spin_lock_irq

2014-02-04 Thread KOSAKI Motohiro
On Tue, Feb 4, 2014 at 11:58 AM,   wrote:
> From: KOSAKI Motohiro 
>
> To use spin_{un}lock_irq is dangerous if caller disabled interrupt.
> spin_lock_irqsave is a safer alternative. Luckily, now there is no
> caller that has such usage but it would be nice to fix.
>
> Reported-by: David Rientjes rient...@google.com>
> Signed-off-by: KOSAKI Motohiro 

Self Nack this. There IS a caller and we should send this to stable.
I'll respin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: __set_page_dirty_nobuffers uses spin_lock_irqseve instead of spin_lock_irq

2014-02-03 Thread kosaki . motohiro
From: KOSAKI Motohiro 

During aio stress test, we observed the following lockdep warning.
This mean AIO+numa_balancing is currently deadlockable.

The problem is, aio_migratepage disable interrupt, but 
__set_page_dirty_nobuffers
unintentionally enable it again.

Generally, all helper function should use spin_lock_irqsave()
instead of spin_lock_irq() because they don't know caller at all.

[  599.843948] other info that might help us debug this:
[  599.873748]  Possible unsafe locking scenario:
[  599.873748]
[  599.900902]CPU0
[  599.912701]
[  599.924929]   lock(&(&ctx->completion_lock)->rlock);
[  599.950299]   
[  599.962576] lock(&(&ctx->completion_lock)->rlock);
[  599.985771]
[  599.985771]  *** DEADLOCK ***

[  600.375623]  [] dump_stack+0x19/0x1b
[  600.398769]  [] print_usage_bug+0x1f7/0x208
[  600.425092]  [] ? 
print_shortest_lock_dependencies+0x1d0/0x1d0
[  600.458981]  [] mark_lock+0x21d/0x2a0
[  600.482910]  [] mark_held_locks+0xb9/0x140
[  600.508956]  [] ? _raw_spin_unlock_irq+0x2c/0x50
[  600.536825]  [] trace_hardirqs_on_caller+0x105/0x1d0
[  600.566861]  [] trace_hardirqs_on+0xd/0x10
[  600.593210]  [] _raw_spin_unlock_irq+0x2c/0x50
[  600.620599]  [] __set_page_dirty_nobuffers+0x8c/0xf0
[  600.649992]  [] migrate_page_copy+0x434/0x540
[  600.676635]  [] aio_migratepage+0xb1/0x140
[  600.703126]  [] move_to_new_page+0x7d/0x230
[  600.729022]  [] migrate_pages+0x5e5/0x700
[  600.754705]  [] ? buffer_migrate_lock_buffers+0xb0/0xb0
[  600.785784]  [] migrate_misplaced_page+0xbc/0xf0
[  600.814029]  [] do_numa_page+0x102/0x190
[  600.839182]  [] handle_pte_fault+0x241/0x970
[  600.865875]  [] handle_mm_fault+0x265/0x370
[  600.892071]  [] __do_page_fault+0x172/0x5a0
[  600.918065]  [] ? retint_swapgs+0x13/0x1b
[  600.943493]  [] do_page_fault+0x1a/0x70
[  600.968081]  [] page_fault+0x28/0x30

Signed-off-by: KOSAKI Motohiro 
Cc: Larry Woodman 
Cc: Rik van Riel 
Cc: Johannes Weiner 
Cc: sta...@vger.kernel.org
---
 mm/page-writeback.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2d30e2c..7106cb1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2173,11 +2173,12 @@ int __set_page_dirty_nobuffers(struct page *page)
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
struct address_space *mapping2;
+   unsigned long flags;
 
if (!mapping)
return 1;
 
-   spin_lock_irq(&mapping->tree_lock);
+   spin_lock_irqsave(&mapping->tree_lock, flags);
mapping2 = page_mapping(page);
if (mapping2) { /* Race with truncate? */
BUG_ON(mapping2 != mapping);
@@ -2186,7 +2187,7 @@ int __set_page_dirty_nobuffers(struct page *page)
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
-   spin_unlock_irq(&mapping->tree_lock);
+   spin_unlock_irqrestore(&mapping->tree_lock, flags);
if (mapping->host) {
/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-27 Thread KOSAKI Motohiro
Hi Minchan,


On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim  wrote:
> Hey all,
>
> Happy New Year!
>
> I know it's bad timing to send this unfamiliar large patchset for
> review but hope there are some guys with freshed-brain in new year
> all over the world. :)
> And most important thing is that before I dive into lots of testing,
> I'd like to make an agreement on design issues and others
>
> o Syscall interface
> o Not bind with vma split/merge logic to prevent mmap_sem cost and
> o Not bind with vma split/merge logic to avoid vm_area_struct memory
>   footprint.
> o Purging logic - when we trigger purging volatile pages to prevent
>   working set and stop to prevent too excessive purging of volatile
>   pages
> o How to test
>   Currently, we have a patched jemalloc allocator by Jason's help
>   although it's not perfect and more rooms to be enhanced but IMO,
>   it's enough to prove vrange-anonymous. The problem is that
>   lack of benchmark for testing vrange-file side. I hope that
>   Mozilla folks can help.
>
> So its been a while since the last release of the volatile ranges
> patches, again. I and John have been busy with other things.
> Still, we have been slowly chipping away at issues and differences
> trying to get a patchset that we both agree on.
>
> There's still a few issues, but we figured any further polishing of
> the patch series in private would be unproductive and it would be much
> better to send the patches out for review and comment and get some wider
> opinions.
>
> You could get full patchset by git
>
> git clone -b vrange-v10-rc5 --single-branch 
> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git

Brief comments.

- You should provide jemalloc patch too. Otherwise we cannot
understand what the your mesurement mean.
- Your number only claimed the effectiveness anon vrange, but not file vrange.
- Still, Nobody likes file vrange. At least nobody said explicitly on
the list. I don't ack file vrange part until
  I fully convinced Pros/Cons. You need to persuade other MM guys if
you really think anon vrange is not
  sufficient. (Maybe LSF is the best place)
- I wrote you need to put a mesurement current implementation vs
VMA-based implementation at several
  previous iteration. Because You claimed fast, but no number and you
haven't yet. I guess the reason is
  you don't have any access to large machine. If so, I'll offer it.
Plz collaborate with us.

Unfortunately, I'm very busy and I didn't have a chance to review your
latest patch yet. But I'll finish it until
mm summit. And, I'll show you guys how much this patch improve glibc malloc too.

I and glibc folks agreed we push vrange into glibc malloc.

https://sourceware.org/ml/libc-alpha/2013-12/msg00343.html

Even though, I still dislike some aspect of this patch. I'd like to
discuss and make better design decision
with you.

Thanks.


>
> In v10, there are some notable changes following as
>
> Whats new in v10:
> * Fix several bugs and build break
> * Add shmem_purge_page to correct purging shmem/tmpfs
> * Replace slab shrinker with direct hooked reclaim path
> * Optimize pte scanning by caching previous place
> * Reorder patch and tidy up Cc-list
> * Rebased on v3.12
> * Add vrange-anon test with jemalloc in Dhaval's test suite
>   - https://github.com/volatile-ranges-test/vranges-test
>   so, you could test any application with vrange-patched jemalloc by
>   LD_PRELOAD but please keep in mind that it's just a prototype to
>   prove vrange syscall concept so it has more rooms to optimize.
>   So, please do not compare it with another allocator.
>
> Whats new in v9:
> * Updated to v3.11
> * Added vrange purging logic to purge anonymous pages on
>   swapless systems
> * Added logic to allocate the vroot structure dynamically
>   to avoid added overhead to mm and address_space structures
> * Lots of minor tweaks, changes and cleanups
>
> Still TODO:
> * Sort out better solution for clearing volatility on new mmaps
> - Minchan has a different approach here
> * Agreement of systemcall interface
> * Better discarding trigger policy to prevent working set evction
> * Review, Review, Review.. Comment.
> * A ton of test
>
> Feedback or thoughts here would be particularly helpful!
>
> Also, thanks to Dhaval for his maintaining and vastly improving
> the volatile ranges test suite, which can be found here:
> [1] https://github.com/volatile-ranges-test/vranges-test
>
> These patches can also be pulled from git here:
> git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-v9
>
> We'd really welcome any feedback and comments on the patch series.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86, acpi memory hotplug, add parameter to disable memory hotplug

2014-01-13 Thread KOSAKI Motohiro
er 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: x...@kernel.org
> Cc: Len Brown 
> Cc: "Rafael J. Wysocki" 
> Cc: Linn Crosetto 
> Cc: Pekka Enberg 
> Cc: Yinghai Lu 
> Cc: Andrew Morton 
> Cc: Toshi Kani 
> Cc: Tang Chen 
> Cc: Wen Congyang 
> Cc: Vivek Goyal 
> Cc: kosaki.motoh...@gmail.com
> Cc: dyo...@redhat.com
> Cc: Toshi Kani 
> Cc: linux-a...@vger.kernel.org
> Cc: linux...@kvack.org

I think we need a knob manually enable mem-hotplug when specify memmap. But
it is another story.

Acked-by: KOSAKI Motohiro 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] x86, e820 disable ACPI Memory Hotplug if memory mapping is specified by user [v2]

2014-01-13 Thread KOSAKI Motohiro
On Sun, Jan 12, 2014 at 6:46 PM, Prarit Bhargava  wrote:
>
>
> On 01/11/2014 11:35 AM, 7egg...@gmx.de wrote:
>>
>>
>> On Fri, 10 Jan 2014, Prarit Bhargava wrote:
>>
>>> kdump uses memmap=exactmap and mem=X values to configure the memory
>>> mapping for the kdump kernel.  If memory is hotadded during the boot of
>>> the kdump kernel it is possible that the page tables for the new memory
>>> cause the kdump kernel to run out of memory.
>>>
>>> Since the user has specified a specific mapping ACPI Memory Hotplug should 
>>> be
>>> disabled in this case.
>>
>> I'll ask just in case: Is it possible to want memory hotplug in spite of
>> using memmap=exactmap or mem=X?
>
> Good question -- I can't think of a case.  When a user specifies "memmap" or
> "mem" IMO they are asking for a very specific memory configuration.  Having
> extra memory added above what the user has specified seems to defeat the 
> purpose
> of "memmap" and "mem".

May be yes, may be no.

They are often used for a wrokaround to avoid broken firmware issue.
If we have no way
to explicitly enable hotplug. We will lose a workaround.

Perhaps, there is no matter. Today, memory hotplug is only used on
high-end machine
and their firmware is carefully developped and don't have a serious
issue almostly. Though.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

2014-01-10 Thread Motohiro Kosaki
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 068522d..b99c742 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long
> cursor, unsigned int *mapcount,
>   BUG_ON(!page || PageAnon(page));
> 
>   if (locked_vma) {
> - mlock_vma_page(page);   /* no-op if already
> mlocked */
> - if (page == check_page)
> + if (page == check_page) {
> + /* we know we have check_page locked */
> + mlock_vma_page(page);
>   ret = SWAP_MLOCK;
> + } else if (trylock_page(page)) {
> + /*
> +  * If we can lock the page, perform mlock.
> +  * Otherwise leave the page alone, it will be
> +  * eventually encountered again later.
> +  */
> + mlock_vma_page(page);
> + unlock_page(page);
> + }
>   continue;   /* don't unmap */
>   }

I audited all related mm code. However I couldn't find any race that it can 
close.

First off,  current munlock code is crazy tricky.

munlock
down_write(mmap_sem)
do_mlock()
mlock_fixup
munlock_vma_pages_range
__munlock_pagevec
spin_lock_irq(zone->lru_lock)
TestClearPageMlocked(page)
del_page_from_lru_list
spin_unlock_irq(zone->lru_lock)
lock_page
__munlock_isolated_page
unlock_page

up_write(mmap_sem)

Look, TestClearPageMlocked(page) is not protected by lock_page. But this is 
still
safe because Page_mocked mean one or more vma marked VM_LOCKED then we
only need care about turning down final VM_LOCKED. I.e. mmap_sem protect them.

And,

spin_lock_irq(zone->lru_lock)
del_page_from_lru_list
spin_unlock_irq(zone->lru_lock)

This idiom ensures I or someone else isolate the page from lru and isolated 
pages
will be put back by putback_lru_page in anycase. So, the pages will move the 
right
lru eventually.

And then, taking page-lock doesn't help to close vs munlock race.

On the other hands, page migration has the following call stack. 

some-caller [isolate_lru_page]
unmap_and_move
__unmap_and_move
trylock_page
try_to_unmap
move_to_new_page
migrate_page
migrate_page_copy
unlock_page

The critical part (i.e. migrate_page_copy) is protected by both page isolation 
and page lock.
Page fault path take lock page and doesn't use page isolation. This is correct.
try_to_unmap_cluster doesn't take page lock, but it ensure the page is 
isolated. This is correct too.

Plus, do_wp_page() has the following comment. But it is wrong. This lock is 
necessary to protect against
page migration, but not lru manipulation.

if ((ret & VM_FAULT_WRITE) && (vma->vm_flags & VM_LOCKED)) {
lock_page(old_page);/* LRU manipulation */
munlock_vma_page(old_page);
unlock_page(old_page);
}


But, you know, this is crazy ugly lock design. I'm ok to change the rule to 
that PG_mlocked must be protected
page lock. If so, I propose to add PageMlocked() like this

} else if (!PageMlocked() && trylock_page(page)) {
/*
 * If we can lock the page, perform mlock.
 * Otherwise leave the page alone, it will be
 * eventually encountered again later.
 */
mlock_vma_page(page);
unlock_page(page);

This is safe. Even if race is happen and vmscan failed to mark PG_mlocked, 
that's no problem. Next vmscan may
do the right job and will fix the mistake.

Anyway,  I'm also sure this is not recent issue. It lived 5 years. It is no 
reason to rush.



Re: [PATCH] acpi memory hotplug, add parameter to disable memory hotplug for kexec

2014-01-09 Thread KOSAKI Motohiro
On Thu, Jan 9, 2014 at 10:00 AM, Vivek Goyal  wrote:
> On Thu, Jan 09, 2014 at 12:00:29AM +0100, Rafael J. Wysocki wrote:
>
> [..]
>> > The system then panics and the kdump/kexec kernel boots.  During this boot
>> > ACPi is initialized and the kernel (as can be seen above)
>>
>> Which is a bug.  You're not supposed to initialize ACPI twice in a row.
>
> [CC lkml, kexec mailing list, dave young]
>
> It is a fresh instance of kernel booting and it is initializing its data
> structures fresh. It is *not* re-initializing ACPI in same kernel.
>
>> > This patchset resolves the problem by adding a kernel parameter,
>> > no_memory_hotplug, to disable ACPI memory hotplug.  It can be added by 
>> > default
>> > as a parameter to the kexec/kdump kernel so the kernel boots correctly.
>>
>> This problem is specific to kexec/kdump, so please don't add *generic* 
>> command
>> line parameters to address this.
>>
>
> There are other command line options to solve kdump problems. In general
> one might want to disable memory hogplug on the fly even if it is compiled
> in the kernel. So it can act as a good debugging aid.
>
> Secondly, it can be specified with memmap=exactmap and mem=X paramters to
> make sure no memory is hot added in the system.
>
> So I can see other usages of this parameter. To me it makes sense to have
> a separate command line option to disable memory hotplug feature on the
> fly.

I'm ok this option. But note, even if this option is specified, SH,
Power and S390 still
be able to use memory hotplug because their firmware are totally
different from ACPI.

Maybe, adding acpi prefix provides good clarification.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

2014-01-06 Thread KOSAKI Motohiro
On Mon, Jan 6, 2014 at 11:47 AM, Motohiro Kosaki
 wrote:
>
>
>> -Original Message-
>> From: linus...@gmail.com [mailto:linus...@gmail.com] On Behalf Of Linus
>> Torvalds
>> Sent: Friday, January 03, 2014 7:18 PM
>> To: Vlastimil Babka
>> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu;
>> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman;
>> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing
>> List
>> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear
>> VMAs
>>
>> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka  wrote:
>> >
>> > I'm for going with the removal of BUG_ON. The TestSetPageMlocked
>> > should provide enough race protection.
>>
>> Maybe. But dammit, that's subtle, and I don't think you're even right.
>>
>> It basically depends on mlock_vma_page() and munlock_vma_page() being
>> able to run CONCURRENTLY on the same page. In particular, you could have a
>> mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
>> immediately clearing it on another, and then the rest of those functions
>> could run with a totally arbitrary interleaving when working with the exact
>> same page.
>>
>> They both do basically
>>
>> if (!isolate_lru_page(page))
>> putback_lru_page(page);
>>
>> but one or the other would randomly win the race (it's internally protected
>> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do
>>
>> try_to_munlock(page);
>>
>> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely
>> broken - you end up with the PageMlocked bit clear, but
>> try_to_munlock() was never called on that page, because
>> mlock_vma_page() got to the page isolation before the "subsequent"
>> munlock_vma_page().
>>
>> And this is very much what the page lock serialization would prevent.
>> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit 
>> op,
>> yes, but that only "serializes" in one direction, not when you can have a mix
>> of bit setting and clearing.
>>
>> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least
>> enforces some kind of ordering. And try_to_unmap_cluster() is just broken
>> in calling that without the page being locked. That's my opinion. There may
>> be some *other* reason why it all happens to work, but no,
>> "TestSetPageMlocked should provide enough race protection" is simply not
>> true, and even if it were, it's way too subtle and odd to be a good rule.
>>
>> So I really object to just removing the BUG_ON(). Not with a *lot* more
>> explanation as to why these kinds of issues wouldn't matter.
>
> I don't have a perfect answer. But I can explain a bit history. Let's me try.
>
> First off, 5 years ago, Lee's original putback_lru_page() implementation 
> required
> page-lock, but I removed the restriction months later. That's why we can see
> strange BUG_ON here.
>
> 5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was 
> protected  by
> mmap_sem (write mdoe). Then, mlock and munlock had no race.
> Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem 
> read-mode. However it is enough to
> protect against munlock.
>
> Next, In case of mlock vs reclaim, the key is that mlock(2) has two step 
> operation. 1) turn on VM_LOCKED under
> mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If 
> reclaim race against step (1),
> reclaim must lose because it uses trylock. On the other hand, if reclaim race 
> against step (2), reclaim must detect
> VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem.
>
> By the way, page isolation is still necessary because we need to protect 
> another page modification like page migration.
>
>
> My memory was alomostly flushed and I might lost some technical concern and 
> past discussion. Please point me out,
> If I am overlooking something.

No. I did talk about completely different issue. My memory is
completely broken as I said. I need to read latest code and dig past
discussion. Sorry again, please ignore my last mail.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

2014-01-06 Thread Motohiro Kosaki


> -Original Message-
> From: linus...@gmail.com [mailto:linus...@gmail.com] On Behalf Of Linus
> Torvalds
> Sent: Friday, January 03, 2014 7:18 PM
> To: Vlastimil Babka
> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu;
> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman;
> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing
> List
> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear
> VMAs
> 
> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka  wrote:
> >
> > I'm for going with the removal of BUG_ON. The TestSetPageMlocked
> > should provide enough race protection.
> 
> Maybe. But dammit, that's subtle, and I don't think you're even right.
> 
> It basically depends on mlock_vma_page() and munlock_vma_page() being
> able to run CONCURRENTLY on the same page. In particular, you could have a
> mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
> immediately clearing it on another, and then the rest of those functions
> could run with a totally arbitrary interleaving when working with the exact
> same page.
> 
> They both do basically
> 
> if (!isolate_lru_page(page))
> putback_lru_page(page);
> 
> but one or the other would randomly win the race (it's internally protected
> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do
> 
> try_to_munlock(page);
> 
> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely
> broken - you end up with the PageMlocked bit clear, but
> try_to_munlock() was never called on that page, because
> mlock_vma_page() got to the page isolation before the "subsequent"
> munlock_vma_page().
> 
> And this is very much what the page lock serialization would prevent.
> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op,
> yes, but that only "serializes" in one direction, not when you can have a mix
> of bit setting and clearing.
> 
> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least
> enforces some kind of ordering. And try_to_unmap_cluster() is just broken
> in calling that without the page being locked. That's my opinion. There may
> be some *other* reason why it all happens to work, but no,
> "TestSetPageMlocked should provide enough race protection" is simply not
> true, and even if it were, it's way too subtle and odd to be a good rule.
> 
> So I really object to just removing the BUG_ON(). Not with a *lot* more
> explanation as to why these kinds of issues wouldn't matter.

I don't have a perfect answer. But I can explain a bit history. Let's me try.

First off, 5 years ago, Lee's original putback_lru_page() implementation 
required
page-lock, but I removed the restriction months later. That's why we can see 
strange BUG_ON here. 

5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was 
protected  by 
mmap_sem (write mdoe). Then, mlock and munlock had no race. 
Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem 
read-mode. However it is enough to
protect against munlock.

Next, In case of mlock vs reclaim, the key is that mlock(2) has two step 
operation. 1) turn on VM_LOCKED under
mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If 
reclaim race against step (1),
reclaim must lose because it uses trylock. On the other hand, if reclaim race 
against step (2), reclaim must detect
VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem.

By the way, page isolation is still necessary because we need to protect 
another page modification like page migration.


My memory was alomostly flushed and I might lost some technical concern and 
past discussion. Please point me out,
If I am overlooking something.

Thanks.



Re: [PATCH V2] mm: add show num_poisoned_pages when oom

2013-12-10 Thread KOSAKI Motohiro
(12/9/2013 8:38 PM), Xishi Qiu wrote:
> Show num_poisoned_pages when oom, it is a little helpful to find the reason.
> Also it will be emitted anytime show_mem() is called.
> 
> Signed-off-by: Xishi Qiu 
> Suggested-by: Naoya Horiguchi 
> Acked-by: Michal Hocko 
> Acked-by: David Rientjes 
> ---
>  lib/show_mem.c |3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/lib/show_mem.c b/lib/show_mem.c
> index 5847a49..1cbdcd8 100644
> --- a/lib/show_mem.c
> +++ b/lib/show_mem.c
> @@ -46,4 +46,7 @@ void show_mem(unsigned int filter)
>   printk("%lu pages in pagetable cache\n",
>   quicklist_total_size());
>  #endif
> +#ifdef CONFIG_MEMORY_FAILURE
> + printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
> +#endif
>  }

Looks ok.

Acked-by: KOSAKI Motohiro 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/10 v2] posix-timers: Remove dead process posix cpu timers caching

2013-12-10 Thread KOSAKI Motohiro
(12/9/2013 12:58 PM), Frederic Weisbecker wrote:
> Now that we removed dead thread posix cpu timers caching,
> lets remove the dead process wide version. This caching
> is similar to the per thread version but it should be even
> more rare:
> 
> * If the process id dead, we are not reading its timers
> status from a thread belonging to its group since they
> are all dead. So this caching only concern remote process
> timers reads. Now posix cpu timers using itimers or timer_settime()
> can't do remote process timers anyway so it's not even clear if there
> is actually a user for this caching.
> 
> * Unlike per thread timers caching, this only applies to
> zombies targets. Buried targets' process wide timers return
> 0 values. But then again, timer_gettime() can't read remote
> process timers, so if the process is dead, there can't be
> any reader left anyway.
> 
> Then again this caching seem to complicate the code for
> corner cases that are probably not worth it. So lets get
> rid of it.
> 
> Also remove the sample snapshot on dying process timer
> that is now useless, as suggested by Kosaki.
> 
> Signed-off-by: Frederic Weisbecker 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Oleg Nesterov 
> Cc: Kosaki Motohiro 
> Cc: Andrew Morton 

Looks good to me.

Acked-by: KOSAKI Motohiro 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/10] posix-timers: Remove dead process posix cpu timers caching

2013-12-04 Thread KOSAKI Motohiro
> @@ -1090,13 +1063,8 @@ void posix_cpu_timer_schedule(struct k_itimer *timer)
>   timer->it.cpu.expires = 0;
>   goto out_unlock;
>   } else if (unlikely(p->exit_state) && thread_group_empty(p)) {
> - /*
> -  * We've noticed that the thread is dead, but
> -  * not yet reaped.  Take this opportunity to
> -  * drop our task ref.
> -  */
> + /* Optimizations: if the process is dying, no need to 
> rearm */
>   cpu_timer_sample_group(timer->it_clock, p, &now);

Why do we still need to call cpu_timer_sample_group? You removed to below line 
which uses
"now".

> - clear_dead_task(timer, now);
>   goto out_unlock;
>   }
>   spin_lock(&p->sighand->siglock);
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] check_unsafe_exec: kill the dead -EAGAIN and clear_in_exec logic

2013-11-22 Thread KOSAKI Motohiro
>> I have found no problem in this patch. However, I have a very basic question.
>> Why do we need to keep fs->in_exec?
> 
> To ensure that a sub-thread can't create a new process with the same
> ->fs while we are doing exec without LSM_UNSAFE_SHARE, I guess. This
> is only for security/ code.

But in LSM_UNSAFE_SHARE case, we have no check, right? I'm amazing why
we don't need anything.


> 
>> If it is correct,
>> can't we move it it to signal->in_exec?
> 
> Yes, perhaps, I am thinking about more cleanups too. But not that this
> will add the subtle change. CLONE_THREAD doesn't require CLONE_FS, so
> copy_fs() can fail even it the caller doesn't share ->fs with the execing
> thread. And we still need fs->lock to set signal->in_exec, this looks
> a bit strange.

Oops. Yes, this is totally odd. Sorry, we need to stop off topic discussion.
Anyway

Acked-by: KOSAKI Motohiro 

> 
>> I am not expert in this area and I may overlook something.
> 
> Neither me ;) So this patch tries to not change the current logic.
> 
> I feel that perhaps we can do more cleanups, but I am not really sure
> and this needs a separate change.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 4/4] kill task_struct->did_exec

2013-11-22 Thread KOSAKI Motohiro
(11/22/2013 3:33 PM), Oleg Nesterov wrote:
> On 11/22, KOSAKI Motohiro wrote:
>>
>> (11/22/2013 12:54 PM), Oleg Nesterov wrote:
>>> We can kill either task->did_exec or PF_FORKNOEXEC, they are
>>> mutually exclusive. The patch kill ->did_exec because it has
>>> a single user.
>>
>> It's ok.
>>
>> but,
>>
>>> - * Auch. Had to add the 'did_exec' flag to conform completely to POSIX.
>>> - * LBT 04.03.94
>>> + * !PF_FORKNOEXEC check to conform completely to POSIX. LBT 04.03.94.
>>
>> I guess LBT is his name and !PF_FORKNOEXEC is not his opinion. Please just
>> remove "LBT 04.03.94" too. git repo still keep his achievement and can avoid
>> confusion.
> 
> OK, please see v2.
> 
> 
> Subject: [PATCH] kill task_struct->did_exec
> From: Oleg Nesterov 
> Date: Fri, 22 Nov 2013 18:43:40 +0100
> 
> We can kill either task->did_exec or PF_FORKNOEXEC, they are
> mutually exclusive. The patch kills ->did_exec because it has
> a single user.
> 
> Signed-off-by: Oleg Nesterov 

Acked-by: KOSAKI Motohiro 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] check_unsafe_exec: use while_each_thread() rather than next_thread()

2013-11-22 Thread KOSAKI Motohiro
(11/22/2013 3:24 PM), Oleg Nesterov wrote:
> On 11/22, KOSAKI Motohiro wrote:
>>
>> (11/22/2013 12:54 PM), Oleg Nesterov wrote:
>>> next_thread() should be avoided, change check_unsafe_exec()
>>> to use while_each_thread(). This also saves 32 bytes.
>>
>> Just curious.
>> Why it should be avoided? Just for cleaner code?
> 
> Nobody except signal->curr_target actually need next_thread-like
> code, and
> 
>> Or is there
>> serious issue?
> 
> We need to change (fix) this interface. This particular code is
> fine, p == current. But in general the code like this can loop
> forever if p exits and next_thread(t) can't reach the unhashed
> thread.

That's enough and good reason.

Acked-by: KOSAKI Motohiro 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] exec: move the final allow_write_access/fput into free_bprm()

2013-11-22 Thread KOSAKI Motohiro
(11/22/2013 12:54 PM), Oleg Nesterov wrote:
> Both success/failure paths cleanup bprm->file, we can move this
> code into free_bprm() to simlify and cleanup this logic.
> 
> Signed-off-by: Oleg Nesterov 

Acked-by: KOSAKI Motohiro 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] check_unsafe_exec: kill the dead -EAGAIN and clear_in_exec logic

2013-11-22 Thread KOSAKI Motohiro
(11/22/2013 12:54 PM), Oleg Nesterov wrote:
> fs_struct->in_exec == T means that this ->fs is used by a single
> process (thread group), and one of the treads does do_execve().
> 
> To avoid the mt-exec races this code has the following complications:
> 
>   1. check_unsafe_exec() returns -EBUSY if ->in_exec was
>  already set by another thread.
> 
>   2. do_execve_common() records "clear_in_exec" to ensure
>  that the error path can only clear ->in_exec if it was
>  set by current.
> 
> However, after 9b1bf12d5d51 "signals: move cred_guard_mutex from
> task_struct to signal_struct" we do not need these complications:
> 
>   1. We can't race with our sub-thread, this is called under
>  per-process ->cred_guard_mutex. And we can't race with
>  another CLONE_FS task, we already checked that this fs
>  is not shared.
> 
>  We can remove the  dead -EAGAIN logic.
> 
>   2. "out_unmark:" in do_execve_common() is either called
>  under ->cred_guard_mutex, or after de_thread() which
>  kills other threads, so we can't race with sub-thread
>  which could set ->in_exec. And if ->fs is shared with
>  another process ->in_exec should be false anyway.
> 
>  We can clear in_exec unconditionally.
> 
> This also means that check_unsafe_exec() can be void.
> 
> Signed-off-by: Oleg Nesterov 

I have found no problem in this patch. However, I have a very basic question.
Why do we need to keep fs->in_exec? If I understand correctly, it is needed for
retrieving fork() and exec() race in the same process. If it is correct,
can't we move it it to signal->in_exec? It seems to match 
signal->cred_guard_mutex
and _I_ can easily understand what the code want.

In the other words, currently we have no protection against making new thread 
when
p->fs is shared w/ another process and I have no idea why multi process sharing 
influence
multi thread behavior.

I am not expert in this area and I may overlook something. Please correct me if 
I am silly.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] check_unsafe_exec: use while_each_thread() rather than next_thread()

2013-11-22 Thread KOSAKI Motohiro
(11/22/2013 12:54 PM), Oleg Nesterov wrote:
> next_thread() should be avoided, change check_unsafe_exec()
> to use while_each_thread(). This also saves 32 bytes.

Just curious.
Why it should be avoided? Just for cleaner code? Or is there
serious issue?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] kill task_struct->did_exec

2013-11-22 Thread KOSAKI Motohiro
(11/22/2013 12:54 PM), Oleg Nesterov wrote:
> We can kill either task->did_exec or PF_FORKNOEXEC, they are
> mutually exclusive. The patch kill ->did_exec because it has
> a single user.

It's ok.

but,

> - * Auch. Had to add the 'did_exec' flag to conform completely to POSIX.
> - * LBT 04.03.94
> + * !PF_FORKNOEXEC check to conform completely to POSIX. LBT 04.03.94.

I guess LBT is his name and !PF_FORKNOEXEC is not his opinion. Please just
remove "LBT 04.03.94" too. git repo still keep his achievement and can avoid
confusion.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Expose sysctls for enabling slab/file_cache interleaving

2013-11-19 Thread KOSAKI Motohiro
On Tue, Nov 19, 2013 at 4:49 PM, Andi Kleen  wrote:
> Michal Hocko  writes:
>>
>> Another option would be to use sysctl values for the top cpuset as a
>> default. But then why not just do it manually without sysctl?
>
> I want to provide an alternative to having to use cpusets to use this,
> that is actually usable for normal people.
>
> Also this is really a global setting in my mind.
>
>> If you create a cpuset and explicitly disable spreading then you would
>> be quite surprised that your process gets pages from all nodes, no?
>
> If I enable it globally using a sysctl I would be quite surprised
> if some cpuset can override it.
>
> That argument is equally valid :-)
>
> The user configured an inconsistent configuration, and the kernel
> has to make a decision somehow.
>
> In the end it is arbitary, but not having to check the cpuset
> here is a lot cheaper, so I prefer the "sysctl has priority"
> option.

sorry.
I agree with Michael. If there are large scope knob and small scope knob,
the small scope should have high priority. It is one of best practice of the
interface design.

However, I fully agree the basic concept of this patch. sysctl help a
lot of admins.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:sched/urgent] sched: Optimize task_sched_runtime()

2013-11-18 Thread KOSAKI Motohiro
On 11/13/2013 12:25 PM, tip-bot for Peter Zijlstra wrote:
> Commit-ID:  911b2898b3c9fe0048e9485ad1629ed4fce330fd
> Gitweb: http://git.kernel.org/tip/911b2898b3c9fe0048e9485ad1629ed4fce330fd
> Author: Peter Zijlstra 
> AuthorDate: Mon, 11 Nov 2013 18:21:56 +0100
> Committer:  Ingo Molnar 
> CommitDate: Wed, 13 Nov 2013 13:33:54 +0100
> 
> sched: Optimize task_sched_runtime()
> 
> Large multi-threaded apps like to hit this using do_sys_times() and
> then queue up on the rq->lock.
> 
> Avoid when possible.
> 
> Larry reported ~20% performance increase his test case.
> 
> Reported-by: Larry Woodman 
> Suggested-by: Paul Turner 
> Signed-off-by: Peter Zijlstra 
> Cc: KOSAKI Motohiro 
> Cc: Linus Torvalds 
> Cc: Andrew Morton 
> Link: 
> http://lkml.kernel.org/r/2013172925.gg26...@twins.programming.kicks-ass.net
> Signed-off-by: Ingo Molnar 

Sorry for the delay. I took a vacation. As you know, this patch originally
written me, then I already tested this. So, of course,

Acked-by: KOSAKI Motohiro 

# S-O-B is better? dunno.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: cache largest vma

2013-11-03 Thread KOSAKI Motohiro
>> I'm slightly surprised this cache makes 15% hit. Which application
>> get a benefit? You listed a lot of applications, but I'm not sure
>> which is highly depending on largest vma.
>
> Well I chose the largest vma because it gives us a greater chance of
> being already cached when we do the lookup for the faulted address.
>
> The 15% improvement was with Hadoop. According to my notes it was at
> ~48% with the baseline kernel and increased to ~63% with this patch.
>
> In any case I didn't measure the rates on a per-task granularity, but at
> a general system level. When a system is first booted I can see that the
> mmap_cache access rate becomes the determinant factor and when adding a
> workload it doesn't change much. One exception to this was a kernel
> build, where we go from ~50% to ~89% hit rate on a vanilla kernel.

I looked at this patch a bit. The worth of this is to improve the
cache hit ratio
of heap.

1) For single thread applications, heap is frequently largest mapping
in the process.
2) For java VM, "java -Xms1000m -Xmx1000m HelloWorld" makes following
/proc//smaps entry. That said, JVM allocate single heap even if
applications are multi threaded.

c180-1 rw-p  00:00 0
Size:1024000 kB
Rss: 244 kB
Pss: 244 kB
Shared_Clean:  0 kB
Shared_Dirty:  0 kB
Private_Clean: 0 kB
Private_Dirty:   244 kB
Referenced:  244 kB
Anonymous:   244 kB
AnonHugePages: 0 kB
Swap:  0 kB
KernelPageSize:4 kB
MMUPageSize:   4 kB

That's good.

However, we know there is a situation that this patch doesn't work.
glibc makes per thread heap (arena) by default. So, it is not to be
expected works well on glibc multi threaded programs. That's a
slightly big limitation.

Anyway, I haven't observed real performance difference because most
big penalty of find_vma come from taking mmap_sem, not rb-tree search.

Another and additional input are welcome. But I myself haven't convinced
this patch works everywhere.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: cache largest vma

2013-11-01 Thread KOSAKI Motohiro

(11/1/13 4:17 PM), Davidlohr Bueso wrote:

While caching the last used vma already does a nice job avoiding
having to iterate the rbtree in find_vma, we can improve. After
studying the hit rate on a load of workloads and environments,
it was seen that it was around 45-50% - constant for a standard
desktop system (gnome3 + evolution + firefox + a few xterms),
and multiple java related workloads (including Hadoop/terasort),
and aim7, which indicates it's better than the 35% value documented
in the code.

By also caching the largest vma, that is, the one that contains
most addresses, there is a steady 10-15% hit rate gain, putting
it above the 60% region. This improvement comes at a very low
overhead for a miss. Furthermore, systems with !CONFIG_MMU keep
the current logic.


I'm slightly surprised this cache makes 15% hit. Which application
get a benefit? You listed a lot of applications, but I'm not sure
which is highly depending on largest vma.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] per anon_vma lock and turn anon_vma rwsem lock to rwlock_t

2013-11-01 Thread KOSAKI Motohiro
(11/1/13 3:54 AM), Yuanhan Liu wrote:
> Patch 1 turns locking the anon_vma's root to locking itself to let it be
> a per anon_vma lock, which would reduce contentions.
> 
> In the same time, lock range becomes quite small then, which is bascially
> a call of anon_vma_interval_tree_insert(). Patch 2 turn rwsem to rwlock_t.
> It's a patch made from Ingo, I just made some change to let it apply based on
> patch 1.
> 
> Patch 3 is from Peter. It was a diff, I edited it to be a patch ;)
> 
> Here is the detailed changed stats with this patch applied. The test base is 
> v3.12-rc7,
> and 1c00bef768d4341afa7d is patch 1, e3e37183ee805f33e88f is patch 2.
> 
> NOTE: both commits are compared to base v3.12-rc7.

I'd suggest you CCing linux-mm when posting mm patches.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] mm: __rmqueue_fallback() should respect pageblock type

2013-10-31 Thread kosaki . motohiro
From: KOSAKI Motohiro 

When __rmqueue_fallback() don't find out a free block with the same size
of required, it splits a larger page and puts back rest peiece of the page
to free list.

But it has one serious mistake. When putting back, __rmqueue_fallback()
always use start_migratetype if type is not CMA. However, __rmqueue_fallback()
is only called when all of start_migratetype queue are empty. That said,
__rmqueue_fallback always put back memory to wrong queue except
try_to_steal_freepages() changed pageblock type (i.e. requested size is
smaller than half of page block). Finally, antifragmentation framework
increase fragmenation instead of decrease.

Mel's original anti fragmentation do the right thing. But commit 47118af076
(mm: mmzone: MIGRATE_CMA migration type added) broke it.

This patch restores sane and old behavior. And also it remvoe an incorrect
comment which introduced at commit fef903efcf (mm/page_allo.c: restructure
free-page stealing code and fix a bug).

Cc: Mel Gorman 
Signed-off-by: KOSAKI Motohiro 
---
 mm/page_alloc.c |   16 +---
 1 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd886fa..d488514 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1027,6 +1027,10 @@ static int try_to_steal_freepages(struct zone *zone, 
struct page *page,
 {
int current_order = page_order(page);
 
+   /*
+* When borrowing from MIGRATE_CMA, we need to release the excess
+* buddy pages to CMA itself.
+*/
if (is_migrate_cma(fallback_type))
return fallback_type;
 
@@ -1091,17 +1095,7 @@ __rmqueue_fallback(struct zone *zone, int order, int 
start_migratetype)
list_del(&page->lru);
rmv_page_order(page);
 
-   /*
-* Borrow the excess buddy pages as well, irrespective
-* of whether we stole freepages, or took ownership of
-* the pageblock or not.
-*
-* Exception: When borrowing from MIGRATE_CMA, release
-* the excess buddy pages to CMA itself.
-*/
-   expand(zone, page, order, current_order, area,
-  is_migrate_cma(migratetype)
-? migratetype : start_migratetype);
+   expand(zone, page, order, current_order, area, 
new_type);
 
trace_mm_page_alloc_extfrag(page, order,
current_order, start_migratetype, migratetype,
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserve

2013-10-31 Thread KOSAKI Motohiro

Nit. I would like to add following hunk. This is just nit because moving
reserve pageblock is extreme rare.

if (block_migratetype == MIGRATE_RESERVE) {
+   found++;
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
move_freepages_block(zone, page, MIGRATE_MOVABLE);
}


I don't really see the advantage but if you think it is necessary then I
do not object either.


For example, a zone has five pageblock b1,b2,b3,b4,b5 and b1 has 
MIGRATE_RESERVE.
When hotremove b1 and hotadd again, your code need to scan all of blocks. But
mine only need to scan b1 and b2. I mean that's a hotplug specific optimization.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: __rmqueue_fallback() should respect pageblock type

2013-10-30 Thread KOSAKI Motohiro
(10/31/13 12:24 AM), kosaki.motoh...@gmail.com wrote:
> From: KOSAKI Motohiro 
> 
> When __rmqueue_fallback() don't find out a free block with the same size
> of required, it splits a larger page and puts back rest peiece of the page
> to free list.
> 
> But it has one serious mistake. When putting back, __rmqueue_fallback()
> always use start_migratetype if type is not CMA. However, __rmqueue_fallback()
> is only called when all of start_migratetype queue are empty. That said,
> __rmqueue_fallback always put back memory to wrong queue except
> try_to_steal_freepages() changed pageblock type (i.e. requested size is
> smaller than half of page block). Finally, antifragmentation framework
> increase fragmenation instead of decrease.
> 
> Mel's original anti fragmentation do the right thing. But commit 47118af076
> (mm: mmzone: MIGRATE_CMA migration type added) broke it.
> 
> This patch restores sane and old behavior.
> 
> Cc: Mel Gorman 
> Signed-off-by: KOSAKI Motohiro 
> ---
>   mm/page_alloc.c |2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dd886fa..ea7bb9a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1101,7 +1101,7 @@ __rmqueue_fallback(struct zone *zone, int order, int 
> start_migratetype)
>*/
>   expand(zone, page, order, current_order, area,
>  is_migrate_cma(migratetype)
> -  ? migratetype : start_migratetype);
> +  ? migratetype : new_type);

Oops, this can be simplified to following because try_to_steal_freepages() has 
cma check.


-   expand(zone, page, order, current_order, area,
-  is_migrate_cma(migratetype)
-? migratetype : start_migratetype);
+   expand(zone, page, order, current_order, area, 
new_type);



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: __rmqueue_fallback() should respect pageblock type

2013-10-30 Thread KOSAKI Motohiro

(10/31/13 12:35 AM), Andrew Morton wrote:

On Thu, 31 Oct 2013 00:24:49 -0400 kosaki.motoh...@gmail.com wrote:


When __rmqueue_fallback() don't find out a free block with the same size
of required, it splits a larger page and puts back rest peiece of the page
to free list.

But it has one serious mistake. When putting back, __rmqueue_fallback()
always use start_migratetype if type is not CMA. However, __rmqueue_fallback()
is only called when all of start_migratetype queue are empty. That said,
__rmqueue_fallback always put back memory to wrong queue except
try_to_steal_freepages() changed pageblock type (i.e. requested size is
smaller than half of page block). Finally, antifragmentation framework
increase fragmenation instead of decrease.

Mel's original anti fragmentation do the right thing. But commit 47118af076
(mm: mmzone: MIGRATE_CMA migration type added) broke it.

This patch restores sane and old behavior.


What are the user-visible runtime effects of this change?


Memory fragmentation may increase compaction rate. (And then system get 
unnecessary
slow down)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Do not walk all of system memory during show_mem

2013-10-30 Thread KOSAKI Motohiro

(10/16/13 6:42 AM), Mel Gorman wrote:

It has been reported on very large machines that show_mem is taking almost
5 minutes to display information. This is a serious problem if there is
an OOM storm. The bulk of the cost is in show_mem doing a very expensive
PFN walk to give us the following information

Total RAM:  Also available as totalram_pages
Highmem pages:  Also available as totalhigh_pages
Reserved pages: Can be inferred from the zone structure
Shared pages:   PFN walk required
Unshared pages: PFN walk required
Quick pages:Per-cpu walk required

Only the shared/unshared pages requires a full PFN walk but that information
is useless. It is also inaccurate as page pins of unshared pages would
be accounted for as shared.  Even if the information was accurate, I'm
struggling to think how the shared/unshared information could be useful
for debugging OOM conditions. Maybe it was useful before rmap existed when
reclaiming shared pages was costly but it is less relevant today.

The PFN walk could be optimised a bit but why bother as the information is
useless. This patch deletes the PFN walker and infers the total RAM, highmem
and reserved pages count from struct zone. It omits the shared/unshared page
usage on the grounds that it is useless.  It also corrects the reporting
of HighMem as HighMem/MovableOnly as ZONE_MOVABLE has similar problems to
HighMem with respect to lowmem/highmem exhaustion.

Signed-off-by: Mel Gorman 


That's ok. I haven't used such information on my long oom debugging history.

Acked-by: KOSAKI Motohiro 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: __rmqueue_fallback() should respect pageblock type

2013-10-30 Thread kosaki . motohiro
From: KOSAKI Motohiro 

When __rmqueue_fallback() don't find out a free block with the same size
of required, it splits a larger page and puts back rest peiece of the page
to free list.

But it has one serious mistake. When putting back, __rmqueue_fallback()
always use start_migratetype if type is not CMA. However, __rmqueue_fallback()
is only called when all of start_migratetype queue are empty. That said,
__rmqueue_fallback always put back memory to wrong queue except
try_to_steal_freepages() changed pageblock type (i.e. requested size is
smaller than half of page block). Finally, antifragmentation framework
increase fragmenation instead of decrease.

Mel's original anti fragmentation do the right thing. But commit 47118af076
(mm: mmzone: MIGRATE_CMA migration type added) broke it.

This patch restores sane and old behavior.

Cc: Mel Gorman 
Signed-off-by: KOSAKI Motohiro 
---
 mm/page_alloc.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd886fa..ea7bb9a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1101,7 +1101,7 @@ __rmqueue_fallback(struct zone *zone, int order, int 
start_migratetype)
 */
expand(zone, page, order, current_order, area,
   is_migrate_cma(migratetype)
-? migratetype : start_migratetype);
+? migratetype : new_type);
 
trace_mm_page_alloc_extfrag(page, order,
current_order, start_migratetype, migratetype,
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserve

2013-10-30 Thread KOSAKI Motohiro

@@ -3926,11 +3929,11 @@ static void setup_zone_migrate_reserve(struct zone 
*zone)
/*
 * Reserve blocks are generally in place to help high-order atomic
 * allocations that are short-lived. A min_free_kbytes value that
-* would result in more than 2 reserve blocks for atomic allocations
-* is assumed to be in place to help anti-fragmentation for the
-* future allocation of hugepages at runtime.
+* would result in more than MAX_MIGRATE_RESERVE_BLOCKS reserve blocks
+* for atomic allocations is assumed to be in place to help
+* anti-fragmentation for the future allocation of hugepages at runtime.
 */
-   reserve = min(2, reserve);
+   reserve = min(MAX_MIGRATE_RESERVE_BLOCKS, reserve);

for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
if (!pfn_valid(pfn))
@@ -3956,6 +3959,7 @@ static void setup_zone_migrate_reserve(struct zone *zone)
/* If this block is reserved, account for it */
if (block_migratetype == MIGRATE_RESERVE) {
reserve--;
+   found++;
continue;
}

@@ -3970,6 +3974,10 @@ static void setup_zone_migrate_reserve(struct zone *zone)
}
}

+   /* If all possible reserve blocks have been found, we're done */
+   if (found >= MAX_MIGRATE_RESERVE_BLOCKS)
+   break;
+
/*
 * If the reserve is met and this is a previous reserved block,
 * take it back


Nit. I would like to add following hunk. This is just nit because moving
reserve pageblock is extreme rare.

if (block_migratetype == MIGRATE_RESERVE) {
+   found++;
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
move_freepages_block(zone, page, MIGRATE_MOVABLE);
}



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserve

2013-10-30 Thread KOSAKI Motohiro

(10/30/13 11:19 AM), Mel Gorman wrote:

On Wed, Oct 23, 2013 at 05:01:32PM -0400, kosaki.motoh...@gmail.com wrote:

From: KOSAKI Motohiro 

Yasuaki Ithimatsu reported memory hot-add spent more than 5 _hours_
on 9TB memory machine and we found out setup_zone_migrate_reserve
spnet >90% time.

The problem is, setup_zone_migrate_reserve scan all pageblock
unconditionally, but it is only necessary number of reserved block
was reduced (i.e. memory hot remove).
Moreover, maximum MIGRATE_RESERVE per zone are currently 2. It mean,
number of reserved pageblock are almost always unchanged.

This patch adds zone->nr_migrate_reserve_block to maintain number
of MIGRATE_RESERVE pageblock and it reduce an overhead of
setup_zone_migrate_reserve dramatically.



It seems regrettable to expand the size of struct zone just for this.


This is only matter when backporting enterprise distro. But you are right
it would be nice if it's avoidable.


You are right that the number of blocks does not exceed 2 because of a
check made in setup_zone_migrate_reserve so it should be possible to
special case this. I didn't test this or think about it particularly
carefully and no doubt there is a nicer way but for illustration
purposes see the patch below.


I'll test. A few days give me please.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: cma: free cma page to buddy instead of being cpu hot page

2013-10-28 Thread KOSAKI Motohiro
> The concern is likely/unlikely usage is proper in this code peice.
> If we don't use memory isolation, the code path is used for only
> MIGRATE_RESERVE which is very rare allocation in normal workload.
> 
> Even, in memory isolation environement, I'm not sure how many
> CMA/HOTPLUG is used compared to normal alloc/free.
> So, I think below is more proper?
> 
> if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
> if (is_migrate_isolate(migratetype) || is_migrate_cma(migratetype))
> 
> I know it's an another topic but I'd like to disucss it in this time because
> we will forget such trivial thing later, again.

I completely agree with this.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix page_group_by_mobility_disabled breakage

2013-10-25 Thread kosaki . motohiro
From: KOSAKI Motohiro 

Currently, set_pageblock_migratetype screw up MIGRATE_CMA and
MIGRATE_ISOLATE if page_group_by_mobility_disabled is true. It
rewrite the argument to MIGRATE_UNMOVABLE and we lost these attribute.

The problem was introduced commit 49255c619f (page allocator: move
check for disabled anti-fragmentation out of fastpath). So, 4 years
lived issue may mean that nobody uses page_group_by_mobility_disabled.

But anyway, this patch fixes the problem.

Signed-off-by: KOSAKI Motohiro 
Cc: Mel Gorman 
---
 mm/page_alloc.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd886fa..ef44d95 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -234,8 +234,8 @@ int page_group_by_mobility_disabled __read_mostly;
 
 void set_pageblock_migratetype(struct page *page, int migratetype)
 {
-
-   if (unlikely(page_group_by_mobility_disabled))
+   if (unlikely(page_group_by_mobility_disabled &&
+migratetype < MIGRATE_PCPTYPES))
migratetype = MIGRATE_UNMOVABLE;
 
set_pageblock_flags_group(page, (unsigned long)migratetype,
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: get rid of unnecessary overhead of trace_mm_page_alloc_extfrag()

2013-10-25 Thread kosaki . motohiro
From: KOSAKI Motohiro 

In general, every tracepoint should be zero overhead if it is disabled.
However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate
"new_type == start_migratetype" even if tracepoint is disabled.

However, the code can be moved into tracepoint's TP_fast_assign() and
TP_fast_assign exist exactly such purpose. This patch does it.

Cc: Mel Gorman 
Signed-off-by: KOSAKI Motohiro 
---
 include/trace/events/kmem.h |   10 --
 mm/page_alloc.c |5 ++---
 2 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index d0c6134..aece134 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -267,14 +267,12 @@ DEFINE_EVENT_PRINT(mm_page, mm_page_pcpu_drain,
 TRACE_EVENT(mm_page_alloc_extfrag,
 
TP_PROTO(struct page *page,
-   int alloc_order, int fallback_order,
-   int alloc_migratetype, int fallback_migratetype,
-   int change_ownership),
+   int alloc_order, int fallback_order,
+   int alloc_migratetype, int fallback_migratetype, int 
new_migratetype),
 
TP_ARGS(page,
alloc_order, fallback_order,
-   alloc_migratetype, fallback_migratetype,
-   change_ownership),
+   alloc_migratetype, fallback_migratetype, new_migratetype),
 
TP_STRUCT__entry(
__field(struct page *,  page)
@@ -291,7 +289,7 @@ TRACE_EVENT(mm_page_alloc_extfrag,
__entry->fallback_order = fallback_order;
__entry->alloc_migratetype  = alloc_migratetype;
__entry->fallback_migratetype   = fallback_migratetype;
-   __entry->change_ownership   = change_ownership;
+   __entry->change_ownership   = (new_migratetype == 
alloc_migratetype);
),
 
TP_printk("page=%p pfn=%lu alloc_order=%d fallback_order=%d 
pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d 
change_ownership=%d",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4414c9d..58e67ea 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1103,9 +1103,8 @@ __rmqueue_fallback(struct zone *zone, int order, int 
start_migratetype)
   is_migrate_cma(migratetype)
 ? migratetype : start_migratetype);
 
-   trace_mm_page_alloc_extfrag(page, order,
-   current_order, start_migratetype, migratetype,
-   new_type == start_migratetype);
+   trace_mm_page_alloc_extfrag(page, order, current_order,
+   start_migratetype, migratetype, new_type);
 
return page;
}
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserve

2013-10-23 Thread kosaki . motohiro
From: KOSAKI Motohiro 

Yasuaki Ithimatsu reported memory hot-add spent more than 5 _hours_
on 9TB memory machine and we found out setup_zone_migrate_reserve
spnet >90% time.

The problem is, setup_zone_migrate_reserve scan all pageblock
unconditionally, but it is only necessary number of reserved block
was reduced (i.e. memory hot remove).
Moreover, maximum MIGRATE_RESERVE per zone are currently 2. It mean,
number of reserved pageblock are almost always unchanged.

This patch adds zone->nr_migrate_reserve_block to maintain number
of MIGRATE_RESERVE pageblock and it reduce an overhead of
setup_zone_migrate_reserve dramatically.

Cc: Mel Gorman 
Reported-by: Yasuaki Ishimatsu 
Cc: Yasuaki Ishimatsu 
Signed-off-by: KOSAKI Motohiro 
---
 include/linux/mmzone.h |6 ++
 mm/page_alloc.c|   13 +
 2 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bd791e4..67ab5fe 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -490,6 +490,12 @@ struct zone {
unsigned long   managed_pages;
 
/*
+* Number of MIGRATE_RESEVE page block. To maintain for just
+* optimization. Protected by zone->lock.
+*/
+   int nr_migrate_reserve_block;
+
+   /*
 * rarely used fields:
 */
const char  *name;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 58e67ea..76ca054 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3909,6 +3909,7 @@ static void setup_zone_migrate_reserve(struct zone *zone)
struct page *page;
unsigned long block_migratetype;
int reserve;
+   int old_reserve;
 
/*
 * Get the start pfn, end pfn and the number of blocks to reserve
@@ -3930,6 +3931,12 @@ static void setup_zone_migrate_reserve(struct zone *zone)
 * future allocation of hugepages at runtime.
 */
reserve = min(2, reserve);
+   old_reserve = zone->nr_migrate_reserve_block;
+
+   /* When memory hot-add, we almost always need to do nothing */
+   if (reserve == old_reserve)
+   return;
+   zone->nr_migrate_reserve_block = reserve;
 
for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
if (!pfn_valid(pfn))
@@ -3967,6 +3974,12 @@ static void setup_zone_migrate_reserve(struct zone *zone)
reserve--;
continue;
}
+   } else if (!old_reserve) {
+   /*
+* When boot time, we don't need scan whole zone
+* for turning off MIGRATE_RESERVE.
+*/
+   break;
}
 
/*
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] alarmtimer: return EINVAL instead of ENOTSUPP if rtcdev doesn't exist

2013-10-18 Thread KOSAKI Motohiro
On 10/18/2013 6:39 PM, John Stultz wrote:
> On 10/17/2013 06:12 PM, KOSAKI Motohiro wrote:
>> (10/17/13 1:05 PM), John Stultz wrote:
>>> On 10/14/2013 02:33 PM, kosaki.motoh...@gmail.com wrote:
>>>> From: KOSAKI Motohiro 
>>>>
>>>> Fedora Ruby maintainer reported latest Ruby doesn't work on Fedora
>>>> Rawhide
>>>> on ARM. (http://bugs.ruby-lang.org/issues/9008)
>>>>
>>>> Because of, commit 1c6b39ad3f (alarmtimers: Return -ENOTSUPP if no
>>>> RTC device is present) intruduced to return ENOTSUPP when
>>>> clock_get{time,res} can't find a RTC device. However it is incorrect.
>>>>
>>>> Posix and Linux man pages agree that clock_gettime and clock_getres
>>>> should return EINVAL if clk_id argument is invalid. This is significant
>>>> different from timer_create API.
>>>>
>>>> This patch fixes it.
>>>
>>> Hrm... So I feel like there is a difference here. The clockid for
>>> CLOCK_BOOTTIME_ALARM and CLOCK_REALTIME_ALARM are both valid.
>>>
>>> Its just that they're not supported on this specific hardware because it
>>> apparently lacks a RTC that has told the system it can be used as a
>>> wakeup device (Its actually quite likely on the hardware that the RTC
>>> can be a wakeup device, but that the driver is probably setting the
>>> wakeup flag after the RTC registered - so there is probably a driver bug
>>> here too).
>>>
>>> So I feel like in this case EINVAL isn't quite right.  I'll admit it is
>>> somewhat new behavior, because we haven't had any clockids before that
>>> were dependent on the particular hardware, they either existed in a
>>> kernel verison or didn't.
>>>
>>> Would updating the manpage be a better route?
>>
>> Nope.
>>
>> ENOTSUPP is not exported to userland. ENOTSUP (single P) and EOPNOTSUP is
>> valid errno (and they are same on linux), but ENOTSUPP is a kernel
>> internal specific.
>>
>> Moreover, I completely disagree your position. Both
>> CLOCK_REALTIME_ALARM unsupported
>> kernel and ARM which doesn't support RTC should use the same error
>> because application
>> need the same fallback.
> 
> Ok. You're right. The technicality that the clockid is valid but
> unsupported isn't really useful to the applications, since the effect is
> the same.
> 
> What is the urgency on this? As the issue has been around since 3.0, is
> it ok if it gets queued for 3.13 and marked for stable, or does it need
> to land in 3.12?

3.13 is OK to me.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] alarmtimer: return EINVAL instead of ENOTSUPP if rtcdev doesn't exist

2013-10-17 Thread KOSAKI Motohiro

(10/17/13 1:05 PM), John Stultz wrote:

On 10/14/2013 02:33 PM, kosaki.motoh...@gmail.com wrote:

From: KOSAKI Motohiro 

Fedora Ruby maintainer reported latest Ruby doesn't work on Fedora Rawhide
on ARM. (http://bugs.ruby-lang.org/issues/9008)

Because of, commit 1c6b39ad3f (alarmtimers: Return -ENOTSUPP if no
RTC device is present) intruduced to return ENOTSUPP when
clock_get{time,res} can't find a RTC device. However it is incorrect.

Posix and Linux man pages agree that clock_gettime and clock_getres
should return EINVAL if clk_id argument is invalid. This is significant
different from timer_create API.

This patch fixes it.


Hrm... So I feel like there is a difference here. The clockid for
CLOCK_BOOTTIME_ALARM and CLOCK_REALTIME_ALARM are both valid.

Its just that they're not supported on this specific hardware because it
apparently lacks a RTC that has told the system it can be used as a
wakeup device (Its actually quite likely on the hardware that the RTC
can be a wakeup device, but that the driver is probably setting the
wakeup flag after the RTC registered - so there is probably a driver bug
here too).

So I feel like in this case EINVAL isn't quite right.  I'll admit it is
somewhat new behavior, because we haven't had any clockids before that
were dependent on the particular hardware, they either existed in a
kernel verison or didn't.

Would updating the manpage be a better route?


Nope.

ENOTSUPP is not exported to userland. ENOTSUP (single P) and EOPNOTSUP is
valid errno (and they are same on linux), but ENOTSUPP is a kernel internal 
specific.

Moreover, I completely disagree your position. Both CLOCK_REALTIME_ALARM 
unsupported
kernel and ARM which doesn't support RTC should use the same error because 
application
need the same fallback.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] alarmtimer: return EINVAL instead of ENOTSUPP if rtcdev doesn't exist

2013-10-14 Thread kosaki . motohiro
From: KOSAKI Motohiro 

Fedora Ruby maintainer reported latest Ruby doesn't work on Fedora Rawhide
on ARM. (http://bugs.ruby-lang.org/issues/9008)

Because of, commit 1c6b39ad3f (alarmtimers: Return -ENOTSUPP if no
RTC device is present) intruduced to return ENOTSUPP when
clock_get{time,res} can't find a RTC device. However it is incorrect.

Posix and Linux man pages agree that clock_gettime and clock_getres
should return EINVAL if clk_id argument is invalid. This is significant
different from timer_create API.

This patch fixes it.

Reported-by: Vit Ondruch 
Cc: Thomas Gleixner 
Cc: Frederic Weisbecker 
Cc: John Stultz 
Signed-off-by: KOSAKI Motohiro 
---
 kernel/time/alarmtimer.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index eec50fc..88c9c65 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -490,7 +490,7 @@ static int alarm_clock_getres(const clockid_t which_clock, 
struct timespec *tp)
clockid_t baseid = alarm_bases[clock2alarm(which_clock)].base_clockid;
 
if (!alarmtimer_get_rtcdev())
-   return -ENOTSUPP;
+   return -EINVAL;
 
return hrtimer_get_res(baseid, tp);
 }
@@ -507,7 +507,7 @@ static int alarm_clock_get(clockid_t which_clock, struct 
timespec *tp)
struct alarm_base *base = &alarm_bases[clock2alarm(which_clock)];
 
if (!alarmtimer_get_rtcdev())
-   return -ENOTSUPP;
+   return -EINVAL;
 
*tp = ktime_to_timespec(base->gettime());
return 0;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/14] vrange: Add new vrange(2) system call

2013-10-07 Thread KOSAKI Motohiro

(10/7/13 11:07 PM), Minchan Kim wrote:

Hi KOSAKI,

On Mon, Oct 07, 2013 at 10:51:18PM -0400, KOSAKI Motohiro wrote:

Maybe, int madvise5(addr, length, MADV_DONTNEED|MADV_LAZY|MADV_SIGBUS,
 &purged, &ret);

Another reason to make it hard is that madvise(2) is tight coupled with
with vmas split/merge. It needs mmap_sem's write-side lock and it hurt
anon-vrange test performance much heavily and userland might want to
make volatile range with small unit like "page size" so it's undesireable
to make it with vma. Then, we should filter out to avoid vma split/merge
in implementation if only MADV_LAZY case? Doable but it could make code
complicated and lost consistency with other variant of madvise.


I haven't seen your performance test result. Could please point out URLs?


https://lkml.org/lkml/2013/3/12/105


It's not comparison with and without vma merge. I'm interest how much benefit
vmas operation avoiding have.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 05/14] vrange: Add new vrange(2) system call

2013-10-07 Thread KOSAKI Motohiro

Maybe, int madvise5(addr, length, MADV_DONTNEED|MADV_LAZY|MADV_SIGBUS,
 &purged, &ret);

Another reason to make it hard is that madvise(2) is tight coupled with
with vmas split/merge. It needs mmap_sem's write-side lock and it hurt
anon-vrange test performance much heavily and userland might want to
make volatile range with small unit like "page size" so it's undesireable
to make it with vma. Then, we should filter out to avoid vma split/merge
in implementation if only MADV_LAZY case? Doable but it could make code
complicated and lost consistency with other variant of madvise.


I haven't seen your performance test result. Could please point out URLs?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 18/26] mm: Convert process_vm_rw_pages() to use get_user_pages_unlocked()

2013-10-07 Thread KOSAKI Motohiro

(10/7/13 4:55 PM), Jan Kara wrote:

On Thu 03-10-13 18:40:06, KOSAKI Motohiro wrote:

(10/2/13 3:36 PM), Jan Kara wrote:

On Wed 02-10-13 12:32:33, KOSAKI Motohiro wrote:

(10/2/13 10:27 AM), Jan Kara wrote:

Signed-off-by: Jan Kara 
---
   mm/process_vm_access.c | 8 ++--
   1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index fd26d0433509..c1bc47d8ed90 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -64,12 +64,8 @@ static int process_vm_rw_pages(struct task_struct *task,
*bytes_copied = 0;

/* Get the pages we're interested in */
-   down_read(&mm->mmap_sem);
-   pages_pinned = get_user_pages(task, mm, pa,
- nr_pages_to_copy,
- vm_write, 0, process_pages, NULL);
-   up_read(&mm->mmap_sem);
-
+   pages_pinned = get_user_pages_unlocked(task, mm, pa, nr_pages_to_copy,
+  vm_write, 0, process_pages);
if (pages_pinned != nr_pages_to_copy) {
rc = -EFAULT;
goto end;


This is wrong because original code is wrong. In this function, page may
be pointed to anon pages. Then, you should keep to take mmap_sem until
finish to copying. Otherwise concurrent fork() makes nasty COW issue.

   Hum, can you be more specific? I suppose you are speaking about situation
when the remote task we are copying data from/to does fork while
process_vm_rw_pages() runs. If we are copying data from remote task, I
don't see how COW could cause any problem. If we are copying to remote task
and fork happens after get_user_pages() but before copy_to_user() then I
can see we might be having some trouble. copy_to_user() would then copy
data into both original remote process and its child thus essentially
bypassing COW. If the child process manages to COW some of the pages before
copy_to_user() happens, it can even see only some of the pages. Is that what
you mean?


scenario 1: vm_write==0

Process P1 call get_user_pages(pa, process_pages) in process_vm_rw_pages
P1 unlock mmap_sem.
Process P2 call fork(). and make P3.
P2 write memory pa. now the "process_pages" is owned by P3 (the child process)
P3 write memory pa. and then the content of "process_pages" is changed.
P1 read process_pages as P2's page. but actually, it is P3's data. Then,
   P1 observe garbage, at least unintended, data was read.

   Yeah, this really looks buggy because P1 can see data in (supposedly)
P2's address space which P2 never wrote there.


scenario 2: vm_write==1

Process P1 call get_user_pages(pa, process_pages) in process_vm_rw_pages.
It makes COW break and any anon page sharing broke.
P1 unlock mmap_sem.
P2 call fork(). and make P3. And, now COW page sharing is restored.
P2 write memory pa. now the "process_pages" is owned by P3.
P3 write memory pa. it mean P3 changes "process_pages".
P1 write process_pages as P2's page. but actually, it is P3's. It
override above P3's write and then P3 observe data corruption.

   Yep, this is a similar problem as above. Thanks for pointing this out.


The solution is to keep holding mmap_sem until finishing process_pages
access because mmap_sem prevent fork. and then race never be happen. I
mean you cann't use get_user_pages_unlock() if target address point to
anon pages.

   Yeah, if you are accessing third party mm,


one correction. the condition is,

 - third party mm, or
 - current process have multiple threads and other threads makes fork() and COW 
break


you've convinced me you
currently need mmap_sem to avoid problems with COW on anon pages. I'm just
thinking that this "hold mmap_sem to prevent fork" is somewhat subtle
(definitely would deserve a comment) and if it would be needed in more
places we might be better off if we have a more explicit mechanism for that
(like a special lock, fork sequence count, or something like that).


Hmm..

Actually, I tried this several years ago. But MM people disliked it because
they prefer faster kernel than simple code. Any additional lock potentially
makes slower the kernel.

However, I fully agree the current mechanism is too complex and misleading,
or at least, very hard to understanding. So, any improvement idea is welcome.


Anyway
I'll have this in mind and if I see other places that need this, I'll try
to come up with some solution. Thanks again for explanation.


NP.

I tried to explain this at MM summit. but my English is too poor and I couldn't
explain enough to you. Sorry about that.


Anyway, I'd like to fix process_vm_rw_pages() before my memory will be flushed 
again.
Thank you for helping remember this bug.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 18/26] mm: Convert process_vm_rw_pages() to use get_user_pages_unlocked()

2013-10-03 Thread KOSAKI Motohiro

(10/2/13 3:36 PM), Jan Kara wrote:

On Wed 02-10-13 12:32:33, KOSAKI Motohiro wrote:

(10/2/13 10:27 AM), Jan Kara wrote:

Signed-off-by: Jan Kara 
---
   mm/process_vm_access.c | 8 ++--
   1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index fd26d0433509..c1bc47d8ed90 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -64,12 +64,8 @@ static int process_vm_rw_pages(struct task_struct *task,
*bytes_copied = 0;

/* Get the pages we're interested in */
-   down_read(&mm->mmap_sem);
-   pages_pinned = get_user_pages(task, mm, pa,
- nr_pages_to_copy,
- vm_write, 0, process_pages, NULL);
-   up_read(&mm->mmap_sem);
-
+   pages_pinned = get_user_pages_unlocked(task, mm, pa, nr_pages_to_copy,
+  vm_write, 0, process_pages);
if (pages_pinned != nr_pages_to_copy) {
rc = -EFAULT;
goto end;


This is wrong because original code is wrong. In this function, page may
be pointed to anon pages. Then, you should keep to take mmap_sem until
finish to copying. Otherwise concurrent fork() makes nasty COW issue.

   Hum, can you be more specific? I suppose you are speaking about situation
when the remote task we are copying data from/to does fork while
process_vm_rw_pages() runs. If we are copying data from remote task, I
don't see how COW could cause any problem. If we are copying to remote task
and fork happens after get_user_pages() but before copy_to_user() then I
can see we might be having some trouble. copy_to_user() would then copy
data into both original remote process and its child thus essentially
bypassing COW. If the child process manages to COW some of the pages before
copy_to_user() happens, it can even see only some of the pages. Is that what
you mean?


scenario 1: vm_write==0

Process P1 call get_user_pages(pa, process_pages) in process_vm_rw_pages
P1 unlock mmap_sem.
Process P2 call fork(). and make P3.
P2 write memory pa. now the "process_pages" is owned by P3 (the child process)
P3 write memory pa. and then the content of "process_pages" is changed.
P1 read process_pages as P2's page. but actually, it is P3's data. Then, P1 
observe garbage, at least unintended, data was read.


scenario 2: vm_write==1

Process P1 call get_user_pages(pa, process_pages) in process_vm_rw_pages. It 
makes COW break and any anon page sharing broke.
P1 unlock mmap_sem.
P2 call fork(). and make P3. And, now COW page sharing is restored.
P2 write memory pa. now the "process_pages" is owned by P3.
P3 write memory pa. it mean P3 changes "process_pages".
P1 write process_pages as P2's page. but actually, it is P3's. It override 
above P3's write and then P3 observe data corruption.



The solution is to keep holding mmap_sem until finishing process_pages access
because mmap_sem prevent fork. and then race never be happen. I mean you cann't 
use
get_user_pages_unlock() if target address point to anon pages.

I'm not sure these story match your explanation.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 18/26] mm: Convert process_vm_rw_pages() to use get_user_pages_unlocked()

2013-10-02 Thread KOSAKI Motohiro
(10/2/13 10:27 AM), Jan Kara wrote:
> Signed-off-by: Jan Kara 
> ---
>   mm/process_vm_access.c | 8 ++--
>   1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index fd26d0433509..c1bc47d8ed90 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -64,12 +64,8 @@ static int process_vm_rw_pages(struct task_struct *task,
>   *bytes_copied = 0;
>   
>   /* Get the pages we're interested in */
> - down_read(&mm->mmap_sem);
> - pages_pinned = get_user_pages(task, mm, pa,
> -   nr_pages_to_copy,
> -   vm_write, 0, process_pages, NULL);
> - up_read(&mm->mmap_sem);
> -
> + pages_pinned = get_user_pages_unlocked(task, mm, pa, nr_pages_to_copy,
> +vm_write, 0, process_pages);
>   if (pages_pinned != nr_pages_to_copy) {
>   rc = -EFAULT;
>   goto end;

This is wrong because original code is wrong. In this function, page may be 
pointed to 
anon pages. Then, you should keep to take mmap_sem until finish to copying. 
Otherwise
concurrent fork() makes nasty COW issue.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 16/26] mm: Provide get_user_pages_unlocked()

2013-10-02 Thread KOSAKI Motohiro
(10/2/13 10:27 AM), Jan Kara wrote:
> Provide a wrapper for get_user_pages() which takes care of acquiring and
> releasing mmap_sem. Using this function reduces amount of places in
> which we deal with mmap_sem.
> 
> Signed-off-by: Jan Kara 
> ---
>   include/linux/mm.h | 14 ++
>   1 file changed, 14 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8b6e55ee8855..70031ead06a5 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1031,6 +1031,20 @@ long get_user_pages(struct task_struct *tsk, struct 
> mm_struct *mm,
>   struct vm_area_struct **vmas);
>   int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>   struct page **pages);
> +static inline long
> +get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
> + unsigned long start, unsigned long nr_pages,
> + int write, int force, struct page **pages)
> +{
> + long ret;
> +
> + down_read(&mm->mmap_sem);
> + ret = get_user_pages(tsk, mm, start, nr_pages, write, force, pages,
> +  NULL);
> + up_read(&mm->mmap_sem);
> + return ret;
> +}

Hmmm. I like the idea, but I really dislike this name. I don't like xx_unlocked 
function takes a lock. It is a source of confusing, I believe. 







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch for-3.12] mm, memcg: protect mem_cgroup_read_events for cpu hotplug

2013-10-01 Thread KOSAKI Motohiro

(10/1/13 7:31 PM), David Rientjes wrote:

for_each_online_cpu() needs the protection of {get,put}_online_cpus() so
cpu_online_mask doesn't change during the iteration.

Signed-off-by: David Rientjes 


Acked-by: KOSAKI Motohiro 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 2/2] mm: add a field to store names for private anonymous memory

2013-10-01 Thread KOSAKI Motohiro
> +static void seq_print_vma_name(struct seq_file *m, struct vm_area_struct 
> *vma)
> +{
> + const char __user *name = vma_get_anon_name(vma);
> + struct mm_struct *mm = vma->vm_mm;
> +
> + unsigned long page_start_vaddr;
> + unsigned long page_offset;
> + unsigned long num_pages;
> + unsigned long max_len = NAME_MAX;
> + int i;
> +
> + page_start_vaddr = (unsigned long)name & PAGE_MASK;
> + page_offset = (unsigned long)name - page_start_vaddr;
> + num_pages = DIV_ROUND_UP(page_offset + max_len, PAGE_SIZE);
> +
> + seq_puts(m, "[anon:");
> +
> + for (i = 0; i < num_pages; i++) {
> + int len;
> + int write_len;
> + const char *kaddr;
> + long pages_pinned;
> + struct page *page;
> +
> + pages_pinned = get_user_pages(current, mm, page_start_vaddr,
> + 1, 0, 0, &page, NULL);
> + if (pages_pinned < 1) {
> + seq_puts(m, "]");
> + return;
> + }
> +
> + kaddr = (const char *)kmap(page);
> + len = min(max_len, PAGE_SIZE - page_offset);

You can't show the name if the name is placed in end of page.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH v5 3/4] mm/vmalloc: revert "mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead of show_numa_info"

2013-09-16 Thread KOSAKI Motohiro
On Mon, Sep 16, 2013 at 8:18 PM, Wanpeng Li  wrote:
> Hi KOSAKI,
> On Mon, Sep 16, 2013 at 05:23:32PM -0400, KOSAKI Motohiro wrote:
>>On 9/14/2013 7:45 PM, Wanpeng Li wrote:
>>> Changelog:
>>>  *v2 -> v3: revert commit d157a558 directly
>>>
>>> The VM_UNINITIALIZED/VM_UNLIST flag introduced by commit f5252e00(mm: avoid
>>> null pointer access in vm_struct via /proc/vmallocinfo) is used to avoid
>>> accessing the pages field with unallocated page when show_numa_info() is
>>> called. This patch move the check just before show_numa_info in order that
>>> some messages still can be dumped via /proc/vmallocinfo. This patch revert
>>> commit d157a558 (mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead
>>> of show_numa_info);
>>
>>Both d157a558 and your patch don't explain why your one is better. Yes, some
>>messages _can_ be dumped. But why should we do so?
>
> More messages can be dumped and original commit f5252e00(mm: avoid null 
> pointer
> access in vm_struct via /proc/vmallocinfo) do that.
>
>>And No. __get_vm_area_node() doesn't use __GFP_ZERO for allocating 
>>vm_area_struct.
>>dumped partial dump is not only partial, but also may be garbage.
>
> vm_struct is allocated by kzalloc_node.

Oops, you are right. Then, your code _intentionally_ show amazing
zero. Heh, nice.
More message is pointless. zero is just zero. It doesn't have any information.


>>I wonder why we need to call setup_vmalloc_vm() _after_ insert_vmap_area.
>
> I think it's another topic.

Why?


> Fill vm_struct and set VM_VM_AREA flag. If I misunderstand your
> question?

VM_VM_AREA doesn't help. we have race between insert_vmap_area and
setup_vmalloc_vm.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH v5 2/4] mm/vmalloc: revert "mm/vmalloc.c: emit the failure message before return"

2013-09-16 Thread KOSAKI Motohiro
On Mon, Sep 16, 2013 at 7:41 PM, Wanpeng Li  wrote:
> Hi KOSAKI,
> On Mon, Sep 16, 2013 at 04:15:29PM -0400, KOSAKI Motohiro wrote:
>>On 9/14/2013 7:45 PM, Wanpeng Li wrote:
>>> Changelog:
>>>  *v2 -> v3: revert commit 46c001a2 directly
>>>
>>> Don't warning twice in __vmalloc_area_node and __vmalloc_node_range if
>>> __vmalloc_area_node allocation failure. This patch revert commit 46c001a2
>>> (mm/vmalloc.c: emit the failure message before return).
>>>
>>> Reviewed-by: Zhang Yanfei 
>>> Signed-off-by: Wanpeng Li 
>>> ---
>>>  mm/vmalloc.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index d78d117..e3ec8b4 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -1635,7 +1635,7 @@ void *__vmalloc_node_range(unsigned long size, 
>>> unsigned long align,
>>>
>>>  addr = __vmalloc_area_node(area, gfp_mask, prot, node, caller);
>>>  if (!addr)
>>> -goto fail;
>>> +return NULL;
>
> The goto fail is introduced by commit (mm/vmalloc.c: emit the failure message
> before return), and the commit author ignore there has already have warning in
> __vmalloc_area_node.
>
> http://marc.info/?l=linux-mm&m=137818671125209&w=2

But, module_alloc() directly calls __vmalloc_node_range(). Your fix
makes another regression.


>>This is not right fix. Now we have following call stack.
>>
>> __vmalloc_node
>>   __vmalloc_node_range
>>   __vmalloc_node
>>
>>Even if we remove a warning of __vmalloc_node_range, we still be able to see 
>>double warning
>>because we call __vmalloc_node recursively.
>
> Different size allocation failure in your example actually.

But, when we can not allocate small size memory, almost always we
can't allocate large size too.

You need some refactoring and make right fix.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH v5 3/4] mm/vmalloc: revert "mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead of show_numa_info"

2013-09-16 Thread KOSAKI Motohiro
On 9/14/2013 7:45 PM, Wanpeng Li wrote:
> Changelog:
>  *v2 -> v3: revert commit d157a558 directly
> 
> The VM_UNINITIALIZED/VM_UNLIST flag introduced by commit f5252e00(mm: avoid
> null pointer access in vm_struct via /proc/vmallocinfo) is used to avoid
> accessing the pages field with unallocated page when show_numa_info() is
> called. This patch move the check just before show_numa_info in order that
> some messages still can be dumped via /proc/vmallocinfo. This patch revert 
> commit d157a558 (mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead 
> of show_numa_info);

Both d157a558 and your patch don't explain why your one is better. Yes, some
messages _can_ be dumped. But why should we do so?

And No. __get_vm_area_node() doesn't use __GFP_ZERO for allocating 
vm_area_struct.
dumped partial dump is not only partial, but also may be garbage.

I wonder why we need to call setup_vmalloc_vm() _after_ insert_vmap_area.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH v5 2/4] mm/vmalloc: revert "mm/vmalloc.c: emit the failure message before return"

2013-09-16 Thread KOSAKI Motohiro
On 9/14/2013 7:45 PM, Wanpeng Li wrote:
> Changelog:
>  *v2 -> v3: revert commit 46c001a2 directly
> 
> Don't warning twice in __vmalloc_area_node and __vmalloc_node_range if
> __vmalloc_area_node allocation failure. This patch revert commit 46c001a2
> (mm/vmalloc.c: emit the failure message before return).
> 
> Reviewed-by: Zhang Yanfei 
> Signed-off-by: Wanpeng Li 
> ---
>  mm/vmalloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d78d117..e3ec8b4 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1635,7 +1635,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
> long align,
>  
>   addr = __vmalloc_area_node(area, gfp_mask, prot, node, caller);
>   if (!addr)
> - goto fail;
> + return NULL;

This is not right fix. Now we have following call stack.

 __vmalloc_node
__vmalloc_node_range
__vmalloc_node

Even if we remove a warning of __vmalloc_node_range, we still be able to see 
double warning
because we call __vmalloc_node recursively.

I haven't catch your point why twice warning is unacceptable though.









--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH v5 1/4] mm/vmalloc: don't set area->caller twice

2013-09-16 Thread KOSAKI Motohiro
On 9/14/2013 7:45 PM, Wanpeng Li wrote:
> Changelog:
>  *v1 -> v2: rebase against mmotm tree
> 
> The caller address has already been set in set_vmalloc_vm(), there's no need

setup_vmalloc_vm()

> to set it again in __vmalloc_area_node.
> 
> Reviewed-by: Zhang Yanfei 
> Signed-off-by: Wanpeng Li 
> ---
>  mm/vmalloc.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 1074543..d78d117 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1566,7 +1566,6 @@ static void *__vmalloc_area_node(struct vm_struct 
> *area, gfp_t gfp_mask,
>   pages = kmalloc_node(array_size, nested_gfp, node);
>   }
>   area->pages = pages;
> - area->caller = caller;
>   if (!area->pages) {
>   remove_vm_area(area->addr);
>   kfree(area);

Then, __vmalloc_area_node() no longer need "caller" argument. It can use 
area->caller instead.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND PATCH v5 4/4] mm/vmalloc: fix show vmap_area information race with vmap_area tear down

2013-09-16 Thread KOSAKI Motohiro
On 9/14/2013 7:45 PM, Wanpeng Li wrote:
> Changelog:
>  *v4 -> v5: return directly for !VM_VM_AREA case and remove (VM_LAZY_FREE | 
> VM_LAZY_FREEING) check 
> 
> There is a race window between vmap_area tear down and show vmap_area 
> information.
> 
>   AB
> 
> remove_vm_area
> spin_lock(&vmap_area_lock);
> va->vm = NULL;
> va->flags &= ~VM_VM_AREA;
> spin_unlock(&vmap_area_lock);
>   spin_lock(&vmap_area_lock);
>   if (va->flags & (VM_LAZY_FREE | 
> VM_LAZY_FREEZING))
>   return 0;
>   if (!(va->flags & VM_VM_AREA)) {
>   seq_printf(m, 
> "0x%pK-0x%pK %7ld vm_map_ram\n",
>   (void 
> *)va->va_start, (void *)va->va_end,
>   va->va_end - 
> va->va_start);
>   return 0;
>   }
> free_unmap_vmap_area(va);
>   flush_cache_vunmap
>   free_unmap_vmap_area_noflush
>   unmap_vmap_area
>   free_vmap_area_noflush
>   va->flags |= VM_LAZY_FREE 
> 
> The assumption !VM_VM_AREA represents vm_map_ram allocation is introduced by 
> commit: d4033afd(mm, vmalloc: iterate vmap_area_list, instead of vmlist, in 
> vmallocinfo()). However, !VM_VM_AREA also represents vmap_area is being tear 
> down in race window mentioned above. This patch fix it by don't dump any 
> information for !VM_VM_AREA case and also remove (VM_LAZY_FREE | 
> VM_LAZY_FREEING)
> check since they are not possible for !VM_VM_AREA case.
> 
> Suggested-by: Joonsoo Kim 
> Signed-off-by: Wanpeng Li 

Acked-by: KOSAKI Motohiro 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/mempolicy: use NUMA_NO_NODE

2013-09-16 Thread KOSAKI Motohiro

(9/16/13 8:53 AM), Jianguo Wu wrote:

Use more appropriate NUMA_NO_NODE instead of -1

Signed-off-by: Jianguo Wu 
---
  mm/mempolicy.c |   10 +-
  1 files changed, 5 insertions(+), 5 deletions(-)


I think this patch don't make any functional change, right?

Acked-by: KOSAKI Motohiro 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFD: Non-Disruptive Core Dump Infrastructure

2013-09-13 Thread KOSAKI Motohiro
On 9/12/2013 12:45 AM, Suzuki K. Poulose wrote:
> On 09/12/2013 12:57 AM, KOSAKI Motohiro wrote:
>> (9/3/13 4:39 AM), Janani Venkataraman wrote:
>>> Hello,
>>>
>>> We are working on an infrastructure to create a system core file of a
>>> specific
>>> process at run-time, non-disruptively. It can also be extended to a
>>> case where
>>> a process is able to take a self-core dump.
>>>
>>> gcore, an existing utility creates a core image of the specified
>>> process. It
>>> attaches to the process using gdb and runs the gdb gcore command and then
>>> detaches. In gcore the dump cannot be issued from a signal handler
>>> context as
>>> fork() is not signal safe and moreover it is disruptive in nature as
>>> the gdb
>>> attaches using ptrace which sends a SIGSTOP signal. Hence the gcore
>>> method
>>> cannot be used if the process wants to initiate a self dump.
>>
>> Maybe I'm missing something. But why gcore uses c-level fork()? gcore
>> need to
>> call pthread-at-fork handler? No. gcore need to flush stdio buffer? No.
>>
> Let me clarify. If an application wants to dump itself, it has to do a
> fork() and then exec the gcore with the pid of the appication to
> generate the dump.

Oh, I did think the fork() is used for no application stop dump. But it is
incorrect.

Hmm. However, if an application _itself_ want to dump itself. They can avoid
to use signal handler properly. I'm missing the point of this discussion
completely.

So, I'd keep silence while.

> 
> So, if the application wants to initiate the dump from a signal handler
> context, it may lead to trouble.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vsprintf: drop comment claiming %n is ignored

2013-09-11 Thread KOSAKI Motohiro

Yeah, just found those too.
---
  fs/proc/consoles.c   |  2 +-
  fs/proc/nommu.c  | 20 ++--
  fs/proc/task_mmu.c   | 18 +-
  fs/proc/task_nommu.c | 20 ++--
  4 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/fs/proc/consoles.c b/fs/proc/consoles.c
index b701eaa..42f2bb7 100644
--- a/fs/proc/consoles.c
+++ b/fs/proc/consoles.c
@@ -47,7 +47,7 @@ static int show_console_dev(struct seq_file *m, void *v)
con_flags[a].name : ' ';
flags[a] = 0;

-   seq_printf(m, "%s%d%n", con->name, con->index, &len);
+   len = seq_printf(m, "%s%d", con->name, con->index);
len = 21 - len;
if (len < 1)
len = 1;
diff --git a/fs/proc/nommu.c b/fs/proc/nommu.c
index ccfd99b..91cfd19 100644
--- a/fs/proc/nommu.c
+++ b/fs/proc/nommu.c
@@ -50,16 +50,16 @@ static int nommu_region_show(struct seq_file *m, struct 
vm_region *region)
ino = inode->i_ino;
}

-   seq_printf(m,
-  "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu %n",
-  region->vm_start,
-  region->vm_end,
-  flags & VM_READ ? 'r' : '-',
-  flags & VM_WRITE ? 'w' : '-',
-  flags & VM_EXEC ? 'x' : '-',
-  flags & VM_MAYSHARE ? flags & VM_SHARED ? 'S' : 's' : 'p',
-  ((loff_t)region->vm_pgoff) << PAGE_SHIFT,
-  MAJOR(dev), MINOR(dev), ino, &len);
+   len = seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
+region->vm_start,
+region->vm_end,
+flags & VM_READ ? 'r' : '-',
+flags & VM_WRITE ? 'w' : '-',
+flags & VM_EXEC ? 'x' : '-',
+flags & VM_MAYSHARE ?
+flags & VM_SHARED ? 'S' : 's' : 'p',
+((loff_t)region->vm_pgoff) << PAGE_SHIFT,
+MAJOR(dev), MINOR(dev), ino);

if (file) {
len = 25 + sizeof(void *) * 6 - len;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 107d026..f84ee9f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -286,15 +286,15 @@ show_map_vma(struct seq_file *m, struct vm_area_struct 
*vma, int is_pid)
if (stack_guard_page_end(vma, end))
end -= PAGE_SIZE;

-   seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu %n",
-   start,
-   end,
-   flags & VM_READ ? 'r' : '-',
-   flags & VM_WRITE ? 'w' : '-',
-   flags & VM_EXEC ? 'x' : '-',
-   flags & VM_MAYSHARE ? 's' : 'p',
-   pgoff,
-   MAJOR(dev), MINOR(dev), ino, &len);
+   len = seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
+start,
+end,
+flags & VM_READ ? 'r' : '-',
+flags & VM_WRITE ? 'w' : '-',
+flags & VM_EXEC ? 'x' : '-',
+flags & VM_MAYSHARE ? 's' : 'p',
+pgoff,
+MAJOR(dev), MINOR(dev), ino);

/*
 * Print the dentry name for named mappings, and a
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 56123a6..1d7bbe5 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -155,16 +155,16 @@ static int nommu_vma_show(struct seq_file *m, struct 
vm_area_struct *vma,
pgoff = (loff_t)vma->vm_pgoff << PAGE_SHIFT;
}

-   seq_printf(m,
-  "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu %n",
-  vma->vm_start,
-  vma->vm_end,
-  flags & VM_READ ? 'r' : '-',
-  flags & VM_WRITE ? 'w' : '-',
-  flags & VM_EXEC ? 'x' : '-',
-  flags & VM_MAYSHARE ? flags & VM_SHARED ? 'S' : 's' : 'p',
-      pgoff,
-  MAJOR(dev), MINOR(dev), ino, &len);
+   len = seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu ",
+vma->vm_start,
+vma->

Re: RFD: Non-Disruptive Core Dump Infrastructure

2013-09-11 Thread KOSAKI Motohiro

(9/3/13 4:39 AM), Janani Venkataraman wrote:

Hello,

We are working on an infrastructure to create a system core file of a specific
process at run-time, non-disruptively. It can also be extended to a case where
a process is able to take a self-core dump.

gcore, an existing utility creates a core image of the specified process. It
attaches to the process using gdb and runs the gdb gcore command and then
detaches. In gcore the dump cannot be issued from a signal handler context as
fork() is not signal safe and moreover it is disruptive in nature as the gdb
attaches using ptrace which sends a SIGSTOP signal. Hence the gcore method
cannot be used if the process wants to initiate a self dump.


Maybe I'm missing something. But why gcore uses c-level fork()? gcore need to
call pthread-at-fork handler? No. gcore need to flush stdio buffer? No.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   >