Re: [RFC v2] Support volatile range for anon vma

2012-12-10 Thread Minchan Kim
On Fri, Dec 07, 2012 at 04:49:56PM -0800, John Stultz wrote:
> On 12/04/2012 08:18 PM, Minchan Kim wrote:
> >On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:
> >>I don't think the problem is when vmas being marked VM_VOLATILE are
> >>being merged, its that when we mark the vma as *non-volatile*, and
> >>remove the VM_VOLATILE flag we merge the non-volatile vmas with
> >>neighboring vmas. So preserving the purged flag during that merge is
> >>important. Again, the example I used to trigger this was an
> >>alternating pattern of volatile and non volatile vmas, then marking
> >>the entire range non-volatile (though sometimes in two overlapping
> >>passes).
> >Understood. Thanks.
> >Below patch solves your problems? It's simple than yours.
> 
> Yea, this is nicer then my fix.
> Although I still need the purged handling in the vma merge code for
> me to see the behavior I expect in my tests.
> 
> I've integrated your patch and repushed my queue here:
> http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol
> 
> git://git.linaro.org/people/jstultz/android-dev.git dev/minchan-anonvol
> 
> >Anyway, both yours and mine are not right fix.
> >As I mentioned, locking scheme is broken.
> >We need anon_vma_lock to handle purged and we should consider fork
> >case, too.
> Hrm. I'm sure you're right, as I've not yet fully grasped all the
> locking rules here.  Could you clarify how it is broken? And why is
> the anon_vma_lock needed to manage the purged state that is part of
> the vma itself?

If you don't hold anon->lock, merge/split/fork can race with try_to_unmap
so vma->[purged|vm_flags] would lose consistency.

> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-12-10 Thread Minchan Kim
On Fri, Dec 07, 2012 at 04:20:30PM -0800, John Stultz wrote:
> On 12/04/2012 11:01 PM, Minchan Kim wrote:
> >Hi John,
> >
> >On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:
> >>
> >>I don't think the problem is when vmas being marked VM_VOLATILE are
> >>being merged, its that when we mark the vma as *non-volatile*, and
> >>remove the VM_VOLATILE flag we merge the non-volatile vmas with
> >>neighboring vmas. So preserving the purged flag during that merge is
> >>important. Again, the example I used to trigger this was an
> >>alternating pattern of volatile and non volatile vmas, then marking
> >>the entire range non-volatile (though sometimes in two overlapping
> >>passes).
> >If I understand correctly, you mean following as.
> >
> >chunk1 = mmap(8M)
> >chunk2 = chunk1 + 2M;
> >chunk3 = chunk2 + 2M
> >chunk4 = chunk3 + 2M
> >
> >madvise(chunk1, 2M, VOLATILE);
> >madvise(chunk4, 2M, VOLATILE);
> >
> >/*
> >  * V : volatile vma
> >  * N : non volatile vma
> >  * So Now vma is VNVN.
> >  */
> >And chunk4 is purged.
> >
> >int ret = madvise(chunk1, 8M, NOVOLATILE);
> >ASSERT(ret == 1);
> >/* And you expect VNVN->N ?*/
> >
> >Right?
> 
> Yes. That's exactly right.
> 
> >If so, why should non-volatile function semantic allow it which cross over
> >non-volatile areas in a range? I would like to fail such case because
> >in case of MADV_REMOVE, it fails in the middle of operation if it encounter
> >VM_LOCKED.
> >
> >What do you think about it?
> Right, so I think this issue is maybe a problematic part of the VMA
> based approach.  While marking an area as nonvolatile twice might
> not make a lot of sense, I think userland applications would not
> appreciate the constraint that madvise(VOLATILE/NONVOLATILE) calls
> be made in perfect pairs of identical sizes.
> 
> For instance, if a browser has rendered a web page, but the page is
> so large that only a sliding window/view of that page is visible at
> one time, it may want to mark the regions not currently in the view
> as volatile.   So it would be nice (albeit naive) for that
> application that when the view location changed, it would just mark
> the new region as non-volatile, and any region not in the current
> view as volatile.  This would be easier then trying to calculate the
> diff of the old view region boundaries vs the new and modifying only
> the ranges that changed. Granted, doing so might be more efficient,
> but I'm not sure we can be sure every similar case would be more
> efficient.
> 
> So in my mind, double-clearing a flag should be allowed (as well as
> double-setting), as well as allowing for setting/clearing
> overlapping regions.

It might and as you said, it's not matched by normal madvise opearation.
So if user really want it, we might need another interface like new
system call like mlock.

Although we can implement it, what I has a concern is mmap_sem hold time.
For VMA approach, we need exclusive mmap_sem and it ends up preventing
concurrent page fault handling so it would mitigate anon volatile's goal
for user-space allocators. So I would like to avoid more works with
exclusive mmap_sem as far as possible.

Of course, you can argue that if we don't support such semantic,
user can call madvise(NOVOATILE) several time with several ranges
so it could be more bad. Right. But I suggest for plumbers to implement
range management smart and let's leave kernel implementation simple/fast.

I'm not solid. If user really want such semantic, I can support it with
new system call.

Frankly speaking, I would like to remove madvise(NOVOLATILE) call.
If you already saw my patch just I sent morning, you can guess what it is.
The problem of anon volatile with madvise(NOVOLATILE) is that time delay
between allocator allocats a free chunk and user really access the memory.
Normally, when allocator return free chunk to customer, allocator should
call madvise(NOVOLATILE) but user could access the memory long time after.
So during that time difference, that pages could be swap out.
So I decide to remove madvise(NOVOLATILE) and it's handled at first
page fault.

Yeb. The same rule couldn't applied to tmpfs volatile and it does needs
NOVOLATILE semantic. Hmm,, I am biasing to new system call.

int mvolatile(const void *addr, size_t len, int mode);
int munvolatile(const void *addr, size_t len;

If someone call mvolatile with AUTO mode, it would work as my anon volatile
while in MANUAL mode, user must call munvolatile before using.
It might meet your and mine goal. But adding new system call is last resort. :)

> 
> Aside from if the behavior should be allowed or not, the error mode
> of madvise is problematic as well, since failures can happen mid way
> through the operation, leaving the vmas in the range specified
> inconsistent. Since usually its only advisory, such inconsistent
> states aren't really problematic, and repeating the last action is
> probably fine.

True.

> 
> The problem with NOVOLATILE's  purged state, with vmas, is that if
> we hi

Re: [RFC v2] Support volatile range for anon vma

2012-12-07 Thread John Stultz

On 12/04/2012 08:18 PM, Minchan Kim wrote:

On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:

I don't think the problem is when vmas being marked VM_VOLATILE are
being merged, its that when we mark the vma as *non-volatile*, and
remove the VM_VOLATILE flag we merge the non-volatile vmas with
neighboring vmas. So preserving the purged flag during that merge is
important. Again, the example I used to trigger this was an
alternating pattern of volatile and non volatile vmas, then marking
the entire range non-volatile (though sometimes in two overlapping
passes).

Understood. Thanks.
Below patch solves your problems? It's simple than yours.


Yea, this is nicer then my fix.
Although I still need the purged handling in the vma merge code for me 
to see the behavior I expect in my tests.


I've integrated your patch and repushed my queue here:
http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol

git://git.linaro.org/people/jstultz/android-dev.git dev/minchan-anonvol


Anyway, both yours and mine are not right fix.
As I mentioned, locking scheme is broken.
We need anon_vma_lock to handle purged and we should consider fork
case, too.
Hrm. I'm sure you're right, as I've not yet fully grasped all the 
locking rules here.  Could you clarify how it is broken? And why is the 
anon_vma_lock needed to manage the purged state that is part of the vma 
itself?


thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-12-07 Thread John Stultz

On 12/04/2012 11:01 PM, Minchan Kim wrote:

Hi John,

On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:


I don't think the problem is when vmas being marked VM_VOLATILE are
being merged, its that when we mark the vma as *non-volatile*, and
remove the VM_VOLATILE flag we merge the non-volatile vmas with
neighboring vmas. So preserving the purged flag during that merge is
important. Again, the example I used to trigger this was an
alternating pattern of volatile and non volatile vmas, then marking
the entire range non-volatile (though sometimes in two overlapping
passes).

If I understand correctly, you mean following as.

chunk1 = mmap(8M)
chunk2 = chunk1 + 2M;
chunk3 = chunk2 + 2M
chunk4 = chunk3 + 2M

madvise(chunk1, 2M, VOLATILE);
madvise(chunk4, 2M, VOLATILE);

/*
  * V : volatile vma
  * N : non volatile vma
  * So Now vma is VNVN.
  */
And chunk4 is purged.

int ret = madvise(chunk1, 8M, NOVOLATILE);
ASSERT(ret == 1);
/* And you expect VNVN->N ?*/

Right?


Yes. That's exactly right.


If so, why should non-volatile function semantic allow it which cross over
non-volatile areas in a range? I would like to fail such case because
in case of MADV_REMOVE, it fails in the middle of operation if it encounter
VM_LOCKED.

What do you think about it?
Right, so I think this issue is maybe a problematic part of the VMA 
based approach.  While marking an area as nonvolatile twice might not 
make a lot of sense, I think userland applications would not appreciate 
the constraint that madvise(VOLATILE/NONVOLATILE) calls be made in 
perfect pairs of identical sizes.


For instance, if a browser has rendered a web page, but the page is so 
large that only a sliding window/view of that page is visible at one 
time, it may want to mark the regions not currently in the view as 
volatile.   So it would be nice (albeit naive) for that application that 
when the view location changed, it would just mark the new region as 
non-volatile, and any region not in the current view as volatile.  This 
would be easier then trying to calculate the diff of the old view region 
boundaries vs the new and modifying only the ranges that changed. 
Granted, doing so might be more efficient, but I'm not sure we can be 
sure every similar case would be more efficient.


So in my mind, double-clearing a flag should be allowed (as well as 
double-setting), as well as allowing for setting/clearing overlapping 
regions.


Aside from if the behavior should be allowed or not, the error mode of 
madvise is problematic as well, since failures can happen mid way 
through the operation, leaving the vmas in the range specified 
inconsistent. Since usually its only advisory, such inconsistent states 
aren't really problematic, and repeating the last action is probably fine.


The problem with NOVOLATILE's  purged state, with vmas, is that if we 
hit an error mid-way through, its hard to figure out what the state of 
the pages are for the range specified. Some of them could have been 
purged and set to non-volatile, while some may not be purged, and still 
left volatile. You can't just repeat the last action and get a sane 
result (as we lose the purged flag state).


With my earlier fallocate implementations, I tried to avoid this by 
making any memory allocations that might be required before making any 
state changes, so there wasn't a chance for a partial failure from 
-ENOMEM.  (It was also simpler because in my own range management code 
there were only volatile ranges,  non-volatility was simply the absence 
of a volatile range. With vmas we have to manage both volatile and 
nonvolatile vmas).  I'm not sure how this could be done with the vma 
method other then by maybe reworking the merge/split logic, but I'm wary 
of mucking with that too much as I know its performance sensitive.


Your thoughts?  Am I just being too set in my way of thinking here?

thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-12-04 Thread Minchan Kim
Hi John,

On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:
> On 12/03/2012 11:22 PM, Minchan Kim wrote:
> >On Mon, Dec 03, 2012 at 04:57:20PM -0800, John Stultz wrote:
> >>On 12/03/2012 04:00 PM, Minchan Kim wrote:
> >>>On Wed, Nov 28, 2012 at 08:18:01PM -0800, John Stultz wrote:
> On 11/21/2012 04:36 PM, John Stultz wrote:
> >2) Being able to use this with tmpfs files. I'm currently trying
> >to better understand the rmap code, looking to see if there's a
> >way to have try_to_unmap_file() work similarly to
> >try_to_unmap_anon(), to allow allow users to madvise() on mmapped
> >tmpfs files. This would provide a very similar interface as to
> >what I've been proposing with fadvise/fallocate, but just using
> >process virtual addresses instead of (fd, offset) pairs.   The
> >benefit with (fd,offset) pairs for Android is that its easier to
> >manage shared volatile ranges between two processes that are
> >sharing data via an mmapped tmpfs file (although this actual use
> >case may be fairly rare).  I believe we should still be able to
> >rework the ashmem internals to use madvise (which would provide
> >legacy support for existing android apps), so then its just a
> >question of if we could then eventually convince Android apps to
> >use the madvise interface directly, rather then the ashmem unpin
> >ioctl.
> Hey Minchan,
>  I've been playing around with your patch trying to better
> understand your approach and to extend it to support tmpfs files. In
> doing so I've found a few bugs, and have some rough fixes I wanted
> to share. There's still a few edge cases I need to deal with (the
> vma-purged flag isn't being properly handled through vma merge/split
> operations), but its starting to come along.
> >>>Hmm, my patch doesn't allow to merge volatile with another one by
> >>>inserting VM_VOLATILE into VM_SPECIAL so I guess merge isn't problem.
> >>>In case of split, __split_vma copy old vma to new vma like this
> >>>
> >>> *new = *vma;
> >>>
> >>>So the problem shouldn't happen, I guess.
> >>>Did you see the real problem about that?
> >>Yes, depending on the pattern that MADV_VOLATILE and MADV_NOVOLATILE
> >>is applied, we can get a result where data is purged, but we aren't
> >>notified of it.  Also, since madvise returns early if it encounters
> >>an error, in the case where you have checkerboard volatile regions
> >>(say every other page is volatile), which you mark non-volatile with
> >>one large MADV_NOVOLATILE call, the first volatile vma will be
> >>marked non-volatile, but since it returns purged, the madvise loop
> >>will stop and the following volatile regions will be left volatile.
> >>
> >>The patches in the git tree below which handle the perged state
> >>better seem to work for my tests, as far as resolving any
> >>overlapping calls. Of course there may yet still be problems I've
> >>not found.
> >>
> Anyway, take a look at the tree here and let me know what you think.
> http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol
> >>Eager to hear what you think!
> >Below two patches look good to me.
> >
> >[rmap: Simplify volatility checking by moving it out of try_to_unmap_one]
> >[rmap: ClearPageDirty() when returning SWAP_DISCARD]
> >
> >[madvise: Fix NOVOLATILE bug]
> >I can't understand description of the patch.
> >Could you elaborate it with example?
> The case I ran into here is if you have a range where you mark every
> other page as volatile. Then mark all the pages in that range as
> non-volatile in one madvise call.
> 
> sys_madvise() will then find the first vma in the range, and call
> madvise_vma(), which marks the first vma non-volatile and return the
> purged state.  If the page has been purged, sys_madvise code will
> note that as an error, and break out of the vma iteration loop,
> leaving the following vmas in the range volatile.
> 
> >[madvise: Fixup vma->purged handling]
> >I included VM_VOLATILE into VM_SPECIAL intentionally.
> >If comment of VM_SPECIAL is right, merge with volatile vmas shouldn't happen.
> >So I guess you see other problem. When I see my source code today, locking
> >scheme/purge handling is totally broken. I will look at it. Maybe you are 
> >seeing
> >bug related that. Part of patch is needed. It could be separate patch.
> >I will merge it.
> I don't think the problem is when vmas being marked VM_VOLATILE are
> being merged, its that when we mark the vma as *non-volatile*, and
> remove the VM_VOLATILE flag we merge the non-volatile vmas with
> neighboring vmas. So preserving the purged flag during that merge is
> important. Again, the example I used to trigger this was an
> alternating pattern of volatile and non volatile vmas, then marking
> the entire range non-volatile (though sometimes in two overlapping
> passes).

If I understand correctly, you mean following as.

chunk1 = m

Re: [RFC v2] Support volatile range for anon vma

2012-12-04 Thread Minchan Kim
On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:
> On 12/03/2012 11:22 PM, Minchan Kim wrote:
> >On Mon, Dec 03, 2012 at 04:57:20PM -0800, John Stultz wrote:
> >>On 12/03/2012 04:00 PM, Minchan Kim wrote:
> >>>On Wed, Nov 28, 2012 at 08:18:01PM -0800, John Stultz wrote:
> On 11/21/2012 04:36 PM, John Stultz wrote:
> >2) Being able to use this with tmpfs files. I'm currently trying
> >to better understand the rmap code, looking to see if there's a
> >way to have try_to_unmap_file() work similarly to
> >try_to_unmap_anon(), to allow allow users to madvise() on mmapped
> >tmpfs files. This would provide a very similar interface as to
> >what I've been proposing with fadvise/fallocate, but just using
> >process virtual addresses instead of (fd, offset) pairs.   The
> >benefit with (fd,offset) pairs for Android is that its easier to
> >manage shared volatile ranges between two processes that are
> >sharing data via an mmapped tmpfs file (although this actual use
> >case may be fairly rare).  I believe we should still be able to
> >rework the ashmem internals to use madvise (which would provide
> >legacy support for existing android apps), so then its just a
> >question of if we could then eventually convince Android apps to
> >use the madvise interface directly, rather then the ashmem unpin
> >ioctl.
> Hey Minchan,
>  I've been playing around with your patch trying to better
> understand your approach and to extend it to support tmpfs files. In
> doing so I've found a few bugs, and have some rough fixes I wanted
> to share. There's still a few edge cases I need to deal with (the
> vma-purged flag isn't being properly handled through vma merge/split
> operations), but its starting to come along.
> >>>Hmm, my patch doesn't allow to merge volatile with another one by
> >>>inserting VM_VOLATILE into VM_SPECIAL so I guess merge isn't problem.
> >>>In case of split, __split_vma copy old vma to new vma like this
> >>>
> >>> *new = *vma;
> >>>
> >>>So the problem shouldn't happen, I guess.
> >>>Did you see the real problem about that?
> >>Yes, depending on the pattern that MADV_VOLATILE and MADV_NOVOLATILE
> >>is applied, we can get a result where data is purged, but we aren't
> >>notified of it.  Also, since madvise returns early if it encounters
> >>an error, in the case where you have checkerboard volatile regions
> >>(say every other page is volatile), which you mark non-volatile with
> >>one large MADV_NOVOLATILE call, the first volatile vma will be
> >>marked non-volatile, but since it returns purged, the madvise loop
> >>will stop and the following volatile regions will be left volatile.
> >>
> >>The patches in the git tree below which handle the perged state
> >>better seem to work for my tests, as far as resolving any
> >>overlapping calls. Of course there may yet still be problems I've
> >>not found.
> >>
> Anyway, take a look at the tree here and let me know what you think.
> http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol
> >>Eager to hear what you think!
> >Below two patches look good to me.
> >
> >[rmap: Simplify volatility checking by moving it out of try_to_unmap_one]
> >[rmap: ClearPageDirty() when returning SWAP_DISCARD]
> >
> >[madvise: Fix NOVOLATILE bug]
> >I can't understand description of the patch.
> >Could you elaborate it with example?
> The case I ran into here is if you have a range where you mark every
> other page as volatile. Then mark all the pages in that range as
> non-volatile in one madvise call.
> 
> sys_madvise() will then find the first vma in the range, and call
> madvise_vma(), which marks the first vma non-volatile and return the
> purged state.  If the page has been purged, sys_madvise code will
> note that as an error, and break out of the vma iteration loop,
> leaving the following vmas in the range volatile.
> 
> >[madvise: Fixup vma->purged handling]
> >I included VM_VOLATILE into VM_SPECIAL intentionally.
> >If comment of VM_SPECIAL is right, merge with volatile vmas shouldn't happen.
> >So I guess you see other problem. When I see my source code today, locking
> >scheme/purge handling is totally broken. I will look at it. Maybe you are 
> >seeing
> >bug related that. Part of patch is needed. It could be separate patch.
> >I will merge it.
> I don't think the problem is when vmas being marked VM_VOLATILE are
> being merged, its that when we mark the vma as *non-volatile*, and
> remove the VM_VOLATILE flag we merge the non-volatile vmas with
> neighboring vmas. So preserving the purged flag during that merge is
> important. Again, the example I used to trigger this was an
> alternating pattern of volatile and non volatile vmas, then marking
> the entire range non-volatile (though sometimes in two overlapping
> passes).

Understood. Thanks.
Below patch solves your problems? It's simple than 

Re: [RFC v2] Support volatile range for anon vma

2012-12-04 Thread John Stultz

On 12/03/2012 11:22 PM, Minchan Kim wrote:

On Mon, Dec 03, 2012 at 04:57:20PM -0800, John Stultz wrote:

On 12/03/2012 04:00 PM, Minchan Kim wrote:

On Wed, Nov 28, 2012 at 08:18:01PM -0800, John Stultz wrote:

On 11/21/2012 04:36 PM, John Stultz wrote:

2) Being able to use this with tmpfs files. I'm currently trying
to better understand the rmap code, looking to see if there's a
way to have try_to_unmap_file() work similarly to
try_to_unmap_anon(), to allow allow users to madvise() on mmapped
tmpfs files. This would provide a very similar interface as to
what I've been proposing with fadvise/fallocate, but just using
process virtual addresses instead of (fd, offset) pairs.   The
benefit with (fd,offset) pairs for Android is that its easier to
manage shared volatile ranges between two processes that are
sharing data via an mmapped tmpfs file (although this actual use
case may be fairly rare).  I believe we should still be able to
rework the ashmem internals to use madvise (which would provide
legacy support for existing android apps), so then its just a
question of if we could then eventually convince Android apps to
use the madvise interface directly, rather then the ashmem unpin
ioctl.

Hey Minchan,
 I've been playing around with your patch trying to better
understand your approach and to extend it to support tmpfs files. In
doing so I've found a few bugs, and have some rough fixes I wanted
to share. There's still a few edge cases I need to deal with (the
vma-purged flag isn't being properly handled through vma merge/split
operations), but its starting to come along.

Hmm, my patch doesn't allow to merge volatile with another one by
inserting VM_VOLATILE into VM_SPECIAL so I guess merge isn't problem.
In case of split, __split_vma copy old vma to new vma like this

 *new = *vma;

So the problem shouldn't happen, I guess.
Did you see the real problem about that?

Yes, depending on the pattern that MADV_VOLATILE and MADV_NOVOLATILE
is applied, we can get a result where data is purged, but we aren't
notified of it.  Also, since madvise returns early if it encounters
an error, in the case where you have checkerboard volatile regions
(say every other page is volatile), which you mark non-volatile with
one large MADV_NOVOLATILE call, the first volatile vma will be
marked non-volatile, but since it returns purged, the madvise loop
will stop and the following volatile regions will be left volatile.

The patches in the git tree below which handle the perged state
better seem to work for my tests, as far as resolving any
overlapping calls. Of course there may yet still be problems I've
not found.


Anyway, take a look at the tree here and let me know what you think.
http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol

Eager to hear what you think!

Below two patches look good to me.

[rmap: Simplify volatility checking by moving it out of try_to_unmap_one]
[rmap: ClearPageDirty() when returning SWAP_DISCARD]

[madvise: Fix NOVOLATILE bug]
I can't understand description of the patch.
Could you elaborate it with example?
The case I ran into here is if you have a range where you mark every 
other page as volatile. Then mark all the pages in that range as 
non-volatile in one madvise call.


sys_madvise() will then find the first vma in the range, and call 
madvise_vma(), which marks the first vma non-volatile and return the 
purged state.  If the page has been purged, sys_madvise code will note 
that as an error, and break out of the vma iteration loop, leaving the 
following vmas in the range volatile.



[madvise: Fixup vma->purged handling]
I included VM_VOLATILE into VM_SPECIAL intentionally.
If comment of VM_SPECIAL is right, merge with volatile vmas shouldn't happen.
So I guess you see other problem. When I see my source code today, locking
scheme/purge handling is totally broken. I will look at it. Maybe you are seeing
bug related that. Part of patch is needed. It could be separate patch.
I will merge it.
I don't think the problem is when vmas being marked VM_VOLATILE are 
being merged, its that when we mark the vma as *non-volatile*, and 
remove the VM_VOLATILE flag we merge the non-volatile vmas with 
neighboring vmas. So preserving the purged flag during that merge is 
important. Again, the example I used to trigger this was an alternating 
pattern of volatile and non volatile vmas, then marking the entire range 
non-volatile (though sometimes in two overlapping passes).


thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-12-03 Thread Minchan Kim
On Mon, Dec 03, 2012 at 04:57:20PM -0800, John Stultz wrote:
> On 12/03/2012 04:00 PM, Minchan Kim wrote:
> >On Wed, Nov 28, 2012 at 08:18:01PM -0800, John Stultz wrote:
> >>On 11/21/2012 04:36 PM, John Stultz wrote:
> >>>2) Being able to use this with tmpfs files. I'm currently trying
> >>>to better understand the rmap code, looking to see if there's a
> >>>way to have try_to_unmap_file() work similarly to
> >>>try_to_unmap_anon(), to allow allow users to madvise() on mmapped
> >>>tmpfs files. This would provide a very similar interface as to
> >>>what I've been proposing with fadvise/fallocate, but just using
> >>>process virtual addresses instead of (fd, offset) pairs.   The
> >>>benefit with (fd,offset) pairs for Android is that its easier to
> >>>manage shared volatile ranges between two processes that are
> >>>sharing data via an mmapped tmpfs file (although this actual use
> >>>case may be fairly rare).  I believe we should still be able to
> >>>rework the ashmem internals to use madvise (which would provide
> >>>legacy support for existing android apps), so then its just a
> >>>question of if we could then eventually convince Android apps to
> >>>use the madvise interface directly, rather then the ashmem unpin
> >>>ioctl.
> >>Hey Minchan,
> >> I've been playing around with your patch trying to better
> >>understand your approach and to extend it to support tmpfs files. In
> >>doing so I've found a few bugs, and have some rough fixes I wanted
> >>to share. There's still a few edge cases I need to deal with (the
> >>vma-purged flag isn't being properly handled through vma merge/split
> >>operations), but its starting to come along.
> >Hmm, my patch doesn't allow to merge volatile with another one by
> >inserting VM_VOLATILE into VM_SPECIAL so I guess merge isn't problem.
> >In case of split, __split_vma copy old vma to new vma like this
> >
> > *new = *vma;
> >
> >So the problem shouldn't happen, I guess.
> >Did you see the real problem about that?
> Yes, depending on the pattern that MADV_VOLATILE and MADV_NOVOLATILE
> is applied, we can get a result where data is purged, but we aren't
> notified of it.  Also, since madvise returns early if it encounters
> an error, in the case where you have checkerboard volatile regions
> (say every other page is volatile), which you mark non-volatile with
> one large MADV_NOVOLATILE call, the first volatile vma will be
> marked non-volatile, but since it returns purged, the madvise loop
> will stop and the following volatile regions will be left volatile.
> 
> The patches in the git tree below which handle the perged state
> better seem to work for my tests, as far as resolving any
> overlapping calls. Of course there may yet still be problems I've
> not found.
> 
> >>Anyway, take a look at the tree here and let me know what you think.
> >>http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol
> 
> Eager to hear what you think!

Below two patches look good to me.

[rmap: Simplify volatility checking by moving it out of try_to_unmap_one]
[rmap: ClearPageDirty() when returning SWAP_DISCARD]

[madvise: Fix NOVOLATILE bug]
I can't understand description of the patch.
Could you elaborate it with example?

[madvise: Fixup vma->purged handling]
I included VM_VOLATILE into VM_SPECIAL intentionally.
If comment of VM_SPECIAL is right, merge with volatile vmas shouldn't happen.
So I guess you see other problem. When I see my source code today, locking
scheme/purge handling is totally broken. I will look at it. Maybe you are seeing
bug related that. Part of patch is needed. It could be separate patch.
I will merge it.

ff --git a/mm/madvise.c b/mm/madvise.c
index 65af0b5..5469d76 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -139,8 +139,10 @@ success:
if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
volatile_lock(vma);
vma->vm_flags = new_flags;
-   if (behavior == MADV_NOVOLATILE)
+   if (behavior == MADV_NOVOLATILE) {
error = vma->purged;
+   vma->purged = 0;
+   }
if (behavior == MADV_NOVOLATILE || behavior == MADV_VOLATILE)
volatile_unlock(vma);
 out:

First of all, after I resolve above issue, let's talk about tmpfs volatile.
Thanks for the fix, John!

> 
> Thanks again!
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-12-03 Thread John Stultz

On 12/03/2012 04:00 PM, Minchan Kim wrote:

On Wed, Nov 28, 2012 at 08:18:01PM -0800, John Stultz wrote:

On 11/21/2012 04:36 PM, John Stultz wrote:

2) Being able to use this with tmpfs files. I'm currently trying
to better understand the rmap code, looking to see if there's a
way to have try_to_unmap_file() work similarly to
try_to_unmap_anon(), to allow allow users to madvise() on mmapped
tmpfs files. This would provide a very similar interface as to
what I've been proposing with fadvise/fallocate, but just using
process virtual addresses instead of (fd, offset) pairs.   The
benefit with (fd,offset) pairs for Android is that its easier to
manage shared volatile ranges between two processes that are
sharing data via an mmapped tmpfs file (although this actual use
case may be fairly rare).  I believe we should still be able to
rework the ashmem internals to use madvise (which would provide
legacy support for existing android apps), so then its just a
question of if we could then eventually convince Android apps to
use the madvise interface directly, rather then the ashmem unpin
ioctl.

Hey Minchan,
 I've been playing around with your patch trying to better
understand your approach and to extend it to support tmpfs files. In
doing so I've found a few bugs, and have some rough fixes I wanted
to share. There's still a few edge cases I need to deal with (the
vma-purged flag isn't being properly handled through vma merge/split
operations), but its starting to come along.

Hmm, my patch doesn't allow to merge volatile with another one by
inserting VM_VOLATILE into VM_SPECIAL so I guess merge isn't problem.
In case of split, __split_vma copy old vma to new vma like this

 *new = *vma;

So the problem shouldn't happen, I guess.
Did you see the real problem about that?
Yes, depending on the pattern that MADV_VOLATILE and MADV_NOVOLATILE is 
applied, we can get a result where data is purged, but we aren't 
notified of it.  Also, since madvise returns early if it encounters an 
error, in the case where you have checkerboard volatile regions (say 
every other page is volatile), which you mark non-volatile with one 
large MADV_NOVOLATILE call, the first volatile vma will be marked 
non-volatile, but since it returns purged, the madvise loop will stop 
and the following volatile regions will be left volatile.


The patches in the git tree below which handle the perged state better 
seem to work for my tests, as far as resolving any overlapping calls. Of 
course there may yet still be problems I've not found.



Anyway, take a look at the tree here and let me know what you think.
http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol


Eager to hear what you think!

Thanks again!
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-12-03 Thread Minchan Kim
On Wed, Nov 28, 2012 at 08:18:01PM -0800, John Stultz wrote:
> On 11/21/2012 04:36 PM, John Stultz wrote:
> >2) Being able to use this with tmpfs files. I'm currently trying
> >to better understand the rmap code, looking to see if there's a
> >way to have try_to_unmap_file() work similarly to
> >try_to_unmap_anon(), to allow allow users to madvise() on mmapped
> >tmpfs files. This would provide a very similar interface as to
> >what I've been proposing with fadvise/fallocate, but just using
> >process virtual addresses instead of (fd, offset) pairs.   The
> >benefit with (fd,offset) pairs for Android is that its easier to
> >manage shared volatile ranges between two processes that are
> >sharing data via an mmapped tmpfs file (although this actual use
> >case may be fairly rare).  I believe we should still be able to
> >rework the ashmem internals to use madvise (which would provide
> >legacy support for existing android apps), so then its just a
> >question of if we could then eventually convince Android apps to
> >use the madvise interface directly, rather then the ashmem unpin
> >ioctl.
> 
> Hey Minchan,
> I've been playing around with your patch trying to better
> understand your approach and to extend it to support tmpfs files. In
> doing so I've found a few bugs, and have some rough fixes I wanted
> to share. There's still a few edge cases I need to deal with (the
> vma-purged flag isn't being properly handled through vma merge/split
> operations), but its starting to come along.

Hmm, my patch doesn't allow to merge volatile with another one by
inserting VM_VOLATILE into VM_SPECIAL so I guess merge isn't problem.
In case of split, __split_vma copy old vma to new vma like this

*new = *vma;

So the problem shouldn't happen, I guess.
Did you see the real problem about that?

> 
> Anyway, take a look at the tree here and let me know what you think.
> http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol
> 
> I'm sure much is wrong with the tree, but with it I can now mark
> tmpfs file pages as volatile/nonvolatile and see them purged under
> pressure. Unfortunately its not limited to tmpfs, so persistent
> files will also work, but the state of the underlying files on purge
> is undefined. Hopefully I can find a way to limit it to
> non-persistent filesystems for now, and if needed find a way to
> extend it to persistent filesystems in a sane way later.

I will take a look.
Thanks.

> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-12-03 Thread Minchan Kim
Hi John,

Sorry for the long delay.
At last, I get a chance to look at this, again.

On Wed, Nov 21, 2012 at 04:36:42PM -0800, John Stultz wrote:
> On 10/29/2012 06:29 PM, Minchan Kim wrote:
> >This patch introudces new madvise behavior MADV_VOLATILE and
> >MADV_NOVOLATILE for anonymous pages. It's different with
> >John Stultz's version which considers only tmpfs while this patch
> >considers only anonymous pages so this cannot cover John's one.
> >If below idea is proved as reasonable, I hope we can unify both
> >concepts by madvise/fadvise.
> >
> >Rationale is following as.
> >Many allocators call munmap(2) when user call free(3) if ptr is
> >in mmaped area. But munmap isn't cheap because it have to clean up
> >all pte entries and unlinking a vma so overhead would be increased
> >linearly by mmaped area's size.
> >
> >Volatile conecept of Robert Love could be very useful for reducing
> >free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of
> >munmap(2)(Of course, they need to manage volatile mmaped area to
> >reduce shortage of address space and sometime ends up unmaping them).
> >The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because
> >
> >1) it just marks the flag in VMA and
> >2) if memory pressure happens, VM can discard pages of volatile VMA
> >instead of swapping out when volatile pages is selected as victim
> >by normal VM aging policy.
> >3) freed mmaped area doesn't include any meaningful data so there
> >is no point to swap them out.
> >
> >Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> >allocating that area to user. Otherwise, accessing of volatile range
> >will meet SIGBUS error.
> >
> >The downside is that we have to age anon lru list although we don't
> >have swap because I don't want to discard volatile pages by top priority
> >when memory pressure happens as volatile in this patch means "We don't
> >need to swap out because user can handle the situation which data are
> >disappear suddenly", NOT "They are useless so hurry up to reclaim them".
> >So I want to apply same aging rule of nomal pages to them.
> >
> >Anon background aging of non-swap system would be a trade-off for
> >getting good feature. Even, we had done it two years ago until merge
> >[1] and I believe free(3) performance gain will beat loss of anon lru
> >aging's overead once all of allocator start to use madvise.
> >(This patch doesn't include background aging in case of non-swap system
> >  but it's trivial if we decide)
> 
> Hey Minchan!
> So I've been looking at your patch for a bit, and I'm still
> trying to fully grok it and the rmap code. Overall this approach
> looks pretty interesting,  and while your patch description focused
> on malloc/free behavior, I suspect your patch would satisfy what the
> mozilla folks are looking for, and while its not quite sufficient
> yet for Android, the interface semantics are very close to what I've
> been wanting (my test cases were easily mapped over).
> 
> The two major issues for me are:
> 1) As you noted, this approach currently doesn't work on non-swap
> systems, as we don't try to shrink the anonymous page lrus. This is
> a big problem, as it makes it unusable for most all Android systems.
> You suggest we may want to change aging the anonymous lru, and I had
> a patch earlier that tried to change some of the anonymous lru aging
> rules for volatile pages, but its not quite right for what you have
> here. So I'd be interested in hearing how you think the anonymous
> lru aging should happen with swapoff.

As I mentioned, anon LRU aging should happen for getting the benefit
of VOLATILE. I expect user-space allocator(ex, glibc) or VM heap management
will start it so the gain would be higher than lose.
Otherwise, we should choose and move volatile pages into anonther LRU
when madvise calls. It would be a big overhead as you already have measured.
So the design is same with rmap which move the overhead from frequent place
(ex, fork, pgfault) to to rare plcae(ie, reclaim)
I will accept if there is another good idea which is able to minimise
madvise's overhead.

> 
> 2) Being able to use this with tmpfs files. I'm currently trying to
> better understand the rmap code, looking to see if there's a way to
> have try_to_unmap_file() work similarly to try_to_unmap_anon(), to
> allow allow users to madvise() on mmapped tmpfs files. This would
> provide a very similar interface as to what I've been proposing with
> fadvise/fallocate, but just using process virtual addresses instead
> of (fd, offset) pairs.   The benefit with (fd,offset) pairs for
> Android is that its easier to manage shared volatile ranges between
> two processes that are sharing data via an mmapped tmpfs file
> (although this actual use case may be fairly rare).  I believe we
> should still be able to rework the ashmem internals to use madvise
> (which would provide legacy support for existing android apps), so
> then its just a question of if we coul

Re: [RFC v2] Support volatile range for anon vma

2012-11-28 Thread John Stultz

On 11/21/2012 04:36 PM, John Stultz wrote:
2) Being able to use this with tmpfs files. I'm currently trying to 
better understand the rmap code, looking to see if there's a way to 
have try_to_unmap_file() work similarly to try_to_unmap_anon(), to 
allow allow users to madvise() on mmapped tmpfs files. This would 
provide a very similar interface as to what I've been proposing with 
fadvise/fallocate, but just using process virtual addresses instead of 
(fd, offset) pairs.   The benefit with (fd,offset) pairs for Android 
is that its easier to manage shared volatile ranges between two 
processes that are sharing data via an mmapped tmpfs file (although 
this actual use case may be fairly rare).  I believe we should still 
be able to rework the ashmem internals to use madvise (which would 
provide legacy support for existing android apps), so then its just a 
question of if we could then eventually convince Android apps to use 
the madvise interface directly, rather then the ashmem unpin ioctl.


Hey Minchan,
I've been playing around with your patch trying to better 
understand your approach and to extend it to support tmpfs files. In 
doing so I've found a few bugs, and have some rough fixes I wanted to 
share. There's still a few edge cases I need to deal with (the 
vma-purged flag isn't being properly handled through vma merge/split 
operations), but its starting to come along.


Anyway, take a look at the tree here and let me know what you think.
http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol

I'm sure much is wrong with the tree, but with it I can now mark tmpfs 
file pages as volatile/nonvolatile and see them purged under pressure. 
Unfortunately its not limited to tmpfs, so persistent files will also 
work, but the state of the underlying files on purge is undefined. 
Hopefully I can find a way to limit it to non-persistent filesystems for 
now, and if needed find a way to extend it to persistent filesystems in 
a sane way later.


thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-11-22 Thread John Stultz

On 10/29/2012 06:29 PM, Minchan Kim wrote:

This patch introudces new madvise behavior MADV_VOLATILE and
MADV_NOVOLATILE for anonymous pages. It's different with
John Stultz's version which considers only tmpfs while this patch
considers only anonymous pages so this cannot cover John's one.
If below idea is proved as reasonable, I hope we can unify both
concepts by madvise/fadvise.

Rationale is following as.
Many allocators call munmap(2) when user call free(3) if ptr is
in mmaped area. But munmap isn't cheap because it have to clean up
all pte entries and unlinking a vma so overhead would be increased
linearly by mmaped area's size.

Volatile conecept of Robert Love could be very useful for reducing
free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of
munmap(2)(Of course, they need to manage volatile mmaped area to
reduce shortage of address space and sometime ends up unmaping them).
The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because

1) it just marks the flag in VMA and
2) if memory pressure happens, VM can discard pages of volatile VMA
instead of swapping out when volatile pages is selected as victim
by normal VM aging policy.
3) freed mmaped area doesn't include any meaningful data so there
is no point to swap them out.

Allocator should call madvise(MADV_NOVOLATILE) before reusing for
allocating that area to user. Otherwise, accessing of volatile range
will meet SIGBUS error.

The downside is that we have to age anon lru list although we don't
have swap because I don't want to discard volatile pages by top priority
when memory pressure happens as volatile in this patch means "We don't
need to swap out because user can handle the situation which data are
disappear suddenly", NOT "They are useless so hurry up to reclaim them".
So I want to apply same aging rule of nomal pages to them.

Anon background aging of non-swap system would be a trade-off for
getting good feature. Even, we had done it two years ago until merge
[1] and I believe free(3) performance gain will beat loss of anon lru
aging's overead once all of allocator start to use madvise.
(This patch doesn't include background aging in case of non-swap system
  but it's trivial if we decide)


Hey Minchan!
So I've been looking at your patch for a bit, and I'm still trying 
to fully grok it and the rmap code. Overall this approach looks pretty 
interesting,  and while your patch description focused on malloc/free 
behavior, I suspect your patch would satisfy what the mozilla folks are 
looking for, and while its not quite sufficient yet for Android, the 
interface semantics are very close to what I've been wanting (my test 
cases were easily mapped over).


The two major issues for me are:
1) As you noted, this approach currently doesn't work on non-swap 
systems, as we don't try to shrink the anonymous page lrus. This is a 
big problem, as it makes it unusable for most all Android systems. You 
suggest we may want to change aging the anonymous lru, and I had a patch 
earlier that tried to change some of the anonymous lru aging rules for 
volatile pages, but its not quite right for what you have here. So I'd 
be interested in hearing how you think the anonymous lru aging should 
happen with swapoff.


2) Being able to use this with tmpfs files. I'm currently trying to 
better understand the rmap code, looking to see if there's a way to have 
try_to_unmap_file() work similarly to try_to_unmap_anon(), to allow 
allow users to madvise() on mmapped tmpfs files. This would provide a 
very similar interface as to what I've been proposing with 
fadvise/fallocate, but just using process virtual addresses instead of 
(fd, offset) pairs.   The benefit with (fd,offset) pairs for Android is 
that its easier to manage shared volatile ranges between two processes 
that are sharing data via an mmapped tmpfs file (although this actual 
use case may be fairly rare).  I believe we should still be able to 
rework the ashmem internals to use madvise (which would provide legacy 
support for existing android apps), so then its just a question of if we 
could then eventually convince Android apps to use the madvise interface 
directly, rather then the ashmem unpin ioctl.


The other concern with the madvise on mmapped files approach is that 
there's no easy way I can see to limit it to tmpfs files. I know some 
have been interested in having fallocate(VOLATILE) interface for 
non-tmpfs files, but I'm not sure I see the benefit there yet. I have 
noted folks mixing the idea of volatile pages being purged under memory 
pressure with the idea of volatile files, which might be purged from 
disk under disk pressure. While I think the second idea is interesting, 
I do think its completely separate from the volatile memory concept.


Anyway, I'd be interested in your thoughts on these two issues. Thanks 
so much for sending out this patch, its given me quite a bit to chew on, 
and I too hope we can merge our different appro

Re: [RFC v2] Support volatile range for anon vma

2012-11-05 Thread Arun Sharma

On 11/5/12 5:49 PM, Minchan Kim wrote:


Also, memory allocators have a second motivation in using madvise: to
create virtually contiguous regions of memory from a fragmented address
space, without increasing the RSS.


I don't get it. How do we create contiguos region by madvise?
Just out of curiosity.
Could you elaborate that use case? :)


By using a new anonymous map and faulting pages in.

The fragmented virtual memory is released via MADV_DONTNEED and if the 
malloc/free activity on the system is dominated by one process, chances 
are that the newly faulted in page is the one released by the same 
process :)


The net effect is that physical pages within a single address space are 
rearranged so larger allocations can be satisfied.


 -Arun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-11-05 Thread Minchan Kim
Hello,

On Mon, Nov 05, 2012 at 03:54:43PM -0800, Arun Sharma wrote:
> On Wed, Oct 31, 2012 at 06:56:05PM -0400, KOSAKI Motohiro wrote:
> > glibc malloc discard freed memory by using MADV_DONTNEED
> > as tcmalloc. and it is often a source of large performance decrease.
> > because of MADV_DONTNEED discard memory immediately and
> > right after malloc() call fall into page fault and pagesize memset() path.
> > then, using DONTNEED increased zero fill and cache miss rate.
> 
> The memcg based solution that I posted a few months ago is working well
> for us. We see significantly less cpu in zero'ing pages.
> 
> Not everyone was comfortable with the security implications of recycling
> pages between processes in a memcg, although it was disabled by default
> and had to be explicitly opted-in.
> 
> Also, memory allocators have a second motivation in using madvise: to
> create virtually contiguous regions of memory from a fragmented address 
> space, without increasing the RSS.

I don't get it. How do we create contiguos region by madvise?
Just out of curiosity.
Could you elaborate that use case? :)

> 
>  -Arun
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Kind Regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-11-05 Thread Arun Sharma
On Wed, Oct 31, 2012 at 06:56:05PM -0400, KOSAKI Motohiro wrote:
> glibc malloc discard freed memory by using MADV_DONTNEED
> as tcmalloc. and it is often a source of large performance decrease.
> because of MADV_DONTNEED discard memory immediately and
> right after malloc() call fall into page fault and pagesize memset() path.
> then, using DONTNEED increased zero fill and cache miss rate.

The memcg based solution that I posted a few months ago is working well
for us. We see significantly less cpu in zero'ing pages.

Not everyone was comfortable with the security implications of recycling
pages between processes in a memcg, although it was disabled by default
and had to be explicitly opted-in.

Also, memory allocators have a second motivation in using madvise: to
create virtually contiguous regions of memory from a fragmented address 
space, without increasing the RSS.

 -Arun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-11-01 Thread Minchan Kim
Hi Bob,

On Fri, Nov 02, 2012 at 09:43:01AM +0800, Bob Liu wrote:
> On Tue, Oct 30, 2012 at 9:29 AM, Minchan Kim  wrote:
> > This patch introudces new madvise behavior MADV_VOLATILE and
> > MADV_NOVOLATILE for anonymous pages. It's different with
> > John Stultz's version which considers only tmpfs while this patch
> > considers only anonymous pages so this cannot cover John's one.
> > If below idea is proved as reasonable, I hope we can unify both
> > concepts by madvise/fadvise.
> >
> > Rationale is following as.
> > Many allocators call munmap(2) when user call free(3) if ptr is
> > in mmaped area. But munmap isn't cheap because it have to clean up
> > all pte entries and unlinking a vma so overhead would be increased
> > linearly by mmaped area's size.
> >
> 
> I have a question.
> Pte entries are cleaned up during munmap(), so if user space try to
> access the unmaped address
> page fault will be generated.
> 
> If use MADV_VOLATILE? What's the result?

1) If user accesses pages of volatile vma which are discard by VM
   due to memory pressure, it's SIGBUS.

2) If user accesses pages of volatile vma which live in memory without
   discarding, he can access it.

3) If user accesses pages of volatile vma which are swapped out before
  calling madvise(MADV_VOLATILE), It's okay.

> The pte entries are not cleaned so user space can still access the
> memory before
> VM discard pages?

Yeb.

> 
> > Volatile conecept of Robert Love could be very useful for reducing
> > free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of
> > munmap(2)(Of course, they need to manage volatile mmaped area to
> > reduce shortage of address space and sometime ends up unmaping them).
> > The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because
> >
> > 1) it just marks the flag in VMA and
> > 2) if memory pressure happens, VM can discard pages of volatile VMA
> >instead of swapping out when volatile pages is selected as victim
> >by normal VM aging policy.
> > 3) freed mmaped area doesn't include any meaningful data so there
> >is no point to swap them out.
> >
> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> > allocating that area to user. Otherwise, accessing of volatile range
> > will meet SIGBUS error.
> >
> > The downside is that we have to age anon lru list although we don't
> > have swap because I don't want to discard volatile pages by top priority
> > when memory pressure happens as volatile in this patch means "We don't
> > need to swap out because user can handle the situation which data are
> > disappear suddenly", NOT "They are useless so hurry up to reclaim them".
> > So I want to apply same aging rule of nomal pages to them.
> >
> > Anon background aging of non-swap system would be a trade-off for
> > getting good feature. Even, we had done it two years ago until merge
> > [1] and I believe free(3) performance gain will beat loss of anon lru
> > aging's overead once all of allocator start to use madvise.
> > (This patch doesn't include background aging in case of non-swap system
> >  but it's trivial if we decide)
> >
> > I hope seeing opinions from others before diving into glibc or bionic.
> > Welcome to any comment.
> >
> > [1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap 
> > system
> >
> > Changelog
> >  * from RFC v1
> >* add clear comment of purge - Christoph
> >* Change patch descritpion
> >
> > Cc: John Stultz 
> > Cc: Andrew Morton 
> > Cc: Christoph Lameter 
> > Cc: Android Kernel Team 
> > Cc: Robert Love 
> > Cc: Mel Gorman 
> > Cc: Hugh Dickins 
> > Cc: Dave Hansen 
> > Cc: Rik van Riel 
> > Cc: Dave Chinner 
> > Cc: Neil Brown 
> > Cc: Mike Hommey 
> > Cc: Taras Glek 
> > Cc: KOSAKI Motohiro 
> > Cc: Christoph Lameter 
> > Cc: KAMEZAWA Hiroyuki 
> > Signed-off-by: Minchan Kim 
> > ---
> >  include/asm-generic/mman-common.h |3 +
> >  include/linux/mm.h|9 ++-
> >  include/linux/mm_types.h  |5 ++
> >  include/linux/rmap.h  |   24 ++-
> >  mm/ksm.c  |4 +-
> >  mm/madvise.c  |   32 +-
> >  mm/memory.c   |2 +
> >  mm/migrate.c  |6 +-
> >  mm/rmap.c |  126 
> > +++--
> >  mm/vmscan.c   |3 +
> >  10 files changed, 202 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/asm-generic/mman-common.h 
> > b/include/asm-generic/mman-common.h
> > index d030d2c..5f8090d 100644
> > --- a/include/asm-generic/mman-common.h
> > +++ b/include/asm-generic/mman-common.h
> > @@ -34,6 +34,9 @@
> >  #define MADV_SEQUENTIAL2   /* expect sequential page 
> > references */
> >  #define MADV_WILLNEED  3   /* will need these pages */
> >  #define MADV_DONTNEED  4   /* don't need these pages */
> > +#define MADV_VOLATILE  5   /* pages will dis

Re: [RFC v2] Support volatile range for anon vma

2012-11-01 Thread Bob Liu
On Tue, Oct 30, 2012 at 9:29 AM, Minchan Kim  wrote:
> This patch introudces new madvise behavior MADV_VOLATILE and
> MADV_NOVOLATILE for anonymous pages. It's different with
> John Stultz's version which considers only tmpfs while this patch
> considers only anonymous pages so this cannot cover John's one.
> If below idea is proved as reasonable, I hope we can unify both
> concepts by madvise/fadvise.
>
> Rationale is following as.
> Many allocators call munmap(2) when user call free(3) if ptr is
> in mmaped area. But munmap isn't cheap because it have to clean up
> all pte entries and unlinking a vma so overhead would be increased
> linearly by mmaped area's size.
>

I have a question.
Pte entries are cleaned up during munmap(), so if user space try to
access the unmaped address
page fault will be generated.

If use MADV_VOLATILE? What's the result?
The pte entries are not cleaned so user space can still access the
memory before
VM discard pages?

> Volatile conecept of Robert Love could be very useful for reducing
> free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of
> munmap(2)(Of course, they need to manage volatile mmaped area to
> reduce shortage of address space and sometime ends up unmaping them).
> The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because
>
> 1) it just marks the flag in VMA and
> 2) if memory pressure happens, VM can discard pages of volatile VMA
>instead of swapping out when volatile pages is selected as victim
>by normal VM aging policy.
> 3) freed mmaped area doesn't include any meaningful data so there
>is no point to swap them out.
>
> Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> allocating that area to user. Otherwise, accessing of volatile range
> will meet SIGBUS error.
>
> The downside is that we have to age anon lru list although we don't
> have swap because I don't want to discard volatile pages by top priority
> when memory pressure happens as volatile in this patch means "We don't
> need to swap out because user can handle the situation which data are
> disappear suddenly", NOT "They are useless so hurry up to reclaim them".
> So I want to apply same aging rule of nomal pages to them.
>
> Anon background aging of non-swap system would be a trade-off for
> getting good feature. Even, we had done it two years ago until merge
> [1] and I believe free(3) performance gain will beat loss of anon lru
> aging's overead once all of allocator start to use madvise.
> (This patch doesn't include background aging in case of non-swap system
>  but it's trivial if we decide)
>
> I hope seeing opinions from others before diving into glibc or bionic.
> Welcome to any comment.
>
> [1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system
>
> Changelog
>  * from RFC v1
>* add clear comment of purge - Christoph
>* Change patch descritpion
>
> Cc: John Stultz 
> Cc: Andrew Morton 
> Cc: Christoph Lameter 
> Cc: Android Kernel Team 
> Cc: Robert Love 
> Cc: Mel Gorman 
> Cc: Hugh Dickins 
> Cc: Dave Hansen 
> Cc: Rik van Riel 
> Cc: Dave Chinner 
> Cc: Neil Brown 
> Cc: Mike Hommey 
> Cc: Taras Glek 
> Cc: KOSAKI Motohiro 
> Cc: Christoph Lameter 
> Cc: KAMEZAWA Hiroyuki 
> Signed-off-by: Minchan Kim 
> ---
>  include/asm-generic/mman-common.h |3 +
>  include/linux/mm.h|9 ++-
>  include/linux/mm_types.h  |5 ++
>  include/linux/rmap.h  |   24 ++-
>  mm/ksm.c  |4 +-
>  mm/madvise.c  |   32 +-
>  mm/memory.c   |2 +
>  mm/migrate.c  |6 +-
>  mm/rmap.c |  126 
> +++--
>  mm/vmscan.c   |3 +
>  10 files changed, 202 insertions(+), 12 deletions(-)
>
> diff --git a/include/asm-generic/mman-common.h 
> b/include/asm-generic/mman-common.h
> index d030d2c..5f8090d 100644
> --- a/include/asm-generic/mman-common.h
> +++ b/include/asm-generic/mman-common.h
> @@ -34,6 +34,9 @@
>  #define MADV_SEQUENTIAL2   /* expect sequential page 
> references */
>  #define MADV_WILLNEED  3   /* will need these pages */
>  #define MADV_DONTNEED  4   /* don't need these pages */
> +#define MADV_VOLATILE  5   /* pages will disappear suddenly */
> +#define MADV_NOVOLATILE 6  /* pages will not disappear */
> +
>
>  /* common parameters: try to keep these consistent across architectures */
>  #define MADV_REMOVE9   /* remove these pages & resources */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 311be90..78c8c08 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -120,6 +120,13 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_PFN_AT_MMAP 0x4000  /* PFNMAP vma that is fully mapped at 
> mmap time */
>  #define VM_MERGEABLE   0x8000  /* KSM may merge identi

Re: [RFC v2] Support volatile range for anon vma

2012-10-31 Thread KOSAKI Motohiro
>> - making zero page daemon and avoid pagesize zero fill at page fault
>> - making new vma or page flags and mark as discardable w/o swap and
>>   vmscan treat it. (like this and/or MADV_FREE)
>
> Thanks for the information.
> I realized by you I'm not first people to think of this idea.
> Rik already tried it(https://lkml.org/lkml/2007/4/17/53) by new page flag
> and even other OSes already have such good feature. And John's concept was
> already tried long time ago (https://lkml.org/lkml/2005/11/1/384)
>
> Hmm, I look over Rik's thread but couldn't find why it wasn't merged
> at that time. Anyone know it?

Dunno. and I like volatile feature than old one. but bold remark, please don't
100% trust me, I haven't review a detailed code of your patch and I don't
strictly understand it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-10-31 Thread Minchan Kim
On Wed, Oct 31, 2012 at 06:15:33PM -0700, Paul Turner wrote:
> On Wed, Oct 31, 2012 at 3:56 PM, KOSAKI Motohiro
>  wrote:
> >>> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> >>> > allocating that area to user. Otherwise, accessing of volatile range
> >>> > will meet SIGBUS error.
> >>>
> >>> Well, why?  It would be easy enough for the fault handler to give
> >>> userspace a new, zeroed page at that address.
> >>
> >> Note: MADV_DONTNEED already has this (nice) property.
> >
> > I don't think I strictly understand this patch. but maybe I can answer why
> > userland and malloc folks don't like MADV_DONTNEED.
> >
> > glibc malloc discard freed memory by using MADV_DONTNEED
> > as tcmalloc. and it is often a source of large performance decrease.
> > because of MADV_DONTNEED discard memory immediately and
> > right after malloc() call fall into page fault and pagesize memset() path.
> > then, using DONTNEED increased zero fill and cache miss rate.
> >
> > At called free() time, malloc don't have a knowledge when next big malloc()
> > is called. then, immediate discarding may or may not get good performance
> > gain. (Ah, ok, the rate is not 5:5. then usually it is worth. but not 
> > everytime)
> >
> 
> Ah; In tcmalloc allocations (and their associated free-lists) are
> binned into separate lists as a function of object-size which helps to
> mitigate this.
> 
> I'd make a separate more general argument here:
> If I'm allocating a large (multi-kilobyte object) the cost of what I'm
> about to do with that object is likely fairly large -- The fault/zero
> cost a probably fairly small proportional cost, which limits the
> optimization value.

While I look at thread trial of Rik which is same goal while implementation
is different, I found this number.

https://lkml.org/lkml/2007/4/20/390

I believe optimiation is valuable. Of course, I need simillar testing for
proving it.

> 
> >
> > In past, several developers tryied to avoid such situation, likes
> >
> > - making zero page daemon and avoid pagesize zero fill at page fault
> > - making new vma or page flags and mark as discardable w/o swap and
> >   vmscan treat it. (like this and/or MADV_FREE)
> > - making new process option and avoid page zero fill from page fault path.
> >   (yes, it is big incompatibility and insecure. but some embedded folks 
> > thought
> >they are acceptable downside)
> > - etc
> >
> >
> > btw, I'm not sure this patch is better for malloc because current 
> > MADV_DONTNEED
> > don't need mmap_sem and works very effectively when a lot of threads case.
> > taking mmap_sem might bring worse performance than DONTNEED. dunno.
> 
> MADV_VOLATILE also seems to end up looking quite similar to a
> user-visible (range-based) cleancache.
> 
> A second popular use-case for such semantics is the case of
> discardable cache elements (e.g. web browser).  I suspect we'd want to
> at least mention these in the changelog.  (Alternatively, what does a
> cleancache-backed-fs exposing these semantics look like?)
> 

It's a trial of John Stultz(http://lwn.net/Articles/518130/, there was another
trial long time ago https://lkml.org/lkml/2005/11/1/384) and I want to
expand the concept from file-backed page to anonymous page so this patch
is a trial for anonymous page. So, usecase of my patch have focussed on
malloc/free case.
I hope both are able to be unified.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-10-31 Thread Minchan Kim
On Wed, Oct 31, 2012 at 06:22:58PM -0700, Paul Turner wrote:
> On Wed, Oct 31, 2012 at 5:50 PM, Minchan Kim  wrote:
> > Hello,
> >
> > On Wed, Oct 31, 2012 at 02:59:07PM -0700, Paul Turner wrote:
> >> On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
> >>  wrote:
> >> >
> >> > On Tue, 30 Oct 2012 10:29:54 +0900
> >> > Minchan Kim  wrote:
> >> >
> >> > > This patch introudces new madvise behavior MADV_VOLATILE and
> >> > > MADV_NOVOLATILE for anonymous pages. It's different with
> >> > > John Stultz's version which considers only tmpfs while this patch
> >> > > considers only anonymous pages so this cannot cover John's one.
> >> > > If below idea is proved as reasonable, I hope we can unify both
> >> > > concepts by madvise/fadvise.
> >> > >
> >> > > Rationale is following as.
> >> > > Many allocators call munmap(2) when user call free(3) if ptr is
> >> > > in mmaped area. But munmap isn't cheap because it have to clean up
> >> > > all pte entries and unlinking a vma so overhead would be increased
> >> > > linearly by mmaped area's size.
> >> >
> >> > Presumably the userspace allocator will internally manage memory in
> >> > large chunks, so the munmap() call frequency will be much lower than
> >> > the free() call frequency.  So the performance gains from this change
> >> > might be very small.
> >>
> >> I don't think I strictly understand the motivation from a
> >> malloc-standpoint here.
> >>
> >> These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
> >> to perform discards on Linux.For any reasonable allocator (short
> >> of binding malloc --> mmap, free --> unmap) this seems a better
> >> choice.
> >>
> >> Note also from a performance stand-point I doubt any allocator (which
> >> case about performance) is going to want to pay the cost of even a
> >> null syscall about typical malloc/free usage (consider: a tcmalloc
> >
> > Good point.
> >
> >> malloc/free pairis currently <20ns).  Given then that this cost is
> >> amortized once you start doing discards on larger blocks MADV_DONTNEED
> >> seems a preferable interface:
> >> - You don't need to reconstruct an arena when you do want to allocate
> >> since there's no munmap/mmap for the region to change about
> >> - There are no syscalls involved in later reallocating the block.
> >
> > Above benefits are applied on MADV_VOLATILE, too.
> > But as you pointed out, there is a little bit overhead than DONTNEED
> > because allocator should call madvise(MADV_NOVOLATILE) before allocation.
> > For mavise(NOVOLATILE) does just mark vma flag, it does need mmap_sem
> > and could be a problem on parallel malloc/free workload as KOSAKI pointed 
> > out.
> >
> > In such case, we can change semantic so malloc doesn't need to call
> > madivse(NOVOLATILE) before allocating. Then, page fault handler have to
> > check whether this page fault happen by access of volatile vma. If so,
> > it could return zero page instead of SIGBUS and mark the vma isn't volatile
> > any more.
> 
> I think being able to determine whether the backing was discarded
> (about a atomic transition to non-volatile) would be a required
> property to make this useful for non-malloc use-cases.
> 

Absolutely.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-10-31 Thread Paul Turner
On Wed, Oct 31, 2012 at 5:50 PM, Minchan Kim  wrote:
> Hello,
>
> On Wed, Oct 31, 2012 at 02:59:07PM -0700, Paul Turner wrote:
>> On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
>>  wrote:
>> >
>> > On Tue, 30 Oct 2012 10:29:54 +0900
>> > Minchan Kim  wrote:
>> >
>> > > This patch introudces new madvise behavior MADV_VOLATILE and
>> > > MADV_NOVOLATILE for anonymous pages. It's different with
>> > > John Stultz's version which considers only tmpfs while this patch
>> > > considers only anonymous pages so this cannot cover John's one.
>> > > If below idea is proved as reasonable, I hope we can unify both
>> > > concepts by madvise/fadvise.
>> > >
>> > > Rationale is following as.
>> > > Many allocators call munmap(2) when user call free(3) if ptr is
>> > > in mmaped area. But munmap isn't cheap because it have to clean up
>> > > all pte entries and unlinking a vma so overhead would be increased
>> > > linearly by mmaped area's size.
>> >
>> > Presumably the userspace allocator will internally manage memory in
>> > large chunks, so the munmap() call frequency will be much lower than
>> > the free() call frequency.  So the performance gains from this change
>> > might be very small.
>>
>> I don't think I strictly understand the motivation from a
>> malloc-standpoint here.
>>
>> These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
>> to perform discards on Linux.For any reasonable allocator (short
>> of binding malloc --> mmap, free --> unmap) this seems a better
>> choice.
>>
>> Note also from a performance stand-point I doubt any allocator (which
>> case about performance) is going to want to pay the cost of even a
>> null syscall about typical malloc/free usage (consider: a tcmalloc
>
> Good point.
>
>> malloc/free pairis currently <20ns).  Given then that this cost is
>> amortized once you start doing discards on larger blocks MADV_DONTNEED
>> seems a preferable interface:
>> - You don't need to reconstruct an arena when you do want to allocate
>> since there's no munmap/mmap for the region to change about
>> - There are no syscalls involved in later reallocating the block.
>
> Above benefits are applied on MADV_VOLATILE, too.
> But as you pointed out, there is a little bit overhead than DONTNEED
> because allocator should call madvise(MADV_NOVOLATILE) before allocation.
> For mavise(NOVOLATILE) does just mark vma flag, it does need mmap_sem
> and could be a problem on parallel malloc/free workload as KOSAKI pointed out.
>
> In such case, we can change semantic so malloc doesn't need to call
> madivse(NOVOLATILE) before allocating. Then, page fault handler have to
> check whether this page fault happen by access of volatile vma. If so,
> it could return zero page instead of SIGBUS and mark the vma isn't volatile
> any more.

I think being able to determine whether the backing was discarded
(about a atomic transition to non-volatile) would be a required
property to make this useful for non-malloc use-cases.

>
>>
>> The only real additional cost is address-space.  Are you strongly
>> concerned about the 32-bit case?
>
> No. I believe allocators have a logic to clean up them once address space is
> almost full.
>
> Thanks, Paul.
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-10-31 Thread Minchan Kim
Hi KOSAKI,

On Wed, Oct 31, 2012 at 06:56:05PM -0400, KOSAKI Motohiro wrote:
> >> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> >> > allocating that area to user. Otherwise, accessing of volatile range
> >> > will meet SIGBUS error.
> >>
> >> Well, why?  It would be easy enough for the fault handler to give
> >> userspace a new, zeroed page at that address.
> >
> > Note: MADV_DONTNEED already has this (nice) property.
> 
> I don't think I strictly understand this patch. but maybe I can answer why
> userland and malloc folks don't like MADV_DONTNEED.
> 
> glibc malloc discard freed memory by using MADV_DONTNEED
> as tcmalloc. and it is often a source of large performance decrease.
> because of MADV_DONTNEED discard memory immediately and
> right after malloc() call fall into page fault and pagesize memset() path.
> then, using DONTNEED increased zero fill and cache miss rate.
> 
> At called free() time, malloc don't have a knowledge when next big malloc()
> is called. then, immediate discarding may or may not get good performance
> gain. (Ah, ok, the rate is not 5:5. then usually it is worth. but not 
> everytime)
> 
> 
> In past, several developers tryied to avoid such situation, likes
> 
> - making zero page daemon and avoid pagesize zero fill at page fault
> - making new vma or page flags and mark as discardable w/o swap and
>   vmscan treat it. (like this and/or MADV_FREE)

Thanks for the information.
I realized by you I'm not first people to think of this idea.
Rik already tried it(https://lkml.org/lkml/2007/4/17/53) by new page flag
and even other OSes already have such good feature. And John's concept was
already tried long time ago (https://lkml.org/lkml/2005/11/1/384)

Hmm, I look over Rik's thread but couldn't find why it wasn't merged
at that time. Anyone know it?

> - making new process option and avoid page zero fill from page fault path.
>   (yes, it is big incompatibility and insecure. but some embedded folks 
> thought
>they are acceptable downside)
> - etc
> 
> 
> btw, I'm not sure this patch is better for malloc because current 
> MADV_DONTNEED
> don't need mmap_sem and works very effectively when a lot of threads case.
> taking mmap_sem might bring worse performance than DONTNEED. dunno.

It's a good point. 

Quote from my reply to Paul
"
In such case, we can change semantic so malloc doesn't need to call
madivse(NOVOLATILE) before allocating. Then, page fault handler have to
check whether this page fault happen by access of volatile vma. If so,
it could return zero page instead of SIGBUS and mark the vma isn't volatile
any more.
"
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-10-31 Thread Paul Turner
On Wed, Oct 31, 2012 at 3:56 PM, KOSAKI Motohiro
 wrote:
>>> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
>>> > allocating that area to user. Otherwise, accessing of volatile range
>>> > will meet SIGBUS error.
>>>
>>> Well, why?  It would be easy enough for the fault handler to give
>>> userspace a new, zeroed page at that address.
>>
>> Note: MADV_DONTNEED already has this (nice) property.
>
> I don't think I strictly understand this patch. but maybe I can answer why
> userland and malloc folks don't like MADV_DONTNEED.
>
> glibc malloc discard freed memory by using MADV_DONTNEED
> as tcmalloc. and it is often a source of large performance decrease.
> because of MADV_DONTNEED discard memory immediately and
> right after malloc() call fall into page fault and pagesize memset() path.
> then, using DONTNEED increased zero fill and cache miss rate.
>
> At called free() time, malloc don't have a knowledge when next big malloc()
> is called. then, immediate discarding may or may not get good performance
> gain. (Ah, ok, the rate is not 5:5. then usually it is worth. but not 
> everytime)
>

Ah; In tcmalloc allocations (and their associated free-lists) are
binned into separate lists as a function of object-size which helps to
mitigate this.

I'd make a separate more general argument here:
If I'm allocating a large (multi-kilobyte object) the cost of what I'm
about to do with that object is likely fairly large -- The fault/zero
cost a probably fairly small proportional cost, which limits the
optimization value.

>
> In past, several developers tryied to avoid such situation, likes
>
> - making zero page daemon and avoid pagesize zero fill at page fault
> - making new vma or page flags and mark as discardable w/o swap and
>   vmscan treat it. (like this and/or MADV_FREE)
> - making new process option and avoid page zero fill from page fault path.
>   (yes, it is big incompatibility and insecure. but some embedded folks 
> thought
>they are acceptable downside)
> - etc
>
>
> btw, I'm not sure this patch is better for malloc because current 
> MADV_DONTNEED
> don't need mmap_sem and works very effectively when a lot of threads case.
> taking mmap_sem might bring worse performance than DONTNEED. dunno.

MADV_VOLATILE also seems to end up looking quite similar to a
user-visible (range-based) cleancache.

A second popular use-case for such semantics is the case of
discardable cache elements (e.g. web browser).  I suspect we'd want to
at least mention these in the changelog.  (Alternatively, what does a
cleancache-backed-fs exposing these semantics look like?)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-10-31 Thread Minchan Kim
Hello,

On Wed, Oct 31, 2012 at 02:59:07PM -0700, Paul Turner wrote:
> On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
>  wrote:
> >
> > On Tue, 30 Oct 2012 10:29:54 +0900
> > Minchan Kim  wrote:
> >
> > > This patch introudces new madvise behavior MADV_VOLATILE and
> > > MADV_NOVOLATILE for anonymous pages. It's different with
> > > John Stultz's version which considers only tmpfs while this patch
> > > considers only anonymous pages so this cannot cover John's one.
> > > If below idea is proved as reasonable, I hope we can unify both
> > > concepts by madvise/fadvise.
> > >
> > > Rationale is following as.
> > > Many allocators call munmap(2) when user call free(3) if ptr is
> > > in mmaped area. But munmap isn't cheap because it have to clean up
> > > all pte entries and unlinking a vma so overhead would be increased
> > > linearly by mmaped area's size.
> >
> > Presumably the userspace allocator will internally manage memory in
> > large chunks, so the munmap() call frequency will be much lower than
> > the free() call frequency.  So the performance gains from this change
> > might be very small.
> 
> I don't think I strictly understand the motivation from a
> malloc-standpoint here.
> 
> These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
> to perform discards on Linux.For any reasonable allocator (short
> of binding malloc --> mmap, free --> unmap) this seems a better
> choice.
> 
> Note also from a performance stand-point I doubt any allocator (which
> case about performance) is going to want to pay the cost of even a
> null syscall about typical malloc/free usage (consider: a tcmalloc

Good point.

> malloc/free pairis currently <20ns).  Given then that this cost is
> amortized once you start doing discards on larger blocks MADV_DONTNEED
> seems a preferable interface:
> - You don't need to reconstruct an arena when you do want to allocate
> since there's no munmap/mmap for the region to change about
> - There are no syscalls involved in later reallocating the block.

Above benefits are applied on MADV_VOLATILE, too.
But as you pointed out, there is a little bit overhead than DONTNEED
because allocator should call madvise(MADV_NOVOLATILE) before allocation.
For mavise(NOVOLATILE) does just mark vma flag, it does need mmap_sem
and could be a problem on parallel malloc/free workload as KOSAKI pointed out.

In such case, we can change semantic so malloc doesn't need to call
madivse(NOVOLATILE) before allocating. Then, page fault handler have to
check whether this page fault happen by access of volatile vma. If so,
it could return zero page instead of SIGBUS and mark the vma isn't volatile
any more.

> 
> The only real additional cost is address-space.  Are you strongly
> concerned about the 32-bit case?

No. I believe allocators have a logic to clean up them once address space is
almost full.

Thanks, Paul.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-10-31 Thread Minchan Kim
Hi Andrew,

On Wed, Oct 31, 2012 at 02:35:24PM -0700, Andrew Morton wrote:
> On Tue, 30 Oct 2012 10:29:54 +0900
> Minchan Kim  wrote:
> 
> > This patch introudces new madvise behavior MADV_VOLATILE and
> > MADV_NOVOLATILE for anonymous pages. It's different with
> > John Stultz's version which considers only tmpfs while this patch
> > considers only anonymous pages so this cannot cover John's one.
> > If below idea is proved as reasonable, I hope we can unify both
> > concepts by madvise/fadvise.
> > 
> > Rationale is following as.
> > Many allocators call munmap(2) when user call free(3) if ptr is
> > in mmaped area. But munmap isn't cheap because it have to clean up
> > all pte entries and unlinking a vma so overhead would be increased
> > linearly by mmaped area's size.
> 
> Presumably the userspace allocator will internally manage memory in
> large chunks, so the munmap() call frequency will be much lower than
> the free() call frequency.  So the performance gains from this change
> might be very small.
> 
> The whole point of the patch is to improve performance, but we have no
> evidence that it was successful in doing that!  I do think we'll need
> good quantitative testing results before proceeding with such a patch,
> please.

Absolutely. That's why I send it as RFC.
In this time, I would like to reach a concensus on that this idea
makes sense before further investigating because we have lots of
experienced developer pool and one of them might know this is really
needed or not.

> 
> Also, it is very desirable that we involve the relevant userspace
> (glibc, etc) developers in this.  And I understand that the google
> tcmalloc project will probably have interest in this - I've cc'ed
> various people@google in the hope that they can provide input (please).

Thanks! I should have done. Such input is really one I need now.

> 
> Also, it is a userspace API change.  Please cc mtk.manpa...@gmail.com.

This is RFC so we don't have anything fixed until now.
I will Cc'ed him after everything I should solve goes out and
interface is fixed.

> 
> Also, I assume that you have userspace test code.  At some stage,
> please consider adding a case to tools/testing/selftests.  Such a test
> would require to creation of memory pressure, which is rather contrary
> to the selftests' current philosopy of being a bunch of short-running
> little tests.  Perhaps you can come up with something.  But I suggest
> that such work be done later, once it becomes clearer that this code is
> actually headed into the kernel.

Yes.

> 
> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> > allocating that area to user. Otherwise, accessing of volatile range
> > will meet SIGBUS error.
> 
> Well, why?  It would be easy enough for the fault handler to give
> userspace a new, zeroed page at that address.

Absolutely. It would be convenient but as a matter of fact, I am considering
to unify John Stultz's fallocate volatile range which consider only tmpfs
pages so madvise/fadvise might be better candidate as API.
In tmpfs case, John implemented it as returning zero page when someone
access volatile region like you mentioned but in this kernel summit, Hugh
pointed out and wanted to return SIGBUS and I think it makes debug better.

Another option is we can put a flag in API which indicates that VM will
return zero page or SIGBUS when user access volatile range so user can do
what they want.

> 
> Or we could simply leave the old page in place at that address.  If the
> page gets touched, we clear MADV_NOVOLATILE on its VMA and give the
> page (or all the not-yet-reclaimed pages) back to userspace at their
> old addresses.
> 
> Various options suggest themselves here.  You've chosen one of them but
> I would like to see a pretty exhaustive description of the reasoning
> behind that decision.

Will do.

> 
> Also, I wonder about the interaction with other vma manipulation
> operations.  For example, can a VMA get split when in the MADV_VOLATILE
> state?  If so, what happens?  

Both VMAs would be volatile although one of either has never reclaimed
pages. I understand it's not an optimal but I expect user will not do 
such operations(ex, mprotect, mremap) frequently on volatile vma.

If they do, maybe I need to come up with something but It wouldn't be easy.

> 
> Also, I see no reason why the code shouldn't work OK with nonlinear VMAs,
> but I bet this wasn't tested ;)

Yes. I didn't consider that yet. AFAIK, nonlinear vma is related to
file-mapped vma while this patch consider only anon vma which is good for
first step.

> 
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
> > if (error)
> > goto out;
> > break;
> > +   case MADV_VOLATILE:
> > +   if (vma->vm_flags & VM_LOCKED) {
> > +   error = -EINVAL;
> > +   goto out;
> > +   }
> > +   new_fla

Re: [RFC v2] Support volatile range for anon vma

2012-10-31 Thread KOSAKI Motohiro
>> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
>> > allocating that area to user. Otherwise, accessing of volatile range
>> > will meet SIGBUS error.
>>
>> Well, why?  It would be easy enough for the fault handler to give
>> userspace a new, zeroed page at that address.
>
> Note: MADV_DONTNEED already has this (nice) property.

I don't think I strictly understand this patch. but maybe I can answer why
userland and malloc folks don't like MADV_DONTNEED.

glibc malloc discard freed memory by using MADV_DONTNEED
as tcmalloc. and it is often a source of large performance decrease.
because of MADV_DONTNEED discard memory immediately and
right after malloc() call fall into page fault and pagesize memset() path.
then, using DONTNEED increased zero fill and cache miss rate.

At called free() time, malloc don't have a knowledge when next big malloc()
is called. then, immediate discarding may or may not get good performance
gain. (Ah, ok, the rate is not 5:5. then usually it is worth. but not everytime)


In past, several developers tryied to avoid such situation, likes

- making zero page daemon and avoid pagesize zero fill at page fault
- making new vma or page flags and mark as discardable w/o swap and
  vmscan treat it. (like this and/or MADV_FREE)
- making new process option and avoid page zero fill from page fault path.
  (yes, it is big incompatibility and insecure. but some embedded folks thought
   they are acceptable downside)
- etc


btw, I'm not sure this patch is better for malloc because current MADV_DONTNEED
don't need mmap_sem and works very effectively when a lot of threads case.
taking mmap_sem might bring worse performance than DONTNEED. dunno.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-10-31 Thread Paul Turner
On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
 wrote:
>
> On Tue, 30 Oct 2012 10:29:54 +0900
> Minchan Kim  wrote:
>
> > This patch introudces new madvise behavior MADV_VOLATILE and
> > MADV_NOVOLATILE for anonymous pages. It's different with
> > John Stultz's version which considers only tmpfs while this patch
> > considers only anonymous pages so this cannot cover John's one.
> > If below idea is proved as reasonable, I hope we can unify both
> > concepts by madvise/fadvise.
> >
> > Rationale is following as.
> > Many allocators call munmap(2) when user call free(3) if ptr is
> > in mmaped area. But munmap isn't cheap because it have to clean up
> > all pte entries and unlinking a vma so overhead would be increased
> > linearly by mmaped area's size.
>
> Presumably the userspace allocator will internally manage memory in
> large chunks, so the munmap() call frequency will be much lower than
> the free() call frequency.  So the performance gains from this change
> might be very small.

I don't think I strictly understand the motivation from a
malloc-standpoint here.

These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
to perform discards on Linux.For any reasonable allocator (short
of binding malloc --> mmap, free --> unmap) this seems a better
choice.

Note also from a performance stand-point I doubt any allocator (which
case about performance) is going to want to pay the cost of even a
null syscall about typical malloc/free usage (consider: a tcmalloc
malloc/free pairis currently <20ns).  Given then that this cost is
amortized once you start doing discards on larger blocks MADV_DONTNEED
seems a preferable interface:
- You don't need to reconstruct an arena when you do want to allocate
since there's no munmap/mmap for the region to change about
- There are no syscalls involved in later reallocating the block.

The only real additional cost is address-space.  Are you strongly
concerned about the 32-bit case?

>
> The whole point of the patch is to improve performance, but we have no
> evidence that it was successful in doing that!  I do think we'll need
> good quantitative testing results before proceeding with such a patch,
> please.
>
> Also, it is very desirable that we involve the relevant userspace
> (glibc, etc) developers in this.  And I understand that the google
> tcmalloc project will probably have interest in this - I've cc'ed
> various people@google in the hope that they can provide input (please).
>
> Also, it is a userspace API change.  Please cc mtk.manpa...@gmail.com.
>
> Also, I assume that you have userspace test code.  At some stage,
> please consider adding a case to tools/testing/selftests.  Such a test
> would require to creation of memory pressure, which is rather contrary
> to the selftests' current philosopy of being a bunch of short-running
> little tests.  Perhaps you can come up with something.  But I suggest
> that such work be done later, once it becomes clearer that this code is
> actually headed into the kernel.
>
> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> > allocating that area to user. Otherwise, accessing of volatile range
> > will meet SIGBUS error.
>
> Well, why?  It would be easy enough for the fault handler to give
> userspace a new, zeroed page at that address.

Note: MADV_DONTNEED already has this (nice) property.

>
> Or we could simply leave the old page in place at that address.  If the
> page gets touched, we clear MADV_NOVOLATILE on its VMA and give the
> page (or all the not-yet-reclaimed pages) back to userspace at their
> old addresses.
>
> Various options suggest themselves here.  You've chosen one of them but
> I would like to see a pretty exhaustive description of the reasoning
> behind that decision.
>
> Also, I wonder about the interaction with other vma manipulation
> operations.  For example, can a VMA get split when in the MADV_VOLATILE
> state?  If so, what happens?
>
> Also, I see no reason why the code shouldn't work OK with nonlinear VMAs,
> but I bet this wasn't tested ;)
>
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
> >   if (error)
> >   goto out;
> >   break;
> > + case MADV_VOLATILE:
> > + if (vma->vm_flags & VM_LOCKED) {
> > + error = -EINVAL;
> > + goto out;
> > + }
> > + new_flags |= VM_VOLATILE;
> > + vma->purged = false;
> > + break;
> > + case MADV_NOVOLATILE:
> > + if (!(vma->vm_flags & VM_VOLATILE)) {
> > + error = -EINVAL;
> > + goto out;
>
> I wonder if this really should return an error.  Other madvise()
> options don't do this, and running MADV_NOVOLATILE against a
> not-volatile area seems pretty benign and has clearly defined before-
> and after- states.
>
> > + }
> > +
> > +   

Re: [RFC v2] Support volatile range for anon vma

2012-10-31 Thread Andrew Morton
On Tue, 30 Oct 2012 10:29:54 +0900
Minchan Kim  wrote:

> This patch introudces new madvise behavior MADV_VOLATILE and
> MADV_NOVOLATILE for anonymous pages. It's different with
> John Stultz's version which considers only tmpfs while this patch
> considers only anonymous pages so this cannot cover John's one.
> If below idea is proved as reasonable, I hope we can unify both
> concepts by madvise/fadvise.
> 
> Rationale is following as.
> Many allocators call munmap(2) when user call free(3) if ptr is
> in mmaped area. But munmap isn't cheap because it have to clean up
> all pte entries and unlinking a vma so overhead would be increased
> linearly by mmaped area's size.

Presumably the userspace allocator will internally manage memory in
large chunks, so the munmap() call frequency will be much lower than
the free() call frequency.  So the performance gains from this change
might be very small.

The whole point of the patch is to improve performance, but we have no
evidence that it was successful in doing that!  I do think we'll need
good quantitative testing results before proceeding with such a patch,
please.

Also, it is very desirable that we involve the relevant userspace
(glibc, etc) developers in this.  And I understand that the google
tcmalloc project will probably have interest in this - I've cc'ed
various people@google in the hope that they can provide input (please).

Also, it is a userspace API change.  Please cc mtk.manpa...@gmail.com.

Also, I assume that you have userspace test code.  At some stage,
please consider adding a case to tools/testing/selftests.  Such a test
would require to creation of memory pressure, which is rather contrary
to the selftests' current philosopy of being a bunch of short-running
little tests.  Perhaps you can come up with something.  But I suggest
that such work be done later, once it becomes clearer that this code is
actually headed into the kernel.

> Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> allocating that area to user. Otherwise, accessing of volatile range
> will meet SIGBUS error.

Well, why?  It would be easy enough for the fault handler to give
userspace a new, zeroed page at that address.

Or we could simply leave the old page in place at that address.  If the
page gets touched, we clear MADV_NOVOLATILE on its VMA and give the
page (or all the not-yet-reclaimed pages) back to userspace at their
old addresses.

Various options suggest themselves here.  You've chosen one of them but
I would like to see a pretty exhaustive description of the reasoning
behind that decision.

Also, I wonder about the interaction with other vma manipulation
operations.  For example, can a VMA get split when in the MADV_VOLATILE
state?  If so, what happens?  

Also, I see no reason why the code shouldn't work OK with nonlinear VMAs,
but I bet this wasn't tested ;)

> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
>   if (error)
>   goto out;
>   break;
> + case MADV_VOLATILE:
> + if (vma->vm_flags & VM_LOCKED) {
> + error = -EINVAL;
> + goto out;
> + }
> + new_flags |= VM_VOLATILE;
> + vma->purged = false;
> + break;
> + case MADV_NOVOLATILE:
> + if (!(vma->vm_flags & VM_VOLATILE)) {
> + error = -EINVAL;
> + goto out;

I wonder if this really should return an error.  Other madvise()
options don't do this, and running MADV_NOVOLATILE against a
not-volatile area seems pretty benign and has clearly defined before-
and after- states.

> + }
> +
> + new_flags &= ~VM_VOLATILE;
> + break;
>   }
>  
>   if (new_flags == vma->vm_flags) {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC v2] Support volatile range for anon vma

2012-10-29 Thread Minchan Kim
This patch introudces new madvise behavior MADV_VOLATILE and
MADV_NOVOLATILE for anonymous pages. It's different with
John Stultz's version which considers only tmpfs while this patch
considers only anonymous pages so this cannot cover John's one.
If below idea is proved as reasonable, I hope we can unify both
concepts by madvise/fadvise.

Rationale is following as.
Many allocators call munmap(2) when user call free(3) if ptr is
in mmaped area. But munmap isn't cheap because it have to clean up
all pte entries and unlinking a vma so overhead would be increased
linearly by mmaped area's size.

Volatile conecept of Robert Love could be very useful for reducing
free(3) overhead. Allocators can do madvise(MADV_VOLATILE) instead of
munmap(2)(Of course, they need to manage volatile mmaped area to
reduce shortage of address space and sometime ends up unmaping them).
The madvise(MADV_VOLATILE|NOVOLATILE) is very cheap opeartion because

1) it just marks the flag in VMA and
2) if memory pressure happens, VM can discard pages of volatile VMA
   instead of swapping out when volatile pages is selected as victim
   by normal VM aging policy.
3) freed mmaped area doesn't include any meaningful data so there
   is no point to swap them out.

Allocator should call madvise(MADV_NOVOLATILE) before reusing for
allocating that area to user. Otherwise, accessing of volatile range
will meet SIGBUS error.

The downside is that we have to age anon lru list although we don't
have swap because I don't want to discard volatile pages by top priority
when memory pressure happens as volatile in this patch means "We don't
need to swap out because user can handle the situation which data are
disappear suddenly", NOT "They are useless so hurry up to reclaim them".
So I want to apply same aging rule of nomal pages to them.

Anon background aging of non-swap system would be a trade-off for
getting good feature. Even, we had done it two years ago until merge
[1] and I believe free(3) performance gain will beat loss of anon lru
aging's overead once all of allocator start to use madvise.
(This patch doesn't include background aging in case of non-swap system
 but it's trivial if we decide)

I hope seeing opinions from others before diving into glibc or bionic.
Welcome to any comment.

[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system

Changelog
 * from RFC v1
   * add clear comment of purge - Christoph
   * Change patch descritpion

Cc: John Stultz 
Cc: Andrew Morton 
Cc: Christoph Lameter 
Cc: Android Kernel Team 
Cc: Robert Love 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Dave Hansen 
Cc: Rik van Riel 
Cc: Dave Chinner 
Cc: Neil Brown 
Cc: Mike Hommey 
Cc: Taras Glek 
Cc: KOSAKI Motohiro 
Cc: Christoph Lameter 
Cc: KAMEZAWA Hiroyuki 
Signed-off-by: Minchan Kim 
---
 include/asm-generic/mman-common.h |3 +
 include/linux/mm.h|9 ++-
 include/linux/mm_types.h  |5 ++
 include/linux/rmap.h  |   24 ++-
 mm/ksm.c  |4 +-
 mm/madvise.c  |   32 +-
 mm/memory.c   |2 +
 mm/migrate.c  |6 +-
 mm/rmap.c |  126 +++--
 mm/vmscan.c   |3 +
 10 files changed, 202 insertions(+), 12 deletions(-)

diff --git a/include/asm-generic/mman-common.h 
b/include/asm-generic/mman-common.h
index d030d2c..5f8090d 100644
--- a/include/asm-generic/mman-common.h
+++ b/include/asm-generic/mman-common.h
@@ -34,6 +34,9 @@
 #define MADV_SEQUENTIAL2   /* expect sequential page 
references */
 #define MADV_WILLNEED  3   /* will need these pages */
 #define MADV_DONTNEED  4   /* don't need these pages */
+#define MADV_VOLATILE  5   /* pages will disappear suddenly */
+#define MADV_NOVOLATILE 6  /* pages will not disappear */
+
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE9   /* remove these pages & resources */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 311be90..78c8c08 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -120,6 +120,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_PFN_AT_MMAP 0x4000  /* PFNMAP vma that is fully mapped at 
mmap time */
 #define VM_MERGEABLE   0x8000  /* KSM may merge identical pages */
 
+/*
+ * Recently, Konstantin removed a few flags but not merged yet
+ * so we will get a room for new flag for supporting 32 bit.
+ * Thanks, Konstantin!
+ */
+#define VM_VOLATILE0x1
+
 /* Bits set in the VMA until the stack is in its final location */
 #define VM_STACK_INCOMPLETE_SETUP  (VM_RAND_READ | VM_SEQ_READ)
 
@@ -143,7 +150,7 @@ extern unsigned int kobjsize(const void *objp);
  * Special vmas that are non-mergable, non-mlock()able.
  * Note: mm/huge_memory.c VM_NO_THP depends on this definition.
  */