subject:"\[PATCH v10 00\/16\] Volatile Ranges v10"

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-02-03 Thread Minchan Kim

On Fri, Jan 31, 2014 at 11:49:01AM -0500, Johannes Weiner wrote:
> On Wed, Jan 29, 2014 at 02:11:02PM +0900, Minchan Kim wrote:
> > It's interesting timing, I posted this patch Yew Year's Day
> > and receives indepth design review Lunar New Year's Day. :)
> > It's almost 0-day review. :)
> 
> That's the only way I can do 0-day reviews ;)
> 
> > On Tue, Jan 28, 2014 at 07:03:59PM -0500, Johannes Weiner wrote:
> > > Hello Minchan,
> > > 
> > > On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
> > > > Hey all,
> > > > 
> > > > Happy New Year!
> > > > 
> > > > I know it's bad timing to send this unfamiliar large patchset for
> > > > review but hope there are some guys with freshed-brain in new year
> > > > all over the world. :)
> > > > And most important thing is that before I dive into lots of testing,
> > > > I'd like to make an agreement on design issues and others
> > > > 
> > > > o Syscall interface
> > > 
> > > Why do we need another syscall for this?  Can't we extend madvise to
> > 
> > Yeb. I should have written the reason. Early versions in this patchset
> > had used madvise with VMA handling but it was terrible performance for
> > ebizzy workload by mmap_sem's downside lock due to merging/split VMA.
> > Even it was worse than old so I gave up the VMA approach.
> > 
> > You could see the difference.
> > https://lkml.org/lkml/2013/10/8/63
> 
> So the compared kernels are 4 releases apart and the test happened
> inside a VM.  It's also not really apparent from that link what the
> tested workload is doing.  We first have to agree that it's doing
> nothing that could be avoided.  E.g. we wouldn't introduce an
> optimized version of write() because an application that writes 4G at
> one byte per call is having problems.

About ebizzy workload, the process allocates several chunks then,
threads start to alloc own chunk and *copy( the content from random
chunk which was one of preallocated chunk to own chunk.
It means lots of threads are page-faulting so mmap_sem write-side
lock is really critical point for performance.
(I don't know ebizzy is really good for real practice but at least,
several papers and benchmark suites have used it so we couldn't
ignore. And per-thread allocator are really popular these days)

With VMA approach, we need mmap_sem write-side lock twice to mark/unmark
VM_VOLATILE in vma->vm_flags so with my experiment, the performance was
terrible as I said on link.

I don't think the situation of current kernel would be better than old.
And virtulization is really important technique thesedays so we couldn't
ignore that although I tested it on VM for convenience. If you want,
I surely can test it on bare box.

> 
> The vroot lock has the same locking granularity as mmap_sem.  Why is
> mmap_sem more contended in this test?

It seems above explanation is enough.

> 
> > > take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
> > > in the range was purged?
> > 
> > In that case, -ENOMEM would have duplicated meaning "Purged" and "Out
> > of memory so failed in the middle of the system call processing" and
> > later could be a problem so we need to return value to indicate
> > how many bytes are succeeded so far so it means we need additional
> > out parameter. But yes, we can solve it by modifying semantic and
> > behavior (ex, as you said below, we could just unmark volatile
> > successfully if user pass (offset, len) consistent with marked volatile
> > ranges. (IOW, if we give up overlapping/subrange marking/unmakring
> > usecase. I expect it makes code simple further).
> > It's request from John so If he is okay, I'm no problem.
> 
> Yes, I don't insist on using madvise.  And it's too early to decide on
> an interface before we haven't fully nailed the semantics and features.
> 
> > > > o Not bind with vma split/merge logic to prevent mmap_sem cost and
> > > > o Not bind with vma split/merge logic to avoid vm_area_struct memory
> > > >   footprint.
> > > 
> > > VMAs are there to track attributes of memory ranges.  Duplicating
> > > large parts of their functionality and co-maintaining both structures
> > > on create, destroy, split, and merge means duplicate code and complex
> > > interactions.
> > > 
> > > 1. You need to define semantics and coordinate what happens when the
> > >vma underlying a volatile range changes.
> > > 
> > >Either you have to strictly co-maintain both range objects, or you
> > >have weird behavior like volatily outliving a vma and then applying
> > >to a separate vma created in its place.
> > > 
> > >Userspace won't get this right, and even in the kernel this is
> > >error prone and adds a lot to the complexity of vma management.
> > 
> > Current semantic is following as,
> > Vma handling logic in mm doesn't need to know vrange handling because
> > vrange's internal logic always checks validity of the vma but
> > one thing to do in vma logic is only clearing old volatile ranges
> > on creating new vma.
> > (Look

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-02-03 Thread Johannes Weiner

On Mon, Feb 03, 2014 at 03:58:06PM +0100, Jan Kara wrote:
> On Fri 31-01-14 11:49:01, Johannes Weiner wrote:
> > On Wed, Jan 29, 2014 at 02:11:02PM +0900, Minchan Kim wrote:
> > > >The only way to make these semantics clean is either
> > > > 
> > > >  a) have vrange() return a range ID so that only full ranges can
> > > >  later be marked non-volatile, or
> > > 
> > > > 
> > > >  b) remember individual page purges so that sub-range changes can
> > > >  properly report them
> > > > 
> > > >I don't like a) much because it's somewhat arbitrarily more
> > > >restrictive than madvise, mprotect, mmap/munmap etc.  And for b),
> > > >the straight-forward solution would be to put purge-cookies into
> > > >the page tables to properly report purges in subrange changes, but
> > > >that would be even more coordination between vmas, page tables, and
> > > >the ad-hoc vranges.
> > > 
> > > Agree but I don't want to put a accuracy of defalut vrange syscall.
> > > Page table lookup needs mmap_sem and O(N) cost so I'm afraid it would
> > > make userland folks hesitant using this system call.
> > 
> > If userspace sees nothing but cost in this system call, nothing but a
> > voluntary donation for the common good of the system, then it does not
> > matter how cheap this is, nobody will use it.  Why would they?  Even
>   I think this is a flawed logic. If you take it to the extreme then why
> each application doesn't allocate all the available memory and never free
> it? Because users will kick such application in the ass as soon as they
> have a viable alternative. So there is certainly a relatively strong
> benefit in being a good citizen on the system. But it's a matter of a
> tradeoff - if being a good citizen costs you too much (in the extreme if it
> would make the application hardly usable because it is too slow), then you
> just give up or hack it around in some other way...

Oh, that is exactly what I was trying to point out.  The argument was
basically that it has to be as cheap and lightweight as humanly
possible because applications participate voluntarily and they won't
donate memory back if it comes at a cost.

And as you said, this is flawed.  There is an incentive to give back
memory other than altruistic tendencies, namely the looming kick in
the butt.

So I very much agree that there is a trade-off to be had, but I think
the cost of the proposed implementation is not justified.

If we agree that simply not returning memory is unacceptable anyway,
providing an interface that is drastically cheaper than the current
means of returning memory is already an improvement.  Even if it's
still O(#pages).  So I think the incentive to use it is there.  We
should design it to fit into the existing VM and then optimize it,
rather than design for an (unnecessary) optimization.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-02-03 Thread Johannes Weiner

On Mon, Feb 03, 2014 at 03:58:06PM +0100, Jan Kara wrote:
 On Fri 31-01-14 11:49:01, Johannes Weiner wrote:
  On Wed, Jan 29, 2014 at 02:11:02PM +0900, Minchan Kim wrote:
   The only way to make these semantics clean is either

 a) have vrange() return a range ID so that only full ranges can
 later be marked non-volatile, or
   

 b) remember individual page purges so that sub-range changes can
 properly report them

   I don't like a) much because it's somewhat arbitrarily more
   restrictive than madvise, mprotect, mmap/munmap etc.  And for b),
   the straight-forward solution would be to put purge-cookies into
   the page tables to properly report purges in subrange changes, but
   that would be even more coordination between vmas, page tables, and
   the ad-hoc vranges.
   
   Agree but I don't want to put a accuracy of defalut vrange syscall.
   Page table lookup needs mmap_sem and O(N) cost so I'm afraid it would
   make userland folks hesitant using this system call.
  
  If userspace sees nothing but cost in this system call, nothing but a
  voluntary donation for the common good of the system, then it does not
  matter how cheap this is, nobody will use it.  Why would they?  Even
   I think this is a flawed logic. If you take it to the extreme then why
 each application doesn't allocate all the available memory and never free
 it? Because users will kick such application in the ass as soon as they
 have a viable alternative. So there is certainly a relatively strong
 benefit in being a good citizen on the system. But it's a matter of a
 tradeoff - if being a good citizen costs you too much (in the extreme if it
 would make the application hardly usable because it is too slow), then you
 just give up or hack it around in some other way...

Oh, that is exactly what I was trying to point out.  The argument was
basically that it has to be as cheap and lightweight as humanly
possible because applications participate voluntarily and they won't
donate memory back if it comes at a cost.

And as you said, this is flawed.  There is an incentive to give back
memory other than altruistic tendencies, namely the looming kick in
the butt.

So I very much agree that there is a trade-off to be had, but I think
the cost of the proposed implementation is not justified.

If we agree that simply not returning memory is unacceptable anyway,
providing an interface that is drastically cheaper than the current
means of returning memory is already an improvement.  Even if it's
still O(#pages).  So I think the incentive to use it is there.  We
should design it to fit into the existing VM and then optimize it,
rather than design for an (unnecessary) optimization.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-02-03 Thread Minchan Kim

On Fri, Jan 31, 2014 at 11:49:01AM -0500, Johannes Weiner wrote:
 On Wed, Jan 29, 2014 at 02:11:02PM +0900, Minchan Kim wrote:
  It's interesting timing, I posted this patch Yew Year's Day
  and receives indepth design review Lunar New Year's Day. :)
  It's almost 0-day review. :)
 
 That's the only way I can do 0-day reviews ;)
 
  On Tue, Jan 28, 2014 at 07:03:59PM -0500, Johannes Weiner wrote:
   Hello Minchan,
   
   On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
Hey all,

Happy New Year!

I know it's bad timing to send this unfamiliar large patchset for
review but hope there are some guys with freshed-brain in new year
all over the world. :)
And most important thing is that before I dive into lots of testing,
I'd like to make an agreement on design issues and others

o Syscall interface
   
   Why do we need another syscall for this?  Can't we extend madvise to
  
  Yeb. I should have written the reason. Early versions in this patchset
  had used madvise with VMA handling but it was terrible performance for
  ebizzy workload by mmap_sem's downside lock due to merging/split VMA.
  Even it was worse than old so I gave up the VMA approach.
  
  You could see the difference.
  https://lkml.org/lkml/2013/10/8/63
 
 So the compared kernels are 4 releases apart and the test happened
 inside a VM.  It's also not really apparent from that link what the
 tested workload is doing.  We first have to agree that it's doing
 nothing that could be avoided.  E.g. we wouldn't introduce an
 optimized version of write() because an application that writes 4G at
 one byte per call is having problems.

About ebizzy workload, the process allocates several chunks then,
threads start to alloc own chunk and *copy( the content from random
chunk which was one of preallocated chunk to own chunk.
It means lots of threads are page-faulting so mmap_sem write-side
lock is really critical point for performance.
(I don't know ebizzy is really good for real practice but at least,
several papers and benchmark suites have used it so we couldn't
ignore. And per-thread allocator are really popular these days)

With VMA approach, we need mmap_sem write-side lock twice to mark/unmark
VM_VOLATILE in vma-vm_flags so with my experiment, the performance was
terrible as I said on link.

I don't think the situation of current kernel would be better than old.
And virtulization is really important technique thesedays so we couldn't
ignore that although I tested it on VM for convenience. If you want,
I surely can test it on bare box.

 
 The vroot lock has the same locking granularity as mmap_sem.  Why is
 mmap_sem more contended in this test?

It seems above explanation is enough.

 
   take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
   in the range was purged?
  
  In that case, -ENOMEM would have duplicated meaning Purged and Out
  of memory so failed in the middle of the system call processing and
  later could be a problem so we need to return value to indicate
  how many bytes are succeeded so far so it means we need additional
  out parameter. But yes, we can solve it by modifying semantic and
  behavior (ex, as you said below, we could just unmark volatile
  successfully if user pass (offset, len) consistent with marked volatile
  ranges. (IOW, if we give up overlapping/subrange marking/unmakring
  usecase. I expect it makes code simple further).
  It's request from John so If he is okay, I'm no problem.
 
 Yes, I don't insist on using madvise.  And it's too early to decide on
 an interface before we haven't fully nailed the semantics and features.
 
o Not bind with vma split/merge logic to prevent mmap_sem cost and
o Not bind with vma split/merge logic to avoid vm_area_struct memory
  footprint.
   
   VMAs are there to track attributes of memory ranges.  Duplicating
   large parts of their functionality and co-maintaining both structures
   on create, destroy, split, and merge means duplicate code and complex
   interactions.
   
   1. You need to define semantics and coordinate what happens when the
  vma underlying a volatile range changes.
   
  Either you have to strictly co-maintain both range objects, or you
  have weird behavior like volatily outliving a vma and then applying
  to a separate vma created in its place.
   
  Userspace won't get this right, and even in the kernel this is
  error prone and adds a lot to the complexity of vma management.
  
  Current semantic is following as,
  Vma handling logic in mm doesn't need to know vrange handling because
  vrange's internal logic always checks validity of the vma but
  one thing to do in vma logic is only clearing old volatile ranges
  on creating new vma.
  (Look at  [PATCH v10 02/16] vrange: Clear volatility on new mmaps)
  Acutally I don't like the idea and suggested following as.

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-31 Thread Johannes Weiner

On Wed, Jan 29, 2014 at 02:11:02PM +0900, Minchan Kim wrote:
> It's interesting timing, I posted this patch Yew Year's Day
> and receives indepth design review Lunar New Year's Day. :)
> It's almost 0-day review. :)

That's the only way I can do 0-day reviews ;)

> On Tue, Jan 28, 2014 at 07:03:59PM -0500, Johannes Weiner wrote:
> > Hello Minchan,
> > 
> > On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
> > > Hey all,
> > > 
> > > Happy New Year!
> > > 
> > > I know it's bad timing to send this unfamiliar large patchset for
> > > review but hope there are some guys with freshed-brain in new year
> > > all over the world. :)
> > > And most important thing is that before I dive into lots of testing,
> > > I'd like to make an agreement on design issues and others
> > > 
> > > o Syscall interface
> > 
> > Why do we need another syscall for this?  Can't we extend madvise to
> 
> Yeb. I should have written the reason. Early versions in this patchset
> had used madvise with VMA handling but it was terrible performance for
> ebizzy workload by mmap_sem's downside lock due to merging/split VMA.
> Even it was worse than old so I gave up the VMA approach.
> 
> You could see the difference.
> https://lkml.org/lkml/2013/10/8/63

So the compared kernels are 4 releases apart and the test happened
inside a VM.  It's also not really apparent from that link what the
tested workload is doing.  We first have to agree that it's doing
nothing that could be avoided.  E.g. we wouldn't introduce an
optimized version of write() because an application that writes 4G at
one byte per call is having problems.

The vroot lock has the same locking granularity as mmap_sem.  Why is
mmap_sem more contended in this test?

> > take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
> > in the range was purged?
> 
> In that case, -ENOMEM would have duplicated meaning "Purged" and "Out
> of memory so failed in the middle of the system call processing" and
> later could be a problem so we need to return value to indicate
> how many bytes are succeeded so far so it means we need additional
> out parameter. But yes, we can solve it by modifying semantic and
> behavior (ex, as you said below, we could just unmark volatile
> successfully if user pass (offset, len) consistent with marked volatile
> ranges. (IOW, if we give up overlapping/subrange marking/unmakring
> usecase. I expect it makes code simple further).
> It's request from John so If he is okay, I'm no problem.

Yes, I don't insist on using madvise.  And it's too early to decide on
an interface before we haven't fully nailed the semantics and features.

> > > o Not bind with vma split/merge logic to prevent mmap_sem cost and
> > > o Not bind with vma split/merge logic to avoid vm_area_struct memory
> > >   footprint.
> > 
> > VMAs are there to track attributes of memory ranges.  Duplicating
> > large parts of their functionality and co-maintaining both structures
> > on create, destroy, split, and merge means duplicate code and complex
> > interactions.
> > 
> > 1. You need to define semantics and coordinate what happens when the
> >vma underlying a volatile range changes.
> > 
> >Either you have to strictly co-maintain both range objects, or you
> >have weird behavior like volatily outliving a vma and then applying
> >to a separate vma created in its place.
> > 
> >Userspace won't get this right, and even in the kernel this is
> >error prone and adds a lot to the complexity of vma management.
> 
> Current semantic is following as,
> Vma handling logic in mm doesn't need to know vrange handling because
> vrange's internal logic always checks validity of the vma but
> one thing to do in vma logic is only clearing old volatile ranges
> on creating new vma.
> (Look at  [PATCH v10 02/16] vrange: Clear volatility on new mmaps)
> Acutally I don't like the idea and suggested following as.
> https://git.kernel.org/cgit/linux/kernel/git/minchan/linux.git/commit/?h=vrange-working=821f58333b381fd88ee7f37fd9c472949756c74e
> But John didn't like it. I guess if VMA size is really matter,
> maybe we can embedded the flag into somewhere field of
> vma(ex, vm_file LSB?)

It's not entirely clear to me how the per-VMA variable can work like
that when vmas can merge and split by other means (mprotect e.g.)

> > 2. If page reclaim discards a page from the upper end of a a range,
> >you mark the whole range as purged.  If the user later marks the
> >lower half of the range as non-volatile, the syscall will report
> >purged=1 even though all requested pages are still there.
> 
> True, The assumption is that basically, user should have a range
> per object but we gives flexibility for user to handle subranges
> of a volatile range so it might report false positive as you said.
> In that case, please user can use mincore(2) for accuracy if he
> want so he has flexiblity but lose performance a bit.
> It's a tradeoff, IMO.

Look, we can't present a

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-31 Thread Johannes Weiner

On Wed, Jan 29, 2014 at 02:11:02PM +0900, Minchan Kim wrote:
It's interesting timing, I posted this patch Yew Year's Day
and receives indepth design review Lunar New Year's Day. :)
It's almost 0-day review. :)

That's the only way I can do 0-day reviews ;)

On Tue, Jan 28, 2014 at 07:03:59PM -0500, Johannes Weiner wrote:
Hello Minchan,

On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
Hey all,

Happy New Year!

I know it's bad timing to send this unfamiliar large patchset for
review but hope there are some guys with freshed-brain in new year
all over the world. :)
And most important thing is that before I dive into lots of testing,
I'd like to make an agreement on design issues and others

o Syscall interface

Why do we need another syscall for this? Can't we extend madvise to

Yeb. I should have written the reason. Early versions in this patchset
had used madvise with VMA handling but it was terrible performance for
ebizzy workload by mmap_sem's downside lock due to merging/split VMA.
Even it was worse than old so I gave up the VMA approach.

You could see the difference.
https://lkml.org/lkml/2013/10/8/63

So the compared kernels are 4 releases apart and the test happened
inside a VM. It's also not really apparent from that link what the
tested workload is doing. We first have to agree that it's doing
nothing that could be avoided. E.g. we wouldn't introduce an
optimized version of write() because an application that writes 4G at
one byte per call is having problems.

The vroot lock has the same locking granularity as mmap_sem. Why is
mmap_sem more contended in this test?

take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
in the range was purged?

In that case, -ENOMEM would have duplicated meaning Purged and Out
of memory so failed in the middle of the system call processing and
later could be a problem so we need to return value to indicate
how many bytes are succeeded so far so it means we need additional
out parameter. But yes, we can solve it by modifying semantic and
behavior (ex, as you said below, we could just unmark volatile
successfully if user pass (offset, len) consistent with marked volatile
ranges. (IOW, if we give up overlapping/subrange marking/unmakring
usecase. I expect it makes code simple further).
It's request from John so If he is okay, I'm no problem.

Yes, I don't insist on using madvise. And it's too early to decide on
an interface before we haven't fully nailed the semantics and features.

o Not bind with vma split/merge logic to prevent mmap_sem cost and
o Not bind with vma split/merge logic to avoid vm_area_struct memory
footprint.

VMAs are there to track attributes of memory ranges. Duplicating
large parts of their functionality and co-maintaining both structures
on create, destroy, split, and merge means duplicate code and complex
interactions.

1. You need to define semantics and coordinate what happens when the
vma underlying a volatile range changes.

Either you have to strictly co-maintain both range objects, or you
have weird behavior like volatily outliving a vma and then applying
to a separate vma created in its place.

Userspace won't get this right, and even in the kernel this is
error prone and adds a lot to the complexity of vma management.

Current semantic is following as,
Vma handling logic in mm doesn't need to know vrange handling because
vrange's internal logic always checks validity of the vma but
one thing to do in vma logic is only clearing old volatile ranges
on creating new vma.
(Look at [PATCH v10 02/16] vrange: Clear volatility on new mmaps)
Acutally I don't like the idea and suggested following as.
https://git.kernel.org/cgit/linux/kernel/git/minchan/linux.git/commit/?h=vrange-workingid=821f58333b381fd88ee7f37fd9c472949756c74e
But John didn't like it. I guess if VMA size is really matter,
maybe we can embedded the flag into somewhere field of
vma(ex, vm_file LSB?)

It's not entirely clear to me how the per-VMA variable can work like
that when vmas can merge and split by other means (mprotect e.g.)

2. If page reclaim discards a page from the upper end of a a range,
you mark the whole range as purged. If the user later marks the
lower half of the range as non-volatile, the syscall will report
purged=1 even though all requested pages are still there.

True, The assumption is that basically, user should have a range
per object but we gives flexibility for user to handle subranges
of a volatile range so it might report false positive as you said.
In that case, please user can use mincore(2) for accuracy if he
want so he has flexiblity but lose performance a bit.
It's a tradeoff, IMO.

Look, we can't present a syscall that takes an exact range of bytes
and then return results that are not applicable to this range at all.

We can not make performance

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-30 Thread Johannes Weiner

On Thu, Jan 30, 2014 at 05:27:18PM -0800, John Stultz wrote:
> On 01/29/2014 10:30 AM, Johannes Weiner wrote:
> > On Tue, Jan 28, 2014 at 05:43:54PM -0800, John Stultz wrote:
> >> On 01/28/2014 04:03 PM, Johannes Weiner wrote:
> >>> On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
>  o Syscall interface
> >>> Why do we need another syscall for this?  Can't we extend madvise to
> >>> take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
> >>> in the range was purged?
> >> So the madvise interface is insufficient to provide the semantics
> >> needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the
> >> NONVOLATILE call, we have to atomically unmark the volatility status of
> >> the byte range and provide the purge status, which informs the caller if
> >> any of the data in the specified range was discarded (and thus needs to
> >> be regenerated).
> >>
> >> The problem is that by clearing the range, we may need to allocate
> >> memory (possibly by splitting in an existing range segment into two),
> >> which possibly could fail. Unfortunately this could happen after we've
> >> modified the volatile state of part of that range.  At this point we
> >> can't just fail, because we've modified state and we also need to return
> >> the purge status of the modified state.
> > munmap() can theoretically fail for the same reason (splitting has to
> > allocate a new vma) but it's not even documented.  The allocator does
> > not fail allocations of that order.
> >
> > I'm not sure this is good enough, but to me it sounds a bit overkill
> > to design a new system call around a non-existent problem.
> 
> I still think its problematic design issue. With munmap, I think
> re-calling on failure should be fine. But with _NONVOLATILE we could
> possibly lose the purge status on a second call (for instance if only
> the first page of memory was purged, but we errored out mid-call w/
> ENOMEM, on the second call it will seem like the range was successfully
> set non-volatile with no memory purged).
> 
> And even if the current allocator never ever fails, I worry at some
> point in the future that rule might change and then we'd have a broken
> interface.

Fair enough, we don't have to paint ourselves into a corner.

> >>> 2. If page reclaim discards a page from the upper end of a a range,
> >>>you mark the whole range as purged.  If the user later marks the
> >>>lower half of the range as non-volatile, the syscall will report
> >>>purged=1 even though all requested pages are still there.
> >> To me this aspect is a non-ideal but acceptable result of the usage 
> >> pattern.
> >>
> >> Semantically, the hard rule would be we never report non-purged if pages
> >> in a range were purged.  Reporting purged when pages technically weren't
> >> is not optimal but acceptable side effect of unmarking a sub-range. And
> >> could be avoided by applications marking and unmarking objects 
> >> consistently.
> >>
> >>
> >>>The only way to make these semantics clean is either
> >>>
> >>>  a) have vrange() return a range ID so that only full ranges can
> >>>  later be marked non-volatile, or
> >>>
> >>>  b) remember individual page purges so that sub-range changes can
> >>>  properly report them
> >>>
> >>>I don't like a) much because it's somewhat arbitrarily more
> >>>restrictive than madvise, mprotect, mmap/munmap etc.  
> >> Agreed on A.
> >>
> >>> And for b),
> >>>the straight-forward solution would be to put purge-cookies into
> >>>the page tables to properly report purges in subrange changes, but
> >>>that would be even more coordination between vmas, page tables, and
> >>>the ad-hoc vranges.
> >> And for B this would cause way too much overhead for the mark/unmark
> >> operations, which have to be lightweight.
> > Yes, and allocators/message passers truly don't need this because at
> > the time they set a region to volatile the contents are invalidated
> > and the non-volatile declaration doesn't give a hoot if content has
> > been destroyed.
> >
> > But caches certainly would have to know if they should regenerate the
> > contents.  And bigger areas should be using huge pages, so we'd check
> > in 2MB steps.  Is this really more expensive than regenerating the
> > contents on a false positive?
> 
> So you make a good argument. I'd counter that the false-positives are
> only caused when unmarking subranges of larger marked volatile range,
> and for use cases that would care about regenerating the contents,
> that's not a likely useage model (as they're probably going to be
> marking objects in memory volatile/nonvolatile, not just arbitrary
> ranges of pages).

I can imagine that applications have continuous areas of same-sized
objects and want to mark a whole range of them volatile in one go,
then later come back for individual objects.

Otherwise we'd require N adjacent objects to be marked individually
through N syscalls to create N

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-30 Thread John Stultz

On 01/29/2014 10:30 AM, Johannes Weiner wrote:
> On Tue, Jan 28, 2014 at 05:43:54PM -0800, John Stultz wrote:
>> On 01/28/2014 04:03 PM, Johannes Weiner wrote:
>>> On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
 o Syscall interface
>>> Why do we need another syscall for this?  Can't we extend madvise to
>>> take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
>>> in the range was purged?
>> So the madvise interface is insufficient to provide the semantics
>> needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the
>> NONVOLATILE call, we have to atomically unmark the volatility status of
>> the byte range and provide the purge status, which informs the caller if
>> any of the data in the specified range was discarded (and thus needs to
>> be regenerated).
>>
>> The problem is that by clearing the range, we may need to allocate
>> memory (possibly by splitting in an existing range segment into two),
>> which possibly could fail. Unfortunately this could happen after we've
>> modified the volatile state of part of that range.  At this point we
>> can't just fail, because we've modified state and we also need to return
>> the purge status of the modified state.
> munmap() can theoretically fail for the same reason (splitting has to
> allocate a new vma) but it's not even documented.  The allocator does
> not fail allocations of that order.
>
> I'm not sure this is good enough, but to me it sounds a bit overkill
> to design a new system call around a non-existent problem.

I still think its problematic design issue. With munmap, I think
re-calling on failure should be fine. But with _NONVOLATILE we could
possibly lose the purge status on a second call (for instance if only
the first page of memory was purged, but we errored out mid-call w/
ENOMEM, on the second call it will seem like the range was successfully
set non-volatile with no memory purged).

And even if the current allocator never ever fails, I worry at some
point in the future that rule might change and then we'd have a broken
interface.



 o Not bind with vma split/merge logic to prevent mmap_sem cost and
 o Not bind with vma split/merge logic to avoid vm_area_struct memory
   footprint.
>>> VMAs are there to track attributes of memory ranges.  Duplicating
>>> large parts of their functionality and co-maintaining both structures
>>> on create, destroy, split, and merge means duplicate code and complex
>>> interactions.
>>>
>>> 1. You need to define semantics and coordinate what happens when the
>>>vma underlying a volatile range changes.
>>>
>>>Either you have to strictly co-maintain both range objects, or you
>>>have weird behavior like volatily outliving a vma and then applying
>>>to a separate vma created in its place.
>> So indeed this is a difficult problem!  My initial approach is simply
>> when any new mapping is made, we clear the volatility of the affected
>> process memory. Admittedly this has extra overhead and Minchan has an
>> alternative here (which I'm not totally sold on yet, but may be ok). 
>> I'm almost convinced that for anonymous volatility, storing the
>> volatility in the vma would be ok, but Minchan is worried about the
>> performance overhead of the required locking for manipulating the vmas.
>>
>> For file volatility, this is more complicated, because since the
>> volatility is shared, the ranges have to be tracked against the
>> address_space structure, and can't be stored in per-process vmas. So
>> this is partially why we've kept range trees hanging off of the mm and
>> address_spaces structures, since it allows the range manipulation logic
>> to be shared in both cases.
> The fs people probably have not noticed yet what you've done to struct
> address_space / struct inode ;-) I doubt that this is mergeable in its
> current form, so we have to think about a separate mechanism for shmem
> page ranges either way.

Yea. But given the semantics will likely be *very* similar, it seems
strange to try to force separate mechanisms.

That said, in an earlier implementation I stored the range tree in a
hash so we wouldn't have to add anything to the address_space structure.
But for now I want to make it clear that the ranges are tied to the
address space (and it gives the fs folks something to notice ;).


>>>Userspace won't get this right, and even in the kernel this is
>>>error prone and adds a lot to the complexity of vma management.
>> Not sure exactly I understand what you mean by "userspace won't get this
>> right" ?
> I meant, userspace being responsible for keeping vranges coherent with
> its mmap and munmap operations, instead of the kernel doing it.
>
>>> 2. If page reclaim discards a page from the upper end of a a range,
>>>you mark the whole range as purged.  If the user later marks the
>>>lower half of the range as non-volatile, the syscall will report
>>>purged=1 even though all requested pages are still there.
>> To me

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-30 Thread John Stultz

On 01/29/2014 10:30 AM, Johannes Weiner wrote:
 On Tue, Jan 28, 2014 at 05:43:54PM -0800, John Stultz wrote:
 On 01/28/2014 04:03 PM, Johannes Weiner wrote:
 On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
 o Syscall interface
 Why do we need another syscall for this?  Can't we extend madvise to
 take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
 in the range was purged?
 So the madvise interface is insufficient to provide the semantics
 needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the
 NONVOLATILE call, we have to atomically unmark the volatility status of
 the byte range and provide the purge status, which informs the caller if
 any of the data in the specified range was discarded (and thus needs to
 be regenerated).

 The problem is that by clearing the range, we may need to allocate
 memory (possibly by splitting in an existing range segment into two),
 which possibly could fail. Unfortunately this could happen after we've
 modified the volatile state of part of that range.  At this point we
 can't just fail, because we've modified state and we also need to return
 the purge status of the modified state.
 munmap() can theoretically fail for the same reason (splitting has to
 allocate a new vma) but it's not even documented.  The allocator does
 not fail allocations of that order.

 I'm not sure this is good enough, but to me it sounds a bit overkill
 to design a new system call around a non-existent problem.

I still think its problematic design issue. With munmap, I think
re-calling on failure should be fine. But with _NONVOLATILE we could
possibly lose the purge status on a second call (for instance if only
the first page of memory was purged, but we errored out mid-call w/
ENOMEM, on the second call it will seem like the range was successfully
set non-volatile with no memory purged).

And even if the current allocator never ever fails, I worry at some
point in the future that rule might change and then we'd have a broken
interface.



 o Not bind with vma split/merge logic to prevent mmap_sem cost and
 o Not bind with vma split/merge logic to avoid vm_area_struct memory
   footprint.
 VMAs are there to track attributes of memory ranges.  Duplicating
 large parts of their functionality and co-maintaining both structures
 on create, destroy, split, and merge means duplicate code and complex
 interactions.

 1. You need to define semantics and coordinate what happens when the
vma underlying a volatile range changes.

Either you have to strictly co-maintain both range objects, or you
have weird behavior like volatily outliving a vma and then applying
to a separate vma created in its place.
 So indeed this is a difficult problem!  My initial approach is simply
 when any new mapping is made, we clear the volatility of the affected
 process memory. Admittedly this has extra overhead and Minchan has an
 alternative here (which I'm not totally sold on yet, but may be ok). 
 I'm almost convinced that for anonymous volatility, storing the
 volatility in the vma would be ok, but Minchan is worried about the
 performance overhead of the required locking for manipulating the vmas.

 For file volatility, this is more complicated, because since the
 volatility is shared, the ranges have to be tracked against the
 address_space structure, and can't be stored in per-process vmas. So
 this is partially why we've kept range trees hanging off of the mm and
 address_spaces structures, since it allows the range manipulation logic
 to be shared in both cases.
 The fs people probably have not noticed yet what you've done to struct
 address_space / struct inode ;-) I doubt that this is mergeable in its
 current form, so we have to think about a separate mechanism for shmem
 page ranges either way.

Yea. But given the semantics will likely be *very* similar, it seems
strange to try to force separate mechanisms.

That said, in an earlier implementation I stored the range tree in a
hash so we wouldn't have to add anything to the address_space structure.
But for now I want to make it clear that the ranges are tied to the
address space (and it gives the fs folks something to notice ;).


Userspace won't get this right, and even in the kernel this is
error prone and adds a lot to the complexity of vma management.
 Not sure exactly I understand what you mean by userspace won't get this
 right ?
 I meant, userspace being responsible for keeping vranges coherent with
 its mmap and munmap operations, instead of the kernel doing it.

 2. If page reclaim discards a page from the upper end of a a range,
you mark the whole range as purged.  If the user later marks the
lower half of the range as non-volatile, the syscall will report
purged=1 even though all requested pages are still there.
 To me this aspect is a non-ideal but acceptable result of the usage pattern.

 Semantically, the hard rule would be we never report non-purged if pages
 in a range

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-30 Thread Johannes Weiner

On Thu, Jan 30, 2014 at 05:27:18PM -0800, John Stultz wrote:
 On 01/29/2014 10:30 AM, Johannes Weiner wrote:
  On Tue, Jan 28, 2014 at 05:43:54PM -0800, John Stultz wrote:
  On 01/28/2014 04:03 PM, Johannes Weiner wrote:
  On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
  o Syscall interface
  Why do we need another syscall for this?  Can't we extend madvise to
  take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
  in the range was purged?
  So the madvise interface is insufficient to provide the semantics
  needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the
  NONVOLATILE call, we have to atomically unmark the volatility status of
  the byte range and provide the purge status, which informs the caller if
  any of the data in the specified range was discarded (and thus needs to
  be regenerated).
 
  The problem is that by clearing the range, we may need to allocate
  memory (possibly by splitting in an existing range segment into two),
  which possibly could fail. Unfortunately this could happen after we've
  modified the volatile state of part of that range.  At this point we
  can't just fail, because we've modified state and we also need to return
  the purge status of the modified state.
  munmap() can theoretically fail for the same reason (splitting has to
  allocate a new vma) but it's not even documented.  The allocator does
  not fail allocations of that order.
 
  I'm not sure this is good enough, but to me it sounds a bit overkill
  to design a new system call around a non-existent problem.
 
 I still think its problematic design issue. With munmap, I think
 re-calling on failure should be fine. But with _NONVOLATILE we could
 possibly lose the purge status on a second call (for instance if only
 the first page of memory was purged, but we errored out mid-call w/
 ENOMEM, on the second call it will seem like the range was successfully
 set non-volatile with no memory purged).
 
 And even if the current allocator never ever fails, I worry at some
 point in the future that rule might change and then we'd have a broken
 interface.

Fair enough, we don't have to paint ourselves into a corner.

  2. If page reclaim discards a page from the upper end of a a range,
 you mark the whole range as purged.  If the user later marks the
 lower half of the range as non-volatile, the syscall will report
 purged=1 even though all requested pages are still there.
  To me this aspect is a non-ideal but acceptable result of the usage 
  pattern.
 
  Semantically, the hard rule would be we never report non-purged if pages
  in a range were purged.  Reporting purged when pages technically weren't
  is not optimal but acceptable side effect of unmarking a sub-range. And
  could be avoided by applications marking and unmarking objects 
  consistently.
 
 
 The only way to make these semantics clean is either
 
   a) have vrange() return a range ID so that only full ranges can
   later be marked non-volatile, or
 
   b) remember individual page purges so that sub-range changes can
   properly report them
 
 I don't like a) much because it's somewhat arbitrarily more
 restrictive than madvise, mprotect, mmap/munmap etc.  
  Agreed on A.
 
  And for b),
 the straight-forward solution would be to put purge-cookies into
 the page tables to properly report purges in subrange changes, but
 that would be even more coordination between vmas, page tables, and
 the ad-hoc vranges.
  And for B this would cause way too much overhead for the mark/unmark
  operations, which have to be lightweight.
  Yes, and allocators/message passers truly don't need this because at
  the time they set a region to volatile the contents are invalidated
  and the non-volatile declaration doesn't give a hoot if content has
  been destroyed.
 
  But caches certainly would have to know if they should regenerate the
  contents.  And bigger areas should be using huge pages, so we'd check
  in 2MB steps.  Is this really more expensive than regenerating the
  contents on a false positive?
 
 So you make a good argument. I'd counter that the false-positives are
 only caused when unmarking subranges of larger marked volatile range,
 and for use cases that would care about regenerating the contents,
 that's not a likely useage model (as they're probably going to be
 marking objects in memory volatile/nonvolatile, not just arbitrary
 ranges of pages).

I can imagine that applications have continuous areas of same-sized
objects and want to mark a whole range of them volatile in one go,
then later come back for individual objects.

Otherwise we'd require N adjacent objects to be marked individually
through N syscalls to create N separate internal ranges, or they'd get
strange and unexpected results.

I'm agreeing with you about what's the most likely and common usecase,
but it shouldn't get too weird around the edges.

  MADV_NONVOLATILE and

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-29 Thread Johannes Weiner

On Tue, Jan 28, 2014 at 05:43:54PM -0800, John Stultz wrote:
> On 01/28/2014 04:03 PM, Johannes Weiner wrote:
> > On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
> >> o Syscall interface
> > Why do we need another syscall for this?  Can't we extend madvise to
> > take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
> > in the range was purged?
> 
> So the madvise interface is insufficient to provide the semantics
> needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the
> NONVOLATILE call, we have to atomically unmark the volatility status of
> the byte range and provide the purge status, which informs the caller if
> any of the data in the specified range was discarded (and thus needs to
> be regenerated).
> 
> The problem is that by clearing the range, we may need to allocate
> memory (possibly by splitting in an existing range segment into two),
> which possibly could fail. Unfortunately this could happen after we've
> modified the volatile state of part of that range.  At this point we
> can't just fail, because we've modified state and we also need to return
> the purge status of the modified state.

munmap() can theoretically fail for the same reason (splitting has to
allocate a new vma) but it's not even documented.  The allocator does
not fail allocations of that order.

I'm not sure this is good enough, but to me it sounds a bit overkill
to design a new system call around a non-existent problem.

> >> o Not bind with vma split/merge logic to prevent mmap_sem cost and
> >> o Not bind with vma split/merge logic to avoid vm_area_struct memory
> >>   footprint.
> > VMAs are there to track attributes of memory ranges.  Duplicating
> > large parts of their functionality and co-maintaining both structures
> > on create, destroy, split, and merge means duplicate code and complex
> > interactions.
> >
> > 1. You need to define semantics and coordinate what happens when the
> >vma underlying a volatile range changes.
> >
> >Either you have to strictly co-maintain both range objects, or you
> >have weird behavior like volatily outliving a vma and then applying
> >to a separate vma created in its place.
> 
> So indeed this is a difficult problem!  My initial approach is simply
> when any new mapping is made, we clear the volatility of the affected
> process memory. Admittedly this has extra overhead and Minchan has an
> alternative here (which I'm not totally sold on yet, but may be ok). 
> I'm almost convinced that for anonymous volatility, storing the
> volatility in the vma would be ok, but Minchan is worried about the
> performance overhead of the required locking for manipulating the vmas.
>
> For file volatility, this is more complicated, because since the
> volatility is shared, the ranges have to be tracked against the
> address_space structure, and can't be stored in per-process vmas. So
> this is partially why we've kept range trees hanging off of the mm and
> address_spaces structures, since it allows the range manipulation logic
> to be shared in both cases.

The fs people probably have not noticed yet what you've done to struct
address_space / struct inode ;-) I doubt that this is mergeable in its
current form, so we have to think about a separate mechanism for shmem
page ranges either way.

> >Userspace won't get this right, and even in the kernel this is
> >error prone and adds a lot to the complexity of vma management.
> Not sure exactly I understand what you mean by "userspace won't get this
> right" ?

I meant, userspace being responsible for keeping vranges coherent with
its mmap and munmap operations, instead of the kernel doing it.

> > 2. If page reclaim discards a page from the upper end of a a range,
> >you mark the whole range as purged.  If the user later marks the
> >lower half of the range as non-volatile, the syscall will report
> >purged=1 even though all requested pages are still there.
> 
> To me this aspect is a non-ideal but acceptable result of the usage pattern.
> 
> Semantically, the hard rule would be we never report non-purged if pages
> in a range were purged.  Reporting purged when pages technically weren't
> is not optimal but acceptable side effect of unmarking a sub-range. And
> could be avoided by applications marking and unmarking objects consistently.
> 
> 
> >The only way to make these semantics clean is either
> >
> >  a) have vrange() return a range ID so that only full ranges can
> >  later be marked non-volatile, or
> >
> >  b) remember individual page purges so that sub-range changes can
> >  properly report them
> >
> >I don't like a) much because it's somewhat arbitrarily more
> >restrictive than madvise, mprotect, mmap/munmap etc.  
> Agreed on A.
> 
> > And for b),
> >the straight-forward solution would be to put purge-cookies into
> >the page tables to properly report purges in subrange changes, but
> >that would be even more

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-29 Thread Johannes Weiner

On Tue, Jan 28, 2014 at 05:43:54PM -0800, John Stultz wrote:
 On 01/28/2014 04:03 PM, Johannes Weiner wrote:
  On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
  o Syscall interface
  Why do we need another syscall for this?  Can't we extend madvise to
  take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
  in the range was purged?
 
 So the madvise interface is insufficient to provide the semantics
 needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the
 NONVOLATILE call, we have to atomically unmark the volatility status of
 the byte range and provide the purge status, which informs the caller if
 any of the data in the specified range was discarded (and thus needs to
 be regenerated).
 
 The problem is that by clearing the range, we may need to allocate
 memory (possibly by splitting in an existing range segment into two),
 which possibly could fail. Unfortunately this could happen after we've
 modified the volatile state of part of that range.  At this point we
 can't just fail, because we've modified state and we also need to return
 the purge status of the modified state.

munmap() can theoretically fail for the same reason (splitting has to
allocate a new vma) but it's not even documented.  The allocator does
not fail allocations of that order.

I'm not sure this is good enough, but to me it sounds a bit overkill
to design a new system call around a non-existent problem.

  o Not bind with vma split/merge logic to prevent mmap_sem cost and
  o Not bind with vma split/merge logic to avoid vm_area_struct memory
footprint.
  VMAs are there to track attributes of memory ranges.  Duplicating
  large parts of their functionality and co-maintaining both structures
  on create, destroy, split, and merge means duplicate code and complex
  interactions.
 
  1. You need to define semantics and coordinate what happens when the
 vma underlying a volatile range changes.
 
 Either you have to strictly co-maintain both range objects, or you
 have weird behavior like volatily outliving a vma and then applying
 to a separate vma created in its place.
 
 So indeed this is a difficult problem!  My initial approach is simply
 when any new mapping is made, we clear the volatility of the affected
 process memory. Admittedly this has extra overhead and Minchan has an
 alternative here (which I'm not totally sold on yet, but may be ok). 
 I'm almost convinced that for anonymous volatility, storing the
 volatility in the vma would be ok, but Minchan is worried about the
 performance overhead of the required locking for manipulating the vmas.

 For file volatility, this is more complicated, because since the
 volatility is shared, the ranges have to be tracked against the
 address_space structure, and can't be stored in per-process vmas. So
 this is partially why we've kept range trees hanging off of the mm and
 address_spaces structures, since it allows the range manipulation logic
 to be shared in both cases.

The fs people probably have not noticed yet what you've done to struct
address_space / struct inode ;-) I doubt that this is mergeable in its
current form, so we have to think about a separate mechanism for shmem
page ranges either way.

 Userspace won't get this right, and even in the kernel this is
 error prone and adds a lot to the complexity of vma management.
 Not sure exactly I understand what you mean by userspace won't get this
 right ?

I meant, userspace being responsible for keeping vranges coherent with
its mmap and munmap operations, instead of the kernel doing it.

  2. If page reclaim discards a page from the upper end of a a range,
 you mark the whole range as purged.  If the user later marks the
 lower half of the range as non-volatile, the syscall will report
 purged=1 even though all requested pages are still there.
 
 To me this aspect is a non-ideal but acceptable result of the usage pattern.
 
 Semantically, the hard rule would be we never report non-purged if pages
 in a range were purged.  Reporting purged when pages technically weren't
 is not optimal but acceptable side effect of unmarking a sub-range. And
 could be avoided by applications marking and unmarking objects consistently.
 
 
 The only way to make these semantics clean is either
 
   a) have vrange() return a range ID so that only full ranges can
   later be marked non-volatile, or
 
   b) remember individual page purges so that sub-range changes can
   properly report them
 
 I don't like a) much because it's somewhat arbitrarily more
 restrictive than madvise, mprotect, mmap/munmap etc.  
 Agreed on A.
 
  And for b),
 the straight-forward solution would be to put purge-cookies into
 the page tables to properly report purges in subrange changes, but
 that would be even more coordination between vmas, page tables, and
 the ad-hoc vranges.
 
 And for B this would cause way too much overhead for the

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-28 Thread Minchan Kim

Hi Hannes,

It's interesting timing, I posted this patch Yew Year's Day
and receives indepth design review Lunar New Year's Day. :)
It's almost 0-day review. :)

On Tue, Jan 28, 2014 at 07:03:59PM -0500, Johannes Weiner wrote:
> Hello Minchan,
> 
> On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
> > Hey all,
> > 
> > Happy New Year!
> > 
> > I know it's bad timing to send this unfamiliar large patchset for
> > review but hope there are some guys with freshed-brain in new year
> > all over the world. :)
> > And most important thing is that before I dive into lots of testing,
> > I'd like to make an agreement on design issues and others
> > 
> > o Syscall interface
> 
> Why do we need another syscall for this?  Can't we extend madvise to

Yeb. I should have written the reason. Early versions in this patchset
had used madvise with VMA handling but it was terrible performance for
ebizzy workload by mmap_sem's downside lock due to merging/split VMA.
Even it was worse than old so I gave up the VMA approach.

You could see the difference.
https://lkml.org/lkml/2013/10/8/63

It might be not a good decision and someone might say that if mmap_sem
is really headache, please fix it at first but as you know well,
it's never simple problem.
I hope if better idea or final decision comes in(ex, let's hold
until someone fix mmap_sem scalability), I could follow that.

> take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
> in the range was purged?

In that case, -ENOMEM would have duplicated meaning "Purged" and "Out
of memory so failed in the middle of the system call processing" and
later could be a problem so we need to return value to indicate
how many bytes are succeeded so far so it means we need additional
out parameter. But yes, we can solve it by modifying semantic and
behavior (ex, as you said below, we could just unmark volatile
successfully if user pass (offset, len) consistent with marked volatile
ranges. (IOW, if we give up overlapping/subrange marking/unmakring
usecase. I expect it makes code simple further).
It's request from John so If he is okay, I'm no problem.

And there was another reason to make hard reusing madvise.

full puring VS partial purging

If someone should regenerate full object without
considering fault/alloc/zeroig(and many of people wanted it rather than
partial) he would want to VOLATILE_RANGE_FULL purging while
others(acutlly, only I) likes VOLATILE_RANGE_PARTIAL for
allocators(ie, vrange-anon) because it's fair for every processes
in allocator POV(fault + alloc + zeroing overhead).

It's not implemented yet in this patchset but I thought it is worth
discuss and if we want it, current madvise isn't enough.

> 
> > o Not bind with vma split/merge logic to prevent mmap_sem cost and
> > o Not bind with vma split/merge logic to avoid vm_area_struct memory
> >   footprint.
> 
> VMAs are there to track attributes of memory ranges.  Duplicating
> large parts of their functionality and co-maintaining both structures
> on create, destroy, split, and merge means duplicate code and complex
> interactions.
> 
> 1. You need to define semantics and coordinate what happens when the
>vma underlying a volatile range changes.
> 
>Either you have to strictly co-maintain both range objects, or you
>have weird behavior like volatily outliving a vma and then applying
>to a separate vma created in its place.
> 
>Userspace won't get this right, and even in the kernel this is
>error prone and adds a lot to the complexity of vma management.

Current semantic is following as,
Vma handling logic in mm doesn't need to know vrange handling because
vrange's internal logic always checks validity of the vma but
one thing to do in vma logic is only clearing old volatile ranges
on creating new vma.
(Look at  [PATCH v10 02/16] vrange: Clear volatility on new mmaps)
Acutally I don't like the idea and suggested following as.
https://git.kernel.org/cgit/linux/kernel/git/minchan/linux.git/commit/?h=vrange-working=821f58333b381fd88ee7f37fd9c472949756c74e
But John didn't like it. I guess if VMA size is really matter,
maybe we can embedded the flag into somewhere field of
vma(ex, vm_file LSB?)

Anyway, what I want to say is that vma/vrange co-maintaining
seem to be not bad.

> 
> 2. If page reclaim discards a page from the upper end of a a range,
>you mark the whole range as purged.  If the user later marks the
>lower half of the range as non-volatile, the syscall will report
>purged=1 even though all requested pages are still there.

True, The assumption is that basically, user should have a range
per object but we gives flexibility for user to handle subranges
of a volatile range so it might report false positive as you said.
In that case, please user can use mincore(2) for accuracy if he
want so he has flexiblity but lose performance a bit.
It's a tradeoff, IMO.

> 
>The only way to make these semantics clean is either
> 
>  a) have vrange() return a

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-28 Thread John Stultz

On 01/28/2014 04:03 PM, Johannes Weiner wrote:
> Hello Minchan,
>
> On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
>> Hey all,
>>
>> Happy New Year!
>>
>> I know it's bad timing to send this unfamiliar large patchset for
>> review but hope there are some guys with freshed-brain in new year
>> all over the world. :)
>> And most important thing is that before I dive into lots of testing,
>> I'd like to make an agreement on design issues and others
>>
>> o Syscall interface
> Why do we need another syscall for this?  Can't we extend madvise to
> take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
> in the range was purged?

So the madvise interface is insufficient to provide the semantics
needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the
NONVOLATILE call, we have to atomically unmark the volatility status of
the byte range and provide the purge status, which informs the caller if
any of the data in the specified range was discarded (and thus needs to
be regenerated).

The problem is that by clearing the range, we may need to allocate
memory (possibly by splitting in an existing range segment into two),
which possibly could fail. Unfortunately this could happen after we've
modified the volatile state of part of that range.  At this point we
can't just fail, because we've modified state and we also need to return
the purge status of the modified state.

Thus we seem to need a write-like interface, which returns the number of
bytes successfully manipulated. But we also have to return the purge
state, which we currently do via a argument pointer.

hpa suggested to create something like an madvise2 interface which would
provide the needed interface change, but would be a shared interface for
the new flags as well as the old (possibly allowing various flags to be
combined). I'm fine changing it (the interface has changed a number of
times already), but we really haven't seen much in the way of a deeper
review, so the current vrange syscall is mostly a placeholder to
demonstrate the functionality and hopefully spur discussion on the
deeper semantics of how volatile ranges should work.

>> o Not bind with vma split/merge logic to prevent mmap_sem cost and
>> o Not bind with vma split/merge logic to avoid vm_area_struct memory
>>   footprint.
> VMAs are there to track attributes of memory ranges.  Duplicating
> large parts of their functionality and co-maintaining both structures
> on create, destroy, split, and merge means duplicate code and complex
> interactions.
>
> 1. You need to define semantics and coordinate what happens when the
>vma underlying a volatile range changes.
>
>Either you have to strictly co-maintain both range objects, or you
>have weird behavior like volatily outliving a vma and then applying
>to a separate vma created in its place.

So indeed this is a difficult problem!  My initial approach is simply
when any new mapping is made, we clear the volatility of the affected
process memory. Admittedly this has extra overhead and Minchan has an
alternative here (which I'm not totally sold on yet, but may be ok). 
I'm almost convinced that for anonymous volatility, storing the
volatility in the vma would be ok, but Minchan is worried about the
performance overhead of the required locking for manipulating the vmas.

For file volatility, this is more complicated, because since the
volatility is shared, the ranges have to be tracked against the
address_space structure, and can't be stored in per-process vmas. So
this is partially why we've kept range trees hanging off of the mm and
address_spaces structures, since it allows the range manipulation logic
to be shared in both cases.

>Userspace won't get this right, and even in the kernel this is
>error prone and adds a lot to the complexity of vma management.
Not sure exactly I understand what you mean by "userspace won't get this
right" ?

>
> 2. If page reclaim discards a page from the upper end of a a range,
>you mark the whole range as purged.  If the user later marks the
>lower half of the range as non-volatile, the syscall will report
>purged=1 even though all requested pages are still there.

To me this aspect is a non-ideal but acceptable result of the usage pattern.

Semantically, the hard rule would be we never report non-purged if pages
in a range were purged.  Reporting purged when pages technically weren't
is not optimal but acceptable side effect of unmarking a sub-range. And
could be avoided by applications marking and unmarking objects consistently.

>The only way to make these semantics clean is either
>
>  a) have vrange() return a range ID so that only full ranges can
>  later be marked non-volatile, or
>
>  b) remember individual page purges so that sub-range changes can
>  properly report them
>
>I don't like a) much because it's somewhat arbitrarily more
>restrictive than madvise, mprotect, mmap/munmap etc.  
Agreed

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-28 Thread Johannes Weiner

Hello Minchan,

On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
> Hey all,
> 
> Happy New Year!
> 
> I know it's bad timing to send this unfamiliar large patchset for
> review but hope there are some guys with freshed-brain in new year
> all over the world. :)
> And most important thing is that before I dive into lots of testing,
> I'd like to make an agreement on design issues and others
> 
> o Syscall interface

Why do we need another syscall for this?  Can't we extend madvise to
take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
in the range was purged?

> o Not bind with vma split/merge logic to prevent mmap_sem cost and
> o Not bind with vma split/merge logic to avoid vm_area_struct memory
>   footprint.

VMAs are there to track attributes of memory ranges.  Duplicating
large parts of their functionality and co-maintaining both structures
on create, destroy, split, and merge means duplicate code and complex
interactions.

1. You need to define semantics and coordinate what happens when the
   vma underlying a volatile range changes.

   Either you have to strictly co-maintain both range objects, or you
   have weird behavior like volatily outliving a vma and then applying
   to a separate vma created in its place.

   Userspace won't get this right, and even in the kernel this is
   error prone and adds a lot to the complexity of vma management.

2. If page reclaim discards a page from the upper end of a a range,
   you mark the whole range as purged.  If the user later marks the
   lower half of the range as non-volatile, the syscall will report
   purged=1 even though all requested pages are still there.

   The only way to make these semantics clean is either

 a) have vrange() return a range ID so that only full ranges can
 later be marked non-volatile, or

 b) remember individual page purges so that sub-range changes can
 properly report them

   I don't like a) much because it's somewhat arbitrarily more
   restrictive than madvise, mprotect, mmap/munmap etc.  And for b),
   the straight-forward solution would be to put purge-cookies into
   the page tables to properly report purges in subrange changes, but
   that would be even more coordination between vmas, page tables, and
   the ad-hoc vranges.

3. Page reclaim usually happens on individual pages until an
   allocation can be satisfied, but the shrinker purges entire ranges.

   Should it really take out an entire 1G volatile range even though 4
   pages would have been enough to satisfy an allocation?  Sure, we
   assume a range represents an single "object" and userspace would
   have to regenerate the whole thing with only one page missing, but
   there is still a massive difference in page frees, faults, and
   allocations.

There needs to be a *really* good argument why VMAs are not enough for
this purpose.  I would really like to see anon volatility implemented
as a VMA attribute, and have regular reclaim decide based on rmap of
individual pages whether it needs to swap or purge.  Something like
this:

MADV_VOLATILE:
  split vma if necessary
  set VM_VOLATILE

MADV_NONVOLATILE:
  clear VM_VOLATILE
  merge vma if possible
  pte walk to check for pmd_purged()/pte_purged()
  return any_purged

shrink_page_list():
  if PageAnon:
if try_to_purge_anon():
  page_lock_anon_vma_read()
  anon_vma_interval_tree_foreach:
if vma->vm_flags & VM_VOLATILE:
  lock page table
  unmap page
  set_pmd_purged() / set_pte_purged()
  unlock page table
  page_lock_anon_vma_read()
   ...
   try to reclaim

> o Purging logic - when we trigger purging volatile pages to prevent
>   working set and stop to prevent too excessive purging of volatile
>   pages
> o How to test
>   Currently, we have a patched jemalloc allocator by Jason's help
>   although it's not perfect and more rooms to be enhanced but IMO,
>   it's enough to prove vrange-anonymous. The problem is that
>   lack of benchmark for testing vrange-file side. I hope that
>   Mozilla folks can help.
> 
> So its been a while since the last release of the volatile ranges
> patches, again. I and John have been busy with other things.
> Still, we have been slowly chipping away at issues and differences
> trying to get a patchset that we both agree on.
> 
> There's still a few issues, but we figured any further polishing of
> the patch series in private would be unproductive and it would be much
> better to send the patches out for review and comment and get some wider
> opinions.
> 
> You could get full patchset by git
> 
> git clone -b vrange-v10-rc5 --single-branch 
> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
> 
> In v10, there are some notable changes following as
> 
> Whats new in v10:
> * Fix several bugs and build break
> * Add shmem_purge_page to correct purging shmem/tmpfs
> * Replace slab shrinker with direct hooked reclaim path
> * Optimize pte scanning by caching previous

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-28 Thread Johannes Weiner

Hello Minchan,

On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
 Hey all,
 
 Happy New Year!
 
 I know it's bad timing to send this unfamiliar large patchset for
 review but hope there are some guys with freshed-brain in new year
 all over the world. :)
 And most important thing is that before I dive into lots of testing,
 I'd like to make an agreement on design issues and others
 
 o Syscall interface

Why do we need another syscall for this?  Can't we extend madvise to
take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
in the range was purged?

 o Not bind with vma split/merge logic to prevent mmap_sem cost and
 o Not bind with vma split/merge logic to avoid vm_area_struct memory
   footprint.

VMAs are there to track attributes of memory ranges.  Duplicating
large parts of their functionality and co-maintaining both structures
on create, destroy, split, and merge means duplicate code and complex
interactions.

1. You need to define semantics and coordinate what happens when the
   vma underlying a volatile range changes.

   Either you have to strictly co-maintain both range objects, or you
   have weird behavior like volatily outliving a vma and then applying
   to a separate vma created in its place.

   Userspace won't get this right, and even in the kernel this is
   error prone and adds a lot to the complexity of vma management.

2. If page reclaim discards a page from the upper end of a a range,
   you mark the whole range as purged.  If the user later marks the
   lower half of the range as non-volatile, the syscall will report
   purged=1 even though all requested pages are still there.

   The only way to make these semantics clean is either

 a) have vrange() return a range ID so that only full ranges can
 later be marked non-volatile, or

 b) remember individual page purges so that sub-range changes can
 properly report them

   I don't like a) much because it's somewhat arbitrarily more
   restrictive than madvise, mprotect, mmap/munmap etc.  And for b),
   the straight-forward solution would be to put purge-cookies into
   the page tables to properly report purges in subrange changes, but
   that would be even more coordination between vmas, page tables, and
   the ad-hoc vranges.

3. Page reclaim usually happens on individual pages until an
   allocation can be satisfied, but the shrinker purges entire ranges.

   Should it really take out an entire 1G volatile range even though 4
   pages would have been enough to satisfy an allocation?  Sure, we
   assume a range represents an single object and userspace would
   have to regenerate the whole thing with only one page missing, but
   there is still a massive difference in page frees, faults, and
   allocations.

There needs to be a *really* good argument why VMAs are not enough for
this purpose.  I would really like to see anon volatility implemented
as a VMA attribute, and have regular reclaim decide based on rmap of
individual pages whether it needs to swap or purge.  Something like
this:

MADV_VOLATILE:
  split vma if necessary
  set VM_VOLATILE

MADV_NONVOLATILE:
  clear VM_VOLATILE
  merge vma if possible
  pte walk to check for pmd_purged()/pte_purged()
  return any_purged

shrink_page_list():
  if PageAnon:
if try_to_purge_anon():
  page_lock_anon_vma_read()
  anon_vma_interval_tree_foreach:
if vma-vm_flags  VM_VOLATILE:
  lock page table
  unmap page
  set_pmd_purged() / set_pte_purged()
  unlock page table
  page_lock_anon_vma_read()
   ...
   try to reclaim

 o Purging logic - when we trigger purging volatile pages to prevent
   working set and stop to prevent too excessive purging of volatile
   pages
 o How to test
   Currently, we have a patched jemalloc allocator by Jason's help
   although it's not perfect and more rooms to be enhanced but IMO,
   it's enough to prove vrange-anonymous. The problem is that
   lack of benchmark for testing vrange-file side. I hope that
   Mozilla folks can help.
 
 So its been a while since the last release of the volatile ranges
 patches, again. I and John have been busy with other things.
 Still, we have been slowly chipping away at issues and differences
 trying to get a patchset that we both agree on.
 
 There's still a few issues, but we figured any further polishing of
 the patch series in private would be unproductive and it would be much
 better to send the patches out for review and comment and get some wider
 opinions.
 
 You could get full patchset by git
 
 git clone -b vrange-v10-rc5 --single-branch 
 git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
 
 In v10, there are some notable changes following as
 
 Whats new in v10:
 * Fix several bugs and build break
 * Add shmem_purge_page to correct purging shmem/tmpfs
 * Replace slab shrinker with direct hooked reclaim path
 * Optimize pte scanning by caching previous place
 * Reorder patch and tidy up Cc-list
 *

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-28 Thread John Stultz

On 01/28/2014 04:03 PM, Johannes Weiner wrote:
 Hello Minchan,

 On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
 Hey all,

 Happy New Year!

 I know it's bad timing to send this unfamiliar large patchset for
 review but hope there are some guys with freshed-brain in new year
 all over the world. :)
 And most important thing is that before I dive into lots of testing,
 I'd like to make an agreement on design issues and others

 o Syscall interface
 Why do we need another syscall for this?  Can't we extend madvise to
 take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
 in the range was purged?

So the madvise interface is insufficient to provide the semantics
needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the
NONVOLATILE call, we have to atomically unmark the volatility status of
the byte range and provide the purge status, which informs the caller if
any of the data in the specified range was discarded (and thus needs to
be regenerated).

The problem is that by clearing the range, we may need to allocate
memory (possibly by splitting in an existing range segment into two),
which possibly could fail. Unfortunately this could happen after we've
modified the volatile state of part of that range.  At this point we
can't just fail, because we've modified state and we also need to return
the purge status of the modified state.

Thus we seem to need a write-like interface, which returns the number of
bytes successfully manipulated. But we also have to return the purge
state, which we currently do via a argument pointer.

hpa suggested to create something like an madvise2 interface which would
provide the needed interface change, but would be a shared interface for
the new flags as well as the old (possibly allowing various flags to be
combined). I'm fine changing it (the interface has changed a number of
times already), but we really haven't seen much in the way of a deeper
review, so the current vrange syscall is mostly a placeholder to
demonstrate the functionality and hopefully spur discussion on the
deeper semantics of how volatile ranges should work.


 o Not bind with vma split/merge logic to prevent mmap_sem cost and
 o Not bind with vma split/merge logic to avoid vm_area_struct memory
   footprint.
 VMAs are there to track attributes of memory ranges.  Duplicating
 large parts of their functionality and co-maintaining both structures
 on create, destroy, split, and merge means duplicate code and complex
 interactions.

 1. You need to define semantics and coordinate what happens when the
vma underlying a volatile range changes.

Either you have to strictly co-maintain both range objects, or you
have weird behavior like volatily outliving a vma and then applying
to a separate vma created in its place.

So indeed this is a difficult problem!  My initial approach is simply
when any new mapping is made, we clear the volatility of the affected
process memory. Admittedly this has extra overhead and Minchan has an
alternative here (which I'm not totally sold on yet, but may be ok). 
I'm almost convinced that for anonymous volatility, storing the
volatility in the vma would be ok, but Minchan is worried about the
performance overhead of the required locking for manipulating the vmas.

For file volatility, this is more complicated, because since the
volatility is shared, the ranges have to be tracked against the
address_space structure, and can't be stored in per-process vmas. So
this is partially why we've kept range trees hanging off of the mm and
address_spaces structures, since it allows the range manipulation logic
to be shared in both cases.


Userspace won't get this right, and even in the kernel this is
error prone and adds a lot to the complexity of vma management.
Not sure exactly I understand what you mean by userspace won't get this
right ?



 2. If page reclaim discards a page from the upper end of a a range,
you mark the whole range as purged.  If the user later marks the
lower half of the range as non-volatile, the syscall will report
purged=1 even though all requested pages are still there.

To me this aspect is a non-ideal but acceptable result of the usage pattern.

Semantically, the hard rule would be we never report non-purged if pages
in a range were purged.  Reporting purged when pages technically weren't
is not optimal but acceptable side effect of unmarking a sub-range. And
could be avoided by applications marking and unmarking objects consistently.


The only way to make these semantics clean is either

  a) have vrange() return a range ID so that only full ranges can
  later be marked non-volatile, or

  b) remember individual page purges so that sub-range changes can
  properly report them

I don't like a) much because it's somewhat arbitrarily more
restrictive than madvise, mprotect, mmap/munmap etc.  
Agreed on A.

 And for b),
the straight-forward solution would be

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-28 Thread Minchan Kim

Hi Hannes,

It's interesting timing, I posted this patch Yew Year's Day
and receives indepth design review Lunar New Year's Day. :)
It's almost 0-day review. :)

On Tue, Jan 28, 2014 at 07:03:59PM -0500, Johannes Weiner wrote:
Hello Minchan,

On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
Hey all,

Happy New Year!

o Syscall interface

Why do we need another syscall for this? Can't we extend madvise to

You could see the difference.
https://lkml.org/lkml/2013/10/8/63

It might be not a good decision and someone might say that if mmap_sem
is really headache, please fix it at first but as you know well,
it's never simple problem.
I hope if better idea or final decision comes in(ex, let's hold
until someone fix mmap_sem scalability), I could follow that.

take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
in the range was purged?

And there was another reason to make hard reusing madvise.

full puring VS partial purging

If someone should regenerate full object without
considering fault/alloc/zeroig(and many of people wanted it rather than
partial) he would want to VOLATILE_RANGE_FULL purging while
others(acutlly, only I) likes VOLATILE_RANGE_PARTIAL for
allocators(ie, vrange-anon) because it's fair for every processes
in allocator POV(fault + alloc + zeroing overhead).

It's not implemented yet in this patchset but I thought it is worth
discuss and if we want it, current madvise isn't enough.

o Not bind with vma split/merge logic to prevent mmap_sem cost and
o Not bind with vma split/merge logic to avoid vm_area_struct memory
footprint.

1. You need to define semantics and coordinate what happens when the
vma underlying a volatile range changes.

Either you have to strictly co-maintain both range objects, or you
have weird behavior like volatily outliving a vma and then applying
to a separate vma created in its place.

Userspace won't get this right, and even in the kernel this is
error prone and adds a lot to the complexity of vma management.

Anyway, what I want to say is that vma/vrange co-maintaining
seem to be not bad.

The only way to make these semantics clean is either

a) have vrange() return a range ID so that only full ranges can
later be marked

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-27 Thread Minchan Kim

On Mon, Jan 27, 2014 at 05:09:59PM -0800, Taras Glek wrote:
> 
> 
> John Stultz wrote:
> >On 01/27/2014 04:12 PM, Minchan Kim wrote:
> >>On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote:
> >>>- Your number only claimed the effectiveness anon vrange, but not file 
> >>>vrange.
> >>Yes. It's really problem as I said.
> >> From the beginning, John Stultz wanted to promote vrange-file to replace
> >>android's ashmem and when I heard usecase of vrange-file, it does make sense
> >>to me so that's why I'd like to unify them in a same interface.
> >>
> >>But the problem is lack of interesting from others and lack of time to
> >>test/evaluate it. I'm not an expert of userspace so actually I need a bit
> >>help from them who require the feature but at a moment,
> >>but I don't know who really want or/and help it.
> >>
> >>Even, Android folks didn't have any interest on vrange-file.
> >
> >Just as a correction here. I really don't think this is the case, as
> >Android's use definitely relies on file based volatility. It might be
> >more fair to say there hasn't been very much discussion from Android
> >developers on the particulars of the file volatility semantics (out
> >possibly not having any particular objections, or more-likely, being a
> >bit too busy to follow the all various theoretical tangents we've
> >discussed).
> >
> >But I'd not want anyone to get the impression that anonymous-only
> >volatility would be sufficient for Android's needs.
> Mozilla is starting to use android's ashmem for discardable memory
> within a single process:
> https://bugzilla.mozilla.org/show_bug.cgi?id=748598 .
> 
> Volatile ranges do help with that specific(uncommon?) use of ashmem.

Thanks for the info.

I'd like to ask a question.
Do you prefer fvrange(fd, offset, len) or fadvise(fd, offset, len, advise)
inteface rather than current vrange syscall interface for vrange-file?

Because I think it would remove unnecessary mmap/munmap syscall for vrange
interface as well as out of address space in 32bit machine.

> 
> For Mozilla sharing memory across processes via ashmem is not a
> nearterm project. It's something that is likely to require
> significant rework. Process-local discardable memory can be
> retrofited in a more straight-forward fashion.
> 
> Taras

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-27 Thread Minchan Kim

On Mon, Jan 27, 2014 at 04:42:27PM -0800, John Stultz wrote:
> On 01/27/2014 04:12 PM, Minchan Kim wrote:
> > On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote:
> >> - Your number only claimed the effectiveness anon vrange, but not file 
> >> vrange.
> > Yes. It's really problem as I said.
> > From the beginning, John Stultz wanted to promote vrange-file to replace
> > android's ashmem and when I heard usecase of vrange-file, it does make sense
> > to me so that's why I'd like to unify them in a same interface.
> >
> > But the problem is lack of interesting from others and lack of time to
> > test/evaluate it. I'm not an expert of userspace so actually I need a bit
> > help from them who require the feature but at a moment,
> > but I don't know who really want or/and help it.
> >
> > Even, Android folks didn't have any interest on vrange-file.
> 
> Just as a correction here. I really don't think this is the case, as
> Android's use definitely relies on file based volatility. It might be
> more fair to say there hasn't been very much discussion from Android
> developers on the particulars of the file volatility semantics (out
> possibly not having any particular objections, or more-likely, being a
> bit too busy to follow the all various theoretical tangents we've
> discussed).
> 
> But I'd not want anyone to get the impression that anonymous-only
> volatility would be sufficient for Android's needs.

Right. Thanks for the correction.

> 
> 
> (And to further clarify here, since this can be confusing... 
> shmem/tmpfs-only file volatility *would* be sufficient, despite that
> technically being anonymous backed memory. The key issue is we need to
> be able to share the volatility between processes.)
> 
> 
> > So, we might drop vrange-file part in this patchset if it's really headache.
> > But let's discuss further because still I believe it's valuable feature to
> > keep instead of dropping.
> 
> If it helps gets interest in reviewing this, I'm ok with deferring
> (tmpfs) file volatility, so folks can get comfortable with anonymous
> volatility. But I worry its too critical a feature to ignore.

Yes. I don't want to drop it without more discussion with real user
of it but the problem is it's very hard to find one to have extra time
to discuss it.


> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-27 Thread John Stultz

On 01/27/2014 04:12 PM, Minchan Kim wrote:
> On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote:
>> - Your number only claimed the effectiveness anon vrange, but not file 
>> vrange.
> Yes. It's really problem as I said.
> From the beginning, John Stultz wanted to promote vrange-file to replace
> android's ashmem and when I heard usecase of vrange-file, it does make sense
> to me so that's why I'd like to unify them in a same interface.
>
> But the problem is lack of interesting from others and lack of time to
> test/evaluate it. I'm not an expert of userspace so actually I need a bit
> help from them who require the feature but at a moment,
> but I don't know who really want or/and help it.
>
> Even, Android folks didn't have any interest on vrange-file.

Just as a correction here. I really don't think this is the case, as
Android's use definitely relies on file based volatility. It might be
more fair to say there hasn't been very much discussion from Android
developers on the particulars of the file volatility semantics (out
possibly not having any particular objections, or more-likely, being a
bit too busy to follow the all various theoretical tangents we've
discussed).

But I'd not want anyone to get the impression that anonymous-only
volatility would be sufficient for Android's needs.

(And to further clarify here, since this can be confusing... 
shmem/tmpfs-only file volatility *would* be sufficient, despite that
technically being anonymous backed memory. The key issue is we need to
be able to share the volatility between processes.)

> So, we might drop vrange-file part in this patchset if it's really headache.
> But let's discuss further because still I believe it's valuable feature to
> keep instead of dropping.

If it helps gets interest in reviewing this, I'm ok with deferring
(tmpfs) file volatility, so folks can get comfortable with anonymous
volatility. But I worry its too critical a feature to ignore.

thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-27 Thread Minchan Kim

Hey KOSAKI,

On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote:
> Hi Minchan,
> 
> 
> On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim  wrote:
> > Hey all,
> >
> > Happy New Year!
> >
> > I know it's bad timing to send this unfamiliar large patchset for
> > review but hope there are some guys with freshed-brain in new year
> > all over the world. :)
> > And most important thing is that before I dive into lots of testing,
> > I'd like to make an agreement on design issues and others
> >
> > o Syscall interface
> > o Not bind with vma split/merge logic to prevent mmap_sem cost and
> > o Not bind with vma split/merge logic to avoid vm_area_struct memory
> >   footprint.
> > o Purging logic - when we trigger purging volatile pages to prevent
> >   working set and stop to prevent too excessive purging of volatile
> >   pages
> > o How to test
> >   Currently, we have a patched jemalloc allocator by Jason's help
> >   although it's not perfect and more rooms to be enhanced but IMO,
> >   it's enough to prove vrange-anonymous. The problem is that
> >   lack of benchmark for testing vrange-file side. I hope that
> >   Mozilla folks can help.
> >
> > So its been a while since the last release of the volatile ranges
> > patches, again. I and John have been busy with other things.
> > Still, we have been slowly chipping away at issues and differences
> > trying to get a patchset that we both agree on.
> >
> > There's still a few issues, but we figured any further polishing of
> > the patch series in private would be unproductive and it would be much
> > better to send the patches out for review and comment and get some wider
> > opinions.
> >
> > You could get full patchset by git
> >
> > git clone -b vrange-v10-rc5 --single-branch 
> > git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
> 
> Brief comments.
> 
> - You should provide jemalloc patch too. Otherwise we cannot

I did. :) It seems you missed below in this description.
You could see it via following URL in Dhaval's test suite.

https://github.com/volatile-ranges-test/vranges-test/blob/master/0001-Implement-experimental-mvolatile-2-mnovolatile-2-sup.patch

Dhaval: Pz, could you merge patches John sent in your test suite?
I just pinged you.

But KOSAKI, pz, don't focus on jemalloc's implementaion.
It's not how jemalloc uses volatile ranges efficiently but just
one of example how to use volatile ranges.
I think volatile ranges could be really useful for garbage collection
of custom allocators(ex, In-memory DB, JVM, Dalvik, v8) as well as
general allocators.

> understand what the your mesurement mean.

> - Your number only claimed the effectiveness anon vrange, but not file vrange.

Yes. It's really problem as I said.
>From the beginning, John Stultz wanted to promote vrange-file to replace
android's ashmem and when I heard usecase of vrange-file, it does make sense
to me so that's why I'd like to unify them in a same interface.

But the problem is lack of interesting from others and lack of time to
test/evaluate it. I'm not an expert of userspace so actually I need a bit
help from them who require the feature but at a moment,
but I don't know who really want or/and help it.

Even, Android folks didn't have any interest on vrange-file.
So, we might drop vrange-file part in this patchset if it's really headache.
But let's discuss further because still I believe it's valuable feature to
keep instead of dropping.

I want that drop of vrange-file is really last resort to make forward
progress of vrange-anon.

> - Still, Nobody likes file vrange. At least nobody said explicitly on
> the list. I don't ack file vrange part until
>   I fully convinced Pros/Cons. You need to persuade other MM guys if
> you really think anon vrange is not
>   sufficient. (Maybe LSF is the best place)
> - I wrote you need to put a mesurement current implementation vs
> VMA-based implementation at several
>   previous iteration. Because You claimed fast, but no number and you
> haven't yet. I guess the reason is

I did. :) Look at the number.
https://lkml.org/lkml/2013/10/8/63

The point is we need an mmap_sem's readside lock for vma handling(ex,
merge/split) and it's really bottlenect point for ebizzy which another
thread want to malloc(ie, mmap with new chunk requires mmap_sem's
write-side lock).

Additionally, some of user want to handle vrange fine-granularity(ex,
as worst case, PAGE_SIZE) so VMA handling would be really overhead
for us.

>   you don't have any access to large machine. If so, I'll offer it.
> Plz collaborate with us.

Yes, Yes, Yes. That's what I want and you're really proper person to
collaborate. Pz, ping me if you're ready. :)

> 
> Unfortunately, I'm very busy and I didn't have a chance to review your
> latest patch yet. But I'll finish it until
> mm summit. And, I'll show you guys how much this patch improve glibc malloc 
> too.

Cool! It's really helpful for the work which I believe it's really
helpful feature for the Linux so I

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-27 Thread John Stultz

On 01/27/2014 02:23 PM, KOSAKI Motohiro wrote:
> Hi Minchan,
>
>
> On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim  wrote:
>> Hey all,
>>
>> Happy New Year!
>>
>> I know it's bad timing to send this unfamiliar large patchset for
>> review but hope there are some guys with freshed-brain in new year
>> all over the world. :)
>> And most important thing is that before I dive into lots of testing,
>> I'd like to make an agreement on design issues and others
>>
>> o Syscall interface
>> o Not bind with vma split/merge logic to prevent mmap_sem cost and
>> o Not bind with vma split/merge logic to avoid vm_area_struct memory
>>   footprint.
>> o Purging logic - when we trigger purging volatile pages to prevent
>>   working set and stop to prevent too excessive purging of volatile
>>   pages
>> o How to test
>>   Currently, we have a patched jemalloc allocator by Jason's help
>>   although it's not perfect and more rooms to be enhanced but IMO,
>>   it's enough to prove vrange-anonymous. The problem is that
>>   lack of benchmark for testing vrange-file side. I hope that
>>   Mozilla folks can help.
>>
>> So its been a while since the last release of the volatile ranges
>> patches, again. I and John have been busy with other things.
>> Still, we have been slowly chipping away at issues and differences
>> trying to get a patchset that we both agree on.
>>
>> There's still a few issues, but we figured any further polishing of
>> the patch series in private would be unproductive and it would be much
>> better to send the patches out for review and comment and get some wider
>> opinions.
>>
>> You could get full patchset by git
>>
>> git clone -b vrange-v10-rc5 --single-branch 
>> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
> Brief comments.
>
> - You should provide jemalloc patch too. Otherwise we cannot
> understand what the your mesurement mean.
> - Your number only claimed the effectiveness anon vrange, but not file vrange.
> - Still, Nobody likes file vrange. At least nobody said explicitly on
> the list. I don't ack file vrange part until
>   I fully convinced Pros/Cons. You need to persuade other MM guys if
> you really think anon vrange is not
>   sufficient. (Maybe LSF is the best place)

I do agree that the semantics for volatile-ranges on files is more
difficult for folks to grasp (and like after doing so). I've almost
gotten to the point (as I've discussed with Minchan privately) where I'm
willing to hold back on volatile-ranges on files in the shrort-term just
to see if it helps to get key mm folks to review and comment the
volatile-ranges on anonymous memory.

That said, I do think volatile ranges on files is an important concept,
and I'd like to make sure we don't design something that can't be used
for files in the future.

Part of the major interest in volatile memory has been from web
browsers. Both Chrome and Firefox are already making use of the
file-based ashmem, where available, in order to have this "discardable
memory" feature.

And while the Mozilla developers don't see file based volatile memory as
critical right now for their needs, I can imagine as they continue to
work on multi-process firefox
(http://billmccloskey.wordpress.com/2013/12/05/multiprocess-firefox/)
for performance and security reasons, the need to have memory volatility
shared between processes will become more important.

thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-27 Thread KOSAKI Motohiro

Hi Minchan,


On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim  wrote:
> Hey all,
>
> Happy New Year!
>
> I know it's bad timing to send this unfamiliar large patchset for
> review but hope there are some guys with freshed-brain in new year
> all over the world. :)
> And most important thing is that before I dive into lots of testing,
> I'd like to make an agreement on design issues and others
>
> o Syscall interface
> o Not bind with vma split/merge logic to prevent mmap_sem cost and
> o Not bind with vma split/merge logic to avoid vm_area_struct memory
>   footprint.
> o Purging logic - when we trigger purging volatile pages to prevent
>   working set and stop to prevent too excessive purging of volatile
>   pages
> o How to test
>   Currently, we have a patched jemalloc allocator by Jason's help
>   although it's not perfect and more rooms to be enhanced but IMO,
>   it's enough to prove vrange-anonymous. The problem is that
>   lack of benchmark for testing vrange-file side. I hope that
>   Mozilla folks can help.
>
> So its been a while since the last release of the volatile ranges
> patches, again. I and John have been busy with other things.
> Still, we have been slowly chipping away at issues and differences
> trying to get a patchset that we both agree on.
>
> There's still a few issues, but we figured any further polishing of
> the patch series in private would be unproductive and it would be much
> better to send the patches out for review and comment and get some wider
> opinions.
>
> You could get full patchset by git
>
> git clone -b vrange-v10-rc5 --single-branch 
> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git

Brief comments.

- You should provide jemalloc patch too. Otherwise we cannot
understand what the your mesurement mean.
- Your number only claimed the effectiveness anon vrange, but not file vrange.
- Still, Nobody likes file vrange. At least nobody said explicitly on
the list. I don't ack file vrange part until
  I fully convinced Pros/Cons. You need to persuade other MM guys if
you really think anon vrange is not
  sufficient. (Maybe LSF is the best place)
- I wrote you need to put a mesurement current implementation vs
VMA-based implementation at several
  previous iteration. Because You claimed fast, but no number and you
haven't yet. I guess the reason is
  you don't have any access to large machine. If so, I'll offer it.
Plz collaborate with us.

Unfortunately, I'm very busy and I didn't have a chance to review your
latest patch yet. But I'll finish it until
mm summit. And, I'll show you guys how much this patch improve glibc malloc too.

I and glibc folks agreed we push vrange into glibc malloc.

https://sourceware.org/ml/libc-alpha/2013-12/msg00343.html

Even though, I still dislike some aspect of this patch. I'd like to
discuss and make better design decision
with you.

Thanks.


>
> In v10, there are some notable changes following as
>
> Whats new in v10:
> * Fix several bugs and build break
> * Add shmem_purge_page to correct purging shmem/tmpfs
> * Replace slab shrinker with direct hooked reclaim path
> * Optimize pte scanning by caching previous place
> * Reorder patch and tidy up Cc-list
> * Rebased on v3.12
> * Add vrange-anon test with jemalloc in Dhaval's test suite
>   - https://github.com/volatile-ranges-test/vranges-test
>   so, you could test any application with vrange-patched jemalloc by
>   LD_PRELOAD but please keep in mind that it's just a prototype to
>   prove vrange syscall concept so it has more rooms to optimize.
>   So, please do not compare it with another allocator.
>
> Whats new in v9:
> * Updated to v3.11
> * Added vrange purging logic to purge anonymous pages on
>   swapless systems
> * Added logic to allocate the vroot structure dynamically
>   to avoid added overhead to mm and address_space structures
> * Lots of minor tweaks, changes and cleanups
>
> Still TODO:
> * Sort out better solution for clearing volatility on new mmaps
> - Minchan has a different approach here
> * Agreement of systemcall interface
> * Better discarding trigger policy to prevent working set evction
> * Review, Review, Review.. Comment.
> * A ton of test
>
> Feedback or thoughts here would be particularly helpful!
>
> Also, thanks to Dhaval for his maintaining and vastly improving
> the volatile ranges test suite, which can be found here:
> [1] https://github.com/volatile-ranges-test/vranges-test
>
> These patches can also be pulled from git here:
> git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-v9
>
> We'd really welcome any feedback and comments on the patch series.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-27 Thread KOSAKI Motohiro

Hi Minchan,


On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim minc...@kernel.org wrote:
 Hey all,

 Happy New Year!

 I know it's bad timing to send this unfamiliar large patchset for
 review but hope there are some guys with freshed-brain in new year
 all over the world. :)
 And most important thing is that before I dive into lots of testing,
 I'd like to make an agreement on design issues and others

 o Syscall interface
 o Not bind with vma split/merge logic to prevent mmap_sem cost and
 o Not bind with vma split/merge logic to avoid vm_area_struct memory
   footprint.
 o Purging logic - when we trigger purging volatile pages to prevent
   working set and stop to prevent too excessive purging of volatile
   pages
 o How to test
   Currently, we have a patched jemalloc allocator by Jason's help
   although it's not perfect and more rooms to be enhanced but IMO,
   it's enough to prove vrange-anonymous. The problem is that
   lack of benchmark for testing vrange-file side. I hope that
   Mozilla folks can help.

 So its been a while since the last release of the volatile ranges
 patches, again. I and John have been busy with other things.
 Still, we have been slowly chipping away at issues and differences
 trying to get a patchset that we both agree on.

 There's still a few issues, but we figured any further polishing of
 the patch series in private would be unproductive and it would be much
 better to send the patches out for review and comment and get some wider
 opinions.

 You could get full patchset by git

 git clone -b vrange-v10-rc5 --single-branch 
 git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git

Brief comments.

- You should provide jemalloc patch too. Otherwise we cannot
understand what the your mesurement mean.
- Your number only claimed the effectiveness anon vrange, but not file vrange.
- Still, Nobody likes file vrange. At least nobody said explicitly on
the list. I don't ack file vrange part until
  I fully convinced Pros/Cons. You need to persuade other MM guys if
you really think anon vrange is not
  sufficient. (Maybe LSF is the best place)
- I wrote you need to put a mesurement current implementation vs
VMA-based implementation at several
  previous iteration. Because You claimed fast, but no number and you
haven't yet. I guess the reason is
  you don't have any access to large machine. If so, I'll offer it.
Plz collaborate with us.

Unfortunately, I'm very busy and I didn't have a chance to review your
latest patch yet. But I'll finish it until
mm summit. And, I'll show you guys how much this patch improve glibc malloc too.

I and glibc folks agreed we push vrange into glibc malloc.

https://sourceware.org/ml/libc-alpha/2013-12/msg00343.html

Even though, I still dislike some aspect of this patch. I'd like to
discuss and make better design decision
with you.

Thanks.



 In v10, there are some notable changes following as

 Whats new in v10:
 * Fix several bugs and build break
 * Add shmem_purge_page to correct purging shmem/tmpfs
 * Replace slab shrinker with direct hooked reclaim path
 * Optimize pte scanning by caching previous place
 * Reorder patch and tidy up Cc-list
 * Rebased on v3.12
 * Add vrange-anon test with jemalloc in Dhaval's test suite
   - https://github.com/volatile-ranges-test/vranges-test
   so, you could test any application with vrange-patched jemalloc by
   LD_PRELOAD but please keep in mind that it's just a prototype to
   prove vrange syscall concept so it has more rooms to optimize.
   So, please do not compare it with another allocator.

 Whats new in v9:
 * Updated to v3.11
 * Added vrange purging logic to purge anonymous pages on
   swapless systems
 * Added logic to allocate the vroot structure dynamically
   to avoid added overhead to mm and address_space structures
 * Lots of minor tweaks, changes and cleanups

 Still TODO:
 * Sort out better solution for clearing volatility on new mmaps
 - Minchan has a different approach here
 * Agreement of systemcall interface
 * Better discarding trigger policy to prevent working set evction
 * Review, Review, Review.. Comment.
 * A ton of test

 Feedback or thoughts here would be particularly helpful!

 Also, thanks to Dhaval for his maintaining and vastly improving
 the volatile ranges test suite, which can be found here:
 [1] https://github.com/volatile-ranges-test/vranges-test

 These patches can also be pulled from git here:
 git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-v9

 We'd really welcome any feedback and comments on the patch series.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-27 Thread John Stultz

On 01/27/2014 02:23 PM, KOSAKI Motohiro wrote:
 Hi Minchan,


 On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim minc...@kernel.org wrote:
 Hey all,

 Happy New Year!

 I know it's bad timing to send this unfamiliar large patchset for
 review but hope there are some guys with freshed-brain in new year
 all over the world. :)
 And most important thing is that before I dive into lots of testing,
 I'd like to make an agreement on design issues and others

 o Syscall interface
 o Not bind with vma split/merge logic to prevent mmap_sem cost and
 o Not bind with vma split/merge logic to avoid vm_area_struct memory
   footprint.
 o Purging logic - when we trigger purging volatile pages to prevent
   working set and stop to prevent too excessive purging of volatile
   pages
 o How to test
   Currently, we have a patched jemalloc allocator by Jason's help
   although it's not perfect and more rooms to be enhanced but IMO,
   it's enough to prove vrange-anonymous. The problem is that
   lack of benchmark for testing vrange-file side. I hope that
   Mozilla folks can help.

 So its been a while since the last release of the volatile ranges
 patches, again. I and John have been busy with other things.
 Still, we have been slowly chipping away at issues and differences
 trying to get a patchset that we both agree on.

 There's still a few issues, but we figured any further polishing of
 the patch series in private would be unproductive and it would be much
 better to send the patches out for review and comment and get some wider
 opinions.

 You could get full patchset by git

 git clone -b vrange-v10-rc5 --single-branch 
 git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
 Brief comments.

 - You should provide jemalloc patch too. Otherwise we cannot
 understand what the your mesurement mean.
 - Your number only claimed the effectiveness anon vrange, but not file vrange.
 - Still, Nobody likes file vrange. At least nobody said explicitly on
 the list. I don't ack file vrange part until
   I fully convinced Pros/Cons. You need to persuade other MM guys if
 you really think anon vrange is not
   sufficient. (Maybe LSF is the best place)

I do agree that the semantics for volatile-ranges on files is more
difficult for folks to grasp (and like after doing so). I've almost
gotten to the point (as I've discussed with Minchan privately) where I'm
willing to hold back on volatile-ranges on files in the shrort-term just
to see if it helps to get key mm folks to review and comment the
volatile-ranges on anonymous memory.

That said, I do think volatile ranges on files is an important concept,
and I'd like to make sure we don't design something that can't be used
for files in the future.

Part of the major interest in volatile memory has been from web
browsers. Both Chrome and Firefox are already making use of the
file-based ashmem, where available, in order to have this discardable
memory feature.

And while the Mozilla developers don't see file based volatile memory as
critical right now for their needs, I can imagine as they continue to
work on multi-process firefox
(http://billmccloskey.wordpress.com/2013/12/05/multiprocess-firefox/)
for performance and security reasons, the need to have memory volatility
shared between processes will become more important.


thanks
-john

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-27 Thread Minchan Kim

Hey KOSAKI,

On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote:
Hi Minchan,

On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim minc...@kernel.org wrote:
Hey all,

Happy New Year!

o Syscall interface
o Not bind with vma split/merge logic to prevent mmap_sem cost and
o Not bind with vma split/merge logic to avoid vm_area_struct memory
footprint.
o Purging logic - when we trigger purging volatile pages to prevent
working set and stop to prevent too excessive purging of volatile
pages
o How to test
Currently, we have a patched jemalloc allocator by Jason's help
although it's not perfect and more rooms to be enhanced but IMO,
it's enough to prove vrange-anonymous. The problem is that
lack of benchmark for testing vrange-file side. I hope that
Mozilla folks can help.

So its been a while since the last release of the volatile ranges
patches, again. I and John have been busy with other things.
Still, we have been slowly chipping away at issues and differences
trying to get a patchset that we both agree on.

There's still a few issues, but we figured any further polishing of
the patch series in private would be unproductive and it would be much
better to send the patches out for review and comment and get some wider
opinions.

You could get full patchset by git

git clone -b vrange-v10-rc5 --single-branch
git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git

Brief comments.

- You should provide jemalloc patch too. Otherwise we cannot

I did. :) It seems you missed below in this description.
You could see it via following URL in Dhaval's test suite.

https://github.com/volatile-ranges-test/vranges-test/blob/master/0001-Implement-experimental-mvolatile-2-mnovolatile-2-sup.patch

Dhaval: Pz, could you merge patches John sent in your test suite?
I just pinged you.

But KOSAKI, pz, don't focus on jemalloc's implementaion.
It's not how jemalloc uses volatile ranges efficiently but just
one of example how to use volatile ranges.
I think volatile ranges could be really useful for garbage collection
of custom allocators(ex, In-memory DB, JVM, Dalvik, v8) as well as
general allocators.

understand what the your mesurement mean.

- Your number only claimed the effectiveness anon vrange, but not file vrange.

Yes. It's really problem as I said.
From the beginning, John Stultz wanted to promote vrange-file to replace
android's ashmem and when I heard usecase of vrange-file, it does make sense
to me so that's why I'd like to unify them in a same interface.

But the problem is lack of interesting from others and lack of time to
test/evaluate it. I'm not an expert of userspace so actually I need a bit
help from them who require the feature but at a moment,
but I don't know who really want or/and help it.

Even, Android folks didn't have any interest on vrange-file.
So, we might drop vrange-file part in this patchset if it's really headache.
But let's discuss further because still I believe it's valuable feature to
keep instead of dropping.

I want that drop of vrange-file is really last resort to make forward
progress of vrange-anon.

- Still, Nobody likes file vrange. At least nobody said explicitly on
the list. I don't ack file vrange part until
I fully convinced Pros/Cons. You need to persuade other MM guys if
you really think anon vrange is not
sufficient. (Maybe LSF is the best place)
- I wrote you need to put a mesurement current implementation vs
VMA-based implementation at several
previous iteration. Because You claimed fast, but no number and you
haven't yet. I guess the reason is

I did. :) Look at the number.
https://lkml.org/lkml/2013/10/8/63

The point is we need an mmap_sem's readside lock for vma handling(ex,
merge/split) and it's really bottlenect point for ebizzy which another
thread want to malloc(ie, mmap with new chunk requires mmap_sem's
write-side lock).

Additionally, some of user want to handle vrange fine-granularity(ex,
as worst case, PAGE_SIZE) so VMA handling would be really overhead
for us.

you don't have any access to large machine. If so, I'll offer it.
Plz collaborate with us.

Yes, Yes, Yes. That's what I want and you're really proper person to
collaborate. Pz, ping me if you're ready. :)

Unfortunately, I'm very busy and I didn't have a chance to review your
latest patch yet. But I'll finish it until
mm summit. And, I'll show you guys how much this patch improve glibc malloc
too.

Cool! It's really helpful for the work which I believe it's really
helpful feature for the Linux so I never want to drop this feature by just
lack of interesting of other MM guys who are

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-27 Thread John Stultz

On 01/27/2014 04:12 PM, Minchan Kim wrote:
 On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote:
 - Your number only claimed the effectiveness anon vrange, but not file 
 vrange.
 Yes. It's really problem as I said.
 From the beginning, John Stultz wanted to promote vrange-file to replace
 android's ashmem and when I heard usecase of vrange-file, it does make sense
 to me so that's why I'd like to unify them in a same interface.

 But the problem is lack of interesting from others and lack of time to
 test/evaluate it. I'm not an expert of userspace so actually I need a bit
 help from them who require the feature but at a moment,
 but I don't know who really want or/and help it.

 Even, Android folks didn't have any interest on vrange-file.

Just as a correction here. I really don't think this is the case, as
Android's use definitely relies on file based volatility. It might be
more fair to say there hasn't been very much discussion from Android
developers on the particulars of the file volatility semantics (out
possibly not having any particular objections, or more-likely, being a
bit too busy to follow the all various theoretical tangents we've
discussed).

But I'd not want anyone to get the impression that anonymous-only
volatility would be sufficient for Android's needs.


(And to further clarify here, since this can be confusing... 
shmem/tmpfs-only file volatility *would* be sufficient, despite that
technically being anonymous backed memory. The key issue is we need to
be able to share the volatility between processes.)


 So, we might drop vrange-file part in this patchset if it's really headache.
 But let's discuss further because still I believe it's valuable feature to
 keep instead of dropping.

If it helps gets interest in reviewing this, I'm ok with deferring
(tmpfs) file volatility, so folks can get comfortable with anonymous
volatility. But I worry its too critical a feature to ignore.

thanks
-john

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-27 Thread Minchan Kim

On Mon, Jan 27, 2014 at 04:42:27PM -0800, John Stultz wrote:
 On 01/27/2014 04:12 PM, Minchan Kim wrote:
  On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote:
  - Your number only claimed the effectiveness anon vrange, but not file 
  vrange.
  Yes. It's really problem as I said.
  From the beginning, John Stultz wanted to promote vrange-file to replace
  android's ashmem and when I heard usecase of vrange-file, it does make sense
  to me so that's why I'd like to unify them in a same interface.
 
  But the problem is lack of interesting from others and lack of time to
  test/evaluate it. I'm not an expert of userspace so actually I need a bit
  help from them who require the feature but at a moment,
  but I don't know who really want or/and help it.
 
  Even, Android folks didn't have any interest on vrange-file.
 
 Just as a correction here. I really don't think this is the case, as
 Android's use definitely relies on file based volatility. It might be
 more fair to say there hasn't been very much discussion from Android
 developers on the particulars of the file volatility semantics (out
 possibly not having any particular objections, or more-likely, being a
 bit too busy to follow the all various theoretical tangents we've
 discussed).
 
 But I'd not want anyone to get the impression that anonymous-only
 volatility would be sufficient for Android's needs.

Right. Thanks for the correction.

 
 
 (And to further clarify here, since this can be confusing... 
 shmem/tmpfs-only file volatility *would* be sufficient, despite that
 technically being anonymous backed memory. The key issue is we need to
 be able to share the volatility between processes.)
 
 
  So, we might drop vrange-file part in this patchset if it's really headache.
  But let's discuss further because still I believe it's valuable feature to
  keep instead of dropping.
 
 If it helps gets interest in reviewing this, I'm ok with deferring
 (tmpfs) file volatility, so folks can get comfortable with anonymous
 volatility. But I worry its too critical a feature to ignore.

Yes. I don't want to drop it without more discussion with real user
of it but the problem is it's very hard to find one to have extra time
to discuss it.


 
 thanks
 -john
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 00/16] Volatile Ranges v10

2014-01-27 Thread Minchan Kim

On Mon, Jan 27, 2014 at 05:09:59PM -0800, Taras Glek wrote:
 
 
 John Stultz wrote:
 On 01/27/2014 04:12 PM, Minchan Kim wrote:
 On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote:
 - Your number only claimed the effectiveness anon vrange, but not file 
 vrange.
 Yes. It's really problem as I said.
  From the beginning, John Stultz wanted to promote vrange-file to replace
 android's ashmem and when I heard usecase of vrange-file, it does make sense
 to me so that's why I'd like to unify them in a same interface.
 
 But the problem is lack of interesting from others and lack of time to
 test/evaluate it. I'm not an expert of userspace so actually I need a bit
 help from them who require the feature but at a moment,
 but I don't know who really want or/and help it.
 
 Even, Android folks didn't have any interest on vrange-file.
 
 Just as a correction here. I really don't think this is the case, as
 Android's use definitely relies on file based volatility. It might be
 more fair to say there hasn't been very much discussion from Android
 developers on the particulars of the file volatility semantics (out
 possibly not having any particular objections, or more-likely, being a
 bit too busy to follow the all various theoretical tangents we've
 discussed).
 
 But I'd not want anyone to get the impression that anonymous-only
 volatility would be sufficient for Android's needs.
 Mozilla is starting to use android's ashmem for discardable memory
 within a single process:
 https://bugzilla.mozilla.org/show_bug.cgi?id=748598 .
 
 Volatile ranges do help with that specific(uncommon?) use of ashmem.

Thanks for the info.

I'd like to ask a question.
Do you prefer fvrange(fd, offset, len) or fadvise(fd, offset, len, advise)
inteface rather than current vrange syscall interface for vrange-file?

Because I think it would remove unnecessary mmap/munmap syscall for vrange
interface as well as out of address space in 32bit machine.

 
 For Mozilla sharing memory across processes via ashmem is not a
 nearterm project. It's something that is likely to require
 significant rework. Process-local discardable memory can be
 retrofited in a more straight-forward fashion.
 
 Taras

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v10 00/16] Volatile Ranges v10

2014-01-01 Thread Minchan Kim

Hey all,

Happy New Year!

I know it's bad timing to send this unfamiliar large patchset for
review but hope there are some guys with freshed-brain in new year
all over the world. :)
And most important thing is that before I dive into lots of testing,
I'd like to make an agreement on design issues and others

o Syscall interface
o Not bind with vma split/merge logic to prevent mmap_sem cost and
o Not bind with vma split/merge logic to avoid vm_area_struct memory
  footprint.
o Purging logic - when we trigger purging volatile pages to prevent
  working set and stop to prevent too excessive purging of volatile
  pages
o How to test
  Currently, we have a patched jemalloc allocator by Jason's help
  although it's not perfect and more rooms to be enhanced but IMO,
  it's enough to prove vrange-anonymous. The problem is that
  lack of benchmark for testing vrange-file side. I hope that
  Mozilla folks can help.

So its been a while since the last release of the volatile ranges
patches, again. I and John have been busy with other things.
Still, we have been slowly chipping away at issues and differences
trying to get a patchset that we both agree on.

There's still a few issues, but we figured any further polishing of
the patch series in private would be unproductive and it would be much
better to send the patches out for review and comment and get some wider
opinions.

You could get full patchset by git

git clone -b vrange-v10-rc5 --single-branch 
git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git

In v10, there are some notable changes following as

Whats new in v10:
* Fix several bugs and build break
* Add shmem_purge_page to correct purging shmem/tmpfs
* Replace slab shrinker with direct hooked reclaim path
* Optimize pte scanning by caching previous place
* Reorder patch and tidy up Cc-list
* Rebased on v3.12
* Add vrange-anon test with jemalloc in Dhaval's test suite
  - https://github.com/volatile-ranges-test/vranges-test
  so, you could test any application with vrange-patched jemalloc by
  LD_PRELOAD but please keep in mind that it's just a prototype to
  prove vrange syscall concept so it has more rooms to optimize.
  So, please do not compare it with another allocator.
   
Whats new in v9:
* Updated to v3.11
* Added vrange purging logic to purge anonymous pages on
  swapless systems
* Added logic to allocate the vroot structure dynamically
  to avoid added overhead to mm and address_space structures
* Lots of minor tweaks, changes and cleanups

Still TODO:
* Sort out better solution for clearing volatility on new mmaps
- Minchan has a different approach here
* Agreement of systemcall interface
* Better discarding trigger policy to prevent working set evction
* Review, Review, Review.. Comment.
* A ton of test

Feedback or thoughts here would be particularly helpful!

Also, thanks to Dhaval for his maintaining and vastly improving
the volatile ranges test suite, which can be found here:
[1] https://github.com/volatile-ranges-test/vranges-test

These patches can also be pulled from git here:
git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-v9

We'd really welcome any feedback and comments on the patch series.

thanks

== &< =

Volatile ranges provides a method for userland to inform the kernel that
a range of memory is safe to discard (ie: can be regenerated) but
userspace may want to try access it in the future.  It can be thought of
as similar to MADV_DONTNEED, but that the actual freeing of the memory
is delayed and only done under memory pressure, and the user can try to
cancel the action and be able to quickly access any unpurged pages. The
idea originated from Android's ashmem, but I've since learned that other
OSes provide similar functionality.

This funcitonality allows for a number of interesting uses:
* Userland caches that have kernel triggered eviction under memory
pressure. This allows for the kernel to "rightsize" userspace caches for
current system-wide workload. Things like image bitmap caches, or
rendered HTML in a hidden browser tab, where the data is not visible and
can be regenerated if needed, are good examples.

* Opportunistic freeing of memory that may be quickly reused. Minchan
has done a malloc implementation where free() marks the pages as
volatile, allowing the kernel to reclaim under pressure. This avoids the
unmapping and remapping of anonymous pages on free/malloc. So if
userland wants to malloc memory quickly after the free, it just needs to
mark the pages as non-volatile, and only purged pages will have to be
faulted back in. I did some test with jemalloc by Jason Mason's help who
is author of jemalloc because he had interest on vrange sytem call.

Test(RAM 2G, CPU 4, ebizzy benchmark)
ebizzy argument: ./ebizzy -S 30 -n 512

default chunksize = 512k so 512k * 512 = 256M, *a* ebizzy process
has 256M footprint.

(1.1) stands for 1 process and 1 thread so (1.4) is
1 process and 4 thread.

vanilla

[PATCH v10 00/16] Volatile Ranges v10

2014-01-01 Thread Minchan Kim

Hey all,

Happy New Year!

I know it's bad timing to send this unfamiliar large patchset for
review but hope there are some guys with freshed-brain in new year
all over the world. :)
And most important thing is that before I dive into lots of testing,
I'd like to make an agreement on design issues and others

o Syscall interface
o Not bind with vma split/merge logic to prevent mmap_sem cost and
o Not bind with vma split/merge logic to avoid vm_area_struct memory
  footprint.
o Purging logic - when we trigger purging volatile pages to prevent
  working set and stop to prevent too excessive purging of volatile
  pages
o How to test
  Currently, we have a patched jemalloc allocator by Jason's help
  although it's not perfect and more rooms to be enhanced but IMO,
  it's enough to prove vrange-anonymous. The problem is that
  lack of benchmark for testing vrange-file side. I hope that
  Mozilla folks can help.

So its been a while since the last release of the volatile ranges
patches, again. I and John have been busy with other things.
Still, we have been slowly chipping away at issues and differences
trying to get a patchset that we both agree on.

There's still a few issues, but we figured any further polishing of
the patch series in private would be unproductive and it would be much
better to send the patches out for review and comment and get some wider
opinions.

You could get full patchset by git

git clone -b vrange-v10-rc5 --single-branch 
git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git

In v10, there are some notable changes following as

Whats new in v10:
* Fix several bugs and build break
* Add shmem_purge_page to correct purging shmem/tmpfs
* Replace slab shrinker with direct hooked reclaim path
* Optimize pte scanning by caching previous place
* Reorder patch and tidy up Cc-list
* Rebased on v3.12
* Add vrange-anon test with jemalloc in Dhaval's test suite
  - https://github.com/volatile-ranges-test/vranges-test
  so, you could test any application with vrange-patched jemalloc by
  LD_PRELOAD but please keep in mind that it's just a prototype to
  prove vrange syscall concept so it has more rooms to optimize.
  So, please do not compare it with another allocator.
   
Whats new in v9:
* Updated to v3.11
* Added vrange purging logic to purge anonymous pages on
  swapless systems
* Added logic to allocate the vroot structure dynamically
  to avoid added overhead to mm and address_space structures
* Lots of minor tweaks, changes and cleanups

Still TODO:
* Sort out better solution for clearing volatility on new mmaps
- Minchan has a different approach here
* Agreement of systemcall interface
* Better discarding trigger policy to prevent working set evction
* Review, Review, Review.. Comment.
* A ton of test

Feedback or thoughts here would be particularly helpful!

Also, thanks to Dhaval for his maintaining and vastly improving
the volatile ranges test suite, which can be found here:
[1] https://github.com/volatile-ranges-test/vranges-test

These patches can also be pulled from git here:
git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-v9

We'd really welcome any feedback and comments on the patch series.

thanks

==  =

Volatile ranges provides a method for userland to inform the kernel that
a range of memory is safe to discard (ie: can be regenerated) but
userspace may want to try access it in the future.  It can be thought of
as similar to MADV_DONTNEED, but that the actual freeing of the memory
is delayed and only done under memory pressure, and the user can try to
cancel the action and be able to quickly access any unpurged pages. The
idea originated from Android's ashmem, but I've since learned that other
OSes provide similar functionality.

This funcitonality allows for a number of interesting uses:
* Userland caches that have kernel triggered eviction under memory
pressure. This allows for the kernel to rightsize userspace caches for
current system-wide workload. Things like image bitmap caches, or
rendered HTML in a hidden browser tab, where the data is not visible and
can be regenerated if needed, are good examples.

* Opportunistic freeing of memory that may be quickly reused. Minchan
has done a malloc implementation where free() marks the pages as
volatile, allowing the kernel to reclaim under pressure. This avoids the
unmapping and remapping of anonymous pages on free/malloc. So if
userland wants to malloc memory quickly after the free, it just needs to
mark the pages as non-volatile, and only purged pages will have to be
faulted back in. I did some test with jemalloc by Jason Mason's help who
is author of jemalloc because he had interest on vrange sytem call.

Test(RAM 2G, CPU 4, ebizzy benchmark)
ebizzy argument: ./ebizzy -S 30 -n 512

default chunksize = 512k so 512k * 512 = 256M, *a* ebizzy process
has 256M footprint.

(1.1) stands for 1 process and 1 thread so (1.4) is
1 process and 4 thread.

vanilla

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

Re: [PATCH v10 00/16] Volatile Ranges v10

[PATCH v10 00/16] Volatile Ranges v10

[PATCH v10 00/16] Volatile Ranges v10

32 matches

Site Navigation

Mail list logo

Footer information