subject:"\[PATCH 0\/5\] Volatile Ranges \(v12\) \& LSF\-MM discussion fodder"

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-11 Thread John Stultz

On 04/02/2014 01:13 PM, John Stultz wrote:
> On 04/02/2014 12:47 PM, Johannes Weiner wrote:
>
>> It's really nothing but a use-after-free bug that has consequences for
>> no-one but the faulty application.  The thing that IS new is that even
>> a read is enough to corrupt your data in this case.
>>
>> MADV_REVIVE could return 0 if all pages in the specified range were
>> present, -Esomething if otherwise.  That would be semantically sound
>> even if userspace messes up.
> So its semantically more of just a combined mincore+dirty operation..
> and nothing more?
>
> What are other folks thinking about this? Although I don't particularly
> like it, I probably could go along with Johannes' approach, forgoing
> SIGBUS for zero-fill and adapting the semantics that are in my mind a
> bit stranger. This would allow for ashmem-like style behavior w/ the
> additional  write-clears-volatile-state and read-clears-purged-state
> constraints (which I don't think would be problematic for Android, but
> am not totally sure).
>
> But I do worry that these semantics are easier for kernel-mm-developers
> to grasp, but are much much harder for application developers to
> understand.

So I don't feel like we've gotten enough feedback for consensus here.

Thus, to at least address other issues pointed out at LSF-MM, I'm going
to shortly send out a v13 of the patchset which keeps with the previous
approach instead of adopting Johannes' suggested approach here.

If folks do prefer Johannes' approach, please speak up as I'm willing to
give it a whirl, despite my concerns about the subtle semantics.

thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-11 Thread John Stultz

On 04/02/2014 01:13 PM, John Stultz wrote:
 On 04/02/2014 12:47 PM, Johannes Weiner wrote:

 It's really nothing but a use-after-free bug that has consequences for
 no-one but the faulty application.  The thing that IS new is that even
 a read is enough to corrupt your data in this case.

 MADV_REVIVE could return 0 if all pages in the specified range were
 present, -Esomething if otherwise.  That would be semantically sound
 even if userspace messes up.
 So its semantically more of just a combined mincore+dirty operation..
 and nothing more?

 What are other folks thinking about this? Although I don't particularly
 like it, I probably could go along with Johannes' approach, forgoing
 SIGBUS for zero-fill and adapting the semantics that are in my mind a
 bit stranger. This would allow for ashmem-like style behavior w/ the
 additional  write-clears-volatile-state and read-clears-purged-state
 constraints (which I don't think would be problematic for Android, but
 am not totally sure).

 But I do worry that these semantics are easier for kernel-mm-developers
 to grasp, but are much much harder for application developers to
 understand.

So I don't feel like we've gotten enough feedback for consensus here.

Thus, to at least address other issues pointed out at LSF-MM, I'm going
to shortly send out a v13 of the patchset which keeps with the previous
approach instead of adopting Johannes' suggested approach here.

If folks do prefer Johannes' approach, please speak up as I'm willing to
give it a whirl, despite my concerns about the subtle semantics.

thanks
-john

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-07 Thread John Stultz

On 04/07/2014 09:32 PM, Kevin Easton wrote:
> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
>> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner  wrote:
>>> I'm just dying to hear a "normal" use case then. :)
>> So the more "normal" use cause would be marking objects volatile and
>> then non-volatile w/o accessing them in-between. In this case the
>> zero-fill vs SIGBUS semantics don't really matter, its really just a
>> trade off in how we handle applications deviating (intentionally or
>> not) from this use case.
>>
>> So to maybe flesh out the context here for folks who are following
>> along (but weren't in the hallway at LSF :),  Johannes made a fairly
>> interesting proposal (Johannes: Please correct me here where I'm maybe
>> slightly off here) to use only the dirty bits of the ptes to mark a
>> page as volatile. Then the kernel could reclaim these clean pages as
>> it needed, and when we marked the range as non-volatile, the pages
>> would be re-dirtied and if any of the pages were missing, we could
>> return a flag with the purged state.  This had some different
>> semantics then what I've been working with for awhile (for example,
>> any writes to pages would implicitly clear volatility), so I wasn't
>> completely comfortable with it, but figured I'd think about it to see
>> if it could be done. Particularly since it would in some ways simplify
>> tmpfs/shm shared volatility that I'd eventually like to do.
> ...
>> Now, while for the case I'm personally most interested in (ashmem),
>> zero-fill would technically be ok, since that's what Android does.
>> Even so, I don't think its the best approach for the interface, since
>> applications may end up quite surprised by the results when they
>> accidentally don't follow the "don't touch volatile pages" rule.
>>
>> That point beside, I think the other problem with the page-cleaning
>> volatility approach is that there are other awkward side effects. For
>> example: Say an application marks a range as volatile. One page in the
>> range is then purged. The application, due to a bug or otherwise,
>> reads the volatile range. This causes the page to be zero-filled in,
>> and the application silently uses the corrupted data (which isn't
>> great). More problematic though, is that by faulting the page in,
>> they've in effect lost the purge state for that page. When the
>> application then goes to mark the range as non-volatile, all pages are
>> present, so we'd return that no pages were purged.  From an
>> application perspective this is pretty ugly.
> The write-implicitly-clears-volatile semantics would actually be
> an advantage for some use cases.  If you have a volatile cache of
> many sub-page-size objects, the application can just include at
> the start of each page "int present, in_use;".  "present" is set
> to non-zero before marking volatile, and when the application wants
> unmark as volatile it writes to "in_use" and tests the value of 
> "present".  No need for a syscall at all, although it does take a
> minor fault.
>
> The syscall would be better for the case of large objects, though.
>
> Or is that fatally flawed?

Well, as you note, each object would then have to be page size or
smaller, which limits some of the potential use cases.

However, these semantics would match better to the MADV_FREE proposal
Minchan is pushing. So this method would work fine there.

thanks
-john


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-07 Thread Kevin Easton

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner  wrote:
> > I'm just dying to hear a "normal" use case then. :)
> 
> So the more "normal" use cause would be marking objects volatile and
> then non-volatile w/o accessing them in-between. In this case the
> zero-fill vs SIGBUS semantics don't really matter, its really just a
> trade off in how we handle applications deviating (intentionally or
> not) from this use case.
> 
> So to maybe flesh out the context here for folks who are following
> along (but weren't in the hallway at LSF :),  Johannes made a fairly
> interesting proposal (Johannes: Please correct me here where I'm maybe
> slightly off here) to use only the dirty bits of the ptes to mark a
> page as volatile. Then the kernel could reclaim these clean pages as
> it needed, and when we marked the range as non-volatile, the pages
> would be re-dirtied and if any of the pages were missing, we could
> return a flag with the purged state.  This had some different
> semantics then what I've been working with for awhile (for example,
> any writes to pages would implicitly clear volatility), so I wasn't
> completely comfortable with it, but figured I'd think about it to see
> if it could be done. Particularly since it would in some ways simplify
> tmpfs/shm shared volatility that I'd eventually like to do.
...
> Now, while for the case I'm personally most interested in (ashmem),
> zero-fill would technically be ok, since that's what Android does.
> Even so, I don't think its the best approach for the interface, since
> applications may end up quite surprised by the results when they
> accidentally don't follow the "don't touch volatile pages" rule.
> 
> That point beside, I think the other problem with the page-cleaning
> volatility approach is that there are other awkward side effects. For
> example: Say an application marks a range as volatile. One page in the
> range is then purged. The application, due to a bug or otherwise,
> reads the volatile range. This causes the page to be zero-filled in,
> and the application silently uses the corrupted data (which isn't
> great). More problematic though, is that by faulting the page in,
> they've in effect lost the purge state for that page. When the
> application then goes to mark the range as non-volatile, all pages are
> present, so we'd return that no pages were purged.  From an
> application perspective this is pretty ugly.

The write-implicitly-clears-volatile semantics would actually be
an advantage for some use cases.  If you have a volatile cache of
many sub-page-size objects, the application can just include at
the start of each page "int present, in_use;".  "present" is set
to non-zero before marking volatile, and when the application wants
unmark as volatile it writes to "in_use" and tests the value of 
"present".  No need for a syscall at all, although it does take a
minor fault.

The syscall would be better for the case of large objects, though.

Or is that fatally flawed?

- Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-07 Thread Minchan Kim

On Wed, Apr 02, 2014 at 03:27:44PM -0400, Johannes Weiner wrote:
> On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
> > Hi everyone,
> > 
> > On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> > > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> > > you have a third option you're thinking of, I'd of course be interested
> > > in hearing it.
> > 
> > I actually thought the way of being notified with a page fault (sigbus
> > or whatever) was the most efficient way of using volatile ranges.
> > 
> > Why having to call a syscall to know if you can still access the
> > volatile range, if there was no VM pressure before the access?
> > syscalls are expensive, accessing the memory direct is not. Only if it
> > page was actually missing and a page fault would fire, you'd take the
> > slowpath.
> 
> Not everybody wants to actually come back for the data in the range,
> allocators and message passing applications just want to be able to
> reuse the memory mapping.
> 
> By tying the volatility to the dirty bit in the page tables, an
> allocator could simply clear those bits once on free().  When malloc()
> hands out this region again, the user is expected to write, which will
> either overwrite the old page, or, if it was purged, fault in a fresh
> zero page.  But there is no second syscall needed to clear volatility.
> 
> > > Now... once you've chosen SIGBUS semantics, there will be folks who will
> > > try to exploit the fact that we get SIGBUS on purged page access (at
> > > least on the user-space side) and will try to access pages that are
> > > volatile until they are purged and try to then handle the SIGBUS to fix
> > > things up. Those folks exploiting that will have to be particularly
> > > careful not to pass volatile data to the kernel, and if they do they'll
> > > have to be smart enough to handle the EFAULT, etc. That's really all
> > > their problem, because they're being clever. :)
> > 
> > I'm actually working on feature that would solve the problem for the
> > syscalls accessing missing volatile pages. So you'd never see a
> > -EFAULT because all syscalls won't return even if they encounters a
> > missing page in the volatile range dropped by the VM pressure.
> > 
> > It's called userfaultfd. You call sys_userfaultfd(flags) and it
> > connects the current mm to a pseudo filedescriptor. The filedescriptor
> > works similarly to eventfd but with a different protocol.
> > 
> > You need a thread that will never access the userfault area with the
> > CPU, that is responsible to poll on the userfaultfd and talk the
> > userfaultfd protocol to fill-in missing pages. The userfault thread
> > after a POLLIN event reads the virtual addresses of the fault that
> > must have happened on some other thread of the same mm, and then
> > writes back an "handled" virtual range into the fd, after the page (or
> > pages if multiple) have been regenerated and mapped in with
> > sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
> > swapping. Then depending on the "solved" range written back into the
> > fd, the kernel will wakeup the thread or threads that were waiting in
> > kernel mode on the "handled" virtual range, and retry the fault
> > without ever exiting kernel mode.
> > 
> > We need this in KVM for running the guest on memory that is on other
> > nodes or other processes (postcopy live migration is the most common
> > use case but there are others like memory externalization and
> > cross-node KSM in the cloud, to keep a single copy of memory across
> > multiple nodes and externalized to the VM and to the host node).
> > 
> > This thread made me wonder if we could mix the two features and you
> > would then depend on MADV_USERFAULT and userfaultfd to deliver to
> > userland the "faults" happening on the volatile pages that have been
> > purged as result of VM pressure.
> > 
> > I'm just saying this after Johannes mentioned the issue with syscalls
> > returning -EFAULT. Because that is the very issue that the userfaultfd
> > is going to solve for the KVM migration thread.
> > 
> > What I'm thinking now would be to mark the volatile range also
> > MADV_USERFAULT and then calling userfaultfd and instead of having the
> > cache regeneration "slow path" inside the SIGBUS handler, to run it in
> > the userfault thread that polls the userfaultfd. Then you could write
> > the volatile ranges to disk with a write() syscall (or use any other
> > syscall on the volatile ranges), without having to worry about -EFAULT
> > being returned because one page was discarded. And if MADV_USERFAULT
> > is not called in combination with vrange syscalls, then it'd still
> > work without the userfault, but with the vrange syscalls only.
> > 
> > In short the idea would be to let the userfault code solve the fault
> > delivery to userland for you, and make the vrange syscalls only focus
> > on the page purging problem, without having to worry about what
> > happens

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-07 Thread Minchan Kim

Hello Andrea,

On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
> Hi everyone,
> 
> On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> > you have a third option you're thinking of, I'd of course be interested
> > in hearing it.
> 
> I actually thought the way of being notified with a page fault (sigbus
> or whatever) was the most efficient way of using volatile ranges.
> 
> Why having to call a syscall to know if you can still access the
> volatile range, if there was no VM pressure before the access?
> syscalls are expensive, accessing the memory direct is not. Only if it
> page was actually missing and a page fault would fire, you'd take the
> slowpath.

True.

> 
> The usages I see for this are plenty, like for maintaining caches in
> memory that may be big and would be nice to discard if there's VM
> pressure, jpeg uncompressed images sounds like a candidate too. So the
> browser size would shrink if there's VM pressure, instead of ending up
> swapping out uncompressed image data that can be regenerated more
> quickly with the CPU than with swapins.

That's really typical case vrange is targetting.

> 
> > Now... once you've chosen SIGBUS semantics, there will be folks who will
> > try to exploit the fact that we get SIGBUS on purged page access (at
> > least on the user-space side) and will try to access pages that are
> > volatile until they are purged and try to then handle the SIGBUS to fix
> > things up. Those folks exploiting that will have to be particularly
> > careful not to pass volatile data to the kernel, and if they do they'll
> > have to be smart enough to handle the EFAULT, etc. That's really all
> > their problem, because they're being clever. :)
> 
> I'm actually working on feature that would solve the problem for the
> syscalls accessing missing volatile pages. So you'd never see a
> -EFAULT because all syscalls won't return even if they encounters a
> missing page in the volatile range dropped by the VM pressure.
> 
> It's called userfaultfd. You call sys_userfaultfd(flags) and it
> connects the current mm to a pseudo filedescriptor. The filedescriptor
> works similarly to eventfd but with a different protocol.
> 
> You need a thread that will never access the userfault area with the
> CPU, that is responsible to poll on the userfaultfd and talk the
> userfaultfd protocol to fill-in missing pages. The userfault thread
> after a POLLIN event reads the virtual addresses of the fault that
> must have happened on some other thread of the same mm, and then
> writes back an "handled" virtual range into the fd, after the page (or
> pages if multiple) have been regenerated and mapped in with
> sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
> swapping. Then depending on the "solved" range written back into the
> fd, the kernel will wakeup the thread or threads that were waiting in
> kernel mode on the "handled" virtual range, and retry the fault
> without ever exiting kernel mode.

Sounds flexible.

> 
> We need this in KVM for running the guest on memory that is on other
> nodes or other processes (postcopy live migration is the most common
> use case but there are others like memory externalization and
> cross-node KSM in the cloud, to keep a single copy of memory across
> multiple nodes and externalized to the VM and to the host node).
> 
> This thread made me wonder if we could mix the two features and you
> would then depend on MADV_USERFAULT and userfaultfd to deliver to
> userland the "faults" happening on the volatile pages that have been
> purged as result of VM pressure.
> 
> I'm just saying this after Johannes mentioned the issue with syscalls
> returning -EFAULT. Because that is the very issue that the userfaultfd
> is going to solve for the KVM migration thread.
> 
> What I'm thinking now would be to mark the volatile range also
> MADV_USERFAULT and then calling userfaultfd and instead of having the
> cache regeneration "slow path" inside the SIGBUS handler, to run it in
> the userfault thread that polls the userfaultfd. Then you could write
> the volatile ranges to disk with a write() syscall (or use any other
> syscall on the volatile ranges), without having to worry about -EFAULT
> being returned because one page was discarded. And if MADV_USERFAULT
> is not called in combination with vrange syscalls, then it'd still
> work without the userfault, but with the vrange syscalls only.
> 
> In short the idea would be to let the userfault code solve the fault
> delivery to userland for you, and make the vrange syscalls only focus
> on the page purging problem, without having to worry about what
> happens when something access a missing page.
> 
> But if you don't intend to solve the syscall -EFAULT problem, well
> then probably the overlap is still as thin as I thought it was before
> (like also mentioned in the below link).

Sounds doable. I will look into

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-07 Thread Minchan Kim

Hello Andrea,

On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
 Hi everyone,
 
 On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
  So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
  you have a third option you're thinking of, I'd of course be interested
  in hearing it.
 
 I actually thought the way of being notified with a page fault (sigbus
 or whatever) was the most efficient way of using volatile ranges.
 
 Why having to call a syscall to know if you can still access the
 volatile range, if there was no VM pressure before the access?
 syscalls are expensive, accessing the memory direct is not. Only if it
 page was actually missing and a page fault would fire, you'd take the
 slowpath.

True.

 
 The usages I see for this are plenty, like for maintaining caches in
 memory that may be big and would be nice to discard if there's VM
 pressure, jpeg uncompressed images sounds like a candidate too. So the
 browser size would shrink if there's VM pressure, instead of ending up
 swapping out uncompressed image data that can be regenerated more
 quickly with the CPU than with swapins.

That's really typical case vrange is targetting.

 
  Now... once you've chosen SIGBUS semantics, there will be folks who will
  try to exploit the fact that we get SIGBUS on purged page access (at
  least on the user-space side) and will try to access pages that are
  volatile until they are purged and try to then handle the SIGBUS to fix
  things up. Those folks exploiting that will have to be particularly
  careful not to pass volatile data to the kernel, and if they do they'll
  have to be smart enough to handle the EFAULT, etc. That's really all
  their problem, because they're being clever. :)
 
 I'm actually working on feature that would solve the problem for the
 syscalls accessing missing volatile pages. So you'd never see a
 -EFAULT because all syscalls won't return even if they encounters a
 missing page in the volatile range dropped by the VM pressure.
 
 It's called userfaultfd. You call sys_userfaultfd(flags) and it
 connects the current mm to a pseudo filedescriptor. The filedescriptor
 works similarly to eventfd but with a different protocol.
 
 You need a thread that will never access the userfault area with the
 CPU, that is responsible to poll on the userfaultfd and talk the
 userfaultfd protocol to fill-in missing pages. The userfault thread
 after a POLLIN event reads the virtual addresses of the fault that
 must have happened on some other thread of the same mm, and then
 writes back an handled virtual range into the fd, after the page (or
 pages if multiple) have been regenerated and mapped in with
 sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
 swapping. Then depending on the solved range written back into the
 fd, the kernel will wakeup the thread or threads that were waiting in
 kernel mode on the handled virtual range, and retry the fault
 without ever exiting kernel mode.

Sounds flexible.

 
 We need this in KVM for running the guest on memory that is on other
 nodes or other processes (postcopy live migration is the most common
 use case but there are others like memory externalization and
 cross-node KSM in the cloud, to keep a single copy of memory across
 multiple nodes and externalized to the VM and to the host node).
 
 This thread made me wonder if we could mix the two features and you
 would then depend on MADV_USERFAULT and userfaultfd to deliver to
 userland the faults happening on the volatile pages that have been
 purged as result of VM pressure.
 
 I'm just saying this after Johannes mentioned the issue with syscalls
 returning -EFAULT. Because that is the very issue that the userfaultfd
 is going to solve for the KVM migration thread.
 
 What I'm thinking now would be to mark the volatile range also
 MADV_USERFAULT and then calling userfaultfd and instead of having the
 cache regeneration slow path inside the SIGBUS handler, to run it in
 the userfault thread that polls the userfaultfd. Then you could write
 the volatile ranges to disk with a write() syscall (or use any other
 syscall on the volatile ranges), without having to worry about -EFAULT
 being returned because one page was discarded. And if MADV_USERFAULT
 is not called in combination with vrange syscalls, then it'd still
 work without the userfault, but with the vrange syscalls only.
 
 In short the idea would be to let the userfault code solve the fault
 delivery to userland for you, and make the vrange syscalls only focus
 on the page purging problem, without having to worry about what
 happens when something access a missing page.
 
 But if you don't intend to solve the syscall -EFAULT problem, well
 then probably the overlap is still as thin as I thought it was before
 (like also mentioned in the below link).

Sounds doable. I will look into your patch.
Thanks for reminding!

 
 Thanks,
 Andrea
 
 PS. my last email about this from a more KVM

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-07 Thread Minchan Kim

On Wed, Apr 02, 2014 at 03:27:44PM -0400, Johannes Weiner wrote:
 On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
  Hi everyone,
  
  On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
   So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
   you have a third option you're thinking of, I'd of course be interested
   in hearing it.
  
  I actually thought the way of being notified with a page fault (sigbus
  or whatever) was the most efficient way of using volatile ranges.
  
  Why having to call a syscall to know if you can still access the
  volatile range, if there was no VM pressure before the access?
  syscalls are expensive, accessing the memory direct is not. Only if it
  page was actually missing and a page fault would fire, you'd take the
  slowpath.
 
 Not everybody wants to actually come back for the data in the range,
 allocators and message passing applications just want to be able to
 reuse the memory mapping.
 
 By tying the volatility to the dirty bit in the page tables, an
 allocator could simply clear those bits once on free().  When malloc()
 hands out this region again, the user is expected to write, which will
 either overwrite the old page, or, if it was purged, fault in a fresh
 zero page.  But there is no second syscall needed to clear volatility.
 
   Now... once you've chosen SIGBUS semantics, there will be folks who will
   try to exploit the fact that we get SIGBUS on purged page access (at
   least on the user-space side) and will try to access pages that are
   volatile until they are purged and try to then handle the SIGBUS to fix
   things up. Those folks exploiting that will have to be particularly
   careful not to pass volatile data to the kernel, and if they do they'll
   have to be smart enough to handle the EFAULT, etc. That's really all
   their problem, because they're being clever. :)
  
  I'm actually working on feature that would solve the problem for the
  syscalls accessing missing volatile pages. So you'd never see a
  -EFAULT because all syscalls won't return even if they encounters a
  missing page in the volatile range dropped by the VM pressure.
  
  It's called userfaultfd. You call sys_userfaultfd(flags) and it
  connects the current mm to a pseudo filedescriptor. The filedescriptor
  works similarly to eventfd but with a different protocol.
  
  You need a thread that will never access the userfault area with the
  CPU, that is responsible to poll on the userfaultfd and talk the
  userfaultfd protocol to fill-in missing pages. The userfault thread
  after a POLLIN event reads the virtual addresses of the fault that
  must have happened on some other thread of the same mm, and then
  writes back an handled virtual range into the fd, after the page (or
  pages if multiple) have been regenerated and mapped in with
  sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
  swapping. Then depending on the solved range written back into the
  fd, the kernel will wakeup the thread or threads that were waiting in
  kernel mode on the handled virtual range, and retry the fault
  without ever exiting kernel mode.
  
  We need this in KVM for running the guest on memory that is on other
  nodes or other processes (postcopy live migration is the most common
  use case but there are others like memory externalization and
  cross-node KSM in the cloud, to keep a single copy of memory across
  multiple nodes and externalized to the VM and to the host node).
  
  This thread made me wonder if we could mix the two features and you
  would then depend on MADV_USERFAULT and userfaultfd to deliver to
  userland the faults happening on the volatile pages that have been
  purged as result of VM pressure.
  
  I'm just saying this after Johannes mentioned the issue with syscalls
  returning -EFAULT. Because that is the very issue that the userfaultfd
  is going to solve for the KVM migration thread.
  
  What I'm thinking now would be to mark the volatile range also
  MADV_USERFAULT and then calling userfaultfd and instead of having the
  cache regeneration slow path inside the SIGBUS handler, to run it in
  the userfault thread that polls the userfaultfd. Then you could write
  the volatile ranges to disk with a write() syscall (or use any other
  syscall on the volatile ranges), without having to worry about -EFAULT
  being returned because one page was discarded. And if MADV_USERFAULT
  is not called in combination with vrange syscalls, then it'd still
  work without the userfault, but with the vrange syscalls only.
  
  In short the idea would be to let the userfault code solve the fault
  delivery to userland for you, and make the vrange syscalls only focus
  on the page purging problem, without having to worry about what
  happens when something access a missing page.
 
 Yes, the two seem certainly combinable to me.
 
 madvise(MADV_FREE | MADV_USERFAULT) to allow purging and userspace
 fault handling.  In the

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-07 Thread Kevin Easton

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
 On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner han...@cmpxchg.org wrote:
  I'm just dying to hear a normal use case then. :)
 
 So the more normal use cause would be marking objects volatile and
 then non-volatile w/o accessing them in-between. In this case the
 zero-fill vs SIGBUS semantics don't really matter, its really just a
 trade off in how we handle applications deviating (intentionally or
 not) from this use case.
 
 So to maybe flesh out the context here for folks who are following
 along (but weren't in the hallway at LSF :),  Johannes made a fairly
 interesting proposal (Johannes: Please correct me here where I'm maybe
 slightly off here) to use only the dirty bits of the ptes to mark a
 page as volatile. Then the kernel could reclaim these clean pages as
 it needed, and when we marked the range as non-volatile, the pages
 would be re-dirtied and if any of the pages were missing, we could
 return a flag with the purged state.  This had some different
 semantics then what I've been working with for awhile (for example,
 any writes to pages would implicitly clear volatility), so I wasn't
 completely comfortable with it, but figured I'd think about it to see
 if it could be done. Particularly since it would in some ways simplify
 tmpfs/shm shared volatility that I'd eventually like to do.
...
 Now, while for the case I'm personally most interested in (ashmem),
 zero-fill would technically be ok, since that's what Android does.
 Even so, I don't think its the best approach for the interface, since
 applications may end up quite surprised by the results when they
 accidentally don't follow the don't touch volatile pages rule.
 
 That point beside, I think the other problem with the page-cleaning
 volatility approach is that there are other awkward side effects. For
 example: Say an application marks a range as volatile. One page in the
 range is then purged. The application, due to a bug or otherwise,
 reads the volatile range. This causes the page to be zero-filled in,
 and the application silently uses the corrupted data (which isn't
 great). More problematic though, is that by faulting the page in,
 they've in effect lost the purge state for that page. When the
 application then goes to mark the range as non-volatile, all pages are
 present, so we'd return that no pages were purged.  From an
 application perspective this is pretty ugly.

The write-implicitly-clears-volatile semantics would actually be
an advantage for some use cases.  If you have a volatile cache of
many sub-page-size objects, the application can just include at
the start of each page int present, in_use;.  present is set
to non-zero before marking volatile, and when the application wants
unmark as volatile it writes to in_use and tests the value of 
present.  No need for a syscall at all, although it does take a
minor fault.

The syscall would be better for the case of large objects, though.

Or is that fatally flawed?

- Kevin
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-07 Thread John Stultz

On 04/07/2014 09:32 PM, Kevin Easton wrote:
 On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
 On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner han...@cmpxchg.org wrote:
 I'm just dying to hear a normal use case then. :)
 So the more normal use cause would be marking objects volatile and
 then non-volatile w/o accessing them in-between. In this case the
 zero-fill vs SIGBUS semantics don't really matter, its really just a
 trade off in how we handle applications deviating (intentionally or
 not) from this use case.

 So to maybe flesh out the context here for folks who are following
 along (but weren't in the hallway at LSF :),  Johannes made a fairly
 interesting proposal (Johannes: Please correct me here where I'm maybe
 slightly off here) to use only the dirty bits of the ptes to mark a
 page as volatile. Then the kernel could reclaim these clean pages as
 it needed, and when we marked the range as non-volatile, the pages
 would be re-dirtied and if any of the pages were missing, we could
 return a flag with the purged state.  This had some different
 semantics then what I've been working with for awhile (for example,
 any writes to pages would implicitly clear volatility), so I wasn't
 completely comfortable with it, but figured I'd think about it to see
 if it could be done. Particularly since it would in some ways simplify
 tmpfs/shm shared volatility that I'd eventually like to do.
 ...
 Now, while for the case I'm personally most interested in (ashmem),
 zero-fill would technically be ok, since that's what Android does.
 Even so, I don't think its the best approach for the interface, since
 applications may end up quite surprised by the results when they
 accidentally don't follow the don't touch volatile pages rule.

 That point beside, I think the other problem with the page-cleaning
 volatility approach is that there are other awkward side effects. For
 example: Say an application marks a range as volatile. One page in the
 range is then purged. The application, due to a bug or otherwise,
 reads the volatile range. This causes the page to be zero-filled in,
 and the application silently uses the corrupted data (which isn't
 great). More problematic though, is that by faulting the page in,
 they've in effect lost the purge state for that page. When the
 application then goes to mark the range as non-volatile, all pages are
 present, so we'd return that no pages were purged.  From an
 application perspective this is pretty ugly.
 The write-implicitly-clears-volatile semantics would actually be
 an advantage for some use cases.  If you have a volatile cache of
 many sub-page-size objects, the application can just include at
 the start of each page int present, in_use;.  present is set
 to non-zero before marking volatile, and when the application wants
 unmark as volatile it writes to in_use and tests the value of 
 present.  No need for a syscall at all, although it does take a
 minor fault.

 The syscall would be better for the case of large objects, though.

 Or is that fatally flawed?

Well, as you note, each object would then have to be page size or
smaller, which limits some of the potential use cases.

However, these semantics would match better to the MADV_FREE proposal
Minchan is pushing. So this method would work fine there.

thanks
-john


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-06 Thread Minchan Kim

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner  wrote:
> > On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> >> On 04/01/2014 04:01 PM, Dave Hansen wrote:
> >> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> >> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> >> > John, this was something that the Mozilla guys asked for, right?  Any
> >> > idea why this isn't ever a problem for them?
> >> So one of their use cases for it is for library text. Basically they
> >> want to decompress a compressed library file into memory. Then they plan
> >> to mark the uncompressed pages volatile, and then be able to call into
> >> it. Ideally for them, the kernel would only purge cold pages, leaving
> >> the hot pages in memory. When they traverse a purged page, they handle
> >> the SIGBUS and patch the page up.
> >
> > How big are these libraries compared to overall system size?
> 
> Mike or Taras would have to refresh my memory on this detail. My
> recollection is it mostly has to do with keeping the on-disk size of
> the library small, so it can load off of slow media very quickly.
> 
> >> Now.. this is not what I'd consider a normal use case, but was hoping to
> >> illustrate some of the more interesting uses and demonstrate the
> >> interfaces flexibility.
> >
> > I'm just dying to hear a "normal" use case then. :)
> 
> So the more "normal" use cause would be marking objects volatile and
> then non-volatile w/o accessing them in-between. In this case the
> zero-fill vs SIGBUS semantics don't really matter, its really just a
> trade off in how we handle applications deviating (intentionally or
> not) from this use case.
> 
> So to maybe flesh out the context here for folks who are following
> along (but weren't in the hallway at LSF :),  Johannes made a fairly
> interesting proposal (Johannes: Please correct me here where I'm maybe
> slightly off here) to use only the dirty bits of the ptes to mark a
> page as volatile. Then the kernel could reclaim these clean pages as
> it needed, and when we marked the range as non-volatile, the pages
> would be re-dirtied and if any of the pages were missing, we could

I'd like to know more clearly as Hannes and you are thinking.
You mean that when we unmark the range, we should redirty of all of
pages's pte? or SetPageDirty?
If we redirty pte, maybe softdirty people(ie, CRIU) might be angry
because it could make lots of diff.
If we just do SetPageDirty, it would invalidate writeout-avoid logic
of swapped page which were already on the swap. Yeb, but it could
be minor and SetPageDirty model would be proper for shared-vrange
implmenetation. But how could we know any pages were missing
when unmarking time? Where do we keep the information?
It's no problem for vrange-anon because we can keep the information
on pte but how about vrange-file(ie, vrange-shared)? Using a shadow
entry of radix tree? What are you thinking about?

Another major concern is still syscall's overhead.
Such page-based scheme has a trouble with syscall's speed so I'm
afraid users might not use the syscall any more. :(
Frankly speaking, we don't have concrete user so not sure how
the overhead is severe but we could imagine easily that in future
someuser might want to makr volatile huge GB memory.

But I couldn't insist on range-based option because it has downside, too.
If we don't work page-based model, reclaim path cleary have a big
overhead to scan virtual memory to find a victim pages. As worst case,
just a page in Huge GB vma. Even, a page might be other zone. :(
If we could optimize that path to prevent CPU buring in future,
it could make very complicated and not sure woking well.
We already have similar issue with compaction. ;-)

So, it's really dilemma.

> return a flag with the purged state.  This had some different
> semantics then what I've been working with for awhile (for example,
> any writes to pages would implicitly clear volatility), so I wasn't
> completely comfortable with it, but figured I'd think about it to see
> if it could be done. Particularly since it would in some ways simplify
> tmpfs/shm shared volatility that I'd eventually like to do.
> 
> After thinking it over in the hallway, I talked some of the details w/
> Johnnes and there was one issue that while w/ anonymous memory, we can
> still add a VM_VOLATILE flag on the vma, so we can get SIGBUS
> semantics, but since on shared volatile ranges, we don't have anything
> to hang a volatile flag on w/o adding some new vma like structure to
> the address_space structure (much as we did in the past w/ earlier
> volatile range implementations). This would negate much of the point
> of using the dirty bits to simplify the shared volatility
> implementation.
> 
> Thus Johannes is reasonably questioning the need for SIGBUS semantics,
> since if it wasn't needed, the simpler page-cleaning based volatility
> could potentially be used.

I think SIGBUS scenario isn't

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-06 Thread Minchan Kim

On Wed, Apr 02, 2014 at 12:36:38PM -0400, Johannes Weiner wrote:
> On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> > On 04/01/2014 04:01 PM, Dave Hansen wrote:
> > > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> > >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> > >>> Either way, optimistic volatile pointers are nowhere near as
> > >>> transparent to the application as the above description suggests,
> > >>> which makes this usecase not very interesting, IMO.
> > >> ... however, I think you're still derating the value way too much.  The
> > >> case of user space doing elastic memory management is more and more
> > >> common, and for a lot of those applications it is perfectly reasonable
> > >> to either not do system calls or to have to devolatilize first.
> > > The SIGBUS is only in cases where the memory is set as volatile and
> > > _then_ accessed, right?
> > Not just set volatile and then accessed, but when a volatile page has
> > been purged and then accessed without being made non-volatile.
> > 
> > 
> > > John, this was something that the Mozilla guys asked for, right?  Any
> > > idea why this isn't ever a problem for them?
> > So one of their use cases for it is for library text. Basically they
> > want to decompress a compressed library file into memory. Then they plan
> > to mark the uncompressed pages volatile, and then be able to call into
> > it. Ideally for them, the kernel would only purge cold pages, leaving
> > the hot pages in memory. When they traverse a purged page, they handle
> > the SIGBUS and patch the page up.
> 
> How big are these libraries compared to overall system size?

One of the example about jit I had is 5M bytes for just simple node.js
service. Acutally I'm not sure it was JIT or something. Just what I saw
was it was rwxp vmas so I guess they are JIT.
Anyway, it's really simple script but consumed 5M bytes. It's really
big for Embedded WebOS because other more complicated service could be
executed in parallel on the system.

> 
> > Now.. this is not what I'd consider a normal use case, but was hoping to
> > illustrate some of the more interesting uses and demonstrate the
> > interfaces flexibility.
> 
> I'm just dying to hear a "normal" use case then. :)
> 
> > Also it provided a clear example of benefits to doing LRU based
> > cold-page purging rather then full object purging. Though I think the
> > same could be demonstrated in a simpler case of a large cache of objects
> > that the applications wants to mark volatile in one pass, unmarking
> > sub-objects as it needs.
> 
> Agreed.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-06 Thread Minchan Kim

On Wed, Apr 02, 2014 at 12:36:38PM -0400, Johannes Weiner wrote:
 On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
  On 04/01/2014 04:01 PM, Dave Hansen wrote:
   On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
   On 04/01/2014 02:21 PM, Johannes Weiner wrote:
   Either way, optimistic volatile pointers are nowhere near as
   transparent to the application as the above description suggests,
   which makes this usecase not very interesting, IMO.
   ... however, I think you're still derating the value way too much.  The
   case of user space doing elastic memory management is more and more
   common, and for a lot of those applications it is perfectly reasonable
   to either not do system calls or to have to devolatilize first.
   The SIGBUS is only in cases where the memory is set as volatile and
   _then_ accessed, right?
  Not just set volatile and then accessed, but when a volatile page has
  been purged and then accessed without being made non-volatile.
  
  
   John, this was something that the Mozilla guys asked for, right?  Any
   idea why this isn't ever a problem for them?
  So one of their use cases for it is for library text. Basically they
  want to decompress a compressed library file into memory. Then they plan
  to mark the uncompressed pages volatile, and then be able to call into
  it. Ideally for them, the kernel would only purge cold pages, leaving
  the hot pages in memory. When they traverse a purged page, they handle
  the SIGBUS and patch the page up.
 
 How big are these libraries compared to overall system size?

One of the example about jit I had is 5M bytes for just simple node.js
service. Acutally I'm not sure it was JIT or something. Just what I saw
was it was rwxp vmas so I guess they are JIT.
Anyway, it's really simple script but consumed 5M bytes. It's really
big for Embedded WebOS because other more complicated service could be
executed in parallel on the system.

 
  Now.. this is not what I'd consider a normal use case, but was hoping to
  illustrate some of the more interesting uses and demonstrate the
  interfaces flexibility.
 
 I'm just dying to hear a normal use case then. :)
 
  Also it provided a clear example of benefits to doing LRU based
  cold-page purging rather then full object purging. Though I think the
  same could be demonstrated in a simpler case of a large cache of objects
  that the applications wants to mark volatile in one pass, unmarking
  sub-objects as it needs.
 
 Agreed.
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-06 Thread Minchan Kim

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
 On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner han...@cmpxchg.org wrote:
  On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
  On 04/01/2014 04:01 PM, Dave Hansen wrote:
   On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
   On 04/01/2014 02:21 PM, Johannes Weiner wrote:
   John, this was something that the Mozilla guys asked for, right?  Any
   idea why this isn't ever a problem for them?
  So one of their use cases for it is for library text. Basically they
  want to decompress a compressed library file into memory. Then they plan
  to mark the uncompressed pages volatile, and then be able to call into
  it. Ideally for them, the kernel would only purge cold pages, leaving
  the hot pages in memory. When they traverse a purged page, they handle
  the SIGBUS and patch the page up.
 
  How big are these libraries compared to overall system size?
 
 Mike or Taras would have to refresh my memory on this detail. My
 recollection is it mostly has to do with keeping the on-disk size of
 the library small, so it can load off of slow media very quickly.
 
  Now.. this is not what I'd consider a normal use case, but was hoping to
  illustrate some of the more interesting uses and demonstrate the
  interfaces flexibility.
 
  I'm just dying to hear a normal use case then. :)
 
 So the more normal use cause would be marking objects volatile and
 then non-volatile w/o accessing them in-between. In this case the
 zero-fill vs SIGBUS semantics don't really matter, its really just a
 trade off in how we handle applications deviating (intentionally or
 not) from this use case.
 
 So to maybe flesh out the context here for folks who are following
 along (but weren't in the hallway at LSF :),  Johannes made a fairly
 interesting proposal (Johannes: Please correct me here where I'm maybe
 slightly off here) to use only the dirty bits of the ptes to mark a
 page as volatile. Then the kernel could reclaim these clean pages as
 it needed, and when we marked the range as non-volatile, the pages
 would be re-dirtied and if any of the pages were missing, we could

I'd like to know more clearly as Hannes and you are thinking.
You mean that when we unmark the range, we should redirty of all of
pages's pte? or SetPageDirty?
If we redirty pte, maybe softdirty people(ie, CRIU) might be angry
because it could make lots of diff.
If we just do SetPageDirty, it would invalidate writeout-avoid logic
of swapped page which were already on the swap. Yeb, but it could
be minor and SetPageDirty model would be proper for shared-vrange
implmenetation. But how could we know any pages were missing
when unmarking time? Where do we keep the information?
It's no problem for vrange-anon because we can keep the information
on pte but how about vrange-file(ie, vrange-shared)? Using a shadow
entry of radix tree? What are you thinking about?

Another major concern is still syscall's overhead.
Such page-based scheme has a trouble with syscall's speed so I'm
afraid users might not use the syscall any more. :(
Frankly speaking, we don't have concrete user so not sure how
the overhead is severe but we could imagine easily that in future
someuser might want to makr volatile huge GB memory.

But I couldn't insist on range-based option because it has downside, too.
If we don't work page-based model, reclaim path cleary have a big
overhead to scan virtual memory to find a victim pages. As worst case,
just a page in Huge GB vma. Even, a page might be other zone. :(
If we could optimize that path to prevent CPU buring in future,
it could make very complicated and not sure woking well.
We already have similar issue with compaction. ;-)

So, it's really dilemma.

 return a flag with the purged state.  This had some different
 semantics then what I've been working with for awhile (for example,
 any writes to pages would implicitly clear volatility), so I wasn't
 completely comfortable with it, but figured I'd think about it to see
 if it could be done. Particularly since it would in some ways simplify
 tmpfs/shm shared volatility that I'd eventually like to do.
 
 After thinking it over in the hallway, I talked some of the details w/
 Johnnes and there was one issue that while w/ anonymous memory, we can
 still add a VM_VOLATILE flag on the vma, so we can get SIGBUS
 semantics, but since on shared volatile ranges, we don't have anything
 to hang a volatile flag on w/o adding some new vma like structure to
 the address_space structure (much as we did in the past w/ earlier
 volatile range implementations). This would negate much of the point
 of using the dirty bits to simplify the shared volatility
 implementation.
 
 Thus Johannes is reasonably questioning the need for SIGBUS semantics,
 since if it wasn't needed, the simpler page-cleaning based volatility
 could potentially be used.

I think SIGBUS scenario isn't common but in case of JIT, it is necessary
and the amount of ram consumed would be

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread Jan Kara

On Wed 02-04-14 13:13:34, John Stultz wrote:
> On 04/02/2014 12:47 PM, Johannes Weiner wrote:
> > On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote:
> >> On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner  
> >> wrote:
> >>> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
>  That point beside, I think the other problem with the page-cleaning
>  volatility approach is that there are other awkward side effects. For
>  example: Say an application marks a range as volatile. One page in the
>  range is then purged. The application, due to a bug or otherwise,
>  reads the volatile range. This causes the page to be zero-filled in,
>  and the application silently uses the corrupted data (which isn't
>  great). More problematic though, is that by faulting the page in,
>  they've in effect lost the purge state for that page. When the
>  application then goes to mark the range as non-volatile, all pages are
>  present, so we'd return that no pages were purged.  From an
>  application perspective this is pretty ugly.
> 
>  Johannes: Any thoughts on this potential issue with your proposal? Am
>  I missing something else?
> >>> No, this is accurate.  However, I don't really see how this is
> >>> different than any other use-after-free bug.  If you access malloc
> >>> memory after free(), you might receive a SIGSEGV, you might see random
> >>> data, you might corrupt somebody else's data.  This certainly isn't
> >>> nice, but it's not exactly new behavior, is it?
> >> The part that troubles me is that I see the purged state as kernel
> >> data being corrupted by userland in this case. The kernel will tell
> >> userspace that no pages were purged, even though they were. Only
> >> because userspace made an errant read of a page, and got garbage data
> >> back.
> > That sounds overly dramatic to me.  First of all, this data still
> > reflects accurately the actions of userspace in this situation.  And
> > secondly, the kernel does not rely on this data to be meaningful from
> > a userspace perspective to function correctly.
>  volatile page purge state1">
> 
> Maybe you're right, but I feel this is the sort of thing application
> developers would be surprised and annoyed by.
> 
> 
> > It's really nothing but a use-after-free bug that has consequences for
> > no-one but the faulty application.  The thing that IS new is that even
> > a read is enough to corrupt your data in this case.
> >
> > MADV_REVIVE could return 0 if all pages in the specified range were
> > present, -Esomething if otherwise.  That would be semantically sound
> > even if userspace messes up.
> 
> So its semantically more of just a combined mincore+dirty operation..
> and nothing more?
> 
> What are other folks thinking about this? Although I don't particularly
> like it, I probably could go along with Johannes' approach, forgoing
> SIGBUS for zero-fill and adapting the semantics that are in my mind a
> bit stranger. This would allow for ashmem-like style behavior w/ the
> additional  write-clears-volatile-state and read-clears-purged-state
> constraints (which I don't think would be problematic for Android, but
> am not totally sure).
> 
> But I do worry that these semantics are easier for kernel-mm-developers
> to grasp, but are much much harder for application developers to
> understand.
  Yeah, I have to admit that although the simplicity of the implementation
looks compelling, the interface from a userspace POV looks weird.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread John Stultz

On 04/02/2014 12:47 PM, Johannes Weiner wrote:
> On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote:
>> On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner  wrote:
>>> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
 That point beside, I think the other problem with the page-cleaning
 volatility approach is that there are other awkward side effects. For
 example: Say an application marks a range as volatile. One page in the
 range is then purged. The application, due to a bug or otherwise,
 reads the volatile range. This causes the page to be zero-filled in,
 and the application silently uses the corrupted data (which isn't
 great). More problematic though, is that by faulting the page in,
 they've in effect lost the purge state for that page. When the
 application then goes to mark the range as non-volatile, all pages are
 present, so we'd return that no pages were purged.  From an
 application perspective this is pretty ugly.

 Johannes: Any thoughts on this potential issue with your proposal? Am
 I missing something else?
>>> No, this is accurate.  However, I don't really see how this is
>>> different than any other use-after-free bug.  If you access malloc
>>> memory after free(), you might receive a SIGSEGV, you might see random
>>> data, you might corrupt somebody else's data.  This certainly isn't
>>> nice, but it's not exactly new behavior, is it?
>> The part that troubles me is that I see the purged state as kernel
>> data being corrupted by userland in this case. The kernel will tell
>> userspace that no pages were purged, even though they were. Only
>> because userspace made an errant read of a page, and got garbage data
>> back.
> That sounds overly dramatic to me.  First of all, this data still
> reflects accurately the actions of userspace in this situation.  And
> secondly, the kernel does not rely on this data to be meaningful from
> a userspace perspective to function correctly.

Maybe you're right, but I feel this is the sort of thing application
developers would be surprised and annoyed by.

> It's really nothing but a use-after-free bug that has consequences for
> no-one but the faulty application.  The thing that IS new is that even
> a read is enough to corrupt your data in this case.
>
> MADV_REVIVE could return 0 if all pages in the specified range were
> present, -Esomething if otherwise.  That would be semantically sound
> even if userspace messes up.

So its semantically more of just a combined mincore+dirty operation..
and nothing more?

What are other folks thinking about this? Although I don't particularly
like it, I probably could go along with Johannes' approach, forgoing
SIGBUS for zero-fill and adapting the semantics that are in my mind a
bit stranger. This would allow for ashmem-like style behavior w/ the
additional  write-clears-volatile-state and read-clears-purged-state
constraints (which I don't think would be problematic for Android, but
am not totally sure).

But I do worry that these semantics are easier for kernel-mm-developers
to grasp, but are much much harder for application developers to
understand.

Additionally unless we could really leave access-after-volatile as a
total undefined behavior, this would lock us into O(page) behavior and
would remove the possibility of O(log(ranges)) behavior Minchan and I
were able to get (admittedly with more complicated code - but something
I was hoping we'd be able to get back to after the base semantics and
interface behavior was understood and merged). I since applications will
have bugs and will access after volatile, we won't be able to get away
with that sort of behavioral flexibility.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread John Stultz

On 04/02/2014 11:31 AM, Andrea Arcangeli wrote:
> On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
>> Now... once you've chosen SIGBUS semantics, there will be folks who will
>> try to exploit the fact that we get SIGBUS on purged page access (at
>> least on the user-space side) and will try to access pages that are
>> volatile until they are purged and try to then handle the SIGBUS to fix
>> things up. Those folks exploiting that will have to be particularly
>> careful not to pass volatile data to the kernel, and if they do they'll
>> have to be smart enough to handle the EFAULT, etc. That's really all
>> their problem, because they're being clever. :)
> I'm actually working on feature that would solve the problem for the
> syscalls accessing missing volatile pages. So you'd never see a
> -EFAULT because all syscalls won't return even if they encounters a
> missing page in the volatile range dropped by the VM pressure.
>
> It's called userfaultfd. You call sys_userfaultfd(flags) and it
> connects the current mm to a pseudo filedescriptor. The filedescriptor
> works similarly to eventfd but with a different protocol.
So yea! I actually think (its been awhile now) I mentioned your work to
Taras (or maybe he mentioned it to me?), but it did seem like the
userfaltfd would be a better solution for the style of fault handling
they were thinking about. (Especially as actually handling SIGBUS and
doing something sane in a large threaded application seems very difficult).

That said, explaining volatile ranges as a concept has been difficult
enough without mixing in other new concepts :), so I'm hesitant to tie
the functionality together in until its clear the userfaultfd approach
is likely to land. But maybe I need to take a closer look at it.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread Johannes Weiner

On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner  wrote:
> > On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> >> That point beside, I think the other problem with the page-cleaning
> >> volatility approach is that there are other awkward side effects. For
> >> example: Say an application marks a range as volatile. One page in the
> >> range is then purged. The application, due to a bug or otherwise,
> >> reads the volatile range. This causes the page to be zero-filled in,
> >> and the application silently uses the corrupted data (which isn't
> >> great). More problematic though, is that by faulting the page in,
> >> they've in effect lost the purge state for that page. When the
> >> application then goes to mark the range as non-volatile, all pages are
> >> present, so we'd return that no pages were purged.  From an
> >> application perspective this is pretty ugly.
> >>
> >> Johannes: Any thoughts on this potential issue with your proposal? Am
> >> I missing something else?
> >
> > No, this is accurate.  However, I don't really see how this is
> > different than any other use-after-free bug.  If you access malloc
> > memory after free(), you might receive a SIGSEGV, you might see random
> > data, you might corrupt somebody else's data.  This certainly isn't
> > nice, but it's not exactly new behavior, is it?
> 
> The part that troubles me is that I see the purged state as kernel
> data being corrupted by userland in this case. The kernel will tell
> userspace that no pages were purged, even though they were. Only
> because userspace made an errant read of a page, and got garbage data
> back.

That sounds overly dramatic to me.  First of all, this data still
reflects accurately the actions of userspace in this situation.  And
secondly, the kernel does not rely on this data to be meaningful from
a userspace perspective to function correctly.

It's really nothing but a use-after-free bug that has consequences for
no-one but the faulty application.  The thing that IS new is that even
a read is enough to corrupt your data in this case.

MADV_REVIVE could return 0 if all pages in the specified range were
present, -Esomething if otherwise.  That would be semantically sound
even if userspace messes up.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread John Stultz

On Wed, Apr 2, 2014 at 11:07 AM, Johannes Weiner  wrote:
> On Wed, Apr 02, 2014 at 10:48:03AM -0700, John Stultz wrote:
>> I suspect handling the SIGBUS and patching up the purged page you
>> trapped on is likely much to complicated for most use cases. But I do
>> think SIGBUS is preferable to zero-fill on purged page access, just
>> because its likely to be easier to debug applications.
>
> Fully agreed, but it seems a bit overkill to add a separate syscall, a
> range-tree on top of shmem address_spaces, and an essentially new
> programming model based on SIGBUS userspace fault handling (incl. all
> the complexities and confusion this inevitably will bring when people
> DO end up passing these pointers into kernel space) just to be a bit
> nicer about use-after-free bugs in applications.

Its more about making an interface that has graspable semantics to
userspace, instead of having the semantics being a side-effect of the
implementation.

Tying volatility to the page-clean state and page-was-purged to
page-present seems problematic to me, because there are too many ways
to change the page-clean or page-present outside of the interface
being proposed.

I feel this causes a cascade of corner cases that have to be explained
to users of the interface.

Also I disagree we're adding a new programming model, as SIGBUSes can
already be caught, just that there's not usually much one can do,
where with volatile pages its more likely something could be done. And
again, its really just a side-effect of having semantics (SIGBUS on
purged page access) that are more helpful from a applications
perspective.

As for the separate syscall: Again, this is mainly needed to handle
allocation failures that happen mid-way through modifying the range.
There may still be a way to do the allocation first and only after it
succeeds do the modification. The vma merge/splitting logic doesn't
make this easy but if we can be sure that on a failed split of 1 vma
-> 3 vmas (which may fail half way) we can re-merge w/o allocation and
error out (without having to do any other allocations), this might be
avoidable. I'm still wanting to look at this. If so, it would be
easier to re-add this support under madvise, if folks really really
don't like the new syscall.   For the most part, having the separate
syscall allows us to discuss other details of the semantics, which to
me are more important then the syscall naming.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread Johannes Weiner

On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
> Hi everyone,
> 
> On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> > you have a third option you're thinking of, I'd of course be interested
> > in hearing it.
> 
> I actually thought the way of being notified with a page fault (sigbus
> or whatever) was the most efficient way of using volatile ranges.
> 
> Why having to call a syscall to know if you can still access the
> volatile range, if there was no VM pressure before the access?
> syscalls are expensive, accessing the memory direct is not. Only if it
> page was actually missing and a page fault would fire, you'd take the
> slowpath.

Not everybody wants to actually come back for the data in the range,
allocators and message passing applications just want to be able to
reuse the memory mapping.

By tying the volatility to the dirty bit in the page tables, an
allocator could simply clear those bits once on free().  When malloc()
hands out this region again, the user is expected to write, which will
either overwrite the old page, or, if it was purged, fault in a fresh
zero page.  But there is no second syscall needed to clear volatility.

> > Now... once you've chosen SIGBUS semantics, there will be folks who will
> > try to exploit the fact that we get SIGBUS on purged page access (at
> > least on the user-space side) and will try to access pages that are
> > volatile until they are purged and try to then handle the SIGBUS to fix
> > things up. Those folks exploiting that will have to be particularly
> > careful not to pass volatile data to the kernel, and if they do they'll
> > have to be smart enough to handle the EFAULT, etc. That's really all
> > their problem, because they're being clever. :)
> 
> I'm actually working on feature that would solve the problem for the
> syscalls accessing missing volatile pages. So you'd never see a
> -EFAULT because all syscalls won't return even if they encounters a
> missing page in the volatile range dropped by the VM pressure.
> 
> It's called userfaultfd. You call sys_userfaultfd(flags) and it
> connects the current mm to a pseudo filedescriptor. The filedescriptor
> works similarly to eventfd but with a different protocol.
> 
> You need a thread that will never access the userfault area with the
> CPU, that is responsible to poll on the userfaultfd and talk the
> userfaultfd protocol to fill-in missing pages. The userfault thread
> after a POLLIN event reads the virtual addresses of the fault that
> must have happened on some other thread of the same mm, and then
> writes back an "handled" virtual range into the fd, after the page (or
> pages if multiple) have been regenerated and mapped in with
> sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
> swapping. Then depending on the "solved" range written back into the
> fd, the kernel will wakeup the thread or threads that were waiting in
> kernel mode on the "handled" virtual range, and retry the fault
> without ever exiting kernel mode.
> 
> We need this in KVM for running the guest on memory that is on other
> nodes or other processes (postcopy live migration is the most common
> use case but there are others like memory externalization and
> cross-node KSM in the cloud, to keep a single copy of memory across
> multiple nodes and externalized to the VM and to the host node).
> 
> This thread made me wonder if we could mix the two features and you
> would then depend on MADV_USERFAULT and userfaultfd to deliver to
> userland the "faults" happening on the volatile pages that have been
> purged as result of VM pressure.
> 
> I'm just saying this after Johannes mentioned the issue with syscalls
> returning -EFAULT. Because that is the very issue that the userfaultfd
> is going to solve for the KVM migration thread.
> 
> What I'm thinking now would be to mark the volatile range also
> MADV_USERFAULT and then calling userfaultfd and instead of having the
> cache regeneration "slow path" inside the SIGBUS handler, to run it in
> the userfault thread that polls the userfaultfd. Then you could write
> the volatile ranges to disk with a write() syscall (or use any other
> syscall on the volatile ranges), without having to worry about -EFAULT
> being returned because one page was discarded. And if MADV_USERFAULT
> is not called in combination with vrange syscalls, then it'd still
> work without the userfault, but with the vrange syscalls only.
> 
> In short the idea would be to let the userfault code solve the fault
> delivery to userland for you, and make the vrange syscalls only focus
> on the page purging problem, without having to worry about what
> happens when something access a missing page.

Yes, the two seem certainly combinable to me.

madvise(MADV_FREE | MADV_USERFAULT) to allow purging and userspace
fault handling.  In the fault slowpath, you can then regenerate any
missing data and

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread John Stultz

On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner  wrote:
> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
>> That point beside, I think the other problem with the page-cleaning
>> volatility approach is that there are other awkward side effects. For
>> example: Say an application marks a range as volatile. One page in the
>> range is then purged. The application, due to a bug or otherwise,
>> reads the volatile range. This causes the page to be zero-filled in,
>> and the application silently uses the corrupted data (which isn't
>> great). More problematic though, is that by faulting the page in,
>> they've in effect lost the purge state for that page. When the
>> application then goes to mark the range as non-volatile, all pages are
>> present, so we'd return that no pages were purged.  From an
>> application perspective this is pretty ugly.
>>
>> Johannes: Any thoughts on this potential issue with your proposal? Am
>> I missing something else?
>
> No, this is accurate.  However, I don't really see how this is
> different than any other use-after-free bug.  If you access malloc
> memory after free(), you might receive a SIGSEGV, you might see random
> data, you might corrupt somebody else's data.  This certainly isn't
> nice, but it's not exactly new behavior, is it?

The part that troubles me is that I see the purged state as kernel
data being corrupted by userland in this case. The kernel will tell
userspace that no pages were purged, even though they were. Only
because userspace made an errant read of a page, and got garbage data
back.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread Andrea Arcangeli

Hi everyone,

On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> you have a third option you're thinking of, I'd of course be interested
> in hearing it.

I actually thought the way of being notified with a page fault (sigbus
or whatever) was the most efficient way of using volatile ranges.

Why having to call a syscall to know if you can still access the
volatile range, if there was no VM pressure before the access?
syscalls are expensive, accessing the memory direct is not. Only if it
page was actually missing and a page fault would fire, you'd take the
slowpath.

The usages I see for this are plenty, like for maintaining caches in
memory that may be big and would be nice to discard if there's VM
pressure, jpeg uncompressed images sounds like a candidate too. So the
browser size would shrink if there's VM pressure, instead of ending up
swapping out uncompressed image data that can be regenerated more
quickly with the CPU than with swapins.

> Now... once you've chosen SIGBUS semantics, there will be folks who will
> try to exploit the fact that we get SIGBUS on purged page access (at
> least on the user-space side) and will try to access pages that are
> volatile until they are purged and try to then handle the SIGBUS to fix
> things up. Those folks exploiting that will have to be particularly
> careful not to pass volatile data to the kernel, and if they do they'll
> have to be smart enough to handle the EFAULT, etc. That's really all
> their problem, because they're being clever. :)

I'm actually working on feature that would solve the problem for the
syscalls accessing missing volatile pages. So you'd never see a
-EFAULT because all syscalls won't return even if they encounters a
missing page in the volatile range dropped by the VM pressure.

It's called userfaultfd. You call sys_userfaultfd(flags) and it
connects the current mm to a pseudo filedescriptor. The filedescriptor
works similarly to eventfd but with a different protocol.

You need a thread that will never access the userfault area with the
CPU, that is responsible to poll on the userfaultfd and talk the
userfaultfd protocol to fill-in missing pages. The userfault thread
after a POLLIN event reads the virtual addresses of the fault that
must have happened on some other thread of the same mm, and then
writes back an "handled" virtual range into the fd, after the page (or
pages if multiple) have been regenerated and mapped in with
sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
swapping. Then depending on the "solved" range written back into the
fd, the kernel will wakeup the thread or threads that were waiting in
kernel mode on the "handled" virtual range, and retry the fault
without ever exiting kernel mode.

We need this in KVM for running the guest on memory that is on other
nodes or other processes (postcopy live migration is the most common
use case but there are others like memory externalization and
cross-node KSM in the cloud, to keep a single copy of memory across
multiple nodes and externalized to the VM and to the host node).

This thread made me wonder if we could mix the two features and you
would then depend on MADV_USERFAULT and userfaultfd to deliver to
userland the "faults" happening on the volatile pages that have been
purged as result of VM pressure.

I'm just saying this after Johannes mentioned the issue with syscalls
returning -EFAULT. Because that is the very issue that the userfaultfd
is going to solve for the KVM migration thread.

What I'm thinking now would be to mark the volatile range also
MADV_USERFAULT and then calling userfaultfd and instead of having the
cache regeneration "slow path" inside the SIGBUS handler, to run it in
the userfault thread that polls the userfaultfd. Then you could write
the volatile ranges to disk with a write() syscall (or use any other
syscall on the volatile ranges), without having to worry about -EFAULT
being returned because one page was discarded. And if MADV_USERFAULT
is not called in combination with vrange syscalls, then it'd still
work without the userfault, but with the vrange syscalls only.

In short the idea would be to let the userfault code solve the fault
delivery to userland for you, and make the vrange syscalls only focus
on the page purging problem, without having to worry about what
happens when something access a missing page.

But if you don't intend to solve the syscall -EFAULT problem, well
then probably the overlap is still as thin as I thought it was before
(like also mentioned in the below link).

Thanks,
Andrea

PS. my last email about this from a more KVM centric point of view:

http://www.spinics.net/lists/kvm/msg101449.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread Johannes Weiner

On Wed, Apr 02, 2014 at 10:48:03AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 10:40 AM, Dave Hansen  wrote:
> > On 04/02/2014 10:18 AM, Johannes Weiner wrote:
> >> Hence my follow-up question in the other mail about how large we
> >> expect such code caches to become in practice in relationship to
> >> overall system memory.  Are code caches interesting reclaim candidates
> >> to begin with?  Are they big enough to make the machine thrash/swap
> >> otherwise?
> >
> > A big chunk of the use cases here are for swapless systems anyway, so
> > this is the *only* way for them to reclaim anonymous memory.  Their
> > choices are either to be constantly throwing away and rebuilding these
> > objects, or to leave them in memory effectively pinned.
> >
> > In practice I did see ashmem (the Android thing that we're trying to
> > replace) get used a lot by the Android web browser when I was playing
> > with it.  John said that it got used for storing decompressed copies of
> > images.
> 
> Although images are a simpler case where its easier to not touch
> volatile pages. I think Johannes is mostly concerned about cases where
> volatile pages are being accessed while they are volatile, which the
> Mozilla folks are so far the only viable case (in my mind... folks may
> have others) where they intentionally want to access pages while
> they're volatile and thus require SIGBUS semantics.

Yes, absolutely, that is my only concern.  Compressed images as in
Android can easily be marked non-volatile before they are accessed
again.

Code caches are harder because control is handed off to the CPU, but
I'm not entirely sure yet whether these are in fact interesting
reclaim candidates.

> I suspect handling the SIGBUS and patching up the purged page you
> trapped on is likely much to complicated for most use cases. But I do
> think SIGBUS is preferable to zero-fill on purged page access, just
> because its likely to be easier to debug applications.

Fully agreed, but it seems a bit overkill to add a separate syscall, a
range-tree on top of shmem address_spaces, and an essentially new
programming model based on SIGBUS userspace fault handling (incl. all
the complexities and confusion this inevitably will bring when people
DO end up passing these pointers into kernel space) just to be a bit
nicer about use-after-free bugs in applications.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread Johannes Weiner

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner  wrote:
> > On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> >> On 04/01/2014 04:01 PM, Dave Hansen wrote:
> >> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> >> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> >> > John, this was something that the Mozilla guys asked for, right?  Any
> >> > idea why this isn't ever a problem for them?
> >> So one of their use cases for it is for library text. Basically they
> >> want to decompress a compressed library file into memory. Then they plan
> >> to mark the uncompressed pages volatile, and then be able to call into
> >> it. Ideally for them, the kernel would only purge cold pages, leaving
> >> the hot pages in memory. When they traverse a purged page, they handle
> >> the SIGBUS and patch the page up.
> >
> > How big are these libraries compared to overall system size?
> 
> Mike or Taras would have to refresh my memory on this detail. My
> recollection is it mostly has to do with keeping the on-disk size of
> the library small, so it can load off of slow media very quickly.
> 
> >> Now.. this is not what I'd consider a normal use case, but was hoping to
> >> illustrate some of the more interesting uses and demonstrate the
> >> interfaces flexibility.
> >
> > I'm just dying to hear a "normal" use case then. :)
> 
> So the more "normal" use cause would be marking objects volatile and
> then non-volatile w/o accessing them in-between. In this case the
> zero-fill vs SIGBUS semantics don't really matter, its really just a
> trade off in how we handle applications deviating (intentionally or
> not) from this use case.
> 
> So to maybe flesh out the context here for folks who are following
> along (but weren't in the hallway at LSF :),  Johannes made a fairly
> interesting proposal (Johannes: Please correct me here where I'm maybe
> slightly off here) to use only the dirty bits of the ptes to mark a
> page as volatile. Then the kernel could reclaim these clean pages as
> it needed, and when we marked the range as non-volatile, the pages
> would be re-dirtied and if any of the pages were missing, we could
> return a flag with the purged state.  This had some different
> semantics then what I've been working with for awhile (for example,
> any writes to pages would implicitly clear volatility), so I wasn't
> completely comfortable with it, but figured I'd think about it to see
> if it could be done. Particularly since it would in some ways simplify
> tmpfs/shm shared volatility that I'd eventually like to do.
> 
> After thinking it over in the hallway, I talked some of the details w/
> Johnnes and there was one issue that while w/ anonymous memory, we can
> still add a VM_VOLATILE flag on the vma, so we can get SIGBUS
> semantics, but since on shared volatile ranges, we don't have anything
> to hang a volatile flag on w/o adding some new vma like structure to
> the address_space structure (much as we did in the past w/ earlier
> volatile range implementations). This would negate much of the point
> of using the dirty bits to simplify the shared volatility
> implementation.
> 
> Thus Johannes is reasonably questioning the need for SIGBUS semantics,
> since if it wasn't needed, the simpler page-cleaning based volatility
> could potentially be used.

Thanks for summarizing this again!

> Now, while for the case I'm personally most interested in (ashmem),
> zero-fill would technically be ok, since that's what Android does.
> Even so, I don't think its the best approach for the interface, since
> applications may end up quite surprised by the results when they
> accidentally don't follow the "don't touch volatile pages" rule.
> 
> That point beside, I think the other problem with the page-cleaning
> volatility approach is that there are other awkward side effects. For
> example: Say an application marks a range as volatile. One page in the
> range is then purged. The application, due to a bug or otherwise,
> reads the volatile range. This causes the page to be zero-filled in,
> and the application silently uses the corrupted data (which isn't
> great). More problematic though, is that by faulting the page in,
> they've in effect lost the purge state for that page. When the
> application then goes to mark the range as non-volatile, all pages are
> present, so we'd return that no pages were purged.  From an
> application perspective this is pretty ugly.
> 
> Johannes: Any thoughts on this potential issue with your proposal? Am
> I missing something else?

No, this is accurate.  However, I don't really see how this is
different than any other use-after-free bug.  If you access malloc
memory after free(), you might receive a SIGSEGV, you might see random
data, you might corrupt somebody else's data.  This certainly isn't
nice, but it's not exactly new behavior, is it?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread John Stultz

On Wed, Apr 2, 2014 at 10:40 AM, Dave Hansen  wrote:
> On 04/02/2014 10:18 AM, Johannes Weiner wrote:
>> Hence my follow-up question in the other mail about how large we
>> expect such code caches to become in practice in relationship to
>> overall system memory.  Are code caches interesting reclaim candidates
>> to begin with?  Are they big enough to make the machine thrash/swap
>> otherwise?
>
> A big chunk of the use cases here are for swapless systems anyway, so
> this is the *only* way for them to reclaim anonymous memory.  Their
> choices are either to be constantly throwing away and rebuilding these
> objects, or to leave them in memory effectively pinned.
>
> In practice I did see ashmem (the Android thing that we're trying to
> replace) get used a lot by the Android web browser when I was playing
> with it.  John said that it got used for storing decompressed copies of
> images.

Although images are a simpler case where its easier to not touch
volatile pages. I think Johannes is mostly concerned about cases where
volatile pages are being accessed while they are volatile, which the
Mozilla folks are so far the only viable case (in my mind... folks may
have others) where they intentionally want to access pages while
they're volatile and thus require SIGBUS semantics.

I suspect handling the SIGBUS and patching up the purged page you
trapped on is likely much to complicated for most use cases. But I do
think SIGBUS is preferable to zero-fill on purged page access, just
because its likely to be easier to debug applications.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread John Stultz

On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner  wrote:
> On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
>> On 04/01/2014 04:01 PM, Dave Hansen wrote:
>> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
>> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
>> > John, this was something that the Mozilla guys asked for, right?  Any
>> > idea why this isn't ever a problem for them?
>> So one of their use cases for it is for library text. Basically they
>> want to decompress a compressed library file into memory. Then they plan
>> to mark the uncompressed pages volatile, and then be able to call into
>> it. Ideally for them, the kernel would only purge cold pages, leaving
>> the hot pages in memory. When they traverse a purged page, they handle
>> the SIGBUS and patch the page up.
>
> How big are these libraries compared to overall system size?

Mike or Taras would have to refresh my memory on this detail. My
recollection is it mostly has to do with keeping the on-disk size of
the library small, so it can load off of slow media very quickly.

>> Now.. this is not what I'd consider a normal use case, but was hoping to
>> illustrate some of the more interesting uses and demonstrate the
>> interfaces flexibility.
>
> I'm just dying to hear a "normal" use case then. :)

So the more "normal" use cause would be marking objects volatile and
then non-volatile w/o accessing them in-between. In this case the
zero-fill vs SIGBUS semantics don't really matter, its really just a
trade off in how we handle applications deviating (intentionally or
not) from this use case.

So to maybe flesh out the context here for folks who are following
along (but weren't in the hallway at LSF :),  Johannes made a fairly
interesting proposal (Johannes: Please correct me here where I'm maybe
slightly off here) to use only the dirty bits of the ptes to mark a
page as volatile. Then the kernel could reclaim these clean pages as
it needed, and when we marked the range as non-volatile, the pages
would be re-dirtied and if any of the pages were missing, we could
return a flag with the purged state.  This had some different
semantics then what I've been working with for awhile (for example,
any writes to pages would implicitly clear volatility), so I wasn't
completely comfortable with it, but figured I'd think about it to see
if it could be done. Particularly since it would in some ways simplify
tmpfs/shm shared volatility that I'd eventually like to do.

After thinking it over in the hallway, I talked some of the details w/
Johnnes and there was one issue that while w/ anonymous memory, we can
still add a VM_VOLATILE flag on the vma, so we can get SIGBUS
semantics, but since on shared volatile ranges, we don't have anything
to hang a volatile flag on w/o adding some new vma like structure to
the address_space structure (much as we did in the past w/ earlier
volatile range implementations). This would negate much of the point
of using the dirty bits to simplify the shared volatility
implementation.

Thus Johannes is reasonably questioning the need for SIGBUS semantics,
since if it wasn't needed, the simpler page-cleaning based volatility
could potentially be used.

Now, while for the case I'm personally most interested in (ashmem),
zero-fill would technically be ok, since that's what Android does.
Even so, I don't think its the best approach for the interface, since
applications may end up quite surprised by the results when they
accidentally don't follow the "don't touch volatile pages" rule.

That point beside, I think the other problem with the page-cleaning
volatility approach is that there are other awkward side effects. For
example: Say an application marks a range as volatile. One page in the
range is then purged. The application, due to a bug or otherwise,
reads the volatile range. This causes the page to be zero-filled in,
and the application silently uses the corrupted data (which isn't
great). More problematic though, is that by faulting the page in,
they've in effect lost the purge state for that page. When the
application then goes to mark the range as non-volatile, all pages are
present, so we'd return that no pages were purged.  From an
application perspective this is pretty ugly.

Johannes: Any thoughts on this potential issue with your proposal? Am
I missing something else?

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread Dave Hansen

On 04/02/2014 10:18 AM, Johannes Weiner wrote:
> Hence my follow-up question in the other mail about how large we
> expect such code caches to become in practice in relationship to
> overall system memory.  Are code caches interesting reclaim candidates
> to begin with?  Are they big enough to make the machine thrash/swap
> otherwise?

A big chunk of the use cases here are for swapless systems anyway, so
this is the *only* way for them to reclaim anonymous memory.  Their
choices are either to be constantly throwing away and rebuilding these
objects, or to leave them in memory effectively pinned.

In practice I did see ashmem (the Android thing that we're trying to
replace) get used a lot by the Android web browser when I was playing
with it.  John said that it got used for storing decompressed copies of
images.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread Johannes Weiner

On Wed, Apr 02, 2014 at 09:37:49AM -0700, H. Peter Anvin wrote:
> On 04/02/2014 09:32 AM, H. Peter Anvin wrote:
> > On 04/02/2014 09:30 AM, Johannes Weiner wrote:
> >>
> >> So between zero-fill and SIGBUS, I'd prefer the one which results in
> >> the simpler user interface / fewer system calls.
> >>
> > 
> > The use cases are different; I believe this should be a user space option.
> > 
> 
> Case in point, for example: imagine a JIT.  You *really* don't want to
> zero-fill memory behind the back of your JIT, as all zero memory may not
> be a trapping instruction (it isn't on x86, for example, and if you are
> unlucky you may be modifying *part* of an instruction.)

Yes, and I think this would be comparable to the compressed-library
usecase that John mentioned.  What's special about these cases is that
the accesses are no longer under control of the application because
it's literally code that the CPU jumps into.  It is obvious to me that
such a usecase would require SIGBUS handling.  However, it seems that
in any usecase *besides* executable code caches, userspace would have
the ability to mark the pages non-volatile ahead of time, and thus not
require SIGBUS delivery.

Hence my follow-up question in the other mail about how large we
expect such code caches to become in practice in relationship to
overall system memory.  Are code caches interesting reclaim candidates
to begin with?  Are they big enough to make the machine thrash/swap
otherwise?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread H. Peter Anvin

On 04/02/2014 09:32 AM, H. Peter Anvin wrote:
> On 04/02/2014 09:30 AM, Johannes Weiner wrote:
>>
>> So between zero-fill and SIGBUS, I'd prefer the one which results in
>> the simpler user interface / fewer system calls.
>>
> 
> The use cases are different; I believe this should be a user space option.
> 

Case in point, for example: imagine a JIT.  You *really* don't want to
zero-fill memory behind the back of your JIT, as all zero memory may not
be a trapping instruction (it isn't on x86, for example, and if you are
unlucky you may be modifying *part* of an instruction.)

Thus, SIGBUS is the only safe option.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread Johannes Weiner

On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> On 04/01/2014 04:01 PM, Dave Hansen wrote:
> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> >>> Either way, optimistic volatile pointers are nowhere near as
> >>> transparent to the application as the above description suggests,
> >>> which makes this usecase not very interesting, IMO.
> >> ... however, I think you're still derating the value way too much.  The
> >> case of user space doing elastic memory management is more and more
> >> common, and for a lot of those applications it is perfectly reasonable
> >> to either not do system calls or to have to devolatilize first.
> > The SIGBUS is only in cases where the memory is set as volatile and
> > _then_ accessed, right?
> Not just set volatile and then accessed, but when a volatile page has
> been purged and then accessed without being made non-volatile.
> 
> 
> > John, this was something that the Mozilla guys asked for, right?  Any
> > idea why this isn't ever a problem for them?
> So one of their use cases for it is for library text. Basically they
> want to decompress a compressed library file into memory. Then they plan
> to mark the uncompressed pages volatile, and then be able to call into
> it. Ideally for them, the kernel would only purge cold pages, leaving
> the hot pages in memory. When they traverse a purged page, they handle
> the SIGBUS and patch the page up.

How big are these libraries compared to overall system size?

> Now.. this is not what I'd consider a normal use case, but was hoping to
> illustrate some of the more interesting uses and demonstrate the
> interfaces flexibility.

I'm just dying to hear a "normal" use case then. :)

> Also it provided a clear example of benefits to doing LRU based
> cold-page purging rather then full object purging. Though I think the
> same could be demonstrated in a simpler case of a large cache of objects
> that the applications wants to mark volatile in one pass, unmarking
> sub-objects as it needs.

Agreed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread H. Peter Anvin

On 04/02/2014 09:30 AM, Johannes Weiner wrote:
> 
> So between zero-fill and SIGBUS, I'd prefer the one which results in
> the simpler user interface / fewer system calls.
> 

The use cases are different; I believe this should be a user space option.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-02 Thread Johannes Weiner

On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> > [ I tried to bring this up during LSFMM but it got drowned out.
> >   Trying again :) ]
> >
> > On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
> >> Optimistic method:
> >> 1) Userland marks a large range of data as volatile
> >> 2) Userland continues to access the data as it needs.
> >> 3) If userland accesses a page that has been purged, the kernel will
> >> send a SIGBUS
> >> 4) Userspace can trap the SIGBUS, mark the affected pages as
> >> non-volatile, and refill the data as needed before continuing on
> > As far as I understand, if a pointer to volatile memory makes it into
> > a syscall and the fault is trapped in kernel space, there won't be a
> > SIGBUS, the syscall will just return -EFAULT.
> >
> > Handling this would mean annotating every syscall invocation to check
> > for -EFAULT, refill the data, and then restart the syscall.  This is
> > complicated even before taking external libraries into account, which
> > may not propagate syscall returns properly or may not be reentrant at
> > the necessary granularity.
> >
> > Another option is to never pass volatile memory pointers into the
> > kernel, but that too means that knowledge of volatility has to travel
> > alongside the pointers, which will either result in more complexity
> > throughout the application or severely limited scope of volatile
> > memory usage.
> >
> > Either way, optimistic volatile pointers are nowhere near as
> > transparent to the application as the above description suggests,
> > which makes this usecase not very interesting, IMO.  If we can support
> > it at little cost, why not, but I don't think we should complicate the
> > common usecases to support this one.
> 
> So yea, thanks again for all the feedback at LSF-MM! I'm trying to get
> things integrated for a v13 here shortly (although with visitors in town
> this week it may not happen until next week).
> 
> 
> So, maybe its best to ignore the fact that folks want to do semi-crazy
> user-space faulting via SIGBUS. At least to start with. Lets look at the
> semantic for the "normal" mark volatile, never touch the pages until you
> mark non-volatile - basically where accessing volatile pages is similar
> to a use-after-free bug.
> 
> So, for the most part, I'd say the proposed SIGBUS semantics don't
> complicate things for this basic use-case, at least when compared with
> things like zero-fill.  If an applications accidentally accessed a
> purged volatile page, I think SIGBUS is the right thing to do. They most
> likely immediately crash, but its better then them moving along with
> silent corruption because they're mucking with zero-filled pages.
> 
> So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> you have a third option you're thinking of, I'd of course be interested
> in hearing it.

The reason I'm bringing this up again is because I see very little
solid usecases for a separate vrange() syscall once we have something
like MADV_FREE and MADV_REVIVE, which respectively clear the dirty
bits of a range of anon/tmpfs pages, and set them again and report if
any pages in the given range were purged on revival.

So between zero-fill and SIGBUS, I'd prefer the one which results in
the simpler user interface / fewer system calls.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread Johannes Weiner

On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
 On 04/01/2014 02:21 PM, Johannes Weiner wrote:
  [ I tried to bring this up during LSFMM but it got drowned out.
Trying again :) ]
 
  On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
  Optimistic method:
  1) Userland marks a large range of data as volatile
  2) Userland continues to access the data as it needs.
  3) If userland accesses a page that has been purged, the kernel will
  send a SIGBUS
  4) Userspace can trap the SIGBUS, mark the affected pages as
  non-volatile, and refill the data as needed before continuing on
  As far as I understand, if a pointer to volatile memory makes it into
  a syscall and the fault is trapped in kernel space, there won't be a
  SIGBUS, the syscall will just return -EFAULT.
 
  Handling this would mean annotating every syscall invocation to check
  for -EFAULT, refill the data, and then restart the syscall.  This is
  complicated even before taking external libraries into account, which
  may not propagate syscall returns properly or may not be reentrant at
  the necessary granularity.
 
  Another option is to never pass volatile memory pointers into the
  kernel, but that too means that knowledge of volatility has to travel
  alongside the pointers, which will either result in more complexity
  throughout the application or severely limited scope of volatile
  memory usage.
 
  Either way, optimistic volatile pointers are nowhere near as
  transparent to the application as the above description suggests,
  which makes this usecase not very interesting, IMO.  If we can support
  it at little cost, why not, but I don't think we should complicate the
  common usecases to support this one.
 
 So yea, thanks again for all the feedback at LSF-MM! I'm trying to get
 things integrated for a v13 here shortly (although with visitors in town
 this week it may not happen until next week).
 
 
 So, maybe its best to ignore the fact that folks want to do semi-crazy
 user-space faulting via SIGBUS. At least to start with. Lets look at the
 semantic for the normal mark volatile, never touch the pages until you
 mark non-volatile - basically where accessing volatile pages is similar
 to a use-after-free bug.
 
 So, for the most part, I'd say the proposed SIGBUS semantics don't
 complicate things for this basic use-case, at least when compared with
 things like zero-fill.  If an applications accidentally accessed a
 purged volatile page, I think SIGBUS is the right thing to do. They most
 likely immediately crash, but its better then them moving along with
 silent corruption because they're mucking with zero-filled pages.
 
 So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
 you have a third option you're thinking of, I'd of course be interested
 in hearing it.

The reason I'm bringing this up again is because I see very little
solid usecases for a separate vrange() syscall once we have something
like MADV_FREE and MADV_REVIVE, which respectively clear the dirty
bits of a range of anon/tmpfs pages, and set them again and report if
any pages in the given range were purged on revival.

So between zero-fill and SIGBUS, I'd prefer the one which results in
the simpler user interface / fewer system calls.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread H. Peter Anvin

On 04/02/2014 09:30 AM, Johannes Weiner wrote:
 
 So between zero-fill and SIGBUS, I'd prefer the one which results in
 the simpler user interface / fewer system calls.
 

The use cases are different; I believe this should be a user space option.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread Johannes Weiner

On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
 On 04/01/2014 04:01 PM, Dave Hansen wrote:
  On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
  On 04/01/2014 02:21 PM, Johannes Weiner wrote:
  Either way, optimistic volatile pointers are nowhere near as
  transparent to the application as the above description suggests,
  which makes this usecase not very interesting, IMO.
  ... however, I think you're still derating the value way too much.  The
  case of user space doing elastic memory management is more and more
  common, and for a lot of those applications it is perfectly reasonable
  to either not do system calls or to have to devolatilize first.
  The SIGBUS is only in cases where the memory is set as volatile and
  _then_ accessed, right?
 Not just set volatile and then accessed, but when a volatile page has
 been purged and then accessed without being made non-volatile.
 
 
  John, this was something that the Mozilla guys asked for, right?  Any
  idea why this isn't ever a problem for them?
 So one of their use cases for it is for library text. Basically they
 want to decompress a compressed library file into memory. Then they plan
 to mark the uncompressed pages volatile, and then be able to call into
 it. Ideally for them, the kernel would only purge cold pages, leaving
 the hot pages in memory. When they traverse a purged page, they handle
 the SIGBUS and patch the page up.

How big are these libraries compared to overall system size?

 Now.. this is not what I'd consider a normal use case, but was hoping to
 illustrate some of the more interesting uses and demonstrate the
 interfaces flexibility.

I'm just dying to hear a normal use case then. :)

 Also it provided a clear example of benefits to doing LRU based
 cold-page purging rather then full object purging. Though I think the
 same could be demonstrated in a simpler case of a large cache of objects
 that the applications wants to mark volatile in one pass, unmarking
 sub-objects as it needs.

Agreed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread H. Peter Anvin

On 04/02/2014 09:32 AM, H. Peter Anvin wrote:
 On 04/02/2014 09:30 AM, Johannes Weiner wrote:

 So between zero-fill and SIGBUS, I'd prefer the one which results in
 the simpler user interface / fewer system calls.

 
 The use cases are different; I believe this should be a user space option.
 

Case in point, for example: imagine a JIT.  You *really* don't want to
zero-fill memory behind the back of your JIT, as all zero memory may not
be a trapping instruction (it isn't on x86, for example, and if you are
unlucky you may be modifying *part* of an instruction.)

Thus, SIGBUS is the only safe option.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread Johannes Weiner

On Wed, Apr 02, 2014 at 09:37:49AM -0700, H. Peter Anvin wrote:
 On 04/02/2014 09:32 AM, H. Peter Anvin wrote:
  On 04/02/2014 09:30 AM, Johannes Weiner wrote:
 
  So between zero-fill and SIGBUS, I'd prefer the one which results in
  the simpler user interface / fewer system calls.
 
  
  The use cases are different; I believe this should be a user space option.
  
 
 Case in point, for example: imagine a JIT.  You *really* don't want to
 zero-fill memory behind the back of your JIT, as all zero memory may not
 be a trapping instruction (it isn't on x86, for example, and if you are
 unlucky you may be modifying *part* of an instruction.)

Yes, and I think this would be comparable to the compressed-library
usecase that John mentioned.  What's special about these cases is that
the accesses are no longer under control of the application because
it's literally code that the CPU jumps into.  It is obvious to me that
such a usecase would require SIGBUS handling.  However, it seems that
in any usecase *besides* executable code caches, userspace would have
the ability to mark the pages non-volatile ahead of time, and thus not
require SIGBUS delivery.

Hence my follow-up question in the other mail about how large we
expect such code caches to become in practice in relationship to
overall system memory.  Are code caches interesting reclaim candidates
to begin with?  Are they big enough to make the machine thrash/swap
otherwise?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread John Stultz

On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner han...@cmpxchg.org wrote:
 On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
 On 04/01/2014 04:01 PM, Dave Hansen wrote:
  On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
  On 04/01/2014 02:21 PM, Johannes Weiner wrote:
  John, this was something that the Mozilla guys asked for, right?  Any
  idea why this isn't ever a problem for them?
 So one of their use cases for it is for library text. Basically they
 want to decompress a compressed library file into memory. Then they plan
 to mark the uncompressed pages volatile, and then be able to call into
 it. Ideally for them, the kernel would only purge cold pages, leaving
 the hot pages in memory. When they traverse a purged page, they handle
 the SIGBUS and patch the page up.

 How big are these libraries compared to overall system size?

Mike or Taras would have to refresh my memory on this detail. My
recollection is it mostly has to do with keeping the on-disk size of
the library small, so it can load off of slow media very quickly.

 Now.. this is not what I'd consider a normal use case, but was hoping to
 illustrate some of the more interesting uses and demonstrate the
 interfaces flexibility.

 I'm just dying to hear a normal use case then. :)

So the more normal use cause would be marking objects volatile and
then non-volatile w/o accessing them in-between. In this case the
zero-fill vs SIGBUS semantics don't really matter, its really just a
trade off in how we handle applications deviating (intentionally or
not) from this use case.

So to maybe flesh out the context here for folks who are following
along (but weren't in the hallway at LSF :),  Johannes made a fairly
interesting proposal (Johannes: Please correct me here where I'm maybe
slightly off here) to use only the dirty bits of the ptes to mark a
page as volatile. Then the kernel could reclaim these clean pages as
it needed, and when we marked the range as non-volatile, the pages
would be re-dirtied and if any of the pages were missing, we could
return a flag with the purged state.  This had some different
semantics then what I've been working with for awhile (for example,
any writes to pages would implicitly clear volatility), so I wasn't
completely comfortable with it, but figured I'd think about it to see
if it could be done. Particularly since it would in some ways simplify
tmpfs/shm shared volatility that I'd eventually like to do.

After thinking it over in the hallway, I talked some of the details w/
Johnnes and there was one issue that while w/ anonymous memory, we can
still add a VM_VOLATILE flag on the vma, so we can get SIGBUS
semantics, but since on shared volatile ranges, we don't have anything
to hang a volatile flag on w/o adding some new vma like structure to
the address_space structure (much as we did in the past w/ earlier
volatile range implementations). This would negate much of the point
of using the dirty bits to simplify the shared volatility
implementation.

Thus Johannes is reasonably questioning the need for SIGBUS semantics,
since if it wasn't needed, the simpler page-cleaning based volatility
could potentially be used.


Now, while for the case I'm personally most interested in (ashmem),
zero-fill would technically be ok, since that's what Android does.
Even so, I don't think its the best approach for the interface, since
applications may end up quite surprised by the results when they
accidentally don't follow the don't touch volatile pages rule.

That point beside, I think the other problem with the page-cleaning
volatility approach is that there are other awkward side effects. For
example: Say an application marks a range as volatile. One page in the
range is then purged. The application, due to a bug or otherwise,
reads the volatile range. This causes the page to be zero-filled in,
and the application silently uses the corrupted data (which isn't
great). More problematic though, is that by faulting the page in,
they've in effect lost the purge state for that page. When the
application then goes to mark the range as non-volatile, all pages are
present, so we'd return that no pages were purged.  From an
application perspective this is pretty ugly.

Johannes: Any thoughts on this potential issue with your proposal? Am
I missing something else?

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread Dave Hansen

On 04/02/2014 10:18 AM, Johannes Weiner wrote:
 Hence my follow-up question in the other mail about how large we
 expect such code caches to become in practice in relationship to
 overall system memory.  Are code caches interesting reclaim candidates
 to begin with?  Are they big enough to make the machine thrash/swap
 otherwise?

A big chunk of the use cases here are for swapless systems anyway, so
this is the *only* way for them to reclaim anonymous memory.  Their
choices are either to be constantly throwing away and rebuilding these
objects, or to leave them in memory effectively pinned.

In practice I did see ashmem (the Android thing that we're trying to
replace) get used a lot by the Android web browser when I was playing
with it.  John said that it got used for storing decompressed copies of
images.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread John Stultz

On Wed, Apr 2, 2014 at 10:40 AM, Dave Hansen d...@sr71.net wrote:
 On 04/02/2014 10:18 AM, Johannes Weiner wrote:
 Hence my follow-up question in the other mail about how large we
 expect such code caches to become in practice in relationship to
 overall system memory.  Are code caches interesting reclaim candidates
 to begin with?  Are they big enough to make the machine thrash/swap
 otherwise?

 A big chunk of the use cases here are for swapless systems anyway, so
 this is the *only* way for them to reclaim anonymous memory.  Their
 choices are either to be constantly throwing away and rebuilding these
 objects, or to leave them in memory effectively pinned.

 In practice I did see ashmem (the Android thing that we're trying to
 replace) get used a lot by the Android web browser when I was playing
 with it.  John said that it got used for storing decompressed copies of
 images.

Although images are a simpler case where its easier to not touch
volatile pages. I think Johannes is mostly concerned about cases where
volatile pages are being accessed while they are volatile, which the
Mozilla folks are so far the only viable case (in my mind... folks may
have others) where they intentionally want to access pages while
they're volatile and thus require SIGBUS semantics.

I suspect handling the SIGBUS and patching up the purged page you
trapped on is likely much to complicated for most use cases. But I do
think SIGBUS is preferable to zero-fill on purged page access, just
because its likely to be easier to debug applications.

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread Johannes Weiner

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
 On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner han...@cmpxchg.org wrote:
  On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
  On 04/01/2014 04:01 PM, Dave Hansen wrote:
   On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
   On 04/01/2014 02:21 PM, Johannes Weiner wrote:
   John, this was something that the Mozilla guys asked for, right?  Any
   idea why this isn't ever a problem for them?
  So one of their use cases for it is for library text. Basically they
  want to decompress a compressed library file into memory. Then they plan
  to mark the uncompressed pages volatile, and then be able to call into
  it. Ideally for them, the kernel would only purge cold pages, leaving
  the hot pages in memory. When they traverse a purged page, they handle
  the SIGBUS and patch the page up.
 
  How big are these libraries compared to overall system size?
 
 Mike or Taras would have to refresh my memory on this detail. My
 recollection is it mostly has to do with keeping the on-disk size of
 the library small, so it can load off of slow media very quickly.
 
  Now.. this is not what I'd consider a normal use case, but was hoping to
  illustrate some of the more interesting uses and demonstrate the
  interfaces flexibility.
 
  I'm just dying to hear a normal use case then. :)
 
 So the more normal use cause would be marking objects volatile and
 then non-volatile w/o accessing them in-between. In this case the
 zero-fill vs SIGBUS semantics don't really matter, its really just a
 trade off in how we handle applications deviating (intentionally or
 not) from this use case.
 
 So to maybe flesh out the context here for folks who are following
 along (but weren't in the hallway at LSF :),  Johannes made a fairly
 interesting proposal (Johannes: Please correct me here where I'm maybe
 slightly off here) to use only the dirty bits of the ptes to mark a
 page as volatile. Then the kernel could reclaim these clean pages as
 it needed, and when we marked the range as non-volatile, the pages
 would be re-dirtied and if any of the pages were missing, we could
 return a flag with the purged state.  This had some different
 semantics then what I've been working with for awhile (for example,
 any writes to pages would implicitly clear volatility), so I wasn't
 completely comfortable with it, but figured I'd think about it to see
 if it could be done. Particularly since it would in some ways simplify
 tmpfs/shm shared volatility that I'd eventually like to do.
 
 After thinking it over in the hallway, I talked some of the details w/
 Johnnes and there was one issue that while w/ anonymous memory, we can
 still add a VM_VOLATILE flag on the vma, so we can get SIGBUS
 semantics, but since on shared volatile ranges, we don't have anything
 to hang a volatile flag on w/o adding some new vma like structure to
 the address_space structure (much as we did in the past w/ earlier
 volatile range implementations). This would negate much of the point
 of using the dirty bits to simplify the shared volatility
 implementation.
 
 Thus Johannes is reasonably questioning the need for SIGBUS semantics,
 since if it wasn't needed, the simpler page-cleaning based volatility
 could potentially be used.

Thanks for summarizing this again!

 Now, while for the case I'm personally most interested in (ashmem),
 zero-fill would technically be ok, since that's what Android does.
 Even so, I don't think its the best approach for the interface, since
 applications may end up quite surprised by the results when they
 accidentally don't follow the don't touch volatile pages rule.
 
 That point beside, I think the other problem with the page-cleaning
 volatility approach is that there are other awkward side effects. For
 example: Say an application marks a range as volatile. One page in the
 range is then purged. The application, due to a bug or otherwise,
 reads the volatile range. This causes the page to be zero-filled in,
 and the application silently uses the corrupted data (which isn't
 great). More problematic though, is that by faulting the page in,
 they've in effect lost the purge state for that page. When the
 application then goes to mark the range as non-volatile, all pages are
 present, so we'd return that no pages were purged.  From an
 application perspective this is pretty ugly.
 
 Johannes: Any thoughts on this potential issue with your proposal? Am
 I missing something else?

No, this is accurate.  However, I don't really see how this is
different than any other use-after-free bug.  If you access malloc
memory after free(), you might receive a SIGSEGV, you might see random
data, you might corrupt somebody else's data.  This certainly isn't
nice, but it's not exactly new behavior, is it?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread Johannes Weiner

On Wed, Apr 02, 2014 at 10:48:03AM -0700, John Stultz wrote:
 On Wed, Apr 2, 2014 at 10:40 AM, Dave Hansen d...@sr71.net wrote:
  On 04/02/2014 10:18 AM, Johannes Weiner wrote:
  Hence my follow-up question in the other mail about how large we
  expect such code caches to become in practice in relationship to
  overall system memory.  Are code caches interesting reclaim candidates
  to begin with?  Are they big enough to make the machine thrash/swap
  otherwise?
 
  A big chunk of the use cases here are for swapless systems anyway, so
  this is the *only* way for them to reclaim anonymous memory.  Their
  choices are either to be constantly throwing away and rebuilding these
  objects, or to leave them in memory effectively pinned.
 
  In practice I did see ashmem (the Android thing that we're trying to
  replace) get used a lot by the Android web browser when I was playing
  with it.  John said that it got used for storing decompressed copies of
  images.
 
 Although images are a simpler case where its easier to not touch
 volatile pages. I think Johannes is mostly concerned about cases where
 volatile pages are being accessed while they are volatile, which the
 Mozilla folks are so far the only viable case (in my mind... folks may
 have others) where they intentionally want to access pages while
 they're volatile and thus require SIGBUS semantics.

Yes, absolutely, that is my only concern.  Compressed images as in
Android can easily be marked non-volatile before they are accessed
again.

Code caches are harder because control is handed off to the CPU, but
I'm not entirely sure yet whether these are in fact interesting
reclaim candidates.

 I suspect handling the SIGBUS and patching up the purged page you
 trapped on is likely much to complicated for most use cases. But I do
 think SIGBUS is preferable to zero-fill on purged page access, just
 because its likely to be easier to debug applications.

Fully agreed, but it seems a bit overkill to add a separate syscall, a
range-tree on top of shmem address_spaces, and an essentially new
programming model based on SIGBUS userspace fault handling (incl. all
the complexities and confusion this inevitably will bring when people
DO end up passing these pointers into kernel space) just to be a bit
nicer about use-after-free bugs in applications.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread Andrea Arcangeli

Hi everyone,

On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
 So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
 you have a third option you're thinking of, I'd of course be interested
 in hearing it.

I actually thought the way of being notified with a page fault (sigbus
or whatever) was the most efficient way of using volatile ranges.

Why having to call a syscall to know if you can still access the
volatile range, if there was no VM pressure before the access?
syscalls are expensive, accessing the memory direct is not. Only if it
page was actually missing and a page fault would fire, you'd take the
slowpath.

The usages I see for this are plenty, like for maintaining caches in
memory that may be big and would be nice to discard if there's VM
pressure, jpeg uncompressed images sounds like a candidate too. So the
browser size would shrink if there's VM pressure, instead of ending up
swapping out uncompressed image data that can be regenerated more
quickly with the CPU than with swapins.

 Now... once you've chosen SIGBUS semantics, there will be folks who will
 try to exploit the fact that we get SIGBUS on purged page access (at
 least on the user-space side) and will try to access pages that are
 volatile until they are purged and try to then handle the SIGBUS to fix
 things up. Those folks exploiting that will have to be particularly
 careful not to pass volatile data to the kernel, and if they do they'll
 have to be smart enough to handle the EFAULT, etc. That's really all
 their problem, because they're being clever. :)

I'm actually working on feature that would solve the problem for the
syscalls accessing missing volatile pages. So you'd never see a
-EFAULT because all syscalls won't return even if they encounters a
missing page in the volatile range dropped by the VM pressure.

It's called userfaultfd. You call sys_userfaultfd(flags) and it
connects the current mm to a pseudo filedescriptor. The filedescriptor
works similarly to eventfd but with a different protocol.

You need a thread that will never access the userfault area with the
CPU, that is responsible to poll on the userfaultfd and talk the
userfaultfd protocol to fill-in missing pages. The userfault thread
after a POLLIN event reads the virtual addresses of the fault that
must have happened on some other thread of the same mm, and then
writes back an handled virtual range into the fd, after the page (or
pages if multiple) have been regenerated and mapped in with
sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
swapping. Then depending on the solved range written back into the
fd, the kernel will wakeup the thread or threads that were waiting in
kernel mode on the handled virtual range, and retry the fault
without ever exiting kernel mode.

We need this in KVM for running the guest on memory that is on other
nodes or other processes (postcopy live migration is the most common
use case but there are others like memory externalization and
cross-node KSM in the cloud, to keep a single copy of memory across
multiple nodes and externalized to the VM and to the host node).

This thread made me wonder if we could mix the two features and you
would then depend on MADV_USERFAULT and userfaultfd to deliver to
userland the faults happening on the volatile pages that have been
purged as result of VM pressure.

I'm just saying this after Johannes mentioned the issue with syscalls
returning -EFAULT. Because that is the very issue that the userfaultfd
is going to solve for the KVM migration thread.

What I'm thinking now would be to mark the volatile range also
MADV_USERFAULT and then calling userfaultfd and instead of having the
cache regeneration slow path inside the SIGBUS handler, to run it in
the userfault thread that polls the userfaultfd. Then you could write
the volatile ranges to disk with a write() syscall (or use any other
syscall on the volatile ranges), without having to worry about -EFAULT
being returned because one page was discarded. And if MADV_USERFAULT
is not called in combination with vrange syscalls, then it'd still
work without the userfault, but with the vrange syscalls only.

In short the idea would be to let the userfault code solve the fault
delivery to userland for you, and make the vrange syscalls only focus
on the page purging problem, without having to worry about what
happens when something access a missing page.

But if you don't intend to solve the syscall -EFAULT problem, well
then probably the overlap is still as thin as I thought it was before
(like also mentioned in the below link).

Thanks,
Andrea

PS. my last email about this from a more KVM centric point of view:

http://www.spinics.net/lists/kvm/msg101449.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread John Stultz

On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner han...@cmpxchg.org wrote:
 On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
 That point beside, I think the other problem with the page-cleaning
 volatility approach is that there are other awkward side effects. For
 example: Say an application marks a range as volatile. One page in the
 range is then purged. The application, due to a bug or otherwise,
 reads the volatile range. This causes the page to be zero-filled in,
 and the application silently uses the corrupted data (which isn't
 great). More problematic though, is that by faulting the page in,
 they've in effect lost the purge state for that page. When the
 application then goes to mark the range as non-volatile, all pages are
 present, so we'd return that no pages were purged.  From an
 application perspective this is pretty ugly.

 Johannes: Any thoughts on this potential issue with your proposal? Am
 I missing something else?

 No, this is accurate.  However, I don't really see how this is
 different than any other use-after-free bug.  If you access malloc
 memory after free(), you might receive a SIGSEGV, you might see random
 data, you might corrupt somebody else's data.  This certainly isn't
 nice, but it's not exactly new behavior, is it?

The part that troubles me is that I see the purged state as kernel
data being corrupted by userland in this case. The kernel will tell
userspace that no pages were purged, even though they were. Only
because userspace made an errant read of a page, and got garbage data
back.

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread Johannes Weiner

On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
 Hi everyone,
 
 On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
  So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
  you have a third option you're thinking of, I'd of course be interested
  in hearing it.
 
 I actually thought the way of being notified with a page fault (sigbus
 or whatever) was the most efficient way of using volatile ranges.
 
 Why having to call a syscall to know if you can still access the
 volatile range, if there was no VM pressure before the access?
 syscalls are expensive, accessing the memory direct is not. Only if it
 page was actually missing and a page fault would fire, you'd take the
 slowpath.

Not everybody wants to actually come back for the data in the range,
allocators and message passing applications just want to be able to
reuse the memory mapping.

By tying the volatility to the dirty bit in the page tables, an
allocator could simply clear those bits once on free().  When malloc()
hands out this region again, the user is expected to write, which will
either overwrite the old page, or, if it was purged, fault in a fresh
zero page.  But there is no second syscall needed to clear volatility.

  Now... once you've chosen SIGBUS semantics, there will be folks who will
  try to exploit the fact that we get SIGBUS on purged page access (at
  least on the user-space side) and will try to access pages that are
  volatile until they are purged and try to then handle the SIGBUS to fix
  things up. Those folks exploiting that will have to be particularly
  careful not to pass volatile data to the kernel, and if they do they'll
  have to be smart enough to handle the EFAULT, etc. That's really all
  their problem, because they're being clever. :)
 
 I'm actually working on feature that would solve the problem for the
 syscalls accessing missing volatile pages. So you'd never see a
 -EFAULT because all syscalls won't return even if they encounters a
 missing page in the volatile range dropped by the VM pressure.
 
 It's called userfaultfd. You call sys_userfaultfd(flags) and it
 connects the current mm to a pseudo filedescriptor. The filedescriptor
 works similarly to eventfd but with a different protocol.
 
 You need a thread that will never access the userfault area with the
 CPU, that is responsible to poll on the userfaultfd and talk the
 userfaultfd protocol to fill-in missing pages. The userfault thread
 after a POLLIN event reads the virtual addresses of the fault that
 must have happened on some other thread of the same mm, and then
 writes back an handled virtual range into the fd, after the page (or
 pages if multiple) have been regenerated and mapped in with
 sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
 swapping. Then depending on the solved range written back into the
 fd, the kernel will wakeup the thread or threads that were waiting in
 kernel mode on the handled virtual range, and retry the fault
 without ever exiting kernel mode.
 
 We need this in KVM for running the guest on memory that is on other
 nodes or other processes (postcopy live migration is the most common
 use case but there are others like memory externalization and
 cross-node KSM in the cloud, to keep a single copy of memory across
 multiple nodes and externalized to the VM and to the host node).
 
 This thread made me wonder if we could mix the two features and you
 would then depend on MADV_USERFAULT and userfaultfd to deliver to
 userland the faults happening on the volatile pages that have been
 purged as result of VM pressure.
 
 I'm just saying this after Johannes mentioned the issue with syscalls
 returning -EFAULT. Because that is the very issue that the userfaultfd
 is going to solve for the KVM migration thread.
 
 What I'm thinking now would be to mark the volatile range also
 MADV_USERFAULT and then calling userfaultfd and instead of having the
 cache regeneration slow path inside the SIGBUS handler, to run it in
 the userfault thread that polls the userfaultfd. Then you could write
 the volatile ranges to disk with a write() syscall (or use any other
 syscall on the volatile ranges), without having to worry about -EFAULT
 being returned because one page was discarded. And if MADV_USERFAULT
 is not called in combination with vrange syscalls, then it'd still
 work without the userfault, but with the vrange syscalls only.
 
 In short the idea would be to let the userfault code solve the fault
 delivery to userland for you, and make the vrange syscalls only focus
 on the page purging problem, without having to worry about what
 happens when something access a missing page.

Yes, the two seem certainly combinable to me.

madvise(MADV_FREE | MADV_USERFAULT) to allow purging and userspace
fault handling.  In the fault slowpath, you can then regenerate any
missing data and do MADV_FREE again if it should remain volatile.  And
again, any actual writes to the region

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread John Stultz

On Wed, Apr 2, 2014 at 11:07 AM, Johannes Weiner han...@cmpxchg.org wrote:
 On Wed, Apr 02, 2014 at 10:48:03AM -0700, John Stultz wrote:
 I suspect handling the SIGBUS and patching up the purged page you
 trapped on is likely much to complicated for most use cases. But I do
 think SIGBUS is preferable to zero-fill on purged page access, just
 because its likely to be easier to debug applications.

 Fully agreed, but it seems a bit overkill to add a separate syscall, a
 range-tree on top of shmem address_spaces, and an essentially new
 programming model based on SIGBUS userspace fault handling (incl. all
 the complexities and confusion this inevitably will bring when people
 DO end up passing these pointers into kernel space) just to be a bit
 nicer about use-after-free bugs in applications.

Its more about making an interface that has graspable semantics to
userspace, instead of having the semantics being a side-effect of the
implementation.

Tying volatility to the page-clean state and page-was-purged to
page-present seems problematic to me, because there are too many ways
to change the page-clean or page-present outside of the interface
being proposed.

I feel this causes a cascade of corner cases that have to be explained
to users of the interface.

Also I disagree we're adding a new programming model, as SIGBUSes can
already be caught, just that there's not usually much one can do,
where with volatile pages its more likely something could be done. And
again, its really just a side-effect of having semantics (SIGBUS on
purged page access) that are more helpful from a applications
perspective.

As for the separate syscall: Again, this is mainly needed to handle
allocation failures that happen mid-way through modifying the range.
There may still be a way to do the allocation first and only after it
succeeds do the modification. The vma merge/splitting logic doesn't
make this easy but if we can be sure that on a failed split of 1 vma
- 3 vmas (which may fail half way) we can re-merge w/o allocation and
error out (without having to do any other allocations), this might be
avoidable. I'm still wanting to look at this. If so, it would be
easier to re-add this support under madvise, if folks really really
don't like the new syscall.   For the most part, having the separate
syscall allows us to discuss other details of the semantics, which to
me are more important then the syscall naming.

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread Johannes Weiner

On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote:
 On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner han...@cmpxchg.org wrote:
  On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
  That point beside, I think the other problem with the page-cleaning
  volatility approach is that there are other awkward side effects. For
  example: Say an application marks a range as volatile. One page in the
  range is then purged. The application, due to a bug or otherwise,
  reads the volatile range. This causes the page to be zero-filled in,
  and the application silently uses the corrupted data (which isn't
  great). More problematic though, is that by faulting the page in,
  they've in effect lost the purge state for that page. When the
  application then goes to mark the range as non-volatile, all pages are
  present, so we'd return that no pages were purged.  From an
  application perspective this is pretty ugly.
 
  Johannes: Any thoughts on this potential issue with your proposal? Am
  I missing something else?
 
  No, this is accurate.  However, I don't really see how this is
  different than any other use-after-free bug.  If you access malloc
  memory after free(), you might receive a SIGSEGV, you might see random
  data, you might corrupt somebody else's data.  This certainly isn't
  nice, but it's not exactly new behavior, is it?
 
 The part that troubles me is that I see the purged state as kernel
 data being corrupted by userland in this case. The kernel will tell
 userspace that no pages were purged, even though they were. Only
 because userspace made an errant read of a page, and got garbage data
 back.

That sounds overly dramatic to me.  First of all, this data still
reflects accurately the actions of userspace in this situation.  And
secondly, the kernel does not rely on this data to be meaningful from
a userspace perspective to function correctly.

It's really nothing but a use-after-free bug that has consequences for
no-one but the faulty application.  The thing that IS new is that even
a read is enough to corrupt your data in this case.

MADV_REVIVE could return 0 if all pages in the specified range were
present, -Esomething if otherwise.  That would be semantically sound
even if userspace messes up.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread John Stultz

On 04/02/2014 11:31 AM, Andrea Arcangeli wrote:
 On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
 Now... once you've chosen SIGBUS semantics, there will be folks who will
 try to exploit the fact that we get SIGBUS on purged page access (at
 least on the user-space side) and will try to access pages that are
 volatile until they are purged and try to then handle the SIGBUS to fix
 things up. Those folks exploiting that will have to be particularly
 careful not to pass volatile data to the kernel, and if they do they'll
 have to be smart enough to handle the EFAULT, etc. That's really all
 their problem, because they're being clever. :)
 I'm actually working on feature that would solve the problem for the
 syscalls accessing missing volatile pages. So you'd never see a
 -EFAULT because all syscalls won't return even if they encounters a
 missing page in the volatile range dropped by the VM pressure.

 It's called userfaultfd. You call sys_userfaultfd(flags) and it
 connects the current mm to a pseudo filedescriptor. The filedescriptor
 works similarly to eventfd but with a different protocol.
So yea! I actually think (its been awhile now) I mentioned your work to
Taras (or maybe he mentioned it to me?), but it did seem like the
userfaltfd would be a better solution for the style of fault handling
they were thinking about. (Especially as actually handling SIGBUS and
doing something sane in a large threaded application seems very difficult).

That said, explaining volatile ranges as a concept has been difficult
enough without mixing in other new concepts :), so I'm hesitant to tie
the functionality together in until its clear the userfaultfd approach
is likely to land. But maybe I need to take a closer look at it.

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread John Stultz

On 04/02/2014 12:47 PM, Johannes Weiner wrote:
 On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote:
 On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner han...@cmpxchg.org wrote:
 On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
 That point beside, I think the other problem with the page-cleaning
 volatility approach is that there are other awkward side effects. For
 example: Say an application marks a range as volatile. One page in the
 range is then purged. The application, due to a bug or otherwise,
 reads the volatile range. This causes the page to be zero-filled in,
 and the application silently uses the corrupted data (which isn't
 great). More problematic though, is that by faulting the page in,
 they've in effect lost the purge state for that page. When the
 application then goes to mark the range as non-volatile, all pages are
 present, so we'd return that no pages were purged.  From an
 application perspective this is pretty ugly.

 Johannes: Any thoughts on this potential issue with your proposal? Am
 I missing something else?
 No, this is accurate.  However, I don't really see how this is
 different than any other use-after-free bug.  If you access malloc
 memory after free(), you might receive a SIGSEGV, you might see random
 data, you might corrupt somebody else's data.  This certainly isn't
 nice, but it's not exactly new behavior, is it?
 The part that troubles me is that I see the purged state as kernel
 data being corrupted by userland in this case. The kernel will tell
 userspace that no pages were purged, even though they were. Only
 because userspace made an errant read of a page, and got garbage data
 back.
 That sounds overly dramatic to me.  First of all, this data still
 reflects accurately the actions of userspace in this situation.  And
 secondly, the kernel does not rely on this data to be meaningful from
 a userspace perspective to function correctly.
insert dramatic-chipmunk video w/ text overlay errant read corrupted
volatile page purge state1

Maybe you're right, but I feel this is the sort of thing application
developers would be surprised and annoyed by.


 It's really nothing but a use-after-free bug that has consequences for
 no-one but the faulty application.  The thing that IS new is that even
 a read is enough to corrupt your data in this case.

 MADV_REVIVE could return 0 if all pages in the specified range were
 present, -Esomething if otherwise.  That would be semantically sound
 even if userspace messes up.

So its semantically more of just a combined mincore+dirty operation..
and nothing more?

What are other folks thinking about this? Although I don't particularly
like it, I probably could go along with Johannes' approach, forgoing
SIGBUS for zero-fill and adapting the semantics that are in my mind a
bit stranger. This would allow for ashmem-like style behavior w/ the
additional  write-clears-volatile-state and read-clears-purged-state
constraints (which I don't think would be problematic for Android, but
am not totally sure).

But I do worry that these semantics are easier for kernel-mm-developers
to grasp, but are much much harder for application developers to
understand.

Additionally unless we could really leave access-after-volatile as a
total undefined behavior, this would lock us into O(page) behavior and
would remove the possibility of O(log(ranges)) behavior Minchan and I
were able to get (admittedly with more complicated code - but something
I was hoping we'd be able to get back to after the base semantics and
interface behavior was understood and merged). I since applications will
have bugs and will access after volatile, we won't be able to get away
with that sort of behavioral flexibility.

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-02 Thread Jan Kara

On Wed 02-04-14 13:13:34, John Stultz wrote:
 On 04/02/2014 12:47 PM, Johannes Weiner wrote:
  On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote:
  On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner han...@cmpxchg.org 
  wrote:
  On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
  That point beside, I think the other problem with the page-cleaning
  volatility approach is that there are other awkward side effects. For
  example: Say an application marks a range as volatile. One page in the
  range is then purged. The application, due to a bug or otherwise,
  reads the volatile range. This causes the page to be zero-filled in,
  and the application silently uses the corrupted data (which isn't
  great). More problematic though, is that by faulting the page in,
  they've in effect lost the purge state for that page. When the
  application then goes to mark the range as non-volatile, all pages are
  present, so we'd return that no pages were purged.  From an
  application perspective this is pretty ugly.
 
  Johannes: Any thoughts on this potential issue with your proposal? Am
  I missing something else?
  No, this is accurate.  However, I don't really see how this is
  different than any other use-after-free bug.  If you access malloc
  memory after free(), you might receive a SIGSEGV, you might see random
  data, you might corrupt somebody else's data.  This certainly isn't
  nice, but it's not exactly new behavior, is it?
  The part that troubles me is that I see the purged state as kernel
  data being corrupted by userland in this case. The kernel will tell
  userspace that no pages were purged, even though they were. Only
  because userspace made an errant read of a page, and got garbage data
  back.
  That sounds overly dramatic to me.  First of all, this data still
  reflects accurately the actions of userspace in this situation.  And
  secondly, the kernel does not rely on this data to be meaningful from
  a userspace perspective to function correctly.
 insert dramatic-chipmunk video w/ text overlay errant read corrupted
 volatile page purge state1
 
 Maybe you're right, but I feel this is the sort of thing application
 developers would be surprised and annoyed by.
 
 
  It's really nothing but a use-after-free bug that has consequences for
  no-one but the faulty application.  The thing that IS new is that even
  a read is enough to corrupt your data in this case.
 
  MADV_REVIVE could return 0 if all pages in the specified range were
  present, -Esomething if otherwise.  That would be semantically sound
  even if userspace messes up.
 
 So its semantically more of just a combined mincore+dirty operation..
 and nothing more?
 
 What are other folks thinking about this? Although I don't particularly
 like it, I probably could go along with Johannes' approach, forgoing
 SIGBUS for zero-fill and adapting the semantics that are in my mind a
 bit stranger. This would allow for ashmem-like style behavior w/ the
 additional  write-clears-volatile-state and read-clears-purged-state
 constraints (which I don't think would be problematic for Android, but
 am not totally sure).
 
 But I do worry that these semantics are easier for kernel-mm-developers
 to grasp, but are much much harder for application developers to
 understand.
  Yeah, I have to admit that although the simplicity of the implementation
looks compelling, the interface from a userspace POV looks weird.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-01 Thread John Stultz

On 04/01/2014 04:01 PM, Dave Hansen wrote:
> On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
>> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
>>> Either way, optimistic volatile pointers are nowhere near as
>>> transparent to the application as the above description suggests,
>>> which makes this usecase not very interesting, IMO.
>> ... however, I think you're still derating the value way too much.  The
>> case of user space doing elastic memory management is more and more
>> common, and for a lot of those applications it is perfectly reasonable
>> to either not do system calls or to have to devolatilize first.
> The SIGBUS is only in cases where the memory is set as volatile and
> _then_ accessed, right?
Not just set volatile and then accessed, but when a volatile page has
been purged and then accessed without being made non-volatile.


> John, this was something that the Mozilla guys asked for, right?  Any
> idea why this isn't ever a problem for them?
So one of their use cases for it is for library text. Basically they
want to decompress a compressed library file into memory. Then they plan
to mark the uncompressed pages volatile, and then be able to call into
it. Ideally for them, the kernel would only purge cold pages, leaving
the hot pages in memory. When they traverse a purged page, they handle
the SIGBUS and patch the page up.

Now.. this is not what I'd consider a normal use case, but was hoping to
illustrate some of the more interesting uses and demonstrate the
interfaces flexibility.

Also it provided a clear example of benefits to doing LRU based
cold-page purging rather then full object purging. Though I think the
same could be demonstrated in a simpler case of a large cache of objects
that the applications wants to mark volatile in one pass, unmarking
sub-objects as it needs.

thanks
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-01 Thread H. Peter Anvin

On 04/01/2014 09:03 PM, John Stultz wrote:
> 
> So, maybe its best to ignore the fact that folks want to do semi-crazy
> user-space faulting via SIGBUS. At least to start with. Lets look at the
> semantic for the "normal" mark volatile, never touch the pages until you
> mark non-volatile - basically where accessing volatile pages is similar
> to a use-after-free bug.
> 
> So, for the most part, I'd say the proposed SIGBUS semantics don't
> complicate things for this basic use-case, at least when compared with
> things like zero-fill.  If an applications accidentally accessed a
> purged volatile page, I think SIGBUS is the right thing to do. They most
> likely immediately crash, but its better then them moving along with
> silent corruption because they're mucking with zero-filled pages.
> 
> So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> you have a third option you're thinking of, I'd of course be interested
> in hearing it.
> 

People already do SIGBUS for mmap, so there is nothing new here.

> Now... once you've chosen SIGBUS semantics, there will be folks who will
> try to exploit the fact that we get SIGBUS on purged page access (at
> least on the user-space side) and will try to access pages that are
> volatile until they are purged and try to then handle the SIGBUS to fix
> things up. Those folks exploiting that will have to be particularly
> careful not to pass volatile data to the kernel, and if they do they'll
> have to be smart enough to handle the EFAULT, etc. That's really all
> their problem, because they're being clever. :)

Yep.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-01 Thread John Stultz

On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> [ I tried to bring this up during LSFMM but it got drowned out.
>   Trying again :) ]
>
> On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
>> Optimistic method:
>> 1) Userland marks a large range of data as volatile
>> 2) Userland continues to access the data as it needs.
>> 3) If userland accesses a page that has been purged, the kernel will
>> send a SIGBUS
>> 4) Userspace can trap the SIGBUS, mark the affected pages as
>> non-volatile, and refill the data as needed before continuing on
> As far as I understand, if a pointer to volatile memory makes it into
> a syscall and the fault is trapped in kernel space, there won't be a
> SIGBUS, the syscall will just return -EFAULT.
>
> Handling this would mean annotating every syscall invocation to check
> for -EFAULT, refill the data, and then restart the syscall.  This is
> complicated even before taking external libraries into account, which
> may not propagate syscall returns properly or may not be reentrant at
> the necessary granularity.
>
> Another option is to never pass volatile memory pointers into the
> kernel, but that too means that knowledge of volatility has to travel
> alongside the pointers, which will either result in more complexity
> throughout the application or severely limited scope of volatile
> memory usage.
>
> Either way, optimistic volatile pointers are nowhere near as
> transparent to the application as the above description suggests,
> which makes this usecase not very interesting, IMO.  If we can support
> it at little cost, why not, but I don't think we should complicate the
> common usecases to support this one.

So yea, thanks again for all the feedback at LSF-MM! I'm trying to get
things integrated for a v13 here shortly (although with visitors in town
this week it may not happen until next week).

So, maybe its best to ignore the fact that folks want to do semi-crazy
user-space faulting via SIGBUS. At least to start with. Lets look at the
semantic for the "normal" mark volatile, never touch the pages until you
mark non-volatile - basically where accessing volatile pages is similar
to a use-after-free bug.

So, for the most part, I'd say the proposed SIGBUS semantics don't
complicate things for this basic use-case, at least when compared with
things like zero-fill.  If an applications accidentally accessed a
purged volatile page, I think SIGBUS is the right thing to do. They most
likely immediately crash, but its better then them moving along with
silent corruption because they're mucking with zero-filled pages.

So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
you have a third option you're thinking of, I'd of course be interested
in hearing it.

Now... once you've chosen SIGBUS semantics, there will be folks who will
try to exploit the fact that we get SIGBUS on purged page access (at
least on the user-space side) and will try to access pages that are
volatile until they are purged and try to then handle the SIGBUS to fix
things up. Those folks exploiting that will have to be particularly
careful not to pass volatile data to the kernel, and if they do they'll
have to be smart enough to handle the EFAULT, etc. That's really all
their problem, because they're being clever. :)

I've maybe made a mistake in talking at length about those use cases,
because I wanted to make sure folks didn't have suggestions on how to
better address those cases (so far I've not heard any), and it sort of
helps wrap folks heads around at least some of the potential variations
on the desired purging semantics (lru based cold page purging, or entire
object based purging).

Now, one other potential variant, which Keith brought up at LSF-MM, and
others have mentioned before, is to have *any* volatile page access
(purged or not) return a SIGBUS. This seems "safe" in that it protects
developers from themselves, and makes application behavior more
deterministic (rather then depending on memory pressure). However it
also has the overhead of setting up the pte swp entries for each page in
order to trip the SIGBUS.  Since folks have explicitly asked for it,
allowing non-purged volatile page access seems more flexible. And its
cheaper. So that's what I've been leaning towards.

thanks again!
-john

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-01 Thread Dave Hansen

On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
>> Either way, optimistic volatile pointers are nowhere near as
>> transparent to the application as the above description suggests,
>> which makes this usecase not very interesting, IMO.
> 
> ... however, I think you're still derating the value way too much.  The
> case of user space doing elastic memory management is more and more
> common, and for a lot of those applications it is perfectly reasonable
> to either not do system calls or to have to devolatilize first.

The SIGBUS is only in cases where the memory is set as volatile and
_then_ accessed, right?

John, this was something that the Mozilla guys asked for, right?  Any
idea why this isn't ever a problem for them?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-01 Thread H. Peter Anvin

On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> 
> Either way, optimistic volatile pointers are nowhere near as
> transparent to the application as the above description suggests,
> which makes this usecase not very interesting, IMO.
> 

... however, I think you're still derating the value way too much.  The
case of user space doing elastic memory management is more and more
common, and for a lot of those applications it is perfectly reasonable
to either not do system calls or to have to devolatilize first.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-01 Thread H. Peter Anvin

On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> [ I tried to bring this up during LSFMM but it got drowned out.
>   Trying again :) ]
> 
> On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
>> Optimistic method:
>> 1) Userland marks a large range of data as volatile
>> 2) Userland continues to access the data as it needs.
>> 3) If userland accesses a page that has been purged, the kernel will
>> send a SIGBUS
>> 4) Userspace can trap the SIGBUS, mark the affected pages as
>> non-volatile, and refill the data as needed before continuing on
> 
> As far as I understand, if a pointer to volatile memory makes it into
> a syscall and the fault is trapped in kernel space, there won't be a
> SIGBUS, the syscall will just return -EFAULT.
> 
> Handling this would mean annotating every syscall invocation to check
> for -EFAULT, refill the data, and then restart the syscall.  This is
> complicated even before taking external libraries into account, which
> may not propagate syscall returns properly or may not be reentrant at
> the necessary granularity.
> 
> Another option is to never pass volatile memory pointers into the
> kernel, but that too means that knowledge of volatility has to travel
> alongside the pointers, which will either result in more complexity
> throughout the application or severely limited scope of volatile
> memory usage.
> 
> Either way, optimistic volatile pointers are nowhere near as
> transparent to the application as the above description suggests,
> which makes this usecase not very interesting, IMO.  If we can support
> it at little cost, why not, but I don't think we should complicate the
> common usecases to support this one.
> 

The whole EFAULT thing is a fundamental problem with the kernel
interface.  This is not in any way the only place where this suffers.

The fact that we cannot reliably get SIGSEGV or SIGBUS because something
may have been passed as a system call is an enormous problem.  The
question is if it is in any way fixable.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-01 Thread Johannes Weiner

[ I tried to bring this up during LSFMM but it got drowned out.
  Trying again :) ]

On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
> Optimistic method:
> 1) Userland marks a large range of data as volatile
> 2) Userland continues to access the data as it needs.
> 3) If userland accesses a page that has been purged, the kernel will
> send a SIGBUS
> 4) Userspace can trap the SIGBUS, mark the affected pages as
> non-volatile, and refill the data as needed before continuing on

As far as I understand, if a pointer to volatile memory makes it into
a syscall and the fault is trapped in kernel space, there won't be a
SIGBUS, the syscall will just return -EFAULT.

Handling this would mean annotating every syscall invocation to check
for -EFAULT, refill the data, and then restart the syscall.  This is
complicated even before taking external libraries into account, which
may not propagate syscall returns properly or may not be reentrant at
the necessary granularity.

Another option is to never pass volatile memory pointers into the
kernel, but that too means that knowledge of volatility has to travel
alongside the pointers, which will either result in more complexity
throughout the application or severely limited scope of volatile
memory usage.

Either way, optimistic volatile pointers are nowhere near as
transparent to the application as the above description suggests,
which makes this usecase not very interesting, IMO.  If we can support
it at little cost, why not, but I don't think we should complicate the
common usecases to support this one.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-01 Thread Johannes Weiner

[ I tried to bring this up during LSFMM but it got drowned out.
  Trying again :) ]

On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
 Optimistic method:
 1) Userland marks a large range of data as volatile
 2) Userland continues to access the data as it needs.
 3) If userland accesses a page that has been purged, the kernel will
 send a SIGBUS
 4) Userspace can trap the SIGBUS, mark the affected pages as
 non-volatile, and refill the data as needed before continuing on

As far as I understand, if a pointer to volatile memory makes it into
a syscall and the fault is trapped in kernel space, there won't be a
SIGBUS, the syscall will just return -EFAULT.

Handling this would mean annotating every syscall invocation to check
for -EFAULT, refill the data, and then restart the syscall.  This is
complicated even before taking external libraries into account, which
may not propagate syscall returns properly or may not be reentrant at
the necessary granularity.

Another option is to never pass volatile memory pointers into the
kernel, but that too means that knowledge of volatility has to travel
alongside the pointers, which will either result in more complexity
throughout the application or severely limited scope of volatile
memory usage.

Either way, optimistic volatile pointers are nowhere near as
transparent to the application as the above description suggests,
which makes this usecase not very interesting, IMO.  If we can support
it at little cost, why not, but I don't think we should complicate the
common usecases to support this one.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-01 Thread H. Peter Anvin

On 04/01/2014 02:21 PM, Johannes Weiner wrote:
 [ I tried to bring this up during LSFMM but it got drowned out.
   Trying again :) ]
 
 On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
 Optimistic method:
 1) Userland marks a large range of data as volatile
 2) Userland continues to access the data as it needs.
 3) If userland accesses a page that has been purged, the kernel will
 send a SIGBUS
 4) Userspace can trap the SIGBUS, mark the affected pages as
 non-volatile, and refill the data as needed before continuing on
 
 As far as I understand, if a pointer to volatile memory makes it into
 a syscall and the fault is trapped in kernel space, there won't be a
 SIGBUS, the syscall will just return -EFAULT.
 
 Handling this would mean annotating every syscall invocation to check
 for -EFAULT, refill the data, and then restart the syscall.  This is
 complicated even before taking external libraries into account, which
 may not propagate syscall returns properly or may not be reentrant at
 the necessary granularity.
 
 Another option is to never pass volatile memory pointers into the
 kernel, but that too means that knowledge of volatility has to travel
 alongside the pointers, which will either result in more complexity
 throughout the application or severely limited scope of volatile
 memory usage.
 
 Either way, optimistic volatile pointers are nowhere near as
 transparent to the application as the above description suggests,
 which makes this usecase not very interesting, IMO.  If we can support
 it at little cost, why not, but I don't think we should complicate the
 common usecases to support this one.
 

The whole EFAULT thing is a fundamental problem with the kernel
interface.  This is not in any way the only place where this suffers.

The fact that we cannot reliably get SIGSEGV or SIGBUS because something
may have been passed as a system call is an enormous problem.  The
question is if it is in any way fixable.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-01 Thread H. Peter Anvin

On 04/01/2014 02:21 PM, Johannes Weiner wrote:
 
 Either way, optimistic volatile pointers are nowhere near as
 transparent to the application as the above description suggests,
 which makes this usecase not very interesting, IMO.
 

... however, I think you're still derating the value way too much.  The
case of user space doing elastic memory management is more and more
common, and for a lot of those applications it is perfectly reasonable
to either not do system calls or to have to devolatilize first.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-01 Thread Dave Hansen

On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
 On 04/01/2014 02:21 PM, Johannes Weiner wrote:
 Either way, optimistic volatile pointers are nowhere near as
 transparent to the application as the above description suggests,
 which makes this usecase not very interesting, IMO.
 
 ... however, I think you're still derating the value way too much.  The
 case of user space doing elastic memory management is more and more
 common, and for a lot of those applications it is perfectly reasonable
 to either not do system calls or to have to devolatilize first.

The SIGBUS is only in cases where the memory is set as volatile and
_then_ accessed, right?

John, this was something that the Mozilla guys asked for, right?  Any
idea why this isn't ever a problem for them?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-01 Thread John Stultz

On 04/01/2014 02:21 PM, Johannes Weiner wrote:
 [ I tried to bring this up during LSFMM but it got drowned out.
   Trying again :) ]

 On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
 Optimistic method:
 1) Userland marks a large range of data as volatile
 2) Userland continues to access the data as it needs.
 3) If userland accesses a page that has been purged, the kernel will
 send a SIGBUS
 4) Userspace can trap the SIGBUS, mark the affected pages as
 non-volatile, and refill the data as needed before continuing on
 As far as I understand, if a pointer to volatile memory makes it into
 a syscall and the fault is trapped in kernel space, there won't be a
 SIGBUS, the syscall will just return -EFAULT.

 Handling this would mean annotating every syscall invocation to check
 for -EFAULT, refill the data, and then restart the syscall.  This is
 complicated even before taking external libraries into account, which
 may not propagate syscall returns properly or may not be reentrant at
 the necessary granularity.

 Another option is to never pass volatile memory pointers into the
 kernel, but that too means that knowledge of volatility has to travel
 alongside the pointers, which will either result in more complexity
 throughout the application or severely limited scope of volatile
 memory usage.

 Either way, optimistic volatile pointers are nowhere near as
 transparent to the application as the above description suggests,
 which makes this usecase not very interesting, IMO.  If we can support
 it at little cost, why not, but I don't think we should complicate the
 common usecases to support this one.

So yea, thanks again for all the feedback at LSF-MM! I'm trying to get
things integrated for a v13 here shortly (although with visitors in town
this week it may not happen until next week).


So, maybe its best to ignore the fact that folks want to do semi-crazy
user-space faulting via SIGBUS. At least to start with. Lets look at the
semantic for the normal mark volatile, never touch the pages until you
mark non-volatile - basically where accessing volatile pages is similar
to a use-after-free bug.

So, for the most part, I'd say the proposed SIGBUS semantics don't
complicate things for this basic use-case, at least when compared with
things like zero-fill.  If an applications accidentally accessed a
purged volatile page, I think SIGBUS is the right thing to do. They most
likely immediately crash, but its better then them moving along with
silent corruption because they're mucking with zero-filled pages.

So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
you have a third option you're thinking of, I'd of course be interested
in hearing it.

Now... once you've chosen SIGBUS semantics, there will be folks who will
try to exploit the fact that we get SIGBUS on purged page access (at
least on the user-space side) and will try to access pages that are
volatile until they are purged and try to then handle the SIGBUS to fix
things up. Those folks exploiting that will have to be particularly
careful not to pass volatile data to the kernel, and if they do they'll
have to be smart enough to handle the EFAULT, etc. That's really all
their problem, because they're being clever. :)

I've maybe made a mistake in talking at length about those use cases,
because I wanted to make sure folks didn't have suggestions on how to
better address those cases (so far I've not heard any), and it sort of
helps wrap folks heads around at least some of the potential variations
on the desired purging semantics (lru based cold page purging, or entire
object based purging).

Now, one other potential variant, which Keith brought up at LSF-MM, and
others have mentioned before, is to have *any* volatile page access
(purged or not) return a SIGBUS. This seems safe in that it protects
developers from themselves, and makes application behavior more
deterministic (rather then depending on memory pressure). However it
also has the overhead of setting up the pte swp entries for each page in
order to trip the SIGBUS.  Since folks have explicitly asked for it,
allowing non-purged volatile page access seems more flexible. And its
cheaper. So that's what I've been leaning towards.

thanks again!
-john


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-01 Thread H. Peter Anvin

On 04/01/2014 09:03 PM, John Stultz wrote:
 
 So, maybe its best to ignore the fact that folks want to do semi-crazy
 user-space faulting via SIGBUS. At least to start with. Lets look at the
 semantic for the normal mark volatile, never touch the pages until you
 mark non-volatile - basically where accessing volatile pages is similar
 to a use-after-free bug.
 
 So, for the most part, I'd say the proposed SIGBUS semantics don't
 complicate things for this basic use-case, at least when compared with
 things like zero-fill.  If an applications accidentally accessed a
 purged volatile page, I think SIGBUS is the right thing to do. They most
 likely immediately crash, but its better then them moving along with
 silent corruption because they're mucking with zero-filled pages.
 
 So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
 you have a third option you're thinking of, I'd of course be interested
 in hearing it.
 

People already do SIGBUS for mmap, so there is nothing new here.

 Now... once you've chosen SIGBUS semantics, there will be folks who will
 try to exploit the fact that we get SIGBUS on purged page access (at
 least on the user-space side) and will try to access pages that are
 volatile until they are purged and try to then handle the SIGBUS to fix
 things up. Those folks exploiting that will have to be particularly
 careful not to pass volatile data to the kernel, and if they do they'll
 have to be smart enough to handle the EFAULT, etc. That's really all
 their problem, because they're being clever. :)

Yep.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-04-01 Thread John Stultz

On 04/01/2014 04:01 PM, Dave Hansen wrote:
 On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
 On 04/01/2014 02:21 PM, Johannes Weiner wrote:
 Either way, optimistic volatile pointers are nowhere near as
 transparent to the application as the above description suggests,
 which makes this usecase not very interesting, IMO.
 ... however, I think you're still derating the value way too much.  The
 case of user space doing elastic memory management is more and more
 common, and for a lot of those applications it is perfectly reasonable
 to either not do system calls or to have to devolatilize first.
 The SIGBUS is only in cases where the memory is set as volatile and
 _then_ accessed, right?
Not just set volatile and then accessed, but when a volatile page has
been purged and then accessed without being made non-volatile.


 John, this was something that the Mozilla guys asked for, right?  Any
 idea why this isn't ever a problem for them?
So one of their use cases for it is for library text. Basically they
want to decompress a compressed library file into memory. Then they plan
to mark the uncompressed pages volatile, and then be able to call into
it. Ideally for them, the kernel would only purge cold pages, leaving
the hot pages in memory. When they traverse a purged page, they handle
the SIGBUS and patch the page up.

Now.. this is not what I'd consider a normal use case, but was hoping to
illustrate some of the more interesting uses and demonstrate the
interfaces flexibility.

Also it provided a clear example of benefits to doing LRU based
cold-page purging rather then full object purging. Though I think the
same could be demonstrated in a simpler case of a large cache of objects
that the applications wants to mark volatile in one pass, unmarking
sub-objects as it needs.

thanks
-john

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-03-21 Thread John Stultz

Just wanted to send out an updated patch set that includes changes from
some of the reviews. Hopefully folks will have some time to look them
over prior to the LSF-MM discussion on volatile ranges on Tuesday (see
below for LSF-MM discussion points to think about).

New changes are:

o Added flags argument to the syscall, which is unused, but per
  https://lwn.net/Articles/585415/ seems like a good idea.
o Minor vma traversing cleanups suggested by Jan
o Return an error when trying to mark unmapped regions
o First pass implementation of marking pages referenced when
  they are marked volatile, so the pages in a range are set to
  the same "age" and will be approximately purged together.
  This behavior is still open for discussion.
o Very naive implementation of anonymous page aging on swapless
  systems. This has clear performance issues, as we burn time
  overly scanning anonymous pages, but provides something
  concrete upon which to discuss what the best way would be to
  solve this.
o Other minor code cleanups

The first three patches are still the core functionality, which
I'd really like further review on. The last two patches in this
series are more discussion starters, and are less serious.


Potential discussion items for LSF-MM to think about:

o How to increase reviewer interest?
- Lots of interest from application world
o Page aging semantics when marking volatile.
- Should marking volatile be the same as accessing pages?
- Should volatile ranges be put on end of inactive lru?
- Should we just punt this and have applications combine madvise()
  use with vrange() to specify range age?
o Volatile page & purged page accounting
- Volatility is stored in per-process vma, not page
- vmstats are page based, how do we deal w/ COWed pages?
o Aging anonymous memory on swapless systems
- Any thoughts on improving over naive method?
- Better volatile page accounting might help?
- Do we need a separate volatile LRU?
o Shared volatility on tmpfs/shm/memfd (required for ashmem)
- Johannes idea for clearing dirty bits?
- vma-like structure on the address space?

thanks
-john


Volatile ranges provides a method for userland to inform the kernel that
a range of memory is safe to discard (ie: can be regenerated) but
userspace may want to try access it in the future.  It can be thought of
as similar to MADV_DONTNEED, but that the actual freeing of the memory
is delayed and only done under memory pressure, and the user can try to
cancel the action and be able to quickly access any unpurged pages. The
idea originated from Android's ashmem, but I've since learned that other
OSes provide similar functionality.

This functionality allows for a number of interesting uses. One such
example is: Userland caches that have kernel triggered eviction under
memory pressure. This allows for the kernel to "rightsize" userspace
caches for current system-wide workload. Things like image bitmap
caches, or rendered HTML in a hidden browser tab, where the data is
not visible and can be regenerated if needed, are good examples.

Both Chrome and Firefox already make use of volatile ranges via the
ashmem interface:
https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34

https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc


There are two basic ways volatile ranges can be used:

Explicit marking method:
1) Userland marks a range of memory that can be regenerated if necessary
as volatile
2) Before accessing the memory again, userland marks the memory as
nonvolatile, and the kernel will provide notification if any pages in the
range has been purged.

Optimistic method:
1) Userland marks a large range of data as volatile
2) Userland continues to access the data as it needs.
3) If userland accesses a page that has been purged, the kernel will
send a SIGBUS
4) Userspace can trap the SIGBUS, mark the affected pages as
non-volatile, and refill the data as needed before continuing on


You can read more about the history of volatile ranges here (~reverse
chronological order):
https://lwn.net/Articles/590991/
http://permalink.gmane.org/gmane.linux.kernel.mm/98848
http://permalink.gmane.org/gmane.linux.kernel.mm/98676
https://lwn.net/Articles/522135/
https://lwn.net/Kernel/Index/#Volatile_ranges


Continuing from the last release, this revision is reduced in scope
when compared to earlier attempts. I've only focused on handled
volatility on anonymous memory, and we're storing the volatility in
the VMA.  This may have performance implications compared with the earlier
approach, but it does simplify the approach. I'm open to expanding
functionality via flags arugments, but for now I'm wanting to keep focus
on what the right default behavior should be and keep the use cases
restricted to help get reviewer interest.

Further, the page

[PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder

2014-03-21 Thread John Stultz

Just wanted to send out an updated patch set that includes changes from
some of the reviews. Hopefully folks will have some time to look them
over prior to the LSF-MM discussion on volatile ranges on Tuesday (see
below for LSF-MM discussion points to think about).

New changes are:

o Added flags argument to the syscall, which is unused, but per
  https://lwn.net/Articles/585415/ seems like a good idea.
o Minor vma traversing cleanups suggested by Jan
o Return an error when trying to mark unmapped regions
o First pass implementation of marking pages referenced when
  they are marked volatile, so the pages in a range are set to
  the same age and will be approximately purged together.
  This behavior is still open for discussion.
o Very naive implementation of anonymous page aging on swapless
  systems. This has clear performance issues, as we burn time
  overly scanning anonymous pages, but provides something
  concrete upon which to discuss what the best way would be to
  solve this.
o Other minor code cleanups

The first three patches are still the core functionality, which
I'd really like further review on. The last two patches in this
series are more discussion starters, and are less serious.


Potential discussion items for LSF-MM to think about:

o How to increase reviewer interest?
- Lots of interest from application world
o Page aging semantics when marking volatile.
- Should marking volatile be the same as accessing pages?
- Should volatile ranges be put on end of inactive lru?
- Should we just punt this and have applications combine madvise()
  use with vrange() to specify range age?
o Volatile page  purged page accounting
- Volatility is stored in per-process vma, not page
- vmstats are page based, how do we deal w/ COWed pages?
o Aging anonymous memory on swapless systems
- Any thoughts on improving over naive method?
- Better volatile page accounting might help?
- Do we need a separate volatile LRU?
o Shared volatility on tmpfs/shm/memfd (required for ashmem)
- Johannes idea for clearing dirty bits?
- vma-like structure on the address space?

thanks
-john


Volatile ranges provides a method for userland to inform the kernel that
a range of memory is safe to discard (ie: can be regenerated) but
userspace may want to try access it in the future.  It can be thought of
as similar to MADV_DONTNEED, but that the actual freeing of the memory
is delayed and only done under memory pressure, and the user can try to
cancel the action and be able to quickly access any unpurged pages. The
idea originated from Android's ashmem, but I've since learned that other
OSes provide similar functionality.

This functionality allows for a number of interesting uses. One such
example is: Userland caches that have kernel triggered eviction under
memory pressure. This allows for the kernel to rightsize userspace
caches for current system-wide workload. Things like image bitmap
caches, or rendered HTML in a hidden browser tab, where the data is
not visible and can be regenerated if needed, are good examples.

Both Chrome and Firefox already make use of volatile ranges via the
ashmem interface:
https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34

https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc


There are two basic ways volatile ranges can be used:

Explicit marking method:
1) Userland marks a range of memory that can be regenerated if necessary
as volatile
2) Before accessing the memory again, userland marks the memory as
nonvolatile, and the kernel will provide notification if any pages in the
range has been purged.

Optimistic method:
1) Userland marks a large range of data as volatile
2) Userland continues to access the data as it needs.
3) If userland accesses a page that has been purged, the kernel will
send a SIGBUS
4) Userspace can trap the SIGBUS, mark the affected pages as
non-volatile, and refill the data as needed before continuing on


You can read more about the history of volatile ranges here (~reverse
chronological order):
https://lwn.net/Articles/590991/
http://permalink.gmane.org/gmane.linux.kernel.mm/98848
http://permalink.gmane.org/gmane.linux.kernel.mm/98676
https://lwn.net/Articles/522135/
https://lwn.net/Kernel/Index/#Volatile_ranges


Continuing from the last release, this revision is reduced in scope
when compared to earlier attempts. I've only focused on handled
volatility on anonymous memory, and we're storing the volatility in
the VMA.  This may have performance implications compared with the earlier
approach, but it does simplify the approach. I'm open to expanding
functionality via flags arugments, but for now I'm wanting to keep focus
on what the right default behavior should be and keep the use cases
restricted to help get reviewer interest.

Further, the page

66 matches

Mail list logo