Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On 04/02/2014 01:13 PM, John Stultz wrote: > On 04/02/2014 12:47 PM, Johannes Weiner wrote: > >> It's really nothing but a use-after-free bug that has consequences for >> no-one but the faulty application. The thing that IS new is that even >> a read is enough to corrupt your data in this case. >> >> MADV_REVIVE could return 0 if all pages in the specified range were >> present, -Esomething if otherwise. That would be semantically sound >> even if userspace messes up. > So its semantically more of just a combined mincore+dirty operation.. > and nothing more? > > What are other folks thinking about this? Although I don't particularly > like it, I probably could go along with Johannes' approach, forgoing > SIGBUS for zero-fill and adapting the semantics that are in my mind a > bit stranger. This would allow for ashmem-like style behavior w/ the > additional write-clears-volatile-state and read-clears-purged-state > constraints (which I don't think would be problematic for Android, but > am not totally sure). > > But I do worry that these semantics are easier for kernel-mm-developers > to grasp, but are much much harder for application developers to > understand. So I don't feel like we've gotten enough feedback for consensus here. Thus, to at least address other issues pointed out at LSF-MM, I'm going to shortly send out a v13 of the patchset which keeps with the previous approach instead of adopting Johannes' suggested approach here. If folks do prefer Johannes' approach, please speak up as I'm willing to give it a whirl, despite my concerns about the subtle semantics. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On 04/02/2014 01:13 PM, John Stultz wrote: On 04/02/2014 12:47 PM, Johannes Weiner wrote: It's really nothing but a use-after-free bug that has consequences for no-one but the faulty application. The thing that IS new is that even a read is enough to corrupt your data in this case. MADV_REVIVE could return 0 if all pages in the specified range were present, -Esomething if otherwise. That would be semantically sound even if userspace messes up. So its semantically more of just a combined mincore+dirty operation.. and nothing more? What are other folks thinking about this? Although I don't particularly like it, I probably could go along with Johannes' approach, forgoing SIGBUS for zero-fill and adapting the semantics that are in my mind a bit stranger. This would allow for ashmem-like style behavior w/ the additional write-clears-volatile-state and read-clears-purged-state constraints (which I don't think would be problematic for Android, but am not totally sure). But I do worry that these semantics are easier for kernel-mm-developers to grasp, but are much much harder for application developers to understand. So I don't feel like we've gotten enough feedback for consensus here. Thus, to at least address other issues pointed out at LSF-MM, I'm going to shortly send out a v13 of the patchset which keeps with the previous approach instead of adopting Johannes' suggested approach here. If folks do prefer Johannes' approach, please speak up as I'm willing to give it a whirl, despite my concerns about the subtle semantics. thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On 04/07/2014 09:32 PM, Kevin Easton wrote: > On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: >> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner wrote: >>> I'm just dying to hear a "normal" use case then. :) >> So the more "normal" use cause would be marking objects volatile and >> then non-volatile w/o accessing them in-between. In this case the >> zero-fill vs SIGBUS semantics don't really matter, its really just a >> trade off in how we handle applications deviating (intentionally or >> not) from this use case. >> >> So to maybe flesh out the context here for folks who are following >> along (but weren't in the hallway at LSF :), Johannes made a fairly >> interesting proposal (Johannes: Please correct me here where I'm maybe >> slightly off here) to use only the dirty bits of the ptes to mark a >> page as volatile. Then the kernel could reclaim these clean pages as >> it needed, and when we marked the range as non-volatile, the pages >> would be re-dirtied and if any of the pages were missing, we could >> return a flag with the purged state. This had some different >> semantics then what I've been working with for awhile (for example, >> any writes to pages would implicitly clear volatility), so I wasn't >> completely comfortable with it, but figured I'd think about it to see >> if it could be done. Particularly since it would in some ways simplify >> tmpfs/shm shared volatility that I'd eventually like to do. > ... >> Now, while for the case I'm personally most interested in (ashmem), >> zero-fill would technically be ok, since that's what Android does. >> Even so, I don't think its the best approach for the interface, since >> applications may end up quite surprised by the results when they >> accidentally don't follow the "don't touch volatile pages" rule. >> >> That point beside, I think the other problem with the page-cleaning >> volatility approach is that there are other awkward side effects. For >> example: Say an application marks a range as volatile. One page in the >> range is then purged. The application, due to a bug or otherwise, >> reads the volatile range. This causes the page to be zero-filled in, >> and the application silently uses the corrupted data (which isn't >> great). More problematic though, is that by faulting the page in, >> they've in effect lost the purge state for that page. When the >> application then goes to mark the range as non-volatile, all pages are >> present, so we'd return that no pages were purged. From an >> application perspective this is pretty ugly. > The write-implicitly-clears-volatile semantics would actually be > an advantage for some use cases. If you have a volatile cache of > many sub-page-size objects, the application can just include at > the start of each page "int present, in_use;". "present" is set > to non-zero before marking volatile, and when the application wants > unmark as volatile it writes to "in_use" and tests the value of > "present". No need for a syscall at all, although it does take a > minor fault. > > The syscall would be better for the case of large objects, though. > > Or is that fatally flawed? Well, as you note, each object would then have to be page size or smaller, which limits some of the potential use cases. However, these semantics would match better to the MADV_FREE proposal Minchan is pushing. So this method would work fine there. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: > On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner wrote: > > I'm just dying to hear a "normal" use case then. :) > > So the more "normal" use cause would be marking objects volatile and > then non-volatile w/o accessing them in-between. In this case the > zero-fill vs SIGBUS semantics don't really matter, its really just a > trade off in how we handle applications deviating (intentionally or > not) from this use case. > > So to maybe flesh out the context here for folks who are following > along (but weren't in the hallway at LSF :), Johannes made a fairly > interesting proposal (Johannes: Please correct me here where I'm maybe > slightly off here) to use only the dirty bits of the ptes to mark a > page as volatile. Then the kernel could reclaim these clean pages as > it needed, and when we marked the range as non-volatile, the pages > would be re-dirtied and if any of the pages were missing, we could > return a flag with the purged state. This had some different > semantics then what I've been working with for awhile (for example, > any writes to pages would implicitly clear volatility), so I wasn't > completely comfortable with it, but figured I'd think about it to see > if it could be done. Particularly since it would in some ways simplify > tmpfs/shm shared volatility that I'd eventually like to do. ... > Now, while for the case I'm personally most interested in (ashmem), > zero-fill would technically be ok, since that's what Android does. > Even so, I don't think its the best approach for the interface, since > applications may end up quite surprised by the results when they > accidentally don't follow the "don't touch volatile pages" rule. > > That point beside, I think the other problem with the page-cleaning > volatility approach is that there are other awkward side effects. For > example: Say an application marks a range as volatile. One page in the > range is then purged. The application, due to a bug or otherwise, > reads the volatile range. This causes the page to be zero-filled in, > and the application silently uses the corrupted data (which isn't > great). More problematic though, is that by faulting the page in, > they've in effect lost the purge state for that page. When the > application then goes to mark the range as non-volatile, all pages are > present, so we'd return that no pages were purged. From an > application perspective this is pretty ugly. The write-implicitly-clears-volatile semantics would actually be an advantage for some use cases. If you have a volatile cache of many sub-page-size objects, the application can just include at the start of each page "int present, in_use;". "present" is set to non-zero before marking volatile, and when the application wants unmark as volatile it writes to "in_use" and tests the value of "present". No need for a syscall at all, although it does take a minor fault. The syscall would be better for the case of large objects, though. Or is that fatally flawed? - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 03:27:44PM -0400, Johannes Weiner wrote: > On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote: > > Hi everyone, > > > > On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote: > > > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If > > > you have a third option you're thinking of, I'd of course be interested > > > in hearing it. > > > > I actually thought the way of being notified with a page fault (sigbus > > or whatever) was the most efficient way of using volatile ranges. > > > > Why having to call a syscall to know if you can still access the > > volatile range, if there was no VM pressure before the access? > > syscalls are expensive, accessing the memory direct is not. Only if it > > page was actually missing and a page fault would fire, you'd take the > > slowpath. > > Not everybody wants to actually come back for the data in the range, > allocators and message passing applications just want to be able to > reuse the memory mapping. > > By tying the volatility to the dirty bit in the page tables, an > allocator could simply clear those bits once on free(). When malloc() > hands out this region again, the user is expected to write, which will > either overwrite the old page, or, if it was purged, fault in a fresh > zero page. But there is no second syscall needed to clear volatility. > > > > Now... once you've chosen SIGBUS semantics, there will be folks who will > > > try to exploit the fact that we get SIGBUS on purged page access (at > > > least on the user-space side) and will try to access pages that are > > > volatile until they are purged and try to then handle the SIGBUS to fix > > > things up. Those folks exploiting that will have to be particularly > > > careful not to pass volatile data to the kernel, and if they do they'll > > > have to be smart enough to handle the EFAULT, etc. That's really all > > > their problem, because they're being clever. :) > > > > I'm actually working on feature that would solve the problem for the > > syscalls accessing missing volatile pages. So you'd never see a > > -EFAULT because all syscalls won't return even if they encounters a > > missing page in the volatile range dropped by the VM pressure. > > > > It's called userfaultfd. You call sys_userfaultfd(flags) and it > > connects the current mm to a pseudo filedescriptor. The filedescriptor > > works similarly to eventfd but with a different protocol. > > > > You need a thread that will never access the userfault area with the > > CPU, that is responsible to poll on the userfaultfd and talk the > > userfaultfd protocol to fill-in missing pages. The userfault thread > > after a POLLIN event reads the virtual addresses of the fault that > > must have happened on some other thread of the same mm, and then > > writes back an "handled" virtual range into the fd, after the page (or > > pages if multiple) have been regenerated and mapped in with > > sys_remap_anon_pages(), mremap or equivalent atomic pagetable page > > swapping. Then depending on the "solved" range written back into the > > fd, the kernel will wakeup the thread or threads that were waiting in > > kernel mode on the "handled" virtual range, and retry the fault > > without ever exiting kernel mode. > > > > We need this in KVM for running the guest on memory that is on other > > nodes or other processes (postcopy live migration is the most common > > use case but there are others like memory externalization and > > cross-node KSM in the cloud, to keep a single copy of memory across > > multiple nodes and externalized to the VM and to the host node). > > > > This thread made me wonder if we could mix the two features and you > > would then depend on MADV_USERFAULT and userfaultfd to deliver to > > userland the "faults" happening on the volatile pages that have been > > purged as result of VM pressure. > > > > I'm just saying this after Johannes mentioned the issue with syscalls > > returning -EFAULT. Because that is the very issue that the userfaultfd > > is going to solve for the KVM migration thread. > > > > What I'm thinking now would be to mark the volatile range also > > MADV_USERFAULT and then calling userfaultfd and instead of having the > > cache regeneration "slow path" inside the SIGBUS handler, to run it in > > the userfault thread that polls the userfaultfd. Then you could write > > the volatile ranges to disk with a write() syscall (or use any other > > syscall on the volatile ranges), without having to worry about -EFAULT > > being returned because one page was discarded. And if MADV_USERFAULT > > is not called in combination with vrange syscalls, then it'd still > > work without the userfault, but with the vrange syscalls only. > > > > In short the idea would be to let the userfault code solve the fault > > delivery to userland for you, and make the vrange syscalls only focus > > on the page purging problem, without having to worry about what > > happens
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
Hello Andrea, On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote: > Hi everyone, > > On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote: > > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If > > you have a third option you're thinking of, I'd of course be interested > > in hearing it. > > I actually thought the way of being notified with a page fault (sigbus > or whatever) was the most efficient way of using volatile ranges. > > Why having to call a syscall to know if you can still access the > volatile range, if there was no VM pressure before the access? > syscalls are expensive, accessing the memory direct is not. Only if it > page was actually missing and a page fault would fire, you'd take the > slowpath. True. > > The usages I see for this are plenty, like for maintaining caches in > memory that may be big and would be nice to discard if there's VM > pressure, jpeg uncompressed images sounds like a candidate too. So the > browser size would shrink if there's VM pressure, instead of ending up > swapping out uncompressed image data that can be regenerated more > quickly with the CPU than with swapins. That's really typical case vrange is targetting. > > > Now... once you've chosen SIGBUS semantics, there will be folks who will > > try to exploit the fact that we get SIGBUS on purged page access (at > > least on the user-space side) and will try to access pages that are > > volatile until they are purged and try to then handle the SIGBUS to fix > > things up. Those folks exploiting that will have to be particularly > > careful not to pass volatile data to the kernel, and if they do they'll > > have to be smart enough to handle the EFAULT, etc. That's really all > > their problem, because they're being clever. :) > > I'm actually working on feature that would solve the problem for the > syscalls accessing missing volatile pages. So you'd never see a > -EFAULT because all syscalls won't return even if they encounters a > missing page in the volatile range dropped by the VM pressure. > > It's called userfaultfd. You call sys_userfaultfd(flags) and it > connects the current mm to a pseudo filedescriptor. The filedescriptor > works similarly to eventfd but with a different protocol. > > You need a thread that will never access the userfault area with the > CPU, that is responsible to poll on the userfaultfd and talk the > userfaultfd protocol to fill-in missing pages. The userfault thread > after a POLLIN event reads the virtual addresses of the fault that > must have happened on some other thread of the same mm, and then > writes back an "handled" virtual range into the fd, after the page (or > pages if multiple) have been regenerated and mapped in with > sys_remap_anon_pages(), mremap or equivalent atomic pagetable page > swapping. Then depending on the "solved" range written back into the > fd, the kernel will wakeup the thread or threads that were waiting in > kernel mode on the "handled" virtual range, and retry the fault > without ever exiting kernel mode. Sounds flexible. > > We need this in KVM for running the guest on memory that is on other > nodes or other processes (postcopy live migration is the most common > use case but there are others like memory externalization and > cross-node KSM in the cloud, to keep a single copy of memory across > multiple nodes and externalized to the VM and to the host node). > > This thread made me wonder if we could mix the two features and you > would then depend on MADV_USERFAULT and userfaultfd to deliver to > userland the "faults" happening on the volatile pages that have been > purged as result of VM pressure. > > I'm just saying this after Johannes mentioned the issue with syscalls > returning -EFAULT. Because that is the very issue that the userfaultfd > is going to solve for the KVM migration thread. > > What I'm thinking now would be to mark the volatile range also > MADV_USERFAULT and then calling userfaultfd and instead of having the > cache regeneration "slow path" inside the SIGBUS handler, to run it in > the userfault thread that polls the userfaultfd. Then you could write > the volatile ranges to disk with a write() syscall (or use any other > syscall on the volatile ranges), without having to worry about -EFAULT > being returned because one page was discarded. And if MADV_USERFAULT > is not called in combination with vrange syscalls, then it'd still > work without the userfault, but with the vrange syscalls only. > > In short the idea would be to let the userfault code solve the fault > delivery to userland for you, and make the vrange syscalls only focus > on the page purging problem, without having to worry about what > happens when something access a missing page. > > But if you don't intend to solve the syscall -EFAULT problem, well > then probably the overlap is still as thin as I thought it was before > (like also mentioned in the below link). Sounds doable. I will look into
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
Hello Andrea, On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote: Hi everyone, On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote: So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If you have a third option you're thinking of, I'd of course be interested in hearing it. I actually thought the way of being notified with a page fault (sigbus or whatever) was the most efficient way of using volatile ranges. Why having to call a syscall to know if you can still access the volatile range, if there was no VM pressure before the access? syscalls are expensive, accessing the memory direct is not. Only if it page was actually missing and a page fault would fire, you'd take the slowpath. True. The usages I see for this are plenty, like for maintaining caches in memory that may be big and would be nice to discard if there's VM pressure, jpeg uncompressed images sounds like a candidate too. So the browser size would shrink if there's VM pressure, instead of ending up swapping out uncompressed image data that can be regenerated more quickly with the CPU than with swapins. That's really typical case vrange is targetting. Now... once you've chosen SIGBUS semantics, there will be folks who will try to exploit the fact that we get SIGBUS on purged page access (at least on the user-space side) and will try to access pages that are volatile until they are purged and try to then handle the SIGBUS to fix things up. Those folks exploiting that will have to be particularly careful not to pass volatile data to the kernel, and if they do they'll have to be smart enough to handle the EFAULT, etc. That's really all their problem, because they're being clever. :) I'm actually working on feature that would solve the problem for the syscalls accessing missing volatile pages. So you'd never see a -EFAULT because all syscalls won't return even if they encounters a missing page in the volatile range dropped by the VM pressure. It's called userfaultfd. You call sys_userfaultfd(flags) and it connects the current mm to a pseudo filedescriptor. The filedescriptor works similarly to eventfd but with a different protocol. You need a thread that will never access the userfault area with the CPU, that is responsible to poll on the userfaultfd and talk the userfaultfd protocol to fill-in missing pages. The userfault thread after a POLLIN event reads the virtual addresses of the fault that must have happened on some other thread of the same mm, and then writes back an handled virtual range into the fd, after the page (or pages if multiple) have been regenerated and mapped in with sys_remap_anon_pages(), mremap or equivalent atomic pagetable page swapping. Then depending on the solved range written back into the fd, the kernel will wakeup the thread or threads that were waiting in kernel mode on the handled virtual range, and retry the fault without ever exiting kernel mode. Sounds flexible. We need this in KVM for running the guest on memory that is on other nodes or other processes (postcopy live migration is the most common use case but there are others like memory externalization and cross-node KSM in the cloud, to keep a single copy of memory across multiple nodes and externalized to the VM and to the host node). This thread made me wonder if we could mix the two features and you would then depend on MADV_USERFAULT and userfaultfd to deliver to userland the faults happening on the volatile pages that have been purged as result of VM pressure. I'm just saying this after Johannes mentioned the issue with syscalls returning -EFAULT. Because that is the very issue that the userfaultfd is going to solve for the KVM migration thread. What I'm thinking now would be to mark the volatile range also MADV_USERFAULT and then calling userfaultfd and instead of having the cache regeneration slow path inside the SIGBUS handler, to run it in the userfault thread that polls the userfaultfd. Then you could write the volatile ranges to disk with a write() syscall (or use any other syscall on the volatile ranges), without having to worry about -EFAULT being returned because one page was discarded. And if MADV_USERFAULT is not called in combination with vrange syscalls, then it'd still work without the userfault, but with the vrange syscalls only. In short the idea would be to let the userfault code solve the fault delivery to userland for you, and make the vrange syscalls only focus on the page purging problem, without having to worry about what happens when something access a missing page. But if you don't intend to solve the syscall -EFAULT problem, well then probably the overlap is still as thin as I thought it was before (like also mentioned in the below link). Sounds doable. I will look into your patch. Thanks for reminding! Thanks, Andrea PS. my last email about this from a more KVM
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 03:27:44PM -0400, Johannes Weiner wrote: On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote: Hi everyone, On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote: So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If you have a third option you're thinking of, I'd of course be interested in hearing it. I actually thought the way of being notified with a page fault (sigbus or whatever) was the most efficient way of using volatile ranges. Why having to call a syscall to know if you can still access the volatile range, if there was no VM pressure before the access? syscalls are expensive, accessing the memory direct is not. Only if it page was actually missing and a page fault would fire, you'd take the slowpath. Not everybody wants to actually come back for the data in the range, allocators and message passing applications just want to be able to reuse the memory mapping. By tying the volatility to the dirty bit in the page tables, an allocator could simply clear those bits once on free(). When malloc() hands out this region again, the user is expected to write, which will either overwrite the old page, or, if it was purged, fault in a fresh zero page. But there is no second syscall needed to clear volatility. Now... once you've chosen SIGBUS semantics, there will be folks who will try to exploit the fact that we get SIGBUS on purged page access (at least on the user-space side) and will try to access pages that are volatile until they are purged and try to then handle the SIGBUS to fix things up. Those folks exploiting that will have to be particularly careful not to pass volatile data to the kernel, and if they do they'll have to be smart enough to handle the EFAULT, etc. That's really all their problem, because they're being clever. :) I'm actually working on feature that would solve the problem for the syscalls accessing missing volatile pages. So you'd never see a -EFAULT because all syscalls won't return even if they encounters a missing page in the volatile range dropped by the VM pressure. It's called userfaultfd. You call sys_userfaultfd(flags) and it connects the current mm to a pseudo filedescriptor. The filedescriptor works similarly to eventfd but with a different protocol. You need a thread that will never access the userfault area with the CPU, that is responsible to poll on the userfaultfd and talk the userfaultfd protocol to fill-in missing pages. The userfault thread after a POLLIN event reads the virtual addresses of the fault that must have happened on some other thread of the same mm, and then writes back an handled virtual range into the fd, after the page (or pages if multiple) have been regenerated and mapped in with sys_remap_anon_pages(), mremap or equivalent atomic pagetable page swapping. Then depending on the solved range written back into the fd, the kernel will wakeup the thread or threads that were waiting in kernel mode on the handled virtual range, and retry the fault without ever exiting kernel mode. We need this in KVM for running the guest on memory that is on other nodes or other processes (postcopy live migration is the most common use case but there are others like memory externalization and cross-node KSM in the cloud, to keep a single copy of memory across multiple nodes and externalized to the VM and to the host node). This thread made me wonder if we could mix the two features and you would then depend on MADV_USERFAULT and userfaultfd to deliver to userland the faults happening on the volatile pages that have been purged as result of VM pressure. I'm just saying this after Johannes mentioned the issue with syscalls returning -EFAULT. Because that is the very issue that the userfaultfd is going to solve for the KVM migration thread. What I'm thinking now would be to mark the volatile range also MADV_USERFAULT and then calling userfaultfd and instead of having the cache regeneration slow path inside the SIGBUS handler, to run it in the userfault thread that polls the userfaultfd. Then you could write the volatile ranges to disk with a write() syscall (or use any other syscall on the volatile ranges), without having to worry about -EFAULT being returned because one page was discarded. And if MADV_USERFAULT is not called in combination with vrange syscalls, then it'd still work without the userfault, but with the vrange syscalls only. In short the idea would be to let the userfault code solve the fault delivery to userland for you, and make the vrange syscalls only focus on the page purging problem, without having to worry about what happens when something access a missing page. Yes, the two seem certainly combinable to me. madvise(MADV_FREE | MADV_USERFAULT) to allow purging and userspace fault handling. In the
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner han...@cmpxchg.org wrote: I'm just dying to hear a normal use case then. :) So the more normal use cause would be marking objects volatile and then non-volatile w/o accessing them in-between. In this case the zero-fill vs SIGBUS semantics don't really matter, its really just a trade off in how we handle applications deviating (intentionally or not) from this use case. So to maybe flesh out the context here for folks who are following along (but weren't in the hallway at LSF :), Johannes made a fairly interesting proposal (Johannes: Please correct me here where I'm maybe slightly off here) to use only the dirty bits of the ptes to mark a page as volatile. Then the kernel could reclaim these clean pages as it needed, and when we marked the range as non-volatile, the pages would be re-dirtied and if any of the pages were missing, we could return a flag with the purged state. This had some different semantics then what I've been working with for awhile (for example, any writes to pages would implicitly clear volatility), so I wasn't completely comfortable with it, but figured I'd think about it to see if it could be done. Particularly since it would in some ways simplify tmpfs/shm shared volatility that I'd eventually like to do. ... Now, while for the case I'm personally most interested in (ashmem), zero-fill would technically be ok, since that's what Android does. Even so, I don't think its the best approach for the interface, since applications may end up quite surprised by the results when they accidentally don't follow the don't touch volatile pages rule. That point beside, I think the other problem with the page-cleaning volatility approach is that there are other awkward side effects. For example: Say an application marks a range as volatile. One page in the range is then purged. The application, due to a bug or otherwise, reads the volatile range. This causes the page to be zero-filled in, and the application silently uses the corrupted data (which isn't great). More problematic though, is that by faulting the page in, they've in effect lost the purge state for that page. When the application then goes to mark the range as non-volatile, all pages are present, so we'd return that no pages were purged. From an application perspective this is pretty ugly. The write-implicitly-clears-volatile semantics would actually be an advantage for some use cases. If you have a volatile cache of many sub-page-size objects, the application can just include at the start of each page int present, in_use;. present is set to non-zero before marking volatile, and when the application wants unmark as volatile it writes to in_use and tests the value of present. No need for a syscall at all, although it does take a minor fault. The syscall would be better for the case of large objects, though. Or is that fatally flawed? - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On 04/07/2014 09:32 PM, Kevin Easton wrote: On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner han...@cmpxchg.org wrote: I'm just dying to hear a normal use case then. :) So the more normal use cause would be marking objects volatile and then non-volatile w/o accessing them in-between. In this case the zero-fill vs SIGBUS semantics don't really matter, its really just a trade off in how we handle applications deviating (intentionally or not) from this use case. So to maybe flesh out the context here for folks who are following along (but weren't in the hallway at LSF :), Johannes made a fairly interesting proposal (Johannes: Please correct me here where I'm maybe slightly off here) to use only the dirty bits of the ptes to mark a page as volatile. Then the kernel could reclaim these clean pages as it needed, and when we marked the range as non-volatile, the pages would be re-dirtied and if any of the pages were missing, we could return a flag with the purged state. This had some different semantics then what I've been working with for awhile (for example, any writes to pages would implicitly clear volatility), so I wasn't completely comfortable with it, but figured I'd think about it to see if it could be done. Particularly since it would in some ways simplify tmpfs/shm shared volatility that I'd eventually like to do. ... Now, while for the case I'm personally most interested in (ashmem), zero-fill would technically be ok, since that's what Android does. Even so, I don't think its the best approach for the interface, since applications may end up quite surprised by the results when they accidentally don't follow the don't touch volatile pages rule. That point beside, I think the other problem with the page-cleaning volatility approach is that there are other awkward side effects. For example: Say an application marks a range as volatile. One page in the range is then purged. The application, due to a bug or otherwise, reads the volatile range. This causes the page to be zero-filled in, and the application silently uses the corrupted data (which isn't great). More problematic though, is that by faulting the page in, they've in effect lost the purge state for that page. When the application then goes to mark the range as non-volatile, all pages are present, so we'd return that no pages were purged. From an application perspective this is pretty ugly. The write-implicitly-clears-volatile semantics would actually be an advantage for some use cases. If you have a volatile cache of many sub-page-size objects, the application can just include at the start of each page int present, in_use;. present is set to non-zero before marking volatile, and when the application wants unmark as volatile it writes to in_use and tests the value of present. No need for a syscall at all, although it does take a minor fault. The syscall would be better for the case of large objects, though. Or is that fatally flawed? Well, as you note, each object would then have to be page size or smaller, which limits some of the potential use cases. However, these semantics would match better to the MADV_FREE proposal Minchan is pushing. So this method would work fine there. thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: > On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner wrote: > > On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote: > >> On 04/01/2014 04:01 PM, Dave Hansen wrote: > >> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote: > >> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote: > >> > John, this was something that the Mozilla guys asked for, right? Any > >> > idea why this isn't ever a problem for them? > >> So one of their use cases for it is for library text. Basically they > >> want to decompress a compressed library file into memory. Then they plan > >> to mark the uncompressed pages volatile, and then be able to call into > >> it. Ideally for them, the kernel would only purge cold pages, leaving > >> the hot pages in memory. When they traverse a purged page, they handle > >> the SIGBUS and patch the page up. > > > > How big are these libraries compared to overall system size? > > Mike or Taras would have to refresh my memory on this detail. My > recollection is it mostly has to do with keeping the on-disk size of > the library small, so it can load off of slow media very quickly. > > >> Now.. this is not what I'd consider a normal use case, but was hoping to > >> illustrate some of the more interesting uses and demonstrate the > >> interfaces flexibility. > > > > I'm just dying to hear a "normal" use case then. :) > > So the more "normal" use cause would be marking objects volatile and > then non-volatile w/o accessing them in-between. In this case the > zero-fill vs SIGBUS semantics don't really matter, its really just a > trade off in how we handle applications deviating (intentionally or > not) from this use case. > > So to maybe flesh out the context here for folks who are following > along (but weren't in the hallway at LSF :), Johannes made a fairly > interesting proposal (Johannes: Please correct me here where I'm maybe > slightly off here) to use only the dirty bits of the ptes to mark a > page as volatile. Then the kernel could reclaim these clean pages as > it needed, and when we marked the range as non-volatile, the pages > would be re-dirtied and if any of the pages were missing, we could I'd like to know more clearly as Hannes and you are thinking. You mean that when we unmark the range, we should redirty of all of pages's pte? or SetPageDirty? If we redirty pte, maybe softdirty people(ie, CRIU) might be angry because it could make lots of diff. If we just do SetPageDirty, it would invalidate writeout-avoid logic of swapped page which were already on the swap. Yeb, but it could be minor and SetPageDirty model would be proper for shared-vrange implmenetation. But how could we know any pages were missing when unmarking time? Where do we keep the information? It's no problem for vrange-anon because we can keep the information on pte but how about vrange-file(ie, vrange-shared)? Using a shadow entry of radix tree? What are you thinking about? Another major concern is still syscall's overhead. Such page-based scheme has a trouble with syscall's speed so I'm afraid users might not use the syscall any more. :( Frankly speaking, we don't have concrete user so not sure how the overhead is severe but we could imagine easily that in future someuser might want to makr volatile huge GB memory. But I couldn't insist on range-based option because it has downside, too. If we don't work page-based model, reclaim path cleary have a big overhead to scan virtual memory to find a victim pages. As worst case, just a page in Huge GB vma. Even, a page might be other zone. :( If we could optimize that path to prevent CPU buring in future, it could make very complicated and not sure woking well. We already have similar issue with compaction. ;-) So, it's really dilemma. > return a flag with the purged state. This had some different > semantics then what I've been working with for awhile (for example, > any writes to pages would implicitly clear volatility), so I wasn't > completely comfortable with it, but figured I'd think about it to see > if it could be done. Particularly since it would in some ways simplify > tmpfs/shm shared volatility that I'd eventually like to do. > > After thinking it over in the hallway, I talked some of the details w/ > Johnnes and there was one issue that while w/ anonymous memory, we can > still add a VM_VOLATILE flag on the vma, so we can get SIGBUS > semantics, but since on shared volatile ranges, we don't have anything > to hang a volatile flag on w/o adding some new vma like structure to > the address_space structure (much as we did in the past w/ earlier > volatile range implementations). This would negate much of the point > of using the dirty bits to simplify the shared volatility > implementation. > > Thus Johannes is reasonably questioning the need for SIGBUS semantics, > since if it wasn't needed, the simpler page-cleaning based volatility > could potentially be used. I think SIGBUS scenario isn't
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 12:36:38PM -0400, Johannes Weiner wrote: > On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote: > > On 04/01/2014 04:01 PM, Dave Hansen wrote: > > > On 04/01/2014 02:35 PM, H. Peter Anvin wrote: > > >> On 04/01/2014 02:21 PM, Johannes Weiner wrote: > > >>> Either way, optimistic volatile pointers are nowhere near as > > >>> transparent to the application as the above description suggests, > > >>> which makes this usecase not very interesting, IMO. > > >> ... however, I think you're still derating the value way too much. The > > >> case of user space doing elastic memory management is more and more > > >> common, and for a lot of those applications it is perfectly reasonable > > >> to either not do system calls or to have to devolatilize first. > > > The SIGBUS is only in cases where the memory is set as volatile and > > > _then_ accessed, right? > > Not just set volatile and then accessed, but when a volatile page has > > been purged and then accessed without being made non-volatile. > > > > > > > John, this was something that the Mozilla guys asked for, right? Any > > > idea why this isn't ever a problem for them? > > So one of their use cases for it is for library text. Basically they > > want to decompress a compressed library file into memory. Then they plan > > to mark the uncompressed pages volatile, and then be able to call into > > it. Ideally for them, the kernel would only purge cold pages, leaving > > the hot pages in memory. When they traverse a purged page, they handle > > the SIGBUS and patch the page up. > > How big are these libraries compared to overall system size? One of the example about jit I had is 5M bytes for just simple node.js service. Acutally I'm not sure it was JIT or something. Just what I saw was it was rwxp vmas so I guess they are JIT. Anyway, it's really simple script but consumed 5M bytes. It's really big for Embedded WebOS because other more complicated service could be executed in parallel on the system. > > > Now.. this is not what I'd consider a normal use case, but was hoping to > > illustrate some of the more interesting uses and demonstrate the > > interfaces flexibility. > > I'm just dying to hear a "normal" use case then. :) > > > Also it provided a clear example of benefits to doing LRU based > > cold-page purging rather then full object purging. Though I think the > > same could be demonstrated in a simpler case of a large cache of objects > > that the applications wants to mark volatile in one pass, unmarking > > sub-objects as it needs. > > Agreed. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 12:36:38PM -0400, Johannes Weiner wrote: On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote: On 04/01/2014 04:01 PM, Dave Hansen wrote: On 04/01/2014 02:35 PM, H. Peter Anvin wrote: On 04/01/2014 02:21 PM, Johannes Weiner wrote: Either way, optimistic volatile pointers are nowhere near as transparent to the application as the above description suggests, which makes this usecase not very interesting, IMO. ... however, I think you're still derating the value way too much. The case of user space doing elastic memory management is more and more common, and for a lot of those applications it is perfectly reasonable to either not do system calls or to have to devolatilize first. The SIGBUS is only in cases where the memory is set as volatile and _then_ accessed, right? Not just set volatile and then accessed, but when a volatile page has been purged and then accessed without being made non-volatile. John, this was something that the Mozilla guys asked for, right? Any idea why this isn't ever a problem for them? So one of their use cases for it is for library text. Basically they want to decompress a compressed library file into memory. Then they plan to mark the uncompressed pages volatile, and then be able to call into it. Ideally for them, the kernel would only purge cold pages, leaving the hot pages in memory. When they traverse a purged page, they handle the SIGBUS and patch the page up. How big are these libraries compared to overall system size? One of the example about jit I had is 5M bytes for just simple node.js service. Acutally I'm not sure it was JIT or something. Just what I saw was it was rwxp vmas so I guess they are JIT. Anyway, it's really simple script but consumed 5M bytes. It's really big for Embedded WebOS because other more complicated service could be executed in parallel on the system. Now.. this is not what I'd consider a normal use case, but was hoping to illustrate some of the more interesting uses and demonstrate the interfaces flexibility. I'm just dying to hear a normal use case then. :) Also it provided a clear example of benefits to doing LRU based cold-page purging rather then full object purging. Though I think the same could be demonstrated in a simpler case of a large cache of objects that the applications wants to mark volatile in one pass, unmarking sub-objects as it needs. Agreed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner han...@cmpxchg.org wrote: On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote: On 04/01/2014 04:01 PM, Dave Hansen wrote: On 04/01/2014 02:35 PM, H. Peter Anvin wrote: On 04/01/2014 02:21 PM, Johannes Weiner wrote: John, this was something that the Mozilla guys asked for, right? Any idea why this isn't ever a problem for them? So one of their use cases for it is for library text. Basically they want to decompress a compressed library file into memory. Then they plan to mark the uncompressed pages volatile, and then be able to call into it. Ideally for them, the kernel would only purge cold pages, leaving the hot pages in memory. When they traverse a purged page, they handle the SIGBUS and patch the page up. How big are these libraries compared to overall system size? Mike or Taras would have to refresh my memory on this detail. My recollection is it mostly has to do with keeping the on-disk size of the library small, so it can load off of slow media very quickly. Now.. this is not what I'd consider a normal use case, but was hoping to illustrate some of the more interesting uses and demonstrate the interfaces flexibility. I'm just dying to hear a normal use case then. :) So the more normal use cause would be marking objects volatile and then non-volatile w/o accessing them in-between. In this case the zero-fill vs SIGBUS semantics don't really matter, its really just a trade off in how we handle applications deviating (intentionally or not) from this use case. So to maybe flesh out the context here for folks who are following along (but weren't in the hallway at LSF :), Johannes made a fairly interesting proposal (Johannes: Please correct me here where I'm maybe slightly off here) to use only the dirty bits of the ptes to mark a page as volatile. Then the kernel could reclaim these clean pages as it needed, and when we marked the range as non-volatile, the pages would be re-dirtied and if any of the pages were missing, we could I'd like to know more clearly as Hannes and you are thinking. You mean that when we unmark the range, we should redirty of all of pages's pte? or SetPageDirty? If we redirty pte, maybe softdirty people(ie, CRIU) might be angry because it could make lots of diff. If we just do SetPageDirty, it would invalidate writeout-avoid logic of swapped page which were already on the swap. Yeb, but it could be minor and SetPageDirty model would be proper for shared-vrange implmenetation. But how could we know any pages were missing when unmarking time? Where do we keep the information? It's no problem for vrange-anon because we can keep the information on pte but how about vrange-file(ie, vrange-shared)? Using a shadow entry of radix tree? What are you thinking about? Another major concern is still syscall's overhead. Such page-based scheme has a trouble with syscall's speed so I'm afraid users might not use the syscall any more. :( Frankly speaking, we don't have concrete user so not sure how the overhead is severe but we could imagine easily that in future someuser might want to makr volatile huge GB memory. But I couldn't insist on range-based option because it has downside, too. If we don't work page-based model, reclaim path cleary have a big overhead to scan virtual memory to find a victim pages. As worst case, just a page in Huge GB vma. Even, a page might be other zone. :( If we could optimize that path to prevent CPU buring in future, it could make very complicated and not sure woking well. We already have similar issue with compaction. ;-) So, it's really dilemma. return a flag with the purged state. This had some different semantics then what I've been working with for awhile (for example, any writes to pages would implicitly clear volatility), so I wasn't completely comfortable with it, but figured I'd think about it to see if it could be done. Particularly since it would in some ways simplify tmpfs/shm shared volatility that I'd eventually like to do. After thinking it over in the hallway, I talked some of the details w/ Johnnes and there was one issue that while w/ anonymous memory, we can still add a VM_VOLATILE flag on the vma, so we can get SIGBUS semantics, but since on shared volatile ranges, we don't have anything to hang a volatile flag on w/o adding some new vma like structure to the address_space structure (much as we did in the past w/ earlier volatile range implementations). This would negate much of the point of using the dirty bits to simplify the shared volatility implementation. Thus Johannes is reasonably questioning the need for SIGBUS semantics, since if it wasn't needed, the simpler page-cleaning based volatility could potentially be used. I think SIGBUS scenario isn't common but in case of JIT, it is necessary and the amount of ram consumed would be
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Wed 02-04-14 13:13:34, John Stultz wrote: > On 04/02/2014 12:47 PM, Johannes Weiner wrote: > > On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote: > >> On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner > >> wrote: > >>> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: > That point beside, I think the other problem with the page-cleaning > volatility approach is that there are other awkward side effects. For > example: Say an application marks a range as volatile. One page in the > range is then purged. The application, due to a bug or otherwise, > reads the volatile range. This causes the page to be zero-filled in, > and the application silently uses the corrupted data (which isn't > great). More problematic though, is that by faulting the page in, > they've in effect lost the purge state for that page. When the > application then goes to mark the range as non-volatile, all pages are > present, so we'd return that no pages were purged. From an > application perspective this is pretty ugly. > > Johannes: Any thoughts on this potential issue with your proposal? Am > I missing something else? > >>> No, this is accurate. However, I don't really see how this is > >>> different than any other use-after-free bug. If you access malloc > >>> memory after free(), you might receive a SIGSEGV, you might see random > >>> data, you might corrupt somebody else's data. This certainly isn't > >>> nice, but it's not exactly new behavior, is it? > >> The part that troubles me is that I see the purged state as kernel > >> data being corrupted by userland in this case. The kernel will tell > >> userspace that no pages were purged, even though they were. Only > >> because userspace made an errant read of a page, and got garbage data > >> back. > > That sounds overly dramatic to me. First of all, this data still > > reflects accurately the actions of userspace in this situation. And > > secondly, the kernel does not rely on this data to be meaningful from > > a userspace perspective to function correctly. > volatile page purge state1"> > > Maybe you're right, but I feel this is the sort of thing application > developers would be surprised and annoyed by. > > > > It's really nothing but a use-after-free bug that has consequences for > > no-one but the faulty application. The thing that IS new is that even > > a read is enough to corrupt your data in this case. > > > > MADV_REVIVE could return 0 if all pages in the specified range were > > present, -Esomething if otherwise. That would be semantically sound > > even if userspace messes up. > > So its semantically more of just a combined mincore+dirty operation.. > and nothing more? > > What are other folks thinking about this? Although I don't particularly > like it, I probably could go along with Johannes' approach, forgoing > SIGBUS for zero-fill and adapting the semantics that are in my mind a > bit stranger. This would allow for ashmem-like style behavior w/ the > additional write-clears-volatile-state and read-clears-purged-state > constraints (which I don't think would be problematic for Android, but > am not totally sure). > > But I do worry that these semantics are easier for kernel-mm-developers > to grasp, but are much much harder for application developers to > understand. Yeah, I have to admit that although the simplicity of the implementation looks compelling, the interface from a userspace POV looks weird. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On 04/02/2014 12:47 PM, Johannes Weiner wrote: > On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote: >> On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner wrote: >>> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: That point beside, I think the other problem with the page-cleaning volatility approach is that there are other awkward side effects. For example: Say an application marks a range as volatile. One page in the range is then purged. The application, due to a bug or otherwise, reads the volatile range. This causes the page to be zero-filled in, and the application silently uses the corrupted data (which isn't great). More problematic though, is that by faulting the page in, they've in effect lost the purge state for that page. When the application then goes to mark the range as non-volatile, all pages are present, so we'd return that no pages were purged. From an application perspective this is pretty ugly. Johannes: Any thoughts on this potential issue with your proposal? Am I missing something else? >>> No, this is accurate. However, I don't really see how this is >>> different than any other use-after-free bug. If you access malloc >>> memory after free(), you might receive a SIGSEGV, you might see random >>> data, you might corrupt somebody else's data. This certainly isn't >>> nice, but it's not exactly new behavior, is it? >> The part that troubles me is that I see the purged state as kernel >> data being corrupted by userland in this case. The kernel will tell >> userspace that no pages were purged, even though they were. Only >> because userspace made an errant read of a page, and got garbage data >> back. > That sounds overly dramatic to me. First of all, this data still > reflects accurately the actions of userspace in this situation. And > secondly, the kernel does not rely on this data to be meaningful from > a userspace perspective to function correctly. Maybe you're right, but I feel this is the sort of thing application developers would be surprised and annoyed by. > It's really nothing but a use-after-free bug that has consequences for > no-one but the faulty application. The thing that IS new is that even > a read is enough to corrupt your data in this case. > > MADV_REVIVE could return 0 if all pages in the specified range were > present, -Esomething if otherwise. That would be semantically sound > even if userspace messes up. So its semantically more of just a combined mincore+dirty operation.. and nothing more? What are other folks thinking about this? Although I don't particularly like it, I probably could go along with Johannes' approach, forgoing SIGBUS for zero-fill and adapting the semantics that are in my mind a bit stranger. This would allow for ashmem-like style behavior w/ the additional write-clears-volatile-state and read-clears-purged-state constraints (which I don't think would be problematic for Android, but am not totally sure). But I do worry that these semantics are easier for kernel-mm-developers to grasp, but are much much harder for application developers to understand. Additionally unless we could really leave access-after-volatile as a total undefined behavior, this would lock us into O(page) behavior and would remove the possibility of O(log(ranges)) behavior Minchan and I were able to get (admittedly with more complicated code - but something I was hoping we'd be able to get back to after the base semantics and interface behavior was understood and merged). I since applications will have bugs and will access after volatile, we won't be able to get away with that sort of behavioral flexibility. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On 04/02/2014 11:31 AM, Andrea Arcangeli wrote: > On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote: >> Now... once you've chosen SIGBUS semantics, there will be folks who will >> try to exploit the fact that we get SIGBUS on purged page access (at >> least on the user-space side) and will try to access pages that are >> volatile until they are purged and try to then handle the SIGBUS to fix >> things up. Those folks exploiting that will have to be particularly >> careful not to pass volatile data to the kernel, and if they do they'll >> have to be smart enough to handle the EFAULT, etc. That's really all >> their problem, because they're being clever. :) > I'm actually working on feature that would solve the problem for the > syscalls accessing missing volatile pages. So you'd never see a > -EFAULT because all syscalls won't return even if they encounters a > missing page in the volatile range dropped by the VM pressure. > > It's called userfaultfd. You call sys_userfaultfd(flags) and it > connects the current mm to a pseudo filedescriptor. The filedescriptor > works similarly to eventfd but with a different protocol. So yea! I actually think (its been awhile now) I mentioned your work to Taras (or maybe he mentioned it to me?), but it did seem like the userfaltfd would be a better solution for the style of fault handling they were thinking about. (Especially as actually handling SIGBUS and doing something sane in a large threaded application seems very difficult). That said, explaining volatile ranges as a concept has been difficult enough without mixing in other new concepts :), so I'm hesitant to tie the functionality together in until its clear the userfaultfd approach is likely to land. But maybe I need to take a closer look at it. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote: > On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner wrote: > > On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: > >> That point beside, I think the other problem with the page-cleaning > >> volatility approach is that there are other awkward side effects. For > >> example: Say an application marks a range as volatile. One page in the > >> range is then purged. The application, due to a bug or otherwise, > >> reads the volatile range. This causes the page to be zero-filled in, > >> and the application silently uses the corrupted data (which isn't > >> great). More problematic though, is that by faulting the page in, > >> they've in effect lost the purge state for that page. When the > >> application then goes to mark the range as non-volatile, all pages are > >> present, so we'd return that no pages were purged. From an > >> application perspective this is pretty ugly. > >> > >> Johannes: Any thoughts on this potential issue with your proposal? Am > >> I missing something else? > > > > No, this is accurate. However, I don't really see how this is > > different than any other use-after-free bug. If you access malloc > > memory after free(), you might receive a SIGSEGV, you might see random > > data, you might corrupt somebody else's data. This certainly isn't > > nice, but it's not exactly new behavior, is it? > > The part that troubles me is that I see the purged state as kernel > data being corrupted by userland in this case. The kernel will tell > userspace that no pages were purged, even though they were. Only > because userspace made an errant read of a page, and got garbage data > back. That sounds overly dramatic to me. First of all, this data still reflects accurately the actions of userspace in this situation. And secondly, the kernel does not rely on this data to be meaningful from a userspace perspective to function correctly. It's really nothing but a use-after-free bug that has consequences for no-one but the faulty application. The thing that IS new is that even a read is enough to corrupt your data in this case. MADV_REVIVE could return 0 if all pages in the specified range were present, -Esomething if otherwise. That would be semantically sound even if userspace messes up. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Wed, Apr 2, 2014 at 11:07 AM, Johannes Weiner wrote: > On Wed, Apr 02, 2014 at 10:48:03AM -0700, John Stultz wrote: >> I suspect handling the SIGBUS and patching up the purged page you >> trapped on is likely much to complicated for most use cases. But I do >> think SIGBUS is preferable to zero-fill on purged page access, just >> because its likely to be easier to debug applications. > > Fully agreed, but it seems a bit overkill to add a separate syscall, a > range-tree on top of shmem address_spaces, and an essentially new > programming model based on SIGBUS userspace fault handling (incl. all > the complexities and confusion this inevitably will bring when people > DO end up passing these pointers into kernel space) just to be a bit > nicer about use-after-free bugs in applications. Its more about making an interface that has graspable semantics to userspace, instead of having the semantics being a side-effect of the implementation. Tying volatility to the page-clean state and page-was-purged to page-present seems problematic to me, because there are too many ways to change the page-clean or page-present outside of the interface being proposed. I feel this causes a cascade of corner cases that have to be explained to users of the interface. Also I disagree we're adding a new programming model, as SIGBUSes can already be caught, just that there's not usually much one can do, where with volatile pages its more likely something could be done. And again, its really just a side-effect of having semantics (SIGBUS on purged page access) that are more helpful from a applications perspective. As for the separate syscall: Again, this is mainly needed to handle allocation failures that happen mid-way through modifying the range. There may still be a way to do the allocation first and only after it succeeds do the modification. The vma merge/splitting logic doesn't make this easy but if we can be sure that on a failed split of 1 vma -> 3 vmas (which may fail half way) we can re-merge w/o allocation and error out (without having to do any other allocations), this might be avoidable. I'm still wanting to look at this. If so, it would be easier to re-add this support under madvise, if folks really really don't like the new syscall. For the most part, having the separate syscall allows us to discuss other details of the semantics, which to me are more important then the syscall naming. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote: > Hi everyone, > > On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote: > > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If > > you have a third option you're thinking of, I'd of course be interested > > in hearing it. > > I actually thought the way of being notified with a page fault (sigbus > or whatever) was the most efficient way of using volatile ranges. > > Why having to call a syscall to know if you can still access the > volatile range, if there was no VM pressure before the access? > syscalls are expensive, accessing the memory direct is not. Only if it > page was actually missing and a page fault would fire, you'd take the > slowpath. Not everybody wants to actually come back for the data in the range, allocators and message passing applications just want to be able to reuse the memory mapping. By tying the volatility to the dirty bit in the page tables, an allocator could simply clear those bits once on free(). When malloc() hands out this region again, the user is expected to write, which will either overwrite the old page, or, if it was purged, fault in a fresh zero page. But there is no second syscall needed to clear volatility. > > Now... once you've chosen SIGBUS semantics, there will be folks who will > > try to exploit the fact that we get SIGBUS on purged page access (at > > least on the user-space side) and will try to access pages that are > > volatile until they are purged and try to then handle the SIGBUS to fix > > things up. Those folks exploiting that will have to be particularly > > careful not to pass volatile data to the kernel, and if they do they'll > > have to be smart enough to handle the EFAULT, etc. That's really all > > their problem, because they're being clever. :) > > I'm actually working on feature that would solve the problem for the > syscalls accessing missing volatile pages. So you'd never see a > -EFAULT because all syscalls won't return even if they encounters a > missing page in the volatile range dropped by the VM pressure. > > It's called userfaultfd. You call sys_userfaultfd(flags) and it > connects the current mm to a pseudo filedescriptor. The filedescriptor > works similarly to eventfd but with a different protocol. > > You need a thread that will never access the userfault area with the > CPU, that is responsible to poll on the userfaultfd and talk the > userfaultfd protocol to fill-in missing pages. The userfault thread > after a POLLIN event reads the virtual addresses of the fault that > must have happened on some other thread of the same mm, and then > writes back an "handled" virtual range into the fd, after the page (or > pages if multiple) have been regenerated and mapped in with > sys_remap_anon_pages(), mremap or equivalent atomic pagetable page > swapping. Then depending on the "solved" range written back into the > fd, the kernel will wakeup the thread or threads that were waiting in > kernel mode on the "handled" virtual range, and retry the fault > without ever exiting kernel mode. > > We need this in KVM for running the guest on memory that is on other > nodes or other processes (postcopy live migration is the most common > use case but there are others like memory externalization and > cross-node KSM in the cloud, to keep a single copy of memory across > multiple nodes and externalized to the VM and to the host node). > > This thread made me wonder if we could mix the two features and you > would then depend on MADV_USERFAULT and userfaultfd to deliver to > userland the "faults" happening on the volatile pages that have been > purged as result of VM pressure. > > I'm just saying this after Johannes mentioned the issue with syscalls > returning -EFAULT. Because that is the very issue that the userfaultfd > is going to solve for the KVM migration thread. > > What I'm thinking now would be to mark the volatile range also > MADV_USERFAULT and then calling userfaultfd and instead of having the > cache regeneration "slow path" inside the SIGBUS handler, to run it in > the userfault thread that polls the userfaultfd. Then you could write > the volatile ranges to disk with a write() syscall (or use any other > syscall on the volatile ranges), without having to worry about -EFAULT > being returned because one page was discarded. And if MADV_USERFAULT > is not called in combination with vrange syscalls, then it'd still > work without the userfault, but with the vrange syscalls only. > > In short the idea would be to let the userfault code solve the fault > delivery to userland for you, and make the vrange syscalls only focus > on the page purging problem, without having to worry about what > happens when something access a missing page. Yes, the two seem certainly combinable to me. madvise(MADV_FREE | MADV_USERFAULT) to allow purging and userspace fault handling. In the fault slowpath, you can then regenerate any missing data and
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner wrote: > On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: >> That point beside, I think the other problem with the page-cleaning >> volatility approach is that there are other awkward side effects. For >> example: Say an application marks a range as volatile. One page in the >> range is then purged. The application, due to a bug or otherwise, >> reads the volatile range. This causes the page to be zero-filled in, >> and the application silently uses the corrupted data (which isn't >> great). More problematic though, is that by faulting the page in, >> they've in effect lost the purge state for that page. When the >> application then goes to mark the range as non-volatile, all pages are >> present, so we'd return that no pages were purged. From an >> application perspective this is pretty ugly. >> >> Johannes: Any thoughts on this potential issue with your proposal? Am >> I missing something else? > > No, this is accurate. However, I don't really see how this is > different than any other use-after-free bug. If you access malloc > memory after free(), you might receive a SIGSEGV, you might see random > data, you might corrupt somebody else's data. This certainly isn't > nice, but it's not exactly new behavior, is it? The part that troubles me is that I see the purged state as kernel data being corrupted by userland in this case. The kernel will tell userspace that no pages were purged, even though they were. Only because userspace made an errant read of a page, and got garbage data back. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
Hi everyone, On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote: > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If > you have a third option you're thinking of, I'd of course be interested > in hearing it. I actually thought the way of being notified with a page fault (sigbus or whatever) was the most efficient way of using volatile ranges. Why having to call a syscall to know if you can still access the volatile range, if there was no VM pressure before the access? syscalls are expensive, accessing the memory direct is not. Only if it page was actually missing and a page fault would fire, you'd take the slowpath. The usages I see for this are plenty, like for maintaining caches in memory that may be big and would be nice to discard if there's VM pressure, jpeg uncompressed images sounds like a candidate too. So the browser size would shrink if there's VM pressure, instead of ending up swapping out uncompressed image data that can be regenerated more quickly with the CPU than with swapins. > Now... once you've chosen SIGBUS semantics, there will be folks who will > try to exploit the fact that we get SIGBUS on purged page access (at > least on the user-space side) and will try to access pages that are > volatile until they are purged and try to then handle the SIGBUS to fix > things up. Those folks exploiting that will have to be particularly > careful not to pass volatile data to the kernel, and if they do they'll > have to be smart enough to handle the EFAULT, etc. That's really all > their problem, because they're being clever. :) I'm actually working on feature that would solve the problem for the syscalls accessing missing volatile pages. So you'd never see a -EFAULT because all syscalls won't return even if they encounters a missing page in the volatile range dropped by the VM pressure. It's called userfaultfd. You call sys_userfaultfd(flags) and it connects the current mm to a pseudo filedescriptor. The filedescriptor works similarly to eventfd but with a different protocol. You need a thread that will never access the userfault area with the CPU, that is responsible to poll on the userfaultfd and talk the userfaultfd protocol to fill-in missing pages. The userfault thread after a POLLIN event reads the virtual addresses of the fault that must have happened on some other thread of the same mm, and then writes back an "handled" virtual range into the fd, after the page (or pages if multiple) have been regenerated and mapped in with sys_remap_anon_pages(), mremap or equivalent atomic pagetable page swapping. Then depending on the "solved" range written back into the fd, the kernel will wakeup the thread or threads that were waiting in kernel mode on the "handled" virtual range, and retry the fault without ever exiting kernel mode. We need this in KVM for running the guest on memory that is on other nodes or other processes (postcopy live migration is the most common use case but there are others like memory externalization and cross-node KSM in the cloud, to keep a single copy of memory across multiple nodes and externalized to the VM and to the host node). This thread made me wonder if we could mix the two features and you would then depend on MADV_USERFAULT and userfaultfd to deliver to userland the "faults" happening on the volatile pages that have been purged as result of VM pressure. I'm just saying this after Johannes mentioned the issue with syscalls returning -EFAULT. Because that is the very issue that the userfaultfd is going to solve for the KVM migration thread. What I'm thinking now would be to mark the volatile range also MADV_USERFAULT and then calling userfaultfd and instead of having the cache regeneration "slow path" inside the SIGBUS handler, to run it in the userfault thread that polls the userfaultfd. Then you could write the volatile ranges to disk with a write() syscall (or use any other syscall on the volatile ranges), without having to worry about -EFAULT being returned because one page was discarded. And if MADV_USERFAULT is not called in combination with vrange syscalls, then it'd still work without the userfault, but with the vrange syscalls only. In short the idea would be to let the userfault code solve the fault delivery to userland for you, and make the vrange syscalls only focus on the page purging problem, without having to worry about what happens when something access a missing page. But if you don't intend to solve the syscall -EFAULT problem, well then probably the overlap is still as thin as I thought it was before (like also mentioned in the below link). Thanks, Andrea PS. my last email about this from a more KVM centric point of view: http://www.spinics.net/lists/kvm/msg101449.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 10:48:03AM -0700, John Stultz wrote: > On Wed, Apr 2, 2014 at 10:40 AM, Dave Hansen wrote: > > On 04/02/2014 10:18 AM, Johannes Weiner wrote: > >> Hence my follow-up question in the other mail about how large we > >> expect such code caches to become in practice in relationship to > >> overall system memory. Are code caches interesting reclaim candidates > >> to begin with? Are they big enough to make the machine thrash/swap > >> otherwise? > > > > A big chunk of the use cases here are for swapless systems anyway, so > > this is the *only* way for them to reclaim anonymous memory. Their > > choices are either to be constantly throwing away and rebuilding these > > objects, or to leave them in memory effectively pinned. > > > > In practice I did see ashmem (the Android thing that we're trying to > > replace) get used a lot by the Android web browser when I was playing > > with it. John said that it got used for storing decompressed copies of > > images. > > Although images are a simpler case where its easier to not touch > volatile pages. I think Johannes is mostly concerned about cases where > volatile pages are being accessed while they are volatile, which the > Mozilla folks are so far the only viable case (in my mind... folks may > have others) where they intentionally want to access pages while > they're volatile and thus require SIGBUS semantics. Yes, absolutely, that is my only concern. Compressed images as in Android can easily be marked non-volatile before they are accessed again. Code caches are harder because control is handed off to the CPU, but I'm not entirely sure yet whether these are in fact interesting reclaim candidates. > I suspect handling the SIGBUS and patching up the purged page you > trapped on is likely much to complicated for most use cases. But I do > think SIGBUS is preferable to zero-fill on purged page access, just > because its likely to be easier to debug applications. Fully agreed, but it seems a bit overkill to add a separate syscall, a range-tree on top of shmem address_spaces, and an essentially new programming model based on SIGBUS userspace fault handling (incl. all the complexities and confusion this inevitably will bring when people DO end up passing these pointers into kernel space) just to be a bit nicer about use-after-free bugs in applications. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: > On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner wrote: > > On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote: > >> On 04/01/2014 04:01 PM, Dave Hansen wrote: > >> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote: > >> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote: > >> > John, this was something that the Mozilla guys asked for, right? Any > >> > idea why this isn't ever a problem for them? > >> So one of their use cases for it is for library text. Basically they > >> want to decompress a compressed library file into memory. Then they plan > >> to mark the uncompressed pages volatile, and then be able to call into > >> it. Ideally for them, the kernel would only purge cold pages, leaving > >> the hot pages in memory. When they traverse a purged page, they handle > >> the SIGBUS and patch the page up. > > > > How big are these libraries compared to overall system size? > > Mike or Taras would have to refresh my memory on this detail. My > recollection is it mostly has to do with keeping the on-disk size of > the library small, so it can load off of slow media very quickly. > > >> Now.. this is not what I'd consider a normal use case, but was hoping to > >> illustrate some of the more interesting uses and demonstrate the > >> interfaces flexibility. > > > > I'm just dying to hear a "normal" use case then. :) > > So the more "normal" use cause would be marking objects volatile and > then non-volatile w/o accessing them in-between. In this case the > zero-fill vs SIGBUS semantics don't really matter, its really just a > trade off in how we handle applications deviating (intentionally or > not) from this use case. > > So to maybe flesh out the context here for folks who are following > along (but weren't in the hallway at LSF :), Johannes made a fairly > interesting proposal (Johannes: Please correct me here where I'm maybe > slightly off here) to use only the dirty bits of the ptes to mark a > page as volatile. Then the kernel could reclaim these clean pages as > it needed, and when we marked the range as non-volatile, the pages > would be re-dirtied and if any of the pages were missing, we could > return a flag with the purged state. This had some different > semantics then what I've been working with for awhile (for example, > any writes to pages would implicitly clear volatility), so I wasn't > completely comfortable with it, but figured I'd think about it to see > if it could be done. Particularly since it would in some ways simplify > tmpfs/shm shared volatility that I'd eventually like to do. > > After thinking it over in the hallway, I talked some of the details w/ > Johnnes and there was one issue that while w/ anonymous memory, we can > still add a VM_VOLATILE flag on the vma, so we can get SIGBUS > semantics, but since on shared volatile ranges, we don't have anything > to hang a volatile flag on w/o adding some new vma like structure to > the address_space structure (much as we did in the past w/ earlier > volatile range implementations). This would negate much of the point > of using the dirty bits to simplify the shared volatility > implementation. > > Thus Johannes is reasonably questioning the need for SIGBUS semantics, > since if it wasn't needed, the simpler page-cleaning based volatility > could potentially be used. Thanks for summarizing this again! > Now, while for the case I'm personally most interested in (ashmem), > zero-fill would technically be ok, since that's what Android does. > Even so, I don't think its the best approach for the interface, since > applications may end up quite surprised by the results when they > accidentally don't follow the "don't touch volatile pages" rule. > > That point beside, I think the other problem with the page-cleaning > volatility approach is that there are other awkward side effects. For > example: Say an application marks a range as volatile. One page in the > range is then purged. The application, due to a bug or otherwise, > reads the volatile range. This causes the page to be zero-filled in, > and the application silently uses the corrupted data (which isn't > great). More problematic though, is that by faulting the page in, > they've in effect lost the purge state for that page. When the > application then goes to mark the range as non-volatile, all pages are > present, so we'd return that no pages were purged. From an > application perspective this is pretty ugly. > > Johannes: Any thoughts on this potential issue with your proposal? Am > I missing something else? No, this is accurate. However, I don't really see how this is different than any other use-after-free bug. If you access malloc memory after free(), you might receive a SIGSEGV, you might see random data, you might corrupt somebody else's data. This certainly isn't nice, but it's not exactly new behavior, is it? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Wed, Apr 2, 2014 at 10:40 AM, Dave Hansen wrote: > On 04/02/2014 10:18 AM, Johannes Weiner wrote: >> Hence my follow-up question in the other mail about how large we >> expect such code caches to become in practice in relationship to >> overall system memory. Are code caches interesting reclaim candidates >> to begin with? Are they big enough to make the machine thrash/swap >> otherwise? > > A big chunk of the use cases here are for swapless systems anyway, so > this is the *only* way for them to reclaim anonymous memory. Their > choices are either to be constantly throwing away and rebuilding these > objects, or to leave them in memory effectively pinned. > > In practice I did see ashmem (the Android thing that we're trying to > replace) get used a lot by the Android web browser when I was playing > with it. John said that it got used for storing decompressed copies of > images. Although images are a simpler case where its easier to not touch volatile pages. I think Johannes is mostly concerned about cases where volatile pages are being accessed while they are volatile, which the Mozilla folks are so far the only viable case (in my mind... folks may have others) where they intentionally want to access pages while they're volatile and thus require SIGBUS semantics. I suspect handling the SIGBUS and patching up the purged page you trapped on is likely much to complicated for most use cases. But I do think SIGBUS is preferable to zero-fill on purged page access, just because its likely to be easier to debug applications. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner wrote: > On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote: >> On 04/01/2014 04:01 PM, Dave Hansen wrote: >> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote: >> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote: >> > John, this was something that the Mozilla guys asked for, right? Any >> > idea why this isn't ever a problem for them? >> So one of their use cases for it is for library text. Basically they >> want to decompress a compressed library file into memory. Then they plan >> to mark the uncompressed pages volatile, and then be able to call into >> it. Ideally for them, the kernel would only purge cold pages, leaving >> the hot pages in memory. When they traverse a purged page, they handle >> the SIGBUS and patch the page up. > > How big are these libraries compared to overall system size? Mike or Taras would have to refresh my memory on this detail. My recollection is it mostly has to do with keeping the on-disk size of the library small, so it can load off of slow media very quickly. >> Now.. this is not what I'd consider a normal use case, but was hoping to >> illustrate some of the more interesting uses and demonstrate the >> interfaces flexibility. > > I'm just dying to hear a "normal" use case then. :) So the more "normal" use cause would be marking objects volatile and then non-volatile w/o accessing them in-between. In this case the zero-fill vs SIGBUS semantics don't really matter, its really just a trade off in how we handle applications deviating (intentionally or not) from this use case. So to maybe flesh out the context here for folks who are following along (but weren't in the hallway at LSF :), Johannes made a fairly interesting proposal (Johannes: Please correct me here where I'm maybe slightly off here) to use only the dirty bits of the ptes to mark a page as volatile. Then the kernel could reclaim these clean pages as it needed, and when we marked the range as non-volatile, the pages would be re-dirtied and if any of the pages were missing, we could return a flag with the purged state. This had some different semantics then what I've been working with for awhile (for example, any writes to pages would implicitly clear volatility), so I wasn't completely comfortable with it, but figured I'd think about it to see if it could be done. Particularly since it would in some ways simplify tmpfs/shm shared volatility that I'd eventually like to do. After thinking it over in the hallway, I talked some of the details w/ Johnnes and there was one issue that while w/ anonymous memory, we can still add a VM_VOLATILE flag on the vma, so we can get SIGBUS semantics, but since on shared volatile ranges, we don't have anything to hang a volatile flag on w/o adding some new vma like structure to the address_space structure (much as we did in the past w/ earlier volatile range implementations). This would negate much of the point of using the dirty bits to simplify the shared volatility implementation. Thus Johannes is reasonably questioning the need for SIGBUS semantics, since if it wasn't needed, the simpler page-cleaning based volatility could potentially be used. Now, while for the case I'm personally most interested in (ashmem), zero-fill would technically be ok, since that's what Android does. Even so, I don't think its the best approach for the interface, since applications may end up quite surprised by the results when they accidentally don't follow the "don't touch volatile pages" rule. That point beside, I think the other problem with the page-cleaning volatility approach is that there are other awkward side effects. For example: Say an application marks a range as volatile. One page in the range is then purged. The application, due to a bug or otherwise, reads the volatile range. This causes the page to be zero-filled in, and the application silently uses the corrupted data (which isn't great). More problematic though, is that by faulting the page in, they've in effect lost the purge state for that page. When the application then goes to mark the range as non-volatile, all pages are present, so we'd return that no pages were purged. From an application perspective this is pretty ugly. Johannes: Any thoughts on this potential issue with your proposal? Am I missing something else? thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On 04/02/2014 10:18 AM, Johannes Weiner wrote: > Hence my follow-up question in the other mail about how large we > expect such code caches to become in practice in relationship to > overall system memory. Are code caches interesting reclaim candidates > to begin with? Are they big enough to make the machine thrash/swap > otherwise? A big chunk of the use cases here are for swapless systems anyway, so this is the *only* way for them to reclaim anonymous memory. Their choices are either to be constantly throwing away and rebuilding these objects, or to leave them in memory effectively pinned. In practice I did see ashmem (the Android thing that we're trying to replace) get used a lot by the Android web browser when I was playing with it. John said that it got used for storing decompressed copies of images. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 09:37:49AM -0700, H. Peter Anvin wrote: > On 04/02/2014 09:32 AM, H. Peter Anvin wrote: > > On 04/02/2014 09:30 AM, Johannes Weiner wrote: > >> > >> So between zero-fill and SIGBUS, I'd prefer the one which results in > >> the simpler user interface / fewer system calls. > >> > > > > The use cases are different; I believe this should be a user space option. > > > > Case in point, for example: imagine a JIT. You *really* don't want to > zero-fill memory behind the back of your JIT, as all zero memory may not > be a trapping instruction (it isn't on x86, for example, and if you are > unlucky you may be modifying *part* of an instruction.) Yes, and I think this would be comparable to the compressed-library usecase that John mentioned. What's special about these cases is that the accesses are no longer under control of the application because it's literally code that the CPU jumps into. It is obvious to me that such a usecase would require SIGBUS handling. However, it seems that in any usecase *besides* executable code caches, userspace would have the ability to mark the pages non-volatile ahead of time, and thus not require SIGBUS delivery. Hence my follow-up question in the other mail about how large we expect such code caches to become in practice in relationship to overall system memory. Are code caches interesting reclaim candidates to begin with? Are they big enough to make the machine thrash/swap otherwise? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On 04/02/2014 09:32 AM, H. Peter Anvin wrote: > On 04/02/2014 09:30 AM, Johannes Weiner wrote: >> >> So between zero-fill and SIGBUS, I'd prefer the one which results in >> the simpler user interface / fewer system calls. >> > > The use cases are different; I believe this should be a user space option. > Case in point, for example: imagine a JIT. You *really* don't want to zero-fill memory behind the back of your JIT, as all zero memory may not be a trapping instruction (it isn't on x86, for example, and if you are unlucky you may be modifying *part* of an instruction.) Thus, SIGBUS is the only safe option. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote: > On 04/01/2014 04:01 PM, Dave Hansen wrote: > > On 04/01/2014 02:35 PM, H. Peter Anvin wrote: > >> On 04/01/2014 02:21 PM, Johannes Weiner wrote: > >>> Either way, optimistic volatile pointers are nowhere near as > >>> transparent to the application as the above description suggests, > >>> which makes this usecase not very interesting, IMO. > >> ... however, I think you're still derating the value way too much. The > >> case of user space doing elastic memory management is more and more > >> common, and for a lot of those applications it is perfectly reasonable > >> to either not do system calls or to have to devolatilize first. > > The SIGBUS is only in cases where the memory is set as volatile and > > _then_ accessed, right? > Not just set volatile and then accessed, but when a volatile page has > been purged and then accessed without being made non-volatile. > > > > John, this was something that the Mozilla guys asked for, right? Any > > idea why this isn't ever a problem for them? > So one of their use cases for it is for library text. Basically they > want to decompress a compressed library file into memory. Then they plan > to mark the uncompressed pages volatile, and then be able to call into > it. Ideally for them, the kernel would only purge cold pages, leaving > the hot pages in memory. When they traverse a purged page, they handle > the SIGBUS and patch the page up. How big are these libraries compared to overall system size? > Now.. this is not what I'd consider a normal use case, but was hoping to > illustrate some of the more interesting uses and demonstrate the > interfaces flexibility. I'm just dying to hear a "normal" use case then. :) > Also it provided a clear example of benefits to doing LRU based > cold-page purging rather then full object purging. Though I think the > same could be demonstrated in a simpler case of a large cache of objects > that the applications wants to mark volatile in one pass, unmarking > sub-objects as it needs. Agreed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On 04/02/2014 09:30 AM, Johannes Weiner wrote: > > So between zero-fill and SIGBUS, I'd prefer the one which results in > the simpler user interface / fewer system calls. > The use cases are different; I believe this should be a user space option. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote: > On 04/01/2014 02:21 PM, Johannes Weiner wrote: > > [ I tried to bring this up during LSFMM but it got drowned out. > > Trying again :) ] > > > > On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote: > >> Optimistic method: > >> 1) Userland marks a large range of data as volatile > >> 2) Userland continues to access the data as it needs. > >> 3) If userland accesses a page that has been purged, the kernel will > >> send a SIGBUS > >> 4) Userspace can trap the SIGBUS, mark the affected pages as > >> non-volatile, and refill the data as needed before continuing on > > As far as I understand, if a pointer to volatile memory makes it into > > a syscall and the fault is trapped in kernel space, there won't be a > > SIGBUS, the syscall will just return -EFAULT. > > > > Handling this would mean annotating every syscall invocation to check > > for -EFAULT, refill the data, and then restart the syscall. This is > > complicated even before taking external libraries into account, which > > may not propagate syscall returns properly or may not be reentrant at > > the necessary granularity. > > > > Another option is to never pass volatile memory pointers into the > > kernel, but that too means that knowledge of volatility has to travel > > alongside the pointers, which will either result in more complexity > > throughout the application or severely limited scope of volatile > > memory usage. > > > > Either way, optimistic volatile pointers are nowhere near as > > transparent to the application as the above description suggests, > > which makes this usecase not very interesting, IMO. If we can support > > it at little cost, why not, but I don't think we should complicate the > > common usecases to support this one. > > So yea, thanks again for all the feedback at LSF-MM! I'm trying to get > things integrated for a v13 here shortly (although with visitors in town > this week it may not happen until next week). > > > So, maybe its best to ignore the fact that folks want to do semi-crazy > user-space faulting via SIGBUS. At least to start with. Lets look at the > semantic for the "normal" mark volatile, never touch the pages until you > mark non-volatile - basically where accessing volatile pages is similar > to a use-after-free bug. > > So, for the most part, I'd say the proposed SIGBUS semantics don't > complicate things for this basic use-case, at least when compared with > things like zero-fill. If an applications accidentally accessed a > purged volatile page, I think SIGBUS is the right thing to do. They most > likely immediately crash, but its better then them moving along with > silent corruption because they're mucking with zero-filled pages. > > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If > you have a third option you're thinking of, I'd of course be interested > in hearing it. The reason I'm bringing this up again is because I see very little solid usecases for a separate vrange() syscall once we have something like MADV_FREE and MADV_REVIVE, which respectively clear the dirty bits of a range of anon/tmpfs pages, and set them again and report if any pages in the given range were purged on revival. So between zero-fill and SIGBUS, I'd prefer the one which results in the simpler user interface / fewer system calls. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote: On 04/01/2014 02:21 PM, Johannes Weiner wrote: [ I tried to bring this up during LSFMM but it got drowned out. Trying again :) ] On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote: Optimistic method: 1) Userland marks a large range of data as volatile 2) Userland continues to access the data as it needs. 3) If userland accesses a page that has been purged, the kernel will send a SIGBUS 4) Userspace can trap the SIGBUS, mark the affected pages as non-volatile, and refill the data as needed before continuing on As far as I understand, if a pointer to volatile memory makes it into a syscall and the fault is trapped in kernel space, there won't be a SIGBUS, the syscall will just return -EFAULT. Handling this would mean annotating every syscall invocation to check for -EFAULT, refill the data, and then restart the syscall. This is complicated even before taking external libraries into account, which may not propagate syscall returns properly or may not be reentrant at the necessary granularity. Another option is to never pass volatile memory pointers into the kernel, but that too means that knowledge of volatility has to travel alongside the pointers, which will either result in more complexity throughout the application or severely limited scope of volatile memory usage. Either way, optimistic volatile pointers are nowhere near as transparent to the application as the above description suggests, which makes this usecase not very interesting, IMO. If we can support it at little cost, why not, but I don't think we should complicate the common usecases to support this one. So yea, thanks again for all the feedback at LSF-MM! I'm trying to get things integrated for a v13 here shortly (although with visitors in town this week it may not happen until next week). So, maybe its best to ignore the fact that folks want to do semi-crazy user-space faulting via SIGBUS. At least to start with. Lets look at the semantic for the normal mark volatile, never touch the pages until you mark non-volatile - basically where accessing volatile pages is similar to a use-after-free bug. So, for the most part, I'd say the proposed SIGBUS semantics don't complicate things for this basic use-case, at least when compared with things like zero-fill. If an applications accidentally accessed a purged volatile page, I think SIGBUS is the right thing to do. They most likely immediately crash, but its better then them moving along with silent corruption because they're mucking with zero-filled pages. So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If you have a third option you're thinking of, I'd of course be interested in hearing it. The reason I'm bringing this up again is because I see very little solid usecases for a separate vrange() syscall once we have something like MADV_FREE and MADV_REVIVE, which respectively clear the dirty bits of a range of anon/tmpfs pages, and set them again and report if any pages in the given range were purged on revival. So between zero-fill and SIGBUS, I'd prefer the one which results in the simpler user interface / fewer system calls. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On 04/02/2014 09:30 AM, Johannes Weiner wrote: So between zero-fill and SIGBUS, I'd prefer the one which results in the simpler user interface / fewer system calls. The use cases are different; I believe this should be a user space option. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote: On 04/01/2014 04:01 PM, Dave Hansen wrote: On 04/01/2014 02:35 PM, H. Peter Anvin wrote: On 04/01/2014 02:21 PM, Johannes Weiner wrote: Either way, optimistic volatile pointers are nowhere near as transparent to the application as the above description suggests, which makes this usecase not very interesting, IMO. ... however, I think you're still derating the value way too much. The case of user space doing elastic memory management is more and more common, and for a lot of those applications it is perfectly reasonable to either not do system calls or to have to devolatilize first. The SIGBUS is only in cases where the memory is set as volatile and _then_ accessed, right? Not just set volatile and then accessed, but when a volatile page has been purged and then accessed without being made non-volatile. John, this was something that the Mozilla guys asked for, right? Any idea why this isn't ever a problem for them? So one of their use cases for it is for library text. Basically they want to decompress a compressed library file into memory. Then they plan to mark the uncompressed pages volatile, and then be able to call into it. Ideally for them, the kernel would only purge cold pages, leaving the hot pages in memory. When they traverse a purged page, they handle the SIGBUS and patch the page up. How big are these libraries compared to overall system size? Now.. this is not what I'd consider a normal use case, but was hoping to illustrate some of the more interesting uses and demonstrate the interfaces flexibility. I'm just dying to hear a normal use case then. :) Also it provided a clear example of benefits to doing LRU based cold-page purging rather then full object purging. Though I think the same could be demonstrated in a simpler case of a large cache of objects that the applications wants to mark volatile in one pass, unmarking sub-objects as it needs. Agreed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On 04/02/2014 09:32 AM, H. Peter Anvin wrote: On 04/02/2014 09:30 AM, Johannes Weiner wrote: So between zero-fill and SIGBUS, I'd prefer the one which results in the simpler user interface / fewer system calls. The use cases are different; I believe this should be a user space option. Case in point, for example: imagine a JIT. You *really* don't want to zero-fill memory behind the back of your JIT, as all zero memory may not be a trapping instruction (it isn't on x86, for example, and if you are unlucky you may be modifying *part* of an instruction.) Thus, SIGBUS is the only safe option. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 09:37:49AM -0700, H. Peter Anvin wrote: On 04/02/2014 09:32 AM, H. Peter Anvin wrote: On 04/02/2014 09:30 AM, Johannes Weiner wrote: So between zero-fill and SIGBUS, I'd prefer the one which results in the simpler user interface / fewer system calls. The use cases are different; I believe this should be a user space option. Case in point, for example: imagine a JIT. You *really* don't want to zero-fill memory behind the back of your JIT, as all zero memory may not be a trapping instruction (it isn't on x86, for example, and if you are unlucky you may be modifying *part* of an instruction.) Yes, and I think this would be comparable to the compressed-library usecase that John mentioned. What's special about these cases is that the accesses are no longer under control of the application because it's literally code that the CPU jumps into. It is obvious to me that such a usecase would require SIGBUS handling. However, it seems that in any usecase *besides* executable code caches, userspace would have the ability to mark the pages non-volatile ahead of time, and thus not require SIGBUS delivery. Hence my follow-up question in the other mail about how large we expect such code caches to become in practice in relationship to overall system memory. Are code caches interesting reclaim candidates to begin with? Are they big enough to make the machine thrash/swap otherwise? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner han...@cmpxchg.org wrote: On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote: On 04/01/2014 04:01 PM, Dave Hansen wrote: On 04/01/2014 02:35 PM, H. Peter Anvin wrote: On 04/01/2014 02:21 PM, Johannes Weiner wrote: John, this was something that the Mozilla guys asked for, right? Any idea why this isn't ever a problem for them? So one of their use cases for it is for library text. Basically they want to decompress a compressed library file into memory. Then they plan to mark the uncompressed pages volatile, and then be able to call into it. Ideally for them, the kernel would only purge cold pages, leaving the hot pages in memory. When they traverse a purged page, they handle the SIGBUS and patch the page up. How big are these libraries compared to overall system size? Mike or Taras would have to refresh my memory on this detail. My recollection is it mostly has to do with keeping the on-disk size of the library small, so it can load off of slow media very quickly. Now.. this is not what I'd consider a normal use case, but was hoping to illustrate some of the more interesting uses and demonstrate the interfaces flexibility. I'm just dying to hear a normal use case then. :) So the more normal use cause would be marking objects volatile and then non-volatile w/o accessing them in-between. In this case the zero-fill vs SIGBUS semantics don't really matter, its really just a trade off in how we handle applications deviating (intentionally or not) from this use case. So to maybe flesh out the context here for folks who are following along (but weren't in the hallway at LSF :), Johannes made a fairly interesting proposal (Johannes: Please correct me here where I'm maybe slightly off here) to use only the dirty bits of the ptes to mark a page as volatile. Then the kernel could reclaim these clean pages as it needed, and when we marked the range as non-volatile, the pages would be re-dirtied and if any of the pages were missing, we could return a flag with the purged state. This had some different semantics then what I've been working with for awhile (for example, any writes to pages would implicitly clear volatility), so I wasn't completely comfortable with it, but figured I'd think about it to see if it could be done. Particularly since it would in some ways simplify tmpfs/shm shared volatility that I'd eventually like to do. After thinking it over in the hallway, I talked some of the details w/ Johnnes and there was one issue that while w/ anonymous memory, we can still add a VM_VOLATILE flag on the vma, so we can get SIGBUS semantics, but since on shared volatile ranges, we don't have anything to hang a volatile flag on w/o adding some new vma like structure to the address_space structure (much as we did in the past w/ earlier volatile range implementations). This would negate much of the point of using the dirty bits to simplify the shared volatility implementation. Thus Johannes is reasonably questioning the need for SIGBUS semantics, since if it wasn't needed, the simpler page-cleaning based volatility could potentially be used. Now, while for the case I'm personally most interested in (ashmem), zero-fill would technically be ok, since that's what Android does. Even so, I don't think its the best approach for the interface, since applications may end up quite surprised by the results when they accidentally don't follow the don't touch volatile pages rule. That point beside, I think the other problem with the page-cleaning volatility approach is that there are other awkward side effects. For example: Say an application marks a range as volatile. One page in the range is then purged. The application, due to a bug or otherwise, reads the volatile range. This causes the page to be zero-filled in, and the application silently uses the corrupted data (which isn't great). More problematic though, is that by faulting the page in, they've in effect lost the purge state for that page. When the application then goes to mark the range as non-volatile, all pages are present, so we'd return that no pages were purged. From an application perspective this is pretty ugly. Johannes: Any thoughts on this potential issue with your proposal? Am I missing something else? thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On 04/02/2014 10:18 AM, Johannes Weiner wrote: Hence my follow-up question in the other mail about how large we expect such code caches to become in practice in relationship to overall system memory. Are code caches interesting reclaim candidates to begin with? Are they big enough to make the machine thrash/swap otherwise? A big chunk of the use cases here are for swapless systems anyway, so this is the *only* way for them to reclaim anonymous memory. Their choices are either to be constantly throwing away and rebuilding these objects, or to leave them in memory effectively pinned. In practice I did see ashmem (the Android thing that we're trying to replace) get used a lot by the Android web browser when I was playing with it. John said that it got used for storing decompressed copies of images. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Wed, Apr 2, 2014 at 10:40 AM, Dave Hansen d...@sr71.net wrote: On 04/02/2014 10:18 AM, Johannes Weiner wrote: Hence my follow-up question in the other mail about how large we expect such code caches to become in practice in relationship to overall system memory. Are code caches interesting reclaim candidates to begin with? Are they big enough to make the machine thrash/swap otherwise? A big chunk of the use cases here are for swapless systems anyway, so this is the *only* way for them to reclaim anonymous memory. Their choices are either to be constantly throwing away and rebuilding these objects, or to leave them in memory effectively pinned. In practice I did see ashmem (the Android thing that we're trying to replace) get used a lot by the Android web browser when I was playing with it. John said that it got used for storing decompressed copies of images. Although images are a simpler case where its easier to not touch volatile pages. I think Johannes is mostly concerned about cases where volatile pages are being accessed while they are volatile, which the Mozilla folks are so far the only viable case (in my mind... folks may have others) where they intentionally want to access pages while they're volatile and thus require SIGBUS semantics. I suspect handling the SIGBUS and patching up the purged page you trapped on is likely much to complicated for most use cases. But I do think SIGBUS is preferable to zero-fill on purged page access, just because its likely to be easier to debug applications. thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner han...@cmpxchg.org wrote: On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote: On 04/01/2014 04:01 PM, Dave Hansen wrote: On 04/01/2014 02:35 PM, H. Peter Anvin wrote: On 04/01/2014 02:21 PM, Johannes Weiner wrote: John, this was something that the Mozilla guys asked for, right? Any idea why this isn't ever a problem for them? So one of their use cases for it is for library text. Basically they want to decompress a compressed library file into memory. Then they plan to mark the uncompressed pages volatile, and then be able to call into it. Ideally for them, the kernel would only purge cold pages, leaving the hot pages in memory. When they traverse a purged page, they handle the SIGBUS and patch the page up. How big are these libraries compared to overall system size? Mike or Taras would have to refresh my memory on this detail. My recollection is it mostly has to do with keeping the on-disk size of the library small, so it can load off of slow media very quickly. Now.. this is not what I'd consider a normal use case, but was hoping to illustrate some of the more interesting uses and demonstrate the interfaces flexibility. I'm just dying to hear a normal use case then. :) So the more normal use cause would be marking objects volatile and then non-volatile w/o accessing them in-between. In this case the zero-fill vs SIGBUS semantics don't really matter, its really just a trade off in how we handle applications deviating (intentionally or not) from this use case. So to maybe flesh out the context here for folks who are following along (but weren't in the hallway at LSF :), Johannes made a fairly interesting proposal (Johannes: Please correct me here where I'm maybe slightly off here) to use only the dirty bits of the ptes to mark a page as volatile. Then the kernel could reclaim these clean pages as it needed, and when we marked the range as non-volatile, the pages would be re-dirtied and if any of the pages were missing, we could return a flag with the purged state. This had some different semantics then what I've been working with for awhile (for example, any writes to pages would implicitly clear volatility), so I wasn't completely comfortable with it, but figured I'd think about it to see if it could be done. Particularly since it would in some ways simplify tmpfs/shm shared volatility that I'd eventually like to do. After thinking it over in the hallway, I talked some of the details w/ Johnnes and there was one issue that while w/ anonymous memory, we can still add a VM_VOLATILE flag on the vma, so we can get SIGBUS semantics, but since on shared volatile ranges, we don't have anything to hang a volatile flag on w/o adding some new vma like structure to the address_space structure (much as we did in the past w/ earlier volatile range implementations). This would negate much of the point of using the dirty bits to simplify the shared volatility implementation. Thus Johannes is reasonably questioning the need for SIGBUS semantics, since if it wasn't needed, the simpler page-cleaning based volatility could potentially be used. Thanks for summarizing this again! Now, while for the case I'm personally most interested in (ashmem), zero-fill would technically be ok, since that's what Android does. Even so, I don't think its the best approach for the interface, since applications may end up quite surprised by the results when they accidentally don't follow the don't touch volatile pages rule. That point beside, I think the other problem with the page-cleaning volatility approach is that there are other awkward side effects. For example: Say an application marks a range as volatile. One page in the range is then purged. The application, due to a bug or otherwise, reads the volatile range. This causes the page to be zero-filled in, and the application silently uses the corrupted data (which isn't great). More problematic though, is that by faulting the page in, they've in effect lost the purge state for that page. When the application then goes to mark the range as non-volatile, all pages are present, so we'd return that no pages were purged. From an application perspective this is pretty ugly. Johannes: Any thoughts on this potential issue with your proposal? Am I missing something else? No, this is accurate. However, I don't really see how this is different than any other use-after-free bug. If you access malloc memory after free(), you might receive a SIGSEGV, you might see random data, you might corrupt somebody else's data. This certainly isn't nice, but it's not exactly new behavior, is it? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 10:48:03AM -0700, John Stultz wrote: On Wed, Apr 2, 2014 at 10:40 AM, Dave Hansen d...@sr71.net wrote: On 04/02/2014 10:18 AM, Johannes Weiner wrote: Hence my follow-up question in the other mail about how large we expect such code caches to become in practice in relationship to overall system memory. Are code caches interesting reclaim candidates to begin with? Are they big enough to make the machine thrash/swap otherwise? A big chunk of the use cases here are for swapless systems anyway, so this is the *only* way for them to reclaim anonymous memory. Their choices are either to be constantly throwing away and rebuilding these objects, or to leave them in memory effectively pinned. In practice I did see ashmem (the Android thing that we're trying to replace) get used a lot by the Android web browser when I was playing with it. John said that it got used for storing decompressed copies of images. Although images are a simpler case where its easier to not touch volatile pages. I think Johannes is mostly concerned about cases where volatile pages are being accessed while they are volatile, which the Mozilla folks are so far the only viable case (in my mind... folks may have others) where they intentionally want to access pages while they're volatile and thus require SIGBUS semantics. Yes, absolutely, that is my only concern. Compressed images as in Android can easily be marked non-volatile before they are accessed again. Code caches are harder because control is handed off to the CPU, but I'm not entirely sure yet whether these are in fact interesting reclaim candidates. I suspect handling the SIGBUS and patching up the purged page you trapped on is likely much to complicated for most use cases. But I do think SIGBUS is preferable to zero-fill on purged page access, just because its likely to be easier to debug applications. Fully agreed, but it seems a bit overkill to add a separate syscall, a range-tree on top of shmem address_spaces, and an essentially new programming model based on SIGBUS userspace fault handling (incl. all the complexities and confusion this inevitably will bring when people DO end up passing these pointers into kernel space) just to be a bit nicer about use-after-free bugs in applications. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
Hi everyone, On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote: So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If you have a third option you're thinking of, I'd of course be interested in hearing it. I actually thought the way of being notified with a page fault (sigbus or whatever) was the most efficient way of using volatile ranges. Why having to call a syscall to know if you can still access the volatile range, if there was no VM pressure before the access? syscalls are expensive, accessing the memory direct is not. Only if it page was actually missing and a page fault would fire, you'd take the slowpath. The usages I see for this are plenty, like for maintaining caches in memory that may be big and would be nice to discard if there's VM pressure, jpeg uncompressed images sounds like a candidate too. So the browser size would shrink if there's VM pressure, instead of ending up swapping out uncompressed image data that can be regenerated more quickly with the CPU than with swapins. Now... once you've chosen SIGBUS semantics, there will be folks who will try to exploit the fact that we get SIGBUS on purged page access (at least on the user-space side) and will try to access pages that are volatile until they are purged and try to then handle the SIGBUS to fix things up. Those folks exploiting that will have to be particularly careful not to pass volatile data to the kernel, and if they do they'll have to be smart enough to handle the EFAULT, etc. That's really all their problem, because they're being clever. :) I'm actually working on feature that would solve the problem for the syscalls accessing missing volatile pages. So you'd never see a -EFAULT because all syscalls won't return even if they encounters a missing page in the volatile range dropped by the VM pressure. It's called userfaultfd. You call sys_userfaultfd(flags) and it connects the current mm to a pseudo filedescriptor. The filedescriptor works similarly to eventfd but with a different protocol. You need a thread that will never access the userfault area with the CPU, that is responsible to poll on the userfaultfd and talk the userfaultfd protocol to fill-in missing pages. The userfault thread after a POLLIN event reads the virtual addresses of the fault that must have happened on some other thread of the same mm, and then writes back an handled virtual range into the fd, after the page (or pages if multiple) have been regenerated and mapped in with sys_remap_anon_pages(), mremap or equivalent atomic pagetable page swapping. Then depending on the solved range written back into the fd, the kernel will wakeup the thread or threads that were waiting in kernel mode on the handled virtual range, and retry the fault without ever exiting kernel mode. We need this in KVM for running the guest on memory that is on other nodes or other processes (postcopy live migration is the most common use case but there are others like memory externalization and cross-node KSM in the cloud, to keep a single copy of memory across multiple nodes and externalized to the VM and to the host node). This thread made me wonder if we could mix the two features and you would then depend on MADV_USERFAULT and userfaultfd to deliver to userland the faults happening on the volatile pages that have been purged as result of VM pressure. I'm just saying this after Johannes mentioned the issue with syscalls returning -EFAULT. Because that is the very issue that the userfaultfd is going to solve for the KVM migration thread. What I'm thinking now would be to mark the volatile range also MADV_USERFAULT and then calling userfaultfd and instead of having the cache regeneration slow path inside the SIGBUS handler, to run it in the userfault thread that polls the userfaultfd. Then you could write the volatile ranges to disk with a write() syscall (or use any other syscall on the volatile ranges), without having to worry about -EFAULT being returned because one page was discarded. And if MADV_USERFAULT is not called in combination with vrange syscalls, then it'd still work without the userfault, but with the vrange syscalls only. In short the idea would be to let the userfault code solve the fault delivery to userland for you, and make the vrange syscalls only focus on the page purging problem, without having to worry about what happens when something access a missing page. But if you don't intend to solve the syscall -EFAULT problem, well then probably the overlap is still as thin as I thought it was before (like also mentioned in the below link). Thanks, Andrea PS. my last email about this from a more KVM centric point of view: http://www.spinics.net/lists/kvm/msg101449.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner han...@cmpxchg.org wrote: On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: That point beside, I think the other problem with the page-cleaning volatility approach is that there are other awkward side effects. For example: Say an application marks a range as volatile. One page in the range is then purged. The application, due to a bug or otherwise, reads the volatile range. This causes the page to be zero-filled in, and the application silently uses the corrupted data (which isn't great). More problematic though, is that by faulting the page in, they've in effect lost the purge state for that page. When the application then goes to mark the range as non-volatile, all pages are present, so we'd return that no pages were purged. From an application perspective this is pretty ugly. Johannes: Any thoughts on this potential issue with your proposal? Am I missing something else? No, this is accurate. However, I don't really see how this is different than any other use-after-free bug. If you access malloc memory after free(), you might receive a SIGSEGV, you might see random data, you might corrupt somebody else's data. This certainly isn't nice, but it's not exactly new behavior, is it? The part that troubles me is that I see the purged state as kernel data being corrupted by userland in this case. The kernel will tell userspace that no pages were purged, even though they were. Only because userspace made an errant read of a page, and got garbage data back. thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote: Hi everyone, On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote: So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If you have a third option you're thinking of, I'd of course be interested in hearing it. I actually thought the way of being notified with a page fault (sigbus or whatever) was the most efficient way of using volatile ranges. Why having to call a syscall to know if you can still access the volatile range, if there was no VM pressure before the access? syscalls are expensive, accessing the memory direct is not. Only if it page was actually missing and a page fault would fire, you'd take the slowpath. Not everybody wants to actually come back for the data in the range, allocators and message passing applications just want to be able to reuse the memory mapping. By tying the volatility to the dirty bit in the page tables, an allocator could simply clear those bits once on free(). When malloc() hands out this region again, the user is expected to write, which will either overwrite the old page, or, if it was purged, fault in a fresh zero page. But there is no second syscall needed to clear volatility. Now... once you've chosen SIGBUS semantics, there will be folks who will try to exploit the fact that we get SIGBUS on purged page access (at least on the user-space side) and will try to access pages that are volatile until they are purged and try to then handle the SIGBUS to fix things up. Those folks exploiting that will have to be particularly careful not to pass volatile data to the kernel, and if they do they'll have to be smart enough to handle the EFAULT, etc. That's really all their problem, because they're being clever. :) I'm actually working on feature that would solve the problem for the syscalls accessing missing volatile pages. So you'd never see a -EFAULT because all syscalls won't return even if they encounters a missing page in the volatile range dropped by the VM pressure. It's called userfaultfd. You call sys_userfaultfd(flags) and it connects the current mm to a pseudo filedescriptor. The filedescriptor works similarly to eventfd but with a different protocol. You need a thread that will never access the userfault area with the CPU, that is responsible to poll on the userfaultfd and talk the userfaultfd protocol to fill-in missing pages. The userfault thread after a POLLIN event reads the virtual addresses of the fault that must have happened on some other thread of the same mm, and then writes back an handled virtual range into the fd, after the page (or pages if multiple) have been regenerated and mapped in with sys_remap_anon_pages(), mremap or equivalent atomic pagetable page swapping. Then depending on the solved range written back into the fd, the kernel will wakeup the thread or threads that were waiting in kernel mode on the handled virtual range, and retry the fault without ever exiting kernel mode. We need this in KVM for running the guest on memory that is on other nodes or other processes (postcopy live migration is the most common use case but there are others like memory externalization and cross-node KSM in the cloud, to keep a single copy of memory across multiple nodes and externalized to the VM and to the host node). This thread made me wonder if we could mix the two features and you would then depend on MADV_USERFAULT and userfaultfd to deliver to userland the faults happening on the volatile pages that have been purged as result of VM pressure. I'm just saying this after Johannes mentioned the issue with syscalls returning -EFAULT. Because that is the very issue that the userfaultfd is going to solve for the KVM migration thread. What I'm thinking now would be to mark the volatile range also MADV_USERFAULT and then calling userfaultfd and instead of having the cache regeneration slow path inside the SIGBUS handler, to run it in the userfault thread that polls the userfaultfd. Then you could write the volatile ranges to disk with a write() syscall (or use any other syscall on the volatile ranges), without having to worry about -EFAULT being returned because one page was discarded. And if MADV_USERFAULT is not called in combination with vrange syscalls, then it'd still work without the userfault, but with the vrange syscalls only. In short the idea would be to let the userfault code solve the fault delivery to userland for you, and make the vrange syscalls only focus on the page purging problem, without having to worry about what happens when something access a missing page. Yes, the two seem certainly combinable to me. madvise(MADV_FREE | MADV_USERFAULT) to allow purging and userspace fault handling. In the fault slowpath, you can then regenerate any missing data and do MADV_FREE again if it should remain volatile. And again, any actual writes to the region
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Wed, Apr 2, 2014 at 11:07 AM, Johannes Weiner han...@cmpxchg.org wrote: On Wed, Apr 02, 2014 at 10:48:03AM -0700, John Stultz wrote: I suspect handling the SIGBUS and patching up the purged page you trapped on is likely much to complicated for most use cases. But I do think SIGBUS is preferable to zero-fill on purged page access, just because its likely to be easier to debug applications. Fully agreed, but it seems a bit overkill to add a separate syscall, a range-tree on top of shmem address_spaces, and an essentially new programming model based on SIGBUS userspace fault handling (incl. all the complexities and confusion this inevitably will bring when people DO end up passing these pointers into kernel space) just to be a bit nicer about use-after-free bugs in applications. Its more about making an interface that has graspable semantics to userspace, instead of having the semantics being a side-effect of the implementation. Tying volatility to the page-clean state and page-was-purged to page-present seems problematic to me, because there are too many ways to change the page-clean or page-present outside of the interface being proposed. I feel this causes a cascade of corner cases that have to be explained to users of the interface. Also I disagree we're adding a new programming model, as SIGBUSes can already be caught, just that there's not usually much one can do, where with volatile pages its more likely something could be done. And again, its really just a side-effect of having semantics (SIGBUS on purged page access) that are more helpful from a applications perspective. As for the separate syscall: Again, this is mainly needed to handle allocation failures that happen mid-way through modifying the range. There may still be a way to do the allocation first and only after it succeeds do the modification. The vma merge/splitting logic doesn't make this easy but if we can be sure that on a failed split of 1 vma - 3 vmas (which may fail half way) we can re-merge w/o allocation and error out (without having to do any other allocations), this might be avoidable. I'm still wanting to look at this. If so, it would be easier to re-add this support under madvise, if folks really really don't like the new syscall. For the most part, having the separate syscall allows us to discuss other details of the semantics, which to me are more important then the syscall naming. thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote: On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner han...@cmpxchg.org wrote: On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: That point beside, I think the other problem with the page-cleaning volatility approach is that there are other awkward side effects. For example: Say an application marks a range as volatile. One page in the range is then purged. The application, due to a bug or otherwise, reads the volatile range. This causes the page to be zero-filled in, and the application silently uses the corrupted data (which isn't great). More problematic though, is that by faulting the page in, they've in effect lost the purge state for that page. When the application then goes to mark the range as non-volatile, all pages are present, so we'd return that no pages were purged. From an application perspective this is pretty ugly. Johannes: Any thoughts on this potential issue with your proposal? Am I missing something else? No, this is accurate. However, I don't really see how this is different than any other use-after-free bug. If you access malloc memory after free(), you might receive a SIGSEGV, you might see random data, you might corrupt somebody else's data. This certainly isn't nice, but it's not exactly new behavior, is it? The part that troubles me is that I see the purged state as kernel data being corrupted by userland in this case. The kernel will tell userspace that no pages were purged, even though they were. Only because userspace made an errant read of a page, and got garbage data back. That sounds overly dramatic to me. First of all, this data still reflects accurately the actions of userspace in this situation. And secondly, the kernel does not rely on this data to be meaningful from a userspace perspective to function correctly. It's really nothing but a use-after-free bug that has consequences for no-one but the faulty application. The thing that IS new is that even a read is enough to corrupt your data in this case. MADV_REVIVE could return 0 if all pages in the specified range were present, -Esomething if otherwise. That would be semantically sound even if userspace messes up. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On 04/02/2014 11:31 AM, Andrea Arcangeli wrote: On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote: Now... once you've chosen SIGBUS semantics, there will be folks who will try to exploit the fact that we get SIGBUS on purged page access (at least on the user-space side) and will try to access pages that are volatile until they are purged and try to then handle the SIGBUS to fix things up. Those folks exploiting that will have to be particularly careful not to pass volatile data to the kernel, and if they do they'll have to be smart enough to handle the EFAULT, etc. That's really all their problem, because they're being clever. :) I'm actually working on feature that would solve the problem for the syscalls accessing missing volatile pages. So you'd never see a -EFAULT because all syscalls won't return even if they encounters a missing page in the volatile range dropped by the VM pressure. It's called userfaultfd. You call sys_userfaultfd(flags) and it connects the current mm to a pseudo filedescriptor. The filedescriptor works similarly to eventfd but with a different protocol. So yea! I actually think (its been awhile now) I mentioned your work to Taras (or maybe he mentioned it to me?), but it did seem like the userfaltfd would be a better solution for the style of fault handling they were thinking about. (Especially as actually handling SIGBUS and doing something sane in a large threaded application seems very difficult). That said, explaining volatile ranges as a concept has been difficult enough without mixing in other new concepts :), so I'm hesitant to tie the functionality together in until its clear the userfaultfd approach is likely to land. But maybe I need to take a closer look at it. thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On 04/02/2014 12:47 PM, Johannes Weiner wrote: On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote: On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner han...@cmpxchg.org wrote: On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: That point beside, I think the other problem with the page-cleaning volatility approach is that there are other awkward side effects. For example: Say an application marks a range as volatile. One page in the range is then purged. The application, due to a bug or otherwise, reads the volatile range. This causes the page to be zero-filled in, and the application silently uses the corrupted data (which isn't great). More problematic though, is that by faulting the page in, they've in effect lost the purge state for that page. When the application then goes to mark the range as non-volatile, all pages are present, so we'd return that no pages were purged. From an application perspective this is pretty ugly. Johannes: Any thoughts on this potential issue with your proposal? Am I missing something else? No, this is accurate. However, I don't really see how this is different than any other use-after-free bug. If you access malloc memory after free(), you might receive a SIGSEGV, you might see random data, you might corrupt somebody else's data. This certainly isn't nice, but it's not exactly new behavior, is it? The part that troubles me is that I see the purged state as kernel data being corrupted by userland in this case. The kernel will tell userspace that no pages were purged, even though they were. Only because userspace made an errant read of a page, and got garbage data back. That sounds overly dramatic to me. First of all, this data still reflects accurately the actions of userspace in this situation. And secondly, the kernel does not rely on this data to be meaningful from a userspace perspective to function correctly. insert dramatic-chipmunk video w/ text overlay errant read corrupted volatile page purge state1 Maybe you're right, but I feel this is the sort of thing application developers would be surprised and annoyed by. It's really nothing but a use-after-free bug that has consequences for no-one but the faulty application. The thing that IS new is that even a read is enough to corrupt your data in this case. MADV_REVIVE could return 0 if all pages in the specified range were present, -Esomething if otherwise. That would be semantically sound even if userspace messes up. So its semantically more of just a combined mincore+dirty operation.. and nothing more? What are other folks thinking about this? Although I don't particularly like it, I probably could go along with Johannes' approach, forgoing SIGBUS for zero-fill and adapting the semantics that are in my mind a bit stranger. This would allow for ashmem-like style behavior w/ the additional write-clears-volatile-state and read-clears-purged-state constraints (which I don't think would be problematic for Android, but am not totally sure). But I do worry that these semantics are easier for kernel-mm-developers to grasp, but are much much harder for application developers to understand. Additionally unless we could really leave access-after-volatile as a total undefined behavior, this would lock us into O(page) behavior and would remove the possibility of O(log(ranges)) behavior Minchan and I were able to get (admittedly with more complicated code - but something I was hoping we'd be able to get back to after the base semantics and interface behavior was understood and merged). I since applications will have bugs and will access after volatile, we won't be able to get away with that sort of behavioral flexibility. thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On Wed 02-04-14 13:13:34, John Stultz wrote: On 04/02/2014 12:47 PM, Johannes Weiner wrote: On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote: On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner han...@cmpxchg.org wrote: On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote: That point beside, I think the other problem with the page-cleaning volatility approach is that there are other awkward side effects. For example: Say an application marks a range as volatile. One page in the range is then purged. The application, due to a bug or otherwise, reads the volatile range. This causes the page to be zero-filled in, and the application silently uses the corrupted data (which isn't great). More problematic though, is that by faulting the page in, they've in effect lost the purge state for that page. When the application then goes to mark the range as non-volatile, all pages are present, so we'd return that no pages were purged. From an application perspective this is pretty ugly. Johannes: Any thoughts on this potential issue with your proposal? Am I missing something else? No, this is accurate. However, I don't really see how this is different than any other use-after-free bug. If you access malloc memory after free(), you might receive a SIGSEGV, you might see random data, you might corrupt somebody else's data. This certainly isn't nice, but it's not exactly new behavior, is it? The part that troubles me is that I see the purged state as kernel data being corrupted by userland in this case. The kernel will tell userspace that no pages were purged, even though they were. Only because userspace made an errant read of a page, and got garbage data back. That sounds overly dramatic to me. First of all, this data still reflects accurately the actions of userspace in this situation. And secondly, the kernel does not rely on this data to be meaningful from a userspace perspective to function correctly. insert dramatic-chipmunk video w/ text overlay errant read corrupted volatile page purge state1 Maybe you're right, but I feel this is the sort of thing application developers would be surprised and annoyed by. It's really nothing but a use-after-free bug that has consequences for no-one but the faulty application. The thing that IS new is that even a read is enough to corrupt your data in this case. MADV_REVIVE could return 0 if all pages in the specified range were present, -Esomething if otherwise. That would be semantically sound even if userspace messes up. So its semantically more of just a combined mincore+dirty operation.. and nothing more? What are other folks thinking about this? Although I don't particularly like it, I probably could go along with Johannes' approach, forgoing SIGBUS for zero-fill and adapting the semantics that are in my mind a bit stranger. This would allow for ashmem-like style behavior w/ the additional write-clears-volatile-state and read-clears-purged-state constraints (which I don't think would be problematic for Android, but am not totally sure). But I do worry that these semantics are easier for kernel-mm-developers to grasp, but are much much harder for application developers to understand. Yeah, I have to admit that although the simplicity of the implementation looks compelling, the interface from a userspace POV looks weird. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On 04/01/2014 04:01 PM, Dave Hansen wrote: > On 04/01/2014 02:35 PM, H. Peter Anvin wrote: >> On 04/01/2014 02:21 PM, Johannes Weiner wrote: >>> Either way, optimistic volatile pointers are nowhere near as >>> transparent to the application as the above description suggests, >>> which makes this usecase not very interesting, IMO. >> ... however, I think you're still derating the value way too much. The >> case of user space doing elastic memory management is more and more >> common, and for a lot of those applications it is perfectly reasonable >> to either not do system calls or to have to devolatilize first. > The SIGBUS is only in cases where the memory is set as volatile and > _then_ accessed, right? Not just set volatile and then accessed, but when a volatile page has been purged and then accessed without being made non-volatile. > John, this was something that the Mozilla guys asked for, right? Any > idea why this isn't ever a problem for them? So one of their use cases for it is for library text. Basically they want to decompress a compressed library file into memory. Then they plan to mark the uncompressed pages volatile, and then be able to call into it. Ideally for them, the kernel would only purge cold pages, leaving the hot pages in memory. When they traverse a purged page, they handle the SIGBUS and patch the page up. Now.. this is not what I'd consider a normal use case, but was hoping to illustrate some of the more interesting uses and demonstrate the interfaces flexibility. Also it provided a clear example of benefits to doing LRU based cold-page purging rather then full object purging. Though I think the same could be demonstrated in a simpler case of a large cache of objects that the applications wants to mark volatile in one pass, unmarking sub-objects as it needs. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On 04/01/2014 09:03 PM, John Stultz wrote: > > So, maybe its best to ignore the fact that folks want to do semi-crazy > user-space faulting via SIGBUS. At least to start with. Lets look at the > semantic for the "normal" mark volatile, never touch the pages until you > mark non-volatile - basically where accessing volatile pages is similar > to a use-after-free bug. > > So, for the most part, I'd say the proposed SIGBUS semantics don't > complicate things for this basic use-case, at least when compared with > things like zero-fill. If an applications accidentally accessed a > purged volatile page, I think SIGBUS is the right thing to do. They most > likely immediately crash, but its better then them moving along with > silent corruption because they're mucking with zero-filled pages. > > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If > you have a third option you're thinking of, I'd of course be interested > in hearing it. > People already do SIGBUS for mmap, so there is nothing new here. > Now... once you've chosen SIGBUS semantics, there will be folks who will > try to exploit the fact that we get SIGBUS on purged page access (at > least on the user-space side) and will try to access pages that are > volatile until they are purged and try to then handle the SIGBUS to fix > things up. Those folks exploiting that will have to be particularly > careful not to pass volatile data to the kernel, and if they do they'll > have to be smart enough to handle the EFAULT, etc. That's really all > their problem, because they're being clever. :) Yep. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On 04/01/2014 02:21 PM, Johannes Weiner wrote: > [ I tried to bring this up during LSFMM but it got drowned out. > Trying again :) ] > > On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote: >> Optimistic method: >> 1) Userland marks a large range of data as volatile >> 2) Userland continues to access the data as it needs. >> 3) If userland accesses a page that has been purged, the kernel will >> send a SIGBUS >> 4) Userspace can trap the SIGBUS, mark the affected pages as >> non-volatile, and refill the data as needed before continuing on > As far as I understand, if a pointer to volatile memory makes it into > a syscall and the fault is trapped in kernel space, there won't be a > SIGBUS, the syscall will just return -EFAULT. > > Handling this would mean annotating every syscall invocation to check > for -EFAULT, refill the data, and then restart the syscall. This is > complicated even before taking external libraries into account, which > may not propagate syscall returns properly or may not be reentrant at > the necessary granularity. > > Another option is to never pass volatile memory pointers into the > kernel, but that too means that knowledge of volatility has to travel > alongside the pointers, which will either result in more complexity > throughout the application or severely limited scope of volatile > memory usage. > > Either way, optimistic volatile pointers are nowhere near as > transparent to the application as the above description suggests, > which makes this usecase not very interesting, IMO. If we can support > it at little cost, why not, but I don't think we should complicate the > common usecases to support this one. So yea, thanks again for all the feedback at LSF-MM! I'm trying to get things integrated for a v13 here shortly (although with visitors in town this week it may not happen until next week). So, maybe its best to ignore the fact that folks want to do semi-crazy user-space faulting via SIGBUS. At least to start with. Lets look at the semantic for the "normal" mark volatile, never touch the pages until you mark non-volatile - basically where accessing volatile pages is similar to a use-after-free bug. So, for the most part, I'd say the proposed SIGBUS semantics don't complicate things for this basic use-case, at least when compared with things like zero-fill. If an applications accidentally accessed a purged volatile page, I think SIGBUS is the right thing to do. They most likely immediately crash, but its better then them moving along with silent corruption because they're mucking with zero-filled pages. So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If you have a third option you're thinking of, I'd of course be interested in hearing it. Now... once you've chosen SIGBUS semantics, there will be folks who will try to exploit the fact that we get SIGBUS on purged page access (at least on the user-space side) and will try to access pages that are volatile until they are purged and try to then handle the SIGBUS to fix things up. Those folks exploiting that will have to be particularly careful not to pass volatile data to the kernel, and if they do they'll have to be smart enough to handle the EFAULT, etc. That's really all their problem, because they're being clever. :) I've maybe made a mistake in talking at length about those use cases, because I wanted to make sure folks didn't have suggestions on how to better address those cases (so far I've not heard any), and it sort of helps wrap folks heads around at least some of the potential variations on the desired purging semantics (lru based cold page purging, or entire object based purging). Now, one other potential variant, which Keith brought up at LSF-MM, and others have mentioned before, is to have *any* volatile page access (purged or not) return a SIGBUS. This seems "safe" in that it protects developers from themselves, and makes application behavior more deterministic (rather then depending on memory pressure). However it also has the overhead of setting up the pte swp entries for each page in order to trip the SIGBUS. Since folks have explicitly asked for it, allowing non-purged volatile page access seems more flexible. And its cheaper. So that's what I've been leaning towards. thanks again! -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On 04/01/2014 02:35 PM, H. Peter Anvin wrote: > On 04/01/2014 02:21 PM, Johannes Weiner wrote: >> Either way, optimistic volatile pointers are nowhere near as >> transparent to the application as the above description suggests, >> which makes this usecase not very interesting, IMO. > > ... however, I think you're still derating the value way too much. The > case of user space doing elastic memory management is more and more > common, and for a lot of those applications it is perfectly reasonable > to either not do system calls or to have to devolatilize first. The SIGBUS is only in cases where the memory is set as volatile and _then_ accessed, right? John, this was something that the Mozilla guys asked for, right? Any idea why this isn't ever a problem for them? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On 04/01/2014 02:21 PM, Johannes Weiner wrote: > > Either way, optimistic volatile pointers are nowhere near as > transparent to the application as the above description suggests, > which makes this usecase not very interesting, IMO. > ... however, I think you're still derating the value way too much. The case of user space doing elastic memory management is more and more common, and for a lot of those applications it is perfectly reasonable to either not do system calls or to have to devolatilize first. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
On 04/01/2014 02:21 PM, Johannes Weiner wrote: > [ I tried to bring this up during LSFMM but it got drowned out. > Trying again :) ] > > On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote: >> Optimistic method: >> 1) Userland marks a large range of data as volatile >> 2) Userland continues to access the data as it needs. >> 3) If userland accesses a page that has been purged, the kernel will >> send a SIGBUS >> 4) Userspace can trap the SIGBUS, mark the affected pages as >> non-volatile, and refill the data as needed before continuing on > > As far as I understand, if a pointer to volatile memory makes it into > a syscall and the fault is trapped in kernel space, there won't be a > SIGBUS, the syscall will just return -EFAULT. > > Handling this would mean annotating every syscall invocation to check > for -EFAULT, refill the data, and then restart the syscall. This is > complicated even before taking external libraries into account, which > may not propagate syscall returns properly or may not be reentrant at > the necessary granularity. > > Another option is to never pass volatile memory pointers into the > kernel, but that too means that knowledge of volatility has to travel > alongside the pointers, which will either result in more complexity > throughout the application or severely limited scope of volatile > memory usage. > > Either way, optimistic volatile pointers are nowhere near as > transparent to the application as the above description suggests, > which makes this usecase not very interesting, IMO. If we can support > it at little cost, why not, but I don't think we should complicate the > common usecases to support this one. > The whole EFAULT thing is a fundamental problem with the kernel interface. This is not in any way the only place where this suffers. The fact that we cannot reliably get SIGSEGV or SIGBUS because something may have been passed as a system call is an enormous problem. The question is if it is in any way fixable. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
[ I tried to bring this up during LSFMM but it got drowned out. Trying again :) ] On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote: > Optimistic method: > 1) Userland marks a large range of data as volatile > 2) Userland continues to access the data as it needs. > 3) If userland accesses a page that has been purged, the kernel will > send a SIGBUS > 4) Userspace can trap the SIGBUS, mark the affected pages as > non-volatile, and refill the data as needed before continuing on As far as I understand, if a pointer to volatile memory makes it into a syscall and the fault is trapped in kernel space, there won't be a SIGBUS, the syscall will just return -EFAULT. Handling this would mean annotating every syscall invocation to check for -EFAULT, refill the data, and then restart the syscall. This is complicated even before taking external libraries into account, which may not propagate syscall returns properly or may not be reentrant at the necessary granularity. Another option is to never pass volatile memory pointers into the kernel, but that too means that knowledge of volatility has to travel alongside the pointers, which will either result in more complexity throughout the application or severely limited scope of volatile memory usage. Either way, optimistic volatile pointers are nowhere near as transparent to the application as the above description suggests, which makes this usecase not very interesting, IMO. If we can support it at little cost, why not, but I don't think we should complicate the common usecases to support this one. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
[ I tried to bring this up during LSFMM but it got drowned out. Trying again :) ] On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote: Optimistic method: 1) Userland marks a large range of data as volatile 2) Userland continues to access the data as it needs. 3) If userland accesses a page that has been purged, the kernel will send a SIGBUS 4) Userspace can trap the SIGBUS, mark the affected pages as non-volatile, and refill the data as needed before continuing on As far as I understand, if a pointer to volatile memory makes it into a syscall and the fault is trapped in kernel space, there won't be a SIGBUS, the syscall will just return -EFAULT. Handling this would mean annotating every syscall invocation to check for -EFAULT, refill the data, and then restart the syscall. This is complicated even before taking external libraries into account, which may not propagate syscall returns properly or may not be reentrant at the necessary granularity. Another option is to never pass volatile memory pointers into the kernel, but that too means that knowledge of volatility has to travel alongside the pointers, which will either result in more complexity throughout the application or severely limited scope of volatile memory usage. Either way, optimistic volatile pointers are nowhere near as transparent to the application as the above description suggests, which makes this usecase not very interesting, IMO. If we can support it at little cost, why not, but I don't think we should complicate the common usecases to support this one. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On 04/01/2014 02:21 PM, Johannes Weiner wrote: [ I tried to bring this up during LSFMM but it got drowned out. Trying again :) ] On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote: Optimistic method: 1) Userland marks a large range of data as volatile 2) Userland continues to access the data as it needs. 3) If userland accesses a page that has been purged, the kernel will send a SIGBUS 4) Userspace can trap the SIGBUS, mark the affected pages as non-volatile, and refill the data as needed before continuing on As far as I understand, if a pointer to volatile memory makes it into a syscall and the fault is trapped in kernel space, there won't be a SIGBUS, the syscall will just return -EFAULT. Handling this would mean annotating every syscall invocation to check for -EFAULT, refill the data, and then restart the syscall. This is complicated even before taking external libraries into account, which may not propagate syscall returns properly or may not be reentrant at the necessary granularity. Another option is to never pass volatile memory pointers into the kernel, but that too means that knowledge of volatility has to travel alongside the pointers, which will either result in more complexity throughout the application or severely limited scope of volatile memory usage. Either way, optimistic volatile pointers are nowhere near as transparent to the application as the above description suggests, which makes this usecase not very interesting, IMO. If we can support it at little cost, why not, but I don't think we should complicate the common usecases to support this one. The whole EFAULT thing is a fundamental problem with the kernel interface. This is not in any way the only place where this suffers. The fact that we cannot reliably get SIGSEGV or SIGBUS because something may have been passed as a system call is an enormous problem. The question is if it is in any way fixable. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On 04/01/2014 02:21 PM, Johannes Weiner wrote: Either way, optimistic volatile pointers are nowhere near as transparent to the application as the above description suggests, which makes this usecase not very interesting, IMO. ... however, I think you're still derating the value way too much. The case of user space doing elastic memory management is more and more common, and for a lot of those applications it is perfectly reasonable to either not do system calls or to have to devolatilize first. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On 04/01/2014 02:35 PM, H. Peter Anvin wrote: On 04/01/2014 02:21 PM, Johannes Weiner wrote: Either way, optimistic volatile pointers are nowhere near as transparent to the application as the above description suggests, which makes this usecase not very interesting, IMO. ... however, I think you're still derating the value way too much. The case of user space doing elastic memory management is more and more common, and for a lot of those applications it is perfectly reasonable to either not do system calls or to have to devolatilize first. The SIGBUS is only in cases where the memory is set as volatile and _then_ accessed, right? John, this was something that the Mozilla guys asked for, right? Any idea why this isn't ever a problem for them? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On 04/01/2014 02:21 PM, Johannes Weiner wrote: [ I tried to bring this up during LSFMM but it got drowned out. Trying again :) ] On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote: Optimistic method: 1) Userland marks a large range of data as volatile 2) Userland continues to access the data as it needs. 3) If userland accesses a page that has been purged, the kernel will send a SIGBUS 4) Userspace can trap the SIGBUS, mark the affected pages as non-volatile, and refill the data as needed before continuing on As far as I understand, if a pointer to volatile memory makes it into a syscall and the fault is trapped in kernel space, there won't be a SIGBUS, the syscall will just return -EFAULT. Handling this would mean annotating every syscall invocation to check for -EFAULT, refill the data, and then restart the syscall. This is complicated even before taking external libraries into account, which may not propagate syscall returns properly or may not be reentrant at the necessary granularity. Another option is to never pass volatile memory pointers into the kernel, but that too means that knowledge of volatility has to travel alongside the pointers, which will either result in more complexity throughout the application or severely limited scope of volatile memory usage. Either way, optimistic volatile pointers are nowhere near as transparent to the application as the above description suggests, which makes this usecase not very interesting, IMO. If we can support it at little cost, why not, but I don't think we should complicate the common usecases to support this one. So yea, thanks again for all the feedback at LSF-MM! I'm trying to get things integrated for a v13 here shortly (although with visitors in town this week it may not happen until next week). So, maybe its best to ignore the fact that folks want to do semi-crazy user-space faulting via SIGBUS. At least to start with. Lets look at the semantic for the normal mark volatile, never touch the pages until you mark non-volatile - basically where accessing volatile pages is similar to a use-after-free bug. So, for the most part, I'd say the proposed SIGBUS semantics don't complicate things for this basic use-case, at least when compared with things like zero-fill. If an applications accidentally accessed a purged volatile page, I think SIGBUS is the right thing to do. They most likely immediately crash, but its better then them moving along with silent corruption because they're mucking with zero-filled pages. So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If you have a third option you're thinking of, I'd of course be interested in hearing it. Now... once you've chosen SIGBUS semantics, there will be folks who will try to exploit the fact that we get SIGBUS on purged page access (at least on the user-space side) and will try to access pages that are volatile until they are purged and try to then handle the SIGBUS to fix things up. Those folks exploiting that will have to be particularly careful not to pass volatile data to the kernel, and if they do they'll have to be smart enough to handle the EFAULT, etc. That's really all their problem, because they're being clever. :) I've maybe made a mistake in talking at length about those use cases, because I wanted to make sure folks didn't have suggestions on how to better address those cases (so far I've not heard any), and it sort of helps wrap folks heads around at least some of the potential variations on the desired purging semantics (lru based cold page purging, or entire object based purging). Now, one other potential variant, which Keith brought up at LSF-MM, and others have mentioned before, is to have *any* volatile page access (purged or not) return a SIGBUS. This seems safe in that it protects developers from themselves, and makes application behavior more deterministic (rather then depending on memory pressure). However it also has the overhead of setting up the pte swp entries for each page in order to trip the SIGBUS. Since folks have explicitly asked for it, allowing non-purged volatile page access seems more flexible. And its cheaper. So that's what I've been leaning towards. thanks again! -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On 04/01/2014 09:03 PM, John Stultz wrote: So, maybe its best to ignore the fact that folks want to do semi-crazy user-space faulting via SIGBUS. At least to start with. Lets look at the semantic for the normal mark volatile, never touch the pages until you mark non-volatile - basically where accessing volatile pages is similar to a use-after-free bug. So, for the most part, I'd say the proposed SIGBUS semantics don't complicate things for this basic use-case, at least when compared with things like zero-fill. If an applications accidentally accessed a purged volatile page, I think SIGBUS is the right thing to do. They most likely immediately crash, but its better then them moving along with silent corruption because they're mucking with zero-filled pages. So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If you have a third option you're thinking of, I'd of course be interested in hearing it. People already do SIGBUS for mmap, so there is nothing new here. Now... once you've chosen SIGBUS semantics, there will be folks who will try to exploit the fact that we get SIGBUS on purged page access (at least on the user-space side) and will try to access pages that are volatile until they are purged and try to then handle the SIGBUS to fix things up. Those folks exploiting that will have to be particularly careful not to pass volatile data to the kernel, and if they do they'll have to be smart enough to handle the EFAULT, etc. That's really all their problem, because they're being clever. :) Yep. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
On 04/01/2014 04:01 PM, Dave Hansen wrote: On 04/01/2014 02:35 PM, H. Peter Anvin wrote: On 04/01/2014 02:21 PM, Johannes Weiner wrote: Either way, optimistic volatile pointers are nowhere near as transparent to the application as the above description suggests, which makes this usecase not very interesting, IMO. ... however, I think you're still derating the value way too much. The case of user space doing elastic memory management is more and more common, and for a lot of those applications it is perfectly reasonable to either not do system calls or to have to devolatilize first. The SIGBUS is only in cases where the memory is set as volatile and _then_ accessed, right? Not just set volatile and then accessed, but when a volatile page has been purged and then accessed without being made non-volatile. John, this was something that the Mozilla guys asked for, right? Any idea why this isn't ever a problem for them? So one of their use cases for it is for library text. Basically they want to decompress a compressed library file into memory. Then they plan to mark the uncompressed pages volatile, and then be able to call into it. Ideally for them, the kernel would only purge cold pages, leaving the hot pages in memory. When they traverse a purged page, they handle the SIGBUS and patch the page up. Now.. this is not what I'd consider a normal use case, but was hoping to illustrate some of the more interesting uses and demonstrate the interfaces flexibility. Also it provided a clear example of benefits to doing LRU based cold-page purging rather then full object purging. Though I think the same could be demonstrated in a simpler case of a large cache of objects that the applications wants to mark volatile in one pass, unmarking sub-objects as it needs. thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder
Just wanted to send out an updated patch set that includes changes from some of the reviews. Hopefully folks will have some time to look them over prior to the LSF-MM discussion on volatile ranges on Tuesday (see below for LSF-MM discussion points to think about). New changes are: o Added flags argument to the syscall, which is unused, but per https://lwn.net/Articles/585415/ seems like a good idea. o Minor vma traversing cleanups suggested by Jan o Return an error when trying to mark unmapped regions o First pass implementation of marking pages referenced when they are marked volatile, so the pages in a range are set to the same "age" and will be approximately purged together. This behavior is still open for discussion. o Very naive implementation of anonymous page aging on swapless systems. This has clear performance issues, as we burn time overly scanning anonymous pages, but provides something concrete upon which to discuss what the best way would be to solve this. o Other minor code cleanups The first three patches are still the core functionality, which I'd really like further review on. The last two patches in this series are more discussion starters, and are less serious. Potential discussion items for LSF-MM to think about: o How to increase reviewer interest? - Lots of interest from application world o Page aging semantics when marking volatile. - Should marking volatile be the same as accessing pages? - Should volatile ranges be put on end of inactive lru? - Should we just punt this and have applications combine madvise() use with vrange() to specify range age? o Volatile page & purged page accounting - Volatility is stored in per-process vma, not page - vmstats are page based, how do we deal w/ COWed pages? o Aging anonymous memory on swapless systems - Any thoughts on improving over naive method? - Better volatile page accounting might help? - Do we need a separate volatile LRU? o Shared volatility on tmpfs/shm/memfd (required for ashmem) - Johannes idea for clearing dirty bits? - vma-like structure on the address space? thanks -john Volatile ranges provides a method for userland to inform the kernel that a range of memory is safe to discard (ie: can be regenerated) but userspace may want to try access it in the future. It can be thought of as similar to MADV_DONTNEED, but that the actual freeing of the memory is delayed and only done under memory pressure, and the user can try to cancel the action and be able to quickly access any unpurged pages. The idea originated from Android's ashmem, but I've since learned that other OSes provide similar functionality. This functionality allows for a number of interesting uses. One such example is: Userland caches that have kernel triggered eviction under memory pressure. This allows for the kernel to "rightsize" userspace caches for current system-wide workload. Things like image bitmap caches, or rendered HTML in a hidden browser tab, where the data is not visible and can be regenerated if needed, are good examples. Both Chrome and Firefox already make use of volatile ranges via the ashmem interface: https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34 https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc There are two basic ways volatile ranges can be used: Explicit marking method: 1) Userland marks a range of memory that can be regenerated if necessary as volatile 2) Before accessing the memory again, userland marks the memory as nonvolatile, and the kernel will provide notification if any pages in the range has been purged. Optimistic method: 1) Userland marks a large range of data as volatile 2) Userland continues to access the data as it needs. 3) If userland accesses a page that has been purged, the kernel will send a SIGBUS 4) Userspace can trap the SIGBUS, mark the affected pages as non-volatile, and refill the data as needed before continuing on You can read more about the history of volatile ranges here (~reverse chronological order): https://lwn.net/Articles/590991/ http://permalink.gmane.org/gmane.linux.kernel.mm/98848 http://permalink.gmane.org/gmane.linux.kernel.mm/98676 https://lwn.net/Articles/522135/ https://lwn.net/Kernel/Index/#Volatile_ranges Continuing from the last release, this revision is reduced in scope when compared to earlier attempts. I've only focused on handled volatility on anonymous memory, and we're storing the volatility in the VMA. This may have performance implications compared with the earlier approach, but it does simplify the approach. I'm open to expanding functionality via flags arugments, but for now I'm wanting to keep focus on what the right default behavior should be and keep the use cases restricted to help get reviewer interest. Further, the page
[PATCH 0/5] Volatile Ranges (v12) LSF-MM discussion fodder
Just wanted to send out an updated patch set that includes changes from some of the reviews. Hopefully folks will have some time to look them over prior to the LSF-MM discussion on volatile ranges on Tuesday (see below for LSF-MM discussion points to think about). New changes are: o Added flags argument to the syscall, which is unused, but per https://lwn.net/Articles/585415/ seems like a good idea. o Minor vma traversing cleanups suggested by Jan o Return an error when trying to mark unmapped regions o First pass implementation of marking pages referenced when they are marked volatile, so the pages in a range are set to the same age and will be approximately purged together. This behavior is still open for discussion. o Very naive implementation of anonymous page aging on swapless systems. This has clear performance issues, as we burn time overly scanning anonymous pages, but provides something concrete upon which to discuss what the best way would be to solve this. o Other minor code cleanups The first three patches are still the core functionality, which I'd really like further review on. The last two patches in this series are more discussion starters, and are less serious. Potential discussion items for LSF-MM to think about: o How to increase reviewer interest? - Lots of interest from application world o Page aging semantics when marking volatile. - Should marking volatile be the same as accessing pages? - Should volatile ranges be put on end of inactive lru? - Should we just punt this and have applications combine madvise() use with vrange() to specify range age? o Volatile page purged page accounting - Volatility is stored in per-process vma, not page - vmstats are page based, how do we deal w/ COWed pages? o Aging anonymous memory on swapless systems - Any thoughts on improving over naive method? - Better volatile page accounting might help? - Do we need a separate volatile LRU? o Shared volatility on tmpfs/shm/memfd (required for ashmem) - Johannes idea for clearing dirty bits? - vma-like structure on the address space? thanks -john Volatile ranges provides a method for userland to inform the kernel that a range of memory is safe to discard (ie: can be regenerated) but userspace may want to try access it in the future. It can be thought of as similar to MADV_DONTNEED, but that the actual freeing of the memory is delayed and only done under memory pressure, and the user can try to cancel the action and be able to quickly access any unpurged pages. The idea originated from Android's ashmem, but I've since learned that other OSes provide similar functionality. This functionality allows for a number of interesting uses. One such example is: Userland caches that have kernel triggered eviction under memory pressure. This allows for the kernel to rightsize userspace caches for current system-wide workload. Things like image bitmap caches, or rendered HTML in a hidden browser tab, where the data is not visible and can be regenerated if needed, are good examples. Both Chrome and Firefox already make use of volatile ranges via the ashmem interface: https://hg.mozilla.org/releases/mozilla-b2g28_v1_3t/rev/a32c32b24a34 https://chromium.googlesource.com/chromium/src/base/+/47617a69b9a57796935e03d78931bd01b4806e70/memory/discardable_memory_allocator_android.cc There are two basic ways volatile ranges can be used: Explicit marking method: 1) Userland marks a range of memory that can be regenerated if necessary as volatile 2) Before accessing the memory again, userland marks the memory as nonvolatile, and the kernel will provide notification if any pages in the range has been purged. Optimistic method: 1) Userland marks a large range of data as volatile 2) Userland continues to access the data as it needs. 3) If userland accesses a page that has been purged, the kernel will send a SIGBUS 4) Userspace can trap the SIGBUS, mark the affected pages as non-volatile, and refill the data as needed before continuing on You can read more about the history of volatile ranges here (~reverse chronological order): https://lwn.net/Articles/590991/ http://permalink.gmane.org/gmane.linux.kernel.mm/98848 http://permalink.gmane.org/gmane.linux.kernel.mm/98676 https://lwn.net/Articles/522135/ https://lwn.net/Kernel/Index/#Volatile_ranges Continuing from the last release, this revision is reduced in scope when compared to earlier attempts. I've only focused on handled volatility on anonymous memory, and we're storing the volatility in the VMA. This may have performance implications compared with the earlier approach, but it does simplify the approach. I'm open to expanding functionality via flags arugments, but for now I'm wanting to keep focus on what the right default behavior should be and keep the use cases restricted to help get reviewer interest. Further, the page