Re: [PATCH v10 00/16] Volatile Ranges v10
On Fri, Jan 31, 2014 at 11:49:01AM -0500, Johannes Weiner wrote: > On Wed, Jan 29, 2014 at 02:11:02PM +0900, Minchan Kim wrote: > > It's interesting timing, I posted this patch Yew Year's Day > > and receives indepth design review Lunar New Year's Day. :) > > It's almost 0-day review. :) > > That's the only way I can do 0-day reviews ;) > > > On Tue, Jan 28, 2014 at 07:03:59PM -0500, Johannes Weiner wrote: > > > Hello Minchan, > > > > > > On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: > > > > Hey all, > > > > > > > > Happy New Year! > > > > > > > > I know it's bad timing to send this unfamiliar large patchset for > > > > review but hope there are some guys with freshed-brain in new year > > > > all over the world. :) > > > > And most important thing is that before I dive into lots of testing, > > > > I'd like to make an agreement on design issues and others > > > > > > > > o Syscall interface > > > > > > Why do we need another syscall for this? Can't we extend madvise to > > > > Yeb. I should have written the reason. Early versions in this patchset > > had used madvise with VMA handling but it was terrible performance for > > ebizzy workload by mmap_sem's downside lock due to merging/split VMA. > > Even it was worse than old so I gave up the VMA approach. > > > > You could see the difference. > > https://lkml.org/lkml/2013/10/8/63 > > So the compared kernels are 4 releases apart and the test happened > inside a VM. It's also not really apparent from that link what the > tested workload is doing. We first have to agree that it's doing > nothing that could be avoided. E.g. we wouldn't introduce an > optimized version of write() because an application that writes 4G at > one byte per call is having problems. About ebizzy workload, the process allocates several chunks then, threads start to alloc own chunk and *copy( the content from random chunk which was one of preallocated chunk to own chunk. It means lots of threads are page-faulting so mmap_sem write-side lock is really critical point for performance. (I don't know ebizzy is really good for real practice but at least, several papers and benchmark suites have used it so we couldn't ignore. And per-thread allocator are really popular these days) With VMA approach, we need mmap_sem write-side lock twice to mark/unmark VM_VOLATILE in vma->vm_flags so with my experiment, the performance was terrible as I said on link. I don't think the situation of current kernel would be better than old. And virtulization is really important technique thesedays so we couldn't ignore that although I tested it on VM for convenience. If you want, I surely can test it on bare box. > > The vroot lock has the same locking granularity as mmap_sem. Why is > mmap_sem more contended in this test? It seems above explanation is enough. > > > > take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something > > > in the range was purged? > > > > In that case, -ENOMEM would have duplicated meaning "Purged" and "Out > > of memory so failed in the middle of the system call processing" and > > later could be a problem so we need to return value to indicate > > how many bytes are succeeded so far so it means we need additional > > out parameter. But yes, we can solve it by modifying semantic and > > behavior (ex, as you said below, we could just unmark volatile > > successfully if user pass (offset, len) consistent with marked volatile > > ranges. (IOW, if we give up overlapping/subrange marking/unmakring > > usecase. I expect it makes code simple further). > > It's request from John so If he is okay, I'm no problem. > > Yes, I don't insist on using madvise. And it's too early to decide on > an interface before we haven't fully nailed the semantics and features. > > > > > o Not bind with vma split/merge logic to prevent mmap_sem cost and > > > > o Not bind with vma split/merge logic to avoid vm_area_struct memory > > > > footprint. > > > > > > VMAs are there to track attributes of memory ranges. Duplicating > > > large parts of their functionality and co-maintaining both structures > > > on create, destroy, split, and merge means duplicate code and complex > > > interactions. > > > > > > 1. You need to define semantics and coordinate what happens when the > > >vma underlying a volatile range changes. > > > > > >Either you have to strictly co-maintain both range objects, or you > > >have weird behavior like volatily outliving a vma and then applying > > >to a separate vma created in its place. > > > > > >Userspace won't get this right, and even in the kernel this is > > >error prone and adds a lot to the complexity of vma management. > > > > Current semantic is following as, > > Vma handling logic in mm doesn't need to know vrange handling because > > vrange's internal logic always checks validity of the vma but > > one thing to do in vma logic is only clearing old volatile ranges > > on creating new vma. > > (Look
Re: [PATCH v10 00/16] Volatile Ranges v10
On Mon, Feb 03, 2014 at 03:58:06PM +0100, Jan Kara wrote: > On Fri 31-01-14 11:49:01, Johannes Weiner wrote: > > On Wed, Jan 29, 2014 at 02:11:02PM +0900, Minchan Kim wrote: > > > >The only way to make these semantics clean is either > > > > > > > > a) have vrange() return a range ID so that only full ranges can > > > > later be marked non-volatile, or > > > > > > > > > > > b) remember individual page purges so that sub-range changes can > > > > properly report them > > > > > > > >I don't like a) much because it's somewhat arbitrarily more > > > >restrictive than madvise, mprotect, mmap/munmap etc. And for b), > > > >the straight-forward solution would be to put purge-cookies into > > > >the page tables to properly report purges in subrange changes, but > > > >that would be even more coordination between vmas, page tables, and > > > >the ad-hoc vranges. > > > > > > Agree but I don't want to put a accuracy of defalut vrange syscall. > > > Page table lookup needs mmap_sem and O(N) cost so I'm afraid it would > > > make userland folks hesitant using this system call. > > > > If userspace sees nothing but cost in this system call, nothing but a > > voluntary donation for the common good of the system, then it does not > > matter how cheap this is, nobody will use it. Why would they? Even > I think this is a flawed logic. If you take it to the extreme then why > each application doesn't allocate all the available memory and never free > it? Because users will kick such application in the ass as soon as they > have a viable alternative. So there is certainly a relatively strong > benefit in being a good citizen on the system. But it's a matter of a > tradeoff - if being a good citizen costs you too much (in the extreme if it > would make the application hardly usable because it is too slow), then you > just give up or hack it around in some other way... Oh, that is exactly what I was trying to point out. The argument was basically that it has to be as cheap and lightweight as humanly possible because applications participate voluntarily and they won't donate memory back if it comes at a cost. And as you said, this is flawed. There is an incentive to give back memory other than altruistic tendencies, namely the looming kick in the butt. So I very much agree that there is a trade-off to be had, but I think the cost of the proposed implementation is not justified. If we agree that simply not returning memory is unacceptable anyway, providing an interface that is drastically cheaper than the current means of returning memory is already an improvement. Even if it's still O(#pages). So I think the incentive to use it is there. We should design it to fit into the existing VM and then optimize it, rather than design for an (unnecessary) optimization. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/16] Volatile Ranges v10
On Mon, Feb 03, 2014 at 03:58:06PM +0100, Jan Kara wrote: On Fri 31-01-14 11:49:01, Johannes Weiner wrote: On Wed, Jan 29, 2014 at 02:11:02PM +0900, Minchan Kim wrote: The only way to make these semantics clean is either a) have vrange() return a range ID so that only full ranges can later be marked non-volatile, or b) remember individual page purges so that sub-range changes can properly report them I don't like a) much because it's somewhat arbitrarily more restrictive than madvise, mprotect, mmap/munmap etc. And for b), the straight-forward solution would be to put purge-cookies into the page tables to properly report purges in subrange changes, but that would be even more coordination between vmas, page tables, and the ad-hoc vranges. Agree but I don't want to put a accuracy of defalut vrange syscall. Page table lookup needs mmap_sem and O(N) cost so I'm afraid it would make userland folks hesitant using this system call. If userspace sees nothing but cost in this system call, nothing but a voluntary donation for the common good of the system, then it does not matter how cheap this is, nobody will use it. Why would they? Even I think this is a flawed logic. If you take it to the extreme then why each application doesn't allocate all the available memory and never free it? Because users will kick such application in the ass as soon as they have a viable alternative. So there is certainly a relatively strong benefit in being a good citizen on the system. But it's a matter of a tradeoff - if being a good citizen costs you too much (in the extreme if it would make the application hardly usable because it is too slow), then you just give up or hack it around in some other way... Oh, that is exactly what I was trying to point out. The argument was basically that it has to be as cheap and lightweight as humanly possible because applications participate voluntarily and they won't donate memory back if it comes at a cost. And as you said, this is flawed. There is an incentive to give back memory other than altruistic tendencies, namely the looming kick in the butt. So I very much agree that there is a trade-off to be had, but I think the cost of the proposed implementation is not justified. If we agree that simply not returning memory is unacceptable anyway, providing an interface that is drastically cheaper than the current means of returning memory is already an improvement. Even if it's still O(#pages). So I think the incentive to use it is there. We should design it to fit into the existing VM and then optimize it, rather than design for an (unnecessary) optimization. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/16] Volatile Ranges v10
On Fri, Jan 31, 2014 at 11:49:01AM -0500, Johannes Weiner wrote: On Wed, Jan 29, 2014 at 02:11:02PM +0900, Minchan Kim wrote: It's interesting timing, I posted this patch Yew Year's Day and receives indepth design review Lunar New Year's Day. :) It's almost 0-day review. :) That's the only way I can do 0-day reviews ;) On Tue, Jan 28, 2014 at 07:03:59PM -0500, Johannes Weiner wrote: Hello Minchan, On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: Hey all, Happy New Year! I know it's bad timing to send this unfamiliar large patchset for review but hope there are some guys with freshed-brain in new year all over the world. :) And most important thing is that before I dive into lots of testing, I'd like to make an agreement on design issues and others o Syscall interface Why do we need another syscall for this? Can't we extend madvise to Yeb. I should have written the reason. Early versions in this patchset had used madvise with VMA handling but it was terrible performance for ebizzy workload by mmap_sem's downside lock due to merging/split VMA. Even it was worse than old so I gave up the VMA approach. You could see the difference. https://lkml.org/lkml/2013/10/8/63 So the compared kernels are 4 releases apart and the test happened inside a VM. It's also not really apparent from that link what the tested workload is doing. We first have to agree that it's doing nothing that could be avoided. E.g. we wouldn't introduce an optimized version of write() because an application that writes 4G at one byte per call is having problems. About ebizzy workload, the process allocates several chunks then, threads start to alloc own chunk and *copy( the content from random chunk which was one of preallocated chunk to own chunk. It means lots of threads are page-faulting so mmap_sem write-side lock is really critical point for performance. (I don't know ebizzy is really good for real practice but at least, several papers and benchmark suites have used it so we couldn't ignore. And per-thread allocator are really popular these days) With VMA approach, we need mmap_sem write-side lock twice to mark/unmark VM_VOLATILE in vma-vm_flags so with my experiment, the performance was terrible as I said on link. I don't think the situation of current kernel would be better than old. And virtulization is really important technique thesedays so we couldn't ignore that although I tested it on VM for convenience. If you want, I surely can test it on bare box. The vroot lock has the same locking granularity as mmap_sem. Why is mmap_sem more contended in this test? It seems above explanation is enough. take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something in the range was purged? In that case, -ENOMEM would have duplicated meaning Purged and Out of memory so failed in the middle of the system call processing and later could be a problem so we need to return value to indicate how many bytes are succeeded so far so it means we need additional out parameter. But yes, we can solve it by modifying semantic and behavior (ex, as you said below, we could just unmark volatile successfully if user pass (offset, len) consistent with marked volatile ranges. (IOW, if we give up overlapping/subrange marking/unmakring usecase. I expect it makes code simple further). It's request from John so If he is okay, I'm no problem. Yes, I don't insist on using madvise. And it's too early to decide on an interface before we haven't fully nailed the semantics and features. o Not bind with vma split/merge logic to prevent mmap_sem cost and o Not bind with vma split/merge logic to avoid vm_area_struct memory footprint. VMAs are there to track attributes of memory ranges. Duplicating large parts of their functionality and co-maintaining both structures on create, destroy, split, and merge means duplicate code and complex interactions. 1. You need to define semantics and coordinate what happens when the vma underlying a volatile range changes. Either you have to strictly co-maintain both range objects, or you have weird behavior like volatily outliving a vma and then applying to a separate vma created in its place. Userspace won't get this right, and even in the kernel this is error prone and adds a lot to the complexity of vma management. Current semantic is following as, Vma handling logic in mm doesn't need to know vrange handling because vrange's internal logic always checks validity of the vma but one thing to do in vma logic is only clearing old volatile ranges on creating new vma. (Look at [PATCH v10 02/16] vrange: Clear volatility on new mmaps) Acutally I don't like the idea and suggested following as.
Re: [PATCH v10 00/16] Volatile Ranges v10
On Wed, Jan 29, 2014 at 02:11:02PM +0900, Minchan Kim wrote: > It's interesting timing, I posted this patch Yew Year's Day > and receives indepth design review Lunar New Year's Day. :) > It's almost 0-day review. :) That's the only way I can do 0-day reviews ;) > On Tue, Jan 28, 2014 at 07:03:59PM -0500, Johannes Weiner wrote: > > Hello Minchan, > > > > On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: > > > Hey all, > > > > > > Happy New Year! > > > > > > I know it's bad timing to send this unfamiliar large patchset for > > > review but hope there are some guys with freshed-brain in new year > > > all over the world. :) > > > And most important thing is that before I dive into lots of testing, > > > I'd like to make an agreement on design issues and others > > > > > > o Syscall interface > > > > Why do we need another syscall for this? Can't we extend madvise to > > Yeb. I should have written the reason. Early versions in this patchset > had used madvise with VMA handling but it was terrible performance for > ebizzy workload by mmap_sem's downside lock due to merging/split VMA. > Even it was worse than old so I gave up the VMA approach. > > You could see the difference. > https://lkml.org/lkml/2013/10/8/63 So the compared kernels are 4 releases apart and the test happened inside a VM. It's also not really apparent from that link what the tested workload is doing. We first have to agree that it's doing nothing that could be avoided. E.g. we wouldn't introduce an optimized version of write() because an application that writes 4G at one byte per call is having problems. The vroot lock has the same locking granularity as mmap_sem. Why is mmap_sem more contended in this test? > > take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something > > in the range was purged? > > In that case, -ENOMEM would have duplicated meaning "Purged" and "Out > of memory so failed in the middle of the system call processing" and > later could be a problem so we need to return value to indicate > how many bytes are succeeded so far so it means we need additional > out parameter. But yes, we can solve it by modifying semantic and > behavior (ex, as you said below, we could just unmark volatile > successfully if user pass (offset, len) consistent with marked volatile > ranges. (IOW, if we give up overlapping/subrange marking/unmakring > usecase. I expect it makes code simple further). > It's request from John so If he is okay, I'm no problem. Yes, I don't insist on using madvise. And it's too early to decide on an interface before we haven't fully nailed the semantics and features. > > > o Not bind with vma split/merge logic to prevent mmap_sem cost and > > > o Not bind with vma split/merge logic to avoid vm_area_struct memory > > > footprint. > > > > VMAs are there to track attributes of memory ranges. Duplicating > > large parts of their functionality and co-maintaining both structures > > on create, destroy, split, and merge means duplicate code and complex > > interactions. > > > > 1. You need to define semantics and coordinate what happens when the > >vma underlying a volatile range changes. > > > >Either you have to strictly co-maintain both range objects, or you > >have weird behavior like volatily outliving a vma and then applying > >to a separate vma created in its place. > > > >Userspace won't get this right, and even in the kernel this is > >error prone and adds a lot to the complexity of vma management. > > Current semantic is following as, > Vma handling logic in mm doesn't need to know vrange handling because > vrange's internal logic always checks validity of the vma but > one thing to do in vma logic is only clearing old volatile ranges > on creating new vma. > (Look at [PATCH v10 02/16] vrange: Clear volatility on new mmaps) > Acutally I don't like the idea and suggested following as. > https://git.kernel.org/cgit/linux/kernel/git/minchan/linux.git/commit/?h=vrange-working=821f58333b381fd88ee7f37fd9c472949756c74e > But John didn't like it. I guess if VMA size is really matter, > maybe we can embedded the flag into somewhere field of > vma(ex, vm_file LSB?) It's not entirely clear to me how the per-VMA variable can work like that when vmas can merge and split by other means (mprotect e.g.) > > 2. If page reclaim discards a page from the upper end of a a range, > >you mark the whole range as purged. If the user later marks the > >lower half of the range as non-volatile, the syscall will report > >purged=1 even though all requested pages are still there. > > True, The assumption is that basically, user should have a range > per object but we gives flexibility for user to handle subranges > of a volatile range so it might report false positive as you said. > In that case, please user can use mincore(2) for accuracy if he > want so he has flexiblity but lose performance a bit. > It's a tradeoff, IMO. Look, we can't present a
Re: [PATCH v10 00/16] Volatile Ranges v10
On Wed, Jan 29, 2014 at 02:11:02PM +0900, Minchan Kim wrote: It's interesting timing, I posted this patch Yew Year's Day and receives indepth design review Lunar New Year's Day. :) It's almost 0-day review. :) That's the only way I can do 0-day reviews ;) On Tue, Jan 28, 2014 at 07:03:59PM -0500, Johannes Weiner wrote: Hello Minchan, On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: Hey all, Happy New Year! I know it's bad timing to send this unfamiliar large patchset for review but hope there are some guys with freshed-brain in new year all over the world. :) And most important thing is that before I dive into lots of testing, I'd like to make an agreement on design issues and others o Syscall interface Why do we need another syscall for this? Can't we extend madvise to Yeb. I should have written the reason. Early versions in this patchset had used madvise with VMA handling but it was terrible performance for ebizzy workload by mmap_sem's downside lock due to merging/split VMA. Even it was worse than old so I gave up the VMA approach. You could see the difference. https://lkml.org/lkml/2013/10/8/63 So the compared kernels are 4 releases apart and the test happened inside a VM. It's also not really apparent from that link what the tested workload is doing. We first have to agree that it's doing nothing that could be avoided. E.g. we wouldn't introduce an optimized version of write() because an application that writes 4G at one byte per call is having problems. The vroot lock has the same locking granularity as mmap_sem. Why is mmap_sem more contended in this test? take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something in the range was purged? In that case, -ENOMEM would have duplicated meaning Purged and Out of memory so failed in the middle of the system call processing and later could be a problem so we need to return value to indicate how many bytes are succeeded so far so it means we need additional out parameter. But yes, we can solve it by modifying semantic and behavior (ex, as you said below, we could just unmark volatile successfully if user pass (offset, len) consistent with marked volatile ranges. (IOW, if we give up overlapping/subrange marking/unmakring usecase. I expect it makes code simple further). It's request from John so If he is okay, I'm no problem. Yes, I don't insist on using madvise. And it's too early to decide on an interface before we haven't fully nailed the semantics and features. o Not bind with vma split/merge logic to prevent mmap_sem cost and o Not bind with vma split/merge logic to avoid vm_area_struct memory footprint. VMAs are there to track attributes of memory ranges. Duplicating large parts of their functionality and co-maintaining both structures on create, destroy, split, and merge means duplicate code and complex interactions. 1. You need to define semantics and coordinate what happens when the vma underlying a volatile range changes. Either you have to strictly co-maintain both range objects, or you have weird behavior like volatily outliving a vma and then applying to a separate vma created in its place. Userspace won't get this right, and even in the kernel this is error prone and adds a lot to the complexity of vma management. Current semantic is following as, Vma handling logic in mm doesn't need to know vrange handling because vrange's internal logic always checks validity of the vma but one thing to do in vma logic is only clearing old volatile ranges on creating new vma. (Look at [PATCH v10 02/16] vrange: Clear volatility on new mmaps) Acutally I don't like the idea and suggested following as. https://git.kernel.org/cgit/linux/kernel/git/minchan/linux.git/commit/?h=vrange-workingid=821f58333b381fd88ee7f37fd9c472949756c74e But John didn't like it. I guess if VMA size is really matter, maybe we can embedded the flag into somewhere field of vma(ex, vm_file LSB?) It's not entirely clear to me how the per-VMA variable can work like that when vmas can merge and split by other means (mprotect e.g.) 2. If page reclaim discards a page from the upper end of a a range, you mark the whole range as purged. If the user later marks the lower half of the range as non-volatile, the syscall will report purged=1 even though all requested pages are still there. True, The assumption is that basically, user should have a range per object but we gives flexibility for user to handle subranges of a volatile range so it might report false positive as you said. In that case, please user can use mincore(2) for accuracy if he want so he has flexiblity but lose performance a bit. It's a tradeoff, IMO. Look, we can't present a syscall that takes an exact range of bytes and then return results that are not applicable to this range at all. We can not make performance
Re: [PATCH v10 00/16] Volatile Ranges v10
On Thu, Jan 30, 2014 at 05:27:18PM -0800, John Stultz wrote: > On 01/29/2014 10:30 AM, Johannes Weiner wrote: > > On Tue, Jan 28, 2014 at 05:43:54PM -0800, John Stultz wrote: > >> On 01/28/2014 04:03 PM, Johannes Weiner wrote: > >>> On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: > o Syscall interface > >>> Why do we need another syscall for this? Can't we extend madvise to > >>> take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something > >>> in the range was purged? > >> So the madvise interface is insufficient to provide the semantics > >> needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the > >> NONVOLATILE call, we have to atomically unmark the volatility status of > >> the byte range and provide the purge status, which informs the caller if > >> any of the data in the specified range was discarded (and thus needs to > >> be regenerated). > >> > >> The problem is that by clearing the range, we may need to allocate > >> memory (possibly by splitting in an existing range segment into two), > >> which possibly could fail. Unfortunately this could happen after we've > >> modified the volatile state of part of that range. At this point we > >> can't just fail, because we've modified state and we also need to return > >> the purge status of the modified state. > > munmap() can theoretically fail for the same reason (splitting has to > > allocate a new vma) but it's not even documented. The allocator does > > not fail allocations of that order. > > > > I'm not sure this is good enough, but to me it sounds a bit overkill > > to design a new system call around a non-existent problem. > > I still think its problematic design issue. With munmap, I think > re-calling on failure should be fine. But with _NONVOLATILE we could > possibly lose the purge status on a second call (for instance if only > the first page of memory was purged, but we errored out mid-call w/ > ENOMEM, on the second call it will seem like the range was successfully > set non-volatile with no memory purged). > > And even if the current allocator never ever fails, I worry at some > point in the future that rule might change and then we'd have a broken > interface. Fair enough, we don't have to paint ourselves into a corner. > >>> 2. If page reclaim discards a page from the upper end of a a range, > >>>you mark the whole range as purged. If the user later marks the > >>>lower half of the range as non-volatile, the syscall will report > >>>purged=1 even though all requested pages are still there. > >> To me this aspect is a non-ideal but acceptable result of the usage > >> pattern. > >> > >> Semantically, the hard rule would be we never report non-purged if pages > >> in a range were purged. Reporting purged when pages technically weren't > >> is not optimal but acceptable side effect of unmarking a sub-range. And > >> could be avoided by applications marking and unmarking objects > >> consistently. > >> > >> > >>>The only way to make these semantics clean is either > >>> > >>> a) have vrange() return a range ID so that only full ranges can > >>> later be marked non-volatile, or > >>> > >>> b) remember individual page purges so that sub-range changes can > >>> properly report them > >>> > >>>I don't like a) much because it's somewhat arbitrarily more > >>>restrictive than madvise, mprotect, mmap/munmap etc. > >> Agreed on A. > >> > >>> And for b), > >>>the straight-forward solution would be to put purge-cookies into > >>>the page tables to properly report purges in subrange changes, but > >>>that would be even more coordination between vmas, page tables, and > >>>the ad-hoc vranges. > >> And for B this would cause way too much overhead for the mark/unmark > >> operations, which have to be lightweight. > > Yes, and allocators/message passers truly don't need this because at > > the time they set a region to volatile the contents are invalidated > > and the non-volatile declaration doesn't give a hoot if content has > > been destroyed. > > > > But caches certainly would have to know if they should regenerate the > > contents. And bigger areas should be using huge pages, so we'd check > > in 2MB steps. Is this really more expensive than regenerating the > > contents on a false positive? > > So you make a good argument. I'd counter that the false-positives are > only caused when unmarking subranges of larger marked volatile range, > and for use cases that would care about regenerating the contents, > that's not a likely useage model (as they're probably going to be > marking objects in memory volatile/nonvolatile, not just arbitrary > ranges of pages). I can imagine that applications have continuous areas of same-sized objects and want to mark a whole range of them volatile in one go, then later come back for individual objects. Otherwise we'd require N adjacent objects to be marked individually through N syscalls to create N
Re: [PATCH v10 00/16] Volatile Ranges v10
On 01/29/2014 10:30 AM, Johannes Weiner wrote: > On Tue, Jan 28, 2014 at 05:43:54PM -0800, John Stultz wrote: >> On 01/28/2014 04:03 PM, Johannes Weiner wrote: >>> On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: o Syscall interface >>> Why do we need another syscall for this? Can't we extend madvise to >>> take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something >>> in the range was purged? >> So the madvise interface is insufficient to provide the semantics >> needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the >> NONVOLATILE call, we have to atomically unmark the volatility status of >> the byte range and provide the purge status, which informs the caller if >> any of the data in the specified range was discarded (and thus needs to >> be regenerated). >> >> The problem is that by clearing the range, we may need to allocate >> memory (possibly by splitting in an existing range segment into two), >> which possibly could fail. Unfortunately this could happen after we've >> modified the volatile state of part of that range. At this point we >> can't just fail, because we've modified state and we also need to return >> the purge status of the modified state. > munmap() can theoretically fail for the same reason (splitting has to > allocate a new vma) but it's not even documented. The allocator does > not fail allocations of that order. > > I'm not sure this is good enough, but to me it sounds a bit overkill > to design a new system call around a non-existent problem. I still think its problematic design issue. With munmap, I think re-calling on failure should be fine. But with _NONVOLATILE we could possibly lose the purge status on a second call (for instance if only the first page of memory was purged, but we errored out mid-call w/ ENOMEM, on the second call it will seem like the range was successfully set non-volatile with no memory purged). And even if the current allocator never ever fails, I worry at some point in the future that rule might change and then we'd have a broken interface. o Not bind with vma split/merge logic to prevent mmap_sem cost and o Not bind with vma split/merge logic to avoid vm_area_struct memory footprint. >>> VMAs are there to track attributes of memory ranges. Duplicating >>> large parts of their functionality and co-maintaining both structures >>> on create, destroy, split, and merge means duplicate code and complex >>> interactions. >>> >>> 1. You need to define semantics and coordinate what happens when the >>>vma underlying a volatile range changes. >>> >>>Either you have to strictly co-maintain both range objects, or you >>>have weird behavior like volatily outliving a vma and then applying >>>to a separate vma created in its place. >> So indeed this is a difficult problem! My initial approach is simply >> when any new mapping is made, we clear the volatility of the affected >> process memory. Admittedly this has extra overhead and Minchan has an >> alternative here (which I'm not totally sold on yet, but may be ok). >> I'm almost convinced that for anonymous volatility, storing the >> volatility in the vma would be ok, but Minchan is worried about the >> performance overhead of the required locking for manipulating the vmas. >> >> For file volatility, this is more complicated, because since the >> volatility is shared, the ranges have to be tracked against the >> address_space structure, and can't be stored in per-process vmas. So >> this is partially why we've kept range trees hanging off of the mm and >> address_spaces structures, since it allows the range manipulation logic >> to be shared in both cases. > The fs people probably have not noticed yet what you've done to struct > address_space / struct inode ;-) I doubt that this is mergeable in its > current form, so we have to think about a separate mechanism for shmem > page ranges either way. Yea. But given the semantics will likely be *very* similar, it seems strange to try to force separate mechanisms. That said, in an earlier implementation I stored the range tree in a hash so we wouldn't have to add anything to the address_space structure. But for now I want to make it clear that the ranges are tied to the address space (and it gives the fs folks something to notice ;). >>>Userspace won't get this right, and even in the kernel this is >>>error prone and adds a lot to the complexity of vma management. >> Not sure exactly I understand what you mean by "userspace won't get this >> right" ? > I meant, userspace being responsible for keeping vranges coherent with > its mmap and munmap operations, instead of the kernel doing it. > >>> 2. If page reclaim discards a page from the upper end of a a range, >>>you mark the whole range as purged. If the user later marks the >>>lower half of the range as non-volatile, the syscall will report >>>purged=1 even though all requested pages are still there. >> To me
Re: [PATCH v10 00/16] Volatile Ranges v10
On 01/29/2014 10:30 AM, Johannes Weiner wrote: On Tue, Jan 28, 2014 at 05:43:54PM -0800, John Stultz wrote: On 01/28/2014 04:03 PM, Johannes Weiner wrote: On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: o Syscall interface Why do we need another syscall for this? Can't we extend madvise to take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something in the range was purged? So the madvise interface is insufficient to provide the semantics needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the NONVOLATILE call, we have to atomically unmark the volatility status of the byte range and provide the purge status, which informs the caller if any of the data in the specified range was discarded (and thus needs to be regenerated). The problem is that by clearing the range, we may need to allocate memory (possibly by splitting in an existing range segment into two), which possibly could fail. Unfortunately this could happen after we've modified the volatile state of part of that range. At this point we can't just fail, because we've modified state and we also need to return the purge status of the modified state. munmap() can theoretically fail for the same reason (splitting has to allocate a new vma) but it's not even documented. The allocator does not fail allocations of that order. I'm not sure this is good enough, but to me it sounds a bit overkill to design a new system call around a non-existent problem. I still think its problematic design issue. With munmap, I think re-calling on failure should be fine. But with _NONVOLATILE we could possibly lose the purge status on a second call (for instance if only the first page of memory was purged, but we errored out mid-call w/ ENOMEM, on the second call it will seem like the range was successfully set non-volatile with no memory purged). And even if the current allocator never ever fails, I worry at some point in the future that rule might change and then we'd have a broken interface. o Not bind with vma split/merge logic to prevent mmap_sem cost and o Not bind with vma split/merge logic to avoid vm_area_struct memory footprint. VMAs are there to track attributes of memory ranges. Duplicating large parts of their functionality and co-maintaining both structures on create, destroy, split, and merge means duplicate code and complex interactions. 1. You need to define semantics and coordinate what happens when the vma underlying a volatile range changes. Either you have to strictly co-maintain both range objects, or you have weird behavior like volatily outliving a vma and then applying to a separate vma created in its place. So indeed this is a difficult problem! My initial approach is simply when any new mapping is made, we clear the volatility of the affected process memory. Admittedly this has extra overhead and Minchan has an alternative here (which I'm not totally sold on yet, but may be ok). I'm almost convinced that for anonymous volatility, storing the volatility in the vma would be ok, but Minchan is worried about the performance overhead of the required locking for manipulating the vmas. For file volatility, this is more complicated, because since the volatility is shared, the ranges have to be tracked against the address_space structure, and can't be stored in per-process vmas. So this is partially why we've kept range trees hanging off of the mm and address_spaces structures, since it allows the range manipulation logic to be shared in both cases. The fs people probably have not noticed yet what you've done to struct address_space / struct inode ;-) I doubt that this is mergeable in its current form, so we have to think about a separate mechanism for shmem page ranges either way. Yea. But given the semantics will likely be *very* similar, it seems strange to try to force separate mechanisms. That said, in an earlier implementation I stored the range tree in a hash so we wouldn't have to add anything to the address_space structure. But for now I want to make it clear that the ranges are tied to the address space (and it gives the fs folks something to notice ;). Userspace won't get this right, and even in the kernel this is error prone and adds a lot to the complexity of vma management. Not sure exactly I understand what you mean by userspace won't get this right ? I meant, userspace being responsible for keeping vranges coherent with its mmap and munmap operations, instead of the kernel doing it. 2. If page reclaim discards a page from the upper end of a a range, you mark the whole range as purged. If the user later marks the lower half of the range as non-volatile, the syscall will report purged=1 even though all requested pages are still there. To me this aspect is a non-ideal but acceptable result of the usage pattern. Semantically, the hard rule would be we never report non-purged if pages in a range
Re: [PATCH v10 00/16] Volatile Ranges v10
On Thu, Jan 30, 2014 at 05:27:18PM -0800, John Stultz wrote: On 01/29/2014 10:30 AM, Johannes Weiner wrote: On Tue, Jan 28, 2014 at 05:43:54PM -0800, John Stultz wrote: On 01/28/2014 04:03 PM, Johannes Weiner wrote: On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: o Syscall interface Why do we need another syscall for this? Can't we extend madvise to take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something in the range was purged? So the madvise interface is insufficient to provide the semantics needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the NONVOLATILE call, we have to atomically unmark the volatility status of the byte range and provide the purge status, which informs the caller if any of the data in the specified range was discarded (and thus needs to be regenerated). The problem is that by clearing the range, we may need to allocate memory (possibly by splitting in an existing range segment into two), which possibly could fail. Unfortunately this could happen after we've modified the volatile state of part of that range. At this point we can't just fail, because we've modified state and we also need to return the purge status of the modified state. munmap() can theoretically fail for the same reason (splitting has to allocate a new vma) but it's not even documented. The allocator does not fail allocations of that order. I'm not sure this is good enough, but to me it sounds a bit overkill to design a new system call around a non-existent problem. I still think its problematic design issue. With munmap, I think re-calling on failure should be fine. But with _NONVOLATILE we could possibly lose the purge status on a second call (for instance if only the first page of memory was purged, but we errored out mid-call w/ ENOMEM, on the second call it will seem like the range was successfully set non-volatile with no memory purged). And even if the current allocator never ever fails, I worry at some point in the future that rule might change and then we'd have a broken interface. Fair enough, we don't have to paint ourselves into a corner. 2. If page reclaim discards a page from the upper end of a a range, you mark the whole range as purged. If the user later marks the lower half of the range as non-volatile, the syscall will report purged=1 even though all requested pages are still there. To me this aspect is a non-ideal but acceptable result of the usage pattern. Semantically, the hard rule would be we never report non-purged if pages in a range were purged. Reporting purged when pages technically weren't is not optimal but acceptable side effect of unmarking a sub-range. And could be avoided by applications marking and unmarking objects consistently. The only way to make these semantics clean is either a) have vrange() return a range ID so that only full ranges can later be marked non-volatile, or b) remember individual page purges so that sub-range changes can properly report them I don't like a) much because it's somewhat arbitrarily more restrictive than madvise, mprotect, mmap/munmap etc. Agreed on A. And for b), the straight-forward solution would be to put purge-cookies into the page tables to properly report purges in subrange changes, but that would be even more coordination between vmas, page tables, and the ad-hoc vranges. And for B this would cause way too much overhead for the mark/unmark operations, which have to be lightweight. Yes, and allocators/message passers truly don't need this because at the time they set a region to volatile the contents are invalidated and the non-volatile declaration doesn't give a hoot if content has been destroyed. But caches certainly would have to know if they should regenerate the contents. And bigger areas should be using huge pages, so we'd check in 2MB steps. Is this really more expensive than regenerating the contents on a false positive? So you make a good argument. I'd counter that the false-positives are only caused when unmarking subranges of larger marked volatile range, and for use cases that would care about regenerating the contents, that's not a likely useage model (as they're probably going to be marking objects in memory volatile/nonvolatile, not just arbitrary ranges of pages). I can imagine that applications have continuous areas of same-sized objects and want to mark a whole range of them volatile in one go, then later come back for individual objects. Otherwise we'd require N adjacent objects to be marked individually through N syscalls to create N separate internal ranges, or they'd get strange and unexpected results. I'm agreeing with you about what's the most likely and common usecase, but it shouldn't get too weird around the edges. MADV_NONVOLATILE and
Re: [PATCH v10 00/16] Volatile Ranges v10
On Tue, Jan 28, 2014 at 05:43:54PM -0800, John Stultz wrote: > On 01/28/2014 04:03 PM, Johannes Weiner wrote: > > On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: > >> o Syscall interface > > Why do we need another syscall for this? Can't we extend madvise to > > take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something > > in the range was purged? > > So the madvise interface is insufficient to provide the semantics > needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the > NONVOLATILE call, we have to atomically unmark the volatility status of > the byte range and provide the purge status, which informs the caller if > any of the data in the specified range was discarded (and thus needs to > be regenerated). > > The problem is that by clearing the range, we may need to allocate > memory (possibly by splitting in an existing range segment into two), > which possibly could fail. Unfortunately this could happen after we've > modified the volatile state of part of that range. At this point we > can't just fail, because we've modified state and we also need to return > the purge status of the modified state. munmap() can theoretically fail for the same reason (splitting has to allocate a new vma) but it's not even documented. The allocator does not fail allocations of that order. I'm not sure this is good enough, but to me it sounds a bit overkill to design a new system call around a non-existent problem. > >> o Not bind with vma split/merge logic to prevent mmap_sem cost and > >> o Not bind with vma split/merge logic to avoid vm_area_struct memory > >> footprint. > > VMAs are there to track attributes of memory ranges. Duplicating > > large parts of their functionality and co-maintaining both structures > > on create, destroy, split, and merge means duplicate code and complex > > interactions. > > > > 1. You need to define semantics and coordinate what happens when the > >vma underlying a volatile range changes. > > > >Either you have to strictly co-maintain both range objects, or you > >have weird behavior like volatily outliving a vma and then applying > >to a separate vma created in its place. > > So indeed this is a difficult problem! My initial approach is simply > when any new mapping is made, we clear the volatility of the affected > process memory. Admittedly this has extra overhead and Minchan has an > alternative here (which I'm not totally sold on yet, but may be ok). > I'm almost convinced that for anonymous volatility, storing the > volatility in the vma would be ok, but Minchan is worried about the > performance overhead of the required locking for manipulating the vmas. > > For file volatility, this is more complicated, because since the > volatility is shared, the ranges have to be tracked against the > address_space structure, and can't be stored in per-process vmas. So > this is partially why we've kept range trees hanging off of the mm and > address_spaces structures, since it allows the range manipulation logic > to be shared in both cases. The fs people probably have not noticed yet what you've done to struct address_space / struct inode ;-) I doubt that this is mergeable in its current form, so we have to think about a separate mechanism for shmem page ranges either way. > >Userspace won't get this right, and even in the kernel this is > >error prone and adds a lot to the complexity of vma management. > Not sure exactly I understand what you mean by "userspace won't get this > right" ? I meant, userspace being responsible for keeping vranges coherent with its mmap and munmap operations, instead of the kernel doing it. > > 2. If page reclaim discards a page from the upper end of a a range, > >you mark the whole range as purged. If the user later marks the > >lower half of the range as non-volatile, the syscall will report > >purged=1 even though all requested pages are still there. > > To me this aspect is a non-ideal but acceptable result of the usage pattern. > > Semantically, the hard rule would be we never report non-purged if pages > in a range were purged. Reporting purged when pages technically weren't > is not optimal but acceptable side effect of unmarking a sub-range. And > could be avoided by applications marking and unmarking objects consistently. > > > >The only way to make these semantics clean is either > > > > a) have vrange() return a range ID so that only full ranges can > > later be marked non-volatile, or > > > > b) remember individual page purges so that sub-range changes can > > properly report them > > > >I don't like a) much because it's somewhat arbitrarily more > >restrictive than madvise, mprotect, mmap/munmap etc. > Agreed on A. > > > And for b), > >the straight-forward solution would be to put purge-cookies into > >the page tables to properly report purges in subrange changes, but > >that would be even more
Re: [PATCH v10 00/16] Volatile Ranges v10
On Tue, Jan 28, 2014 at 05:43:54PM -0800, John Stultz wrote: On 01/28/2014 04:03 PM, Johannes Weiner wrote: On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: o Syscall interface Why do we need another syscall for this? Can't we extend madvise to take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something in the range was purged? So the madvise interface is insufficient to provide the semantics needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the NONVOLATILE call, we have to atomically unmark the volatility status of the byte range and provide the purge status, which informs the caller if any of the data in the specified range was discarded (and thus needs to be regenerated). The problem is that by clearing the range, we may need to allocate memory (possibly by splitting in an existing range segment into two), which possibly could fail. Unfortunately this could happen after we've modified the volatile state of part of that range. At this point we can't just fail, because we've modified state and we also need to return the purge status of the modified state. munmap() can theoretically fail for the same reason (splitting has to allocate a new vma) but it's not even documented. The allocator does not fail allocations of that order. I'm not sure this is good enough, but to me it sounds a bit overkill to design a new system call around a non-existent problem. o Not bind with vma split/merge logic to prevent mmap_sem cost and o Not bind with vma split/merge logic to avoid vm_area_struct memory footprint. VMAs are there to track attributes of memory ranges. Duplicating large parts of their functionality and co-maintaining both structures on create, destroy, split, and merge means duplicate code and complex interactions. 1. You need to define semantics and coordinate what happens when the vma underlying a volatile range changes. Either you have to strictly co-maintain both range objects, or you have weird behavior like volatily outliving a vma and then applying to a separate vma created in its place. So indeed this is a difficult problem! My initial approach is simply when any new mapping is made, we clear the volatility of the affected process memory. Admittedly this has extra overhead and Minchan has an alternative here (which I'm not totally sold on yet, but may be ok). I'm almost convinced that for anonymous volatility, storing the volatility in the vma would be ok, but Minchan is worried about the performance overhead of the required locking for manipulating the vmas. For file volatility, this is more complicated, because since the volatility is shared, the ranges have to be tracked against the address_space structure, and can't be stored in per-process vmas. So this is partially why we've kept range trees hanging off of the mm and address_spaces structures, since it allows the range manipulation logic to be shared in both cases. The fs people probably have not noticed yet what you've done to struct address_space / struct inode ;-) I doubt that this is mergeable in its current form, so we have to think about a separate mechanism for shmem page ranges either way. Userspace won't get this right, and even in the kernel this is error prone and adds a lot to the complexity of vma management. Not sure exactly I understand what you mean by userspace won't get this right ? I meant, userspace being responsible for keeping vranges coherent with its mmap and munmap operations, instead of the kernel doing it. 2. If page reclaim discards a page from the upper end of a a range, you mark the whole range as purged. If the user later marks the lower half of the range as non-volatile, the syscall will report purged=1 even though all requested pages are still there. To me this aspect is a non-ideal but acceptable result of the usage pattern. Semantically, the hard rule would be we never report non-purged if pages in a range were purged. Reporting purged when pages technically weren't is not optimal but acceptable side effect of unmarking a sub-range. And could be avoided by applications marking and unmarking objects consistently. The only way to make these semantics clean is either a) have vrange() return a range ID so that only full ranges can later be marked non-volatile, or b) remember individual page purges so that sub-range changes can properly report them I don't like a) much because it's somewhat arbitrarily more restrictive than madvise, mprotect, mmap/munmap etc. Agreed on A. And for b), the straight-forward solution would be to put purge-cookies into the page tables to properly report purges in subrange changes, but that would be even more coordination between vmas, page tables, and the ad-hoc vranges. And for B this would cause way too much overhead for the
Re: [PATCH v10 00/16] Volatile Ranges v10
Hi Hannes, It's interesting timing, I posted this patch Yew Year's Day and receives indepth design review Lunar New Year's Day. :) It's almost 0-day review. :) On Tue, Jan 28, 2014 at 07:03:59PM -0500, Johannes Weiner wrote: > Hello Minchan, > > On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: > > Hey all, > > > > Happy New Year! > > > > I know it's bad timing to send this unfamiliar large patchset for > > review but hope there are some guys with freshed-brain in new year > > all over the world. :) > > And most important thing is that before I dive into lots of testing, > > I'd like to make an agreement on design issues and others > > > > o Syscall interface > > Why do we need another syscall for this? Can't we extend madvise to Yeb. I should have written the reason. Early versions in this patchset had used madvise with VMA handling but it was terrible performance for ebizzy workload by mmap_sem's downside lock due to merging/split VMA. Even it was worse than old so I gave up the VMA approach. You could see the difference. https://lkml.org/lkml/2013/10/8/63 It might be not a good decision and someone might say that if mmap_sem is really headache, please fix it at first but as you know well, it's never simple problem. I hope if better idea or final decision comes in(ex, let's hold until someone fix mmap_sem scalability), I could follow that. > take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something > in the range was purged? In that case, -ENOMEM would have duplicated meaning "Purged" and "Out of memory so failed in the middle of the system call processing" and later could be a problem so we need to return value to indicate how many bytes are succeeded so far so it means we need additional out parameter. But yes, we can solve it by modifying semantic and behavior (ex, as you said below, we could just unmark volatile successfully if user pass (offset, len) consistent with marked volatile ranges. (IOW, if we give up overlapping/subrange marking/unmakring usecase. I expect it makes code simple further). It's request from John so If he is okay, I'm no problem. And there was another reason to make hard reusing madvise. full puring VS partial purging If someone should regenerate full object without considering fault/alloc/zeroig(and many of people wanted it rather than partial) he would want to VOLATILE_RANGE_FULL purging while others(acutlly, only I) likes VOLATILE_RANGE_PARTIAL for allocators(ie, vrange-anon) because it's fair for every processes in allocator POV(fault + alloc + zeroing overhead). It's not implemented yet in this patchset but I thought it is worth discuss and if we want it, current madvise isn't enough. > > > o Not bind with vma split/merge logic to prevent mmap_sem cost and > > o Not bind with vma split/merge logic to avoid vm_area_struct memory > > footprint. > > VMAs are there to track attributes of memory ranges. Duplicating > large parts of their functionality and co-maintaining both structures > on create, destroy, split, and merge means duplicate code and complex > interactions. > > 1. You need to define semantics and coordinate what happens when the >vma underlying a volatile range changes. > >Either you have to strictly co-maintain both range objects, or you >have weird behavior like volatily outliving a vma and then applying >to a separate vma created in its place. > >Userspace won't get this right, and even in the kernel this is >error prone and adds a lot to the complexity of vma management. Current semantic is following as, Vma handling logic in mm doesn't need to know vrange handling because vrange's internal logic always checks validity of the vma but one thing to do in vma logic is only clearing old volatile ranges on creating new vma. (Look at [PATCH v10 02/16] vrange: Clear volatility on new mmaps) Acutally I don't like the idea and suggested following as. https://git.kernel.org/cgit/linux/kernel/git/minchan/linux.git/commit/?h=vrange-working=821f58333b381fd88ee7f37fd9c472949756c74e But John didn't like it. I guess if VMA size is really matter, maybe we can embedded the flag into somewhere field of vma(ex, vm_file LSB?) Anyway, what I want to say is that vma/vrange co-maintaining seem to be not bad. > > 2. If page reclaim discards a page from the upper end of a a range, >you mark the whole range as purged. If the user later marks the >lower half of the range as non-volatile, the syscall will report >purged=1 even though all requested pages are still there. True, The assumption is that basically, user should have a range per object but we gives flexibility for user to handle subranges of a volatile range so it might report false positive as you said. In that case, please user can use mincore(2) for accuracy if he want so he has flexiblity but lose performance a bit. It's a tradeoff, IMO. > >The only way to make these semantics clean is either > > a) have vrange() return a
Re: [PATCH v10 00/16] Volatile Ranges v10
On 01/28/2014 04:03 PM, Johannes Weiner wrote: > Hello Minchan, > > On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: >> Hey all, >> >> Happy New Year! >> >> I know it's bad timing to send this unfamiliar large patchset for >> review but hope there are some guys with freshed-brain in new year >> all over the world. :) >> And most important thing is that before I dive into lots of testing, >> I'd like to make an agreement on design issues and others >> >> o Syscall interface > Why do we need another syscall for this? Can't we extend madvise to > take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something > in the range was purged? So the madvise interface is insufficient to provide the semantics needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the NONVOLATILE call, we have to atomically unmark the volatility status of the byte range and provide the purge status, which informs the caller if any of the data in the specified range was discarded (and thus needs to be regenerated). The problem is that by clearing the range, we may need to allocate memory (possibly by splitting in an existing range segment into two), which possibly could fail. Unfortunately this could happen after we've modified the volatile state of part of that range. At this point we can't just fail, because we've modified state and we also need to return the purge status of the modified state. Thus we seem to need a write-like interface, which returns the number of bytes successfully manipulated. But we also have to return the purge state, which we currently do via a argument pointer. hpa suggested to create something like an madvise2 interface which would provide the needed interface change, but would be a shared interface for the new flags as well as the old (possibly allowing various flags to be combined). I'm fine changing it (the interface has changed a number of times already), but we really haven't seen much in the way of a deeper review, so the current vrange syscall is mostly a placeholder to demonstrate the functionality and hopefully spur discussion on the deeper semantics of how volatile ranges should work. >> o Not bind with vma split/merge logic to prevent mmap_sem cost and >> o Not bind with vma split/merge logic to avoid vm_area_struct memory >> footprint. > VMAs are there to track attributes of memory ranges. Duplicating > large parts of their functionality and co-maintaining both structures > on create, destroy, split, and merge means duplicate code and complex > interactions. > > 1. You need to define semantics and coordinate what happens when the >vma underlying a volatile range changes. > >Either you have to strictly co-maintain both range objects, or you >have weird behavior like volatily outliving a vma and then applying >to a separate vma created in its place. So indeed this is a difficult problem! My initial approach is simply when any new mapping is made, we clear the volatility of the affected process memory. Admittedly this has extra overhead and Minchan has an alternative here (which I'm not totally sold on yet, but may be ok). I'm almost convinced that for anonymous volatility, storing the volatility in the vma would be ok, but Minchan is worried about the performance overhead of the required locking for manipulating the vmas. For file volatility, this is more complicated, because since the volatility is shared, the ranges have to be tracked against the address_space structure, and can't be stored in per-process vmas. So this is partially why we've kept range trees hanging off of the mm and address_spaces structures, since it allows the range manipulation logic to be shared in both cases. >Userspace won't get this right, and even in the kernel this is >error prone and adds a lot to the complexity of vma management. Not sure exactly I understand what you mean by "userspace won't get this right" ? > > 2. If page reclaim discards a page from the upper end of a a range, >you mark the whole range as purged. If the user later marks the >lower half of the range as non-volatile, the syscall will report >purged=1 even though all requested pages are still there. To me this aspect is a non-ideal but acceptable result of the usage pattern. Semantically, the hard rule would be we never report non-purged if pages in a range were purged. Reporting purged when pages technically weren't is not optimal but acceptable side effect of unmarking a sub-range. And could be avoided by applications marking and unmarking objects consistently. >The only way to make these semantics clean is either > > a) have vrange() return a range ID so that only full ranges can > later be marked non-volatile, or > > b) remember individual page purges so that sub-range changes can > properly report them > >I don't like a) much because it's somewhat arbitrarily more >restrictive than madvise, mprotect, mmap/munmap etc. Agreed
Re: [PATCH v10 00/16] Volatile Ranges v10
Hello Minchan, On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: > Hey all, > > Happy New Year! > > I know it's bad timing to send this unfamiliar large patchset for > review but hope there are some guys with freshed-brain in new year > all over the world. :) > And most important thing is that before I dive into lots of testing, > I'd like to make an agreement on design issues and others > > o Syscall interface Why do we need another syscall for this? Can't we extend madvise to take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something in the range was purged? > o Not bind with vma split/merge logic to prevent mmap_sem cost and > o Not bind with vma split/merge logic to avoid vm_area_struct memory > footprint. VMAs are there to track attributes of memory ranges. Duplicating large parts of their functionality and co-maintaining both structures on create, destroy, split, and merge means duplicate code and complex interactions. 1. You need to define semantics and coordinate what happens when the vma underlying a volatile range changes. Either you have to strictly co-maintain both range objects, or you have weird behavior like volatily outliving a vma and then applying to a separate vma created in its place. Userspace won't get this right, and even in the kernel this is error prone and adds a lot to the complexity of vma management. 2. If page reclaim discards a page from the upper end of a a range, you mark the whole range as purged. If the user later marks the lower half of the range as non-volatile, the syscall will report purged=1 even though all requested pages are still there. The only way to make these semantics clean is either a) have vrange() return a range ID so that only full ranges can later be marked non-volatile, or b) remember individual page purges so that sub-range changes can properly report them I don't like a) much because it's somewhat arbitrarily more restrictive than madvise, mprotect, mmap/munmap etc. And for b), the straight-forward solution would be to put purge-cookies into the page tables to properly report purges in subrange changes, but that would be even more coordination between vmas, page tables, and the ad-hoc vranges. 3. Page reclaim usually happens on individual pages until an allocation can be satisfied, but the shrinker purges entire ranges. Should it really take out an entire 1G volatile range even though 4 pages would have been enough to satisfy an allocation? Sure, we assume a range represents an single "object" and userspace would have to regenerate the whole thing with only one page missing, but there is still a massive difference in page frees, faults, and allocations. There needs to be a *really* good argument why VMAs are not enough for this purpose. I would really like to see anon volatility implemented as a VMA attribute, and have regular reclaim decide based on rmap of individual pages whether it needs to swap or purge. Something like this: MADV_VOLATILE: split vma if necessary set VM_VOLATILE MADV_NONVOLATILE: clear VM_VOLATILE merge vma if possible pte walk to check for pmd_purged()/pte_purged() return any_purged shrink_page_list(): if PageAnon: if try_to_purge_anon(): page_lock_anon_vma_read() anon_vma_interval_tree_foreach: if vma->vm_flags & VM_VOLATILE: lock page table unmap page set_pmd_purged() / set_pte_purged() unlock page table page_lock_anon_vma_read() ... try to reclaim > o Purging logic - when we trigger purging volatile pages to prevent > working set and stop to prevent too excessive purging of volatile > pages > o How to test > Currently, we have a patched jemalloc allocator by Jason's help > although it's not perfect and more rooms to be enhanced but IMO, > it's enough to prove vrange-anonymous. The problem is that > lack of benchmark for testing vrange-file side. I hope that > Mozilla folks can help. > > So its been a while since the last release of the volatile ranges > patches, again. I and John have been busy with other things. > Still, we have been slowly chipping away at issues and differences > trying to get a patchset that we both agree on. > > There's still a few issues, but we figured any further polishing of > the patch series in private would be unproductive and it would be much > better to send the patches out for review and comment and get some wider > opinions. > > You could get full patchset by git > > git clone -b vrange-v10-rc5 --single-branch > git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git > > In v10, there are some notable changes following as > > Whats new in v10: > * Fix several bugs and build break > * Add shmem_purge_page to correct purging shmem/tmpfs > * Replace slab shrinker with direct hooked reclaim path > * Optimize pte scanning by caching previous
Re: [PATCH v10 00/16] Volatile Ranges v10
Hello Minchan, On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: Hey all, Happy New Year! I know it's bad timing to send this unfamiliar large patchset for review but hope there are some guys with freshed-brain in new year all over the world. :) And most important thing is that before I dive into lots of testing, I'd like to make an agreement on design issues and others o Syscall interface Why do we need another syscall for this? Can't we extend madvise to take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something in the range was purged? o Not bind with vma split/merge logic to prevent mmap_sem cost and o Not bind with vma split/merge logic to avoid vm_area_struct memory footprint. VMAs are there to track attributes of memory ranges. Duplicating large parts of their functionality and co-maintaining both structures on create, destroy, split, and merge means duplicate code and complex interactions. 1. You need to define semantics and coordinate what happens when the vma underlying a volatile range changes. Either you have to strictly co-maintain both range objects, or you have weird behavior like volatily outliving a vma and then applying to a separate vma created in its place. Userspace won't get this right, and even in the kernel this is error prone and adds a lot to the complexity of vma management. 2. If page reclaim discards a page from the upper end of a a range, you mark the whole range as purged. If the user later marks the lower half of the range as non-volatile, the syscall will report purged=1 even though all requested pages are still there. The only way to make these semantics clean is either a) have vrange() return a range ID so that only full ranges can later be marked non-volatile, or b) remember individual page purges so that sub-range changes can properly report them I don't like a) much because it's somewhat arbitrarily more restrictive than madvise, mprotect, mmap/munmap etc. And for b), the straight-forward solution would be to put purge-cookies into the page tables to properly report purges in subrange changes, but that would be even more coordination between vmas, page tables, and the ad-hoc vranges. 3. Page reclaim usually happens on individual pages until an allocation can be satisfied, but the shrinker purges entire ranges. Should it really take out an entire 1G volatile range even though 4 pages would have been enough to satisfy an allocation? Sure, we assume a range represents an single object and userspace would have to regenerate the whole thing with only one page missing, but there is still a massive difference in page frees, faults, and allocations. There needs to be a *really* good argument why VMAs are not enough for this purpose. I would really like to see anon volatility implemented as a VMA attribute, and have regular reclaim decide based on rmap of individual pages whether it needs to swap or purge. Something like this: MADV_VOLATILE: split vma if necessary set VM_VOLATILE MADV_NONVOLATILE: clear VM_VOLATILE merge vma if possible pte walk to check for pmd_purged()/pte_purged() return any_purged shrink_page_list(): if PageAnon: if try_to_purge_anon(): page_lock_anon_vma_read() anon_vma_interval_tree_foreach: if vma-vm_flags VM_VOLATILE: lock page table unmap page set_pmd_purged() / set_pte_purged() unlock page table page_lock_anon_vma_read() ... try to reclaim o Purging logic - when we trigger purging volatile pages to prevent working set and stop to prevent too excessive purging of volatile pages o How to test Currently, we have a patched jemalloc allocator by Jason's help although it's not perfect and more rooms to be enhanced but IMO, it's enough to prove vrange-anonymous. The problem is that lack of benchmark for testing vrange-file side. I hope that Mozilla folks can help. So its been a while since the last release of the volatile ranges patches, again. I and John have been busy with other things. Still, we have been slowly chipping away at issues and differences trying to get a patchset that we both agree on. There's still a few issues, but we figured any further polishing of the patch series in private would be unproductive and it would be much better to send the patches out for review and comment and get some wider opinions. You could get full patchset by git git clone -b vrange-v10-rc5 --single-branch git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git In v10, there are some notable changes following as Whats new in v10: * Fix several bugs and build break * Add shmem_purge_page to correct purging shmem/tmpfs * Replace slab shrinker with direct hooked reclaim path * Optimize pte scanning by caching previous place * Reorder patch and tidy up Cc-list *
Re: [PATCH v10 00/16] Volatile Ranges v10
On 01/28/2014 04:03 PM, Johannes Weiner wrote: Hello Minchan, On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: Hey all, Happy New Year! I know it's bad timing to send this unfamiliar large patchset for review but hope there are some guys with freshed-brain in new year all over the world. :) And most important thing is that before I dive into lots of testing, I'd like to make an agreement on design issues and others o Syscall interface Why do we need another syscall for this? Can't we extend madvise to take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something in the range was purged? So the madvise interface is insufficient to provide the semantics needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the NONVOLATILE call, we have to atomically unmark the volatility status of the byte range and provide the purge status, which informs the caller if any of the data in the specified range was discarded (and thus needs to be regenerated). The problem is that by clearing the range, we may need to allocate memory (possibly by splitting in an existing range segment into two), which possibly could fail. Unfortunately this could happen after we've modified the volatile state of part of that range. At this point we can't just fail, because we've modified state and we also need to return the purge status of the modified state. Thus we seem to need a write-like interface, which returns the number of bytes successfully manipulated. But we also have to return the purge state, which we currently do via a argument pointer. hpa suggested to create something like an madvise2 interface which would provide the needed interface change, but would be a shared interface for the new flags as well as the old (possibly allowing various flags to be combined). I'm fine changing it (the interface has changed a number of times already), but we really haven't seen much in the way of a deeper review, so the current vrange syscall is mostly a placeholder to demonstrate the functionality and hopefully spur discussion on the deeper semantics of how volatile ranges should work. o Not bind with vma split/merge logic to prevent mmap_sem cost and o Not bind with vma split/merge logic to avoid vm_area_struct memory footprint. VMAs are there to track attributes of memory ranges. Duplicating large parts of their functionality and co-maintaining both structures on create, destroy, split, and merge means duplicate code and complex interactions. 1. You need to define semantics and coordinate what happens when the vma underlying a volatile range changes. Either you have to strictly co-maintain both range objects, or you have weird behavior like volatily outliving a vma and then applying to a separate vma created in its place. So indeed this is a difficult problem! My initial approach is simply when any new mapping is made, we clear the volatility of the affected process memory. Admittedly this has extra overhead and Minchan has an alternative here (which I'm not totally sold on yet, but may be ok). I'm almost convinced that for anonymous volatility, storing the volatility in the vma would be ok, but Minchan is worried about the performance overhead of the required locking for manipulating the vmas. For file volatility, this is more complicated, because since the volatility is shared, the ranges have to be tracked against the address_space structure, and can't be stored in per-process vmas. So this is partially why we've kept range trees hanging off of the mm and address_spaces structures, since it allows the range manipulation logic to be shared in both cases. Userspace won't get this right, and even in the kernel this is error prone and adds a lot to the complexity of vma management. Not sure exactly I understand what you mean by userspace won't get this right ? 2. If page reclaim discards a page from the upper end of a a range, you mark the whole range as purged. If the user later marks the lower half of the range as non-volatile, the syscall will report purged=1 even though all requested pages are still there. To me this aspect is a non-ideal but acceptable result of the usage pattern. Semantically, the hard rule would be we never report non-purged if pages in a range were purged. Reporting purged when pages technically weren't is not optimal but acceptable side effect of unmarking a sub-range. And could be avoided by applications marking and unmarking objects consistently. The only way to make these semantics clean is either a) have vrange() return a range ID so that only full ranges can later be marked non-volatile, or b) remember individual page purges so that sub-range changes can properly report them I don't like a) much because it's somewhat arbitrarily more restrictive than madvise, mprotect, mmap/munmap etc. Agreed on A. And for b), the straight-forward solution would be
Re: [PATCH v10 00/16] Volatile Ranges v10
Hi Hannes, It's interesting timing, I posted this patch Yew Year's Day and receives indepth design review Lunar New Year's Day. :) It's almost 0-day review. :) On Tue, Jan 28, 2014 at 07:03:59PM -0500, Johannes Weiner wrote: Hello Minchan, On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote: Hey all, Happy New Year! I know it's bad timing to send this unfamiliar large patchset for review but hope there are some guys with freshed-brain in new year all over the world. :) And most important thing is that before I dive into lots of testing, I'd like to make an agreement on design issues and others o Syscall interface Why do we need another syscall for this? Can't we extend madvise to Yeb. I should have written the reason. Early versions in this patchset had used madvise with VMA handling but it was terrible performance for ebizzy workload by mmap_sem's downside lock due to merging/split VMA. Even it was worse than old so I gave up the VMA approach. You could see the difference. https://lkml.org/lkml/2013/10/8/63 It might be not a good decision and someone might say that if mmap_sem is really headache, please fix it at first but as you know well, it's never simple problem. I hope if better idea or final decision comes in(ex, let's hold until someone fix mmap_sem scalability), I could follow that. take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something in the range was purged? In that case, -ENOMEM would have duplicated meaning Purged and Out of memory so failed in the middle of the system call processing and later could be a problem so we need to return value to indicate how many bytes are succeeded so far so it means we need additional out parameter. But yes, we can solve it by modifying semantic and behavior (ex, as you said below, we could just unmark volatile successfully if user pass (offset, len) consistent with marked volatile ranges. (IOW, if we give up overlapping/subrange marking/unmakring usecase. I expect it makes code simple further). It's request from John so If he is okay, I'm no problem. And there was another reason to make hard reusing madvise. full puring VS partial purging If someone should regenerate full object without considering fault/alloc/zeroig(and many of people wanted it rather than partial) he would want to VOLATILE_RANGE_FULL purging while others(acutlly, only I) likes VOLATILE_RANGE_PARTIAL for allocators(ie, vrange-anon) because it's fair for every processes in allocator POV(fault + alloc + zeroing overhead). It's not implemented yet in this patchset but I thought it is worth discuss and if we want it, current madvise isn't enough. o Not bind with vma split/merge logic to prevent mmap_sem cost and o Not bind with vma split/merge logic to avoid vm_area_struct memory footprint. VMAs are there to track attributes of memory ranges. Duplicating large parts of their functionality and co-maintaining both structures on create, destroy, split, and merge means duplicate code and complex interactions. 1. You need to define semantics and coordinate what happens when the vma underlying a volatile range changes. Either you have to strictly co-maintain both range objects, or you have weird behavior like volatily outliving a vma and then applying to a separate vma created in its place. Userspace won't get this right, and even in the kernel this is error prone and adds a lot to the complexity of vma management. Current semantic is following as, Vma handling logic in mm doesn't need to know vrange handling because vrange's internal logic always checks validity of the vma but one thing to do in vma logic is only clearing old volatile ranges on creating new vma. (Look at [PATCH v10 02/16] vrange: Clear volatility on new mmaps) Acutally I don't like the idea and suggested following as. https://git.kernel.org/cgit/linux/kernel/git/minchan/linux.git/commit/?h=vrange-workingid=821f58333b381fd88ee7f37fd9c472949756c74e But John didn't like it. I guess if VMA size is really matter, maybe we can embedded the flag into somewhere field of vma(ex, vm_file LSB?) Anyway, what I want to say is that vma/vrange co-maintaining seem to be not bad. 2. If page reclaim discards a page from the upper end of a a range, you mark the whole range as purged. If the user later marks the lower half of the range as non-volatile, the syscall will report purged=1 even though all requested pages are still there. True, The assumption is that basically, user should have a range per object but we gives flexibility for user to handle subranges of a volatile range so it might report false positive as you said. In that case, please user can use mincore(2) for accuracy if he want so he has flexiblity but lose performance a bit. It's a tradeoff, IMO. The only way to make these semantics clean is either a) have vrange() return a range ID so that only full ranges can later be marked
Re: [PATCH v10 00/16] Volatile Ranges v10
On Mon, Jan 27, 2014 at 05:09:59PM -0800, Taras Glek wrote: > > > John Stultz wrote: > >On 01/27/2014 04:12 PM, Minchan Kim wrote: > >>On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote: > >>>- Your number only claimed the effectiveness anon vrange, but not file > >>>vrange. > >>Yes. It's really problem as I said. > >> From the beginning, John Stultz wanted to promote vrange-file to replace > >>android's ashmem and when I heard usecase of vrange-file, it does make sense > >>to me so that's why I'd like to unify them in a same interface. > >> > >>But the problem is lack of interesting from others and lack of time to > >>test/evaluate it. I'm not an expert of userspace so actually I need a bit > >>help from them who require the feature but at a moment, > >>but I don't know who really want or/and help it. > >> > >>Even, Android folks didn't have any interest on vrange-file. > > > >Just as a correction here. I really don't think this is the case, as > >Android's use definitely relies on file based volatility. It might be > >more fair to say there hasn't been very much discussion from Android > >developers on the particulars of the file volatility semantics (out > >possibly not having any particular objections, or more-likely, being a > >bit too busy to follow the all various theoretical tangents we've > >discussed). > > > >But I'd not want anyone to get the impression that anonymous-only > >volatility would be sufficient for Android's needs. > Mozilla is starting to use android's ashmem for discardable memory > within a single process: > https://bugzilla.mozilla.org/show_bug.cgi?id=748598 . > > Volatile ranges do help with that specific(uncommon?) use of ashmem. Thanks for the info. I'd like to ask a question. Do you prefer fvrange(fd, offset, len) or fadvise(fd, offset, len, advise) inteface rather than current vrange syscall interface for vrange-file? Because I think it would remove unnecessary mmap/munmap syscall for vrange interface as well as out of address space in 32bit machine. > > For Mozilla sharing memory across processes via ashmem is not a > nearterm project. It's something that is likely to require > significant rework. Process-local discardable memory can be > retrofited in a more straight-forward fashion. > > Taras -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/16] Volatile Ranges v10
On Mon, Jan 27, 2014 at 04:42:27PM -0800, John Stultz wrote: > On 01/27/2014 04:12 PM, Minchan Kim wrote: > > On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote: > >> - Your number only claimed the effectiveness anon vrange, but not file > >> vrange. > > Yes. It's really problem as I said. > > From the beginning, John Stultz wanted to promote vrange-file to replace > > android's ashmem and when I heard usecase of vrange-file, it does make sense > > to me so that's why I'd like to unify them in a same interface. > > > > But the problem is lack of interesting from others and lack of time to > > test/evaluate it. I'm not an expert of userspace so actually I need a bit > > help from them who require the feature but at a moment, > > but I don't know who really want or/and help it. > > > > Even, Android folks didn't have any interest on vrange-file. > > Just as a correction here. I really don't think this is the case, as > Android's use definitely relies on file based volatility. It might be > more fair to say there hasn't been very much discussion from Android > developers on the particulars of the file volatility semantics (out > possibly not having any particular objections, or more-likely, being a > bit too busy to follow the all various theoretical tangents we've > discussed). > > But I'd not want anyone to get the impression that anonymous-only > volatility would be sufficient for Android's needs. Right. Thanks for the correction. > > > (And to further clarify here, since this can be confusing... > shmem/tmpfs-only file volatility *would* be sufficient, despite that > technically being anonymous backed memory. The key issue is we need to > be able to share the volatility between processes.) > > > > So, we might drop vrange-file part in this patchset if it's really headache. > > But let's discuss further because still I believe it's valuable feature to > > keep instead of dropping. > > If it helps gets interest in reviewing this, I'm ok with deferring > (tmpfs) file volatility, so folks can get comfortable with anonymous > volatility. But I worry its too critical a feature to ignore. Yes. I don't want to drop it without more discussion with real user of it but the problem is it's very hard to find one to have extra time to discuss it. > > thanks > -john > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/16] Volatile Ranges v10
On 01/27/2014 04:12 PM, Minchan Kim wrote: > On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote: >> - Your number only claimed the effectiveness anon vrange, but not file >> vrange. > Yes. It's really problem as I said. > From the beginning, John Stultz wanted to promote vrange-file to replace > android's ashmem and when I heard usecase of vrange-file, it does make sense > to me so that's why I'd like to unify them in a same interface. > > But the problem is lack of interesting from others and lack of time to > test/evaluate it. I'm not an expert of userspace so actually I need a bit > help from them who require the feature but at a moment, > but I don't know who really want or/and help it. > > Even, Android folks didn't have any interest on vrange-file. Just as a correction here. I really don't think this is the case, as Android's use definitely relies on file based volatility. It might be more fair to say there hasn't been very much discussion from Android developers on the particulars of the file volatility semantics (out possibly not having any particular objections, or more-likely, being a bit too busy to follow the all various theoretical tangents we've discussed). But I'd not want anyone to get the impression that anonymous-only volatility would be sufficient for Android's needs. (And to further clarify here, since this can be confusing... shmem/tmpfs-only file volatility *would* be sufficient, despite that technically being anonymous backed memory. The key issue is we need to be able to share the volatility between processes.) > So, we might drop vrange-file part in this patchset if it's really headache. > But let's discuss further because still I believe it's valuable feature to > keep instead of dropping. If it helps gets interest in reviewing this, I'm ok with deferring (tmpfs) file volatility, so folks can get comfortable with anonymous volatility. But I worry its too critical a feature to ignore. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/16] Volatile Ranges v10
Hey KOSAKI, On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote: > Hi Minchan, > > > On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim wrote: > > Hey all, > > > > Happy New Year! > > > > I know it's bad timing to send this unfamiliar large patchset for > > review but hope there are some guys with freshed-brain in new year > > all over the world. :) > > And most important thing is that before I dive into lots of testing, > > I'd like to make an agreement on design issues and others > > > > o Syscall interface > > o Not bind with vma split/merge logic to prevent mmap_sem cost and > > o Not bind with vma split/merge logic to avoid vm_area_struct memory > > footprint. > > o Purging logic - when we trigger purging volatile pages to prevent > > working set and stop to prevent too excessive purging of volatile > > pages > > o How to test > > Currently, we have a patched jemalloc allocator by Jason's help > > although it's not perfect and more rooms to be enhanced but IMO, > > it's enough to prove vrange-anonymous. The problem is that > > lack of benchmark for testing vrange-file side. I hope that > > Mozilla folks can help. > > > > So its been a while since the last release of the volatile ranges > > patches, again. I and John have been busy with other things. > > Still, we have been slowly chipping away at issues and differences > > trying to get a patchset that we both agree on. > > > > There's still a few issues, but we figured any further polishing of > > the patch series in private would be unproductive and it would be much > > better to send the patches out for review and comment and get some wider > > opinions. > > > > You could get full patchset by git > > > > git clone -b vrange-v10-rc5 --single-branch > > git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git > > Brief comments. > > - You should provide jemalloc patch too. Otherwise we cannot I did. :) It seems you missed below in this description. You could see it via following URL in Dhaval's test suite. https://github.com/volatile-ranges-test/vranges-test/blob/master/0001-Implement-experimental-mvolatile-2-mnovolatile-2-sup.patch Dhaval: Pz, could you merge patches John sent in your test suite? I just pinged you. But KOSAKI, pz, don't focus on jemalloc's implementaion. It's not how jemalloc uses volatile ranges efficiently but just one of example how to use volatile ranges. I think volatile ranges could be really useful for garbage collection of custom allocators(ex, In-memory DB, JVM, Dalvik, v8) as well as general allocators. > understand what the your mesurement mean. > - Your number only claimed the effectiveness anon vrange, but not file vrange. Yes. It's really problem as I said. >From the beginning, John Stultz wanted to promote vrange-file to replace android's ashmem and when I heard usecase of vrange-file, it does make sense to me so that's why I'd like to unify them in a same interface. But the problem is lack of interesting from others and lack of time to test/evaluate it. I'm not an expert of userspace so actually I need a bit help from them who require the feature but at a moment, but I don't know who really want or/and help it. Even, Android folks didn't have any interest on vrange-file. So, we might drop vrange-file part in this patchset if it's really headache. But let's discuss further because still I believe it's valuable feature to keep instead of dropping. I want that drop of vrange-file is really last resort to make forward progress of vrange-anon. > - Still, Nobody likes file vrange. At least nobody said explicitly on > the list. I don't ack file vrange part until > I fully convinced Pros/Cons. You need to persuade other MM guys if > you really think anon vrange is not > sufficient. (Maybe LSF is the best place) > - I wrote you need to put a mesurement current implementation vs > VMA-based implementation at several > previous iteration. Because You claimed fast, but no number and you > haven't yet. I guess the reason is I did. :) Look at the number. https://lkml.org/lkml/2013/10/8/63 The point is we need an mmap_sem's readside lock for vma handling(ex, merge/split) and it's really bottlenect point for ebizzy which another thread want to malloc(ie, mmap with new chunk requires mmap_sem's write-side lock). Additionally, some of user want to handle vrange fine-granularity(ex, as worst case, PAGE_SIZE) so VMA handling would be really overhead for us. > you don't have any access to large machine. If so, I'll offer it. > Plz collaborate with us. Yes, Yes, Yes. That's what I want and you're really proper person to collaborate. Pz, ping me if you're ready. :) > > Unfortunately, I'm very busy and I didn't have a chance to review your > latest patch yet. But I'll finish it until > mm summit. And, I'll show you guys how much this patch improve glibc malloc > too. Cool! It's really helpful for the work which I believe it's really helpful feature for the Linux so I
Re: [PATCH v10 00/16] Volatile Ranges v10
On 01/27/2014 02:23 PM, KOSAKI Motohiro wrote: > Hi Minchan, > > > On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim wrote: >> Hey all, >> >> Happy New Year! >> >> I know it's bad timing to send this unfamiliar large patchset for >> review but hope there are some guys with freshed-brain in new year >> all over the world. :) >> And most important thing is that before I dive into lots of testing, >> I'd like to make an agreement on design issues and others >> >> o Syscall interface >> o Not bind with vma split/merge logic to prevent mmap_sem cost and >> o Not bind with vma split/merge logic to avoid vm_area_struct memory >> footprint. >> o Purging logic - when we trigger purging volatile pages to prevent >> working set and stop to prevent too excessive purging of volatile >> pages >> o How to test >> Currently, we have a patched jemalloc allocator by Jason's help >> although it's not perfect and more rooms to be enhanced but IMO, >> it's enough to prove vrange-anonymous. The problem is that >> lack of benchmark for testing vrange-file side. I hope that >> Mozilla folks can help. >> >> So its been a while since the last release of the volatile ranges >> patches, again. I and John have been busy with other things. >> Still, we have been slowly chipping away at issues and differences >> trying to get a patchset that we both agree on. >> >> There's still a few issues, but we figured any further polishing of >> the patch series in private would be unproductive and it would be much >> better to send the patches out for review and comment and get some wider >> opinions. >> >> You could get full patchset by git >> >> git clone -b vrange-v10-rc5 --single-branch >> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git > Brief comments. > > - You should provide jemalloc patch too. Otherwise we cannot > understand what the your mesurement mean. > - Your number only claimed the effectiveness anon vrange, but not file vrange. > - Still, Nobody likes file vrange. At least nobody said explicitly on > the list. I don't ack file vrange part until > I fully convinced Pros/Cons. You need to persuade other MM guys if > you really think anon vrange is not > sufficient. (Maybe LSF is the best place) I do agree that the semantics for volatile-ranges on files is more difficult for folks to grasp (and like after doing so). I've almost gotten to the point (as I've discussed with Minchan privately) where I'm willing to hold back on volatile-ranges on files in the shrort-term just to see if it helps to get key mm folks to review and comment the volatile-ranges on anonymous memory. That said, I do think volatile ranges on files is an important concept, and I'd like to make sure we don't design something that can't be used for files in the future. Part of the major interest in volatile memory has been from web browsers. Both Chrome and Firefox are already making use of the file-based ashmem, where available, in order to have this "discardable memory" feature. And while the Mozilla developers don't see file based volatile memory as critical right now for their needs, I can imagine as they continue to work on multi-process firefox (http://billmccloskey.wordpress.com/2013/12/05/multiprocess-firefox/) for performance and security reasons, the need to have memory volatility shared between processes will become more important. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/16] Volatile Ranges v10
Hi Minchan, On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim wrote: > Hey all, > > Happy New Year! > > I know it's bad timing to send this unfamiliar large patchset for > review but hope there are some guys with freshed-brain in new year > all over the world. :) > And most important thing is that before I dive into lots of testing, > I'd like to make an agreement on design issues and others > > o Syscall interface > o Not bind with vma split/merge logic to prevent mmap_sem cost and > o Not bind with vma split/merge logic to avoid vm_area_struct memory > footprint. > o Purging logic - when we trigger purging volatile pages to prevent > working set and stop to prevent too excessive purging of volatile > pages > o How to test > Currently, we have a patched jemalloc allocator by Jason's help > although it's not perfect and more rooms to be enhanced but IMO, > it's enough to prove vrange-anonymous. The problem is that > lack of benchmark for testing vrange-file side. I hope that > Mozilla folks can help. > > So its been a while since the last release of the volatile ranges > patches, again. I and John have been busy with other things. > Still, we have been slowly chipping away at issues and differences > trying to get a patchset that we both agree on. > > There's still a few issues, but we figured any further polishing of > the patch series in private would be unproductive and it would be much > better to send the patches out for review and comment and get some wider > opinions. > > You could get full patchset by git > > git clone -b vrange-v10-rc5 --single-branch > git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git Brief comments. - You should provide jemalloc patch too. Otherwise we cannot understand what the your mesurement mean. - Your number only claimed the effectiveness anon vrange, but not file vrange. - Still, Nobody likes file vrange. At least nobody said explicitly on the list. I don't ack file vrange part until I fully convinced Pros/Cons. You need to persuade other MM guys if you really think anon vrange is not sufficient. (Maybe LSF is the best place) - I wrote you need to put a mesurement current implementation vs VMA-based implementation at several previous iteration. Because You claimed fast, but no number and you haven't yet. I guess the reason is you don't have any access to large machine. If so, I'll offer it. Plz collaborate with us. Unfortunately, I'm very busy and I didn't have a chance to review your latest patch yet. But I'll finish it until mm summit. And, I'll show you guys how much this patch improve glibc malloc too. I and glibc folks agreed we push vrange into glibc malloc. https://sourceware.org/ml/libc-alpha/2013-12/msg00343.html Even though, I still dislike some aspect of this patch. I'd like to discuss and make better design decision with you. Thanks. > > In v10, there are some notable changes following as > > Whats new in v10: > * Fix several bugs and build break > * Add shmem_purge_page to correct purging shmem/tmpfs > * Replace slab shrinker with direct hooked reclaim path > * Optimize pte scanning by caching previous place > * Reorder patch and tidy up Cc-list > * Rebased on v3.12 > * Add vrange-anon test with jemalloc in Dhaval's test suite > - https://github.com/volatile-ranges-test/vranges-test > so, you could test any application with vrange-patched jemalloc by > LD_PRELOAD but please keep in mind that it's just a prototype to > prove vrange syscall concept so it has more rooms to optimize. > So, please do not compare it with another allocator. > > Whats new in v9: > * Updated to v3.11 > * Added vrange purging logic to purge anonymous pages on > swapless systems > * Added logic to allocate the vroot structure dynamically > to avoid added overhead to mm and address_space structures > * Lots of minor tweaks, changes and cleanups > > Still TODO: > * Sort out better solution for clearing volatility on new mmaps > - Minchan has a different approach here > * Agreement of systemcall interface > * Better discarding trigger policy to prevent working set evction > * Review, Review, Review.. Comment. > * A ton of test > > Feedback or thoughts here would be particularly helpful! > > Also, thanks to Dhaval for his maintaining and vastly improving > the volatile ranges test suite, which can be found here: > [1] https://github.com/volatile-ranges-test/vranges-test > > These patches can also be pulled from git here: > git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-v9 > > We'd really welcome any feedback and comments on the patch series. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/16] Volatile Ranges v10
Hi Minchan, On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim minc...@kernel.org wrote: Hey all, Happy New Year! I know it's bad timing to send this unfamiliar large patchset for review but hope there are some guys with freshed-brain in new year all over the world. :) And most important thing is that before I dive into lots of testing, I'd like to make an agreement on design issues and others o Syscall interface o Not bind with vma split/merge logic to prevent mmap_sem cost and o Not bind with vma split/merge logic to avoid vm_area_struct memory footprint. o Purging logic - when we trigger purging volatile pages to prevent working set and stop to prevent too excessive purging of volatile pages o How to test Currently, we have a patched jemalloc allocator by Jason's help although it's not perfect and more rooms to be enhanced but IMO, it's enough to prove vrange-anonymous. The problem is that lack of benchmark for testing vrange-file side. I hope that Mozilla folks can help. So its been a while since the last release of the volatile ranges patches, again. I and John have been busy with other things. Still, we have been slowly chipping away at issues and differences trying to get a patchset that we both agree on. There's still a few issues, but we figured any further polishing of the patch series in private would be unproductive and it would be much better to send the patches out for review and comment and get some wider opinions. You could get full patchset by git git clone -b vrange-v10-rc5 --single-branch git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git Brief comments. - You should provide jemalloc patch too. Otherwise we cannot understand what the your mesurement mean. - Your number only claimed the effectiveness anon vrange, but not file vrange. - Still, Nobody likes file vrange. At least nobody said explicitly on the list. I don't ack file vrange part until I fully convinced Pros/Cons. You need to persuade other MM guys if you really think anon vrange is not sufficient. (Maybe LSF is the best place) - I wrote you need to put a mesurement current implementation vs VMA-based implementation at several previous iteration. Because You claimed fast, but no number and you haven't yet. I guess the reason is you don't have any access to large machine. If so, I'll offer it. Plz collaborate with us. Unfortunately, I'm very busy and I didn't have a chance to review your latest patch yet. But I'll finish it until mm summit. And, I'll show you guys how much this patch improve glibc malloc too. I and glibc folks agreed we push vrange into glibc malloc. https://sourceware.org/ml/libc-alpha/2013-12/msg00343.html Even though, I still dislike some aspect of this patch. I'd like to discuss and make better design decision with you. Thanks. In v10, there are some notable changes following as Whats new in v10: * Fix several bugs and build break * Add shmem_purge_page to correct purging shmem/tmpfs * Replace slab shrinker with direct hooked reclaim path * Optimize pte scanning by caching previous place * Reorder patch and tidy up Cc-list * Rebased on v3.12 * Add vrange-anon test with jemalloc in Dhaval's test suite - https://github.com/volatile-ranges-test/vranges-test so, you could test any application with vrange-patched jemalloc by LD_PRELOAD but please keep in mind that it's just a prototype to prove vrange syscall concept so it has more rooms to optimize. So, please do not compare it with another allocator. Whats new in v9: * Updated to v3.11 * Added vrange purging logic to purge anonymous pages on swapless systems * Added logic to allocate the vroot structure dynamically to avoid added overhead to mm and address_space structures * Lots of minor tweaks, changes and cleanups Still TODO: * Sort out better solution for clearing volatility on new mmaps - Minchan has a different approach here * Agreement of systemcall interface * Better discarding trigger policy to prevent working set evction * Review, Review, Review.. Comment. * A ton of test Feedback or thoughts here would be particularly helpful! Also, thanks to Dhaval for his maintaining and vastly improving the volatile ranges test suite, which can be found here: [1] https://github.com/volatile-ranges-test/vranges-test These patches can also be pulled from git here: git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-v9 We'd really welcome any feedback and comments on the patch series. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/16] Volatile Ranges v10
On 01/27/2014 02:23 PM, KOSAKI Motohiro wrote: Hi Minchan, On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim minc...@kernel.org wrote: Hey all, Happy New Year! I know it's bad timing to send this unfamiliar large patchset for review but hope there are some guys with freshed-brain in new year all over the world. :) And most important thing is that before I dive into lots of testing, I'd like to make an agreement on design issues and others o Syscall interface o Not bind with vma split/merge logic to prevent mmap_sem cost and o Not bind with vma split/merge logic to avoid vm_area_struct memory footprint. o Purging logic - when we trigger purging volatile pages to prevent working set and stop to prevent too excessive purging of volatile pages o How to test Currently, we have a patched jemalloc allocator by Jason's help although it's not perfect and more rooms to be enhanced but IMO, it's enough to prove vrange-anonymous. The problem is that lack of benchmark for testing vrange-file side. I hope that Mozilla folks can help. So its been a while since the last release of the volatile ranges patches, again. I and John have been busy with other things. Still, we have been slowly chipping away at issues and differences trying to get a patchset that we both agree on. There's still a few issues, but we figured any further polishing of the patch series in private would be unproductive and it would be much better to send the patches out for review and comment and get some wider opinions. You could get full patchset by git git clone -b vrange-v10-rc5 --single-branch git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git Brief comments. - You should provide jemalloc patch too. Otherwise we cannot understand what the your mesurement mean. - Your number only claimed the effectiveness anon vrange, but not file vrange. - Still, Nobody likes file vrange. At least nobody said explicitly on the list. I don't ack file vrange part until I fully convinced Pros/Cons. You need to persuade other MM guys if you really think anon vrange is not sufficient. (Maybe LSF is the best place) I do agree that the semantics for volatile-ranges on files is more difficult for folks to grasp (and like after doing so). I've almost gotten to the point (as I've discussed with Minchan privately) where I'm willing to hold back on volatile-ranges on files in the shrort-term just to see if it helps to get key mm folks to review and comment the volatile-ranges on anonymous memory. That said, I do think volatile ranges on files is an important concept, and I'd like to make sure we don't design something that can't be used for files in the future. Part of the major interest in volatile memory has been from web browsers. Both Chrome and Firefox are already making use of the file-based ashmem, where available, in order to have this discardable memory feature. And while the Mozilla developers don't see file based volatile memory as critical right now for their needs, I can imagine as they continue to work on multi-process firefox (http://billmccloskey.wordpress.com/2013/12/05/multiprocess-firefox/) for performance and security reasons, the need to have memory volatility shared between processes will become more important. thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/16] Volatile Ranges v10
Hey KOSAKI, On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote: Hi Minchan, On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim minc...@kernel.org wrote: Hey all, Happy New Year! I know it's bad timing to send this unfamiliar large patchset for review but hope there are some guys with freshed-brain in new year all over the world. :) And most important thing is that before I dive into lots of testing, I'd like to make an agreement on design issues and others o Syscall interface o Not bind with vma split/merge logic to prevent mmap_sem cost and o Not bind with vma split/merge logic to avoid vm_area_struct memory footprint. o Purging logic - when we trigger purging volatile pages to prevent working set and stop to prevent too excessive purging of volatile pages o How to test Currently, we have a patched jemalloc allocator by Jason's help although it's not perfect and more rooms to be enhanced but IMO, it's enough to prove vrange-anonymous. The problem is that lack of benchmark for testing vrange-file side. I hope that Mozilla folks can help. So its been a while since the last release of the volatile ranges patches, again. I and John have been busy with other things. Still, we have been slowly chipping away at issues and differences trying to get a patchset that we both agree on. There's still a few issues, but we figured any further polishing of the patch series in private would be unproductive and it would be much better to send the patches out for review and comment and get some wider opinions. You could get full patchset by git git clone -b vrange-v10-rc5 --single-branch git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git Brief comments. - You should provide jemalloc patch too. Otherwise we cannot I did. :) It seems you missed below in this description. You could see it via following URL in Dhaval's test suite. https://github.com/volatile-ranges-test/vranges-test/blob/master/0001-Implement-experimental-mvolatile-2-mnovolatile-2-sup.patch Dhaval: Pz, could you merge patches John sent in your test suite? I just pinged you. But KOSAKI, pz, don't focus on jemalloc's implementaion. It's not how jemalloc uses volatile ranges efficiently but just one of example how to use volatile ranges. I think volatile ranges could be really useful for garbage collection of custom allocators(ex, In-memory DB, JVM, Dalvik, v8) as well as general allocators. understand what the your mesurement mean. - Your number only claimed the effectiveness anon vrange, but not file vrange. Yes. It's really problem as I said. From the beginning, John Stultz wanted to promote vrange-file to replace android's ashmem and when I heard usecase of vrange-file, it does make sense to me so that's why I'd like to unify them in a same interface. But the problem is lack of interesting from others and lack of time to test/evaluate it. I'm not an expert of userspace so actually I need a bit help from them who require the feature but at a moment, but I don't know who really want or/and help it. Even, Android folks didn't have any interest on vrange-file. So, we might drop vrange-file part in this patchset if it's really headache. But let's discuss further because still I believe it's valuable feature to keep instead of dropping. I want that drop of vrange-file is really last resort to make forward progress of vrange-anon. - Still, Nobody likes file vrange. At least nobody said explicitly on the list. I don't ack file vrange part until I fully convinced Pros/Cons. You need to persuade other MM guys if you really think anon vrange is not sufficient. (Maybe LSF is the best place) - I wrote you need to put a mesurement current implementation vs VMA-based implementation at several previous iteration. Because You claimed fast, but no number and you haven't yet. I guess the reason is I did. :) Look at the number. https://lkml.org/lkml/2013/10/8/63 The point is we need an mmap_sem's readside lock for vma handling(ex, merge/split) and it's really bottlenect point for ebizzy which another thread want to malloc(ie, mmap with new chunk requires mmap_sem's write-side lock). Additionally, some of user want to handle vrange fine-granularity(ex, as worst case, PAGE_SIZE) so VMA handling would be really overhead for us. you don't have any access to large machine. If so, I'll offer it. Plz collaborate with us. Yes, Yes, Yes. That's what I want and you're really proper person to collaborate. Pz, ping me if you're ready. :) Unfortunately, I'm very busy and I didn't have a chance to review your latest patch yet. But I'll finish it until mm summit. And, I'll show you guys how much this patch improve glibc malloc too. Cool! It's really helpful for the work which I believe it's really helpful feature for the Linux so I never want to drop this feature by just lack of interesting of other MM guys who are
Re: [PATCH v10 00/16] Volatile Ranges v10
On 01/27/2014 04:12 PM, Minchan Kim wrote: On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote: - Your number only claimed the effectiveness anon vrange, but not file vrange. Yes. It's really problem as I said. From the beginning, John Stultz wanted to promote vrange-file to replace android's ashmem and when I heard usecase of vrange-file, it does make sense to me so that's why I'd like to unify them in a same interface. But the problem is lack of interesting from others and lack of time to test/evaluate it. I'm not an expert of userspace so actually I need a bit help from them who require the feature but at a moment, but I don't know who really want or/and help it. Even, Android folks didn't have any interest on vrange-file. Just as a correction here. I really don't think this is the case, as Android's use definitely relies on file based volatility. It might be more fair to say there hasn't been very much discussion from Android developers on the particulars of the file volatility semantics (out possibly not having any particular objections, or more-likely, being a bit too busy to follow the all various theoretical tangents we've discussed). But I'd not want anyone to get the impression that anonymous-only volatility would be sufficient for Android's needs. (And to further clarify here, since this can be confusing... shmem/tmpfs-only file volatility *would* be sufficient, despite that technically being anonymous backed memory. The key issue is we need to be able to share the volatility between processes.) So, we might drop vrange-file part in this patchset if it's really headache. But let's discuss further because still I believe it's valuable feature to keep instead of dropping. If it helps gets interest in reviewing this, I'm ok with deferring (tmpfs) file volatility, so folks can get comfortable with anonymous volatility. But I worry its too critical a feature to ignore. thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/16] Volatile Ranges v10
On Mon, Jan 27, 2014 at 04:42:27PM -0800, John Stultz wrote: On 01/27/2014 04:12 PM, Minchan Kim wrote: On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote: - Your number only claimed the effectiveness anon vrange, but not file vrange. Yes. It's really problem as I said. From the beginning, John Stultz wanted to promote vrange-file to replace android's ashmem and when I heard usecase of vrange-file, it does make sense to me so that's why I'd like to unify them in a same interface. But the problem is lack of interesting from others and lack of time to test/evaluate it. I'm not an expert of userspace so actually I need a bit help from them who require the feature but at a moment, but I don't know who really want or/and help it. Even, Android folks didn't have any interest on vrange-file. Just as a correction here. I really don't think this is the case, as Android's use definitely relies on file based volatility. It might be more fair to say there hasn't been very much discussion from Android developers on the particulars of the file volatility semantics (out possibly not having any particular objections, or more-likely, being a bit too busy to follow the all various theoretical tangents we've discussed). But I'd not want anyone to get the impression that anonymous-only volatility would be sufficient for Android's needs. Right. Thanks for the correction. (And to further clarify here, since this can be confusing... shmem/tmpfs-only file volatility *would* be sufficient, despite that technically being anonymous backed memory. The key issue is we need to be able to share the volatility between processes.) So, we might drop vrange-file part in this patchset if it's really headache. But let's discuss further because still I believe it's valuable feature to keep instead of dropping. If it helps gets interest in reviewing this, I'm ok with deferring (tmpfs) file volatility, so folks can get comfortable with anonymous volatility. But I worry its too critical a feature to ignore. Yes. I don't want to drop it without more discussion with real user of it but the problem is it's very hard to find one to have extra time to discuss it. thanks -john -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 00/16] Volatile Ranges v10
On Mon, Jan 27, 2014 at 05:09:59PM -0800, Taras Glek wrote: John Stultz wrote: On 01/27/2014 04:12 PM, Minchan Kim wrote: On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote: - Your number only claimed the effectiveness anon vrange, but not file vrange. Yes. It's really problem as I said. From the beginning, John Stultz wanted to promote vrange-file to replace android's ashmem and when I heard usecase of vrange-file, it does make sense to me so that's why I'd like to unify them in a same interface. But the problem is lack of interesting from others and lack of time to test/evaluate it. I'm not an expert of userspace so actually I need a bit help from them who require the feature but at a moment, but I don't know who really want or/and help it. Even, Android folks didn't have any interest on vrange-file. Just as a correction here. I really don't think this is the case, as Android's use definitely relies on file based volatility. It might be more fair to say there hasn't been very much discussion from Android developers on the particulars of the file volatility semantics (out possibly not having any particular objections, or more-likely, being a bit too busy to follow the all various theoretical tangents we've discussed). But I'd not want anyone to get the impression that anonymous-only volatility would be sufficient for Android's needs. Mozilla is starting to use android's ashmem for discardable memory within a single process: https://bugzilla.mozilla.org/show_bug.cgi?id=748598 . Volatile ranges do help with that specific(uncommon?) use of ashmem. Thanks for the info. I'd like to ask a question. Do you prefer fvrange(fd, offset, len) or fadvise(fd, offset, len, advise) inteface rather than current vrange syscall interface for vrange-file? Because I think it would remove unnecessary mmap/munmap syscall for vrange interface as well as out of address space in 32bit machine. For Mozilla sharing memory across processes via ashmem is not a nearterm project. It's something that is likely to require significant rework. Process-local discardable memory can be retrofited in a more straight-forward fashion. Taras -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v10 00/16] Volatile Ranges v10
Hey all, Happy New Year! I know it's bad timing to send this unfamiliar large patchset for review but hope there are some guys with freshed-brain in new year all over the world. :) And most important thing is that before I dive into lots of testing, I'd like to make an agreement on design issues and others o Syscall interface o Not bind with vma split/merge logic to prevent mmap_sem cost and o Not bind with vma split/merge logic to avoid vm_area_struct memory footprint. o Purging logic - when we trigger purging volatile pages to prevent working set and stop to prevent too excessive purging of volatile pages o How to test Currently, we have a patched jemalloc allocator by Jason's help although it's not perfect and more rooms to be enhanced but IMO, it's enough to prove vrange-anonymous. The problem is that lack of benchmark for testing vrange-file side. I hope that Mozilla folks can help. So its been a while since the last release of the volatile ranges patches, again. I and John have been busy with other things. Still, we have been slowly chipping away at issues and differences trying to get a patchset that we both agree on. There's still a few issues, but we figured any further polishing of the patch series in private would be unproductive and it would be much better to send the patches out for review and comment and get some wider opinions. You could get full patchset by git git clone -b vrange-v10-rc5 --single-branch git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git In v10, there are some notable changes following as Whats new in v10: * Fix several bugs and build break * Add shmem_purge_page to correct purging shmem/tmpfs * Replace slab shrinker with direct hooked reclaim path * Optimize pte scanning by caching previous place * Reorder patch and tidy up Cc-list * Rebased on v3.12 * Add vrange-anon test with jemalloc in Dhaval's test suite - https://github.com/volatile-ranges-test/vranges-test so, you could test any application with vrange-patched jemalloc by LD_PRELOAD but please keep in mind that it's just a prototype to prove vrange syscall concept so it has more rooms to optimize. So, please do not compare it with another allocator. Whats new in v9: * Updated to v3.11 * Added vrange purging logic to purge anonymous pages on swapless systems * Added logic to allocate the vroot structure dynamically to avoid added overhead to mm and address_space structures * Lots of minor tweaks, changes and cleanups Still TODO: * Sort out better solution for clearing volatility on new mmaps - Minchan has a different approach here * Agreement of systemcall interface * Better discarding trigger policy to prevent working set evction * Review, Review, Review.. Comment. * A ton of test Feedback or thoughts here would be particularly helpful! Also, thanks to Dhaval for his maintaining and vastly improving the volatile ranges test suite, which can be found here: [1] https://github.com/volatile-ranges-test/vranges-test These patches can also be pulled from git here: git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-v9 We'd really welcome any feedback and comments on the patch series. thanks == &< = Volatile ranges provides a method for userland to inform the kernel that a range of memory is safe to discard (ie: can be regenerated) but userspace may want to try access it in the future. It can be thought of as similar to MADV_DONTNEED, but that the actual freeing of the memory is delayed and only done under memory pressure, and the user can try to cancel the action and be able to quickly access any unpurged pages. The idea originated from Android's ashmem, but I've since learned that other OSes provide similar functionality. This funcitonality allows for a number of interesting uses: * Userland caches that have kernel triggered eviction under memory pressure. This allows for the kernel to "rightsize" userspace caches for current system-wide workload. Things like image bitmap caches, or rendered HTML in a hidden browser tab, where the data is not visible and can be regenerated if needed, are good examples. * Opportunistic freeing of memory that may be quickly reused. Minchan has done a malloc implementation where free() marks the pages as volatile, allowing the kernel to reclaim under pressure. This avoids the unmapping and remapping of anonymous pages on free/malloc. So if userland wants to malloc memory quickly after the free, it just needs to mark the pages as non-volatile, and only purged pages will have to be faulted back in. I did some test with jemalloc by Jason Mason's help who is author of jemalloc because he had interest on vrange sytem call. Test(RAM 2G, CPU 4, ebizzy benchmark) ebizzy argument: ./ebizzy -S 30 -n 512 default chunksize = 512k so 512k * 512 = 256M, *a* ebizzy process has 256M footprint. (1.1) stands for 1 process and 1 thread so (1.4) is 1 process and 4 thread. vanilla
[PATCH v10 00/16] Volatile Ranges v10
Hey all, Happy New Year! I know it's bad timing to send this unfamiliar large patchset for review but hope there are some guys with freshed-brain in new year all over the world. :) And most important thing is that before I dive into lots of testing, I'd like to make an agreement on design issues and others o Syscall interface o Not bind with vma split/merge logic to prevent mmap_sem cost and o Not bind with vma split/merge logic to avoid vm_area_struct memory footprint. o Purging logic - when we trigger purging volatile pages to prevent working set and stop to prevent too excessive purging of volatile pages o How to test Currently, we have a patched jemalloc allocator by Jason's help although it's not perfect and more rooms to be enhanced but IMO, it's enough to prove vrange-anonymous. The problem is that lack of benchmark for testing vrange-file side. I hope that Mozilla folks can help. So its been a while since the last release of the volatile ranges patches, again. I and John have been busy with other things. Still, we have been slowly chipping away at issues and differences trying to get a patchset that we both agree on. There's still a few issues, but we figured any further polishing of the patch series in private would be unproductive and it would be much better to send the patches out for review and comment and get some wider opinions. You could get full patchset by git git clone -b vrange-v10-rc5 --single-branch git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git In v10, there are some notable changes following as Whats new in v10: * Fix several bugs and build break * Add shmem_purge_page to correct purging shmem/tmpfs * Replace slab shrinker with direct hooked reclaim path * Optimize pte scanning by caching previous place * Reorder patch and tidy up Cc-list * Rebased on v3.12 * Add vrange-anon test with jemalloc in Dhaval's test suite - https://github.com/volatile-ranges-test/vranges-test so, you could test any application with vrange-patched jemalloc by LD_PRELOAD but please keep in mind that it's just a prototype to prove vrange syscall concept so it has more rooms to optimize. So, please do not compare it with another allocator. Whats new in v9: * Updated to v3.11 * Added vrange purging logic to purge anonymous pages on swapless systems * Added logic to allocate the vroot structure dynamically to avoid added overhead to mm and address_space structures * Lots of minor tweaks, changes and cleanups Still TODO: * Sort out better solution for clearing volatility on new mmaps - Minchan has a different approach here * Agreement of systemcall interface * Better discarding trigger policy to prevent working set evction * Review, Review, Review.. Comment. * A ton of test Feedback or thoughts here would be particularly helpful! Also, thanks to Dhaval for his maintaining and vastly improving the volatile ranges test suite, which can be found here: [1] https://github.com/volatile-ranges-test/vranges-test These patches can also be pulled from git here: git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-v9 We'd really welcome any feedback and comments on the patch series. thanks == = Volatile ranges provides a method for userland to inform the kernel that a range of memory is safe to discard (ie: can be regenerated) but userspace may want to try access it in the future. It can be thought of as similar to MADV_DONTNEED, but that the actual freeing of the memory is delayed and only done under memory pressure, and the user can try to cancel the action and be able to quickly access any unpurged pages. The idea originated from Android's ashmem, but I've since learned that other OSes provide similar functionality. This funcitonality allows for a number of interesting uses: * Userland caches that have kernel triggered eviction under memory pressure. This allows for the kernel to rightsize userspace caches for current system-wide workload. Things like image bitmap caches, or rendered HTML in a hidden browser tab, where the data is not visible and can be regenerated if needed, are good examples. * Opportunistic freeing of memory that may be quickly reused. Minchan has done a malloc implementation where free() marks the pages as volatile, allowing the kernel to reclaim under pressure. This avoids the unmapping and remapping of anonymous pages on free/malloc. So if userland wants to malloc memory quickly after the free, it just needs to mark the pages as non-volatile, and only purged pages will have to be faulted back in. I did some test with jemalloc by Jason Mason's help who is author of jemalloc because he had interest on vrange sytem call. Test(RAM 2G, CPU 4, ebizzy benchmark) ebizzy argument: ./ebizzy -S 30 -n 512 default chunksize = 512k so 512k * 512 = 256M, *a* ebizzy process has 256M footprint. (1.1) stands for 1 process and 1 thread so (1.4) is 1 process and 4 thread. vanilla