Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Thu, 13 Nov 2008 12:38:07 +0200 Izik Eidus <[EMAIL PROTECTED]> wrote: > > If KSM pages are on radix-tree, it will be accounted automatically. > > Now, we have "Unevictable" LRU and mlocked() pages are smartly isolated > > into its > > own LRU. So, just doing > > > > - inode's radix-tree > > - make all pages mlocked. > > - provide special page fault handler for your purpose > > > > Well in this version that i am going to merge the pages arent going to > be swappable, > Latter after Ksm will get merged we will make the KsmPages swappable... good to hear > so i think working with cgroups would be effective / useful only when > KsmPages will start be swappable... > Do you agree? > (What i am saying is that right now lets dont count the KsmPages inside > the cgroup, lets do it when KsmPages > will be swappable) > ok. > If you feel this pages should be counted in the cgroup i have no problem > to do it via hooks like page migration is doing. > > thanks. > > > is simple one. But ok, whatever implementation you'll do, I have to check it > > and consider whether it should be tracked or not. Then, add codes to memcg > > to > > track it or ignore it or comments on your patches ;) > > > > It's helpful to add me to CC: when you post this set again. > > > > Sure will. > If necessary, I'll have to add "ignore in this case" hook in memcg. (ex. checking PageKSM flag in memcg.) If you are sufferred from memcg in your test, cgroup_disable=memory boot option will allow you to disable memcg. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
ציטוט KAMEZAWA Hiroyuki: Thank you for answers. On Wed, 12 Nov 2008 13:11:12 +0200 Izik Eidus <[EMAIL PROTECTED]> wrote: Avi Kivity wrote: KAMEZAWA Hiroyuki wrote: Can I make a question ? (I'm working for memory cgroup.) Now, we do charge to anonymous page when - charge(+1) when it's mapped firstly (mapcount 0->1) - uncharge(-1) it's fully unmapped (mapcount 1->0) vir page_remove_rmap(). My quesion is - PageKSM pages are not necessary to be tracked by memory cgroup ? When we reaplacing page using page_replace() we have: oldpage - > anonymous page that is going to be replaced by newpage newpage -> kernel allocated page (KsmPage) so about oldpage we are calling page_remove_rmap() that will notify cgroup and about newpage it wont be count inside cgroup beacuse it is file rmap page (we are calling to page_add_file_rmap), so right now PageKSM wont ever be tracked by cgroup. If not in radix-tree, it's not tracked. (But we don't want to track non-LRU pages which are not freeable.) Yes. - Can we know that "the page is just replaced and we don't necessary to do charge/uncharge". The caller of page_replace does know it, the only problem is that page_remove_rmap() automaticly change the cgroup for anonymous pages, if we want it not to change the cgroup, we can: increase the cgroup count before page_remove (but in that case what happen if we reach to the limit???) give parameter to page_remove_rmap() that we dont want the cgroup to be changed. Hmm, current mem cgroup works via page_cgroup struct to track pages. page <-> page_cgroup has one-to-one relation ship. So, "exchanging page" itself causes trouble. But I may be able to provide necessary hooks to you as I did in page migraiton. But if we dont track KsmPages, and we call to page_remove_rmap() on the anonymous page that we replace, why would it be a problem? (should everything be correct in that case???) - annonymous page from KSM is worth to be tracked by memory cgroup ? (IOW, it's on LRU and can be swapped-out ?) KSM have no anonymous pages (it share anonymous pages into KsmPAGE -> kernel allocated page without mapping) so it isnt in LRU and it cannt be swapped, only when KsmPAGEs will be break by do_wp_page() the duplication will be able to swap. Ok, thank you for confirmation. My feeling is that shared pages should be accounted as if they were not shared; that is, a share page should be accounted for each process that shares it. Perhaps sharing within a cgroup should be counted as 1 page for all the ptes pointing to it. If KSM pages are on radix-tree, it will be accounted automatically. Now, we have "Unevictable" LRU and mlocked() pages are smartly isolated into its own LRU. So, just doing - inode's radix-tree - make all pages mlocked. - provide special page fault handler for your purpose Well in this version that i am going to merge the pages arent going to be swappable, Latter after Ksm will get merged we will make the KsmPages swappable... so i think working with cgroups would be effective / useful only when KsmPages will start be swappable... Do you agree? (What i am saying is that right now lets dont count the KsmPages inside the cgroup, lets do it when KsmPages will be swappable) If you feel this pages should be counted in the cgroup i have no problem to do it via hooks like page migration is doing. thanks. is simple one. But ok, whatever implementation you'll do, I have to check it and consider whether it should be tracked or not. Then, add codes to memcg to track it or ignore it or comments on your patches ;) It's helpful to add me to CC: when you post this set again. Sure will. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
Thank you for answers. On Wed, 12 Nov 2008 13:11:12 +0200 Izik Eidus <[EMAIL PROTECTED]> wrote: > Avi Kivity wrote: > > KAMEZAWA Hiroyuki wrote: > >> Can I make a question ? (I'm working for memory cgroup.) > >> > >> Now, we do charge to anonymous page when > >> - charge(+1) when it's mapped firstly (mapcount 0->1) > >> - uncharge(-1) it's fully unmapped (mapcount 1->0) vir > >> page_remove_rmap(). > >> > >> My quesion is > >> - PageKSM pages are not necessary to be tracked by memory cgroup ? > When we reaplacing page using page_replace() we have: > oldpage - > anonymous page that is going to be replaced by newpage > newpage -> kernel allocated page (KsmPage) > so about oldpage we are calling page_remove_rmap() that will notify cgroup > and about newpage it wont be count inside cgroup beacuse it is file rmap > page > (we are calling to page_add_file_rmap), so right now PageKSM wont ever > be tracked by cgroup. > If not in radix-tree, it's not tracked. (But we don't want to track non-LRU pages which are not freeable.) > >> - Can we know that "the page is just replaced and we don't necessary > >> to do > >>charge/uncharge". > > The caller of page_replace does know it, the only problem is that > page_remove_rmap() > automaticly change the cgroup for anonymous pages, > if we want it not to change the cgroup, we can: > increase the cgroup count before page_remove (but in that case what > happen if we reach to the limit???) > give parameter to page_remove_rmap() that we dont want the cgroup to be > changed. Hmm, current mem cgroup works via page_cgroup struct to track pages. page <-> page_cgroup has one-to-one relation ship. So, "exchanging page" itself causes trouble. But I may be able to provide necessary hooks to you as I did in page migraiton. > > >> - annonymous page from KSM is worth to be tracked by memory cgroup ? > >>(IOW, it's on LRU and can be swapped-out ?) > > KSM have no anonymous pages (it share anonymous pages into KsmPAGE -> > kernel allocated page without mapping) > so it isnt in LRU and it cannt be swapped, only when KsmPAGEs will be > break by do_wp_page() the duplication will be able to swap. > Ok, thank you for confirmation. > >> > > > > My feeling is that shared pages should be accounted as if they were > > not shared; that is, a share page should be accounted for each process > > that shares it. Perhaps sharing within a cgroup should be counted as > > 1 page for all the ptes pointing to it. > > > > If KSM pages are on radix-tree, it will be accounted automatically. Now, we have "Unevictable" LRU and mlocked() pages are smartly isolated into its own LRU. So, just doing - inode's radix-tree - make all pages mlocked. - provide special page fault handler for your purpose is simple one. But ok, whatever implementation you'll do, I have to check it and consider whether it should be tracked or not. Then, add codes to memcg to track it or ignore it or comments on your patches ;) It's helpful to add me to CC: when you post this set again. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Thursday 13 November 2008 13:31, Andrea Arcangeli wrote: > On Thu, Nov 13, 2008 at 03:00:59AM +0100, Andrea Arcangeli wrote: > > CPU0 migrate.c CPU1 filemap.c > > --- -- > > find_get_page > > radix_tree_lookup_slot returns the oldpage > > page_count still = expected_count > > freeze_ref (oldpage->count = 0) > > radix_tree_replace (too late, other side already got the oldpage) > > unfreeze_ref (oldpage->count = 2) > > page_cache_get_speculative(old_page) > > set count to 3 and succeeds > > After reading more of this lockless radix tree code, I realized this > below check is the one that was intended to restart find_get_page and > prevent it to return the oldpage: > > if (unlikely(page != *pagep)) { > > But there's no barrier there, atomic_add_unless would need to provide > an atomic smp_mb() _after_ atomic_add_unless executed. In the old days > the atomic_* routines had no implied memory barriers, you had to use > smp_mb__after_atomic_add_unless if you wanted to avoid the race. I > don't see much in the ppc implementation of atomic_add_unless that > would provide an implicit smb_mb after the page_cache_get_speculative > returns, so I can't see why the pagep can't be by find_get_page read > before the other cpu executes radix_tree_replace in the above > timeline. All atomic functions that both return a value and modify their memory are defined to provide full barriers before and after the operation. powerpc should be OK __asm__ __volatile__ ( LWSYNC_ON_SMP "1: lwarx %0,0,%1 # atomic_add_unless\n\ cmpw0,%0,%3 \n\ beq-2f \n\ add %0,%2,%0 \n" PPC405_ERR77(0,%2) " stwcx. %0,0,%1 \n\ bne-1b \n" ISYNC_ON_SMP " subf%0,%2,%0 \n\ 2:" : "=&r" (t) : "r" (&v->counter), "r" (a), "r" (u) : "cc", "memory"); lwsync instruction prevents all reorderings except allows loads to pass stores. But because it is followed by a LL/SC sequence, the store part of that sequence is prevented from passing stores, so therefore it will fail if the load had executed earlier and the value has changed. isync prevents speculative execution, so the branch has to be resolved before any subsequent instructions start. The branch depends on the result of the LL/SC, so that has to be complete. So that prevents loads from passing the LL/SC. For the SC to complete, I think by definition the store has to be visible, because it has to check the value has not changed (so it would have to own the cacheline). I think that's how it works. I'm not an expert at powerpc's weird barriers, but I'm pretty sure they work. > I guess you intended to put an smp_mb() in between the > page_cache_get_speculative and the *pagep to make the code safe on ppc > too, but there isn't, and I think it must be fixed, either that or I > don't understand ppc assembly right. The other side has a smp_wmb > implicit inside radix_tree_replace_slot so it should be ok already to > ensure we see the refcount going to 0 before we see the pagep changed > (the fact the other side has a memory barrier, further confirms this > side needs it too). I think it is OK. It should have a comment there however to say it relies on a barrier in get_page_unless_zero (I thought I had one..) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Thu, Nov 13, 2008 at 03:00:59AM +0100, Andrea Arcangeli wrote: > CPU0 migrate.cCPU1 filemap.c > --- -- > find_get_page > radix_tree_lookup_slot returns the oldpage > page_count still = expected_count > freeze_ref (oldpage->count = 0) > radix_tree_replace (too late, other side already got the oldpage) > unfreeze_ref (oldpage->count = 2) > page_cache_get_speculative(old_page) > set count to 3 and succeeds After reading more of this lockless radix tree code, I realized this below check is the one that was intended to restart find_get_page and prevent it to return the oldpage: if (unlikely(page != *pagep)) { But there's no barrier there, atomic_add_unless would need to provide an atomic smp_mb() _after_ atomic_add_unless executed. In the old days the atomic_* routines had no implied memory barriers, you had to use smp_mb__after_atomic_add_unless if you wanted to avoid the race. I don't see much in the ppc implementation of atomic_add_unless that would provide an implicit smb_mb after the page_cache_get_speculative returns, so I can't see why the pagep can't be by find_get_page read before the other cpu executes radix_tree_replace in the above timeline. I guess you intended to put an smp_mb() in between the page_cache_get_speculative and the *pagep to make the code safe on ppc too, but there isn't, and I think it must be fixed, either that or I don't understand ppc assembly right. The other side has a smp_wmb implicit inside radix_tree_replace_slot so it should be ok already to ensure we see the refcount going to 0 before we see the pagep changed (the fact the other side has a memory barrier, further confirms this side needs it too). BTW, the radix_tree_deref_slot might miss a rcu_barrier_depends() after radix_tree_deref_slot returns but I'm not entirely sure and only alpha would be affected in the worst case. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Wed, Nov 12, 2008 at 05:09:03PM -0500, Lee Schermerhorn wrote: > Maybe not so wild, given the complexity of these interactions... Perhaps Christoph's right it's just wild ideas, but see below. You both seem to agree the first theory of the tree_lock is bogus as it's lockless for find_get_page. The second theory of why it shouldn't happen thanks to the refcount freezing is bogus too... CPU0 migrate.c CPU1 filemap.c --- -- find_get_page radix_tree_lookup_slot returns the oldpage page_count still = expected_count freeze_ref (oldpage->count = 0) radix_tree_replace (too late, other side already got the oldpage) unfreeze_ref (oldpage->count = 2) page_cache_get_speculative(old_page) set count to 3 and succeeds Admittedly I couldn't understand what the freeze_ref was about, I thought it was something related to the radix tree internals (which I didn't check as they weren't relevant at that point besides being lockless) as there was nothing inside that freeze/unfreeze critical section that could affect find_get_page, so I ignored it. If if was meant to stop find_get_page to get the oldpage it clearly isn't working. Perhaps I'm still missing something... If I'm right my suggested fix is to simply change the remove_migration_ptes to set the pte to point to the swapcache, instead of leaving the swapentry in it. That will make do_swap_page bailout like every other path in memory.c in the pte_same check, and additionally it'll avoid an unnecessary minor fault. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Wed, 2008-11-12 at 14:27 -0600, Christoph Lameter wrote: > On Wed, 12 Nov 2008, Andrea Arcangeli wrote: > > > On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote: > > > get_user_pages() cannot get to it since the pagetables have already been > > > modified. If get_user_pages runs then the fault handling will occur > > > which will block the thread until migration is complete. > > > > migrate.c does nothing for ptes pointing to swap entries and > > do_swap_page won't wait for them either. Assume follow_page in > > If a anonymous page is a swap page then it has a mapping. > migrate_page_move_mapping() will lock the radix tree and ensure that no > additional reference (like done by do_swap_page) is established during > migration. So, it's Nick's reference freezing you asked about in response to my mail that prevents do_swap_page() from getting another reference on the page in the swap cache just after migrate_page_move_mapping() checks the ref count and replaces the slot with new swap pte. Radix tree lock just prevents other threads from modifying the slot, right? [Hmmm, looks like we need to update the reference to "write lock" in the comments on the 'deref_slot() and _replace_slot() definitions in radix-tree.h.] Therefore, do_swap_page() will either get the old page and raise the ref before migration check, or it will [possibly loop in find_get_page() and then] get the new page. Migration will bail out, for this pass anyway, in the former case. In the second case, do_swap_page() will wait on the new page lock until migration completes, deferring any direct IO. Or am I still missing something? > > > However it's not exactly the same bug as the one in fork, I was > > talking about before, it's also not o_direct specific. Still > > So far I have seen wild ideas not bugs. Maybe not so wild, given the complexity of these interactions... Later, Lee -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Wed, 12 Nov 2008, Lee Schermerhorn wrote: > Might want/need to check for migration entry in do_swap_page() and loop > back to migration_entry_wait() call when the changed pte is detected > rather than returning an error to the caller. > > Does that sound reasonable? The reference count freezing and the rechecking of the pte in do_swap_page() does not work? Nick broke it during lock removal for the lockless page cache? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Wed, 12 Nov 2008, Andrea Arcangeli wrote: > On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote: > > get_user_pages() cannot get to it since the pagetables have already been > > modified. If get_user_pages runs then the fault handling will occur > > which will block the thread until migration is complete. > > migrate.c does nothing for ptes pointing to swap entries and > do_swap_page won't wait for them either. Assume follow_page in If a anonymous page is a swap page then it has a mapping. migrate_page_move_mapping() will lock the radix tree and ensure that no additional reference (like done by do_swap_page) is established during migration. > However it's not exactly the same bug as the one in fork, I was > talking about before, it's also not o_direct specific. Still So far I have seen wild ideas not bugs. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Wed, 2008-11-12 at 18:32 +0100, Andrea Arcangeli wrote: > On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote: > > get_user_pages() cannot get to it since the pagetables have already been > > modified. If get_user_pages runs then the fault handling will occur > > which will block the thread until migration is complete. > > migrate.c does nothing for ptes pointing to swap entries and > do_swap_page won't wait for them either. Assume follow_page in > migrate.c returns a swapcache not mapped but with a pte pointing to > it. That means page_count 1 (+1 after you isolate it from the lru), > page_mapcount 0, page_mapped 0, page_mapping = swap address space, > swap_count = 2 (1 swapcache, 1 the pte with the swapentry). Now assume > one thread does o_direct read from disk that triggers a minor fault in > do_swap_cache called by get_user_pages. The other cpu is running > sys_move_pages and the expected count will match the page count in > migrate_page_move_mapping. Page is still in swapcache. So after the > expected count matches in the migrate.c thread, the other thread > continues in do_swap_page and runs lookup_swap_cache that succeeds > (the page wasn't removed from swapcache yet as migrate.c needs to bail > out if the expected count doesn't match, so it can't mess with the > oldpage until it's sure it can migrate it). After that do_swap_page > gets a reference on the swapcache (at that point migrate.c continues > despite the expected count isn't 2 anymore! just a second after having > verified that it was 2). lock_page blocks do_swap_page until migration > is complete but pte_same in do_swap_page won't fail because the pte is > still pointing to the same swapentry (it's just the swapcache inode > radix tree that points to a different page, the swapentry is still the > same as before the migration - is_swap_pte will succeed but > is_migration_entry failed when restoring the pte). Ah. try_to_unmap_one() won't replace the pte entry with a migration_pte() if the [anon] page is already in the swap cache. When migration completes, we won't modify the page tables with the newpage pte--we'll just let any subsequent swap page [minor] fault handle that. That suggests a possible fix: instead of replacing the pte with a duplicate swap entry in try_to_unmap_one(), go ahead and replace the pte with a migration pte. Then back in try_to_unmap_anon(), after unmapping all references, free the swap cache entry, so as not to leak it [assuming we're in a lock state that allows that--I haven't checked that far]. Then, the page table WILL have been modified by the time migration unlocks the page. Might want/need to check for migration entry in do_swap_page() and loop back to migration_entry_wait() call when the changed pte is detected rather than returning an error to the caller. Does that sound reasonable? > Finally the pte is > overwritten with the old page and any data written to the new page in > between is lost. And wouldn't the new page potentially be leaked? That is, could it end up on the lru with page_count == page_mapcount() >= 1, but no page table reference to ever be unmapped to release the count? > > However it's not exactly the same bug as the one in fork, I was > talking about before, it's also not o_direct specific. Still > page_wrprotect + replace_page is orders of magnitude simpler logic > than migrate.c and it has no bugs or at least it's certainly much > simpler to proof as correct. Furthermore we never 'stall' any userland > task while we do our work. We only mark the pte wrprotected, the task > can cow or takeover it if refcount allows anytime, and later we'll > bailout during replace_page if something has happened in between > page_wrprotect and replace_page. So our logic is simpler and tuned for > max performance and fewer interference with userland runtime. Not > really sure if it worth for us to call into migrate.c. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to [EMAIL PROTECTED] For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"[EMAIL PROTECTED]"> [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote: > get_user_pages() cannot get to it since the pagetables have already been > modified. If get_user_pages runs then the fault handling will occur > which will block the thread until migration is complete. migrate.c does nothing for ptes pointing to swap entries and do_swap_page won't wait for them either. Assume follow_page in migrate.c returns a swapcache not mapped but with a pte pointing to it. That means page_count 1 (+1 after you isolate it from the lru), page_mapcount 0, page_mapped 0, page_mapping = swap address space, swap_count = 2 (1 swapcache, 1 the pte with the swapentry). Now assume one thread does o_direct read from disk that triggers a minor fault in do_swap_cache called by get_user_pages. The other cpu is running sys_move_pages and the expected count will match the page count in migrate_page_move_mapping. Page is still in swapcache. So after the expected count matches in the migrate.c thread, the other thread continues in do_swap_page and runs lookup_swap_cache that succeeds (the page wasn't removed from swapcache yet as migrate.c needs to bail out if the expected count doesn't match, so it can't mess with the oldpage until it's sure it can migrate it). After that do_swap_page gets a reference on the swapcache (at that point migrate.c continues despite the expected count isn't 2 anymore! just a second after having verified that it was 2). lock_page blocks do_swap_page until migration is complete but pte_same in do_swap_page won't fail because the pte is still pointing to the same swapentry (it's just the swapcache inode radix tree that points to a different page, the swapentry is still the same as before the migration - is_swap_pte will succeed but is_migration_entry failed when restoring the pte). Finally the pte is overwritten with the old page and any data written to the new page in between is lost. However it's not exactly the same bug as the one in fork, I was talking about before, it's also not o_direct specific. Still page_wrprotect + replace_page is orders of magnitude simpler logic than migrate.c and it has no bugs or at least it's certainly much simpler to proof as correct. Furthermore we never 'stall' any userland task while we do our work. We only mark the pte wrprotected, the task can cow or takeover it if refcount allows anytime, and later we'll bailout during replace_page if something has happened in between page_wrprotect and replace_page. So our logic is simpler and tuned for max performance and fewer interference with userland runtime. Not really sure if it worth for us to call into migrate.c. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
Avi Kivity wrote: KAMEZAWA Hiroyuki wrote: Can I make a question ? (I'm working for memory cgroup.) Now, we do charge to anonymous page when - charge(+1) when it's mapped firstly (mapcount 0->1) - uncharge(-1) it's fully unmapped (mapcount 1->0) vir page_remove_rmap(). My quesion is - PageKSM pages are not necessary to be tracked by memory cgroup ? When we reaplacing page using page_replace() we have: oldpage - > anonymous page that is going to be replaced by newpage newpage -> kernel allocated page (KsmPage) so about oldpage we are calling page_remove_rmap() that will notify cgroup and about newpage it wont be count inside cgroup beacuse it is file rmap page (we are calling to page_add_file_rmap), so right now PageKSM wont ever be tracked by cgroup. - Can we know that "the page is just replaced and we don't necessary to do charge/uncharge". The caller of page_replace does know it, the only problem is that page_remove_rmap() automaticly change the cgroup for anonymous pages, if we want it not to change the cgroup, we can: increase the cgroup count before page_remove (but in that case what happen if we reach to the limit???) give parameter to page_remove_rmap() that we dont want the cgroup to be changed. - annonymous page from KSM is worth to be tracked by memory cgroup ? (IOW, it's on LRU and can be swapped-out ?) KSM have no anonymous pages (it share anonymous pages into KsmPAGE -> kernel allocated page without mapping) so it isnt in LRU and it cannt be swapped, only when KsmPAGEs will be break by do_wp_page() the duplication will be able to swap. My feeling is that shared pages should be accounted as if they were not shared; that is, a share page should be accounted for each process that shares it. Perhaps sharing within a cgroup should be counted as 1 page for all the ptes pointing to it. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
KAMEZAWA Hiroyuki wrote: Can I make a question ? (I'm working for memory cgroup.) Now, we do charge to anonymous page when - charge(+1) when it's mapped firstly (mapcount 0->1) - uncharge(-1) it's fully unmapped (mapcount 1->0) vir page_remove_rmap(). My quesion is - PageKSM pages are not necessary to be tracked by memory cgroup ? - Can we know that "the page is just replaced and we don't necessary to do charge/uncharge". - annonymous page from KSM is worth to be tracked by memory cgroup ? (IOW, it's on LRU and can be swapped-out ?) My feeling is that shared pages should be accounted as if they were not shared; that is, a share page should be accounted for each process that shares it. Perhaps sharing within a cgroup should be counted as 1 page for all the ptes pointing to it. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Wed, 12 Nov 2008, Andrea Arcangeli wrote: > So are you checking if there's an unresolved reference only in the > very place I just quoted in the previous email? If answer is yes: what > should prevent get_user_pages from running in parallel from another > thread? get_user_pages will trigger a minor fault and get the elevated > reference just after you read page_count. To you it looks like there > is no o_direct in progress when you proceed to the core of migration > code, but in effect o_direct just started a moment after you read the > page count. get_user_pages() cannot get to it since the pagetables have already been modified. If get_user_pages runs then the fault handling will occur which will block the thread until migration is complete. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, Nov 11, 2008 at 06:27:09PM -0600, Christoph Lameter wrote: > Then page migration will not occur because there is an unresolved > reference. So are you checking if there's an unresolved reference only in the very place I just quoted in the previous email? If answer is yes: what should prevent get_user_pages from running in parallel from another thread? get_user_pages will trigger a minor fault and get the elevated reference just after you read page_count. To you it looks like there is no o_direct in progress when you proceed to the core of migration code, but in effect o_direct just started a moment after you read the page count. What can protect you is PG lock or mmap_sem in _write_ mode (and they've to be hold for the whole duration of the migration). I don't see any of the two being hold while you read the page count... You don't seem to be using stop_machine either (stop_machine pretty expensive on the 4096 way I guess). This wasn't reproduced in practice but it should be possible to reproduce it by just writing a testcase with three threads, one forks in a loop (child just quit) the other memset 0 the first 512bytes of a page, and then o_direct read from a 0xff 512byte region and checks that the first 512bytes are all non-zero in a loop, and the third writes 1 byte to the last 512bytes of the page in a loop. Eventually the comparison should show zero data in the page. To reproduce with migration just start the thread that memset 0, reads a 0xff region with o_direct, and checks it's all 0xff in a loop, and then migrate the memory of this thread back and forth between two nodes with the sys_move_pages (mpol is safe by luck because it surrounds migrate_pages with the mmap_sem in write mode). Eventually you should see zero bytes despite I/O is complete. Reproducing this is normal life would take time and for the fork bug it may not be reproducible depending of what the app is doing. Mixing sys_move_pages with o_direct in the same process with on two different threads, instead should eventually eventually reproduce it. And with gup_fast is now unfixable until more infrastructure is added to slowdown gup_fast a bit (unless Nick finds an RCU way of doing it). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, 11 Nov 2008 23:24:21 +0100 Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > On Tue, Nov 11, 2008 at 03:31:18PM -0600, Christoph Lameter wrote: > > > ksm need the pte inside the vma to point from anonymous page into > > > filebacked > > > page > > > can migrate.c do it without changes? > > > > So change anonymous to filebacked page? > > > > Currently page migration assumes that the page will continue to be part > > of the existing file or anon vma. > > > > What you want sounds like assigning a swap pte to an anonymous page? That > > way a anon page gains membership in a file backed mapping. > > KSM needs to convert anonymous pages to PageKSM, which means a page > owned by ksm.c and only known by ksm.c. The Linux VM will free this > page in munmap but that's about it, all we do is to match the number > of anon-ptes pointing to the page with the page_count. So besides > freeing the page when the last user exit()s or cows it, the VM will do > nothing about it. Initially. Later it can swap it in a nonlinear way. > Can I make a question ? (I'm working for memory cgroup.) Now, we do charge to anonymous page when - charge(+1) when it's mapped firstly (mapcount 0->1) - uncharge(-1) it's fully unmapped (mapcount 1->0) vir page_remove_rmap(). My quesion is - PageKSM pages are not necessary to be tracked by memory cgroup ? - Can we know that "the page is just replaced and we don't necessary to do charge/uncharge". - annonymous page from KSM is worth to be tracked by memory cgroup ? (IOW, it's on LRU and can be swapped-out ?) Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Wed, 12 Nov 2008, Andrea Arcangeli wrote: > > O_DIRECT does not take a refcount on the page in order to prevent this? > > It definitely does, it's also the only thing it does. Then page migration will not occur because there is an unresolved reference. > The whole point is that O_DIRECT can start the instruction after > page_count returns as far as I can tell. But there must still be reference for the bio and whatever may be going on at the time in order to perform the I/O operation. > If you check the three emails I linked in answer to Andrew on the > topic, we agree the o_direct can't start under PT lock (or under > mmap_sem in write mode but migrate.c rightefully takes the read > mode). So the fix used in ksm page_wrprotect and in fork() is to check > page_count vs page_mapcount inside PT lock before doing anything on > the pte. If you just mark the page wprotect while O_DIRECT is in > flight, that's enough for fork() to generate data corruption in the > parent (not the child where the result would be undefined). But in the > parent the result of the o-direct is defined and it'd never corrupt if > this was a cached-I/O. The moment the parent pte is marked readonly, a thread > in the parent could write to the last 512bytes of the page, leading to > the first 512bytes coming with O_DIRECT from disk being lost (as the > write will trigger a cow before I/O is complete and the dma will > complete on the oldpage). Have you actually seen corruption or this conjecture? AFACT the page count is elevated while I/O is in progress and thus this is safe. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Wed, Nov 12, 2008 at 12:17:22AM +0100, Andrea Arcangeli wrote: > We don't have to check the page_count vs mapcount later in > replace_page because we know if anybody started an O_DIRECT read from > disk, it would have triggered a cow, and the pte_same check that we > have to do for other reasons would take care of bailing out the > replace_page. Ah, for completeness: above I didn't mention the case of O_DIRECT writes to disk, because we never need to care about those. They're never a problem. If the page is replaced and the cpu writes to the page and by doing so triggers a cow that lead to the CPU write to go in a different page (not the one under dma) it'll be like if the write to disk completed before the cpu overwritten the page, so result is undefined. I don't think we've to define the case of somebody doing a direct read from a location where there's still an o_direct write in flight either. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, Nov 11, 2008 at 04:30:22PM -0600, Christoph Lameter wrote: > On Tue, 11 Nov 2008, Andrea Arcangeli wrote: > > > this page_count check done with only the tree_lock won't prevent a > > task to start O_DIRECT after page_count has been read in the above line. > > > > If a thread starts O_DIRECT on the page, and the o_direct is still in > > flight by the time you copy the page to the new page, the read will > > not be represented fully in the newpage leading to userland data > > corruption. > > O_DIRECT does not take a refcount on the page in order to prevent this? It definitely does, it's also the only thing it does. The whole point is that O_DIRECT can start the instruction after page_count returns as far as I can tell. If you check the three emails I linked in answer to Andrew on the topic, we agree the o_direct can't start under PT lock (or under mmap_sem in write mode but migrate.c rightefully takes the read mode). So the fix used in ksm page_wrprotect and in fork() is to check page_count vs page_mapcount inside PT lock before doing anything on the pte. If you just mark the page wprotect while O_DIRECT is in flight, that's enough for fork() to generate data corruption in the parent (not the child where the result would be undefined). But in the parent the result of the o-direct is defined and it'd never corrupt if this was a cached-I/O. The moment the parent pte is marked readonly, a thread in the parent could write to the last 512bytes of the page, leading to the first 512bytes coming with O_DIRECT from disk being lost (as the write will trigger a cow before I/O is complete and the dma will complete on the oldpage). We do the check in page_wprotect because that's the _first_ place where we mark a pte readonly. Same as fork. And we can't convert a pte from writeable to wrprotected without doing this check inside PT lock or we generate data corruption with o_direct (again same as the bug in fork). We don't have to check the page_count vs mapcount later in replace_page because we know if anybody started an O_DIRECT read from disk, it would have triggered a cow, and the pte_same check that we have to do for other reasons would take care of bailing out the replace_page. > Oh they could be migrated if you had a callback to the devices method for > giving up references. Same as slab defrag. The oldpage is a regular anonymous page for us, not the 'strange' object called PageKSM. And we need to migrate many anonymous pages having destination a single PageKSM page already preallocated and not being an anonymous page. We never migrate PageKSM to anything, that's always the destination. > Seems that we are tinkering around with the concept of what an anonymous > page is? Doesnt shmem have some means of converting pages to file backed? > Swizzling? Anonymous page is anything with page->mapping pointing to an anon_vma or swapcache_space instead of an address_space of a real inode backed by the vfs. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
Christoph Lameter wrote: On Tue, 11 Nov 2008, Avi Kivity wrote: Christoph Lameter wrote: page migration requires the page to be on the LRU. That could be changed if you have a different means of isolating a page from its page tables. Isn't rmap the means of isolating a page from its page tables? I guess I'm misunderstanding something. In order to migrate a page one first has to make sure that userspace and the kernel cannot access the page in any way. User space must be made to generate page faults and all kernel references must be accounted for and not be in use. The user space portion involves changing the page tables so that faults are generated. The kernel portion isolates the page from the LRU (to exempt it from kernel reclaim handling etc). Thanks. Only then can the page and its metadata be copied to a new location. Guess you already have the LRU portion done. So you just need the user space isolation portion? We don't want to limit all faults, just writes. So we write protect the page before merging. What do you mean by page metadata? obviously the dirty bit (Izik?), accessed bit and position in the LRU are desirable (the last is quite difficult for ksm -- the page occupied *two* positions in the LRU). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, 11 Nov 2008, Andrea Arcangeli wrote: > this page_count check done with only the tree_lock won't prevent a > task to start O_DIRECT after page_count has been read in the above line. > > If a thread starts O_DIRECT on the page, and the o_direct is still in > flight by the time you copy the page to the new page, the read will > not be represented fully in the newpage leading to userland data > corruption. O_DIRECT does not take a refcount on the page in order to prevent this? > > Define a regular VM page? A page on the LRU? > > Yes, pages owned, allocated and worked on by the VM. So they can be > swapped, collected, migrated etc... You can't possibly migrate a > device driver page for example and infact those device driver pages > can't be migrated either. Oh they could be migrated if you had a callback to the devices method for giving up references. Same as slab defrag. > The KSM page initially is a driver page, later we'd like to teach the > VM how to swap it by introducing rmap methods and adding it to the > LRU. As long as it's only anonymous memory that we're sharing/cloning, > we won't have to patch pagecache radix tree and other stuff. BTW, if > we ever decice to clone pagecache we could generate immense metadata > ram overhead in the radix tree with just a single page of data. All > issues that don't exist for anon ram. Seems that we are tinkering around with the concept of what an anonymous page is? Doesnt shmem have some means of converting pages to file backed? Swizzling? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, Nov 11, 2008 at 03:31:18PM -0600, Christoph Lameter wrote: > > ksm need the pte inside the vma to point from anonymous page into filebacked > > page > > can migrate.c do it without changes? > > So change anonymous to filebacked page? > > Currently page migration assumes that the page will continue to be part > of the existing file or anon vma. > > What you want sounds like assigning a swap pte to an anonymous page? That > way a anon page gains membership in a file backed mapping. KSM needs to convert anonymous pages to PageKSM, which means a page owned by ksm.c and only known by ksm.c. The Linux VM will free this page in munmap but that's about it, all we do is to match the number of anon-ptes pointing to the page with the page_count. So besides freeing the page when the last user exit()s or cows it, the VM will do nothing about it. Initially. Later it can swap it in a nonlinear way. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, Nov 11, 2008 at 03:26:57PM -0600, Christoph Lameter wrote: > On Tue, 11 Nov 2008, Andrea Arcangeli wrote: > > > btw, page_migration likely is buggy w.r.t. o_direct too (and now > > unfixable with gup_fast until the 2.4 brlock is added around it or > > similar) if it does the same thing but without any page_mapcount vs > > page_count check. > > Details please? spin_lock_irq(&mapping->tree_lock); pslot = radix_tree_lookup_slot(&mapping->page_tree, page_index(page)); expected_count = 2 + !!PagePrivate(page); if (page_count(page) != expected_count || this page_count check done with only the tree_lock won't prevent a task to start O_DIRECT after page_count has been read in the above line. If a thread starts O_DIRECT on the page, and the o_direct is still in flight by the time you copy the page to the new page, the read will not be represented fully in the newpage leading to userland data corruption. > > page_migration does too much for us, so us calling into migrate.c may > > not be ideal. It has to convert a fresh page to a VM page. In KSM we > > don't convert the newpage to be a VM page, we just replace the anon > > page with another page. The new page in the KSM case is not a page > > known by the VM, not in the lru etc... > > A VM page as opposed to pages not in the VM? ??? Yes, you migrate old VM pages to new VM pages. We replace VM pages with unknown stuff called KSM pages. So in the inner function where you replace the pte-migration-placeholder with a pte pointing to the newpage, you also rightfully do unconditional stuff we can't be doing like page_add_*_rmap. It's VM pages you're dealing with. Not for us. > page migration requires the page to be on the LRU. That could be changed > if you have a different means of isolating a page from its page tables. Yes we'd need to change those bits to be able to use migrate.c. > Define a regular VM page? A page on the LRU? Yes, pages owned, allocated and worked on by the VM. So they can be swapped, collected, migrated etc... You can't possibly migrate a device driver page for example and infact those device driver pages can't be migrated either. The KSM page initially is a driver page, later we'd like to teach the VM how to swap it by introducing rmap methods and adding it to the LRU. As long as it's only anonymous memory that we're sharing/cloning, we won't have to patch pagecache radix tree and other stuff. BTW, if we ever decice to clone pagecache we could generate immense metadata ram overhead in the radix tree with just a single page of data. All issues that don't exist for anon ram. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
Christoph Lameter wrote: On Tue, 11 Nov 2008, Avi Kivity wrote: Christoph Lameter wrote: page migration requires the page to be on the LRU. That could be changed if you have a different means of isolating a page from its page tables. Isn't rmap the means of isolating a page from its page tables? I guess I'm misunderstanding something. In order to migrate a page one first has to make sure that userspace and the kernel cannot access the page in any way. User space must be made to generate page faults and all kernel references must be accounted for and not be in use. This is really not the case for ksm, in ksm we allow the page to be accessed all the time, we dont have to swap the page like migrate.c is doing... -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, 11 Nov 2008, Avi Kivity wrote: > Christoph Lameter wrote: > > page migration requires the page to be on the LRU. That could be changed > > if you have a different means of isolating a page from its page tables. > > > > Isn't rmap the means of isolating a page from its page tables? I guess I'm > misunderstanding something. In order to migrate a page one first has to make sure that userspace and the kernel cannot access the page in any way. User space must be made to generate page faults and all kernel references must be accounted for and not be in use. The user space portion involves changing the page tables so that faults are generated. The kernel portion isolates the page from the LRU (to exempt it from kernel reclaim handling etc). Only then can the page and its metadata be copied to a new location. Guess you already have the LRU portion done. So you just need the user space isolation portion? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, 11 Nov 2008, Izik Eidus wrote: > > What do you mean by kernel page? The kernel can allocate a page and then > > point a user space pte to it. That is how page migration works. > > > i mean filebacked page (!AnonPage()) ok. > ksm need the pte inside the vma to point from anonymous page into filebacked > page > can migrate.c do it without changes? So change anonymous to filebacked page? Currently page migration assumes that the page will continue to be part of the existing file or anon vma. What you want sounds like assigning a swap pte to an anonymous page? That way a anon page gains membership in a file backed mapping. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
Christoph Lameter wrote: page migration requires the page to be on the LRU. That could be changed if you have a different means of isolating a page from its page tables. Isn't rmap the means of isolating a page from its page tables? I guess I'm misunderstanding something. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
Christoph Lameter wrote: Currently page migration assumes that the page will continue to be part of the existing file or anon vma. exactly, and ksm really need it to get out of the existing anon vma! What you want sounds like assigning a swap pte to an anonymous page? That way a anon page gains membership in a file backed mapping. No, i want pte that is found inside vma and point to anonymous page, will stop point into the anonymous page and will point to a whole diffrent page that i chose (for ksm it is needed because this way we are mapping alot of ptes into the same write_protected page and save memory) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, 11 Nov 2008, Andrea Arcangeli wrote: > btw, page_migration likely is buggy w.r.t. o_direct too (and now > unfixable with gup_fast until the 2.4 brlock is added around it or > similar) if it does the same thing but without any page_mapcount vs > page_count check. Details please? > page_migration does too much for us, so us calling into migrate.c may > not be ideal. It has to convert a fresh page to a VM page. In KSM we > don't convert the newpage to be a VM page, we just replace the anon > page with another page. The new page in the KSM case is not a page > known by the VM, not in the lru etc... A VM page as opposed to pages not in the VM? ??? page migration requires the page to be on the LRU. That could be changed if you have a different means of isolating a page from its page tables. > The way to go could be to change the page_migration to use > replace_page (or __replace_page if called in some shared inner-lock > context) after preparing the newpage to be a regular VM page. If we > can do that, migrate.c will get the o_direct race fixed too for free. Define a regular VM page? A page on the LRU? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, Nov 11, 2008 at 03:21:45PM -0600, Christoph Lameter wrote: > What do you mean by kernel page? The kernel can allocate a page and then > point a user space pte to it. That is how page migration works. Just to make an example, remove_migration_pte adds the page back to rmap layer. We can't do that right now as rmap for the ksm pages will be built inside ksm, or alternatively rmap.c will have to learn to handle nonlinear anon-vma. Migration simply migrates the page. The new page is identical to the original one, just backed by different physical memory. For us the new page is an entirely different beast that we build ourself (we can't let migrate.c to pretend dealing with the newpage like if it resembled the old page like it's doing now). We replace a linear anon page with something that isn't an anonymous page at all right now (in the future it may become a nonlinear anon page if VM learns about it, or still an unknown page external-rmappable if we go the external-rmap way). There's clearly something to share, but the replace_page seem to be the one that could be called from migrate.c. What is different is that we don't need the migration pte placeholder, we never block releasing locks, all atomic with pte wrprotected, and a final pte_same check under PT lock before we replace the page. There isn't a whole lot to share after all, but surely it'd be nice to share if we can. Us calling into migrate.c isn't feasible right now without some significant change to migrate.c where it would be misplaced IMHO as to share we'd need migrate.c to call into VM core instead. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
Christoph Lameter wrote: page migration as far as i saw cant migrate anonymous page into kernel page. if you want we can change page_migration to do that, but i thought you will rather have ksm changes separate. What do you mean by kernel page? The kernel can allocate a page and then point a user space pte to it. That is how page migration works. i mean filebacked page (!AnonPage()) ksm need the pte inside the vma to point from anonymous page into filebacked page can migrate.c do it without changes? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, 11 Nov 2008, Izik Eidus wrote: > yes but it replace it with kernel allocated page. > > page migration already kinda does that. Is there common ground? > > > > > page migration as far as i saw cant migrate anonymous page into kernel page. > if you want we can change page_migration to do that, but i thought you will > rather have ksm changes separate. What do you mean by kernel page? The kernel can allocate a page and then point a user space pte to it. That is how page migration works. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, Nov 11, 2008 at 11:45:55AM -0800, Andrew Morton wrote: > page migration already kinda does that. Is there common ground? btw, page_migration likely is buggy w.r.t. o_direct too (and now unfixable with gup_fast until the 2.4 brlock is added around it or similar) if it does the same thing but without any page_mapcount vs page_count check. page_migration does too much for us, so us calling into migrate.c may not be ideal. It has to convert a fresh page to a VM page. In KSM we don't convert the newpage to be a VM page, we just replace the anon page with another page. The new page in the KSM case is not a page known by the VM, not in the lru etc... The way to go could be to change the page_migration to use replace_page (or __replace_page if called in some shared inner-lock context) after preparing the newpage to be a regular VM page. If we can do that, migrate.c will get the o_direct race fixed too for free. > I don't understand the restrictions on anonymous pages. Please expand > the changelog so that reviewers can understand the reasons for this > restriction. That's a good question but I don't have a definitive answer as I didn't account for exactly how complex it would be yet. Suppose a file has 0-4k equal to 4k-8k. A MAP_SHARED that maps both pages with the same physical page sounds tricky. The shared pages are KSM owned and invisible to the VM (later could be made visible with an external-rmap), but pagecache can't be just KSM owned, they at least must go in pagecache radix tree so that requires patching the radix tree etc... All things we don't need for anon ram. I think first thing to extend is to add external-rmap and make ksm swappable, then we can think if something can be done about MAP_SHARED/MAP_PRIVATE too (MAP_PRIVATE post-COW already works, the question is pre-COW). One excuse of why I didn't think too much about it yet is because in effect KSM it's mostly useful to the anon ram, the pagecache can be solved in userland with hardlinks with openvz, and shared libs already do all they can to share .text (the not-shared post dynamic link should be covered by ksm already). > Again, we could make the presence of this code selectable by subsystems > which want it. Indeed!! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
Andrew Morton wrote: On Tue, 11 Nov 2008 15:21:39 +0200 Izik Eidus <[EMAIL PROTECTED]> wrote: From: Izik Eidus <[EMAIL PROTECTED]> this function is needed in cases you want to change the userspace virtual mapping into diffrent physical page, Not sure that I understand that description. We want to replace a live page in an anonymous VMA with a different one? It looks that way. yes but it replace it with kernel allocated page. page migration already kinda does that. Is there common ground? page migration as far as i saw cant migrate anonymous page into kernel page. if you want we can change page_migration to do that, but i thought you will rather have ksm changes separate. KSM need this for merging the identical pages. this function is working by removing the oldpage from the rmap and calling put_page on it, and by setting the virtual address pte to point into the new page. (note that the new page (the page that we change the pte to map to) cannot be anonymous page) I don't understand the restrictions on anonymous pages. Please expand the changelog so that reviewers can understand the reasons for this restriction. the page that we are going to map into the pte going to be nonlinear from the point of view of anon-vma therefore it cannot be anonymous. --- include/linux/mm.h |3 ++ mm/memory.c| 68 2 files changed, 71 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ffee2f7..4da7fa8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1207,6 +1207,9 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); +int replace_page(struct vm_area_struct *vma, struct page *oldpage, +struct page *newpage, pte_t orig_pte, pgprot_t prot); + struct page *follow_page(struct vm_area_struct *, unsigned long address, unsigned int foll_flags); #define FOLL_WRITE 0x01/* check pte is writable */ diff --git a/mm/memory.c b/mm/memory.c index 164951c..b2c542c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1472,6 +1472,74 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, } EXPORT_SYMBOL(vm_insert_mixed); +/** + * replace _page - replace the pte mapping related to vm area between two pages s/replace _page/replace_page/ + * (from oldpage to newpage) + * NOTE: you should take into consideration the impact on the VM when replacing + * anonymous pages with kernel non swappable pages. + */ This _is_ a kerneldoc comment, but kernedoc comments conventionally document the arguments and the return value also. +int replace_page(struct vm_area_struct *vma, struct page *oldpage, +struct page *newpage, pte_t orig_pte, pgprot_t prot) +{ + struct mm_struct *mm = vma->vm_mm; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep; + spinlock_t *ptl; + unsigned long addr; + int ret; + + BUG_ON(PageAnon(newpage)); + + ret = -EFAULT; + addr = page_address_in_vma(oldpage, vma); + if (addr == -EFAULT) + goto out; + + pgd = pgd_offset(mm, addr); + if (!pgd_present(*pgd)) + goto out; + + pud = pud_offset(pgd, addr); + if (!pud_present(*pud)) + goto out; + + pmd = pmd_offset(pud, addr); + if (!pmd_present(*pmd)) + goto out; + + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); + if (!ptep) + goto out; + + if (!pte_same(*ptep, orig_pte)) { + pte_unmap_unlock(ptep, ptl); + goto out; + } + + ret = 0; + get_page(newpage); + page_add_file_rmap(newpage); + + flush_cache_page(vma, addr, pte_pfn(*ptep)); + ptep_clear_flush(vma, addr, ptep); + set_pte_at(mm, addr, ptep, mk_pte(newpage, prot)); + + page_remove_rmap(oldpage, vma); + if (PageAnon(oldpage)) { + dec_mm_counter(mm, anon_rss); + inc_mm_counter(mm, file_rss); + } + put_page(oldpage); + + pte_unmap_unlock(ptep, ptl); + +out: + return ret; +} +EXPORT_SYMBOL(replace_page); Again, we could make the presence of this code selectable by subsystems which want it. ] sure. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another
On Tue, 11 Nov 2008 15:21:39 +0200 Izik Eidus <[EMAIL PROTECTED]> wrote: > From: Izik Eidus <[EMAIL PROTECTED]> > > this function is needed in cases you want to change the userspace > virtual mapping into diffrent physical page, Not sure that I understand that description. We want to replace a live page in an anonymous VMA with a different one? It looks that way. page migration already kinda does that. Is there common ground? > KSM need this for merging the identical pages. > > this function is working by removing the oldpage from the rmap and > calling put_page on it, and by setting the virtual address pte > to point into the new page. > (note that the new page (the page that we change the pte to map to) > cannot be anonymous page) > I don't understand the restrictions on anonymous pages. Please expand the changelog so that reviewers can understand the reasons for this restriction. > --- > include/linux/mm.h |3 ++ > mm/memory.c| 68 > > 2 files changed, 71 insertions(+), 0 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index ffee2f7..4da7fa8 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1207,6 +1207,9 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned > long addr, > int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, > unsigned long pfn); > > +int replace_page(struct vm_area_struct *vma, struct page *oldpage, > + struct page *newpage, pte_t orig_pte, pgprot_t prot); > + > struct page *follow_page(struct vm_area_struct *, unsigned long address, > unsigned int foll_flags); > #define FOLL_WRITE 0x01/* check pte is writable */ > diff --git a/mm/memory.c b/mm/memory.c > index 164951c..b2c542c 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1472,6 +1472,74 @@ int vm_insert_mixed(struct vm_area_struct *vma, > unsigned long addr, > } > EXPORT_SYMBOL(vm_insert_mixed); > > +/** > + * replace _page - replace the pte mapping related to vm area between two > pages s/replace _page/replace_page/ > + * (from oldpage to newpage) > + * NOTE: you should take into consideration the impact on the VM when > replacing > + * anonymous pages with kernel non swappable pages. > + */ This _is_ a kerneldoc comment, but kernedoc comments conventionally document the arguments and the return value also. > +int replace_page(struct vm_area_struct *vma, struct page *oldpage, > + struct page *newpage, pte_t orig_pte, pgprot_t prot) > +{ > + struct mm_struct *mm = vma->vm_mm; > + pgd_t *pgd; > + pud_t *pud; > + pmd_t *pmd; > + pte_t *ptep; > + spinlock_t *ptl; > + unsigned long addr; > + int ret; > + > + BUG_ON(PageAnon(newpage)); > + > + ret = -EFAULT; > + addr = page_address_in_vma(oldpage, vma); > + if (addr == -EFAULT) > + goto out; > + > + pgd = pgd_offset(mm, addr); > + if (!pgd_present(*pgd)) > + goto out; > + > + pud = pud_offset(pgd, addr); > + if (!pud_present(*pud)) > + goto out; > + > + pmd = pmd_offset(pud, addr); > + if (!pmd_present(*pmd)) > + goto out; > + > + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); > + if (!ptep) > + goto out; > + > + if (!pte_same(*ptep, orig_pte)) { > + pte_unmap_unlock(ptep, ptl); > + goto out; > + } > + > + ret = 0; > + get_page(newpage); > + page_add_file_rmap(newpage); > + > + flush_cache_page(vma, addr, pte_pfn(*ptep)); > + ptep_clear_flush(vma, addr, ptep); > + set_pte_at(mm, addr, ptep, mk_pte(newpage, prot)); > + > + page_remove_rmap(oldpage, vma); > + if (PageAnon(oldpage)) { > + dec_mm_counter(mm, anon_rss); > + inc_mm_counter(mm, file_rss); > + } > + put_page(oldpage); > + > + pte_unmap_unlock(ptep, ptl); > + > +out: > + return ret; > +} > +EXPORT_SYMBOL(replace_page); Again, we could make the presence of this code selectable by subsystems which want it. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html