Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-13 Thread KAMEZAWA Hiroyuki
On Thu, 13 Nov 2008 12:38:07 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:
> > If KSM pages are on radix-tree, it will be accounted automatically.
> > Now, we have "Unevictable" LRU and mlocked() pages are smartly isolated 
> > into its
> > own LRU. So, just doing
> >
> >  - inode's radix-tree
> >  - make all pages mlocked.
> >  - provide special page fault handler for your purpose
> >   
> 
> Well in this version that i am going to merge the pages arent going to 
> be swappable,
> Latter after Ksm will get merged we will make the KsmPages swappable...
good to hear

> so i think working with cgroups would be effective / useful only when 
> KsmPages will start be swappable...
> Do you agree?
> (What i am saying is that right now lets dont count the KsmPages inside 
> the cgroup, lets do it when KsmPages
> will be swappable)
> 
ok.

> If you feel this pages should be counted in the cgroup i have no problem 
> to do it via hooks like page migration is doing.
> 
> thanks.
> 
> > is simple one. But ok, whatever implementation you'll do, I have to check it
> > and consider whether it should be tracked or not. Then, add codes to memcg 
> > to
> > track it or ignore it or comments on your patches ;)
> >
> > It's helpful to add me to CC: when you post this set again.
> >   
> 
> Sure will.
> 

If necessary, I'll have to add "ignore in this case" hook in memcg.
(ex. checking PageKSM flag in memcg.)

If you are sufferred from memcg in your test, cgroup_disable=memory boot option
will allow you to disable memcg.


Thanks,
-Kame








--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-13 Thread Izik Eidus

ציטוט KAMEZAWA Hiroyuki:

Thank you for answers.

On Wed, 12 Nov 2008 13:11:12 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

  

Avi Kivity wrote:


KAMEZAWA Hiroyuki wrote:
  

Can I make a question ? (I'm working for memory cgroup.)

Now, we do charge to anonymous page when
  - charge(+1) when it's mapped firstly (mapcount 0->1)
  - uncharge(-1) it's fully unmapped (mapcount 1->0) vir 
page_remove_rmap().


My quesion is
 - PageKSM pages are not necessary to be tracked by memory cgroup ?


When we reaplacing page using page_replace() we have:
oldpage - > anonymous page that is going to be replaced by newpage
newpage -> kernel allocated page (KsmPage)
so about oldpage we are calling page_remove_rmap() that will notify cgroup
and about newpage it wont be count inside cgroup beacuse it is file rmap 
page
(we are calling to page_add_file_rmap), so right now PageKSM wont ever 
be tracked by cgroup.




If not in radix-tree, it's not tracked.
(But we don't want to track non-LRU pages which are not freeable.)
  


Yes.



  
 - Can we know that "the page is just replaced and we don't necessary 
to do

   charge/uncharge".

The caller of page_replace does know it, the only problem is that 
page_remove_rmap()

automaticly change the cgroup for anonymous pages,
if we want it not to change the cgroup, we can:
increase the cgroup count before page_remove (but in that case what 
happen if we reach to the limit???)
give parameter to page_remove_rmap() that we dont want the cgroup to be 
changed.



Hmm, current mem cgroup works via page_cgroup struct to track pages.

   page <-> page_cgroup has one-to-one relation ship.

So, "exchanging page" itself causes trouble. But I may be able to provide
necessary hooks to you as I did in page migraiton.
  


But if we dont track KsmPages, and we call to page_remove_rmap() on the 
anonymous page that we

replace, why would it be a problem?
(should everything be correct in that case???)

  

 - annonymous page from KSM is worth to be tracked by memory cgroup ?
   (IOW, it's on LRU and can be swapped-out ?)

KSM have no anonymous pages (it share anonymous pages into KsmPAGE -> 
kernel allocated page without mapping)
so it isnt in LRU and it cannt be swapped, only when KsmPAGEs will be 
break by do_wp_page() the duplication will be able to swap.




Ok, thank you for confirmation.

  
  

My feeling is that shared pages should be accounted as if they were 
not shared; that is, a share page should be accounted for each process 
that shares it.  Perhaps sharing within a cgroup should be counted as 
1 page for all the ptes pointing to it.



  


If KSM pages are on radix-tree, it will be accounted automatically.
Now, we have "Unevictable" LRU and mlocked() pages are smartly isolated into its
own LRU. So, just doing

 - inode's radix-tree
 - make all pages mlocked.
 - provide special page fault handler for your purpose
  


Well in this version that i am going to merge the pages arent going to 
be swappable,

Latter after Ksm will get merged we will make the KsmPages swappable...
so i think working with cgroups would be effective / useful only when 
KsmPages will start be swappable...

Do you agree?
(What i am saying is that right now lets dont count the KsmPages inside 
the cgroup, lets do it when KsmPages

will be swappable)

If you feel this pages should be counted in the cgroup i have no problem 
to do it via hooks like page migration is doing.


thanks.


is simple one. But ok, whatever implementation you'll do, I have to check it
and consider whether it should be tracked or not. Then, add codes to memcg to
track it or ignore it or comments on your patches ;)

It's helpful to add me to CC: when you post this set again.
  


Sure will.


Thanks,
-Kame






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
  


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-12 Thread KAMEZAWA Hiroyuki
Thank you for answers.

On Wed, 12 Nov 2008 13:11:12 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

> Avi Kivity wrote:
> > KAMEZAWA Hiroyuki wrote:
> >> Can I make a question ? (I'm working for memory cgroup.)
> >>
> >> Now, we do charge to anonymous page when
> >>   - charge(+1) when it's mapped firstly (mapcount 0->1)
> >>   - uncharge(-1) it's fully unmapped (mapcount 1->0) vir 
> >> page_remove_rmap().
> >>
> >> My quesion is
> >>  - PageKSM pages are not necessary to be tracked by memory cgroup ?
> When we reaplacing page using page_replace() we have:
> oldpage - > anonymous page that is going to be replaced by newpage
> newpage -> kernel allocated page (KsmPage)
> so about oldpage we are calling page_remove_rmap() that will notify cgroup
> and about newpage it wont be count inside cgroup beacuse it is file rmap 
> page
> (we are calling to page_add_file_rmap), so right now PageKSM wont ever 
> be tracked by cgroup.
> 
If not in radix-tree, it's not tracked.
(But we don't want to track non-LRU pages which are not freeable.)


> >>  - Can we know that "the page is just replaced and we don't necessary 
> >> to do
> >>charge/uncharge".
> 
> The caller of page_replace does know it, the only problem is that 
> page_remove_rmap()
> automaticly change the cgroup for anonymous pages,
> if we want it not to change the cgroup, we can:
> increase the cgroup count before page_remove (but in that case what 
> happen if we reach to the limit???)
> give parameter to page_remove_rmap() that we dont want the cgroup to be 
> changed.

Hmm, current mem cgroup works via page_cgroup struct to track pages.

   page <-> page_cgroup has one-to-one relation ship.

So, "exchanging page" itself causes trouble. But I may be able to provide
necessary hooks to you as I did in page migraiton.

> 
> >>  - annonymous page from KSM is worth to be tracked by memory cgroup ?
> >>(IOW, it's on LRU and can be swapped-out ?)
> 
> KSM have no anonymous pages (it share anonymous pages into KsmPAGE -> 
> kernel allocated page without mapping)
> so it isnt in LRU and it cannt be swapped, only when KsmPAGEs will be 
> break by do_wp_page() the duplication will be able to swap.
> 
Ok, thank you for confirmation.

> >>   
> >
> > My feeling is that shared pages should be accounted as if they were 
> > not shared; that is, a share page should be accounted for each process 
> > that shares it.  Perhaps sharing within a cgroup should be counted as 
> > 1 page for all the ptes pointing to it.
> >
> >

If KSM pages are on radix-tree, it will be accounted automatically.
Now, we have "Unevictable" LRU and mlocked() pages are smartly isolated into its
own LRU. So, just doing

 - inode's radix-tree
 - make all pages mlocked.
 - provide special page fault handler for your purpose

is simple one. But ok, whatever implementation you'll do, I have to check it
and consider whether it should be tracked or not. Then, add codes to memcg to
track it or ignore it or comments on your patches ;)

It's helpful to add me to CC: when you post this set again.

Thanks,
-Kame






--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-12 Thread Nick Piggin
On Thursday 13 November 2008 13:31, Andrea Arcangeli wrote:
> On Thu, Nov 13, 2008 at 03:00:59AM +0100, Andrea Arcangeli wrote:
> > CPU0 migrate.c  CPU1 filemap.c
> > --- --
> > find_get_page
> > radix_tree_lookup_slot returns the oldpage
> > page_count still = expected_count
> > freeze_ref (oldpage->count = 0)
> > radix_tree_replace (too late, other side already got the oldpage)
> > unfreeze_ref (oldpage->count = 2)
> > page_cache_get_speculative(old_page)
> > set count to 3 and succeeds
>
> After reading more of this lockless radix tree code, I realized this
> below check is the one that was intended to restart find_get_page and
> prevent it to return the oldpage:
>
>   if (unlikely(page != *pagep)) {
>
> But there's no barrier there, atomic_add_unless would need to provide
> an atomic smp_mb() _after_ atomic_add_unless executed. In the old days
> the atomic_* routines had no implied memory barriers, you had to use
> smp_mb__after_atomic_add_unless if you wanted to avoid the race. I
> don't see much in the ppc implementation of atomic_add_unless that
> would provide an implicit smb_mb after the page_cache_get_speculative
> returns, so I can't see why the pagep can't be by find_get_page read
> before the other cpu executes radix_tree_replace in the above
> timeline.

All atomic functions that both return a value and modify their memory
are defined to provide full barriers before and after the operation.

powerpc should be OK

__asm__ __volatile__ (
LWSYNC_ON_SMP
"1: lwarx   %0,0,%1 # atomic_add_unless\n\
cmpw0,%0,%3 \n\
beq-2f \n\
add %0,%2,%0 \n"
PPC405_ERR77(0,%2)
"   stwcx.  %0,0,%1 \n\
bne-1b \n"
ISYNC_ON_SMP
"   subf%0,%2,%0 \n\
2:"
: "=&r" (t)
: "r" (&v->counter), "r" (a), "r" (u)
: "cc", "memory");

lwsync instruction prevents all reorderings except allows loads to
pass stores. But because it is followed by a LL/SC sequence, the
store part of that sequence is prevented from passing stores, so
therefore it will fail if the load had executed earlier and the
value has changed.

isync prevents speculative execution, so the branch has to be
resolved before any subsequent instructions start. The branch
depends on the result of the LL/SC, so that has to be complete.

So that prevents loads from passing the LL/SC.

For the SC to complete, I think by definition the store has to be
visible, because it has to check the value has not changed (so it
would have to own the cacheline).

I think that's how it works. I'm not an expert at powerpc's weird
barriers, but I'm pretty sure they work.


> I guess you intended to put an smp_mb() in between the
> page_cache_get_speculative and the *pagep to make the code safe on ppc
> too, but there isn't, and I think it must be fixed, either that or I
> don't understand ppc assembly right. The other side has a smp_wmb
> implicit inside radix_tree_replace_slot so it should be ok already to
> ensure we see the refcount going to 0 before we see the pagep changed
> (the fact the other side has a memory barrier, further confirms this
> side needs it too).

I think it is OK. It should have a comment there however to say it
relies on a barrier in get_page_unless_zero (I thought I had one..)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-12 Thread Andrea Arcangeli
On Thu, Nov 13, 2008 at 03:00:59AM +0100, Andrea Arcangeli wrote:
> CPU0 migrate.cCPU1 filemap.c
> ---   --
>   find_get_page
>   radix_tree_lookup_slot returns the oldpage
> page_count still = expected_count
> freeze_ref (oldpage->count = 0)
> radix_tree_replace (too late, other side already got the oldpage)
> unfreeze_ref (oldpage->count = 2)
>   page_cache_get_speculative(old_page)
>   set count to 3 and succeeds

After reading more of this lockless radix tree code, I realized this
below check is the one that was intended to restart find_get_page and
prevent it to return the oldpage:

if (unlikely(page != *pagep)) {

But there's no barrier there, atomic_add_unless would need to provide
an atomic smp_mb() _after_ atomic_add_unless executed. In the old days
the atomic_* routines had no implied memory barriers, you had to use
smp_mb__after_atomic_add_unless if you wanted to avoid the race. I
don't see much in the ppc implementation of atomic_add_unless that
would provide an implicit smb_mb after the page_cache_get_speculative
returns, so I can't see why the pagep can't be by find_get_page read
before the other cpu executes radix_tree_replace in the above
timeline.

I guess you intended to put an smp_mb() in between the
page_cache_get_speculative and the *pagep to make the code safe on ppc
too, but there isn't, and I think it must be fixed, either that or I
don't understand ppc assembly right. The other side has a smp_wmb
implicit inside radix_tree_replace_slot so it should be ok already to
ensure we see the refcount going to 0 before we see the pagep changed
(the fact the other side has a memory barrier, further confirms this
side needs it too).

BTW, the radix_tree_deref_slot might miss a rcu_barrier_depends()
after radix_tree_deref_slot returns but I'm not entirely sure and only
alpha would be affected in the worst case.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-12 Thread Andrea Arcangeli
On Wed, Nov 12, 2008 at 05:09:03PM -0500, Lee Schermerhorn wrote:
> Maybe not so wild, given the complexity of these interactions... 

Perhaps Christoph's right it's just wild ideas, but see below.

You both seem to agree the first theory of the tree_lock is bogus
as it's lockless for find_get_page.

The second theory of why it shouldn't happen thanks to the refcount
freezing is bogus too...

CPU0 migrate.c  CPU1 filemap.c
--- --
find_get_page
radix_tree_lookup_slot returns the oldpage
page_count still = expected_count
freeze_ref (oldpage->count = 0)
radix_tree_replace (too late, other side already got the oldpage)
unfreeze_ref (oldpage->count = 2)
page_cache_get_speculative(old_page)
set count to 3 and succeeds

Admittedly I couldn't understand what the freeze_ref was about, I
thought it was something related to the radix tree internals (which I
didn't check as they weren't relevant at that point besides being
lockless) as there was nothing inside that freeze/unfreeze critical
section that could affect find_get_page, so I ignored it. If if was
meant to stop find_get_page to get the oldpage it clearly isn't
working.

Perhaps I'm still missing something...

If I'm right my suggested fix is to simply change the
remove_migration_ptes to set the pte to point to the swapcache,
instead of leaving the swapentry in it. That will make do_swap_page
bailout like every other path in memory.c in the pte_same check, and
additionally it'll avoid an unnecessary minor fault.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-12 Thread Lee Schermerhorn
On Wed, 2008-11-12 at 14:27 -0600, Christoph Lameter wrote:
> On Wed, 12 Nov 2008, Andrea Arcangeli wrote:
> 
> > On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote:
> > > get_user_pages() cannot get to it since the pagetables have already been
> > > modified. If get_user_pages runs then the fault handling will occur
> > > which will block the thread until migration is complete.
> >
> > migrate.c does nothing for ptes pointing to swap entries and
> > do_swap_page won't wait for them either. Assume follow_page in
> 
> If a anonymous page is a swap page then it has a mapping.
> migrate_page_move_mapping() will lock the radix tree and ensure that no
> additional reference (like done by do_swap_page) is established during
> migration.

So, it's Nick's reference freezing you asked about in response to my
mail that prevents do_swap_page() from getting another reference on the
page in the swap cache just after migrate_page_move_mapping() checks the
ref count and replaces the slot with new swap pte.  Radix tree lock just
prevents other threads from modifying the slot, right?  [Hmmm, looks
like we need to update the reference to "write lock" in the comments on
the 'deref_slot() and _replace_slot() definitions in radix-tree.h.]  

Therefore, do_swap_page() will either get the old page and raise the ref
before migration check, or it will [possibly loop in find_get_page() and
then] get the new page.

Migration will bail out, for this pass anyway, in the former case.  In
the second case, do_swap_page() will wait on the new page lock until
migration completes, deferring any direct IO. 

Or am I still missing something?

> 
> > However it's not exactly the same bug as the one in fork, I was
> > talking about before, it's also not o_direct specific. Still
> 
> So far I have seen wild ideas not bugs.

Maybe not so wild, given the complexity of these interactions... 

Later,
Lee


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-12 Thread Christoph Lameter
On Wed, 12 Nov 2008, Lee Schermerhorn wrote:

> Might want/need to check for migration entry in do_swap_page() and loop
> back to migration_entry_wait() call when the changed pte is detected
> rather than returning an error to the caller.
>
> Does that sound reasonable?

The reference count freezing and the rechecking of the pte in
do_swap_page() does not work? Nick broke it during lock removal for the
lockless page cache?


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-12 Thread Christoph Lameter
On Wed, 12 Nov 2008, Andrea Arcangeli wrote:

> On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote:
> > get_user_pages() cannot get to it since the pagetables have already been
> > modified. If get_user_pages runs then the fault handling will occur
> > which will block the thread until migration is complete.
>
> migrate.c does nothing for ptes pointing to swap entries and
> do_swap_page won't wait for them either. Assume follow_page in

If a anonymous page is a swap page then it has a mapping.
migrate_page_move_mapping() will lock the radix tree and ensure that no
additional reference (like done by do_swap_page) is established during
migration.

> However it's not exactly the same bug as the one in fork, I was
> talking about before, it's also not o_direct specific. Still

So far I have seen wild ideas not bugs.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-12 Thread Lee Schermerhorn
On Wed, 2008-11-12 at 18:32 +0100, Andrea Arcangeli wrote:
> On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote:
> > get_user_pages() cannot get to it since the pagetables have already been
> > modified. If get_user_pages runs then the fault handling will occur
> > which will block the thread until migration is complete.
> 
> migrate.c does nothing for ptes pointing to swap entries and
> do_swap_page won't wait for them either. Assume follow_page in
> migrate.c returns a swapcache not mapped but with a pte pointing to
> it. That means page_count 1 (+1 after you isolate it from the lru),
> page_mapcount 0, page_mapped 0, page_mapping = swap address space,
> swap_count = 2 (1 swapcache, 1 the pte with the swapentry). Now assume
> one thread does o_direct read from disk that triggers a minor fault in
> do_swap_cache called by get_user_pages. The other cpu is running
> sys_move_pages and the expected count will match the page count in
> migrate_page_move_mapping. Page is still in swapcache. So after the
> expected count matches in the migrate.c thread, the other thread
> continues in do_swap_page and runs lookup_swap_cache that succeeds
> (the page wasn't removed from swapcache yet as migrate.c needs to bail
> out if the expected count doesn't match, so it can't mess with the
> oldpage until it's sure it can migrate it). After that do_swap_page
> gets a reference on the swapcache (at that point migrate.c continues
> despite the expected count isn't 2 anymore! just a second after having
> verified that it was 2). lock_page blocks do_swap_page until migration
> is complete but pte_same in do_swap_page won't fail because the pte is
> still pointing to the same swapentry (it's just the swapcache inode
> radix tree that points to a different page, the swapentry is still the
> same as before the migration - is_swap_pte will succeed but
> is_migration_entry failed when restoring the pte). 

Ah.  try_to_unmap_one() won't replace the pte entry with a
migration_pte() if the [anon] page is already in the swap cache.  When
migration completes, we won't modify the page tables with the newpage
pte--we'll just let any subsequent swap page [minor] fault handle that.

That suggests a possible fix:  instead of replacing the pte with a
duplicate swap entry in try_to_unmap_one(), go ahead and replace the pte
with a migration pte.  Then back in try_to_unmap_anon(), after unmapping
all references, free the swap cache entry, so as not to leak it
[assuming we're in a lock state that allows that--I haven't checked that
far].  Then, the page table WILL have been modified by the time
migration unlocks the page.

Might want/need to check for migration entry in do_swap_page() and loop
back to migration_entry_wait() call when the changed pte is detected
rather than returning an error to the caller.

Does that sound reasonable?

> Finally the pte is
> overwritten with the old page and any data written to the new page in
> between is lost.

And wouldn't the new page potentially be leaked?  That is, could it end
up on the lru with page_count == page_mapcount() >= 1, but no page table
reference to ever be unmapped to release the count?  

> 
> However it's not exactly the same bug as the one in fork, I was
> talking about before, it's also not o_direct specific. Still
> page_wrprotect + replace_page is orders of magnitude simpler logic
> than migrate.c and it has no bugs or at least it's certainly much
> simpler to proof as correct. Furthermore we never 'stall' any userland
> task while we do our work. We only mark the pte wrprotected, the task
> can cow or takeover it if refcount allows anytime, and later we'll
> bailout during replace_page if something has happened in between
> page_wrprotect and replace_page. So our logic is simpler and tuned for
> max performance and fewer interference with userland runtime. Not
> really sure if it worth for us to call into migrate.c.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [EMAIL PROTECTED]  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"[EMAIL PROTECTED]"> [EMAIL PROTECTED] 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-12 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 09:10:45PM -0600, Christoph Lameter wrote:
> get_user_pages() cannot get to it since the pagetables have already been
> modified. If get_user_pages runs then the fault handling will occur
> which will block the thread until migration is complete.

migrate.c does nothing for ptes pointing to swap entries and
do_swap_page won't wait for them either. Assume follow_page in
migrate.c returns a swapcache not mapped but with a pte pointing to
it. That means page_count 1 (+1 after you isolate it from the lru),
page_mapcount 0, page_mapped 0, page_mapping = swap address space,
swap_count = 2 (1 swapcache, 1 the pte with the swapentry). Now assume
one thread does o_direct read from disk that triggers a minor fault in
do_swap_cache called by get_user_pages. The other cpu is running
sys_move_pages and the expected count will match the page count in
migrate_page_move_mapping. Page is still in swapcache. So after the
expected count matches in the migrate.c thread, the other thread
continues in do_swap_page and runs lookup_swap_cache that succeeds
(the page wasn't removed from swapcache yet as migrate.c needs to bail
out if the expected count doesn't match, so it can't mess with the
oldpage until it's sure it can migrate it). After that do_swap_page
gets a reference on the swapcache (at that point migrate.c continues
despite the expected count isn't 2 anymore! just a second after having
verified that it was 2). lock_page blocks do_swap_page until migration
is complete but pte_same in do_swap_page won't fail because the pte is
still pointing to the same swapentry (it's just the swapcache inode
radix tree that points to a different page, the swapentry is still the
same as before the migration - is_swap_pte will succeed but
is_migration_entry failed when restoring the pte). Finally the pte is
overwritten with the old page and any data written to the new page in
between is lost.

However it's not exactly the same bug as the one in fork, I was
talking about before, it's also not o_direct specific. Still
page_wrprotect + replace_page is orders of magnitude simpler logic
than migrate.c and it has no bugs or at least it's certainly much
simpler to proof as correct. Furthermore we never 'stall' any userland
task while we do our work. We only mark the pte wrprotected, the task
can cow or takeover it if refcount allows anytime, and later we'll
bailout during replace_page if something has happened in between
page_wrprotect and replace_page. So our logic is simpler and tuned for
max performance and fewer interference with userland runtime. Not
really sure if it worth for us to call into migrate.c.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-12 Thread Izik Eidus

Avi Kivity wrote:

KAMEZAWA Hiroyuki wrote:

Can I make a question ? (I'm working for memory cgroup.)

Now, we do charge to anonymous page when
  - charge(+1) when it's mapped firstly (mapcount 0->1)
  - uncharge(-1) it's fully unmapped (mapcount 1->0) vir 
page_remove_rmap().


My quesion is
 - PageKSM pages are not necessary to be tracked by memory cgroup ?

When we reaplacing page using page_replace() we have:
oldpage - > anonymous page that is going to be replaced by newpage
newpage -> kernel allocated page (KsmPage)
so about oldpage we are calling page_remove_rmap() that will notify cgroup
and about newpage it wont be count inside cgroup beacuse it is file rmap 
page
(we are calling to page_add_file_rmap), so right now PageKSM wont ever 
be tracked by cgroup.


 - Can we know that "the page is just replaced and we don't necessary 
to do

   charge/uncharge".


The caller of page_replace does know it, the only problem is that 
page_remove_rmap()

automaticly change the cgroup for anonymous pages,
if we want it not to change the cgroup, we can:
increase the cgroup count before page_remove (but in that case what 
happen if we reach to the limit???)
give parameter to page_remove_rmap() that we dont want the cgroup to be 
changed.



 - annonymous page from KSM is worth to be tracked by memory cgroup ?
   (IOW, it's on LRU and can be swapped-out ?)


KSM have no anonymous pages (it share anonymous pages into KsmPAGE -> 
kernel allocated page without mapping)
so it isnt in LRU and it cannt be swapped, only when KsmPAGEs will be 
break by do_wp_page() the duplication will be able to swap.


  


My feeling is that shared pages should be accounted as if they were 
not shared; that is, a share page should be accounted for each process 
that shares it.  Perhaps sharing within a cgroup should be counted as 
1 page for all the ptes pointing to it.




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-12 Thread Avi Kivity

KAMEZAWA Hiroyuki wrote:

Can I make a question ? (I'm working for memory cgroup.)

Now, we do charge to anonymous page when
  - charge(+1) when it's mapped firstly (mapcount 0->1)
  - uncharge(-1) it's fully unmapped (mapcount 1->0) vir page_remove_rmap().

My quesion is
 - PageKSM pages are not necessary to be tracked by memory cgroup ?
 - Can we know that "the page is just replaced and we don't necessary to do
   charge/uncharge".
 - annonymous page from KSM is worth to be tracked by memory cgroup ?
   (IOW, it's on LRU and can be swapped-out ?)
  


My feeling is that shared pages should be accounted as if they were not 
shared; that is, a share page should be accounted for each process that 
shares it.  Perhaps sharing within a cgroup should be counted as 1 page 
for all the ptes pointing to it.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Wed, 12 Nov 2008, Andrea Arcangeli wrote:

> So are you checking if there's an unresolved reference only in the
> very place I just quoted in the previous email? If answer is yes: what
> should prevent get_user_pages from running in parallel from another
> thread? get_user_pages will trigger a minor fault and get the elevated
> reference just after you read page_count. To you it looks like there
> is no o_direct in progress when you proceed to the core of migration
> code, but in effect o_direct just started a moment after you read the
> page count.

get_user_pages() cannot get to it since the pagetables have already been
modified. If get_user_pages runs then the fault handling will occur
which will block the thread until migration is complete.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 06:27:09PM -0600, Christoph Lameter wrote:
> Then page migration will not occur because there is an unresolved
> reference.

So are you checking if there's an unresolved reference only in the
very place I just quoted in the previous email? If answer is yes: what
should prevent get_user_pages from running in parallel from another
thread? get_user_pages will trigger a minor fault and get the elevated
reference just after you read page_count. To you it looks like there
is no o_direct in progress when you proceed to the core of migration
code, but in effect o_direct just started a moment after you read the
page count.

What can protect you is PG lock or mmap_sem in _write_ mode (and
they've to be hold for the whole duration of the migration). I don't
see any of the two being hold while you read the page count... You
don't seem to be using stop_machine either (stop_machine pretty
expensive on the 4096 way I guess).

This wasn't reproduced in practice but it should be possible to
reproduce it by just writing a testcase with three threads, one forks
in a loop (child just quit) the other memset 0 the first 512bytes of a
page, and then o_direct read from a 0xff 512byte region and checks
that the first 512bytes are all non-zero in a loop, and the third
writes 1 byte to the last 512bytes of the page in a loop. Eventually
the comparison should show zero data in the page.

To reproduce with migration just start the thread that memset 0, reads
a 0xff region with o_direct, and checks it's all 0xff in a loop, and
then migrate the memory of this thread back and forth between two
nodes with the sys_move_pages (mpol is safe by luck because it
surrounds migrate_pages with the mmap_sem in write mode). Eventually
you should see zero bytes despite I/O is complete.

Reproducing this is normal life would take time and for the fork bug
it may not be reproducible depending of what the app is doing. Mixing
sys_move_pages with o_direct in the same process with on two different
threads, instead should eventually eventually reproduce it. And with
gup_fast is now unfixable until more infrastructure is added to
slowdown gup_fast a bit (unless Nick finds an RCU way of doing it).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread KAMEZAWA Hiroyuki
On Tue, 11 Nov 2008 23:24:21 +0100
Andrea Arcangeli <[EMAIL PROTECTED]> wrote:

> On Tue, Nov 11, 2008 at 03:31:18PM -0600, Christoph Lameter wrote:
> > > ksm need the pte inside the vma to point from anonymous page into 
> > > filebacked
> > > page
> > > can migrate.c do it without changes?
> > 
> > So change anonymous to filebacked page?
> >
> > Currently page migration assumes that the page will continue to be part
> > of the existing file or anon vma.
> > 
> > What you want sounds like assigning a swap pte to an anonymous page? That
> > way a anon page gains membership in a file backed mapping.
> 
> KSM needs to convert anonymous pages to PageKSM, which means a page
> owned by ksm.c and only known by ksm.c. The Linux VM will free this
> page in munmap but that's about it, all we do is to match the number
> of anon-ptes pointing to the page with the page_count. So besides
> freeing the page when the last user exit()s or cows it, the VM will do
> nothing about it. Initially. Later it can swap it in a nonlinear way.
> 
Can I make a question ? (I'm working for memory cgroup.)

Now, we do charge to anonymous page when
  - charge(+1) when it's mapped firstly (mapcount 0->1)
  - uncharge(-1) it's fully unmapped (mapcount 1->0) vir page_remove_rmap().

My quesion is
 - PageKSM pages are not necessary to be tracked by memory cgroup ?
 - Can we know that "the page is just replaced and we don't necessary to do
   charge/uncharge".
 - annonymous page from KSM is worth to be tracked by memory cgroup ?
   (IOW, it's on LRU and can be swapped-out ?)

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Wed, 12 Nov 2008, Andrea Arcangeli wrote:

> > O_DIRECT does not take a refcount on the page in order to prevent this?
>
> It definitely does, it's also the only thing it does.

Then page migration will not occur because there is an unresolved
reference.

> The whole point is that O_DIRECT can start the instruction after
> page_count returns as far as I can tell.

But there must still be reference for the bio and whatever may be going on
at the time in order to perform the I/O operation.

> If you check the three emails I linked in answer to Andrew on the
> topic, we agree the o_direct can't start under PT lock (or under
> mmap_sem in write mode but migrate.c rightefully takes the read
> mode). So the fix used in ksm page_wrprotect and in fork() is to check
> page_count vs page_mapcount inside PT lock before doing anything on
> the pte. If you just mark the page wprotect while O_DIRECT is in
> flight, that's enough for fork() to generate data corruption in the
> parent (not the child where the result would be undefined). But in the
> parent the result of the o-direct is defined and it'd never corrupt if
> this was a cached-I/O. The moment the parent pte is marked readonly, a thread
> in the parent could write to the last 512bytes of the page, leading to
> the first 512bytes coming with O_DIRECT from disk being lost (as the
> write will trigger a cow before I/O is complete and the dma will
> complete on the oldpage).

Have you actually seen corruption or this conjecture? AFACT the page
count is elevated while I/O is in progress and thus this is safe.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrea Arcangeli
On Wed, Nov 12, 2008 at 12:17:22AM +0100, Andrea Arcangeli wrote:
> We don't have to check the page_count vs mapcount later in
> replace_page because we know if anybody started an O_DIRECT read from
> disk, it would have triggered a cow, and the pte_same check that we
> have to do for other reasons would take care of bailing out the
> replace_page.

Ah, for completeness: above I didn't mention the case of O_DIRECT
writes to disk, because we never need to care about those. They're
never a problem. If the page is replaced and the cpu writes to the
page and by doing so triggers a cow that lead to the CPU write to go
in a different page (not the one under dma) it'll be like if the write
to disk completed before the cpu overwritten the page, so result is
undefined. I don't think we've to define the case of somebody doing a
direct read from a location where there's still an o_direct write in
flight either.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 04:30:22PM -0600, Christoph Lameter wrote:
> On Tue, 11 Nov 2008, Andrea Arcangeli wrote:
> 
> > this page_count check done with only the tree_lock won't prevent a
> > task to start O_DIRECT after page_count has been read in the above line.
> >
> > If a thread starts O_DIRECT on the page, and the o_direct is still in
> > flight by the time you copy the page to the new page, the read will
> > not be represented fully in the newpage leading to userland data
> > corruption.
> 
> O_DIRECT does not take a refcount on the page in order to prevent this?

It definitely does, it's also the only thing it does.

The whole point is that O_DIRECT can start the instruction after
page_count returns as far as I can tell.

If you check the three emails I linked in answer to Andrew on the
topic, we agree the o_direct can't start under PT lock (or under
mmap_sem in write mode but migrate.c rightefully takes the read
mode). So the fix used in ksm page_wrprotect and in fork() is to check
page_count vs page_mapcount inside PT lock before doing anything on
the pte. If you just mark the page wprotect while O_DIRECT is in
flight, that's enough for fork() to generate data corruption in the
parent (not the child where the result would be undefined). But in the
parent the result of the o-direct is defined and it'd never corrupt if
this was a cached-I/O. The moment the parent pte is marked readonly, a thread
in the parent could write to the last 512bytes of the page, leading to
the first 512bytes coming with O_DIRECT from disk being lost (as the
write will trigger a cow before I/O is complete and the dma will
complete on the oldpage).

We do the check in page_wprotect because that's the _first_ place
where we mark a pte readonly. Same as fork. And we can't convert a pte
from writeable to wrprotected without doing this check inside PT lock
or we generate data corruption with o_direct (again same as the bug in
fork).

We don't have to check the page_count vs mapcount later in
replace_page because we know if anybody started an O_DIRECT read from
disk, it would have triggered a cow, and the pte_same check that we
have to do for other reasons would take care of bailing out the
replace_page.

> Oh they could be migrated if you had a callback to the devices method for
> giving up references. Same as slab defrag.

The oldpage is a regular anonymous page for us, not the 'strange'
object called PageKSM. And we need to migrate many anonymous pages
having destination a single PageKSM page already preallocated and not
being an anonymous page. We never migrate PageKSM to anything, that's
always the destination.

> Seems that we are tinkering around with the concept of what an anonymous
> page is? Doesnt shmem have some means of converting pages to file backed?
> Swizzling?

Anonymous page is anything with page->mapping pointing to an anon_vma
or swapcache_space instead of an address_space of a real inode backed
by the vfs.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Avi Kivity

Christoph Lameter wrote:

On Tue, 11 Nov 2008, Avi Kivity wrote:

  

Christoph Lameter wrote:


page migration requires the page to be on the LRU. That could be changed
if you have a different means of isolating a page from its page tables.

  

Isn't rmap the means of isolating a page from its page tables?  I guess I'm
misunderstanding something.



In order to migrate a page one first has to make sure that userspace and
the kernel cannot access the page in any way. User space must be made to
generate page faults and all kernel references must be accounted for and
not be in use.

The user space portion involves changing the page tables so that faults
are generated.

The kernel portion isolates the page from the LRU (to exempt it from
kernel reclaim handling etc).

  


Thanks.


Only then can the page and its metadata be copied to a new location.

Guess you already have the LRU portion done. So you just need the user
space isolation portion?
  


We don't want to limit all faults, just writes.  So we write protect the 
page before merging.


What do you mean by page metadata?  obviously the dirty bit (Izik?), 
accessed bit and position in the LRU are desirable (the last is quite 
difficult for ksm -- the page occupied *two* positions in the LRU).


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Tue, 11 Nov 2008, Andrea Arcangeli wrote:

> this page_count check done with only the tree_lock won't prevent a
> task to start O_DIRECT after page_count has been read in the above line.
>
> If a thread starts O_DIRECT on the page, and the o_direct is still in
> flight by the time you copy the page to the new page, the read will
> not be represented fully in the newpage leading to userland data
> corruption.

O_DIRECT does not take a refcount on the page in order to prevent this?

> > Define a regular VM page? A page on the LRU?
>
> Yes, pages owned, allocated and worked on by the VM. So they can be
> swapped, collected, migrated etc... You can't possibly migrate a
> device driver page for example and infact those device driver pages
> can't be migrated either.

Oh they could be migrated if you had a callback to the devices method for
giving up references. Same as slab defrag.

> The KSM page initially is a driver page, later we'd like to teach the
> VM how to swap it by introducing rmap methods and adding it to the
> LRU. As long as it's only anonymous memory that we're sharing/cloning,
> we won't have to patch pagecache radix tree and other stuff. BTW, if
> we ever decice to clone pagecache we could generate immense metadata
> ram overhead in the radix tree with just a single page of data. All
> issues that don't exist for anon ram.

Seems that we are tinkering around with the concept of what an anonymous
page is? Doesnt shmem have some means of converting pages to file backed?
Swizzling?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 03:31:18PM -0600, Christoph Lameter wrote:
> > ksm need the pte inside the vma to point from anonymous page into filebacked
> > page
> > can migrate.c do it without changes?
> 
> So change anonymous to filebacked page?
>
> Currently page migration assumes that the page will continue to be part
> of the existing file or anon vma.
> 
> What you want sounds like assigning a swap pte to an anonymous page? That
> way a anon page gains membership in a file backed mapping.

KSM needs to convert anonymous pages to PageKSM, which means a page
owned by ksm.c and only known by ksm.c. The Linux VM will free this
page in munmap but that's about it, all we do is to match the number
of anon-ptes pointing to the page with the page_count. So besides
freeing the page when the last user exit()s or cows it, the VM will do
nothing about it. Initially. Later it can swap it in a nonlinear way.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 03:26:57PM -0600, Christoph Lameter wrote:
> On Tue, 11 Nov 2008, Andrea Arcangeli wrote:
> 
> > btw, page_migration likely is buggy w.r.t. o_direct too (and now
> > unfixable with gup_fast until the 2.4 brlock is added around it or
> > similar) if it does the same thing but without any page_mapcount vs
> > page_count check.
> 
> Details please?

  spin_lock_irq(&mapping->tree_lock);

  pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));

expected_count = 2 + !!PagePrivate(page);
if (page_count(page) != expected_count ||

this page_count check done with only the tree_lock won't prevent a
task to start O_DIRECT after page_count has been read in the above line.

If a thread starts O_DIRECT on the page, and the o_direct is still in
flight by the time you copy the page to the new page, the read will
not be represented fully in the newpage leading to userland data
corruption.

> > page_migration does too much for us, so us calling into migrate.c may
> > not be ideal. It has to convert a fresh page to a VM page. In KSM we
> > don't convert the newpage to be a VM page, we just replace the anon
> > page with another page. The new page in the KSM case is not a page
> > known by the VM, not in the lru etc...
> 
> A VM page as opposed to pages not in the VM? ???

Yes, you migrate old VM pages to new VM pages. We replace VM pages
with unknown stuff called KSM pages. So in the inner function where you
replace the pte-migration-placeholder with a pte pointing to the
newpage, you also rightfully do unconditional stuff we can't be doing
like page_add_*_rmap. It's VM pages you're dealing with. Not for us.

> page migration requires the page to be on the LRU. That could be changed
> if you have a different means of isolating a page from its page tables.

Yes we'd need to change those bits to be able to use migrate.c.

> Define a regular VM page? A page on the LRU?

Yes, pages owned, allocated and worked on by the VM. So they can be
swapped, collected, migrated etc... You can't possibly migrate a
device driver page for example and infact those device driver pages
can't be migrated either.

The KSM page initially is a driver page, later we'd like to teach the
VM how to swap it by introducing rmap methods and adding it to the
LRU. As long as it's only anonymous memory that we're sharing/cloning,
we won't have to patch pagecache radix tree and other stuff. BTW, if
we ever decice to clone pagecache we could generate immense metadata
ram overhead in the radix tree with just a single page of data. All
issues that don't exist for anon ram.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Izik Eidus

Christoph Lameter wrote:

On Tue, 11 Nov 2008, Avi Kivity wrote:

  

Christoph Lameter wrote:


page migration requires the page to be on the LRU. That could be changed
if you have a different means of isolating a page from its page tables.

  

Isn't rmap the means of isolating a page from its page tables?  I guess I'm
misunderstanding something.



In order to migrate a page one first has to make sure that userspace and
the kernel cannot access the page in any way. User space must be made to
generate page faults and all kernel references must be accounted for and
not be in use.
  

This is really not the case for ksm,
in ksm we allow the page to be accessed all the time, we dont have to 
swap the page

like migrate.c is doing...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Tue, 11 Nov 2008, Avi Kivity wrote:

> Christoph Lameter wrote:
> > page migration requires the page to be on the LRU. That could be changed
> > if you have a different means of isolating a page from its page tables.
> >
>
> Isn't rmap the means of isolating a page from its page tables?  I guess I'm
> misunderstanding something.

In order to migrate a page one first has to make sure that userspace and
the kernel cannot access the page in any way. User space must be made to
generate page faults and all kernel references must be accounted for and
not be in use.

The user space portion involves changing the page tables so that faults
are generated.

The kernel portion isolates the page from the LRU (to exempt it from
kernel reclaim handling etc).

Only then can the page and its metadata be copied to a new location.

Guess you already have the LRU portion done. So you just need the user
space isolation portion?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Tue, 11 Nov 2008, Izik Eidus wrote:

> > What do you mean by kernel page? The kernel can allocate a page and then
> > point a user space pte to it. That is how page migration works.
> >
> i mean filebacked page (!AnonPage())

ok.

> ksm need the pte inside the vma to point from anonymous page into filebacked
> page
> can migrate.c do it without changes?

So change anonymous to filebacked page?

Currently page migration assumes that the page will continue to be part
of the existing file or anon vma.

What you want sounds like assigning a swap pte to an anonymous page? That
way a anon page gains membership in a file backed mapping.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Avi Kivity

Christoph Lameter wrote:

page migration requires the page to be on the LRU. That could be changed
if you have a different means of isolating a page from its page tables.
  


Isn't rmap the means of isolating a page from its page tables?  I guess 
I'm misunderstanding something.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Izik Eidus

Christoph Lameter wrote:



Currently page migration assumes that the page will continue to be part
of the existing file or anon vma.
  


exactly, and ksm really need it to get out of the existing anon vma!


What you want sounds like assigning a swap pte to an anonymous page? That
way a anon page gains membership in a file backed mapping.


  


No, i want pte that is found inside vma and point to anonymous page, 
will stop point into the anonymous page
and will point to a whole diffrent page that i chose (for ksm it is 
needed because this way we are mapping alot

of ptes into the same write_protected page and save memory)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Tue, 11 Nov 2008, Andrea Arcangeli wrote:

> btw, page_migration likely is buggy w.r.t. o_direct too (and now
> unfixable with gup_fast until the 2.4 brlock is added around it or
> similar) if it does the same thing but without any page_mapcount vs
> page_count check.

Details please?

> page_migration does too much for us, so us calling into migrate.c may
> not be ideal. It has to convert a fresh page to a VM page. In KSM we
> don't convert the newpage to be a VM page, we just replace the anon
> page with another page. The new page in the KSM case is not a page
> known by the VM, not in the lru etc...

A VM page as opposed to pages not in the VM? ???

page migration requires the page to be on the LRU. That could be changed
if you have a different means of isolating a page from its page tables.

> The way to go could be to change the page_migration to use
> replace_page (or __replace_page if called in some shared inner-lock
> context) after preparing the newpage to be a regular VM page. If we
> can do that, migrate.c will get the o_direct race fixed too for free.

Define a regular VM page? A page on the LRU?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 03:21:45PM -0600, Christoph Lameter wrote:
> What do you mean by kernel page? The kernel can allocate a page and then
> point a user space pte to it. That is how page migration works.

Just to make an example, remove_migration_pte adds the page back to
rmap layer. We can't do that right now as rmap for the ksm pages will
be built inside ksm, or alternatively rmap.c will have to learn to
handle nonlinear anon-vma.

Migration simply migrates the page. The new page is identical to the
original one, just backed by different physical memory.

For us the new page is an entirely different beast that we build
ourself (we can't let migrate.c to pretend dealing with the newpage
like if it resembled the old page like it's doing now).

We replace a linear anon page with something that isn't an anonymous
page at all right now (in the future it may become a nonlinear anon
page if VM learns about it, or still an unknown page
external-rmappable if we go the external-rmap way).

There's clearly something to share, but the replace_page seem to be
the one that could be called from migrate.c. What is different is that
we don't need the migration pte placeholder, we never block releasing
locks, all atomic with pte wrprotected, and a final pte_same check
under PT lock before we replace the page. There isn't a whole lot to
share after all, but surely it'd be nice to share if we can. Us
calling into migrate.c isn't feasible right now without some
significant change to migrate.c where it would be misplaced IMHO as to
share we'd need migrate.c to call into VM core instead.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Izik Eidus

Christoph Lameter wrote:

page migration as far as i saw cant migrate anonymous page into kernel page.
if you want we can change page_migration to do that, but i thought you will
rather have ksm changes separate.



What do you mean by kernel page? The kernel can allocate a page and then
point a user space pte to it. That is how page migration works.
  

i mean filebacked page (!AnonPage())
ksm need the pte inside the vma to point from anonymous page into 
filebacked page

can migrate.c do it without changes?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Christoph Lameter
On Tue, 11 Nov 2008, Izik Eidus wrote:

> yes but it replace it with kernel allocated page.
> > page migration already kinda does that.  Is there common ground?
> >
> >
> page migration as far as i saw cant migrate anonymous page into kernel page.
> if you want we can change page_migration to do that, but i thought you will
> rather have ksm changes separate.

What do you mean by kernel page? The kernel can allocate a page and then
point a user space pte to it. That is how page migration works.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrea Arcangeli
On Tue, Nov 11, 2008 at 11:45:55AM -0800, Andrew Morton wrote:
> page migration already kinda does that.  Is there common ground?

btw, page_migration likely is buggy w.r.t. o_direct too (and now
unfixable with gup_fast until the 2.4 brlock is added around it or
similar) if it does the same thing but without any page_mapcount vs
page_count check.

page_migration does too much for us, so us calling into migrate.c may
not be ideal. It has to convert a fresh page to a VM page. In KSM we
don't convert the newpage to be a VM page, we just replace the anon
page with another page. The new page in the KSM case is not a page
known by the VM, not in the lru etc...

The way to go could be to change the page_migration to use
replace_page (or __replace_page if called in some shared inner-lock
context) after preparing the newpage to be a regular VM page. If we
can do that, migrate.c will get the o_direct race fixed too for free.

> I don't understand the restrictions on anonymous pages.  Please expand
> the changelog so that reviewers can understand the reasons for this
> restriction.

That's a good question but I don't have a definitive answer as I
didn't account for exactly how complex it would be yet.

Suppose a file has 0-4k equal to 4k-8k. A MAP_SHARED that maps both
pages with the same physical page sounds tricky. The shared pages are
KSM owned and invisible to the VM (later could be made visible with an
external-rmap), but pagecache can't be just KSM owned, they at least
must go in pagecache radix tree so that requires patching the radix
tree etc... All things we don't need for anon ram. I think first thing
to extend is to add external-rmap and make ksm swappable, then we can
think if something can be done about MAP_SHARED/MAP_PRIVATE too
(MAP_PRIVATE post-COW already works, the question is pre-COW). One
excuse of why I didn't think too much about it yet is because in
effect KSM it's mostly useful to the anon ram, the pagecache can be
solved in userland with hardlinks with openvz, and shared libs already
do all they can to share .text (the not-shared post dynamic link
should be covered by ksm already).

> Again, we could make the presence of this code selectable by subsystems
> which want it.

Indeed!!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Izik Eidus

Andrew Morton wrote:

On Tue, 11 Nov 2008 15:21:39 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

  

From: Izik Eidus <[EMAIL PROTECTED]>

this function is needed in cases you want to change the userspace
virtual mapping into diffrent physical page,



Not sure that I understand that description.  We want to replace a live
page in an anonymous VMA with a different one?

It looks that way.
  

yes but it replace it with kernel allocated page.

page migration already kinda does that.  Is there common ground?

  

page migration as far as i saw cant migrate anonymous page into kernel page.
if you want we can change page_migration to do that, but i thought you 
will rather have ksm changes separate.



KSM need this for merging the identical pages.

this function is working by removing the oldpage from the rmap and
calling put_page on it, and by setting the virtual address pte
to point into the new page.
(note that the new page (the page that we change the pte to map to)
cannot be anonymous page)




I don't understand the restrictions on anonymous pages.  Please expand
the changelog so that reviewers can understand the reasons for this
restriction.
  
the page that we are going to map into the pte going to be nonlinear 
from the point of view of anon-vma

therefore it cannot be anonymous.



  

---
 include/linux/mm.h |3 ++
 mm/memory.c|   68 
 2 files changed, 71 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ffee2f7..4da7fa8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1207,6 +1207,9 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned 
long addr,
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
 
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,

+struct page *newpage, pte_t orig_pte, pgprot_t prot);
+
 struct page *follow_page(struct vm_area_struct *, unsigned long address,
unsigned int foll_flags);
 #define FOLL_WRITE 0x01/* check pte is writable */
diff --git a/mm/memory.c b/mm/memory.c
index 164951c..b2c542c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1472,6 +1472,74 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned 
long addr,
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
+/**

+ * replace _page - replace the pte mapping related to vm area between two pages



s/replace _page/replace_page/

  

+ * (from oldpage to newpage)
+ * NOTE: you should take into consideration the impact on the VM when replacing
+ * anonymous pages with kernel non swappable pages.
+ */



This _is_ a kerneldoc comment, but kernedoc comments conventionally
document the arguments and the return value also.

  

+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep;
+   spinlock_t *ptl;
+   unsigned long addr;
+   int ret;
+
+   BUG_ON(PageAnon(newpage));
+
+   ret = -EFAULT;
+   addr = page_address_in_vma(oldpage, vma);
+   if (addr == -EFAULT)
+   goto out;
+
+   pgd = pgd_offset(mm, addr);
+   if (!pgd_present(*pgd))
+   goto out;
+
+   pud = pud_offset(pgd, addr);
+   if (!pud_present(*pud))
+   goto out;
+
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   goto out;
+
+   ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+   if (!ptep)
+   goto out;
+
+   if (!pte_same(*ptep, orig_pte)) {
+   pte_unmap_unlock(ptep, ptl);
+   goto out;
+   }
+
+   ret = 0;
+   get_page(newpage);
+   page_add_file_rmap(newpage);
+
+   flush_cache_page(vma, addr, pte_pfn(*ptep));
+   ptep_clear_flush(vma, addr, ptep);
+   set_pte_at(mm, addr, ptep, mk_pte(newpage, prot));
+
+   page_remove_rmap(oldpage, vma);
+   if (PageAnon(oldpage)) {
+   dec_mm_counter(mm, anon_rss);
+   inc_mm_counter(mm, file_rss);
+   }
+   put_page(oldpage);
+
+   pte_unmap_unlock(ptep, ptl);
+
+out:
+   return ret;
+}
+EXPORT_SYMBOL(replace_page);



Again, we could make the presence of this code selectable by subsystems
which want it.

]


sure.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] Add replace_page(), change the mapping of pte from one page into another

2008-11-11 Thread Andrew Morton
On Tue, 11 Nov 2008 15:21:39 +0200
Izik Eidus <[EMAIL PROTECTED]> wrote:

> From: Izik Eidus <[EMAIL PROTECTED]>
> 
> this function is needed in cases you want to change the userspace
> virtual mapping into diffrent physical page,

Not sure that I understand that description.  We want to replace a live
page in an anonymous VMA with a different one?

It looks that way.

page migration already kinda does that.  Is there common ground?

> KSM need this for merging the identical pages.
> 
> this function is working by removing the oldpage from the rmap and
> calling put_page on it, and by setting the virtual address pte
> to point into the new page.
> (note that the new page (the page that we change the pte to map to)
> cannot be anonymous page)
> 

I don't understand the restrictions on anonymous pages.  Please expand
the changelog so that reviewers can understand the reasons for this
restriction.


> ---
>  include/linux/mm.h |3 ++
>  mm/memory.c|   68 
> 
>  2 files changed, 71 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ffee2f7..4da7fa8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1207,6 +1207,9 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned 
> long addr,
>  int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
>   unsigned long pfn);
>  
> +int replace_page(struct vm_area_struct *vma, struct page *oldpage,
> +  struct page *newpage, pte_t orig_pte, pgprot_t prot);
> +
>  struct page *follow_page(struct vm_area_struct *, unsigned long address,
>   unsigned int foll_flags);
>  #define FOLL_WRITE   0x01/* check pte is writable */
> diff --git a/mm/memory.c b/mm/memory.c
> index 164951c..b2c542c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1472,6 +1472,74 @@ int vm_insert_mixed(struct vm_area_struct *vma, 
> unsigned long addr,
>  }
>  EXPORT_SYMBOL(vm_insert_mixed);
>  
> +/**
> + * replace _page - replace the pte mapping related to vm area between two 
> pages

s/replace _page/replace_page/

> + * (from oldpage to newpage)
> + * NOTE: you should take into consideration the impact on the VM when 
> replacing
> + * anonymous pages with kernel non swappable pages.
> + */

This _is_ a kerneldoc comment, but kernedoc comments conventionally
document the arguments and the return value also.

> +int replace_page(struct vm_area_struct *vma, struct page *oldpage,
> +  struct page *newpage, pte_t orig_pte, pgprot_t prot)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *ptep;
> + spinlock_t *ptl;
> + unsigned long addr;
> + int ret;
> +
> + BUG_ON(PageAnon(newpage));
> +
> + ret = -EFAULT;
> + addr = page_address_in_vma(oldpage, vma);
> + if (addr == -EFAULT)
> + goto out;
> +
> + pgd = pgd_offset(mm, addr);
> + if (!pgd_present(*pgd))
> + goto out;
> +
> + pud = pud_offset(pgd, addr);
> + if (!pud_present(*pud))
> + goto out;
> +
> + pmd = pmd_offset(pud, addr);
> + if (!pmd_present(*pmd))
> + goto out;
> +
> + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> + if (!ptep)
> + goto out;
> +
> + if (!pte_same(*ptep, orig_pte)) {
> + pte_unmap_unlock(ptep, ptl);
> + goto out;
> + }
> +
> + ret = 0;
> + get_page(newpage);
> + page_add_file_rmap(newpage);
> +
> + flush_cache_page(vma, addr, pte_pfn(*ptep));
> + ptep_clear_flush(vma, addr, ptep);
> + set_pte_at(mm, addr, ptep, mk_pte(newpage, prot));
> +
> + page_remove_rmap(oldpage, vma);
> + if (PageAnon(oldpage)) {
> + dec_mm_counter(mm, anon_rss);
> + inc_mm_counter(mm, file_rss);
> + }
> + put_page(oldpage);
> +
> + pte_unmap_unlock(ptep, ptl);
> +
> +out:
> + return ret;
> +}
> +EXPORT_SYMBOL(replace_page);

Again, we could make the presence of this code selectable by subsystems
which want it.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html