Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-10 Thread IWAMOTO Toshihiro
At Tue, 8 Feb 2005 16:26:26 + (GMT),
Hugh Dickins wrote:
> 
> On Mon, 7 Feb 2005, Hugh Dickins wrote:
> > On Thu, 3 Feb 2005, IWAMOTO Toshihiro wrote:
> > > The current implementation of memory hotremoval relies on that pages
> > > can be unmapped from process spaces.  After successful unmapping,
> > > subsequent accesses to the pages are blocked and don't interfere
> > > the hotremoval operation.
> > > 
> > > However, this code
> > > 
> > > if (PageSwapCache(page) &&
> > > page_count(page) != page_mapcount(page) + 2) {
> > > ret = SWAP_FAIL;
> > > goto out_unmap;
> > > }
> > 
> > Yes, that is odd code.  It would be nice to have a solution without it.
> > 
> > > in try_to_unmap_one() prevents unmapping pages that are referenced via
> > > get_user_pages(), and such references can be held for a long time if
> > > they are due to such as direct IO.
> > > I've made a test program that issues multiple direct IO read requests
> > > against a single read buffer, and pages that belong to the buffer
> > > cannot be hotremoved because they aren't unmapped.
> > 
> > 
> > 
> > I was hoping to append my own patch to this response, but I haven't
> > got it working right yet (swap seems too full).  I hope tomorrow,
> > but thought I'd better not delay these comments any longer.
> 
> Seems it was okay after all, I got confused by an unrelated issue.
> Here's what I had in mind, what do you think?  Does it indeed help
> with your problem?  I'm copying Andrea because it was he who devised
> that fix to the get_user_pages issue, and also (I think, longer ago)
> can_share_swap_page itself.

I've tested with linux-2.6.10-rc2-mm3 + my hotremoval patch, and it
passed hotremoval tests including the direct IO related one.

Thanks.

> --- 2.6.11-rc3-mm1/mm/memory.c2005-02-05 07:09:40.0 +
> +++ linux/mm/memory.c 2005-02-07 19:50:47.0 +

--
IWAMOTO Toshihiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] Changing COW detection to be memory hotplug friendly

2005-02-09 Thread IWAMOTO Toshihiro
At Mon, 7 Feb 2005 21:24:59 + (GMT),
Hugh Dickins wrote:
> 
> On Thu, 3 Feb 2005, IWAMOTO Toshihiro wrote:
> > The current implementation of memory hotremoval relies on that pages
> > can be unmapped from process spaces.  After successful unmapping,
> > subsequent accesses to the pages are blocked and don't interfere
> > the hotremoval operation.
> > 
> > However, this code
> > 
> > if (PageSwapCache(page) &&
> > page_count(page) != page_mapcount(page) + 2) {
> > ret = SWAP_FAIL;
> > goto out_unmap;
> > }
> 
> Yes, that is odd code.  It would be nice to have a solution without it.
> 
> > in try_to_unmap_one() prevents unmapping pages that are referenced via
> > get_user_pages(), and such references can be held for a long time if
> > they are due to such as direct IO.
> > I've made a test program that issues multiple direct IO read requests
> > against a single read buffer, and pages that belong to the buffer
> > cannot be hotremoved because they aren't unmapped.
> 
> I haven't looked at the rest of your hotremoval, so it's not obvious
> to me how a change here would help you - obviously you wouldn't want
> to be migrating pages while direct IO to them was in progress.
> 
> I presume your patch works for you by letting the page count fall
> to a point where migration moves it automatically as soon as the
> got_user_pages are put, where without your patch the count is held
> too high, and you keep doing scans which tend to miss the window
> in which those pages are put?

Yes.  And my test program has no such time window because IO requests
are issued without waiting for completion of older requests.
I think issuing IO requests in such a manner is nonsense, but
a misbehaving process shouldn't be able to prevent memory hotremoval.

If the hotremoval code can unmap a page from a process space, the
process will be blocked when it causes a page fault against the page.
So, a process cannot continuously issue direct IO requests to keep
page counts high.  (get_user_pages() causes a (simulated) page fault.)


> >   - The can_share_swap_page() call in do_swap_page() always returns
> > false.  It is inefficient but should be harmless.  Incrementing
> > page_mapcount before calling that function should fix the problem,
> > but it may cause bad side effects.
> 
> Odd that your patch moves it if it now doesn't even work!
> But I think some more movement should be able to solve that.

Moving swap_free() is necessary to avoid can_share_swap_page()
returning true when it shouldn't.  And, write access cases are handled
later with do_wp_page() call, anyway.

> >   - Another obvious solution to this issue is to find the "offending"
> > process from a un-unmappable page and suspend it until the page is
> > unmapped.  I'm afraid the implementation would be much more complicated.
> 
> Agreed, let's not get into that.

Nice to hear that.

> > --- mm/memory.c.orig2005-01-17 14:47:11.0 +0900
> > +++ mm/memory.c 2005-01-17 14:55:51.0 +0900
> > @@ -1786,10 +1786,6 @@ static int do_swap_page(struct mm_struct
> > }
> >  
> > /* The page isn't present yet, go ahead with the fault. */
> > -   
> > -   swap_free(entry);
> > -   if (vm_swap_full())
> > -   remove_exclusive_swap_page(page);
> >  
> > mm->rss++;
> > acct_update_integrals();
> > @@ -1800,6 +1796,10 @@ static int do_swap_page(struct mm_struct
> > pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > write_access = 0;
> > }
> > +   
> > +   swap_free(entry);
> > +   if (vm_swap_full())
> > +   remove_exclusive_swap_page(page);
> > unlock_page(page);
> >  
> > flush_icache_page(vma, page);
> > --- mm/rmap.c.orig  2005-01-17 14:40:08.0 +0900
> > +++ mm/rmap.c   2005-01-21 12:34:06.0 +0900
> > @@ -569,8 +569,11 @@ static int try_to_unmap_one(struct page 
> >  */
> > if (PageSwapCache(page) &&
> > page_count(page) != page_mapcount(page) + 2) {
> > -   ret = SWAP_FAIL;
> > -   goto out_unmap;
> > +   if (page_mapcount(page) > 1) {  /* happens when COW is in 
> > progress */
> > +   ret = SWAP_FAIL;
> > +   goto out_unmap;
> > +   }
> > +   printk("unmapping GUPed page\n");
> > }
> >  
> > /* Nuke the page table entry. */
> 
> I'

[RFC] Changing COW detection to be memory hotplug friendly

2005-02-02 Thread IWAMOTO Toshihiro
The current implementation of memory hotremoval relies on that pages
can be unmapped from process spaces.  After successful unmapping,
subsequent accesses to the pages are blocked and don't interfere
the hotremoval operation.

However, this code

if (PageSwapCache(page) &&
page_count(page) != page_mapcount(page) + 2) {
ret = SWAP_FAIL;
goto out_unmap;
}

in try_to_unmap_one() prevents unmapping pages that are referenced via
get_user_pages(), and such references can be held for a long time if
they are due to such as direct IO.
I've made a test program that issues multiple direct IO read requests
against a single read buffer, and pages that belong to the buffer
cannot be hotremoved because they aren't unmapped.

The following patch, which is against linux-2.6.11-rc1-mm1 and also
tested witch linux-2.6.11-rc2-mm2, fixes this issue.  The purpose of
this patch is to be able to unmap pages that have incremented
page_count.  To do that consistently, the COW detection logic needs to
be modified to not to rely on page_count.  I'm aware that such
extensive use of page_mapcount is discouraged and there is a plan to
kill page_mapcount (*), but I cannot think of a better alternative
solution.

(*) c.f. http://www.ussg.iu.edu/hypermail/linux/kernel/0406.0/0483.html

Some notes about my code:

  - I think it's safe to rely on page_mapcount in do_swap_page(),
because its use is protected by lock_page().
  - The can_share_swap_page() call in do_swap_page() always returns
false.  It is inefficient but should be harmless.  Incrementing
page_mapcount before calling that function should fix the problem,
but it may cause bad side effects.
  - Another obvious solution to this issue is to find the "offending"
process from a un-unmappable page and suspend it until the page is
unmapped.  I'm afraid the implementation would be much more complicated.
  - I could not test the following situation.  It should be possible
to write some kernel code to do that, but please let me know if
you know any such test cases.
- A page_count is incremented by get_user_pages().
- The page gets unmapped.
- The process causes a write fault for the page, before the
  incremented page_count is dropped.

Also, while I've tried carefully not to make mistakes and done some
testing, I'm not very sure this is bug free.  Please comment.

--- mm/memory.c.orig2005-01-17 14:47:11.0 +0900
+++ mm/memory.c 2005-01-17 14:55:51.0 +0900
@@ -1786,10 +1786,6 @@ static int do_swap_page(struct mm_struct
}
 
/* The page isn't present yet, go ahead with the fault. */
-   
-   swap_free(entry);
-   if (vm_swap_full())
-   remove_exclusive_swap_page(page);
 
mm->rss++;
acct_update_integrals();
@@ -1800,6 +1796,10 @@ static int do_swap_page(struct mm_struct
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
write_access = 0;
}
+   
+   swap_free(entry);
+   if (vm_swap_full())
+   remove_exclusive_swap_page(page);
unlock_page(page);
 
flush_icache_page(vma, page);
--- mm/rmap.c.orig  2005-01-17 14:40:08.0 +0900
+++ mm/rmap.c   2005-01-21 12:34:06.0 +0900
@@ -569,8 +569,11 @@ static int try_to_unmap_one(struct page 
 */
if (PageSwapCache(page) &&
page_count(page) != page_mapcount(page) + 2) {
-   ret = SWAP_FAIL;
-   goto out_unmap;
+   if (page_mapcount(page) > 1) {  /* happens when COW is in 
progress */
+   ret = SWAP_FAIL;
+   goto out_unmap;
+   }
+   printk("unmapping GUPed page\n");
}
 
/* Nuke the page table entry. */
--- mm/swapfile.c.orig  2005-01-17 14:47:11.0 +0900
+++ mm/swapfile.c   2005-01-31 16:59:11.0 +0900
@@ -293,7 +293,7 @@ static int exclusive_swap_page(struct pa
if (p->swap_map[swp_offset(entry)] == 1) {
/* Recheck the page count with the swapcache lock 
held.. */
write_lock_irq(&swapper_space.tree_lock);
-   if (page_count(page) == 2)
+   if (page_mapcount(page) == 1)
retval = 1;
write_unlock_irq(&swapper_space.tree_lock);
}
@@ -316,22 +316,17 @@ int can_share_swap_page(struct page *pag
 
if (!PageLocked(page))
BUG();
-   switch (page_count(page)) {
-   case 3:
-   if (!PagePrivate(page))
-   break;
-   /* Fallthrough */
-   case 2:
-   if (!PageSwapCache(page))
-   break;
-   retval = exclusive_swap_page(page);
-   break;
-   case 1:
-   if (PageReserved(page))
-   break;
-