Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-31 Thread Rik van Riel
On Thu, 31 Jan 2008 12:32:24 +0100
Andi Kleen <[EMAIL PROTECTED]> wrote:
> On Thu, Jan 31, 2008 at 05:52:09AM -0500, Rik van Riel wrote:

> > Don't malloc() and free() hopelessly fragment memory
> > over time, ensuring that little related data can be
> > found inside each 1MB chunk if the process is large
> > enough?  (say, firefox)
> 
> Even if they do (I don't know if it's true or not) it does not really 
> matter because on modern hard disks/systems it does not cost less to 
> transfer 1MB versus 4K. The actual threshold seems to be rising in
> fact.

That is definately true.

> The only drawback is that the swap might be full sooner, but 
> I would actually consider this a feature because it would likely
> end many prolonged oom death dances much sooner.

A second drawback would be that we evict more potentially
useful data every time we swap in a whole lot of extra
data around the little bit of data we need.

On the other hand, swapping should be the exception on
many of today's workloads.

Maybe we can measure how many of the swapped in pages end
up being used and how many are evicted again without being
used and automatically change our chunk size based on those
statistics?

I would expect most desktop systems to end up with large
chunks, because they rarely swap.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-31 Thread Andi Kleen
On Thu, Jan 31, 2008 at 05:52:09AM -0500, Rik van Riel wrote:
> On Thu, 31 Jan 2008 12:06:10 +0100
> Andi Kleen <[EMAIL PROTECTED]> wrote:
> 
> > > Yeah, the 2.5 switch to physical scanning killed us there.
> > > 
> > > I still don't know why my
> > > allocate-swapspace-according-to-virtual-address change didn't
> > > help.  Much.  Marcelo played with that a bit too.
> > 
> > I've been thinking about just always doing swap on > page clusters. 
> > Any reason swapping couldn't be done on e.g. 1MB chunks? 
> 
> Don't malloc() and free() hopelessly fragment memory
> over time, ensuring that little related data can be
> found inside each 1MB chunk if the process is large
> enough?  (say, firefox)

Even if they do (I don't know if it's true or not) it does not really 
matter because on modern hard disks/systems it does not cost less to 
transfer 1MB versus 4K. The actual threshold seems to be rising in fact.

The only drawback is that the swap might be full sooner, but 
I would actually consider this a feature because it would likely
end many prolonged oom death dances much sooner.

-Andi 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-31 Thread Rik van Riel
On Thu, 31 Jan 2008 12:06:10 +0100
Andi Kleen <[EMAIL PROTECTED]> wrote:

> > Yeah, the 2.5 switch to physical scanning killed us there.
> > 
> > I still don't know why my
> > allocate-swapspace-according-to-virtual-address change didn't
> > help.  Much.  Marcelo played with that a bit too.
> 
> I've been thinking about just always doing swap on > page clusters. 
> Any reason swapping couldn't be done on e.g. 1MB chunks? 

Don't malloc() and free() hopelessly fragment memory
over time, ensuring that little related data can be
found inside each 1MB chunk if the process is large
enough?  (say, firefox)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-31 Thread Andi Kleen
> Yeah, the 2.5 switch to physical scanning killed us there.
> 
> I still don't know why my allocate-swapspace-according-to-virtual-address
> change didn't help.  Much.  Marcelo played with that a bit too.

I've been thinking about just always doing swap on > page clusters. 
Any reason swapping couldn't be done on e.g. 1MB chunks? 

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-31 Thread Andrew Morton
On Thu, 31 Jan 2008 11:15:08 +0100 Andi Kleen <[EMAIL PROTECTED]> wrote:

> Peter Zijlstra <[EMAIL PROTECTED]> writes:
> >
> > Ah, that is Lennarts Pulse Audio thing, he has samples in memory which
> > might not have been used for a while, and he wants to be able to
> > pre-fetch those when he suspects they might need to be played. So that
> > once the audio thread comes along and stuffs them down /dev/dsp its all
> > nice in memory.
> 
> The real problem that seems to make swapping so slow is that the data
> tends to be badly fragmented on the swap partition. I suspect if that
> problem was attached the need for such prefetching would be far less
> because swap in would be much faster.
> 

Yeah, the 2.5 switch to physical scanning killed us there.

I still don't know why my allocate-swapspace-according-to-virtual-address
change didn't help.  Much.  Marcelo played with that a bit too.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-31 Thread Andrew Morton
On Thu, 31 Jan 2008 11:10:13 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> 
> On Thu, 2008-01-31 at 02:05 -0800, Andrew Morton wrote:
> > On Thu, 31 Jan 2008 10:53:26 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > 
> > > 
> > > On Thu, 2008-01-31 at 01:47 -0800, Andrew Morton wrote:
> > > > On Thu, 31 Jan 2008 10:35:18 +0100 Peter Zijlstra <[EMAIL PROTECTED]> 
> > > > wrote:
> > > > 
> > > > > 
> > > > > On Thu, 2008-01-31 at 01:12 -0800, Andrew Morton wrote:
> > > > > 
> > > > > > Implementation-wise: make_pages_present() _can_ be converted to do 
> > > > > > this. 
> > > > > > But it's a lot of patching, and the result will be a cleaner, 
> > > > > > faster and
> > > > > > smaller core MM.  Whereas your approach is easy, but adds more code 
> > > > > > and
> > > > > > leaves the old stuff slow-and-dirty.
> > > > > > 
> > > > > > Guess which approach is preferred? ;)
> > > > > 
> > > > > Ok, I'll look at using make_pages_present().
> > > > 
> > > > Am still curious to know what inspired this change.  What are the use
> > > > cases?  Performance testing results, etc?
> > > 
> > > Ah, that is Lennarts Pulse Audio thing, he has samples in memory which
> > > might not have been used for a while, and he wants to be able to
> > > pre-fetch those when he suspects they might need to be played. So that
> > > once the audio thread comes along and stuffs them down /dev/dsp its all
> > > nice in memory.
> > > 
> > > Since its all soft real-time at best he feels its better to do a best
> > > effort at not hitting swap than it is to strain the system with mlock
> > > usage.
> > 
> > hrm.  Does he know about pthread_create()?
> 
> I'm very sure he does. So you're suggesting to just create a thread and
> touch that memory and be done with it?
> 
> Lennart?

That would get him out of trouble.  But it certainly makes _sense_ for the
kernel to implement MADV_WILLNEED for anon memory.  From a consistency POV.

But I don't know that the usefulness of the feature is worth actually
expending code on.  Heck, after five-odd years I'm still asking every
second person I meet "why don't you use fadvise()?"  (Reponse: h!)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-31 Thread Andi Kleen
Peter Zijlstra <[EMAIL PROTECTED]> writes:
>
> Ah, that is Lennarts Pulse Audio thing, he has samples in memory which
> might not have been used for a while, and he wants to be able to
> pre-fetch those when he suspects they might need to be played. So that
> once the audio thread comes along and stuffs them down /dev/dsp its all
> nice in memory.

The real problem that seems to make swapping so slow is that the data
tends to be badly fragmented on the swap partition. I suspect if that
problem was attached the need for such prefetching would be far less
because swap in would be much faster.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-31 Thread Peter Zijlstra

On Thu, 2008-01-31 at 02:05 -0800, Andrew Morton wrote:
> On Thu, 31 Jan 2008 10:53:26 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> 
> > 
> > On Thu, 2008-01-31 at 01:47 -0800, Andrew Morton wrote:
> > > On Thu, 31 Jan 2008 10:35:18 +0100 Peter Zijlstra <[EMAIL PROTECTED]> 
> > > wrote:
> > > 
> > > > 
> > > > On Thu, 2008-01-31 at 01:12 -0800, Andrew Morton wrote:
> > > > 
> > > > > Implementation-wise: make_pages_present() _can_ be converted to do 
> > > > > this. 
> > > > > But it's a lot of patching, and the result will be a cleaner, faster 
> > > > > and
> > > > > smaller core MM.  Whereas your approach is easy, but adds more code 
> > > > > and
> > > > > leaves the old stuff slow-and-dirty.
> > > > > 
> > > > > Guess which approach is preferred? ;)
> > > > 
> > > > Ok, I'll look at using make_pages_present().
> > > 
> > > Am still curious to know what inspired this change.  What are the use
> > > cases?  Performance testing results, etc?
> > 
> > Ah, that is Lennarts Pulse Audio thing, he has samples in memory which
> > might not have been used for a while, and he wants to be able to
> > pre-fetch those when he suspects they might need to be played. So that
> > once the audio thread comes along and stuffs them down /dev/dsp its all
> > nice in memory.
> > 
> > Since its all soft real-time at best he feels its better to do a best
> > effort at not hitting swap than it is to strain the system with mlock
> > usage.
> 
> hrm.  Does he know about pthread_create()?

I'm very sure he does. So you're suggesting to just create a thread and
touch that memory and be done with it?

Lennart?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-31 Thread Andrew Morton
On Thu, 31 Jan 2008 10:53:26 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> 
> On Thu, 2008-01-31 at 01:47 -0800, Andrew Morton wrote:
> > On Thu, 31 Jan 2008 10:35:18 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > 
> > > 
> > > On Thu, 2008-01-31 at 01:12 -0800, Andrew Morton wrote:
> > > 
> > > > Implementation-wise: make_pages_present() _can_ be converted to do 
> > > > this. 
> > > > But it's a lot of patching, and the result will be a cleaner, faster and
> > > > smaller core MM.  Whereas your approach is easy, but adds more code and
> > > > leaves the old stuff slow-and-dirty.
> > > > 
> > > > Guess which approach is preferred? ;)
> > > 
> > > Ok, I'll look at using make_pages_present().
> > 
> > Am still curious to know what inspired this change.  What are the use
> > cases?  Performance testing results, etc?
> 
> Ah, that is Lennarts Pulse Audio thing, he has samples in memory which
> might not have been used for a while, and he wants to be able to
> pre-fetch those when he suspects they might need to be played. So that
> once the audio thread comes along and stuffs them down /dev/dsp its all
> nice in memory.
> 
> Since its all soft real-time at best he feels its better to do a best
> effort at not hitting swap than it is to strain the system with mlock
> usage.

hrm.  Does he know about pthread_create()?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-31 Thread Peter Zijlstra

On Thu, 2008-01-31 at 01:47 -0800, Andrew Morton wrote:
> On Thu, 31 Jan 2008 10:35:18 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> 
> > 
> > On Thu, 2008-01-31 at 01:12 -0800, Andrew Morton wrote:
> > 
> > > Implementation-wise: make_pages_present() _can_ be converted to do this. 
> > > But it's a lot of patching, and the result will be a cleaner, faster and
> > > smaller core MM.  Whereas your approach is easy, but adds more code and
> > > leaves the old stuff slow-and-dirty.
> > > 
> > > Guess which approach is preferred? ;)
> > 
> > Ok, I'll look at using make_pages_present().
> 
> Am still curious to know what inspired this change.  What are the use
> cases?  Performance testing results, etc?

Ah, that is Lennarts Pulse Audio thing, he has samples in memory which
might not have been used for a while, and he wants to be able to
pre-fetch those when he suspects they might need to be played. So that
once the audio thread comes along and stuffs them down /dev/dsp its all
nice in memory.

Since its all soft real-time at best he feels its better to do a best
effort at not hitting swap than it is to strain the system with mlock
usage.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-31 Thread Andrew Morton
On Thu, 31 Jan 2008 10:35:18 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> 
> On Thu, 2008-01-31 at 01:12 -0800, Andrew Morton wrote:
> 
> > Implementation-wise: make_pages_present() _can_ be converted to do this. 
> > But it's a lot of patching, and the result will be a cleaner, faster and
> > smaller core MM.  Whereas your approach is easy, but adds more code and
> > leaves the old stuff slow-and-dirty.
> > 
> > Guess which approach is preferred? ;)
> 
> Ok, I'll look at using make_pages_present().

Am still curious to know what inspired this change.  What are the use
cases?  Performance testing results, etc?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-31 Thread Peter Zijlstra

On Thu, 2008-01-31 at 01:12 -0800, Andrew Morton wrote:

> Implementation-wise: make_pages_present() _can_ be converted to do this. 
> But it's a lot of patching, and the result will be a cleaner, faster and
> smaller core MM.  Whereas your approach is easy, but adds more code and
> leaves the old stuff slow-and-dirty.
> 
> Guess which approach is preferred? ;)

Ok, I'll look at using make_pages_present().

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-31 Thread Andrew Morton
On Thu, 31 Jan 2008 09:44:00 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> 
> On Wed, 2008-01-30 at 14:40 -0800, Andrew Morton wrote:
> > On Wed, 30 Jan 2008 18:28:59 +0100
> > Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > 
> > > Implement MADV_WILLNEED for anonymous pages by walking the page tables and
> > > starting asynchonous swap cache reads for all encountered swap pages.
> > 
> > Why cannot this use (a perhaps suitably-modified) make_pages_present()?
> 
> Because make_pages_present() relies on page faults to bring data in and
> will thus wait for all data to be present before returning.
> 
> This solution is async; it will just issue a read for the requested
> pages and moves on.
> 

I of course realise that.  I also realise that swapin_readahead() is
_supposed_ to make the difference moot.

There's something you guys aren't telling us.  Several things, actually. 
Please don't do that.



Implementation-wise: make_pages_present() _can_ be converted to do this. 
But it's a lot of patching, and the result will be a cleaner, faster and
smaller core MM.  Whereas your approach is easy, but adds more code and
leaves the old stuff slow-and-dirty.

Guess which approach is preferred? ;)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-31 Thread Peter Zijlstra

On Wed, 2008-01-30 at 14:40 -0800, Andrew Morton wrote:
> On Wed, 30 Jan 2008 18:28:59 +0100
> Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> 
> > Implement MADV_WILLNEED for anonymous pages by walking the page tables and
> > starting asynchonous swap cache reads for all encountered swap pages.
> 
> Why cannot this use (a perhaps suitably-modified) make_pages_present()?

Because make_pages_present() relies on page faults to bring data in and
will thus wait for all data to be present before returning.

This solution is async; it will just issue a read for the requested
pages and moves on.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-30 Thread Andrew Morton
On Wed, 30 Jan 2008 18:28:59 +0100
Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> Implement MADV_WILLNEED for anonymous pages by walking the page tables and
> starting asynchonous swap cache reads for all encountered swap pages.

Why cannot this use (a perhaps suitably-modified) make_pages_present()?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-30 Thread Matt Mackall

On Wed, 2008-01-30 at 18:28 +0100, Peter Zijlstra wrote:
> Subject: mm: MADV_WILLNEED implementation for anonymous memory
> 
> Implement MADV_WILLNEED for anonymous pages by walking the page tables and
> starting asynchonous swap cache reads for all encountered swap pages.
> 
> Doing so required a modification to the page table walking library functions.
> Previously ->pte_entry() could be called while holding a kmap_atomic, to
> overcome this problem the pte walker is changed to copy batches of the pmd
> and iterate them.

That's a pretty reasonable approach. My original approach was to buffer
a page worth of PTEs with all the attendant malloc annoyances. Then
Andrew and I came up with another fix a bit ago by effectively doing a
batch of size 1: mapping and immediately unmapping per PTE. That's
basically a no-op on !HIGHPTE but could potentially be expensive in the
HIGHPTE case. Your approach might be a good complexity/performance
middle ground.

Unfortunately, I think we only implemented our fix in one of the
relevant places: the /proc/pid/pagemap code hooks a callback at the pte
table level and then does its own walk across the table. Perhaps I
should refactor this so that it hooks in at the pte entry level of the
walker instead.

> +/*
> + * Much of the complication here is to work around CONFIG_HIGHPTE which needs
> + * to kmap the pmd. So copy batches of ptes from the pmd and iterate over
> + * those.
> + */
> +#define WALK_BATCH_SIZE  32
> +
>  static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> const struct mm_walk *walk, void *private)
>  {
>   pte_t *pte;
> + pte_t ptes[WALK_BATCH_SIZE];
> + unsigned long start;
> + unsigned int i;
>   int err = 0;
>  
> - pte = pte_offset_map(pmd, addr);
>   do {
> - err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, private);
> - if (err)
> -break;
> - } while (pte++, addr += PAGE_SIZE, addr != end);
> + start = addr;
>  
> - pte_unmap(pte);
> + pte = pte_offset_map(pmd, addr);
> + for (i = 0; i < WALK_BATCH_SIZE && addr != end;
> + i++, pte++, addr += PAGE_SIZE)
> + ptes[i] = *pte;

Looks like this could be:

for (i = 0; i < WALK_BATCH_SIZE && addr + i * PAGE_SIZE != end; 
i++)
ptes[i] = pte[i];

> + pte_unmap(pte);
> +
> + for (i = 0, pte = ptes, addr = start;
> + i < WALK_BATCH_SIZE && addr != end;
> + i++, pte++, addr += PAGE_SIZE) {
> + err = walk->pte_entry(pte, addr, addr + PAGE_SIZE,
> + private);
for (i = 0; i < WALK_BATCH_SIZE && addr != end;
i++, addr+= PAGE_SIZE) {
err = walk->pte_entry(ptes[i], addr, addr + PAGE_SIZE,
private);

And we can ditch start.

Also, one wonders if setting batch size to 1 will then convince the
compiler to collapse this into a more trivial loop in the !HIGHPTE case.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: MADV_WILLNEED implementation for anonymous memory

2008-01-30 Thread Peter Zijlstra
Subject: mm: MADV_WILLNEED implementation for anonymous memory

Implement MADV_WILLNEED for anonymous pages by walking the page tables and
starting asynchonous swap cache reads for all encountered swap pages.

Doing so required a modification to the page table walking library functions.
Previously ->pte_entry() could be called while holding a kmap_atomic, to
overcome this problem the pte walker is changed to copy batches of the pmd
and iterate them.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 mm/madvise.c  |   48 +---
 mm/pagewalk.c |   33 +++--
 2 files changed, 68 insertions(+), 13 deletions(-)

Index: linux-2.6/mm/madvise.c
===
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -11,6 +11,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
@@ -100,17 +102,51 @@ out:
return error;
 }
 
+static int madvise_willneed_anon_pte(pte_t *ptep, unsigned long addr,
+   unsigned long end, void *arg)
+{
+   struct vm_area_struct *vma = arg;
+   struct page *page;
+
+   if (is_swap_pte(*ptep)) {
+   page = read_swap_cache_async(pte_to_swp_entry(*ptep),
+   GFP_KERNEL, vma, addr);
+   if (page)
+   page_cache_release(page);
+   }
+
+   return 0;
+}
+
+static long madvise_willneed_anon(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
+{
+   unsigned long len;
+   struct mm_walk walk = {
+   .pte_entry = madvise_willneed_anon_pte,
+   };
+
+   len = max_sane_readahead((end - start) >> PAGE_SHIFT);
+   end = start + (len << PAGE_SHIFT);
+
+   walk_page_range(vma->vm_mm, start, end, &walk, vma);
+   *prev = vma;
+
+   return 0;
+}
+
 /*
  * Schedule all required I/O operations.  Do not wait for completion.
  */
-static long madvise_willneed(struct vm_area_struct * vma,
-struct vm_area_struct ** prev,
+static long madvise_willneed(struct vm_area_struct *vma,
+struct vm_area_struct **prev,
 unsigned long start, unsigned long end)
 {
struct file *file = vma->vm_file;
 
if (!file)
-   return -EBADF;
+   return madvise_willneed_anon(vma, prev, start, end);
 
if (file->f_mapping->a_ops->get_xip_page) {
/* no bad return value, but ignore advice */
@@ -119,8 +155,6 @@ static long madvise_willneed(struct vm_a
 
*prev = vma;
start = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-   if (end > vma->vm_end)
-   end = vma->vm_end;
end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
force_page_cache_readahead(file->f_mapping,
@@ -147,8 +181,8 @@ static long madvise_willneed(struct vm_a
  * An interface that causes the system to free clean pages and flush
  * dirty pages is already available as msync(MS_INVALIDATE).
  */
-static long madvise_dontneed(struct vm_area_struct * vma,
-struct vm_area_struct ** prev,
+static long madvise_dontneed(struct vm_area_struct *vma,
+struct vm_area_struct **prev,
 unsigned long start, unsigned long end)
 {
*prev = vma;
Index: linux-2.6/mm/pagewalk.c
===
--- linux-2.6.orig/mm/pagewalk.c
+++ linux-2.6/mm/pagewalk.c
@@ -2,20 +2,41 @@
 #include 
 #include 
 
+/*
+ * Much of the complication here is to work around CONFIG_HIGHPTE which needs
+ * to kmap the pmd. So copy batches of ptes from the pmd and iterate over
+ * those.
+ */
+#define WALK_BATCH_SIZE32
+
 static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
  const struct mm_walk *walk, void *private)
 {
pte_t *pte;
+   pte_t ptes[WALK_BATCH_SIZE];
+   unsigned long start;
+   unsigned int i;
int err = 0;
 
-   pte = pte_offset_map(pmd, addr);
do {
-   err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, private);
-   if (err)
-  break;
-   } while (pte++, addr += PAGE_SIZE, addr != end);
+   start = addr;
 
-   pte_unmap(pte);
+   pte = pte_offset_map(pmd, addr);
+   for (i = 0; i < WALK_BATCH_SIZE && addr != end;
+   i++, pte++, addr += PAGE_SIZE)
+   ptes[i] = *pte;
+   pte_unmap(pte);
+
+   for (i = 0, pte = ptes, addr = start;
+   i < WALK_BATCH_SIZE && addr != end;
+