Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-09 Thread Matt Mackall
On Wed, Feb 07, 2007 at 12:30:38AM -0800, Christoph Lameter wrote:
> On Tue, 6 Feb 2007, David Rientjes wrote:
> 
> > Extracts the page table entry walker from the smaps-specific code in
> > fs/proc/task_mmu.c.  This will be used later for clearing the reference
> > bits on pages to measure the number of pages accessed over a time period
> > through /proc/pid/smaps.
> 
> Clearing reference bits? Ummm... That is a pretty inaccurate measure since 
> reclaim can remove pages and revert the reference bits. It can never work 
> reliably.
> 
> > The new struct pte_walker includes the struct vm_area_struct of the memory
> > to walk over.  Iteration begins at the start address and completes at the
> > the end address.  A pointer to another data structure may be stored in the
> > private field such as the struct mem_size_stats, which acts as the smaps
> > accumulator.  For each page table entry in the VMA, the func function is
> > called with the corresponding struct pte_walker, the pte_t, and its
> > address.
> 
> Would it be possible to sync up with the people doing the page table 
> interface?

Have a pointer to this?

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-09 Thread Matt Mackall
On Wed, Feb 07, 2007 at 12:30:38AM -0800, Christoph Lameter wrote:
 On Tue, 6 Feb 2007, David Rientjes wrote:
 
  Extracts the page table entry walker from the smaps-specific code in
  fs/proc/task_mmu.c.  This will be used later for clearing the reference
  bits on pages to measure the number of pages accessed over a time period
  through /proc/pid/smaps.
 
 Clearing reference bits? Ummm... That is a pretty inaccurate measure since 
 reclaim can remove pages and revert the reference bits. It can never work 
 reliably.
 
  The new struct pte_walker includes the struct vm_area_struct of the memory
  to walk over.  Iteration begins at the start address and completes at the
  the end address.  A pointer to another data structure may be stored in the
  private field such as the struct mem_size_stats, which acts as the smaps
  accumulator.  For each page table entry in the VMA, the func function is
  called with the corresponding struct pte_walker, the pte_t, and its
  address.
 
 Would it be possible to sync up with the people doing the page table 
 interface?

Have a pointer to this?

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-08 Thread David Rientjes
On Thu, 8 Feb 2007, Matt Mackall wrote:

> I've been looking at a similar refactoring of other code and I think
> the way to go is a callback per block-of-PTEs with start and end
> pointers. That gets rid of most of the call indirection overhead.
> 

Yes, but only in a limited number of cases (including smaps).  It is not 
always sufficient to do it on the block-of-ptes level.  Many of the the 
pte iterators in the code right now are slightly different on each level: 
the ioremap code, for example, allocates the various directories as it 
progresses in depth.

To have one library pte iterator would only be possible with callback 
functions on each pgd, pud, and pmd level that (at least) ioremap would 
require to be implemented on top of.  That isn't very appealing.

The PTI patchset from July moved these iterators into .h files but did not 
implement a single pte iterator to build on.  It just turned out nicely in 
the smaps case that there was no difference on the pgd, pud, or pmd level 
between the two required iterators.  I added a void *private to carry the 
struct for statistics accumulating in the smaps case but this is worthless 
in the clear_refs case.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-08 Thread Matt Mackall
On Wed, Feb 07, 2007 at 03:29:34PM +0900, Paul Mundt wrote:
> On Tue, Feb 06, 2007 at 10:15:47PM -0800, David Rientjes wrote:
> > Extracts the page table entry walker from the smaps-specific code in
> > fs/proc/task_mmu.c.  This will be used later for clearing the reference
> > bits on pages to measure the number of pages accessed over a time period
> > through /proc/pid/smaps.
> > 
> I like the general idea of this patch set, however..
> 
> > Since the PTE walker is now extracted from the smaps code,
> > smaps_pte_func() is invoked for each PTE in the VMA.  Its behavior is
> > identical to the existing implementation, except it is slightly slower
> > because each PTE now invokes a function call.
> > 
> Perhaps this is something that needs to be looked at more closely and
> made more generic? There are many ranged page table walkers that aren't
> so performance critical that the function call cost would cause too much
> pain. ioremap_page_range() comes to mind, and there's bound to be others.
> This would also help people to get the pte map/unmap right, which seems
> to pop up from time to time as well..

I've been looking at a similar refactoring of other code and I think
the way to go is a callback per block-of-PTEs with start and end
pointers. That gets rid of most of the call indirection overhead.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-08 Thread Matt Mackall
On Wed, Feb 07, 2007 at 03:29:34PM +0900, Paul Mundt wrote:
 On Tue, Feb 06, 2007 at 10:15:47PM -0800, David Rientjes wrote:
  Extracts the page table entry walker from the smaps-specific code in
  fs/proc/task_mmu.c.  This will be used later for clearing the reference
  bits on pages to measure the number of pages accessed over a time period
  through /proc/pid/smaps.
  
 I like the general idea of this patch set, however..
 
  Since the PTE walker is now extracted from the smaps code,
  smaps_pte_func() is invoked for each PTE in the VMA.  Its behavior is
  identical to the existing implementation, except it is slightly slower
  because each PTE now invokes a function call.
  
 Perhaps this is something that needs to be looked at more closely and
 made more generic? There are many ranged page table walkers that aren't
 so performance critical that the function call cost would cause too much
 pain. ioremap_page_range() comes to mind, and there's bound to be others.
 This would also help people to get the pte map/unmap right, which seems
 to pop up from time to time as well..

I've been looking at a similar refactoring of other code and I think
the way to go is a callback per block-of-PTEs with start and end
pointers. That gets rid of most of the call indirection overhead.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-08 Thread David Rientjes
On Thu, 8 Feb 2007, Matt Mackall wrote:

 I've been looking at a similar refactoring of other code and I think
 the way to go is a callback per block-of-PTEs with start and end
 pointers. That gets rid of most of the call indirection overhead.
 

Yes, but only in a limited number of cases (including smaps).  It is not 
always sufficient to do it on the block-of-ptes level.  Many of the the 
pte iterators in the code right now are slightly different on each level: 
the ioremap code, for example, allocates the various directories as it 
progresses in depth.

To have one library pte iterator would only be possible with callback 
functions on each pgd, pud, and pmd level that (at least) ioremap would 
require to be implemented on top of.  That isn't very appealing.

The PTI patchset from July moved these iterators into .h files but did not 
implement a single pte iterator to build on.  It just turned out nicely in 
the smaps case that there was no difference on the pgd, pud, or pmd level 
between the two required iterators.  I added a void *private to carry the 
struct for statistics accumulating in the smaps case but this is worthless 
in the clear_refs case.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-07 Thread David Rientjes
On Wed, 7 Feb 2007, Christoph Lameter wrote:

> Clearing reference bits? Ummm... That is a pretty inaccurate measure since 
> reclaim can remove pages and revert the reference bits. It can never work 
> reliably.
> 

It's not intended to work precisely, it's intended to give a good estimate 
of task memory footprint.  There are several different scenarios that can 
interfere with getting a precise measurement via this method; then again 
/proc/pid/smaps is so expensive already because of the iteration through 
VMA's that it's not something you'd do regularly.

Any other suggestions on how to obtain this data is certainly welcome.

> Would it be possible to sync up with the people doing the page table 
> interface?
> 

(+pauld)

Is there an update on Paul Davies' implementation that would allow us to 
iterate through a set of pte's in a vm_area_struct range?

> Could we somehow consolidate smaps and numa_maps?
> 

Sure, but that's beyond the scope of this patchset.  My intention in 
extracting away a new pte_walker was to prevent having two identical 
implementations and it could easily be extracted even further to lib for 
the ioremap case that Paul mentioned.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-07 Thread Christoph Lameter
On Tue, 6 Feb 2007, David Rientjes wrote:

> Extracts the page table entry walker from the smaps-specific code in
> fs/proc/task_mmu.c.  This will be used later for clearing the reference
> bits on pages to measure the number of pages accessed over a time period
> through /proc/pid/smaps.

Clearing reference bits? Ummm... That is a pretty inaccurate measure since 
reclaim can remove pages and revert the reference bits. It can never work 
reliably.

> The new struct pte_walker includes the struct vm_area_struct of the memory
> to walk over.  Iteration begins at the start address and completes at the
> the end address.  A pointer to another data structure may be stored in the
> private field such as the struct mem_size_stats, which acts as the smaps
> accumulator.  For each page table entry in the VMA, the func function is
> called with the corresponding struct pte_walker, the pte_t, and its
> address.

Would it be possible to sync up with the people doing the page table 
interface?

Could we somehow consolidate smaps and numa_maps?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-07 Thread Christoph Lameter
On Tue, 6 Feb 2007, David Rientjes wrote:

 Extracts the page table entry walker from the smaps-specific code in
 fs/proc/task_mmu.c.  This will be used later for clearing the reference
 bits on pages to measure the number of pages accessed over a time period
 through /proc/pid/smaps.

Clearing reference bits? Ummm... That is a pretty inaccurate measure since 
reclaim can remove pages and revert the reference bits. It can never work 
reliably.

 The new struct pte_walker includes the struct vm_area_struct of the memory
 to walk over.  Iteration begins at the start address and completes at the
 the end address.  A pointer to another data structure may be stored in the
 private field such as the struct mem_size_stats, which acts as the smaps
 accumulator.  For each page table entry in the VMA, the func function is
 called with the corresponding struct pte_walker, the pte_t, and its
 address.

Would it be possible to sync up with the people doing the page table 
interface?

Could we somehow consolidate smaps and numa_maps?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-07 Thread David Rientjes
On Wed, 7 Feb 2007, Christoph Lameter wrote:

 Clearing reference bits? Ummm... That is a pretty inaccurate measure since 
 reclaim can remove pages and revert the reference bits. It can never work 
 reliably.
 

It's not intended to work precisely, it's intended to give a good estimate 
of task memory footprint.  There are several different scenarios that can 
interfere with getting a precise measurement via this method; then again 
/proc/pid/smaps is so expensive already because of the iteration through 
VMA's that it's not something you'd do regularly.

Any other suggestions on how to obtain this data is certainly welcome.

 Would it be possible to sync up with the people doing the page table 
 interface?
 

(+pauld)

Is there an update on Paul Davies' implementation that would allow us to 
iterate through a set of pte's in a vm_area_struct range?

 Could we somehow consolidate smaps and numa_maps?
 

Sure, but that's beyond the scope of this patchset.  My intention in 
extracting away a new pte_walker was to prevent having two identical 
implementations and it could easily be extracted even further to lib for 
the ioremap case that Paul mentioned.

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-06 Thread Paul Mundt
On Tue, Feb 06, 2007 at 10:49:14PM -0800, Andrew Morton wrote:
> On Wed, 7 Feb 2007 15:29:34 +0900 Paul Mundt <[EMAIL PROTECTED]> wrote:
> > I like the general idea of this patch set, however..
> 
> David didn't really spell out the rationale.  Userspace people ask "how
> much memory is my application using".  For system planning and such.  We
> don't know, really.  So the approach taken here is to nuke all the
> referenced bits and to then monitor them coming back over time.
> 
> Obviously it doesn't work very well if the page scanner is doing things,
> but one hopes that when someone is instrumenting their application they'll
> ensure that sufficient memory is available to prevent this inaccuracy.
> 
This was the problem we kept running in to as well, people that wanted to
instrument their applications on the desktop where all of the required
infrastructure already brought the system to its knees long before
instrumentation could begin didn't really provide the most consistent
metrics, especially when the application depended heavily on overcommit.

This resulted in a lot of extra hacks on top of the smaps code trying to
balance out the numbers, this just ended up being a massive headache.
smaps seems to be an appealing target for piling additional semantics on,
unfortunately.

> I don't really have a sense for how much stuff we want to put into the kernel
> to support this sort of thing.  The proposed patches are, I think, minimal.
> Perhaps it needs more.  If so, opinions are solicited before we go and add
> (and hence be forced to maintain) this new interface.
> 
This sort of minimalist interface is not necessarily a bad approach if it
helps in a way that works for the people that really want it. The issue
with any of the smaps extensions is being able to provide people with a
"good enough" metric without falling in to interface hell. This really
depends on how one defines "good enough", though. It would be nice to
hear from application folks if this ends up being useful for them or not.

> > Perhaps this is something that needs to be looked at more closely and
> > made more generic? There are many ranged page table walkers that aren't
> > so performance critical that the function call cost would cause too much
> > pain. ioremap_page_range() comes to mind, and there's bound to be others.
> > This would also help people to get the pte map/unmap right, which seems
> > to pop up from time to time as well..
> 
> That's what Paul Davies' patches are aimed at:
> 
> http://marc.theaimsgroup.com/?l=linux-mm=115276500100695=2
> 
> I promised Paul that I'd take a serious look at those patches next time
> they pop up.   It would be good if others could help out in that.

It would be nice to see this smaps patchset rebased on something like
that rather than introducing another walker abstraction simply to have it
overhauled later. I realize this isn't really David's problem, but we
shouldn't be trying to solve the same problem multiple times.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-06 Thread Andrew Morton
On Wed, 7 Feb 2007 15:29:34 +0900 Paul Mundt <[EMAIL PROTECTED]> wrote:

> On Tue, Feb 06, 2007 at 10:15:47PM -0800, David Rientjes wrote:
> > Extracts the page table entry walker from the smaps-specific code in
> > fs/proc/task_mmu.c.  This will be used later for clearing the reference
> > bits on pages to measure the number of pages accessed over a time period
> > through /proc/pid/smaps.
> > 
> I like the general idea of this patch set, however..

David didn't really spell out the rationale.  Userspace people ask "how
much memory is my application using".  For system planning and such.  We
don't know, really.  So the approach taken here is to nuke all the
referenced bits and to then monitor them coming back over time.

Obviously it doesn't work very well if the page scanner is doing things,
but one hopes that when someone is instrumenting their application they'll
ensure that sufficient memory is available to prevent this inaccuracy.

The other part of the question is of course pagecache.  Hopefully /proc/pid/io
will suffice for that, probably in combination with /proc/sys/vm/drop_caches.

I don't really have a sense for how much stuff we want to put into the kernel
to support this sort of thing.  The proposed patches are, I think, minimal.
Perhaps it needs more.  If so, opinions are solicited before we go and add
(and hence be forced to maintain) this new interface.

> > Since the PTE walker is now extracted from the smaps code,
> > smaps_pte_func() is invoked for each PTE in the VMA.  Its behavior is
> > identical to the existing implementation, except it is slightly slower
> > because each PTE now invokes a function call.
> > 
> Perhaps this is something that needs to be looked at more closely and
> made more generic? There are many ranged page table walkers that aren't
> so performance critical that the function call cost would cause too much
> pain. ioremap_page_range() comes to mind, and there's bound to be others.
> This would also help people to get the pte map/unmap right, which seems
> to pop up from time to time as well..

That's what Paul Davies' patches are aimed at:

http://marc.theaimsgroup.com/?l=linux-mm=115276500100695=2

I promised Paul that I'd take a serious look at those patches next time
they pop up.   It would be good if others could help out in that.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-06 Thread Paul Mundt
On Tue, Feb 06, 2007 at 10:15:47PM -0800, David Rientjes wrote:
> Extracts the page table entry walker from the smaps-specific code in
> fs/proc/task_mmu.c.  This will be used later for clearing the reference
> bits on pages to measure the number of pages accessed over a time period
> through /proc/pid/smaps.
> 
I like the general idea of this patch set, however..

> Since the PTE walker is now extracted from the smaps code,
> smaps_pte_func() is invoked for each PTE in the VMA.  Its behavior is
> identical to the existing implementation, except it is slightly slower
> because each PTE now invokes a function call.
> 
Perhaps this is something that needs to be looked at more closely and
made more generic? There are many ranged page table walkers that aren't
so performance critical that the function call cost would cause too much
pain. ioremap_page_range() comes to mind, and there's bound to be others.
This would also help people to get the pte map/unmap right, which seems
to pop up from time to time as well..
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-06 Thread David Rientjes
Extracts the page table entry walker from the smaps-specific code in
fs/proc/task_mmu.c.  This will be used later for clearing the reference
bits on pages to measure the number of pages accessed over a time period
through /proc/pid/smaps.

The new struct pte_walker includes the struct vm_area_struct of the memory
to walk over.  Iteration begins at the start address and completes at the
the end address.  A pointer to another data structure may be stored in the
private field such as the struct mem_size_stats, which acts as the smaps
accumulator.  For each page table entry in the VMA, the func function is
called with the corresponding struct pte_walker, the pte_t, and its
address.

Since the PTE walker is now extracted from the smaps code,
smaps_pte_func() is invoked for each PTE in the VMA.  Its behavior is
identical to the existing implementation, except it is slightly slower
because each PTE now invokes a function call.

Cc: Hugh Dickins <[EMAIL PROTECTED]>
Cc: Paul Mundt <[EMAIL PROTECTED]>
Cc: Christoph Lameter <[EMAIL PROTECTED]>
Signed-off-by: David Rientjes <[EMAIL PROTECTED]>
---
 fs/proc/task_mmu.c |  126 ++--
 1 files changed, 82 insertions(+), 44 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 55ade0d..e87824b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -122,6 +122,15 @@ struct mem_size_stats
unsigned long private_dirty;
 };
 
+struct pte_walker {
+   struct vm_area_struct *vma; /* VMA */
+   unsigned long start;/* start address */
+   unsigned long end;  /* end address */
+   void *private;  /* private data */
+   /* function to invoke for each pte in the above range */
+   void (*func)(struct pte_walker*, pte_t*, unsigned long);
+};
+
 static int show_map_internal(struct seq_file *m, void *v, struct 
mem_size_stats *mss)
 {
struct proc_maps_private *priv = m->private;
@@ -204,98 +213,127 @@ static int show_map(struct seq_file *m, void *v)
return show_map_internal(m, v, NULL);
 }
 
-static void smaps_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, unsigned long end,
-   struct mem_size_stats *mss)
+/*
+ * Walks each PTE in the struct pte_walker address range and calls
+ * walker->func() for each entry.
+ */
+static void walk_ptes(struct pte_walker *walker, pmd_t *pmd)
 {
-   pte_t *pte, ptent;
+   struct vm_area_struct *vma = walker->vma;
+   unsigned long addr = walker->start;
spinlock_t *ptl;
-   struct page *page;
+   pte_t *pte;
 
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
do {
-   ptent = *pte;
-   if (!pte_present(ptent))
-   continue;
-
-   mss->resident += PAGE_SIZE;
-
-   page = vm_normal_page(vma, addr, ptent);
-   if (!page)
-   continue;
-
-   if (page_mapcount(page) >= 2) {
-   if (pte_dirty(ptent))
-   mss->shared_dirty += PAGE_SIZE;
-   else
-   mss->shared_clean += PAGE_SIZE;
-   } else {
-   if (pte_dirty(ptent))
-   mss->private_dirty += PAGE_SIZE;
-   else
-   mss->private_clean += PAGE_SIZE;
-   }
-   } while (pte++, addr += PAGE_SIZE, addr != end);
+   walker->func(walker, pte, addr);
+   } while (pte++, addr += PAGE_SIZE, addr != walker->end);
pte_unmap_unlock(pte - 1, ptl);
cond_resched();
 }
 
-static inline void smaps_pmd_range(struct vm_area_struct *vma, pud_t *pud,
-   unsigned long addr, unsigned long end,
-   struct mem_size_stats *mss)
+static inline void walk_pmds(struct pte_walker *walker, pud_t *pud)
 {
-   pmd_t *pmd;
+   unsigned long addr = walker->start;
+   unsigned long end = walker->end;
unsigned long next;
+   pmd_t *pmd;
 
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
if (pmd_none_or_clear_bad(pmd))
continue;
-   smaps_pte_range(vma, pmd, addr, next, mss);
+   walk_ptes(walker, pmd);
} while (pmd++, addr = next, addr != end);
 }
 
-static inline void smaps_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
-   unsigned long addr, unsigned long end,
-   struct mem_size_stats *mss)
+static inline void walk_puds(struct pte_walker *walker, pgd_t *pgd)
 {
-   pud_t *pud;
+   unsigned long addr = walker->start;
+   unsigned long end = walker->end;
unsigned long next;
+   pud_t *pud;
 
pud = pud_offset(pgd, 

[patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-06 Thread David Rientjes
Extracts the page table entry walker from the smaps-specific code in
fs/proc/task_mmu.c.  This will be used later for clearing the reference
bits on pages to measure the number of pages accessed over a time period
through /proc/pid/smaps.

The new struct pte_walker includes the struct vm_area_struct of the memory
to walk over.  Iteration begins at the start address and completes at the
the end address.  A pointer to another data structure may be stored in the
private field such as the struct mem_size_stats, which acts as the smaps
accumulator.  For each page table entry in the VMA, the func function is
called with the corresponding struct pte_walker, the pte_t, and its
address.

Since the PTE walker is now extracted from the smaps code,
smaps_pte_func() is invoked for each PTE in the VMA.  Its behavior is
identical to the existing implementation, except it is slightly slower
because each PTE now invokes a function call.

Cc: Hugh Dickins [EMAIL PROTECTED]
Cc: Paul Mundt [EMAIL PROTECTED]
Cc: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: David Rientjes [EMAIL PROTECTED]
---
 fs/proc/task_mmu.c |  126 ++--
 1 files changed, 82 insertions(+), 44 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 55ade0d..e87824b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -122,6 +122,15 @@ struct mem_size_stats
unsigned long private_dirty;
 };
 
+struct pte_walker {
+   struct vm_area_struct *vma; /* VMA */
+   unsigned long start;/* start address */
+   unsigned long end;  /* end address */
+   void *private;  /* private data */
+   /* function to invoke for each pte in the above range */
+   void (*func)(struct pte_walker*, pte_t*, unsigned long);
+};
+
 static int show_map_internal(struct seq_file *m, void *v, struct 
mem_size_stats *mss)
 {
struct proc_maps_private *priv = m-private;
@@ -204,98 +213,127 @@ static int show_map(struct seq_file *m, void *v)
return show_map_internal(m, v, NULL);
 }
 
-static void smaps_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long addr, unsigned long end,
-   struct mem_size_stats *mss)
+/*
+ * Walks each PTE in the struct pte_walker address range and calls
+ * walker-func() for each entry.
+ */
+static void walk_ptes(struct pte_walker *walker, pmd_t *pmd)
 {
-   pte_t *pte, ptent;
+   struct vm_area_struct *vma = walker-vma;
+   unsigned long addr = walker-start;
spinlock_t *ptl;
-   struct page *page;
+   pte_t *pte;
 
pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
do {
-   ptent = *pte;
-   if (!pte_present(ptent))
-   continue;
-
-   mss-resident += PAGE_SIZE;
-
-   page = vm_normal_page(vma, addr, ptent);
-   if (!page)
-   continue;
-
-   if (page_mapcount(page) = 2) {
-   if (pte_dirty(ptent))
-   mss-shared_dirty += PAGE_SIZE;
-   else
-   mss-shared_clean += PAGE_SIZE;
-   } else {
-   if (pte_dirty(ptent))
-   mss-private_dirty += PAGE_SIZE;
-   else
-   mss-private_clean += PAGE_SIZE;
-   }
-   } while (pte++, addr += PAGE_SIZE, addr != end);
+   walker-func(walker, pte, addr);
+   } while (pte++, addr += PAGE_SIZE, addr != walker-end);
pte_unmap_unlock(pte - 1, ptl);
cond_resched();
 }
 
-static inline void smaps_pmd_range(struct vm_area_struct *vma, pud_t *pud,
-   unsigned long addr, unsigned long end,
-   struct mem_size_stats *mss)
+static inline void walk_pmds(struct pte_walker *walker, pud_t *pud)
 {
-   pmd_t *pmd;
+   unsigned long addr = walker-start;
+   unsigned long end = walker-end;
unsigned long next;
+   pmd_t *pmd;
 
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
if (pmd_none_or_clear_bad(pmd))
continue;
-   smaps_pte_range(vma, pmd, addr, next, mss);
+   walk_ptes(walker, pmd);
} while (pmd++, addr = next, addr != end);
 }
 
-static inline void smaps_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
-   unsigned long addr, unsigned long end,
-   struct mem_size_stats *mss)
+static inline void walk_puds(struct pte_walker *walker, pgd_t *pgd)
 {
-   pud_t *pud;
+   unsigned long addr = walker-start;
+   unsigned long end = walker-end;
unsigned long next;
+   pud_t *pud;
 
pud = pud_offset(pgd, addr);
do {

Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-06 Thread Paul Mundt
On Tue, Feb 06, 2007 at 10:15:47PM -0800, David Rientjes wrote:
 Extracts the page table entry walker from the smaps-specific code in
 fs/proc/task_mmu.c.  This will be used later for clearing the reference
 bits on pages to measure the number of pages accessed over a time period
 through /proc/pid/smaps.
 
I like the general idea of this patch set, however..

 Since the PTE walker is now extracted from the smaps code,
 smaps_pte_func() is invoked for each PTE in the VMA.  Its behavior is
 identical to the existing implementation, except it is slightly slower
 because each PTE now invokes a function call.
 
Perhaps this is something that needs to be looked at more closely and
made more generic? There are many ranged page table walkers that aren't
so performance critical that the function call cost would cause too much
pain. ioremap_page_range() comes to mind, and there's bound to be others.
This would also help people to get the pte map/unmap right, which seems
to pop up from time to time as well..
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-06 Thread Andrew Morton
On Wed, 7 Feb 2007 15:29:34 +0900 Paul Mundt [EMAIL PROTECTED] wrote:

 On Tue, Feb 06, 2007 at 10:15:47PM -0800, David Rientjes wrote:
  Extracts the page table entry walker from the smaps-specific code in
  fs/proc/task_mmu.c.  This will be used later for clearing the reference
  bits on pages to measure the number of pages accessed over a time period
  through /proc/pid/smaps.
  
 I like the general idea of this patch set, however..

David didn't really spell out the rationale.  Userspace people ask how
much memory is my application using.  For system planning and such.  We
don't know, really.  So the approach taken here is to nuke all the
referenced bits and to then monitor them coming back over time.

Obviously it doesn't work very well if the page scanner is doing things,
but one hopes that when someone is instrumenting their application they'll
ensure that sufficient memory is available to prevent this inaccuracy.

The other part of the question is of course pagecache.  Hopefully /proc/pid/io
will suffice for that, probably in combination with /proc/sys/vm/drop_caches.

I don't really have a sense for how much stuff we want to put into the kernel
to support this sort of thing.  The proposed patches are, I think, minimal.
Perhaps it needs more.  If so, opinions are solicited before we go and add
(and hence be forced to maintain) this new interface.

  Since the PTE walker is now extracted from the smaps code,
  smaps_pte_func() is invoked for each PTE in the VMA.  Its behavior is
  identical to the existing implementation, except it is slightly slower
  because each PTE now invokes a function call.
  
 Perhaps this is something that needs to be looked at more closely and
 made more generic? There are many ranged page table walkers that aren't
 so performance critical that the function call cost would cause too much
 pain. ioremap_page_range() comes to mind, and there's bound to be others.
 This would also help people to get the pte map/unmap right, which seems
 to pop up from time to time as well..

That's what Paul Davies' patches are aimed at:

http://marc.theaimsgroup.com/?l=linux-mmm=115276500100695w=2

I promised Paul that I'd take a serious look at those patches next time
they pop up.   It would be good if others could help out in that.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/3 take2] smaps: extract pte walker from smaps code

2007-02-06 Thread Paul Mundt
On Tue, Feb 06, 2007 at 10:49:14PM -0800, Andrew Morton wrote:
 On Wed, 7 Feb 2007 15:29:34 +0900 Paul Mundt [EMAIL PROTECTED] wrote:
  I like the general idea of this patch set, however..
 
 David didn't really spell out the rationale.  Userspace people ask how
 much memory is my application using.  For system planning and such.  We
 don't know, really.  So the approach taken here is to nuke all the
 referenced bits and to then monitor them coming back over time.
 
 Obviously it doesn't work very well if the page scanner is doing things,
 but one hopes that when someone is instrumenting their application they'll
 ensure that sufficient memory is available to prevent this inaccuracy.
 
This was the problem we kept running in to as well, people that wanted to
instrument their applications on the desktop where all of the required
infrastructure already brought the system to its knees long before
instrumentation could begin didn't really provide the most consistent
metrics, especially when the application depended heavily on overcommit.

This resulted in a lot of extra hacks on top of the smaps code trying to
balance out the numbers, this just ended up being a massive headache.
smaps seems to be an appealing target for piling additional semantics on,
unfortunately.

 I don't really have a sense for how much stuff we want to put into the kernel
 to support this sort of thing.  The proposed patches are, I think, minimal.
 Perhaps it needs more.  If so, opinions are solicited before we go and add
 (and hence be forced to maintain) this new interface.
 
This sort of minimalist interface is not necessarily a bad approach if it
helps in a way that works for the people that really want it. The issue
with any of the smaps extensions is being able to provide people with a
good enough metric without falling in to interface hell. This really
depends on how one defines good enough, though. It would be nice to
hear from application folks if this ends up being useful for them or not.

  Perhaps this is something that needs to be looked at more closely and
  made more generic? There are many ranged page table walkers that aren't
  so performance critical that the function call cost would cause too much
  pain. ioremap_page_range() comes to mind, and there's bound to be others.
  This would also help people to get the pte map/unmap right, which seems
  to pop up from time to time as well..
 
 That's what Paul Davies' patches are aimed at:
 
 http://marc.theaimsgroup.com/?l=linux-mmm=115276500100695w=2
 
 I promised Paul that I'd take a serious look at those patches next time
 they pop up.   It would be good if others could help out in that.

It would be nice to see this smaps patchset rebased on something like
that rather than introducing another walker abstraction simply to have it
overhauled later. I realize this isn't really David's problem, but we
shouldn't be trying to solve the same problem multiple times.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/